Advanced RAG

Part 4 — Reference Architecture: The Seven Planes

How do you organise the system so that each concern has a clear owner, a defined boundary, and a deployable footprint — and what actually happens to one query as it moves through them?

8 min · Updated July 2026

A production multimodal hybrid agentic RAG system organises into seven conceptually distinct planes. Each has a clear ownership boundary, a defined data contract, and a separate deployment footprint. One clarifying idea before the tour: these aren’t seven sequential steps. Five planes sit in the request path(a query flows agentic → retrieval → indexing, with tooling as a side call and ingestion feeding the index offline), while two are cross-cutting — control governs what the others may do, and observability measureswhat they did. That’s why they’re “planes,” not “stages.”

The seven planes

The model gateway (LiteLLM) sits horizontally across all planes: every generation, embedding, reranking, and vision API call routes through it, giving unified cost accounting, per-role model routing, cross-provider fallback, and prompt caching.

4.1 Control plane

Owns: configuration and policy. You touch it when: onboarding a tenant, changing a prompt template, or toggling an agentic pattern on or off. It holds per-tenant settings, per-role prompt templates, feature flags, and the immutable audit log required for compliance. Nothing here does retrieval or generation — it governs what the other planes are allowed to do. In practice it’s a resolved policy object handed to every request:

tenant_policy.py

# Control plane: one resolved policy object per request. No retrieval here.
@dataclass(frozen=True)
class TenantPolicy:
    tenant_id: str
    model_by_role: dict[str, str]      # {"generate": "gpt-4o", "rerank": "bge-large"}
    patterns_enabled: set[str]         # {"crag", "self_rag"} — feature flags
    max_correction_loops: int          # bound agentic cost per query
    allowed_tools: set[str]            # {"web_search", "sql"} — per-tenant allow-list

def resolve_policy(tenant_id: str, store) -> TenantPolicy:
    return store.load(tenant_id)       # every downstream plane reads from this

4.2 Observability plane

Owns: measurement. You touch it when: a query returns a bad answer and you need the trace to see which plane failed. It receives OpenTelemetry spans from every plane, stores them in Langfuse for LLM-native trace analysis, feeds embedding-drift metrics to Arize Phoenix, and runs DeepEval / RAGAS CI gates on every pipeline change. This plane is not optional instrumentation — its signals are the input to the routing decisions in the ingestion and retrieval planes. Remove it and those planes lose their ability to self-correct.

4.3 Ingestion plane

Owns: accuracy determination. You touch it when: a document format retrieves badly and parsing needs fixing. Documents enter as raw files and leave as validated, confidence-graded, contextually enriched, deduplicated chunks ready for indexing. This is where most production accuracy is won or lost — covered in full in Part 5.

4.4 Tooling plane

Owns: external capability. You touch it when: the knowledge base can’t answer (stale data, missing content, numeric lookups) and the query needs a live source. The agentic plane invokes tools registered here — web search, SQL execution, HTTP fetch, on-demand transcription — running in sandboxed environments with per-tenant allow-lists.

4.5 Indexing plane

Owns: storage and isolation. You touch it when: tuning what’s stored or enforcing tenant boundaries. Qdrant holds dense and sparse vectors with payload-level ACL filters; Kuzu holds the knowledge graph for multi-hop traversal; RAPTOR recursive summaries enable global thematic queries. Tenant isolation is enforced here, not post-hoc — a query can never retrieve documents outside the requesting tenant’s scope.

4.6 Retrieval plane

Owns: search. You touch it when: recall or precision needs tuning. It executes hybrid queries (dense + BM25 + RRF), reranks with a cross-encoder, applies query transformations (HyDE, multi-query, step-back), compresses and reorders the context window, and returns a scored, ordered candidate set. The eight retrieval failure modes and nine accuracy patterns are covered in full in Part 6.

4.7 Agentic plane

Owns: orchestration. You touch it when: changing how queries are routed, decomposed, or verified. It’s a LangGraph StateGraph that routes queries, decomposes compound questions, fans out parallel retrievals, grades context (CRAG), verifies answers (Self-RAG), enforces citation, and bounds correction loops. Not a pipeline — a graph with conditional edges and termination conditions:

agentic_graph.py

# Agentic plane: a graph, not a pipeline. Conditional edges ARE the architecture.
from langgraph.graph import StateGraph, END

g = StateGraph(RAGState)
g.add_node("route", route_query)          # pick strategy (uses control-plane flags)
g.add_node("retrieve", hybrid_retrieve)   # calls the retrieval plane
g.add_node("grade", grade_context)        # CRAG — is the context relevant?
g.add_node("generate", generate_answer)
g.add_node("verify", verify_grounding)    # Self-RAG — is the answer supported?
g.add_node("tool", call_external_tool)    # tooling plane fallback

g.add_conditional_edges("grade", lambda s:
    "generate" if s.relevant else "tool")          # bad context → fall back to tools
g.add_conditional_edges("verify", lambda s:
    END if s.grounded or s.loops >= s.max_loops    # bounded — respects policy
    else "retrieve")                                # ungrounded → try again

The conditional edges are the whole point: grade can divert to tools, verify can loop back to retrieval, and max_loops (from the control plane’s TenantPolicy) keeps cost bounded. The seven agentic accuracy patterns are covered in full in Part 7.

4.8 One query, all seven planes

To see how the planes compose, trace a single question — “What did we agree with Acme on payment terms, and is that still current?” — end to end:

Controlresolves the tenant’s policy: which model generates, which patterns are enabled, max_loops = 3.
Agenticroutes it — this is compound (“what we agreed” + “still current”), so it decomposes into two sub-questions.
Retrieval runs hybrid search for the contract clause; indexing returns only Acme-scoped chunks (ACL enforced in the query).
Agenticgrades the context (CRAG). The “still current” part isn’t in the corpus → it diverts to tooling for a live lookup.
Agenticgenerates, then verifies grounding (Self-RAG). If unsupported, it loops — but no more than 3 times.
Observability has recorded a span at every step; those traces are what let the router improve next time.

The request touched five planes in-path and two around it — exactly the two-axis split from the intro.

Libraries & frameworks referenced on this page

Plane	Primary tooling
Model gateway (cross-cutting)	LiteLLM
Control	config store + feature flags (LaunchDarkly / in-house), immutable audit log
Observability	OpenTelemetry, Langfuse, Arize Phoenix, DeepEval, RAGAS
Ingestion	Unstructured, LlamaParse, Docling (detail in Part 5)
Tooling	sandboxed web search, SQL engines, HTTP fetch, Whisper
Indexing	Qdrant (dense + sparse + ACL), Kuzu (graph), RAPTOR
Retrieval	BM25 / OpenSearch, RRF, Cohere Rerank / bge-reranker, HyDE (detail in Part 6)
Agentic	LangGraph (detail in Part 7)

Found this useful?