Advanced RAG

Part 6 — The Retrieval Plane: Why Retrieval Fails

What are the eight ways retrieval silently returns the wrong answer, and what is the specific pattern that kills each one?

10 min · Updated June 2026

Retrieval is where most operational RAG accuracy is actually lost — and where it is most silently lost. A parse failure is visible (an empty table). A retrieval failure is not: the system returns a confident, fluent, fully cited answer — grounded in the wrong chunks.

6.1 The eight retrieval failure modes

Naming these precisely is the prerequisite for designing against them. “Retrieval is bad” is not actionable. These are.

F1 — Semantic gap.The question and the answer exist in different vocabulary. A question about “declining quarterly performance” and a document section titled “revenue shortfall in Q3” are semantically close but lexically distant. Dense vector similarity handles this well when the gap is small, but fails when technical domain language creates a large lexical distance.

F2 — Lexical miss.Some queries cannot be paraphrased. “Part number XR-7741,” “ISO 27001 Clause 9.4.2,” “error code 0x8007045D” — these must be retrieved by exact token match. Dense embeddings represent these tokens in a distributed semantic space where they share neighbourhoods with dozens of unrelated concepts. BM25 exact-match retrieval solves this directly.

F3 — Top-k cliff.The bi-encoder that drives dense retrieval is trained to rank relevant documents above irrelevant ones, but its discrimination is coarse at scale. The correct document is frequently retrieved — just at rank 11 or 15 or 20 — outside the top-k window. The fix is not to increase top-k indefinitely; it is to retrieve wide and rerank precisely.

F4 — Multi-hop failure. “Which vendors did the CFO who joined in 2021 approve?” requires three retrieval steps: find the CFO’s name, find their approvals, find the vendor list. A single vector lookup retrieves documents on a single topic. No embedding captures the inference chain across documents.

F5 — Lost-in-the-middle. The correct chunks are retrieved and placed in the context. The LLM generates the wrong answer anyway. Empirical research has consistently shown that language models attend primarily to the head and tail of long contexts and systematically underweight content in the middle. A retrieval system that returns 20 chunks and places the most relevant one at position 10 has effectively hidden the answer.

F6 — Distractor poisoning. One plausible-but-wrong chunk is retrieved alongside several correct ones. The LLM, trained to be helpful and to synthesize, blends the distractor into the answer. This is especially dangerous for numerical claims.

F7 — No-answer / stale-answer. The corpus simply does not contain the answer, or contains a stale version. The pipeline retrieves the closest available content and synthesizes from it anyway. This is the most systemic failure of static RAG systems: they have no “I don’t know” path, and they have no freshness awareness.

F8 — Wrong-tool routing. GraphRAG is genuinely superior for multi-hop and narrative questions. It is empirically inferiorby approximately 13% on simple factoid questions (Han et al., 2025). Applying the most powerful retrieval tool to every query class is not conservative — it actively degrades accuracy for simple queries while adding latency and cost.

6.2 The nine retrieval accuracy patterns

Pattern	Name	Failure modes killed	Mechanism
R1	Hybrid retrieval — dense + BM25 + RRF	F1, F2	Dense and BM25 run in parallel over every query; scores fused via RRF server-side in Qdrant. Neither channel is optional.
R2	Wide retrieve then cross-encoder rerank	F3, F6	Retrieve top-100 to 200 for recall; cross-encoder (BGE-Reranker-v2-m3) cuts to a precise top-8 to 10.
R3	Query transformation (HyDE / multi-query / step-back)	F1, F4	HyDE embeds a hypothetical answer; multi-query paraphrases; step-back abstracts for broad recall. All results fused with RRF.
R4	Adaptive routing (complexity classifier)	F8	Lightweight LLM classifier routes to: no-retrieval, single-hop, multi-hop, GraphRAG, SQL, or real-time. ~40% cost reduction when implemented well.
R5	Multi-hop iterative retrieval	F4	Decompose the compound question; retrieve for each sub-query; use the sub-answer to formulate the next.
R6	GraphRAG (gated) + RAPTOR	F4, thematic queries	Kuzu-backed graph traversal for entity-relationship chains; RAPTOR recursive cluster summaries for global thematic queries. Both only activated by R4 routing.
R7	Contextual compression and reordering	F5, F6	Drop chunks below relevance-score floor; reorder so highest-scoring chunks occupy head and tail of the context window.
R8	Metadata / ACL pre-filtering	F6, cross-tenant leakage	tenant_id, acl_set, date range applied as Qdrant payload filters inside the vector query — not post-hoc.
R9	Retrieval confidence gating	F7	If the top reranked score is below threshold, synthesis is suppressed; system routes to web-search fallback or returns an explicit no-answer.