QRefAI
Contents
Advanced RAG

Part 6 — The Retrieval Plane: Why Retrieval Fails

What are the eight ways retrieval silently returns the wrong answer, and what is the specific pattern that kills each one?

10 min · Updated June 2026

Retrieval is where most operational RAG accuracy is actually lost — and where it is most silently lost. A parse failure is visible (an empty table). A retrieval failure is not: the system returns a confident, fluent, fully cited answer — grounded in the wrong chunks.

6.1 The eight retrieval failure modes

Naming these precisely is the prerequisite for designing against them. “Retrieval is bad” is not actionable. These are.

F1 — Semantic gap.The question and the answer exist in different vocabulary. A question about “declining quarterly performance” and a document section titled “revenue shortfall in Q3” are semantically close but lexically distant. Dense vector similarity handles this well when the gap is small, but fails when technical domain language creates a large lexical distance.

F2 — Lexical miss.Some queries cannot be paraphrased. “Part number XR-7741,” “ISO 27001 Clause 9.4.2,” “error code 0x8007045D” — these must be retrieved by exact token match. Dense embeddings represent these tokens in a distributed semantic space where they share neighbourhoods with dozens of unrelated concepts. BM25 exact-match retrieval solves this directly.

F3 — Top-k cliff.The bi-encoder that drives dense retrieval is trained to rank relevant documents above irrelevant ones, but its discrimination is coarse at scale. The correct document is frequently retrieved — just at rank 11 or 15 or 20 — outside the top-k window. The fix is not to increase top-k indefinitely; it is to retrieve wide and rerank precisely.

F4 — Multi-hop failure. “Which vendors did the CFO who joined in 2021 approve?” requires three retrieval steps: find the CFO’s name, find their approvals, find the vendor list. A single vector lookup retrieves documents on a single topic. No embedding captures the inference chain across documents.

F5 — Lost-in-the-middle. The correct chunks are retrieved and placed in the context. The LLM generates the wrong answer anyway. Empirical research has consistently shown that language models attend primarily to the head and tail of long contexts and systematically underweight content in the middle. A retrieval system that returns 20 chunks and places the most relevant one at position 10 has effectively hidden the answer.

F6 — Distractor poisoning. One plausible-but-wrong chunk is retrieved alongside several correct ones. The LLM, trained to be helpful and to synthesize, blends the distractor into the answer. This is especially dangerous for numerical claims.

F7 — No-answer / stale-answer. The corpus simply does not contain the answer, or contains a stale version. The pipeline retrieves the closest available content and synthesizes from it anyway. This is the most systemic failure of static RAG systems: they have no “I don’t know” path, and they have no freshness awareness.

F8 — Wrong-tool routing. GraphRAG is genuinely superior for multi-hop and narrative questions. It is empirically inferiorby approximately 13% on simple factoid questions (Han et al., 2025). Applying the most powerful retrieval tool to every query class is not conservative — it actively degrades accuracy for simple queries while adding latency and cost.

6.2 The nine retrieval accuracy patterns

PatternNameFailure modes killedMechanism
R1Hybrid retrieval — dense + BM25 + RRFF1, F2Dense and BM25 run in parallel over every query; scores fused via RRF server-side in Qdrant. Neither channel is optional.
R2Wide retrieve then cross-encoder rerankF3, F6Retrieve top-100 to 200 for recall; cross-encoder (BGE-Reranker-v2-m3) cuts to a precise top-8 to 10.
R3Query transformation (HyDE / multi-query / step-back)F1, F4HyDE embeds a hypothetical answer; multi-query paraphrases; step-back abstracts for broad recall. All results fused with RRF.
R4Adaptive routing (complexity classifier)F8Lightweight LLM classifier routes to: no-retrieval, single-hop, multi-hop, GraphRAG, SQL, or real-time. ~40% cost reduction when implemented well.
R5Multi-hop iterative retrievalF4Decompose the compound question; retrieve for each sub-query; use the sub-answer to formulate the next.
R6GraphRAG (gated) + RAPTORF4, thematic queriesKuzu-backed graph traversal for entity-relationship chains; RAPTOR recursive cluster summaries for global thematic queries. Both only activated by R4 routing.
R7Contextual compression and reorderingF5, F6Drop chunks below relevance-score floor; reorder so highest-scoring chunks occupy head and tail of the context window.
R8Metadata / ACL pre-filteringF6, cross-tenant leakagetenant_id, acl_set, date range applied as Qdrant payload filters inside the vector query — not post-hoc.
R9Retrieval confidence gatingF7If the top reranked score is below threshold, synthesis is suppressed; system routes to web-search fallback or returns an explicit no-answer.