Advanced RAG

Part 5 — The Ingestion Plane: Where Accuracy Is Won or Lost

How do you parse every document format faithfully, and how do you know when parsing has failed?

14 min · Updated June 2026

The ingestion plane is the most underinvested part of almost every RAG system. Engineers spend days fine-tuning prompts and hours thinking about chunking, and an afternoon on parsing — when the reality is inverted: no retrieval or agent cleverness can recover information that was destroyed at ingest. A garbled table is garbled forever. A figure stripped of its caption is permanently orphaned. A sentence split mid-clause by a fixed-size chunker will never be retrieved in its correct context.

5.1 Modality coverage

Every modality requires a different parsing strategy. Using a single parser for all content is as misguided as using a single database for all data access patterns.

Digitally-born PDFs and office documents contain machine-readable text layers and benefit from layout-aware parsers that understand multi-column flow, heading hierarchy, and table structure. Docling (IBM, MIT license) is the current state-of-the-art OSS parser for this class: it combines a DocLayNet-trained layout model with a TableFormer model for cell-level table extraction, and it does so without a generative model — meaning it cannot hallucinate.

Scanned documents require optical character recognition before any text extraction is possible. The challenge is calibrated escalation: low-quality OCR that passes silently is worse than no OCR, because it produces confidently wrong text. A tiered OCR strategy routes pages to increasingly powerful (and expensive) engines based on measured confidence, with commercial vision APIs as the last resort.

Audio and video require ASR to produce a transcript, optionally with speaker diarization and word-level timestamps. Keyframes are extracted and captioned. The canonical OSS tool is faster-whisper for high throughput, with WhisperX adding forced word alignment and pyannote-audio-based diarization.

Source code requires AST-aware chunking rather than character-count chunking. Tree-sitter parses 40+ languages and provides function and class boundaries as natural chunk boundaries.

Structured data (CSV / Excel / JSON) presents a subtlety: do not try to retrieve structured data as chunks. Exact numeric queries must route to a text-to-SQL agent operating directly over the structured data. Vector similarity over flattened table text gives approximate answers to questions that demand exact ones.

HTML and web content requires boilerplate-aware extraction (Trafilatura) for static pages and headless browser rendering (Playwright) for JavaScript-rendered dynamic pages.

5.2 The ten ingestion accuracy patterns

Each pattern targets a specific, named failure mode. The table shows how Docling, OpenAI embeddings, and VLMs collaborate to implement each one.

#	Pattern	Failure mode prevented
1	Faithful modular parsing	Parser hallucination — generative models inventing text not on the page
2	Confidence-routed OCR escalation	Silent bad OCR passed downstream as clean text
3	Table structure fidelity	Flattened or misaligned cells producing wrong numeric or relational content
4	Contextual chunk enrichment	Context-stripped chunks — "revenue fell 12%" with no who, when, or where
5	Structure-aware chunk boundaries	Mid-sentence or mid-function splits that destroy coherent retrieval units
6	Cross-modal linking and figure captioning	Orphaned figures, tables without context, or captions without their referent
7	Embedding input validation	Truncated inputs silently producing wrong vectors; near-duplicate chunks skewing retrieval
8	Idempotent quarantining pipeline	One bad document silently dropped or corrupting a batch
9	Confidence-gated human review	Bad parses accumulating in the index undetected at scale
10	Golden ingestion regression set	A parser or model upgrade silently degrading extraction quality

5.3 How Docling, OpenAI embeddings, and VLMs work together

These three components are not interchangeable — each occupies a distinct, non-overlapping role in the pipeline.

Docling is the authority on document structure. It runs deterministic, model-driven layout analysis (DocLayNet) and table extraction (TableFormer) — no generative component. Its outputs are structural: bounding boxes, element types, heading hierarchy, table cell grids, reading order. It also provides a calibrated confidence report per document and per page.

docling_confidence.py

from docling.document_converter import DocumentConverter, ConversionStatus

converter = DocumentConverter()

for result in converter.convert_all(sources, raises_on_error=False):
    if result.status == ConversionStatus.SUCCESS:
        confidence = result.document.confidence
        # confidence.mean_grade: POOR / FAIR / GOOD / EXCELLENT
        # confidence.low_grade: lowest single-page grade
        # confidence.ocr_score: 0.0 - 1.0 OCR quality
        # confidence.layout_score: 0.0 - 1.0 layout recognition quality
        route_by_confidence(result, confidence)
    else:
        quarantine(result)

Commercial vision APIs are used only for what Docling cannot handle deterministically. There are exactly three cases: tier-escalation OCR when Docling’s ocr_score is below threshold; figure captioning when a figure element has no associated caption in the document structure; and complex table re-extraction when structural ambiguity is present. Vision API calls are conditional, not universal. Applying them to every element would be extremely expensive and would introduce generative risk on clean documents that Docling handles deterministically.

OpenAI text-embedding-3 is applied after structural integrity is confirmed. It never embeds raw, unvalidated parser output. Garbage in, garbage embedded — and garbage embedded means garbage retrieved.

The data flow is a routing system with explicit quality gates, not a pipeline:

Diagram of Docling confidence-gated routing: documents are parsed and routed through GOOD/EXCELLENT, FAIR, POOR, and FAILED quality gates to chunking, VLM retranscription, human review, or quarantine

5.4 The role of observability in ingestion

Observability in ingestion serves three distinct functional roles.

Role 1: Observability is the routing signal. The OCR escalation trigger (Pattern 2), the table-VLM fallback trigger (Pattern 3), and the human-review gate (Pattern 9) are all driven by Docling confidence metrics emitted as OTel spans. Without the telemetry, those routing decisions cannot be made. Observability is not decoration on the pipeline; it is structural plumbing.

Role 2: Observability is the early warning system. When a new class of scanned documents arrives from a new vendor, the escalation rate metric will spike before any user notices degraded answers. The distribution of ocr_score, layout_score, and mean_grade across your document corpus tells you where the parsers are struggling.

Role 3: Observability is the regression gate. Pattern 10 turns ingestion quality into a CI metric. Every pipeline change re-runs against the golden document set and compares extraction F1 and structural match against the approved baseline.

Metric	Source	Alert condition
docling.conversion_status	ConversionStatus enum	Any FAILED count above threshold
docling.mean_grade	confidence.mean_grade	Distribution shift toward POOR/FAIR
docling.ocr_score	confidence.ocr_score	Mean below 0.7 for a source type
docling.escalation_rate	OCR routing logic	Sudden spike for a new source
embedding.cache_hit_rate	Redis cache	Drop below 60% (costly)
embedding.truncation_count	Pre-flight tiktoken check	Any non-zero in a batch
ingestion.quarantine_depth	Quarantine queue	Absolute threshold
ingestion.dedup_ratio	Hash dedup	Unexpected zero (may indicate hash collision)
contextual_prepend.cache_hit_rate	LiteLLM prompt cache	Drop signals cost increase

5.5 The human-in-the-loop: when labelling is required

This is the part that most RAG implementations skip entirely, and it is the part that explains why some systems compound accuracy over time while others decay. The human’s role is not to replace automation — it is to curate the quality gates and training signals that automation cannot generate for itself.

The principle: label selectively, never in bulk. You never review every document. You review what the confidence signals flag. Steady-state human effort is low; it is front-loaded in bootstrapping and then sustained by a small, targeted review queue.

The three pipeline gates where humans can intervene:

Gate 1 (Quality review before indexing): Documents with POOR grades are held. A human reviews in Argilla or Label Studio, corrects, and approves. The approved version goes to indexing. This prevents bad documents from ever entering the search index.
Gate 2 (Spot-check of chunk quality): A random sample of chunks is surfaced in Argilla for a human reviewer to verify context enrichment, boundary quality, and cross-modal links. Identified problems become pattern updates.
Gate 3 (Query-answer review): When a query returns a low-faithfulness or low-relevance score in online eval, a human reviews the trace — what was retrieved, what was generated — and annotates the correct retrieval. This becomes a golden query-answer pair.

The recommended annotation platform is Argilla (Apache-2) for AI-engineer and domain-expert collaboration on datasets. Label Studio (Apache-2) is preferred when reviewers need to work visually on the document image itself — bounding boxes, OCR regions, table cells.