Part 5 — The Ingestion Plane: Where Accuracy Is Won or Lost
How do you parse every document format faithfully, and how do you know when parsing has failed?
14 min · Updated June 2026
The ingestion plane is the most underinvested part of almost every RAG system. Engineers spend days fine-tuning prompts and hours thinking about chunking, and an afternoon on parsing — when the reality is inverted: no retrieval or agent cleverness can recover information that was destroyed at ingest. A garbled table is garbled forever. A figure stripped of its caption is permanently orphaned. A sentence split mid-clause by a fixed-size chunker will never be retrieved in its correct context.
5.1 Modality coverage
Every modality requires a different parsing strategy. Using a single parser for all content is as misguided as using a single database for all data access patterns.
Digitally-born PDFs and office documents contain machine-readable text layers and benefit from layout-aware parsers that understand multi-column flow, heading hierarchy, and table structure. Docling (IBM, MIT license) is the current state-of-the-art OSS parser for this class: it combines a DocLayNet-trained layout model with a TableFormer model for cell-level table extraction, and it does so without a generative model — meaning it cannot hallucinate.
Scanned documents require optical character recognition before any text extraction is possible. The challenge is calibrated escalation: low-quality OCR that passes silently is worse than no OCR, because it produces confidently wrong text. A tiered OCR strategy routes pages to increasingly powerful (and expensive) engines based on measured confidence, with commercial vision APIs as the last resort.
Audio and video require ASR to produce a transcript, optionally with speaker diarization and word-level timestamps. Keyframes are extracted and captioned. The canonical OSS tool is faster-whisper for high throughput, with WhisperX adding forced word alignment and pyannote-audio-based diarization.
Source code requires AST-aware chunking rather than character-count chunking. Tree-sitter parses 40+ languages and provides function and class boundaries as natural chunk boundaries.
Structured data (CSV / Excel / JSON) presents a subtlety: do not try to retrieve structured data as chunks. Exact numeric queries must route to a text-to-SQL agent operating directly over the structured data. Vector similarity over flattened table text gives approximate answers to questions that demand exact ones.
HTML and web content requires boilerplate-aware extraction (Trafilatura) for static pages and headless browser rendering (Playwright) for JavaScript-rendered dynamic pages.
5.2 The ten ingestion accuracy patterns
Each pattern targets a specific, named failure mode. The table shows how Docling, OpenAI embeddings, and VLMs collaborate to implement each one.
| # | Pattern | Failure mode prevented |
|---|---|---|
| 1 | Faithful modular parsing | Parser hallucination — generative models inventing text not on the page |
| 2 | Confidence-routed OCR escalation | Silent bad OCR passed downstream as clean text |
| 3 | Table structure fidelity | Flattened or misaligned cells producing wrong numeric or relational content |
| 4 | Contextual chunk enrichment | Context-stripped chunks — "revenue fell 12%" with no who, when, or where |
| 5 | Structure-aware chunk boundaries | Mid-sentence or mid-function splits that destroy coherent retrieval units |
| 6 | Cross-modal linking and figure captioning | Orphaned figures, tables without context, or captions without their referent |
| 7 | Embedding input validation | Truncated inputs silently producing wrong vectors; near-duplicate chunks skewing retrieval |
| 8 | Idempotent quarantining pipeline | One bad document silently dropped or corrupting a batch |
| 9 | Confidence-gated human review | Bad parses accumulating in the index undetected at scale |
| 10 | Golden ingestion regression set | A parser or model upgrade silently degrading extraction quality |
5.3 How Docling, OpenAI embeddings, and VLMs work together
These three components are not interchangeable — each occupies a distinct, non-overlapping role in the pipeline.
Docling is the authority on document structure. It runs deterministic, model-driven layout analysis (DocLayNet) and table extraction (TableFormer) — no generative component. Its outputs are structural: bounding boxes, element types, heading hierarchy, table cell grids, reading order. It also provides a calibrated confidence report per document and per page.
from docling.document_converter import DocumentConverter, ConversionStatus
converter = DocumentConverter()
for result in converter.convert_all(sources, raises_on_error=False):
if result.status == ConversionStatus.SUCCESS:
confidence = result.document.confidence
# confidence.mean_grade: POOR / FAIR / GOOD / EXCELLENT
# confidence.low_grade: lowest single-page grade
# confidence.ocr_score: 0.0 - 1.0 OCR quality
# confidence.layout_score: 0.0 - 1.0 layout recognition quality
route_by_confidence(result, confidence)
else:
quarantine(result)Commercial vision APIs are used only for what Docling cannot handle deterministically. There are exactly three cases: tier-escalation OCR when Docling’s ocr_score is below threshold; figure captioning when a figure element has no associated caption in the document structure; and complex table re-extraction when structural ambiguity is present. Vision API calls are conditional, not universal. Applying them to every element would be extremely expensive and would introduce generative risk on clean documents that Docling handles deterministically.
OpenAI text-embedding-3 is applied after structural integrity is confirmed. It never embeds raw, unvalidated parser output. Garbage in, garbage embedded — and garbage embedded means garbage retrieved.
The data flow is a routing system with explicit quality gates, not a pipeline:

5.4 The role of observability in ingestion
Observability in ingestion serves three distinct functional roles.
Role 1: Observability is the routing signal. The OCR escalation trigger (Pattern 2), the table-VLM fallback trigger (Pattern 3), and the human-review gate (Pattern 9) are all driven by Docling confidence metrics emitted as OTel spans. Without the telemetry, those routing decisions cannot be made. Observability is not decoration on the pipeline; it is structural plumbing.
Role 2: Observability is the early warning system. When a new class of scanned documents arrives from a new vendor, the escalation rate metric will spike before any user notices degraded answers. The distribution of ocr_score, layout_score, and mean_grade across your document corpus tells you where the parsers are struggling.
Role 3: Observability is the regression gate. Pattern 10 turns ingestion quality into a CI metric. Every pipeline change re-runs against the golden document set and compares extraction F1 and structural match against the approved baseline.
| Metric | Source | Alert condition |
|---|---|---|
| docling.conversion_status | ConversionStatus enum | Any FAILED count above threshold |
| docling.mean_grade | confidence.mean_grade | Distribution shift toward POOR/FAIR |
| docling.ocr_score | confidence.ocr_score | Mean below 0.7 for a source type |
| docling.escalation_rate | OCR routing logic | Sudden spike for a new source |
| embedding.cache_hit_rate | Redis cache | Drop below 60% (costly) |
| embedding.truncation_count | Pre-flight tiktoken check | Any non-zero in a batch |
| ingestion.quarantine_depth | Quarantine queue | Absolute threshold |
| ingestion.dedup_ratio | Hash dedup | Unexpected zero (may indicate hash collision) |
| contextual_prepend.cache_hit_rate | LiteLLM prompt cache | Drop signals cost increase |
5.5 The human-in-the-loop: when labelling is required
This is the part that most RAG implementations skip entirely, and it is the part that explains why some systems compound accuracy over time while others decay. The human’s role is not to replace automation — it is to curate the quality gates and training signals that automation cannot generate for itself.
The principle: label selectively, never in bulk. You never review every document. You review what the confidence signals flag. Steady-state human effort is low; it is front-loaded in bootstrapping and then sustained by a small, targeted review queue.
The three pipeline gates where humans can intervene:
- Gate 1 (Quality review before indexing): Documents with POOR grades are held. A human reviews in Argilla or Label Studio, corrects, and approves. The approved version goes to indexing. This prevents bad documents from ever entering the search index.
- Gate 2 (Spot-check of chunk quality): A random sample of chunks is surfaced in Argilla for a human reviewer to verify context enrichment, boundary quality, and cross-modal links. Identified problems become pattern updates.
- Gate 3 (Query-answer review): When a query returns a low-faithfulness or low-relevance score in online eval, a human reviews the trace — what was retrieved, what was generated — and annotates the correct retrieval. This becomes a golden query-answer pair.
The recommended annotation platform is Argilla (Apache-2) for AI-engineer and domain-expert collaboration on datasets. Label Studio (Apache-2) is preferred when reviewers need to work visually on the document image itself — bounding boxes, OCR regions, table cells.