Advanced RAG

Part 8 — Technology Stack, Decisions, and What Makes a Production System

What does the confirmed production stack look like, what was deliberately traded away, and what separates a production system from a demo?

12 min · Updated June 2026

9.1 The confirmed stack

The technology stack is not a list of tools. Every choice reflects a tradeoff, and every tradeoff was made explicitly rather than by default.

Layer	Technology	Tradeoff accepted
Document parsing	Docling 2.x (MIT)	Best-in-class fidelity; slightly slower than lightweight alternatives
OCR escalation	Docling built-in OCR + OpenAI/Claude vision API	No GPU required; API cost on escalation
Audio transcription	faster-whisper / WhisperX	CPU-only; diarization available; slower than GPU-accelerated
LLM generation	OpenAI + Anthropic Claude via LiteLLM	Data transits third-party APIs; requires ZDR enterprise tiers
Embeddings	OpenAI text-embedding-3-large via LiteLLM	Dense only — no late chunking, no learned sparse; BM25 covers lexical channel
Sparse retrieval	BM25 (Qdrant native)	No learned sparse (SPLADE would need self-hosting); lexical recall is strong
Reranking	BGE-Reranker-v2-m3 (CPU container)	One self-hosted container; zero per-call token cost
Vector database	Qdrant 1.x (Apache-2)	Best-in-class hybrid + payload filtering; Milvus planned at >500M vectors
Knowledge graph	Kuzu (MIT)	Embedded, columnar, CPU-only; FalkorDB/Neo4j as alternatives
Chunking	Contextual retrieval (LLM-prepend) + semantic chunking	Late chunking unavailable with API embeddings; contextual retrieval is highest-accuracy-per-effort alternative
Agentic orchestration	LangGraph 1.0.x (MIT)	Deepest tracing, durable execution, most active development
Prompt optimization	DSPy (MIT)	Compiles prompts against metrics; requires golden sets
Ingestion pipeline	Dagster + Kafka/Redpanda	Asset-oriented model; strong observability hooks
Observability / telemetry	OpenTelemetry Collector + Langfuse	Full LLM-native tracing; Arize Phoenix for embedding drift
CI evaluation	Ragas + DeepEval + pytest	Ragas for RAG-specific metrics; DeepEval for agentic/multimodal metrics
Human review	Argilla (Apache-2)	Primary annotation platform; Label Studio for image-level annotation
Deployment	Kubernetes + Argo CD + KEDA	CPU-only cluster; GitOps; queue-driven autoscaling for ingestion

9.2 Consequential tradeoffs explained

OpenAI embeddings vs. self-hosted BGE-M3

This is the most consequential technical tradeoff. BGE-M3 is a single model that produces dense vectors, sparse vectors (learned sparse), and multi-vector representations — enabling late chunking, SPLADE-based hybrid retrieval, and ColPali visual retrieval. OpenAI text-embedding-3 produces dense vectors only. The choice of OpenAI embeddings eliminates three accuracy techniques: late chunking, learned sparse (SPLADE), and ColPali.

The mitigation: contextual retrieval is promoted to the primary chunking strategy, BM25 covers the lexical channel, and ColPali is deferred but preserved via a reserved vector slot. The net accuracy difference is real but manageable for most domains; the operational simplicity gain is significant. If a domain turns out to be very visual or very dependent on rare technical terms, reintroducing self-hosted BGE-M3 is a container deployment and config change, not a rewrite.

API-first vs. self-hosted LLM generation

Commercial API generation means documents and queries transit third-party infrastructure. This requires contractual zero-data-retention enterprise tiers, PII redaction before API egress, and a per-tenant data-class allow-list. The compensating benefit: no GPU provisioning, no model serving infrastructure, no quantization management. The LiteLLM interface means self-hosted vLLM is always a config change away for tenants with stricter data residency requirements.

Contextual retrieval as the primary chunking strategy

Because late chunking is unavailable with dense-only API embeddings, contextual retrieval is the strongest accuracy lever remaining. It uses an LLM call to prepend a situating context to every chunk. With prompt caching enabled on both the OpenAI and Anthropic routes, this is cost-mitigated — the document’s structural context is cached, and only the chunk-specific content varies. Without prompt caching, contextual retrieval at scale would be prohibitively expensive.

10. What separates a production system from a demo

Diagram contrasting a demo RAG system with a production system across five dimensions: uncertainty awareness, quality gates, multi-tenancy, observability loops, and human review

The list of technical components is not what separates them. A demo can use the same vector database, the same LLM API, the same chunking library. What separates them is architectural discipline in five specific areas.

1. The system knows its own uncertainty. A demo retrieves and generates. A production system measures, at every step, how confident it is in what it retrieved and what it generated. Docling’s confidence grades, reranker scores, CRAG grading, Self-RAG faithfulness checks — these are the signals that drive every quality gate, routing decision, and human-review trigger.

2. The system has explicit failure modes with explicit mitigations. Every pattern in this series exists to kill a named failure mode. A demo does not enumerate failure modes. A production system maps every known failure to a specific pattern, measures whether the pattern is working, and alerts when it is not.

3. The ingestion plane is treated as a first-class accuracy layer. In a demo, ingestion is one afternoon’s work. In a production system, ingestion is where most accuracy is actually determined — and it is instrumented, confidence-graded, routed, and human-reviewed with the same rigour as retrieval and generation.

4. Humans are structural components of the system, not an afterthought. A demo has no human-in-the-loop. A production system routes low-confidence documents, detected hallucinations, and failed retrievals to a structured human-review workflow and feeds corrections back as golden data that improves future performance. The human effort is targeted, not bulk — and it compounds, because each correction makes the confidence thresholds more accurate.

5. Accuracy is an enforced CI constraint, not a subjective judgment. A demo reports accuracy as a snapshot at demo time. A production system encodes the acceptable performance floor as a pytest/DeepEval gate that blocks every merge. Accuracy is maintained automatically rather than hoped for manually.

The honest summary: anyone can assemble the components. What cannot be assembled in an afternoon is the routing logic for failure modes, the confidence-gated human-review loop, the compound accuracy flywheel, the CI enforcement, and the architectural discipline that keeps every component behind a stable interface so the stack can evolve without rebuilding. That is what a production-grade multimodal hybrid agentic RAG system actually is.

Appendix A: Open-source tool reference

Tool	Category	License	Role
Docling 2.x	Parsing	MIT	Primary document parser; confidence scoring
faster-whisper / WhisperX	ASR	MIT / MIT	Audio/video transcription and diarization
tree-sitter	Code parsing	MIT	AST-aware code chunking
Trafilatura	HTML extraction	Apache-2	Web content extraction
Chonkie	Chunking	Apache-2	Modality-aware chunking
Qdrant	Vector DB	Apache-2	Primary vector store; hybrid + named vectors
Kuzu	Knowledge graph	MIT	Embedded graph DB for GraphRAG
LangGraph 1.0.x	Agentic orchestration	MIT	Stateful multi-agent graph
LiteLLM	Model gateway	MIT	Unified API gateway; per-role routing
Dagster	Pipeline orchestration	Apache-2	Asset-oriented ingestion pipeline
Kafka / Redpanda	Message bus	Apache-2	Streaming ingestion intake
KEDA	Autoscaling	Apache-2	Queue-driven worker autoscaling
Langfuse	Tracing / eval	MIT (self-hosted)	LLM-native trace, prompt, eval backend
Arize Phoenix	Embedding drift	Elastic-2	Embedding-space cluster visualization
OpenTelemetry Collector	Telemetry	Apache-2	Span collection and fan-out
Ragas	RAG evaluation	Apache-2	Context recall, faithfulness, relevancy
DeepEval	LLM evaluation	Apache-2	CI gate; 50+ metrics; agentic eval
Argilla	Human review	Apache-2	Annotation, golden dataset curation
Label Studio	Image annotation	Apache-2	Visual/OCR-level document annotation
DSPy	Prompt optimization	MIT	Metric-driven prompt compilation
Presidio	PII detection	MIT	PII redaction before API egress
BGE-Reranker-v2-m3	Reranking	MIT	Cross-encoder reranker (CPU container)
PostgreSQL 16+	Metadata / audit	PostgreSQL	Tenants, ACLs, ingestion state, episodic log
Redis / Valkey	Cache / state	BSD / Apache-2	Embedding cache, query cache, rate limiting
Argo CD / Argo Workflows	GitOps / batch	Apache-2	Declarative deployment; batch job orchestration

Appendix B: Key empirical claims with attribution

These numbers appear throughout the series. They are sourced from the foundational research and should be treated as direction-of-effect estimates, not precise constants — results vary by domain and dataset.

Hybrid retrieval beats dense-only or sparse-only: consistent across all benchmark configurations; margin varies, approximately 10-25% context recall improvement typical.
Contextual retrieval: -67% retrieval errors (Anthropic, 2024) when combined with BM25 and a reranker.
GraphRAG approximately 13% worse than vanilla RAG on simple factoid QA (Han et al., 2025).
Edge intent classification: approximately 40% cost reduction, approximately 35% latency reduction from Adaptive RAG literature on complexity-based routing.
Agentic multipliers: Adaptive approximately 1.5-2x, Self-RAG approximately 2-3x, CRAG approximately 3-5x, multi-hop ReAct approximately 4-10x. Exact values are system-specific.
Embedding fine-tuning on domain data: 10-30% retrieval improvement from domain-specific embedding adaptation literature.
Late chunking: Jina AI, 2024.

This knowledge article documents the design space of enterprise-grade multimodal hybrid agentic RAG as of June 2026. Framework versions, API capabilities, and benchmark results evolve — the design principles and failure-mode taxonomy are the durable parts.