QRefAI
Contents
Advanced RAG

Part 8 — Technology Stack, Decisions, and What Makes a Production System

What does the confirmed production stack look like, what was deliberately traded away, and what separates a production system from a demo?

12 min · Updated June 2026

9.1 The confirmed stack

The technology stack is not a list of tools. Every choice reflects a tradeoff, and every tradeoff was made explicitly rather than by default.

LayerTechnologyTradeoff accepted
Document parsingDocling 2.x (MIT)Best-in-class fidelity; slightly slower than lightweight alternatives
OCR escalationDocling built-in OCR + OpenAI/Claude vision APINo GPU required; API cost on escalation
Audio transcriptionfaster-whisper / WhisperXCPU-only; diarization available; slower than GPU-accelerated
LLM generationOpenAI + Anthropic Claude via LiteLLMData transits third-party APIs; requires ZDR enterprise tiers
EmbeddingsOpenAI text-embedding-3-large via LiteLLMDense only — no late chunking, no learned sparse; BM25 covers lexical channel
Sparse retrievalBM25 (Qdrant native)No learned sparse (SPLADE would need self-hosting); lexical recall is strong
RerankingBGE-Reranker-v2-m3 (CPU container)One self-hosted container; zero per-call token cost
Vector databaseQdrant 1.x (Apache-2)Best-in-class hybrid + payload filtering; Milvus planned at >500M vectors
Knowledge graphKuzu (MIT)Embedded, columnar, CPU-only; FalkorDB/Neo4j as alternatives
ChunkingContextual retrieval (LLM-prepend) + semantic chunkingLate chunking unavailable with API embeddings; contextual retrieval is highest-accuracy-per-effort alternative
Agentic orchestrationLangGraph 1.0.x (MIT)Deepest tracing, durable execution, most active development
Prompt optimizationDSPy (MIT)Compiles prompts against metrics; requires golden sets
Ingestion pipelineDagster + Kafka/RedpandaAsset-oriented model; strong observability hooks
Observability / telemetryOpenTelemetry Collector + LangfuseFull LLM-native tracing; Arize Phoenix for embedding drift
CI evaluationRagas + DeepEval + pytestRagas for RAG-specific metrics; DeepEval for agentic/multimodal metrics
Human reviewArgilla (Apache-2)Primary annotation platform; Label Studio for image-level annotation
DeploymentKubernetes + Argo CD + KEDACPU-only cluster; GitOps; queue-driven autoscaling for ingestion

9.2 Consequential tradeoffs explained

OpenAI embeddings vs. self-hosted BGE-M3

This is the most consequential technical tradeoff. BGE-M3 is a single model that produces dense vectors, sparse vectors (learned sparse), and multi-vector representations — enabling late chunking, SPLADE-based hybrid retrieval, and ColPali visual retrieval. OpenAI text-embedding-3 produces dense vectors only. The choice of OpenAI embeddings eliminates three accuracy techniques: late chunking, learned sparse (SPLADE), and ColPali.

The mitigation: contextual retrieval is promoted to the primary chunking strategy, BM25 covers the lexical channel, and ColPali is deferred but preserved via a reserved vector slot. The net accuracy difference is real but manageable for most domains; the operational simplicity gain is significant. If a domain turns out to be very visual or very dependent on rare technical terms, reintroducing self-hosted BGE-M3 is a container deployment and config change, not a rewrite.

API-first vs. self-hosted LLM generation

Commercial API generation means documents and queries transit third-party infrastructure. This requires contractual zero-data-retention enterprise tiers, PII redaction before API egress, and a per-tenant data-class allow-list. The compensating benefit: no GPU provisioning, no model serving infrastructure, no quantization management. The LiteLLM interface means self-hosted vLLM is always a config change away for tenants with stricter data residency requirements.

Contextual retrieval as the primary chunking strategy

Because late chunking is unavailable with dense-only API embeddings, contextual retrieval is the strongest accuracy lever remaining. It uses an LLM call to prepend a situating context to every chunk. With prompt caching enabled on both the OpenAI and Anthropic routes, this is cost-mitigated — the document’s structural context is cached, and only the chunk-specific content varies. Without prompt caching, contextual retrieval at scale would be prohibitively expensive.

10. What separates a production system from a demo

Diagram contrasting a demo RAG system with a production system across five dimensions: uncertainty awareness, quality gates, multi-tenancy, observability loops, and human review

The list of technical components is not what separates them. A demo can use the same vector database, the same LLM API, the same chunking library. What separates them is architectural discipline in five specific areas.

1. The system knows its own uncertainty. A demo retrieves and generates. A production system measures, at every step, how confident it is in what it retrieved and what it generated. Docling’s confidence grades, reranker scores, CRAG grading, Self-RAG faithfulness checks — these are the signals that drive every quality gate, routing decision, and human-review trigger.

2. The system has explicit failure modes with explicit mitigations. Every pattern in this series exists to kill a named failure mode. A demo does not enumerate failure modes. A production system maps every known failure to a specific pattern, measures whether the pattern is working, and alerts when it is not.

3. The ingestion plane is treated as a first-class accuracy layer. In a demo, ingestion is one afternoon’s work. In a production system, ingestion is where most accuracy is actually determined — and it is instrumented, confidence-graded, routed, and human-reviewed with the same rigour as retrieval and generation.

4. Humans are structural components of the system, not an afterthought. A demo has no human-in-the-loop. A production system routes low-confidence documents, detected hallucinations, and failed retrievals to a structured human-review workflow and feeds corrections back as golden data that improves future performance. The human effort is targeted, not bulk — and it compounds, because each correction makes the confidence thresholds more accurate.

5. Accuracy is an enforced CI constraint, not a subjective judgment. A demo reports accuracy as a snapshot at demo time. A production system encodes the acceptable performance floor as a pytest/DeepEval gate that blocks every merge. Accuracy is maintained automatically rather than hoped for manually.

The honest summary: anyone can assemble the components. What cannot be assembled in an afternoon is the routing logic for failure modes, the confidence-gated human-review loop, the compound accuracy flywheel, the CI enforcement, and the architectural discipline that keeps every component behind a stable interface so the stack can evolve without rebuilding. That is what a production-grade multimodal hybrid agentic RAG system actually is.

Appendix A: Open-source tool reference

ToolCategoryLicenseRole
Docling 2.xParsingMITPrimary document parser; confidence scoring
faster-whisper / WhisperXASRMIT / MITAudio/video transcription and diarization
tree-sitterCode parsingMITAST-aware code chunking
TrafilaturaHTML extractionApache-2Web content extraction
ChonkieChunkingApache-2Modality-aware chunking
QdrantVector DBApache-2Primary vector store; hybrid + named vectors
KuzuKnowledge graphMITEmbedded graph DB for GraphRAG
LangGraph 1.0.xAgentic orchestrationMITStateful multi-agent graph
LiteLLMModel gatewayMITUnified API gateway; per-role routing
DagsterPipeline orchestrationApache-2Asset-oriented ingestion pipeline
Kafka / RedpandaMessage busApache-2Streaming ingestion intake
KEDAAutoscalingApache-2Queue-driven worker autoscaling
LangfuseTracing / evalMIT (self-hosted)LLM-native trace, prompt, eval backend
Arize PhoenixEmbedding driftElastic-2Embedding-space cluster visualization
OpenTelemetry CollectorTelemetryApache-2Span collection and fan-out
RagasRAG evaluationApache-2Context recall, faithfulness, relevancy
DeepEvalLLM evaluationApache-2CI gate; 50+ metrics; agentic eval
ArgillaHuman reviewApache-2Annotation, golden dataset curation
Label StudioImage annotationApache-2Visual/OCR-level document annotation
DSPyPrompt optimizationMITMetric-driven prompt compilation
PresidioPII detectionMITPII redaction before API egress
BGE-Reranker-v2-m3RerankingMITCross-encoder reranker (CPU container)
PostgreSQL 16+Metadata / auditPostgreSQLTenants, ACLs, ingestion state, episodic log
Redis / ValkeyCache / stateBSD / Apache-2Embedding cache, query cache, rate limiting
Argo CD / Argo WorkflowsGitOps / batchApache-2Declarative deployment; batch job orchestration

Appendix B: Key empirical claims with attribution

These numbers appear throughout the series. They are sourced from the foundational research and should be treated as direction-of-effect estimates, not precise constants — results vary by domain and dataset.

  • Hybrid retrieval beats dense-only or sparse-only: consistent across all benchmark configurations; margin varies, approximately 10-25% context recall improvement typical.
  • Contextual retrieval: -67% retrieval errors (Anthropic, 2024) when combined with BM25 and a reranker.
  • GraphRAG approximately 13% worse than vanilla RAG on simple factoid QA (Han et al., 2025).
  • Edge intent classification: approximately 40% cost reduction, approximately 35% latency reduction from Adaptive RAG literature on complexity-based routing.
  • Agentic multipliers: Adaptive approximately 1.5-2x, Self-RAG approximately 2-3x, CRAG approximately 3-5x, multi-hop ReAct approximately 4-10x. Exact values are system-specific.
  • Embedding fine-tuning on domain data: 10-30% retrieval improvement from domain-specific embedding adaptation literature.
  • Late chunking: Jina AI, 2024.

This knowledge article documents the design space of enterprise-grade multimodal hybrid agentic RAG as of June 2026. Framework versions, API capabilities, and benchmark results evolve — the design principles and failure-mode taxonomy are the durable parts.