Part 8 — Technology Stack, Decisions, and What Makes a Production System
What does the confirmed production stack look like, what was deliberately traded away, and what separates a production system from a demo?
12 min · Updated June 2026
9.1 The confirmed stack
The technology stack is not a list of tools. Every choice reflects a tradeoff, and every tradeoff was made explicitly rather than by default.
| Layer | Technology | Tradeoff accepted |
|---|---|---|
| Document parsing | Docling 2.x (MIT) | Best-in-class fidelity; slightly slower than lightweight alternatives |
| OCR escalation | Docling built-in OCR + OpenAI/Claude vision API | No GPU required; API cost on escalation |
| Audio transcription | faster-whisper / WhisperX | CPU-only; diarization available; slower than GPU-accelerated |
| LLM generation | OpenAI + Anthropic Claude via LiteLLM | Data transits third-party APIs; requires ZDR enterprise tiers |
| Embeddings | OpenAI text-embedding-3-large via LiteLLM | Dense only — no late chunking, no learned sparse; BM25 covers lexical channel |
| Sparse retrieval | BM25 (Qdrant native) | No learned sparse (SPLADE would need self-hosting); lexical recall is strong |
| Reranking | BGE-Reranker-v2-m3 (CPU container) | One self-hosted container; zero per-call token cost |
| Vector database | Qdrant 1.x (Apache-2) | Best-in-class hybrid + payload filtering; Milvus planned at >500M vectors |
| Knowledge graph | Kuzu (MIT) | Embedded, columnar, CPU-only; FalkorDB/Neo4j as alternatives |
| Chunking | Contextual retrieval (LLM-prepend) + semantic chunking | Late chunking unavailable with API embeddings; contextual retrieval is highest-accuracy-per-effort alternative |
| Agentic orchestration | LangGraph 1.0.x (MIT) | Deepest tracing, durable execution, most active development |
| Prompt optimization | DSPy (MIT) | Compiles prompts against metrics; requires golden sets |
| Ingestion pipeline | Dagster + Kafka/Redpanda | Asset-oriented model; strong observability hooks |
| Observability / telemetry | OpenTelemetry Collector + Langfuse | Full LLM-native tracing; Arize Phoenix for embedding drift |
| CI evaluation | Ragas + DeepEval + pytest | Ragas for RAG-specific metrics; DeepEval for agentic/multimodal metrics |
| Human review | Argilla (Apache-2) | Primary annotation platform; Label Studio for image-level annotation |
| Deployment | Kubernetes + Argo CD + KEDA | CPU-only cluster; GitOps; queue-driven autoscaling for ingestion |
9.2 Consequential tradeoffs explained
OpenAI embeddings vs. self-hosted BGE-M3
This is the most consequential technical tradeoff. BGE-M3 is a single model that produces dense vectors, sparse vectors (learned sparse), and multi-vector representations — enabling late chunking, SPLADE-based hybrid retrieval, and ColPali visual retrieval. OpenAI text-embedding-3 produces dense vectors only. The choice of OpenAI embeddings eliminates three accuracy techniques: late chunking, learned sparse (SPLADE), and ColPali.
The mitigation: contextual retrieval is promoted to the primary chunking strategy, BM25 covers the lexical channel, and ColPali is deferred but preserved via a reserved vector slot. The net accuracy difference is real but manageable for most domains; the operational simplicity gain is significant. If a domain turns out to be very visual or very dependent on rare technical terms, reintroducing self-hosted BGE-M3 is a container deployment and config change, not a rewrite.
API-first vs. self-hosted LLM generation
Commercial API generation means documents and queries transit third-party infrastructure. This requires contractual zero-data-retention enterprise tiers, PII redaction before API egress, and a per-tenant data-class allow-list. The compensating benefit: no GPU provisioning, no model serving infrastructure, no quantization management. The LiteLLM interface means self-hosted vLLM is always a config change away for tenants with stricter data residency requirements.
Contextual retrieval as the primary chunking strategy
Because late chunking is unavailable with dense-only API embeddings, contextual retrieval is the strongest accuracy lever remaining. It uses an LLM call to prepend a situating context to every chunk. With prompt caching enabled on both the OpenAI and Anthropic routes, this is cost-mitigated — the document’s structural context is cached, and only the chunk-specific content varies. Without prompt caching, contextual retrieval at scale would be prohibitively expensive.
10. What separates a production system from a demo

The list of technical components is not what separates them. A demo can use the same vector database, the same LLM API, the same chunking library. What separates them is architectural discipline in five specific areas.
1. The system knows its own uncertainty. A demo retrieves and generates. A production system measures, at every step, how confident it is in what it retrieved and what it generated. Docling’s confidence grades, reranker scores, CRAG grading, Self-RAG faithfulness checks — these are the signals that drive every quality gate, routing decision, and human-review trigger.
2. The system has explicit failure modes with explicit mitigations. Every pattern in this series exists to kill a named failure mode. A demo does not enumerate failure modes. A production system maps every known failure to a specific pattern, measures whether the pattern is working, and alerts when it is not.
3. The ingestion plane is treated as a first-class accuracy layer. In a demo, ingestion is one afternoon’s work. In a production system, ingestion is where most accuracy is actually determined — and it is instrumented, confidence-graded, routed, and human-reviewed with the same rigour as retrieval and generation.
4. Humans are structural components of the system, not an afterthought. A demo has no human-in-the-loop. A production system routes low-confidence documents, detected hallucinations, and failed retrievals to a structured human-review workflow and feeds corrections back as golden data that improves future performance. The human effort is targeted, not bulk — and it compounds, because each correction makes the confidence thresholds more accurate.
5. Accuracy is an enforced CI constraint, not a subjective judgment. A demo reports accuracy as a snapshot at demo time. A production system encodes the acceptable performance floor as a pytest/DeepEval gate that blocks every merge. Accuracy is maintained automatically rather than hoped for manually.
The honest summary: anyone can assemble the components. What cannot be assembled in an afternoon is the routing logic for failure modes, the confidence-gated human-review loop, the compound accuracy flywheel, the CI enforcement, and the architectural discipline that keeps every component behind a stable interface so the stack can evolve without rebuilding. That is what a production-grade multimodal hybrid agentic RAG system actually is.
Appendix A: Open-source tool reference
| Tool | Category | License | Role |
|---|---|---|---|
| Docling 2.x | Parsing | MIT | Primary document parser; confidence scoring |
| faster-whisper / WhisperX | ASR | MIT / MIT | Audio/video transcription and diarization |
| tree-sitter | Code parsing | MIT | AST-aware code chunking |
| Trafilatura | HTML extraction | Apache-2 | Web content extraction |
| Chonkie | Chunking | Apache-2 | Modality-aware chunking |
| Qdrant | Vector DB | Apache-2 | Primary vector store; hybrid + named vectors |
| Kuzu | Knowledge graph | MIT | Embedded graph DB for GraphRAG |
| LangGraph 1.0.x | Agentic orchestration | MIT | Stateful multi-agent graph |
| LiteLLM | Model gateway | MIT | Unified API gateway; per-role routing |
| Dagster | Pipeline orchestration | Apache-2 | Asset-oriented ingestion pipeline |
| Kafka / Redpanda | Message bus | Apache-2 | Streaming ingestion intake |
| KEDA | Autoscaling | Apache-2 | Queue-driven worker autoscaling |
| Langfuse | Tracing / eval | MIT (self-hosted) | LLM-native trace, prompt, eval backend |
| Arize Phoenix | Embedding drift | Elastic-2 | Embedding-space cluster visualization |
| OpenTelemetry Collector | Telemetry | Apache-2 | Span collection and fan-out |
| Ragas | RAG evaluation | Apache-2 | Context recall, faithfulness, relevancy |
| DeepEval | LLM evaluation | Apache-2 | CI gate; 50+ metrics; agentic eval |
| Argilla | Human review | Apache-2 | Annotation, golden dataset curation |
| Label Studio | Image annotation | Apache-2 | Visual/OCR-level document annotation |
| DSPy | Prompt optimization | MIT | Metric-driven prompt compilation |
| Presidio | PII detection | MIT | PII redaction before API egress |
| BGE-Reranker-v2-m3 | Reranking | MIT | Cross-encoder reranker (CPU container) |
| PostgreSQL 16+ | Metadata / audit | PostgreSQL | Tenants, ACLs, ingestion state, episodic log |
| Redis / Valkey | Cache / state | BSD / Apache-2 | Embedding cache, query cache, rate limiting |
| Argo CD / Argo Workflows | GitOps / batch | Apache-2 | Declarative deployment; batch job orchestration |
Appendix B: Key empirical claims with attribution
These numbers appear throughout the series. They are sourced from the foundational research and should be treated as direction-of-effect estimates, not precise constants — results vary by domain and dataset.
- Hybrid retrieval beats dense-only or sparse-only: consistent across all benchmark configurations; margin varies, approximately 10-25% context recall improvement typical.
- Contextual retrieval: -67% retrieval errors (Anthropic, 2024) when combined with BM25 and a reranker.
- GraphRAG approximately 13% worse than vanilla RAG on simple factoid QA (Han et al., 2025).
- Edge intent classification: approximately 40% cost reduction, approximately 35% latency reduction from Adaptive RAG literature on complexity-based routing.
- Agentic multipliers: Adaptive approximately 1.5-2x, Self-RAG approximately 2-3x, CRAG approximately 3-5x, multi-hop ReAct approximately 4-10x. Exact values are system-specific.
- Embedding fine-tuning on domain data: 10-30% retrieval improvement from domain-specific embedding adaptation literature.
- Late chunking: Jina AI, 2024.
This knowledge article documents the design space of enterprise-grade multimodal hybrid agentic RAG as of June 2026. Framework versions, API capabilities, and benchmark results evolve — the design principles and failure-mode taxonomy are the durable parts.