Part 6 — The production envelope
Why do agent demos fail to become products — and what does the gap actually consist of?
8 min · Updated June 2026
Demos are easy. The gap between a demo and a production vertical agent is almost entirely in this section, and it is the part that gets skipped. A model that is right 90% of the time is a great demo and a terrible claims processor.
Q6.1 — What is durable execution and why do agents need it?
Real agent tasks are long-running and they touch the messy real world: a claims workflow might take hours, span a dozen tool calls, and fail halfway through because an external API timed out. If your agent’s state lives only in memory, that failure means starting over — or worse, re-running side effects (sending the email twice, filing the claim twice).
Durable execution solves this. The agent’s progress is journaled, so if anything crashes, it resumes from exactly where it left off without repeating completed work. This has become recognised as core infrastructure rather than a nice-to-have. The main options:
- Temporal— the production default; battle-tested, with native integrations into the major agent SDKs. Heavier to operate.
- Restate— lighter footprint, edge- and serverless-friendly.
- DBOS— Postgres-only, library-based, zero new infrastructure to run.
- LangGraph’s built-in checkpointing — durability via state snapshots, included in the framework.
For any workflow that can exceed a minute or must survive a crash without re-doing side effects, you want one of these. The saga pattern — defining compensating actions to roll back partially-completed multi-step transactions — is now standard for financial and transactional agents.
Q6.2 — What does observability look like for agents?
An agent makes dozens of non-deterministic decisions per run. When it does something wrong in production, “read the logs” does not cut it — you need to trace the entire decision path: every model call, every tool call, every retrieval, with inputs, outputs, token counts, and latencies.
The field has standardised around OpenTelemetry’s GenAI semantic conventions as the vendor-neutral instrumentation layer. The practical advice: instrument once with a neutral layer — OpenLLMetry or OpenInference — so you can switch observability backends without re-wiring everything.
The leading open-source, self-hostable backends are Langfuse (broad, MIT-licensed) and Arize Phoenix (strong on RAG evaluation and drift), with Laminar purpose-built for debugging long agent runs. LangSmith offers the best experience if you are committed to the LangChain/LangGraph stack and do not mind the lock-in.
Tie observability to an evaluation pipeline: an offline test set, LLM-as-judge scoring, and regression gates that block a deploy if quality drops. For RAG-heavy verticals, Ragas gives you faithfulness and context-relevance metrics; DeepEval plugs quality gates into your CI; Promptfoois good for red-teaming. Build a domain-specific gold set with your subject-matter experts — generic benchmarks will not tell you whether the agent is right about your domain.
Q6.3 — What does layered guardrail architecture look like?
A single safety check is not enough for a regulated vertical. The 2026 baseline is layered guardrails, each catching a different class of problem:
- Input rails— screen incoming messages for prompt injection and policy violations before the model sees them.
- Retrieval / tool rails — screen the content coming back from tools and retrieval, because that is a major injection vector.
- Dialog rails— keep the agent on-topic and on-policy during the conversation.
- Output rails— validate what the agent is about to say or do (PII redaction, structured-output validation, content policy) before it leaves the system.
The open-source building blocks: NVIDIA NeMo Guardrails (a programmable rails framework), Llama Guard for hazard classification, Llama Prompt Guard as a fast first-pass injection filter, Guardrails AI for structured-output validation, and LLM Guardfor PII, secrets, and toxicity scanning. A typical regulated-vertical chain runs a fast injection gate, then a hazard classifier, then policy and PII rails, then output validation — adding somewhere in the low hundreds of milliseconds total.
Q6.4 — How should human-in-the-loop be designed?
In production, the human-in-the-loop (HITL) rate — the fraction of agent actions that require human approval — is the real measure of where a vertical actually is. Legal and compliance agents run at very high HITL rates (well over half of actions reviewed); sales-development agents run low (single digits). Neither is wrong; they reflect the cost of an error.
The practical principle: bake HITL gates into the workflow itself as explicit approval nodes in your orchestration graph, at every action with real monetary, clinical, or legal consequence. Do not bolt on a reviewer queue as an afterthought. Design the agent to pause and ask, and design the human’s review experience to be fast. Ultimate responsibility always rests with the human professional. That sentence is a design constraint, not a disclaimer.
Q6.5 — What are the binding governance and compliance constraints?
For business and vertical agents, the binding constraints are frequently regulatory, not technical:
- EU AI Act— high-risk system enforcement is set to bite in August 2026, with steep penalties and an explicit requirement (Article 12) to automatically log events across the system’s lifecycle. Your observability and audit choices should produce conformant logs now, regardless of where you are deployed.
- ISO/IEC 42001 and the NIST AI Risk Management Framework are converging as the operational baseline that demonstrates reasonable care.
- Colorado AI Act— note that this was rewritten in May 2026 (SB 26-189), narrowing the original sweeping regime into an automated-decision-making notice-and-review framework, effective January 1, 2027. A reminder that this regulatory landscape is genuinely in motion.
- Sector floors— HIPAA plus a Business Associate Agreement and SOC 2 Type II in healthcare; SR 11-7 (model risk) and NYDFS Part 500 (cybersecurity) in financial services; PCI DSS for anything touching payments.
Q6.6 — In what order should I attack cost and latency?
When the bill arrives, attack it in this order:
- 1.Prompt caching — highest ROI, zero quality cost. Fix your prompt ordering first.
- 2.Tiered model routing — use small, cheap models for easy steps; reserve the expensive reasoning models for genuinely hard ones. Use batch APIs (often around 50% off) for anything that tolerates delay.
- 3.Programmatic tool calling / Code Mode — for tool-heavy agents, the single biggest token reduction available.
- 4.Sub-agent summary compression — return summaries, not transcripts.
- 5.Pruning, offloading, summarisation — the context-management toolkit from Part 2.
- 6.Async and precompute paths — for anything non-interactive, batch it.