Custom AI Agents

Part 6 — The production envelope

Why do agent demos fail to become products — and what does the gap actually consist of?

11 min · Updated June 2026

Demos are easy. The gap between a demo and a production vertical agent is almost entirely in this section, and it is the part that gets skipped. A model that is right 90% of the time is a great demo and a terrible claims processor.

A note on how this article connects to the rest of the site. This page is the map of the production envelope; several of its sections are covered in depth in the AI Governance pillar. Where that’s the case, this article gives you the working picture and the ADK-level hooks, then points to the governance article that treats the topic as its whole subject. Read this for the “what and where in the stack”; follow the links for the “how to operate and prove it.”

6.1

My agent task runs for minutes and touches real systems — what happens when it crashes halfway?

Real agent tasks are long-running and they touch the messy real world: a claims workflow might take hours, span a dozen tool calls, and fail halfway through because an external API timed out. If your agent’s state lives only in memory, that failure means starting over — or worse, re-running side effects (sending the email twice, filing the claim twice).

Durable execution solves this. The agent’s progress is journaled, so if anything crashes, it resumes from exactly where it left off without repeating completed work. This has become recognised as core infrastructure rather than a nice-to-have. The main options:

Temporal— the production default; battle-tested, with native integrations into the major agent SDKs. Heavier to operate.
Restate— lighter footprint, edge- and serverless-friendly.
DBOS— Postgres-only, library-based, zero new infrastructure to run.
LangGraph’s built-in checkpointing — durability via state snapshots, included in the framework.

For any workflow that can exceed a minute or must survive a crash without re-doing side effects, you want one of these. The saga pattern — defining compensating actions to roll back partially-completed multi-step transactions — is now standard for financial and transactional agents.

One connection back to Part 3: durable execution is the production-grade version of the “short-term / working memory” checkpointing discussed there. The session state that let an agent resume a task is the same state a durable-execution engine journals to survive a crash — the difference is that the engine also guarantees exactly-once handling of the side effects, which in-memory state does not.

6.2

It passed every test but does strange things with real users — how do I see what it’s actually doing?

An agent makes dozens of non-deterministic decisions per run. When it does something wrong in production, “read the logs” does not cut it — you need to trace the entire decision path: every model call, every tool call, every retrieval, with inputs, outputs, token counts, and latencies.

The field has standardised around OpenTelemetry’s GenAI semantic conventions as the vendor-neutral instrumentation layer. The practical advice: instrument once with a neutral layer — OpenLLMetry or OpenInference — so you can switch observability backends without re-wiring everything.

The leading open-source, self-hostable backends are Langfuse (broad, MIT-licensed) and Arize Phoenix (strong on RAG evaluation and drift), with Laminar purpose-built for debugging long agent runs. LangSmith offers the best experience if you are committed to the LangChain/LangGraph stack and do not mind the lock-in.

This is one place ADK does the heavy lifting for you: it emits OpenTelemetry traces using the standard GenAI conventions natively, so “instrument once, swap backends freely” is a few lines of setup — point the OTLP exporter at a collector or directly at a backend like Langfuse:

telemetry_setup.py

import os
from google.adk.telemetry.setup import maybe_set_otel_providers

# Point ADK's native OTel traces at any OTLP endpoint — a collector, or Langfuse directly.
os.environ["OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"] = "http://your-collector:4318/v1/traces"
os.environ["OTEL_SERVICE_NAME"] = "claims-agent"
maybe_set_otel_providers()   # every agent run now produces a navigable trace tree

Every reasoning step, tool call, retrieval, latency, and token count then shows up as a span tree you can navigate by user and session.

Tie observability to an evaluation pipeline: an offline test set, LLM-as-judge scoring, and regression gates that block a deploy if quality drops. For RAG-heavy verticals, Ragas gives you faithfulness and context-relevance metrics; DeepEval plugs quality gates into your CI; Promptfoois good for red-teaming. Build a domain-specific gold set with your subject-matter experts — generic benchmarks will not tell you whether the agent is right about your domain.

This section is deliberately a summary — observability and evaluation each get a full treatment in the governance pillar, because in a regulated vertical they are not just engineering conveniences but the evidence base for compliance:

The Evaluation article covers how to run evaluation as a CI gate: the golden dataset as a living artifact, explicit pass thresholds (task completion, tool correctness, hallucination), and — crucially for agents — scoring the trajectory (did it call the right tools, in the right order, with the right arguments) rather than just the final output.
The Telemetry article covers production tracing in depth (ADK → OpenTelemetry → Langfuse, exactly the path above), plus two things this page doesn’t: managing prompts as versioned, release-controlled artifacts, and the feedback loop where surprising or low-scoring production traces become your next golden test cases. That loop is what turns agent quality into a ratchet instead of something that silently decays.

6.3

One safety check isn’t enough for a regulated vertical — how do I layer defences?

A single safety check is not enough for a regulated vertical. The 2026 baseline is layered guardrails, each catching a different class of problem:

Input rails— screen incoming messages for prompt injection and policy violations before the model sees them.
Retrieval / tool rails — screen the content coming back from tools and retrieval, because that is a major injection vector.
Dialog rails— keep the agent on-topic and on-policy during the conversation.
Output rails— validate what the agent is about to say or do (PII redaction, structured-output validation, content policy) before it leaves the system.

The open-source building blocks: NVIDIA NeMo Guardrails (a programmable rails framework), Llama Guard for hazard classification, Llama Prompt Guard as a fast first-pass injection filter, Guardrails AI for structured-output validation, and LLM Guardfor PII, secrets, and toxicity scanning. A typical regulated-vertical chain runs a fast injection gate, then a hazard classifier, then policy and PII rails, then output validation — adding somewhere in the low hundreds of milliseconds total.

In ADK, input and output rails attach at the model callbacks: before_model_callback inspects each prompt before it reaches the model (injection, policy), and after_model_callback inspects each response before it leaves (PII, content policy). This is where you’d invoke one of the classifiers above:

guardrail_callbacks.py

from google.adk.agents.context import Context
from google.adk.models import LlmRequest
from google.genai import types
from typing import Optional

def input_rail(context: Context, request: LlmRequest) -> Optional[types.Content]:
    """Input rail: screen the incoming prompt BEFORE the model sees it."""
    latest = request.contents[-1].parts[0].text if request.contents else ""
    if injection_classifier_flags(latest):        # e.g. Llama Prompt Guard
        # Returning Content short-circuits the model call — the rail blocked it.
        return types.Content(parts=[types.Part(
            text="I can't process that request.")], role="model")
    return None   # None = let the model call proceed

agent = LlmAgent(
    model="gemini-flash-latest",
    name="ClaimsAgent",
    before_model_callback=input_rail,   # input rail
    # after_model_callback=output_rail, # output rail: PII redaction, content policy
)

There is one important limitation to be honest about, and the governance pillar names it directly: evaluation, observability, and policy enforcement do not, by themselves, cover runtime content guardrails. Evaluation tests for injection and jailbreaks before release; a policy engine blocks the actions those attacks try to trigger; but neither inspects and filters the live model contenton every turn. That live-content layer — the callbacks above, plus adversarial red-teaming in CI — is a distinct, fourth component you have to add explicitly. The Accountability §5.2 article works through this gap and how to close it in two parts (pre-production red-teaming and a runtime guardrail component at the model callbacks).

6.4

How much should a human approve — and how do I build that in without bolting it on?

In production, the human-in-the-loop (HITL) rate — the fraction of agent actions that require human approval — is the real measure of where a vertical actually is. Legal and compliance agents run at very high HITL rates (well over half of actions reviewed); sales-development agents run low (single digits). Neither is wrong; they reflect the cost of an error.

The practical principle: bake HITL gates into the workflow itself as explicit approval nodes in your orchestration graph, at every action with real monetary, clinical, or legal consequence. Do not bolt on a reviewer queue as an afterthought. Design the agent to pause and ask, and design the human’s review experience to be fast. Ultimate responsibility always rests with the human professional. That sentence is a design constraint, not a disclaimer.

Mechanically, “pause and ask” means the agent must be able to suspend mid-task, surface the pending action for approval, and resumeon the human’s reply — which is why this section depends on the durable execution from the top of the article. In ADK, the primitive is a long-running tool that emits an approval request and waits, rather than returning immediately:

hitl_tool.py

from google.adk.tools import LongRunningFunctionTool

def request_refund_approval(claim_id: str, amount_cents: int) -> dict:
    """Pause the agent and ask a human to approve a high-value refund.

    Returns control to the caller immediately; the workflow resumes only
    when the human's decision is supplied back to the run.
    """
    return {
        "status": "pending_human_approval",   # the runner surfaces this and waits
        "claim_id": claim_id,
        "amount_cents": amount_cents,
    }

# Wrapping it as long-running is what lets the run suspend and later resume.
approval_gate = LongRunningFunctionTool(func=request_refund_approval)

The governance angle on HITL is the enforcement side: a human review step is only trustworthy if the agent cannotproceed without it. That’s runtime policy enforcement, covered in the next section and in Prevention §4.1 — a policy engine that intercepts the consequential tool and routes it to a human, so the human gate holds even if the model is manipulated into trying to skip it.

6.5

What are the constraints that are actually binding — the regulatory ones?

For business and vertical agents, the binding constraints are frequently regulatory, not technical:

EU AI Act— high-risk system enforcement is set to bite in August 2026, with steep penalties and an explicit requirement (Article 12) to automatically log events across the system’s lifecycle. Your observability and audit choices should produce conformant logs now, regardless of where you are deployed.
ISO/IEC 42001 and the NIST AI Risk Management Framework are converging as the operational baseline that demonstrates reasonable care.
Colorado AI Act— note that this was rewritten in May 2026 (SB 26-189), narrowing the original sweeping regime into an automated-decision-making notice-and-review framework, effective January 1, 2027. A reminder that this regulatory landscape is genuinely in motion.
Sector floors— HIPAA plus a Business Associate Agreement and SOC 2 Type II in healthcare; SR 11-7 (model risk) and NYDFS Part 500 (cybersecurity) in financial services; PCI DSS for anything touching payments.

This section is the hinge between this series and the governance pillar, so it’s worth being explicit about where each obligation is actually satisfied:

Preventingunauthorised or dangerous actions — the four preventive controls (runtime policy enforcement, cryptographic agent identity, execution sandboxing, and reliability engineering with a kill switch) are the subject of the Prevention article. Two of these map to Part 4 of this series: policy enforcement at the action boundary is the production-grade version of the tool guardrails there, and execution sandboxing is the containment layer for any code-executing tool.
Provingit to a regulator or auditor — the EU AI Act Article 12 logging requirement is not met by observability traces, which can be altered. It needs a tamper-evidentaudit log (cryptographically chained, control-mapped, signed attestations). That distinction — debugging traces versus the compliance system of record — is the core of the Accountability §5.1 article.
Deployinggovernance without new cloud lock-in — if you’re multi-cloud, on-prem, or sovereign, note the distinction the governance pillar draws between cloud-bound SaaS governance products and self-hostable, framework-agnostic governance libraries. Accountability §5.4 covers choosing the cloud-agnostic option.

The one-line version: this article tells you the constraints exist and where they sit in the stack; the governance pillar is the operating manual for satisfying and evidencing them.

6.6

When the bill arrives, in what order do I attack cost and latency?

When the bill arrives, attack it in this order:

1.Prompt caching— highest ROI, zero quality cost. Fix your prompt ordering first. (The mechanics and the ADK caching hook are in Part 2.)
2.Tiered model routing — use small, cheap models for easy steps; reserve the expensive reasoning models for genuinely hard ones. Use batch APIs (often around 50% off) for anything that tolerates delay.
3.Programmatic tool calling / Code Mode — for tool-heavy agents, the single biggest token reduction available. (See Part 4.)
4.Sub-agent summary compression — return summaries, not transcripts. (The 15×-cost point from Part 5.)
5.Pruning, offloading, summarisation — the context-management toolkit from Part 2.
6.Async and precompute paths — for anything non-interactive, batch it.

The ordering is the actual advice here: teams tend to reach for step 2 (swapping models) first because it feels like the obvious lever, but step 1 is higher ROI at zero quality cost. Work the list top-down and stop when the bill is acceptable — you rarely need all six.

Found this useful?

Libraries and frameworks referenced on this page

Google Agent Development Kit (ADK) — google-adk (Python) — used for the observability, guardrail, and HITL samples. Native OpenTelemetry export via maybe_set_otel_providers() pointed at any OTLP endpoint (collector or Langfuse). Input/output content rails via before_model_callback / after_model_callback (using Context and LlmRequest). Human-in-the-loop pause/resume via LongRunningFunctionTool. Model shown: gemini-flash-latest.
Durable execution— Temporal (production default), Restate, DBOS, LangGraph checkpointing. Referenced conceptually; the choice is independent of the agent framework, and the saga pattern is the standard for transactional rollback.
Observability backends — Langfuse and Arize Phoenix (open-source, self-hostable), Laminar (long-run debugging), LangSmith (LangChain/LangGraph-native). Neutral instrumentation via OpenTelemetry GenAI conventions, OpenLLMetry, or OpenInference.
Evaluation— DeepEval (CI quality gate), Ragas (RAG faithfulness/context-relevance), Promptfoo (red-teaming). Covered in depth in the governance Evaluation article.
Guardrails— NVIDIA NeMo Guardrails, Llama Guard, Llama Prompt Guard, Guardrails AI, LLM Guard. Attached at ADK’s model callbacks.
Governance / compliance tooling — the governance pillar treats runtime policy enforcement, cryptographic agent identity, execution sandboxing, reliability/kill-switch, and tamper-evident audit as its subject (including the Microsoft Agent Governance Toolkit as one open-source, cloud-agnostic implementation). Cited via cross-links rather than reproduced here.

Cross-references to the AI Governance pillar

Observability & evals → Evaluation and Telemetry
Runtime content-guardrails gap → Accountability §5.2
Human-in-the-loop enforcement → Prevention §4.1
Preventive controls (policy, identity, sandboxing, reliability) → Prevention
Tamper-evident audit & EU AI Act Article 12 → Accountability §5.1
Cloud-agnostic governance deployment → Accountability §5.4