QRefAI
Contents
Custom AI Agents

Part 8 — Reality check

What do real deployments actually look like — and what should I discount from everything I’ve read?

5 min · Updated June 2026

It is worth grounding everything in this series in what actual vertical deployments look like as of June 2026 — and being honest about how much of the public narrative is reliable.

Q8.1 — What do the marquee deployments have in common?

Common patterns across marquee AI agent deployments: custom architectures on major model APIs, durable execution, human checkpoints, and MCP as the integration layer

The pattern across the well-known deployments — Klarna in fintech customer service, JPMorgan with hundreds of live use cases, Harvey and EvenUp in legal, Hippocratic AI and Abridge in healthcare, Allianz in insurance claims, Sierra as a multi-industry customer-experience platform — is remarkably consistent in one respect: almost all of them run custom-built architectures on top of the major model APIs, not off-the-shelf frameworks.

Frameworks like LangGraph show up most clearly where durable execution and human checkpoints dominate. MCP shows up most prominently as the named integration layer in the Allianz–Anthropic work. And notably, none of these public deployments name a specific memory framework or vector database — strongly suggesting that large enterprises build their memory layers while the broader market buys frameworks. If you adopt a packaged memory layer, you are running slightly ahead of the disclosed market. That is fine, but know that is the bet.

Q8.2 — What four things should I stay skeptical about?

  • Headline ROI numbers get walked back.Klarna’s famous “AI does the work of 700 people” framing was partially reversed when customer satisfaction degraded on emotionally complex tickets and human capacity was re-expanded. The lesson is not “AI failed.” It is that the boundary between what to automate and what to route to a human is itself the product, and you will get it wrong before you get it right.
  • “Agentic” is often marketing.Legal-industry observers flagged through 2026 that many “agentic” claims are really workflow automation with better PR. Ask what the system actually decides on its own versus what is hard-coded.
  • Benchmarks are contested.Memory-framework scores, model-scaffold comparisons, and adoption statistics frequently come from parties with an interest in the result. The cleanest finding on orchestration is modest: the same model on a better scaffold gained a handful of points on a standard benchmark — meaningful, but not the order-of-magnitude that “agent framework” hype implies. Most of an agent’s quality comes from the model and your context engineering, not from a magic framework.
  • The tooling will shift again.Every major vendor shipped a production-grade agent SDK in roughly the same quarter. The recommendations in this series lean conservative precisely because the landscape is moving. Bet on portable, neutral layers — MCP, OpenTelemetry, LiteLLM — so that when the leader changes (and it will), you are not rebuilding from scratch.
The skeptic's cheat sheet: four things to stay skeptical about — headline ROI numbers, agentic marketing claims, contested benchmarks, and shifting tooling

Q8.3 — What is the one-paragraph version of everything?

An agent is a loop where a model decides its own actions; most business problems need a workflow, not a fully autonomous agent, and knowing the difference is the whole game. The central discipline is no longer prompt engineering but context engineering — curating exactly what the model sees, when, with prompt caching and just-in-time loading as the cost levers. Memory is a layered system (thread checkpointing plus extracted, consolidated long-term memory) you should measure on your own data, not on vendor benchmarks. Tools reach the world increasingly through MCP, which is powerful, portable, and carries a real and active security burden you must design around with gateways, OAuth, and sandboxing. Orchestration has sobered up around the supervisor pattern and the hard truth that multi-agent systems cost roughly 15× more, so you use them only when the value justifies it. And the gap between a demo and a real vertical agent is the production envelope: durable execution, observability, layered guardrails, human-in-the-loop gates, and an audit trail a regulator would accept. Build on portable, neutral foundations, stay sceptical of the hype, and remember that the model and your context discipline — not the framework logo — are where the quality actually comes from.