Custom AI Agents

Part 2 — Context engineering

Why did “prompt engineering” stop being enough, and what replaced it?

8 min · Updated June 2026

Through 2023 and into 2024, the craft was prompt engineering — wording your instructions cleverly to coax better output from a model. As of 2026, that craft has been subsumed by a broader and more important one: context engineering.

Context engineering is the discipline of selecting, shaping, and delivering exactly the information and tools an agent needs at the moment it needs them. The prompt is now just one component of the context, and usually not the hard part.

Q2.1 — Why did the shift from prompt engineering happen?

The reason is mechanical. An agent loop accumulates context fast — every tool result, every sub-step, every retrieved document piles into the window. And longer context is not free or even neutral. It costs money and latency, and past a certain point it actively degrades reasoning.

Once you internalise that context is a scarce, curated resource — not a bucket you dump everything into — the core techniques become obvious. Treat the context window like a working desk, not a filing cabinet: only what you need for this decision, cleared off constantly.

Q2.2 — What are the five context failure modes?

The community now has a vocabulary for the failure modes. Learn these terms because you will see all of them in your own logs:

Context poisoning— a hallucination or error gets into the context and is then referenced as if it were true, compounding.
Context distraction— so much accumulated history that the model over-focuses on it and stops reasoning freshly about the actual task.
Context confusion— irrelevant material in the window that the model tries to use anyway.
Context clash— pieces of context that contradict each other, leaving the model to pick badly.
Context rot— the general degradation of response quality as the window fills, even with relevant material.

Q2.3 — What is the practical toolkit for managing context?

Offloading.Don’t keep everything in the model’s head. Give the agent a scratchpad — a “think” step where it can reason without that reasoning permanently bloating the conversation — and, increasingly, give it a filesystem. A widely-used pattern is the agent maintaining a MEMORY.mdor notes file it reads and updates, so durable state lives on disk and only the relevant slice gets pulled into context per step. “The filesystem is the context” is a real architectural stance in 2026, not a metaphor.

Pruning.As new information arrives, actively remove outdated or superseded material. The mature version of this is dynamic context pruning — evicting items by age and relevance, the way a cache does. Teams running this on tool-heavy agents report 50–70% token reductions with no quality loss, because most of what accumulates in a long agent run is genuinely dead weight by step ten.

Summarising the past. Once a conversation or sub-task exceeds some threshold, compress the old turns into a summary and carry the summary forward instead of the raw transcript. This is the single highest-leverage move in multi-agent systems.

Just-in-time loading. Don’t front-load every tool definition and every document. The 2026 pattern is discovery at runtime: the agent searches for the tool it needs when it needs it, rather than carrying definitions for fifty tools it will never use. Anthropic’s Tool Search Tool does exactly this and reports roughly an 85% reduction in tool-definition tokens. The same logic applies to knowledge — retrieve when relevant, don’t preload.

Isolating context across agents. When you split work across multiple agents, give each one only the slice of context it needs and have it return a compressed summary, not its full transcript. Returning a full sub-agent transcript to the orchestrator is the classic way to blow up your token bill — practitioners cite something like a 15× cost difference between returning summaries and inlining transcripts.

Q2.4 — How does retrieval work in 2026?

Retrieval-Augmented Generation — fetching relevant documents and putting them in context — is still the backbone of most business agents, because verticals run on proprietary knowledge the model was never trained on. But naive RAG (embed the query, grab the top-k chunks, stuff them in the prompt) is now understood to be a weak baseline. The commonly cited figure is that naive pipelines fail at retrieval around 40% of the time.

The 2026 stack layers several improvements on top:

Hybrid retrieval is table stakes. Combine keyword search (BM25), dense vector similarity, and structured metadata filters in a single query. Pure vector search misses exact-match terms like product codes, statute numbers, and drug names; hybrid catches them.
Contextual retrievalprepends a short, LLM-generated description of where each chunk came from and what it is about before embedding it. Anthropic’s original work on this reported retrieval-failure reductions of up to 67% versus plain chunking. It is cheap and one of the highest-ROI improvements available.
Agentic RAG turns retrieval into a loop instead of a single shot: the agent decomposes the query, retrieves, critiques whether the results are actually relevant, and retries with a reformulated query if not. One legal-tech deployment cut irrelevant retrievals from around 40% down to under 8% this way.
GraphRAG and knowledge graphs matter wherever the reasoning is relational — legal cross-references, a clinical trail from trial to FDA approval, financial entity ownership chains. Use it when relationships are the problem, not by default.
Metadata governance turns out to matter more than people expect. Well-governed metadata has been shown to substantially improve agent accuracy on structured-data tasks, because it lets retrieval filter precisely instead of guessing.

Retrieval is no longer a preprocessing step. It is part of the agent’s reasoning. That reframing is what separates a 2026 RAG system from a 2023 one.

Q2.5 — What is prompt caching and why does it matter?

Every major model provider now caches stable prefixes of your context so you do not pay full price to re-process them on every call. The economics are dramatic: Anthropic charges cached reads at about 10% of the normal input rate, OpenAI automatically discounts cached prefixes by around 50%, and others have similar schemes. For a typical agent loop — where the same system prompt and tool definitions get sent on every single step — this routinely cuts the input-token bill by 30–50% with zero change to output quality.

There is an architectural catch you must design around: caching only works on a stable prefix. The cache breaks at the first byte that changes. So the ordering of your context is not cosmetic — it is a cost decision. Put stable content first and variable content last:

system prompt → tool definitions → static reference corpus → conversation history → current user message

Anything placed after a variable element cannot be cached. Teams that don’t know this rule often interleave dynamic content early in the prompt and then wonder why their cache hit rate — visible in the logs — is near zero. Check your cache hit rate. It is free money you are probably leaving on the table.