Context Engineering: How to Feed Your AI Agents
Ricardo Argüello — March 11, 2026
CEO & Founder
General summary
AI agents fail less because of the model and more because of what they receive as context. Context engineering — deciding exactly which tokens the model sees per call — produces bigger accuracy gains than switching models. This post is the practical guide.
- Context engineering focuses on what information is present, not how you ask — it's different from prompt engineering
- The system prompt, tool definitions, retrieved data, and conversation history all compete for the same token budget
- Loading data on demand instead of upfront dropped tokens from ~45K to ~12K and raised accuracy from ~72% to ~91% in one IQ Source project
- System prompts beyond ~1,500 tokens show diminishing returns — edge cases are better handled through tools and on-demand retrieval
- Sub-agent architecture lets each task start with a clean window and return only a condensed summary
Imagine you're cooking on a small counter. If you pile every ingredient you own on it, you can't find anything and you'll grab the wrong spice. But if you set out only what today's recipe needs — and pull extras from the pantry when you actually need them — everything goes smoothly. Context engineering does exactly that with the information you give an AI agent.
AI-generated summary
Consider a procurement agent with access to a 200-page supplier manual, 45 active contracts, and the current purchase request. I asked it to evaluate the cheapest vendor. It pulled data from a contract that had expired two years ago.
The model wasn’t bad. The context was.
In the previous post I showed the data: models lose accuracy as context grows. All of them — no exceptions. This post is the practical companion: now that we know the window is limited, the question is what to put in it.
Context is a budget, not a warehouse
Every token that enters the context window competes for the model’s attention. Anthropic describes it as finding “the smallest possible set of high-signal tokens.” It’s not about including everything that might be relevant. It’s about including only what is relevant for this specific call.
And here’s the distinction that matters: prompt engineering is how you ask. Context engineering is what information is present when the model processes your question.
The budget splits across four components:
- System prompt — the agent’s instructions and personality
- Tool definitions — what it can do and how
- Retrieved data — documents, records, search results
- Conversation history — everything said earlier in the session
Most deployments I review at clients have the same problem: they over-invest in one component and starve the others. A 4,000-token system prompt that leaves no room for the data the agent actually needs. Or duplicate tools that consume tokens without adding capability.
Anatomy of context that works
System prompts: the right altitude
An effective system prompt defines the agent’s role and constraints, specifies the output format, and uses clear sections (XML or markdown) so the model can find what it needs. Explicit exclusions matter too — what the agent shouldn’t do is as important as what it should.
The most common mistake: writing an operations manual inside the system prompt. Past ~1,500 tokens, the system prompt hits diminishing returns. Every additional instruction competes with the previous ones for the model’s attention.
Tools: every definition costs tokens
Every tool you define consumes tokens — its name, description, parameter schema all count against the budget. If you have tools that overlap (search_customer and lookup_customer that do nearly the same thing), you’re paying double for the same capability and confusing the model about which one to use.
Definitions should be self-contained. If a model needs to read the system prompt to understand how to use a tool, the definition is incomplete. And if your tools expose APIs, the definition design connects directly to your API strategy for AI integration.
Retrieved data: where most budgets leak
This is where most of the budget gets wasted. An agent that needs two paragraphs from a contract receives the full 50 pages. A support agent that needs one knowledge base article receives ten “just in case.”
The difference in practice:
| Naive context | Engineered context | |
|---|---|---|
| Procurement agent | Full manual (200 pages) + all contracts + request = ~180K tokens | Relevant manual section + specific vendor contract + request = ~8K tokens |
| Accuracy | ~68% | ~89% |
| Cost per call | ~$1.80 | ~$0.08 |
Same model. Same task. The tokens we removed were noise diluting the model’s attention.
Load on demand, not upfront
The most effective pattern we use at IQ Source is progressive loading. Instead of giving the agent all the information at the start, you give it lightweight identifiers and the ability to request data when it needs it.
A real example: a support escalation agent. The original version loaded the full ticket, customer history, contract terms, and knowledge base. All at the start, without knowing if it would need any of it.
The redesigned version starts with just the ticket metadata: customer ID, issue category, priority level. The agent analyzes and decides: do I need the contract? It requests it through a tool. Need the history? It asks. Data arrives exactly when it’s needed.
The result in a document review project: active tokens dropped from ~45K to ~12K. Accuracy went from ~72% to ~91%. We didn’t change the model. We changed when and what information it received.
Three techniques for agents that work for hours
When an agent needs to operate for hours — processing an approval pipeline, reviewing regulatory documentation, monitoring a deployment — the window fills up inevitably. These techniques solve that:
Automatic compaction
The model auto-summarizes the oldest parts of the conversation. In the context windows post I covered this in detail: Claude 3 Opus reduces ~58% of token usage automatically.
But there’s a catch: compaction loses detail. If your process is regulated and you need exact traceability of every step, save the originals to an external system before they get compacted.
Structured scratchpads
The agent writes its findings to an external file or database as it works. When context gets trimmed, the findings persist. The agent reads its own notes back when it needs them.
It’s like a researcher who takes notes in a notebook: if short-term memory fades, the notes say what’s been found and what’s still missing.
Sub-agent architecture
A parent agent orchestrates. Sub-agents start with clean windows, process specific tasks, and return condensed summaries (~1-2K tokens each). Where a single agent with all the information might process tens of thousands of tokens with declining accuracy, each sub-agent works with exactly what it needs.
For a deeper look, we covered how to orchestrate agents in enterprise operations in detail.
What we see go wrong in most deployments
After reviewing dozens of agent implementations with clients, these are the three anti-patterns that show up again and again:
Kitchen-sink system prompt. 4,000+ tokens covering every possible edge case. The model doesn’t ignore those instructions — they actively compete with the important ones. Past ~1,500 tokens, every additional rule dilutes the ones before it. The fix: move edge cases into tools the agent calls only when it needs them.
Duplicate information across layers. A tool’s description repeats what the system prompt says, which in turn repeats what’s already in the retrieved data. In a recent audit we found ~25% redundancy in the tokens of a customer service agent. Removing the duplicates improved accuracy without changing anything else.
Static context for dynamic tasks. The agent always receives the same 15K tokens of company policy, whether it’s classifying a support ticket or drafting a sales proposal. Context should adapt to the task type. A classifier needs labeled examples. A generator needs constraints and format specs. A lookup agent just needs the relevant data, not instructions on how to search.
Checklist before deploying your next agent
Before putting an agent in production, run through this exercise:
Budget your context. Calculate how many tokens you have available after the system prompt and tools. If your model has 200K tokens and the prompt plus tools already consume 8K, your real budget for data and conversation is 192K — but real-world accuracy degrades well before that limit.
Audit for redundancy. Read every token your agent receives. Literally. Highlight anything that appears more than once. If the same information lives in the system prompt and a tool description, it’s redundant in one of the two places.
Test with context ablation. Remove chunks of context and measure whether accuracy changes. If you can strip a 3K-token block and accuracy doesn’t drop, those tokens were noise. They were cost without value.
Design retrieval around task type. Not every agent call needs the same data:
- Classification → needs labeled examples
- Generation → needs constraints and format
- Lookup → needs the relevant data, not instructions on how to search
The model is not the bottleneck
Every time a client tells us “the agent isn’t performing well, should we switch models?” — the first thing we review is the context. In most cases — I’d say ~80% — the problem isn’t the model. It’s that the model receives too much irrelevant information and too little relevant information.
Context engineering isn’t a one-time project. It’s ongoing maintenance, the same way you optimize database queries or refactor code. Every time you change a workflow, the available data, or the agent’s tools, the context needs adjustment.
If you already have agents deployed or a pilot running, send us the system prompt, tool definitions, and a sample of retrieved data for one agent. We’ll review it and send back a context engineering report — what’s redundant, what’s missing, where tokens are wasted. No meeting needed, just the files. Get in touch here.
Frequently Asked Questions
It's the discipline of deciding exactly which tokens the model sees per call — system prompt, tool definitions, retrieved data, and conversation history. Unlike prompt engineering (how you ask), context engineering focuses on what information is present. The goal is the smallest set of high-signal tokens for the desired outcome.
Past ~1,500 tokens, system prompts show diminishing returns. Focused prompts that define role, constraints, and output format work better than 4,000-token rulebooks covering every edge case. Handle exceptions through tools and on-demand data retrieval instead.
A parent agent orchestrates, sending focused tasks to sub-agents that start with clean context windows. Each sub-agent processes its task and returns a condensed summary of ~1-2K tokens. Result: higher accuracy per task, lower total token cost, and the ability to process information that exceeds a single window.
In IQ Source projects, switching from full-document loading to progressive retrieval has improved task accuracy by ~20 points and reduced token cost by ~70%, without changing the model. The improvement comes from removing irrelevant information that competes for the model's attention.
Related Articles
LiteLLM Attack: Your AI Trust Chain Just Broke
LiteLLM, the AI API key proxy with 97 million monthly downloads, was poisoned via PyPI. Your security scanner was the entry point.
Google Stitch + AI Studio: Design-to-Code Without Engineers
Google shipped a full design-to-production pipeline with Stitch and AI Studio. Where it works for B2B prototypes and where you still need real engineering.