What is context engineering for AI agents?

It's the discipline of deciding exactly which tokens the model sees per call — system prompt, tool definitions, retrieved data, and conversation history. Unlike prompt engineering (how you ask), context engineering focuses on what information is present. The goal is the smallest set of high-signal tokens for the desired outcome.

How many tokens should an enterprise AI agent's system prompt use?

Past ~1,500 tokens, system prompts show diminishing returns. Focused prompts that define role, constraints, and output format work better than 4,000-token rulebooks covering every edge case. Handle exceptions through tools and on-demand data retrieval instead.

How does sub-agent architecture work for AI tasks?

A parent agent orchestrates, sending focused tasks to sub-agents that start with clean context windows. Each sub-agent processes its task and returns a condensed summary of ~1-2K tokens. Result: higher accuracy per task, lower total token cost, and the ability to process information that exceeds a single window.

How much does restructuring context improve AI agent performance?

In IQ Source projects, switching from full-document loading to progressive retrieval has improved task accuracy by ~20 points and reduced token cost by ~70%, without changing the model. The improvement comes from removing irrelevant information that competes for the model's attention.

www.iqsource.ai

Context Engineering: How to Feed Your AI Agents

Ricardo Argüello

Context Engineering: How to Feed Your AI Agents

Ricardo Argüello — March 11, 2026

Ricardo Argüello

CEO & Founder

March 11, 2026 AI & Automation 8 min read

Consider a procurement agent with access to a 200-page supplier manual, 45 active contracts, and the current purchase request. I asked it to evaluate the cheapest vendor. It pulled data from a contract that had expired two years ago.

The model wasn’t bad. The context was.

In the previous post I showed the data: models lose accuracy as context grows. All of them — no exceptions. This post is the practical companion: now that we know the window is limited, the question is what to put in it.

Context is a budget, not a warehouse

Every token that enters the context window competes for the model’s attention. Anthropic describes it as finding “the smallest possible set of high-signal tokens.” It’s not about including everything that might be relevant. It’s about including only what is relevant for this specific call.

And here’s the distinction that matters: prompt engineering is how you ask. Context engineering is what information is present when the model processes your question.

The budget splits across four components:

System prompt — the agent’s instructions and personality
Tool definitions — what it can do and how
Retrieved data — documents, records, search results
Conversation history — everything said earlier in the session

Most deployments I review at clients have the same problem: they over-invest in one component and starve the others. A 4,000-token system prompt that leaves no room for the data the agent actually needs. Or duplicate tools that consume tokens without adding capability.

Anatomy of context that works

System prompts: the right altitude

An effective system prompt defines the agent’s role and constraints, specifies the output format, and uses clear sections (XML or markdown) so the model can find what it needs. Explicit exclusions matter too — what the agent shouldn’t do is as important as what it should.

The most common mistake: writing an operations manual inside the system prompt. Past ~1,500 tokens, the system prompt hits diminishing returns. Every additional instruction competes with the previous ones for the model’s attention.

Tools: every definition costs tokens

Every tool you define consumes tokens — its name, description, parameter schema all count against the budget. If you have tools that overlap (search_customer and lookup_customer that do nearly the same thing), you’re paying double for the same capability and confusing the model about which one to use.

Definitions should be self-contained. If a model needs to read the system prompt to understand how to use a tool, the definition is incomplete. And if your tools expose APIs, the definition design connects directly to your API strategy for AI integration.

Retrieved data: where most budgets leak

This is where most of the budget gets wasted. An agent that needs two paragraphs from a contract receives the full 50 pages. A support agent that needs one knowledge base article receives ten “just in case.”

The difference in practice:

	Naive context	Engineered context
Procurement agent	Full manual (200 pages) + all contracts + request = ~180K tokens	Relevant manual section + specific vendor contract + request = ~8K tokens
Accuracy	~68%	~89%
Cost per call	~$1.80	~$0.08

Same model. Same task. The tokens we removed were noise diluting the model’s attention.

Load on demand, not upfront

The most effective pattern we use at IQ Source is progressive loading. Instead of giving the agent all the information at the start, you give it lightweight identifiers and the ability to request data when it needs it.

A real example: a support escalation agent. The original version loaded the full ticket, customer history, contract terms, and knowledge base. All at the start, without knowing if it would need any of it.

The redesigned version starts with just the ticket metadata: customer ID, issue category, priority level. The agent analyzes and decides: do I need the contract? It requests it through a tool. Need the history? It asks. Data arrives exactly when it’s needed.

The result in a document review project: active tokens dropped from ~45K to ~12K. Accuracy went from ~72% to ~91%. We didn’t change the model. We changed when and what information it received.

Three techniques for agents that work for hours

When an agent needs to operate for hours — processing an approval pipeline, reviewing regulatory documentation, monitoring a deployment — the window fills up inevitably. These techniques solve that:

Automatic compaction

The model auto-summarizes the oldest parts of the conversation. In the context windows post I covered this in detail: Claude 3 Opus reduces ~58% of token usage automatically.

But there’s a catch: compaction loses detail. If your process is regulated and you need exact traceability of every step, save the originals to an external system before they get compacted.

Structured scratchpads

The agent writes its findings to an external file or database as it works. When context gets trimmed, the findings persist. The agent reads its own notes back when it needs them.

It’s like a researcher who takes notes in a notebook: if short-term memory fades, the notes say what’s been found and what’s still missing.

Sub-agent architecture

A parent agent orchestrates. Sub-agents start with clean windows, process specific tasks, and return condensed summaries (~1-2K tokens each). Where a single agent with all the information might process tens of thousands of tokens with declining accuracy, each sub-agent works with exactly what it needs.

For a deeper look, we covered how to orchestrate agents in enterprise operations in detail.

What we see go wrong in most deployments

After reviewing dozens of agent implementations with clients, these are the three anti-patterns that show up again and again:

Kitchen-sink system prompt. 4,000+ tokens covering every possible edge case. The model doesn’t ignore those instructions — they actively compete with the important ones. Past ~1,500 tokens, every additional rule dilutes the ones before it. The fix: move edge cases into tools the agent calls only when it needs them.

Duplicate information across layers. A tool’s description repeats what the system prompt says, which in turn repeats what’s already in the retrieved data. In a recent audit we found ~25% redundancy in the tokens of a customer service agent. Removing the duplicates improved accuracy without changing anything else.

Static context for dynamic tasks. The agent always receives the same 15K tokens of company policy, whether it’s classifying a support ticket or drafting a sales proposal. Context should adapt to the task type. A classifier needs labeled examples. A generator needs constraints and format specs. A lookup agent just needs the relevant data, not instructions on how to search.

Checklist before deploying your next agent

Before putting an agent in production, run through this exercise:

Budget your context. Calculate how many tokens you have available after the system prompt and tools. If your model has 200K tokens and the prompt plus tools already consume 8K, your real budget for data and conversation is 192K — but real-world accuracy degrades well before that limit.

Audit for redundancy. Read every token your agent receives. Literally. Highlight anything that appears more than once. If the same information lives in the system prompt and a tool description, it’s redundant in one of the two places.

Test with context ablation. Remove chunks of context and measure whether accuracy changes. If you can strip a 3K-token block and accuracy doesn’t drop, those tokens were noise. They were cost without value.

Design retrieval around task type. Not every agent call needs the same data:

Classification → needs labeled examples
Generation → needs constraints and format
Lookup → needs the relevant data, not instructions on how to search

The model is not the bottleneck

Every time a client tells us “the agent isn’t performing well, should we switch models?” — the first thing we review is the context. In most cases — I’d say ~80% — the problem isn’t the model. It’s that the model receives too much irrelevant information and too little relevant information.

Context engineering isn’t a one-time project. It’s ongoing maintenance, the same way you optimize database queries or refactor code. Every time you change a workflow, the available data, or the agent’s tools, the context needs adjustment.

If you already have agents deployed or a pilot running, send us the system prompt, tool definitions, and a sample of retrieved data for one agent. We’ll review it and send back a context engineering report — what’s redundant, what’s missing, where tokens are wasted. No meeting needed, just the files. Get in touch here.

Frequently Asked Questions

context engineering AI agents context window enterprise automation enterprise prompts agent architecture artificial intelligence

LiteLLM Attack: Your AI Trust Chain Just Broke

AI & Automation

March 25, 2026 · 7 min read

LiteLLM Attack: Your AI Trust Chain Just Broke

LiteLLM, the AI API key proxy with 97 million monthly downloads, was poisoned via PyPI. Your security scanner was the entry point.

AI security software supply chain LiteLLM

Google Stitch + AI Studio: Design-to-Code Without Engineers

AI & Automation

March 20, 2026 · 7 min read

Google Stitch + AI Studio: Design-to-Code Without Engineers

Google shipped a full design-to-production pipeline with Stitch and AI Studio. Where it works for B2B prototypes and where you still need real engineering.

Google Stitch vibe coding vibe design

B2B Enterprise Services

Software Development

Digital Marketing

Free Tools

Context Engineering: How to Feed Your AI Agents

Context Engineering: How to Feed Your AI Agents

General summary

Context is a budget, not a warehouse

Anatomy of context that works

System prompts: the right altitude

Tools: every definition costs tokens

Retrieved data: where most budgets leak

Load on demand, not upfront

Three techniques for agents that work for hours

Automatic compaction

Structured scratchpads

Sub-agent architecture

What we see go wrong in most deployments

Checklist before deploying your next agent

The model is not the bottleneck

Frequently Asked Questions

Related Articles

LiteLLM Attack: Your AI Trust Chain Just Broke

Google Stitch + AI Studio: Design-to-Code Without Engineers

IQ Source Assistant