Context Windows That Actually Work
Ricardo Argüello — March 7, 2026
CEO & Founder
General summary
AI vendors compete on context window size — 200K, 1M, 2M tokens — but research shows most models lose information well before hitting their limit. What matters isn't how many tokens fit, but how many the model actually retains and uses accurately.
- Chroma's research found all 18 tested models degrade as input grows — they retain information at the beginning and end but lose accuracy in the middle
- On the MRCR v2 benchmark with 8 buried facts, Claude Opus 4.6 scores 76% while Gemini 3 Pro scores 26.3% despite having double the window size
- Structured content like contracts and code is actually harder for models to process than randomized text
- RAG works best for large knowledge bases with discrete lookups; full context is better for contracts, regulatory docs, and codebases with cross-dependencies
- Context compaction can reduce token usage by ~58%, enabling multi-hour agent sessions without losing critical information
Imagine you're reading a 3,000-page book, but you can only really pay attention to the first 50 pages and the last 50 — everything in the middle gets blurry. That's what happens to most AI models with long documents. The advertised 'context window' is like the number of pages the book has, but what actually matters is how many pages the model can read without losing track of important details.
AI-generated summary
200K tokens. 1M tokens. 2M tokens. AI vendors compete on context window size like it’s the spec that defines the model. Every launch brags about a bigger number.
But when I evaluate models for client projects at IQ Source, window size tells me very little. What I need to know is: what percentage of that window does the model actually use with accuracy?
Size is marketing. Retention is engineering.
A study by Chroma published in July 2025 tested 18 language models. All 18 showed degradation as input text grew. The researchers found a consistent pattern: models retain information better at the beginning and end of the text, but fail with data buried in the middle. They call it context rot.
Here’s the strange part: when researchers shuffled content randomly (instead of presenting it in logical order), models processed it better. Structured content — contracts, code, filings, the stuff we actually use in business — is harder for LLMs to process than jumbled text with no structure.
If your company processes long contracts, regulatory documentation, or large codebases, this matters. A penalty clause buried on page 147 of a contract isn’t at the beginning or the end. It’s in the zone where models lose accuracy.
76% vs 18.5%: the difference that changes decisions
Anthropic published results from the MRCR v2 benchmark (Multi-turn Retrieval with Contextual Reasoning) that measures something very specific: whether a model can find and use data buried in a 1-million-token context when there are 8 hidden facts.
The results:
| Model | Window | MRCR v2 accuracy (8 needles) |
|---|---|---|
| Claude Opus 4.6 | 1M tokens | 76% |
| Gemini 3 Pro | 2M tokens | 26.3% |
| Claude Sonnet 4.5 | 200K tokens | 18.5% |
Gemini 3 Pro has double the window of Opus 4.6 and less than a third of its accuracy. It’s like having a 400-square-meter warehouse where you lose half of what you store, versus a 200-square-meter one where you can find everything.
According to analysis by Redis, models that advertise 200K token windows become unreliable around 130K tokens — between 60% and 70% of the stated capacity. This isn’t a one-off defect; it’s a structural limitation of how current attention mechanisms work.
When to use full context and when not to
The answer isn’t “always use the full window” or “always use RAG.” It’s an engineering problem that depends on what you need to process.
Full context works best when:
- The document has cross-references between sections (contracts with clauses that refer to other clauses, code with dependencies across files)
- You need the model to understand relationships between distant parts of the text
- The document’s structure and order matter for interpretation
- You’re working with regulatory filings where an exception on page 80 modifies a rule on page 12
RAG works best when:
- You have large knowledge bases but queries target specific fragments
- Support documentation or FAQs where each answer is self-contained
- The body of information grows constantly and doesn’t fit in a single window
- You need answers from multiple sources without processing each source in full
The reality for most enterprises: a hybrid system. RAG to filter and select relevant documents, full context to process them deeply.
And here’s where cost comes in. A 1M token prompt at premium pricing costs around $10 per call. A RAG retrieval of 5K-10K relevant tokens: between $0.05 and $0.10. That’s a 100x difference. If your use case doesn’t require the model to see the full document, you’re paying 100 times more for a result that could be equal or better with well-implemented RAG.
Context compaction: infinite agent sessions
If you’re running AI agents on processes that last hours, there’s a development that changes the rules: context compaction.
When the context window fills up, instead of losing information or cutting the session, the model auto-summarizes the oldest parts of the conversation. It keeps the essentials, discards the redundant, and frees up space to keep working.
Claude Opus 4.6 implements this automatically. In one documented case, compaction reduced token usage by 58.6% without losing the conversation thread. In practice, this enables agent sessions that can run for hours without degrading.
For an enterprise running an agent through an approval pipeline, reviewing compliance documentation, or monitoring an infrastructure deployment, the difference is operational: the agent doesn’t “forget” what it did two hours ago.
What we do at IQ Source
In most projects we take on, the first question isn’t which model to use. It’s what kind of processing each document actually needs.
We’ve found that ~70% of B2B use cases work fine with RAG and intelligent chunking. You don’t need to stuff 200 pages into the window. But that other 30% — contracts with cross-referencing clauses, comparative analysis across vendors, code with distributed dependencies — needs a model that actually uses its full window.
In practice, here’s what we do with clients:
- Document evaluation: we classify client documents by required processing type (full context vs. RAG vs. hybrid)
- Retrieval architecture: we design pipelines that combine RAG for initial filtering with full context windows for deep analysis
- Model selection by use case: not every process needs the most expensive model. A customer support flow can run on Sonnet; a legal analysis needs Opus
- Accuracy monitoring: we implement validations that detect when a model loses information in long documents, before that turns into an incorrect business decision
The race for bigger windows will keep going. Every quarter there’ll be a new model with a bigger number on the spec sheet. But the number that should be on the spec sheet — and never is — is retention accuracy at different depths.
If you’re evaluating models for processes that depend on long documents, compare with real accuracy data, not marketing figures. And if you need help building the right architecture — RAG, full context, or a hybrid of both — get in touch.
Frequently Asked Questions
This happens because the model's attention dilutes as context grows. Benchmarks show a model with a 1M-token window can retrieve data at the beginning and end of the text but loses accuracy in middle sections. More context doesn't always mean better answers — information placement matters as much as its presence.
It depends on the use case, but the data is clear: Gemini 3 Pro has a 2M token window and scores 26.3% on the MRCR v2 benchmark with 8 buried facts. Claude Opus 4.6, with 1M tokens, hits 76%. A larger window with poor retention is worse than a smaller one that actually works.
RAG works best for large knowledge bases, support documentation, and FAQs — cases where you need specific fragments, not full-document understanding. Full context is better for contracts with cross-referencing clauses, regulatory analysis, and codebases where the model needs to see dependencies across sections.
It's a technique where the model auto-summarizes older parts of a conversation to free up active window space. Claude Opus 4.6 does this automatically, achieving ~58% token reduction. For enterprises running agents on multi-hour processes, this enables continuous sessions without losing critical context.
Related Articles
LiteLLM Attack: Your AI Trust Chain Just Broke
LiteLLM, the AI API key proxy with 97 million monthly downloads, was poisoned via PyPI. Your security scanner was the entry point.
Google Stitch + AI Studio: Design-to-Code Without Engineers
Google shipped a full design-to-production pipeline with Stitch and AI Studio. Where it works for B2B prototypes and where you still need real engineering.