Why do AI models lose accuracy with long documents?

This happens because the model's attention dilutes as context grows. Benchmarks show a model with a 1M-token window can retrieve data at the beginning and end of the text but loses accuracy in middle sections. More context doesn't always mean better answers — information placement matters as much as its presence.

Is a 2M token context window better than a 1M window with higher accuracy?

It depends on the use case, but the data is clear: Gemini 3 Pro has a 2M token window and scores 26.3% on the MRCR v2 benchmark with 8 buried facts. Claude Opus 4.6, with 1M tokens, hits 76%. A larger window with poor retention is worse than a smaller one that actually works.

When should you use RAG instead of full context?

RAG works best for large knowledge bases, support documentation, and FAQs — cases where you need specific fragments, not full-document understanding. Full context is better for contracts with cross-referencing clauses, regulatory analysis, and codebases where the model needs to see dependencies across sections.

What is context compaction and why does it matter for enterprises?

It's a technique where the model auto-summarizes older parts of a conversation to free up active window space. Claude Opus 4.6 does this automatically, achieving ~58% token reduction. For enterprises running agents on multi-hour processes, this enables continuous sessions without losing critical context.

www.iqsource.ai

Context Windows That Actually Work

Ricardo Argüello

Context Windows That Actually Work

Ricardo Argüello — March 7, 2026

Ricardo Argüello

CEO & Founder

March 7, 2026 AI & Automation 5 min read

200K tokens. 1M tokens. 2M tokens. AI vendors compete on context window size like it’s the spec that defines the model. Every launch brags about a bigger number.

But when I evaluate models for client projects at IQ Source, window size tells me very little. What I need to know is: what percentage of that window does the model actually use with accuracy?

Size is marketing. Retention is engineering.

A study by Chroma published in July 2025 tested 18 language models. All 18 showed degradation as input text grew. The researchers found a consistent pattern: models retain information better at the beginning and end of the text, but fail with data buried in the middle. They call it context rot.

Here’s the strange part: when researchers shuffled content randomly (instead of presenting it in logical order), models processed it better. Structured content — contracts, code, filings, the stuff we actually use in business — is harder for LLMs to process than jumbled text with no structure.

If your company processes long contracts, regulatory documentation, or large codebases, this matters. A penalty clause buried on page 147 of a contract isn’t at the beginning or the end. It’s in the zone where models lose accuracy.

76% vs 18.5%: the difference that changes decisions

Anthropic published results from the MRCR v2 benchmark (Multi-turn Retrieval with Contextual Reasoning) that measures something very specific: whether a model can find and use data buried in a 1-million-token context when there are 8 hidden facts.

The results:

Model	Window	MRCR v2 accuracy (8 needles)
Claude Opus 4.6	1M tokens	76%
Gemini 3 Pro	2M tokens	26.3%
Claude Sonnet 4.5	200K tokens	18.5%

Gemini 3 Pro has double the window of Opus 4.6 and less than a third of its accuracy. It’s like having a 400-square-meter warehouse where you lose half of what you store, versus a 200-square-meter one where you can find everything.

According to analysis by Redis, models that advertise 200K token windows become unreliable around 130K tokens — between 60% and 70% of the stated capacity. This isn’t a one-off defect; it’s a structural limitation of how current attention mechanisms work.

When to use full context and when not to

The answer isn’t “always use the full window” or “always use RAG.” It’s an engineering problem that depends on what you need to process.

Full context works best when:

The document has cross-references between sections (contracts with clauses that refer to other clauses, code with dependencies across files)
You need the model to understand relationships between distant parts of the text
The document’s structure and order matter for interpretation
You’re working with regulatory filings where an exception on page 80 modifies a rule on page 12

RAG works best when:

You have large knowledge bases but queries target specific fragments
Support documentation or FAQs where each answer is self-contained
The body of information grows constantly and doesn’t fit in a single window
You need answers from multiple sources without processing each source in full

The reality for most enterprises: a hybrid system. RAG to filter and select relevant documents, full context to process them deeply.

And here’s where cost comes in. A 1M token prompt at premium pricing costs around $10 per call. A RAG retrieval of 5K-10K relevant tokens: between $0.05 and $0.10. That’s a 100x difference. If your use case doesn’t require the model to see the full document, you’re paying 100 times more for a result that could be equal or better with well-implemented RAG.

Context compaction: infinite agent sessions

If you’re running AI agents on processes that last hours, there’s a development that changes the rules: context compaction.

When the context window fills up, instead of losing information or cutting the session, the model auto-summarizes the oldest parts of the conversation. It keeps the essentials, discards the redundant, and frees up space to keep working.

Claude Opus 4.6 implements this automatically. In one documented case, compaction reduced token usage by 58.6% without losing the conversation thread. In practice, this enables agent sessions that can run for hours without degrading.

For an enterprise running an agent through an approval pipeline, reviewing compliance documentation, or monitoring an infrastructure deployment, the difference is operational: the agent doesn’t “forget” what it did two hours ago.

What we do at IQ Source

In most projects we take on, the first question isn’t which model to use. It’s what kind of processing each document actually needs.

We’ve found that ~70% of B2B use cases work fine with RAG and intelligent chunking. You don’t need to stuff 200 pages into the window. But that other 30% — contracts with cross-referencing clauses, comparative analysis across vendors, code with distributed dependencies — needs a model that actually uses its full window.

In practice, here’s what we do with clients:

Document evaluation: we classify client documents by required processing type (full context vs. RAG vs. hybrid)
Retrieval architecture: we design pipelines that combine RAG for initial filtering with full context windows for deep analysis
Model selection by use case: not every process needs the most expensive model. A customer support flow can run on Sonnet; a legal analysis needs Opus
Accuracy monitoring: we implement validations that detect when a model loses information in long documents, before that turns into an incorrect business decision

The race for bigger windows will keep going. Every quarter there’ll be a new model with a bigger number on the spec sheet. But the number that should be on the spec sheet — and never is — is retention accuracy at different depths.

If you’re evaluating models for processes that depend on long documents, compare with real accuracy data, not marketing figures. And if you need help building the right architecture — RAG, full context, or a hybrid of both — get in touch.

Frequently Asked Questions

artificial intelligence context window context rot RAG enterprise automation AI agents language models

LiteLLM Attack: Your AI Trust Chain Just Broke

AI & Automation

March 25, 2026 · 7 min read

LiteLLM Attack: Your AI Trust Chain Just Broke

LiteLLM, the AI API key proxy with 97 million monthly downloads, was poisoned via PyPI. Your security scanner was the entry point.

AI security software supply chain LiteLLM

Google Stitch + AI Studio: Design-to-Code Without Engineers

AI & Automation

March 20, 2026 · 7 min read

Google Stitch + AI Studio: Design-to-Code Without Engineers

Google shipped a full design-to-production pipeline with Stitch and AI Studio. Where it works for B2B prototypes and where you still need real engineering.

Google Stitch vibe coding vibe design

B2B Enterprise Services

Software Development

Digital Marketing

Free Tools

Context Windows That Actually Work

Context Windows That Actually Work

General summary

Size is marketing. Retention is engineering.

76% vs 18.5%: the difference that changes decisions

When to use full context and when not to

Context compaction: infinite agent sessions

What we do at IQ Source

Frequently Asked Questions

Related Articles

LiteLLM Attack: Your AI Trust Chain Just Broke

Google Stitch + AI Studio: Design-to-Code Without Engineers

IQ Source Assistant