Skip to main content

Context Windows That Actually Work

Context window size is a marketing number. What matters is how much information the model actually retains. Real data and practical B2B guide.

Context Windows That Actually Work

Ricardo Argüello

Ricardo Argüello
Ricardo Argüello

CEO & Founder

AI & Automation 5 min read

200K tokens. 1M tokens. 2M tokens. AI vendors compete on context window size like it’s the spec that defines the model. Every launch brags about a bigger number.

But when I evaluate models for client projects at IQ Source, window size tells me very little. What I need to know is: what percentage of that window does the model actually use with accuracy?

Size is marketing. Retention is engineering.

A study by Chroma published in July 2025 tested 18 language models. All 18 showed degradation as input text grew. The researchers found a consistent pattern: models retain information better at the beginning and end of the text, but fail with data buried in the middle. They call it context rot.

Here’s the strange part: when researchers shuffled content randomly (instead of presenting it in logical order), models processed it better. Structured content — contracts, code, filings, the stuff we actually use in business — is harder for LLMs to process than jumbled text with no structure.

If your company processes long contracts, regulatory documentation, or large codebases, this matters. A penalty clause buried on page 147 of a contract isn’t at the beginning or the end. It’s in the zone where models lose accuracy.

76% vs 18.5%: the difference that changes decisions

Anthropic published results from the MRCR v2 benchmark (Multi-turn Retrieval with Contextual Reasoning) that measures something very specific: whether a model can find and use data buried in a 1-million-token context when there are 8 hidden facts.

The results:

ModelWindowMRCR v2 accuracy (8 needles)
Claude Opus 4.61M tokens76%
Gemini 3 Pro2M tokens26.3%
Claude Sonnet 4.5200K tokens18.5%

Gemini 3 Pro has double the window of Opus 4.6 and less than a third of its accuracy. It’s like having a 400-square-meter warehouse where you lose half of what you store, versus a 200-square-meter one where you can find everything.

According to analysis by Redis, models that advertise 200K token windows become unreliable around 130K tokens — between 60% and 70% of the stated capacity. This isn’t a one-off defect; it’s a structural limitation of how current attention mechanisms work.

When to use full context and when not to

The answer isn’t “always use the full window” or “always use RAG.” It’s an engineering problem that depends on what you need to process.

Full context works best when:

  • The document has cross-references between sections (contracts with clauses that refer to other clauses, code with dependencies across files)
  • You need the model to understand relationships between distant parts of the text
  • The document’s structure and order matter for interpretation
  • You’re working with regulatory filings where an exception on page 80 modifies a rule on page 12

RAG works best when:

  • You have large knowledge bases but queries target specific fragments
  • Support documentation or FAQs where each answer is self-contained
  • The body of information grows constantly and doesn’t fit in a single window
  • You need answers from multiple sources without processing each source in full

The reality for most enterprises: a hybrid system. RAG to filter and select relevant documents, full context to process them deeply.

And here’s where cost comes in. A 1M token prompt at premium pricing costs around $10 per call. A RAG retrieval of 5K-10K relevant tokens: between $0.05 and $0.10. That’s a 100x difference. If your use case doesn’t require the model to see the full document, you’re paying 100 times more for a result that could be equal or better with well-implemented RAG.

Context compaction: infinite agent sessions

If you’re running AI agents on processes that last hours, there’s a development that changes the rules: context compaction.

When the context window fills up, instead of losing information or cutting the session, the model auto-summarizes the oldest parts of the conversation. It keeps the essentials, discards the redundant, and frees up space to keep working.

Claude Opus 4.6 implements this automatically. In one documented case, compaction reduced token usage by 58.6% without losing the conversation thread. In practice, this enables agent sessions that can run for hours without degrading.

For an enterprise running an agent through an approval pipeline, reviewing compliance documentation, or monitoring an infrastructure deployment, the difference is operational: the agent doesn’t “forget” what it did two hours ago.

What we do at IQ Source

In most projects we take on, the first question isn’t which model to use. It’s what kind of processing each document actually needs.

We’ve found that ~70% of B2B use cases work fine with RAG and intelligent chunking. You don’t need to stuff 200 pages into the window. But that other 30% — contracts with cross-referencing clauses, comparative analysis across vendors, code with distributed dependencies — needs a model that actually uses its full window.

In practice, here’s what we do with clients:

  • Document evaluation: we classify client documents by required processing type (full context vs. RAG vs. hybrid)
  • Retrieval architecture: we design pipelines that combine RAG for initial filtering with full context windows for deep analysis
  • Model selection by use case: not every process needs the most expensive model. A customer support flow can run on Sonnet; a legal analysis needs Opus
  • Accuracy monitoring: we implement validations that detect when a model loses information in long documents, before that turns into an incorrect business decision

The race for bigger windows will keep going. Every quarter there’ll be a new model with a bigger number on the spec sheet. But the number that should be on the spec sheet — and never is — is retention accuracy at different depths.

If you’re evaluating models for processes that depend on long documents, compare with real accuracy data, not marketing figures. And if you need help building the right architecture — RAG, full context, or a hybrid of both — get in touch.

Frequently Asked Questions

artificial intelligence context window context rot RAG enterprise automation AI agents language models

Related Articles

LiteLLM Attack: Your AI Trust Chain Just Broke
AI & Automation
· 7 min read

LiteLLM Attack: Your AI Trust Chain Just Broke

LiteLLM, the AI API key proxy with 97 million monthly downloads, was poisoned via PyPI. Your security scanner was the entry point.

AI security software supply chain LiteLLM
Google Stitch + AI Studio: Design-to-Code Without Engineers
AI & Automation
· 7 min read

Google Stitch + AI Studio: Design-to-Code Without Engineers

Google shipped a full design-to-production pipeline with Stitch and AI Studio. Where it works for B2B prototypes and where you still need real engineering.

Google Stitch vibe coding vibe design