Context Rot Simulator
See how retrieval accuracy degrades as context fills, using published benchmark data.
Illustrative benchmark data
Curves are smoothed, class-level shapes derived from published needle-in-a-haystack style results (“Lost in the Middle”, Liu et al. 2023; RULER, Hsieh et al. 2024; community NIAH runs) — not vendor numbers for specific current models, and definitely not your workload. Use the shape, not the decimals.
estimated retrieval accuracy for the mid class (sonnet / gpt-5 mini tier) at 60% fill with a middle (lost-in-the-middle zone) needle — 12 points below the same lookup in a near-empty context.
Accuracy vs context fill (middle (lost-in-the-middle zone))
All classes at 60% fill
- Frontier class (Claude Opus / GPT-5 tier) ≈ 95%
- Mid class (Sonnet / GPT-5 mini tier) selected≈ 87%
- Small class (Haiku / nano tier) ≈ 77%
- Long-context class (Gemini Pro tier) ≈ 93%
Benchmark ≠ your workload: single-needle retrieval is the easy case. Multi-hop reasoning over long context degrades earlier and harder.
How it works
Context windows are marketed as a capacity — 200k, 400k, a million tokens — but capacity is not competence. Published research consistently shows retrieval accuracy declining as context fills, with a pronounced dead zone for information buried in the middle. This simulator makes the effect tangible: pick a model class, set how full the context is and where the fact you need sits, and read off an estimated retrieval accuracy with the full curve drawn behind it.
The data deserves a straight explanation. The curves embedded here are smoothed, class-level composites shaped from published needle-in-a-haystack style results — “Lost in the Middle” (Liu et al., 2023), RULER (Hsieh et al., 2024) and community NIAH evaluations — interpolated linearly between anchor points. They are deliberately not vendor numbers for specific current models: published runs vary in methodology, models update faster than papers, and a single decimal would imply precision the field does not have. The shapes are robust; the digits are illustrative.
The needle-position selector encodes the most actionable finding. The same fact at the same fill level is dramatically easier to retrieve from the start or end of context than from the middle — which is why well-built agent harnesses pin system instructions first and recent turns last, leaving the middle for material you can afford to lose. If your agent stores critical constraints in turn 12 of a 60-turn session, this curve is the failure mode you are signing up for.
The engineering response is to treat the window as a budget, not a warehouse: compact history, retrieve references on demand, reset at task boundaries. The Context Window Visualizer shows what fills your window today; this tool shows what that fill costs in reliability. And because the right model and context policy differ per task, FORG's rule engine can enforce them live — routing long-context work to models that hold up, and capping context growth before accuracy quietly falls off this curve.
Frequently asked questions
What is context rot?
The reliable degradation of a model's ability to find and use information as its context window fills. A fact a model retrieves perfectly from a 5k-token context gets missed, garbled or half-remembered when the same fact sits inside 150k tokens of other material. It is not a bug in any one model — every published long-context evaluation shows the effect to some degree, varying mainly in how late and how steep the decline is.
What does 'lost in the middle' mean?
Liu et al. (2023) showed that retrieval accuracy depends on where information sits: models recall facts placed at the very beginning or very end of context far better than facts buried in the middle. The U-shaped curve has been replicated broadly. Practically, this is why agent harnesses pin important instructions at the start and recent conversation at the end — and why the needle-position selector in this tool changes the answer so much.
How do I mitigate context rot?
Keep contexts lean rather than relying on the advertised window. The standard toolkit: compaction (summarize old history instead of carrying it verbatim), retrieval (store reference material outside the context and load only relevant chunks per query), positional discipline (critical instructions first, fresh data last), and session resets at task boundaries. The Context Window Visualizer helps budget what stays in.
Do all models rot at the same rate?
No — and the differences are big enough to matter for routing decisions. Frontier models hold accuracy deep into their windows; smaller and cheaper tiers degrade earlier and steeper. Models engineered specifically for long context can outperform their general capability tier on retrieval tasks. The class overlay in this tool shows the spread at any fill level, but always check current published benchmarks for the specific models you run.
How literally should I take these numbers?
As shapes, not measurements. The curves are smoothed class-level composites derived from published needle-in-a-haystack style research (Lost in the Middle, RULER, community NIAH runs) — explicitly labeled illustrative on the tool. Real accuracy depends on your task: single-fact retrieval is the easy case, and multi-hop reasoning over long context degrades sooner and harder than any of these curves show.
Turn this analysis into a live rule with the FORG rule engine — route models and enforce limits automatically.
Explore the rule engine