Context Window Comparison
Every major model's context window, max output, price and speed in one sortable table.
| Model | |||||
|---|---|---|---|---|---|
| Gemini 3.5 Flashlargestgoogle · frontier | 1M | 65.5k | $1.5 | $9 | 160 |
| Gemini 3.1 Progoogle · frontier | 1M | 65.5k | $2 | $12 | 80 |
| Gemini 3 Flashgoogle · mid | 1M | 65.5k | $0.5 | $3 | 180 |
| Gemini 3.1 Flash-Litegoogle · small | 1M | 65.5k | $0.25 | $1.5 | 250 |
| Gemini 2.5 Progoogle · frontier | 1M | 65.5k | $1.25 | $10 | 85 |
| Claude Fable 5anthropic · frontier | 1M | 64k | $10 | $50 | 60 |
| Claude Opus 4.8anthropic · frontier | 1M | 64k | $5 | $25 | 55 |
| Claude Sonnet 4.6anthropic · frontier | 1M | 64k | $3 | $15 | 78 |
| DeepSeek V4 Flashdeepseek · mid | 1M | 384k | $0.14 | $0.28 | 70 |
| DeepSeek V4 Prodeepseek · frontier | 1M | 384k | $0.435 | $0.87 | 55 |
| GPT-5.5openai · frontier | 400k | 128k | $5 | $30 | 85 |
| GPT-5.5 Proopenai · frontier | 400k | 128k | $30 | $180 | 45 |
| GPT-5.4openai · frontier | 400k | 128k | $2.5 | $15 | 90 |
| GPT-5.4 miniopenai · mid | 400k | 128k | $0.75 | $4.5 | 150 |
| GPT-5.4 nanoopenai · small | 400k | 128k | $0.2 | $1.25 | 210 |
| GPT-5.3 Codexopenai · mid | 400k | 128k | $1.75 | $14 | 95 |
| Claude Sonnet 4.5anthropic · frontier | 200k | 64k | $3 | $15 | 78 |
| Claude Haiku 4.5anthropic · mid | 200k | 64k | $1 | $5 | 130 |
Specs checked 2026-06-11 · sources: vendor docs and pricing pages. Tok/sec are median public benchmark figures — expect variance.
How it works
Context windows are the most-quoted and least-understood model spec. This table puts every major model side by side — total window, maximum output per response, current pricing and streaming speed — sortable by any column, with the data verification date shown (2026-06-11).
Two distinctions trip people up. First, window versus max output: a 200k window does not mean 200k of generation — output caps are typically 8-128k, and input plus output must fit the window together. Second, advertised versus usable context: published needle-in-a-haystack benchmarks show retrieval accuracy degrading well before windows fill, especially for facts buried mid-context. A million-token window is a real capability, but treating it as a database is how agents start hallucinating file contents they read an hour ago.
The cost dimension is on the same table because window and price interact: carrying a big context is not free, it is re-billed every call. At Claude Sonnet rates, holding 150k tokens of session history costs about $0.45 of input per turn before the model writes a word — caching reduces it, compaction eliminates it. The bars next to each window size keep the scale honest; a 1M window is five times Claude's, and the bar shows it.
Speed figures are median public benchmark numbers for streaming throughput and vary with load, region and prompt shape — treat them as ratios between models rather than guarantees. For how your own sessions actually fill these windows, FORG tracks context size per turn across every real session.
When two models look equivalent on this table, the tiebreakers usually live elsewhere: tokenizer efficiency (the same document can differ ten percent in token count between vocabularies), cache pricing, and how gracefully each model degrades as its window fills. The comparison here is the factual baseline; pair it with the Context Rot Simulator for the degradation story and the Model Capability Picker when the decision is about task fit rather than specs.
Frequently asked questions
Which model has the biggest context window in 2026?
Gemini 2.5 and Llama 4 Maverick lead at ~1M tokens, with Claude and GPT-5 families at 200k-400k. Sort the table by context window for the current ranking — and remember that usable context is smaller than advertised context (see context rot).
What is the difference between context window and max output?
The context window is the total the model can hold — your input plus its output combined. Max output caps how much it can generate in one response, and is typically much smaller (8k-128k). A 200k-window model with 64k max output cannot write you a 100k-token document in one call.
Does a bigger context window actually help?
Up to a point. Retrieval accuracy degrades as windows fill — models reliably use the start and end of context but lose facts from the middle ('lost in the middle'). Past roughly 50-70% fill, adding more context often hurts. Bigger windows help most for code, where structure aids retrieval.
Why does context size matter for cost?
You pay input rates on every token in context, every call. An agent that carries 150k tokens of history pays for them on each turn — which is why long sessions cost quadratically, not linearly. Pair this table with our Session Cost Estimator to see the effect.
FORG tracks this automatically across every agent session — live cost attribution, budgets, and alerts.
Start tracking with FORG