Skip to main content

Context Window Comparison

Every major model's context window, max output, price and speed in one sortable table.

100% client-side⌗ exact o200k tokenizer⌁ zero uploads
Context window and spec comparison for major AI models
Model
Gemini 3.5 Flashlargestgoogle · frontier1M65.5k$1.5$9160
Gemini 3.1 Progoogle · frontier1M65.5k$2$1280
Gemini 3 Flashgoogle · mid1M65.5k$0.5$3180
Gemini 3.1 Flash-Litegoogle · small1M65.5k$0.25$1.5250
Gemini 2.5 Progoogle · frontier1M65.5k$1.25$1085
Claude Fable 5anthropic · frontier1M64k$10$5060
Claude Opus 4.8anthropic · frontier1M64k$5$2555
Claude Sonnet 4.6anthropic · frontier1M64k$3$1578
DeepSeek V4 Flashdeepseek · mid1M384k$0.14$0.2870
DeepSeek V4 Prodeepseek · frontier1M384k$0.435$0.8755
GPT-5.5openai · frontier400k128k$5$3085
GPT-5.5 Proopenai · frontier400k128k$30$18045
GPT-5.4openai · frontier400k128k$2.5$1590
GPT-5.4 miniopenai · mid400k128k$0.75$4.5150
GPT-5.4 nanoopenai · small400k128k$0.2$1.25210
GPT-5.3 Codexopenai · mid400k128k$1.75$1495
Claude Sonnet 4.5anthropic · frontier200k64k$3$1578
Claude Haiku 4.5anthropic · mid200k64k$1$5130

Specs checked 2026-06-11 · sources: vendor docs and pricing pages. Tok/sec are median public benchmark figures — expect variance.

o200k
exact GPT tokenizer, in-browser
≈3.6
chars/token Claude estimate, documented
18
models in the cost dataset
0
network requests per keystroke

How it works

Context windows are the most-quoted and least-understood model spec. This table puts every major model side by side — total window, maximum output per response, current pricing and streaming speed — sortable by any column, with the data verification date shown (2026-06-11).

Two distinctions trip people up. First, window versus max output: a 200k window does not mean 200k of generation — output caps are typically 8-128k, and input plus output must fit the window together. Second, advertised versus usable context: published needle-in-a-haystack benchmarks show retrieval accuracy degrading well before windows fill, especially for facts buried mid-context. A million-token window is a real capability, but treating it as a database is how agents start hallucinating file contents they read an hour ago.

The cost dimension is on the same table because window and price interact: carrying a big context is not free, it is re-billed every call. At Claude Sonnet rates, holding 150k tokens of session history costs about $0.45 of input per turn before the model writes a word — caching reduces it, compaction eliminates it. The bars next to each window size keep the scale honest; a 1M window is five times Claude's, and the bar shows it.

Speed figures are median public benchmark numbers for streaming throughput and vary with load, region and prompt shape — treat them as ratios between models rather than guarantees. For how your own sessions actually fill these windows, FORG tracks context size per turn across every real session.

When two models look equivalent on this table, the tiebreakers usually live elsewhere: tokenizer efficiency (the same document can differ ten percent in token count between vocabularies), cache pricing, and how gracefully each model degrades as its window fills. The comparison here is the factual baseline; pair it with the Context Rot Simulator for the degradation story and the Model Capability Picker when the decision is about task fit rather than specs.

Frequently asked questions

Which model has the biggest context window in 2026?

Gemini 2.5 and Llama 4 Maverick lead at ~1M tokens, with Claude and GPT-5 families at 200k-400k. Sort the table by context window for the current ranking — and remember that usable context is smaller than advertised context (see context rot).

What is the difference between context window and max output?

The context window is the total the model can hold — your input plus its output combined. Max output caps how much it can generate in one response, and is typically much smaller (8k-128k). A 200k-window model with 64k max output cannot write you a 100k-token document in one call.

Does a bigger context window actually help?

Up to a point. Retrieval accuracy degrades as windows fill — models reliably use the start and end of context but lose facts from the middle ('lost in the middle'). Past roughly 50-70% fill, adding more context often hurts. Bigger windows help most for code, where structure aids retrieval.

Why does context size matter for cost?

You pay input rates on every token in context, every call. An agent that carries 150k tokens of history pays for them on each turn — which is why long sessions cost quadratically, not linearly. Pair this table with our Session Cost Estimator to see the effect.

FORG tracks this automatically across every agent session — live cost attribution, budgets, and alerts.

Start tracking with FORG