Skip to main content

Streaming Latency Estimator

Tokens per second by model: estimate how long any output length takes to stream.

100% client-side⛁ data verified 2026-06-11⌁ zero network calls
tokens

Throughput figures: benchmark medians, checked 2026-06-11. Real speeds vary with load, region and time of day.

11s

to stream 800 tokens on Claude Sonnet 4.5 at 78 tok/s (incl. 0.8s TTFT). Gemini 3.1 Flash-Lite would finish in 4.0s.

All models at 800 output tokens

  • Gemini 3.1 Flash-Litefastest4.0s · 250 tok/s
  • GPT-5.4 nano4.6s · 210 tok/s
  • Gemini 3 Flash5.2s · 180 tok/s
  • Gemini 3.5 Flash5.8s · 160 tok/s
  • GPT-5.4 mini6.1s · 150 tok/s
  • Claude Haiku 4.57.0s · 130 tok/s
  • GPT-5.3 Codex9.2s · 95 tok/s
  • GPT-5.49.7s · 90 tok/s
  • GPT-5.510s · 85 tok/s
  • Gemini 2.5 Pro10s · 85 tok/s
  • Gemini 3.1 Pro11s · 80 tok/s
  • Claude Sonnet 4.611s · 78 tok/s
  • Claude Sonnet 4.5selected11s · 78 tok/s
  • DeepSeek V4 Flash12s · 70 tok/s
  • Claude Fable 514s · 60 tok/s
  • Claude Opus 4.815s · 55 tok/s
  • DeepSeek V4 Pro15s · 55 tok/s
  • GPT-5.5 Pro19s · 45 tok/s
18
models in the dataset
2026-06-11
reference data verified
100%
logic runs in your browser
0
network requests per keystroke

How it works

This calculator answers the question you feel every time an agent streams a long response: how many seconds until it finishes? Pick a model, set the expected output length, and the tool computes seconds-to-complete from each model's measured tokens-per-second, with an optional time-to-first-token allowance for the pause before streaming begins.

The math is transparent: completion time = TTFT (if included) + output tokens ÷ tokens per second. The throughput figures come from our model dataset (last verified 2026-06-11), which aggregates published benchmark medians per model. The comparison bars below the result rank every model in the dataset at your output length, because the absolute number matters less than the relative gap — Haiku-class and Flash-class models routinely stream two to four times faster than frontier models, and that difference is the entire user experience when a human is watching the output arrive.

Be honest with yourself about variance. These are median figures: provider load, region, time of day and prompt shape all move real throughput, and p95 latency during peak hours can run multiples of the median. The rankings between models are far more stable than the absolute numbers, which is why this tool emphasizes the comparison view. If a latency budget genuinely matters to your product, measure your own traffic — median benchmarks tell you which model to try first, not what your users will experience at the 95th percentile.

The speed-versus-cost tradeoff is more interesting than it first appears. Faster models are usually also the cheaper ones, so for interactive workloads — code review comments a developer is waiting on, chat replies, agent steps a human supervises — the small model is often the right call twice over. The case for slow frontier models is reasoning depth on hard tasks, where nobody minds the wait because the alternative is a fast wrong answer. Decide per workload, not per project: routing easy calls to a fast model and hard ones to a strong model is exactly the kind of rule worth automating.

Frequently asked questions

What is the difference between TTFT and throughput?

Time-to-first-token (TTFT) is the wait before anything appears — the model processing your prompt — and typically runs 0.3 to 2 seconds depending on input size. Throughput (tokens per second) is how fast text flows once it starts. Short responses are dominated by TTFT; long responses by throughput. A 50-token answer on a fast model is mostly TTFT; an 800-token answer is mostly streaming time.

How much do real-world speeds vary from these numbers?

Considerably. Tokens-per-second varies with provider load, time of day, region, and your prompt's characteristics — published benchmark figures are medians, and p95 latency can run 2-3× slower during peak hours. Treat the figures here as relative rankings between models (which are fairly stable) rather than guarantees of absolute speed for any single request you make.

Should I pick a faster model or a cheaper one?

Depends on who is waiting. For interactive use — a developer watching an agent work, a user in a chat UI — speed is a product feature and the small-model speed advantage (often 2-4× faster) compounds across every interaction. For batch jobs and background agents, nobody is watching, so optimize for cost and quality instead and let the tokens stream at whatever pace they stream.

What is speculative decoding and does it change these numbers?

Speculative decoding uses a small draft model to propose several tokens at once, which the large model verifies in a single pass — accepting most of them and multiplying effective throughput, often 2-3×, with identical output quality. Providers increasingly apply it server-side, which is one reason published tokens-per-second figures keep improving for unchanged models. You cannot enable it yourself via the API; you simply benefit when the provider does.

Built by FORG — AI cost observability for agentic coding. Free tools, no signup, nothing leaves your browser.

Learn about FORG