Question 1

What is the difference between TTFT and throughput?

Accepted Answer

Time-to-first-token (TTFT) is the wait before anything appears — the model processing your prompt — and typically runs 0.3 to 2 seconds depending on input size. Throughput (tokens per second) is how fast text flows once it starts. Short responses are dominated by TTFT; long responses by throughput. A 50-token answer on a fast model is mostly TTFT; an 800-token answer is mostly streaming time.

Question 2

How much do real-world speeds vary from these numbers?

Accepted Answer

Considerably. Tokens-per-second varies with provider load, time of day, region, and your prompt's characteristics — published benchmark figures are medians, and p95 latency can run 2-3× slower during peak hours. Treat the figures here as relative rankings between models (which are fairly stable) rather than guarantees of absolute speed for any single request you make.

Question 3

Should I pick a faster model or a cheaper one?

Accepted Answer

Depends on who is waiting. For interactive use — a developer watching an agent work, a user in a chat UI — speed is a product feature and the small-model speed advantage (often 2-4× faster) compounds across every interaction. For batch jobs and background agents, nobody is watching, so optimize for cost and quality instead and let the tokens stream at whatever pace they stream.

Question 4

What is speculative decoding and does it change these numbers?

Accepted Answer

Speculative decoding uses a small draft model to propose several tokens at once, which the large model verifies in a single pass — accepting most of them and multiplying effective throughput, often 2-3×, with identical output quality. Providers increasingly apply it server-side, which is one reason published tokens-per-second figures keep improving for unchanged models. You cannot enable it yourself via the API; you simply benefit when the provider does.

Streaming Latency Estimator

How it works

Frequently asked questions

What is the difference between TTFT and throughput?

How much do real-world speeds vary from these numbers?

Should I pick a faster model or a cheaper one?

What is speculative decoding and does it change these numbers?

Related tools

AI Provider Status History

Context Window Comparison

AI Model Pricing Comparison

Rate Limit Planner