Temperature & Top-p Playground
Interactive visualization of how temperature, top-p and top-k reshape token sampling.
Temperature ≈ 1: probabilities are used as the model produced them — the neutral setting.
top_p = 1: nucleus sampling is off — no probability-mass cutoff is applied.
top_k 12: no rank cutoff — all 12 candidate tokens stay eligible (before top_p).
Next-token distribution for: "The best way to learn programming is ___"
12 of 12 tokens eligible after truncation · fixed illustrative logits, softmax at your temperature
10 samples at these settings
Seeded PRNG (seed 42) — the share link reproduces this exact strip. Real APIs draw fresh randomness per call.
Share these settings: …
How it works
Every LLM call ends the same way: the model produces a score (a logit) for every token in its vocabulary, those scores become a probability distribution, and one token is drawn from it. Temperature, top_p and top_k are the three dials on that final drawing step — and they are far easier to understand by moving them than by reading definitions. This playground wires all three sliders to a fixed, illustrative 12-token distribution for the prompt "The best way to learn programming is ___" so you can watch the bars reshape in real time.
The math is the real pipeline, just on twelve tokens instead of a 100k-token vocabulary. Temperature divides every logit before the softmax: values below 1 exaggerate the gaps (at 0.2, "practice" swallows nearly all the mass), values above 1 compress them (at 2.0, even "magic" gets a visible bar). Top_k then cuts everything below the k-th rank, top_p keeps the smallest set of tokens whose cumulative probability reaches p, and the survivors are renormalized so they sum to 1 again. Cut tokens stay visible in the chart — grayed and struck through — because seeing what got truncated is the whole lesson.
The sample strip below the chart draws ten tokens from the final distribution using a seeded pseudo-random generator, so a shared link reproduces the exact same strip — handy for teaching. Hit re-roll to draw with a fresh seed. At temperature 0 you will see ten identical tokens; at temperature 2 with no truncation, the strip turns chaotic. That visceral difference is what the parameters actually do to your production traffic.
One honest framing note: the logits here are hand-picked for pedagogy, not pulled from a real model, and real distributions over huge vocabularies behave the same way but with a much longer tail — which is precisely why nucleus (top_p) sampling exists. The mechanics you see here — sharpening, flattening, rank cuts and mass cuts — transfer one-to-one to the temperature and top_p fields you set in any Anthropic, OpenAI or Gemini request. Everything runs locally in your browser; there is no model behind this page and nothing is sent anywhere.
Frequently asked questions
What is the difference between temperature and top_p?
They attack randomness from different angles. Temperature reshapes the whole distribution: dividing every logit by the temperature sharpens it below 1 (the favorite gets even more likely) and flattens it above 1 (tail tokens gain ground). Top_p never reshapes anything — it truncates, keeping only the smallest set of tokens whose probabilities sum to p and renormalizing. In practice temperature controls how adventurous each pick is, while top_p caps how deep into the tail an adventurous pick is allowed to reach. Most providers recommend tuning one, not both.
What settings should I use for coding versus creative writing?
For code, structured output and tool calls, run cold: temperature 0 to 0.3 with top_p left at 1. Wrong-but-plausible tokens are expensive in code, and there is usually one right answer. For brainstorming, naming and prose, temperature 0.8 to 1.2 gives useful variety; pairing with top_p around 0.9 trims the incoherent tail that high temperature lifts. Agentic coding tools ship near-deterministic defaults for good reason — when an agent loops on a task, sampling noise compounds across turns.
Why is temperature 0 not fully deterministic in real APIs?
Temperature 0 makes the sampling step deterministic — always take the argmax — but the forward pass that produces the logits is not bit-identical between runs. Batched GPU inference adds floating-point non-associativity, mixture-of-experts routing can differ with batch composition, and providers shuffle requests across heterogeneous hardware. When two tokens sit within numerical noise of each other, the argmax itself can flip. So temperature 0 means "as deterministic as the infrastructure allows", not a reproducibility guarantee — for that, some APIs offer a seed parameter, also best-effort.
What does top_k do, and when does it matter over top_p?
Top_k keeps the k highest-probability tokens regardless of how much probability they hold, where top_p keeps however many tokens it takes to cover p probability mass. The difference shows at the extremes: in a confident distribution, top_p 0.9 might keep just one token while top_k 40 keeps forty near-zero stragglers; in a flat distribution, top_k 5 may cut tokens that held real probability. Anthropic and Google expose top_k directly; OpenAI's Chat Completions API does not — one of the parameter gaps to watch when porting requests between providers.
In what order are temperature, top_k and top_p applied?
The standard pipeline — and the one this playground implements — is: scale logits by temperature, softmax into probabilities, apply top_k (rank cutoff), apply top_p (cumulative mass cutoff), renormalize the survivors, then sample. The order matters: because temperature reshapes probabilities before truncation, a high temperature can pull tail tokens above the top_p waterline that would have been cut at temperature 1. Watch the chart: raise temperature with top_p at 0.9 and you can see the survivor set itself grow.
Built by FORG — AI cost observability for agentic coding. Free tools, no signup, nothing leaves your browser.
Learn about FORG