Multilingual Token Ratio
How many more tokens the same text costs in 20+ languages, with live samples.
The sample sentence
"The new software update will be installed automatically tomorrow morning, and all your files will remain safe."
The same sentence, translated into 16 languages, each tokenized with the exact o200k_base encoding in your browser. Ratios are relative to the English token count — a ratio of 2.0× means the same meaning costs twice the input tokens (and twice the dollars) in that language.
Loading tokenizer… counts appear in a moment.
Honest caveat: CJK languages use far fewer characters per sentence, so although their chars-per-token is low (often under 2), their per-sentence token ratio lands closer to English than character counts suggest. Translation length differences are part of the measurement — that is the real-world cost.
How it works
Token pricing looks language-neutral and isn't. Because tokenizer vocabularies are trained mostly on English text, the same meaning expressed in Hindi, Arabic or Polish fragments into more tokens than its English equivalent — and every one of those tokens is billed. If you serve international users, your per-conversation cost varies by language in ways an English-only cost model will never show you.
This page measures the effect instead of asserting it. One everyday sentence — a software update notice, the kind of text a product actually sends — is translated into 16 languages and each version is tokenized with the exact o200k_base encoding, in your browser, the moment the page loads. The table ranks languages by their ratio to the English baseline; the bars make the spread visible at a glance.
A few results worth internalizing. Western European languages cluster tightly above English, typically 1.1-1.5×, because their vocabularies are well represented in training data. The largest premiums on o200k tend to appear in Indic scripts. And CJK languages defy the simple story: their characters are token-dense, but they need far fewer characters per sentence, so their per-meaning cost lands much closer to English than chars-per-token folklore suggests. We state that nuance rather than flattening it.
Methodology and limits, plainly: this is one sentence, not a corpus, so treat the ratios as representative rather than definitive — technical jargon, names and numbers shift the numbers for any given text. The encoding is OpenAI's; Anthropic and Google tokenizers are not public, and while language rankings transfer broadly, absolute ratios will differ. Nothing is uploaded; the tokenizer runs locally.
To measure your own multilingual content rather than our sample, paste it into the Token Counter. To turn a measured ratio into a budget, the Token Cost Calculator prices any token count on every major model — multiply your English baseline by the ratio you see here and you have an honest per-market cost estimate.
Frequently asked questions
Why does the same sentence cost more tokens in some languages?
Tokenizer vocabularies are trained on web-scale corpora that skew heavily English, so English words map to single tokens far more often. Languages with rich morphology, agglutination, or scripts underrepresented in training data get fragmented into more, smaller pieces. The o200k encoding improved non-English coverage substantially over older encodings, but the skew remains measurable.
Which languages pay the largest token premium?
On o200k with this sample, Indic scripts like Hindi and some right-to-left scripts tend to show the highest ratios, often 1.5-2.5× English. Western European languages cluster near 1.1-1.5×. Your exact ratios depend on the text — technical vocabulary, names and numbers tokenize differently than everyday prose — which is why we show a real measured sample rather than quoting a single folklore number.
Are CJK languages cheap or expensive in tokens?
Both claims circulate, and the honest answer is: per character they are dense (often under 2 characters per token), but per sentence they need far fewer characters than English to express the same meaning. The net per-meaning ratio on o200k usually lands between 0.9× and 1.6× English. This tool measures per-sentence ratios, which is what you actually pay.
Does this affect what I pay per user in non-English markets?
Directly. API pricing is per token, so if your Hindi-speaking users' prompts and responses run 2× the tokens of English equivalents, those users cost roughly twice as much to serve at the same conversation length. Teams budgeting international rollouts should multiply their English-based cost models by measured language ratios, not assume parity.
Is the comparison exact?
The token counts are exact o200k_base counts, computed in your browser with js-tiktoken — the same encoding current OpenAI models bill against. The translations are fixed reference renderings of one sentence, so the ratios are a representative sample, not a corpus-level study. Claude and Gemini use different tokenizers; rankings are broadly similar but absolute ratios will differ a few percent.
Built by FORG — AI cost observability for agentic coding. Free tools, no signup, nothing leaves your browser.
Learn about FORG