Skip to main content

Cost Per Task Comparator

Price a complete task — calls, retries and context included — across models honestly.

100% client-side⛁ prices verified 2026-06-11⌁ zero network calls
8
tokens
tokens
15%
$0.10

per task on Gemini 3 Flash — the cheapest of your four picks. Claude Sonnet 4.5 costs 5.6× more ($0.54) for the same task.

Total task cost — 9.2 effective calls (8 + 15% retries)

Claude Sonnet 4.5$0.54
Claude Haiku 4.5$0.18
GPT-5.4$0.48
Gemini 3 Flashcheapest$0.10

Task volume: 124.2k tokens billed per completed task.

18
models priced, 4 vendors
2026-06-11
prices verified against vendor pages
90d
price staleness tripwire in CI
0
network requests per keystroke

How it works

Per-million-token prices are how vendors sell; cost per completed task is how teams actually spend. This comparator prices a whole task — the number of API calls it takes, the input and output tokens each call carries, and the retries your pipeline really incurs — then shows the total side by side across four models you pick. Everything computes locally in your browser.

The retry rate is the honest part most calculators skip. A retried call resends the full input context and regenerates output, so it bills as a complete extra call: at 8 calls per task and 15% retries you pay for 9.2. Agentic pipelines see 10-20% retry rates routinely — 429s, malformed JSON, failed validations — and on long-context calls those retries are expensive precisely because the input is large. The effective-calls figure is shown above the bars so the multiplier is never hidden.

Define a task as whatever unit you budget by: a resolved ticket, a generated pull request, a summarized document. Estimate the average calls a completed instance takes and the typical context per call — for agentic coding, input context of 10k-30k tokens per call with 1k-2k output is a reasonable starting shape. The defaults model exactly that, so a real result renders before you touch anything.

Read the verdict line with appropriate suspicion. The comparison holds calls-per-task constant across models, which flatters cheaper, weaker models — in reality a smaller model may need more turns or more retries to finish the same task, and sometimes fails it outright. The useful signal is magnitude: a 10× gap justifies an experiment with a cheaper model on a slice of traffic; a 1.3× gap rarely justifies the capability risk.

Model prices come from the shared FORG model table, verified 2026-06-11 against the Anthropic, OpenAI, Google and DeepSeek pricing pages. What this tool estimates from sliders, FORG measures from your real sessions — actual calls per task, actual retry rates, actual cost per merged PR — which is the number worth putting in a planning doc. The share link preserves your full scenario, four models included.

Frequently asked questions

Why price per task instead of per token?

Because nobody ships tokens — they ship completed tasks. A task is several API calls, each carrying context, plus the retries your pipeline actually incurs. Per-token comparisons hide all of that: a model that needs more calls or more retries to finish can cost more in practice than one with a higher sticker price per million tokens.

How does the retry rate affect the total?

Each retry resends the full input context and regenerates output, so it bills like a complete extra call. The calculator multiplies your calls per task by (1 + retry rate): at 8 calls and a 15% retry rate, you are billed for 9.2 effective calls. Retry rates of 10-20% are normal for agentic pipelines hitting rate limits, malformed outputs and validation failures.

What counts as one task?

Whatever unit your team budgets by: a resolved support ticket, a generated pull request, a document summarized, a test suite repaired. Estimate the average number of API calls a completed instance takes, the typical input context per call (system prompt plus accumulated history) and the typical output, and the comparison holds for any task shape.

Should I always pick the cheapest model in the verdict?

No — the comparison assumes every model completes the task in the same number of calls, which favors weaker models. In practice a smaller model may need more turns, more retries or human cleanup. The right read: if the cheapest model is 10× cheaper, it is worth an experiment; if it is 1.3× cheaper, the capability risk probably is not worth it.

FORG tracks this automatically across every agent session — live cost attribution, budgets, and alerts.

Start tracking with FORG