Cost Per Task Comparator

Price a complete task — calls, retries and context included — across models honestly.

100% client-side⛁ prices verified 2026-06-11⌁ zero network calls

API calls per task8

Input tokens per call

tokens

Output tokens per call

tokens

Retry rate15%

Model A

Model B

Model C

Model D

$0.10

per task on Gemini 3 Flash — the cheapest of your four picks. Claude Sonnet 4.5 costs 5.6× more ($0.54) for the same task.

Total task cost — 9.2 effective calls (8 + 15% retries)

Claude Sonnet 4.5$0.54

Claude Haiku 4.5$0.18

GPT-5.4$0.48

Gemini 3 Flashcheapest$0.10

Task volume: 124.2k tokens billed per completed task.

How it works

Per-million-token prices are how vendors sell; cost per completed task is how teams actually spend. This comparator prices a whole task — the number of API calls it takes, the input and output tokens each call carries, and the retries your pipeline really incurs — then shows the total side by side across four models you pick. Everything computes locally in your browser.

The retry rate is the honest part most calculators skip. A retried call resends the full input context and regenerates output, so it bills as a complete extra call: at 8 calls per task and 15% retries you pay for 9.2. Agentic pipelines see 10-20% retry rates routinely — 429s, malformed JSON, failed validations — and on long-context calls those retries are expensive precisely because the input is large. The effective-calls figure is shown above the bars so the multiplier is never hidden.

Define a task as whatever unit you budget by: a resolved ticket, a generated pull request, a summarized document. Estimate the average calls a completed instance takes and the typical context per call — for agentic coding, input context of 10k-30k tokens per call with 1k-2k output is a reasonable starting shape. The defaults model exactly that, so a real result renders before you touch anything.

Read the verdict line with appropriate suspicion. The comparison holds calls-per-task constant across models, which flatters cheaper, weaker models — in reality a smaller model may need more turns or more retries to finish the same task, and sometimes fails it outright. The useful signal is magnitude: a 10× gap justifies an experiment with a cheaper model on a slice of traffic; a 1.3× gap rarely justifies the capability risk.

Model prices come from the shared FORG model table, verified 2026-06-11 against the Anthropic, OpenAI, Google and DeepSeek pricing pages. What this tool estimates from sliders, FORG measures from your real sessions — actual calls per task, actual retry rates, actual cost per merged PR — which is the number worth putting in a planning doc. The share link preserves your full scenario, four models included.

Frequently asked questions

Why price per task instead of per token?

Because nobody ships tokens — they ship completed tasks. A task is several API calls, each carrying context, plus the retries your pipeline actually incurs. Per-token comparisons hide all of that: a model that needs more calls or more retries to finish can cost more in practice than one with a higher sticker price per million tokens.

How does the retry rate affect the total?

Each retry resends the full input context and regenerates output, so it bills like a complete extra call. The calculator multiplies your calls per task by (1 + retry rate): at 8 calls and a 15% retry rate, you are billed for 9.2 effective calls. Retry rates of 10-20% are normal for agentic pipelines hitting rate limits, malformed outputs and validation failures.

What counts as one task?

Whatever unit your team budgets by: a resolved support ticket, a generated pull request, a document summarized, a test suite repaired. Estimate the average number of API calls a completed instance takes, the typical input context per call (system prompt plus accumulated history) and the typical output, and the comparison holds for any task shape.

Should I always pick the cheapest model in the verdict?

No — the comparison assumes every model completes the task in the same number of calls, which favors weaker models. In practice a smaller model may need more turns, more retries or human cleanup. The right read: if the cheapest model is 10× cheaper, it is worth an experiment; if it is 1.3× cheaper, the capability risk probably is not worth it.

FORG tracks this automatically across every agent session — live cost attribution, budgets, and alerts.

Start tracking with FORG

Related tools

Cost & Pricing

Cost Per Task Comparator

How it works

Frequently asked questions

Why price per task instead of per token?

How does the retry rate affect the total?

What counts as one task?

Should I always pick the cheapest model in the verdict?

Related tools

Multi-Model Blend Calculator

Agent Session Cost Estimator

Model Downgrade Advisor

Agent Loop Cost Simulator