Batch API Savings Calculator
See how much of your workload qualifies for 50% batch pricing and what you would save.
Does a job qualify for the Batch API?
- No user waiting on the response (eval runs, backfills, classification, embedding prep)
- Tolerates up to 24h completion (most batches finish much faster)
- Requests are independent — no chaining one result into the next call
- Volume is bursty or scheduled, not a steady interactive stream
saved per month by batching 40% of 50M tokens on Claude Sonnet 4.5 — your bill drops from $300.00 to $240.00.
All-realtime vs realtime + batch
- Batch discount
- 50%
- Annualized savings
- $720.00
- Realtime portion
- $180.00
- Batched portion
- $60.00
How it works
The Batch API is the least glamorous 50% discount in the industry: same models, same quality, half the price, in exchange for tolerating up to 24 hours of latency. This calculator takes your monthly token volume, the share of it that could run async, and a model, then shows the realtime bill next to the blended realtime-plus-batch bill. The difference is your savings — recomputed live as you move the slider, with no signup and no network calls.
The math is deliberately simple. We assume a 3:1 input-to-output token split (75% input), which matches typical mixed workloads — adjust your mental model if yours skews chattier. The full bill is (in × rate_in + out × rate_out) ÷ 1M; the batched share of it is multiplied by the model's batch discount (50% on every model listed here), and the rest stays at realtime rates. Both input and output tokens get the discount, which is why batch savings scale with your whole bill rather than just the prompt side.
The honest question is not the math — it is the batchable percentage. Most teams underestimate it. Eval runs, prompt regression suites, bulk document processing, log summarization, classification backfills and synthetic data generation are all batchable, and at many companies those are 30–60% of total token volume once you audit it. The qualifying checklist next to the inputs is the audit: no user waiting, 24h tolerance, independent requests. If a job ticks all three, it is burning money at realtime rates.
What the estimate leaves out: batch jobs have separate rate-limit pools, so moving work to batch also frees realtime quota for your interactive traffic — a second-order benefit this calculator does not price. It also assumes your batchable share has the same input/output mix as the rest, which slightly understates savings for output-heavy bulk generation. Prices verified 2026-06-11against vendor pricing pages. If you do not know your real monthly volume or its mix, that is the measurement gap FORG closes — it attributes every token to a session, so the "% batchable" slider stops being a guess.
Frequently asked questions
Which jobs qualify for the Batch API?
Anything where no human is waiting on the answer: nightly eval suites, data backfills, bulk classification, summarizing a document corpus, generating embeddings prep, regression-testing prompts. The requests must be independent of each other — if call B needs call A's output, it cannot go in the same batch. Interactive chat, agent loops and anything user-facing stay realtime.
What is the latency tradeoff with batch processing?
Providers guarantee completion within 24 hours, and in practice most batches finish in minutes to a few hours depending on system load. There is no SLA tighter than the 24-hour window, so design for the worst case: kick off batches from a scheduler, poll or use webhooks for completion, and never put a batch on a user-facing critical path.
What are the provider batch limits I should know about?
OpenAI batches accept up to 50,000 requests or a 200 MB input file per batch, with separate batch rate-limit pools so they do not eat your realtime quota. Anthropic's Message Batches API takes up to 100,000 requests per batch with results retrievable for 29 days. Both bill at 50% of the standard rate for input and output tokens alike.
Can I combine batch discounts with prompt caching?
On Anthropic, yes — cache reads inside a batch get both discounts stacked, which makes repeated-prefix bulk jobs extremely cheap. On OpenAI, cached input pricing does not apply to batch requests; the batch discount already applies to the full input. Run the numbers per provider rather than assuming the discounts compose the same way everywhere.
FORG tracks this automatically across every agent session — live cost attribution, budgets, and alerts.
Start tracking with FORG