Skip to main content

Self-Host vs API Calculator

GPU rental versus API tokens: find your breakeven volume before buying hardware.

100% client-side⛁ prices verified 2026-06-11⌁ zero network calls
M tokens
60%

Hidden costs the curve ignores

  • Ops time: deploys, CUDA driver upgrades, on-call for a GPU box
  • Cold starts & autoscaling lag if you try to scale-to-zero
  • Redundancy — one replica is one outage away from downtime
  • Eval & quality drift work when you quantize to fit VRAM
  • Egress and storage for weights, logs and checkpoints
8111.1M tok/mo

is the breakeven for A100 80GB ×2 vs the API. At your 100M/mo: the API wins$3,650.00 self-hosted (1 replica) vs $45.00 API.

Monthly cost vs volume

self-hostAPI┊ your volume
GPU cost / replica
$3,650.00/mo
Capacity / replica
126.1M/mo
Replicas needed
1
Cost per 1M tokens
$36.50 vs $0.45 API
18
models priced, 4 vendors
2026-06-11
prices verified against vendor pages
90d
price staleness tripwire in CI
0
network requests per keystroke

How it works

The self-hosting question has a clean structure: the API is a variable cost (dollars per million tokens, zero at zero usage) and a GPU is a fixed cost (dollars per hour, whether or not tokens flow). Fixed beats variable only past a crossover volume. This calculator finds that crossover for your model class and GPU configuration, draws both cost curves, and gives a verdict at your stated monthly volume — all client-side, no signup.

The math: a replica costs hourly_rate × 730 per month and serves tok/s × 3,600 × 730 × utilization tokens. When your volume exceeds one replica's capacity, the calculator adds replicas — that is why the self-host curve steps upward instead of staying flat. The API line is simply volume times a blended per-million rate for a comparable hosted open-weight model. Breakeven is where one replica's monthly cost equals the API bill for the same tokens.

Every input here is an estimate and labeled as one. Throughput figures assume modern batched serving (vLLM-class) at sane quantization per configuration, and real numbers vary with sequence length, batch mix and serving stack. The blended API prices assume a 3:1 input-output split on hosted open-weight endpoints. The utilization slider is where most self-hosting business cases quietly die: the spreadsheet assumes 90% and production delivers 40%, which more than doubles the effective cost per token.

The hidden-costs checklist next to the inputs is not decoration — ops time, redundancy, cold starts and quantization-evaluation work routinely add 20–40% to the raw GPU number, and none of it appears on the curve. Treat a thin breakeven margin as a loss. And before deciding anything, measure your actual volume rather than estimating it: most teams guessing "100M tokens a month" are off by 3×. FORG measures real per-session token consumption across your team, which turns this calculator's main input from a guess into a number — and a breakeven analysis built on a measured number is the only kind worth taking to an infrastructure review.

Frequently asked questions

Why does GPU utilization matter so much in this comparison?

Because a rented GPU bills 730 hours a month whether tokens flow or not, while the API bills only for tokens. At 100% theoretical utilization self-hosting looks unbeatable; at the 30-60% real-world figure — bursty traffic, business-hours load, deploy windows — your effective cost per token doubles or triples. The utilization slider is the most honest knob on this page; very few production workloads sustain above 60%.

How does quantization change the economics?

Quantizing weights to int8 or int4 lets a bigger model fit on a smaller GPU — a 70B model squeezes onto a single A100 at int4 — and usually raises throughput too, both of which push the breakeven in self-hosting's favor. The cost is a quality haircut that varies by task and needs evaluation work to bound. The throughput figures here assume sane quantization per configuration.

What does the ops overhead really cost?

The line items people skip: someone owns deploys, CUDA driver and serving-framework upgrades, on-call for the GPU box, capacity planning, and a second replica if you need redundancy — one replica is one hardware fault away from total downtime. A common rule of thumb is to add 20-40% of the raw GPU cost in engineering time. If your breakeven margin is thinner than that, the API wins on the all-in number.

When does the API always win, regardless of volume?

When you need frontier-quality models (closed weights are not self-hostable at all), when traffic is spiky enough that utilization stays low, when you lack anyone to own GPU infrastructure, or when volume is small enough that even one always-on replica is underused. Also during fast model churn: an API model upgrade is a config change, while a self-hosted upgrade is a re-evaluation and redeployment project.

Built by FORG — AI cost observability for agentic coding. Free tools, no signup, nothing leaves your browser.

Learn about FORG