Question 1

Why does GPU utilization matter so much in this comparison?

Accepted Answer

Because a rented GPU bills 730 hours a month whether tokens flow or not, while the API bills only for tokens. At 100% theoretical utilization self-hosting looks unbeatable; at the 30-60% real-world figure — bursty traffic, business-hours load, deploy windows — your effective cost per token doubles or triples. The utilization slider is the most honest knob on this page; very few production workloads sustain above 60%.

Question 2

How does quantization change the economics?

Accepted Answer

Quantizing weights to int8 or int4 lets a bigger model fit on a smaller GPU — a 70B model squeezes onto a single A100 at int4 — and usually raises throughput too, both of which push the breakeven in self-hosting's favor. The cost is a quality haircut that varies by task and needs evaluation work to bound. The throughput figures here assume sane quantization per configuration.

Question 3

What does the ops overhead really cost?

Accepted Answer

The line items people skip: someone owns deploys, CUDA driver and serving-framework upgrades, on-call for the GPU box, capacity planning, and a second replica if you need redundancy — one replica is one hardware fault away from total downtime. A common rule of thumb is to add 20-40% of the raw GPU cost in engineering time. If your breakeven margin is thinner than that, the API wins on the all-in number.

Question 4

When does the API always win, regardless of volume?

Accepted Answer

When you need frontier-quality models (closed weights are not self-hostable at all), when traffic is spiky enough that utilization stays low, when you lack anyone to own GPU infrastructure, or when volume is small enough that even one always-on replica is underused. Also during fast model churn: an API model upgrade is a config change, while a self-hosted upgrade is a re-evaluation and redeployment project.

Self-Host vs API Calculator

How it works

Frequently asked questions

Why does GPU utilization matter so much in this comparison?

How does quantization change the economics?

What does the ops overhead really cost?

When does the API always win, regardless of volume?

Related tools

GPU VRAM Calculator

AI Model Pricing Comparison

Embeddings Cost Calculator

Fine-Tuning vs Prompting