Skip to main content

Transcription Cost Calculator

Whisper and 4o-transcribe pricing per minute — price calls, podcasts and meeting archives.

100% client-side⛁ prices verified 2026-06-11⌁ zero network calls
120m
$21.92

per month on gpt-4o-transcribe3,653 minutes (60.9 hours) at $0.36/hour.

Rate table — per hour of audio

Model$/min$/hourYour monthly
gpt-4o-transcribeselected$0.0060$0.36$21.92
gpt-4o-mini-transcribe$0.0030$0.18$10.96
gemini-3.5-live-translate$0.04$2.21$134.42

Rates verified 2026-06-11. Monthly = 30.44 days when entering minutes/day.

18
models priced, 4 vendors
2026-06-11
prices verified against vendor pages
90d
price staleness tripwire in CI
0
network requests per keystroke

How it works

This calculator prices audio transcription at the unit that matters: the audio minute. Enter your volume as minutes per day or hours per month — whichever matches how you think about your pipeline — pick a model, and you get the monthly bill plus a rate table comparing every model at the same volume. All math runs locally in your browser.

The rates compress to simple per-minute figures. gpt-4o-transcribe costs about $0.006 per minute, built from $2.50 per million audio-input tokens plus $10 per million text-output tokens. gpt-4o-mini-transcribe halves that to roughly $0.003 per minute with modestly lower accuracy on noisy audio. gemini-3.5-live-translate sits at about $0.0368 per minute — audio in at $3.50/M, text out at $21/M, at roughly 25 tokens per second of audio — because it is doing live translation, not just transcription. Picking it for plain same-language transcripts means paying 6-12× more than necessary.

The hour framing makes budgets concrete: $0.36, $0.18 and $2.21 per hour respectively. A customer-support team archiving 500 hours of calls monthly pays $180 on the full 4o model or $90 on mini — numbers small enough that accuracy, not price, should usually decide. The calculus flips for always-on workloads: transcribing a 24/7 audio stream is 730 hours a month, where the model choice swings the bill from $131 to $1,614.

Things this deliberately excludes: silence still bills as minutes, so trim dead air before upload; second-pass processing (summaries, diarization, formatting via a chat model) adds separate token costs; and verbose output formats with word-level timestamps inflate the output-token share beyond the typical assumption baked into the per-minute rate.

Rates were verified on 2026-06-11 against the OpenAI pricing page and Google's Gemini API pricing docs, with the date embedded in the tool source as a documented constant. Monthly projections from daily volume use a 30.44-day month. For measured transcription spend from your actual API traffic — including the re-runs and second passes this estimate cannot see — FORG tracks it per session automatically. The share link keeps your inputs.

Frequently asked questions

How is API transcription priced?

Modern transcription models bill audio-input tokens and text-output tokens, but the practical unit is the audio minute. gpt-4o-transcribe works out to roughly $0.006 per minute ($2.50/M audio-input tokens plus $10/M text output), gpt-4o-mini-transcribe to about $0.003 per minute, and Gemini 3.5 live translate to about $0.0368 per minute.

Why is the Gemini live translate rate so much higher?

Because it does more than transcribe: gemini-3.5-live-translate runs real-time translation over a live audio stream, billing audio input at $3.50/M tokens and text output at $21/M at roughly 25 tokens per second of audio. If you only need same-language transcripts of recorded files, the 4o-transcribe models are 6-12× cheaper for the same minutes.

What does an hour of audio cost to transcribe?

At verified rates, one hour costs about $0.36 on gpt-4o-transcribe, $0.18 on gpt-4o-mini-transcribe and $2.21 on gemini-3.5-live-translate. For scale: transcribing 100 hours of meetings per month runs $36, $18 or $221 respectively — the rate table in the tool shows all three side by side at your exact volume.

What can move my real bill away from these numbers?

Silence and dead air still bill as audio minutes, so unpadded recordings matter. Diarization or formatting passes that re-run audio through a second model double the minutes. And long text outputs (verbose timestamps, word-level confidence) increase the output-token share. The per-minute figures here assume typical speech density and plain transcript output.

FORG tracks this automatically across every agent session — live cost attribution, budgets, and alerts.

Start tracking with FORG