Skip to main content

GPU VRAM Calculator

Will that model fit on your GPU? VRAM needs by parameter count, quantization and context.

100% client-side⌁ nothing leaves your browser⎘ instant results
Quantization
k tokens
sequences
146.1 GB

VRAM for a 70B model at fp16, 8k context, batch 1 — weights 130.4 GB + KV cache 2.4 GB, +10% overhead.

Whether this model fits on common GPUs
HardwareMemoryFits?
RTX 409024 GB✗ no
A100 40GB40 GB✗ no
A100 80GB80 GB✗ no
H100 80GB80 GB✗ no
M3 Max 128GB (unified)128 GB✗ no

Formula: weights = params × 2 B; KV = 2 × 80 layers × 8 kv-heads × 128 head-dim × 2 B per token × context × batch; total × 1.10 overhead. Real frameworks need ~1–2 GB extra headroom.

100%
client-side compute
0
uploads — verify in devtools
96
free tools in the directory
0
network requests per keystroke

How it works

This calculator answers the question every local-LLM experiment starts with: will the model fit? Pick a preset or enter a raw parameter count, choose a quantization level, set your target context length and batch size, and the tool computes the VRAM requirement and checks it against common hardware — RTX 4090, A100 in both 40 and 80 GB flavors, H100, and a 128 GB M3 Max with unified memory.

The formula is shown rather than hidden. Model weights need parameters × bytes-per-parameter: 2 bytes at fp16, 1 at int8, 0.5 at int4. The KV cache — the attention keys and values the model keeps for every token in context — adds 2 × layers × kv-heads × head-dim × 2 bytes per token, multiplied by context length and batch size. Each preset carries its real architecture constants (layer count, head dimensions, grouped-query attention factor), which is why a Llama 70B with GQA needs far less cache per token than its size suggests. A flat 10% overhead covers activations and framework buffers.

Two practical insights fall out of the math. First, quantization is the single biggest lever: int4 cuts the weights to a quarter of fp16, turning a 140 GB model into a 35 GB one that fits on a single workstation card. Second, at long contexts the KV cache dominates — the same 70B model that fits at 8k context can blow past 80 GB at 128k, which is why production servers obsess over cache quantization and paged attention.

Estimates here are floors, not ceilings: real frameworks add CUDA context, allocator fragmentation and preallocated pools, so leave a gigabyte or two of headroom. The fits-on table treats multi-GPU setups as additive (two 24 GB cards ≈ one 48 GB budget), which holds for tensor-parallel inference within a small communication overhead. If the verdict is close, quantize down a step rather than gambling on exact-fit — out-of-memory errors halfway through a long generation waste more time than a slightly smaller model ever will.

Frequently asked questions

How much quality do I lose with int8 or int4 quantization?

Surprisingly little for most workloads. int8 quantization is generally indistinguishable from fp16 on standard benchmarks, while good int4 schemes (GPTQ, AWQ, GGUF Q4_K_M) typically lose one to three points on reasoning-heavy evaluations. The loss grows for smaller models and for tasks requiring precise numeric output. The pragmatic rule: int4 a 70B before you fp16 a 13B — a bigger quantized model usually beats a smaller full-precision one.

Why does the KV cache grow with context length and batch size?

The model stores a key and a value vector for every attention layer, for every token in the context, for every sequence in the batch. That memory scales linearly with both context and batch: doubling context doubles the KV cache, and so does doubling batch. With long contexts the KV cache can exceed the weights themselves — a 70B model at 128k context needs more memory for cache than a 7B model needs for everything.

Can I split a model across multiple GPUs?

Yes — tensor parallelism splits each layer across GPUs and pipeline parallelism assigns whole layers to different cards, and frameworks like vLLM, TGI and llama.cpp handle this automatically. The total VRAM requirement stays roughly the same plus a small communication overhead, so two 24 GB cards behave approximately like one 48 GB card. Interconnect bandwidth (NVLink vs PCIe) determines how much throughput you sacrifice.

How does Apple Silicon unified memory compare to discrete GPU VRAM?

Apple's unified memory is shared between CPU and GPU, so a 128 GB M3 Max can hold models no single consumer GPU can — a quantized 70B fits comfortably. The trade-off is bandwidth and compute: an M3 Max delivers roughly 400 GB/s of memory bandwidth versus over 3 TB/s on an H100, so generation is several times slower. Great for local development and private inference, not for serving traffic.

Why does my real-world usage exceed this estimate?

The calculator covers weights plus KV cache and adds a documented 10% overhead, but real frameworks allocate extra: CUDA context (~0.5–1 GB), activation buffers, fragmentation from the allocator, and preallocated cache pools in vLLM. Leave 1–2 GB of headroom beyond the figure shown, and more if you run a desktop environment on the same GPU.

Built by FORG — AI cost observability for agentic coding. Free tools, no signup, nothing leaves your browser.

Learn about FORG