Question 1

How much quality do I lose with int8 or int4 quantization?

Accepted Answer

Surprisingly little for most workloads. int8 quantization is generally indistinguishable from fp16 on standard benchmarks, while good int4 schemes (GPTQ, AWQ, GGUF Q4_K_M) typically lose one to three points on reasoning-heavy evaluations. The loss grows for smaller models and for tasks requiring precise numeric output. The pragmatic rule: int4 a 70B before you fp16 a 13B — a bigger quantized model usually beats a smaller full-precision one.

Question 2

Why does the KV cache grow with context length and batch size?

Accepted Answer

The model stores a key and a value vector for every attention layer, for every token in the context, for every sequence in the batch. That memory scales linearly with both context and batch: doubling context doubles the KV cache, and so does doubling batch. With long contexts the KV cache can exceed the weights themselves — a 70B model at 128k context needs more memory for cache than a 7B model needs for everything.

Question 3

Can I split a model across multiple GPUs?

Accepted Answer

Yes — tensor parallelism splits each layer across GPUs and pipeline parallelism assigns whole layers to different cards, and frameworks like vLLM, TGI and llama.cpp handle this automatically. The total VRAM requirement stays roughly the same plus a small communication overhead, so two 24 GB cards behave approximately like one 48 GB card. Interconnect bandwidth (NVLink vs PCIe) determines how much throughput you sacrifice.

Question 4

How does Apple Silicon unified memory compare to discrete GPU VRAM?

Accepted Answer

Apple's unified memory is shared between CPU and GPU, so a 128 GB M3 Max can hold models no single consumer GPU can — a quantized 70B fits comfortably. The trade-off is bandwidth and compute: an M3 Max delivers roughly 400 GB/s of memory bandwidth versus over 3 TB/s on an H100, so generation is several times slower. Great for local development and private inference, not for serving traffic.

Question 5

Why does my real-world usage exceed this estimate?

Accepted Answer

The calculator covers weights plus KV cache and adds a documented 10% overhead, but real frameworks allocate extra: CUDA context (~0.5–1 GB), activation buffers, fragmentation from the allocator, and preallocated cache pools in vLLM. Leave 1–2 GB of headroom beyond the figure shown, and more if you run a desktop environment on the same GPU.

Hardware	Memory	Fits?
RTX 4090	24 GB	✗ no
A100 40GB	40 GB	✗ no
A100 80GB	80 GB	✗ no
H100 80GB	80 GB	✗ no
M3 Max 128GB (unified)	128 GB	✗ no

GPU VRAM Calculator

How it works

Frequently asked questions

How much quality do I lose with int8 or int4 quantization?

Why does the KV cache grow with context length and batch size?

Can I split a model across multiple GPUs?

How does Apple Silicon unified memory compare to discrete GPU VRAM?

Why does my real-world usage exceed this estimate?

Related tools

Self-Host vs API Calculator

AI Model Pricing Comparison

Streaming Latency Estimator

Embeddings Cost Calculator