LLM VRAM Calculator
Estimate the GPU VRAM required to serve a large language model on self-hosted hardware. Get a per-component breakdown (weights, KV cache + overhead) and the smallest fitting GPU instance across A10, A100 80GB SXM, H100, H200, L4, L40, L40S, B200, and B300 — at FP16, FP8, or FP4 precision.
Formula
vram_required = (bits_precision / 8) × params_billions × kv_cache_allocation
Source: Inference Engineering by Philip Kiely (Baseten Books, 2026), Fig 5.11, p.142.
Supported model presets
Llama 3, Qwen 2.5, Qwen3-235B-A22B, Mistral, Mixtral, DeepSeek-V3.1, GPT-OSS, Gemma — plus a custom parameter-count input.
Worked example
DeepSeek-V3.1 at FP8 with a 1.8× KV-cache allocation requires ≈ 1208 GB of VRAM, which fits on 8×B200 (1440 GB) — reproducing the book's worked example.
Loading the interactive calculator… If this message persists, JavaScript may be disabled.