GPUs & Hardware
The single most useful infrastructure skill for an application engineer: estimating whether a given model fits on given hardware, and what that costs. It’s mostly arithmetic.
Why GPUs
Section titled “Why GPUs”A CPU has a few dozen cores tuned for fast sequential work. A GPU has thousands of simpler cores tuned for doing the same operation across lots of data at once — exactly the matrix multiplications of a neural network. For LLM work, the spec that matters most isn’t raw speed; it’s memory (VRAM).
The memory math
Section titled “The memory math”To serve a model, its parameters must fit in GPU memory. The estimate:
memory for weights ≈ parameters × bytes-per-parameter
Precision Bytes/param A 7B model A 70B modelFP32 (full) 4 ~28 GB ~280 GBFP16 / BF16 2 ~14 GB ~140 GB ← typical servingINT8 (quantized) 1 ~7 GB ~70 GBINT4 (quantized) 0.5 ~3.5 GB ~35 GBSo a 7B model at FP16 needs ~14 GB just for weights. Then add overhead:
- The KV cache — grows with context length and concurrency; can be many GB.
- Activations and framework overhead.
A working rule of thumb: budget ~1.5–2× the weight size for real serving. That 7B/FP16 model wants a ~24 GB GPU to serve comfortably; a 70B model needs multiple GPUs.
Quantization
Section titled “Quantization”Quantization stores weights at lower precision — INT8 or INT4 instead of 16-bit. From the table, INT4 cuts memory ~4× versus FP16. That can move a model from “needs a data-center GPU” to “runs on a consumer card” or even a laptop.
The trade-off is a usually small quality loss. Modern methods (GPTQ, AWQ, GGUF/llama.cpp k-quants) are good enough that 8-bit is often nearly indistinguishable from 16-bit, and 4-bit is acceptable for many uses. Quantization is the main reason open models are practical to self-host.
Training vs. inference
Section titled “Training vs. inference”The two have very different appetites:
- Inference — needs memory for weights + KV cache. The focus of most app engineers.
- Training / fine-tuning — also needs gradients, optimizer state, and saved activations: often 3–4× the inference footprint. This is why LoRA, which trains only a tiny adapter, matters so much — it brings fine-tuning down to a single GPU.
Choosing hardware
Section titled “Choosing hardware”| Need | Typical choice |
|---|---|
| Local dev, small/quantized models | Consumer GPU, or Apple Silicon (unified memory) |
| Serving mid-size models | One data-center GPU (e.g. A10, L40S, A100) |
| Serving large models / training | Multiple high-VRAM GPUs (A100/H100), networked |
| Occasional or spiky workloads | Rent cloud GPUs by the hour |
Key specs, in order: VRAM (does it fit?), memory bandwidth (decode speed), then compute throughput.
Cloud, rented, or owned
Section titled “Cloud, rented, or owned”- Hyperscaler cloud (AWS/GCP/Azure) — flexible, integrated, priciest per GPU-hour.
- Specialized GPU clouds — often cheaper per hour for raw compute.
- Owned hardware — lowest cost if utilization is consistently high; you own capacity planning, failures, and idle time.
Cloud GPUs are scarce and expensive — multi-GPU instances especially. This scarcity, more than anything, is why hosted model APIs win for most teams: the provider amortizes GPUs across many customers far better than you can alone.
Key takeaways
Section titled “Key takeaways”GPUs serve LLMs because the math is massively parallel, and VRAM is the binding constraint. Estimate weight memory as parameters × bytes-per-parameter, then budget ~1.5–2× for the KV cache and overhead. Quantization (INT8/INT4) cuts memory several-fold for a usually-small quality cost, making self-hosting feasible. Training needs 3–4× the memory of inference. Match hardware to VRAM first — and remember API providers exist largely because GPUs are scarce.