Serving & Inference
Inference is running a trained model to produce output — the serving side, as opposed to training. Serving LLMs efficiently is its own discipline, because LLM inference has an unusual shape.
Two phases: prefill and decode
Section titled “Two phases: prefill and decode”A single LLM request runs in two distinct phases with very different performance profiles:
- Prefill reads the entire prompt at once — highly parallel, so it’s fast even for long prompts.
- Decode is inherently sequential: token N+1 can’t start until token N exists. This is why a long response takes longer than a long prompt, and why output tokens cost more.
The KV cache
Section titled “The KV cache”During decoding, each new token’s attention needs the keys and values of every previous token. Recomputing those each step would be hugely wasteful, so they are stored — the KV cache.
The KV cache is fast, but it’s memory, and it grows with both sequence length and the number of concurrent requests. On a serving GPU, the KV cache — not the model weights — is often what limits how many users you can handle at once. Long contexts are expensive partly because their KV cache is large.
Batching
Section titled “Batching”A GPU processing one request at a time is mostly idle. Batching runs many requests together, sharing the work and using the hardware far better.
Naive (static) batching makes everyone wait for the slowest request in the batch. Modern inference servers use continuous batching (a.k.a. in-flight batching): finished requests leave the batch and new ones join mid-flight, so the GPU stays saturated. This is one of the biggest real-world throughput wins.
Inference servers
Section titled “Inference servers”You don’t write inference loops by hand — you use a serving engine that bundles these optimizations:
| Engine | Notes |
|---|---|
| vLLM | Popular open-source server; continuous batching, efficient KV-cache paging |
| TGI | Hugging Face’s text-generation server |
| TensorRT-LLM | NVIDIA’s highly optimized stack |
| Ollama / llama.cpp | Local and laptop-scale serving, CPU/GPU |
They also apply optimizations like PagedAttention (manage the KV cache like OS virtual memory, cutting waste), quantization (smaller weights — see GPUs & Hardware), and speculative decoding (a small model drafts tokens, the big model verifies several at once).
Throughput vs. latency
Section titled “Throughput vs. latency”The central serving trade-off:
- Latency — how fast one request completes. What an interactive user feels.
- Throughput — how many requests per second across all users. What sets your cost-per-request.
Bigger batches raise throughput (cheaper per request) but raise latency (each request waits longer). Tune to the workload:
| Workload | Optimize for | Approach |
|---|---|---|
| Interactive chat | Latency | Smaller batches, stream tokens |
| Bulk / offline jobs | Throughput | Large batches, latency irrelevant |
| Mixed traffic | Balanced | Continuous batching, maybe separate pools |
Key takeaways
Section titled “Key takeaways”LLM inference splits into a fast, parallel prefill phase and a slow, sequential decode phase — which is why responses stream and output costs more. The KV cache speeds decoding but consumes memory that limits concurrency. Continuous batching keeps the GPU saturated and is a major throughput win. Use a serving engine like vLLM rather than rolling your own. Throughput and latency trade off — tune batching to whether humans are waiting.