Scaling & Cost
Running one model on one GPU is easy. Serving fluctuating real traffic affordably is the actual job. This page covers scaling inference and the decision that drives the whole budget: API or self-host.
Why scaling LLMs is different
Section titled “Why scaling LLMs is different”GPU-backed services scale unlike ordinary stateless web apps:
- GPUs are expensive and scarce. An idle GPU still bills; over-provisioning hurts immediately.
- Cold starts are brutal. Loading tens of GB of weights onto a GPU takes seconds to minutes — you can’t spin one up instantly on demand.
- Capacity is lumpy. You add a whole GPU at a time, not a few percent of one.
So you can’t just “autoscale to zero” the way you might a small API.
Scaling approaches
Section titled “Scaling approaches”- Horizontal scaling — run more model replicas behind a load balancer; route requests across them. The standard approach for handling more traffic.
- Autoscaling — add and remove replicas with load. Effective, but a new replica’s cold start means you must scale ahead of demand, not after it.
- Keep a warm floor. Maintain a minimum number of always-on replicas so baseline traffic never eats a cold start; autoscale above that floor.
- Queue and smooth. Buffer bursts in a queue so a traffic spike becomes a brief latency increase instead of dropped requests or a scramble for GPUs.
- Separate workloads. Route latency-sensitive interactive traffic and latency-tolerant batch jobs to different pools, tuned differently — see throughput vs. latency.
Capacity planning
Section titled “Capacity planning”Size capacity from real numbers, not vibes:
1. Peak load: requests/sec at the busy hour2. Per-replica rate: requests/sec one replica sustains at acceptable latency (measure it — batching makes this non-obvious)3. Replicas needed = peak load ÷ per-replica rate, + headroom for spikes4. Cost = replicas × GPU-hours × GPU priceAlways size against peak, not average — but a spiky workload sized for peak is exactly the case where renting elastic capacity (or an API) beats owning idle GPUs.
API vs. self-hosting: the real decision
Section titled “API vs. self-hosting: the real decision”This choice dominates AI infrastructure cost and effort.
| Factor | Hosted API | Self-hosting |
|---|---|---|
| Time to start | Minutes | Weeks of setup |
| Ops burden | Provider’s problem | Yours: GPUs, scaling, uptime |
| Cost at low/medium volume | Cheaper — no idle GPUs | Expensive — GPUs sit idle |
| Cost at high, steady volume | Can get expensive | Cheaper if utilization is high |
| Data control | Leaves your environment | Stays in your environment |
| Model choice | Provider’s menu | Any open model; fully customizable |
| Latency floor | Network + provider queue | Tunable; no external dependency |
Pricing models differ too. APIs are usage-based: pay per token, zero when idle — great for spiky or low volume. Self-hosting is capacity-based: pay for GPU-hours whether busy or not — great only when utilization stays high.
Controlling cost, wherever you run
Section titled “Controlling cost, wherever you run”The biggest savings come from the application layer, not the infrastructure — revisit Cost, Latency & Reliability:
- Cache aggressively — a cache hit costs nothing and is instant.
- Route by difficulty — small cheap model for easy requests, frontier model only for hard ones.
- Trim tokens — shorter prompts, leaner retrieved context, capped output.
- Right-size the model — the smallest model that passes your evals, not the most capable one available.
- Batch offline work to maximize throughput.
- Attribute and monitor cost per feature and per user, with alerts — AI spend creeps silently.
Key takeaways
Section titled “Key takeaways”LLM inference scales unlike ordinary services: GPUs are costly and scarce, and cold starts are slow — so keep a warm floor of replicas, autoscale ahead of demand, and queue bursts. Plan capacity against measured peak load. The API-vs-self-host decision drives the budget: APIs win for spiky or low volume and fast starts; self-hosting wins for steady high volume, data control, or custom models. Default to an API and switch only on a concrete trigger. Most cost savings come from caching, routing, and trimming tokens at the app layer.