Scaling & Cost

Running one model on one GPU is easy. Serving fluctuating real traffic affordably is the actual job. This page covers scaling inference and the decision that drives the whole budget: API or self-host.

Why scaling LLMs is different

GPU-backed services scale unlike ordinary stateless web apps:

GPUs are expensive and scarce. An idle GPU still bills; over-provisioning hurts immediately.
Cold starts are brutal. Loading tens of GB of weights onto a GPU takes seconds to minutes — you can’t spin one up instantly on demand.
Capacity is lumpy. You add a whole GPU at a time, not a few percent of one.

So you can’t just “autoscale to zero” the way you might a small API.

Scaling approaches

Horizontal scaling — run more model replicas behind a load balancer; route requests across them. The standard approach for handling more traffic.
Autoscaling — add and remove replicas with load. Effective, but a new replica’s cold start means you must scale ahead of demand, not after it.
Keep a warm floor. Maintain a minimum number of always-on replicas so baseline traffic never eats a cold start; autoscale above that floor.
Queue and smooth. Buffer bursts in a queue so a traffic spike becomes a brief latency increase instead of dropped requests or a scramble for GPUs.
Separate workloads. Route latency-sensitive interactive traffic and latency-tolerant batch jobs to different pools, tuned differently — see throughput vs. latency.

Capacity planning

Size capacity from real numbers, not vibes:

1. Peak load:        requests/sec at the busy hour
2. Per-replica rate: requests/sec one replica sustains at acceptable latency
                     (measure it — batching makes this non-obvious)
3. Replicas needed = peak load ÷ per-replica rate, + headroom for spikes
4. Cost = replicas × GPU-hours × GPU price

Always size against peak, not average — but a spiky workload sized for peak is exactly the case where renting elastic capacity (or an API) beats owning idle GPUs.

API vs. self-hosting: the real decision

This choice dominates AI infrastructure cost and effort.

Factor	Hosted API	Self-hosting
Time to start	Minutes	Weeks of setup
Ops burden	Provider’s problem	Yours: GPUs, scaling, uptime
Cost at low/medium volume	Cheaper — no idle GPUs	Expensive — GPUs sit idle
Cost at high, steady volume	Can get expensive	Cheaper if utilization is high
Data control	Leaves your environment	Stays in your environment
Model choice	Provider’s menu	Any open model; fully customizable
Latency floor	Network + provider queue	Tunable; no external dependency

Pricing models differ too. APIs are usage-based: pay per token, zero when idle — great for spiky or low volume. Self-hosting is capacity-based: pay for GPU-hours whether busy or not — great only when utilization stays high.

Controlling cost, wherever you run

The biggest savings come from the application layer, not the infrastructure — revisit Cost, Latency & Reliability:

Cache aggressively — a cache hit costs nothing and is instant.
Route by difficulty — small cheap model for easy requests, frontier model only for hard ones.
Trim tokens — shorter prompts, leaner retrieved context, capped output.
Right-size the model — the smallest model that passes your evals, not the most capable one available.
Batch offline work to maximize throughput.
Attribute and monitor cost per feature and per user, with alerts — AI spend creeps silently.

Key takeaways

LLM inference scales unlike ordinary services: GPUs are costly and scarce, and cold starts are slow — so keep a warm floor of replicas, autoscale ahead of demand, and queue bursts. Plan capacity against measured peak load. The API-vs-self-host decision drives the budget: APIs win for spiky or low volume and fast starts; self-hosting wins for steady high volume, data control, or custom models. Default to an API and switch only on a concrete trigger. Most cost savings come from caching, routing, and trimming tokens at the app layer.