Skip to content
About

Scaling & Cost

Running one model on one GPU is easy. Serving fluctuating real traffic affordably is the actual job. This page covers scaling inference and the decision that drives the whole budget: API or self-host.

GPU-backed services scale unlike ordinary stateless web apps:

  • GPUs are expensive and scarce. An idle GPU still bills; over-provisioning hurts immediately.
  • Cold starts are brutal. Loading tens of GB of weights onto a GPU takes seconds to minutes — you can’t spin one up instantly on demand.
  • Capacity is lumpy. You add a whole GPU at a time, not a few percent of one.

So you can’t just “autoscale to zero” the way you might a small API.

  • Horizontal scaling — run more model replicas behind a load balancer; route requests across them. The standard approach for handling more traffic.
  • Autoscaling — add and remove replicas with load. Effective, but a new replica’s cold start means you must scale ahead of demand, not after it.
  • Keep a warm floor. Maintain a minimum number of always-on replicas so baseline traffic never eats a cold start; autoscale above that floor.
  • Queue and smooth. Buffer bursts in a queue so a traffic spike becomes a brief latency increase instead of dropped requests or a scramble for GPUs.
  • Separate workloads. Route latency-sensitive interactive traffic and latency-tolerant batch jobs to different pools, tuned differently — see throughput vs. latency.

Size capacity from real numbers, not vibes:

1. Peak load: requests/sec at the busy hour
2. Per-replica rate: requests/sec one replica sustains at acceptable latency
(measure it — batching makes this non-obvious)
3. Replicas needed = peak load ÷ per-replica rate, + headroom for spikes
4. Cost = replicas × GPU-hours × GPU price

Always size against peak, not average — but a spiky workload sized for peak is exactly the case where renting elastic capacity (or an API) beats owning idle GPUs.

This choice dominates AI infrastructure cost and effort.

FactorHosted APISelf-hosting
Time to startMinutesWeeks of setup
Ops burdenProvider’s problemYours: GPUs, scaling, uptime
Cost at low/medium volumeCheaper — no idle GPUsExpensive — GPUs sit idle
Cost at high, steady volumeCan get expensiveCheaper if utilization is high
Data controlLeaves your environmentStays in your environment
Model choiceProvider’s menuAny open model; fully customizable
Latency floorNetwork + provider queueTunable; no external dependency

Pricing models differ too. APIs are usage-based: pay per token, zero when idle — great for spiky or low volume. Self-hosting is capacity-based: pay for GPU-hours whether busy or not — great only when utilization stays high.

The biggest savings come from the application layer, not the infrastructure — revisit Cost, Latency & Reliability:

  • Cache aggressively — a cache hit costs nothing and is instant.
  • Route by difficulty — small cheap model for easy requests, frontier model only for hard ones.
  • Trim tokens — shorter prompts, leaner retrieved context, capped output.
  • Right-size the model — the smallest model that passes your evals, not the most capable one available.
  • Batch offline work to maximize throughput.
  • Attribute and monitor cost per feature and per user, with alerts — AI spend creeps silently.

LLM inference scales unlike ordinary services: GPUs are costly and scarce, and cold starts are slow — so keep a warm floor of replicas, autoscale ahead of demand, and queue bursts. Plan capacity against measured peak load. The API-vs-self-host decision drives the budget: APIs win for spiky or low volume and fast starts; self-hosting wins for steady high volume, data control, or custom models. Default to an API and switch only on a concrete trigger. Most cost savings come from caching, routing, and trimming tokens at the app layer.