Skip to content
About

Running Models Locally

Running a capable LLM on your own laptop or server is now routine. It’s worth doing — for privacy, cost, offline use, and the genuine intuition you gain from operating a model directly.

  • Privacy — data never leaves your machine. Decisive for sensitive content.
  • Cost — no per-token bill; once the hardware exists, inference is “free.”
  • Offline — works with no network, in air-gapped environments.
  • Learning — nothing builds intuition like running, quantizing, and prompting a model yourself.
  • No rate limits — bounded only by your hardware.

The trade-off: a model that fits on consumer hardware is smaller and less capable than a frontier API model. Local is excellent for many tasks — and not a drop-in replacement for the strongest closed models.

The easiest entry point. One command pulls and runs a model, with a local API server that mimics common LLM APIs.

Terminal window
ollama run llama3.2 # download + chat, interactively
# A local OpenAI-compatible endpoint also starts at localhost:11434

Your application code can point at that local endpoint with almost no changes — ideal for development and for swapping between local and hosted models.

The engine under much local inference, Ollama included. A highly optimized C++ runtime that runs models efficiently on CPUs, consumer GPUs, and Apple Silicon. Use it directly when you want maximum control or to embed inference in an app.

Desktop apps like LM Studio and Jan offer a GUI over the same engines; vLLM handles serious local serving on a GPU.

Local models are usually distributed as GGUF files — a format built for efficient CPU/GPU inference that bundles the model and its quantization.

Quantization is what makes local viable: 4-bit weights cut memory ~4× versus 16-bit, moving a model from “data-center GPU” to “your laptop.” You’ll pick a quantization level — a trade-off:

QuantizationMemoryQualityUse when
8-bit (Q8)HighestNear-originalYou have the RAM/VRAM to spare
4–5-bit (Q4/Q5)ModerateVery goodThe common sweet spot
2–3-bitLowestNoticeably degradedOnly if nothing else fits

Q4/Q5 is the usual default — most of the quality, a fraction of the memory.

The binding constraint is memory to hold the model — see the memory math. Roughly, for a 4-bit model:

Model sizeMemory (~Q4)Runs comfortably on
1–3B~1–3 GBAlmost any modern laptop
7–8B~5–8 GB16 GB RAM laptop; mainstream GPU
13–14B~9–12 GB32 GB RAM; a GPU with 12 GB+ VRAM
30–34B~20–24 GBHigh-memory machine or a 24 GB GPU
70B~40 GB+Workstation; multi-GPU; high-RAM Mac

Also watch speed: memory bandwidth sets token throughput, and small or heavily quantized models may run acceptably on CPU alone — but a GPU (or Apple Silicon) is far better for anything interactive.

When local makes sense — and when it doesn’t

Section titled “When local makes sense — and when it doesn’t”

Good fit: development and experimentation; privacy-sensitive data; offline or air-gapped use; high-volume simple tasks (classification, extraction) where a small model suffices; learning.

Poor fit: you need frontier-level capability; scaling to many concurrent users (that’s real serving infrastructure, not a laptop); you want zero operational effort.

A practical pattern: develop locally against a small model via Ollama’s API-compatible endpoint, then deploy against either a self-hosted model or a hosted API — the same code, just a different base URL.

Running models locally gives privacy, zero per-token cost, offline use, and real intuition — at the price of using smaller, less capable models. Ollama is the easiest start; llama.cpp is the engine beneath it. Models ship as quantized GGUF files, and Q4/Q5 is the usual quality/memory sweet spot. Memory is the constraint — size hardware to the model, and note that Apple Silicon’s unified memory makes Macs unusually capable. Use local for dev, privacy, and simple high-volume tasks; use hosted infrastructure for frontier capability and real multi-user scale.