Running Models Locally

Running a capable LLM on your own laptop or server is now routine. It’s worth doing — for privacy, cost, offline use, and the genuine intuition you gain from operating a model directly.

Why run locally

Privacy — data never leaves your machine. Decisive for sensitive content.
Cost — no per-token bill; once the hardware exists, inference is “free.”
Offline — works with no network, in air-gapped environments.
Learning — nothing builds intuition like running, quantizing, and prompting a model yourself.
No rate limits — bounded only by your hardware.

The trade-off: a model that fits on consumer hardware is smaller and less capable than a frontier API model. Local is excellent for many tasks — and not a drop-in replacement for the strongest closed models.

The tools

Ollama

The easiest entry point. One command pulls and runs a model, with a local API server that mimics common LLM APIs.

ollama run llama3.2          # download + chat, interactively
# A local OpenAI-compatible endpoint also starts at localhost:11434

Your application code can point at that local endpoint with almost no changes — ideal for development and for swapping between local and hosted models.

llama.cpp

The engine under much local inference, Ollama included. A highly optimized C++ runtime that runs models efficiently on CPUs, consumer GPUs, and Apple Silicon. Use it directly when you want maximum control or to embed inference in an app.

Others

Desktop apps like LM Studio and Jan offer a GUI over the same engines; vLLM handles serious local serving on a GPU.

The GGUF format and quantization

Local models are usually distributed as GGUF files — a format built for efficient CPU/GPU inference that bundles the model and its quantization.

Quantization is what makes local viable: 4-bit weights cut memory ~4× versus 16-bit, moving a model from “data-center GPU” to “your laptop.” You’ll pick a quantization level — a trade-off:

Quantization	Memory	Quality	Use when
8-bit (Q8)	Highest	Near-original	You have the RAM/VRAM to spare
4–5-bit (Q4/Q5)	Moderate	Very good	The common sweet spot
2–3-bit	Lowest	Noticeably degraded	Only if nothing else fits

Q4/Q5 is the usual default — most of the quality, a fraction of the memory.

Hardware: what you need

The binding constraint is memory to hold the model — see the memory math. Roughly, for a 4-bit model:

Model size	Memory (~Q4)	Runs comfortably on
1–3B	~1–3 GB	Almost any modern laptop
7–8B	~5–8 GB	16 GB RAM laptop; mainstream GPU
13–14B	~9–12 GB	32 GB RAM; a GPU with 12 GB+ VRAM
30–34B	~20–24 GB	High-memory machine or a 24 GB GPU
70B	~40 GB+	Workstation; multi-GPU; high-RAM Mac

Also watch speed: memory bandwidth sets token throughput, and small or heavily quantized models may run acceptably on CPU alone — but a GPU (or Apple Silicon) is far better for anything interactive.

When local makes sense — and when it doesn’t

Good fit: development and experimentation; privacy-sensitive data; offline or air-gapped use; high-volume simple tasks (classification, extraction) where a small model suffices; learning.

Poor fit: you need frontier-level capability; scaling to many concurrent users (that’s real serving infrastructure, not a laptop); you want zero operational effort.

A practical pattern: develop locally against a small model via Ollama’s API-compatible endpoint, then deploy against either a self-hosted model or a hosted API — the same code, just a different base URL.

Key takeaways

Running models locally gives privacy, zero per-token cost, offline use, and real intuition — at the price of using smaller, less capable models. Ollama is the easiest start; llama.cpp is the engine beneath it. Models ship as quantized GGUF files, and Q4/Q5 is the usual quality/memory sweet spot. Memory is the constraint — size hardware to the model, and note that Apple Silicon’s unified memory makes Macs unusually capable. Use local for dev, privacy, and simple high-volume tasks; use hosted infrastructure for frontier capability and real multi-user scale.