LLM Application Architecture
Most production LLM applications share the same skeleton. Learn it once and you can read — or design — almost any of them.
The reference architecture
Section titled “The reference architecture”The layers, one by one
Section titled “The layers, one by one”API gateway
Section titled “API gateway”The ordinary front door: authentication, per-user rate limiting, request validation. Nothing AI-specific, but it’s where you stop abuse before it costs you model calls.
Input guardrails
Section titled “Input guardrails”The first AI-aware layer. Cheap, fast checks before an expensive model call: prompt-injection detection, off-topic filtering, and abuse classification. Reject early.
Orchestration layer
Section titled “Orchestration layer”The brain of the application — your own code that decides the control flow:
- A simple feature: assemble a prompt, call the model, return the result.
- A RAG feature: retrieve context first, then call the model.
- An agent: loop — call the model, run the tool it requested, feed the result back, repeat. See AI Agents.
This layer is deterministic application code. Resist the urge to let the LLM control flow it doesn’t need to.
Prompt assembly
Section titled “Prompt assembly”Builds the final prompt from parts: a versioned system prompt, conversation history (often trimmed or summarized to fit the context window), retrieved context, and the user’s message. Treat prompts as versioned artifacts, not string literals scattered through the codebase.
Retrieval
Section titled “Retrieval”Fetches relevant knowledge — vector search, keyword search, SQL, an API call — and feeds it to prompt assembly. This is RAG, and it’s how you give the model facts it doesn’t have.
Tool / action execution
Section titled “Tool / action execution”When the model needs to do something — query a database, call an API, run code — this layer executes the requested tool in a sandbox and returns the result. Every tool call is a security boundary: validate arguments, scope permissions narrowly.
LLM gateway
Section titled “LLM gateway”A single internal choke point for all model calls. It centralizes retries, timeouts, fallback to a second provider, model routing (cheap model for easy requests, strong model for hard ones), caching, and cost tracking. This is the swappable-model principle made concrete.
Output guardrails
Section titled “Output guardrails”Before a response reaches the user: validate it against a schema, scan for unsafe content and leaked PII or secrets, and confirm it stays on policy. Fail closed.
Observability
Section titled “Observability”Cross-cutting, not a step. Every layer emits traces (the full chain for one request), metrics (latency, token cost, error and cache-hit rates), and evals (quality scored on sampled traffic). Covered in MLOps.
Start simple
Section titled “Start simple”You do not build all of this on day one. Most applications begin as:
Gateway → Prompt → LLM → ResponseAdd layers when a real problem demands them: retrieval when the model lacks knowledge, guardrails when you see abuse, an LLM gateway when you need fallback or routing, orchestration loops when one call can’t finish the job. Premature architecture is as costly as no architecture.
Key takeaways
Section titled “Key takeaways”Production LLM apps share a skeleton: gateway, input guardrails, an orchestration layer that owns control flow, prompt assembly, retrieval, tool execution, an LLM gateway, output guardrails, and pervasive observability. Orchestration is your deterministic code — keep it that way. The LLM gateway centralizes resilience and routing. Start with the minimal path and add layers only when a concrete problem justifies them.