Skip to content
About

LLM System Design

The LLM system design interview is the round that matters most for AI engineering roles. It’s a classic system design interview with an AI core — and it rewards structure, trade-off reasoning, and awareness of how AI systems fail.

Use a repeatable structure so you cover everything under pressure:

1 Clarify requirements, scale, constraints 2 Define what does "good" mean? how is it measured? 3 Sketch high-level architecture and data flow 4 Deep-dive the AI core — retrieval, prompts, models 5 Evaluate how you measure quality and catch regressions 6 Operate cost, latency, scaling, monitoring 7 Iterate trade-offs, what you would improve or cut spend your time here

Steps 4–6 are where AI system design diverges from ordinary system design. Spend your time there.

Worked example: “Design a customer support AI assistant”

Section titled “Worked example: “Design a customer support AI assistant””

Ask before designing: What can it do — answer questions only, or take actions (refunds, ticket updates)? What’s the knowledge source — help docs, past tickets? Volume? Latency expectation? What happens when it’s unsure? Channels? Languages?

Assume: answers product questions from help docs and past tickets, can escalate to a human, ~10k conversations/day, chat latency, English first.

State metrics up front — it signals maturity: resolution rate (no human needed), answer accuracy/faithfulness, escalation rate, latency (p99), cost per conversation, and user satisfaction. Note explicitly that a wrong answer is worse than an escalation — that framing drives the whole design.

This is a RAG system — knowledge gap, changing data:

User Gateway Input guardrail Orchestrator Retrieval (RAG) help docs + past tickets Conversation memory Escalation to human ticket handed to a person LLM grounded prompt Output guardrail User Every step traced to observability

Walk through the real decisions:

  • Retrieval — chunk docs and tickets; embed; store in a vector DB. Hybrid search so exact terms (error codes, product names) aren’t lost; rerank; metadata-filter by product and recency.
  • Generation — a grounding prompt: answer only from retrieved context, cite sources, and say “I’m not sure” rather than guess. Low temperature.
  • Model choice — start with a capable hosted model; later route easy questions to a cheaper model for cost.
  • Actions — keep refunds and account changes behind tool calls with strict validation and human approval. Read-only by default.
  • The unsure path — when retrieval is weak or confidence is low, escalate. Better than a confident wrong answer — and you defined it that way in step 2.

Build a test set of real questions with ideal answers and source chunks. Score retrieval (context recall) and generation (faithfulness, correctness) separately. Gate every prompt or model change on it. In production, run evals on sampled live traffic and watch thumbs-down and escalations. See Advanced RAG & Evaluation.

Cost: estimate tokens per conversation; cache common questions; consider model routing. Latency: stream responses; parallelize retrieval. Reliability: timeouts, retries, a fallback model, graceful escalation if AI is down. Monitoring: trace every conversation; track cost, latency, faithfulness, escalation rate.

Name them unprompted: hybrid search and reranking cost latency but lift accuracy; a stronger model costs more but escalates less; aggressive caching risks staleness. Show you see the tensions and chose deliberately.

They want to seeThey worry when you
Clarifying before designingJump straight to a solution
Evaluation as a first-class concernNever mention measuring quality
Awareness of failure modesAssume the LLM is always right
Explicit cost/latency/quality trade-offsIgnore cost and latency
Guardrails and the “unsure” pathLet the model act unconstrained
Starting simple, then scalingOver-engineer from slide one

Drive the LLM system design interview with a fixed framework: clarify, define success, sketch, deep-dive the AI core, evaluate, operate, iterate. Spend your time on the AI-specific parts — retrieval, prompts, model choice, guardrails. Raise evaluation, failure modes, and cost/latency/quality trade-offs yourself. Start simple and scale deliberately. Treating a wrong answer as worse than an escalation — and designing for it — is what marks a real practitioner.