LLM System Design

The LLM system design interview is the round that matters most for AI engineering roles. It’s a classic system design interview with an AI core — and it rewards structure, trade-off reasoning, and awareness of how AI systems fail.

A framework

Use a repeatable structure so you cover everything under pressure:

Steps 4–6 are where AI system design diverges from ordinary system design. Spend your time there.

Worked example: “Design a customer support AI assistant”

1. Clarify

Ask before designing: What can it do — answer questions only, or take actions (refunds, ticket updates)? What’s the knowledge source — help docs, past tickets? Volume? Latency expectation? What happens when it’s unsure? Channels? Languages?

Assume: answers product questions from help docs and past tickets, can escalate to a human, ~10k conversations/day, chat latency, English first.

2. Define success

State metrics up front — it signals maturity: resolution rate (no human needed), answer accuracy/faithfulness, escalation rate, latency (p99), cost per conversation, and user satisfaction. Note explicitly that a wrong answer is worse than an escalation — that framing drives the whole design.

3. Sketch the architecture

This is a RAG system — knowledge gap, changing data:

4. Deep-dive: the AI core

Walk through the real decisions:

Retrieval — chunk docs and tickets; embed; store in a vector DB. Hybrid search so exact terms (error codes, product names) aren’t lost; rerank; metadata-filter by product and recency.
Generation — a grounding prompt: answer only from retrieved context, cite sources, and say “I’m not sure” rather than guess. Low temperature.
Model choice — start with a capable hosted model; later route easy questions to a cheaper model for cost.
Actions — keep refunds and account changes behind tool calls with strict validation and human approval. Read-only by default.
The unsure path — when retrieval is weak or confidence is low, escalate. Better than a confident wrong answer — and you defined it that way in step 2.

5. Evaluation

Build a test set of real questions with ideal answers and source chunks. Score retrieval (context recall) and generation (faithfulness, correctness) separately. Gate every prompt or model change on it. In production, run evals on sampled live traffic and watch thumbs-down and escalations. See Advanced RAG & Evaluation.

6. Operate

Cost: estimate tokens per conversation; cache common questions; consider model routing. Latency: stream responses; parallelize retrieval. Reliability: timeouts, retries, a fallback model, graceful escalation if AI is down. Monitoring: trace every conversation; track cost, latency, faithfulness, escalation rate.

7. Trade-offs

Name them unprompted: hybrid search and reranking cost latency but lift accuracy; a stronger model costs more but escalates less; aggressive caching risks staleness. Show you see the tensions and chose deliberately.

What interviewers reward

They want to see	They worry when you
Clarifying before designing	Jump straight to a solution
Evaluation as a first-class concern	Never mention measuring quality
Awareness of failure modes	Assume the LLM is always right
Explicit cost/latency/quality trade-offs	Ignore cost and latency
Guardrails and the “unsure” path	Let the model act unconstrained
Starting simple, then scaling	Over-engineer from slide one

Key takeaways

Drive the LLM system design interview with a fixed framework: clarify, define success, sketch, deep-dive the AI core, evaluate, operate, iterate. Spend your time on the AI-specific parts — retrieval, prompts, model choice, guardrails. Raise evaluation, failure modes, and cost/latency/quality trade-offs yourself. Start simple and scale deliberately. Treating a wrong answer as worse than an escalation — and designing for it — is what marks a real practitioner.