Skip to content
About

Design Principles

Designing AI systems is mostly about unlearning an assumption from ordinary software: that a given input produces a known, correct, repeatable output. With an LLM, none of that holds. These principles are how you build reliably anyway.

The same input can yield different outputs. So:

  • Never assume an exact response — parse and validate what comes back.
  • Don’t put unvalidated model output straight into a database, a query, or a UI.
  • Make operations idempotent where you can, so a retry is always safe.

The model is a source of proposals, not a source of truth. Your code decides what to trust.

A model’s reliability is inversely proportional to how open-ended its job is.

classify extract transform summarize open-ended closed task generative task more reliable less reliable

Whenever possible, narrow the task. Turn “answer the user” into “pick one of these five intents.” Turn “write the response” into “fill these named fields.” Closed tasks are easier to constrain, validate, and evaluate.

Free-form text is hostile to downstream code. Demand structured output — JSON matching a schema — so the result is machine-checkable.

schema = {
"type": "object",
"properties": {
"sentiment": {"enum": ["positive", "negative", "neutral"]},
"confidence": {"type": "number"},
},
"required": ["sentiment", "confidence"],
}
# Use the provider's structured-output / JSON-schema mode, then still validate.

A schema turns “hope the model formatted it right” into “reject it if it didn’t.”

Every LLM call has more failure modes than a normal function. Enumerate them and handle each:

FailureMitigation
Timeout / slow responseAggressive timeout + retry with backoff
Rate limited (429)Backoff, queue, or fail over to another provider
Malformed / invalid outputSchema validation, then one re-ask
Hallucinated contentGround with retrieval; verify before use
Provider outageFallback model or graceful degradation
Unsafe / off-topic outputGuardrails on input and output

A demo handles the happy path. A system handles this table.

The model proposes; your system disposes. Before acting on output:

  • Validate structure — schema, types, required fields.
  • Validate semantics — is the cited document ID real? Is the number in range?
  • Cross-check when stakes are high — run the model’s SQL read-only first; confirm a quoted fact against the source.
  • Keep humans in the loop for irreversible or high-impact actions.

You cannot debug what you cannot see, and LLM systems fail quietly — a slightly worse answer, not a stack trace. Log, for every call: the full prompt, the raw response, token counts, latency, cost, model version, and which retrieved context was used. Without this, a production complaint is unreproducible.

A guardrail is a check around the model, independent of it:

  • Input guardrails — block prompt injection, off-topic requests, and abuse before spending a model call.
  • Output guardrails — scan responses for leaked secrets, unsafe content, or policy violations before they reach the user.

Guardrails are deterministic code or separate classifiers — never “we asked the model nicely not to.” The threats they defend against — prompt injection, data leakage, unsafe output — are covered in depth in AI Safety & Security.

The model landscape changes monthly. Isolate provider calls behind your own interface so you can switch models, route by task, or fall back without rewriting the application.

class LLMClient:
def complete(self, prompt: str, *, schema=None) -> Result: ...
# App code depends on LLMClient — never on a vendor SDK directly.

Build for non-determinism: validate everything, never trust raw output. Constrain both the task and the output format to make the model reliable and checkable. Enumerate failure modes and handle each one. Verify outputs against ground truth before acting. Instrument every call, wrap the model in guardrails, and keep it swappable behind your own abstraction.