Multi-Agent & Production

A single agent can do a lot. Multi-agent systems divide work across several specialized agents — and shipping any agent, single or multi, surfaces a hard set of production problems. This page covers both.

Multi-agent systems

The idea: instead of one agent juggling everything, use multiple agents, each with a focused role, a tailored prompt, and its own tools.

Orchestration patterns

Supervisor / orchestrator — a coordinator agent routes subtasks to worker agents and assembles their results. Flexible; the supervisor is the single point of control.
Pipeline — agents run in a fixed sequence, each handing off to the next (research → write → edit). Predictable and easy to debug.
Network — agents hand off to each other dynamically. Most flexible, least predictable — hardest to control.

When multi-agent actually helps

It helps when subtasks are genuinely distinct (each agent gets a simpler, more focused context) or need different tools or permissions, and when parts can run in parallel.

Production realities

Agents fail in ways single LLM calls don’t. Each must be engineered for.

Runaway loops and cost

An agent can loop — repeating a failing action, oscillating between two states, or just taking too long. Each iteration costs money. Mandatory guardrails:

Hard step limit — a max iteration count, always.
Budget cap — a per-task token/dollar ceiling that aborts the run.
Loop detection — spot repeated identical actions and break out.
Wall-clock timeout — bound total runtime.

limits = AgentLimits(max_steps=15, max_cost_usd=0.50, max_seconds=120)
# Hitting any limit ends the run with a graceful partial result — never silently.

Error compounding

A small error in step 2 becomes wrong input to step 3, and the agent confidently builds on it. Mitigate by validating intermediate results, having the agent verify before depending on a result, and using reflection checkpoints on critical steps.

Reliability and observability

Treat agency as risk surface. Keep humans in the loop for high-stakes actions. Make every run fully traceable — log every thought, tool call, argument, observation, and cost. When an agent misbehaves, the trace is the only way to find which step went wrong. This is non-negotiable for agents.

Security

The tool security rules apply with extra force: agents chain many actions, and prompt injection is the signature threat — a malicious instruction hidden in a retrieved document or tool result can hijack the loop. Least-privilege tools, sandboxing, argument validation, and human approval gates are the defenses.

Evaluating agents

Agents are harder to evaluate than single calls — success is a trajectory, not one output. Measure both:

Outcome — did it achieve the goal? (task success rate)
Process — how? Step count, tool-choice accuracy, cost per task, latency, wrong turns. Two agents can both succeed while one costs 5× as much.

Build a suite of representative tasks with known good outcomes and run it on every change — same evaluation discipline as the rest of the guide, applied to trajectories.

Key takeaways

Multi-agent systems split work across focused agents via supervisor, pipeline, or network patterns — but they multiply cost, latency, and failure points, so prefer one good agent until its context genuinely overloads. In production, agents demand hard limits (steps, budget, time), loop detection, validation of intermediate results, full tracing, least-privilege tools, and human approval for high-stakes actions. Evaluate both the outcome and the process.