Tools & Memory

An agent’s loop is only as good as two things: the tools that let it affect the world, and the memory that lets it carry information across steps and sessions.

Tools and function calling

A tool is a function the agent can call — search, a database query, an API request, a calculator, code execution. Function calling is the mechanism: you describe your tools to the model; when it wants one, it returns a structured request naming the tool and its arguments; your code runs it and returns the result.

tools = [{
    "name": "get_order_status",
    "description": "Get the current status of a customer's order by order ID.",
    "parameters": {
        "type": "object",
        "properties": {
            "order_id": {"type": "string", "description": "e.g. 'o_8842'"},
        },
        "required": ["order_id"],
    },
}]
# The model returns: {"tool": "get_order_status", "args": {"order_id": "o_8842"}}
# Your code executes it — the model never runs anything itself.

Designing good tools

Tool quality drives agent reliability more than any prompt tweak. The model picks tools using only their descriptions, so treat tool design as API design for a non-human caller:

Crisp descriptions. State exactly what the tool does and when to use it. Vague descriptions cause wrong-tool choices.
Few, well-scoped tools. A handful of clear tools beats thirty overlapping ones — too many and the model gets lost choosing.
Constrained inputs. Use enums and typed schemas so the model can’t pass nonsense.
Helpful results — including errors. A result of "Error: order_id not found. Check the ID format (o_XXXX)." lets the agent self-correct; "Error 400" leaves it stuck.
Right-sized output. Don’t return a 50KB blob — it floods the context window. Return what the agent needs.

Memory

An LLM is stateless — each call only knows what’s in its context window. Memory is the engineering around that limit. Agents need several distinct kinds.

Short-term (working) memory

The current task’s context: the goal, recent steps, and observations — the running transcript inside the loop. The constraint is the context window. Long tasks overflow it, so you manage working memory actively:

Trim — drop the oldest, least relevant turns.
Summarize — periodically compress older history into a short recap.
Offload — write bulky results to external storage and keep only a reference in context.

Long-term memory

Information that must survive across sessions — user preferences, facts learned, past outcomes. It lives outside the model, usually in a vector database, and is retrieved when relevant. Long-term memory is essentially RAG pointed at the agent’s own history.

Memory types at a glance

Type	Holds	Lives in	Lifespan
Working / short-term	Current task state	Context window	This task
Episodic	Past events and outcomes	Vector DB / store	Across sessions
Semantic	Facts, preferences, knowledge	Vector DB / store	Long-lived
Procedural	How to do recurring tasks	Prompts / tools / code	Persistent

Context engineering

Deciding what occupies the limited context window each step — instructions, which memories, which tool results, how much history — is context engineering, and it’s the core craft of building agents. Too little context and the agent flounders; too much and it gets distracted, slows down, and costs more. Curate deliberately, every step.

Key takeaways

Tools let agents act; function calling is the protocol, and the model only requests — your code executes and authorizes. Design tools like APIs for a non-human caller: clear descriptions, narrow scope, typed inputs, useful errors — and treat every tool as a security boundary. Memory compensates for the stateless model: working memory (managed against the context limit), long-term memory (RAG over past sessions). Choosing what fills the context each step — context engineering — is the central skill.