Skip to content
About

Context & Decoding

Two mechanics govern day-to-day LLM work: how the model reads input (tokens and context) and how it chooses output (decoding). Both directly drive cost, latency, and quality.

LLMs don’t see characters or words — they see tokens, chunks of text from a fixed vocabulary. A token is roughly ¾ of an English word.

"Engineering with LLMs is practical."
["Engineering", " with", " LL", "Ms", " is", " practical", "."] -> 7 tokens

Tokens are the unit of everything:

  • Billing — APIs charge per input token and per output token, at different rates.
  • Limits — context windows and output caps are measured in tokens.
  • Latency — output tokens are generated one at a time, so response time scales with output length.

Rule of thumb: ~4 characters per token, or ~750 words per 1,000 tokens. Code and non-English text tokenize less efficiently.

The context window is the maximum number of tokens the model can consider at once — input plus output combined. Modern models range from tens of thousands to over a million tokens.

Everything the model “knows” in a request must fit here: system prompt, instructions, conversation history, retrieved documents, and the room left for its answer. The context window is a budget you actively manage.

An embedding is a different use of these models: instead of generating text, an embedding model converts text into a fixed-length vector — a list of numbers (often 768–3072 of them) — that captures meaning.

The key property: semantically similar text produces nearby vectors. “How do I reset my password?” and “I forgot my login credentials” land close together even with no shared words. This is the foundation of vector search and RAG — covered in depth there.

After the model produces next-token probabilities, decoding picks the actual token. You control this with parameters on every API call.

Scales how much randomness is allowed.

  • temperature = 0 — near-deterministic; always take the most likely token. Use for extraction, classification, structured output, anything factual.
  • temperature ≈ 0.7 — balanced; the default for chat and general use.
  • temperature ≈ 1.0+ — high variety; use for brainstorming and creative drafting.

Restricts choices to the smallest set of tokens whose probabilities sum to p. top_p = 0.9 ignores the unlikely long tail. Usually tune either temperature or top-p, not both.

A hard cap on output length. Set it deliberately: it bounds cost and latency, but set too low it will truncate a response mid-sentence.

Strings that halt generation when produced — useful for cutting output at a known boundary.

response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Extract the order ID."}],
temperature=0, # deterministic — this is an extraction task
max_tokens=50, # the answer is short; cap cost and latency
)
TaskTemperatureWhy
Data extraction / classification0Want the same answer every time
Q&A over documents0 – 0.3Faithful, low-variance
Conversational assistant0.7Natural without being chaotic
Brainstorming / creative writing0.9 – 1.2Variety is the goal

Tokens are the universal unit — of billing, limits, and latency; budget about four characters each. The context window holds input and output together and must be curated, not flooded. Embeddings turn text into meaning-vectors and power retrieval. Decoding parameters are real controls: use temperature 0 for factual and structured tasks, higher for creative ones, and always set max_tokens on purpose.