Context & Decoding

Two mechanics govern day-to-day LLM work: how the model reads input (tokens and context) and how it chooses output (decoding). Both directly drive cost, latency, and quality.

Tokens

LLMs don’t see characters or words — they see tokens, chunks of text from a fixed vocabulary. A token is roughly ¾ of an English word.

"Engineering with LLMs is practical."
 ["Engineering", " with", " LL", "Ms", " is", " practical", "."]   -> 7 tokens

Tokens are the unit of everything:

Billing — APIs charge per input token and per output token, at different rates.
Limits — context windows and output caps are measured in tokens.
Latency — output tokens are generated one at a time, so response time scales with output length.

Rule of thumb: ~4 characters per token, or ~750 words per 1,000 tokens. Code and non-English text tokenize less efficiently.

The context window

The context window is the maximum number of tokens the model can consider at once — input plus output combined. Modern models range from tens of thousands to over a million tokens.

Everything the model “knows” in a request must fit here: system prompt, instructions, conversation history, retrieved documents, and the room left for its answer. The context window is a budget you actively manage.

Embeddings

An embedding is a different use of these models: instead of generating text, an embedding model converts text into a fixed-length vector — a list of numbers (often 768–3072 of them) — that captures meaning.

The key property: semantically similar text produces nearby vectors. “How do I reset my password?” and “I forgot my login credentials” land close together even with no shared words. This is the foundation of vector search and RAG — covered in depth there.

Decoding parameters

After the model produces next-token probabilities, decoding picks the actual token. You control this with parameters on every API call.

Temperature

Scales how much randomness is allowed.

temperature = 0 — near-deterministic; always take the most likely token. Use for extraction, classification, structured output, anything factual.
temperature ≈ 0.7 — balanced; the default for chat and general use.
temperature ≈ 1.0+ — high variety; use for brainstorming and creative drafting.

Top-p (nucleus sampling)

Restricts choices to the smallest set of tokens whose probabilities sum to p. top_p = 0.9 ignores the unlikely long tail. Usually tune either temperature or top-p, not both.

Max tokens

A hard cap on output length. Set it deliberately: it bounds cost and latency, but set too low it will truncate a response mid-sentence.

Stop sequences

Strings that halt generation when produced — useful for cutting output at a known boundary.

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Extract the order ID."}],
    temperature=0,        # deterministic — this is an extraction task
    max_tokens=50,        # the answer is short; cap cost and latency
)

Task	Temperature	Why
Data extraction / classification	0	Want the same answer every time
Q&A over documents	0 – 0.3	Faithful, low-variance
Conversational assistant	0.7	Natural without being chaotic
Brainstorming / creative writing	0.9 – 1.2	Variety is the goal

Key takeaways

Tokens are the universal unit — of billing, limits, and latency; budget about four characters each. The context window holds input and output together and must be curated, not flooded. Embeddings turn text into meaning-vectors and power retrieval. Decoding parameters are real controls: use temperature 0 for factual and structured tasks, higher for creative ones, and always set max_tokens on purpose.