Context & Decoding
Two mechanics govern day-to-day LLM work: how the model reads input (tokens and context) and how it chooses output (decoding). Both directly drive cost, latency, and quality.
Tokens
Section titled “Tokens”LLMs don’t see characters or words — they see tokens, chunks of text from a fixed vocabulary. A token is roughly ¾ of an English word.
"Engineering with LLMs is practical." ["Engineering", " with", " LL", "Ms", " is", " practical", "."] -> 7 tokensTokens are the unit of everything:
- Billing — APIs charge per input token and per output token, at different rates.
- Limits — context windows and output caps are measured in tokens.
- Latency — output tokens are generated one at a time, so response time scales with output length.
Rule of thumb: ~4 characters per token, or ~750 words per 1,000 tokens. Code and non-English text tokenize less efficiently.
The context window
Section titled “The context window”The context window is the maximum number of tokens the model can consider at once — input plus output combined. Modern models range from tens of thousands to over a million tokens.
Everything the model “knows” in a request must fit here: system prompt, instructions, conversation history, retrieved documents, and the room left for its answer. The context window is a budget you actively manage.
Embeddings
Section titled “Embeddings”An embedding is a different use of these models: instead of generating text, an embedding model converts text into a fixed-length vector — a list of numbers (often 768–3072 of them) — that captures meaning.
The key property: semantically similar text produces nearby vectors. “How do I reset my password?” and “I forgot my login credentials” land close together even with no shared words. This is the foundation of vector search and RAG — covered in depth there.
Decoding parameters
Section titled “Decoding parameters”After the model produces next-token probabilities, decoding picks the actual token. You control this with parameters on every API call.
Temperature
Section titled “Temperature”Scales how much randomness is allowed.
temperature = 0— near-deterministic; always take the most likely token. Use for extraction, classification, structured output, anything factual.temperature ≈ 0.7— balanced; the default for chat and general use.temperature ≈ 1.0+— high variety; use for brainstorming and creative drafting.
Top-p (nucleus sampling)
Section titled “Top-p (nucleus sampling)”Restricts choices to the smallest set of tokens whose probabilities sum to p.
top_p = 0.9 ignores the unlikely long tail. Usually tune either temperature
or top-p, not both.
Max tokens
Section titled “Max tokens”A hard cap on output length. Set it deliberately: it bounds cost and latency, but set too low it will truncate a response mid-sentence.
Stop sequences
Section titled “Stop sequences”Strings that halt generation when produced — useful for cutting output at a known boundary.
response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Extract the order ID."}], temperature=0, # deterministic — this is an extraction task max_tokens=50, # the answer is short; cap cost and latency)| Task | Temperature | Why |
|---|---|---|
| Data extraction / classification | 0 | Want the same answer every time |
| Q&A over documents | 0 – 0.3 | Faithful, low-variance |
| Conversational assistant | 0.7 | Natural without being chaotic |
| Brainstorming / creative writing | 0.9 – 1.2 | Variety is the goal |
Key takeaways
Section titled “Key takeaways”Tokens are the universal unit — of billing, limits, and latency; budget about
four characters each. The context window holds input and output together and
must be curated, not flooded. Embeddings turn text into meaning-vectors and power
retrieval. Decoding parameters are real controls: use temperature 0 for
factual and structured tasks, higher for creative ones, and always set
max_tokens on purpose.