ai · level 10

Generation

Logits in. Sampled tokens out. Repeat until done.

175 XP

Generation

At inference time, a language model takes a token sequence and outputs a probability distribution over the next token. Generation is the process of sampling from that distribution repeatedly until a stop condition is met.

Analogy

Imagine a game-show wheel with tens of thousands of segments, each labelled with a word and sized to match how likely the host thinks it is. The contestant spins once, reads the segment it lands on, writes it down — and the wheel is rebuilt with new segment sizes based on everything written so far. Spin, rebuild, spin, rebuild. Temperature is how much you smooth the wheel: high temperature flattens the segments so rare words have a real chance; low temperature lets the biggest slice win almost every time.

Logits and softmax

The transformer's final layer outputs a vector of logits — one real number per vocabulary token. Softmax converts logits to probabilities:

p(token_i) = exp(logit_i) / Σ_j exp(logit_j)

For a vocabulary of 100,000 tokens, this is 100,000 probabilities that sum to 1. The model picks (or samples from) this distribution.

Greedy decoding

Always pick the highest-probability token. Fast, deterministic, coherent in the short term. Tends to produce repetitive, low-diversity output — the model keeps taking the same locally safe step and misses good-but-risky alternatives.

Temperature

Temperature T scales the logits before softmax:

p(token_i) ∝ exp(logit_i / T)
  • T = 1.0 — unchanged distribution
  • T < 1.0 — concentrates probability, more deterministic
  • T > 1.0 — flattens distribution, more random

T = 0 is greedy decoding (the argmax). T = 2.0 produces chaotic output. Production settings for chat are typically 0.7–1.0; code generation often uses 0.2–0.4.

Top-k sampling

Sample only from the k highest-probability tokens. Set all other probabilities to zero, renormalize, then sample. k = 50 is common. This prevents sampling rare tokens that would break coherence, while preserving diversity among the most plausible options.

Top-p (nucleus) sampling

Instead of a fixed k, use all tokens whose cumulative probability sums to ≤ p. If the top token already has 60% probability and p = 0.9, only a small nucleus of tokens is eligible. If the distribution is flat, more tokens qualify.

This is adaptive: when the model is confident, it samples from a small nucleus; when uncertain, it considers more options. p = 0.9 is a widely-used default.

Beam search

Maintain k candidate sequences ("beams") simultaneously. At each step, expand each beam into its top-k continuations and keep only the k highest-probability beams overall. The final output is the highest-probability complete sequence.

Beam search maximizes sequence probability rather than per-step probability. It avoids greedy's myopia. But beam search tends toward generic, conservative output — the highest-probability long sequence is often bland. It is standard for translation; rarely used for open-ended generation.

Stop tokens

Generation terminates when:

  1. A special <eos> (end of sequence) token is sampled.
  2. A user-defined stop string is encountered.
  3. A maximum token limit is reached.

Models learn to emit <eos> from training data: documents end, conversations end, answers end. If the model is not well-trained, it may fail to stop.

KV-cache

At each step, the model re-processes the entire token sequence to produce the next token. This is O(n²) attention per step — prohibitively expensive for long contexts. The KV-cache stores the key and value matrices computed at each layer for each past token. New tokens only compute attention against the cached K/V values — O(1) incremental compute per new token. The cache grows with sequence length; at 128K tokens, it can occupy gigabytes of GPU memory.

Parallel vs. sequential generation

Autoregressive generation — one token at a time, left to right. Standard. Sequential by nature; each token depends on the previous.

Speculative decoding — a small draft model proposes k tokens; the large model verifies them in parallel (one forward pass). If all k are accepted, you get k tokens for the cost of roughly 1+1 calls. Typical speedup: 2–3×.

Diffusion-based generation — a newer paradigm that generates tokens in parallel by iteratively denoising a noisy sequence. Not yet competitive with autoregressive models on quality, but potentially faster at long outputs.