Attention
Every token asks: which other tokens matter to me?
Attention
Attention is the mechanism that lets every position in a sequence look at every other position and decide how much to borrow from it. It is what gives transformers their long-range understanding.
Analogy
Imagine a crowded cocktail party. You are trying to finish a sentence you started thirty seconds ago, and to do that you quickly glance around the room — mostly tuning people out, but leaning in a lot on the two friends who brought up the topic, and a little on the person who cracked the joke. Each word in a sentence does the same sweep of the whole conversation and weights every other voice before deciding what to say next. The party is the context; the attention is the selective listening.
The core idea
When processing token i, the model asks: which other tokens are relevant to understanding token i right now? Attention computes a weighted mixture of all other tokens' representations, where the weights come from learned compatibility scores.
Informally: "dog" in "the dog that chased the cat bit the man" needs to figure out that "bit" refers to it, not to "cat." Attention learns to route that connection.
Queries, keys, values
Each token produces three vectors from its embedding, via learned weight matrices:
- Query (Q) — "what am I looking for?"
- Key (K) — "what do I offer?"
- Value (V) — "what do I contribute if selected?"
The attention weight from token i to token j is:
score(i, j) = softmax( Qi · Kj / √d_k )
√d_k is a scaling factor that prevents the dot products from growing too large in high dimensions. Softmax normalizes the scores to a probability distribution over all positions.
The output at position i is:
output_i = Σ_j attention_weight(i,j) × V_j
Every output is a weighted sum of all value vectors. High-weight tokens contribute more.
What it looks like
Imagine the sentence "The trophy didn't fit in the suitcase because it was too big."
To resolve "it," the model computes attention from the "it" token to all other tokens. The highest weights land on "trophy" — because "trophy" is too big to fit, not the suitcase. The attention pattern encodes the coreference.
| From token | High attention to |
|---|---|
| "it" | "trophy" (0.72) |
| "fit" | "trophy", "suitcase" (0.40, 0.38) |
| "big" | "trophy" (0.65) |
Multi-head attention
Running attention once gives one perspective. Multi-head attention runs h independent attention operations in parallel, each with its own Q/K/V weight matrices. The outputs are concatenated and projected back to the model dimension.
Different heads specialize. One head may track syntactic dependencies (subject → verb). Another may track coreference. Another may attend to domain-relevant keywords. The model learns which specialization is useful from data.
Typical configurations:
| Model | Layers | Heads |
|---|---|---|
| BERT-base | 12 | 12 |
| GPT-2 | 12 | 12 |
| GPT-4 (estimated) | ~120 | ~96 |
Causal masking
In language model pre-training, the model predicts the next token. It must not look at future tokens — that would be cheating. A causal mask zeroes out all attention weights from position i to positions j > i. The attention pattern becomes lower-triangular: each token can only attend to itself and earlier tokens.
Token 0 can see: [0]
Token 1 can see: [0, 1]
Token 2 can see: [0, 1, 2]
At inference, the same mask applies. The model generates one token at a time, left to right.
Attention is O(n²)
For a sequence of length n, attention computes n² pairwise scores. At 128,000 tokens, that is 16 billion score computations per layer. This is why extending context windows is expensive — quadratic cost in the sequence length. Efficient attention variants (FlashAttention, ring attention) restructure the computation to be more memory-efficient without changing the math.