ai · level 4

Attention

Every token asks: which other tokens matter to me?

200 XP

Attention

Attention is the mechanism that lets every position in a sequence look at every other position and decide how much to borrow from it. It is what gives transformers their long-range understanding.

Analogy

Imagine a crowded cocktail party. You are trying to finish a sentence you started thirty seconds ago, and to do that you quickly glance around the room — mostly tuning people out, but leaning in a lot on the two friends who brought up the topic, and a little on the person who cracked the joke. Each word in a sentence does the same sweep of the whole conversation and weights every other voice before deciding what to say next. The party is the context; the attention is the selective listening.

The core idea

When processing token i, the model asks: which other tokens are relevant to understanding token i right now? Attention computes a weighted mixture of all other tokens' representations, where the weights come from learned compatibility scores.

Informally: "dog" in "the dog that chased the cat bit the man" needs to figure out that "bit" refers to it, not to "cat." Attention learns to route that connection.

Queries, keys, values

Each token produces three vectors from its embedding, via learned weight matrices:

  • Query (Q) — "what am I looking for?"
  • Key (K) — "what do I offer?"
  • Value (V) — "what do I contribute if selected?"

The attention weight from token i to token j is:

score(i, j) = softmax( Qi · Kj / √d_k )

√d_k is a scaling factor that prevents the dot products from growing too large in high dimensions. Softmax normalizes the scores to a probability distribution over all positions.

The output at position i is:

output_i = Σ_j  attention_weight(i,j) × V_j

Every output is a weighted sum of all value vectors. High-weight tokens contribute more.

What it looks like

Imagine the sentence "The trophy didn't fit in the suitcase because it was too big."

To resolve "it," the model computes attention from the "it" token to all other tokens. The highest weights land on "trophy" — because "trophy" is too big to fit, not the suitcase. The attention pattern encodes the coreference.

From token High attention to
"it" "trophy" (0.72)
"fit" "trophy", "suitcase" (0.40, 0.38)
"big" "trophy" (0.65)

Multi-head attention

Running attention once gives one perspective. Multi-head attention runs h independent attention operations in parallel, each with its own Q/K/V weight matrices. The outputs are concatenated and projected back to the model dimension.

Different heads specialize. One head may track syntactic dependencies (subject → verb). Another may track coreference. Another may attend to domain-relevant keywords. The model learns which specialization is useful from data.

Typical configurations:

Model Layers Heads
BERT-base 12 12
GPT-2 12 12
GPT-4 (estimated) ~120 ~96

Causal masking

In language model pre-training, the model predicts the next token. It must not look at future tokens — that would be cheating. A causal mask zeroes out all attention weights from position i to positions j > i. The attention pattern becomes lower-triangular: each token can only attend to itself and earlier tokens.

Token 0 can see: [0]
Token 1 can see: [0, 1]
Token 2 can see: [0, 1, 2]

At inference, the same mask applies. The model generates one token at a time, left to right.

Attention is O(n²)

For a sequence of length n, attention computes pairwise scores. At 128,000 tokens, that is 16 billion score computations per layer. This is why extending context windows is expensive — quadratic cost in the sequence length. Efficient attention variants (FlashAttention, ring attention) restructure the computation to be more memory-efficient without changing the math.