ai · level 5

Transformers

Stacked attention and FFN layers compose into a world model.

200 XP

Transformers

The transformer is the architecture underlying every major language model since 2017. It stacks attention and feedforward layers into a pipeline that converts a token sequence into predictions.

Analogy

Think of a car assembly line. A bare chassis rolls in at one end and a finished vehicle rolls out the other, but it never goes back — it passes through dozens of stations, each doing one specialised job (doors, wiring, paint, inspection). Every station looks at the whole vehicle and adds its own contribution before the next station takes over. A transformer is the same: tokens enter on the left, pass through identical-looking stations in sequence, and each one refines the representation a little more before handing it on.

The pipeline

Input tokens travel through this sequence of operations:

Tokens → Embeddings + Positional Encoding
       → [Attention → Add & Norm → FFN → Add & Norm] × N layers
       → Output logits → Softmax → Next-token probabilities

Each bracketed group is one transformer layer (also called a block). Modern models stack 12 to 120+ layers.

Inside one layer

Multi-head attention — every token attends to every other token (subject to causal masking). The output is a weighted mixture of value vectors.

Add & Norm (residual + layer normalization) — the attention output is added to the original input (residual connection), then normalized. This is critical: without residuals, gradients vanish in deep networks and training fails.

Feed-forward network (FFN) — a two-layer MLP applied independently to each position. Typically 4× the model's hidden dimension wide. This is where most parameters live. GPT-3's FFN layers hold ~60% of its 175B parameters.

Add & Norm again — same residual + normalize pattern after the FFN.

Residual connections

Every layer's output is:

output = LayerNorm(x + sublayer(x))

The original x always flows forward unchanged. Gradients during backpropagation can flow directly from output to input, bypassing the sublayer entirely. This makes training 100+ layer models possible. Without residuals, layers compete and often fail to learn.

Positional encoding

Attention has no built-in notion of order — it treats the input as a set, not a sequence. Positional encodings inject order information.

Sinusoidal (original) — fixed, mathematical. Each position gets a unique vector of sin/cos values at different frequencies. No learned parameters; works at any sequence length.

Learned — a trainable embedding table, one vector per position. Used by BERT and GPT-2. Doesn't generalize to sequences longer than seen during training.

Rotary (RoPE) — encodes relative distance into the Q/K dot product via rotation matrices. Used by Llama, Mistral, Gemma. Generalizes well to long sequences.

Encoder vs. decoder

Feature Encoder Decoder
Attention Bidirectional (sees all) Causal (sees past only)
Output Contextual representations Next-token probabilities
Use case Classification, embedding Text generation
Examples BERT, RoBERTa GPT, Llama, Claude

Encoder-decoder models (T5, BART) use both: an encoder processes the input, a decoder generates the output attending to encoder representations. Good for translation and summarization.

Pure decoders (GPT lineage) dominate for general language modeling. Given enough scale, they handle tasks originally thought to require encoders.

Scale and the number of layers

Deeper models learn more abstract representations. Earlier layers tend to capture syntax and low-level patterns; deeper layers capture semantics, world knowledge, and reasoning. This gradient from surface to abstract is consistent across model families and scales.

The number of layers is one of three primary scale dimensions:

  • Depth (layers)
  • Width (hidden dimension)
  • Heads (attention heads)

All three grow together in well-tuned models.