Neural Networks
Differentiable functions, composed to approximate anything.
Neural Networks
A neural network is a differentiable function — a chain of simple operations that, when composed, can approximate arbitrarily complex input-output mappings. Every transformer layer contains neural network components.
Analogy
Picture an old audio mixing board with hundreds of sliders, arranged in rows. A raw recording enters at the top row; each slider lets through a bit of the signal from a few sliders above it, and the final row spits out the finished song. No single slider is doing anything clever, but together they can shape silence into symphonies. "Training" the board is a sound engineer tweaking every slider until the output matches the reference track — and a modern model has billions of sliders.
Neurons and layers
A single neuron computes:
output = activation( w · x + b )
w is a learned weight vector, x is the input, b is a bias scalar, and activation is a nonlinear function. String thousands of neurons together and you get a layer. Stack layers and you get a network.
A layer with n_in inputs and n_out outputs is parameterized by a weight matrix W of shape [n_out, n_in] and a bias vector of length n_out.
Why nonlinearity matters
Without activation functions, any depth of layers collapses to a single linear transformation. You can compose as many matrices as you like — the product is still a matrix. Nonlinear activations break this; they let the network model curves, boundaries, and interactions that no linear function can capture.
Common activation functions
ReLU (max(0, x)) — simple, fast, default for FFNs in earlier models. Dead neuron problem: if a weight configuration consistently produces negative inputs, the neuron never activates and its weights never update.
GeLU (Gaussian Error Linear Unit) — smoother than ReLU, used in GPT-2, BERT, and most modern transformers. Approximated as x × sigmoid(1.702x). Slightly more expensive, empirically better.
SiLU / Swish (x × sigmoid(x)) — self-gated, used in Llama and PaLM FFNs.
Sigmoid — maps to (0, 1). Used in gates and binary outputs. Not used in deep hidden layers due to gradient saturation.
Softmax — converts a vector of logits to a probability distribution. Used at the output layer and inside attention.
The feedforward network in a transformer
Each transformer block contains a two-layer FFN:
FFN(x) = activation(x W₁ + b₁) W₂ + b₂
The intermediate dimension (after W₁) is typically 4× the model hidden size. For a 4,096-dimensional model, W₁ is shape [4096, 16384] — 67M parameters in one weight matrix, just for this one component, in one layer.
Backpropagation
Training adjusts weights by computing gradients of the loss with respect to every parameter. The chain rule propagates error signals backward through every operation — activation functions, matrix multiplications, additions. Each parameter nudges toward a value that reduces loss.
The key insight: every operation in a neural network must be differentiable (or have a well-defined subgradient). This is why activation function choice matters — you need the gradient to flow through it.
Why depth works
Shallow networks can approximate any function (universal approximation theorem) but may require exponentially many neurons. Deep networks learn compositional representations: early layers detect edges, middle layers detect shapes, deep layers detect objects (or, in language, characters → words → syntax → semantics → pragmatics).
Each layer transforms the representation into a space where the next layer's job is easier. The transformer's residual connections preserve direct paths while the layers specialize.
Initialization
Starting with all-zero weights is fatal: every neuron computes the same gradient, and the network never breaks symmetry. Starting with too-large weights causes exploding gradients. Common strategies:
- Kaiming (He) initialization — scales variance by
2 / n_in, designed for ReLU - Xavier initialization — scales by
1 / √(n_in + n_out), for tanh and sigmoid - Normal distribution scaled by model dimension — used in transformer language models
Good initialization means training converges faster and more reliably.