ai · level 9

Scaling Laws

More compute, more data, more parameters — predictable gains.

200 XP

Scaling Laws

Model performance follows predictable mathematical relationships with compute, parameters, and data. These relationships — scaling laws — let researchers forecast model quality before training and decide how to allocate compute budgets.

Analogy

Think of baking bread at scale. A bigger oven (more parameters), more dough (more data), and more time (more compute) all improve the bread — but not independently. Put a massive ball of dough in a tiny oven and the outside burns while the middle stays raw; crank a huge oven on a teaspoon of dough and you waste all the heat. Bakers learn the sweet-spot ratios from experience, and the Chinchilla paper is that recipe written down: for every extra litre of oven you add, you need to add roughly twenty times as much dough.

The Chinchilla finding

The 2022 DeepMind paper "Training Compute-Optimal Large Language Models" (Hoffmann et al.) found that prior large models were significantly undertrained relative to their parameter count. The optimal allocation given a fixed compute budget is:

N_opt ∝ C^0.5   (parameters)
D_opt ∝ C^0.5   (tokens)

For a fixed compute budget C, equal scaling of parameters and training tokens is roughly optimal. A rule of thumb: train on ~20 tokens per parameter.

Model Parameters Tokens trained
GPT-3 175B 300B (~1.7 tokens/param)
Chinchilla 70B 1.4T (~20 tokens/param)
Llama-3 70B 15T (~214 tokens/param)

Llama-3 deliberately over-trains relative to compute-optimal, because inference is cheap and a smaller model trained longer is faster at deployment time.

Power law behavior

Loss decreases as a smooth power law of compute:

L(C) ≈ A × C^(-α) + L_∞

α is around 0.05–0.1 for language modeling. The curve is approximately log-linear: every 10× increase in compute buys a fixed reduction in log-loss. There is no known cliff or saturation yet within the ranges tested.

Emergent abilities

Some capabilities appear to be absent at small scale and suddenly present at large scale. The sharpest examples:

Capability Approximate threshold
Multi-step arithmetic ~10B parameters
Chain-of-thought reasoning ~100B parameters
Instruction following without fine-tuning ~100B parameters

These look like phase transitions: the loss decreases smoothly, but certain structured tasks go from chance-level to strong performance in a narrow parameter range. This is partly a measurement artifact — aggregate metrics hide sharp per-capability transitions — but the pattern is real.

Scaling dimensions

Three levers, all contributing:

Parameters (N) — more weights, more capacity to memorize patterns. Primarily controlled by widening layers (larger hidden dimension) or adding layers (more depth).

Training tokens (D) — more data, more diverse patterns learned. Language quality, knowledge breadth, and reasoning all improve with more tokens.

Compute (C) — determines the achievable (N, D) frontier. C ≈ 6ND for dense transformers.

Increasing one without the others eventually hits diminishing returns. Optimal scaling balances all three.

Inference cost

Larger models are better but slower and more expensive to run. A 7B-parameter model fits on a single consumer GPU. A 70B model requires 4–8 GPUs. A 405B model requires a cluster.

This drives the interest in "inference-efficient" training: train a smaller model on more tokens to match the quality of a larger undertrained model, then serve the smaller one cheaply.

Beyond dense transformers

Mixture of Experts (MoE) — partition the FFN into many parallel "expert" networks. Route each token to a small subset (typically 2 out of 64 experts). Total parameters grow large but compute per token stays constant. GPT-4, Mixtral, and Gemini are believed to use MoE variants.

MoE models can have 400B+ parameters while using the compute of a 20B dense model — a major efficiency win.