Fine-tuning
Pre-train for knowledge. Fine-tune for behaviour.
Fine-tuning
Pre-training produces a model that completes text. Fine-tuning shapes it into a model that follows instructions, stays in role, and produces the kind of responses that humans find helpful. The two stages are distinct and both matter.
Analogy
Think of hiring a new receptionist. You do not teach them the English language — they arrive fluent, having absorbed it over a lifetime. What you actually train is the job: how to answer the phone, whose call goes where, what counts as a polite escalation, how to handle the rude customer. The general knowledge was pre-trained by growing up; the behaviour on the job is fine-tuned in their first two weeks with your specific scripts and corrections.
The three-stage pipeline
Stage 1: Pre-training
Train on a massive, diverse corpus — Common Crawl, books, code, scientific papers. The model learns language, facts, reasoning patterns, and world knowledge. This is compute-expensive and rarely done outside of large labs.
Stage 2: Supervised Fine-tuning (SFT)
Show the pre-trained model thousands to millions of (prompt, ideal response) pairs written or curated by humans. Train with next-token prediction on the response given the prompt. The model learns to format answers, follow instructions, and adopt a tone.
SFT is inexpensive relative to pre-training. It shifts the model from "text completer" to "instruction follower" — but it can only teach what is in the demonstration data.
Stage 3: Reinforcement Learning from Human Feedback (RLHF)
Collect human preferences: show two model responses to the same prompt, ask a human which is better. Train a reward model on these comparisons — a classifier that predicts which response a human would prefer.
Then use the reward model as a signal to further tune the language model with reinforcement learning (PPO or similar), nudging responses toward higher predicted preference. The language model optimizes for human approval rather than pure next-token accuracy.
Why RLHF matters
SFT teaches the model to imitate demonstrations. RLHF teaches it to optimize for a target quality signal. The difference is large:
- SFT on "explain quantum entanglement" produces whatever the demonstrator wrote.
- RLHF on "explain quantum entanglement" produces responses that a human rater would rank as clear, accurate, and helpful — even for phrasings and topics not in the demonstration data.
Reward hacking
The reward model is an approximation. Given enough RL pressure, the language model finds ways to produce outputs that score highly on the reward model without actually being good — verbose flattery, sycophancy, confident confabulation. Mitigating this requires:
- KL penalty: penalize the RL policy for diverging too far from the SFT policy
- Regular reward model updates from fresh preference data
- Careful reward model design and evaluation
Parameter-efficient fine-tuning (LoRA)
Full fine-tuning updates every parameter. For a 70B-parameter model, that requires enormous GPU memory and compute. LoRA (Low-Rank Adaptation) adds small trainable matrices alongside frozen pre-trained weights:
W_new = W_pretrained + B × A
B and A are low-rank matrices. If the hidden dimension is 4096 and the LoRA rank is 16, A is [4096, 16] and B is [16, 4096] — 131K parameters instead of 16.7M. Only these small matrices are trained. At inference, B × A is absorbed into the weight.
LoRA typically uses 0.1–1% of full fine-tuning parameters with comparable results.
DPO (Direct Preference Optimization)
DPO bypasses the explicit reward model. It directly fine-tunes the language model on preference pairs (chosen, rejected response) using a closed-form objective derived from the RLHF formulation. Simpler to implement, more stable to train, increasingly common.
What fine-tuning does and doesn't do
Fine-tuning adjusts the model's behaviour, tone, and instruction-following. It does not add knowledge that wasn't in pre-training. A model fine-tuned on customer support transcripts will be better at support formatting and tone — but it cannot reliably learn new factual information from those transcripts. Use retrieval (RAG) to add knowledge, not fine-tuning.