A pretrained LLM is good at next-token prediction — but predicting the next token is not the same as being helpful, harmless, or honest. RLHF (Reinforcement Learning from Human Feedback) is the three-stage process that turns a raw language model into ChatGPT, Claude, or Gemini. Humans rank model outputs by quality; a reward model is trained to predict those rankings; then the LLM is fine-tuned via RL to maximise predicted reward. This is what made AI assistants suddenly feel useful.
Three stages
- 1. Supervised Fine-Tuning (SFT): collect pairs of (instruction, ideal response) written by human contractors. Fine-tune the pretrained LLM on these. Teaches the model the format and tone of helpful responses.
- 2. Reward Modelling (RM): for each instruction, generate several model responses. Have humans rank them. Train a separate small model (the reward model) to predict 'how would humans rank this output?' from the text alone.
- 3. PPO Fine-Tuning: use Proximal Policy Optimization (or similar RL algorithm) to fine-tune the SFT model. Reward = score from the reward model. KL penalty keeps the model close to SFT (prevents reward hacking).
Why it's necessary
Pretraining objective: maximise log p(next token | context) on internet text
→ model learns to imitate the average internet writer
→ gives plausible but often unhelpful, biased, or dangerous text
RLHF objective: maximise expected human preference score
→ model learns to do what humans actually want from an assistant
→ refuses harmful requests, is more concise, follows instructions
Without RLHF: GPT-3 (raw, 2020) was a curiosity for researchers.
With RLHF: ChatGPT (Nov 2022) became the fastest-growing product in history.Different objectives produce dramatically different behaviour from the same base model
Failure modes and refinements
- Reward hacking: the model learns to game the reward model rather than improve actual quality (verbose, sycophantic outputs that humans rate well).
- Distribution shift: as the policy changes during PPO, it produces outputs the reward model never trained on — predictions become unreliable.
- Sycophancy: agreeing with users regardless of correctness (humans often rate confirmation higher than correction).
- DPO (Direct Preference Optimization, 2023): an alternative to PPO that directly optimises preferences without a separate reward model — simpler and often as effective.
- Constitutional AI (Anthropic): use the model itself to critique and revise its outputs guided by written principles, reducing reliance on human labellers.