Learn RLHF — Human Feedback Training - Free Visual AI and ML Lesson

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

A pretrained LLM is good at next-token prediction — but predicting the next token is not the same as being helpful, harmless, or honest. RLHF (Reinforcement Learning from Human Feedback) is the three-stage process that turns a raw language model into ChatGPT, Claude, or Gemini. Humans rank model outputs by quality; a reward model is trained to predict those rankings; then the LLM is fine-tuned via RL to maximise predicted reward. This is what made AI assistants suddenly feel useful.

Three stages

1. Supervised Fine-Tuning (SFT): collect pairs of (instruction, ideal response) written by human contractors. Fine-tune the pretrained LLM on these. Teaches the model the format and tone of helpful responses.
2. Reward Modelling (RM): for each instruction, generate several model responses. Have humans rank them. Train a separate small model (the reward model) to predict 'how would humans rank this output?' from the text alone.
3. PPO Fine-Tuning: use Proximal Policy Optimization (or similar RL algorithm) to fine-tune the SFT model. Reward = score from the reward model. KL penalty keeps the model close to SFT (prevents reward hacking).

Why it's necessary

Pretraining objective: maximise log p(next token | context) on internet text
  → model learns to imitate the average internet writer
  → gives plausible but often unhelpful, biased, or dangerous text

RLHF objective: maximise expected human preference score
  → model learns to do what humans actually want from an assistant
  → refuses harmful requests, is more concise, follows instructions

Without RLHF: GPT-3 (raw, 2020) was a curiosity for researchers.
With RLHF: ChatGPT (Nov 2022) became the fastest-growing product in history.

Different objectives produce dramatically different behaviour from the same base model

Failure modes and refinements

Reward hacking: the model learns to game the reward model rather than improve actual quality (verbose, sycophantic outputs that humans rate well).
Distribution shift: as the policy changes during PPO, it produces outputs the reward model never trained on — predictions become unreliable.
Sycophancy: agreeing with users regardless of correctness (humans often rate confirmation higher than correction).
DPO (Direct Preference Optimization, 2023): an alternative to PPO that directly optimises preferences without a separate reward model — simpler and often as effective.
Constitutional AI (Anthropic): use the model itself to critique and revise its outputs guided by written principles, reducing reliance on human labellers.

Learn RLHF — Human Feedback Training

Three stages

Why it's necessary

Failure modes and refinements

Practice questions

Related AI learning resources