Generative AI - Advanced - 15 min

Learn RLHF — Human Feedback Training

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Last updated: 2026-05-13.

A pretrained LLM is good at next-token prediction — but predicting the next token is not the same as being helpful, harmless, or honest. RLHF (Reinforcement Learning from Human Feedback) is the three-stage process that turns a raw language model into ChatGPT, Claude, or Gemini. Humans rank model outputs by quality; a reward model is trained to predict those rankings; then the LLM is fine-tuned via RL to maximise predicted reward. This is what made AI assistants suddenly feel useful.

Three stages

  • 1. Supervised Fine-Tuning (SFT): collect pairs of (instruction, ideal response) written by human contractors. Fine-tune the pretrained LLM on these. Teaches the model the format and tone of helpful responses.
  • 2. Reward Modelling (RM): for each instruction, generate several model responses. Have humans rank them. Train a separate small model (the reward model) to predict 'how would humans rank this output?' from the text alone.
  • 3. PPO Fine-Tuning: use Proximal Policy Optimization (or similar RL algorithm) to fine-tune the SFT model. Reward = score from the reward model. KL penalty keeps the model close to SFT (prevents reward hacking).

Why it's necessary

Pretraining objective: maximise log p(next token | context) on internet text
  → model learns to imitate the average internet writer
  → gives plausible but often unhelpful, biased, or dangerous text

RLHF objective: maximise expected human preference score
  → model learns to do what humans actually want from an assistant
  → refuses harmful requests, is more concise, follows instructions

Without RLHF: GPT-3 (raw, 2020) was a curiosity for researchers.
With RLHF: ChatGPT (Nov 2022) became the fastest-growing product in history.

Different objectives produce dramatically different behaviour from the same base model

Failure modes and refinements

  • Reward hacking: the model learns to game the reward model rather than improve actual quality (verbose, sycophantic outputs that humans rate well).
  • Distribution shift: as the policy changes during PPO, it produces outputs the reward model never trained on — predictions become unreliable.
  • Sycophancy: agreeing with users regardless of correctness (humans often rate confirmation higher than correction).
  • DPO (Direct Preference Optimization, 2023): an alternative to PPO that directly optimises preferences without a separate reward model — simpler and often as effective.
  • Constitutional AI (Anthropic): use the model itself to critique and revise its outputs guided by written principles, reducing reliance on human labellers.

Practice questions

  1. Why isn't pretraining alone enough to make a useful AI assistant?
  2. What is the role of the reward model in RLHF?
  3. What is reward hacking in RLHF?
  4. What's an alternative to PPO that has gained traction recently?

Related AI learning resources

Premium lesson notes and simulations | AI project templates | More Generative AI lessons