Learn How ChatGPT Works

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

ChatGPT feels magical, but it's the careful layering of four well-understood techniques: pretraining on a huge corpus of text, supervised fine-tuning on curated assistant dialogues, RLHF for alignment, and tokenwise inference at runtime. Each phase adds one skill. By the end you have a system that types one token at a time, fast enough to feel like a real conversation.

Stage 1 — Pretraining

The model sees trillions of tokens of web text, books, code, and conversations. Its single training task: predict the next token. This forces it to learn grammar, facts, reasoning patterns, programming idioms, and dozens of languages — all just to do next-token prediction well. The base model after this stage is a knowledgeable but unhelpful text completer.

Stage 2 — Supervised Fine-Tuning (SFT)

Human contractors write thousands of high-quality (instruction, ideal answer) pairs. The pretrained model is fine-tuned on these. After SFT the model has learned the 'assistant' format: it answers questions instead of continuing them, follows instructions, and adopts a helpful tone.

Stage 3 — RLHF

Humans are shown two answers to the same prompt and rank which they prefer. A reward model is trained on these preference pairs to predict human preference. Then PPO (a reinforcement learning algorithm) updates the language model to maximize the reward — making it more helpful, less harmful, more honest.

Stage 4 — Inference (runtime)

At runtime, ChatGPT just predicts the next token, samples one according to its probability distribution, appends it, and repeats. The chat history acts as the prompt. Temperature controls randomness (0 = greedy, 1 = creative). Streaming lets it show tokens as they're produced — that's why the answer feels like it's being typed live.

Pretraining: ~15T tokens, months of training on thousands of GPUs
SFT: ~10-50k carefully written instruction-answer pairs
RLHF: ~100k+ human preference rankings + PPO updates
Inference: ~50ms per token typical latency, streamed to user
Context window: how many tokens of history the model can see at once (4K to 1M+ in modern models)