Learn LLMs at Scale

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Why are GPT-4 and Claude so dramatically better than GPT-2? Same architecture, mostly the same training objective, more or less the same data sources. The answer is scale: more parameters, more data, more compute. Empirical scaling laws (Kaplan, Hoffmann/Chinchilla) show that loss decreases as a smooth power-law function of N (parameters) and D (training tokens). Crucially, certain capabilities EMERGE only above particular scale thresholds — the model goes from useless to surprisingly competent over a small window of scale increase.

Scaling laws (Chinchilla, 2022)

Loss = A / N^α + B / D^β + irreducible_loss
  where N = parameters, D = training tokens
  α, β ≈ 0.3-0.4

Key insight (Chinchilla): for optimal compute use, N and D should grow together at roughly equal rates.
  Old wisdom: bigger models, fewer tokens (GPT-3: 175B params, 300B tokens — undertrained)
  Chinchilla: 70B params, 1.4T tokens beats 175B with same compute → smaller, more data is often better.

Example: GPT-3 (175B, 2020) ≈ Chinchilla (70B, 2022) on benchmarks, with Chinchilla using less inference compute.

Loss is predictable in N and D · scaling them together is optimal

Emergent capabilities

Below ~10B params: short coherent text, basic Q&A. Useful but visibly limited.
10B-100B: in-context learning works. Few-shot prompting starts paying off. Multi-step reasoning is hit-or-miss.
100B+: chain-of-thought reasoning emerges. Code generation. Multi-language fluency. Solving novel logic puzzles.
1T+: complex agentic tasks (browsing, tool use, multi-step planning). Specialised reasoning (chess, math olympiad-level).
These thresholds aren't theoretical — they're empirical observations. Some are sharp (chain-of-thought emerges between 60B and 100B with little warning).

Why scale costs grow nonlinearly

Parameters cost memory: a 7B model needs 14GB FP16. A 1T model needs 2TB.
Training compute scales as ~N × D. GPT-3 cost ~$5M in 2020. GPT-4 estimated $100M+. GPT-5/Gemini 3 likely $1B+.
Inference cost grows linearly in N (each token requires one forward pass). 1T model = 100× cheaper if it's also serving 100× more requests; otherwise not viable.
Mixture of Experts (MoE): only a few sub-networks fire per token, reducing inference cost. GPT-4, Gemini, Mixtral all use MoE.

Scaling laws (Chinchilla, 2022)

Emergent capabilities

Why scale costs grow nonlinearly

Practice questions

Related AI learning resources