Why are GPT-4 and Claude so dramatically better than GPT-2? Same architecture, mostly the same training objective, more or less the same data sources. The answer is scale: more parameters, more data, more compute. Empirical scaling laws (Kaplan, Hoffmann/Chinchilla) show that loss decreases as a smooth power-law function of N (parameters) and D (training tokens). Crucially, certain capabilities EMERGE only above particular scale thresholds — the model goes from useless to surprisingly competent over a small window of scale increase.
Scaling laws (Chinchilla, 2022)
Loss = A / N^α + B / D^β + irreducible_loss
where N = parameters, D = training tokens
α, β ≈ 0.3-0.4
Key insight (Chinchilla): for optimal compute use, N and D should grow together at roughly equal rates.
Old wisdom: bigger models, fewer tokens (GPT-3: 175B params, 300B tokens — undertrained)
Chinchilla: 70B params, 1.4T tokens beats 175B with same compute → smaller, more data is often better.
Example: GPT-3 (175B, 2020) ≈ Chinchilla (70B, 2022) on benchmarks, with Chinchilla using less inference compute.Loss is predictable in N and D · scaling them together is optimal
Emergent capabilities
- Below ~10B params: short coherent text, basic Q&A. Useful but visibly limited.
- 10B-100B: in-context learning works. Few-shot prompting starts paying off. Multi-step reasoning is hit-or-miss.
- 100B+: chain-of-thought reasoning emerges. Code generation. Multi-language fluency. Solving novel logic puzzles.
- 1T+: complex agentic tasks (browsing, tool use, multi-step planning). Specialised reasoning (chess, math olympiad-level).
- These thresholds aren't theoretical — they're empirical observations. Some are sharp (chain-of-thought emerges between 60B and 100B with little warning).
Why scale costs grow nonlinearly
- Parameters cost memory: a 7B model needs 14GB FP16. A 1T model needs 2TB.
- Training compute scales as ~N × D. GPT-3 cost ~$5M in 2020. GPT-4 estimated $100M+. GPT-5/Gemini 3 likely $1B+.
- Inference cost grows linearly in N (each token requires one forward pass). 1T model = 100× cheaper if it's also serving 100× more requests; otherwise not viable.
- Mixture of Experts (MoE): only a few sub-networks fire per token, reducing inference cost. GPT-4, Gemini, Mixtral all use MoE.