Generative AI - Advanced - 12 min

Learn LLMs at Scale

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Last updated: 2026-05-13.

Why are GPT-4 and Claude so dramatically better than GPT-2? Same architecture, mostly the same training objective, more or less the same data sources. The answer is scale: more parameters, more data, more compute. Empirical scaling laws (Kaplan, Hoffmann/Chinchilla) show that loss decreases as a smooth power-law function of N (parameters) and D (training tokens). Crucially, certain capabilities EMERGE only above particular scale thresholds — the model goes from useless to surprisingly competent over a small window of scale increase.

Scaling laws (Chinchilla, 2022)

Loss = A / N^α + B / D^β + irreducible_loss
  where N = parameters, D = training tokens
  α, β ≈ 0.3-0.4

Key insight (Chinchilla): for optimal compute use, N and D should grow together at roughly equal rates.
  Old wisdom: bigger models, fewer tokens (GPT-3: 175B params, 300B tokens — undertrained)
  Chinchilla: 70B params, 1.4T tokens beats 175B with same compute → smaller, more data is often better.

Example: GPT-3 (175B, 2020) ≈ Chinchilla (70B, 2022) on benchmarks, with Chinchilla using less inference compute.

Loss is predictable in N and D · scaling them together is optimal

Emergent capabilities

  • Below ~10B params: short coherent text, basic Q&A. Useful but visibly limited.
  • 10B-100B: in-context learning works. Few-shot prompting starts paying off. Multi-step reasoning is hit-or-miss.
  • 100B+: chain-of-thought reasoning emerges. Code generation. Multi-language fluency. Solving novel logic puzzles.
  • 1T+: complex agentic tasks (browsing, tool use, multi-step planning). Specialised reasoning (chess, math olympiad-level).
  • These thresholds aren't theoretical — they're empirical observations. Some are sharp (chain-of-thought emerges between 60B and 100B with little warning).

Why scale costs grow nonlinearly

  • Parameters cost memory: a 7B model needs 14GB FP16. A 1T model needs 2TB.
  • Training compute scales as ~N × D. GPT-3 cost ~$5M in 2020. GPT-4 estimated $100M+. GPT-5/Gemini 3 likely $1B+.
  • Inference cost grows linearly in N (each token requires one forward pass). 1T model = 100× cheaper if it's also serving 100× more requests; otherwise not viable.
  • Mixture of Experts (MoE): only a few sub-networks fire per token, reducing inference cost. GPT-4, Gemini, Mixtral all use MoE.

Practice questions

  1. What do empirical scaling laws (Chinchilla, Kaplan) show?
  2. What did Chinchilla (2022) reveal about the optimal balance of parameters vs training data?
  3. What does 'emergent capabilities' mean in the context of LLMs?
  4. What is Mixture of Experts (MoE) and why is it used for very large LLMs?

Related AI learning resources

Premium lesson notes and simulations | AI project templates | More Generative AI lessons