Deep Learning - Advanced - 20 min

Learn Transformers in AI

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Last updated: 2026-05-13.

Transformers are neural networks built entirely from attention and feed-forward layers — no recurrence, no convolution. A stack of transformer blocks reads a sequence in parallel and produces contextualised representations of every token. This architecture, introduced in 'Attention is All You Need' (2017), now powers GPT, BERT, T5, LLaMA, Claude, Gemini, and nearly every modern large language model.

Anatomy of a Transformer Block

One encoder block takes input X and produces Y:

  Step 1: Multi-head self-attention
    A = MultiHead(X, X, X)

  Step 2: Residual + LayerNorm
    X' = LayerNorm(X + A)

  Step 3: Feed-forward (2 linear layers with GELU/ReLU)
    F = FFN(X') = W₂ · GELU(W₁ · X' + b₁) + b₂
                (typically W₁ expands d_model → 4·d_model, W₂ contracts back)

  Step 4: Residual + LayerNorm
    Y = LayerNorm(X' + F)

The full transformer = N such blocks stacked on top of each other.
GPT-3 has 96 blocks. BERT-base has 12. LLaMA-7B has 32.

Each block: attention (mix across tokens) → FFN (transform per token). Residuals let gradients flow across the entire stack.

Positional Encoding: Giving Tokens a Sense of Order

Attention is permutation-invariant by itself — 'cat sat mat' and 'mat sat cat' produce identical outputs without positional info. Fix: add a position vector to each input embedding.

Sinusoidal (original transformer):
  PE(pos, 2i)   = sin(pos / 10000^(2i/d))
  PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Learned (used by BERT, GPT-2):
  PE = a lookup table, trained end-to-end like any embedding.

Rotary (RoPE — used in LLaMA, GPT-NeoX):
  Rotates Q and K by angle ∝ position in each 2D subspace — encodes relative distance in the attention dot product itself.

Input to first block: x_token_embedding + x_position_embedding

Same token at position 1 vs position 100 has different representation — that's how the model knows word order matters

Encoder, Decoder, Encoder-Decoder: Three Flavours

  • Encoder-only (BERT, RoBERTa, DeBERTa): bidirectional self-attention. Good for understanding tasks — classification, NER, QA on passages. Every token sees every other token.
  • Decoder-only (GPT, LLaMA, Mistral, Claude, Gemini): causal (masked) self-attention. Good for generation — every token only sees previous tokens, enabling autoregressive next-token prediction.
  • Encoder-Decoder (T5, BART, original Transformer): encoder reads the source (bidirectional), decoder generates the target (causal), with cross-attention from decoder to encoder. Used for translation, summarisation, structured generation.

Why Transformers Scale So Well

RNN compute graph for sequence of length N:
  depth ≈ N (sequential)
  parallel width ≈ 1
  → terrible GPU utilisation

Transformer compute graph:
  depth ≈ num_layers (fixed, ~12-96)
  parallel width ≈ N × H_heads
  → massive GPU utilisation

Empirical scaling laws (Chinchilla, Kaplan et al):
  loss ≈ A / N^α + B / D^β + irreducible
  where N = parameters, D = training tokens

Doubling parameters or data predictably decreases loss.
This predictability enabled the 'just make it bigger' era (GPT-2 → GPT-3 → GPT-4).

Transformers are the first neural architecture with genuinely useful scaling laws — bigger really does keep getting better

Landmark Transformer Models

  • Transformer (2017, Vaswani et al): 6-layer encoder + 6-layer decoder, ~65M params, trained for machine translation. Launched the era.
  • BERT (2018): 12-24 encoder layers, 110M-340M params. Pre-trained on masked language modelling — predict the blank. Dominated NLP understanding benchmarks.
  • GPT-2 (2019): 48 decoder layers, 1.5B params. Demonstrated that scaling + generation = emergent capabilities.
  • GPT-3 (2020): 96 layers, 175B params. In-context learning, zero-shot reasoning, code generation without fine-tuning.
  • LLaMA (2023): efficient open-weight decoder-only models (7B, 13B, 70B params) with RoPE and RMSNorm.
  • Today: GPT-4, Claude, Gemini — still transformer-based, with improvements in training, RLHF, context length, and mixture-of-experts routing.

Practice questions

  1. What are the two main sub-layers inside a single transformer block, and what does each do?
  2. Why does a transformer need positional encoding?
  3. What is the difference between encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) transformers?
  4. Why do transformers scale so much better than RNNs on modern GPUs?

Related AI learning resources

Premium lesson notes and simulations | AI project templates | More Deep Learning lessons