Learn Transformers in AI

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Transformers are neural networks built entirely from attention and feed-forward layers — no recurrence, no convolution. A stack of transformer blocks reads a sequence in parallel and produces contextualised representations of every token. This architecture, introduced in 'Attention is All You Need' (2017), now powers GPT, BERT, T5, LLaMA, Claude, Gemini, and nearly every modern large language model.

Anatomy of a Transformer Block

One encoder block takes input X and produces Y:

  Step 1: Multi-head self-attention
    A = MultiHead(X, X, X)

  Step 2: Residual + LayerNorm
    X' = LayerNorm(X + A)

  Step 3: Feed-forward (2 linear layers with GELU/ReLU)
    F = FFN(X') = W₂ · GELU(W₁ · X' + b₁) + b₂
                (typically W₁ expands d_model → 4·d_model, W₂ contracts back)

  Step 4: Residual + LayerNorm
    Y = LayerNorm(X' + F)

The full transformer = N such blocks stacked on top of each other.
GPT-3 has 96 blocks. BERT-base has 12. LLaMA-7B has 32.

Each block: attention (mix across tokens) → FFN (transform per token). Residuals let gradients flow across the entire stack.

Positional Encoding: Giving Tokens a Sense of Order

Attention is permutation-invariant by itself — 'cat sat mat' and 'mat sat cat' produce identical outputs without positional info. Fix: add a position vector to each input embedding.

Sinusoidal (original transformer):
  PE(pos, 2i)   = sin(pos / 10000^(2i/d))
  PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Learned (used by BERT, GPT-2):
  PE = a lookup table, trained end-to-end like any embedding.

Rotary (RoPE — used in LLaMA, GPT-NeoX):
  Rotates Q and K by angle ∝ position in each 2D subspace — encodes relative distance in the attention dot product itself.

Input to first block: x_token_embedding + x_position_embedding

Same token at position 1 vs position 100 has different representation — that's how the model knows word order matters

Encoder, Decoder, Encoder-Decoder: Three Flavours

Encoder-only (BERT, RoBERTa, DeBERTa): bidirectional self-attention. Good for understanding tasks — classification, NER, QA on passages. Every token sees every other token.
Decoder-only (GPT, LLaMA, Mistral, Claude, Gemini): causal (masked) self-attention. Good for generation — every token only sees previous tokens, enabling autoregressive next-token prediction.
Encoder-Decoder (T5, BART, original Transformer): encoder reads the source (bidirectional), decoder generates the target (causal), with cross-attention from decoder to encoder. Used for translation, summarisation, structured generation.

Why Transformers Scale So Well

RNN compute graph for sequence of length N:
  depth ≈ N (sequential)
  parallel width ≈ 1
  → terrible GPU utilisation

Transformer compute graph:
  depth ≈ num_layers (fixed, ~12-96)
  parallel width ≈ N × H_heads
  → massive GPU utilisation

Empirical scaling laws (Chinchilla, Kaplan et al):
  loss ≈ A / N^α + B / D^β + irreducible
  where N = parameters, D = training tokens

Doubling parameters or data predictably decreases loss.
This predictability enabled the 'just make it bigger' era (GPT-2 → GPT-3 → GPT-4).

Transformers are the first neural architecture with genuinely useful scaling laws — bigger really does keep getting better

Landmark Transformer Models

Transformer (2017, Vaswani et al): 6-layer encoder + 6-layer decoder, ~65M params, trained for machine translation. Launched the era.
BERT (2018): 12-24 encoder layers, 110M-340M params. Pre-trained on masked language modelling — predict the blank. Dominated NLP understanding benchmarks.
GPT-2 (2019): 48 decoder layers, 1.5B params. Demonstrated that scaling + generation = emergent capabilities.
GPT-3 (2020): 96 layers, 175B params. In-context learning, zero-shot reasoning, code generation without fine-tuning.
LLaMA (2023): efficient open-weight decoder-only models (7B, 13B, 70B params) with RoPE and RMSNorm.
Today: GPT-4, Claude, Gemini — still transformer-based, with improvements in training, RLHF, context length, and mixture-of-experts routing.

Translating 'The cat sat on the mat' to French with an encoder-decoder transformer: Encoder: Input embeddings + positional encoding for [The, cat, sat, on, the, mat] 6 blocks of self-attention + FFN → contextualised representations Output: 6 rich vectors capturing full English meaning. Decoder (generates one token at a time): Start token: <BOS> Block operations: 1. Causal self-attention over previously-generated tokens 2. Cross-attention: Q from decoder, K and V from encoder output 3. FFN Sample next token: 'Le' Repeat with [<BOS>, 'Le'] → 'chat' Continue: 'Le chat s'est assis sur le tapis' The cross-attention step is where the decoder 'looks at' the source — exactly the same mechanism as self-attention, but Q comes from the target and K,V from the source.