GPT (Generative Pre-trained Transformer, OpenAI 2018+) takes the opposite approach to BERT. Instead of filling in blanks, GPT predicts the NEXT token given everything that came before. This single objective — language modelling — turns out to be enormously powerful: a model trained to predict the next word ends up learning grammar, facts, reasoning, even basic mathematics. Scale this recipe up (GPT-3 = 175B params), and you get an AI that can write code, draft essays, and answer questions zero-shot.
Architecture: Decoder-Only Transformer
GPT = a stack of transformer DECODER blocks (with self-attention only, no cross-attention).
GPT-2: 12-48 decoder layers, 768-1600 dim, up to 1.5B params
GPT-3: 96 decoder layers, 12,288 dim, 96 heads → 175B params
GPT-4: estimated ~1.8T params (mixture-of-experts)
Key distinction from BERT: causal self-attention.
At layer L, position t: only attend to positions 1, 2, ..., t (NOT t+1, t+2, ...)
Implementation: upper-triangular mask of −∞ added to attention logits before softmax.
Why causal? GPT predicts the next token. If the model could see future tokens during training, it would 'cheat' by simply reading the answer. The causal mask ensures the model truly predicts from past context only.Decoder-only · causal mask · same architecture, just bigger and better trained
Pre-training: Next-Token Prediction
Training objective is breathtakingly simple:
For each token t in the corpus, predict t given t-1, t-2, ..., t-N.
Loss = cross-entropy of predicted distribution vs. actual next token.
Unlike BERT, every position in every training example contributes a loss signal
(BERT only computes loss on masked positions = ~15% of tokens).
This makes training very efficient.
Training corpus:
GPT-2: ~40GB filtered web text (8M documents)
GPT-3: ~570GB of text (CommonCrawl + books + Wikipedia)
GPT-4: undisclosed, likely multi-TB including code
Run for hundreds of thousands of GPU-hours.
Result: a model that has implicitly read most of the public internet.One objective · billions of tokens · emergent capabilities at scale
Sampling: How Text is Actually Generated
At inference, GPT outputs a probability distribution over the vocabulary at each step.
We must SAMPLE a token from this distribution.
Greedy: pick the highest-probability token
+ Deterministic, focused
− Repetitive, often boring ('the the the the')
Temperature T: rescale logits before softmax: logits / T
T = 0.0 = greedy
T = 1.0 = original distribution
T = 2.0 = flatter distribution → more random
Top-k: sample only from top k highest-probability tokens
k = 50 is typical, prevents picking very rare tokens
Top-p (nucleus): sample from the smallest set whose cumulative probability ≥ p
p = 0.9 dynamic — wider distribution when uncertain, narrow when confident
Most modern LLMs use a combination of temperature + top-p.Sampling controls the trade-off between focus (low T) and creativity (high T)
In-Context Learning: GPT-3's Magic Trick
- Zero-shot: just describe the task. 'Translate to French: Hello →' Model: 'Bonjour'
- One-shot: give one example. 'EN: cat → FR: chat. EN: dog → FR:' Model: 'chien'
- Few-shot: 3-5 examples. Model often beats zero-shot dramatically without any weight updates.
- Chain-of-thought: ask the model to think step by step. Solves complex math/reasoning problems via decomposition.
- All of this happens at INFERENCE TIME with no training — the prompt itself is the program. This was the breakthrough that made GPT-3 'feel intelligent' to non-specialists.