GPT Explained Visually - Autoregressive Transformer Lesson

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

A loss function measures how wrong your model is. It takes the model's prediction ŷ and the true answer y, and outputs a single number: the penalty. During training, the optimizer tries to minimize this number. The choice of loss function shapes what 'wrong' means — and that shapes everything the model learns.

Mean Squared Error (MSE)

MSE = (ŷ − y)²

Gradient:  dL/dŷ = 2(ŷ − y)

For a batch of n examples:
  MSE = (1/n) Σᵢ (ŷᵢ − yᵢ)²

Example:
  True y = 3.0,  Prediction ŷ = 4.2
  MSE = (4.2 − 3.0)² = 1.44
  Gradient = 2 × 1.2 = 2.4  →  push ŷ toward y

Gradient grows with error magnitude — large mistakes receive large corrections

Mean Absolute Error (MAE)

MAE = |ŷ − y|

Gradient:  dL/dŷ = +1 if ŷ > y,  −1 if ŷ < y

Example:
  True y = 3.0,  Prediction ŷ = 4.2
  MAE = |4.2 − 3.0| = 1.2
  Gradient = +1  →  push ŷ toward y with constant force

Constant gradient ±1 — outliers get the same treatment as small errors

Binary Cross-Entropy (BCE)

BCE = −y log(ŷ) − (1 − y) log(1 − ŷ)

where ŷ = σ(z) is the sigmoid output (a probability in (0,1))
and y ∈ {0, 1} is the true class.

When y = 1:   BCE = −log(ŷ)      → 0 loss when ŷ→1, infinite when ŷ→0
When y = 0:   BCE = −log(1 − ŷ)  → 0 loss when ŷ→0, infinite when ŷ→1

Gradient:  dL/dŷ = −y/ŷ + (1−y)/(1−ŷ)
Combined with sigmoid:  dL/dz = ŷ − y  ← beautifully clean

BCE explodes when the model is confident AND wrong — a strong learning signal

Huber Loss

Huber(e, δ=1) = ½e²       if |e| ≤ δ
              = δ(|e| − ½δ)  if |e| > δ

where e = ŷ − y

Gradient:  dL/de = e        if |e| ≤ 1
                = sign(e)   if |e| > 1

Behavior:
  Small errors (|e| ≤ 1): behaves like MSE — smooth, zero gradient at minimum
  Large errors (|e| > 1): behaves like MAE — constant gradient, ignores outliers

δ controls the transition point — the smaller δ, the more like MAE

Which loss to choose?

Regression (predicting numbers): start with MSE. Switch to Huber if your data has outliers.
Binary classification (yes/no): always use Binary Cross-Entropy (BCE) with a sigmoid output.
Multi-class classification: use Categorical Cross-Entropy (CCE) with a softmax output.
Ranking problems: use custom margin losses (triplet loss, contrastive loss).
Never use MSE for classification — it doesn't give a strong gradient signal near the boundaries.

Learn Loss Functions