A loss function measures how wrong your model is. It takes the model's prediction ŷ and the true answer y, and outputs a single number: the penalty. During training, the optimizer tries to minimize this number. The choice of loss function shapes what 'wrong' means — and that shapes everything the model learns.
Mean Squared Error (MSE)
MSE = (ŷ − y)²
Gradient: dL/dŷ = 2(ŷ − y)
For a batch of n examples:
MSE = (1/n) Σᵢ (ŷᵢ − yᵢ)²
Example:
True y = 3.0, Prediction ŷ = 4.2
MSE = (4.2 − 3.0)² = 1.44
Gradient = 2 × 1.2 = 2.4 → push ŷ toward yGradient grows with error magnitude — large mistakes receive large corrections
Mean Absolute Error (MAE)
MAE = |ŷ − y|
Gradient: dL/dŷ = +1 if ŷ > y, −1 if ŷ < y
Example:
True y = 3.0, Prediction ŷ = 4.2
MAE = |4.2 − 3.0| = 1.2
Gradient = +1 → push ŷ toward y with constant forceConstant gradient ±1 — outliers get the same treatment as small errors
Binary Cross-Entropy (BCE)
BCE = −y log(ŷ) − (1 − y) log(1 − ŷ)
where ŷ = σ(z) is the sigmoid output (a probability in (0,1))
and y ∈ {0, 1} is the true class.
When y = 1: BCE = −log(ŷ) → 0 loss when ŷ→1, infinite when ŷ→0
When y = 0: BCE = −log(1 − ŷ) → 0 loss when ŷ→0, infinite when ŷ→1
Gradient: dL/dŷ = −y/ŷ + (1−y)/(1−ŷ)
Combined with sigmoid: dL/dz = ŷ − y ← beautifully cleanBCE explodes when the model is confident AND wrong — a strong learning signal
Huber Loss
Huber(e, δ=1) = ½e² if |e| ≤ δ
= δ(|e| − ½δ) if |e| > δ
where e = ŷ − y
Gradient: dL/de = e if |e| ≤ 1
= sign(e) if |e| > 1
Behavior:
Small errors (|e| ≤ 1): behaves like MSE — smooth, zero gradient at minimum
Large errors (|e| > 1): behaves like MAE — constant gradient, ignores outliersδ controls the transition point — the smaller δ, the more like MAE
Which loss to choose?
- Regression (predicting numbers): start with MSE. Switch to Huber if your data has outliers.
- Binary classification (yes/no): always use Binary Cross-Entropy (BCE) with a sigmoid output.
- Multi-class classification: use Categorical Cross-Entropy (CCE) with a softmax output.
- Ranking problems: use custom margin losses (triplet loss, contrastive loss).
- Never use MSE for classification — it doesn't give a strong gradient signal near the boundaries.