Learn Regularization (L1 & L2)

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Simple theory: Regularization adds a penalty that discourages overly large or unnecessary weights. It keeps the model simpler so it is less likely to memorize training noise.

Without constraints, a model will fit the training data as closely as possible — including all the noise. Regularisation adds a penalty to the loss function that discourages large weights. The model must balance fitting the data and staying simple. Simplicity is the constraint that prevents memorisation.

L2 regularisation — Ridge

L2 objective:  Loss(w) = MSE(w) + λ·Σwᵢ²

Gradient update with L2:
  w ← w − lr·(∂MSE/∂w + 2λ·w)
  w ← w·(1 − 2·lr·λ) − lr·∂MSE/∂w

Effect: weight shrinkage toward zero each step (never exactly zero).

λ (lambda) = regularisation strength. Tune via cross-validation.

L2 penalises large weights quadratically. The effect: all weights shrink toward (but never reach) zero. All features are retained but with reduced importance. L2 is preferred when all features carry some signal.

L1 objective:  Loss(w) = MSE(w) + λ·Σ|wᵢ|

L2 objective:  Loss(w) = MSE(w) + λ·Σwᵢ²

Elastic Net:   Loss(w) = MSE(w) + λ₁·Σ|wᵢ| + λ₂·Σwᵢ²

Elastic Net combines L1 sparsity with L2 stability — best of both

Elastic Net and choosing λ

Elastic Net combines L1 and L2: α × L1 + (1−α) × L2. You get both sparsity (feature selection) and the stability of L2. Choose λ via cross-validation — try a range of values on the validation set and pick the one with the best validation loss. Too small λ → still overfitting. Too large λ → underfitting (weights forced to near-zero, model too simple).

L1 (Lasso): λΣ|wᵢ| — produces sparse models (many exact zeros). Use for feature selection.
L2 (Ridge): λΣwᵢ² — shrinks all weights toward zero evenly. Use when all features matter.
Elastic Net: α·L1 + (1-α)·L2 — combines both. Use when features are correlated.
Dropout (neural nets): randomly zeroing activations acts as implicit regularisation.
Early stopping: stop training when validation loss starts increasing — implicit regularisation.
Choose λ: use cross-validation with logarithmic grid search (0.0001, 0.001, 0.01, 0.1, 1, 10).

L2 regularisation — Ridge

Elastic Net and choosing λ

Practice questions

Related AI learning resources