Classical ML - Advanced - 12 min

Learn Regularization (L1 & L2)

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Last updated: 2026-05-13.

Simple theory: Regularization adds a penalty that discourages overly large or unnecessary weights. It keeps the model simpler so it is less likely to memorize training noise.

Without constraints, a model will fit the training data as closely as possible — including all the noise. Regularisation adds a penalty to the loss function that discourages large weights. The model must balance fitting the data and staying simple. Simplicity is the constraint that prevents memorisation.

L2 regularisation — Ridge

L2 objective:  Loss(w) = MSE(w) + λ·Σwᵢ²

Gradient update with L2:
  w ← w − lr·(∂MSE/∂w + 2λ·w)
  w ← w·(1 − 2·lr·λ) − lr·∂MSE/∂w

Effect: weight shrinkage toward zero each step (never exactly zero).

λ (lambda) = regularisation strength. Tune via cross-validation.

L2 penalises large weights quadratically. The effect: all weights shrink toward (but never reach) zero. All features are retained but with reduced importance. L2 is preferred when all features carry some signal.

L1 objective:  Loss(w) = MSE(w) + λ·Σ|wᵢ|

L2 objective:  Loss(w) = MSE(w) + λ·Σwᵢ²

Elastic Net:   Loss(w) = MSE(w) + λ₁·Σ|wᵢ| + λ₂·Σwᵢ²

Elastic Net combines L1 sparsity with L2 stability — best of both

Elastic Net and choosing λ

Elastic Net combines L1 and L2: α × L1 + (1−α) × L2. You get both sparsity (feature selection) and the stability of L2. Choose λ via cross-validation — try a range of values on the validation set and pick the one with the best validation loss. Too small λ → still overfitting. Too large λ → underfitting (weights forced to near-zero, model too simple).

  • L1 (Lasso): λΣ|wᵢ| — produces sparse models (many exact zeros). Use for feature selection.
  • L2 (Ridge): λΣwᵢ² — shrinks all weights toward zero evenly. Use when all features matter.
  • Elastic Net: α·L1 + (1-α)·L2 — combines both. Use when features are correlated.
  • Dropout (neural nets): randomly zeroing activations acts as implicit regularisation.
  • Early stopping: stop training when validation loss starts increasing — implicit regularisation.
  • Choose λ: use cross-validation with logarithmic grid search (0.0001, 0.001, 0.01, 0.1, 1, 10).

Practice questions

  1. What does L1 regularisation (Lasso) do to model weights that other weights don't?
  2. You're training on 1000 features but suspect only 50 are relevant. Which regulariser is more appropriate?
  3. What happens when you increase λ (the regularisation strength) from 0 to very large values?
  4. Elastic Net regularisation combines L1 and L2. The main reason to use it over either alone is:

Related AI learning resources

Premium lesson notes and simulations | AI project templates | More Classical ML lessons