Simple theory: Regularization adds a penalty that discourages overly large or unnecessary weights. It keeps the model simpler so it is less likely to memorize training noise.
Without constraints, a model will fit the training data as closely as possible — including all the noise. Regularisation adds a penalty to the loss function that discourages large weights. The model must balance fitting the data and staying simple. Simplicity is the constraint that prevents memorisation.
L2 regularisation — Ridge
L2 objective: Loss(w) = MSE(w) + λ·Σwᵢ²
Gradient update with L2:
w ← w − lr·(∂MSE/∂w + 2λ·w)
w ← w·(1 − 2·lr·λ) − lr·∂MSE/∂w
Effect: weight shrinkage toward zero each step (never exactly zero).λ (lambda) = regularisation strength. Tune via cross-validation.
L2 penalises large weights quadratically. The effect: all weights shrink toward (but never reach) zero. All features are retained but with reduced importance. L2 is preferred when all features carry some signal.
L1 objective: Loss(w) = MSE(w) + λ·Σ|wᵢ|
L2 objective: Loss(w) = MSE(w) + λ·Σwᵢ²
Elastic Net: Loss(w) = MSE(w) + λ₁·Σ|wᵢ| + λ₂·Σwᵢ²Elastic Net combines L1 sparsity with L2 stability — best of both
Elastic Net and choosing λ
Elastic Net combines L1 and L2: α × L1 + (1−α) × L2. You get both sparsity (feature selection) and the stability of L2. Choose λ via cross-validation — try a range of values on the validation set and pick the one with the best validation loss. Too small λ → still overfitting. Too large λ → underfitting (weights forced to near-zero, model too simple).
- L1 (Lasso): λΣ|wᵢ| — produces sparse models (many exact zeros). Use for feature selection.
- L2 (Ridge): λΣwᵢ² — shrinks all weights toward zero evenly. Use when all features matter.
- Elastic Net: α·L1 + (1-α)·L2 — combines both. Use when features are correlated.
- Dropout (neural nets): randomly zeroing activations acts as implicit regularisation.
- Early stopping: stop training when validation loss starts increasing — implicit regularisation.
- Choose λ: use cross-validation with logarithmic grid search (0.0001, 0.001, 0.01, 0.1, 1, 10).