An optimizer decides how to update a model's weights after backprop computes the gradients. The simplest rule — subtract a fixed fraction of the gradient — works, but often slowly and inconsistently. Better optimizers add momentum, adapt the step size per parameter, or both. The choice of optimizer can make the difference between a model that trains in hours versus days.
SGD — Stochastic Gradient Descent
w ← w − η · ∇L(w)
η = learning rate (fixed)
∇L = gradient of loss w.r.t. weights
Problem: uses the same η for every parameter, every step.
In elongated loss landscapes (common in deep networks), SGD oscillates
between the steep walls while creeping along the shallow valley.
Example with η = 0.25 in a 10:1 aspect bowl:
Steep axis: overshoots repeatedly (zigzag)
Shallow axis: tiny steps toward minimumSame step size everywhere — efficient only when gradients are balanced
SGD + Momentum
v ← β·v − η·∇L
w ← w + v
β = momentum coefficient (typically 0.9)
v = velocity — exponential moving average of past gradients
Effect: consistent directions accumulate speed (β=0.9 looks back ~10 steps).
Oscillating directions cancel out → smoother path.
With β = 0.9: effective lr ≈ η / (1−β) = 10× amplification on consistent axesβ controls memory — higher β = more history, more inertia
Adam — Adaptive Moment Estimation
m ← β₁·m + (1−β₁)·g (1st moment — mean of gradients)
v ← β₂·v + (1−β₂)·g² (2nd moment — mean of squared gradients)
m̂ = m / (1−β₁ᵗ) (bias-corrected)
v̂ = v / (1−β₂ᵗ) (bias-corrected)
w ← w − η · m̂ / (√v̂ + ε)
Default: β₁=0.9, β₂=0.999, ε=1e-8, η=0.001
Key insight: m̂/√v̂ normalises by the root mean square of past gradients.
Parameters with large consistent gradients get smaller effective steps.
Parameters with small inconsistent gradients get relatively larger steps.Normalising by √v̂ makes Adam nearly learning-rate-invariant across parameters
Why Adam usually wins
- Per-parameter adaptive learning rate: each weight gets its own effective η based on its gradient history. A weight that rarely updates gets a large step when it finally fires; a weight with huge gradients gets scaled down.
- Handles sparse gradients: critical for NLP (most word embeddings see zero gradient on most steps) and embeddings in general.
- Less sensitive to η choice: Adam's normalisation means the same η=0.001 often works across very different architectures without tuning.
- Bias correction at t=1: avoids enormous first steps before the moment estimates warm up.