Learn Optimizers (SGD, Adam)

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

An optimizer decides how to update a model's weights after backprop computes the gradients. The simplest rule — subtract a fixed fraction of the gradient — works, but often slowly and inconsistently. Better optimizers add momentum, adapt the step size per parameter, or both. The choice of optimizer can make the difference between a model that trains in hours versus days.

SGD — Stochastic Gradient Descent

w ← w − η · ∇L(w)

η  = learning rate (fixed)
∇L = gradient of loss w.r.t. weights

Problem: uses the same η for every parameter, every step.
In elongated loss landscapes (common in deep networks), SGD oscillates
between the steep walls while creeping along the shallow valley.

Example with η = 0.25 in a 10:1 aspect bowl:
  Steep axis: overshoots repeatedly (zigzag)
  Shallow axis: tiny steps toward minimum

Same step size everywhere — efficient only when gradients are balanced

SGD + Momentum

v ← β·v − η·∇L
w ← w + v

β = momentum coefficient (typically 0.9)
v = velocity — exponential moving average of past gradients

Effect: consistent directions accumulate speed (β=0.9 looks back ~10 steps).
Oscillating directions cancel out → smoother path.

With β = 0.9:  effective lr ≈ η / (1−β) = 10× amplification on consistent axes

β controls memory — higher β = more history, more inertia

Adam — Adaptive Moment Estimation

m ← β₁·m + (1−β₁)·g       (1st moment — mean of gradients)
v ← β₂·v + (1−β₂)·g²      (2nd moment — mean of squared gradients)

m̂ = m / (1−β₁ᵗ)            (bias-corrected)
v̂ = v / (1−β₂ᵗ)            (bias-corrected)

w ← w − η · m̂ / (√v̂ + ε)

Default: β₁=0.9, β₂=0.999, ε=1e-8, η=0.001

Key insight: m̂/√v̂ normalises by the root mean square of past gradients.
Parameters with large consistent gradients get smaller effective steps.
Parameters with small inconsistent gradients get relatively larger steps.

Normalising by √v̂ makes Adam nearly learning-rate-invariant across parameters

Why Adam usually wins

Per-parameter adaptive learning rate: each weight gets its own effective η based on its gradient history. A weight that rarely updates gets a large step when it finally fires; a weight with huge gradients gets scaled down.
Handles sparse gradients: critical for NLP (most word embeddings see zero gradient on most steps) and embeddings in general.
Less sensitive to η choice: Adam's normalisation means the same η=0.001 often works across very different architectures without tuning.
Bias correction at t=1: avoids enormous first steps before the moment estimates warm up.