Learn Backpropagation

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Backpropagation is how a neural network learns. After a forward pass produces a prediction, backprop computes how much each weight contributed to the error — then adjusts every weight to reduce that error. It's the chain rule of calculus applied layer by layer, moving backwards from output to input.

The chain rule

For any composed function y = f(g(x)):
  dy/dx = (dy/dg) × (dg/dx)

In a network with layers L₁ → L₂ → L₃:
  ∂L/∂w₁ = (∂L/∂a₃) × (∂a₃/∂a₂) × (∂a₂/∂w₁)

Each arrow is a multiplication — gradients chain backward layer by layer.

Every weight update is a product of all downstream derivatives

Step 1 — output gradient

For MSE loss and sigmoid output:
  L = ½(ŷ − y)²
  ∂L/∂z₂ = (ŷ − y) · ŷ · (1 − ŷ)

This is the error signal that enters the network from the right.
Large when the model is wrong and confident; small when nearly correct.

Output gradient: how wrong × how uncertain

Step 2 — hidden layer gradient

∂L/∂a₁[j] = Σᵢ  W₂[i][j] · ∂L/∂z₂[i]   (sum from all output neurons)
∂L/∂z₁[j] = ∂L/∂a₁[j] · ReLU'(z₁[j])

ReLU'(z) = 1 if z > 0,  0 if z ≤ 0

Key: if a hidden neuron was dead (z ≤ 0), its gradient is exactly 0.
That neuron's weights receive NO update — this is the dying ReLU problem.

Dead ReLU neurons are frozen — they never recover

Step 3 — weight update

∂L/∂W₂[i][j] = ∂L/∂z₂[i] · a₁[j]    (output weight gradient)
∂L/∂W₁[j][k] = ∂L/∂z₁[j] · x[k]     (hidden weight gradient)

Update rule (gradient descent):
  W  ←  W  −  η · ∂L/∂W

η is the learning rate. Too large: weights overshoot and diverge.
Too small: training takes forever.

η controls step size — the most sensitive hyperparameter

Why gradients vanish in deep networks

Each backward step multiplies by a weight matrix and an activation derivative. If these products are consistently less than 1, the gradient signal shrinks exponentially as it travels back through layers. By layer 10, the gradient may be 10⁻⁶ of its original size — essentially zero. This is the vanishing gradient problem, and it's why training very deep networks was nearly impossible before residual connections (ResNets) and better activation functions.