Deep Learning - Advanced - 18 min

Learn Backpropagation

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Last updated: 2026-05-13.

Backpropagation is how a neural network learns. After a forward pass produces a prediction, backprop computes how much each weight contributed to the error — then adjusts every weight to reduce that error. It's the chain rule of calculus applied layer by layer, moving backwards from output to input.

The chain rule

For any composed function y = f(g(x)):
  dy/dx = (dy/dg) × (dg/dx)

In a network with layers L₁ → L₂ → L₃:
  ∂L/∂w₁ = (∂L/∂a₃) × (∂a₃/∂a₂) × (∂a₂/∂w₁)

Each arrow is a multiplication — gradients chain backward layer by layer.

Every weight update is a product of all downstream derivatives

Step 1 — output gradient

For MSE loss and sigmoid output:
  L = ½(ŷ − y)²
  ∂L/∂z₂ = (ŷ − y) · ŷ · (1 − ŷ)

This is the error signal that enters the network from the right.
Large when the model is wrong and confident; small when nearly correct.

Output gradient: how wrong × how uncertain

Step 2 — hidden layer gradient

∂L/∂a₁[j] = Σᵢ  W₂[i][j] · ∂L/∂z₂[i]   (sum from all output neurons)
∂L/∂z₁[j] = ∂L/∂a₁[j] · ReLU'(z₁[j])

ReLU'(z) = 1 if z > 0,  0 if z ≤ 0

Key: if a hidden neuron was dead (z ≤ 0), its gradient is exactly 0.
That neuron's weights receive NO update — this is the dying ReLU problem.

Dead ReLU neurons are frozen — they never recover

Step 3 — weight update

∂L/∂W₂[i][j] = ∂L/∂z₂[i] · a₁[j]    (output weight gradient)
∂L/∂W₁[j][k] = ∂L/∂z₁[j] · x[k]     (hidden weight gradient)

Update rule (gradient descent):
  W  ←  W  −  η · ∂L/∂W

η is the learning rate. Too large: weights overshoot and diverge.
Too small: training takes forever.

η controls step size — the most sensitive hyperparameter

Why gradients vanish in deep networks

Each backward step multiplies by a weight matrix and an activation derivative. If these products are consistently less than 1, the gradient signal shrinks exponentially as it travels back through layers. By layer 10, the gradient may be 10⁻⁶ of its original size — essentially zero. This is the vanishing gradient problem, and it's why training very deep networks was nearly impossible before residual connections (ResNets) and better activation functions.

Practice questions

  1. A hidden neuron has z₁ = −0.3. What gradient does it pass back during backpropagation with ReLU?
  2. Why does backprop use the chain rule rather than computing ∂L/∂w directly?
  3. What is the dying ReLU problem?
  4. If the learning rate η is too large, what happens during weight updates?

Related AI learning resources

Premium lesson notes and simulations | AI project templates | More Deep Learning lessons