Math for ML - Intermediate - 10 min

Learn Chain Rule

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Last updated: 2026-05-13.

The chain rule is the single mathematical idea that makes neural networks trainable. It answers: how does changing the very first weight in a deep network affect the final loss — when there are 50 layers of functions in between? The answer: multiply the local derivatives all the way through. That's backpropagation.

The formula

Two functions:   d/dx f(g(x))       = f′(g(x)) · g′(x)
Three functions: d/dx f(g(h(x)))   = f′(g(h(x))) · g′(h(x)) · h′(x)

Rule: multiply the local derivatives of each function in the chain.

Each factor is 'how sensitive is this layer's output to its input?'

Vanishing & exploding gradients

If each local derivative is slightly less than 1, multiplying 100 of them gives a number near zero — the gradient vanishes. If each is slightly greater than 1, the product explodes. This is why deep network training was hard before ReLU, residual connections, and careful initialisation — they all help keep the chain product from collapsing.

Practice questions

  1. What is d/dx of sin(x²) using the chain rule?
  2. In backpropagation, what does the chain rule compute?
  3. Why do vanishing gradients make training deep networks hard?
  4. A 3-layer network has local derivatives [0.8, 1.2, 0.9]. What is the gradient magnitude reaching layer 1?

Related AI learning resources

Premium lesson notes and simulations | AI project templates | More Math for ML lessons