The chain rule is the single mathematical idea that makes neural networks trainable. It answers: how does changing the very first weight in a deep network affect the final loss — when there are 50 layers of functions in between? The answer: multiply the local derivatives all the way through. That's backpropagation.
The formula
Two functions: d/dx f(g(x)) = f′(g(x)) · g′(x)
Three functions: d/dx f(g(h(x))) = f′(g(h(x))) · g′(h(x)) · h′(x)
Rule: multiply the local derivatives of each function in the chain.Each factor is 'how sensitive is this layer's output to its input?'
Vanishing & exploding gradients
If each local derivative is slightly less than 1, multiplying 100 of them gives a number near zero — the gradient vanishes. If each is slightly greater than 1, the product explodes. This is why deep network training was hard before ReLU, residual connections, and careful initialisation — they all help keep the chain product from collapsing.