Stack two linear layers and you get... one linear layer. No matter how deep you go, pure matrix multiplication collapses to a single affine transformation. Activation functions break this collapse by introducing non-linearity after every neuron — the one ingredient that lets deep networks learn curves, spirals, and complex decision boundaries.
ReLU — Rectified Linear Unit
ReLU(z) = max(0, z)
Derivative:
f′(z) = 1 if z > 0
f′(z) = 0 if z ≤ 0
Examples:
z = 2.5 → ReLU = 2.5 (passes through unchanged)
z = −1.2 → ReLU = 0 (clipped, neuron silent)
z = 0.0 → ReLU = 0 (exactly on the boundary)ReLU is piecewise linear — extremely cheap to compute (just a max). Default for hidden layers.
Sigmoid
σ(z) = 1 / (1 + e⁻ᶻ) → output ∈ (0, 1)
Derivative:
σ′(z) = σ(z) · (1 − σ(z)) max value = 0.25 at z = 0
Saturation examples:
z = 6: σ = 0.9975, σ′ ≈ 0.0025 (gradient ≈ 0)
z = −6: σ = 0.0025, σ′ ≈ 0.0025 (same — both ends saturate)Use sigmoid in output layers for binary classification. Avoid in hidden layers — gradients vanish.
Sigmoid's gradient peaks at 0.25 (when z=0) and shrinks rapidly toward zero as |z| grows. In a 10-layer network, multiplying 10 sigmoid gradients of 0.25 together gives 0.25¹⁰ ≈ 0.000001 — the early layers receive almost no signal. This is the vanishing gradient problem that prevented deep networks from training before 2012.
Tanh
tanh(z) = (eᶻ − e⁻ᶻ) / (eᶻ + e⁻ᶻ) → output ∈ (−1, 1)
Derivative:
f′(z) = 1 − tanh(z)² max value = 1.0 at z = 0
Vs sigmoid:
tanh is zero-centred (outputs average to 0)
sigmoid is not (outputs always positive, average > 0)Tanh is strictly better than sigmoid for hidden layers — same shape but zero-centred.
Zero-centred outputs mean the average activation is near 0 rather than 0.5. This makes gradient updates more symmetric during backpropagation and leads to faster convergence. Tanh still saturates at large |z|, so it still suffers from vanishing gradients — but less severely than sigmoid. Common in RNN and LSTM gate computations.
Leaky ReLU & GELU
Leaky ReLU fixes dying neurons by allowing a small negative slope (0.1z) for z < 0 — the gradient is never exactly zero. GELU (Gaussian Error Linear Unit) is a smooth approximation of ReLU: instead of a hard cutoff at zero, it weights each input by its percentile in a standard Gaussian. The result is a smooth, differentiable curve used in all modern Transformer architectures.
- Hidden layers: use ReLU (default) or Leaky ReLU (if dying neurons are a problem)
- Binary output (probability): use Sigmoid — output ∈ (0,1) is interpretable as P(class=1)
- Multi-class output (K classes): use Softmax — generalises sigmoid to K probabilities summing to 1
- RNN / LSTM gates: use Tanh — zero-centred outputs improve gradient flow in recurrent loops
- Transformers (GPT, BERT, LLaMA): use GELU — smooth gradient everywhere, empirically stronger