Learn Activation Functions

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Stack two linear layers and you get... one linear layer. No matter how deep you go, pure matrix multiplication collapses to a single affine transformation. Activation functions break this collapse by introducing non-linearity after every neuron — the one ingredient that lets deep networks learn curves, spirals, and complex decision boundaries.

ReLU — Rectified Linear Unit

ReLU(z) = max(0, z)

Derivative:
  f′(z) = 1   if z > 0
  f′(z) = 0   if z ≤ 0

Examples:
  z =  2.5 → ReLU = 2.5   (passes through unchanged)
  z = −1.2 → ReLU = 0     (clipped, neuron silent)
  z =  0.0 → ReLU = 0     (exactly on the boundary)

ReLU is piecewise linear — extremely cheap to compute (just a max). Default for hidden layers.

Sigmoid

σ(z) = 1 / (1 + e⁻ᶻ)     → output ∈ (0, 1)

Derivative:
  σ′(z) = σ(z) · (1 − σ(z))   max value = 0.25 at z = 0

Saturation examples:
  z =  6:  σ = 0.9975,  σ′ ≈ 0.0025  (gradient ≈ 0)
  z = −6:  σ = 0.0025,  σ′ ≈ 0.0025  (same — both ends saturate)

Use sigmoid in output layers for binary classification. Avoid in hidden layers — gradients vanish.

Sigmoid's gradient peaks at 0.25 (when z=0) and shrinks rapidly toward zero as |z| grows. In a 10-layer network, multiplying 10 sigmoid gradients of 0.25 together gives 0.25¹⁰ ≈ 0.000001 — the early layers receive almost no signal. This is the vanishing gradient problem that prevented deep networks from training before 2012.

Tanh

tanh(z) = (eᶻ − e⁻ᶻ) / (eᶻ + e⁻ᶻ)    → output ∈ (−1, 1)

Derivative:
  f′(z) = 1 − tanh(z)²    max value = 1.0 at z = 0

Vs sigmoid:
  tanh is zero-centred (outputs average to 0)
  sigmoid is not (outputs always positive, average > 0)

Tanh is strictly better than sigmoid for hidden layers — same shape but zero-centred.

Zero-centred outputs mean the average activation is near 0 rather than 0.5. This makes gradient updates more symmetric during backpropagation and leads to faster convergence. Tanh still saturates at large |z|, so it still suffers from vanishing gradients — but less severely than sigmoid. Common in RNN and LSTM gate computations.

Leaky ReLU & GELU

Leaky ReLU fixes dying neurons by allowing a small negative slope (0.1z) for z < 0 — the gradient is never exactly zero. GELU (Gaussian Error Linear Unit) is a smooth approximation of ReLU: instead of a hard cutoff at zero, it weights each input by its percentile in a standard Gaussian. The result is a smooth, differentiable curve used in all modern Transformer architectures.

Hidden layers: use ReLU (default) or Leaky ReLU (if dying neurons are a problem)
Binary output (probability): use Sigmoid — output ∈ (0,1) is interpretable as P(class=1)
Multi-class output (K classes): use Softmax — generalises sigmoid to K probabilities summing to 1
RNN / LSTM gates: use Tanh — zero-centred outputs improve gradient flow in recurrent loops
Transformers (GPT, BERT, LLaMA): use GELU — smooth gradient everywhere, empirically stronger