Deep Learning - Beginner - 10 min

Learn Activation Functions

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Last updated: 2026-05-13.

Stack two linear layers and you get... one linear layer. No matter how deep you go, pure matrix multiplication collapses to a single affine transformation. Activation functions break this collapse by introducing non-linearity after every neuron — the one ingredient that lets deep networks learn curves, spirals, and complex decision boundaries.

ReLU — Rectified Linear Unit

ReLU(z) = max(0, z)

Derivative:
  f′(z) = 1   if z > 0
  f′(z) = 0   if z ≤ 0

Examples:
  z =  2.5 → ReLU = 2.5   (passes through unchanged)
  z = −1.2 → ReLU = 0     (clipped, neuron silent)
  z =  0.0 → ReLU = 0     (exactly on the boundary)

ReLU is piecewise linear — extremely cheap to compute (just a max). Default for hidden layers.

Sigmoid

σ(z) = 1 / (1 + e⁻ᶻ)     → output ∈ (0, 1)

Derivative:
  σ′(z) = σ(z) · (1 − σ(z))   max value = 0.25 at z = 0

Saturation examples:
  z =  6:  σ = 0.9975,  σ′ ≈ 0.0025  (gradient ≈ 0)
  z = −6:  σ = 0.0025,  σ′ ≈ 0.0025  (same — both ends saturate)

Use sigmoid in output layers for binary classification. Avoid in hidden layers — gradients vanish.

Sigmoid's gradient peaks at 0.25 (when z=0) and shrinks rapidly toward zero as |z| grows. In a 10-layer network, multiplying 10 sigmoid gradients of 0.25 together gives 0.25¹⁰ ≈ 0.000001 — the early layers receive almost no signal. This is the vanishing gradient problem that prevented deep networks from training before 2012.

Tanh

tanh(z) = (eᶻ − e⁻ᶻ) / (eᶻ + e⁻ᶻ)    → output ∈ (−1, 1)

Derivative:
  f′(z) = 1 − tanh(z)²    max value = 1.0 at z = 0

Vs sigmoid:
  tanh is zero-centred (outputs average to 0)
  sigmoid is not (outputs always positive, average > 0)

Tanh is strictly better than sigmoid for hidden layers — same shape but zero-centred.

Zero-centred outputs mean the average activation is near 0 rather than 0.5. This makes gradient updates more symmetric during backpropagation and leads to faster convergence. Tanh still saturates at large |z|, so it still suffers from vanishing gradients — but less severely than sigmoid. Common in RNN and LSTM gate computations.

Leaky ReLU & GELU

Leaky ReLU fixes dying neurons by allowing a small negative slope (0.1z) for z < 0 — the gradient is never exactly zero. GELU (Gaussian Error Linear Unit) is a smooth approximation of ReLU: instead of a hard cutoff at zero, it weights each input by its percentile in a standard Gaussian. The result is a smooth, differentiable curve used in all modern Transformer architectures.

  • Hidden layers: use ReLU (default) or Leaky ReLU (if dying neurons are a problem)
  • Binary output (probability): use Sigmoid — output ∈ (0,1) is interpretable as P(class=1)
  • Multi-class output (K classes): use Softmax — generalises sigmoid to K probabilities summing to 1
  • RNN / LSTM gates: use Tanh — zero-centred outputs improve gradient flow in recurrent loops
  • Transformers (GPT, BERT, LLaMA): use GELU — smooth gradient everywhere, empirically stronger

Practice questions

  1. Which activation function has the 'dying neuron' problem?
  2. What is sigmoid's output range?
  3. Why is Tanh strictly better than Sigmoid for hidden layers?
  4. Which activation function is used in modern Transformer models like GPT-2 and BERT?

Related AI learning resources

Premium lesson notes and simulations | AI project templates | More Deep Learning lessons