BERT Explained Visually - Transformer Encoder Lesson

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Forward propagation is the network doing its job — taking raw input and producing a prediction. Every time you ask a model 'is this email spam?' or 'what digit is this?', one forward pass runs through all the layers and a number comes out the other end. Understanding exactly what happens at each step is the foundation for understanding how networks learn.

Step 1: The weighted sum (z)

For neuron j in layer l, given activations aˡ⁻¹ from the previous layer:

  zʲˡ = Σᵢ (wʲⁱˡ · aⁱˡ⁻¹) + bʲˡ

Expanded for a 3-input neuron:
  z = w₁·x₁ + w₂·x₂ + w₃·x₃ + b

In matrix form for a whole layer:
  z = W·a + b

where W is the weight matrix, a is the input vector, b is the bias vector.

z is called the pre-activation or logit — the raw score before squashing

Step 2: The activation (a)

Apply the activation function element-wise to z:

  aˡ = f(zˡ)

For ReLU hidden layers:
  a = max(0, z)   →   negative values clipped to 0

For sigmoid output (binary classification):
  a = 1/(1+e⁻ᶻ)  →   output squashed to (0,1) = probability

Example (one hidden neuron):
  z = 0.8×0.8 + (−0.3)×0.2 + 0.5×(−0.5) + 0.1
  z = 0.64 − 0.06 − 0.25 + 0.1 = 0.43
  a = ReLU(0.43) = 0.43   ← this fires and passes to next layer

Activation squashes the raw score into a range — stopping gradients from exploding, adding non-linearity

Step 3: Repeat for each layer

Start: a⁰ = input features (x₁, x₂, ..., xₙ)
Layer 1: z¹ = W¹·a⁰ + b¹ → a¹ = ReLU(z¹)
Layer 2: z² = W²·a¹ + b² → a² = ReLU(z²)
Output: z³ = W³·a² + b³ → ŷ = σ(z³) (binary) or softmax(z³) (multi-class)
Each layer's output aˡ becomes the input to layer l+1 — information flows strictly left to right

What about the dead ReLU neuron?

When a hidden neuron's pre-activation z ≤ 0, its ReLU output is exactly 0. This neuron contributes nothing to the forward pass — it is silent. Its weights are not updated during backpropagation either (gradient = 0 flows back through it). If many inputs consistently give z ≤ 0 for a neuron, it 'dies'. This is harmless for a few neurons but damaging if a large fraction of the hidden layer goes dead.

Learn Forward Propagation