Forward propagation is the network doing its job — taking raw input and producing a prediction. Every time you ask a model 'is this email spam?' or 'what digit is this?', one forward pass runs through all the layers and a number comes out the other end. Understanding exactly what happens at each step is the foundation for understanding how networks learn.
Step 1: The weighted sum (z)
For neuron j in layer l, given activations aˡ⁻¹ from the previous layer:
zʲˡ = Σᵢ (wʲⁱˡ · aⁱˡ⁻¹) + bʲˡ
Expanded for a 3-input neuron:
z = w₁·x₁ + w₂·x₂ + w₃·x₃ + b
In matrix form for a whole layer:
z = W·a + b
where W is the weight matrix, a is the input vector, b is the bias vector.z is called the pre-activation or logit — the raw score before squashing
Step 2: The activation (a)
Apply the activation function element-wise to z:
aˡ = f(zˡ)
For ReLU hidden layers:
a = max(0, z) → negative values clipped to 0
For sigmoid output (binary classification):
a = 1/(1+e⁻ᶻ) → output squashed to (0,1) = probability
Example (one hidden neuron):
z = 0.8×0.8 + (−0.3)×0.2 + 0.5×(−0.5) + 0.1
z = 0.64 − 0.06 − 0.25 + 0.1 = 0.43
a = ReLU(0.43) = 0.43 ← this fires and passes to next layerActivation squashes the raw score into a range — stopping gradients from exploding, adding non-linearity
Step 3: Repeat for each layer
- Start: a⁰ = input features (x₁, x₂, ..., xₙ)
- Layer 1: z¹ = W¹·a⁰ + b¹ → a¹ = ReLU(z¹)
- Layer 2: z² = W²·a¹ + b² → a² = ReLU(z²)
- Output: z³ = W³·a² + b³ → ŷ = σ(z³) (binary) or softmax(z³) (multi-class)
- Each layer's output aˡ becomes the input to layer l+1 — information flows strictly left to right
What about the dead ReLU neuron?
When a hidden neuron's pre-activation z ≤ 0, its ReLU output is exactly 0. This neuron contributes nothing to the forward pass — it is silent. Its weights are not updated during backpropagation either (gradient = 0 flows back through it). If many inputs consistently give z ≤ 0 for a neuron, it 'dies'. This is harmless for a few neurons but damaging if a large fraction of the hidden layer goes dead.