Vanilla RNNs fail on long sequences because their gradient vanishes exponentially with distance (0.7¹⁰⁰ ≈ 0). LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) were invented to fix this. Both architectures add learnable gates that decide what to remember, what to forget, and what to output — creating an uninterrupted gradient highway from any output back to any input.
LSTM Cell: Four Gates Working Together
At each step, given input xₜ, previous hidden hₜ₋₁, previous cell state cₜ₋₁:
fₜ = σ(W_f · [hₜ₋₁, xₜ] + b_f) ← forget gate (how much of cₜ₋₁ to keep)
iₜ = σ(W_i · [hₜ₋₁, xₜ] + b_i) ← input gate (how much new info to add)
gₜ = tanh(W_g · [hₜ₋₁, xₜ] + b_g) ← candidate values to possibly add
oₜ = σ(W_o · [hₜ₋₁, xₜ] + b_o) ← output gate (how much of cell state to expose)
Update:
cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ gₜ ← new cell state (memory)
hₜ = oₜ ⊙ tanh(cₜ) ← new hidden state (output)
σ = sigmoid → outputs in [0, 1] act as soft valves
⊙ = element-wise multiplicationThe key trick: cₜ = fₜ ⊙ cₜ₋₁ + ... — when fₜ ≈ 1, cell state passes through unchanged, gradient flows freely
What Each Gate Actually Learns
- Forget gate (fₜ): learns to erase outdated memory. Reading 'The cat sat on the mat. Meanwhile, the dog...' — when 'the dog' arrives, forget gate zeros out the 'cat' entry in the cell state.
- Input gate (iₜ): learns which new information is worth remembering. A stopword like 'the' triggers a low iₜ (don't update memory); a named entity or verb triggers a high iₜ.
- Candidate values (gₜ): the actual content to write IF the input gate approves. Produced by tanh so values are in [−1, +1].
- Output gate (oₜ): learns which parts of memory are relevant right now. The cell may hold 'waiting for the verb of this sentence' for 10 steps, then expose it only when the verb finally arrives.
GRU: Simplified LSTM with Two Gates
GRU merges the forget + input gates into one 'update' gate, and drops the separate cell state:
rₜ = σ(W_r · [hₜ₋₁, xₜ] + b_r) ← reset gate (how much past to ignore when computing candidate)
zₜ = σ(W_z · [hₜ₋₁, xₜ] + b_z) ← update gate (interpolate old vs new)
h̃ₜ = tanh(W · [rₜ ⊙ hₜ₋₁, xₜ] + b) ← candidate hidden state
Update:
hₜ = (1 − zₜ) ⊙ hₜ₋₁ + zₜ ⊙ h̃ₜ
GRU has ~75% of LSTM's parameters and is faster to train.
Empirical result: GRU ≈ LSTM on most tasks, LSTM slightly better on very long sequences.If zₜ ≈ 0, hₜ = hₜ₋₁ (perfect memory). If zₜ ≈ 1, hₜ = h̃ₜ (full update). Same gradient-highway idea.
LSTM vs GRU: Practical Guidance
- Use GRU when: compute budget is tight, sequences are short-medium (≤ 100 steps), or you're prototyping quickly. ~25% fewer params, ~30% faster.
- Use LSTM when: sequences are very long (speech, music, long documents), you need maximum expressive power, or you have the compute to spare.
- Both dramatically outperform vanilla RNNs on any sequence longer than ~20 steps.
- Both have been largely replaced by Transformers for NLP tasks — but remain dominant in speech, time series, and anywhere streaming/online inference is required.