Deep Learning - Advanced - 15 min

Learn RNN & Sequences

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Last updated: 2026-05-13.

A feedforward network processes inputs independently — give it a sentence word by word and it forgets every previous word before reading the next. Recurrent Neural Networks (RNNs) solve this by maintaining a hidden state: a compact summary of everything seen so far. At each step, the hidden state is updated using the new input and the previous hidden state, creating a memory that spans the entire sequence.

The RNN Cell: What Happens at Each Step

At time step t, given input xₜ and previous hidden state hₜ₋₁:

  hₜ = tanh(Wₓ · xₜ  +  Wₕ · hₜ₋₁  +  b)
  yₜ = Wᵧ · hₜ  +  bᵧ   (optional output at each step)

Key insight: the SAME weights (Wₓ, Wₕ, b) are used at every time step.
This weight sharing is analogous to convolutional filters —
one RNN cell processes sequences of any length.

Initialisation: h₀ = 0 (zero vector, no prior context)

Same weights at every step → parameter count is O(hidden_dim²), not O(seq_length)

Unrolling: the Same Cell, Different Time

An RNN is often visualised 'unrolled' — one copy of the cell per time step, each sharing the same weights. This unrolled view makes the information flow clear: input enters from below at each step, the hidden state flows right between cells, and an optional output exits from the top. Backpropagation through this unrolled diagram is called Backpropagation Through Time (BPTT) — gradients flow backwards through every copy of the cell.

Vanishing Gradients: the Core Problem

For a sequence of length T, the gradient of the loss w.r.t. h₁ requires:

  ∂L/∂h₁ = (∂L/∂hₜ) × ∏ₖ₌₁ᵀ⁻¹ (∂hₖ₊₁/∂hₖ)

Each factor is:
  ∂hₖ₊₁/∂hₖ = Wₕ · diag(tanh′(zₖ₊₁))

With tanh′ ≤ 1 and ‖Wₕ‖ < 1, each multiplication shrinks the gradient.
For T = 100 steps and average factor ≈ 0.7:
  ∂L/∂h₁ ≈ 0.7⁹⁹ × ∂L/∂hₜ ≈ 6 × 10⁻¹⁶ × ∂L/∂hₜ ≈ 0

The gradient signal from the output barely reaches the beginning of the sequence.
The network cannot learn dependencies between elements far apart in time.

LSTM and GRU were invented specifically to fix this — they create gradient highways

Three Types of RNN Applications

  • Many-to-one: sequence → single output. Examples: sentiment analysis (sentence → positive/negative), document classification, time series forecasting of a final value.
  • Many-to-many (same length): one output per input. Examples: part-of-speech tagging (each word → its tag), frame-by-frame video labelling, real-time sensor classification.
  • Many-to-many (different lengths — encoder-decoder): sequence → context vector → different-length sequence. Examples: machine translation, speech-to-text, text summarisation.
  • One-to-many: single input → sequence output. Examples: image captioning (image features → caption), music generation from a seed note.

Bidirectional RNNs

Standard RNN: context flows left → right only.
  hₜ uses x₁, x₂, ..., xₜ (past and present)

Bidirectional RNN:
  Forward pass:  h→ₜ = RNN(xₜ, h→ₜ₋₁)   (left to right)
  Backward pass: h←ₜ = RNN(xₜ, h←ₜ₊₁)   (right to left)
  Combined:      hₜ = [h→ₜ ; h←ₜ]          (concatenated)

Each position now sees the FULL context — past AND future.
Used in BERT (bidirectional transformer) and BiLSTMs for NLP tasks
where the full sequence is available at inference time.

Bidirectional = impossible for real-time generation but ideal for understanding tasks

Practice questions

  1. What is the key difference between a feedforward network and an RNN when processing a sequence?
  2. Why do vanilla RNNs suffer from the vanishing gradient problem for long sequences?
  3. An RNN reads the sentence: 'The movie, despite its confusing plot, was surprisingly good.' The output at the final step should be 'positive'. What makes this hard for a vanilla RNN?
  4. What does 'unrolling' an RNN mean, and why is it useful?

Related AI learning resources

Premium lesson notes and simulations | AI project templates | More Deep Learning lessons