Advanced Topics - Advanced - 18 min

Learn Q-Learning & Deep RL

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Last updated: 2026-05-13.

Q-learning is the classical algorithm that turns the abstract RL idea into a concrete learning rule. It builds a table — one entry per (state, action) pair — that estimates the long-term reward of taking each action. Deep Q-Networks (DQN) replace that table with a neural network so the same idea scales to environments with millions of states (like Atari pixel screens).

The Q-update rule

Q(s, a) ← Q(s, a) + α · [ r + γ · max_{a'} Q(s', a') − Q(s, a) ]

The term in brackets is the TD error:
  • r              = immediate reward
  • γ · max Q(s')  = best estimated future reward from the next state
  • − Q(s, a)      = current estimate (we move toward the target)

α = learning rate. γ = discount factor. Update one cell per step.

From table to neural network

Tabular Q-learning works for grid worlds with 25 states. It fails for Atari, where the state is a 84×84 image (over 7,000 dimensions). Deep Q-Networks replace the table with a CNN that takes pixels as input and outputs one Q-value per action. The training loss minimises the same TD error — just with backpropagation instead of a single-cell update.

Deep Q loss:
  L(θ) = E[ ( r + γ · max_{a'} Q(s', a'; θ⁻) − Q(s, a; θ) )² ]

θ⁻ is a frozen copy of the network (target network),
updated periodically to stabilise training.

DQN minimises mean-squared TD error via gradient descent.

Two tricks that make DQN work

  • Experience replay: store transitions (s, a, r, s') in a buffer and sample mini-batches randomly. This breaks correlation between consecutive samples and reuses data efficiently.
  • Target network: a slowly-updated copy of Q used to compute targets. Without it, the target moves every step and training is unstable.
  • ε-greedy decay: start with ε = 1 (pure random), decay to ε = 0.01 over millions of steps. Encourages broad early exploration.
  • Reward clipping: in Atari, clip rewards to {−1, 0, +1} so loss scales similarly across games.

Practice questions

  1. What does Q(s, a) represent?
  2. Why does DQN use a separate target network?
  3. What problem does experience replay solve?
  4. When does Q-learning struggle without a neural network?

Related AI learning resources

Premium lesson notes and simulations | AI project templates | More Advanced Topics lessons