Q-learning is the classical algorithm that turns the abstract RL idea into a concrete learning rule. It builds a table — one entry per (state, action) pair — that estimates the long-term reward of taking each action. Deep Q-Networks (DQN) replace that table with a neural network so the same idea scales to environments with millions of states (like Atari pixel screens).
The Q-update rule
Q(s, a) ← Q(s, a) + α · [ r + γ · max_{a'} Q(s', a') − Q(s, a) ]
The term in brackets is the TD error:
• r = immediate reward
• γ · max Q(s') = best estimated future reward from the next state
• − Q(s, a) = current estimate (we move toward the target)α = learning rate. γ = discount factor. Update one cell per step.
From table to neural network
Tabular Q-learning works for grid worlds with 25 states. It fails for Atari, where the state is a 84×84 image (over 7,000 dimensions). Deep Q-Networks replace the table with a CNN that takes pixels as input and outputs one Q-value per action. The training loss minimises the same TD error — just with backpropagation instead of a single-cell update.
Deep Q loss:
L(θ) = E[ ( r + γ · max_{a'} Q(s', a'; θ⁻) − Q(s, a; θ) )² ]
θ⁻ is a frozen copy of the network (target network),
updated periodically to stabilise training.DQN minimises mean-squared TD error via gradient descent.
Two tricks that make DQN work
- Experience replay: store transitions (s, a, r, s') in a buffer and sample mini-batches randomly. This breaks correlation between consecutive samples and reuses data efficiently.
- Target network: a slowly-updated copy of Q used to compute targets. Without it, the target moves every step and training is unstable.
- ε-greedy decay: start with ε = 1 (pure random), decay to ε = 0.01 over millions of steps. Encourages broad early exploration.
- Reward clipping: in Atari, clip rewards to {−1, 0, +1} so loss scales similarly across games.