Learn Experiment Tracking

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Q-learning is a reinforcement learning method that estimates how valuable it is to take an action in a state. The model learns a Q-value: expected future reward if the agent chooses that action and then behaves well afterward.

Why it matters

Q-learning introduces the core idea of learning long-term action value. Deep RL extends it to complex inputs such as game screens, robot sensors, or high-dimensional simulations.

Key terms

Q-value: expected future reward for taking an action in a state.
Q-table: lookup table of state-action values for small problems.
Bellman update: rule that improves Q-values using reward plus estimated future value.
Learning rate: how strongly new experience updates old estimates.
Discount factor: how much future rewards matter.
Epsilon-greedy: exploration method that sometimes chooses a random action.
Deep Q-Network: neural network that approximates Q-values for large state spaces.
Replay buffer: stored experiences sampled during training to stabilize learning.

Q(s,a) <- Q(s,a) + alpha * [reward + gamma * max Q(next_state, next_action) - Q(s,a)]

Q-learning updates current action value toward reward plus future value.

Deep RL basics

A neural network receives the state and predicts Q-values for possible actions.
The agent chooses actions using a mix of exploration and best-known Q-values.
Experience replay samples old transitions so the model does not learn only from recent steps.
Target networks are often used to make updates more stable.

Visual explanation suggestion

Show a grid with each action arrow labeled by Q-value. Animate one Bellman update after a reward, then switch to a neural network view that predicts action values from the current state.

Common mistakes

Confusing immediate reward with long-term action value.
Using too little exploration and never discovering better paths.
Using unstable deep RL training without replay buffers or target networks.
Evaluating only reward and ignoring safety, cost, or constraint violations.

Interview-style questions

What does a Q-value represent?
Write the intuition behind the Bellman update.
Why does deep RL use replay buffers?
What is epsilon-greedy exploration?

Related lessons

Reinforcement Learning Basics
Neural Network Architecture
Gradient Descent
RLHF - Human Feedback Training
AI Ethics & Bias

Related project/template CTA

After learning Q-learning, use the MLOps production checklist to think about monitoring, safety constraints, and rollback for any deployed decision system.