Q-learning is a reinforcement learning method that estimates how valuable it is to take an action in a state. The model learns a Q-value: expected future reward if the agent chooses that action and then behaves well afterward.
Why it matters
Q-learning introduces the core idea of learning long-term action value. Deep RL extends it to complex inputs such as game screens, robot sensors, or high-dimensional simulations.
Key terms
- Q-value: expected future reward for taking an action in a state.
- Q-table: lookup table of state-action values for small problems.
- Bellman update: rule that improves Q-values using reward plus estimated future value.
- Learning rate: how strongly new experience updates old estimates.
- Discount factor: how much future rewards matter.
- Epsilon-greedy: exploration method that sometimes chooses a random action.
- Deep Q-Network: neural network that approximates Q-values for large state spaces.
- Replay buffer: stored experiences sampled during training to stabilize learning.
Q(s,a) <- Q(s,a) + alpha * [reward + gamma * max Q(next_state, next_action) - Q(s,a)]Q-learning updates current action value toward reward plus future value.
Deep RL basics
- A neural network receives the state and predicts Q-values for possible actions.
- The agent chooses actions using a mix of exploration and best-known Q-values.
- Experience replay samples old transitions so the model does not learn only from recent steps.
- Target networks are often used to make updates more stable.
Visual explanation suggestion
Show a grid with each action arrow labeled by Q-value. Animate one Bellman update after a reward, then switch to a neural network view that predicts action values from the current state.
Common mistakes
- Confusing immediate reward with long-term action value.
- Using too little exploration and never discovering better paths.
- Using unstable deep RL training without replay buffers or target networks.
- Evaluating only reward and ignoring safety, cost, or constraint violations.
Interview-style questions
- What does a Q-value represent?
- Write the intuition behind the Bellman update.
- Why does deep RL use replay buffers?
- What is epsilon-greedy exploration?
Related lessons
- Reinforcement Learning Basics
- Neural Network Architecture
- Gradient Descent
- RLHF - Human Feedback Training
- AI Ethics & Bias
Related project/template CTA
After learning Q-learning, use the MLOps production checklist to think about monitoring, safety constraints, and rollback for any deployed decision system.