Advanced Topics - Advanced - 15 min

Learn Reinforcement Learning Basics

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Last updated: 2026-05-13.

Supervised learning teaches a model from labeled examples. Reinforcement learning teaches an agent from consequences. The agent acts in an environment, receives rewards or penalties, and slowly learns which actions in which states maximise long-term reward. This is how AlphaGo, robotic arms, and ad-bidding systems learn to make decisions.

The four core ingredients

  • State (s) — what the agent perceives right now: pixels on screen, joint angles, board position.
  • Action (a) — what the agent chooses to do: move left, place stone, send query.
  • Reward (r) — a scalar signal from the environment: +1 for a goal, −1 for a fall, 0 otherwise.
  • Policy (π) — the agent's rule: π(a|s) = probability of action a in state s.

The Markov Decision Process (MDP)

RL formalises this as an MDP: at each time step the agent observes state s, picks action a, the environment returns next state s' and reward r. The Markov assumption says the next state depends only on the current state and action — not the entire history. This assumption makes learning tractable.

Goal:  maximise   G_t = r_{t+1} + γ·r_{t+2} + γ²·r_{t+3} + …

γ ∈ [0, 1] is the discount factor:
  γ = 0     → only immediate reward matters
  γ = 0.9   → moderate look-ahead (~10 steps)
  γ = 0.99  → long-horizon planning

Cumulative discounted reward — the quantity RL optimises.

Exploration vs exploitation

Early in training, the agent doesn't know which actions are good. If it always picks the action it currently thinks is best (exploit), it may miss something better. If it always picks randomly (explore), it never converges. The most common solution is ε-greedy: pick the best action with probability 1−ε, a random action with probability ε. ε starts high (lots of exploration) and decays over time (more exploitation as the agent grows confident).

Practice questions

  1. What is the agent's goal in reinforcement learning?
  2. What does the discount factor γ control?
  3. What does ε in ε-greedy do?
  4. Why is the Markov assumption useful?

Related AI learning resources

Premium lesson notes and simulations | AI project templates | More Advanced Topics lessons