Learn Reinforcement Learning Basics

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Supervised learning teaches a model from labeled examples. Reinforcement learning teaches an agent from consequences. The agent acts in an environment, receives rewards or penalties, and slowly learns which actions in which states maximise long-term reward. This is how AlphaGo, robotic arms, and ad-bidding systems learn to make decisions.

The four core ingredients

State (s) — what the agent perceives right now: pixels on screen, joint angles, board position.
Action (a) — what the agent chooses to do: move left, place stone, send query.
Reward (r) — a scalar signal from the environment: +1 for a goal, −1 for a fall, 0 otherwise.
Policy (π) — the agent's rule: π(a|s) = probability of action a in state s.

The Markov Decision Process (MDP)

RL formalises this as an MDP: at each time step the agent observes state s, picks action a, the environment returns next state s' and reward r. The Markov assumption says the next state depends only on the current state and action — not the entire history. This assumption makes learning tractable.

Goal:  maximise   G_t = r_{t+1} + γ·r_{t+2} + γ²·r_{t+3} + …

γ ∈ [0, 1] is the discount factor:
  γ = 0     → only immediate reward matters
  γ = 0.9   → moderate look-ahead (~10 steps)
  γ = 0.99  → long-horizon planning

Cumulative discounted reward — the quantity RL optimises.

Exploration vs exploitation

Early in training, the agent doesn't know which actions are good. If it always picks the action it currently thinks is best (exploit), it may miss something better. If it always picks randomly (explore), it never converges. The most common solution is ε-greedy: pick the best action with probability 1−ε, a random action with probability ε. ε starts high (lots of exploration) and decays over time (more exploitation as the agent grows confident).

The four core ingredients

The Markov Decision Process (MDP)

Exploration vs exploitation

Practice questions

Related AI learning resources