Learn Decision Trees

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Simple theory: A decision tree predicts by asking a sequence of simple yes/no feature questions. Each split narrows the data until the final leaf gives a class or number prediction.

A decision tree turns the training problem into a series of questions: 'Is income above $50k? Is age below 30? Is credit score above 700?' Each question splits the data into two groups. Keep splitting until each group is mostly one class. The result is a flowchart any human can follow and verify.

Gini impurity — measuring purity

At every node, the algorithm scores every possible split on every feature and picks the one that produces the purest child nodes. Gini impurity measures how mixed the classes are at a node.

Gini(node) = 1 − Σ pᵢ²

pᵢ = fraction of samples at this node belonging to class i

Examples:
  Pure node (all class A):     Gini = 1 − 1² = 0.00  ← perfect
  50/50 split (binary class):  Gini = 1 − (0.5² + 0.5²) = 0.50  ← worst
  70/30 split:                 Gini = 1 − (0.7² + 0.3²) = 0.42

Information Gain (entropy-based) gives the same splits — Gini is faster to compute

Choosing the best split

Weighted Gini after split on feature F at threshold t:
  Gini_split = (n_left/n)·Gini(left) + (n_right/n)·Gini(right)

Algorithm tries every (feature, threshold) pair → picks lowest Gini_split

Example: 50 samples, split on income ≥ 40k
  Left  (20 samples, 18 class-0, 2 class-1): Gini = 1−(0.9²+0.1²) = 0.18
  Right (30 samples, 12 class-0, 18 class-1): Gini = 1−(0.4²+0.6²) = 0.48
  Gini_split = (20/50)×0.18 + (30/50)×0.48 = 0.072 + 0.288 = 0.36

Lower weighted Gini = better split. Try all features and all thresholds, pick the best.

Key hyperparameters

max_depth: limits how deep the tree grows — most important overfitting control
min_samples_split: minimum samples required to split a node (higher = simpler tree)
min_samples_leaf: minimum samples in any leaf (prevents tiny, noisy leaves)
max_features: how many features to consider at each split (randomness for forests)
criterion: 'gini' (faster) or 'entropy' (information gain) — usually similar results

When to use decision trees

Need full interpretability (regulated industries: banking, medicine, law)
Mixed numeric and categorical features with minimal preprocessing
Fast inference needed (tree lookup is O(depth) — extremely fast at prediction time)
As a building block for ensembles — Random Forests and Gradient Boosting are collections of trees
Don't use alone for tabular ML competitions — ensembles always win