Classical ML - Intermediate - 12 min

Learn Decision Trees

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Last updated: 2026-05-13.

Simple theory: A decision tree predicts by asking a sequence of simple yes/no feature questions. Each split narrows the data until the final leaf gives a class or number prediction.

A decision tree turns the training problem into a series of questions: 'Is income above $50k? Is age below 30? Is credit score above 700?' Each question splits the data into two groups. Keep splitting until each group is mostly one class. The result is a flowchart any human can follow and verify.

Gini impurity — measuring purity

At every node, the algorithm scores every possible split on every feature and picks the one that produces the purest child nodes. Gini impurity measures how mixed the classes are at a node.

Gini(node) = 1 − Σ pᵢ²

pᵢ = fraction of samples at this node belonging to class i

Examples:
  Pure node (all class A):     Gini = 1 − 1² = 0.00  ← perfect
  50/50 split (binary class):  Gini = 1 − (0.5² + 0.5²) = 0.50  ← worst
  70/30 split:                 Gini = 1 − (0.7² + 0.3²) = 0.42

Information Gain (entropy-based) gives the same splits — Gini is faster to compute

Choosing the best split

Weighted Gini after split on feature F at threshold t:
  Gini_split = (n_left/n)·Gini(left) + (n_right/n)·Gini(right)

Algorithm tries every (feature, threshold) pair → picks lowest Gini_split

Example: 50 samples, split on income ≥ 40k
  Left  (20 samples, 18 class-0, 2 class-1): Gini = 1−(0.9²+0.1²) = 0.18
  Right (30 samples, 12 class-0, 18 class-1): Gini = 1−(0.4²+0.6²) = 0.48
  Gini_split = (20/50)×0.18 + (30/50)×0.48 = 0.072 + 0.288 = 0.36

Lower weighted Gini = better split. Try all features and all thresholds, pick the best.

Key hyperparameters

  • max_depth: limits how deep the tree grows — most important overfitting control
  • min_samples_split: minimum samples required to split a node (higher = simpler tree)
  • min_samples_leaf: minimum samples in any leaf (prevents tiny, noisy leaves)
  • max_features: how many features to consider at each split (randomness for forests)
  • criterion: 'gini' (faster) or 'entropy' (information gain) — usually similar results

When to use decision trees

  • Need full interpretability (regulated industries: banking, medicine, law)
  • Mixed numeric and categorical features with minimal preprocessing
  • Fast inference needed (tree lookup is O(depth) — extremely fast at prediction time)
  • As a building block for ensembles — Random Forests and Gradient Boosting are collections of trees
  • Don't use alone for tabular ML competitions — ensembles always win

Practice questions

  1. What does a Gini impurity of 0 mean at a node?
  2. A decision tree with max_depth=None trained on 1000 examples achieves 100% training accuracy. What is most likely true?
  3. Why are decision trees called 'interpretable' models?
  4. The main weakness of a single decision tree is:

Related AI learning resources

Premium lesson notes and simulations | AI project templates | More Classical ML lessons