Learn Random Forests

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Simple theory: A random forest combines many decision trees and lets them vote or average. This usually makes predictions more stable than relying on one tree alone.

One decision tree is unstable and often wrong. What if you built 500 of them, each slightly different, and let them vote? That's a Random Forest — the wisdom of crowds principle: diverse imperfect models, when combined, outperform any single model. It's been the go-to tabular ML algorithm for over a decade.

Bagging — Bootstrap Aggregating

Each tree is trained on a bootstrap sample — N samples drawn with replacement from the training set. With N=1000, expect ~632 unique samples per tree (the rest are duplicates). The ~368 samples not selected are the out-of-bag (OOB) set — a free validation set for that tree. After training all trees, OOB predictions are averaged for an unbiased generalisation estimate, eliminating the need for a separate validation split.

Prediction (classification):
  ŷ = majority_vote(tree₁(x), tree₂(x), ..., treeₖ(x))

Prediction (regression):
  ŷ = (1/K) × Σ treeᵢ(x)

Bootstrap unique coverage ≈ 63.2%
  P(sample appears) = 1 − (1 − 1/N)ᴺ → 1 − 1/e ≈ 0.632

Aggregation is simple but powerful — averaging reduces variance without increasing bias

Feature subsampling — max_features

At every split, each tree considers only a random subset of features (not all features). This prevents dominant features from appearing in every tree, forcing different trees to discover different patterns. Typical defaults: √p for classification, p/3 for regression (where p = total features).

Feature importance

Feature importance (Mean Decrease Impurity):
  importance(fⱼ) = (1/K) × Σᵢ Σₜ [splits on fⱼ in treeᵢ] × ΔGini(t)

Example (house price prediction, 4 features):
  size_sqft:    0.42  ← most predictive
  location:     0.31
  age_years:    0.18
  num_rooms:    0.09  ← least useful, could drop

Use importance scores to identify and remove irrelevant features, then retrain faster

Key hyperparameters

n_estimators: number of trees (100–500 typical; more = better, diminishing returns after ~200)
max_features: features per split ('sqrt' for classification, 'log2' also common)
max_depth: maximum depth per tree (None = fully grown; limit to reduce overfitting)
min_samples_leaf: minimum samples in a leaf (2–5 for noisy data, 1 for clean data)
bootstrap: True uses bagging (default); False trains all trees on full dataset
oob_score: True computes OOB accuracy — free validation without a test split

When to use Random Forests

Tabular data with mixed numeric + categorical features
Need feature importance scores for interpretability or feature selection
Quick strong baseline with minimal hyperparameter tuning (defaults work well)
Noisy training data — forest is robust, single tree memorises noise
Small datasets where you can't afford a validation split (use OOB score instead)
Prefer XGBoost/LightGBM when you need maximum accuracy or faster training