Simple theory: A random forest combines many decision trees and lets them vote or average. This usually makes predictions more stable than relying on one tree alone.
One decision tree is unstable and often wrong. What if you built 500 of them, each slightly different, and let them vote? That's a Random Forest — the wisdom of crowds principle: diverse imperfect models, when combined, outperform any single model. It's been the go-to tabular ML algorithm for over a decade.
Bagging — Bootstrap Aggregating
Each tree is trained on a bootstrap sample — N samples drawn with replacement from the training set. With N=1000, expect ~632 unique samples per tree (the rest are duplicates). The ~368 samples not selected are the out-of-bag (OOB) set — a free validation set for that tree. After training all trees, OOB predictions are averaged for an unbiased generalisation estimate, eliminating the need for a separate validation split.
Prediction (classification):
ŷ = majority_vote(tree₁(x), tree₂(x), ..., treeₖ(x))
Prediction (regression):
ŷ = (1/K) × Σ treeᵢ(x)
Bootstrap unique coverage ≈ 63.2%
P(sample appears) = 1 − (1 − 1/N)ᴺ → 1 − 1/e ≈ 0.632Aggregation is simple but powerful — averaging reduces variance without increasing bias
Feature subsampling — max_features
At every split, each tree considers only a random subset of features (not all features). This prevents dominant features from appearing in every tree, forcing different trees to discover different patterns. Typical defaults: √p for classification, p/3 for regression (where p = total features).
Feature importance
Feature importance (Mean Decrease Impurity):
importance(fⱼ) = (1/K) × Σᵢ Σₜ [splits on fⱼ in treeᵢ] × ΔGini(t)
Example (house price prediction, 4 features):
size_sqft: 0.42 ← most predictive
location: 0.31
age_years: 0.18
num_rooms: 0.09 ← least useful, could dropUse importance scores to identify and remove irrelevant features, then retrain faster
Key hyperparameters
- n_estimators: number of trees (100–500 typical; more = better, diminishing returns after ~200)
- max_features: features per split ('sqrt' for classification, 'log2' also common)
- max_depth: maximum depth per tree (None = fully grown; limit to reduce overfitting)
- min_samples_leaf: minimum samples in a leaf (2–5 for noisy data, 1 for clean data)
- bootstrap: True uses bagging (default); False trains all trees on full dataset
- oob_score: True computes OOB accuracy — free validation without a test split
When to use Random Forests
- Tabular data with mixed numeric + categorical features
- Need feature importance scores for interpretability or feature selection
- Quick strong baseline with minimal hyperparameter tuning (defaults work well)
- Noisy training data — forest is robust, single tree memorises noise
- Small datasets where you can't afford a validation split (use OOB score instead)
- Prefer XGBoost/LightGBM when you need maximum accuracy or faster training