Simple theory: Gradient boosting builds many small models in sequence. Each new model focuses on the mistakes left by the previous models, so the final ensemble becomes stronger step by step.
Bagging (Random Forests) builds many trees in parallel and averages them. Boosting does the opposite: build trees sequentially, each one specifically targeting the mistakes of the previous. It's less about averaging diverse opinions and more about iteratively correcting errors — a fundamentally different strategy that achieves state-of-the-art results on tabular data.
The Gradient Boosting algorithm
Start with a base prediction F₀ (e.g. the mean for regression). For each round m=1 to M: compute residuals rᵢ = yᵢ − Fₘ₋₁(xᵢ). Train a shallow tree hₘ to predict these residuals. Add to the model: Fₘ(x) = Fₘ₋₁(x) + η·hₘ(x). The residual is the negative gradient of MSE loss — hence 'gradient' boosting. Each tree is a small gradient descent step in function space.
Gradient Boosting update rule:
F₀(x) = mean(y) ← initial prediction
Fₘ(x) = Fₘ₋₁(x) + η × hₘ(x)
where:
η = learning rate (shrinkage, e.g. 0.1)
hₘ = shallow tree trained on residuals rᵢ = yᵢ − Fₘ₋₁(xᵢ)
After M rounds:
Final prediction = F₀ + η·h₁ + η·h₂ + ... + η·hₘ
Example (3 rounds predicting house price, η=0.5):
Round 0: F₀ = 300k (mean)
Round 1: h₁ predicts residuals → F₁ = 300k + 0.5×(+40k) = 320k
Round 2: h₂ targets remaining error → F₂ = 320k + 0.5×(+12k) = 326k
Round 3: F₃ = 326k + 0.5×(+4k) = 328k ← true: 329kLearning rate η controls step size — small η + many trees beats large η + few trees
XGBoost — why it dominates competitions
XGBoost (eXtreme Gradient Boosting) won virtually every Kaggle competition on tabular data from 2016–2020. It supercharges Gradient Boosting with: second-order Taylor expansion (faster convergence), built-in L1+L2 regularisation (prevents overfitting), column and row subsampling (adds randomness like Random Forests), parallel tree building on a single machine, and native handling of missing values (automatically learns which direction to send missing samples at each split).
XGBoost objective (each round):
Obj = Σᵢ L(yᵢ, ŷᵢ) + Σₖ Ω(fₖ)
where:
L = loss function (MSE for regression, log-loss for classification)
Ω(f) = γT + (λ/2)‖w‖² ← regularisation
T = number of leaves in the tree
w = leaf weights
γ = minimum loss reduction to make a split (pruning)
λ = L2 weight on leaf scores
Key params: n_estimators=500, max_depth=6, learning_rate=0.05, subsample=0.8XGBoost = Gradient Boosting + second-order gradients + regularisation + speed
Gradient Boosting vs Random Forest
- Training speed: Random Forest faster (parallel trees) vs Gradient Boosting slow (sequential)
- Accuracy: Gradient Boosting usually wins on clean, large tabular datasets
- Overfitting: Random Forest more robust; Gradient Boosting needs careful learning rate + depth tuning
- Interpretability: both provide feature importance; individual trees readable in both
- Missing values: XGBoost handles natively; Random Forest needs imputation
- Memory: Gradient Boosting uses less (shallow trees); Random Forest needs all trees stored