Learn Gradient Boosting & XGBoost

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Simple theory: Gradient boosting builds many small models in sequence. Each new model focuses on the mistakes left by the previous models, so the final ensemble becomes stronger step by step.

Bagging (Random Forests) builds many trees in parallel and averages them. Boosting does the opposite: build trees sequentially, each one specifically targeting the mistakes of the previous. It's less about averaging diverse opinions and more about iteratively correcting errors — a fundamentally different strategy that achieves state-of-the-art results on tabular data.

The Gradient Boosting algorithm

Start with a base prediction F₀ (e.g. the mean for regression). For each round m=1 to M: compute residuals rᵢ = yᵢ − Fₘ₋₁(xᵢ). Train a shallow tree hₘ to predict these residuals. Add to the model: Fₘ(x) = Fₘ₋₁(x) + η·hₘ(x). The residual is the negative gradient of MSE loss — hence 'gradient' boosting. Each tree is a small gradient descent step in function space.

Gradient Boosting update rule:
  F₀(x) = mean(y)  ← initial prediction
  Fₘ(x) = Fₘ₋₁(x) + η × hₘ(x)

where:
  η = learning rate (shrinkage, e.g. 0.1)
  hₘ = shallow tree trained on residuals rᵢ = yᵢ − Fₘ₋₁(xᵢ)

After M rounds:
  Final prediction = F₀ + η·h₁ + η·h₂ + ... + η·hₘ

Example (3 rounds predicting house price, η=0.5):
  Round 0: F₀ = 300k (mean)
  Round 1: h₁ predicts residuals → F₁ = 300k + 0.5×(+40k) = 320k
  Round 2: h₂ targets remaining error → F₂ = 320k + 0.5×(+12k) = 326k
  Round 3: F₃ = 326k + 0.5×(+4k) = 328k  ← true: 329k

Learning rate η controls step size — small η + many trees beats large η + few trees

XGBoost — why it dominates competitions

XGBoost (eXtreme Gradient Boosting) won virtually every Kaggle competition on tabular data from 2016–2020. It supercharges Gradient Boosting with: second-order Taylor expansion (faster convergence), built-in L1+L2 regularisation (prevents overfitting), column and row subsampling (adds randomness like Random Forests), parallel tree building on a single machine, and native handling of missing values (automatically learns which direction to send missing samples at each split).

XGBoost objective (each round):
  Obj = Σᵢ L(yᵢ, ŷᵢ) + Σₖ Ω(fₖ)

where:
  L = loss function (MSE for regression, log-loss for classification)
  Ω(f) = γT + (λ/2)‖w‖²  ← regularisation
    T = number of leaves in the tree
    w = leaf weights
    γ = minimum loss reduction to make a split (pruning)
    λ = L2 weight on leaf scores

Key params: n_estimators=500, max_depth=6, learning_rate=0.05, subsample=0.8

XGBoost = Gradient Boosting + second-order gradients + regularisation + speed

Gradient Boosting vs Random Forest

Training speed: Random Forest faster (parallel trees) vs Gradient Boosting slow (sequential)
Accuracy: Gradient Boosting usually wins on clean, large tabular datasets
Overfitting: Random Forest more robust; Gradient Boosting needs careful learning rate + depth tuning
Interpretability: both provide feature importance; individual trees readable in both
Missing values: XGBoost handles natively; Random Forest needs imputation
Memory: Gradient Boosting uses less (shallow trees); Random Forest needs all trees stored

The Gradient Boosting algorithm

XGBoost — why it dominates competitions

Gradient Boosting vs Random Forest

Practice questions

Related AI learning resources