Simple theory: Data preprocessing means cleaning and converting raw data into a form a model can learn from. It handles missing values, text categories, different numeric scales, and outliers before training starts.
Raw data is almost never ready for a model. It has missing values, wildly different scales, text categories, and outliers. Preprocessing converts messy real-world data into clean, numeric, consistently-scaled input the model can actually learn from. Watch the 3D animation: the same data before and after scaling — the difference is dramatic.
Min-Max Normalisation
Rescales every value into the range [0, 1]. Simple and bounded — good for neural networks with sigmoid/tanh outputs, image pixel values, and cases where the feature range is known and meaningful.
x_norm = (x − x_min) / (x_max − x_min)
Example — Age column [18, 25, 40, 65]:
x_min = 18, x_max = 65
age=25: (25−18)/(65−18) = 7/47 = 0.149
age=40: (40−18)/(65−18) = 22/47 = 0.468
age=65: (65−18)/(65−18) = 1.000Result always in [0, 1]. Sensitive to outliers — one extreme value compresses all others.
Z-Score Standardisation
Transforms data to have mean=0 and standard deviation=1. Doesn't bound the output — values can be negative or greater than 1. Preferred by most gradient-based models (linear regression, SVM, neural networks) because it handles outliers better and doesn't distort the distribution shape.
x_std = (x − μ) / σ
μ = mean of the column
σ = standard deviation of the column
Example — Salary [30k, 50k, 70k, 120k]:
μ = 67,500 σ = 34,034
salary=30k: (30k−67.5k)/34k = −1.10
salary=50k: (50k−67.5k)/34k = −0.51
salary=120k: (120k−67.5k)/34k = +1.54CRITICAL: always fit μ and σ on training data only. Never use test data statistics.
Handling missing values
- Drop rows: safe when <5% of data is missing and dataset is large enough
- Mean imputation: replace NaN with column mean — fast, but distorts distribution with outliers
- Median imputation: better for skewed distributions or columns with outliers
- Most models (sklearn, XGBoost) cannot handle NaN directly — imputation is required before fitting
- Add a binary 'was_missing' indicator column alongside the imputed value — this lets the model know which values were imputed
Encoding categorical variables
Label encoding: colour → {red:0, green:1, blue:2} ← WRONG for unordered categories
One-hot: colour → [is_red, is_green, is_blue]
red → [1, 0, 0]
green → [0, 1, 0]
blue → [0, 0, 1]Use label encoding only for truly ordered categories (small/medium/large). Otherwise always one-hot.