Learn Data Preprocessing

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Simple theory: Data preprocessing means cleaning and converting raw data into a form a model can learn from. It handles missing values, text categories, different numeric scales, and outliers before training starts.

Raw data is almost never ready for a model. It has missing values, wildly different scales, text categories, and outliers. Preprocessing converts messy real-world data into clean, numeric, consistently-scaled input the model can actually learn from. Watch the 3D animation: the same data before and after scaling — the difference is dramatic.

Min-Max Normalisation

Rescales every value into the range [0, 1]. Simple and bounded — good for neural networks with sigmoid/tanh outputs, image pixel values, and cases where the feature range is known and meaningful.

x_norm = (x − x_min) / (x_max − x_min)

Example — Age column [18, 25, 40, 65]:
  x_min = 18,  x_max = 65
  age=25: (25−18)/(65−18) = 7/47 = 0.149
  age=40: (40−18)/(65−18) = 22/47 = 0.468
  age=65: (65−18)/(65−18) = 1.000

Result always in [0, 1]. Sensitive to outliers — one extreme value compresses all others.

Z-Score Standardisation

Transforms data to have mean=0 and standard deviation=1. Doesn't bound the output — values can be negative or greater than 1. Preferred by most gradient-based models (linear regression, SVM, neural networks) because it handles outliers better and doesn't distort the distribution shape.

x_std = (x − μ) / σ

μ = mean of the column
σ = standard deviation of the column

Example — Salary [30k, 50k, 70k, 120k]:
  μ = 67,500   σ = 34,034
  salary=30k:  (30k−67.5k)/34k = −1.10
  salary=50k:  (50k−67.5k)/34k = −0.51
  salary=120k: (120k−67.5k)/34k = +1.54

CRITICAL: always fit μ and σ on training data only. Never use test data statistics.

Handling missing values

Drop rows: safe when <5% of data is missing and dataset is large enough
Mean imputation: replace NaN with column mean — fast, but distorts distribution with outliers
Median imputation: better for skewed distributions or columns with outliers
Most models (sklearn, XGBoost) cannot handle NaN directly — imputation is required before fitting
Add a binary 'was_missing' indicator column alongside the imputed value — this lets the model know which values were imputed

Encoding categorical variables

Label encoding:  colour → {red:0, green:1, blue:2}   ← WRONG for unordered categories
One-hot:         colour → [is_red, is_green, is_blue]
  red   → [1, 0, 0]
  green → [0, 1, 0]
  blue  → [0, 0, 1]

Use label encoding only for truly ordered categories (small/medium/large). Otherwise always one-hot.