Classical ML - Beginner - 10 min

Learn Data Preprocessing

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Last updated: 2026-05-13.

Simple theory: Data preprocessing means cleaning and converting raw data into a form a model can learn from. It handles missing values, text categories, different numeric scales, and outliers before training starts.

Raw data is almost never ready for a model. It has missing values, wildly different scales, text categories, and outliers. Preprocessing converts messy real-world data into clean, numeric, consistently-scaled input the model can actually learn from. Watch the 3D animation: the same data before and after scaling — the difference is dramatic.

Min-Max Normalisation

Rescales every value into the range [0, 1]. Simple and bounded — good for neural networks with sigmoid/tanh outputs, image pixel values, and cases where the feature range is known and meaningful.

x_norm = (x − x_min) / (x_max − x_min)

Example — Age column [18, 25, 40, 65]:
  x_min = 18,  x_max = 65
  age=25: (25−18)/(65−18) = 7/47 = 0.149
  age=40: (40−18)/(65−18) = 22/47 = 0.468
  age=65: (65−18)/(65−18) = 1.000

Result always in [0, 1]. Sensitive to outliers — one extreme value compresses all others.

Z-Score Standardisation

Transforms data to have mean=0 and standard deviation=1. Doesn't bound the output — values can be negative or greater than 1. Preferred by most gradient-based models (linear regression, SVM, neural networks) because it handles outliers better and doesn't distort the distribution shape.

x_std = (x − μ) / σ

μ = mean of the column
σ = standard deviation of the column

Example — Salary [30k, 50k, 70k, 120k]:
  μ = 67,500   σ = 34,034
  salary=30k:  (30k−67.5k)/34k = −1.10
  salary=50k:  (50k−67.5k)/34k = −0.51
  salary=120k: (120k−67.5k)/34k = +1.54

CRITICAL: always fit μ and σ on training data only. Never use test data statistics.

Handling missing values

  • Drop rows: safe when <5% of data is missing and dataset is large enough
  • Mean imputation: replace NaN with column mean — fast, but distorts distribution with outliers
  • Median imputation: better for skewed distributions or columns with outliers
  • Most models (sklearn, XGBoost) cannot handle NaN directly — imputation is required before fitting
  • Add a binary 'was_missing' indicator column alongside the imputed value — this lets the model know which values were imputed

Encoding categorical variables

Label encoding:  colour → {red:0, green:1, blue:2}   ← WRONG for unordered categories
One-hot:         colour → [is_red, is_green, is_blue]
  red   → [1, 0, 0]
  green → [0, 1, 0]
  blue  → [0, 0, 1]

Use label encoding only for truly ordered categories (small/medium/large). Otherwise always one-hot.

Practice questions

  1. Feature A ranges from 0–100,000 (salary). Feature B ranges from 18–90 (age). What problem does this cause without scaling?
  2. You standardise features using the training set mean and std. You then apply the same scaler to the test set. Why do you use training statistics, not test statistics?
  3. A column has values [red, green, blue]. You apply label encoding: red=0, green=1, blue=2. What problem does this create?
  4. The best strategy for handling 5% missing values in an important feature column is:

Related AI learning resources

Premium lesson notes and simulations | AI project templates | More Classical ML lessons