Simple theory: A train/test split separates examples used for learning from examples used for checking. The goal is to measure whether the model can handle new data, not just memorize the data it already saw.
You've built a model and it scores 98% on your data. Impressive — until you realise you tested it on the exact same data you trained it on. The model didn't learn; it memorised. The train/test split is the fundamental safeguard that gives you an honest answer: does this model actually work on new data it's never seen?
The three-way split
Dataset (N total samples)
└── Train 70–80% ← model fits weights on this
└── Val 10–15% ← tune hyperparameters, pick best model
└── Test 10–15% ← evaluate ONCE at the end, never during dev
With N = 10,000 samples:
Train = 7,500 Val = 1,250 Test = 1,250Use stratified splitting to preserve class ratios in each set
- Train set: the only data the model sees during fitting — weights are updated using this
- Validation set: used to compare model variants, tune learning rate, depth, regularisation
- Test set: touched exactly once — after all development is done — to report final performance
- k-Fold Cross-Validation: rotates the validation window k times for more reliable estimates on small datasets
- Stratified split: ensures each set has the same class proportions as the full dataset (crucial for imbalanced data)
k-Fold Cross-Validation
With small datasets (< 10,000 samples), a single random split gives unstable estimates — different random seeds give different results. k-Fold (typically k=5 or k=10) solves this: split data into k equal folds, train k models each using a different fold as validation, average all k scores. More reliable, but k× slower to run.
k-Fold CV score = mean(score_fold_1, score_fold_2, ..., score_fold_k)
Example: 5-Fold CV on 5,000 samples
Fold 1: Train on 4,000, Val on 1,000 → acc = 84.2%
Fold 2: Train on 4,000, Val on 1,000 → acc = 83.1%
Fold 3: Train on 4,000, Val on 1,000 → acc = 85.5%
Fold 4: Train on 4,000, Val on 1,000 → acc = 82.8%
Fold 5: Train on 4,000, Val on 1,000 → acc = 84.9%
Final CV score = 84.1% ± 1.0% (mean ± std)The ± gives you confidence in the estimate — a large std means the score is unreliable
Data leakage — the silent killer
Data leakage is when information from outside the training set sneaks into your model during development, making performance look better than it is in production.