Classical ML - Intermediate - 12 min

Learn Model Evaluation Metrics

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Last updated: 2026-05-13.

Simple theory: Evaluation metrics are numbers that tell you how a model is performing. Different metrics expose different mistakes, so accuracy alone is often not enough.

Your model is 99% accurate. Sounds great. But your dataset is 99% negative class. A model that always predicts 'negative' is 99% accurate and completely useless. Accuracy is a lie on imbalanced datasets. Precision, recall, F1, and the confusion matrix tell you what's actually happening.

The confusion matrix

For binary classification: True Positive (TP): correctly predicted positive. True Negative (TN): correctly predicted negative. False Positive (FP): predicted positive, actually negative (false alarm). False Negative (FN): predicted negative, actually positive (missed case). All metrics are derived from these four numbers.

Precision = TP / (TP + FP)
Recall    = TP / (TP + FN)
F1 Score  = 2 × (Precision × Recall) / (Precision + Recall)
Accuracy  = (TP + TN) / (TP + TN + FP + FN)

TP = True Positive   FP = False Positive
TN = True Negative   FN = False Negative

All four metrics come from the same four numbers in the confusion matrix

Choosing the right metric

Spam filter: prioritise precision (few false alarms — you don't want good emails deleted). Cancer screening: prioritise recall (catch every case — missing one is catastrophic). Credit scoring: AUC-ROC measures ranking quality across all thresholds. Imbalanced classes: always use F1 or AUC-ROC instead of accuracy.

Practice questions

  1. A spam filter is 97% accurate on an email dataset where 3% are spam. A model that labels all emails as 'not spam' achieves what accuracy?
  2. In cancer screening, which metric is most important to maximise?
  3. A model catches 90 out of 100 actual fraud cases but also flags 50 legitimate transactions as fraud. What is its recall?
  4. F1 score is preferred over accuracy for imbalanced datasets because:

Related AI learning resources

Premium lesson notes and simulations | AI project templates | More Classical ML lessons