Simple theory: Evaluation metrics are numbers that tell you how a model is performing. Different metrics expose different mistakes, so accuracy alone is often not enough.
Your model is 99% accurate. Sounds great. But your dataset is 99% negative class. A model that always predicts 'negative' is 99% accurate and completely useless. Accuracy is a lie on imbalanced datasets. Precision, recall, F1, and the confusion matrix tell you what's actually happening.
The confusion matrix
For binary classification: True Positive (TP): correctly predicted positive. True Negative (TN): correctly predicted negative. False Positive (FP): predicted positive, actually negative (false alarm). False Negative (FN): predicted negative, actually positive (missed case). All metrics are derived from these four numbers.
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 Score = 2 × (Precision × Recall) / (Precision + Recall)
Accuracy = (TP + TN) / (TP + TN + FP + FN)
TP = True Positive FP = False Positive
TN = True Negative FN = False NegativeAll four metrics come from the same four numbers in the confusion matrix
Choosing the right metric
Spam filter: prioritise precision (few false alarms — you don't want good emails deleted). Cancer screening: prioritise recall (catch every case — missing one is catastrophic). Credit scoring: AUC-ROC measures ranking quality across all thresholds. Imbalanced classes: always use F1 or AUC-ROC instead of accuracy.