Image classification asks 'what's in this picture?' Object detection asks something harder: 'what's in it AND where is each thing?' For every object the model finds, it must output a class label, a confidence score, AND a bounding box (x, y, width, height) that tightly wraps the object. Modern detectors run in real-time on phones — every face on a Zoom call, every car a Tesla sees, every defect a factory camera flags is a detection.
Two Generations of Detectors
- Two-stage (Faster R-CNN, 2015): Stage 1 proposes ~2,000 candidate regions where objects might be. Stage 2 classifies each region and refines its box. More accurate but slower.
- One-stage (YOLO, SSD, RetinaNet): predicts boxes + classes in a single forward pass over a grid of anchor positions. Fast (60+ FPS on a phone) but historically slightly less accurate. The gap has closed since 2020.
- Modern (DETR, 2020+): treats detection as a SET prediction problem with a transformer. No anchors, no NMS — the model directly outputs the final list of detections.
The YOLO Idea — Detection as a Single Pass
YOLO (You Only Look Once) divides the image into an S×S grid (e.g., 13×13).
Each grid cell predicts B bounding boxes, plus class probabilities:
Per grid cell, output:
For each of B boxes:
• (x, y, w, h) — box coordinates relative to cell
• confidence — probability the box contains AN object × IoU
Per cell:
• class_probs[C] — probability for each of C classes
Final tensor shape: [S, S, B × 5 + C]
For the COCO dataset: S=13, B=3, C=80 → [13, 13, 95] = 16,055 numbers per image.
A single forward pass produces all detections — extremely fast.
Trick: predictions go through Non-Maximum Suppression (NMS) to remove
duplicate boxes covering the same object.One forward pass · per-cell predictions · NMS to deduplicate
Bounding Box Coordinates
Two common formats:
Corner format: (x_min, y_min, x_max, y_max)
Center format: (x_center, y_center, width, height)
Networks output center format (easier for regression), but converted to corner format for IoU calculations and visualization.
Boxes are usually normalised to [0, 1] coordinates (image-relative)
so the model is invariant to image size at inference.Both formats are equivalent — choice depends on convenience
Intersection-over-Union (IoU): the Detection Yardstick
IoU is THE metric for evaluating box predictions:
IoU(A, B) = area(A ∩ B) / area(A ∪ B)
Values:
• IoU = 0: no overlap
• IoU = 1: boxes are identical
• IoU > 0.5: typically considered a 'correct' detection
• IoU > 0.75: high precision
Used for:
1. Training: matching predicted boxes to ground truth
2. Non-Maximum Suppression: removing duplicate predictions
3. Evaluation: mAP@0.5, mAP@0.5:0.95 metricsIoU is the contract between predictions and ground-truth boxes
Non-Maximum Suppression (NMS): Removing Duplicates
- Problem: a single object often gets multiple predicted boxes (slight variations in position).
- Solution NMS: 1) Sort predictions by confidence (highest first). 2) Take the top one, add to final list. 3) Remove all remaining boxes whose IoU with the kept one exceeds a threshold (typically 0.5). 4) Repeat until empty.
- Result: each object ends up with one box — its highest-confidence prediction.
- Per class: NMS is applied independently for each class so two different objects can overlap (e.g., person on bicycle).