Learn Object Detection

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Image classification asks 'what's in this picture?' Object detection asks something harder: 'what's in it AND where is each thing?' For every object the model finds, it must output a class label, a confidence score, AND a bounding box (x, y, width, height) that tightly wraps the object. Modern detectors run in real-time on phones — every face on a Zoom call, every car a Tesla sees, every defect a factory camera flags is a detection.

Two Generations of Detectors

Two-stage (Faster R-CNN, 2015): Stage 1 proposes ~2,000 candidate regions where objects might be. Stage 2 classifies each region and refines its box. More accurate but slower.
One-stage (YOLO, SSD, RetinaNet): predicts boxes + classes in a single forward pass over a grid of anchor positions. Fast (60+ FPS on a phone) but historically slightly less accurate. The gap has closed since 2020.
Modern (DETR, 2020+): treats detection as a SET prediction problem with a transformer. No anchors, no NMS — the model directly outputs the final list of detections.

The YOLO Idea — Detection as a Single Pass

YOLO (You Only Look Once) divides the image into an S×S grid (e.g., 13×13).
Each grid cell predicts B bounding boxes, plus class probabilities:

  Per grid cell, output:
    For each of B boxes:
      • (x, y, w, h)         — box coordinates relative to cell
      • confidence            — probability the box contains AN object × IoU
    Per cell:
      • class_probs[C]        — probability for each of C classes

Final tensor shape: [S, S, B × 5 + C]

For the COCO dataset: S=13, B=3, C=80 → [13, 13, 95] = 16,055 numbers per image.
A single forward pass produces all detections — extremely fast.

Trick: predictions go through Non-Maximum Suppression (NMS) to remove
duplicate boxes covering the same object.

One forward pass · per-cell predictions · NMS to deduplicate

Bounding Box Coordinates

Two common formats:

  Corner format:  (x_min, y_min, x_max, y_max)
  Center format:  (x_center, y_center, width, height)

Networks output center format (easier for regression), but converted to corner format for IoU calculations and visualization.

Boxes are usually normalised to [0, 1] coordinates (image-relative)
so the model is invariant to image size at inference.

Both formats are equivalent — choice depends on convenience

Intersection-over-Union (IoU): the Detection Yardstick

IoU is THE metric for evaluating box predictions:

  IoU(A, B) = area(A ∩ B) / area(A ∪ B)

Values:
  • IoU = 0: no overlap
  • IoU = 1: boxes are identical
  • IoU > 0.5: typically considered a 'correct' detection
  • IoU > 0.75: high precision

Used for:
  1. Training: matching predicted boxes to ground truth
  2. Non-Maximum Suppression: removing duplicate predictions
  3. Evaluation: mAP@0.5, mAP@0.5:0.95 metrics

IoU is the contract between predictions and ground-truth boxes

Non-Maximum Suppression (NMS): Removing Duplicates

Problem: a single object often gets multiple predicted boxes (slight variations in position).
Solution NMS: 1) Sort predictions by confidence (highest first). 2) Take the top one, add to final list. 3) Remove all remaining boxes whose IoU with the kept one exceeds a threshold (typically 0.5). 4) Repeat until empty.
Result: each object ends up with one box — its highest-confidence prediction.
Per class: NMS is applied independently for each class so two different objects can overlap (e.g., person on bicycle).