Computer Vision - Advanced - 15 min

Learn Image Segmentation

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Last updated: 2026-05-13.

Detection draws a box around each object — but a box has empty space inside the corners. Segmentation goes finer: it labels EVERY PIXEL with a class. The output is a 'mask' — same size as the input image, but each pixel is colored by what it is. This is what makes Tesla's autopilot, medical imaging, and Google Photos' background blur possible. Pixels, not boxes.

Three Flavors of Segmentation

  • Semantic segmentation: every pixel gets a CLASS label (sky, road, car, person). All cars get the same color. Doesn't distinguish between two cars — they merge into one 'car blob'.
  • Instance segmentation: every pixel gets a CLASS label AND an INSTANCE id. Car #1 and Car #2 get different colors. The model has to find each individual object.
  • Panoptic segmentation: combines both — each pixel gets (class, instance_id) where 'stuff' classes (sky, road) have no instance id and 'thing' classes (cars, people) do. The full picture of what's where.

How Semantic Segmentation Works

Architecture pattern: encoder-decoder with skip connections (U-Net, DeepLab).

  Input: H × W × 3 image

  Encoder (downsample):  conv → pool → conv → pool → conv → pool
    spatial: 256 → 128 → 64 → 32
    channels: 3 → 64 → 128 → 256
    Captures 'what' (semantic features) at low spatial resolution

  Decoder (upsample):    upconv ← upconv ← upconv
    spatial: 32 → 64 → 128 → 256
    Recovers 'where' (per-pixel localisation)

  Skip connections: encoder features at each resolution copy
    over to the corresponding decoder layer.
    Lets the decoder use both deep semantic features and
    shallow positional features.

  Output: H × W × C (logits per pixel per class)
  Final mask: argmax over C → H × W class index

Down: see the big picture · Up: paint pixel-perfect masks · Skip: don't lose spatial detail

Loss Functions for Segmentation

Per-pixel cross-entropy is the basic loss:

  L = −(1 / HW) Σ_{p} log P(y_p | image)

where y_p is the correct class for pixel p.

Problems:
  • Class imbalance: 'sky' covers 30% of typical images, 'pole' covers 0.1% — naive CE
    ignores rare classes.
  • Boundary errors: pixels at object edges are often missed.

Fixes:
  • Weighted CE: weight each pixel inversely to class frequency.
  • Dice loss: 1 − (2 × intersection / (sum_pred + sum_target))
      directly optimises overlap, robust to imbalance.
  • Focal loss: down-weights easy pixels (where the network is confident),
      focusing learning on hard cases.
  • Boundary loss: extra weight on pixels near object edges.

Pixel-level supervision · Dice + CE is the modern default

Instance Segmentation: Mask R-CNN Style

  • Step 1: detect objects with a Faster R-CNN-style detector → list of (class, bbox).
  • Step 2: for each detected box, run a small 'mask head' that outputs a binary mask within that box.
  • Step 3: paste each per-instance mask back onto the original image canvas.
  • Result: every detected object has its own pixel-precise outline.
  • Modern alternatives: SAM (Segment Anything, 2023) lets you click any point and get a clean mask out — universal segmentation without per-class training.

Practice questions

  1. What is the key difference between semantic segmentation and instance segmentation?
  2. Why is the U-Net architecture (with skip connections) so widely used for segmentation?
  3. Why does segmentation often use Dice loss in addition to (or instead of) cross-entropy?
  4. Why is annotated data so much more expensive for segmentation than for classification?

Related AI learning resources

Premium lesson notes and simulations | AI project templates | More Computer Vision lessons