Detection draws a box around each object — but a box has empty space inside the corners. Segmentation goes finer: it labels EVERY PIXEL with a class. The output is a 'mask' — same size as the input image, but each pixel is colored by what it is. This is what makes Tesla's autopilot, medical imaging, and Google Photos' background blur possible. Pixels, not boxes.
Three Flavors of Segmentation
- Semantic segmentation: every pixel gets a CLASS label (sky, road, car, person). All cars get the same color. Doesn't distinguish between two cars — they merge into one 'car blob'.
- Instance segmentation: every pixel gets a CLASS label AND an INSTANCE id. Car #1 and Car #2 get different colors. The model has to find each individual object.
- Panoptic segmentation: combines both — each pixel gets (class, instance_id) where 'stuff' classes (sky, road) have no instance id and 'thing' classes (cars, people) do. The full picture of what's where.
How Semantic Segmentation Works
Architecture pattern: encoder-decoder with skip connections (U-Net, DeepLab).
Input: H × W × 3 image
Encoder (downsample): conv → pool → conv → pool → conv → pool
spatial: 256 → 128 → 64 → 32
channels: 3 → 64 → 128 → 256
Captures 'what' (semantic features) at low spatial resolution
Decoder (upsample): upconv ← upconv ← upconv
spatial: 32 → 64 → 128 → 256
Recovers 'where' (per-pixel localisation)
Skip connections: encoder features at each resolution copy
over to the corresponding decoder layer.
Lets the decoder use both deep semantic features and
shallow positional features.
Output: H × W × C (logits per pixel per class)
Final mask: argmax over C → H × W class indexDown: see the big picture · Up: paint pixel-perfect masks · Skip: don't lose spatial detail
Loss Functions for Segmentation
Per-pixel cross-entropy is the basic loss:
L = −(1 / HW) Σ_{p} log P(y_p | image)
where y_p is the correct class for pixel p.
Problems:
• Class imbalance: 'sky' covers 30% of typical images, 'pole' covers 0.1% — naive CE
ignores rare classes.
• Boundary errors: pixels at object edges are often missed.
Fixes:
• Weighted CE: weight each pixel inversely to class frequency.
• Dice loss: 1 − (2 × intersection / (sum_pred + sum_target))
directly optimises overlap, robust to imbalance.
• Focal loss: down-weights easy pixels (where the network is confident),
focusing learning on hard cases.
• Boundary loss: extra weight on pixels near object edges.Pixel-level supervision · Dice + CE is the modern default
Instance Segmentation: Mask R-CNN Style
- Step 1: detect objects with a Faster R-CNN-style detector → list of (class, bbox).
- Step 2: for each detected box, run a small 'mask head' that outputs a binary mask within that box.
- Step 3: paste each per-instance mask back onto the original image canvas.
- Result: every detected object has its own pixel-precise outline.
- Modern alternatives: SAM (Segment Anything, 2023) lets you click any point and get a clean mask out — universal segmentation without per-class training.