Diffusion Models Explained Visually - Generative AI Lesson

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Face recognition is a four-stage pipeline that runs every time you unlock your phone with your face, every time Facebook tags a friend in a photo, every time a passport gate scans you at the airport. Detect → Align → Embed → Match. Each stage is a separate model. Each can be replaced or upgraded independently. Together they turn an image into the answer 'this is Alice with 96% confidence.'

Stage 1: Detection

Find the bounding box around every face in the image.

Classic models: MTCNN, RetinaFace, BlazeFace.
Modern: YOLO-Face, custom face detectors trained on WIDER-FACE dataset.

Output per detected face: bounding box (x, y, w, h) + confidence + 5 facial landmarks:
  • Left eye, right eye, nose tip, left mouth corner, right mouth corner.

These 5 landmarks are essential for the next step (alignment).

A typical photo:
  Input image → 1-50 detected faces with landmarks → list of cropped face regions.

Each face becomes a small image patch with known landmark positions

Stage 2: Alignment

Faces in the wild are tilted, rotated, and at varying angles.
Alignment normalises every face to a canonical pose:

  1. Take the 5 detected landmarks (eyes, nose, mouth corners).
  2. Compute an affine transform that maps these to a canonical template:
       e.g., eyes always at fixed pixel positions, mouth at a fixed bottom point.
  3. Apply the transform → output is a 112 × 112 (or 160 × 160) RGB face image,
     consistently oriented.

Why this matters: the embedding network can now focus on identity
rather than wasting capacity on rotation/scale invariance.
Alignment lifts accuracy by several percent on benchmarks.

Same face in different photos → after alignment, all look 'looking at you straight'

Stage 3: Embedding

A specialised CNN (e.g., FaceNet, ArcFace, AdaFace) takes the aligned face
and outputs a fixed-length vector — the FACE EMBEDDING.

  aligned face (112×112×3) → CNN → 512-dim vector

The network is trained with a metric-learning loss:
  • Same person's faces → embeddings close together
  • Different people's faces → embeddings far apart
  • Specifically: ArcFace loss adds a margin to the angle between
    classes, sharply separating identity clusters.

Properties of a good embedding:
  • Pose-invariant: front, side, slightly tilted faces of one person
    map to nearby vectors
  • Lighting-invariant
  • Aging/glasses/expression robust
  • L2-normalised: ‖v‖ = 1, so cosine similarity = dot product.

The embedding IS the face's identity · all comparison happens in this 512-d space

Stage 4: Matching

Given the query embedding e_query, search the database of known faces:

  similarity(e_query, e_db_i) = e_query · e_db_i   (cosine, since vectors are normalised)

  best_match = argmax_i similarity
  confidence = similarity at best_match

Decision:
  • If confidence ≥ threshold (e.g., 0.5) → recognised as that person
  • Else → unknown / reject

For scale (millions of faces):
  • Brute force: O(N) — fine for thousands
  • Approximate Nearest Neighbour (ANN): FAISS, ScaNN — sublinear at huge scale
  • Often combined with metadata filters (e.g., 'only employees of this office')

Threshold choice = trade-off:
  • Lower threshold: more recognised (true positives ↑) but more false positives
  • Higher threshold: fewer false positives but more rejections
  • Set per use-case (phone unlock = high threshold; photo tagging = lower)

Identity = cosine similarity in embedding space · threshold controls accept/reject

1:N vs 1:1 Modes

1:1 verification ('is this Alice?'): compare query against ONE known embedding. Used for phone unlock, payment authorisation. Threshold is high — false accept is dangerous.
1:N identification ('who is this?'): compare query against a database of N people. Used for photo organisation, surveillance. False positives become more likely as N grows.
Open-set vs closed-set: in closed-set, query is always one of the known faces. Open-set must also handle 'this person is unknown' — much harder, requires careful threshold tuning.

Learn Face Recognition Pipeline