Face recognition is a four-stage pipeline that runs every time you unlock your phone with your face, every time Facebook tags a friend in a photo, every time a passport gate scans you at the airport. Detect → Align → Embed → Match. Each stage is a separate model. Each can be replaced or upgraded independently. Together they turn an image into the answer 'this is Alice with 96% confidence.'
Stage 1: Detection
Find the bounding box around every face in the image.
Classic models: MTCNN, RetinaFace, BlazeFace.
Modern: YOLO-Face, custom face detectors trained on WIDER-FACE dataset.
Output per detected face: bounding box (x, y, w, h) + confidence + 5 facial landmarks:
• Left eye, right eye, nose tip, left mouth corner, right mouth corner.
These 5 landmarks are essential for the next step (alignment).
A typical photo:
Input image → 1-50 detected faces with landmarks → list of cropped face regions.Each face becomes a small image patch with known landmark positions
Stage 2: Alignment
Faces in the wild are tilted, rotated, and at varying angles.
Alignment normalises every face to a canonical pose:
1. Take the 5 detected landmarks (eyes, nose, mouth corners).
2. Compute an affine transform that maps these to a canonical template:
e.g., eyes always at fixed pixel positions, mouth at a fixed bottom point.
3. Apply the transform → output is a 112 × 112 (or 160 × 160) RGB face image,
consistently oriented.
Why this matters: the embedding network can now focus on identity
rather than wasting capacity on rotation/scale invariance.
Alignment lifts accuracy by several percent on benchmarks.Same face in different photos → after alignment, all look 'looking at you straight'
Stage 3: Embedding
A specialised CNN (e.g., FaceNet, ArcFace, AdaFace) takes the aligned face
and outputs a fixed-length vector — the FACE EMBEDDING.
aligned face (112×112×3) → CNN → 512-dim vector
The network is trained with a metric-learning loss:
• Same person's faces → embeddings close together
• Different people's faces → embeddings far apart
• Specifically: ArcFace loss adds a margin to the angle between
classes, sharply separating identity clusters.
Properties of a good embedding:
• Pose-invariant: front, side, slightly tilted faces of one person
map to nearby vectors
• Lighting-invariant
• Aging/glasses/expression robust
• L2-normalised: ‖v‖ = 1, so cosine similarity = dot product.The embedding IS the face's identity · all comparison happens in this 512-d space
Stage 4: Matching
Given the query embedding e_query, search the database of known faces:
similarity(e_query, e_db_i) = e_query · e_db_i (cosine, since vectors are normalised)
best_match = argmax_i similarity
confidence = similarity at best_match
Decision:
• If confidence ≥ threshold (e.g., 0.5) → recognised as that person
• Else → unknown / reject
For scale (millions of faces):
• Brute force: O(N) — fine for thousands
• Approximate Nearest Neighbour (ANN): FAISS, ScaNN — sublinear at huge scale
• Often combined with metadata filters (e.g., 'only employees of this office')
Threshold choice = trade-off:
• Lower threshold: more recognised (true positives ↑) but more false positives
• Higher threshold: fewer false positives but more rejections
• Set per use-case (phone unlock = high threshold; photo tagging = lower)Identity = cosine similarity in embedding space · threshold controls accept/reject
1:N vs 1:1 Modes
- 1:1 verification ('is this Alice?'): compare query against ONE known embedding. Used for phone unlock, payment authorisation. Threshold is high — false accept is dangerous.
- 1:N identification ('who is this?'): compare query against a database of N people. Used for photo organisation, surveillance. False positives become more likely as N grows.
- Open-set vs closed-set: in closed-set, query is always one of the known faces. Open-set must also handle 'this person is unknown' — much harder, requires careful threshold tuning.