A neural network is greedy for data — give it 1,000 photos, it overfits; give it 1,000,000, it generalises. But annotating millions of images is expensive. Image augmentation is the cheap-and-cheerful solution: take each training image and synthesise dozens of valid variations by flipping, rotating, cropping, or perturbing colors. The label stays the same. The model sees a vastly bigger 'effective dataset' and learns to be robust to all the natural variations it'll meet at test time.
Two Families of Augmentations
- Geometric transforms: change the spatial layout — flip, rotate, crop, scale, shear, translate. Cheap, label-preserving for most tasks. Caveat: vertical flip changes meaning for letters/digits/landscape vs portrait.
- Photometric transforms: change colors and intensities — brightness, contrast, saturation, hue jitter, blur, noise, JPEG compression. Simulate different lighting and camera conditions.
- Modern (advanced): MixUp (blend two images linearly), CutMix (paste a square from one image into another), RandAugment (random selection of N transforms with random magnitude), AutoAugment (learned augmentation policies).
Common Geometric Transforms
Each is applied with some probability per training example:
Horizontal flip (50%): pixel[y, x] → pixel[y, W - 1 - x]
Always safe for natural objects. Skip for text, signs, faces with asymmetric features.
Random crop: sample crop of size c × c from the H × W image, then resize back
Forces the network to recognise objects from partial views.
Rotation: rotate by angle ∈ [−15°, +15°]
Bilinear interpolation; corners become black or padded.
Random resized crop: crop random region with random aspect ratio, resize to fixed size
Used by ResNet/ViT training — strong augmentation, very effective.
Translate / Shear / Scale: small affine perturbations.Each transform is parameter-light · effects compound when stacked
Common Photometric Transforms
Brightness / Contrast: x → α x + β with α ∈ [0.8, 1.2], β ∈ [−20, 20]
Simulates lighting conditions.
Color jitter: random multiplicative shift to each RGB channel independently.
Simulates different cameras, white balance, time of day.
Hue shift: rotate hue in HSV space.
Object stays recognisable but color cast changes.
Gaussian noise: add per-pixel noise N(0, σ).
Simulates sensor noise in low light.
Gaussian blur: convolve with a Gaussian kernel.
Simulates out-of-focus or motion blur.
JPEG compression: re-encode at low quality.
Simulates downloaded/compressed images at inference.
Cutout: erase a random rectangular region.
Forces robustness to occlusion.Photometric augs simulate the camera/environment your model will see at deploy time
Modern Augmentations: MixUp, CutMix, RandAugment
- MixUp: take two images x_a, x_b with labels y_a, y_b. Mix: x = λ·x_a + (1-λ)·x_b, y = λ·y_a + (1-λ)·y_b. The model learns smooth decision boundaries by training on linear blends.
- CutMix: paste a random rectangle from x_b into x_a, label is mixed by the area ratio. Better than MixUp for object-centric tasks because it preserves clean image patches.
- RandAugment: pick N transforms randomly from a list (translate, rotate, color jitter, ...) at random magnitude M. Just two hyperparameters (N, M) instead of tuning each transform individually. Used by EfficientNet, ViT.
- AutoAugment: actually learn the best augmentation policy via reinforcement learning. Powerful but compute-intensive to train.
Augmentation Pipelines (Practical)
A typical training pipeline (PyTorch torchvision style):
RandomResizedCrop(224, scale=(0.08, 1.0))
RandomHorizontalFlip(p=0.5)
ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4)
RandAugment(num_ops=2, magnitude=9)
ToTensor()
Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
RandomErasing(p=0.25)
Applied ONLY at training time (one fresh random version per batch).
Evaluation pipeline is much simpler:
Resize(256)
CenterCrop(224)
ToTensor()
Normalize(...)
No randomness at evaluation = reproducible, deterministic predictions.Train: heavy random augs · Eval: minimal deterministic preprocessing