Stable Diffusion Pipeline - Visual Text-to-Image AI Lesson

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

A neural network is greedy for data — give it 1,000 photos, it overfits; give it 1,000,000, it generalises. But annotating millions of images is expensive. Image augmentation is the cheap-and-cheerful solution: take each training image and synthesise dozens of valid variations by flipping, rotating, cropping, or perturbing colors. The label stays the same. The model sees a vastly bigger 'effective dataset' and learns to be robust to all the natural variations it'll meet at test time.

Two Families of Augmentations

Geometric transforms: change the spatial layout — flip, rotate, crop, scale, shear, translate. Cheap, label-preserving for most tasks. Caveat: vertical flip changes meaning for letters/digits/landscape vs portrait.
Photometric transforms: change colors and intensities — brightness, contrast, saturation, hue jitter, blur, noise, JPEG compression. Simulate different lighting and camera conditions.
Modern (advanced): MixUp (blend two images linearly), CutMix (paste a square from one image into another), RandAugment (random selection of N transforms with random magnitude), AutoAugment (learned augmentation policies).

Common Geometric Transforms

Each is applied with some probability per training example:

  Horizontal flip (50%):  pixel[y, x] → pixel[y, W - 1 - x]
    Always safe for natural objects. Skip for text, signs, faces with asymmetric features.

  Random crop:            sample crop of size c × c from the H × W image, then resize back
    Forces the network to recognise objects from partial views.

  Rotation:               rotate by angle ∈ [−15°, +15°]
    Bilinear interpolation; corners become black or padded.

  Random resized crop:    crop random region with random aspect ratio, resize to fixed size
    Used by ResNet/ViT training — strong augmentation, very effective.

  Translate / Shear / Scale: small affine perturbations.

Each transform is parameter-light · effects compound when stacked

Common Photometric Transforms

  Brightness / Contrast: x → α x + β  with α ∈ [0.8, 1.2], β ∈ [−20, 20]
    Simulates lighting conditions.

  Color jitter: random multiplicative shift to each RGB channel independently.
    Simulates different cameras, white balance, time of day.

  Hue shift: rotate hue in HSV space.
    Object stays recognisable but color cast changes.

  Gaussian noise: add per-pixel noise N(0, σ).
    Simulates sensor noise in low light.

  Gaussian blur:           convolve with a Gaussian kernel.
    Simulates out-of-focus or motion blur.

  JPEG compression:         re-encode at low quality.
    Simulates downloaded/compressed images at inference.

  Cutout: erase a random rectangular region.
    Forces robustness to occlusion.

Photometric augs simulate the camera/environment your model will see at deploy time

Modern Augmentations: MixUp, CutMix, RandAugment

MixUp: take two images x_a, x_b with labels y_a, y_b. Mix: x = λ·x_a + (1-λ)·x_b, y = λ·y_a + (1-λ)·y_b. The model learns smooth decision boundaries by training on linear blends.
CutMix: paste a random rectangle from x_b into x_a, label is mixed by the area ratio. Better than MixUp for object-centric tasks because it preserves clean image patches.
RandAugment: pick N transforms randomly from a list (translate, rotate, color jitter, ...) at random magnitude M. Just two hyperparameters (N, M) instead of tuning each transform individually. Used by EfficientNet, ViT.
AutoAugment: actually learn the best augmentation policy via reinforcement learning. Powerful but compute-intensive to train.

Augmentation Pipelines (Practical)

A typical training pipeline (PyTorch torchvision style):

  RandomResizedCrop(224, scale=(0.08, 1.0))
  RandomHorizontalFlip(p=0.5)
  ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4)
  RandAugment(num_ops=2, magnitude=9)
  ToTensor()
  Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
  RandomErasing(p=0.25)

Applied ONLY at training time (one fresh random version per batch).
Evaluation pipeline is much simpler:

  Resize(256)
  CenterCrop(224)
  ToTensor()
  Normalize(...)

No randomness at evaluation = reproducible, deterministic predictions.

Train: heavy random augs · Eval: minimal deterministic preprocessing

Learn Image Augmentation