Diffusion models are the engine behind DALL-E, Stable Diffusion, Midjourney, and Sora. The idea is brilliantly counterintuitive: train a model to reverse the process of slowly adding noise to an image. Once trained, you can start from pure noise and run the reverse process to generate a new image. Unlike GANs, training is stable. Unlike VAEs, output quality is photoreal. The cost: hundreds of forward passes to generate one image (vs. one for GANs).
Forward process — simple, no learning
Add Gaussian noise at each step:
x_t = √(1 − β_t) · x_{t−1} + √β_t · ε where ε ~ N(0, 1)
After T = 1000 steps with small β_t, x_T is essentially pure noise.
Closed form: x_t = √(α̃_t) · x_0 + √(1 − α̃_t) · ε (α̃_t depends on β schedule).Forward = mechanical · no learning needed
Reverse process — train a UNet
Train a network ε_θ(x_t, t) to predict the noise that was added at step t.
Loss = || ε − ε_θ(x_t, t) ||²
At sampling time:
• Start: x_T ~ N(0, 1) (pure noise)
• For t = T, T−1, ..., 1:
predicted noise = ε_θ(x_t, t)
x_{t-1} = (x_t − scaled_noise) / scale + small Gaussian
• Output: x_0 (sampled image)
Neural net: usually a UNet with attention layers + time conditioning.
T = 1000 in the original paper; modern samplers (DDIM, DPM-Solver) reach quality in 20-50 steps.Predict the noise · subtract a bit · iterate
Why diffusion works so well
- Stable training: just predict noise — no two-player game like GANs.
- High fidelity: small changes per step, error doesn't compound badly.
- Diversity: different noise seeds → different images.
- Conditioning: add text, class, or image guidance to ε_θ → DALL-E, Stable Diffusion.
- Slow inference: each image needs N forward passes through the UNet (mitigated by latent diffusion + faster solvers).