Diffusion Models

4 min read

“Learning to reverse noise and recover data.”


🌟 What Are Diffusion Models? #

Diffusion Models are a type of generative model that work by:

  1. Gradually adding noise to data until it becomes pure noise (the forward process).
  2. Learning to reverse this noise to reconstruct the original data (the reverse process).

This allows the model to generate new data from noise, like creating a painting from a blank canvas of static.


🧠 Intuition #

Imagine starting with a high-quality image:

  • You corrupt it step-by-step with increasing noise until it becomes unrecognizable.
  • Then you train a model to reverse this process, predicting how to β€œde-noise” it, step by step.

Eventually, the model can start from pure noise and work backward to generate entirely new data that looks realistic.

This process is probabilistic, iterative, and powerful.


πŸ§ͺ How It Works (Step-by-Step) #

1. Forward Process (Diffusion) #

We slowly add Gaussian noise to an image over TTT steps. q(xt∣xtβˆ’1)=N(xt;1βˆ’Ξ²txtβˆ’1,Ξ²tI)q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 – \beta_t}x_{t-1}, \beta_t I)q(xtβ€‹βˆ£xtβˆ’1​)=N(xt​;1βˆ’Ξ²t​​xtβˆ’1​,Ξ²t​I)

Eventually:

  • xT∼N(0,I)x_T \sim \mathcal{N}(0, I)xTβ€‹βˆΌN(0,I) β†’ almost pure noise.

2. Reverse Process (Denoising) #

We train a neural network ϡθ(xt,t)\epsilon_\theta(x_t, t)ϡθ​(xt​,t) to predict the noise added at each step.
This gives us: pΞΈ(xtβˆ’1∣xt)=N(xtβˆ’1;ΞΌΞΈ(xt,t),Σθ(xt,t))p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))pθ​(xtβˆ’1β€‹βˆ£xt​)=N(xtβˆ’1​;μθ​(xt​,t),Σθ​(xt​,t))

By sampling this in reverse (from xTx_TxT​ to x0x_0x0​), we can generate new data.


πŸ—οΈ Architecture #

Typically, a U-Net architecture is used (especially for images), which:

  • Processes the noisy image and time step
  • Predicts the noise at each stage
  • Uses skip connections to retain spatial detail

πŸ–ΌοΈ Visual Summary #

textCopyEditForward Process (add noise):

Image β†’ slightly noisy β†’ more noise β†’ ... β†’ pure noise

Reverse Process (remove noise):

Pure noise β†’ slightly clearer β†’ more details β†’ final image

🧠 Loss Function #

The training goal is to predict the noise added during the forward process.
So, we minimize the difference between actual noise and predicted noise: Lsimple=Ex,t,Ο΅[βˆ₯Ο΅βˆ’Ο΅ΞΈ(xt,t)βˆ₯2]\mathcal{L}_{simple} = \mathbb{E}_{x, t, \epsilon} \left[ \left\| \epsilon – \epsilon_\theta(x_t, t) \right\|^2 \right]Lsimple​=Ex,t,ϡ​[βˆ₯Ο΅βˆ’Ο΅ΞΈβ€‹(xt​,t)βˆ₯2]

This is known as denoising score matching.


πŸ”§ Example Libraries #

  • πŸ€— Hugging Face Diffusers
  • OpenAI GLIDE, DALLΒ·E 2
  • Stability AI’s Stable Diffusion
  • Google’s Imagen

ModelDescription
DDPMDenoising Diffusion Probabilistic Models
Stable DiffusionText-to-image diffusion with latent space
GLIDEGuided Language-to-Image Diffusion
ImagenHigh-fidelity image generation by Google
Latent Diffusion Models (LDMs)Run diffusion in a compressed (latent) space

πŸ’‘ Real-World Use Cases #

🎨 1. Image Generation #

  • Generate high-quality images from text (e.g., β€œA dragon flying over Tokyo”).
  • E.g., Stable Diffusion, DALLΒ·E 2, Midjourney

πŸ§‘β€πŸŽ¨ 2. Inpainting & Editing #

  • Fill missing parts in images (e.g., removing objects or painting new ones).

πŸ“Ή 3. Video Generation (in progress) #

  • Research into temporal diffusion models for video is advancing rapidly.

πŸ–ΌοΈ 4. Super-Resolution #

  • Enhance low-res images with photorealistic details.

πŸ§ͺ 5. Scientific Applications #

  • Molecule generation, protein folding, etc.

βœ… Advantages #

FeatureWhy it matters
High sample qualityRivals or exceeds GANs in realism
Stable trainingNo adversarial loss = fewer training issues
Diverse outputsDifferent samples from the same prompt
InterpretabilityEach generation step is explicit and guided

⚠️ Limitations #

ChallengeDescription
Slow samplingMultiple steps (50–1000) per image
Computational costLarge models and long training times
ComplexityRequires careful tuning and understanding

Solutions like Latent Diffusion address speed by operating in a lower-dimensional space.


πŸ” Diffusion vs Other Generative Models #

FeatureDiffusion ModelsGANsVAEsAutoregressive
Training Stabilityβœ… Very stable❌ Often unstableβœ… Stableβœ… Stable
Output Quality⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Sampling Speed❌ Slowβœ… Fastβœ… Fast❌ Slow
Interpretabilityβœ… High❌ Lowβœ… Mediumβœ… High

🧠 Summary #

ConceptDescription
Forward processAdd noise to image step-by-step
Reverse processLearn to remove noise and recover image
Training goalPredict noise at each step
OutputHigh-quality, diverse data (especially images)
Best forText-to-image generation, super-resolution, inpainting

πŸ“Œ Bonus: Code Example (Using πŸ€— Diffusers) #

bashCopyEditpip install diffusers transformers
pythonCopyEditfrom diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
pipe = pipe.to("cuda")

prompt = "a fantasy castle floating in the sky"
image = pipe(prompt).images[0]
image.show()

🎯 Final Thoughts #

Diffusion models are currently the gold standard in many generative AI applicationsβ€”particularly text-to-image generation. Their ability to create photorealistic, diverse, and controllable outputs is reshaping creative industries, gaming, scientific research, and beyond.

Updated on June 6, 2025