Stable Diffusion: Revolutionizing Generative Models

7 min readJun 26, 2024

In the rapidly evolving field of generative models, Stable diffusion models have emerged as a groundbreaking technology, pushing the boundaries of what’s possible in creating realistic and high-quality data like images, text, audio and video. But how do these modern techniques stack up against traditional methods? Let’s delve into the fascinating world of image generation and explore the innovations that stable diffusion brings.

Image generated from text prompt (Source from Author)

Traditional Diffusion Model: A Brief Overview

Diffusion Models are generative models that create images by simulating a diffusion process. Fundamentally, the Diffusion model works by adding Gaussian noise progressively to the image over several steps, to create a sequence of increasingly noisy images and then learning to reverse this noise addition to generate a new image. This iterative process allows the Denoising Diffusion Models to generate high-quality images by refining noisy inputs into coherent outputs. While effective, these denoising diffusion models these methods come with certain limitations.

High Dimensionality: These models operate in the high dimensional pixel space. This means noise is added and removed directly in the pixel value of the image, requiring a large amount of computational resources. For example, a colour image with 512x512 resolution has 786,432 possible values.
Numerous Denoising Steps: Diffusion models typically require many denoising steps to produce high-quality images. This iterative process, while effective, can be time-consuming and computationally demanding.

Stable Diffusion Models: A New Paradigm

Stable Diffusion Models, also known as Latent Diffusion Models, represent a significant leap from traditional diffusion models, in addressing these challenges. Instead of using the pixel space of the image, Stable Diffusion operates in the latent space. This approach leverages the power of stable diffusion to generate data in a compressed and abstract representation, making the process more computationally efficient.

What is Latent Space?

Latent Space is a lower-dimensional representation of data, learned by a neural network, that captures the essential features of the input data. The idea is to encode complex high-dimensional data (like images) into a simpler, compact form while retaining as much relevant information as possible. For example, in the human mind, memories are not organized by distinct categories like childhood or work memories. Rather, they are organized in a multi-dimensional latent space based on various hidden factors such as emotions, sensory experiences, and contextual details. Similarly, latent space provides a computer with a compressed understanding of the world through a spatial representation.

The Structure of the Variational Autoencoder consists of an encoder, latent space, and a decoder [Source from 8]

The architecture of the Stable Diffusion model

The architecture of the Stable Diffusion model includes several key components:

Latent Space Representation

As said earlier, Stable Diffusion operates in a latent space. The process involves encoding an image into a latent representation using a pre-trained encoder (often a variational autoencoder (VAE) or similar architecture) given by the below equation where it maps given image x to a latent space z by the encoder E:

This latent representation is then used as the starting point for the diffusion process. The encoder maps the input image into lower-dimensional latent data and by using the decoder we can reconstruct the image back from the latent representation given by :

2. Diffusion Process

After encoding the image into latent data, the diffusion process comes into play and involves two main phases:

Forward Diffusion (Noise Addition): This phase involves gradually adding Gaussian noise to the latent representation over time steps. This process can be mathematically described as:

where zₜ is the noisy latent at time step t,
αₜ is a noise schedule coefficient, and
𝛜ₜ is Gaussian noise.

Reverse Diffusion (Denoising): This phase involves learning to reverse the noise addition process to recover the original latent representation. A neural network (typically a U-Net) is trained to predict the noise component 𝛜ₜ at each time step. Training involves minimizing the difference between the predicted and actual noise:

Denoising the image produced from giving input text “ A Nepali girl smiling in traditional dress” in each sampling step [Source from Author]

3. U- Net Architecture

The U-Net architecture is crucial for the denoising process. It consists of an encoder-decoder structure with skip connections, allowing it to capture local and global information effectively.

Attention mechanisms are integrated into the U-Net to enhance the model’s ability to focus on relevant parts of the latent representation. It consists of two different mechanisms.

Self-Attention: Allows the model to capture dependencies across different spatial locations in the latent representation. The attention mechanism is defined as the below equation where Q, K and V are the query, key and value matrices, and dₖ is the dimensionality of key vectors respectively.

Cross-Attention: Used in conditional diffusion models to incorporate additional information, such as text embeddings, into the generation process. For text inputs, they are first converted into embeddings (vectors) using a language model 𝜏θ (e.g. BERT, CLIP), and then they are mapped into the U-Net via the (multi-head) Cross-Attention(Q, K, V,c) layer. If c represents the conditioning information:

4. Conditioning Mechanisms

Stable Diffusion can generate images conditioned on various types of auxiliary information. The purpose of conditioning is to steer the noise predictor so that the predicted noise will give us what we want after subtracting from the image. Conditioning can be achieved by:

Concatenation: Directly concatenating the conditioning information (e.g., class labels or text embeddings) with the latent representation.
Adaptive Normalization: Using techniques like Conditional Batch Normalization (CBN) or Adaptive Instance Normalization (AdaIN) to modulate the latent representation based on the conditioning information.

In the traditional diffusion model, pre-trained classifiers guide the generation process. This means these classifiers help in generating images that satisfy certain conditions like specific class labels, textual descriptions etc. Reliance on such a classifier can limit the performance of the model, as it requires a separate classifier for each type of conditioning (class levels, text prompts). Also generated images heavily depend on the performance of the classifier. So to eliminate these issues, the classifier-free diffusion principle is implemented in a stable diffusion model.
The stable diffusion model is trained to handle both conditional and unconditional generation tasks. This is achieved by modifying the training process:

Conditional Training: During training, the model receives both the input data (e.g., noisy latent representations) and the conditioning information (e.g., text prompts, and class labels).
Unconditional Training: The model is also trained without any conditioning information. In this scenario, the conditioning input is replaced with a null value or omitted entirely.

The model learns to perform both tasks, flexibly switching between conditional and unconditional modes during inference.

5. Training

The training process involves optimizing the model to minimize the difference between predicted and actual noise added during the forward diffusion process. The loss function typically used is:

where 𝛜ᵩ is the noise predicted by the model, zₜ is the noisy latent at time step t, and c is the conditioning information.

Examples and Applications

Stable Diffusion models have shown remarkable results in various image generation tasks, including:

Text to Image Generation

It introduces a functionality to generate the Images from the given text.

Image-to-Image Generation

It transforms an image into another one using stable diffusion. An input image and a text prompt are supplied as the input in image-to-image. The generated image will be conditioned by both the input image and text prompt

Upscaling: Enhancing the resolution of low-quality images.

Upscaling Image from low quality to high [Source from 7]

Inpainting: Filling in missing parts of an image.

Prompt: “A cat on a bench”
Generated image:

Hence, Stable Diffusion represents a significant advancement over traditional diffusion models by leveraging latent space representations to improve computational efficiency and image quality. This innovative approach opens new possibilities for generative models, making them more practical and versatile for a wide range of applications.

References

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with Latent Diffusion Models,” arXiv.org, 13-Apr-2022. [Online]. Available: https://arxiv.org/abs/2112.10752.
https://github.com/CompVis/stable-diffusion/tree/main
https://cvpr2022-tutorial-diffusion-models.github.io/
Understanding Stable Diffusion from “Scratch” | Binxu Wang (harvard.edu)

5. https://stable-diffusion-art.com/how-stable-diffusion-work/

6. https://medium.com/@steinsfu/stable-diffusion-clearly-explained-ed008044e07e

7. https://dublog.net/blog/stable-diffusion-2/

8. Attari, Vahid & Arróyave, Raymundo. (2022). Machine learning-assisted high-throughput exploration of interface energy space in multi-phase-field model with CALPHAD potential. Materials Theory. 6. 10.1186/s41313–021–00038–0.