2024/09/14


Notes on Diffusion Models



Diffusion Process

Diffusion models are a class of generative models that are used to create high-quality synthetic data. At a high level, they work by learning how to reverse a process where an initial dataset of images, audio, video, etc. is transformed into pure noise. Models that successfully learn this denoising process can generate realistic samples from their training set's distribution, starting with nothing but pure noise.



Diffusion models are generally comprised of:




The two main types of reverse samplers are the Denoising Diffusion Probabilistic Model (DDPM) and the Denoising Diffusion Implicit Model (DDIM). The most important difference between the two is that the DDPM uses stochasticity in its journey from pure noise back to the target distribution whereas the DDIM has a more deterministic path.



In both cases, the reverse sampler uses a neural network to predict how much noise was added at each step in the forward pass. This is done by training the network to estimate the mean \( \mu_{t - \Delta t}(x_t) \) of the conditional distribution \( p(x_{t-\Delta t} | x_t) \) at any given timestep \(t\). Finding this mean at each timestep is sufficient for sampling from the previous timestep's probability distribution because of a few assumptions we are working with. The key idea is that the conditional distribution between timesteps is Gaussian with a predetermined variance, which allows us to fully characterize it once we learn its mean. The neural network backbone that learns to predict the noise added at each timestep is usually built upon a U-Net or a Vision Transformer (ViT) architecture.



The DDPM Reverse Sampler update rule for generating a denoised sample \(\hat{x}_{t-\Delta t}\) given an input \(x_t\) at timestep \(t\):

\[ \hat{x}_{t-\Delta t} \gets \mu_{t-\Delta t}(x_t) + \mathcal{N}(0,\sigma^{2}_{q}\Delta t)\]

A noise term is added to the neural network's prediction in order to introduce randomness and let the sampler explore different possible paths through latent space. The stochastic nature of this approach allows the model to generate diverse outputs and potentially capture more complex distributions at the cost of requiring more sampling steps in order to converge on the target distribution.



The DDIM Reverse Sampler update rule for generating a denoised sample \(\hat{x}_{t-\Delta t}\) given an input \(x_t\) at timestep \(t\):

\[ \hat{x}_{t-\Delta t} \gets x_t + \lambda(\mu_{t-\Delta t}(x_t) - x_t)\]

In contrast to DDPMs, this process is deterministic as it opts for a direct update rule without any added noise. The next sample is calculated by interpolating between the current sample and the neural network's estimated mean, controlled by a parameter \(\lambda\). This parameter acts as a scheduled scaling factor, effectively determining how much each step's update depends on the neural network's prediction vs the current sample. When \( \lambda = 1\), the sampler fully relies on the neural network and when \( \lambda = 0 \) there is no update at all. This approach ensures a consistent and reproducible trajectory through latent space for any given input, enabling faster sampling and more control over the generation proccess at the potential cost of reduced output diversity.



The general training process for diffusion models looks something like:



In practical applications, the choice between DDPMs and DDIMs comes down to a tradeoff between sampling speed and output diversity. This brings into question some more specific requirements / constraints for the application, such as computational resources, desired output diversity, generation speed, etc.






< return