Diffusion is a physical process that models random motion, first analyzed by Brown when studying pollen grains in water. In this section, we will first analyze a simplified 1-dimensional version, and then delve into diffusion models for images, the ones closest to (Ho et al. 2020).

The Diffusion Process

This note follows original Einstein’s presentation, here we have a simplified version.

Let’s suppose we have a particle at $t = 0$ at some position $i$. We have a probability of jumping to the left of $p$ to right of $q$, the rest is staying at the same position.

Concentration

We would like to know the concentration of particles after a number of fixed steps at a certain position. Then we would like to know the same thing if we extend the idea to a certain number of starting particles at the beginning. Let’s call this concentration $C_{i}(t)$. Then the number of particles at a certain time step and position is $n_{i}(t) = Nc_{i}(t)$

Markov Process

$$ C_{i}(t + 1) = C_{i - 1}(t)q + C_{i + 1}(t)p + (1 - q - p) C_{i}(t) $$$$ \begin{align} C_{i}(t + 1) - C_{i}(t) &= C_{i - 1}(t)q + C_{i + 1}(t)p + (1 - q - p) C_{i}(t) - C_{i}(t) \\ &= q(C_{i - 1}(t) - C_{i}(t)) + p(C_{i + 1}(t) - C_{i}(t)) \\ &= p C_{i + 1}(t) - (p + q)C_{i}(t) + qC_{i - 1}(t) \\ &= \frac{p - q}{2} (C_{i + 1}(t) - C_{i - 1}(t)) + \frac{p + q}{2}(C_{i + 1}(t) - 2C_{i}(t) + C_{i - 1}(t)) \end{align} $$

From the Markov Process we can have the master equation, from which we take the Fokker Plank equation. Somehow, we can interpret the temporal difference as the sum of a first derivative and of a second derivative. The first derivative tells us a preferred direction of diffusion (drift) the second tells us how fast it is (diffusion).

Going into continuous time

Let’s assume we have an update time of $\Delta t$ (so smaller time deltas imply more frequent updates). We say $\tau$ is the time scale of the system (measure of time in the system), we define $\tau$ such that $\tau / \Delta t$ is the number of the updates for one Markov Step as before. More intuitively, the characteristic time $\tau$ tells us the system needs that amount of time to have a single change, while $\Delta t$ tells us how frequently we are going to check the system, this is the reason why we need to check the system $\frac{\tau}{\Delta t}$ times to observe a change.

$$ C_{i}(t + \Delta t) - C_{i}(t) = \frac{p' - q'}{2} (C_{i + 1}(t) - C_{i - 1}(t)) + \frac{p' + q'}{2}(C_{i + 1}(t) - 2C_{i}(t) + C_{i - 1}(t)) $$$$ \tau \frac{d}{dt} C_{i}(t) = \frac{p - q}{2} (C_{i + 1}(t) - C_{i - 1}(t)) + \frac{p + q}{2}(C_{i + 1}(t) - 2C_{i}(t) + C_{i - 1}(t)) $$

This is the master equation. Having a continuous time just changes the probability of jumping, this is the relation that allows us to have continuous updates (so if we don’t have a full time, we have just a fraction of the probability of jumping). This should be the master equation for a diffusion process.

Continuous Space

Here, we apply a rescaling on the form $i \pm 1 \to x \pm \Delta x$. To simplify the calculations, we further assume that there is no Drift, meaning $p = q = \tilde{D}$. Now, we have steps of size $\delta$ made of small steps of $\Delta x$ in time $\tau$

$$ C(x+\Delta x,t) - 2\,C(x,t) + C(x-\Delta x,t) = (\Delta x)^2\,\frac{\partial^2 C}{\partial x^2}(x,t) + O(D^{4}) $$

We do another re-scaling of the probability $p'' = p ( \delta / \Delta x )^{2}$ which is motivated by our double derivative in the second part, then rewriting everything we get the continuous space equation. Then we get

$$ \begin{align} \\ \tau \frac{d}{dt} C_{i}(t) &= \frac{p - q}{2} (C_{i + 1}(t) - C_{i - 1}(t)) + \frac{p'' + q''}{2}(C_{i + 1}(t) - 2C_{i}(t) + C_{i - 1}(t)) \\ &= \underbrace{\frac{p - q}{2}}_{p = q = \tilde{D} \implies p-q = 0}(C_{i + 1}(t) - C_{i - 1}(t)) + \delta^{2}\frac{p + q}{2}\frac{(C_{i + 1}(t) - 2C_{i}(t) + C_{i - 1}(t))}{\Delta x^{2}} \\ &\implies\frac{ \partial }{ \partial t } C(x, t) = D \frac{ \partial^{2} }{ \partial x^{2} } C(x, t), \text{ With } D = \frac{ \delta^{2} \tilde{D}}{ 2 \tau } \end{align} $$

When $\tilde{D} = \frac{1}{ 2}$ we have Einstein’s diffusion. Where $D$ is our diffusion coefficient. We do a Fourier transform in space and we get an ordinary differential equation in time, which is solvable. Take a look for the next section.

We can prove that the mean squared displacement (variance) grows linearly with time.

Solution for the diffusion process

$$ C(x, t) = \frac{1}{\sqrt{ \pi 4 D t }} e^{-x^{2} / 4Dt} $$$$ F(k, t) = \int e^{-ikx}f(x, t) \, dx $$$$ f(x, t) = \int e^{ikx}F(k, t) \, dk $$$$ \begin{align} \frac{d}{dt} \int e^{-ikx}C(x, t) \, dx = D \int e^{-ikx} \frac{ \partial^{2} }{ \partial x^{2} } C(x, t) \, dx \\ \frac{d}{dt} F(k, t) = -Dk^{2}F(k, t) \end{align} $$

Where the right hand side is derived through a double integration by parts.

Fokker Planck Equation

$$ \frac{ \partial }{ \partial t } f(x, t) = - \frac{ \partial }{ \partial x }\underbrace{ (A(x, t)f(x, t))}_{\substack{\text{Drift Term with} \\ \text{Force } A(x, t)}} + \frac{ \partial^{2} }{ \partial x^{2} } \underbrace{(B(x, t)f(x, t))}_{\substack{\text{Diffusion Term with} \\ \text{parameter } B(x, t)}} $$

The Ornstein Uhlenbeck Process is a solution to the Fokker Planck equation with $A(x, t) = - \gamma x$ and $B(x, t) = \sigma^{2}$, where $\gamma$ and $\sigma$ are constants. Its SDE has the form $dx = - \gamma x dt + \sigma dW$.

But this is not quite important for the analysis of diffusion models, but keep in mind that the root of the theory originated from physics!

Introduction to Diffusion Models

The Forward Encoder

Noise Schedule

$$ z_{t} = \sqrt{ 1 - \beta_{t} } z_{t - 1} + \sqrt{\beta_{t}} \epsilon_{t}$$

And $z_{0} = x$ where $\forall t, 0 < \beta_{t} < 1$ and $\epsilon_{t} \sim \mathcal{N}(0, I)$. So now we have $z_{t} \sim \mathcal{N}(z_{t}; \sqrt{ 1 - \beta_{t} }z_{t - 1}, \beta_{t}I)$ If we suppose $z_{t - 1}$ is a Gaussian random variable with mean $\mu_{t - 1}$ and variance $\Sigma_{t - 1}^{2}$ then we observe that in the limit it converges to a Gaussian with mean $0$ and variance $I$, given noises. This choice is made with the idea that if we can invert the process, then generating a new sample from the data manifold is just sampling from the normal Gaussian and then applying the inverse noise transformation many many times.

$$ z_{t} = \sqrt{ \alpha_{t} } z_{0} + \sqrt{1 - \alpha_{t}} \epsilon $$

Where $\alpha_{t} = \prod_{i = 1}^{t} (1 - \beta_{i})$, this is possible as we are matching the mean and variance of the resulting distribution. Now we can write $z_{t} \sim \mathcal{N}(z_{t}; \sqrt{ \alpha_{t} } z_{0}, (1 - \alpha_{t})I)$.

$$ q(x_{t} \mid x_{t - 1}) \sim \mathcal{N}(x_{t}; \sqrt{ 1 - \beta_{t} } x_{t - 1}, \beta_{t}I) = \mathcal{N}(x_{t}; \sqrt{ \alpha_{t} } x_{0}, (1 - \alpha_{t})I) $$

This allows to jump around during timesteps when we train.

Closed reverse conditional distribution

We will use now the same tricks present in Bayesian Linear Regression for computing closed forms of Gaussian distributions. If we don’t condition with the initial image, the reverse distribution is not tractable.

$$ q(z_{t - 1} \mid z_{t}, x) = \frac{q(z_{t} \mid z_{t - 1}, x) q(z_{t - 1} \mid x)}{q(z_{t} \mid x)} $$

Now observe: the denominator is just a normalization constant that does not depend over $z_{t - 1}$, and $q(z_{t} \mid z_{t - 1}, x) = q(z_{t} \mid z_{t - 1})$ by Markov Property (see Markov Chains), now we can play with the Gaussians and derive the closed form of the distribution. Let’s take a deep look inside the $\exp$ after we have done the product. As with Bayesian Linear Regression, we would like to rewrite it in the form $x^{T}\Sigma^{-1}x - 2 \mu^{T}\Sigma^{-1}x + \text{ const}$ as we know that the product still resembles a Gaussian, and needs to be normalized.

$$ \begin{align} \log q(z_{t - 1} \mid z_{t}, x) & \propto (z_{t - 1} - \sqrt{ a_{t - 1} }x)^{T} \frac{1}{1 - \alpha_{t - 1}} (z_{t - 1} - \sqrt{ a_{t - 1} }x) + (z_{t} - \sqrt{ 1 - \beta_{t} }z_{t - 1})^{T} \frac{1}{\beta_{t}} (z_{t} - \sqrt{ 1 - \beta_{t} }z_{t - 1}) \\ & = \frac{1}{1 - \alpha_{t - 1}} z_{t - 1}^{T}z_{t - 1} - 2 \frac{1}{1 - \alpha_{t - 1}} z_{t - 1}^{T}\sqrt{ a_{t - 1} }x- 2 \frac{1}{\beta_{t}} z_{t}^{T}\sqrt{ 1 - \beta_{t} }z_{t - 1} + \frac{1}{\beta_{t}} z_{t - 1}^{T}z_{t - 1} + \text{const} \\ &= z^{T}_{t - 1}\left( \frac{1}{ 1 - \alpha_{t - 1}} + \frac{1 - \beta_{t}}{\beta_{t}} \right) z_{t - 1} - 2 z^{T}_{t - 1} \left( \frac{ \sqrt{ a_{t - 1} }}{1 - \alpha_{t - 1}}x + \frac{\sqrt{ 1 - \beta_{t} }}{\beta_{t}} z_{t} \right) + \text{const} \end{align} $$$$ \sigma_{t - 1}^{2} = \left( \frac{1}{ 1 - \alpha_{t - 1}} + \frac{1 - \beta_{t}}{\beta_{t}} \right)^{-1} = \frac{\beta_{t}(1 - \alpha_{t - 1})}{1 - \alpha_{t}} $$$$ \begin{align} \mu_{t - 1} = m(x, z_{t})&= \sigma_{t - 1}^{2} \left( \frac{1}{1 - \alpha_{t - 1}} \sqrt{ a_{t - 1} }x + \frac{\sqrt{ 1 - \beta_{t} }}{\beta_{t}}z_{t} \right) \\ &= \frac{\beta_{t}(1 - \alpha_{t - 1})}{1 - \alpha_{t}} \left( \frac{1}{1 - \alpha_{t - 1}} \sqrt{ a_{t - 1} }x +\frac{\sqrt{ 1 - \beta_{t} }}{\beta_{t}}z_{t} \right) \\ &= \frac{\beta_{t}}{1 - \alpha_{t}} \sqrt{ a_{t - 1} }x + \frac{ \sqrt{ 1 - \beta_{t} }}{1 - \alpha_{t}} (1 - \alpha_{t- 1})z_{t} \end{align} $$$$ x = \frac{1}{\sqrt{ \alpha_{t} }} z_{t} - \frac{\sqrt{1 - \alpha_{t}}}{\sqrt{ \alpha_{t} }} \epsilon $$$$ \mu_{t - 1} = \frac{1}{\sqrt{ 1 - \beta_{t} }} \left( z_{t} - \frac{\beta_{t}}{\sqrt{ 1 - \alpha_{t} }}\varepsilon_{t} \right) $$

We will observe in a successive section how this form could be useful to train the reverse decoder. The idea is to use a parameterized neural network to predict this mean, because we know the variance a priori by the diffusion schedule.

Since we have $q(x_{t - 1} \mid x_{t}, x_{0}) = \mathcal{N}(\mu_{t - 1}, \sigma_{t - 1}^{2})$ and our approximation as $p_{\theta}(x_{t - 1} \mid x_{t}) = \mathcal{N}(\mu_{\theta}(x_{0}), \sigma_{t - 1}^{2})$ We know their KL divergence is just matching their means. See Gaussians#General KL divergence between Gaussians for more details.

The Reverse Decoder

Using small variance schedules

We show now briefly that having small variance schedules then we can prove that the distribution $q(z_{t - 1}\mid z_{t}) \approx \mathcal{N}(z_{t- 1}; z_{t},\beta_{t}I)$.

$$ q(\mathbf{z}_{t-1} | \mathbf{z}_t) = \frac{q(\mathbf{z}_t | \mathbf{z}_{t-1}) q(\mathbf{z}_{t-1})}{q(\mathbf{z}_t)}. $$

Then take the log and expand Taylor to the second element to correct the mean and variance elements for the first one.

$\log q(\mathbf{z}_{t-1}) \approx \log q(\mathbf{z}_t) + (\mathbf{z}_{t-1} - \mathbf{z}_t)^\top \nabla_{\mathbf{z}_{t-1}} \log q(\mathbf{z}_t) + \frac{1}{2} (\mathbf{z}_{t-1} - \mathbf{z}_t)^\top \nabla^2_{\mathbf{z}_{t-1}} \log q(\mathbf{z}_t) (\mathbf{z}_{t-1} - \mathbf{z}_t).$

The Loss Function

We will use ideas from Variational Inference. We will attempt to approximate the real inverse distribution $p$ with one of the variational family of Gaussians $q$, and maximize for the ELBO (see linked note for details).

The loss can be seen as a maximization of the ELBO: it decomposes to a reconstruction term, a prior matching term and a noising matching term.

We will see that the loss function is the following: $$ \mathcal{L}(w) = \mathbb{E}{q} \left[ \sum{t = 2}^{T} \ln \frac{p(z_{t - 1} \mid z_{t}, w)}{q(z_{t - 1} \mid z_{t}, w)} + \ln p(x \mid z_{1}, w) \right]

$$

This can be rewritten in a form closer to Variational Autoencoders, studied in Autoencoders, where we have a reconstruction term with a consistency term.

$$ \mathcal{L}(w) = -\sum_{t = 1}^{T} \lVert g(\sqrt{ a_{t} }x + \sqrt{ 1 - \alpha_{t} }\varepsilon_{t}, w, t) - \varepsilon_{t} \rVert ^{2} $$

Where $g$ is the neural network that we are using to approximate the reverse conditional distribution. And $t = 1$ is the special condition for the reconstruction term, and the rest is the consistency term, all in one in this case.

Training the Decoder

Diffusion Models-20241201211812630

From Deep Learning with Bishop book

Sampling from the diffusion

Diffusion Models-20241201214123170

Algorithm from Bishop book

Training a Diffusion Model

If we know $x_{0}$ then the reverse distribution is tractable, we have seen here in a section in #The Forward Encoder.

Also diffusion models are trained with the ELBO, in a manner similar in Autoencoders.

$$ \begin{align*} \log p(\mathbf{x}) &\geq \text{ELBO}_{\theta}(\mathbf{x}) \\ &= \mathbb{E}_{q(\mathbf{x}_0)} [\log p_{\theta}(\mathbf{x}_0|\mathbf{x}_1)] - D_{KL}(q(\mathbf{x}_T|\mathbf{x}_0) || p(\mathbf{x}_T)) - \sum_{t=2}^{T} \mathbb{E}_{q(\mathbf{x}_0)} [D_{KL}(q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) || p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_t))] \end{align*} $$

Where we have a reconstruction term, a prior matching term that is similar to VAE, and a denoising matching term, also referred as consistency term.

We want to make our approximate term as close as possible with the true term $q$.

Since we know that both $q$ and $p$ are Gaussian, then we reduce the optimization to match the means. i.e. we want $\mu_{\theta}(x_{t}, t) \approx \mu_{q}(x_{t}, t)$. We can rewrite everything so that it just matches the noise, this is a better posed problem, since the noise is Gaussian, and we know this is an easier distribution to predict compared to the space of all possible images.

Classifier Guidance

  • The idea is to use a classifier to guide the diffusion process, so that we can generate samples from a specific class. Another way is to reach similar things (but with more difficulty) is just conditional generation (creating an embedding).

Modifying the loss

$$ \varepsilon_{t} = \varepsilon_{t} + \lambda \nabla_{\varepsilon_{t}} \log p(y \mid \varepsilon_{t}) $$

Where the logit is the classifier that we are using to guide the diffusion process. The idea is to nudge the error towards some class of images.

Problems with the technique:

  • We want the classifier to be able to work on noised data close to the inputs for the diffusion models, and this is not always an easy thing to have.
  • The classifier could have mistakes in the direction.
  • It is not very flexible, since we have only the classes on which the classifier has been trained on.

Classifier-free guidance

If we have a conditioned generator, we can re-purpose it to be a unconditioned generator.

$$ \begin{align*} \varepsilon_{\theta}^{}(\boldsymbol{x},c;t) &= (1 + \rho)\varepsilon_{\theta}(\boldsymbol{x},c;t) - \rho\varepsilon_{\theta}(\boldsymbol{x};t) \ &= \varepsilon_{\theta}(\boldsymbol{x};c, t) + \rho(\varepsilon_{\theta}(\boldsymbol{x};c, t)-\varepsilon_{\theta}(\boldsymbol{x}; t)) \end{align}

$$ Where $\rho$ is the strength of the guidance. In this manner, we can both generate in conditioned and not conditioned manner. With this we can use trade-off between diversity and quality. This is one of the best solutions at the time of writing.

ControlNet

A difficulty using text embedding, is that it is difficult to create a image that well resembles what you meant with the text, it is usually highly contextual. Can you condition with other images? For example some sketches? This is the idea of ControlNet.

Idea is to start from a sketch, and condition on both. One of the examples is controlNet, that has some similarities with feature modulation used in Generative Adversarial Networks and Normalizing Flows. See (Zhang et al. 2023).

Diffusion Models-20250421140854735

Initialized to zero so that it is limited in the beginning.

One problem is that they are doubling the parameter space.

Comparisons with other models

Diffusion models offer high quality and diverse generations with model flexibility and more stable training. VAE (see Autoencoders) have usual problems with quality, since in the latent space many features are not disentangled. GANs (see Generative Adversarial Networks) have problems with mode collapse, and training instabilities that make them difficult to train. Normalizing Flows (see Normalizing Flows) have problems with the number of parameters, and computational cost for having good quality outputs, but they offer invertibility of the transformations.

References

[1] Zhang et al. “Adding Conditional Control to Text-to-Image Diffusion Models” arXiv preprint arXiv:2302.05543 2023

[2] Ho et al. “Denoising Diffusion Probabilistic Models” arXiv preprint arXiv:2006.11239 2020