Diffusion is a physical process that models random motion, first analyzed by Brown when studying pollen grains in water. In this section, we will first analyze a simplified 1-dimensional version, and then delve into diffusion models for images, the ones closest to (Ho et al. 2020).

The Diffusion Process

This note follows original Einstein's presentation, here we have a simplified version.

Let's suppose we have a particle at at some position . We have a probability of jumping to the left of to right of , the rest is staying at the same position.

Concentration

We would like to know the concentration of particles after a number of fixed steps at a certain position. Then we would like to know the same thing if we extend the idea to a certain number of starting particles at the beginning. Let's call this concentration . Then the number of particles at a certain time step and position is

Markov Process

We consider we have the above concentration process, we would like to determine the evolution process. Then we have the following relation:

This defines a Markov process, see Markov Processes, where the value of a certain timestamp depends on the previous ones. We can also interpret this as a certain recurrence relation. This is often used to model Brownian Motions. We observe that the difference of

From the Markov Process we can have the master equation, from which we take the Fokker Plank equation. Somehow, we can interpret the temporal difference as the sum of a first derivative and of a second derivative. The first derivative tells us a preferred direction of diffusion (drift) the second tells us how fast it is (diffusion).

Going into continuous time

Let's assume we have an update time of (so smaller time deltas imply more frequent updates). We say is the time scale of the system (measure of time in the system), we define such that is the number of the updates for one Markov Step as before. More intuitively, the characteristic time tells us the system needs that amount of time to have a single change, while tells us how frequently we are going to check the system, this is the reason why we need to check the system times to observe a change.

We can now redefine the updated probabilities with this notion of time. and equivalently . Then the continuous Markov process is

If we consider the limit then we have that

This is the master equation. Having a continuous time just changes the probability of jumping, this is the relation that allows us to have continuous updates (so if we don't have a full time, we have just a fraction of the probability of jumping). This should be the master equation for a diffusion process.

Continuous Space

Here, we apply a rescaling on the form . To simplify the calculations, we further assume that there is no Drift, meaning . Now, we have steps of size made of small steps of in time

This can be proven by using Taylor expansions:

We do another re-scaling of the probability which is motivated by our double derivative in the second part, then rewriting everything we get the continuous space equation. Then we get

When we have Einstein's diffusion. Where is our diffusion coefficient. We do a Fourier transform in space and we get an ordinary differential equation in time, which is solvable. Take a look for the next section.

We can prove that the mean squared displacement (variance) grows linearly with time.

Solution for the diffusion process

One can see that the following equation is a solution for the above problem:

One can solve this just by brute force, or one fancier method is passing into the fourier space, where the equation is simpler to treat. In fact, the Fourier transform of the function is

And it's inverse is

If we apply the Fourier Transform we are getting the following:

Where the right hand side is derived through a double integration by parts.

Fokker Planck Equation

For completeness we will also give some details about the Fokker Planck equation. Given a density function , the Fokker Planck Equation is the following:

The Ornstein Uhlenbeck Process is a solution to the Fokker Planck equation with and , where and are constants. Its SDE has the form .

But this is not quite important for the analysis of diffusion models, but keep in mind that the root of the theory originated from physics!

Introduction to Diffusion Models

The Forward Encoder

Noise Schedule

The noise schedule of diffusion models is set as follows:

z_{t} = \sqrt{ 1 - \beta_{t} } z_{t - 1} + \sqrt{\beta_{t}} \epsilon_{t}$$ And $z_{0} = x$ where $\forall t, 0 < \beta_{t} < 1$ and $\epsilon_{t} \sim \mathcal{N}(0, I)$. So now we have $z_{t} \sim \mathcal{N}(z_{t}; \sqrt{ 1 - \beta_{t} }z_{t - 1}, \beta_{t}I)$ If we suppose $z_{t - 1}$ is a [[Gaussians|Gaussian]] random variable with mean $\mu_{t - 1}$ and variance $\Sigma_{t - 1}^{2}$ then we observe that in the limit it converges to a Gaussian with mean $0$ and variance $I$, given noises. This choice is made with the idea that if we can invert the process, then generating a new sample from the data manifold is just sampling from the normal Gaussian and then applying the inverse noise transformation many many times. We can write a closed form for the forward process:

z_{t} = \sqrt{ \alpha_{t} } z_{0} + \sqrt{1 - \alpha_{t}} \epsilon

Where $\alpha_{t} = \prod_{i = 1}^{t} (1 - \beta_{i})$, this is possible as we are matching the mean and variance of the resulting distribution. Now we can write $z_{t} \sim \mathcal{N}(z_{t}; \sqrt{ \alpha_{t} } z_{0}, (1 - \alpha_{t})I)$. This means that:

q(x_{t} \mid x_{t - 1}) \sim \mathcal{N}(x_{t}; \sqrt{ 1 - \beta_{t} } x_{t - 1}, \beta_{t}I) = \mathcal{N}(x_{t}; \sqrt{ \alpha_{t} } x_{0}, (1 - \alpha_{t})I)

This allows to jump around during timesteps when we train. #### Closed reverse conditional distribution We will use now the same tricks present in [[Bayesian Linear Regression]] for computing closed forms of Gaussian distributions. If we don't condition with the initial image, the reverse distribution is not tractable. We use Bayesian rule to get the reverse conditional distribution:

q(z_{t - 1} \mid z_{t}, x) = \frac{q(z_{t} \mid z_{t - 1}, x) q(z_{t - 1} \mid x)}{q(z_{t} \mid x)}

Now observe: the denominator is just a normalization constant that does not depend over $z_{t - 1}$, and $q(z_{t} \mid z_{t - 1}, x) = q(z_{t} \mid z_{t - 1})$ by Markov Property (see [[Markov Chains]]), now we can play with the Gaussians and derive the closed form of the distribution. Let's take a deep look inside the $\exp$ after we have done the product. As with [[Bayesian Linear Regression]], we would like to rewrite it in the form $x^{T}\Sigma^{-1}x - 2 \mu^{T}\Sigma^{-1}x + \text{ const}$ as we know that the product still resembles a Gaussian, and needs to be normalized. So we have:

\begin{align} \log q(z_{t - 1} \mid z_{t}, x) & \propto (z_{t - 1} - \sqrt{ a_{t - 1} }x)^{T} \frac{1}{1 - \alpha_{t - 1}} (z_{t - 1} - \sqrt{ a_{t - 1} }x) + (z_{t} - \sqrt{ 1 - \beta_{t} }z_{t - 1})^{T} \frac{1}{\beta_{t}} (z_{t} - \sqrt{ 1 - \beta_{t} }z_{t - 1}) \ & = \frac{1}{1 - \alpha_{t - 1}} z_{t - 1}^{T}z_{t - 1} - 2 \frac{1}{1 - \alpha_{t - 1}} z_{t - 1}^{T}\sqrt{ a_{t - 1} }x- 2 \frac{1}{\beta_{t}} z_{t}^{T}\sqrt{ 1 - \beta_{t} }z_{t - 1} + \frac{1}{\beta_{t}} z_{t - 1}^{T}z_{t - 1} + \text{const} \ &= z^{T}{t - 1}\left( \frac{1}{ 1 - \alpha{t - 1}} + \frac{1 - \beta_{t}}{\beta_{t}} \right) z_{t - 1} - 2 z^{T}{t - 1} \left( \frac{ \sqrt{ a{t - 1} }}{1 - \alpha_{t - 1}}x + \frac{\sqrt{ 1 - \beta_{t} }}{\beta_{t}} z_{t} \right) + \text{const} \end{align}

\sigma_{t - 1}^{2} = \left( \frac{1}{ 1 - \alpha_{t - 1}} + \frac{1 - \beta_{t}}{\beta_{t}} \right)^{-1} = \frac{\beta_{t}(1 - \alpha_{t - 1})}{1 - \alpha_{t}}

\begin{align} \mu_{t - 1} = m(x, z_{t})&= \sigma_{t - 1}^{2} \left( \frac{1}{1 - \alpha_{t - 1}} \sqrt{ a_{t - 1} }x + \frac{\sqrt{ 1 - \beta_{t} }}{\beta_{t}}z_{t} \right) \ &= \frac{\beta_{t}(1 - \alpha_{t - 1})}{1 - \alpha_{t}} \left( \frac{1}{1 - \alpha_{t - 1}} \sqrt{ a_{t - 1} }x +\frac{\sqrt{ 1 - \beta_{t} }}{\beta_{t}}z_{t} \right) \ &= \frac{\beta_{t}}{1 - \alpha_{t}} \sqrt{ a_{t - 1} }x + \frac{ \sqrt{ 1 - \beta_{t} }}{1 - \alpha_{t}} (1 - \alpha_{t- 1})z_{t} \end{align}

x = \frac{1}{\sqrt{ \alpha_{t} }} z_{t} - \frac{\sqrt{1 - \alpha_{t}}}{\sqrt{ \alpha_{t} }} \epsilon

\mu_{t - 1} = \frac{1}{\sqrt{ 1 - \beta_{t} }} \left( z_{t} - \frac{\beta_{t}}{\sqrt{ 1 - \alpha_{t} }}\varepsilon_{t} \right)

We will observe in a successive section how this form could be useful to train the reverse decoder. The idea is to use a parameterized neural network to predict this mean, because we know the variance a priori by the diffusion schedule. Since we have $q(x_{t - 1} \mid x_{t}, x_{0}) = \mathcal{N}(\mu_{t - 1}, \sigma_{t - 1}^{2})$ and our approximation as $p_{\theta}(x_{t - 1} \mid x_{t}) = \mathcal{N}(\mu_{\theta}(x_{0}), \sigma_{t - 1}^{2})$ We know their KL divergence is just matching their means. See [[Gaussians#General KL divergence between Gaussians]] for more details. ### The Reverse Decoder #### Using small variance schedules We show now briefly that having small variance schedules then we can prove that the distribution $q(z_{t - 1}\mid z_{t}) \approx \mathcal{N}(z_{t- 1}; z_{t},\beta_{t}I)$. We first use Bayes rule:

q(\mathbf{z}_{t-1} | \mathbf{z}t) = \frac{q(\mathbf{z}t | \mathbf{z}{t-1}) q(\mathbf{z}{t-1})}{q(\mathbf{z}_t)}.

Then take the log and expand Taylor to the second element to correct the mean and variance elements for the first one. $\log q(\mathbf{z}_{t-1}) \approx \log q(\mathbf{z}_t) + (\mathbf{z}_{t-1} - \mathbf{z}_t)^\top \nabla_{\mathbf{z}_{t-1}} \log q(\mathbf{z}_t) + \frac{1}{2} (\mathbf{z}_{t-1} - \mathbf{z}_t)^\top \nabla^2_{\mathbf{z}_{t-1}} \log q(\mathbf{z}_t) (\mathbf{z}_{t-1} - \mathbf{z}_t).$ #### The Loss Function We will use ideas from [[Variational Inference]]. We will attempt to approximate the real inverse distribution $p$ with one of the variational family of Gaussians $q$, and maximize for the ELBO (see linked note for details). The loss can be seen as a maximization of the ELBO: it decomposes to a reconstruction term, a prior matching term and a noising matching term. We will see that the loss function is the following:

\mathcal{L}(w) = \mathbb{E}{q} \left[ \sum{t = 2}^{T} \ln \frac{p(z_{t - 1} \mid z_{t}, w)}{q(z_{t - 1} \mid z_{t}, w)} + \ln p(x \mid z_{1}, w) \right]

\mathcal{L}(w) = -\sum_{t = 1}^{T} \lVert g(\sqrt{ a_{t} }x + \sqrt{ 1 - \alpha_{t} }\varepsilon_{t}, w, t) - \varepsilon_{t} \rVert ^{2}

Where $g$ is the neural network that we are using to approximate the reverse conditional distribution. And $t = 1$ is the special condition for the reconstruction term, and the rest is the consistency term, all in one in this case. #### Training the Decoder ![[Diffusion Models-20241201211812630.webp|From Deep Learning with Bishop book]] #### Sampling from the diffusion ![[Diffusion Models-20241201214123170.webp|Algorithm from Bishop book]] ### Training a Diffusion Model If we know $x_{0}$ then the reverse distribution is tractable, we have seen here in a section in [[#The Forward Encoder]]. Also diffusion models are trained with the ELBO, in a manner similar in [[Autoencoders]].

\begin{align*} \log p(\mathbf{x}) &\geq \text{ELBO}{\theta}(\mathbf{x}) \ &= \mathbb{E}{q(\mathbf{x}0)} [\log p{\theta}(\mathbf{x}_0|\mathbf{x}1)] - D{KL}(q(\mathbf{x}T|\mathbf{x}0) || p(\mathbf{x}T)) - \sum{t=2}^{T} \mathbb{E}{q(\mathbf{x}0)} [D{KL}(q(\mathbf{x}{t-1}|\mathbf{x}t, \mathbf{x}0) || p{\theta}(\mathbf{x}{t-1}|\mathbf{x}_t))] \end{align*}

Where we have a reconstruction term, a prior matching term that is similar to VAE, and a denoising matching term, also referred as consistency term. We want to make our approximate term as close as possible with the true term $q$. Since we know that both $q$ and $p$ are Gaussian, then we reduce the optimization to match the means. i.e. we want $\mu_{\theta}(x_{t}, t) \approx \mu_{q}(x_{t}, t)$. We can rewrite everything so that it just matches the noise, this is a better posed problem, since the noise is Gaussian, and we know this is an easier distribution to predict compared to the space of all possible images. ### Classifier Guidance - The idea is to use a classifier to guide the diffusion process, so that we can generate samples from a specific class. Another way is to reach similar things (but with more difficulty) is just conditional generation (creating an embedding). #### Modifying the loss The main difference compared to the classical diffusion is that here we add the following to the noise: We want to adjust the noise signal based on the classes $y$ from noisy images.

\varepsilon_{t} = \varepsilon_{t} + \lambda \nabla_{\varepsilon_{t}} \log p(y \mid \varepsilon_{t})

Where the logit is the classifier that we are using to guide the diffusion process. The idea is to **nudge** the error towards some class of images. Problems with the technique: - We want the classifier to be able to work on noised data close to the inputs for the diffusion models, and this is not always an easy thing to have. - The classifier could have mistakes in the direction. - It is not very flexible, since we have only the classes on which the classifier has been trained on. #### Classifier-free guidance If we have a conditioned generator, we can re-purpose it to be a unconditioned generator.

\begin{align*} \varepsilon_{\theta}^{}(\boldsymbol{x},c;t) &= (1 + \rho)\varepsilon_{\theta}(\boldsymbol{x},c;t) - \rho\varepsilon_{\theta}(\boldsymbol{x};t) \ &= \varepsilon_{\theta}(\boldsymbol{x};c, t) + \rho(\varepsilon_{\theta}(\boldsymbol{x};c, t)-\varepsilon_{\theta}(\boldsymbol{x}; t)) \end{align}

Where $\rho$ is the strength of the guidance. In this manner, we can both generate in conditioned and not conditioned manner. With this we can use trade-off between **diversity** and quality. This is one of the best solutions at the time of writing. #### ControlNet A difficulty using text embedding, is that it is difficult to create a image that well resembles what you meant with the text, it is usually highly contextual. Can you condition with other images? For example some sketches? This is the idea of ControlNet. Idea is to start from a sketch, and condition on both. One of the examples is controlNet, that has some similarities with feature modulation used in [[Generative Adversarial Networks]] and [[Normalizing Flows]]. See [[@zhangAddingConditionalControl2023]]. ![[Diffusion Models-20250421140854735.webp|Initialized to zero so that it is limited in the beginning.]] One problem is that they are **doubling the parameter space**. ## Comparisons with other models Diffusion models offer **high quality and diverse generations** with **model flexibility** and **more stable training**. VAE (see [[Autoencoders]]) have usual problems with quality, since in the latent space many features are not disentangled. GANs (see [[Generative Adversarial Networks]]) have problems with mode collapse, and training instabilities that make them difficult to train. Normalizing Flows (see [[Normalizing Flows]]) have problems with the number of parameters, and computational cost for having good quality outputs, but they offer invertibility of the transformations.

References

[1] Ho et al. “Denoising Diffusion Probabilistic Models” arXiv preprint arXiv:2006.11239 2020