Diffusion is a physical process that models random motion, first analyzed by Brown when studying pollen grains in water. In this section, we will first analyze a simplified 1-dimensional version, and then delve into diffusion models for images, the ones closest to (Ho et al. 2020).
The Diffusion Process
This note follows original Einstein’s presentation, here we have a simplified version.
Let’s suppose we have a particle at $t = 0$ at some position $i$. We have a probability of jumping to the left of $p$ to right of $q$, the rest is staying at the same position.
Concentration🟨-
We would like to know the concentration of particles after a number of fixed steps at a certain position. Then we would like to know the same thing if we extend the idea to a certain number of starting particles at the beginning. Let’s call this concentration $C_{i}(t)$. Then the number of particles at a certain time step and position is $n_{i}(t) = Nc_{i}(t)$
Markov Process 🟨
We consider we have the above concentration process, we would like to determine the evolution process. Then we have the following relation:
$$ C_{i}(t + 1) = C_{i - 1}(t)q + C_{i + 1}(t)p + (1 - q - p) C_{i}(t) $$This defines a Markov process, see Markov Processes, where the value of a certain timestamp depends on the previous ones. We can also interpret this as a certain recurrence relation. This is often used to model Brownian Motions. We observe that the difference of
$$ \begin{align} C_{i}(t + 1) - C_{i}(t) &= C_{i - 1}(t)q + C_{i + 1}(t)p + (1 - q - p) C_{i}(t) - C_{i}(t) \\ &= q(C_{i - 1}(t) - C_{i}(t)) + p(C_{i + 1}(t) - C_{i}(t)) \\ &= p C_{i + 1}(t) - (p + q)C_{i}(t) + qC_{i - 1}(t) \\ &= \frac{p - q}{2} (C_{i + 1}(t) - C_{i - 1}(t)) + \frac{p + q}{2}(C_{i + 1}(t) - 2C_{i}(t) + C_{i - 1}(t)) \end{align} $$From the Markov Process we can have the master equation, from which we take the Fokker Plank equation. Somehow, we can interpret the temporal difference as the sum of a first derivative and of a second derivative. The first derivative tells us a preferred direction of diffusion (drift) the second tells us how fast it is (diffusion).
Going into continuous time 🟥++
Let’s assume we have an update time of $\Delta t$ (so smaller time deltas imply more frequent updates). We say $\tau$ is the time scale of the system (measure of time in the system), we define $\tau$ such that $\tau / \Delta t$ is the number of the updates for one Markov Step as before. More intuitively, the characteristic time $\tau$ tells us the system needs that amount of time to have a single change, while $\Delta t$ tells us how frequently we are going to check the system, this is the reason why we need to check the system $\frac{\tau}{\Delta t}$ times to observe a change.
We can now redefine the updated probabilities with this notion of time. $p'= p \Delta t / \tau$ and equivalently $q'$. Then the continuous Markov process is
$$ C_{i}(t + \Delta t) - C_{i}(t) = \frac{p' - q'}{2} (C_{i + 1}(t) - C_{i - 1}(t)) + \frac{p' + q'}{2}(C_{i + 1}(t) - 2C_{i}(t) + C_{i - 1}(t)) $$If we consider the limit $\Delta t \to 0$ then we have that
$$ \tau \frac{d}{dt} C_{i}(t) = \frac{p - q}{2} (C_{i + 1}(t) - C_{i - 1}(t)) + \frac{p + q}{2}(C_{i + 1}(t) - 2C_{i}(t) + C_{i - 1}(t)) $$This is the master equation. Having a continuous time just changes the probability of jumping, this is the relation that allows us to have continuous updates (so if we don’t have a full time, we have just a fraction of the probability of jumping). This should be the master equation for a diffusion process.
Continuous Space 🟨–
Here, we apply a rescaling on the form $i \pm 1 \to x \pm \Delta x$. To simplify the calculations, we further assume that there is no Drift, meaning $p = q = \tilde{D}$. Now, we have steps of size $\delta$ made of small steps of $\Delta x$ in time $\tau$ We do another re-scaling of the probability $p'' = p (\Delta x / \delta )^{2}$ which is motivated by our double derivative in the second part, then rewriting everything we get the continuous space equation. Then we get
$$ \begin{align} \\ \tau \frac{d}{dt} C_{i}(t) &= \frac{p - q}{2} (C_{i + 1}(t) - C_{i - 1}(t)) + \frac{p'' + q''}{2}(C_{i + 1}(t) - 2C_{i}(t) + C_{i - 1}(t)) \\ &= \underbrace{\frac{p - q}{2}}_{p = q = \tilde{D} \implies p-q = 0}(C_{i + 1}(t) - C_{i - 1}(t)) + \delta^{2}\frac{p + q}{2}\frac{(C_{i + 1}(t) - 2C_{i}(t) + C_{i - 1}(t))}{\Delta x^{2}} \\ &\implies\frac{ \partial }{ \partial t } C(x, t) = D \frac{ \partial^{2} }{ \partial x^{2} } C(x, t), \text{ With } D = \frac{ \delta^{2} \tilde{D}}{ 2 \tau } \end{align} $$When $\tilde{D} = \frac{1}{ 2}$ we have Einstein’s diffusion. Where $D$ is our diffusion coefficient. We do a Fourier transform in space and we get an ordinary differential equation in time, which is solvable. Take a look for the next section.
We can prove that the mean squared displacement (variance) grows linearly with time.
Solution for the diffusion process
One can see that the following equation is a solution for the above problem:
$$ C(x, t) = \frac{1}{\sqrt{ \pi 4 D t }} e^{-x^{2} / 4Dt} $$One can solve this just by brute force, or one fancier method is passing into the fourier space, where the equation is simpler to treat. In fact, the Fourier transform of the function $f(x, t)$ is
$$ F(k, t) = \int e^{-ikx}f(x, t) \, dx $$And it’s inverse is
$$ f(x, t) = \int e^{ikx}F(k, t) \, dk $$If we apply the Fourier Transform we are getting the following:
$$ \begin{align} \frac{d}{dt} \int e^{-ikx}C(x, t) \, dx = D \int e^{-ikx} \frac{ \partial^{2} }{ \partial x^{2} } C(x, t) \, dx \\ \frac{d}{dt} F(k, t) = -Dk^{2}F(k, t) \end{align} $$Where the right hand side is derived through a double integration by parts.
Fokker Planck Equation
For completeness we will also give some details about the Fokker Planck equation. Given a density function $f(x, t) = p(x, t \mid x_{0}, t_{0})$, the Fokker Planck Equation is the following:
$$ \frac{ \partial }{ \partial t } f(x, t) = - \frac{ \partial }{ \partial x }\underbrace{ (A(x, t)f(x, t))}_{\substack{\text{Drift Term with} \\ \text{Force } A(x, t)}} + \frac{ \partial^{2} }{ \partial x^{2} } \underbrace{(B(x, t)f(x, t))}_{\substack{\text{Diffusion Term with} \\ \text{parameter } B(x, t)}} $$The Ornstein Uhlenbeck Process is a solution to the Fokker Planck equation with $A(x, t) = - \gamma x$ and $B(x, t) = \sigma^{2}$, where $\gamma$ and $\sigma$ are constants. Its SDE has the form $dx = - \gamma x dt + \sigma dW$.
But this is not quite important for the analysis of diffusion models, but keep in mind that the root of the theory originated from physics!
Introduction to Diffusion Models
The Forward Encoder
- What is the link to Autoencoders?
Noise Schedule
The noise schedule of diffusion models is set as follows:
$$ z_{t} = \sqrt{ 1 - \beta_{t} } z_{t - 1} + \sqrt{\beta_{t}} \epsilon_{t}$$And $z_{0} = x$ where $\forall t, 0 < \beta_{t} < 1$ and $\epsilon_{t} \sim \mathcal{N}(0, I)$. So now we have $z_{t} \sim \mathcal{N}(z_{t}; \sqrt{ 1 - \beta_{t} }z_{t - 1}, \beta_{t}I)$ If we suppose $z_{t - 1}$ is a Gaussian random variable with mean $\mu_{t - 1}$ and variance $\Sigma_{t - 1}^{2}$ then we observe that in the limit it converges to a Gaussian with mean $0$ and variance $I$, given noises. This choice is made with the idea that if we can invert the process, then generating a new sample from the data manifold is just sampling from the normal Gaussian and then applying the inverse noise transformation many many times.
We can write a closed form for the forward process:
$$ z_{t} = \sqrt{ \alpha_{t} } z_{0} + \sqrt{1 - \alpha_{t}} \epsilon $$Where $\alpha_{t} = \prod_{i = 1}^{t} (1 - \beta_{i})$, this is possible as we are matching the mean and variance of the resulting distribution. Now we can write $z_{t} \sim \mathcal{N}(z_{t}; \sqrt{ \alpha_{t} } z_{0}, (1 - \alpha_{t})I)$.
Diffusion Kernel
- Write the close form
- Why is this usually useful?
Closed reverse conditional distribution
We will use now the same tricks present in Bayesian Linear Regression for computing closed forms of Gaussian distributions.
We use Bayesian rule to get the reverse conditional distribution:
$$ q(z_{t - 1} \mid z_{t}, x) = \frac{q(z_{t} \mid z_{t - 1}, x) q(z_{t - 1} \mid x)}{q(z_{t} \mid x)} $$Now observe: the denominator is just a normalization constant that does not depend over $z_{t - 1}$, and $q(z_{t} \mid z_{t - 1}, x) = q(z_{t} \mid z_{t - 1})$ by Markov Property (see Markov Chains), now we can play with the Gaussians and derive the closed form of the distribution. Let’s take a deep look inside the $\exp$ after we have done the product. As with Bayesian Linear Regression, we would like to rewrite it in the form $x^{T}\Sigma^{-1}x - 2 \mu^{T}\Sigma^{-1}x + \text{ const}$ as we know that the product still resembles a Gaussian, and needs to be normalized.
So we have:
$$ \begin{align} \log q(z_{t - 1} \mid z_{t}, x) & \propto (z_{t - 1} - \sqrt{ a_{t - 1} }x)^{T} \frac{1}{1 - \alpha_{t - 1}} (z_{t - 1} - \sqrt{ a_{t - 1} }x) + (z_{t} - \sqrt{ 1 - \beta_{t} }z_{t - 1})^{T} \frac{1}{\beta_{t}} (z_{t} - \sqrt{ 1 - \beta_{t} }z_{t - 1}) \\ & = \frac{1}{1 - \alpha_{t - 1}} z_{t - 1}^{T}z_{t - 1} - 2 \frac{1}{1 - \alpha_{t - 1}} z_{t - 1}^{T}\sqrt{ a_{t - 1} }x- 2 \frac{1}{\beta_{t}} z_{t}^{T}\sqrt{ 1 - \beta_{t} }z_{t - 1} + \frac{1}{\beta_{t}} z_{t - 1}^{T}z_{t - 1} + \text{const} \\ &= z^{T}_{t - 1}\left( \frac{1}{ 1 - \alpha_{t - 1}} + \frac{1 - \beta_{t}}{\beta_{t}} \right) z_{t - 1} - 2 z^{T}_{t - 1} \left( \frac{ \sqrt{ a_{t - 1} }}{1 - \alpha_{t - 1}}x + \frac{\sqrt{ 1 - \beta_{t} }}{\beta_{t}} z_{t} \right) + \text{const} \end{align} $$And then we can see that:
$$ \sigma_{t - 1}^{2} = \left( \frac{1}{ 1 - \alpha_{t - 1}} + \frac{1 - \beta_{t}}{\beta_{t}} \right)^{-1} = \frac{\beta_{t}(1 - \alpha_{t - 1})}{1 - \alpha_{t}} $$And that:
$$ \begin{align} \mu_{t - 1} = m(x, z_{t})&= \sigma_{t - 1}^{2} \left( \frac{1}{1 - \alpha_{t - 1}} \sqrt{ a_{t - 1} }x + \frac{\sqrt{ 1 - \beta_{t} }}{\beta_{t}}z_{t} \right) \\ &= \frac{\beta_{t}(1 - \alpha_{t - 1})}{1 - \alpha_{t}} \left( \frac{1}{1 - \alpha_{t - 1}} \sqrt{ a_{t - 1} }x +\frac{\sqrt{ 1 - \beta_{t} }}{\beta_{t}}z_{t} \right) \\ &= \frac{\beta_{t}}{1 - \alpha_{t}} \sqrt{ a_{t - 1} }x + \frac{ \sqrt{ 1 - \beta_{t} }}{1 - \alpha_{t}} (1 - \alpha_{t- 1})z_{t} \end{align} $$If we revert the noise update as:
$$ x = \frac{1}{\sqrt{ \alpha_{t} }} z_{t} - \frac{\sqrt{1 - \alpha_{t}}}{\sqrt{ \alpha_{t} }} \epsilon $$Then we can rewrite the posterior mean as:
$$ \mu_{t - 1} = \frac{1}{\sqrt{ 1 - \beta_{t} }} \left( z_{t} - \frac{\beta_{t}}{\sqrt{ 1 - \alpha_{t} }}\varepsilon_{t} \right) $$We will observe in a successive section how this form could be useful to train the reverse decoder.
The Reverse Decoder
Using small variance schedules 🟩
We show now briefly that having small variance schedules then we can prove that the distribution $q(z_{t - 1}\mid z_{t}) \approx \mathcal{N}(z_{t- 1}; z_{t},\beta_{t}I)$.
We first use Bayes rule:
$$ q(\mathbf{z}_{t-1} | \mathbf{z}_t) = \frac{q(\mathbf{z}_t | \mathbf{z}_{t-1}) q(\mathbf{z}_{t-1})}{q(\mathbf{z}_t)}. $$Then take the log and expand Taylor to the second element to correct the mean and variance elements for the first one.
$\log q(\mathbf{z}_{t-1}) \approx \log q(\mathbf{z}_t) + (\mathbf{z}_{t-1} - \mathbf{z}_t)^\top \nabla_{\mathbf{z}_{t-1}} \log q(\mathbf{z}_t) + \frac{1}{2} (\mathbf{z}_{t-1} - \mathbf{z}_t)^\top \nabla^2_{\mathbf{z}_{t-1}} \log q(\mathbf{z}_t) (\mathbf{z}_{t-1} - \mathbf{z}_t).$
The Loss Function 🟨–
We will use ideas from Variational Inference. We will attempt to approximate the real inverse distribution $p$ with one of the variational family of Gaussians $q$, and maximize for the ELBO (see linked note for details).
We will see that the loss function is the following: $$ \mathcal{L}(w) = \mathbb{E}{q} \left[ \sum{t = 2}^{T} \ln \frac{p(z_{t - 1} \mid z_{t}, w)}{q(z_{t - 1} \mid z_{t}, w)} + \ln p(x \mid z_{1}, w) \right]
$$
This can be rewritten in a form closer to Variational Autoencoders, studied in Autoencoders, where we have a reconstruction term with a consistency term.
With some quite nice manipulations (difficult and long manipulations!), we can rewrite the loss function as:
$$ \mathcal{L}(w) = -\sum_{t = 1}^{T} \lVert g(\sqrt{ a_{t} }x + \sqrt{ 1 - \alpha_{t} }\varepsilon_{t}, w, t) - \varepsilon_{t} \rVert ^{2} $$Where $g$ is the neural network that we are using to approximate the reverse conditional distribution. And $t = 1$ is the special condition for the reconstruction term, and the rest is the consistency term, all in one in this case.
Training the Decoder
Sampling from the diffusion
References
[1] Ho et al. “Denoising Diffusion Probabilistic Models” 2020