Normalizing flows have both latent space and can produce tractable explicit probability distributions (closer to Autoregressive Modelling, they have tractable distributions, but not a latent space). This means we are able to get the likelihoods of a certain sample.

This approach to modelling a flexible distribution is called a normalizing flow because the transformation of a probability distribution through a sequence of mappings is somewhat analogous to the flow of a fluid. From (Bishop & Bishop 2024)

The main idea

The intuition here is that we can both have a latent space and some sort of autoregressive modelling. We want

  • An analytical model that is easy to sample from.
  • We want this distribution to be able to represent complex data.
  • The idea is to use many change of variables,
$$\iint_R f(x,y) \, dx \, dy = \iint_G f(g(u,v),h(u,v)) \, J(u,v) \, du \, dv$$$$ dx = \left\lvert \det \left( \frac{ \partial f^{-1}(z) }{ \partial z } \right) \right\rvert dz $$$$ dx = dz \left\lvert \det \left( \frac{ \partial f(z) }{ \partial z } \right) \right\rvert ^{-1} $$

And you can remember that if we have an invertible matrix $A$ then you can take out the inverse thing.

Normalizing Flows

Normalizing flows are direct application of the change of variables formula. we have three desiderata:

  • Invertible
  • Differentiable
  • Preserve dimensionality

The model

$$ f_{\theta}: \mathbb{R}^{d} \to \mathbb{R}^{d}, \text{s.t., } X = f_{\theta}(Z) \cap Z = f_{\theta}^{-1}(X) $$

It is important that the transformation is invertible (but this raises also expressivity concerns, because it limits possible functions).

$$ p_{X}(x ; \theta) = p_{z} (f^{-1}_{\theta}(x)) \cdot \left\lvert \det \left( \frac{ \partial f^{-1}_{\theta}(x) }{ \partial x } \right) \right\rvert $$

Parameterizing the transformation

We want to find a way to learn the function $f$. The difficulty is having invertible neural networks here, and that it preserves the dimensionality. (Some activations like ReLU are not invertible). And we want to be able to compute the Jacobian efficiently, it is possible to compute it in $O(d)$ if we have a triangular matrix (it is just the diagonal).

$$ x = f(z) = f_{k} \circ f_{k - 1} \circ \ldots \circ f_{1}(z) $$$$ p_{X}(x ; \theta) = p_{Z}(z) \cdot \prod_{i = 1}^{k} \left\lvert \det \left( \frac{ \partial f_{i}^{-1}(x) }{ \partial x } \right) \right\rvert $$

The Coupling Layer

The main idea of a normalizing flow is to couple some invertible transformation to be able to get both the likelihood of generated data and the generation itself. Simply stacking linear layers, though invertible, does not work (stack of linear layers is still a linear layer)

Here we use a similar idea present in cryptography for the Feistel network, see Block Ciphers, the idea to partition the space of parameters and apply a transformation to only part of the parameters, this technique has been named real NVP (non-volume preserving).

$$ \begin{pmatrix} y_{A} \\ y_{B} \end{pmatrix} = \begin{pmatrix} h(x_{A}, x_{B}; \theta) \\ x_{B} \end{pmatrix} $$$$ \begin{pmatrix} x_{A} \\ x_{B} \end{pmatrix} = \begin{pmatrix} h^{-1}(y_{A}, y_{B}; \theta) \\ y_{B} \end{pmatrix} $$$$ h(x_{A}, x_{B}; \theta) = \exp(\theta(x_{A})) \odot x_{B} + \mu(x_{A}, w) $$

Where $\odot$ is the element-wise product, and $\mu$ is a function of $x_{A}$ and $w$ (the parameters). You can see that the inverse is quite easy do get.

$$ \frac{\partial y}{\partial x} = \begin{pmatrix} \frac{\partial h}{\partial x_{A}} & \frac{\partial h}{\partial x_{B}} \\ 0 & I \end{pmatrix} $$

And this determinant is very easy to compute.

Training the flow of transformations

$$ \log p_{x}(x) = \log p_{z}(f^{-1}_{\theta}(x)) + \sum_{i=1}^{k} \log \left\lvert \det \left( \frac{ \partial f_{i}^{-1}(x) }{ \partial x } \right) \right\rvert $$

And summing for the whole dataset. Because of the structure, we can explicitly evaluate how probable is the sampled sample.

Critiques

We don’t know how many layers we would need, the same problem we had in Clustering. We use similar idea of Dirichlet Processes, going into continuous mode, or meta parametrization, and you have continuous normalizing flows, which is something close to (Chen et al. 2019), so that you do not need to set up a specific number of layers.

Resolution is quite slow because it preserves dimensions (a lot of computational time).

The Architecture

Squeeze 🟨–

We want to reshape from $4 \times 4 \times 1$ to $2 \times 2\times 4$ dimensionality thing. Using usually some checked structure to split it into many parts so that we can use flow in parallel So they just change the spatial resolution of the layers.

Flow step

Normalizing Flows-20250420114254149

Act norm is for example a batch norm see section in Convolutional Neural Network.

1x1 Convolutions

1x1 convolutions are generalizations of permutations (i.e. they contain permutations, if the matrix is a permutation matrix), introduced in (Kingma & Dhariwal 2018).

$$ W = PL(U + \text{diag}(s)) $$

Where $L$ is lower triangular, $P$ is a permutation matrix, and $U$ is upper triangular (with 0 on diagonal). One can observe that the log determinant of this matrix is just the sum of $s$, which is quite quick to compute.

The weight is in the beta version:

$$ \begin{pmatrix} y_{A} \\ y_{B} \end{pmatrix} = \begin{pmatrix} h(x_{A}, \beta(x_{B}, W)) \\ x_{B} \end{pmatrix} $$

Applications

SRFlow

It modifies the beta by conditioning on the higher resolution image, in paper SRFlow does this super resolution thing.

StyleFlow

Other works are for example StyleFlow, extension of styleGAN, where you can interpolate between identities, or make other continuous disentangled modifications. They use Normalizing Flows to create encoded weights of the style (conditioned with some attribute, e.g. lighting, head pose etc).

C-Flow for multi-modal data

The idea here is to condition one flow with another flow, and use it for more complex part. This section needs further study, because I did not understand.

Human Mesh Recovery

Examples are… TODO.

Continuous Flows

TODO.

Adjoint Sensitivity

$$ a(t) = \frac{ \partial L }{ \partial \boldsymbol{z}(t) } $$

This allows to compute the derivative to update the ODEs.

References

[1] Kingma & Dhariwal “Glow: Generative Flow with Invertible 1x1 Convolutions” arXiv preprint arXiv:1807.03039 2018

[2] Bishop & Bishop “Deep Learning: Foundations and Concepts” Springer International Publishing 2024

[3] Chen et al. “Neural Ordinary Differential Equations” arXiv preprint arXiv:1806.07366 2019