aussians are one of the most important probability distributions. They arise naturally in the law of large numbers and have some nice properties that we will briefly present and prove here in this note. They are also quite common for Gaussian Processes and the Expectation Maximization algorithm. The best thing if you want to learn this part actually well is section 2.3 of (Bishop 2006), so go there my friend :)
The Density Function
The single variable Gaussian is as follows:
$$ \mathcal{N}(\mu, \sigma) = \frac{1}{\sqrt{ 2\pi } \sigma} \exp\left( -\frac{(x - \mu)^{2}}{2\sigma^{2}} \right) $$This can be generalized for the multi variable case
$$ \mathcal{N}(\mu, \Sigma) = \frac{1}{(2\pi )^{d/2}\sqrt{ \lvert \Sigma \rvert } } \exp\left( -\frac{1}{2} (x - \mu)^{T} \Sigma^{-1} (x - \mu) \right) $$Where $d$ is the dimensionality for the multidimensional Gaussian.
Integral is 1
We now prove that the integral of the Gaussian PDF is 1, this is a requirements needed to be considered a probability distribution function.
First, let’s prove a famous equality:
$$ I = \int_{-\infty}^{\infty} \exp( - x^{2}) \, dx = \sqrt{ \pi } $$This is kinda surprising, we need some care to prove it:
$$ \begin{align} \\ I^{2} & = \int_{-\infty}^{\infty} \exp( - x^{2} - y^{2}) \, dx dy \\ & = \int_{0}^{2\pi} \int_{0}^{\infty} r\exp( - r^{2}) \, dr d\theta \\ & = 2\pi \cdot \left( -\frac{1}{2} \right) \int_{0}^{\infty} -2r\exp( - r^{2}) \, dr \\ & = -\frac{2\pi}{2} \exp(-r^{2}) \bigg\vert_{0}^{\infty} = \pi \\ & \implies I = \sqrt{ \pi } \end{align} $$This is quite interesting. If we do the same derivation with $\exp\left( - \frac{(x - \mu)^{2}}{2\sigma^{2}} \right)$, first doing a change of variables $y = (x - \mu)$ where we have $dy = dx$, then doing another change of variables $z = \frac{y}{\sqrt{ 2 \sigma^{2} }}$ where we get $\sqrt{ 2\sigma^{2} }dz = dy$ now we have the same integral, plus an added constant multiplicative term. So we have
$$ \int _{-\infty}^{\infty} \exp\left( -\frac{(x - \mu)^{2}}{2\sigma^{2}} \right) \, dx = \sqrt{ 2\sigma^{2} } \int _{-\infty}^{\infty} \exp(-z^{2})\, dz = \sqrt{ 2\pi \sigma^{2}} $$Which finishes the derivation of the normalizing term.
Error Function
Sometimes, for example calculating the mean of the folded Gaussian, is useful to consider the error function. This is defined as
$$ \text{erf}(x) = \frac{2}{\sqrt{ \pi }} \int_{0}^{x} \exp(-t^{2}) \, dt $$Sometimes this is also written, using the symmetry over the $x-$axis as
$$ \text{erf}(x) = \frac{1}{\sqrt{ \pi }} \int_{-x}^{x} \exp(-t^{2}) \, dt $$We observe that the limit $x \to +\infty$ is 1, and that $x \to -\infty$ is -1.
Another useful relation is with the Gaussian CDF:
$$ \text{erf}\left( \frac{x}{\sqrt{ 2 }} \right) = 2\Phi(x) - 1 $$We also note that it is anti-symmetric:
$$ \text{erf}(-x) = -\text{erf}(x) $$Some properties of Gaussians
The conditional Gaussian
If we have $X,Y$ which are jointly Gaussian, then the distribution $p(X = x \mid Y = y)$ is a gaussian with the following mean and variance:
$$ \mu_{X \mid Y = y} = \mu_{X} + \Sigma_{XY} \Sigma_{YY}^{-1}(y - \mu_{Y}) $$And
$$ \Sigma_{X \mid Y} = \Sigma_{XX} - \Sigma_{XY} \Sigma^{-1}_{YY} \Sigma_{YX} $$The proof is presented in section 2.3 of (Bishop 2006)
Product of Gaussians are Gaussian 🟥
This is a little more difficult to detail, see this chatgpt response.
Gaussian distribution and linear models 🟨
There is a close connection between linear models and the Gaussian distribution. This is linked to the properties of the sum and product of those gaussians.
Marginals are Gaussians 🟨
One can prove that any finite marginals of Gaussians are still multivariate Gaussians.
Let’s now write a closed for for this. Let’s assume we have these two random variables:
$$ p(A, B) = \begin{bmatrix} A \\ B \end{bmatrix} \sim \mathcal{N}(\mu, \Sigma) $$Where:
$$ \mu = \begin{bmatrix} \mu_{A} \\ \mu_{B} \end{bmatrix}, \Sigma = \begin{bmatrix} \Sigma_{AA} & \Sigma_{AB} \\ \Sigma_{BA} & \Sigma_{BB} \end{bmatrix} $$We want to find the value of $p(A)$ and of $p(B \mid A)$. To prove this it is useful to remember the value of the following matrix:
$$ \Sigma^{-1} = V = \begin{bmatrix} V_{11} & V_{12} \\ V_{21} & V_{22} \end{bmatrix} = \begin{bmatrix} I & V_{12}V_{22}^{-1} \\ 0 & I \end{bmatrix} \cdot \begin{bmatrix} V_{11} - V_{12}V_{22}^{-1}V_{21} & 0 \\ 0 & V_{22} \end{bmatrix} \cdot \begin{bmatrix} I & 0 \\ V_{22}^{-1}V_{21} & I \end{bmatrix} $$Then the inverse $(ABC)^{-1} = C^{-1}B^{-1}A^{-1}$ which is equal to: $$ V^{-1} =
\begin{bmatrix} I & 0 \
- V_{22}^{-1}V_{21} & I \end{bmatrix} \cdot \begin{bmatrix} (V_{11} - V_{12}V_{22}^{-1}V_{21})^{-1} & 0 \ 0 & V_{22}^{-1} \end{bmatrix} \cdot\begin{bmatrix} I & -V_{12}V_{22}^{-1} \ 0 & I \end{bmatrix} = \begin{bmatrix} (V_{11} - V_{12}V_{22}^{-1}V_{21})^{-1} & -\Sigma_{11}V_{12}V_{22}^{-1} \ -\Sigma_{11}V_{22}^{-1}V_{22} & V_{22}^{-1}(V_{12}V_{22}^{-1}V_{21}\Sigma_{11} + 1) \end{bmatrix} $$ One can note now with the inverse thing that $V_{22} = (\Sigma_{22} - \Sigma_{21}\Sigma^{-1}_{11}\Sigma_{12})^{-1}$ This allows to write $V_{12}$ nicely as $$ \Sigma_{12} = -\Sigma_{11}V_{12}V_{22}^{-1} \implies V_{12} = -\Sigma_{11}\Sigma_{12}^{-1} V_{22}^{-1} =-\Sigma_{11}\Sigma_{12}^{-1} (\Sigma_{22} - \Sigma_{21}\Sigma^{-1}{11}\Sigma{12} $$ Because then it is easily invertible and one can observe that $\Sigma_{11} = (V_{11} - V_{12}V_{22}^{-1}V_{21})^{-1}$, this is used for the marginalization calculation. One can find in this manner that $$ A \sim \mathcal{N}(\mu_{A}, \Sigma_{AA}) $$ And that $$ B \mid A \sim \mathcal{N}(\mu_{B} - V_{BB}^{-1}V_{AB} (A - \mu_{A}), V_{BB}) $$ If you are a student ad ETH watch [this](https://video.ethz.ch/lectures/d-infk/2024/autumn/263-5210-00L/0d924ef2-af34-4cb1-96cb-ba9a63c1f15b.html) for the derivation, minute 46. Rewriting with the above properties for $V_{BB}$ and $V_{AB}$ we obtain: $$ B \mid A \sim \mathcal{N}(\mu_{B} + \Sigma_{21}\Sigma_{11}^{-1}(A - \mu_{A}),\Sigma_{22} - \Sigma_{21}\Sigma^{-1}{11}\Sigma{12}) $$ Which is a ok form, but very very long to derive.
Gaussian characteristic function 🟩–
Characteristic functions are sometimes useful to prove that two distributions are the same as each other. One can prove that the characteristic function for Gaussians is
$$ \mathbb{E}[\exp(itX)] = \exp\left( \mu it - \frac{1}{2} t^{T} \Sigma t \right) $$Let’s prove the uni-variate case, we will see that it will be exactly this value. We need to compute the value:
$$ \mathbb{E}[\exp(itX)] = \int _{-\infty}^{\infty} \frac{1}{\sqrt{ 2\pi } \sigma} \exp\left( -\frac{(x - \mu)^{2}}{2\sigma^{2}} + itx \right) \, dx $$The idea is to complete the square, and the by knowing the value of the integral of the completed square, we simplify.
$$ \begin{align} \\ \int _{-\infty}^{\infty} \frac{1}{\sqrt{ 2\pi } \sigma} \exp\left( -\frac{(x - \mu)^{2}}{2\sigma^{2}} + itx \right) \, dx \\ = \int _{-\infty}^{\infty} \frac{1}{\sqrt{ 2\pi } \sigma} \exp\left( -\frac{(x - (\mu + \sigma^{2}it))^{2} - 2\mu it\sigma^{2} + \sigma^{4}t^{2} }{2\sigma^{2}} \right) \, dx \\ = \exp\left( \mu it - \frac{1}{2} \sigma^{2}t^{2} \right) \cdot \int _{-\infty}^{\infty} \frac{1}{\sqrt{ 2\pi } \sigma} \exp\left( -\frac{(x - (\mu + \sigma^{2}it))^{2} }{2\sigma^{2}} \right) \, dx \\ = \exp\left( \mu it - \frac{1}{2} \sigma^{2}t^{2} \right) \end{align} $$Sum Gaussians are Gaussian 🟩
This is easily provable, if we have $X \sim \mathcal{N}(\mu_{X}, \Sigma_{X})$ and a compatible distribution $Y \sim \mathcal{N}(\mu_{Y}, \Sigma_{Y})$ then we have that the distribution $X +Y = \mathcal{N} (\mu_{X} + \mu_{Y}, \Sigma_{X} + \Sigma_{Y})$ The proof should use characteristic functions in the line of linear Gaussians.
$$ \begin{align} \mathbb{E}[\exp(it (X + Y))] = \\ &= \mathbb{E}[\exp(itX)]\mathbb{E}[\exp(itY)] \\ &= \exp\left( \mu_{X} it - \frac{1}{2} t^{T} \Sigma_{X} t \right) \exp\left( \mu_{Y} it - \frac{1}{2} t^{T} \Sigma_{Y} t \right) \\ &= \exp\left( (\mu_{X} + \mu_{Y}) it - \frac{1}{2} t^{T} (\Sigma_{X} + \Sigma_{Y}) t \right) \end{align} $$Which finishes the proof. One can also extend this result to every linear combination of Gaussians.
Properties to remember 🟩
- Compact representation of high dimensional joint distributions: instead of using $2^{n}$ variables we just need $n^{2}$, this is why Gaussian Processes are analytically handy.
- Closed form inference (I think about the Conjugacy of itself, this is because Gaussians are in the The Exponential Family.)
Information theoretic properties
Entropy of a Gaussian distribution 🟩
We compute here the Entropy of a Univariate Gaussian distribution $\mathcal{N}(x; \mu, \sigma^{2})$. So we need to compute the following value:
$$ \begin{align} \int p(x) \log \frac{1}{p(x)} , dx &= -\int \frac{1}{\sqrt{ 2\pi \sigma^{2} }} \exp\left( -\frac{(x - \mu)^{2}}{2\sigma^{2}} \right) \cdot \left( -\frac{1}{2} \log(2\pi \sigma^{2}) -\frac{1}{2\sigma^{2}}(x - \mu)^{2} \right) , dx \ &= \frac{1}{2}\log(2\pi \sigma^{2}) +\frac{1}{2\sigma^{2}} \mathbb{E}_{x} [(x - \mu)^{2}] \ &= \frac{1}{2}\log(2\pi \sigma^{2}) + \frac{1}{2} \ &= \frac{1}{2} \log(2\pi \sigma^{2}e) \end{align}
$$ With just the above proof one can prove that Gaussians are the distributions with maximum entropy for a given mean and variance. See Maximum Entropy Principle.
We can extend this to the multivariate case, observing the following:
$$ \begin{align} \mathbb{E}_{x \sim p}[-\log p(x)] & = \mathbb{E}_{x \sim p}\left[ \frac{d}{2} \log(2\pi ) + \log \det \Sigma + \frac{1}{2}(x - \mu)^{T}\Sigma^{-1}(x - \mu) \right] \\ & = \frac{d}{2} \log(2\pi ) + \log \det \Sigma + \frac{1}{2} \mathbb{E}_{x \sim p}[(x - \mu)^{T}\Sigma^{-1}(x - \mu)] \\ & =\frac{d}{2}(1 + \log(2\pi)) + \log \det \Sigma \\ &= \frac{d}{2} \log(2\pi e) + \log \det \Sigma \\ &= \frac{1}{2} \log((2\pi e)^{d} \lvert \Sigma \rvert ) \end{align} $$Where in the last step we used this equality: $$ \begin{align} \mathbb{E}{x \sim p}[(x - \mu)^{T}\Sigma^{-1}(x - \mu)] & = \mathbb{E}{x \sim p}[\text{tr}((x - \mu)^{T}\Sigma^{-1}(x - \mu))] & \text{ trace of real number}\
& = \mathbb{E}{x \sim p}[\text{tr} (\Sigma^{-1}(x - \mu)(x - \mu)^{T})] & \text{ eq. 16 Matrix Cookbook} \
& = \text{tr}(\mathbb{E}{x \sim p}[\Sigma^{-1}(x - \mu)^{T}(x - \mu)]) & \text{ linearity of trace} \
& = \text{tr}(\Sigma^{-1} \mathbb{E}_{x \sim p}[(x - \mu)(x - \mu)^{T}]) & \text{ linearity of expectation} \
& = \text{tr}(\Sigma^{-1} \Sigma) & \text{ definition of covariance} \
& = d & \text{ trace of identity matrix} \
\end{align}
$$
The Matrix Cookbook refers to this resource.
Mutual information of Gaussians
Suppose we have a Gaussian $X \sim \mathcal{N}(\mu, \Sigma)$ and $Y = X + \varepsilon, \varepsilon\sim \mathcal{N}(0, \sigma^{2}_{n}I)$ TODO
We will see that this is equal to:
$$ I(X, Y) = \frac{1}{2} \log\lvert I + \sigma^{-2}_{n}\Sigma \rvert $$General KL divergence between Gaussians
The KL divergence between two Gaussians is given by:
$$ KL(p \mid \mid q) = \frac{1}{2} \left( \log \frac{\lvert \Sigma_{2} \rvert}{\lvert \Sigma_{1} \rvert} - d + tr(\Sigma_{2}^{-1}\Sigma_{1}) + (\mu_{2} - \mu_{1})^{T}\Sigma_{2}^{-1}(\mu_{2} - \mu_{1}) \right) $$This is a good resource for a proof.
Forward KL
One can prove that the forward KL divergence between two Gaussians defined as $\mathcal{N}(\mu_{1}, diag\left\{ \sigma^{2}_{1}, \dots, \sigma^{2}_{d} \right\})$ and $q = \mathcal{N}(0, 1)$ is given by:
$$ KL(p \mid \mid q) = \frac{1}{2} \sum_{i = 1}^{d} \left( \sigma_{i}^{2} + \mu_{i}^{2} - \log \sigma_{i}^{2} - 1 \right) $$Let’s interpret this. The $\mu$ term works to pull the mean toward zero. The $\sigma$ term introduces a penalty for high variance values, while the $\log \sigma^2$ term imposes a cost for low values of $\sigma$. This forward KL is what is used for Autoencoders.
Reverse KL
Given the same assumptions we have that the KL of $q$ over $p$ is given by:
$$ KL(q \mid \mid p) = \frac{1}{2} \sum_{i = 1}^{d} \left( \frac{\mu_{i}^{2}}{\sigma_{i}^{2}} + \sigma_{i}^{-2} + \log \sigma_{i}^{2} - 1 \right) $$References
[1] Bishop “Pattern Recognition and Machine Learning” Springer 2006