Gaussians are one of the most important family of probability distributions. They arise naturally in the law of large numbers and have some nice properties that we will briefly present and prove here in this note. They are also quite common for Gaussian Processes and the Clustering algorithm. They have also something to say about Maximum Entropy Principle. The best thing if you want to learn this part actually well is section 2.3 of (Bishop 2006), so go there my friend :)

The Density Function

The single variable Gaussian is as follows:

$$ \mathcal{N}(\mu, \sigma) = \frac{1}{\sqrt{ 2\pi } \sigma} \exp\left( -\frac{(x - \mu)^{2}}{2\sigma^{2}} \right) $$

This can be generalized for the multi variable case

$$ \mathcal{N}(\mu, \Sigma) = \frac{1}{(2\pi )^{d/2}\sqrt{ \lvert \Sigma \rvert } } \exp\left( -\frac{1}{2} (x - \mu)^{T} \Sigma^{-1} (x - \mu) \right) $$

Where $d$ is the dimensionality for the multidimensional Gaussian.

Integral is 1 🟥++

We now prove that the integral of the Gaussian PDF is 1, this is a requirements needed to be considered a probability distribution function.

First, let’s prove a famous equality:

$$ I = \int_{-\infty}^{\infty} \exp( - x^{2}) \, dx = \sqrt{ \pi } $$

This is kinda surprising, we need some care to prove it:

$$ \begin{align} \\ I^{2} & = \int_{-\infty}^{\infty} \exp( - x^{2} - y^{2}) \, dx dy \\ & = \int_{0}^{2\pi} \int_{0}^{\infty} r\exp( - r^{2}) \, dr d\theta \\ & = 2\pi \cdot \left( -\frac{1}{2} \right) \int_{0}^{\infty} -2r\exp( - r^{2}) \, dr \\ & = -\frac{2\pi}{2} \exp(-r^{2}) \bigg\vert_{0}^{\infty} = \pi \\ & \implies I = \sqrt{ \pi } \end{align} $$

Note that in the second step we changed variables: This is quite interesting. If we do the same derivation with $\exp\left( - \frac{(x - \mu)^{2}}{2\sigma^{2}} \right)$, first doing a change of variables $y = (x - \mu)$ where we have $dy = dx$, then doing another change of variables $z = \frac{y}{\sqrt{ 2 \sigma^{2} }}$ where we get $\sqrt{ 2\sigma^{2} }dz = dy$ now we have the same integral, plus an added constant multiplicative term. So we have

$$ \int _{-\infty}^{\infty} \exp\left( -\frac{(x - \mu)^{2}}{2\sigma^{2}} \right) \, dx = \sqrt{ 2\sigma^{2} } \int _{-\infty}^{\infty} \exp(-z^{2})\, dz = \sqrt{ 2\pi \sigma^{2}} $$

Which finishes the derivation of the normalizing term.

Error Function 🟨–

Sometimes, for example calculating the mean of the folded Gaussian, is useful to consider the error function. This is defined as

$$ \text{erf}(x) = \frac{2}{\sqrt{ \pi }} \int_{0}^{x} \exp(-t^{2}) \, dt $$

Sometimes this is also written, using the symmetry over the $x-$axis as

$$ \text{erf}(x) = \frac{1}{\sqrt{ \pi }} \int_{-x}^{x} \exp(-t^{2}) \, dt $$

We observe that the limit $x \to +\infty$ is 1, and that $x \to -\infty$ is -1.

Another useful relation is with the Gaussian CDF:

$$ \text{erf}\left( \frac{x}{\sqrt{ 2 }} \right) = 2\Phi(x) - 1 $$

We also note that it is anti-symmetric:

$$ \text{erf}(-x) = -\text{erf}(x) $$

Some properties of Gaussians

The conditional Gaussian

If we have $X,Y$ which are jointly Gaussian, then the distribution $p(X = x \mid Y = y)$ is a gaussian with the following mean and variance:

$$ \mu_{X \mid Y = y} = \mu_{X} + \Sigma_{XY} \Sigma_{YY}^{-1}(y - \mu_{Y}) $$

And

$$ \Sigma_{X \mid Y} = \Sigma_{XX} - \Sigma_{XY} \Sigma^{-1}_{YY} \Sigma_{YX} $$

The proof is presented in section 2.3 of (Bishop 2006)

Product of Gaussians are Gaussian 🟨++

This is a little more difficult to detail, see this chatgpt response. It’s just an Unnormalized Gaussian.

Marginals are Gaussians 🟨

One can prove that any finite marginals of Gaussians are still multivariate Gaussians.

Let’s now write a closed for for this. Let’s assume we have these two random variables:

$$ p(A, B) = \begin{bmatrix} A \\ B \end{bmatrix} \sim \mathcal{N}(\mu, \Sigma) $$

Where:

$$ \mu = \begin{bmatrix} \mu_{A} \\ \mu_{B} \end{bmatrix}, \Sigma = \begin{bmatrix} \Sigma_{AA} & \Sigma_{AB} \\ \Sigma_{BA} & \Sigma_{BB} \end{bmatrix} $$

We want to find the value of $p(A)$ and of $p(B \mid A)$. To prove this it is useful to remember the value of the following matrix:

$$ \Sigma^{-1} = V = \begin{bmatrix} V_{11} & V_{12} \\ V_{21} & V_{22} \end{bmatrix} = \begin{bmatrix} I & V_{12}V_{22}^{-1} \\ 0 & I \end{bmatrix} \cdot \begin{bmatrix} V_{11} - V_{12}V_{22}^{-1}V_{21} & 0 \\ 0 & V_{22} \end{bmatrix} \cdot \begin{bmatrix} I & 0 \\ V_{22}^{-1}V_{21} & I \end{bmatrix} $$

Then the inverse $(ABC)^{-1} = C^{-1}B^{-1}A^{-1}$ which is equal to: $$ V^{-1} =

\begin{bmatrix} I & 0 \

  • V_{22}^{-1}V_{21} & I \end{bmatrix} \cdot \begin{bmatrix} (V_{11} - V_{12}V_{22}^{-1}V_{21})^{-1} & 0 \ 0 & V_{22}^{-1} \end{bmatrix} \cdot\begin{bmatrix} I & -V_{12}V_{22}^{-1} \ 0 & I \end{bmatrix} = \begin{bmatrix} (V_{11} - V_{12}V_{22}^{-1}V_{21})^{-1} & -\Sigma_{11}V_{12}V_{22}^{-1} \ -\Sigma_{11}V_{22}^{-1}V_{22} & V_{22}^{-1}(V_{12}V_{22}^{-1}V_{21}\Sigma_{11} + 1) \end{bmatrix} $$ One can note now with the inverse thing that $V_{22} = (\Sigma_{22} - \Sigma_{21}\Sigma^{-1}_{11}\Sigma_{12})^{-1}$ This allows to write $V_{12}$ nicely as $$ \Sigma_{12} = -\Sigma_{11}V_{12}V_{22}^{-1} \implies V_{12} = -\Sigma_{11}\Sigma_{12}^{-1} V_{22}^{-1} =-\Sigma_{11}\Sigma_{12}^{-1} (\Sigma_{22} - \Sigma_{21}\Sigma^{-1}{11}\Sigma{12} $$ Because then it is easily invertible and one can observe that $\Sigma_{11} = (V_{11} - V_{12}V_{22}^{-1}V_{21})^{-1}$, this is used for the marginalization calculation. One can find in this manner that $$ A \sim \mathcal{N}(\mu_{A}, \Sigma_{AA}) $$ And that $$ B \mid A \sim \mathcal{N}(\mu_{B} - V_{BB}^{-1}V_{AB} (A - \mu_{A}), V_{BB}) $$ If you are a student ad ETH watch [this](https://video.ethz.ch/lectures/d-infk/2024/autumn/263-5210-00L/0d924ef2-af34-4cb1-96cb-ba9a63c1f15b.html) for the derivation, minute 46. Rewriting with the above properties for $V_{BB}$ and $V_{AB}$ we obtain: $$ B \mid A \sim \mathcal{N}(\mu_{B} + \Sigma_{21}\Sigma_{11}^{-1}(A - \mu_{A}),\Sigma_{22} - \Sigma_{21}\Sigma^{-1}{11}\Sigma{12}) $$ Which is a ok form, but very very long to derive.

Gaussian characteristic function 🟨++

Characteristic functions are sometimes useful to prove that two distributions are the same as each other. One can prove that the characteristic function for Gaussians is

$$ \mathbb{E}[\exp(itX)] = \exp\left( \mu it - \frac{1}{2} t^{T} \Sigma t \right) $$

Let’s prove the uni-variate case, we will see that it will be exactly this value. We need to compute the value:

$$ \mathbb{E}[\exp(itX)] = \int _{-\infty}^{\infty} \frac{1}{\sqrt{ 2\pi } \sigma} \exp\left( -\frac{(x - \mu)^{2}}{2\sigma^{2}} + itx \right) \, dx $$

The idea is to complete the square, and the by knowing the value of the integral of the completed square, we simplify.

$$ \begin{align} \\ \int _{-\infty}^{\infty} \frac{1}{\sqrt{ 2\pi } \sigma} \exp\left( -\frac{(x - \mu)^{2}}{2\sigma^{2}} + itx \right) \, dx \\ = \int _{-\infty}^{\infty} \frac{1}{\sqrt{ 2\pi } \sigma} \exp\left( -\frac{(x - (\mu + \sigma^{2}it))^{2} - 2\mu it\sigma^{2} + \sigma^{4}t^{2} }{2\sigma^{2}} \right) \, dx \\ = \exp\left( \mu it - \frac{1}{2} \sigma^{2}t^{2} \right) \cdot \int _{-\infty}^{\infty} \frac{1}{\sqrt{ 2\pi } \sigma} \exp\left( -\frac{(x - (\mu + \sigma^{2}it))^{2} }{2\sigma^{2}} \right) \, dx \\ = \exp\left( \mu it - \frac{1}{2} \sigma^{2}t^{2} \right) \end{align} $$

Sum Gaussians are Gaussian 🟩

This is easily provable, if we have $X \sim \mathcal{N}(\mu_{X}, \Sigma_{X})$ and a compatible distribution $Y \sim \mathcal{N}(\mu_{Y}, \Sigma_{Y})$ then we have that the distribution $X +Y = \mathcal{N} (\mu_{X} + \mu_{Y}, \Sigma_{X} + \Sigma_{Y})$ The proof should use characteristic functions in the line of linear Gaussians.

$$ \begin{align} \mathbb{E}[\exp(it (X + Y))] = \\ &= \mathbb{E}[\exp(itX)]\mathbb{E}[\exp(itY)] \\ &= \exp\left( \mu_{X} it - \frac{1}{2} t^{T} \Sigma_{X} t \right) \exp\left( \mu_{Y} it - \frac{1}{2} t^{T} \Sigma_{Y} t \right) \\ &= \exp\left( (\mu_{X} + \mu_{Y}) it - \frac{1}{2} t^{T} (\Sigma_{X} + \Sigma_{Y}) t \right) \end{align} $$

Which finishes the proof. One can also extend this result to every linear combination of Gaussians.

Properties to remember 🟩

  • Compact representation of high dimensional joint distributions: instead of using $2^{n}$ variables we just need $n^{2}$, this is why Gaussian Processes are analytically handy.
  • Closed form inference (I think about the Conjugacy of itself, this is because Gaussians are in the The Exponential Family.)

Confidence Intervals

Gaussians are a nice distribution. We have listed many of its properties by now. But one of the most over-utilized feature is the ease in computing $1 - \alpha$ confidence intervals where $\alpha$ is called significance level: meaning we want to find the interval where our prediction lies there with $1 - \alpha$ probability. This is usually easy to compute with $z$ tables. The Standard Error of a Gaussian is $\frac{\sigma}{\sqrt{ n }}$ and is related to the square root of the mean variance.

So after we have computed this values, the confidence interval for a prediction is just

$$ \bar{x} \pm z \cdot SE $$

Where $\bar{x}$ is the expected value for our prediction.

Information theoretic properties

Entropy of a Gaussian distribution 🟩

We compute here the Entropy of a Univariate Gaussian distribution $\mathcal{N}(x; \mu, \sigma^{2})$. So we need to compute the following value:

$$ \begin{align} \int p(x) \log \frac{1}{p(x)} , dx &= -\int \frac{1}{\sqrt{ 2\pi \sigma^{2} }} \exp\left( -\frac{(x - \mu)^{2}}{2\sigma^{2}} \right) \cdot \left( -\frac{1}{2} \log(2\pi \sigma^{2}) -\frac{1}{2\sigma^{2}}(x - \mu)^{2} \right) , dx \ &= \frac{1}{2}\log(2\pi \sigma^{2}) +\frac{1}{2\sigma^{2}} \mathbb{E}_{x} [(x - \mu)^{2}] \ &= \frac{1}{2}\log(2\pi \sigma^{2}) + \frac{1}{2} \ &= \frac{1}{2} \log(2\pi \sigma^{2}e) \end{align}

$$ With just the above proof one can prove that Gaussians are the distributions with maximum entropy for a given mean and variance. See Maximum Entropy Principle.

We can extend this to the multivariate case, observing the following:

$$ \begin{align} \mathbb{E}_{x \sim p}[-\log p(x)] & = \mathbb{E}_{x \sim p}\left[ \frac{d}{2} \log(2\pi ) + \log \det \Sigma + \frac{1}{2}(x - \mu)^{T}\Sigma^{-1}(x - \mu) \right] \\ & = \frac{d}{2} \log(2\pi ) + \log \det \Sigma + \frac{1}{2} \mathbb{E}_{x \sim p}[(x - \mu)^{T}\Sigma^{-1}(x - \mu)] \\ & =\frac{d}{2}(1 + \log(2\pi)) + \log \det \Sigma \\ &= \frac{d}{2} \log(2\pi e) + \log \det \Sigma \\ &= \frac{1}{2} \log((2\pi e)^{d} \lvert \Sigma \rvert ) \end{align} $$

Where in the last step we used this equality: $$ \begin{align} \mathbb{E}{x \sim p}[(x - \mu)^{T}\Sigma^{-1}(x - \mu)] & = \mathbb{E}{x \sim p}[\text{tr}((x - \mu)^{T}\Sigma^{-1}(x - \mu))] & \text{ trace of real number}\

& = \mathbb{E}{x \sim p}[\text{tr} (\Sigma^{-1}(x - \mu)(x - \mu)^{T})] & \text{ eq. 16 Matrix Cookbook} \ & = \text{tr}(\mathbb{E}{x \sim p}[\Sigma^{-1}(x - \mu)^{T}(x - \mu)]) & \text{ linearity of trace} \
& = \text{tr}(\Sigma^{-1} \mathbb{E}_{x \sim p}[(x - \mu)(x - \mu)^{T}]) & \text{ linearity of expectation} \ & = \text{tr}(\Sigma^{-1} \Sigma) & \text{ definition of covariance} \ & = d & \text{ trace of identity matrix} \ \end{align} $$ The Matrix Cookbook refers to this resource.

Mutual information of Gaussians

Suppose we have a Gaussian $X \sim \mathcal{N}(\mu, \Sigma)$ and $Y = X + \varepsilon, \varepsilon\sim \mathcal{N}(0, \sigma^{2}_{n}I)$ TODO

We will see that this is equal to:

$$ I(X, Y) = \frac{1}{2} \log\lvert I + \sigma^{-2}_{n}\Sigma \rvert $$

General KL divergence between Gaussians

The KL divergence between two Gaussians is given by:

$$ KL(p \mid \mid q) = \frac{1}{2} \left( \log \frac{\lvert \Sigma_{q} \rvert}{\lvert \Sigma_{p} \rvert} - d + tr(\Sigma_{q}^{-1}\Sigma_{p}) + (\mu_{q} - \mu_{p})^{T}\Sigma_{q}^{-1}(\mu_{q} - \mu_{p}) \right) $$

This is a good resource for a proof.

Forward KL

One can prove that the forward KL divergence between two Gaussians defined as $p \sim \mathcal{N}(\mu_{1}, diag\left\{ \sigma^{2}_{1}, \dots, \sigma^{2}_{d} \right\})$ and $q = \mathcal{N}(0, 1)$ is given by:

$$ KL(p \mid \mid q) = \frac{1}{2} \sum_{i = 1}^{d} \left( \sigma_{i}^{2} + \mu_{i}^{2} - \log \sigma_{i}^{2} - 1 \right) $$

Let’s interpret this. The $\mu$ term works to pull the mean toward zero. The $\sigma$ term introduces a penalty for high variance values, while the $\log \sigma^2$ term imposes a cost for low values of $\sigma$. This forward KL is what is used for Autoencoders.

Reverse KL

Given the same assumptions we have that the KL of $q$ over $p$ is given by:

$$ KL(q \mid \mid p) = \frac{1}{2} \sum_{i = 1}^{d} \left( \frac{\mu_{i}^{2}}{\sigma_{i}^{2}} + \sigma_{i}^{-2} + \log \sigma_{i}^{2} - 1 \right) $$

References

[1] Bishop “Pattern Recognition and Machine Learning” Springer 2006