Gaussians are one of the most important family of probability distributions. They arise naturally in the law of large numbers and have some nice properties that we will briefly present and prove here in this note. They are also quite common for Gaussian Processes and the Clustering algorithm. They have also something to say about Maximum Entropy Principle. The best thing if you want to learn this part actually well is section 2.3 of (Bishop 2006), so go there my friend :)

The Density Function

$$ \mathcal{N}(\mu, \sigma) = \frac{1}{\sqrt{ 2\pi } \sigma} \exp\left( -\frac{(x - \mu)^{2}}{2\sigma^{2}} \right) $$$$ \mathcal{N}(\mu, \Sigma) = \frac{1}{(2\pi )^{d/2}\sqrt{ \lvert \Sigma \rvert } } \exp\left( -\frac{1}{2} (x - \mu)^{T} \Sigma^{-1} (x - \mu) \right) $$

Where $d$ is the dimensionality for the multidimensional Gaussian.

Integral is 1 🟥++

We now prove that the integral of the Gaussian PDF is 1, this is a requirements needed to be considered a probability distribution function.

$$ I = \int_{-\infty}^{\infty} \exp( - x^{2}) \, dx = \sqrt{ \pi } $$$$ \begin{align} \\ I^{2} & = \int_{-\infty}^{\infty} \exp( - x^{2} - y^{2}) \, dx dy \\ & = \int_{0}^{2\pi} \int_{0}^{\infty} r\exp( - r^{2}) \, dr d\theta \\ & = 2\pi \cdot \left( -\frac{1}{2} \right) \int_{0}^{\infty} -2r\exp( - r^{2}) \, dr \\ & = -\frac{2\pi}{2} \exp(-r^{2}) \bigg\vert_{0}^{\infty} = \pi \\ & \implies I = \sqrt{ \pi } \end{align} $$

Note that in the second step we changed variables: This is quite interesting. If we do the same derivation with $\exp\left( - \frac{(x - \mu)^{2}}{2\sigma^{2}} \right)$, first doing a change of variables $y = (x - \mu)$ where we have $dy = dx$, then doing another change of variables $z = \frac{y}{\sqrt{ 2 \sigma^{2} }}$ where we get $\sqrt{ 2\sigma^{2} }dz = dy$ now we have the same integral, plus an added constant multiplicative term. So we have

$$ \int _{-\infty}^{\infty} \exp\left( -\frac{(x - \mu)^{2}}{2\sigma^{2}} \right) \, dx = \sqrt{ 2\sigma^{2} } \int _{-\infty}^{\infty} \exp(-z^{2})\, dz = \sqrt{ 2\pi \sigma^{2}} $$

Which finishes the derivation of the normalizing term.

Error Function 🟨–

$$ \text{erf}(x) = \frac{2}{\sqrt{ \pi }} \int_{0}^{x} \exp(-t^{2}) \, dt $$$$ \text{erf}(x) = \frac{1}{\sqrt{ \pi }} \int_{-x}^{x} \exp(-t^{2}) \, dt $$

We observe that the limit $x \to +\infty$ is 1, and that $x \to -\infty$ is -1.

$$ \text{erf}\left( \frac{x}{\sqrt{ 2 }} \right) = 2\Phi(x) - 1 $$$$ \text{erf}(-x) = -\text{erf}(x) $$

Some properties of Gaussians

The conditional Gaussian

If we have $X,Y$ which are jointly Gaussian, then the distribution $p(X = x \mid Y = y)$ is a gaussian with the following mean and variance:

$$ \mu_{X \mid Y = y} = \mu_{X} + \Sigma_{XY} \Sigma_{YY}^{-1}(y - \mu_{Y}) $$$$ \Sigma_{X \mid Y} = \Sigma_{XX} - \Sigma_{XY} \Sigma^{-1}_{YY} \Sigma_{YX} $$

The proof is presented in section 2.3 of (Bishop 2006)

Product of Gaussians are Gaussian 🟨++

This is a little more difficult to detail, see this chatgpt response. It’s just an Unnormalized Gaussian.

Marginals are Gaussians 🟨

One can prove that any finite marginals of Gaussians are still multivariate Gaussians.

$$ p(A, B) = \begin{bmatrix} A \\ B \end{bmatrix} \sim \mathcal{N}(\mu, \Sigma) $$$$ \mu = \begin{bmatrix} \mu_{A} \\ \mu_{B} \end{bmatrix}, \Sigma = \begin{bmatrix} \Sigma_{AA} & \Sigma_{AB} \\ \Sigma_{BA} & \Sigma_{BB} \end{bmatrix} $$$$ \Sigma^{-1} = V = \begin{bmatrix} V_{11} & V_{12} \\ V_{21} & V_{22} \end{bmatrix} = \begin{bmatrix} I & V_{12}V_{22}^{-1} \\ 0 & I \end{bmatrix} \cdot \begin{bmatrix} V_{11} - V_{12}V_{22}^{-1}V_{21} & 0 \\ 0 & V_{22} \end{bmatrix} \cdot \begin{bmatrix} I & 0 \\ V_{22}^{-1}V_{21} & I \end{bmatrix} $$

Then the inverse $(ABC)^{-1} = C^{-1}B^{-1}A^{-1}$ which is equal to: $$ V^{-1} =

\begin{bmatrix} I & 0 \

  • V_{22}^{-1}V_{21} & I \end{bmatrix} \cdot \begin{bmatrix} (V_{11} - V_{12}V_{22}^{-1}V_{21})^{-1} & 0 \ 0 & V_{22}^{-1} \end{bmatrix} \cdot\begin{bmatrix} I & -V_{12}V_{22}^{-1} \ 0 & I \end{bmatrix} = \begin{bmatrix} (V_{11} - V_{12}V_{22}^{-1}V_{21})^{-1} & -\Sigma_{11}V_{12}V_{22}^{-1} \ -\Sigma_{11}V_{22}^{-1}V_{22} & V_{22}^{-1}(V_{12}V_{22}^{-1}V_{21}\Sigma_{11} + 1) \end{bmatrix} $$ One can note now with the inverse thing that $V_{22} = (\Sigma_{22} - \Sigma_{21}\Sigma^{-1}_{11}\Sigma_{12})^{-1}$ This allows to write $V_{12}$ nicely as $$ \Sigma_{12} = -\Sigma_{11}V_{12}V_{22}^{-1} \implies V_{12} = -\Sigma_{11}\Sigma_{12}^{-1} V_{22}^{-1} =-\Sigma_{11}\Sigma_{12}^{-1} (\Sigma_{22} - \Sigma_{21}\Sigma^{-1}{11}\Sigma{12} $$ Because then it is easily invertible and one can observe that $\Sigma_{11} = (V_{11} - V_{12}V_{22}^{-1}V_{21})^{-1}$, this is used for the marginalization calculation. One can find in this manner that $$ A \sim \mathcal{N}(\mu_{A}, \Sigma_{AA}) $$ And that $$ B \mid A \sim \mathcal{N}(\mu_{B} - V_{BB}^{-1}V_{AB} (A - \mu_{A}), V_{BB}) $$ If you are a student ad ETH watch [this](https://video.ethz.ch/lectures/d-infk/2024/autumn/263-5210-00L/0d924ef2-af34-4cb1-96cb-ba9a63c1f15b.html) for the derivation, minute 46. Rewriting with the above properties for $V_{BB}$ and $V_{AB}$ we obtain: $$ B \mid A \sim \mathcal{N}(\mu_{B} + \Sigma_{21}\Sigma_{11}^{-1}(A - \mu_{A}),\Sigma_{22} - \Sigma_{21}\Sigma^{-1}{11}\Sigma{12}) $$ Which is a ok form, but very very long to derive.

Gaussian characteristic function 🟨++

Characteristic functions are sometimes useful to prove that two distributions are the same as each other. One can prove that the characteristic function for Gaussians is

$$ \mathbb{E}[\exp(itX)] = \exp\left( \mu it - \frac{1}{2} t^{T} \Sigma t \right) $$$$ \mathbb{E}[\exp(itX)] = \int _{-\infty}^{\infty} \frac{1}{\sqrt{ 2\pi } \sigma} \exp\left( -\frac{(x - \mu)^{2}}{2\sigma^{2}} + itx \right) \, dx $$$$ \begin{align} \\ \int _{-\infty}^{\infty} \frac{1}{\sqrt{ 2\pi } \sigma} \exp\left( -\frac{(x - \mu)^{2}}{2\sigma^{2}} + itx \right) \, dx \\ = \int _{-\infty}^{\infty} \frac{1}{\sqrt{ 2\pi } \sigma} \exp\left( -\frac{(x - (\mu + \sigma^{2}it))^{2} - 2\mu it\sigma^{2} + \sigma^{4}t^{2} }{2\sigma^{2}} \right) \, dx \\ = \exp\left( \mu it - \frac{1}{2} \sigma^{2}t^{2} \right) \cdot \int _{-\infty}^{\infty} \frac{1}{\sqrt{ 2\pi } \sigma} \exp\left( -\frac{(x - (\mu + \sigma^{2}it))^{2} }{2\sigma^{2}} \right) \, dx \\ = \exp\left( \mu it - \frac{1}{2} \sigma^{2}t^{2} \right) \end{align} $$

Sum Gaussians are Gaussian 🟩

This is easily provable, if we have $X \sim \mathcal{N}(\mu_{X}, \Sigma_{X})$ and a compatible distribution $Y \sim \mathcal{N}(\mu_{Y}, \Sigma_{Y})$ then we have that the distribution $X +Y = \mathcal{N} (\mu_{X} + \mu_{Y}, \Sigma_{X} + \Sigma_{Y})$ The proof should use characteristic functions in the line of linear Gaussians.

$$ \begin{align} \mathbb{E}[\exp(it (X + Y))] = \\ &= \mathbb{E}[\exp(itX)]\mathbb{E}[\exp(itY)] \\ &= \exp\left( \mu_{X} it - \frac{1}{2} t^{T} \Sigma_{X} t \right) \exp\left( \mu_{Y} it - \frac{1}{2} t^{T} \Sigma_{Y} t \right) \\ &= \exp\left( (\mu_{X} + \mu_{Y}) it - \frac{1}{2} t^{T} (\Sigma_{X} + \Sigma_{Y}) t \right) \end{align} $$

Which finishes the proof. One can also extend this result to every linear combination of Gaussians.

Properties to remember 🟩

  • Compact representation of high dimensional joint distributions: instead of using $2^{n}$ variables we just need $n^{2}$, this is why Gaussian Processes are analytically handy.
  • Closed form inference (I think about the Conjugacy of itself, this is because Gaussians are in the The Exponential Family.)

Confidence Intervals

Gaussians are a nice distribution. We have listed many of its properties by now. But one of the most over-utilized feature is the ease in computing $1 - \alpha$ confidence intervals where $\alpha$ is called significance level: meaning we want to find the interval where our prediction lies there with $1 - \alpha$ probability. This is usually easy to compute with $z$ tables. The Standard Error of a Gaussian is $\frac{\sigma}{\sqrt{ n }}$ and is related to the square root of the mean variance.

$$ \bar{x} \pm z \cdot SE $$

Where $\bar{x}$ is the expected value for our prediction.

Information theoretic properties

Entropy of a Gaussian distribution 🟩

We compute here the Entropy of a Univariate Gaussian distribution $\mathcal{N}(x; \mu, \sigma^{2})$. So we need to compute the following value:

$$ \begin{align} \int p(x) \log \frac{1}{p(x)} , dx &= -\int \frac{1}{\sqrt{ 2\pi \sigma^{2} }} \exp\left( -\frac{(x - \mu)^{2}}{2\sigma^{2}} \right) \cdot \left( -\frac{1}{2} \log(2\pi \sigma^{2}) -\frac{1}{2\sigma^{2}}(x - \mu)^{2} \right) , dx \ &= \frac{1}{2}\log(2\pi \sigma^{2}) +\frac{1}{2\sigma^{2}} \mathbb{E}_{x} [(x - \mu)^{2}] \ &= \frac{1}{2}\log(2\pi \sigma^{2}) + \frac{1}{2} \ &= \frac{1}{2} \log(2\pi \sigma^{2}e) \end{align}

$$ With just the above proof one can prove that Gaussians are the distributions with maximum entropy for a given mean and variance. See Maximum Entropy Principle.

$$ \begin{align} \mathbb{E}_{x \sim p}[-\log p(x)] & = \mathbb{E}_{x \sim p}\left[ \frac{d}{2} \log(2\pi ) + \log \det \Sigma + \frac{1}{2}(x - \mu)^{T}\Sigma^{-1}(x - \mu) \right] \\ & = \frac{d}{2} \log(2\pi ) + \log \det \Sigma + \frac{1}{2} \mathbb{E}_{x \sim p}[(x - \mu)^{T}\Sigma^{-1}(x - \mu)] \\ & =\frac{d}{2}(1 + \log(2\pi)) + \log \det \Sigma \\ &= \frac{d}{2} \log(2\pi e) + \log \det \Sigma \\ &= \frac{1}{2} \log((2\pi e)^{d} \lvert \Sigma \rvert ) \end{align} $$

Where in the last step we used this equality: $$ \begin{align} \mathbb{E}{x \sim p}[(x - \mu)^{T}\Sigma^{-1}(x - \mu)] & = \mathbb{E}{x \sim p}[\text{tr}((x - \mu)^{T}\Sigma^{-1}(x - \mu))] & \text{ trace of real number}\

& = \mathbb{E}{x \sim p}[\text{tr} (\Sigma^{-1}(x - \mu)(x - \mu)^{T})] & \text{ eq. 16 Matrix Cookbook} \ & = \text{tr}(\mathbb{E}{x \sim p}[\Sigma^{-1}(x - \mu)^{T}(x - \mu)]) & \text{ linearity of trace} \
& = \text{tr}(\Sigma^{-1} \mathbb{E}_{x \sim p}[(x - \mu)(x - \mu)^{T}]) & \text{ linearity of expectation} \ & = \text{tr}(\Sigma^{-1} \Sigma) & \text{ definition of covariance} \ & = d & \text{ trace of identity matrix} \ \end{align} $$ The Matrix Cookbook refers to this resource.

Mutual information of Gaussians

Suppose we have a Gaussian $X \sim \mathcal{N}(\mu, \Sigma)$ and $Y = X + \varepsilon, \varepsilon\sim \mathcal{N}(0, \sigma^{2}_{n}I)$ TODO

$$ I(X, Y) = \frac{1}{2} \log\lvert I + \sigma^{-2}_{n}\Sigma \rvert $$

General KL divergence between Gaussians

$$ KL(p \mid \mid q) = \frac{1}{2} \left( \log \frac{\lvert \Sigma_{q} \rvert}{\lvert \Sigma_{p} \rvert} - d + tr(\Sigma_{q}^{-1}\Sigma_{p}) + (\mu_{q} - \mu_{p})^{T}\Sigma_{q}^{-1}(\mu_{q} - \mu_{p}) \right) $$

This is a good resource for a proof.

Forward KL

One can prove that the forward KL divergence between two Gaussians defined as $p \sim \mathcal{N}(\mu_{1}, diag\left\{ \sigma^{2}_{1}, \dots, \sigma^{2}_{d} \right\})$ and $q = \mathcal{N}(0, 1)$ is given by:

$$ KL(p \mid \mid q) = \frac{1}{2} \sum_{i = 1}^{d} \left( \sigma_{i}^{2} + \mu_{i}^{2} - \log \sigma_{i}^{2} - 1 \right) $$

Let’s interpret this. The $\mu$ term works to pull the mean toward zero. The $\sigma$ term introduces a penalty for high variance values, while the $\log \sigma^2$ term imposes a cost for low values of $\sigma$. This forward KL is what is used for Autoencoders.

Reverse KL

$$ KL(q \mid \mid p) = \frac{1}{2} \sum_{i = 1}^{d} \left( \frac{\mu_{i}^{2}}{\sigma_{i}^{2}} + \sigma_{i}^{-2} + \log \sigma_{i}^{2} - 1 \right) $$

References

[1] Bishop “Pattern Recognition and Machine Learning” Springer 2006