Gaussians are one of the most important family of probability distributions. They arise naturally in the law of large numbers and have some nice properties that we will briefly present and prove here in this note. They are also quite common for Gaussian Processes and the Clustering algorithm. They have also something to say about Maximum Entropy Principle. The best thing if you want to learn this part actually well is section 2.3 of (Bishop 2006), so go there my friend :)
The Density Function
The single variable Gaussian is as follows:
This can be generalized for the multi variable case
Where is the dimensionality for the multidimensional Gaussian.
Integral is 1
We now prove that the integral of the Gaussian PDF is 1, this is a requirements needed to be considered a probability distribution function.
First, let's prove a famous equality:
This is kinda surprising, we need some care to prove it:
Note that in the second step we changed variables: This is quite interesting. If we do the same derivation with , first doing a change of variables where we have , then doing another change of variables where we get now we have the same integral, plus an added constant multiplicative term. So we have
Which finishes the derivation of the normalizing term.
Error Function
Sometimes, for example calculating the mean of the folded Gaussian, is useful to consider the error function. This is defined as
Sometimes this is also written, using the symmetry over the axis as
We observe that the limit is 1, and that is -1.
Another useful relation is with the Gaussian CDF:
We also note that it is anti-symmetric:
Some properties of Gaussians
The conditional Gaussian
If we have which are jointly Gaussian, then the distribution is a gaussian with the following mean and variance:
And
The proof is presented in section 2.3 of (Bishop 2006)
Product of Gaussians are Gaussian
This is a little more difficult to detail, see this chatgpt response. It's just an Unnormalized Gaussian.
Marginals are Gaussians
One can prove that any finite marginals of Gaussians are still multivariate Gaussians.
Let's now write a closed for for this. Let's assume we have these two random variables:
Where:
We want to find the value of and of . To prove this it is useful to remember the value of the following matrix:
Then the inverse which is equal to:
One can note now with the inverse thing that This allows to write nicely as
Because then it is easily invertible and one can observe that , this is used for the marginalization calculation. One can find in this manner that
And that
If you are a student ad ETH watch this for the derivation, minute 46. Rewriting with the above properties for and we obtain:
Which is a ok form, but very very long to derive.
Gaussian characteristic function
Characteristic functions are sometimes useful to prove that two distributions are the same as each other. One can prove that the characteristic function for Gaussians is
Let's prove the uni-variate case, we will see that it will be exactly this value. We need to compute the value:
The idea is to complete the square, and the by knowing the value of the integral of the completed square, we simplify.
Sum Gaussians are Gaussian
This is easily provable, if we have and a compatible distribution then we have that the distribution The proof should use characteristic functions in the line of linear Gaussians.
Which finishes the proof. One can also extend this result to every linear combination of Gaussians.
Properties to remember
- Compact representation of high dimensional joint distributions: instead of using variables we just need , this is why Gaussian Processes are analytically handy.
- Closed form inference (I think about the Conjugacy of itself, this is because Gaussians are in the The Exponential Family.)
Confidence Intervals
Gaussians are a nice distribution. We have listed many of its properties by now. But one of the most over-utilized feature is the ease in computing confidence intervals where is called significance level: meaning we want to find the interval where our prediction lies there with probability. This is usually easy to compute with tables. The Standard Error of a Gaussian is and is related to the square root of the mean variance.
So after we have computed this values, the confidence interval for a prediction is just
Where is the expected value for our prediction.
Information theoretic properties
Entropy of a Gaussian distribution
We compute here the Entropy of a Univariate Gaussian distribution . So we need to compute the following value:
With just the above proof one can prove that Gaussians are the distributions with maximum entropy for a given mean and variance. See Maximum Entropy Principle.
We can extend this to the multivariate case, observing the following:
Where in the last step we used this equality:
The Matrix Cookbook refers to this resource.
Mutual information of Gaussians
Suppose we have a Gaussian and TODO
We will see that this is equal to:
General KL divergence between Gaussians
The KL divergence between two Gaussians is given by:
This is a good resource for a proof.
Forward KL
One can prove that the forward KL divergence between two Gaussians defined as and is given by:
Let’s interpret this. The term works to pull the mean toward zero. The term introduces a penalty for high variance values, while the term imposes a cost for low values of . This forward KL is what is used for Autoencoders.
Reverse KL
Given the same assumptions we have that the KL of over is given by:
References
[1] Bishop “Pattern Recognition and Machine Learning” Springer 2006