The maximum entropy principle is one of the most important guiding motives in artificial artificial intelligence. Its roots emerge from a long tradition of probabilistic inference that goes back to Laplace and Occam’s Razor, i.e. the principle of parsimony.
Let’s start with a simple example taken from Andreas Kraus’s Lecture notes in the ETH course of Probabilistic Artificial Intelligence:
Consider a criminal trial with three suspects, A, B, and C. The collected evidence shows that suspect C can not have committed the crime, however it does not yield any information about sus- pects A and B. Clearly, any distribution respecting the data must assign zero probability of having committed the crime to suspect C. However, any distribution interpolating between (1, 0, 0) and (0, 1, 0) respects the data. The principle of indifference suggests that the desired distribution is $(\frac{1}{2}, \frac{1}{2}, 0)$, and indeed, any alterna- tive distribution seems unreasonable.
The maximum entropy principle, elaborated by Jaynes (1968) asserts that if there are more distributions that could be used to fit a problem, use the one that assumes less information as possible, so that it maximizes the entropy.
$$ H[p] = \mathbb{E}_{x \sim p}[-\log p(x)] $$Gaussian as Maximum Entropy Distribution
One nice thing about Gaussians is that this distribution has the maximum entropy between all distributions with known mean and variance. So we can say that it’s the distribution that assumes less information: it’s the distribution that best applies Occam’s Razor.
Let’s say $f \sim \mathcal{N}(f; \mu, \Sigma)$ and assume we have another distribution $g$ with the same mean and variance. Then we can say that $f$ is the distribution that maximizes the entropy.
Let’s consider first the cross entropy between $f$ and $g$, we will prove that this is equal to the entropy of $f$.
$$ H[g\mid f] = \mathbb{E}_{x \sim g}[-\log f(x)] = \mathbb{E}_{x\sim f}[-\log f(x)] = H[f] $$To prove this you should see that in the computation of the entropy for the Gaussians we just need the first two moments (i.e. mean and variance) Then we see that
$$ KL(g \mid f) = H[g \mid f] - H[g] = H[f] - H[g] \implies H[f] \geq H[g] $$For every possible distribution with same mean and Variance.