This set takes inspiration from chapter 9.2 of (Bishop 2006). We assume that the reader already knows quite well what is a Gaussian mixture model and we will just restate the models here. We will discuss the problem of estimating the best possible parameters (so, this is a density estimation problem) when the data is generated by a mixture of Gaussians.
Remember that the standard multivariate Gaussian has this format:
$$ \mathcal{N}(x \mid \mu, \Sigma) = \frac{1}{\sqrt{ 2\pi }} \frac{1}{\lvert \Sigma \rvert^{1/2} } \exp \left( -\frac{1}{2} (x - \mu)^{T} \Sigma^{-1}(x - \mu) \right) $$Problem statement
Given a set of data points $x_{1}, \dots, x_{n}$ in $\mathbb{R}^{d}$ sampled by $k$ Gaussian each with responsibility $\pi_{k}$ the objective of this problem is to estimate the best $\pi_{k}$ for each Gaussian and the relative mean and covariance matrix. We will assume a latent model with a variable $z$ which represents which Gaussian has been chosen for a specific sample. We have this prior:
$$ p(z) = \prod_{i = 1}^{k} \pi_{i}^{z_{i}} $$Because we know that $z$ is a $k$ dimensional vector that has a single digit indicating which Gaussian was chosen.
The Maximum Likelihood problem
The frequentist approach with Maximum likelihood is quite probable to give rise to particular edge-cases that make this method difficult to apply for this density estimation problem. Let’s remember that in the case of Gaussian mixture models, our loss function is the following:
$$ \min_{\pi, \mu, \Sigma} \log p(X \mid \pi, \mu, \Sigma) = \sum_{n = 1}^{N} \log \left\{ \sum_{i = 1}^{K} \pi_{i} \mathcal{N}(x_{n} \mid \mu_{k}, \Sigma_{k}) \right\} $$So our parameter space is the following
$$ \Theta = \left\{ \pi_{k}, \mu_{k}, \Sigma_{k} : k \leq K \right\} $$Let’s see now a case where this function is not well behaved. Let’s consider the covariance matrix to be $\sigma_{i}^{2}I$ and let’s say we have sampled a single point that is exactly $\mu_{i}$ then we have that the contribution of this particular Gaussian to our loss function is
$$ \mathcal{N}(x_{n} \mid x_{n}, \mu_{i}, \sigma_{i}) = \frac{1}{\sqrt{ 2\pi } \sigma_{_{i}}} $$If we have a single point, and $\sigma_{i} \to 0$ which is reasonable because we have a single point on the mean, then this value explodes and makes the whole log-likelihood to go to infinity. This is a case we don’t want to explore. There are some methods that try to solve this problem. But in this setting we don’t want to explore this, and focus on the expectation maximization algorithm.
Ideas for the solution
Let’s consider for instance the value $p(X, Z)$ this is a known value and it’s equal to
$$ \log p(X, Z) = \sum_{n = 1}^{N}\log \left\{ \pi_{z} \mathcal{N}(x_{n} \mid \mu_{z}, \Sigma_{z}) \right\} = \sum_{n = 1}^{N} (\log \pi_{z} + \log\mathcal{N}(x_{n} \mid \mu_{z}, \Sigma_{z})) $$That decomposition is quite nice. Having had this observation we can write our objective as
$$ \log p(X) = \mathbb{E}_{Z \sim q} \left[ \log \frac{P(X, Z)}{P(Z \mid X)} \right] $$Using product rule and using the expectation to get rid of the $Z$.
The interesting part comes when we multiply and divide by $q(Z)$ then we can decompose it further into two parts:
$$ \log p(X) = \mathbb{E}_{Z \sim q} \left[ \log \frac{P(X, Z)}{q(Z)} \right] + \mathbb{E}_{Z \sim q} \left[ \log \frac{q(Z)}{P(Z \mid X)} \right] = M(q, \theta) + E(q, \theta) $$We note that $E$ is a Kullback-Leibler divergence so it’s always positive, and we have $\log p(X) \geq M(q, \theta)$.
Another fundamental operation is that we can find the parameters of $q$ such that $E$ is null, because we know that if two distributions are the same then the divergence is null. We can compute this because we know the values of the posterior.
The expectation-maximization algorithm
Dempster et al., 1977; McLachlan and Krishnan, 1997 are useful references for this method.
The algorithm in brief goes as follows:
- Set an initial value $\theta^{(0)}$
- for values $t=1,2,\dots$
- Set $q^{*}$ such that $E(q, \theta^{t - 1}) = 0$, which is just minimizing this value.
- Set $\theta$ to the max of $M(q^{*}, \theta)$. By adequately changing parameters for $p(X, Z)$ which is tractable.
Deriving the expected mean
First we want to do some multivariable analysis in order to derive some conditions of the minima, for this reason we take the derivative with respect to $\mu_{k}$ of the loss equation, and we derive that
$$ 0 = - \sum_{n = 1}^{N} \frac{\pi_{k}\mathcal{N}(x_{n} \mid \mu_{k}, \Sigma_{k})}{\sum_{j} \pi_{j} \mathcal{N} (x_{n} \mid \mu_{j}, \Sigma_{j})} \Sigma^{-1}_{k} (x_{n} - \mu_{k}) $$We note that the bad part is that with the sum of the Gaussians at the denominator, we note that is the same as
$$ p(z_{j} \mid x) = \frac{p (x \mid z_{j}) p(z_{j})}{\sum_{i} p(x \mid z_{i}) p(z_{i})} = \frac{\pi_{j} \mathcal{N}(x \mid \mu_{j}, \Sigma_{j})}{\sum_{i} \pi_{i} \mathcal{N}(x \mid \mu_{i}, \Sigma_{i})} $$Let’s call this variable $\gamma(z_{j})$ and maybe introduce a index for the data point, so if $x = x_{n}$ (meaning if our variable $x$ is the $n$th sample then we call it $\gamma(z_{nj})$. We can rewrite the above, after multiplying by $\Sigma_{k}$ which is a constant after the vector is multiplied by this scalar value to be:
$$ \sum_{n=1}^{N} \gamma(z_{nk})(x_{n} - \mu_{k}) = 0 \implies \mu_{k} = \frac{1}{N_{k}} \sum_{n = 1}^{N} \gamma(z_{nk}) x_{n} $$Where $N_{k} = \sum_{n = 1}^{N} \gamma(z_{nk})$
we can interprete $N_{k}$ to be the number of points generated by the Gaussian $k$, and the internal part is just the weighted average of the points generated by $k$! This gives an easy interpretation of the mean of the expectation part of this algorithm.
Deriving the expected deviation
This one is harder, and I still have not understood how exactly this matrix derivative is done, but the end results is very similar to the above, we have
$$ \Sigma_{k} = \frac{1}{N_{k}} \sum_{n = 1}^{N} \gamma(z_{nk})(x_{n} - \mu_{k})^{T}(x_{n} -\mu_{k}) $$Deriving the priors
By using Lagrange Multipliers we can find that $\pi_{k} = \frac{N_{k}}{N}$. Which is the last unknown parameter. This is not a closed form solution, but it’s why the iterative approach works for the expectation maximization algorithm. It’s easy to see that the two steps of the algorithms are the following:
- Compute the posterior $\gamma$
- Compute the best mean, variance and priors with the formula above and update them
- Repeat until convergence.
It is guaranteed that the likelihood is increasing, but we might be stuck on local maxima and similar things.
References
[1] Bishop “Pattern Recognition and Machine Learning” Springer 2006