Gaussian Mixture Models
This set takes inspiration from chapter 9.2 of (Bishop 2006). We assume that the reader already knows quite well what is a Gaussian mixture model and we will just restate the models here. We will discuss the problem of estimating the best possible parameters (so, this is a density estimation problem) when the data is generated by a mixture of Gaussians.
Remember that the standard multivariate Gaussian has this format:
$$ \mathcal{N}(x \mid \mu, \Sigma) = \frac{1}{\sqrt{ 2\pi }} \frac{1}{\lvert \Sigma \rvert^{1/2} } \exp \left( -\frac{1}{2} (x - \mu)^{T} \Sigma^{-1}(x - \mu) \right) $$Problem statement 🟩
Given a set of data points $x_{1}, \dots, x_{n}$ in $\mathbb{R}^{d}$ sampled by $k$ Gaussian each with responsibility $\pi_{k}$ the objective of this problem is to estimate the best $\pi_{k}$ for each Gaussian and the relative mean and covariance matrix. We will assume a latent model with a variable $z$ which represents which Gaussian has been chosen for a specific sample. We have this prior:
$$ p(z) = \prod_{i = 1}^{k} \pi_{i}^{z_{i}} $$Because we know that $z$ is a $k$ dimensional vector that has a single digit indicating which Gaussian was chosen.
The Maximum Likelihood problem 🟩–
The frequentist approach with Maximum likelihood is quite probable to give rise to particular edge-cases that make this method difficult to apply for this density estimation problem. Let’s remember that in the case of Gaussian mixture models, our loss function is the following:
$$ \min_{\pi, \mu, \Sigma} \log p(X \mid \pi, \mu, \Sigma) = \sum_{n = 1}^{N} \log \left\{ \sum_{i = 1}^{K} \pi_{i} \mathcal{N}(x_{n} \mid \mu_{k}, \Sigma_{k}) \right\} $$So our parameter space is the following
$$ \Theta = \left\{ \pi_{k}, \mu_{k}, \Sigma_{k} : k \leq K \right\} $$Let’s see now a case where this function is not well behaved. Let’s consider the covariance matrix to be $\sigma_{i}^{2}I$ and let’s say we have sampled a single point that is exactly $\mu_{i}$ then we have that the contribution of this particular Gaussian to our loss function is
$$ \mathcal{N}(x_{n} \mid x_{n}, \mu_{i}, \sigma_{i}) = \frac{1}{\sqrt{ 2\pi } \sigma_{_{i}}} $$If we have a single point, and $\sigma_{i} \to 0$ which is reasonable because we have a single point on the mean, then this value explodes and makes the whole log-likelihood to go to infinity. This is a case we don’t want to explore. There are some methods that try to solve this problem. But in this setting we don’t want to explore this, and focus on the expectation maximization algorithm.
Ideas for the solution 🟨
Let’s consider for instance the value $p(X, Z)$ this is a known value and it’s equal to
$$ \log p(X, Z) = \sum_{n = 1}^{N}\log \left\{ \pi_{z} \mathcal{N}(x_{n} \mid \mu_{z}, \Sigma_{z}) \right\} = \sum_{n = 1}^{N} (\log \pi_{z} + \log\mathcal{N}(x_{n} \mid \mu_{z}, \Sigma_{z})) $$That decomposition is quite nice. Having had this observation we can write our objective as
$$ \log p(X) = \mathbb{E}_{Z \sim q} \left[ \log \frac{P(X, Z)}{P(Z \mid X)} \right] $$Using product rule and using the expectation to get rid of the $Z$.
The interesting part comes when we multiply and divide by $q(Z)$ then we can decompose it further into two parts:
$$ \log p(X) = \mathbb{E}_{Z \sim q} \left[ \log \frac{P(X, Z)}{q(Z)} \right] + \mathbb{E}_{Z \sim q} \left[ \log \frac{q(Z)}{P(Z \mid X)} \right] = M(q, \theta) + E(q, \theta) $$We note that $E$ is a Kullback-Leibler divergence so it’s always positive, and we have $\log p(X) \geq M(q, \theta)$.
Another fundamental operation is that we can find the parameters of $q$ such that $E$ is null, because we know that if two distributions are the same then the divergence is null. We can compute this because we know the values of the posterior.
The expectation-maximization algorithm 🟩–
Dempster et al., 1977; McLachlan and Krishnan, 1997 are useful references for this method.
The algorithm in brief goes as follows:
- Set an initial value $\theta^{(0)}$
- for values $t=1,2,\dots$
- Set $q^{*}$ such that $E(q, \theta^{t - 1}) = 0$, which is just minimizing this value.
- Set $\theta$ to the max of $M(q^{*}, \theta)$. By adequately changing parameters for $p(X, Z)$ which is tractable.
From a more high level view:
- Compute the posterior $\gamma$
- Compute the best mean, variance and priors with the formula above and update them
- Repeat until convergence.
It is guaranteed that the likelihood is increasing, but we might be stuck on local maxima and similar things.
Convergence of EM
This is just a bounded optimization problem, after which you use the convergence theorem in Successioni which asserts that the limit for bounded monotone sequences always exists and is Unique.
We know that $\log p(X) = M(q, \theta) + E(q, \theta)$, after the $E$ step, the corresponding Kullback-Leibler divergence is 0, so we have $\log p(X) = M(q', \theta)$ where $q'$ is the updated variational estimator.
Then, if we set $\theta^{'} = \arg\max_{\theta} \log p(X) = \arg\max_{\theta} M(q', \theta)$, we have the following equations:
$$ \log p_{\theta'}(X) \geq M(q', \theta') \geq M(q', \theta) = \log p_{\theta}(X) $$Which is a increasing sequence. The upper bound is trivial, by axiomatic definition of $p$.
The importance of the class
If you assume to know the class for which the point $x$ is part of, then the problems becomes actually quite easy. This is the original problem that concerns k-means too! We don’t know a priori which class has been used to generate the point $x$, so taking the expected value accounting for each possibility makes this usually quite hard.
This part corresponds to the E-step of the algorithm. In the case of Gaussian Mixture Models, this just corresponds setting $q(z) \sim p(z \mid x)$ Which is just:
$$ q(z = i) = p(z \mid x) = \frac{p(x, z)}{p(x)} = \frac{\pi_{i}\mathcal{N}(x \mid \mu_{i}, \Sigma_{i})}{\sum_{j} \pi_{j}\mathcal{N}(x \mid \mu_{j}, \Sigma_{j})} $$The Loss Function
It is possible to define a loss function with respect to the parameters $\pi, \Sigma, \mu$ after the variational posterior has been fitted in the $E$ step.
We denote
$$ \gamma(z_{nk}) = p(z_{j} \mid x_{n}) = \frac{p (x_{n} \mid z_{j}) p(z_{j})}{\sum_{i} p(x_{n} \mid z_{i}) p(z_{i})} = \frac{\pi_{j} \mathcal{N}(x_{n} \mid \mu_{j}, \Sigma_{j})}{\sum_{i} \pi_{i} \mathcal{N}(x_{n} \mid \mu_{i}, \Sigma_{i})} $$Then we can write the loss function as
$$ \begin{align} \mathcal{L}(\pi, \Sigma, \mu) = \sum_{n = 1}^{N} \sum_{k = 1}^{K} \gamma(z_{nk}) (\log \lvert \Sigma_{k} \rvert + \frac{1}{2} (x_{n} - \mu_{k})^{T} \Sigma^{-1}_{k} (x_{n} - \mu_{k}) - \log \pi_{k}) \end{align} $$Then you can derive this loss to get the best mean, Sigma and $\pi$. This follows (Murphy 2012), but I am not sure I modified it correctly.
Deriving the expected mean 🟨++
Then we continue with the maximization step, which is finding the best variables under this new variational family.
First we want to do some multivariable analysis in order to derive some conditions of the minima, for this reason we take the derivative with respect to $\mu_{k}$ of the loss equation, and we derive that
$$ \begin{align} \frac{ \partial \log p(x) }{ \partial \mu_{k} }&= \frac{\pi_{k}\mathcal{N}(x_{n} \mid \mu_{k}, \Sigma_{k})}{\sum_{j} \pi_{j} \mathcal{N} (x_{n} \mid \mu_{j}, \Sigma_{j})} \frac{ \partial (x_{n} - \mu_{k})^{T} \Sigma^{-1}{k}(x{n} - \mu_{k}) }{ \partial \mu_{k} } \ &\implies- \sum_{n = 1}^{N} \frac{\pi_{k}\mathcal{N}(x_{n} \mid \mu_{k}, \Sigma_{k})}{\sum_{j} \pi_{j} \mathcal{N} (x_{n} \mid \mu_{j}, \Sigma_{j})} \Sigma^{-1}{k} (x{n} - \mu_{k}) = 0\ &\implies - \sum_{n = 1}^{N} \gamma(z_{nk})\Sigma^{-1}{k}(x{n} - \mu_{k}) = 0 \ &\implies \sum_{n = 1}^{N} \gamma(z_{nk})(x_{n} - \mu_{k}) = 0 \ &\implies \mu_{k} = \frac{1}{N_{k}} \sum_{n = 1}^{N} \gamma(z_{nk}) x_{n}
\end{align} $$
Where $N_{k} = \sum_{n = 1}^{N} \gamma(z_{nk})$
we can interpret $N_{k}$ to be the number of points generated by the Gaussian $k$, and the internal part is just the weighted average of the points generated by $k$! This gives an easy interpretation of the mean of the expectation part of this algorithm.
Deriving the expected deviation
This one is harder, and I still have not understood how exactly this matrix derivative is done, but the end results is very similar to the above, we have
$$ \Sigma_{k} = \frac{1}{N_{k}} \sum_{n = 1}^{N} \gamma(z_{nk})(x_{n} - \mu_{k})(x_{n} -\mu_{k})^{T} $$Selecting the number of Clusters
You can check this in chapter 25.2 There is a problem at the beginning, even before you can apply the EM algorithm to estimate the probability, we need to choose the hyperparameter $k$ for the number of the classes that we are assuming to exist. We need a way to find a solution to find $k$ that could be more principled than just searching over the possible $k$ in a bruteforce manner.
With the stick breaking idea, we assume to have and infinite number of clusters. Then we will have some realizations of the clusters. We have the result that with $N \to \infty$ we will have a realization of every cluster. Having a realization means we have one member of $x$ that is part of this class.
We discover how to select this with the following section, where we delve into Dirichlet processes.
K-Means
The problem 🟩
Let’s say we have a set of $d$ dimensional points $\mathcal{X} = \left\{ x_{1}, \dots, x_{n} \right\}$ We would like to learn a function
$$ c : \mathbb{R}^{d} \to \left\{ 1, \dots, k \right\} $$That assigns each point some unique label.
We consider the prototype which is a representative of one class. In classical k-means we would like to minimize the the squared distance (or other distance function) for each example of the row. We can write:
$$ R(c, \mathcal{Y}) = \sum_{i = 1}^{N} \lVert x_{i} - \mu_{c(x)} \rVert ^{2} $$The Algorithm 🟩
References
[1] Bishop “Pattern Recognition and Machine Learning” Springer 2006
[2] Murphy “Machine Learning: A Probabilistic Perspective” 2012