Gaussian Mixture Models

This set takes inspiration from chapter 9.2 of (Bishop 2006). We assume that the reader already knows quite well what is a Gaussian Mixture Model and we will just restate the models here. We will discuss the problem of estimating the best possible parameters (so, this is a density estimation problem) when the data is generated by a mixture of Gaussians.

$$ \mathcal{N}(x \mid \mu, \Sigma) = \frac{1}{\sqrt{ 2\pi }} \frac{1}{\lvert \Sigma \rvert^{1/2} } \exp \left( -\frac{1}{2} (x - \mu)^{T} \Sigma^{-1}(x - \mu) \right) $$

Problem statement

$$ p(z) = \prod_{i = 1}^{k} \pi_{i}^{z_{i}} $$

Because we know that $z$ is a $k$ dimensional vector that has a single digit indicating which Gaussian was chosen.

The Maximum Likelihood problem

$$ \min_{\pi, \mu, \Sigma} \log p(X \mid \pi, \mu, \Sigma) = \sum_{n = 1}^{N} \log \left\{ \sum_{i = 1}^{K} \pi_{i} \mathcal{N}(x_{n} \mid \mu_{k}, \Sigma_{k}) \right\} $$$$ \Theta = \left\{ \pi_{k}, \mu_{k}, \Sigma_{k} : k \leq K \right\} $$$$ \mathcal{N}(x_{n} \mid x_{n}, \mu_{i}, \sigma_{i}) = \frac{1}{\sqrt{ 2\pi } \sigma_{_{i}}} $$

If we have a single point, and $\sigma_{i} \to 0$ which is reasonable because we have a single point on the mean, then this value explodes and makes the whole log-likelihood to go to infinity. This is a case we don’t want to explore. There are some methods that try to solve this problem. But in this setting we don’t want to explore this, and focus on the expectation maximization algorithm.

Ideas for the solution

$$ \log p(X, Z) = \sum_{n = 1}^{N}\log \left\{ \pi_{z} \mathcal{N}(x_{n} \mid \mu_{z}, \Sigma_{z}) \right\} = \sum_{n = 1}^{N} (\log \pi_{z} + \log\mathcal{N}(x_{n} \mid \mu_{z}, \Sigma_{z})) $$$$ \log p(X) = \mathbb{E}_{Z \sim q} \left[ \log \frac{P(X, Z)}{P(Z \mid X)} \right] $$

Using product rule and using the expectation to get rid of the $Z$.

The interesting part comes when we multiply and divide by $q(Z)$ then we can decompose it further into two parts:

$$ \log p(X) = \mathbb{E}_{Z \sim q} \left[ \log \frac{P(X, Z)}{q(Z\mid X)} \right] + \mathbb{E}_{Z \sim q} \left[ \log \frac{q(Z\mid X)}{P(Z \mid X)} \right] = M(q, \theta) + E(q, \theta) $$

We note that $E$ is a Kullback-Leibler divergence so it’s always positive, and we have $\log p(X) \geq M(q, \theta)$. The $M$ part is also known as the ELBO (see Variational Inference).

Another fundamental operation is that we can find the parameters of $q$ such that $E$ is null, because we know that if two distributions are the same then the divergence is null. We can compute this because we know the values of the posterior.

The expectation-maximization algorithm

Dempster et al., 1977; McLachlan and Krishnan, 1997 are useful references for this method.

Structure of the Algorithm

The algorithm in brief goes as follows:

Set an initial value $\theta^{(0)}$
for values $t=1,2,\dots$
1. Set $q^{*}$ such that $E(q, \theta^{t - 1}) = 0$, which is just minimizing this value.
2. Set $\theta$ to the max of $M(q^{*}, \theta)$. By adequately changing parameters for $p(X, Z)$ which is tractable.

From a more high level view:

Compute the posterior $\gamma$
Compute the best mean, variance and priors with the formula above and update them
Repeat until convergence.

It is guaranteed that the likelihood is increasing, but we might be stuck on local maxima and similar things.

Convergence of EM

This is just a bounded optimization problem, after which you use the convergence theorem in Successioni which asserts that the limit for bounded monotone sequences always exists and is Unique.

We know that $\log p(X) = M(q, \theta) + E(q, \theta)$, after the $E$ step, the corresponding Kullback-Leibler divergence is 0, so we have $\log p(X) = M(q', \theta)$ where $q'$ is the updated Variational estimator.

$$ \log p_{\theta'}(X) \geq M(q', \theta') \geq M(q', \theta) = \log p_{\theta}(X) $$

Which is a increasing sequence. The upper bound is trivial, by axiomatic definition of $p$.

The importance of the class

If you assume to know the class for which the point $x$ is part of, then the problems becomes actually quite easy. This is the original problem that concerns k-means too! We don’t know a priori which class has been used to generate the point $x$, so taking the expected value accounting for each possibility makes this usually quite hard.

$$ q(z = i) = p(z \mid x_{n}) = \frac{p(x_{n}, z)}{p(x_{n})} = \frac{\pi_{i}\mathcal{N}(x_{n} \mid \mu_{i}, \Sigma_{i})}{\sum_{j} \pi_{j}\mathcal{N}(x_{n} \mid \mu_{j}, \Sigma_{j})} = \gamma(z_{ni}) $$

The Loss Function

It is possible to define a loss function with respect to the parameters $\pi, \Sigma, \mu$ after the variational posterior has been fitted in the $E$ step.

$$ \gamma(z_{nk}) = p(z_{j} \mid x_{n}) = \frac{p (x_{n} \mid z_{j}) p(z_{j})}{\sum_{i} p(x_{n} \mid z_{i}) p(z_{i})} = \frac{\pi_{j} \mathcal{N}(x_{n} \mid \mu_{j}, \Sigma_{j})}{\sum_{i} \pi_{i} \mathcal{N}(x_{n} \mid \mu_{i}, \Sigma_{i})} $$$$ \begin{align} \mathcal{L}(\pi, \Sigma, \mu) = \sum_{n = 1}^{N} \sum_{k = 1}^{K} \gamma(z_{nk}) (\log \lvert \Sigma_{k} \rvert + \frac{1}{2} (x_{n} - \mu_{k})^{T} \Sigma^{-1}_{k} (x_{n} - \mu_{k}) - \log \pi_{k}) \end{align} $$

Then you can derive this loss to get the best mean, Sigma and $\pi$. This follows (Murphy 2012), but I am not sure I modified it correctly.

Deriving the expected mean

Then we continue with the maximization step, which is finding the best variables under this new variational family.

First we want to do some multivariable analysis in order to derive some conditions of the minima, for this reason we take the derivative with respect to $\mu_{k}$ of the loss equation, and we derive that

$$ \begin{align} \frac{ \partial \log p(x) }{ \partial \mu_{k} }&= \sum_{n = 1}^{N} \gamma(z_{nk}) \frac{ \partial \log p(x_{n }, z_{k}) }{ \partial \mu_{k} } \\ &=\sum_{n = 1}^{N}\frac{\pi_{k}\mathcal{N}(x_{n} \mid \mu_{k}, \Sigma_{k})}{\sum_{j} \pi_{j} \mathcal{N} (x_{n} \mid \mu_{j}, \Sigma_{j})} \cdot \left( -\frac{1}{2} \right) \frac{ \partial (x_{n} - \mu_{k})^{T} \Sigma^{-1}_{k}(x_{n} - \mu_{k}) }{ \partial \mu_{k} }\\ &\implies- \sum_{n = 1}^{N} \frac{\pi_{k}\mathcal{N}(x_{n} \mid \mu_{k}, \Sigma_{k})}{\sum_{j} \pi_{j} \mathcal{N} (x_{n} \mid \mu_{j}, \Sigma_{j})} \Sigma^{-1}_{k} (x_{n} - \mu_{k}) = 0\\ &\implies - \sum_{n = 1}^{N} \gamma(z_{nk})\Sigma^{-1}_{k}(x_{n} - \mu_{k}) = 0 \\ &\implies \sum_{n = 1}^{N} \gamma(z_{nk})(x_{n} - \mu_{k}) = 0 \\ &\implies \mu_{k} = \frac{1}{N_{k}} \sum_{n = 1}^{N} \gamma(z_{nk}) x_{n} \end{align} $$

Where $N_{k} = \sum_{n = 1}^{N} \gamma(z_{nk})$

we can interpret $N_{k}$ to be the number of points generated by the Gaussian $k$, and the internal part is just the weighted average of the points generated by $k$! This gives an easy interpretation of the mean of the expectation part of this algorithm.

Deriving the expected deviation

$$ \Sigma_{k} = \frac{1}{N_{k}} \sum_{n = 1}^{N} \gamma(z_{nk})(x_{n} - \mu_{k})(x_{n} -\mu_{k})^{T} $$

Selecting the number of Clusters

You can check this in chapter 25.2 There is a problem at the beginning, even before you can apply the EM algorithm to estimate the probability, we need to choose the hyperparameter $k$ for the number of the classes that we are assuming to exist. We need a way to find a solution to find $k$ that could be more principled than just searching over the possible $k$ in a bruteforce manner.

With the stick breaking idea, we assume to have and infinite number of clusters. Then we will have some realizations of the clusters. We have the result that with $N \to \infty$ we will have a realization of every cluster. Having a realization means we have one member of $x$ that is part of this class.

We discover how to select this with the following section, where we delve into Dirichlet processes.

K-Means

The problem

$$ c : \mathbb{R}^{d} \to \left\{ 1, \dots, k \right\} $$

That assigns each point some unique label.

$$ R(c, \mathcal{Y}) = \sum_{i = 1}^{N} \lVert x_{i} - \mu_{c(x)} \rVert ^{2} $$

The Algorithm

Non-parametric Modeling-20241205152331188

References

[1] Bishop “Pattern Recognition and Machine Learning” Springer 2006

[2] Murphy “Machine Learning: A Probabilistic Perspective” 2012

Gaussian Mixture Models#

Problem statement#

The Maximum Likelihood problem#

Ideas for the solution#

The expectation-maximization algorithm#

Structure of the Algorithm#

Convergence of EM#

The importance of the class#

The Loss Function#

Deriving the expected mean#

Deriving the expected deviation#

Selecting the number of Clusters#

K-Means#

The problem#

The Algorithm#

References#