Gaussian Mixture Models

This set takes inspiration from chapter 9.2 of (Bishop 2006). We assume that the reader already knows quite well what is a Gaussian Mixture Model and we will just restate the models here. We will discuss the problem of estimating the best possible parameters (so, this is a density estimation problem) when the data is generated by a mixture of Gaussians.

Remember that the standard multivariate Gaussian has this format:

Problem statement

Given a set of data points in sampled by Gaussian each with responsibility the objective of this problem is to estimate the best for each Gaussian and the relative mean and covariance matrix. We will assume a latent model with a variable which represents which Gaussian has been chosen for a specific sample. We have this prior:

Because we know that is a dimensional vector that has a single digit indicating which Gaussian was chosen.

The Maximum Likelihood problem

The frequentist approach with Maximum likelihood is quite probable to give rise to particular edge-cases that make this method difficult to apply for this density estimation problem. Let's remember that in the case of Gaussian mixture models, our loss function is the following:

So our parameter space is the following

Let's see now a case where this function is not well behaved. Let's consider the covariance matrix to be and let's say we have sampled a single point that is exactly then we have that the contribution of this particular Gaussian to our loss function is

If we have a single point, and which is reasonable because we have a single point on the mean, then this value explodes and makes the whole log-likelihood to go to infinity. This is a case we don't want to explore. There are some methods that try to solve this problem. But in this setting we don't want to explore this, and focus on the expectation maximization algorithm.

Ideas for the solution

Let's consider for instance the value this is a known value and it's equal to

That decomposition is quite nice. Having had this observation we can write our objective as

Using product rule and using the expectation to get rid of the .

The interesting part comes when we multiply and divide by then we can decompose it further into two parts:

We note that is a Kullback-Leibler divergence so it's always positive, and we have . The part is also known as the ELBO (see Variational Inference).

Another fundamental operation is that we can find the parameters of such that is null, because we know that if two distributions are the same then the divergence is null. We can compute this because we know the values of the posterior.

The expectation-maximization algorithm

Dempster et al., 1977; McLachlan and Krishnan, 1997 are useful references for this method.

Structure of the Algorithm

The algorithm in brief goes as follows:

  1. Set an initial value
  2. for values
    1. Set such that , which is just minimizing this value.
    2. Set to the max of . By adequately changing parameters for which is tractable.

From a more high level view:

  1. Compute the posterior
  2. Compute the best mean, variance and priors with the formula above and update them
  3. Repeat until convergence.

It is guaranteed that the likelihood is increasing, but we might be stuck on local maxima and similar things.

Convergence of EM

This is just a bounded optimization problem, after which you use the convergence theorem in Successioni which asserts that the limit for bounded monotone sequences always exists and is Unique.

We know that , after the step, the corresponding Kullback-Leibler divergence is 0, so we have where is the updated Variational estimator.

Then, if we set , we have the following equations:

Which is a increasing sequence. The upper bound is trivial, by axiomatic definition of .

The importance of the class

If you assume to know the class for which the point is part of, then the problems becomes actually quite easy. This is the original problem that concerns k-means too! We don't know a priori which class has been used to generate the point , so taking the expected value accounting for each possibility makes this usually quite hard.

This part corresponds to the E-step of the algorithm. In the case of Gaussian Mixture Models, this just corresponds setting Which is just:

The Loss Function

It is possible to define a loss function with respect to the parameters after the variational posterior has been fitted in the step.

We denote

Then we can write the loss function as

Then you can derive this loss to get the best mean, Sigma and . This follows (Murphy 2012), but I am not sure I modified it correctly.

Deriving the expected mean

Then we continue with the maximization step, which is finding the best variables under this new variational family.

First we want to do some multivariable analysis in order to derive some conditions of the minima, for this reason we take the derivative with respect to of the loss equation, and we derive that

Where

we can interpret to be the number of points generated by the Gaussian , and the internal part is just the weighted average of the points generated by ! This gives an easy interpretation of the mean of the expectation part of this algorithm.

Deriving the expected deviation

This one is harder, and I still have not understood how exactly this matrix derivative is done, but the end results is very similar to the above, we have

Selecting the number of Clusters

You can check this in chapter 25.2 There is a problem at the beginning, even before you can apply the EM algorithm to estimate the probability, we need to choose the hyperparameter for the number of the classes that we are assuming to exist. We need a way to find a solution to find that could be more principled than just searching over the possible in a bruteforce manner.

With the stick breaking idea, we assume to have and infinite number of clusters. Then we will have some realizations of the clusters. We have the result that with we will have a realization of every cluster. Having a realization means we have one member of that is part of this class.

We discover how to select this with the following section, where we delve into Dirichlet processes.

K-Means

The problem

Let's say we have a set of dimensional points We would like to learn a function

That assigns each point some unique label.

We consider the prototype which is a representative of one class. In classical k-means we would like to minimize the the squared distance (or other distance function) for each example of the row. We can write:

The Algorithm

Non-parametric Modeling-20241205152331188

References

[1] Bishop “Pattern Recognition and Machine Learning” Springer 2006

[2] Murphy “Machine Learning: A Probabilistic Perspective” 2012