Clustering

Huang, Xuanqiang Angelo

Home » Notes

Clustering

February 6, 2025 · Reading Time: 12 minutes · By Xuanqiang Angelo Huang

Table of Contents

Gaussian Mixture Models
The expectation-maximization algorithm
K-Means
- The problem
- The Algorithm

Gaussian Mixture Models

This set takes inspiration from chapter 9.2 of (Bishop 2006). We assume that the reader already knows quite well what is a Gaussian Mixture Model and we will just restate the models here. We will discuss the problem of estimating the best possible parameters (so, this is a density estimation problem) when the data is generated by a mixture of Gaussians.

Remember that the standard multivariate Gaussian has this format:

N (x ∣ μ, Σ) = \frac{1}{2 π} \frac{1}{∣ Σ ∣ ^{1/2}} exp (- \frac{1}{2} (x - μ)^{T} Σ^{- 1} (x - μ))

Problem statement

Given a set of data points $x_{1}, \dots, x_{n}$ in $R^{d}$ sampled by $k$ Gaussian each with responsibility $π_{k}$ the objective of this problem is to estimate the best $π_{k}$ for each Gaussian and the relative mean and covariance matrix. We will assume a latent model with a variable $z$ which represents which Gaussian has been chosen for a specific sample. We have this prior:

p (z) = i = 1 \prod k π_{i}^{z_{i}}

Because we know that $z$ is a $k$ dimensional vector that has a single digit indicating which Gaussian was chosen.

The Maximum Likelihood problem

The frequentist approach with Maximum likelihood is quite probable to give rise to particular edge-cases that make this method difficult to apply for this density estimation problem. Let's remember that in the case of Gaussian mixture models, our loss function is the following:

π, μ, Σ min lo g p (X ∣ π, μ, Σ) = n = 1 \sum N lo g {i = 1 \sum K π_{i} N (x_{n} ∣ μ_{k}, Σ_{k})}

So our parameter space is the following

Θ = {π_{k}, μ_{k}, Σ_{k} : k \leq K}

Let's see now a case where this function is not well behaved. Let's consider the covariance matrix to be $σ_{i}^{2} I$ and let's say we have sampled a single point that is exactly $μ_{i}$ then we have that the contribution of this particular Gaussian to our loss function is

N (x_{n} ∣ x_{n}, μ_{i}, σ_{i}) = \frac{1}{2 π σ _{_{i}}}

If we have a single point, and $σ_{i} \to 0$ which is reasonable because we have a single point on the mean, then this value explodes and makes the whole log-likelihood to go to infinity. This is a case we don't want to explore. There are some methods that try to solve this problem. But in this setting we don't want to explore this, and focus on the expectation maximization algorithm.

Ideas for the solution

Let's consider for instance the value $p (X, Z)$ this is a known value and it's equal to

lo g p (X, Z) = n = 1 \sum N lo g {π_{z} N (x_{n} ∣ μ_{z}, Σ_{z})} = n = 1 \sum N (lo g π_{z} + lo g N (x_{n} ∣ μ_{z}, Σ_{z}))

That decomposition is quite nice. Having had this observation we can write our objective as

lo g p (X) = E_{Z \sim q} [lo g \frac{P ( X , Z )}{P ( Z ∣ X )}]

Using product rule and using the expectation to get rid of the $Z$ .

The interesting part comes when we multiply and divide by $q (Z)$ then we can decompose it further into two parts:

lo g p (X) = E_{Z \sim q} [lo g \frac{P ( X , Z )}{q ( Z ∣ X )}] + E_{Z \sim q} [lo g \frac{q ( Z ∣ X )}{P ( Z ∣ X )}] = M (q, θ) + E (q, θ)

We note that $E$ is a Kullback-Leibler divergence so it's always positive, and we have $lo g p (X) \geq M (q, θ)$ . The $M$ part is also known as the ELBO (see Variational Inference).

Another fundamental operation is that we can find the parameters of $q$ such that $E$ is null, because we know that if two distributions are the same then the divergence is null. We can compute this because we know the values of the posterior.

The expectation-maximization algorithm

Dempster et al., 1977; McLachlan and Krishnan, 1997 are useful references for this method.

Structure of the Algorithm

The algorithm in brief goes as follows:

Set an initial value $θ^{(0)}$
for values $t = 1, 2, \dots$
1. Set $q^{*}$ such that $E (q, θ^{t - 1}) = 0$ , which is just minimizing this value.
2. Set $θ$ to the max of $M (q^{*}, θ)$ . By adequately changing parameters for $p (X, Z)$ which is tractable.

From a more high level view:

Compute the posterior $γ$
Compute the best mean, variance and priors with the formula above and update them
Repeat until convergence.

It is guaranteed that the likelihood is increasing, but we might be stuck on local maxima and similar things.

Convergence of EM

This is just a bounded optimization problem, after which you use the convergence theorem in Successioni which asserts that the limit for bounded monotone sequences always exists and is Unique.

We know that $lo g p (X) = M (q, θ) + E (q, θ)$ , after the $E$ step, the corresponding Kullback-Leibler divergence is 0, so we have $lo g p (X) = M (q^{'}, θ)$ where $q^{'}$ is the updated Variational estimator.

Then, if we set $θ^{^{'}} = ar g max_{θ} lo g p (X) = ar g max_{θ} M (q^{'}, θ)$ , we have the following equations:

lo g p_{θ^{'}} (X) \geq M (q^{'}, θ^{'}) \geq M (q^{'}, θ) = lo g p_{θ} (X)

Which is a increasing sequence. The upper bound is trivial, by axiomatic definition of $p$ .

The importance of the class

If you assume to know the class for which the point $x$ is part of, then the problems becomes actually quite easy. This is the original problem that concerns k-means too! We don't know a priori which class has been used to generate the point $x$ , so taking the expected value accounting for each possibility makes this usually quite hard.

This part corresponds to the E-step of the algorithm. In the case of Gaussian Mixture Models, this just corresponds setting $q (z) \sim p (z ∣ x)$ Which is just:

q (z = i) = p (z ∣ x_{n}) = \frac{p ( x _{n} , z )}{p ( x _{n} )} = \frac{π _{i} N ( x _{n} ∣ μ _{i} , Σ _{i} )}{\sum _{j} π _{j} N ( x _{n} ∣ μ _{j} , Σ _{j} )} = γ (z_{ni})

The Loss Function

It is possible to define a loss function with respect to the parameters $π, Σ, μ$ after the variational posterior has been fitted in the $E$ step.

We denote

γ (z_{nk}) = p (z_{j} ∣ x_{n}) = \frac{p ( x _{n} ∣ z _{j} ) p ( z _{j} )}{\sum _{i} p ( x _{n} ∣ z _{i} ) p ( z _{i} )} = \frac{π _{j} N ( x _{n} ∣ μ _{j} , Σ _{j} )}{\sum _{i} π _{i} N ( x _{n} ∣ μ _{i} , Σ _{i} )}

Then we can write the loss function as

L (π, Σ, μ) = n = 1 \sum N k = 1 \sum K γ (z_{nk}) (lo g ∣ Σ_{k} ∣ + \frac{1}{2} (x_{n} - μ_{k})^{T} Σ_{k}^{- 1} (x_{n} - μ_{k}) - lo g π_{k})

Then you can derive this loss to get the best mean, Sigma and $π$ . This follows (Murphy 2012), but I am not sure I modified it correctly.

Deriving the expected mean

Then we continue with the maximization step, which is finding the best variables under this new variational family.

First we want to do some multivariable analysis in order to derive some conditions of the minima, for this reason we take the derivative with respect to $μ_{k}$ of the loss equation, and we derive that

\frac{\partial lo g p ( x )}{\partial μ _{k}} = n = 1 \sum N γ (z_{nk}) \frac{\partial lo g p ( x _{n} , z _{k} )}{\partial μ _{k}} = n = 1 \sum N \frac{π _{k} N ( x _{n} ∣ μ _{k} , Σ _{k} )}{\sum _{j} π _{j} N ( x _{n} ∣ μ _{j} , Σ _{j} )} \cdot (- \frac{1}{2}) \frac{\partial ( x _{n} - μ _{k} ) ^{T} Σ _{k}^{- 1} ( x _{n} - μ _{k} )}{\partial μ _{k}} ⟹ - n = 1 \sum N \frac{π _{k} N ( x _{n} ∣ μ _{k} , Σ _{k} )}{\sum _{j} π _{j} N ( x _{n} ∣ μ _{j} , Σ _{j} )} Σ_{k}^{- 1} (x_{n} - μ_{k}) = 0 ⟹ - n = 1 \sum N γ (z_{nk}) Σ_{k}^{- 1} (x_{n} - μ_{k}) = 0 ⟹ n = 1 \sum N γ (z_{nk}) (x_{n} - μ_{k}) = 0 ⟹ μ_{k} = \frac{1}{N _{k}} n = 1 \sum N γ (z_{nk}) x_{n}

Where $N_{k} = \sum_{n = 1}^{N} γ (z_{nk})$

we can interpret $N_{k}$ to be the number of points generated by the Gaussian $k$ , and the internal part is just the weighted average of the points generated by $k$ ! This gives an easy interpretation of the mean of the expectation part of this algorithm.

Deriving the expected deviation

This one is harder, and I still have not understood how exactly this matrix derivative is done, but the end results is very similar to the above, we have

Σ_{k} = \frac{1}{N _{k}} n = 1 \sum N γ (z_{nk}) (x_{n} - μ_{k}) (x_{n} - μ_{k})^{T}

Selecting the number of Clusters

You can check this in chapter 25.2 There is a problem at the beginning, even before you can apply the EM algorithm to estimate the probability, we need to choose the hyperparameter $k$ for the number of the classes that we are assuming to exist. We need a way to find a solution to find $k$ that could be more principled than just searching over the possible $k$ in a bruteforce manner.

With the stick breaking idea, we assume to have and infinite number of clusters. Then we will have some realizations of the clusters. We have the result that with $N \to \infty$ we will have a realization of every cluster. Having a realization means we have one member of $x$ that is part of this class.

We discover how to select this with the following section, where we delve into Dirichlet processes.

K-Means

The problem

Let's say we have a set of $d$ dimensional points $X = {x_{1}, \dots, x_{n}}$ We would like to learn a function

c : R^{d} \to {1, \dots, k}

That assigns each point some unique label.

We consider the prototype which is a representative of one class. In classical k-means we would like to minimize the the squared distance (or other distance function) for each example of the row. We can write:

R (c, Y) = i = 1 \sum N ∥ x_{i} - μ_{c (x)} ∥^{2}

The Algorithm

Non-parametric Modeling-20241205152331188

References

[1] Bishop “Pattern Recognition and Machine Learning” Springer 2006

[2] Murphy “Machine Learning: A Probabilistic Perspective” 2012

Gaussian Mixture Models#

Problem statement#

The Maximum Likelihood problem#

Ideas for the solution#

The expectation-maximization algorithm#

Structure of the Algorithm#

Convergence of EM#

The importance of the class#

The Loss Function#

Deriving the expected mean#

Deriving the expected deviation#

Selecting the number of Clusters#

K-Means#

The problem#

The Algorithm#

References#