Dirichlet Processes

The DP (Dirichlet Processes) is part of family of models called non-parametric models. Non parametric models concern learning models with potentially infinite number of parameters. One of the classical application is unsupervised techniques like clustering. Intuitively, clustering concerns in finding compact subsets of data, i.e. finding groups of points in the space that are particularly close by some measure.

The Dirichlet Process

See Beta and Dirichlet Distributions for the definition and intuition of these two distributions. One quite important thing that Dirichlet allows to do is the ability of assigning an ever growing number of clusters to data. This models are thus quite flexible to change and growth.

Definition of the process

$$ G \sim DP(\alpha, H) $$$$ G(\Theta_{1}), G(\Theta_{2}), \dots \sim \text{Dir}(\alpha H(\Theta_{1}), \alpha H(\Theta_{2}), \dots) $$

So they are jointly distributed as a Dirichlet distribution. And $H$ is called the base measure and $\alpha$ is called the concentration parameter.

The base measure $H$ represents our prior belief about the distribution of the data. $\alpha$ is called concentration parameter: $\alpha$ is small, the resulting distributions tend to concentrate on fewer clusters, effectively encouraging sparsity. Conversely, a larger $\alpha$ allows for a richer, more diverse set of clusters.

Choosing number of Clusters

Dirichlet processes are related to finding the best number of clusters (See Clustering). It is a non-parametric way (meaning it can grow with the data) to find that number.

Stick Breaking

We have dome something similar to $GP$ in Gaussian Processes: the difference is sequential sampling compared to sampling all of the points at once using the difficult multivariate distribution.

We would like to sample $(\rho_{1}, \dots, \rho_{k}) \sim \text{Dir}(\alpha_{1}, \dots, \alpha_{k})$. So now we just sample the first $\rho_{1}$ and then sample the second conditioned on the first observation (So we have a beta distribution now) and so on, until we finish the possible clusters.

Image from the slides. Dirichlet Processes-20250201114331078

Griffiths-Engen-McCloskey distribution

This distribution is just a generalization of the stick breaking process. We write $\rho \sim GEM(\alpha)$ which means breaking according the distribution $\text{Beta}(1, \alpha)$. if alpha is smaller then the stick is longer. So we can interpret the parameter $\alpha$ as some sort of variable of the remaining stick length.

Using the above image, we can define the stick breaking patterns using constant values $1, \alpha$, which correspond to GEM distribution, a dirichlet process with concentration measure $\alpha$ and base measure $H$.

De Finetti’s theorem

$$ p(X_{1}, \dots, X_{n}) = p(X_{\sigma(1)}, \dots, X_{\sigma(n)}) $$

with $\sigma$ a permutation of the set $\{1, \dots, n\}$.

$$ p(X_{1}, \dots, X_{n}) = \int \left( \prod_{i = 1}^{n} p(x_{i} \mid G) \right) \, dG $$

Which means we can decompose the infinite exchangeable sequence into independent conditional random variables.

In the context of CRP and Urns, this property allows us to count the probability of ending into a specific number of clusters, each with some number of elements.

Classical Combinatorial problems

The Chinese Restaurant Process

We have some customers (observations) $x^{(i)}$ assigned to some clusters $k$. Then we have the Matthew effect: more people in a table makes it more probable to seat there. Or some probability to create a new table $\propto \alpha$.

$$ p(\tau = k \mid \tau_{1}, \dots, \tau_{n}) = \begin{cases} \frac{\lVert \tau \rVert}{n + \alpha} & \text{ if } \tau \in \mathcal{P} \\ \frac{\alpha}{n + \alpha} & \text{otherwise} \end{cases} $$$$ P(\mathcal{P}) = \frac{\alpha^{\lvert P \rvert }}{\alpha^{(n)}} \prod_{\tau \in \mathcal{P}} (\lvert \tau \rvert - 1)! $$

We can do this because it is order and labeling-independent. With this model, we can show that the expected number of table is in the order of $\mathcal{O}(\alpha \log(N))$ In economy, this effect is called rich get richer because more popular cluster are able to attract more points into it.

Rate of increase of the number of tables

$$ \mathbb{E}[K_{n} - K_{n - 1}] = \mathbb{E}[\mathbb{1}_{n}] = \frac{\alpha}{n + \alpha} $$$$ \mathbb{E}[K_{n}] = O( \alpha \log(n)) $$

This is a simple consequence of the bound of the harmonic series. See Serie.

Pólya Urns

We have urns with colored balls that can be drawn at random. If a ball is drawn it is doubled and put both back into the Urn. We can prove that this is equivalent to the Chinese Restaurant Process, given a finite number of clusters (colors). The important part is that this is exchangeable.

Hoppe Urn

This is a variation of the Pólya’s urn. Here we have a special ball that when it’s picked it is put back into the urn but also a new color is put back! Then this model is exactly the same as the Chinese Restaurant Process. It should be a nice exercise to prove this.

We can interpret the random variable $G$ in De Finetti’s theorem as a single sample from the an underlying Dirichlet Process. After we have sampled this (which is a prior, it is like we are sampling hyperparameters), then we use CRP to assign the points to the clusters.

The DP Mixture Model

The model

We have the probability of the clusters

$\rho = (\rho_{1}, \dots) \sim GEM(\alpha)$ and alpha is chosen appropriately with some validation.
$\mu_{k} \sim \mathcal{N}(\mu_{0}, \sigma_{0}^{2})$ are the centers of the clusters.
We draw some assignments to the clusters with a categorical distribution based on $\rho$. We call these $z_{i}$.
We points $x_{i} \sim \mathcal{N}(\mu_{i}, \sigma^{2})$, based on the above cluster parameters.

The easy way to remember this model is with a graphical model: $\alpha \to \rho \to z, \mu \to H, (z, H) \to x$

Fitting a learner

EM (see Clustering) are difficult to use for non-parametric models. We will use Gibbs Sampling introduced in Monte Carlo Methods.

$$ p(z_{i} =k \mid z_{-i}, \boldsymbol{x}, \alpha, \boldsymbol{\mu}) \propto p(z_{i} \mid z_{-i}, \alpha) \cdot p(x_{i} \mid \boldsymbol{z}, \boldsymbol{x}_{-i}, \boldsymbol{\mu}) $$

And then we have the probabilities in the image.

Code Example just run it on a Notebook. I noticed that if we initialize the code with all classes assigned to a single cluster, the model seems to fail to learn that there are other classes, this is probably because the rich get richer effect is not enough to explore the space of the clusters. If we initialize the code with a random assignment on the maximum number of clusters, then it doesn’t seem to be able to eliminate clusters correctly. It is quite probable that we would need to tune for the $\alpha$ value, but it is probably quite time consuming to find one.

A More Detailed Walk-Through

Let’s outline the steps more algorithmically:

Initialization:
- Initialize the cluster assignments $z_i$ arbitrarily (for example, each data point in its own cluster or by some heuristic).
- For each unique cluster, initialize the parameter $\theta_k$ by drawing from $G_0$ or by using an initial estimate from the assigned data.
Gibbs Sampling Loop: For each iteration, and for each data point $y_i$: a. Remove $y_i$ from its current cluster:
Adjust the counts $n_{-i,k}$ for the cluster $k$ that $y_i$ belonged to. If this removal makes the cluster empty, remove the cluster altogether. b. Compute assignment probabilities:
- For each existing cluster $k$: $$ p_k \propto n_{-i,k} \, f(y_i \mid \theta_k). $$
- For a new cluster: $$ p_{\text{new}} \propto \alpha \, m(y_i), \quad \text{where } m(y_i)=\int f(y_i \mid \theta) \, dG_0(\theta) $$ is the marginal likelihood for $y_i$ under the base measure. c. Sample a new assignment for $y_i$ from the discrete distribution defined by $\{p_k, p_{\text{new}}\}$. d. If a new cluster is chosen:
- Draw a new parameter $\theta_{\text{new}}$ from the posterior given $y_i$ (or directly from $G_0$ if you wish to assign it prior to any data update). e. Update Cluster Parameters:
  For each cluster $k$, sample $$ \theta_k \sim p(\theta_k \mid \{y_i : z_i = k\}). $$
Convergence and Inference: After a sufficient number of iterations, the samples of $\{z_i\}$ and $\{\theta_k\}$ represent draws from the posterior distribution of the DP mixture model. You can then estimate cluster structures, predictive densities, or other quantities of interest.

Drawbacks of Dirichlet

We provide here a summary of the main drawbacks of the Dirichlet Process:

Category	Drawback
Growth Assumptions	Logarithmic growth may not fit real-world data
Cluster Size Bias	Overemphasis on large clusters; poor handling of outliers
Hyperparameter Sensitivity	Choice of $\alpha$ and $G_{0}$ (Base Measure) affects clustering
Inference Complexity	MCMC/Variational methods are computationally expensive
Exchangeability Assumption	Fails for temporally or spatially structured data
Expressiveness	Limited ability to handle complex or hierarchical data relationships
Scalability	Standard DP methods struggle with large datasets
Task Restriction	Primarily a clustering tool, requiring extensions for other tasks

We observe that the main points are the exchangeability assumption, the growth assumptions (in real life some clusters may grow linearly or exponentially), and the scalability.

The Dirichlet Process#

Definition of the process#

Choosing number of Clusters#

Stick Breaking#

Griffiths-Engen-McCloskey distribution#

De Finetti’s theorem#

Classical Combinatorial problems#

The Chinese Restaurant Process#

Rate of increase of the number of tables#

Pólya Urns#

Hoppe Urn#

The DP Mixture Model#

The model#

Fitting a learner#

A More Detailed Walk-Through#

Drawbacks of Dirichlet#