The DP (Dirichlet Processes) is part of family of models called non-parametric models. Non parametric models concern learning models with potentially infinite number of parameters. One of the classical application is unsupervised techniques like clustering. Intuitively, clustering concerns in finding compact subsets of data, i.e. finding groups of points in the space that are particularly close by some measure.
The Dirichlet Process
See Beta and Dirichlet Distributions for the definition and intuition of these two distributions. One quite important thing that Dirichlet allows to do is the ability of assigning an ever growing number of clusters to data. This models are thus quite flexible to change and growth.
Definition of the process
$$ G \sim DP(\alpha, H) $$$$ G(\Theta_{1}), G(\Theta_{2}), \dots \sim \text{Dir}(\alpha H(\Theta_{1}), \alpha H(\Theta_{2}), \dots) $$So they are jointly distributed as a Dirichlet distribution. And $H$ is called the base measure and $\alpha$ is called the concentration parameter.
The base measure $H$ represents our prior belief about the distribution of the data. $\alpha$ is called concentration parameter: $\alpha$ is small, the resulting distributions tend to concentrate on fewer clusters, effectively encouraging sparsity. Conversely, a larger $\alpha$ allows for a richer, more diverse set of clusters.
Choosing number of Clusters
Dirichlet processes are related to finding the best number of clusters (See Clustering). It is a non-parametric way (meaning it can grow with the data) to find that number.
Stick Breaking 🟨++
We have dome something similar to $GP$ in Gaussian Processes: the difference is sequential sampling compared to sampling all of the points at once using the difficult multivariate distribution.
We would like to sample $(\rho_{1}, \dots, \rho_{k}) \sim \text{Dir}(\alpha_{1}, \dots, \alpha_{k})$. So now we just sample the first $\rho_{1}$ and then sample the second conditioned on the first observation (So we have a beta distribution now) and so on, until we finish the possible clusters.
Image from the slides.
Griffiths-Engen-McCloskey distribution 🟨++
This distribution is just a generalization of the stick breaking process. We write $\rho \sim GEM(\alpha)$ which means breaking according the distribution $\text{Beta}(1, \alpha)$. if alpha is smaller then the stick is longer. So we can interpret the parameter $\alpha$ as some sort of variable of the remaining stick length.
Using the above image, we can define the stick breaking patterns using constant values $1, \alpha$, which correspond to GEM distribution, a dirichlet process with concentration measure $\alpha$ and base measure $H$.
De Finetti’s theorem 🟨–
$$ p(X_{1}, \dots, X_{n}) = p(X_{\sigma(1)}, \dots, X_{\sigma(n)}) $$with $\sigma$ a permutation of the set $\{1, \dots, n\}$.
$$ p(X_{1}, \dots, X_{n}) = \int \left( \prod_{i = 1}^{n} p(x_{i} \mid G) \right) \, dG $$Which means we can decompose the infinite exchangeable sequence into independent conditional random variables.
In the context of CRP and Urns, this property allows us to count the probability of ending into a specific number of clusters, each with some number of elements.
Classical Combinatorial problems
The Chinese Restaurant Process 🟨++
We have some customers (observations) $x^{(i)}$ assigned to some clusters $k$. Then we have the Matthew effect: more people in a table makes it more probable to seat there. Or some probability to create a new table $\propto \alpha$.
$$ p(\tau = k \mid \tau_{1}, \dots, \tau_{n}) = \begin{cases} \frac{\lVert \tau \rVert}{n + \alpha} & \text{ if } \tau \in \mathcal{P} \\ \frac{\alpha}{n + \alpha} & \text{otherwise} \end{cases} $$$$ P(\mathcal{P}) = \frac{\alpha^{\lvert P \rvert }}{\alpha^{(n)}} \prod_{\tau \in \mathcal{P}} (\lvert \tau \rvert - 1)! $$We can do this because it is order and labeling-independent. With this model, we can show that the expected number of table is in the order of $\mathcal{O}(\alpha \log(N))$ In economy, this effect is called rich get richer because more popular cluster are able to attract more points into it.
Rate of increase of the number of tables 🟨
$$ \mathbb{E}[K_{n} - K_{n - 1}] = \mathbb{E}[\mathbb{1}_{n}] = \frac{\alpha}{n + \alpha} $$$$ \mathbb{E}[K_{n}] = O( \alpha \log(n)) $$This is a simple consequence of the bound of the harmonic series. See Serie.
Pólya Urns 🟩
We have urns with colored balls that can be drawn at random. If a ball is drawn it is doubled and put both back into the Urn. We can prove that this is equivalent to the Chinese Restaurant Process, given a finite number of clusters (colors). The important part is that this is exchangeable.
Hoppe Urn 🟩
This is a variation of the Pólya’s urn. Here we have a special ball that when it’s picked it is put back into the urn but also a new color is put back! Then this model is exactly the same as the Chinese Restaurant Process. It should be a nice exercise to prove this.
We can interpret the random variable $G$ in De Finetti’s theorem as a single sample from the an underlying Dirichlet Process. After we have sampled this (which is a prior, it is like we are sampling hyperparameters), then we use CRP to assign the points to the clusters.
The DP Mixture Model
The model 🟨
We have the probability of the clusters
- $\rho = (\rho_{1}, \dots) \sim GEM(\alpha)$ and alpha is chosen appropriately with some validation.
- $\mu_{k} \sim \mathcal{N}(\mu_{0}, \sigma_{0}^{2})$ are the centers of the clusters.
- We draw some assignments to the clusters with a categorical distribution based on $\rho$. We call these $z_{i}$.
- We points $x_{i} \sim \mathcal{N}(\mu_{i}, \sigma^{2})$, based on the above cluster parameters.
The easy way to remember this model is with a graphical model: $\alpha \to \rho \to z, \mu \to H, (z, H) \to x$
Fitting a learner 🟥
EM (see Clustering) are difficult to use for non-parametric models. We will use Gibbs Sampling introduced in Monte Carlo Methods.
$$ p(z_{i} =k \mid z_{-i}, \boldsymbol{x}, \alpha, \boldsymbol{\mu}) \propto p(z_{i} \mid z_{-i}, \alpha) \cdot p(x_{i} \mid \boldsymbol{z}, \boldsymbol{x}_{-i}, \boldsymbol{\mu}) $$And then we have the probabilities in the image.

Code Example just run it on a Notebook. I noticed that if we initialize the code with all classes assigned to a single cluster, the model seems to fail to learn that there are other classes, this is probably because the rich get richer effect is not enough to explore the space of the clusters. If we initialize the code with a random assignment on the maximum number of clusters, then it doesn’t seem to be able to eliminate clusters correctly. It is quite probable that we would need to tune for the $\alpha$ value, but it is probably quite time consuming to find one.
A More Detailed Walk-Through 🟨–
Let’s outline the steps more algorithmically:
-
Initialization:
- Initialize the cluster assignments $z_i$ arbitrarily (for example, each data point in its own cluster or by some heuristic).
- For each unique cluster, initialize the parameter $\theta_k$ by drawing from $G_0$ or by using an initial estimate from the assigned data.
-
Gibbs Sampling Loop: For each iteration, and for each data point $y_i$: a. Remove $y_i$ from its current cluster:
Adjust the counts $n_{-i,k}$ for the cluster $k$ that $y_i$ belonged to. If this removal makes the cluster empty, remove the cluster altogether. b. Compute assignment probabilities:- For each existing cluster $k$: $$ p_k \propto n_{-i,k} \, f(y_i \mid \theta_k). $$
- For a new cluster: $$ p_{\text{new}} \propto \alpha \, m(y_i), \quad \text{where } m(y_i)=\int f(y_i \mid \theta) \, dG_0(\theta) $$ is the marginal likelihood for $y_i$ under the base measure. c. Sample a new assignment for $y_i$ from the discrete distribution defined by $\{p_k, p_{\text{new}}\}$. d. If a new cluster is chosen:
- Draw a new parameter $\theta_{\text{new}}$ from the posterior given $y_i$ (or directly from $G_0$ if you wish to assign it prior to any data update).
e. Update Cluster Parameters:
For each cluster $k$, sample $$ \theta_k \sim p(\theta_k \mid \{y_i : z_i = k\}). $$
-
Convergence and Inference: After a sufficient number of iterations, the samples of $\{z_i\}$ and $\{\theta_k\}$ represent draws from the posterior distribution of the DP mixture model. You can then estimate cluster structures, predictive densities, or other quantities of interest.
Drawbacks of Dirichlet
We provide here a summary of the main drawbacks of the Dirichlet Process:
Category | Drawback |
---|---|
Growth Assumptions | Logarithmic growth may not fit real-world data |
Cluster Size Bias | Overemphasis on large clusters; poor handling of outliers |
Hyperparameter Sensitivity | Choice of $\alpha$ and $G_{0}$ (Base Measure)​ affects clustering |
Inference Complexity | MCMC/Variational methods are computationally expensive |
Exchangeability Assumption | Fails for temporally or spatially structured data |
Expressiveness | Limited ability to handle complex or hierarchical data relationships |
Scalability | Standard DP methods struggle with large datasets |
Task Restriction | Primarily a clustering tool, requiring extensions for other tasks |
We observe that the main points are the exchangeability assumption, the growth assumptions (in real life some clusters may grow linearly or exponentially), and the scalability.