The beta distribution
The beta distribution is a powerful tool for modeling probabilities and proportions between 0 and 1. Here’s a structured intuition to grasp its essence:
Core Concept
The beta distribution, defined on $[0, 1]$, is parameterized by two shape parameters: α (alpha) and β (beta). These parameters dictate the distribution’s shape, allowing it to flexibly represent beliefs about probabilities, rates, or proportions.
Key Intuitions
a. “Pseudo-Counts” Interpretation
- α acts like “successes” and β like “failures” in a hypothetical experiment.
- Example: If you use Beta(5, 3), it’s as if you’ve observed 5 successes and 3 failures before seeing actual data.
- After observing x real successes and y real failures, the posterior becomes Beta(α+x, β+y). This makes beta the conjugate prior for the binomial distribution (bernoulli process).
b. Shape Flexibility
- Uniform distribution: When α = β = 1, all values in [0, 1] are equally likely.
- Bell-shaped: When α, β > 1, the distribution peaks at mode = (α-1)/(α+β-2).
- Symmetric if α = β (e.g., Beta(5, 5) is centered at 0.5).
- U-shaped: When α, β < 1, density spikes at 0 and 1 (useful for modeling polarization, meaning we believe the model to only produce values at 0 or 1, not in the middle.).
- Skewed: If α > β, skewed toward 1; if β > α, skewed toward 0.
c. Moments
- Mean: $α/(α+β)$ – your “expected” probability of success.
- Variance: $αβ / [(α+β)²(α+β+1)]$ – decreases as α and β grow (more confidence).
The mathematical model
$$ \text{Beta} (x \mid a, b) = \frac{1}{B(a, b)} \cdot x^{a -1 }(1 - x)^{b - 1} $$Where $B(a, b) = \Gamma(a) \Gamma(b) / \Gamma( + b)$ And $\Gamma(t) = \int_{0}^{\infty}e^{-x}x^{t - 1} \, dx$
Visualization & Parameter Tuning
- Mean and Variance Control:
To design a beta distribution with mean μ and variance σ², solve:- α = μ(μ(1−μ)/σ² − 1)
- β = (1−μ)(μ(1−μ)/σ² − 1)
Example: For μ = 0.5, σ² = 0.01, use Beta(12, 12).
Dirichlet Distribution
The Dirichlet Distribution is a generalization of the Beta distribution.
A mathematical definition
$$ \text{Dir}(\rho_{1}, \dots \rho_{k};\alpha_{1}, \dots, \alpha_{k}) = \frac{1}{B(\alpha)} \prod_{i = 1}^{k} \rho_{i}^{\alpha_{i} - 1} $$$$ \Gamma(x) = \int_{0}^{\infty} t^{x - 1} e^{-t} dt $$And it is has the nice property of $\Gamma(x + 1) = x \Gamma(x)$ and $\Gamma(1) = 1$, this is why we can see this distribution as a generalization of the factorial. The important thing to note about this distribution is that it is the conjugate prior of the multinomial distribution. See here. So, if we have a multinomial distribution with a Dirichlet prior, then the posterior is also a Dirichlet distribution. This allows us to sort of update our prior with the data we have, which is a nice property for bayesian inference. You can learn more about prior updates in Bayesian Linear Regression.
DP effectively defines a conjugate prior for arbitrary measurable spaces.
The following sections have content brought to you by (Murphy 2012).
Computing the posterior
Recall the multinomial distribution is $p(\mathcal{D}|\theta) = \prod_{k=1}^K \theta_k^{N_k}$
Then we see that the Dirichlet is the conjugate prior for this distribution, so the posterior is: $$ % Posterior proportionality \begin{align} p(\theta|\mathcal{D}) &\propto p(\mathcal{D}|\theta)p(\theta) \
% Posterior calculation &\propto \prod_{k=1}^K \theta_k^{N_k} \theta_k^{\alpha_k-1} = \prod_{k=1}^K \theta_k^{\alpha_k+N_k-1} \
&= \text{Dir}(\theta|\alpha_1 + N_1,\ldots,\alpha_K + N_K) \end{align} $$ Which is the correct Posterior.
The Mode of the distribution
Using Lagrange Multipliers, we can find that the mode of the Dirichlet distribution is:
$$ \hat{\theta}_{k} = \frac{N_{k}+\alpha_{k} - 1}{N+\sum_{i=1}^{K} \alpha_{i} - K} $$Where $K$ is the number of clusters, $N$ is the number of new samples after running the multinomial process. We notice that its form is quite nice when we set the uniform prior $\forall k \in K, \alpha_{k} = 1$.
The Posterior predictive
$$ \begin{align} p(X = j|\mathcal{D}) &= \int p(X = j|\theta)p(\theta|\mathcal{D})d\theta \
&= \int p(X = j|\theta_j)\left[\int p(\theta_{-j},\theta_j|\mathcal{D})d\theta_{-j}\right] d\theta_j \
&= \int \theta_jp(\theta_j|\mathcal{D})d\theta_j = \mathbb{E}[\theta_j|\mathcal{D}] = \frac{\alpha_j + N_j}{\sum_k(\alpha_k + N_k)} = \frac{\alpha_j + N_j}{\sum_{k}\alpha_{k} + N}
\end{align} $$ We notice that this is a form of Bayesian Smoothing.
References
[1] Murphy “Machine Learning: A Probabilistic Perspective” 2012