The beta distribution

The beta distribution is a powerful tool for modeling probabilities and proportions between 0 and 1. Here’s a structured intuition to grasp its essence:

Core Concept

The beta distribution, defined on $[0, 1]$, is parameterized by two shape parameters: α (alpha) and β (beta). These parameters dictate the distribution’s shape, allowing it to flexibly represent beliefs about probabilities, rates, or proportions.

Key Intuitions

a. “Pseudo-Counts” Interpretation

  • α acts like “successes” and β like “failures” in a hypothetical experiment.
    • Example: If you use Beta(5, 3), it’s as if you’ve observed 5 successes and 3 failures before seeing actual data.
  • After observing x real successes and y real failures, the posterior becomes Beta(α+x, β+y). This makes beta the conjugate prior for the binomial distribution (bernoulli process).

b. Shape Flexibility

  • Uniform distribution: When α = β = 1, all values in [0, 1] are equally likely.
  • Bell-shaped: When α, β > 1, the distribution peaks at mode = (α-1)/(α+β-2).
    • Symmetric if α = β (e.g., Beta(5, 5) is centered at 0.5).
  • U-shaped: When α, β < 1, density spikes at 0 and 1 (useful for modeling polarization, meaning we believe the model to only produce values at 0 or 1, not in the middle.).
  • Skewed: If α > β, skewed toward 1; if β > α, skewed toward 0.

c. Moments

  • Mean: $α/(α+β)$ – your “expected” probability of success.
  • Variance: $αβ / [(α+β)²(α+β+1)]$ – decreases as α and β grow (more confidence).
$$ \text{Mode} = \frac{\alpha - 1}{\alpha + \beta - 2} $$

The mathematical model

$$ \text{Beta} (x \mid a, b) = \frac{1}{B(a, b)} \cdot x^{a -1 }(1 - x)^{b - 1} $$

Where $B(a, b) = \Gamma(a) \Gamma(b) / \Gamma( + b)$ And $\Gamma(t) = \int_{0}^{\infty}e^{-x}x^{t - 1} \, dx$

Visualization & Parameter Tuning

  • Mean and Variance Control:
    To design a beta distribution with mean μ and variance σ², solve:
    • α = μ(μ(1−μ)/σ² − 1)
    • β = (1−μ)(μ(1−μ)/σ² − 1)
      Example: For μ = 0.5, σ² = 0.01, use Beta(12, 12).

Dirichlet Distribution

The Dirichlet Distribution is a generalization of the Beta distribution.

A mathematical definition

$$ \text{Dir}(\rho_{1}, \dots \rho_{k};\alpha_{1}, \dots, \alpha_{k}) = \frac{1}{B(\alpha)} \prod_{i = 1}^{k} \rho_{i}^{\alpha_{i} - 1} $$$$ \Gamma(x) = \int_{0}^{\infty} t^{x - 1} e^{-t} dt $$

And it is has the nice property of $\Gamma(x + 1) = x \Gamma(x)$ and $\Gamma(1) = 1$, this is why we can see this distribution as a generalization of the factorial. The important thing to note about this distribution is that it is the conjugate prior of the multinomial distribution. See here. So, if we have a multinomial distribution with a Dirichlet prior, then the posterior is also a Dirichlet distribution. This allows us to sort of update our prior with the data we have, which is a nice property for bayesian inference. You can learn more about prior updates in Bayesian Linear Regression.

DP effectively defines a conjugate prior for arbitrary measurable spaces.

The following sections have content brought to you by (Murphy 2012).

Computing the posterior

Recall the multinomial distribution is $p(\mathcal{D}|\theta) = \prod_{k=1}^K \theta_k^{N_k}$

Then we see that the Dirichlet is the conjugate prior for this distribution, so the posterior is: $$ % Posterior proportionality \begin{align} p(\theta|\mathcal{D}) &\propto p(\mathcal{D}|\theta)p(\theta) \

% Posterior calculation &\propto \prod_{k=1}^K \theta_k^{N_k} \theta_k^{\alpha_k-1} = \prod_{k=1}^K \theta_k^{\alpha_k+N_k-1} \

&= \text{Dir}(\theta|\alpha_1 + N_1,\ldots,\alpha_K + N_K) \end{align} $$ Which is the correct Posterior.

The Mode of the distribution

Using Lagrange Multipliers, we can find that the mode of the Dirichlet distribution is:

$$ \hat{\theta}_{k} = \frac{N_{k}+\alpha_{k} - 1}{N+\sum_{i=1}^{K} \alpha_{i} - K} $$

Where $K$ is the number of clusters, $N$ is the number of new samples after running the multinomial process. We notice that its form is quite nice when we set the uniform prior $\forall k \in K, \alpha_{k} = 1$.

The Posterior predictive

$$ \begin{align} p(X = j|\mathcal{D}) &= \int p(X = j|\theta)p(\theta|\mathcal{D})d\theta \

&= \int p(X = j|\theta_j)\left[\int p(\theta_{-j},\theta_j|\mathcal{D})d\theta_{-j}\right] d\theta_j \

&= \int \theta_jp(\theta_j|\mathcal{D})d\theta_j = \mathbb{E}[\theta_j|\mathcal{D}] = \frac{\alpha_j + N_j}{\sum_k(\alpha_k + N_k)} = \frac{\alpha_j + N_j}{\sum_{k}\alpha_{k} + N}

\end{align} $$ We notice that this is a form of Bayesian Smoothing.

References

[1] Murphy “Machine Learning: A Probabilistic Perspective” 2012