The beta distribution

The beta distribution is a powerful tool for modeling probabilities and proportions between 0 and 1. Here's a structured intuition to grasp its essence:

Core Concept

The beta distribution, defined on , is parameterized by two shape parameters: α (alpha) and β (beta). These parameters dictate the distribution’s shape, allowing it to flexibly represent beliefs about probabilities, rates, or proportions.

Key Intuitions

a. "Pseudo-Counts" Interpretation

  • α acts like "successes" and β like "failures" in a hypothetical experiment.
    • Example: If you use Beta(5, 3), it’s as if you’ve observed 5 successes and 3 failures before seeing actual data.
  • After observing x real successes and y real failures, the posterior becomes Beta(α+x, β+y). This makes beta the conjugate prior for the binomial distribution (bernoulli process).

b. Shape Flexibility

  • Uniform distribution: When α = β = 1, all values in [0, 1] are equally likely.
  • Bell-shaped: When α, β > 1, the distribution peaks at mode = (α-1)/(α+β-2).
    • Symmetric if α = β (e.g., Beta(5, 5) is centered at 0.5).
  • U-shaped: When α, β < 1, density spikes at 0 and 1 (useful for modeling polarization, meaning we believe the model to only produce values at 0 or 1, not in the middle.).
  • Skewed: If α > β, skewed toward 1; if β > α, skewed toward 0.

c. Moments

  • Mean: – your "expected" probability of success.
  • Variance: – decreases as α and β grow (more confidence).

One can also compute the mode and discover it is:

The mathematical model

The beta distribution is known as:

Where And

Visualization & Parameter Tuning

  • Mean and Variance Control:
    To design a beta distribution with mean μ and variance σ², solve:
    • α = μ(μ(1−μ)/σ² − 1)
    • β = (1−μ)(μ(1−μ)/σ² − 1)
      Example: For μ = 0.5, σ² = 0.01, use Beta(12, 12).

Dirichlet Distribution

The Dirichlet Distribution is a generalization of the Beta distribution.

A mathematical definition

The Dirichlet Distribution is a generalization of the Beta distribution. It is defined as:

Where is the normalization constant. And and . The function is defined as

And it is has the nice property of and , this is why we can see this distribution as a generalization of the factorial. The important thing to note about this distribution is that it is the conjugate prior of the multinomial distribution. See here. So, if we have a multinomial distribution with a Dirichlet prior, then the posterior is also a Dirichlet distribution. This allows us to sort of update our prior with the data we have, which is a nice property for bayesian inference. You can learn more about prior updates in Bayesian Linear Regression.

DP effectively defines a conjugate prior for arbitrary measurable spaces.

The following sections have content brought to you by (Murphy 2012).

Computing the posterior

Recall the multinomial distribution is

Then we see that the Dirichlet is the conjugate prior for this distribution, so the posterior is:

Which is the correct Posterior.

The Mode of the distribution

Using Lagrange Multipliers, we can find that the mode of the Dirichlet distribution is:

Where is the number of clusters, is the number of new samples after running the multinomial process. We notice that its form is quite nice when we set the uniform prior .

The Posterior predictive

We notice that this is a form of Bayesian Smoothing.

References

[1] Murphy “Machine Learning: A Probabilistic Perspective” 2012