Bayesian Information Criterion

Bayesian Information Criterion (BIC)

The Bayesian Information Criterion (BIC) is a model selection criterion that helps compare different statistical models while penalizing model complexity. It is rooted in Bayesian probability theory but is commonly used even in frequentist settings.

Mathematically Precise Definition

For a statistical model $M$ with $k$ parameters fitted to a dataset $\mathcal{D} = \{x_1, x_2, \dots, x_n\}$, the BIC is defined as:

$$ \text{BIC} = -2 \cdot \ln \hat{L} + k \cdot \ln(n) $$

where:

$\hat{L}$ is the maximum likelihood of the model, i.e., $\hat{L} = P(\mathcal{D} \mid \hat{\theta}, M)$, where $\hat{\theta}$ are the maximum likelihood estimates (MLE) of the parameters.
$k$ is the number of free parameters in the model.
$n$ is the number of observations in the dataset.

Interpretation

Likelihood Component ($-2 \ln \hat{L}$):
- Measures the fit of the model. A higher likelihood (or lower $-2 \ln \hat{L}$) indicates the model explains the data well.
Penalty Term ($k \ln n$):
- Penalizes model complexity. More parameters ($k$) increase the penalty, preventing overfitting.
- The penalty grows with $\ln(n)$, meaning as the dataset size increases, the cost of adding parameters increases more slowly.
Model Selection Criterion:
- Lower BIC values indicate a more preferred model.
- It balances goodness of fit and model simplicity: the best model explains the data well without unnecessary complexity.

Derivation of BIC

The BIC is derived from approximating the Bayesian model evidence under certain assumptions. Here’s a step-by-step derivation:

Bayesian Model Selection: In Bayesian statistics, model selection is based on the posterior probability of a model $M$ given data $\mathcal{D}$:
$$ P(M \mid \mathcal{D}) \propto P(\mathcal{D} | M) P(M) $$
where:
- $P(\mathcal{D} | M)$ is the marginal likelihood or model evidence: $$ P(\mathcal{D} | M) = \int P(\mathcal{D} | \theta, M) P(\theta \mid M) \, d\theta $$
Laplace Approximation of Marginal Likelihood: When $n$ is large, we can approximate the integral using Laplace’s method around the MLE $\hat{\theta}$.
- The likelihood $P(\mathcal{D} \mid \theta, M)$ is sharply peaked around $\hat{\theta}$.
- The approximation gives: $$ P(\mathcal{D} \mid M) \approx P(\mathcal{D} \mid \hat{\theta}, M) \cdot \left( \frac{(2\pi)^{k/2}}{ \lvert I_{n}(\hat{\theta}) \rvert ^{1/2}} \right) \cdot P(\hat{\theta} \mid M) $$ where $I(\hat{\theta})$ is the Fisher information, see Parametric Modeling for its definition.
Log Transformation and Simplification: Taking the logarithm and ignoring constants not dependent on $n$:
$$ \ln P(\mathcal{D} \mid M) \approx \ln P(\mathcal{D} \mid \hat{\theta}, M) - \frac{k}{2} \ln(n) $$
Formulating BIC: Multiplying both sides by $-2$ (for consistency with the likelihood ratio test framework):
$$ -2 \ln P(\mathcal{D} \mid M) \approx -2 \ln P(\mathcal{D} \mid \hat{\theta}, M) + k \ln(n) $$
The right-hand side is precisely the BIC formula:
$$ \text{BIC} = -2 \ln \hat{L} + k \ln(n) $$

Assumptions Behind BIC

Large Sample Size ($n \to \infty$): The Laplace approximation assumes that the sample size is large enough for the likelihood to be sharply peaked.
Regularity Conditions: The likelihood function must be well-behaved (differentiable, unimodal, etc.).
Model Correctness: One of the candidate models is assumed to be the true model (though in practice, this is often violated).

Comparison to Other Criteria

$$ \text{AIC} = -2 \ln \hat{L} + 2k $$
- AIC penalizes complexity less harshly than BIC.
- AIC is more suitable for predictive performance, while BIC is more conservative and often better at identifying the “true” model.
BIC vs AIC:
- BIC tends to prefer simpler models as $n$ increases, due to the $\ln(n)$ term.
- AIC is asymptotically efficient (minimizes prediction error), while BIC is consistent (selects the true model as $n \to \infty$).

How to use the BIC

In practice: 11. Fit candidate models to the data. 12. Compute the BIC for each model. 13. Select the model with the lowest BIC.

BIC is widely used in fields like machine learning, econometrics, and statistical modeling, where balancing fit and complexity is critical.