Bayesian Information Criterion

Bayesian Information Criterion (BIC) The Bayesian Information Criterion (BIC) is a model selection criterion that helps compare different statistical models while penalizing model complexity. It is rooted in Bayesian probability theory but is commonly used even in frequentist settings. Mathematically Precise Definition For a statistical model $M$ with $k$ parameters fitted to a dataset $\mathcal{D} = \{x_1, x_2, \dots, x_n\}$, the BIC is defined as: $$ \text{BIC} = -2 \cdot \ln \hat{L} + k \cdot \ln(n) $$where: ...

3 min · Xuanqiang 'Angelo' Huang

Bayesian Linear Regression

We have a prior $p(\text{model})$, we have a posterior $p(\text{model} \mid \text{data})$, a likelihood $p(\text{data} \mid \text{model})$ and $p(\text{data})$ is called the evidence. Classical Linear regression $$ y = w^{T}x + \varepsilon $$ Where $\varepsilon \sim \mathcal{N}(0, \sigma_{n}^{2}I)$ and it’s the irreducible noise, an error that cannot be eliminated by any model in the model class, this is also called aleatoric uncertainty. One could write this as follows: $y \sim \mathcal{N}(w^{T}x, \sigma^{2}_{n}I)$ and it’s the exact same thing as the previous, so if we look for the MLE estimate now we get ...

9 min · Xuanqiang 'Angelo' Huang

Bayesian Networks

Questi network bayesiani sono proprio dei grafi, che permettono una migliore comprensione delle relazioni causali o diagnostici fra le probabilità Esempio rete bayesiana Note generali Introduzione alla rete classica Una rete bayesiana ci permette di semplificare di molto il calcolo della full disjoint probability table, rendendola in questo modo Ossia andiamo a utilizzare una probabilità locale, o sparsa per fare i conti, cosa che semplifica molto, e quindi velocizza il calcolo. v ...

2 min · Xuanqiang 'Angelo' Huang

Bayesian neural networks

Robbins-Moro Algorithm The Algorithm $$ w_{n+1} = w_{n} - \alpha_{n} \Delta w_{n} $$For example with $\alpha_{0} > \alpha_{1} > \dots > \alpha_{n} \dots$, and $\alpha_{t} = \frac{1}{t}$ they satisfy the condition (in practice we use a constant $\alpha$, but we lose the convergence guarantee by Robbins Moro). More generally, the Robbins-Moro conditions re: $\sum_{n} \alpha_{n} = \infty$ $\sum_{n} \alpha_{n}^{2} < \infty$ Then the algorithm is guaranteed to converge to the best answer. One nice thing about this, is that we don’t need gradients. But often we use gradient versions (stochastic gradient descent and similar), using auto-grad, see Backpropagation. But learning with gradients brings some drawbacks: ...

10 min · Xuanqiang 'Angelo' Huang

Bayesian Optimization

While Active Learning looks for the most informative points to recover a true underlying function, Bayesian Optimization is just interested to find the maximum of that function. In Bayesian Optimization, we ask for the best way to find sequentially a set of points $x_{1}, \dots, x_{n}$ to find $\max_{x \in \mathcal{X}} f(x)$ for a certain unknown function $f$. This is what the whole thing is about. Definitions First we will introduce some useful definitions in this context. These were also somewhat introduced in N-Bandit Problem, which is one of the classical optimization problems we can find in the literature. ...

8 min · Xuanqiang 'Angelo' Huang