We have a prior $p(\text{model})$, we have a posterior $p(\text{model} \mid \text{data})$, a likelihood $p(\text{data} \mid \text{model})$ and $p(\text{data})$ is called the evidence.
Classical Linear regression
$$ y = w^{T}x + \varepsilon $$Where $\varepsilon \sim \mathcal{N}(0, \sigma_{n}^{2}I)$ and it’s the irreducible noise, an error that cannot be eliminated by any model in the model class, this is also called aleatoric uncertainty. One could write this as follows: $y \sim \mathcal{N}(w^{T}x, \sigma^{2}_{n}I)$ and it’s the exact same thing as the previous, so if we look for the MLE estimate now we get
$$ \hat{w}_{\text{MLE}} = \arg \max_{w \in \mathbb{R}^{d}} \sum_{i = 1}^{N} \log p (y_{i} \mid x_{i}, w) = \arg \min_{w \in \mathbb{R}^{d}} (y - w^{T}x)^{2} $$And this result is quite nice, because we just discovered that with this assumptions the best model for classical linear regression, the minimizer for the ordinary least squares problem, gives the same result with the maximum likelihood estimator.
$$ \mathcal{O}(d^{2}N + d^{3}) $$Where $d$ is the dimensionality of the input and $n$ the number of samples. Using the closed formula approach. Usually the Gradient Descent method might be a little faster. It costs $\mathcal{O(nd)}$ per iteration.
Ridge Regression
Before diving into this section, you should know Ridge Regression quite well.
Maximum a posteriori estimate leads to Ridge Regression 🟩
$$ p(y_{i} \mid w_{i}x_{i}) = \mathcal{N}(y_{i}; w^{T}x_{i}, \sigma^{2}_{n}) $$Which is the same as saying that $y_{i} = w^{T}x_{i} + \varepsilon_{i}$ where $\varepsilon_{i} \sim \mathcal{N}(0, \sigma^{2}_{n})$.
$$ p(w \mid x_{1:n}, y_{1:n}) = \frac{1}{z} p(w \mid x_{1:n}) p(y_{1:n} \mid w_{i},x_{1:n}) $$And $z = \int p(w) p(y_{1:n} \mid w_{i}, x_{1:n}) \, dw$.
Now let’s expand everything: we will have:
$$ \frac{1}{z} \cdot \frac{1}{z_{p}} \exp\left( -\frac{1}{2\sigma^{2}_{p}} \lVert w \rVert ^{2}_{2} \right) \cdot \frac{1}{z_{l}} \prod_{i = 1}^{n} \exp\left( -\frac{1}{2\sigma_{n}^{2}} (y_{i} - w^{T}x_{i})^{2} \right) $$$$ = \frac{1}{z'} \exp\left( -\left[ \frac{1}{2\sigma^{2}_{p} }\lVert w \rVert^{2}_{2} + \frac{1}{2\sigma^{2}_{n}} \sum_{i = 1}^{n} (y_{i} - w^{T}x_{i})^{2} \right] \right) $$Now we do Max. a posteriori (MAP) estimate for $\hat{w}$ and we get this: $$
\begin{align} \hat{w} = \arg\max_{w} \log p(w \mid x_{1:n}y_{1:n}) \ \ = \arg \max_{w} \log p(y_{1:n} \mid x_{1:n}, w) + \log p(w)\ = \arg \min_{w} \sum_{i = 1}^{k} (y_{i} - w^{T}x_{i})^{2} + \frac{\sigma_{n}^{2}}{\sigma^{2}{p}} \lVert w \rVert ^{2}{2} \end{align} $$ Which we note is the same as Ridge regression with $\lambda = \frac{\sigma_{n}^{2}}{\sigma^{2}_{p}}$. So with those Gaussian assumptions we just go back to a standard regression case! Remember that if $\lambda = 0$ we don’t have basically a regularizer, if $\lambda$ is high we practically don’t care much about the data (if noise is high we want to regularize, but if the prior variance is small we regularize more)
The MAP estimate simply collapses all mass of the posterior around its mode. This can be harmful when we are highly unsure about the best model, e.g., because we have observed insufficient data.
This why usually bayesian methods are preferred in other times.
$$ p(w) = \prod_{i = 1}^{d} \frac{\sigma^{2}_{p}}{4 \sigma^{2}_{n}} \exp\left( -\frac{\sigma^{2}_{p}}{2\sigma^{2}_{n}} \lvert w_{i} \rvert \right) $$No clue how to derive it though. I took this from slide 12 here (you need eth credentials to access the resource).
Posterior is Gaussian 🟩
$$ p(w \mid x_{1:n}, y_{1:n}) = \frac{1}{z'} \exp\left( -\left[ \frac{1}{2\sigma^{2}_{p} }\lVert w \rVert^{2}_{2} + \frac{1}{2\sigma^{2}_{n}} \sum_{i = 1}^{n} (y_{i} - w^{T}x_{i})^{2} \right] \right) $$$$ \begin{array} \\ \bar{\Sigma} = \left( \frac{1}{\sigma_{n}^{2}} X^{T}X + \frac{1}{\sigma_{p}^{2}}I \right)^{-1} = \sigma_{n}^{2}\left( X^{T}X + \frac{\sigma^{2}_{n}}{\sigma_{p}^{2}} I \right) ^{-1} \\ \bar{\mu} = \frac{1}{\sigma_{n}^{2}} \bar{\Sigma} X^{T}y = (X^{T}X + \lambda I)^{-1} X^{T}y \end{array} $$Where $\lambda$ is as before, and this is the link with linear regression! This form is nice as it allows to have some confidence intervals over the parameters.
Inference
Inference in Bayesian Linear Regression 🟨++
$$ p(y^{*} \mid x^{*} x_{1:n} y_{1:n}) = \int p(y^{*}, w \mid x^{*}, x_{1:n} ,y_{1:n}) \, dw = \int p(y^{*} \mid w ,x^{*}) p(w \mid x_{1:n} y_{1:n}) \, dw $$$$ p(f^{*} \mid x^{*}, x_{1:n}, y_{1:n}) = \mathcal{N}(f^{*}; \bar{\mu} ^{T}x^{*} , x^{*T}\bar{\Sigma}x^{*}) $$And assuming $y^{*} = f^{*} + \varepsilon$ we have sum of Gaussian is Gaussian so we have $$ p(y^{} \mid x^{}, x_{1:n}, y_{1:n}) =
\int p(y^{} \mid w ,x^{}) p(w \mid x_{1:n} y_{1:n}) , dw =\mathcal{N}(y^{}; \bar{\mu}^{T} x^{}, x^{T}\bar{\Sigma}x^{} + \sigma^{2}_{n}I) $$ So compared to ridge regression, the Bayesian linear regression has a variance term that is added. Here we are using all possible models, weighted by their probability. Conceptually bayesian is better than MLE.
It is possible to see later with the law of total variance, the decomposition of the inference variance into epistemic and aleatoric variance.
Relation between BLR and Ridge Regression 🟩
Ridge regression can be viewed as approximating the full posterior by placing all mass on its mode (also other models can be seen similarly, one example is MLE, analyzed in Parametric Modeling).
So it’s a nice approximation: $$ \begin{align}
p(y^{} \mid x^{}, x_{1:n}, y_{1:n}) =
$$ Remember that for delta direct distribution we have $$\mathbb{E}[f(x)] = \int f(x) \delta_{x’}(x) , dx = f(x’) $$
Epistemic and aleatoric uncertainty 🟩
We now start a philosophical discussion on epistemic and aleatoric uncertainty.
- Epistemic uncertainty: Uncertainty about the model due to the lack of data
- Aleatoric uncertainty: Irreducible noise
The $x^{*T}\bar{\Sigma}x^{*}$ is the epistemic uncertainty about $f^{*}$ while the $\sigma^{2}_{n}$ factor is the noise/uncertainty about $y^{*}$ given $f^{*}$, meaning we need to change $f^{*}$, or hypothesis class if we want to lower this uncertainty I think (not sure though).
The law of total variance can give another intuition about the decomposition between epistemic and aleatoric uncertainty:
$$ Var(y \mid x) = \text{ aleatoric } + \text{ epistemic } = \mathbb{E}_{\theta}\left[ \text{Var}_{y}(y \mid x, \theta) \right] + \text{Var}_{\theta}(\mathbb{E}_{y}[y \mid x, \theta]) $$Choosing hyper-parameters 🟩
We would like a manner to compute the values of $\sigma^{2}_{n}$ and $\sigma^{2}_{p}$. Covariance of the prior and variance of the noise need to be chosen before we can apply these methods. One way is just finding $\lambda = \frac{\sigma^{2}_{n}}{\sigma^{2}_{p}}$ with cross-validation, then estimate the variance from data and get the prior variance in this manner.
Then it’s possible to have graphical models for bayesian linear regression.
Recursive bayesian updates 🟨
$$ p^{(j)} (w) := p(w \mid y_{1: j}), p^{(0)}(w) = p(w) $$$$ p^{(j)}(w) = p( w \mid y_{1: j}) = \frac{1}{z} p(w) \cdot p(y_{1} \mid w) \dots p(y_{j} \mid w) = \frac{1}{z} \left( z \cdot p^{(j - 1)}(w) \right) p(y_{j} \mid w) = f(p^{(j - 1)}(w), y_{j}) $$So we would just need to integrate the new information given by $y_{j}$ and then use the previous knowledge. The recursive updates are also the reason why the Dirichlet is a nice prior for multinomials, see Dirichlet Processes.
Example: Bayesian Update for a Coin toss
We can see how the posterior changes with the number of coin flips. We will use a Beta prior for the Bernoulli likelihood, this is a conjugate prior so we can compute the posterior in closed form.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta
# Initialize parameters for the Beta prior
alpha_prior = 1 # Prior "successes" (e.g., heads)
beta_prior = 1 # Prior "failures" (e.g., tails)
# Simulated coin flips (1 = heads, 0 = tails)
np.random.seed(42) # For reproducibility
coin_flips = np.random.choice([0, 1], size=100, p=[0.7, 0.3]) # Unfair coin: 70% tails, 30% heads
# Function to plot the Beta distribution
def plot_beta(alpha, beta_val, flip_num):
x = np.linspace(0, 1, 1000)
y = beta.pdf(x, alpha, beta_val)
plt.plot(x, y, label=f"After {flip_num} flips (α={alpha}, β={beta_val})")
plt.xlabel("Probability of Heads (p)")
plt.ylabel("Density")
plt.title("Online Bayesian Update of Beta Distribution")
plt.legend()
# Create a figure for the updates
plt.figure(figsize=(10, 6))
# Perform Bayesian updates after each coin flip
for i, flip in enumerate(coin_flips):
# Update the Beta prior with the Bernoulli likelihood
alpha_prior += flip # Increment alpha if heads (flip=1)
beta_prior += 1 - flip # Increment beta if tails (flip=0)
# Plot the updated distribution every few flips
if i in {0, 4, 9, 19, 99}: # Plot at these specific flips
plot_beta(alpha_prior, beta_prior, i + 1)
# Show the final plot
plt.grid(alpha=0.3)
plt.show()
Online Bayesian regression
This section concerns in building a bayesian regression algorithm whose memory does not grow with the number of samples. It can be conceived as a filtering process, composed of two phased: conditioning (updating over the priors) and prediction.
$$ p(w \mid X, y) = \mathcal{N}(w; (X^{T}X +\sigma_{n}^{2}I)^{-1}X^{T}y, (\sigma^{-2}_{n}X^{T}X + I)^{-1}) $$Assuming $\sigma_{p}^{2} = 1$.
$$ X^{(t)} = \begin{bmatrix} x_{1}^{T} \\ \vdots \\ x_{t}^{T} \end{bmatrix} \in \mathbb{R}^{t \times d} \,\,\, Y^{(t)} = \begin{bmatrix} y_{1} \\ \vdots \\ y_{t} \end{bmatrix} \in \mathbb{R}^{t \times 1} $$$$ X^{(t + 1)T}Y^{(t+1)} = \sum_{i = 1}^{t} x_{i}y_{i} + x_{t + 1}y_{t + 1} = X^{(t)}Y^{(t)} + x_{t + 1}y_{t + 1} $$$$ X^{(t + 1)T} X^{(t+ 1)} = \begin{pmatrix} x_{1} , \dots x_{t} \end{pmatrix} \cdot \begin{pmatrix} x_{1} \\ \vdots \\ x_{t} \end{pmatrix} = \sum_{i = 1}^{t} x_{i}x_{i}^{T} + x_{t + 1}x_{t + 1}^{T} = X^{(t)T}X^{(t)} + x_{t+1}x_{t + 1}^{T} $$This allows to compute the parameters in a way that the memory requirement is constant and does not grow as $\mathcal{O(t)}$.
$$ (A + uv^{T})^{-1} = A^{-1} - \frac{A^{-1}uv^{T}A^{-1}}{1 + v^{T}A^{-1}u} $$And we can interpret the value $X^{(t)T}X^{(t)} + \sigma_{n}^{2}I$ as our $A$ and the $x_{t+1}x_{t + 1}^{T}$ as our two vectors. In this manner we just need to compute matrix multiplications which bring down the complexity to $\mathcal{O}(d^{2})$.
This is all for analysis in the online bayesian setting.
Logistic regression case 🟨
$$ P(y \mid x, w) = Ber(y; \sigma(w^{T}x)) $$Where $\sigma$ is the Sigmoid function which gives an idea of confidence of a prediction.
Relation with dimensionality 🟩
$$ f(x) = \sum_{i = 1}^{N} w_{i} \phi_{i}(x) $$$$ \phi = [1, x_{1}, \dots, x_{d}, x_{1}^{2}, .. x_{2}^{2}, x_{1}x_{2}, \dots, x_{1}x_{2}\dots x_{d}] $$$$ \lvert \phi \rvert = \begin{pmatrix} d + m \\ d \end{pmatrix}$$The kernel trick 🟩
We first want to replace the problem using inner products then we want to replace inner products using kernels. If we are able to do this, then we can compute inner products in feature space quite efficiently. For example if $\phi(x) = [\text{ all monomials up to } m \text{ variables} ]$ then we can compute $\phi(x)^{T}\phi(x)$ by just computing $(1 + x^{T}x)^{m}$. This is quite important to understand well. See Kernel Methods for more. This leads us in an attempt to kernelize the bayesian linear regression.
Function view
In the previous sections we have derived the Bayesian Linear Regression based on priors on the weights and likelihoods. It is possible to derive something very similar by just observing inputs and outputs, so having a distribution on the space of the functions. This part builds the fundamental ideas that will be later needed for Gaussian Processes so it is quite important to understand this quite well.
$$ f \mid X \sim \mathcal{N}(\Phi \mathbb{E}[w], \Phi\mathbb{E}[w]\Phi^{T}) = \mathcal{N}(0, \sigma^{2}_{p}\Phi\Phi^{T}) $$With this formulation the feature map is implicit in the choice of the Kernel.