We have a prior , we have a posterior , a likelihood and is called the evidence.
Classical Linear regression
Let's start with a classical regression. In this setting we need to estimate a model that is generated from this kind of data:
Where and it's the irreducible noise, an error that cannot be eliminated by any model in the model class, this is also called aleatoric uncertainty. One could write this as follows: and it's the exact same thing as the previous, so if we look for the MLE estimate now we get
And this result is quite nice, because we just discovered that with this assumptions the best model for classical linear regression, the minimizer for the ordinary least squares problem, gives the same result with the maximum likelihood estimator.
One random note is that the complexity of computing linear regression is
Where is the dimensionality of the input and the number of samples. Using the closed formula approach. Usually the Gradient Descent method might be a little faster. It costs per iteration.
Ridge Regression
Before diving into this section, you should know Ridge Regression quite well.
Maximum a posteriori estimate leads to Ridge Regression
We assume a prior (this is the special thing about Bayesian learning, which gives us confidence bounds, we choose Gaussians because of nice properties, see Gaussians and the Maximum Entropy Principle), and we also assume that is independent of the , this further assumption acts as a regularization, as we will see briefly. The conditional likelihood is so we assume that the are conditionally independent given . And we further assume that
Which is the same as saying that where .
Now we can model the posterior:
And .
Now let's expand everything: we will have:
Let's call and now we have this:
Now we do Max. a posteriori (MAP) estimate for and we get this:
Which we note is the same as Ridge regression with . So with those Gaussian assumptions we just go back to a standard regression case! Remember that if we don't have basically a regularizer, if is high we practically don't care much about the data (if noise is high we want to regularize, but if the prior variance is small we regularize more)
The MAP estimate simply collapses all mass of the posterior around its mode. This can be harmful when we are highly unsure about the best model, e.g., because we have observed insufficient data.
This why usually bayesian methods are preferred in other times.
And interesting observation is that one could derive the lasso regression if we assume the underlying distribution of to be a Laplacian instead of a Gaussian. This should enable us to perform variable selection. But the lasso optimization is usually not derivable (computed using LARS or similars), and the form is not exactly nice. For completeness, we report it here anyways: The prior on the weights should be in this case
No clue how to derive it though. I took this from slide 12 here (you need eth credentials to access the resource).
Posterior is Gaussian
In theory this result is easy if you know what are The Exponential Family and see that Gaussian is a Conjugate prior of itself. We observe that we can rewrite the argument part of the exponent in the following equation to be a gaussian:
We can rewrite it as where
\begin{array} \\ \bar{\Sigma} = \left( \frac{1}{\sigma_{n}^{2}} X^{T}X + \frac{1}{\sigma_{p}^{2}}I \right)^{-1} = \sigma_{n}^{2}\left( X^{T}X + \frac{\sigma^{2}_{n}}{\sigma_{p}^{2}} I \right) ^{-1} \\ \bar{\mu} = \frac{1}{\sigma_{n}^{2}} \bar{\Sigma} X^{T}y = (X^{T}X + \lambda I)^{-1} X^{T}y \end{array}Where is as before, and this is the link with linear regression! This form is nice as it allows to have some confidence intervals over the parameters.
Inference
Inference in Bayesian Linear Regression
The main idea is to have predictions that are weighted by the probability of the model, so we write:
We won't actually use this model for inference. In the case of Bayesian Linear regression we have a simpler case, which can be written as follows: We first used the sum rule, then we used product rule and independence assumption . Let's suppose so we use the multiplication of Gaussians with a constant () rule and get
And assuming we have sum of Gaussian is Gaussian so we have
So compared to ridge regression, the Bayesian linear regression has a variance term that is added. Here we are using all possible models, weighted by their probability. Conceptually bayesian is better than MLE.
It is possible to see later with the law of total variance, the decomposition of the inference variance into epistemic and aleatoric variance.
Relation between BLR and Ridge Regression
Ridge regression can be viewed as approximating the full posterior by placing all mass on its mode (also other models can be seen similarly, one example is MLE, analyzed in Parametric Modeling).
So it's a nice approximation:
Remember that for delta direct distribution we have
Epistemic and aleatoric uncertainty
We now start a philosophical discussion on epistemic and aleatoric uncertainty.
- Epistemic uncertainty: Uncertainty about the model due to the lack of data
- Aleatoric uncertainty: Irreducible noise
The is the epistemic uncertainty about while the factor is the noise/uncertainty about given , meaning we need to change , or hypothesis class if we want to lower this uncertainty I think (not sure though).
The law of total variance can give another intuition about the decomposition between epistemic and aleatoric uncertainty:
Choosing hyper-parameters
We would like a manner to compute the values of and . Covariance of the prior and variance of the noise need to be chosen before we can apply these methods. One way is just finding with cross-validation, then estimate the variance from data and get the prior variance in this manner.
Then it's possible to have graphical models for bayesian linear regression.
Recursive bayesian updates
We want to incorporate the data we have today as future prior. The posterior we have today will be the prior of tomorrow. Let's suppose we have prior and we have observations such that We can define some intermediate posteriors:
If we have the posterior for the previous data points and now we have , then we can compute the next prior (we used Bayes rule here and i.i.d. property of ):
So we would just need to integrate the new information given by and then use the previous knowledge. The recursive updates are also the reason why the Dirichlet is a nice prior for multinomials, see Dirichlet Processes.
Example: Bayesian Update for a Coin toss
We can see how the posterior changes with the number of coin flips. We will use a Beta prior for the Bernoulli likelihood, this is a conjugate prior so we can compute the posterior in closed form.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta
# Initialize parameters for the Beta prior
alpha_prior = 1 # Prior "successes" (e.g., heads)
beta_prior = 1 # Prior "failures" (e.g., tails)
# Simulated coin flips (1 = heads, 0 = tails)
np.random.seed(42) # For reproducibility
coin_flips = np.random.choice([0, 1], size=100, p=[0.7, 0.3]) # Unfair coin: 70% tails, 30% heads
# Function to plot the Beta distribution
def plot_beta(alpha, beta_val, flip_num):
x = np.linspace(0, 1, 1000)
y = beta.pdf(x, alpha, beta_val)
plt.plot(x, y, label=f"After {flip_num} flips (α={alpha}, β={beta_val})")
plt.xlabel("Probability of Heads (p)")
plt.ylabel("Density")
plt.title("Online Bayesian Update of Beta Distribution")
plt.legend()
# Create a figure for the updates
plt.figure(figsize=(10, 6))
# Perform Bayesian updates after each coin flip
for i, flip in enumerate(coin_flips):
# Update the Beta prior with the Bernoulli likelihood
alpha_prior += flip # Increment alpha if heads (flip=1)
beta_prior += 1 - flip # Increment beta if tails (flip=0)
# Plot the updated distribution every few flips
if i in {0, 4, 9, 19, 99}: # Plot at these specific flips
plot_beta(alpha_prior, beta_prior, i + 1)
# Show the final plot
plt.grid(alpha=0.3)
plt.show()
Online Bayesian regression
This section concerns in building a bayesian regression algorithm whose memory does not grow with the number of samples. It can be conceived as a filtering process, composed of two phased: conditioning (updating over the priors) and prediction.
We have discovered a manner to learn the best parameters for bayesian linear regression which is just this closed form:
Assuming .
Now we want to do online linear regression, assuming the data is coming one by one, we would like to update the corresponding matrices of interest, which are and . So, let's say we are given some matrices
We would like to efficiently compute and assuming the known matrices above. The latter is easy to compute because:
Now let's try to decompose the other product
This allows to compute the parameters in a way that the memory requirement is constant and does not grow as .
Another problem that we have is computing the inverse of is very expensive, usually takes but we can do it in with the use of Woodbury Identities: If and then the following is true:
And we can interpret the value as our and the as our two vectors. In this manner we just need to compute matrix multiplications which bring down the complexity to .
This is all for analysis in the online bayesian setting.
Logistic regression case
With this we just have the likelihood from a Bernoulli distribution (could also be extended for many dimensions, not just 0 or 1 labels, but now we have just the binary case). So we write
Where is the Sigmoid function which gives an idea of confidence of a prediction.
Relation with dimensionality
We can still apply linear methods on non-linearly transformed data, we just do something like
The problem is that the number of new features grows exponentially with the number of the dimensions, so it's not usually feasible using this trick. For example if we consider polynomials of degree of dimensions, then we will have the following variables:
which has dimension
The kernel trick
We first want to replace the problem using inner products then we want to replace inner products using kernels. If we are able to do this, then we can compute inner products in feature space quite efficiently. For example if then we can compute by just computing . This is quite important to understand well. See Kernel Methods for more. This leads us in an attempt to kernelize the bayesian linear regression.
Function view
In the previous sections we have derived the Bayesian Linear Regression based on priors on the weights and likelihoods. It is possible to derive something very similar by just observing inputs and outputs, so having a distribution on the space of the functions. This part builds the fundamental ideas that will be later needed for Gaussian Processes so it is quite important to understand this quite well.
Assuming this structure we can assume a structure on the space of functions:
With this formulation the feature map is implicit in the choice of the Kernel.
Predictions in function view
Let's suppose we have a new point , we need to show that with the function view we can do inference without too many problems: Let's define
At the end the inference process is very similar to the one in the linear regression setting: which is a normal with variance and mean 0. Then we just add the aleatoric noise and obtain
Other Trivia
We used Gaussians to represent the uncertainty.
Distributions are related:
