For Bayesian Linear Regression we had high dimensional Gaussians which made the inference closed form, in general this is not true, so we need some kinds of approximation.
Laplace approximation
Introduction to the Idea 🟩
$$ \psi(\theta) \approx \hat{\psi}(\theta) = \psi(\hat{\theta}) + (\theta-\hat{\theta} ) ^{T} \nabla \psi(\hat{\theta}) + \frac{1}{2} (\theta-\hat{\theta} ) ^{T} H_{\psi}(\hat{\theta})(\theta-\hat{\theta} ) = \psi(\hat{\theta}) + \frac{1}{2} (\theta-\hat{\theta} ) ^{T} H_{\psi}(\hat{\theta})(\theta-\hat{\theta} ) $$We simplified the term on the first order because we are considering the mode, so the gradient should be zero for the stationary point.
$$ \hat{\psi}(\theta) =\log \mathcal{N}(\theta; \hat{\theta}, -H_{\psi}^{-1}) + const $$Then we choose $q(\theta)= \mathcal{N}(\theta; \hat{\theta}, -H_{\psi}^{-1}) \propto \exp(\hat{\psi}(\theta))$ One can verify that the hessian in the covariance matrix is indeed symmetric semidefinite positive. (the positive definiteveness comes from the minus of a negative semidefinite matrix of the maximum point). Normally, the covariance matrix of the Laplace approximation is indicated as $\Lambda = - H^{-1}_{\psi}$.
Laplace approximation for a Gaussian 🟩
We can easily see that the Laplace approximation for a Gaussian is the same as the Gaussian itself, because the Hessian is the inverse of the covariance matrix.
$$ D_{\theta}D_{\theta}\log p(\theta) = (D_{\theta}(\Sigma^{-1}\mu - \Sigma^{-1} \theta))^{T} = -\Sigma^{-1} $$An analysis of Bayesian Logistic regression
Setting
$$ p(y_{1:n} \mid w, x_{1:n}) = \prod_{i = 1}^{N} \sigma(y_{i}w^{T}x_{i}) $$And that $p(w) = \mathcal{N}(w; 0, \sigma^{2}_{p}I)$, where $\sigma$ is the Sigmoid function $\sigma(x) = \frac{1}{1 + \exp(-x)}$.
$$ p(y_{i} \mid w, x_{1:n}) = \begin{cases} \sigma(w^{T}x) & \text{ if } y = 1 \\ 1 - \sigma(w^{T}x) & \text{ if } y = 0 \end{cases} \implies \sigma(y_{i}w^{T}x) $$Thanks to the properties of the Sigmoid function $1 - \sigma(x) = \sigma(-x)$
MAP Weight update
We then see that in order to have the proxy distribution for $w$ we need to find the Hessian of that We have that (we omit $x$ for brevity). $$ \hat{w} = \arg \max_{w} p(w \mid y_{1:n}) = \arg \max_{w} \log p(w) + \log p(y_{1:n} \mid w) = \arg \max_{w} \frac{1}{2\sigma^{2}{p}} w^{T}w + \sum{i = 1}^{N} \log \sigma(y_{i}w^{T}x_{i})
$$ Continuing:
$$ = \arg \min_{w} - \frac{1}{2\sigma^{2}_{p}}w^{T}w + \sum_{i = 1}^{N} \underbrace{ \log(1 + \exp(- y_{i}w^{T}x_{i})) }_{\log p(y_{i} \mid w)} \implies \frac{ \partial \log p(y_{1:n} \mid w) }{ \partial w } = -\sum_{i = 1}^{N}y_{i}x_{i} \sigma( - y_{i}w^{T}x_{i}) $$$$ w \leftarrow w(1 - 2c\eta) + \eta yx\sigma(-y_{i}w^{T}x_{i}) $$Laplace approximation for Bayesian logistic regression 🟨+
$$ \Lambda = -\nabla_{w} \nabla_{w} \log p(w \mid y_{1:n} x_{1:n}) = X^{T}\text{diag}_{i \in [n]} \left\{ \pi_{i} (1 - \pi_{i}) \right\} X + \sigma^{-2}_{p} I $$With $\pi_{i} = \sigma(w^{T}x_{i})$. There is a nice interpretation of this covariance matrix, because if the data point is certain then the diagonal value is small, else it is a little bit larger.
Downsides of Laplace approximation 🟩
Laplace could be overconfident, in the same way linear regression is for some samples. This is because it greedily approximates basing on the mean of it. So perhaps one direction is ok, other directions can be quite bad.

Figure from Krause Book
Prediction with Variational Posterior 🟥++
$$ p(y^{*} \mid x^{*}, x_{1:n}, y_{1:n}) = \int p(y^{*} \mid x^{*}, \theta) p(\theta \mid x_{1:n} y_{1:n}) \, d\theta \approx \int p(y^{*} \mid x^{*}, \theta) q_{\lambda}(\theta) \, d\theta = \mathbb{E}_{\theta \sim q_{\lambda}} [p(y^{*} \mid x^{*}, \theta)] $$$$ \mathbb{E}_{\theta \sim q} [p(y^{*} \mid x^{*}, \theta)] \approx \frac{1}{m} \sum_{j = 1}^{m} p(y^{*} \mid x^{*}, \theta) $$$$ \int p(y^{*} \mid x^{*}, \theta) q(\theta) \, d\theta= \int \sigma (y^{*} \cdot \theta^{T}x)q(\theta) \, d\theta = \int \sigma (y^{*} \cdot f)(x^{*}q(\theta)) \, df = \int \sigma(y^{*} f) \mathcal{N}(f; x^{*T}\mu, x^{*T}\Sigma x^{*}) \, df $$As this is just a single dimensional integral, it could be well approximated with numerical methods like Gauss-Legendre quadrature.
The variational approach
With this approach we define a family of distributions $Q$ called variational family and the try to find the best member within this family to approximate our $P$.
For example we can define the variational family of Gaussians as
$$ \mathcal{Q} = \left\{ q(\theta \mid \lambda) = \mathcal{N}(\theta; \mu, \Sigma \right\} $$Where the parameters $\lambda$ are the mean and the covariance matrix of the Gaussian.
Reverse and Forward KL 🟩
So we optimize for this: Reverse KL:
$$ q^{*} \in \arg \min_{q \in Q} KL(q \mid \mid p) $$$$ q^{*} \in \arg \min_{q \in Q} KL(p \mid \mid q) $$
See Kullback-Leibler divergence in Entropy.
KL of Gaussians 🟩
Understanding the properties of the KL divergence between Gaussian distributions is crucial for grasping variational inference. I strongly encourage you to review Gaussians#Information Theoretic Properties closely, probably we will use that result in the next observations
Minimizing Forward KL Divergence
We can provide two interpretations of the forward KL divergence. One is a MLE estimation of the dataset in the variational family, the other is a moment matching.
MLE Estimation
We can prove the following: $$ \begin{align} \arg\min_{\lambda} KL(p \mid \mid q) & = \arg\min_{\lambda} H[p \mid \mid q_{\lambda}] - H[p]\ & = \arg\min_{\lambda} \mathbb{E}{\theta \sim p} [-\log q{\lambda}(y)] -\text{ const} \ & \approx \arg\min_{\lambda}- \frac{1}{m} \sum_{j = 1}^{m} \log q_{\lambda}(y^{(i)})\ \end{align}
$$
The problem is that it is often unfeasible.
This tells us that any maximum likelihood estimate $q_{\lambda}$ minimizes the forward KL-divergence to the empirical data distribution.
The first interpretation is usually not used as we cannot draw samples using the true distribution.
Moment Matching
We can prove that if we find such $q$ that minimizes the forward KL divergence, then the first two moments of the distribution are the same.
$$ q(\theta) = \exp(\lambda^{T}T(\theta) - A(\lambda)) $$Where $T(\theta)$ are the sufficient statistics of the distribution and $A(\lambda)$ is the log partition function.
$$ \begin{align} q^{*} = \arg\max_{\lambda}KL(p \mid \mid q) &= \arg\max_{\lambda}\mathbb{E}_{\theta \sim p} \left[ \log \frac{p(\theta)}{q(\theta)} \right] \\ &= \arg\min_{\lambda}\mathbb{E}_{\theta \sim p} \left[\lambda^{T}T(\theta) - A(\lambda) \right] \\ \end{align} $$Recall that $A(\lambda) = \log \int \exp(\lambda^{T}T(\theta)) \, d\theta$.
$$ \begin{align} 0 &= \nabla_{\lambda} \mathbb{E}_{\theta \sim p} \left[\lambda^{T}T(\theta) - A(\lambda) \right] \\ &= \mathbb{E}_{\theta \sim p} \left[T(\theta) \right] - \nabla_{\lambda} A(\lambda) \\ &= \mathbb{E}_{\theta \sim p} \left[T(\theta) \right] - \nabla_{\lambda} \log \int \exp(\lambda^{T}T(\theta)) \, d\theta \\ &= \mathbb{E}_{\theta \sim p} \left[T(\theta) \right] - \mathbb{E}_{\theta \sim q} \left[T(\theta) \right] \\ &\implies \mathbb{E}_{\theta \sim p} \left[T(\theta) \right] = \mathbb{E}_{\theta \sim q} \left[T(\theta) \right] \end{align} $$Which means the expected value of the sufficient statistics of the true distribution are the same as the variational one. For Gaussians, we have that the mean and the covariance matrix are the same as those are the sufficient statistics.
Minimizing Reverse KL Divergence 🟩-
We need to compute the value
$$ \begin{align}
q^{}{\lambda} &= \arg\min{q \in Q} KL(q \mid \mid p(\cdot \mid y)) \ &= \arg\min_{q \in Q}\mathbb{E}{\theta \sim q} \left[ \frac{\log q(\theta)}{\log p(\theta \mid y)} \right] \ &= \arg\min{q \in Q} - H(q) - \mathbb{E}{\theta \sim q}[\log p(\theta \mid y)] \ &= \arg\max{q \in Q} H(q) + \mathbb{E}{\theta \sim q}[\log p(\theta \mid y)] \ &= \arg\max{q \in Q} H(q) + \mathbb{E}{\theta \sim q}[\log p(\theta)] + \mathbb{E}{\theta \sim q} [\log p (y \mid \theta)]+const \ &\implies q^{}{\lambda }= \arg\max{q \in Q} \mathbb{E}_{\theta \sim q} [\log p(y \mid \theta)] - KL(q \mid \mid p(\theta)) = ELBO \end{align} $$ Where we have now the prior what we were trying to optimize. We can interpret the first part as sort of likelihood while the second part is a proximity measure that acts as a regularizer. This cost function is known as the variational free energy principle (highly put forward by Hinton). This optimizer has also some links with the minimum description length principle, cited in Randomness, Model of Analogies.
Evidence Lower Bound 🟨 –
$$ L(q, p; D_{n}) = \log p(y_{1:n} \mid x_{1:n}) - KL(q \mid \mid p(\cdot \mid x_{1:n}, y_{1:n})) \implies \log p(y_{1:n}) \geq L(q, p; D_{n}) $$The evidence is defines a $\log p(y_{1:n})$. This can be played with to have a lower bound on this value with the above knowledge. We can prove the lower bound in another way, noticing that: $$ \begin{align} \log p(y_{1:n}) &= \log \int p(y_{1:n} \mid \theta) p(\theta) , d\theta \ &= \log \int p(y_{1:n} \mid \theta) p(\theta) \frac{q(\theta)}{q(\theta)} , d\theta \ \text{using Jensen}&\geq{E}{\theta \sim q} \left[ \log p(y{1:n} \mid \theta) \frac{p(\theta)}{q(\theta)} \right] \ & ={E}{\theta \sim q} \left[ \log p(y{1:n} \mid \theta) \right] - KL(q \mid \mid p) = ELBO \
\end{align} $$ We observe that this is the same thing that we have seen before while trying to minimize the reverse KL divergence. Thus, finding the best approximation is equivalent to identifying the maximum lower bound on the evidence. Essentially, it involves determining the posterior that is most likely to explain the evidence.
This indicates that maximizing the evidence lower bound is an adequate method of model selection which can be used instead of maximizing the evidence (marginal likelihood) directly.
ELBO of logistic regression 🟨
Suppose we have a classic logistic regression with prior on the weights to be $p(w) = \mathcal{N}(w; 0, I)$ and the likelihood to be $p(y \mid w, x) = \prod_{i = 1}^{N} \sigma(y_{i}w^{T}x_{i})$. Let’s take the parameters from a Gaussian distribution $\mathcal{N}(\mu, \text{diag}_{i \in [d]}\left\{ \sigma_{i}^{2} \right\})$, our variational family. We want now to find the ELBO for this model.
$$ KL(q \mid \mid p) = \frac{1}{2} \sum_{i = 1}^{d} \left( \sigma_{i}^{2} + \mu_{i}^{2} - \log \sigma_{i}^{2} - 1 \right) $$$$ \mathbb{E}_{w \sim q} \left[ \log p(y \mid w, x) \right] = \sum_{i = 1}^{N} \mathbb{E}_{w \sim q} \left[ \log \sigma(y_{i}w^{T}x_{i}) \right] $$And we can use everything now.
Gradient of ELBO 🟩–
One possible approach is using the score function trick, explored in RL Function Approximation. Also called score gradients or monte-carlo gradients. Another is using the so called reparametrization trick. We can compute the gradient of the ELBO with respect to the variational parameters $\lambda$. But the first parameter is difficult to compute.
$$ \mathbb{E}_{\theta \sim q} [f(\theta)] = \mathbb{E}_{\varepsilon \sim \phi} [f(g(\varepsilon, \lambda))] $$So now we can use a known distribution to sample from and then compute the gradient of the expectation, this allows to compute stochastic gradients. If you know the ideas of Volume, then the formula of change of variables becomes very clean.
Let’s work out an example: Let’s suppose $q$ follows a Gaussian distribution $\mathcal{N}(\mu, \Sigma)$ then we can reparametrize it with $\theta = \mu + \Sigma^{1/2} \varepsilon$ with $\varepsilon \sim \mathcal{N}(0, I)$ that we can sample from.
$$ \begin{align} \\ \nabla_{\lambda} L(\lambda) = \nabla_{\lambda} \mathbb{E}_{\theta \sim q} [ \log p(y \mid x, \theta) ] - KL(q \mid \mid p) = \\ \nabla_{\lambda} \mathbb{E}_{\varepsilon \sim \phi} [ \log p(y \mid x, \Sigma^{1/2}\varepsilon + \mu ) ] - KL(q \mid \mid p) = \\ \end{align} $$$$ \begin{align} \nabla_{\lambda} \mathbb{E}_{\theta \sim q} [ \log p(y_{1:n} \mid x_{1:n}, \theta) ] &= \nabla_{\lambda} \mathbb{E}_{\varepsilon \sim \mathcal{N}(0, 1)} [ \log p(y_{1:n} \mid x_{1:n}, \Sigma^{1/2}\varepsilon + \mu ) ] \\ & = n \mathbb{E}_{\varepsilon \sim \mathcal{N}(0, 1)} \mathbb{E}_{i \sim \text{unif}[0, n]}[\nabla_{\lambda} \log p(y_{i} \mid x_{i}, \Sigma^{1/2}\varepsilon + \mu ) ]\\ & \approx \frac{n}{m} \sum_{j = 1}^{m} \nabla_{C, \mu} \log p(y_{i_{j}} \mid x_{i_{j}}, C\varepsilon_{j} + \mu) \\ \end{align} $$Which means: We can approximate the derivate of the likelihood with respect of the variational parameters by sampling from a normal and uniformly from the dataset and only then computing an average gradient of the likelihood with respect to the parameters.
Variational inference enables us to find approximations of distributions using highly optimized stochastic optimization techniques. However, a significant drawback is the difficulty in assessing the quality of these approximations.