With variational inference we want to find a good approximation of the posterior distribution from which it is easy to sample. The objective is to approximate the posterior with a simpler one, because sometimes the prior or likelihood are difficult to compute.
$$ p(\theta \mid x_{1:n}, y_{1:n}) = \frac{1}{z} p(y_{1:n} \mid \theta, x_{1:n}) p(\theta \mid x_{1:n}) \approx q(\theta \mid \lambda) $$For Bayesian Linear Regression we had high dimensional Gaussians which made the inference closed form, in general this is not true, so we need some kinds of approximation.
Laplace approximation
Introduction to the Idea 🟩
Let’s consider $\psi(\theta) = \log p(\theta \mid y_{1:n}, x_{1:n})$ and it’s Taylor approximation (see Hopital, Taylor, Peano) around the mode $\hat{\theta}$ for MAP. Then we have
$$ \psi(\theta) \approx \hat{\psi}(\theta) = \psi(\hat{\theta}) + (\theta-\hat{\theta} ) ^{T} \nabla \psi(\hat{\theta}) + \frac{1}{2} (\theta-\hat{\theta} ) ^{T} H_{\psi}(\hat{\theta})(\theta-\hat{\theta} ) = \psi(\hat{\theta}) + \frac{1}{2} (\theta-\hat{\theta} ) ^{T} H_{\psi}(\hat{\theta})(\theta-\hat{\theta} ) $$We simplified the term on the first order because we are considering the mode, so the gradient should be zero for the stationary point.
This Taylor approximation is quite similar to the log of a Gaussian distribution, which for a Gaussian $\sim \mathcal{N}(\mu, \Sigma)$ is equal to $-\frac{1}{2}(x - \mu)^{T}\Sigma^{-1}(x - \mu)$ so we can do some sort of pattern matching and consider our psi function to be:
$$ \hat{\psi}(\theta) =\log \mathcal{N}(\theta; \hat{\theta}, -H_{\psi}^{-1}) + const $$Then we choose $q(\theta)= \mathcal{N}(\theta; \hat{\theta}, -H_{\psi}^{-1}) \propto \exp(\hat{\psi}(\theta))$ One can verify that the hessian in the covariance matrix is indeed symmetric semidefinite positive. (the positive definiteveness comes from the minus of a negative semidefinite matrix of the maximum point). Normally, the covariance matrix of the Laplace approximation is indicated as $\Lambda = - H^{-1}_{\psi}$.
Laplace approximation for a Gaussian 🟩
We can easily see that the Laplace approximation for a Gaussian is the same as the Gaussian itself, because the Hessian is the inverse of the covariance matrix.
It’s easy to see that the mode of the Gaussian is exactly its mean, and the Hessian is computes as follows:
$$ D_{\theta}D_{\theta}\log p(\theta) = (D_{\theta}(\Sigma^{-1}\mu - \Sigma^{-1} \theta))^{T} = -\Sigma^{-1} $$An analysis of Bayesian Logistic regression
Setting
For bayesian logistic regression we have that
$$ p(y_{1:n} \mid w, x_{1:n}) = \prod_{i = 1}^{N} \sigma(y_{i}w^{T}x_{i}) $$And that $p(w) = \mathcal{N}(w; 0, \sigma^{2}_{p}I)$, where $\sigma$ is the Sigmoid function $\sigma(x) = \frac{1}{1 + \exp(-x)}$.
We use this probability because it’s natural to say
$$ p(y_{i} \mid w, x_{1:n}) = \begin{cases} \sigma(w^{T}x) & \text{ if } y = 1 \\ 1 - \sigma(w^{T}x) & \text{ if } y = 0 \end{cases} \implies \sigma(y_{i}w^{T}x) $$Thanks to the properties of the Sigmoid function $1 - \sigma(x) = \sigma(-x)$
MAP Weight update
We then see that in order to have the proxy distribution for $w$ we need to find the Hessian of that We have that (we omit $x$ for brevity). $$ \hat{w} = \arg \max_{w} p(w \mid y_{1:n}) = \arg \max_{w} \log p(w) + \log p(y_{1:n} \mid w) = \arg \max_{w} \frac{1}{2\sigma^{2}{p}} w^{T}w + \sum{i = 1}^{N} \log \sigma(y_{i}w^{T}x_{i})
$$ Continuing:
$$ = \arg \min_{w} - \frac{1}{2\sigma^{2}_{p}}w^{T}w + \sum_{i = 1}^{N} \underbrace{ \log(1 + \exp(- y_{i}w^{T}x_{i})) }_{\log p(y_{i} \mid w)} \implies \frac{ \partial \log p(y_{1:n} \mid w) }{ \partial w } = -\sum_{i = 1}^{N}y_{i}x_{i} \sigma( - y_{i}w^{T}x_{i}) $$We can interpret that last part of that Sigmoid as the probability of the error. This could then be optimized by the gradient with updates as follow:
$$ w \leftarrow w(1 - 2c\eta) + \eta yx\sigma(-y_{i}w^{T}x_{i}) $$Laplace approximation for Bayesian logistic regression 🟨+
Then for the precision matrix we have (computation as exercise):
$$ \Lambda = -\nabla_{w} \nabla_{w} \log p(w \mid y_{1:n} x_{1:n}) = X^{T}\text{diag}_{i \in [n]} \left\{ \pi_{i} (1 - \pi_{i}) \right\} X + \sigma^{-2}_{p} I $$With $\pi_{i} = \sigma(w^{T}x_{i})$. There is a nice interpretation of this covariance matrix, because if the data point is certain then the diagonal value is small, else it is a little bit larger.
Downsides of Laplace approximation 🟩
Laplace could be overconfident, in the same way linear regression is for some samples. This is because it greedily approximates basing on the mean of it. So perhaps one direction is ok, other directions can be quite bad.
Prediction with Variational Posterior 🟥++
Now that we have an approximation for the posterior, it should be far more easier to make predictions!
$$ p(y^{*} \mid x^{*}, x_{1:n}, y_{1:n}) = \int p(y^{*} \mid x^{*}, \theta) p(\theta \mid x_{1:n} y_{1:n}) \, d\theta \approx \int p(y^{*} \mid x^{*}, \theta) q_{\lambda}(\theta) \, d\theta = \mathbb{E}_{\theta \sim q_{\lambda}} [p(y^{*} \mid x^{*}, \theta)] $$The last could be approximated using Monte Carlo methods methods to be
$$ \mathbb{E}_{\theta \sim q} [p(y^{*} \mid x^{*}, \theta)] \approx \frac{1}{m} \sum_{j = 1}^{m} p(y^{*} \mid x^{*}, \theta) $$Or we could do a change of variables and then estimate the one dimensional integral that we have as a result with numerical methods basically to machine precision, where $f = \theta^{T}x$
$$ \int p(y^{*} \mid x^{*}, \theta) q(\theta) \, d\theta= \int \sigma (y^{*} \cdot \theta^{T}x)q(\theta) \, d\theta = \int \sigma (y^{*} \cdot f)(x^{*}q(\theta)) \, df = \int \sigma(y^{*} f) \mathcal{N}(f; x^{*T}\mu, x^{*T}\Sigma x^{*}) \, df $$As this is just a single dimensional integral, it could be well approximated with numerical methods like Gauss-Legendre quadrature.
The variational approach
With this approach we define a family of distributions $Q$ called variational family and the try to find the best member within this family to approximate our $P$.
For example we can define the variational family of Gaussians as
$$ \mathcal{Q} = \left\{ q(\theta \mid \lambda) = \mathcal{N}(\theta; \mu, \Sigma \right\} $$Where the parameters $\lambda$ are the mean and the covariance matrix of the Gaussian.
Reverse and Forward KL 🟩
So we optimize for this: Reverse KL:
greedily select the mode and underestimating the variance which, in this case, leads to an overconfident prediction
$$ q^{*} \in \arg \min_{q \in Q} KL(q \mid \mid p) $$
Or Forward KL, this is usually preferred, but usually more difficult to compute.
$$ q^{*} \in \arg \min_{q \in Q} KL(p \mid \mid q) $$See Kullback-Leibler divergence in Entropy.
KL of Gaussians 🟩
Understanding the properties of the KL divergence between Gaussian distributions is crucial for grasping variational inference. I strongly encourage you to review Gaussians#Information Theoretic Properties closely, probably we will use that result in the next observations
Minimizing Forward KL Divergence
We can provide two interpretations of the forward KL divergence. One is a MLE estimation of the dataset in the variational family, the other is a moment matching.
MLE Estimation
We can prove the following: $$ \begin{align} \arg\min_{\lambda} KL(p \mid \mid q) & = \arg\min_{\lambda} H[p \mid \mid q_{\lambda}] - H[p]\ & = \arg\min_{\lambda} \mathbb{E}{\theta \sim p} [-\log q{\lambda}(y)] -\text{ const} \ & \approx \arg\min_{\lambda}- \frac{1}{m} \sum_{j = 1}^{m} \log q_{\lambda}(y^{(i)})\ \end{align}
$$
The problem is that it is often unfeasible.
This tells us that any maximum likelihood estimate $q_{\lambda}$ minimizes the forward KL-divergence to the empirical data distribution.
The first interpretation is usually not used as we cannot draw samples using the true distribution.
Moment Matching
We can prove that if we find such $q$ that minimizes the forward KL divergence, then the first two moments of the distribution are the same.
First, we consider the variational family to be part of The Exponential Family. So we can write $q$ in the following manner:
$$ q(\theta) = \exp(\lambda^{T}T(\theta) - A(\lambda)) $$Where $T(\theta)$ are the sufficient statistics of the distribution and $A(\lambda)$ is the log partition function.
Then we can write the KL divergence as:
$$ \begin{align} q^{*} = \arg\max_{\lambda}KL(p \mid \mid q) &= \arg\max_{\lambda}\mathbb{E}_{\theta \sim p} \left[ \log \frac{p(\theta)}{q(\theta)} \right] \\ &= \arg\min_{\lambda}\mathbb{E}_{\theta \sim p} \left[\lambda^{T}T(\theta) - A(\lambda) \right] \\ \end{align} $$Recall that $A(\lambda) = \log \int \exp(\lambda^{T}T(\theta)) \, d\theta$.
If we now differentiate and set to zero we obtain:
$$ \begin{align} 0 &= \nabla_{\lambda} \mathbb{E}_{\theta \sim p} \left[\lambda^{T}T(\theta) - A(\lambda) \right] \\ &= \mathbb{E}_{\theta \sim p} \left[T(\theta) \right] - \nabla_{\lambda} A(\lambda) \\ &= \mathbb{E}_{\theta \sim p} \left[T(\theta) \right] - \nabla_{\lambda} \log \int \exp(\lambda^{T}T(\theta)) \, d\theta \\ &= \mathbb{E}_{\theta \sim p} \left[T(\theta) \right] - \mathbb{E}_{\theta \sim q} \left[T(\theta) \right] \\ &\implies \mathbb{E}_{\theta \sim p} \left[T(\theta) \right] = \mathbb{E}_{\theta \sim q} \left[T(\theta) \right] \end{align} $$Which means the expected value of the sufficient statistics of the true distribution are the same as the variational one. For Gaussians, we have that the mean and the covariance matrix are the same as those are the sufficient statistics.
Minimizing Reverse KL Divergence 🟩-
We need to compute the value
$$ \begin{align}
q^{}{\lambda} &= \arg\min{q \in Q} KL(q \mid \mid p(\cdot \mid y)) \ &= \arg\min_{q \in Q}\mathbb{E}{\theta \sim q} \left[ \frac{\log q(\theta)}{\log p(\theta \mid y)} \right] \ &= \arg\min{q \in Q} - H(q) - \mathbb{E}{\theta \sim q}[\log p(\theta \mid y)] \ &= \arg\max{q \in Q} H(q) + \mathbb{E}{\theta \sim q}[\log p(\theta \mid y)] \ &= \arg\max{q \in Q} H(q) + \mathbb{E}{\theta \sim q}[\log p(\theta)] + \mathbb{E}{\theta \sim q} [\log p (y \mid \theta)]+const \ &\implies q^{}{\lambda }= \arg\max{q \in Q} \mathbb{E}_{\theta \sim q} [\log p(y \mid \theta)] - KL(q \mid \mid p(\theta)) = ELBO \end{align} $$ Where we have now the prior what we were trying to optimize. We can interpret the first part as sort of likelihood while the second part is a proximity measure that acts as a regularizer.
Evidence Lower Bound 🟨 –
We have first encountered this value in Autoencoders when analyzing variational auto-encoders. We define the ELBO also in other manners, for a certain dataset $D_{n}$, for instance, this is another formulation.
$$ L(q, p; D_{n}) = \log p(y_{1:n} \mid x_{1:n}) - KL(q \mid \mid p(\cdot \mid x_{1:n}, y_{1:n})) \implies \log p(y_{1:n}) \geq L(q, p; D_{n}) $$The evidence is defines a $\log p(y_{1:n})$. This can be played with to have a lower bound on this value with the above knowledge. We can prove the lower bound in another way, noticing that: $$ \begin{align} \log p(y_{1:n}) &= \log \int p(y_{1:n} \mid \theta) p(\theta) , d\theta \ &= \log \int p(y_{1:n} \mid \theta) p(\theta) \frac{q(\theta)}{q(\theta)} , d\theta \ \text{using Jensen}&\geq{E}{\theta \sim q} \left[ \log p(y{1:n} \mid \theta) \frac{p(\theta)}{q(\theta)} \right] \ & ={E}{\theta \sim q} \left[ \log p(y{1:n} \mid \theta) \right] - KL(q \mid \mid p) = ELBO \
\end{align} $$ We observe that this is the same thing that we have seen before while trying to minimize the reverse KL divergence. Thus, finding the best approximation is equivalent to identifying the maximum lower bound on the evidence. Essentially, it involves determining the posterior that is most likely to explain the evidence.
This indicates that maximizing the evidence lower bound is an adequate method of model selection which can be used instead of maximizing the evidence (marginal likelihood) directly.
ELBO of logistic regression 🟨
Suppose we have a classic logistic regression with prior on the weights to be $p(w) = \mathcal{N}(w; 0, I)$ and the likelihood to be $p(y \mid w, x) = \prod_{i = 1}^{N} \sigma(y_{i}w^{T}x_{i})$. Let’s take the parameters from a Gaussian distribution $\mathcal{N}(\mu, \text{diag}_{i \in [d]}\left\{ \sigma_{i}^{2} \right\})$, our variational family. We want now to find the ELBO for this model.
We decompose again for the likelihood and the Kullback-Leibler. We can see from Gaussians that the KL between such distributions is:
$$ KL(q \mid \mid p) = \frac{1}{2} \sum_{i = 1}^{d} \left( \sigma_{i}^{2} + \mu_{i}^{2} - \log \sigma_{i}^{2} - 1 \right) $$While the likelihood is:
$$ \mathbb{E}_{w \sim q} \left[ \log p(y \mid w, x) \right] = \sum_{i = 1}^{N} \mathbb{E}_{w \sim q} \left[ \log \sigma(y_{i}w^{T}x_{i}) \right] $$And we can use everything now.
Gradient of ELBO 🟥+
One possible approach is using the score function trick, explored in RL Function Approximation. Also called score gradients or monte-carlo gradients. Another is using the so called reparametrization trick. We can compute the gradient of the ELBO with respect to the variational parameters $\lambda$. But the first parameter is difficult to compute.
Suppose to have a random variable $\varepsilon \sim \phi$ and then we can write the expectation as and that $\theta = g(\varepsilon, \lambda)$ for invertible functions $g$.
$$ \mathbb{E}_{\theta \sim q} [f(\theta)] = \mathbb{E}_{\varepsilon \sim \phi} [f(g(\varepsilon, \lambda))] $$So now we can use a known distribution to sample from and then compute the gradient of the expectation, this allows to compute stochastic gradients. If you know the ideas of Volume, then the formula of change of variables becomes very clean.
Let’s work out an example: Let’s suppose $q$ follows a Gaussian distribution $\mathcal{N}(\mu, \Sigma)$ then we can reparametrize it with $\theta = \mu + \Sigma^{1/2} \varepsilon$ with $\varepsilon \sim \mathcal{N}(0, I)$ that we can sample from.
$$ \begin{align} \\ \nabla_{\lambda} L(\lambda) = \nabla_{\lambda} \mathbb{E}_{\theta \sim q} [ \log p(y \mid x, \theta) ] - KL(q \mid \mid p) = \\ \nabla_{\lambda} \mathbb{E}_{\varepsilon \sim \phi} [ \log p(y \mid x, \Sigma^{1/2}\varepsilon + \mu ) ] - KL(q \mid \mid p) = \\ \end{align} $$We can now take the gradient of the first term.
$$ \begin{align} \nabla_{\lambda} \mathbb{E}_{\theta \sim q} [ \log p(y_{1:n} \mid x_{1:n}, \theta) ] &= \nabla_{\lambda} \mathbb{E}_{\varepsilon \sim \mathcal{N}(0, 1)} [ \log p(y_{1:n} \mid x_{1:n}, \Sigma^{1/2}\varepsilon + \mu ) ] \\ & = n \mathbb{E}_{\varepsilon \sim \mathcal{N}(0, 1)} \mathbb{E}_{i \sim \text{unif}[0, n]}[\nabla_{\lambda} \log p(y_{i} \mid x_{i}, \Sigma^{1/2}\varepsilon + \mu ) ]\\ & \approx \frac{n}{m} \sum_{j = 1}^{m} \nabla_{C, \mu} \log p(y_{i_{j}} \mid x_{i_{j}}, C\varepsilon_{j} + \mu) \\ \end{align} $$Which means: We can approximate the derivate of the likelihood with respect of the variational parameters by sampling from a normal and uniformly from the dataset and only then computing an average gradient of the likelihood with respect to the parameters.
Variational inference enables us to find approximations of distributions using highly optimized stochastic optimization techniques. However, a significant drawback is the difficulty in assessing the quality of these approximations.