With variational inference we want to find a good approximation of the posterior distribution from which it is easy to sample. The objective is to approximate the posterior with a simpler one, because sometimes the prior or likelihood are difficult to compute.
$$ p(\theta \mid x_{1:n}, y_{1:n}) = \frac{1}{z} p(y_{1:n} \mid \theta, x_{1:n}) p(\theta \mid x_{1:n}) \approx q(\theta \mid \lambda) $$For Bayesian Linear Regression we had high dimensional Gaussians which made the inference closed form, in general this is not true, so we need some kinds of approximation.
Laplace approximation
Introduction to the Idea 🟩
Let’s consider $\psi(\theta) = \log p(\theta \mid y_{1:n}, x_{1:n})$ and it’s Taylor approximation (see Hopital, Taylor, Peano) around the mode $\hat{\theta}$ for MAP. Then we have
$$ \psi(\theta) \approx \hat{\psi}(\theta) = \psi(\hat{\theta}) + (\theta-\hat{\theta} ) ^{T} \nabla \psi(\hat{\theta}) + \frac{1}{2} (\theta-\hat{\theta} ) ^{T} H_{\psi}(\hat{\theta})(\theta-\hat{\theta} ) = \psi(\hat{\theta}) + \frac{1}{2} (\theta-\hat{\theta} ) ^{T} H_{\psi}(\hat{\theta})(\theta-\hat{\theta} ) $$We simplified the term on the first order because we are considering the mode, so the gradient should be zero for the stationary point.
This Taylor approximation is quite similar to the log of a Gaussian distribution, which for a Gaussian $\sim \mathcal{N}(\mu, \Sigma)$ is equal to $-\frac{1}{2}(x - \mu)^{T}\Sigma^{-1}(x - \mu)$ so we can do some sort of pattern matching and consider our psi function to be:
$$ \hat{\psi}(\theta) =\log \mathcal{N}(\theta; \hat{\theta}, -H_{\psi}^{-1}) + const $$Then we choose $q(\theta)= \mathcal{N}(\theta; \hat{\theta}, -H_{\psi}^{-1}) \propto \exp(\hat{\psi}(\theta))$ One can verify that the hessian in the covariance matrix is indeed symmetric semidefinite positive. (the positive definiteveness comes from the minus of a negative semidefinite matrix of the maximum point). Normally, the covariance matrix of the Laplace approximation is indicated as $\Lambda = - H^{-1}_{\psi}$.
Laplace approximation for a Gaussian
We can easily see that the Laplace approximation for a Gaussian is the same as the Gaussian itself, because the Hessian is the inverse of the covariance matrix.
An analysis of Bayesian Logistic regression
Setting
For bayesian logistic regression we have that
$$ p(y_{1:n} \mid w, x_{1:n}) = \prod_{i = 1}^{N} \sigma(y_{i}w^{T}x_{i}) $$And that $p(w) = \mathcal{N}(w; 0, \sigma^{2}_{p}I)$
We use this probability because it’s natural to say
$$ p(y_{i} \mid w, x_{1:n}) = \begin{cases} \sigma(w^{T}x) & \text{ if } y = 1 \\ 1 - \sigma(w^{T}x) & \text{ if } y = 0 \end{cases} \implies \sigma(y_{i}w^{T}x) $$Thanks to the properties of the Sigmoid function $1 - \sigma(x) = \sigma(-x)$
MAP Weight update
We then see that in order to have the proxy distribution for $w$ we need to find the Hessian of that We have that (we omit $x$ for brevity). $$ \hat{w} = \arg \max_{w} p(w \mid y_{1:n}) = \arg \max_{w} \log p(w) + \log p(y_{1:n} \mid w) = \arg \max_{w} \frac{1}{2\sigma^{2}{p}} w^{T}w + \sum{i = 1}^{N} \log \sigma(y_{i}w^{T}x_{i})
$$ Continuing: $$= \arg \min_{w} - \frac{1}{2\sigma^{2}{p}}w^{T}w + \sum{i = 1}^{N} log(1 + \exp(- y_{i}w^{T}x_{i})) \implies \frac{ \partial -\log p(y_{1:n} \mid w) }{ \partial w } = -\sum_{i = 1}^{N}y_{i}x_{i} \sigma( - y_{i}w^{T}x_{i})
$$ We can interpret that last part of that Sigmoid as the probability of the error. This could then be optimized by the gradient with updates as follow: $$w \leftarrow w(1 - 2c\eta) + \eta yx\sigma(-y_{i}w^{T}x_{i}) $$
Laplace approximation for Bayesian logistic regression 🟨+
Then for the precision matrix we have (computation as exercise):
$$ \Lambda = -\nabla_{w} \nabla_{w} \log p(w \mid y_{1:n} x_{1:n}) = X^{T}\text{diag}_{i \in [n]} \left\{ \pi_{i} (1 - \pi_{i}) \right\} X + \sigma^{-2}_{p} I $$With $\pi_{i} = \sigma(w^{T}x_{i})$. There is a nice interpretation of this covariance matrix, because if the data point is certain then the diagonal value is small, else it is a little bit larger.
Downsides of Laplace approximation 🟩
Laplace could be overconfident, in the same way linear regression is for some samples. This is because it greedily approximates basing on the mean of it. So perhaps one direction is ok, other directions can be quite bad.
Prediction with Variational Posterior 🟥
Now that we have an approximation for the posterior, it should be far more easier to make predictions!
$$ p(y^{*} \mid x^{*}, x_{1:n}, y_{1:n}) = \int p(y^{*} \mid x^{*}, \theta) p(\theta \mid x_{1:n} y_{1:n}) \, d\theta \approx \int p(y^{*} \mid x^{*}, \theta) q(\theta) \, d\theta = \mathbb{E}_{\theta \sim q} [p(y^{*} \mid x^{*}, \theta)] $$The last could be approximated using Monte Carlo methods methods to be
$$ \mathbb{E}_{\theta \sim q} [p(y^{*} \mid x^{*}, \theta)] \approx \frac{1}{m} \sum_{j = 1}^{m} p(y^{*} \mid x^{*}, \theta) $$Or we could do a change of variables and then estimate the one dimensional integral that we have as a result with numerical methods basically to machine precision, where $f = \theta^{T}x$
$$ \int p(y^{*} \mid x^{*}, \theta) q(\theta) \, d\theta= \int \sigma (y^{*} \cdot \theta^{T}x)q(\theta) \, d\theta = \int \sigma (y^{*} \cdot f)q(f) \, df = \int \sigma(y^{*} f) \mathcal{N}(f; x^{*T}\mu, x^{*T}\Sigma x^{*}) \, df $$The variational approach
With this approach we define a family of distributions $Q$ called variational family and the try to find the best member within this family to approximate our $P$.
For example we can define the variational family of Gaussians as
$$ \mathcal{Q} = \left\{ q(\theta \mid \lambda) = \mathcal{N}(\theta; \mu, \Sigma \right\} $$Where the parameters $\lambda$ are the mean and the covariance matrix of the Gaussian.
Reverse and Forward KL 🟩
So we optimize for this: Reverse KL:
greedily select the mode and underestimating the variance which, in this case, leads to an overconfident prediction
$$ q^{*} \in \arg \min_{q \in Q} KL(q \mid \mid p) $$
Or Forward KL, this is usually preferred, but usually more difficult to compute.
$$ q^{*} \in \arg \min_{q \in Q} KL(p \mid \mid q) $$See Kullback-Leibler divergence in Entropy.
KL of Gaussians 🟩
Understanding the properties of the KL divergence between Gaussian distributions is crucial for grasping variational inference. I strongly encourage you to review Gaussians#Information Theoretic Properties closely, probably we will use that result in the next observations
Minimizing Forward KL Divergence
TODO: this is an add-on, it’s quite complicated stuff.
We can prove the following: $$ \begin{align} \arg\min_{\lambda} KL(p \mid \mid q) & = \arg\min_{\lambda} H[p \mid \mid q_{\lambda}] - H[p]\ & = \arg\min_{\lambda} \mathbb{E}{\theta \sim p} [-\log q{\lambda}(y)] -\text{ const} \ & \approx \arg\min_{\lambda}- \frac{1}{m} \sum_{j = 1}^{m} \log q_{\lambda}(y^{(i)})\ \end{align}
$$
The problem is that it is often unfeasible.
This tells us that any maximum likelihood estimate $q_{\lambda}$ minimizes the forward KL-divergence to the empirical data distribution.
Moment Matching
We can prove that if we find such $q$ that minimizes the forward KL divergence, then the first two moments of the distribution are the same. This is a good exercise. TODO.
Minimizing Reverse KL Divergence 🟩-
We need to compute the value
$$ \begin{align}
q^{}{\lambda} &= \arg\min{q \in Q} KL(q \mid \mid p(\cdot \mid y)) \ &= \arg\min_{q \in Q}\mathbb{E}{\theta \sim q} \left[ \frac{\log q(\theta)}{\log p(\theta \mid y)} \right] \ &= \arg\min{q \in Q} - H(q) - \mathbb{E}{\theta \sim q}[\log p(\theta \mid y)] \ &= \arg\max{q \in Q} H(q) + \mathbb{E}{\theta \sim q}[\log p(\theta \mid y)] \ &= \arg\max{q \in Q} H(q) + \mathbb{E}{\theta \sim q}[\log p(\theta)] + \mathbb{E}{\theta \sim q} [\log p (y \mid \theta)]+const \ &\implies q^{}{\lambda }= \arg\max{q \in Q} \mathbb{E}_{\theta \sim q} [\log p(y \mid \theta)] - KL(q \mid \mid p(\theta)) = ELBO \end{align} $$ Where we have now the prior what we were trying to optimize
Evidence Lower Bound 🟨
We have first encountered this value in Autoencoders when analyzing variational auto-encoders. We define the ELBO also in other manners, for a certain dataset $D_{n}$, for instance, this is another formulation.
$$ L(q, p; D_{n}) = \log p(y_{1:n} \mid x_{1:n})] - KL(q \mid \mid p(\cdot \mid x_{1:n}, y_{1:n})) \\ $$The evidence is defines a $\log p(y_{1:n})$. This can be played with to have a lower bound on this value with the above knowledge. We have now $$ \begin{align} \log p(y_{1:n}) &= \log \int p(y_{1:n} \mid \theta) p(\theta) , d\theta \ &= \log \int p(y_{1:n} \mid \theta) p(\theta) \frac{q(\theta)}{q(\theta)} , d\theta \ \text{using Jensen}&\geq{E}{\theta \sim q} \left[ \log p(y{1:n} \mid \theta) \frac{p(\theta)}{q(\theta)} \right] \ & ={E}{\theta \sim q} \left[ \log p(y{1:n} \mid \theta) \right] - KL(q \mid \mid p) = ELBO \
\end{align} $$ We observe that this is the same thing that we have seen before while trying to minimize the reverse KL divergence. Thus, finding the best approximation is equivalent to identifying the maximum lower bound on the evidence. Essentially, it involves determining the posterior that is most likely to explain the evidence.
This indicates that maximizing the evidence lower bound is an adequate method of model selection which can be used instead of maximizing the evidence (marginal likelihood) directly.
ELBO of logistic regression 🟥
Suppose we have a classic logistic regression with prior on the weights to be $\mathcal{N}(0, \sigma^{2}_{p}I)$ and the likelihood to be $\prod_{i = 1}^{N} \sigma(y_{i}w^{T}x_{i})$. Let’s take the parameters from a Gaussian distribution $\mathcal{N}(\mu, \sigma^{2}_{n}I)$. We want now to find the ELBO for this model.
TODO
Gradient of ELBO 🟥++
We can compute the gradient of the ELBO with respect to the variational parameters $\lambda$. But the first parameter is difficult to compute, so we can use the so called reparametrization trick. Other ways are called score gradients or monte-carlo gradients.
Suppose to have a random variable $\varepsilon \sim \phi$ and then we can write the expectation as and that $\theta = g(\varepsilon, \lambda)$ for invertible functions $g$.
$$ \mathbb{E}_{\theta \sim q} [f(\theta)] = \mathbb{E}_{\varepsilon \sim \phi} [f(g(\varepsilon, \lambda))] $$So now we can use a known distribution to sample from and then compute the gradient of the expectation, this allows to compute stochastic gradients. If you know the ideas of Volume, then the formula of change of variables becomes very clean.
Let’s work out an example: Let’s suppose $q$ follows a Gaussian distribution $\mathcal{N}(\mu, \Sigma)$ then we can reparametrize it with $\theta = \mu + \Sigma^{1/2} \varepsilon$ with $\varepsilon \sim \mathcal{N}(0, I)$ that we can sample from.
$$ \begin{align} \\ \nabla_{\lambda} L(\lambda) = \nabla_{\lambda} \mathbb{E}_{\theta \sim q} [ \log p(y \mid x, \theta) ] - KL(q \mid \mid p) = \\ \nabla_{\lambda} \mathbb{E}_{\varepsilon \sim \phi} [ \log p(y \mid x, \Sigma^{1/2}\varepsilon + \mu ) ] - KL(q \mid \mid p) = \\ \end{align} $$We can now take the gradient of the first term.
$$ \begin{align} \\ \nabla_{\lambda} \mathbb{E}_{\varepsilon \sim \phi} [ \log p(y \mid x, \Sigma^{1/2}\varepsilon + \mu ) ] & = \\ \\ & = \text{TODO} \\ & \approx \frac{n}{m} \sum_{j = 1}^{m} \nabla_{C, \mu} \log p(y_{i_{j}} \mid x_{i_{j}}, C\varepsilon_{j} + \mu) \\ \end{align} $$TODO: exercise to complete this derivation, the result is that we can use monte carlo approximation after some tricks.
Variational inference enables us to find approximations of distributions using highly optimized stochastic optimization techniques. However, a significant drawback is the difficulty in assessing the quality of these approximations.