Gaussian processes can be viewed through a Bayesian lens of the function space: rather than sampling over individual data points, we are now sampling over entire functions. They extend the idea of bayesian linear regression by introducing an infinite number of feature functions for the input XXX.

In geostatistics, Gaussian processes are referred to as kriging regressions, and many other models, such as Kalman Filters or radial basis function networks, can be understood as special cases of Gaussian processes. In this framework, certain functions are more likely than others, and we aim to model this probability distribution.

Gaussian processes retain epistemic uncertainty, meaning we are still modeling what we don’t know about the function, while remaining tractable. Essentially, they provide a normal distribution over functions.

If we index this infinite dimensional Gaussian, thanks to the properties of Gaussians with marginalization, we still have another Gaussian. We need a way to parametrize now. In this case we do it using a covariance function $k(x, x') = \text{Cov} (f(x), f(x'))$

This seems a quite nice resource on the topic.

Formal definition of Gaussian Process

A Gaussian process is an infinite set of random variables such that any finite number of them are jointly Gaussian.

The informal ideas is that we use the probability of the joints, given by our training dataset, to do some inference.

A Gaussian Process is a (infinite) set of random variables, indexed by some set $X$ (domain of the function) and for each $x \in X$ there is a random variable $f_{x}$ such that for all $A \subseteq X$ and $A = \left\{ x_{1}, \dots, x_{m} \right\}$ it holds that $f_{A} = [f_{x_{1}}, \dots, f_{x_{m}}] \sim \mathcal{N}(\mu_{A}, K_{AA})$ , this is called noise-free prediction, to have the value of $y$ just add some Gaussians with a noise variance. You can see that GPs are very sensitive to adversarial data, it will be considered in the prediction and will condition the mean and Covariance of the prediction. Where

$$ K_{A A} = \begin{pmatrix} k(x_{1}, x_{1}), \dots, k(x_{1}, k_{m}) \\ \vdots \\ k(x_{m}, x_{1}), \dots, k(x_{m}, k_{m}) \end{pmatrix} \text{ and } \mu_A = \begin{pmatrix} \mu(x_{1}) \\ \vdots \\ \mu(x_{m}) \end{pmatrix} $$

$k : X \times X \to \mathbb{R}$ is called the covariance (kernel) function and $\mu : X \to \mathbb{R}$ is called the mean function. We write $f \sim GP(\mu, k)$ and define $f(x) := f_{x}$ See Kernel Methods for more about kernels.

Inference for Gaussian Processes 🟩

Inference is easy if you understand the conditional Gaussian distribution, see Gaussians. You already have the old points defined, you just need to get the conditioned Gaussian on the previous data points.

Here is the setting, and it’s very similar to the Bayesian Linear Regression function view inference

$$ \bar{f} = \begin{bmatrix} f \\ f* \end{bmatrix}, \, \bar{\Sigma} =\begin{bmatrix} \Sigma_{AA} + \sigma_{p}^{2}I & k(x^{*}, A)^{T} \\ k(x^{*}, A) & k(x^{*}, x^{*}) \end{bmatrix}, \mu = \begin{bmatrix} \mu_{A} \\ \mu_{f^{*}} \end{bmatrix} $$

We just use the conditional Gaussian to have that our wanted distribution $f^{*} \mid f$ has the following distribution:

$$ \begin{cases} \mu = \mu_{f^{*}} + k(x^{*}, A)(\Sigma_{AA} + \sigma^{2}_{p}I)^{-1}(A - \mu_{A}) \\ \Sigma = k(x^{*}, x^{*}) - k(x^{*}, A)(\Sigma_{AA} + \sigma_{p}^{2}I)^{-1}k(x^{*}, A)^{T} \end{cases} $$

And then you just sample with this, which we will solve in the next section. Where $k(x^{*}, A) = \left[ k(x^{*}, x_{1}), \dots,k(x^{*}, x_{n}) \right]$

If we have only the GP prior then the mean $\mu_{f^{*}} = 0$, this is often a convention because the prior mean is only used to compute the posterior mean, and can be easily added back in. One note: the posterior mean depends on the observations $y$ while the variance does not.

Sampling for Gaussian Processes

Standard Gaussian Sampling 🟩–

One simple solution is just sampling using $X = \Sigma^{1/2}\varepsilon + \mu$, but taking the square root of a matrix takes $\mathcal{O}(n^{3})$ time. This samples every needed point immediately. This is possible because by assumption we have that the GP is jointly Gaussian, meaning:

$$ \begin{array} \\ f_{1:n} \sim \mathcal{N}(\mu_{f}, K_{f}) \\ \end{array} $$

Forward Sampling 🟩

Another approach, called forward sampling just samples everything one by one. Meaning

$$ \begin{array} \\ f_{1} \sim p(f_{1}) \\ f_{2} \sim p(f_{2} \mid f_{1}) \\ \vdots \\ f_{n} \sim p(f_{n} \mid f_{1:n-1}) \end{array} $$

But we need to compute the matrix inverse in this case, which also takes $\mathcal{O}(n^{3})$.

Learning Gaussians Processes

Kernel optimization 🟨–

Could be also useful too look at the Kernel Methods. We want to optimize the following equation: $$ \begin{split} \ \hat{\theta_{MLE}} = \arg \max_{\theta} p(y_{\text{train} }\mid x_{\text{train}}, \theta) = \arg \max_{\theta} \int p(y_{\text{train}} \mid f, x_{\text{train}}, \theta)p(f \mid \theta) , df \ = \arg \max_{\theta} \int p(y_{1:n} \mid f_{1:n}, x_{1:n}, \theta)p(f_{1:n} \mid \theta) , df \ = \arg \max_{\theta} \int \prod_{i = 1}^{n} \mathcal{N}(y_{i} \mid f_{i}; 0, \sigma^{2}{n} )\mathcal{N}(f{1:n} ; 0, K_{f}(\theta)) df \ = \arg \max_{\theta} \mathcal{N}(y_{1:n} ; 0, K_{y}(\theta)) \ = \arg \max_{\theta} \frac{1}{\sqrt{ (2\pi)^{n} \lvert K_{y}(\theta) \rvert }}\exp\left( - \frac{1}{2} y_{1:n}^{T} K^{-1}{y}(\theta) y{1:n} \right) , \end{split}

$$ Now let's take the log and we will get: $$

\begin{align} \ = \arg \min_{\theta} - \log \mathcal{N}(y_{1:n}; 0, K_{y}(\theta)) \ = \arg \min_{\theta} \frac{n}{2} \log 2\pi + \frac{1}{2} \log \lvert K_{y}(\theta) \rvert + \frac{1}{2} y_{1:n}^{T} K^{-1}{y}(\theta) y{1:n} \ = \arg \min_{\theta} \frac{1}{2} \log \lvert K_{y}(\theta) \rvert + \frac{1}{2} y_{1:n}^{T} K^{-1}{y}(\theta) y{1:n} \end{align}

$$ Where $$

K_{f}(\theta) = K_{f_{1:n}}(\theta) = \begin{bmatrix} k_{\theta}(x_{1}, x_{1}), & \dots & , k_{\theta}(x_{1}, x_{n}) \ \vdots & & \vdots \ k_{\theta}(x_{n}, x_{1}), & \dots & , k_{\theta}(x_{n}, x_{n}) \end{bmatrix}

$$ And $$

K_{y}(\theta) = K_{f}(\theta) + \sigma^{2}_{n}I $$

One can also prove that the gradient of our likelihood function is

$$ \frac{ \partial \log p(y_{1:n} \mid x_{1:n}, \theta) }{ \partial \theta_{j} } = \frac{1}{2} tr\left( (\alpha \alpha^{T} - K^{-1}_{y, \theta}) \frac{ \partial K_{y, \theta} }{ \partial \theta_{j}} \right) $$

With $\alpha = K_{y, \theta}^{-1}y$ Now let’s take some time to analyze what we good after all of this derivation: The quadratic form is interpretable as the likelihood, or the goodness of the fit. While the log of the Kernel matrix is the volume of the possible prediction, which is kinda the prior. Minimizing the above loss with gradient methods or similars can give us the best kernel parameters.

We can also obtain a Map Estimate if we place a prior on the kernel parameters $p(\theta)$.

Hyper-priors 🟨+

Another way to optimize is using hyper-priors. Basically, we assume the priors have parameters too, and we try to optimize over those parameters. Usually a good way to choose kernels is posterior agreement. Given the data, we would like to make the probability of the given data as high as possible. This is very important but I have not understood this well

Approximating GPs

Leaning a Gaussian Process is usually quite slow. We will introduce some local methods for faster approximation, and a technique introduced in Rahimi, Recht NeurIPs 2007, it is a general algorithm for simplifying stationary kernels.

Local methods 🟩

One way is only use points such that satisfy $\lvert k(x, x') \rvert \geq \tau$, the idea is that point that are far away usually have not such a great influence. This is still expensive if there are too many points close. Basically we can restrict ourselves to a subset of the Covariance matrix, as most other points will be set to $0$. This method essentially cuts off the tails of the distribution.

Kernel function approximation 🟥+

This technique is quite hard to understand well. Let’s consider now an advanced technique: Kernel function approximation: we want to reduce GPs to linear regression, then the cost would be $\mathcal{O}(nm^{2} + m^{3})$ instead of $\mathcal{n^{3}}$.

Consider a shift invariant (stationary kernel), these kernels have a Fourier transform, see Fourier Series.

$$ k(x, x') = k(x - x') = \int _{\mathbb{R}^{d}} p(\omega) e^{j\omega^{T}(x - x')} \, d\omega = \mathbb{E}_{\omega, b}[z_{\omega, b}(x) \cdot z_{\omega, b}(x')] $$

Where $\omega \sim p(\omega), b \sim U([0, 2\pi])$, and define $z_{\omega, b} = \sqrt{ 2 } \cos(\omega^{T}x + b)$ where we are reinterpreting the integral as an expectation, one can show that these two are equivalent.

Then we can approximate this expectation as follows:

$$ \mathbb{E}_{\omega, b}[z_{\omega, b}(x) \cdot z_{\omega, b}(x')] = \frac{1}{m} \sum_{i = 1}^{m} z_{\omega, b}(x) \cdot z_{\omega, b}(x') = z(x)^{T}z(x) $$

Where $z(x) = \frac{1}{\sqrt{ m }}[z_{\omega_{1, b_{1}}}(x), \dots, z_{\omega_{m, b_{m}}}(x)]$ Now we have an explicit random feature map, where $m$ works as a parameter for quality of the approximation.

One can also have some error analysis, telling you how close is the approximation to the original kernel. The theoretical thing is that this approximates the kernel function uniformly well, we use uniform convergence in this setting, which we don’t know exactly what this is.

Inducing points 🟥

Introduced in (Quinonero-Candela and Rasmussen, 2005), we want to summarize the points of the training set into the inducing points, and then use these points for inference.