Gaussian processes can be viewed through a Bayesian lens of the function space: rather than sampling over individual data points, we are now sampling over entire functions. They extend the idea of bayesian linear regression by introducing an infinite number of feature functions for the input XXX.

In geostatistics, Gaussian processes are referred to as kriging regressions, and many other models, such as Kalman Filters or radial basis function networks, can be understood as special cases of Gaussian processes. In this framework, certain functions are more likely than others, and we aim to model this probability distribution.

Gaussian processes retain epistemic uncertainty, meaning we are still modeling what we don’t know about the function, while remaining tractable. Essentially, they provide a normal distribution over functions.

If we index this infinite dimensional Gaussian, thanks to the properties of Gaussians with marginalization, we still have another Gaussian. We need a way to parametrize now. In this case we do it using a covariance function $k(x, x') = \text{Cov} (f(x), f(x'))$

This seems a quite nice resource on the topic.

Formal definition of Gaussian Process

A Gaussian process is an infinite set of random variables such that any finite number of them are jointly Gaussian.

The informal ideas is that we use the probability of the joints, given by our training dataset, to do some inference.

A Gaussian Process is a (infinite) set of random variables, indexed by some set $X$ (domain of the function) and for each $x \in X$ there is a random variable $f_{x}$ such that for all $A \subseteq X$ and $A = \left\{ x_{1}, \dots, x_{m} \right\}$ it holds that $f_{A} = [f_{x_{1}}, \dots, f_{x_{m}}] \sim \mathcal{N}(\mu_{A}, K_{AA})$ , this is called noise-free prediction, to have the value of $y$ just add some Gaussians with a noise variance. You can see that GPs are very sensitive to adversarial data, it will be considered in the prediction and will condition the mean and Covariance of the prediction. Where

$$ K_{A A} = \begin{pmatrix} k(x_{1}, x_{1}), \dots, k(x_{1}, k_{m}) \\ \vdots \\ k(x_{m}, x_{1}), \dots, k(x_{m}, k_{m}) \end{pmatrix} \text{ and } \mu_A = \begin{pmatrix} \mu(x_{1}) \\ \vdots \\ \mu(x_{m}) \end{pmatrix} $$

$k : X \times X \to \mathbb{R}$ is called the covariance (kernel) function and $\mu : X \to \mathbb{R}$ is called the mean function. We write $f \sim GP(\mu, k)$ and define $f(x) := f_{x}$ See Kernel Methods for more about kernels.

Inference for Gaussian Processes 🟩

Inference is easy if you understand the conditional Gaussian distribution, see Gaussians. You already have the old points defined, you just need to get the conditioned Gaussian on the previous data points.

Here is the setting, and it’s very similar to the Bayesian Linear Regression (indeed, GP can be seen as aa generalization of BLR) function view inference

$$ \bar{f} = \begin{bmatrix} y \\ f* \end{bmatrix}, \, \bar{\Sigma} =\begin{bmatrix} \Sigma_{AA} + \sigma_{n}^{2}I & k(x^{*}, A)^{T} \\ k(x^{*}, A) & k(x^{*}, x^{*}) \end{bmatrix}, \mu = \begin{bmatrix} \mu_{A} \\ \mu_{f^{*}} \end{bmatrix} $$

We just use the conditional Gaussian to have that our wanted distribution $f^{*} \mid f$ has the following distribution:

$$ \begin{cases} \mu = \mu_{f^{*}} + k(x^{*}, A)^{T}(\Sigma_{AA} + \sigma^{2}_{n}I)^{-1}(A - \mu_{A}) \\ \Sigma = k(x^{*}, x^{*}) - k(x^{*}, A)^{T}(\Sigma_{AA} + \sigma_{n}^{2}I)^{-1}k(x^{*}, A) \end{cases} $$

And then you just sample with this, which we will solve in the next section. Where $k(x^{*}, A) = \left[ k(x^{*}, x_{1}), \dots,k(x^{*}, x_{n}) \right]$

If we have only the GP prior then the mean $\mu_{f^{*}} = 0$, this is often a convention because the prior mean is only used to compute the posterior mean, and can be easily added back in. One note: the posterior mean depends on the observations $y$ while the variance does not.

Sampling for Gaussian Processes

Standard Gaussian Sampling 🟩–

One simple solution is just sampling using $X = \Sigma^{1/2}\varepsilon + \mu$, but taking the square root of a matrix takes $\mathcal{O}(n^{3})$ time. This samples every needed point immediately. This is possible because by assumption we have that the GP is jointly Gaussian, meaning:

$$ \begin{array} \\ f_{1:n} \sim \mathcal{N}(\mu_{f}, K_{f}) \\ \end{array} $$

Forward Sampling 🟩

Another approach, called forward sampling just samples everything one by one. Meaning

$$ \begin{array} \\ f_{1} \sim p(f_{1}) \\ f_{2} \sim p(f_{2} \mid f_{1}) \\ \vdots \\ f_{n} \sim p(f_{n} \mid f_{1:n-1}) \end{array} $$

But we need to compute the matrix inverse in this case, which also takes $\mathcal{O}(n^{3})$.

Learning Gaussians Processes

Kernel optimization 🟨++

Once we have chosen a Kernel, we would like to find the best Hyper-parameters for our problem. Take a look at the Kernel Methods for possible Kernels. We want to optimize the following equation: $$ \begin{align} \ \hat{\theta_{MLE}} &= \arg \max_{\theta} p(y_{\text{train} }\mid x_{\text{train}}, \theta) \

&= \arg \max_{\theta} \int p(y_{\text{train}} \mid f, x_{\text{train}}, \theta)p(f \mid \theta) , df \ &= \arg \max_{\theta} \int p(y_{1:n} \mid f_{1:n}, x_{1:n}, \theta)p(f_{1:n} \mid \theta) , df \ &= \arg \max_{\theta} \int \prod_{i = 1}^{n} \mathcal{N}(y_{i} \mid f_{i}; 0, \sigma^{2}{n} )\mathcal{N}(f{1:n} ; 0, K_{f}(\theta)) df \ &= \arg \max_{\theta} \mathcal{N}(y_{1:n} ; 0, K_{y}(\theta)) \ &= \arg \max_{\theta} \frac{1}{\sqrt{ (2\pi)^{n} \lvert K_{y}(\theta) \rvert }}\exp\left( - \frac{1}{2} y_{1:n}^{T} K^{-1}{y}(\theta) y{1:n} \right) , \end{align}

$$ Now let's take the negative log and we will get: $$

\begin{align} \ \hat{\theta_{MLE}} &= \arg \min_{\theta} - \log \mathcal{N}(y_{1:n}; 0, K_{y}(\theta)) \ &= \arg \min_{\theta} \frac{n}{2} \log 2\pi + \frac{1}{2} \log \lvert K_{y}(\theta) \rvert + \frac{1}{2} y_{1:n}^{T} K^{-1}{y}(\theta) y{1:n} \ &= \arg \min_{\theta} \frac{1}{2} \log \lvert K_{y}(\theta) \rvert + \frac{1}{2} y_{1:n}^{T} K^{-1}{y}(\theta) y{1:n} \end{align}

$$ Where $$

K_{f}(\theta) = K_{f_{1:n}}(\theta) = \begin{bmatrix} k_{\theta}(x_{1}, x_{1}), & \dots & , k_{\theta}(x_{1}, x_{n}) \ \vdots & & \vdots \ k_{\theta}(x_{n}, x_{1}), & \dots & , k_{\theta}(x_{n}, x_{n}) \end{bmatrix}

$$ And $$

K_{y}(\theta) = K_{f}(\theta) + \sigma^{2}_{n}I $$

One can also prove that the gradient of our likelihood function is

$$ \frac{ \partial \log p(y_{1:n} \mid x_{1:n}, \theta) }{ \partial \theta_{j} } = \frac{1}{2} \text{tr}\left( (\alpha \alpha^{T} - K^{-1}_{y, \theta}) \frac{ \partial K_{y, \theta} }{ \partial \theta_{j}} \right) $$

With $\alpha = K_{y, \theta}^{-1}y$, you just need some formulas from Multi Variable Derivatives. Now let’s take some time to analyze what we good after all of this derivation: The quadratic form is interpretable as the likelihood, or the goodness of the fit. While the log of the Kernel matrix is the volume of the possible prediction, which is kinda the prior. Minimizing the above loss with gradient methods or similars can give us the best kernel parameters.

We can also obtain a Map Estimate if we place a prior on the kernel parameters $p(\theta)$.

Hyper-priors 🟩–

Another way to optimize is using hyper-priors. Basically, we assume the priors have parameters too, and we try to optimize over those parameters. Usually a good way to choose kernels is posterior agreement. Given the data, we would like to make the probability of the given data as high as possible. This is very important but I have not understood this well

Approximating GPs

Leaning a Gaussian Process is usually quite slow. We will introduce some local methods for faster approximation, and a technique introduced in Rahimi, Recht NeurIPs 2007, it is a general algorithm for simplifying stationary kernels.

Local methods 🟩

One way is only use points such that satisfy $\lvert k(x, x') \rvert \geq \tau$, the idea is that point that are far away usually have not such a great influence. This is still expensive if there are too many points close. Basically we can restrict ourselves to a subset of the Covariance matrix, as most other points will be set to $0$. This method essentially cuts off the tails of the distribution.

Kernel function approximation 🟥+

This technique is quite hard to understand well. The idea is to reduce GPs to linear regression, then the cost would be $\mathcal{O}(nm^{2} + m^{3})$ instead of $\mathcal{n^{3}}$. One simple way to do it, is estimate the kernel function as some

$$ k(x, x') \approx \phi(x)^{T}\phi(x') $$

Such that $\phi(x) \in \mathbb{R}^{m}$, and $m$ is a much smaller dimensionality (easier to compute). Then it’s just GP with a linear kernel, or a Bayesian linear regression.

Fourier Features 🟥

We make use of Bochner’s theorem which states that a Kernel is a valid Mercer Kernel if its Fourier transform $p(\omega)$ is non-negative (which implies it is a valid probability distribution). When $p$ is a valid function, it is called spectral density.

Consider a shift invariant (stationary kernel), as they can be seen as function in one variable, these kernels have a Fourier transform, see Fourier Series.

$$ k(x, x') = k(x - x') = \int _{\mathbb{R}^{d}} p(\omega) e^{j\omega^{T}(x - x')} \, d\omega = \mathbb{E}_{\omega, b}[z_{\omega, b}(x) \cdot z_{\omega, b}(x')] $$

Where $\omega \sim p(\omega), b \sim U([0, 2\pi])$, and define $z_{\omega, b}(x) = \sqrt{ 2 } \cos(\omega^{T}x + b)$ where we are reinterpreting the integral as an expectation, one can show that these two are equivalent. The important thing to note here is that we have an unbiased approximator for some Kernel, given we have its spectral density.

Then we can approximate this expectation as follows:

$$ \mathbb{E}_{\omega, b}[z_{\omega, b}(x) \cdot z_{\omega, b}(x')] = \frac{1}{m} \sum_{i = 1}^{m} z_{\omega, b}(x) \cdot z_{\omega, b}(x') = z(x)^{T}z(x) $$

Where $z(x) = \frac{1}{\sqrt{ m }}[z_{\omega_{1, b_{1}}}(x), \dots, z_{\omega_{m, b_{m}}}(x)]$ Now we have an explicit random feature map, where $m$ works as a parameter for quality of the approximation.

One can also have some error analysis, telling you how close is the approximation to the original kernel. The theoretical thing is that this approximates the kernel function uniformly well, we use uniform convergence in this setting, which we don’t know exactly what this is.

One nice thing to see is that one can prove BLR with fourier features is equivalent to GP with a stationary kernel.

Uniform Convergence of Fourier Features

The theorem states that if we have a compact subset of $\mathbb{R}^{d}$, called $\mathcal{M}$ of diameter $D_{\mathcal{M}}$, then for a stationary kernel $k$ and Fourier features $z$, it holds for any $\varepsilon$ that:

$$ \begin{align} \mathbb{P}( \lvert k(x, x') - z(x)^{T}z(x') \rvert > \varepsilon) \\ \leq2^{8} \left( \frac{\mathop{\mathbb{E}}_{w \sim p}[w^{T}w]D_{\mathcal{M}}}{\varepsilon} \right)^{2} \exp \left( -\frac{m\varepsilon^{2}}{8 (d + 2)} \right) \end{align} $$

Where we observe that the decay of the error is exponential in the number of Fourier features $m$

Proof: We first observe that $z(x)^{T}z(x)$ is shift invariant as the function that it attempts to approximate is. Then you can define some function $f(x, x') = k(x, x') - z(x)^{T}z(x')$ and then we can define the function solely using the difference $\Delta = x - x'$.

We use Hoeffding’s Inequality to prove that:

$$ \mathbb{P}(\lvert f(\Delta) \rvert > \varepsilon ) \leq \exp\left( - m \varepsilon^{2}\right) $$

Where we note that $0 \leq z(x)^{T}z(x) \leq 2$

TODO: finish, this proof is hard…

Inducing points 🟥

Introduced in (Quinonero-Candela and Rasmussen, 2005), we want to summarize the points of the training set into the inducing points, and then use these points for inference.