Gaussian processes can be viewed through a Bayesian lens of the function space: rather than sampling over individual data points, we are now sampling over entire functions. They extend the idea of bayesian linear regression by introducing an infinite number of feature functions for the input XXX.

In geostatistics, Gaussian processes are referred to as kriging regressions, and many other models, such as Kalman Filters or radial basis function networks, can be understood as special cases of Gaussian processes. In this framework, certain functions are more likely than others, and we aim to model this probability distribution.

Gaussian processes retain epistemic uncertainty, meaning we are still modeling what we don't know about the function, while remaining tractable. Essentially, they provide a normal distribution over functions.

If we index this infinite dimensional Gaussian, thanks to the properties of Gaussians with marginalization, we still have another Gaussian. We need a way to parametrize now. In this case we do it using a covariance function

This seems a quite nice resource on the topic.

Formal definition of Gaussian Process

A Gaussian process is an infinite set of random variables such that any finite number of them are jointly Gaussian.

The informal ideas is that we use the probability of the joints, given by our training dataset, to do some inference.

A Gaussian Process is a (infinite) set of random variables, indexed by some set (domain of the function) and for each there is a random variable such that for all and it holds that , this is called noise-free prediction, to have the value of just add some Gaussians with a noise variance. You can see that GPs are very sensitive to adversarial data, it will be considered in the prediction and will condition the mean and Covariance of the prediction. Where

is called the covariance (kernel) function and is called the mean function. We write and define See Kernel Methods for more about kernels.

Inference for Gaussian Processes

Inference is easy if you understand the conditional Gaussian distribution, see Gaussians. You already have the old points defined, you just need to get the conditioned Gaussian on the previous data points.

Here is the setting, and it's very similar to the Bayesian Linear Regression (indeed, GP can be seen as aa generalization of BLR) function view inference

We just use the conditional Gaussian to have that our wanted distribution has the following distribution:

And then you just sample with this, which we will solve in the next section. Where

If we have only the GP prior then the mean , this is often a convention because the prior mean is only used to compute the posterior mean, and can be easily added back in. One note: the posterior mean depends on the observations while the variance does not.

Sampling for Gaussian Processes

Standard Gaussian Sampling

One simple solution is just sampling using , but taking the square root of a matrix takes time. This samples every needed point immediately. This is possible because by assumption we have that the GP is jointly Gaussian, meaning:

\begin{array} \\ f_{1:n} \sim \mathcal{N}(\mu_{f}, K_{f}) \\ \end{array}

Forward Sampling

Another approach, called forward sampling just samples everything one by one. Meaning

\begin{array} \\ f_{1} \sim p(f_{1}) \\ f_{2} \sim p(f_{2} \mid f_{1}) \\ \vdots \\ f_{n} \sim p(f_{n} \mid f_{1:n-1}) \end{array}

But we need to compute the matrix inverse in this case, which also takes .

Learning Gaussians Processes

Kernel optimization

Once we have chosen a Kernel, we would like to find the best Hyper-parameters for our problem. Take a look at the Kernel Methods for possible Kernels. We want to optimize the following equation:

Now let's take the negative log and we will get:

Where

And

One can also prove that the gradient of our likelihood function is

With , you just need some formulas from Multi Variable Derivatives. Now let's take some time to analyze what we good after all of this derivation: The quadratic form is interpretable as the likelihood, or the goodness of the fit. While the log of the Kernel matrix is the volume of the possible prediction, which is kinda the prior. Minimizing the above loss with gradient methods or similars can give us the best kernel parameters.

We can also obtain a Map Estimate if we place a prior on the kernel parameters .

Hyper-priors

Another way to optimize is using hyper-priors. Basically, we assume the priors have parameters too, and we try to optimize over those parameters. Usually a good way to choose kernels is posterior agreement. Given the data, we would like to make the probability of the given data as high as possible. This is very important but I have not understood this well

Approximating GPs

Leaning a Gaussian Process is usually quite slow. We will introduce some local methods for faster approximation, and a technique introduced in Rahimi, Recht NeurIPs 2007, it is a general algorithm for simplifying stationary kernels.

Local methods

One way is only use points such that satisfy , the idea is that point that are far away usually have not such a great influence. This is still expensive if there are too many points close. Basically we can restrict ourselves to a subset of the Covariance matrix, as most other points will be set to . This method essentially cuts off the tails of the distribution.

Kernel function approximation

This technique is quite hard to understand well. The idea is to reduce GPs to linear regression, then the cost would be instead of . One simple way to do it, is estimate the kernel function as some

Such that , and is a much smaller dimensionality (easier to compute). Then it's just GP with a linear kernel, or a Bayesian linear regression.

Fourier Features

We make use of Bochner's theorem which states that a Kernel is a valid Mercer Kernel if its Fourier transform is non-negative (which implies it is a valid probability distribution). When is a valid function, it is called spectral density.

Consider a shift invariant (stationary kernel), as they can be seen as function in one variable, these kernels have a Fourier transform, see Fourier Series.

Where , and define where we are reinterpreting the integral as an expectation, one can show that these two are equivalent. The important thing to note here is that we have an unbiased approximator for some Kernel, given we have its spectral density.

Then we can approximate this expectation as follows:

Where Now we have an explicit random feature map, where works as a parameter for quality of the approximation.

One can also have some error analysis, telling you how close is the approximation to the original kernel. The theoretical thing is that this approximates the kernel function uniformly well, we use uniform convergence in this setting, which we don't know exactly what this is.

One nice thing to see is that one can prove BLR with fourier features is equivalent to GP with a stationary kernel.

Uniform Convergence of Fourier Features

The theorem states that if we have a compact subset of , called of diameter , then for a stationary kernel and Fourier features , it holds for any that:

Where we observe that the decay of the error is exponential in the number of Fourier features

Proof: We first observe that is shift invariant as the function that it attempts to approximate is. Then you can define some function and then we can define the function solely using the difference .

We use Hoeffding's Inequality to prove that:

Where we note that

TODO: finish, this proof is hard...

Inducing points

Introduced in (Quinonero-Candela and Rasmussen, 2005), we want to summarize the points of the training set into the inducing points, and then use these points for inference.

Then the original Gaussian can be recovered in this manner:

Recall that and are conditionally independen. The values and are respectively called testing conditional and training conditional.

Thus, the time complexity is cubic in the number of inducing points, but only linear in the number of data points.

Because we need to invert the matrix for the inducing point, but we don't need to invert the matrix for the training points.