As we will briefly see, Kernels will have an important role in many machine learning applications. In this note we will get to know what are Kernels and why are they useful. Intuitively they measure the similarity between two input points. So if they are close the kernel should be big, else it should be small.

We briefly state the requirements of a Kernel, then we will argue with a simple example why they are useful.

Kernel Function requirements

Every function $k$ must be

Symmetric: $\forall x, x' \in \mathbb{X}$ we have $k(x, x') = k(x', x)$
Positive definiteness For all $A$ we have $K_{A A}$ is positive definite. (it’s not clear the interpretation of the negative similarity) This function encodes information about the correlation. If a kernel satisfies those, it is called a Mercer kernel. Mercer has proven that these two conditions are necessary and sufficient for a function to be considered a kernel. If we have linear kernel, then we have bayesian linear regression. They are exactly the same thing.

Another nice thing about kernels is that we can create new kernels by linearly combining them. Choosing kernels (or building them) is somewhat interpretable as choosing the correct prior for our functions.

A motivating example

Recall, we can use linear regression to learn non linear features, we just need some feature map.

Feature maps -> Curse of dimensionality

It is possible to include in linear regression settings, also monomials of greater order, and then use the same old algorithms to derive their value. But in doing so, one quickly gets into dimensionality problems: if we have $d$ variables, then the order of the feature map to all polynomials of order $m$ will be $\mathcal{O}(d^{m})$, which is often quite prohibitive. We need a simple method not to store everything.

Simple linear regression

$$ \theta_{\text{new}} = \theta - \alpha \sum_{i = 1}^{N} (y^{(i)} - \theta^{T}\phi(x^{(i)}))\phi( x^{(i)}) $$

Normally we would need to explicitly compute the product every single time, but we know that if the feature map is too large this would be infeasible.

But if we note that $\theta$ could be written by a linear combination of the features, then this problem will solve itself! Let’s say there are some $\beta_{1}, \dots, \beta_{n}$ such that $\theta_{0} = \sum_{i = 1}^{N} \beta_{i}\phi(x^{(i)})$. It can be shown by induction that the new $\theta$ updated using gradient descent will always be a linear combination. With this in mind, we just need to store a number that grows with $N$, not with the number of features $n$. But then we can rewrite the update rule in the following way:

$$ \theta_{new} = \sum_{i = 1}^{N} \beta_{i}\phi(x^{(i)}) - \alpha \sum_{i = 1}^{N} (y^{(i)} - ( \sum_{j = 1}^{N} \beta_{j}\phi(x^{(j)}))\phi(x^{(i)}))\phi( x^{(i)}) = \sum_{i = 1}^{N} \left( \beta_{i} - \alpha (y^{(i)} - ( \sum_{j = 1}^{N} \beta_{j}\phi(x^{(j)})^{T})\phi(x^{(i)}) \right) \phi(x^{(i)}) $$$$ \beta_{i} := \beta_{i} - \alpha (y^{(i)} - \sum_{j = 1}^{N} \beta_{j}\phi(x^{(j)})^{T}\phi(x^{(i)}) $$$$ \langle \phi(x), \phi(z) \rangle = 1 + \langle x, z \rangle + \langle x, z \rangle ^{2} $$$$ \beta := \beta - \alpha (y - K\beta) $$

which $K$ our $n \times n$ matrix of the features such that $K_{ij} = K(\phi(x^{(i)}), \phi(x^{(j)}))$.

Some takeaways

From the above example, one can observe the kernel trick doing its job and bringing down the computational cost related to our optimization problem. This can be done in many other problems, and its quite a general idea. We need to highlight two main observations making this framework tractable:

The kernel trick allows to compute optimizations with feature functions without explicitly computing them. If the feature space had dimension $e$ which could be infinitely dimensional, exponential on the number of terms $d$ et cetera, the optimization still takes take $N$, the number of samples.
It is efficient to compute a single entry of the kernel matrix, often the complexity is just $d$, instead of being $e$. (this is often because we just need to compute the product $x^{T}z$ and then apply this value, the best way to understand this is to work out some examples).

Building new Kernels

After we have a valid kernel matrix, we can compose them in some ways, while retaining the properties that suffice to make them kernels. We report the methods listed in (Bishop 2006) Chapter 6: The Kernel Trick-20241010153911349

Logarithms would be nice, but we lose the positive definiteness because small eigenvalues are mapped to negative values.

Choosing the best kernel parameters

$$ \hat{f}_{i} = \arg \max_{f} p(y_{\text{train}}\mid f, \theta_{i}, x_{\text{train}}) p(f \mid \theta_{i}) $$

Then pick $\hat{\theta} =\arg \max_{\theta_{i}} p(y_{\text{vol} }\mid \hat{f}_{i}, \theta_{i}, x_{\text{val}})$. This method still collapses the uncertainty in f into a point estimate. But we want to keep the uncertainty.

$$ \hat{\theta_{MLE}} = \arg \max_{\theta} p(y_{\text{train} }\mid x_{\text{train}}, \theta) = \int p(y_{\text{train}} \mid f, x_{\text{train}}, \theta)p(f \mid \theta) \, df $$

The $p(f \mid \theta)$ is the prior, the other is the likelihood. The delighful observation on this MLE observation is that it naturally prevents underfitting or overfitting the model. This is often called Bayesian model selection.

This table attempts to give an intuitive understanding of what is explained here.

	Likelihood	Prior
$H$ is simple, underfitting	Small for most $f$	Large for some $f$ (concentrated), small for most $f$.
$H$ is complex, overfitting	Large for some $f$, small for most.	Small for most $f$, more distributed.
Just right	ok	ok

So in the overfitting case we are spreading the values too much, in the underfitting case, the likelihood is always small so the summation should be small too, this is why usually it is just right.

Intuitively: We are attempting to characterize the space of the functions given certain parameters. We are asking: what is the priori probability of having that function if we are given some parameters? If we overfit, we say that is unlikely that the function is random, so the prior is small (there is a high number of functions to choose from). If we underfit, then it is the inverse.

Little notes on Kernel Engineering

Kernels should be designed for some specific characteristics of the data. So if you know a dataset has a certain invariance, then you should use kernels that have that invariance. For example if you have periodic data, you should use periodic kernels. Else: that periodicity aspect would be hardly obtainable.

Some famous Kernels

Sometimes is important to recognize famous kernel families because then you can compose them together, and easily recognize if some other function is a kernel or not.

Linear kernel

This is one of the simplest kernels, which is just $k(x, y) = x^{T}y$, with $x, y \in\mathbb{R}^{d}$. A simple example of linear kernel, is the view of Gaussian Processes as a generalization of Bayesian Linear Regression with linear kernel.

Radial Basis Kernel

$$ k(x_{i}, x_{j}; l, \sigma) = \sigma\exp\left( -\frac{1}{2l} \lVert x_{i} - x_{j} \rVert_{2}^{2} \right) $$

They are similar to the Gaussians probability distribution function, but we don’t need to normalize in this case. This has one of the most natural interpretations of kernels as a distance similarity functions.

This family is both stationary and isotropic kernels, because they are translation invariant and depend only on the second norm. The $l$ parameter is called length scale and determines how much the function can vary. Usually high value means the function is more sensitive to old data, which implies it varies less.

$$ p(\omega) = (2\pi)^{-d/2} \exp\left( - \frac{\lVert \omega \rVert_{2}^{2}}{2} \right) $$

Which is just the standard Gaussian distribution in d dimensions, this is why sometimes the radial basis kernel is called Gaussian Kernel.

RBF as infinite polinomials (Hard version)

$$ \psi(x) = \sum_{j_{1}, \dots, j_{d} \in \mathbb{N}} \beta_{j_{1}, \dots j_{d}} x_{1}^{j_{1}} \cdot \dots \cdot x^{j_{d}}_{d} = \beta^{T}\varphi(x) $$$$ \varphi_{j_{1},\dots j_{d}}(x) = \exp\left( -\frac{1}{2} \lVert x \rVert_{2}^{2} \right) \frac{x_{1}^{j_{1}} \cdot \dots \cdot x^{j_{d}}_{d}}{\sqrt{ j_{1}! \dots j_{d}! }} \implies \varphi(x)^{T}\varphi(x') = \exp\left( - \frac{\lVert x - x'\rVert_{2}^{2}}{2} \right) $$$$ \sum_{\alpha \in \mathbb{N}^{d}} \frac{x^{\alpha}(x')^\alpha}{\alpha!} = \exp(x^{T}x') $$$$ \exp(x) = \sum_{i = 1}^{\infty} \frac{x^{d}}{d!} $$

And it’s also a nice exercise to prove that.

RBF as infinite polinomials

$$ \begin{align} \phi(x) = \begin{bmatrix} \phi_{0}(x) \\ \phi_{1}(x) \\ \vdots \end{bmatrix} && \phi_{j}(x) = \exp\left( -\frac{1}{2}x^{2} \right) \frac{x^{j}}{\sqrt{j!}} \\ \end{align} $$$$ \begin{align} \phi(x)^{T}\phi(x') &= \sum_{j = 0}^{\infty} \exp\left( -\frac{1}{2}x^{2} \right) \frac{x^{j}}{\sqrt{j!}} \exp\left( -\frac{1}{2}x'^{2} \right) \frac{x'^{j}}{\sqrt{j!}} \\ &= \sum_{j = 0}^{\infty} \exp\left( -\frac{1}{2}x^{2} -\frac{1}{2}x'^{2} \right) \frac{x^{j}x'^{j}}{j!} \\ &= \exp\left( -\frac{1}{2}x^{2} -\frac{1}{2}x'^{2} \right) \sum_{j = 0}^{\infty} \frac{(xx')^{j}}{j!} \\ &= \exp\left( -\frac{1}{2}x^{2} -\frac{1}{2}x'^{2} \right) \exp(xx') &\text{Taylor expansion of} \exp \\ &= \exp\left( -\frac{1}{2}x^{2} -\frac{1}{2}x'^{2} + xx' \right) \\ &= \exp\left( -\frac{1}{2}(x - x')^{2} \right) \\ \end{align} $$

Which ends the proof $\square$.

Spectral Density

In this section we compute the spectral density for the RBF kernel, this is useful for the Fourier features in Gaussian Processes.

$$ \begin{align} p(\omega) &= \int k(x, x') \cdot \exp(-i \omega^{T}(x - x'))\, d(x - x') \\ &=\int \exp\left( -\frac{1}{2l}(x - x')^{2} \right) \exp(-i \omega^{T}(x - x'))\, d(x - x') \\ &= \int \exp\left( -\frac{\lVert x \rVert_{2} ^{2}}{2l} -i \omega^{T}x \right)\, dx \\ &= (2l^{2}\pi)^{d/2} \exp\left( -\frac{l^{2}\lVert \omega \rVert_{2}^{2}}{2} \right) \end{align} $$

In the last part we completed the square by adding and removing $(ilw^{T})^{2} / 2$

Exponential Kernel

$$ k(x_{i}, x_{j} ; l) = \exp\left( - \frac{1}{l} \lvert x_{i} - x_{j} \rvert \right) $$

They are also called Ornstein-Uhlenbeck Kernels or Laplace Kernels.

Matérn Kernels

These kernels have a quite weird form:

$$ k(x_{i}, x_{j}; v, \ell) = \frac{2^{1 - v}}{\Gamma(v)} \left( \frac{\sqrt{ 2v }}{\ell} \lVert x_{i} - x_{j} \rVert_{2} \right)^{v} K_{v} \left( \frac{\sqrt{ 2v }}{\ell} \lVert x_{i} - x_{j} \rVert_{2} \right) $$

I currently do not know what are these used for. But it is $\lceil v \rceil - 1$ times differentiable.

Periodic Kernels

$$ k(x, x') = \sigma^{2} \exp \left( - \frac{\left( 2\sin ^{2}\left( \frac{\pi \lVert x' - x \rVert }{p} \right) \right)}{l^{2}} \right) $$

Reproducing Kernel Hilbert space

Definition of RKHS

$$ \mathcal{H}_{k}(\mathcal{X}) = \left\{ f(x) = \sum_{i = 1}^{n} \alpha_{i}k(x, x_{i}) :n \in \mathbb{N}, x_{i} \in \mathcal{X} , \alpha_{i} \in \mathbb{R}\right\} $$$$ \langle f, g \rangle_{k} = \sum_{i = 1}^{n}\sum_{j = 1}^{n '}\alpha_{i}\alpha'_{j} k(x_{i}, x'_{j}) $$$$ \lVert f \rVert _{k} = \sqrt{ \langle f, f \rangle _{k} } $$

The Reducing Property

$$ f(x) = \langle f(\cdot), k(x, \cdot) \rangle _{k} $$

Meaning a simple evaluation of the function can be understood as a inner product between itself and a kernel.

$$ f(x) = \sum_{i = 1}^{n} \alpha_{1} k(x, x') = \sum_{i = 1}^{n} \alpha_{1} \langle k(\cdot, x), k(\cdot, x') \rangle_{k} = \langle \sum_{i = 1}^{n} \alpha_{1}k(\cdot, x), k(\cdot, x') \rangle_{k} = \langle f(\cdot), k(x, \cdot) \rangle _{k} $$

$\square$

Norm as a measure of smoothness

$$ \lvert f(x) - f(y) \rvert \leq \lVert f \rVert _{k} \lVert k(x, \cdot) - k(y, \cdot) \rVert _{k} $$$$ \lvert\langle f, k(x, \cdot) - k(y, \cdot) \rangle \rvert^{2} \leq \langle f, f \rangle \cdot \langle k(x, \cdot) - k(y, \cdot), k(x, \cdot) - k(y, \cdot) \rangle $$$$ \langle f, k(x, \cdot) - k(y, \cdot) \rangle = \sum_{i = 1}^{n} \alpha_{i} (k(x, x_{i}) - k(y, x_{i})) = f(x) - f(y) $$

Then taking the square root has the desired output.

The Representer Theorem

$$ \hat{f} \in \arg \min_{f} l(f, D) + \lambda \lVert f \rVert ^{2}_{k} $$$$ \hat{f}(x) = \alpha^{T} k_{x, A} $$

Where $A$ are all points in the Dataset without the labels. Something similar to this, the Riesz representation theorem, will be used in Counterfactual Invariance for the MMD computation.

References

[1] Bishop “Pattern Recognition and Machine Learning” Springer 2006

Kernel Function requirements#

A motivating example#

Feature maps -> Curse of dimensionality#

Simple linear regression#

Some takeaways#

Building new Kernels#

Choosing the best kernel parameters#

Little notes on Kernel Engineering#

Some famous Kernels#

Linear kernel#

Radial Basis Kernel#

RBF as infinite polinomials (Hard version)#

RBF as infinite polinomials#

Spectral Density#

Exponential Kernel#

Matérn Kernels#

Periodic Kernels#

Reproducing Kernel Hilbert space#

Definition of RKHS#

The Reducing Property#

Norm as a measure of smoothness#

The Representer Theorem#

References#