As we will briefly see, Kernels will have an important role in many machine learning applications. In this note we will get to know what are Kernels and why are they useful. Intuitively they measure the similarity between two input points. So if they are close the kernel should be big, else it should be small.
We briefly state the requirements of a Kernel, then we will argue with a simple example why they are useful.
Kernel Function requirements
Every function $k$ must be
- Symmetric: $\forall x, x' \in \mathbb{X}$ we have $k(x, x') = k(x', x)$
- Positive definiteness For all $A$ we have $K_{A A}$ is positive definite. (it’s not clear the interpretation of the negative similarity) This function encodes information about the correlation. If a kernel satisfies those, it is called a Mercer kernel. Mercer has proven that these two conditions are necessary and sufficient for a function to be considered a kernel. If we have linear kernel, then we have bayesian linear regression. They are exactly the same thing.
Another nice thing about kernels is that we can create new kernels by linearly combining them. Choosing kernels (or building them) is somewhat interpretable as choosing the correct prior for our functions.
A motivating example
Feature maps -> Curse of dimensionality
It is possible to include in linear regression settings, also monomials of greater order, and then use the same old algorithms to derive their value. But in doing so, one quickly gets into dimensionality problems: if we have $d$ variables, then the order of the feature map to all polynomials of order $m$ will be $\mathcal{O}(d^{m})$, which is often quite prohibitive. We need a simple method not to store everything.
Simple linear regression
Let’s consider a simple linear regression with the feature maps. Let’s say we want to find the best parameter $\theta$ for $\theta^{T} \phi(x)$, where $\phi : d \to n$ where $n$ is the size of the feature map and $\theta \in \mathbb{R}^{n \times 1}$. The update rule with gradient descend would be the following:
$$ \theta_{\text{new}} = \theta - \alpha \sum_{i = 1}^{N} (y^{(i)} - \theta^{T}\phi(x^{(i)}))\phi( x^{(i)}) $$Normally we would need to explicitly compute the product every single time, but we know that if the feature map is too large this would be infeasible.
But if we note that $\theta$ could be written by a linear combination of the features, then this problem will solve itself! Let’s say there are some $\beta_{1}, \dots, \beta_{n}$ such that $\theta_{0} = \sum_{i = 1}^{N} \beta_{i}\phi(x^{(i)})$ We claim that $\theta$ will always be a linear combination. If this is true then we just need to store a number that grows with $N$, not with the number of features $n$, which is exponential in the number of the polynomial dimensions! It should be easy to prove by induction that this is indeed true. But then we can rewrite the update rule in the following way:
$$ \theta_{new} = \sum_{i = 1}^{N} \beta_{i}\phi(x^{(i)}) - \alpha \sum_{i = 1}^{N} (y^{(i)} - ( \sum_{j = 1}^{N} \beta_{j}\phi(x^{(j)}))\phi(x^{(i)}))\phi( x^{(i)}) = \sum_{i = 1}^{N} \left( \beta_{i} - \alpha (y^{(i)} - ( \sum_{j = 1}^{N} \beta_{j}\phi(x^{(j)})^{T})\phi(x^{(i)}) \right) \phi(x^{(i)}) $$The part inside the big curve braces are our new $\beta$s! Let’s write it in a clearer manner here: our update rule is now:
$$ \beta_{i} := \beta_{i} - \alpha (y^{(i)} - \sum_{j = 1}^{N} \beta_{j}\phi(x^{(j)})^{T}\phi(x^{(i)}) $$The $\phi(x^{(j)})^{T}\phi(x^{(i)})$ is often referred as the kernel function $K(x^{(j)}, x^{(i)})$. We usually pre-compute this function at the start of the optimization problem. Now we deal with just $\mathcal{O}(N)$ variables, which are the number of the training inputs. Another nice thing about these kernels, if we are using polynomial feature maps, then it is also quite fast: One can prove that if for example $\phi(x) = 1 + x_{1} + x_{2} +\dots x_{d} + x_{1}^{2} + x_{1}x_{2} + \dots + x^{2}_{d}$ then the inner product that we want to compute is
$$ \langle \phi(x), \phi(z) \rangle = 1 + \langle x, z \rangle + \langle x, z \rangle ^{2} $$which is quite fast, because we just need to compute $\langle x, z \rangle$ which is in $\mathcal{O}(d)$. So now, in vector notation, our update would just be
$$ \beta := \beta - \alpha (y - K\beta) $$which $K$ our $n \times n$ matrix of the features such that $K_{ij} = K(\phi(x^{(i)}), \phi(x^{(j)}))$.
Some takeaways
From the above example, one can observe the kernel trick doing its job and bringing down the computational cost related to our optimization problem. This can be done in many other problems, and its quite a general idea. We need to highlight two main observations making this framework tractable:
- The kernel trick allows to compute optimizations with feature functions without explicitly computing them. If the feature space had dimension $e$ which could be infinitely dimensional, exponential on the number of terms $d$ et cetera, the optimization still takes take $N$, the number of samples.
- It is efficient to compute a single entry of the kernel matrix, often the complexity is just $d$, instead of being $e$. (this is often because we just need to compute the product $x^{T}z$ and then apply this value, the best way to understand this is to work out some examples).
Building new Kernels
After we have a valid kernel matrix, we can compose them in some ways, while retaining the properties that suffice to make them kernels. We report the methods listed in (Bishop 2006) Chapter 6:
Logarithms would be nice, but we lose the positive definiteness because small eigenvalues are mapped to negative values.
Bochner’s theorem
A shift-invariant kernel is positive definite (which makes the kernel a valid Mercer Kernel) if and only if its Fourier transform $p(\omega)$ is non-negative.
Choosing the best kernel parameters
Classical techniques of cross-validation on predictive performance are fine for selecting the best parameters. You just do a map estimation, which in formula is as follows: For each candidate parameters $\theta_{i}$ do the following.
$$ \hat{f}_{i} = \arg \max_{f} p(y_{\text{train}}\mid f, \theta_{i}, x_{\text{train}}) p(f \mid \theta_{i}) $$Then pick $\hat{\theta} =\arg \max_{\theta_{i}} p(y_{\text{vol} }\mid \hat{f}_{i}, \theta_{i}, x_{\text{val}})$. This method still collapses the uncertainty in f into a point estimate. But we want to keep the uncertainty.
An alternative possibility is to attach it using a bayesian approach, and attempt to maximize the marginal likelihood of the data. This allows us not to select the possible parameters before, and at the same time selecting across all models, somehow this is possible to do even without a validation set.
$$ \hat{\theta_{MLE}} = \arg \max_{\theta} p(y_{\text{train} }\mid x_{\text{train}}, \theta) = \int p(y_{\text{train}} \mid f, x_{\text{train}}, \theta)p(f \mid \theta) \, df $$The $p(f \mid \theta)$ is the prior, the other is the likelihood. The delighful observation on this MLE observation is that it naturally prevents underfitting or overfitting the model. This is often called Bayesian model selection.
This table attempts to give an intuitive understanding of what is explained here.
Likelihood | Prior | |
---|---|---|
$H$ is simple, underfitting | Small for most $f$ | Large for some $f$ (concentrated), small for most $f$. |
$H$ is complex, overfitting | Large for some $f$, small for most. | Small for most $f$, more distributed. |
Just right | ok | ok |
So in the overfitting case we are spreading the values too much, in the underfitting case, the likelihood is always small so the summation should be small too, this is why usually it is just right.
Little notes on Kernel Engineering
Kernels should be designed for some specific characteristics of the data. So if you know a dataset has a certain invariance, then you should use kernels that have that invariance. For example if you have periodic data, you should use periodic kernels. Else: that periodicity aspect would be hardly obtainable.
Some famous Kernels
Sometimes is important to recognize famous kernel families because then you can compose them together, and easily recognize if some other function is a kernel or not.
Linear kernel
This is one of the simplest kernels, which is just $k(x, y) = x^{T}y$, with $x, y \in\mathbb{R}^{d}$. A simple example of linear kernel, is the view of Gaussian Processes as a generalization of Bayesian Linear Regression with linear kernel.
Radial Basis Kernel
Radial Basis kernels have the following form:
$$ k(x_{i}, x_{j}; l, \sigma) = \sigma\exp\left( -\frac{1}{2l} \lVert x_{i} - x_{j} \rVert_{2}^{2} \right) $$They are similar to the Gaussians probability distribution function, but we don’t need to normalize in this case. This has one of the most natural interpretations of kernels as a distance similarity functions.
This family is both stationary and isotropic kernels, because they are translation invariant and depend only on the second norm. The $l$ parameter is called length scale and determines how much the function can vary. Usually high value means the function is more sensitive to old data, which implies it varies less.
If we take the Fourier transform (See Fourier Series) of this kernel it is
$$ p(\omega) = (2\pi)^{-d/2} \exp\left( - \frac{\lVert \omega \rVert_{2}^{2}}{2} \right) $$Which is just the standard Gaussian distribution in d dimensions, this is why sometimes the radial basis kernel is called Gaussian Kernel.
RBF as infinite polinomials 🟥
If $x \in \mathbb{R}^{d}$ and $\psi$ is a polynomial, then we can write the polynomial in this format
$$ \psi(x) = \sum_{j_{1}, \dots, j_{d} \in \mathbb{N}} \beta_{j_{1}, \dots j_{d}} x_{1}^{j_{1}} \cdot \dots \cdot x^{j_{d}}_{d} = \beta^{T}\varphi(x) $$With $\beta^{T} = \beta_{j_{1}, \dots j_{d}}$ at a specific index and $\varphi$ is a bit strange, it is:
$$ \varphi_{j_{1},\dots j_{d}}(x) = \exp\left( -\frac{1}{2} \lVert x \rVert_{2}^{2} \right) \frac{x_{1}^{j_{1}} \cdot \dots \cdot x^{j_{d}}_{d}}{\sqrt{ j_{1}! \dots j_{d}! }} \implies \varphi(x)^{T}\varphi(x') = \exp\left( - \frac{\lVert x - x'\rVert_{2}^{2}}{2} \right) $$Where $\varphi_{j_{1}, \dots, j_{d}}$ is our feature map at exactly that index, you need to image the real vector $\varphi$ to be the infinite vector indexed by those multi-indexes. Because you should remember that the multinomial series expansion with the multi-index notiation is
$$ \sum_{\alpha \in \mathbb{N}^{d}} \frac{x^{\alpha}(x')^\alpha}{\alpha!} = \exp(x^{T}x') $$This is true because of the Taylor expansion of the exponential which for instance is
$$ \exp(x) = \sum_{i = 1}^{\infty} \frac{x^{d}}{d!} $$And it’s also a nice exercise to prove that.
Exponential Kernel
Exponential kernels have the form
$$ k(x_{i}, x_{j} ; l) = \exp\left( - \frac{1}{l} \lvert x_{i} - x_{j} \rvert \right) $$They are also called Ornstein-Uhlenbeck Kernels or Laplace Kernels.
Matérn Kernels
These kernels have a quite weird form:
$$ k(x_{i}, x_{j}; v, \ell) = \frac{2^{1 - v}}{\Gamma(v)} \left( \frac{\sqrt{ 2v }}{\ell} \lVert x_{i} - x_{j} \rVert_{2} \right)^{v} K_{v} \left( \frac{\sqrt{ 2v }}{\ell} \lVert x_{i} - x_{j} \rVert_{2} \right) $$I currently do not know what are these used for. But it is $\lceil v \rceil - 1$ times differentiable.
Periodic Kernels
Those kernels are useful to capture periodic data (something that repeats every while)
$$ k(x, x') = \sigma^{2} \exp \left( - \frac{\left( 2\sin ^{2}\left( \frac{\pi \lVert x' - x \rVert }{p} \right) \right)}{l^{2}} \right) $$Reproducing kernel Hilbert space
TODO
References
[1] Bishop “Pattern Recognition and Machine Learning” Springer 2006