As we will briefly see, Kernels will have an important role in many machine learning applications. In this note we will get to know what are Kernels and why are they useful. Intuitively they measure the similarity between two input points. So if they are close the kernel should be big, else it should be small.
We briefly state the requirements of a Kernel, then we will argue with a simple example why they are useful.
Kernel Function requirements
Every function $k$ must be
- Symmetric: $\forall x, x' \in \mathbb{X}$ we have $k(x, x') = k(x', x)$
- Positive definiteness For all $A$ we have $K_{A A}$ is positive definite. (it’s not clear the interpretation of the negative similarity) This function encodes information about the correlation. If a kernel satisfies those, it is called a Mercer kernel. Mercer has proven that these two conditions are necessary and sufficient for a function to be considered a kernel. If we have linear kernel, then we have bayesian linear regression. They are exactly the same thing.
Another nice thing about kernels is that we can create new kernels by linearly combining them. Choosing kernels (or building them) is somewhat interpretable as choosing the correct prior for our functions.
A motivating example
Recall, we can use linear regression to learn non linear features, we just need some feature map.
Feature maps -> Curse of dimensionality
It is possible to include in linear regression settings, also monomials of greater order, and then use the same old algorithms to derive their value. But in doing so, one quickly gets into dimensionality problems: if we have $d$ variables, then the order of the feature map to all polynomials of order $m$ will be $\mathcal{O}(d^{m})$, which is often quite prohibitive. We need a simple method not to store everything.
Simple linear regression
Let’s consider a simple linear regression with the feature maps. Let’s say we want to find the best parameter $\theta$ for $\theta^{T} \phi(x)$, where $\phi : d \to n$ where $n$ is the size of the feature map and $\theta \in \mathbb{R}^{n \times 1}$. The update rule with gradient descend would be the following:
$$ \theta_{\text{new}} = \theta - \alpha \sum_{i = 1}^{N} (y^{(i)} - \theta^{T}\phi(x^{(i)}))\phi( x^{(i)}) $$Normally we would need to explicitly compute the product every single time, but we know that if the feature map is too large this would be infeasible.
But if we note that $\theta$ could be written by a linear combination of the features, then this problem will solve itself! Let’s say there are some $\beta_{1}, \dots, \beta_{n}$ such that $\theta_{0} = \sum_{i = 1}^{N} \beta_{i}\phi(x^{(i)})$. It can be shown by induction that the new $\theta$ updated using gradient descent will always be a linear combination. With this in mind, we just need to store a number that grows with $N$, not with the number of features $n$. But then we can rewrite the update rule in the following way:
$$ \theta_{new} = \sum_{i = 1}^{N} \beta_{i}\phi(x^{(i)}) - \alpha \sum_{i = 1}^{N} (y^{(i)} - ( \sum_{j = 1}^{N} \beta_{j}\phi(x^{(j)}))\phi(x^{(i)}))\phi( x^{(i)}) = \sum_{i = 1}^{N} \left( \beta_{i} - \alpha (y^{(i)} - ( \sum_{j = 1}^{N} \beta_{j}\phi(x^{(j)})^{T})\phi(x^{(i)}) \right) \phi(x^{(i)}) $$The part inside the big curve braces are our new $\beta$s! Let’s write it in a clearer manner here: our update rule is now:
$$ \beta_{i} := \beta_{i} - \alpha (y^{(i)} - \sum_{j = 1}^{N} \beta_{j}\phi(x^{(j)})^{T}\phi(x^{(i)}) $$The $\phi(x^{(j)})^{T}\phi(x^{(i)})$ is often referred as the kernel function $K(x^{(j)}, x^{(i)})$. We usually pre-compute this function at the start of the optimization problem. Now we deal with just $\mathcal{O}(N)$ variables, which are the number of the training inputs. Another nice thing about these kernels, if we are using polynomial feature maps, then it is also quite fast: One can prove that if for example $\phi(x) = 1 + x_{1} + x_{2} +\dots x_{d} + x_{1}^{2} + x_{1}x_{2} + \dots + x^{2}_{d}$ then the inner product that we want to compute is
$$ \langle \phi(x), \phi(z) \rangle = 1 + \langle x, z \rangle + \langle x, z \rangle ^{2} $$which is quite fast, as we just need to compute $\langle x, z \rangle$ which is in $\mathcal{O}(d)$, when evaluating $\phi(x)$ would be exponential. So now, in vector notation, our update would just be
$$ \beta := \beta - \alpha (y - K\beta) $$which $K$ our $n \times n$ matrix of the features such that $K_{ij} = K(\phi(x^{(i)}), \phi(x^{(j)}))$.
Some takeaways
From the above example, one can observe the kernel trick doing its job and bringing down the computational cost related to our optimization problem. This can be done in many other problems, and its quite a general idea. We need to highlight two main observations making this framework tractable:
- The kernel trick allows to compute optimizations with feature functions without explicitly computing them. If the feature space had dimension $e$ which could be infinitely dimensional, exponential on the number of terms $d$ et cetera, the optimization still takes take $N$, the number of samples.
- It is efficient to compute a single entry of the kernel matrix, often the complexity is just $d$, instead of being $e$. (this is often because we just need to compute the product $x^{T}z$ and then apply this value, the best way to understand this is to work out some examples).
Building new Kernels 🟨
After we have a valid kernel matrix, we can compose them in some ways, while retaining the properties that suffice to make them kernels. We report the methods listed in (Bishop 2006) Chapter 6:
Logarithms would be nice, but we lose the positive definiteness because small eigenvalues are mapped to negative values.
Choosing the best kernel parameters 🟩
Classical techniques of cross-validation on predictive performance are fine for selecting the best parameters. You just do a map estimation, which in formula is as follows: For each candidate parameters $\theta_{i}$ do the following.
$$ \hat{f}_{i} = \arg \max_{f} p(y_{\text{train}}\mid f, \theta_{i}, x_{\text{train}}) p(f \mid \theta_{i}) $$Then pick $\hat{\theta} =\arg \max_{\theta_{i}} p(y_{\text{vol} }\mid \hat{f}_{i}, \theta_{i}, x_{\text{val}})$. This method still collapses the uncertainty in f into a point estimate. But we want to keep the uncertainty.
An alternative possibility is to attach it using a bayesian approach, and attempt to maximize the marginal likelihood of the data. This allows us not to select the possible parameters before, and at the same time selecting across all models, somehow this is possible to do even without a validation set.
$$ \hat{\theta_{MLE}} = \arg \max_{\theta} p(y_{\text{train} }\mid x_{\text{train}}, \theta) = \int p(y_{\text{train}} \mid f, x_{\text{train}}, \theta)p(f \mid \theta) \, df $$The $p(f \mid \theta)$ is the prior, the other is the likelihood. The delighful observation on this MLE observation is that it naturally prevents underfitting or overfitting the model. This is often called Bayesian model selection.
This table attempts to give an intuitive understanding of what is explained here.
Likelihood | Prior | |
---|---|---|
$H$ is simple, underfitting | Small for most $f$ | Large for some $f$ (concentrated), small for most $f$. |
$H$ is complex, overfitting | Large for some $f$, small for most. | Small for most $f$, more distributed. |
Just right | ok | ok |
So in the overfitting case we are spreading the values too much, in the underfitting case, the likelihood is always small so the summation should be small too, this is why usually it is just right.
Intuitively: We are attempting to characterize the space of the functions given certain parameters. We are asking: what is the priori probability of having that function if we are given some parameters? If we overfit, we say that is unlikely that the function is random, so the prior is small (there is a high number of functions to choose from). If we underfit, then it is the inverse.
Little notes on Kernel Engineering
Kernels should be designed for some specific characteristics of the data. So if you know a dataset has a certain invariance, then you should use kernels that have that invariance. For example if you have periodic data, you should use periodic kernels. Else: that periodicity aspect would be hardly obtainable.
Some famous Kernels
Sometimes is important to recognize famous kernel families because then you can compose them together, and easily recognize if some other function is a kernel or not.
Linear kernel 🟩
This is one of the simplest kernels, which is just $k(x, y) = x^{T}y$, with $x, y \in\mathbb{R}^{d}$. A simple example of linear kernel, is the view of Gaussian Processes as a generalization of Bayesian Linear Regression with linear kernel.
Radial Basis Kernel
Radial Basis kernels have the following form:
$$ k(x_{i}, x_{j}; l, \sigma) = \sigma\exp\left( -\frac{1}{2l} \lVert x_{i} - x_{j} \rVert_{2}^{2} \right) $$They are similar to the Gaussians probability distribution function, but we don’t need to normalize in this case. This has one of the most natural interpretations of kernels as a distance similarity functions.
This family is both stationary and isotropic kernels, because they are translation invariant and depend only on the second norm. The $l$ parameter is called length scale and determines how much the function can vary. Usually high value means the function is more sensitive to old data, which implies it varies less.
If we take the Fourier transform (See Fourier Series) of this kernel it is
$$ p(\omega) = (2\pi)^{-d/2} \exp\left( - \frac{\lVert \omega \rVert_{2}^{2}}{2} \right) $$Which is just the standard Gaussian distribution in d dimensions, this is why sometimes the radial basis kernel is called Gaussian Kernel.
RBF as infinite polinomials (Hard version)
If $x \in \mathbb{R}^{d}$ and $\psi$ is a polynomial, then we can write the polynomial in this format
$$ \psi(x) = \sum_{j_{1}, \dots, j_{d} \in \mathbb{N}} \beta_{j_{1}, \dots j_{d}} x_{1}^{j_{1}} \cdot \dots \cdot x^{j_{d}}_{d} = \beta^{T}\varphi(x) $$With $\beta^{T} = \beta_{j_{1}, \dots j_{d}}$ at a specific index and $\varphi$ is a bit strange, it is:
$$ \varphi_{j_{1},\dots j_{d}}(x) = \exp\left( -\frac{1}{2} \lVert x \rVert_{2}^{2} \right) \frac{x_{1}^{j_{1}} \cdot \dots \cdot x^{j_{d}}_{d}}{\sqrt{ j_{1}! \dots j_{d}! }} \implies \varphi(x)^{T}\varphi(x') = \exp\left( - \frac{\lVert x - x'\rVert_{2}^{2}}{2} \right) $$Where $\varphi_{j_{1}, \dots, j_{d}}$ is our feature map at exactly that index, you need to image the real vector $\varphi$ to be the infinite vector indexed by those multi-indexes. Because you should remember that the multinomial series expansion with the multi-index notation is
$$ \sum_{\alpha \in \mathbb{N}^{d}} \frac{x^{\alpha}(x')^\alpha}{\alpha!} = \exp(x^{T}x') $$This is true because of the Taylor expansion of the exponential which for instance is
$$ \exp(x) = \sum_{i = 1}^{\infty} \frac{x^{d}}{d!} $$And it’s also a nice exercise to prove that.
RBF as infinite polinomials 🟨++
There is also one easier proof that works for $x \in \mathbb{R}$, the line of reasoning is exactly the same, but I fin this proof much more easier to understand as you don’t need to introduce multi indexes. Let’s say our feature map is the following:
$$ \begin{align} \phi(x) = \begin{bmatrix} \phi_{0}(x) \\ \phi_{1}(x) \\ \vdots \end{bmatrix} && \phi_{j}(x) = \exp\left( -\frac{1}{2}x^{2} \right) \frac{x^{j}}{\sqrt{j!}} \\ \end{align} $$We can write the inner product of these two functions as
$$ \begin{align} \phi(x)^{T}\phi(x') &= \sum_{j = 0}^{\infty} \exp\left( -\frac{1}{2}x^{2} \right) \frac{x^{j}}{\sqrt{j!}} \exp\left( -\frac{1}{2}x'^{2} \right) \frac{x'^{j}}{\sqrt{j!}} \\ &= \sum_{j = 0}^{\infty} \exp\left( -\frac{1}{2}x^{2} -\frac{1}{2}x'^{2} \right) \frac{x^{j}x'^{j}}{j!} \\ &= \exp\left( -\frac{1}{2}x^{2} -\frac{1}{2}x'^{2} \right) \sum_{j = 0}^{\infty} \frac{(xx')^{j}}{j!} \\ &= \exp\left( -\frac{1}{2}x^{2} -\frac{1}{2}x'^{2} \right) \exp(xx') &\text{Taylor expansion of} \exp \\ &= \exp\left( -\frac{1}{2}x^{2} -\frac{1}{2}x'^{2} + xx' \right) \\ &= \exp\left( -\frac{1}{2}(x - x')^{2} \right) \\ \end{align} $$Which ends the proof $\square$.
Spectral Density
In this section we compute the spectral density for the RBF kernel, this is useful for the Fourier features in Gaussian Processes.
$$ \begin{align} p(\omega) &= \int k(x, x') \cdot \exp(-i \omega^{T}(x - x'))\, d(x - x') \\ &=\int \exp\left( -\frac{1}{2l}(x - x')^{2} \right) \exp(-i \omega^{T}(x - x'))\, d(x - x') \\ &= \int \exp\left( -\frac{\lVert x \rVert_{2} ^{2}}{2l} -i \omega^{T}x \right)\, dx \\ &= (2l^{2}\pi)^{d/2} \exp\left( -\frac{l^{2}\lVert \omega \rVert_{2}^{2}}{2} \right) \end{align} $$In the last part we completed the square by adding and removing $(ilw^{T})^{2} / 2$
Exponential Kernel
Exponential kernels have the form
$$ k(x_{i}, x_{j} ; l) = \exp\left( - \frac{1}{l} \lvert x_{i} - x_{j} \rvert \right) $$They are also called Ornstein-Uhlenbeck Kernels or Laplace Kernels.
Matérn Kernels
These kernels have a quite weird form:
$$ k(x_{i}, x_{j}; v, \ell) = \frac{2^{1 - v}}{\Gamma(v)} \left( \frac{\sqrt{ 2v }}{\ell} \lVert x_{i} - x_{j} \rVert_{2} \right)^{v} K_{v} \left( \frac{\sqrt{ 2v }}{\ell} \lVert x_{i} - x_{j} \rVert_{2} \right) $$I currently do not know what are these used for. But it is $\lceil v \rceil - 1$ times differentiable.
Periodic Kernels
Those kernels are useful to capture periodic data (something that repeats every while)
$$ k(x, x') = \sigma^{2} \exp \left( - \frac{\left( 2\sin ^{2}\left( \frac{\pi \lVert x' - x \rVert }{p} \right) \right)}{l^{2}} \right) $$Reproducing Kernel Hilbert space
Definition of RKHS 🟩
Given a kernel $k: \mathcal{X}\times\mathcal{X} \to \mathbb{R}$, his RKHS is defines as follows:
$$ \mathcal{H}_{k}(\mathcal{X}) = \left\{ f(x) = \sum_{i = 1}^{n} \alpha_{i}k(x, x_{i}) :n \in \mathbb{N}, x_{i} \in \mathcal{X} , \alpha_{i} \in \mathbb{R}\right\} $$So it’s the space of all functions that we can build choosing some coefficients from $\mathbb{R}$, a number of addends and $n$ elements from $\mathcal{X}$. This set can be considered as a Hilbert space with the following inner product:
$$ \langle f, g \rangle_{k} = \sum_{i = 1}^{n}\sum_{j = 1}^{n '}\alpha_{i}\alpha'_{j} k(x_{i}, x'_{j}) $$With norm defined as:
$$ \lVert f \rVert _{k} = \sqrt{ \langle f, f \rangle _{k} } $$The Reducing Property 🟩–
The reducing property asserts that for any $f \in \mathcal{H}_{k}(\mathcal{X})$ the following holds:
$$ f(x) = \langle f(\cdot), k(x, \cdot) \rangle _{k} $$Meaning a simple evaluation of the function can be understood as a inner product between itself and a kernel.
By definition of the inner product, we have that $k(x, x') = \langle k(\cdot, x), k(\cdot, x') \rangle_{k}$. We see that
$$ f(x) = \sum_{i = 1}^{n} \alpha_{1} k(x, x') = \sum_{i = 1}^{n} \alpha_{1} \langle k(\cdot, x), k(\cdot, x') \rangle_{k} = \langle \sum_{i = 1}^{n} \alpha_{1}k(\cdot, x), k(\cdot, x') \rangle_{k} = \langle f(\cdot), k(x, \cdot) \rangle _{k} $$$\square$
Norm as a measure of smoothness 🟨
The norm of the RKHS can be interpreted as a measure of smoothness, i.e. it bounds the possible deviation the we can have from two different samples in our space. The theorem asserts that:
$$ \lvert f(x) - f(y) \rvert \leq \lVert f \rVert _{k} \lVert k(x, \cdot) - k(y, \cdot) \rVert _{k} $$Proof: The proof is an easy application of the Cauchy-Schwarz Inequality: With Cauchy we have that
$$ \lvert\langle f, k(x, \cdot) - k(y, \cdot) \rangle \rvert^{2} \leq \langle f, f \rangle \cdot \langle k(x, \cdot) - k(y, \cdot), k(x, \cdot) - k(y, \cdot) \rangle $$If we focus on the LHS, we see that the inner product reduces as follows:
$$ \langle f, k(x, \cdot) - k(y, \cdot) \rangle = \sum_{i = 1}^{n} \alpha_{i} (k(x, x_{i}) - k(y, x_{i})) = f(x) - f(y) $$Then taking the square root has the desired output.
The Representer Theorem
This theorem by Schölkopf shows a link between minimizers of regularized losses and a product in RKHS: It states that the solution of our minimization problem: Given a dataset $D$, a $\lambda > 0$
$$ \hat{f} \in \arg \min_{f} l(f, D) + \lambda \lVert f \rVert ^{2}_{k} $$Admits a nice form interpretable as a inner product between a kernel and a vector.
$$ \hat{f}(x) = \alpha^{T} k_{x, A} $$Where $A$ are all points in the Dataset without the labels. Something similar to this, the Riesz representation theorem, will be used in Counterfactual Invariance for the MMD computation.
References
[1] Bishop “Pattern Recognition and Machine Learning” Springer 2006