This is a quite good resource about this part of Support Vector Machines (step by step derivation). (Bishop 2006) chapter 7 is a good resource. The main idea about this supervised method is separating with a large gap. The thing is that we have a hyperplane, when this plane is projected to lower dimensional data, it can look like a non-linear separator. After we have found this separator, we can intuitively have an idea of confidence based on the distance of the separator.

Mid 90’ and early 2000 these models were very popular. With support vector machines, we generalize the idea of a linear separator.

Definitions

We consider a dataset $S = \left\{ (x^{(1)}, y^{(1)}), \dots, (x^{(n)}, y^{(n)}) \right\}$. This is a classical dataset with points, and dimensionality $d$, and labels $y^{(i)} \in \left\{ -1, 1 \right\}$.

Functional Margin 🟨–

Given a hyperplane $w^{T}x + b$ a margin for the $i$-th point is just $\lambda^{(i)} = y^{(i)}(w^{T}x^{(i)} + b)$. Why is this a good definition? We discussed that the main idea is that we want a large gap between the hyperplane and the points. When a classification is correct we have that $\lambda^{(i)} > 0$. And the value of this margin can be interpreted as a sign of confidence. One observation is that $w$ are not scaled, so we should usually take the norm, or any other transformation that is useful for us. The important thing to notice is the sign.

After we have defined this for every point, we have that the functional margin for this dataset is

$$ \lambda = \min_{i} \lambda^{(i)} $$

Geometrical Margin 🟨–

This margin has the easy interpretation given its geometry. Let’s say we have a point $A$, and its projection to our hyperplane, call this point $B$. The small trick is to express point $B$ as $A$ minus the projection distance, which is our geometrical margin. So we have

$$ w^{T}\left( x^{(i)} - \frac{w}{\lVert w \rVert }\lambda^{(i)} \right) + b = 0 \implies \lambda^{(i)} = \frac{w^{T}x^{(i) }+ b}{\lVert w \rVert } $$

We know that the vector $w$ is perpendicular to the hyperplane (this is easy to prove, just take a vector of the hyperplane, say $a - b$ with $a, b$ points on the hyperplane, then we know that $w^{T}(a - b) = 0$ because $w^{T}a + b = 0$ and the same for $b$), so we just use that as a unit vector for having the distance. After we have this definition, we just define the geometrical margin to be the minimum:

$$ \lambda = \min_{i}\lambda^{(i)} $$

We note that if $\lVert w \rVert = 1$ we have that geometrical margin = functional margin, so we have the properties of correct classification that the functional margin gives, and we also have the easy geometrical interpretation of the geometrical margin.

The technique

The primal problem 🟨–

We have said in the previous sections that the main idea of the SVM classifier is to find the best margin to separate our data. This motivates a simple modeling of the problem. We will use Lagrange Multipliers to find the solution of the optimization problem above. We first formalize the problem in this way:

$$ \begin{array} \\ \max \lambda \\ \lVert w \rVert = 1 \\ y^{(i)}(w^{T}x^{(i)} + b) \geq \lambda \\ \end{array} $$

Which means we want to find the best geometrical margin. But some conditions are not so nice to handle (the norm constraint is not so easy, because it is not convex), this allows us to try to re-scale the $w$ and the $b$ such that the third condition is $1$, and the second is implicit. This is possible because scaling $w$ and $b$ does not change the final solution of the margin. Let’s see this. Given the support points $x^{+}$ and $x^{-}$ we want to find the best $w$ and $b$ such that the margin is maximized. We can write the optimization problem in this way:

$$ \begin{align} \max \lVert proj_{+} - proj_{-} \rVert = \max \frac{\lVert w^{T}x^{+} - w^{T}x^{-} \rVert}{\lVert w \rVert } \implies \max \frac{1}{\lVert w \rVert} \\ y^{(i)}(w^{T}x^{(i)} + b) \geq 1 \end{align} $$

Where we have used the definition of vector projection, and the scaling property, which allowed us to build that constraint. We need to note that in this case we are still assuming linear separability of the data.

In this way we reduce this problem to be $$ \begin{array} \ \min \frac{1}{2} \lVert w \rVert ^{2} \ y^{(i)}(w^{T}x^{(i)} + b) \geq 1\

\end{array} $$ Which could be solved by Quadratic Programming and the use of Lagrange Multiplies. We see now how is this possible.

Functional and Geometrical Margin

The original problem’s margin is called geometrical margin (because the distance actually corresponds to a real geometrical distance) while, if we set the constraint $\lVert w \rVert = 1$, it is called the functional margin, as the optimization objective is the same, which makes it scalable.

The dual 🟥++

Using Lagrange Multipliers we obtain he following optimization function:

$$ \mathcal{L}(w, b, \alpha) = \frac{1}{2}\lVert w \rVert ^{2} - \sum_{i} \alpha_{i}(y^{(i)}(w^{T}x^{(i)} + b) - 1) $$

Where $\alpha_{i} \geq 0 \forall i$. We check Slater’s condition hold, which imply strong duality on this optimization problem. In order for this to hold we need to find $w$ and $b$ such that strict inequality hold. This is simple, as it is scaling invariant, given some $w$ and $b$ such that the equality holds (this is true by construction, because we have set the support vectors to be 1), then just re-scale them by a strictly positive value $\lambda > 0$ and you have your new $w$ and $b$.

Now we find the gradient and substitute in the Lagrange Multiplier equation, and we obtain the dual problem: $$

\begin{align} \min_{\alpha} \frac{1}{2} \sum_{i, j} \alpha_{i}\alpha_{j}y^{(i)}y^{(j)}x^{(i)T}x^{(j)} - \sum_{i} \alpha_{i} \ \text{subject to} \sum_{i} \alpha_{i}y^{(i)} = 0 \ \alpha_{i} \geq 0 \end{align} $$

Finding support vectors 🟨+

This is a quadratic optimization problem, and it is usually solved by the SMO algorithm. After we have found this value, we can look for the support vectors in the following manner: We first observe that by complementary slackness we have that $\alpha_{i}(1 - y^{(i)}(w^{T}x^{(i)} + b)) = 0$. This implies that if $\alpha_{i} > 0$ then $y^{(i)}(w^{T}x^{(i)} + b) = 1$. This is the condition for a support vector, and if it’s not a support vector then $\alpha_{i}$ which implies that $\boldsymbol{\alpha}$ is a quite sparse vector. Now you see that a simple linear search over the dataset is enough to find the support vectors! If you have derived the dual, at one step of the gradient calculation you would have found that

$$ w^{*} = \sum_{i} \alpha_{i}y^{(i)}x^{(i)} $$

Now you see that this is just a linear combination of the support vectors! And we can also find $b$ noting that $y^{(i)}(w^{T}x^{(i)} + b) = 1$ for a support vector, so we can find $b$ by solving for it. After we have one positive and one negative support vector, it is easy to find the intercept in the following way:

$$ w^{*T}x^{+} + b = 1 \implies b = 1 - w^{*T}x^{+} $$

Extensions of SVM

Soft SVM

In this case we assume the data is not linearly separable so we add a relaxation on the optimization problem. We add a slack variable $\xi_{i} \geq 0$ such that the optimization problem is now:

$$ \begin{array} \\ \min \frac{1}{2} \lVert w \rVert ^{2} + C\sum_{i} \xi_{i} \\ y^{(i)}(w^{T}x^{(i)} + b) \geq 1 - \xi_{i} \\ \xi_{i} \geq 0 \end{array} $$

Intuitively, we are now allowing for some points to be misclassified, but we are penalizing this misclassification. The parameter $C$ is a hyper-parameter that we can tune to adjust the tradeoff between the margin and the misclassification. Qualitatively, if $\xi_{i} = 0$ then the point is correctly classified, if $\xi_{i} = 1$ then the point is on the margin, and if $\xi_{i} > 1$ then the point is misclassified. We see that if we have higher value ok $\xi$ this is penalized by the $C$ parameter. So, you see $C$ tells you how much you penalize misclassification.

Relation with the Hinge Loss

The objective of the Soft SVM can be related to the hinge loss. The hinge loss is defined as

$$ \mathcal{L}_{\text{Hinge}}(y, x) = \max(0, 1 - y^{(i)}(w^{T}x^{(i)} + b)) = \max(0, 1 - y^{(i)}f(x^{(i)})) $$

Then, it’s just minimizing a regularized Hinge Loss:

$$ \frac{1}{2} \lVert w \rVert^{2} + C\sum_{i = 1}^{N}\mathcal{L}_{\text{Hinge}}(y^{(i)}, x^{(i)}) $$

, which is the same as the Soft SVM loss function. This is because for correctly classified points the hinge loss is zero, and for misclassified points the hinge loss is $1 - y^{(i)}f(x^{(i)})$ which is the same as the slack variable $\xi_{i}$.

Deriving the optimization objective 🟩

One can see using the same techniques that the optimization objective in this case is:

$$ \begin{align} \mathcal{L}(w, b, \xi,\alpha, \mu) = \frac{1}{2}\lVert w \rVert ^{2} + C\sum_{i} \xi_{i} - \sum_{i} \alpha_{i}(y^{(i)}(w^{T}x^{(i)} + b) - 1 + \xi_{i}) - \sum_{i} \mu_{i}\xi_{i} \end{align} $$

Then the stationary conditions are:

$$ \begin{cases} \frac{\partial \mathcal{L}}{\partial w} = 0 \implies w = \sum_{i} \alpha_{i}y^{(i)}x^{(i)} \\ \frac{\partial \mathcal{L}}{\partial b} = 0 \implies \sum_{i} \alpha_{i}y^{(i)} = 0 \\ \frac{\partial \mathcal{L}}{\partial \xi_{i}} = 0 \implies \alpha_{i} = C - \mu_{i} \end{cases} $$

Substituting these in, we have that the dual of the soft margin SVM classifier is:

$$ \begin{align} \min_{\alpha} \frac{1}{2} \sum_{i, j} \alpha_{i}\alpha_{j}y^{(i)}y^{(j)}x^{(i)T}x^{(j)} - \sum_{i} \alpha_{i} \\ \text{subject to} \sum_{i} \alpha_{i}y^{(i)} = 0 \\ 0 \leq \alpha_{i} \leq C \end{align} $$

Which is very similar to the original problem, but with the added constraint that $\alpha_{i} \leq C$.

NOTE: we have compacted three conditions into one:

$$ \forall i \begin{cases} C = \alpha_{i} + \mu_{i} \\ \mu_{i} \geq 0 \\ \alpha_{i} \geq 0 \end{cases} \implies 0 \leq \alpha_{i} \leq C $$

Kernelized SVM

We have seen in the previous analysis of SVM that $w^{*} = \sum_{i} \alpha_{i}y^{(i)}x^{(i)}$ then you see that if we define a feature function $\varphi$ which a corresponding kernel (see Kernel Methods) then at inference we just need to compute:

$$ w^{*}\varphi(x) = \sum_{i} \alpha_{i}y^{(i)}\varphi(x^{(i)}) \varphi(x) = \sum_{i} \alpha_{i}y^{(i)}K(x^{(i)}, x) $$

We see kernels again! Usually Kernelized SVM is justified by the observation that data might be separable in higher dimensional spaces, so we just lift the data on that space and compute everything there. Example, if the $1-D$ data is as follows: $[0, 1] \to 1$ else where in the real line is $0$ this is clearly not linearly separable. But if we lift to $(x, x^{2})$ it is separable. This is the intuition behind how kernels work for SVMs too!

Kernelized SVM loss 🟨

If we do the math, we will get exactly the similar conditions for quadratic optimization:

$$ \begin{align} \min_{\alpha} \frac{1}{2} \sum_{i, j} \alpha_{i}\alpha_{j}y^{(i)}y^{(j)}K(x^{(i)}, x^{(j)}) - \sum_{i} \alpha_{i} \\ \text{subject to} \sum_{i} \alpha_{i}y^{(i)} = 0 \\ 0 \leq \alpha_{i} \end{align} $$

Inference with K-SVM 🟩-

And for inference, we do something quite similar:

$$ w^{*T}\varphi(x) + b = \sum_{i} \alpha_{i}y^{(i)}K(x^{(i)}, x) + b $$

As in this case we have that $w^{*} = \sum_{i} \alpha_{i}y^{(i)}\varphi(x^{(i)})$.

Structured SVM

Structured SVMs attempt to generalize the idea of SVMs into multi-class prediction, or prediction of structured objects (like parse threes covered in Probabilistic Parsing or tags in Part of Speech Tagging).

The generalized Margin

The main idea of SSVMs is to extend the idea of margins to a more general form (Crmmer and Singer JMLR 2001). Given $k$ classes and $y$ one possible configuration of these, then the new constrain that we have is

$$ (w_{i}^{T}x + w_{i,0}) - \max_{j \neq i} (w^{T}_{j}x + w_{j, 0}) \geq m $$

where $m$ is our margin, and $i$ is the current considered class among the $k$ possible (note we have different weight vectors for each class). If we want to extend this to structured data, then instead of taking indexes, we should take the maximum margin different from the one we are considering inside the space of possible structures (which is often huge).

This generalization is not exactly the generalization we would expect for the binary classification case, but in any case it is useful to extend the technique to structured objects.

Then after having made this consideration, we can still use the functional margin idea of re-normalizing, and using the soft-SVM idea of adding a slack variable accounting for possible errors.

Another way to generalize SVM to multiclass is training many models in a one versus rest manner, and then just take the model that has the highest margin among the present. But we will not discuss this case.

Problems with SSVMs

From the slides:

  1. Compact representation of the output space If we allowed even just one parameter for each class, we would already have more parameters than we could ever hope to have enough training data for.
  2. Efficient prediction: Sequentially enumerating all possible classes may be infeasible, hence making a single prediction on a new example may be computationally challenging.
  3. Prediction error: The notion of a prediction error must be refined – for example, predicting a parse tree that is almost correct should be treated differently from predicting a tree that is completely wrong. Furthermore, the training data may be inconsistent.
  4. Efficient training: Efficient training algorithms are needed that have a run-time complexity sub-linear in the number of classes.

Scores and feature maps

To solve the first two problems, we usually define a feature map $\psi(y, x)$ that tells us something about the input and possible relations with the structures. This is often considered the hard part that needs a lot of problem knowledge. In order to solve the prediction error problem, we define a notion of score, which often comes naturally with a given problem. This is often modeled as $f_{y, x} = w^{T}\psi(y, x)$. Which allows us to compare even similar solutions. These ideas have been also developed in the NLP course. e.g. Log Linear Models clearly define the concept of scores, which are quite common in the other community. While the features are often derived from data, with some neural network models nowadays (e.g. Part of Speech Tagging they used Bert to have an estimate of the emission, which is the probability of a certain word to have a certain tagging).

Then we just choose the class with the highest score as our prediction.

References

[1] Bishop “Pattern Recognition and Machine Learning” Springer 2006