This is a quite good resource about this part of Support Vector Machines (step by step derivation). (Bishop 2006) chapter 7 is a good resource. The main idea about this supervised method is separating with a large gap. The thing is that we have a hyperplane, when this plane is projected to lower dimensional data, it can look like a non-linear separator. After we have found this separator, we can intuitively have an idea of confidence based on the distance of the separator.
Mid 90’ and early 2000 these models were very popular. With support vector machines, we generalize the idea of a linear separator.
Definitions
We consider a dataset $S = \left\{ (x^{(1)}, y^{(1)}), \dots, (x^{(n)}, y^{(n)}) \right\}$. This is a classical dataset with points, and dimensionality $d$, and labels $y^{(i)} \in \left\{ -1, 1 \right\}$.
Functional Margin π¨++
Given a hyperplane $w^{T}x + b$ a margin for the $i$-th point is just $\lambda^{(i)} = y^{(i)}(w^{T}x^{(i)} + b)$. Why is this a good definition? We discussed that the main idea is that we want a large gap between the hyperplane and the points. When a classification is correct we have that $\lambda^{(i)} > 0$. And the value of this margin can be interpreted as a sign of confidence. One observation is that $w$ are not scaled, so we should usually take the norm, or any other transformation that is useful for us. The important thing to notice is the sign.
$$ \lambda = \min_{i} \lambda^{(i)} $$Geometrical Margin π¨++
$$ w^{T}\left( x^{(i)} - \frac{w}{\lVert w \rVert }\lambda^{(i)} \right) + b = 0 \implies \lambda^{(i)} = \frac{w^{T}x^{(i) }+ b}{\lVert w \rVert } $$$$ \lambda = \min_{i}\lambda^{(i)} $$We note that if $\lVert w \rVert = 1$ we have that geometrical margin = functional margin, so we have the properties of correct classification that the functional margin gives, and we also have the easy geometrical interpretation of the geometrical margin.
The technique
The primal problem π¨–
$$ \begin{array} \\ \max \lambda \\ \lVert w \rVert = 1 \\ y^{(i)}(w^{T}x^{(i)} + b) \geq \lambda \\ \end{array} $$Which means we want to find the best geometrical margin. But some conditions are not so nice to handle (the norm constraint is not so easy, because it is not convex), this allows us to try to re-scale the $w$ and the $b$ such that the third condition is $1$, and the second is implicit. This is possible because scaling $w$ and $b$ does not change the final solution of the margin. Let’s see this. Given the support points $x^{+}$ and $x^{-}$ we want to find the best $w$ and $b$ such that the margin is maximized. We can write the optimization problem in this way:
$$ \begin{align} \max \lVert proj_{+} - proj_{-} \rVert = \max \frac{\lVert w^{T}x^{+} - w^{T}x^{-} \rVert}{\lVert w \rVert } \implies \max \frac{1}{\lVert w \rVert} \\ y^{(i)}(w^{T}x^{(i)} + b) \geq 1 \end{align} $$Where we have used the definition of vector projection, and the scaling property, which allowed us to build that constraint. We need to note that in this case we are still assuming linear separability of the data.
In this way we reduce this problem to be $$ \begin{array} \ \min \frac{1}{2} \lVert w \rVert ^{2} \ y^{(i)}(w^{T}x^{(i)} + b) \geq 1\
\end{array} $$ Which could be solved by Quadratic Programming and the use of Lagrange Multiplies. We see now how is this possible.
Functional and Geometrical Margin
The original problem’s margin is called geometrical margin (because the distance actually corresponds to a real geometrical distance) while, if we set the constraint $\lVert w \rVert = 1$, it is called the functional margin, as the optimization objective is the same, which makes it scalable.
The dual π©
$$ \mathcal{L}(w, b, \alpha) = \frac{1}{2}\lVert w \rVert ^{2} - \sum_{i} \alpha_{i}(y^{(i)}(w^{T}x^{(i)} + b) - 1) $$Where $\alpha_{i} \geq 0 \forall i$. We check Slater’s condition hold, which imply strong duality on this optimization problem. In order for this to hold we need to find $w$ and $b$ such that strict inequality hold. This is simple, as it is scaling invariant, given some $w$ and $b$ such that the equality holds (this is true by construction, because we have set the support vectors to be 1), then just re-scale them by a strictly positive value $\lambda > 0$ and you have your new $w$ and $b$.
Now we find the gradient and substitute in the Lagrange Multiplier equation, and we obtain the dual problem: $$
\begin{align} \min_{\alpha} \frac{1}{2} \sum_{i, j} \alpha_{i}\alpha_{j}y^{(i)}y^{(j)}x^{(i)T}x^{(j)} - \sum_{i} \alpha_{i} \ \text{subject to} \sum_{i} \alpha_{i}y^{(i)} = 0 \ \alpha_{i} \geq 0 \end{align} $$
Finding support vectors π¨+
$$ w^{*} = \sum_{i} \alpha_{i}y^{(i)}x^{(i)} $$$$ w^{*T}x^{+} + b = 1 \implies b = 1 - w^{*T}x^{+} $$Extensions of SVM
Soft SVM
In this case we assume the data is not linearly separable so we add a relaxation on the optimization problem.
New setting of the problem π©
$$ \begin{array} \\ \min \frac{1}{2} \lVert w \rVert ^{2} + C\sum_{i} \xi_{i} \\ y^{(i)}(w^{T}x^{(i)} + b) \geq 1 - \xi_{i} \\ \xi_{i} \geq 0 \end{array} $$Intuitively, we are now allowing for some points to be misclassified, but we are penalizing this misclassification. The parameter $C$ is a hyper-parameter that we can tune to adjust the tradeoff between the margin and the misclassification. Qualitatively, if $\xi_{i} = 0$ then the point is correctly classified, if $\xi_{i} = 1$ then the point is on the margin, and if $\xi_{i} > 1$ then the point is misclassified. We see that if we have higher value ok $\xi$ this is penalized by the $C$ parameter. So, you see $C$ tells you how much you penalize misclassification.
Relation with the Hinge Loss
$$ \mathcal{L}_{\text{Hinge}}(y, x) = \max(0, 1 - y^{(i)}(w^{T}x^{(i)} + b)) = \max(0, 1 - y^{(i)}f(x^{(i)})) $$$$ \frac{1}{2} \lVert w \rVert^{2} + C\sum_{i = 1}^{N}\mathcal{L}_{\text{Hinge}}(y^{(i)}, x^{(i)}) $$, which is the same as the Soft SVM loss function. This is because for correctly classified points the hinge loss is zero, and for misclassified points the hinge loss is $1 - y^{(i)}f(x^{(i)})$ which is the same as the slack variable $\xi_{i}$.
Deriving the optimization objective π©
$$ \begin{align} \mathcal{L}(w, b, \xi,\alpha, \mu) = \frac{1}{2}\lVert w \rVert ^{2} + C\sum_{i} \xi_{i} - \sum_{i} \alpha_{i}(y^{(i)}(w^{T}x^{(i)} + b) - 1 + \xi_{i}) - \sum_{i} \mu_{i}\xi_{i} \end{align} $$$$ \begin{cases} \frac{\partial \mathcal{L}}{\partial w} = 0 \implies w = \sum_{i} \alpha_{i}y^{(i)}x^{(i)} \\ \frac{\partial \mathcal{L}}{\partial b} = 0 \implies \sum_{i} \alpha_{i}y^{(i)} = 0 \\ \frac{\partial \mathcal{L}}{\partial \xi_{i}} = 0 \implies \alpha_{i} = C - \mu_{i} \end{cases} $$$$ \begin{align} \min_{\alpha} \frac{1}{2} \sum_{i, j} \alpha_{i}\alpha_{j}y^{(i)}y^{(j)}x^{(i)T}x^{(j)} - \sum_{i} \alpha_{i} \\ \text{subject to} \sum_{i} \alpha_{i}y^{(i)} = 0 \\ 0 \leq \alpha_{i} \leq C \end{align} $$Which is very similar to the original problem, but with the added constraint that $\alpha_{i} \leq C$.
$$ \forall i \begin{cases} C = \alpha_{i} + \mu_{i} \\ \mu_{i} \geq 0 \\ \alpha_{i} \geq 0 \end{cases} \implies 0 \leq \alpha_{i} \leq C $$Kernelized SVM
$$ w^{*}\varphi(x) = \sum_{i} \alpha_{i}y^{(i)}\varphi(x^{(i)}) \varphi(x) = \sum_{i} \alpha_{i}y^{(i)}K(x^{(i)}, x) $$We see kernels again! Usually Kernelized SVM is justified by the observation that data might be separable in higher dimensional spaces, so we just lift the data on that space and compute everything there. Example, if the $1-D$ data is as follows: $[0, 1] \to 1$ else where in the real line is $0$ this is clearly not linearly separable. But if we lift to $(x, x^{2})$ it is separable. This is the intuition behind how kernels work for SVMs too!
Kernelized SVM loss π¨
$$ \begin{align} \min_{\alpha} \frac{1}{2} \sum_{i, j} \alpha_{i}\alpha_{j}y^{(i)}y^{(j)}K(x^{(i)}, x^{(j)}) - \sum_{i} \alpha_{i} \\ \text{subject to} \sum_{i} \alpha_{i}y^{(i)} = 0 \\ 0 \leq \alpha_{i} \end{align} $$Inference with K-SVM π©-
$$ w^{*T}\varphi(x) + b = \sum_{i} \alpha_{i}y^{(i)}K(x^{(i)}, x) + b $$As in this case we have that $w^{*} = \sum_{i} \alpha_{i}y^{(i)}\varphi(x^{(i)})$.
Structured SVM
Structured SVMs attempt to generalize the idea of SVMs into multi-class prediction, or prediction of structured objects (like parse threes covered in Probabilistic Parsing or tags in Part of Speech Tagging).
The generalized Margin π©
$$ (w_{i}^{T}x + w_{i,0}) - \max_{j \neq i} (w^{T}_{j}x + w_{j, 0}) \geq m $$where $m$ is our margin, and $i$ is the current considered class among the $k$ possible (note we have different weight vectors for each class). If we want to extend this to structured data, then instead of taking indexes, we should take the maximum margin different from the one we are considering inside the space of possible structures (which is often huge).
This generalization is not exactly the generalization we would expect for the binary classification case, but in any case it is useful to extend the technique to structured objects.
Then after having made this consideration, we can still use the functional margin idea of re-normalizing, and using the soft-SVM idea of adding a slack variable accounting for possible errors.
Another way to generalize SVM to multiclass is training many models in a one versus rest manner, and then just take the model that has the highest margin among the present. But we will not discuss this case.
Problems with SSVMs π₯
From the slides:
- Compact representation of the output space If we allowed even just one parameter for each class, we would already have more parameters than we could ever hope to have enough training data for.
- Efficient prediction: Sequentially enumerating all possible classes may be infeasible, hence making a single prediction on a new example may be computationally challenging.
- Prediction error: The notion of a prediction error must be refined β for example, predicting a parse tree that is almost correct should be treated differently from predicting a tree that is completely wrong. Furthermore, the training data may be inconsistent.
- Efficient training: Efficient training algorithms are needed that have a run-time complexity sub-linear in the number of classes.
Scores and feature maps π₯
To solve the first two problems, we usually define a feature map $\psi(y, x)$ that tells us something about the input and possible relations with the structures. This is often considered the hard part that needs a lot of problem knowledge. In order to solve the prediction error problem, we define a notion of score, which often comes naturally with a given problem. This is often modeled as $f_{y, x} = w^{T}\psi(y, x)$. Which allows us to compare even similar solutions. These ideas have been also developed in the NLP course. e.g. Log Linear Models clearly define the concept of scores, which are quite common in the other community. While the features are often derived from data, with some neural network models nowadays (e.g. Part of Speech Tagging they used Bert to have an estimate of the emission, which is the probability of a certain word to have a certain tagging).
Then we just choose the class with the highest score as our prediction.
References
[1] Bishop βPattern Recognition and Machine Learningβ Springer 2006