A simple motivation
Fisher’s Linear Discriminant is a simple idea used to linearly classify our data.
The image above, taken from (Bishop 2006), is the summary of the idea. We clearly see that if we first project using the direction of maximum variance (See Principal Component Analysis) then the data is not linearly separable, but if we take other notions into consideration, then the idea becomes much more cleaner.
A first approach
We want to maximize the distance from the centers, while minimizing the inter-class variance. Let’s consider these two classes of points and let’s set
$$ m_{j} = \frac{1}{N_{j}}\sum_{i : y_{i} = j} x_{i} $$Which is just the mean of our cluster. Then if we consider a linear projection with $w$ the projected mean would be $w^{T}m_{j}$ over every class $j$. If we consider the simple case where we have two classes, then we would like to maximize the distance $m_{2} - m_{1}$, and setting the constraint $\lVert w \rVert = 1$ because we only care about the direction, we find that $w$ is maximized proportionally to $m_{2} - m_{1}$. This is easy with Lagrange Multipliers, we are just setting this maximization problem:
$$ \arg\max_{w} w^{T}(m_{2} - m_{1}) + \lambda \lVert w \rVert _{2}^{2} $$But the problem with this formulation is that we could have some strong overlap in the projected domain, due to non diagonal covariances in the class distribution. We would also like to minimize the within class variance during the projection, so that we can better separate the two values.
Adding inter-class variance
Let’s set the inter-class variances after our linear projection to be
$$ s_{i}^{2} = \sum_{n \in \mathcal{C}_{i}} (w^{T}x_{n} - w^{T}m_{n})^{2} $$We simply compute the inter-class variance and divide, getting the Fisher discriminant
$$ J(w) = \frac{w^{T}(m_{2} - m_{1})^{2}w}{s_{1}^{2} + s_{2}^{2}} = \frac{w^{T}S_{B}w}{w^{T}S_{I}w} $$The lower part is the sum of the within class variances, while the upper part is the between class covariance.
Now, if we do some calculus we find that the best solution is $w \propto S^{-1}_{w}(m_{2} - m_{1})$
References
[1] Bishop “Pattern Recognition and Machine Learning” Springer 2006