See https://peterbloem.nl/blog/pca series. The main idea is to find the directions with the most variance in a dataset. Those will be principal components.

There is a very easy derivation present in (Bishop & Bishop 2024). It is also known as the Kosambi-Karhunen-Loève transform (you will probably like this name more if you are from physics). Another good resource is the Wilkinson.

Technique setting

Le’t say we have $\left\{ \boldsymbol{x}_{n} \right\}$ observations of dimension $D$ and $n \in \left\{ 1, \dots, N \right\}$. We want to project this onto a dimension $M$, with the constraint of maximizing the variance of the projected data. Let’s say $u$ is our basis matrix.

Maximum variance approach

If $M=1$ then we have a single direction, and it’s easy to see that the projected points are $u^{T}\boldsymbol{x}_{n}$. Then we can easily calculate the statistics. Mean:

$$ \bar{x} = \frac{1}{N} \sum_{n=1}^{N} x_{n} \implies \text{ mean is }u^{T}\bar{x} $$

The variance:

$$ \frac{1}{N} \sum_{n = 1}^{N} \left( u^{T}x_{n} - u^{T}\bar{x} \right) ^{2} = u^{T}Su $$

With $S$ the covariance matrix defined as $S_{ij} = \text{ Cov}(x_{i}, x_{j}) = \frac{1}{N} \sum_{n = 1}^{N} (x_{n} - \bar{x}) (x_{n} - \bar{x})^{T}$. And then we can try to maximize $u^{T}Su$ with respect to $u$. With the condition that $u^{T}u = 1$ so that it is a normalized direction, we can use the Lagrange Multipliers technique explained in Duality Theory#Lagrangian function. So our optimization problem is

$$ \text{ minimize } f(u, \lambda) = u^{T}Su - \lambda(1 - u^{T}u) $$

And we use the multi-variable optimization tools, derived in Massimi minimi multi-variabile to see that

$$ \frac{ \partial f(u, \lambda) }{ \partial u } = 2Su - 2\lambda u $$

Which equated to zero for a stationary point we have $Su = \lambda u$ which is then just try to find the eigenvalues of $S$, explained here Determinanti. Now if we use some maths and the property that $u^{T}u = 1$ we observe that the variance is exactly the Lagrange multiplier: $u^{T}Su = \lambda$. So the eigenvectors of $S$ describe the variance, and this variance will be the largest when we chose the biggest eigenvector. This is also why we call it first principal component.

The above is also known as the Rayleigh quotient optimization problem, where the covariance matrix is a special case.

Derivation of the covariance property

Before going into the details here take this easier way into consideration. If you are not familiar with matrix calculation and are unsure about why this is true check the following argument:

We know that: $$ \left( u^{T}x_{n} - u^{T}\bar{x} \right) ^{2} = (u_{1}x_{1} + \dots + u_{M}x_{M} - u_{1}\bar{x}{1} - \dots - u{M}\bar{x}_{M})^{2}

$$ $$

= \sum_{i = 1}^{M} (u^{2}{i}x^{2}{i} + u_{i}^{2}\bar{x}^{2}_{i})

  • 2\sum_{i\neq j}^{M} u_{i}u_{j}x_{i}x_{j}
  • 2\sum_{i,j=1, 1}^{M} u_{i}u_{j}x_{i}\bar{x}_{j}
  • 2\sum_{i \neq j}^{M} u_{i}u_{j}\bar{x}{i}\bar{x}{j} $$ Note: first i used $n$ to indicate the nth vector, then i used $i$ to index the $n$-th vector, take care of this thing.

Now let’s compare the last result with the covariance matrix. We have in the covariance matrix what

$$ S_{ij} = (x_{n} - \bar{x})_{i} (x_{n} - \bar{x})_{j} = x_{i}x_{j} - x_{i}\bar{x}_{j} - \bar{x}_{i}x_{j} + \bar{x}_{i}\bar{x}_{j} $$

And summing $S_{ij} + S_{ji}$ we have $2(x_{i}x_{j} + \bar{x}_{i}\bar{x}_{j} - x_{i}\bar{x}_{j} - \bar{x}_{i}x_{j})$ Which takes care of all the sums where $i \neq j$ when we calculate the $u_{i}S_{ij}u_{j} + u_{j}S_{ji}u_{i}$ When $i = j$ we observe that $S_{ii} = x_{i}^{2} + \bar{x}_{i}^{2} - 2x_{i}\bar{x}_{i}$. This proves the relation, just a lot of sums.

Multi-variable derivative

To the people that are not used to matrix derivatives (like me) it could be useful to see how $$ \frac{ \partial u^{T}Su }{ \partial u } = 2Su $$ First, we note that if you derive with respect to some matrix, the output will be of the same dimension of that matrix. That notation is just deriving every single component independently and then joining them together, so it will be better understood as as $$ \frac{ \partial u^{T}Su }{ \partial u } = \begin{bmatrix} \frac{ \partial u^{T}Su }{ \partial u_{1} } \ \dots \ \frac{ \partial u^{T}Su }{ \partial u_{M} } \ \end{bmatrix}

\begin{bmatrix} 2(Su){1} \ \dots \ 2(Su){M} \end{bmatrix} = 2Su

$$ So we can prove each derivative independently, it's just a lot of manual work! We see that $u^{T}Su$ is just a quadratic form, studied in [Massimi minimi multi-variabile#Forme quadratiche](/notes/massimi-minimi-multi-variabile#forme-quadratiche) so it is just computing this: $$

u^{T}Su = \sum_{i, j = 1, 1}^{M} u_{i}u_{j}S_{ij} \implies \frac{ \partial u^{T}Su }{ \partial u_{1} } =2u_{1}S_{11} + \sum_{j \neq 1}^{M}(u_{j}S_{1j} + u_{j}S_{j1}) = 2\left( u_{1}S_{11} + \sum_{j \neq 1}u_{j}S_{1j} \right) = 2(Su)_{1} $$ Last equation is true because $S$ is a symmetric matrix, then we easily see that indeed it’s true that indeed it’s the first row of the $Su$ matrix multiplied by 2.

Proof by induction

Now that we know the general idea in the special case where $M = 1$ we want to generalize this approach, and say that it works also for the case where $M = N$. We have the base case of our induction done, we assume inductively that the $\lambda_{1}, \dots \lambda_{N}$ the ordered eigenvectors of the covariance matrix are the best solution to our optimization problem when we still take the next eigenvector. By induction we know that those eigenvectors are orthonormal and that the projection matrix defined by them, that we call $w$ is the one that maximizes the variance, now defined as $u^{T}Su$. with $w : \mathbb{R}^{D} \to \mathbb{R}^{M}$. Let’s add a column to this projection matrix. We have other conditions caused by the fact that we want orthonormal vectors. Let’s call $v_{1}, v_{2}, \dots, v_{N}$ the eigenvectors associated to the found eigenvalues, then the Lagrange Multipliers technique tells us to

$$ \text{ maximize } u^{T}Su + \lambda(1 - u^{T}u) + \sum_{k = 1}^{N} \mu_{k} (0 - u^{T}v_{k}) $$

We derive with respect to $w$ and we have

$$ 2Su - 2\lambda u - \sum_{k= 1}^{N}\mu_{k}v_{k} \implies \lambda = $$

Now to continue this multiplication, the trick is to multiply this by $v_{l}$ and then we have

$$ 2v_{l}Su - 2\lambda v_{l}u - \sum_{k=1}^{N}\mu_{k}v_{l}v_{k} = 0 - 0 - \mu_{l}v_{l}^{2} $$

which implies that $\mu_{l}=0$ we can repeat this for all $\mu$ and show that all of those coefficients should be 0, this falls back to the original proof, and in this way we know that it is the next possible eigenvector, because if not it would not be orthonormal.

References

[1] Bishop & Bishop “Deep Learning: Foundations and Concepts” Springer International Publishing 2024