Principal Component Analysis is a technique used to reduce the dimensionality of a dataset. So we can view this as an algorithm that tries to compress the data while retaining some of the most important information. It is one of the earliest and most simple techniques in machine learning. Single Layer Autoencoders learn to approximate this kind of transformation.

The main idea is to find the directions with the most variance in a dataset. Those will be called principal components.

Some resources and trivia: There is a very easy derivation present in (Bishop & Bishop 2024). It is also known as the Kosambi-Karhunen-Loève transform (you will probably like this name more if you are from physics). One good resource is See https://peterbloem.nl/blog/pca series. Another good resource is the Wilkinson.

Technique setting

Le’t say we have $\left\{ \boldsymbol{x}_{n} \right\}$ observations of dimension $D$ and $n \in \left\{ 1, \dots, N \right\}$. We want to project this onto a dimension $M$, with the constraint of maximizing the variance of the projected data. Let’s say $u$ is our basis matrix.

Maximum variance approach

If $M=1$ then we have a single direction, and it’s easy to see that the projected points are $u^{T}\boldsymbol{x}_{n}$. Then we can easily calculate the statistics. Mean:

$$ \bar{x} = \frac{1}{N} \sum_{n=1}^{N} x_{n} \implies \text{ mean is }u^{T}\bar{x} $$

The variance:

$$ \frac{1}{N} \sum_{n = 1}^{N} \left( u^{T}x_{n} - u^{T}\bar{x} \right) ^{2} = u^{T}Su $$

With $S$ the covariance matrix defined as $S_{ij} = \text{ Cov}(x_{i}, x_{j}) = \frac{1}{N} \sum_{n = 1}^{N} (x_{n} - \bar{x}) (x_{n} - \bar{x})^{T}$. And then we can try to maximize $u^{T}Su$ with respect to $u$. With the condition that $u^{T}u = 1$ so that it is a normalized direction, we can use the Lagrange Multipliers technique explained in Duality Theory#Lagrangian function. So our optimization problem is

$$ \text{ minimize } f(u, \lambda) = u^{T}Su - \lambda(1 - u^{T}u) $$

And we use the multi-variable optimization tools, derived in Massimi minimi multi-variabile to see that

$$ \frac{ \partial f(u, \lambda) }{ \partial u } = 2Su - 2\lambda u $$

Which equated to zero for a stationary point we have $Su = \lambda u$ which is then just try to find the eigenvalues of $S$, explained here Determinanti. Now if we use some maths and the property that $u^{T}u = 1$ we observe that the variance is exactly the Lagrange multiplier: $u^{T}Su = \lambda$. So the eigenvectors of $S$ describe the variance, and this variance will be the largest when we chose the biggest eigenvector. This is also why we call it first principal component.

The above is also known as the Rayleigh quotient optimization problem, where the covariance matrix is a special case.

Derivation of the covariance property

Before going into the details here take this easier way into consideration. If you are not familiar with matrix calculation and are unsure about why this is true check the following argument:

We know that: $$ \left( u^{T}x_{n} - u^{T}\bar{x} \right) ^{2} = (u_{1}x_{1} + \dots + u_{M}x_{M} - u_{1}\bar{x}{1} - \dots - u{M}\bar{x}_{M})^{2}

$$ $$

= \sum_{i = 1}^{M} (u^{2}{i}x^{2}{i} + u_{i}^{2}\bar{x}^{2}_{i})

  • 2\sum_{i\neq j}^{M} u_{i}u_{j}x_{i}x_{j}
  • 2\sum_{i,j=1, 1}^{M} u_{i}u_{j}x_{i}\bar{x}_{j}
  • 2\sum_{i \neq j}^{M} u_{i}u_{j}\bar{x}{i}\bar{x}{j} $$ Note: first i used $n$ to indicate the nth vector, then i used $i$ to index the $n$-th vector, take care of this thing.

Now let’s compare the last result with the covariance matrix. We have in the covariance matrix what

$$ S_{ij} = (x_{n} - \bar{x})_{i} (x_{n} - \bar{x})_{j} = x_{i}x_{j} - x_{i}\bar{x}_{j} - \bar{x}_{i}x_{j} + \bar{x}_{i}\bar{x}_{j} $$

And summing $S_{ij} + S_{ji}$ we have $2(x_{i}x_{j} + \bar{x}_{i}\bar{x}_{j} - x_{i}\bar{x}_{j} - \bar{x}_{i}x_{j})$ Which takes care of all the sums where $i \neq j$ when we calculate the $u_{i}S_{ij}u_{j} + u_{j}S_{ji}u_{i}$ When $i = j$ we observe that $S_{ii} = x_{i}^{2} + \bar{x}_{i}^{2} - 2x_{i}\bar{x}_{i}$. This proves the relation, just a lot of sums.

Check Multi Variable Derivatives for derivatives of vectors and matrices if you are unconfortable.

Generalization of the Reduction

Now that we know the general idea in the special case where $M = 1$ we want to generalize this approach. Let’s say $u$ is the eigenvector we have found at the first step and consider the new dataset:

$$ X_{1} = \left\{ x - (u^{T}x)\cdot u : x \in \mathcal{X} \right\} $$

We see that the mean of the dataset is $\bar{x} - (u^{T}\bar{x})\cdot u$ while the new variance is

$$ \Sigma_{1} = \Sigma - u\Sigma u^{T} $$

This is easily shown by applying the variance formula. Then you observe that the new matrix has exactly the same eigenvectors as the previous, minus $u$, which means that if you continue to apply the variance procedure then you will get the second highest eigenvalue, and so on.

Approach by Induction

This section aims to provide a proof of induction of the maximum variance approach. But this has not been checked and so you should read this quite carefully.

We have the base case of our induction done, we assume inductively that the $\lambda_{1}, \dots \lambda_{N}$ the ordered eigenvectors of the covariance matrix are the best solution to our optimization problem when we still take the next eigenvector. By induction we know that those eigenvectors are orthonormal and that the projection matrix defined by them, that we call $w$ is the one that maximizes the variance, now defined as $u^{T}Su$. with $w : \mathbb{R}^{D} \to \mathbb{R}^{M}$. Let’s add a column to this projection matrix. We have other conditions caused by the fact that we want orthonormal vectors. Let’s call $v_{1}, v_{2}, \dots, v_{N}$ the eigenvectors associated to the found eigenvalues, then the Lagrange Multipliers technique tells us to

$$ \text{ maximize } u^{T}Su + \lambda(1 - u^{T}u) + \sum_{k = 1}^{N} \mu_{k} (0 - u^{T}v_{k}) $$

We derive with respect to $w$ and we have

$$ 2Su - 2\lambda u - \sum_{k= 1}^{N}\mu_{k}v_{k} \implies \lambda = $$

Now to continue this multiplication, the trick is to multiply this by $v_{l}$ and then we have

$$ 2v_{l}Su - 2\lambda v_{l}u - \sum_{k=1}^{N}\mu_{k}v_{l}v_{k} = 0 - 0 - \mu_{l}v_{l}^{2} $$

which implies that $\mu_{l}=0$ we can repeat this for all $\mu$ and show that all of those coefficients should be 0, this falls back to the original proof, and in this way we know that it is the next possible eigenvector, because if not it would not be orthonormal.

References

[1] Bishop & Bishop “Deep Learning: Foundations and Concepts” Springer International Publishing 2024