Anomaly detection is a problem in machine learning that is of a big interest in industry. For example a bank needs to identify problems in transactions, doctors need it to see illness, or suspicious behaviors for law (no Orwell here). The main difference between this and classification is that here we have no classes.

Setting of the problem

Let’s say we have a set $X = \left\{ x_{1}, \dots, x_{n} \right\} \subseteq \mathcal{N} \subseteq \mathcal{X} = \mathbb{R}^{d}$ We say this set is the normal set, and $X$ are our samples but it’s quite complex, so we need an approximation to say whether if a set is normal or not. We need a function $\phi : \mathcal{X} \to \left\{ 0, 1 \right\}$ with $\phi(x) = 1 \iff x \not \in \mathcal{N}$.

We want to project this high dimensional space to a low dimensional space and the use mixture of Gaussians. For some reason (you need to understand this fact, it’s important) if you project high dimensional random variables to a low dimension one, this resembles Gaussians, and it’s a sign of losing information. Projection Pursuit measures the non-Gaussianity in low dimension space, this tells us that in high dimension there is for sure a structure which might be interesting. After we have fitted this Gaussian, we can have an anomaly score.

Standard approach

The standard pipeline for this kind of problem is

  • Dimensionality reduction
  • Clustering approach to determine the probability of the various data points

Variance tells us something about informativeness. We need to maximize the variance when we project. We also need to choose the correct abstraction, we need to have a trade-off between tractability and fidelity. We need to choose the function $\pi : \mathbb{R}^{D} \to \mathbb{R}^{d}$ with $d \ll D$. An easy way to maximize variance id Principal Component Analysis$. After we had this, we need to use Clustering algorithm to fit the gaussians.