Transformers

Introduction to the structure Transformers are just repeated blocks of attention layers, norms, MLP, followed by a final softmax on the final MLP layer, and preceded by a encoding layer. The first encoding layer has to embed some information about the original structure: Semantic information about the input Positional information about the input. Then we use the transformer blocks to process the input and get the final embedding layer. Positional encoding We need to keep positional information about the contents....

6 min · Xuanqiang 'Angelo' Huang

Anomaly Detection

Anomaly detection is a problem in machine learning that is of a big interest in industry. For example a bank needs to identify problems in transactions, doctors need it to see illness, or suspicious behaviors for law (no Orwell here). The main difference between this and classification is that here we have no classes. Setting of the problem Let’s say we have a set $X = \left\{ x_{1}, \dots, x_{n} \right\} \subseteq \mathcal{N} \subseteq \mathcal{X} = \mathbb{R}^{d}$ We say this set is the normal set, and $X$ are our samples but it’s quite complex, so we need an approximation to say whether if a set is normal or not....

2 min · Xuanqiang 'Angelo' Huang

Ensemble Methods

The idea of ensemble methods goes back to Sir Francis Galton. In 787, he noted that although not every single person got the right value, the average estimate of a crowd of people predicted quite well. The main idea of ensemble methods is to combine relatively weak classifiers into a highly accurate predictor. The motivation for boosting was a procedure that combines the outputs of many “weak” classifiers to produce a powerful “committee....

7 min · Xuanqiang 'Angelo' Huang

Logistic Regression

Queste note sono molto di base. Per cose leggermente più avanzate bisogna guardare Bayesian Linear Regression, Linear Regression methods. Introduzione alla logistic regression Giustificazione del metodo Questo è uno dei modelli classici, creati da Minsky qualche decennio fa In questo caso andiamo direttamente a computare il valore di $P(Y|X)$ durante l’inferenza, quindi si parla di modello discriminativo. Introduzione al problema Supponiamo che $Y$ siano variabili booleane $X_{i}$ siano variabili continue $X_{i}$ siano indipendenti uno dall’altro....

4 min · Xuanqiang 'Angelo' Huang

Parametric Modeling

In this note we will first talk about briefly some of the main differences of the three main approaches regarding statistics: the bayesian, the frequentist and the statistical learning methods and then present the concept of the estimator, compare how the approaches differ from method to method, we will explain maximum likelihood estimator and the Rao-Cramer Bound. Short introduction to the statistical methods Bayesian 🟩 With bayesian methods we often assume a prior on the parameters, often human picked, that allows to give a regularizer term over the possible distribution that we are trying to model....

13 min · Xuanqiang 'Angelo' Huang