Naïve Bayes

Introduzione a Naïve Bayes NOTE: this note should be reviewed after the course I took in NLP. This is a very old note, not even well written. Bisognerebbe in primo momento avere benissimo in mente il significato di probabilità condizionata e la regola di naive Bayes in seguito. Bayes ad alto livello Da un punto di vista intuitivo non è altro che predire la cosa che abbiamo visto più spesso in quello spazio ...

Reading Time: 6 minutes ·  By Xuanqiang 'Angelo' Huang

Variational Inference

$$ p(\theta \mid x_{1:n}, y_{1:n}) = \frac{1}{z} p(y_{1:n} \mid \theta, x_{1:n}) p(\theta \mid x_{1:n}) \approx q(\theta \mid \lambda) $$For Bayesian Linear Regression we had high dimensional Gaussians which made the inference closed form, in general this is not true, so we need some kinds of approximation. Laplace approximation Introduction to the Idea $$ \psi(\theta) \approx \hat{\psi}(\theta) = \psi(\hat{\theta}) + (\theta-\hat{\theta} ) ^{T} \nabla \psi(\hat{\theta}) + \frac{1}{2} (\theta-\hat{\theta} ) ^{T} H_{\psi}(\hat{\theta})(\theta-\hat{\theta} ) = \psi(\hat{\theta}) + \frac{1}{2} (\theta-\hat{\theta} ) ^{T} H_{\psi}(\hat{\theta})(\theta-\hat{\theta} ) $$ We simplified the term on the first order because we are considering the mode, so the gradient should be zero for the stationary point. ...

Reading Time: 9 minutes ·  By Xuanqiang 'Angelo' Huang

Reti convoluzionali

Abbiamo trattato i modelli classici in Convolutional Neural Network. Con i vecchi files di notion Il Kernel I punti interessanti delle immagini sono solamente i punti di cambio solo che attualmente siamo in stato discreto, quindi ci è difficile usare una derivata, si usano kernel del tipo: $\left[ 1, 0, -1 \right]$, che sarà positivo se cresce verso sinistra, negativo se scende. feature map Sono delle mappe che rappresentano alcune informazioni interessanti della nostra immagine. ...

Reading Time: 1 minute ·  By Xuanqiang 'Angelo' Huang

Anomaly Detection

Anomaly detection is a problem in machine learning that is of a big interest in industry. For example a bank needs to identify problems in transactions, doctors need it to see illness, or suspicious behaviors for law (no Orwell here). The main difference between this and classification is that here we have no classes. Setting of the problem Let’s say we have a set $X = \left\{ x_{1}, \dots, x_{n} \right\} \subseteq \mathcal{N} \subseteq \mathcal{X} = \mathbb{R}^{d}$ We say this set is the normal set, and $X$ are our samples but it’s quite complex, so we need an approximation to say whether if a set is normal or not. We need a function $\phi : \mathcal{X} \to \left\{ 0, 1 \right\}$ with $\phi(x) = 1 \iff x \not \in \mathcal{N}$. ...

Reading Time: 2 minutes ·  By Xuanqiang 'Angelo' Huang

Active Learning

Active Learning concerns methods to decide how to sample the most useful information in a specific domain; how can you select the best sample for an unknown model? Gathering data is very costly, we would like to create some principled manner to choose the best data point to humanly label in order to have the best model. In this setting, we are interested in the concept of usefulness of information. One of our main goals is to reduce uncertainty, thus, Entropy-based (mutual information) methods are often used. For example, we can use active learning to choose what samples needs to be labelled in order to have highest accuracy on the trained model, when labelling is costly. ...

Reading Time: 13 minutes ·  By Xuanqiang 'Angelo' Huang

Beta and Dirichlet Distributions

The beta distribution The beta distribution is a powerful tool for modeling probabilities and proportions between 0 and 1. Here’s a structured intuition to grasp its essence: Core Concept The beta distribution, defined on $[0, 1]$, is parameterized by two shape parameters: α (alpha) and β (beta). These parameters dictate the distribution’s shape, allowing it to flexibly represent beliefs about probabilities, rates, or proportions. Key Intuitions a. “Pseudo-Counts” Interpretation α acts like “successes” and β like “failures” in a hypothetical experiment. Example: If you use Beta(5, 3), it’s as if you’ve observed 5 successes and 3 failures before seeing actual data. After observing x real successes and y real failures, the posterior becomes Beta(α+x, β+y). This makes beta the conjugate prior for the binomial distribution (bernoulli process). b. Shape Flexibility Uniform distribution: When α = β = 1, all values in [0, 1] are equally likely. Bell-shaped: When α, β > 1, the distribution peaks at mode = (α-1)/(α+β-2). Symmetric if α = β (e.g., Beta(5, 5) is centered at 0.5). U-shaped: When α, β < 1, density spikes at 0 and 1 (useful for modeling polarization, meaning we believe the model to only produce values at 0 or 1, not in the middle.). Skewed: If α > β, skewed toward 1; if β > α, skewed toward 0. c. Moments Mean: $α/(α+β)$ – your “expected” probability of success. Variance: $αβ / [(α+β)²(α+β+1)]$ – decreases as α and β grow (more confidence). $$ \text{Mode} = \frac{\alpha - 1}{\alpha + \beta - 2} $$The mathematical model $$ \text{Beta} (x \mid a, b) = \frac{1}{B(a, b)} \cdot x^{a -1 }(1 - x)^{b - 1} $$ Where $B(a, b) = \Gamma(a) \Gamma(b) / \Gamma( + b)$ And $\Gamma(t) = \int_{0}^{\infty}e^{-x}x^{t - 1} \, dx$ ...

Reading Time: 4 minutes ·  By Xuanqiang 'Angelo' Huang

Counterfactual Invariance

Machine learning cannot distinguish between causal and environment features. Shortcut learning Often we observe shortcut learning: the model learns some dataset dependent shortcuts (e.g. the machine that was used to take the X-ray) to make inference, but this is very brittle, and is not usually able to generalize. Shortcut learning happens when there are correlations in the test set between causal and non-causal features. Our object of interest should be the main focus, not the environment around, in most of the cases. For example, a camel in a grass land should still be recognized as a camel, not a cow. One solution could be engineering invariant representations which are independent of the environment. So having a kind of encoder that creates these representations. ...

Reading Time: 9 minutes ·  By Xuanqiang 'Angelo' Huang

Cross Validation and Model Selection

There is a big difference between the empirical score and the expected score; in the beginning, we had said something about this in Introduction to Advanced Machine Learning. We will develop more methods to better comprehend this fundamental principles. How can we estimate the expected risk of a particular estimator or algorithm? We can use the cross-validation method. This method is used to estimate the expected risk of a model, and it is a fundamental method in machine learning. ...

Reading Time: 5 minutes ·  By Xuanqiang 'Angelo' Huang

Dirichlet Processes

The DP (Dirichlet Processes) is part of family of models called non-parametric models. Non parametric models concern learning models with potentially infinite number of parameters. One of the classical application is unsupervised techniques like clustering. Intuitively, clustering concerns in finding compact subsets of data, i.e. finding groups of points in the space that are particularly close by some measure. The Dirichlet Process See Beta and Dirichlet Distributions for the definition and intuition of these two distributions. One quite important thing that Dirichlet allows to do is the ability of assigning an ever growing number of clusters to data. This models are thus quite flexible to change and growth. ...

Reading Time: 7 minutes ·  By Xuanqiang 'Angelo' Huang

Ensemble Methods

The idea of ensemble methods goes back to Sir Francis Galton. In 787, he noted that although not every single person got the right value, the average estimate of a crowd of people predicted quite well. The main idea of ensemble methods is to combine relatively weak classifiers into a highly accurate predictor. The motivation for boosting was a procedure that combines the outputs of many “weak” classifiers to produce a powerful “committee.” ...

Reading Time: 6 minutes ·  By Xuanqiang 'Angelo' Huang