Language Models

In order to understand language models we need to understand structured prediction. If you are familiar with Sentiment Analysis, where given an input text we need to classify it in a binary manner, in this case the output space usually scales in an exponential manner. The output has some structure, for example it could be a tree, it could be a set of words etc… This usually needs an intersection between statistics and computer science. ...

2 min · Xuanqiang 'Angelo' Huang

Linear Regression methods

We will present some methods related to regression methods for data analysis. Some of the work here is from (Hastie et al. 2009). This note does not treat the bayesian case, you should see Bayesian Linear Regression for that. Problem setting $$ Y = \beta_{0} + \sum_{j = 1}^{d} X_{j}\beta_{j} $$We usually don’t know the distribution of $P(X)$ or $P(Y \mid X)$ so we need to assume something about these distributions. ...

9 min · Xuanqiang 'Angelo' Huang

Log Linear Models

Log Linear Models can be considered the most basic model used in natural languages. The main idea is to try to model the correlations of our data, or how the posterior $p(y \mid x)$ varies, where $x$ is our single data point features and $y$ are the labels of interest. This is a form of generalization because contextualized events (x, y) with similar descriptions tend to have similar probabilities. These kinds of models are so common that it has been discovered in many fields (and thus assuming different names): some of the most famous are Gibbs distributions, undirected graphical models, Markov Random Fields or Conditional Random Fields, exponential models, and (regularized) maximum entropy models. Special cases include logistic regression and Boltzmann machines. ...

5 min · Xuanqiang 'Angelo' Huang

Markov Processes

Andiamo a parlare di processi Markoviani. Dobbiamo avere bene a mente il contenuto di Markov Chains prima di approcciare questo capitolo. Markov property Uno stato si può dire di godere della proprietà di Markov se, intuitivamente parlando, possiede già tutte le informazioni necessarie per predire lo stato successivo, ossia, supponiamo di avere la sequenza di stati $(S_n)_{n \in \mathbb{N}}$, allora si ha che $P(S_k | S_{k-1}) = P(S_k|S_0S_1...S_{k - 1})$, ossia lo stato attuale in $S_{k}$ dipende solamente dallo stato precedente. ...

12 min · Xuanqiang 'Angelo' Huang

Markup

Introduzione alle funzioni del markup 🟩 La semantica di una parola è caratterizzata dalla mia scelta (design sul significato). Non mi dice molto, quindi proviamo a raccontare qualcosa in più. Definiamo markup ogni mezzo per rendere esplicita una particolare interpretazione di un testo. In particolare è un modo per esplicitare qualche significato. (un po’ come la punteggiatura, che da qualche altra informazione oltre le singole parole, rende più chiaro l’uso del testo). ...

8 min · Xuanqiang 'Angelo' Huang

On The Double Descent Phenomenon

Double descent is a striking phenomenon in modern machine learning that challenges the traditional bias–variance tradeoff. In classical learning theory, increasing model complexity beyond a certain point is expected to increase test error because the model starts to overfit the training data. However, in many contemporary models—from simple linear predictors to deep neural networks—a second descent in test error emerges as the model becomes even more overparameterized. At its core, the double descent curve can be understood in three stages. In the first stage, as the model’s capacity increases, the error decreases because the model is better able to capture the underlying signal in the data. As the model approaches the interpolation threshold—where the number of parameters is roughly equal to the number of data points—the model fits the training data exactly. This exact fitting, however, makes the model extremely sensitive to noise, leading to a spike in test error. Surprisingly, when the model complexity is increased further into the highly overparameterized regime, the training algorithm (often stochastic gradient descent) tends to select from the many possible interpolating solutions one that exhibits desirable properties such as lower norm or smoothness. This implicit bias toward simpler, more generalizable solutions causes the test error to decrease again, producing the second descent. ...

3 min · Xuanqiang 'Angelo' Huang

Parametric Modeling

In this note we will first talk about briefly some of the main differences of the three main approaches regarding statistics: the bayesian, the frequentist and the statistical learning methods and then present the concept of the estimator, compare how the approaches differ from method to method, we will explain maximum likelihood estimator and the Rao-Cramer Bound. Short introduction to the statistical methods Bayesian 🟩 $$ p(\theta \mid X) = \frac{1}{z}p(X \mid \theta) p(\theta) $$The quantity $P(X \mid \theta)$ could be very complicated if our model is complicated. ...

11 min · Xuanqiang 'Angelo' Huang

Part of Speech Tagging

What is a part of Speech? A part of speech (POS) is a category of words that display similar syntactic behavior, i.e., they play similar roles within the grammatical structure of sentences. It has been known since the Latin era that some categories of words behave similarly (verbs for declination for example). The intuitive take is that knowing a specific part of speech can help understand the meaning of the sentence. ...

5 min · Xuanqiang 'Angelo' Huang

Performance at Large Scales

Some specific phenomenons in modern systems happen only when we scale into large systems. This note will gather some observations about the most important phenomena we observe at these scales. Tail Latency Phenomenon Tail latency refers to the high-end response time experienced by When scaling our services, using Massive Parallel Processing, and similar technology, it is not rare that a small percentage of requests in a system experience a high-end response time, typically measured at the 95th or 99th percentile. This significant delays that can degrade user experience or system reliability. ...

3 min · Xuanqiang 'Angelo' Huang

Planning

There is huge literature on planning. We will attack this problem from the view of probabilistic artificial intelligence. In this case we focus on continuous, fully observed with non-linear transitions, an environment often used for robotics. It’s called Model Predictive Control (MPC). \[...\] Moreover, modeling uncertainty in our model of the environment can be extremely useful in deciding where to explore. Learning a model can therefore help to dramatically reduce the sample complexity over model-free techniques. ...

8 min · Xuanqiang 'Angelo' Huang