Kalman Filters

Here is a historical treatment on the topic: https://jwmi.github.io/ASM/6-KalmanFilter.pdf. Kalman Filters are defined as follows: We start with a variable $X_{0} \sim \mathcal{N}(\mu, \Sigma)$, then we have a motion model and a sensor model: $$ \begin{cases} X_{t + 1} = FX_{t} + \varepsilon_{t} & F \in \mathbb{R}^{d\times d}, \varepsilon_{t} \sim \mathcal{N}(0, \Sigma_{x})\\ Y_{t} = HX_{t} + \eta_{t} & H \in \mathbb{R}^{m \times d}, \eta_{t} \sim \mathcal{N}(0, \Sigma_{y}) \end{cases} $$Inference is just doing things with the Gaussians. One can interpret the $Y$ to be the observations and $X$ to be the underlying beliefs about a certain state. We see that the Kalman Filters satisfy the Markov Property, see Markov Chains. These independence properties allow a easy characterization of the joint distribution for Kalman Filters: ...

3 min · Xuanqiang 'Angelo' Huang

Kernel Methods

As we will briefly see, Kernels will have an important role in many machine learning applications. In this note we will get to know what are Kernels and why are they useful. Intuitively they measure the similarity between two input points. So if they are close the kernel should be big, else it should be small. We briefly state the requirements of a Kernel, then we will argue with a simple example why they are useful. ...

9 min · Xuanqiang 'Angelo' Huang

Linear Regression methods

We will present some methods related to regression methods for data analysis. Some of the work here is from (Hastie et al. 2009). This note does not treat the bayesian case, you should see Bayesian Linear Regression for that. Problem setting $$ Y = \beta_{0} + \sum_{j = 1}^{d} X_{j}\beta_{j} $$We usually don’t know the distribution of $P(X)$ or $P(Y \mid X)$ so we need to assume something about these distributions. One approach is assuming the distribution $Y \mid X \sim \mathcal{N}(f(X), \sigma^{2}I)$ and then solve the log likelihood on the probability If we use a statistical learning approach then we know we want to minimize this $\arg \min_{f} \sum_{i = 1}^{n} (y_{i} - f(x_{i}))^{2}$ which is what is often used. Both methods end with the same solution. Usually for these kind of problems we use the Least Squares Method, initially studied in Minimi quadrati. We will describe it again better in this setting. Let’s consider the linear model ...

9 min · Xuanqiang 'Angelo' Huang

Markov Processes

Andiamo a parlare di processi Markoviani. Dobbiamo avere bene a mente il contenuto di Markov Chains prima di approcciare questo capitolo. Markov property Uno stato si può dire di godere della proprietà di Markov se, intuitivamente parlando, possiede già tutte le informazioni necessarie per predire lo stato successivo, ossia, supponiamo di avere la sequenza di stati $(S_n)_{n \in \mathbb{N}}$, allora si ha che $P(S_k | S_{k-1}) = P(S_k|S_0S_1...S_{k - 1})$, ossia lo stato attuale in $S_{k}$ dipende solamente dallo stato precedente. ...

12 min · Xuanqiang 'Angelo' Huang

Markup

Introduzione alle funzioni del markup La semantica di una parola è caratterizzata dalla mia scelta (design sul significato). Non mi dice molto, quindi proviamo a raccontare qualcosa in più. Definiamo markup ogni mezzo per rendere esplicita una particolare interpretazione di un testo. In particolare è un modo per esplicitare qualche significato. (un po’ come la punteggiatura, che da qualche altra informazione oltre le singole parole, rende più chiaro l’uso del testo). ...

8 min · Xuanqiang 'Angelo' Huang

On The Double Descent Phenomenon

Double descent is a striking phenomenon in modern machine learning that challenges the traditional bias–variance tradeoff. In classical learning theory, increasing model complexity beyond a certain point is expected to increase test error because the model starts to overfit the training data. However, in many contemporary models—from simple linear predictors to deep neural networks—a second descent in test error emerges as the model becomes even more overparameterized. At its core, the double descent curve can be understood in three stages. In the first stage, as the model’s capacity increases, the error decreases because the model is better able to capture the underlying signal in the data. As the model approaches the interpolation threshold—where the number of parameters is roughly equal to the number of data points—the model fits the training data exactly. This exact fitting, however, makes the model extremely sensitive to noise, leading to a spike in test error. Surprisingly, when the model complexity is increased further into the highly overparameterized regime, the training algorithm (often stochastic gradient descent) tends to select from the many possible interpolating solutions one that exhibits desirable properties such as lower norm or smoothness. This implicit bias toward simpler, more generalizable solutions causes the test error to decrease again, producing the second descent. ...

3 min · Xuanqiang 'Angelo' Huang

Planning

There is huge literature on planning. We will attack this problem from the view of probabilistic artificial intelligence. In this case we focus on continuous, fully observed with non-linear transitions, an environment often used for robotics. It’s called Model Predictive Control (MPC). \[...\] Moreover, modeling uncertainty in our model of the environment can be extremely useful in deciding where to explore. Learning a model can therefore help to dramatically reduce the sample complexity over model-free techniques. ...

8 min · Xuanqiang 'Angelo' Huang

Probabilistic Parsing

Language Constituents A constituent is a word or a group of words that function as a single unit within a hierarchical structure This is because there is a lot of evidence pointing towards an hierarchical organization of human language. Example of constituents Let’s have some examples: John speaks [Spanish] fluently John speaks [Spanish and French] fluently Mary programs the homework [in the ETH computer laboratory] Mary programs the homework [in the laboratory] ...

5 min · Xuanqiang 'Angelo' Huang

Provably Approximately Correct Learning

PAC Learning is one of the most famous theories in learning theory. Learning theory concerns in answering questions like: What is learnable? Somewhat akin to La macchina di Turing for computability theory. How well can you learn something? PAC is a framework that allows to formally answer these questions. Now there is also a bayesian version of PAC in which there is a lot of research. Some definitions Empirical Risk Minimizer and Errors $$ \arg \min_{\hat{c} \in \mathcal{H}} \hat{R}_{n}(\hat{c}) $$ Where the inside is the empirical error. ...

11 min · Xuanqiang 'Angelo' Huang

Querying Denormalized Data

TODO: write the introduction to the note. JSONiq purports as an easy query language that could run everywhere. It attempts to solve common problems in SQL i.e. the lack of support for nested data structures and also the lack of support for JSON data types. A nice thing about JSONiq is that it is functional, which makes its queries quite powerful and flexible. It is also declarative and set-based. These are some commonalities with SQL. ...

6 min · Xuanqiang 'Angelo' Huang