Notes

Gaussian Processes

Gaussian processes can be viewed through a Bayesian lens of the function space: rather than sampling over individual data points, we are now sampling over entire functions. They extend the idea of bayesian linear regression by introducing an infinite number of feature functions for the input XXX. In geostatistics, Gaussian processes are referred to as kriging regressions, and many other models, such as Kalman Filters or radial basis function networks, can be understood as special cases of Gaussian processes. In this framework, certain functions are more likely than others, and we aim to model this probability distribution. ...

Cross Validation and Model Selection

There is a big difference between the empirical score and the expected score; in the beginning, we had said something about this in Introduction to Advanced Machine Learning. We will develop more methods to better comprehend this fundamental principles. How can we estimate the expected risk of a particular estimator or algorithm? We can use the cross-validation method. This method is used to estimate the expected risk of a model, and it is a fundamental method in machine learning. ...

Rademacher Complexity

This note used the definitions present in Provably Approximately Correct Learning. So, go there when you encounter a word you don’t know. Or search online Rademacher Complexity $$ \mathcal{G} = \left\{ g : (x, y) \to L(h(x), y) : h \in \mathcal{H} \right\} $$ Where $L : \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}$ is a generic loss function. The Rademacher complexity captures the richness of a family of functions by measuring the degree to which a hypothesis set can fit random noise. From (Mohri et al. 2012). ...

Data Cubes

Data Cubes is a data format especially useful for heavy reads. It has been popularized in business environments where the main use for data was to make reports (many reads). This also links with the OLAP (Online Analytical Processing) vs OLTP (Online Transaction Processing) concepts, where one is optimized for reads and the other for writes. The main driver behind data cubes was business intelligence. While traditional relational database systems are focused on the day-to-day business of a company and record keeping (with customers placing or- ders, inventories kept up to date, etc), business intelligence is focused on the production of high-level reports for supporting C-level executives in making informed decisions. ...

Introduction to Big Data

Data Science is similar to physics: it attemps to create theories of realities based on some formalism that another science brings. For physics it was mathematics, for data science it is computer science. Data has grown expeditiously in these last years and has reached a distance that in metres is the distance to Jupiter. The galaxy is in the order of magnitude of 400 Yottametres, which has $3 \cdot 8$ zeros following after it. So quite a lot. We don’t know if the magnitude of the data will grow this fast but certainly we need to be able to face this case. ...

Querying Denormalized Data

TODO: write the introduction to the note. JSONiq purports as an easy query language that could run everywhere. It attempts to solve common problems in SQL i.e. the lack of support for nested data structures and also the lack of support for JSON data types. A nice thing about JSONiq is that it is functional, which makes its queries quite powerful and flexible. It is also declarative and set-based. These are some commonalities with SQL. ...

Probabilistic Parsing

Language Constituents A constituent is a word or a group of words that function as a single unit within a hierarchical structure This is because there is a lot of evidence pointing towards an hierarchical organization of human language. Example of constituents Let’s have some examples: John speaks [Spanish] fluently John speaks [Spanish and French] fluently Mary programs the homework [in the ETH computer laboratory] Mary programs the homework [in the laboratory] ...

Kalman Filters

Here is a historical treatment on the topic: https://jwmi.github.io/ASM/6-KalmanFilter.pdf. Kalman Filters are defined as follows: We start with a variable $X_{0} \sim \mathcal{N}(\mu, \Sigma)$, then we have a motion model and a sensor model: $$ \begin{cases} X_{t + 1} = FX_{t} + \varepsilon_{t} & F \in \mathbb{R}^{d\times d}, \varepsilon_{t} \sim \mathcal{N}(0, \Sigma_{x})\\ Y_{t} = HX_{t} + \eta_{t} & H \in \mathbb{R}^{m \times d}, \eta_{t} \sim \mathcal{N}(0, \Sigma_{y}) \end{cases} $$Inference is just doing things with the Gaussians. One can interpret the $Y$ to be the observations and $X$ to be the underlying beliefs about a certain state. We see that the Kalman Filters satisfy the Markov Property, see Markov Chains. These independence properties allow a easy characterization of the joint distribution for Kalman Filters: ...

Introduction to Advanced Machine Learning

Introduction to the course Machine learning offers a new way of thinking about reality: rather than attempting to directly capture a fragment of reality, as many traditional sciences have done, we elevate to the meta-level and strive to create an automated method for capturing it. This first lesson will be more philosophical in nature. We are witnessing a paradigm shift in the sense described by Thomas Kuhn in his theory of scientific revolutions. But what drives such a shift, and how does it unfold? ...

Introduction to Natural Language Processing

The landscape of NLP was very different in the beginning of the field. “But it must be recognized that the notion ‘probability of a sentence’ is an entirely useless one, under any known interpretation of this term 1968 p 53. Noam Chomsky. Probability was not seen very well (Chomsky has said many wrong things indeed), and linguists were considered useless. Recently deep learning and computational papers are ubiquitous in major conferences in linguistics, e.g. ACL. ...