Notes

Recurrent Neural Networks

Recurrent Neural Networks allows us to model arbitrarily long sequence dependencies, at least in theory (this is also why they seem a very nice choice in theory for time series). This is very handy, and has many interesting theoretical implication. But here we are also interested in the practical applicability, so we may need to analyze common architectures used to implement these models, the main limitation and drawbacks, the nice properties and some applications. ...

Skylake Microprocessor

The Skylake processor is a 2015 Intel processor. The Intel Processor In 1978 Intel made the choice to have retrocompatibility for every processor. At that time they had the 8086 processor, with some number of memory bits. For backwards compatibility intructions have usually just grown. They used geographic locations because these are not suable. If we want new code to run for old processors, we should need to put specific flags. ...

Sparse Matrix Vector Multiplication

Algorithms for Sparse Matrix-Vector Multiplication Compressed Sparse Row This is an optimized way to store rows for sparse matrices: Sparse MVM using CSR void smvm(int m, const double* values, const int* col_idx, const int* row_start, double* x, double* y) { int i, j; double d; /* Loop over m rows */ for (i = 0; i < m; i++) { d = y[i]; /* Scalar replacement since reused */ /* Loop over non-zero elements in row i */ for (j = row_start[i]; j < row_start[i + 1]; j++) { d += values[j] * x[col_idx[j]]; } y[i] = d; } } Let’s analyze this code: Spatial locality: with respect to row_start, col_idx and values we have spatial locality. Temporal locality: with respect to y we have temporal locality. (Poor temporal with respect to $x$) Good storage efficiency for the sparse matrix. But it is 2x slower than the dense matrix multiplication when the matrix is dense. Block CSR But we cannot do block optimizations for the cache with this storage method. ...

The Perceptron Model

The perceptron is a fundamental binary linear classifier introduced by (Rosenblatt 1958). It maps an input vector $\mathbf{x} \in \mathbb{R}^n$ to an output $y \in \{0,1\}$ using a weighted sum followed by a threshold function. Introduction to the Perceptron A mathematical model Given an input vector $\mathbf{x} = (x_1, x_2, \dots, x_n)$ and a weight vector $\mathbf{w} = (w_1, w_2, \dots, w_n)$, the perceptron computes: $$ z = \mathbf{w}^\top \mathbf{x} + b = \sum_{i=1}^{n} w_i x_i + b $$$$ y = f(z) = \begin{cases} 1, & \text{if } z \geq 0 \\ 0, & \text{otherwise} \end{cases} $$Learning Rule Given a labeled dataset $\{ (\mathbf{x}^{(i)}, y^{(i)}) \}_{i=1}^{m}$, the perceptron uses the following weight update rule for misclassified samples ($y^{(i)} \neq f(\mathbf{w}^\top \mathbf{x}^{(i)} + b)$): ...

Transformers

Transformers, introduced in NLP language translation in (Vaswani et al. 2017), are one of the cornerstones of modern deep learning. For this reason, it is quite important to understand how they are done. Introduction to Transformers Transformers are called in this manner because they transform the input data space into another with the same dimensionality. The goal of the transformation is that the new space will have a richer internal representation that is better suited to solving downstream tasks. (Bishop & Bishop 2024) ...

Systems for Artificial Intelligence

At the time of writing, the compute requirements for machine learning models and artificial intelligence are growing at a staggering rate of 200% every 3.5 months. Interest in the area is being quantified as 10k papers per month on the topic, while dollar investments on compute (energy, cooling, sustainability of compute in general) have had a hard time keeping up with the continuous new requests. Image from here ...

Container Virtualization

Containers In this note, we introduce the famous docker containers. We also explore how #Linux Containers are implemented, and some parts of how #Docker works. What is a Container We have explored Virtual Machines in some past section. Containers do not virtualize everything, but just the environment where the application is run. This includes: Libraries Binaries We can see it as a lightweight VM, even if they do not offer the full level of isolation of traditional virtual machines. ...

Datacenter Hardware

We want to optimize the parts of the datacenter hardware such that the cost of operating the datacenter as a whole would be lower, we need to think about it as a whole. Datacenter CPUs Desktop CPU vs Cloud CPU Isolation: Desktop CPUs have low isolation, they are used by a single user. Cloud CPUs have high isolation, they are shared among different users. Workload and performance: usually high workloads and moving a lot of data around. They have a spectrum of low and high end cores, so that if you have high parallelism you can use lower cores, while for resource intensive tasks, its better to have high end cores, especially for latency critical tasks. ...

Architettura software del OS

A seconda dell’utilizzatore l’OS può essere molte cose, come solamente l’interfaccia se sei un programmatore, servizi (se sei un utente, ma gran parte dei servizi sono astratti e l’utente ne può anche essere a non-conoscenza). Ma se sei un programmatore OS ti interessa capire le componenti principali dell’OS Slide componenti OS alto livello Introduzione sui componenti (salto) Questa parte la salto perché è una descrizione molto generale di cosa si occupa L’os verso drivers, processi, filesystem I/O, quindi non è molto importante ...

Cloud Reliability

Reliability is the ability of a system to remain operational over time, i.e., to offer the service it was designed for. Cloud Hardware and software fails. In this note, we will try to find methods to analyze and predict when components fail, and how we can prevent this problem. Defining the vocabulary Reliability and Factors of Influence Reliability is the probability that a system will perform its intended function without failure over a specified period of time. There are many factors that influence this value: ...