Notes

Cluster Resource Management

We need to find an efficient and effective manner to allocate the resources around. This is what the resource management layer does. Introduction to the problem What is Cluster Resource Management? Most of the time, the user specifies an amount of resources, and then the cluster decides how much to allocate (but approaches like (Delimitrou & Kozyrakis 2014), do it differently). There are mainly two parts in cluster resource management: Allocation: deciding how many resources an application (techniques for this is presented in Cluster Management Policies. Assignment: from which physical machine you can effectively put the application. Types of management architectures We mainly divide the management architectures in three ways: ...

Design patterns

Introduction to design patterns Tutte le immagini qui presenti sono prese da https://refactoring.guru/. Quello è il sito principale da tenere come reference. Introduzione personale I design patterns sono simili a dei plug and play, ossia delle soluzioni che hanno funzionato bene in passato e che sono ora riutilizzati. Solitamente dovrebbe essere una abilità implicita, cioè un buon programmatore è in grado di fare senza pensarci, dovrebbe essere automatico. Infatti quando uno fa il design non lo fa esplitamente seguendo un certo modello, ma farlo solitamente risulta utile per guidare il processo. ...

Automi e Regexp

Per l’analisi lessicale vogliamo cercare di ricordare le parole legali all’interno di questo linguaggio e questo è fatto con i linguaggi regolari. Introduzione a analizzatori lessicali Token Struttura del token è fatto da due parti Identificatore della classe del token Identificatore del valore del token Pattern e lessema ci sono direi boh Pattern e Lessema I pattern sono una descrizione generale della forma dei valori di una classe di token. ...

Bayesian neural networks

Robbins-Moro Algorithm The Algorithm $$ w_{n+1} = w_{n} - \alpha_{n} \Delta w_{n} $$For example with $\alpha_{0} > \alpha_{1} > \dots > \alpha_{n} \dots$, and $\alpha_{t} = \frac{1}{t}$ they satisfy the condition (in practice we use a constant $\alpha$, but we lose the convergence guarantee by Robbins Moro). More generally, the Robbins-Moro conditions re: $\sum_{n} \alpha_{n} = \infty$ $\sum_{n} \alpha_{n}^{2} < \infty$ Then the algorithm is guaranteed to converge to the best answer. One nice thing about this, is that we don’t need gradients. But often we use gradient versions (stochastic gradient descent and similar), using auto-grad, see Backpropagation. But learning with gradients brings some drawbacks: ...

Bayesian Optimization

While Active Learning looks for the most informative points to recover a true underlying function, Bayesian Optimization is just interested to find the maximum of that function. In Bayesian Optimization, we ask for the best way to find sequentially a set of points $x_{1}, \dots, x_{n}$ to find $\max_{x \in \mathcal{X}} f(x)$ for a certain unknown function $f$. This is what the whole thing is about. Definitions First we will introduce some useful definitions in this context. These were also somewhat introduced in N-Bandit Problem, which is one of the classical optimization problems we can find in the literature. ...

Multi Variable Derivatives

Multi-variable derivative To the people that are not used to matrix derivatives (like me) it could be useful to see how $$ \frac{ \partial u^{T}Su }{ \partial u } = 2Su $$ First, we note that if you derive with respect to some matrix, the output will be of the same dimension of that matrix. That notation is just deriving every single component independently and then joining them together, so it will be better understood as as $$ \frac{ \partial u^{T}Su }{ \partial u } = \begin{bmatrix} \frac{ \partial u^{T}Su }{ \partial u_{1} } \ \dots \ \frac{ \partial u^{T}Su }{ \partial u_{M} } \ \end{bmatrix} $$ So we can prove each derivative independently, it's just a lot of manual work! We see that $u^{T}Su$ is just a quadratic form, studied in Massimi minimi multi-variabile#Forme quadratiche so it is just computing this: $$ u^{T}Su = \sum_{i, j = 1, 1}^{M} u_{i}u_{j}S_{ij} \implies \frac{ \partial u^{T}Su }{ \partial u_{1} } =2u_{1}S_{11} + \sum_{j \neq 1}^{M}(u_{j}S_{1j} + u_{j}S_{j1}) = 2\left( u_{1}S_{11} + \sum_{j \neq 1}u_{j}S_{1j} \right) = 2(Su)_{1} $$ Last equation is true because $S$ is a symmetric matrix, then we easily see that indeed it’s true that indeed it’s the first row of the $Su$ matrix multiplied by 2. ...

Naïve Bayes

Introduzione a Naïve Bayes NOTE: this note should be reviewed after the course I took in NLP. This is a very old note, not even well written. Bisognerebbe in primo momento avere benissimo in mente il significato di probabilità condizionata e la regola di naive Bayes in seguito. Bayes ad alto livello Da un punto di vista intuitivo non è altro che predire la cosa che abbiamo visto più spesso in quello spazio ...

Variational Inference

$$ p(\theta \mid x_{1:n}, y_{1:n}) = \frac{1}{z} p(y_{1:n} \mid \theta, x_{1:n}) p(\theta \mid x_{1:n}) \approx q(\theta \mid \lambda) $$For Bayesian Linear Regression we had high dimensional Gaussians which made the inference closed form, in general this is not true, so we need some kinds of approximation. Laplace approximation Introduction to the Idea $$ \psi(\theta) \approx \hat{\psi}(\theta) = \psi(\hat{\theta}) + (\theta-\hat{\theta} ) ^{T} \nabla \psi(\hat{\theta}) + \frac{1}{2} (\theta-\hat{\theta} ) ^{T} H_{\psi}(\hat{\theta})(\theta-\hat{\theta} ) = \psi(\hat{\theta}) + \frac{1}{2} (\theta-\hat{\theta} ) ^{T} H_{\psi}(\hat{\theta})(\theta-\hat{\theta} ) $$ We simplified the term on the first order because we are considering the mode, so the gradient should be zero for the stationary point. ...

Asymmetric Cryptography

Public Key Encryption We now define a formally what is a public key encryption Formal definition of Public Key Encryption We define a 3-tuple formed as follows: $(G, E, D)$ where $G$ is the generator for the private and public keys, from now on identified as $(pk, sk)$ (public key and secret key) $E(pk, m)$ the encryption algorithm, that takes the $pk$ and the message in input $D(sk, c)$ the decryption algorithm, that takes the $sk$ and the ciphertext in input. Now is this definition useful? i don’t think so! We can’t create theorems for it, too general I suppose. Is it clear? yes! I think this is the usefulness of maths in many occasions, it delivers some complex information in a concise and understandable manner. ...

Human Vision

Vision is THE most important sense for humans. Most of the information we get is through vision 90% vision 8% tactile, touch 3% hearing I have no idea how did they measure this aspect. This is true for humans, but for mice for example it is different, they have probably a 64x64 pixel resolution equivalent. For humans, visual data is more important, it is faster compared to speech and other senses. Human vision is estimated to be about 576 Megapixels of data (3M snapshots patched together with saccades, that has that pixel image value), since it can distinguish 0.6arc-minutes (0.01 degrees). There is an estimate of about 60kk rods and 3kk cones. ...