Gaussians

Gaussians are one of the most important family of probability distributions. They arise naturally in the law of large numbers and have some nice properties that we will briefly present and prove here in this note. They are also quite common for Gaussian Processes and the Clustering algorithm. They have also something to say about Maximum Entropy Principle. The best thing if you want to learn this part actually well is section 2.3 of (Bishop 2006), so go there my friend :) ...

Reading Time: 8 minutes ·  By Xuanqiang 'Angelo' Huang

Anomaly Detection

Anomaly detection is a problem in machine learning that is of a big interest in industry. For example a bank needs to identify problems in transactions, doctors need it to see illness, or suspicious behaviors for law (no Orwell here). The main difference between this and classification is that here we have no classes. Setting of the problem Let’s say we have a set $X = \left\{ x_{1}, \dots, x_{n} \right\} \subseteq \mathcal{N} \subseteq \mathcal{X} = \mathbb{R}^{d}$ We say this set is the normal set, and $X$ are our samples but it’s quite complex, so we need an approximation to say whether if a set is normal or not. We need a function $\phi : \mathcal{X} \to \left\{ 0, 1 \right\}$ with $\phi(x) = 1 \iff x \not \in \mathcal{N}$. ...

Reading Time: 2 minutes ·  By Xuanqiang 'Angelo' Huang

Notazione Asintotica

Introduzione alla notazione asintotica Cercare di definire il tempo impiegato da una funzione per essere eseguita in termini di DIMENSIONE dell’input. **(il numero di bit a livello basso basso) Ma abbiamo il problema di misura, in quanto dobbiamo considerare delle variabili che siano indipendenti rispetto alla macchina. Caratteristiche della notazione Vogliamo considerare una notazione asintotica (che guarda quanto fa il comportamento verso l’infinito) ### Accesso di memoria Ogni operazione in un processore moderno ha in generale un numero di accessi in memoria constante (solitamente abbiamo sempre un numero fissato di operandi possibile, questo significa che se un certo algoritmo ha una certa complessità, resta di questa complessità anche tenendo in considerazione le operazioni di accesso di memoria). Questo discorso non tiene più se teniamo in considerazione numeri a precisione infinita, che possono avere un numero arbitrario di accessi in memoria per poter essere computato. ...

Reading Time: 4 minutes ·  By Xuanqiang 'Angelo' Huang

Massive Parallel Processing

We have a group of mappers that work on dividing the keys for some reducers that actually work on that same group of data. The bottleneck is the assigning part: when mappers finish and need to handle the data to the reducers. Introduction Common input formats You need to know well what Shards Textual input binary, parquet and similars CSV and similars Sharding It is a common practice to divide a big dataset into chunks (or shards), smaller parts which recomposed give the original dataset. For example, in Cloud Storage settings we often divide big files into chunks, while in Distributed file systems the system automatically divides big files into native files of maximum 10 GB size. ...

Reading Time: 14 minutes ·  By Xuanqiang 'Angelo' Huang

Circuiti Sequenziali

7.1 Introduzione 7.1.1 Perché usarli Sono utili per mantenere delle informazioni nel tempo 7.1.2 Caratteristiche Hanno feedback cioè ci sono degli output che tornano dentro al circuito, quindi è molto difficile senza sapere niente cosa succede dentro Questo circuito non è combinatorio, che è formalizzabile in modo deterministico con l’lgebra booleana. 7.1.3 Il Bit di memoria Questo bit ha due input, un load e un input, se il load è attivo comincia a storare, altrimenti l’output è sempre il bit che ha memoriazzato. ...

Reading Time: 4 minutes ·  By Xuanqiang 'Angelo' Huang

Active Learning

Active Learning concerns methods to decide how to sample the most useful information in a specific domain; how can you select the best sample for an unknown model? Gathering data is very costly, we would like to create some principled manner to choose the best data point to humanly label in order to have the best model. In this setting, we are interested in the concept of usefulness of information. One of our main goals is to reduce uncertainty, thus, Entropy-based (mutual information) methods are often used. For example, we can use active learning to choose what samples needs to be labelled in order to have highest accuracy on the trained model, when labelling is costly. ...

Reading Time: 13 minutes ·  By Xuanqiang 'Angelo' Huang

Estensioni di Turing e altre macchine

Sono variazioni possibili equivalenti: • Nastri addizionali • Testine addizionali • Nastri infiniti su entrambi i lati • Non-determinismo • Scelta probabilistica • Scelta quantistica Si può dire che la definizione di TM è stata robusta nella storia perché tantissimi formalismi che intuitivamente sembrano essere molto diversi rispetto alla TM alla fine possono essere dimostrate essere equivalenti. Turing con nastri addizionali Questo è presente in modo abbastanza facile sul Sipser. ...

Reading Time: 6 minutes ·  By Xuanqiang 'Angelo' Huang

Monte Carlo Methods

DI Law of Large Numbers e Central limit theorem ne parliamo in Central Limit Theorem and Law of Large Numbers. Usually these methods are useful when you need to calculate following something similar to Bayes rule, but don’t know how to calculate the denominator, often infeasible integral. We estimate this value without explicitly calculating that. Interested in $\mathbb{P}(x) = \frac{1}{z} \mathbb{P}^{*}(x) = \frac{1}{Z} e^{-E(x)}$ Can evaluate E(x) at any x. Problem 1 Make samples x(r) ~ 2 P Problem 2 Estimate expectations $\Phi = \sum_{x}\phi(x)\mathbb{P}(x)$) What we’re not trying to do: We’re not trying to find the most probable state. We’re not trying to visit all typical states. Law of large numbers $$ S_{n} = \sum^n_{i=1} x_{i} ,:, \bar{x}_{n} = \frac{S_{n}}{n} $$$$ \bar{x}_{n} \to \mu $$ Ossia il limite converge sul valore atteso di tutte le variabili aleatorie. ...

Reading Time: 7 minutes ·  By Xuanqiang 'Angelo' Huang

Apache Spark

This is a new framework that is faster than MapReduce (See Massive Parallel Processing). It is written in Scala and has a more functional approach to programming. Spark extends the previous MapReduce framework to a generic distributed dataflow, properly modeled as a DAG. There are other benefits of using Spark instead of the Map reduce Framework: Spark processes data in memory, avoiding the disk I/O overhead of MapReduce, making it significantly faster. Spark uses a DAG to optimize the entire workflow, reducing data shuffling and stage count. But MapReduce sometimes has its advantages: ...

Reading Time: 9 minutes ·  By Xuanqiang 'Angelo' Huang

Beta and Dirichlet Distributions

The beta distribution The beta distribution is a powerful tool for modeling probabilities and proportions between 0 and 1. Here’s a structured intuition to grasp its essence: Core Concept The beta distribution, defined on $[0, 1]$, is parameterized by two shape parameters: α (alpha) and β (beta). These parameters dictate the distribution’s shape, allowing it to flexibly represent beliefs about probabilities, rates, or proportions. Key Intuitions a. “Pseudo-Counts” Interpretation α acts like “successes” and β like “failures” in a hypothetical experiment. Example: If you use Beta(5, 3), it’s as if you’ve observed 5 successes and 3 failures before seeing actual data. After observing x real successes and y real failures, the posterior becomes Beta(α+x, β+y). This makes beta the conjugate prior for the binomial distribution (bernoulli process). b. Shape Flexibility Uniform distribution: When α = β = 1, all values in [0, 1] are equally likely. Bell-shaped: When α, β > 1, the distribution peaks at mode = (α-1)/(α+β-2). Symmetric if α = β (e.g., Beta(5, 5) is centered at 0.5). U-shaped: When α, β < 1, density spikes at 0 and 1 (useful for modeling polarization, meaning we believe the model to only produce values at 0 or 1, not in the middle.). Skewed: If α > β, skewed toward 1; if β > α, skewed toward 0. c. Moments Mean: $α/(α+β)$ – your “expected” probability of success. Variance: $αβ / [(α+β)²(α+β+1)]$ – decreases as α and β grow (more confidence). $$ \text{Mode} = \frac{\alpha - 1}{\alpha + \beta - 2} $$The mathematical model $$ \text{Beta} (x \mid a, b) = \frac{1}{B(a, b)} \cdot x^{a -1 }(1 - x)^{b - 1} $$ Where $B(a, b) = \Gamma(a) \Gamma(b) / \Gamma( + b)$ And $\Gamma(t) = \int_{0}^{\infty}e^{-x}x^{t - 1} \, dx$ ...

Reading Time: 4 minutes ·  By Xuanqiang 'Angelo' Huang