Bayesian Optimization

While Active Learning looks for the most informative points to recover a true underlying function, Bayesian Optimization is just interested to find the maximum of that function. In Bayesian Optimization, we ask for the best way to find sequentially a set of points $x_{1}, \dots, x_{n}$ to find $\max_{x \in \mathcal{X}} f(x)$ for a certain unknown function $f$. This is what the whole thing is about. Definitions First we will introduce some useful definitions in this context. These were also somewhat introduced in N-Bandit Problem, which is one of the classical optimization problems we can find in the literature. ...

8 min · Xuanqiang 'Angelo' Huang

Beta and Dirichlet Distributions

The beta distribution The beta distribution is a powerful tool for modeling probabilities and proportions between 0 and 1. Here’s a structured intuition to grasp its essence: Core Concept The beta distribution, defined on $[0, 1]$, is parameterized by two shape parameters: α (alpha) and β (beta). These parameters dictate the distribution’s shape, allowing it to flexibly represent beliefs about probabilities, rates, or proportions. Key Intuitions a. “Pseudo-Counts” Interpretation α acts like “successes” and β like “failures” in a hypothetical experiment. Example: If you use Beta(5, 3), it’s as if you’ve observed 5 successes and 3 failures before seeing actual data. After observing x real successes and y real failures, the posterior becomes Beta(α+x, β+y). This makes beta the conjugate prior for the binomial distribution (bernoulli process). b. Shape Flexibility Uniform distribution: When α = β = 1, all values in [0, 1] are equally likely. Bell-shaped: When α, β > 1, the distribution peaks at mode = (α-1)/(α+β-2). Symmetric if α = β (e.g., Beta(5, 5) is centered at 0.5). U-shaped: When α, β < 1, density spikes at 0 and 1 (useful for modeling polarization, meaning we believe the model to only produce values at 0 or 1, not in the middle.). Skewed: If α > β, skewed toward 1; if β > α, skewed toward 0. c. Moments Mean: $α/(α+β)$ – your “expected” probability of success. Variance: $αβ / [(α+β)²(α+β+1)]$ – decreases as α and β grow (more confidence). $$ \text{Mode} = \frac{\alpha - 1}{\alpha + \beta - 2} $$The mathematical model $$ \text{Beta} (x \mid a, b) = \frac{1}{B(a, b)} \cdot x^{a -1 }(1 - x)^{b - 1} $$ Where $B(a, b) = \Gamma(a) \Gamma(b) / \Gamma( + b)$ And $\Gamma(t) = \int_{0}^{\infty}e^{-x}x^{t - 1} \, dx$ ...

4 min · Xuanqiang 'Angelo' Huang

Counterfactual Invariance

Machine learning cannot distinguish between causal and environment features. Shortcut learning Often we observe shortcut learning: the model learns some dataset dependent shortcuts (e.g. the machine that was used to take the X-ray) to make inference, but this is very brittle, and is not usually able to generalize. Shortcut learning happens when there are correlations in the test set between causal and non-causal features. Our object of interest should be the main focus, not the environment around, in most of the cases. For example, a camel in a grass land should still be recognized as a camel, not a cow. One solution could be engineering invariant representations which are independent of the environment. So having a kind of encoder that creates these representations. ...

9 min · Xuanqiang 'Angelo' Huang

Cross Validation and Model Selection

There is a big difference between the empirical score and the expected score; in the beginning, we had said something about this in Introduction to Advanced Machine Learning. We will develop more methods to better comprehend this fundamental principles. How can we estimate the expected risk of a particular estimator or algorithm? We can use the cross-validation method. This method is used to estimate the expected risk of a model, and it is a fundamental method in machine learning. ...

5 min · Xuanqiang 'Angelo' Huang

Data Cubes

Data Cubes is a data format especially useful for heavy reads. It has been popularized in business environments where the main use for data was to make reports (many reads). This also links with the OLAP (Online Analytical Processing) vs OLTP (Online Transaction Processing) concepts, where one is optimized for reads and the other for writes. The main driver behind data cubes was business intelligence. While traditional relational database systems are focused on the day-to-day business of a company and record keeping (with customers placing or- ders, inventories kept up to date, etc), business intelligence is focused on the production of high-level reports for supporting C-level executives in making informed decisions. ...

4 min · Xuanqiang 'Angelo' Huang

Data Models and Validation

A data model is an abstract view over the data that hides the way it is stored physically. The same idea from (Codd 1970) This is why we should not modify data directly, but pass though some abstraction that maintain the properties of that specific data model. Data Models Tree view 🟩 We can view all JSON and XML data, as presented in Markup, as trees. This structure is usually quite evident, as it is inherent in their design. Converting from the tree structure to a memory model is known as serialization, while the reverse process is called parsing. ...

10 min · Xuanqiang 'Angelo' Huang

Dirichlet Processes

The DP (Dirichlet Processes) is part of family of models called non-parametric models. Non parametric models concern learning models with potentially infinite number of parameters. One of the classical application is unsupervised techniques like clustering. Intuitively, clustering concerns in finding compact subsets of data, i.e. finding groups of points in the space that are particularly close by some measure. The Dirichlet Process See Beta and Dirichlet Distributions for the definition and intuition of these two distributions. One quite important thing that Dirichlet allows to do is the ability of assigning an ever growing number of clusters to data. This models are thus quite flexible to change and growth. ...

7 min · Xuanqiang 'Angelo' Huang

Distributed file systems

We want to know how to handle systems that have a large number of data. In previous lesson we have discovered how to quickly access and make Scalable systems with huge dimensions, see Cloud Storage. Object storage could store billions of files, we want to handle millions of petabyte files. Designing DFSs The Use Case Remember that the size of the files where heavily limited for Cloud Storage. The physical limitation was due to the limited size of a single hard disk, which was usually in the order of the Terabytes. Here, we would like to easily store petabytes of data in a single file, for example big datasets. Another feature that should be easily supported is highly concurrent access to the filesystem, last but not least being able to set up permissions in the system. ...

10 min · Xuanqiang 'Angelo' Huang

Document Stores

p> Document stores provide a native database management system for semi-structured data. Document stores also scale to Gigabytes or Terabytes of data, and typically millions or billions of records (a record being a JSON object or an XML document). Introduction to Document Stores A document store, unlike a data lake, manages the data directly and the users do not see the physical layout. Unlike data lakes, using document stores prevent us from breaking data independence and reading the data file directly: it offers an automatic manager service for semi-structured data that we need to throw and read quickly. ...

6 min · Xuanqiang 'Angelo' Huang

Ensemble Methods

The idea of ensemble methods goes back to Sir Francis Galton. In 787, he noted that although not every single person got the right value, the average estimate of a crowd of people predicted quite well. The main idea of ensemble methods is to combine relatively weak classifiers into a highly accurate predictor. The motivation for boosting was a procedure that combines the outputs of many “weak” classifiers to produce a powerful “committee.” ...

6 min · Xuanqiang 'Angelo' Huang