Ensemble Methods

The idea of ensemble methods goes back to Sir Francis Galton. In 787, he noted that although not every single person got the right value, the average estimate of a crowd of people predicted quite well. The main idea of ensemble methods is to combine relatively weak classifiers into a highly accurate predictor. The motivation for boosting was a procedure that combines the outputs of many “weak” classifiers to produce a powerful “committee.” ...

January 1, 2025 · Reading Time: 6 minutes ·  By Xuanqiang Angelo Huang

Linear Regression methods

We will present some methods related to regression methods for data analysis. Some of the work here is from (Hastie et al. 2009). This note does not treat the bayesian case, you should see Bayesian Linear Regression for that. Problem setting $$ Y = \beta_{0} + \sum_{j = 1}^{d} X_{j}\beta_{j} $$We usually don’t know the distribution of $P(X)$ or $P(Y \mid X)$ so we need to assume something about these distributions. One approach is assuming the distribution $Y \mid X \sim \mathcal{N}(f(X), \sigma^{2}I)$ and then solve the log likelihood on the probability If we use a statistical learning approach then we know we want to minimize this $\arg \min_{f} \sum_{i = 1}^{n} (y_{i} - f(x_{i}))^{2}$ which is what is often used. Both methods end with the same solution. Usually for these kind of problems we use the Least Squares Method, initially studied in Minimi quadrati. We will describe it again better in this setting. Let’s consider the linear model ...

December 30, 2024 · Reading Time: 9 minutes ·  By Xuanqiang Angelo Huang

Fisher's Linear Discriminant

A simple motivation Fisher’s Linear Discriminant is a simple idea used to linearly classify our data. The image above, taken from (Bishop 2006), is the summary of the idea. We clearly see that if we first project using the direction of maximum variance (See Principal Component Analysis) then the data is not linearly separable, but if we take other notions into consideration, then the idea becomes much more cleaner. ...

December 28, 2024 · Reading Time: 4 minutes ·  By Xuanqiang Angelo Huang

Kernel Methods

As we will briefly see, Kernels will have an important role in many machine learning applications. In this note we will get to know what are Kernels and why are they useful. Intuitively they measure the similarity between two input points. So if they are close the kernel should be big, else it should be small. We briefly state the requirements of a Kernel, then we will argue with a simple example why they are useful. ...

December 28, 2024 · Reading Time: 9 minutes ·  By Xuanqiang Angelo Huang

Wide Column Storage

Introduction to Wide Column Storages One of the bottlenecks of traditional relational databases is the speed of the Joints, which could be done in $\mathcal{O}(n)$ using a merge join, assuming some indexes are present which make the keys already sorted. The other solution, of just using Distributed file systems, is also not optimal: they have usually a high latency, with high throughput, that is not optimal with the series of small files that it is optimized for. While Object Storages, do not have APIs that could be helpful -> richer logical model. ...

December 28, 2024 · Reading Time: 9 minutes ·  By Xuanqiang Angelo Huang

Apache Spark

This is a new framework that is faster than MapReduce (See Massive Parallel Processing). It is written in Scala and has a more functional approach to programming. Spark extends the previous MapReduce framework to a generic distributed dataflow, properly modeled as a DAG. There are other benefits of using Spark instead of the Map reduce Framework: Spark processes data in memory, avoiding the disk I/O overhead of MapReduce, making it significantly faster. Spark uses a DAG to optimize the entire workflow, reducing data shuffling and stage count. But MapReduce sometimes has its advantages: ...

December 27, 2024 · Reading Time: 9 minutes ·  By Xuanqiang Angelo Huang

Gaussian Processes

Gaussian processes can be viewed through a Bayesian lens of the function space: rather than sampling over individual data points, we are now sampling over entire functions. They extend the idea of bayesian linear regression by introducing an infinite number of feature functions for the input XXX. In geostatistics, Gaussian processes are referred to as kriging regressions, and many other models, such as Kalman Filters or radial basis function networks, can be understood as special cases of Gaussian processes. In this framework, certain functions are more likely than others, and we aim to model this probability distribution. ...

December 25, 2024 · Reading Time: 8 minutes ·  By Xuanqiang Angelo Huang

Cross Validation and Model Selection

There is a big difference between the empirical score and the expected score; in the beginning, we had said something about this in Introduction to Advanced Machine Learning. We will develop more methods to better comprehend this fundamental principles. How can we estimate the expected risk of a particular estimator or algorithm? We can use the cross-validation method. This method is used to estimate the expected risk of a model, and it is a fundamental method in machine learning. ...

December 24, 2024 · Reading Time: 5 minutes ·  By Xuanqiang Angelo Huang

Rademacher Complexity

This note used the definitions present in Provably Approximately Correct Learning. So, go there when you encounter a word you don’t know. Or search online Rademacher Complexity $$ \mathcal{G} = \left\{ g : (x, y) \to L(h(x), y) : h \in \mathcal{H} \right\} $$ Where $L : \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}$ is a generic loss function. The Rademacher complexity captures the richness of a family of functions by measuring the degree to which a hypothesis set can fit random noise. From (Mohri et al. 2012). ...

December 21, 2024 · Reading Time: 1 minute ·  By Xuanqiang Angelo Huang

Data Cubes

Data Cubes is a data format especially useful for heavy reads. It has been popularized in business environments where the main use for data was to make reports (many reads). This also links with the OLAP (Online Analytical Processing) vs OLTP (Online Transaction Processing) concepts, where one is optimized for reads and the other for writes. The main driver behind data cubes was business intelligence. While traditional relational database systems are focused on the day-to-day business of a company and record keeping (with customers placing or- ders, inventories kept up to date, etc), business intelligence is focused on the production of high-level reports for supporting C-level executives in making informed decisions. ...

December 20, 2024 · Reading Time: 4 minutes ·  By Xuanqiang Angelo Huang