The idea of ensemble methods goes back to Sir Francis Galton. In 787, he noted that although not every single person got the right value, the average estimate of a crowd of people predicted quite well.

The main idea of ensemble methods is to combine relatively weak classifiers into a highly accurate predictor.

The motivation for boosting was a procedure that combines the outputs of many “weak” classifiers to produce a powerful “committee.”

In Buhmann’s optinion, somehow this is also why democracy works well. But it won’t work anymore if too many individuals are correlated with each other.

Some properties of Ensembles

Variance of Ensembles

Consider a set of estimators $f_{1}, f_{2} \dots, f_{n}$ and their average $\hat{f}(x) = \frac{1}{N} \sum_{i = 1}^{N}f_{i}(x)$

Then the bias of the average is the same as the average of the biases, which implies if originally the estimators are unbiased, then the average is unbiased too!

$$ \begin{align} \mathop{\mathbb{E}}[\hat{f}(x)] &= \frac{1}{N}\mathop{\mathbb{E}}\left[ \sum_{i = 1}^{N}f_{i}(x) \right] \\ &= \frac{1}{N}\sum_{i = 1}^{N}\mathop{\mathbb{E}}[f_{i}(x)] \\ &= \mathop{\mathbb{E}}[f_{i}(x)] \end{align} $$

But if we look at the variance, then we see that the variance decreases if the estimators are uncorrelated with each other! This leads to more correct estimates.

$$ \begin{align} \text{Var}[\hat{f}(x)] &= \text{Var}\left[ \frac{1}{N}\sum_{i = 1}^{N}f_{i}(x) \right] \\ &= \frac{1}{N^{2}}\sum_{i = 1}^{N}\text{Var}[f_{i}(x)] + \frac{1}{N^{2}}\sum_{i \neq j}\text{Cov}[f_{i}(x), f_{j}(x)] \\ &\approx \frac{1}{N}\text{Var}[f(x)] \end{align} $$

In reality, the predictors are often correlated, which is why the variance of the ensemble is not as low as we would like.

Advantages of Ensembles

Usually weak classifiers are easy to train (the only thing we need is to be better than random).

Data randomization in the spirit of Bootstrap captures statistical dependencies between alternative hypotheses. -> The classifier ensemble encodes uncertainty intervals in the hypothesis class (output space).

Bagging

In this whole section we will assume that the classifiers output $\left\{ -1, 1 \right\}$. The idea of bagging (Breiman 1996) is producing a diverse set of classifiers using Bootstrapped samples (or other methods to induce diversity in the classifiers). And then keep the prediction as the average of the predicted values by each classifier. This technique is known also as committee methods.

Bagging Classifiers

The basic idea of this method is quite easy, and builds upon bootstrap, introduced in Cross Validation and Model Selection. Set $B$ the number of classifiers to train. Then do the following: For $B$ times:

Sample with replacement the training set (this is the bootstrap part)
Train a classifier $C_{b}$ on this sample
Output the sum the predictions of the classifiers and take the sign as the classification result. This setting only works for the binary classification problem, and it is slightly biased in case the number of the classifiers is odd.

Compare and Bag

This method has a very similar idea to the previous one. But are training two different ensembles and selecting based on the performance of the validation set. Ensemble Methods-20241107210140625

Random Forests

The Idea

This method leverages on Decision Trees to build a forest of them. The idea is to create enough different classifiers so that we can build upon the observation on the variance of ensembles. Ensemble Methods-20241107205340512

Diversifying the models

Ways to diversify the trees are:

Train them on different bootstrapped samples
At each split, just consider a subset $p$ of features (Typically the square root).
Use different splitting criteria (Gini index, Entropy, Misclassification).

There could be other methods that we are not considering here.

Boosting

The main difference between boosting and bagging methods is that here the classifiers are trained in sequence.

AdaBoost algorithm

The main idea of AdaBoost, which has been a breakthrough in the field of statistical learning, is to train a sequence of classifiers, each of which focuses on the examples that the previous classifiers have misclassified. If a sample has been misclassified in a previous attempt, it will have a higher weight in the next classifier. It is implemented in the following way:

Curious choices are for the $\alpha$ and how we scaled the weights. These will be discussed in the next section.

A philosophical similarity

Adaboost is just a collection of simple algorithmic steps which could be interpreted as some optimization objective in the larger scheme of things. This has some commonalities with some philosophical view from Wolphram in (Wolfram 2002).

Forward Stagewise Additive Modeling

Forward Stagewise additive modeling focus on building a sequence of classifiers, each of which try to solve for the residual of the previous ones. The algorithm is the following:

Initialize $f_{0}(x) = 0$
For $m = 1, \dots, M$:
- Compute the residuals $r_{im} = y_{i} - f_{m-1}(x_{i})$
- Fit a classifier to the residuals $r_{im}$, call it $h_{m}(x)$
- Choose a step size $\gamma_{m}$ by solving the optimization problem
  - The above three steps can be summarized as finding $$\arg \min_{\gamma_{m}, h_{m}} \sum_{i = 1}^{N} L(y_{i}, f_{m - 1}(x_{i}) + \gamma_{m}h_{m}(x_{i}))$$
- Update the model $f_{m}(x) = f_{m-1}(x) + \gamma_{m}h_{m}(x)$ $h_{m}$ are also called basis functions. In this setting, they are the classifiers interesting to us.

Adaboost’s energy function

$$ L(y, f(x)) = \exp(-yf(x)) $$

This has some slight resemblance to Log Linear Models.

$$ \sum_{m = 1}^{N} \exp(-y_{i}f_{m - 1}(x) - y_{i}\gamma_{m}h_{m}(x_{i})) = \sum_{m =1}^{N} w_{m} \exp(-y_{i}\gamma_{m}h_{m}(x_{i})) $$$$ (-e^{-\gamma_{m}} + e^{\gamma_{m}})\sum_{i = 1}^{N} w_{i}^{(m)}\mathbb{1}(h_{m}(x_{i}) \neq y_{i}) + e^{-\gamma_{m}}\sum_{i = 1}^{N} w_{i}^{(m)} $$$$ \gamma_{m} = \frac{1}{2} \log \frac{1 - \text{error}_{m}}{\text{error}_{m}} $$$$ \text{error}_{m} = \frac{\sum_{i = 1}^{N} w_{i}^{(m)}\mathbb{1}(h_{m}(x_{i}) \neq y_{i})}{\sum_{i = 1}^{N} w_{i}^{(m)}} $$$$ w_{i}^{(m + 1)} = w_{i}^{(m)}\exp(-\gamma_{m}y_{i}h_{m}(x_{i})) $$$$ -y_{i}h_{m}(x_{i}) = 2\mathbb{1}(h_{m}(x_{i}) \neq y_{i}) - 1 $$$$ w_{i}^{(m + 1)} = w_{i}^{(m)}\exp(\gamma_{m})\exp(-2\gamma_{m}\mathbb{1}(h_{m}(x_{i}) \neq y_{i})) $$

We note that the term $\exp(\gamma_{m})$ is a constant, this can be dropped when calculating the error as it is insignificant for the weighted average. Now we have Adaboost’s weight update rule in the classification setting.

Is the exponential a Good Loss?

From(Bishop 2006): Ensemble Methods-20241226155325992 We observe that the misclassification error for the exponential loss grows exponentially for negatives, which could be sign of a high sensitivity to outliers. Another note on the exponential loss: it cannot be interpreted as log likelihood of some well defined probability distribution.

Another drawback of the exponential loss compared to a cross-entropy loss is the extension to multi-class classification: while with cross entropy it’s easy, with the exponential loss there is no straightforward manner to extend it to multi-class classification.

$$ \log p(y \mid x) = \log \prod p_{i}^{y_{i}} = \sum y_{i}\log p_{i} $$$$ \mathcal{L}(f(x), y) = \log(1 + \exp(-yf(x))) $$

Used for Linear Classification, as the underlying probability is different.

$$ \max(0, 1 - yf(x)) $$

Used from Support Vector Machines

Adaboost and Logistic Regression

$$ f^{*}(x) = \arg\min_{f(x)} \mathbb{E}_{y \mid x} \exp(-yf(x)) = \frac{1}{2} \log \frac{P(y = 1 \mid x)}{P(y = -1 \mid x)} $$

The proof is quite straightforward: $$ \begin{align} \mathop{\mathbb{E}}{y \mid x} \exp(-yf(x)) &= P(y = 1 \mid x)\exp(-f(x)) + P(y = -1 \mid x)\exp(f(x)) \ &\implies \frac{\partial}{\partial f(x)} \mathop{\mathbb{E}}{y \mid x} \exp(-yf(x)) = 0 \ &\implies -P(y = 1 \mid x)\exp(-f(x)) + P(y = -1 \mid x)\exp(f(x)) = 0 \ &\implies \frac{P(y = 1 \mid x)}{P(y = -1 \mid x)} = \exp(2f(x)) \ &\implies f(x) = \frac{1}{2} \log \frac{P(y = 1 \mid x)}{P(y = -1 \mid x)} \end{align}

$$ P(y = 1 \mid x) = \frac{1}{1 + \exp(-2f^{*}(x))} $$

This is also close to the sigmoid function! Remember that this is why in Logistic Regression, Sigmoid is a good choice for the loss function.

References

[1] Wolfram “A New Kind of Science” Wolfram Media 2002

[2] Bishop “Pattern Recognition and Machine Learning” Springer 2006

Some properties of Ensembles#

Variance of Ensembles#

Advantages of Ensembles#

Bagging#

Bagging Classifiers#

Compare and Bag#

Random Forests#

The Idea#

Diversifying the models#

Boosting#

AdaBoost algorithm#

A philosophical similarity#

Forward Stagewise Additive Modeling#

Adaboost’s energy function#

Is the exponential a Good Loss?#

Adaboost and Logistic Regression#

References#