The idea of ensemble methods goes back to Sir Francis Galton. In 787, he noted that although not every single person got the right value, the average estimate of a crowd of people predicted quite well.
The main idea of ensemble methods is to combine relatively weak classifiers into a highly accurate predictor.
The motivation for boosting was a procedure that combines the outputs of many “weak” classifiers to produce a powerful “committee.”
In Buhmann’s optinion, somehow this is also why democracy works well. But it won’t work anymore if too many individuals are correlated with each other.
Some properties of Ensembles
Variance of Ensembles 🟩–
Consider a set of estimators $f_{1}, f_{2} \dots, f_{n}$ and their average $\hat{f}(x) = \frac{1}{N} \sum_{i = 1}^{N}f_{i}(x)$
Then the bias of the average is the same as the average of the biases, which implies if originally the estimators are unbiased, then the average is unbiased too!
$$ \begin{align} \mathop{\mathbb{E}}[\hat{f}(x)] &= \frac{1}{N}\mathop{\mathbb{E}}\left[ \sum_{i = 1}^{N}f_{i}(x) \right] \\ &= \frac{1}{N}\sum_{i = 1}^{N}\mathop{\mathbb{E}}[f_{i}(x)] \\ &= \mathop{\mathbb{E}}[f_{i}(x)] \end{align} $$But if we look at the variance, then we see that the variance decreases if the estimators are uncorrelated with each other! This leads to more correct estimates.
$$ \begin{align} \text{Var}[\hat{f}(x)] &= \text{Var}\left[ \frac{1}{N}\sum_{i = 1}^{N}f_{i}(x) \right] \\ &= \frac{1}{N^{2}}\sum_{i = 1}^{N}\text{Var}[f_{i}(x)] + \frac{1}{N^{2}}\sum_{i \neq j}\text{Cov}[f_{i}(x), f_{j}(x)] \\ &\approx \frac{1}{N}\text{Var}[f(x)] \end{align} $$In reality, the predictors are often correlated, which is why the variance of the ensemble is not as low as we would like.
Advantages of Ensembles
Usually weak classifiers are easy to train (the only thing we need is to be better than random).
Data randomization in the spirit of Bootstrap captures statistical dependencies between alternative hypotheses. -> The classifier ensemble encodes uncertainty intervals in the hypothesis class (output space).
Bagging
In this whole section we will assume that the classifiers output $\left\{ -1, 1 \right\}$. The idea of bagging (Breiman 1996) is producing a diverse set of classifiers using Bootstrapped samples (or other methods to induce diversity in the classifiers). And then keep the prediction as the average of the predicted values by each classifier. This technique is known also as committee methods.
Bagging Classifiers 🟩
The basic idea of this method is quite easy, and builds upon bootstrap, introduced in Cross Validation and Model Selection. Set $B$ the number of classifiers to train. Then do the following: For $B$ times:
- Sample with replacement the training set (this is the bootstrap part)
- Train a classifier $C_{b}$ on this sample
- Output the sum the predictions of the classifiers and take the sign as the classification result. This setting only works for the binary classification problem, and it is slightly biased in case the number of the classifiers is odd.
Compare and Bag 🟩
This method has a very similar idea to the previous one. But are training two different ensembles and selecting based on the performance of the validation set.
Random Forests
The Idea 🟩
This method leverages on Decision Trees to build a forest of them. The idea is to create enough different classifiers so that we can build upon the observation on the variance of ensembles.
Diversifying the models 🟩
Ways to diversify the trees are:
- Train them on different bootstrapped samples
- At each split, just consider a subset $p$ of features (Typically the square root).
- Use different splitting criteria (Gini index, Entropy, Misclassification).
There could be other methods that we are not considering here.
Boosting
The main difference between boosting and bagging methods is that here the classifiers are trained in sequence.
AdaBoost algorithm 🟨++
The main idea of AdaBoost, which has been a breakthrough in the field of statistical learning, is to train a sequence of classifiers, each of which focuses on the examples that the previous classifiers have misclassified. If a sample has been misclassified in a previous attempt, it will have a higher weight in the next classifier. It is implemented in the following way:
Curious choices are for the $\alpha$ and how we scaled the weights. These will be discussed in the next section.
A philosophical similarity
Adaboost is just a collection of simple algorithmic steps which could be interpreted as some optimization objective in the larger scheme of things. This has some commonalities with some philosophical view from Wolphram in (Wolfram 2002).
Forward Stagewise Additive Modeling 🟨++
Forward Stagewise additive modeling focus on building a sequence of classifiers, each of which try to solve for the residual of the previous ones. The algorithm is the following:
- Initialize $f_{0}(x) = 0$
- For $m = 1, \dots, M$:
- Compute the residuals $r_{im} = y_{i} - f_{m-1}(x_{i})$
- Fit a classifier to the residuals $r_{im}$, call it $h_{m}(x)$
- Choose a step size $\gamma_{m}$ by solving the optimization problem
- The above three steps can be summarized as finding $$\arg \min_{\gamma_{m}, h_{m}} \sum_{i = 1}^{N} L(y_{i}, f_{m - 1}(x_{i}) + \gamma_{m}h_{m}(x_{i}))$$
- Update the model $f_{m}(x) = f_{m-1}(x) + \gamma_{m}h_{m}(x)$ $h_{m}$ are also called basis functions. In this setting, they are the classifiers interesting to us.
Adaboost’s energy function 🟥++
We can prove that Adaboost is just a special case of these models with the following Loss function:
$$ L(y, f(x)) = \exp(-yf(x)) $$This has some slight resemblance to Log Linear Models.
Let’s see why. The predictions for the AdaBoost model for a certain training sample $x_{i}$
$$ \sum_{m = 1}^{N} \exp(-y_{i}f_{m - 1}(x) - y_{i}\gamma_{m}h_{m}(x_{i})) = \sum_{m =1}^{N} w_{m} \exp(-y_{i}\gamma_{m}h_{m}(x_{i})) $$We substituted $w_{m} = \exp(-y_{i}f_{m - 1}(x))$. as this value is constant for state $m$. We now observe that the energy (score) for this value is
$$ (-e^{-\gamma_{m}} + e^{\gamma_{m}})\sum_{i = 1}^{N} w_{i}^{(m)}\mathbb{1}(h_{m}(x_{i}) \neq y_{i}) + e^{-\gamma_{m}}\sum_{i = 1}^{N} w_{i}^{(m)} $$If take the gradient with respect to $\gamma_{m}$ and set it to zero, we find that the value of $\gamma_{m}$ is
$$ \gamma_{m} = \frac{1}{2} \log \frac{1 - \text{error}_{m}}{\text{error}_{m}} $$Where error is defined as
$$ \text{error}_{m} = \frac{\sum_{i = 1}^{N} w_{i}^{(m)}\mathbb{1}(h_{m}(x_{i}) \neq y_{i})}{\sum_{i = 1}^{N} w_{i}^{(m)}} $$Now we see how we update the weights:
$$ w_{i}^{(m + 1)} = w_{i}^{(m)}\exp(-\gamma_{m}y_{i}h_{m}(x_{i})) $$Then using the observation that
$$ -y_{i}h_{m}(x_{i}) = 2\mathbb{1}(h_{m}(x_{i}) \neq y_{i}) - 1 $$We rewrite the above as
$$ w_{i}^{(m + 1)} = w_{i}^{(m)}\exp(\gamma_{m})\exp(-2\gamma_{m}\mathbb{1}(h_{m}(x_{i}) \neq y_{i})) $$We note that the term $\exp(\gamma_{m})$ is a constant, this can be dropped when calculating the error as it is insignificant for the weighted average. Now we have Adaboost’s weight update rule in the classification setting.
Is the exponential a Good Loss?
From(Bishop 2006): We observe that the misclassification error for the exponential loss grows exponentially for negatives, which could be sign of a high sensitivity to outliers. Another note on the exponential loss: it cannot be interpreted as log likelihood of some well defined probability distribution.
Another drawback of the exponential loss compared to a cross-entropy loss is the extension to multi-class classification: while with cross entropy it’s easy, with the exponential loss there is no straightforward manner to extend it to multi-class classification.
Recall that the cross entropy loss is:
$$ \log p(y \mid x) = \log \prod p_{i}^{y_{i}} = \sum y_{i}\log p_{i} $$But in the case of $y \in \left\{ -1, 1 \right\}$ The loss becomes:
$$ \mathcal{L}(f(x), y) = \log(1 + \exp(-yf(x))) $$Used for Linear Classification, as the underlying probability is different.
And the hinge loss is:
$$ \max(0, 1 - yf(x)) $$Used from Support Vector Machines
Adaboost and Logistic Regression 🟩
What does AdaBoost estimate and how well is it being estimated? it turns out that AdaBoost is estimating the posterior probability of a sample having probability 1 using logistic regression each time! We can show that the minimizer of the expected energy is as follows:
$$ f^{*}(x) = \arg\min_{f(x)} \mathbb{E}_{y \mid x} \exp(-yf(x)) = \frac{1}{2} \log \frac{P(y = 1 \mid x)}{P(y = -1 \mid x)} $$The proof is quite straightforward: $$ \begin{align} \mathop{\mathbb{E}}{y \mid x} \exp(-yf(x)) &= P(y = 1 \mid x)\exp(-f(x)) + P(y = -1 \mid x)\exp(f(x)) \ &\implies \frac{\partial}{\partial f(x)} \mathop{\mathbb{E}}{y \mid x} \exp(-yf(x)) = 0 \ &\implies -P(y = 1 \mid x)\exp(-f(x)) + P(y = -1 \mid x)\exp(f(x)) = 0 \ &\implies \frac{P(y = 1 \mid x)}{P(y = -1 \mid x)} = \exp(2f(x)) \ &\implies f(x) = \frac{1}{2} \log \frac{P(y = 1 \mid x)}{P(y = -1 \mid x)} \end{align}
$$
Which allows us to rewrite the probability of the class 1 to be:
$$ P(y = 1 \mid x) = \frac{1}{1 + \exp(-2f^{*}(x))} $$This is also close to the sigmoid function! Remember that this is why in Logistic Regression, Sigmoid is a good choice for the loss function.
References
[1] Bishop “Pattern Recognition and Machine Learning” Springer 2006
[2] Wolfram “A New Kind of Science” Wolfram Media 2002