Advanced Machine Learning

Lecturer: Dr. HAS Sothea

Code: AMSI61AML

🔗 Ensemble Learning

🗺️ Content

Motivation & Tools
Bagging: Random Forest
Boosting: Adaboost, XGBoost…
Concensual aggregation: COBRA, KernelCOBRA, MixCOBRA
Stacking: SuperLearner

Motivation & Tools

Motivation

🤔 Have you ever thought about:

Why do we trust a panel of judges more than a single judge in competitions?
In medicine, why do doctors often seek second or third opinions for complex cases?
Have you noticed that weather forecasts often give a probability of rain rather than a simple yes/no prediction?
Would the famous “Wisdom of Crowds” principle applies to machine learning?

Motivation & Tools

Tools

Ensemble Learning (EL): Combining several based learners/models is possible thanks to different theories of
- Probability: Bayesian, Law of Large Number, Conditional Probability…
- Statistics: Bias-variance trade-off, bootstrap theory…
- Computational Learning: Probably Approximately Correct (PAC), Gradient-based optimization, and more…

We’re going to explore \(3\) main EL methods:
- Bagging: Combine nearly decorrelated high-varianced models.
- Boosting: Sequentially combine weak learners.
- Stacking: Combination is based on the predicted features.

Bagging: Bootstrap Aggregating

Purpose

Reduce variance of high-varianced base learners (trees) to produce a more stable and accurate final model.

Methodology

Bootstrap Sampling: Generate multiple subsets of the training data by sampling with replacement.

Bagging: Bootstrap Aggregating

Purpose

Reduce variance of high-varianced base learners (trees) to produce a more stable and accurate final model.

Methodology

Bootstrap Sampling: Generate multiple subsets of the training data by sampling with replacement.
Train Models: Train a separate model (typically the same type) on each bootstrap sample.

Bagging: Bootstrap Aggregating

Purpose

Reduce variance of high-varianced base learners (trees) to produce a more stable and accurate final model.

Methodology

Bootstrap Sampling: Generate multiple subsets of the training data by sampling with replacement.
Train Models: Train a separate model (typically the same type) on each bootstrap sample.
Aggregate Predictions: Combine the predictions of all models by averaging (for regression) or majority voting (for classification).

Bagging: Bootstrap Aggregating

Purpose

Reduce variance of high-varianced base learners (trees) to produce a more stable and accurate final model.

Methodology

Bootstrap Sampling: Generate multiple subsets of the training data by sampling with replacement.
Train Models: Train a separate model (typically the same type) on each bootstrap sample.
Aggregate Predictions: Combine the predictions of all models by averaging (for regression) or majority voting (for classification).

Pseudocode

Number of trees: \(T\)
for t=1,...,T:
- Sampling: \(B_t\) from training data \(\cal D\).
- Training models: \(f_t\) on \(B_t\)
Predictions:
- Regression: \[\begin{align*}\color{blue}{\hat{y}}&\color{blue}{=\frac{1}{T}\sum_{t=1}^Tf_t(x)}\\ &\color{blue}{=\text{Averaging.}}\end{align*}\]
- Classification: \[\begin{align*}\color{blue}{\hat{y}}&\color{blue}{=\arg\max_{1\leq k\leq M}\sum_{t=1}^T\mathbb{1}_{\{f_t(x)=k\}}}\\ &\color{blue}{=\text{Majority vote.}}\end{align*}\]

Bagging: Bootstrap Aggregating

Why does it work?

Bias-variance trade-off: Assuming \(y_i=f(\text{x}_i)+\varepsilon_i\) where \(\varepsilon_i\overset{iid}{\sim}{\cal N}(0,\sigma^2)\), then for any model \(\hat{f}\) built using training data \(\cal D\), we can decompose MSE of \(\hat{f}\) at any fixed input \(\text{x}_0\) as \[\begin{align*} \mathbb{E}_{\cal D}[(\hat{f}(\text{x}_0)-y_0)^2]&=\mathbb{E}_{\cal D}[(\hat{f}(\text{x}_0)-\mathbb{E}[\hat{f}(\text{x}_0)])^2] + \mathbb{E}_{\cal D}[(\mathbb{E}[\hat{f}(\text{x}_0)]-f(\text{x}_0))^2] + \sigma^2\\ &=\underbrace{\mathbb{V}(\hat{f}(\text{x}_0))}_{\color{blue}{\text{Flexibility of }\hat{f}}}+\underbrace{(\text{Bias})^2}_{\color{darkgreen}{\text{How far }\hat{f}\text{ from } f}}+\underbrace{\sigma^2}_{\color{red}{\text{Uncontrollable Term}}}. \end{align*}\]

Bagging: Bootstrap Aggregating

Why does it work?

Bias-variance trade-off: Assuming \(y_i=f(\text{x}_i)+\varepsilon_i\) where \(\varepsilon_i\overset{iid}{\sim}{\cal N}(0,\sigma^2)\), then for any model \(\hat{f}\) built using training data \(\cal D\), we can decompose MSE of \(\hat{f}\) at any fixed input \(\text{x}_0\) as \[\begin{align*} \mathbb{E}_{\cal D}[(\hat{f}(\text{x}_0)-y_0)^2]&=\mathbb{E}_{\cal D}[(\hat{f}(\text{x}_0)-\mathbb{E}[\hat{f}(\text{x}_0)])^2] + \mathbb{E}_{\cal D}[(\mathbb{E}[\hat{f}(\text{x}_0)]-f(\text{x}_0))^2] + \sigma^2\\ &=\underbrace{\mathbb{V}(\hat{f}(\text{x}_0))}_{\color{blue}{\text{Flexibility of }\hat{f}}}+\underbrace{(\text{Bias})^2}_{\color{darkgreen}{\text{How far }\hat{f}\text{ from } f}}+\underbrace{\sigma^2}_{\color{red}{\text{Uncontrollable Term}}}. \end{align*}\]
Begging: seeks to balance these terms by averaging nearly independent high-varianced models to reduce more stable predictive model.

\(^{\color{blue}{1}}\)If \(\color{blue}{\hat{f}(\text{x}_0)=\frac{1}{T}\sum_{t=1}^Tf_t(\text{x})}\) then \(\mathbb{E}_{\cal D}\left[\color{blue}{(\hat{f}(\text{x}_0)-y_0)^2}\right]\leq \min_{1\leq t\leq T}\mathbb{E}_{\cal D}\left[\color{red}{(f_t(\text{x}_0)-y_0)^2}\right].\)

Bagging is suitable with
- High-varianced models: deep trees, \(K\)-NN with small \(K\)…
- Simplicity and scalability…

\(^{\color{blue}{1}}\)Bagging predictors, Breiman L. (1996)

Bagging: Bootstrap Aggregating

Random Forest

In practice, general Bagging doesn’t work so well as the base learners (trees) are not as decorrelated as expected (due to correlated bootstrap samples).

Random Forest: (What a cool name! 😎)
- Number of trees: \(T\)
- for t = 1,2,...,T:
  - Bootstrap sampling: sample \(B_t\) with replacement from \(\cal D\).
  - Random Features: select subset \(S_t\) from full \(d\) input features.
  - Build tree \(f_t\) on \(B_t\) using only features from \(S_t\).
- Prediction: (same as before).
It works well as Random Features part introduces even more randomness on top of bootstrap sampling.

Bagging: Bootstrap Aggregating

Cool things in bagging/ random forest

Out-Of-Bag (OOB) Error

For each bootstrap sample \(B_t\), \(\text{x}_O\notin B_t\) are called Out-Of-Bag samples (\(\approx 37\%\)).

Tree’s OOB Error: The average error for a tree, calculated using its out-of-bag samples.
OOB Sample Error: The average error for a given observation, computed from all trees whose bootstrap samples did not include that observation.
Overall or Average OOB Error: The average of all OOB Sample Errors, providing an overall measure of the model’s performance.
The Overall OOB Error can be used as an approximation of cross-validation error. However, it tends to overestimate the true error in classification according to Silke Janitza and Roman Hornung (2018).

Bagging: Bootstrap Aggregating

Cool things in bagging/ random forest

Feature Importances (FI)

Mean Decrease in Impurity (MDI)

The Reduction of Impurity Measures when spliting at \(X_j\) is denoted by \[\text{RIM}_t(X_j)=\text{Imp}_{t-1}(X_j)-\text{Imp}_t(X_j).\]
The importance of feature \(X_j\) within a tree: \(\sum_{t\in S_j}\text{RIM}_t(X_j),\) where \(S_j\) is the set of indices \(t\) when a split is performed at \(X_j\).
Unnormaized FI of variable \(X_j\) is the sum of FIs from all the built trees.
⚠️ Might be biased toward high cardinality features and might not be accurate with highly correlated features.

Permutation Feature Importance

Train initial random forest model, then compute validation error denoted by \(\text{Er}_0\)
For each \(X_j\), shuffle its column in the dataset, then train and measure validation error \(\text{Er}_j\).
The importance score for \(X_j\) is proportional to \(\text{Er}_0-\text{Er}_j\).
It reflexes more direct influence of each feature in the model.
⚠️ Might be sensitive to data splits and a bit computationally expensive.

Bagging: Bootstrap Aggregating

Extra-trees method

Number of trees: \(T\)
for t = 1,2,...,T:
- Build tree: \(f_t\) is built on the Full dataset but a bit differently. At each split,
  - Random Features: A random subset \(S_t\) of the full input features is considered.
  - Random Split Points: A random set of points \(\{a_1,\dots,a_p\}\) is considered.
Prediction: (same as before).

Bagging: Bootstrap Aggregating

Numerical experiment

Let’s consider Kaggle Abalone Dataset of shape: \(4177\times 9\).

	Sex	Length	Diameter	Height	Whole weight	Shucked weight	Viscera weight	Shell weight	Rings
0	M	0.455	0.365	0.095	0.5140	0.2245	0.1010	0.150	15
1	M	0.350	0.265	0.090	0.2255	0.0995	0.0485	0.070	7
2	F	0.530	0.420	0.135	0.6770	0.2565	0.1415	0.210	9
3	M	0.440	0.365	0.125	0.5160	0.2155	0.1140	0.155	10
4	I	0.330	0.255	0.080	0.2050	0.0895	0.0395	0.055	7

👉 Check out the notebook.

Boosting

General framework

Boosting combines weak learners (models that perform a bit better than random guess: trees with a few splits/stumps) to create a strong model, firstly introduce by Robert E. Schapire (1990).
Main framework: Sequentially train a new weak learner to correct mispredicted points of the previous learners. The resulting model is the weighted average of all the trained learners.
Form: \(\hat{f}(x)=\sum_{t=1}^Tw_tf_t(x)\) where \(f_t\) are weak learners built on updated data \(D_t\).
It focuses more on improving bias than the variance.

Boosting

Algorithm

Number of trees: \(T\)
for t = 1, 2,..., T:
- Build weak learner \(f_t\) by minimizing error \(\varepsilon_t\).
- Learn and ajust weights:
  - Compute learner weight \(w_t>0\) computed based on weighted data with weight \(\{D_t(i):i=1,\dots,n\}\).
  - Compute sample weight \(D_t(i)\to D_{t+1}(i)\) for obs \(i\).
Combined model: \(\color{blue}{\hat{f}(x)=\sum_{t=1}^Tw_tf_t(x)}\).

Boosting

Algorithm

Combined model: \(\color{blue}{\hat{f}(x)=\sum_{t=1}^Tw_tf_t(x)}\).

Boosting

Adaboost: Adaptive Boosting

It’s for binary classification with \((\text{x}_i,y_i)\in\mathbb{R}^d\times\{-1,1\},i=1,\dots,n\).
Initialize the weight \(D_1(i)=1/n\) for all sample \(\text{x}_i\).
for t = 1, . . . , T:
- Train the base classifier \(f_t\) by minimizing \[\varepsilon_t=\mathbb{P}(f_t(X)\neq Y)=\sum_{i=1}^nD_t(i)\mathbb{1}_{\{f_t(\text{x}_i\neq y_i)\}}.\]
- Calculate the learner weight \(w_t=\frac{1}{2}\ln\left(\frac{1-\varepsilon_t}{\varepsilon_t}\right).\)
- Update the sample weights using \(D_{t+1}=\frac{D_t(i)}{Z_t}e^{-w_ty_if_t(\text{x}_i)}\) where \(Z_t\) is the normalized constant.
The final model: \(\color{blue}{\hat{f}(\text{x})=\text{sign}(\sum_{t=1}^Tw_tf_t(\text{x}))}\). (Read: Freund & Schapire (1999))

Boosting

XGBoost: EXtreme Gradient Boosting

It’s a special case of Gradient Boosting: \(F_M(\text{x})=\sum_{t=1}^Mw_tf_t(\text{x})\).
Let \(L\) be the loss function of the problem:
- Initial model: \(F_0=\arg\min_{c}\sum_{i=1}^nL(y_i,c).\)
- for t = 1, 2, ..., M:
  - Compute false residuals: \(r_{t,i}=\frac{\partial L(y_i,f(\text{x}))}{\partial f(\text{x})}\Big|_{f(\text{x})=f_{t-1}(\text{x}_i)}\).
  - Train a base learner on the new data: \(\{(\text{x}_i,y_i),r_{t,i}\}.\)
  - Solve for \(\color{blue}{f_t}\) and \(\color{blue}{w_t}\) from: \[(\color{blue}{w_t}, \color{blue}{f_t(\text{x})})=\arg\min_{\color{blue}{w,f}}\sum_{i=1}^nL(y_i,F_{t-1}(\text{x}_i)+\color{blue}{wf(\text{x}_i)}).\]
  - Update aggregated model: \(F_t(\text{x})=F_{t-1}(\text{x})+\color{blue}{w_tf_t(\text{x})}\).

In XGBoost:

\[\begin{align*}L(F_t)&=\sum_{i=1}^nL(y_i,F_t(\text{X}_i))+\sum_{m=1}^t\Omega(f_m)\\ \Omega(f_t)&=\gamma T+\frac{1}{2}\lambda\|w\|^2.\\ L(F_t)&\approx \sum_{i=1}^n[g_iF_t(\text{x}_i)+\frac{1}{2}h_iF_t^2(\text{x}_i)]+\Omega(F_t),\\ g_i&=\frac{\partial L(y_i,F_t(\text{x}))}{\partial F_t(\text{x})}\|_{F_t(\text{x})=F_t(\text{x}_i)}\\ h_i&=\frac{\partial^2 L(y_i,F_t(\text{x}))}{\partial F_t(\text{x})^2}\|_{F_t(\text{x})=F_t(\text{x}_i)}. \end{align*}\]

Read Chen & Guestrin (1999)

Boosting

Other variants

LightGBM: Light Gradient Boosting Machine by researchers at Microsof (2017). It works for regression, classification and ranking problems. Main improvement:
- Gradient-based One-Sided Sampling (GOSS)
- Exclusive Feature Bundling (EFB),

to ensure the efficiency and accuracy of the method.

CatBoost: Efficient with categorical features (auto transformation) and unbiased estimation of gradient at each iteration (see: Prokhorenkova et al. (2017)).

Boosting

Numerical Experiment: Abalone Dataset

👉 Check out the notebook.

Consensual Aggregation

General framework

Consensual Aggregation combines \(M\) base learners \(f_1,\dots,f_M\) based on how “close” the predictions given by these learners are.
Roughly speaking, it’s the nonparametric methods on predicted features: \(\widehat{\text{x}}=(f_1(\text{x}),\dots,f_M(\text{x}))\).
General form: \(\hat{y}=\sum_{i=1}^nW_{n,i}(\color{red}{\widehat{\text{x}}})y_i\)

\[\text{x}_i\overset{f_1,\dots,f_M}{\mapsto}\color{red}{\widehat{\text{x}}=(f_1(\text{x}),\dots,f_M(\text{x}))}\mapsto \hat{y}=\sum_{i=1}^nW_{n,i}(\color{red}{\widehat{\text{x}}})y_i.\]

Consensual Aggregation

Concensual Classifier: Mojirsheibani (1999)

Suppose we have 3 binary classifiers: \(C=(C_1, C_2, C_3)\) with training data:

Id	\(C_1\)	\(C_2\)	\(C_3\)	\(Y\)
1	\(0\)	\(1\)	\(1\)	\(1\)
2	1	1	0	\(\color{blue}{1}\)
3	\(0\)	\(0\)	\(0\)	\(0\)
4	1	1	0	\(\color{blue}{1}\)
5	1	1	0	\(\color{red}{0}\)
\(\text{x}\)	1	1	0	\(?\)

Majority class among data points with the same predictions as the prediction of \(\text{x}:\hat{y}=1\).

Consensual Aggregation

COBRA by Beau et al. (2016)
Gradient COBRA by Has (2023)

Given base regressors: \({\bf r}=(r_1,\dots,r_M)\), the combination takes the form: \[\hat{y}=\sum_{i=1}^nW_{n,i}(\text{x})y_i.\]

Training such a model is equivalent to finding an optimal smoothing parameter \(h\) minimizing cross-validation error:

\[\varphi(h)=\frac{1}{K}\sum_{k=1}^K\sum_{(\text{x}_i,y_i)\in F_j}(\hat{y}_i-y_i)^2.\]

Consensual Aggregation

COBRA by Beau et al. (2016)
Gradient COBRA by Has (2023)

Consensual Aggregation

MixCOBRA by Fischer & Mougeot (2019)

Given base regressors: \({\bf r}=(r_1,\dots,r_M)\), the combination takes the form: \[\hat{y}=\sum_{i=1}^nW_{n,i}(\text{x})y_i.\]

In this case, we seek for parameter \((\alpha,\beta)\) minimizing cross-validation error:

\[\varphi(\alpha,\beta)=\frac{1}{K}\sum_{k=1}^K\sum_{(\text{x}_i,y_i)\in F_j}(\hat{y}_i-y_i)^2.\]

Consensual Aggregation

MixCOBRA by Fischer & Mougeot (2019)

Consensual Aggregation

Numerical Experiment: Abalone Dataset

👉 Check out the notebook.

Stacking

General framework

Stacking stands on the idea that given \(M\) base learners \(f_1,\dots,f_M\), the predicted features \(\widehat{\text{x}}_i=(f_1(\text{x}_i),\dots,f_M(\text{x}_i))\) are close to the target \(y_i\).
Goal: using these features: \(\hat{x}_i\) to predict \(y_i\).
Meta-models: trained based on \(\widehat{\text{x}}_i\) to predict \(y_i\).

\[\text{x}_i\overset{f_1,\dots,f_M}{\mapsto}\color{red}{\widehat{\text{x}_i}=(f_1(\text{x}_i),\dots,f_M(\text{x}_i))}\overset{\text{Meta-learner }f}{\mapsto} \hat{y}_i=f(\color{red}{\widehat{x}_i}).\]

Stacking

Super Learner by M. J. Van der Laan (2007)

It can be used for both: classification and regression.

Stacking

Numerical Experiment: Abalone Dataset

👉 Check out the notebook.

Summary

Ensemble learning: combine base learners to create a stronger predictor.
Bagging: Combines high-varianced base learners (trees) to produce a more stable and accurate final model.
Boosting: Sequentially combines weak learners aiming at correcting mistakes made by the previous built combined learners to create a strong model.
Consensual Aggregation: Combine different learners using the consensuses of their predicted features.
Stacking: Build meta-learners using predicted features of base learners to predict the target.

Advanced Machine Learning

🔗 Ensemble Learning

🗺️ Content

Motivation & Tools

Motivation

Motivation & Tools

Tools

Bagging: Bootstrap Aggregating

Purpose

Methodology

Bagging: Bootstrap Aggregating

Purpose

Methodology

Bagging: Bootstrap Aggregating

Purpose

Methodology

Bagging: Bootstrap Aggregating

Purpose

Methodology

Pseudocode

Bagging: Bootstrap Aggregating

Why does it work?

Bagging: Bootstrap Aggregating

Why does it work?

Bagging: Bootstrap Aggregating

Random Forest

Bagging: Bootstrap Aggregating

Cool things in bagging/ random forest

Out-Of-Bag (OOB) Error

Bagging: Bootstrap Aggregating

Cool things in bagging/ random forest

Feature Importances (FI)

Mean Decrease in Impurity (MDI)

Permutation Feature Importance

Bagging: Bootstrap Aggregating

Extra-trees method

Bagging: Bootstrap Aggregating

Numerical experiment

Boosting

General framework

Boosting

Algorithm

Boosting

Algorithm

Boosting

Adaboost: Adaptive Boosting

Boosting

XGBoost: EXtreme Gradient Boosting

Boosting

Other variants

Boosting

Numerical Experiment: Abalone Dataset

Consensual Aggregation

General framework

Consensual Aggregation

Concensual Classifier: Mojirsheibani (1999)

Consensual Aggregation

COBRA by Beau et al. (2016) Gradient COBRA by Has (2023)

Consensual Aggregation

COBRA by Beau et al. (2016) Gradient COBRA by Has (2023)

Consensual Aggregation

MixCOBRA by Fischer & Mougeot (2019)

Consensual Aggregation

MixCOBRA by Fischer & Mougeot (2019)

Consensual Aggregation

Numerical Experiment: Abalone Dataset

Stacking

General framework

Stacking

Super Learner by M. J. Van der Laan (2007)

Stacking

Numerical Experiment: Abalone Dataset

Summary

🥳 It’s party time 🥂

📋 View party menu here: Party 5 Menu.

🫠 Download party invitation here: Party 5 Invitation Letter.

COBRA by Beau et al. (2016)
Gradient COBRA by Has (2023)

COBRA by Beau et al. (2016)
Gradient COBRA by Has (2023)