🔗 Ensemble Learning Methods

ITM 390 004: Machine Learning

Lecturer: Dr. Sothea HAS

🗺️ Content

Motivation & Introducion
Bootstrap Aggregating: Bagging
Random Forest
Boosting: Adaboost & XGBoost
Concensual aggregation: Gradient COBRA

Motivation & Introducion

Recap

Recap:
- Parametric models:
  - Assume input-to-target relation: \(X\rightarrow y\).
    - Linear regression: \(\hat{y}=\color{blue}{\beta_0}+\color{blue}{\beta_1}\text{x}_1+\dots+\color{blue}{\beta_d}\text{x}_d\).
    - Logistic regression: \(p(\color{blue}{Y=1}|\text{X}=\text{x})=\sigma(\color{blue}{\beta_0}+\sum_{j=1}^d\color{blue}{\beta_j}\text{x}_j)\).
- Nonparametric models:
  - Do not assume any input-output form.
  - Predict based only on neigboring data.
    - \(k\)-Nearest Neighbors: neighbors are based on Euclidean distance.
    - Decision tree: neighbors are based on block/region.
🤔 What remains to be explored?

Motivation & Introducion

Motivation

Why do we trust a panel of judges more than a single judge in competitions?
In medicine, why do doctors often seek second or third opinions for complex cases?
Have you noticed that weather forecasts often give a probability of rain rather than a simple yes/no prediction?
Would the famous “Wisdom of Crowds” principle applies to machine learning?
The keys idea of ensemble learning is combining multiple base models/learners to create a better/stronger one.

Motivation & Tools

Main methods

We’re going to explore \(3\) main EL methods:
- Bagging: Combine nearly decorrelated high-varianced models.
- Boosting: Sequentially combine weak learners to create a strong final model.
- Stacking: Combination is based on the predicted features.

Bootstrap Aggregating: Bagging

Bagging: Bootstrap Aggregating

Purpose

Reduce variance of high-varianced base learners (trees) to produce a more stable and accurate final model.

Methodology

Bootstrap Sampling: Generate multiple subsets of the training data by sampling with replacement.

Bagging: Bootstrap Aggregating

Purpose

Reduce variance of high-varianced base learners (trees) to produce a more stable and accurate final model.

Methodology

Bootstrap Sampling: Generate multiple subsets of the training data by sampling with replacement.
Train Models: Train a separate model (typically the same type) on each bootstrap sample.

Bagging: Bootstrap Aggregating

Purpose

Reduce variance of high-varianced base learners (trees) to produce a more stable and accurate final model.

Methodology

Bootstrap Sampling: Generate multiple subsets of the training data by sampling with replacement.
Train Models: Train a separate model (typically the same type) on each bootstrap sample.
Aggregate Predictions: Combine the predictions of all models by averaging (for regression) or majority voting (for classification).

Bagging: Bootstrap Aggregating

Purpose

Reduce variance of high-varianced base learners (trees) to produce a more stable and accurate final model.

Methodology

Bootstrap Sampling: Generate multiple subsets of the training data by sampling with replacement.
Train Models: Train a separate model (typically the same type) on each bootstrap sample.
Aggregate Predictions: Combine the predictions of all models by averaging (for regression) or majority voting (for classification).

Pseudocode

Number of trees: \(T\)
for t=1,...,T:
- Sampling: \(B_t\) from training data \(\cal D\).
- Training models: \(f_t\) on \(B_t\)
Predictions:
- Regression: \[\begin{align*}\color{blue}{\hat{y}}&\color{blue}{=\frac{1}{T}\sum_{t=1}^Tf_t(x)}\\ &\color{blue}{=\text{Averaging.}}\end{align*}\]
- Classification: \[\begin{align*}\color{red}{\hat{y}}&\color{red}{=\arg\max_{1\leq k\leq M}\sum_{t=1}^T\mathbb{1}_{\{f_t(x)=k\}}}\\ &\color{red}{=\text{Majority vote.}}\end{align*}\]

Bagging: Bootstrap Aggregating

Why does it work?

Bias-variance trade-off: Assuming \(y_i=f(\text{x}_i)+\varepsilon_i\) where \(\varepsilon_i\overset{iid}{\sim}{\cal N}(0,\sigma^2)\), then for any model \(\hat{f}\) built using training data \(\cal D\), we can decompose MSE of \(\hat{f}\) at any fixed input \(\text{x}_0\) as \[\begin{align*} \mathbb{E}_{\cal D}[(\hat{f}(\text{x}_0)-y_0)^2]&=\mathbb{E}_{\cal D}[(\hat{f}(\text{x}_0)-\mathbb{E}[\hat{f}(\text{x}_0)])^2] + \mathbb{E}_{\cal D}[(\mathbb{E}[\hat{f}(\text{x}_0)]-f(\text{x}_0))^2] + \sigma^2\\ &=\underbrace{\mathbb{V}(\hat{f}(\text{x}_0))}_{\color{blue}{\text{Flexibility of }\hat{f}}}+\underbrace{(\text{Bias})^2}_{\color{darkgreen}{\text{How far }\hat{f}\text{ from } f}}+\underbrace{\sigma^2}_{\color{red}{\text{Uncontrollable Term}}}. \end{align*}\]

Bagging: Bootstrap Aggregating

Why does it work?

Bias-variance trade-off: Assuming \(y_i=f(\text{x}_i)+\varepsilon_i\) where \(\varepsilon_i\overset{iid}{\sim}{\cal N}(0,\sigma^2)\), then for any model \(\hat{f}\) built using training data \(\cal D\), we can decompose MSE of \(\hat{f}\) at any fixed input \(\text{x}_0\) as \[\begin{align*} \mathbb{E}_{\cal D}[(\hat{f}(\text{x}_0)-y_0)^2]&=\mathbb{E}_{\cal D}[(\hat{f}(\text{x}_0)-\mathbb{E}[\hat{f}(\text{x}_0)])^2] + \mathbb{E}_{\cal D}[(\mathbb{E}[\hat{f}(\text{x}_0)]-f(\text{x}_0))^2] + \sigma^2\\ &=\underbrace{\mathbb{V}(\hat{f}(\text{x}_0))}_{\color{blue}{\text{Flexibility of }\hat{f}}}+\underbrace{(\text{Bias})^2}_{\color{darkgreen}{\text{How far }\hat{f}\text{ from } f}}+\underbrace{\sigma^2}_{\color{red}{\text{Uncontrollable Term}}}. \end{align*}\]
Begging: seeks to balance these terms by averaging nearly independent high-varianced models to reduce more stable predictive model.

\(^{\color{blue}{1}}\)If \(\color{blue}{\hat{f}(\text{x}_0)=\frac{1}{T}\sum_{t=1}^Tf_t(\text{x})}\) then \(\mathbb{E}_{\cal D}\left[\color{blue}{(\hat{f}(\text{x}_0)-y_0)^2}\right]\leq \min_{1\leq t\leq T}\mathbb{E}_{\cal D}\left[\color{red}{(f_t(\text{x}_0)-y_0)^2}\right].\)

Bagging is suitable with
- High-varianced models: deep trees, \(K\)-NN with small \(K\)…
- Simplicity and scalability…

\(^{\color{blue}{1}}\)Bagging predictors, Breiman L. (1996)

Random Forest

Bagging: Bootstrap Aggregating

Random Forest

In practice, Bagging doesn’t work so well as the base learners (trees) are not as decorrelated as expected (due to correlated bootstrap samples).

Random Forest: (What a cool name! 😎)
- Number of trees: \(T\)
- for t = 1,2,...,T:
  - Bootstrap sampling: sample \(B_t\) with replacement from \(\cal D\).
  - Random Features: select subset \(S_t\) from full \(d\) input features.
  - Build tree \(f_t\) on \(B_t\) using only features from \(S_t\).
- Prediction: (same as before).
It works well as Random Features part introduces even more randomness on top of bootstrap sampling.

Bagging: Bootstrap Aggregating

Have fun at 👉 Random Forest Playground

Bagging: Bootstrap Aggregating

Cool things in bagging/ random forest

Out-Of-Bag (OOB) Error

For each bootstrap sample \(B_t\), \(\text{x}_O\notin B_t\) are called Out-Of-Bag samples (\(\approx 37\%\)).

Tree’s OOB Error: The average error for a tree, calculated using its out-of-bag samples.
OOB Sample Error: The average error for a given observation, computed from all trees whose bootstrap samples did not include that observation.
Overall or Average OOB Error: The average of all OOB Sample Errors, providing an overall measure of the model’s performance.
The Overall OOB Error can be used as an approximation of cross-validation error. However, it tends to overestimate the true error in classification according to Silke Janitza and Roman Hornung (2018).

Bagging: Bootstrap Aggregating

Cool things in bagging/ random forest

Feature Importances (FI)

Mean Decrease in Impurity (MDI)

The Reduction of Impurity Measures when spliting at \(X_j\) is denoted by \[\color{blue}{\text{RIM}}_t(X_j)=\text{Imp}_{t-1}(X_j)-\text{Imp}_t(X_j).\]
The importance of feature \(X_j\) within a tree: \(\sum_{t\in S_j}\color{blue}{\text{RIM}_t(X_j)},\) where \(S_j\) is the set of indices \(t\) when a split is performed at \(X_j\).
Unnormaized MDI of variable \(X_j\) is the sum of FIs from all the built trees.
⚠️ Might be biased toward high cardinality features and might not be accurate with highly correlated features.

Bagging: Bootstrap Aggregating

Cool things in bagging/ random forest

Feature Importances (FI)

Mean Decrease in Impurity (MDI)

The Reduction of Impurity Measures when spliting at \(X_j\) is denoted by \[\color{blue}{\text{RIM}}_t(X_j)=\text{Imp}_{t-1}(X_j)-\text{Imp}_t(X_j).\]
The importance of feature \(X_j\) within a tree: \(\sum_{t\in S_j}\color{blue}{\text{RIM}_t(X_j)},\) where \(S_j\) is the set of indices \(t\) when a split is performed at \(X_j\).
Unnormaized MDI of variable \(X_j\) is the sum of FIs from all the built trees.
⚠️ Might be biased toward high cardinality features and might not be accurate with highly correlated features.

Permutation Feature Importance

Train initial random forest model, then compute validation error denoted by \(\color{red}{\text{Er}_0}\)
For each \(X_j\), shuffle its values in the dataset, then trains and measures validation error \(\color{red}{\text{Er}_j}\).
The PFI for \(X_j\) is proportional to \(\color{red}{\text{Er}_0-\text{Er}_j}\).
It reflexes more direct influence of each feature in the model.
⚠️ Might be sensitive to data splits and a bit computationally expensive.

Bagging: Bootstrap Aggregating

Extremely Randomized Trees (Extra-trees)

Number of trees: \(T\)
for t = 1,2,...,T:
- Build tree: \(f_t\) is built on the Full dataset but a bit differently. At each split,
  - Random Features: A random subset \(S_t\) of the full input features is considered.
  - Random Split Points: Random points \(\{a_1,\dots,a_p\}\) is considered at each split.
Prediction: (same as before).

Bagging: Bootstrap Aggregating

Numerical experiment

Let’s consider Kaggle Abalone Dataset of shape: \(4177\times 9\).

	Sex	Length	Diameter	Height	Whole weight	Shucked weight	Viscera weight	Shell weight	Rings
0	M	0.455	0.365	0.095	0.5140	0.2245	0.1010	0.150	15
1	M	0.350	0.265	0.090	0.2255	0.0995	0.0485	0.070	7
2	F	0.530	0.420	0.135	0.6770	0.2565	0.1415	0.210	9
3	M	0.440	0.365	0.125	0.5160	0.2155	0.1140	0.155	10
4	I	0.330	0.255	0.080	0.2050	0.0895	0.0395	0.055	7

👉 Check out the notebook.

Boosting: Adaboost & XGBoost

Boosting

General framework

Boosting combines weak learners (models that perform a bit better than random guess: trees with a few splits called stumps) to create a strong model, firstly introduce by Robert E. Schapire (1990).
Main framework: Sequentially train a new weak learner to correct mispredicted points of the previous learners. The resulting model is the weighted average of all the trained learners.
Form: \(\color{blue}{\hat{f}(x)=\sum_{t=1}^Tw_tf_t(x)}\) where \(\color{blue}{f_t}\) are weak learners built on updated data \(\color{blue}{D_t}\).
It focuses more on improving bias than the variance.

Boosting

Algorithm

Number of trees: \(T\)
for t = 1, 2,..., T:
- Build weak learner \(f_t\) by minimizing error \(\varepsilon_t\).
- Learn and ajust weights:
  - Compute learner weight \(w_t>0\) computed based on weighted data with weight \(\{D_t(i):i=1,\dots,n\}\).
  - Compute sample weight \(D_t(i)\to D_{t+1}(i)\) for obs \(i\).
Combined model: \(\color{blue}{\hat{f}(x)=\sum_{t=1}^Tw_tf_t(x)}\).

Boosting

Algorithm

Combined model: \(\color{blue}{\hat{f}(x)=\sum_{t=1}^Tw_tf_t(x)}\).

Boosting

Adaboost: Adaptive Boosting

It’s for binary classification with \((\text{x}_i,y_i)\in\mathbb{R}^d\times\{-1,1\},i=1,\dots,n\).
Initialize the sample weight: \(\color{purple}{D_1(i)=1/n}\) for all data point \(\text{x}_i\).
for t = 1, . . . , T:
- Train the base classifier \(\color{blue}{f_t}\) by minimizing \[\color{red}{\varepsilon_t}=\mathbb{P}(\color{blue}{f_t}(X)\neq Y)=\sum_{i=1}^n\color{purple}{D_t(i)}\mathbb{1}_{\{\color{blue}{f_t}(\text{x}_i)\neq y_i\}}.\]
- Calculate the learner weight: \(\color{blue}{w_t}=\frac{1}{2}\ln\left(\frac{1-\color{red}{\varepsilon_t}}{\color{red}{\varepsilon_t}}\right).\)
- Update the sample weight using \(\color{purple}{D_{t+1}}=\frac{\color{purple}{D_t(i)}}{Z_t}e^{\color{blue}{-w_t}y_i\color{blue}{f_t}(\text{x}_i)}\) where \(Z_t\) is the normalized constant.
The final model: \(\color{blue}{\hat{f}}(\text{x}_0)=\text{sign}(\color{blue}{\sum_{t=1}^Tw_tf_t}(\text{x}_0))\). (Read: Freund & Schapire (1999))

Boosting

XGBoost: EXtreme Gradient Boosting

It’s a special case of Gradient Boosting: \(F_M(\text{x})=\sum_{t=1}^Mw_tf_t(\text{x})\).
Let \(L\) be the loss function of the problem:
- Initial model: \(F_0=\arg\min_{c}\sum_{i=1}^nL(y_i,c).\)
- for t = 1, 2, ..., M:
  - Compute false residuals: \(r_{t,i}=\frac{\partial L(y_i,f(\text{x}))}{\partial f(\text{x})}\Big|_{f(\text{x})=f_{t-1}(\text{x}_i)}\).
  - Train a base learner on the new data: \(\{(\text{x}_i,y_i),r_{t,i}\}.\)
  - Solve for \(\color{blue}{f_t}\) and \(\color{blue}{w_t}\) from: \[(\color{blue}{w_t}, \color{blue}{f_t(\text{x})})=\arg\min_{\color{blue}{w,f}}\sum_{i=1}^nL(y_i,F_{t-1}(\text{x}_i)+\color{blue}{wf(\text{x}_i)}).\]
  - Update the model: \(F_t(\text{x})=F_{t-1}(\text{x})+\color{blue}{w_tf_t(\text{x})}\).

In XGBoost:

\[\begin{align*}L(F_t)&=\sum_{i=1}^nL(y_i,F_t(\text{X}_i))+\sum_{m=1}^t\Omega(f_m)\\ \Omega(f_t)&=\gamma T+\frac{1}{2}\lambda\|w\|^2.\\ L(F_t)&\approx \sum_{i=1}^n[g_iF_t(\text{x}_i)+\frac{1}{2}h_iF_t^2(\text{x}_i)]+\Omega(F_t),\\ g_i&=\frac{\partial L(y_i,F_t(\text{x}))}{\partial F_t(\text{x})}\|_{F_t(\text{x})=F_t(\text{x}_i)}\\ h_i&=\frac{\partial^2 L(y_i,F_t(\text{x}))}{\partial F_t(\text{x})^2}\|_{F_t(\text{x})=F_t(\text{x}_i)}. \end{align*}\]

Read Chen & Guestrin (1999)

Boosting

Have fun at 👉 Gradient Boosting Playground

Boosting

Other variants

LightGBM: Light Gradient Boosting Machine by researchers at Microsof (2017). It works for regression, classification and ranking problems. Main improvement:
- Gradient-based One-Sided Sampling (GOSS)
- Exclusive Feature Bundling (EFB),

to ensure the efficiency and accuracy of the method.

CatBoost: Efficient with categorical features (auto transformation) and unbiased estimation of gradient at each iteration (see: Prokhorenkova et al. (2017)).

Boosting

Numerical Experiment: Abalone Dataset

👉 Check out the notebook.

Stacking: Consensual Aggregation

General framework

Stacking: Consensual Aggregation combines \(M\) base learners \(f_1,\dots,f_M\) based on how “close” the predictions given by these learners are.
Roughly speaking, it’s the nonparametric methods on predicted features: \[\color{red}{\tilde{\text{x}}=(f_1(\text{x}),\dots,f_m(\text{x}))}.\]
General form: \(\hat{y}=\sum_{i=1}^nW_{n,i}(\color{red}{\tilde{\text{x}}})y_i\)

Stacking: Consensual Aggregation

Concensual Classifier: Mojirsheibani (1999)

Suppose we have 3 binary classifiers: \(C=(C_1, C_2, C_3)\) with training data:

Id	\(C_1\)	\(C_2\)	\(C_3\)	\(Y\)
1	\(0\)	\(1\)	\(1\)	\(1\)
2	1	1	0	\(\color{blue}{1}\)
3	\(0\)	\(0\)	\(0\)	\(0\)
4	1	1	0	\(\color{blue}{1}\)
5	1	1	0	\(\color{red}{0}\)
\(\text{x}\)	1	1	0	\(?\)

Majority class among data points with the same predictions as the prediction of \(\text{x}:\hat{y}=1\).

Stacking: Consensual Aggregation

COBRA by Beau et al. (2016)
Gradient COBRA by Has (2023)

Given base regressors: \({\bf r}=(r_1,\dots,r_M)\), the combination takes the form: \[\hat{y}=\sum_{i=1}^nW_{n,i}(\text{x})y_i.\]

Training such a model is equivalent to finding an optimal smoothing parameter \(\color{blue}{h}\) minimizing cross-validation error:

\[\varphi(\color{blue}{h})=\frac{1}{K}\sum_{k=1}^K\sum_{(\text{x}_i,y_i)\in F_j}(\hat{y}_i-y_i)^2.\]

Stacking: Consensual Aggregation

COBRA by Beau et al. (2016)
Gradient COBRA by Has (2023)

Python package available at Gradient Cobra Library.

Stacking: Consensual Aggregation

Numerical Experiment: Abalone Dataset

👉 Check out the notebook.

Stacking: Consensual Aggregation

Summary

Ensemble learning: combine base learners to create a stronger predictor.
Bagging: Combines high-varianced base learners (trees) to produce a more stable and accurate final model.
Boosting: Sequentially combines weak learners aiming at correcting mistakes made by the previous built combined learners to create a strong model.
Stacking: Consensual Aggregation: Combine different learners using the consensuses of their predicted features.

🔗 Ensemble Learning Methods

🗺️ Content

Motivation & Introducion

Motivation & Introducion

Recap

Motivation & Introducion

Motivation

Motivation & Tools

Main methods

Bootstrap Aggregating: Bagging

Bagging: Bootstrap Aggregating

Purpose

Methodology

Bagging: Bootstrap Aggregating

Purpose

Methodology

Bagging: Bootstrap Aggregating

Purpose

Methodology

Bagging: Bootstrap Aggregating

Purpose

Methodology

Pseudocode

Bagging: Bootstrap Aggregating

Why does it work?

Bagging: Bootstrap Aggregating

Why does it work?

Random Forest

Bagging: Bootstrap Aggregating

Random Forest

Bagging: Bootstrap Aggregating

Have fun at 👉 Random Forest Playground

Bagging: Bootstrap Aggregating

Cool things in bagging/ random forest

Out-Of-Bag (OOB) Error

Bagging: Bootstrap Aggregating

Cool things in bagging/ random forest

Feature Importances (FI)

Mean Decrease in Impurity (MDI)

Bagging: Bootstrap Aggregating

Cool things in bagging/ random forest

Feature Importances (FI)

Mean Decrease in Impurity (MDI)

Permutation Feature Importance

Bagging: Bootstrap Aggregating

Extremely Randomized Trees (Extra-trees)

Bagging: Bootstrap Aggregating

Numerical experiment

Boosting: Adaboost & XGBoost

Boosting

General framework

Boosting

Algorithm

Boosting

Algorithm

Boosting

Adaboost: Adaptive Boosting

Boosting

XGBoost: EXtreme Gradient Boosting

Boosting

Have fun at 👉 Gradient Boosting Playground

Boosting

Other variants

Boosting

Numerical Experiment: Abalone Dataset

Stacking: Consensual Aggregation

Stacking: Consensual Aggregation

General framework

Stacking: Consensual Aggregation

Concensual Classifier: Mojirsheibani (1999)

Stacking: Consensual Aggregation

COBRA by Beau et al. (2016) Gradient COBRA by Has (2023)

Stacking: Consensual Aggregation

COBRA by Beau et al. (2016) Gradient COBRA by Has (2023)

Stacking: Consensual Aggregation

Numerical Experiment: Abalone Dataset

Stacking: Consensual Aggregation

Summary

🥳 Yeahhhh….

Let’s Party… 🥂

COBRA by Beau et al. (2016)
Gradient COBRA by Has (2023)

COBRA by Beau et al. (2016)
Gradient COBRA by Has (2023)