🔗 Ensemble Learning Methods


ITM 390 004: Machine Learning

Lecturer: Dr. Sothea HAS

🗺️ Content

  • Motivation & Introducion

  • Bootstrap Aggregating: Bagging

  • Random Forest

  • Boosting: Adaboost & XGBoost

  • Concensual aggregation: Gradient COBRA

Motivation & Introducion

Motivation & Introducion

Recap

  • Recap:
    • Parametric models:
      • Assume input-to-target relation: \(X\rightarrow y\).
        • Linear regression: \(\hat{y}=\color{blue}{\beta_0}+\color{blue}{\beta_1}\text{x}_1+\dots+\color{blue}{\beta_d}\text{x}_d\).
        • Logistic regression: \(p(\color{blue}{Y=1}|\text{X}=\text{x})=\sigma(\color{blue}{\beta_0}+\sum_{j=1}^d\color{blue}{\beta_j}\text{x}_j)\).
    • Nonparametric models:
      • Do not assume any input-output form.
      • Predict based only on neigboring data.
        • \(k\)-Nearest Neighbors: neighbors are based on Euclidean distance.
        • Decision tree: neighbors are based on block/region.
  • 🤔 What remains to be explored?

Motivation & Introducion

Motivation

  • Why do we trust a panel of judges more than a single judge in competitions?

  • In medicine, why do doctors often seek second or third opinions for complex cases?

  • Have you noticed that weather forecasts often give a probability of rain rather than a simple yes/no prediction?

  • Would the famous “Wisdom of Crowds” principle applies to machine learning?

  • The keys idea of ensemble learning is combining multiple base models/learners to create a better/stronger one.


Motivation & Tools

Main methods

  • We’re going to explore \(3\) main EL methods:
    • Bagging: Combine nearly decorrelated high-varianced models.

    • Boosting: Sequentially combine weak learners to create a strong final model.

    • Stacking: Combination is based on the predicted features.

Bootstrap Aggregating: Bagging

Bagging: Bootstrap Aggregating

Purpose

  • Reduce variance of high-varianced base learners (trees) to produce a more stable and accurate final model.

Methodology

  • Bootstrap Sampling: Generate multiple subsets of the training data by sampling with replacement.

Bagging: Bootstrap Aggregating

Purpose

  • Reduce variance of high-varianced base learners (trees) to produce a more stable and accurate final model.

Methodology

  • Bootstrap Sampling: Generate multiple subsets of the training data by sampling with replacement.
  • Train Models: Train a separate model (typically the same type) on each bootstrap sample.

Bagging: Bootstrap Aggregating

Purpose

  • Reduce variance of high-varianced base learners (trees) to produce a more stable and accurate final model.

Methodology

  • Bootstrap Sampling: Generate multiple subsets of the training data by sampling with replacement.
  • Train Models: Train a separate model (typically the same type) on each bootstrap sample.
  • Aggregate Predictions: Combine the predictions of all models by averaging (for regression) or majority voting (for classification).

Bagging: Bootstrap Aggregating

Purpose

  • Reduce variance of high-varianced base learners (trees) to produce a more stable and accurate final model.

Methodology

  • Bootstrap Sampling: Generate multiple subsets of the training data by sampling with replacement.
  • Train Models: Train a separate model (typically the same type) on each bootstrap sample.
  • Aggregate Predictions: Combine the predictions of all models by averaging (for regression) or majority voting (for classification).

Pseudocode

  • Number of trees: \(T\)
  • for t=1,...,T:
    • Sampling: \(B_t\) from training data \(\cal D\).
    • Training models: \(f_t\) on \(B_t\)
  • Predictions:
    • Regression: \[\begin{align*}\color{blue}{\hat{y}}&\color{blue}{=\frac{1}{T}\sum_{t=1}^Tf_t(x)}\\ &\color{blue}{=\text{Averaging.}}\end{align*}\]
    • Classification: \[\begin{align*}\color{red}{\hat{y}}&\color{red}{=\arg\max_{1\leq k\leq M}\sum_{t=1}^T\mathbb{1}_{\{f_t(x)=k\}}}\\ &\color{red}{=\text{Majority vote.}}\end{align*}\]

Bagging: Bootstrap Aggregating

Why does it work?

  • Bias-variance trade-off: Assuming \(y_i=f(\text{x}_i)+\varepsilon_i\) where \(\varepsilon_i\overset{iid}{\sim}{\cal N}(0,\sigma^2)\), then for any model \(\hat{f}\) built using training data \(\cal D\), we can decompose MSE of \(\hat{f}\) at any fixed input \(\text{x}_0\) as \[\begin{align*} \mathbb{E}_{\cal D}[(\hat{f}(\text{x}_0)-y_0)^2]&=\mathbb{E}_{\cal D}[(\hat{f}(\text{x}_0)-\mathbb{E}[\hat{f}(\text{x}_0)])^2] + \mathbb{E}_{\cal D}[(\mathbb{E}[\hat{f}(\text{x}_0)]-f(\text{x}_0))^2] + \sigma^2\\ &=\underbrace{\mathbb{V}(\hat{f}(\text{x}_0))}_{\color{blue}{\text{Flexibility of }\hat{f}}}+\underbrace{(\text{Bias})^2}_{\color{darkgreen}{\text{How far }\hat{f}\text{ from } f}}+\underbrace{\sigma^2}_{\color{red}{\text{Uncontrollable Term}}}. \end{align*}\]

Bagging: Bootstrap Aggregating

Why does it work?

  • Bias-variance trade-off: Assuming \(y_i=f(\text{x}_i)+\varepsilon_i\) where \(\varepsilon_i\overset{iid}{\sim}{\cal N}(0,\sigma^2)\), then for any model \(\hat{f}\) built using training data \(\cal D\), we can decompose MSE of \(\hat{f}\) at any fixed input \(\text{x}_0\) as \[\begin{align*} \mathbb{E}_{\cal D}[(\hat{f}(\text{x}_0)-y_0)^2]&=\mathbb{E}_{\cal D}[(\hat{f}(\text{x}_0)-\mathbb{E}[\hat{f}(\text{x}_0)])^2] + \mathbb{E}_{\cal D}[(\mathbb{E}[\hat{f}(\text{x}_0)]-f(\text{x}_0))^2] + \sigma^2\\ &=\underbrace{\mathbb{V}(\hat{f}(\text{x}_0))}_{\color{blue}{\text{Flexibility of }\hat{f}}}+\underbrace{(\text{Bias})^2}_{\color{darkgreen}{\text{How far }\hat{f}\text{ from } f}}+\underbrace{\sigma^2}_{\color{red}{\text{Uncontrollable Term}}}. \end{align*}\]

  • Begging: seeks to balance these terms by averaging nearly independent high-varianced models to reduce more stable predictive model.

  • \(^{\color{blue}{1}}\)If \(\color{blue}{\hat{f}(\text{x}_0)=\frac{1}{T}\sum_{t=1}^Tf_t(\text{x})}\) then \(\mathbb{E}_{\cal D}\left[\color{blue}{(\hat{f}(\text{x}_0)-y_0)^2}\right]\leq \min_{1\leq t\leq T}\mathbb{E}_{\cal D}\left[\color{red}{(f_t(\text{x}_0)-y_0)^2}\right].\)
  • Bagging is suitable with
    • High-varianced models: deep trees, \(K\)-NN with small \(K\)
    • Simplicity and scalability…

Random Forest

Bagging: Bootstrap Aggregating

Random Forest

  • In practice, Bagging doesn’t work so well as the base learners (trees) are not as decorrelated as expected (due to correlated bootstrap samples).
  • Random Forest: (What a cool name! 😎)
    • Number of trees: \(T\)
    • for t = 1,2,...,T:
      • Bootstrap sampling: sample \(B_t\) with replacement from \(\cal D\).
      • Random Features: select subset \(S_t\) from full \(d\) input features.
      • Build tree \(f_t\) on \(B_t\) using only features from \(S_t\).
    • Prediction: (same as before).
  • It works well as Random Features part introduces even more randomness on top of bootstrap sampling.

Bagging: Bootstrap Aggregating

Have fun at 👉 Random Forest Playground

Bagging: Bootstrap Aggregating

Cool things in bagging/ random forest

Out-Of-Bag (OOB) Error

  • For each bootstrap sample \(B_t\), \(\text{x}_O\notin B_t\) are called Out-Of-Bag samples (\(\approx 37\%\)).
  • Tree’s OOB Error: The average error for a tree, calculated using its out-of-bag samples.
  • OOB Sample Error: The average error for a given observation, computed from all trees whose bootstrap samples did not include that observation.
  • Overall or Average OOB Error: The average of all OOB Sample Errors, providing an overall measure of the model’s performance.
  • The Overall OOB Error can be used as an approximation of cross-validation error. However, it tends to overestimate the true error in classification according to Silke Janitza and Roman Hornung (2018).

Bagging: Bootstrap Aggregating

Cool things in bagging/ random forest

Feature Importances (FI)

Mean Decrease in Impurity (MDI)

  • The Reduction of Impurity Measures when spliting at \(X_j\) is denoted by \[\color{blue}{\text{RIM}}_t(X_j)=\text{Imp}_{t-1}(X_j)-\text{Imp}_t(X_j).\]
  • The importance of feature \(X_j\) within a tree: \(\sum_{t\in S_j}\color{blue}{\text{RIM}_t(X_j)},\) where \(S_j\) is the set of indices \(t\) when a split is performed at \(X_j\).
  • Unnormaized MDI of variable \(X_j\) is the sum of FIs from all the built trees.
  • ⚠️ Might be biased toward high cardinality features and might not be accurate with highly correlated features.




Bagging: Bootstrap Aggregating

Cool things in bagging/ random forest

Feature Importances (FI)

Mean Decrease in Impurity (MDI)

  • The Reduction of Impurity Measures when spliting at \(X_j\) is denoted by \[\color{blue}{\text{RIM}}_t(X_j)=\text{Imp}_{t-1}(X_j)-\text{Imp}_t(X_j).\]
  • The importance of feature \(X_j\) within a tree: \(\sum_{t\in S_j}\color{blue}{\text{RIM}_t(X_j)},\) where \(S_j\) is the set of indices \(t\) when a split is performed at \(X_j\).
  • Unnormaized MDI of variable \(X_j\) is the sum of FIs from all the built trees.
  • ⚠️ Might be biased toward high cardinality features and might not be accurate with highly correlated features.

Permutation Feature Importance

  • Train initial random forest model, then compute validation error denoted by \(\color{red}{\text{Er}_0}\)
  • For each \(X_j\), shuffle its values in the dataset, then trains and measures validation error \(\color{red}{\text{Er}_j}\).
  • The PFI for \(X_j\) is proportional to \(\color{red}{\text{Er}_0-\text{Er}_j}\).
  • It reflexes more direct influence of each feature in the model.
  • ⚠️ Might be sensitive to data splits and a bit computationally expensive.

Bagging: Bootstrap Aggregating

Extremely Randomized Trees (Extra-trees)

  • Number of trees: \(T\)
  • for t = 1,2,...,T:
    • Build tree: \(f_t\) is built on the Full dataset but a bit differently. At each split,
      • Random Features: A random subset \(S_t\) of the full input features is considered.
      • Random Split Points: Random points \(\{a_1,\dots,a_p\}\) is considered at each split.
  • Prediction: (same as before).

Bagging: Bootstrap Aggregating

Numerical experiment

Sex Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings
0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7
2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7


👉 Check out the notebook.

Boosting: Adaboost & XGBoost

Boosting

General framework

  • Boosting combines weak learners (models that perform a bit better than random guess: trees with a few splits called stumps) to create a strong model, firstly introduce by Robert E. Schapire (1990).
  • Main framework: Sequentially train a new weak learner to correct mispredicted points of the previous learners. The resulting model is the weighted average of all the trained learners.
  • Form: \(\color{blue}{\hat{f}(x)=\sum_{t=1}^Tw_tf_t(x)}\) where \(\color{blue}{f_t}\) are weak learners built on updated data \(\color{blue}{D_t}\).
  • It focuses more on improving bias than the variance.

Boosting

Algorithm

  • Number of trees: \(T\)
  • for t = 1, 2,..., T:
    • Build weak learner \(f_t\) by minimizing error \(\varepsilon_t\).
    • Learn and ajust weights:
      • Compute learner weight \(w_t>0\) computed based on weighted data with weight \(\{D_t(i):i=1,\dots,n\}\).
      • Compute sample weight \(D_t(i)\to D_{t+1}(i)\) for obs \(i\).
  • Combined model: \(\color{blue}{\hat{f}(x)=\sum_{t=1}^Tw_tf_t(x)}\).

Boosting

Algorithm

  • Combined model: \(\color{blue}{\hat{f}(x)=\sum_{t=1}^Tw_tf_t(x)}\).

Boosting

Adaboost: Adaptive Boosting

  • It’s for binary classification with \((\text{x}_i,y_i)\in\mathbb{R}^d\times\{-1,1\},i=1,\dots,n\).
  • Initialize the sample weight: \(\color{purple}{D_1(i)=1/n}\) for all data point \(\text{x}_i\).
  • for t = 1, . . . , T:
    • Train the base classifier \(\color{blue}{f_t}\) by minimizing \[\color{red}{\varepsilon_t}=\mathbb{P}(\color{blue}{f_t}(X)\neq Y)=\sum_{i=1}^n\color{purple}{D_t(i)}\mathbb{1}_{\{\color{blue}{f_t}(\text{x}_i)\neq y_i\}}.\]
    • Calculate the learner weight: \(\color{blue}{w_t}=\frac{1}{2}\ln\left(\frac{1-\color{red}{\varepsilon_t}}{\color{red}{\varepsilon_t}}\right).\)
    • Update the sample weight using \(\color{purple}{D_{t+1}}=\frac{\color{purple}{D_t(i)}}{Z_t}e^{\color{blue}{-w_t}y_i\color{blue}{f_t}(\text{x}_i)}\) where \(Z_t\) is the normalized constant.
  • The final model: \(\color{blue}{\hat{f}}(\text{x}_0)=\text{sign}(\color{blue}{\sum_{t=1}^Tw_tf_t}(\text{x}_0))\). (Read: Freund & Schapire (1999))

Boosting

XGBoost: EXtreme Gradient Boosting

  • It’s a special case of Gradient Boosting: \(F_M(\text{x})=\sum_{t=1}^Mw_tf_t(\text{x})\).
  • Let \(L\) be the loss function of the problem:
    • Initial model: \(F_0=\arg\min_{c}\sum_{i=1}^nL(y_i,c).\)
    • for t = 1, 2, ..., M:
      • Compute false residuals: \(r_{t,i}=\frac{\partial L(y_i,f(\text{x}))}{\partial f(\text{x})}\Big|_{f(\text{x})=f_{t-1}(\text{x}_i)}\).
      • Train a base learner on the new data: \(\{(\text{x}_i,y_i),r_{t,i}\}.\)
      • Solve for \(\color{blue}{f_t}\) and \(\color{blue}{w_t}\) from: \[(\color{blue}{w_t}, \color{blue}{f_t(\text{x})})=\arg\min_{\color{blue}{w,f}}\sum_{i=1}^nL(y_i,F_{t-1}(\text{x}_i)+\color{blue}{wf(\text{x}_i)}).\]
      • Update the model: \(F_t(\text{x})=F_{t-1}(\text{x})+\color{blue}{w_tf_t(\text{x})}\).
  • In XGBoost:

\[\begin{align*}L(F_t)&=\sum_{i=1}^nL(y_i,F_t(\text{X}_i))+\sum_{m=1}^t\Omega(f_m)\\ \Omega(f_t)&=\gamma T+\frac{1}{2}\lambda\|w\|^2.\\ L(F_t)&\approx \sum_{i=1}^n[g_iF_t(\text{x}_i)+\frac{1}{2}h_iF_t^2(\text{x}_i)]+\Omega(F_t),\\ g_i&=\frac{\partial L(y_i,F_t(\text{x}))}{\partial F_t(\text{x})}\|_{F_t(\text{x})=F_t(\text{x}_i)}\\ h_i&=\frac{\partial^2 L(y_i,F_t(\text{x}))}{\partial F_t(\text{x})^2}\|_{F_t(\text{x})=F_t(\text{x}_i)}. \end{align*}\]

Boosting

Have fun at 👉 Gradient Boosting Playground

Boosting

Other variants

  • LightGBM: Light Gradient Boosting Machine by researchers at Microsof (2017). It works for regression, classification and ranking problems. Main improvement:
    • Gradient-based One-Sided Sampling (GOSS)
    • Exclusive Feature Bundling (EFB),

to ensure the efficiency and accuracy of the method.

  • CatBoost: Efficient with categorical features (auto transformation) and unbiased estimation of gradient at each iteration (see: Prokhorenkova et al. (2017)).

Boosting

Numerical Experiment: Abalone Dataset




👉 Check out the notebook.

Stacking: Consensual Aggregation

Stacking: Consensual Aggregation

General framework

  • Stacking: Consensual Aggregation combines \(M\) base learners \(f_1,\dots,f_M\) based on how “close” the predictions given by these learners are.
  • Roughly speaking, it’s the nonparametric methods on predicted features: \[\color{red}{\tilde{\text{x}}=(f_1(\text{x}),\dots,f_m(\text{x}))}.\]
  • General form: \(\hat{y}=\sum_{i=1}^nW_{n,i}(\color{red}{\tilde{\text{x}}})y_i\)

Stacking: Consensual Aggregation

Concensual Classifier: Mojirsheibani (1999)

  • Suppose we have 3 binary classifiers: \(C=(C_1, C_2, C_3)\) with training data:
Id \(C_1\) \(C_2\) \(C_3\) \(Y\)
1 \(0\) \(1\) \(1\) \(1\)
2 1 1 0 \(\color{blue}{1}\)
3 \(0\) \(0\) \(0\) \(0\)
4 1 1 0 \(\color{blue}{1}\)
5 1 1 0 \(\color{red}{0}\)
\(\text{x}\) 1 1 0 \(?\)
  • Majority class among data points with the same predictions as the prediction of \(\text{x}:\hat{y}=1\).

Stacking: Consensual Aggregation

COBRA by Beau et al. (2016)
Gradient COBRA by Has (2023)

  • Given base regressors: \({\bf r}=(r_1,\dots,r_M)\), the combination takes the form: \[\hat{y}=\sum_{i=1}^nW_{n,i}(\text{x})y_i.\]

  • Training such a model is equivalent to finding an optimal smoothing parameter \(\color{blue}{h}\) minimizing cross-validation error:

\[\varphi(\color{blue}{h})=\frac{1}{K}\sum_{k=1}^K\sum_{(\text{x}_i,y_i)\in F_j}(\hat{y}_i-y_i)^2.\]

Stacking: Consensual Aggregation

COBRA by Beau et al. (2016)
Gradient COBRA by Has (2023)

Stacking: Consensual Aggregation

Numerical Experiment: Abalone Dataset




👉 Check out the notebook.

Stacking: Consensual Aggregation

Summary

  • Ensemble learning: combine base learners to create a stronger predictor.

  • Bagging: Combines high-varianced base learners (trees) to produce a more stable and accurate final model.

  • Boosting: Sequentially combines weak learners aiming at correcting mistakes made by the previous built combined learners to create a strong model.

  • Stacking: Consensual Aggregation: Combine different learners using the consensuses of their predicted features.

🥳 Yeahhhh….









Let’s Party… 🥂