Advanced Machine Learning


     

Lecturer: Dr. HAS Sothea

Code: AMSI61AML

🔗 Ensemble Learning

🗺️ Content

  • Motivation & Tools

  • Bagging: Random Forest

  • Boosting: Adaboost, XGBoost…

  • Concensual aggregation: COBRA, KernelCOBRA, MixCOBRA

  • Stacking: SuperLearner

Motivation & Tools

Motivation

🤔 Have you ever thought about:

  • Why do we trust a panel of judges more than a single judge in competitions?

  • In medicine, why do doctors often seek second or third opinions for complex cases?

  • Have you noticed that weather forecasts often give a probability of rain rather than a simple yes/no prediction?

  • Would the famous “Wisdom of Crowds” principle applies to machine learning?

Motivation & Tools

Tools

  • Ensemble Learning (EL): Combining several based learners/models is possible thanks to different theories of
    • Probability: Bayesian, Law of Large Number, Conditional Probability…
    • Statistics: Bias-variance trade-off, bootstrap theory…
    • Computational Learning: Probably Approximately Correct (PAC), Gradient-based optimization, and more…
  • We’re going to explore \(3\) main EL methods:
    • Bagging: Combine nearly decorrelated high-varianced models.

    • Boosting: Sequentially combine weak learners.

    • Stacking: Combination is based on the predicted features.

Bagging: Bootstrap Aggregating

Purpose

  • Reduce variance of high-varianced base learners (trees) to produce a more stable and accurate final model.

Methodology

  • Bootstrap Sampling: Generate multiple subsets of the training data by sampling with replacement.

Bagging: Bootstrap Aggregating

Purpose

  • Reduce variance of high-varianced base learners (trees) to produce a more stable and accurate final model.

Methodology

  • Bootstrap Sampling: Generate multiple subsets of the training data by sampling with replacement.
  • Train Models: Train a separate model (typically the same type) on each bootstrap sample.

Bagging: Bootstrap Aggregating

Purpose

  • Reduce variance of high-varianced base learners (trees) to produce a more stable and accurate final model.

Methodology

  • Bootstrap Sampling: Generate multiple subsets of the training data by sampling with replacement.
  • Train Models: Train a separate model (typically the same type) on each bootstrap sample.
  • Aggregate Predictions: Combine the predictions of all models by averaging (for regression) or majority voting (for classification).

Bagging: Bootstrap Aggregating

Purpose

  • Reduce variance of high-varianced base learners (trees) to produce a more stable and accurate final model.

Methodology

  • Bootstrap Sampling: Generate multiple subsets of the training data by sampling with replacement.
  • Train Models: Train a separate model (typically the same type) on each bootstrap sample.
  • Aggregate Predictions: Combine the predictions of all models by averaging (for regression) or majority voting (for classification).

Pseudocode

  • Number of trees: \(T\)
  • for t=1,...,T:
    • Sampling: \(B_t\) from training data \(\cal D\).
    • Training models: \(f_t\) on \(B_t\)
  • Predictions:
    • Regression: \[\begin{align*}\color{blue}{\hat{y}}&\color{blue}{=\frac{1}{T}\sum_{t=1}^Tf_t(x)}\\ &\color{blue}{=\text{Averaging.}}\end{align*}\]
    • Classification: \[\begin{align*}\color{blue}{\hat{y}}&\color{blue}{=\arg\max_{1\leq k\leq M}\sum_{t=1}^T\mathbb{1}_{\{f_t(x)=k\}}}\\ &\color{blue}{=\text{Majority vote.}}\end{align*}\]

Bagging: Bootstrap Aggregating

Why does it work?

  • Bias-variance trade-off: Assuming \(y_i=f(\text{x}_i)+\varepsilon_i\) where \(\varepsilon_i\overset{iid}{\sim}{\cal N}(0,\sigma^2)\), then for any model \(\hat{f}\) built using training data \(\cal D\), we can decompose MSE of \(\hat{f}\) at any fixed input \(\text{x}_0\) as \[\begin{align*} \mathbb{E}_{\cal D}[(\hat{f}(\text{x}_0)-y_0)^2]&=\mathbb{E}_{\cal D}[(\hat{f}(\text{x}_0)-\mathbb{E}[\hat{f}(\text{x}_0)])^2] + \mathbb{E}_{\cal D}[(\mathbb{E}[\hat{f}(\text{x}_0)]-f(\text{x}_0))^2] + \sigma^2\\ &=\underbrace{\mathbb{V}(\hat{f}(\text{x}_0))}_{\color{blue}{\text{Flexibility of }\hat{f}}}+\underbrace{(\text{Bias})^2}_{\color{darkgreen}{\text{How far }\hat{f}\text{ from } f}}+\underbrace{\sigma^2}_{\color{red}{\text{Uncontrollable Term}}}. \end{align*}\]

Bagging: Bootstrap Aggregating

Why does it work?

  • Bias-variance trade-off: Assuming \(y_i=f(\text{x}_i)+\varepsilon_i\) where \(\varepsilon_i\overset{iid}{\sim}{\cal N}(0,\sigma^2)\), then for any model \(\hat{f}\) built using training data \(\cal D\), we can decompose MSE of \(\hat{f}\) at any fixed input \(\text{x}_0\) as \[\begin{align*} \mathbb{E}_{\cal D}[(\hat{f}(\text{x}_0)-y_0)^2]&=\mathbb{E}_{\cal D}[(\hat{f}(\text{x}_0)-\mathbb{E}[\hat{f}(\text{x}_0)])^2] + \mathbb{E}_{\cal D}[(\mathbb{E}[\hat{f}(\text{x}_0)]-f(\text{x}_0))^2] + \sigma^2\\ &=\underbrace{\mathbb{V}(\hat{f}(\text{x}_0))}_{\color{blue}{\text{Flexibility of }\hat{f}}}+\underbrace{(\text{Bias})^2}_{\color{darkgreen}{\text{How far }\hat{f}\text{ from } f}}+\underbrace{\sigma^2}_{\color{red}{\text{Uncontrollable Term}}}. \end{align*}\]

  • Begging: seeks to balance these terms by averaging nearly independent high-varianced models to reduce more stable predictive model.

  • \(^{\color{blue}{1}}\)If \(\color{blue}{\hat{f}(\text{x}_0)=\frac{1}{T}\sum_{t=1}^Tf_t(\text{x})}\) then \(\mathbb{E}_{\cal D}\left[\color{blue}{(\hat{f}(\text{x}_0)-y_0)^2}\right]\leq \min_{1\leq t\leq T}\mathbb{E}_{\cal D}\left[\color{red}{(f_t(\text{x}_0)-y_0)^2}\right].\)
  • Bagging is suitable with
    • High-varianced models: deep trees, \(K\)-NN with small \(K\)
    • Simplicity and scalability…

Bagging: Bootstrap Aggregating

Random Forest

  • In practice, general Bagging doesn’t work so well as the base learners (trees) are not as decorrelated as expected (due to correlated bootstrap samples).
  • Random Forest: (What a cool name! 😎)
    • Number of trees: \(T\)
    • for t = 1,2,...,T:
      • Bootstrap sampling: sample \(B_t\) with replacement from \(\cal D\).
      • Random Features: select subset \(S_t\) from full \(d\) input features.
      • Build tree \(f_t\) on \(B_t\) using only features from \(S_t\).
    • Prediction: (same as before).
  • It works well as Random Features part introduces even more randomness on top of bootstrap sampling.

Bagging: Bootstrap Aggregating

Cool things in bagging/ random forest

Out-Of-Bag (OOB) Error

  • For each bootstrap sample \(B_t\), \(\text{x}_O\notin B_t\) are called Out-Of-Bag samples (\(\approx 37\%\)).
  • Tree’s OOB Error: The average error for a tree, calculated using its out-of-bag samples.
  • OOB Sample Error: The average error for a given observation, computed from all trees whose bootstrap samples did not include that observation.
  • Overall or Average OOB Error: The average of all OOB Sample Errors, providing an overall measure of the model’s performance.
  • The Overall OOB Error can be used as an approximation of cross-validation error. However, it tends to overestimate the true error in classification according to Silke Janitza and Roman Hornung (2018).

Bagging: Bootstrap Aggregating

Cool things in bagging/ random forest

Feature Importances (FI)

Mean Decrease in Impurity (MDI)

  • The Reduction of Impurity Measures when spliting at \(X_j\) is denoted by \[\text{RIM}_t(X_j)=\text{Imp}_{t-1}(X_j)-\text{Imp}_t(X_j).\]
  • The importance of feature \(X_j\) within a tree: \(\sum_{t\in S_j}\text{RIM}_t(X_j),\) where \(S_j\) is the set of indices \(t\) when a split is performed at \(X_j\).
  • Unnormaized FI of variable \(X_j\) is the sum of FIs from all the built trees.
  • ⚠️ Might be biased toward high cardinality features and might not be accurate with highly correlated features.

Permutation Feature Importance

  • Train initial random forest model, then compute validation error denoted by \(\text{Er}_0\)
  • For each \(X_j\), shuffle its column in the dataset, then train and measure validation error \(\text{Er}_j\).
  • The importance score for \(X_j\) is proportional to \(\text{Er}_0-\text{Er}_j\).
  • It reflexes more direct influence of each feature in the model.
  • ⚠️ Might be sensitive to data splits and a bit computationally expensive.

Bagging: Bootstrap Aggregating

Extra-trees method

  • Number of trees: \(T\)
  • for t = 1,2,...,T:
    • Build tree: \(f_t\) is built on the Full dataset but a bit differently. At each split,
      • Random Features: A random subset \(S_t\) of the full input features is considered.
      • Random Split Points: A random set of points \(\{a_1,\dots,a_p\}\) is considered.
  • Prediction: (same as before).

Bagging: Bootstrap Aggregating

Numerical experiment

Sex Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings
0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7
2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7


👉 Check out the notebook.

Boosting

General framework

  • Boosting combines weak learners (models that perform a bit better than random guess: trees with a few splits/stumps) to create a strong model, firstly introduce by Robert E. Schapire (1990).
  • Main framework: Sequentially train a new weak learner to correct mispredicted points of the previous learners. The resulting model is the weighted average of all the trained learners.
  • Form: \(\hat{f}(x)=\sum_{t=1}^Tw_tf_t(x)\) where \(f_t\) are weak learners built on updated data \(D_t\).
  • It focuses more on improving bias than the variance.

Boosting

Algorithm

  • Number of trees: \(T\)
  • for t = 1, 2,..., T:
    • Build weak learner \(f_t\) by minimizing error \(\varepsilon_t\).
    • Learn and ajust weights:
      • Compute learner weight \(w_t>0\) computed based on weighted data with weight \(\{D_t(i):i=1,\dots,n\}\).
      • Compute sample weight \(D_t(i)\to D_{t+1}(i)\) for obs \(i\).
  • Combined model: \(\color{blue}{\hat{f}(x)=\sum_{t=1}^Tw_tf_t(x)}\).

Boosting

Algorithm

  • Combined model: \(\color{blue}{\hat{f}(x)=\sum_{t=1}^Tw_tf_t(x)}\).

Boosting

Adaboost: Adaptive Boosting

  • It’s for binary classification with \((\text{x}_i,y_i)\in\mathbb{R}^d\times\{-1,1\},i=1,\dots,n\).
  • Initialize the weight \(D_1(i)=1/n\) for all sample \(\text{x}_i\).
  • for t = 1, . . . , T:
    • Train the base classifier \(f_t\) by minimizing \[\varepsilon_t=\mathbb{P}(f_t(X)\neq Y)=\sum_{i=1}^nD_t(i)\mathbb{1}_{\{f_t(\text{x}_i\neq y_i)\}}.\]
    • Calculate the learner weight \(w_t=\frac{1}{2}\ln\left(\frac{1-\varepsilon_t}{\varepsilon_t}\right).\)
    • Update the sample weights using \(D_{t+1}=\frac{D_t(i)}{Z_t}e^{-w_ty_if_t(\text{x}_i)}\) where \(Z_t\) is the normalized constant.
  • The final model: \(\color{blue}{\hat{f}(\text{x})=\text{sign}(\sum_{t=1}^Tw_tf_t(\text{x}))}\). (Read: Freund & Schapire (1999))

Boosting

XGBoost: EXtreme Gradient Boosting

  • It’s a special case of Gradient Boosting: \(F_M(\text{x})=\sum_{t=1}^Mw_tf_t(\text{x})\).
  • Let \(L\) be the loss function of the problem:
    • Initial model: \(F_0=\arg\min_{c}\sum_{i=1}^nL(y_i,c).\)
    • for t = 1, 2, ..., M:
      • Compute false residuals: \(r_{t,i}=\frac{\partial L(y_i,f(\text{x}))}{\partial f(\text{x})}\Big|_{f(\text{x})=f_{t-1}(\text{x}_i)}\).
      • Train a base learner on the new data: \(\{(\text{x}_i,y_i),r_{t,i}\}.\)
      • Solve for \(\color{blue}{f_t}\) and \(\color{blue}{w_t}\) from: \[(\color{blue}{w_t}, \color{blue}{f_t(\text{x})})=\arg\min_{\color{blue}{w,f}}\sum_{i=1}^nL(y_i,F_{t-1}(\text{x}_i)+\color{blue}{wf(\text{x}_i)}).\]
      • Update aggregated model: \(F_t(\text{x})=F_{t-1}(\text{x})+\color{blue}{w_tf_t(\text{x})}\).
  • In XGBoost:

\[\begin{align*}L(F_t)&=\sum_{i=1}^nL(y_i,F_t(\text{X}_i))+\sum_{m=1}^t\Omega(f_m)\\ \Omega(f_t)&=\gamma T+\frac{1}{2}\lambda\|w\|^2.\\ L(F_t)&\approx \sum_{i=1}^n[g_iF_t(\text{x}_i)+\frac{1}{2}h_iF_t^2(\text{x}_i)]+\Omega(F_t),\\ g_i&=\frac{\partial L(y_i,F_t(\text{x}))}{\partial F_t(\text{x})}\|_{F_t(\text{x})=F_t(\text{x}_i)}\\ h_i&=\frac{\partial^2 L(y_i,F_t(\text{x}))}{\partial F_t(\text{x})^2}\|_{F_t(\text{x})=F_t(\text{x}_i)}. \end{align*}\]

Boosting

Other variants

  • LightGBM: Light Gradient Boosting Machine by researchers at Microsof (2017). It works for regression, classification and ranking problems. Main improvement:
    • Gradient-based One-Sided Sampling (GOSS)
    • Exclusive Feature Bundling (EFB),

to ensure the efficiency and accuracy of the method.

  • CatBoost: Efficient with categorical features (auto transformation) and unbiased estimation of gradient at each iteration (see: Prokhorenkova et al. (2017)).

Boosting

Numerical Experiment: Abalone Dataset




👉 Check out the notebook.

Consensual Aggregation

General framework

  • Consensual Aggregation combines \(M\) base learners \(f_1,\dots,f_M\) based on how “close” the predictions given by these learners are.
  • Roughly speaking, it’s the nonparametric methods on predicted features: \(\widehat{\text{x}}=(f_1(\text{x}),\dots,f_M(\text{x}))\).
  • General form: \(\hat{y}=\sum_{i=1}^nW_{n,i}(\color{red}{\widehat{\text{x}}})y_i\)

\[\text{x}_i\overset{f_1,\dots,f_M}{\mapsto}\color{red}{\widehat{\text{x}}=(f_1(\text{x}),\dots,f_M(\text{x}))}\mapsto \hat{y}=\sum_{i=1}^nW_{n,i}(\color{red}{\widehat{\text{x}}})y_i.\]

Consensual Aggregation

Concensual Classifier: Mojirsheibani (1999)

  • Suppose we have 3 binary classifiers: \(C=(C_1, C_2, C_3)\) with training data:
Id \(C_1\) \(C_2\) \(C_3\) \(Y\)
1 \(0\) \(1\) \(1\) \(1\)
2 1 1 0 \(\color{blue}{1}\)
3 \(0\) \(0\) \(0\) \(0\)
4 1 1 0 \(\color{blue}{1}\)
5 1 1 0 \(\color{red}{0}\)
\(\text{x}\) 1 1 0 \(?\)
  • Majority class among data points with the same predictions as the prediction of \(\text{x}:\hat{y}=1\).

Consensual Aggregation

COBRA by Beau et al. (2016)
Gradient COBRA by Has (2023)

  • Given base regressors: \({\bf r}=(r_1,\dots,r_M)\), the combination takes the form: \[\hat{y}=\sum_{i=1}^nW_{n,i}(\text{x})y_i.\]

  • Training such a model is equivalent to finding an optimal smoothing parameter \(h\) minimizing cross-validation error:

\[\varphi(h)=\frac{1}{K}\sum_{k=1}^K\sum_{(\text{x}_i,y_i)\in F_j}(\hat{y}_i-y_i)^2.\]

Consensual Aggregation

COBRA by Beau et al. (2016)
Gradient COBRA by Has (2023)

Consensual Aggregation

MixCOBRA by Fischer & Mougeot (2019)

  • Given base regressors: \({\bf r}=(r_1,\dots,r_M)\), the combination takes the form: \[\hat{y}=\sum_{i=1}^nW_{n,i}(\text{x})y_i.\]

  • In this case, we seek for parameter \((\alpha,\beta)\) minimizing cross-validation error:

\[\varphi(\alpha,\beta)=\frac{1}{K}\sum_{k=1}^K\sum_{(\text{x}_i,y_i)\in F_j}(\hat{y}_i-y_i)^2.\]

Consensual Aggregation

MixCOBRA by Fischer & Mougeot (2019)

Consensual Aggregation

Numerical Experiment: Abalone Dataset




👉 Check out the notebook.

Stacking

General framework

  • Stacking stands on the idea that given \(M\) base learners \(f_1,\dots,f_M\), the predicted features \(\widehat{\text{x}}_i=(f_1(\text{x}_i),\dots,f_M(\text{x}_i))\) are close to the target \(y_i\).
  • Goal: using these features: \(\hat{x}_i\) to predict \(y_i\).
  • Meta-models: trained based on \(\widehat{\text{x}}_i\) to predict \(y_i\).

\[\text{x}_i\overset{f_1,\dots,f_M}{\mapsto}\color{red}{\widehat{\text{x}_i}=(f_1(\text{x}_i),\dots,f_M(\text{x}_i))}\overset{\text{Meta-learner }f}{\mapsto} \hat{y}_i=f(\color{red}{\widehat{x}_i}).\]

Stacking

Super Learner by M. J. Van der Laan (2007)

  • It can be used for both: classification and regression.

Stacking

Numerical Experiment: Abalone Dataset




👉 Check out the notebook.

Summary

  • Ensemble learning: combine base learners to create a stronger predictor.

  • Bagging: Combines high-varianced base learners (trees) to produce a more stable and accurate final model.

  • Boosting: Sequentially combines weak learners aiming at correcting mistakes made by the previous built combined learners to create a strong model.

  • Consensual Aggregation: Combine different learners using the consensuses of their predicted features.

  • Stacking: Build meta-learners using predicted features of base learners to predict the target.

🥳 It’s party time 🥂









📋 View party menu here: Party 5 Menu.

🫠 Download party invitation here: Party 5 Invitation Letter.