Lab5 - Ensemble Learning

Course: ITM 390 004: Machine Learning
Lecturer: Sothea HAS, PhD


Objective: Ensemble Learning Methods are about combining several base learners to enhance its performance. In this lab, you will apply each ensemble learning method on real datasets and analyze its sensitivity in terms of the key hyperparameters of the method. Moreover, the feature importances will also be computed from each model.


1. Email Spam Dataset

This dataset contains bag of words or percentages or counts of words/special characters of spam and nonspam emails. Your task is to build email spam filter to identify spam emails.

import pandas as pd
path = "https://raw.githubusercontent.com/hassothea/MLcourses/main/data/spam.txt"
data = pd.read_csv(path, sep=" ")
data = data.drop(columns=['Id'])
data.head(5)
make address all num3d our over remove internet order mail ... charSemicolon charRoundbracket charSquarebracket charExclamation charDollar charHash capitalAve capitalLong capitalTotal type
0 0.00 0.64 0.64 0.0 0.32 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.000 0.0 0.778 0.000 0.000 3.756 61 278 spam
1 0.21 0.28 0.50 0.0 0.14 0.28 0.21 0.07 0.00 0.94 ... 0.00 0.132 0.0 0.372 0.180 0.048 5.114 101 1028 spam
2 0.06 0.00 0.71 0.0 1.23 0.19 0.19 0.12 0.64 0.25 ... 0.01 0.143 0.0 0.276 0.184 0.010 9.821 485 2259 spam
3 0.00 0.00 0.00 0.0 0.63 0.00 0.31 0.63 0.31 0.63 ... 0.00 0.137 0.0 0.137 0.000 0.000 3.537 40 191 spam
4 0.00 0.00 0.00 0.0 0.63 0.00 0.31 0.63 0.31 0.63 ... 0.00 0.135 0.0 0.135 0.000 0.000 3.537 40 191 spam

5 rows ร— 58 columns

A. Overview and preprocssing

  • Address the dimension, qualitative and quantitative columns of the dataset.

  • Predicting the type of email is considered a regression or classification problem? Is the dataset well-balanced?

  • Does the dataset contain any missing values? If so, handle them.

  • Does the dataset contain any duplicated rows? If so, handle them.

  • With 57 input columns, itโ€™s not easy to detect outliers of each column one by one using boxplot, use z-score method to handle outliers if there are any.

# To do

B. Random Forest: OOB vs Cross Validation

  • Split the dataset into \(80\%-20\%\) training-testing data using random_state = 42.

  • Build a random forest model with its default setting. Then compute the free Out-Of-Bag Errors or Score obtained by the forest (see model.oob_score_).

  • Compute the following metrics on the test data: Accuracy, Precision, Recall and F1-score. Store them in a data frame.

  • Fine-tune the hyperparameter of the random forest then evaluate its CV error or Score. Compare it to the OOB score of the second point.

  • Compute the four metrics of the fine-tuned random forest model on the test data. Compare it to the default one.

  • Compote and plot the figure of mean decrease impurity measure and permutation feature importances of the best model among the two.

# To do

C. Extra-trees: OOB vs Cross Validation

  • Repeat the previous questions of part (B) from the second point using ExtraTrees model from the same module. Compare the result to Random Forest.

  • Compote and plot the figure of mean decrease impurity measure and permutation feature importances of the best Extra-tree model.

# To do

C. Boosting: Feature Importances

  • Build and fine-tune Adaboost model, AdaBoostClassifier from sklearn.ensemble, using CV technique.

  • Compute both feature importances for this model and report its test performaces.

# To do

D. XGBoost:

  • Build and fine-tune the hyperparameter of XGBoost from XGboost.

  • Compute both feature importances for the model and report the test performances.

# To do

2. Kaggle Stroke Dataset

Stroke, also known as a cerebrovascular accident (CVA), occurs when blood flow to a part of the brain is interrupted or reduced, depriving brain tissue of oxygen and nutrients. This dataset contains information such as age, gender, hypertension, heart disease, marital status, work type, residence type, average glucose level, and body mass index (BMI). The goal is to use this data to build predictive models that can help identify individuals at high risk of stroke, enabling early intervention and potentially saving lives. It is a very highly imbalanced dataset, you may face challenges in building a model. Random sampling and weighting methods may be considered. For more information, see: Kaggle Stroke Dataset.

import kagglehub

path = kagglehub.dataset_download("mirzahasnine/heart-disease-dataset")
data = pd.read_csv(path + '/heart_disease.csv')
data.head()
Gender age education currentSmoker cigsPerDay BPMeds prevalentStroke prevalentHyp diabetes totChol sysBP diaBP BMI heartRate glucose Heart_ stroke
0 Male 39 postgraduate 0 0.0 0.0 no 0 0 195.0 106.0 70.0 26.97 80.0 77.0 No
1 Female 46 primaryschool 0 0.0 0.0 no 0 0 250.0 121.0 81.0 28.73 95.0 76.0 No
2 Male 48 uneducated 1 20.0 0.0 no 0 0 245.0 127.5 80.0 25.34 75.0 70.0 No
3 Female 61 graduate 1 30.0 0.0 no 1 0 225.0 150.0 95.0 28.58 65.0 103.0 yes
4 Female 46 graduate 1 23.0 0.0 no 0 0 285.0 130.0 84.0 23.10 85.0 85.0 No
  • Build and fine-tune the ensemble learning models to predict a 20% test data of this dataset.

  • Compute both feature importances of each model on this dataset.

References

\(^{\text{๐Ÿ“š}}\) Bagging predictors, Breiman (1996).
\(^{\text{๐Ÿ“š}}\) The strength of weak learnability, Robert E. Schapire (1990)..
\(^{\text{๐Ÿ“š}}\) COBRA: A combined regression strategy, Beau et al. (2016).
\(^{\text{๐Ÿ“š}}\) Gradient COBRA: A kernel-based consensual aggregation for regression, Has (2023).
\(^{\text{๐Ÿ“š}}\) Aggregation using inputโ€“output trade-off, Fischer & Mougeot (2019).
\(^{\text{๐Ÿ“š}}\) Super Learner, M. J. Van der Laan (2007).