Lab5 - Ensemble Learning

Course: ITM 390 004: Machine Learning
Lecturer: Sothea HAS, PhD

Objective: Ensemble Learning Methods are about combining several base learners to enhance its performance. In this lab, you will apply each ensemble learning method on real datasets and analyze its sensitivity in terms of the key hyperparameters of the method. Moreover, the feature importances will also be computed from each model.

The jupyter notebook is available here: Lab5_Ensemble_Learning.ipynb.

1. Email Spam Dataset

This dataset contains bag of words or percentages or counts of words/special characters of spam and nonspam emails. Your task is to build email spam filter to identify spam emails.

import pandas as pd
path = "https://raw.githubusercontent.com/hassothea/MLcourses/main/data/spam.txt"
data = pd.read_csv(path, sep=" ")
data = data.drop(columns=['Id'])
data.head(5)

	make	address	all	our	over	remove	internet	order	mail	...	charSemicolon	charRoundbracket	charExclamation	charDollar	charHash	capitalAve	capitalLong	capitalTotal	type
0	0.00	0.64	0.64	0.32	0.00	0.00	0.00	0.00	0.00	...	0.00	0.000	0.778	0.000	0.000	3.756	61	278	spam
1	0.21	0.28	0.50	0.14	0.28	0.21	0.07	0.00	0.94	...	0.00	0.132	0.372	0.180	0.048	5.114	101	1028	spam
2	0.06	0.00	0.71	1.23	0.19	0.19	0.12	0.64	0.25	...	0.01	0.143	0.276	0.184	0.010	9.821	485	2259	spam
3	0.00	0.00	0.00	0.63	0.00	0.31	0.63	0.31	0.63	...	0.00	0.137	0.137	0.000	0.000	3.537	40	191	spam
4	0.00	0.00	0.00	0.63	0.00	0.31	0.63	0.31	0.63	...	0.00	0.135	0.135	0.000	0.000	3.537	40	191	spam

5 rows × 58 columns

A. Overview and preprocssing

Address the dimension, qualitative and quantitative columns of the dataset.
Predicting the type of email is considered a regression or classification problem? Is the dataset well-balanced?
Does the dataset contain any missing values? If so, handle them.
Does the dataset contain any duplicated rows? If so, handle them.
With 57 input columns, it’s not easy to detect outliers of each column one by one using boxplot, use z-score method to handle outliers if there are any.

# To do

B. Random Forest: OOB vs Cross Validation

Split the dataset into \(80\%-20\%\) training-testing data using random_state = 42.
Build a random forest model with its default setting. Then compute the free Out-Of-Bag Errors or Score obtained by the forest (see model.oob_score_).
Compute the following metrics on the test data: Accuracy, Precision, Recall and F1-score. Store them in a data frame.
Fine-tune the hyperparameter of the random forest then evaluate its CV error or Score. Compare it to the OOB score of the second point.
Compute the four metrics of the fine-tuned random forest model on the test data. Compare it to the default one.
Compote and plot the figure of mean decrease impurity measure and permutation feature importances of the best model among the two.

# To do

C. Extra-trees: OOB vs Cross Validation

Repeat the previous questions of part (B) from the second point using ExtraTrees model from the same module. Compare the result to Random Forest.
Compote and plot the figure of mean decrease impurity measure and permutation feature importances of the best Extra-tree model.

# To do

C. Boosting: Feature Importances

Build and fine-tune Adaboost model, AdaBoostClassifier from sklearn.ensemble, using CV technique.
Compute both feature importances for this model and report its test performaces.

# To do

D. XGBoost:

Build and fine-tune the hyperparameter of XGBoost from XGboost.
Compute both feature importances for the model and report the test performances.

# To do

2. Kaggle Stroke Dataset

Stroke, also known as a cerebrovascular accident (CVA), occurs when blood flow to a part of the brain is interrupted or reduced, depriving brain tissue of oxygen and nutrients. This dataset contains information such as age, gender, hypertension, heart disease, marital status, work type, residence type, average glucose level, and body mass index (BMI). The goal is to use this data to build predictive models that can help identify individuals at high risk of stroke, enabling early intervention and potentially saving lives. It is a very highly imbalanced dataset, you may face challenges in building a model. Random sampling and weighting methods may be considered. For more information, see: Kaggle Stroke Dataset.

import kagglehub

path = kagglehub.dataset_download("mirzahasnine/heart-disease-dataset")
data = pd.read_csv(path + '/heart_disease.csv')
data.head()

	Gender	age	education	currentSmoker	cigsPerDay	prevalentStroke	prevalentHyp	totChol	sysBP	diaBP	BMI	heartRate	glucose	Heart_ stroke
0	Male	39	postgraduate	0	0.0	no	0	195.0	106.0	70.0	26.97	80.0	77.0	No
1	Female	46	primaryschool	0	0.0	no	0	250.0	121.0	81.0	28.73	95.0	76.0	No
2	Male	48	uneducated	1	20.0	no	0	245.0	127.5	80.0	25.34	75.0	70.0	No
3	Female	61	graduate	1	30.0	no	1	225.0	150.0	95.0	28.58	65.0	103.0	yes
4	Female	46	graduate	1	23.0	no	0	285.0	130.0	84.0	23.10	85.0	85.0	No

Build and fine-tune the ensemble learning models to predict a 20% test data of this dataset.
Compute both feature importances of each model on this dataset.

References

\(^{\text{📚}}\) Bagging predictors, Breiman (1996).
\(^{\text{📚}}\) The strength of weak learnability, Robert E. Schapire (1990)..
\(^{\text{📚}}\) COBRA: A combined regression strategy, Beau et al. (2016).
\(^{\text{📚}}\) Gradient COBRA: A kernel-based consensual aggregation for regression, Has (2023).
\(^{\text{📚}}\) Aggregation using input–output trade-off, Fischer & Mougeot (2019).
\(^{\text{📚}}\) Super Learner, M. J. Van der Laan (2007).