TP5 - Ensemble Learning

Course: Advanced Machine Learning
Lecturer: Sothea HAS, PhD

Objective: Ensemble Learning Methods are about combining several base learners to enhance its performance. In this TP, you will apply each ensemble learning method on real datasets and analyze its sensitivity in terms of the key hyperparameters of the method.

The notebook of this TP can be downloaded here: TP5_Ensemble_Learning.ipynb.

1. Food Delivery Dataset

This dataset is designed for predicting food delivery times based on various influencing factors such as distance, weather, traffic conditions, and time of day. It offers a practical and engaging challenge for machine learning practitioners, especially those interested in logistics and operations research. Read and load the data from kaggle: Food Delivery Dataset.

import kagglehub

# Download latest version
path = kagglehub.dataset_download("denkuznetz/food-delivery-time-prediction")

# Import data
import pandas as pd
data = pd.read_csv(path + "/Food_Delivery_Times.csv")
data.head()

	Order_ID	Distance_km	Weather	Traffic_Level	Time_of_Day	Vehicle_Type	Preparation_Time_min	Courier_Experience_yrs	Delivery_Time_min
0	522	7.93	Windy	Low	Afternoon	Scooter	12	1.0	43
1	738	16.42	Clear	Medium	Evening	Bike	20	2.0	84
2	741	9.52	Foggy	Low	Night	Scooter	28	1.0	59
3	661	7.44	Rainy	Medium	Afternoon	Scooter	5	1.0	37
4	412	19.03	Clear	Low	Morning	Bike	16	5.0	68

import numpy as np
data.dropna().shape

(883, 9)

A. Overview of the dataset.

Address the dimension, qualitative and quantitative columns of the dataset.
Create statistical summary of the dataset.
Identify problems and handle them if there is any:
- Missing values,
- Duplicated data,
- Outliers…
Perform bivariate analysis to detect useful inputs for the model:
- Correlation matrix
- Graphs…

# To do

B. Model development: OOB vs Cross Validation

Split the dataset into \(80\%-20\%\) training-testing data using random_state = 42.
Build a random forest model and fine-tune its hyperparameters using MSE criterion based on two different approaches:
- Out-Of-Bag Errors (see model.oob_score_)
- Cross-validation method (you may use GridSearchCV from sklearn.model_selection).
Report the test RMSE and compare the two results.
Repeat the questions with ExtraTrees model from the same module. Compare the result to Random Forest.

# To do

C. Boosting: Feature Importances

Compute Mean Decrease Impurity (MDI) and Permutation Feature Importance (PFI) from the optimal random forest built in the previous question.
Build and fine-tune Adaboost model using AdaBoostRegressor from sklearn.ensemble. Compute both feature importances for this model and report its test performace.
Build and fine-tune XGBoost from XGboost. Compute both feature importances for the model and report the test performance.

# To do

D. Consensual Aggregation and Stacking

Build consensual aggregators and stacking models then report their test performances.
Compare to the previous models. Conclude.

# To do

E. Neural Network.

Design a neural network to predict the testing data and compute its RMSE.
Compre to the previous results and conclude.

2. Kaggle Stroke Dataset

Stroke, also known as a cerebrovascular accident (CVA), occurs when blood flow to a part of the brain is interrupted or reduced, depriving brain tissue of oxygen and nutrients. This dataset contains information such as age, gender, hypertension, heart disease, marital status, work type, residence type, average glucose level, and body mass index (BMI). The goal is to use this data to build predictive models that can help identify individuals at high risk of stroke, enabling early intervention and potentially saving lives. It is a very highly imbalanced dataset, you may face challenges in building a model. Random sampling and weighting methods may be considered. For more information, see: Kaggle Stroke Dataset.

path = kagglehub.dataset_download("fedesoriano/stroke-prediction-dataset")

data = pd.read_csv(path + '/healthcare-dataset-stroke-data.csv')
data.head()

	id	gender	age	hypertension	heart_disease	ever_married	work_type	Residence_type	avg_glucose_level	bmi	smoking_status	stroke
0	9046	Male	67.0	0	1	Yes	Private	Urban	228.69	36.6	formerly smoked	1
1	51676	Female	61.0	0	0	Yes	Self-employed	Rural	202.21	NaN	never smoked	1
2	31112	Male	80.0	0	1	Yes	Private	Rural	105.92	32.5	never smoked	1
3	60182	Female	49.0	0	0	Yes	Private	Urban	171.23	34.4	smokes	1
4	1665	Female	79.0	1	0	Yes	Self-employed	Rural	174.12	24.0	never smoked	1

References

\(^{\text{📚}}\) Bagging predictors, Breiman (1996).
\(^{\text{📚}}\) The strength of weak learnability, Robert E. Schapire (1990)..
\(^{\text{📚}}\) COBRA: A combined regression strategy, Beau et al. (2016).
\(^{\text{📚}}\) Gradient COBRA: A kernel-based consensual aggregation for regression, Has (2023).
\(^{\text{📚}}\) Aggregation using input–output trade-off, Fischer & Mougeot (2019).
\(^{\text{📚}}\) Super Learner, M. J. Van der Laan (2007).