TP5 - Ensemble Learning


Course: Advanced Machine Learning
Lecturer: Sothea HAS, PhD

Objective: Ensemble Learning Methods are about combining several base learners to enhance its performance. In this TP, you will apply each ensemble learning method on real datasets and analyze its sensitivity in terms of the key hyperparameters of the method.


1. Food Delivery Dataset

This dataset is designed for predicting food delivery times based on various influencing factors such as distance, weather, traffic conditions, and time of day. It offers a practical and engaging challenge for machine learning practitioners, especially those interested in logistics and operations research. Read and load the data from kaggle: Food Delivery Dataset.

import kagglehub

# Download latest version
path = kagglehub.dataset_download("denkuznetz/food-delivery-time-prediction")

# Import data
import pandas as pd
data = pd.read_csv(path + "/Food_Delivery_Times.csv")
data.head()
Order_ID Distance_km Weather Traffic_Level Time_of_Day Vehicle_Type Preparation_Time_min Courier_Experience_yrs Delivery_Time_min
0 522 7.93 Windy Low Afternoon Scooter 12 1.0 43
1 738 16.42 Clear Medium Evening Bike 20 2.0 84
2 741 9.52 Foggy Low Night Scooter 28 1.0 59
3 661 7.44 Rainy Medium Afternoon Scooter 5 1.0 37
4 412 19.03 Clear Low Morning Bike 16 5.0 68
import numpy as np
data.dropna().shape
(883, 9)

A. Overview of the dataset.

  • Address the dimension, qualitative and quantitative columns of the dataset.

  • Create statistical summary of the dataset.

  • Identify problems and handle them if there is any:

    • Missing values,
    • Duplicated data,
    • Outliers…
  • Perform bivariate analysis to detect useful inputs for the model:

    • Correlation matrix
    • Graphs…
# To do

B. Model development: OOB vs Cross Validation

  • Split the dataset into \(80\%-20\%\) training-testing data using random_state = 42.

  • Build a random forest model and fine-tune its hyperparameters using MSE criterion based on two different approaches:

    • Out-Of-Bag Errors (see model.oob_score_)
    • Cross-validation method (you may use GridSearchCV from sklearn.model_selection).
  • Report the test RMSE and compare the two results.

  • Repeat the questions with ExtraTrees model from the same module. Compare the result to Random Forest.

# To do

C. Boosting: Feature Importances

  • Compute Mean Decrease Impurity (MDI) and Permutation Feature Importance (PFI) from the optimal random forest built in the previous question.

  • Build and fine-tune Adaboost model using AdaBoostRegressor from sklearn.ensemble. Compute both feature importances for this model and report its test performace.

  • Build and fine-tune XGBoost from XGboost. Compute both feature importances for the model and report the test performance.

# To do

D. Consensual Aggregation and Stacking

  • Build consensual aggregators and stacking models then report their test performances.

  • Compare to the previous models. Conclude.

# To do

E. Neural Network.

  • Design a neural network to predict the testing data and compute its RMSE.

  • Compre to the previous results and conclude.

2. Kaggle Stroke Dataset

Stroke, also known as a cerebrovascular accident (CVA), occurs when blood flow to a part of the brain is interrupted or reduced, depriving brain tissue of oxygen and nutrients. This dataset contains information such as age, gender, hypertension, heart disease, marital status, work type, residence type, average glucose level, and body mass index (BMI). The goal is to use this data to build predictive models that can help identify individuals at high risk of stroke, enabling early intervention and potentially saving lives. It is a very highly imbalanced dataset, you may face challenges in building a model. Random sampling and weighting methods may be considered. For more information, see: Kaggle Stroke Dataset.

path = kagglehub.dataset_download("fedesoriano/stroke-prediction-dataset")

data = pd.read_csv(path + '/healthcare-dataset-stroke-data.csv')
data.head()
id gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke
0 9046 Male 67.0 0 1 Yes Private Urban 228.69 36.6 formerly smoked 1
1 51676 Female 61.0 0 0 Yes Self-employed Rural 202.21 NaN never smoked 1
2 31112 Male 80.0 0 1 Yes Private Rural 105.92 32.5 never smoked 1
3 60182 Female 49.0 0 0 Yes Private Urban 171.23 34.4 smokes 1
4 1665 Female 79.0 1 0 Yes Self-employed Rural 174.12 24.0 never smoked 1

References

\(^{\text{📚}}\) Bagging predictors, Breiman (1996).
\(^{\text{📚}}\) The strength of weak learnability, Robert E. Schapire (1990)..
\(^{\text{📚}}\) COBRA: A combined regression strategy, Beau et al. (2016).
\(^{\text{📚}}\) Gradient COBRA: A kernel-based consensual aggregation for regression, Has (2023).
\(^{\text{📚}}\) Aggregation using input–output trade-off, Fischer & Mougeot (2019).
\(^{\text{📚}}\) Super Learner, M. J. Van der Laan (2007).