{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# **TP5 - Ensemble Learning**\n",
"\n",
"-----\n",
"\n",
"**Course**: Advanced Machine Learning
\n",
"**Lecturer**: Sothea HAS, PhD"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Objective:** Ensemble Learning Methods are about combining several base learners to enhance its performance. In this TP, you will apply each ensemble learning method on real datasets and analyze its sensitivity in terms of the key hyperparameters of the method.\n",
"\n",
"- The `notebook` of this `TP` can be downloaded here: [TP5_Ensemble_Learning.ipynb](https://hassothea.github.io/Advanced-Machine-Learning-ITC/TPs/TP5_Ensemble_Learning.ipynb){target=\"_blank\"}.\n",
"\n",
"----------"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **1. Food Delivery Dataset**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This dataset is designed for predicting food delivery times based on various influencing factors such as distance, weather, traffic conditions, and time of day. It offers a practical and engaging challenge for machine learning practitioners, especially those interested in logistics and operations research. Read and load the data from kaggle: [Food Delivery Dataset](https://www.kaggle.com/datasets/denkuznetz/food-delivery-time-prediction/data)."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Order_ID | \n",
" Distance_km | \n",
" Weather | \n",
" Traffic_Level | \n",
" Time_of_Day | \n",
" Vehicle_Type | \n",
" Preparation_Time_min | \n",
" Courier_Experience_yrs | \n",
" Delivery_Time_min | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 522 | \n",
" 7.93 | \n",
" Windy | \n",
" Low | \n",
" Afternoon | \n",
" Scooter | \n",
" 12 | \n",
" 1.0 | \n",
" 43 | \n",
"
\n",
" \n",
" 1 | \n",
" 738 | \n",
" 16.42 | \n",
" Clear | \n",
" Medium | \n",
" Evening | \n",
" Bike | \n",
" 20 | \n",
" 2.0 | \n",
" 84 | \n",
"
\n",
" \n",
" 2 | \n",
" 741 | \n",
" 9.52 | \n",
" Foggy | \n",
" Low | \n",
" Night | \n",
" Scooter | \n",
" 28 | \n",
" 1.0 | \n",
" 59 | \n",
"
\n",
" \n",
" 3 | \n",
" 661 | \n",
" 7.44 | \n",
" Rainy | \n",
" Medium | \n",
" Afternoon | \n",
" Scooter | \n",
" 5 | \n",
" 1.0 | \n",
" 37 | \n",
"
\n",
" \n",
" 4 | \n",
" 412 | \n",
" 19.03 | \n",
" Clear | \n",
" Low | \n",
" Morning | \n",
" Bike | \n",
" 16 | \n",
" 5.0 | \n",
" 68 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Order_ID Distance_km Weather Traffic_Level Time_of_Day Vehicle_Type \\\n",
"0 522 7.93 Windy Low Afternoon Scooter \n",
"1 738 16.42 Clear Medium Evening Bike \n",
"2 741 9.52 Foggy Low Night Scooter \n",
"3 661 7.44 Rainy Medium Afternoon Scooter \n",
"4 412 19.03 Clear Low Morning Bike \n",
"\n",
" Preparation_Time_min Courier_Experience_yrs Delivery_Time_min \n",
"0 12 1.0 43 \n",
"1 20 2.0 84 \n",
"2 28 1.0 59 \n",
"3 5 1.0 37 \n",
"4 16 5.0 68 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import kagglehub\n",
"\n",
"# Download latest version\n",
"path = kagglehub.dataset_download(\"denkuznetz/food-delivery-time-prediction\")\n",
"\n",
"# Import data\n",
"import pandas as pd\n",
"data = pd.read_csv(path + \"/Food_Delivery_Times.csv\")\n",
"data.head()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(883, 9)"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"data.dropna().shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**A. Overview of the dataset.** \n",
"\n",
"- Address the dimension, qualitative and quantitative columns of the dataset.\n",
"\n",
"- Create statistical summary of the dataset. \n",
"\n",
"- Identify problems and handle them if there is any:\n",
" - Missing values,\n",
" - Duplicated data,\n",
" - Outliers...\n",
"\n",
"- Perform bivariate analysis to detect useful inputs for the model:\n",
" - Correlation matrix\n",
" - Graphs... "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# To do"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**B. Model development: OOB vs Cross Validation** \n",
"\n",
"- Split the dataset into $80\\%-20\\%$ training-testing data using `random_state = 42`.\n",
"\n",
"- Build a random forest model and fine-tune its hyperparameters using `MSE` criterion based on two different approaches:\n",
" - Out-Of-Bag Errors (see `model.oob_score_`)\n",
" - Cross-validation method (you may use `GridSearchCV` from `sklearn.model_selection`).\n",
"\n",
"- Report the test RMSE and compare the two results.\n",
"\n",
"- Repeat the questions with ExtraTrees model from the same module. Compare the result to Random Forest."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# To do"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**C. Boosting: Feature Importances**\n",
"\n",
"- Compute Mean Decrease Impurity (MDI) and Permutation Feature Importance (PFI) from the optimal random forest built in the previous question.\n",
"\n",
"- Build and fine-tune Adaboost model using `AdaBoostRegressor` from `sklearn.ensemble`. Compute both feature importances for this model and report its test performace.\n",
"\n",
"- Build and fine-tune XGBoost from [`XGboost`](https://xgboost.readthedocs.io/en/stable/python/python_intro.html#install-xgboost). Compute both feature importances for the model and report the test performance."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# To do"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**D. Consensual Aggregation and Stacking**\n",
"\n",
"- Build consensual aggregators and stacking models then report their test performances. \n",
"\n",
"- Compare to the previous models. Conclude."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"# To do"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**E. Neural Network.**\n",
"\n",
"- Design a neural network to predict the testing data and compute its RMSE.\n",
"\n",
"- Compre to the previous results and conclude."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **2. [Kaggle Stroke Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset)**\n",
"\n",
"Stroke, also known as a cerebrovascular accident (CVA), occurs when blood flow to a part of the brain is interrupted or reduced, depriving brain tissue of oxygen and nutrients. This dataset contains information such as age, gender, hypertension, heart disease, marital status, work type, residence type, average glucose level, and body mass index (BMI). The goal is to use this data to build predictive models that can help identify individuals at high risk of stroke, enabling early intervention and potentially saving lives. It is a very highly imbalanced dataset, you may face challenges in building a model. Random sampling and weighting methods may be considered. For more information, see: [Kaggle Stroke Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset)."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" gender | \n",
" age | \n",
" hypertension | \n",
" heart_disease | \n",
" ever_married | \n",
" work_type | \n",
" Residence_type | \n",
" avg_glucose_level | \n",
" bmi | \n",
" smoking_status | \n",
" stroke | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 9046 | \n",
" Male | \n",
" 67.0 | \n",
" 0 | \n",
" 1 | \n",
" Yes | \n",
" Private | \n",
" Urban | \n",
" 228.69 | \n",
" 36.6 | \n",
" formerly smoked | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" 51676 | \n",
" Female | \n",
" 61.0 | \n",
" 0 | \n",
" 0 | \n",
" Yes | \n",
" Self-employed | \n",
" Rural | \n",
" 202.21 | \n",
" NaN | \n",
" never smoked | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" 31112 | \n",
" Male | \n",
" 80.0 | \n",
" 0 | \n",
" 1 | \n",
" Yes | \n",
" Private | \n",
" Rural | \n",
" 105.92 | \n",
" 32.5 | \n",
" never smoked | \n",
" 1 | \n",
"
\n",
" \n",
" 3 | \n",
" 60182 | \n",
" Female | \n",
" 49.0 | \n",
" 0 | \n",
" 0 | \n",
" Yes | \n",
" Private | \n",
" Urban | \n",
" 171.23 | \n",
" 34.4 | \n",
" smokes | \n",
" 1 | \n",
"
\n",
" \n",
" 4 | \n",
" 1665 | \n",
" Female | \n",
" 79.0 | \n",
" 1 | \n",
" 0 | \n",
" Yes | \n",
" Self-employed | \n",
" Rural | \n",
" 174.12 | \n",
" 24.0 | \n",
" never smoked | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id gender age hypertension heart_disease ever_married \\\n",
"0 9046 Male 67.0 0 1 Yes \n",
"1 51676 Female 61.0 0 0 Yes \n",
"2 31112 Male 80.0 0 1 Yes \n",
"3 60182 Female 49.0 0 0 Yes \n",
"4 1665 Female 79.0 1 0 Yes \n",
"\n",
" work_type Residence_type avg_glucose_level bmi smoking_status \\\n",
"0 Private Urban 228.69 36.6 formerly smoked \n",
"1 Self-employed Rural 202.21 NaN never smoked \n",
"2 Private Rural 105.92 32.5 never smoked \n",
"3 Private Urban 171.23 34.4 smokes \n",
"4 Self-employed Rural 174.12 24.0 never smoked \n",
"\n",
" stroke \n",
"0 1 \n",
"1 1 \n",
"2 1 \n",
"3 1 \n",
"4 1 "
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"path = kagglehub.dataset_download(\"fedesoriano/stroke-prediction-dataset\")\n",
"\n",
"data = pd.read_csv(path + '/healthcare-dataset-stroke-data.csv')\n",
"data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# References\n",
"\n",
"$^{\\text{π}}$ [Bagging predictors, Breiman (1996)](https://link.springer.com/article/10.1007/BF00058655){target=\"_blank\"}.
\n",
"$^{\\text{π}}$ [The strength of weak learnability, Robert E. Schapire (1990).](https://link.springer.com/article/10.1007/BF00116037){target=\"_blank\"}.
\n",
"$^{\\text{π}}$ [COBRA: A combined regression strategy, Beau et al. (2016)](https://www.sciencedirect.com/science/article/pii/S0047259X15000950){target=\"_blank\"}.
\n",
"$^{\\text{π}}$ [Gradient COBRA: A kernel-based consensual aggregation for regression, Has (2023)](https://doi.org/10.52933/jdssv.v3i2.70){target=\"_blank\"}.
\n",
"$^{\\text{π}}$ [Aggregation using inputβoutput trade-off, Fischer & Mougeot (2019)](https://www.sciencedirect.com/science/article/abs/pii/S0378375818302349){target=\"_blank\"}.
\n",
"$^{\\text{π}}$ [Super Learner, M. J. Van der Laan (2007)](https://www.degruyter.com/document/doi/10.2202/1544-6115.1309/html){target=\"_blank\"}.
"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}