{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **TP5 - Ensemble Learning**\n",
    "\n",
    "-----\n",
    "\n",
    "**Course**: Advanced Machine Learning <br>\n",
    "**Lecturer**: Sothea HAS, PhD"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Objective:**  Ensemble Learning Methods are about combining several base learners to enhance its performance. In this TP, you will apply each ensemble learning method on real datasets and analyze its sensitivity in terms of the key hyperparameters of the method.\n",
    "\n",
    "- The `notebook` of this `TP` can be downloaded here: [TP5_Ensemble_Learning.ipynb](https://hassothea.github.io/Advanced-Machine-Learning-ITC/TPs/TP5_Ensemble_Learning.ipynb){target=\"_blank\"}.\n",
    "\n",
    "----------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **1. Food Delivery Dataset**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This dataset is designed for predicting food delivery times based on various influencing factors such as distance, weather, traffic conditions, and time of day. It offers a practical and engaging challenge for machine learning practitioners, especially those interested in logistics and operations research. Read and load the data from kaggle: [Food Delivery Dataset](https://www.kaggle.com/datasets/denkuznetz/food-delivery-time-prediction/data)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Order_ID</th>\n",
       "      <th>Distance_km</th>\n",
       "      <th>Weather</th>\n",
       "      <th>Traffic_Level</th>\n",
       "      <th>Time_of_Day</th>\n",
       "      <th>Vehicle_Type</th>\n",
       "      <th>Preparation_Time_min</th>\n",
       "      <th>Courier_Experience_yrs</th>\n",
       "      <th>Delivery_Time_min</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>522</td>\n",
       "      <td>7.93</td>\n",
       "      <td>Windy</td>\n",
       "      <td>Low</td>\n",
       "      <td>Afternoon</td>\n",
       "      <td>Scooter</td>\n",
       "      <td>12</td>\n",
       "      <td>1.0</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>738</td>\n",
       "      <td>16.42</td>\n",
       "      <td>Clear</td>\n",
       "      <td>Medium</td>\n",
       "      <td>Evening</td>\n",
       "      <td>Bike</td>\n",
       "      <td>20</td>\n",
       "      <td>2.0</td>\n",
       "      <td>84</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>741</td>\n",
       "      <td>9.52</td>\n",
       "      <td>Foggy</td>\n",
       "      <td>Low</td>\n",
       "      <td>Night</td>\n",
       "      <td>Scooter</td>\n",
       "      <td>28</td>\n",
       "      <td>1.0</td>\n",
       "      <td>59</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>661</td>\n",
       "      <td>7.44</td>\n",
       "      <td>Rainy</td>\n",
       "      <td>Medium</td>\n",
       "      <td>Afternoon</td>\n",
       "      <td>Scooter</td>\n",
       "      <td>5</td>\n",
       "      <td>1.0</td>\n",
       "      <td>37</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>412</td>\n",
       "      <td>19.03</td>\n",
       "      <td>Clear</td>\n",
       "      <td>Low</td>\n",
       "      <td>Morning</td>\n",
       "      <td>Bike</td>\n",
       "      <td>16</td>\n",
       "      <td>5.0</td>\n",
       "      <td>68</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Order_ID  Distance_km Weather Traffic_Level Time_of_Day Vehicle_Type  \\\n",
       "0       522         7.93   Windy           Low   Afternoon      Scooter   \n",
       "1       738        16.42   Clear        Medium     Evening         Bike   \n",
       "2       741         9.52   Foggy           Low       Night      Scooter   \n",
       "3       661         7.44   Rainy        Medium   Afternoon      Scooter   \n",
       "4       412        19.03   Clear           Low     Morning         Bike   \n",
       "\n",
       "   Preparation_Time_min  Courier_Experience_yrs  Delivery_Time_min  \n",
       "0                    12                     1.0                 43  \n",
       "1                    20                     2.0                 84  \n",
       "2                    28                     1.0                 59  \n",
       "3                     5                     1.0                 37  \n",
       "4                    16                     5.0                 68  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import kagglehub\n",
    "\n",
    "# Download latest version\n",
    "path = kagglehub.dataset_download(\"denkuznetz/food-delivery-time-prediction\")\n",
    "\n",
    "# Import data\n",
    "import pandas as pd\n",
    "data = pd.read_csv(path + \"/Food_Delivery_Times.csv\")\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(883, 9)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "data.dropna().shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**A. Overview of the dataset.** \n",
    "\n",
    "- Address the dimension, qualitative and quantitative columns of the dataset.\n",
    "\n",
    "- Create statistical summary of the dataset. \n",
    "\n",
    "- Identify problems and handle them if there is any:\n",
    "    - Missing values,\n",
    "    - Duplicated data,\n",
    "    - Outliers...\n",
    "\n",
    "- Perform bivariate analysis to detect useful inputs for the model:\n",
    "    - Correlation matrix\n",
    "    - Graphs... "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**B. Model development: OOB vs Cross Validation** \n",
    "\n",
    "- Split the dataset into $80\\%-20\\%$ training-testing data using `random_state = 42`.\n",
    "\n",
    "- Build a random forest model and fine-tune its hyperparameters using `MSE` criterion based on two different approaches:\n",
    "    - Out-Of-Bag Errors (see `model.oob_score_`)\n",
    "    - Cross-validation method (you may use `GridSearchCV` from `sklearn.model_selection`).\n",
    "\n",
    "- Report the test RMSE and compare the two results.\n",
    "\n",
    "- Repeat the questions with ExtraTrees model from the same module. Compare the result to Random Forest."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**C. Boosting: Feature Importances**\n",
    "\n",
    "- Compute Mean Decrease Impurity (MDI) and Permutation Feature Importance (PFI) from the optimal random forest built in the previous question.\n",
    "\n",
    "- Build and fine-tune Adaboost model using `AdaBoostRegressor` from `sklearn.ensemble`. Compute both feature importances for this model and report its test performace.\n",
    "\n",
    "- Build and fine-tune XGBoost from [`XGboost`](https://xgboost.readthedocs.io/en/stable/python/python_intro.html#install-xgboost). Compute both feature importances for the model and report the test performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**D. Consensual Aggregation and Stacking**\n",
    "\n",
    "- Build consensual aggregators and stacking models then report their test performances. \n",
    "\n",
    "- Compare to the previous models. Conclude."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**E. Neural Network.**\n",
    "\n",
    "- Design a neural network to predict the testing data and compute its RMSE.\n",
    "\n",
    "- Compre to the previous results and conclude."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **2. [Kaggle Stroke Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset)**\n",
    "\n",
    "Stroke, also known as a cerebrovascular accident (CVA), occurs when blood flow to a part of the brain is interrupted or reduced, depriving brain tissue of oxygen and nutrients. This dataset contains information such as age, gender, hypertension, heart disease, marital status, work type, residence type, average glucose level, and body mass index (BMI). The goal is to use this data to build predictive models that can help identify individuals at high risk of stroke, enabling early intervention and potentially saving lives. It is a very highly imbalanced dataset, you may face challenges in building a model. Random sampling and weighting methods may be considered. For more information, see: [Kaggle Stroke Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>gender</th>\n",
       "      <th>age</th>\n",
       "      <th>hypertension</th>\n",
       "      <th>heart_disease</th>\n",
       "      <th>ever_married</th>\n",
       "      <th>work_type</th>\n",
       "      <th>Residence_type</th>\n",
       "      <th>avg_glucose_level</th>\n",
       "      <th>bmi</th>\n",
       "      <th>smoking_status</th>\n",
       "      <th>stroke</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>9046</td>\n",
       "      <td>Male</td>\n",
       "      <td>67.0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Private</td>\n",
       "      <td>Urban</td>\n",
       "      <td>228.69</td>\n",
       "      <td>36.6</td>\n",
       "      <td>formerly smoked</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>51676</td>\n",
       "      <td>Female</td>\n",
       "      <td>61.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Self-employed</td>\n",
       "      <td>Rural</td>\n",
       "      <td>202.21</td>\n",
       "      <td>NaN</td>\n",
       "      <td>never smoked</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>31112</td>\n",
       "      <td>Male</td>\n",
       "      <td>80.0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Private</td>\n",
       "      <td>Rural</td>\n",
       "      <td>105.92</td>\n",
       "      <td>32.5</td>\n",
       "      <td>never smoked</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>60182</td>\n",
       "      <td>Female</td>\n",
       "      <td>49.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Private</td>\n",
       "      <td>Urban</td>\n",
       "      <td>171.23</td>\n",
       "      <td>34.4</td>\n",
       "      <td>smokes</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1665</td>\n",
       "      <td>Female</td>\n",
       "      <td>79.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Yes</td>\n",
       "      <td>Self-employed</td>\n",
       "      <td>Rural</td>\n",
       "      <td>174.12</td>\n",
       "      <td>24.0</td>\n",
       "      <td>never smoked</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      id  gender   age  hypertension  heart_disease ever_married  \\\n",
       "0   9046    Male  67.0             0              1          Yes   \n",
       "1  51676  Female  61.0             0              0          Yes   \n",
       "2  31112    Male  80.0             0              1          Yes   \n",
       "3  60182  Female  49.0             0              0          Yes   \n",
       "4   1665  Female  79.0             1              0          Yes   \n",
       "\n",
       "       work_type Residence_type  avg_glucose_level   bmi   smoking_status  \\\n",
       "0        Private          Urban             228.69  36.6  formerly smoked   \n",
       "1  Self-employed          Rural             202.21   NaN     never smoked   \n",
       "2        Private          Rural             105.92  32.5     never smoked   \n",
       "3        Private          Urban             171.23  34.4           smokes   \n",
       "4  Self-employed          Rural             174.12  24.0     never smoked   \n",
       "\n",
       "   stroke  \n",
       "0       1  \n",
       "1       1  \n",
       "2       1  \n",
       "3       1  \n",
       "4       1  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "path = kagglehub.dataset_download(\"fedesoriano/stroke-prediction-dataset\")\n",
    "\n",
    "data = pd.read_csv(path + '/healthcare-dataset-stroke-data.csv')\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# References\n",
    "\n",
    "$^{\\text{📚}}$ [Bagging predictors, Breiman (1996)](https://link.springer.com/article/10.1007/BF00058655){target=\"_blank\"}. <br>\n",
    "$^{\\text{📚}}$ [The strength of weak learnability, Robert E. Schapire (1990).](https://link.springer.com/article/10.1007/BF00116037){target=\"_blank\"}. <br>\n",
    "$^{\\text{📚}}$ [COBRA: A combined regression strategy, Beau et al. (2016)](https://www.sciencedirect.com/science/article/pii/S0047259X15000950){target=\"_blank\"}. <br>\n",
    "$^{\\text{📚}}$ [Gradient COBRA: A kernel-based consensual aggregation for regression, Has (2023)](https://doi.org/10.52933/jdssv.v3i2.70){target=\"_blank\"}. <br>\n",
    "$^{\\text{📚}}$ [Aggregation using input–output trade-off, Fischer & Mougeot (2019)](https://www.sciencedirect.com/science/article/abs/pii/S0378375818302349){target=\"_blank\"}. <br>\n",
    "$^{\\text{📚}}$ [Super Learner, M. J. Van der Laan (2007)](https://www.degruyter.com/document/doi/10.2202/1544-6115.1309/html){target=\"_blank\"}. <br>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}