Objective: Ensemble Learning Methods are about combining several base learners to enhance its performance. In this TP, you will apply each ensemble learning method on real datasets and analyze its sensitivity in terms of the key hyperparameters of the method.
This dataset is designed for predicting food delivery times based on various influencing factors such as distance, weather, traffic conditions, and time of day. It offers a practical and engaging challenge for machine learning practitioners, especially those interested in logistics and operations research. Read and load the data from kaggle: Food Delivery Dataset.
Address the dimension, qualitative and quantitative columns of the dataset.
Create statistical summary of the dataset.
Identify problems and handle them if there is any:
Missing values,
Duplicated data,
Outliers…
Perform bivariate analysis to detect useful inputs for the model:
Correlation matrix
Graphs…
# To do
B. Model development: OOB vs Cross Validation
Split the dataset into \(80\%-20\%\) training-testing data using random_state = 42.
Build a random forest model and fine-tune its hyperparameters using MSE criterion based on two different approaches:
Out-Of-Bag Errors (see model.oob_score_)
Cross-validation method (you may use GridSearchCV from sklearn.model_selection).
Report the test RMSE and compare the two results.
Repeat the questions with ExtraTrees model from the same module. Compare the result to Random Forest.
# To do
C. Boosting: Feature Importances
Compute Mean Decrease Impurity (MDI) and Permutation Feature Importance (PFI) from the optimal random forest built in the previous question.
Build and fine-tune Adaboost model using AdaBoostRegressor from sklearn.ensemble. Compute both feature importances for this model and report its test performace.
Build and fine-tune XGBoost from XGboost. Compute both feature importances for the model and report the test performance.
# To do
D. Consensual Aggregation and Stacking
Build consensual aggregators and stacking models then report their test performances.
Compare to the previous models. Conclude.
# To do
E. Neural Network.
Design a neural network to predict the testing data and compute its RMSE.
Stroke, also known as a cerebrovascular accident (CVA), occurs when blood flow to a part of the brain is interrupted or reduced, depriving brain tissue of oxygen and nutrients. This dataset contains information such as age, gender, hypertension, heart disease, marital status, work type, residence type, average glucose level, and body mass index (BMI). The goal is to use this data to build predictive models that can help identify individuals at high risk of stroke, enabling early intervention and potentially saving lives. It is a very highly imbalanced dataset, you may face challenges in building a model. Random sampling and weighting methods may be considered. For more information, see: Kaggle Stroke Dataset.