# To doLab7 - Ensemble Learning
Course: Advanced Machine Learning
Lecturer: Dr. Sothea HAS
Objective: Ensemble Learning Methods are about combining several base learners to enhance its performance. In this lab, you will apply each ensemble learning method on real datasets and analyze its sensitivity in terms of the key hyperparameters of the method. Moreover, the feature importances will also be computed from each model.
The
Jupyter Notebookfor this TP can be downloaded here: TP7_Ensemble_Learning.
1. Auto-MPG Dataset
The dataset is downloaded from UCI Machine Learning Repository. The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes, (Quinlan, 1993).
Load the dataset from kaggle using the following link: Auto-MPG dataset.
A. Overview and Univariate Analysis:
- Check the dimension of the dataset and modify columns with wrong data type.
- What’s wrong with the column
horsepower? Properly solve it. - Perform univariate analysis aiming at understanding individual columns of the data and dectecting the following problems:
- Outliers
- Duplications
- Missing data.
# To doB. Bivariate Analysis
- Plot pairplot between quantitative columns. Take note on the most potential predictors for the target
MPG. - Is
originuseful for predicting the targetMPG? - Preprocess the inputs for model development.
# To doC. Random Forest: OOB vs Cross Validation
Split the dataset into \(80\%-20\%\) training-testing data using
random_state = 42.Build a random forest model with its default setting. Then compute the free Out-Of-Bag Errors or Score obtained by the forest (see
model.oob_score_). Compute suitable metrics on the test data and store them in a data frame.Fine-tune the key hyperparameters of the random forest (
max_depth,max_features,n_estimators…), then evaluate its CV error or Score. Compare it to the corresponding OOB criterion for each combination of the hyperparameters. In order to achieve this, you should follow the following steps:- Initialize: Ensure you set
oob_score=Truewhen initializing yourRandomForestRegressor.
- OOB Error: After fitting the model, access the OOB score using the
.oob_score_attribute of the fitted model object. Convert this to an error rate (e.g., 1−score).
- CV Error: For the same model configuration, calculate the mean cross-validation score using the
cross_val_scorefunction (orcross_validate) withk=5folds. Convert the mean score to an error rate. - Visualize: Plot the OOB Error and the CV Error on the same graph for each hyperparameter. Analyze the gap between the two lines.
- Initialize: Ensure you set
Compare the test metrics of the three models:
- The default random forest (second point).
- The model with the best
OOBperformance. - The model with the best
CVperformance.
Compote and plot the figure of
mean decrease impurity measureandpermutation feature importancesof the last two models.
# To doD. Extra-trees: OOB vs Cross Validation
Repeat the previous questions of part (C) from the second point using ExtraTrees model from the same module. Compare the result to Random Forest.
Compote and plot the figure of
mean decrease impurity measureandpermutation feature importancesof the last two Extra-tree models.
# To doE. XGBoost: OOB vs Cross Validation
Repeat the previous quesions with
XGboost.Compute and draw both feature importances for the last two XGBoost models.
# To doF. GradientCOBRA: Find-tune the hyperparameter \(h\)
- Build
GradientCOBRAmodel by fine-tune the most suitable smoothing hyperparameter \(h>0\). - Compute its test performance and compare the previous models.
# To do2. Kaggle Cybersecurity Intrusion Detection Dataset
This Kaggle Cybersecurity Intrusion Detection Dataset is designed for detecting cyber intrusions based on network traffic and user behavior. You may find the explanation of the dataset in detail, including the dataset structure, feature importance, possible analysis approaches, and how it can be used for machine learning in the the provided link.
Question: import the dataset, analyze the data and preprocess it properly. Build your best model to detect intrusion activities in the dataset. Report the test performance and features the seem to influence the performance of the model the most.
# To doReferences
\(^{\text{📚}}\) Bagging predictors, Breiman (1996).
\(^{\text{📚}}\) The strength of weak learnability, Robert E. Schapire (1990)..
\(^{\text{📚}}\) COBRA: A combined regression strategy, Beau et al. (2016).
\(^{\text{📚}}\) Gradient COBRA: A kernel-based consensual aggregation for regression, Has (2023).
\(^{\text{📚}}\) Aggregation using input–output trade-off, Fischer & Mougeot (2019).
\(^{\text{📚}}\) Super Learner, M. J. Van der Laan (2007).