Lab7 - Ensemble Learning

Course: Advanced Machine Learning
Lecturer: Dr. Sothea HAS

Objective: Ensemble Learning Methods are about combining several base learners to enhance its performance. In this lab, you will apply each ensemble learning method on real datasets and analyze its sensitivity in terms of the key hyperparameters of the method. Moreover, the feature importances will also be computed from each model.

The Jupyter Notebook for this TP can be downloaded here: TP7_Ensemble_Learning.

1. Auto-MPG Dataset

The dataset is downloaded from UCI Machine Learning Repository. The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes, (Quinlan, 1993).

Load the dataset from kaggle using the following link: Auto-MPG dataset.

# To do

A. Overview and Univariate Analysis:

Check the dimension of the dataset and modify columns with wrong data type.
What’s wrong with the column horsepower? Properly solve it.
Perform univariate analysis aiming at understanding individual columns of the data and dectecting the following problems:
- Outliers
- Duplications
- Missing data.

# To do

B. Bivariate Analysis

Plot pairplot between quantitative columns. Take note on the most potential predictors for the target MPG.
Is origin useful for predicting the target MPG?
Preprocess the inputs for model development.

# To do

C. Random Forest: OOB vs Cross Validation

Split the dataset into \(80\%-20\%\) training-testing data using random_state = 42.
Build a random forest model with its default setting. Then compute the free Out-Of-Bag Errors or Score obtained by the forest (see model.oob_score_). Compute suitable metrics on the test data and store them in a data frame.
Fine-tune the key hyperparameters of the random forest (max_depth, max_features, n_estimators…), then evaluate its CV error or Score. Compare it to the corresponding OOB criterion for each combination of the hyperparameters. In order to achieve this, you should follow the following steps:
1. Initialize: Ensure you set oob_score=True when initializing your RandomForestRegressor.
- OOB Error: After fitting the model, access the OOB score using the .oob_score_ attribute of the fitted model object. Convert this to an error rate (e.g., 1−score).
1. CV Error: For the same model configuration, calculate the mean cross-validation score using the cross_val_score function (or cross_validate) with k=5 folds. Convert the mean score to an error rate.
2. Visualize: Plot the OOB Error and the CV Error on the same graph for each hyperparameter. Analyze the gap between the two lines.
Compare the test metrics of the three models:
- The default random forest (second point).
- The model with the best OOB performance.
- The model with the best CV performance.
Compote and plot the figure of mean decrease impurity measure and permutation feature importances of the last two models.

# To do

D. Extra-trees: OOB vs Cross Validation

Repeat the previous questions of part (C) from the second point using ExtraTrees model from the same module. Compare the result to Random Forest.
Compote and plot the figure of mean decrease impurity measure and permutation feature importances of the last two Extra-tree models.

# To do

E. XGBoost: OOB vs Cross Validation

Repeat the previous quesions with XGboost.
Compute and draw both feature importances for the last two XGBoost models.

# To do

F. GradientCOBRA: Find-tune the hyperparameter \(h\)

Build GradientCOBRA model by fine-tune the most suitable smoothing hyperparameter \(h>0\).
Compute its test performance and compare the previous models.

# To do

2. Kaggle Cybersecurity Intrusion Detection Dataset

This Kaggle Cybersecurity Intrusion Detection Dataset is designed for detecting cyber intrusions based on network traffic and user behavior. You may find the explanation of the dataset in detail, including the dataset structure, feature importance, possible analysis approaches, and how it can be used for machine learning in the the provided link.

Question: import the dataset, analyze the data and preprocess it properly. Build your best model to detect intrusion activities in the dataset. Report the test performance and features the seem to influence the performance of the model the most.

# To do

References

\(^{\text{📚}}\) Bagging predictors, Breiman (1996).
\(^{\text{📚}}\) The strength of weak learnability, Robert E. Schapire (1990)..
\(^{\text{📚}}\) COBRA: A combined regression strategy, Beau et al. (2016).
\(^{\text{📚}}\) Gradient COBRA: A kernel-based consensual aggregation for regression, Has (2023).
\(^{\text{📚}}\) Aggregation using input–output trade-off, Fischer & Mougeot (2019).
\(^{\text{📚}}\) Super Learner, M. J. Van der Laan (2007).