Objective: Ensemble Learning Methods are about combining several base learners to enhance its performance. In this lab, you will apply each ensemble learning method on real datasets and analyze its sensitivity in terms of the key hyperparameters of the method. Moreover, the feature importances will also be computed from each model.
This dataset contains bag of words or percentages or counts of words/special characters of spam and nonspam emails. Your task is to build email spam filter to identify spam emails.
Address the dimension, qualitative and quantitative columns of the dataset.
Predicting the type of email is considered a regression or classification problem? Is the dataset well-balanced?
Does the dataset contain any missing values? If so, handle them.
Does the dataset contain any duplicated rows? If so, handle them.
With 57 input columns, itโs not easy to detect outliers of each column one by one using boxplot, use z-score method to handle outliers if there are any.
# To do
B. Random Forest: OOB vs Cross Validation
Split the dataset into \(80\%-20\%\) training-testing data using random_state = 42.
Build a random forest model with its default setting. Then compute the free Out-Of-Bag Errors or Score obtained by the forest (see model.oob_score_).
Compute the following metrics on the test data: Accuracy, Precision, Recall and F1-score. Store them in a data frame.
Fine-tune the hyperparameter of the random forest then evaluate its CV error or Score. Compare it to the OOB score of the second point.
Compute the four metrics of the fine-tuned random forest model on the test data. Compare it to the default one.
Compote and plot the figure of mean decrease impurity measure and permutation feature importances of the best model among the two.
# To do
C. Extra-trees: OOB vs Cross Validation
Repeat the previous questions of part (B) from the second point using ExtraTrees model from the same module. Compare the result to Random Forest.
Compote and plot the figure of mean decrease impurity measure and permutation feature importances of the best Extra-tree model.
# To do
C. Boosting: Feature Importances
Build and fine-tune Adaboost model, AdaBoostClassifier from sklearn.ensemble, using CV technique.
Compute both feature importances for this model and report its test performaces.
# To do
D. XGBoost:
Build and fine-tune the hyperparameter of XGBoost from XGboost.
Compute both feature importances for the model and report the test performances.
Stroke, also known as a cerebrovascular accident (CVA), occurs when blood flow to a part of the brain is interrupted or reduced, depriving brain tissue of oxygen and nutrients. This dataset contains information such as age, gender, hypertension, heart disease, marital status, work type, residence type, average glucose level, and body mass index (BMI). The goal is to use this data to build predictive models that can help identify individuals at high risk of stroke, enabling early intervention and potentially saving lives. It is a very highly imbalanced dataset, you may face challenges in building a model. Random sampling and weighting methods may be considered. For more information, see: Kaggle Stroke Dataset.