Objective: In this lab, we shall learn how to implement Decision Tree model and fine-tune its best hyperparameters mostly related to tree complexity and size using cross-validation method. We will implement decision trees for both type of problems, regression and classification.
You can work directly with Google Colab here: Lab4_tree.ipynb.
1. Abalone Dataset
Abalone is a popular seafood in Japanese and European cuisine. However, the age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of Rings through a microscope, a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age, including their physical measurements, weights etc. This section aims at predicting the Rings of abalone using its physical measurements. Read and load the data from kaggle: Abalone dataset.
# %pip install kagglehub # if you have not installed "kagglehub" module yetimport kagglehub# Download latest versionpath = kagglehub.dataset_download("rodolfomendes/abalone-dataset")# Import dataimport pandas as pddata = pd.read_csv(path +"/abalone.csv")data.head()
Sex
Length
Diameter
Height
Whole weight
Shucked weight
Viscera weight
Shell weight
Rings
0
M
0.455
0.365
0.095
0.5140
0.2245
0.1010
0.150
15
1
M
0.350
0.265
0.090
0.2255
0.0995
0.0485
0.070
7
2
F
0.530
0.420
0.135
0.6770
0.2565
0.1415
0.210
9
3
M
0.440
0.365
0.125
0.5160
0.2155
0.1140
0.155
10
4
I
0.330
0.255
0.080
0.2050
0.0895
0.0395
0.055
7
A. Overview of the dataset.
By using the proprocessing steps in the previous lab, load the dataset and check its basic information, including number of rows and columns, data types, missing values, statistical summary, etc.
Clean the data if nessary. Get it ready for modeling.
# To do
B. Model development.
Split the dataset into \(80\%-20%\%\) training-testing data using random_state = 42.
Build a Regression tree to predict the testing data and report its RMSE and MAPE performance metrics. Use the default hyperparameters of the model.
Fine-tune the following hyperparameters of the previous tree model using cross-validation method:
Reload the heart disease dataset introduced in Lab2.
Build decision to classify patients by following the same procedure as in the previous section on this dataset except that this time, the evaluation metrics are the following:
Accuracy
Recall
Precision
F1-score.
Compare all the performance metrics on the test set and conclude.