TP4: Decision Tree Models

Course: ITM 390 004: Machine Learning
Lecturer: Sothea HAS, PhD

Objective: In this lab, we shall learn how to implement Decision Tree model and fine-tune its best hyperparameters mostly related to tree complexity and size using cross-validation method. We will implement decision trees for both type of problems, regression and classification.

You can work directly with Google Colab here: Lab4_tree.ipynb.

1. Abalone Dataset

Abalone is a popular seafood in Japanese and European cuisine. However, the age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of Rings through a microscope, a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age, including their physical measurements, weights etc. This section aims at predicting the Rings of abalone using its physical measurements. Read and load the data from kaggle: Abalone dataset.

# %pip install kagglehub   # if you have not installed "kagglehub" module yet
import kagglehub

# Download latest version
path = kagglehub.dataset_download("rodolfomendes/abalone-dataset")

# Import data
import pandas as pd
data = pd.read_csv(path + "/abalone.csv")
data.head()

	Sex	Length	Diameter	Height	Whole weight	Shucked weight	Viscera weight	Shell weight	Rings
0	M	0.455	0.365	0.095	0.5140	0.2245	0.1010	0.150	15
1	M	0.350	0.265	0.090	0.2255	0.0995	0.0485	0.070	7
2	F	0.530	0.420	0.135	0.6770	0.2565	0.1415	0.210	9
3	M	0.440	0.365	0.125	0.5160	0.2155	0.1140	0.155	10
4	I	0.330	0.255	0.080	0.2050	0.0895	0.0395	0.055	7

A. Overview of the dataset.

By using the proprocessing steps in the previous lab, load the dataset and check its basic information, including number of rows and columns, data types, missing values, statistical summary, etc.
Clean the data if nessary. Get it ready for modeling.

# To do

B. Model development.

Split the dataset into \(80\%-20%\%\) training-testing data using random_state = 42.
Build a Regression tree to predict the testing data and report its RMSE and MAPE performance metrics. Use the default hyperparameters of the model.
Fine-tune the following hyperparameters of the previous tree model using cross-validation method:
- Important hyperparameters:
  - max_depth: [3, 5, 10, 15, 20, 30, 50]
  - min_samples_leaf: [1, 2, 5, 10, 20, 30]
- mModerate hyperparameters:
  - min_samples_split: [2, 5, 10, 20, 50]
  - max_features: [None, ‘sqrt’, ‘log2’]
- Less important hyperparameters:
  - criterion: [‘squared_error’, ‘friedman_mse’, ‘absolute_error’]
Compute the test performance metrics: RMSE and MAPE. How does the perform compared to the \(k\)-NN model of the previous lab3?
Report the optimal hyperparameters and compare the results of the two models in the following table:

Model	RMSE	MAPE
Tree default	…	…
Tree CV	…	…
kNN CV	…	…

# To do

2. Revisit `heart disease` dataset

Reload the heart disease dataset introduced in Lab2.
Build decision to classify patients by following the same procedure as in the previous section on this dataset except that this time, the evaluation metrics are the following:
- Accuracy
- Recall
- Precision
- F1-score.
Compare all the performance metrics on the test set and conclude.

References

\(^{\text{📚}}\) The Element of Statistical Learning, Hastie et al. (2002).
\(^{\text{📚}}\) A Distribution-free Theory of Nonparameteric Regression, Györfi et al. (2002)..
\(^{\text{📚}}\) A Probabilistic Theory of Pattern Recognition, Devroye et al. (1997).

1. Abalone Dataset

2. Revisit heart disease dataset

References

2. Revisit `heart disease` dataset