TP3: \(k\)-NN Models

Course: ITM 390 004: Machine Learning
Lecturer: Sothea HAS, PhD

Objective: In this lab, we shall learn how to implement \(k\)-NN model and fine-tune its best hyperparameter \(k\) using different data splitting schemes. Moreover, we will implement \(k\)-NN on both type of problems, regression and classification.

You can work directly with Google Colab here: Lab3_kNN.ipynb.

1. Abalone Dataset

Abalone is a popular seafood in Japanese and European cuisine. However, the age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of Rings through a microscope, a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age, including their physical measurements, weights etc. This section aims at predicting the Rings of abalone using its physical measurements. Read and load the data from kaggle: Abalone dataset.

# %pip install kagglehub   # if you have not installed "kagglehub" module yet
import kagglehub

# Download latest version
path = kagglehub.dataset_download("rodolfomendes/abalone-dataset")

# Import data
import pandas as pd
data = pd.read_csv(path + "/abalone.csv")
data.head()

	Sex	Length	Diameter	Height	Whole weight	Shucked weight	Viscera weight	Shell weight	Rings
0	M	0.455	0.365	0.095	0.5140	0.2245	0.1010	0.150	15
1	M	0.350	0.265	0.090	0.2255	0.0995	0.0485	0.070	7
2	F	0.530	0.420	0.135	0.6770	0.2565	0.1415	0.210	9
3	M	0.440	0.365	0.125	0.5160	0.2155	0.1140	0.155	10
4	I	0.330	0.255	0.080	0.2050	0.0895	0.0395	0.055	7

A. Overview of the dataset.

What’s the dimension of this dataset? How many quantitative and qualitative variables are there in this dataset?
Create statistical summary of the dataset. Identify problems if there is any in this dataset.
Study the correlation matrix of this dataset. Comment this correlation matrix.

# To do

B. Model development.

Split the dataset into \(85\%-15%\%\) training-testing data using random_state = 42.
Build a \(k\)-NN model with its default configuration. Evaluate the model performance on test data using MAPE and RMSE as evaluation metrics.
Fine-tune \(k\) using train/validation/test (of size 70%/15%/15%) scheme (you just need to split the training part into 70% training and 15% validation set).
- Compute the test performance metrics.
- What do you observe if you rerun this part all over again?
Fine-tune \(k\) using \(10\)-fold cross validation method.
- Report the optimal \(k\) and test performance.
- What do you observe if you rerun this part again?
Report and compare the test result from all cases and conclude. Hint: table of performance summary is a good tool to use.

Model	RMSE	MAPE
kNN default	…	…
kNN with 3 splits	…	…
kNN CV	…	…

# To do

2. Revisit `heart disease` dataset

Reload the heart disease dataset introduced in Lab2.
Follow the same procedure as in the previous section on this dataset except that this time, the evaluation metrics are the following:
- Accuracy
- Recall
- Precision
- F1-score.
Compare all the performance metrics on the test set and conclude.

	age	sex	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal
0	52	1	125	212	0	1	168	0	1.0	2	2	3
1	53	1	140	203	1	0	155	1	3.1	0	0	3
2	70	1	145	174	0	1	125	1	2.6	0	0	3
3	61	1	148	203	0	1	161	0	0.0	2	1	3
4	62	0	138	294	1	1	106	0	1.9	1	3	2

References

\(^{\text{📚}}\) The Element of Statistical Learning, Hastie et al. (2002).
\(^{\text{📚}}\) A Distribution-free Theory of Nonparameteric Regression, Györfi et al. (2002)..
\(^{\text{📚}}\) A Probabilistic Theory of Pattern Recognition, Devroye et al. (1997).

	age	sex	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal
0	52	1	125	212	0	1	168	0	1.0	2	2	3
1	53	1	140	203	1	0	155	1	3.1	0	0	3
2	70	1	145	174	0	1	125	1	2.6	0	0	3
3	61	1	148	203	0	1	161	0	0.0	2	1	3
4	62	0	138	294	1	1	106	0	1.9	1	3	2

	age	sex	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal
0	52	1	125	212	0	1	168	0	1.0	2	2	3
1	53	1	140	203	1	0	155	1	3.1	0	0	3
2	70	1	145	174	0	1	125	1	2.6	0	0	3
3	61	1	148	203	0	1	161	0	0.0	2	1	3
4	62	0	138	294	1	1	106	0	1.9	1	3	2

1. Abalone Dataset

2. Revisit heart disease dataset

References

2. Revisit `heart disease` dataset

	age	sex	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal
0	52	1	125	212	0	1	168	0	1.0	2	2	3
1	53	1	140	203	1	0	155	1	3.1	0	0	3
2	70	1	145	174	0	1	125	1	2.6	0	0	3
3	61	1	148	203	0	1	161	0	0.0	2	1	3
4	62	0	138	294	1	1	106	0	1.9	1	3	2