TP3: \(k\)-NN Models


Course: ITM 390 004: Machine Learning
Lecturer: Sothea HAS, PhD


Objective: In this lab, we shall learn how to implement \(k\)-NN model and fine-tune its best hyperparameter \(k\) using different data splitting schemes. Moreover, we will implement \(k\)-NN on both type of problems, regression and classification.


1. Abalone Dataset

Abalone is a popular seafood in Japanese and European cuisine. However, the age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of Rings through a microscope, a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age, including their physical measurements, weights etc. This section aims at predicting the Rings of abalone using its physical measurements. Read and load the data from kaggle: Abalone dataset.

# %pip install kagglehub   # if you have not installed "kagglehub" module yet
import kagglehub

# Download latest version
path = kagglehub.dataset_download("rodolfomendes/abalone-dataset")

# Import data
import pandas as pd
data = pd.read_csv(path + "/abalone.csv")
data.head()
Sex Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings
0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7
2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7

A. Overview of the dataset.

  • What’s the dimension of this dataset? How many quantitative and qualitative variables are there in this dataset?

  • Create statistical summary of the dataset. Identify problems if there is any in this dataset.

  • Study the correlation matrix of this dataset. Comment this correlation matrix.

# To do

B. Model development.

  • Split the dataset into \(85\%-15%\%\) training-testing data using random_state = 42.

  • Build a \(k\)-NN model with its default configuration. Evaluate the model performance on test data using MAPE and RMSE as evaluation metrics.

  • Fine-tune \(k\) using train/validation/test (of size 70%/15%/15%) scheme (you just need to split the training part into 70% training and 15% validation set).

    • Compute the test performance metrics.
    • What do you observe if you rerun this part all over again?
  • Fine-tune \(k\) using \(10\)-fold cross validation method.

    • Report the optimal \(k\) and test performance.
    • What do you observe if you rerun this part again?
  • Report and compare the test result from all cases and conclude. Hint: table of performance summary is a good tool to use.

Model RMSE MAPE
kNN default
kNN with 3 splits
kNN CV
# To do

2. Revisit heart disease dataset

  • Reload the heart disease dataset introduced in Lab2.

  • Follow the same procedure as in the previous section on this dataset except that this time, the evaluation metrics are the following:

    • Accuracy
    • Recall
    • Precision
    • F1-score.
  • Compare all the performance metrics on the test set and conclude.

age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
0 52 1 0 125 212 0 1 168 0 1.0 2 2 3 0
1 53 1 0 140 203 1 0 155 1 3.1 0 0 3 0
2 70 1 0 145 174 0 1 125 1 2.6 0 0 3 0
3 61 1 0 148 203 0 1 161 0 0.0 2 1 3 0
4 62 0 0 138 294 1 1 106 0 1.9 1 3 2 0

References

\(^{\text{📚}}\) The Element of Statistical Learning, Hastie et al. (2002).
\(^{\text{📚}}\) A Distribution-free Theory of Nonparameteric Regression, Györfi et al. (2002)..
\(^{\text{📚}}\) A Probabilistic Theory of Pattern Recognition, Devroye et al. (1997).