Objective: In this lab, we shall learn how to implement \(k\)-NN model and fine-tune its best hyperparameter \(k\) using different data splitting schemes. Moreover, we will implement \(k\)-NN on both type of problems, regression and classification.
You can work directly with Google Colab here: Lab3_kNN.ipynb.
1. Abalone Dataset
Abalone is a popular seafood in Japanese and European cuisine. However, the age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of Rings through a microscope, a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age, including their physical measurements, weights etc. This section aims at predicting the Rings of abalone using its physical measurements. Read and load the data from kaggle: Abalone dataset.
# %pip install kagglehub # if you have not installed "kagglehub" module yetimport kagglehub# Download latest versionpath = kagglehub.dataset_download("rodolfomendes/abalone-dataset")# Import dataimport pandas as pddata = pd.read_csv(path +"/abalone.csv")data.head()
Sex
Length
Diameter
Height
Whole weight
Shucked weight
Viscera weight
Shell weight
Rings
0
M
0.455
0.365
0.095
0.5140
0.2245
0.1010
0.150
15
1
M
0.350
0.265
0.090
0.2255
0.0995
0.0485
0.070
7
2
F
0.530
0.420
0.135
0.6770
0.2565
0.1415
0.210
9
3
M
0.440
0.365
0.125
0.5160
0.2155
0.1140
0.155
10
4
I
0.330
0.255
0.080
0.2050
0.0895
0.0395
0.055
7
A. Overview of the dataset.
What’s the dimension of this dataset? How many quantitative and qualitative variables are there in this dataset?
Create statistical summary of the dataset. Identify problems if there is any in this dataset.
Study the correlation matrix of this dataset. Comment this correlation matrix.
# To do
B. Model development.
Split the dataset into \(85\%-15%\%\) training-testing data using random_state = 42.
Build a \(k\)-NN model with its default configuration. Evaluate the model performance on test data using MAPE and RMSE as evaluation metrics.
Fine-tune \(k\) using train/validation/test (of size 70%/15%/15%) scheme (you just need to split the training part into 70% training and 15% validation set).
Compute the test performance metrics.
What do you observe if you rerun this part all over again?
Fine-tune \(k\) using \(10\)-fold cross validation method.
Report the optimal \(k\) and test performance.
What do you observe if you rerun this part again?
Report and compare the test result from all cases and conclude. Hint: table of performance summary is a good tool to use.