TP4 - Nonparametric Models


Course: Advanced Machine Learning
Lecturer: Sothea HAS, PhD

Objective: We have seen in the course that nonparametric models aim at directly estimating the regression function of MSE criterion. In this TP, we shall learn how to implement the three basic nonparametric models including \(K\)-NN, Decision Trees and Kernel Smoother method.


1. Abalone Dataset

Abalone is a popular seafood in Japanese and European cuisine. However, the age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of Rings through a microscope, a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age, including their physical measurements, weights etc. This section aims at predicting the Rings of abalone using its physical measurements. Read and load the data from kaggle: Abalone dataset.

# %pip install kagglehub   # if you have not installed "kagglehub" module yet
import kagglehub

# Download latest version
path = kagglehub.dataset_download("rodolfomendes/abalone-dataset")

# Import data
import pandas as pd
data = pd.read_csv(path + "/abalone.csv")
data.head()
Sex Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings
0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7
2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7

A. Overview of the dataset.

  • What’s the dimension of this dataset? How many quantitative and qualitative variables are there in this dataset?

  • Create statistical summary of the dataset. Identify problems if there is any in this dataset.

  • Study the correlation matrix of this dataset. Comment this correlation matrix.

# To do

B. Model development.

  • Split the dataset into \(80\%-20\%\) training-testing data using random_state = 42.

  • Build a \(K\)-NN model and fine-tune it to predict the testing data. Report its RMSE.

  • Build a Regression Tree to predict the testing data and report its RMSE.

  • Build a Kernel Smoother method to predict the testing data and report its RMSE (my python module: gradientcobra and its module: KernelSmoother).

# To do

C. Neural Network.

  • Design a neural network to predict the testing data and compute its RMSE.

  • Compre to the previous results and conclude.

2. Revisit Spam dataset

  • Your task in this section is to create email spam filters by applying the nonparametric models introduced in the course.

  • Report test performance metrics on the spam dataset loaded below.

  • Build a pipeline that takes text input as a real email, then return the type of the email using your best spam filter found in the first question.

path = "https://raw.githubusercontent.com/hassothea/MLcourses/main/data/spam.txt"
data = pd.read_csv(path, sep=" ")
data.head(5)
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
0 52 1 0 125 212 0 1 168 0 1.0 2 2 3 0
1 53 1 0 140 203 1 0 155 1 3.1 0 0 3 0
2 70 1 0 145 174 0 1 125 1 2.6 0 0 3 0
3 61 1 0 148 203 0 1 161 0 0.0 2 1 3 0
4 62 0 0 138 294 1 1 106 0 1.9 1 3 2 0

References

\(^{\text{📚}}\) The Element of Statistical Learning, Hastie et al. (2002).
\(^{\text{📚}}\) A Distribution-free Theory of Nonparameteric Regression, Györfi et al. (2002)..
\(^{\text{📚}}\) A Probabilistic Theory of Pattern Recognition, Devroye et al. (1997).