# **TP6 - Nonparametric Models**

**Course: Advanced Machine Learning** <br>
**Lecturer: Dr. Sothea HAS**

----

**Objective:**  We have seen in the course that nonparametric models aim at directly estimating the regression function of MSE criterion. In this TP, we shall learn how to implement the three basic nonparametric models including $K$-NN, Decision Trees and Kernel Smoother method.

- The `notebook` of this `TP` can be downloaded here: [TP6_Nonparametric.ipynb](https://hassothea.github.io/Advanced-Machine-Learning-ITC/TPs/TP6_Nonparametric.ipynb).

----------

## 1. Abalone Dataset

Abalone is a popular seafood in Japanese and European cuisine. However, the age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of Rings through a microscope, a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age, including their physical measurements, weights etc. This section aims at predicting the Rings of abalone using its physical measurements. Read and load the data from kaggle: [Abalone dataset](https://www.kaggle.com/datasets/rodolfomendes/abalone-dataset).

In [2]:
# %pip install kagglehub   # if you have not installed "kagglehub" module yet
import kagglehub

# Download latest version
path = kagglehub.dataset_download("rodolfomendes/abalone-dataset")

# Import data
import pandas as pd
data = pd.read_csv(path + "/abalone.csv")
data.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


**A. Overview of the dataset.** 

- Whatâ€™s the dimension of this dataset? How many quantitative and qualitative variables are there in this dataset?

- Create statistical summary and visualize the distribution of of the quantitative columns and then the qualitative one. Identify and handle what seems to be the problems.

- Inspect if there are any dupplicated data.

- Study both correlation matrices of this dataset. Comment this correlation matrix.

- Is the qualitative column useful for predicting the target `Rings`?

In [4]:
# To do

**B. Model development.** 

- Split the dataset into $80\%-20\%$ training-testing data using `random_state = 42`.

- Build a $K$-NN model and fine-tune it to predict the testing data. Report its CV-RMSE.

- Build and fine-tune a Regression Tree to predict the testing data and report its CV-RMSE.

- Build a Kernel Smoother method to predict the testing data and report its CV-RMSE (a python module: [`gradientcobra`](https://pypi.org/project/gradientcobra/) and its module: [`KernelSmoother`](https://hassothea.github.io/files/CodesPhD/kernelsmoother.html)).

- Compare the cross-validation performance of the three models, then test all three models on the testing data. Create the following comparison table and conclude.

| **Model** | **CV-RMSE** | **Test-RMSE** | **Test-$R^2$** | **Test-MAPE** |
|-----------|-------------|---------------|----------------|---------------|
| **KNN**   | $\dots$     | $\dots$       | $\dots$        | $\dots$       |
| **Tree**  | $\dots$     | $\dots$       | $\dots$        | $\dots$       |
| **Kernel**| $\dots$     | $\dots$       | $\dots$        | $\dots$       |
|

In [3]:
# To do

## **2. Revisit `Spam` dataset**


- Your task in this section is to create email spam filters by applying the nonparametric models introduced in the [course](https://hassothea.github.io/Advanced-Machine-Learning-ITC/courses/AML3_Nonparametric_Models.html){target='_blank'}. 

- Report CV and test performance metrics on the spam dataset loaded below.

In [3]:
path = "https://raw.githubusercontent.com/hassothea/MLcourses/main/data/spam.txt"
data = pd.read_csv(path, sep=" ")
data.head(5)

Unnamed: 0,Id,make,address,all,num3d,our,over,remove,internet,order,...,charSemicolon,charRoundbracket,charSquarebracket,charExclamation,charDollar,charHash,capitalAve,capitalLong,capitalTotal,type
0,1,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,spam
1,2,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,spam
2,3,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,spam
3,4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,spam
4,5,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,spam


- Build a pipeline that takes text input as a real email, then return the type of the email using your best spam filter found in the first question.

In [None]:
# Example:
email = 'Hi Jack,\n I hope this email find you well. I am writing to ask for the address of Marry because I want to send her an invitation for my wedding.\n\n Thank you for the information.\n\n Best regards, Mark'

# This is the prediction by KNN
print(f'* KNN predict this email to be: {SpamFilter(email)}')

# This is the prediction by Tree
print(f'* Tree predict this email to be: {SpamFilter(email)}')

```
* KNN predict this email to be: nonspam
* Tree predict this email to be: nonspam
```

# References

$^{\text{ðŸ“š}}$ [The Element of Statistical Learning, Hastie et al. (2002)](https://www.stat.ntu.edu.tw/download/æ•™å­¸æ–‡ä»¶/bigdata/The%20Elements%20of%20Statistical%20Learning.pdf). <br>
$^{\text{ðŸ“š}}$ [A Distribution-free Theory of Nonparameteric Regression, GyÃ¶rfi et al. (2002).](https://link.springer.com/book/10.1007/b97848). <br>
$^{\text{ðŸ“š}}$ [A Probabilistic Theory of Pattern Recognition, Devroye et al. (1997)](https://www.szit.bme.hu/~gyorfi/pbook.pdf). <br>
