Lab5: \(K\)-Nearest Neighbors & Decision Trees

Course: CSCI-866-001: Data Mining & Knowledge Discovery
Lecturer: Sothea HAS, PhD

Objective: In this lab, we will explore nonparametric models that predict the label of a data point based on its similarity to training examples. Additionally, we will examine strategies to enhance model performance by applying cross-validation to fine-tune hyperparameters, ensuring optimal predictive accuracy.

The notebook of this Lab can be downloaded here: Lab5_KNN_Trees.ipynb.
Or you can work directly with Google Colab here: Lab5_KNN_Trees.ipynb.

Email Spam Dataset

Let’s start by exploring the email spam dataset introduced in the previous chapter. The data can be imported as follow.

import pandas as pd

path = "https://raw.githubusercontent.com/hassothea/MLcourses/main/data/spam.txt"
data = pd.read_csv(path, sep=" ")
data.head(5)

	Id	make	address	all	our	over	remove	internet	order	...	charSemicolon	charRoundbracket	charExclamation	charDollar	charHash	capitalAve	capitalLong	capitalTotal	type
0	1	0.00	0.64	0.64	0.32	0.00	0.00	0.00	0.00	...	0.00	0.000	0.778	0.000	0.000	3.756	61	278	spam
1	2	0.21	0.28	0.50	0.14	0.28	0.21	0.07	0.00	...	0.00	0.132	0.372	0.180	0.048	5.114	101	1028	spam
2	3	0.06	0.00	0.71	1.23	0.19	0.19	0.12	0.64	...	0.01	0.143	0.276	0.184	0.010	9.821	485	2259	spam
3	4	0.00	0.00	0.00	0.63	0.00	0.31	0.63	0.31	...	0.00	0.137	0.137	0.000	0.000	3.537	40	191	spam
4	5	0.00	0.00	0.00	0.63	0.00	0.31	0.63	0.31	...	0.00	0.135	0.135	0.000	0.000	3.537	40	191	spam

5 rows × 59 columns

1. Univariate Analysis: Preprocessing & Data Analysis

A. Visualize the distribution of the target type.

B. Compute minimum values of all features and check that all of them are positive.

C. Are there any nan or NA values in this dataset?

D. Are their any duplicated observations?

# To do

2. Bivariate Analysis: Exploratory Data Analysis & Important Feature Detection

A. Pick three input features and visualize heir relationship with the target type. Do the chosen inputs seem to be related with the target.

B. Trying to visualize or detect the connection between all 57 inputs to the target is a challenging task. To this purpose, statistical tests such as Analysis of Variance or ANOVA and its nonparemetric version (Kruskal-Wallis Test) are the useful tools. We will use Kruskal-Wallis Test to detect informative inputs for email classification.

Import kruskal function from scipy.stats as follow:
```
from scipy.stats import kruskal
```
For each of the three selected input features in the first point, perform the kruskal-wallis test to check if the median among spam and nonspam group of the considered input significantly different.
Conduct the Kruskal-Wallis test on all 57 columns of the dataset to assess whether there are significant differences in the medians of input features between the spam and nonspam groups.
Select only the features where the p-value is less than 1e-10, indicating that the difference in medians is statistically significant.

# To do

3. \(K\)-Nearest Neighbors (KNN)

3.1. Preparation

A. Split the dataset into 80%-20% of training and testing data.

B. Standardize both the training and testing input features.

C. Choose your favorite \(K\) and build two KNN models on the training data using all columns and only the selected features.

D. Test the performance of the two models on the testing data using suitable metrics.

# To do

3.2. Fine-tune \(K\)

A. For each model, perform \(K\)-fold cross-validation method to select the best \(K\).

B. Evaluate the performance of the two new models using their respective optimal value of \(K\).

C. Conclude

# To do

4. Decision Trees

Default setting: Build two different decision tree methods on the 80% training data. Test the performance of the models on the testing data.
Fine-tuned model: Perform cross-validation to tune the hyperparameters of the models including:
- depth of the trees
- minimal size of the terminal nodes (leaves)
- number of features to be considered at each split.
- splitting criteria…
Measure their performance on the corresponding testing dataset.
Conclude.

# To do