Objective: In this lab, we will explore nonparametric models that predict the label of a data point based on its similarity to training examples. Additionally, we will examine strategies to enhance model performance by applying cross-validation to fine-tune hyperparameters, ensuring optimal predictive accuracy.
Letโs start by exploring the email spam dataset introduced in the previous chapter. The data can be imported as follow.
import pandas as pdpath ="https://raw.githubusercontent.com/hassothea/MLcourses/main/data/spam.txt"data = pd.read_csv(path, sep=" ")data.head(5)
Id
make
address
all
num3d
our
over
remove
internet
order
...
charSemicolon
charRoundbracket
charSquarebracket
charExclamation
charDollar
charHash
capitalAve
capitalLong
capitalTotal
type
0
1
0.00
0.64
0.64
0.0
0.32
0.00
0.00
0.00
0.00
...
0.00
0.000
0.0
0.778
0.000
0.000
3.756
61
278
spam
1
2
0.21
0.28
0.50
0.0
0.14
0.28
0.21
0.07
0.00
...
0.00
0.132
0.0
0.372
0.180
0.048
5.114
101
1028
spam
2
3
0.06
0.00
0.71
0.0
1.23
0.19
0.19
0.12
0.64
...
0.01
0.143
0.0
0.276
0.184
0.010
9.821
485
2259
spam
3
4
0.00
0.00
0.00
0.0
0.63
0.00
0.31
0.63
0.31
...
0.00
0.137
0.0
0.137
0.000
0.000
3.537
40
191
spam
4
5
0.00
0.00
0.00
0.0
0.63
0.00
0.31
0.63
0.31
...
0.00
0.135
0.0
0.135
0.000
0.000
3.537
40
191
spam
5 rows ร 59 columns
1. Univariate Analysis: Preprocessing & Data Analysis
A. Visualize the distribution of the target type.
B. Compute minimum values of all features and check that all of them are positive.
C. Are there any nan or NA values in this dataset?
D. Are their any duplicated observations?
# To do
2. Bivariate Analysis: Exploratory Data Analysis & Important Feature Detection
A. Pick three input features and visualize heir relationship with the target type. Do the chosen inputs seem to be related with the target.
B. Trying to visualize or detect the connection between all 57 inputs to the target is a challenging task. To this purpose, statistical tests such as Analysis of Variance or ANOVA and its nonparemetric version (Kruskal-Wallis Test) are the useful tools. We will use Kruskal-Wallis Test to detect informative inputs for email classification.
Import kruskal function from scipy.stats as follow:
from scipy.stats import kruskal
For each of the three selected input features in the first point, perform the kruskal-wallis test to check if the median among spam and nonspam group of the considered input significantly different.
Conduct the Kruskal-Wallis test on all 57 columns of the dataset to assess whether there are significant differences in the medians of input features between the spam and nonspam groups.
Select only the features where the p-value is less than 1e-10, indicating that the difference in medians is statistically significant.
# To do
3. \(K\)-Nearest Neighbors (KNN)
3.1. Preparation
A. Split the dataset into 80%-20% of training and testing data.
B. Standardize both the training and testing input features.
C. Choose your favorite \(K\) and build two KNN models on the training data using all columns and only the selected features.
D. Test the performance of the two models on the testing data using suitable metrics.
# To do
3.2. Fine-tune \(K\)
A. For each model, perform \(K\)-fold cross-validation method to select the best \(K\).
B. Evaluate the performance of the two new models using their respective optimal value of \(K\).
C. Conclude
# To do
4. Decision Trees
Default setting: Build two different decision tree methods on the 80% training data. Test the performance of the models on the testing data.
Fine-tuned model: Perform cross-validation to tune the hyperparameters of the models including:
depth of the trees
minimal size of the terminal nodes (leaves)
number of features to be considered at each split.
splitting criteriaโฆ
Measure their performance on the corresponding testing dataset.