Lab5: \(K\)-Nearest Neighbors & Decision Trees

Course: CSCI-866-001: Data Mining & Knowledge Discovery
Lecturer: Sothea HAS, PhD


Objective: In this lab, we will explore nonparametric models that predict the label of a data point based on its similarity to training examples. Additionally, we will examine strategies to enhance model performance by applying cross-validation to fine-tune hyperparameters, ensuring optimal predictive accuracy.


Email Spam Dataset

Letโ€™s start by exploring the email spam dataset introduced in the previous chapter. The data can be imported as follow.

import pandas as pd

path = "https://raw.githubusercontent.com/hassothea/MLcourses/main/data/spam.txt"
data = pd.read_csv(path, sep=" ")
data.head(5)
Id make address all num3d our over remove internet order ... charSemicolon charRoundbracket charSquarebracket charExclamation charDollar charHash capitalAve capitalLong capitalTotal type
0 1 0.00 0.64 0.64 0.0 0.32 0.00 0.00 0.00 0.00 ... 0.00 0.000 0.0 0.778 0.000 0.000 3.756 61 278 spam
1 2 0.21 0.28 0.50 0.0 0.14 0.28 0.21 0.07 0.00 ... 0.00 0.132 0.0 0.372 0.180 0.048 5.114 101 1028 spam
2 3 0.06 0.00 0.71 0.0 1.23 0.19 0.19 0.12 0.64 ... 0.01 0.143 0.0 0.276 0.184 0.010 9.821 485 2259 spam
3 4 0.00 0.00 0.00 0.0 0.63 0.00 0.31 0.63 0.31 ... 0.00 0.137 0.0 0.137 0.000 0.000 3.537 40 191 spam
4 5 0.00 0.00 0.00 0.0 0.63 0.00 0.31 0.63 0.31 ... 0.00 0.135 0.0 0.135 0.000 0.000 3.537 40 191 spam

5 rows ร— 59 columns

1. Univariate Analysis: Preprocessing & Data Analysis

A. Visualize the distribution of the target type.

B. Compute minimum values of all features and check that all of them are positive.

C. Are there any nan or NA values in this dataset?

D. Are their any duplicated observations?

# To do

2. Bivariate Analysis: Exploratory Data Analysis & Important Feature Detection

A. Pick three input features and visualize heir relationship with the target type. Do the chosen inputs seem to be related with the target.

B. Trying to visualize or detect the connection between all 57 inputs to the target is a challenging task. To this purpose, statistical tests such as Analysis of Variance or ANOVA and its nonparemetric version (Kruskal-Wallis Test) are the useful tools. We will use Kruskal-Wallis Test to detect informative inputs for email classification.

  • Import kruskal function from scipy.stats as follow:

    from scipy.stats import kruskal
  • For each of the three selected input features in the first point, perform the kruskal-wallis test to check if the median among spam and nonspam group of the considered input significantly different.

  • Conduct the Kruskal-Wallis test on all 57 columns of the dataset to assess whether there are significant differences in the medians of input features between the spam and nonspam groups.

  • Select only the features where the p-value is less than 1e-10, indicating that the difference in medians is statistically significant.

# To do

3. \(K\)-Nearest Neighbors (KNN)

3.1. Preparation

A. Split the dataset into 80%-20% of training and testing data.

B. Standardize both the training and testing input features.

C. Choose your favorite \(K\) and build two KNN models on the training data using all columns and only the selected features.

D. Test the performance of the two models on the testing data using suitable metrics.

# To do

3.2. Fine-tune \(K\)

A. For each model, perform \(K\)-fold cross-validation method to select the best \(K\).

B. Evaluate the performance of the two new models using their respective optimal value of \(K\).

C. Conclude

# To do

4. Decision Trees

  • Default setting: Build two different decision tree methods on the 80% training data. Test the performance of the models on the testing data.
  • Fine-tuned model: Perform cross-validation to tune the hyperparameters of the models including:
    • depth of the trees
    • minimal size of the terminal nodes (leaves)
    • number of features to be considered at each split.
    • splitting criteriaโ€ฆ
  • Measure their performance on the corresponding testing dataset.
  • Conclude.
# To do

Further Reading

\(^{\text{๐Ÿ“š}}\) Pandas python library: https://pandas.pydata.org/docs/getting_started/index.html#getting-started
\(^{\text{๐Ÿ“š}}\) Pandas Cheatsheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
\(^{\text{๐Ÿ“š}}\) 10 Minute to Pandas: https://pandas.pydata.org/docs/user_guide/10min.html
\(^{\text{๐Ÿ“š}}\) Some Pandas Lession: https://www.kaggle.com/learn/pandas
\(^{\text{๐Ÿ“š}}\) Chapter 4, Introduction to Statistical Learning with R, James et al. (2021)..
\(^{\text{๐Ÿ“š}}\) The Element of Statistical Learning, Hastie et al. (2002).
\(^{\text{๐Ÿ“š}}\) A Distribution-free Theory of Nonparameteric Regression, Gyรถrfi et al. (2002)..
\(^{\text{๐Ÿ“š}}\) A Probabilistic Theory of Pattern Recognition, Devroye et al. (1997).