Classification: $K$-Nearest Neighbors & Decision Trees

CSCI-866-001: Data Mining & Knowledge Discovery

Lecturer: Dr. Sothea HAS

🗺️ Content

K-Nearest Neighbors
- Theoretical Motivation
- Setting
- Tuning $K$
- Curse of Dimensionality
Decision Trees
- Classification & Regression Trees
- Application

1. $K$-Nearest Neighbors

Motivation

Imputing Missing Pixels

Code

from skimage import data
import matplotlib.pyplot as plt
image = data.astronaut()
import numpy as np
from sklearn.impute import KNNImputer
n_miss = 100
np.random.seed(42)
missing_id = np.random.choice(512, size=(n_miss,2))
image_nan = image.copy()

for i in range(n_miss):
    image_nan[(missing_id[i,0]-2):(missing_id[i,0]+2),(missing_id[i,1]-2):(missing_id[i,1]+2),:] = 0
plt.figure(figsize=(4,4))
plt.imshow(image_nan)
plt.axis('off')
plt.show()

How would you impute the missing pixels in this image?

Code

for i in range(n_miss):
    for ch in range(3):
        temp = image_nan[(missing_id[i,0]-5):(missing_id[i,0]+5), (missing_id[i,1]-5):(missing_id[i,1]+5), ch]
        image_nan[(missing_id[i,0]-2):(missing_id[i,0]+2),(missing_id[i,1]-2):(missing_id[i,1]+2),ch] = temp[temp> 0].mean()
plt.figure(figsize=(4,4))
plt.imshow(image_nan)
plt.axis('off')
plt.show()

Missing pixels $=\frac{1}{K}\sum_{k=1}^K\text{pixel}_k$ where $\text{pixel}_k$ are neighboring non-missing pixels.

Introduction

Supervised Learning Setting

Objective: Predicting target $y$ using input $x$.

If $y$ is continuous, the problem is called Regression.
If $y$ is discrete/categorical, it’s called Classification.

Criteria

Quality of a prediction $\hat{y}$ is measured by the loss: $\ell(\hat{y},y)$.
Quality of a model $\hat{f}$ is measured by the risk: \[{\cal R}(\hat{f})=\mathbb{E}_{X,Y}[\ell(Y, \hat{f}(X))].\]

Theoretical Motivation

Decision theory for classification

In classification, misclassification loss: $\ell(\hat{f}(x),y)=\mathbb{1}_{\{\hat{f}(x)\neq y\}}=\begin{cases}1,\text{ if }\hat{f}(x)\neq y\\0,\text{ if }\hat{f}(x)=y\end{cases}$.
The misclassification risk is given by: \[{\cal R}_c(\hat{f})=\mathbb{E}[\ell(\hat{f}(X),Y)]=\mathbb{E}[\mathbb{1}_{\{\hat{f}(X)\neq Y\}}]=\mathbb{P}(\hat{f}(X)\neq Y).\]
This is equivalent to classification problems that maximize accuracy score because \[\text{Accuracy}=1-\mathbb{P}(\hat{f}(X)\neq Y)=\mathbb{P}(\hat{f}(X)=Y).\]

Lemma [Ch 2, Hastie et al. (2008) & Sec 1.4, Kyörfi et al. (2002)]

In balanced $M$-class classification, \[\text{If } g^*(x)=\arg\max_{1\leq k\leq M}\mathbb{P}(Y=k|X=x)\text{ then }{\cal R}_c(g^*)=\inf_{\hat{f}}{\cal R}_c(\hat{f}).\]

Theoretical Motivation

Summary

The key idea of Nonparametric methods, a prediction of any new observation $\text{x}$ depends on the labels/targets of other training data that are similar to $\text{x}$.
Example: Missing pixels can be imputed by the average of neighboring pixels.
In balanced $M$-class classification with Accuracy score, $g^*(\text{x})=\arg\max_{1\leq k\leq M}\color{blue}{\mathbb{P}(Y=k|X=\text{x})}$ is the best classifier.
In the following sections, each method aims at defining “similarity/neighbors” structure of data points.
The prediction is the “majority vote” (classification) or “averaging” (regression).

Setting

$K$-Nearest Neighbors ($K$-NN)

Setting

Given iid observations: $\{(\text{x}_1,y_1),\dots, (\text{x}_n,y_n)\}\subset \mathbb{R}^d\times{\cal Y}$.
If $D$ is a distance on $\mathbb{R}^d$ (e.g. Euclidean distance), the $K$-th nearest neighbor $\color{red}{\text{x}_{(k)}}$ of $\text{x}\in\mathbb{R}^d$ is defined by
- $D(\text{x},\text{x}_{(1)})\leq D(\text{x},\text{x}_{(2)})\leq\dots\leq D(\text{x},\color{red}{\text{x}_{(k)}})\leq \dots\leq D(\text{x},\text{x}_{(n)})$.
- Let $y_{(1)},\dots,y_{(n)}$ be the target of $\text{x}_{(1)},\dots,\text{x}_{(n)}$ respectively.
If $K\geq 1$, then $K$-NN predicts the target of an input $\text{x}$ by

Regression: \[\begin{align*}\hat{y}&=\frac{1}{K}\sum_{k=1}^Ky_{(k)}\\ &=\text{Average $y_{(k)}$ among the $K$ neighbors}.\end{align*}\]

Classification: \[\begin{align*}\hat{y}&=\arg\max_{1\leq m\leq M}\frac{1}{K}\sum_{k=1}^K\mathbb{1}_{\{y^{(k)}=m\}}\\ &=\text{Majority class among the $K$ neighbors.}\end{align*}\]

$K$-Nearest Neighbors ($K$-NN)

Example

Regression: \[\begin{align*}\hat{y}&=\frac{1}{K}\sum_{k=1}^Ky_{(k)}\\ &=\text{Average $y_{(k)}$ among the $K$ neighbors}.\end{align*}\]

Classification: \[\begin{align*}\hat{y}&=\arg\max_{1\leq m\leq M}\frac{1}{K}\sum_{k=1}^K\mathbb{1}_{\{y^{(k)}=m\}}\\ &=\text{Majority class among the $K$ neighbors.}\end{align*}\]

$K$-Nearest Neighbors ($K$-NN)

Application

Let’s work with our Heart Disease Dataset (shpe: $1025\times 14$) and choose $K=5$.
$K$-NN is a distance-based method, it’s essential to
- Scale the inputs
- Watch out for outliers/missing values
- Encode categorical inputs
- Watch out for the effect of imbalanced class…

Code

import numpy as np
import pandas as pd
data = pd.read_csv(path + "/heart.csv")
quan_vars = ['age','trestbps','chol','thalach','oldpeak']
qual_vars = ['sex','cp','fbs','restecg','exang','slope','ca','thal','target']

# Convert to correct types
for i in quan_vars:
  data[i] = data[i].astype('float')
for i in qual_vars:
  data[i] = data[i].astype('category')

# Train test split
from sklearn.model_selection import train_test_split
X, y = data.iloc[:,:-1], data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaler = MinMaxScaler()

# OnehotEncoding and Scaling
X_train_cat = pd.get_dummies(X_train.select_dtypes(include="category"), drop_first=True)
X_train_encoded = scaler.fit_transform(np.column_stack([X_train.select_dtypes(include="number").to_numpy(), X_train_cat]))
X_test_cat = pd.get_dummies(X_test.select_dtypes(include="category"), drop_first=True)
X_test_encoded = scaler.transform(np.column_stack([X_test.select_dtypes(include="number").to_numpy(), X_test_cat]))

# KNN
from sklearn.neighbors import KNeighborsClassifier 

knn = KNeighborsClassifier(n_neighbors=5)
knn = knn.fit(X_train_encoded, y_train)
y_pred = knn.predict(X_test_encoded)

from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

test_perf = pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred),
          'AUC': roc_auc_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["5NN"])
test_perf

	Accuracy	Precision	Recall	F1-score	AUC
5NN	0.868293	0.861111	0.885714	0.873239	0.867857

Q3: Can we do better?
A3: Yes! $K=5$ is arbitrary. We should fine-tune it!

Tuning $K$

$K$-Nearest Neighbors ($K$-NN)

Influence of $K$

Too small $K\Leftrightarrow$ too flexible $\Leftrightarrow$ high variance $\Leftrightarrow$ Overfitting.

$K$-Nearest Neighbors ($K$-NN)

Fine-tuning $K$: Cross-validation

There are many way to perform Cross-validation in python.
Let’s use GridSearchCV from sklearn.model_selection module.

Code

knn = KNeighborsClassifier()
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, make_scorer
scorer = make_scorer(f1_score, pos_label=1)
param_grid = {'n_neighbors': list(range(1,31))}
knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, cv=10, scoring=scorer, return_train_score=True)
grid_search.fit(X_train_encoded, y_train)

knn = KNeighborsClassifier(n_neighbors=grid_search.best_params_['n_neighbors'])
knn = knn.fit(X_train_encoded, y_train)
y_pred = knn.predict(X_test_encoded)

test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred),
          'AUC': roc_auc_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=[f"{grid_search.best_params_['n_neighbors']}NN"])], axis=0)
test_perf

	Accuracy	Precision	Recall	F1-score	AUC
5NN	0.868293	0.861111	0.885714	0.873239	0.867857
1NN	1.000000	1.000000	1.000000	1.000000	1.000000

$K$-Nearest Neighbors ($K$-NN)

What can go wrong?

Duplicated data!
Let’s try again.

Code

# Drop duplicated data
data_no_dup = data.drop_duplicates()

# Train test split
from sklearn.model_selection import train_test_split
X, y = data_no_dup.iloc[:,:-1], data_no_dup.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaler = MinMaxScaler()

# OnehotEncoding and Scaling
X_train_cat = pd.get_dummies(X_train.select_dtypes(include="category"), drop_first=True)
X_train_encoded = scaler.fit_transform(np.column_stack([X_train.select_dtypes(include="number").to_numpy(), X_train_cat]))
X_test_cat = pd.get_dummies(X_test.select_dtypes(include="category"), drop_first=True)
X_test_encoded = scaler.transform(np.column_stack([X_test.select_dtypes(include="number").to_numpy(), X_test_cat]))

knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, cv=10, scoring=scorer, return_train_score=True)
grid_search.fit(X_train_encoded, y_train)

# CV F1 score
cv_results = grid_search.cv_results_
fig2 = go.Figure(go.Scatter(
    x=param_grid['n_neighbors'], y=cv_results['mean_test_score'],
    mode="markers+lines", name="CV F1 score"))
fig2.add_trace(go.Scatter(x=[grid_search.best_params_['n_neighbors'], grid_search.best_params_['n_neighbors']], 
        y=[np.max(cv_results['mean_test_score']), np.min(cv_results['mean_test_score'])], 
        name="Optimal K", mode="lines", line=dict(color='red', dash="dash")))
fig2.update_layout(title="CV f1-score vs Number of Neighbors (K)",
        width=710, height=200, xaxis_title="Number of Neighbors (K)", yaxis_title="CV f1-score")
fig2.show()

knn = KNeighborsClassifier(n_neighbors=grid_search.best_params_['n_neighbors'])
knn = knn.fit(X_train_encoded, y_train)
y_pred = knn.predict(X_test_encoded)

test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': np.round(accuracy_score(y_test, y_pred), 3),
          'Precision': np.round(precision_score(y_test, y_pred),3),
          'Recall': np.round(recall_score(y_test, y_pred),3),
          'F1-score': np.round(f1_score(y_test, y_pred),3),
          'AUC': np.round(roc_auc_score(y_test, y_pred),3)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=[f"{grid_search.best_params_['n_neighbors']}NN_No_Dup"])], axis=0)
test_perf

	Accuracy	Precision	Recall	F1-score	AUC
5NN	0.868293	0.861111	0.885714	0.873239	0.867857
1NN	1.000000	1.000000	1.000000	1.000000	1.000000
16NN_No_Dup	0.885000	0.882000	0.909000	0.896000	0.883000

Curse of dimensionality

Curse of dimensionality refers to various challenges and phenomena that arise when working with high-dimensional data.
The main challenge for $K$-NN is that distances or closeness lose its meaning in high-dimensional spaces.
In this scenario, data points tend to be equally distant from one another.
Example: Simulate $\text{x}_1,\dots,\text{x}_n\sim{\cal U}[-5,5]^d$ with $d=1,10,100,500,1000, 5000, 10000, 50000$.
- For each dimension $d$, we compute: \[r(d)=\frac{\max_{i\neq j}D(\text{x}_i,\text{x}_j)}{\min_{i\neq j}D(\text{x}_i,\text{x}_j)}.\]
- Obtain the following graph 👉

Code

N = 10
Ds = np.zeros(shape=(N*(N-1)//2, 8))
j = 0
for d in [1,10,100, 500, 1000, 5000, 10000, 50000]:
    Xd = np.random.uniform(-5,5,size=(N,d))
    i = 0
    for s in range(1, N):
        for k in range(s):
            Ds[i,j] = np.linalg.norm(Xd[s,:] - Xd[k,:])
            i += 1
    j += 1
import plotly.express as px
df_dist = pd.DataFrame({'Ratio': np.round(Ds.max(axis=0)/Ds.min(axis=0), 2), 'Dim': ['1','10','100', '500', '1000', '5000', '10000', '50000']})
fig4 = px.line(df_dist, x='Dim', y="Ratio", text="Ratio")
fig4.update_layout(
    width=400, height=450, 
    title="Max-min distance ratio at various dimension")
fig4.update_xaxes(title='Dimension')
fig4.update_yaxes(title='Max-Min Distance Ratio')
fig4.update_traces(textposition="top right")
fig4.show()

$K$-Nearest Neighbors ($K$-NN)

Summary

$K$-NN is a nonparametric model that predict the target of any new observation $\text{x}$ based only on the $K$-nearest neighbors of $\text{x}$.

Data preprocessing is essential: scaling, encoding, outliers…
The key parameter $K$ can be tuned using cross-validation technique.
$K$-NN may not be suitable in high-dimensional cases due to Curse of dimensionality. However, we can try:
- Feature selection
- Dimensional reduction
- Distance metric

2. Decision Trees

Classification & Regression Trees

Decision Trees

CART: Classification And Regression Trees

In $K$-NN, “neighbors” are defined by the distance $D$ ($K$ points within the ball centered at the query point).
In CART, “neighbors” are defined by rectangular regions within inputs space.

Decision Trees

CART: Classification And Regression Trees

Building a CART $\Leftrightarrow$ recursively split input space into smaller regions.
Rectangular regions $\Rightarrow$ defining neighbors $\Rightarrow$ prediction.
Start at root node where all points are in the same region (without split).

At each step, each region is split on feature $X_j$ at threshold $a\in\mathbb{R}$ such that the two subregions $R_1=\{x\in\mathbb{R}^d: X_j\leq a\}$ and $R_2=\{x\in\mathbb{R}^d: X_j> a\}$ are as pure as possible.
Impurity is defined by impurity measures:
- Regression: $\sum_{y\in R_1}(y-\overline{y}_1)^2+\sum_{y\in R_2}(y-\overline{y}_2)^2.$
- Classification ($M$ classes):
  - Missclassification error: $1-\hat{p}_{k^*}$ where $k^*$ is the majority class.
  - Gini impurity: $\sum_{k=1}^M\hat{p}_{k}(1-\hat{p}_{k})$.
  - Entropy: $-\sum_{k}\hat{p}_{k}\log(\hat{p}_{k})$ where $\hat{p}_{k}=\frac{\sum_{i\in R}\mathbb{1}_{\{y_i=k\}}}{|R|}$ on region any $R$.

Decision Trees

CART: Classification And Regression Trees

Building a CART $\Leftrightarrow$ recursively split input space into smaller regions.
Rectangular regions $\Rightarrow$ defining neighbors $\Rightarrow$ prediction.
Start at root node where all points are in the same region (without split).

At each step, each region is split on feature $X_j$ at threshold $a\in\mathbb{R}$ such that the two subregions $R_1=\{x\in\mathbb{R}^d: X_j\leq a\}$ and $R_2=\{x\in\mathbb{R}^d: X_j> a\}$ are as pure as possible.
Impurity is defined by impurity measures:
- Regression: $\sum_{y\in R_1}(y-\overline{y}_1)^2+\sum_{y\in R_2}(y-\overline{y}_2)^2.$
- Classification ($M$ classes):

Decision Trees

CART: Classification And Regression Trees

First split:
- $\text{En}(R_1)=-1\log(1)=0$
- $\text{En}(R_2)=-\color{blue}{16/19\log(16/19)}-\color{red}{3/19\log(3/19)} =0.436$.
- $\text{En}_1=(0)11/30+(0.436)19/30=0.276.$
- Info gain: $\text{En}_0-\text{En}_1.$

Prediction:
- Regression: Average target values within the region.
- Classification: Majority vote of target within the region.

Decision Trees

Pruning tree

Hyperparameter of a tree: depth, minimum leave size, impurity measures, maximum features considered at each split…
Deep trees $\Leftrightarrow$ high variance $\Rightarrow$ Overfitting.
Pruning a tree = removing branches with small improvement.

Pruning criterion to be minimized: $C_{\alpha}(T)=\sum_{j=1}^{|T|}N_j\text{Imp}_j(T)+\alpha|T|$ where
- $|T|$: num. of leaves of the tree $T$ and
- $\text{Imp}_j(T)$: the impurity at leave $j$-th.

Application

Decision Trees

Appliation Heart Disease Dataset

We use GridsearchCV to search over the hyperparameters:
- Impurity (criterion)
- Depth (max_depth)
- Mininum size of leave nodes (min_samples_leaf)
- Maximum features (max_features).

Code

data = pd.read_csv(path + "/heart.csv")
quan_vars = ['age','trestbps','chol','thalach','oldpeak']
qual_vars = ['sex','cp','fbs','restecg','exang','slope','ca','thal','target']

# Convert to correct types
for i in quan_vars:
  data[i] = data[i].astype('float')
for i in qual_vars:
  data[i] = data[i].astype('category')

data = data.drop_duplicates()
# Train test split
from sklearn.model_selection import train_test_split
X, y = data.iloc[:,:-1], data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

from sklearn.model_selection import GridSearchCV 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

clf = DecisionTreeClassifier()
param_grid = {'criterion': ['gini', 'entropy'], 
              'max_depth': [3, 5, 10, 15, 20, 25, 30, 35,  40, 45, 50], 
              'min_samples_split': [2, 5, 7, 10, 15], 
              'min_samples_leaf': [3, 4, 5, 6, 8, 10, 15, 20, 25, 30], 
              'max_features': [2,3, 4, 5, 6, 7, 8, 9, 10, 11]}
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5, scoring='f1_weighted', n_jobs=-1) 
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_ 
y_pred = best_model.predict(X_test)

test_perf = pd.concat([test_perf,  pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred),
          'AUC': roc_auc_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["Tree"])], axis=0)
print(f"Best hyperparameters: {best_model}")
print("\n")
test_perf.iloc[[0,2,3],:]

Best hyperparameters: DecisionTreeClassifier(max_depth=3, max_features=4, min_samples_leaf=15)

	Accuracy	Precision	Recall	F1-score	AUC
5NN	0.868293	0.861111	0.885714	0.873239	0.867857
16NN_No_Dup	0.885000	0.882000	0.909000	0.896000	0.883000
Tree	0.819672	0.805556	0.878788	0.840580	0.814394

Decision Trees

Summary

CART predicts the target of any input $\text{x}$ using the target values of the observations falling into the same Rectangular Region/Leave node as $\text{x}$.

They are not sensitive to scaling.
The key parameters includes depth, minimum leave size, impurity measures, number of splits, maximum features considered at each split… which can be tuned using cross-validation technique.
It is efficient in high-dimensional case as it handle features individually.
It can handle categorical data as well.
Just ike $K$-NN with small $K$, deep trees are high-variance methods.
They should be pruned.

Classification: \(K\)-Nearest Neighbors & Decision Trees

🗺️ Content

1. \(K\)-Nearest Neighbors

Motivation

Motivation

Imputing Missing Pixels

Introduction

Supervised Learning Setting

Theoretical Motivation

Decision theory for classification

Theoretical Motivation

Summary

Setting

\(K\)-Nearest Neighbors (\(K\)-NN)

Setting

\(K\)-Nearest Neighbors (\(K\)-NN)

Example

\(K\)-Nearest Neighbors (\(K\)-NN)

Application

Tuning \(K\)

\(K\)-Nearest Neighbors (\(K\)-NN)

Influence of \(K\)

\(K\)-Nearest Neighbors (\(K\)-NN)

Fine-tuning \(K\): Cross-validation

\(K\)-Nearest Neighbors (\(K\)-NN)

What can go wrong?

Curse of dimensionality

Curse of dimensionality

\(K\)-Nearest Neighbors (\(K\)-NN)

Summary

2. Decision Trees

Classification & Regression Trees

Decision Trees

CART: Classification And Regression Trees

Decision Trees

CART: Classification And Regression Trees

Decision Trees

CART: Classification And Regression Trees

Decision Trees

CART: Classification And Regression Trees

Decision Trees

Pruning tree

Application

Decision Trees

Appliation Heart Disease Dataset

Decision Trees

Summary

🥳 It’s party time 🥂