Classification: \(K\)-Nearest Neighbors & Decision Trees


CSCI-866-001: Data Mining & Knowledge Discovery



Lecturer: Dr. Sothea HAS

🗺️ Content

  • K-Nearest Neighbors
    • Theoretical Motivation
    • Setting
    • Tuning \(K\)
    • Curse of Dimensionality
  • Decision Trees
    • Classification & Regression Trees
    • Application

1. \(K\)-Nearest Neighbors

Motivation

Motivation

Imputing Missing Pixels

Code
from skimage import data
import matplotlib.pyplot as plt
image = data.astronaut()
import numpy as np
from sklearn.impute import KNNImputer
n_miss = 100
np.random.seed(42)
missing_id = np.random.choice(512, size=(n_miss,2))
image_nan = image.copy()

for i in range(n_miss):
    image_nan[(missing_id[i,0]-2):(missing_id[i,0]+2),(missing_id[i,1]-2):(missing_id[i,1]+2),:] = 0
plt.figure(figsize=(4,4))
plt.imshow(image_nan)
plt.axis('off')
plt.show()

  • How would you impute the missing pixels in this image?
Code
for i in range(n_miss):
    for ch in range(3):
        temp = image_nan[(missing_id[i,0]-5):(missing_id[i,0]+5), (missing_id[i,1]-5):(missing_id[i,1]+5), ch]
        image_nan[(missing_id[i,0]-2):(missing_id[i,0]+2),(missing_id[i,1]-2):(missing_id[i,1]+2),ch] = temp[temp> 0].mean()
plt.figure(figsize=(4,4))
plt.imshow(image_nan)
plt.axis('off')
plt.show()

  • Missing pixels \(=\frac{1}{K}\sum_{k=1}^K\text{pixel}_k\) where \(\text{pixel}_k\) are neighboring non-missing pixels.

Introduction

Supervised Learning Setting

Objective: Predicting target \(y\) using input \(x\).

  • If \(y\) is continuous, the problem is called Regression.
  • If \(y\) is discrete/categorical, it’s called Classification.

Criteria

  • Quality of a prediction \(\hat{y}\) is measured by the loss: \(\ell(\hat{y},y)\).
  • Quality of a model \(\hat{f}\) is measured by the risk: \[{\cal R}(\hat{f})=\mathbb{E}_{X,Y}[\ell(Y, \hat{f}(X))].\]

Theoretical Motivation

Decision theory for classification

  • In classification, misclassification loss: \(\ell(\hat{f}(x),y)=\mathbb{1}_{\{\hat{f}(x)\neq y\}}=\begin{cases}1,\text{ if }\hat{f}(x)\neq y\\0,\text{ if }\hat{f}(x)=y\end{cases}\).
  • The misclassification risk is given by: \[{\cal R}_c(\hat{f})=\mathbb{E}[\ell(\hat{f}(X),Y)]=\mathbb{E}[\mathbb{1}_{\{\hat{f}(X)\neq Y\}}]=\mathbb{P}(\hat{f}(X)\neq Y).\]
  • This is equivalent to classification problems that maximize accuracy score because \[\text{Accuracy}=1-\mathbb{P}(\hat{f}(X)\neq Y)=\mathbb{P}(\hat{f}(X)=Y).\]

In balanced \(M\)-class classification, \[\text{If } g^*(x)=\arg\max_{1\leq k\leq M}\mathbb{P}(Y=k|X=x)\text{ then }{\cal R}_c(g^*)=\inf_{\hat{f}}{\cal R}_c(\hat{f}).\]

Theoretical Motivation

Summary

  • The key idea of Nonparametric methods, a prediction of any new observation \(\text{x}\) depends on the labels/targets of other training data that are similar to \(\text{x}\).

  • Example: Missing pixels can be imputed by the average of neighboring pixels.

  • In balanced \(M\)-class classification with Accuracy score, \(g^*(\text{x})=\arg\max_{1\leq k\leq M}\color{blue}{\mathbb{P}(Y=k|X=\text{x})}\) is the best classifier.

  • In the following sections, each method aims at defining “similarity/neighbors” structure of data points.

  • The prediction is the “majority vote” (classification) or “averaging” (regression).

Setting

\(K\)-Nearest Neighbors (\(K\)-NN)

Setting

  • Given iid observations: \(\{(\text{x}_1,y_1),\dots, (\text{x}_n,y_n)\}\subset \mathbb{R}^d\times{\cal Y}\).
  • If \(D\) is a distance on \(\mathbb{R}^d\) (e.g. Euclidean distance), the \(K\)-th nearest neighbor \(\color{red}{\text{x}_{(k)}}\) of \(\text{x}\in\mathbb{R}^d\) is defined by
    • \(D(\text{x},\text{x}_{(1)})\leq D(\text{x},\text{x}_{(2)})\leq\dots\leq D(\text{x},\color{red}{\text{x}_{(k)}})\leq \dots\leq D(\text{x},\text{x}_{(n)})\).
    • Let \(y_{(1)},\dots,y_{(n)}\) be the target of \(\text{x}_{(1)},\dots,\text{x}_{(n)}\) respectively.
  • If \(K\geq 1\), then \(K\)-NN predicts the target of an input \(\text{x}\) by
  • Regression: \[\begin{align*}\hat{y}&=\frac{1}{K}\sum_{k=1}^Ky_{(k)}\\ &=\text{Average $y_{(k)}$ among the $K$ neighbors}.\end{align*}\]
  • Classification: \[\begin{align*}\hat{y}&=\arg\max_{1\leq m\leq M}\frac{1}{K}\sum_{k=1}^K\mathbb{1}_{\{y^{(k)}=m\}}\\ &=\text{Majority class among the $K$ neighbors.}\end{align*}\]

\(K\)-Nearest Neighbors (\(K\)-NN)

Example

  • Regression: \[\begin{align*}\hat{y}&=\frac{1}{K}\sum_{k=1}^Ky_{(k)}\\ &=\text{Average $y_{(k)}$ among the $K$ neighbors}.\end{align*}\]
  • Classification: \[\begin{align*}\hat{y}&=\arg\max_{1\leq m\leq M}\frac{1}{K}\sum_{k=1}^K\mathbb{1}_{\{y^{(k)}=m\}}\\ &=\text{Majority class among the $K$ neighbors.}\end{align*}\]

\(K\)-Nearest Neighbors (\(K\)-NN)

Application

  • Let’s work with our Heart Disease Dataset (shpe: \(1025\times 14\)) and choose \(K=5\).
  • \(K\)-NN is a distance-based method, it’s essential to
    • Scale the inputs
    • Watch out for outliers/missing values
    • Encode categorical inputs
    • Watch out for the effect of imbalanced class…
Code
import numpy as np
import pandas as pd
data = pd.read_csv(path + "/heart.csv")
quan_vars = ['age','trestbps','chol','thalach','oldpeak']
qual_vars = ['sex','cp','fbs','restecg','exang','slope','ca','thal','target']

# Convert to correct types
for i in quan_vars:
  data[i] = data[i].astype('float')
for i in qual_vars:
  data[i] = data[i].astype('category')

# Train test split
from sklearn.model_selection import train_test_split
X, y = data.iloc[:,:-1], data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaler = MinMaxScaler()

# OnehotEncoding and Scaling
X_train_cat = pd.get_dummies(X_train.select_dtypes(include="category"), drop_first=True)
X_train_encoded = scaler.fit_transform(np.column_stack([X_train.select_dtypes(include="number").to_numpy(), X_train_cat]))
X_test_cat = pd.get_dummies(X_test.select_dtypes(include="category"), drop_first=True)
X_test_encoded = scaler.transform(np.column_stack([X_test.select_dtypes(include="number").to_numpy(), X_test_cat]))

# KNN
from sklearn.neighbors import KNeighborsClassifier 

knn = KNeighborsClassifier(n_neighbors=5)
knn = knn.fit(X_train_encoded, y_train)
y_pred = knn.predict(X_test_encoded)

from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

test_perf = pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred),
          'AUC': roc_auc_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["5NN"])
test_perf
Accuracy Precision Recall F1-score AUC
5NN 0.868293 0.861111 0.885714 0.873239 0.867857
  • Q3: Can we do better?
  • A3: Yes! \(K=5\) is arbitrary. We should fine-tune it!

Tuning \(K\)

\(K\)-Nearest Neighbors (\(K\)-NN)

Influence of \(K\)

  • Too small \(K\Leftrightarrow\) too flexible \(\Leftrightarrow\) high variance \(\Leftrightarrow\) Overfitting.

\(K\)-Nearest Neighbors (\(K\)-NN)

Fine-tuning \(K\): Cross-validation

  • There are many way to perform Cross-validation in python.
  • Let’s use GridSearchCV from sklearn.model_selection module.
Code
knn = KNeighborsClassifier()
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, make_scorer
scorer = make_scorer(f1_score, pos_label=1)
param_grid = {'n_neighbors': list(range(1,31))}
knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, cv=10, scoring=scorer, return_train_score=True)
grid_search.fit(X_train_encoded, y_train)

knn = KNeighborsClassifier(n_neighbors=grid_search.best_params_['n_neighbors'])
knn = knn.fit(X_train_encoded, y_train)
y_pred = knn.predict(X_test_encoded)

test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred),
          'AUC': roc_auc_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=[f"{grid_search.best_params_['n_neighbors']}NN"])], axis=0)
test_perf
Accuracy Precision Recall F1-score AUC
5NN 0.868293 0.861111 0.885714 0.873239 0.867857
1NN 1.000000 1.000000 1.000000 1.000000 1.000000

\(K\)-Nearest Neighbors (\(K\)-NN)

What can go wrong?

 

  • Duplicated data!
  • Let’s try again.
Code
# Drop duplicated data
data_no_dup = data.drop_duplicates()

# Train test split
from sklearn.model_selection import train_test_split
X, y = data_no_dup.iloc[:,:-1], data_no_dup.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaler = MinMaxScaler()

# OnehotEncoding and Scaling
X_train_cat = pd.get_dummies(X_train.select_dtypes(include="category"), drop_first=True)
X_train_encoded = scaler.fit_transform(np.column_stack([X_train.select_dtypes(include="number").to_numpy(), X_train_cat]))
X_test_cat = pd.get_dummies(X_test.select_dtypes(include="category"), drop_first=True)
X_test_encoded = scaler.transform(np.column_stack([X_test.select_dtypes(include="number").to_numpy(), X_test_cat]))

knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, cv=10, scoring=scorer, return_train_score=True)
grid_search.fit(X_train_encoded, y_train)

# CV F1 score
cv_results = grid_search.cv_results_
fig2 = go.Figure(go.Scatter(
    x=param_grid['n_neighbors'], y=cv_results['mean_test_score'],
    mode="markers+lines", name="CV F1 score"))
fig2.add_trace(go.Scatter(x=[grid_search.best_params_['n_neighbors'], grid_search.best_params_['n_neighbors']], 
        y=[np.max(cv_results['mean_test_score']), np.min(cv_results['mean_test_score'])], 
        name="Optimal K", mode="lines", line=dict(color='red', dash="dash")))
fig2.update_layout(title="CV f1-score vs Number of Neighbors (K)",
        width=710, height=200, xaxis_title="Number of Neighbors (K)", yaxis_title="CV f1-score")
fig2.show()

knn = KNeighborsClassifier(n_neighbors=grid_search.best_params_['n_neighbors'])
knn = knn.fit(X_train_encoded, y_train)
y_pred = knn.predict(X_test_encoded)

test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': np.round(accuracy_score(y_test, y_pred), 3),
          'Precision': np.round(precision_score(y_test, y_pred),3),
          'Recall': np.round(recall_score(y_test, y_pred),3),
          'F1-score': np.round(f1_score(y_test, y_pred),3),
          'AUC': np.round(roc_auc_score(y_test, y_pred),3)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=[f"{grid_search.best_params_['n_neighbors']}NN_No_Dup"])], axis=0)
test_perf
Accuracy Precision Recall F1-score AUC
5NN 0.868293 0.861111 0.885714 0.873239 0.867857
1NN 1.000000 1.000000 1.000000 1.000000 1.000000
16NN_No_Dup 0.885000 0.882000 0.909000 0.896000 0.883000

Curse of dimensionality

Curse of dimensionality

  • Curse of dimensionality refers to various challenges and phenomena that arise when working with high-dimensional data.
  • The main challenge for \(K\)-NN is that distances or closeness lose its meaning in high-dimensional spaces.
  • In this scenario, data points tend to be equally distant from one another.
  • Example: Simulate \(\text{x}_1,\dots,\text{x}_n\sim{\cal U}[-5,5]^d\) with \(d=1,10,100,500,1000, 5000, 10000, 50000\).
    • For each dimension \(d\), we compute: \[r(d)=\frac{\max_{i\neq j}D(\text{x}_i,\text{x}_j)}{\min_{i\neq j}D(\text{x}_i,\text{x}_j)}.\]
    • Obtain the following graph 👉
Code
N = 10
Ds = np.zeros(shape=(N*(N-1)//2, 8))
j = 0
for d in [1,10,100, 500, 1000, 5000, 10000, 50000]:
    Xd = np.random.uniform(-5,5,size=(N,d))
    i = 0
    for s in range(1, N):
        for k in range(s):
            Ds[i,j] = np.linalg.norm(Xd[s,:] - Xd[k,:])
            i += 1
    j += 1
import plotly.express as px
df_dist = pd.DataFrame({'Ratio': np.round(Ds.max(axis=0)/Ds.min(axis=0), 2), 'Dim': ['1','10','100', '500', '1000', '5000', '10000', '50000']})
fig4 = px.line(df_dist, x='Dim', y="Ratio", text="Ratio")
fig4.update_layout(
    width=400, height=450, 
    title="Max-min distance ratio at various dimension")
fig4.update_xaxes(title='Dimension')
fig4.update_yaxes(title='Max-Min Distance Ratio')
fig4.update_traces(textposition="top right")
fig4.show()

\(K\)-Nearest Neighbors (\(K\)-NN)

Summary

  • \(K\)-NN is a nonparametric model that predict the target of any new observation \(\text{x}\) based only on the \(K\)-nearest neighbors of \(\text{x}\).
  • Data preprocessing is essential: scaling, encoding, outliers…

  • The key parameter \(K\) can be tuned using cross-validation technique.

  • \(K\)-NN may not be suitable in high-dimensional cases due to Curse of dimensionality. However, we can try:

    • Feature selection
    • Dimensional reduction
    • Distance metric

2. Decision Trees

Classification & Regression Trees

Decision Trees

CART: Classification And Regression Trees

  • In \(K\)-NN, “neighbors” are defined by the distance \(D\) (\(K\) points within the ball centered at the query point).
  • In CART, “neighbors” are defined by rectangular regions within inputs space.

Decision Trees

CART: Classification And Regression Trees

  • Building a CART \(\Leftrightarrow\) recursively split input space into smaller regions.
  • Rectangular regions \(\Rightarrow\) defining neighbors \(\Rightarrow\) prediction.
  • Start at root node where all points are in the same region (without split).
  • At each step, each region is split on feature \(X_j\) at threshold \(a\in\mathbb{R}\) such that the two subregions \(R_1=\{x\in\mathbb{R}^d: X_j\leq a\}\) and \(R_2=\{x\in\mathbb{R}^d: X_j> a\}\) are as pure as possible.
  • Impurity is defined by impurity measures:
    • Regression: \(\sum_{y\in R_1}(y-\overline{y}_1)^2+\sum_{y\in R_2}(y-\overline{y}_2)^2.\)
    • Classification (\(M\) classes):
      • Missclassification error: \(1-\hat{p}_{k^*}\) where \(k^*\) is the majority class.
      • Gini impurity: \(\sum_{k=1}^M\hat{p}_{k}(1-\hat{p}_{k})\).
      • Entropy: \(-\sum_{k}\hat{p}_{k}\log(\hat{p}_{k})\) where \(\hat{p}_{k}=\frac{\sum_{i\in R}\mathbb{1}_{\{y_i=k\}}}{|R|}\) on region any \(R\).

Decision Trees

CART: Classification And Regression Trees

  • Building a CART \(\Leftrightarrow\) recursively split input space into smaller regions.
  • Rectangular regions \(\Rightarrow\) defining neighbors \(\Rightarrow\) prediction.
  • Start at root node where all points are in the same region (without split).
  • At each step, each region is split on feature \(X_j\) at threshold \(a\in\mathbb{R}\) such that the two subregions \(R_1=\{x\in\mathbb{R}^d: X_j\leq a\}\) and \(R_2=\{x\in\mathbb{R}^d: X_j> a\}\) are as pure as possible.
  • Impurity is defined by impurity measures:
    • Regression: \(\sum_{y\in R_1}(y-\overline{y}_1)^2+\sum_{y\in R_2}(y-\overline{y}_2)^2.\)
    • Classification (\(M\) classes):

Decision Trees

CART: Classification And Regression Trees

  • First split:
    • \(\text{En}(R_1)=-1\log(1)=0\)
    • \(\text{En}(R_2)=-\color{blue}{16/19\log(16/19)}-\color{red}{3/19\log(3/19)} =0.436\).
    • \(\text{En}_1=(0)11/30+(0.436)19/30=0.276.\)
    • Info gain: \(\text{En}_0-\text{En}_1.\)

  • Prediction:
    • Regression: Average target values within the region.
    • Classification: Majority vote of target within the region.

Decision Trees

Pruning tree

  • Hyperparameter of a tree: depth, minimum leave size, impurity measures, maximum features considered at each split…
  • Deep trees \(\Leftrightarrow\) high variance \(\Rightarrow\) Overfitting.
  • Pruning a tree = removing branches with small improvement.

  • Pruning criterion to be minimized: \(C_{\alpha}(T)=\sum_{j=1}^{|T|}N_j\text{Imp}_j(T)+\alpha|T|\) where
    • \(|T|\): num. of leaves of the tree \(T\) and
    • \(\text{Imp}_j(T)\): the impurity at leave \(j\)-th.

Application

Decision Trees

Appliation Heart Disease Dataset

  • We use GridsearchCV to search over the hyperparameters:
    • Impurity (criterion)
    • Depth (max_depth)
    • Mininum size of leave nodes (min_samples_leaf)
    • Maximum features (max_features).
Code
data = pd.read_csv(path + "/heart.csv")
quan_vars = ['age','trestbps','chol','thalach','oldpeak']
qual_vars = ['sex','cp','fbs','restecg','exang','slope','ca','thal','target']

# Convert to correct types
for i in quan_vars:
  data[i] = data[i].astype('float')
for i in qual_vars:
  data[i] = data[i].astype('category')

data = data.drop_duplicates()
# Train test split
from sklearn.model_selection import train_test_split
X, y = data.iloc[:,:-1], data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

from sklearn.model_selection import GridSearchCV 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

clf = DecisionTreeClassifier()
param_grid = {'criterion': ['gini', 'entropy'], 
              'max_depth': [3, 5, 10, 15, 20, 25, 30, 35,  40, 45, 50], 
              'min_samples_split': [2, 5, 7, 10, 15], 
              'min_samples_leaf': [3, 4, 5, 6, 8, 10, 15, 20, 25, 30], 
              'max_features': [2,3, 4, 5, 6, 7, 8, 9, 10, 11]}
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5, scoring='f1_weighted', n_jobs=-1) 
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_ 
y_pred = best_model.predict(X_test)

test_perf = pd.concat([test_perf,  pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred),
          'AUC': roc_auc_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["Tree"])], axis=0)
print(f"Best hyperparameters: {best_model}")
print("\n")
test_perf.iloc[[0,2,3],:]
Best hyperparameters: DecisionTreeClassifier(max_depth=3, max_features=4, min_samples_leaf=15)

Accuracy Precision Recall F1-score AUC
5NN 0.868293 0.861111 0.885714 0.873239 0.867857
16NN_No_Dup 0.885000 0.882000 0.909000 0.896000 0.883000
Tree 0.819672 0.805556 0.878788 0.840580 0.814394

Decision Trees

Summary

  • CART predicts the target of any input \(\text{x}\) using the target values of the observations falling into the same Rectangular Region/Leave node as \(\text{x}\).
  • They are not sensitive to scaling.

  • The key parameters includes depth, minimum leave size, impurity measures, number of splits, maximum features considered at each split… which can be tuned using cross-validation technique.

  • It is efficient in high-dimensional case as it handle features individually.

  • It can handle categorical data as well.

  • Just ike \(K\)-NN with small \(K\), deep trees are high-variance methods.

  • They should be pruned.

🥳 It’s party time 🥂