$k$-nearest neighbors

ITM 390 004: Machine Learning

Lecturer: Dr. Sothea HAS

🗺️ Content

Motivation & Introduction
Euclidean Distance
$k$-Nearest Neighbors
Fine-tune $k$
Performance metrics
Application

Motivation & Introduction

Motivation

Linear Regression is for regression problems.

Logistic Regression is for classification problems.

Both models require an input-output formula or form.

Do we have something that
- Works both for classification & regression?
- DOESN’T assume any input-output formula for prediction?

Motivation & Introduction

Introduction

Models that DO NOT assume any input-output form for prediction are known as Non-parametric models.
In this case, the prediction is based on two main points
- Similarity of input data ($\color{blue}{\text{x}}\approx\color{red}{\text{x}_i}$?)
- Using the output of those similar points ($\color{red}{y_i}$) to predict the output of query point ($\color{blue}{y}$).
We use this idea all the time:
- “You are the average of the five people you spend the most time with”—Jim Rohn.
- “The sky is so dark, it’s going to be raining!”…

x1	x2	y
-0.752759	2.704286	1
1.935603	-0.838856	0
-0.546282	-1.960234	0
0.952162	-2.022393	0
-0.955179	2.584544	1
-2.458261	2.011815	1
2.449595	-1.562629	0
1.065386	-2.900473	0
-0.793301	0.793835	1
2.015881	1.175845	0
-0.016509	-1.194730	0

Euclidean distance

The core idea of some Non-parametric models is using the outputs of similar data points to predict any query point.
But what does similar or difference mean?
In ML, we often use distances to measure how difference the data points are.
The most common distance is the Euclidean one:
- For example: $A=(1,3,4)$ and $B=(-1,2,5)$ then \[D(A,B)=\sqrt{(1-(-1))^2+(3-2)^2+(4-5)^2}=\sqrt{4}=2\ (\text{unit}).\]

For two input data $\color{blue}{\text{x}=(x_1,x_2,...,x_d)}$ and $\color{red}{\text{x'}=(x_1',x_2',...,x_d')}$ then the Euclidean distance between them is given by \[D(\color{blue}{\text{x}},\color{red}{\text{x}'})=\sqrt{\sum_{i=1}^D(\color{blue}{x_i}-\color{red}{x_i'})^2}.\]

Euclidean distance

For two input data $\color{blue}{\text{x}=(x_1,x_2,...,x_d)}$ and $\color{red}{\text{x'}=(x_1',x_2',...,x_d')}$ then the Euclidean distance between them is given by \[D(\color{blue}{\text{x}},\color{red}{\text{x}'})=\sqrt{\sum_{i=1}^D(\color{blue}{x_i}-\color{red}{x_i'})^2}.\]

🔑 Smaller distance = Closer the points = More similar the data.

x1	x2	y
-0.752759	2.704286	1
1.935603	-0.838856	0
-0.546282	-1.960234	0
0.952162	-2.022393	0
-0.955179	2.584544	1

Can you identify the most similar point to the first point based on its input?
What’s the label of that nearest point?
Assume that you know the labels and of all the points except for the first one.
🤔 How would you guess its label?

$k$-Nearest Neighbors ($k$-NN)

Given the training data: $\{(\text{x}_1,y_1),\dots, (\text{x}_n,y_n)\}\subset \mathbb{R}^d\times{\cal Y}$.
If $D$ is a distance on $\mathbb{R}^d$ (e.g. Euclidean distance), $\color{red}{\text{x}_{(k)}}$ is called the $k$-th nearest neighbor of $\color{blue}{\text{x}}\in\mathbb{R}^d$ if its distance to $\color{blue}{\text{x}}$ ranks $k$-th among all the input points, i.e.,
- $D(\color{blue}{\text{x}},\text{x}_{(1)})\leq D(\color{blue}{\text{x}},\text{x}_{(2)})\leq\dots\leq D(\color{blue}{\text{x}},\text{x}_{(k-1)})\leq D(\color{blue}{\text{x}},\color{red}{\text{x}_{(k)}})\leq \dots\leq D(\color{blue}{\text{x}},\text{x}_{(n)})$.
- Let $y_{(1)},\dots,y_{(n)}$ be the target of $\text{x}_{(1)},\dots,\text{x}_{(n)}$ respectively.
If $k\geq 1$, then $k$-NN predicts the target of an input $\color{blue}{\text{x}}$ by

Regression: \[\begin{align*}\color{blue}{\hat{y}}&=\frac{1}{k}\sum_{j=1}^ky_{(j)}\\ &=\text{Average $y_{(j)}$ among the $k$ neighbors}.\\ &=\text{The predicted value.}\end{align*}\]

Classification with $M$ classes: \[\begin{align*}\color{blue}{\hat{y}}&=\arg\max_{1\leq m\leq M}\frac{1}{k}\sum_{j=1}^k\mathbb{1}_{\{y_{(j)}=m\}}\\ &=\text{Majority group among the $k$ neighbors.}\\ &=\text{The predicted class.}\end{align*}\]

$k$-Nearest Neighbors ($k$-NN)

Example

Regression: \[\begin{align*}\color{blue}{\hat{y}}&=\frac{1}{k}\sum_{j=1}^ky_{(j)}\\ &=\text{Average $y_{(j)}$ among the $k$ neighbors}.\\ &=\text{The predicted value.}\end{align*}\]

Classification with $M$ classes: \[\begin{align*}\color{blue}{\hat{y}}&=\arg\max_{1\leq m\leq M}\frac{1}{k}\sum_{j=1}^k\mathbb{1}_{\{y_{(j)}=m\}}\\ &=\text{Majority group among the $k$ neighbors.}\\ &=\text{The predicted class.}\end{align*}\]

$k$-Nearest Neighbors ($K$-NN)

Influence of $K$

Too large $k\Leftrightarrow$ Using many points
$\Leftrightarrow$ too inflexible $\Leftrightarrow$ Underfitting.

Too small $k\Leftrightarrow$ Using less points $\Leftrightarrow$ too flexible $\Leftrightarrow$ Overfitting.

How to choose a good $k$?

Fine-tune $k$

Data splitting: Train/Validate/Test

A good model is the one that can generalize/predict new unseen data.
The first attempt: splitting the data into 3 parts.

Set	Common %	Purpose
Train	60%–70%	For training the model
Validation	15%–20%	Tune hyperparameters $k$
Test	15%–20%	Evaluate final model performance

In this case, the best $k$ is the one achieving the best performance on Validation set.
The final performance is measured using the Test set.

Fine-tune $k$

Data splitting: $K$-fold Cross-Validation

In the previous splitting scheme, the best $k$ depends strongly on the split.
To reduce this dependency, a more stable scheme is proposed called $K$-fold Cross-Validation technique.

Pseudocode

For $\color{blue}{k}$ [1,2,3,...,N]:
- For f in [1,...,K]:
  - Train $k$-NN on all data except for fold f.
  - Predict and measure performance on fold f.
  - Save the performance as $\epsilon_f$.
- Compute CV performance for $\color{blue}{k}$: \[\text{CV}(\color{blue}{k})=\frac{1}{K}\sum_{f=1}^K\epsilon_f.\]
Choose the best $k$ with the best CV performance.

Performance metrics

Selecting the best $k$ depends not only the splitting scheme, but also the performance metric to define WHAT DOES THE BEST MEAN?
What’s performance metric?
It’s a value that measures the quality of a model when using to predict new unseen observations.
They are divided into two main types:
- Score: larger $\Leftrightarrow$ better model.
- Error: smaller $\Leftrightarrow$ better model.
⚠️ Not to confuse:
- Metric: For fine-tuning the key hyperparameters of the model (use validating or testing data when being measured).
  - Example: $R^2$, Adjusted $R^2$, Accuracy…
- Loss: For training the model and is computed using the training data.
  - Example: Mean Squared Error (MSE), Mean Absolute Error (MAE)…

Performance metrics

Regression metrics

These are some common metrics in regression problems.

Mean Squared Error (MSE): \[\text{MSE}=\frac{1}{\text{n}_{\text{test}}}\sum_{i=1}^{\text{n}_{\text{test}}}(\color{blue}{y_i}-\color{red}{\hat{y}_i})^2.\]
Mean Absolute Error (MAE): \[\text{MAE}=\frac{1}{\text{n}_{\text{test}}}\sum_{i=1}^{\text{n}_{\text{test}}}|\color{blue}{y_i}-\color{red}{\hat{y}_i}|.\]

Root Mean Squared Error (RMSE): \[\text{RMSE}=\sqrt{\frac{1}{\text{n}_{\text{test}}}\sum_{i=1}^{\text{n}_{\text{test}}}(\color{blue}{y_i}-\color{red}{\hat{y}_i})^2}.\]
Mean Absolute Percentage Error (MAPE): \[\text{MAPE}=\frac{1}{\text{n}_{\text{test}}}\sum_{i=1}^{\text{n}_{\text{test}}}\left|\frac{\color{blue}{y_i}-\color{red}{\hat{y}_i}}{\color{blue}{y_i}}\right|.\]

Coefficient of Determination: $R^2=1-\sum_{i=1}^{\text{n}_{\text{test}}}(\color{blue}{y_i}-\color{red}{\hat{y}_i})^2/\sum_{i=1}^{\text{n}_{\text{test}}}(\color{blue}{y_i}-\overline{\color{blue}{y}})^2.$

Performance metrics

Classification metrics

These are some common metrics in classification problems.

Misclassification Error (ME): \[\text{ME}=\frac{\text{#}\{i:\color{blue}{y_i}\neq\color{red}{\hat{y}_i}\}}{\text{n}_{\text{test}}}.\]
Precision: \[\text{Precision}=\frac{\text{True Positive}}{\text{All Positive Prediction}}.\]

Accuracy: \[\text{Accuracy}=\frac{\text{#}\{i:\color{blue}{y_i}=\color{red}{\hat{y}_i}\}}{\text{n}_{\text{test}}}.\]
Recall: \[\text{Recall}=\frac{\text{True Positive}}{\text{All Positive Labels}}.\]

F1-score: \[\text{F1-score}=\frac{2\times \text{Precision}\times \text{Recall}}{\text{Precision}+\text{Recall}}.\]

Application

$k$-Nearest Neighbors ($k$-NN)

In action

Let’s work with our Heart Disease Dataset (shpe: $1025\times 14$) and choose $K=5$.
$K$-NN is a distance-based method, it’s essential to
- Scale the inputs
- Watch out for outliers/missing values
- Encode categorical inputs
- Watch out for the effect of imbalanced class…

Code

import numpy as np
import pandas as pd
data = pd.read_csv(path + "/heart.csv")
quan_vars = ['age','trestbps','chol','thalach','oldpeak']
qual_vars = ['sex','cp','fbs','restecg','exang','slope','ca','thal','target']

# Convert to correct types
for i in quan_vars:
  data[i] = data[i].astype('float')
for i in qual_vars:
  data[i] = data[i].astype('category')

# Train test split
from sklearn.model_selection import train_test_split
X, y = data.iloc[:,:-1], data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaler = MinMaxScaler()

# OnehotEncoding and Scaling
X_train_cat = pd.get_dummies(X_train.select_dtypes(include="category"), drop_first=True)
X_train_encoded = scaler.fit_transform(np.column_stack([X_train.select_dtypes(include="number").to_numpy(), X_train_cat]))
X_test_cat = pd.get_dummies(X_test.select_dtypes(include="category"), drop_first=True)
X_test_encoded = scaler.transform(np.column_stack([X_test.select_dtypes(include="number").to_numpy(), X_test_cat]))

# KNN
from sklearn.neighbors import KNeighborsClassifier 

knn = KNeighborsClassifier(n_neighbors=5)
knn = knn.fit(X_train_encoded, y_train)
y_pred = knn.predict(X_test_encoded)

from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

test_perf = pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score"],
    index=["5NN"])
test_perf

	Accuracy	Precision	Recall	F1-score
5NN	0.868293	0.861111	0.885714	0.873239

Q3: Can we do better?
A3: Yes! $K=5$ is arbitrary. We should fine-tune it!

$k$-Nearest Neighbors ($k$-NN)

Fine-tuning $k$: Cross-validation

There are many way to perform Cross-validation in python.
Let’s use GridSearchCV from sklearn.model_selection module.

Code

knn = KNeighborsClassifier()
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, make_scorer
scorer = make_scorer(f1_score, pos_label=1)
param_grid = {'n_neighbors': list(range(1,20))}
knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, cv=10, scoring=scorer, return_train_score=True)
grid_search.fit(X_train_encoded, y_train)

knn = KNeighborsClassifier(n_neighbors=grid_search.best_params_['n_neighbors'])
knn = knn.fit(X_train_encoded, y_train)
y_pred = knn.predict(X_test_encoded)

test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score"],
    index=[f"{grid_search.best_params_['n_neighbors']}NN"])], axis=0)
test_perf

	Accuracy	Precision	Recall	F1-score
5NN	0.868293	0.861111	0.885714	0.873239
1NN	1.000000	1.000000	1.000000	1.000000

$k$-Nearest Neighbors ($k$-NN)

What can go wrong?

It’s because of the duplicated data!
Let’s try again.

	Accuracy	Precision	Recall	F1-score
5NN	0.868293	0.861111	0.885714	0.873239
1NN	1.000000	1.000000	1.000000	1.000000
16NN_No_Dup	0.885246	0.882353	0.909091	0.895522

$k$-Nearest Neighbors ($k$-NN)

Curse of dimensionality

Curse of dimensionality refers to various challenges and phenomena that arise when working with high-dimensional data.
The main challenge for $K$-NN is that distances or closeness lose its meaning in high-dimensional spaces.
In this scenario, data points tend to be equally distant from one another.
Example: Simulate $\text{x}_1,\dots,\text{x}_n\sim{\cal U}[-5,5]^d$ with $d=1,10,100,500,1000, 5000, 10000, 50000$.
- For each dimension $d$, we compute: \[r(d)=\frac{\max_{i\neq j}D(\text{x}_i,\text{x}_j)}{\min_{i\neq j}D(\text{x}_i,\text{x}_j)}.\]
- Obtain the following graph 👉

Code

N = 10
Ds = np.zeros(shape=(N*(N-1)//2, 8))
j = 0
for d in [1,10,100, 500, 1000, 5000, 10000, 50000]:
    Xd = np.random.uniform(-5,5,size=(N,d))
    i = 0
    for s in range(1, N):
        for k in range(s):
            Ds[i,j] = np.linalg.norm(Xd[s,:] - Xd[k,:])
            i += 1
    j += 1
import plotly.express as px
df_dist = pd.DataFrame({'Ratio': np.round(Ds.max(axis=0)/Ds.min(axis=0), 2), 'Dim': ['1','10','100', '500', '1000', '5000', '10000', '50000']})
fig4 = px.line(df_dist, x='Dim', y="Ratio", text="Ratio")
fig4.update_layout(
    width=400, height=450, 
    title="Max-min distance ratio at various dimension")
fig4.update_xaxes(title='Dimension')
fig4.update_yaxes(title='Max-Min Distance Ratio')
fig4.update_traces(textposition="top right")
fig4.show()

$k$-Nearest Neighbors ($k$-NN)

Summary

$k$-NN predicts the label/value of a new point by looking at the $k$ closest neighbors of the point.

Data preprocessing is essential: scaling, encoding, outliers…
The key parameter $k$ can be tuned using cross-validation technique.
$k$-NN may not be suitable in high-dimensional cases due to Curse of dimensionality. However, we can try:
- Feature selection
- Dimensional reduction
- Distance metric

\(k\)-nearest neighbors

🗺️ Content

Motivation & Introduction

Motivation & Introduction

Motivation

Motivation & Introduction

Introduction

Euclidean distance

Euclidean distance

Euclidean distance

\(k\)-Nearest Neighbors (\(k\)-NN)

\(k\)-Nearest Neighbors (\(k\)-NN)

Example

\(k\)-Nearest Neighbors (\(K\)-NN)

Influence of \(K\)

Fine-tune \(k\)

Fine-tune \(k\)

Data splitting: Train/Validate/Test

Fine-tune \(k\)

Data splitting: \(K\)-fold Cross-Validation

Pseudocode

Performance metrics

Performance metrics

Performance metrics

Regression metrics

Performance metrics

Classification metrics

Application

\(k\)-Nearest Neighbors (\(k\)-NN)

In action

\(k\)-Nearest Neighbors (\(k\)-NN)

Fine-tuning \(k\): Cross-validation

\(k\)-Nearest Neighbors (\(k\)-NN)

What can go wrong?

\(k\)-Nearest Neighbors (\(k\)-NN)

Curse of dimensionality

\(k\)-Nearest Neighbors (\(k\)-NN)

Summary

🥳 Yeahhhh….

Let’s Party… 🥂