Advanced Machine Learning

Lecturer: Dr. HAS Sothea

Code: AMSI61AML

Nonparametric Models

🗺️ Content

Introduction
Theoretical Motivation
K-Nearest Neighbors
Decision Trees
Kernel Smoothing Methods

Introduction

Supervised Learning Setting

Objective: Predicting target $y$ using input $x$.

If $y$ is continuous, the problem is called Regression.
If $y$ is discrete/categorical, it’s called Classification.

Criteria

Quality of a prediction $\hat{y}$ is measured by the loss: $\ell(\hat{y},y)$.
Quality of a model $\hat{f}$ is measured by the risk: \[{\cal R}(\hat{f})=\mathbb{E}_{X,Y}[\ell(Y, \hat{f}(X))].\]

Introduction

Supervised Learning Setting

Parametric models assume a specific form that depends on a fixed number of parameters. Example:
- Linear model: $\hat{f}_{\color{blue}{\beta}}(\text{x})=\color{blue}{\beta_0}+\sum_{j=1}^d\color{blue}{\beta_j}x_j$ with $\color{blue}{\beta_0},\dots,\color{blue}{\beta_d}\in\mathbb{R}.$
- Polynomial model: $\hat{f}_{\color{blue}{\beta}}(\text{x})=\color{blue}{\beta_0}+\sum_{p=0}^m\sum_{k=1}^d\sum_{j\leq k}\color{blue}{\beta_{jkp}}x_j^{p}x_k^{m-p}$ with $\color{blue}{\beta_0},\color{blue}{\beta_{jkp}}\in\mathbb{R},\forall j,k,p.$
- $2$-layer Neural Networks: $\hat{f}(\text{x})=\sigma_2\left(\color{blue}{W_2}\sigma_1(\color{blue}{W_1}x+\color{blue}{b_1})+\color{blue}{b_2}\right).$
Building a model is searching for $\color{blue}{\beta^*}$ minimizing the risk, i.e., \[{\cal R}(\hat{f}_{\color{blue}{\beta^*}})=\inf_{\color{blue}{\beta}}\Big[\underbrace{{\cal R}(\hat{f}_{\color{blue}{\beta}})}_{\text{Goodness of fit}} + \underbrace{\color{gray}{\Omega(\color{blue}{\beta})}}_{\text{Flexibility}}\Big]\]

Introduction

Supervised Learning Setting

Nonparametric models rely directly on the target $y$ without assuming any form of input-output relationship.

General weighted average form: \[\hat{f}(\text{x})=\sum_{i=1}^ny_i\color{blue}{W_{n,i}(\text{x})}.\]
Interpretation:
- The prediction of $\text{x}$ is determined by averaging values of targets $y_i$’s.
- Targets $y_i$ that their input $\text{x}_i$ is similar to $\text{x}$ are more important and carry heavier weight $\color{blue}{W_{n,i}(\text{x})}$ in the average.

🤔 Why does it work?

🗝️ Because it estimates $\color{red}{\eta(x)=\mathbb{E}(Y|X=x)}$.

🤔 But what’s this $\color{red}{\eta(x)}$?

Theoretical Motivation

Regression theory: $L_2$/MSE minimizer

$\color{blue}{\text{Mean Squared Error}}$ of any model $\hat{f}$ is defined by: \[\color{blue}{\text{MSE}}(\hat{f})=\mathbb{E}[(\hat{f}(X)-Y)^2].\]
If $\text{x}$ be an input and $\color{red}{\eta(\text{x})=\mathbb{E}(Y|X=\text{x})}$ then $\forall\hat{f}$: \[\begin{align*} \color{blue}{\text{MSE}}(\hat{f})&=\mathbb{E}[(\hat{f}(X)-\color{red}{\eta}(X))^2]+\mathbb{E}[(\color{red}{\eta}(X)-Y)^2] \end{align*}\]
This implies that $\color{red}{\eta}$ is the minimizer of $\color{blue}{\text{MSE}}$ criterion, i.e., \[\color{blue}{\text{MSE}}(\color{red}{\eta})=\inf_{\hat{f}}\color{blue}{\text{MSE}}(\hat{f})=\mathbb{E}[(\color{red}{\eta}(X)-Y)^2].\]
$\color{red}{\eta}$ is called Regression Function/$L_2$ Bayesian Estimator.

Theoretical Motivation

Regression theory: $L_2$/MSE minimizer

$\color{red}{\eta}$ is called Regression Function/$L_2$ Bayesian Estimator.

$\color{red}{\eta(\text{x})=\mathbb{E}(Y|X=\text{x})}$ is the average of $y$ for inputs that are similar to the query point $\text{x}$. In discrete case: \[\begin{align*}\color{red}{\eta}(\text{x})&=\mathbb{E}(Y|X=\text{x})=\sum_{i=1}^{\infty}y_i\color{blue}{\mathbb{P}(Y=y_i|X=\text{x})} \end{align*}\]
Recall that nonparam. models: $\hat{f}(\text{x})=\sum_{i=1}^ny_i\color{blue}{W_{n,i}(\text{x})}.$

Theoretical Motivation

Regression theory: $L_2$/MSE minimizer

Nonparametric models: \[\hat{f}(\text{x})=\sum_{i=1}^ny_i\color{blue}{W_{n,i}(\text{x})}.\]

Regression function: \[\color{red}{\eta}(\text{x})=\sum_{i=1}^{\infty}y_i\color{blue}{\mathbb{P}(Y=y_i|X=\text{x})}.\]

Q1: Why don’t we compute $\color{blue}{\mathbb{P}(Y=y_i|X=\text{x})}$ directly?
A1: We don’t know the distribution of $Y|X$.
Weights $\color{blue}{W_{n,i}(\text{x})}$ plays a role as an estimate of $\color{blue}{\mathbb{P}(Y=y_i|X=\text{x})}$.
Different choices of weights lead to different nonparametric models.
Weights $\color{blue}{W_{n,i}(\text{x})}$ are often defined according to how “close” $\text{x}_i$ to $\text{x}$.
“Closeness” is often described by “metrics, neighborhoods or topological structures”…

Theoretical Motivation

Example: Missing Value Imputation

Code

from skimage import data
import matplotlib.pyplot as plt
image = data.astronaut()
import numpy as np
from sklearn.impute import KNNImputer
n_miss = 100
np.random.seed(42)
missing_id = np.random.choice(512, size=(n_miss,2))
image_nan = image.copy()

for i in range(n_miss):
    image_nan[(missing_id[i,0]-2):(missing_id[i,0]+2),(missing_id[i,1]-2):(missing_id[i,1]+2),:] = 0
plt.figure(figsize=(4,4))
plt.imshow(image_nan)
plt.axis('off')
plt.show()

Code

for i in range(n_miss):
    for ch in range(3):
        temp = image_nan[(missing_id[i,0]-5):(missing_id[i,0]+5), (missing_id[i,1]-5):(missing_id[i,1]+5), ch]
        image_nan[(missing_id[i,0]-2):(missing_id[i,0]+2),(missing_id[i,1]-2):(missing_id[i,1]+2),ch] = temp[temp> 0].mean()
plt.figure(figsize=(4,4))
plt.imshow(image_nan)
plt.axis('off')
plt.show()

Missing pixels $=\frac{1}{N}\sum_{k=1}^Np_k$ where $p_k$ are neighboring non-missing pixels.
Q2: What are the weights $\color{blue}{W_{n,i}(\text{x})}$ in this case?

Theoretical Motivation

Decision theory for classification

In classification, misclassification loss: $\ell(\hat{f}(x),y)=\mathbb{1}_{\{\hat{f}(x)\neq y\}}=\begin{cases}1,\text{ if }\hat{f}(x)\neq y\\0,\text{ if }\hat{f}(x)=y\end{cases}$.
The misclassification risk is given by: \[{\cal R}_c(\hat{f})=\mathbb{E}[\ell(\hat{f}(X),Y)]=\mathbb{E}[\mathbb{1}_{\{\hat{f}(X)\neq Y\}}]=\mathbb{P}(\hat{f}(X)\neq Y).\]
This is equivalent to classification problems that maximize accuracy score because \[\text{Accuracy}=1-\mathbb{P}(\hat{f}(X)\neq Y)=\mathbb{P}(\hat{f}(X)=Y).\]

Lemma [Ch 2, Hastie et al. (2008) & Sec 1.4, Kyörfi et al. (2002)]

In balanced $M$-class classification, \[\text{If } g^*(x)=\arg\max_{1\leq k\leq M}\mathbb{P}(Y=k|X=x)\text{ then }{\cal R}_c(g^*)=\inf_{\hat{f}}{\cal R}_c(\hat{f}).\]

Theoretical Motivation

Summary

In regression with MSE, $\color{red}{\eta(\text{x})=\mathbb{E}(Y|X=\text{x})}$ is the best regressor.
- We cannot compute it, but we can estimate it.
  - Parametric models assume that $\color{red}{\eta}$ takes some forms (linear, polynomial…).
  - Nonparametric models attempt to estimate $\color{red}{\eta}$ using weighted average form: $\hat{f}(\text{x})=\sum_{i=1}^ny_i\color{blue}{W_{n,i}(x)}$.
In balanced $M$-class classification with Accuracy score, $g^*(\text{x})=\arg\max_{1\leq k\leq M}\color{blue}{\mathbb{P}(Y=k|X=\text{x})}$ is the best classifier.
We will now look at three different models that attempt to estimate regression function $\color{red}{\eta}$ using different form of weights $\color{blue}{W_{n,i}(\text{x})}$.

$K$-Nearest Neighbors ($K$-NN)

Setting

Given iid observations: $\{(\text{x}_1,y_1),\dots, (\text{x}_n,y_n)\}\subset \mathbb{R}^d\times{\cal Y}$.
If $D$ is a distance on $\mathbb{R}^d$ (e.g. Euclidean distance), the $k$-th nearest neighbor $\color{red}{\text{x}_{(k)}}$ of $\text{x}\in\mathbb{R}^d$ is defined by
- $D(\text{x},\text{x}_{(1)})\leq D(\text{x},\text{x}_{(2)})\leq\dots\leq D(\text{x},\color{red}{\text{x}_{(k)}})\leq \dots\leq D(\text{x},\text{x}_{(n)})$.
- Let $y_{(1)},\dots,y_{(n)}$ be the target of $\text{x}_{(1)},\dots,\text{x}_{(n)}$ respectively.
If $K\geq 1$, then $K$-NN predicts the target of an input $\text{x}$ by

Regression: \[\begin{align*}\hat{y}&=\frac{1}{K}\sum_{k=1}^Ky_{(k)}\qquad\quad \color{gray}{(\color{blue}{W_{n,i}(\text{x})}=?)}\\ &=\text{Average $y_{(k)}$ among the $K$ neighbors}.\end{align*}\]

Classification: \[\begin{align*}\hat{y}&=\arg\max_{1\leq m\leq M}\frac{1}{K}\sum_{k=1}^K\mathbb{1}_{\{y^{(k)}=m\}}\\ &=\text{Majority class among the $K$ neighbors.}\end{align*}\]

$K$-Nearest Neighbors ($K$-NN)

Example

Regression: \[\begin{align*}\hat{y}&=\frac{1}{K}\sum_{k=1}^Ky_{(k)}\qquad\quad \color{gray}{(\color{blue}{W_{n,i}(\text{x})}=?)}\\ &=\text{Average $y_{(k)}$ among the $K$ neighbors}.\end{align*}\]

Classification: \[\begin{align*}\hat{y}&=\arg\max_{1\leq m\leq M}\frac{1}{K}\sum_{k=1}^K\mathbb{1}_{\{y^{(k)}=m\}}\\ &=\text{Majority class among the $K$ neighbors.}\end{align*}\]

$K$-Nearest Neighbors ($K$-NN)

In action

Let’s work with our Heart Disease Dataset (shpe: $1025\times 14$) and choose $K=5$.
$K$-NN is a distance-based method, it’s essential to
- Scale the inputs
- Watch out for outliers/missing values
- Encode categorical inputs
- Watch out for the effect of imbalanced class…

Code

import numpy as np
import pandas as pd
data = pd.read_csv(path + "/heart.csv")
quan_vars = ['age','trestbps','chol','thalach','oldpeak']
qual_vars = ['sex','cp','fbs','restecg','exang','slope','ca','thal','target']

# Convert to correct types
for i in quan_vars:
  data[i] = data[i].astype('float')
for i in qual_vars:
  data[i] = data[i].astype('category')

# Train test split
from sklearn.model_selection import train_test_split
X, y = data.iloc[:,:-1], data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaler = MinMaxScaler()

# OnehotEncoding and Scaling
X_train_cat = pd.get_dummies(X_train.select_dtypes(include="category"), drop_first=True)
X_train_encoded = scaler.fit_transform(np.column_stack([X_train.select_dtypes(include="number").to_numpy(), X_train_cat]))
X_test_cat = pd.get_dummies(X_test.select_dtypes(include="category"), drop_first=True)
X_test_encoded = scaler.transform(np.column_stack([X_test.select_dtypes(include="number").to_numpy(), X_test_cat]))

# KNN
from sklearn.neighbors import KNeighborsClassifier 

knn = KNeighborsClassifier(n_neighbors=5)
knn = knn.fit(X_train_encoded, y_train)
y_pred = knn.predict(X_test_encoded)

from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

test_perf = pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred),
          'AUC': roc_auc_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["5NN"])
test_perf

	Accuracy	Precision	Recall	F1-score	AUC
5NN	0.868293	0.861111	0.885714	0.873239	0.867857

Q3: Can we do better?
A3: Yes! $K=5$ is arbitrary. We should fine-tune it!

$K$-Nearest Neighbors ($K$-NN)

Influence of $K$

Too small $K\Leftrightarrow$ too flexible $\Leftrightarrow$ high variance $\Leftrightarrow$ Overfitting.

$K$-Nearest Neighbors ($K$-NN)

Fine-tuning $K$: Cross-validation

There are many way to perform Cross-validation in python.
Let’s use GridSearchCV from sklearn.model_selection module.

Code

knn = KNeighborsClassifier()
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, make_scorer
scorer = make_scorer(f1_score, pos_label=1)
param_grid = {'n_neighbors': list(range(2,31))}
knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, cv=10, scoring=scorer, return_train_score=True)
grid_search.fit(X_train_encoded, y_train)

knn = KNeighborsClassifier(n_neighbors=grid_search.best_params_['n_neighbors'])
knn = knn.fit(X_train_encoded, y_train)
y_pred = knn.predict(X_test_encoded)

test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred),
          'AUC': roc_auc_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=[f"{grid_search.best_params_['n_neighbors']}NN"])], axis=0)
test_perf

	Accuracy	Precision	Recall	F1-score	AUC
5NN	0.868293	0.861111	0.885714	0.873239	0.867857
2NN	0.970732	1.000000	0.942857	0.970588	0.971429

$K$-Nearest Neighbors ($K$-NN)

What can go wrong?

Duplicated data!
Let’s try again.

	Accuracy	Precision	Recall	F1-score	AUC
5NN	0.868293	0.861111	0.885714	0.873239	0.867857
2NN	0.970732	1.000000	0.942857	0.970588	0.971429
16NN_No_Dup	0.885246	0.882353	0.909091	0.895522	0.883117

$K$-Nearest Neighbors ($K$-NN)

Curse of dimensionality

Curse of dimensionality refers to various challenges and phenomena that arise when working with high-dimensional data.
The main challenge for $K$-NN is that distances or closeness lose its meaning in high-dimensional spaces.
In this scenario, data points tend to be equally distant from one another.
Example: Simulate $\text{x}_1,\dots,\text{x}_n\sim{\cal U}[-5,5]^d$ with $d=1,10,100,500,1000, 5000, 10000, 50000$.
- For each dimension $d$, we compute: \[r(d)=\frac{\max_{i\neq j}D(\text{x}_i,\text{x}_j)}{\min_{i\neq j}D(\text{x}_i,\text{x}_j)}.\]
- Obtain the following graph 👉

Code

N = 10
Ds = np.zeros(shape=(N*(N-1)//2, 8))
j = 0
for d in [1,10,100, 500, 1000, 5000, 10000, 50000]:
    Xd = np.random.uniform(-5,5,size=(N,d))
    i = 0
    for s in range(1, N):
        for k in range(s):
            Ds[i,j] = np.linalg.norm(Xd[s,:] - Xd[k,:])
            i += 1
    j += 1
import plotly.express as px
df_dist = pd.DataFrame({'Ratio': np.round(Ds.max(axis=0)/Ds.min(axis=0), 2), 'Dim': ['1','10','100', '500', '1000', '5000', '10000', '50000']})
fig4 = px.line(df_dist, x='Dim', y="Ratio", text="Ratio")
fig4.update_layout(
    width=400, height=450, 
    title="Max-min distance ratio at various dimension")
fig4.update_xaxes(title='Dimension')
fig4.update_yaxes(title='Max-Min Distance Ratio')
fig4.update_traces(textposition="top right")
fig4.show()

$K$-Nearest Neighbors ($K$-NN)

Summary

$K$-NN is a nonparametric model with weights: $\color{blue}{W_{n,i}(\text{x})=\frac{\mathbb{1}_{\{\text{x}_i\in{\cal N}_{\text{x}}\}}}{K}}$, where ${\cal N}_{\text{x}}=\{\text{x}_{(1)},\dots,\text{x}_{(K)}\}$ the set of the first $K$-nearest neighbors of $\text{x}$.

Data preprocessing is essential: scaling, encoding, outliers…
The key parameter $K$ can be tuned using cross-validation technique.
$K$-NN may not be suitable in high-dimensional cases due to Curse of dimensionality. However, we can try:
- Feature selection
- Dimensional reduction
- Distance metric

$K$-Nearest Neighbors ($K$-NN)

$K$-NN on `Mnist`

We try some different approaches:
- Normal $3$NN (various sources, for example, see here)
- Dimensional reduction with PCA with cross-validation
- Fourier transform method with cross-validation.

Code

from keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# Reshape data from 3D array to (n, 784)
X_train_reshaped = X_train.reshape(-1, 28*28)/255.0
X_test_reshaped = X_test.reshape(-1, 28*28)/255.0

# Model
knn = KNeighborsClassifier(n_neighbors=3)
knn = knn.fit(X_train_reshaped, y_train)
y_pred = knn.predict(X_test_reshaped)

test_mnist = pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred, average="macro"),
          'Recall': recall_score(y_test, y_pred, average="macro"),
          'F1-score': f1_score(y_test, y_pred, average="macro")},
    columns=["Accuracy", "Precision", "Recall", "F1-score"],
    index=[f"3NN"])

# PCA
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
pca = PCA(n_components=150)
X_proj_train = pca.fit_transform(X_train_reshaped)
X_proj_test = pca.transform(X_test_reshaped)

# Model
param_grid = {'n_neighbors': [2,4,6,9,11,13,15,17,20]}#list(range(2,31))}
knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring=scorer, return_train_score=True)
grid_search.fit(X_proj_train, y_train)

knn = KNeighborsClassifier(n_neighbors=grid_search.best_params_['n_neighbors'])
knn = knn.fit(X_proj_train, y_train)
y_pred = knn.predict(X_proj_test)

test_mnist = pd.concat([test_mnist, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred, average="macro"),
          'Recall': recall_score(y_test, y_pred, average="macro"),
          'F1-score': f1_score(y_test, y_pred, average="macro")},
    columns=["Accuracy", "Precision", "Recall", "F1-score"],
    index=[f"{grid_search.best_params_['n_neighbors']}NN_PCA"])], axis=0)

# Fourier transform
from scipy.fftpack import fft2, fftshift
def compute_fourier_transform(image): 
    f_transform = fft2(image) 
    f_transform_shifted = fftshift(f_transform) 
    magnitude_spectrum = np.abs(f_transform_shifted) 
    return magnitude_spectrum

X_train_fft = np.array([compute_fourier_transform(img) for img in X_train]).reshape(-1, 28*28)
X_test_fft = np.array([compute_fourier_transform(img) for img in X_test]).reshape(-1, 28*28)

# Model
knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring=scorer, return_train_score=True)
grid_search.fit(X_train_fft, y_train)

knn = KNeighborsClassifier(n_neighbors=grid_search.best_params_['n_neighbors'])
knn = knn.fit(X_train_fft, y_train)
y_pred = knn.predict(X_test_fft)

test_mnist = pd.concat([test_mnist, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred, average="macro"),
          'Recall': recall_score(y_test, y_pred, average="macro"),
          'F1-score': f1_score(y_test, y_pred, average="macro")},
    columns=["Accuracy", "Precision", "Recall", "F1-score"],
    index=[f"{grid_search.best_params_['n_neighbors']}NN_FT"])], axis=0)
test_mnist

	Accuracy	Precision	Recall	F1-score
3NN	0.9705	0.970912	0.970114	0.970375
2NN_PCA	0.9659	0.966592	0.965448	0.965672
2NN_FT	0.9187	0.921945	0.917746	0.917453

Decision Trees

CART: Classification And Regression Trees

In $K$-NN, “neighbors” are defined by the distance $D$ ($K$ points within the ball centered at the query point).
In CART, “neighbors” are defined by rectangular regions within inputs space.

Decision Trees

CART: Classification And Regression Trees

Building a CART $\Leftrightarrow$ recursively split input space into smaller regions.
Rectangular regions $\Rightarrow$ defining neighbors $\Rightarrow$ prediction.
Start at root node where all points are in the same region (without split).

At each step, each region is split on feature $X_j$ at threshold $a\in\mathbb{R}$ such that the two subregions $R_1=\{x\in\mathbb{R}^d: X_j\leq a\}$ and $R_2=\{x\in\mathbb{R}^d: X_j> a\}$ are as pure as possible.
Impurity is defined by impurity measures:
- Regression: $\sum_{y\in R_1}(y-\overline{y}_1)^2+\sum_{y\in R_2}(y-\overline{y}_2)^2.$
- Classification ($M$ classes):
  - Missclassification error: $1-\hat{p}_{k^*}$ where $k^*$ is the majority class.
  - Gini impurity: $\sum_{k=1}^M\hat{p}_{k}(1-\hat{p}_{k})$.
  - Entropy: $-\sum_{k}\hat{p}_{k}\log(\hat{p}_{k})$ where $\hat{p}_{k}=\frac{\sum_{i\in R}\mathbb{1}_{\{y_i=k\}}}{|R|}$ on region any $R$.

Decision Trees

CART: Classification And Regression Trees

Building a CART $\Leftrightarrow$ recursively split input space into smaller regions.
Rectangular regions $\Rightarrow$ defining neighbors $\Rightarrow$ prediction.
Start at root node where all points are in the same region (without split).

At each step, each region is split on feature $X_j$ at threshold $a\in\mathbb{R}$ such that the two subregions $R_1=\{x\in\mathbb{R}^d: X_j\leq a\}$ and $R_2=\{x\in\mathbb{R}^d: X_j> a\}$ are as pure as possible.
Impurity is defined by impurity measures:
- Regression: $\sum_{y\in R_1}(y-\overline{y}_1)^2+\sum_{y\in R_2}(y-\overline{y}_2)^2.$
- Classification ($M$ classes):

Decision Trees

CART: Classification And Regression Trees

First split:
- $\text{En}(R_1)=-1\log(1)=0$
- $\text{En}(R_2)=-\color{blue}{16/19\log(16/19)}-\color{red}{3/19\log(3/19)} =0.436$.
- $\text{En}_1=(0)11/30+(0.436)19/30=0.276.$
- Info gain: $\text{En}_0-\text{En}_1.$

Prediction:
- Regression: $\hat{y}=\frac{1}{|R_{\text{x}}|}\sum_{y_i\in R_{\text{x}}}y_i$.
- Classification: $\hat{y}=\arg\max_{1\leq m\leq M}\frac{1}{R_{\text{x}}|}\sum_{y_i\in R_{\text{x}}}\mathbb{1}_{\{y_i=m\}}$.

Decision Trees

Pruning tree

Hyperparameter of a tree: depth, minimum leave size, impurity measures, maximum features considered at each split…
Deep trees $\Leftrightarrow$ high variance $\Rightarrow$ Overfitting.
Pruning a tree = removing branches with small improvement.

Pruning criterion to be minimized: $C_{\alpha}(T)=\sum_{j=1}^{|T|}N_j\text{Imp}_j(T)+\alpha|T|$ where
- $|T|$: num. of leaves of the tree $T$ and
- $\text{Imp}_j(T)$: the impurity at leave $j$-th.

Decision Trees

In action: Heart Disease Dataset

We use GridsearchCV to search over the hyperparameters:
- Impurity (criterion)
- Depth (max_depth)
- Mininum size of leave nodes (min_samples_leaf)
- Maximum features (max_features).

Code

data = pd.read_csv(path + "/heart.csv")
quan_vars = ['age','trestbps','chol','thalach','oldpeak']
qual_vars = ['sex','cp','fbs','restecg','exang','slope','ca','thal','target']

# Convert to correct types
for i in quan_vars:
  data[i] = data[i].astype('float')
for i in qual_vars:
  data[i] = data[i].astype('category')

data = data.drop_duplicates()
# Train test split
from sklearn.model_selection import train_test_split
X, y = data.iloc[:,:-1], data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

from sklearn.model_selection import GridSearchCV 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

clf = DecisionTreeClassifier()
param_grid = {'criterion': ['gini', 'entropy'], 
              'max_depth': [10, 20, 30, 40, 50], 
              'min_samples_split': [2, 5, 10], 
              'min_samples_leaf': [1, 2, 4], 
              'max_features': ['auto', 'sqrt', 'log2'] }
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1) 
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_ 
y_pred = best_model.predict(X_test)

test_tr = pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred),
          'AUC': roc_auc_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["Tree"])
test_tr

	Accuracy	Precision	Recall	F1-score	AUC
Tree	0.721311	0.766667	0.69697	0.730159	0.723485

Decision Trees

Summary

CART is a nonparametric model with weights: $\color{blue}{W_{n,i}(\text{x})=\frac{\mathbb{1}_{\{\text{x}_i\in R_{\text{x}}\}}}{|R_{\text{x}}|}}$, where $R_{\text{x}}$ the is the leave node containing $\text{x}$.

They are not sensitive to scaling.
The key parameters includes depth, minimum leave size, impurity measures, number of splits, maximum features considered at each split… which can be tuned using cross-validation technique.
It is efficient in high-dimensional case as it handle features individually.
It can handle categorical data as well.
Just ike $K$-NN with small $K$, deep trees are high-variance methods.
They should be pruned.

Kernel Smoothing Methods

Smoother weights

This is inspired by density estimation.
It aims at directly estimating $\eta(x)=\mathbb{E}(Y|X=x)$ for regression or $\mathbb{P}(Y=k|X=x)$ in classifications.

Regression: \[\hat{y}=\sum_{i=1}^ny_i\color{blue}{W_{n,i}(\text{x})},\] where $\color{blue}{W_{n,i}(\text{x})}=\frac{K_h(\|\text{x}_i-\text{x}\|)}{\sum_{j=1}^nK_h(\|\text{x}_j-\text{x}\|)}$ with $K_h(t)=K(t/h)$ for some moothing parameter $h>0$.

Classification: \[\hat{y}=\arg\max_{1\leq k\leq M}\sum_{i=1}^n\mathbb{1}_{\{y_i=k\}}\color{blue}{W_{n,i}(\text{x})}.\]
A common ex. Gaussian Kernel $K(t)=\exp(-t^2/2\sigma^2)$ for some $\sigma>0$.

Kernel Smoothing Methods

Example: Gaussian Kernel

Building a kernel smoother $\Leftrightarrow$ tuning $h>0$ (cross-validation).

Kernel Smoothing Methods

Example: `Auto-MPG Dataset`

	mpg	cylinders	displacement	horsepower	weight	acceleration	model year	origin
mpg	1.000000	-0.777618	-0.805127	-0.778427	-0.832244	0.423329	0.580541	0.565209
cylinders	-0.777618	1.000000	0.950823	0.842983	0.897527	-0.504683	-0.345647	-0.568932
displacement	-0.805127	0.950823	1.000000	0.897257	0.932994	-0.543800	-0.369855	-0.614535
horsepower	-0.778427	0.842983	0.897257	1.000000	0.864538	-0.689196	-0.416361	-0.455171
weight	-0.832244	0.897527	0.932994	0.864538	1.000000	-0.416839	-0.309120	-0.585005
acceleration	0.423329	-0.504683	-0.543800	-0.689196	-0.416839	1.000000	0.290316	0.212746
model year	0.580541	-0.345647	-0.369855	-0.416361	-0.309120	0.290316	1.000000	0.181528
origin	0.565209	-0.568932	-0.614535	-0.455171	-0.585005	0.212746	0.181528	1.000000

Kernel Smoothing Methods

Example: `Auto-MPG Dataset`

Code

X = X.drop_duplicates()
X_train, X_test, y_train, y_test = train_test_split(X.iloc[:,1:], X.iloc[:,0], test_size=0.2, random_state=42)
from sklearn.neighbors import KNeighborsRegressor 
from sklearn.tree import DecisionTreeRegressor 
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

# Define the KNN and Decision Tree Regressor 
knn = KNeighborsRegressor()
param_grid = {'n_neighbors': [2, 3, 5, 7, 9, 11, 13, 15, 17, 20]}
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1) 
# Fit the model
grid_search.fit(X_train_scaled, y_train) 

# Extract the best model and parameters 
knn = KNeighborsRegressor(n_neighbors = grid_search.best_params_['n_neighbors'])
knn = knn.fit(X_train_scaled, y_train)
y_pred = knn.predict(X_test_scaled)
df_mse = pd.DataFrame(
    {'RMSE': np.round(np.sqrt(mean_squared_error(y_test, y_pred)), 3),
    'MAE': np.round(mean_absolute_error(y_test, y_pred), 3)},
    index = [f"{grid_search.best_params_['n_neighbors']}NN"]
)

tree = DecisionTreeRegressor() 
param_grid = {'criterion': ['squared_error', 'absolute_error'], 
              'min_samples_leaf': [2, 3, 5, 7, 10],
              'max_depth' : [5, 10, 15, 20, 30]}
grid_search = GridSearchCV(tree, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1) 
# Fit the model
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
tree = DecisionTreeRegressor(**best_params)
tree = tree.fit(X_train, y_train)
y_pred = tree.predict(X_test)

df_mse = pd.concat([df_mse, pd.DataFrame(
    {'RMSE': np.round(np.sqrt(mean_squared_error(y_test, y_pred)), 3),
    'MAE': np.round(mean_absolute_error(y_test, y_pred), 3)},
    index = [f"Tree"])])

# Kernel smoother
from gradientcobra.gradientcobra import KernelSmoother
ks = KernelSmoother(opt_method='grid',
                    bandwidth_list = np.linspace(0.00001, 50, 300),
                    show_progress = False)
ks = ks.fit(X_train_scaled, y_train)
y_pred = ks.predict(X_test_scaled)

df_mse = pd.concat([df_mse, pd.DataFrame(
    {'RMSE': np.round(np.sqrt(mean_squared_error(y_test, y_pred)), 3),
    'MAE': np.round(mean_absolute_error(y_test, y_pred), 3)},
    index = [f"Kernel Smoother"])])
df_mse

	RMSE	MAE
3NN	2.624	1.822
Tree	2.741	2.087
Kernel Smoother	2.460	1.783

Kernel Smoothing Methods

Summary

Kernel smoother is a nonparametric model that assigns weight to each points using some kernel function $K$, i.e., \[W_{n,i}(\text{x})=\frac{K_h(\|\text{x}_i-\text{x}\|)}{\sum_{j=1}^nK_h(\|\text{x}_j-\text{x}\|)}.\]
Training a Kernel Smoother is equivalent to tuning the smoothing parameter $h>0$ using, for example, cross validation error.
Preprocessing is essential: scaling, encoding, transformation,…
Just like in $K$-NN, it might not work well in high-dimensional case but can be improved using the same methods.

Advanced Machine Learning

Nonparametric Models

🗺️ Content

Introduction

Supervised Learning Setting

Introduction

Supervised Learning Setting

Introduction

Supervised Learning Setting

Theoretical Motivation

Regression theory: \(L_2\)/MSE minimizer

Theoretical Motivation

Regression theory: \(L_2\)/MSE minimizer

Theoretical Motivation

Regression theory: \(L_2\)/MSE minimizer

Theoretical Motivation

Example: Missing Value Imputation

Theoretical Motivation

Decision theory for classification

Theoretical Motivation

Summary

\(K\)-Nearest Neighbors (\(K\)-NN)

Setting

\(K\)-Nearest Neighbors (\(K\)-NN)

Example

\(K\)-Nearest Neighbors (\(K\)-NN)

In action

\(K\)-Nearest Neighbors (\(K\)-NN)

Influence of \(K\)

\(K\)-Nearest Neighbors (\(K\)-NN)

Fine-tuning \(K\): Cross-validation

\(K\)-Nearest Neighbors (\(K\)-NN)

What can go wrong?

\(K\)-Nearest Neighbors (\(K\)-NN)

Curse of dimensionality

\(K\)-Nearest Neighbors (\(K\)-NN)

Summary

\(K\)-Nearest Neighbors (\(K\)-NN)

\(K\)-NN on Mnist

Decision Trees

CART: Classification And Regression Trees

Decision Trees

CART: Classification And Regression Trees

Decision Trees

CART: Classification And Regression Trees

Decision Trees

CART: Classification And Regression Trees

Decision Trees

Pruning tree

Decision Trees

In action: Heart Disease Dataset

Decision Trees

Summary

Kernel Smoothing Methods

Smoother weights

Kernel Smoothing Methods

Example: Gaussian Kernel

Kernel Smoothing Methods

Example: Auto-MPG Dataset

Kernel Smoothing Methods

Example: Auto-MPG Dataset

Kernel Smoothing Methods

Summary

🥳 It’s party time 🥂

📋 View party menu here: Party 4 Menu.

🫠 Download party invitation here: Party 4 Invitation Letter.

\(K\)-NN on `Mnist`

Example: `Auto-MPG Dataset`

Example: `Auto-MPG Dataset`