Advanced Machine Learning


     

Lecturer: Dr. HAS Sothea

Code: AMSI61AML

Nonparametric Models

🗺️ Content

  • Introduction

  • Theoretical Motivation

  • K-Nearest Neighbors

  • Decision Trees

  • Kernel Smoothing Methods

Introduction

Supervised Learning Setting

Objective: Predicting target \(y\) using input \(x\).

  • If \(y\) is continuous, the problem is called Regression.
  • If \(y\) is discrete/categorical, it’s called Classification.

Criteria

  • Quality of a prediction \(\hat{y}\) is measured by the loss: \(\ell(\hat{y},y)\).
  • Quality of a model \(\hat{f}\) is measured by the risk: \[{\cal R}(\hat{f})=\mathbb{E}_{X,Y}[\ell(Y, \hat{f}(X))].\]

Introduction

Supervised Learning Setting

  • Parametric models assume a specific form that depends on a fixed number of parameters. Example:
    • Linear model: \(\hat{f}_{\color{blue}{\beta}}(\text{x})=\color{blue}{\beta_0}+\sum_{j=1}^d\color{blue}{\beta_j}x_j\) with \(\color{blue}{\beta_0},\dots,\color{blue}{\beta_d}\in\mathbb{R}.\)
    • Polynomial model: \(\hat{f}_{\color{blue}{\beta}}(\text{x})=\color{blue}{\beta_0}+\sum_{p=0}^m\sum_{k=1}^d\sum_{j\leq k}\color{blue}{\beta_{jkp}}x_j^{p}x_k^{m-p}\) with \(\color{blue}{\beta_0},\color{blue}{\beta_{jkp}}\in\mathbb{R},\forall j,k,p.\)
    • \(2\)-layer Neural Networks: \(\hat{f}(\text{x})=\sigma_2\left(\color{blue}{W_2}\sigma_1(\color{blue}{W_1}x+\color{blue}{b_1})+\color{blue}{b_2}\right).\)
  • Building a model is searching for \(\color{blue}{\beta^*}\) minimizing the risk, i.e., \[{\cal R}(\hat{f}_{\color{blue}{\beta^*}})=\inf_{\color{blue}{\beta}}\Big[\underbrace{{\cal R}(\hat{f}_{\color{blue}{\beta}})}_{\text{Goodness of fit}} + \underbrace{\color{gray}{\Omega(\color{blue}{\beta})}}_{\text{Flexibility}}\Big]\]

Introduction

Supervised Learning Setting

  • Nonparametric models rely directly on the target \(y\) without assuming any form of input-output relationship.
  • General weighted average form: \[\hat{f}(\text{x})=\sum_{i=1}^ny_i\color{blue}{W_{n,i}(\text{x})}.\]
  • Interpretation:
    • The prediction of \(\text{x}\) is determined by averaging values of targets \(y_i\)’s.
    • Targets \(y_i\) that their input \(\text{x}_i\) is similar to \(\text{x}\) are more important and carry heavier weight \(\color{blue}{W_{n,i}(\text{x})}\) in the average.

🤔 Why does it work?

🗝️ Because it estimates \(\color{red}{\eta(x)=\mathbb{E}(Y|X=x)}\).

🤔 But what’s this \(\color{red}{\eta(x)}\)?

Theoretical Motivation

Regression theory: \(L_2\)/MSE minimizer

  • \(\color{blue}{\text{Mean Squared Error}}\) of any model \(\hat{f}\) is defined by: \[\color{blue}{\text{MSE}}(\hat{f})=\mathbb{E}[(\hat{f}(X)-Y)^2].\]
  • If \(\text{x}\) be an input and \(\color{red}{\eta(\text{x})=\mathbb{E}(Y|X=\text{x})}\) then \(\forall\hat{f}\): \[\begin{align*} \color{blue}{\text{MSE}}(\hat{f})&=\mathbb{E}[(\hat{f}(X)-\color{red}{\eta}(X))^2]+\mathbb{E}[(\color{red}{\eta}(X)-Y)^2] \end{align*}\]
  • This implies that \(\color{red}{\eta}\) is the minimizer of \(\color{blue}{\text{MSE}}\) criterion, i.e., \[\color{blue}{\text{MSE}}(\color{red}{\eta})=\inf_{\hat{f}}\color{blue}{\text{MSE}}(\hat{f})=\mathbb{E}[(\color{red}{\eta}(X)-Y)^2].\]
  • \(\color{red}{\eta}\) is called Regression Function/\(L_2\) Bayesian Estimator.



Theoretical Motivation

Regression theory: \(L_2\)/MSE minimizer

  • \(\color{red}{\eta}\) is called Regression Function/\(L_2\) Bayesian Estimator.
  • \(\color{red}{\eta(\text{x})=\mathbb{E}(Y|X=\text{x})}\) is the average of \(y\) for inputs that are similar to the query point \(\text{x}\). In discrete case: \[\begin{align*}\color{red}{\eta}(\text{x})&=\mathbb{E}(Y|X=\text{x})=\sum_{i=1}^{\infty}y_i\color{blue}{\mathbb{P}(Y=y_i|X=\text{x})} \end{align*}\]

  • Recall that nonparam. models: \(\hat{f}(\text{x})=\sum_{i=1}^ny_i\color{blue}{W_{n,i}(\text{x})}.\)

Theoretical Motivation

Regression theory: \(L_2\)/MSE minimizer

  • Nonparametric models: \[\hat{f}(\text{x})=\sum_{i=1}^ny_i\color{blue}{W_{n,i}(\text{x})}.\]
  • Regression function: \[\color{red}{\eta}(\text{x})=\sum_{i=1}^{\infty}y_i\color{blue}{\mathbb{P}(Y=y_i|X=\text{x})}.\]
  • Q1: Why don’t we compute \(\color{blue}{\mathbb{P}(Y=y_i|X=\text{x})}\) directly?

  • A1: We don’t know the distribution of \(Y|X\).

  • Weights \(\color{blue}{W_{n,i}(\text{x})}\) plays a role as an estimate of \(\color{blue}{\mathbb{P}(Y=y_i|X=\text{x})}\).

  • Different choices of weights lead to different nonparametric models.

  • Weights \(\color{blue}{W_{n,i}(\text{x})}\) are often defined according to how “close” \(\text{x}_i\) to \(\text{x}\).

  • “Closeness” is often described by “metrics, neighborhoods or topological structures”…

Theoretical Motivation

Example: Missing Value Imputation

Code
from skimage import data
import matplotlib.pyplot as plt
image = data.astronaut()
import numpy as np
from sklearn.impute import KNNImputer
n_miss = 100
np.random.seed(42)
missing_id = np.random.choice(512, size=(n_miss,2))
image_nan = image.copy()

for i in range(n_miss):
    image_nan[(missing_id[i,0]-2):(missing_id[i,0]+2),(missing_id[i,1]-2):(missing_id[i,1]+2),:] = 0
plt.figure(figsize=(4,4))
plt.imshow(image_nan)
plt.axis('off')
plt.show()

Code
for i in range(n_miss):
    for ch in range(3):
        temp = image_nan[(missing_id[i,0]-5):(missing_id[i,0]+5), (missing_id[i,1]-5):(missing_id[i,1]+5), ch]
        image_nan[(missing_id[i,0]-2):(missing_id[i,0]+2),(missing_id[i,1]-2):(missing_id[i,1]+2),ch] = temp[temp> 0].mean()
plt.figure(figsize=(4,4))
plt.imshow(image_nan)
plt.axis('off')
plt.show()

  • Missing pixels \(=\frac{1}{N}\sum_{k=1}^Np_k\) where \(p_k\) are neighboring non-missing pixels.
  • Q2: What are the weights \(\color{blue}{W_{n,i}(\text{x})}\) in this case?

Theoretical Motivation

Decision theory for classification

  • In classification, misclassification loss: \(\ell(\hat{f}(x),y)=\mathbb{1}_{\{\hat{f}(x)\neq y\}}=\begin{cases}1,\text{ if }\hat{f}(x)\neq y\\0,\text{ if }\hat{f}(x)=y\end{cases}\).
  • The misclassification risk is given by: \[{\cal R}_c(\hat{f})=\mathbb{E}[\ell(\hat{f}(X),Y)]=\mathbb{E}[\mathbb{1}_{\{\hat{f}(X)\neq Y\}}]=\mathbb{P}(\hat{f}(X)\neq Y).\]
  • This is equivalent to classification problems that maximize accuracy score because \[\text{Accuracy}=1-\mathbb{P}(\hat{f}(X)\neq Y)=\mathbb{P}(\hat{f}(X)=Y).\]

In balanced \(M\)-class classification, \[\text{If } g^*(x)=\arg\max_{1\leq k\leq M}\mathbb{P}(Y=k|X=x)\text{ then }{\cal R}_c(g^*)=\inf_{\hat{f}}{\cal R}_c(\hat{f}).\]

Theoretical Motivation

Summary

  • In regression with MSE, \(\color{red}{\eta(\text{x})=\mathbb{E}(Y|X=\text{x})}\) is the best regressor.
    • We cannot compute it, but we can estimate it.
      • Parametric models assume that \(\color{red}{\eta}\) takes some forms (linear, polynomial…).
      • Nonparametric models attempt to estimate \(\color{red}{\eta}\) using weighted average form: \(\hat{f}(\text{x})=\sum_{i=1}^ny_i\color{blue}{W_{n,i}(x)}\).
  • In balanced \(M\)-class classification with Accuracy score, \(g^*(\text{x})=\arg\max_{1\leq k\leq M}\color{blue}{\mathbb{P}(Y=k|X=\text{x})}\) is the best classifier.
  • We will now look at three different models that attempt to estimate regression function \(\color{red}{\eta}\) using different form of weights \(\color{blue}{W_{n,i}(\text{x})}\).

\(K\)-Nearest Neighbors (\(K\)-NN)

Setting

  • Given iid observations: \(\{(\text{x}_1,y_1),\dots, (\text{x}_n,y_n)\}\subset \mathbb{R}^d\times{\cal Y}\).
  • If \(D\) is a distance on \(\mathbb{R}^d\) (e.g. Euclidean distance), the \(k\)-th nearest neighbor \(\color{red}{\text{x}_{(k)}}\) of \(\text{x}\in\mathbb{R}^d\) is defined by
    • \(D(\text{x},\text{x}_{(1)})\leq D(\text{x},\text{x}_{(2)})\leq\dots\leq D(\text{x},\color{red}{\text{x}_{(k)}})\leq \dots\leq D(\text{x},\text{x}_{(n)})\).
    • Let \(y_{(1)},\dots,y_{(n)}\) be the target of \(\text{x}_{(1)},\dots,\text{x}_{(n)}\) respectively.
  • If \(K\geq 1\), then \(K\)-NN predicts the target of an input \(\text{x}\) by
  • Regression: \[\begin{align*}\hat{y}&=\frac{1}{K}\sum_{k=1}^Ky_{(k)}\qquad\quad \color{gray}{(\color{blue}{W_{n,i}(\text{x})}=?)}\\ &=\text{Average $y_{(k)}$ among the $K$ neighbors}.\end{align*}\]
  • Classification: \[\begin{align*}\hat{y}&=\arg\max_{1\leq m\leq M}\frac{1}{K}\sum_{k=1}^K\mathbb{1}_{\{y^{(k)}=m\}}\\ &=\text{Majority class among the $K$ neighbors.}\end{align*}\]

\(K\)-Nearest Neighbors (\(K\)-NN)

Example

  • Regression: \[\begin{align*}\hat{y}&=\frac{1}{K}\sum_{k=1}^Ky_{(k)}\qquad\quad \color{gray}{(\color{blue}{W_{n,i}(\text{x})}=?)}\\ &=\text{Average $y_{(k)}$ among the $K$ neighbors}.\end{align*}\]
  • Classification: \[\begin{align*}\hat{y}&=\arg\max_{1\leq m\leq M}\frac{1}{K}\sum_{k=1}^K\mathbb{1}_{\{y^{(k)}=m\}}\\ &=\text{Majority class among the $K$ neighbors.}\end{align*}\]

\(K\)-Nearest Neighbors (\(K\)-NN)

In action

  • Let’s work with our Heart Disease Dataset (shpe: \(1025\times 14\)) and choose \(K=5\).
  • \(K\)-NN is a distance-based method, it’s essential to
    • Scale the inputs
    • Watch out for outliers/missing values
    • Encode categorical inputs
    • Watch out for the effect of imbalanced class…
Code
import numpy as np
import pandas as pd
data = pd.read_csv(path + "/heart.csv")
quan_vars = ['age','trestbps','chol','thalach','oldpeak']
qual_vars = ['sex','cp','fbs','restecg','exang','slope','ca','thal','target']

# Convert to correct types
for i in quan_vars:
  data[i] = data[i].astype('float')
for i in qual_vars:
  data[i] = data[i].astype('category')

# Train test split
from sklearn.model_selection import train_test_split
X, y = data.iloc[:,:-1], data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaler = MinMaxScaler()

# OnehotEncoding and Scaling
X_train_cat = pd.get_dummies(X_train.select_dtypes(include="category"), drop_first=True)
X_train_encoded = scaler.fit_transform(np.column_stack([X_train.select_dtypes(include="number").to_numpy(), X_train_cat]))
X_test_cat = pd.get_dummies(X_test.select_dtypes(include="category"), drop_first=True)
X_test_encoded = scaler.transform(np.column_stack([X_test.select_dtypes(include="number").to_numpy(), X_test_cat]))

# KNN
from sklearn.neighbors import KNeighborsClassifier 

knn = KNeighborsClassifier(n_neighbors=5)
knn = knn.fit(X_train_encoded, y_train)
y_pred = knn.predict(X_test_encoded)

from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

test_perf = pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred),
          'AUC': roc_auc_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["5NN"])
test_perf
Accuracy Precision Recall F1-score AUC
5NN 0.868293 0.861111 0.885714 0.873239 0.867857
  • Q3: Can we do better?
  • A3: Yes! \(K=5\) is arbitrary. We should fine-tune it!

\(K\)-Nearest Neighbors (\(K\)-NN)

Influence of \(K\)

  • Too small \(K\Leftrightarrow\) too flexible \(\Leftrightarrow\) high variance \(\Leftrightarrow\) Overfitting.

\(K\)-Nearest Neighbors (\(K\)-NN)

Fine-tuning \(K\): Cross-validation

  • There are many way to perform Cross-validation in python.
  • Let’s use GridSearchCV from sklearn.model_selection module.
Code
knn = KNeighborsClassifier()
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, make_scorer
scorer = make_scorer(f1_score, pos_label=1)
param_grid = {'n_neighbors': list(range(2,31))}
knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, cv=10, scoring=scorer, return_train_score=True)
grid_search.fit(X_train_encoded, y_train)

knn = KNeighborsClassifier(n_neighbors=grid_search.best_params_['n_neighbors'])
knn = knn.fit(X_train_encoded, y_train)
y_pred = knn.predict(X_test_encoded)

test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred),
          'AUC': roc_auc_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=[f"{grid_search.best_params_['n_neighbors']}NN"])], axis=0)
test_perf
Accuracy Precision Recall F1-score AUC
5NN 0.868293 0.861111 0.885714 0.873239 0.867857
2NN 0.970732 1.000000 0.942857 0.970588 0.971429

\(K\)-Nearest Neighbors (\(K\)-NN)

What can go wrong?

 

  • Duplicated data!
  • Let’s try again.
Accuracy Precision Recall F1-score AUC
5NN 0.868293 0.861111 0.885714 0.873239 0.867857
2NN 0.970732 1.000000 0.942857 0.970588 0.971429
16NN_No_Dup 0.885246 0.882353 0.909091 0.895522 0.883117

\(K\)-Nearest Neighbors (\(K\)-NN)

Curse of dimensionality

  • Curse of dimensionality refers to various challenges and phenomena that arise when working with high-dimensional data.
  • The main challenge for \(K\)-NN is that distances or closeness lose its meaning in high-dimensional spaces.
  • In this scenario, data points tend to be equally distant from one another.
  • Example: Simulate \(\text{x}_1,\dots,\text{x}_n\sim{\cal U}[-5,5]^d\) with \(d=1,10,100,500,1000, 5000, 10000, 50000\).
    • For each dimension \(d\), we compute: \[r(d)=\frac{\max_{i\neq j}D(\text{x}_i,\text{x}_j)}{\min_{i\neq j}D(\text{x}_i,\text{x}_j)}.\]
    • Obtain the following graph 👉
Code
N = 10
Ds = np.zeros(shape=(N*(N-1)//2, 8))
j = 0
for d in [1,10,100, 500, 1000, 5000, 10000, 50000]:
    Xd = np.random.uniform(-5,5,size=(N,d))
    i = 0
    for s in range(1, N):
        for k in range(s):
            Ds[i,j] = np.linalg.norm(Xd[s,:] - Xd[k,:])
            i += 1
    j += 1
import plotly.express as px
df_dist = pd.DataFrame({'Ratio': np.round(Ds.max(axis=0)/Ds.min(axis=0), 2), 'Dim': ['1','10','100', '500', '1000', '5000', '10000', '50000']})
fig4 = px.line(df_dist, x='Dim', y="Ratio", text="Ratio")
fig4.update_layout(
    width=400, height=450, 
    title="Max-min distance ratio at various dimension")
fig4.update_xaxes(title='Dimension')
fig4.update_yaxes(title='Max-Min Distance Ratio')
fig4.update_traces(textposition="top right")
fig4.show()

\(K\)-Nearest Neighbors (\(K\)-NN)

Summary

  • \(K\)-NN is a nonparametric model with weights: \(\color{blue}{W_{n,i}(\text{x})=\frac{\mathbb{1}_{\{\text{x}_i\in{\cal N}_{\text{x}}\}}}{K}}\), where \({\cal N}_{\text{x}}=\{\text{x}_{(1)},\dots,\text{x}_{(K)}\}\) the set of the first \(K\)-nearest neighbors of \(\text{x}\).
  • Data preprocessing is essential: scaling, encoding, outliers…

  • The key parameter \(K\) can be tuned using cross-validation technique.

  • \(K\)-NN may not be suitable in high-dimensional cases due to Curse of dimensionality. However, we can try:

    • Feature selection
    • Dimensional reduction
    • Distance metric

\(K\)-Nearest Neighbors (\(K\)-NN)

\(K\)-NN on Mnist

  • We try some different approaches:
    • Normal \(3\)NN (various sources, for example, see here)
    • Dimensional reduction with PCA with cross-validation
    • Fourier transform method with cross-validation.
Code
from keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# Reshape data from 3D array to (n, 784)
X_train_reshaped = X_train.reshape(-1, 28*28)/255.0
X_test_reshaped = X_test.reshape(-1, 28*28)/255.0

# Model
knn = KNeighborsClassifier(n_neighbors=3)
knn = knn.fit(X_train_reshaped, y_train)
y_pred = knn.predict(X_test_reshaped)

test_mnist = pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred, average="macro"),
          'Recall': recall_score(y_test, y_pred, average="macro"),
          'F1-score': f1_score(y_test, y_pred, average="macro")},
    columns=["Accuracy", "Precision", "Recall", "F1-score"],
    index=[f"3NN"])

# PCA
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
pca = PCA(n_components=150)
X_proj_train = pca.fit_transform(X_train_reshaped)
X_proj_test = pca.transform(X_test_reshaped)

# Model
param_grid = {'n_neighbors': [2,4,6,9,11,13,15,17,20]}#list(range(2,31))}
knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring=scorer, return_train_score=True)
grid_search.fit(X_proj_train, y_train)

knn = KNeighborsClassifier(n_neighbors=grid_search.best_params_['n_neighbors'])
knn = knn.fit(X_proj_train, y_train)
y_pred = knn.predict(X_proj_test)

test_mnist = pd.concat([test_mnist, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred, average="macro"),
          'Recall': recall_score(y_test, y_pred, average="macro"),
          'F1-score': f1_score(y_test, y_pred, average="macro")},
    columns=["Accuracy", "Precision", "Recall", "F1-score"],
    index=[f"{grid_search.best_params_['n_neighbors']}NN_PCA"])], axis=0)

# Fourier transform
from scipy.fftpack import fft2, fftshift
def compute_fourier_transform(image): 
    f_transform = fft2(image) 
    f_transform_shifted = fftshift(f_transform) 
    magnitude_spectrum = np.abs(f_transform_shifted) 
    return magnitude_spectrum

X_train_fft = np.array([compute_fourier_transform(img) for img in X_train]).reshape(-1, 28*28)
X_test_fft = np.array([compute_fourier_transform(img) for img in X_test]).reshape(-1, 28*28)

# Model
knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring=scorer, return_train_score=True)
grid_search.fit(X_train_fft, y_train)

knn = KNeighborsClassifier(n_neighbors=grid_search.best_params_['n_neighbors'])
knn = knn.fit(X_train_fft, y_train)
y_pred = knn.predict(X_test_fft)

test_mnist = pd.concat([test_mnist, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred, average="macro"),
          'Recall': recall_score(y_test, y_pred, average="macro"),
          'F1-score': f1_score(y_test, y_pred, average="macro")},
    columns=["Accuracy", "Precision", "Recall", "F1-score"],
    index=[f"{grid_search.best_params_['n_neighbors']}NN_FT"])], axis=0)
test_mnist
Accuracy Precision Recall F1-score
3NN 0.9705 0.970912 0.970114 0.970375
2NN_PCA 0.9659 0.966592 0.965448 0.965672
2NN_FT 0.9187 0.921945 0.917746 0.917453

Decision Trees

CART: Classification And Regression Trees

  • In \(K\)-NN, “neighbors” are defined by the distance \(D\) (\(K\) points within the ball centered at the query point).
  • In CART, “neighbors” are defined by rectangular regions within inputs space.

Decision Trees

CART: Classification And Regression Trees

  • Building a CART \(\Leftrightarrow\) recursively split input space into smaller regions.
  • Rectangular regions \(\Rightarrow\) defining neighbors \(\Rightarrow\) prediction.
  • Start at root node where all points are in the same region (without split).
  • At each step, each region is split on feature \(X_j\) at threshold \(a\in\mathbb{R}\) such that the two subregions \(R_1=\{x\in\mathbb{R}^d: X_j\leq a\}\) and \(R_2=\{x\in\mathbb{R}^d: X_j> a\}\) are as pure as possible.
  • Impurity is defined by impurity measures:
    • Regression: \(\sum_{y\in R_1}(y-\overline{y}_1)^2+\sum_{y\in R_2}(y-\overline{y}_2)^2.\)
    • Classification (\(M\) classes):
      • Missclassification error: \(1-\hat{p}_{k^*}\) where \(k^*\) is the majority class.
      • Gini impurity: \(\sum_{k=1}^M\hat{p}_{k}(1-\hat{p}_{k})\).
      • Entropy: \(-\sum_{k}\hat{p}_{k}\log(\hat{p}_{k})\) where \(\hat{p}_{k}=\frac{\sum_{i\in R}\mathbb{1}_{\{y_i=k\}}}{|R|}\) on region any \(R\).

Decision Trees

CART: Classification And Regression Trees

  • Building a CART \(\Leftrightarrow\) recursively split input space into smaller regions.
  • Rectangular regions \(\Rightarrow\) defining neighbors \(\Rightarrow\) prediction.
  • Start at root node where all points are in the same region (without split).
  • At each step, each region is split on feature \(X_j\) at threshold \(a\in\mathbb{R}\) such that the two subregions \(R_1=\{x\in\mathbb{R}^d: X_j\leq a\}\) and \(R_2=\{x\in\mathbb{R}^d: X_j> a\}\) are as pure as possible.
  • Impurity is defined by impurity measures:
    • Regression: \(\sum_{y\in R_1}(y-\overline{y}_1)^2+\sum_{y\in R_2}(y-\overline{y}_2)^2.\)
    • Classification (\(M\) classes):

Decision Trees

CART: Classification And Regression Trees

  • First split:
    • \(\text{En}(R_1)=-1\log(1)=0\)
    • \(\text{En}(R_2)=-\color{blue}{16/19\log(16/19)}-\color{red}{3/19\log(3/19)} =0.436\).
    • \(\text{En}_1=(0)11/30+(0.436)19/30=0.276.\)
    • Info gain: \(\text{En}_0-\text{En}_1.\)

  • Prediction:
    • Regression: \(\hat{y}=\frac{1}{|R_{\text{x}}|}\sum_{y_i\in R_{\text{x}}}y_i\).
    • Classification: \(\hat{y}=\arg\max_{1\leq m\leq M}\frac{1}{R_{\text{x}}|}\sum_{y_i\in R_{\text{x}}}\mathbb{1}_{\{y_i=m\}}\).

Decision Trees

Pruning tree

  • Hyperparameter of a tree: depth, minimum leave size, impurity measures, maximum features considered at each split…
  • Deep trees \(\Leftrightarrow\) high variance \(\Rightarrow\) Overfitting.
  • Pruning a tree = removing branches with small improvement.

  • Pruning criterion to be minimized: \(C_{\alpha}(T)=\sum_{j=1}^{|T|}N_j\text{Imp}_j(T)+\alpha|T|\) where
    • \(|T|\): num. of leaves of the tree \(T\) and
    • \(\text{Imp}_j(T)\): the impurity at leave \(j\)-th.

Decision Trees

In action: Heart Disease Dataset

  • We use GridsearchCV to search over the hyperparameters:
    • Impurity (criterion)
    • Depth (max_depth)
    • Mininum size of leave nodes (min_samples_leaf)
    • Maximum features (max_features).
Code
data = pd.read_csv(path + "/heart.csv")
quan_vars = ['age','trestbps','chol','thalach','oldpeak']
qual_vars = ['sex','cp','fbs','restecg','exang','slope','ca','thal','target']

# Convert to correct types
for i in quan_vars:
  data[i] = data[i].astype('float')
for i in qual_vars:
  data[i] = data[i].astype('category')

data = data.drop_duplicates()
# Train test split
from sklearn.model_selection import train_test_split
X, y = data.iloc[:,:-1], data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

from sklearn.model_selection import GridSearchCV 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

clf = DecisionTreeClassifier()
param_grid = {'criterion': ['gini', 'entropy'], 
              'max_depth': [10, 20, 30, 40, 50], 
              'min_samples_split': [2, 5, 10], 
              'min_samples_leaf': [1, 2, 4], 
              'max_features': ['auto', 'sqrt', 'log2'] }
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1) 
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_ 
y_pred = best_model.predict(X_test)

test_tr = pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred),
          'AUC': roc_auc_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["Tree"])
test_tr
Accuracy Precision Recall F1-score AUC
Tree 0.721311 0.766667 0.69697 0.730159 0.723485

Decision Trees

Summary

  • CART is a nonparametric model with weights: \(\color{blue}{W_{n,i}(\text{x})=\frac{\mathbb{1}_{\{\text{x}_i\in R_{\text{x}}\}}}{|R_{\text{x}}|}}\), where \(R_{\text{x}}\) the is the leave node containing \(\text{x}\).
  • They are not sensitive to scaling.

  • The key parameters includes depth, minimum leave size, impurity measures, number of splits, maximum features considered at each split… which can be tuned using cross-validation technique.

  • It is efficient in high-dimensional case as it handle features individually.

  • It can handle categorical data as well.

  • Just ike \(K\)-NN with small \(K\), deep trees are high-variance methods.

  • They should be pruned.

Kernel Smoothing Methods

Smoother weights

  • This is inspired by density estimation.
  • It aims at directly estimating \(\eta(x)=\mathbb{E}(Y|X=x)\) for regression or \(\mathbb{P}(Y=k|X=x)\) in classifications.
  • Regression: \[\hat{y}=\sum_{i=1}^ny_i\color{blue}{W_{n,i}(\text{x})},\] where \(\color{blue}{W_{n,i}(\text{x})}=\frac{K_h(\|\text{x}_i-\text{x}\|)}{\sum_{j=1}^nK_h(\|\text{x}_j-\text{x}\|)}\) with \(K_h(t)=K(t/h)\) for some moothing parameter \(h>0\).
  • Classification: \[\hat{y}=\arg\max_{1\leq k\leq M}\sum_{i=1}^n\mathbb{1}_{\{y_i=k\}}\color{blue}{W_{n,i}(\text{x})}.\]
  • A common ex. Gaussian Kernel \(K(t)=\exp(-t^2/2\sigma^2)\) for some \(\sigma>0\).

Kernel Smoothing Methods

Example: Gaussian Kernel

  • Building a kernel smoother \(\Leftrightarrow\) tuning \(h>0\) (cross-validation).

Kernel Smoothing Methods

Example: Auto-MPG Dataset

  mpg cylinders displacement horsepower weight acceleration model year origin
mpg 1.000000 -0.777618 -0.805127 -0.778427 -0.832244 0.423329 0.580541 0.565209
cylinders -0.777618 1.000000 0.950823 0.842983 0.897527 -0.504683 -0.345647 -0.568932
displacement -0.805127 0.950823 1.000000 0.897257 0.932994 -0.543800 -0.369855 -0.614535
horsepower -0.778427 0.842983 0.897257 1.000000 0.864538 -0.689196 -0.416361 -0.455171
weight -0.832244 0.897527 0.932994 0.864538 1.000000 -0.416839 -0.309120 -0.585005
acceleration 0.423329 -0.504683 -0.543800 -0.689196 -0.416839 1.000000 0.290316 0.212746
model year 0.580541 -0.345647 -0.369855 -0.416361 -0.309120 0.290316 1.000000 0.181528
origin 0.565209 -0.568932 -0.614535 -0.455171 -0.585005 0.212746 0.181528 1.000000

Kernel Smoothing Methods

Example: Auto-MPG Dataset

Code
X = X.drop_duplicates()
X_train, X_test, y_train, y_test = train_test_split(X.iloc[:,1:], X.iloc[:,0], test_size=0.2, random_state=42)
from sklearn.neighbors import KNeighborsRegressor 
from sklearn.tree import DecisionTreeRegressor 
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

# Define the KNN and Decision Tree Regressor 
knn = KNeighborsRegressor()
param_grid = {'n_neighbors': [2, 3, 5, 7, 9, 11, 13, 15, 17, 20]}
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1) 
# Fit the model
grid_search.fit(X_train_scaled, y_train) 

# Extract the best model and parameters 
knn = KNeighborsRegressor(n_neighbors = grid_search.best_params_['n_neighbors'])
knn = knn.fit(X_train_scaled, y_train)
y_pred = knn.predict(X_test_scaled)
df_mse = pd.DataFrame(
    {'RMSE': np.round(np.sqrt(mean_squared_error(y_test, y_pred)), 3),
    'MAE': np.round(mean_absolute_error(y_test, y_pred), 3)},
    index = [f"{grid_search.best_params_['n_neighbors']}NN"]
)

tree = DecisionTreeRegressor() 
param_grid = {'criterion': ['squared_error', 'absolute_error'], 
              'min_samples_leaf': [2, 3, 5, 7, 10],
              'max_depth' : [5, 10, 15, 20, 30]}
grid_search = GridSearchCV(tree, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1) 
# Fit the model
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
tree = DecisionTreeRegressor(**best_params)
tree = tree.fit(X_train, y_train)
y_pred = tree.predict(X_test)

df_mse = pd.concat([df_mse, pd.DataFrame(
    {'RMSE': np.round(np.sqrt(mean_squared_error(y_test, y_pred)), 3),
    'MAE': np.round(mean_absolute_error(y_test, y_pred), 3)},
    index = [f"Tree"])])

# Kernel smoother
from gradientcobra.gradientcobra import KernelSmoother
ks = KernelSmoother(opt_method='grid',
                    bandwidth_list = np.linspace(0.00001, 50, 300),
                    show_progress = False)
ks = ks.fit(X_train_scaled, y_train)
y_pred = ks.predict(X_test_scaled)

df_mse = pd.concat([df_mse, pd.DataFrame(
    {'RMSE': np.round(np.sqrt(mean_squared_error(y_test, y_pred)), 3),
    'MAE': np.round(mean_absolute_error(y_test, y_pred), 3)},
    index = [f"Kernel Smoother"])])
df_mse
RMSE MAE
3NN 2.624 1.822
Tree 2.741 2.087
Kernel Smoother 2.460 1.783

Kernel Smoothing Methods

Summary

  • Kernel smoother is a nonparametric model that assigns weight to each points using some kernel function \(K\), i.e., \[W_{n,i}(\text{x})=\frac{K_h(\|\text{x}_i-\text{x}\|)}{\sum_{j=1}^nK_h(\|\text{x}_j-\text{x}\|)}.\]
  • Training a Kernel Smoother is equivalent to tuning the smoothing parameter \(h>0\) using, for example, cross validation error.
  • Preprocessing is essential: scaling, encoding, transformation,…
  • Just like in \(K\)-NN, it might not work well in high-dimensional case but can be improved using the same methods.

🥳 It’s party time 🥂









đź“‹ View party menu here: Party 4 Menu.

đź«  Download party invitation here: Party 4 Invitation Letter.