Advanced Classification: SVM & Ensemble Methods

CSCI-866-001: Data Mining & Knowledge Discovery

Lecturer: Dr. Sothea HAS

🗺️ Content

Support Vector Machine
- Model Setting: Hard Margin
- Soft Margin
- Kernel Tricks
Ensemble Learning
- Bootstrap Aggregating Trees (Tree Bagging)
- Random Forest
- Extremely Randomized Trees (Extra-trees)
- Gradient Boosting & XGBoost
- Feature Importance

1. Support Vector Machine (SVM)

Model Setting

Model Setting: Hard Margin

Linear SVM vs Logistic Regression

We are still working in binary classification:
- Input: \(\text{x}\in\mathbb{R}^d\)
- Target: \(y\in\{-1,1\}\) (for the sake of math).
Methods: NBC, Logistic Regression, KNN, Trees.

Code

import numpy as np
import pandas as pd
data = pd.read_csv(path + "/heart.csv")
data[['age', 'chol','target']].sample(3, random_state=42)

	age	chol	target
527	62	209	1
359	53	216	1
447	55	289	0

Logistic Regression

Model Assumption: \[\mathbb{P}(Y=1|X=\text{x})=\frac{1}{1+e^{-\beta_0-\beta^T\text{x}}}.\] (Boundary decision is linear.)

Linear SVM

Assumption 1:
- Boundary decision is linear &
- It has Maximum Margin
Boundary form: \(f(\text{x})=\color{blue}{\vec{w}^T}\text{x}+\color{blue}{b}.\)
Objective: Find the best \(\color{blue}{\vec{w}}\).

Model Setting

Hard Margin Linear SVM

Assumption 2: The classes are linearly separable.
Objective: Find the best \(\color{blue}{\vec{w}}\).

Model Setting

Hard Margin Linear SVM

Assumption 2: The classes are linearly separable (LS).
Objective: Find the best \(\color{blue}{\vec{w}}\).

Simple maths implies that \(\text{Margin}\propto 1/\|\color{blue}{\vec{w}}\|,\) with
- \(\|\color{blue}{\vec{w}}\|:\) norm of \(\color{blue}{\vec{w}}\) (size/length of \(\color{blue}{\vec{w}}\)).
Goal: Now, reduced to finding \(\color{blue}{\vec{w}}\) minimizing \(\|\color{blue}{\vec{w}}\|\)
and separates classes well!
Final problem: \[\begin{align*}&\min_{\color{blue}{\vec{w}}\in\mathbb{R}^d, \color{blue}{b}\in\mathbb{R}}\frac{1}{2}\|\color{blue}{\vec{w}}\|^2\qquad (\text{maximizing the margin})\\ &\text{subject to: }y_i(\color{blue}{\vec{w}^T}\text{x}_i+\color{blue}{b})\geq 1,\text{ for all }i=1,2,... (\text{separate class well!})\end{align*}\]

Model Setting

Solution of Hard Margin Linear SVM

Final problem: \[\begin{align*}&\min_{\color{blue}{\vec{w}}\in\mathbb{R}^d, \color{blue}{b}\in\mathbb{R}}\frac{1}{2}\|\color{blue}{\vec{w}}\|^2\qquad (\text{maximizing the margin})\\ &\text{subject to: }y_i(\color{blue}{\vec{w}^T}\text{x}_i+\color{blue}{b})\geq 1,\text{ for all }i=1,2,... (\text{separate class well!})\end{align*}\]

Solution: \[\color{blue}{\vec{w}^*}=\color{blue}{\sum_{i=1}^n\alpha_iy_i\text{x}_i}\] such that
- \(\alpha_i=0\) for \(\text{x}_i\) NOT ON the margin.
- \(\color{green}{\alpha_{i_0}}\neq 0\) for \(\color{green}{\text{x}_{i_0}}\) ON the margin and therefore, \[\color{blue}{b^*}=\color{green}{y_{i_0}}-\sum_{i=1}^n\alpha_iy_i\color{green}{\text{x}_{i_0}^T}[\text{x}_i].\]

Model Setting

Prediction by Hard Margin Linear SVM

Solution: \[\color{blue}{\vec{w}^*}=\color{blue}{\sum_{i=1}^n\alpha_iy_i\text{x}_i}\] such that
- \(\alpha_i=0\) for \(\text{x}_i\) NOT ON the margin.
- \(\color{green}{\alpha_{i_0}}\neq 0\) for \(\color{green}{\text{x}_{i_0}}\) ON the margin and therefore, \[\color{blue}{b^*}=\color{green}{y_{i_0}}-\sum_{i=1}^n\alpha_iy_i\color{green}{\text{x}_{i_0}^T}[\text{x}_i].\]

The decision boundary: \(f(\text{x})=\color{blue}{(\vec{w}^*)^T}\text{x}+\color{blue}{b}=\color{blue}{\sum_{i=1}\alpha_iy_i}[\color{blue}{\text{x}_i^T}\text{x}]+\color{blue}{b^*}.\)
Prediction rule: Given a new input \(\color{red}{\text{x}_0}\in\mathbb{R}^d\) we have \[\color{red}{y_0}=\text{sign}(f(\text{x}_0))=\text{sign}\Big(\color{blue}{\sum_{i=1}\alpha_iy_i}[\color{blue}{\text{x}_i^T}\color{red}{\text{x}_0}]+\color{blue}{b^*}\Big).\]

Soft Margin

Model Setting

Soft Margin Linear SVM (slack variables)

The result from Linear SVM is beautiful! BUT…
In practice, data are rarely Linearly Separable!
Soft Margin: Introducing slack variables \(\color{blue}{s_i},i=1,2,...,n\) and \[\begin{align*}&\min_{\color{blue}{\vec{w}}\in\mathbb{R}^d, \color{blue}{b,s_i}\in\mathbb{R}}\frac{1}{2}\|\color{blue}{\vec{w}}\|^2+\color{red}{C}\sum_{i=1}^n\color{blue}{s_i}\\ &\text{subject to: }y_i(\color{blue}{\vec{w}^T}\text{x}_i+\color{blue}{b})\geq 1-\color{blue}{s_i}\quad\text{and }s_i\geq 0,\text{ for all }i=1,2,...\end{align*}\]
Interpretation: Slack variables \(\color{blue}{s_i}\) relax the linearly separable condition by allowing some misclassified points.
The constant \(\color{red}{C}>0\) controls the trade-off between maximizing the margin and minimizing the classification error.

Model Setting

Soft Margin Linear SVM (slack variables)

Soft Margin: Introducing slack variables \(\color{blue}{s_i},i=1,2,...,n\) and \[\begin{align*}&\min_{\color{blue}{\vec{w}}\in\mathbb{R}^d, \color{blue}{b,s_i}\in\mathbb{R}}\frac{1}{2}\|\color{blue}{\vec{w}}\|^2+\color{red}{C}\sum_{i=1}^n\color{blue}{s_i}\\ &\text{subject to: }y_i(\color{blue}{\vec{w}^T}\text{x}_i+\color{blue}{b})\geq 1-\color{blue}{s_i}\quad\text{and }s_i\geq 0,\text{ for all }i=1,2,...\end{align*}\]

Interpretation: Slack variables \(\color{blue}{s_i}\) relax the linearly separable condition by allowing some misclassified points.
The constant \(\color{red}{C}>0\) controls the trade-off between maximizing the margin and minimizing the classification error.
Lower \(\color{red}{C}\Leftrightarrow\) more relaxing, more many points can fall within the margin (large margin).

Model Setting

Soft Margin Linear SVM (slack variables)

Soft Margin: Introducing slack variables \(\color{blue}{s_i},i=1,2,...,n\) and \[\begin{align*}&\min_{\color{blue}{\vec{w}}\in\mathbb{R}^d, \color{blue}{b,s_i}\in\mathbb{R}}\frac{1}{2}\|\color{blue}{\vec{w}}\|^2+\color{red}{C}\sum_{i=1}^n\color{blue}{s_i}\\ &\text{subject to: }y_i(\color{blue}{\vec{w}^T}\text{x}_i+\color{blue}{b})\geq 1-\color{blue}{s_i}\quad\text{and }s_i\geq 0,\text{ for all }i=1,2,...\end{align*}\]

Interpretation: Slack variables \(\color{blue}{s_i}\) relax the linearly separable condition by allowing some misclassified points.
The constant \(\color{red}{C}>0\) controls the trade-off between maximizing the margin and minimizing the classification error.
Lower \(\color{red}{C}\Leftrightarrow\) more relaxing, more many points can fall within the margin (large margin).
Large \(\color{red}{C}\Leftrightarrow\) more strict, less points can fall within the margin (small margin).

Model Setting

Prediction by Soft Margin Linear SVM

Soft Margin Solution: \(\color{blue}{\vec{w}^*}=\color{blue}{\sum_{i=1}^n\alpha_iy_i\text{x}_i}\) such that
- If \(\alpha_i=0\) for \(\text{x}_i\) NOT ON the margin.
- If \(0<\color{green}{\alpha_{i_0}}\leq \color{red}{C}\) then for \(\color{green}{\text{x}_{i_0}}\):
  - If \(\color{blue}{s_{i_0}}\in\{0,\color{red}{C}\}\Rightarrow\) ARE EXACTLY ON the margin: \[\color{blue}{b^*}=\color{green}{y_{i_0}}-\sum_{i=1}^n\alpha_iy_i\color{green}{\text{x}_{i_0}^T}[\text{x}_i].\]
  - Else \(\color{green}{\text{x}_{i_0}}\) FALL WITHIN THE MARGIN.

The decision boundary: \(f(\text{x})=\color{blue}{(\vec{w}^*)^T}\text{x}+\color{blue}{b}=\color{blue}{\sum_{i=1}\alpha_iy_i}[\color{blue}{\text{x}_i^T}\text{x}]+\color{blue}{b^*}.\)
Prediction rule: Given a new input \(\color{red}{\text{x}_0}\in\mathbb{R}^d\) we have \[\color{red}{y_0}=\text{sign}(f(\text{x}_0))=\text{sign}\Big(\color{blue}{\sum_{i=1}\alpha_iy_i}[\color{blue}{\text{x}_i^T}\color{red}{\text{x}_0}]+\color{blue}{b^*}\Big).\]

Hard & Soft Margin Linear SVM

Application

Let’s consider Heart Disease Dataset (shpe: \(1025\times 14\)).
The key parameter \(C\) is tuned using \(K\)-Fold CV.
Categorical inputs are encoded using OneHotEncoding.
Standardization is also applied.

Code

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder

data = pd.read_csv(path + "/heart.csv")
quan_vars = ['age','trestbps','chol','thalach','oldpeak']
qual_vars = ['sex','cp','fbs','restecg','exang','slope','ca','thal','target']

# Convert to correct types
for i in quan_vars:
  data[i] = data[i].astype('float')
for i in qual_vars:
  data[i] = data[i].astype('category')


data.drop_duplicates(inplace=True)

# Train test split
from sklearn.model_selection import train_test_split
X, y = data.iloc[:,:-1], data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
encoder = OneHotEncoder(drop='first', sparse_output=False)

# OnehotEncoding and Scaling
X_train_cat = encoder.fit_transform(X_train.select_dtypes(include="category"))
X_train_encoded = scaler.fit_transform(np.column_stack([X_train.select_dtypes(include="number").to_numpy(), X_train_cat]))
X_test_cat = encoder.transform(X_test.select_dtypes(include="category"))
X_test_encoded = scaler.transform(np.column_stack([X_test.select_dtypes(include="number").to_numpy(), X_test_cat]))

# KNN
from sklearn.svm import SVC

svm_model = SVC(kernel='linear', random_state=42)

# Define the hyperparameter grid
param_grid = {'C': [0.1, 0.2, 0.25, 0.3, 0.4, 0.5, 0.75, 1]}

# Perform GridSearch for optimal C
grid_search = GridSearchCV(svm_model, param_grid, cv=20, scoring='accuracy')
grid_search.fit(X_train_encoded, y_train)

# Get the best parameter
best_C = grid_search.best_params_['C']

# Train final model using best C
final_model = SVC(kernel='rbf', C=best_C, random_state=42)
final_model.fit(X_train_encoded, y_train)

# Predict on test set
y_pred = final_model.predict(X_test_encoded)

from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

test_perf = pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred),
          'AUC': roc_auc_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["LSVM"])
test_perf

	Accuracy	Precision	Recall	F1-score	AUC
LSVM	0.852459	0.852941	0.878788	0.865672	0.850108

Best \(C=\) 0.1.

Model Setting

Summary: Hard & Soft Margin Linear SVM

Hard Margin Linear SVM: assumes that the classes are linear separable.
Soft Margin Linear SVM: does not assume linearly separability (used slack variables).
The decision boundary: \(f(\text{x})=\color{blue}{(\vec{w}^*)^T}\text{x}+\color{blue}{b}=\color{blue}{\sum_{i=1}\alpha_iy_i}[\color{blue}{\text{x}_i^T}\text{x}]+\color{blue}{b^*}.\)
Prediction rule: Given a new input \(\color{red}{\text{x}_0}\in\mathbb{R}^d\) we have \[\color{red}{y_0}=\text{sign}(f(\text{x}_0))=\text{sign}\Big(\color{blue}{\sum_{i=1}\alpha_iy_i}[\color{blue}{\text{x}_i^T}\color{red}{\text{x}_0}]+\color{blue}{b^*}\Big).\]
A VERY IMPORTANT REMARK: It does not depend on actual training inputs, just the scalar products: \(\color{blue}{\text{x}_i^T}\color{red}{\text{x}_0}\).
From this we can overcome Linear Boundary by using Kernel Trick.

Nonlinear SVM: Kernel Trick

Nonlinear SVM

Kernel Trick

Prediction rule: Given a new input \(\color{red}{\text{x}_0}\in\mathbb{R}^d\) we have \[\color{red}{y_0}=\text{sign}(f(\text{x}_0))=\text{sign}\Big(\color{blue}{\sum_{i=1}\alpha_iy_i}[\color{blue}{\text{x}_i^T}\color{red}{\text{x}_0}]+\color{blue}{b^*}\Big).\]
If \(K\) is a kernel function, for example,
- Gaussian radial basis function: \(K(\text{x}_i, \text{x}_j)=e^{-\gamma\|\text{x}_i, \text{x}_j\|^2},\gamma>0\).
- Polynomial kernel: \(K(\text{x}_i, \text{x}_j)=(r+\text{x}_i^T\text{x}_j)^d\).
Kernel function introduce feature/transformation of original inputs.
Instead of input scalar \(\color{blue}{\text{x}_i^T}\color{red}{\text{x}_j}\), we can use kernel feature \(K(\color{blue}{\text{x}_i},\color{red}{\text{x}_j})\).
It works well in problems with nonlinear boundary decision.

Nonlinear SVM

Kernel Trick

Instead of input scalar \(\color{blue}{\text{x}_i^T}\color{red}{\text{x}_j}\), we can use kernel feature \(K(\color{blue}{\text{x}_i},\color{red}{\text{x}_j})\).
In this case, prediction rule becomes: \(\color{red}{\text{x}_0}\in\mathbb{R}^d\), \[\color{red}{y_0}=\text{sign}(f(\text{x}_0))=\text{sign}\Big(\color{blue}{\sum_{i=1}\alpha_iy_i}K(\color{blue}{\text{x}_i}\color{red}{\text{x}_0})+\color{blue}{b^*}\Big).\]
It works well in problems with nonlinear boundary decision.

Nonlinear SVM

Application

We work with Heart Disease Dataset.
The key parameter \(C\) is tuned using \(K\)-Fold CV.
We try 3rd degree Polynomial and RBF Kernels.

Code

svm_model = SVC(kernel='rbf', random_state=42)

# Define the hyperparameter grid
param_grid = {'C': [5, 6, 7, 8, 10, 12.5, 15, 17, 20, 30, 45, 50]}
# Perform GridSearch for optimal C
grid_search = GridSearchCV(svm_model, param_grid, cv=20, scoring='accuracy')
grid_search.fit(X_train_encoded, y_train)

# Get the best parameter
best_C_rbf = grid_search.best_params_['C']

# Train final model using best C
final_model = SVC(kernel='rbf', C=best_C_rbf, random_state=42)
final_model.fit(X_train_encoded, y_train)

# Predict on test set
y_pred = final_model.predict(X_test_encoded)

from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred),
          'AUC': roc_auc_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["RBF-SVM"])])

# Polynomial 3 kernel
svm_model = SVC(kernel='poly', degree=3, random_state=42)

# Perform GridSearch for optimal C
grid_search = GridSearchCV(svm_model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_encoded, y_train)

# Get the best parameter
best_C_poly = grid_search.best_params_['C']

# Train final model using best C
final_model = SVC(kernel='poly', degree=3, C=best_C_poly, random_state=42)
final_model.fit(X_train_encoded, y_train)

# Predict on test set
y_pred = final_model.predict(X_test_encoded)

from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred),
          'AUC': roc_auc_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["Poly3-SVM"])])
test_perf

	Accuracy	Precision	Recall	F1-score	AUC
LSVM	0.852459	0.852941	0.878788	0.865672	0.850108
RBF-SVM	0.819672	0.866667	0.787879	0.825397	0.822511
Poly3-SVM	0.852459	0.875000	0.848485	0.861538	0.852814

RBF: \(C=\) 5 and Degree-3 Polynomial: \(C=\) 5.

Soft vs Hard Margin SVM

Summary

Hard Linear SVM = No slack variable + Original input SVM.
Soft Linear SVM = With slack variable + Original input SVM.
Soft Nonlinear SVM = With slack variable + Kernel Feature SVM.
Key Parameters to be tuned using Cross-Validation:
- \(C\): trade-off between maximizing margin & classification power.
- Kernel: ‘rbf’, ‘poly’…
Features should be
- Encoding: OneHotEncoder
- Scaling: StandardScaler, MinMaxScaler…
- Make sure there is no duplication.

2. Ensemble Methods

Bootstrap Aggregating Trees (Tree Bagging)

Tree Bagging

Bagging Algorithm

Decision trees are easy to be constructed but sometimes prone to overfitting (deep trees).
Tree Bagging leverages Bootstrap to stabilize predictions (reduce variance).
Algorithm:
- for t = 1,2,..,T:
  - Bootstrap sample \(B_t\)
  - Build a tree \(f_t\) on \(B_t\)
- Prediction:
  - Classification: Vote.
  - Regression: Average.

Regression Case

Bootstrap sample = Sample with replacement.
Variance reduction: averaging smooths out prediction.

Tree Bagging

Application

Dataset: Heart Disease Dataset.
Key hyperparameters to be fine-tuned using CV method:
- Number of trees: \(T\)
- Maximum samples of \(B_t\)
- Tree hyperparameters: max_depth, min_sample_split, min_samples_leaf and citerion.

Code

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Define base Decision Tree (with tunable parameters)
base_tree = DecisionTreeClassifier()

# Set up Bagging Classifier
bagging_model = BaggingClassifier(estimator=base_tree)

# Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200, 300, 500],  # Number of trees in Bagging
    'max_samples': [0.1, 0.25, 0.4, 0.5, 0.6],  # Proportion of samples per tree
    'estimator__min_samples_leaf': [3, 5, 7, 10],  # Minimum samples per leaf
    'estimator__criterion': ["gini", "entropy"],  # Splitting criterion
}

grid_search = GridSearchCV(bagging_model, param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train_encoded, y_train)

# Best parameters and corresponding model
best_model = grid_search.best_estimator_
print(f"Best parameters: {grid_search.best_params_}")

# Make predictions with the best model
y_pred = best_model.predict(X_test_encoded)
test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred),
          'AUC': roc_auc_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["Bagging"])])
test_perf

Best parameters: {'estimator__criterion': 'gini', 'estimator__min_samples_leaf': 3, 'max_samples': 0.6, 'n_estimators': 500}

	Accuracy	Precision	Recall	F1-score	AUC
LSVM	0.852459	0.852941	0.878788	0.865672	0.850108
RBF-SVM	0.819672	0.866667	0.787879	0.825397	0.822511
Poly3-SVM	0.852459	0.875000	0.848485	0.861538	0.852814
Bagging	0.803279	0.862069	0.757576	0.806452	0.807359

Random Forest

Decorrelating trees

Tree Bagging: combines trees built on different bootstrap samples.
However, bootstrap samples are not truely independent, so are the trees.
Random forest insert ONE key step to introduce more randomness when building trees.

Algorithm:
- for t = 1,2,..,T:
  - Bootstrap sample \(B_t\)
  - Sample subset of features \(S_t\)
  - Build a tree \(f_t\) on \(B_t\) using only \(S_t\)
- Prediction:
  - Classification: Vote.
  - Regression: Average.

Random Forest

Application

Dataset: Heart Disease Dataset.
Key hyperparameters to be fine-tuned using CV method:
- Number of trees: \(T\)
- Maximum samples of \(B_t\)
- Maximum features \(S_t\): max_features
- Tree hyperparameters: max_depth, min_sample_split, min_samples_leaf and citerion.

Code

from sklearn.ensemble import RandomForestClassifier

# Set up Bagging Classifier
rf = RandomForestClassifier()

# Define hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 300, 500],  # Number of trees in Bagging
    'max_samples': [0.25, 0.4, 0.5, 0.6, 0.75, 0.9],  # Proportion of samples per tree
    'max_depth' : [5, 10, 20, 25, 30, 35, 40],
    'criterion': ["gini", "entropy"],  # Splitting criterion
}

grid_search = GridSearchCV(rf, param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train_encoded, y_train)

# Best parameters and corresponding model
best_model = grid_search.best_estimator_
print(f"Best parameters: {grid_search.best_params_}")

# Make predictions with the best model
y_pred = best_model.predict(X_test_encoded)
test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred),
          'AUC': roc_auc_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["RF"])])
test_perf

Best parameters: {'criterion': 'gini', 'max_depth': 5, 'max_samples': 0.6, 'n_estimators': 100}

	Accuracy	Precision	Recall	F1-score	AUC
LSVM	0.852459	0.852941	0.878788	0.865672	0.850108
RBF-SVM	0.819672	0.866667	0.787879	0.825397	0.822511
Poly3-SVM	0.852459	0.875000	0.848485	0.861538	0.852814
Bagging	0.803279	0.862069	0.757576	0.806452	0.807359
RF	0.868852	0.903226	0.848485	0.875000	0.870671

Extra-trees

Extremely Randomized Trees

Split at random values

Random forest has been sucessful in various tasks.
Extra-trees is another variation of RF aiming at introduce even more randomness when building the individual trees.

Algorithm:
- for t = 1,2,..,T:
  - Sample subset of features \(S_t\)
  - Build a tree \(f_t\) on \(B_t\) using only \(S_t\) where the split are performed at random values.

Prediction:
- Classification: Vote.
- Regression: Average.

Random Forest

Application

Dataset: Heart Disease Dataset.
Key hyperparameters to be fine-tuned using CV method:
- Number of trees: \(T\)
- Maximum samples of \(B_t\)
- Tree hyperparameters: max_depth, min_sample_split, min_samples_leaf and citerion.

Code

from sklearn.ensemble import ExtraTreesClassifier

# Set up Bagging Classifier
extr = ExtraTreesClassifier()

# Define hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 350, 500],  # Number of trees in Bagging
    'max_depth' : [5, 7, 10, 20, 30, 40],
    'max_features' : [2, 3,5,7,9],
    'criterion': ["gini", "entropy"],  # Splitting criterion
}

grid_search = GridSearchCV(extr, param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train_encoded, y_train)

# Best parameters and corresponding model
best_model = grid_search.best_estimator_
print(f"Best parameters: {grid_search.best_params_}")

# Make predictions with the best model
y_pred = best_model.predict(X_test_encoded)
test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred),
          'AUC': roc_auc_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["Extra-Trees"])])
test_perf

Best parameters: {'criterion': 'entropy', 'max_depth': 10, 'max_features': 3, 'n_estimators': 100}

	Accuracy	Precision	Recall	F1-score	AUC
LSVM	0.852459	0.852941	0.878788	0.865672	0.850108
RBF-SVM	0.819672	0.866667	0.787879	0.825397	0.822511
Poly3-SVM	0.852459	0.875000	0.848485	0.861538	0.852814
Bagging	0.803279	0.862069	0.757576	0.806452	0.807359
RF	0.868852	0.903226	0.848485	0.875000	0.870671
Extra-Trees	0.819672	0.823529	0.848485	0.835821	0.817100

Summary: Bagging, Random Forest, Extra-Trees

Each method relies on two main ideas:
- Build independent tree base predictors
- Combine them by averaging or voting over the predictions.
The individual trees within each method may be too flexible (high variance), however averaging/voting results in more stable predictions.
There are many hyperparameters to be optimized (CV):
- Number of trees
- Number of features used to build each tree
- Individual tree structure
  - Depth of each tree
  - Mininum samples at the leaf node
  - Impurity criterion…

Boosting & XGBoost

Boosting

Algorithm

T: the number of trees
for t = 1, 2,..., T:
- Build weak learner \(f_t\) by minimizing error \(\varepsilon_t\).
- Learn and ajust weights:
  - Compute learner weight \(w_t>0\) computed based on weighted data with weight \(\{D_t(i):i=1,\dots,n\}\).
  - Compute sample weight \(D_t(i)\to D_{t+1}(i)\) for obs \(i\).
Combined model: \(\color{blue}{\hat{f}(x)=\sum_{t=1}^Tw_tf_t(x)}\).

Boosting

Algorithm

Combined model: \(\color{blue}{\hat{f}(x)=\sum_{t=1}^Tw_tf_t(x)}\).

Boosting

Adaboost: Adaptive Boosting

It’s for binary classification with \((\text{x}_i,y_i)\in\mathbb{R}^d\times\{-1,1\},i=1,\dots,n\).
Initialize the weight \(D_1(i)=1/n\) for all sample \(\text{x}_i\).
for t = 1, . . . , T:
- Train the base classifier \(f_t\) by minimizing \[\varepsilon_t=\mathbb{P}(f_t(X)\neq Y)=\sum_{i=1}^nD_t(i)\mathbb{1}_{\{f_t(\text{x}_i\neq y_i)\}}.\]
- Calculate the learner weight \(w_t=\frac{1}{2}\ln\left(\frac{1-\varepsilon_t}{\varepsilon_t}\right).\)
- Update the sample weights using \(D_{t+1}=\frac{D_t(i)}{Z_t}e^{-w_ty_if_t(\text{x}_i)}\) where \(Z_t\) is the normalized constant.
The final model: \(\color{blue}{\hat{f}(\text{x})=\text{sign}(\sum_{t=1}^Tw_tf_t(\text{x}))}\). (Read: Freund & Schapire (1999))

Boosting

XGBoost: EXtreme Gradient Boosting

It’s a special case of Gradient Boosting: \(F_M(\text{x})=\sum_{t=1}^Mw_tf_t(\text{x})\).
Let \(L\) be the loss function of the problem:
- Initial model: \(F_0=\arg\min_{c}\sum_{i=1}^nL(y_i,c).\)
- for t = 1, 2, ..., M:
  - Compute false residuals: \(r_{t,i}=\frac{\partial L(y_i,f(\text{x}))}{\partial f(\text{x})}\Big|_{f(\text{x})=f_{t-1}(\text{x}_i)}\).
  - Train a base learner on the new data: \(\{(\text{x}_i,y_i),r_{t,i}\}.\)
  - Solve for \(\color{blue}{f_t}\) and \(\color{blue}{w_t}\) from: \[(\color{blue}{w_t}, \color{blue}{f_t(\text{x})})=\arg\min_{\color{blue}{w,f}}\sum_{i=1}^nL(y_i,F_{t-1}(\text{x}_i)+\color{blue}{wf(\text{x}_i)}).\]
  - Update aggregated model: \(F_t(\text{x})=F_{t-1}(\text{x})+\color{blue}{w_tf_t(\text{x})}\).

In XGBoost:

\[\begin{align*}L(F_t)&=\sum_{i=1}^nL(y_i,F_t(\text{X}_i))+\sum_{m=1}^t\Omega(f_m)\\ \Omega(f_t)&=\gamma T+\frac{1}{2}\lambda\|w\|^2.\\ L(F_t)&\approx \sum_{i=1}^n[g_iF_t(\text{x}_i)+\frac{1}{2}h_iF_t^2(\text{x}_i)]+\Omega(F_t),\\ g_i&=\frac{\partial L(y_i,F_t(\text{x}))}{\partial F_t(\text{x})}\|_{F_t(\text{x})=F_t(\text{x}_i)}\\ h_i&=\frac{\partial^2 L(y_i,F_t(\text{x}))}{\partial F_t(\text{x})^2}\|_{F_t(\text{x})=F_t(\text{x}_i)}. \end{align*}\]

Read Chen & Guestrin (1999)

Boosting

Other variants

LightGBM: Light Gradient Boosting Machine by researchers at Microsof (2017). It works for regression, classification and ranking problems. Main improvement:
- Gradient-based One-Sided Sampling (GOSS)
- Exclusive Feature Bundling (EFB),

to ensure the efficiency and accuracy of the method.

CatBoost: Efficient with categorical features (auto transformation) and unbiased estimation of gradient at each iteration (see: Prokhorenkova et al. (2017)).

Boosting

Numerical Experiment: Heart Disease Dataset

👉 Check out the notebook.

Feature Importance

Feature Importances (FI)

Mean Decrease in Impurity (MDI)

The Reduction of Impurity Measures when spliting at \(X_j\) is denoted by \[\text{RIM}_t(X_j)=\text{Imp}_{t-1}(X_j)-\text{Imp}_t(X_j).\]
The importance of feature \(X_j\) within a tree: \(\sum_{t\in S_j}\text{RIM}_t(X_j),\) where \(S_j\) is the set of indices \(t\) when a split is performed at \(X_j\).
Unnormaized FI of variable \(X_j\) is the sum of FIs from all the built trees.
⚠️ Might be biased toward high cardinality features and might not be accurate with highly correlated features.

Permutation Feature Importance

Train initial model, then compute the validation error denoted by \(\color{blue}{\text{Er}_0}\)
For each \(X_j\), shuffle its column in the dataset, then train and measure validation error \(\color{red}{\text{Er}_j}\).
The importance score for \(X_j\) is proportional to \(\color{red}{\text{Er}_j}-\color{blue}{\text{Er}_0}\).
It reflexes more direct influence of each feature in the model.
⚠️ Might be sensitive to data splits and a bit computationally expensive.

Feature Importances (FI)

Feature Importance by Random Forest

Heart Disease Dataset

Code

importances = best_model.feature_importances_
importance_df = pd.DataFrame({'Feature': list(X_train.select_dtypes(include="number").columns)+list(encoder.get_feature_names_out()), 'Importance': importances})

# Sort by importance
importance_df = importance_df.sort_values(by='Importance', ascending=False)

import plotly.express as px
# Plot feature importance using Plotly
fig = px.bar(importance_df, x='Feature', y='Importance', title='Feature Importance (Random Forest)',
             labels={'Feature': 'Features', 'Importance': 'Importance Score'},
             text=np.round(importance_df['Importance'], 3),
             color='Importance', color_continuous_scale='Viridis')

fig.update_layout(xaxis_title="Features", yaxis_title="Importance Score", coloraxis_showscale=False, height=400, width=900)
fig.show()

Advanced Classification: SVM & Ensemble Methods

🗺️ Content

1. Support Vector Machine (SVM)

Model Setting

Model Setting: Hard Margin

Linear SVM vs Logistic Regression

Logistic Regression

Linear SVM

Model Setting

Hard Margin Linear SVM

Model Setting

Hard Margin Linear SVM

Model Setting

Solution of Hard Margin Linear SVM

Model Setting

Prediction by Hard Margin Linear SVM

Soft Margin

Model Setting

Soft Margin Linear SVM (slack variables)

Model Setting

Soft Margin Linear SVM (slack variables)

Model Setting

Soft Margin Linear SVM (slack variables)

Model Setting

Prediction by Soft Margin Linear SVM

Hard & Soft Margin Linear SVM

Application

Model Setting

Summary: Hard & Soft Margin Linear SVM

Nonlinear SVM: Kernel Trick

Nonlinear SVM

Kernel Trick

Nonlinear SVM

Kernel Trick

Nonlinear SVM

Application

Soft vs Hard Margin SVM

Summary

2. Ensemble Methods

Bootstrap Aggregating Trees (Tree Bagging)

Tree Bagging

Bagging Algorithm

Regression Case

Tree Bagging

Application

Random Forest

Random Forest

Decorrelating trees

Random Forest

Application

Extra-trees

Extremely Randomized Trees

Split at random values

Random Forest

Application

Summary: Bagging, Random Forest, Extra-Trees

Boosting & XGBoost

Boosting

Algorithm

Boosting

Algorithm

Boosting

Adaboost: Adaptive Boosting

Boosting

XGBoost: EXtreme Gradient Boosting

Boosting

Other variants

Boosting

Numerical Experiment: Heart Disease Dataset

Feature Importance

Feature Importances (FI)

Mean Decrease in Impurity (MDI)

Permutation Feature Importance

Feature Importances (FI)

Feature Importance by Random Forest

🥳 It’s party time 🥂