Advanced Classification: SVM & Ensemble Methods


CSCI-866-001: Data Mining & Knowledge Discovery



Lecturer: Dr. Sothea HAS

🗺️ Content

  • Support Vector Machine
    • Model Setting: Hard Margin
    • Soft Margin
    • Kernel Tricks
  • Ensemble Learning
    • Bootstrap Aggregating Trees (Tree Bagging)
    • Random Forest
    • Extremely Randomized Trees (Extra-trees)
    • Gradient Boosting & XGBoost
    • Feature Importance

1. Support Vector Machine (SVM)

Model Setting

Model Setting: Hard Margin

Linear SVM vs Logistic Regression

  • We are still working in binary classification:
    • Input: \(\text{x}\in\mathbb{R}^d\)
    • Target: \(y\in\{-1,1\}\) (for the sake of math).
  • Methods: NBC, Logistic Regression, KNN, Trees.
Code
import numpy as np
import pandas as pd
data = pd.read_csv(path + "/heart.csv")
data[['age', 'chol','target']].sample(3, random_state=42)
age chol target
527 62 209 1
359 53 216 1
447 55 289 0

Logistic Regression

  • Model Assumption: \[\mathbb{P}(Y=1|X=\text{x})=\frac{1}{1+e^{-\beta_0-\beta^T\text{x}}}.\] (Boundary decision is linear.)

Linear SVM

  • Assumption 1:
    • Boundary decision is linear &
    • It has Maximum Margin
  • Boundary form: \(f(\text{x})=\color{blue}{\vec{w}^T}\text{x}+\color{blue}{b}.\)
  • Objective: Find the best \(\color{blue}{\vec{w}}\).

Model Setting

Hard Margin Linear SVM

  • Assumption 2: The classes are linearly separable.
  • Objective: Find the best \(\color{blue}{\vec{w}}\).

Model Setting

Hard Margin Linear SVM

  • Assumption 2: The classes are linearly separable (LS).
  • Objective: Find the best \(\color{blue}{\vec{w}}\).
  • Simple maths implies that \(\text{Margin}\propto 1/\|\color{blue}{\vec{w}}\|,\) with
    • \(\|\color{blue}{\vec{w}}\|:\) norm of \(\color{blue}{\vec{w}}\) (size/length of \(\color{blue}{\vec{w}}\)).
  • Goal: Now, reduced to finding \(\color{blue}{\vec{w}}\) minimizing \(\|\color{blue}{\vec{w}}\|\)
    and separates classes well!
  • Final problem: \[\begin{align*}&\min_{\color{blue}{\vec{w}}\in\mathbb{R}^d, \color{blue}{b}\in\mathbb{R}}\frac{1}{2}\|\color{blue}{\vec{w}}\|^2\qquad (\text{maximizing the margin})\\ &\text{subject to: }y_i(\color{blue}{\vec{w}^T}\text{x}_i+\color{blue}{b})\geq 1,\text{ for all }i=1,2,... (\text{separate class well!})\end{align*}\]

Model Setting

Solution of Hard Margin Linear SVM

  • Final problem: \[\begin{align*}&\min_{\color{blue}{\vec{w}}\in\mathbb{R}^d, \color{blue}{b}\in\mathbb{R}}\frac{1}{2}\|\color{blue}{\vec{w}}\|^2\qquad (\text{maximizing the margin})\\ &\text{subject to: }y_i(\color{blue}{\vec{w}^T}\text{x}_i+\color{blue}{b})\geq 1,\text{ for all }i=1,2,... (\text{separate class well!})\end{align*}\]
  • Solution: \[\color{blue}{\vec{w}^*}=\color{blue}{\sum_{i=1}^n\alpha_iy_i\text{x}_i}\] such that
    • \(\alpha_i=0\) for \(\text{x}_i\) NOT ON the margin.
    • \(\color{green}{\alpha_{i_0}}\neq 0\) for \(\color{green}{\text{x}_{i_0}}\) ON the margin and therefore, \[\color{blue}{b^*}=\color{green}{y_{i_0}}-\sum_{i=1}^n\alpha_iy_i\color{green}{\text{x}_{i_0}^T}[\text{x}_i].\]

Model Setting

Prediction by Hard Margin Linear SVM

  • Solution: \[\color{blue}{\vec{w}^*}=\color{blue}{\sum_{i=1}^n\alpha_iy_i\text{x}_i}\] such that
    • \(\alpha_i=0\) for \(\text{x}_i\) NOT ON the margin.
    • \(\color{green}{\alpha_{i_0}}\neq 0\) for \(\color{green}{\text{x}_{i_0}}\) ON the margin and therefore, \[\color{blue}{b^*}=\color{green}{y_{i_0}}-\sum_{i=1}^n\alpha_iy_i\color{green}{\text{x}_{i_0}^T}[\text{x}_i].\]

  • The decision boundary: \(f(\text{x})=\color{blue}{(\vec{w}^*)^T}\text{x}+\color{blue}{b}=\color{blue}{\sum_{i=1}\alpha_iy_i}[\color{blue}{\text{x}_i^T}\text{x}]+\color{blue}{b^*}.\)
  • Prediction rule: Given a new input \(\color{red}{\text{x}_0}\in\mathbb{R}^d\) we have \[\color{red}{y_0}=\text{sign}(f(\text{x}_0))=\text{sign}\Big(\color{blue}{\sum_{i=1}\alpha_iy_i}[\color{blue}{\text{x}_i^T}\color{red}{\text{x}_0}]+\color{blue}{b^*}\Big).\]

Soft Margin

Model Setting

Soft Margin Linear SVM (slack variables)

  • The result from Linear SVM is beautiful! BUT
  • In practice, data are rarely Linearly Separable!
  • Soft Margin: Introducing slack variables \(\color{blue}{s_i},i=1,2,...,n\) and \[\begin{align*}&\min_{\color{blue}{\vec{w}}\in\mathbb{R}^d, \color{blue}{b,s_i}\in\mathbb{R}}\frac{1}{2}\|\color{blue}{\vec{w}}\|^2+\color{red}{C}\sum_{i=1}^n\color{blue}{s_i}\\ &\text{subject to: }y_i(\color{blue}{\vec{w}^T}\text{x}_i+\color{blue}{b})\geq 1-\color{blue}{s_i}\quad\text{and }s_i\geq 0,\text{ for all }i=1,2,...\end{align*}\]
  • Interpretation: Slack variables \(\color{blue}{s_i}\) relax the linearly separable condition by allowing some misclassified points.
  • The constant \(\color{red}{C}>0\) controls the trade-off between maximizing the margin and minimizing the classification error.

Model Setting

Soft Margin Linear SVM (slack variables)

  • Soft Margin: Introducing slack variables \(\color{blue}{s_i},i=1,2,...,n\) and \[\begin{align*}&\min_{\color{blue}{\vec{w}}\in\mathbb{R}^d, \color{blue}{b,s_i}\in\mathbb{R}}\frac{1}{2}\|\color{blue}{\vec{w}}\|^2+\color{red}{C}\sum_{i=1}^n\color{blue}{s_i}\\ &\text{subject to: }y_i(\color{blue}{\vec{w}^T}\text{x}_i+\color{blue}{b})\geq 1-\color{blue}{s_i}\quad\text{and }s_i\geq 0,\text{ for all }i=1,2,...\end{align*}\]
  • Interpretation: Slack variables \(\color{blue}{s_i}\) relax the linearly separable condition by allowing some misclassified points.
  • The constant \(\color{red}{C}>0\) controls the trade-off between maximizing the margin and minimizing the classification error.
  • Lower \(\color{red}{C}\Leftrightarrow\) more relaxing, more many points can fall within the margin (large margin).

Model Setting

Soft Margin Linear SVM (slack variables)

  • Soft Margin: Introducing slack variables \(\color{blue}{s_i},i=1,2,...,n\) and \[\begin{align*}&\min_{\color{blue}{\vec{w}}\in\mathbb{R}^d, \color{blue}{b,s_i}\in\mathbb{R}}\frac{1}{2}\|\color{blue}{\vec{w}}\|^2+\color{red}{C}\sum_{i=1}^n\color{blue}{s_i}\\ &\text{subject to: }y_i(\color{blue}{\vec{w}^T}\text{x}_i+\color{blue}{b})\geq 1-\color{blue}{s_i}\quad\text{and }s_i\geq 0,\text{ for all }i=1,2,...\end{align*}\]
  • Interpretation: Slack variables \(\color{blue}{s_i}\) relax the linearly separable condition by allowing some misclassified points.
  • The constant \(\color{red}{C}>0\) controls the trade-off between maximizing the margin and minimizing the classification error.
  • Lower \(\color{red}{C}\Leftrightarrow\) more relaxing, more many points can fall within the margin (large margin).
  • Large \(\color{red}{C}\Leftrightarrow\) more strict, less points can fall within the margin (small margin).

Model Setting

Prediction by Soft Margin Linear SVM

  • Soft Margin Solution: \(\color{blue}{\vec{w}^*}=\color{blue}{\sum_{i=1}^n\alpha_iy_i\text{x}_i}\) such that
    • If \(\alpha_i=0\) for \(\text{x}_i\) NOT ON the margin.
    • If \(0<\color{green}{\alpha_{i_0}}\leq \color{red}{C}\) then for \(\color{green}{\text{x}_{i_0}}\):
      • If \(\color{blue}{s_{i_0}}\in\{0,\color{red}{C}\}\Rightarrow\) ARE EXACTLY ON the margin: \[\color{blue}{b^*}=\color{green}{y_{i_0}}-\sum_{i=1}^n\alpha_iy_i\color{green}{\text{x}_{i_0}^T}[\text{x}_i].\]
      • Else \(\color{green}{\text{x}_{i_0}}\) FALL WITHIN THE MARGIN.

  • The decision boundary: \(f(\text{x})=\color{blue}{(\vec{w}^*)^T}\text{x}+\color{blue}{b}=\color{blue}{\sum_{i=1}\alpha_iy_i}[\color{blue}{\text{x}_i^T}\text{x}]+\color{blue}{b^*}.\)
  • Prediction rule: Given a new input \(\color{red}{\text{x}_0}\in\mathbb{R}^d\) we have \[\color{red}{y_0}=\text{sign}(f(\text{x}_0))=\text{sign}\Big(\color{blue}{\sum_{i=1}\alpha_iy_i}[\color{blue}{\text{x}_i^T}\color{red}{\text{x}_0}]+\color{blue}{b^*}\Big).\]

Hard & Soft Margin Linear SVM

Application

  • Let’s consider Heart Disease Dataset (shpe: \(1025\times 14\)).
  • The key parameter \(C\) is tuned using \(K\)-Fold CV.
  • Categorical inputs are encoded using OneHotEncoding.
  • Standardization is also applied.
Code
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder

data = pd.read_csv(path + "/heart.csv")
quan_vars = ['age','trestbps','chol','thalach','oldpeak']
qual_vars = ['sex','cp','fbs','restecg','exang','slope','ca','thal','target']

# Convert to correct types
for i in quan_vars:
  data[i] = data[i].astype('float')
for i in qual_vars:
  data[i] = data[i].astype('category')


data.drop_duplicates(inplace=True)

# Train test split
from sklearn.model_selection import train_test_split
X, y = data.iloc[:,:-1], data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
encoder = OneHotEncoder(drop='first', sparse_output=False)

# OnehotEncoding and Scaling
X_train_cat = encoder.fit_transform(X_train.select_dtypes(include="category"))
X_train_encoded = scaler.fit_transform(np.column_stack([X_train.select_dtypes(include="number").to_numpy(), X_train_cat]))
X_test_cat = encoder.transform(X_test.select_dtypes(include="category"))
X_test_encoded = scaler.transform(np.column_stack([X_test.select_dtypes(include="number").to_numpy(), X_test_cat]))

# KNN
from sklearn.svm import SVC

svm_model = SVC(kernel='linear', random_state=42)

# Define the hyperparameter grid
param_grid = {'C': [0.1, 0.2, 0.25, 0.3, 0.4, 0.5, 0.75, 1]}

# Perform GridSearch for optimal C
grid_search = GridSearchCV(svm_model, param_grid, cv=20, scoring='accuracy')
grid_search.fit(X_train_encoded, y_train)

# Get the best parameter
best_C = grid_search.best_params_['C']

# Train final model using best C
final_model = SVC(kernel='rbf', C=best_C, random_state=42)
final_model.fit(X_train_encoded, y_train)

# Predict on test set
y_pred = final_model.predict(X_test_encoded)

from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

test_perf = pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred),
          'AUC': roc_auc_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["LSVM"])
test_perf
Accuracy Precision Recall F1-score AUC
LSVM 0.852459 0.852941 0.878788 0.865672 0.850108
  • Best \(C=\) 0.1.

Model Setting

Summary: Hard & Soft Margin Linear SVM

  • Hard Margin Linear SVM: assumes that the classes are linear separable.
  • Soft Margin Linear SVM: does not assume linearly separability (used slack variables).
  • The decision boundary: \(f(\text{x})=\color{blue}{(\vec{w}^*)^T}\text{x}+\color{blue}{b}=\color{blue}{\sum_{i=1}\alpha_iy_i}[\color{blue}{\text{x}_i^T}\text{x}]+\color{blue}{b^*}.\)
  • Prediction rule: Given a new input \(\color{red}{\text{x}_0}\in\mathbb{R}^d\) we have \[\color{red}{y_0}=\text{sign}(f(\text{x}_0))=\text{sign}\Big(\color{blue}{\sum_{i=1}\alpha_iy_i}[\color{blue}{\text{x}_i^T}\color{red}{\text{x}_0}]+\color{blue}{b^*}\Big).\]
  • A VERY IMPORTANT REMARK: It does not depend on actual training inputs, just the scalar products: \(\color{blue}{\text{x}_i^T}\color{red}{\text{x}_0}\).
  • From this we can overcome Linear Boundary by using Kernel Trick.

Nonlinear SVM: Kernel Trick

Nonlinear SVM

Kernel Trick

  • Prediction rule: Given a new input \(\color{red}{\text{x}_0}\in\mathbb{R}^d\) we have \[\color{red}{y_0}=\text{sign}(f(\text{x}_0))=\text{sign}\Big(\color{blue}{\sum_{i=1}\alpha_iy_i}[\color{blue}{\text{x}_i^T}\color{red}{\text{x}_0}]+\color{blue}{b^*}\Big).\]
  • If \(K\) is a kernel function, for example,
    • Gaussian radial basis function: \(K(\text{x}_i, \text{x}_j)=e^{-\gamma\|\text{x}_i, \text{x}_j\|^2},\gamma>0\).
    • Polynomial kernel: \(K(\text{x}_i, \text{x}_j)=(r+\text{x}_i^T\text{x}_j)^d\).
  • Kernel function introduce feature/transformation of original inputs.
  • Instead of input scalar \(\color{blue}{\text{x}_i^T}\color{red}{\text{x}_j}\), we can use kernel feature \(K(\color{blue}{\text{x}_i},\color{red}{\text{x}_j})\).
  • It works well in problems with nonlinear boundary decision.

Nonlinear SVM

Kernel Trick

  • Instead of input scalar \(\color{blue}{\text{x}_i^T}\color{red}{\text{x}_j}\), we can use kernel feature \(K(\color{blue}{\text{x}_i},\color{red}{\text{x}_j})\).
  • In this case, prediction rule becomes: \(\color{red}{\text{x}_0}\in\mathbb{R}^d\), \[\color{red}{y_0}=\text{sign}(f(\text{x}_0))=\text{sign}\Big(\color{blue}{\sum_{i=1}\alpha_iy_i}K(\color{blue}{\text{x}_i}\color{red}{\text{x}_0})+\color{blue}{b^*}\Big).\]
  • It works well in problems with nonlinear boundary decision.

Nonlinear SVM

Application

  • We work with Heart Disease Dataset.
  • The key parameter \(C\) is tuned using \(K\)-Fold CV.
  • We try 3rd degree Polynomial and RBF Kernels.
Code
svm_model = SVC(kernel='rbf', random_state=42)

# Define the hyperparameter grid
param_grid = {'C': [5, 6, 7, 8, 10, 12.5, 15, 17, 20, 30, 45, 50]}
# Perform GridSearch for optimal C
grid_search = GridSearchCV(svm_model, param_grid, cv=20, scoring='accuracy')
grid_search.fit(X_train_encoded, y_train)

# Get the best parameter
best_C_rbf = grid_search.best_params_['C']

# Train final model using best C
final_model = SVC(kernel='rbf', C=best_C_rbf, random_state=42)
final_model.fit(X_train_encoded, y_train)

# Predict on test set
y_pred = final_model.predict(X_test_encoded)

from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred),
          'AUC': roc_auc_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["RBF-SVM"])])

# Polynomial 3 kernel
svm_model = SVC(kernel='poly', degree=3, random_state=42)

# Perform GridSearch for optimal C
grid_search = GridSearchCV(svm_model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_encoded, y_train)

# Get the best parameter
best_C_poly = grid_search.best_params_['C']

# Train final model using best C
final_model = SVC(kernel='poly', degree=3, C=best_C_poly, random_state=42)
final_model.fit(X_train_encoded, y_train)

# Predict on test set
y_pred = final_model.predict(X_test_encoded)

from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred),
          'AUC': roc_auc_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["Poly3-SVM"])])
test_perf
Accuracy Precision Recall F1-score AUC
LSVM 0.852459 0.852941 0.878788 0.865672 0.850108
RBF-SVM 0.819672 0.866667 0.787879 0.825397 0.822511
Poly3-SVM 0.852459 0.875000 0.848485 0.861538 0.852814
  • RBF: \(C=\) 5 and Degree-3 Polynomial: \(C=\) 5.

Soft vs Hard Margin SVM

Summary

  • Hard Linear SVM = No slack variable + Original input SVM.
  • Soft Linear SVM = With slack variable + Original input SVM.
  • Soft Nonlinear SVM = With slack variable + Kernel Feature SVM.
  • Key Parameters to be tuned using Cross-Validation:
    • \(C\): trade-off between maximizing margin & classification power.
    • Kernel: ‘rbf’, ‘poly’…
  • Features should be
    • Encoding: OneHotEncoder
    • Scaling: StandardScaler, MinMaxScaler
    • Make sure there is no duplication.

2. Ensemble Methods

Bootstrap Aggregating Trees (Tree Bagging)

Tree Bagging

Bagging Algorithm

  • Decision trees are easy to be constructed but sometimes prone to overfitting (deep trees).
  • Tree Bagging leverages Bootstrap to stabilize predictions (reduce variance).
  • Algorithm:
    • for t = 1,2,..,T:
      • Bootstrap sample \(B_t\)
      • Build a tree \(f_t\) on \(B_t\)
    • Prediction:
      • Classification: Vote.
      • Regression: Average.

Regression Case

  • Bootstrap sample = Sample with replacement.
  • Variance reduction: averaging smooths out prediction.

Tree Bagging

Application

  • Dataset: Heart Disease Dataset.
  • Key hyperparameters to be fine-tuned using CV method:
    • Number of trees: \(T\)
    • Maximum samples of \(B_t\)
    • Tree hyperparameters: max_depth, min_sample_split, min_samples_leaf and citerion.
Code
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Define base Decision Tree (with tunable parameters)
base_tree = DecisionTreeClassifier()

# Set up Bagging Classifier
bagging_model = BaggingClassifier(estimator=base_tree)

# Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200, 300, 500],  # Number of trees in Bagging
    'max_samples': [0.1, 0.25, 0.4, 0.5, 0.6],  # Proportion of samples per tree
    'estimator__min_samples_leaf': [3, 5, 7, 10],  # Minimum samples per leaf
    'estimator__criterion': ["gini", "entropy"],  # Splitting criterion
}

grid_search = GridSearchCV(bagging_model, param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train_encoded, y_train)

# Best parameters and corresponding model
best_model = grid_search.best_estimator_
print(f"Best parameters: {grid_search.best_params_}")

# Make predictions with the best model
y_pred = best_model.predict(X_test_encoded)
test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred),
          'AUC': roc_auc_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["Bagging"])])
test_perf
Best parameters: {'estimator__criterion': 'gini', 'estimator__min_samples_leaf': 3, 'max_samples': 0.6, 'n_estimators': 500}
Accuracy Precision Recall F1-score AUC
LSVM 0.852459 0.852941 0.878788 0.865672 0.850108
RBF-SVM 0.819672 0.866667 0.787879 0.825397 0.822511
Poly3-SVM 0.852459 0.875000 0.848485 0.861538 0.852814
Bagging 0.803279 0.862069 0.757576 0.806452 0.807359

Random Forest

Random Forest

Decorrelating trees

  • Tree Bagging: combines trees built on different bootstrap samples.
  • However, bootstrap samples are not truely independent, so are the trees.
  • Random forest insert ONE key step to introduce more randomness when building trees.
  • Algorithm:
    • for t = 1,2,..,T:
      • Bootstrap sample \(B_t\)
      • Sample subset of features \(S_t\)
      • Build a tree \(f_t\) on \(B_t\) using only \(S_t\)
    • Prediction:
      • Classification: Vote.
      • Regression: Average.

Random Forest

Application

  • Dataset: Heart Disease Dataset.
  • Key hyperparameters to be fine-tuned using CV method:
    • Number of trees: \(T\)
    • Maximum samples of \(B_t\)
    • Maximum features \(S_t\): max_features
    • Tree hyperparameters: max_depth, min_sample_split, min_samples_leaf and citerion.
Code
from sklearn.ensemble import RandomForestClassifier

# Set up Bagging Classifier
rf = RandomForestClassifier()

# Define hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 300, 500],  # Number of trees in Bagging
    'max_samples': [0.25, 0.4, 0.5, 0.6, 0.75, 0.9],  # Proportion of samples per tree
    'max_depth' : [5, 10, 20, 25, 30, 35, 40],
    'criterion': ["gini", "entropy"],  # Splitting criterion
}

grid_search = GridSearchCV(rf, param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train_encoded, y_train)

# Best parameters and corresponding model
best_model = grid_search.best_estimator_
print(f"Best parameters: {grid_search.best_params_}")

# Make predictions with the best model
y_pred = best_model.predict(X_test_encoded)
test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred),
          'AUC': roc_auc_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["RF"])])
test_perf
Best parameters: {'criterion': 'gini', 'max_depth': 5, 'max_samples': 0.6, 'n_estimators': 100}
Accuracy Precision Recall F1-score AUC
LSVM 0.852459 0.852941 0.878788 0.865672 0.850108
RBF-SVM 0.819672 0.866667 0.787879 0.825397 0.822511
Poly3-SVM 0.852459 0.875000 0.848485 0.861538 0.852814
Bagging 0.803279 0.862069 0.757576 0.806452 0.807359
RF 0.868852 0.903226 0.848485 0.875000 0.870671

Extra-trees

Extremely Randomized Trees

Split at random values

  • Random forest has been sucessful in various tasks.
  • Extra-trees is another variation of RF aiming at introduce even more randomness when building the individual trees.
  • Algorithm:
    • for t = 1,2,..,T:
      • Sample subset of features \(S_t\)
      • Build a tree \(f_t\) on \(B_t\) using only \(S_t\) where the split are performed at random values.
  • Prediction:
    • Classification: Vote.
    • Regression: Average.

Random Forest

Application

  • Dataset: Heart Disease Dataset.
  • Key hyperparameters to be fine-tuned using CV method:
    • Number of trees: \(T\)
    • Maximum samples of \(B_t\)
    • Tree hyperparameters: max_depth, min_sample_split, min_samples_leaf and citerion.
Code
from sklearn.ensemble import ExtraTreesClassifier

# Set up Bagging Classifier
extr = ExtraTreesClassifier()

# Define hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 350, 500],  # Number of trees in Bagging
    'max_depth' : [5, 7, 10, 20, 30, 40],
    'max_features' : [2, 3,5,7,9],
    'criterion': ["gini", "entropy"],  # Splitting criterion
}

grid_search = GridSearchCV(extr, param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train_encoded, y_train)

# Best parameters and corresponding model
best_model = grid_search.best_estimator_
print(f"Best parameters: {grid_search.best_params_}")

# Make predictions with the best model
y_pred = best_model.predict(X_test_encoded)
test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred),
          'Precision': precision_score(y_test, y_pred),
          'Recall': recall_score(y_test, y_pred),
          'F1-score': f1_score(y_test, y_pred),
          'AUC': roc_auc_score(y_test, y_pred)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["Extra-Trees"])])
test_perf
Best parameters: {'criterion': 'entropy', 'max_depth': 10, 'max_features': 3, 'n_estimators': 100}
Accuracy Precision Recall F1-score AUC
LSVM 0.852459 0.852941 0.878788 0.865672 0.850108
RBF-SVM 0.819672 0.866667 0.787879 0.825397 0.822511
Poly3-SVM 0.852459 0.875000 0.848485 0.861538 0.852814
Bagging 0.803279 0.862069 0.757576 0.806452 0.807359
RF 0.868852 0.903226 0.848485 0.875000 0.870671
Extra-Trees 0.819672 0.823529 0.848485 0.835821 0.817100

Summary: Bagging, Random Forest, Extra-Trees

  • Each method relies on two main ideas:

    • Build independent tree base predictors
    • Combine them by averaging or voting over the predictions.
  • The individual trees within each method may be too flexible (high variance), however averaging/voting results in more stable predictions.

  • There are many hyperparameters to be optimized (CV):

    • Number of trees
    • Number of features used to build each tree
    • Individual tree structure
      • Depth of each tree
      • Mininum samples at the leaf node
      • Impurity criterion…

Boosting & XGBoost

Boosting

Algorithm

  • T: the number of trees
  • for t = 1, 2,..., T:
    • Build weak learner \(f_t\) by minimizing error \(\varepsilon_t\).
    • Learn and ajust weights:
      • Compute learner weight \(w_t>0\) computed based on weighted data with weight \(\{D_t(i):i=1,\dots,n\}\).
      • Compute sample weight \(D_t(i)\to D_{t+1}(i)\) for obs \(i\).
  • Combined model: \(\color{blue}{\hat{f}(x)=\sum_{t=1}^Tw_tf_t(x)}\).

Boosting

Algorithm

  • Combined model: \(\color{blue}{\hat{f}(x)=\sum_{t=1}^Tw_tf_t(x)}\).

Boosting

Adaboost: Adaptive Boosting

  • It’s for binary classification with \((\text{x}_i,y_i)\in\mathbb{R}^d\times\{-1,1\},i=1,\dots,n\).
  • Initialize the weight \(D_1(i)=1/n\) for all sample \(\text{x}_i\).
  • for t = 1, . . . , T:
    • Train the base classifier \(f_t\) by minimizing \[\varepsilon_t=\mathbb{P}(f_t(X)\neq Y)=\sum_{i=1}^nD_t(i)\mathbb{1}_{\{f_t(\text{x}_i\neq y_i)\}}.\]
    • Calculate the learner weight \(w_t=\frac{1}{2}\ln\left(\frac{1-\varepsilon_t}{\varepsilon_t}\right).\)
    • Update the sample weights using \(D_{t+1}=\frac{D_t(i)}{Z_t}e^{-w_ty_if_t(\text{x}_i)}\) where \(Z_t\) is the normalized constant.
  • The final model: \(\color{blue}{\hat{f}(\text{x})=\text{sign}(\sum_{t=1}^Tw_tf_t(\text{x}))}\). (Read: Freund & Schapire (1999))

Boosting

XGBoost: EXtreme Gradient Boosting

  • It’s a special case of Gradient Boosting: \(F_M(\text{x})=\sum_{t=1}^Mw_tf_t(\text{x})\).
  • Let \(L\) be the loss function of the problem:
    • Initial model: \(F_0=\arg\min_{c}\sum_{i=1}^nL(y_i,c).\)
    • for t = 1, 2, ..., M:
      • Compute false residuals: \(r_{t,i}=\frac{\partial L(y_i,f(\text{x}))}{\partial f(\text{x})}\Big|_{f(\text{x})=f_{t-1}(\text{x}_i)}\).
      • Train a base learner on the new data: \(\{(\text{x}_i,y_i),r_{t,i}\}.\)
      • Solve for \(\color{blue}{f_t}\) and \(\color{blue}{w_t}\) from: \[(\color{blue}{w_t}, \color{blue}{f_t(\text{x})})=\arg\min_{\color{blue}{w,f}}\sum_{i=1}^nL(y_i,F_{t-1}(\text{x}_i)+\color{blue}{wf(\text{x}_i)}).\]
      • Update aggregated model: \(F_t(\text{x})=F_{t-1}(\text{x})+\color{blue}{w_tf_t(\text{x})}\).
  • In XGBoost:

\[\begin{align*}L(F_t)&=\sum_{i=1}^nL(y_i,F_t(\text{X}_i))+\sum_{m=1}^t\Omega(f_m)\\ \Omega(f_t)&=\gamma T+\frac{1}{2}\lambda\|w\|^2.\\ L(F_t)&\approx \sum_{i=1}^n[g_iF_t(\text{x}_i)+\frac{1}{2}h_iF_t^2(\text{x}_i)]+\Omega(F_t),\\ g_i&=\frac{\partial L(y_i,F_t(\text{x}))}{\partial F_t(\text{x})}\|_{F_t(\text{x})=F_t(\text{x}_i)}\\ h_i&=\frac{\partial^2 L(y_i,F_t(\text{x}))}{\partial F_t(\text{x})^2}\|_{F_t(\text{x})=F_t(\text{x}_i)}. \end{align*}\]

Boosting

Other variants

  • LightGBM: Light Gradient Boosting Machine by researchers at Microsof (2017). It works for regression, classification and ranking problems. Main improvement:
    • Gradient-based One-Sided Sampling (GOSS)
    • Exclusive Feature Bundling (EFB),

to ensure the efficiency and accuracy of the method.

  • CatBoost: Efficient with categorical features (auto transformation) and unbiased estimation of gradient at each iteration (see: Prokhorenkova et al. (2017)).

Boosting

Numerical Experiment: Heart Disease Dataset




👉 Check out the notebook.

Feature Importance

Feature Importances (FI)

Mean Decrease in Impurity (MDI)

  • The Reduction of Impurity Measures when spliting at \(X_j\) is denoted by \[\text{RIM}_t(X_j)=\text{Imp}_{t-1}(X_j)-\text{Imp}_t(X_j).\]
  • The importance of feature \(X_j\) within a tree: \(\sum_{t\in S_j}\text{RIM}_t(X_j),\) where \(S_j\) is the set of indices \(t\) when a split is performed at \(X_j\).
  • Unnormaized FI of variable \(X_j\) is the sum of FIs from all the built trees.
  • ⚠️ Might be biased toward high cardinality features and might not be accurate with highly correlated features.

Permutation Feature Importance

  • Train initial model, then compute the validation error denoted by \(\color{blue}{\text{Er}_0}\)
  • For each \(X_j\), shuffle its column in the dataset, then train and measure validation error \(\color{red}{\text{Er}_j}\).
  • The importance score for \(X_j\) is proportional to \(\color{red}{\text{Er}_j}-\color{blue}{\text{Er}_0}\).
  • It reflexes more direct influence of each feature in the model.
  • ⚠️ Might be sensitive to data splits and a bit computationally expensive.

Feature Importances (FI)

Feature Importance by Random Forest

  • Heart Disease Dataset
Code
importances = best_model.feature_importances_
importance_df = pd.DataFrame({'Feature': list(X_train.select_dtypes(include="number").columns)+list(encoder.get_feature_names_out()), 'Importance': importances})

# Sort by importance
importance_df = importance_df.sort_values(by='Importance', ascending=False)

import plotly.express as px
# Plot feature importance using Plotly
fig = px.bar(importance_df, x='Feature', y='Importance', title='Feature Importance (Random Forest)',
             labels={'Feature': 'Features', 'Importance': 'Importance Score'},
             text=np.round(importance_df['Importance'], 3),
             color='Importance', color_continuous_scale='Viridis')

fig.update_layout(xaxis_title="Features", yaxis_title="Importance Score", coloraxis_showscale=False, height=400, width=900)
fig.show()

🥳 It’s party time 🥂