Objective

This practical session (TP) aims to familiarize you with the key assumptions of each introduced model. The first section is designed to test your understanding of the data and models, while the second section focuses on applying your knowledge to real-world datasets. Please read the instructions carefully and try to complete the tasks independently. If you get stuck, don’t hesitate to ask for help. Good luck!

Please download the Jupyter Notebook by clicking here. The session will cover the following sections:

Model Assumptions and Data Simulation: It’s important to verify that any models should work well on the data that respect their assumptions.

Data Simulation: Create datasets that either respect or violate the assumptions of each model, including addressing imbalance problems.

Model Implementation: Put the models into action.

Model Evaluation: Report the performance of the models using appropriate metrics.

Real Data Implementation: In real-world problems, things are more complicated because very often the assumptions of the model are often violated. We shall see this by exploring the following real datasets.

Real Datasets: you may start with Spam dataset to reproduce the experimental results shown in the course, or explore Heart Disease Dataset.

Preprocessing/Descriptive Analysis: Understand the features and verify the assumptions (use correlation metrics, for example).

Implementation: Apply the models to the datasets.

Evaluation: Assess the performance of the models.

1. Simulation

Write a function simulateClassificationData(n=200, d=2, M = 2, method = "nbc") that returns input-output observations with
- observation size n (\(200\) by default)
- input \(x_i\) are of dimension d (\(2\) by default)
- the target \(y\) contains M classes taking values in \(\{1,2,\dots,M\}\)
- and lastly, method defines the prefered method that is supposed to work well on this dataset. It should be an element of [nbc, LDA, QDA, RDA] (nbc by default).

I set an example below, you can do it differently.

Code

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.datasets import make_spd_matrix

def simulateClassificationData(n=200, d=2, M=2, method="nbc", class_weights = None):
    """
    Generates a design matrix for classification that works well with Naive Bayes.
    
    Parameters:
    n (int): Number of samples.
    d (int): Number of features.
    M (int): Number of classes.
    method (str): Method name, default is "nbc" (Naive Bayes Classification).
    class_weights (arr): The proportion of each class. If `None`, the data is balanced.
    random_state (int): Random seed for repreoducing the result in random simulation.
    
    Returns:
    X (numpy.ndarray): Feature matrix.
    y (numpy.ndarray): Labels.
    """

    # Check if the class weight is given. If it's None, it's a balanced case.
    if class_weights is None:
        class_weights = np.ones(M)/M
    idx = np.random.multinomial(1, class_weights, size=n)
    n_class = [np.sum(idx[:,i]) for i in range(M)]

    # Generate data that prefers NBC model.
    if method == "nbc":
        b = np.random.randint(1,10, 1)
        x1 = np.random.poisson(b, size=(n_class[0], d//2))
        b = np.random.uniform(-2, 2, 2)
        x2 = np.random.normal(np.min(b), np.abs(np.max(b)), size=(n_class[0], d//2))
        X = np.column_stack([x1,x2])
        for i in range(1,M):
            b = np.random.randint(1,10, 1)
            x1 = np.random.poisson(b, size=(n_class[i], d//2))
            b = np.random.uniform(-2,2, 2)
            x2 = np.random.normal(np.min(b), np.abs(np.max(b)), size=(n_class[i], d//2))
            X = np.row_stack([X, np.column_stack([x1, x2])])
        y = np.repeat([str(i) for i in range(1,M+1)], n_class)

    # Data that prefers LDA
    elif method == "lda":
        # generate parameters (means & covariances)
        mu = np.random.uniform(-5, 5, size=d)
        sigma0 = make_spd_matrix(n_dim=d)
        for i in range(M-1):
            mu = np.row_stack([mu, np.random.uniform(-5, 5, size=d)])
            
        # generate observations
        X = np.row_stack([np.random.multivariate_normal(mean=mu[i,:], cov=sigma0, size=int(n*class_weights[i])) for i in range(M)])
        y = np.concatenate([np.repeat(str(i), int(n*class_weights[i-1])) for i in range(1,M+1)])

    # Data that prefers QDA or RDA
    elif method in ["qda", "rda"]:
        # generate parameters (means & covariances)
        mu = np.random.uniform(-5, 5, size=d)
        sigma = [make_spd_matrix(n_dim=d)]
        for i in range(M-1):
            mu = np.row_stack([mu, np.random.uniform(-5, 5, size=d)])
            sigma.append(make_spd_matrix(n_dim=d))
        # generate observations
        X = np.row_stack([np.random.multivariate_normal(mean=mu[i,:], cov=sigma[i], size=int(n*class_weights[i])) for i in range(M)])
        y = np.concatenate([np.repeat(str(i), int(n*class_weights[i-1])) for i in range(1,M+1)])
    # If method is other than above, return value error.
    else:
        ValueError("method is either 'nbc', 'lda', 'qda' or 'rda'!")
    # Randomly shuffle the data
    id_shuffle = np.random.permutation(range(len(y)))
    return X[id_shuffle,:], y[id_shuffle].astype(object)

1.1. Balanced datasets

A. With \((n,d,M)=(700,2,3)\) fixed and three different options for method in the list ['nbc', 'lda', 'qda'], use the function above to generate three different datasets: (X_nbc, y_nbc), (X_lda, y_lda), and (X_qda, y_qda), where X and y are different input-output pairs. An example is given below.

# Set parameters
n, d, M = 700, 2, 3

# Set random seed for our data generation
np.random.seed(42)
X_nbc, y_nbc = simulateClassificationData(n, d, M, method="nbc")
X_lda, y_lda = simulateClassificationData(n, d, M, method="lda")
X_qda, y_qda = simulateClassificationData(n, d, M, method="qda")

B. Write the code to visualize the first \(500\) observations from each dataset using scatterplots colored according to the classes of the target y. The remaining \(200\) observations are treated as the testing data. Your result should look similar to the figures below.

# To do

C. For now, work with (X_nbc, y_nbc) dataset.

a. Train NBC, LDA and QDA on the first \(500\) observations. Report the accuracies, precision, recall and f1-scaore of the three models on the remaining \(200\) testing points. You should obtain the results as illustrated below.

# To do

       Recall  Precision  F1-score
NBC  0.831125   0.831003  0.824050
LDA  0.639893   0.634448  0.636846
QDA  0.820832   0.818865  0.814520

b. Draw the decision boundary with the testing data of the three models on the (X_nbc, y_nbc) dataset. An example of a function for drawing such a boundary is given below.

Code

from sklearn.inspection import DecisionBoundaryDisplay
from matplotlib.colors import ListedColormap

def plot_decision_boundary(X, y,
                           models = list,
                           n_x = 30, 
                           n_y = 30,
                           model_names = ['NBC', 'LDA', 'QDA']):
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, n_x),
                         np.arange(y_min, y_max, n_y))
    fig, axes = plt.subplots(1, 3, figsize=(8, 3))
    keys = list(conf_mat.keys())
    for idx, model in enumerate(models):
        ax = axes[idx]
        disp = DecisionBoundaryDisplay.from_estimator(
            model, X, response_method='predict', ax=ax, cmap=ListedColormap(palette.values()), alpha=0.5)
        sns.scatterplot(x=X[:,0], y=X[:,1], hue=y, palette=palette, ax=ax)
        ax.set_title(model_names[idx])
        ax.set_xlabel("x1")
        if idx == 0:
            ax.set_ylabel("x2")
        ax.set_title(f"{model_names[idx]} with acc = {(conf_mat[keys[idx]].diagonal().sum() / conf_mat[keys[idx]].sum()).round(3)}")
    plt.tight_layout()
    plt.show()

# To do: apply to the dataset

D. Do the same with the other two datasets. Make some comments on the results: Are the results reasonable? Explain.

For (X_lda, y_lda), you should produce the following result.

# To do

       Recall  Precision  F1-score
NBC  0.975737   0.973918  0.974699
LDA  0.980639   0.978421  0.979373
QDA  0.980639   0.978421  0.979373

For (X_qda, y_qda), your result should be similar to the result below.

# To do

       Recall  Precision  F1-score
NBC  0.928110   0.334877  0.929750
LDA  0.932805   0.339067  0.934641
QDA  0.948944   0.338166  0.950155

Your comments:

1.2. Imbalanced datasets

Now, we will work with imbalanced simulated datasets. The goal is to identify problems associated with imbalanced datasets and to propose possible solutions we have studied so far.

A. With the same options as in the balanced case, but adding an additional option of class_weights = [0.2, 0.15, 0.65], generate other three imbalanced datasets. An example is provided below.

weights = [0.2, 0.15, 0.65] #[0.2, 0.15, 0.65]
np.random.seed(42) 
X_nbc, y_nbc = simulateClassificationData(n, d, M, method="nbc", class_weights=weights)
X_lda, y_lda = simulateClassificationData(n, d, M, method="lda", class_weights=weights)
X_qda, y_qda = simulateClassificationData(n, d, M, method="qda", class_weights=weights)

a. Recreate the countplots and scatterplots for the three simulated datasets as ilustrated below.

# To do

b. Recreate the following performance metrics on imbalanced nbc dataset.

# To do

       Recall  Precision  F1-score
NBC  0.671337   0.695662  0.682120
LDA  0.545978   0.467226  0.501748
QDA  0.663761   0.689268  0.674789

c. How does this result compare to the balanced case? Does this surprise you?

Your opinion…

B. Do the same (compute these matrices) for the other two imbalanced datasets.

# To do

C. In NBC, LDA, and QDA, all parameters (such as means and variances) are directly estimated using data. However, in RDA, we can tune the trade-off parameter \(\alpha\) that combines the QDA covariance matrices with the LDA covariance matrix. This allows for a balance between the flexibility of QDA and the simplicity of LDA:

\[\hat{\Sigma}_k(\alpha)=\alpha\hat{\Sigma}_k+(1-\alpha)\hat{\Sigma}.\]

We are now searching for an optimal \(\alpha^*\) that yields better or even the best scores (recall, precision and F\(_1\)-score). This can be achieved using Cross-Validation technique. The steps we follow are:

a. First, let’s fix \(\alpha_0=0.5\), write a python function called deltaRDA(X, y, X_test, alpha = 0.5, lda_cov = None, qda_cov = None, means = None) where

X,y: input-output training data
X_test: an array containing testing inputs \(x_i\) for computing \(\delta(x)\).
alpha: regularized strength (be default is \(0.5\)).
lda_cov: common covariance in LDA (\(\hat{\Sigma}\)). It should be estimated using data (X,y) if it’s None.
qda_cov: list of per-class covariances in QDA (\(\hat{\Sigma}_k\)). It should be estimated using data.
means: list of per-class means. It should also be estimated if it’s None.

¹This function computes the prediction of X_test using RDA at regularized value \(\alpha_0\). Hint: using the implementation on slide 37.

def deltaRDA(X, y, X_test, alpha = 0.5, lda_cov = None, qda_cov = None, means = None):
    # To do

b. Compute the confusion matrix and all the performance matrices on the test data using deltaRDA at \(\alpha=0.5\) as defined above.

Code

# making prediction using delta(x)
y_hat = deltaRDA(X_nbc[:500,:], y_nbc[:500], X_nbc[500:,:])

# confusion matrix
conf_mat = confusion_matrix(y_nbc[500:], y_hat)
# metrics
prec = precision_score(y_nbc[500:], y_hat, average="macro")
rec = recall_score(y_nbc[500:], y_hat, average="macro")
f1 = f1_score(y_nbc[500:], y_hat, average="macro")

_, ax = plt.subplots(figsize = (5,4))
ConfusionMatrixDisplay(conf_mat).plot(ax=ax, colorbar=False)
ax.set_title(r"$\text{RDA at }\alpha=0.5\text{ on 'nbc' dataset}$")

plt.show()

res_nbc = pd.DataFrame(
    {
        'Recall': rec,
        'Precision': prec,
        'F1-score': f1,
        'Accuracy':  (conf_mat.diagonal().sum() / conf_mat.sum()).round(3)
    },
    index=["RDA"]
)
print(res_nbc)

       Recall  Precision  F1-score  Accuracy
RDA  0.726954   0.667442  0.686269      0.69

Remark question: What do you think about the results of RDA compared to the previous results from NBC, LDA, and QDA? Remember, we initially chose \(\alpha_0=0.5\). Now, let’s find the optimal value of \(\alpha\).

c. Split your training data into \(K\) folds namely: \(F_1,F_2,\dots,F_K\). Let \(F_{-k}\) represent all the training data except for the \(k\)th fold. For any fixed \(\alpha\), write a function cvRDA(alpha) that returns the average F\(_1\)-score over each fold \(F_k\) using model built on \(F_{-k}\). One way to do this is shown below.

Code

K = 5          # choose K = 5
np.random.seed(42)
rand_id = np.random.choice(range(K), replace=True, size=500) # Shuffle index
X_train = X_nbc[:500,:]
y_train = y_nbc[:500]
X_test = X_nbc[500:,:]
y_test = y_nbc[500:]

def cvRDA(alpha):
    res = []
    for k in range(K):
        y_hat = deltaRDA(X=X_train[rand_id != k,:],
                         y=y_train[rand_id != k],
                         X_test=X_train[rand_id == k,:],
                         alpha=alpha)
        res.append(f1_score(y_train[rand_id == k], y_hat, average="macro"))
    return np.mean(res).round(3)

print(f"* Cross-validation F1-score for alpha = 0.5: {cvRDA(0.5)}")

* Cross-validation F1-score for alpha = 0.5: 0.76

d. Choose a grid for \(\alpha\), for example, G=np.linspace(0,1,100) which generates uniform grid of \(100\) values on interval \([0,1]\). Now, write Python code that searches for \(\alpha^*\) on the grid \(G\) with largest cross-validation F\(_1\)-score. Plot the the following graph of \((x,y)=(\alpha, \text{CVError}(\alpha))\).

* Optimal alpha: [0.263 0.293 0.303]

e. Build the RDA model with the observed optimal \(\alpha^*\) using all \(500\) training data points. Report the performance metrics on the testing data. Conclude your findings.

Code

preds = {alpha: deltaRDA(X=X_train,
                         y=y_train,
                         X_test=X_test,
                         alpha=alpha) for alpha in alpha_opt}
recall = {alpha: np.round(recall_score(y_test, y, average="macro"), 3) for alpha, y in preds.items()}
precision = {alpha: np.round(precision_score(y_test, y, average="macro"), 3) for alpha, y in preds.items()}
f1 = {alpha: np.round(f1_score(y_test, y, average="macro"), 3) for alpha, y in preds.items()}
acc = {alpha: np.round(np.mean(y_test == y), 3) for alpha, y in preds.items()}

df_acc = pd.DataFrame({
    "alpha" : list(acc.keys()),
    "accucary" : list(acc.values())
})

print("* Accuracy:")

print(df_acc)

sns.set(style="white")
fig, ax = plt.subplots(1, 3, figsize = (9.75, 3))
i = 0
for alpha in alpha_opt:
    mat = confusion_matrix(y_test, preds[alpha])
    ConfusionMatrixDisplay(mat).plot(ax=ax[i], colorbar=False)
    ax[i].set_title(f"RDA at alpha={np.round(alpha,3)}; score={np.round(f1[alpha], 3)}")
    i += 1

* Accuracy:
   alpha  accucary
0  0.263     0.715
1  0.293     0.720
2  0.303     0.720

Code

print("* Other metrics:")

print(pd.DataFrame({
        "recall": recall.values(),
        "precision" : precision.values(),
        "f1-score" : f1.values() 
    },index=list(recall.keys())))

* Other metrics:
       recall  precision  f1-score
0.263   0.704      0.679      0.69
0.293   0.717      0.687      0.70
0.303   0.717      0.687      0.70

You conclusion:

2. Real datasets

In this section, you will reproduce the results shown in the course with Spam dataset and explore the Heart Disease Dataset. We can download Spam data using the code below:

path_spam = "https://raw.githubusercontent.com/hassothea/MLcourses/refs/heads/main/data/spam.txt"
spam = pd.read_csv(path_spam, index_col=0, sep=" ")
spam.sample(5)

	make	address	all	num3d	our	over	remove	internet	order	mail	...	charSemicolon	charRoundbracket	charSquarebracket	charExclamation	charDollar	charHash	capitalAve	capitalLong	capitalTotal	type
Id
3924	0.08	0.0	0.16	0.0	0.0	0.08	0.0	0.08	0.73	0.00	...	0.126	0.172	0.057	0.000	0.022	0.0	3.212	44	665	nonspam
3631	0.00	0.0	0.00	0.0	0.0	0.00	0.0	0.00	0.00	0.00	...	0.000	0.000	0.000	0.000	0.000	0.0	1.000	1	5	nonspam
4081	0.00	0.0	0.00	0.0	0.0	0.00	0.0	0.00	0.00	0.00	...	0.000	0.344	0.000	0.000	0.000	0.0	1.400	5	14	nonspam
180	0.00	0.0	0.00	0.0	0.0	0.00	0.0	2.04	0.00	0.00	...	0.000	0.000	0.000	0.969	0.000	0.0	2.179	18	85	spam
590	0.00	0.0	0.00	0.0	0.0	0.00	0.0	0.00	0.00	1.51	...	0.000	0.000	0.000	0.000	0.000	0.0	2.769	15	36	spam

5 rows × 58 columns

A. Perform descriptive analysis on the datasets: aim for connection between inputs and the target.

B. Split and built model based on your analysis in the previous step.

C. Report the results on the testing data.

References

\(^{\text{📚}}\) Chapter 4, Introduction to Statistical Learning with R, James et al. (2021)..
\(^{\text{📚}}\) Chapter 2, The Elements of Statistical Learning, Hastie et al. (2008)..
\(^{\text{📚}}\) Friedman (1989).
\(^{\text{📚}}\) Cross-Validation technique{target=“_blank”}.
\(^{\text{📚}}\) Heart Disease Dataset.

Footnotes

This may be the most difficult question.↩︎