Objective

This practical session (TP) aims to familiarize you with the key assumptions of each introduced model. The first section is designed to test your understanding of the data and models, while the second section focuses on applying your knowledge to real-world datasets. Please read the instructions carefully and try to complete the tasks independently. If you get stuck, don’t hesitate to ask for help. Good luck!

Please download the Jupyter Notebook by clicking here. The session will cover the following sections:

Model Assumptions and Data Simulation: It’s important to verify that any models should work well on the data that respect their assumptions.

Data Simulation: Create datasets that either respect or violate the assumptions of each model, including addressing imbalance problems.

Model Implementation: Put the models into action.

Model Evaluation: Report the performance of the models using appropriate metrics.

Real Data Implementation: In real-world problems, things are more complicated because very often the assumptions of the model are often violated. We shall see this by exploring the following real datasets.

Real Datasets: you may start with Spam dataset to reproduce the experimental results shown in the course, or explore Heart Disease Dataset.

Preprocessing/Descriptive Analysis: Understand the features and verify the assumptions (use correlation metrics, for example).

Implementation: Apply the models to the datasets.

Evaluation: Assess the performance of the models.

1. Simulation

Write a function simulateClassificationData(n=200, d=2, M = 2, method = "nbc") that returns input-output observations with
- observation size n (\(200\) by default)
- input \(x_i\) are of dimension d (\(2\) by default)
- the target \(y\) contains M classes taking values in \(\{1,2,\dots,M\}\)
- and lastly, method defines the prefered method that is supposed to work well on this dataset. It should be an element of [nbc, LDA, QDA, RDA] (nbc by default).

I set an example below, you can do it differently.

Code

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.datasets import make_spd_matrix

def simulateClassificationData(n=200, d=2, M=2, method="nbc", class_weights = None):
    """
    Generates a design matrix for classification that works well with Naive Bayes.
    
    Parameters:
    n (int): Number of samples.
    d (int): Number of features.
    M (int): Number of classes.
    method (str): Method name, default is "nbc" (Naive Bayes Classification).
    class_weights (arr): The proportion of each class. If `None`, the data is balanced.
    random_state (int): Random seed for repreoducing the result in random simulation.
    
    Returns:
    X (numpy.ndarray): Feature matrix.
    y (numpy.ndarray): Labels.
    """

    # Check if the class weight is given. If it's None, it's a balanced case.
    if class_weights is None:
        class_weights = np.ones(M)/M
    idx = np.random.multinomial(1, class_weights, size=n)
    n_class = [np.sum(idx[:,i]) for i in range(M)]

    # Generate data that prefers NBC model.
    if method == "nbc":
        b = np.random.randint(1,10, 1)
        x1 = np.random.poisson(b, size=(n_class[0], d//2))
        b = np.random.uniform(-2, 2, 2)
        x2 = np.random.normal(np.min(b), np.abs(np.max(b)), size=(n_class[0], d//2))
        X = np.column_stack([x1,x2])
        for i in range(1,M):
            b = np.random.randint(1,10, 1)
            x1 = np.random.poisson(b, size=(n_class[i], d//2))
            b = np.random.uniform(-2,2, 2)
            x2 = np.random.normal(np.min(b), np.abs(np.max(b)), size=(n_class[i], d//2))
            X = np.row_stack([X, np.column_stack([x1, x2])])
        y = np.repeat([str(i) for i in range(1,M+1)], n_class)

    # Data that prefers LDA
    elif method == "lda":
        # generate parameters (means & covariances)
        mu = np.random.uniform(-5, 5, size=d)
        sigma0 = make_spd_matrix(n_dim=d)
        for i in range(M-1):
            mu = np.row_stack([mu, np.random.uniform(-5, 5, size=d)])
            
        # generate observations
        X = np.row_stack([np.random.multivariate_normal(mean=mu[i,:], cov=sigma0, size=int(n*class_weights[i])) for i in range(M)])
        y = np.concatenate([np.repeat(str(i), int(n*class_weights[i-1])) for i in range(1,M+1)])

    # Data that prefers QDA or RDA
    elif method in ["qda", "rda"]:
        # generate parameters (means & covariances)
        mu = np.random.uniform(-5, 5, size=d)
        sigma = [make_spd_matrix(n_dim=d)]
        for i in range(M-1):
            mu = np.row_stack([mu, np.random.uniform(-5, 5, size=d)])
            sigma.append(make_spd_matrix(n_dim=d))
        # generate observations
        X = np.row_stack([np.random.multivariate_normal(mean=mu[i,:], cov=sigma[i], size=int(n*class_weights[i])) for i in range(M)])
        y = np.concatenate([np.repeat(str(i), int(n*class_weights[i-1])) for i in range(1,M+1)])
    # If method is other than above, return value error.
    else:
        ValueError("method is either 'nbc', 'lda', 'qda' or 'rda'!")
    # Randomly shuffle the data
    id_shuffle = np.random.permutation(range(len(y)))
    return X[id_shuffle,:], y[id_shuffle].astype(object)

1.1. Balanced datasets

A. With \((n,d,M)=(700,2,3)\) fixed and three different options for method in the list ['nbc', 'lda', 'qda'], use the function above to generate three different datasets: (X_nbc, y_nbc), (X_lda, y_lda), and (X_qda, y_qda), where X and y are different input-output pairs. An example is given below.

# Set parameters
n, d, M = 700, 2, 3

# Set random seed for our data generation
np.random.seed(42)
X_nbc, y_nbc = simulateClassificationData(n, d, M, method="nbc")
X_lda, y_lda = simulateClassificationData(n, d, M, method="lda")
X_qda, y_qda = simulateClassificationData(n, d, M, method="qda")

B. Write the code to visualize the first \(500\) observations from each dataset using scatterplots colored according to the classes of the target y. The remaining \(200\) observations are treated as the testing data. Your result should look similar to the figures below.

Code

import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid")

# Color 
color_tab = sns.color_palette("Set2")
palette = {str(i): color_tab[i-1] for i in range(1,M+1)}

fig, ax = plt.subplots(1, 3, figsize = (9, 3))
sns.scatterplot(x=X_nbc[:500,0], y=X_nbc[:500,1], hue=y_nbc[:500], ax=ax[0], palette=palette)
ax[0].set_title('Naive Bayes')
ax[0].set_xlabel('x1')
ax[0].set_ylabel('x2')
sns.scatterplot(x=X_lda[:500,0], y=X_lda[:500,1], hue=y_lda[:500], ax=ax[1], palette=palette)
ax[1].set_title('LDA')
ax[1].set_xlabel('x1')
# ax[1].set_ylabel('x2')
sns.scatterplot(x=X_qda[:500,0], y=X_qda[:500,1], hue=y_qda[:500], ax=ax[2], palette=palette)
ax[2].set_title('QDA')
ax[2].set_xlabel('x1')
# ax[2].set_ylabel('x2')
fig.show()

C. For now, work with (X_nbc, y_nbc) dataset.

a. Train NBC, LDA and QDA on the first \(500\) observations. Report the accuracies, precision, recall and f1-scaore of the three models on the remaining \(200\) testing points. You should obtain the results as illustrated below.

Code

sns.set(style="white")
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, precision_score, recall_score, f1_score, accuracy_score

fig, ax = plt.subplots(1, 3, figsize = (9.5, 3))

models = {"nbc": GaussianNB(), "lda": LDA(), "qda": QDA()}
fit = {x: m.fit(X_nbc[:500,:], y_nbc[:500]) for x,m in models.items()}
preds = {x: m.predict(X_nbc[500:,:]) for x,m in fit.items()}
conf_mat = {x: confusion_matrix(y_nbc[500:], y) for x,y in preds.items()}
ConfusionMatrixDisplay(conf_mat['nbc']).plot(ax=ax[0], colorbar=False)
ax[0].set_title(f"NBC with acc = {(conf_mat['nbc'].diagonal().sum() / conf_mat['nbc'].sum()).round(3)}")
ConfusionMatrixDisplay(conf_mat['lda']).plot(ax=ax[1],colorbar=False)
ax[1].set_title(f"LDA with acc = {(conf_mat['lda'].diagonal().sum() / conf_mat['lda'].sum()).round(3)}")
ConfusionMatrixDisplay(conf_mat['qda']).plot(ax=ax[2], colorbar=False)
ax[2].set_title(f"QDA with acc = {(conf_mat['qda'].diagonal().sum() / conf_mat['qda'].sum()).round(3)}")

prec = [precision_score(y_nbc[500:], y, average="macro") for y in preds.values()]
rec = [recall_score(y_nbc[500:], y, average="macro") for y in preds.values()]
f1 = [f1_score(y_nbc[500:], y, average="macro") for y in preds.values()]
plt.show()
res_nbc = pd.DataFrame(
    {
        'Recall': rec,
        'Precision': prec,
        'F1-score': f1
    },
    index=["NBC", "LDA", "QDA"]
)
print(res_nbc)

       Recall  Precision  F1-score
NBC  0.831125   0.831003  0.824050
LDA  0.639893   0.634448  0.636846
QDA  0.820832   0.818865  0.814520

b. Draw the decision boundary with the testing data of the three models on the (X_nbc, y_nbc) dataset. An example of a function for drawing such a boundary is given below.

Code

from sklearn.inspection import DecisionBoundaryDisplay
from matplotlib.colors import ListedColormap

def plot_decision_boundary(X, y,
                           models = list,
                           n_x = 30, 
                           n_y = 30,
                           model_names = ['NBC', 'LDA', 'QDA']):
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, n_x),
                         np.arange(y_min, y_max, n_y))
    fig, axes = plt.subplots(1, 3, figsize=(8, 3))
    keys = list(conf_mat.keys())
    for idx, model in enumerate(models):
        ax = axes[idx]
        disp = DecisionBoundaryDisplay.from_estimator(
            model, X, response_method='predict', ax=ax, cmap=ListedColormap(palette.values()), alpha=0.5)
        sns.scatterplot(x=X[:,0], y=X[:,1], hue=y, palette=palette, ax=ax)
        ax.set_title(model_names[idx])
        ax.set_xlabel("x1")
        if idx == 0:
            ax.set_ylabel("x2")
        ax.set_title(f"{model_names[idx]} with acc = {(conf_mat[keys[idx]].diagonal().sum() / conf_mat[keys[idx]].sum()).round(3)}")
    plt.tight_layout()
    plt.show()

Code

plot_decision_boundary(X_nbc[500:,:], y_nbc[500:], 
                       models=models.values())

D. Do the same with the other two datasets. Make some comments on the results: Are the results reasonable? Explain.

For (X_lda, y_lda), you should produce the following result.

Code

fig, ax = plt.subplots(1, 3, figsize = (9.5, 3))

models = {"nbc": GaussianNB(), "lda": LDA(), "qda": QDA()}
fit = {x: m.fit(X_lda[:500,:], y_lda[:500]) for x,m in models.items()}
preds = {x: m.predict(X_lda[500:,:]) for x,m in fit.items()}
conf_mat = {x: confusion_matrix(y_lda[500:], y) for x,y in preds.items()}
ConfusionMatrixDisplay(conf_mat['nbc']).plot(ax=ax[0], colorbar=False)
ax[0].set_title(f"NBC with acc = {(conf_mat['nbc'].diagonal().sum() / conf_mat['nbc'].sum()).round(3)}")
ConfusionMatrixDisplay(conf_mat['lda']).plot(ax=ax[1], colorbar=False)
ax[1].set_title(f"LDA with acc = {(conf_mat['lda'].diagonal().sum() / conf_mat['lda'].sum()).round(3)}")
ConfusionMatrixDisplay(conf_mat['qda']).plot(ax=ax[2], colorbar=False)
ax[2].set_title(f"QDA with acc = {(conf_mat['qda'].diagonal().sum() / conf_mat['qda'].sum()).round(3)}")
plt.show()

plot_decision_boundary(X_lda[500:,:], y_lda[500:], 
                       models=models.values())

prec = [precision_score(y_lda[500:], y, average="macro") for y in preds.values()]
rec = [recall_score(y_lda[500:], y, average="macro") for y in preds.values()]
f1 = [f1_score(y_lda[500:], y, average="macro") for y in preds.values()]
plt.show()
res_lda = pd.DataFrame(
    {
        'Recall': rec,
        'Precision': prec,
        'F1-score': f1
    },
    index=["NBC", "LDA", "QDA"]
)
print(res_lda)

       Recall  Precision  F1-score
NBC  0.975737   0.973918  0.974699
LDA  0.980639   0.978421  0.979373
QDA  0.980639   0.978421  0.979373

For (X_qda, y_qda), your result should be similar to the result below.

Code

fig, ax = plt.subplots(1, 3, figsize = (9.5, 3))

models = {"nbc": GaussianNB(), "lda": LDA(), "qda": QDA()}
fit = {x: m.fit(X_qda[:500,:], y_qda[:500]) for x,m in models.items()}
preds = {x: m.predict(X_qda[500:,:]) for x,m in fit.items()}
conf_mat = {x: confusion_matrix(y_qda[500:], y) for x,y in preds.items()}
ConfusionMatrixDisplay(conf_mat['nbc']).plot(ax=ax[0], colorbar=False)
ax[0].set_title(f"NBC with acc = {(conf_mat['nbc'].diagonal().sum() / conf_mat['nbc'].sum()).round(3)}")
ConfusionMatrixDisplay(conf_mat['lda']).plot(ax=ax[1], colorbar=False)
ax[1].set_title(f"LDA with acc = {(conf_mat['lda'].diagonal().sum() / conf_mat['lda'].sum()).round(3)}")
ConfusionMatrixDisplay(conf_mat['qda']).plot(ax=ax[2], colorbar=False)
ax[2].set_title(f"QDA with acc = {(conf_mat['qda'].diagonal().sum() / conf_mat['qda'].sum()).round(3)}")
plt.show()

plot_decision_boundary(X_qda[500:,:], y_qda[500:], 
                       models=models.values())


prec = [precision_score(y_lda[500:], y, average="macro") for y in preds.values()]
rec = [recall_score(y_qda[500:], y, average="macro") for y in preds.values()]
f1 = [f1_score(y_qda[500:], y, average="macro") for y in preds.values()]
plt.show()
res_qda = pd.DataFrame(
    {
        'Recall': rec,
        'Precision': prec,
        'F1-score': f1
    },
    index=["NBC", "LDA", "QDA"]
)
print(res_qda)

       Recall  Precision  F1-score
NBC  0.928110   0.334877  0.929750
LDA  0.932805   0.339067  0.934641
QDA  0.948944   0.338166  0.950155

Each simulated dataset is designed to verify the assumption of its corresponding model. For the observed result, the ideal model seems to really outperform others on the dataset that respects its assumption.

1.2. Imbalanced datasets

Now, we will work with imbalanced simulated datasets. The goal is to identify problems associated with imbalanced datasets and to propose possible solutions we have studied so far.

A. With the same options as in the balanced case, but adding an additional option of class_weights = [0.2, 0.15, 0.65], generate other three imbalanced datasets. An example is provided below.

weights = [0.2, 0.15, 0.65] #[0.2, 0.15, 0.65]
np.random.seed(42) 
X_nbc, y_nbc = simulateClassificationData(n, d, M, method="nbc", class_weights=weights)
X_lda, y_lda = simulateClassificationData(n, d, M, method="lda", class_weights=weights)
X_qda, y_qda = simulateClassificationData(n, d, M, method="qda", class_weights=weights)

a. Recreate the countplots and scatterplots for the three simulated datasets as ilustrated below.

Code

sns.set(style="whitegrid")
fig, ax = plt.subplots(1, 3, figsize = (9, 3))
sns.countplot(x=y_nbc[:500], hue=y_nbc[:500], ax=ax[0], order=['1', '2', '3'], palette=palette)
ax[0].set_title('Naive Bayes')
sns.countplot(x=y_lda[:500], hue=y_lda[:500], ax=ax[1], order=['1', '2', '3'], palette=palette)
ax[1].set_title('LDA')
ax[1].set_ylabel("")
sns.countplot(x=y_qda[:500], hue=y_qda[:500], ax=ax[2], order=['1', '2', '3'], palette=palette)
ax[2].set_title('QDA')
ax[2].set_ylabel("")
fig.show()

fig, ax = plt.subplots(1, 3, figsize = (9, 3))
sns.scatterplot(x=X_nbc[:500,0], y=X_nbc[:500,1], hue=y_nbc[:500], ax=ax[0], palette=palette)
ax[0].set_title('Naive Bayes')
sns.scatterplot(x=X_lda[:500,0], y=X_lda[:500,1], hue=y_lda[:500], ax=ax[1], palette=palette)
ax[1].set_title('LDA')
sns.scatterplot(x=X_qda[:500,0], y=X_qda[:500,1], hue=y_qda[:500], ax=ax[2], palette=palette)
ax[2].set_title('QDA')
fig.show()

b. Recreate the following performance metrics on imbalanced nbc dataset.

Code

sns.set(style="white")
fig, ax = plt.subplots(1, 3, figsize = (9.5, 3))

models = {"nbc": GaussianNB(), "lda": LDA(store_covariance=True), "qda": QDA(store_covariance=True)}
fit = {x: m.fit(X_nbc[:500,:], y_nbc[:500]) for x,m in models.items()}
preds = {x: m.predict(X_nbc[500:,:]) for x,m in fit.items()}
conf_mat = {x: confusion_matrix(y_nbc[500:],y) for x,y in preds.items()}
ConfusionMatrixDisplay(conf_mat['nbc']).plot(ax=ax[0], colorbar=False)
ax[0].set_title(f"NBC with acc = {(conf_mat['nbc'].diagonal().sum() / conf_mat['nbc'].sum()).round(3)}")
ConfusionMatrixDisplay(conf_mat['lda']).plot(ax=ax[1], colorbar=False)
ax[1].set_title(f"LDA with acc = {(conf_mat['lda'].diagonal().sum() / conf_mat['lda'].sum()).round(3)}")
ConfusionMatrixDisplay(conf_mat['qda']).plot(ax=ax[2], colorbar=False)
ax[2].set_title(f"QDA with acc = {(conf_mat['qda'].diagonal().sum() / conf_mat['qda'].sum()).round(3)}")
plt.show()

prec = [precision_score(y_nbc[500:], y, average="macro") for y in preds.values()]
rec = [recall_score(y_nbc[500:], y, average="macro") for y in preds.values()]
f1 = [f1_score(y_nbc[500:], y, average="macro") for y in preds.values()]
res_nbc = pd.DataFrame(
    {
        'Recall': rec,
        'Precision': prec,
        'F1-score': f1
    },
    index=["NBC", "LDA", "QDA"]
)
print(res_nbc)

       Recall  Precision  F1-score
NBC  0.671337   0.695662  0.682120
LDA  0.545978   0.467226  0.501748
QDA  0.663761   0.689268  0.674789

c. How does this result compare to the balanced case? Does this surprise you?

We observed lower performance metrics in these imbalanced cases than in the balanced cases. This results from models being biased towards majority classes.

B. Do the same (compute these matrices) for the other two imbalanced datasets.

C. In NBC, LDA, and QDA, all parameters (such as means and variances) are directly estimated using data. However, in RDA, we can tune the trade-off parameter \(\alpha\) that combines the QDA covariance matrices with the LDA covariance matrix. This allows for a balance between the flexibility of QDA and the simplicity of LDA:

\[\hat{\Sigma}_k(\alpha)=\alpha\hat{\Sigma}_k+(1-\alpha)\hat{\Sigma}.\]

We are now searching for an optimal \(\alpha^*\) that yields better or even the best scores (recall, precision and F\(_1\)-score). This can be achieved using Cross-Validation technique. The steps we follow are:

a. First, let’s fix \(\alpha_0=0.5\), write a python function called deltaRDA(X, y, X_test, alpha = 0.5, lda_cov = None, qda_cov = None, means = None) where

X,y: input-output training data
X_test: an array containing testing inputs \(x_i\) for computing \(\delta(x)\).
alpha: regularized strength (be default is \(0.5\)).
lda_cov: common covariance in LDA (\(\hat{\Sigma}\)). It should be estimated using data (X,y) if it’s None.
qda_cov: list of per-class covariances in QDA (\(\hat{\Sigma}_k\)). It should be estimated using data.
means: list of per-class means. It should also be estimated if it’s None.

¹This function computes the prediction of X_test using RDA at regularized value \(\alpha_0\). Hint: using the implementation on slide 37.

Code

def deltaRDA(X, y, X_test, alpha = 0.5, lda_cov = None, qda_cov = None, means = None):
    classes = np.unique(y)
    decoder = {i:classes[i] for i in range(len(classes))}
    if means is None:
        means = [np.mean(X[y == x,:], axis=0) for x in classes]
    if qda_cov is None:
        qda_cov = [np.cov(X[y == x,:].transpose()) for x in classes]
    if lda_cov is None:
        lda_cov = np.sum(qda_cov, axis = 0)/(len(y)-len(classes))
    cov = [qda_cov[i]*alpha + (1-alpha)*lda_cov for i in range(len(classes))]
    def delta(x, pi, mu, sigma):
        inv_sigma = np.linalg.inv(sigma)
        res = np.row_stack([np.log(pi) - np.log(np.linalg.det(sigma))-np.dot(np.matmul((v - mu), inv_sigma), (v - mu))/2 for v in x])
        return res
    prop = np.array([np.sum(y == v) for v in classes])
    prop = prop / np.sum(prop)
    temp = np.column_stack([delta(X_test, prop[i], means[i], cov[i]) for i in range(len(classes))])
    loc = np.argmax(temp, axis=1)
    return np.array([decoder[loc[i]] for i in range(X_test.shape[0])])

b. Compute the confusion matrix and all the performance matrices on the test data using deltaRDA at \(\alpha=0.5\) as defined above.

Code

# making prediction using delta(x)
y_hat = deltaRDA(X_nbc[:500,:], y_nbc[:500], X_nbc[500:,:])

# confusion matrix
conf_mat = confusion_matrix(y_nbc[500:], y_hat)
# metrics
prec = precision_score(y_nbc[500:], y_hat, average="macro")
rec = recall_score(y_nbc[500:], y_hat, average="macro")
f1 = f1_score(y_nbc[500:], y_hat, average="macro")

_, ax = plt.subplots(figsize = (5,4))
ConfusionMatrixDisplay(conf_mat).plot(ax=ax, colorbar=False)
ax.set_title(r"$\text{RDA at }\alpha=0.5\text{ on 'nbc' dataset}$")

plt.show()

res_nbc = pd.DataFrame(
    {
        'Recall': rec,
        'Precision': prec,
        'F1-score': f1,
        'Accuracy':  (conf_mat.diagonal().sum() / conf_mat.sum()).round(3)
    },
    index=["RDA"]
)
print(res_nbc)

       Recall  Precision  F1-score  Accuracy
RDA  0.726954   0.667442  0.686269      0.69

Remark question: What do you think about the results of RDA compared to the previous results from NBC, LDA, and QDA? Remember, we initially chose \(\alpha_0=0.5\). Now, let’s find the optimal value of \(\alpha\).

c. Split your training data into \(K\) folds namely: \(F_1,F_2,\dots,F_K\). Let \(F_{-k}\) represent all the training data except for the \(k\)th fold. For any fixed \(\alpha\), write a function cvRDA(alpha) that returns the average F\(_1\)-score over each fold \(F_k\) using model built on \(F_{-k}\). One way to do this is shown below.

Code

K = 5          # choose K = 5
np.random.seed(42)
rand_id = np.random.choice(range(K), replace=True, size=500) # Shuffle index
X_train = X_nbc[:500,:]
y_train = y_nbc[:500]
X_test = X_nbc[500:,:]
y_test = y_nbc[500:]

def cvRDA(alpha):
    res = []
    for k in range(K):
        y_hat = deltaRDA(X=X_train[rand_id != k,:],
                         y=y_train[rand_id != k],
                         X_test=X_train[rand_id == k,:],
                         alpha=alpha)
        res.append(f1_score(y_train[rand_id == k], y_hat, average="macro"))
    return np.mean(res).round(3)

print(f"* Cross-validation F1-score for alpha = 0.5: {cvRDA(0.5)}")

* Cross-validation F1-score for alpha = 0.5: 0.76

d. Choose a grid for \(\alpha\), for example, G=np.linspace(0,1,100) which generates uniform grid of \(100\) values on interval \([0,1]\). Now, write Python code that searches for \(\alpha^*\) on the grid \(G\) with largest cross-validation F\(_1\)-score. Plot the the following graph of \((x,y)=(\alpha, \text{CVError}(\alpha))\).

Code

G = np.linspace(0,1,100)
cvError = [cvRDA(alpha) for alpha in G]

plt.plot(G, cvError, label="CV error")
plt.title(r"$\alpha\text{ vs cross-validation score}$")
plt.xlabel(r"$\alpha$")
plt.ylabel("Cross-validation score")
alpha_opt = G[cvError==np.max(cvError)].round(3)
plt.vlines(alpha_opt, ymin=np.min(cvError), ymax=np.max(cvError), colors='r', linestyles="dashed", label=r"$\text{Optimal }\alpha^*$")
plt.scatter(x=alpha_opt, y=[np.min(cvError) for i in range(len(alpha_opt))], c='r')
plt.legend()
print(f"* Optimal alpha: {alpha_opt}")

* Optimal alpha: [0.263 0.293 0.303]

e. Build the RDA model with the observed optimal \(\alpha^*\) using all \(500\) training data points. Report the performance metrics on the testing data. Conclude your findings.

Code

preds = {alpha: deltaRDA(X=X_train,
                         y=y_train,
                         X_test=X_test,
                         alpha=alpha) for alpha in alpha_opt}
recall = {alpha: np.round(recall_score(y_test, y, average="macro"), 3) for alpha, y in preds.items()}
precision = {alpha: np.round(precision_score(y_test, y, average="macro"), 3) for alpha, y in preds.items()}
f1 = {alpha: np.round(f1_score(y_test, y, average="macro"), 3) for alpha, y in preds.items()}
acc = {alpha: np.round(np.mean(y_test == y), 3) for alpha, y in preds.items()}

df_acc = pd.DataFrame({
    "alpha" : list(acc.keys()),
    "accucary" : list(acc.values())
})

print("* Accuracy:")

print(df_acc)

sns.set(style="white")
fig, ax = plt.subplots(1, 3, figsize = (9.75, 3))
i = 0
for alpha in alpha_opt:
    mat = confusion_matrix(y_test, preds[alpha])
    ConfusionMatrixDisplay(mat).plot(ax=ax[i], colorbar=False)
    ax[i].set_title(f"RDA at alpha={np.round(alpha,3)}; score={np.round(f1[alpha], 3)}")
    i += 1

* Accuracy:
   alpha  accucary
0  0.263     0.715
1  0.293     0.720
2  0.303     0.720

Code

print("* Other metrics:")

print(pd.DataFrame({
        "recall": recall.values(),
        "precision" : precision.values(),
        "f1-score" : f1.values() 
    },index=list(recall.keys())))

* Other metrics:
       recall  precision  f1-score
0.263   0.704      0.679      0.69
0.293   0.717      0.687      0.70
0.303   0.717      0.687      0.70

Conclusion: NBC, LDA and QDA are simple models where the model parameters can be directly estimated from the data and do not require hyperparameter tuning. Despite being simple, they are powerful in many tasks including text classification and detection. Moreover, we can enhance their performance by tuning the shape of covariance matrices using the regularization method as illustrated in this section.

2. Real datasets

In this section, you will reproduce the results shown in the course with Spam dataset and explore the Heart Disease Dataset. We can download Spam data using the code below:

path_spam = "https://raw.githubusercontent.com/hassothea/MLcourses/refs/heads/main/data/spam.txt"
spam = pd.read_csv(path_spam, index_col=0, sep=" ")
spam.sample(5)

	make	address	all	num3d	our	over	remove	internet	order	mail	...	charSemicolon	charRoundbracket	charSquarebracket	charExclamation	charDollar	charHash	capitalAve	capitalLong	capitalTotal	type
Id
3924	0.08	0.0	0.16	0.0	0.0	0.08	0.0	0.08	0.73	0.00	...	0.126	0.172	0.057	0.000	0.022	0.0	3.212	44	665	nonspam
3631	0.00	0.0	0.00	0.0	0.0	0.00	0.0	0.00	0.00	0.00	...	0.000	0.000	0.000	0.000	0.000	0.0	1.000	1	5	nonspam
4081	0.00	0.0	0.00	0.0	0.0	0.00	0.0	0.00	0.00	0.00	...	0.000	0.344	0.000	0.000	0.000	0.0	1.400	5	14	nonspam
180	0.00	0.0	0.00	0.0	0.0	0.00	0.0	2.04	0.00	0.00	...	0.000	0.000	0.000	0.969	0.000	0.0	2.179	18	85	spam
590	0.00	0.0	0.00	0.0	0.0	0.00	0.0	0.00	0.00	1.51	...	0.000	0.000	0.000	0.000	0.000	0.0	2.769	15	36	spam

5 rows × 58 columns

A. Perform descriptive analysis on the datasets: aim for connection between inputs and the target.

B. Split and built model based on your analysis in the previous step.

C. Report the results on the testing data.

The solution to this section had already been shown in the course. It will be left for students to explore by reading the slide of the course here.

References

\(^{\text{📚}}\) Chapter 4, Introduction to Statistical Learning with R, James et al. (2021)..
\(^{\text{📚}}\) Chapter 2, The Elements of Statistical Learning, Hastie et al. (2008)..
\(^{\text{📚}}\) Friedman (1989).
\(^{\text{📚}}\) Cross-Validation technique{target=“_blank”}.
\(^{\text{📚}}\) Heart Disease Dataset.

Footnotes

This may be the most difficult question.↩︎