Advanced Machine Learning

Lecturer: Dr. HAS Sothea

Code: AMSI61AML

Course Criteria

`Criteria`	`Percentage`
Attendance	10%
Participation & quiz	10%
Midterm Exam or/and Project	15%+15%
Final Exam	20%
Final Project & Presentation	30%

Programming: Python (, , , , …)

Course webpage

Address: https://hassothea.github.io/Advanced-Machine-Learning-ITC/.

Introduction

What’s Machine Learning (ML)?

Machine = Computer
Learning = Improving in some task w.r.t some measurement.

Arthur Samuel

“The field of study that gives computers the ability to learn (from data) without being explicitly programmed.”

“A computer program is said to learn from experience \(E\) with respect to some class of tasks \(T\) and performance measure \(P\) if its performance at tasks in \(T\), as measured by \(P\), improves with experience \(E\).”

Tom M. Mitchell

Why does it matter?

Data is the cornerstone of decision-making and innovation.
Growth of data + growth of computational power.

Source: Statista & Wikipedia

Why does it matter?

ML is a powerful tool in this era.

Source: https://www.javatpoint.com/applications-of-machine-learning

Course Outline

`Week`	`Topic`
1	Naive Bayesian Classifier, LDA & QDA
2	Logistic Regression & Regularization
3	KNN & Decision Trees
4	Ensemble Learning
5	Model Selection
6	Hard Clustering
7	Probabilistic Clustering
8	Midterm Exam
9	Dimensional Reduction: PCA, SVD
10	Association Rules
11	Markov Decision Process & Q-learning
12	Upper Confident Bound (PCB) & Thomson sampling
13	Deep Learning
14	MLOps
15-16	ML for SDG & Future Trends
17-18	Final exam & Projects

What we will focus on

Data comprehension:
- What should we focus on when analyzing the data?
- How can data comprehension guide us to the best model?
Method comprehension:
- How ML models work?
- What models seem to be suitable for a given dataset?
- Pros & cons?
Practical tasks:
- Quizzes
- Practical labs (TP)
- Projects + presentations…

1. Naive Bayesian Classifier (NBC)

Example: Email spam filter ✉️

Input \(\text{x}_i=(x_{i1},x_{i2},\dots, x_{id})\): Bag of words of email \(i\).
Label \(y_i\in\{1,0\}\) where \(1=\) Spam & \(0=\) Nonspam.
Objective: Classify if an email is a spam based on its input.

import pandas as pd     # Load package 'pandas' and call it 'pd'
spam = pd.read_csv(path + "spam.txt", sep=" ") # 'path' in your machine
spam = spam.drop(columns=["Id"])
spam.sample(2)

	make	address	all	num3d	our	over	remove	internet	order	mail	...	charSemicolon	charRoundbracket	charSquarebracket	charExclamation	charDollar	charHash	capitalAve	capitalLong	capitalTotal	type
3588	0.0	0.0	0.00	0.0	0.0	0.0	0.0	0.0	0.0	0.00	...	0.000	0.000	0.000	0.0	0.0	0.0	1.000	1	5	nonspam
3192	0.0	0.0	0.17	0.0	0.0	0.0	0.0	0.0	0.0	0.17	...	0.267	0.802	0.118	0.0	0.0	0.0	4.808	20	601	nonspam

2 rows × 58 columns

import seaborn as sns
sns.set(style="whitegrid")
sns.catplot(data=spam, x="type", kind="count", hue="type", height=4.5)

Consider \(3\) inputs: make, address & capitalTotal.
Suitable graphic\(^{\text{📚}}\): histogram, density, boxplot, violinplot…

import matplotlib.pyplot as plt
_, ax = plt.subplots(1, 3, figsize = (12, 3.5))
sns.histplot(data=spam, x="make", ax=ax[0], binwidth=0.2, hue = "type")
ax[0].set_title("Distribution of 'make'")
sns.histplot(data=spam, x="address", ax=ax[1], binwidth=0.5, hue = "type")
ax[1].set_title("Distribution of 'address'")
sns.histplot(data=spam, x="capitalTotal", ax=ax[2], binwidth=700, hue = "type")
ax[2].set_title("Distribution of 'capitalTotal'")

\(^{\text{📚}}\) Chapter 2 & 3, Handbook of Data Visualization, Chen et al. (2008).

⚠️ Values are too concentrated around \(0\)!

Use log scale: too wide range, too dense around \(0\)…

_, ax = plt.subplots(1, 3, figsize = (12, 3.5))
sns.histplot(data=spam, x="make", ax=ax[0], binwidth=0.2, hue = "type")
ax[0].set_title("Distribution of 'make'")
ax[0].set_yscale("log")
sns.histplot(data=spam, x="address", ax=ax[1], binwidth=0.5, hue = "type")
ax[1].set_title("Distribution of 'address'")
ax[1].set_yscale("log")
sns.histplot(data=spam, x="capitalTotal", ax=ax[2], binwidth=700, hue = "type")
ax[2].set_title("Distribution of 'capitalTotal'")
ax[2].set_yscale("log")

✅ Better, isn’t it?

Example: Email spam filter ✉️

Consider \(X=(\) make, address, capitalTotal \()\in\mathbb{R}^3\).

Code

from scipy.stats import gaussian_kde
import numpy as np
x1 = [1.5, 4.2, 5050]
x2 = [1.5, 4.2, 9000]

# Given Y = 1
ker_make = gaussian_kde(spam.make[spam.type=="spam"])
ker_address = gaussian_kde(spam.address[spam.type=="spam"])
ker_capital = gaussian_kde(spam.capitalTotal[spam.type=="spam"])
den11 = [ker_make(x1[0]), ker_address(x1[1]), ker_capital(x1[2])]
den12 = [ker_make(x2[0]), ker_address(x2[1]), ker_capital(x2[2])]
n_spam = np.sum(spam.type=="spam")
n_non = spam.shape[0] - n_spam
pro11 = n_spam/spam.shape[0] * np.prod(den11)
pro12 = n_spam/spam.shape[0] * np.prod(den12)

# Given Y = 0
ker_make = gaussian_kde(spam.make[spam.type=="nonspam"])
ker_address = gaussian_kde(spam.address[spam.type=="nonspam"])
ker_capital = gaussian_kde(spam.capitalTotal[spam.type=="nonspam"])
den01 = [ker_make(x1[0]), ker_address(x1[1]), ker_capital(x1[2])]
den02 = [ker_make(x2[0]), ker_address(x2[1]), ker_capital(x2[2])]
pro01 = n_non/spam.shape[0] * np.prod(den01)
pro02 = n_non/spam.shape[0] * np.prod(den02)

pro01, pro11 = pro01/(pro01+pro11), pro11/(pro01+pro11)
pro02, pro12 = pro02/(pro02+pro12), pro12/(pro02+pro12)
ax[0].vlines([x1[0]], ymin=[0], ymax=[3000], color='black', linestyle = "dashed")
ax[1].vlines([x1[1]], ymin=[0], ymax=[3000], color='black', linestyle = "dashed")
ax[2].vlines([x1[2], x2[2]], ymin=[0], ymax=[3000], color=['red', 'blue'], linestyle = "dashed")
display(fig)

Type of \(\text{x}_1=(1.5, 4.2, \color{red}{5050})\) and \(\text{x}_2=(1.5, 4.2, \color{blue}{9000})\)?

Recall: Bayes’s Theorem

Bayes’s Theorem

For any two events \(E,H\) with \(\mathbb{P}(E)>0,\) one has \[\begin{equation}\overbrace{\mathbb{P}(H|E)}^{\text{Posterior}}=\frac{\overbrace{\mathbb{P}(E|H)}^{\text{Likelihood}}\times\overbrace{\mathbb{P}(H)}^{\text{Prior}}}{\underbrace{\mathbb{P}(E)}_{\text{Marginal}}}.\end{equation}\]

\(\mathbb{P}(H)\): Prior belief of having hypothesis \(H\).
\(\mathbb{P}(E|H)\): If \(H\) is true, how likely for \(E\) to be observed?
\(\mathbb{P}(H|E)\): If \(E\) is observed, how likely for \(H\) to be true?
\(\mathbb{P}(E)\): How likely for \(E\) to be observed in general?

Back to email spam filter problem

For any email \(\text{x}=(x_1,\dots, x_d)\): \[\mathbb{P}(Y=1|X=\text{x})=\frac{\mathbb{P}^{\small\text{📚}}(X=\text{x}|Y=1)\times\mathbb{P}(Y=1)}{\mathbb{P}(X=\text{x})}.\]
\(\mathbb{P}(Y=1|X=\text{x})^{\small\text{📚}}\) allows us to classify email \(x\): \[\text{Email x} \text{ is a }\begin{cases}\text{spam}& \mbox{if }\mathbb{P}(Y=1|X=\text{x})\geq\delta\\ \text{nonspam}& \mbox{if }\mathbb{P}(Y=1|X=\text{x})<\delta\end{cases}\] for some \(\delta\in (0,1)\). A common choice is \(\delta=0.5\).

\(^{\text{📚}}\) Here, \(\mathbb{P}\) may refer to PDF (continous) or PMF (discrete) according to the type of \(\text{x}\).
\(^{\text{📚}}\) Chapter 2, The Elements of Statistical Learning, Hastie et al. (2008).

Main assumption of NBC

Key quantities in classification

\[\mathbb{P}(Y=1|X=\text{x})\propto \color{red}{\mathbb{P}(X=\text{x}|Y=1)}\times\color{green}{\mathbb{P}(Y=1)}.\]

\(\color{green}{\mathbb{P}(Y=1)}\) can be estimated by \(\frac{n(\text{spams})}{n(\text{emails})}\) ✅
\(\color{red}{\mathbb{P}(X=\text{x}|Y=1)}\) is more complicated (key to different models) ⚠️

Main Assumption of Naive Bayes

Within any class \(k\in\{1,0\}\), the components of \(X|Y=k\) are indpendent i.e., \[\color{red}{\mathbb{P}(X=\text{x}|Y=k)}=\prod_{j=1}^d\mathbb{P}(X_j=x_j|Y=k).\]

Key quantities of NBC

Key quantities in Naive Bayes

\[\mathbb{P}(Y=1|X=\text{x})\propto \color{green}{\mathbb{P}(Y=1)}\color{red}{\prod_{j=1}^d\mathbb{P}(X_j=x_j|Y=1)},\]

\(\mathbb{P}(X_j=x_j|Y=1)\) can be estimated (in 1D)\(^{\small\text{📚}}\) as follows:

Type of \(X_j\)	Distribution	Graphic
Qualitative	Bernoulli, Multinomial…	`barplot`, `countplot`
Quantitative	Gausian, Exponential…	`displot`, `hist`, `density`…

\(^{\text{📚}}\) Chapter 4, Introduction to Statistical Learning with R, James et al. (2021).

Back to email spam filter problem

For \(\text{x}_1=(1.5, 4.2, \color{red}{5050})\) and \(\text{x}_2=(1.5, 4.2, \color{blue}{9000})\):
- \(\mathbb{P}(Y=1|X=\text{x}_1)=\) 3.84e-05.
- \(\mathbb{P}(Y=1|X=\text{x}_2)=\) 1.0, with \(\mathbb{P}(Y=1)=\) 0.394.

\(M\)-class Naive Bayes Classifier

For any \(d\)-dimensional input \(\text{x}\) and \(k\in\{1,2,...,M\}\):

Key quantity

\[\mathbb{P}(Y=k|X=\text{x})\propto\mathbb{P}(Y=k)\prod_{j=1}^d\mathbb{P}(X_j=x_j|Y=k).\]

Classification rule

\[x\text{ belongs to class }k^*\text{ if }\mathbb{P}(Y=k^*|X=x)=\max_{1\leq k\leq M}\mathbb{P}(Y=k|X=\text{x}).\]

Pros & Cons of NBC

Pros

Efficiency & simplicity
Less training data requirement
Scalability: works well on large and high-dimensional data
Ability to handle categorical data
Ability to handle missing data
Sometimes, it still works well even thought the assumption of independence is violeted.

Cons

May perform poorly when features are highly correlated due to the violation of independence assumption.
May not work well for complex relationship.
Zero probability: when some categories are not presented in some training features.
Continuous features often be modeled using Gaussian distribution, which might not always be appropriate.

Results on `Spam` dataset

Full input
Result
3 inputs
Result

Data: \((\text{x},y)\in\mathbb{R}^{57}\times\{0,1\}\).
Test data: \(20\%\) of total 4601 observations.

from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
sns.set(style="white")
X_train1, X_test1, y_train1, y_test1 = train_test_split(spam.iloc[:,:57], spam.iloc[:,57], test_size = 0.2, random_state=42)
nb1 = GaussianNB()
nb1 = nb1.fit(X_train1, y_train1)
pred1 = nb1.predict(X_test1)
conf1 = confusion_matrix(pred1, y_test1)
con_fig1 = ConfusionMatrixDisplay(conf1)

Accuracy: \[\frac{387+369}{387+369+21+144}\approx 0.821.\]
Misclassification error: \[1-\text{accuracy}\approx 0.179.\]
This depends on the split ⚠️

Input: \(x=(\) make, address, capitalTotal \()\in\mathbb{R}^3\).
Test data: the same \(20\%\).

X_train2, X_test2, y_train2, y_test2 = train_test_split(spam[["make","address", "capitalTotal"]], spam.iloc[:,57], test_size = 0.2, random_state=42)
nb2 = GaussianNB()
nb2 = nb1.fit(X_train2, y_train2)
pred2 = nb2.predict(X_test2)
conf2 = confusion_matrix(pred2, y_test2)
con_fig2 = ConfusionMatrixDisplay(conf2)

Accuracy: \[\frac{418+157}{418+157+233+113}\approx 0.624.\]
Misclassification error: \[1-\text{accuracy}\approx0.376.\]
This depends on the split ⚠️

Imbalanced data ⚠️

If the data contains \(95\%\) nonspam, always guessing nonspam gives \(0.95\) accuracy!
Accuracy isn’t the right metric for imbalanced data ⚠️

Confustion Matrix

\(\color{purple}{\text{Precision}}=\frac{\color{CornflowerBlue}{\text{TP}}}{\color{CornflowerBlue}{\text{TP}}+\color{purple}{\text{FP}}}\)
\(\color{Tan}{\text{Recall}}=\frac{\color{CornflowerBlue}{\text{TP}}}{\color{CornflowerBlue}{\text{TP}}+\color{Tan}{\text{FN}}}\)
\(\color{ForestGreen}{\text{F1-score}}=\frac{2.\color{purple}{\text{Precision}}.\color{Tan}{\text{Recall}}}{\color{purple}{\text{Precision}}+\color{Tan}{\text{Recall}}}\).
\(\color{ForestGreen}{\text{F1-score}}\) balances \(\color{purple}{\text{FP}}\) & \(\color{Tan}{\text{FN}}\).

Imbalanced data ⚠️

\(\color{purple}{\text{Precision}}=\frac{\color{CornflowerBlue}{\text{TP}}}{\color{CornflowerBlue}{\text{TP}}+\color{purple}{\text{FP}}}\)
\(\color{Tan}{\text{Recall}}=\frac{\color{CornflowerBlue}{\text{TP}}}{\color{CornflowerBlue}{\text{TP}}+\color{Tan}{\text{FN}}}\)
\(\color{ForestGreen}{\text{F1-score}}=\frac{2.\color{purple}{\text{Precision}}.\color{Tan}{\text{Recall}}}{\color{purple}{\text{Precision}}+\color{Tan}{\text{Recall}}}\).

Code

import plotly.graph_objects as go
x = np.linspace(0,1,20)
y = np.linspace(0,1,20)
z1 = [[2*x[i]*y[j]/(x[i]+y[j]) for j in range(len(y))] for i in range(len(x))]
z2 = [[(x[i]+y[j])/2 for j in range(len(y))] for i in range(len(x))]

camera = dict(
    eye=dict(x=1.7, y=-1.2, z=1.2)
)

fig = go.Figure(go.Surface(x = x,
                           y = y,
                           z = z1,
                           name = "F1-score",
                           colorscale = "Blues",
                           showscale = False))
fig.add_trace(go.Surface(x = x,
                         y = y,
                         z = z2,
                        name = "Mean",
                        colorscale = "Electric",
                        showscale = False))
fig.update_layout(scene = dict(
                    xaxis_title='Precision',
                    yaxis_title='Recall',
                    zaxis_title='Scores'),
                  title = dict(text="F1-score vs Mean", 
                               y=0.9,
                               x=0.5,
                               font=dict(size = 30, 
                                         color = "#1C66B5")
                              ),
                  scene_camera=camera,
                  width = 560,
                  height = 500)
fig.show()

Imbalanced data ⚠️

Receiver Operating Characteristic Curve (ROC)

\(\bullet\) ROC \(=\{(\)FPR\(_{\delta}\),TPR\(_{\delta}):\delta\in[0,1]\}\).
\(\bullet\) Better model = Larger Area Under the Curve (AUC).

Code

from plot_metric.functions import BinaryClassification
from plotly.tools import mpl_to_plotly
# Visualisation with plot_metric
y1 = 1*(y_test1 == "spam")
bc1 = BinaryClassification(y1, pr1, labels=["nonspam", "spam"], seaborn_style="whitegrid")

bc2 = BinaryClassification(y1, pr2, labels=["nonspam", "spam"], seaborn_style="whitegrid")

# Figures
a = bc1.plot_roc_curve()
fig_full = plt.gcf()
pl_full = mpl_to_plotly(fig_full)
pl_full.update_layout(width=500, height=450, 
                      title=dict(text="ROC Curve of Full model", 
                                 font=dict(size=25)),
                      xaxis_title = dict(font=dict(size=20, color = "red")),
                      yaxis_title = dict(text='True Positive Rate (Recall)', font=dict(size=20, color = "#EBB31D")),
                      template='plotly_white')
pl_full.show()

Imbalanced data ⚠️

Receiver Operating Characteristic Curve (ROC)

\(\bullet\) ROC \(=\{(\)FPR\(_{\delta}\),TPR\(_{\delta}):\delta\in[0,1]\}\).
\(\bullet\) Better model = Larger Area Under the Curve (AUC).

Code

bc2 = BinaryClassification(y1, pr2, labels=["nonspam", "spam"], seaborn_style="whitegrid")

# Figures
b = bc2.plot_roc_curve()
fig_3 = plt.gcf()
pl_3 = mpl_to_plotly(fig_3)
pl_3.update_layout(width=500, height=450, 
                      title=dict(text="ROC Curve of 3-input model", 
                                 font=dict(size=25)),
                      xaxis_title = dict(font=dict(size=20, color = "red")),
                      yaxis_title = dict(text='True Positive Rate (Recall)', font=dict(size=20, color = "#EBB31D")),
                      template='plotly_white')
pl_3.show()

Imbalanced data ⚠️ (Summary)

Confusion matrix

Precision: controlls FP.
Recall: controlls FN.
F1-score: ballances the two.

ROC Curve & AUC

ROC Curve: ballances TPR and FPR.
Can be used to select \(\delta\in [0,1]\).
Better model = Larger AUC.

Imbalanced data ⚠️ (to explore)

Sampling methods:
- Oversampling: random, SMOTE, SMOTE SVM, ADASYN…
- Undersampling: random, new miss, CNN, Tomek Links…
Weight adjustment methods (nonparametric)
- Tree-based algorithms, \(k\)-NN, kernel methods…
Tuning threshold \(\delta\).
Work with packages that handle imbalanced data:
- imbalanced-learn, PyCaret…
Helpful links: Geeks for Geeks, Angelleon Collado, Machine Learning Mastery…

Quiz time 🤗

Link 👉 https://forms.office.com/r/659fu3GaQj

Showtime 🫣

Quiz time (again) 🤗

Link 👉 https://forms.office.com/r/PYY7LVaWgC

Showtime (once more) 🫣

2. Linear, Quadratic & Regularized Discriminant Analysis (LDA, QDA & RDA)

Univariate Gaussian Distribution

Code

import numpy as np
from plotly.subplots import make_subplots
import plotly.graph_objs as go

# define means and covaraince matrix
mu1, Sigma1 = 0, 3

# Simulate points
x1 = np.random.normal(mu1, Sigma1, 100)

# Plot points
fig = go.Figure(go.Scatter(x=x1, 
                           y=[0 for i in range(len(x1))],
               mode = "markers",
               name = "Points/Observations",
               showlegend = True,
               marker = dict(size = 6)))
# Density
def density_gaussian1d(x, mu, sigma):
  return 1/((2*np.pi*sigma ** 2) ** (1/2)) * np.exp(-1/2 * (x-mu) ** 2/ sigma ** 2)

x = np.linspace(-10, 10, 50)
y1 = np.array([density_gaussian1d(xi, mu1, Sigma1) for xi in x])

fig.add_trace(
    go.Line(x=x,
            y =y1,
            name = "Density/Possibility",
            showlegend = True))

fig.update_layout(title = dict(text="1D Gaussian Random Variables",
                               x = 0.5,
                               y = 0.98,
                               font=dict(size = 20, 
                                          color = "#1C66B5")),
                  width = 450,
                  height = 430,
                  yaxis_title=dict(text='Density'),
                  legend=dict(
                  yanchor="top",
                  y=0.95,
                  xanchor="left",
                  x=0.01))

Simulate \(x_1,\dots,x_{100}\) with \[x_i\sim{\cal N}(0,3^3).\]
Desity function: \[f_X(x)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2\sigma^2}(x-\mu)^2},\]
- \(\mathbb{E}(X)=\mu\in\mathbb{R}\)
- \(\mathbb{V}(X)=\mathbb{E}[(X-\mu)^2]=\sigma^2\).

Multivariate Gaussian Distribution

Code

import numpy as np
from plotly.subplots import make_subplots
import plotly.graph_objs as go

# define means and covaraince matrix
mu1, Sigma1 = [0, 0], [[1, 0.5], [0.5, 3]]

# Simulate points
x1 = np.random.multivariate_normal(mean=mu1, cov=Sigma1, size=100)

# Plot points
fig = go.Figure(go.Scatter3d(x=x1[:,0], 
                           y=x1[:,1],  
                           z=[0 for i in range(x1.shape[0])],
               mode = "markers",
               name = "Points/Observations",
               showlegend = True,
               marker = dict(size = 4)))

# Simulate points
# Density
def density_gaussian2d(x, mu, Sigma):
  return 1/((2*np.pi) ** (len(x)/2) * np.sqrt(np.linalg.det(Sigma))) * np.exp(-1/2 * np.dot(np.dot(x-mu, np.linalg.inv(Sigma)), x-mu))

x = np.linspace(-5, 5, 50)
y = np.linspace(-5, 5, 50)
f1 = np.array([[density_gaussian2d(np.array([xi,yi]), mu1, Sigma1) for xi in x] for yi in y])

fig.add_trace(
    go.Surface(x=x,
               y=y,
               z=f1,
               name = "Density/Possibility",
               opacity=0.3,
               showlegend = True,
               showscale=False))

fig.update_layout(title = dict(text="2D Gaussian Random Variables",
                               x = 0.5,
                               y = 0.98,
                               font=dict(size = 20, 
                                          color = "#1C66B5")),
                  width = 400,
                  height = 430,
                  legend=dict(
                  yanchor="top",
                  xanchor="center",
                  y = 0.9,
                  x = 0.5
                ))
fig.update_scenes(zaxis_title_text='Density')

Simulate \(x_1,\dots,x_{100}\) with \(x_i\sim{\cal N}_2\left(\begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 1 & 0.5 \\ 0.5 & 3 \end{pmatrix}\right)\)
\(d\)-dimensional density: \(f_X(\text{x})=\frac{e^{-\frac{1}{2}(\text{x}-\mu)^t\Sigma^{-1}(\text{x}-\mu)}}{\sqrt{(2\pi)^d|\Sigma|}}.\)
- \(\mathbb{E}(X)=\mu\in\mathbb{R}^d\)
- \(\mathbb{V}(X)=\mathbb{E}[(X-\mu)(X-\mu)^t]=\Sigma\).

Why Gaussian Distribution?

Code

# define means and covaraince matrix
mu1, Sigma1 = [0, 0], [[1, 0], [0, 3]]

mu2, Sigma2  = [5, 5], [[3, -1], [-1, 3]]

mu3, Sigma3 = [-2, 6], [[3, 1.5], [1.5, 1]]

mu4, Sigma4 = [6, 0], [[3, 0.1], [0.1, 0.25]]

x1 = np.random.multivariate_normal(mean=mu1, cov=Sigma1, size=300)
x2 = np.random.multivariate_normal(mean=mu2, cov=Sigma2, size=300)
x3 = np.random.multivariate_normal(mean=mu3, cov=Sigma3, size=300)
x4 = np.random.multivariate_normal(mean=mu4, cov=Sigma4, size=300)

# Save data for later
df_qda = pd.DataFrame({
    "x1" : np.concatenate([x1[:,0], x2[:,0], x3[:,0], x4[:,0]]),
    "x2" : np.concatenate([x1[:,1], x2[:,1], x3[:,1], x4[:,1]]),
    "y" : np.repeat([1,2,3,4], 300)
})

# Plot points
fig0 = go.Figure(go.Scatter3d(x=x1[:,0], 
               y = x1[:,1], 
               z = [0] * len(x1[:,0]),
               mode = "markers",
               name = "Case 1",
               showlegend = True,
               marker = dict(size = 3)))

fig0.add_trace(
    go.Scatter3d(x=x2[:,0], 
               y = x2[:,1], 
               z = [0] * len(x1[:,0]),
               mode = "markers",
               name = "Case 2",
               showlegend = True,
               marker = dict(size = 3)))
fig0.add_trace(
    go.Scatter3d(x=x3[:,0], 
               y = x3[:,1], 
               z = [0] * len(x1[:,0]),
               mode = "markers",
               name = "Case 3",
               showlegend = True,
               marker = dict(size = 3)))
fig0.add_trace(
    go.Scatter3d(x=x4[:,0], 
               y = x4[:,1], 
               z = [0] * len(x1[:,0]),
               mode = "markers",
               name = "Case 4",
               showlegend = True,
               marker = dict(size = 3)))

# Density
def density_gaussian2d(x, mu, Sigma):
  return 1/((2*np.pi) ** (len(x)/2) * np.sqrt(np.linalg.det(Sigma))) * np.exp(-1/2 * np.dot(np.dot(x-mu, np.linalg.inv(Sigma)), x-mu))

x = np.linspace(-10, 15, 50)
y = np.linspace(-5, 12, 50)
f1 = np.array([[density_gaussian2d(np.array([xi,yi]), mu1, Sigma1) for xi in x] for yi in y])
f2 = np.array([[density_gaussian2d(np.array([xi,yi]), mu2, Sigma2) for xi in x] for yi in y])
f3 = np.array([[density_gaussian2d(np.array([xi,yi]), mu3, Sigma3) for xi in x] for yi in y])
f4 = np.array([[density_gaussian2d(np.array([xi,yi]), mu4, Sigma4) for xi in x] for yi in y])

fig0.add_trace(
    go.Surface(x = x,
               y = y,
               z = f1,
               name = "Density 1",
               showlegend = True,
               opacity=0.5,
               showscale=False))
fig0.add_trace(
    go.Surface(x=x,
               y =y,
               z=f2,
               name = "Density 2",
               opacity=0.5,
               showlegend = True,
               showscale=False))
fig0.add_trace(
    go.Surface(x=x,
               y =y,
               z=f3,
               name = "Density 3",
               opacity=0.5,
               showlegend = True,
               showscale=False))
fig0.add_trace(
    go.Surface(x=x,
               y =y,
               z=f4,
               name = "Density 4",
               showlegend = True,
               opacity=0.5,
               showscale=False))
camera = dict(
    eye=dict(x=0, y=-1.2, z=1.5)
)
fig0.update_layout(title = dict(text="Gaussian models",
                               x = 0.4,
                               y = 0.9,
                               font=dict(size = 20, 
                                          color = "#1C66B5")),
                  scene_camera=camera,
                  width = 420,
                  height = 510)
fig0.show()

Code

import plotly.express as px
df = px.data.iris()
fig = px.scatter(df, x="sepal_length", 
                y="petal_length", 
                color="species",
                size='petal_width')
fig.update_layout(title = dict(text="Iris dataset",
                               x = 0.4,
                               y = 0.9,
                               font=dict(size = 20, 
                                          color = "#1C66B5")),
                  width = 420,
                  height = 510)

Quadratic Discriminant Analysis

Recall key quantities

\[\mathbb{P}(Y=k|X=\text{x})\propto \color{red}{\mathbb{P}(X=\text{x}|Y=k)}\times\color{green}{\mathbb{P}(Y=k)}.\]

Main Assumption of QDA

For any class \(k\in\{1,\dots,M\}\): \[\color{red}{\mathbb{P}(X=\text{x}|Y=k)}={\cal N}_d(\mu_k,\Sigma_k),\] for some \(\mu_k\in\mathbb{R}^d\) and \(d\times d\)-matrix \(\Sigma\) (to be estimated).

🔑 Within any class, the shape of input \(X\) is assumed to be Gaussian.

Discriminant Function

Goal: Search for class \(k\) such that

\[\begin{align*} \mathbb{P}(Y=k|X=\text{x})=&\arg\max_{1\leq m\leq M}\mathbb{P}(Y=m|X=\text{x})\\ =&\arg\max_{1\leq m\leq M} \mathbb{P}(Y=m)\mathbb{P}(X=\text{x}|Y=m)\\ =&\arg\max_{1\leq m\leq M} \log\left(\mathbb{P}(Y=m)\mathbb{P}(X=\text{x}|Y=m)\right)\\ =&\arg\max_{1\leq m\leq M} \delta_m(\text{x}), \end{align*}\] where \(\delta_m(\text{x})\) measures the association of the input \(\text{x}\) to the class \(k\) and is defined by \[\delta_m(\text{x})=\color{green}{\log(\pi_m)}-\color{blue}{\log(|\Sigma_m|)}-\color{red}{\frac{1}{2}(\text{x}-\mu_m)^t\Sigma_m^{-1}(\text{x}-\mu_m)}\] with \(\pi_m=\mathbb{P}(Y=m), \mu_m\in\mathbb{R}^d\) and matrix \(\Sigma_m\) to be estimated \(\forall m=1,\dots,M\).

Boundary Decision of QDA

Definition: Boundary Decision

Boundary decision of 2 classes \(k\) and \(j\) is the set of inputs \(\text{x}\in\mathbb{R}^d\) satisfying: \[\mathbb{P}(Y=k|X=\text{x})=\mathbb{P}(Y=j|X=\text{x}).\]

Boundary Decision of QDA

Boundary decision between class \(k\) and \(j\) in QDA are inputs \(\text{x}\) s.t \(\delta_k(\text{x})=\delta_j(\text{x})\).
It’s a Quadratic Form of \(\text{x}: \text{x}^tA\text{x}+v^t\text{x}+c=0\), where \(A\) is a \(d\times d\) symmetric matrix, \(v\in\mathbb{R}^d\) and \(c\in\mathbb{R}\) and depend on \(\pi_k,\pi_j,\mu_k,\mu_j,\Sigma_k\) and \(\Sigma_j\).
Such a boundary is more flexible than linear boundary, but easily overfit the data!

\(^{\text{📚}}\) Read Chapter 4 of The Elements of Statistical Learning, Hastie et al. (2008).

Implementation of QDA

Implementation\(^{\text{📚}}\)

Given: Training data \({\cal D}_n\{(\text{x}_1,y_1),\dots,(\text{x}_n,y_n)\}\) and the query input \(\text{x}\).
for \(k\) in range(M) estimate:
- \(\hat{\pi}_k=n_k/n\) where \(n_k=\text{card}(\{i:y_i=k\})\) (estimate prior \(\mathbb{P}(Y=k)\))
- \(\hat{\mu}_k=\frac{1}{n_k}\sum_{i:y_i=k}\text{x}_i\) (estimate mean of class \(k\))
- \(\hat{\Sigma}_k=\frac{1}{n_k}\sum_{i:y_i=k}(\text{x}_i-\hat{\mu}_k)(\text{x}_i-\hat{\mu}_k)^t\) (estimate covariance matrix of class \(k\))
- Compute \(\delta_k(\text{x})=\log(\hat{\pi}_k)-\log(|\hat{\Sigma}_k|)-\frac{1}{2}(\text{x}-\hat{\mu}_k)^t\hat{\Sigma}_k^{-1}(\text{x}-\hat{\mu}_k)\)
return \(k\) with the largest value of \(\delta_k(\text{x})\).

\(^{\text{📚}}\) Read Chapter 4 of The Elements of Statistical Learning, Hastie et al. (2008).

Linear Dicriminant Analysis (LDA)

QDA can be expensive when \(M\) is large.
LDA is achieved by imposing the following assumption.

Main Assumption of LDA

In LDA, we assume that the input \(X\) has the same covariance matrix within all classes, i.e., \[\Sigma_k=\Sigma, \forall k=1,\dots,M.\] In other words, \(X\) has the same shape for all classes.

Code

# define means and covaraince matrix
mu1, Sigma1 = [0, 0], [[2, 0], [0, 2]]

mu2, Sigma2  = [5, 5], [[3, 0], [0, 3]]

mu3, Sigma3 = [-2, 6], [[1, 0], [0, 1]]

mu4, Sigma4 = [6, 0], [[1.75, 0], [0, 1.75]]

x1 = np.random.multivariate_normal(mean=mu1, cov=Sigma1, size=300)
x2 = np.random.multivariate_normal(mean=mu2, cov=Sigma2, size=300)
x3 = np.random.multivariate_normal(mean=mu3, cov=Sigma3, size=300)
x4 = np.random.multivariate_normal(mean=mu4, cov=Sigma4, size=300)

# Save data for later
df_lda = pd.DataFrame({
    "x1" : np.concatenate([x1[:,0], x2[:,0], x3[:,0], x4[:,0]]),
    "x2" : np.concatenate([x1[:,1], x2[:,1], x3[:,1], x4[:,1]]),
    "y" : np.repeat([1,2,3,4], 300)
})

# Plot points
fig1 = go.Figure(go.Scatter3d(x=x1[:,0], 
               y = x1[:,1], 
               z = [0] * len(x1[:,0]),
               mode = "markers",
               name = "Case 1",
               showlegend = True,
               marker = dict(size = 3)))

fig1.add_trace(
    go.Scatter3d(x=x2[:,0], 
               y = x2[:,1], 
               z = [0] * len(x1[:,0]),
               mode = "markers",
               name = "Case 2",
               showlegend = True,
               marker = dict(size = 3)))
fig1.add_trace(
    go.Scatter3d(x=x3[:,0], 
               y = x3[:,1], 
               z = [0] * len(x1[:,0]),
               mode = "markers",
               name = "Case 3",
               showlegend = True,
               marker = dict(size = 3)))
fig1.add_trace(
    go.Scatter3d(x=x4[:,0], 
               y = x4[:,1], 
               z = [0] * len(x1[:,0]),
               mode = "markers",
               name = "Case 4",
               showlegend = True,
               marker = dict(size = 2)))

# Density
def density_gaussian2d(x, mu, Sigma):
  return 1/((2*np.pi) ** (len(x)/2) * np.sqrt(np.linalg.det(Sigma))) * np.exp(-1/2 * np.dot(np.dot(x-mu, np.linalg.inv(Sigma)), x-mu))

x = np.linspace(-5, 10, 50)
y = np.linspace(-5, 10, 50)
f1 = np.array([[density_gaussian2d(np.array([xi,yi]), mu1, Sigma1) for xi in x] for yi in y])
f2 = np.array([[density_gaussian2d(np.array([xi,yi]), mu2, Sigma2) for xi in x] for yi in y])
f3 = np.array([[density_gaussian2d(np.array([xi,yi]), mu3, Sigma3) for xi in x] for yi in y])
f4 = np.array([[density_gaussian2d(np.array([xi,yi]), mu4, Sigma4) for xi in x] for yi in y])

fig1.add_trace(
    go.Surface(x = x,
               y = y,
               z = f1,
               name = "Density 1",
               showlegend = True,
               opacity=0.5,
               showscale=False))
fig1.add_trace(
    go.Surface(x=x,
               y =y,
               z=f2,
               name = "Density 2",
               opacity=0.5,
               showlegend = True,
               showscale=False))
fig1.add_trace(
    go.Surface(x=x,
               y =y,
               z=f3,
               name = "Density 3",
               opacity=0.5,
               showlegend = True,
               showscale=False))
fig1.add_trace(
    go.Surface(x=x,
               y =y,
               z=f4,
               name = "Density 4",
               showlegend = True,
               opacity=0.5,
               showscale=False))
camera = dict(
    eye=dict(x=0, y=-1.2, z=1.5)
)
fig1 = fig1.update_layout(
                   title = dict(text=r'$\Sigma_k\text{ in LDA}$', 
                                y=0.9,
                                x=0.25,
                                font=dict(size = 20, 
                                          color = "#1C66B5")
                              ),
                  scene_camera=camera,
                  width = 320,
                  height = 250)

import ipywidgets as ipw 
fig0.update_layout(title = dict(text=r'$\Sigma_k\text{ in QDA}$', 
                                y=0.9,
                                x=0.25,
                                font=dict(size = 20, 
                                          color = "#1C66B5")
                              ),
                  scene_camera=camera,
                  width = 320,
                  height = 250)
fig0 = go.FigureWidget(fig0)
fig1 = go.FigureWidget(fig1)
ipw.HBox([fig0, fig1])

LDA is slightly simpler than QDA

Discriminant Function

In LDA, \(\delta_k(\text{x})=\log(\pi_k)-\frac{1}{2}(\text{x}-\mu_k)^t\Sigma^{-1}(\text{x}-\mu_k).\)

Boundary Decision of LDA

The boundary decision takes Linear Form of \(\text{x}\): \[\text{x}^tv+c=0,\] where \(v\in\mathbb{R}^d\) and \(c\in\mathbb{R}\) and depend on \(\pi_k,\pi_j,\mu_k,\mu_j\) and \(\Sigma\).

Code

import matplotlib as mpl
from matplotlib import colors
import matplotlib.pyplot as plt
from sklearn.inspection import DecisionBoundaryDisplay

# Functions for ellipses and boudary decision
def plot_ellipse(mean, cov, color, ax):
    v, w = np.linalg.eigh(cov)
    u = w[0] / np.linalg.norm(w[0])
    angle = np.arctan(u[1] / u[0])
    angle = 180 * angle / np.pi  # convert to degrees
    # filled Gaussian at 2 standard deviation
    ell = mpl.patches.Ellipse(
        mean,
        2 * v[0] ** 0.5,
        2 * v[1] ** 0.5,
        angle=180 + angle,
        facecolor=color,
        edgecolor="black",
        linewidth=2,
    )
    ell.set_clip_box(ax.bbox)
    ell.set_alpha(0.4)
    ax.add_artist(ell)

def plot_result(estimator, X, y, ax):
    cmap = colors.ListedColormap(["tab:red", "tab:blue"])
    DecisionBoundaryDisplay.from_estimator(
        estimator,
        X,
        response_method="predict_proba",
        plot_method="pcolormesh",
        ax=ax,
        cmap="RdBu",
        alpha=0.3,
    )
    DecisionBoundaryDisplay.from_estimator(
        estimator,
        X,
        response_method="predict_proba",
        plot_method="contour",
        ax=ax,
        alpha=1.0,
        levels=[0.5],
    )
    y_pred = estimator.predict(X)
    X_right, y_right = X[y == y_pred], y[y == y_pred]
    X_wrong, y_wrong = X[y != y_pred], y[y != y_pred]
    ax.scatter(X_right[:, 0], X_right[:, 1], c=y_right, s=20, cmap=cmap, alpha=0.5)
    ax.scatter(
        X_wrong[:, 0],
        X_wrong[:, 1],
        c=y_wrong,
        s=30,
        cmap=cmap,
        alpha=0.9,
        marker="x",
    )
    ax.scatter(
        estimator.means_[:, 0],
        estimator.means_[:, 1],
        c="yellow",
        s=200,
        marker="*",
        edgecolor="black",
    )
    if isinstance(estimator, LDA):
        covariance = [estimator.covariance_] * 2
    else:
        covariance = estimator.covariance_
    plot_ellipse(estimator.means_[0], covariance[0], "tab:red", ax)
    plot_ellipse(estimator.means_[1], covariance[1], "tab:blue", ax)

    ax.set_box_aspect(1)
    ax.spines["top"].set_visible(False)
    ax.spines["bottom"].set_visible(False)
    ax.spines["left"].set_visible(False)
    ax.spines["right"].set_visible(False)
    ax.set(xticks=[], yticks=[])

fig, axs = plt.subplots(nrows=1, ncols=2, sharex="row", sharey="row", figsize=(8, 12))

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA

lda = LDA(solver="svd", store_covariance=True)
qda = QDA(store_covariance=True)

for ax_row, X, y in zip(
    (axs,),
    (df_qda[["x1", "x2"]].to_numpy()[:600,:], ),
    (df_qda['y'].to_numpy()[:600], ),
):
    lda.fit(X, y)
    plot_result(lda, X, y, ax_row[0])
    qda.fit(X, y)
    plot_result(qda, X, y, ax_row[1])

axs[0].set_title("Boundary decision of LDA")
axs[1].set_title("Boundary decision of QDA")
plt.show()

Implementation of LDA

Implementation\(^{\text{📚}}\)

Given: Training data \({\cal D}_n\{(\text{x}_1,y_1),\dots,(\text{x}_n,y_n)\}\) and the query input \(\text{x}\).
\(\hat{\Sigma}=\frac{1}{n-M}\sum_{k=1}^M\sum_{i:y_i=k}(\text{x}_i-\hat{\mu}_k)(\text{x}_i-\hat{\mu}_k)^t\) (esimate common covariance matrix)
for \(k\) in range(M) estimate:
- \(\hat{\pi}_k=n_k/n\) where \(n_k=\text{card}(\{i:y_i=k\})\) (estimate prior \(\mathbb{P}(Y=k)\))
- \(\hat{\mu}_k=\frac{1}{n_k}\sum_{i:y_i=k}\text{x}_i\) (estimate mean of class \(k\))
- Compute \(\delta_k(\text{x})=\log(\hat{\pi}_k)-\frac{1}{2}(\text{x}-\hat{\mu}_k)^t\hat{\Sigma}^{-1}(\text{x}-\hat{\mu}_k)\)
return \(k\) with the largest value of \(\delta_k(\text{x})\).

\(^{\text{📚}}\) Chapter 4, The Elements of Statistical Learning, Hastie et al. (2008).

Regularized Discriminant Analysis

Regularized DA (RDA) is about regularizing the convariance matrices\(^{\text{📚}}\)

LDA & QDA can be combined, for example, Friedman (1989) proposed: \[\hat{\Sigma}_k(\alpha)=\alpha\hat{\Sigma}+(1-\alpha)\hat{\Sigma}_k,\] for some \(\alpha\in[0,1]\).
The common covariance \(\hat{\Sigma}\) can also be shrunk toward the scalar covariance: \[\hat{\Sigma}(\lambda)=\lambda\hat{\Sigma}+(1-\lambda)\hat{\sigma}^2I_d,\] for some \(\lambda\in[0,1]\).
Define a family of covariance matrices \(\hat{\Sigma}_k(\alpha,\lambda)\) for \((\alpha,\lambda)\in[0,1]^2\) to be tuned.

\(^{\text{📚}}\) Chapter 4, The Elements of Statistical Learning, Hastie et al. (2008).

Regularization effect in action

Code

mu1, Sigma1 = [0, 0], np.array([[1, 0], [0, 3]])

mu2, Sigma2  = [5, 5], np.array([[3, -1], [-1, 3]])

mu3, Sigma3 = [-2, 6], np.array([[3, 1.5], [1.5, 1]])

mu4, Sigma4 = [6, 0], np.array([[3, 0.1], [0.1, 0.25]])


df_alpha = np.row_stack([np.random.multivariate_normal(mean=mu1, 
                                                          cov=Sigma1,
                                                          size=100),
                         np.random.multivariate_normal(mean=mu2, 
                                                          cov=Sigma2,
                                                          size=100),
                         np.random.multivariate_normal(mean=mu3, 
                                                          cov=Sigma3,
                                                          size=100),
                         np.random.multivariate_normal(mean=mu4, 
                                                          cov=Sigma4,
                                                          size=100)])

df0 = df_alpha.copy()

Sigma0 = np.array([[1.25, 0], [0, 1.25]])

df_final = np.row_stack([np.random.multivariate_normal(mean=mu1, 
                                                          cov=Sigma0,
                                                          size=100),
                         np.random.multivariate_normal(mean=mu2, 
                                                          cov=Sigma0,
                                                          size=100),
                         np.random.multivariate_normal(mean=mu3, 
                                                          cov=Sigma0,
                                                          size=100),
                         np.random.multivariate_normal(mean=mu4, 
                                                          cov=Sigma0,
                                                          size=100)])

alpha_list = np.linspace(0, 1, 15)
alphas = [0] * 400
classes = np.repeat([int(i) for i in range(1,5)], 100)

for alpha in alpha_list[1:]:
  temp = df0 + alpha * (df_final - df0)
  df_alpha = np.row_stack([df_alpha,  temp])
  alphas = np.concatenate((alphas, [alpha] * 400))
  classes = np.concatenate((classes, np.repeat([int(i) for i in range(1,5)], 100)))

df_alpha = np.column_stack([df_alpha, alphas, classes])
df_alpha = pd.DataFrame(df_alpha)
df_alpha.columns = ['x1', 'x2', 'alpha', 'class']
df_alpha['alpha'] = df_alpha['alpha'].round(3)
df_alpha['class'] = df_alpha['class'].astype(int)
df_alpha['class'] = df_alpha['class'].astype(str)

import plotly.graph_objects as go
import matplotlib.pyplot as plt
import plotly.express as px

df_alpha["dummy_size"] = 1
fig = px.scatter(df_alpha, x="x1", y="x2", animation_frame="alpha", color="class", hover_data='class', size="dummy_size", size_max = 10)
fig.update_layout(title = dict(text=r'Transition of covariance', 
                                y=1,
                                x=0.2,
                                font=dict(size = 30, 
                                          color = "#1C66B5")
                              ), 
    width=500, height = 520,
    transition = {'duration': 500})
fig.show()

Right covariances \(\approx\) right shapes!
\(\alpha\in[0,1]\) balances the trade-off between the covariance \(\Sigma_k\) (QDA) and the common variance \(\Sigma\) (LDA): \[\Sigma_k(\alpha)=\alpha\Sigma+(1-\alpha)\Sigma_k.\]
In practice, \(\alpha\) is tuned using cross-validation technique (later).

Summary of LDA, QDA & RDA

Main assumption: Normality of inputs within each class: \[\mathbb{P}(X=x|Y=k)={\cal N}(\mu_k,\Sigma_k)\]
Assumptions on covariance matrices lead to different methods:
- QDA: \(\hat{\Sigma}_k\) are computed differently.
- LDA: \(\hat{\Sigma}_k=\hat{\Sigma}\) are all the same.
- RDA: \(\hat{\Sigma}_k(\alpha,\lambda)=(1-\alpha)\hat{\Sigma}_k+\alpha[\lambda\hat{\Sigma}+(1-\lambda)\hat{\sigma}^2I_d]\).
No method is always the best, it depends on the data.

LDA, QDA & RDA on `Spam` dataset

Full data
Result
3 inputs
Result
RDA

Same \(20\%\) test data of total 4601 observations.

sns.set(style="white")
# Build LDA object & predict
lda = LDA(solver="svd", store_covariance=True)
lda1 = lda.fit(X_train1, y_train1)
pred1_lda = lda1.predict(X_test1)
conf1_lda = confusion_matrix(pred1_lda, y_test1)
con_fig1_lda = ConfusionMatrixDisplay(conf1_lda)

qda = QDA(store_covariance=True)
qda1 = qda.fit(X_train1, y_train1)
pred1_qda = qda1.predict(X_test1)
conf1_qda = confusion_matrix(pred1_qda, y_test1)
con_fig1_qda = ConfusionMatrixDisplay(conf1_qda)

LDA accuracy: 0.882.

QDA accuracy: 0.882.

Input: \(x=(\) make, address, capitalTotal \()\in\mathbb{R}^3\).

lda = LDA(solver="svd", store_covariance=True)
lda2 = lda.fit(X_train2, y_train2)
pred2_lda = lda2.predict(X_test2)
conf2_lda = confusion_matrix(pred2_lda, y_test2)
con_fig2_lda = ConfusionMatrixDisplay(conf2_lda)

qda = QDA(store_covariance=True)
qda2 = qda.fit(X_train2, y_train2)
pred2_qda = qda2.predict(X_test2)
conf2_qda = confusion_matrix(pred2_qda, y_test2)
con_fig2_qda = ConfusionMatrixDisplay(conf2_qda)

LDA accuracy: 0.626.

QDA accuracy: 0.626.

Coming in the TP 😁!

Pros & Cons

Pros

Effective for multiclass problems (compute & compare \(\delta_k(x),\forall k\).)
Computationally efficient (LDA).
Allow to compute class probabilities.
Minimal hyperparameter tuning.
Stability with small sample sizes (LDA but not QDA).

Cons

Assumption of multivariate normality may not be suitable.
Sensitivity to outliers.
Less flexible with non-linear boundaries (especially for LDA).
Require enough data in each class for variance estimates (QDA & RDA)…

Quiz time 🤗

Link 👉 https://forms.office.com/r/9Q7mY2jg4d

Advanced Machine Learning

Course Criteria

Course webpage

Introduction

What’s Machine Learning (ML)?

Why does it matter?

Why does it matter?

Course Outline

What we will focus on

1. Naive Bayesian Classifier (NBC)

Example: Email spam filter ✉️

Example: Email spam filter ✉️

Example: Email spam filter ✉️

Recall: Bayes’s Theorem

Back to email spam filter problem

Main assumption of NBC

Key quantities of NBC

Back to email spam filter problem

\(M\)-class Naive Bayes Classifier

Pros & Cons of NBC

Results on Spam dataset

Imbalanced data ⚠️

Confustion Matrix

Imbalanced data ⚠️

Imbalanced data ⚠️

Receiver Operating Characteristic Curve (ROC)

Imbalanced data ⚠️

Receiver Operating Characteristic Curve (ROC)

Imbalanced data ⚠️ (Summary)

Imbalanced data ⚠️ (to explore)

Quiz time 🤗

Showtime 🫣

Quiz time (again) 🤗

Showtime (once more) 🫣

2. Linear, Quadratic & Regularized Discriminant Analysis (LDA, QDA & RDA)

Univariate Gaussian Distribution

Multivariate Gaussian Distribution

Why Gaussian Distribution?

Quadratic Discriminant Analysis

Discriminant Function

Boundary Decision of QDA

Implementation of QDA

Linear Dicriminant Analysis (LDA)

LDA is slightly simpler than QDA

Implementation of LDA

Regularized Discriminant Analysis

Regularization effect in action

Summary of LDA, QDA & RDA

LDA, QDA & RDA on Spam dataset

Pros & Cons

Quiz time 🤗

Showtime 🫣

🥳 It’s party time 🥂

📋 View party menu here: Party 1 Menu.

🫠 Download party invitation here: Party 1 Invitation Letter.

Results on `Spam` dataset

LDA, QDA & RDA on `Spam` dataset