Naive Bayes Classifier

Advanced Machine Learning

Lecturer: Dr. HAS Sothea

🗺️ Content

Motivation and Introduction
Naive Bayes Classifier (NBC)
- NBC Model Setting and Assumption
- Application
- Handling Imbalanced Data

Motivation & Introduction

We humans can effortlessly recognize cats and dogs:

No sweat even with these kind of patterns:

How can we make a computer program learns to do this?
Other than cats & dogs, it’s much more useful to identify
- Email spams
- Frauds in banking systems
- Diseases in health care system …

Motivation & Introduction

Machine Learning (ML): a branch of AI focused on enabling computers/machines to imitate the way that humans learn.

Three main branches:
- Supervised Learning: for predicting some target (categorical or numerical).
- Unsupervised Learning: for grouping, understanding and reducing the dimension of the data (not for predicting).
- Reinforcement Learning: a model works on some tasks and learn to improve itself based on how it’s performed so far.

I. Naive Bayes Classifier (NBC)

1. Setting and Main Assumption

Binary NBC

Consider Email spam dataset :

	make	address	all	our	over	remove	internet	order	mail	...	charSemicolon	charRoundbracket	charExclamation	charDollar	charHash	capitalAve	capitalLong	capitalTotal	type
0	0.00	0.64	0.64	0.32	0.00	0.00	0.00	0.00	0.00	...	0.00	0.000	0.778	0.000	0.000	3.756	61	278	spam
1	0.21	0.28	0.50	0.14	0.28	0.21	0.07	0.00	0.94	...	0.00	0.132	0.372	0.180	0.048	5.114	101	1028	spam
2	0.06	0.00	0.71	1.23	0.19	0.19	0.12	0.64	0.25	...	0.01	0.143	0.276	0.184	0.010	9.821	485	2259	spam

3 rows × 58 columns

Input matrix \(M\in\mathbb{R}^{n\times d}\): \[M=\begin{pmatrix} x_{1,1} & x_{1,2} & \dots & x_{1,d}\\ x_{2,1} & x_{2,2} & \dots & x_{2,d}\\ \vdots & \vdots & \ddots & \vdots\\ x_{n,1} & x_{n,2} & \dots & x_{n,d} \end{pmatrix}\]
Target vector \(\text{y}\in\mathbb{R}\): \[\text{y}=\begin{pmatrix} y_{1}\\ y_{2}\\ \vdots\\ y_{n}\end{pmatrix}\]

1. Setting and Main Assumption

Binary NBC

Consider Email spam dataset :

	make	address	all	our	over	remove	internet	order	mail	...	charSemicolon	charRoundbracket	charExclamation	charDollar	charHash	capitalAve	capitalLong	capitalTotal	type
0	0.00	0.64	0.64	0.32	0.00	0.00	0.00	0.00	0.00	...	0.00	0.000	0.778	0.000	0.000	3.756	61	278	spam
1	0.21	0.28	0.50	0.14	0.28	0.21	0.07	0.00	0.94	...	0.00	0.132	0.372	0.180	0.048	5.114	101	1028	spam
2	0.06	0.00	0.71	1.23	0.19	0.19	0.12	0.64	0.25	...	0.01	0.143	0.276	0.184	0.010	9.821	485	2259	spam

3 rows × 58 columns

Input \(\text{x}_i=(x_{i1},x_{i2},\dots, x_{id})\): Bag of words of email \(i\).

Target \(y_i\in\{1,0\}\) with \(1=\) spam and \(0=\) nonspam.
Objective: Classify if an email is a spam or not based on its input.

Input matrix \(M\in\mathbb{R}^{n\times d}\): \[M=\begin{pmatrix} x_{1,1} & x_{1,2} & \dots & x_{1,d}\\ x_{2,1} & x_{2,2} & \dots & x_{2,d}\\ \vdots & \vdots & \ddots & \vdots\\ x_{n,1} & x_{n,2} & \dots & x_{n,d} \end{pmatrix}\]
Target vector \(\text{y}\in\mathbb{R}\): \[\text{y}=\begin{pmatrix} y_{1}\\ y_{2}\\ \vdots\\ y_{n}\end{pmatrix}\]

1. Setting and Main Assumption

Binary NBC

Consider Email spam dataset :

	make	address	all	our	over	remove	internet	order	mail	...	charSemicolon	charRoundbracket	charExclamation	charDollar	charHash	capitalAve	capitalLong	capitalTotal	type
0	0.00	0.64	0.64	0.32	0.00	0.00	0.00	0.00	0.00	...	0.00	0.000	0.778	0.000	0.000	3.756	61	278	spam
1	0.21	0.28	0.50	0.14	0.28	0.21	0.07	0.00	0.94	...	0.00	0.132	0.372	0.180	0.048	5.114	101	1028	spam
2	0.06	0.00	0.71	1.23	0.19	0.19	0.12	0.64	0.25	...	0.01	0.143	0.276	0.184	0.010	9.821	485	2259	spam

3 rows × 58 columns

Input \(\text{x}_i=(x_{i1},x_{i2},\dots, x_{id})\): Bag of words of email \(i\).
Target \(y_i\in\{1,0\}\) with \(1=\) spam and \(0=\) nonspam.
Objective: Classify if an email is a spam or not based on its input.

Data shape: (4601, 58).
Total missing values: 0.
Total duplicated rows: 391.
After removing duplicates:

Code

import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid")
_, ax = plt.subplots(1,1,figsize=(4,2.75))
spam.drop_duplicates(inplace=True)
sns.countplot(data=spam, x="type", hue="type", stat="percent")
ax.set_title("Barplot of Email  Type")
ax.bar_label(ax.containers[0], fmt="%.2f")
ax.bar_label(ax.containers[1], fmt="%.2f")
plt.show()

1. Setting and Main Assumption

Model Motivation

Given an email (input) \(\text{x}_i=(0, 0.64,\dots,278)\in\mathbb{R}^{57}\), how would you guess its type \(y_i\)?
🔑 The most important quantity in (binary) classification: \[\color{blue}{\mathbb{P}(Y_i=1|X=\text{x}_i)}.\] Interpretation: Given input \(\text{x}_i\), how likely that it belongs to class 1 (spam)?
Decision rule: \[\text{ Email x}_i\text{ is a spam }\Leftrightarrow\color{blue}{\mathbb{P}(Y_i=1|X=\text{x}_i)}\geq 0.5.\]
From now, building a classifier is just trying to estimate this \[\color{blue}{\mathbb{P}(Y_i=1|X=\text{x}_i)}.\]

1. Setting and Main Assumption

Recall Bayes’s Theorem

Bayes’s Theorem

For any two events \(O,H\) with \(\mathbb{P}(O)\times\mathbb{P}(H)>0,\) \[\begin{equation}\overbrace{\mathbb{P}(H|O)}^{\text{Posterior}}=\frac{\overbrace{\mathbb{P}(O|H)}^{\text{Likelihood}}\times\overbrace{\mathbb{P}(H)}^{\text{Prior}}}{\underbrace{\mathbb{P}(O)}_{\text{Marginal}}}.\end{equation}\]

\(\mathbb{P}(H)\): Prior belief of having hypothesis \(H\).
\(\mathbb{P}(O|H)\): If \(H\) is true, how likely for \(O\) to be observed?
\(\mathbb{P}(H|O)\): If \(O\) is observed, how likely for \(H\) to be true?
\(\mathbb{P}(O)\): How likely for \(O\) to be observed in general?

1. Setting and Main Assumption

Model Setting

Bayes’s theorem implies: \[\begin{align*}\color{blue}{\mathbb{P}(Y_i=1|X=\text{x}_i)}&=\frac{\color{red}{\mathbb{P}(X=\text{x}_i|Y_i=1)}\times\color{green}{\mathbb{P}(Y_i=1)}}{\mathbb{P}(X=\text{x}_i)}\\ &\propto \color{red}{\mathbb{P}(X=\text{x}_i|Y_i=1)}\times\color{green}{\mathbb{P}(Y_i=1)}.\end{align*}\]
Interpretation:
- \(\color{green}{\mathbb{P}(Y_i=1)}\): The chance that an email \(i\) is a spam.
- \(\color{red}{\mathbb{P}(X=\text{x}_i|Y_i=1)}\): If email \(i\) is a spam, how likely that its input is \(\text{x}_i\).
\(\color{green}{\mathbb{P}(Y_i=1)}\) is easy to estimate 😊: \(\frac{\text{Number of all Spams}}{\text{Number of total emails}}\).
\(\color{red}{\mathbb{P}(X=\text{x}_i|Y_i=1)}\) is very complicated to estimate 🥹.

1. Setting and Main Assumption

Main assumption & key quantity in NBC

Main assumption of NBC

Within any class \(k\in\{1,0\}\), the components of input \(X|Y=k\) are indpendent i.e., \[\color{red}{\mathbb{P}(X=\text{x}|Y=k)}=\prod_{j=1}^d\mathbb{P}(X_j=x_j|Y=k).\]

Key quantity in NBC

From above, the classification probability is computed by

\[\mathbb{P}(Y=1|X=\text{x})\propto \color{green}{\mathbb{P}(Y=1)}\color{red}{\prod_{j=1}^d\mathbb{P}(X_j=x_j|Y=1)},\]

1. Setting and Main Assumption

Main assumption & key quantity in NBC

Key quantity in NBC

From above, the classification probability is computed by

\[\mathbb{P}(Y=1|X=\text{x})\propto \color{green}{\mathbb{P}(Y=1)}\color{red}{\prod_{j=1}^d\mathbb{P}(X_j=x_j|Y=1)}.\]

\(\color{red}{\mathbb{P}(X_j=x_j|Y=1)}\) is just 1D distribution and can be estimated \(^{\small\text{📚}}\) as follows:

Type of \(X_j\)	Distribution	Graphic
Qualitative	Bernoulli, Multinomial…	`barplot`, `countplot`
Quantitative	Gausian, Exponential…	`displot`, `hist`, `density`…

\(^{\text{📚}}\) Chapter 4, Introduction to Statistical Learning with R, James et al. (2021).

1. Setting and Main Assumption

Main assumption & key quantity in NBC

Key quantity in NBC

From above, the classification probability is computed by

\[\mathbb{P}(Y=1|X=\text{x})\propto \color{green}{\mathbb{P}(Y=1)}\color{red}{\prod_{j=1}^d\mathbb{P}(X_j=x_j|Y=1)}.\]

Code

import plotly.figure_factory as ff
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

fig = make_subplots(cols=2, rows=1, subplot_titles=("Exclamation", "CapitalTotal"))
temp1 = px.histogram(spam, x='charExclamation', color='type', nbins=10000)
temp1.data[0].showlegend=False
temp1.data[1].showlegend=False
temp2 = px.histogram(spam, x='capitalTotal', color='type', nbins=2)
fig.add_trace(temp1.data[0], row=1, col=1)
fig.add_trace(temp1.data[1], row=1, col=1)
fig.add_trace(temp2.data[0], row=1, col=2)
fig.add_trace(temp2.data[1], row=1, col=2)
fig.update_yaxes(type='log', row=1, col=1)
# fig.update_xaxes(type='log', row=1, col=2)
fig.update_layout(width=1000, height=230)
fig.show()

1. Setting and Main Assumption

\(M\)-class Naive Bayes Classifier

Suppose the target \(y\in\{1,2,...,M\}\).

Key quantity

For any \(d\)-dimensional input \(\text{x}\) and \(k\in\{1,2,...,M\}\):

\[\color{blue}{\mathbb{P}(Y=k|X=\text{x})}\propto\color{green}{\mathbb{P}(Y=k)}\color{red}{\prod_{j=1}^d\mathbb{P}(X_j=x_j|Y=k)}.\]

Classification Rule

\[\text{x}\text{ belongs to class }\color{blue}{k^*}\text{ if }\color{blue}{\mathbb{P}(Y=k^*|X=\text{x})}=\max_{1\leq k\leq M}\color{blue}{\mathbb{P}(Y=k|X=\text{x})}.\]

2. Application

Email type vs only 3 inputs

If \(\text{x}=(\)make, address, capitalTotal\()\).
Test data: \(20\%\) of all observations.

Code

from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
sns.set(style="white")
X_train2, X_test2, y_train2, y_test2 = train_test_split(spam[["make","address", "capitalTotal"]], spam.iloc[:,57], test_size = 0.2, random_state=42)
nb2 = GaussianNB()
_, ax = plt.subplots(1,1, figsize=(4,4))
nb2 = nb2.fit(X_train2, y_train2)
pred2 = nb2.predict(X_test2)
conf2 = confusion_matrix(pred2, y_test2)
con_fig2 = ConfusionMatrixDisplay(conf2)
pr2 = nb2.predict_proba(X_test2)[:,1]
con_fig2.plot(ax=ax)
plt.show()

Accuracy: proportion of correctly classified emails.

\[\frac{463+62}{463+62+20+297}\approx 0.624.\]

Misclassification error: \[1-\text{accuracy}\approx0.376.\]
This depends on the split ⚠️

2. Application

Email type vs all inputs

Data: \(\text{x}\in\mathbb{R}^{57},y\in\{0,1\}\).
Test data: the same \(20\%\).

Code

X_train1, X_test1, y_train1, y_test1 = train_test_split(spam.iloc[:,:57], spam.iloc[:,57], test_size = 0.2, random_state=42)
_, ax = plt.subplots(1,1, figsize=(4,4))
nb1 = GaussianNB()
nb1 = nb1.fit(X_train1, y_train1)
pred1 = nb1.predict(X_test1)
conf1 = confusion_matrix(pred1, y_test1)
con_fig1 = ConfusionMatrixDisplay(conf1)
pr1 = nb1.predict_proba(X_test1)[:,1]
con_fig1.plot(ax=ax)
plt.show()

Accuracy: proportion of correctly classified emails.

\[\frac{365+348}{365+348+11+118}\approx 0.846.\]

Misclassification error: \[1-\text{accuracy}\approx 0.154.\]
This depends on the split ⚠️

2. Application

Email type vs Selected Inputs

Visualize type against inputs can help us filter useful input features.
There are too many features, we can apply statistical method, for example, ANOVA.

Code

from scipy.stats import f_oneway
# Initialize list to store selected features
selected_features = []
p_values = {}

# Get feature column names (excluding the target)
feature_columns = [col for col in spam.columns if col != 'type']

# For loop to test each feature
for feature in feature_columns:
    # Get unique classes in target variable
    classes = spam['type'].unique()

    # Create groups for ANOVA test
    groups = []
    for class_label in classes:
        group_data = spam[spam['type'] == class_label][feature]
        groups.append(group_data)
    
    # Perform one-way ANOVA
    f_stat, p_value = f_oneway(*groups)
    
    # Store p-value
    p_values[feature] = p_value
    
    # Select feature if p-value < 1e-15
    if p_value < 1e-15:
        selected_features.append(feature)
# Create DataFrame with only selected features and target
if selected_features:
    filtered_data = spam[selected_features]

Number of selected features: 38.

Accuracy:

\[\frac{359+350}{359+350+9+124}\approx 0.842.\]

Misclassification error: \[1-\text{accuracy}\approx 0.158.\]
This depends on the split ⚠️

3. Pros & Cons of NBC

Pros

Efficiency & simplicity
Less training data requirement
Scalability: works well on large and high-dimensional data
Ability to handle categorical data
Ability to handle missing data
Sometimes, it still works well even thought the assumption of independence is violeted.

Cons

May perform poorly when features are highly correlated due to the violation of independence assumption.
May not work well for complex relationship.
Zero probability: when some categories are not presented in some training features.
Continuous features often be modeled using Gaussian distribution, which might not always be appropriate.

II. Imbalanced Data ⚠️

1. Imbalanced data ⚠️

If the data contains \(95\%\) nonspam, always guessing nonspam gives \(0.95\) accuracy!
Accuracy isn’t the right metric for imbalanced data ⚠️

Confustion Matrix

\(\color{purple}{\text{Precision}}=\frac{\color{CornflowerBlue}{\text{TP}}}{\color{CornflowerBlue}{\text{TP}}+\color{purple}{\text{FP}}}\)
\(\color{Tan}{\text{Recall}}=\frac{\color{CornflowerBlue}{\text{TP}}}{\color{CornflowerBlue}{\text{TP}}+\color{Tan}{\text{FN}}}\)
\(\color{ForestGreen}{\text{F1-score}}=\frac{2.\color{purple}{\text{Precision}}.\color{Tan}{\text{Recall}}}{\color{purple}{\text{Precision}}+\color{Tan}{\text{Recall}}}\).
\(\color{ForestGreen}{\text{F1-score}}\) balances \(\color{purple}{\text{FP}}\) & \(\color{Tan}{\text{FN}}\).

1. Imbalanced data ⚠️

\(\color{purple}{\text{Precision}}=\frac{\color{CornflowerBlue}{\text{TP}}}{\color{CornflowerBlue}{\text{TP}}+\color{purple}{\text{FP}}}\)
\(\color{Tan}{\text{Recall}}=\frac{\color{CornflowerBlue}{\text{TP}}}{\color{CornflowerBlue}{\text{TP}}+\color{Tan}{\text{FN}}}\)
\(\color{ForestGreen}{\text{F1-score}}=\frac{2.\color{purple}{\text{Precision}}.\color{Tan}{\text{Recall}}}{\color{purple}{\text{Precision}}+\color{Tan}{\text{Recall}}}\).

Code

import plotly.graph_objects as go
x = np.linspace(0,1,20)
y = np.linspace(0,1,20)
z1 = [[2*x[i]*y[j]/(x[i]+y[j]) for j in range(len(y))] for i in range(len(x))]
z2 = [[(x[i]+y[j])/2 for j in range(len(y))] for i in range(len(x))]

camera = dict(
    eye=dict(x=1.7, y=-1.2, z=1.2)
)

fig = go.Figure(go.Surface(x = x,
                           y = y,
                           z = z1,
                           name = "F1-score",
                           colorscale = "Blues",
                           showscale = False))
fig.add_trace(go.Surface(x = x,
                         y = y,
                         z = z2,
                        name = "Mean",
                        colorscale = "Electric",
                        showscale = False))
fig.update_layout(scene = dict(
                    xaxis_title='Precision',
                    yaxis_title='Recall',
                    zaxis_title='Scores'),
                  title = dict(text="F1-score vs Mean", 
                               y=0.9,
                               x=0.5,
                               font=dict(size = 30, 
                                         color = "#1C66B5")
                              ),
                  scene_camera=camera,
                  width = 500,
                  height = 500)
fig.show()

1. Imbalanced data ⚠️

Receiver Operating Characteristic Curve (ROC)

\(\bullet\) ROC \(=\{(\)FPR\(_{\delta}\),TPR\(_{\delta}):\delta\in[0,1]\}\).
\(\bullet\) Better model = Larger Area Under the Curve (AUC).

Code

from sklearn.metrics import roc_curve, RocCurveDisplay
from plotly.tools import mpl_to_plotly

# Method 2: Using roc_curve and Matplotlib directly
y1 = 1*(y_test1 == "spam")
fpr, tpr, thresholds = roc_curve(y1.values, pr1.flatten())
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier') # Diagonal line
plt.plot(fpr, tpr, label='ROC Curve')

# Figures

fig_full = plt.gcf()
pl_1 = mpl_to_plotly(fig_full)
pl_1.update_layout(width=500, height=450, 
                      title=dict(text="ROC Curve of Full model", 
                                 font=dict(size=25)),
                      xaxis_title = dict(font=dict(size=20, color = "red")),
                      yaxis_title = dict(text='True Positive Rate (Recall)', font=dict(size=20, color = "#EBB31D")),
                      template='plotly_white')
pl_1.show()

1. Imbalanced data ⚠️

Receiver Operating Characteristic Curve (ROC)

\(\bullet\) ROC \(=\{(\)FPR\(_{\delta}\),TPR\(_{\delta}):\delta\in[0,1]\}\).
\(\bullet\) Better model = Larger Area Under the Curve (AUC).

Code

fpr, tpr, thresholds = roc_curve(y1.values, pr2.flatten())
plt.plot(fpr, tpr, label='ROC Curve')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier') # Diagonal line

# Figures
fig_2 = plt.gcf()
pl_2 = mpl_to_plotly(fig_2)
pl_2.update_layout(width=500, height=450, 
                      title=dict(text="ROC Curve of 3-input model", 
                                 font=dict(size=25)),
                      xaxis_title = dict(font=dict(size=20, color = "red")),
                      yaxis_title = dict(text='True Positive Rate (Recall)', font=dict(size=20, color = "#EBB31D")),
                      template='plotly_white')
pl_2.show()

1. Imbalanced data ⚠️ (Summary)

Confusion matrix

Precision: controlls FP.
Recall: controlls FN.
F1-score: ballances the two.

ROC Curve & AUC

ROC Curve: ballances TPR and FPR.
Can be used to select \(\delta\in [0,1]\).
Better model = Larger AUC.

1. Imbalanced data ⚠️ (to explore)

Sampling methods:
- Oversampling: random, SMOTE, SMOTE SVM, ADASYN…
- Undersampling: random, new miss, CNN, Tomek Links…
Weight adjustment methods (nonparametric)
- Tree-based algorithms, \(k\)-NN, kernel methods…
Tuning threshold \(\delta\).
Work with packages that handle imbalanced data:
- imbalanced-learn, PyCaret…
Helpful links: Geeks for Geeks, Angelleon Collado, Machine Learning Mastery…

Naive Bayes Classifier

🗺️ Content

Motivation & Introduction

Motivation & Introduction

Motivation & Introduction

I. Naive Bayes Classifier (NBC)

1. Setting and Main Assumption

1. Setting and Main Assumption

Binary NBC

1. Setting and Main Assumption

Binary NBC

1. Setting and Main Assumption

Binary NBC

1. Setting and Main Assumption

Model Motivation

1. Setting and Main Assumption

Recall Bayes’s Theorem

1. Setting and Main Assumption

Model Setting

1. Setting and Main Assumption

Main assumption & key quantity in NBC

1. Setting and Main Assumption

Main assumption & key quantity in NBC

1. Setting and Main Assumption

Main assumption & key quantity in NBC

1. Setting and Main Assumption

\(M\)-class Naive Bayes Classifier

2. Application

2. Application

Email type vs only 3 inputs

2. Application

Email type vs all inputs

2. Application

Email type vs Selected Inputs

3. Pros & Cons of NBC

3. Pros & Cons of NBC

II. Imbalanced Data ⚠️

1. Imbalanced data ⚠️

1. Imbalanced data ⚠️

Confustion Matrix

1. Imbalanced data ⚠️

1. Imbalanced data ⚠️

Receiver Operating Characteristic Curve (ROC)

1. Imbalanced data ⚠️

Receiver Operating Characteristic Curve (ROC)

1. Imbalanced data ⚠️ (Summary)

1. Imbalanced data ⚠️ (to explore)

🥳 It’s party time 🥂