Supervised Learning: for predicting some target (categorical or numerical).
Unsupervised Learning: for grouping, understanding and reducing the dimension of the data (not for predicting).
Reinforcement Learning: a model works on some tasks and learn to improve itself based on how it’s performed so far.
NBC Model Setting and Main Assumption
Naive Bayes Classifiers (NBC)
Consider Email spam dataset ✉️
make
address
all
num3d
our
over
remove
internet
order
mail
...
charSemicolon
charRoundbracket
charSquarebracket
charExclamation
charDollar
charHash
capitalAve
capitalLong
capitalTotal
type
0
0.00
0.64
0.64
0.0
0.32
0.00
0.00
0.00
0.00
0.00
...
0.00
0.000
0.0
0.778
0.000
0.000
3.756
61
278
spam
1
0.21
0.28
0.50
0.0
0.14
0.28
0.21
0.07
0.00
0.94
...
0.00
0.132
0.0
0.372
0.180
0.048
5.114
101
1028
spam
2
0.06
0.00
0.71
0.0
1.23
0.19
0.19
0.12
0.64
0.25
...
0.01
0.143
0.0
0.276
0.184
0.010
9.821
485
2259
spam
3 rows × 58 columns
Input\(\text{x}_i=(x_{i1},x_{i2},\dots, x_{id})\): Bag of words of email \(i\).
Target\(y_i\in\{1,0\}\) with \(1=\)spam and \(0=\)nonspam.
Objective:Classify if an email is a spam or not based on its input.
Data shape: (4601, 58).
Total missing values: 0.
Total duplicated rows: 391.
After removing duplicates:
Code
import seaborn as snsimport matplotlib.pyplot as pltsns.set(style="whitegrid")_, ax = plt.subplots(1,1,figsize=(3,2.15))spam.drop_duplicates(inplace=True)sns.countplot(data=spam, x="type", hue="type", stat="percent")ax.set_title("Barplot of Email Type")ax.bar_label(ax.containers[0], fmt="%.2f")ax.bar_label(ax.containers[1], fmt="%.2f")plt.show()
Naive Bayes Classifiers (NBC)
Model Motivation
Given an email (input) \(\text{x}_i=(0, 0.64,\dots,278)\in\mathbb{R}^{57}\), how would you guess its type \(y_i\)?
🔑 The most important quantity in (binary) classification: \[\color{blue}{\mathbb{P}(Y_i=1|X=\text{x}_i)}.\]Interpretation:Given input \(\text{x}_i\), how likely that it belongs to class 1 (spam)?
Decision rule:\[\text{ Email x}_i\text{ is a spam }\Leftrightarrow\color{blue}{\mathbb{P}(Y_i=1|X=\text{x}_i)}\geq 0.5.\]
From now, building a classifier is just trying to estimate this \[\color{blue}{\mathbb{P}(Y_i=1|X=\text{x}_i)}.\]
Naive Bayes Classifiers (NBC)
Recall Bayes’s Theorem
Bayes’s Theorem
For any two events \(O,H\) with \(\mathbb{P}(O)\times\mathbb{P}(H)>0,\)\[\begin{equation}\overbrace{\mathbb{P}(H|O)}^{\text{Posterior}}=\frac{\overbrace{\mathbb{P}(O|H)}^{\text{Likelihood}}\times\overbrace{\mathbb{P}(H)}^{\text{Prior}}}{\underbrace{\mathbb{P}(O)}_{\text{Marginal}}}.\end{equation}\]
\(\mathbb{P}(H)\): Prior belief of having hypothesis \(H\).
\(\mathbb{P}(O|H)\): If \(H\) is true, how likely for \(O\) to be observed?
\(\mathbb{P}(H|O)\): If \(O\) is observed, how likely for \(H\) to be true?
\(\mathbb{P}(O)\): How likely for \(O\) to be observed in general?
\(\color{green}{\mathbb{P}(Y_i=1)}\) is easy to estimate 😊: \(\frac{\text{Number of all Spams}}{\text{Number of total emails}}\).
\(\color{red}{\mathbb{P}(X=\text{x}_i|Y_i=1)}\) is very complicated to estimate 🥹.
Main Assumption of Naive Bayes Classifier
Within any class \(k\in\{1,0\}\), the components of input \(X|Y=k\) are indpendent i.e., \[\color{red}{\mathbb{P}(X=\text{x}|Y=k)}=\prod_{j=1}^d\mathbb{P}(X_j=x_j|Y=k).\]
Visualize type against inputs can help us filter useful input features.
There are too many features, we can apply statistical method, for ex. ANOVA.
Code
from scipy.stats import f_oneway# Initialize list to store selected featuresselected_features = []p_values = {}# Get feature column names (excluding the target)feature_columns = [col for col in spam.columns if col !='type']# For loop to test each featurefor feature in feature_columns:# Get unique classes in target variable classes = spam['type'].unique()# Create groups for ANOVA test groups = []for class_label in classes: group_data = spam[spam['type'] == class_label][feature] groups.append(group_data)# Perform one-way ANOVA f_stat, p_value = f_oneway(*groups)# Store p-value p_values[feature] = p_value# Select feature if p-value < 1e-15if p_value <1e-15: selected_features.append(feature)# Create DataFrame with only selected features and targetif selected_features: filtered_data = spam[selected_features + ['type']]
import plotly.graph_objects as gox = np.linspace(0,1,20)y = np.linspace(0,1,20)z1 = [[2*x[i]*y[j]/(x[i]+y[j]) for j inrange(len(y))] for i inrange(len(x))]z2 = [[(x[i]+y[j])/2for j inrange(len(y))] for i inrange(len(x))]camera =dict( eye=dict(x=1.7, y=-1.2, z=1.2))fig = go.Figure(go.Surface(x = x, y = y, z = z1, name ="F1-score", colorscale ="Blues", showscale =False))fig.add_trace(go.Surface(x = x, y = y, z = z2, name ="Mean", colorscale ="Electric", showscale =False))fig.update_layout(scene =dict( xaxis_title='Precision', yaxis_title='Recall', zaxis_title='Scores'), title =dict(text="F1-score vs Mean", y=0.9, x=0.5, font=dict(size =30, color ="#1C66B5") ), scene_camera=camera, width =560, height =500)fig.show()
Imbalanced data ⚠️
Receiver Operating Characteristic Curve (ROC)
\(\bullet\) ROC \(=\{(\)FPR\(_{\delta}\),TPR\(_{\delta}):\delta\in[0,1]\}\). \(\bullet\)Better model = Larger Area Under the Curve (AUC).
Code
from plot_metric.functions import BinaryClassificationfrom plotly.tools import mpl_to_plotly# Visualisation with plot_metricy1 =1*(y_test1 =="spam")bc1 = BinaryClassification(y1, pr1, labels=["nonspam", "spam"], seaborn_style="whitegrid")bc2 = BinaryClassification(y1, pr2, labels=["nonspam", "spam"], seaborn_style="whitegrid")# Figuresa = bc1.plot_roc_curve()fig_full = plt.gcf()pl_full = mpl_to_plotly(fig_full)pl_full.update_layout(width=500, height=450, title=dict(text="ROC Curve of Full model", font=dict(size=25)), xaxis_title =dict(font=dict(size=20, color ="red")), yaxis_title =dict(text='True Positive Rate (Recall)', font=dict(size=20, color ="#EBB31D")), template='plotly_white')pl_full.show()
Imbalanced data ⚠️
Receiver Operating Characteristic Curve (ROC)
\(\bullet\) ROC \(=\{(\)FPR\(_{\delta}\),TPR\(_{\delta}):\delta\in[0,1]\}\). \(\bullet\)Better model = Larger Area Under the Curve (AUC).
Objective: Given input \(\text{x}_i\in\mathbb{R}^d\), classify if \(y\in\{0,1\}\) (Male or female).
Main idea: classify \(\Leftrightarrow\) identify decision boundary.
Main assumption: Boundary(B) is linear.
Model: Given input \(\text{x}_i\), the chance that it belongs to class \(1\) is given by \[\mathbb{P}(Y_i=1|X=\text{x}_i)=\sigma(\color{blue}{\beta_0}+\sum_{j=1}^d\color{blue}{\beta_j}x_{ij}),\] where \(\color{blue}{\beta_0,\beta_1,\dots,\beta_d}\in\mathbb{R}\) are the key parameters to be estiamted from the data, and \(\sigma(t)=1/(1+e^{-t}),\forall t\geq 0\).
Binary Logistic Regression
Model intuition
Ex: Given \(\text{x}_0=[\text{h}_0,\text{w}_0]\in\mathbb{R}^2,\) for any candidate parameter\(\color{blue}{\vec{\beta}=[\beta_0,\beta_1,\beta_2]}\), \[\color{green}{z_0}=\color{blue}{\beta_0}+\color{blue}{\beta_1}\text{h}_0+\color{blue}{\beta_2}\text{w}_0\text{ is the relative distance from }\text{x}_0\to\text{ Boundary (B)}.\]
That’s to say that
\(\color{green}{z_0}>0\Leftrightarrow \text{x}_0\) is above the boundary.
\(|\color{green}{z_0}|\) is large \(\Leftrightarrow\)\(\text{x}_0\) is far from the bounday.
A good boundary should be such that:
\(|\color{green}{z_0}|\) large \(\Rightarrow\)“certain about its class”.
\(|\color{green}{z_0}|\) small \(\Rightarrow\)“less certain about its class”.
Binary Logistic Regression
Model intuition
A good boundary should be such that:
\(|\color{green}{z_0}|\) large \(\Rightarrow\)“certain about its class”.
\(|\color{green}{z_0}|\) small \(\Rightarrow\)“less certain about its class”.
Interpretation: \(\text{x}_1,\text{x}_2\) are located above the line \((B):1+x_1-2x_2\) as \(z_1,z_2>0\) and are predicted to be in class \(\color{blue}{1}\). On the other hand, \(\text{x}_3\) is located below the line (\(z_3<0\)) and is predicted to be in class \(\color{red}{0}\).
Q4: Now,how do we find the best key parameter \(\color{blue}{\beta_0,\dots,\beta_d}\)?
We will build a criterion just like RSS in linear regression.
Objective: search for \(\color{blue}{\beta_0}\in\mathbb{R},\color{blue}{\vec{\beta}}\in\mathbb{R}^d\) such that the model is best aligned with the data \({\cal D}\): \[p(y_i|\text{x}_i)\text{ is large for all }i\in\{1,\dots,n\}.\]
Conditional Likelihood Function: If the data are iid, one has \[\begin{align*}{L}(\color{blue}{\beta_0},\color{blue}{\vec{\beta}})&=\mathbb{P}(Y_1=y_1,\dots,Y_n=y_n|X_1=\text{x}_1,\dots,X_n=\text{x}_n)\\
&=\prod_{i=1}^np(y_i|\text{x}_i)\\
&=\prod_{i=1}^n\Big[p(1|\text{x}_i)\Big]^{y_i}\Big[p(0|\text{x}_i)\Big]^{1-y_i}\\
&=\prod_{i=1}^n\Big[\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}})\Big]^{y_i}\Big[(1-\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}}))\Big]^{1-y_i}.
\end{align*}\]
def log_loss(beta1, beta2, X, y, beta0 =0.1): beta_ = np.array([beta0, beta1, beta2]) sig = sigmoid(X @ beta_) loss = np.mean(y*np.log(sig) + (1-y)*np.log(1-sig))return-lossbeta1 = np.linspace(-7,2,25)beta2 = np.linspace(-2,7,30)losses = np.array([[log_loss(b1, b2, X=x1, y=y_) for b1 in beta1] for b2 in beta2])bright_blue1 = [[0, '#235b63'], [1, '#45e696']]fig = go.Figure(go.Surface( x=beta1, y=beta2, z=losses, name="Loss function", colorscale=bright_blue1, hovertemplate='beta1: %{x}<br>'+'beta2: %{y}<br>'+'Loss: %{z}<extra></extra>'))fig.update_layout( scene =dict( xaxis_title='beta1', yaxis_title='beta2', zaxis_title='Loss'), width=800, height=300, title="Loss function of the simulated dataset")fig.show()
Binary Logistic Regression
Estimating coefficients
We search for coefficient \(\color{blue}{\vec{\beta}}=[\color{blue}{\beta_0,\dots,\beta_d}]\) minimizing \[\text{CEn}(\color{blue}{\vec{\beta}})=-\sum_{i=1}^n\Big[y_i\log[\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}})]+(1-y_i)\log[(1-\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}}))]\Big].\]
😭 Unfortunately, such minimizer values \((\color{blue}{\widehat{\beta}_0,\widehat{\vec{\beta}}})\)CANNOT be analytically computed.
😊 Fortunately, it can be numerically approximated!
We can use optimization algorithms such as Gradient Descent Algorithm to estimate the best \(\color{blue}{\hat{\beta}}\).
This is the next step!
Binary Logistic Regression
Summary
Logistic Regression Model
Main model: \(p(1|\text{x})=1/(1+e^{-\color{green}{z}})=1/(1+e^{-(\color{blue}{\beta_0}+\text{x}^T\color{blue}{\vec{\beta}})})\).
Interpretation:
Boundary decision is Linear defined by the coefficients \(\color{blue}{\beta_0}\) and \(\color{blue}{\vec{\beta}}\).
Probability of being in each class depends on the relative distance of that point to the boundary.
Works well when classes are linearly separable.
Objective: buliding a Logistic Regression model is equivalent to searching for parameters\(\color{blue}{\beta_0}\) and \(\color{blue}{\vec{\beta}}\) that minimizes the Cross-entropy.
The loss cannot be minimized analytically but can be minimized numerically.
Optimization: GD & SGD
Binary Logistic Regression
Gradient
If \(L:\mathbb{R}^d\to\mathbb{R}, y=L(x_1,...,x_d)\) be a real-valued function of \(d\) real variables. For any point \(a\in\mathbb{R}^d\), the gradient of \(L\) at point \(a\) is a \(d\)-dimensional vector defined by \[\nabla L(a)=\begin{bmatrix}\frac{\partial L}{\partial x_1}(a)\\ \vdots\\ \frac{\partial L}{\partial x_d}(a)\end{bmatrix}.\]
It is the direction (from the point \(a\)) with the fastest increasing rate of the function \(L\).
One has \(\frac{\partial f}{\partial x}=0.4x^3\) and \(\frac{\partial f}{\partial y}=y\).
Therefore, \(\nabla f(1,-1)=(0.4,-1).\)
Meaning: from point \((1,-1)\), function \(f\) increases fastest along the direction \((0.4, -1)\).
Around 1847, Augustin-Louis Cauchy used this to propose a genius algorithm that makes impossibilities, possible!
Binary Logistic Regression
Gradient Descent Algorithm (GD)
If \(L\) is the loss function, then our goal is to find its lowest position.
Consider \(\theta_0\in\mathbb{R}^d\) and if \(L\) is differentiable, we can approximate \(L\) by a linear form around \(\theta_0\): \[L(\color{green}{\theta_0+h})\approx L(\color{red}{\theta_0})+\nabla L(\theta_0)^Th.\]
If \(h=-\eta\nabla L(\theta_0)\) for some \(\eta>0\), we have: \(L(\color{green}{\theta_0-\eta\nabla L(\theta_0)})\approx L(\color{red}{\theta_0})-\eta\|\nabla L(\theta_0)\|^2\leq L(\theta_0).\)
Interpretation: The loss \(L(\color{green}{\theta_0-\eta\nabla L(\theta_0)})\) is likely to be lower than \(L(\color{red}{\theta_0})\), therefore in searching for the lowest positoin of \(L\), we should move from \(\color{red}{\theta_0}\to\color{green}{\theta_0-\eta\nabla L(\theta_0)}\). This is the main idea of GD algorithm.
Gradient Descent Algorithm:
Initialiaztion:
learning rate: \(\eta>0\) (small \(\approx 0.001\) or \(0.01\))
Update: for t=1,2,...,N: \[\color{green}{\theta_{t}:=\theta_{t-1}-\eta\nabla L(\theta_{t-1})}.\]
if\(\|\nabla L(\theta_{t^*})\|<\delta\) at some step \(t^*\): return\(\color{green}{\theta_{t^*}}\).
else: return\(\color{green}{\theta_{N}}\).
Binary Logistic Regression
Gradient Descent Algorithm (GD)
Binary Logistic Regression
Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent (SGD) Algorithm
Recall that for Logistic Regression, log loss is defined by \[\color{red}{{\cal L}(\beta_0,\vec{\beta})}=-\color{red}{\sum_{i=1}^n}\left[y_i\log(p(1|\text{x}_i))+(1-y_i)\log(p(0|\text{x}_i))\right]:=\color{red}{\sum_{i=1}^n}\ell(y_i,\hat{y}_i).\]
In GD, the gradient \(\nabla {\cal L}\) is computed using the whole training data, which is very costly for large datasets.
Using statistics, at each iteration \(t=1,\dots,N\), we can approximate \(\color{red}{\sum_{i=1}^n}\) in the loss by \(\color{green}{\sum_{i\in B_t}}\) using some random subset \(B_t\subset \{1,\dots,n\}\) (called minibatch), i.e., \(\color{red}{{\cal L}(\beta_0,\vec{\beta})}\approx\color{green}{\widehat{\cal L}(\beta_0,\vec{\beta})}=\color{green}{\sum_{i\in B_t}}\ell(y_i,\hat{y}_i)\).
Stochastic Gradient Descent Algorithm: (In Logistic Regression: \(\theta=(\beta_0,\vec{\beta})\))
Initialize: \(\eta, N, \theta_0, \delta>0\) (same as in GD), and minibatch size \(b=|B_t|,\forall t=1,2,...,N\).
Main loop: for i=1,2,...,N:\[\color{green}{\theta_{t}}:=\color{red}{\theta_{t-1}}+\eta\nabla\color{green}{\widehat{\cal L}}(\color{red}{\theta_{t-1}}).\]
Ex: Consider 1st simuated data 👉
Binary Logistic Regression
Stochastic Gradient Descent (SGD) in Action
Binary Logistic Regression
Gradient of Log Loss
Optimizing Log Loss in Logistic Regression
Log loss for any parameter \(\beta_0,\vec{\beta}\in\mathbb{R}^d: {\cal L}(\beta_0,\vec{\beta})=-\sum_{i=1}^n[y_i\log(p(1|\text{x}_i))+(1-y_i)\log(p(0|\text{x}_i))]\).
We introduce basic concept of Logistic Regression Model: \[p(1|X=\text{x})=\frac{1}{(1+e^{-\color{blue}{\beta_0}-\text{x}^T\color{blue}{\vec{\beta}}})}.\]
The intuition of the model: the probability of being in class \(1\) depends on the relative distance from \(\text{x}\) to a linear boundary defined by \(\color{blue}{[\beta_0,\beta_1,\dots,\beta_d]}\).
The linear boundary assumption may be too weak in practice.
The performance of the model can be improved further by
Selecting relevant features
Feature engineering: polynomial transform…
Regularization or penalty methods…
Multinomial Logistic Regression
Multinomial Logistic Regression
Model
In this case, \(y\) takes \(M\) distinct values, it’s often encoded to be \(M\)-dimensional vector: \[y=k\Leftrightarrow p=(0,\dots,0,\underbrace{1}_{k\text{th index}},0,\dots,0)^T.\]
The model also returns \(M\)-dimensional probability vector where index with highest probability is the predicted class.
\(\text{Softmax}:\mathbb{R}^M\to\mathbb{S}_{M-1}=\{p_j\in[0,1]^M:\sum_{j=1}^Mp_j=1\}\) defined by \[\vec{z}\mapsto \vec{s}=\text{Softmax}(z)=\Big(\frac{e^{z_1}}{\sum_{j=1}^Me^{z_j}},\dots,\frac{e^{z_M}}{\sum_{j=1}^Me^{z_j}}\Big).\]
It converts any \(M\)-dimensional vector \(\vec{z}\) to \(M\)-dimensional probability vector \(\vec{s}\).
Multinomial Logistic Regression
Model
\(\text{Softmax}:\mathbb{R}^M\to\mathbb{S}_{M-1}=\{p_j\in[0,1]^M:\sum_{j=1}^Mp_j=1\}\) defined by \[\vec{z}\mapsto \vec{s}=\text{Softmax}(z)=\Big(\frac{e^{z_1}}{\sum_{j=1}^Me^{z_j}},\dots,\frac{e^{z_M}}{\sum_{j=1}^Me^{z_j}}\Big).\]
Model: Given parameter matrix \(\color{green}{W}\) of size \(M\times d\) and intercept vector \(\color{green}{\vec{b}}\in\mathbb{R}^d\), the predicted probability of any input data \(\text{x}_i\in\mathbb{R}^d\) is defined by
We search for best possible parameters \(\color{green}{W}\in\mathbb{R}^{M\times d}\) and \(\color{green}{\vec{b}}\in\mathbb{R}^d\) by minimizing Cross Entropy loss defined below: \[\text{CEn}(W,\vec{b})=-\sum_{i=1}^n\sum_{m=1}^M\color{blue}{p_{im}}\log(\color{red}{\hat{p}_{im}}),\text{ where }\color{red}{\widehat{\text{p}}_i}=(\color{red}{\hat{p}_{i1}},\color{red}{\dots},\color{red}{\hat{p}_{iM}})\text{ is the predicted probability of }\color{blue}{\text{p}_i}.\]
Multinomial Logistic Regression
Gradient of Cross-Entropy
Optimizing Cross-entropy in Multinomial Logistic Regression
Cross Entropy of any parameter \(\color{green}{W}\in\mathbb{R}^{M\times d}\) and \(\color{green}{\vec{b}}\in\mathbb{R}^d\): \(\text{CEn}(\color{green}{W},\color{green}{\vec{b}})=-\sum_{i=1}^n\sum_{m=1}^M\color{blue}{p_{im}}\log(\color{red}{\widehat{p}_{im}})\).
Simplicity: It’s easy to understand and implement.
Efficiency: It’s not so computationally expensive, making it suitable for large datasets.
Interpretability: It converts relative distances from input data to boundary decision into probabilities, which can be meaningfully interpreted.
No Scaling Required: It doesn’t require input features to be scaled or standardized (you can also tranform for efficient computation).
Works Well with Linearly Separable Data: Performs well when the data is linearly separable.
Cons
Assumes Linearity: Assumes a linear relationship between the independent variables and the log-odds of the dependent variable.
Sensitivity to Outliers: Outliers can affect the model’s performance and accuracy.
Overfitting: Can overfit if the dataset has a large number of features compared to the number of observations.
Limited to Binary Outcomes: Standard logistic regression is designed for binary classification and may not perform as well with multiclass classification problems.
Assumes Independence of Errors: Assumes that the observations are independent of each other and that there is no multicollinearity among predictors.