Logistic Regression

Advanced Machine Learning

Lecturer: Dr. HAS Sothea

🗺️ Content

1. Binary Logistic Regression
- Model setting and Intuition
- Conditional Likelihood and Loss function
2. Optimization: GD and SGD
3. Application
4. Multiple Logistic Regression
5. Polynomial Features
6. Regularization/Penalization
7. Hyperparameter Tuning

1. Binary Logistic Regression

1.1. Model setting

x1	x2	y
-0.752759	2.704286	1
1.935603	-0.838856	0

Code

import plotly.express as px
import plotly.graph_objects as go
fig = px.scatter(
    data_toy1, x="x1", y="x2", 
    color="y")

# Lines
line_coefs = np.array(
    [[1, 1, -2], [-1, 0.3, 0.5], [-0.5, 0.3, -1], [0.1, -1, 1]])
frames = []
x_line = np.array([np.min(x1[:,1]), np.max(x1[:,1])])
y_range = np.array([np.min(x1[:,2]), np.max(x1[:,2])])
id_small = np.argsort(np.abs(x1[:,1]) + np.abs(x1[:,2]))[5]
point_far = np.array([x_line[0], y_range[1]])
point_near = np.array([x1[id_small,1],x1[id_small,2]])

for i, coef in enumerate(line_coefs):
    y_line = (-coef[0] - coef[1] * x_line) / coef[2]
    a, b = -coef[1]/coef[2], -coef[0]/coef[2]
    point_proj_far = np.array([(point_far[0]+a*point_far[1]-a*b)/(a**2+1), a*(point_far[0]+a*point_far[1]-a*b)/(a**2+1)+b])
    point_proj_near = np.array([(point_near[0]+a*point_near[1]-a*b)/(a**2+1), a*(point_near[0]+a*point_near[1]-a*b)/(a**2+1)+b])
    p1 = np.row_stack([point_far, point_proj_far])
    p2 = np.row_stack([point_near, point_proj_near])
    frames.append(go.Frame(
        data=[fig.data[0],
              fig.data[1],
              go.Scatter(
                x=p1[:,0],
                y=p1[:,1],
                name="Far from boundary",
                line=dict(dash="dash"),
                visible="legendonly"
              ),
              go.Scatter(
                x=p2[:,0],
                y=p2[:,1],
                name="Close to boundary",
                line=dict(dash="dash"),
                visible="legendonly"
              ),
              go.Scatter(
                x=x_line, y=y_line, mode='lines',
                line=dict(width=3, color="black"), 
                name=f'Line: {i+1}')],
        name=f'{i+1}'))

y_line = (-line_coefs[0,0] - line_coefs[0,1] * x_line) / line_coefs[0,2]
a, b = -line_coefs[0,0]/line_coefs[0,2], - line_coefs[0,1]/line_coefs[0,2]
point_proj_far = np.array([(point_far[0]+a*point_far[1]-a*b)/(a**2+1), a*(point_far[0]+a*point_far[1]-a*b)/(a**2+1)+b])
point_proj_near = np.array([(point_near[0]+a*point_near[1]-a*b)/(a**2+1), a*(point_near[0]+a*point_near[1]-a*b)/(a**2+1)+b])
p1 = np.row_stack([point_far, point_proj_far])
p2 = np.row_stack([point_near, point_proj_near])
fig1 = go.Figure(
    data=[
        fig.data[0],
        fig.data[1],
        go.Scatter(
            x=p1[:,0],
            y=p1[:,1],
            name="Far from boundary",
            line=dict(dash="dash"),
            visible="legendonly"
            ),
        go.Scatter(
            x=p2[:,0],
            y=p2[:,1],
            name="Close to boundary",
            line=dict(dash="dash"),
            visible="legendonly"
        ),
        go.Scatter(
            x=x_line, y=y_line, mode='lines',
            line=dict(width=3, color="black"), 
            name=f'Line: 1')
    ],
    layout=go.Layout(
        title="1st simulated data & boundary lines",
        xaxis=dict(title="x1", range=[-3.4, 3.4]),
        yaxis=dict(title="x2", range=[-3.4, 3.4]),
        updatemenus=[{
            "buttons": [
                {
                    "args": [None, {"frame": {"duration": 1000, "redraw": True}, "fromcurrent": True, "mode": "immediate"}],
                    "label": "Play",
                    "method": "animate"
                },
                {
                    "args": [[None], {"frame": {"duration": 0, "redraw": False}, "mode": "immediate"}],
                    "label": "Stop",
                    "method": "animate"
                }
            ],
            "type": "buttons",
            "showactive": False,
            "x": -0.1,
            "y": 1.25,
            "pad": {"r": 10, "t": 50}
        }],
        sliders=[{
            "active": 0,
            "currentvalue": {"prefix": "Line: "},
            "pad": {"t": 50},
            "steps": [{"label": f"{i}",
                       "method": "animate",
                       "args": [[f'{i}'], {"frame": {"duration": 1000, "redraw": True}, "mode": "immediate", 
                       "transition": {"duration": 10}}]}
                      for i in range(1,5)]
        }]
    ),
    frames=frames
)

fig1.update_layout(height=350, width=500)

fig1.show()

Objective: Given input $\text{x}_i\in\mathbb{R}^d$, classify if $y\in\{0,1\}$ (Male or female).
Main idea: classify $\Leftrightarrow$ identify decision boundary.
Main assumption: Boundary(B) is linear.
Model: Given input $\text{x}_i$, the chance that it belongs to class $1$ is given by \[\mathbb{P}(Y_i=1|X=\text{x}_i)=\sigma(\color{blue}{\beta_0}+\sum_{j=1}^d\color{blue}{\beta_j}x_{ij}),\] where $\color{blue}{\beta_0,\beta_1,\dots,\beta_d}\in\mathbb{R}$ are the key parameters to be estiamted from the data, and $\sigma(t)=1/(1+e^{-t}),\forall t\geq 0$.

1. Binary Logistic Regression

1.2. Model intuition

Ex: Given $\text{x}_0=[\text{h}_0,\text{w}_0]\in\mathbb{R}^2,$ for any candidate parameter $\color{blue}{\vec{\beta}=[\beta_0,\beta_1,\beta_2]}$, \[\color{green}{z_0}=\color{blue}{\beta_0}+\color{blue}{\beta_1}\text{h}_0+\color{blue}{\beta_2}\text{w}_0\text{ is the relative distance from }\text{x}_0\to\text{ Boundary (B)}.\]

That’s to say that
- $\color{green}{z_0}>0\Leftrightarrow \text{x}_0$ is above the boundary.
- $|\color{green}{z_0}|$ is large $\Leftrightarrow$ $\text{x}_0$ is far from the bounday.

A good boundary should be such that:
- $|\color{green}{z_0}|$ large $\Rightarrow$ “certain about its class”.
- $|\color{green}{z_0}|$ small $\Rightarrow$ “less certain about its class”.

1. Binary Logistic Regression

1.2. Model intuition

A good boundary should be such that:
- $|\color{green}{z_0}|$ large $\Rightarrow$ “certain about its class”.
- $|\color{green}{z_0}|$ small $\Rightarrow$ “less certain about its class”.

Sigmoid function, $\sigma:\mathbb{R}\to (0,1)$

\[\color{green}{z}\mapsto\sigma(\color{green}{z})=\frac{1}{1+\exp(-\color{green}{z})}.\]

Key ideas

$\color{green}{z_0}=\color{blue}{\beta_0}+\text{x}_0^T\color{blue}{\vec{\beta}}$ is the relative distance of $\text{x}_0$ w.r.t (B).
Sigmoid converts this relative distance into probability.

1. Binary Logistic Regression

Example

For $\color{blue}{(\beta_0,\beta_1,\beta_2)=(-1, -1, 2)}$, compute $p(\color{blue}{1}|\text{x})$ for the data:

x1	x2	y
2.489186	-0.779048	0
-2.572868	-1.086146	1
2.767143	2.432104	0

Compute relative distance $z_i=\color{blue}{\beta_0}+\text{x}_i^T\color{blue}{\vec{\beta}}$, then $p(1|\text{x}_i)$ \[\begin{align*}z_1&=-1-1(2.489186)+2(-0.779048)\\ &=-5.047282<0\\ \Rightarrow p(1|\text{x}_1)&=\sigma(z_1)=1/(1+e^{-(-5.047282)})=\color{red}{0.00641}.\\ \\ z_2&=-1-1(-2.572868)+2(-1.086146)\\ &=-0.599424<0\\ \Rightarrow p(1|\text{x}_2)&=\sigma(z_2)=1/(1+e^{-(-0.599424)})=\color{red}{0.354483}.\\ \\ z_3&=-1-1(2.767143)+2(2.432104)\\ &=1.097065 > 0\\ \Rightarrow p(1|\text{x}_3)&=\sigma(z_3)=1/(1+e^{-1.097065})=\color{blue}{0.749788}.\end{align*}\]
Interpretation: $\text{x}_1,\text{x}_2$ are located below the line $(B):-1-x_1+2x_2$ as $z_1,z_2<0$ and are predicted to be in class $\color{red}{0}$. On the other hand, $\text{x}_3$ is located above the line ($z_3>0$) and is predicted to be in class $\color{blue}{1}$.

Question: Now, how do we find the best key parameter $\color{blue}{\beta_0,\dots,\beta_d}$?

1. Binary Logistic Regression

1.3. Conditional likelihood $\to$ Cross-entropy

Data: ${\cal D}=\{(\text{x}_1,y_1),...,(\text{x}_n,y_n)\}\subset\mathbb{R}^d\times \{0,1\}$.
Objective: search for $\color{blue}{\beta_0}\in\mathbb{R},\color{blue}{\vec{\beta}}\in\mathbb{R}^d$ such that the model is best aligned with the data ${\cal D}$: \[p(y_i|\text{x}_i)\text{ is large for all }i\in\{1,\dots,n\}.\]

Conditional Likelihood Function: If the data are iid, one has \[\begin{align*}{L}(\color{blue}{\beta_0},\color{blue}{\vec{\beta}})&=\mathbb{P}(Y_1=y_1,\dots,Y_n=y_n|X_1=\text{x}_1,\dots,X_n=\text{x}_n)\\ &=\prod_{i=1}^np(y_i|\text{x}_i)\\ &=\prod_{i=1}^n\Big[p(1|\text{x}_i)\Big]^{y_i}\Big[p(0|\text{x}_i)\Big]^{1-y_i}\\ &=\prod_{i=1}^n\Big[\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}})\Big]^{y_i}\Big[(1-\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}}))\Big]^{1-y_i}. \end{align*}\]
Cross-entropy: $\text{CEn}(\color{blue}{\vec{\beta}})=-\sum_{i=1}^n\Big[y_i\log[\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}})]+(1-y_i)\log[(1-\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}}))]\Big]$.

1. Binary Logistic Regression

1.3. Conditional likelihood $\to$ Cross-entropy

Cross-entropy: \[\text{CEn}(\color{blue}{\vec{\beta}})=-\sum_{i=1}^n\Big[y_i\log[\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}})]+(1-y_i)\log[(1-\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}}))]\Big].\]
It’s also known as log-loss or logistic loss.

Code

def log_loss(beta1, beta2, X, y, beta0 = 0.1):
    beta_ = np.array([beta0, beta1, beta2])
    sig = sigmoid(X @ beta_)
    loss = np.mean(y*np.log(sig) + (1-y)*np.log(1-sig))
    return -loss

beta1 = np.linspace(-7,2,25)
beta2 = np.linspace(-2,7,30)
losses = np.array([[log_loss(b1, b2, X=x1, y=y_) for b1 in beta1] for b2 in beta2])
bright_blue1 = [[0, '#235b63'], [1, '#45e696']]
fig = go.Figure(go.Surface(
    x=beta1, y=beta2, z=losses, name="Loss function",
    colorscale=bright_blue1,
    hovertemplate='beta1: %{x}<br>' + 
                  'beta2: %{y}<br>' +
                  'Loss: %{z}<extra></extra>'))
fig.update_layout(
                scene = dict(
                xaxis_title= 'beta1',
                yaxis_title= 'beta2',
                zaxis_title='Loss'),
                width=1000, height=250, title="Loss function of the simulated dataset")
fig.show()

1. Binary Logistic Regression

1.3. Conditional likelihood $\to$ Cross-entropy

We search for coefficient $\color{blue}{\vec{\beta}}=[\color{blue}{\beta_0,\dots,\beta_d}]$ minimizing \[\text{CEn}(\color{blue}{\vec{\beta}})=-\sum_{i=1}^n\Big[y_i\log[\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}})]+(1-y_i)\log[(1-\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}}))]\Big].\]
😭 Unfortunately, such minimizer values $(\color{blue}{\widehat{\beta}_0,\widehat{\vec{\beta}}})$ CANNOT be analytically computed.
😊 Fortunately, it can be numerically approximated!
We can use optimization algorithms such as Gradient Descent Algorithm to estimate the best $\color{blue}{\hat{\beta}}$.
This is the next step!

1. Binary Logistic Regression

Summary

Logistic Regression Model

Main model: $p(1|\text{x})=1/(1+e^{-\color{green}{z}})=1/(1+e^{-(\color{blue}{\beta_0}+\text{x}^T\color{blue}{\vec{\beta}})})$.
- Interpretation:
  - Boundary decision is Linear defined by the coefficients $\color{blue}{\beta_0}$ and $\color{blue}{\vec{\beta}}$.
  - Probability of being in each class depends on the relative distance of that point to the boundary.
  - Works well when classes are linearly separable.

Objective: buliding a Logistic Regression model is equivalent to searching for parameters $\color{blue}{\beta_0}$ and $\color{blue}{\vec{\beta}}$ that minimizes the Cross-entropy.
The loss cannot be minimized analytically but can be minimized numerically.

2. Optimization: GD & SGD

2. Optimization algorithm

2.1. Gradient

If $L:\mathbb{R}^d\to\mathbb{R}, y=L(x_1,...,x_d)$ be a real-valued function of $d$ real variables. For any point $a\in\mathbb{R}^d$, the gradient of $L$ at point $a$ is a $d$-dimensional vector defined by \[\nabla L(a)=\begin{bmatrix}\frac{\partial L}{\partial x_1}(a)\\ \vdots\\ \frac{\partial L}{\partial x_d}(a)\end{bmatrix}.\]
It is the direction (from the point $a$) with the fastest increasing rate of the function $L$.
Ex: $f(x,y)=0.1x^4+0.5y^2$. Compute $\nabla f(1,-1)$.
- One has $\frac{\partial f}{\partial x}=0.4x^3$ and $\frac{\partial f}{\partial y}=y$.
- Therefore, $\nabla f(1,-1)=(0.4,-1).$
- Meaning: from point $(1,-1)$, function $f$ increases fastest along the direction $(0.4, -1)$.
Around 1847, Augustin-Louis Cauchy used this to propose a genius algorithm that makes impossibilities, possible!

2. Optimization algorithm

2.2. Gradient Descent Algorithm (GD)

If $L$ is the loss function, then our goal is to find its lowest position.
Consider $\theta_0\in\mathbb{R}^d$ and if $L$ is differentiable, we can approximate $L$ by a linear form around $\theta_0$: \[L(\color{green}{\theta_0+h})\approx L(\color{red}{\theta_0})+\nabla L(\theta_0)^Th.\]
- If $h=-\eta\nabla L(\theta_0)$ for some $\eta>0$, we have: $L(\color{green}{\theta_0-\eta\nabla L(\theta_0)})\approx L(\color{red}{\theta_0})-\eta\|\nabla L(\theta_0)\|^2\leq L(\theta_0).$
- Interpretation: The loss $L(\color{green}{\theta_0-\eta\nabla L(\theta_0)})$ is likely to be lower than $L(\color{red}{\theta_0})$, therefore in searching for the lowest positoin of $L$, we should move from $\color{red}{\theta_0}\to\color{green}{\theta_0-\eta\nabla L(\theta_0)}$. This is the main idea of GD algorithm.

Gradient Descent Algorithm:
- Initialiaztion:
  - learning rate: $\eta>0$ (small $\approx 0.001$ or $0.01$)
  - initial value: $\theta_0$
  - Maximum iteration: $N$
  - Stopping threshold: $\delta>0$ (small $< 10^{-5}$).
- Update: for t=1,2,...,N: \[\color{green}{\theta_{t}:=\theta_{t-1}-\eta\nabla L(\theta_{t-1})}.\]
  - if $\|\nabla L(\theta_{t^*})\|<\delta$ at some step $t^*$: return $\color{green}{\theta_{t^*}}$.
  - else: return $\color{green}{\theta_{N}}$.

2. Optimization algorithm

2.2. Gradient Descent Algorithm (GD)

2. Optimization algorithm

2.3. Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) Algorithm

Recall that for Logistic Regression, log loss is defined by \[\color{red}{{\cal L}(\beta_0,\vec{\beta})}=-\color{red}{\sum_{i=1}^n}\left[y_i\log(p(1|\text{x}_i))+(1-y_i)\log(p(0|\text{x}_i))\right]:=\color{red}{\sum_{i=1}^n}\ell(y_i,\hat{y}_i).\]
In GD, the gradient $\nabla {\cal L}$ is computed using the whole training data, which is very costly for large datasets.
Using statistics, at each iteration $t=1,\dots,N$, we can approximate $\color{red}{\sum_{i=1}^n}$ in the loss by $\color{green}{\sum_{i\in B_t}}$ using some random subset $B_t\subset \{1,\dots,n\}$ (called minibatch), i.e., $\color{red}{{\cal L}(\beta_0,\vec{\beta})}\approx\color{green}{\widehat{\cal L}(\beta_0,\vec{\beta})}=\color{green}{\sum_{i\in B_t}}\ell(y_i,\hat{y}_i)$.
Stochastic Gradient Descent Algorithm: (In Logistic Regression: $\theta=(\beta_0,\vec{\beta})$)
- Initialize: $\eta, N, \theta_0, \delta>0$ (same as in GD), and minibatch size $b=|B_t|,\forall t=1,2,...,N$.
- Main loop: for i=1,2,...,N: \[\color{green}{\theta_{t}}:=\color{red}{\theta_{t-1}}+\eta\nabla\color{green}{\widehat{\cal L}}(\color{red}{\theta_{t-1}}).\]
Ex: Consider 1st simuated data 👉

2. Optimization algorithm

2.3. Stochastic Gradient Descent (SGD) in Action

2. Optimization algorithm

2.4. Gradient of Log Loss

Optimizing Log Loss in Logistic Regression

Log loss for any parameter $\beta_0,\vec{\beta}\in\mathbb{R}^d: {\cal L}(\beta_0,\vec{\beta})=-\sum_{i=1}^n[y_i\log(p(1|\text{x}_i))+(1-y_i)\log(p(0|\text{x}_i))]$.
You should verify that:
- $\partial_{\beta_0} {\cal L}(\beta_0,\vec{\beta})=\sum_{i=1}^n\partial_{\beta_0} \ell(y_i,\hat{y}_i)=-\sum_{i=1}^n(y_i-p(1|\text{x}_i))\in\mathbb{R}$.
- $\nabla_{\vec{\beta}} {\cal L}(\beta_0,\vec{\beta})=\sum_{i=1}^n\nabla_{\vec{\beta}} \ell(y_i,\hat{y}_i)=-\sum_{i=1}^n\underbrace{(y_i-p(1|\text{x}_i))}_{\in\mathbb{R}}\underbrace{\text{x}_i}_{\in\mathbb{R}^d}$.

We define sigmoid function $\sigma$ and the above gradient function of log loss as follow:

def sigmoid(z):
    return 1/(1+np.exp(-z))

def gradLogLoss(beta, X,y):
    X_ = np.column_stack([np.ones_like(y), np.array(X)])
    return -np.sum(np.diag(y-sigmoid(X_ @ beta)) @ X_, axis=0)

GD code is given in the next panel.
SGD can be implemented easily from this GD code.

d = 3 # input dimension of X
def grad_desc(grad, X, y, theta0 = [0]*(d+1), 
              max_Iter = 100, lr = 0.1, threshold = 1e-10):
    if theta0 is None:
        val = [np.random.uniform(-10, 10, size=d)]
    else:
        val = [np.array(theta0)]
    grads = [grad(theta0, X, y)]
    for i in range(max_Iter):
        if np.sum(np.abs(grads[-1])) < threshold:
            return np.row_stack(val), np.row_stack(grads)
        new = val[-1] - lr * grads[-1]
        val.append(new)
        gr = grad(new)
        grads.append(gr)
    return np.row_stack(val), np.row_stack(grads)

2. Optimization algorithm

2.5. More about gradient-based optimizations

An overview of gradient descent optimization algorithms by S. Ruder (2016)
- Introductory, general ideas and implementation
- Less theories
Gradient-Based Optimizer (GBO): A Review, Theory, Variants, and Applications by Daoud, M.S., Shehab, M., Al-Mimi, H.M. et al. (2023)
- Both explanation and theoretical
Recent Advances in Stochastic Gradient Descent in Deep Learning by Tian, Y.; Zhang, Y.; Zhang, H. (2023)
- More detailed with implementation
- Less theories

3. Application

3.1. Application on `Spam` dataset

We build Binary Logistic Regression for spam filtering.
The same $20%$ test data is considered.
Feature engineering:
- Standardization features: \[x_j\to\tilde{x}_j=\frac{x_j-\mu_j}{\sigma_j}.\]
Standardization often helps the optimization to converge faster.
We obtain the following ROC Curve (TPF, FPR).

Code

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
# Standardization
scaler = StandardScaler()
X_sc_train = scaler.fit_transform(X_train1)
X_sc_test = scaler.transform(X_test1)

# Fit model
logit = LogisticRegression()
logit = logit.fit(X_sc_train, y_train1)
logit_pred = logit.predict(X_sc_test)
logit_pred_pro = logit.predict_proba(X_sc_test)[:,1]
acc = np.mean(logit_pred.flatten() == y_test1.to_numpy().flatten())
sns.set(style="white")
_, ax = plt.subplots(1,1,figsize=(3,1.5))
conf = confusion_matrix(logit_pred, y_test1)
con_fig = ConfusionMatrixDisplay(conf)
con_fig.plot(ax=ax)
ax.set_title(f"Accuracy = {np.round(acc,3)}")
plt.show()

Summary

We introduce basic concept of Logistic Regression Model: \[p(1|X=\text{x})=\frac{1}{(1+e^{-\color{blue}{\beta_0}-\text{x}^T\color{blue}{\vec{\beta}}})}.\]
The intuition of the model: the probability of being in class $1$ depends on the relative distance from $\text{x}$ to a linear boundary defined by $\color{blue}{[\beta_0,\beta_1,\dots,\beta_d]}$.
The linear boundary assumption may be too weak in practice.
The performance of the model can be improved further by
- Selecting relevant features
- Feature engineering: polynomial transform…
- Regularization or penalty methods…

4. Multinomial Logistic Regression

4.1. Model

In this case, $y$ takes $M$ distinct values, it’s often encoded to be $M$-dimensional vector: \[y=k\Leftrightarrow p=(0,\dots,0,\underbrace{1}_{k\text{th index}},0,\dots,0)^T.\]
The model also returns $M$-dimensional probability vector where index with highest probability is the predicted class.
$\text{Softmax}:\mathbb{R}^M\to\mathbb{S}_{M-1}=\{p_j\in[0,1]^M:\sum_{j=1}^Mp_j=1\}$ defined by \[\vec{z}\mapsto \vec{s}=\text{Softmax}(z)=\Big(\frac{e^{z_1}}{\sum_{j=1}^Me^{z_j}},\dots,\frac{e^{z_M}}{\sum_{j=1}^Me^{z_j}}\Big).\]
It converts any $M$-dimensional vector $\vec{z}$ to $M$-dimensional probability vector $\vec{s}$.

4. Multinomial Logistic Regression

4.1. Model

$\text{Softmax}:\mathbb{R}^M\to\mathbb{S}_{M-1}=\{p_j\in[0,1]^M:\sum_{j=1}^Mp_j=1\}$ defined by \[\vec{z}\mapsto \vec{s}=\text{Softmax}(z)=\Big(\frac{e^{z_1}}{\sum_{j=1}^Me^{z_j}},\dots,\frac{e^{z_M}}{\sum_{j=1}^Me^{z_j}}\Big).\]

Model: Given parameter matrix $\color{green}{W}$ of size $M\times d$ and intercept vector $\color{green}{\vec{b}}\in\mathbb{R}^d$, the predicted probability of any input data $\text{x}_i\in\mathbb{R}^d$ is defined by

\[\begin{align*}\color{red}{\hat{p}_i}&=\text{Softmax}(\color{green}{W}\text{x}_i+\color{green}{\vec{b}})\\ &=\text{Softmax}\begin{pmatrix} \color{green}{\begin{bmatrix}w_{11}& \dots & w_{1d}\\ w_{11}& \dots & w_{1d}\\ \vdots & \vdots & \vdots\\ w_{M1}& \dots & w_{Md}\end{bmatrix}}\begin{bmatrix}x_{i1}\\ x_{i2}\\ \vdots\\ x_{id}\end{bmatrix}+\color{green}{\begin{bmatrix}b_{1}\\ b_{2}\\ \vdots\\ b_{M}\end{bmatrix}}\end{pmatrix}=\color{red}{\begin{bmatrix}p_{i1}\\ p_{i2}\\ \vdots\\ p_{iM}\end{bmatrix}}\end{align*}\]

4. Multinomial Logistic Regression

4.1. Model

4. Multinomial Logistic Regression

4.2. Cross-entropy loss

We search for best possible parameters $\color{green}{W}\in\mathbb{R}^{M\times d}$ and $\color{green}{\vec{b}}\in\mathbb{R}^d$ by minimizing Cross Entropy loss defined below: \[\text{CEn}(W,\vec{b})=-\sum_{i=1}^n\sum_{m=1}^M\color{blue}{p_{im}}\log(\color{red}{\hat{p}_{im}}),\text{ where }\color{red}{\widehat{\text{p}}_i}=(\color{red}{\hat{p}_{i1}},\color{red}{\dots},\color{red}{\hat{p}_{iM}})\text{ is the predicted probability of }\color{blue}{\text{p}_i}.\]

4. Multinomial Logistic Regression

4.3. Gradient of Cross-Entropy

Optimizing Cross-entropy in Multinomial Logistic Regression

Cross Entropy of any parameter $\color{green}{W}\in\mathbb{R}^{M\times d}$ and $\color{green}{\vec{b}}\in\mathbb{R}^d$: $\text{CEn}(\color{green}{W},\color{green}{\vec{b}})=-\sum_{i=1}^n\sum_{m=1}^M\color{blue}{p_{im}}\log(\color{red}{\widehat{p}_{im}})$.
Challenge: Show that
- $\nabla_{\color{green}{\vec{b}}} \text{CEn}(\color{green}{W},\color{green}{\vec{b}})=\sum_{i=1}^n\nabla_{\color{green}{\vec{b}}} \ell(\color{blue}{\text{p}_i},\color{red}{\hat{\text{p}}_i})=-\sum_{i=1}^n\underbrace{(\color{blue}{\text{p}_i}-\text{softmax}(\color{green}{W}\text{x}+\color{green}{\vec{b}}))}_{\in\mathbb{R}^d}$.
- $\nabla_{\color{green}{W}} \text{CEn}(\color{green}{W},\color{green}{\vec{b}})=\sum_{i=1}^n\nabla_{\color{green}{W}} \ell(\color{blue}{\text{p}_i},\color{red}{\widehat{p}_i})=-\sum_{i=1}^n\underbrace{(\color{blue}{\text{p}_i}-\color{red}{\widehat{\text{p}}_i})}_{\in\mathbb{R}^{d\times 1}}\underbrace{\text{x}_i^T}_{\in\mathbb{R}^{1\times d}}=\underbrace{(\color{red}{\widehat{P}}-\color{blue}{P})^TX}_{\in\mathbb{R}^{d\times d}}$, where

\[\begin{align*} \color{red}{\hat{P}}=\color{red}{\underbrace{\begin{bmatrix} \hat{p}_{11} &\dots & \hat{p}_{1d}\\ \vdots & \vdots & \vdots\\ \hat{p}_{n1} &\dots & \hat{p}_{nd} \end{bmatrix}}_{\text{Matrix of predicted probab}}}, \color{blue}{P}=\color{blue}{\underbrace{\begin{bmatrix} p_{11} &\dots & p_{1d}\\ \vdots & \vdots & \vdots\\ p_{n1} &\dots & p_{nd} \end{bmatrix}}_{\text{One-hot encoded matrix of $y_i$}}}\text{ and } X=\underbrace{\begin{bmatrix} x_{11} &\dots & x_{1d}\\ \vdots & \vdots & \vdots\\ x_{n1} &\dots & x_{nd} \end{bmatrix}}_{\text{Matrix of inputs $x_i$}} \end{align*}\]

Further reading: J. Chiang (2023) and K. Stratos (2018).

4. Multinomial Logistic Regression

4.5. Pros & Cons

Pros

Simplicity: It’s easy to understand and implement.
Efficiency: It’s not so computationally expensive, making it suitable for large datasets.
Interpretability: It converts relative distances from input data to boundary decision into probabilities, which can be meaningfully interpreted.
No Scaling Required: It doesn’t require input features to be scaled or standardized (you can also tranform for efficient computation).
Works Well with Linearly Separable Data: Performs well when the data is linearly separable.

Cons

Assumes Linearity: Assumes a linear relationship between the independent variables and the log-odds of the dependent variable.
Sensitivity to Outliers: Outliers can affect the model’s performance and accuracy.
Overfitting: Can overfit if the dataset has a large number of features compared to the number of observations.
Limited to Binary Outcomes: Standard logistic regression is designed for binary classification and may not perform as well with multiclass classification problems.
Assumes Independence of Errors: Assumes that the observations are independent of each other and that there is no multicollinearity among predictors.

5. Polynomial features (Beyond linearity)

5. Polynomial Features

Linear boundary assumption might be too weak in real situation.
Polynomial features: $X_1,\dots,X_d\to X_i^kX_j^{p-k}, k=0,1,\dots,p$.

Original data:

	x1	x2
0	-0.752759	2.704286
1	1.391964	0.591951
2	-2.063888	-2.064033

Poly. features:

	x1	x2	x1^2	x1 x2	x2^2
0	-0.752759	2.704286	0.566647	-2.035676	7.313162
1	1.391964	0.591951	1.937563	0.823974	0.350406
2	-2.063888	-2.064033	4.259634	4.259933	4.260232

5. Polynomial Features

Linear boundary assumption might be too weak in real situation.
Polynomial features: $X_1,\dots,X_d\to X_i^kX_j^{p-k}, k=0,1,\dots,p$.

Offer more flexible boundaries.
May be more suitable in complex problems.
Higher risk of overfitting!!!
Overfitting: happens when a model learns the training data too well, capturing noise and fluctuations rather than the underlying pattern.

6. Regularization/penalization

Why Regularization?

In many cases, linear assumption is not suitable.
However, when a model is too flexible (high-order polynomial features), it’s risky of overfitting.
Building a good model is capturing the pattern of data, NOT to fit all the training points.

The effect of controlling parameters:

6. Regularization/penalization

6.1. $L_2$ Penalization

$L_2$ regularized loss function: for any $\alpha>0$, \[{\cal L}_2(\beta_0,\vec{\beta})=\underbrace{\color{red}{{\cal L}(\beta_0,\vec{\beta})}}_{\text{loss function}}+\alpha\underbrace{\color{green}{\sum_{j=0}^d\beta_j^2}}_{\text{Control}}.\]
- $\alpha$ is called regularization strength.
- Large $\alpha\Leftrightarrow$ strong penalization $\Leftrightarrow$ small $\beta_0,\vec{\beta}$.
- Small $\alpha\Leftrightarrow$ light penalization $\Leftrightarrow$ large $\beta_0,\vec{\beta}$.
Parameter $\alpha$ balances goodness of fit (loss $\color{red}{{\cal L}}$) and flexibility of the model ($\color{green}{\sum_{j=0}^d\beta_j^2}$).
🔑 Key idea: Tune a suitable penalization strength $\alpha$ that balances these two terms.

6. Regularization/penalization

6.2. $L_1$ Penalization

$L_1$ regularized loss function: for any $\alpha>0$, \[{\cal L}_1(\beta_0,\vec{\beta})=\underbrace{\color{red}{{\cal L}(\beta_0,\vec{\beta})}}_{\text{loss function}}+\alpha\underbrace{\color{green}{\sum_{j=0}^d|\beta_j|}}_{\text{Control}}.\]
By its nature, $L_1$ regularization performs variable selection.
It tends to force unimportant components/variables to be $0$.
🔑 Key idea: Tune a suitable penalization strength $\alpha$.

Remark: Regularization is an important method that can be used with any parametric models not only logtistic regression.

7. Hyperparameter Tuning

7.1. Data splitting: Train/Validate/Test

A good model is the one that can generalize/predict new unseen data.
The first attempt: splitting the data into 3 parts.

Set	Common %	Purpose
Train	60%–70%	For training the model
Validation	15%–20%	Tune hyperparameters $k$
Test	15%–20%	Evaluate final model performance

In this case, the best $k$ is the one achieving the best performance on Validation set.
The final performance is measured using the Test set.

7. Hyperparameter Tuning

7.2. Data splitting: $K$-fold Cross-Validation

In the previous splitting scheme, the best $k$ depends strongly on the split.
To reduce this dependency, a more stable scheme is proposed called $K$-fold Cross-Validation technique.

Pseudocode

For $\color{blue}{k}$ [1,2,3,...,N]:
- For f in [1,...,K]:
  - Train $k$-NN on all data except for fold f.
  - Predict and measure performance on fold f.
  - Save the performance as $\color{red}{\epsilon_f}$.
- Compute CV performance for $\color{blue}{k}$: \[\text{CV}(\color{blue}{k})=\frac{1}{K}\sum_{f=1}^K\color{red}{\epsilon_f}.\]
Choose the best $k$ with the best CV performance.

7. Hyperparameter Tuning

7.3. Tuning polynomial degree

Tuning Optimal Polynomial Degree using $10$-fold Cross-Validation

from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

# List of degrees to search over
degrees = list(range(1,11))

# List for storing accuracy
acc = []
for deg in degrees:
    poly = PolynomialFeatures(degree=deg, include_bias=False)
    x_poly = poly.fit_transform(
        pd.DataFrame(x1[:,1], columns=['x1'])
    )
    x_poly = np.column_stack([x_poly,x1[:,2],x1[:,1]*x1[:,2]])

    # model
    model = LogisticRegression()
    score = cross_val_score(model, x_poly, y_, cv=10, 
                scoring='accuracy'
            ).mean()
    acc.append(score)

7. Hyperparameter Tuning

7.4. Tuning Penalty Strength $\alpha$ in $L_2$ Penalty

Tuning Penalty Strength $\alpha$ using $10$-fold Cross-Validation

from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
# Data
poly = PolynomialFeatures(degree=5, include_bias=False)
x_poly = poly.fit_transform(
        pd.DataFrame(x1[:,1], columns=['x1']))
x_poly = np.column_stack([x_poly,x1[:,2],x1[:,1]*x1[:,2]])
# Penalty strength
alphas = np.concatenate([
    np.linspace(1e-3,10, 50), 
    np.linspace(10.1, 1000, 50)]
)
coefficients = {f'{alp}': [] for alp in alphas}
acc = []
for alp in alphas:
    # model
    model = LogisticRegression(penalty="l2", C=1/alp)
    score = cross_val_score(model, x_poly, y_, cv=10, 
                scoring='balanced_accuracy').mean()
    acc.append(score)
    # Fitting
    model.fit(x_poly, y_)
    coefficients[f'{alp}'] = model.coef_[0]

7. Hyperparameter Tuning

7.4. Tuning Penalty Strength $\alpha$ in $L_2$ Penalty

Tuning Penalty Strength $\alpha$ using $10$-fold Cross-Validation

7. Hyperparameter Tuning

7.5. Tuning Penalty Strength $\alpha$ in $L_1$ Penalty

Tuning Penalty Strength $\alpha$ using $10$-fold Cross-Validation

from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
# Data
poly = PolynomialFeatures(degree=5, include_bias=False)
x_poly = poly.fit_transform(
        pd.DataFrame(x1[:,1], columns=['x1']))
x_poly = np.column_stack([x_poly,x1[:,2],x1[:,1]*x1[:,2]])
# Penalty strength
alphas = np.concatenate([
    np.linspace(1e-3,10, 50), 
    np.linspace(10.1, 1000, 50)]
)
coefficients = {f'{alp}': [] for alp in alphas}
acc = []
for alp in alphas:
    # model
    model = LogisticRegression(penalty="l1", 
                               C=1/alp, solver='saga')
    score = cross_val_score(model, x_poly, y_, cv=10, 
                scoring='balanced_accuracy').mean()
    acc.append(score)
    # Fitting
    model.fit(x_poly, y_)
    coefficients[f'{alp}'] = model.coef_[0]

7. Hyperparameter Tuning

7.5. Tuning Penalty Strength $\alpha$ in $L_1$ Penalty

Tuning Penalty Strength $\alpha$ using $10$-fold Cross-Validation

🎉 Conclusion

Logistic Regression is a powerful classification algorithm especially for binary outcomes.
It can be extended to handle non-linear decision boundaries using polynomial features.
Regularization techniques like $L_1$ and $L_2$ help prevent overfitting and improve model generalization.
Hyperparameter tuning, especially using cross-validation, is crucial for selecting the best model parameters.

Logistic Regression

🗺️ Content

1. Binary Logistic Regression

1. Binary Logistic Regression

1.1. Model setting

1. Binary Logistic Regression

1.2. Model intuition

1. Binary Logistic Regression

1.2. Model intuition

1. Binary Logistic Regression

1. Binary Logistic Regression

1.3. Conditional likelihood \(\to\) Cross-entropy

1. Binary Logistic Regression

1.3. Conditional likelihood \(\to\) Cross-entropy

1. Binary Logistic Regression

1.3. Conditional likelihood \(\to\) Cross-entropy

1. Binary Logistic Regression

Summary

2. Optimization: GD & SGD

2. Optimization algorithm

2.1. Gradient

2. Optimization algorithm

2.2. Gradient Descent Algorithm (GD)

2. Optimization algorithm

2.2. Gradient Descent Algorithm (GD)

2. Optimization algorithm

2.3. Stochastic Gradient Descent (SGD)

2. Optimization algorithm

2.3. Stochastic Gradient Descent (SGD) in Action

2. Optimization algorithm

2.4. Gradient of Log Loss

2. Optimization algorithm

2.5. More about gradient-based optimizations

3. Application

3. Application

3.1. Application on Spam dataset

Summary

4. Multinomial Logistic Regression

4. Multinomial Logistic Regression

4.1. Model

4. Multinomial Logistic Regression

4.1. Model

4. Multinomial Logistic Regression

4.1. Model

4. Multinomial Logistic Regression

4.2. Cross-entropy loss

4. Multinomial Logistic Regression

4.3. Gradient of Cross-Entropy

4. Multinomial Logistic Regression

4.5. Pros & Cons

5. Polynomial features (Beyond linearity)

5. Polynomial Features

5. Polynomial Features

6. Regularization/penalization

6. Regularization/penalization

Why Regularization?

6. Regularization/penalization

6.1. \(L_2\) Penalization

6. Regularization/penalization

6.2. \(L_1\) Penalization

7. Hyperparameter Tuning

7. Hyperparameter Tuning

7.1. Data splitting: Train/Validate/Test

7. Hyperparameter Tuning

7.2. Data splitting: \(K\)-fold Cross-Validation

Pseudocode

7. Hyperparameter Tuning

7.3. Tuning polynomial degree

7. Hyperparameter Tuning

7.4. Tuning Penalty Strength \(\alpha\) in \(L_2\) Penalty

7. Hyperparameter Tuning

7.4. Tuning Penalty Strength \(\alpha\) in \(L_2\) Penalty

7. Hyperparameter Tuning

7.5. Tuning Penalty Strength \(\alpha\) in \(L_1\) Penalty

7. Hyperparameter Tuning

7.5. Tuning Penalty Strength \(\alpha\) in \(L_1\) Penalty

🎉 Conclusion

🥳 It’s party time 🥂

3.1. Application on `Spam` dataset