Logistic Regression


Advanced Machine Learning

     

Lecturer: Dr. HAS Sothea

🗺️ Content

  • 1. Binary Logistic Regression
    • Model setting and Intuition
    • Conditional Likelihood and Loss function
  • 2. Optimization: GD and SGD
  • 3. Application
  • 4. Multiple Logistic Regression
  • 5. Polynomial Features
  • 6. Regularization/Penalization
  • 7. Hyperparameter Tuning

1. Binary Logistic Regression

1. Binary Logistic Regression

1.1. Model setting

x1 x2 y
-0.752759 2.704286 1
1.935603 -0.838856 0
Code
import plotly.express as px
import plotly.graph_objects as go
fig = px.scatter(
    data_toy1, x="x1", y="x2", 
    color="y")

# Lines
line_coefs = np.array(
    [[1, 1, -2], [-1, 0.3, 0.5], [-0.5, 0.3, -1], [0.1, -1, 1]])
frames = []
x_line = np.array([np.min(x1[:,1]), np.max(x1[:,1])])
y_range = np.array([np.min(x1[:,2]), np.max(x1[:,2])])
id_small = np.argsort(np.abs(x1[:,1]) + np.abs(x1[:,2]))[5]
point_far = np.array([x_line[0], y_range[1]])
point_near = np.array([x1[id_small,1],x1[id_small,2]])

for i, coef in enumerate(line_coefs):
    y_line = (-coef[0] - coef[1] * x_line) / coef[2]
    a, b = -coef[1]/coef[2], -coef[0]/coef[2]
    point_proj_far = np.array([(point_far[0]+a*point_far[1]-a*b)/(a**2+1), a*(point_far[0]+a*point_far[1]-a*b)/(a**2+1)+b])
    point_proj_near = np.array([(point_near[0]+a*point_near[1]-a*b)/(a**2+1), a*(point_near[0]+a*point_near[1]-a*b)/(a**2+1)+b])
    p1 = np.row_stack([point_far, point_proj_far])
    p2 = np.row_stack([point_near, point_proj_near])
    frames.append(go.Frame(
        data=[fig.data[0],
              fig.data[1],
              go.Scatter(
                x=p1[:,0],
                y=p1[:,1],
                name="Far from boundary",
                line=dict(dash="dash"),
                visible="legendonly"
              ),
              go.Scatter(
                x=p2[:,0],
                y=p2[:,1],
                name="Close to boundary",
                line=dict(dash="dash"),
                visible="legendonly"
              ),
              go.Scatter(
                x=x_line, y=y_line, mode='lines',
                line=dict(width=3, color="black"), 
                name=f'Line: {i+1}')],
        name=f'{i+1}'))

y_line = (-line_coefs[0,0] - line_coefs[0,1] * x_line) / line_coefs[0,2]
a, b = -line_coefs[0,0]/line_coefs[0,2], - line_coefs[0,1]/line_coefs[0,2]
point_proj_far = np.array([(point_far[0]+a*point_far[1]-a*b)/(a**2+1), a*(point_far[0]+a*point_far[1]-a*b)/(a**2+1)+b])
point_proj_near = np.array([(point_near[0]+a*point_near[1]-a*b)/(a**2+1), a*(point_near[0]+a*point_near[1]-a*b)/(a**2+1)+b])
p1 = np.row_stack([point_far, point_proj_far])
p2 = np.row_stack([point_near, point_proj_near])
fig1 = go.Figure(
    data=[
        fig.data[0],
        fig.data[1],
        go.Scatter(
            x=p1[:,0],
            y=p1[:,1],
            name="Far from boundary",
            line=dict(dash="dash"),
            visible="legendonly"
            ),
        go.Scatter(
            x=p2[:,0],
            y=p2[:,1],
            name="Close to boundary",
            line=dict(dash="dash"),
            visible="legendonly"
        ),
        go.Scatter(
            x=x_line, y=y_line, mode='lines',
            line=dict(width=3, color="black"), 
            name=f'Line: 1')
    ],
    layout=go.Layout(
        title="1st simulated data & boundary lines",
        xaxis=dict(title="x1", range=[-3.4, 3.4]),
        yaxis=dict(title="x2", range=[-3.4, 3.4]),
        updatemenus=[{
            "buttons": [
                {
                    "args": [None, {"frame": {"duration": 1000, "redraw": True}, "fromcurrent": True, "mode": "immediate"}],
                    "label": "Play",
                    "method": "animate"
                },
                {
                    "args": [[None], {"frame": {"duration": 0, "redraw": False}, "mode": "immediate"}],
                    "label": "Stop",
                    "method": "animate"
                }
            ],
            "type": "buttons",
            "showactive": False,
            "x": -0.1,
            "y": 1.25,
            "pad": {"r": 10, "t": 50}
        }],
        sliders=[{
            "active": 0,
            "currentvalue": {"prefix": "Line: "},
            "pad": {"t": 50},
            "steps": [{"label": f"{i}",
                       "method": "animate",
                       "args": [[f'{i}'], {"frame": {"duration": 1000, "redraw": True}, "mode": "immediate", 
                       "transition": {"duration": 10}}]}
                      for i in range(1,5)]
        }]
    ),
    frames=frames
)

fig1.update_layout(height=350, width=500)

fig1.show()
  • Objective: Given input \(\text{x}_i\in\mathbb{R}^d\), classify if \(y\in\{0,1\}\) (Male or female).
  • Main idea: classify \(\Leftrightarrow\) identify decision boundary.
  • Main assumption: Boundary(B) is linear.
  • Model: Given input \(\text{x}_i\), the chance that it belongs to class \(1\) is given by \[\mathbb{P}(Y_i=1|X=\text{x}_i)=\sigma(\color{blue}{\beta_0}+\sum_{j=1}^d\color{blue}{\beta_j}x_{ij}),\] where \(\color{blue}{\beta_0,\beta_1,\dots,\beta_d}\in\mathbb{R}\) are the key parameters to be estiamted from the data, and \(\sigma(t)=1/(1+e^{-t}),\forall t\geq 0\).

1. Binary Logistic Regression

1.2. Model intuition

  • Ex: Given \(\text{x}_0=[\text{h}_0,\text{w}_0]\in\mathbb{R}^2,\) for any candidate parameter \(\color{blue}{\vec{\beta}=[\beta_0,\beta_1,\beta_2]}\), \[\color{green}{z_0}=\color{blue}{\beta_0}+\color{blue}{\beta_1}\text{h}_0+\color{blue}{\beta_2}\text{w}_0\text{ is the relative distance from }\text{x}_0\to\text{ Boundary (B)}.\]
  • That’s to say that
    • \(\color{green}{z_0}>0\Leftrightarrow \text{x}_0\) is above the boundary.
    • \(|\color{green}{z_0}|\) is large \(\Leftrightarrow\) \(\text{x}_0\) is far from the bounday.
  • A good boundary should be such that:
    • \(|\color{green}{z_0}|\) large \(\Rightarrow\) “certain about its class”.
    • \(|\color{green}{z_0}|\) small \(\Rightarrow\) “less certain about its class”.

1. Binary Logistic Regression

1.2. Model intuition

  • A good boundary should be such that:
    • \(|\color{green}{z_0}|\) large \(\Rightarrow\) “certain about its class”.
    • \(|\color{green}{z_0}|\) small \(\Rightarrow\) “less certain about its class”.
  • Sigmoid function, \(\sigma:\mathbb{R}\to (0,1)\)

\[\color{green}{z}\mapsto\sigma(\color{green}{z})=\frac{1}{1+\exp(-\color{green}{z})}.\]

Key ideas

  • \(\color{green}{z_0}=\color{blue}{\beta_0}+\text{x}_0^T\color{blue}{\vec{\beta}}\) is the relative distance of \(\text{x}_0\) w.r.t (B).
  • Sigmoid converts this relative distance into probability.

1. Binary Logistic Regression

Example

  • For \(\color{blue}{(\beta_0,\beta_1,\beta_2)=(-1, -1, 2)}\), compute \(p(\color{blue}{1}|\text{x})\) for the data:
x1 x2 y
2.489186 -0.779048 0
-2.572868 -1.086146 1
2.767143 2.432104 0
  • Compute relative distance \(z_i=\color{blue}{\beta_0}+\text{x}_i^T\color{blue}{\vec{\beta}}\), then \(p(1|\text{x}_i)\) \[\begin{align*}z_1&=-1-1(2.489186)+2(-0.779048)\\ &=-5.047282<0\\ \Rightarrow p(1|\text{x}_1)&=\sigma(z_1)=1/(1+e^{-(-5.047282)})=\color{red}{0.00641}.\\ \\ z_2&=-1-1(-2.572868)+2(-1.086146)\\ &=-0.599424<0\\ \Rightarrow p(1|\text{x}_2)&=\sigma(z_2)=1/(1+e^{-(-0.599424)})=\color{red}{0.354483}.\\ \\ z_3&=-1-1(2.767143)+2(2.432104)\\ &=1.097065 > 0\\ \Rightarrow p(1|\text{x}_3)&=\sigma(z_3)=1/(1+e^{-1.097065})=\color{blue}{0.749788}.\end{align*}\]

  • Interpretation: \(\text{x}_1,\text{x}_2\) are located below the line \((B):-1-x_1+2x_2\) as \(z_1,z_2<0\) and are predicted to be in class \(\color{red}{0}\). On the other hand, \(\text{x}_3\) is located above the line (\(z_3>0\)) and is predicted to be in class \(\color{blue}{1}\).

  • Question: Now, how do we find the best key parameter \(\color{blue}{\beta_0,\dots,\beta_d}\)?

1. Binary Logistic Regression

1.3. Conditional likelihood \(\to\) Cross-entropy

  • Data: \({\cal D}=\{(\text{x}_1,y_1),...,(\text{x}_n,y_n)\}\subset\mathbb{R}^d\times \{0,1\}\).
  • Objective: search for \(\color{blue}{\beta_0}\in\mathbb{R},\color{blue}{\vec{\beta}}\in\mathbb{R}^d\) such that the model is best aligned with the data \({\cal D}\): \[p(y_i|\text{x}_i)\text{ is large for all }i\in\{1,\dots,n\}.\]
  • Conditional Likelihood Function: If the data are iid, one has \[\begin{align*}{L}(\color{blue}{\beta_0},\color{blue}{\vec{\beta}})&=\mathbb{P}(Y_1=y_1,\dots,Y_n=y_n|X_1=\text{x}_1,\dots,X_n=\text{x}_n)\\ &=\prod_{i=1}^np(y_i|\text{x}_i)\\ &=\prod_{i=1}^n\Big[p(1|\text{x}_i)\Big]^{y_i}\Big[p(0|\text{x}_i)\Big]^{1-y_i}\\ &=\prod_{i=1}^n\Big[\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}})\Big]^{y_i}\Big[(1-\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}}))\Big]^{1-y_i}. \end{align*}\]

  • Cross-entropy: \(\text{CEn}(\color{blue}{\vec{\beta}})=-\sum_{i=1}^n\Big[y_i\log[\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}})]+(1-y_i)\log[(1-\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}}))]\Big]\).

1. Binary Logistic Regression

1.3. Conditional likelihood \(\to\) Cross-entropy

  • Cross-entropy: \[\text{CEn}(\color{blue}{\vec{\beta}})=-\sum_{i=1}^n\Big[y_i\log[\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}})]+(1-y_i)\log[(1-\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}}))]\Big].\]
  • It’s also known as log-loss or logistic loss.
Code
def log_loss(beta1, beta2, X, y, beta0 = 0.1):
    beta_ = np.array([beta0, beta1, beta2])
    sig = sigmoid(X @ beta_)
    loss = np.mean(y*np.log(sig) + (1-y)*np.log(1-sig))
    return -loss

beta1 = np.linspace(-7,2,25)
beta2 = np.linspace(-2,7,30)
losses = np.array([[log_loss(b1, b2, X=x1, y=y_) for b1 in beta1] for b2 in beta2])
bright_blue1 = [[0, '#235b63'], [1, '#45e696']]
fig = go.Figure(go.Surface(
    x=beta1, y=beta2, z=losses, name="Loss function",
    colorscale=bright_blue1,
    hovertemplate='beta1: %{x}<br>' + 
                  'beta2: %{y}<br>' +
                  'Loss: %{z}<extra></extra>'))
fig.update_layout(
                scene = dict(
                xaxis_title= 'beta1',
                yaxis_title= 'beta2',
                zaxis_title='Loss'),
                width=1000, height=250, title="Loss function of the simulated dataset")
fig.show()

1. Binary Logistic Regression

1.3. Conditional likelihood \(\to\) Cross-entropy

  • We search for coefficient \(\color{blue}{\vec{\beta}}=[\color{blue}{\beta_0,\dots,\beta_d}]\) minimizing \[\text{CEn}(\color{blue}{\vec{\beta}})=-\sum_{i=1}^n\Big[y_i\log[\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}})]+(1-y_i)\log[(1-\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}}))]\Big].\]

  • 😭 Unfortunately, such minimizer values \((\color{blue}{\widehat{\beta}_0,\widehat{\vec{\beta}}})\) CANNOT be analytically computed.

  • 😊 Fortunately, it can be numerically approximated!

  • We can use optimization algorithms such as Gradient Descent Algorithm to estimate the best \(\color{blue}{\hat{\beta}}\).

  • This is the next step!

1. Binary Logistic Regression

Summary

Logistic Regression Model

  • Main model: \(p(1|\text{x})=1/(1+e^{-\color{green}{z}})=1/(1+e^{-(\color{blue}{\beta_0}+\text{x}^T\color{blue}{\vec{\beta}})})\).
    • Interpretation:
      • Boundary decision is Linear defined by the coefficients \(\color{blue}{\beta_0}\) and \(\color{blue}{\vec{\beta}}\).
      • Probability of being in each class depends on the relative distance of that point to the boundary.
      • Works well when classes are linearly separable.

  • Objective: buliding a Logistic Regression model is equivalent to searching for parameters \(\color{blue}{\beta_0}\) and \(\color{blue}{\vec{\beta}}\) that minimizes the Cross-entropy.
  • The loss cannot be minimized analytically but can be minimized numerically.

2. Optimization: GD & SGD

2. Optimization algorithm

2.1. Gradient

  • If \(L:\mathbb{R}^d\to\mathbb{R}, y=L(x_1,...,x_d)\) be a real-valued function of \(d\) real variables. For any point \(a\in\mathbb{R}^d\), the gradient of \(L\) at point \(a\) is a \(d\)-dimensional vector defined by \[\nabla L(a)=\begin{bmatrix}\frac{\partial L}{\partial x_1}(a)\\ \vdots\\ \frac{\partial L}{\partial x_d}(a)\end{bmatrix}.\]
  • It is the direction (from the point \(a\)) with the fastest increasing rate of the function \(L\).
  • Ex: \(f(x,y)=0.1x^4+0.5y^2\). Compute \(\nabla f(1,-1)\).
    • One has \(\frac{\partial f}{\partial x}=0.4x^3\) and \(\frac{\partial f}{\partial y}=y\).
    • Therefore, \(\nabla f(1,-1)=(0.4,-1).\)
    • Meaning: from point \((1,-1)\), function \(f\) increases fastest along the direction \((0.4, -1)\).
  • Around 1847, Augustin-Louis Cauchy used this to propose a genius algorithm that makes impossibilities, possible!



2. Optimization algorithm

2.2. Gradient Descent Algorithm (GD)

  • If \(L\) is the loss function, then our goal is to find its lowest position.
  • Consider \(\theta_0\in\mathbb{R}^d\) and if \(L\) is differentiable, we can approximate \(L\) by a linear form around \(\theta_0\): \[L(\color{green}{\theta_0+h})\approx L(\color{red}{\theta_0})+\nabla L(\theta_0)^Th.\]
    • If \(h=-\eta\nabla L(\theta_0)\) for some \(\eta>0\), we have: \(L(\color{green}{\theta_0-\eta\nabla L(\theta_0)})\approx L(\color{red}{\theta_0})-\eta\|\nabla L(\theta_0)\|^2\leq L(\theta_0).\)
    • Interpretation: The loss \(L(\color{green}{\theta_0-\eta\nabla L(\theta_0)})\) is likely to be lower than \(L(\color{red}{\theta_0})\), therefore in searching for the lowest positoin of \(L\), we should move from \(\color{red}{\theta_0}\to\color{green}{\theta_0-\eta\nabla L(\theta_0)}\). This is the main idea of GD algorithm.
  • Gradient Descent Algorithm:
    • Initialiaztion:
      • learning rate: \(\eta>0\) (small \(\approx 0.001\) or \(0.01\))
      • initial value: \(\theta_0\)
      • Maximum iteration: \(N\)
      • Stopping threshold: \(\delta>0\) (small \(< 10^{-5}\)).
    • Update: for t=1,2,...,N: \[\color{green}{\theta_{t}:=\theta_{t-1}-\eta\nabla L(\theta_{t-1})}.\]
      • if \(\|\nabla L(\theta_{t^*})\|<\delta\) at some step \(t^*\): return \(\color{green}{\theta_{t^*}}\).
      • else: return \(\color{green}{\theta_{N}}\).

2. Optimization algorithm

2.2. Gradient Descent Algorithm (GD)

2. Optimization algorithm

2.3. Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) Algorithm

  • Recall that for Logistic Regression, log loss is defined by \[\color{red}{{\cal L}(\beta_0,\vec{\beta})}=-\color{red}{\sum_{i=1}^n}\left[y_i\log(p(1|\text{x}_i))+(1-y_i)\log(p(0|\text{x}_i))\right]:=\color{red}{\sum_{i=1}^n}\ell(y_i,\hat{y}_i).\]
  • In GD, the gradient \(\nabla {\cal L}\) is computed using the whole training data, which is very costly for large datasets.
  • Using statistics, at each iteration \(t=1,\dots,N\), we can approximate \(\color{red}{\sum_{i=1}^n}\) in the loss by \(\color{green}{\sum_{i\in B_t}}\) using some random subset \(B_t\subset \{1,\dots,n\}\) (called minibatch), i.e., \(\color{red}{{\cal L}(\beta_0,\vec{\beta})}\approx\color{green}{\widehat{\cal L}(\beta_0,\vec{\beta})}=\color{green}{\sum_{i\in B_t}}\ell(y_i,\hat{y}_i)\).
  • Stochastic Gradient Descent Algorithm: (In Logistic Regression: \(\theta=(\beta_0,\vec{\beta})\))
    • Initialize: \(\eta, N, \theta_0, \delta>0\) (same as in GD), and minibatch size \(b=|B_t|,\forall t=1,2,...,N\).
    • Main loop: for i=1,2,...,N: \[\color{green}{\theta_{t}}:=\color{red}{\theta_{t-1}}+\eta\nabla\color{green}{\widehat{\cal L}}(\color{red}{\theta_{t-1}}).\]
  • Ex: Consider 1st simuated data 👉

2. Optimization algorithm

2.3. Stochastic Gradient Descent (SGD) in Action

2. Optimization algorithm

2.4. Gradient of Log Loss

Optimizing Log Loss in Logistic Regression

  • Log loss for any parameter \(\beta_0,\vec{\beta}\in\mathbb{R}^d: {\cal L}(\beta_0,\vec{\beta})=-\sum_{i=1}^n[y_i\log(p(1|\text{x}_i))+(1-y_i)\log(p(0|\text{x}_i))]\).
  • You should verify that:
    • \(\partial_{\beta_0} {\cal L}(\beta_0,\vec{\beta})=\sum_{i=1}^n\partial_{\beta_0} \ell(y_i,\hat{y}_i)=-\sum_{i=1}^n(y_i-p(1|\text{x}_i))\in\mathbb{R}\).
    • \(\nabla_{\vec{\beta}} {\cal L}(\beta_0,\vec{\beta})=\sum_{i=1}^n\nabla_{\vec{\beta}} \ell(y_i,\hat{y}_i)=-\sum_{i=1}^n\underbrace{(y_i-p(1|\text{x}_i))}_{\in\mathbb{R}}\underbrace{\text{x}_i}_{\in\mathbb{R}^d}\).
  • We define sigmoid function \(\sigma\) and the above gradient function of log loss as follow:
def sigmoid(z):
    return 1/(1+np.exp(-z))

def gradLogLoss(beta, X,y):
    X_ = np.column_stack([np.ones_like(y), np.array(X)])
    return -np.sum(np.diag(y-sigmoid(X_ @ beta)) @ X_, axis=0)
  • GD code is given in the next panel.

  • SGD can be implemented easily from this GD code.

d = 3 # input dimension of X
def grad_desc(grad, X, y, theta0 = [0]*(d+1), 
              max_Iter = 100, lr = 0.1, threshold = 1e-10):
    if theta0 is None:
        val = [np.random.uniform(-10, 10, size=d)]
    else:
        val = [np.array(theta0)]
    grads = [grad(theta0, X, y)]
    for i in range(max_Iter):
        if np.sum(np.abs(grads[-1])) < threshold:
            return np.row_stack(val), np.row_stack(grads)
        new = val[-1] - lr * grads[-1]
        val.append(new)
        gr = grad(new)
        grads.append(gr)
    return np.row_stack(val), np.row_stack(grads)

2. Optimization algorithm

2.5. More about gradient-based optimizations

3. Application

3. Application

3.1. Application on Spam dataset

  • We build Binary Logistic Regression for spam filtering.
  • The same \(20%\) test data is considered.
  • Feature engineering:
    • Standardization features: \[x_j\to\tilde{x}_j=\frac{x_j-\mu_j}{\sigma_j}.\]
  • Standardization often helps the optimization to converge faster.
  • We obtain the following ROC Curve (TPF, FPR).
Code
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
# Standardization
scaler = StandardScaler()
X_sc_train = scaler.fit_transform(X_train1)
X_sc_test = scaler.transform(X_test1)

# Fit model
logit = LogisticRegression()
logit = logit.fit(X_sc_train, y_train1)
logit_pred = logit.predict(X_sc_test)
logit_pred_pro = logit.predict_proba(X_sc_test)[:,1]
acc = np.mean(logit_pred.flatten() == y_test1.to_numpy().flatten())
sns.set(style="white")
_, ax = plt.subplots(1,1,figsize=(3,1.5))
conf = confusion_matrix(logit_pred, y_test1)
con_fig = ConfusionMatrixDisplay(conf)
con_fig.plot(ax=ax)
ax.set_title(f"Accuracy = {np.round(acc,3)}")
plt.show()

Summary

  • We introduce basic concept of Logistic Regression Model: \[p(1|X=\text{x})=\frac{1}{(1+e^{-\color{blue}{\beta_0}-\text{x}^T\color{blue}{\vec{\beta}}})}.\]
  • The intuition of the model: the probability of being in class \(1\) depends on the relative distance from \(\text{x}\) to a linear boundary defined by \(\color{blue}{[\beta_0,\beta_1,\dots,\beta_d]}\).
  • The linear boundary assumption may be too weak in practice.
  • The performance of the model can be improved further by
    • Selecting relevant features
    • Feature engineering: polynomial transform…
    • Regularization or penalty methods…

4. Multinomial Logistic Regression

4. Multinomial Logistic Regression

4.1. Model

  • In this case, \(y\) takes \(M\) distinct values, it’s often encoded to be \(M\)-dimensional vector: \[y=k\Leftrightarrow p=(0,\dots,0,\underbrace{1}_{k\text{th index}},0,\dots,0)^T.\]
  • The model also returns \(M\)-dimensional probability vector where index with highest probability is the predicted class.
  • \(\text{Softmax}:\mathbb{R}^M\to\mathbb{S}_{M-1}=\{p_j\in[0,1]^M:\sum_{j=1}^Mp_j=1\}\) defined by \[\vec{z}\mapsto \vec{s}=\text{Softmax}(z)=\Big(\frac{e^{z_1}}{\sum_{j=1}^Me^{z_j}},\dots,\frac{e^{z_M}}{\sum_{j=1}^Me^{z_j}}\Big).\]
  • It converts any \(M\)-dimensional vector \(\vec{z}\) to \(M\)-dimensional probability vector \(\vec{s}\).

4. Multinomial Logistic Regression

4.1. Model

  • \(\text{Softmax}:\mathbb{R}^M\to\mathbb{S}_{M-1}=\{p_j\in[0,1]^M:\sum_{j=1}^Mp_j=1\}\) defined by \[\vec{z}\mapsto \vec{s}=\text{Softmax}(z)=\Big(\frac{e^{z_1}}{\sum_{j=1}^Me^{z_j}},\dots,\frac{e^{z_M}}{\sum_{j=1}^Me^{z_j}}\Big).\]
  • Model: Given parameter matrix \(\color{green}{W}\) of size \(M\times d\) and intercept vector \(\color{green}{\vec{b}}\in\mathbb{R}^d\), the predicted probability of any input data \(\text{x}_i\in\mathbb{R}^d\) is defined by

\[\begin{align*}\color{red}{\hat{p}_i}&=\text{Softmax}(\color{green}{W}\text{x}_i+\color{green}{\vec{b}})\\ &=\text{Softmax}\begin{pmatrix} \color{green}{\begin{bmatrix}w_{11}& \dots & w_{1d}\\ w_{11}& \dots & w_{1d}\\ \vdots & \vdots & \vdots\\ w_{M1}& \dots & w_{Md}\end{bmatrix}}\begin{bmatrix}x_{i1}\\ x_{i2}\\ \vdots\\ x_{id}\end{bmatrix}+\color{green}{\begin{bmatrix}b_{1}\\ b_{2}\\ \vdots\\ b_{M}\end{bmatrix}}\end{pmatrix}=\color{red}{\begin{bmatrix}p_{i1}\\ p_{i2}\\ \vdots\\ p_{iM}\end{bmatrix}}\end{align*}\]

4. Multinomial Logistic Regression

4.1. Model

\[\begin{align*}\color{red}{\hat{p}_i}&=\text{Softmax}(\color{green}{W}\text{x}_i+\color{green}{\vec{b}})\\ &=\text{Softmax}\begin{pmatrix} \color{green}{\begin{bmatrix}w_{11}& \dots & w_{1d}\\ w_{11}& \dots & w_{1d}\\ \vdots & \vdots & \vdots\\ w_{M1}& \dots & w_{Md}\end{bmatrix}}\begin{bmatrix}x_{i1}\\ x_{i2}\\ \vdots\\ x_{id}\end{bmatrix}+\color{green}{\begin{bmatrix}b_{1}\\ b_{2}\\ \vdots\\ b_{M}\end{bmatrix}}\end{pmatrix}=\color{red}{\begin{bmatrix}p_{i1}\\ p_{i2}\\ \vdots\\ p_{iM}\end{bmatrix}}\end{align*}\]

4. Multinomial Logistic Regression

4.2. Cross-entropy loss

  • We search for best possible parameters \(\color{green}{W}\in\mathbb{R}^{M\times d}\) and \(\color{green}{\vec{b}}\in\mathbb{R}^d\) by minimizing Cross Entropy loss defined below: \[\text{CEn}(W,\vec{b})=-\sum_{i=1}^n\sum_{m=1}^M\color{blue}{p_{im}}\log(\color{red}{\hat{p}_{im}}),\text{ where }\color{red}{\widehat{\text{p}}_i}=(\color{red}{\hat{p}_{i1}},\color{red}{\dots},\color{red}{\hat{p}_{iM}})\text{ is the predicted probability of }\color{blue}{\text{p}_i}.\]

4. Multinomial Logistic Regression

4.3. Gradient of Cross-Entropy

Optimizing Cross-entropy in Multinomial Logistic Regression

  • Cross Entropy of any parameter \(\color{green}{W}\in\mathbb{R}^{M\times d}\) and \(\color{green}{\vec{b}}\in\mathbb{R}^d\): \(\text{CEn}(\color{green}{W},\color{green}{\vec{b}})=-\sum_{i=1}^n\sum_{m=1}^M\color{blue}{p_{im}}\log(\color{red}{\widehat{p}_{im}})\).
  • Challenge: Show that
    • \(\nabla_{\color{green}{\vec{b}}} \text{CEn}(\color{green}{W},\color{green}{\vec{b}})=\sum_{i=1}^n\nabla_{\color{green}{\vec{b}}} \ell(\color{blue}{\text{p}_i},\color{red}{\hat{\text{p}}_i})=-\sum_{i=1}^n\underbrace{(\color{blue}{\text{p}_i}-\text{softmax}(\color{green}{W}\text{x}+\color{green}{\vec{b}}))}_{\in\mathbb{R}^d}\).
    • \(\nabla_{\color{green}{W}} \text{CEn}(\color{green}{W},\color{green}{\vec{b}})=\sum_{i=1}^n\nabla_{\color{green}{W}} \ell(\color{blue}{\text{p}_i},\color{red}{\widehat{p}_i})=-\sum_{i=1}^n\underbrace{(\color{blue}{\text{p}_i}-\color{red}{\widehat{\text{p}}_i})}_{\in\mathbb{R}^{d\times 1}}\underbrace{\text{x}_i^T}_{\in\mathbb{R}^{1\times d}}=\underbrace{(\color{red}{\widehat{P}}-\color{blue}{P})^TX}_{\in\mathbb{R}^{d\times d}}\), where

\[\begin{align*} \color{red}{\hat{P}}=\color{red}{\underbrace{\begin{bmatrix} \hat{p}_{11} &\dots & \hat{p}_{1d}\\ \vdots & \vdots & \vdots\\ \hat{p}_{n1} &\dots & \hat{p}_{nd} \end{bmatrix}}_{\text{Matrix of predicted probab}}}, \color{blue}{P}=\color{blue}{\underbrace{\begin{bmatrix} p_{11} &\dots & p_{1d}\\ \vdots & \vdots & \vdots\\ p_{n1} &\dots & p_{nd} \end{bmatrix}}_{\text{One-hot encoded matrix of $y_i$}}}\text{ and } X=\underbrace{\begin{bmatrix} x_{11} &\dots & x_{1d}\\ \vdots & \vdots & \vdots\\ x_{n1} &\dots & x_{nd} \end{bmatrix}}_{\text{Matrix of inputs $x_i$}} \end{align*}\]

Further reading: J. Chiang (2023) and K. Stratos (2018).

4. Multinomial Logistic Regression

4.5. Pros & Cons

Pros

  • Simplicity: It’s easy to understand and implement.
  • Efficiency: It’s not so computationally expensive, making it suitable for large datasets.
  • Interpretability: It converts relative distances from input data to boundary decision into probabilities, which can be meaningfully interpreted.
  • No Scaling Required: It doesn’t require input features to be scaled or standardized (you can also tranform for efficient computation).
  • Works Well with Linearly Separable Data: Performs well when the data is linearly separable.

Cons

  • Assumes Linearity: Assumes a linear relationship between the independent variables and the log-odds of the dependent variable.
  • Sensitivity to Outliers: Outliers can affect the model’s performance and accuracy.
  • Overfitting: Can overfit if the dataset has a large number of features compared to the number of observations.
  • Limited to Binary Outcomes: Standard logistic regression is designed for binary classification and may not perform as well with multiclass classification problems.
  • Assumes Independence of Errors: Assumes that the observations are independent of each other and that there is no multicollinearity among predictors.

5. Polynomial features (Beyond linearity)

5. Polynomial Features

  • Linear boundary assumption might be too weak in real situation.
  • Polynomial features: \(X_1,\dots,X_d\to X_i^kX_j^{p-k}, k=0,1,\dots,p\).
  • Original data:
x1 x2
0 -0.752759 2.704286
1 1.391964 0.591951
2 -2.063888 -2.064033
  • Poly. features:
x1 x2 x1^2 x1 x2 x2^2
0 -0.752759 2.704286 0.566647 -2.035676 7.313162
1 1.391964 0.591951 1.937563 0.823974 0.350406
2 -2.063888 -2.064033 4.259634 4.259933 4.260232

5. Polynomial Features

  • Linear boundary assumption might be too weak in real situation.
  • Polynomial features: \(X_1,\dots,X_d\to X_i^kX_j^{p-k}, k=0,1,\dots,p\).
  • Offer more flexible boundaries.
  • May be more suitable in complex problems.
  • Higher risk of overfitting!!!
  • Overfitting: happens when a model learns the training data too well, capturing noise and fluctuations rather than the underlying pattern.

6. Regularization/penalization

6. Regularization/penalization

Why Regularization?

  • In many cases, linear assumption is not suitable.
  • However, when a model is too flexible (high-order polynomial features), it’s risky of overfitting.
  • Building a good model is capturing the pattern of data, NOT to fit all the training points.
  • The effect of controlling parameters:

6. Regularization/penalization

6.1. \(L_2\) Penalization

  • \(L_2\) regularized loss function: for any \(\alpha>0\), \[{\cal L}_2(\beta_0,\vec{\beta})=\underbrace{\color{red}{{\cal L}(\beta_0,\vec{\beta})}}_{\text{loss function}}+\alpha\underbrace{\color{green}{\sum_{j=0}^d\beta_j^2}}_{\text{Control}}.\]
    • \(\alpha\) is called regularization strength.
    • Large \(\alpha\Leftrightarrow\) strong penalization \(\Leftrightarrow\) small \(\beta_0,\vec{\beta}\).
    • Small \(\alpha\Leftrightarrow\) light penalization \(\Leftrightarrow\) large \(\beta_0,\vec{\beta}\).
  • Parameter \(\alpha\) balances goodness of fit (loss \(\color{red}{{\cal L}}\)) and flexibility of the model (\(\color{green}{\sum_{j=0}^d\beta_j^2}\)).
  • 🔑 Key idea: Tune a suitable penalization strength \(\alpha\) that balances these two terms.

6. Regularization/penalization

6.2. \(L_1\) Penalization

  • \(L_1\) regularized loss function: for any \(\alpha>0\), \[{\cal L}_1(\beta_0,\vec{\beta})=\underbrace{\color{red}{{\cal L}(\beta_0,\vec{\beta})}}_{\text{loss function}}+\alpha\underbrace{\color{green}{\sum_{j=0}^d|\beta_j|}}_{\text{Control}}.\]
  • By its nature, \(L_1\) regularization performs variable selection.
  • It tends to force unimportant components/variables to be \(0\).
  • 🔑 Key idea: Tune a suitable penalization strength \(\alpha\).

  • Remark: Regularization is an important method that can be used with any parametric models not only logtistic regression.

7. Hyperparameter Tuning

7. Hyperparameter Tuning

7.1. Data splitting: Train/Validate/Test

  • A good model is the one that can generalize/predict new unseen data.
  • The first attempt: splitting the data into 3 parts.
Set Common % Purpose
Train 60%–70% For training the model
Validation 15%–20% Tune hyperparameters \(k\)
Test 15%–20% Evaluate final model performance
  • In this case, the best \(k\) is the one achieving the best performance on Validation set.
  • The final performance is measured using the Test set.

7. Hyperparameter Tuning

7.2. Data splitting: \(K\)-fold Cross-Validation

  • In the previous splitting scheme, the best \(k\) depends strongly on the split.
  • To reduce this dependency, a more stable scheme is proposed called \(K\)-fold Cross-Validation technique.

Pseudocode

  • For \(\color{blue}{k}\) [1,2,3,...,N]:
    • For f in [1,...,K]:
      • Train \(k\)-NN on all data except for fold f.
      • Predict and measure performance on fold f.
      • Save the performance as \(\color{red}{\epsilon_f}\).
    • Compute CV performance for \(\color{blue}{k}\): \[\text{CV}(\color{blue}{k})=\frac{1}{K}\sum_{f=1}^K\color{red}{\epsilon_f}.\]
  • Choose the best \(k\) with the best CV performance.

7. Hyperparameter Tuning

7.3. Tuning polynomial degree

Tuning Optimal Polynomial Degree using \(10\)-fold Cross-Validation

from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

# List of degrees to search over
degrees = list(range(1,11))

# List for storing accuracy
acc = []
for deg in degrees:
    poly = PolynomialFeatures(degree=deg, include_bias=False)
    x_poly = poly.fit_transform(
        pd.DataFrame(x1[:,1], columns=['x1'])
    )
    x_poly = np.column_stack([x_poly,x1[:,2],x1[:,1]*x1[:,2]])

    # model
    model = LogisticRegression()
    score = cross_val_score(model, x_poly, y_, cv=10, 
                scoring='accuracy'
            ).mean()
    acc.append(score)

7. Hyperparameter Tuning

7.4. Tuning Penalty Strength \(\alpha\) in \(L_2\) Penalty

Tuning Penalty Strength \(\alpha\) using \(10\)-fold Cross-Validation

from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
# Data
poly = PolynomialFeatures(degree=5, include_bias=False)
x_poly = poly.fit_transform(
        pd.DataFrame(x1[:,1], columns=['x1']))
x_poly = np.column_stack([x_poly,x1[:,2],x1[:,1]*x1[:,2]])
# Penalty strength
alphas = np.concatenate([
    np.linspace(1e-3,10, 50), 
    np.linspace(10.1, 1000, 50)]
)
coefficients = {f'{alp}': [] for alp in alphas}
acc = []
for alp in alphas:
    # model
    model = LogisticRegression(penalty="l2", C=1/alp)
    score = cross_val_score(model, x_poly, y_, cv=10, 
                scoring='balanced_accuracy').mean()
    acc.append(score)
    # Fitting
    model.fit(x_poly, y_)
    coefficients[f'{alp}'] = model.coef_[0]

7. Hyperparameter Tuning

7.4. Tuning Penalty Strength \(\alpha\) in \(L_2\) Penalty

Tuning Penalty Strength \(\alpha\) using \(10\)-fold Cross-Validation

7. Hyperparameter Tuning

7.5. Tuning Penalty Strength \(\alpha\) in \(L_1\) Penalty

Tuning Penalty Strength \(\alpha\) using \(10\)-fold Cross-Validation

from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
# Data
poly = PolynomialFeatures(degree=5, include_bias=False)
x_poly = poly.fit_transform(
        pd.DataFrame(x1[:,1], columns=['x1']))
x_poly = np.column_stack([x_poly,x1[:,2],x1[:,1]*x1[:,2]])
# Penalty strength
alphas = np.concatenate([
    np.linspace(1e-3,10, 50), 
    np.linspace(10.1, 1000, 50)]
)
coefficients = {f'{alp}': [] for alp in alphas}
acc = []
for alp in alphas:
    # model
    model = LogisticRegression(penalty="l1", 
                               C=1/alp, solver='saga')
    score = cross_val_score(model, x_poly, y_, cv=10, 
                scoring='balanced_accuracy').mean()
    acc.append(score)
    # Fitting
    model.fit(x_poly, y_)
    coefficients[f'{alp}'] = model.coef_[0]

7. Hyperparameter Tuning

7.5. Tuning Penalty Strength \(\alpha\) in \(L_1\) Penalty

Tuning Penalty Strength \(\alpha\) using \(10\)-fold Cross-Validation

🎉 Conclusion

  • Logistic Regression is a powerful classification algorithm especially for binary outcomes.
  • It can be extended to handle non-linear decision boundaries using polynomial features.
  • Regularization techniques like \(L_1\) and \(L_2\) help prevent overfitting and improve model generalization.
  • Hyperparameter tuning, especially using cross-validation, is crucial for selecting the best model parameters.

🥳 It’s party time 🥂