TP2 - Logistic Regression & Regularization


Course: Advanced Machine Learning
Lecturer: Sothea HAS, PhD

Objective: This practical session (TP) is designed to help you apply the concepts and techniques you have learned about Logistic Regression. You will learn to analyze each input variable and its connection to the target which will guide you to a suitable model in predicting the target.


1. Binary Logistic Regression

Let’s us begin by exploring the Heart Disease Dataset.

import numpy as np
import pandas as pd
import kagglehub
# Download latest version
path = kagglehub.dataset_download("johnsmith88/heart-disease-dataset")
data = pd.read_csv(path + "/heart.csv")
data.head(5)
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
0 52 1 0 125 212 0 1 168 0 1.0 2 2 3 0
1 53 1 0 140 203 1 0 155 1 3.1 0 0 3 0
2 70 1 0 145 174 0 1 125 1 2.6 0 0 3 0
3 61 1 0 148 203 0 1 161 0 0.0 2 1 3 0
4 62 0 0 138 294 1 1 106 0 1.9 1 3 2 0

A. General view of the dataset.

  • Load the dataset into jupyter notebook.
  • What’s the dimension of the dataset?
  • How many qualitative and quantitative variables are there in this dataset (answer this question carefully! Some qualitative variables may be encoded using numerical values)?
  • Convert variables into their suitable data type if there are any inconsistent variable types.
import numpy as np
import pandas as pd
print(f'The dimension of the dataset: {data.shape}')
The dimension of the dataset: (1025, 14)
data.dtypes.to_frame().T
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
0 int64 int64 int64 int64 int64 int64 int64 int64 int64 float64 int64 int64 int64 int64
quan_vars = ['age','trestbps','chol','thalach','oldpeak']
qual_vars = ['sex','cp','fbs','restecg','exang','slope','ca','thal','target']

for i in quan_vars:
  data[i] = data[i].astype('float')
for i in qual_vars:
  data[i] = data[i].astype('category')
print(f'The number of categorical data is {data.select_dtypes(include="category").shape[1]}')
print(f'The number of quantitative data is {data.select_dtypes(include="number").shape[1]}')
The number of categorical data is 9
The number of quantitative data is 5

B. Univariate Analysis.

  • Compute summary statistics and visualize the distribution of the target and the inputs according to their types.
  • Are there any missing values? Duplicate data? Outliers?
  • Address or handle the above problems.
data.describe()
age trestbps chol thalach oldpeak
count 1025.000000 1025.000000 1025.00000 1025.000000 1025.000000
mean 54.434146 131.611707 246.00000 149.114146 1.071512
std 9.072290 17.516718 51.59251 23.005724 1.175053
min 29.000000 94.000000 126.00000 71.000000 0.000000
25% 48.000000 120.000000 211.00000 132.000000 0.000000
50% 56.000000 130.000000 240.00000 152.000000 0.800000
75% 61.000000 140.000000 275.00000 166.000000 1.800000
max 77.000000 200.000000 564.00000 202.000000 6.200000
import seaborn as sns
import matplotlib.pyplot as plt
x, axs = plt.subplots(1, 5, figsize=(15, 3))

for var in quan_vars:
  sns.histplot(data=data, x=var, ax=axs[quan_vars.index(var)], kde=True)
  axs[quan_vars.index(var)].set_title(f"Histogram of {var}")
plt.tight_layout()
plt.show()

_, axs = plt.subplots(3, 3, figsize=(12, 9))

for var in qual_vars:
  sns.countplot(data=data, x=var, ax=axs[qual_vars.index(var)//3, qual_vars.index(var)%3], stat="proportion")
  axs[qual_vars.index(var)//3, qual_vars.index(var)%3].set_title(f"Countplot of {var}")
  axs[qual_vars.index(var)//3, qual_vars.index(var)%3].bar_label(axs[qual_vars.index(var)//3, qual_vars.index(var)%3].containers[0], fmt="%.2f")
plt.tight_layout()
plt.show()

data.isna().sum().to_frame().T
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
id_duplicated = data.duplicated()
data.loc[id_duplicated,:].index

print(f"Target of duplicated rows: {data.loc[data.duplicated(), ['target']].value_counts()}")
print(f'Percentage of duplicated rows: {data.duplicated().sum()/data.shape[0]*100:.2f}%')
Target of duplicated rows: target
1         362
0         361
Name: count, dtype: int64
Percentage of duplicated rows: 70.54%
data.drop_duplicates(inplace=True)

Since the data contains more than 70% duplications, we keeps the duplicated rows.

C. Bivariate Analysis.

# Pearson's correaltion matrix
data.select_dtypes(include="number").corr()
age trestbps chol thalach oldpeak
age 1.000000 0.283121 0.207216 -0.395235 0.206040
trestbps 0.283121 1.000000 0.125256 -0.048023 0.194600
chol 0.207216 0.125256 1.000000 -0.005308 0.050086
thalach -0.395235 -0.048023 -0.005308 1.000000 -0.342201
oldpeak 0.206040 0.194600 0.050086 -0.342201 1.000000
# Spearman's correaltion
data.select_dtypes(include="number").corr(method="spearman")
age trestbps chol thalach oldpeak
age 1.000000 0.289705 0.188903 -0.393453 0.263625
trestbps 0.289705 1.000000 0.130210 -0.042699 0.156807
chol 0.188903 0.130210 1.000000 -0.040367 0.039565
thalach -0.393453 -0.042699 -0.040367 1.000000 -0.430495
oldpeak 0.263625 0.156807 0.039565 -0.430495 1.000000
  • Quantitative inputs vs Target
_, axs = plt.subplots(1, 5, figsize=(15, 3))
for var in quan_vars:
  sns.boxplot(data=data, y=var, x="target", hue="target", ax=axs[quan_vars.index(var)])
  axs[quan_vars.index(var)].set_title(f"{var} vs Heart disease")

plt.tight_layout()
plt.show()

_, axs = plt.subplots(2, 4, figsize=(15, 6))

for var in qual_vars:
  if var != "target":
    sns.countplot(data=data, x=var, hue="target", ax=axs[qual_vars.index(var)//4, qual_vars.index(var)%4])
    axs[qual_vars.index(var)//4, qual_vars.index(var)%4].set_title(f"Countplot of {var}")
    axs[qual_vars.index(var)//4, qual_vars.index(var)%4].bar_label(axs[qual_vars.index(var)//4, qual_vars.index(var)%4].containers[0], fmt="%.2f")
    axs[qual_vars.index(var)//4, qual_vars.index(var)%4].bar_label(axs[qual_vars.index(var)//4, qual_vars.index(var)%4].containers[1], fmt="%.2f")
plt.tight_layout()
plt.show()

Remark: You should try to understand the differences between these two types of correlation as they are helpful in guiding you to the correct transformation of inputs for model development. At the end of this step, you should have strong intuition on the most impactful inputs for building the model and how can to handle the inputs before building models.

D. Building Logistic Regression Models

  • Split the data into \(80\%-20\%\) training and testing data.
from sklearn.model_selection import train_test_split
X, y = data.iloc[:,:-1], data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
  • Build a logistic regression model using only categorical variables on the training data then compute its performance on the test data using suitable metrics.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import StandardScaler

# OnehotEncoding
X_train_cat = pd.get_dummies(X_train.select_dtypes(include="category"), drop_first=True)
X_test_cat = pd.get_dummies(X_test.select_dtypes(include="category"), drop_first=True)

# Fit the model
lgr_cat = LogisticRegression(max_iter=500)
lgr_cat = lgr_cat.fit(X_train_cat, y_train)

# Test performance
y_pred_cat = lgr_cat.predict(X_test_cat)
test_perf = pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred_cat),
          'Precision': precision_score(y_test, y_pred_cat),
          'Recall': recall_score(y_test, y_pred_cat),
          'F1-score': f1_score(y_test, y_pred_cat),
          'AUC': roc_auc_score(y_test, y_pred_cat)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["Qual_model"])
test_perf
Accuracy Precision Recall F1-score AUC
Qual_model 0.836066 0.848485 0.848485 0.848485 0.834957
  • Repeat the previous question using quantitative variables instead.
X_train_quan = X_train.select_dtypes(include="number")
X_test_quan = X_test.select_dtypes(include="number")

# Fit the model
lgr_quan = LogisticRegression(max_iter=500)
lgr_quan = lgr_quan.fit(X_train_quan, y_train)

# Test performance
y_pred_quan = lgr_quan.predict(X_test_quan)
test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred_quan),
          'Precision': precision_score(y_test, y_pred_quan),
          'Recall': recall_score(y_test, y_pred_quan),
          'F1-score': f1_score(y_test, y_pred_quan),
          'AUC': roc_auc_score(y_test, y_pred_quan)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["Quan_model"])], axis=0)
test_perf
Accuracy Precision Recall F1-score AUC
Qual_model 0.836066 0.848485 0.848485 0.848485 0.834957
Quan_model 0.786885 0.833333 0.757576 0.793651 0.789502
  • Do it again with all the variables.

Note that when the optimization does not converge, one can perform feature scaling to make the optimazation converges faster.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_all = pd.concat([X_train_cat, X_train_quan], axis=1)
X_train_all = scaler.fit_transform(X_train_all)
X_test_all = pd.concat([X_test_cat, X_test_quan], axis=1)
X_test_all = scaler.transform(X_test_all)

# Fit the model
lgr_all = LogisticRegression(max_iter=1000)
lgr_all = lgr_quan.fit(X_train_all, y_train)

# Test performance
y_pred_all = lgr_all.predict(X_test_all)
test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred_all),
          'Precision': precision_score(y_test, y_pred_all),
          'Recall': recall_score(y_test, y_pred_all),
          'F1-score': f1_score(y_test, y_pred_all),
          'AUC': roc_auc_score(y_test, y_pred_all)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["Full_model"])], axis=0)
test_perf
Accuracy Precision Recall F1-score AUC
Qual_model 0.836066 0.848485 0.848485 0.848485 0.834957
Quan_model 0.786885 0.833333 0.757576 0.793651 0.789502
Full_model 0.836066 0.848485 0.848485 0.848485 0.834957

Here, we build a model with all the informative inputs from EDA:

X_train_eda = np.column_stack([
    X_train[['age', 'oldpeak', 'thalach']].to_numpy(),
    pd.get_dummies(X_train.drop(columns=['fbs']).select_dtypes(include="category"), drop_first=True).to_numpy()])
X_test_eda = np.column_stack([
    X_test[['age', 'oldpeak', 'thalach']].to_numpy(),
    pd.get_dummies(X_test.drop(columns=['fbs']).select_dtypes(include="category"), drop_first=True).to_numpy()])
# Fit the model
lgr_eda = LogisticRegression(max_iter=1000)
lgr_eda = lgr_eda.fit(X_train_eda, y_train)

# Test performance
y_pred_eda = lgr_eda.predict(X_test_eda)
test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred_eda),
          'Precision': precision_score(y_test, y_pred_eda),
          'Recall': recall_score(y_test, y_pred_eda),
          'F1-score': f1_score(y_test, y_pred_eda),
          'AUC': roc_auc_score(y_test, y_pred_eda)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["EDA_model"])], axis=0)
test_perf
Accuracy Precision Recall F1-score AUC
Qual_model 0.836066 0.848485 0.848485 0.848485 0.834957
Quan_model 0.786885 0.833333 0.757576 0.793651 0.789502
Full_model 0.836066 0.848485 0.848485 0.848485 0.834957
EDA_model 0.836066 0.848485 0.848485 0.848485 0.834957
  • Comment your finding.

Based on EDA step, we found that qualitative variables are more related to the target than the some of numerical variables. When building models, we actually observe that the model with qualitative variables does perform better than the model with only quantitative variables.

  • Try to study logistic regression using polynomial features. Compute its performance on the test data and compare to the previous result.

We study polynomial features with numerical variables.

from sklearn.preprocessing import PolynomialFeatures
scaler = StandardScaler()
for degree in range(2,16):
  poly = PolynomialFeatures(degree=degree, include_bias=False)
  X_train_poly = np.column_stack([poly.fit_transform(X_train.select_dtypes(include="number")),
                                 X_train_cat])
  X_train_poly = scaler.fit_transform(X_train_poly)
  X_test_poly = np.column_stack([poly.fit_transform(X_test.select_dtypes(include="number")),
                                X_test_cat])
  X_test_poly = scaler.transform(X_test_poly)
  model = LogisticRegression(max_iter=1000)
  model = model.fit(X_train_poly, y_train)
  y_pred_poly = model.predict(X_test_poly)
  test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred_poly),
          'Precision': precision_score(y_test, y_pred_poly),
          'Recall': recall_score(y_test, y_pred_poly),
          'F1-score': f1_score(y_test, y_pred_poly),
          'AUC': roc_auc_score(y_test, y_pred_poly)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=[f"Poly{degree}_model"])], axis=0)
test_perf
Accuracy Precision Recall F1-score AUC
Qual_model 0.836066 0.848485 0.848485 0.848485 0.834957
Quan_model 0.786885 0.833333 0.757576 0.793651 0.789502
Full_model 0.836066 0.848485 0.848485 0.848485 0.834957
EDA_model 0.836066 0.848485 0.848485 0.848485 0.834957
Poly2_model 0.836066 0.848485 0.848485 0.848485 0.834957
Poly3_model 0.836066 0.870968 0.818182 0.843750 0.837662
Poly4_model 0.819672 0.866667 0.787879 0.825397 0.822511
Poly5_model 0.819672 0.866667 0.787879 0.825397 0.822511
Poly6_model 0.786885 0.812500 0.787879 0.800000 0.786797
Poly7_model 0.803279 0.818182 0.818182 0.818182 0.801948
Poly8_model 0.803279 0.818182 0.818182 0.818182 0.801948
Poly9_model 0.786885 0.812500 0.787879 0.800000 0.786797
Poly10_model 0.770492 0.787879 0.787879 0.787879 0.768939
Poly11_model 0.754098 0.764706 0.787879 0.776119 0.751082
Poly12_model 0.737705 0.757576 0.757576 0.757576 0.735931
Poly13_model 0.737705 0.757576 0.757576 0.757576 0.735931
Poly14_model 0.737705 0.757576 0.757576 0.757576 0.735931
Poly15_model 0.737705 0.757576 0.757576 0.757576 0.735931

It is very common that with higher degrees, the model becomes too flexible and overfits the data. One can use regularization method to control the flexibility of the model.

  • Apply regularization methods and evaluate their performances on the test data.
from sklearn.model_selection import cross_val_score
scaler = StandardScaler()
# let's fix degree
deg = 3
poly = PolynomialFeatures(degree=deg, include_bias=False)
X_train_poly = np.column_stack([poly.fit_transform(X_train.select_dtypes(include="number")),
                                 X_train_cat])
X_train_poly = scaler.fit_transform(X_train_poly)
X_test_poly = np.column_stack([poly.fit_transform(X_test.select_dtypes(include="number")),
                                X_test_cat])
X_test_poly = scaler.transform(X_test_poly)
alphas = 2 ** np.linspace(-10, 13, 20)
losses = []
for alpha in alphas:
  model = LogisticRegression(penalty="l2", C=1/alpha, max_iter=1000)
  model = model.fit(X_train_poly, y_train)
  loss = -cross_val_score(model, X_train_poly, y_train, cv=5, scoring="neg_log_loss")
  losses.append(np.mean(loss))
opt_alpha = alphas[np.argmin(losses)]
print(f"Optimal alpha: {opt_alpha}")
ax = sns.lineplot(x=alphas, y=losses)
ax.vlines(x=opt_alpha, ymin=min(losses), ymax=max(losses), colors='red', linestyles="dashed")
ax.set_xscale("log")
ax.set_xlabel("alpha")
ax.set_ylabel("CV_MSE")
plt.show()

lgr_reg = LogisticRegression(penalty="l2", C=1/opt_alpha, max_iter=1000)
lgr_reg = lgr_reg.fit(X_train_poly, y_train)
y_pred_poly = lgr_reg.predict(X_test_poly)

test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred_poly),
          'Precision': precision_score(y_test, y_pred_poly),
          'Recall': recall_score(y_test, y_pred_poly),
          'F1-score': f1_score(y_test, y_pred_poly),
          'AUC': roc_auc_score(y_test, y_pred_poly)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=[f"Poly3_Ridge_model"])], axis=0)
test_perf
Optimal alpha: 9.957540715712186

Accuracy Precision Recall F1-score AUC
Qual_model 0.836066 0.848485 0.848485 0.848485 0.834957
Quan_model 0.786885 0.833333 0.757576 0.793651 0.789502
Full_model 0.836066 0.848485 0.848485 0.848485 0.834957
EDA_model 0.836066 0.848485 0.848485 0.848485 0.834957
Poly2_model 0.836066 0.848485 0.848485 0.848485 0.834957
Poly3_model 0.836066 0.870968 0.818182 0.843750 0.837662
Poly4_model 0.819672 0.866667 0.787879 0.825397 0.822511
Poly5_model 0.819672 0.866667 0.787879 0.825397 0.822511
Poly6_model 0.786885 0.812500 0.787879 0.800000 0.786797
Poly7_model 0.803279 0.818182 0.818182 0.818182 0.801948
Poly8_model 0.803279 0.818182 0.818182 0.818182 0.801948
Poly9_model 0.786885 0.812500 0.787879 0.800000 0.786797
Poly10_model 0.770492 0.787879 0.787879 0.787879 0.768939
Poly11_model 0.754098 0.764706 0.787879 0.776119 0.751082
Poly12_model 0.737705 0.757576 0.757576 0.757576 0.735931
Poly13_model 0.737705 0.757576 0.757576 0.757576 0.735931
Poly14_model 0.737705 0.757576 0.757576 0.757576 0.735931
Poly15_model 0.737705 0.757576 0.757576 0.757576 0.735931
Poly3_Ridge_model 0.852459 0.875000 0.848485 0.861538 0.852814
losses = []
for alpha in alphas:
  model = LogisticRegression(penalty="l1", solver='liblinear', C=1/alpha, max_iter=1000)
  model = model.fit(X_train_poly, y_train)
  loss = -cross_val_score(model, X_train_poly, y_train, cv=5, scoring="neg_log_loss")
  losses.append(np.mean(loss))

opt_alpha = alphas[np.argmin(losses)]
print(f"Optimal alpha: {opt_alpha}")
ax = sns.lineplot(x=alphas, y=losses)
ax.vlines(x=opt_alpha, ymin=min(losses), ymax=max(losses), colors='red', linestyles="dashed")
ax.set_xscale("log")
ax.set_xlabel("alpha")
ax.set_ylabel("CV_MSE")
plt.show()
Optimal alpha: 4.302762344880728

lgr_reg = LogisticRegression(penalty="l1", solver="saga", C=1/opt_alpha, max_iter=1000)
lgr_reg = lgr_reg.fit(X_train_poly, y_train)
y_pred_poly = lgr_reg.predict(X_test_poly)

test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred_poly),
          'Precision': precision_score(y_test, y_pred_poly),
          'Recall': recall_score(y_test, y_pred_poly),
          'F1-score': f1_score(y_test, y_pred_poly),
          'AUC': roc_auc_score(y_test, y_pred_poly)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=[f"Poly3_Lasso_model"])], axis=0)
test_perf
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_sag.py:349: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
Accuracy Precision Recall F1-score AUC
Qual_model 0.836066 0.848485 0.848485 0.848485 0.834957
Quan_model 0.786885 0.833333 0.757576 0.793651 0.789502
Full_model 0.836066 0.848485 0.848485 0.848485 0.834957
EDA_model 0.836066 0.848485 0.848485 0.848485 0.834957
Poly2_model 0.836066 0.848485 0.848485 0.848485 0.834957
Poly3_model 0.836066 0.870968 0.818182 0.843750 0.837662
Poly4_model 0.819672 0.866667 0.787879 0.825397 0.822511
Poly5_model 0.819672 0.866667 0.787879 0.825397 0.822511
Poly6_model 0.786885 0.812500 0.787879 0.800000 0.786797
Poly7_model 0.803279 0.818182 0.818182 0.818182 0.801948
Poly8_model 0.803279 0.818182 0.818182 0.818182 0.801948
Poly9_model 0.786885 0.812500 0.787879 0.800000 0.786797
Poly10_model 0.770492 0.787879 0.787879 0.787879 0.768939
Poly11_model 0.754098 0.764706 0.787879 0.776119 0.751082
Poly12_model 0.737705 0.757576 0.757576 0.757576 0.735931
Poly13_model 0.737705 0.757576 0.757576 0.757576 0.735931
Poly14_model 0.737705 0.757576 0.757576 0.757576 0.735931
Poly15_model 0.737705 0.757576 0.757576 0.757576 0.735931
Poly3_Ridge_model 0.852459 0.875000 0.848485 0.861538 0.852814
Poly3_Lasso_model 0.836066 0.848485 0.848485 0.848485 0.834957

From these results, Polynomial feature of logistic regression with Ridge regularization is the best model with better score than other models.

E. Try what you have done on Spam dataset.

import pandas as pd
path = "https://raw.githubusercontent.com/hassothea/MLcourses/main/data/spam.txt"
data = pd.read_csv(path, sep=" ")
data.head(5)
Id make address all num3d our over remove internet order ... charSemicolon charRoundbracket charSquarebracket charExclamation charDollar charHash capitalAve capitalLong capitalTotal type
0 1 0.00 0.64 0.64 0.0 0.32 0.00 0.00 0.00 0.00 ... 0.00 0.000 0.0 0.778 0.000 0.000 3.756 61 278 spam
1 2 0.21 0.28 0.50 0.0 0.14 0.28 0.21 0.07 0.00 ... 0.00 0.132 0.0 0.372 0.180 0.048 5.114 101 1028 spam
2 3 0.06 0.00 0.71 0.0 1.23 0.19 0.19 0.12 0.64 ... 0.01 0.143 0.0 0.276 0.184 0.010 9.821 485 2259 spam
3 4 0.00 0.00 0.00 0.0 0.63 0.00 0.31 0.63 0.31 ... 0.00 0.137 0.0 0.137 0.000 0.000 3.537 40 191 spam
4 5 0.00 0.00 0.00 0.0 0.63 0.00 0.31 0.63 0.31 ... 0.00 0.135 0.0 0.135 0.000 0.000 3.537 40 191 spam

5 rows × 59 columns

  • To detect informative inputs in this case, we use ANOVA test on the set of inputs.
from scipy.stats import f_oneway
imp_var = []
very_imp = []
for va in data.columns[1:-1]:
  val = f_oneway(data.loc[data['type'] == "spam", va], data.loc[data['type'] == "nonspam", va])[1]
  if val < 0.01:
    imp_var.append(va)
  if val < 1e-10:
    very_imp.append(va)
print(f"Number of important variables: {len(imp_var)}")
print(f"Number of very important variables: {len(very_imp)}")
Number of important variables: 54
Number of very important variables: 43
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
X, y = data[imp_var], LabelEncoder().fit_transform(data['type'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

lgr0 = LogisticRegression(max_iter=1000)
lgr0 = lgr0.fit(X_train, y_train)
y_pred0 = lgr0.predict(X_test)

# Test spam
test_spam = pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred0),
          'Precision': precision_score(y_test, y_pred0),
          'Recall': recall_score(y_test, y_pred0),
          'F1-score': f1_score(y_test, y_pred0),
          'AUC': roc_auc_score(y_test, y_pred0)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["Imp_model"])
test_spam
Accuracy Precision Recall F1-score AUC
Imp_model 0.931596 0.928571 0.895317 0.911641 0.925257
  • Logistic regression on full inputs
X, y = data.iloc[:,1:-1], LabelEncoder().fit_transform(data['type'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

lgr1 = LogisticRegression(max_iter=1000)
lgr1 = lgr1.fit(X_train, y_train)
y_pred_full = lgr1.predict(X_test)

# Test spam
test_spam = pd.concat([test_spam, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred_full),
          'Precision': precision_score(y_test, y_pred_full),
          'Recall': recall_score(y_test, y_pred_full),
          'F1-score': f1_score(y_test, y_pred_full),
          'AUC': roc_auc_score(y_test, y_pred_full)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["Full_model"])])
test_spam
Accuracy Precision Recall F1-score AUC
Imp_model 0.931596 0.928571 0.895317 0.911641 0.925257
Full_model 0.929425 0.920904 0.898072 0.909344 0.923946

This suggest that model with only variables that passed ANOVA test are better. We will use only these variables for the later models.

  • Polynomial features
from sklearn.preprocessing import LabelEncoder, PolynomialFeatures
X, y = data.iloc[:,1:-1], LabelEncoder().fit_transform(data['type'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

for degree in range(2, 4):
  poly = PolynomialFeatures(degree=degree, include_bias=False)
  X_train_poly = poly.fit_transform(X_train)
  X_test_poly = poly.transform(X_test)
  scaler = StandardScaler()
  X_train_poly = scaler.fit_transform(X_train_poly)
  X_test_poly = scaler.transform(X_test_poly)
  model = LogisticRegression(max_iter=500)
  model = model.fit(X_train_poly, y_train)
  y_pred_poly = model.predict(X_test_poly)
  test_spam = pd.concat([test_spam, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred_poly),
          'Precision': precision_score(y_test, y_pred_poly),
          'Recall': recall_score(y_test, y_pred_poly),
          'F1-score': f1_score(y_test, y_pred_poly),
          'AUC': roc_auc_score(y_test, y_pred_poly)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=[f"Poly{degree}_model"])], axis=0)
test_spam
Accuracy Precision Recall F1-score AUC
Imp_model 0.931596 0.928571 0.895317 0.911641 0.925257
Full_model 0.929425 0.920904 0.898072 0.909344 0.923946
Poly2_model 0.921824 0.903047 0.898072 0.900552 0.917674
Poly3_model 0.933768 0.919444 0.911846 0.915629 0.929937
from sklearn.model_selection import cross_val_score
import numpy as np
scaler = StandardScaler()
# let's fix degree
deg = 3
poly = PolynomialFeatures(degree=deg, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_train_poly = scaler.fit_transform(X_train_poly)
X_test_poly = poly.fit_transform(X_test)
X_test_poly = scaler.transform(X_test_poly)
alphas = 2 ** np.linspace(-10, 13, 20)
losses = []

for alpha in alphas:
  model = LogisticRegression(penalty="l2", C=1/alpha, max_iter=500, class_weight="balanced")
  loss = -cross_val_score(model, X_train_poly, y_train, cv=5, scoring="neg_log_loss")
  losses.append(np.mean(loss))
  print(f"Alpha: {alpha}, Loss: {np.mean(loss)}")
Alpha: 0.0009765625, Loss: 1.7812730271312869
Alpha: 0.002259980932192813, Loss: 1.5698581329062695
Alpha: 0.005230094145408093, Loss: 1.2017006732303417
Alpha: 0.012103590955208248, Loss: 0.9853968348171029
Alpha: 0.028010378004307994, Loss: 0.9053187054956029
Alpha: 0.06482219027788698, Loss: 0.7580831728163
Alpha: 0.15001283994726292, Loss: 0.6492299724675931
Alpha: 0.3471627856536638, Loss: 0.5361732012192744
Alpha: 0.8034112265668824, Loss: 0.45009350383578256
Alpha: 1.8592707100168133, Loss: 0.3777216854210937
Alpha: 4.302762344880728, Loss: 0.3144187535322351
Alpha: 9.957540715712186, Loss: 0.264672260673909
Alpha: 23.043944600620158, Loss: 0.22985830175987476
Alpha: 53.328768409506836, Loss: 0.21342469353625498
Alpha: 123.41452773663924, Loss: 0.20651320425316802
Alpha: 285.6084269469554, Loss: 0.2093283391507102
Alpha: 660.9608693490712, Loss: 0.2222600484958329
Alpha: 1529.609176734192, Loss: 0.24551337260602762
Alpha: 3539.8528748814592, Loss: 0.2797027402203298
Alpha: 8192.0, Loss: 0.3244312135839792
opt_alpha = alphas[np.argmin(losses)]
print(f"Optimal alpha: {opt_alpha}")
ax = sns.lineplot(x=alphas, y=losses)
ax.vlines(x=opt_alpha, ymin=min(losses), ymax=max(losses), colors='red', linestyles="dashed")
ax.set_xscale("log")
ax.set_xlabel("alpha")
ax.set_ylabel("CV_MSE")
plt.show()

lgr_reg = LogisticRegression(penalty="l2", C=1/opt_alpha, max_iter=1000)
lgr_reg = lgr_reg.fit(X_train_poly, y_train)
y_pred_poly = lgr_reg.predict(X_test_poly)

test_spam = pd.concat([test_spam, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred_poly),
          'Precision': precision_score(y_test, y_pred_poly),
          'Recall': recall_score(y_test, y_pred_poly),
          'F1-score': f1_score(y_test, y_pred_poly),
          'AUC': roc_auc_score(y_test, y_pred_poly)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=[f"Poly{deg}_Ridge_model"])], axis=0)
test_spam
Optimal alpha: 53.328768409506836

Accuracy Precision Recall F1-score AUC
Imp_model 0.931596 0.928571 0.895317 0.911641 0.925257
Full_model 0.929425 0.920904 0.898072 0.909344 0.923946
Poly2_model 0.981542 0.960106 0.994490 0.976996 0.983804
Poly3_model 0.967427 0.937008 0.983471 0.959677 0.970230
Poly2_Ridge_model 0.925081 0.922414 0.884298 0.902954 0.917955

2. Multiple Logistic Regression

In this section, you will work with Mnist dataset. It can be imported using the following codes.

from keras.datasets import mnist

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

import matplotlib.pyplot as plt
import numpy as np
digit = np.random.choice(train_images.shape[0], size=10)
_ , axs = plt.subplots(2,5, figsize=(9, 3))
for i in range(10):
  axs[i//5, i%5].imshow(train_images[digit[i]])
  axs[i//5, i%5].axis("off")
  axs[i//5, i%5].set_title(f"True label: {train_labels[digit[i]]}")
plt.tight_layout()
plt.show()
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11490434/11490434 [==============================] - 0s 0us/step

  • Build Multinomial Logistic Regressoin to recognize images of testing digits of this dataset.
# Data preprocessing
M = np.max(train_images)
X, y = train_images.reshape(60000, 28*28)/M, train_labels
X_test, y_test = test_images.reshape(10000, 28*28)/M, test_labels

# Distribution of target
import seaborn as sns
sns.countplot(x=y)
plt.show()

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

mlr = LogisticRegression(max_iter=500)
mlr = mlr.fit(X, train_labels)
  • Evaluate its performance using suitable matrix and conclude.
y_hat = mlr.predict(X_test)

import pandas as pd
test_minst = pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_hat),
          'Precision': precision_score(y_test, y_hat, average="macro"),
          'Recall': recall_score(y_test, y_hat, average="macro"),
          'F1-score': f1_score(y_test, y_hat, average="macro")},
    columns=["Accuracy", "Precision", "Recall", "F1-score"],
    index=[f"Model"])
test_minst
Accuracy Precision Recall F1-score
Model 0.9258 0.924814 0.924713 0.924692

References

\(^{\text{📚}}\) Chapter 4, Introduction to Statistical Learning with R, James et al. (2021)..
\(^{\text{📚}}\) Chapter 2, The Elements of Statistical Learning, Hastie et al. (2008)..
\(^{\text{📚}}\) Friedman (1989).
\(^{\text{📚}}\) Heart Disease Dataset.
\(^{\text{📚}}\) Different Type of Correlation Metrics Used by Data Scientists, Ashray.