TP2 - Logistic Regression & Regularization

Course: Advanced Machine Learning
Lecturer: Sothea HAS, PhD

Objective: This practical session (TP) is designed to help you apply the concepts and techniques you have learned about Logistic Regression. You will learn to analyze each input variable and its connection to the target which will guide you to a suitable model in predicting the target.

The notebook of this TP can be downloaded here: TP2_Logistic_Regression.

1. Binary Logistic Regression

Let’s us begin by exploring the Heart Disease Dataset.

import numpy as np
import pandas as pd
import kagglehub
# Download latest version
path = kagglehub.dataset_download("johnsmith88/heart-disease-dataset")
data = pd.read_csv(path + "/heart.csv")
data.head(5)

	age	sex	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal
0	52	1	125	212	0	1	168	0	1.0	2	2	3
1	53	1	140	203	1	0	155	1	3.1	0	0	3
2	70	1	145	174	0	1	125	1	2.6	0	0	3
3	61	1	148	203	0	1	161	0	0.0	2	1	3
4	62	0	138	294	1	1	106	0	1.9	1	3	2

A. General view of the dataset.

Load the dataset into jupyter notebook.
What’s the dimension of the dataset?
How many qualitative and quantitative variables are there in this dataset (answer this question carefully! Some qualitative variables may be encoded using numerical values)?
Convert variables into their suitable data type if there are any inconsistent variable types.

import numpy as np
import pandas as pd
print(f'The dimension of the dataset: {data.shape}')

The dimension of the dataset: (1025, 14)

data.dtypes.to_frame().T

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal	target
0	int64	int64	int64	int64	int64	int64	int64	int64	int64	float64	int64	int64	int64	int64

quan_vars = ['age','trestbps','chol','thalach','oldpeak']
qual_vars = ['sex','cp','fbs','restecg','exang','slope','ca','thal','target']

for i in quan_vars:
  data[i] = data[i].astype('float')
for i in qual_vars:
  data[i] = data[i].astype('category')

print(f'The number of categorical data is {data.select_dtypes(include="category").shape[1]}')
print(f'The number of quantitative data is {data.select_dtypes(include="number").shape[1]}')

The number of categorical data is 9
The number of quantitative data is 5

B. Univariate Analysis.

Compute summary statistics and visualize the distribution of the target and the inputs according to their types.
Are there any missing values? Duplicate data? Outliers?
Address or handle the above problems.

data.describe()

	age	trestbps	chol	thalach	oldpeak
count	1025.000000	1025.000000	1025.00000	1025.000000	1025.000000
mean	54.434146	131.611707	246.00000	149.114146	1.071512
std	9.072290	17.516718	51.59251	23.005724	1.175053
min	29.000000	94.000000	126.00000	71.000000	0.000000
25%	48.000000	120.000000	211.00000	132.000000	0.000000
50%	56.000000	130.000000	240.00000	152.000000	0.800000
75%	61.000000	140.000000	275.00000	166.000000	1.800000
max	77.000000	200.000000	564.00000	202.000000	6.200000

import seaborn as sns
import matplotlib.pyplot as plt
x, axs = plt.subplots(1, 5, figsize=(15, 3))

for var in quan_vars:
  sns.histplot(data=data, x=var, ax=axs[quan_vars.index(var)], kde=True)
  axs[quan_vars.index(var)].set_title(f"Histogram of {var}")
plt.tight_layout()
plt.show()

_, axs = plt.subplots(3, 3, figsize=(12, 9))

for var in qual_vars:
  sns.countplot(data=data, x=var, ax=axs[qual_vars.index(var)//3, qual_vars.index(var)%3], stat="proportion")
  axs[qual_vars.index(var)//3, qual_vars.index(var)%3].set_title(f"Countplot of {var}")
  axs[qual_vars.index(var)//3, qual_vars.index(var)%3].bar_label(axs[qual_vars.index(var)//3, qual_vars.index(var)%3].containers[0], fmt="%.2f")
plt.tight_layout()
plt.show()

data.isna().sum().to_frame().T

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal	target
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

id_duplicated = data.duplicated()
data.loc[id_duplicated,:].index

print(f"Target of duplicated rows: {data.loc[data.duplicated(), ['target']].value_counts()}")
print(f'Percentage of duplicated rows: {data.duplicated().sum()/data.shape[0]*100:.2f}%')

Target of duplicated rows: target
1         362
0         361
Name: count, dtype: int64
Percentage of duplicated rows: 70.54%

data.drop_duplicates(inplace=True)

Since the data contains more than 70% duplications, we keeps the duplicated rows.

C. Bivariate Analysis.

Compute Pearson’s correlation matrix of quantitative variables. Make some remarks on the correlation matrix.
Compute Spearman’s correlation matrix of quantitative variables. Make some remarks on this correlation matrix.
Visualize the relationship between each input to the target.

# Pearson's correaltion matrix
data.select_dtypes(include="number").corr()

	age	trestbps	chol	thalach	oldpeak
age	1.000000	0.283121	0.207216	-0.395235	0.206040
trestbps	0.283121	1.000000	0.125256	-0.048023	0.194600
chol	0.207216	0.125256	1.000000	-0.005308	0.050086
thalach	-0.395235	-0.048023	-0.005308	1.000000	-0.342201
oldpeak	0.206040	0.194600	0.050086	-0.342201	1.000000

# Spearman's correaltion
data.select_dtypes(include="number").corr(method="spearman")

	age	trestbps	chol	thalach	oldpeak
age	1.000000	0.289705	0.188903	-0.393453	0.263625
trestbps	0.289705	1.000000	0.130210	-0.042699	0.156807
chol	0.188903	0.130210	1.000000	-0.040367	0.039565
thalach	-0.393453	-0.042699	-0.040367	1.000000	-0.430495
oldpeak	0.263625	0.156807	0.039565	-0.430495	1.000000

Quantitative inputs vs Target

_, axs = plt.subplots(1, 5, figsize=(15, 3))
for var in quan_vars:
  sns.boxplot(data=data, y=var, x="target", hue="target", ax=axs[quan_vars.index(var)])
  axs[quan_vars.index(var)].set_title(f"{var} vs Heart disease")

plt.tight_layout()
plt.show()

_, axs = plt.subplots(2, 4, figsize=(15, 6))

for var in qual_vars:
  if var != "target":
    sns.countplot(data=data, x=var, hue="target", ax=axs[qual_vars.index(var)//4, qual_vars.index(var)%4])
    axs[qual_vars.index(var)//4, qual_vars.index(var)%4].set_title(f"Countplot of {var}")
    axs[qual_vars.index(var)//4, qual_vars.index(var)%4].bar_label(axs[qual_vars.index(var)//4, qual_vars.index(var)%4].containers[0], fmt="%.2f")
    axs[qual_vars.index(var)//4, qual_vars.index(var)%4].bar_label(axs[qual_vars.index(var)//4, qual_vars.index(var)%4].containers[1], fmt="%.2f")
plt.tight_layout()
plt.show()

Remark: You should try to understand the differences between these two types of correlation as they are helpful in guiding you to the correct transformation of inputs for model development. At the end of this step, you should have strong intuition on the most impactful inputs for building the model and how can to handle the inputs before building models.

D. Building Logistic Regression Models

Split the data into \(80\%-20\%\) training and testing data.

from sklearn.model_selection import train_test_split
X, y = data.iloc[:,:-1], data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

Build a logistic regression model using only categorical variables on the training data then compute its performance on the test data using suitable metrics.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import StandardScaler

# OnehotEncoding
X_train_cat = pd.get_dummies(X_train.select_dtypes(include="category"), drop_first=True)
X_test_cat = pd.get_dummies(X_test.select_dtypes(include="category"), drop_first=True)

# Fit the model
lgr_cat = LogisticRegression(max_iter=500)
lgr_cat = lgr_cat.fit(X_train_cat, y_train)

# Test performance
y_pred_cat = lgr_cat.predict(X_test_cat)
test_perf = pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred_cat),
          'Precision': precision_score(y_test, y_pred_cat),
          'Recall': recall_score(y_test, y_pred_cat),
          'F1-score': f1_score(y_test, y_pred_cat),
          'AUC': roc_auc_score(y_test, y_pred_cat)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["Qual_model"])
test_perf

	Accuracy	Precision	Recall	F1-score	AUC
Qual_model	0.836066	0.848485	0.848485	0.848485	0.834957

Repeat the previous question using quantitative variables instead.

X_train_quan = X_train.select_dtypes(include="number")
X_test_quan = X_test.select_dtypes(include="number")

# Fit the model
lgr_quan = LogisticRegression(max_iter=500)
lgr_quan = lgr_quan.fit(X_train_quan, y_train)

# Test performance
y_pred_quan = lgr_quan.predict(X_test_quan)
test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred_quan),
          'Precision': precision_score(y_test, y_pred_quan),
          'Recall': recall_score(y_test, y_pred_quan),
          'F1-score': f1_score(y_test, y_pred_quan),
          'AUC': roc_auc_score(y_test, y_pred_quan)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["Quan_model"])], axis=0)
test_perf

	Accuracy	Precision	Recall	F1-score	AUC
Qual_model	0.836066	0.848485	0.848485	0.848485	0.834957
Quan_model	0.786885	0.833333	0.757576	0.793651	0.789502

Do it again with all the variables.

Note that when the optimization does not converge, one can perform feature scaling to make the optimazation converges faster.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_all = pd.concat([X_train_cat, X_train_quan], axis=1)
X_train_all = scaler.fit_transform(X_train_all)
X_test_all = pd.concat([X_test_cat, X_test_quan], axis=1)
X_test_all = scaler.transform(X_test_all)

# Fit the model
lgr_all = LogisticRegression(max_iter=1000)
lgr_all = lgr_quan.fit(X_train_all, y_train)

# Test performance
y_pred_all = lgr_all.predict(X_test_all)
test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred_all),
          'Precision': precision_score(y_test, y_pred_all),
          'Recall': recall_score(y_test, y_pred_all),
          'F1-score': f1_score(y_test, y_pred_all),
          'AUC': roc_auc_score(y_test, y_pred_all)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["Full_model"])], axis=0)
test_perf

	Accuracy	Precision	Recall	F1-score	AUC
Qual_model	0.836066	0.848485	0.848485	0.848485	0.834957
Quan_model	0.786885	0.833333	0.757576	0.793651	0.789502
Full_model	0.836066	0.848485	0.848485	0.848485	0.834957

Here, we build a model with all the informative inputs from EDA:

X_train_eda = np.column_stack([
    X_train[['age', 'oldpeak', 'thalach']].to_numpy(),
    pd.get_dummies(X_train.drop(columns=['fbs']).select_dtypes(include="category"), drop_first=True).to_numpy()])
X_test_eda = np.column_stack([
    X_test[['age', 'oldpeak', 'thalach']].to_numpy(),
    pd.get_dummies(X_test.drop(columns=['fbs']).select_dtypes(include="category"), drop_first=True).to_numpy()])
# Fit the model
lgr_eda = LogisticRegression(max_iter=1000)
lgr_eda = lgr_eda.fit(X_train_eda, y_train)

# Test performance
y_pred_eda = lgr_eda.predict(X_test_eda)
test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred_eda),
          'Precision': precision_score(y_test, y_pred_eda),
          'Recall': recall_score(y_test, y_pred_eda),
          'F1-score': f1_score(y_test, y_pred_eda),
          'AUC': roc_auc_score(y_test, y_pred_eda)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["EDA_model"])], axis=0)
test_perf

	Accuracy	Precision	Recall	F1-score	AUC
Qual_model	0.836066	0.848485	0.848485	0.848485	0.834957
Quan_model	0.786885	0.833333	0.757576	0.793651	0.789502
Full_model	0.836066	0.848485	0.848485	0.848485	0.834957
EDA_model	0.836066	0.848485	0.848485	0.848485	0.834957

Comment your finding.

Based on EDA step, we found that qualitative variables are more related to the target than the some of numerical variables. When building models, we actually observe that the model with qualitative variables does perform better than the model with only quantitative variables.

Try to study logistic regression using polynomial features. Compute its performance on the test data and compare to the previous result.

We study polynomial features with numerical variables.

from sklearn.preprocessing import PolynomialFeatures
scaler = StandardScaler()
for degree in range(2,16):
  poly = PolynomialFeatures(degree=degree, include_bias=False)
  X_train_poly = np.column_stack([poly.fit_transform(X_train.select_dtypes(include="number")),
                                 X_train_cat])
  X_train_poly = scaler.fit_transform(X_train_poly)
  X_test_poly = np.column_stack([poly.fit_transform(X_test.select_dtypes(include="number")),
                                X_test_cat])
  X_test_poly = scaler.transform(X_test_poly)
  model = LogisticRegression(max_iter=1000)
  model = model.fit(X_train_poly, y_train)
  y_pred_poly = model.predict(X_test_poly)
  test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred_poly),
          'Precision': precision_score(y_test, y_pred_poly),
          'Recall': recall_score(y_test, y_pred_poly),
          'F1-score': f1_score(y_test, y_pred_poly),
          'AUC': roc_auc_score(y_test, y_pred_poly)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=[f"Poly{degree}_model"])], axis=0)
test_perf

	Accuracy	Precision	Recall	F1-score	AUC
Qual_model	0.836066	0.848485	0.848485	0.848485	0.834957
Quan_model	0.786885	0.833333	0.757576	0.793651	0.789502
Full_model	0.836066	0.848485	0.848485	0.848485	0.834957
EDA_model	0.836066	0.848485	0.848485	0.848485	0.834957
Poly2_model	0.836066	0.848485	0.848485	0.848485	0.834957
Poly3_model	0.836066	0.870968	0.818182	0.843750	0.837662
Poly4_model	0.819672	0.866667	0.787879	0.825397	0.822511
Poly5_model	0.819672	0.866667	0.787879	0.825397	0.822511
Poly6_model	0.786885	0.812500	0.787879	0.800000	0.786797
Poly7_model	0.803279	0.818182	0.818182	0.818182	0.801948
Poly8_model	0.803279	0.818182	0.818182	0.818182	0.801948
Poly9_model	0.786885	0.812500	0.787879	0.800000	0.786797
Poly10_model	0.770492	0.787879	0.787879	0.787879	0.768939
Poly11_model	0.754098	0.764706	0.787879	0.776119	0.751082
Poly12_model	0.737705	0.757576	0.757576	0.757576	0.735931
Poly13_model	0.737705	0.757576	0.757576	0.757576	0.735931
Poly14_model	0.737705	0.757576	0.757576	0.757576	0.735931
Poly15_model	0.737705	0.757576	0.757576	0.757576	0.735931

It is very common that with higher degrees, the model becomes too flexible and overfits the data. One can use regularization method to control the flexibility of the model.

Apply regularization methods and evaluate their performances on the test data.

from sklearn.model_selection import cross_val_score
scaler = StandardScaler()
# let's fix degree
deg = 3
poly = PolynomialFeatures(degree=deg, include_bias=False)
X_train_poly = np.column_stack([poly.fit_transform(X_train.select_dtypes(include="number")),
                                 X_train_cat])
X_train_poly = scaler.fit_transform(X_train_poly)
X_test_poly = np.column_stack([poly.fit_transform(X_test.select_dtypes(include="number")),
                                X_test_cat])
X_test_poly = scaler.transform(X_test_poly)
alphas = 2 ** np.linspace(-10, 13, 20)
losses = []
for alpha in alphas:
  model = LogisticRegression(penalty="l2", C=1/alpha, max_iter=1000)
  model = model.fit(X_train_poly, y_train)
  loss = -cross_val_score(model, X_train_poly, y_train, cv=5, scoring="neg_log_loss")
  losses.append(np.mean(loss))

opt_alpha = alphas[np.argmin(losses)]
print(f"Optimal alpha: {opt_alpha}")
ax = sns.lineplot(x=alphas, y=losses)
ax.vlines(x=opt_alpha, ymin=min(losses), ymax=max(losses), colors='red', linestyles="dashed")
ax.set_xscale("log")
ax.set_xlabel("alpha")
ax.set_ylabel("CV_MSE")
plt.show()

lgr_reg = LogisticRegression(penalty="l2", C=1/opt_alpha, max_iter=1000)
lgr_reg = lgr_reg.fit(X_train_poly, y_train)
y_pred_poly = lgr_reg.predict(X_test_poly)

test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred_poly),
          'Precision': precision_score(y_test, y_pred_poly),
          'Recall': recall_score(y_test, y_pred_poly),
          'F1-score': f1_score(y_test, y_pred_poly),
          'AUC': roc_auc_score(y_test, y_pred_poly)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=[f"Poly3_Ridge_model"])], axis=0)
test_perf

Optimal alpha: 9.957540715712186

	Accuracy	Precision	Recall	F1-score	AUC
Qual_model	0.836066	0.848485	0.848485	0.848485	0.834957
Quan_model	0.786885	0.833333	0.757576	0.793651	0.789502
Full_model	0.836066	0.848485	0.848485	0.848485	0.834957
EDA_model	0.836066	0.848485	0.848485	0.848485	0.834957
Poly2_model	0.836066	0.848485	0.848485	0.848485	0.834957
Poly3_model	0.836066	0.870968	0.818182	0.843750	0.837662
Poly4_model	0.819672	0.866667	0.787879	0.825397	0.822511
Poly5_model	0.819672	0.866667	0.787879	0.825397	0.822511
Poly6_model	0.786885	0.812500	0.787879	0.800000	0.786797
Poly7_model	0.803279	0.818182	0.818182	0.818182	0.801948
Poly8_model	0.803279	0.818182	0.818182	0.818182	0.801948
Poly9_model	0.786885	0.812500	0.787879	0.800000	0.786797
Poly10_model	0.770492	0.787879	0.787879	0.787879	0.768939
Poly11_model	0.754098	0.764706	0.787879	0.776119	0.751082
Poly12_model	0.737705	0.757576	0.757576	0.757576	0.735931
Poly13_model	0.737705	0.757576	0.757576	0.757576	0.735931
Poly14_model	0.737705	0.757576	0.757576	0.757576	0.735931
Poly15_model	0.737705	0.757576	0.757576	0.757576	0.735931
Poly3_Ridge_model	0.852459	0.875000	0.848485	0.861538	0.852814

losses = []
for alpha in alphas:
  model = LogisticRegression(penalty="l1", solver='liblinear', C=1/alpha, max_iter=1000)
  model = model.fit(X_train_poly, y_train)
  loss = -cross_val_score(model, X_train_poly, y_train, cv=5, scoring="neg_log_loss")
  losses.append(np.mean(loss))

opt_alpha = alphas[np.argmin(losses)]
print(f"Optimal alpha: {opt_alpha}")
ax = sns.lineplot(x=alphas, y=losses)
ax.vlines(x=opt_alpha, ymin=min(losses), ymax=max(losses), colors='red', linestyles="dashed")
ax.set_xscale("log")
ax.set_xlabel("alpha")
ax.set_ylabel("CV_MSE")
plt.show()

Optimal alpha: 4.302762344880728

lgr_reg = LogisticRegression(penalty="l1", solver="saga", C=1/opt_alpha, max_iter=1000)
lgr_reg = lgr_reg.fit(X_train_poly, y_train)
y_pred_poly = lgr_reg.predict(X_test_poly)

test_perf = pd.concat([test_perf, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred_poly),
          'Precision': precision_score(y_test, y_pred_poly),
          'Recall': recall_score(y_test, y_pred_poly),
          'F1-score': f1_score(y_test, y_pred_poly),
          'AUC': roc_auc_score(y_test, y_pred_poly)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=[f"Poly3_Lasso_model"])], axis=0)
test_perf

/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_sag.py:349: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(

	Accuracy	Precision	Recall	F1-score	AUC
Qual_model	0.836066	0.848485	0.848485	0.848485	0.834957
Quan_model	0.786885	0.833333	0.757576	0.793651	0.789502
Full_model	0.836066	0.848485	0.848485	0.848485	0.834957
EDA_model	0.836066	0.848485	0.848485	0.848485	0.834957
Poly2_model	0.836066	0.848485	0.848485	0.848485	0.834957
Poly3_model	0.836066	0.870968	0.818182	0.843750	0.837662
Poly4_model	0.819672	0.866667	0.787879	0.825397	0.822511
Poly5_model	0.819672	0.866667	0.787879	0.825397	0.822511
Poly6_model	0.786885	0.812500	0.787879	0.800000	0.786797
Poly7_model	0.803279	0.818182	0.818182	0.818182	0.801948
Poly8_model	0.803279	0.818182	0.818182	0.818182	0.801948
Poly9_model	0.786885	0.812500	0.787879	0.800000	0.786797
Poly10_model	0.770492	0.787879	0.787879	0.787879	0.768939
Poly11_model	0.754098	0.764706	0.787879	0.776119	0.751082
Poly12_model	0.737705	0.757576	0.757576	0.757576	0.735931
Poly13_model	0.737705	0.757576	0.757576	0.757576	0.735931
Poly14_model	0.737705	0.757576	0.757576	0.757576	0.735931
Poly15_model	0.737705	0.757576	0.757576	0.757576	0.735931
Poly3_Ridge_model	0.852459	0.875000	0.848485	0.861538	0.852814
Poly3_Lasso_model	0.836066	0.848485	0.848485	0.848485	0.834957

From these results, Polynomial feature of logistic regression with Ridge regularization is the best model with better score than other models.

E. Try what you have done on Spam dataset.

import pandas as pd
path = "https://raw.githubusercontent.com/hassothea/MLcourses/main/data/spam.txt"
data = pd.read_csv(path, sep=" ")
data.head(5)

	Id	make	address	all	our	over	remove	internet	order	...	charSemicolon	charRoundbracket	charExclamation	charDollar	charHash	capitalAve	capitalLong	capitalTotal	type
0	1	0.00	0.64	0.64	0.32	0.00	0.00	0.00	0.00	...	0.00	0.000	0.778	0.000	0.000	3.756	61	278	spam
1	2	0.21	0.28	0.50	0.14	0.28	0.21	0.07	0.00	...	0.00	0.132	0.372	0.180	0.048	5.114	101	1028	spam
2	3	0.06	0.00	0.71	1.23	0.19	0.19	0.12	0.64	...	0.01	0.143	0.276	0.184	0.010	9.821	485	2259	spam
3	4	0.00	0.00	0.00	0.63	0.00	0.31	0.63	0.31	...	0.00	0.137	0.137	0.000	0.000	3.537	40	191	spam
4	5	0.00	0.00	0.00	0.63	0.00	0.31	0.63	0.31	...	0.00	0.135	0.135	0.000	0.000	3.537	40	191	spam

5 rows × 59 columns

To detect informative inputs in this case, we use ANOVA test on the set of inputs.

from scipy.stats import f_oneway
imp_var = []
very_imp = []
for va in data.columns[1:-1]:
  val = f_oneway(data.loc[data['type'] == "spam", va], data.loc[data['type'] == "nonspam", va])[1]
  if val < 0.01:
    imp_var.append(va)
  if val < 1e-10:
    very_imp.append(va)
print(f"Number of important variables: {len(imp_var)}")
print(f"Number of very important variables: {len(very_imp)}")

Number of important variables: 54
Number of very important variables: 43

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
X, y = data[imp_var], LabelEncoder().fit_transform(data['type'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

lgr0 = LogisticRegression(max_iter=1000)
lgr0 = lgr0.fit(X_train, y_train)
y_pred0 = lgr0.predict(X_test)

# Test spam
test_spam = pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred0),
          'Precision': precision_score(y_test, y_pred0),
          'Recall': recall_score(y_test, y_pred0),
          'F1-score': f1_score(y_test, y_pred0),
          'AUC': roc_auc_score(y_test, y_pred0)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["Imp_model"])
test_spam

	Accuracy	Precision	Recall	F1-score	AUC
Imp_model	0.931596	0.928571	0.895317	0.911641	0.925257

Logistic regression on full inputs

X, y = data.iloc[:,1:-1], LabelEncoder().fit_transform(data['type'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

lgr1 = LogisticRegression(max_iter=1000)
lgr1 = lgr1.fit(X_train, y_train)
y_pred_full = lgr1.predict(X_test)

# Test spam
test_spam = pd.concat([test_spam, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred_full),
          'Precision': precision_score(y_test, y_pred_full),
          'Recall': recall_score(y_test, y_pred_full),
          'F1-score': f1_score(y_test, y_pred_full),
          'AUC': roc_auc_score(y_test, y_pred_full)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=["Full_model"])])
test_spam

	Accuracy	Precision	Recall	F1-score	AUC
Imp_model	0.931596	0.928571	0.895317	0.911641	0.925257
Full_model	0.929425	0.920904	0.898072	0.909344	0.923946

This suggest that model with only variables that passed ANOVA test are better. We will use only these variables for the later models.

Polynomial features

from sklearn.preprocessing import LabelEncoder, PolynomialFeatures
X, y = data.iloc[:,1:-1], LabelEncoder().fit_transform(data['type'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

for degree in range(2, 4):
  poly = PolynomialFeatures(degree=degree, include_bias=False)
  X_train_poly = poly.fit_transform(X_train)
  X_test_poly = poly.transform(X_test)
  scaler = StandardScaler()
  X_train_poly = scaler.fit_transform(X_train_poly)
  X_test_poly = scaler.transform(X_test_poly)
  model = LogisticRegression(max_iter=500)
  model = model.fit(X_train_poly, y_train)
  y_pred_poly = model.predict(X_test_poly)
  test_spam = pd.concat([test_spam, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred_poly),
          'Precision': precision_score(y_test, y_pred_poly),
          'Recall': recall_score(y_test, y_pred_poly),
          'F1-score': f1_score(y_test, y_pred_poly),
          'AUC': roc_auc_score(y_test, y_pred_poly)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=[f"Poly{degree}_model"])], axis=0)
test_spam

	Accuracy	Precision	Recall	F1-score	AUC
Imp_model	0.931596	0.928571	0.895317	0.911641	0.925257
Full_model	0.929425	0.920904	0.898072	0.909344	0.923946
Poly2_model	0.921824	0.903047	0.898072	0.900552	0.917674
Poly3_model	0.933768	0.919444	0.911846	0.915629	0.929937

from sklearn.model_selection import cross_val_score
import numpy as np
scaler = StandardScaler()
# let's fix degree
deg = 3
poly = PolynomialFeatures(degree=deg, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_train_poly = scaler.fit_transform(X_train_poly)
X_test_poly = poly.fit_transform(X_test)
X_test_poly = scaler.transform(X_test_poly)
alphas = 2 ** np.linspace(-10, 13, 20)
losses = []

for alpha in alphas:
  model = LogisticRegression(penalty="l2", C=1/alpha, max_iter=500, class_weight="balanced")
  loss = -cross_val_score(model, X_train_poly, y_train, cv=5, scoring="neg_log_loss")
  losses.append(np.mean(loss))
  print(f"Alpha: {alpha}, Loss: {np.mean(loss)}")

Alpha: 0.0009765625, Loss: 1.7812730271312869
Alpha: 0.002259980932192813, Loss: 1.5698581329062695
Alpha: 0.005230094145408093, Loss: 1.2017006732303417
Alpha: 0.012103590955208248, Loss: 0.9853968348171029
Alpha: 0.028010378004307994, Loss: 0.9053187054956029
Alpha: 0.06482219027788698, Loss: 0.7580831728163
Alpha: 0.15001283994726292, Loss: 0.6492299724675931
Alpha: 0.3471627856536638, Loss: 0.5361732012192744
Alpha: 0.8034112265668824, Loss: 0.45009350383578256
Alpha: 1.8592707100168133, Loss: 0.3777216854210937
Alpha: 4.302762344880728, Loss: 0.3144187535322351
Alpha: 9.957540715712186, Loss: 0.264672260673909
Alpha: 23.043944600620158, Loss: 0.22985830175987476
Alpha: 53.328768409506836, Loss: 0.21342469353625498
Alpha: 123.41452773663924, Loss: 0.20651320425316802
Alpha: 285.6084269469554, Loss: 0.2093283391507102
Alpha: 660.9608693490712, Loss: 0.2222600484958329
Alpha: 1529.609176734192, Loss: 0.24551337260602762
Alpha: 3539.8528748814592, Loss: 0.2797027402203298
Alpha: 8192.0, Loss: 0.3244312135839792

opt_alpha = alphas[np.argmin(losses)]
print(f"Optimal alpha: {opt_alpha}")
ax = sns.lineplot(x=alphas, y=losses)
ax.vlines(x=opt_alpha, ymin=min(losses), ymax=max(losses), colors='red', linestyles="dashed")
ax.set_xscale("log")
ax.set_xlabel("alpha")
ax.set_ylabel("CV_MSE")
plt.show()

lgr_reg = LogisticRegression(penalty="l2", C=1/opt_alpha, max_iter=1000)
lgr_reg = lgr_reg.fit(X_train_poly, y_train)
y_pred_poly = lgr_reg.predict(X_test_poly)

test_spam = pd.concat([test_spam, pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_pred_poly),
          'Precision': precision_score(y_test, y_pred_poly),
          'Recall': recall_score(y_test, y_pred_poly),
          'F1-score': f1_score(y_test, y_pred_poly),
          'AUC': roc_auc_score(y_test, y_pred_poly)},
    columns=["Accuracy", "Precision", "Recall", "F1-score", "AUC"],
    index=[f"Poly{deg}_Ridge_model"])], axis=0)
test_spam

Optimal alpha: 53.328768409506836

	Accuracy	Precision	Recall	F1-score	AUC
Imp_model	0.931596	0.928571	0.895317	0.911641	0.925257
Full_model	0.929425	0.920904	0.898072	0.909344	0.923946
Poly2_model	0.981542	0.960106	0.994490	0.976996	0.983804
Poly3_model	0.967427	0.937008	0.983471	0.959677	0.970230
Poly2_Ridge_model	0.925081	0.922414	0.884298	0.902954	0.917955

2. Multiple Logistic Regression

In this section, you will work with Mnist dataset. It can be imported using the following codes.

from keras.datasets import mnist

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

import matplotlib.pyplot as plt
import numpy as np
digit = np.random.choice(train_images.shape[0], size=10)
_ , axs = plt.subplots(2,5, figsize=(9, 3))
for i in range(10):
  axs[i//5, i%5].imshow(train_images[digit[i]])
  axs[i//5, i%5].axis("off")
  axs[i//5, i%5].set_title(f"True label: {train_labels[digit[i]]}")
plt.tight_layout()
plt.show()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11490434/11490434 [==============================] - 0s 0us/step

Build Multinomial Logistic Regressoin to recognize images of testing digits of this dataset.

# Data preprocessing
M = np.max(train_images)
X, y = train_images.reshape(60000, 28*28)/M, train_labels
X_test, y_test = test_images.reshape(10000, 28*28)/M, test_labels

# Distribution of target
import seaborn as sns
sns.countplot(x=y)
plt.show()

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

mlr = LogisticRegression(max_iter=500)
mlr = mlr.fit(X, train_labels)

Evaluate its performance using suitable matrix and conclude.

y_hat = mlr.predict(X_test)

import pandas as pd
test_minst = pd.DataFrame(
    data={'Accuracy': accuracy_score(y_test, y_hat),
          'Precision': precision_score(y_test, y_hat, average="macro"),
          'Recall': recall_score(y_test, y_hat, average="macro"),
          'F1-score': f1_score(y_test, y_hat, average="macro")},
    columns=["Accuracy", "Precision", "Recall", "F1-score"],
    index=[f"Model"])
test_minst

	Accuracy	Precision	Recall	F1-score
Model	0.9258	0.924814	0.924713	0.924692

References

\(^{\text{📚}}\) Chapter 4, Introduction to Statistical Learning with R, James et al. (2021)..
\(^{\text{📚}}\) Chapter 2, The Elements of Statistical Learning, Hastie et al. (2008)..
\(^{\text{📚}}\) Friedman (1989).
\(^{\text{📚}}\) Heart Disease Dataset.
\(^{\text{📚}}\) Different Type of Correlation Metrics Used by Data Scientists, Ashray.

	age	sex	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal
0	52	1	125	212	0	1	168	0	1.0	2	2	3
1	53	1	140	203	1	0	155	1	3.1	0	0	3
2	70	1	145	174	0	1	125	1	2.6	0	0	3
3	61	1	148	203	0	1	161	0	0.0	2	1	3
4	62	0	138	294	1	1	106	0	1.9	1	3	2

	age	sex	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal
0	52	1	125	212	0	1	168	0	1.0	2	2	3
1	53	1	140	203	1	0	155	1	3.1	0	0	3
2	70	1	145	174	0	1	125	1	2.6	0	0	3
3	61	1	148	203	0	1	161	0	0.0	2	1	3
4	62	0	138	294	1	1	106	0	1.9	1	3	2

	age	sex	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal
0	52	1	125	212	0	1	168	0	1.0	2	2	3
1	53	1	140	203	1	0	155	1	3.1	0	0	3
2	70	1	145	174	0	1	125	1	2.6	0	0	3
3	61	1	148	203	0	1	161	0	0.0	2	1	3
4	62	0	138	294	1	1	106	0	1.9	1	3	2