Lab5: Logistic Regression

Course: M2-DAS: Advanced Machine Learning
Lecturer: Dr. Sothea HAS

Objective: In this lab, you will learn how to build Binary Logistic Regression model to predict heart failure patients. Not only that, you will learn to detect informative features for maximizing the potential of the constructed models. You will also see that quantitative features are not always the most important ones in building a good predictive model. You have to treat all types of data carefully.

The notebook of this TP can be downloaded here: Lab5_Logistic_Regression.ipynb.

1. Heart Failure Prediction

We will work with Kaggle Heart Failure Dataset as introduced in TP1. You may use the preprocessing step done in the previous work.

import kagglehub
import pandas as pd

# Download latest version
path = kagglehub.dataset_download("fedesoriano/heart-failure-prediction")
data = pd.read_csv(path + "/heart.csv")
data.head()

	Age	Sex	ChestPainType	RestingBP	Cholesterol	RestingECG	MaxHR	ExerciseAngina	Oldpeak	ST_Slope	HeartDisease
0	40	M	ATA	140	289	Normal	172	N	0.0	Up	0
1	49	F	NAP	160	180	Normal	156	N	1.0	Flat	1
2	37	M	ATA	130	283	ST	98	N	0.0	Up	0
3	48	F	ASY	138	214	Normal	108	Y	1.5	Flat	1
4	54	M	NAP	150	195	Normal	122	N	0.0	Up	0

1.1. Binary Logistic Regression

Split the data into \(80\%\)-training and \(20\%\)-testing data.
We start from feature transformation:
- Perform one-hot encoding for all the qualitative variables.
- Standardize all the inputs.
Construct 4 Binary Logistic Regression models on the 80%-Training using different options of inputs:
- lg_quan: logistic regression using only quantitative inputs.
- lg_qual: logistic regression using only qualitative inputs (use one-hot encoding: pd.get_dummies()).
- lg_eda: logistic regression using your selected inputs.
- lg_full: logistic regression using all inputs.
Measure their performance on the corresponding testing data. Compare the results to the result of NBC from the previous TP.
Comment on what you observe.

# To do

1.2. Polynomial Features

Based on the result of EDA, one may try to further elevate the performance of the model using feature engineering or handle problems stored in your problem list. Here, we will try feature engineering.

Tasks:

Quadratic features: Build a model by introducing quadratic features of some selected variable i.e., \(X_1, X_2, X_3\to X_1^2, X_2^2, X_3^3, X_1X_2, X_1X_3, X_2X_3\). Test its performance on the test data.
Penalty parameter C: When more features are created, the model will naturally become too flexible, it’s recommended to fine-tune penalty parameter \(C\) in this case as well.
- Random choice: Try varying parameter \(C\), for example, \(C=0.01\) as follow LogisticRegression(C=0.01). Fit the model to the training data then test its performance on the testing data. Measure its test performance.
- Search for the best \(C\): Now, try to search for the best \(C\) and report the performance on the test data of the model built with the optimal value of \(C\). Measure the performance of the model on the test data.
Compare the performance of all models.

# To do

2. Logistic Regression on Email Spam Dataset

The spam dataset contains frequency of some common words and its class (‘spam’ or ‘nonspam’). The following code allows you to import this data into our environment.

import pandas as pd
path = "https://raw.githubusercontent.com/hassothea/MLcourses/main/data/spam.txt"
data = pd.read_csv(path, sep=" ")
data = data.drop(columns=['Id'])
data.head(5)

Inspect the dataset to find missing values and proportion of spam and nonspam emails.
Split the data into training and testing parts.
Apply techniques you had done in the previous part to identify email spams.
Evaluate model performance on test data using suitable metrics: accuracy, recall, precision, f1 score.

1. Heart Failure Prediction

1.1. Binary Logistic Regression

1.2. Polynomial Features

2. Logistic Regression on Email Spam Dataset

Further Reading