Lab2: Logistic Regression

Course: ITM 390 004: Machine Learning
Lecturer: Sothea HAS, PhD

Objective: In this lab, you will apply what you have learned about logistic regression on some real-world dataset. Moreover, you will explore beyond what we have done including feature selection test, feature engineering to further elevate the model performance and compare their performances.

You can work directly with Google Colab here: Lab2_Logistic_Regression.ipynb.

1. Logistic Regression on Heart Disease Dataset

Cardiovascular diseases (CVDs) are the leading cause of death globally, taking an estimated 17.9 million lives each year (WHO). CVDs are a group of disorders of the heart and blood vessels and include coronary heart disease, cerebrovascular disease, rheumatic heart disease and other conditions. More than four out of five CVD deaths are due to heart attacks and strokes, and one third of these deaths occur prematurely in people under 70 years of age. This research intends to pinpoint the most relevant/risk factors of heart disease as well as predict the overall risk using logistic regression.

We will explore Heart Disease Dataset. Load the dataset into the environment.

import numpy as np
import pandas as pd
import kagglehub
# Download latest version
path = kagglehub.dataset_download("johnsmith88/heart-disease-dataset")
data = pd.read_csv(path + "/heart.csv")
data.head(5)

	age	sex	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal
0	52	1	125	212	0	1	168	0	1.0	2	2	3
1	53	1	140	203	1	0	155	1	3.1	0	0	3
2	70	1	145	174	0	1	125	1	2.6	0	0	3
3	61	1	148	203	0	1	161	0	0.0	2	1	3
4	62	0	138	294	1	1	106	0	1.9	1	3	2

A. General view of the dataset.

What’s the dimension of the dataset?
How many qualitative and quantitative variables are there in this dataset (answer this question carefully! Some qualitative variables may be encoded using numerical values)?
Convert variables into their suitable data type if there are any inconsistent variable types.

# To do

B. Univariate Analysis.

Compute summary statistics and visualize the distribution of the target and the inputs according to their types.
Are there any missing values? Duplicate data? Outliers?
Address or handle the above problems.

# To do

C. Bivariate Analysis.

Compute Pearson’s correlation matrix of quantitative variables. Make some remarks on the correlation matrix.
Compute Spearman’s correlation matrix of quantitative variables. Make some remarks on this correlation matrix.
Visualize the relationship between each input to the target.

# To do

D. Building Logistic Regression Models

Split the data into \(80\%-20\%\) training and testing data.
Build a logistic regression model on the training data then compute its performance on the test data using suitable metrics.
Comment your finding.
Try to study logistic regression using polynomial features. Compute its formance on the test data and compare to the previous result.
Apply regularization methods and evaluate their performances on the test data.

# To do

E. Polynomial features (Optional)

Apply 2nd order polynomial features on all quantitative columns and combine with the qualitative columns to form a new input for training and testing dataset.
Retrain and evaluate logistic regression on this new input features.
Conclude.

# To do

2. Logistic Regression on Spam dataset

The spam dataset contains frequency of some common words and its class (‘spam’ or ‘nonspam’). The following code allows you to import this data into our environment.

import pandas as pd
path = "https://raw.githubusercontent.com/hassothea/MLcourses/main/data/spam.txt"
data = pd.read_csv(path, sep=" ")
data = data.drop(columns=['Id'])
data.head(5)

	make	address	all	our	over	remove	internet	order	mail	...	charSemicolon	charRoundbracket	charExclamation	charDollar	charHash	capitalAve	capitalLong	capitalTotal	type
0	0.00	0.64	0.64	0.32	0.00	0.00	0.00	0.00	0.00	...	0.00	0.000	0.778	0.000	0.000	3.756	61	278	spam
1	0.21	0.28	0.50	0.14	0.28	0.21	0.07	0.00	0.94	...	0.00	0.132	0.372	0.180	0.048	5.114	101	1028	spam
2	0.06	0.00	0.71	1.23	0.19	0.19	0.12	0.64	0.25	...	0.01	0.143	0.276	0.184	0.010	9.821	485	2259	spam
3	0.00	0.00	0.00	0.63	0.00	0.31	0.63	0.31	0.63	...	0.00	0.137	0.137	0.000	0.000	3.537	40	191	spam
4	0.00	0.00	0.00	0.63	0.00	0.31	0.63	0.31	0.63	...	0.00	0.135	0.135	0.000	0.000	3.537	40	191	spam

5 rows × 58 columns

Inspect the dataset to find missing values and proportion of spam and nonspam emails.
Split the data and train logistic regression to identify spam and nonspam emails.
Evaluate model performance on test data using suitable metrics: accuracy, recall, precision, f1 score.

# To do

	age	sex	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal
0	52	1	125	212	0	1	168	0	1.0	2	2	3
1	53	1	140	203	1	0	155	1	3.1	0	0	3
2	70	1	145	174	0	1	125	1	2.6	0	0	3
3	61	1	148	203	0	1	161	0	0.0	2	1	3
4	62	0	138	294	1	1	106	0	1.9	1	3	2

	age	sex	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal
0	52	1	125	212	0	1	168	0	1.0	2	2	3
1	53	1	140	203	1	0	155	1	3.1	0	0	3
2	70	1	145	174	0	1	125	1	2.6	0	0	3
3	61	1	148	203	0	1	161	0	0.0	2	1	3
4	62	0	138	294	1	1	106	0	1.9	1	3	2

1. Logistic Regression on Heart Disease Dataset

2. Logistic Regression on Spam dataset

Further readings

	age	sex	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal
0	52	1	125	212	0	1	168	0	1.0	2	2	3
1	53	1	140	203	1	0	155	1	3.1	0	0	3
2	70	1	145	174	0	1	125	1	2.6	0	0	3
3	61	1	148	203	0	1	161	0	0.0	2	1	3
4	62	0	138	294	1	1	106	0	1.9	1	3	2