Lab2: Logistic Regression

Course: ITM 390 004: Machine Learning
Lecturer: Sothea HAS, PhD


Objective: In this lab, you will apply what you have learned about logistic regression on some real-world dataset. Moreover, you will explore beyond what we have done including feature selection test, feature engineering to further elevate the model performance and compare their performances.


1. Logistic Regression on Heart Disease Dataset

Cardiovascular diseases (CVDs) are the leading cause of death globally, taking an estimated 17.9 million lives each year (WHO). CVDs are a group of disorders of the heart and blood vessels and include coronary heart disease, cerebrovascular disease, rheumatic heart disease and other conditions. More than four out of five CVD deaths are due to heart attacks and strokes, and one third of these deaths occur prematurely in people under 70 years of age. This research intends to pinpoint the most relevant/risk factors of heart disease as well as predict the overall risk using logistic regression.

We will explore Heart Disease Dataset. Load the dataset into the environment.

import numpy as np
import pandas as pd
import kagglehub
# Download latest version
path = kagglehub.dataset_download("johnsmith88/heart-disease-dataset")
data = pd.read_csv(path + "/heart.csv")
data.head(5)
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
0 52 1 0 125 212 0 1 168 0 1.0 2 2 3 0
1 53 1 0 140 203 1 0 155 1 3.1 0 0 3 0
2 70 1 0 145 174 0 1 125 1 2.6 0 0 3 0
3 61 1 0 148 203 0 1 161 0 0.0 2 1 3 0
4 62 0 0 138 294 1 1 106 0 1.9 1 3 2 0

A. General view of the dataset.

  • What’s the dimension of the dataset?
  • How many qualitative and quantitative variables are there in this dataset (answer this question carefully! Some qualitative variables may be encoded using numerical values)?
  • Convert variables into their suitable data type if there are any inconsistent variable types.
# To do

B. Univariate Analysis.

  • Compute summary statistics and visualize the distribution of the target and the inputs according to their types.
  • Are there any missing values? Duplicate data? Outliers?
  • Address or handle the above problems.
# To do

C. Bivariate Analysis.

# To do

D. Building Logistic Regression Models

  • Split the data into \(80\%-20\%\) training and testing data.
  • Build a logistic regression model on the training data then compute its performance on the test data using suitable metrics.
  • Comment your finding.
  • Try to study logistic regression using polynomial features. Compute its formance on the test data and compare to the previous result.
  • Apply regularization methods and evaluate their performances on the test data.
# To do

E. Polynomial features (Optional)

  • Apply 2nd order polynomial features on all quantitative columns and combine with the qualitative columns to form a new input for training and testing dataset.
  • Retrain and evaluate logistic regression on this new input features.
  • Conclude.
# To do

2. Logistic Regression on Spam dataset

The spam dataset contains frequency of some common words and its class (‘spam’ or ‘nonspam’). The following code allows you to import this data into our environment.

import pandas as pd
path = "https://raw.githubusercontent.com/hassothea/MLcourses/main/data/spam.txt"
data = pd.read_csv(path, sep=" ")
data = data.drop(columns=['Id'])
data.head(5)
make address all num3d our over remove internet order mail ... charSemicolon charRoundbracket charSquarebracket charExclamation charDollar charHash capitalAve capitalLong capitalTotal type
0 0.00 0.64 0.64 0.0 0.32 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.000 0.0 0.778 0.000 0.000 3.756 61 278 spam
1 0.21 0.28 0.50 0.0 0.14 0.28 0.21 0.07 0.00 0.94 ... 0.00 0.132 0.0 0.372 0.180 0.048 5.114 101 1028 spam
2 0.06 0.00 0.71 0.0 1.23 0.19 0.19 0.12 0.64 0.25 ... 0.01 0.143 0.0 0.276 0.184 0.010 9.821 485 2259 spam
3 0.00 0.00 0.00 0.0 0.63 0.00 0.31 0.63 0.31 0.63 ... 0.00 0.137 0.0 0.137 0.000 0.000 3.537 40 191 spam
4 0.00 0.00 0.00 0.0 0.63 0.00 0.31 0.63 0.31 0.63 ... 0.00 0.135 0.0 0.135 0.000 0.000 3.537 40 191 spam

5 rows × 58 columns

  • Inspect the dataset to find missing values and proportion of spam and nonspam emails.
  • Split the data and train logistic regression to identify spam and nonspam emails.
  • Evaluate model performance on test data using suitable metrics: accuracy, recall, precision, f1 score.
# To do

Further readings