Objective: In this lab, you will apply what you have learned about logistic regression on some real-world dataset. Moreover, you will explore beyond what we have done including feature selection test, feature engineering to further elevate the model performance and compare their performances.
Cardiovascular diseases (CVDs) are the leading cause of death globally, taking an estimated 17.9 million lives each year (WHO). CVDs are a group of disorders of the heart and blood vessels and include coronary heart disease, cerebrovascular disease, rheumatic heart disease and other conditions. More than four out of five CVD deaths are due to heart attacks and strokes, and one third of these deaths occur prematurely in people under 70 years of age. This research intends to pinpoint the most relevant/risk factors of heart disease as well as predict the overall risk using logistic regression.
import numpy as npimport pandas as pdimport kagglehub# Download latest versionpath = kagglehub.dataset_download("johnsmith88/heart-disease-dataset")data = pd.read_csv(path +"/heart.csv")data.head(5)
age
sex
cp
trestbps
chol
fbs
restecg
thalach
exang
oldpeak
slope
ca
thal
target
0
52
1
0
125
212
0
1
168
0
1.0
2
2
3
0
1
53
1
0
140
203
1
0
155
1
3.1
0
0
3
0
2
70
1
0
145
174
0
1
125
1
2.6
0
0
3
0
3
61
1
0
148
203
0
1
161
0
0.0
2
1
3
0
4
62
0
0
138
294
1
1
106
0
1.9
1
3
2
0
A. General view of the dataset.
What’s the dimension of the dataset?
How many qualitative and quantitative variables are there in this dataset (answer this question carefully! Some qualitative variables may be encoded using numerical values)?
Convert variables into their suitable data type if there are any inconsistent variable types.
# To do
B. Univariate Analysis.
Compute summary statistics and visualize the distribution of the target and the inputs according to their types.
Are there any missing values? Duplicate data? Outliers?
Visualize the relationship between each input to the target.
# To do
D. Building Logistic Regression Models
Split the data into \(80\%-20\%\) training and testing data.
Build a logistic regression model on the training data then compute its performance on the test data using suitable metrics.
Comment your finding.
Try to study logistic regression using polynomial features. Compute its formance on the test data and compare to the previous result.
Apply regularization methods and evaluate their performances on the test data.
# To do
E. Polynomial features (Optional)
Apply 2nd order polynomial features on all quantitative columns and combine with the qualitative columns to form a new input for training and testing dataset.
Retrain and evaluate logistic regression on this new input features.
Conclude.
# To do
2. Logistic Regression on Spam dataset
The spam dataset contains frequency of some common words and its class (‘spam’ or ‘nonspam’). The following code allows you to import this data into our environment.