Objective: In this lab, you will learn how to build NBC and Binary Logistic Regression model to predict heart failure patients. Not only that, you will learn to detect informative features for maximizing the potential of the constructed models. You will also see that quantitative features are not always the most important ones in building a good predictive model. You have to treat all types of data carefully.
Cardiovascular diseases (CVDs) are the leading cause of death globally, taking an estimated 17.9 million lives each year (WHO). CVDs are a group of disorders of the heart and blood vessels and include coronary heart disease, cerebrovascular disease, rheumatic heart disease and other conditions. More than four out of five CVD deaths are due to heart attacks and strokes, and one third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict a possible heart disease.
The following Heart Failure dataset is obtained by combining 5 different heart disease datasets, consisting of 11 features and a target column indicating heart disease status of the patients. We will build a classification model to predict the heart status of the patients.
import kagglehubimport pandas as pd# Download latest versionpath = kagglehub.dataset_download("fedesoriano/heart-failure-prediction")print("Path to dataset files:", path)data = pd.read_csv(path +"/heart.csv")data.head()
Path to dataset files: C:\Users\hasso\.cache\kagglehub\datasets\fedesoriano\heart-failure-prediction\versions\1
Age
Sex
ChestPainType
RestingBP
Cholesterol
FastingBS
RestingECG
MaxHR
ExerciseAngina
Oldpeak
ST_Slope
HeartDisease
0
40
M
ATA
140
289
0
Normal
172
N
0.0
Up
0
1
49
F
NAP
160
180
0
Normal
156
N
1.0
Flat
1
2
37
M
ATA
130
283
0
ST
98
N
0.0
Up
0
3
48
F
ASY
138
214
0
Normal
108
Y
1.5
Flat
1
4
54
M
NAP
150
195
0
Normal
122
N
0.0
Up
0
1. Univariate Analysis: Preprocessing & Data Analysis
Check and modify if there are any columns with inappropriate data type.
Compute descriptive statistics of each column. Do you observe anything strange? Handle what seems to be the problem properly.
Are there any nan or NA values in this dataset?
Are their any duplicated observations?
# To do
2. Bivariate Analysis: Exploratory Data Analysis & Important Feature Detection
Visualize the connection between all quantitative columns with HeartDisease. Notice those that seem to be related to the target.
Visualize the connection between all qualitative columns to the target. Note if there is any interesting qualitative columns (related to the target).
# To do
3. Naive Bayes Classifier
In the following code, train_test_split was imported from sklearn.model_selection. It can be used to split the dataset into 80%-training (for constructing the model) and 20%-testing data (for testing model performance).
# To do
Build NBC model on the 80%-training data using
Only quantitative columns (name it nbc_quan)
Only qualitative columns (name it nbc_qual). Hint: you should encode the categorical columns using LabelEncoder from sklearn.preprocessing.
All columns (name it nbc_full). Hint: make sure you encode categorical columns using one-hot encoding then the GaussianNB can be applied on the encoded features.
Your selected columns from the previous analyzing step (name it nbc_analysis).
Construct confusion matrix for the four models and compute the following metrics on the 20%-testing data:
Accuracy
Precision
Recall
F1-score
AUC
Which model seem to be the most promising one? Why do you think this is the case?
# To do
4. Binary Logistic Regression
We start off from feature engineering:
Standardize the quantitative inputs
Perform one-hot encoding for all the qualitative variables.
Construct 4 Binary Logistic Regression models on the 80%-Training using different options of inputs as in the previous case.
Measure their performance on the corresponding testing dataset.
Conclude.
# To do
5. Beyond Original Features (optional)
If you think thatβs all the models can do, you are probably wrong because the models have only been constructed on the original features. We probably can push the modelsβ performances a little bit more by: - For logistic regression, the parameters \(\beta_j\) of the model can be restrict/penalized further to prevent overfitting (the model becomes to flexible). - Or introducing new features by transforming the original features. Be cautious because these new features might weaken the interpretability of the models.
Tasks:
Penalty parameter C: Try varying parameter \(C\), for example, \(C=0.01\) as follow LogisticRegression(C=0.01). Fit the model to the training data then test its performance on the testing data.
Search for the best \(C\): Now, try to search for the best \(C\) and report the performance on the test data of the model built with the optimal value of \(C\).
Quadratic features: the following codes generate quadratic selected features i.e., \(X_1, X_2, X_3\to X_1^2, X_2^2, X_3^3, X_1X_2, X_1X_3, X_2X_3\). When more features are created, the model will naturally become too flexible, itβs recommended to fine-tune penalty parameter \(C\) in this case as well.