Lab4: Naive Bayes Classifier & Logistic Regression

Course: CSCI-866-001: Data Mining & Knowledge Discovery
Lecturer: Sothea HAS, PhD


Objective: In this lab, you will learn how to build NBC and Binary Logistic Regression model to predict heart failure patients. Not only that, you will learn to detect informative features for maximizing the potential of the constructed models. You will also see that quantitative features are not always the most important ones in building a good predictive model. You have to treat all types of data carefully.


Introduction to Heart Failure Prediction

Cardiovascular diseases (CVDs) are the leading cause of death globally, taking an estimated 17.9 million lives each year (WHO). CVDs are a group of disorders of the heart and blood vessels and include coronary heart disease, cerebrovascular disease, rheumatic heart disease and other conditions. More than four out of five CVD deaths are due to heart attacks and strokes, and one third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict a possible heart disease.

The following Heart Failure dataset is obtained by combining 5 different heart disease datasets, consisting of 11 features and a target column indicating heart disease status of the patients. We will build a classification model to predict the heart status of the patients.

We will explore Kaggle Heart Failure Dataset. Load the dataset into the environment.

import kagglehub
import pandas as pd

# Download latest version
path = kagglehub.dataset_download("fedesoriano/heart-failure-prediction")

print("Path to dataset files:", path)
data = pd.read_csv(path + "/heart.csv")
data.head()
Path to dataset files: C:\Users\hasso\.cache\kagglehub\datasets\fedesoriano\heart-failure-prediction\versions\1
Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease
0 40 M ATA 140 289 0 Normal 172 N 0.0 Up 0
1 49 F NAP 160 180 0 Normal 156 N 1.0 Flat 1
2 37 M ATA 130 283 0 ST 98 N 0.0 Up 0
3 48 F ASY 138 214 0 Normal 108 Y 1.5 Flat 1
4 54 M NAP 150 195 0 Normal 122 N 0.0 Up 0

1. Univariate Analysis: Preprocessing & Data Analysis

  • Check and modify if there are any columns with inappropriate data type.
  • Compute descriptive statistics of each column. Do you observe anything strange? Handle what seems to be the problem properly.
  • Are there any nan or NA values in this dataset?
  • Are their any duplicated observations?
# To do

2. Bivariate Analysis: Exploratory Data Analysis & Important Feature Detection

  • Visualize the connection between all quantitative columns with HeartDisease. Notice those that seem to be related to the target.

  • Visualize the connection between all qualitative columns to the target. Note if there is any interesting qualitative columns (related to the target).

# To do

3. Naive Bayes Classifier

  • In the following code, train_test_split was imported from sklearn.model_selection. It can be used to split the dataset into 80%-training (for constructing the model) and 20%-testing data (for testing model performance).
# To do
  • Build NBC model on the 80%-training data using
    • Only quantitative columns (name it nbc_quan)
    • Only qualitative columns (name it nbc_qual). Hint: you should encode the categorical columns using LabelEncoder from sklearn.preprocessing.
    • All columns (name it nbc_full). Hint: make sure you encode categorical columns using one-hot encoding then the GaussianNB can be applied on the encoded features.
    • Your selected columns from the previous analyzing step (name it nbc_analysis).
  • Construct confusion matrix for the four models and compute the following metrics on the 20%-testing data:
    • Accuracy
    • Precision
    • Recall
    • F1-score
    • AUC
  • Which model seem to be the most promising one? Why do you think this is the case?
# To do

4. Binary Logistic Regression

  • We start off from feature engineering:
    • Standardize the quantitative inputs
    • Perform one-hot encoding for all the qualitative variables.
  • Construct 4 Binary Logistic Regression models on the 80%-Training using different options of inputs as in the previous case.
  • Measure their performance on the corresponding testing dataset.
  • Conclude.
# To do

5. Beyond Original Features (optional)

If you think that’s all the models can do, you are probably wrong because the models have only been constructed on the original features. We probably can push the models’ performances a little bit more by: - For logistic regression, the parameters \(\beta_j\) of the model can be restrict/penalized further to prevent overfitting (the model becomes to flexible). - Or introducing new features by transforming the original features. Be cautious because these new features might weaken the interpretability of the models.

Tasks:

  • Penalty parameter C: Try varying parameter \(C\), for example, \(C=0.01\) as follow LogisticRegression(C=0.01). Fit the model to the training data then test its performance on the testing data.
  • Search for the best \(C\): Now, try to search for the best \(C\) and report the performance on the test data of the model built with the optimal value of \(C\).
  • Quadratic features: the following codes generate quadratic selected features i.e., \(X_1, X_2, X_3\to X_1^2, X_2^2, X_3^3, X_1X_2, X_1X_3, X_2X_3\). When more features are created, the model will naturally become too flexible, it’s recommended to fine-tune penalty parameter \(C\) in this case as well.

Further Reading

\(^{\text{πŸ“š}}\) Pandas python library: https://pandas.pydata.org/docs/getting_started/index.html#getting-started
\(^{\text{πŸ“š}}\) Pandas Cheatsheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
\(^{\text{πŸ“š}}\) 10 Minute to Pandas: https://pandas.pydata.org/docs/user_guide/10min.html
\(^{\text{πŸ“š}}\) Some Pandas Lession: https://www.kaggle.com/learn/pandas
\(^{\text{πŸ“š}}\) Chapter 4, Introduction to Statistical Learning with R, James et al. (2021)..
\(^{\text{πŸ“š}}\) Chapter 2, The Elements of Statistical Learning, Hastie et al. (2008)..
\(^{\text{πŸ“š}}\) Friedman (1989).
\(^{\text{πŸ“š}}\) Heart Disease Dataset.
\(^{\text{πŸ“š}}\) Different Type of Correlation Metrics Used by Data Scientists, Ashray.