Lab4: Naive Bayes Classifier

Course: M2-DAS: Advanced Machine Learning
Lecturer: Dr. Sothea HAS


Objective: In this lab, you will learn how to build NBC and Binary Logistic Regression model to predict heart failure patients. Not only that, you will learn to detect informative features for maximizing the potential of the constructed models. You will also see that quantitative features are not always the most important ones in building a good predictive model. You have to treat all types of data carefully.


Introduction to Heart Failure Prediction

Cardiovascular diseases (CVDs) are the leading cause of death globally, taking an estimated 17.9 million lives each year (WHO). CVDs are a group of disorders of the heart and blood vessels and include coronary heart disease, cerebrovascular disease, rheumatic heart disease and other conditions. More than four out of five CVD deaths are due to heart attacks and strokes, and one third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict a possible heart disease.

The following Heart Failure dataset is obtained by combining 5 different heart disease datasets, consisting of 11 features and a target column indicating heart disease status of the patients. We will build a classification model to predict the heart status of the patients.

We will explore Kaggle Heart Failure Dataset. Load the dataset into the environment.

import kagglehub
import pandas as pd

# Download latest version
path = kagglehub.dataset_download("fedesoriano/heart-failure-prediction")
data = pd.read_csv(path + "/heart.csv")
data.head()
Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease
0 40 M ATA 140 289 0 Normal 172 N 0.0 Up 0
1 49 F NAP 160 180 0 Normal 156 N 1.0 Flat 1
2 37 M ATA 130 283 0 ST 98 N 0.0 Up 0
3 48 F ASY 138 214 0 Normal 108 Y 1.5 Flat 1
4 54 M NAP 150 195 0 Normal 122 N 0.0 Up 0

1. Univariate Analysis: Preprocessing & Data Analysis

  • Check and modify if there are any columns with inappropriate data type.
  • Compute descriptive statistics of each column. Do you observe anything strange? Handle what seems to be the problem properly.
  • Are there any nan or NA values in this dataset?
  • Are there potential outliers?
  • Are their any duplicated observations?
# To do

2. Bivariate Analysis: Exploratory Data Analysis & Important Feature Detection

  • Visualize the connection between all quantitative columns with HeartDisease. Notice those that seem to be related to the target.

  • Visualize the connection between all qualitative columns to the target. Note if there is any interesting qualitative columns (related to the target).

# To do

3. Naive Bayes Classifier

  • In the following code, train_test_split was imported from sklearn.model_selection. It can be used to split the dataset into 80%-training (for constructing the model) and 20%-testing data (for testing model performance).
# To do
  • Build NBC model on the 80%-training data using
    • Only quantitative columns (name it nbc_quan)
    • Only qualitative columns (name it nbc_qual). Hint: you should encode the categorical columns using LabelEncoder from sklearn.preprocessing.
    • All columns (name it nbc_full). Hint: make sure you encode categorical columns using one-hot encoding then the GaussianNB can be applied on the encoded features.
    • Your selected columns from the previous analyzing step (name it nbc_analysis).
  • Construct confusion matrix for the four models and compute the following metrics on the 20%-testing data:
    • Accuracy
    • Precision
    • Recall
    • F1-score
    • AUC
  • Which model seem to be the most promising one? Why do you think this is the case?
# To do

Further Reading

\(^{\text{📚}}\) Pandas python library: https://pandas.pydata.org/docs/getting_started/index.html#getting-started
\(^{\text{📚}}\) Pandas Cheatsheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
\(^{\text{📚}}\) 10 Minute to Pandas: https://pandas.pydata.org/docs/user_guide/10min.html
\(^{\text{📚}}\) Some Pandas Lession: https://www.kaggle.com/learn/pandas
\(^{\text{📚}}\) Chapter 4, Introduction to Statistical Learning with R, James et al. (2021)..
\(^{\text{📚}}\) Chapter 2, The Elements of Statistical Learning, Hastie et al. (2008)..
\(^{\text{📚}}\) Friedman (1989).
\(^{\text{📚}}\) Heart Disease Dataset.
\(^{\text{📚}}\) Different Type of Correlation Metrics Used by Data Scientists, Ashray.