Lab4: Naive Bayes Classifier

Course: M2-DAS: Advanced Machine Learning
Lecturer: Dr. Sothea HAS

Objective: In this lab, you will learn how to build NBC and Binary Logistic Regression model to predict heart failure patients. Not only that, you will learn to detect informative features for maximizing the potential of the constructed models. You will also see that quantitative features are not always the most important ones in building a good predictive model. You have to treat all types of data carefully.

The notebook of this TP can be downloaded here: Lab4_NBC.ipynb.

Introduction to Heart Failure Prediction

Cardiovascular diseases (CVDs) are the leading cause of death globally, taking an estimated 17.9 million lives each year (WHO). CVDs are a group of disorders of the heart and blood vessels and include coronary heart disease, cerebrovascular disease, rheumatic heart disease and other conditions. More than four out of five CVD deaths are due to heart attacks and strokes, and one third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict a possible heart disease.

The following Heart Failure dataset is obtained by combining 5 different heart disease datasets, consisting of 11 features and a target column indicating heart disease status of the patients. We will build a classification model to predict the heart status of the patients.

We will explore Kaggle Heart Failure Dataset. Load the dataset into the environment.

import kagglehub
import pandas as pd

# Download latest version
path = kagglehub.dataset_download("fedesoriano/heart-failure-prediction")
data = pd.read_csv(path + "/heart.csv")
data.head()

	Age	Sex	ChestPainType	RestingBP	Cholesterol	RestingECG	MaxHR	ExerciseAngina	Oldpeak	ST_Slope	HeartDisease
0	40	M	ATA	140	289	Normal	172	N	0.0	Up	0
1	49	F	NAP	160	180	Normal	156	N	1.0	Flat	1
2	37	M	ATA	130	283	ST	98	N	0.0	Up	0
3	48	F	ASY	138	214	Normal	108	Y	1.5	Flat	1
4	54	M	NAP	150	195	Normal	122	N	0.0	Up	0

1. Univariate Analysis: Preprocessing & Data Analysis

Check and modify if there are any columns with inappropriate data type.
Compute descriptive statistics of each column. Do you observe anything strange? Handle what seems to be the problem properly.
Are there any nan or NA values in this dataset?
Are there potential outliers?
Are their any duplicated observations?

# To do

2. Bivariate Analysis: Exploratory Data Analysis & Important Feature Detection

Visualize the connection between all quantitative columns with HeartDisease. Notice those that seem to be related to the target.
Visualize the connection between all qualitative columns to the target. Note if there is any interesting qualitative columns (related to the target).

# To do

3. Naive Bayes Classifier

In the following code, train_test_split was imported from sklearn.model_selection. It can be used to split the dataset into 80%-training (for constructing the model) and 20%-testing data (for testing model performance).

# To do

Build NBC model on the 80%-training data using
- Only quantitative columns (name it nbc_quan)
- Only qualitative columns (name it nbc_qual). Hint: you should encode the categorical columns using LabelEncoder from sklearn.preprocessing.
- All columns (name it nbc_full). Hint: make sure you encode categorical columns using one-hot encoding then the GaussianNB can be applied on the encoded features.
- Your selected columns from the previous analyzing step (name it nbc_analysis).
Construct confusion matrix for the four models and compute the following metrics on the 20%-testing data:
- Accuracy
- Precision
- Recall
- F1-score
- AUC
Which model seem to be the most promising one? Why do you think this is the case?

# To do

Introduction to Heart Failure Prediction

1. Univariate Analysis: Preprocessing & Data Analysis

2. Bivariate Analysis: Exploratory Data Analysis & Important Feature Detection

3. Naive Bayes Classifier

Further Reading