Course: M2-DAS: Advanced Machine Learning Lecturer: Dr. Sothea HAS
Objective: In this lab, you will learn how to build NBC and Binary Logistic Regression model to predict heart failure patients. Not only that, you will learn to detect informative features for maximizing the potential of the constructed models. You will also see that quantitative features are not always the most important ones in building a good predictive model. You have to treat all types of data carefully.
The notebook of this TP can be downloaded here: Lab4_NBC.ipynb.
Introduction to Heart Failure Prediction
Cardiovascular diseases (CVDs) are the leading cause of death globally, taking an estimated 17.9 million lives each year (WHO). CVDs are a group of disorders of the heart and blood vessels and include coronary heart disease, cerebrovascular disease, rheumatic heart disease and other conditions. More than four out of five CVD deaths are due to heart attacks and strokes, and one third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict a possible heart disease.
The following Heart Failure dataset is obtained by combining 5 different heart disease datasets, consisting of 11 features and a target column indicating heart disease status of the patients. We will build a classification model to predict the heart status of the patients.
1. Univariate Analysis: Preprocessing & Data Analysis
Check and modify if there are any columns with inappropriate data type.
Compute descriptive statistics of each column. Do you observe anything strange? Handle what seems to be the problem properly.
Are there any nan or NA values in this dataset?
Are there potential outliers?
Are their any duplicated observations?
# To do
2. Bivariate Analysis: Exploratory Data Analysis & Important Feature Detection
Visualize the connection between all quantitative columns with HeartDisease. Notice those that seem to be related to the target.
Visualize the connection between all qualitative columns to the target. Note if there is any interesting qualitative columns (related to the target).
# To do
3. Naive Bayes Classifier
In the following code, train_test_split was imported from sklearn.model_selection. It can be used to split the dataset into 80%-training (for constructing the model) and 20%-testing data (for testing model performance).
# To do
Build NBC model on the 80%-training data using
Only quantitative columns (name it nbc_quan)
Only qualitative columns (name it nbc_qual). Hint: you should encode the categorical columns using LabelEncoder from sklearn.preprocessing.
All columns (name it nbc_full). Hint: make sure you encode categorical columns using one-hot encoding then the GaussianNB can be applied on the encoded features.
Your selected columns from the previous analyzing step (name it nbc_analysis).
Construct confusion matrix for the four models and compute the following metrics on the 20%-testing data:
Accuracy
Precision
Recall
F1-score
AUC
Which model seem to be the most promising one? Why do you think this is the case?