TP3 - Data Preprocessing

Exploratory Data Analysis & Unsuperivsed Learning
Course: Dr. Sothea HAS


Objective: This lab will demonstrate the critical role of data preprocessing. You will learn to identify and resolve common data quality issues (e.g., missing values, inconsistencies, outliers) to ensure the validity and reliability of your final analysis.

The Jupyter Notebook for this TP can be downloaded here: TP3-Data-Preprocessing.


1. Missing Values

Cardiovascular diseases (CVDs) are the leading cause of death globally, taking an estimated 17.9 million lives each year (WHO). CVDs are a group of disorders of the heart and blood vessels and include coronary heart disease, cerebrovascular disease, rheumatic heart disease and other conditions. More than four out of five CVD deaths are due to heart attacks and strokes, and one third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict a possible heart disease.

The following Heart Failure dataset is obtained by combining 5 different heart disease datasets, consisting of 11 features and a target column indicating heart disease status of the patients. We will build a classification model to predict the heart status of the patients.

We will explore Kaggle Heart Failure Dataset. Load the dataset into the environment.

a. Import this data and name it data.

import kagglehub
import pandas as pd

# To do
Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease
0 40 M ATA 140 289 0 Normal 172 N 0.0 Up 0
1 49 F NAP 160 180 0 Normal 156 N 1.0 Flat 1
2 37 M ATA 130 283 0 ST 98 N 0.0 Up 0
3 48 F ASY 138 214 0 Normal 108 Y 1.5 Flat 1
4 54 M NAP 150 195 0 Normal 122 N 0.0 Up 0

b. What’s the dimension of the dataset? Check and modify if there are any columns with inappropriate data type.

# To do

c. Compute descriptive statistics of each column.

  • Do you observe anything strange?

  • Handle what seems to be the problem properly.

# To do

d. Are there any potential outliers? Take a note for later improvement in your analysis.

  • Are there any duplicated observations? Handle them properly.
# To do

2. Outliers & high leverage data

In an unsupervised framework, outliers are data points that significantly deviate from the majority of observations. In a supervised framework, inputs with extreme values (but not their target) are known as high leverage points. Both outliers and high leverage points may obscure (but not always) the true underlying patterns in the data, often complicating analysis and leading to potential inaccuracies. We will start hunting outliers and high leverage points using Abalone dataset.

a. Download and import the dataset.

# To do

b. Compute and visualize both correaltion matrices. Provide your first impression on the correlation matrices.

# To do

c. Study the relation between Type and the most interesting variable (target) Rings.

# To do

d. Insepct outliers and high leverage data points (considering Rings as the target) in the dataset. You might be interested in the following interesting lessons from STAT 501: - Using Leverages to Help Identify Extreme x Values, STAT 501 - Identifying Outliers (Unusual y Values), STAT 501

# To do

e. Once again, study the correlation matrices of the data after removing outliers and high leverage data points. Conclude.

# To do

Further Readings