Exploratory Data Analysis & Unsuperivsed Learning Course: Dr. Sothea HAS
Objective: This lab will demonstrate the critical role of data preprocessing. You will learn to identify and resolve common data quality issues (e.g., missing values, inconsistencies, outliers) to ensure the validity and reliability of your final analysis.
Cardiovascular diseases (CVDs) are the leading cause of death globally, taking an estimated 17.9 million lives each year (WHO). CVDs are a group of disorders of the heart and blood vessels and include coronary heart disease, cerebrovascular disease, rheumatic heart disease and other conditions. More than four out of five CVD deaths are due to heart attacks and strokes, and one third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict a possible heart disease.
The following Heart Failure dataset is obtained by combining 5 different heart disease datasets, consisting of 11 features and a target column indicating heart disease status of the patients. We will build a classification model to predict the heart status of the patients.
b. What’s the dimension of the dataset? Check and modify if there are any columns with inappropriate data type.
# To do
c. Compute descriptive statistics of each column.
Do you observe anything strange?
Handle what seems to be the problem properly.
# To do
d. Are there any potential outliers? Take a note for later improvement in your analysis.
Are there any duplicated observations? Handle them properly.
# To do
2. Outliers & high leverage data
In an unsupervised framework, outliers are data points that significantly deviate from the majority of observations. In a supervised framework, inputs with extreme values (but not their target) are known as high leverage points. Both outliers and high leverage points may obscure (but not always) the true underlying patterns in the data, often complicating analysis and leading to potential inaccuracies. We will start hunting outliers and high leverage points using Abalone dataset.
a. Download and import the dataset.
# To do
b. Compute and visualize both correaltion matrices. Provide your first impression on the correlation matrices.
# To do
c. Study the relation between Type and the most interesting variable (target) Rings.