Exploratory Data Analysis & Unsuperivsed Learning Course: PHAUK Sokkey, PhD TP: HAS Sothea, PhD
Objective: Preprocessing is important in data related tasks. In this TP, you will explore different challanges you may encounted during when performing data preprocessing. We will discuss reasonable solution to these challanges.
We will begin with missing data which are very common within real-world datasets. One common question about missing values is “Should we remove them”? We will delve into possible proper solution to this problem.
We will work with Enfants dataset available here: Enfants dataset.
a. Import this data and name it data.
import pandas as pdimport numpy as npdata = pd.read_table("D:/Sothea_PC/Teaching_ITC/EDA\data/Enfants.txt", sep="\t")data.head(5)
b. Compute statistical values and visualize the distribution of each variable.
Stastistics
# To do
Visualization
# To do
c. Did you find anything strange in the previous graphs?
Your response:
d. Create new data called data_NoNA by replacing all missing values with np.nan. Remove all the rows containing missing values. - Repeat point (b) on data_NoNA data. - Compare the distribution of these variables before and after removing missing values.
# TO do
# To do
What do you observe?
Your response:
What’s the mechanism/type of these missing values?
Your response:
e. Do you spot any outliers in this dataset?
Your response:
2. Outliers & high leverage data
In an unsupervised framework, outliers are data points that significantly deviate from the majority of observations. In a supervised framework, inputs with extreme values (but not their target) are known as high leverage points. Both outliers and high leverage points may obscure (but not always) the true underlying patterns in the data, often complicating analysis and leading to potential inaccuracies. We will start hunting outliers and high leverage points using Abalone dataset.
a. Download and import the dataset.
# To do
b. Compute and visualize the correaltion matrix. Provide your first impression on the correlation matrix.
# To do
c. Study the relation between Type and the most interesting variable (target) Rings.