TP3 - Data Preprocessing

Exploratory Data Analysis & Unsuperivsed Learning
Course: PHAUK Sokkey, PhD
TP: HAS Sothea, PhD

Objective: Preprocessing is important in data related tasks. In this TP, you will explore different challanges you may encounted during when performing data preprocessing. We will discuss reasonable solution to these challanges.

The Jupyter Notebook for this TP can be downloaded here: TP3-Data-Preprocessing.

1. Missing Data

We will begin with missing data which are very common within real-world datasets. One common question about missing values is “Should we remove them”? We will delve into possible proper solution to this problem.

We will work with Enfants dataset available here: Enfants dataset.

a. Import this data and name it data.

import pandas as pd
import numpy as np
data = pd.read_table("D:/Sothea_PC/Teaching_ITC/EDA\data/Enfants.txt", sep="\t")
data.head(5)

<>:3: SyntaxWarning: invalid escape sequence '\d'
<>:3: SyntaxWarning: invalid escape sequence '\d'
C:\Users\hasso\AppData\Local\Temp\ipykernel_18808\610560642.py:3: SyntaxWarning: invalid escape sequence '\d'
  data = pd.read_table("D:/Sothea_PC/Teaching_ITC/EDA\data/Enfants.txt", sep="\t")

	GENRE	AGE	TAILLE	MASSE
0	F	68	0	20
1	M	74	116	18
2	M	69	120	23
3	M	72	121	25
4	M	73	114	17

b. Compute statistical values and visualize the distribution of each variable.

Stastistics

# To do

Visualization

# To do

c. Did you find anything strange in the previous graphs?

Your response:

d. Create new data called data_NoNA by replacing all missing values with np.nan. Remove all the rows containing missing values. - Repeat point (b) on data_NoNA data. - Compare the distribution of these variables before and after removing missing values.

# TO do

# To do

What do you observe?

Your response:

What’s the mechanism/type of these missing values?

Your response:

e. Do you spot any outliers in this dataset?

Your response:

2. Outliers & high leverage data

In an unsupervised framework, outliers are data points that significantly deviate from the majority of observations. In a supervised framework, inputs with extreme values (but not their target) are known as high leverage points. Both outliers and high leverage points may obscure (but not always) the true underlying patterns in the data, often complicating analysis and leading to potential inaccuracies. We will start hunting outliers and high leverage points using Abalone dataset.

a. Download and import the dataset.

# To do

b. Compute and visualize the correaltion matrix. Provide your first impression on the correlation matrix.

# To do

c. Study the relation between Type and the most interesting variable (target) Rings.

# To do

d. Insepct outliers and high leverage data points (considering Rings as the target) in the dataset. You might be interested in the following interesting lessons from STAT 501: - Using Leverages to Help Identify Extreme x Values, STAT 501 - Identifying Outliers (Unusual y Values), STAT 501

# To do

e. Once again, study the correlation matrix of the data after removing outliers and high leverage data points. Conclude.

# To do

1. Missing Data

2. Outliers & high leverage data

Further Readings