TP3 - Data Preprocessing

Exploratory Data Analysis & Unsuperivsed Learning
Course: PHAUK Sokkey, PhD
TP: HAS Sothea, PhD

Objective: Preprocessing is important in data related tasks. In this TP, you will explore different challanges you may encounted during when performing data preprocessing. We will discuss reasonable solution to these challanges.

The Jupyter Notebook for this TP can be downloaded here: TP3-Data-Preprocessing.

1. Missing Data

We will begin with missing data which are very common within real-world datasets. One common question about missing values is “Should we remove them”? We will delve into possible proper solution to this problem.

We will work with Enfants dataset available here: Enfants dataset.

a. Import this data and name it data.

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
data = pd.read_table("D:/Sothea_PC/Teaching_ITC/EDA\data/Enfants.txt", sep="\t")
data.head(5)

	GENRE	AGE	TAILLE	MASSE
0	F	68	0	20
1	M	74	116	18
2	M	69	120	23
3	M	72	121	25
4	M	73	114	17

b. Compute statistical values and visualize the distribution of each variable.

Stastistics

data.describe().transpose().drop(columns=["count"])

	mean	std	min	25%	50%	75%	max
AGE	68.570836	4.087658	49.0	66.0	69.0	72.0	86.0
TAILLE	90.627505	46.039818	0.0	106.0	112.0	116.0	139.0
MASSE	18.878369	6.634948	0.0	18.0	20.0	22.0	40.0

data["GENRE"].value_counts()

GENRE
M    1492
F    1402
Name: count, dtype: int64

Visualization

import seaborn as sns
import matplotlib.pyplot as plt
names = data.columns
_, ax = plt.subplots(1,3, figsize=(9, 4))
i = 0
for var in names:
    if var != "GENRE":
        sns.histplot(data, x=var, kde=True, ax=ax[i], binwidth=3)
        ax[i].set_title(f"Distribution of {var}")
        i += 1
plt.tight_layout()

sns.countplot(data=data, x="GENRE", hue="GENRE", legend=True)

c. Did you find anything strange in the previous graphs?

Your response: It appears that there are children who have height or weight equal to \(0\). This seems to be missing data.

d. Create new data called data_NoNA by replacing all missing values with np.nan. Remove all the rows containing missing values. - Repeat point (b) on data_NoNA data. - Compare the distribution of these variables before and after removing missing values.

# Replacing missing values by nan
data.loc[data.TAILLE == 0, "TAILLE"] = np.nan
data.loc[data.MASSE == 0, "MASSE"] = np.nan
data_NoNA = data.loc[~(data['TAILLE'].isna() | data['MASSE'].isna()), :]

_, ax = plt.subplots(1,3, figsize=(9, 4))
i = 0
for var in names:
    if var != "GENRE":
        sns.histplot(data_NoNA, x=var, kde=True, ax=ax[i], binwidth=3)
        ax[i].set_title(f"Distribution of {var}")
        i += 1
plt.tight_layout()

# Genre
sns.countplot(data=data_NoNA, x = "GENRE", hue="GENRE", legend=True)

What do you observe?

The distribution of AGE remains similar to the original data even after removing the missing values. On the other hand, the distribution of GENRE changes after removing missing values.

What’s the mechanism/type of these missing values?

As most of missing values are girls, these missing values are not Completely At Random (MCAR) nor Missing Not At Random (NMAR), they are related to variable GENRE and therefore are of type MAR.

e. Do you spot any outliers in this dataset?

Yes, according to the following boxplots.

_, ax = plt.subplots(1,3, figsize=(9, 4))
i = 0
for var in names:
    if var != "GENRE":
        sns.boxplot(data_NoNA, y=var, ax=ax[i])
        ax[i].set_title(f"Boxplot of {var}")
        i += 1
plt.tight_layout()

2. Outliers & high leverage data

In an unsupervised framework, outliers are data points that significantly deviate from the majority of observations. In a supervised framework, inputs with extreme values (but not their target) are known as high leverage points. Both outliers and high leverage points may obscure (but not always) the true underlying patterns in the data, often complicating analysis and leading to potential inaccuracies. We will start hunting outliers and high leverage points using Abalone dataset.

a. Download and import the dataset.

Abalone = pd.read_csv("D:/Sothea_PC/Teaching_ITC/EDA/data/abalone.txt", sep=" ").drop(columns=["Id"])
Abalone.sample(5)

	Type	LongestShell	Diameter	Height	WholeWeight	ShuckedWeight	VisceraWeight	ShellWeight	Rings
1011	F	0.625	0.480	0.170	1.3525	0.6235	0.2780	0.3650	10
1790	F	0.550	0.380	0.165	1.2050	0.5430	0.2940	0.3345	10
2530	F	0.600	0.485	0.165	1.1405	0.5870	0.2175	0.2880	9
3637	I	0.435	0.325	0.100	0.3420	0.1335	0.0835	0.1050	6
3692	M	0.650	0.520	0.170	1.3655	0.6155	0.2885	0.3600	11

b. Compute and visualize the correaltion matrix. Provide your first impression on the correlation matrix.

Abalone.iloc[:,1:].corr().style.background_gradient()

	LongestShell	Diameter	Height	WholeWeight	ShuckedWeight	VisceraWeight	ShellWeight	Rings
LongestShell	1.000000	0.986812	0.827554	0.925261	0.897914	0.903018	0.897706	0.556720
Diameter	0.986812	1.000000	0.833684	0.925452	0.893162	0.899724	0.905330	0.574660
Height	0.827554	0.833684	1.000000	0.819221	0.774972	0.798319	0.817338	0.557467
WholeWeight	0.925261	0.925452	0.819221	1.000000	0.969405	0.966375	0.955355	0.540390
ShuckedWeight	0.897914	0.893162	0.774972	0.969405	1.000000	0.931961	0.882617	0.420884
VisceraWeight	0.903018	0.899724	0.798319	0.966375	0.931961	1.000000	0.907656	0.503819
ShellWeight	0.897706	0.905330	0.817338	0.955355	0.882617	0.907656	1.000000	0.627574
Rings	0.556720	0.574660	0.557467	0.540390	0.420884	0.503819	0.627574	1.000000

c. Study the relation between Type and the most interesting variable (target) Rings.

sns.set(style="whitegrid")
ax = sns.boxplot(data=Abalone, y = "Rings", hue="Type")
ax.set_title("Rings vs Type of Abalone")
plt.show()

The boxplot tells us that there seems to be some connection between Type and Rings as (Type I to the rest). The data also seems to respect the assumption of ANOVA (normality and homoscedasticity), so we confirm this using ANOVA test.

from scipy.stats import f_oneway
f_oneway(*[Abalone.Rings[Abalone['Type'] == np.unique(Abalone[['Type']])[i]] for i in range(3)])

F_onewayResult(statistic=499.33254468883257, pvalue=3.724620497195191e-195)

We can reject the hypothesis of having equal means for all group!

d. Insepct outliers and high leverage data points (considering Rings as the target) in the dataset. You might be interested in reading the following posts from STAT 501: - Using Leverages to Help Identify Extreme x Values, STAT 501 - Identifying Outliers (Unusual y Values), STAT 501

ax = sns.boxplot(Abalone.iloc[:,1:].melt(), x="value", hue="variable")
ax.set_xscale("log")

e. Once again, study the correlation matrix of the data after removing outliers and leverage data points. Conclude.

def remove_outliers_iqr(data):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    ID_out = np.logical_or(data < (Q1 - 1.5 * IQR), data > (Q3 + 1.5 * IQR))
    return ID_out.values.reshape(-1)

ids = {x: remove_outliers_iqr(Abalone[[x]]) for x in Abalone.columns[1:]}

id_df = pd.DataFrame(ids)
ID = id_df.any(axis=1)
Abalone_no_out = Abalone.loc[~ID,:]

ax = sns.boxplot(Abalone_no_out.iloc[:,1:].melt(), x="value", hue="variable")
ax.set_xscale("log")

Abalone_no_out.iloc[:,1:].corr().style.background_gradient()

	LongestShell	Diameter	Height	WholeWeight	ShuckedWeight	VisceraWeight	ShellWeight	Rings
LongestShell	1.000000	0.985969	0.894911	0.940949	0.916001	0.913637	0.923421	0.587108
Diameter	0.985969	1.000000	0.900426	0.938907	0.909477	0.908181	0.929413	0.604331
Height	0.894911	0.900426	1.000000	0.894523	0.849048	0.874604	0.901294	0.615657
WholeWeight	0.940949	0.938907	0.894523	1.000000	0.973089	0.966545	0.962402	0.561207
ShuckedWeight	0.916001	0.909477	0.849048	0.973089	1.000000	0.929280	0.903340	0.469503
VisceraWeight	0.913637	0.908181	0.874604	0.966545	0.929280	1.000000	0.923197	0.546232
ShellWeight	0.923421	0.929413	0.901294	0.962402	0.903340	0.923197	1.000000	0.624113
Rings	0.587108	0.604331	0.615657	0.561207	0.469503	0.546232	0.624113	1.000000

Remark: This is not an ideal way to do! Removing outliers is just like removing missing values, it can influence other columns in the data. You should pay attention to this influence when dealing with outliers. Moreover, in modeling sense, outliers are points with unusual target values. In this case, handling missing values depend on the target.

Detecting high leverage points

A data point has high leverage if it has “extreme” predictor \(x\) values. With a single predictor, an extreme \(x\) value is simply one that is particularly high or low. With multiple predictors, extreme \(x\) values may be particularly high or low for one or more predictors, or may be “unusual” combinations of predictor values (e.g., with two predictors that are positively correlated, an unusual combination of predictor values might be a high value of one predictor paired with a low value of the other predictor).

To hunt high leverage points, we need hat matrix and leverage values as follow.

import statsmodels.api as sm
X = sm.add_constant(Abalone.iloc[:,1:-1])   # Adds the intercept term
y = Abalone.Rings
model = sm.OLS(y, X).fit()

influence = model.get_influence()
leverage = influence.hat_matrix_diag
cooks = influence.cooks_distance[0]
values = {'Leverage': leverage, 'Cooks Distance': cooks}

plt.figure(figsize=(10, 6))
sns.scatterplot(x=leverage, y=model.resid, size=cooks, legend=False, sizes=(20, 200))
plt.xlabel('Leverage')
plt.ylabel('Residuals')
plt.title('Leverage vs. Residuals')
plt.axhline(0, color='red', linestyle='--', linewidth=1)
plt.show()

This plot indicates that there are some high leverage points (deviate away from the mean of input) with large negative residuals (not well predicted by the model). This high-leverage points are concerning because it has weird input and output values.

threshold = 3 * (X.shape[1]) / len(X)   # threshold that separates high leverage points
high_leverage_points = np.where(leverage > threshold)[0]
high_lev_df = Abalone.iloc[high_leverage_points]
low_lev_df = Abalone.iloc[~high_leverage_points]
low_lev_df.iloc[:,1:].corr().style.background_gradient()

	LongestShell	Diameter	Height	WholeWeight	ShuckedWeight	VisceraWeight	ShellWeight	Rings
LongestShell	1.000000	0.983796	0.895052	0.920812	0.887464	0.879698	0.898355	0.598223
Diameter	0.983796	1.000000	0.904749	0.924017	0.885534	0.872586	0.910550	0.619908
Height	0.895052	0.904749	1.000000	0.866042	0.807687	0.827685	0.884090	0.648165
WholeWeight	0.920812	0.924017	0.866042	1.000000	0.966727	0.958884	0.947852	0.567011
ShuckedWeight	0.887464	0.885534	0.807687	0.966727	1.000000	0.917582	0.865314	0.472378
VisceraWeight	0.879698	0.872586	0.827685	0.958884	0.917582	1.000000	0.886303	0.534098
ShellWeight	0.898355	0.910550	0.884090	0.947852	0.865314	0.886303	1.000000	0.608338
Rings	0.598223	0.619908	0.648165	0.567011	0.472378	0.534098	0.608338	1.000000

1. Missing Data

2. Outliers & high leverage data

Further Readings