Data Quality & Preprocessing

INF-604: Data Analysis

Lecturer: Dr. Sothea HAS

📋 Outline

Data Sources
Data Quality
Data Preprocessing
Real Examples

Data Sources

Primary

Data collected directly from the source for a specific purpose.

Example:
- Surveys or Questionnaires 🗳️
- Interviews 🎙️
- Observations 🧐
- Experiments 🔬

Secondary

Data that has already been collected, processed, and made available by others.

Example:
- Government publications or reports 📄
- Books and articles 📚
- Online databases and repositories 🌐
- Industry/NGO reports 🏭

Data Sources

Format

Structured

Highly organized and easily searchable in databases using predefined schemas.

Structure: typically stored in tables with rows and columns.
Example:
- Spreadsheets: Excel
- CSV files

Unstructured

Lacks a predefined format or schema and is typically stored in its raw form.

Structure: Free-form and can be text, images, videos…
Example:
- Emails/Documents (e.g., Word files, PDFs)
- Social media posts, images, audio, videos
- Web pages…

Data Quality

Data quality

Someone in 60s said Garbage In, Garbage Out (GIGO)!.
Data quality is the most important thing in Data Analysis.

Data quality

Data quality

Timeliness: how up-to-date the data is for its intended use.
Ex: Temperature of 60s wouldn’t help forecasting tomorrow.

Data quality

Uniqueness: data should not be duplicated.
Ex: Recording the same female heart disease patient many times may lead to a conclude that females have a higher likelihood of developing heart disease.

Data quality

Validity: data should take values within its valid range.
Ex: Height & weight should not be 0.

Data quality

Consistency: data should be uniform and compatible (format, type…) across different datasets and over time.
Ex: 15/03/2004 & 03/15/2004, Male & M…

Data quality

Accuracy: data should be accurate and reflects what it is meant to measure.
Ex: You entered ‘I like DA Course’ in my survey beause it’s not anonymous.

Data quality

Completeness: data should not contain missing values.
Ex: In Lab1, Cabin column of Titanic dataset contains mostly NaN.

Data quality

Data quality includes these 6 factors.

Data quality

Data quality includes these 6 factors.
If there is a problem with any of these, you may ☝️

For secondary sources, Incompleteness is the common one.

Data Preprocessing

Data preprocessing

Consider an example

	Sex	Pclass	Fare	Cabin
0	male	3	7.2500	NaN
1	female	1	71.2833	C85
2	female	3	7.9250	NaN
3	female	1	53.1000	C123
4	male	3	8.0500	NaN
5	male	3	8.4583	NaN
6	male	1	51.8625	E46
7	male	3	21.0750	NaN
8	female	3	11.1333	NaN
9	female	2	30.0708	NaN
10	female	3	16.7000	G6
11	female	1	26.5500	C103

Data preprocessing

Consider an example

	Sex	Pclass	Fare	Cabin
0	male	3	7.2500	NaN
1	female	1	71.2833	C85
2	female	3	7.9250	NaN
3	female	1	53.1000	C123
4	male	3	8.0500	NaN
5	male	3	8.4583	NaN
6	male	1	51.8625	E46
7	male	3	21.0750	NaN
8	female	3	11.1333	NaN
9	female	2	30.0708	NaN
10	female	3	16.7000	G6
11	female	1	26.5500	C103

data.dropna(inplace=True)

Data preprocessing

Consider an example

	Sex	Pclass	Fare	Cabin
0	male	3	7.2500	NaN
1	female	1	71.2833	C85
2	female	3	7.9250	NaN
3	female	1	53.1000	C123
4	male	3	8.0500	NaN
5	male	3	8.4583	NaN
6	male	1	51.8625	E46
7	male	3	21.0750	NaN
8	female	3	11.1333	NaN
9	female	2	30.0708	NaN
10	female	3	16.7000	G6
11	female	1	26.5500	C103

data.drop(columns = ['Cabin'])

Data preprocessing

Missing values

Data of \(4\)-\(7\) years old kids.

Gender	Age	Height	Weight
F	68	0	20
F	68	0	18
F	65	105	0
F	63	0	15
F	68	112	0
F	66	106	0

What’s wrong with this data?

These are probably missing values (NA, nan, NaN…) in disguise.
Question: how do we handle it: Drop or Impute?
Answer: we should at least know what kind of missing values are they: MCAR, MAR or MNAR?

Data Preprocessing

Missing values

Missing Completely At Random (MCAR)

They are randomly missing.
Easy to handle with imputation or dropping methods if not so many.
They don’t introduce bias.
Removing them does not affect other columns.
Ex: Missing values do not affect Age. They are MCAR in this case.

Code

import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
data_dropped_NA = data_kids.loc[(data_kids.Height > 0) & (data_kids.Weight > 0)]
fig_kid1 = go.Figure(go.Histogram(
    x=data_kids.Age, 
    name="Before dropping NA", 
    showlegend=True))
fig_kid1.add_trace(
    go.Histogram(
        x=data_dropped_NA.Age, 
        name="After dropping NA", 
        showlegend=True, 
        visible="legendonly"))
fig_kid1.update_layout(barmode='overlay', 
                       title="Distribution of Age", 
                       xaxis=dict(title="Age"),
                       yaxis=dict(title="Count"),
                       width=500,
                       height=400)
fig_kid1.update_traces(opacity=0.5)
fig_kid1.show()

Data Preprocessing

Missing values

Missing At Random (MAR)

The missingness is related to other columns.
One can try using those related columns to impute.
If not too many, model-based imputation often work well: KNN…
Ex: Most of the missing values are from female children. They are MAR in this case.

Code

count = data_kids.Gender.value_counts()
fig_kid2 = go.Figure(
    go.Bar(
        x=count.index, 
        y=count, 
        name="Before dropping NA"))
count_NA = data_dropped_NA.Gender.value_counts()
fig_kid2.add_trace(
    go.Bar(x=count_NA.index, 
    y=count_NA, 
    name="After dropping NA", 
    visible="legendonly"))
fig_kid2.update_layout(barmode='overlay', 
                       title="Distribution of Gender", 
                       xaxis=dict(title="Gender"),
                       yaxis=dict(title="Count"),
                       width=500,
                       height=400)
fig_kid2.update_traces(opacity=0.5)
fig_kid2.show()

Data Preprocessing

Missing values

Missing Not At Random (MNAR)

These are the trickiest, as the missingness is related to the missing values themselves.
It may require domain-specific knowledge or advanced techniques (more data, external info…).
It’s hard to judge if missing values are actually MNAR.
Ex: Very high or very low salaries are often missing from a survey if it’s optional.
If not so many, dropping is a common solution.

Data Preprocessing

Rules of Thumb

Proportion of `NA`	Rules of thumb
\(< 5\%\)	Drop/remove rows.
\(5-10\%\)	Can be dropped but must be cautious about the type of missing.
\(10-20\%\)	Better to be imputed according to their types.
\(20-30\%\)	Remove the entire column, if it’s not so important.
\(>30\%\)	Remove the entire column.

Data Preprocessing

Outliers

Data points that deviate significantly from the majority of observations in a dataset.
It can influence our analyses: insightful or problematic!
We can hunt them down using:
- Graphs: Scatterplots, Boxplots or histograms…
- They often fall outside \([\text{Q}_1-1.5\text{IQR},\text{Q}_3+1.5\text{IQR}]\).

Data Preprocessing

Handling outliers

Not all outliers would affect the analysis (may be ignored).
We can apply capping (limiting outliers to some values) or Trimming (completely remove them).
Some transformations may reduce the effect of outliers:
- Z-score: \(x\to \frac{x-\overline{x}}{\sigma_{x}}\) (centered by mean, scaled by std).
- Min-Max scaling: \(x\to\frac{x-\min}{\max-\min}\in [0,1]\).
- If the data are positive: \(x\to \log(x)\) or \(x\to \sqrt{x}\)
No absolute solution! It depends on the analysis.

Data Preprocessing

One-hot encoding

This is taken from gapminder dataset (2007).

Code

from gapminder import gapminder
import numpy as np
from sklearn.preprocessing import OneHotEncoder as onehot
encoder = onehot()
encoded_data = encoder.fit_transform(gapminder.loc[gapminder.year == 2007, ['continent']]).toarray()

# encoded dataset
X_encoded = pd.DataFrame(encoded_data, columns=[x.replace('continent_', '') for x in encoder.get_feature_names_out(['continent'])])
df_encoded = X_encoded.copy()
df_encoded['lifeExp'] = gapminder.lifeExp.loc[gapminder.year==2007].values
fig_cont = px.box(data_frame=gapminder.loc[gapminder.year==2007,:],
                  x="continent", y="lifeExp", color="continent")
fig_cont.update_layout(title="Life Expectancy vs Continent", height=250, width=500)
fig_cont.show()

Sometimes, categorical data are very informative and useful especially for building models.
One-hot encoding is a way to covert them into numbers.
For example:
[‘Asia’, ‘Europe’, ‘Africa’, ‘Africa’] 👇

	Africa	Asia	Europe
0	0.0	1.0	0.0
1	0.0	0.0	1.0
2	1.0	0.0	0.0
3	1.0	0.0	0.0

Real Example

Titanic Dataset (891 rows, 12 columns)

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Cabin	Embarked
0	0	3	male	34.5	0	0	7.8292	NaN	Q
1	1	3	female	47.0	1	0	7.0000	NaN	S

Data types:

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Cabin	Embarked
0	int64	int64	object	float64	int64	int64	float64	object	object

Missing values:

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Cabin	Embarked
0	0	0	0	86	0	0	1	327	0

Question: What should we do in the preprocessing step?
Answer: We should:
- Convert Survived and Pclass to be object.
- Handle missing values: drop 1 NA of Fare, remove column Cabin and study Age.

Real Example

Titanic Dataset (891 rows, 12 columns)

Convert data types:

col_to_be_converted = ['Survived', 'Pclass']
for col in col_to_be_converted:
    data[col] = data[col].astype(object)
data[col_to_be_converted].dtypes.to_frame().T

	Survived	Pclass
0	object	object

Drop column Cabin:

data.drop(columns = ["Cabin"], inplace = True)

Drop 1 row with NA in Fare:

data.dropna(subset = ['Fare'], inplace = True)

Study missing values in Age:
- Impact on qual columns:

Code

import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid")
fig, axs = plt.subplots(2, 4, figsize=(6, 3.75))
col_qual = ['Survived', 'Pclass', 'Sex', 'Embarked']
for i, va in enumerate(col_qual):
    sns.countplot(data, x=va, ax=axs[0,i], stat = "proportion")
    axs[0,i].bar_label(axs[0,i].containers[0], fmt="%0.2f")

    sns.countplot(data.dropna(), x=va, ax=axs[1,i] , stat = "proportion")
    axs[1,i].bar_label(axs[1,i].containers[0], fmt="%0.2f")
    if i == 0:
        axs[0,i].set_ylabel("Before remove NA")
        axs[1,i].set_ylabel("After remove NA")
    else:
        axs[0,i].set_ylabel("")
        axs[1,i].set_ylabel("")
plt.tight_layout()
plt.show()

Real Example

Titanic Dataset (891 rows, 12 columns)

Impact on quan columns:

Code

sns.set(style="whitegrid")
fig, axs = plt.subplots(2, 3, figsize=(6, 3.5))
col_quan = ['SibSp', 'Parch', 'Fare']
for i, va in enumerate(col_quan):
    sns.histplot(data, x=va, ax=axs[0,i], kde=True)
    sns.histplot(data.dropna(), x=va, ax=axs[1,i], kde=True)
    if i == 0:
        axs[0,i].set_ylabel("Before remove NA")
        axs[1,i].set_ylabel("After remove NA")
    else:
        axs[0,i].set_ylabel("")
        axs[1,i].set_ylabel("")
plt.tight_layout()
plt.show()

Do you think that removing NA greatly affects other columns?

Study missing values in Age:
- Impact on qual columns:

Code

import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid")
fig, axs = plt.subplots(2, 4, figsize=(6, 3.75))
col_qual = ['Survived', 'Pclass', 'Sex', 'Embarked']
for i, va in enumerate(col_qual):
    sns.countplot(data, x=va, ax=axs[0,i], stat = "proportion")
    axs[0,i].bar_label(axs[0,i].containers[0], fmt="%0.2f")

    sns.countplot(data.dropna(), x=va, ax=axs[1,i] , stat = "proportion")
    axs[1,i].bar_label(axs[1,i].containers[0], fmt="%0.2f")
    if i == 0:
        axs[0,i].set_ylabel("Before remove NA")
        axs[1,i].set_ylabel("After remove NA")
    else:
        axs[0,i].set_ylabel("")
        axs[1,i].set_ylabel("")
plt.tight_layout()
plt.show()

Real Example

Titanic Dataset (891 rows, 12 columns)

As removing NA barely impacts other columns, we can
- Drop rows with NA or
- Impute with mean or median.

data.fillna(value = data[['Age']].mean(), inplace = True)
data.iloc[:,[1,2,4,5,6,7,9,10]].isna().sum().to_frame().T

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked
0	0	0	0	0	0	0	0	0

Age after imputation:

🥳 Yeahhhh….

Let’s Party… 🥂

	Africa	Asia	Europe
0	0.0	1.0	0.0
1	0.0	0.0	1.0
2	1.0	0.0	0.0
3	1.0	0.0	0.0

	Africa	Asia	Europe
0	0.0	1.0	0.0
1	0.0	0.0	1.0
2	1.0	0.0	0.0
3	1.0	0.0	0.0

	Africa	Asia	Europe
0	0.0	1.0	0.0
1	0.0	0.0	1.0
2	1.0	0.0	0.0
3	1.0	0.0	0.0