Lab3: Data Quality & Preprocessing

Course: INF-604: Data Analysis
Lecturer: Sothea HAS, PhD

Objective: In this lab, you will delve deeper into assessing the quality of datasets and employing preprocessing techniques to properly clean them.

The notebook of this Lab can be downloaded here: Lab3_Preprocessing.ipynb.
Or you can work directly with Google Colab here: Lab3_Preprocessing.ipynb.

1. `Food Delivery` Dataset

Let’s consider Delivery dataset discussed in the previous Lab2. Read and load the data from kaggle: Food Delivery Dataset.

import kagglehub

# Download latest version
path = kagglehub.dataset_download("denkuznetz/food-delivery-time-prediction")

# Import data
import pandas as pd
data = pd.read_csv(path + "/Food_Delivery_Times.csv")
data.head()

	Order_ID	Distance_km	Weather	Traffic_Level	Time_of_Day	Vehicle_Type	Preparation_Time_min	Courier_Experience_yrs	Delivery_Time_min
0	522	7.93	Windy	Low	Afternoon	Scooter	12	1.0	43
1	738	16.42	Clear	Medium	Evening	Bike	20	2.0	84
2	741	9.52	Foggy	Low	Night	Scooter	28	1.0	59
3	661	7.44	Rainy	Medium	Afternoon	Scooter	5	1.0	37
4	412	19.03	Clear	Low	Morning	Bike	16	5.0	68

A. Address the problems related to the quality of this dataset.

Answer: In this dataset, the columns are corrected encoded with correct data type and the only problem is missing values.

data.dtypes.to_frame().T

	Order_ID	Distance_km	Weather	Traffic_Level	Time_of_Day	Vehicle_Type	Preparation_Time_min	Courier_Experience_yrs	Delivery_Time_min
0	int64	float64	object	object	object	object	int64	float64	int64

Identifying Missing Values:
- How many columns contain missing values?
- How many missing values are in each of those columns?

data.isna().sum().to_frame().T

	Order_ID	Distance_km	Weather	Traffic_Level	Time_of_Day	Vehicle_Type	Preparation_Time_min	Courier_Experience_yrs	Delivery_Time_min
0	0	0	30	30	30	0	0	30	0

data.describe()

	Order_ID	Distance_km	Preparation_Time_min	Courier_Experience_yrs	Delivery_Time_min
count	1000.000000	1000.000000	1000.000000	970.000000	1000.000000
mean	500.500000	10.059970	16.982000	4.579381	56.732000
std	288.819436	5.696656	7.204553	2.914394	22.070915
min	1.000000	0.590000	5.000000	0.000000	8.000000
25%	250.750000	5.105000	11.000000	2.000000	41.000000
50%	500.500000	10.190000	17.000000	5.000000	55.500000
75%	750.250000	15.017500	23.000000	7.000000	71.000000
max	1000.000000	19.990000	29.000000	9.000000	153.000000

Analyze a Column with Missing Values:
- Choose one column with missing values.
- Visualize the distribution of the remaining columns before and after dropping the missing values in the chosen column.
- What do you think is the nature of these missing values (e.g., MCAR, MAR, or MNAR)?
- Why might this be the case?

Here we chooe to analyze the missing values of column Weather.

We analyze the effect of removing missing values on categorical columns, then on numerical columns. This will help us identify the nature of missing values in column Weather.

import matplotlib.pyplot as plt    # for graphs
import seaborn as sns

# Consider weather
# Qualitative columns
qual = data.select_dtypes(include=['object']).columns
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(12, 8))
i = 0
for va in qual:
  if va != "Weather":
    # Before removing NA in weather
    sns.countplot(data, x=va, ax=axes[0,i], stat="proportion")
    axes[0,i].set_title(va + ' before dropping NA')
    axes[0,i].bar_label(axes[0,i].containers[0], fmt="%.2f")
    # After removing NA in weather
    sns.countplot(data.dropna(subset=['Weather']), x=va, ax=axes[1,i], stat="proportion")
    axes[1,i].set_title(va + ' after dropping NA')
    axes[1,i].bar_label(axes[1,i].containers[0], fmt="%.3f")
    i = i + 1
plt.tight_layout()
plt.show()

import matplotlib.pyplot as plt    # for graphs
import seaborn as sns

# Quantitative columns
quan = data.select_dtypes(include=['number']).columns
fig, axes = plt.subplots(nrows=4, ncols=4, figsize=(14, 8))
i = 0
for va in quan:
  if va != "Order_ID":
  # Before removing NA in weather
    if va != "Delivery_Time_min":
      width = 2
    else:
      width = 5
    sns.histplot(data, x=va, ax=axes[0,i], binwidth=width, kde=True)
    axes[0,i].set_title(va + ' before dropping NA', fontsize=10)
    # After removing NA in weather
    sns.histplot(data.dropna(subset=['Weather']), x=va, ax=axes[1,i], binwidth=width, kde=True)
    axes[1,i].set_title(va + ' after dropping NA', fontsize=10)
    i = i + 1
plt.tight_layout()
plt.show()

As dropping missing values within column Weather does not affect other columns, we can conclude that the nature of the missing values in Weather are ‘Missing Completely At Random (MCAR)’.

data.isna().sum().to_frame().T/data.shape[0] * 100

	Order_ID	Distance_km	Weather	Traffic_Level	Time_of_Day	Vehicle_Type	Preparation_Time_min	Courier_Experience_yrs	Delivery_Time_min
0	0.0	0.0	3.0	3.0	3.0	0.0	0.0	3.0	0.0

Repeat the Analysis:
- Perform the same analysis for other columns with missing values.

We can drop missing values from each variable and consider the distribution of other columns all at once as follows:

# Consider Traffic column
## Qualitative columns

fig, axes = plt.subplots(nrows=5, ncols=4, figsize=(14, 12))
for i, va0 in enumerate(qual):
  sns.countplot(data, x=va0, ax=axes[0,i], stat="proportion")
  axes[0,i].set_title(va0)
  axes[0,i].bar_label(axes[0,i].containers[0], fmt="%.2f")
  axes[i+1,0].set_ylabel(f"Drop NA by {va0}")
  for j, va in enumerate(qual):
      if va != va0:
        sns.countplot(data.dropna(subset=[va0]), x=va, ax=axes[i+1,j], stat="proportion")
        axes[i+1,j].set_title(va)
        axes[i+1,j].bar_label(axes[i+1,j].containers[0], fmt="%.3f")
plt.tight_layout()
plt.show()

fig, axes = plt.subplots(nrows=4, ncols=4, figsize=(12, 14))
i = 0
for va in quan:
  if va != "Order_ID":
  # Before removing NA
    if va != "Delivery_Time_min":
      width = 2
    else:
      width = 5
    sns.histplot(data, x=va, ax=axes[0,i], binwidth=width, kde=True)
    axes[0,i].set_title(va + ' before dropping NA', fontsize=10)
    axes[0,i].set_ylabel('Before dropping NA', fontsize=12)
    # After removing NA in Traffic level
    sns.histplot(data.dropna(subset=['Traffic_Level']), x=va, ax=axes[1,i], binwidth=width, kde=True)
    axes[1,i].set_title(va + ' after dropping NA', fontsize=10)
    axes[1,i].set_ylabel('Dropping NA by Traffic', fontsize=12)
    # After removing NA in Day time
    sns.histplot(data.dropna(subset=['Time_of_Day']), x=va, ax=axes[2,i], binwidth=width, kde=True)
    axes[2,i].set_title(va + ' after dropping NA', fontsize=10)
    axes[2,i].set_ylabel('Dropping NA by Day time', fontsize=12)
    # After removing NA in Vehicle type
    sns.histplot(data.dropna(subset=['Vehicle_Type']), x=va, ax=axes[3,i], binwidth=width, kde=True)
    axes[3,i].set_title(va + ' after dropping NA', fontsize=10)
    axes[3,i].set_ylabel('Dropping NA by Vehicle types', fontsize=12)
    i = i + 1
plt.tight_layout()
plt.show()

B. Drop all rows with at least one missing values:

Visualize the distribution of the columns without missing values before and after dropping all rows that contain at least one missing value.
What observations can you make from these visualizations?
Impute the missing values using an appropriate method (e.g., mean, median, mode, or advanced imputation techniques).

# Consider weather
# Qualitative columns
quan = data.select_dtypes(include=['number']).columns
fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(14, 6))
i = 0
for va in ['Distance_km', 'Vehicle_Type', 'Preparation_Time_min', 'Delivery_Time_min']:
    if va == "Vehicle_Type":
      sns.countplot(data, x=va, ax=axes[0,i], stat="proportion")
      axes[0,i].set_title(va + ' before dropping NA')
      axes[0,i].bar_label(axes[0,i].containers[0], fmt="%.2f")
      # After removing NA in weather
      sns.countplot(data.dropna(), x=va, ax=axes[1,i], stat="proportion")
      axes[1,i].set_title(va + ' after dropping NA')
      axes[1,i].bar_label(axes[1,i].containers[0], fmt="%.2f")
    else:
      if va != "Delivery_Time_min":
        width = 2
      else:
        width = 5
      sns.histplot(data.dropna(), x=va, ax=axes[0,i], binwidth=width, kde=True)
      axes[0,i].set_title(va + ' before dropping NA', fontsize=10)

      # After removing NA in weather
      sns.histplot(data.dropna(), x=va, ax=axes[1,i], binwidth=width, kde=True)
      axes[1,i].set_title(va + ' after dropping NA', fontsize=10)
    i = i + 1
plt.tight_layout()
plt.show()

Before and after dropping the missing values, the distributions of columns are mostly preserved, therefore they are missing completely at random. Dropping or imputing with mean or median or mode is fine.

C. Analyzing Connections Between Columns

Impact of Weather on Delivery Time: determine if weather conditions affect delivery time.
Influence of Traffic Level: determine whether traffic levels impact delivery time?
Effect of Vehicle Type: How can we evaluate if the type of vehicle used influences delivery time?
Role of Distance: How can we analyze the relationship between distance and delivery time?

fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(10, 8))
for i, va in enumerate(["Weather", "Traffic_Level", "Vehicle_Type", "Distance_km"]):
    if va != "Distance_km":
        sns.boxplot(data, y="Delivery_Time_min", hue=va, ax=axs[i//2, i%2])
        axs[i//2, i%2].legend(loc="upper left")
    else:
        sns.scatterplot(data, x=va, y="Delivery_Time_min", ax=axs[i//2, i%2])
    axs[i//2, i%2].set_title(f"Delivery time vs {va}")
plt.tight_layout()
plt.show()

According to the previous graphs:

Weahter condition seems to not affect delivery time that much except for the snowy time where the delivery times take slightly longer than other conditions.
The traffic level also seems to not influence delivery time that much either. The more busy the traffic, the slightly longer delivery time.
The vehicle type on the other hand does not influence the delivery time at all since the boxplots look nearly identical on all types of vehicle.
The Scatterplot shows a clear trend incating that the longer the distance, the longer time it takes to delivery the foods.

2. `Auto-MPG` Dataset

This dataset contains spec of various cars and is available in kaggle. For more, read here.

import kagglehub

# Download latest version
path = kagglehub.dataset_download("uciml/autompg-dataset")
auto = pd.read_csv(path + '/auto-mpg.csv')
auto.head()

Investigate factors that may affect the quality of the dataset and implement appropriate measures to handle them.

1. Food Delivery Dataset

2. Auto-MPG Dataset

Further Reading

1. `Food Delivery` Dataset

2. `Auto-MPG` Dataset