Lab3: Data Quality & Preprocessing

Course: INF-604: Data Analysis
Lecturer: Sothea HAS, PhD


Objective: In this lab, you will delve deeper into assessing the quality of datasets and employing preprocessing techniques to properly clean them.


1. Food Delivery Dataset

Let’s consider Delivery dataset discussed in the previous Lab2. Read and load the data from kaggle: Food Delivery Dataset.

Order_ID Distance_km Weather Traffic_Level Time_of_Day Vehicle_Type Preparation_Time_min Courier_Experience_yrs Delivery_Time_min
0 522 7.93 Windy Low Afternoon Scooter 12 1.0 43
1 738 16.42 Clear Medium Evening Bike 20 2.0 84
2 741 9.52 Foggy Low Night Scooter 28 1.0 59
3 661 7.44 Rainy Medium Afternoon Scooter 5 1.0 37
4 412 19.03 Clear Low Morning Bike 16 5.0 68

A. Overview:

  1. Drop the column Order_ID. What’s the dimension of this dataset?

  2. Compute statistical summary to briefly understand each individual column.

  3. Visualize the distribution of each column and describe the graph briefly.

  4. Address the problems related to the quality of this dataset.


# To do

B. Handling Missing Data:

You already visualized all the columns of this dataset in the previous section. We’re now studying the nature of the missing values.

  1. Compute the percentage of missing values within each column.

  2. Create two lists:

    • NA_list: a list containing all the names of columns with missing values.
    • NoNA_list: a list containing all the names of columns without missing values.
  3. Show the distribution of each column in NoNA_list, both before and after removing missing values based on NA_list.

  4. Compute the statistical summary of each column in NoNA_list, both before and after removing missing values based on NA_list.

  5. Deduce the nature of the missing values (e.g., MCAR, MAR, or MNAR)?

  6. Can you explain the main reasoning that may lead to this conclusion?

  7. How would you handle this missing values based on your conclusion?

# To do

C. Analyzing Connections Between Columns

  1. Impact of Weather on Delivery Time: determine if weather conditions affect delivery time.

  2. Influence of Traffic Level: determine whether traffic levels impact delivery time?

  3. Effect of Vehicle Type: How can we evaluate if the type of vehicle used influences delivery time?

  4. Role of Distance: How can we analyze the relationship between distance and delivery time?

Hint: You can extract such a relation by visualizing two variables at the same time. For example, the impact of Weather on Delivery Time can be visualized using graph that describe Delivery Time within each Weather condition. Conditional Boxplot is a good option for such task (shown below).

Further Reading