Lab3: Data Quality & Preprocessing

Course: INF-604: Data Analysis
Lecturer: Sothea HAS, PhD


Objective: In this lab, you will delve deeper into assessing the quality of datasets and employing preprocessing techniques to properly clean them.


1. Food Delivery Dataset

Let’s consider Delivery dataset discussed in the previous Lab2. Read and load the data from kaggle: Food Delivery Dataset.

import kagglehub

# Download latest version
path = kagglehub.dataset_download("denkuznetz/food-delivery-time-prediction")

# Import data
import pandas as pd
data = pd.read_csv(path + "/Food_Delivery_Times.csv")
data.head()
Order_ID Distance_km Weather Traffic_Level Time_of_Day Vehicle_Type Preparation_Time_min Courier_Experience_yrs Delivery_Time_min
0 522 7.93 Windy Low Afternoon Scooter 12 1.0 43
1 738 16.42 Clear Medium Evening Bike 20 2.0 84
2 741 9.52 Foggy Low Night Scooter 28 1.0 59
3 661 7.44 Rainy Medium Afternoon Scooter 5 1.0 37
4 412 19.03 Clear Low Morning Bike 16 5.0 68

A. Address the problems related to the quality of this dataset.

Answer:


  • Identifying Missing Values:

    • How many columns contain missing values?

    • How many missing values are in each of those columns?

  • Analyze a Column with Missing Values:

    • Choose one column with missing values.

    • Visualize the distribution of the remaining columns before and after dropping the missing values in the chosen column.

    • What do you think is the nature of these missing values (e.g., MCAR, MAR, or MNAR)?

    • Why might this be the case?

  • Repeat the Analysis:

    • Perform the same analysis for other columns with missing values.

B. Drop all rows with at least one missing values:

  • Visualize the distribution of the columns without missing values before and after dropping all rows that contain at least one missing value.

  • What observations can you make from these visualizations?

  • Impute the missing values using an appropriate method (e.g., mean, median, mode, or advanced imputation techniques).

C. Analyzing Connections Between Columns

  • Impact of Weather on Delivery Time: determine if weather conditions affect delivery time.

  • Influence of Traffic Level: determine whether traffic levels impact delivery time?

  • Effect of Vehicle Type: How can we evaluate if the type of vehicle used influences delivery time?

  • Role of Distance: How can we analyze the relationship between distance and delivery time?

2. Auto-MPG Dataset

This dataset contains spec of various cars and is available in kaggle. For more, read here.

import kagglehub

# Download latest version
path = kagglehub.dataset_download("uciml/autompg-dataset")
auto = pd.read_csv(path + '/auto-mpg.csv')
auto.head()
Warning: Looks like you're using an outdated `kagglehub` version (installed: 0.3.6), please consider upgrading to the latest version (0.3.7).
mpg cylinders displacement horsepower weight acceleration model year origin car name
0 18.0 8 307.0 130 3504 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165 3693 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150 3436 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150 3433 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140 3449 10.5 70 1 ford torino
  • Investigate factors that may affect the quality of the dataset and implement appropriate measures to handle them.

Further Reading