Lab3: Basic Data Analysis

Course: CSCI-866-001: Data Mining & Knowledge Discovery
Lecturer: Sothea HAS, PhD


Objective: In this lab, you will explore the columns of a dataset according to their data types. Your task is to employ various techniques, including statistical values and graphical representations, to understand the dataset before conducting deeper analysis.


1. Food Delivery Dataset

This dataset contains food delivery times based on various influencing factors such as distance, weather, traffic conditions, and time of day. It offers a practical and engaging challenge for machine learning practitioners, especially those interested in logistics and operations research. Read and load the data from kaggle: Food Delivery Dataset.

import kagglehub

# Download latest version
path = kagglehub.dataset_download("denkuznetz/food-delivery-time-prediction")

# Import data
import pandas as pd
data = pd.read_csv(path + "/Food_Delivery_Times.csv")
data.head()
Order_ID Distance_km Weather Traffic_Level Time_of_Day Vehicle_Type Preparation_Time_min Courier_Experience_yrs Delivery_Time_min
0 522 7.93 Windy Low Afternoon Scooter 12 1.0 43
1 738 16.42 Clear Medium Evening Bike 20 2.0 84
2 741 9.52 Foggy Low Night Scooter 28 1.0 59
3 661 7.44 Rainy Medium Afternoon Scooter 5 1.0 37
4 412 19.03 Clear Low Morning Bike 16 5.0 68

A. What’s the dimension of the data? Which variables are considered quantitative and which are qualitative?

Answer:

  • Are there any rows with missing values?

  • Are there any duplicated data?

  • Handling missing values is more complicated than you may expect. Here, we can simply drop those rows.

B. Qualitative variables:

  • Create statistical summary of qualitative columns.

  • Create graphical representation of these qualitative columns to understand them better.

  • Explain each column based on the stastical values and graphs.

C. Quantitative variables:

  • Create statistical summary of quantiative columns.

  • Create graphical representation of these quantitative columns to understand them better.

  • Explain each column based on the stastical values and graphs.

  • Are there any columns with outliers?

Further Reading