Course: INF-604: Data Analysis
Lecturer: Sothea HAS, PhD
Objective: In this lab, you will delve deeper into assessing the quality of datasets and employing preprocessing techniques to properly clean them.
1. Food Delivery
Dataset
Let’s consider Delivery dataset discussed in the previous Lab2. Read and load the data from kaggle: Food Delivery Dataset.
import kagglehub
# Download latest version
path = kagglehub.dataset_download("denkuznetz/food-delivery-time-prediction")
# Import data
import pandas as pd
data = pd.read_csv(path + "/Food_Delivery_Times.csv")
data.head()
0 |
522 |
7.93 |
Windy |
Low |
Afternoon |
Scooter |
12 |
1.0 |
43 |
1 |
738 |
16.42 |
Clear |
Medium |
Evening |
Bike |
20 |
2.0 |
84 |
2 |
741 |
9.52 |
Foggy |
Low |
Night |
Scooter |
28 |
1.0 |
59 |
3 |
661 |
7.44 |
Rainy |
Medium |
Afternoon |
Scooter |
5 |
1.0 |
37 |
4 |
412 |
19.03 |
Clear |
Low |
Morning |
Bike |
16 |
5.0 |
68 |
A. Address the problems related to the quality of this dataset.
Answer:
B. Drop all rows with at least one missing values:
Visualize the distribution of the columns without missing values before and after dropping all rows that contain at least one missing value.
What observations can you make from these visualizations?
Impute the missing values using an appropriate method (e.g., mean, median, mode, or advanced imputation techniques).
C. Analyzing Connections Between Columns
Impact of Weather on Delivery Time: determine if weather conditions affect delivery time.
Influence of Traffic Level: determine whether traffic levels impact delivery time?
Effect of Vehicle Type: How can we evaluate if the type of vehicle used influences delivery time?
Role of Distance: How can we analyze the relationship between distance and delivery time?
2. Auto-MPG
Dataset
This dataset contains spec of various cars and is available in kaggle. For more, read here.
import kagglehub
# Download latest version
path = kagglehub.dataset_download("uciml/autompg-dataset")
auto = pd.read_csv(path + '/auto-mpg.csv')
auto.head()
Warning: Looks like you're using an outdated `kagglehub` version (installed: 0.3.6), please consider upgrading to the latest version (0.3.7).
0 |
18.0 |
8 |
307.0 |
130 |
3504 |
12.0 |
70 |
1 |
chevrolet chevelle malibu |
1 |
15.0 |
8 |
350.0 |
165 |
3693 |
11.5 |
70 |
1 |
buick skylark 320 |
2 |
18.0 |
8 |
318.0 |
150 |
3436 |
11.0 |
70 |
1 |
plymouth satellite |
3 |
16.0 |
8 |
304.0 |
150 |
3433 |
12.0 |
70 |
1 |
amc rebel sst |
4 |
17.0 |
8 |
302.0 |
140 |
3449 |
10.5 |
70 |
1 |
ford torino |
- Investigate factors that may affect the quality of the dataset and implement appropriate measures to handle them.