Lab2: Univariate Analysis

Course: INF-604: Data Analysis
Lecturer: Sothea HAS, PhD

Objective: In this lab, you will explore the columns of a dataset according to their data types. Your task is to employ various techniques, including statistical values and graphical representations, to understand the dataset before conducting deeper analysis.

The notebook of this Lab can be downloaded here: Lab2_Univariate_Analysis.ipynb.
Or you can work directly with Google Colab here: Lab2_Univariate_Analysis.ipynb.

1. `Food Delivery` Dataset

This dataset contains food delivery times based on various influencing factors such as distance, weather, traffic conditions, and time of day. It offers a practical and engaging challenge for machine learning practitioners, especially those interested in logistics and operations research. Read and load the data from kaggle: Food Delivery Dataset.

import kagglehub

# Download latest version
path = kagglehub.dataset_download("denkuznetz/food-delivery-time-prediction")

# Import data
import pandas as pd
data = pd.read_csv(path + "/Food_Delivery_Times.csv")
data.head()

	Order_ID	Distance_km	Weather	Traffic_Level	Time_of_Day	Vehicle_Type	Preparation_Time_min	Courier_Experience_yrs	Delivery_Time_min
0	522	7.93	Windy	Low	Afternoon	Scooter	12	1.0	43
1	738	16.42	Clear	Medium	Evening	Bike	20	2.0	84
2	741	9.52	Foggy	Low	Night	Scooter	28	1.0	59
3	661	7.44	Rainy	Medium	Afternoon	Scooter	5	1.0	37
4	412	19.03	Clear	Low	Morning	Bike	16	5.0	68

A. What’s the dimension of the data? Which variables are considered quantitative and which are qualitative?

Answer:

Qualitative columns are: Weather, Traffic_level, Time_of_Day, Vehical_Type.
Quantitative: Distance_km, …

print(f"The dimension of the data is {data.shape}")
data.drop(columns = ['Order_ID'], inplace=True)

The dimension of the data is (1000, 9)

data.dtypes.to_frame().T

	Distance_km	Weather	Traffic_Level	Time_of_Day	Vehicle_Type	Preparation_Time_min	Courier_Experience_yrs	Delivery_Time_min
0	float64	object	object	object	object	int64	float64	int64

print(f"* Qualitative columns are {list(data.select_dtypes(include=['object']).columns)}")
print(f"* Quantitative columns are {list(data.select_dtypes(include=['number']).columns)}")

* Qualitative columns are ['Weather', 'Traffic_Level', 'Time_of_Day', 'Vehicle_Type']
* Quantitative columns are ['Distance_km', 'Preparation_Time_min', 'Courier_Experience_yrs', 'Delivery_Time_min']

Are there any rows with missing values?

Yessssssssssssssssssssssssssssssss! Here they:

data.isna().sum().to_frame().T

	Distance_km	Weather	Traffic_Level	Time_of_Day	Vehicle_Type	Preparation_Time_min	Courier_Experience_yrs	Delivery_Time_min
0	0	30	30	30	0	0	30	0

Are there any duplicated data?

Nope as shown below:

data.duplicated().sum()

Handling missing values is more complicated than you may expect. Here, we can simply drop those rows.

data.dropna(inplace=True)  # 'inplace = True' is used to directly drop and modify the data from the data.
data.shape

(883, 8)

B. Qualitative variables:

Create statistical summary of qualitative columns.
Create graphical representation of these qualitative columns to understand them better.
Explain each column based on the stastical values and graphs.

data[['Weather']].value_counts().to_frame().T  # Compute the frequency

Weather	Clear	Rainy	Foggy	Snowy	Windy
count	425	188	98	86	86

data[['Weather']].value_counts(normalize=True).to_frame().T  # Compute the relative frequency/ proportion

Weather	Clear	Rainy	Foggy	Snowy	Windy
proportion	0.481314	0.212911	0.110985	0.097395	0.097395

data[['Vehicle_Type']].value_counts(normalize=True).to_frame().T

Vehicle_Type	Bike	Scooter	Car
proportion	0.510759	0.294451	0.19479

Graphs: qualitative columns

import matplotlib.pyplot as plt
import seaborn as sns

fig, axs = plt.subplots(2, 2, figsize=(10, 7))  # I created 1 row and 4 columns of subplots with dimension 12 by 3

sns.countplot(data, x="Weather", stat = 'percent', ax=axs[1,0])    # Assign my graph to the first subplot
axs[1,0].set_title('Countplot of Weather', fontsize = 15)
axs[1,0].bar_label(axs[1,0].containers[0], fmt="%.2f", fontsize=10) # Round the percentages

sns.countplot(data, x="Vehicle_Type", stat = 'percent', ax=axs[0,1])    # Assign my graph to the first subplot
axs[0,1].set_title('Countplot of Vehicle Types', fontsize = 15)
axs[0,1].bar_label(axs[0,1].containers[0], fmt="%.2f", fontsize=10) # Round the percentages


sns.countplot(data, x="Time_of_Day", stat = 'percent', ax=axs[0,0])    # Assign my graph to the first subplot
axs[0,0].set_title('Countplot of Time of Day', fontsize = 15)
axs[0,0].bar_label(axs[0,0].containers[0], fmt="%.2f", fontsize=10) # Round the percentages

sns.countplot(data, x="Traffic_Level", stat = 'percent', ax=axs[1,1])    # Assign my graph to the first subplot
axs[1,1].set_title('Countplot of Traffic level', fontsize = 15)
axs[1,1].bar_label(axs[1,1].containers[0], fmt="%.2f", fontsize=10) # Round the percentages
plt.tight_layout()
plt.show()

C. Quantitative variables:

Create statistical summary of quantiative columns.
Create graphical representation of these quantitative columns to understand them better.
Explain each column based on the stastical values and graphs.
Are there any columns with outliers?

data.describe().drop(index=['count'])

	Distance_km	Preparation_Time_min	Courier_Experience_yrs	Delivery_Time_min
mean	10.051586	17.019253	4.639864	56.425821
std	5.688582	7.260201	2.922172	21.568482
min	0.590000	5.000000	0.000000	8.000000
25%	5.130000	11.000000	2.000000	41.000000
50%	10.280000	17.000000	5.000000	55.000000
75%	15.025000	23.000000	7.000000	71.000000
max	19.990000	29.000000	9.000000	141.000000

sns.histplot(data, x='Delivery_Time_min')

sns.boxplot(data, x="Delivery_Time_min")

sns.violinplot(data, x="Delivery_Time_min")

fig, axs = plt.subplots(2, 2, figsize = (10, 7))

for i, va in enumerate(data.select_dtypes(include="number").columns):
  sns.histplot(data, x=va, ax=axs[i // 2, i % 2], kde=True)
  axs[i // 2, i % 2].set_title(f"{va} Histogram")

plt.tight_layout()
plt.show()

2. Cardiovascular Disease dataset

This dataset consists of 70 000 records of patients data, 11 features and a column of the presence or absence of cardiovascular disease. The data can be downloaded from kaggle using the following link: Cardiovascular Disease dataset.

import kagglehub
import pandas as pd
# Download latest version
path = kagglehub.dataset_download("sulianova/cardiovascular-disease-dataset")

data = pd.read_csv(path + "/cardio_train.csv", sep=";")
data.head()

	id	age	gender	height	weight	ap_hi	ap_lo	cholesterol	gluc	active	cardio
0	0	18393	2	168	62.0	110	80	1	1	1	0
1	1	20228	1	156	85.0	140	90	3	1	1	1
2	2	18857	1	165	64.0	130	70	3	1	0	1
3	3	17623	2	169	82.0	150	100	1	1	1	1
4	4	17474	1	156	56.0	100	60	1	1	0	0

Column type conversion:

data.dtypes.to_frame().T

	id	age	gender	height	weight	ap_hi	ap_lo	cholesterol	gluc	smoke	alco	active	cardio
0	int64	int64	int64	int64	float64	int64	int64	int64	int64	int64	int64	int64	int64

data['gender'] = data['gender'].astype('category')

Task: Analyze each column of the provided data by answering the questions listed in the previous section.

1. Food Delivery Dataset

2. Cardiovascular Disease dataset

Further Reading

1. `Food Delivery` Dataset