TP6 - Data Wrangling

Course: EDA & Unsupervised Learning
M1-DAS
Lecturer: HAS Sothea, PhD

Objective: In practice, real insights from data often require data transformation and manipulation. Data wrangling is the process of cleaning, structuring, and organizing raw data into a usable format. It involves tasks like handling missing values, correcting errors, and transforming data to make it suitable for analysis. In this TP, you will learn how to perform some important tasks of data wrangling including: Creating, loading, changing types, aggregating, merging and manipulating data frames in pandas…

The Jupyter Notebook for this Lab can be downloaded here: TP6_Wrangling.ipynb.

1. Creating data frames

Create an artificial data frame:
- Columns: [‘Electricity’, ‘Water’, ‘Currency’]
- Rows: 12 (indexed by month)
- Values: Electricity (Kw), Water ($m^3$), Currency (Dollar or Riel)
Add a column Electricity_Exp to the data, which is the expense on electricity givinig that the price is $800$R or $0.2\$$ per KW.
Add a Water_Exp column to the data, which is the expense on water giving the price of $1000$R or $0.25\$$ per $m^3$.
Add two columns of total expenses in dollar and riel: Total_dollar and Total_riel.
Identify the months with the highest and lowest expenses for:
- Electricity
- Water
- Total expenses.

# To do

2. Food Delivery Dataset

This dataset is designed for predicting food delivery times based on various influencing factors such as distance, weather, traffic conditions, and time of day. It offers a practical and engaging challenge for machine learning practitioners, especially those interested in logistics and operations research. Read and load the data from kaggle: Food Delivery Dataset.

Download the dataset and import it.

# %pip install kagglehub   # if you have not installed "kagglehub" module yet
import kagglehub

# Download latest version
path = kagglehub.dataset_download("denkuznetz/food-delivery-time-prediction")

print(f'The data is downloaded and store at: {path}')


# To do

A. Data overview:

Address the dimension, qualitative and quantitative columns of the dataset.
Create statistical summary of the dataset.
Identify problems and handle them if there is any:
- Missing values,
- Duplicated data,
- Outliers…

# To do

B. Data Statistics

Compute average delivery time according to:
- Weather condition.
- Traffic level.
- Time of the day.
- Vehical type.
Compute average delivery time according to:
- Weather and Traffic level.
- Time of the day and vehical type.
- Weather, Traffic and Time of the day.
Compute average delivery time for
- Courier with more than 3 year experience and those who have less.
- Distances longer and shorter than 10km.
- Different class of the above conditions.

# To do

3. Flight Dataset

In this section, we explore NYC Flights 2013 dataset available in kaggle: https://www.kaggle.com/datasets/aephidayatuloh/nyc-flights-2013. There are five datasets in total:

nyc_airlines.csv: information about the airlines
nyc_airports.csv: information about the airports
nyc_flights.csv: information about the flights
nyc_planes.csv: information about the planes
nyc_weather.csv: information about the weather.

You will explore these datasets and try to answer the questions below. But first, let’s import the datasets.

import kagglehub

# Download latest version
path = kagglehub.dataset_download("aephidayatuloh/nyc-flights-2013")

First, we will explore nyc_flights.csv dataset.

import pandas as pd

flights = pd.read_csv(path + '/nyc_flights.csv')
flights.head()

	year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
0	2013	1	1	517.0	515	2.0	830.0	819	11.0	UA	1545	N14228	EWR	IAH	227.0	1400	5	15	2013-01-01T10:00:00Z
1	2013	1	1	533.0	529	4.0	850.0	830	20.0	UA	1714	N24211	LGA	IAH	227.0	1416	5	29	2013-01-01T10:00:00Z
2	2013	1	1	542.0	540	2.0	923.0	850	33.0	AA	1141	N619AA	JFK	MIA	160.0	1089	5	40	2013-01-01T10:00:00Z
3	2013	1	1	544.0	545	-1.0	1004.0	1022	-18.0	B6	725	N804JB	JFK	BQN	183.0	1576	5	45	2013-01-01T10:00:00Z
4	2013	1	1	554.0	600	-6.0	812.0	837	-25.0	DL	461	N668DN	LGA	ATL	116.0	762	6	0	2013-01-01T11:00:00Z

Answer the following questions:

A. Find and count the number rows with missing values in the dataset. Can you guess what might those rows represent?

B. Find all flights with arrival delay smaller than departure delay.

C. Find all flights departing during weekends (Saturday, Sunday) from La Guardia airport (LGA). Hint: you may need to use functions strptime and strftime of datetime module from datetime library. For example:

from datetime import datetime

day = datetime.strptime(flights['time_hour'][0], '%Y-%m-%dT%H:%M:%SZ')
datetime.strftime(day, "%A")

'Tuesday'

D. Find all flights that landed between 6.pm and 8.pm and took off from one of EWR, JFK, LGA (take the remainder of your birthday when divided by 3, i.e., day % 3 + 1. If the result is 1, the airport is EWR, if the remainder is 2 or 3 then the airport is JFK or LGA accordingly.).

E. Find all flights with distance more than 600 miles and speed larger than 475 miles per hour. Sort theresults by decreasing speed.

F. Find all supersonic flights (speed larger than 700 miles per hour).

G. Add a new column named speed with speed in km per hour.

H. Compute the number of cancelled flights for each day of week.

I. For each carrier, day of week, compute the number of flights departing from JFK.

J. For each carrier, compute the proportion of cancelled flights.

K. Find the carriers that serve at least 10 distinct destinations.

L. Find the carriers that do not serve all origin airports.

M. Find the carriers that serve both ATL and LAX.

N. Find the carriers that serve both ATL and LAX but serve neither IAH nor HNL.

O. Find all destinations that are served every day of June from at least one New York airport.

P. Which carrier is operating the plane (tailnum) that serve the largest number of distinct destinations?

Q. Find the 10 most distant destinations from LGA.

R. Find the destinations served by the largest number of carriers?

S. What is the average number of flights bound to LAX on Sunday?

T. Does there exist a plane (tailnum) that flew out of NY under two distinct carriers? If yes, list the carriers.

U. Find rows such that time_hour, year, month, day, hour, minute are not consistent, i.e.,

1. Creating data frames

2. Food Delivery Dataset

3. Flight Dataset

Answer the following questions:

Further readings