# To do
TP6 - Data Wrangling
Course: EDA & Unsupervised Learning
M1-DAS
Lecturer: HAS Sothea, PhD
Objective: In practice, real insights from data often require data transformation and manipulation. Data wrangling
is the process of cleaning, structuring, and organizing raw data into a usable format. It involves tasks like handling missing values, correcting errors, and transforming data to make it suitable for analysis. In this TP, you will learn how to perform some important tasks of data wrangling including: Creating, loading, changing types, aggregating, merging and manipulating data frames in pandas
…
The
Jupyter Notebook
for this Lab can be downloaded here: TP6_Wrangling.ipynb.
1. Creating data frames
Create an artificial data frame:
- Columns: [‘Electricity’, ‘Water’, ‘Currency’]
- Rows: 12 (indexed by month)
- Values: Electricity (Kw), Water (\(m^3\)), Currency (Dollar or Riel)
Add a column
Electricity_Exp
to the data, which is the expense on electricity givinig that the price is \(800\)R or \(0.2\$\) per KW.Add a
Water_Exp
column to the data, which is the expense on water giving the price of \(1000\)R or \(0.25\$\) per \(m^3\).Add two columns of total expenses in dollar and riel:
Total_dollar
andTotal_riel
.Identify the months with the highest and lowest expenses for:
- Electricity
- Water
- Total expenses.
2. Food Delivery Dataset
This dataset is designed for predicting food delivery times based on various influencing factors such as distance, weather, traffic conditions, and time of day. It offers a practical and engaging challenge for machine learning practitioners, especially those interested in logistics and operations research. Read and load the data from kaggle: Food Delivery Dataset
.
Download the dataset and import it.
# %pip install kagglehub # if you have not installed "kagglehub" module yet
import kagglehub
# Download latest version
= kagglehub.dataset_download("denkuznetz/food-delivery-time-prediction")
path
print(f'The data is downloaded and store at: {path}')
# To do
A. Data overview:
- Address the dimension, qualitative and quantitative columns of the dataset.
- Create statistical summary of the dataset.
- Identify problems and handle them if there is any:
- Missing values,
- Duplicated data,
- Outliers…
# To do
B. Data Statistics
- Compute average delivery time according to:
- Weather condition.
- Traffic level.
- Time of the day.
- Vehical type.
- Compute average delivery time according to:
- Weather and Traffic level.
- Time of the day and vehical type.
- Weather, Traffic and Time of the day.
- Compute average delivery time for
- Courier with more than 3 year experience and those who have less.
- Distances longer and shorter than 10km.
- Different class of the above conditions.
# To do
3. Flight Dataset
In this section, we explore NYC Flights 2013
dataset available in kaggle: https://www.kaggle.com/datasets/aephidayatuloh/nyc-flights-2013. There are five datasets in total:
nyc_airlines.csv
: information about the airlinesnyc_airports.csv
: information about the airportsnyc_flights.csv
: information about the flightsnyc_planes.csv
: information about the planesnyc_weather.csv
: information about the weather.
You will explore these datasets and try to answer the questions below. But first, let’s import the datasets.
import kagglehub
# Download latest version
= kagglehub.dataset_download("aephidayatuloh/nyc-flights-2013") path
First, we will explore nyc_flights.csv
dataset.
import pandas as pd
= pd.read_csv(path + '/nyc_flights.csv')
flights flights.head()
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z |
1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01T10:00:00Z |
2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z |
3 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01T10:00:00Z |
4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z |
Answer the following questions:
A. Find and count the number rows with missing values in the dataset. Can you guess what might those rows represent?
B. Find all flights with arrival delay smaller than departure delay.
C. Find all flights departing during weekends (Saturday, Sunday) from La Guardia airport (LGA). Hint
: you may need to use functions strptime
and strftime
of datetime
module from datetime
library. For example:
from datetime import datetime
= datetime.strptime(flights['time_hour'][0], '%Y-%m-%dT%H:%M:%SZ')
day "%A") datetime.strftime(day,
'Tuesday'
D. Find all flights that landed between 6.pm and 8.pm and took off from one of EWR
, JFK
, LGA
(take the remainder of your birthday when divided by 3, i.e., day % 3 + 1
. If the result is 1, the airport is EWR
, if the remainder is 2 or 3 then the airport is JFK
or LGA
accordingly.).
E. Find all flights with distance more than 600 miles and speed larger than 475 miles per hour. Sort theresults by decreasing speed.
F. Find all supersonic flights (speed larger than 700 miles per hour).
G. Add a new column named speed
with speed in km per hour.
H. Compute the number of cancelled flights for each day of week.
I. For each carrier, day of week, compute the number of flights departing from JFK
.
J. For each carrier, compute the proportion of cancelled flights.
K. Find the carriers that serve at least 10 distinct destinations.
L. Find the carriers that do not serve all origin airports.
M. Find the carriers that serve both ATL
and LAX
.
N. Find the carriers that serve both ATL
and LAX
but serve neither IAH
nor HNL
.
O. Find all destinations that are served every day of June from at least one New York airport.
P. Which carrier is operating the plane (tailnum
) that serve the largest number of distinct destinations?
Q. Find the 10 most distant destinations from LGA
.
R. Find the destinations served by the largest number of carriers?
S. What is the average number of flights bound to LAX
on Sunday?
T. Does there exist a plane (tailnum) that flew out of NY under two distinct carriers? If yes, list the carriers.
U. Find rows such that time_hour, year, month, day, hour, minute are not consistent, i.e.,