TP6 - Data Wrangling

Course: EDA & Unsupervised Learning
M1-DAS
Lecturer: HAS Sothea, PhD


Objective: In practice, real insights from data often require data transformation and manipulation. Data wrangling is the process of cleaning, structuring, and organizing raw data into a usable format. It involves tasks like handling missing values, correcting errors, and transforming data to make it suitable for analysis. In this TP, you will learn how to perform some important tasks of data wrangling including: Creating, loading, changing types, aggregating, merging and manipulating data frames in pandas

The Jupyter Notebook for this Lab can be downloaded here: TP6_Wrangling.ipynb.


1. Creating data frames

  • Create an artificial data frame:

    • Columns: [‘Electricity’, ‘Water’, ‘Currency’]
    • Rows: 12 (indexed by month)
    • Values: Electricity (Kw), Water (\(m^3\)), Currency (Dollar or Riel)
  • Add a column Electricity_Exp to the data, which is the expense on electricity givinig that the price is \(800\)R or \(0.2\$\) per KW.

  • Add a Water_Exp column to the data, which is the expense on water giving the price of \(1000\)R or \(0.25\$\) per \(m^3\).

  • Add two columns of total expenses in dollar and riel: Total_dollar and Total_riel.

  • Identify the months with the highest and lowest expenses for:

    • Electricity
    • Water
    • Total expenses.
# To do

2. Food Delivery Dataset

This dataset is designed for predicting food delivery times based on various influencing factors such as distance, weather, traffic conditions, and time of day. It offers a practical and engaging challenge for machine learning practitioners, especially those interested in logistics and operations research. Read and load the data from kaggle: Food Delivery Dataset.

Download the dataset and import it.

# %pip install kagglehub   # if you have not installed "kagglehub" module yet
import kagglehub

# Download latest version
path = kagglehub.dataset_download("denkuznetz/food-delivery-time-prediction")

print(f'The data is downloaded and store at: {path}')


# To do

A. Data overview:

  • Address the dimension, qualitative and quantitative columns of the dataset.
  • Create statistical summary of the dataset.
  • Identify problems and handle them if there is any:
    • Missing values,
    • Duplicated data,
    • Outliers…
# To do

B. Data Statistics

  • Compute average delivery time according to:
    • Weather condition.
    • Traffic level.
    • Time of the day.
    • Vehical type.
  • Compute average delivery time according to:
    • Weather and Traffic level.
    • Time of the day and vehical type.
    • Weather, Traffic and Time of the day.
  • Compute average delivery time for
    • Courier with more than 3 year experience and those who have less.
    • Distances longer and shorter than 10km.
    • Different class of the above conditions.
# To do

3. Flight Dataset

In this section, we explore NYC Flights 2013 dataset available in kaggle: https://www.kaggle.com/datasets/aephidayatuloh/nyc-flights-2013. There are five datasets in total:

  • nyc_airlines.csv: information about the airlines
  • nyc_airports.csv: information about the airports
  • nyc_flights.csv: information about the flights
  • nyc_planes.csv: information about the planes
  • nyc_weather.csv: information about the weather.

You will explore these datasets and try to answer the questions below. But first, let’s import the datasets.

import kagglehub

# Download latest version
path = kagglehub.dataset_download("aephidayatuloh/nyc-flights-2013")

First, we will explore nyc_flights.csv dataset.

import pandas as pd

flights = pd.read_csv(path + '/nyc_flights.csv')
flights.head()
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
0 2013 1 1 517.0 515 2.0 830.0 819 11.0 UA 1545 N14228 EWR IAH 227.0 1400 5 15 2013-01-01T10:00:00Z
1 2013 1 1 533.0 529 4.0 850.0 830 20.0 UA 1714 N24211 LGA IAH 227.0 1416 5 29 2013-01-01T10:00:00Z
2 2013 1 1 542.0 540 2.0 923.0 850 33.0 AA 1141 N619AA JFK MIA 160.0 1089 5 40 2013-01-01T10:00:00Z
3 2013 1 1 544.0 545 -1.0 1004.0 1022 -18.0 B6 725 N804JB JFK BQN 183.0 1576 5 45 2013-01-01T10:00:00Z
4 2013 1 1 554.0 600 -6.0 812.0 837 -25.0 DL 461 N668DN LGA ATL 116.0 762 6 0 2013-01-01T11:00:00Z

Answer the following questions:

A. Find and count the number rows with missing values in the dataset. Can you guess what might those rows represent?

B. Find all flights with arrival delay smaller than departure delay.

C. Find all flights departing during weekends (Saturday, Sunday) from La Guardia airport (LGA). Hint: you may need to use functions strptime and strftime of datetime module from datetime library. For example:

from datetime import datetime

day = datetime.strptime(flights['time_hour'][0], '%Y-%m-%dT%H:%M:%SZ')
datetime.strftime(day, "%A")
'Tuesday'

D. Find all flights that landed between 6.pm and 8.pm and took off from one of EWR, JFK, LGA (take the remainder of your birthday when divided by 3, i.e., day % 3 + 1. If the result is 1, the airport is EWR, if the remainder is 2 or 3 then the airport is JFK or LGA accordingly.).

E. Find all flights with distance more than 600 miles and speed larger than 475 miles per hour. Sort theresults by decreasing speed.

F. Find all supersonic flights (speed larger than 700 miles per hour).

G. Add a new column named speed with speed in km per hour.

H. Compute the number of cancelled flights for each day of week.

I. For each carrier, day of week, compute the number of flights departing from JFK.

J. For each carrier, compute the proportion of cancelled flights.

K. Find the carriers that serve at least 10 distinct destinations.

L. Find the carriers that do not serve all origin airports.

M. Find the carriers that serve both ATL and LAX.

N. Find the carriers that serve both ATL and LAX but serve neither IAH nor HNL.

O. Find all destinations that are served every day of June from at least one New York airport.

P. Which carrier is operating the plane (tailnum) that serve the largest number of distinct destinations?

Q. Find the 10 most distant destinations from LGA.

R. Find the destinations served by the largest number of carriers?

S. What is the average number of flights bound to LAX on Sunday?

T. Does there exist a plane (tailnum) that flew out of NY under two distinct carriers? If yes, list the carriers.

U. Find rows such that time_hour, year, month, day, hour, minute are not consistent, i.e.,

Further readings