{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# **Lab3: Data Quality & Preprocessing**\n",
"\n",
"**Course**: **INF-604: Data Analysis**
\n",
"**Lecturer**: **Sothea HAS, PhD**\n",
"\n",
"-----\n",
"\n",
"**Objective:** In this lab, you will delve deeper into assessing the quality of datasets and employing preprocessing techniques to properly clean them.\n",
"\n",
"- The `notebook` of this `Lab` can be downloaded here: [Lab3_Preprocessing.ipynb](https://hassothea.github.io/Data_Analysis_AUPP/Labs/Lab3_Preprocessing.ipynb){target=\"_blank\"}.\n",
"\n",
"- Or you can work directly with `Google Colab` here: [Lab3_Preprocessing.ipynb](https://colab.research.google.com/drive/1T3JiNKY5TOtG4L1WvqCuRwIACRkW7KDF?usp=sharing){target=\"_blank\"}.\n",
"\n",
"\n",
"-----\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. `Food Delivery` Dataset\n",
"\n",
"Let's consider Delivery dataset discussed in the previous [Lab2](https://hassothea.github.io/Data_Analysis_AUPP/Labs/Lab2_Univariate_Analysis.html). Read and load the data from kaggle: [Food Delivery Dataset](https://www.kaggle.com/datasets/denkuznetz/food-delivery-time-prediction/data)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Order_ID | \n",
" Distance_km | \n",
" Weather | \n",
" Traffic_Level | \n",
" Time_of_Day | \n",
" Vehicle_Type | \n",
" Preparation_Time_min | \n",
" Courier_Experience_yrs | \n",
" Delivery_Time_min | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 522 | \n",
" 7.93 | \n",
" Windy | \n",
" Low | \n",
" Afternoon | \n",
" Scooter | \n",
" 12 | \n",
" 1.0 | \n",
" 43 | \n",
"
\n",
" \n",
" 1 | \n",
" 738 | \n",
" 16.42 | \n",
" Clear | \n",
" Medium | \n",
" Evening | \n",
" Bike | \n",
" 20 | \n",
" 2.0 | \n",
" 84 | \n",
"
\n",
" \n",
" 2 | \n",
" 741 | \n",
" 9.52 | \n",
" Foggy | \n",
" Low | \n",
" Night | \n",
" Scooter | \n",
" 28 | \n",
" 1.0 | \n",
" 59 | \n",
"
\n",
" \n",
" 3 | \n",
" 661 | \n",
" 7.44 | \n",
" Rainy | \n",
" Medium | \n",
" Afternoon | \n",
" Scooter | \n",
" 5 | \n",
" 1.0 | \n",
" 37 | \n",
"
\n",
" \n",
" 4 | \n",
" 412 | \n",
" 19.03 | \n",
" Clear | \n",
" Low | \n",
" Morning | \n",
" Bike | \n",
" 16 | \n",
" 5.0 | \n",
" 68 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Order_ID Distance_km Weather Traffic_Level Time_of_Day Vehicle_Type \\\n",
"0 522 7.93 Windy Low Afternoon Scooter \n",
"1 738 16.42 Clear Medium Evening Bike \n",
"2 741 9.52 Foggy Low Night Scooter \n",
"3 661 7.44 Rainy Medium Afternoon Scooter \n",
"4 412 19.03 Clear Low Morning Bike \n",
"\n",
" Preparation_Time_min Courier_Experience_yrs Delivery_Time_min \n",
"0 12 1.0 43 \n",
"1 20 2.0 84 \n",
"2 28 1.0 59 \n",
"3 5 1.0 37 \n",
"4 16 5.0 68 "
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import kagglehub\n",
"\n",
"# Download latest version\n",
"path = kagglehub.dataset_download(\"denkuznetz/food-delivery-time-prediction\")\n",
"\n",
"# Import data\n",
"import pandas as pd\n",
"data = pd.read_csv(path + \"/Food_Delivery_Times.csv\")\n",
"data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**A.** Address the problems related to the quality of this dataset.\n",
"\n",
"> **Answer**:\n",
"\n",
"\n",
"\n",
"--------------\n",
"\n",
"\n",
"- Identifying Missing Values:\n",
"\n",
" - How many columns contain missing values?\n",
"\n",
" - How many missing values are in each of those columns?\n",
"\n",
"- Analyze a Column with Missing Values:\n",
"\n",
" - Choose one column with missing values.\n",
"\n",
" - Visualize the distribution of the remaining columns before and after dropping the missing values in the chosen column.\n",
"\n",
" - What do you think is the nature of these missing values (e.g., **MCAR**, **MAR**, or **MNAR**)?\n",
"\n",
" - Why might this be the case?\n",
"\n",
"- Repeat the Analysis:\n",
"\n",
" - Perform the same analysis for other columns with missing values.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**B. Drop all rows with at least one missing values:**\n",
"\n",
"- Visualize the distribution of the columns without missing values before and after dropping all rows that contain at least one missing value.\n",
"\n",
"- What observations can you make from these visualizations?\n",
"\n",
"- Impute the missing values using an appropriate method (e.g., mean, median, mode, or advanced imputation techniques).\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**C. Analyzing Connections Between Columns**\n",
"\n",
"- *Impact of Weather on Delivery Time*: determine if weather conditions affect delivery time.\n",
"\n",
"- *Influence of Traffic Level*: determine whether traffic levels impact delivery time?\n",
"\n",
"- *Effect of Vehicle Type*: How can we evaluate if the type of vehicle used influences delivery time?\n",
"\n",
"- *Role of Distance*: How can we analyze the relationship between distance and delivery time?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. `Auto-MPG` Dataset\n",
"\n",
"This dataset contains spec of various cars and is available in kaggle. For more, read [here](https://www.kaggle.com/datasets/uciml/autompg-dataset)."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Warning: Looks like you're using an outdated `kagglehub` version (installed: 0.3.6), please consider upgrading to the latest version (0.3.7).\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" mpg | \n",
" cylinders | \n",
" displacement | \n",
" horsepower | \n",
" weight | \n",
" acceleration | \n",
" model year | \n",
" origin | \n",
" car name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 18.0 | \n",
" 8 | \n",
" 307.0 | \n",
" 130 | \n",
" 3504 | \n",
" 12.0 | \n",
" 70 | \n",
" 1 | \n",
" chevrolet chevelle malibu | \n",
"
\n",
" \n",
" 1 | \n",
" 15.0 | \n",
" 8 | \n",
" 350.0 | \n",
" 165 | \n",
" 3693 | \n",
" 11.5 | \n",
" 70 | \n",
" 1 | \n",
" buick skylark 320 | \n",
"
\n",
" \n",
" 2 | \n",
" 18.0 | \n",
" 8 | \n",
" 318.0 | \n",
" 150 | \n",
" 3436 | \n",
" 11.0 | \n",
" 70 | \n",
" 1 | \n",
" plymouth satellite | \n",
"
\n",
" \n",
" 3 | \n",
" 16.0 | \n",
" 8 | \n",
" 304.0 | \n",
" 150 | \n",
" 3433 | \n",
" 12.0 | \n",
" 70 | \n",
" 1 | \n",
" amc rebel sst | \n",
"
\n",
" \n",
" 4 | \n",
" 17.0 | \n",
" 8 | \n",
" 302.0 | \n",
" 140 | \n",
" 3449 | \n",
" 10.5 | \n",
" 70 | \n",
" 1 | \n",
" ford torino | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" mpg cylinders displacement horsepower weight acceleration model year \\\n",
"0 18.0 8 307.0 130 3504 12.0 70 \n",
"1 15.0 8 350.0 165 3693 11.5 70 \n",
"2 18.0 8 318.0 150 3436 11.0 70 \n",
"3 16.0 8 304.0 150 3433 12.0 70 \n",
"4 17.0 8 302.0 140 3449 10.5 70 \n",
"\n",
" origin car name \n",
"0 1 chevrolet chevelle malibu \n",
"1 1 buick skylark 320 \n",
"2 1 plymouth satellite \n",
"3 1 amc rebel sst \n",
"4 1 ford torino "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import kagglehub\n",
"\n",
"# Download latest version\n",
"path = kagglehub.dataset_download(\"uciml/autompg-dataset\")\n",
"auto = pd.read_csv(path + '/auto-mpg.csv')\n",
"auto.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Investigate factors that may affect the quality of the dataset and implement appropriate measures to handle them."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# **Further Reading**\n",
"\n",
"- `Pandas` python library: [https://pandas.pydata.org/docs/getting_started/index.html#getting-started](https://pandas.pydata.org/docs/getting_started/index.html#getting-started)\n",
"\n",
"- `Pandas Cheatsheet`: [https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)\n",
"\n",
"- `10 Minute to Pandas`: [https://pandas.pydata.org/docs/user_guide/10min.html](https://pandas.pydata.org/docs/user_guide/10min.html)\n",
"\n",
"- `Some Pandas Lession`: [https://www.kaggle.com/learn/pandas](https://pandas.pydata.org/docs/user_guide/10min.html)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}