{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **Lab3: Data Quality & Preprocessing**\n",
    "\n",
    "**Course**: **INF-604: Data Analysis** <br>\n",
    "**Lecturer**: **Sothea HAS, PhD**\n",
    "\n",
    "-----\n",
    "\n",
    "**Objective:** In this lab, you will delve deeper into assessing the quality of datasets and employing preprocessing techniques to properly clean them.\n",
    "\n",
    "- The `notebook` of this `Lab` can be downloaded here: [Lab3_Preprocessing.ipynb](https://hassothea.github.io/Data_Analysis_AUPP/Labs/Lab3_Preprocessing.ipynb){target=\"_blank\"}.\n",
    "\n",
    "- Or you can work directly with `Google Colab` here: [Lab3_Preprocessing.ipynb](https://colab.research.google.com/drive/1T3JiNKY5TOtG4L1WvqCuRwIACRkW7KDF?usp=sharing){target=\"_blank\"}.\n",
    "\n",
    "\n",
    "-----\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. `Food Delivery` Dataset\n",
    "\n",
    "Let's consider Delivery dataset discussed in the previous [Lab2](https://hassothea.github.io/Data_Analysis_AUPP/Labs/Lab2_Univariate_Analysis.html). Read and load the data from kaggle: [Food Delivery Dataset](https://www.kaggle.com/datasets/denkuznetz/food-delivery-time-prediction/data)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Order_ID</th>\n",
       "      <th>Distance_km</th>\n",
       "      <th>Weather</th>\n",
       "      <th>Traffic_Level</th>\n",
       "      <th>Time_of_Day</th>\n",
       "      <th>Vehicle_Type</th>\n",
       "      <th>Preparation_Time_min</th>\n",
       "      <th>Courier_Experience_yrs</th>\n",
       "      <th>Delivery_Time_min</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>522</td>\n",
       "      <td>7.93</td>\n",
       "      <td>Windy</td>\n",
       "      <td>Low</td>\n",
       "      <td>Afternoon</td>\n",
       "      <td>Scooter</td>\n",
       "      <td>12</td>\n",
       "      <td>1.0</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>738</td>\n",
       "      <td>16.42</td>\n",
       "      <td>Clear</td>\n",
       "      <td>Medium</td>\n",
       "      <td>Evening</td>\n",
       "      <td>Bike</td>\n",
       "      <td>20</td>\n",
       "      <td>2.0</td>\n",
       "      <td>84</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>741</td>\n",
       "      <td>9.52</td>\n",
       "      <td>Foggy</td>\n",
       "      <td>Low</td>\n",
       "      <td>Night</td>\n",
       "      <td>Scooter</td>\n",
       "      <td>28</td>\n",
       "      <td>1.0</td>\n",
       "      <td>59</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>661</td>\n",
       "      <td>7.44</td>\n",
       "      <td>Rainy</td>\n",
       "      <td>Medium</td>\n",
       "      <td>Afternoon</td>\n",
       "      <td>Scooter</td>\n",
       "      <td>5</td>\n",
       "      <td>1.0</td>\n",
       "      <td>37</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>412</td>\n",
       "      <td>19.03</td>\n",
       "      <td>Clear</td>\n",
       "      <td>Low</td>\n",
       "      <td>Morning</td>\n",
       "      <td>Bike</td>\n",
       "      <td>16</td>\n",
       "      <td>5.0</td>\n",
       "      <td>68</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Order_ID  Distance_km Weather Traffic_Level Time_of_Day Vehicle_Type  \\\n",
       "0       522         7.93   Windy           Low   Afternoon      Scooter   \n",
       "1       738        16.42   Clear        Medium     Evening         Bike   \n",
       "2       741         9.52   Foggy           Low       Night      Scooter   \n",
       "3       661         7.44   Rainy        Medium   Afternoon      Scooter   \n",
       "4       412        19.03   Clear           Low     Morning         Bike   \n",
       "\n",
       "   Preparation_Time_min  Courier_Experience_yrs  Delivery_Time_min  \n",
       "0                    12                     1.0                 43  \n",
       "1                    20                     2.0                 84  \n",
       "2                    28                     1.0                 59  \n",
       "3                     5                     1.0                 37  \n",
       "4                    16                     5.0                 68  "
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import kagglehub\n",
    "\n",
    "# Download latest version\n",
    "path = kagglehub.dataset_download(\"denkuznetz/food-delivery-time-prediction\")\n",
    "\n",
    "# Import data\n",
    "import pandas as pd\n",
    "data = pd.read_csv(path + \"/Food_Delivery_Times.csv\")\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**A.** Address the problems related to the quality of this dataset.\n",
    "\n",
    "> **Answer**:\n",
    "\n",
    "\n",
    "\n",
    "--------------\n",
    "\n",
    "\n",
    "- Identifying Missing Values:\n",
    "\n",
    "    - How many columns contain missing values?\n",
    "\n",
    "    - How many missing values are in each of those columns?\n",
    "\n",
    "- Analyze a Column with Missing Values:\n",
    "\n",
    "    - Choose one column with missing values.\n",
    "\n",
    "    - Visualize the distribution of the remaining columns before and after dropping the missing values in the chosen column.\n",
    "\n",
    "    - What do you think is the nature of these missing values (e.g., **MCAR**, **MAR**, or **MNAR**)?\n",
    "\n",
    "    - Why might this be the case?\n",
    "\n",
    "- Repeat the Analysis:\n",
    "\n",
    "    - Perform the same analysis for other columns with missing values.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**B. Drop all rows with at least one missing values:**\n",
    "\n",
    "- Visualize the distribution of the columns without missing values before and after dropping all rows that contain at least one missing value.\n",
    "\n",
    "- What observations can you make from these visualizations?\n",
    "\n",
    "- Impute the missing values using an appropriate method (e.g., mean, median, mode, or advanced imputation techniques).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**C. Analyzing Connections Between Columns**\n",
    "\n",
    "- *Impact of Weather on Delivery Time*: determine if weather conditions affect delivery time.\n",
    "\n",
    "- *Influence of Traffic Level*: determine whether traffic levels impact delivery time?\n",
    "\n",
    "- *Effect of Vehicle Type*: How can we evaluate if the type of vehicle used influences delivery time?\n",
    "\n",
    "- *Role of Distance*: How can we analyze the relationship between distance and delivery time?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. `Auto-MPG` Dataset\n",
    "\n",
    "This dataset contains spec of various cars and is available in kaggle. For more, read [here](https://www.kaggle.com/datasets/uciml/autompg-dataset)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Warning: Looks like you're using an outdated `kagglehub` version (installed: 0.3.6), please consider upgrading to the latest version (0.3.7).\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mpg</th>\n",
       "      <th>cylinders</th>\n",
       "      <th>displacement</th>\n",
       "      <th>horsepower</th>\n",
       "      <th>weight</th>\n",
       "      <th>acceleration</th>\n",
       "      <th>model year</th>\n",
       "      <th>origin</th>\n",
       "      <th>car name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>18.0</td>\n",
       "      <td>8</td>\n",
       "      <td>307.0</td>\n",
       "      <td>130</td>\n",
       "      <td>3504</td>\n",
       "      <td>12.0</td>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "      <td>chevrolet chevelle malibu</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>15.0</td>\n",
       "      <td>8</td>\n",
       "      <td>350.0</td>\n",
       "      <td>165</td>\n",
       "      <td>3693</td>\n",
       "      <td>11.5</td>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "      <td>buick skylark 320</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>18.0</td>\n",
       "      <td>8</td>\n",
       "      <td>318.0</td>\n",
       "      <td>150</td>\n",
       "      <td>3436</td>\n",
       "      <td>11.0</td>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "      <td>plymouth satellite</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>16.0</td>\n",
       "      <td>8</td>\n",
       "      <td>304.0</td>\n",
       "      <td>150</td>\n",
       "      <td>3433</td>\n",
       "      <td>12.0</td>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "      <td>amc rebel sst</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>17.0</td>\n",
       "      <td>8</td>\n",
       "      <td>302.0</td>\n",
       "      <td>140</td>\n",
       "      <td>3449</td>\n",
       "      <td>10.5</td>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "      <td>ford torino</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    mpg  cylinders  displacement horsepower  weight  acceleration  model year  \\\n",
       "0  18.0          8         307.0        130    3504          12.0          70   \n",
       "1  15.0          8         350.0        165    3693          11.5          70   \n",
       "2  18.0          8         318.0        150    3436          11.0          70   \n",
       "3  16.0          8         304.0        150    3433          12.0          70   \n",
       "4  17.0          8         302.0        140    3449          10.5          70   \n",
       "\n",
       "   origin                   car name  \n",
       "0       1  chevrolet chevelle malibu  \n",
       "1       1          buick skylark 320  \n",
       "2       1         plymouth satellite  \n",
       "3       1              amc rebel sst  \n",
       "4       1                ford torino  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import kagglehub\n",
    "\n",
    "# Download latest version\n",
    "path = kagglehub.dataset_download(\"uciml/autompg-dataset\")\n",
    "auto = pd.read_csv(path + '/auto-mpg.csv')\n",
    "auto.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Investigate factors that may affect the quality of the dataset and implement appropriate measures to handle them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **Further Reading**\n",
    "\n",
    "- `Pandas` python library: [https://pandas.pydata.org/docs/getting_started/index.html#getting-started](https://pandas.pydata.org/docs/getting_started/index.html#getting-started)\n",
    "\n",
    "- `Pandas Cheatsheet`: [https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)\n",
    "\n",
    "- `10 Minute to Pandas`: [https://pandas.pydata.org/docs/user_guide/10min.html](https://pandas.pydata.org/docs/user_guide/10min.html)\n",
    "\n",
    "- `Some Pandas Lession`: [https://www.kaggle.com/learn/pandas](https://pandas.pydata.org/docs/user_guide/10min.html)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}