{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **TP3 - Data Preprocessing**\n",
    "\n",
    "**Exploratory Data Analysis & Unsuperivsed Learning** <br>\n",
    "**M1-DAS**<br>\n",
    "**Lecturer: HAS Sothea, PhD**\n",
    "\n",
    "-------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Objective:** Preprocessing is important in data related tasks. In this TP, you will explore different challanges you may encounted during when performing data preprocessing. We will discuss reasonable solution to these challanges.\n",
    "\n",
    "> **The `Jupyter Notebook` for this TP can be downloaded here: [TP3_Preprocessing.ipynb](https://hassothea.github.io/M1_EDA_ITC/TPs/TP3_Preprocessing.ipynb)**.\n",
    "\n",
    "-----------\n",
    "\n",
    "## 1. Titanic dataset\n",
    "\n",
    "The `Titanic` dataset contains information on the passengers aboard the RMS Titanic, which sank in $1912$. It includes details like age, gender, class, and survival status.\n",
    "\n",
    "I bet you have heard about or watched `Tiannic` movie at least once. How about we take a look at the real dataset of `Titanic` available in Kaggle. For more information about the dataset and the columns, read [`Titanic dataset`](https://www.kaggle.com/datasets/surendhan/titanic-dataset). Let's import it into our Jupyter Notebook by running the following code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>892</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Kelly, Mr. James</td>\n",
       "      <td>male</td>\n",
       "      <td>34.5</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>330911</td>\n",
       "      <td>7.8292</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Q</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>893</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Wilkes, Mrs. James (Ellen Needs)</td>\n",
       "      <td>female</td>\n",
       "      <td>47.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>363272</td>\n",
       "      <td>7.0000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>894</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>Myles, Mr. Thomas Francis</td>\n",
       "      <td>male</td>\n",
       "      <td>62.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>240276</td>\n",
       "      <td>9.6875</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Q</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>895</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Wirz, Mr. Albert</td>\n",
       "      <td>male</td>\n",
       "      <td>27.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>315154</td>\n",
       "      <td>8.6625</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>896</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Hirvonen, Mrs. Alexander (Helga E Lindqvist)</td>\n",
       "      <td>female</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>3101298</td>\n",
       "      <td>12.2875</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   PassengerId  Survived  Pclass  \\\n",
       "0          892         0       3   \n",
       "1          893         1       3   \n",
       "2          894         0       2   \n",
       "3          895         0       3   \n",
       "4          896         1       3   \n",
       "\n",
       "                                           Name     Sex   Age  SibSp  Parch  \\\n",
       "0                              Kelly, Mr. James    male  34.5      0      0   \n",
       "1              Wilkes, Mrs. James (Ellen Needs)  female  47.0      1      0   \n",
       "2                     Myles, Mr. Thomas Francis    male  62.0      0      0   \n",
       "3                              Wirz, Mr. Albert    male  27.0      0      0   \n",
       "4  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  female  22.0      1      1   \n",
       "\n",
       "    Ticket     Fare Cabin Embarked  \n",
       "0   330911   7.8292   NaN        Q  \n",
       "1   363272   7.0000   NaN        S  \n",
       "2   240276   9.6875   NaN        Q  \n",
       "3   315154   8.6625   NaN        S  \n",
       "4  3101298  12.2875   NaN        S  "
      ]
     },
     "execution_count": 86,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import kagglehub\n",
    "\n",
    "# Download latest version\n",
    "path = kagglehub.dataset_download(\"surendhan/titanic-dataset\")\n",
    "\n",
    "# Import data\n",
    "import pandas as pd\n",
    "data = pd.read_csv(path + \"/titanic.csv\")\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**A.** What's the dimension of this dataset? How many quantitative and qualitative variables are there in this dataset (read about the data [here](https://www.kaggle.com/datasets/surendhan/titanic-dataset))?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**B.** Are there any missing values? If so,\n",
    "\n",
    "- Study the impact of missing value removal on the quantitative variables.\n",
    "- Study the impact of missing value removal on the qualitative variables.\n",
    "- Conclude the dynamic of the missing values and handle them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**C.** What are the most common passenger names? Were `Rose` and `Jack` on the ship?\n",
    "\n",
    "**Hint**: `WordCloud` is a useful graph for such text summary. For more, read [here](https://www.kaggle.com/code/anandhuh/word-cloud-in-python-for-beginners)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Bivariate/Multivariate Analysis\n",
    "\n",
    "We are primarily interested in exploring the relationship between each column and the likelihood of passenger survival. The following questions will guide you through this exploration. In each question, try to give some comments on what you observe in the graphs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**A. Survival Analysis:** How did the survival rates vary by gender? How about by class?\n",
    "\n",
    "*Hint*: Create bar charts or stacked bar charts showing the survival rates for different genders and different passenger classes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**B. Fare and Survival:** Is there a relationship between the fare paid and the likelihood of survival?\n",
    "\n",
    "*Hint*: Create boxplots to analyze the fare distribution among survivors and non-survivors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**C. Family Size:** How does family size (number of siblings/spouses and parents/children) impact the chances of survival?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**D. Embarkation Points:** How do survival rates differ based on the port of embarkation (C, Q, S)?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**E. Pclass and Age**: How does passenger class correlate with age?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**F. Gender and Age**: How does age distribution differ between male and female passengers?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**G. Age, Fare and Gender:** View the connection of Age, Fare and Gender in one graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**H. Age, Fare and Class:** View the connection of Age, Fare and Class in one graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**I.** Based on your analysis, which variables appear to have the greatest impact on the likelihood of survival?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Further readings\n",
    "- Graphical tools:\n",
    "    - [`matplotlib`](https://matplotlib.org/stable/index.html)\n",
    "    - [`seaborn`](https://seaborn.pydata.org/)\n",
    "    - [`plotly`](https://plotly.com/python/)\n",
    "- [`Titanic dataset`](https://www.kaggle.com/datasets/surendhan/titanic-dataset)\n",
    "- [Fundamentals of Data Visualization, Claus O. Wilke](https://clauswilke.com/dataviz/)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}