{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **TP7 - Corresponding Analysis (CA)**\n",
    "\n",
    "<b>Exploratory Data Analysis & Unsuperivsed Learning </b><br>\n",
    "**Course: PHAUK Sokkey, PhD** <br> \n",
    "**TP: HAS Sothea, PhD**\n",
    "\n",
    "-------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Objective**: Qualitative columns are often ignored in predictive models or analysis. It is important to notice that qualitative variables are as important as the quantitative ones when it comes to building predictive models or analyzing their connection within the dataset. In this TP, we will focus on identifying the associations between two qualitative variables.\n",
    "\n",
    "---------\n",
    "\n",
    "> **The `Jupyter Notebook` for this TP can be downloaded here: [TP7_CA.ipynb](https://hassothea.github.io/EDA_ITC/TPs/TP7_CA.ipynb)**.\n",
    "\n",
    "-------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Data loading and Preprocessing\n",
    "\n",
    "`Titanic` dataset contains information on the passengers aboard the RMS Titanic, which sank in 1912. It includes details like age, gender, class, and survival status.\n",
    "\n",
    "**A.** Import the `Titanic` dataset from kaggle using: [Titanic dataset](https://www.kaggle.com/datasets/surendhan/titanic-dataset).\n",
    "\n",
    "- How many quantitative and qualitative variables are there in this dataset?\n",
    "- Convert each column into its correct data type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import kagglehub\n",
    "\n",
    "# Download latest version\n",
    "path = kagglehub.dataset_download(\"surendhan/titanic-dataset\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>892</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Kelly, Mr. James</td>\n",
       "      <td>male</td>\n",
       "      <td>34.5</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>330911</td>\n",
       "      <td>7.8292</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Q</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>893</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Wilkes, Mrs. James (Ellen Needs)</td>\n",
       "      <td>female</td>\n",
       "      <td>47.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>363272</td>\n",
       "      <td>7.0000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>894</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>Myles, Mr. Thomas Francis</td>\n",
       "      <td>male</td>\n",
       "      <td>62.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>240276</td>\n",
       "      <td>9.6875</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Q</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>895</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Wirz, Mr. Albert</td>\n",
       "      <td>male</td>\n",
       "      <td>27.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>315154</td>\n",
       "      <td>8.6625</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>896</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Hirvonen, Mrs. Alexander (Helga E Lindqvist)</td>\n",
       "      <td>female</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>3101298</td>\n",
       "      <td>12.2875</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   PassengerId  Survived  Pclass  \\\n",
       "0          892         0       3   \n",
       "1          893         1       3   \n",
       "2          894         0       2   \n",
       "3          895         0       3   \n",
       "4          896         1       3   \n",
       "\n",
       "                                           Name     Sex   Age  SibSp  Parch  \\\n",
       "0                              Kelly, Mr. James    male  34.5      0      0   \n",
       "1              Wilkes, Mrs. James (Ellen Needs)  female  47.0      1      0   \n",
       "2                     Myles, Mr. Thomas Francis    male  62.0      0      0   \n",
       "3                              Wirz, Mr. Albert    male  27.0      0      0   \n",
       "4  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  female  22.0      1      1   \n",
       "\n",
       "    Ticket     Fare Cabin Embarked  \n",
       "0   330911   7.8292   NaN        Q  \n",
       "1   363272   7.0000   NaN        S  \n",
       "2   240276   9.6875   NaN        Q  \n",
       "3   315154   8.6625   NaN        S  \n",
       "4  3101298  12.2875   NaN        S  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Import data\n",
    "import pandas as pd\n",
    "data = pd.read_csv(path + \"/titanic.csv\")\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**B.** Are there any missing values? If so,\n",
    "\n",
    "- Study the impact of missing value removal on the quantitative variables.\n",
    "- Study the impact of missing value removal on the qualitative variables.\n",
    "- Conclude the dynamic of the missing values and handle them.\n",
    "- Remove redundant observations if there is any."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. $\\chi^2$-test and CA\n",
    "\n",
    "The chi-square test is a statistical method used to determine if there is a significant association between two categorical variables. It tests the following hypotheses:\n",
    "$$\\begin{cases}\n",
    "H_0:\\text{ There is no association between the two variables (they are independent).}\\\\\n",
    "H_1:\\text{ There is an association between the two variables (they are not independent).}\n",
    "\\end{cases}$$\n",
    "Under null hypothesis $H_0$, $\\chi^2$-statistic defined by $\\chi^2=\\sum_{i,j}\\frac{(O_{ij}-E_{ij})^2}{E_{ij}}\\sim\\chi^2((r-1)(c-1))$ where\n",
    "\n",
    "- $r,c$: the number of categories of the 1st and 2nd variable respectively.\n",
    "- $O_{ij}$: the observed frequency of $i$-th and $j$-th category of the 1st and the 2nd variable.\n",
    "- $E_{ij}$: the expected/theoretical frequency of $i$-th and $j$-th category of the 1st and the 2nd variable."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**A. $\\chi^2$-test for Pclass vs Survived.** \n",
    "\n",
    "- Visualize the relationship between the two variables.\n",
    "- Compute the $\\chi^2$ statistics of the pair `Pclass` and `Survived` variable.\n",
    "- Deduce the p-value of $\\chi^2$-test of the two variables.\n",
    "- Can we reject the null hypothesis $H_0$ of the two variables being independent at $95\\%$ confidence level?\n",
    "- Recall the assumptions of $\\chi^2$-test. Is the result above reliable?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from scipy.stats import chi2_contingency\n",
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**B. Pclass vs Embarked:**\n",
    "\n",
    "- Perform $\\chi^2$-test on this pair of variables.\n",
    "- Perform CA on this pair of variables.\n",
    "- Create `symmetric biplot` of the resulting CA.\n",
    "- Interpret the result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Eye and Hair color\n",
    "\n",
    "Study the connection between eye and hair colors from the `Eye & Hair Color` dataset available in kaggle as [Hair Eye Color](https://www.kaggle.com/datasets/jasleensondhi/hair-eye-color)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Countries and languages\n",
    "\n",
    "Reproduce results of the association between countries and primary language spoken within those countries conducted [here](https://pmc.ncbi.nlm.nih.gov/articles/PMC3718710/). The contingency table of country of residence and primary language spoken is given below:\n",
    "\t\n",
    "| Country\\ Language  | English | French | Spanish | German | Italian | Total |\n",
    "|:-------------------|:-------:|:------:|:-------:|:------:|:-------:|:-----:|\n",
    "| Canada | 688 | 280 | 10 | 11 | 11 | 1000 |\n",
    "| USA | 730 | 31 | 190 | 8 | 41 | 1000 |\n",
    "| England | 798 | 74 | 38 | 31 | 59 | 1000 |\n",
    "| Italy | 17 | 13 | 11 | 15 | 944 | 1000 |\n",
    "| Switzerland | 15 | 222 | 20 | 648 | 95 | 1000 |\n",
    "| Total | 2248 | 620 | 269 | 713 | 1150 | 5000 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Further Readings\n",
    "\n",
    "- [Correspondence Analysis, Hervé Abdi & Michel Béra](https://cedric.cnam.fr/fichiers/art_3066.pdf)\n",
    "- [Correspondence analysis is a useful tool to uncover the relationships among categorical variables](https://pmc.ncbi.nlm.nih.gov/articles/PMC3718710/)\n",
    "- [The Use of Correspondence Analysis in the Exploration of Health Survey Data](www.fbbva.es/wp-content/uploads/2017/05/dat/DT_2002_05.pdf)\n",
    "- [Correspondence Analysis](https://link.springer.com/referenceworkentry/10.1007/978-1-4939-7131-2_140)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}