import kagglehub
# Download latest version
= kagglehub.dataset_download("surendhan/titanic-dataset") path
TP8 - Corresponding Analysis (CA)
- Course: EDA & Unsuperivsed Learning
- M-DAS
- Lecturer: HAS Sothea, PhD
Objective: Qualitative columns are often ignored in predictive models or analysis. It is important to notice that qualitative variables are as important as the quantitative ones when it comes to building predictive models or analyzing their connection within the dataset. In this TP, we will focus on identifying the associations between two qualitative variables.
The
Jupyter Notebook
for this TP can be downloaded here: TP8_CA.ipynb.
1. Data loading and Preprocessing
In this section, we will work with Titanic
dataset (TP3).
A. Import the Titanic
dataset from kaggle using: Titanic dataset.
- How many quantitative and qualitative variables are there in this dataset?
- Convert each column into its correct data type.
# Import data
import pandas as pd
= pd.read_csv(path + "/titanic.csv")
data data.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892 | 0 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q |
1 | 893 | 1 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S |
2 | 894 | 0 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q |
3 | 895 | 0 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S |
4 | 896 | 1 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S |
# To do
B. In TP3 of data preprocessing, we already handled some problems of this dataset (See TP3-Solution).
- Preprocess this dataset:
- Data types,
- Handle missing values,
- Handle duplicated data…
# To do
2. \(\chi^2\)-test and CA
The chi-square test is a statistical method used to determine if there is a significant association between two categorical variables. It tests the following hypotheses: \[\begin{cases} H_0:\text{ There is no association between the two variables (they are independent).}\\ H_1:\text{ There is an association between the two variables (they are not independent).} \end{cases}\] Under null hypothesis \(H_0\), \(\chi^2\)-statistic defined by \(\chi^2=\sum_{i,j}\frac{(O_{ij}-E_{ij})^2}{E_{ij}}\sim\chi^2((r-1)(c-1))\) where
- \(r,c\): the number of categories of the 1st and 2nd variable respectively.
- \(O_{ij}\): the observed frequency of \(i\)-th and \(j\)-th category of the 1st and the 2nd variable.
- \(E_{ij}\): the expected/theoretical frequency of \(i\)-th and \(j\)-th category of the 1st and the 2nd variable.
A. \(\chi^2\)-test for Pclass vs Survived.
- Visualize the relationship between the two variables.
- Compute the \(\chi^2\) statistics of the pair
Pclass
andSurvived
variable. - Deduce the p-value of \(\chi^2\)-test of the two variables.
- Can we reject the null hypothesis \(H_0\) of the two variables being independent at \(95\%\) confidence level?
- Recall the assumptions of \(\chi^2\)-test. Is the result above reliable?
import numpy as np
from scipy.stats import chi2_contingency
# To do
B. Pclass vs Embarked:
- Visualize the relationship between these two columns.
- Perform \(\chi^2\)-test on this pair of variables.
- Perform CA on this pair of variables.
- Create
symmetric biplot
of the resulting CA. - Interpret the result.
# To do
3. Eye and Hair color
Study the connection between eye and hair colors from the Eye & Hair Color
dataset available in kaggle as Hair Eye Color.
# To do
4. Countries and languages
Reproduce results of the association between countries and primary language spoken within those countries conducted here. The contingency table of country of residence and primary language spoken is given below:
Country Language | English | French | Spanish | German | Italian | Total |
---|---|---|---|---|---|---|
Canada | 688 | 280 | 10 | 11 | 11 | 1000 |
USA | 730 | 31 | 190 | 8 | 41 | 1000 |
England | 798 | 74 | 38 | 31 | 59 | 1000 |
Italy | 17 | 13 | 11 | 15 | 944 | 1000 |
Switzerland | 15 | 222 | 20 | 648 | 95 | 1000 |
Total | 2248 | 620 | 269 | 713 | 1150 | 5000 |
# To do