TP8 - Categorical Analysis & \(\chi^2\)-test

Course: INF-604: Data Analysis
Lecturer: Sothea HAS, PhD

Objective: This lab aims to determine whether two categorical variables are related using visualization techniques and the chi-square (\(\chi^2\)) test. Additionally, it explores how this method can help identify valuable features for classification tasks.

The notebook of this Lab can be downloaded here: Lab8_Categorical_Analysis.ipynb.
Or you can work directly with Google Colab here: Lab8_Categorical_Analysis.ipynb.

I. Revisit Titanic Dataset

We are interested in identifying the factors that affect the chance of survival of the passengers. Though containing missing values, but for the purpose of categorical analysis, those columns are not relevant. First, load the dataset to our environment and take care of the data type of categorical columns (except Cabin).

# To do

A.1. \(\chi^2\)-test for Pclass and Embarked

As a warmp-up, we analyze whether the embarkation port is related to the status of the passengers or not.

Visualize the relationship between the two columns.
Compute two-way contingency table between the two columns.
Compute the marginal relative frequency for both columns.
Construct the expected and observed contingency tables.
Compute \(\chi^2\)-distance between the two table.
At \(\alpha=0.01\), test \(H_0: \text{Pclass and Embarked are independent}\) against \(H_1:\text{Pclass and Embarked are not independent}\). Hint: use chi2 module from scipy.stats.
Is the result reliable?

# To do

A.2. Reproduce the previous result using chi2_contingency

Import chi2_contingency from scipy.stats and perform the \(\chi^2\)-test of independent of the previous pair.
Verify that the result are identical to the previous result.

# To do

B. Survived vs all categorical columns

Perform \(\chi^2\) independent test between Survived and Sex.
Perform \(\chi^2\) independent test between Survived and each the other categorical columns.
Verify the assumptions of each test and list the factors that seem to affect the likelihood of survival of the passengers.

# To do

II. Countries and Languages

Perform \(\chi^2\)-test of independent (both by hand and by using chi2_contingency) between the countries and primary language spoken within those countries conducted here. The contingency table of country of residence and primary language spoken is given below:

Country Language	English	French	Spanish	German	Italian	Total
Canada	688	280	10	11	11	1000
USA	730	31	190	8	41	1000
England	798	74	38	31	59	1000
Italy	17	13	11	15	944	1000
Switzerland	15	222	20	648	95	1000
Total	2248	620	269	713	1150	5000

# To do

III. Categorical analysis for model construction

Heart Disease Dataset dates from 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. It contains 76 attributes, including the predicted attribute, but all published experiments refer to using a subset of 14 of them. The “target” field refers to the presence of heart disease in the patient. It is integer valued 0 = no disease and 1 = disease.

Let’s build logistic regression model to predict the column target.

How many qualitative and quantitative variables are there in this dataset (answer this question carefully!
Convert variables into their suitable data type if there are any inconsistent variable types.
Are there any missing values? Duplicate data?

# To do

Compute \(\eta\) coefficient matrix between all quantitative columns with the target.
Perform \(\chi^2\)-test of independent between qualitative inputs with the target.

# To do

Split the data into \(80\%-20\%\) testing data.
Build three logistic regression models
- qual_model: built using only qualitative columns,
- quan_model: built using only quantitative columns
- full_model: built using all columns.
Predict the testing data and measure the performance of the three models using: accuracy, precision, recall and f1-score.
Which model seems to be the best one?

# To do

I. Revisit Titanic Dataset

II. Countries and Languages

III. Categorical analysis for model construction

Further Readings