TP8 - Categorical Analysis & \(\chi^2\)-test

Course: INF-604: Data Analysis
Lecturer: Sothea HAS, PhD


Objective: This lab aims to determine whether two categorical variables are related using visualization techniques and the chi-square (\(\chi^2\)) test. Additionally, it explores how this method can help identify valuable features for classification tasks.


I. Revisit Titanic Dataset

We are interested in identifying the factors that affect the chance of survival of the passengers. Though containing missing values, but for the purpose of categorical analysis, those columns are not relevant. First, load the dataset to our environment and take care of the data type of categorical columns (except Cabin).

# To do

A.1. \(\chi^2\)-test for Pclass and Embarked

As a warmp-up, we analyze whether the embarkation port is related to the status of the passengers or not.

  • Visualize the relationship between the two columns.
  • Compute two-way contingency table between the two columns.
  • Compute the marginal relative frequency for both columns.
  • Construct the expected and observed contingency tables.
  • Compute \(\chi^2\)-distance between the two table.
  • At \(\alpha=0.01\), test \(H_0: \text{Pclass and Embarked are independent}\) against \(H_1:\text{Pclass and Embarked are not independent}\). Hint: use chi2 module from scipy.stats.
  • Is the result reliable?
# To do

A.2. Reproduce the previous result using chi2_contingency

  • Import chi2_contingency from scipy.stats and perform the \(\chi^2\)-test of independent of the previous pair.
  • Verify that the result are identical to the previous result.
# To do

B. Survived vs all categorical columns

  • Perform \(\chi^2\) independent test between Survived and Sex.
  • Perform \(\chi^2\) independent test between Survived and each the other categorical columns.
  • Verify the assumptions of each test and list the factors that seem to affect the likelihood of survival of the passengers.
# To do

II. Countries and Languages

Perform \(\chi^2\)-test of independent (both by hand and by using chi2_contingency) between the countries and primary language spoken within those countries conducted here. The contingency table of country of residence and primary language spoken is given below:

Country Language English French Spanish German Italian Total
Canada 688 280 10 11 11 1000
USA 730 31 190 8 41 1000
England 798 74 38 31 59 1000
Italy 17 13 11 15 944 1000
Switzerland 15 222 20 648 95 1000
Total 2248 620 269 713 1150 5000
# To do

III. Categorical analysis for model construction

Heart Disease Dataset dates from 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. It contains 76 attributes, including the predicted attribute, but all published experiments refer to using a subset of 14 of them. The “target” field refers to the presence of heart disease in the patient. It is integer valued 0 = no disease and 1 = disease.

Let’s build logistic regression model to predict the column target.

  • How many qualitative and quantitative variables are there in this dataset (answer this question carefully!
  • Convert variables into their suitable data type if there are any inconsistent variable types.
  • Are there any missing values? Duplicate data?
# To do
  • Compute \(\eta\) coefficient matrix between all quantitative columns with the target.
  • Perform \(\chi^2\)-test of independent between qualitative inputs with the target.
# To do
  • Split the data into \(80\%-20\%\) testing data.
  • Build three logistic regression models
    • qual_model: built using only qualitative columns,
    • quan_model: built using only quantitative columns
    • full_model: built using all columns.
  • Predict the testing data and measure the performance of the three models using: accuracy, precision, recall and f1-score.
  • Which model seems to be the best one?
# To do

Further Readings