# To do
TP8 - Categorical Analysis & \(\chi^2\)-test
Course: INF-604: Data Analysis
Lecturer: Sothea HAS, PhD
Objective: This lab aims to determine whether two categorical variables are related using visualization techniques and the chi-square (\(\chi^2\)) test. Additionally, it explores how this method can help identify valuable features for classification tasks.
The
notebook
of thisLab
can be downloaded here: Lab8_Categorical_Analysis.ipynb.Or you can work directly with
Google Colab
here: Lab8_Categorical_Analysis.ipynb.
I. Revisit Titanic Dataset
We are interested in identifying the factors that affect the chance of survival of the passengers. Though containing missing values, but for the purpose of categorical analysis, those columns are not relevant. First, load the dataset to our environment and take care of the data type of categorical columns (except Cabin
).
A.1. \(\chi^2\)-test for Pclass
and Embarked
As a warmp-up, we analyze whether the embarkation port is related to the status of the passengers or not.
- Visualize the relationship between the two columns.
- Compute two-way contingency table between the two columns.
- Compute the marginal relative frequency for both columns.
- Construct the expected and observed contingency tables.
- Compute \(\chi^2\)-distance between the two table.
- At \(\alpha=0.01\), test \(H_0: \text{Pclass and Embarked are independent}\) against \(H_1:\text{Pclass and Embarked are not independent}\).
Hint
: usechi2
module fromscipy.stats
. - Is the result reliable?
# To do
A.2. Reproduce the previous result using chi2_contingency
- Import
chi2_contingency
fromscipy.stats
and perform the \(\chi^2\)-test of independent of the previous pair. - Verify that the result are identical to the previous result.
# To do
B. Survived vs all categorical columns
- Perform \(\chi^2\) independent test between
Survived
andSex
. - Perform \(\chi^2\) independent test between
Survived
and each the other categorical columns. - Verify the assumptions of each test and list the factors that seem to affect the likelihood of survival of the passengers.
# To do
II. Countries and Languages
Perform \(\chi^2\)-test of independent (both by hand and by using chi2_contingency
) between the countries and primary language spoken within those countries conducted here. The contingency table of country of residence and primary language spoken is given below:
Country Language | English | French | Spanish | German | Italian | Total |
---|---|---|---|---|---|---|
Canada | 688 | 280 | 10 | 11 | 11 | 1000 |
USA | 730 | 31 | 190 | 8 | 41 | 1000 |
England | 798 | 74 | 38 | 31 | 59 | 1000 |
Italy | 17 | 13 | 11 | 15 | 944 | 1000 |
Switzerland | 15 | 222 | 20 | 648 | 95 | 1000 |
Total | 2248 | 620 | 269 | 713 | 1150 | 5000 |
# To do
III. Categorical analysis for model construction
Heart Disease Dataset dates from 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. It contains 76 attributes, including the predicted attribute, but all published experiments refer to using a subset of 14 of them. The “target” field refers to the presence of heart disease in the patient. It is integer valued 0 = no disease and 1 = disease.
Let’s build logistic regression model to predict the column target
.
- How many qualitative and quantitative variables are there in this dataset (answer this question carefully!
- Convert variables into their suitable data type if there are any inconsistent variable types.
- Are there any missing values? Duplicate data?
# To do
- Compute \(\eta\) coefficient matrix between all quantitative columns with the
target
. - Perform \(\chi^2\)-test of independent between qualitative inputs with the
target
.
# To do
- Split the data into \(80\%-20\%\) testing data.
- Build three logistic regression models
qual_model
: built using only qualitative columns,quan_model
: built using only quantitative columnsfull_model
: built using all columns.
- Predict the testing data and measure the performance of the three models using: accuracy, precision, recall and f1-score.
- Which model seems to be the best one?
# To do