TP8 - Multiple Corresponding Analysis (MCA)

Exploratory Data Analysis & Unsuperivsed Learning
Course: PHAUK Sokkey, PhD
TP: HAS Sothea, PhD

Objective: “In the previous TP, we studied Correspondence Analysis (CA), which can be used to analyze the association between two qualitative variables. In this TP, we extend this to the more general case of Multiple Correspondence Analysis (MCA), which is used to study the association of multiple variables in an indicator or a Burt matrix or survey.

The Jupyter Notebook for this TP can be downloaded here: TP8_MCA.ipynb.

1. Kaggle Stroke Dataset

A stroke occurs when the blood supply to part of the brain is interrupted or reduced, preventing brain tissue from getting oxygen and nutrients. Brain cells begin to die within minutes. Strokes can be classified into two main types: ischemic stroke, caused by a blockage in an artery, and hemorrhagic stroke, caused by a blood vessel leaking or bursting. Immediate medical attention is crucial to minimize brain damage and potential complications (see, Mayo Clinic and WebMD).

The Kaggle Stroke Prediction Dataset (available here) is designed to predict the likelihood of a stroke based on various health parameters. We will analyze this association using MCA.

The Stroke dataset dataset can be imported from kaggle as follow.

import kagglehub

# Download latest version
path = kagglehub.dataset_download("fedesoriano/stroke-prediction-dataset")

# Import data
import pandas as pd
data = pd.read_csv(path + "/healthcare-dataset-stroke-data.csv")
data.head()

	id	gender	age	hypertension	heart_disease	ever_married	work_type	Residence_type	avg_glucose_level	bmi	smoking_status	stroke
0	9046	Male	67.0	0	1	Yes	Private	Urban	228.69	36.6	formerly smoked	1
1	51676	Female	61.0	0	0	Yes	Self-employed	Rural	202.21	NaN	never smoked	1
2	31112	Male	80.0	0	1	Yes	Private	Rural	105.92	32.5	never smoked	1
3	60182	Female	49.0	0	0	Yes	Private	Urban	171.23	34.4	smokes	1
4	1665	Female	79.0	1	0	Yes	Self-employed	Rural	174.12	24.0	never smoked	1

# To do

A. Perform data preprocessing to ensure that the columns are in correct types, clean and complete for further analysis.

# To do

B. Recall that in CA (or even MCA), if \(X\in\mathbb{R}^{n\times m}\) is the design matrix, let \(Z=X/n\) be the normalized matrix. The row and column marginal relative frequencies are given by \(\textbf{r}=Z\mathbb{1}_{m}\) and \(\textbf{c}=\mathbb{1}_{n}Z\) where \(\mathbb{1}_d\) denotes a \(d\times d\) matrix with all elements equal to 1. Let the row and column weights \(D_r=\text{diag}(\textbf{r})\) and \(D_c=\text{diag}(\textbf{c})\), then the factor scores are obtained from the following singular value decomposition:

\[D_r^{-1/2}(Z-\textbf{r}\textbf{c}^T)D_c^{-1/2}=UD V^T,\]

where \(D\) is the diagonal matrix of sigular values with \(\Sigma=D^2\) are the eigenvalues. The row and column scores are given by: \[F=D_r^{-1/2}UD\quad\text{and}\quad G=D_c^{-1/2}VD,\] which are the scores represented on the principal axes with respect to \(\chi^2\)-distance.

With Stroke dataset,

Perform MCA on this stroke dataset using prince available here: https://github.com/MaxHalford/prince.
Compute the explained varianced of the first two axes.
Compute the row factor scores.
Compute the column factor scores.
Create symmetric Biplot: a balanced view of both variables and observations, with both sets in principal coordinates. Explain the graph.

# To do

2. Sleep Health and Lifestyle Dataset

MCA has been widely used in studies examining the associations between instances based on different categorical characteristics. In this section, we will use MCA to analyze the association within kaggle Sleep Health and Lifestyle Dataset available here: https://www.kaggle.com/datasets/siamaktahmasbi/insights-into-sleep-patterns-and-daily-habits.

A. Apply data preprocessing on the dataset:

Make sure that there is no missing values nor duplicated data, and it is ready for the analysis.
Convert some continuous variables into grouped data for our analysis such as blood pressure into [‘low’, ‘normal’, ‘high’]. You have to decide that own your own and based on your research.

import kagglehub
import pandas as pd
# Download latest version
path = kagglehub.dataset_download("siamaktahmasbi/insights-into-sleep-patterns-and-daily-habits")

data = pd.read_csv(path + '/sleep_health_lifestyle_dataset.csv')
data.head()

	Person ID	Gender	Age	Occupation	Sleep Duration (hours)	Quality of Sleep (scale: 1-10)	Physical Activity Level (minutes/day)	Stress Level (scale: 1-10)	BMI Category	Blood Pressure (systolic/diastolic)	Heart Rate (bpm)	Daily Steps	Sleep Disorder
0	1	Male	29	Manual Labor	7.4	7.0	41	7	Obese	124/70	91	8539	NaN
1	2	Female	43	Retired	4.2	4.9	41	5	Obese	131/86	81	18754	NaN
2	3	Male	44	Retired	6.1	6.0	107	4	Underweight	122/70	81	2857	NaN
3	4	Male	29	Office Worker	8.3	10.0	20	10	Obese	124/72	55	6886	NaN
4	5	Male	67	Retired	9.1	9.5	19	4	Overweight	133/78	97	14945	Insomnia

B. Perform MCA on the dataset:

Compute variances explained by the first two dimensions.
Compute the row factor scores on the first two dimensions.
Compute the column factor scores on the first two dimensions.
Create a symmetric Biplot: a balanced view of variables and observations, with both sets in principal coordinates. Explain the graph.

1. Kaggle Stroke Dataset

2. Sleep Health and Lifestyle Dataset

Further Readings