Correspondence Analysis (CA)

Exploratory Data Analysis & Unsupervised Learning

Lecturer: Dr. HAS Sothea

Content

Motivation
Recall Categorical \(\chi^2\) Test
\(\chi^2\) distance in detail
GSVD & Correspondence Analysis (CA)
Applications

Motivation

UCI Adult dataset

The dataset was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker.
Fore more about the dataset, read UCI Adult dataset.
Dimension: (32561, 15), and some columns are shown below:

	age	workclass	education	marital.status	occupation	relationship
0	90	?	HS-grad	Widowed	?	Not-in-family
1	82	Private	HS-grad	Widowed	Exec-managerial	Not-in-family
2	66	?	Some-college	Widowed	?	Unmarried
3	54	Private	7th-8th	Divorced	Machine-op-inspct	Unmarried
4	41	Private	Some-college	Separated	Prof-specialty	Own-child

Any problem you can see from the above table?
We will simply drop the rows with missing values for demonstration purpose.

Motivation

UCI Adult dataset

Challenge when dealing with categorical data:
- How to visualize category-to-category relationship between two or more categorical variables?
- How to reduce the dimension of categorical data?
- How to cluster categorical data?
Correspondence Analysis (CA) is a powerful Data Analysis technique to address these challenges.
It can be considered as the counterpart of PCA for categorical data.
After rejecting the null hypothesis of independency in \(\chi^2\)-test, one may want to explore the association between two categorical variables.
This is where CA comes into play. It offers a way to visualize the association between two categorical variables in a low dimension.

Recall Categorical \(\chi^2\) Test

Contingency Table

Two-way contingency table between workclass and marital.status:

Code

data_clean = data[['workclass', 'marital.status']].query("(workclass not in ['?', 'Never-worked', 'Without-pay'])  and (`marital.status` not in ['?', 'Married-AF-spouse'])")
contingency_table = pd.crosstab(data_clean['workclass'], data_clean['marital.status'])
contingency_table

marital.status	Divorced	Married-civ-spouse	Married-spouse-absent	Never-married	Separated	Widowed
workclass
Federal-gov	168	471	11	245	26	36
Local-gov	369	1023	22	530	63	86
Private	3119	9732	302	8186	754	588
Self-emp-inc	100	837	5	125	20	29
Self-emp-not-inc	292	1680	31	409	53	74
State-gov	210	588	17	413	43	26

Recall Categorical \(\chi^2\) Test

\(\chi^2\) Test of Independence

Code

from scipy.stats import chi2_contingency
chi2, p, dof, expected = chi2_contingency(contingency_table)
print(f"Chi-square val: {chi2:.4f}.")
print(f"P-value: {p:.4f}.")

Chi-square val: 1091.7196.
P-value: 0.0000.

As the p-value is very small, we reject the null hypothesis of independence.
Next, we shall explore the association between these two categorical variables.
Bivariate plot may be used:

Code

import plotly.express as px
df_freq = data_clean.groupby(['workclass', 'marital.status']).size().reset_index(name='Freq')
df_freq['percent'] = df_freq.groupby('workclass')['Freq'].apply(lambda x: x/x.sum() * 100).reset_index(level=0, drop=True)
fig = px.bar(
    df_freq, 
    x="workclass", 
    y="percent",
    color="marital.status",
    barmode='stack',
    text= df_freq['percent'].round(2).astype(str) + '%')
fig.update_layout(width=510, height=430, 
    title='Stacked Barplot of `Workclass` vs `Marital Status`',
    xaxis_title='Workclass',
    yaxis_title='Percentage (%)',
    legend_title='Marital Status')
fig.update_traces(textposition='inside')
fig.show()

Recall Categorical \(\chi^2\) Test

\(\chi^2\) Test of Independence

However, the bivariate plot may not clearly show the association between categories of the two variables.
It only shows the distribution of marital.status within each workclass category.
CA offers a complete picture of the association between the two variables.

\(\chi^2\) distance in detail

Contingency Table of Proportions

Let \(\color{blue}{X}\) and \(\color{red}{Y}\) be two categorical variables with \(\color{blue}{I}\) and \(\color{red}{J}\) categories, respectively. The contingency table of proportion \(P = (p_{ij})\) is given by:

\[ P=\begin{array}{|c|cccc|c|} \color{blue}{X} \setminus \color{red}{Y} & \color{red}{Y_1} & \color{red}{Y_2} & \dots & \color{red}{Y_J} & \color{blue}{\text{r}} \\ \hline \color{blue}{X_1} & p_{11} & p_{12} & \dots & p_{1J} & \color{blue}{r_1} \\ \color{blue}{X_2} & p_{21} & p_{22} & \dots & p_{2J} & \color{blue}{r_2} \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ \color{blue}{X_I} & p_{I1} & p_{I2} & \dots & p_{IJ} & \color{blue}{r_I} \\ \hline \color{red}{\text{c}} & \color{red}{c_1} & \color{red}{c_2} & \dots & \color{red}{c_J} & 1 \end{array} \]

\(N=\sum_{i,j}n_{ij}\): total observations.
\(p_{ij} = n_{ij}/N\): proportion of type \((i,j)\).
\(\color{blue}{r_i} = \sum_{j=1}^{J} p_{ij}\): row marginal prop.
\(\color{red}{c_j} = \sum_{i=1}^{I} p_{ij}\): col. marginal prop.

Row profile for the row \(\color{blue}{i}\) is given by: \(\left(\frac{p_{\color{blue}{i}1}}{\color{blue}{r_i}},\frac{p_{\color{blue}{i}2}}{\color{blue}{r_i}}, \ldots, \frac{p_{\color{blue}{i}J}}{\color{blue}{r_i}}\right)\).
Column profile for the column \(\color{red}{j}\) is given by: \(\left(\frac{p_{1\color{red}{j}}}{\color{red}{c_j}}, \frac{p_{2\color{red}{j}}}{\color{red}{c_j}}, \ldots, \frac{p_{I\color{red}{j}}}{\color{red}{c_j}}\right)\).

\(\chi^2\) distance in detail

Row profile: Prop. of cols within the row

Row profile for the row \(\color{blue}{i}\) is given by: \(\left(\frac{p_{\color{blue}{i}1}}{\color{blue}{r_i}},\frac{p_{\color{blue}{i}2}}{\color{blue}{r_i}}, \ldots, \frac{p_{\color{blue}{i}J}}{\color{blue}{r_i}}\right)\).

marital.status	Divorced	Married-civ-spouse	Married-spouse-absent	Never-married	Separated	Widowed	Sum
workclass
Federal-gov	0.176	0.492	0.011	0.256	0.027	0.038	1.0
Local-gov	0.176	0.489	0.011	0.253	0.030	0.041	1.0
Private	0.138	0.429	0.013	0.361	0.033	0.026	1.0
Self-emp-inc	0.090	0.750	0.004	0.112	0.018	0.026	1.0
Self-emp-not-inc	0.115	0.662	0.012	0.161	0.021	0.029	1.0
State-gov	0.162	0.453	0.013	0.318	0.033	0.020	1.0

These row profiles are often compared to their mean row profile:
\(\color{red}{\text{c}=}\) [0.139, 0.467, 0.013, 0.323, 0.031, 0.027].

\(\chi^2\) distance in detail

Column profile: Prop. of rows within the col.

Column profile for the column \(\color{red}{j}\) is given by: \(\left(\frac{p_{1\color{red}{j}}}{\color{red}{c_j}}, \frac{p_{2\color{red}{j}}}{\color{red}{c_j}}, \ldots, \frac{p_{I\color{red}{j}}}{\color{red}{c_j}}\right)\).

marital.status	Divorced	Married-civ-spouse	Married-spouse-absent	Never-married	Separated	Widowed
workclass
Federal-gov	0.039	0.033	0.028	0.025	0.027	0.043
Local-gov	0.087	0.071	0.057	0.053	0.066	0.103
Private	0.733	0.679	0.778	0.826	0.786	0.701
Self-emp-inc	0.023	0.058	0.013	0.013	0.021	0.035
Self-emp-not-inc	0.069	0.117	0.080	0.041	0.055	0.088
State-gov	0.049	0.041	0.044	0.042	0.045	0.031
Sum	1.000	1.000	1.000	1.000	1.000	1.000

These column profiles are often compared to their mean column profile: \(\color{blue}{\text{r}=}\) [0.031, 0.068, 0.739, 0.036, 0.083, 0.042].

\(\chi^2\) distance in detail

Definition of \(\chi^2\) distance

The \(\chi^2\) distance between two row profiles \(\color{blue}{i}\) and \(\color{blue}{i'}\): \[d^2(\color{blue}{i}, \color{blue}{i'}) = \sum_{j=1}^{J} \frac{1}{\color{red}{c_j}}\left(\frac{p_{\color{blue}{i}j}}{\color{blue}{r_i}} - \frac{p_{\color{blue}{i'}j}}{\color{blue}{r_{i'}}}\right)^2 \]
The \(\chi^2\) distance between two column profiles \(\color{red}{j}\) and \(\color{red}{j'}\): \[d^2(\color{red}{j}, \color{red}{j'}) = \sum_{i=1}^{I} \frac{1}{\color{blue}{r_i}}\left(\frac{p_{i\color{red}{j}}}{\color{red}{c_j}} - \frac{p_{i'\color{red}{j}}}{\color{red}{c_{j'}}}\right)^2\]

\(\chi^2\) distance in detail

Definition of \(\chi^2\) distance

The \(\chi^2\) distance between two row or column profiles measures how different they are using weighted difference.
Intuition: a small difference in a rare category is more significant than the same difference in a common category.
For example, consider \(\chi^2\)-distance of our workclass row profiles:

Code

import numpy as np
from scipy.spatial.distance import pdist, squareform
import pandas as pd

def chi_squared_distance_matrix(contingency_table):
    """
    Computes the pairwise Chi-squared distance matrix for the row profiles
    of a contingency table.
    
    Args:
        contingency_table (np.array or pd.DataFrame): The raw count data.
        
    Returns:
        pd.DataFrame: A square symmetric distance matrix.
    """
    # Ensure input is a numpy array
    data = np.array(contingency_table, dtype=float)
    
    # 1. Compute Row Profiles (r_ij = n_ij / n_i.)
    # Sum across columns to get row totals
    row_sums = data.sum(axis=1)[:, np.newaxis]
    # Avoid division by zero
    row_profiles = np.divide(data, row_sums, where=row_sums!=0)
    
    # 2. Compute Column Masses (c_j = n_.j / N)
    grand_total = data.sum()
    col_sums = data.sum(axis=0)
    col_masses = col_sums / grand_total
    
    # 3. Transform data for Weighted Euclidean Calculation
    # We weigh the coordinates by 1 / sqrt(col_masses)
    # This turns the Chi-sq distance into a standard Euclidean distance problem
    weights = 1.0 / np.sqrt(col_masses)
    transformed_data = row_profiles * weights
    
    # 4. Compute Pairwise Euclidean Distances on transformed data
    # pdist computes the upper triangle of the distance matrix
    distances = pdist(transformed_data, metric='euclidean')
    
    # Convert to a square matrix
    dist_matrix = squareform(distances)
    
    # Return as DataFrame for better readability if input was DataFrame
    if isinstance(contingency_table, pd.DataFrame):
        return pd.DataFrame(
            dist_matrix, 
            index=contingency_table.index, 
            columns=contingency_table.index
        )
    
    return dist_matrix

# Calculate Distances
dist_mat = chi_squared_distance_matrix(contingency_table)
dist_mat.round(2)

workclass	Federal-gov	Local-gov	Private	Self-emp-inc	Self-emp-not-inc	State-gov
workclass
Federal-gov	0.00	0.03	0.24	0.52	0.35	0.17
Local-gov	0.03	0.00	0.25	0.53	0.35	0.19
Private	0.24	0.25	0.00	0.67	0.50	0.11
Self-emp-inc	0.52	0.53	0.67	0.00	0.18	0.61
Self-emp-not-inc	0.35	0.35	0.50	0.18	0.00	0.44
State-gov	0.17	0.19	0.11	0.61	0.44	0.00

\(\chi^2\) distance in detail

Definition of \(\chi^2\) distance

The \(\chi^2\) distance between two row or column profiles measures how different they are using weighted difference.
Intuition: a small difference in a rare category is more significant than the same difference in a common category.
For example, consider \(\chi^2\)-distance of our workclass row profiles:

Code

dist_mat.round(2)

workclass	Federal-gov	Local-gov	Private	Self-emp-inc	Self-emp-not-inc	State-gov
workclass
Federal-gov	0.00	0.03	0.24	0.52	0.35	0.17
Local-gov	0.03	0.00	0.25	0.53	0.35	0.19
Private	0.24	0.25	0.00	0.67	0.50	0.11
Self-emp-inc	0.52	0.53	0.67	0.00	0.18	0.61
Self-emp-not-inc	0.35	0.35	0.50	0.18	0.00	0.44
State-gov	0.17	0.19	0.11	0.61	0.44	0.00

Here is the \(\chi^2\) distance matrix of marital.status column profiles:

marital.status	Divorced	Married-civ-spouse	Married-spouse-absent	Never-married	Separated	Widowed
marital.status
Divorced	0.00	0.27	0.16	0.22	0.13	0.15
Married-civ-spouse	0.27	0.00	0.30	0.40	0.32	0.22
Married-spouse-absent	0.16	0.30	0.00	0.15	0.10	0.25
Never-married	0.22	0.40	0.15	0.00	0.09	0.33
Separated	0.13	0.32	0.10	0.09	0.00	0.25
Widowed	0.15	0.22	0.25	0.33	0.25	0.00

Interpretation:
- Row profiles: Federal-gov and Local-gov are the most similar row profiles, i.e., the martial status distribution of people working in these two workclasses are quite similar.
- Column profiles: Separated and Divorced are the most similar column profiles, i.e., the workclass distribution of people with these two marital statuses are quite similar.

\(\chi^2\) distance in detail

Link between \(\chi^2\) distance & \(\chi^2\) statistic

\(\chi^2\) statistics: \[\chi^2=\sum_{i,j}\frac{(O_{ij}-E_{ij})^2}{E_{ij}}=\sum_{i,j}\frac{(Np_{\color{blue}{i}\color{red}{j}}-N\color{blue}{r_i}\color{red}{c_j})^2}{N\color{blue}{r_i}\color{red}{c_j}}=N\sum_{i,j}\frac{(p_{\color{blue}{i}\color{red}{j}}-\color{blue}{r_i}\color{red}{c_j})^2}{\color{blue}{r_i}\color{red}{c_j}}\]
Total inertia of row profiles: \[\color{blue}{I_X}=\sum_{i=1}^{I}\color{blue}{r_i}d^2(\color{blue}{i}, \color{blue}{\text{r}})=\sum_{i=1}^{I}\color{blue}{r_i}\sum_{j=1}^{J}\frac{1}{\color{red}{c_j}}\left(\frac{p_{\color{blue}{i}j}}{\color{blue}{r_i}} - \color{red}{c_j}\right)^2=\sum_{i,j}\frac{(p_{\color{blue}{i}\color{red}{j}}-\color{blue}{r_i}\color{red}{c_j})^2}{\color{blue}{r_i}\color{red}{c_j}}\]
Total inertia of column profiles: \[\color{red}{I_Y}=\sum_{j=1}^{J}\color{red}{c_j}d^2(\color{red}{j}, \color{red}{\text{c}})=\sum_{j=1}^{J}\color{red}{c_j}\sum_{i=1}^{I}\frac{1}{\color{blue}{r_i}}\left(\frac{p_{i\color{red}{j}}}{\color{red}{c_j}} - \color{blue}{r_i}\right)^2=\sum_{i,j}\frac{(p_{\color{blue}{i}\color{red}{j}}-\color{blue}{r_i}\color{red}{c_j})^2}{\color{blue}{r_i}\color{red}{c_j}}\]

Therefore: \(\color{blue}{I_X}=\color{red}{I_Y}=\chi^2/N\) (total inertia of the cloud of profiles).

\(\chi^2\) distance in detail

Geometric interpretation of \(\chi^2\) distance

The geometry defined by \(\chi^2\) distance is not the straight Euclidean space but distorted by the column weights.
Objective: find low-dimensional representation that best preserves the \(\chi^2\) distances between profiles.
This leads to Generalized SVD solution, which we will cover in the next section.

Generalized SVD
& Correspondence Analysis (CA)

GSVD & Correspondence Analysis (CA)

GSVD Recap

Recall that the GSVD of a \(q\)-rank matrix \(A\in\mathbb{R}^{n\times d}\) with row weights \(\color{blue}{W_r}\in\mathbb{R}^{n\times n}\) and column weights \(\color{red}{W_c}\in\mathbb{R}^{d\times d}\) is given by: \[\underbrace{A}_{n\times d} =\underbrace{U}_{n\times q}\overbrace{\Sigma}^{q\times q}\underbrace{V^T}_{q\times d}\text{ satisfying: }U^T \underbrace{\color{blue}{W_r}}_{\text{diagonal }n\times n} U=V^T \underbrace{\color{red}{W_c}}_{\text{diagonal }d\times d} V=I_q.\]
The GSVD of \((A, \color{blue}{W_r}, \color{red}{W_c})\) can be computed via the SVD of the matrix: \[B = \color{blue}{W_r}^{1/2} A \color{red}{W_c}^{1/2}\]
If SVD gives \(B=\tilde{U}\Delta\tilde{V}^T\), with left and right singular vectors \(\tilde{U}\) and \(\tilde{V}\) respectively, thus \(A\) the left and right sigular vectors of \(A\) is given by: \[U = \color{blue}{W_r}^{-1/2} \tilde{U}, \quad V = \color{red}{W_c}^{-1/2} \tilde{V},\quad \Sigma = \Delta.\]

GSVD & Correspondence Analysis (CA)

CA via GSVD

Given the contingency table \(P=(p_{ij})\) of relative frequencies, with row and column marginals \(\color{blue}{\text{r}}\) and \(\color{red}{\text{c}}\), we define:
- Row and column weights (metrics): \(\color{blue}{D_r^{-1}} = \text{diag}(\color{blue}{\text{r}})^{-1}\) & \(\color{red}{D_c^{-1}} = \text{diag}(\color{red}{\text{c}})^{-1}\).
- Residual matrix: \(S = P - \color{blue}{\text{r}} \color{red}{\text{c}}^T\).
The CA of \(P\) is obtained by the GSVD of the triplet \((S, \color{blue}{D_r^{-1}}, \color{red}{D_c^{-1}})\): \[S = U \Sigma V^T \text{ satisfying: } U^T \color{blue}{D_r^{-1}} U =V^T \color{red}{D_c^{-1}} V = I.\]
Principal coordinates: projection of profiles onto axes, scaled by the sigular values, given by: \[F = \underbrace{\color{blue}{D_r^{-1}}}_{\text{Weight normalization}} \overbrace{U}^{\text{Rotation}}\underbrace{\Sigma}_{\text{Scale}}, \quad G = \color{red}{D_c^{-1}} V \Sigma\]

Code

import numpy as np
from scipy.linalg import svd
S = contingency_table.values / N - np.outer(contingency_table.sum(axis=1).values / N, contingency_table.sum(axis=0).values / N)
pd.DataFrame(S.round(3), 
      index=contingency_table.index, 
      columns=contingency_table.columns)

marital.status	Divorced	Married-civ-spouse	Married-spouse-absent	Never-married	Separated	Widowed
workclass
Federal-gov	0.001	0.001	-0.0	-0.002	-0.000	0.000
Local-gov	0.003	0.001	-0.0	-0.005	-0.000	0.001
Private	-0.001	-0.028	0.0	0.028	0.001	-0.001
Self-emp-inc	-0.002	0.010	-0.0	-0.008	-0.000	-0.000
Self-emp-not-inc	-0.002	0.016	-0.0	-0.013	-0.001	0.000
State-gov	0.001	-0.001	0.0	-0.000	0.000	-0.000

	Dim 1	Dim 2
workclass
Federal-gov	0.103	-0.136
Local-gov	0.103	-0.151
Private	-0.087	0.015
Self-emp-inc	0.574	0.084
Self-emp-not-inc	0.411	0.021
State-gov	-0.019	-0.038

	Dim 1	Dim 2
marital.status
Divorced	-0.055	-0.110
Married-civ-spouse	0.180	0.019
Married-spouse-absent	-0.108	0.012
Never-married	-0.224	0.031
Separated	-0.139	-0.006
Widowed	0.053	-0.135

GSVD & Correspondence Analysis (CA)

CA via GSVD

Variance explained by axis \(i\) is given by \(\sigma_i^2=\Sigma_{ii}^2\).
Explained variance ratio of axis \(i\) is given by: \(\sigma_i^2/\sum_{j=1}^q\sigma_j^2\).

Code

pd.DataFrame({
    'Explained variance' : Sigma**2,
    'Explained variance ratio' : Sigma**2 / (Sigma ** 2).sum(),
    'Cummulated explained variance ratio' : (Sigma**2).cumsum() / (Sigma**2).sum()
}, index=range(1, len(Sigma)+1)).round(3).T

	1	2	3	4	5	6
Explained variance	0.033	0.003	0.000	0.000	0.0	0.0
Explained variance ratio	0.916	0.075	0.006	0.003	0.0	0.0
Cummulated explained variance ratio	0.916	0.991	0.997	1.000	1.0	1.0

Code

import plotly.express as px
import plotly.graph_objects as go

# 1. Plot F2 (Base Layer): Blue Squares
fig = px.scatter(F2, x='Dim 1', y='Dim 2', text=F2.index)
fig.update_traces(
    textposition='top center',  # Shifts text up
    marker=dict(
        color='blue', 
        symbol='diamond', 
        size=12                 # Increases point size
    )
)

# 2. Add G2 (Second Layer): Red Round (Circle)
trace_g2 = px.scatter(G2, x='Dim 1', y='Dim 2', text=G2.index)
trace_g2.update_traces(
    textposition='top center',  # Shifts text up
    marker=dict(
        color='red', 
        symbol='circle', 
        size=12                 # Increases point size
    )
)
fig.add_trace(trace_g2.data[0])

# 3. Add Lines: Gray and Dashed
fig.add_trace(go.Scatter(
    x=[-0.4, 0.7], y=[0,0], mode='lines', 
    line=dict(color='gray', dash='dash', width=1)
))

fig.add_trace(go.Scatter(
    x=[0,0], y=[min(G2['Dim 2'])*1.5, 0.15], mode='lines', 
    line=dict(color='gray', dash='dash', width=1)
))

fig.update_layout(width=600, height=250, showlegend=False)
fig.show()

Code

import numpy as np
from scipy.linalg import svd
pd.DataFrame(S.round(3), 
      index=contingency_table.index, 
      columns=contingency_table.columns)

marital.status	Divorced	Married-civ-spouse	Married-spouse-absent	Never-married	Separated	Widowed
workclass
Federal-gov	0.001	0.001	-0.0	-0.002	-0.000	0.000
Local-gov	0.003	0.001	-0.0	-0.005	-0.000	0.001
Private	-0.001	-0.028	0.0	0.028	0.001	-0.001
Self-emp-inc	-0.002	0.010	-0.0	-0.008	-0.000	-0.000
Self-emp-not-inc	-0.002	0.016	-0.0	-0.013	-0.001	0.000
State-gov	0.001	-0.001	0.0	-0.000	0.000	-0.000

	Dim 1	Dim 2
workclass
Federal-gov	0.103	-0.136
Local-gov	0.103	-0.151
Private	-0.087	0.015
Self-emp-inc	0.574	0.084
Self-emp-not-inc	0.411	0.021
State-gov	-0.019	-0.038

	Dim 1	Dim 2
marital.status
Divorced	-0.055	-0.110
Married-civ-spouse	0.180	0.019
Married-spouse-absent	-0.108	0.012
Never-married	-0.224	0.031
Separated	-0.139	-0.006
Widowed	0.053	-0.135

GSVD & Correspondence Analysis (CA)

CA via GSVD

Variance explained by axis \(i\) is given by \(\sigma_i^2=\Sigma_{ii}^2\).
Explained variance ratio of axis \(i\) is given by: \(\sigma_i^2/\sum_{j=1}^q\sigma_j^2\).

Code

pd.DataFrame({
    'Explained variance' : Sigma**2,
    'Explained variance ratio' : Sigma**2 / (Sigma ** 2).sum(),
    'Cummulated explained variance ratio' : (Sigma**2).cumsum() / (Sigma**2).sum()
}, index=range(1, len(Sigma)+1)).round(3).T

	1	2	3	4	5	6
Explained variance	0.033	0.003	0.000	0.000	0.0	0.0
Explained variance ratio	0.916	0.075	0.006	0.003	0.0	0.0
Cummulated explained variance ratio	0.916	0.991	0.997	1.000	1.0	1.0

Code

import plotly.express as px
import plotly.graph_objects as go

# 1. Plot F2 (Base Layer): Blue Squares
fig = px.scatter(F2, x='Dim 1', y='Dim 2', text=F2.index)
fig.update_traces(
    textposition='top center',  # Shifts text up
    marker=dict(
        color='blue', 
        symbol='diamond', 
        size=12                 # Increases point size
    )
)

# 2. Add G2 (Second Layer): Red Round (Circle)
trace_g2 = px.scatter(G2, x='Dim 1', y='Dim 2', text=G2.index)
trace_g2.update_traces(
    textposition='top center',  # Shifts text up
    marker=dict(
        color='red', 
        symbol='circle', 
        size=12                 # Increases point size
    )
)
fig.add_trace(trace_g2.data[0])

# 3. Add Lines: Gray and Dashed
fig.add_trace(go.Scatter(
    x=[-0.4, 0.7], y=[0,0], mode='lines', 
    line=dict(color='gray', dash='dash', width=1)
))

fig.add_trace(go.Scatter(
    x=[0,0], y=[min(G2['Dim 2'])*1.5, 0.15], mode='lines', 
    line=dict(color='gray', dash='dash', width=1)
))

fig.update_layout(width=600, height=250, showlegend=False)
fig.show()

marital.status	Divorced	Married-civ-spouse	Married-spouse-absent	Never-married	Separated	Widowed	Sum
workclass
Federal-gov	0.176	0.492	0.011	0.256	0.027	0.038	1.0
Local-gov	0.176	0.489	0.011	0.253	0.030	0.041	1.0
Private	0.138	0.429	0.013	0.361	0.033	0.026	1.0
Self-emp-inc	0.090	0.750	0.004	0.112	0.018	0.026	1.0
Self-emp-not-inc	0.115	0.662	0.012	0.161	0.021	0.029	1.0
State-gov	0.162	0.453	0.013	0.318	0.033	0.020	1.0

marital.status	Divorced	Married-civ-spouse	Married-spouse-absent	Never-married	Separated	Widowed
workclass
Federal-gov	0.039	0.033	0.028	0.025	0.027	0.043
Local-gov	0.087	0.071	0.057	0.053	0.066	0.103
Private	0.733	0.679	0.778	0.826	0.786	0.701
Self-emp-inc	0.023	0.058	0.013	0.013	0.021	0.035
Self-emp-not-inc	0.069	0.117	0.080	0.041	0.055	0.088
State-gov	0.049	0.041	0.044	0.042	0.045	0.031
Sum	1.000	1.000	1.000	1.000	1.000	1.000

Near origin: State-gov & Private are close to origin meaning that their profiles are similar to the mean column profile \(\color{red}{\text{c}}\).
Axis 1: Separate self-employment type from government type works of the column profiles, and good vs bad/non-marriage.
Axis 2: Separate stability or life stage in marriage for column profile, and separation of government and non-government works.
Clusters: Similar profiles are clustered together, and association between different categories of row and column profiles are also grouped together.
Deviation from origin: indication of rare categories or difference to the mean profile.

GSVD & Correspondence Analysis (CA)

Symmetric Biplot

This symmetric biplot summarize the connection between:

Similar type of profiles: similar rows or columns.
Association between different type (row vs column) of profiles.
Rare profiles/very different from mean profile.
Similar to common/mean profile.
The explained variances indicate how much each axis captures the variation of the data.

All \(\sigma_j\in[0,1]\) and \(\sigma_j=1\) only for perfect association meaning that there is a group of row profiles that only associate to a certain group of column profiles and vice versa.
Connection between principal row and column coordinates: \[\begin{align*} % Row coordinate f_{ik} as a weighted average of column coordinates g_{jk} \color{blue}{f_{ik}} &= \frac{1}{\sigma_k} \sum_{j} \left( \frac{p_{ij}}{\color{blue}{r_i}} \right) \color{red}{g_{ik}}, \qquad % Column coordinate \color{red}{g_{ik}} as a weighted average of row coordinates f_{ik} \color{red}{g_{ik}}= \frac{1}{\sigma_k} \sum_{i} \left( \frac{p_{ij}}{\color{red}{c_j}} \right) \color{blue}{f_{ik}} \end{align*}\]

Correspondence Analysis (CA)

Content

Motivation

Motivation

UCI Adult dataset

Motivation

UCI Adult dataset

Recall Categorical \(\chi^2\) Test

Recall Categorical \(\chi^2\) Test

Contingency Table

Recall Categorical \(\chi^2\) Test

\(\chi^2\) Test of Independence

Recall Categorical \(\chi^2\) Test

\(\chi^2\) Test of Independence

\(\chi^2\) distance in detail

\(\chi^2\) distance in detail

Contingency Table of Proportions

\(\chi^2\) distance in detail

Row profile: Prop. of cols within the row

\(\chi^2\) distance in detail

Column profile: Prop. of rows within the col.

\(\chi^2\) distance in detail

Definition of \(\chi^2\) distance

\(\chi^2\) distance in detail

Definition of \(\chi^2\) distance

\(\chi^2\) distance in detail

Definition of \(\chi^2\) distance

\(\chi^2\) distance in detail

Link between \(\chi^2\) distance & \(\chi^2\) statistic

\(\chi^2\) distance in detail

Geometric interpretation of \(\chi^2\) distance

Generalized SVD & Correspondence Analysis (CA)

GSVD & Correspondence Analysis (CA)

GSVD Recap

GSVD & Correspondence Analysis (CA)

CA via GSVD

GSVD & Correspondence Analysis (CA)

CA via GSVD

GSVD & Correspondence Analysis (CA)

CA via GSVD

GSVD & Correspondence Analysis (CA)

Symmetric Biplot

🥳 Yeahhh! Party Time…. 🥂

Any questions?

Generalized SVD
& Correspondence Analysis (CA)