A1: To analyze relationship between categorical variables, we often use contingency table.
Contingency Table
Contingency Table
To obtain a two-way contingency table (Hair vs Eye) from a three-way one, we sum over the Gender because: \[n(\text{H, E})=\sum_{x\in\{M,F\}}n(\text{H, E, G}=x).\]
Does it look like there’s a relationship between Hair and Eye color?
Visualization:
Code
import matplotlib.pyplot as pltdf_hair_eye.plot(kind="bar", figsize=(5, 3), width=0.8)plt.title("Grouped Bar Chart of Hair vs. Eye Color")plt.xlabel("Hair Color")plt.ylabel("Frequency")plt.legend(title="Eye Color")plt.show()
How about now?
Goal: Establish statistical evidence to determine whether an association exists between 2 categorical variables.
Joint relative frequency (JRF):\(\color{green}{p_{i,j}=n_{i,j}/N}\) with \(\color{green}{N=\sum_{i,j}n_{i,j}}\).
Marginal relative frequency (MRF):
Row RF:\(\color{blue}{p_{i,.}=n_{i,.}/N}\).
Column RF:\(\color{red}{p_{.,j}=n_{.,j}/N}\).
Some key questions:
By assuming Hair\(\!\perp\!\!\!\perp\)Eye, compute:
\(\mathbb{P}(\text{Hair='Black', Eye='Brown'})\)?
\(\mathbb{P}(\text{Hair='Red', Eye='Blue'})\)?
\(\mathbb{P}(\text{Hair='Blond', Eye='Hazel'})\)?
Contingency Table
Expected vs Observed Contingency Table
If hair and eye are truly independent, then we expect the following contingency table: \[E_{ij}=p_{i,.}p_{.,j}N\]
Eye
Blue
Brown
Green
Hazel
Hair
Black
35.52
41.44
11.84
17.76
Blond
47.36
47.36
11.84
17.76
Brown
100.64
106.56
29.60
47.36
Red
23.68
23.68
5.92
11.84
This’s called ‘Expected contingency table’.
In reality, we observed:
Eye
Blue
Brown
Green
Hazel
Hair
Black
20
68
5
15
Blond
94
7
16
10
Brown
84
119
29
54
Red
17
26
14
14
It’s called ‘Observed contingency table’.
Q2: If Hair and Eye are independent, what can we say about these two table?
A2: They should be very similar!
We need a tool to measure the similarity between these two tables.
\(\chi^2\)-Distance
\(\chi^2\)-Distance
Measurement of similarity
Expected table:
Eye
Blue
Brown
Green
Hazel
Hair
Black
35.52
41.44
11.84
17.76
Blond
47.36
47.36
11.84
17.76
Brown
100.64
106.56
29.60
47.36
Red
23.68
23.68
5.92
11.84
Observed table:
Eye
Blue
Brown
Green
Hazel
Hair
Black
20
68
5
15
Blond
94
7
16
10
Brown
84
119
29
54
Red
17
26
14
14
\(\chi^2\)-distance between an Expected and an Observed table is defined by \[\chi^2=\sum_{i,j}\frac{(\color{blue}{E_{ij}}-\color{red}{O_{ij}})^2}{\color{blue}{E_{ij}}},\]
Just like \(\color{blue}{z}\)-value and \(\color{red}{t}\)-value, \(\chi^2\)-distance follows a certain pattern if the two tables are independent.
This leads to the following hypothesis testing.
\(\chi^2\)-test
Let \(X\) and \(Y\) be two categorical variables (Hair and Eye, for example).
Goal: Testing \(H_0: X\!\perp\!\!\!\perp Y\) against \(H_1: X\not\!\perp\!\!\!\perp Y\) at significance level \(\alpha\).
Assumptions:
Observations are independent
Expected frequencies \(E_{ij}\geq 1\) and more than \(20\%\) are larger than \(5\).
Key result: Under \(H_0\) is true, then \(\chi^2\sim\color{blue}{\chi^2((I-1)(J-1))}\), the chi-squared distribution of DF \(=(I-1)(J-1)\).
We reject\(H_0\) if \(\chi^2\geq \color{red}{c_{\alpha}}\) where \(\color{red}{c_{\alpha}}\) s.t: \(\mathbb{P}(\color{blue}{\chi^2((I-1)(J-1))}\geq \color{red}{c_{\alpha}})=\color{red}{\alpha}\).
Code
import plotly.graph_objects as gofrom scipy.stats import chi2# Create x-axis values (domain for chi-squared)x = np.linspace(0, 50, 100)# Degrees of freedom to displaydfs = [1, 5, 10, 15, 20, 30]# Create figurefig = go.Figure()# Add trace for each degree of freedomfor df in dfs: y = chi2.pdf(x, df)# Add line to plot fig.add_trace( go.Scatter( x=x, y=y, mode='lines', name=f'df = {df}', line=dict(width=2) ) )# Update layoutfig.update_layout( title=r'$\chi^2(\text{df})$', xaxis_title='x', yaxis_title='Density', legend_title='DFs', template='plotly_white', hovermode='closest', width=240, height=180)fig.show()
\(\chi^2\)-Test
Test of independence
Summary \(\chi^2\)-test
Compute \(\chi^2\)-distance value \(\chi^2\).
For a given significance level \(\color{red}{\alpha}\) (for example, \(0.05\)), compute \(\color{blue}{\text{df}=(I-1)(J-1)}\), then:
Critical value method: Look at the table of \(\color{blue}{\chi^2(\text{df})}\) to find \(\color{red}{c_{\alpha}}\). Decision: Reject \(H_0\) if \(\chi^2\geq \color{red}{c_{\alpha}}\).