Bivariate & Multivariate Analysis

Exploratory Data Analysis & Unsupervised Learning

Lecturer: Dr. HAS Sothea

Outline

0. Motivation
1. Bivariate Visualization
2. Multivariate Visualization
3. Time series data
4. Animated charts/graphs

🌐 https://clauswilke.com/dataviz/

0. Motivation

`Gapminder dataset` (1704, 5)

This dataset captures the world’s evolution from \(1952\) to \(2007\).
Now, take a look at the data from year \(2007\).

Code

from gapminder import gapminder
import numpy as np
data2007 = gapminder[gapminder.year == 2007]  # filter to year 2007
data2007.iloc[:5,:].drop(columns=['year']).style.hide()

country	continent	lifeExp	pop	gdpPercap
Afghanistan	Asia	43.828000	31889923	974.580338
Albania	Europe	76.423000	3600523	5937.029526
Algeria	Africa	72.301000	33333216	6223.367465
Angola	Africa	42.731000	12420476	4797.231267
Argentina	Americas	75.320000	40301927	12779.379640

0. Motivation

`Gapminder dataset` (1704, 5)

This dataset captures the world’s evolution from \(1952\) to \(2007\).
Now, take a look at the data from year \(2007\) (summary).

Code

quan_vars = ["pop", "lifeExp", "gdpPercap"]
data2007[quan_vars].describe().transpose().drop(columns=["count", "25%", "75%"]).transpose()

	pop	lifeExp	gdpPercap
mean	4.402122e+07	67.007423	11680.071820
std	1.476214e+08	12.073021	12859.937337
min	1.995790e+05	39.613000	277.551859
50%	1.051753e+07	71.935500	6124.371108
max	1.318683e+09	82.603000	49357.190170

0. Motivation

`Gapminder dataset` (1704, 5)

This dataset captures the world’s evolution from \(1952\) to \(2007\).
Now, take a look at the data from year \(2007\) (visualization).

Code

from plotly.subplots import make_subplots
import plotly.graph_objects as go
fig = make_subplots(rows=1, cols=3, 
              subplot_titles=("Boxplot of pop", "Violinplot of lifeExp", "Histogram of GDP Per Capita"))
fig.add_trace(go.Box(y=data2007['pop'], name="pop"), col=1, row=1)
fig.add_trace(go.Violin(y=data2007['lifeExp'], name="lifeExp"), row=1, col=2)
fig.add_trace(go.Histogram(x=data2007['gdpPercap'],
              name="gdpPercap"), row=1, col=3)
fig.update_layout(height=280, width=1000)
fig.update_yaxes(type="log", row=1, col=1)
fig.update_xaxes(title="Population", row=1, col=1)
fig.update_xaxes(title="Life Expectancy", row=1, col=2)
fig.update_xaxes(title="GDP Per Capita", row=1, col=3)
fig.show()

Hans Rosling’s 200 Countries, 200 Years in 4 Minutes.

0. Motivation

Objective

The main objectives of this chapter:

What clues/indicators tell us how columns are related?
What graphs can help us see that relationship?

1. Bivariate Analysis

1.1. Quan. vs Quan.

Indicator: Covariance

Suppose \(X=[\text{x}_1,\text{x}_2,...,\text{x}_n]\) be a quan. column.
Mean/average: \(\overline{\text{x}}=\displaystyle\frac{1}{n}\sum_{i=1}\text{x}_i\).
Variance: \(V(X)=\displaystyle\frac{1}{n-1}\sum_{i=1}^n(\text{x}_i-\overline{x})^2.\)
Standard deviation: \(s=\sqrt{V(X)}\).
If \(Y=[\text{y}_1,\text{y}_2,\dots,\text{y}_n]\) is an other quan. column, the covaraince between \(X\) and \(Y\) is defined by

\[\text{Cov}(X,Y)=\frac{1}{n-1}\sum_{i=1}(\text{x}_i-\overline{x})(\text{y}_i-\overline{y}).\]

Code

import plotly.express as px
df = px.data.tips()
fig = px.scatter(df, y="tip", x="total_bill", color="sex", hover_data=df.columns)
fig.update_layout(width=380, height=300, title="Tips vs total bill & gender")
fig.show()

	tip	total_bill	sex
0	1.01	16.99	Female
1	1.66	10.34	Male
2	3.50	21.01	Male

1.1. Quan. vs Quan.

Indicator: Covariance

Covariance between quan. columns \(X\) and \(Y\):

\[\text{Cov}(X,Y)=\frac{1}{n-1}\sum_{i=1}(\text{x}_i-\overline{x})(\text{y}_i-\overline{y}).\]

It determines tendency/direction of the relationship between the two variables.
- Positive value \(\approx\) change in the same direction.
- Negative value \(\approx\) change in opposite direction.

It’s hard to interpret the value of covariance as it can be large or small according to the scale of \(X\) and \(Y\).

Code

fig.update_layout(title=f"Tips vs total bill (Cov = {float(np.cov(df['tip'].values, df['total_bill'].values).round(2)[0,1])}) & gender ")
fig.show()

	tip	total_bill	sex
0	1.01	16.99	Female
1	1.66	10.34	Male
2	3.50	21.01	Male

1.1. Quan. vs Quan.

Indicator: Pearson Correlation Coefficient

Correlation between two quan. columns \(X\) and \(Y\): \[r=r_{X,Y}=\frac{\sum_{i=1}^n(\text{x}_{i}-\overline{x})(\text{y}_{i}-\overline{y})}{\sqrt{\left(\sum_{i=1}^n(x_{i1}-\overline{x}_{1})^2\right)\left(\sum_{i=1}^n(x_{i2}-\overline{x}_{2})^2\right)}}=\frac{\text{Cov}(X,Y)}{s_Xs_Y}.\]
It quantifies the linear relationship/tendency between the two variables.
- For any pair \(X\) and \(Y\) one has \(-1\leq r\leq 1\).
- If \(r\approx 1\), then \(X\) and \(Y\) are positively correlated (change in the same direction).
- If \(r\approx -1\), then \(X\) and \(Y\) are negatively correlated (change in opposite direction).
- If \(r\approx 0\), then \(Y\) and \(Y\) are decorrelated (no pattern/trend/tendency).
It helps identifying informative/useful inputs for the building models.
It also helps identifying redundant (strongly correlated) inputs.
For tip example: \(\text{Corr}(\text{tip}, \text{bill})=\) 0.676.
Correlation does not imply causation; it only indicates a tendency, not a cause-and-effect link [👉 For more, read here].

1.1. Quan. vs Quan.

Indicator: Pearson Correlation Coefficient

Source: https://en.wikipedia.org/wiki/Correlation.

1.1. Quan. vs Quan.

Indicator: Pearson Correlation Matrix

To detect tendency of linear relationship between many quan. columns, Pearson correlation matrix is a common tool to use.
Consider Pearson corr. matrix on Gapminder in 2007:

cor = data2007[["gdpPercap", "lifeExp", "pop"]].corr()
cor.style.background_gradient(cmap='Accent')

	gdpPercap	lifeExp	pop
gdpPercap	1.000000	0.678662	-0.055676
lifeExp	0.678662	1.000000	0.047553
pop	-0.055676	0.047553	1.000000

LifeExp and GDP appear to be positively related.
Population appears to be decorrelated with the other two.

1.1. Quan./Ordi vs Quan./Ordi

Indicator: Spearman’s Rank Correlation

Pearson correlation is sensitive to outliers (we will see that in the lab) and cannot capture non-linear relationship between quantitaive columns.
It is not sutiable for ordinal data (dislike-like rating, for example).
Spearman’s Rank Correlation does not rely on the value of observations but rather depends on the ‘rank’ of the observations (works with ordinal).
Let \(R[\text{x}_i]\) and \(R[\text{y}_i]\) be the rank of observations \(\text{x}_i\) and \(\text{y}_i\) in their own list, then Spearman’s rank correlation coefficient between \(X\) and \(Y\) is defined as the Pearson correlation over the rank of \(X\) and \(Y\), i.e., \[\rho_{X,Y}=r_{R[X],R[Y]}=\frac{\text{Cov}(R[X],R[Y])}{s_{R[X]},s_{R[Y]}}=1-\frac{6\sum_{i=1}^nd_i^2}{n(n^2-1)},\] where \(d_i=R[\text{x}_i]-R[\text{y}_i]\) be the distance in rank of observation \(i\)-th.

1.1. Quan./Ordi vs Quan./Ordi

Indicator: Spearman’s Rank Correlation

Example: \(X=[3,2,1,5,8]\) and \(Y=[8,5,0,23,80]\)

Pearson

\(\overline{x}=\frac{3+2+1+5+9}{5}=\color{red}{4}\) and \(\overline{y}=\frac{8+5+0+23+80}{5}=\color{blue}{23.2}\).
\(s_X=\sqrt{\frac{(3-\color{red}{4})^2+(2-\color{red}{4})^2+\dots+(8-\color{red}{4})^2}{n-1}}=2.8284\) and \(s_Y=\sqrt{\frac{(8-\color{blue}{23.2})^2+(5-\color{blue}{23.2})^2+\dots+(80-\color{blue}{23.2})^2}{n-1}}=29.417\).
\(r_{X,Y}=\frac{(3-\color{red}{4})(8-\color{blue}{23.2})+\dots+(8-\color{red}{4})(80-\color{blue}{23.2})}{(2.8284)(29.417)}=0.9735.\)

Spearman

\(R[X]=[3,2,1,4,5]\) and \(R[Y]=[3,2,1,4,5]\).
All \(d_i=0\) therefore \(\rho_{X,Y}=1-\frac{6\sum_{i}d_i^2}{5(5^2-1)}=1\).

1.1. Quan./Ordi vs Quan./Ordi

Indicator: Spearman’s Rank Correlation

Pearson corr. on Gapminder in 2007:

cor = data2007[["gdpPercap", "lifeExp", "pop"]].corr()
cor.style.background_gradient(cmap='Accent')

	gdpPercap	lifeExp	pop
gdpPercap	1.000000	0.678662	-0.055676
lifeExp	0.678662	1.000000	0.047553
pop	-0.055676	0.047553	1.000000

Spearman corr. on Gapminder in 2007:

cor = data2007[["gdpPercap", "lifeExp", "pop"]]\
    .corr("spearman")
cor.style.background_gradient(cmap='Accent')

	gdpPercap	lifeExp	pop
gdpPercap	1.000000	0.856590	-0.064588
lifeExp	0.856590	1.000000	0.003355
pop	-0.064588	0.003355	1.000000

1.1. Quan./Ordi vs Quan./Ordi

Indicator: Spearman’s Rank Correlation

Consider the change in both coefficients.

1.1. Quan. vs Quan.

Indicator: Spearman (Summary)

Aspect	`Pearson`	`Spearman`
Type	Parametric	Non-parametric
Measure	Linear relationship	Monotonic relationship
Data Type	Continuous	Ordinal or continuous
Outliers	Sensitive to outliers	Less sensitive to outliers
Range	\([-1,1]\)	\([-1,1]\)
Interpretation	\(\approx\) 1: Perfect positive linear relationship \(\approx\) -1: Perfect negative linear relationship \(\approx\) 0: No linear relationship	\(\approx\) 1: Perfect positive rank correlation \(\approx\) -1: Perfect negative rank correlation \(\approx\) 0: No rank or no monotonicity correlation

1.1. Quan. vs Quan.

Visualization: Scatterplot

Scatterplot shows trends/relation of quantitative pairs.
Let’s visualize relation: gdpPercap wih lifeExp & pop.

Code

import plotly.graph_objects as go
import plotly.express as px
data2007 = gapminder.query("year == 2007")
fig1 = px.scatter(data2007, x="gdpPercap", y="lifeExp", hover_name="country", opacity=0.7)
fig1.update_traces(marker=dict(size=10))
fig1.update_layout(height=350, width=500, title="The world GDP vs LifeExp in 2007")
fig1.show()

Code

data2007 = gapminder.query("year == 2007")
fig2 = px.scatter(data2007, x="gdpPercap", y="pop", hover_name="country", opacity=0.7)
fig2.update_traces(marker=dict(size=10))
fig2.update_layout(height=350, width=500, title="The world GDP vs Population in 2007")
fig2.show()

1.1. Quan. vs Quan.

Visualization: Scatterplot

Scatterplot shows trends/relation of quantitative pairs.
Let’s visualize relation: gdpPercap wih lifeExp & pop.

Code

fig1.update_layout(title="The world (log) GDP vs Population 2007 ")
fig1.update_xaxes(type="log")
fig1.show()

Code

fig2.update_layout(title="The world GDP vs (log) Population 2007 ")
fig2.update_yaxes(type="log")
fig2.show()

1.1. Quan. vs Quan.

Visualization: Scatterplot

GPD vs Life Expectancy:
- General trend: Countries with high GPD tend to be healthier.
- There are also a few countries with economy well above average yet health condition is still bad.

GPD vs Population:
- General trend: no clear trend!
- GDP per capita does not appear to be significantly influenced by a country’s population size.

1.1. Quan. vs Quan.

Visualization: Scatterplot

A proper visualization should

🎯 Have a clear purpose – deliver the main message effectively.
👥 Fit the audience – match their knowledge and needs.
📊 Use the right chart – accurately represent the data.
🎨 Be clean and consistent – simple design, meaningful colors, clear labels & title…
🧠 Ensure clarity and honesty – no distortion or clutter.
⚙️ Allow interaction (if needed) – make exploration easy.
♿ Be accessible – readable for everyone, including color-blind users.

1.2. Quan. vs Qual.

Indicator: \(\eta^2\) coefficient

If \(\color{red}{G}\) and \(\color{blue}{X}\) are qualitative and quan. columns resp.
Between Sum of Squares (BSS): \[\color{blue}{\text{BSS}}=\sum_{g=1}^{\color{red}{G}}n_g(\overline{\text{x}}_g-\color{blue}{\overline{\text{x}}})^2,\] where
- \(\overline{\text{x}}_g\) is the mean of \(\color{blue}{X}\) over a category \(g\) of \(\color{red}{G}\).
- \(\color{blue}{\overline{\text{x}}}\) is the global mean of \(\color{blue}{X}\).
- \(n_g\) is the number of observations within category \(g\) of \(\color{red}{G}\).

It measures how distant the values of \(\color{blue}{X}\) are across different groups of \(\color{red}{G}\).
Total Sum of Squares (TSS): \[\color{red}{\text{TSS}}=\sum_{i=1}^n(\text{x}_i-\color{blue}{\overline{\text{x}}})^2.\]

1.2. Quan. vs Qual.

Indicator: \(\eta^2\) coefficient

\(\eta\)-squared coefficient: \(\eta^2=\frac{\color{blue}{\text{BSS}}}{\color{red}{\text{TSS}}}.\)
One always has \(0\leq \eta^2\leq 1\):
- \(\eta^2\approx 0\): no relation between group \(\color{red}{G}\) and quan. column \(\color{blue}{X}\) (similar).
- \(\eta^2\approx 1\): strong relation (differ).

The \(\eta^2\) coefficient measures the proportion of variation in the quan. variable that is explained by the categories of the qual. variable.
\(\eta^2\) is normally used to study the effect of group on some quan. variable on different classes of another qual. variable known as Analysis of Variance (ANOVA).
Just like Pearson coefficient, it’s sensitive to outliers!

Example: \(\eta^2\)-coefficients between lifeExp and gdpPercap on different continent:

	LifeExp	GDP
Continent	0.635	0.424

1.2. Quan. vs Qual.

Visualization: Conditional box/dot plots

To see relation between Values within different Group, we can use:
- Conditional Boxplots: boxplots within different groups.
- Conditional Histogram/Density are also possible but not common.

Code

sorted_data = data2007.sort_values(by='lifeExp')
fig = px.box(data2007, x="continent", y="lifeExp", points='all', hover_name="country", color="continent", category_orders={'continent': sorted_data['continent']})
fig.update_layout(title="Life Expectancy on each continent in 2007", height=300, width=450)
fig.show()

🔑 The distinction of quan. values between different groups indicates a connection between the pairs.
Example:
- Clear distinction of lifeExp accross different continent suggests that there is a relation between the two.
- continent is useful for predicting / explaining lifeExp.

1.2. Quan. vs Qual.

Visualization: Conditional histogram

To see relation between Values within different Group, we can use:
- Conditional Boxplots: boxplots within different groups.
- Conditional Histogram/Density are also possible but not common.

Code

import plotly.figure_factory as ff
group_labels = list(data2007.continent.unique())
hist_data = [data2007.lifeExp[data2007.continent == x] for x in group_labels]
colors = ["#f1ab17", "#f13c26", "#9be155", "#4ab8dc", "#d567f3"]
fig3 = ff.create_distplot(hist_data, group_labels, colors=colors,
                         bin_size=1.5, show_rug=False)
fig3.update_layout(title="Life Expectancy on each continent in 2007", height=300, width=450)
fig3.show()

🔑 The distinction of quan. values between different groups indicates a connection between the pairs.
Example:
- Clear distinction of lifeExp accross different continent suggests that there is a relation between the two.
- continent is useful for predicting / explaining lifeExp.

1.2. Quan. vs Qual.

Visualization: Conditional Box/Violin Plot

How about GDP on each continent?

Code

sorted_data = data2007.sort_values(by='gdpPercap')
hist_data_gdp = [data2007.gdpPercap[data2007.continent == x] for x in group_labels]
colors = ["#f1ab17", "#f13c26", "#9be155", "#4ab8dc", "#d567f3"]
fig4 = px.box(data2007, 
    x="continent", y="gdpPercap", hover_name="country", points='all',
    color="continent", category_orders={'continent': sorted_data['continent']})
fig4.update_layout(title="GDP per Capita on each continent in 2007", height=350, width=500)
fig4.show()

Code

sorted_data = data2007.sort_values(by='gdpPercap')
hist_data_gdp = [data2007.gdpPercap[data2007.continent == x] for x in group_labels]
colors = ["#f1ab17", "#f13c26", "#9be155", "#4ab8dc", "#d567f3"]
fig_ = px.violin(data2007, x="continent", y="gdpPercap", 
    hover_name="country", color="continent", points='all',
    category_orders={'continent': sorted_data['continent']})
fig_.update_layout(title="GDP per Capita on each continent in 2007", height=350, width=500)
fig_.show()

Example:
- The separation of GDP per Capita between coninents is not as clear as Life Expectancy, yet one can still see the differences.
- Continent is useful for predicting / explaining gdpPercap though not as strong/clear as with lifeExp.
- One would be a good predictor for another.

1.2. Quan. vs Qual.

Indicator & Visualization

Code

import plotly.express as px
fig = px.box(data2007, 
    x='continent', 
    y='lifeExp', 
    points='all',
    category_orders={'continent': sorted_data['continent']}, 
    color='continent')
fig.update_layout(
    height=500, 
    width=500,
    title=f"LifeExp per continent with eta-squared {np.round(df_eta['LifeExp'].values[0], 3)}")
fig.show()

Code

fig4 = px.box(data2007, 
    x='continent', 
    y='gdpPercap', 
    points='all',
    category_orders={'continent': sorted_data['continent']}, 
    color='continent')
fig4.update_layout(
    height=500, 
    width=500,
    title=f"GDP per continent with eta-squared {np.round(df_eta['GDP'].values[0], 3)}")
fig4.show()

1.3. Qual. vs Qual.

Visualization: Mosaic plot

We don’t have many qualitative columns,
I grouped GDP per Capita as follows:
- If GDP \(\leq 33.33\%\) 👉 Developing
- elif GDP \(\leq 66.66\%\) 👉 Emerging
- else: 👉 Developed.

Example:
- As GDP seems to be related to continent, it remains true with categorical GDP.
- In Asia, the three types of economic conditions are well balanced, whereas the majority of African countries are developing, followed by emerging economies.

Code

from statsmodels.graphics.mosaicplot import mosaic
import matplotlib.pyplot as plt
import pandas as pd
fig, ax = plt.subplots(figsize=(7, 5))
plt.rcParams.update({'font.size': 15})
def prop(key):
    if "Asia" in key:
        return {'color': '#51cb4b'}
    if "Africa" in key:
        return {'color': '#e35441'}
    if "Americas" in key:
        return {'color': '#41b4e3'}
    if "Europe" in key:
        return {'color': '#dda63e'}
    if "Oceania" in key:
        return {'color': '#b374df'}

data2007['gdp_category'] = pd.qcut(data2007['gdpPercap'], q=3, labels=['Developing', 'Emerging', 'Developed'])
mosaic(data2007.sort_values('continent'), ['continent','gdp_category'], 
    gap=0.01, properties = prop, 
    label_rotation=30, ax=ax)
plt.title("Mosaicplot of categorical GDP vs Continent")
plt.show()

1.3. Qual. vs Qual.

Visualization: Stacked/Grouped barplots

We don’t have many qualitative columns,
I do grouped GDP:
- If GDP \(\leq 33.33\%\) 👉 Developing
- elif GDP \(\leq 66.66\%\) 👉 Emerging
- else: 👉 Developed.

Example:
- As GDP seems to be related to continent, it remains true with categorical GDP.
- In Asia, the three types of economic conditions are well balanced, whereas the majority of African countries are developing, followed by emerging economies.

Code

df_freq = data2007.groupby(['continent','gdp_category']).size().reset_index(name='Freq')
df_freq['Percent'] = df_freq.groupby('continent')['Freq'].apply(lambda x: x/x.sum() * 100).reset_index(level=0, drop=True)
fig = px.bar(
    df_freq, 
    x="continent", 
    y="Percent",
    color="gdp_category",
    barmode='stack',
    text= df_freq['Percent'].round(2).astype(str) + '%')
fig.update_layout(width=510, height=470, 
    title='Stacked Barplot of Categorical GDP vs Continent')
fig.show()

Code

fig = px.bar(
    df_freq, 
    x="continent", 
    y="Freq",
    color="gdp_category",
    barmode='group',
    text= df_freq['Percent'].round(2).astype(str) + '%')
fig.update_layout(width=510, height=470, title='Grouped Barplot of Categorical GDP vs Continent')
fig.show()

1.3. Qual. vs Qual.

Indicator: \(\chi^2\) test

The contingency table of two nominal variables \(\color{blue}{X}\) and \(\color{red}{Y}\) is defined by:

\(\color{blue}{X}\) - \(\color{red}{Y}\)	\(\color{red}{Y_1}\)	\(\dots\)	\(\color{red}{Y_J}\)	Total
\(\color{blue}{X_1}\)	\(n_{1,1}\)	\(\dots\)	\(n_{1,J}\)	\(\color{blue}{n_{1,.}}\)
\(\vdots\)	\(\vdots\)	\(\ddots\)	\(\vdots\)	\(\color{blue}{\vdots}\)
\(\color{blue}{X_I}\)	\(n_{I,1}\)	\(\dots\)	\(n_{I,J}\)	\(\color{blue}{n_{I,.}}\)
Total	\(\color{red}{n_{.,1}}\)	\(\color{red}{\dots}\)	\(\color{red}{n_{.,J}}\)	\(N\)

where \(n_{i,j}\) is the freq of observations being in class \(\color{blue}{X_i}\) of variable \(\color{blue}{X}\) and \(\color{red}{Y_j}\) of variable \(\color{red}{Y}\).

Obs. rel. freq: \(O_{i,j}=n_{i,j}/N\).
Exp. rel. freq: \(E_{i,j}=\color{blue}{O_{i,.}}\times \color{red}{O_{.,j}}=\frac{\color{blue}{n_{i,.}}\color{red}{n_{.,j}}}{N^2}\).
We would to check if \(\color{blue}{X}\) & \(\color{red}{Y}\) are independent?

\(\chi^2\) hypothesis test: \[\begin{cases}H_0&: \color{blue}{X}\text{ is indepedent of }\color{red}{Y}\\ H_1&: \color{blue}{X}\text{ is NOT indepedent of }\color{red}{Y}.\end{cases}\]
🔑 What’s can we say about the observed and expected relative freauency \(O_{i,j}\) & \(E_{i,j}\)?
🔑 Under \(H_0\) is true, then \(\color{green}{O_{i,j}\approx E_{i,j}}\).
\(\chi^2\)-distance: \(\chi^2(\color{blue}{X},\color{red}{Y})=\sum_{i,j}\frac{(O_{i,j}-E_{i,j})^2}{E_{i,j}}\).

Under the assumption that \(H_0\) is true, then \(\chi^2(\color{blue}{X},\color{red}{Y})\sim \chi^2(\text{df})\) with \(\text{df}=(\color{blue}{I}-1)(\color{red}{J}-1)\).

In practice, compute
- Degree of freedom \(\text{df}\) and \(\chi^2(\color{blue}{X},\color{red}{Y})\).
- \(\text{p-val}=\mathbb{P}(\chi^2(\text{df}) \geq \chi^2(\color{blue}{X},\color{red}{Y}))\).
- Small \(\text{p-val}\Rightarrow\) reject \(H_0\).

1.3. Qual. vs Qual.

Indicator: \(\chi^2\) test (Summary)

Observed two-way relative freq. table:

\(\color{blue}{X}\) - \(\color{red}{Y}\)	\(\color{red}{Y_1}\)	\(\dots\)	\(\color{red}{Y_J}\)	Total
\(\color{blue}{X_1}\)	\(O_{1,1}\)	\(\dots\)	\(O_{1,J}\)	\(\color{blue}{O_{1,.}}\)
\(\vdots\)	\(\vdots\)	\(\ddots\)	\(\vdots\)	\(\color{blue}{\vdots}\)
\(\color{blue}{X_I}\)	\(O_{I,1}\)	\(\dots\)	\(O_{I,J}\)	\(\color{blue}{O_{I,.}}\)
Total	\(\color{red}{O_{.,1}}\)	\(\color{red}{\dots}\)	\(\color{red}{O_{.,J}}\)	\(1\)

Expected two-way relative freq. table:

\(\color{blue}{X}\) - \(\color{red}{Y}\)	\(\color{red}{Y_1}\)	\(\dots\)	\(\color{red}{Y_J}\)
\(\color{blue}{X_1}\)	\(\color{red}{O_{.,1}}\color{blue}{O_{1,.}}\)	\(\dots\)	\(\color{red}{O_{.,J}}\color{blue}{O_{1,.}}\)
\(\vdots\)	\(\vdots\)	\(\ddots\)	\(\vdots\)
\(\color{blue}{X_I}\)	\(\color{red}{O_{.,1}}\color{blue}{O_{I,.}}\)	\(\dots\)	\(\color{red}{O_{.,J}}\color{blue}{O_{I,.}}\)

\(\chi^2(\color{blue}{X},\color{red}{Y})\) measures how different they are!

\(\chi^2\) hypothesis test: \[\begin{cases}H_0&: \color{blue}{X}\text{ is indepedent of }\color{red}{Y}\\ H_1&: \color{blue}{X}\text{ is NOT indepedent of }\color{red}{Y}.\end{cases}\]
🔑 What’s can we say about the observed and expected relative freauency \(O_{i,j}\) & \(E_{i,j}\)?
🔑 Under \(H_0\) is true, then \(\color{green}{O_{i,j}\approx E_{i,j}}\).
\(\chi^2\)-distance: \(\chi^2(\color{blue}{X},\color{red}{Y})=\sum_{i,j}\frac{(O_{i,j}-E_{i,j})^2}{E_{i,j}}\).

Under the assumption that \(H_0\) is true, then \(\chi^2(\color{blue}{X},\color{red}{Y})\sim \chi^2(\text{df})\) with \(\text{df}=(\color{blue}{I}-1)(\color{red}{J}-1)\).

In practice, compute
- Degree of freedom \(\text{df}\) and \(\chi^2(\color{blue}{X},\color{red}{Y})\).
- \(\color{blue}{\text{p-val}}=\mathbb{P}(\chi^2(\text{df}) \geq \chi^2(\color{blue}{X},\color{red}{Y}))\).
- Small \(\text{p-val}\Rightarrow\) reject \(H_0\).

1.3. Qual. vs Qual.

Indicator: \(\chi^2\) test (Summary)

Observed two-way relative freq. table:

\(\color{blue}{X}\) - \(\color{red}{Y}\)	\(\color{red}{Y_1}\)	\(\dots\)	\(\color{red}{Y_J}\)	Total
\(\color{blue}{X_1}\)	\(O_{1,1}\)	\(\dots\)	\(O_{1,J}\)	\(\color{blue}{O_{1,.}}\)
\(\vdots\)	\(\vdots\)	\(\ddots\)	\(\vdots\)	\(\color{blue}{\vdots}\)
\(\color{blue}{X_I}\)	\(O_{I,1}\)	\(\dots\)	\(O_{I,J}\)	\(\color{blue}{O_{I,.}}\)
Total	\(\color{red}{O_{.,1}}\)	\(\color{red}{\dots}\)	\(\color{red}{O_{.,J}}\)	\(1\)

Expected two-way relative freq. table:

\(\color{blue}{X}\) - \(\color{red}{Y}\)	\(\color{red}{Y_1}\)	\(\dots\)	\(\color{red}{Y_J}\)
\(\color{blue}{X_1}\)	\(\color{red}{O_{.,1}}\color{blue}{O_{1,.}}\)	\(\dots\)	\(\color{red}{O_{.,J}}\color{blue}{O_{1,.}}\)
\(\vdots\)	\(\vdots\)	\(\ddots\)	\(\vdots\)
\(\color{blue}{X_I}\)	\(\color{red}{O_{.,1}}\color{blue}{O_{I,.}}\)	\(\dots\)	\(\color{red}{O_{.,J}}\color{blue}{O_{I,.}}\)

\(\chi^2(\color{blue}{X},\color{red}{Y})\) measures how different they are!

Under the assumption that \(H_0\) is true, then \(\chi^2(\color{blue}{X},\color{red}{Y})\sim \chi^2(\text{df})\) with \(\text{df}=(\color{blue}{I}-1)(\color{red}{J}-1)\).

In practice, compute
- \(\chi^2(\color{blue}{X},\color{red}{Y})\) and degree of freedom \(\text{df}\).
- \(\color{blue}{\text{p-val}}=\mathbb{P}(\chi^2(\text{df}) \geq \chi^2(\color{blue}{X},\color{red}{Y}))\).
- Small \(\color{blue}{\text{p-val}}\ (<0.05)\Rightarrow\) reject \(H_0\).

Code

import plotly.graph_objects as go
from scipy.stats import chi2

# Create x-axis values (domain for chi-squared)
x = np.linspace(0, 50, 100)

# Degrees of freedom to display
dfs = [1, 5, 10, 15, 20, 30]

# Create figure
fig = go.Figure()

# Add trace for each degree of freedom
for df in dfs:
    y = chi2.pdf(x, df)
    
    # Add line to plot
    fig.add_trace(
        go.Scatter(
            x=x,
            y=y,
            mode='lines',
            name=f'df = {df}',
            line=dict(width=2)
        )
    )

# Update layout
fig.update_layout(
    title=r'$\chi^2(\text{df})$',
    xaxis_title='x',
    yaxis_title='Density',
    legend_title='DFs',
    template='plotly_white',
    hovermode='closest',
    width=490,
    height=270
)

fig.show()

1.3. Qual. vs Qual.

Indicator: \(\chi^2\) test (Example)

🇫🇷 Covid Survey on in 2021 of 85 855 ppl.

	F	SE	SEP	MEP	E	MW	NW
3	263	1085	7704	4346	4052	1551	2319
2	616	2265	10088	8889	11264	4713	3892
1	222	950	2864	3238	4517	1910	1145
0	126	562	1105	1420	2596	1531	622

0-3: less to strong favor in vaccine
F : Farmer
SE : Self-employed and entrepreneurs
SEP : Senior executive professionals
MEP : Middle executive professionals
E : Employees
MW : Manual workers
NW : Never worked and others.

Code

df_vaccine = df.melt(value_name='Count', var_name='Job')
df_vaccine['Vaccine Favor'] = list(range(4)) * 7
df_vaccine['Vaccine Favor'] = df_vaccine[['Vaccine Favor']].astype(object)
temp = df_vaccine.groupby('Job')['Count'].apply(lambda x: x/x.sum() * 100).reset_index(level=1)
temp.columns = ['level_1', 'Percent']
df_vaccine = pd.merge(temp, df_vaccine, left_on='level_1', right_index=True, how='left')
fig = px.bar(
    df_vaccine, 
    x="Job", 
    y="Percent",
    color="Vaccine Favor",
    barmode='stack',
    text= df_vaccine['Percent'].round(2).astype(str) + '%')
fig.update_layout(width=510, height=310, 
    title='Stacked Barplot of Job vs Vaccine Favor')
fig.show()

We have \(\text{df}=\) 18 and \(\chi^2(18)\approx\) 3298.

1.3. Qual. vs Qual.

Indicator: Cramér’s V

It’s based on Pearson’s Chi-squared statistics by Harald Cramér in 1946.
Cramér’s V formula: \(V=\sqrt{\frac{\chi^2/n}{\min(I-1,J-1)}}.\)

Property

\(V\in [0,1]\) with \(V\approx 1\) indicating strong association.
It’s a biased estimator of the association strength between the two qual. variables.
It can be heavily overestimate the true association strength.

Bias correction of Cramér’s V: \(\tilde{V}=\sqrt{\frac{\tilde{\phi}^2}{\min(\tilde{I}-1,\tilde{J}-1)}},\) where \(\tilde{\phi}^2=\max\left(0,\frac{\chi^2}{n}-\frac{\text{df}}{n-1}\right)\), \(\tilde{I}=I-\frac{(I-1)^2}{n-1}\) and \(\tilde{J}=J-\frac{(J-1)^2}{n-1}\).

2. Multiple Information

2.1. Color: quantitative & qualitative

Color can represent:
- qualitative data (discrete color).
- quantitative (in form of gradient)

Code

import numpy as np
data2007[' '] = np.repeat('Data', data2007.shape[0])
fig = px.scatter(
    data2007, x="gdpPercap", y="lifeExp",
    hover_name="country", size_max=80, color=" ")
fig.update_layout(width=472, height=400, title='Life Expectancy vs GDP per Capita & Continent')
fig.update_xaxes(type="log")
fig.show()

2.1. Color: quantitative & qualitative

Color can represent:
- qualitative data (discrete color).
- quantitative (in form of gradient)
Example:
- Color = continent, which is a categorical column.

Code

fig = px.scatter(
    data2007, x="gdpPercap", y="lifeExp", color="continent", hover_name="country", size_max=80)
fig.update_layout(width=500, height=400, title='Life Expectancy vs GDP per Capita & Continent')
fig.update_xaxes(type="log")
fig.show()

2.1. Color: quantitative & qualitative

Color can represent:
- qualitative data (discrete color).
- quantitative (in form of gradient)
Example:
- Color = continent, which is a categorical column.
- Color = leftExp, which is a quantitative column.

Code

fig = px.scatter(
    data2007, x="gdpPercap", y="lifeExp", color="lifeExp", hover_name="country", size_max=80)
fig.update_layout(width=500, height=240, title='Life Expectancy vs GDP per Capita & Continent')
fig.update_xaxes(type="log")
fig.show()

2.2. Shape/Symbol: qualitative

Shape for representing qualitative data.
Example:
- Symbol = gdp_category.
- Color = continent.

Combining numerous colors and symbols can complicate a graph.

Use them carefully and only when appropriate.

Code

fig = px.scatter(
    data2007, x="gdpPercap", y="lifeExp", color="continent", 
    hover_name="country", symbol='gdp_category', size_max=80)
fig.update_layout(width=500, height=350, title='Life Expectancy vs GDP per Capita & Continent')
fig.update_xaxes(type="log")
fig.show()

2.3. Size: quantitative

Size for representing quantitative data.
Example:
- Size = pop.
- Color = continent.

Colors and size are common, and the resulting graphs are often called Bubble chart.

One shoule choose suitable max size to have a nice graph.

Code

fig = px.scatter(
    data2007, x="gdpPercap", y="lifeExp", color="continent",
    hover_name="country", size="pop", size_max=35)
fig.update_layout(width=500, height=370, title='Life Expectancy, GDP, Population & Continent')
fig.update_xaxes(type="log")
fig.show()

2.4. 3D: quantitative

All the previous options can be used with 3D scatter plot.
Example: Marketing
- X = Youtube.
- Y = Facebook.
- Z = Sales.
- Size = Newspaper.
- Color = Newspaper.

Avoid 3D if they are not interactive [Section: “Don’t go 3D” by Claus O. Wilke (2019)].

3. Time series data

3.1. Quantitative: lineplot

Let’s take a look at Oceania from 1952 to 2007.

Code

df_ocean = gapminder.query("continent == 'Oceania'")
fig = px.line(df_ocean, x='year', y='lifeExp',
    symbol="country", color="country",
    title="Evolution of Life Expectancy")
fig.update_layout(height=350, width=330)
fig.show()

Code

fig = px.line(df_ocean, x='year', y='pop',
    symbol="country", color="country",
    title="Evolution of Population")
fig.update_layout(height=350, width=330)
fig.show()

Code

fig = px.line(df_ocean, x='year', y='gdpPercap',
    symbol="country", color="country",
    title="Evolution of GDP per Capita")
fig.update_layout(height=350, width=330)
fig.show()

Is there any other country’s evolution you would like to see?

3.1. Quantitative: lineplot

Is there any other country’s evolution you would like to see?

3.2. Qualitative: Evoluational barplots

Let’s take a look at the evolution of GDP per capita categories for Asian countries over time.

Code

def cat_gdp(yearly_data):
    return pd.qcut(yearly_data, q=3, labels=['Developing', 'Emerging', 'Developed'])
df = gapminder
# Apply the function to each year
df['GDP_Category'] = df.groupby('year').apply(lambda x: cat_gdp(x.gdpPercap)).reset_index(level=0, drop=True)

df_Af = df.query("continent == 'Asia'")
# Aggregate the data
df_agg = df_Af.groupby(['year', 'GDP_Category']).size().reset_index(name='Count')

# Create the stacked bar chart
fig = px.bar(
    df_agg, x='year', y='Count', 
    color='GDP_Category', barmode='stack',
    title="Evolution of Asian Countries' GDP Categories from 1952 to 2007",
    labels={'Count': 'Number of Countries', 'year': 'Year'})

fig.update_layout(height=350, width=1000)
fig.show()

Animated Graphs

4. Animated Graphs

Animation with Plotly

Code

import plotly.express as px
df = px.data.gapminder()
fig_anime = px.scatter(df, x="gdpPercap", y="lifeExp", animation_frame="year", animation_group="country",
           size="pop", color="continent", hover_name="country",
           log_x=True, size_max=50, range_x=[100,100000], 
           range_y=[25,90])
fig_anime.update_layout(height=460, width=1000, 
    title="The Evolution of the World in a Single Graph")

Start by looking at indicators, then visualize the interesting ones.

🥳 Yeahhhh….

Let’s Party… 🥂

Quiz time 🫣

Food delivery dataset

French Children Height Boxplot

Bivariate & Multivariate Analysis

Outline

0. Motivation

0. Motivation

Gapminder dataset (1704, 5)

0. Motivation

Gapminder dataset (1704, 5)

0. Motivation

Gapminder dataset (1704, 5)

0. Motivation

Objective

1. Bivariate Analysis

1.1. Quan. vs Quan.

Indicator: Covariance

1.1. Quan. vs Quan.

Indicator: Covariance

1.1. Quan. vs Quan.

Indicator: Pearson Correlation Coefficient

1.1. Quan. vs Quan.

Indicator: Pearson Correlation Coefficient

1.1. Quan. vs Quan.

Indicator: Pearson Correlation Matrix

1.1. Quan./Ordi vs Quan./Ordi

Indicator: Spearman’s Rank Correlation

1.1. Quan./Ordi vs Quan./Ordi

Indicator: Spearman’s Rank Correlation

Pearson

Spearman

1.1. Quan./Ordi vs Quan./Ordi

Indicator: Spearman’s Rank Correlation

1.1. Quan./Ordi vs Quan./Ordi

Indicator: Spearman’s Rank Correlation

1.1. Quan. vs Quan.

Indicator: Spearman (Summary)

1.1. Quan. vs Quan.

Visualization: Scatterplot

1.1. Quan. vs Quan.

Visualization: Scatterplot

1.1. Quan. vs Quan.

Visualization: Scatterplot

1.1. Quan. vs Quan.

Visualization: Scatterplot

A proper visualization should

1.2. Quan. vs Qual.

Indicator: \(\eta^2\) coefficient

1.2. Quan. vs Qual.

Indicator: \(\eta^2\) coefficient

1.2. Quan. vs Qual.

Visualization: Conditional box/dot plots

1.2. Quan. vs Qual.

Visualization: Conditional histogram

1.2. Quan. vs Qual.

Visualization: Conditional Box/Violin Plot

1.2. Quan. vs Qual.

Indicator & Visualization

1.3. Qual. vs Qual.

Visualization: Mosaic plot

1.3. Qual. vs Qual.

Visualization: Stacked/Grouped barplots

1.3. Qual. vs Qual.

Indicator: \(\chi^2\) test

1.3. Qual. vs Qual.

Indicator: \(\chi^2\) test (Summary)

1.3. Qual. vs Qual.

Indicator: \(\chi^2\) test (Summary)

1.3. Qual. vs Qual.

Indicator: \(\chi^2\) test (Example)

1.3. Qual. vs Qual.

Indicator: Cramér’s V

2. Multiple Information

2.1. Color: quantitative & qualitative

2.1. Color: quantitative & qualitative

2.1. Color: quantitative & qualitative

2.2. Shape/Symbol: qualitative

2.3. Size: quantitative

2.4. 3D: quantitative

3. Time series data

3.1. Quantitative: lineplot

3.1. Quantitative: lineplot

3.2. Qualitative: Evoluational barplots

`Gapminder dataset` (1704, 5)

`Gapminder dataset` (1704, 5)

`Gapminder dataset` (1704, 5)