Exploratory Data Analysis (EDA): Correlation Analysis

INF-604: Data Analysis

Lecturer: Dr. Sothea HAS

Outline

Motivation
Pearson Correlation Coefficient
Spearman’s Rank Correlation Coefficient
\(\eta\)-squared Coefficient

Motivation

Consider `Gapminder Dataset` (1704, 5)

In 2007, we observed that
- People in countries with strong economies appeared to be healthier.
- The economy and health conditions did not seem to be related to population size.

Code

import plotly.graph_objects as go
import plotly.express as px
from gapminder import gapminder
data2007 = gapminder.query("year == 2007")
fig1 = px.scatter(
    data2007, x="gdpPercap", y="lifeExp", 
    hover_name="country", opacity=0.7,
    log_x=True)
fig1.update_traces(marker=dict(size=10))
fig1.update_layout(height=310, width=500, title="The world GDP vs (log) LifeExp in 2007")
fig1.show()

Code

data2007 = gapminder.query("year == 2007")
fig2 = px.scatter(
    data2007, x="gdpPercap", y="pop", 
    hover_name="country", opacity=0.7,
    log_x=True, log_y=True)
fig2.update_traces(marker=dict(size=10))
fig2.update_layout(height=310, width=500, title="The world GDP vs (log) Population in 2007")
fig2.show()

Motivation

Consider `Gapminder Dataset` (1704, 5)

In 2007, we observed that
- Life expectancy and GDP per capita appeared to vary across different continents.

Code

sorted_data = data2007.sort_values(by='lifeExp')
fig3 = px.box(
    data2007, x="continent", y="lifeExp", 
    hover_name="country", color="continent", 
    category_orders={'continent': sorted_data['continent']})
fig3.update_layout(title="Life expectancy on each continent in 2007", height=350, width=500)
fig3.show()

Code

sorted_data = data2007.sort_values(by='gdpPercap')
fig4 = px.box(
    data2007, x="continent", y="gdpPercap", 
    hover_name="country", color="continent", 
    category_orders={'continent': sorted_data['continent']})
fig4.update_layout(title="GDP per Capita on each continent in 2007", height=350, width=500)
fig4.show()

Motivation

Consider `Gapminder Dataset` (1704, 5)

We will explore
some indicators that capture
such tendency or relationship.

Summary/Figure \(\color{red}{\underset{\text{Visualization}}{\longleftarrow}}\) Data \(\color{blue}{\underset{\text{Correlation}}{\longrightarrow}}\) Number.

Pearson Correlation Coefficient

Covariance between two variables/columns

Suppose \(X=[\text{x}_1,\text{x}_2,...,\text{x}_n]\) be a quantitative column.
Mean/average: \(\overline{\text{x}}=\displaystyle\frac{1}{n}\sum_{i=1}\text{x}_i\).
Variance: \(V(X)=\displaystyle\frac{1}{n-1}\sum_{i=1}^n(\text{x}_i-\overline{x})^2.\)
Standard deviation: \(s=\sqrt{V(X)}\).
If \(Y=[\text{y}_1,\text{y}_2,\dots,\text{y}_n]\) is an other quantitative column, the covaraince between \(X\) and \(Y\) is defined by \[\text{Cov}(X,Y)=\frac{1}{n-1}\sum_{i=1}(\text{x}_i-\overline{x})(\text{y}_i-\overline{y}).\]

Code

fig3.update_layout(width=350, height=450, title="Life expectancy per continent")
fig3.show()

Pearson Correlation Coefficient

Covariance between two variables/columns

If \(Y=[\text{y}_1,\text{y}_2,\dots,\text{y}_n]\) is an other quantitative column, the covaraince between \(X\) and \(Y\) is defined by \[\text{Cov}(X,Y)=\frac{1}{n-1}\sum_{i=1}(\text{x}_i-\overline{x})(\text{y}_i-\overline{y}).\]

It determines tendency/direction of the relationship between the two variables.
- Positive value \(\approx\) change in the same direction.
- Negative value \(\approx\) change in opposite direction.

It’s hard to interpret the value of covariance as it can be large or small according to the scale of \(X\) and \(Y\).

Code

fig3.show()

Pearson Correlation Coefficient

Definition

Pearson correlation coefficient

Correlation between two quantitative columns \(X\) and \(Y\): \[r=r_{X,Y}=\frac{\sum_{i=1}^n(\text{x}_{i}-\overline{x})(\text{y}_{i}-\overline{y})}{\sqrt{\left(\sum_{i=1}^n(x_{i1}-\overline{x}_{1})^2\right)\left(\sum_{i=1}^n(x_{i2}-\overline{x}_{2})^2\right)}}=\frac{\text{Cov}(X,Y)}{s_Xs_Y}.\]
It quantifies the linear relationship/tendency between the two variables.
- For any pair \(X\) and \(Y\) one has \(-1\leq r\leq 1\).
- If \(r\approx 1\), then \(X\) and \(Y\) are positively correlated (change in the same direction).
- If \(r\approx -1\), then \(X\) and \(Y\) are negatively correlated (change in opposite direction).
- If \(r\approx 0\), then \(Y\) and \(Y\) are decorrelated (no pattern/trend/tendency).
It helps identifying informative/useful inputs for the building models.
It also helps identifying redundant (strongly correlated) inputs.
Note: Correlation does not imply causation; it only indicates a relationship, not a cause-and-effect link [👉 For more, read here].

Pearson Correlation Coefficient

Examples:

Source: https://en.wikipedia.org/wiki/Correlation.

Pearson Correlation Coefficient

Correlation matrix: Gapinder

Consider Gapminder dataset in 2007:

cor = data2007[["gdpPercap", "lifeExp", "pop"]].corr()
cor.style.background_gradient(cmap='Accent')

	gdpPercap	lifeExp	pop
gdpPercap	1.000000	0.678662	-0.055676
lifeExp	0.678662	1.000000	0.047553
pop	-0.055676	0.047553	1.000000

Pearson Correlation Coefficient

Summary

Pearson correlation coefficient measures linear relationship between two quantitative columns.
It captures the pattern of the scatterplot between the two columns.
It only describes the tendency but not cause-and-effect relation.
The correlation may not be reliable when/with:
- There are outliers
- Small number of observations
- Non-linear relation and confounding variables…

Spearman’s Rank Correlation

Beyond linearity

Pearson correlation is sensitive to outliers (we will see that in the lab) and cannot capture non-linear relationship between quantitaive columns.
It is not sutiable for ordinal data (dislike-like rating, for example).
Spearman’s Rank Correlation does not rely on the value of observations but rather depends on the ‘rank’ of the observations.
Let \(R[\text{x}_i]\) and \(R[\text{y}_i]\) be the rank of observations \(\text{x}_i\) and \(\text{y}_i\) in their own list, then Spearman’s rank correlation coefficient between \(X\) and \(Y\) is defined as the Pearson correlation over the rank of \(X\) and \(Y\), i.e., \[\rho_{X,Y}=r_{R[X],R[Y]}=\frac{\text{Cov}(R[X],R[Y])}{s_{R[X]},s_{R[Y]}}=1-\frac{6\sum_{i=1}^nd_i^2}{n(n^2-1)},\] where \(d_i=R[\text{x}_i]-R[\text{y}_i]\) be the distance in rank of observation \(i\)-th.

Spearman’s Rank Correlation

Example

Example: \(X=[3,2,1,5,8]\) and \(Y=[8,5,0,23,80]\)

Pearson

\(\overline{x}=\frac{3+2+1+5+9}{5}=\color{red}{4}\) and \(\overline{y}=\frac{8+5+0+23+80}{5}=\color{blue}{23.2}\).
\(s_X=\sqrt{\frac{(3-\color{red}{4})^2+(2-\color{red}{4})^2+\dots+(8-\color{red}{4})^2}{n-1}}=2.8284\) and \(s_Y=\sqrt{\frac{(8-\color{blue}{23.2})^2+(5-\color{blue}{23.2})^2+\dots+(80-\color{blue}{23.2})^2}{n-1}}=29.417\).
\(r_{X,Y}=\frac{(3-\color{red}{4})(8-\color{blue}{23.2})+\dots+(8-\color{red}{4})(80-\color{blue}{23.2})}{(2.8284)(29.417)}=0.9735.\)

Spearman

\(R[X]=[3,2,1,4,5]\) and \(R[Y]=[3,2,1,4,5]\).
All \(d_i=0\) therefore \(\rho_{X,Y}=1-\frac{6\sum_{i}d_i^2}{5(5^2-1)}=1\).

Spearman’s Rank Correlation

Correlation matrix: Gapinder

Consider Gapminder dataset in 2007:

cor = data2007[["gdpPercap", "lifeExp", "pop"]].corr()
cor.style.background_gradient(cmap='Accent')

	gdpPercap	lifeExp	pop
gdpPercap	1.000000	0.678662	-0.055676
lifeExp	0.678662	1.000000	0.047553
pop	-0.055676	0.047553	1.000000

cor = data2007[["gdpPercap", "lifeExp", "pop"]]\
    .corr("spearman")
cor.style.background_gradient(cmap='Accent')

	gdpPercap	lifeExp	pop
gdpPercap	1.000000	0.856590	-0.064588
lifeExp	0.856590	1.000000	0.003355
pop	-0.064588	0.003355	1.000000

Spearman’s Rank Correlation

More example

Spearman’s Rank Correlation

Summary

Aspect	`Pearson`	`Spearman`
Type	Parametric	Non-parametric
Measure	Linear relationship	Monotonic relationship
Data Type	Continuous	Ordinal or continuous
Outliers	Sensitive to outliers	Less sensitive to outliers
Range	-1 to 1	-1 to 1
Interpretation	\(\bullet\) 1: Perfect positive linear relationship \(\bullet\) -1: Perfect negative linear relationship \(\bullet\) 0: No linear relationship	\(\bullet\) 1: Perfect positive rank correlation \(\bullet\) -1: Perfect negative rank correlation \(\bullet\) 0: No rank or no monotonicity correlation

\(\eta\)-squared Coefficient

Quantitative-qualitative

Code

data2007[['continent', 'lifeExp', 'gdpPercap']].head(4)

	continent	lifeExp	gdpPercap
11	Asia	43.828	974.580338
23	Europe	76.423	5937.029526
35	Africa	72.301	6223.367465
47	Africa	42.731	4797.231267

Recall that in 2007:
- continent and lifeExp are related (life expectancy varies across different continents).
- gdpPercap also differs across different continents.

How can we quantify these relations?

\(\eta\)-squared Coefficient

Quantitative-qualitative

If \(G\) and \(X\) are qualitative and quantitative columns resp.
Between Sum of Squares (BSS): \[\color{blue}{\text{BSS}}=\sum_{g=1}^Gn_g(\overline{\text{x}}_g-\color{blue}{\overline{\text{x}}})^2,\] where
- \(\overline{\text{x}}_g\) is the mean of \(X\) over a category \(g\) of \(G\).
- \(\color{blue}{\overline{\text{x}}}\) is the global mean of \(X\).
- \(n_g\) is the number of observations within category \(g\) of \(G\).

It measures how distant the values of \(X\) are across different groups of \(G\).
Total Sum of Squares (TSS): \[\color{red}{\text{TSS}}=\sum_{i=1}^n(\text{x}_i-\color{blue}{\overline{\text{x}}})^2.\]

\(\eta\)-squared Coefficient

Quantitative-qualitative

\(\eta\)-squared coefficient: \(\eta^2=\frac{\color{blue}{\text{BSS}}}{\color{red}{\text{TSS}}}.\)
One always has \(0\leq \eta^2\leq 1\):
- \(\eta^2\approx 0\): no relation between group \(G\) and quantitative column \(X\).
- \(\eta^2\approx 1\): strong relation.

Eta-squared value: 0.635

Eta-squared value: 0.424

\(\eta\)-squared Coefficient

Summary

The \(\eta\)-squared coefficient measures the proportion of variation in the quantitative variable that is explained by the categories of the qualitative variable.
\(\eta\)-squared is normally used to study the effect of group on some quantitative target known as Analysis of Variance (ANOVA).
Example: Which qualitative column influences the delivery time the most in Food delivery dataset of Lab2?

	Weather	Traffic_Level	Time_of_Day	Vehicle_Type
Delivery time	0.040175	0.03845	0.001226	0.001181

Exploratory Data Analysis (EDA): Correlation Analysis

Outline

Motivation

Motivation

Consider Gapminder Dataset (1704, 5)

Motivation

Consider Gapminder Dataset (1704, 5)

Motivation

Consider Gapminder Dataset (1704, 5)

Pearson Correlation Coefficient

Pearson Correlation Coefficient

Covariance between two variables/columns

Pearson Correlation Coefficient

Covariance between two variables/columns

Pearson Correlation Coefficient

Definition

Pearson Correlation Coefficient

Examples:

Pearson Correlation Coefficient

Correlation matrix: Gapinder

Pearson Correlation Coefficient

Summary

Spearman’s Rank Correlation

Spearman’s Rank Correlation

Beyond linearity

Spearman’s Rank Correlation

Example

Pearson

Spearman

Spearman’s Rank Correlation

Correlation matrix: Gapinder

Spearman’s Rank Correlation

More example

Spearman’s Rank Correlation

Summary

\(\eta\)-squared Coefficient

\(\eta\)-squared Coefficient

Quantitative-qualitative

\(\eta\)-squared Coefficient

Quantitative-qualitative

\(\eta\)-squared Coefficient

Quantitative-qualitative

\(\eta\)-squared Coefficient

Summary

🥳 Yeahhhh….

Let’s Party… 🥂

Consider `Gapminder Dataset` (1704, 5)

Consider `Gapminder Dataset` (1704, 5)

Consider `Gapminder Dataset` (1704, 5)