Exploratory Data Analysis (EDA): Correlation Analysis


INF-604: Data Analysis

Lecturer: Dr. Sothea HAS

Outline

  • Motivation

  • Pearson Correlation Coefficient

  • Spearman’s Rank Correlation Coefficient

  • \(\eta\)-squared Coefficient

Motivation

Motivation

Consider Gapminder Dataset (1704, 5)

  • In 2007, we observed that
    • People in countries with strong economies appeared to be healthier.
    • The economy and health conditions did not seem to be related to population size.
Code
import plotly.graph_objects as go
import plotly.express as px
from gapminder import gapminder
data2007 = gapminder.query("year == 2007")
fig1 = px.scatter(
    data2007, x="gdpPercap", y="lifeExp", 
    hover_name="country", opacity=0.7,
    log_x=True)
fig1.update_traces(marker=dict(size=10))
fig1.update_layout(height=310, width=500, title="The world GDP vs (log) LifeExp in 2007")
fig1.show()
Code
data2007 = gapminder.query("year == 2007")
fig2 = px.scatter(
    data2007, x="gdpPercap", y="pop", 
    hover_name="country", opacity=0.7,
    log_x=True, log_y=True)
fig2.update_traces(marker=dict(size=10))
fig2.update_layout(height=310, width=500, title="The world GDP vs (log) Population in 2007")
fig2.show()

Motivation

Consider Gapminder Dataset (1704, 5)

  • In 2007, we observed that
    • Life expectancy and GDP per capita appeared to vary across different continents.
Code
sorted_data = data2007.sort_values(by='lifeExp')
fig3 = px.box(
    data2007, x="continent", y="lifeExp", 
    hover_name="country", color="continent", 
    category_orders={'continent': sorted_data['continent']})
fig3.update_layout(title="Life expectancy on each continent in 2007", height=350, width=500)
fig3.show()
Code
sorted_data = data2007.sort_values(by='gdpPercap')
fig4 = px.box(
    data2007, x="continent", y="gdpPercap", 
    hover_name="country", color="continent", 
    category_orders={'continent': sorted_data['continent']})
fig4.update_layout(title="GDP per Capita on each continent in 2007", height=350, width=500)
fig4.show()

Motivation

Consider Gapminder Dataset (1704, 5)



We will explore
some indicators that capture
such tendency or relationship.


Summary/Figure \(\color{red}{\underset{\text{Visualization}}{\longleftarrow}}\) Data \(\color{blue}{\underset{\text{Correlation}}{\longrightarrow}}\) Number.

Pearson Correlation Coefficient

Pearson Correlation Coefficient

Covariance between two variables/columns

  • Suppose \(X=[\text{x}_1,\text{x}_2,...,\text{x}_n]\) be a quantitative column.
  • Mean/average: \(\overline{\text{x}}=\displaystyle\frac{1}{n}\sum_{i=1}\text{x}_i\).
  • Variance: \(V(X)=\displaystyle\frac{1}{n-1}\sum_{i=1}^n(\text{x}_i-\overline{x})^2.\)
  • Standard deviation: \(s=\sqrt{V(X)}\).
  • If \(Y=[\text{y}_1,\text{y}_2,\dots,\text{y}_n]\) is an other quantitative column, the covaraince between \(X\) and \(Y\) is defined by \[\text{Cov}(X,Y)=\frac{1}{n-1}\sum_{i=1}(\text{x}_i-\overline{x})(\text{y}_i-\overline{y}).\]
Code
fig3.update_layout(width=350, height=450, title="Life expectancy per continent")
fig3.show()

Pearson Correlation Coefficient

Covariance between two variables/columns

  • If \(Y=[\text{y}_1,\text{y}_2,\dots,\text{y}_n]\) is an other quantitative column, the covaraince between \(X\) and \(Y\) is defined by \[\text{Cov}(X,Y)=\frac{1}{n-1}\sum_{i=1}(\text{x}_i-\overline{x})(\text{y}_i-\overline{y}).\]
  • It determines tendency/direction of the relationship between the two variables.
    • Positive value \(\approx\) change in the same direction.
    • Negative value \(\approx\) change in opposite direction.

It’s hard to interpret the value of covariance as it can be large or small according to the scale of \(X\) and \(Y\).

Code
fig3.show()

Pearson Correlation Coefficient

Definition

Pearson correlation coefficient

  • Correlation between two quantitative columns \(X\) and \(Y\): \[r=r_{X,Y}=\frac{\sum_{i=1}^n(\text{x}_{i}-\overline{x})(\text{y}_{i}-\overline{y})}{\sqrt{\left(\sum_{i=1}^n(x_{i1}-\overline{x}_{1})^2\right)\left(\sum_{i=1}^n(x_{i2}-\overline{x}_{2})^2\right)}}=\frac{\text{Cov}(X,Y)}{s_Xs_Y}.\]
  • It quantifies the linear relationship/tendency between the two variables.
    • For any pair \(X\) and \(Y\) one has \(-1\leq r\leq 1\).
    • If \(r\approx 1\), then \(X\) and \(Y\) are positively correlated (change in the same direction).
    • If \(r\approx -1\), then \(X\) and \(Y\) are negatively correlated (change in opposite direction).
    • If \(r\approx 0\), then \(Y\) and \(Y\) are decorrelated (no pattern/trend/tendency).
  • It helps identifying informative/useful inputs for the building models.
  • It also helps identifying redundant (strongly correlated) inputs.
  • Note: Correlation does not imply causation; it only indicates a relationship, not a cause-and-effect link [👉 For more, read here].

Pearson Correlation Coefficient

Examples:

Pearson Correlation Coefficient

Correlation matrix: Gapinder

  • Consider Gapminder dataset in 2007:
cor = data2007[["gdpPercap", "lifeExp", "pop"]].corr()
cor.style.background_gradient(cmap='Accent')
  gdpPercap lifeExp pop
gdpPercap 1.000000 0.678662 -0.055676
lifeExp 0.678662 1.000000 0.047553
pop -0.055676 0.047553 1.000000


Pearson Correlation Coefficient

Summary

  • Pearson correlation coefficient measures linear relationship between two quantitative columns.

  • It captures the pattern of the scatterplot between the two columns.

  • It only describes the tendency but not cause-and-effect relation.

  • The correlation may not be reliable when/with:

    • There are outliers
    • Small number of observations
    • Non-linear relation and confounding variables…

Spearman’s Rank Correlation

Spearman’s Rank Correlation

Beyond linearity

  • Pearson correlation is sensitive to outliers (we will see that in the lab) and cannot capture non-linear relationship between quantitaive columns.
  • It is not sutiable for ordinal data (dislike-like rating, for example).
  • Spearman’s Rank Correlation does not rely on the value of observations but rather depends on the ‘rank’ of the observations.
  • Let \(R[\text{x}_i]\) and \(R[\text{y}_i]\) be the rank of observations \(\text{x}_i\) and \(\text{y}_i\) in their own list, then Spearman’s rank correlation coefficient between \(X\) and \(Y\) is defined as the Pearson correlation over the rank of \(X\) and \(Y\), i.e., \[\rho_{X,Y}=r_{R[X],R[Y]}=\frac{\text{Cov}(R[X],R[Y])}{s_{R[X]},s_{R[Y]}}=1-\frac{6\sum_{i=1}^nd_i^2}{n(n^2-1)},\] where \(d_i=R[\text{x}_i]-R[\text{y}_i]\) be the distance in rank of observation \(i\)-th.

Spearman’s Rank Correlation

Example

  • Example: \(X=[3,2,1,5,8]\) and \(Y=[8,5,0,23,80]\)

Pearson

  • \(\overline{x}=\frac{3+2+1+5+9}{5}=\color{red}{4}\) and \(\overline{y}=\frac{8+5+0+23+80}{5}=\color{blue}{23.2}\).
  • \(s_X=\sqrt{\frac{(3-\color{red}{4})^2+(2-\color{red}{4})^2+\dots+(8-\color{red}{4})^2}{n-1}}=2.8284\) and \(s_Y=\sqrt{\frac{(8-\color{blue}{23.2})^2+(5-\color{blue}{23.2})^2+\dots+(80-\color{blue}{23.2})^2}{n-1}}=29.417\).
  • \(r_{X,Y}=\frac{(3-\color{red}{4})(8-\color{blue}{23.2})+\dots+(8-\color{red}{4})(80-\color{blue}{23.2})}{(2.8284)(29.417)}=0.9735.\)

Spearman

  • \(R[X]=[3,2,1,4,5]\) and \(R[Y]=[3,2,1,4,5]\).
  • All \(d_i=0\) therefore \(\rho_{X,Y}=1-\frac{6\sum_{i}d_i^2}{5(5^2-1)}=1\).

Spearman’s Rank Correlation

Correlation matrix: Gapinder

  • Consider Gapminder dataset in 2007:
cor = data2007[["gdpPercap", "lifeExp", "pop"]].corr()
cor.style.background_gradient(cmap='Accent')
  gdpPercap lifeExp pop
gdpPercap 1.000000 0.678662 -0.055676
lifeExp 0.678662 1.000000 0.047553
pop -0.055676 0.047553 1.000000


cor = data2007[["gdpPercap", "lifeExp", "pop"]]\
    .corr("spearman")
cor.style.background_gradient(cmap='Accent')
  gdpPercap lifeExp pop
gdpPercap 1.000000 0.856590 -0.064588
lifeExp 0.856590 1.000000 0.003355
pop -0.064588 0.003355 1.000000

Spearman’s Rank Correlation

More example

Spearman’s Rank Correlation

Summary

Aspect Pearson Spearman
Type Parametric Non-parametric
Measure Linear relationship Monotonic relationship
Data Type Continuous Ordinal or continuous
Outliers Sensitive to outliers Less sensitive to outliers
Range -1 to 1 -1 to 1
Interpretation \(\bullet\) 1: Perfect positive linear relationship
\(\bullet\) -1: Perfect negative linear relationship
\(\bullet\) 0: No linear relationship
\(\bullet\) 1: Perfect positive rank correlation
\(\bullet\) -1: Perfect negative rank correlation
\(\bullet\) 0: No rank or no monotonicity correlation

\(\eta\)-squared Coefficient

\(\eta\)-squared Coefficient

Quantitative-qualitative

Code
data2007[['continent', 'lifeExp', 'gdpPercap']].head(4)
continent lifeExp gdpPercap
11 Asia 43.828 974.580338
23 Europe 76.423 5937.029526
35 Africa 72.301 6223.367465
47 Africa 42.731 4797.231267
  • Recall that in 2007:
    • continent and lifeExp are related (life expectancy varies across different continents).
    • gdpPercap also differs across different continents.
  • How can we quantify these relations?

\(\eta\)-squared Coefficient

Quantitative-qualitative

  • If \(G\) and \(X\) are qualitative and quantitative columns resp.
  • Between Sum of Squares (BSS): \[\color{blue}{\text{BSS}}=\sum_{g=1}^Gn_g(\overline{\text{x}}_g-\color{blue}{\overline{\text{x}}})^2,\] where
    • \(\overline{\text{x}}_g\) is the mean of \(X\) over a category \(g\) of \(G\).
    • \(\color{blue}{\overline{\text{x}}}\) is the global mean of \(X\).
    • \(n_g\) is the number of observations within category \(g\) of \(G\).
  • It measures how distant the values of \(X\) are across different groups of \(G\).
  • Total Sum of Squares (TSS): \[\color{red}{\text{TSS}}=\sum_{i=1}^n(\text{x}_i-\color{blue}{\overline{\text{x}}})^2.\]

\(\eta\)-squared Coefficient

Quantitative-qualitative

  • \(\eta\)-squared coefficient: \(\eta^2=\frac{\color{blue}{\text{BSS}}}{\color{red}{\text{TSS}}}.\)

  • One always has \(0\leq \eta^2\leq 1\):

    • \(\eta^2\approx 0\): no relation between group \(G\) and quantitative column \(X\).
    • \(\eta^2\approx 1\): strong relation.
Eta-squared value: 0.635
Eta-squared value: 0.424

\(\eta\)-squared Coefficient

Summary

  • The \(\eta\)-squared coefficient measures the proportion of variation in the quantitative variable that is explained by the categories of the qualitative variable.
  • \(\eta\)-squared is normally used to study the effect of group on some quantitative target known as Analysis of Variance (ANOVA).
  • Example: Which qualitative column influences the delivery time the most in Food delivery dataset of Lab2?
Weather Traffic_Level Time_of_Day Vehicle_Type
Delivery time 0.040175 0.03845 0.001226 0.001181

🥳 Yeahhhh….









Let’s Party… 🥂