TP5 - EDA: Correlation Analysis

Course: INF-604: Data Analysis
Lecturer: Sothea HAS, PhD


Objective: In this lab, you will apply correaltion analysis on real examples. We will also explore the limitations of correaltion analysis and what to watch out when drawing conclusion from each correlation types.


#%pip install gapminder           # This is for installing the package
from gapminder import gapminder
import pandas as pd
import numpy as np
gapminder.head()
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106

1. Pearson and Spearman’s correlations

a. Compute Pearson correlation matrix of the three quantitative variables on year \(1952\), \(1987\) and then \(2007\) using pd.corr(). Give a brief intuition of the relationship between these variables.

# To do

Description:

b. Compute Spearman’s Rank Correlation of the previous columns in 1952, 1987 and 2007. What do you observe?

# To do

c. From the previous result, pick the most interesting pair of variables and plot a graphic illustrating their relationship for each year using proper axis scaling and title.

import matplotlib.pyplot as plt
import seaborn as sns

# To do

d. Revisit your intuition of the correlation matrix in year 1952 from question (a), can you see why we observed such a (poor) correlation in 1952?!

  • Now, drop the weird country of year 1952. Revisualize and recompute the correaltion between health and economy condition of the world in 1952. Conclude.

Remark: Pearson correlation matrix can summarize linear relationship between pairs of quantitative variables but it might be inacurate and influenced by

  • outliers,
  • non-linearity,
  • small sample size,
  • confounding (causal) variables…

2. \(\eta\)-squared correlations

a. We have seen how life expectancy and economy vary across continents in 1952 (Lab4) and 2007 (course). Compute \(\eta\)-squared correlation between continent and lifeExp then continent with gdpPercap in 1952, 1987 and 2007.

  • Do you find the results reasonable?
# To do

3. Time evolution

a. Draw the evolution of the following correaltions from 1952 to 2007:

  • Person and Spearman corerlation between life expectancy and GDP per capita
  • \(\eta\)-squared coefficients of continent vs life expectancy, and continents vs GDP per capita.
import plotly.express as px

# To do

b. Fom what you have studied from the dataset, describe the world from 1952 to 2007.

Description:

Further readings