Course: INF-604: Data Analysis Lecturer: Sothea HAS, PhD
Objective: In this lab, you will apply correaltion analysis on real examples. We will also explore the limitations of correaltion analysis and what to watch out when drawing conclusion from each correlation types.
The notebook of this Lab can be downloaded here: Lab5_EDA.ipynb.
Or you can work directly with Google Colab here: Lab5_EDA.ipynb.
#%pip install gapminder # This is for installing the packagefrom gapminder import gapminderimport pandas as pdimport numpy as npgapminder.head()
country
continent
year
lifeExp
pop
gdpPercap
0
Afghanistan
Asia
1952
28.801
8425333
779.445314
1
Afghanistan
Asia
1957
30.332
9240934
820.853030
2
Afghanistan
Asia
1962
31.997
10267083
853.100710
3
Afghanistan
Asia
1967
34.020
11537966
836.197138
4
Afghanistan
Asia
1972
36.088
13079460
739.981106
1. Pearson and Spearman’s correlations
a. Compute Pearson correlation matrix of the three quantitative variables on year \(1952\), \(1987\) and then \(2007\) using pd.corr(). Give a brief intuition of the relationship between these variables.
# To do
Description:
b. Compute Spearman’s Rank Correlation of the previous columns in 1952, 1987 and 2007. What do you observe?
# To do
c. From the previous result, pick the most interesting pair of variables and plot a graphic illustrating their relationship for each year using proper axis scaling and title.
import matplotlib.pyplot as pltimport seaborn as sns# To do
d. Revisit your intuition of the correlation matrix in year 1952 from question (a), can you see why we observed such a (poor) correlation in 1952?!
Now, drop the weird country of year 1952. Revisualize and recompute the correaltion between health and economy condition of the world in 1952. Conclude.
Remark: Pearson correlation matrix can summarize linear relationship between pairs of quantitative variables but it might be inacurate and influenced by
outliers,
non-linearity,
small sample size,
confounding (causal) variables…
2. \(\eta\)-squared correlations
a. We have seen how life expectancy and economy vary across continents in 1952 (Lab4) and 2007 (course). Compute \(\eta\)-squared correlation between continent and lifeExp then continent with gdpPercap in 1952, 1987 and 2007.
Do you find the results reasonable?
# To do
3. Time evolution
a. Draw the evolution of the following correaltions from 1952 to 2007:
Person and Spearman corerlation between life expectancy and GDP per capita
\(\eta\)-squared coefficients of continent vs life expectancy, and continents vs GDP per capita.
import plotly.express as px# To do
b. Fom what you have studied from the dataset, describe the world from 1952 to 2007.