TP2 - Bivariate Analysis

Exploratory Data Analysis & Unsuperivsed Learning
Course: Dr. HAS Sothea
TP: Dr. PHAUK Sokkey
       Ms. UANN Sreyvi


Objective: To equip you all with the skills to analyze and interpret the relationship between two variables. You will explore these relationships as a function of time using the Gapminder dataset.


The Jupyter Notebook for this TP can be downloaded here: TP2-Bivariate-Gapminder.


#%pip install gapminder           # This is for installing the package
from gapminder import gapminder
import pandas as pd
import numpy as np
gapminder.head()

1. Pearson and Spearman’s correlations

a. Compute Pearson correlation matrix of the three quantitative variables on year \(1952\), \(1987\) and then \(2007\) using pd.corr(). Give a brief intuition of the relationship between these variables.

# To do

Description:

b. Compute Spearman’s Rank Correlation of the previous columns in 1952, 1987 and 2007. What do you observe?

# To do

c. From the previous result, pick the most interesting pair of variables and plot a graph illustrating their relationship on each year, using proper axis scaling and title.

# To do

d. Revisit your intuition of the correlation matrix in year 1952 from question (a), can you see why we observed such a (poor) correlation in 1952?!

  • Now, drop the weird country of year 1952. Revisualize and recompute the correaltion between health and economy condition of the world in 1952. Conclude.

Remark: Pearson correlation matrix can summarize linear relationship between pairs of quantitative variables but it might be inacurate and influenced by

  • outliers,
  • non-linearity,
  • small sample size,
  • confounding (causal) variables…

2. Visualization with more information

Scatterplot is the primary graphic type for visualizing the relationship between two quantitative variables. Additionally, you can include other factors of interest using color, shape, size, facets, etc., depending on the type of those variables.


The following questions apply to the years 1952, 1987, and 2007:

a. Do you think health conditions and economies differ across continents? Compute some indicators to illustrate this.

# To do

b. Visualize the previous claim using conditional distribution. Hint: plot distribution of each continuous variable on each continent.

# To do

Description:

c. We will try to confirm this using Analysis of Variance (ANOVA).

In 2007,

  • Perform one way ANOVA on lifeExp and continent (you can use f_oneway from scipy.stats module).
from scipy.stats import f_oneway
# To do
  • Do the same with gdpPercap and continent.
# To do
  • Are these results reliable? Why?

Your response:

  • Propose ideas that might solve this problem or an alternative method.

Your response:

d. We have previously described the relationship between population and life expectancy or GDP per capita. Now, illustrate the effect of population defining size of points according to variable pop.

# To do

e. Now, directly visualize the relation between pop and lifeExp, then the relation between pop and gdpPercap. Explain the resulting graphs.

# To do

3. Quantitative vs Qualitative

a. We have seen how life expectancy and economy vary across continents in 2007 (Slide 25 of the course). Compute \(\eta\)-squared correlation between continent and lifeExp, then continent with gdpPercap in 1952, 1987 and 2007.

  • Do you find the results reasonable?
# To do

4. Time evolution

a. Draw the evolution of the following correaltions from 1952 to 2007:

  • Person and Spearman corerlation between life expectancy and GDP per capita
  • \(\eta\)-squared coefficients of continent vs life expectancy, and continents vs GDP per capita.

Hint: the following figure is expected.

import plotly.io as pio
pio.renderers.default = 'notebook'
import plotly.express as px

# To do

b. Describe what you observed: the world from 1952 to 2007.

Description:

Further readings