TP2 - Bivariate Analysis

M-DAS: Exploratory Data Analysis & Unsuperivsed Learning
Lecturer: Dr. HAS Sothea


Objective: To equip you all with the skills to analyze and interpret the relationship between two variables. You will explore these relationships as a function of time using the Gapminder dataset.

The Jupyter Notebook for this TP can be downloaded here: TP2-Bivariate.


#%pip install gapminder           # This is for installing the package
from gapminder import gapminder
import pandas as pd
import numpy as np

1. Correlation matrix

a. Compute Pearson correlation matrices of the three quantitative variables on year \(1952\), \(1987\) and then \(2007\) using pd.corr().

  • What’s your intuition for the low Pearson correlation between gdpPercap and lifeExp in 1952?

Description:

# To do

b. Compute the Spearman correlation matrices of the three quantitatve columns in 1952, 1987 and 2007 using data.corr(method='spearman').

  • Now, is there anything strange? Give a brief explanation.
# To do

c. From the previous result, pick the most interesting pair of variables and visualize their relationship for each year using proper axis scaling and title.

  • Do you see one interesting country in 1952? Guess which country is it? Investigate why is this the case?
import matplotlib.pyplot as plt
import seaborn as sns

# To do

d. Revisit your explanation of correlation matrix for year 1952 in question a., can you see why we observed such a (poor) correlation in 1952!

Remark: Pearson correlation matrix can summarize linear relationship between pairs of quantitative variables but it might be inacurate and influenced by

  • Outliers
  • Non-linearity
  • Small sample size
  • Confounding variables…

In practice, Spearman correlation should also be computed to detect possible non-linear relationship between columns.

2. Visualization with more information

Scatterplot is the primary graphic type for visualizing the relationship between two quantitative variables. Additionally, you can include other factors of interest using color, shape, size, facets, etc., depending on the type of those variables.


The following questions apply to the years 1952, 1987, and 2007:

a. Do you think health conditions and economies differ across continents within these three different years? Provide some indicators.

# To do

b. Confirm the previous claim using conditional distribution visualization and decribe the graphs briefly. Hint: plot distribution of each continuous variable on each continent.

# To do

Description:

c. Now, show the connection of population to GDP per capita and life expectancy by setting the size of points according to variable pop. What do you observe?

# To do

3. Time evolution

We have looked at the world on three frames so far (1952, 1987 and 2007). Now, we will summarize the world from 1952 to 2007 in one graph using animation tool from plotly.express.

  • Seaborn was built on Matplotlib, making it simpler for statistical plots and integrating well with Pandas. It is great for quick, aesthetic and informative statistical plots.
  • Plotly on the other hand built as highly interactive plots with features like zooming, panning, and hover information. It works well in web applications and dashboards. There are extensive customization options for aesthetics, and it supports 3D visualizations, complex plots and animations. Read more here: https://plotly.com/python/.

a. Using plotly, create one scatterplot that summarizes the world using all information: gdpPercap, lifeExp, pop, continent and set option animation_frame=“year” which will create frame-by-frame animated scatterplot of the world from 1952 to 2007.

import plotly.io as pio
pio.renderers.default = 'notebook'
import plotly.express as px

# To do

b. Describe what you observed: the world from 1952 to 2007.

Description:

Further readings