TP2 - Bivariate Analysis

Exploratory Data Analysis & Unsuperivsed Learning
Course: PHAUK Sokkey, PhD
TP: HAS Sothea, PhD

Objective: To equip you all with the skills to analyze and interpret the relationship between two variables. You will explore these relationships as a function of time using the Gapminder dataset.

The Jupyter Notebook for this TP can be downloaded here: TP2-Bivariate-Gapminder.

#%pip install gapminder           # This is for installing the package
from gapminder import gapminder
import pandas as pd
import numpy as np

1. Correlation matrix

a. Compute correlation matrix of the three quantitative variables on year \(1952\), \(1987\) and then \(2007\) using pd.corr(). Give a brief description of each correlation matrix.

# To do

Description:

b. From the previous results, create a graphic for each correlation matrix using color map, which is very helpful for large correlation matrices (can you use corr.style.background_gradient() or your favorite package).

# To do

c. From the previous result, pick the most interesting pair of variables and plot a graphic illustrating their relationship for each year using proper axis scaling and title. Do you see one interesting country in 1952? Guess which country is it? Investigate why is this the case?

import matplotlib.pyplot as plt
import seaborn as sns

# To do

d. Revisit your explanation of correlation matrix for year 1952 in question a., can you see why we observed such a (poor) correlation in 1952!

Remark: Correlation matrix can summarize linear relationship between pairs of quantitative variables but it might be inacurate and influenced by outliers, non-linearity, small sample size, confounding variables…

2. Visualization with more information

Scatterplot is the primary graphic type for visualizing the relationship between two quantitative variables. Additionally, you can include other factors of interest using color, shape, size, facets, etc., depending on the type of those variables.

The following questions apply to the years 1952, 1987, and 2007:

a. Do you think health conditions and economies differ across continents? Visualize it in a graphic.

# To do

b. Confirm the previous claim using conditional distribution graphs. Hint: plot distribution of each continuous variable on each continent.

# To do

Description:

c. We will try to confirm this using Analysis of Variance (ANOVA).

In 2007,

Perform one way ANOVA on lifeExp and continent (you can use f_oneway from scipy.stats module).

from scipy.stats import f_oneway
# To do

Do the same with gdpPercap and continent.

# To do

Are these results reliable? Why?

Your response:

Propose ideas that might solve this problem or an alternative method.

Your response:

d. We have previously described the relationship between population and life expectancy or GDP per capita. Now, show this by defining size of points according to variable pop.

# To do

e. Now, directly visualize the relation between population and life expactancy, then the relation between population and GPD per Capita. Explain the resulting graphs.

# To do

3. Time evolution

We have looked at the world on three frames so far (1952, 1987 and 2007). Now, we will summarize the world from 1952 to 2007 in one graph using animation tool from plotly.express.

Seaborn was built on Matplotlib, making it simpler for statistical plots and integrating well with Pandas. It is great for quick, aesthetic and informative statistical plots.
Plotly on the other hand built as highly interactive plots with features like zooming, panning, and hover information. It works well in web applications and dashboards. There are extensive customization options for aesthetics, and it supports 3D visualizations, complex plots and animations. Read more here: https://plotly.com/python/.

a. Using plotly, create one scatterplot that summarizes the world using all information: gdpPercap, lifeExp, pop, continent and set option animation_frame=“year” which will create frame-by-frame animated scatterplot of the world from 1952 to 2007.

import plotly.io as pio
pio.renderers.default = 'notebook'
import plotly.express as px

# To do

b. Describe what you observed: the world from 1952 to 2007.

Description:

1. Correlation matrix

2. Visualization with more information

3. Time evolution

Further readings