#%pip install gapminder # This is for installing the package
from gapminder import gapminder
import pandas as pd
import numpy as np
TP2 - Bivariate Analysis
Exploratory Data Analysis & Unsuperivsed Learning
Course: PHAUK Sokkey, PhD
TP: HAS Sothea, PhD
Objective: To equip you all with the skills to analyze and interpret the relationship between two variables. You will explore these relationships as a function of time using the Gapminder
dataset.
The
Jupyter Notebook
for this TP can be downloaded here: TP2-Bivariate-Gapminder.
1. Correlation matrix
a. Compute correlation matrix of the three quantitative variables on year \(1952\), \(1987\) and then \(2007\) using pd.corr()
. Give a brief description of each correlation matrix.
# To do
Description:
b. From the previous results, create a graphic for each correlation matrix using color map, which is very helpful for large correlation matrices (can you use corr.style.background_gradient()
or your favorite package).
# To do
c. From the previous result, pick the most interesting pair of variables and plot a graphic illustrating their relationship for each year using proper axis scaling and title. Do you see one interesting country in 1952? Guess which country is it? Investigate why is this the case?
import matplotlib.pyplot as plt
import seaborn as sns
# To do
d. Revisit your explanation of correlation matrix for year 1952 in question a., can you see why we observed such a (poor) correlation in 1952!
Remark: Correlation matrix can summarize linear relationship between pairs of quantitative variables but it might be inacurate and influenced by outliers, non-linearity, small sample size, confounding variables…
2. Visualization with more information
Scatterplot
is the primary graphic type for visualizing the relationship between two quantitative variables. Additionally, you can include other factors of interest using color, shape, size, facets, etc., depending on the type of those variables.
The following questions apply to the years 1952, 1987, and 2007:
a. Do you think health conditions and economies differ across continents? Visualize it in a graphic.
# To do
b. Confirm the previous claim using conditional distribution graphs. Hint: plot distribution of each continuous variable on each continent.
# To do
Description:
c. We will try to confirm this using Analysis of Variance (ANOVA).
In 2007,
- Perform one way ANOVA on
lifeExp
andcontinent
(you can usef_oneway
fromscipy.stats
module).
from scipy.stats import f_oneway
# To do
- Do the same with
gdpPercap
andcontinent
.
# To do
- Are these results reliable? Why?
Your response:
- Propose ideas that might solve this problem or an alternative method.
Your response:
d. We have previously described the relationship between population and life expectancy or GDP per capita. Now, show this by defining size of points according to variable pop
.
# To do
e. Now, directly visualize the relation between population and life expactancy, then the relation between population and GPD per Capita. Explain the resulting graphs.
# To do
3. Time evolution
We have looked at the world on three frames so far (1952, 1987 and 2007). Now, we will summarize the world from 1952 to 2007 in one graph using animation tool from plotly.express
.
Seaborn
was built onMatplotlib
, making it simpler for statistical plots and integrating well withPandas
. It is great for quick, aesthetic and informative statistical plots.Plotly
on the other hand built as highly interactive plots with features like zooming, panning, and hover information. It works well in web applications and dashboards. There are extensive customization options for aesthetics, and it supports 3D visualizations, complex plots and animations. Read more here: https://plotly.com/python/.
a. Using plotly
, create one scatterplot that summarizes the world using all information: gdpPercap
, lifeExp
, pop
, continent
and set option animation_frame=“year” which will create frame-by-frame animated scatterplot of the world from 1952 to 2007.
import plotly.io as pio
= 'notebook'
pio.renderers.default import plotly.express as px
# To do
b. Describe what you observed: the world from 1952 to 2007.
Description:
Further readings
- Gapminder documentation: https://www.gapminder.org/data/documentation/
- A short demonstration video is available here: Hans Rosling’s 200 Countries, 200 Years, 4 Minutes - The Joy of Stats - BBC Four.
- Graphical tools: