#%pip install gapminder # This is for installing the package
from gapminder import gapminder
import pandas as pd
import numpy as npTP2 - Bivariate Analysis
M-DAS: Exploratory Data Analysis & Unsuperivsed Learning
Lecturer: Dr. HAS Sothea
Objective: To equip you all with the skills to analyze and interpret the relationship between two variables. You will explore these relationships as a function of time using the Gapminder dataset.
The
Jupyter Notebookfor this TP can be downloaded here: TP2-Bivariate.
1. Correlation matrix
a. Compute Pearson correlation matrices of the three quantitative variables on year \(1952\), \(1987\) and then \(2007\) using pd.corr().
- What’s your intuition for the low Pearson correlation between
gdpPercapandlifeExpin 1952?
Description:
# To dob. Compute the Spearman correlation matrices of the three quantitatve columns in 1952, 1987 and 2007 using data.corr(method='spearman').
- Now, is there anything strange? Give a brief explanation.
# To doc. From the previous result, pick the most interesting pair of variables and visualize their relationship for each year using proper axis scaling and title.
- Do you see one interesting country in 1952? Guess which country is it? Investigate why is this the case?
import matplotlib.pyplot as plt
import seaborn as sns
# To dod. Revisit your explanation of correlation matrix for year 1952 in question a., can you see why we observed such a (poor) correlation in 1952!
Remark: Pearson correlation matrix can summarize linear relationship between pairs of quantitative variables but it might be inacurate and influenced by
- Outliers
- Non-linearity
- Small sample size
- Confounding variables…
In practice, Spearman correlation should also be computed to detect possible non-linear relationship between columns.
2. Visualization with more information
Scatterplot is the primary graphic type for visualizing the relationship between two quantitative variables. Additionally, you can include other factors of interest using color, shape, size, facets, etc., depending on the type of those variables.
The following questions apply to the years 1952, 1987, and 2007:
a. Do you think health conditions and economies differ across continents within these three different years? Provide some indicators.
# To dob. Confirm the previous claim using conditional distribution visualization and decribe the graphs briefly. Hint: plot distribution of each continuous variable on each continent.
# To doDescription:
c. Now, show the connection of population to GDP per capita and life expectancy by setting the size of points according to variable pop. What do you observe?
# To do3. Time evolution
We have looked at the world on three frames so far (1952, 1987 and 2007). Now, we will summarize the world from 1952 to 2007 in one graph using animation tool from plotly.express.
Seabornwas built onMatplotlib, making it simpler for statistical plots and integrating well withPandas. It is great for quick, aesthetic and informative statistical plots.Plotlyon the other hand built as highly interactive plots with features like zooming, panning, and hover information. It works well in web applications and dashboards. There are extensive customization options for aesthetics, and it supports 3D visualizations, complex plots and animations. Read more here: https://plotly.com/python/.
a. Using plotly, create one scatterplot that summarizes the world using all information: gdpPercap, lifeExp, pop, continent and set option animation_frame=“year” which will create frame-by-frame animated scatterplot of the world from 1952 to 2007.
import plotly.io as pio
pio.renderers.default = 'notebook'
import plotly.express as px
# To dob. Describe what you observed: the world from 1952 to 2007.
Description:
Further readings
- Gapminder documentation: https://www.gapminder.org/data/documentation/
- A short demonstration video is available here: Hans Rosling’s 200 Countries, 200 Years, 4 Minutes - The Joy of Stats - BBC Four.
- Graphical tools: