Course: INF-604: Data Analysis Lecturer: Sothea HAS, PhD
Objective: In this lab, you will apply the data visualization techniques you have studied to a real dataset. Each graph you create should be easy to understand and complete, including a title, legend, and other information to help the audience comprehend them easily. Effective graphs shouldn’t take longer than 15 seconds for viewers to understand.
Gapminder is an excerpt of data available at https://gapminder.org. For each of \(142\) countries (country), the package provides values for life expectancy (lifeExp), GDP per capita (gdpPercap), and population (pop), every five years, from \(1952\) to \(2007\) (year). It is initially used in Jennifer Bryan’s excellent gapminder teaching package for R (ggplot, tidyverse, and more). For more information about gapminder:
# useful packagesimport numpy as npimport pandas as pd#%pip install gapminder (for installing gapminder if you haven't had one)from gapminder import gapminderprint(f"* Number of observations: {gapminder.shape[0]}")gapminder.sample(3)
* Number of observations: 1704
country
continent
year
lifeExp
pop
gdpPercap
1424
Spain
Europe
1992
77.570
39549438
18603.064520
269
Chad
Africa
1977
47.383
4388260
1133.984950
135
Bolivia
Americas
1967
45.032
4040665
2586.886053
A. Variable types
Address dimension of the dataset.
Which variables are considered quantitative and which are qualitative?
# To do
B. Year 1952
B.1 Quantitative vs quantitative
Create a subdataset called data1952 that contains only the information in year 1952.
View relation between gdpPercap and lifeExp in 1952.
View relation between gdpPercap and pop in 1952.
View relation between lifeExp and pop in 1952.
Do they look different from year 2007?
Hint: You can produce the same graphs as shown in the course using Plotly package avaialble here: plotly python.
import matplotlib.pyplot as plt import seaborn as snsimport plotly.graph_objects as go # for interative graphimport plotly.express as px # for interative graph# To do
According to the data, in 1952:
Which country was the richest?
Which country was the poorest?
Which country was the healthiest?
Which country was the unhealthiest?
# To do
Repeat the previous question for the year 1977. Before computing, can you guess which country had the lowest life expectancy around that year?
# To do
B.2. Quatitative vs qualitative
We observed differences in health conditions across continents in 2007. Was this also the case in 1952? Please visualize your findings.
What about the economy? Visualize and explain your results.
# To do
B.3. Qualitative vs qualitative
Qualitative variables are simpler than quantitative ones, as we primarily focus on the proportion or frequency of each category. In our dataset, the existing qualitative variables are not suitable for analysis because they are repeated each year. Therefore, we will create a new qualitative lifeExp.
Add to the data data1952 a column lifeExpQual containing three categories: [“unhealthy”, “moderate”, “healthy”] by splitting lifeExp into 3 classes.
Hint: you may cheat using slide 15. The function pd.qcut is helpful for such a task.
# To do
Graphically represent the connection between lifeExpQual and continent in year 1952.
Describe what you see.
# To do
C. Time evolution
Gapminder captures global changes from \(1952\) to \(2007\). It is more insightful to examine how these variables evolve over time.
C.1. Evolution of average lifeExp of the \(5\) continents
Create a line plot of lifeExp for the five continents from \(1952\) to \(2007\) using sns.lineplot. What observations can you make from the plot?
# To do
C.2. Other variables
Repeat the same process with pop, and then with gdpPercap. Provide your comments for each case.
# To do
C.3. Cambodia
Plot the evolution of the three quantitative variables for Cambodia. What do you observe?