Lab4: Data Visualization

Course: INF-604: Data Analysis
Lecturer: Sothea HAS, PhD


Objective: In this lab, you will apply the data visualization techniques you have studied to a real dataset. Each graph you create should be easy to understand and complete, including a title, legend, and other information to help the audience comprehend them easily. Effective graphs shouldn’t take longer than 15 seconds for viewers to understand.


1. Gapminder dataset

Gapminder is an excerpt of data available at https://gapminder.org. For each of \(142\) countries (country), the package provides values for life expectancy (lifeExp), GDP per capita (gdpPercap), and population (pop), every five years, from \(1952\) to \(2007\) (year). It is initially used in Jennifer Bryan’s excellent gapminder teaching package for R (ggplot, tidyverse, and more). For more information about gapminder:

# useful packages
import numpy as np
import pandas as pd

#%pip install gapminder  (for installing gapminder if you haven't had one)
from gapminder import gapminder
print(f"* Number of observations: {gapminder.shape[0]}")
gapminder.sample(3)
* Number of observations: 1704
country continent year lifeExp pop gdpPercap
1424 Spain Europe 1992 77.570 39549438 18603.064520
269 Chad Africa 1977 47.383 4388260 1133.984950
135 Bolivia Americas 1967 45.032 4040665 2586.886053

A. Variable types

  • Address dimension of the dataset.
  • Which variables are considered quantitative and which are qualitative?
# To do

B. Year 1952

B.1 Quantitative vs quantitative

  • Create a subdataset called data1952 that contains only the information in year 1952.
  • View relation between gdpPercap and lifeExp in 1952.
  • View relation between gdpPercap and pop in 1952.
  • View relation between lifeExp and pop in 1952.
  • Do they look different from year 2007?

Hint: You can produce the same graphs as shown in the course using Plotly package avaialble here: plotly python.

import matplotlib.pyplot as plt 
import seaborn as sns
import plotly.graph_objects as go # for interative graph
import plotly.express as px  # for interative graph
# To do
  • According to the data, in 1952:
    • Which country was the richest?
    • Which country was the poorest?
    • Which country was the healthiest?
    • Which country was the unhealthiest?
# To do
  • Repeat the previous question for the year 1977. Before computing, can you guess which country had the lowest life expectancy around that year?
# To do

B.2. Quatitative vs qualitative

  • We observed differences in health conditions across continents in 2007. Was this also the case in 1952? Please visualize your findings.
  • What about the economy? Visualize and explain your results.
# To do

B.3. Qualitative vs qualitative

Qualitative variables are simpler than quantitative ones, as we primarily focus on the proportion or frequency of each category. In our dataset, the existing qualitative variables are not suitable for analysis because they are repeated each year. Therefore, we will create a new qualitative lifeExp.

  • Add to the data data1952 a column lifeExpQual containing three categories: [“unhealthy”, “moderate”, “healthy”] by splitting lifeExp into 3 classes.

Hint: you may cheat using slide 15. The function pd.qcut is helpful for such a task.

# To do
  • Graphically represent the connection between lifeExpQual and continent in year 1952.
  • Describe what you see.
# To do

C. Time evolution

Gapminder captures global changes from \(1952\) to \(2007\). It is more insightful to examine how these variables evolve over time.

C.1. Evolution of average lifeExp of the \(5\) continents

  1. Create a line plot of lifeExp for the five continents from \(1952\) to \(2007\) using sns.lineplot. What observations can you make from the plot?
# To do

C.2. Other variables

  • Repeat the same process with pop, and then with gdpPercap. Provide your comments for each case.
# To do

C.3. Cambodia

  • Plot the evolution of the three quantitative variables for Cambodia. What do you observe?
# To do

Further readings