Lab4: Data Visualization

Course: INF-604: Data Analysis
Lecturer: Sothea HAS, PhD

Objective: In this lab, you will apply the data visualization techniques you have studied to a real dataset. Each graph you create should be easy to understand and complete, including a title, legend, and other information to help the audience comprehend them easily. Effective graphs shouldn’t take longer than 15 seconds for viewers to understand.

The notebook of this Lab can be downloaded here: Lab4_Data_Visualization.ipynb.
Or you can work directly with Google Colab here: Lab4_Data_Visualization.ipynb.

1. `Gapminder` dataset

Gapminder is an excerpt of data available at https://gapminder.org. For each of \(142\) countries (country), the package provides values for life expectancy (lifeExp), GDP per capita (gdpPercap), and population (pop), every five years, from \(1952\) to \(2007\) (year). It is initially used in Jennifer Bryan’s excellent gapminder teaching package for R (ggplot, tidyverse, and more). For more information about gapminder:

Documentation: https://www.gapminder.org/data/documentation/
A short demonstration video is available here: Hans Rosling’s 200 Countries, 200 Years, 4 Minutes - The Joy of Stats - BBC Four.

# useful packages
import numpy as np
import pandas as pd

#%pip install gapminder  (for installing gapminder if you haven't had one)
from gapminder import gapminder
print(f"* Number of observations: {gapminder.shape[0]}")
gapminder.sample(3)

* Number of observations: 1704

	country	continent	year	lifeExp	pop	gdpPercap
1424	Spain	Europe	1992	77.570	39549438	18603.064520
269	Chad	Africa	1977	47.383	4388260	1133.984950
135	Bolivia	Americas	1967	45.032	4040665	2586.886053

A. Variable types

Address dimension of the dataset.
Which variables are considered quantitative and which are qualitative?

# To do

B. Year 1952

B.1 Quantitative vs quantitative

Create a subdataset called data1952 that contains only the information in year 1952.
View relation between gdpPercap and lifeExp in 1952.
View relation between gdpPercap and pop in 1952.
View relation between lifeExp and pop in 1952.
Do they look different from year 2007?

Hint: You can produce the same graphs as shown in the course using Plotly package avaialble here: plotly python.

import matplotlib.pyplot as plt 
import seaborn as sns
import plotly.graph_objects as go # for interative graph
import plotly.express as px  # for interative graph
# To do

According to the data, in 1952:
- Which country was the richest?
- Which country was the poorest?
- Which country was the healthiest?
- Which country was the unhealthiest?

# To do

Repeat the previous question for the year 1977. Before computing, can you guess which country had the lowest life expectancy around that year?

# To do

B.2. Quatitative vs qualitative

We observed differences in health conditions across continents in 2007. Was this also the case in 1952? Please visualize your findings.
What about the economy? Visualize and explain your results.

# To do

B.3. Qualitative vs qualitative

Qualitative variables are simpler than quantitative ones, as we primarily focus on the proportion or frequency of each category. In our dataset, the existing qualitative variables are not suitable for analysis because they are repeated each year. Therefore, we will create a new qualitative lifeExp.

Add to the data data1952 a column lifeExpQual containing three categories: [“unhealthy”, “moderate”, “healthy”] by splitting lifeExp into 3 classes.

Hint: you may cheat using slide 15. The function pd.qcut is helpful for such a task.

# To do

Graphically represent the connection between lifeExpQual and continent in year 1952.
Describe what you see.

# To do

C. Time evolution

Gapminder captures global changes from \(1952\) to \(2007\). It is more insightful to examine how these variables evolve over time.

C.1. Evolution of average `lifeExp` of the \(5\) continents

Create a line plot of lifeExp for the five continents from \(1952\) to \(2007\) using sns.lineplot. What observations can you make from the plot?

# To do

C.2. Other variables

Repeat the same process with pop, and then with gdpPercap. Provide your comments for each case.

# To do

C.3. Cambodia

Plot the evolution of the three quantitative variables for Cambodia. What do you observe?

# To do

1. Gapminder dataset