TP1 - Variable Types & Descriptive Statistics

Exploratory Data Analysis & Unsuperivsed Learning
Course: PHAUK Sokkey, PhD
TP: HAS Sothea, PhD


Objective: This initial practical session is designed to enhance your understanding of various data variable types and the corresponding statistical and graphical tools suitable for each type.


The Jupyter Notebook for this TP can be downloaded here: TP1-Gapminder.


1. Gapminder dataset

Gapminder is an excerpt of data available at https://gapminder.org. For each of \(142\) countries (country), the package provides values for life expectancy (lifeExp), GDP per capita (gdpPercap), and population (pop), every five years, from \(1952\) to \(2007\) (year). It is initially used in Jennifer Bryan’s excellent gapminder teaching package for R (ggplot, tidyverse, and more). For more information about gapminder: - Documentation: https://www.gapminder.org/data/documentation/ - A short demonstration video is available here: Hans Rosling’s 200 Countries, 200 Years, 4 Minutes - The Joy of Stats - BBC Four.

# useful packages
import numpy as np
import pandas as pd

#%pip install gapminder  (for installing gapminder if you haven't had one)
from gapminder import gapminder
print(f"* Number of observations: {gapminder.shape[0]}")
gapminder.sample(3)
* Number of observations: 1704
country continent year lifeExp pop gdpPercap
1424 Spain Europe 1992 77.570 39549438 18603.064520
269 Chad Africa 1977 47.383 4388260 1133.984950
135 Bolivia Americas 1967 45.032 4040665 2586.886053

2. Variable types

EDA involves summarizing and visualizing data to uncover patterns, detect anomalies, and understand relationships between variables. Statistical summaries, such as mean, median, and standard deviation, are essential tools in this process.

  • Which variables are considered quantitative and which are qualitative?

Hint: You can check the default column types by using gapminder.dtypes.

Your response:

2.1 Quantitative variables

  1. In year 2002,
  • Compute suitable statistics for each quantitative variable (excluding year) to obtain an overall summary.
  • Recall the definitions of Pearson’s second coefficient of skewness and kurtosis introduced in the course. Compute these metrics for each quantitative variable and explain the distribution of each variable based on these values.
# To do
  1. Graphically represent the distribution of each variable for the year \(2002\). After plotting the distributions, provide a brief explanation for each variable.
import matplotlib.pyplot as plt
import seaborn as sns
# To do
  1. According to the data, in 2002:
  • Which country is the richest?
  • Which country is the poorest?
  • Which country is the healthiest?
  • Which country is the unhealthiest?
# To do
  1. Repeat the previous question for the year 1977. Before computing, can you guess which country had the lowest life expectancy around that year?
# To do

2.2. Qualitative variables

Qualitative variables are simpler than quantitative ones, as we primarily focus on the proportion or frequency of each category. In our dataset, the existing qualitative variables are not suitable for analysis because they are repeated each year. Therefore, we will create three new qualitative variables associated with the three quantitative ones by dividing them into three categories each.

In year 2002,

  1. Add the following three variables to the gapminder dataset by grouping each quantitative variable into \(3\) groups.
  • Create variable gdpQual with three categories: [“developing”, “moderate”, “developed”] using variable gdpPercap.
  • Create variable popQual with three categories: [“small”, “medium”, “large”] using variable pop.
  • Create variable lifeExpQual with three categories: [“unhealthy”, “moderate”, “healthy”] using variable lifeExp.

Hint: you may find np.histogram and pd.cut function helpful.

# To do
  1. Compute the appropriate statistical values and graphically represent the distribution of each newly created qualitative variable.
# To do
  • Graphical representation
# To do

3. Time evolution

Gapminder captures global changes from \(1952\) to \(2007\). It is more insightful to examine how these variables evolve over time.

3.1. Evolution of average lifeExp of the \(5\) continents

  1. Create a line plot of lifeExp for the five continents from \(1952\) to \(2007\) using sns.lineplot. What observations can you make from the plot?
# To do
  1. Repeat the same process with pop, and then with gdpPercap. Provide your comments for each case.
# To do
  1. Plot the evolution of the three quantitative variables for Cambodia. What do you observe?
# To do

Further readings