TP1 - Variable Types & Descriptive Statistics¶
Exploratory Data Analysis & Unsuperivsed Learning
Course: PHAUK Sokkey, PhD
TP: HAS Sothea, PhD
Objective: This initial practical session is designed to enhance your understanding of various data variable types and the corresponding statistical and graphical tools suitable for each type.
The
Jupyter Notebook
for this TP can be downloaded here: TP1-Gapminder.
1. Gapminder
dataset¶
Gapminder
is an excerpt of data available at https://gapminder.org. For each of $142$ countries (country
), the package provides values for life expectancy (lifeExp
), GDP per capita (gdpPercap
), and population (pop
), every five years, from $1952$ to $2007$ (year
). It is initially used in Jennifer Bryan's excellent gapminder
teaching package for R (ggplot
, tidyverse
, and more). For more information about gapminder
:
- Documentation: https://www.gapminder.org/data/documentation/
- A short demonstration video is available here: Hans Rosling's 200 Countries, 200 Years, 4 Minutes - The Joy of Stats - BBC Four.
# useful packages
import numpy as np
import pandas as pd
#%pip install gapminder (for installing gapminder if you haven't had one)
from gapminder import gapminder
print(f"* Number of observations: {gapminder.shape[0]}")
gapminder.sample(3)
* Number of observations: 1704
country | continent | year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|---|---|
1424 | Spain | Europe | 1992 | 77.570 | 39549438 | 18603.064520 |
269 | Chad | Africa | 1977 | 47.383 | 4388260 | 1133.984950 |
135 | Bolivia | Americas | 1967 | 45.032 | 4040665 | 2586.886053 |
2. Variable types¶
EDA
involves summarizing and visualizing data to uncover patterns, detect anomalies, and understand relationships between variables. Statistical summaries, such as mean, median, and standard deviation, are essential tools in this process.
- Which variables are considered quantitative and which are qualitative?
Hint: You can check the default column types by using
gapminder.dtypes
.
Your response:
2.1 Quantitative variables¶
- In year 2002,
- Compute suitable statistics for each quantitative variable (excluding
year
) to obtain an overall summary. - Recall the definitions of Pearson's second coefficient of skewness and kurtosis introduced in the course. Compute these metrics for each quantitative variable and explain the distribution of each variable based on these values.
# To do
- Graphically represent the distribution of each variable for the year $2002$. After plotting the distributions, provide a brief explanation for each variable.
import matplotlib.pyplot as plt
import seaborn as sns
# To do
- According to the data, in 2002:
- Which country is the richest?
- Which country is the poorest?
- Which country is the healthiest?
- Which country is the unhealthiest?
# To do
- Repeat the previous question for the year 1977. Before computing, can you guess which country had the lowest life expectancy around that year?
# To do
2.2. Qualitative variables¶
Qualitative variables are simpler than quantitative ones, as we primarily focus on the proportion or frequency of each category. In our dataset, the existing qualitative variables are not suitable for analysis because they are repeated each year. Therefore, we will create three new qualitative variables associated with the three quantitative ones by dividing them into three categories each.
In year 2002,
- Add the following three variables to the
gapminder
dataset by grouping each quantitative variable into $3$ groups.
- Create variable
gdpQual
with three categories: ["developing", "moderate", "developed"] using variablegdpPercap
. - Create variable
popQual
with three categories: ["small", "medium", "large"] using variablepop
. - Create variable
lifeExpQual
with three categories: ["unhealthy", "moderate", "healthy"] using variablelifeExp
.
Hint: you may find
np.histogram
andpd.cut
function helpful.
# To do
- Compute the appropriate statistical values and graphically represent the distribution of each newly created qualitative variable.
# To do
- Graphical representation
# To do
3. Time evolution¶
Gapminder
captures global changes from $1952$ to $2007$. It is more insightful to examine how these variables evolve over time.
3.1. Evolution of average lifeExp
of the $5$ continents¶
- Create a line plot of
lifeExp
for the five continents from $1952$ to $2007$ usingsns.lineplot
. What observations can you make from the plot?
# To do
- Repeat the same process with
pop
, and then withgdpPercap
. Provide your comments for each case.
# To do
- Plot the evolution of the three quantitative variables for Cambodia. What do you observe?
# To do
Further readings¶
- Gapminder documentation: https://www.gapminder.org/data/documentation/
- A short demonstration video is available here: Hans Rosling's 200 Countries, 200 Years, 4 Minutes - The Joy of Stats - BBC Four.
- Graphical tools: