Exploratory Data Analysis & Unsuperivsed Learning Course: HAS Sothea TP: PHAUK Sokkey, PhD Ms. UANN Sreyvi
Objective: This initial practical session is designed to enhance your understanding of various data variable types and the corresponding statistical and graphical tools suitable for each type.
The Jupyter Notebook for this TP can be downloaded here: TP1-Gapminder.
# useful packagesimport numpy as npimport pandas as pd#%pip install gapminder (for installing gapminder if you haven't had one)from gapminder import gapminderprint(f"* Number of observations: {gapminder.shape[0]}")gapminder.sample(3)
* Number of observations: 1704
country
continent
year
lifeExp
pop
gdpPercap
1424
Spain
Europe
1992
77.570
39549438
18603.064520
269
Chad
Africa
1977
47.383
4388260
1133.984950
135
Bolivia
Americas
1967
45.032
4040665
2586.886053
2. Variable types
EDA involves summarizing and visualizing data to uncover patterns, detect anomalies, and understand relationships between variables. Statistical summaries, such as mean, median, and standard deviation, are essential tools in this process.
Which variables are considered quantitative and which are qualitative?
Hint: You can check the default column types by using gapminder.dtypes. But this may not be accurate because some categorical data may be encoded using numerical values.
Your response:
2.1 Quantitative variables
In year 2002,
Compute suitable statistics for each quantitative variable (excluding year) to obtain an overall summary.
Recall the definitions of skewness and kurtosis introduced in the course.
Compute these metrics for each quantitative variable and explain the distribution of each variable based on these values.
# To do
Graphically represent the distribution of each variable for the year \(2002\). After plotting the distributions, provide a brief explanation for each variable.
import matplotlib.pyplot as pltimport seaborn as sns# To do
According to the data, in 2002:
Which country is the richest?
Which country is the poorest?
Which country is the healthiest?
Which country is the unhealthiest?
# To do
Repeat the previous question for the year 1977. Before computing, can you guess which country had the lowest life expectancy around that year?
# To do
2.2. Qualitative variables
Qualitative variables are simpler than the quantitative ones, as we primarily focus on the proportion or frequency of each category. In our dataset, the existing qualitative variables are not suitable for analysis because they are repeated each year. Therefore, we will create three new qualitative variables associated with the three quantitative columns by copping them into three categories each.
In year 2002,
Add the following three variables to the gapminder dataset by grouping each quantitative variable into \(3\) groups.
Create variable gdpQual with three categories: [“developing”, “moderate”, “developed”] using variable gdpPercap.
Create variable popQual with three categories: [“small”, “medium”, “large”] using variable pop.
Create variable lifeExpQual with three categories: [“unhealthy”, “moderate”, “healthy”] using variable lifeExp.
Hint: you may find pd.cut function helpful.
# To do
Compute the appropriate statistical values and graphically represent the distribution of each newly created qualitative variable.
# To do
Graphically represent them and describe the graph.
# To do
3. Time evolution
Gapminder captures global changes from \(1952\) to \(2007\). It is more insightful to examine how these variables evolve over time.
3.1. Evolution of quantitative columns
Create a line plot of lifeExp for the five continents from \(1952\) to \(2007\) using sns.lineplot. What observations can you make from the plot?
# To do
Repeat the same process with pop, and then with gdpPercap. Provide your comments for each case.
# To do
Plot the evolution of each of the three quantitative variables for Cambodia vs Thailand. Describe the graphs.
# To do
3.2. Evolution of qualitative columns
Visualize the evolution of lifeExpQual column in Asia from 1952 to 2007. What do you observe?
# To do
Do the same with popQual and dgpQual columns.
# To do
Visualize the evolution of the three qualitative columns above for African countries from 1952 to 2007.