Data Visualization:
Art & Science

ITM-370: Data Analytics

Lecturer: Dr. Sothea Has

🗺️ Content

Univariate distribution
Bivariate distribution
Multiple information
Time series data
Telling a story & Making a point

🌐 https://clauswilke.com/dataviz/

Univariate distribution

Motivation

Gapminder dataset: world’s changes from \(1952\) to \(2007\).
Video: Hans Rosling’s 200 Countries, 200 Years, 4 Minutes.
Now, take a look at year \(2007\):

Code

from gapminder import gapminder
data2007 = gapminder[gapminder.year == 2007]  # filter to year 2007
data2007.iloc[:4,:].drop(columns=['year']).style.hide()

country	continent	lifeExp	pop	gdpPercap
Afghanistan	Asia	43.828000	31889923	974.580338
Albania	Europe	76.423000	3600523	5937.029526
Algeria	Africa	72.301000	33333216	6223.367465
Angola	Africa	42.731000	12420476	4797.231267

Motivation

Gapminder dataset: world’s changes from \(1952\) to \(2007\).
Video: Hans Rosling’s 200 Countries, 200 Years, 4 Minutes.
Now, take a look at year \(2007\):

Code

quan_vars = ["pop", "lifeExp", "gdpPercap"]
data2007[quan_vars].describe().transpose().drop(columns=["count", "25%", "75%"]).transpose()

	pop	lifeExp	gdpPercap
mean	4.402122e+07	67.007423	11680.071820
std	1.476214e+08	12.073021	12859.937337
min	1.995790e+05	39.613000	277.551859
50%	1.051753e+07	71.935500	6124.371108
max	1.318683e+09	82.603000	49357.190170

Motivation

Gapminder dataset: world’s changes from \(1952\) to \(2007\).
Video: Hans Rosling’s 200 Countries, 200 Years, 4 Minutes.
Now, take a look at year \(2007\):

Code

from plotly.subplots import make_subplots
import plotly.graph_objects as go
fig = make_subplots(rows=1, cols=3, 
              subplot_titles=("Boxplot of pop", "Violinplot of lifeExp", "Histogram of GDP Per Capita"))
fig.add_trace(go.Box(y=data2007['pop'], name="pop"), col=1, row=1)
fig.add_trace(go.Violin(y=data2007['lifeExp'], name="lifeExp"), row=1, col=2)
fig.add_trace(go.Histogram(y=data2007['gdpPercap'],
              name="gdpPercap"), row=1, col=3)
fig.update_layout(height=300, width=1000)
fig.update_yaxes(type="log", row=1, col=1)
fig.update_xaxes(title="Population", row=1, col=1)
fig.update_xaxes(title="Life Expectancy", row=1, col=2)
fig.update_xaxes(title="GDP Per Capita", row=1, col=3)
fig.show()

Quantitative (numerical) data

Statistical values

Note

Mean (average)
Median
Variance
Standard deviation
Percentiles
Skewness
Kurtosis…

Quantitative (numerical) data

Graph: Histogram

Note

Bins: interval of the bars.
Binwidth: interval width.
Bars: counts of points within that bin.
Mathematically, if \(B_x\) is the bin containing \(x\in\mathbb{R}\) then \[\text{hist}(x)=\sum_{i=1}^n\mathbb{1}_{\{x_i\in B_x\}}.\]

Code

import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid")
plt.figure(figsize=(8, 4))
ax = sns.histplot(data2007, x="lifeExp", binwidth=3)
ax.set_title("Histogram of Life Expectancy")
plt.show()

Quantitative (numerical) data

Graph: Kernel Density Estimation (KDE)

Note

Smoothing out each bar of a histogram to create a continuous curve.
Mathematically, if \(K\) is a kernel function, for instance, Gaussian kernel: \(K(t)=\frac{1}{\sqrt{2\pi}}e^{-t^2/2}\), then for any \(x\in\mathbb{R}\): \[\hat{f}(x)=\frac{1}{nh}\sum_{i=1}^nK\Big(\frac{x-x_i}{h}\Big).\]

Code

plt.figure(figsize=(8, 4))
ax = sns.histplot(data2007, x="lifeExp", kde=True, linewidth=3, binwidth=3)
ax.set_title("Histogram & density of Life Expectancy")
plt.show()

Quantitative (numerical) data

Graph: Boxplot

Note

\(Q_1\) & \(Q_3\): 1st (\(25\%\)) & 3rd (\(75\%\)) quartiles.
Interquartile range: \(\text{IQR}=Q_3-Q_1\).
Range: [\(Q_1-1.5\text{IQR}\),\(Q_3+1.5\text{IQR}\)] contains around \(99.3\%\) of the values if it’s Normal.
This is used in boxplots and detecting outliers.

Code

plt.figure(figsize=(4, 4))
ax = sns.boxplot(data2007, y="lifeExp")
ax.set_title("Boxplot of Life Expectancy")
ax.set_ylim((30, 95))

Quantitative (numerical) data

Graph: Violin plot

Note

Combines KDE + boxplot.
Shape of the violin is defined by a KDE.
The box inside is the normal Boxplot.
It provides more details of where the points are concentrated.
It may overshoot the range of the actual data because it smooths the data and can extend beyond the minimum and maximum values.

Code

plt.figure(figsize=(4, 4))
ax = sns.violinplot(data2007, y="lifeExp")
ax.set_title("Violin plot of Life Expectancy")
ax.set_ylim((30, 95))

Quantitative (numerical) data

Graph: Empirical Cummulative Distribution Function

Note

The cummulative proportion of data upto a given point.
Estimate of CDF: \(F(x)=\mathbb{P}(X\leq x)\).
ECDF increases slowly on any range with sparse data points and rapidly in regions with concentrated data points.
Mathematically, for nay point \(x\in\mathbb{R}\): \[\text{ECDF}(x)=\frac{1}{n}\sum_{i=1}^n\mathbb{1}_{\{x_i\leq x\}}.\]

Code

plt.figure(figsize=(8, 3.5))
ax = sns.ecdfplot(data2007, x="lifeExp", linewidth=3)
ax.set_title("ECDF of Life Expectancy")
plt.show()

Qualitative data

Statistical values & Pie Chart

Note

Frequency: The count of occurrences within a dataset.
Relative frequency: The proportion of occurrences within the dataset.

continent_count = data2007.continent.value_counts()
print(continent_count)

continent
Africa      52
Asia        33
Europe      30
Americas    25
Oceania      2
Name: count, dtype: int64

Code

plt.figure(figsize=(5, 4))
plt.pie(continent_count, labels=continent_count.index, autopct='%.0f%%')
plt.title("Pie chart of continent")
plt.show()

Pie charts can be challenging to read with numerous categories.
They’re harder to percieve when many categories have similar proportions.

Qualitative data

Statistical values & Barplot

Note

Frequency: The count of occurrences within a dataset.
Relative frequency: The proportion of occurrences within the dataset.

continent_count = data2007.continent.value_counts()
print(continent_count)

continent
Africa      52
Asia        33
Europe      30
Americas    25
Oceania      2
Name: count, dtype: int64

Code

plt.figure(figsize=(5, 4))
order = data2007.continent.value_counts().sort_values().index
ax = sns.countplot(data2007, x="continent", order=order)
ax.bar_label(ax.containers[0])
ax.set_title("Barplot of continent")
plt.show()

Bivariate distribution

Quantitative vs quantitative

Correlation matrices

In year \(1952\):

Code

data1952 = gapminder.loc[gapminder.year == 1952]
data1952[quan_vars].corr().style.background_gradient()

	pop	lifeExp	gdpPercap
pop	1.000000	-0.002725	-0.025260
lifeExp	-0.002725	1.000000	0.278024
gdpPercap	-0.025260	0.278024	1.000000

In year \(1982\):

Code

data1982 = gapminder.loc[gapminder.year == 1982]
data1982[quan_vars].corr().style.background_gradient()

	pop	lifeExp	gdpPercap
pop	1.000000	0.036242	-0.059943
lifeExp	0.036242	1.000000	0.722763
gdpPercap	-0.059943	0.722763	1.000000

In year \(2007\):

Code

data2007[quan_vars].corr().style.background_gradient()

	pop	lifeExp	gdpPercap
pop	1.000000	0.047553	-0.055676
lifeExp	0.047553	1.000000	0.678662
gdpPercap	-0.055676	0.678662	1.000000

🤔 Anything interesting from these 3 correlation matrices?

Quantitative vs quantitative

Graph: Scatterplot

Code

import plotly.graph_objects as go
import plotly.express as px
fig1 = px.scatter(data1952, x="gdpPercap", y="lifeExp", hover_name="country", opacity=0.7)
fig1.update_traces(marker=dict(size=10))
fig1.update_layout(height=400, width=330, title="The world in 1952")
fig1.show()

Code

fig2 = px.scatter(data1982, x="gdpPercap", y="lifeExp", hover_name="country", opacity=0.7)
fig2.update_traces(marker=dict(size=10))
fig2.update_layout(height=400, width=330, title="The world in 1982")
fig2.show()

Code

fig3 = px.scatter(data2007, x="gdpPercap", y="lifeExp", hover_name="country", opacity=0.7)
fig3.update_traces(marker=dict(size=10))
fig3.update_layout(height=400, width=330, title="The world in 2007")
fig3.show()

Quantitative vs quantitative

Graph: Scatterplot

Code

fig1.update_xaxes(type="log")
fig1.show()

Code

fig2.update_xaxes(type="log")
fig2.show()

Code

fig3.update_xaxes(type="log")
fig3.show()

Better? This is the power of "log" scaling!

Quantitative vs qualitative

Graph: Conditional distribution

Are quantitative data on each category of the qualitative data different?
Different = Influenced = Related.
Just use what we have learned:
- Need one quantitative graph
- But distinguished by group of qualititative data.

Code

sorted_data = data2007.sort_values(by='lifeExp')
fig = px.box(data2007, x="continent", y="lifeExp", hover_name="country", color="continent", category_orders={'continent': sorted_data['continent']})
fig.update_layout(title="Life Expectancy on each continent in 2007", height=350, width=450)
fig.show()

Qualitative vs qualitative

Graph: Mosaic plot

We grouped gdpPercap into 3 classes:
- Developing
- Emerging
- Developed
Are the categories of the 1st qualitative data different on each category of the 2nd qualitative variable?
Mosaic plot represents this effect.
Different = Influenced = Related.

Code

from statsmodels.graphics.mosaicplot import mosaic
import pandas as pd
fig, ax = plt.subplots(figsize=(9, 5))
def prop(key):
    if "Asia" in key:
        return {'color': '#51cb4b'}
    if "Europe" in key:
        return {'color': '#dda63e'}
    if "Africa" in key:
        return {'color': '#e35441'}
    if "Americas" in key:
        return {'color': '#41b4e3'}
    if "Oceania" in key:
        return {'color': '#b374df'}

data2007['gdp_category'] = pd.qcut(data2007['gdpPercap'], q=3, labels=['Developing', 'Emerging', 'Developed'])
mosaic(data2007, ['continent','gdp_category'], gap=0.01, properties = prop, label_rotation=30, ax=ax)
plt.show()

Qualitative vs qualitative

Graph: Grouped Barplot

Grouped barplot is an alternative graph representing connection between two qualitative variables.
Different = Influenced = Related.

Code

fig = px.histogram(data2007, x="continent", color="gdp_category", barmode='group')
fig.update_layout(width=500, height=350, title='Grouped Barplot of continent vs GDP Category')
fig.show()

Multiple information

Color, Shape, Size and 3D Graph

Color: can represent both quantitative (in form of gradient) and qualitative data (discrete color).
Shape: can represent qualitative variables.
Size: can represent quantitative variables.
3D Graph: often used to represent relationship of 3 quantitative variables.

Code

fig = px.scatter(data2007, x="gdpPercap", y="lifeExp", color="continent", size="pop", hover_name="country", size_max=50)
fig.update_layout(width=500, height=450, title='Life Expectancy vs GDP per Capita,<br> Population and Continent')
fig.update_xaxes(type="log")
fig.show()

Time series data

Quantitative data

Graph: Lineplot

Let’s look at Cambodia from 1952 to 2007.

Code

cam_df = gapminder.loc[gapminder.country == "Cambodia"]
fig = px.line(cam_df, x='year', y='lifeExp', title="Evolution of Life Expectancy of Cambodia")
fig.update_layout(height=350, width=330)
fig.show()

Code

fig = px.line(cam_df, x='year', y='pop', title="Evolution of Cambodian population")
fig.update_layout(height=350, width=330)
fig.show()

Code

fig = px.line(cam_df, x='year', y='gdpPercap', title="Evolution of GDP per Capita of Cambodia")
fig.update_layout(height=350, width=330)
fig.show()

Quantitative data

Graph: Lineplot

Now, take a look at the world from 1952 to 2007.

Code

plt.figure(figsize=(5,2.5)) 
sns.lineplot(gapminder, x='year', y='lifeExp', hue="continent", legend=False)
plt.title('Global Life Expectancy Evolution')
plt.xlabel('Year')
plt.ylabel('Life Expectancy')
plt.show()

Code

plt.figure(figsize=(8,2.6)) 
sns.lineplot(gapminder, x='year', y='pop', hue="continent")
plt.title('Global Population Evolution')
plt.xlabel('Year')
plt.ylabel('Population')
plt.yscale("log")
plt.legend(title='Continent', loc='lower center', ncol=5, bbox_to_anchor=(0.5, -0.5))
plt.show()

Code

plt.figure(figsize=(5,2.5)) 
sns.lineplot(gapminder, x='year', y='gdpPercap', hue="continent", legend=False)
plt.title('Global GDP per Capita Evolution')
plt.xlabel('Year')
plt.ylabel('GDP per Capita ($)')
plt.show()

Animated Graph

The world evolution from 1952 to 2007.

Code

fig = px.scatter(gapminder, x="gdpPercap", y="lifeExp", animation_frame="year",
           size="pop", color="continent", hover_name="country",
           log_x=True, size_max=60, range_x=[100,100000], range_y=[25,90])
fig.update_layout(title="The world evolution from 1952 to 2007", 
                  width=1000, height=450)

Telling a story
& Making a point

Whole story needs more than just graphs

Context: Gapminder captures the economic and health conditions of the entire world from 1952 to 2007.
- Global Changes: How has the world changed over these 55 years?
- Direction of Evolution: What direction is this evolution taking?
- Our Role: How can we prepare for or contribute to this ongoing evolution?
Telling story: Analyzing steps
- Understand Key Variables: Begin by getting a grip on the essential metrics.
- Make Connections: Use correlation matrices and graphs to highlight relationships.
- Investigate: Analyze data from a local (yearly) perspective and then expand to uncover global trends.
- Illustration: Always pair numbers with clear, comprehensible graphs.
Conclusion:
- Address the key questions posted in the context.
- Discuss limitations of the analysis if there is any.
- Mention areas where further research or additional data might provide more complete picture.

Data Visualization: Art & Science

🗺️ Content

Univariate distribution

Motivation

Motivation

Motivation

Quantitative (numerical) data

Statistical values

Quantitative (numerical) data

Graph: Histogram

Quantitative (numerical) data

Graph: Kernel Density Estimation (KDE)

Quantitative (numerical) data

Graph: Boxplot

Quantitative (numerical) data

Graph: Violin plot

Quantitative (numerical) data

Graph: Empirical Cummulative Distribution Function

Qualitative data

Statistical values & Pie Chart

Qualitative data

Statistical values & Barplot

Bivariate distribution

Quantitative vs quantitative

Correlation matrices

Quantitative vs quantitative

Graph: Scatterplot

Quantitative vs quantitative

Graph: Scatterplot

Quantitative vs qualitative

Graph: Conditional distribution

Qualitative vs qualitative

Graph: Mosaic plot

Qualitative vs qualitative

Graph: Grouped Barplot

Multiple information

Color, Shape, Size and 3D Graph

Time series data

Quantitative data

Graph: Lineplot

Quantitative data

Graph: Lineplot

Animated Graph

Telling a story & Making a point

Whole story needs more than just graphs

🥳 Yeahhhh……. 🥂

Any questions?

Data Visualization:
Art & Science

Telling a story
& Making a point