Data Visualization:
Art & Science


ITM-370: Data Analytics

Lecturer: Dr. Sothea Has

đŸ—ș Content

  • Univariate distribution

  • Bivariate distribution

  • Multiple information

  • Time series data

  • Telling a story & Making a point

Univariate distribution

Motivation

Code
from gapminder import gapminder
data2007 = gapminder[gapminder.year == 2007]  # filter to year 2007
data2007.iloc[:4,:].drop(columns=['year']).style.hide()
country continent lifeExp pop gdpPercap
Afghanistan Asia 43.828000 31889923 974.580338
Albania Europe 76.423000 3600523 5937.029526
Algeria Africa 72.301000 33333216 6223.367465
Angola Africa 42.731000 12420476 4797.231267

Motivation

Code
quan_vars = ["pop", "lifeExp", "gdpPercap"]
data2007[quan_vars].describe().transpose().drop(columns=["count", "25%", "75%"]).transpose()
pop lifeExp gdpPercap
mean 4.402122e+07 67.007423 11680.071820
std 1.476214e+08 12.073021 12859.937337
min 1.995790e+05 39.613000 277.551859
50% 1.051753e+07 71.935500 6124.371108
max 1.318683e+09 82.603000 49357.190170

Motivation

Code
from plotly.subplots import make_subplots
import plotly.graph_objects as go
fig = make_subplots(rows=1, cols=3, 
              subplot_titles=("Boxplot of pop", "Violinplot of lifeExp", "Histogram of GDP Per Capita"))
fig.add_trace(go.Box(y=data2007['pop'], name="pop"), col=1, row=1)
fig.add_trace(go.Violin(y=data2007['lifeExp'], name="lifeExp"), row=1, col=2)
fig.add_trace(go.Histogram(y=data2007['gdpPercap'],
              name="gdpPercap"), row=1, col=3)
fig.update_layout(height=300, width=1000)
fig.update_yaxes(type="log", row=1, col=1)
fig.update_xaxes(title="Population", row=1, col=1)
fig.update_xaxes(title="Life Expectancy", row=1, col=2)
fig.update_xaxes(title="GDP Per Capita", row=1, col=3)
fig.show()

Quantitative (numerical) data

Statistical values

Note

  • Mean (average)
  • Median
  • Variance
  • Standard deviation
  • Percentiles
  • Skewness
  • Kurtosis


Quantitative (numerical) data

Graph: Histogram

Note

  • Bins: interval of the bars.
  • Binwidth: interval width.
  • Bars: counts of points within that bin.
  • Mathematically, if \(B_x\) is the bin containing \(x\in\mathbb{R}\) then \[\text{hist}(x)=\sum_{i=1}^n\mathbb{1}_{\{x_i\in B_x\}}.\]
Code
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid")
plt.figure(figsize=(8, 4))
ax = sns.histplot(data2007, x="lifeExp", binwidth=3)
ax.set_title("Histogram of Life Expectancy")
plt.show()

Quantitative (numerical) data

Graph: Kernel Density Estimation (KDE)

Note

  • Smoothing out each bar of a histogram to create a continuous curve.
  • Mathematically, if \(K\) is a kernel function, for instance, Gaussian kernel: \(K(t)=\frac{1}{\sqrt{2\pi}}e^{-t^2/2}\), then for any \(x\in\mathbb{R}\): \[\hat{f}(x)=\frac{1}{nh}\sum_{i=1}^nK\Big(\frac{x-x_i}{h}\Big).\]
Code
plt.figure(figsize=(8, 4))
ax = sns.histplot(data2007, x="lifeExp", kde=True, linewidth=3, binwidth=3)
ax.set_title("Histogram & density of Life Expectancy")
plt.show()

Quantitative (numerical) data

Graph: Boxplot

Note

  • \(Q_1\) & \(Q_3\): 1st (\(25\%\)) & 3rd (\(75\%\)) quartiles.
  • Interquartile range: \(\text{IQR}=Q_3-Q_1\).
  • Range: [\(Q_1-1.5\text{IQR}\),\(Q_3+1.5\text{IQR}\)] contains around \(99.3\%\) of the values if it’s Normal.
  • This is used in boxplots and detecting outliers.
Code
plt.figure(figsize=(4, 4))
ax = sns.boxplot(data2007, y="lifeExp")
ax.set_title("Boxplot of Life Expectancy")
ax.set_ylim((30, 95))

Quantitative (numerical) data

Graph: Violin plot

Note

  • Combines KDE + boxplot.
  • Shape of the violin is defined by a KDE.
  • The box inside is the normal Boxplot.
  • It provides more details of where the points are concentrated.
  • It may overshoot the range of the actual data because it smooths the data and can extend beyond the minimum and maximum values.
Code
plt.figure(figsize=(4, 4))
ax = sns.violinplot(data2007, y="lifeExp")
ax.set_title("Violin plot of Life Expectancy")
ax.set_ylim((30, 95))

Quantitative (numerical) data

Graph: Empirical Cummulative Distribution Function

Note

  • The cummulative proportion of data upto a given point.
  • Estimate of CDF: \(F(x)=\mathbb{P}(X\leq x)\).
  • ECDF increases slowly on any range with sparse data points and rapidly in regions with concentrated data points.
  • Mathematically, for nay point \(x\in\mathbb{R}\): \[\text{ECDF}(x)=\frac{1}{n}\sum_{i=1}^n\mathbb{1}_{\{x_i\leq x\}}.\]
Code
plt.figure(figsize=(8, 3.5))
ax = sns.ecdfplot(data2007, x="lifeExp", linewidth=3)
ax.set_title("ECDF of Life Expectancy")
plt.show()

Qualitative data

Statistical values & Pie Chart

Note

  • Frequency: The count of occurrences within a dataset.
  • Relative frequency: The proportion of occurrences within the dataset.
continent_count = data2007.continent.value_counts()
print(continent_count)
continent
Africa      52
Asia        33
Europe      30
Americas    25
Oceania      2
Name: count, dtype: int64
Code
plt.figure(figsize=(5, 4))
plt.pie(continent_count, labels=continent_count.index, autopct='%.0f%%')
plt.title("Pie chart of continent")
plt.show()

  • Pie charts can be challenging to read with numerous categories.
  • They’re harder to percieve when many categories have similar proportions.

Qualitative data

Statistical values & Barplot

Note

  • Frequency: The count of occurrences within a dataset.
  • Relative frequency: The proportion of occurrences within the dataset.
continent_count = data2007.continent.value_counts()
print(continent_count)
continent
Africa      52
Asia        33
Europe      30
Americas    25
Oceania      2
Name: count, dtype: int64
Code
plt.figure(figsize=(5, 4))
order = data2007.continent.value_counts().sort_values().index
ax = sns.countplot(data2007, x="continent", order=order)
ax.bar_label(ax.containers[0])
ax.set_title("Barplot of continent")
plt.show()

Bivariate distribution

Quantitative vs quantitative

Correlation matrices

  • In year \(1952\):
Code
data1952 = gapminder.loc[gapminder.year == 1952]
data1952[quan_vars].corr().style.background_gradient()
  pop lifeExp gdpPercap
pop 1.000000 -0.002725 -0.025260
lifeExp -0.002725 1.000000 0.278024
gdpPercap -0.025260 0.278024 1.000000
  • In year \(1982\):
Code
data1982 = gapminder.loc[gapminder.year == 1982]
data1982[quan_vars].corr().style.background_gradient()
  pop lifeExp gdpPercap
pop 1.000000 0.036242 -0.059943
lifeExp 0.036242 1.000000 0.722763
gdpPercap -0.059943 0.722763 1.000000
  • In year \(2007\):
Code
data2007[quan_vars].corr().style.background_gradient()
  pop lifeExp gdpPercap
pop 1.000000 0.047553 -0.055676
lifeExp 0.047553 1.000000 0.678662
gdpPercap -0.055676 0.678662 1.000000



đŸ€” Anything interesting from these 3 correlation matrices?

Quantitative vs quantitative

Graph: Scatterplot

Code
import plotly.graph_objects as go
import plotly.express as px
fig1 = px.scatter(data1952, x="gdpPercap", y="lifeExp", hover_name="country", opacity=0.7)
fig1.update_traces(marker=dict(size=10))
fig1.update_layout(height=400, width=330, title="The world in 1952")
fig1.show()
Code
fig2 = px.scatter(data1982, x="gdpPercap", y="lifeExp", hover_name="country", opacity=0.7)
fig2.update_traces(marker=dict(size=10))
fig2.update_layout(height=400, width=330, title="The world in 1982")
fig2.show()
Code
fig3 = px.scatter(data2007, x="gdpPercap", y="lifeExp", hover_name="country", opacity=0.7)
fig3.update_traces(marker=dict(size=10))
fig3.update_layout(height=400, width=330, title="The world in 2007")
fig3.show()

Quantitative vs quantitative

Graph: Scatterplot

Code
fig1.update_xaxes(type="log")
fig1.show()
Code
fig2.update_xaxes(type="log")
fig2.show()
Code
fig3.update_xaxes(type="log")
fig3.show()
  • Better? This is the power of "log" scaling!

Quantitative vs qualitative

Graph: Conditional distribution

  • Are quantitative data on each category of the qualitative data different?

  • Different = Influenced = Related.

  • Just use what we have learned:

    • Need one quantitative graph
    • But distinguished by group of qualititative data.
Code
sorted_data = data2007.sort_values(by='lifeExp')
fig = px.box(data2007, x="continent", y="lifeExp", hover_name="country", color="continent", category_orders={'continent': sorted_data['continent']})
fig.update_layout(title="Life Expectancy on each continent in 2007", height=350, width=450)
fig.show()

Qualitative vs qualitative

Graph: Mosaic plot

  • We grouped gdpPercap into 3 classes:

    • Developing
    • Emerging
    • Developed
  • Are the categories of the 1st qualitative data different on each category of the 2nd qualitative variable?

  • Mosaic plot represents this effect.

  • Different = Influenced = Related.

Code
from statsmodels.graphics.mosaicplot import mosaic
import pandas as pd
fig, ax = plt.subplots(figsize=(9, 5))
def prop(key):
    if "Asia" in key:
        return {'color': '#51cb4b'}
    if "Europe" in key:
        return {'color': '#dda63e'}
    if "Africa" in key:
        return {'color': '#e35441'}
    if "Americas" in key:
        return {'color': '#41b4e3'}
    if "Oceania" in key:
        return {'color': '#b374df'}

data2007['gdp_category'] = pd.qcut(data2007['gdpPercap'], q=3, labels=['Developing', 'Emerging', 'Developed'])
mosaic(data2007, ['continent','gdp_category'], gap=0.01, properties = prop, label_rotation=30, ax=ax)
plt.show()

Qualitative vs qualitative

Graph: Grouped Barplot

  • Grouped barplot is an alternative graph representing connection between two qualitative variables.

  • Different = Influenced = Related.

Code
fig = px.histogram(data2007, x="continent", color="gdp_category", barmode='group')
fig.update_layout(width=500, height=350, title='Grouped Barplot of continent vs GDP Category')
fig.show()

Multiple information

Color, Shape, Size and 3D Graph

  • Color: can represent both quantitative (in form of gradient) and qualitative data (discrete color).

  • Shape: can represent qualitative variables.

  • Size: can represent quantitative variables.

  • 3D Graph: often used to represent relationship of 3 quantitative variables.

Code
fig = px.scatter(data2007, x="gdpPercap", y="lifeExp", color="continent", size="pop", hover_name="country", size_max=50)
fig.update_layout(width=500, height=450, title='Life Expectancy vs GDP per Capita,<br> Population and Continent')
fig.update_xaxes(type="log")
fig.show()

Time series data

Quantitative data

Graph: Lineplot

  • Let’s look at Cambodia from 1952 to 2007.
Code
cam_df = gapminder.loc[gapminder.country == "Cambodia"]
fig = px.line(cam_df, x='year', y='lifeExp', title="Evolution of Life Expectancy of Cambodia")
fig.update_layout(height=350, width=330)
fig.show()
Code
fig = px.line(cam_df, x='year', y='pop', title="Evolution of Cambodian population")
fig.update_layout(height=350, width=330)
fig.show()
Code
fig = px.line(cam_df, x='year', y='gdpPercap', title="Evolution of GDP per Capita of Cambodia")
fig.update_layout(height=350, width=330)
fig.show()

Quantitative data

Graph: Lineplot

  • Now, take a look at the world from 1952 to 2007.
Code
plt.figure(figsize=(5,2.5)) 
sns.lineplot(gapminder, x='year', y='lifeExp', hue="continent", legend=False)
plt.title('Global Life Expectancy Evolution')
plt.xlabel('Year')
plt.ylabel('Life Expectancy')
plt.show()

Code
plt.figure(figsize=(8,2.6)) 
sns.lineplot(gapminder, x='year', y='pop', hue="continent")
plt.title('Global Population Evolution')
plt.xlabel('Year')
plt.ylabel('Population')
plt.yscale("log")
plt.legend(title='Continent', loc='lower center', ncol=5, bbox_to_anchor=(0.5, -0.5))
plt.show()

Code
plt.figure(figsize=(5,2.5)) 
sns.lineplot(gapminder, x='year', y='gdpPercap', hue="continent", legend=False)
plt.title('Global GDP per Capita Evolution')
plt.xlabel('Year')
plt.ylabel('GDP per Capita ($)')
plt.show()

Animated Graph

  • The world evolution from 1952 to 2007.
Code
fig = px.scatter(gapminder, x="gdpPercap", y="lifeExp", animation_frame="year",
           size="pop", color="continent", hover_name="country",
           log_x=True, size_max=60, range_x=[100,100000], range_y=[25,90])
fig.update_layout(title="The world evolution from 1952 to 2007", 
                  width=1000, height=450)

Telling a story
& Making a point

Whole story needs more than just graphs

  • Context: Gapminder captures the economic and health conditions of the entire world from 1952 to 2007.
    • Global Changes: How has the world changed over these 55 years?
    • Direction of Evolution: What direction is this evolution taking?
    • Our Role: How can we prepare for or contribute to this ongoing evolution?
  • Telling story: Analyzing steps
    • Understand Key Variables: Begin by getting a grip on the essential metrics.
    • Make Connections: Use correlation matrices and graphs to highlight relationships.
    • Investigate: Analyze data from a local (yearly) perspective and then expand to uncover global trends.
    • Illustration: Always pair numbers with clear, comprehensible graphs.
  • Conclusion:
    • Address the key questions posted in the context.
    • Discuss limitations of the analysis if there is any.
    • Mention areas where further research or additional data might provide more complete picture.

đŸ„ł Yeahhhh

. đŸ„‚









Any questions?