Data Visualization


INF-604: Data Analysis

Lecturer: Dr. Sothea HAS

Outline

  • Motivation


  • Bivariate Visualization


  • Multivariate Visualization


  • Time series data


  • Animated charts/graphs

Motivation

Motivation

Gapminder dataset (1704, 5)

  • This dataset captures the world’s evolution from \(1952\) to \(2007\).
  • Now, take a look at the data from year \(2007\).
Code
from gapminder import gapminder
data2007 = gapminder[gapminder.year == 2007]  # filter to year 2007
data2007.iloc[:5,:].drop(columns=['year']).style.hide()
country continent lifeExp pop gdpPercap
Afghanistan Asia 43.828000 31889923 974.580338
Albania Europe 76.423000 3600523 5937.029526
Algeria Africa 72.301000 33333216 6223.367465
Angola Africa 42.731000 12420476 4797.231267
Argentina Americas 75.320000 40301927 12779.379640

Motivation

Gapminder dataset (1704, 5)

  • This dataset captures the world’s evolution from \(1952\) to \(2007\).
  • Now, take a look at the data from year \(2007\) (summary).
Code
quan_vars = ["pop", "lifeExp", "gdpPercap"]
data2007[quan_vars].describe().transpose().drop(columns=["count", "25%", "75%"]).transpose()
pop lifeExp gdpPercap
mean 4.402122e+07 67.007423 11680.071820
std 1.476214e+08 12.073021 12859.937337
min 1.995790e+05 39.613000 277.551859
50% 1.051753e+07 71.935500 6124.371108
max 1.318683e+09 82.603000 49357.190170

Motivation

Gapminder dataset (1704, 5)

  • This dataset captures the world’s evolution from \(1952\) to \(2007\).
  • Now, take a look at the data from year \(2007\) (visualization).
Code
from plotly.subplots import make_subplots
import plotly.graph_objects as go
fig = make_subplots(rows=1, cols=3, 
              subplot_titles=("Boxplot of pop", "Violinplot of lifeExp", "Histogram of GDP Per Capita"))
fig.add_trace(go.Box(y=data2007['pop'], name="pop"), col=1, row=1)
fig.add_trace(go.Violin(y=data2007['lifeExp'], name="lifeExp"), row=1, col=2)
fig.add_trace(go.Histogram(x=data2007['gdpPercap'],
              name="gdpPercap"), row=1, col=3)
fig.update_layout(height=280, width=1000)
fig.update_yaxes(type="log", row=1, col=1)
fig.update_xaxes(title="Population", row=1, col=1)
fig.update_xaxes(title="Life Expectancy", row=1, col=2)
fig.update_xaxes(title="GDP Per Capita", row=1, col=3)
fig.show()

Bivariate Visualization

Bivariate Visualization

Quantitative vs quantitative: Scatterplot

  • Scatterplot shows trends/relation of quantitative pairs.
  • Let’s visualize relation: gdpPercap wih lifeExp & pop.
Code
import plotly.graph_objects as go
import plotly.express as px
data2007 = gapminder.query("year == 2007")
fig1 = px.scatter(data2007, x="gdpPercap", y="lifeExp", hover_name="country", opacity=0.7)
fig1.update_traces(marker=dict(size=10))
fig1.update_layout(height=350, width=500, title="The world GDP vs LifeExp in 2007")
fig1.show()
Code
data2007 = gapminder.query("year == 2007")
fig2 = px.scatter(data2007, x="gdpPercap", y="pop", hover_name="country", opacity=0.7)
fig2.update_traces(marker=dict(size=10))
fig2.update_layout(height=350, width=500, title="The world GDP vs Population in 2007")
fig2.show()

Bivariate Visualization

Quantitative vs quantitative: Scatterplot

  • Scatterplot shows trends/relation of the quantitative pair.
  • Let’s visualize relation: gdpPercap wih lifeExp & pop.
Code
fig1.update_layout(title="The world (log) GDP vs Population 2007 ")
fig1.update_xaxes(type="log")
fig1.show()
Code
fig2.update_layout(title="The world GDP vs (log) Population 2007 ")
fig2.update_yaxes(type="log")
fig2.show()

Bivariate Visualization

Quantitative vs quantitative: Scatterplot

  • GPD vs Life Expectancy:
    • General trend: Countries with high GPD tend to be healthier.
    • There are also a few countries with economy well above average yet health condition is still bad.


  • GPD vs Population:
    • General trend: no clear trend!
    • GDP per capita does not appear to be significantly influenced by a country’s population size.

Bivariate Visualization

Quantitative vs qualitative: Conditional

  • To see relation between Values within different Group, we can use:
    • Conditional Boxplots: boxplots within different groups.
    • Conditional Histogram/Density.
Code
sorted_data = data2007.sort_values(by='lifeExp')
fig = px.box(data2007, x="continent", y="lifeExp", hover_name="country", color="continent", category_orders={'continent': sorted_data['continent']})
fig.update_layout(title="Life Expectancy on each continent in 2007", height=300, width=450)
fig.show()
  • Key: The distinction of quantitative values between different groups indicates a connection between the pairs.

  • Example:

    • Clear distinction of lifeExp accross different continent suggests that there is a relation between the two.
    • continent is useful for predicting / explaining lifeExp.

Bivariate Visualization

Quantitative vs qualitative: Conditional

  • To see relation between Values within different Group, we can use:
    • Conditional Boxplots: boxplots within different groups.
    • Conditional Histogram/Density.
Code
import plotly.figure_factory as ff
group_labels = list(data2007.continent.unique())
hist_data = [data2007.lifeExp[data2007.continent == x] for x in group_labels]
colors = ["#f1ab17", "#f13c26", "#9be155", "#4ab8dc", "#d567f3"]
fig = ff.create_distplot(hist_data, group_labels, colors=colors,
                         bin_size=1.5, show_rug=False)
fig.update_layout(title="Life Expectancy on each continent in 2007", height=300, width=450)
fig.show()
  • Key: The distinction of quantitative values between different groups indicates a connection between the pairs.

  • Example:

    • Clear distinction of lifeExp accross different continent suggests that there is a relation between the two.
    • continent is useful for predicting / explaining lifeExp.

Bivariate Visualization

Quantitative vs qualitative: Conditional

  • How about GDP on each continent?
Code
sorted_data = data2007.sort_values(by='gdpPercap')
hist_data_gdp = [data2007.gdpPercap[data2007.continent == x] for x in group_labels]
colors = ["#f1ab17", "#f13c26", "#9be155", "#4ab8dc", "#d567f3"]
fig3 = px.box(data2007, x="continent", y="gdpPercap", hover_name="country", color="continent", category_orders={'continent': sorted_data['continent']})
fig3.update_layout(title="GDP per Capita on each continent in 2007", height=350, width=500)
fig3.show()
Code
sorted_data = data2007.sort_values(by='gdpPercap')
hist_data_gdp = [data2007.gdpPercap[data2007.continent == x] for x in group_labels]
colors = ["#f1ab17", "#f13c26", "#9be155", "#4ab8dc", "#d567f3"]
fig3 = px.violin(data2007, x="continent", y="gdpPercap", hover_name="country", color="continent", category_orders={'continent': sorted_data['continent']})
fig3.update_layout(title="GDP per Capita on each continent in 2007", height=350, width=500)
fig3.show()
  • Example:
    • The separation of GDP per Capita between coninents is not as clear as Life Expectancy, yet one can still see the differences.
    • continent is useful for predicting / explaining gdpPercap though not as strong/clear as with lifeExp.

Bivariate Visualization

Qualitative vs qualitative: mosaic plot

  • We don’t have many qualitative columns,
    I do grouped GDP:
    • If GDP \(\leq 33.33\%\) 👉 Developing
    • elif GDP \(\leq 66.66\%\) 👉 Emerging
    • else GDP \(\geq 66.66\%\) 👉 Developed.

Bivariate Visualization

Qualitative vs qualitative: mosaic plot

  • We don’t have many qualitative columns,
    I do grouped GDP:
    • If GDP \(\leq 33.33\%\) 👉 Developing
    • elif GDP \(\leq 66.66\%\) 👉 Emerging
    • else GDP \(\geq 66.66\%\) 👉 Developed.
  • Example:
    • As GDP seems to be related to continent, it remains true with categorical GDP.
    • In Asia, the three types of economic conditions are well balanced, whereas the majority of African countries are developing, followed by emerging economies.
Code
from statsmodels.graphics.mosaicplot import mosaic
import matplotlib.pyplot as plt
import pandas as pd
fig, ax = plt.subplots(figsize=(7, 5))
plt.rcParams.update({'font.size': 15})
def prop(key):
    if "Asia" in key:
        return {'color': '#51cb4b'}
    if "Africa" in key:
        return {'color': '#e35441'}
    if "Americas" in key:
        return {'color': '#41b4e3'}
    if "Europe" in key:
        return {'color': '#dda63e'}
    if "Oceania" in key:
        return {'color': '#b374df'}

data2007['gdp_category'] = pd.qcut(data2007['gdpPercap'], q=3, labels=['Developing', 'Emerging', 'Developed'])
mosaic(data2007.sort_values('continent'), ['continent','gdp_category'], 
    gap=0.01, properties = prop, 
    label_rotation=30, ax=ax)
plt.title("Mosaicplot of categorical GDP vs Continent")
plt.show()

Bivariate Visualization

Qualitative vs qualitative: grouped barplots

  • We don’t have many qualitative columns,
    I do grouped GDP:
    • If GDP \(\leq 33.33\%\) 👉 Developing
    • elif GDP \(\leq 66.66\%\) 👉 Emerging
    • else GDP \(\geq 66.66\%\) 👉 Developed.
  • Example:
    • As GDP seems to be related to continent, it remains true with categorical GDP.
    • In Asia, the three types of economic conditions are well balanced, whereas the majority of African countries are developing, followed by emerging economies.
Code
fig = px.histogram(
    data2007, x="continent", color="gdp_category")
fig.update_layout(width=510, height=470, 
    title='Stacked Barplot of Categorical GDP vs Continent')
fig.show()
Code
fig = px.histogram(
    data2007, x="continent", color="gdp_category", barmode='group')
fig.update_layout(width=510, height=470, title='Grouped Barplot of Categorical GDP vs Continent')
fig.show()

Multiple Information

Multiple Information

Color: quantitative & qualitative

  • Color can represent:
    • qualitative data (discrete color).
    • quantitative (in form of gradient)
Code
import numpy as np
data2007[' '] = np.repeat('Data', data2007.shape[0])
fig = px.scatter(
    data2007, x="gdpPercap", y="lifeExp",
    hover_name="country", size_max=80, color=" ")
fig.update_layout(width=472, height=400, title='Life Expectancy vs GDP per Capita & Continent')
fig.update_xaxes(type="log")
fig.show()

Multiple Information

Color: quantitative & qualitative

  • Color can represent:
    • qualitative data (discrete color).
    • quantitative (in form of gradient)
  • Example:
    • Color = continent, which is a categorical column.
Code
fig = px.scatter(
    data2007, x="gdpPercap", y="lifeExp", color="continent", hover_name="country", size_max=80)
fig.update_layout(width=500, height=400, title='Life Expectancy vs GDP per Capita & Continent')
fig.update_xaxes(type="log")
fig.show()

Multiple Information

Color: quantitative & qualitative

  • Color can represent:
    • qualitative data (discrete color).
    • quantitative (in form of gradient)
  • Example:
    • Color = continent, which is a categorical column.
    • Color = leftExp, which is a quantitative column.
Code
fig = px.scatter(
    data2007, x="gdpPercap", y="lifeExp", color="lifeExp", hover_name="country", size_max=80)
fig.update_layout(width=500, height=240, title='Life Expectancy vs GDP per Capita & Continent')
fig.update_xaxes(type="log")
fig.show()

Multiple Information

Shape/Symbol: qualitative

  • Shape for representing qualitative data.

  • Example:

    • Symbol = gdp_category.
    • Color = continent.

Combining numerous colors and symbols can complicate a graph.

Use them carefully and only when appropriate.

Code
fig = px.scatter(
    data2007, x="gdpPercap", y="lifeExp", color="continent", 
    hover_name="country", symbol='gdp_category', size_max=80)
fig.update_layout(width=500, height=350, title='Life Expectancy vs GDP per Capita & Continent')
fig.update_xaxes(type="log")
fig.show()

Multiple Information

Size: quantitative

  • Size for representing quantitative data.

  • Example:

    • Size = pop.
    • Color = continent.

Colors and size are common, and the resulting graphs are often called Bubble chart.

One shoule choose suitable max size to have a nice graph.

Code
fig = px.scatter(
    data2007, x="gdpPercap", y="lifeExp", color="continent",
    hover_name="country", size="pop", size_max=35)
fig.update_layout(width=500, height=370, title='Life Expectancy, GDP, Population & Continent')
fig.update_xaxes(type="log")
fig.show()

Multiple Information

3D: quantitative

  • All the previous options can be used with 3D scatter plot.

  • Example: Marketing

    • X = Youtube.
    • Y = Facebook.
    • Z = Sales.
    • Size = Newspaper.
    • Color = Newspaper.

Avoid 3D if they are not interactive [Section: “Don’t go 3D” by Claus O. Wilke (2019)].

Time series data

Time series data

Quantitative: lineplot

  • Let’s take a look at Oceania from 1952 to 2007.
Code
df_ocean = gapminder.query("continent == 'Oceania'")
fig = px.line(df_ocean, x='year', y='lifeExp',
    symbol="country", color="country",
    title="Evolution of Life Expectancy")
fig.update_layout(height=350, width=330)
fig.show()
Code
fig = px.line(df_ocean, x='year', y='pop',
    symbol="country", color="country",
    title="Evolution of Population")
fig.update_layout(height=350, width=330)
fig.show()
Code
fig = px.line(df_ocean, x='year', y='gdpPercap',
    symbol="country", color="country",
    title="Evolution of GDP per Capita")
fig.update_layout(height=350, width=330)
fig.show()
  • Is there any other country’s evolution you would like to see?

Time series data

Quantitative: lineplot

  • Is there any other country’s evolution you would like to see?

Time series data

Qualitative: Evoluational barplots

  • Let’s take a look at the evolution of GDP per capita categories for Asian countries over time.
Code
def cat_gdp(yearly_data):
    return pd.qcut(yearly_data, q=3, labels=['Developing', 'Emerging', 'Developed'])
df = gapminder
# Apply the function to each year
df['GDP_Category'] = df.groupby('year').apply(lambda x: cat_gdp(x.gdpPercap)).reset_index(level=0, drop=True)

df_Af = df.query("continent == 'Asia'")
# Aggregate the data
df_agg = df_Af.groupby(['year', 'GDP_Category']).size().reset_index(name='Count')

# Create the stacked bar chart
fig = px.bar(
    df_agg, x='year', y='Count', 
    color='GDP_Category', barmode='stack',
    title="Evolution of Asian Countries' GDP Categories from 1952 to 2007",
    labels={'Count': 'Number of Countries', 'year': 'Year'})

fig.update_layout(height=350, width=1000)
fig.show()

Animated Graphs

Animated Graphs

Animation with Plotly

Code
import plotly.express as px
df = px.data.gapminder()
fig_anime = px.scatter(df, x="gdpPercap", y="lifeExp", animation_frame="year", animation_group="country",
           size="pop", color="continent", hover_name="country",
           log_x=True, size_max=50, range_x=[100,100000], 
           range_y=[25,90])
fig_anime.update_layout(height=460, width=1000, 
    title="The Evolution of the World in a Single Graph")









“There is no such thing as information overload. There is only bad design.” – Edward Tufte.

🥳 Yeahhhh….









Let’s Party… 🥂