from gapminder import gapminderdata2007 = gapminder[gapminder.year ==2007] # filter to year 2007data2007.iloc[:4,:].drop(columns=['year']).style.hide()
from plotly.subplots import make_subplotsimport plotly.graph_objects as gofig = make_subplots(rows=1, cols=3, subplot_titles=("Boxplot of pop", "Violinplot of lifeExp", "Histogram of GDP Per Capita"))fig.add_trace(go.Box(y=data2007['pop'], name="pop"), col=1, row=1)fig.add_trace(go.Violin(y=data2007['lifeExp'], name="lifeExp"), row=1, col=2)fig.add_trace(go.Histogram(y=data2007['gdpPercap'], name="gdpPercap"), row=1, col=3)fig.update_layout(height=300, width=1000)fig.update_yaxes(type="log", row=1, col=1)fig.update_xaxes(title="Population", row=1, col=1)fig.update_xaxes(title="Life Expectancy", row=1, col=2)fig.update_xaxes(title="GDP Per Capita", row=1, col=3)fig.show()
Quantitative (numerical) data
Statistical values
Note
Mean (average)
Median
Variance
Standard deviation
Percentiles
Skewness
KurtosisâŠ
Quantitative (numerical) data
Graph: Histogram
Note
Bins: interval of the bars.
Binwidth: interval width.
Bars: counts of points within that bin.
Mathematically, if \(B_x\) is the bin containing \(x\in\mathbb{R}\) then \[\text{hist}(x)=\sum_{i=1}^n\mathbb{1}_{\{x_i\in B_x\}}.\]
Code
import seaborn as snsimport matplotlib.pyplot as pltsns.set(style="whitegrid")plt.figure(figsize=(8, 4))ax = sns.histplot(data2007, x="lifeExp", binwidth=3)ax.set_title("Histogram of Life Expectancy")plt.show()
Quantitative (numerical) data
Graph: Kernel Density Estimation (KDE)
Note
Smoothing out each bar of a histogram to create a continuous curve.
Mathematically, if \(K\) is a kernel function, for instance, Gaussian kernel: \(K(t)=\frac{1}{\sqrt{2\pi}}e^{-t^2/2}\), then for any \(x\in\mathbb{R}\): \[\hat{f}(x)=\frac{1}{nh}\sum_{i=1}^nK\Big(\frac{x-x_i}{h}\Big).\]
Code
plt.figure(figsize=(8, 4))ax = sns.histplot(data2007, x="lifeExp", kde=True, linewidth=3, binwidth=3)ax.set_title("Histogram & density of Life Expectancy")plt.show()
đ€ Anything interesting from these 3 correlation matrices?
Quantitative vs quantitative
Graph: Scatterplot
Code
import plotly.graph_objects as goimport plotly.express as pxfig1 = px.scatter(data1952, x="gdpPercap", y="lifeExp", hover_name="country", opacity=0.7)fig1.update_traces(marker=dict(size=10))fig1.update_layout(height=400, width=330, title="The world in 1952")fig1.show()
Code
fig2 = px.scatter(data1982, x="gdpPercap", y="lifeExp", hover_name="country", opacity=0.7)fig2.update_traces(marker=dict(size=10))fig2.update_layout(height=400, width=330, title="The world in 1982")fig2.show()
Code
fig3 = px.scatter(data2007, x="gdpPercap", y="lifeExp", hover_name="country", opacity=0.7)fig3.update_traces(marker=dict(size=10))fig3.update_layout(height=400, width=330, title="The world in 2007")fig3.show()
Quantitative vs quantitative
Graph: Scatterplot
Code
fig1.update_xaxes(type="log")fig1.show()
Code
fig2.update_xaxes(type="log")fig2.show()
Code
fig3.update_xaxes(type="log")fig3.show()
Better? This is the power of "log" scaling!
Quantitative vs qualitative
Graph: Conditional distribution
Are quantitative data on each category of the qualitative data different?
Different = Influenced = Related.
Just use what we have learned:
Need one quantitative graph
But distinguished by group of qualititative data.
Code
sorted_data = data2007.sort_values(by='lifeExp')fig = px.box(data2007, x="continent", y="lifeExp", hover_name="country", color="continent", category_orders={'continent': sorted_data['continent']})fig.update_layout(title="Life Expectancy on each continent in 2007", height=350, width=450)fig.show()
Qualitative vs qualitative
Graph: Mosaic plot
We grouped gdpPercap into 3 classes:
Developing
Emerging
Developed
Are the categories of the 1st qualitative data different on each category of the 2nd qualitative variable?
Grouped barplot is an alternative graph representing connection between two qualitative variables.
Different = Influenced = Related.
Code
fig = px.histogram(data2007, x="continent", color="gdp_category", barmode='group')fig.update_layout(width=500, height=350, title='Grouped Barplot of continent vs GDP Category')fig.show()
Multiple information
Color, Shape, Size and 3D Graph
Color: can represent both quantitative (in form of gradient) and qualitative data (discrete color).
Shape: can represent qualitative variables.
Size: can represent quantitative variables.
3D Graph: often used to represent relationship of 3 quantitative variables.
Code
fig = px.scatter(data2007, x="gdpPercap", y="lifeExp", color="continent", size="pop", hover_name="country", size_max=50)fig.update_layout(width=500, height=450, title='Life Expectancy vs GDP per Capita,<br> Population and Continent')fig.update_xaxes(type="log")fig.show()
Time series data
Quantitative data
Graph: Lineplot
Letâs look at Cambodia from 1952 to 2007.
Code
cam_df = gapminder.loc[gapminder.country =="Cambodia"]fig = px.line(cam_df, x='year', y='lifeExp', title="Evolution of Life Expectancy of Cambodia")fig.update_layout(height=350, width=330)fig.show()
Code
fig = px.line(cam_df, x='year', y='pop', title="Evolution of Cambodian population")fig.update_layout(height=350, width=330)fig.show()
Code
fig = px.line(cam_df, x='year', y='gdpPercap', title="Evolution of GDP per Capita of Cambodia")fig.update_layout(height=350, width=330)fig.show()
Quantitative data
Graph: Lineplot
Now, take a look at the world from 1952 to 2007.
Code
plt.figure(figsize=(5,2.5)) sns.lineplot(gapminder, x='year', y='lifeExp', hue="continent", legend=False)plt.title('Global Life Expectancy Evolution')plt.xlabel('Year')plt.ylabel('Life Expectancy')plt.show()
plt.figure(figsize=(5,2.5)) sns.lineplot(gapminder, x='year', y='gdpPercap', hue="continent", legend=False)plt.title('Global GDP per Capita Evolution')plt.xlabel('Year')plt.ylabel('GDP per Capita ($)')plt.show()
Animated Graph
The world evolution from 1952 to 2007.
Code
fig = px.scatter(gapminder, x="gdpPercap", y="lifeExp", animation_frame="year", size="pop", color="continent", hover_name="country", log_x=True, size_max=60, range_x=[100,100000], range_y=[25,90])fig.update_layout(title="The world evolution from 1952 to 2007", width=1000, height=450)
Telling a story & Making a point
Whole story needs more than just graphs
Context: Gapminder captures the economic and health conditions of the entire world from 1952 to 2007.
Global Changes: How has the world changed over these 55 years?
Direction of Evolution: What direction is this evolution taking?
Our Role: How can we prepare for or contribute to this ongoing evolution?
Telling story: Analyzing steps
Understand Key Variables: Begin by getting a grip on the essential metrics.
Make Connections: Use correlation matrices and graphs to highlight relationships.
Investigate: Analyze data from a local (yearly) perspective and then expand to uncover global trends.
Illustration: Always pair numbers with clear, comprehensible graphs.
Conclusion:
Address the key questions posted in the context.
Discuss limitations of the analysis if there is any.
Mention areas where further research or additional data might provide more complete picture.