{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# **Lab3: Data Visualization**\n", "\n", "**Course**: **CSCI-866-001: Data Mining & Knowledge Discovery**
\n", "**Lecturer**: **Sothea HAS, PhD**\n", "\n", "-----\n", "\n", "**Objective:** In this lab, you will apply the data visualization techniques you have studied to a real dataset. Each graph you create should be easy to understand and complete, including a title, legend, and other information to help the audience comprehend them easily. Effective graphs shouldn't take longer than 15 seconds for viewers to understand.\n", "\n", "- The `notebook` of this `Lab` can be downloaded here: [Lab3_Data_Analysis_Visualization.ipynb](https://hassothea.github.io/Data_Mining_AUPP/Labs/Lab3_Data_Visualization.ipynb).\n", "\n", "- Or you can work directly with `Google Colab` here: [Lab3_Data_Visualization.ipynb](https://colab.research.google.com/drive/1vwdOK8eApO2OxY7IJA6up2vxGnPJ-a8W?usp=sharing).\n", "\n", "\n", "-----\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. `Gapminder` dataset\n", "\n", "[`Gapminder`](https://pypi.org/project/gapminder/) is an excerpt of data available at [https://gapminder.org](https://www.gapminder.org/). For each of $142$ countries (`country`), the package provides values for life expectancy (`lifeExp`), GDP per capita (`gdpPercap`), and population (`pop`), every five years, from $1952$ to $2007$ (`year`). It is initially used in [Jennifer Bryan's excellent `gapminder` teaching package for R](https://github.com/jennybc/gapminder/) ([`ggplot`](https://ggplot2.tidyverse.org/), [`tidyverse`](https://www.tidyverse.org/), and more). For more information about `gapminder`: \n", "\n", "- Documentation: [https://www.gapminder.org/data/documentation/](https://www.gapminder.org/data/documentation/)\n", "- A short demonstration video is available here: [Hans Rosling's 200 Countries, 200 Years, 4 Minutes - The Joy of Stats - BBC Four](https://youtu.be/jbkSRLYSojo?si=qipg08VIi999hEgo)." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "* Number of observations: 1704\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrycontinentyearlifeExppopgdpPercap
1424SpainEurope199277.5703954943818603.064520
269ChadAfrica197747.38343882601133.984950
135BoliviaAmericas196745.03240406652586.886053
\n", "
" ], "text/plain": [ " country continent year lifeExp pop gdpPercap\n", "1424 Spain Europe 1992 77.570 39549438 18603.064520\n", "269 Chad Africa 1977 47.383 4388260 1133.984950\n", "135 Bolivia Americas 1967 45.032 4040665 2586.886053" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# useful packages\n", "import numpy as np\n", "import pandas as pd\n", "\n", "#%pip install gapminder (for installing gapminder if you haven't had one)\n", "from gapminder import gapminder\n", "print(f\"* Number of observations: {gapminder.shape[0]}\")\n", "gapminder.sample(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----------\n", "\n", "## **A. Variable types**\n", "\n", "- Address dimension of the dataset. \n", "- Which variables are considered quantitative and which are qualitative? " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# To do" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **B. Year 1952**\n", "\n", "### **B.1 Quantitative vs quantitative**\n", "\n", "- Create a subdataset called `data1952` that contains only the information in year 1952.\n", "- View relation between `gdpPercap` and `lifeExp` in 1952.\n", "- View relation between `gdpPercap` and `pop` in 1952.\n", "- View relation between `lifeExp` and `pop` in 1952.\n", "- Do they look different from year 2007?\n", "\n", "`Hint`: You can produce the same graphs as shown in the course using `Plotly` package avaialble here: [plotly python](https://plotly.com/python/)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt \n", "import seaborn as sns\n", "import plotly.graph_objects as go # for interative graph\n", "import plotly.express as px # for interative graph\n", "# To do" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- According to the data, in 1952:\n", " - Which country was the richest? \n", " - Which country was the poorest?\n", " - Which country was the healthiest?\n", " - Which country was the unhealthiest?" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# To do" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Repeat the previous question for the year 1977. Before computing, can you guess which country had the lowest life expectancy around that year?" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# To do" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **B.2. Quatitative vs qualitative**\n", "\n", "- We observed differences in health conditions across continents in 2007. Was this also the case in 1952? Please visualize your findings.\n", "- What about the economy? Visualize and explain your results." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# To do" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **B.3. Qualitative vs qualitative**\n", "\n", "Qualitative variables are simpler than quantitative ones, as we primarily focus on the proportion or frequency of each category. In our dataset, the existing qualitative variables are not suitable for analysis because they are repeated each year. Therefore, we will create a new qualitative `lifeExp`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Add to the data `data1952` a column `lifeExpQual` containing three categories: [**\"unhealthy\"**, **\"moderate\"**, **\"healthy\"**] by splitting `lifeExp` into 3 classes.\n", "\n", "> Hint: you may cheat using [slide 15](https://hassothea.github.io/Data_Analysis_AUPP/Slides/Data_Visualization.html#/bivariate-visualization-8). The function `pd.qcut` is helpful for such a task." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# To do" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Graphically represent the connection between `lifeExpQual` and `continent` in year 1952.\n", "- Describe what you see." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# To do" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **C. Time evolution**\n", "\n", "`Gapminder` captures global changes from $1952$ to $2007$. It is more insightful to examine how these variables evolve over time.\n", "\n", "### **C.1. Evolution of average `lifeExp` of the $5$ continents**\n", "\n", "1. Create a line plot of `lifeExp` for the five continents from $1952$ to $2007$ using `sns.lineplot`. What observations can you make from the plot?" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# To do" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **C.2. Other variables**\n", "\n", "- Repeat the same process with `pop`, and then with `gdpPercap`. Provide your comments for each case." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# To do" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **C.3. Cambodia**\n", "\n", "- Plot the evolution of the three quantitative variables for **Cambodia**. What do you observe?" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# To do" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Further readings\n", "- Gapminder documentation: [https://www.gapminder.org/data/documentation/](https://www.gapminder.org/data/documentation/)\n", "- A short demonstration video is available here: [Hans Rosling's 200 Countries, 200 Years, 4 Minutes - The Joy of Stats - BBC Four](https://youtu.be/jbkSRLYSojo?si=qipg08VIi999hEgo).\n", "- Graphical tools:\n", " - [`matplotlib`](https://matplotlib.org/stable/index.html)\n", " - [`seaborn`](https://seaborn.pydata.org/)\n", " - [`https://plotly.com/python/`](https://plotly.com/python/)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.5" } }, "nbformat": 4, "nbformat_minor": 2 }