{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# **TP5 - EDA: Correlation Analysis**\n", "\n", "**Course**: **INF-604: Data Analysis**
\n", "**Lecturer**: **Sothea HAS, PhD**\n", "\n", "-----\n", "\n", "**Objective:** In this lab, you will apply correaltion analysis on real examples. We will also explore the limitations of correaltion analysis and what to watch out when drawing conclusion from each correlation types.\n", "\n", "- The `notebook` of this `Lab` can be downloaded here: [Lab5_EDA.ipynb](https://hassothea.github.io/Data_Analysis_AUPP/Labs/Lab5_EDA.ipynb).\n", "\n", "- Or you can work directly with `Google Colab` here: [Lab5_EDA.ipynb](https://colab.research.google.com/drive/1j48iYepkme3Mjr3LDNbXROl8b7lcSbjD?usp=sharing).\n", "\n", "\n", "-----\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrycontinentyearlifeExppopgdpPercap
0AfghanistanAsia195228.8018425333779.445314
1AfghanistanAsia195730.3329240934820.853030
2AfghanistanAsia196231.99710267083853.100710
3AfghanistanAsia196734.02011537966836.197138
4AfghanistanAsia197236.08813079460739.981106
\n", "
" ], "text/plain": [ " country continent year lifeExp pop gdpPercap\n", "0 Afghanistan Asia 1952 28.801 8425333 779.445314\n", "1 Afghanistan Asia 1957 30.332 9240934 820.853030\n", "2 Afghanistan Asia 1962 31.997 10267083 853.100710\n", "3 Afghanistan Asia 1967 34.020 11537966 836.197138\n", "4 Afghanistan Asia 1972 36.088 13079460 739.981106" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#%pip install gapminder # This is for installing the package\n", "from gapminder import gapminder\n", "import pandas as pd\n", "import numpy as np\n", "gapminder.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. Pearson and Spearman's correlations\n", "\n", "**a.** Compute Pearson correlation matrix of the three quantitative variables on year $1952$, $1987$ and then $2007$ using `pd.corr()`. Give a brief intuition of the relationship between these variables." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# To do" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Description: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**b.** Compute Spearman's Rank Correlation of the previous columns in 1952, 1987 and 2007. What do you observe?" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# To do" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**c.** From the previous result, pick the most interesting pair of variables and plot a graphic illustrating their relationship for each year using proper axis scaling and title." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "# To do" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**d.** Revisit your intuition of the correlation matrix in year 1952 from question **(a)**, can you see why we observed such a (poor) correlation in 1952?!\n", "\n", "- Now, drop the weird country of year 1952. Revisualize and recompute the correaltion between `health` and `economy` condition of the world in 1952. Conclude." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **Remark:** Pearson correlation matrix can summarize linear relationship between pairs of quantitative variables but it might be inacurate and influenced by \n", "\n", "> - outliers, \n", "> - non-linearity, \n", "> - small sample size, \n", "> - confounding (causal) variables..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. $\\eta$-squared correlations\n", "\n", "**a.** We have seen how life expectancy and economy vary across continents in 1952 ([Lab4](https://hassothea.github.io/Data_Analysis_AUPP/Labs/Lab4_Data_Visualization.html)) and 2007 ([course](https://hassothea.github.io/Data_Analysis_AUPP/Slides/Data_Visualization.html#/bivariate-visualization-4)). Compute $\\eta$-squared correlation between `continent` and `lifeExp` then `continent` with `gdpPercap` in 1952, 1987 and 2007. \n", "\n", "- Do you find the results reasonable?" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# To do" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3. Time evolution\n", "\n", "**a.** Draw the evolution of the following correaltions from 1952 to 2007:\n", "\n", "- `Person` and `Spearman` corerlation between life expectancy and GDP per capita\n", "- $\\eta$-squared coefficients of continent vs life expectancy, and continents vs GDP per capita." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import plotly.express as px\n", "\n", "# To do" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**b.** Fom what you have studied from the dataset, describe the world from 1952 to 2007." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Description: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Further readings\n", "- Gapminder documentation: [https://www.gapminder.org/data/documentation/](https://www.gapminder.org/data/documentation/)\n", "- A short demonstration video is available here: [Hans Rosling's 200 Countries, 200 Years, 4 Minutes - The Joy of Stats - BBC Four](https://youtu.be/jbkSRLYSojo?si=qipg08VIi999hEgo).\n", "- Graphical tools:\n", " - [`matplotlib`](https://matplotlib.org/stable/index.html)\n", " - [`seaborn`](https://seaborn.pydata.org/)\n", " - [`plotly`](https://plotly.com/python/)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.5" } }, "nbformat": 4, "nbformat_minor": 2 }