{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# **TP6 - Principal Component Analysis (PCA)**\n", "\n", "Exploratory Data Analysis & Unsuperivsed Learning
\n", "**Course: PHAUK Sokkey, PhD**
\n", "**TP: HAS Sothea, PhD**\n", "\n", "-------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Objective**: In this lab, let's dive into an essential unsupervised learning method: Principal Component Analysis (PCA). PCA is a key technique for dimensionality reduction that simplifies data while preserving its crucial patterns. We will explore PCA from multiple perspectives in this TP.\n", "\n", "---------\n", "\n", "> **The `Jupyter Notebook` for this TP can be downloaded here: [TP6_PCA.ipynb](https://hassothea.github.io/EDA_ITC/TPs/TP6_PCA.ipynb)**.\n", "\n", "-------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Analyzing US Crime Dataset with PCA\n", "\n", "The `USArrests` data available in `Kaggle` provides statistics on arrests for crime including **rape**, **assault** and **murder** in 50 states of the United States in 1973. \n", "\n", "For information, read about the dataset [here](https://www.kaggle.com/datasets/halimedogan/usarrests). We will use PCA to identify which U.S. state was the most dangerous or the safest in 1973.\n", "\n", "**A.** Import the data and visualize each column to get a general sense of the dataset." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0MurderAssaultUrbanPopRape
0Alabama13.22365821.2
1Alaska10.02634844.5
2Arizona8.12948031.0
3Arkansas8.81905019.5
4California9.02769140.6
\n", "
" ], "text/plain": [ " Unnamed: 0 Murder Assault UrbanPop Rape\n", "0 Alabama 13.2 236 58 21.2\n", "1 Alaska 10.0 263 48 44.5\n", "2 Arizona 8.1 294 80 31.0\n", "3 Arkansas 8.8 190 50 19.5\n", "4 California 9.0 276 91 40.6" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import kagglehub\n", "\n", "# Download latest version\n", "path = kagglehub.dataset_download(\"halimedogan/usarrests\")\n", "\n", "import pandas as pd\n", "\n", "data = pd.read_csv(path + \"/usarrests.csv\")\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**B.** Study correlations between columns of the data using both `Pearson` and `Spearman` correlation coefficients.\n", "\n", "- Create pairplot for all columns of the data.\n", "\n", "- Given such a pairplot and based on correlations above, is it a good idea to perform dimensional reduction on this dataset? Why?\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# To do" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**C.** Perform reduced PCA (scaled and centered data) on this dataset.\n", "\n", "- Create the scree plot of explained variances of the data.\n", "\n", "- How many percentage of explained variance is retained by the first two principal components?" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# To do" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**D.** Create circle of correlation of the obtained PCA. Explain this correlation circle.\n", "\n", "- Compute the contribution of original variables on the first two PCs (loadings).\n", "\n", "- Compute the contribution of each individual on the first two PCs." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# To do" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**E.** Create biplot of the data on the first factorial plan (PC1 and PC2). Based on this biplot, which US state in 1973 was\n", "\n", "- the most dangerous?\n", "- the safest?\n", "- the highly urbanized?\n", "- Verify your answer by checking the situation of those states." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Analyzing Auto-MPG dataset with PCA\n", "\n", "**A.** Import `Auto-MPG` dataset from kaggle, available [here](https://www.kaggle.com/datasets/uciml/autompg-dataset).\n", "\n", "- Compute correlation matrix of quantitative columns of this dataset." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# To do" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**B.** Perform reduced PCA on this dataset.\n", "\n", "- How much information or variation is retained by the first two pCs?" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# To do" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**C.** Create correlation circle and biplot. Comment." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Mathematical Problem of PCA\n", "\n", "From mathematical point of views, PCA can be seen as a process of searching for a subspace with maximum [projection variance](https://en.wikipedia.org/wiki/Principal_component_analysis) of the data points or searching for its closest [low-rank approximation](https://en.wikipedia.org/wiki/Low-rank_approximation).\n", "\n", "Suppose we have a design matrix of observations $X\\in\\mathbb{R}^{n\\times d}$ with centered columns. We aim to mathematically define 1st, 2nd,..., $d$ th principal components of this matrix.\n", "\n", "**A. First PC:** a vector $\\vec{u}_1\\in\\mathbb{R}^d$ is the $1$ st PC of $X$ if it is the direction (unit vector) in which the projection of observations $X$ achieves maximum variance, i.e.,\n", "\n", "$$\\vec{u}_1=\\arg\\max_{\\vec{u}:\\|\\vec{u}\\|=1}\\|X\\vec{u}\\|^2.$$\n", "\n", "- Show that $\\vec{u}_1$ is the first eigenvector of matrix $X^TX$ corresponding to its largest eigenvalue $\\lambda_1$.\n", "\n", "**B. The $k$ th PC:** Let $\\widehat{X}_k=X-\\sum_{j=1}^{k-1}X\\vec{u}_j^T\\vec{u}_j$, then the $k$ th PC of $X$ is the vector $\\vec{u}_k\\in\\mathbb{R}^d$ that is orthogonal to all the previous PCs ${\\vec{u}_1,\\dots,\\vec{u}_{k-1}}$ satisfying \n", "\n", "$$\\vec{u}_k=\\arg\\max_{\\vec{u}:\\|\\vec{u}\\|=1}\\|\\widehat{X}_k\\vec{u}\\|^2.$$\n", "\n", "- Show that $\\vec{u}_k$ is the $k$ th eigenvector of matrix $X^TX$ corresponding to its $k$ th largest eigenvalue $\\lambda_k\\leq\\lambda_{k-1}\\leq\\dots\\leq \\lambda_1$.\n", "\n", "**C.** How that the matrix $\\widehat{X}_k=\\sum_{j=1}^kX\\vec{u}_j^T\\vec{u}_j$ is the best [low-rank approximation](https://en.wikipedia.org/wiki/Low-rank_approximation) of the original data $X$ w.r.t [Frobenius norm](https://en.wikipedia.org/wiki/Frobenius_norm), i.e.,\n", "\n", "$$\\widehat{X}_k=\\arg\\min_{W:\\text{rank}(W)\\leq k}\\|X-W\\|_{F}=\\arg\\min_{W:\\text{rank}(W)\\leq k}\\sqrt{\\sum_{j=k+1}^d\\lambda_j^2(X-W)}.$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Further Readings\n", "\n", "- [PCA, sklearn](https://scikit-learn.org/dev/modules/generated/sklearn.decomposition.PCA.html)\n", "- [Low-rank approximation problem](https://en.wikipedia.org/wiki/Low-rank_approximation)\n", "- [Principal Component Analysis (PCA)](https://builtin.com/data-science/step-step-explanation-principal-component-analysis)\n", "- [USarrests Kaggle Dataset](https://www.kaggle.com/datasets/halimedogan/usarrests)\n", "- [Auto-MPG Kaggle Dataset](https://www.kaggle.com/datasets/uciml/autompg-dataset)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.5" } }, "nbformat": 4, "nbformat_minor": 2 }