TP6 - Principal Component Analysis (PCA)

Exploratory Data Analysis & Unsuperivsed Learning
Course: PHAUK Sokkey, PhD
TP: HAS Sothea, PhD


Objective: In this lab, let’s dive into an essential unsupervised learning method: Principal Component Analysis (PCA). PCA is a key technique for dimensionality reduction that simplifies data while preserving its crucial patterns. We will explore PCA from multiple perspectives in this TP.


The Jupyter Notebook for this TP can be downloaded here: TP6_PCA.ipynb.


1. Analyzing US Crime Dataset with PCA

The USArrests data available in Kaggle provides statistics on arrests for crime including rape, assault and murder in 50 states of the United States in 1973.

For information, read about the dataset here. We will use PCA to identify which U.S. state was the most dangerous or the safest in 1973.

A. Import the data and visualize each column to get a general sense of the dataset.

import kagglehub

# Download latest version
path = kagglehub.dataset_download("halimedogan/usarrests")

import pandas as pd

data = pd.read_csv(path + "/usarrests.csv")
data.head()
Unnamed: 0 Murder Assault UrbanPop Rape
0 Alabama 13.2 236 58 21.2
1 Alaska 10.0 263 48 44.5
2 Arizona 8.1 294 80 31.0
3 Arkansas 8.8 190 50 19.5
4 California 9.0 276 91 40.6

B. Study correlations between columns of the data using both Pearson and Spearman correlation coefficients.

  • Create pairplot for all columns of the data.

  • Given such a pairplot and based on correlations above, is it a good idea to perform dimensional reduction on this dataset? Why?

# To do

C. Perform reduced PCA (scaled and centered data) on this dataset.

  • Create the scree plot of explained variances of the data.

  • How many percentage of explained variance is retained by the first two principal components?

# To do

D. Create circle of correlation of the obtained PCA. Explain this correlation circle.

  • Compute the contribution of original variables on the first two PCs (loadings).

  • Compute the contribution of each individual on the first two PCs.

# To do

E. Create biplot of the data on the first factorial plan (PC1 and PC2). Based on this biplot, which US state in 1973 was

  • the most dangerous?
  • the safest?
  • the highly urbanized?
  • Verify your answer by checking the situation of those states.

2. Analyzing Auto-MPG dataset with PCA

A. Import Auto-MPG dataset from kaggle, available here.

  • Compute correlation matrix of quantitative columns of this dataset.
# To do

B. Perform reduced PCA on this dataset.

  • How much information or variation is retained by the first two pCs?
# To do

C. Create correlation circle and biplot. Comment.

3. Mathematical Problem of PCA

From mathematical point of views, PCA can be seen as a process of searching for a subspace with maximum projection variance of the data points or searching for its closest low-rank approximation.

Suppose we have a design matrix of observations \(X\in\mathbb{R}^{n\times d}\) with centered columns. We aim to mathematically define 1st, 2nd,…, \(d\) th principal components of this matrix.

A. First PC: a vector \(\vec{u}_1\in\mathbb{R}^d\) is the \(1\) st PC of \(X\) if it is the direction (unit vector) in which the projection of observations \(X\) achieves maximum variance, i.e.,

\[\vec{u}_1=\arg\max_{\vec{u}:\|\vec{u}\|=1}\|X\vec{u}\|^2.\]

  • Show that \(\vec{u}_1\) is the first eigenvector of matrix \(X^TX\) corresponding to its largest eigenvalue \(\lambda_1\).

B. The \(k\) th PC: Let \(\widehat{X}_k=X-\sum_{j=1}^{k-1}X\vec{u}_j^T\vec{u}_j\), then the \(k\) th PC of \(X\) is the vector \(\vec{u}_k\in\mathbb{R}^d\) that is orthogonal to all the previous PCs \({\vec{u}_1,\dots,\vec{u}_{k-1}}\) satisfying

\[\vec{u}_k=\arg\max_{\vec{u}:\|\vec{u}\|=1}\|\widehat{X}_k\vec{u}\|^2.\]

  • Show that \(\vec{u}_k\) is the \(k\) th eigenvector of matrix \(X^TX\) corresponding to its \(k\) th largest eigenvalue \(\lambda_k\leq\lambda_{k-1}\leq\dots\leq \lambda_1\).

C. How that the matrix \(\widehat{X}_k=\sum_{j=1}^kX\vec{u}_j^T\vec{u}_j\) is the best low-rank approximation of the original data \(X\) w.r.t Frobenius norm, i.e.,

\[\widehat{X}_k=\arg\min_{W:\text{rank}(W)\leq k}\|X-W\|_{F}=\arg\min_{W:\text{rank}(W)\leq k}\sqrt{\sum_{j=k+1}^d\lambda_j^2(X-W)}.\]

Further Readings