TP6 - Principal Component Analysis (PCA)

Exploratory Data Analysis & Unsuperivsed Learning
Course: PHAUK Sokkey, PhD
TP: HAS Sothea, PhD

Objective: In this lab, let’s dive into an essential unsupervised learning method: Principal Component Analysis (PCA). PCA is a key technique for dimensionality reduction that simplifies data while preserving its crucial patterns. We will explore PCA from multiple perspectives in this TP.

The Jupyter Notebook for this TP can be downloaded here: TP6_PCA.ipynb.

1. Analyzing US Crime Dataset with PCA

The USArrests data available in Kaggle provides statistics on arrests for crime including rape, assault and murder in 50 states of the United States in 1973.

For information, read about the dataset here. We will use PCA to identify which U.S. state was the most dangerous or the safest in 1973.

A. Import the data and visualize each column to get a general sense of the dataset.

import kagglehub

# Download latest version
path = kagglehub.dataset_download("halimedogan/usarrests")

import pandas as pd

data = pd.read_csv(path + "/usarrests.csv")
data.head()

	Unnamed: 0	Murder	Assault	UrbanPop	Rape
0	Alabama	13.2	236	58	21.2
1	Alaska	10.0	263	48	44.5
2	Arizona	8.1	294	80	31.0
3	Arkansas	8.8	190	50	19.5
4	California	9.0	276	91	40.6

B. Study correlations between columns of the data using both Pearson and Spearman correlation coefficients.

Create pairplot for all columns of the data.
Given such a pairplot and based on correlations above, is it a good idea to perform dimensional reduction on this dataset? Why?

# To do

C. Perform reduced PCA (scaled and centered data) on this dataset.

Create the scree plot of explained variances of the data.
How many percentage of explained variance is retained by the first two principal components?

# To do

D. Create circle of correlation of the obtained PCA. Explain this correlation circle.

Compute the contribution of original variables on the first two PCs (loadings).
Compute the contribution of each individual on the first two PCs.

# To do

E. Create biplot of the data on the first factorial plan (PC1 and PC2). Based on this biplot, which US state in 1973 was

the most dangerous?
the safest?
the highly urbanized?
Verify your answer by checking the situation of those states.

2. Analyzing Auto-MPG dataset with PCA

A. Import Auto-MPG dataset from kaggle, available here.

Compute correlation matrix of quantitative columns of this dataset.

# To do

B. Perform reduced PCA on this dataset.

How much information or variation is retained by the first two pCs?

# To do

C. Create correlation circle and biplot. Comment.

3. Mathematical Problem of PCA

From a mathematical point of view, PCA can be seen as a process of searching for a subspace with maximum projection variance of the data points or searching for its closest low-rank approximation.

Suppose we have a design matrix of observations \(X\in\mathbb{R}^{n\times d}\) with centered columns. We aim to mathematically define 1st, 2nd,…, \(d\) th principal components of this matrix.

A. First PC: a vector \(\vec{u}_1\in\mathbb{R}^d\) is the \(1\)st PC direction of \(X\) if it is the direction (unit vector) in which the projection of observations \(X\) achieves maximum variance, i.e.,

\[\vec{u}_1=\arg\max_{\vec{u}:\|\vec{u}\|=1}\|X\vec{u}\|^2.\]

Show that \(\vec{u}_1\) is the first eigenvector of matrix \(X^TX\) corresponding to its largest eigenvalue \(\lambda_1\).

B. The \(k\)th PC: Let \(\widehat{X}_k=X-\sum_{j=1}^{k-1}X\vec{u}_j\vec{u}_j^T\), then the \(k\)th PC direction of \(X\) is the vector \(\vec{u}_k\in\mathbb{R}^d\) that is orthogonal to all the previous PCs \({\vec{u}_1,\dots,\vec{u}_{k-1}}\) satisfying

\[\vec{u}_k=\arg\max_{\vec{u}:\|\vec{u}\|=1}\|\widehat{X}_k\vec{u}\|^2.\]

Show that \(\vec{u}_k\) is the \(k\)th eigenvector of matrix \(X^TX\) corresponding to its \(k\)th largest eigenvalue \(\lambda_k\leq\lambda_{k-1}\leq\dots\leq \lambda_1\).

C. Show that the matrix \(\tilde{X}_k=\sum_{j=1}^k\lambda_j\vec{u}_j\vec{u}_j^T\) is the best low-rank approximation of the original data \(X\) w.r.t Frobenius norm, i.e.,

\[\tilde{X}_k=\arg\min_{W:\text{rank}(W)\leq k}\|X-W\|_{F}=\arg\min_{W:\text{rank}(W)\leq k}\sqrt{\sum_{j=k+1}^d\lambda_j^2(X-W)}.\]

1. Analyzing US Crime Dataset with PCA

2. Analyzing Auto-MPG dataset with PCA

3. Mathematical Problem of PCA

Further Readings