TP4 - Principal Component Analysis (PCA)

Exploratory Data Analysis & Unsuperivsed Learning
Course: Dr. Sothea HAS
TP: Mr. UANN Sreyvi


Objective: In this lab, let’s dive into an essential unsupervised learning method: Principal Component Analysis (PCA). PCA is a key technique for dimensionality reduction that simplifies data while preserving its crucial patterns. We will explore PCA from multiple perspectives in this TP.


The Jupyter Notebook for this TP can be downloaded here: TP4_PCA.ipynb.


1. Analyzing US Crime Dataset with PCA

The USArrests data available in Kaggle provides statistics on arrests for crime including rape, assault and murder in 50 states of the United States in 1973.

For information, read about the dataset here. We will use PCA to identify which U.S. state was the most dangerous or the safest in 1973.

A. Import the data and visualize each column to get a general sense of the dataset.

import kagglehub

# Download latest version
path = kagglehub.dataset_download("halimedogan/usarrests")

import pandas as pd

data = pd.read_csv(path + "/usarrests.csv")
data.head()
Unnamed: 0 Murder Assault UrbanPop Rape
0 Alabama 13.2 236 58 21.2
1 Alaska 10.0 263 48 44.5
2 Arizona 8.1 294 80 31.0
3 Arkansas 8.8 190 50 19.5
4 California 9.0 276 91 40.6

B. Study correlations between columns of the data using both Pearson and Spearman correlation coefficients.

  • Create pairplot for all columns of the data.

  • Given such a pairplot and based on correlations above, is it a good idea to perform dimensional reduction on this dataset? Why?

# To do

C. Perform reduced PCA (scaled and centered data) on this dataset.

  • Create the scree plot of explained variances of the data.

  • How many percentage of explained variance is retained by the first two principal components?

# To do

D. Create circle of correlation of the obtained PCA. Explain this correlation circle.

  • Compute the contribution of original variables on the first two PCs (loadings).

  • Compute the contribution of each individual on the first two PCs.

# To do

E. Create biplot of the data on the first factorial plan (PC1 and PC2). Based on this biplot, which US state in 1973 was

  • the most dangerous?
  • the safest?
  • the highly urbanized?
  • Verify your answer by checking the situation of those states.

2. Analyzing Auto-MPG dataset with PCA

A. Import Auto-MPG dataset from kaggle, available here.

  • Compute correlation matrix of quantitative columns of this dataset.
# To do

B. Perform reduced PCA on this dataset.

  • How much information or variation is retained by the first two pCs?
# To do

C. Create correlation circle and biplot. Comment.

D. Which variables contribute the most to PC1 and PC2.

  • Perform variable clustering as introduced in the course.

  • Try different values of the threshold of the second eigenvalue and observe its influence on the number of clusters.

E. Supervised Framework: Perform some predictive models to predict MPG using some selected or all features according to your preference.

  • Replace the chosen features using PCs from PCA.

  • Select some features from variable clustering results and compare the performance of the models with other cases.

3. Mathematical Problem of PCA

From mathematical point of views, PCA can be seen as a process of searching for a subspace with maximum projection variance of the data points or searching for its closest low-rank approximation.

Suppose we have a design matrix of observations \(X\in\mathbb{R}^{n\times d}\) with centered columns. We aim to mathematically define 1st, 2nd,…, \(d\) th principal components of this matrix.

A. First PC: a vector \(\vec{u}_1\in\mathbb{R}^d\) is the \(1\) st PC direction of \(X\) if it is the direction (unit vector) in which the projection of observations \(X\) achieves maximum variance, i.e.,

\[\vec{u}_1=\arg\max_{\vec{u}:\|\vec{u}\|=1}\|X\vec{u}\|^2.\]

  • Show that \(\vec{u}_1\) is the first eigenvector of matrix \(X^TX\) corresponding to its largest eigenvalue \(\lambda_1\).

B. The \(k\) th PC: Let \(\widehat{X}_k=X-\sum_{j=1}^{k-1}X\vec{u}_j\vec{u}_j^T\), then the \(k\) th PC direction of \(X\) is the vector \(\vec{u}_k\in\mathbb{R}^d\) that is orthogonal to all the previous PCs \({\vec{u}_1,\dots,\vec{u}_{k-1}}\) satisfying

\[\vec{u}_k=\arg\max_{\vec{u}:\|\vec{u}\|=1}\|\widehat{X}_k\vec{u}\|^2.\]

  • Show that \(\vec{u}_k\) is the \(k\) th eigenvector of matrix \(X^TX\) corresponding to its \(k\) th largest eigenvalue \(\lambda_k\leq\lambda_{k-1}\leq\dots\leq \lambda_1\).

C. Show that the matrix \(\tilde{X}_k=\sum_{j=1}^k\lambda_j\vec{u}_j\vec{u}_j^T\) is the best low-rank approximation of the original data \(X\) w.r.t Frobenius norm, i.e.,

\[\tilde{X}_k=\arg\min_{W:\text{rank}(W)\leq k}\|X-W\|_{F}=\arg\min_{W:\text{rank}(W)\leq k}\sqrt{\sum_{j=k+1}^d\lambda_j^2(X-W)}.\]

4. Read wine dataset

The dataset is available as Wine dataset in kaggle. It can be downloaded as follow.

  • Your task is to perform PCA then variable clustering on this dataset.

  • Interpret the findings and apply different selected features: original features, PCs and selected features from variable clustering to predict wine quality and compare them.

import kagglehub
import pandas as pd

# Download latest version
path = kagglehub.dataset_download("tawfikelmetwally/wine-dataset")
data = pd.read_csv(path + '/Wine dataset.csv')
data.head() 
class Alcohol Malic acid Ash Alcalinity of ash Magnesium Total phenols Flavanoids Nonflavanoid phenols Proanthocyanins Color intensity Hue OD280/OD315 of diluted wines Proline
0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735

Further Readings