TP6 - Principal Component Analysis (PCA)

Exploratory Data Analysis & Unsuperivsed Learning
Course: PHAUK Sokkey, PhD
TP: HAS Sothea, PhD


Objective: In this lab, let’s dive into an essential unsupervised learning method: Principal Component Analysis (PCA). PCA is a key technique for dimensionality reduction that simplifies data while preserving its crucial patterns. We will explore PCA from multiple perspectives in this TP.


The Jupyter Notebook for this TP can be downloaded here: TP6_PCA.ipynb.


1. Analyzing US Crime Dataset with PCA

The USArrests data available in Kaggle provides statistics on arrests for crime including rape, assault and murder in 50 states of the United States in 1973.

For information, read about the dataset here. We will use PCA to identify which U.S. state was the most dangerous or the safest in 1973.

A. Import the data and visualize each column to get a general sense of the dataset.

import kagglehub

# Download latest version
path = kagglehub.dataset_download("halimedogan/usarrests")

import pandas as pd

data = pd.read_csv(path + "/usarrests.csv")
data.columns = ['State'] + list(data.columns[1:])
data.head()
Warning: Looks like you're using an outdated `kagglehub` version, please consider updating (latest version: 0.3.5)
State Murder Assault UrbanPop Rape
0 Alabama 13.2 236 58 21.2
1 Alaska 10.0 263 48 44.5
2 Arizona 8.1 294 80 31.0
3 Arkansas 8.8 190 50 19.5
4 California 9.0 276 91 40.6
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid")
_, axs = plt.subplots(2,2, figsize=(9,6))
i = 0
for var in data.columns[1:]:
    sns.histplot(data=data, x=var, kde=True, ax=axs[i // 2,i % 2])
    i += 1
plt.tight_layout()
plt.show()

B. Study correlations between columns of the data using both Pearson and Spearman correlation coefficients.

data.iloc[:,1:].corr().style.background_gradient()
  Murder Assault UrbanPop Rape
Murder 1.000000 0.801873 0.069573 0.563579
Assault 0.801873 1.000000 0.258872 0.665241
UrbanPop 0.069573 0.258872 1.000000 0.411341
Rape 0.563579 0.665241 0.411341 1.000000
data.iloc[:,1:].corr(method="spearman").style.background_gradient()
  Murder Assault UrbanPop Rape
Murder 1.000000 0.817274 0.106716 0.679427
Assault 0.817274 1.000000 0.275213 0.714368
UrbanPop 0.106716 0.275213 1.000000 0.438107
Rape 0.679427 0.714368 0.438107 1.000000
  • Create pairplot for all columns of the data.
pair_plot = sns.pairplot(data=data, kind='reg')
# Function to calculate and display correlation coefficients 
import numpy as np
from scipy.stats import spearmanr
def corr_func(x, y, **kws): 
    r1 = np.corrcoef(x, y)[0, 1] 
    r2, p_value = spearmanr(x, y)
    ax = plt.gca() 
    ax.annotate(f"Pearson's corr = {r1:.2f}", xy=(0.5, 0.9), xycoords=ax.transAxes, ha='center', fontsize=12, color='red') 
    ax.annotate(f"Spearman's corr = {r2:.2f}", xy=(0.5, 0.8), xycoords=ax.transAxes, ha='center', fontsize=12, color='blue') 
    # Map the correlation coefficients onto the pairplot
pair_plot.map_upper(corr_func)
plt.show()

  • Given such a pairplot and based on correlations above, is it a good idea to perform dimensional reduction on this dataset? Why?

Pearson correlation matrix tells us that there are many linearly correlated columns within USArrests dataset. These highly linearly correlated columns carry almost similiar information therefore can likely be represented in smaller dimensional spaces. In summary, the more correlated columns in the data, the better it is to perform dimensional reduction on the data.

C. Perform reduced PCA (scaled and centered data) on this dataset.

  • Create the scree plot of explained variances of the data.

  • How many percentage of explained variance is retained by the first two principal components?

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
scaler = StandardScaler()
X = scaler.fit_transform(data.iloc[:,1:])
pca = PCA(n_components=4).fit(X)

print(f'The first two PCs can capture around {np.sum(pca.explained_variance_ratio_[:2]*100).round(2)}% of the total variation of the data.')

# Screeplot
ax = sns.barplot(x = ['PC' + str(i) for i in range(1,5)], y=pca.explained_variance_)
plt.plot(['PC' + str(i) for i in range(1,5)], pca.explained_variance_, 'r')
ax.bar_label(ax.containers[0])
plt.title("Screeplot of the performed PCA")
plt.show()
The first two PCs can capture around 86.75% of the total variation of the data.

print("Percentage of explained varaince:")
pca.explained_variance_ratio_ * 100
Percentage of explained varaince:
array([62.00603948, 24.74412881,  8.91407951,  4.33575219])

D. Create circle of correlation of the obtained PCA. Explain this correlation circle.

sns.set(style="whitegrid")
pc_df = pd.DataFrame(pca.transform(X), columns=[f'PC{i+1}' for i in range(X.shape[1])]) 
# Combine original data and principal components 
combined_data = pd.concat([pd.DataFrame(X, columns=data.columns[1:]), pc_df], axis=1)
# Compute correlation matrix 
Corr_matrix = combined_data.corr()[['PC'+str(i) for i in range(1,3)]].iloc[:4,:]
# Circle of correlation 
plt.figure(figsize=(8, 8)) 
plt.axhline(0, color='grey', linestyle="--", lw=0.5) 
plt.axvline(0, color='grey', linestyle="--", lw=0.5) 
circle = plt.Circle((0, 0), 1, color='grey', fill=False) 
plt.gca().add_artist(circle) 
# Plot correlations 
for i in range(4): 
    plt.arrow(0, 0, Corr_matrix.iloc[i,0], Corr_matrix.iloc[i,1], color='r', alpha=0.9, head_width=0.03, head_length=0.03) 
    plt.text(Corr_matrix.iloc[i,0], Corr_matrix.iloc[i,1] + 0.05, data.columns[i+1], color='b', ha='center', va='center') 
plt.xlabel(f'PC1 ({(pca.explained_variance_ratio_[0]*100).round(2)}%)') 
plt.ylabel(f'PC2 ({(pca.explained_variance_ratio_[1]*100).round(2)}%)') 
plt.title('Circle of Correlation') 
plt.xlim(-1.1, 1.1) 
plt.ylim(-1.1, 1.1) 
plt.grid() 
plt.show()

  • Compute the contribution of original variables on the first two PCs (loadings).

When decomposing correlation matrix (reduced PCA) into \(C=U\Sigma U^T\) where

  • \(U\) is the unitary matrix (of eigenvectors of covariance matrix \(C\)) of size \(d\times d\), i.e., \(U^TU=I_d\).
  • \(\Sigma\) is a diagonal matrix of eigenvalues \(\lambda_1\geq\lambda_2\geq\dots\geq\lambda_d\geq 0\). It contains the variation along 1st, 2nd,…, \(d\) th direction of the data.

The contribution of original variable \(j\) on \(k\) PC or loadings are defined by

\[\text{contr}_j(\text{PC}_k)=\frac{[U\Sigma^{1/2}]_{jk}^2}{\sum_{p=1}^d[U\Sigma^{1/2}]_{jp}^2}=\frac{[U\Sigma^{1/2}]_{jk}^2}{\lambda_j}\]

loadings = (pca.components_.T) ** 2 * pca.explained_variance_
contr = pd.DataFrame(loadings/pca.explained_variance_, index=data.columns[1:], columns=[f'PC{i+1}' for i in range(pca.components_.shape[1])])
print("Percentages of contributions of all variables on PCs:")
contr.abs().round(4)*100
Percentages of contributions of all variables on PCs:
PC1 PC2 PC3 PC4
Murder 28.72 17.49 11.64 42.15
Assault 34.01 3.53 7.19 55.27
UrbanPop 7.74 76.18 14.29 1.79
Rape 29.53 2.80 66.88 0.79
plt.figure(figsize=(10,5))
ax = sns.barplot(data=contr.reset_index().melt(id_vars='index', value_name="Contribution", var_name="PCs"), x="PCs", y="Contribution", hue="index")
ax.set_title("Contribution of each variable onto each PC")
for lab in ax.containers:
    ax.bar_label(lab, fmt="%.2f")
plt.legend(title="Variable")
plt.show()

PC1 has captured the main contribution from the three crime columns, while PC2 has been contributed the most by UrbanPop. We can give meaning to these PCs according to the variables that contributed the most to them.

  • PC1 can be seen as Crime Aaxis or Dangerous Level as it’s contributed the most by the crime variables.
  • PC2 may represent Urbanization axis.

Just by looking at this loadings, we might think that Murder and Assault contribute greatly to PC4 than PC1 therefore it is better to interpret them using PC4. However, this is not the right way to view these contributions. The percentage of contributions on PC1, though smaller than on PC4, but PC1 has captured much greater variation of the data than PC4 has. This means that smaller contributions on PC1 can be much more impactful and meaningful than the contributions on the insignificant direction such PC4. For example in our case, PC1 has captured more than \(60\%\) of the variation of the data while PC4 has captured only 4%. This means that the contribution of \(5\%\) on PC1 is more significant than \(70\%\) contribution on PC4.

We should also verify that the total contribution of each original variable on all PCs is \(100\%\)

np.sum(contr.abs().round(4)*100, axis=0)
PC1    100.0
PC2    100.0
PC3    100.0
PC4    100.0
dtype: float64
  • Compute the contribution of each individual on the first two PCs.

We can also compute the among of contribution of each observation onto each PC from its score as follows:

\[\text{Score}_i(\text{PC}_k)=\frac{[XU]_{ik}^2}{\sum_{\ell=1}^n[XU]_{\ell k}^2}=\frac{[XU]_{ik}^2}{(n-1)\lambda_k}\]

pc_score = pd.DataFrame(pca.transform(X) , index=data['State'], columns=contr.columns)
scores = pc_score ** 2/((data.shape[0]-1) * pca.explained_variance_) * 100
scores
PC1 PC2 PC3 PC4
State
Alabama 0.783263 2.595723 1.107096 0.281605
Alaska 3.066667 2.327394 23.342924 2.218248
Arizona 2.506809 1.124411 0.016833 8.033733
Arkansas 0.016127 2.533823 0.073631 0.385398
California 5.136980 4.810526 2.009575 1.348804
Colorado 1.849740 1.970700 6.725542 0.000025
Connecticut 1.488503 2.396051 2.320937 0.161852
Delaware 0.001835 0.213906 2.896728 8.970584
Florida 7.320596 0.003110 1.866330 0.106911
Georgia 2.166925 3.305216 0.657830 13.371283
Hawaii 0.671663 4.983697 0.014465 9.399294
Idaho 2.168292 0.089940 0.378596 2.872684
Illinois 1.533234 0.939430 2.574581 0.171703
Indiana 0.206021 0.046417 0.291724 2.079696
Iowa 4.095505 0.021878 0.151902 0.003554
Kansas 0.512063 0.147487 0.003663 0.491734
Kentucky 0.454625 1.856214 0.004514 5.185332
Louisiana 1.974530 1.533164 3.443101 2.384564
Maine 4.632445 0.286271 0.024199 1.259340
Maryland 2.507394 0.369560 0.138700 3.604435
Massachusetts 0.190592 4.393244 2.083710 0.371975
Michigan 3.584751 0.048796 0.830842 0.120856
Minnesota 2.310397 0.807776 0.131424 0.052258
Mississippi 0.800729 11.579028 3.078260 0.535590
Missouri 0.391504 0.140146 0.799095 0.588095
Montana 1.133193 0.582429 0.341899 0.176580
Nebraska 1.291677 0.076014 0.172907 0.002913
Nevada 6.662370 1.215552 7.591665 1.140748
New Hampshire 4.582661 0.000661 0.007619 0.012663
New Jersey 0.026583 4.245587 3.277900 0.683101
New Mexico 3.161381 0.041234 0.189267 1.329447
New York 2.282896 1.369279 2.319620 0.002097
North Carolina 1.017626 10.030660 4.183024 10.503879
North Dakota 7.219792 0.725310 0.509127 0.743926
Ohio 0.041174 1.113229 0.005439 2.590050
Oklahoma 0.078386 0.167434 0.001315 0.001231
Oregon 0.002819 0.592315 4.954443 0.652017
Pennsylvania 0.636456 0.659057 0.900280 1.486765
Rhode Island 0.601637 4.498036 10.526893 4.341432
South Carolina 1.406566 7.553415 0.506631 0.199314
South Dakota 3.186181 1.369808 0.850056 0.138453
Tennessee 0.805957 1.495369 0.198423 4.915315
Texas 1.480823 0.343800 1.358142 4.770804
Utah 0.244430 4.375434 0.483932 0.078136
Vermont 6.328341 3.973492 3.969695 0.242093
Virginia 0.007483 0.080613 0.000769 0.515223
Washington 0.037937 1.901746 2.190151 0.562460
West Virginia 3.585241 4.102363 0.061575 0.200657
Wisconsin 3.487734 0.755026 0.108163 0.390869
Wyoming 0.319467 0.208230 0.324862 0.320277
import plotly.graph_objects as go
import plotly.express as px

fig = px.scatter(data_frame=scores, x="PC1", y="PC2", hover_name=scores.index, size=scores.PC1 ** 2+ scores.PC2 ** 2, color=scores.PC1 ** 2+ scores.PC2 ** 2)
fig.update_layout(
    width=800, height=600, title="Individual scores on each PC",
    xaxis = dict(title=f"PC1 ({np.round(pca.explained_variance_ratio_[0]*100,2)}%)"),
    yaxis = dict(title=f"PC2 ({np.round(pca.explained_variance_ratio_[1]*100,2)}%)"))
fig.show()
Unable to display output for mime type(s): application/vnd.plotly.v1+json

These are the contributions in percentages of individual observations onto each PC. Its total sum over all observations would give \(100\%\) variation along each direction. We should be able to verify this.

np.sum(scores.drop(columns=['label']), axis=0)
PC1    100.0
PC2    100.0
PC3    100.0
PC4    100.0
dtype: float64

E. Create biplot of the data on the first factorial plan (PC1 and PC2).

import plotly.io as pio
pio.renderers.default = 'notebook'

pc_score.columns = ['PC' + str(i) for i in range(1,5)]
fig = px.scatter(
    pc_score, x='PC1', y='PC2', size=data.UrbanPop, 
    hover_name=pc_score.index, size_max=20, color='PC1',
    title='Biplot of PCA on USArrests dataset') 

cols = ['sandybrown', 'seagreen', 'red', 'sienna']

# Add arrows for the loadings 
for i, feature in enumerate(data.columns[1:]): 
    fig.add_trace(go.Scatter(
        x=[0, Corr_matrix.iloc[i,0]], y=[0, Corr_matrix.iloc[i,1]], 
        mode='lines+text', name=feature, line=dict(color=cols[i], width=2))) 
    fig.add_annotation( x=Corr_matrix.iloc[i, 0], y=Corr_matrix.iloc[i, 1], 
                       ax=0, ay=0, 
                       xref='x', yref='y', 
                       axref='x', ayref='y', 
                       showarrow=True, 
                       arrowhead=3, 
                       arrowsize=1, 
                       arrowwidth=2, 
                       arrowcolor=cols[i] )
# Customize layout 
fig.update_layout( 
    width = 700, height=500,
    xaxis_title=f'PC1 ({(pca.explained_variance_ratio_[0]*100).round(2)}%)', 
    yaxis_title=f'PC2 ({(pca.explained_variance_ratio_[1]*100).round(2)}%)', 
    coloraxis_colorbar=dict(x=-0.25),
    showlegend=True )

Based on this biplot and as PC1 can be viewed as the crime axis, in 1973:

  • the most dangerous state are Florida, Nevada and California (most right of PC1).
  • the safest is North Dakota (most left).
  • the highly urbanized is California (to the top of PC2).

Let’s verify our answers:

print("Most dangerous states:")
pc_score.index = data['State']
data.loc[data['State'].isin(pc_score.sort_values(by='PC1', ascending=False).iloc[:3,:].index)]
Most dangerous states:
State Murder Assault UrbanPop Rape
4 California 9.0 276 91 40.6
8 Florida 15.4 335 80 31.9
27 Nevada 12.2 252 81 46.0
print("Most safest states:")
pc_score.index = data['State']
data.loc[data['State'].isin(pc_score.sort_values(by='PC1', ascending=True).iloc[:3,:].index)]
Most safest states:
State Murder Assault UrbanPop Rape
18 Maine 2.1 83 51 7.8
33 North Dakota 0.8 45 44 7.3
44 Vermont 2.2 48 32 11.2
print("Most urbanized states:")
pc_score.index = data['State']
data.loc[data['State'].isin(pc_score.sort_values(by='PC2', ascending=False).iloc[:5,:].index)]
Most urbanized states:
State Murder Assault UrbanPop Rape
23 Mississippi 16.1 259 44 17.1
32 North Carolina 13.0 337 45 16.1
39 South Carolina 14.4 279 48 22.5
44 Vermont 2.2 48 32 11.2
47 West Virginia 5.7 81 39 9.3

2. Analyzing Auto-MPG dataset with PCA

A. Import Auto-MPG dataset from kaggle, available here.

  • Compute correlation matrix of quantitative columns of this dataset.
import kagglehub

# Download latest version
path = kagglehub.dataset_download("uciml/autompg-dataset")
auto = pd.read_csv(path + '/auto-mpg.csv')
auto.head()
Warning: Looks like you're using an outdated `kagglehub` version, please consider updating (latest version: 0.3.5)
mpg cylinders displacement horsepower weight acceleration model year origin car name
0 18.0 8 307.0 130 3504 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165 3693 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150 3436 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150 3433 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140 3449 10.5 70 1 ford torino
X = auto.loc[auto['horsepower'] != '?']
car_names = X['car name']
X['horsepower'] = X.horsepower.astype(int)
X_clean = X.iloc[:,:-3]
print(X_clean.dtypes)
X_clean.corr().style.background_gradient()
mpg             float64
cylinders         int64
displacement    float64
horsepower        int32
weight            int64
acceleration    float64
dtype: object
C:\Users\hasso\AppData\Local\Temp/ipykernel_16580/340204377.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mpg cylinders displacement horsepower weight acceleration
mpg 1.000000 -0.777618 -0.805127 -0.778427 -0.832244 0.423329
cylinders -0.777618 1.000000 0.950823 0.842983 0.897527 -0.504683
displacement -0.805127 0.950823 1.000000 0.897257 0.932994 -0.543800
horsepower -0.778427 0.842983 0.897257 1.000000 0.864538 -0.689196
weight -0.832244 0.897527 0.932994 0.864538 1.000000 -0.416839
acceleration 0.423329 -0.504683 -0.543800 -0.689196 -0.416839 1.000000
print(X_clean.isna().sum())
mpg             0
cylinders       0
displacement    0
horsepower      0
weight          0
acceleration    0
dtype: int64
from sklearn.preprocessing import StandardScaler
scaler  = StandardScaler()
X_clean = scaler.fit_transform(X_clean)
X_clean.shape
(392, 6)

B. Perform reduced PCA on this dataset.

  • How much information or variation is retained by the first two pCs?
from sklearn.decomposition import PCA
pca = PCA(n_components=6)
pca = pca.fit(X_clean)
print(f'Variation explained by the first two PCs: {np.sum(pca.explained_variance_ratio_[:2])}')
Variation explained by the first two PCs: 0.9194828785333572
# Screeplot
plt.figure(figsize=(10,7))
ax = sns.barplot(x = ['PC' + str(i) for i in range(1,X_clean.shape[1]+1)], y=pca.explained_variance_)
plt.plot(['PC' + str(i) for i in range(1,X_clean.shape[1]+1)], pca.explained_variance_, 'r')
ax.bar_label(ax.containers[0], fmt="%.2f")
plt.title("Screeplot of the performed PCA")
plt.show()

C. Create correlation circle and biplot. Comment.

plt.figure(figsize=(10,10))
sns.set(style="whitegrid")
pc_df = pd.DataFrame(pca.transform(X_clean), columns=[f'PC{i+1}' for i in range(X_clean.shape[1])]) 
# Combine original data and principal components 
combined_data = pd.concat([pd.DataFrame(X_clean, columns=auto.columns[:-3]), pc_df], axis=1)
# Compute correlation matrix 
Corr_matrix = combined_data.corr()[['PC'+str(i) for i in range(1,3)]].iloc[:6,:]
print(Corr_matrix)
# Circle of correlation 
plt.figure(figsize=(8, 8)) 
plt.axhline(0, color='grey', linestyle="--", lw=0.5) 
plt.axvline(0, color='grey', linestyle="--", lw=0.5) 
circle = plt.Circle((0, 0), 1, color='grey', fill=False) 
plt.gca().add_artist(circle) 
# Plot correlations 
for i in range(6): 
    plt.arrow(0, 0, Corr_matrix.iloc[i,0], Corr_matrix.iloc[i,1], color='r', alpha=0.9, head_width=0.03, head_length=0.03) 
    plt.text(Corr_matrix.iloc[i,0], Corr_matrix.iloc[i,1] + 0.05, auto.columns[i], color='b', ha='center', va='center') 
plt.xlabel(f'PC1 ({(pca.explained_variance_ratio_[0]*100).round(2)}%)') 
plt.ylabel(f'PC2 ({(pca.explained_variance_ratio_[1]*100).round(2)}%)') 
plt.title('Circle of Correlation') 
plt.xlim(-1.1, 1.1) 
plt.ylim(-1.1, 1.1) 
plt.grid() 
plt.show()
                   PC1       PC2
mpg          -0.873037 -0.208990
cylinders     0.942277  0.126601
displacement  0.970540  0.092613
horsepower    0.949950 -0.141833
weight        0.941156  0.244211
acceleration -0.638795  0.761967
<Figure size 720x720 with 0 Axes>

  • Biplot of the data
import plotly.io as pio
pio.renderers.default = 'notebook'

import plotly.graph_objects as go
import plotly.express as px
pc_df.columns = ['PC' + str(i) for i in range(1,7)]
pc_df.index = car_names
fig1 = px.scatter(
    pc_df, x='PC1', y='PC2', size=X.dropna().cylinders, 
    hover_name=car_names, size_max=15, color=X.mpg, symbol= X.dropna()['origin'].astype(object),
    title='Biplot of Auot-MPG dataset') 

cols = ['sandybrown', 'seagreen', 'red', 'sienna', 'skyblue', 'blue']

# Add arrows for the loadings 
for i, feature in enumerate(pc_df.columns): 
    fig1.add_trace(go.Scatter(
        x=[0, 2*Corr_matrix.iloc[i,0]], y=[0, 2*Corr_matrix.iloc[i,1]], 
        mode='lines+text', name=Corr_matrix.index[i], 
        line=dict(color=cols[i], width=2))) 
    fig1.add_trace(go.Scatter( 
        x=[2*Corr_matrix.iloc[i,0]], y=[2*Corr_matrix.iloc[i,1]], 
        mode='markers', marker=dict(size=10, symbol='arrow-bar-up', 
                                    angle=np.degrees(np.arctan2(2*Corr_matrix.iloc[i,0], 2*Corr_matrix.iloc[i,1])), 
                                    color=cols[i]), 
                                    showlegend=False))
    # Customize layout 
fig1.update_layout( 
    coloraxis_colorbar_x=-0.275,
    width = 700, height=500,
    xaxis_title=f'PC1 ({(pca.explained_variance_ratio_[0]*100).round(2)}%)', 
    yaxis_title=f'PC2 ({(pca.explained_variance_ratio_[1]*100).round(2)}%)', 
    showlegend=True )
fig1.layout.coloraxis.colorbar.title = 'MPG'
fig1.show()

3. Mathematical Problem of PCA

From mathematical point of views, PCA can be seen as a process of searching for a subspace with maximum projection variance of the data points or searching for its closest low-rank approximation.

Suppose we have a design matrix of observations \(X\in\mathbb{R}^{n\times d}\) with centered columns. We aim to mathematically define 1st, 2nd,…, \(d\) th principal components of this matrix.

A. First PC: a vector \(\vec{u}_1\in\mathbb{R}^d\) is the \(1\) st PC direction of \(X\) if it is the direction (unit vector) in which the projection of observations \(X\) achieves maximum variance, i.e.,

\[\vec{u}_1=\arg\max_{\vec{u}:\|\vec{u}\|=1}\|X\vec{u}\|^2.\]

  • Show that \(\vec{u}_1\) is the first eigenvector of matrix \(X^TX\) corresponding to its largest eigenvalue \(\lambda_1\).

Proof: This is a maximization of \(\|X\vec{u}\|=\vec{u}^TX^TX\vec{u}\) under the constraint that \(\|\vec{u}\|=1\). Lagranian of the probelm for parameter \(\lambda>0\) writes:

\[\begin{align*} {\cal L}(\vec{u},\lambda)&=\vec{u}^TX^TX\vec{u}-\lambda(\vec{u}^T\vec{u}-1)\\ \Rightarrow &\begin{cases}\frac{\partial {\cal L}(\vec{u},\lambda)}{\partial \vec{u}}=2X^TX\vec{u}-2\lambda\vec{u}\\ \frac{\partial {\cal L}(\vec{u},\lambda)}{\partial \lambda}=-\vec{u}^T\vec{u} \leq 0 \end{cases}\\ \text{Therefore }\frac{\partial {\cal L}(\vec{u},\lambda)}{\partial \vec{u}}\Big|_{\vec{u}=\vec{u}_1}&=0\Leftrightarrow X^TX\vec{u}_1=\lambda\vec{u}_1\quad (1)\\ \end{align*}\]

This implies that \(\vec{u}_1\) is the eigenvector of \(X^TX\) corresponding to eigenvalue \(\lambda>0\). As \(X^TX\) is symmetric, then by Spectral Theorem, \(X^TX\) is diagonalizable with the following decomposition:

\[ \begin{align*}X^TX&=U\Sigma U^T=\underbrace{\begin{bmatrix}u_{11} & u_{12} & \dots & u_{1d}\\ u_{21} & u_{22} & \dots & u_{2d}\\ \vdots & \vdots & \ddots & \vdots\\ u_{d1} & u_{d2} & \dots & u_{dd}\\ \end{bmatrix}}_{U}\underbrace{\begin{bmatrix}\lambda_{1} & 0 & \dots & 0\\ 0 & \lambda_{2} & \dots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \dots & \lambda_{d}\\ \end{bmatrix}}_{\Sigma}\underbrace{\begin{bmatrix}u_{11} & u_{12} & \dots & u_{1d}\\ u_{21} & u_{22} & \dots & u_{2d}\\ \vdots & \vdots & \ddots & \vdots\\ u_{d1} & u_{d2} & \dots & u_{dd}\\ \end{bmatrix}^T}_{U^T}\\ &=\sum_{j=1}^d\lambda_j\vec{u}_j\vec{u}_j^T=\sum_{j=1}^d\lambda_j\underbrace{\begin{bmatrix}u_{1j}\\ u_{2j}\\ \vdots\\ u_{dj}\\ \end{bmatrix}}_{\vec{u}_j}\underbrace{\begin{bmatrix}u_{1j} & u_{2j} & \dots & u_{dj} \end{bmatrix}}_{\vec{u}_j^T} \end{align*}\]

where

  • \(U\) is a \(d\times d\)-dimensional unitary matrix (\(U^TU=UU^T=I_d\)) with columns \(\vec{u}_1,\dots,\vec{u}_d\) be the eigenvectors of \(X^TX\).
  • \(\Sigma=\text{diag}(\lambda_1,\dots,\lambda_d)\) be a diagonal matrix with diagonal elements \(\lambda_1\geq \dots\geq \lambda_d\geq 0\) be the eigenvalues corresponding to eigenvectors \(\vec{u}_1,\dots,\vec{u}_d\) respectively.

From this result, \((1)\) implies that the Lagrange multiplier \(\lambda=\lambda_1\) because \((1)\) is eigenvalue equation and \(\lambda_1\) is the maximum eigenvalue among all eigenvalues, and for the same reason \(\vec{u}_1\) must be the eigenvector corresponding to the eigenvalue \(\lambda_1\).


B. The \(k\) th PC: Let \(\widehat{X}_k=X-\sum_{j=1}^{k-1}X\vec{u}_j\vec{u}_j^T\), then the \(k\) th PC direction of \(X\) is the vector \(\vec{u}_k\in\mathbb{R}^d\) that is orthogonal to all the previous PCs \({\vec{u}_1,\dots,\vec{u}_{k-1}}\) satisfying

\[\vec{u}_k=\arg\max_{\vec{u}:\|\vec{u}\|=1}\|\widehat{X}_k\vec{u}\|^2.\]

  • Show that \(\vec{u}_k\) is the \(k\) th eigenvector of matrix \(X^TX\) corresponding to its \(k\) th largest eigenvalue \(\lambda_k\leq\lambda_{k-1}\leq\dots\leq \lambda_1\).

Proof:

By induction, we suppose that it’s true up to dimension \(k\) such that \(1\leq k< d\). We shall check that the statement holds true until \(k+1\leq d\) if it’s true for all \(k<d\).

We write Lagrangian of the optimization problem for any \(\lambda_{p+1}>0\): \[\begin{align*} {\cal L}(\vec{u}_{k+1},\lambda)&=\vec{u}_{k+1}^T\widehat{X}_{k+1}^T\widehat{X}_{k+1}\vec{u}_{k+1}-\lambda(\vec{u}_{k+1}^T\vec{u}_{k+1}-1)\\ &=\vec{u}_{k+1}^T\left(X-\sum_{j=1}^{k}X\vec{u}_j\vec{u}_j^T\right)\vec{u}_{k+1}-\lambda(\vec{u}_{k+1}^T\vec{u}_{k+1}-1)\\ &=\vec{u}_{k+1}^TX\left(I-\sum_{j=1}^{k}\vec{u}_j\vec{u}_j^T\right)\vec{u}_{k+1}-\lambda(\vec{u}_{k+1}^T\vec{u}_{k+1}-1)\\ &=\vec{u}_{k+1}^TX\left(\vec{u}_{k+1}-\sum_{j=1}^{k}\vec{u}_j\overbrace{\underbrace{\vec{u}_j^T\vec{u}_{k+1}}_{=\ 0}}^{\vec{u}_{k+1}\perp\vec{u}_j,\forall j\leq k}\right)-\lambda(\vec{u}_{k+1}^T\vec{u}_{k+1}-1)\\ &=\vec{u}_{k+1}^TX\vec{u}_{k+1}-\lambda(\vec{u}_{k+1}^T\vec{u}_{k+1}-1)\\ \end{align*}\]

By derivating the last equation with respect to all variable \(\vec{u}_{k+1}\) and \(\lambda>0\), we obtain the same eigenvalue equation as in the case of the 1st principal direction, which is \(X^TX\vec{u}_{k+1}=\lambda\vec{u}_{k+1}\). From Spectral Theorem the only vector that is orthogonal to all the previous eigenvalues \(\vec{u}_1,\dots,\vec{u}_k\) with the \((k+1)\)-th maximum variance is the \((k+1)\)-th eigenvector of \(X^TX\) which also means that \(\lambda=\lambda_{k+1}\) is the maximum variance along this direction. So, the problem has been proven by induction.


C. Show that the matrix \(\tilde{X}_k=\sum_{j=1}^k\lambda_j\vec{u}_j\vec{u}_j^T\) is the best low-rank approximation of the original data \(X\) w.r.t Frobenius norm, i.e.,

\[\tilde{X}_k=\arg\min_{W:\text{rank}(W)\leq k}\|X-W\|_{F}=\arg\min_{W:\text{rank}(W)\leq k}\sqrt{\sum_{j=k+1}^d\lambda_j^2(X-W)}.\]

Proof:

This is an immediate consequence of Singular Value Decomposition which can be derived from the result of Spectral Theorem. Many proofs of this result can be found here: Low-rank approximation proofs.


Further Readings