Principal Component Analysis (PCA)


Exploratory Data Analysis & Unsupervised Learning

     

Lecturer: HAS Sothea, PhD

šŸ—ŗļø Content

  • Motivation

  • Principal Directions & Components

  • Dimensional Reduction & Reconstruction

  • Correlation Circles

  • Contribution Scores

Motivation

Motivation

California Housing Prices dataset

  • The data contains information from the 1990 California census.
  • Dimension: (20640, 10). It describes blocks across California.
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
  • How can we summarize and visualize individual-column relationship of the data?

Motivation

California Housing Prices dataset

  • Correlation matrix
Code
import matplotlib.pyplot as plt
import seaborn as sns
f, ax = plt.subplots(figsize=(6, 3.25))
corr = data.select_dtypes(include="number").corr().round(1)
sns.heatmap(corr, annot=True,
    vmin=-1.0, vmax=1.0,
    square=True, ax=ax)
ax.set_yticklabels(ax.get_yticklabels(), ha='right')
ax.set_xticklabels(ax.get_xticklabels(), rotation=30, ha='right')
plt.show()

  • Recall: correlation matrix can
    • Summarize linear relation between columns.
    • Absolute values \(\approx 1\) indicate strong linear relation.
    • Weak values \(\approx 0\) suggest non-linear relation, BUT it may be affected by
      • Outliers (see TP2 Solution),
      • Lacks of data points,
      • Confounding variables…

Motivation

California Housing Prices dataset

  • Correlation matrix
Code
import matplotlib.pyplot as plt
import seaborn as sns
f, ax = plt.subplots(figsize=(6, 3.25))
corr = data.select_dtypes(include="number").corr().round(1)
sns.heatmap(corr, annot=True,
    vmin=-1.0, vmax=1.0,
    square=True, ax=ax)
ax.set_yticklabels(ax.get_yticklabels(), ha='right')
ax.set_xticklabels(ax.get_xticklabels(), rotation=30, ha='right')
plt.show()

  • High correlations among columns suggest that information can be compressed/reduced.
  • Geometrically, the pattern of data can be represented in simpler/smaller dimension.
  • PCA can detect important directions of data and reduce dimension of the data.
  • The important directions may carry meaningful interpretation for summaring the data.
  • Let’s explore this…

Principal directions & components

Principal directions & components

Principal directions (PDs) & Components (PCs)

  • Given data matrix: \(X\in\mathbb{R}^{n\times d}\),
X Y Z
0 -0.959 -0.808 -1.409
1 -2.439 0.870 -1.008
2 0.661 0.043 1.787
3 -3.891 -1.206 -4.043
4 0.618 1.229 0.470
5 1.310 0.082 0.454
6 0.822 -2.479 -1.142
7 4.057 1.629 6.200
8 1.932 1.685 4.132

Principal directions & components

Principal directions (PDs) & Components (PCs)

Principal directions \(\vec{u}_k\)

  • First principal direction \(\color{red}{\vec{u}_1}\in\mathbb{R}^d\) is the direction that the data spreads or varies the most.
  • Second principal direction \(\color{green}{\vec{u}_2}\in\mathbb{R}^d\) is the direction that perpendicular to \(\color{red}{\vec{u}_1}\) and captures second largest variation of the data.
  • For \(3\leq k\leq d\), the \(k\)-th principal direction \(\color{blue}{\vec{u}_k}\in\mathbb{R}^d\) is the direction that perpendicular to \(\{\color{red}{\vec{u}_1},\color{green}{\vec{u}_2},\dots,\color{purple}{\vec{u}_{k-1}}\}\) and captures the \(k\)-th largest variation of the data.
  • Let \(U\) be a \(d\times d\) matrix with all PD columns, i.e., \[U=[\color{red}{\vec{u}_1}|\color{green}{\vec{u}_2}|\dots |\color{blue}{\vec{u}_d}],\] then \(\tilde{X}=XU\) is the rotated data with independent columns, and the 1st, 2nd, 3rd… columns has the largest, 2nd largest, 3rd largest… variance, respectively. They’re called Principal Components (PCs).

Principal directions & components

How to find PDs & PCs

Solution to Principal Directions \(\vec{u}_k\)

  • Let \(X\in\mathbb{R}^{n\times d}\) the data matrix.
  • PDs \(\color{red}{\vec{u}_1},\dots,\color{blue}{\vec{u}_d}\) are the 1st, …, \(d\)-th eigenvectors of matrix \(A=X^TX\) with eigenvalues \(\color{red}{\lambda_1}>\dots>\color{blue}{\lambda_d}\geq 0\) respectively.
  • Principal components are the columns of

\[\begin{align*} \underbrace{\tilde{X}}_{n\times d}&=\underbrace{X}_{n\times d}\underbrace{\color{blue}{U}}_{n\times d}\\ &=\begin{bmatrix} \text{x}_{11} & \dots & \text{x}_{1d}\\ \text{x}_{21} & \dots & \text{x}_{2d}\\ \text{x}_{31} & \dots & \text{x}_{3d}\\ \vdots & \ddots & \vdots\\ \text{x}_{n1} & \dots & \text{x}_{nd}\end{bmatrix} \begin{bmatrix} \text{u}_{11} & \dots & \text{u}_{1d}\\ \text{u}_{21} & \dots & \text{u}_{2d}\\ \vdots & \ddots & \vdots\\ \text{u}_{d1} & \dots & \text{u}_{dd}\end{bmatrix} \end{align*}\]

  • The \(j\)-th PC is the combination of original columns of \(X_i\), i.e., \(\text{PC}_j=\sum_{i=1}^du_{ij}X_i.\)

Principal directions & components

Implementation of PCA

Unnormalized PCA

  • Given data: \(X\in\mathbb{R}^{n\times d}\)
  • Centering \(X_c=\underbrace{X}_{n\times d} -\underbrace{\mathbb{1}_n}_{n\times n}\underbrace{X}_{n\times d}/n.\)
  • Perform Eigenvalue Decomposition on covariance matrix \(V=X_c^TX_c/(n-1)\): \[V=\color{blue}{U}\Sigma \color{blue}{U}^T,\]
    • \(\color{blue}{U}\): matrix of eigenvectors
    • \(\Sigma=\text{diag}(\lambda_1,\dots,\lambda_d)\): diagonal matrix of eigenvalues \(\lambda_1\geq \dots\geq \lambda_d\geq 0\).

Normalized PCA

  • Given data: \(X\in\mathbb{R}^{n\times d}\)
  • Standardizing: compute std matrix \(D=\text{diag}(\widehat{\sigma}_1,\dots,\widehat{\sigma}_d)\) and standardized data \(Z=(\underbrace{X}_{n\times d} -\underbrace{\mathbb{1}_n}_{n\times n}\underbrace{X}_{n\times d}/n)D^{-1}.\)
  • Perform Eigenvalue Decomposition on correlation matrix \(C=Z^TZ/(n-1)\): \[C=\color{blue}{U}\Sigma \color{blue}{U}^T.\]
  • Rotated data: \(\color{blue}{\tilde{X}}=X\color{blue}{U}.\) [šŸ‘‰ Demo]

Dimensional Reduction/Reconstruction

Reduction/Reconstruction

Dimensional Reduction

  • Reduction to \(k\) dimensions (\(k<d\)) is done by \(\color{blue}{\tilde{X}_k=\tilde{X}[:,:k]}.\)
  • Quality of representation on each direction: \(\left[\frac{\lambda_1}{\sum_{j=1}^d\lambda_j},\dots,\frac{\lambda_k}{\sum_{j=1}^d\lambda_j}\right].\)
  • Variation explained by \(k\) dimensions: \(\frac{\sum_{j=1}^k\lambda_j}{\sum_{j=1}^d\lambda_j}\times 100.\)
  • For visualization, one can choose \(k=2\) or \(3\). [šŸ‘‰ Demo]
* Total variation of X: 15.25
X Y Z
0 -0.959 -0.808 -1.409
1 -2.439 0.870 -1.008
2 0.661 0.043 1.787
3 -3.891 -1.206 -4.043
4 0.618 1.229 0.470
* Total variation of projection: 15.2
PC1 PC2 PC3
0 -2.424 -0.173 -0.197
1 -2.287 2.086 -0.406
2 1.239 -0.292 -0.558
3 -6.119 1.163 -0.709
4 0.509 0.555 0.818

Reduction/Reconstruction

Dimensional Reduction

  • Reduction to \(k\) dimensions (\(k<d\)) is done by \(\color{blue}{\tilde{X}_k=\tilde{X}[:,:k]}.\)
  • Quality of representation on each direction: \(\left[\frac{\lambda_1}{\sum_{j=1}^d\lambda_j},\dots,\frac{\lambda_k}{\sum_{j=1}^d\lambda_j}\right].\)
  • Variation explained by \(k\) dimensions: \(\frac{\sum_{j=1}^k\lambda_j}{\sum_{j=1}^d\lambda_j}\times 100.\)
  • For visualization, one can choose \(k=2\) or \(3\). [šŸ‘‰ Demo]
  • Application on California Housing Price dataset:
  • Percentage of variance explained by two PCs: 79.86%.

Reduction/Reconstruction

Dimensional Reduction

  • Reduction to \(k\) dimensions (\(k<d\)) is done by \(\tilde{X}_k=\tilde{X}[:,:k].\)
  • Quality of representation on each direction: \(\left[\frac{\lambda_1}{\sum_{j=1}^d\lambda_j},\dots,\frac{\lambda_k}{\sum_{j=1}^d\lambda_j}\right].\)
  • Variation explained by \(k\) dimensions: \(\frac{\sum_{j=1}^k\lambda_j}{\sum_{j=1}^d\lambda_j}\times 100.\)
  • For visualization, one can choose \(k=2\) or \(3\). [šŸ‘‰ Demo]

Reconstruction

  • The reconsution is defined by: \[\color{red}{X_k^{\text{rec}}}=\tilde{X}_k\color{blue}{U[:,:k]^T}.\]

  • Best low-rank approximation shows that this reconstruction \(\color{red}{X_k^{\text{rec}}}\) is the best \(k\)-rank approximation of \(X\) with respect to Frobenius norm, i.e., \[\color{red}{X_k^{\text{rec}}}=\arg\min_{M_k:r(M_k)\leq k}\|X-M_k\|_F,\] where \(\|.\|_F\) is the Frobenius norm.

Reduction/Reconstruction

Summary

  • PCA captures and arranges directions with largest, 2nd largest,… variation into the 1st, 2nd, … column using Spectral Theorem (Eigenvalue decomposition).
  • These directions are contained in the eigenvector matrix of Covariance (unnormalized) or Correlation matrix (normalized) of the data.
  • The variation along each direction is given by the corresponding eigenvalues: \(\lambda_1>\dots>\lambda_d\).
  • Our goal is to explore dimensional reduction and reconstruction power of PCA, however, it can be used in Data Analysis, EDA, Data Summary…
  • Remark: Many highly correlated columns indicates effective dimensional reduction (the first two PCs can capture significant variation).

Correlation Circle

Correlation Circles

  • Besides dimensional reduction, PCA can be used to analyze the data.
  • How would you explain this scatterplot?
  • One should understand the connection between PCs and the original columns.
  • This is done using Correlation Circle.
  • One can compute correlations between original data with PCs:
Code
d = len(quan_cols)
combined_data = pd.concat([pd.DataFrame(Z[:,:2], columns=['PC'+str(i+1) for i in range(2)]), pd.DataFrame(X[:,2:], columns = quan_cols[2:])], axis=1)
Corr_matrix = combined_data.corr()[['PC'+str(i+1) for i in range(2)]].iloc[2:,:]
Corr_matrix
PC1 PC2
housing_median_age -0.426668 0.048696
total_rooms 0.963481 0.087248
total_bedrooms 0.971857 -0.077096
population 0.930272 -0.113646
households 0.975068 -0.059403
median_income 0.107198 0.912892
median_house_value 0.084727 0.913812
Code
fig_cir = go.Figure()
circle = go.Scatter(
    x=np.cos(np.linspace(0, 2 * np.pi, 100)),
    y=np.sin(np.linspace(0, 2 * np.pi, 100)),
    mode='lines',
    name='Unit Circle',
    line=dict(color='black', dash='dash')
)
fig_cir.add_trace(circle)
fig_cir.add_trace(go.Scatter(
    x=[-1,1],
    y = [0,0],
    mode="lines", line=dict(color='gray', dash='dash')
))
fig_cir.add_trace(go.Scatter(
    x=[0,0],
    y = [-1,1],
    mode="lines", line=dict(color='gray', dash='dash')
))

loc = ['top right', 'top left', 'bottom right', 'bottom left', 'top right', 'top left', 'bottom right']

# Add loading vectors
for i, feature in enumerate(quan_cols[2:]):
    fig_cir.add_trace(go.Scatter(
        x=[0, Corr_matrix.iloc[i, 0]],
        y=[0, Corr_matrix.iloc[i, 1]],
        mode='lines+text',
        text=[None, feature],
        textposition=[loc[i]],
        name=feature,
        showlegend=True
    ))

# Set the layout
fig_cir.update_layout(
    title='Correlation Circle',
    xaxis=dict(title='PC1 ('+str(np.round(pca.explained_variance_ratio_[0]*100,2)) + '%)', range=[-1.2, 1.2]),
    yaxis=dict(title='PC2 ('+str(np.round(pca.explained_variance_ratio_[1]*100,2)) + '%)', range=[-1.2, 1.2]),
    showlegend=False,
    width=470,
    height=470
)
fig_cir.show()
Code
fig_cor = fig_bi_house
fig_cor.add_trace(go.Scatter(
    x=[-1,1],
    y = [0,0],
    mode="lines", line=dict(color='gray', dash='dash')
))
fig_cor.add_trace(go.Scatter(
    x=[0,0],
    y = [-1,1],
    mode="lines", line=dict(color='gray', dash='dash')
))
for i, var in enumerate(quan_cols[2:]):
    fig_cor.add_shape(
        type='line',
        x0=0, y0=0,
        x1=Corr_matrix.iloc[i, 0]*2, y1=Corr_matrix.iloc[i, 1]*2
    )
    fig_cor.add_annotation(
        x=Corr_matrix.iloc[i, 0]+np.random.normal(0,0.25), y=Corr_matrix.iloc[i, 1]+np.random.normal(0,0.25),
        ax=0, ay=0,
        xanchor="center",
        yanchor="bottom",
        text=var
    )
fig_cor.update_layout(width=470, height=470)
fig_cor.show()

Correlation Circles

Interpretation

  • PC1 is strong related to total_rooms, households, population and housing_median_age.
  • PC2 is strong reated to median_income and median_house_value.
  • The length of each variable represents how well represented it is in the plane of PC1 and PC2 (1st factorial plane).
  • Highly correlated original variables are mostly aligned.
  • Those that are nearly perpendicular, tend to be decorrelated.
  • For this example:
    • PC1 seems to represent size and condition of houses within each block.
    • PC2 seems to represent economic condition of houses within each neighborhood.
  • This can be used to summarize the scatterplot of individuals.

Contribution Scores

Contribution Scores

Variable contributions/loadings

  • We define the contribution of original variable \(j\) on PC \(k\) by \[\text{Loading}_j(\text{PC}_k)=\frac{[U\Sigma^{1/2}]_{jk}^2}{\sum_{p=1}^d[U\Sigma^{1/2}]_{jp}^2}=\frac{[U\Sigma^{1/2}]_{jk}^2}{\lambda_j}.\]
Code
loadings = (pca.components_.T) ** 2 * pca.explained_variance_
contr = pd.DataFrame(loadings/pca.explained_variance_, index=quan_cols[2:], columns=[f'PC{i+1}' for i in range(pca.components_.shape[1])])
contr.iloc[:,:2].T.abs().round(4)*100
housing_median_age total_rooms total_bedrooms population households median_income median_house_value
PC1 4.68 23.87 24.28 22.25 24.44 0.3 0.18
PC2 0.14 0.45 0.35 0.76 0.21 49.0 49.10

Contribution Scores

Individual contributions/\(\cos^2\)

  • We define the contribution of each individual \(i\) on PC \(k\) by \[\text{Contr}_i(\text{PC}_k)=\cos^2(i,k)=\frac{[XU]_{ik}^2}{\sum_{\ell=1}^n[XU]_{\ell k}^2}=\frac{[XU]_{ik}^2}{(n-1)\lambda_k}\]
Code
pc_score = pd.DataFrame(X[:,2:], columns=contr.columns)
scores = pc_score ** 2/((data.shape[0]-1) * pca.explained_variance_) * 100
df_contr = pd.DataFrame(
    {
        'Contribution on PC1': scores['PC1'],
        'Contribution on PC2': scores['PC2'],
        'Individual': pc_score.index
    }
)
fig_cont = px.scatter(df_contr, x='Contribution on PC1', y='Contribution on PC2', hover_name=pc_score.index, size=df_contr['Contribution on PC1']+df_contr['Contribution on PC2'])
fig_cont.update_layout(height=220, width=600, title="Contribution of individuals on PCs")
fig_cont.show()

🄳 Yeahhh! Party Time…. šŸ„‚









Any questions?