TP9 - Dimensional Reduction


Course: Advanced Machine Learning
Lecturer: Sothea HAS, PhD

Objective: Dimensional reduction is useful when dealing with high-dimensional dataset. It can also be used in clustering and data analysis. We will explore its potential in data compression and reconstruction, as well as preprocessing method in predictive models.


1. Fashion MNIST Dataset

We revisit the Fashion-MNIST from the previous TP8.


A. Dimensional reduction with PCA Import the dataset into python environment and print the first 12 training items (3 rows and 4 columns) of this dataset with title corresponding to their actual item name (you can find the true label of each item here: https://www.kaggle.com/datasets/zalando-research/fashionmnist).

  • Perform reduced/normalized PCA on the training input of this dataset.
  • How many dimension would you keep to retain \(90\%\) variation of the data?
  • What’s the percentage of variance explained by the first two dimensions?
  • Visualize the data in 2 dimensional space using PCA.
  • Perform clustering algorithm on using all the PCs with accumulated variance of \(80\%\). Analyze the performance of clustering algorithm.
  • Test if a DNN with original features is better than the one with only PCs with accumulated variance of \(80\%\) of the total variance?
# To do

B. Dimensional reduction with \(t\)-SNE

  • Visualize the data in 2 dimensional space using \(t\)-SNE.
  • Perform clustering algorithm on these embedded features. Analyze the performance of clustering algorithm.
  • Implement some predictive models using the embedded features as inputs to predict the type of items and report their performances on the test data.
# To do

C. Dimensional reduction with Johnson-Lindenstrauss Lemma

  • Project images onto \(d=2, 5, 10\) dimensional spaces (called them X_JL2, X_JL5 and X_JL10 respectively).
  • Perform clustering algorithm on these projected data. Analyze the performance of clustering algorithm for each case.
  • Implement some predictive models using the embedded features as inputs to predict the type of items and report their performances on the test data.
# To do

D. Dimensional reduction with Autoencoder

  • Bulid autoencoder to encode and reconstruct item images using your own designed architecture.
  • Visualize some of the original, embedded and reconstructed images side by side.
  • Perform clustering algorithm on the latent images of the network. Analyze the performance of clustering algorithm for each case.
  • Implement some select models using the latent encoded images as inputs to predict the type of item images and report their performances.
# To do
  • Conclude your findings.

References

\(^{\text{πŸ“š}}\) Hinton and Roweis (2002).
\(^{\text{πŸ“š}}\) Laurens, \(t\)-SNE Page.
\(^{\text{πŸ“š}}\) Satellite Images.
\(^{\text{πŸ“š}}\) van der Maaten and Hinton (2008), Visualizing Data using t-SNE.
\(^{\text{πŸ“š}}\) Bank et al (2021), Autoencoder.
\(^{\text{πŸ“š}}\) Umberto Michelucci (2022), An Introduction to Autoencoders.