TP7 - Dimensional Reduction


Course: Advanced Machine Learning
Lecturer: Sothea HAS, PhD

Objective: Dimensional reduction is useful when dealing with high-dimensional dataset. It can also be used in clustering and data analysis. We will explore its potential using our previous dataset of satellite images.


1. Satellite Image Segmentation

We will explore in this TP the satellite image dataset available in kaggle repository: Satellite Images.

A. Dimensional reduction with PCA

  • Load the assembled data in the environment.
  • Perform reduced/normalized PCA on this dataset. How many dimensions should we keep to retain \(90\%\) variation of the data?
  • What’s the percentage of variance explained by the first two dimensions?
  • Visualize the data in 2 dimensional space using PCA.
  • Based on the resulting graph, would it be better to work with the first two PCs (PC1 and PC2)?
  • Perform clustering algorithm on these two PCs. Analyze the performance of the clustering algorithm.
  • Implement some predictive models using PC1 and PC2 as inputs to predict the type of satellite images and report their performances.
# To do

B. Dimensional reduction with \(t\)-SNE

  • Visualize the data in 2 dimensional space using \(t\)-SNE.
  • Based on the resulting graph, would it be better to work with the embedded features by \(t\)-SNE?
  • Perform clustering algorithm on these embedded features. Analyze the performance of the clustering algorithm.
  • Implement some predictive models using the embedded features as inputs to predict the type of satellite images and report their performances.
# To do

C. Dimensional reduction with Johnson-Lindenstrauss Lemma

  • Project images onto \(d=2, 5, 10\) dimensional spaces (called them X_JL2, X_JL5 and X_JL10 respectively).
  • Perform clustering algorithm on these projected data. Analyze the performance of the clustering algorithm for each case.
  • Implement some predictive models using the projected features as inputs to predict the type of satellite images and report their performances.

D. Dimensional reduction with Autoencoder

  • Bulid autoencoder to encode and reconstruct satellite images using your own designed architecture.
  • Visualize some of the original, embedded, and reconstructed images side by side.
  • Perform clustering algorithm on the latent images of the network. Analyze the performance of the clustering algorithm for each case.
  • Implement some select models using the latent encoded images as inputs to predict the type of satellite images and report their performances.

2. Revisit Spam dataset

Task:

  • Perform clustering algorithms on Spam dataset using projected data obtained from the most suitable dimensional reduction method.
  • Built models to classify whether an email is a spam or not using projected data from the most suitable dimensional reduction approach above.
import pandas as pd
path = "https://raw.githubusercontent.com/hassothea/MLcourses/main/data/spam.txt"
data = pd.read_csv(path, sep=" ")
data.head(5)
Id make address all num3d our over remove internet order ... charSemicolon charRoundbracket charSquarebracket charExclamation charDollar charHash capitalAve capitalLong capitalTotal type
0 1 0.00 0.64 0.64 0.0 0.32 0.00 0.00 0.00 0.00 ... 0.00 0.000 0.0 0.778 0.000 0.000 3.756 61 278 spam
1 2 0.21 0.28 0.50 0.0 0.14 0.28 0.21 0.07 0.00 ... 0.00 0.132 0.0 0.372 0.180 0.048 5.114 101 1028 spam
2 3 0.06 0.00 0.71 0.0 1.23 0.19 0.19 0.12 0.64 ... 0.01 0.143 0.0 0.276 0.184 0.010 9.821 485 2259 spam
3 4 0.00 0.00 0.00 0.0 0.63 0.00 0.31 0.63 0.31 ... 0.00 0.137 0.0 0.137 0.000 0.000 3.537 40 191 spam
4 5 0.00 0.00 0.00 0.0 0.63 0.00 0.31 0.63 0.31 ... 0.00 0.135 0.0 0.135 0.000 0.000 3.537 40 191 spam

5 rows Γ— 59 columns

References

\(^{\text{πŸ“š}}\) Hinton and Roweis (2002).
\(^{\text{πŸ“š}}\) Laurens, \(t\)-SNE Page.
\(^{\text{πŸ“š}}\) Satellite Images.
\(^{\text{πŸ“š}}\) van der Maaten and Hinton (2008), Visualizing Data using t-SNE.
\(^{\text{πŸ“š}}\) Bank et al (2021), Autoencoder.
\(^{\text{πŸ“š}}\) Umberto Michelucci (2022), An Introduction to Autoencoders.