TP7 - Dimensional Reduction

Course: Advanced Machine Learning
Lecturer: Sothea HAS, PhD

Objective: Dimensional reduction is useful when dealing with high-dimensional dataset. It can also be used in clustering and data analysis. We will explore its potential using our previous dataset of satellite images.

The notebook of this TP can be downloaded here: TP7_Dimensional_Reduction.ipynb.

1. Satellite Image Segmentation

We will explore in this TP the satellite image dataset available in kaggle repository: Satellite Images.

A. Dimensional reduction with PCA

Load the assembled data in the environment.
Perform reduced/normalized PCA on this dataset. How many dimensions should we keep to retain \(90\%\) variation of the data?
What’s the percentage of variance explained by the first two dimensions?
Visualize the data in 2 dimensional space using PCA.
Based on the resulting graph, would it be better to work with the first two PCs (PC1 and PC2)?
Perform clustering algorithm on these two PCs. Analyze the performance of the clustering algorithm.
Implement some predictive models using PC1 and PC2 as inputs to predict the type of satellite images and report their performances.

# To do

B. Dimensional reduction with \(t\)-SNE

Visualize the data in 2 dimensional space using \(t\)-SNE.
Based on the resulting graph, would it be better to work with the embedded features by \(t\)-SNE?
Perform clustering algorithm on these embedded features. Analyze the performance of the clustering algorithm.
Implement some predictive models using the embedded features as inputs to predict the type of satellite images and report their performances.

# To do

C. Dimensional reduction with Johnson-Lindenstrauss Lemma

Project images onto \(d=2, 5, 10\) dimensional spaces (called them X_JL2, X_JL5 and X_JL10 respectively).
Perform clustering algorithm on these projected data. Analyze the performance of the clustering algorithm for each case.
Implement some predictive models using the projected features as inputs to predict the type of satellite images and report their performances.

D. Dimensional reduction with Autoencoder

Bulid autoencoder to encode and reconstruct satellite images using your own designed architecture.
Visualize some of the original, embedded, and reconstructed images side by side.
Perform clustering algorithm on the latent images of the network. Analyze the performance of the clustering algorithm for each case.
Implement some select models using the latent encoded images as inputs to predict the type of satellite images and report their performances.

2. Revisit Spam dataset

Task:

Perform clustering algorithms on Spam dataset using projected data obtained from the most suitable dimensional reduction method.
Built models to classify whether an email is a spam or not using projected data from the most suitable dimensional reduction approach above.

import pandas as pd
path = "https://raw.githubusercontent.com/hassothea/MLcourses/main/data/spam.txt"
data = pd.read_csv(path, sep=" ")
data.head(5)

	Id	make	address	all	our	over	remove	internet	order	...	charSemicolon	charRoundbracket	charExclamation	charDollar	charHash	capitalAve	capitalLong	capitalTotal	type
0	1	0.00	0.64	0.64	0.32	0.00	0.00	0.00	0.00	...	0.00	0.000	0.778	0.000	0.000	3.756	61	278	spam
1	2	0.21	0.28	0.50	0.14	0.28	0.21	0.07	0.00	...	0.00	0.132	0.372	0.180	0.048	5.114	101	1028	spam
2	3	0.06	0.00	0.71	1.23	0.19	0.19	0.12	0.64	...	0.01	0.143	0.276	0.184	0.010	9.821	485	2259	spam
3	4	0.00	0.00	0.00	0.63	0.00	0.31	0.63	0.31	...	0.00	0.137	0.137	0.000	0.000	3.537	40	191	spam
4	5	0.00	0.00	0.00	0.63	0.00	0.31	0.63	0.31	...	0.00	0.135	0.135	0.000	0.000	3.537	40	191	spam

5 rows × 59 columns

References

\(^{\text{📚}}\) Hinton and Roweis (2002).
\(^{\text{📚}}\) Laurens, \(t\)-SNE Page.
\(^{\text{📚}}\) Satellite Images.
\(^{\text{📚}}\) van der Maaten and Hinton (2008), Visualizing Data using t-SNE.
\(^{\text{📚}}\) Bank et al (2021), Autoencoder.
\(^{\text{📚}}\) Umberto Michelucci (2022), An Introduction to Autoencoders.