# To do
TP7 - Dimensional Reduction
Course: Advanced Machine Learning
Lecturer: Sothea HAS, PhD
Objective: Dimensional reduction is useful when dealing with high-dimensional dataset. It can also be used in clustering and data analysis. We will explore its potential using our previous dataset of satellite images.
- The
notebook
of thisTP
can be downloaded here: TP7_Dimensional_Reduction.ipynb.
1. Satellite Image Segmentation
We will explore in this TP the satellite image dataset available in kaggle repository: Satellite Images.
A. Dimensional reduction with PCA
- Load the assembled data in the environment.
- Perform reduced/normalized PCA on this dataset. How many dimensions should we keep to retain \(90\%\) variation of the data?
- Whatβs the percentage of variance explained by the first two dimensions?
- Visualize the data in 2 dimensional space using
PCA
. - Based on the resulting graph, would it be better to work with the first two PCs (PC1 and PC2)?
- Perform clustering algorithm on these two PCs. Analyze the performance of the clustering algorithm.
- Implement some predictive models using PC1 and PC2 as inputs to predict the type of satellite images and report their performances.
B. Dimensional reduction with \(t\)-SNE
- Visualize the data in 2 dimensional space using \(t\)-SNE.
- Based on the resulting graph, would it be better to work with the embedded features by \(t\)-SNE?
- Perform clustering algorithm on these embedded features. Analyze the performance of the clustering algorithm.
- Implement some predictive models using the embedded features as inputs to predict the type of satellite images and report their performances.
# To do
C. Dimensional reduction with Johnson-Lindenstrauss Lemma
- Project images onto \(d=2, 5, 10\) dimensional spaces (called them
X_JL2
,X_JL5
andX_JL10
respectively). - Perform clustering algorithm on these projected data. Analyze the performance of the clustering algorithm for each case.
- Implement some predictive models using the projected features as inputs to predict the type of satellite images and report their performances.
D. Dimensional reduction with Autoencoder
- Bulid autoencoder to encode and reconstruct satellite images using your own designed architecture.
- Visualize some of the original, embedded, and reconstructed images side by side.
- Perform clustering algorithm on the latent images of the network. Analyze the performance of the clustering algorithm for each case.
- Implement some select models using the latent encoded images as inputs to predict the type of satellite images and report their performances.
2. Revisit Spam dataset
Task:
- Perform clustering algorithms on
Spam
dataset using projected data obtained from the most suitable dimensional reduction method. - Built models to classify whether an email is a spam or not using projected data from the most suitable dimensional reduction approach above.
import pandas as pd
= "https://raw.githubusercontent.com/hassothea/MLcourses/main/data/spam.txt"
path = pd.read_csv(path, sep=" ")
data 5) data.head(
Id | make | address | all | num3d | our | over | remove | internet | order | ... | charSemicolon | charRoundbracket | charSquarebracket | charExclamation | charDollar | charHash | capitalAve | capitalLong | capitalTotal | type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.00 | 0.64 | 0.64 | 0.0 | 0.32 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.000 | 0.0 | 0.778 | 0.000 | 0.000 | 3.756 | 61 | 278 | spam |
1 | 2 | 0.21 | 0.28 | 0.50 | 0.0 | 0.14 | 0.28 | 0.21 | 0.07 | 0.00 | ... | 0.00 | 0.132 | 0.0 | 0.372 | 0.180 | 0.048 | 5.114 | 101 | 1028 | spam |
2 | 3 | 0.06 | 0.00 | 0.71 | 0.0 | 1.23 | 0.19 | 0.19 | 0.12 | 0.64 | ... | 0.01 | 0.143 | 0.0 | 0.276 | 0.184 | 0.010 | 9.821 | 485 | 2259 | spam |
3 | 4 | 0.00 | 0.00 | 0.00 | 0.0 | 0.63 | 0.00 | 0.31 | 0.63 | 0.31 | ... | 0.00 | 0.137 | 0.0 | 0.137 | 0.000 | 0.000 | 3.537 | 40 | 191 | spam |
4 | 5 | 0.00 | 0.00 | 0.00 | 0.0 | 0.63 | 0.00 | 0.31 | 0.63 | 0.31 | ... | 0.00 | 0.135 | 0.0 | 0.135 | 0.000 | 0.000 | 3.537 | 40 | 191 | spam |
5 rows Γ 59 columns
References
\(^{\text{π}}\) Hinton and Roweis (2002).
\(^{\text{π}}\) Laurens, \(t\)-SNE Page.
\(^{\text{π}}\) Satellite Images.
\(^{\text{π}}\) van der Maaten and Hinton (2008), Visualizing Data using t-SNE.
\(^{\text{π}}\) Bank et al (2021), Autoencoder.
\(^{\text{π}}\) Umberto Michelucci (2022), An Introduction to Autoencoders.