TP6 - Clustering


Course: Advanced Machine Learning
Lecturer: Sothea HAS, PhD

Objective: Clustering algorithm is an unsuperivsed learning method aiming at grouping data into clusters based on their similarities. In this TP, we will use various clusterint algorithms we have seen to solve some practical tasks such as image and data segmentation.


1. Satellite Image Segmentation

A. Assembling data

  • Download satellite images from the following kaggle repository: Satellite Images.
  • There are four folders of different areas captured by satellite images:
    • cloudy (\(1500\times 256\times 256\))
    • desert (\(1131\times 256\times 256\))
    • green_area (\(1500\times 64\times 64\))
    • water (\(1500\times 64\times 64\))
  • Assemble these four types of images (convert them to \(64\times 64\)-resolution) and save it as satellite_images.npy. You may find the following libraries useful:
    • cv2
    • glob
    • PIL
import kagglehub

# Download latest version
path = kagglehub.dataset_download("mahmoudreda55/satellite-image-classification")

B. Clustering.

  • Load the assembled data and perform different clustering algorithms on the data.
  • Detect the optimal number of clusters. Is the result reasonable?
  • Explore if the clustering algorithms cluster images into their real categories.
# To do

C. Predictive Models

  • Create a target of four categories \(y=\) [‘cloudy’, ‘desert’, ‘forest’, ‘water’].
  • Randomly select 10% from of each category and store them as test data.
  • Train ML models to predict the category of images.
  • Report the accuracy of the models.
# To do

2. Revisit Spam dataset

Task: Perform clustering algorithms on Spam dataset. Can clustering algorithms distinguish spam and non-spam emails based on it characteristics.

import pandas as pd
path = "https://raw.githubusercontent.com/hassothea/MLcourses/main/data/spam.txt"
data = pd.read_csv(path, sep=" ")
data.head(5)
Id make address all num3d our over remove internet order ... charSemicolon charRoundbracket charSquarebracket charExclamation charDollar charHash capitalAve capitalLong capitalTotal type
0 1 0.00 0.64 0.64 0.0 0.32 0.00 0.00 0.00 0.00 ... 0.00 0.000 0.0 0.778 0.000 0.000 3.756 61 278 spam
1 2 0.21 0.28 0.50 0.0 0.14 0.28 0.21 0.07 0.00 ... 0.00 0.132 0.0 0.372 0.180 0.048 5.114 101 1028 spam
2 3 0.06 0.00 0.71 0.0 1.23 0.19 0.19 0.12 0.64 ... 0.01 0.143 0.0 0.276 0.184 0.010 9.821 485 2259 spam
3 4 0.00 0.00 0.00 0.0 0.63 0.00 0.31 0.63 0.31 ... 0.00 0.137 0.0 0.137 0.000 0.000 3.537 40 191 spam
4 5 0.00 0.00 0.00 0.0 0.63 0.00 0.31 0.63 0.31 ... 0.00 0.135 0.0 0.135 0.000 0.000 3.537 40 191 spam

5 rows × 59 columns

References

\(^{\text{📚}}\) Linder, T. (2002).
\(^{\text{📚}}\) Luxburg (2007).
\(^{\text{📚}}\) Satellite Images.