import kagglehub
# Download latest version
= kagglehub.dataset_download("mahmoudreda55/satellite-image-classification") path
TP6 - Clustering
Course: Advanced Machine Learning
Lecturer: Sothea HAS, PhD
Objective: Clustering algorithm is an unsuperivsed learning method aiming at grouping data into clusters based on their similarities. In this TP, we will use various clusterint algorithms we have seen to solve some practical tasks such as image and data segmentation.
- The
notebook
of thisTP
can be downloaded here: TP6_Clustering.ipynb.
1. Satellite Image Segmentation
A. Assembling data
- Download satellite images from the following kaggle repository: Satellite Images.
- There are four folders of different areas captured by satellite images:
cloudy
(\(1500\times 256\times 256\))desert
(\(1131\times 256\times 256\))green_area
(\(1500\times 64\times 64\))water
(\(1500\times 64\times 64\))
- Assemble these four types of images (convert them to \(64\times 64\)-resolution) and save it as
satellite_images.npy
. You may find the following libraries useful:cv2
glob
PIL
B. Clustering.
- Load the assembled data and perform different clustering algorithms on the data.
- Detect the optimal number of clusters. Is the result reasonable?
- Explore if the clustering algorithms cluster images into their real categories.
# To do
C. Predictive Models
- Create a target of four categories \(y=\) [‘cloudy’, ‘desert’, ‘forest’, ‘water’].
- Randomly select 10% from of each category and store them as test data.
- Train ML models to predict the category of images.
- Report the accuracy of the models.
# To do
2. Revisit Spam dataset
Task: Perform clustering algorithms on Spam dataset. Can clustering algorithms distinguish spam and non-spam emails based on it characteristics.
import pandas as pd
= "https://raw.githubusercontent.com/hassothea/MLcourses/main/data/spam.txt"
path = pd.read_csv(path, sep=" ")
data 5) data.head(
Id | make | address | all | num3d | our | over | remove | internet | order | ... | charSemicolon | charRoundbracket | charSquarebracket | charExclamation | charDollar | charHash | capitalAve | capitalLong | capitalTotal | type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.00 | 0.64 | 0.64 | 0.0 | 0.32 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.000 | 0.0 | 0.778 | 0.000 | 0.000 | 3.756 | 61 | 278 | spam |
1 | 2 | 0.21 | 0.28 | 0.50 | 0.0 | 0.14 | 0.28 | 0.21 | 0.07 | 0.00 | ... | 0.00 | 0.132 | 0.0 | 0.372 | 0.180 | 0.048 | 5.114 | 101 | 1028 | spam |
2 | 3 | 0.06 | 0.00 | 0.71 | 0.0 | 1.23 | 0.19 | 0.19 | 0.12 | 0.64 | ... | 0.01 | 0.143 | 0.0 | 0.276 | 0.184 | 0.010 | 9.821 | 485 | 2259 | spam |
3 | 4 | 0.00 | 0.00 | 0.00 | 0.0 | 0.63 | 0.00 | 0.31 | 0.63 | 0.31 | ... | 0.00 | 0.137 | 0.0 | 0.137 | 0.000 | 0.000 | 3.537 | 40 | 191 | spam |
4 | 5 | 0.00 | 0.00 | 0.00 | 0.0 | 0.63 | 0.00 | 0.31 | 0.63 | 0.31 | ... | 0.00 | 0.135 | 0.0 | 0.135 | 0.000 | 0.000 | 3.537 | 40 | 191 | spam |
5 rows × 59 columns
References
\(^{\text{📚}}\) Linder, T. (2002).
\(^{\text{📚}}\) Luxburg (2007).
\(^{\text{📚}}\) Satellite Images.