# To do
TP4 - Clustering
Exploratory Data Analysis & Unsuperivsed Learning
Course: PHAUK Sokkey, PhD
TP: HAS Sothea, PhD
Objective: Clustering is technique of ML and Data Analysis used to group similar data points together. The goal is to partition a dataset into distinct subsets, or clusters, such that data points within each cluster are more similar to each other than to those in other clusters. This practical class aims to enhance your understanding of two different clustering algorithms, including their strengths and weaknesses.
The
Jupyter Notebook
for this TP can be downloaded here: TP4-Clustering.
1. Kmeans Algorithm
We will begin with a toy example using simulated dataset.
a. Write a function simulateData(k, n)
that generates an ideal dataset for Kmeans
, consisting of \(k\) groups with \(n\) observations within each group, of 2D normally distributed data points (you can choose any value of \(k\in\{3,4,5,...,8\}\)). Visualize your dataset, make sure that the groups are spread evenly.
b. We are trying to detect the number of clsuter \(k\) using Within-class variance:
- Check the equality:
Within-class variation
+Between-class variation
=Total variation
. - Perform
Kmeans
algorithm usingKMeans
fromsklearn.cluster
module with different numbers of clusters and compute within-class variation or each case. - Draw the values of within-class variances as a function of number of cluster.
- What do you observe?
from sklearn.cluster import KMeans
# To do
c. Can you propose a systematic approach to approximate the most suitable number of clusters?
- Run your code \(30\) times on the same data, how many times did you get the number of clusters right? Why?
- Try to set argument
n_init = 5
inKMeans
module then use the previous method to approximate the optimal number of clusters. This time, within \(30\) runs, how many times do you get the number of clusters right? Explain why.
# To do
d. Compute and visualize Silhouette Coefficient for each number of clusters considered above. Conclude.
from sklearn.metrics import silhouette_score
# To do
2. Hierarchical clustering
Unlike Kmeans
algrithm, Hierarchical clustering
or hcluster
does not require a prior number of clusters. It iteratively merges (agglomerative or bottom up approach) into less and less clusters starting from each points being a cluster on its own, or separate the data point (divisive or top-down approach) to create more and more clusters starting from one clsuter containing all data points.
a. Apply Hierarchical cluster on the previously simulated dataset.
# To do
b. Plot the associated Dendrograms of the resulting groups.
# To do
c. Can you decide the most suitable number of clusters from the previous dendrogram?
# To do
3. Real dataset
Now apply both algorithms on Mnist
dataset of hand written digits can be downloaded here: Mnist dataset
or import from keras.datasets
module as follow:
from keras.datasets import mnist
import matplotlib.pyplot as plt
= mnist.load_data() (X_train, y_train), (X_test, y_test)
import numpy as np
= np.random.randint(low=0, high=len(y_train), size=8)
ID = plt.subplots(2,4, figsize=(12, 6))
fig, ax for i in range(8):
//4, i%4].imshow(X_train[ID[i],:,:])
ax[i//4, i%4].set_title(f"True label: {y_train[ID[i]]}")
ax[i
plt.tight_layout() plt.show()