TP4 - Clustering

Exploratory Data Analysis & Unsuperivsed Learning
M1-DAS
Lecturer: HAS Sothea, PhD

Objective: Clustering is technique of ML and Data Analysis used to group similar data points together. The goal is to partition a dataset into distinct subsets, or clusters, such that data points within each cluster are more similar to each other than to those in other clusters. This practical class aims to enhance your understanding of two different clustering algorithms, including their strengths and weaknesses.

The Jupyter Notebook for this TP can be downloaded here: TP4-Clustering.

1. Kmeans Algorithm

We will begin with a toy example using simulated dataset.

a. Write a function simulateData(k, n) that generates an ideal dataset for Kmeans, consisting of \(k\) groups with \(n\) observations within each group, of 2D normally distributed data points (you can choose any value of \(K\in\{3,4,5,...,8\}\)). Visualize your dataset, make sure that the groups are spread evenly. I set an example below:

import numpy as np
def simulateData(k, n):
    # np.random.seed(69)
    X = np.row_stack([np.random.multivariate_normal(np.random.uniform(-10,10,2), 3*np.eye(2), size=n) for i in range(k)])
    return X, np.repeat([str(i) for i in range(1,k+1)], n)

# Simulate
import pandas as pd

data, labels = simulateData(3,100)
df_sim = pd.DataFrame({
    'x' : data[:,0],
    'y' : data[:,1],
    'label' : labels
})

import plotly.io as pio
pio.renderers.default = 'notebook'
import plotly.express as px
fig = px.scatter(df_sim, x="x", y="y")
fig.update_layout(width=500, height=400, title="Simulated data with 3 clusters")
fig.show()

b. Perform Kmeans clustering algorithm on that the generated dataset using different number of clusters.

Check that the equality: With-class variance + Between-class variance = Total Variance.
For each number of cluster \(k\), compute within-class variances.
Draw the values of within-class variances as a function of number of cluster.
What do you observe?

c. Can you propose a systematic approach to approximate the most suitable number of clusters?

Run your function \(30\) times on the same dataset. How many times was the real \(K\) found?
Increase the argument n_init = 3 in KMeans module and rerun the algorithm \(30\) times. How many times was the actual \(K\) found this time?
Compare the computational time of the algorithm when n_init = 3 and when ignoring this argument.

d. Compute and visualize Silhouette Coefficient for each number of clusters considered above. Conclude.

from sklearn.metrics import silhouette_score
# To do

2. Hirachical clustering

Unlike Kmeans algrithm, Hirachical clustering or hcluster does not require a prior number of clusters. It iteratively merges (agglomerative or bottom up approach) into less and less clusters starting from each points being a cluster on its own, or separate the data point (divisive or top-down approach) to create more and more clusters starting from one clsuter containing all data points.

a. Apply Hirachical cluster on the previously simulated dataset.

# To do

b. Plot the associated Dendrograms of the resulting groups.

# To do

c. Can you decide the most suitable number of clusters from the previous dendrogram?

# To do

3. Real dataset

Now apply both algorithms on Iris dataset imported as follow:

from sklearn.datasets import load_iris

# Load data
Iris = load_iris()
X = Iris.data
y = Iris.target

# To do

1. Kmeans Algorithm

2. Hirachical clustering

3. Real dataset

Further Readings