Lab7: KMeans & Hierarchical Clustering

Course: CSCI-866-001: Data Mining & Knowledge Discovery
Lecturer: Sothea HAS, PhD

Objective: Clustering is technique of ML and Data Analysis used to group similar data points together. The goal is to partition a dataset into distinct subsets, or clusters, such that data points within each cluster are more similar to each other than to those in other clusters. This practical class aims to enhance your understanding of two different clustering algorithms, including their strengths and weaknesses.

The notebook of this Lab can be downloaded here: Lab7_Clustering.ipynb.
Or you can work directly with Google Colab here: Lab7_Clustering.ipynb.

1. Kmeans Algorithm

We will begin with a toy example using simulated dataset.

a. Write a function simulateData(k, n) that generates an ideal dataset for Kmeans, consisting of \(k\) groups with \(n\) observations within each group, of 2D normally distributed data points (you can choose any value of \(K\in\{3,4,5,...,8\}\)). Visualize your dataset, make sure that the groups are spread evenly.

Code

import numpy as np
def simulateData(k, n):
    # np.random.seed(69)
    X = np.row_stack([np.random.multivariate_normal(np.random.uniform(-10,10,2), 3*np.eye(2), size=n) for i in range(k)])
    return X, np.repeat([str(i) for i in range(1,k+1)], n)

An example can be simulated as follows.

Code

# Simulate
import pandas as pd
K = 4
data, labels = simulateData(K,100)
df_sim = pd.DataFrame({
    'x' : data[:,0],
    'y' : data[:,1],
    'label' : labels
})

Code

import plotly.io as pio
pio.renderers.default = 'notebook'
import plotly.express as px
fig = px.scatter(df_sim, x="x", y="y")
fig.update_layout(width=700, height=400, title=f"Simulated data with {K} clusters")
fig.show()

b. Perform Kmeans clustering algorithm on the generated dataset using different number of clusters \(k\in\{2,3,4,5,6,7,8\}\).

For each value of \(k\), check the equality: \(\text{TSS}=\text{WSS}+\text{BSS}\) where
- TSS: Total Sum of Squares: \(\text{TSS}=\sum_{i=1}^n\|\text{x}_i-\overline{\text{x}}_n\|^2\).
- WSS: Within Sum of Squares: \(\text{WSS}=\sum_{k=1}^K\sum_{i=1}^{n_k}\|\text{x}_i^{k}-\overline{\text{x}}_n^{k}\|^2\).
- BSS: Between Sum of Squares: \(\text{BSS}=\sum_{k=1}^Kn_k\|\overline{\text{x}}_n^{k}-\overline{\text{x}}_n\|^2\).
Draw the values of within-class variances (or WSS) as a function of number of cluster.
Can you spot a clear elbow in the previous graph?
Try to use different simulated dataset. Can you see the connection between the clustering structure and the shape of the elbow?

# To do

From the result of KMeans module, write a function find_optimal_cluster to automatically return the optimal \(K^*\) where the elbow occurs.

def find_optimal_cluster(kmeans):
    pass

c. Sensitivity to random initialization:

Run your function \(30\) times on \(30\) different results of KMeans algorithm. How many times was the real \(K\) found?
Increase the argument n_init = 3 in KMeans module and rerun the algorithm \(30\) times. This time, how many times was the actual \(K\) found?
Compare the computational time of the algorithm when n_init = 3 and when ignoring this argument.

# To do

d. Compute and visualize Silhouette Coefficient as a function of the number of cluster \(K\). Conclude.

from sklearn.metrics import silhouette_score
# To do

2. Hirachical clustering

Unlike Kmeans algrithm, Hirachical clustering or hcluster does not require a prior number of clusters. It iteratively merges (agglomerative or bottom up approach) into less and less clusters starting from each points being a cluster on its own, or separate the data point (divisive or top-down approach) to create more and more clusters starting from one clsuter containing all data points.

a. Apply Hirachical cluster on the previously simulated dataset.

# To do

b. Plot the associated Dendrograms of the resulting groups.

# To do

c. Can you decide the most suitable number of clusters from the previous dendrogram?

# To do

3. Real dataset

We now consider Bank Loan dataset. The dataset contains 5000 rows of different customers along with their information including age, experience, income, family size and more. The dataset can be imported as follow.

import kagglehub

# Download latest version
path = kagglehub.dataset_download("itsmesunil/bank-loan-modelling")
data = pd.read_excel(path + "/Bank_Personal_Loan_Modelling.xlsx", "Data")
data.head()

	ID	Age	Experience	Income	ZIP Code	Family	CCAvg	Education	Securities Account	CreditCard
0	1	25	1	49	91107	4	1.6	1	1	0
1	2	45	19	34	90089	3	1.5	1	1	0
2	3	39	15	11	94720	1	1.0	1	0	0
3	4	35	9	100	94112	1	2.7	2	0	0
4	5	35	8	45	91330	4	1.0	2	0	1

Choose suitable columns and perform KMeans clustering using different number of cluster \(K\).
Use Elbow method and Silhouette Scores to detect the optimal number of clusters for this dataset.
Interpret each cluster.
Apply Hierarchical clustering and interpret the result.

1. Kmeans Algorithm

2. Hirachical clustering

3. Real dataset

Further Readings