Advanced Machine Learning


     

Lecturer: Dr. HAS Sothea

Code: AMSI61AML

Unsupervised Learning: Clustering

🗺️ Content

  • Introduction & Motivation

  • Kmeans as vector quantization

  • Hierarchical Clustering

  • Spectral Clustering

  • Applications

Introduction & Motivation

Old Faithful dataset (\(272\) rows, \(2\) columns)

Code
import pandas as pd                 # Import pandas package
import seaborn as sns               # Package for beautiful graphs
import plotly.express as px
import matplotlib.pyplot as plt     # Graph management
# path = "https://gist.githubusercontent.com/curran/4b59d1046d9e66f2787780ad51a1cd87/raw/9ec906b78a98cf300947a37b56cfe70d01183200/data.tsv"                       # The data can be found in this link
df0 = pd.read_csv(path0 + "/faithful.csv" )  # Import it into Python
df0.head(5)                        # Randomly select 4 points
eruptions waiting
0 3.600 79
1 1.800 54
2 3.333 74
3 2.283 62
4 4.533 85

Code
fig0 = px.scatter(df0, x="waiting", y="eruptions", size_max=50)    # Create scatterplot
fig0.update_layout(
    title="Old Faithful data from Yellowstone National Park, US",
    width=500, height=380)    # Title
fig0.show()
  • Two blocks: shorter wait & eruption, and the longer ones.

Introduction & Motivation

Satellite Image Segmentation

Introduction & Motivation

More motivations

Marketing: finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records.

Biology: classification of plants and animals given their features.

Insurance: identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds.

Bookshops: book ordering (recommendation).

City-planning: identifying groups of houses according to their type, value and geographical location.

Internet: document classification; clustering weblog data to discover groups of similar access patterns; topic modeling…

Introdution & Motivation

Clustering

Code
df0.head(2)
eruptions waiting
0 3.6 79
1 1.8 54
  • Clustering aims at partitioning a set of points from some metric space in such a way that

Introdution & Motivation

Clustering

Code
df0.head(2)
eruptions waiting
0 3.6 79
1 1.8 54
  • Clustering aims at partitioning a set of points from some metric space in such a way that
    • Points within the same group are close/similar, and

Introdution & Motivation

Clustering

Code
df0.head(2)
eruptions waiting
0 3.6 79
1 1.8 54
  • Clustering aims at partitioning a set of points from some metric space in such a way that
    • Points within the same group are close/similar, and
    • Points from different groups are distant.

Introdution & Motivation

Clustering

Code
df0.head(2)
eruptions waiting
0 3.6 79
1 1.8 54
  • Clustering aims at partitioning a set of points from some metric space in such a way that
    • Points within the same group are close/similar, and
    • Points from different groups are distant.
  • It belongs to the Unsupervised Learning branch of Machine Learning (ML).
  • Objective: group data into clusters based on their similarities.

Kmeans as a vector quantization

Clustering as a compression problem

  • Given set of data \({\cal D}=\{\text{x}_1,\dots,\text{x}_n\}\subset\mathbb{R}^d\).
  • Compression: Represent \(\text{x}_i\) using a unique code-vector \(c_k\in{\cal C}=\{c_1,\dots,c_K\}\subset\mathbb{R}^d\) for some \(K\geq 1\).
eruptions waiting
0 3.600 79
1 1.800 54
2 3.333 74
3 2.283 62

Original data

eruptions waiting
0 4.298 80.285
1 2.094 54.750
2 4.298 80.285
3 2.094 54.750

Compressed data

  • Compression loss: \(\ell(\text{x}_i,c_k)=\|\text{x}_i-c_k\|^2=\sum_{j=1}^d(\text{x}_{ij}-c_{kj})^2.\)

Kmeans as a vector quantization

Problem setting

  • Compression loss: \(\ell(\text{x}_i,c_k)=\|\text{x}_i-c_k\|^2=\sum_{j=1}^d(\text{x}_{ij}-c_{kj})^2.\)
  • A \(K\)-point quantizer associated with codebook \(\cal C\) is any mapping:

\[\begin{align*}q:\mathbb{R}^d&\to{\cal C}=\{c_k:k=1,\dots,K\}\\ \text{x}_i&\mapsto q(\text{x}_i)=c_k\text{ for some }k.\end{align*}\]

  • We define Within Sum of Square criterion for a quantizer \(q\) by:

\[\color{red}{\text{WSS}}(q)=\sum_{i=1}^n\ell(\text{x}_i,q(\text{x}_i))=\sum_{i=1}^n\|\text{x}_i,q(\text{x}_i)\|^2.\]

  • Objective: find \(\color{blue}{q^*}\) that provides the lowest WSS, i.e.,

\[\color{red}{\text{WSS}}(\color{blue}{q^*})=\min_{q\in{\cal Q}}\color{red}{\text{WSS}}(q).\]

Kmeans as a vector quantization

Visualizing WSS

\[\color{red}{\text{WSS}}(q)=\sum_{i=1}^n\ell(\text{x}_i,q(\text{x}_i))=\sum_{i=1}^n\|\text{x}_i,q(\text{x}_i)\|^2.\]

  • Objective: find \(\color{blue}{q^*}\) that provides the lowest WSS, i.e.,

\[\color{red}{\text{WSS}}(\color{blue}{q^*})=\min_{q\in{\cal Q}}\color{red}{\text{WSS}}(q).\]

  • Quantizer \(q\) maps \(\text{x}_i\) to some centroid \(c_k\).
  • For k=1,2,...,K:

Kmeans as a vector quantization

Visualizing WSS

\[\color{red}{\text{WSS}}(q)=\sum_{i=1}^n\ell(\text{x}_i,q(\text{x}_i))=\sum_{i=1}^n\|\text{x}_i,q(\text{x}_i)\|^2.\]

  • Objective: find \(\color{blue}{q^*}\) that provides the lowest WSS, i.e.,

\[\color{red}{\text{WSS}}(\color{blue}{q^*})=\min_{q\in{\cal Q}}\color{red}{\text{WSS}}(q).\]

  • Quantizer \(q\) maps \(\text{x}_i\) to some centroid \(c_k\).
  • For k=1,2,...,K:
    • Within group k: \(\color{red}{\sum_{\text{x}_i\in G_k}\|\text{x}_i-c_k\|^2}.\)

Kmeans as a vector quantization

Visualizing WSS

\[\color{red}{\text{WSS}}(q)=\sum_{i=1}^n\ell(\text{x}_i,q(\text{x}_i))=\sum_{i=1}^n\|\text{x}_i,q(\text{x}_i)\|^2.\]

  • Objective: find \(\color{blue}{q^*}\) that provides the lowest WSS, i.e.,

\[\color{red}{\text{WSS}}(\color{blue}{q^*})=\min_{q\in{\cal Q}}\color{red}{\text{WSS}}(q).\]

  • Quantizer \(q\) maps \(\text{x}_i\) to some centroid \(c_k\).
  • For k=1,2,...,K:
    • Within group k: \(\color{red}{\sum_{\text{x}_i\in G_k}\|\text{x}_i-c_k\|^2}.\)
  • \(\color{red}{\text{WSS}}(q)=\sum_{k=1}^K\color{red}{\sum_{\text{x}_i\in G_k}\|\text{x}_i-c_k\|^2}.\)

Measures the proximity of points within each group.

Kmeans as a vector quantization

Basic Identity WSS & Voronoi partition

For any number of cluster \(K\), one has: \[\color{blue}{\text{TSS}}=\color{green}{\text{BSS}}+\color{red}{\text{WSS}},\] where

  • \(\color{blue}{\text{TSS}}=\sum_{i=1}^n(\text{x}_i-\overline{\text{x}})^2\) and
  • \(\color{green}{\text{BSS}}=\sum_{k=1}^Kn_k(\overline{\text{x}}_k-\overline{\text{x}})^2.\)


  • If codebook \(\cal C=\{c_1,\dots,c_K\}\) is known, Voronoi partition associated with \(\cal C\) is defined by \[\color{green}{S_k}=\{\text{x}_i:\|\text{x}_i-\color{green}{c_k}\|\leq \|\text{x}_i-c_j\|, \forall j\neq k\}.\]

Kmeans as a vector quantization

Nearest Neighbor Condition

  • If codebook \({\cal C}=\{c_1,\dots,c_K\}\) is known, Voronoi partition associated with \(\cal C\) is defined by \[\color{green}{S_k}=\{\text{x}_i:\|\text{x}_i-\color{green}{c_k}\|\leq \|\text{x}_i-c_j\|, \forall j\neq k\}.\]

  • Given centroids \({\cal C}=\{c_1,\dots,c_K\}\), we can define groups by simply computing Voronoi partition.

Lemma 1: Nearet Neighbor Condition [Linder, T. (2002)]

Given a codebook \({\cal C}\), let \(q\) be any quantizer and \(\color{blue}{\hat{q}}\) be the quantizer that assigns any observation \(\text{x}_i\) to its closest code-vector (centroid of its Voronoi partition), i.e., \[\color{blue}{\hat{q}}(\text{x}_i)=\arg\min_{c_k\in{\cal C}}\|\text{x}_i-c_k\|^2\]

Then, one has \[\color{red}{\text{WSS}}(\color{blue}{\hat{q}})\leq \color{red}{\text{WSS}}(q).\]

Kmeans as a vector quantization

Centroid Condition

  • If codebook \({\cal C}=\{c_1,\dots,c_K\}\) is known, Voronoi partition associated with \(\cal C\) is defined by \[\color{green}{S_k}=\{\text{x}_i:\|\text{x}_i-\color{green}{c_k}\|\leq \|\text{x}_i-c_j\|, \forall j\neq k\}.\]

  • Given centroids \({\cal C}=\{c_1,\dots,c_K\}\), we can define groups by simply computing Voronoi partition.

Lemma 1: Centroid Condition [Linder, T. (2002)]

Given a partition \({\cal S}=\{S_1,\dots,S_K\}\), let \(q\) be any quantizer and \(\color{blue}{\hat{q}}\) be the quantizer with Voronoi parition and codebook \(\hat{{\cal C}}=\{\hat{c}_k\}\) defined by \[\hat{c}_k=\frac{1}{|S_k|}\sum_{\text{x}_i\in S_k}\text{x}_i.\]

Then, one has \[\color{red}{\text{WSS}}(\color{blue}{\hat{q}})\leq \color{red}{\text{WSS}}(q).\]

Kmeans as a vector quantization

Summary of Lemma 1

  • Nearet Neighbor Condition: Given codebook \({\cal C}=\{c_k\}\), we know how to assign \(\text{x}_i\mapsto c_k\) optimally, therefore find the best Voronoi parition.

  • Centroid Condition: Given a partition \({\cal S}=\{S_k\}\), we know how to define an optimal codebook \({\cal C}=\{c_k\}\).

  • Q1: Based on these results, how would you design a clustering algorithm?
  • A1: Something like: \({\cal C}_1\overset{\color{green}{\text{NNC}}}{\to}{\cal S}_1\overset{\color{red}{\text{CC}}}{\to}{\cal C}_2\overset{\color{green}{\text{NNC}}}{\to}{\cal S}_2...\)
  • This leads to the well-known KMeans clustering algorithm.

Kmeans as a vector quantization

KMeans Algorithm

  • Given data \({\cal D}=\{\text{x}_1,\dots,\text{x}_n\}\subset\mathbb{R}^d\) and \(K\).
  • Initialization: \({\cal C}^{0}=\{c_1,\dots,c_K\}\) (randomly).
  • Cluster Assignment (NNC):

for i = 1,...,n:
…. for k = 1,...,K: \[\text{Assign }\text{x}_i\to \color{green}{S_k}\text{ if }\|\text{x}_i-\color{green}{c_k}\|\leq \|\text{x}_i-c_j\|,\forall j\neq k.\]

  • Centroid Recomputation (CC): From \({\cal S}=\{S_k\}\) recompute new centroids:

\[c_k=\frac{1}{|S_k|}\sum_{\text{x}_i\in S_k}\text{x}_i.\]

  • Alternatively repeat step 2. and 3. until converges.
Code
from sklearn.cluster import KMeans
km = KMeans(n_clusters=4, init='random', max_iter=7, n_init=1, random_state=12)
km = km.fit(df1)
fig_km = go.Figure(go.Scatter(x=df1[:,0], y=df1[:,1], name="Data", mode="markers", 
                   marker=dict(color=[col1[str(j+1)] for j in km.labels_], size=7)))
fig_km.add_trace(go.Scatter(x=km.cluster_centers_[:,0], y=km.cluster_centers_[:,1],
                name="Centroid", mode="markers", marker=dict(color='black', size=15, symbol='star')))
fig_km.update_layout(width=500, height=480, title="KMeans Algorithm in Action")
fig_km.show()

Kmeans as a vector quantization

KMeans Algorithm may get stuck

  • Given data \({\cal D}=\{\text{x}_1,\dots,\text{x}_n\}\subset\mathbb{R}^d\) and \(K\).
  • Initialization: \(\color{red}{{\cal C}^{0}=\{c_1,\dots,c_K\}}\) (randomly).
  • Cluster Assignment (NNC):

for i = 1,...,n:
…. for k = 1,...,K: \[\text{Assign }\text{x}_i\to \color{green}{S_k}\text{ if }\|\text{x}_i-\color{green}{c_k}\|\leq \|\text{x}_i-c_j\|,\forall j\neq k.\]

  • Centroid Recomputation (CC): From \({\cal S}=\{S_k\}\) recompute new centroids:

\[c_k=\frac{1}{|S_k|}\sum_{\text{x}_i\in S_k}\text{x}_i.\]

  • Alternatively repeat step 2. and 3. until converges.
Code
from sklearn.cluster import KMeans
km = KMeans(n_clusters=4, init='random', max_iter=7, n_init=1, random_state=11)
km = km.fit(df1)
fig_km = go.Figure(go.Scatter(x=df1[:,0], y=df1[:,1], name="Data", mode="markers", 
                   marker=dict(color=[col1[str(j+1)] for j in km.labels_], size=7)))
fig_km.add_trace(go.Scatter(x=km.cluster_centers_[:,0], y=km.cluster_centers_[:,1],
                name="Centroid", mode="markers", marker=dict(color='black', size=15, symbol='star')))
fig_km.update_layout(width=500, height=480, title="KMeans Algorithm in Action")
fig_km.show()

Kmeans as a vector quantization

KMeans Algorithm may get stuck

  • Given data \({\cal D}=\{\text{x}_1,\dots,\text{x}_n\}\subset\mathbb{R}^d\) and \(K\).
  • A solution from KMeans of sklearn.cluster module:
from sklearn.cluster import KMeans
km = KMeans(
    n_clusters=4, 
    n_init=5     # Perform KMeans 5 times
)
km = km.fit(df1) # The best one among the 5

🔑 Perform KMeans n_init times with different random initializations, the best one (lowest WSS) is taken as the final result.

Q2: How to find a suitable \(K\)?

Code
fig_km1.show()

Kmeans as a vector quantization

Elbow Method

  • 🔑 \(\color{red}{\text{WSS}}\) decreases as \(K\) increases:
    • \(K=1\Rightarrow \color{red}{\text{WSS}}=\color{blue}{\text{TSS}}\)
    • \(K=n\Rightarrow \color{red}{\text{WSS}}=0\).
    • At the suitable \(K\), \(\color{red}{\text{WSS}}\) drops slowly.
Code
wss = []
list_k = list(range(1,11))
for k in list_k:
    km = KMeans(n_clusters = k, n_init=2)
    km = km.fit(df1)
    wss.append(km.inertia_)

data_wss = pd.DataFrame({
    'K' : list_k,
    'WSS': wss
})
fig6 = px.line(data_wss, x="K", y="WSS", markers="WSS", title="WSS vs Number of cluster (K)")
fig6.add_trace(go.Scatter(x=[4,4], y=[0,wss[3]], line=dict(color="red", dash="dash"),name="Optimal K"))
fig6.update_layout(width=600, height=280)
fig6.show()
Code
frames = []
for k in list_k[1:-2]: 
    km = KMeans(n_clusters=k, max_iter=100, n_init=2)
    km = km.fit(df1)
    frames.append(go.Frame(
        data=[go.Scatter(x=df1[:,0], y=df1[:,1], mode='markers', 
              marker=dict(color=km.labels_, size=7), 
              name=f'K: {k}'),
              go.Scatter(
                x=km.cluster_centers_[:,0],
                y=km.cluster_centers_[:,1],
                name="Centroid", mode="markers", 
                marker=dict(color='black', size=10, symbol='star'))],
        name=f'{k}'))

km = KMeans(n_clusters=2, max_iter=100, n_init=2)
km = km.fit(df1)
# Plotting
fig_km2 = go.Figure(
        data=[go.Scatter(x=df1[:,0], y=df1[:,1], mode='markers', 
              marker=dict(color=km.labels_, size=7)),
              go.Scatter(
                x=km.cluster_centers_[:,0],
                y=km.cluster_centers_[:,1],
                name="Centroid", mode="markers", 
                marker=dict(color='black', size=15, symbol='star'))],
        layout=go.Layout(
            title="KMeans iteration",
            xaxis=dict(title="X1", range=[-15, 7]),
            yaxis=dict(title="X2", range=[-15, 17]),
            updatemenus=[{
                "buttons": [
                    {
                        "args": [None, {"frame": {"duration": 1000, "redraw": True}, "fromcurrent": True, "mode": "immediate"}],
                        "label": "Play",
                        "method": "animate"
                    },
                    {
                        "args": [[None], {"frame": {"duration": 0, "redraw": False}, "mode": "immediate"}],
                        "label": "Stop",
                        "method": "animate"
                    }
                ],
                "type": "buttons",
                "showactive": False,
                "x": 0,
                "y": 1.25,
                "pad": {"r": 10, "t": 50}
            }],
            sliders=[{
                "active": 0,
                "currentvalue": {"prefix": "K: "},
                "pad": {"t": 50},
                "steps": [{"label": f"{i}",
                        "method": "animate",
                        "args": [[f'{i}'], {"frame": {"duration": 1000, "redraw": True}, "mode": "immediate", 
                        "transition": {"duration": 10}}]}
                        for i in list_k[1:-2]]
            }]
        ),
    frames=frames)

# Update layout
fig_km2.update_layout(
    width=450, height=450, 
    title="KMeans Algorithm with different K")

fig_km2.show()

Kmeans as a vector quantization

Silhouette Coefficients

🔑 If \(\color{blue}{\text{x}_i}\) belongs to cluster \(k\)-th, we define

  • \(a(\color{blue}{i})=\frac{1}{|S_k|-1}\sum_{j\neq i}\|\text{x}_j-\color{blue}{\text{x}_i}\|^2\): the proximity between \(\color{blue}{\text{x}_i}\) and other members within the same cluster.
  • \(b(\color{blue}{i})=\min_{j\neq k}\frac{1}{|S_j|}\sum_{\text{x}\in S_j}\|\text{x}-\color{blue}{\text{x}_i}\|^2\): proximity between \(\color{blue}{\text{x}_i}\) and other members of the nearest cluster.
  • Silhouette value for any data point \(\color{blue}{\text{x}_i}\) is defined by: \[-1\leq s(\color{blue}{i})=\frac{b(\color{blue}{i})-a(\color{blue}{i})}{\max\{a(\color{blue}{i}),b(\color{blue}{i})\}}\leq 1.\]
    • \(s(\color{blue}{i})\approx 1\) indicates that the data point \(\color{blue}{\text{x}_i}\) is well-clustered within its cluster and distant from other groups.
    • \(s(\color{blue}{i})\approx -1\) indicates that the data point \(\color{blue}{\text{x}_i}\) is distant from other members of its cluster and should belong to the nearest group.
  • Silhouette Coefficient for a given \(K\) is \(\tilde{s}(K)=\sum_{i=1}^ns(i)/n.\)
  • Optimal number of cluster \(K^*=\arg\max_{K_\min\leq k\leq K_\max}\{\tilde{s}(k)\}.\)
Code
from sklearn.metrics import silhouette_score, silhouette_samples
km = KMeans(n_clusters=4, max_iter=100, n_init=2)
km = km.fit(df1)
clusters = km.labels_
silhouette_avg = silhouette_score(df1, clusters)
sample_silhouette_values = silhouette_samples(df1, clusters)

# Plot silhouette scores
fig, ax1 = plt.subplots(1, 1, figsize=(6,2.5))
y_lower = 10
for k in range(km.n_clusters):
    ith_cluster_silhouette_values = sample_silhouette_values[clusters == k]
    ith_cluster_silhouette_values.sort()
    size_cluster_k = ith_cluster_silhouette_values.shape[0]
    y_upper = y_lower + size_cluster_k
    ax1.fill_betweenx(np.arange(y_lower, y_upper), 0,
                      ith_cluster_silhouette_values)
    ax1.text(-0.05, y_lower + 0.5 * size_cluster_k, str(k+1))
    y_lower = y_upper + 10
ax1.set_title("Silhouette plot for the various clusters")
ax1.set_xlabel("Silhouette coefficient values")
ax1.set_ylabel("Cluster label")
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

Code
sc_av = []
for k in range(2,10):
  kmeans = KMeans(n_clusters=k, n_init=3)
  clusters = kmeans.fit_predict(df1)
  sc_av.append(silhouette_score(df1, clusters))
sc_df = pd.DataFrame({'k' : range(2, 10),
                      'Silhouette Coefficient' : sc_av})
fig6 = px.line(data_frame=sc_df, x='k', y='Silhouette Coefficient', markers='Silhouette Coefficient')
fig6.add_trace(go.Scatter(x=[4,4],y=[np.min(sc_av), np.max(sc_av)],  line=dict(dash="dash"), name=r"$K^*$"))
fig6.update_layout(width=300, height=180, title = "Silhouette Coefficient vs K")
fig6.show()

Kmeans as a vector quantization

Summary

  • KMeans involves two alternative processes:

    • Cluster assignment: \({\cal C}=\{c_k\}\to {\cal S}=\{S_k\}\).
    • Centroid recomputation: \({\cal S}=\{S_k\}\to{\cal C}=\{c_k\}\).
  • The algorithm may get stuck if it started from bad initialization. Increasing n_init may solves this problem.

  • A suitable number of cluster \(K\) may be estimated using:

    • Elbow method: Find the elbow of \(\color{red}{\text{WSS}}\) vs \(K\) curve.
    • Silhouette coefficient: find \(K\) that maximizes this coef.

Hierarchical clustering

Agglomerative clustering (Bottom-up)

  • It does not require the prior knowledge of \(K\).
  • Let \(\cal D=\{\text{x}_1,\dots, \text{x}_1\}\) be the data points.

Agglomerative clustering

  • Start: Each point belongs to its own cluster (\(K=n\)).
  • Merging: Pairs of clusters are merged as we move up.
  • End: All points are in one cluster.
  • At each step, \(\color{red}{\text{WSS}}\) is recorded.
  • As we move up, \(K\) decreases, the jump in \(\color{red}{\text{WSS}}\) should indicate the suitable number of cluster \(K\).
  • Q3: How to link clusters together?
  • A3: We need tools AKA “Linkages”.
Linkage Formula
Single Linkage \(d_{\text{SL}}(A, B) = \min \{ \|a, b\|^2 : a \in A, b \in B \}\)
Complete Linkage \(d_{\text{CoL}}(A, B) = \max \{ \|a, b\|^2 : a \in A, b \in B \}\)
Average Linkage \(d_{\text{AL}}(A, B) = \frac{1}{|A||B|} \sum_{a \in A} \sum_{b \in B} \|a, b\|^2\)
Centroid Linkage \(d_{\text{CeL}}(A, B) = \|\bar{a}, \bar{b}\|^2\) where \(\bar{a}\) and \(\bar{b}\) are centroids of clusters \(A\) and \(B\), respectively.
Ward’s Linkage \(d_{\text{WL}}(A, B) = \sqrt{\frac{|A||B|}{|A| + |B|} \| \bar{a} - \bar{b} \|^2}\)

Hierarchical clustering

Agglomerative clustering (Bottom-up)

Code
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering
import seaborn as sns
fig, axs = plt.subplots(2,4, figsize=(14, 6))
data_linkage = linkage(df1, method='single')
dendrogram(data_linkage, ax=axs[0,0])
axs[0,0].set_title("Single Linkge")
axs[0,0].set_xticks([])
clust = AgglomerativeClustering(n_clusters=2).fit_predict(df1)
sns.scatterplot(x=df1[:,0], y=df1[:,1], c=clust.astype(object), ax=axs[1,0])
axs[1,0].set_title("K = 2")

data_linkage = linkage(df1, method='complete')
dendrogram(data_linkage, ax=axs[0,1])
axs[0,1].set_title("Complete Linkge")
axs[0,1].set_xticks([])
clust = AgglomerativeClustering(n_clusters=3).fit_predict(df1)
sns.scatterplot(x=df1[:,0], y=df1[:,1], c=clust.astype(object), ax=axs[1,1])
axs[1,1].set_title("K = 3")

data_linkage = linkage(df1, method='average')
dendrogram(data_linkage, ax=axs[0,2])
axs[0,2].set_title("Average Linkge")
axs[0,2].set_xticks([])
clust = AgglomerativeClustering(n_clusters=4).fit_predict(df1)
sns.scatterplot(x=df1[:,0], y=df1[:,1], c=clust.astype(object), ax=axs[1,2])
axs[1,2].set_title("K = 4")

data_linkage = linkage(df1, method='ward')
dendrogram(data_linkage, ax=axs[0,3])
axs[0,3].set_title("Ward's Linkge")
axs[0,3].set_xticks([])
clust = AgglomerativeClustering(n_clusters=5).fit_predict(df1)
sns.scatterplot(x=df1[:,0], y=df1[:,1], c=clust.astype(object), ax=axs[1,3])
axs[1,3].set_title("K = 5")
plt.show()

Spectral Clustering

Motivation

  • Not all clusters are Convex.
  • KMeans and Hierarchical clustering algorithms often struggle with non-convex cluster structures.
Code
np.random.seed(0) 
theta = np.linspace(0, 2*np.pi, 100) 
r1 = 1 + 0.2 * np.sin(4 * theta) 
r2 = 2 + 0.2 * np.sin(4 * theta) 
r3 = 0.5 * np.linspace(0.1, 5, 100)
x1 = r1 *np.cos(theta) - 3
y1 = r1 * np.sin(theta) 
x2 = r2 * np.cos(theta) - 3
y2 = r2 * np.sin(theta) 
x3 = r3 * np.cos(theta) + 3.25
y3 = r3 * np.sin(theta)
x4 = r3 * np.cos(theta - np.pi) +  2.75
y4 = r3 * np.sin(theta - np.pi)
data = np.vstack((np.concatenate([x1, x2, x3, x4]), np.concatenate([y1, y2, y3, y4]))).T 
# Create a DataFrame
df = pd.DataFrame(data, columns=['x', 'y']) 
df['cluster'] = ['A'] * 100 + ['B'] * 100 + ['C'] * 100 + ['D'] * 100
# Plot using Plotly 
fig = px.scatter(df, x='x', y='y', title='Data with Non-Convex Clusters') 
fig.update_layout(width=500, height=300)
fig.show()
Code
# Kmeans
km1 = KMeans(n_clusters=4)
km1 = km1.fit(data)
# Hierarchical
Agg_clust = AgglomerativeClustering(n_clusters=4).fit_predict(data)

from plotly.subplots import make_subplots
fig7 = make_subplots(rows=2, cols=1, subplot_titles=("KMeans", "Hierarchical"))
fig7.add_trace(
    go.Scatter(x=df['x'], y=df['y'], mode="markers", 
    marker=dict(color=km1.labels_), showlegend=False),
    row=1, col=1
)

fig7.add_trace(
    go.Scatter(x=df['x'], y=df['y'], mode="markers", 
    marker=dict(color=Agg_clust), showlegend=False),
    row=2, col=1
)

fig7.update_layout(height=450, width=500)
fig7.show()

Spectral Clustering

Motivation

  • Not all clusters are Convex.
  • KMeans and Hierarchical clustering algorithms often struggle with non-convex cluster structures.
Code
fig.show()
Code
from sklearn.cluster import SpectralClustering
from sklearn.metrics import pairwise_kernels
def gaussian_similarity(X, sigma): 
    return pairwise_kernels(data, metric='rbf', gamma=1/(2*sigma**2))
sigma = 0.7
affinity_matrix = gaussian_similarity(data, sigma)
# Spectral Clustering
sc = SpectralClustering(n_clusters=4,
        affinity="nearest_neighbors",
        eigen_solver="arpack",
        assign_labels='discretize')
labels = sc.fit_predict(affinity_matrix)

fig8 = go.Figure(go.Scatter(x=df['x'], y=df['y'], mode="markers", 
    marker=dict(color=labels), showlegend=False))
fig8.update_layout(height=300, width=500, title="Spectral Clustering")
fig8.show()
  • Q3: But how and why does it work?

  • A3: It relies on Laplacian of Graph Theory.

Spectral Clustering

Graph Laplacian

  • Ex: Social media can be represented by a graph.

\[W=\begin{bmatrix} & A & B & \dots & G \\ A & w_{11} & w_{12} & \dots & w_{17} \\ B & w_{21} & w_{22} & \vdots & w_{27} \\ \vdots & \vdots &\vdots &\ddots &\vdots\\ G & w_{71} & w_{72} & \dots & w_{77} \end{bmatrix} \]

  • \(A\) is called adjacency matrix of the graph \({\cal G}=\{V, E\}\).

  • Degree of node \(i\)-th: \(d_i=\sum_{ij}^nw_{ij}\).

  • Degree matrix: \(D=\text{diag}(d_1,\dots,d_n)\).

  • Unnormalized Laplacian: \[L=D-W\]

  • Normalized Laplacian:

    • \(L_{\text{sym}}=D^{-1/2}LD^{-1/2}\)
    • \(L_{\text{rw}}=D^{-1}L=I-D^{-1}W\).
  • These laplacian matrices capture the connectivity of the graph \(\cal G\).

  • The number of connected components of the graph is related to eigenvalues/vectors of these matrices.

Spectral Clustering

Graph Laplacian

Number of connected components

Let \(\cal G\) be an undirected graph with non-negative weights. Then the multiplicity \(k\) of the eigenvalue \(0\) of both \(L_{\text{rw}}\) and \(L_{\text{sym}}\) equals the number of connected components \(A_1,...,A_k\) in the graph. For \(L_{\text{rw}}\), the eigenspace of \(0\) is spanned by the indicator vectors \(\mathbb{1}_{A_i}\) of those components. For \(L_{\text{sym}}\), the eigenspace of \(0\) is spanned by the vectors \(D^{-1/2}\mathbb{1}_{A_i}\).

\[L_{.}=\begin{bmatrix} [L_1] & 0 & 0 & \dots & 0 \\ 0 & [L_2] & 0 & \dots & 0 \\ 0 & 0 & 0 & \ddots & 0\\ 0 & 0 & 0 & \dots & [L_k] \end{bmatrix}.\]

Spectral Clustering

Normalized Spectral Clustering

Algorithm [Shi and Malik (2000)]

  • Inputs:
    • Similarity matrix \(S\in\mathbb{R}^{n\times n}\), for example: \(s(i,j)=e^{-\|\text{x}_i-\text{x}_j\|^2/(2\sigma^2)}\)
    • Number of clusters \(K\).
  • Compute weight adjacency matrix \(W\) by normalizing \(S\) at each node.
  • Compute unnormalized laplacian \(L\).
  • Compute the first \(K\) generalized eigenvectors \(u_1,\dots, u_K\) of \(L\) corresponding to eigenvalues: \(0=\lambda_0<\lambda_1\leq \lambda_2\leq \dots\leq \lambda_K\).
  • Let \(U=[u_1|\dots |u_K]\) be the matrix with eigenvector columns \(u_j\).
  • Perform KMeans clustering on this matrix \(U\) to obtain clusters: \(C_1,\dots,C_K\).
  • Return final clusters \(A_k\) according to indices of \(C_k\) i.e., \(A_k=\{j:\tilde{x}_j\in C_k\}\).

Spectral Clustering

Summary

  • In some case, connection between data can be represented by a graph.
  • Laplacian of the graph captures connectivity within it.
  • The eigenvectors corresponding to the smallest non-zero eigenvalues capture the “smoothest” variations in the graph. These variations reflect the most significant and coherent groups of connected nodes, which correspond to the clusters.
  • Clustering structure or connected components within the graph are clearly revealed within this matrix.
  • Therefore, clustering this eigenvector matrix leads to connected component clusters.

Spectral Clustering

More Examples

Applications

Image segmentation

Code
from skimage import data
image = data.astronaut()
plt.imshow(image[:,:,1])
plt.axis('off')
plt.show()

Code
image_gray = image[:,:,1].reshape(-1,1)
_, ax = plt.subplots(2, 2, figsize=(6, 5))
for k in range(2, 6):
    km_image = KMeans(n_clusters=k)
    km_image_fit = km_image.fit(image_gray)
    image_compressed = np.array([np.mean(image_gray[km_image_fit.labels_==i]) for i in range(k)])
    image_seg = np.zeros_like(km_image_fit.labels_)
    for i in range(k):
        image_seg[km_image_fit.labels_ == i] = image_compressed[i]

    image_reshaped = image_seg.reshape(image.shape[0], image.shape[1])
    ax[(k-2)//2,(k-2)%2].imshow(image_reshaped, cmap='gray')
    ax[(k-2)//2,(k-2)%2].set_title(f"K = {k}")
    plt.axis('off')
plt.tight_layout()
plt.show()

Applications

Customer Segmentation: Credit Card dataset

Genre Age Annual_Income_(k$) Spending_Score
0 Male 19 15 39
1 Male 21 15 81
2 Female 20 16 6
3 Female 23 16 77
4 Female 31 17 40
  • Q4: What should you do in preprocessing step?

  • A4: Encode Genre, missing, duplicated values, scaling…

Applications

Customer Segmentation: Credit Card dataset

Missing values:
    Genre  Age  Annual_Income_(k$)  Spending_Score
0      0    0                   0               0

Applications

Can you interpret each group?

Applications

How about now?

Summary

  • Clustering is a key technique in the unsupervised learning branch of machine learning.

  • It plays a crucial role in tasks involving the organization and segmentation of data based on their similarities.

  • It can be applied in various fields such as market segmentation, image processing, and anomaly detection.

  • There are numerous clustering algorithms available, each suited to different types of data and purposes. Examples include KMeans, Hierarchical Clustering, DBSCAN, and Spectral Clustering.

  • Interpreting clustering results can be challenging but is an essential step to ensure the validity and usefulness of the clusters identified.

🥳 It’s party time 🥂









đź“‹ View party menu here: Party 6 Menu.

đź«  Download party invitation here: Party 6 Invitation Letter.