KMeans & Hierarchical Clustering

ITM 390 004: Machine Learning

Lecturer: Dr. Sothea HAS

🗺️ Content

Introduction & Motivation
KMeans Clustering Algorithm
Hierarchical Clustering
Spectral Clustering
Applications

Introduction & Motivation

Old Faithful dataset ($272$ rows, $2$ columns)

Code

import pandas as pd                 # Import pandas package
import seaborn as sns               # Package for beautiful graphs
import plotly.express as px
import matplotlib.pyplot as plt     # Graph management
# path = "https://gist.githubusercontent.com/curran/4b59d1046d9e66f2787780ad51a1cd87/raw/9ec906b78a98cf300947a37b56cfe70d01183200/data.tsv"                       # The data can be found in this link
df0 = pd.read_csv(path0 + "/faithful.csv" )  # Import it into Python
df0.head(5)                        # Randomly select 4 points

	eruptions	waiting
0	3.600	79
1	1.800	54
2	3.333	74
3	2.283	62
4	4.533	85

Code

fig0 = px.scatter(df0, x="waiting", y="eruptions", size_max=50)    # Create scatterplot
fig0.update_layout(
    title="Old Faithful data from Yellowstone National Park, US",
    width=500, height=380)    # Title
fig0.show()

Two blocks: shorter wait & eruption, and the longer ones.

Introduction & Motivation

Satellite Image Segmentation

Source: Ilanthiraiyan et al. (2014)

Source: Singh et al. (2020)

Introduction & Motivation

More motivations

Marketing: finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records.

Biology: classification of plants and animals given their features.

Insurance: identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds.

Bookshops: book ordering (recommendation).

City-planning: identifying groups of houses according to their type, value and geographical location.

Internet: document classification; clustering weblog data to discover groups of similar access patterns; topic modeling…

Introduction & Motivation

Clustering

Code

df0.head(2)

	eruptions	waiting
0	3.6	79
1	1.8	54

Clustering aims at partitioning a set of points from some metric space in such a way that

Introduction & Motivation

Clustering

Code

df0.head(2)

	eruptions	waiting
0	3.6	79
1	1.8	54

Clustering aims at partitioning a set of points from some metric space in such a way that
- Points within the same group are close/similar, and

Introduction & Motivation

Clustering

Code

df0.head(2)

	eruptions	waiting
0	3.6	79
1	1.8	54

Clustering aims at partitioning a set of points from some metric space in such a way that
- Points within the same group are close/similar, and
- Points from different groups are distant.

Introduction & Motivation

Clustering

Code

df0.head(2)

	eruptions	waiting
0	3.6	79
1	1.8	54

Clustering aims at partitioning a set of points from some metric space in such a way that
- Points within the same group are close/similar, and
- Points from different groups are distant.
It belongs to the Unsupervised Learning branch of Machine Learning (ML).
Objective: group data into clusters based on their similarities.

KMeans Clustering Algorithm

Algorithm

Given
- The number of cluster $K$
- A set of unlabeled data ${\cal D}=\{\text{x}_1,\dots,\text{x}_n\}\subset\mathbb{R}^d$,
KMeans algorithm: consists of two important steps that are repeated iteratively until convergence:

1. Cluster Assignment Step: assign each observation $\text{x}_i$ to its closest centroid $c_k$ (code-vector).
2. Centroid Update Step: recompute centroids as the mean of all observations within each cluster, i.e., for $1\leq k\leq K:$ \[c_k=\frac{1}{|G_k|}\sum_{\text{x}_i\in G_k}\text{x}_i.\]

Code

from sklearn.cluster import KMeans
df1, lab1 = simulateData(k=4, n=100, random_state = 11)
col1 = {'1': "#2f8f47",
        '2': "#f34141",
        '3': "#5b92e8",
        '4': "#e350da"}
km = KMeans(
    n_clusters=4, init='random', max_iter=1, n_init=1, random_state=12)
km = km.fit(df1)
fig_km = go.Figure()
fig_km.add_trace(
    go.Scatter(x=df1[:,0], y=df1[:,1], name="Data", mode="markers", 
    marker=dict(size=7)))
fig_km.add_trace(
    go.Scatter(
        x=km.cluster_centers_[:,0], 
        y=km.cluster_centers_[:,1],
        name="Centroid", mode="markers", 
        marker=dict(color='black', size=15, symbol='star')))
fig_km.update_layout(
    width=500, height=490, 
    xaxis=dict(title="X1", range=[-15, 7]),
    yaxis=dict(title="X2", range=[-15, 17]),
    title="0. Initial random centroids")
fig_km.show()

Code

fig_km.add_trace(
    go.Scatter(x=df1[:,0], y=df1[:,1], showlegend=False, mode="markers", 
        marker=dict(color=[col1[str(j+1)] for j in km.labels_], size=7))) 
fig_km.add_trace(fig_km.data[1])
fig_km.data[3].showlegend = False
fig_km.update_layout(title="1. Cluster assignment step")
fig_km.show()

Code

km = KMeans(
    n_clusters=4, init='random', max_iter=2, n_init=1, random_state=12)
km = km.fit(df1)
fig_km1 = go.Figure()
fig_km1.add_trace(fig_km.data[0])
fig_km1.add_trace(fig_km.data[2])
fig_km1.add_trace(
    go.Scatter(
        x=km.cluster_centers_[:,0],
        y=km.cluster_centers_[:,1],
        name="Centroid", mode="markers", 
        marker=dict(color='black', size=15, symbol='star')))
fig_km1.update_layout(
    width=500, height=490, 
    xaxis=dict(title="X1", range=[-15, 7]),
    yaxis=dict(title="X2", range=[-15, 17]),
    title="2. Centroid update step")
fig_km1.show()

Code

fig_km1.add_trace(
    go.Scatter(x=df1[:,0], y=df1[:,1], showlegend=False, mode='markers', 
    marker=dict(color=[col1[str(j+1)] for j in km.labels_], size=7)))
fig_km1.add_trace(fig_km1.data[2])
fig_km1.data[4].showlegend = False
fig_km1.update_layout(title="1. Cluster assignment step")
fig_km1.show()

Code

km = KMeans(
    n_clusters=4, init='random', max_iter=3, n_init=1, random_state=12)
km = km.fit(df1)
fig_km2 = go.Figure()
fig_km2.add_trace(fig_km.data[0])
fig_km2.add_trace(fig_km1.data[3])
fig_km2.add_trace(
    go.Scatter(
        x=km.cluster_centers_[:,0],
        y=km.cluster_centers_[:,1],
        name="Centroid", mode="markers", 
        marker=dict(color='black', size=15, symbol='star')))
fig_km2.update_layout(
    width=500, height=490, 
    xaxis=dict(title="X1", range=[-15, 7]),
    yaxis=dict(title="X2", range=[-15, 17]),
    title="2. Centroid update step")
fig_km2.show()

Code

fig_km2.add_trace(
    go.Scatter(x=df1[:,0], y=df1[:,1], showlegend=False, mode='markers', 
    marker=dict(color=[col1[str(j+1)] for j in km.labels_], size=7)))
fig_km2.add_trace(fig_km2.data[2])
fig_km2.data[4].showlegend = False
fig_km2.update_layout(title="1. Cluster assignment step")
fig_km2.show()

KMeans Clustering Algorithm

KMeans Algorithm

Given data ${\cal D}=\{\text{x}_1,\dots,\text{x}_n\}\subset\mathbb{R}^d$ and $K$.

Initialization: ${\cal C}^{0}=\{c_1,\dots,c_K\}$ (randomly).
Cluster Assignment (NNC):

for i = 1,...,n:
…. for k = 1,...,K: \[\text{Assign }\text{x}_i\to \color{green}{S_k}\text{ if }\|\text{x}_i-\color{green}{c_k}\|\leq \|\text{x}_i-c_j\|,\forall j\neq k.\]

Centroid Recomputation (CC): From ${\cal S}=\{S_k\}$ recompute new centroids:

\[c_k=\frac{1}{|S_k|}\sum_{\text{x}_i\in S_k}\text{x}_i.\]

Alternatively repeat step 2. and 3. until converges.

Code

from sklearn.cluster import KMeans
km = KMeans(n_clusters=4, init='random', max_iter=7, n_init=1, random_state=12)
km = km.fit(df1)
fig_km = go.Figure(go.Scatter(x=df1[:,0], y=df1[:,1], name="Data", mode="markers", 
                   marker=dict(color=[col1[str(j+1)] for j in km.labels_], size=7)))
fig_km.add_trace(go.Scatter(x=km.cluster_centers_[:,0], y=km.cluster_centers_[:,1],
                name="Centroid", mode="markers", marker=dict(color='black', size=15, symbol='star')))
fig_km.update_layout(width=500, height=480, title="KMeans Algorithm in Action")
fig_km.show()

KMeans Clustering Algorithm

KMeans Algorithm may get stuck

Given data ${\cal D}=\{\text{x}_1,\dots,\text{x}_n\}\subset\mathbb{R}^d$ and $K$.

Initialization: $\color{red}{{\cal C}^{0}=\{c_1,\dots,c_K\}}$ (randomly).
Cluster Assignment (NNC):

for i = 1,...,n:
…. for k = 1,...,K: \[\text{Assign }\text{x}_i\to \color{green}{S_k}\text{ if }\|\text{x}_i-\color{green}{c_k}\|\leq \|\text{x}_i-c_j\|,\forall j\neq k.\]

Centroid Recomputation (CC): From ${\cal S}=\{S_k\}$ recompute new centroids:

\[c_k=\frac{1}{|S_k|}\sum_{\text{x}_i\in S_k}\text{x}_i.\]

Alternatively repeat step 2. and 3. until converges.

Code

from sklearn.cluster import KMeans
km = KMeans(n_clusters=4, init='random', max_iter=7, n_init=1, random_state=11)
km = km.fit(df1)
fig_km = go.Figure(go.Scatter(x=df1[:,0], y=df1[:,1], name="Data", mode="markers", 
                   marker=dict(color=[col1[str(j+1)] for j in km.labels_], size=7)))
fig_km.add_trace(go.Scatter(x=km.cluster_centers_[:,0], y=km.cluster_centers_[:,1],
                name="Centroid", mode="markers", marker=dict(color='black', size=15, symbol='star')))
fig_km.update_layout(width=500, height=480, title="KMeans Algorithm in Action")
fig_km.show()

KMeans Clustering Algorithm

KMeans Algorithm may get stuck

Given data ${\cal D}=\{\text{x}_1,\dots,\text{x}_n\}\subset\mathbb{R}^d$ and $K$.

A solution from KMeans of sklearn.cluster module:

from sklearn.cluster import KMeans
km = KMeans(
    n_clusters=4, 
    n_init=5     # Perform KMeans 5 times
)
km = km.fit(df1) # The best one among the 5

🔑 Perform KMeans n_init times with different random initializations, the best one (lowest WSS) is taken as the final result.

Q2: How to find a suitable $K$?

Code

fig_km1.show()

KMeans Clustering Algorithm

Criterion: Within Sum of Squares

Within Sum of Squares (WSS) measures the compactness of the clusters, defined as \[\color{red}{\text{WSS}}(K)=\sum_{k=1}^K\color{red}{\sum_{\text{x}_i\in G_k}\|\text{x}_i-c_k\|^2}.\]

Within Sum of Squares (WSS) can be used to
- Evaluate the clustering quality for a given number of cluster $K$.
- Find the optimal number of cluster $K$.

Consider For k=1,2,...,K:

KMeans Clustering Algorithm

Criterion: Within Sum of Squares

Within Sum of Squares (WSS) measures the compactness of the clusters, defined as \[\color{red}{\text{WSS}}(K)=\sum_{k=1}^K\color{red}{\sum_{\text{x}_i\in G_k}\|\text{x}_i-c_k\|^2}.\]

Within Sum of Squares (WSS) can be used to
- Evaluate the clustering quality for a given number of cluster $K$.
- Find the optimal number of cluster $K$.

Consider For k=1,2,...,K:
For k=1,2,...,K:
- Within group k : $\color{red}{\sum_{\text{x}_i\in G_k}\|\text{x}_i-c_k\|^2}.$

KMeans Clustering Algorithm

Criterion: Within Sum of Squares

Within Sum of Squares (WSS) measures the compactness of the clusters, defined as \[\color{red}{\text{WSS}}(K)=\sum_{k=1}^K\color{red}{\sum_{\text{x}_i\in G_k}\|\text{x}_i-c_k\|^2}.\]

Within Sum of Squares (WSS) can be used to
- Evaluate the clustering quality for a given number of cluster $K$.
- Find the optimal number of cluster $K$.

Consider For k=1,2,...,K:
For k=1,2,...,K:
- Within group k : $\color{red}{\sum_{\text{x}_i\in G_k}\|\text{x}_i-c_k\|^2}.$
$\color{red}{\text{WSS}}(K)=\sum_{k=1}^K\color{red}{\sum_{\text{x}_i\in G_k}\|\text{x}_i-c_k\|^2}.$

Measures the proximity of points within each group.

KMeans Clustering Algorithm

Elbow Method 💪

🔑 $\color{red}{\text{WSS}}$ decreases as $K$ increases:
- $K=1\Rightarrow \color{red}{\text{WSS}}=\color{blue}{\text{TSS}}$
- $K=n\Rightarrow \color{red}{\text{WSS}}=0$.
- At the suitable $K$, $\color{red}{\text{WSS}}$ drops slowly.

Code

wss = []
list_k = list(range(1,11))
for k in list_k:
    km = KMeans(n_clusters = k, n_init=2)
    km = km.fit(df1)
    wss.append(km.inertia_)

data_wss = pd.DataFrame({
    'K' : list_k,
    'WSS': wss
})
fig6 = px.line(data_wss, x="K", y="WSS", markers="WSS", title="WSS vs Number of cluster (K)")
fig6.add_trace(go.Scatter(x=[4,4], y=[0,wss[3]], line=dict(color="red", dash="dash"),name="Optimal K"))
fig6.update_layout(width=500, height=280)
fig6.show()

Code

frames = []
for k in list_k[1:-2]: 
    km = KMeans(n_clusters=k, max_iter=100, n_init=2)
    km = km.fit(df1)
    frames.append(go.Frame(
        data=[go.Scatter(x=df1[:,0], y=df1[:,1], mode='markers', 
              marker=dict(color=km.labels_, size=7), 
              name=f'K: {k}'),
              go.Scatter(
                x=km.cluster_centers_[:,0],
                y=km.cluster_centers_[:,1],
                name="Centroid", mode="markers", 
                marker=dict(color='black', size=10, symbol='star'))],
        name=f'{k}'))

km = KMeans(n_clusters=2, max_iter=100, n_init=2)
km = km.fit(df1)
# Plotting
fig_km2 = go.Figure(
        data=[go.Scatter(x=df1[:,0], y=df1[:,1], mode='markers', 
              marker=dict(color=km.labels_, size=7)),
              go.Scatter(
                x=km.cluster_centers_[:,0],
                y=km.cluster_centers_[:,1],
                name="Centroid", mode="markers", 
                marker=dict(color='black', size=15, symbol='star'))],
        layout=go.Layout(
            title="KMeans iteration",
            xaxis=dict(title="X1", range=[-15, 7]),
            yaxis=dict(title="X2", range=[-15, 17]),
            updatemenus=[{
                "buttons": [
                    {
                        "args": [None, {"frame": {"duration": 1000, "redraw": True}, "fromcurrent": True, "mode": "immediate"}],
                        "label": "Play",
                        "method": "animate"
                    },
                    {
                        "args": [[None], {"frame": {"duration": 0, "redraw": False}, "mode": "immediate"}],
                        "label": "Stop",
                        "method": "animate"
                    }
                ],
                "type": "buttons",
                "showactive": False,
                "x": 0,
                "y": 1.25,
                "pad": {"r": 10, "t": 50}
            }],
            sliders=[{
                "active": 0,
                "currentvalue": {"prefix": "K: "},
                "pad": {"t": 50},
                "steps": [{"label": f"{i}",
                        "method": "animate",
                        "args": [[f'{i}'], {"frame": {"duration": 1000, "redraw": True}, "mode": "immediate", 
                        "transition": {"duration": 10}}]}
                        for i in list_k[1:-2]]
            }]
        ),
    frames=frames)

# Update layout
fig_km2.update_layout(
    width=450, height=450, 
    title="KMeans Algorithm with different K")

fig_km2.show()

KMeans Clustering Algorithm

Silhouette Coefficients

🔑 If $\color{blue}{\text{x}_i}$ belongs to cluster $k$-th, we define

$a(\color{blue}{i})=\frac{1}{|S_k|-1}\sum_{j\neq i}\|\text{x}_j-\color{blue}{\text{x}_i}\|^2$: the proximity between $\color{blue}{\text{x}_i}$ and other members within the same cluster.
$b(\color{blue}{i})=\min_{j\neq k}\frac{1}{|S_j|}\sum_{\text{x}\in S_j}\|\text{x}-\color{blue}{\text{x}_i}\|^2$: proximity between $\color{blue}{\text{x}_i}$ and other members of the nearest cluster.
Silhouette value for any data point $\color{blue}{\text{x}_i}$ is defined by: \[-1\leq s(\color{blue}{i})=\frac{b(\color{blue}{i})-a(\color{blue}{i})}{\max\{a(\color{blue}{i}),b(\color{blue}{i})\}}\leq 1.\]
- $s(\color{blue}{i})\approx 1$ indicates that the data point $\color{blue}{\text{x}_i}$ is well-clustered within its cluster and distant from other groups.
- $s(\color{blue}{i})\approx -1$ indicates that the data point $\color{blue}{\text{x}_i}$ is distant from other members of its cluster and should belong to the nearest group.
Silhouette Coefficient for a given $K$ is $\tilde{s}(K)=\sum_{i=1}^ns(i)/n.$
Optimal number of cluster $K^*=\arg\max_{K_\min\leq k\leq K_\max}\{\tilde{s}(k)\}.$

Code

from sklearn.metrics import silhouette_score, silhouette_samples
km = KMeans(n_clusters=4, max_iter=100, n_init=2)
km = km.fit(df1)
clusters = km.labels_
silhouette_avg = silhouette_score(df1, clusters)
sample_silhouette_values = silhouette_samples(df1, clusters)

# Plot silhouette scores
fig, ax1 = plt.subplots(1, 1, figsize=(4,2.5))
y_lower = 10
for k in range(km.n_clusters):
    ith_cluster_silhouette_values = sample_silhouette_values[clusters == k]
    ith_cluster_silhouette_values.sort()
    size_cluster_k = ith_cluster_silhouette_values.shape[0]
    y_upper = y_lower + size_cluster_k
    ax1.fill_betweenx(np.arange(y_lower, y_upper), 0,
                      ith_cluster_silhouette_values)
    ax1.text(-0.05, y_lower + 0.5 * size_cluster_k, str(k+1))
    y_lower = y_upper + 10
ax1.set_title("Silhouette plot for the various clusters")
ax1.set_xlabel("Silhouette coefficient values")
ax1.set_ylabel("Cluster label")
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

Code

sc_av = []
for k in range(2,10):
  kmeans = KMeans(n_clusters=k, n_init=3)
  clusters = kmeans.fit_predict(df1)
  sc_av.append(silhouette_score(df1, clusters))
sc_df = pd.DataFrame({'k' : range(2, 10),
                      'Silhouette Coefficient' : sc_av})
fig6 = px.line(data_frame=sc_df, x='k', y='Silhouette Coefficient', markers='Silhouette Coefficient')
fig6.add_trace(go.Scatter(x=[4,4],y=[np.min(sc_av), np.max(sc_av)],  line=dict(dash="dash"), name=r"$K^*$"))
fig6.update_layout(width=300, height=180, title = "Silhouette Coefficient vs K")
fig6.show()

KMeans Clustering Algorithm

Summary

KMeans involves two alternative processes:
- Cluster assignment: ${\cal C}=\{c_k\}\to {\cal S}=\{S_k\}$.
- Centroid recomputation: ${\cal S}=\{S_k\}\to{\cal C}=\{c_k\}$.
The algorithm may get stuck if it started from bad initialization. Increasing n_init may solves this problem.
A suitable number of cluster $K$ may be estimated using:
- Elbow method: Find the elbow of $\color{red}{\text{WSS}}$ vs $K$ curve.
- Silhouette coefficient: find $K$ that maximizes this coef.

Hierarchical clustering

Agglomerative clustering (Bottom-up)

It does not require the prior knowledge of $K$.
Let $\cal D=\{\text{x}_1,\dots, \text{x}_1\}$ be the data points.

Agglomerative clustering

Start: Each point belongs to its own cluster ($K=n$).
Merging: Pairs of clusters are merged as we move up.
End: All points are in one cluster.

At each step, $\color{red}{\text{WSS}}$ is recorded.
As we move up, $K$ decreases, the jump in $\color{red}{\text{WSS}}$ should indicate the suitable number of cluster $K$.
Q3: How to link clusters together?
A3: We need tools AKA “Linkages”.

Linkage	Formula
Single Linkage	$d_{\text{SL}}(A, B) = \min \{ \\|a, b\\|^2 : a \in A, b \in B \}$
Complete Linkage	$d_{\text{CoL}}(A, B) = \max \{ \\|a, b\\|^2 : a \in A, b \in B \}$
Average Linkage	$d_{\text{AL}}(A, B) = \frac{1}{\|A\|\|B\|} \sum_{a \in A} \sum_{b \in B} \\|a, b\\|^2$
Centroid Linkage	$d_{\text{CeL}}(A, B) = \\|\bar{a}, \bar{b}\\|^2$ where $\bar{a}$ and $\bar{b}$ are centroids of clusters $A$ and $B$, respectively.
Ward’s Linkage	$d_{\text{WL}}(A, B) = \sqrt{\frac{\|A\|\|B\|}{\|A\| + \|B\|} \\| \bar{a} - \bar{b} \\|^2}$

Hierarchical clustering

Agglomerative clustering (Bottom-up)

Code

from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering
import seaborn as sns
fig, axs = plt.subplots(2,4, figsize=(14, 6))
data_linkage = linkage(df1, method='single')
dendrogram(data_linkage, ax=axs[0,0])
axs[0,0].set_title("Single Linkge")
axs[0,0].set_xticks([])
clust = AgglomerativeClustering(n_clusters=2).fit_predict(df1)
sns.scatterplot(x=df1[:,0], y=df1[:,1], c=[i+1 for i in clust], ax=axs[1,0])
axs[1,0].set_title("K = 2")

data_linkage = linkage(df1, method='complete')
dendrogram(data_linkage, ax=axs[0,1])
axs[0,1].set_title("Complete Linkge")
axs[0,1].set_xticks([])
clust = AgglomerativeClustering(n_clusters=3).fit_predict(df1)
sns.scatterplot(x=df1[:,0], y=df1[:,1], c=[i+1 for i in clust], ax=axs[1,1])
axs[1,1].set_title("K = 3")

data_linkage = linkage(df1, method='average')
dendrogram(data_linkage, ax=axs[0,2])
axs[0,2].set_title("Average Linkge")
axs[0,2].set_xticks([])
clust = AgglomerativeClustering(n_clusters=4).fit_predict(df1)
sns.scatterplot(x=df1[:,0], y=df1[:,1], c=[i+1 for i in clust], ax=axs[1,2])
axs[1,2].set_title("K = 4")

data_linkage = linkage(df1, method='ward')
dendrogram(data_linkage, ax=axs[0,3])
axs[0,3].set_title("Ward's Linkge")
axs[0,3].set_xticks([])
clust = AgglomerativeClustering(n_clusters=5).fit_predict(df1)
sns.scatterplot(x=df1[:,0], y=df1[:,1], c=[i+1 for i in clust], ax=axs[1,3])
axs[1,3].set_title("K = 5")
plt.show()

Applications

Image segmentation

Performance image segmentation on one channel of the follwoing image.

Code

from skimage import data
image = data.astronaut()
plt.imshow(image[:,:,1])
plt.axis('off')
plt.show()

Code

sns.set(style="white")
image_gray = image[:,:,1].reshape(-1,1)
_, ax = plt.subplots(2, 2, figsize=(7, 5.25))
for k in range(2, 6):
    km_image = KMeans(n_clusters=k)
    km_image_fit = km_image.fit(image_gray)
    image_compressed = np.array([np.mean(image_gray[km_image_fit.labels_==i]) for i in range(k)])
    image_seg = np.zeros_like(km_image_fit.labels_)
    for i in range(k):
        image_seg[km_image_fit.labels_ == i] = image_compressed[i]

    image_reshaped = image_seg.reshape(image.shape[0], image.shape[1])
    ax[(k-2)//2,(k-2)%2].imshow(image_reshaped, cmap='gray')
    ax[(k-2)//2,(k-2)%2].set_title(f"K = {k}")
    ax[(k-2)//2,(k-2)%2].axis('off')
plt.tight_layout()
plt.show()

Applications

Customer Segmentation: Credit Card dataset

	Genre	Age	Annual_Income_(k$)	Spending_Score
0	Male	19	15	39
1	Male	21	15	81
2	Female	20	16	6
3	Female	23	16	77
4	Female	31	17	40

Q4: What should you do in preprocessing step?
A4: Encode Genre, missing, duplicated values, scaling…

Applications

Customer Segmentation: Credit Card dataset

Missing values:
    Genre  Age  Annual_Income_(k$)  Spending_Score
0      0    0                   0               0

Applications

Can you interpret each group?

Applications

How about now?

Summary

Clustering is a key technique in the unsupervised learning branch of machine learning.
It plays a crucial role in tasks involving the organization and segmentation of data based on their similarities.
It can be applied in various fields such as market segmentation, image processing, and anomaly detection.
There are numerous clustering algorithms available, each suited to different types of data and purposes. Examples include KMeans, Hierarchical Clustering, DBSCAN, and Spectral Clustering.
Interpreting clustering results can be challenging but is an essential step to ensure the validity and usefulness of the clusters identified.

Linkage	Formula
Single Linkage	\(d_{\text{SL}}(A, B) = \min \{ \\|a, b\\|^2 : a \in A, b \in B \}\)
Complete Linkage	\(d_{\text{CoL}}(A, B) = \max \{ \\|a, b\\|^2 : a \in A, b \in B \}\)
Average Linkage	\(d_{\text{AL}}(A, B) = \frac{1}{\|A\|\|B\|} \sum_{a \in A} \sum_{b \in B} \\|a, b\\|^2\)
Centroid Linkage	\(d_{\text{CeL}}(A, B) = \\|\bar{a}, \bar{b}\\|^2\) where \(\bar{a}\) and \(\bar{b}\) are centroids of clusters \(A\) and \(B\), respectively.
Ward’s Linkage	\(d_{\text{WL}}(A, B) = \sqrt{\frac{\|A\|\|B\|}{\|A\| + \|B\|} \\| \bar{a} - \bar{b} \\|^2}\)

KMeans & Hierarchical Clustering

🗺️ Content

Introduction & Motivation

Introduction & Motivation

Old Faithful dataset (\(272\) rows, \(2\) columns)

Introduction & Motivation

Satellite Image Segmentation

Introduction & Motivation

More motivations

Introduction & Motivation

Clustering

Introduction & Motivation

Clustering

Introduction & Motivation

Clustering

Introduction & Motivation

Clustering

KMeans Clustering Algorithm

KMeans Clustering Algorithm

Algorithm

KMeans Clustering Algorithm

KMeans Algorithm

KMeans Clustering Algorithm

KMeans Algorithm may get stuck

KMeans Clustering Algorithm

KMeans Algorithm may get stuck

KMeans Clustering Algorithm

Criterion: Within Sum of Squares

KMeans Clustering Algorithm

Criterion: Within Sum of Squares

KMeans Clustering Algorithm

Criterion: Within Sum of Squares

KMeans Clustering Algorithm

Elbow Method 💪

KMeans Clustering Algorithm

Silhouette Coefficients

KMeans Clustering Algorithm

Summary

Hierarchical clustering

Hierarchical clustering

Agglomerative clustering (Bottom-up)

Hierarchical clustering

Agglomerative clustering (Bottom-up)

Applications

Applications

Image segmentation

Applications

Customer Segmentation: Credit Card dataset

Applications

Customer Segmentation: Credit Card dataset

Applications

Can you interpret each group?

Applications

How about now?

Summary

🥳 It’s party time 🥂