Gaussian Mixture Models (GMM) and EM Algorithm

Exploratory Data Analysis & Unsupervised Learning

Lecturer: Dr. HAS Sothea

Outline

0. Motivation

I. Mixture Models

II. Gaussian Mixture Models

III. EM Alogirthm

📚 Chapter 9 of Pattern Recognition & ML by Christopher M. Bishop.

0. Motivation

Old Faithful dataset (\(272\) rows, \(2\) columns)

Code

import pandas as pd                 # Import pandas package
import seaborn as sns               # Package for beautiful graphs
import plotly.express as px
import matplotlib.pyplot as plt     # Graph management
# path = "https://gist.githubusercontent.com/curran/4b59d1046d9e66f2787780ad51a1cd87/raw/9ec906b78a98cf300947a37b56cfe70d01183200/data.tsv"                       # The data can be found in this link
df0 = pd.read_csv(path0 + "/faithful.csv" )  # Import it into Python
df0.head(5)                        # Randomly select 4 points

	eruptions	waiting
0	3.600	79
1	1.800	54
2	3.333	74
3	2.283	62
4	4.533	85

Code

fig = px.scatter(df0, x='waiting', y='eruptions', size='eruptions')
fig.update_layout(width=440, height=400, title='Eruptions vs waitting time')
fig.show()

0. Motivation

Old Faithful dataset (\(272\) rows, \(2\) columns)

KMeans problems

Convex/spherical clusters are likely formed than wide elliptical shape.
Ideal for linearly separable clusters which is unlikely in practice.

Code

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
km = KMeans(n_clusters=2)
scaler = StandardScaler()
X = scaler.fit_transform(df0)
km.fit(X)
df0['Group'] = [str(x+1) for x in km.labels_]
fig = px.scatter(df0, x='waiting', y='eruptions', size='eruptions', color='Group')
fig.update_layout(width=500, height=400, title='Eruptions vs waitting time')
fig.show()

0. Motivation

Old Faithful dataset (\(272\) rows, \(2\) columns)

Clustering types

Hard Assignment: one object must belong to a unique cluster.
Soft Assignment: one object has a probability of belongging to any cluster.
The later falls into probabilistic approach of clustering.

Code

fig = px.scatter(df0, x='waiting', y='eruptions',
    size='eruptions', marginal_x='histogram', marginal_y='histogram', color='Group')
fig.update_layout(width=400, height=400, title='Eruptions vs waitting time')
fig.show()

I. Mixture Models

1. Mixture Models

1.1. Mixture Density

Mixture Models: a statistical tool that represents data as a combination of several simple components.
Mixture model form: Assume \(K\geq 1\) clusters, \[\forall \text{x}\in\mathcal{D}\subset\mathbb{R}^d: f(\text{x})=\sum_{k=1}^K\color{blue}{\pi_k}f_k(\text{x}; \color{blue}{\theta_k}),\] where,
- \(\color{blue}{\vec{\pi}}=(\color{blue}{\pi_1},\dots,\color{blue}{\pi_K}):\) proportion of each component, \(\sum_{k=1}^K\color{blue}{\pi_k}=1\).
- \(\color{blue}{\vec{\theta}}=(\color{blue}{\theta_1},\dots,\color{blue}{\theta_K}):\) parameters of each component.
- \(f_k(\text{x};\color{blue}{\theta_k}):\) density of component \(k\) evaluated at any data point \(\text{x}\).

Code

fig.update_layout(height=450)
fig.show()

Code

import numpy as np
import plotly.graph_objects as go

cluster1 = df0[df0['eruptions'] < 3]
cluster2 = df0[df0['eruptions'] >= 3]
cols = ['eruptions', 'waiting']
# 3. Calculate Parameters Manually
# We need Mean vector and Covariance matrix for each cluster
mu1, cov1 = cluster1[cols].mean().values, cluster1[cols].cov().values
mu2, cov2 = cluster2[cols].mean().values, cluster2[cols].cov().values

# 4. Define Manual Gaussian Function
def manual_gaussian_pdf(x_grid, y_grid, mu, cov):
    # Unpack grid to (N, 2) points
    points = np.column_stack([x_grid.flatten(), y_grid.flatten()])
    
    det = np.linalg.det(cov)
    inv_cov = np.linalg.inv(cov)
    norm_const = 1.0 / (2 * np.pi * np.sqrt(det))
    
    diff = points - mu
    exponent = -0.5 * np.sum((diff @ inv_cov) * diff, axis=1)
    z = norm_const * np.exp(exponent)
    return z.reshape(x_grid.shape)

# 5. Create Grid for plotting the surface
x_range = np.linspace(df0['eruptions'].min()*0.85, df0['eruptions'].max()*1.1, 100)
y_range = np.linspace(df0['waiting'].min()*0.85, df0['waiting'].max()*1.1, 100)
xx, yy = np.meshgrid(x_range, y_range)

# 6. Calculate Z (Height)
z1 = manual_gaussian_pdf(xx, yy, mu1, cov1)
z2 = manual_gaussian_pdf(xx, yy, mu2, cov2)
z_total = z1 * cluster1.shape[0]/df0.shape[0] + z2 * cluster2.shape[0]/df0.shape[0]

fig = go.Figure()
fig.add_trace(go.Surface(
    z=z_total, x=xx, y=yy,
    opacity=0.6,               # Make it slightly transparent
    colorscale='Viridis',
    showscale=False,
    name='Density'
))

# Add the Data Points (The "Scatter")
# We set z=0 to place them on the "floor"
fig.add_trace(go.Scatter3d(
    x=df0['eruptions'].values,
    y=df0['waiting'].values,
    z=np.zeros(len(df0)),
    mode='markers',
    marker=dict(
        size=5, 
        color=df0['Group'].map({'1': 'red', '2': 'blue'})),
    name='Observations'
))

# Update Layout for better viewing
fig.update_layout(
    title='Density of eruptions and waiting time',
    scene=dict(
        xaxis_title='Eruption Time',
        yaxis_title='Waiting Time',
        zaxis_title='Density',
        camera = dict(
            eye=dict(x=0.2, y=2, z=0.65)
        )
    ),
    width=400,
    height=450
)
fig.show()

1. Mixture Models

1.2. Observed Likelihood/Log-Likelihood

Maximum Likelihood Estimation: is the classical method to estimate \(f\) i.e., \([\color{blue}{\vec{\pi}},\color{blue}{\vec{\theta}}]\).

1. Observed Likelihood/Log-likelihood

The Observed Likelihood for \([\color{blue}{\vec{\pi}},\color{blue}{\vec{\theta}}]\) on the observation \(\text{x}_1,\dots,\text{x}_n\) is given by: \[L_{\text{obs}}(\color{blue}{\vec{\pi}},\color{blue}{\vec{\theta}})=p(X|\color{blue}{\vec{\pi}},\color{blue}{\vec{\theta}})=\prod_{i=1}^nf(\text{x}_i)=\prod_{i=1}^n\sum_{k=1}^K\color{blue}{\pi_k}f_k(\text{x}_i; \color{blue}{\theta_k}).\]
The Observed log-likelihood is given by: \[\mathcal{L}_{\text{obs}}(\color{blue}{\vec{\pi}},\color{blue}{\vec{\theta}})=\ln[p(X|\color{blue}{\vec{\pi}},\color{blue}{\vec{\theta}})]=\sum_{i=1}^n\ln\left[\sum_{k=1}^K\color{blue}{\pi_k}f_k(\text{x}_i; \color{blue}{\theta_k})\right]\ \ \color{red}{(1)}.\]
\(\color{red}{(1)}\) is intractable in general!

Code

fig.show()

1. Mixture Models

1.3. Latent Variables/Complete Log-Likelihood

Latent Variable (unobserved): \(\forall\text{x}_i\in\mathcal{D}\subset\mathbb{R}^d\), let \(\color{red}{\text{z}_i}=(\color{red}{z_{i1}},\dots,\color{red}{z_{iK}})\) such that \(\color{red}{\text{z}}_i|\text{x}_i\sim \mathcal{M}(\color{red}{\vec{\gamma_i}})\) i.e. \(\color{red}{z_{ik}}\in\{0,1\}\) & \(\sum_{k=1}^K\color{red}{z_{ik}}=1, \forall i=1,\dots,n\).
Key idea: \(\color{red}{z_{ik}}=1\Leftrightarrow \text{x}_i\) belongs to cluster \(k\).
It’s not observed but crucial for optimizing \(\color{red}{(1)}\).

2. Complete Likelihood/Log-likelihood

Complete Likelihood/Log-Likelihood for \([\color{blue}{\Theta}, \color{red}{\Gamma}]=[(\color{blue}{\vec{\pi}}, \color{blue}{\vec{\theta}}),(\color{red}{\vec{\gamma_i}})_{i=1}^{n}]\) on the observations \((\text{x}_i)\) and latents \((\color{red}{\text{z}_i})\) is given by: \[\begin{align*}L_{\text{com}}(\color{blue}{\Theta}, \color{red}{\Gamma})&=p(X,\color{red}{Z}|\color{blue}{\Theta}, \color{red}{\Gamma})=\prod_{i=1}^n\prod_{k=1}^K\left[\color{blue}{\pi_k}f_k(\text{x}_i; \color{blue}{\theta_k})\right]^{\color{red}{z_{ik}}}.\\ \mathcal{L}_{\text{com}}(\color{blue}{\Theta}, \color{red}{\Gamma})&=\ln[p(X,\color{red}{Z}|\color{blue}{\Theta},\color{red}{\Gamma})]=\sum_{i=1}^n\sum_{k=1}^K\color{red}{z_{ik}}\left[\ln(\color{blue}{\pi_k})+\ln\left(f_k(\text{x}_i; \color{blue}{\theta_k})\right)\right]\ \ \color{red}{(2)}.\end{align*}\]

1. Mixture Models

1.3. Latent Variables/Complete Log-Likelihood

2. Complete Likelihood/Log-likelihood

Complete Likelihood/Log-Likelihood for \([\color{blue}{\Theta}, \color{red}{\Gamma}]=[(\color{blue}{\vec{\pi}}, \color{blue}{\vec{\theta}}),(\color{red}{\vec{\gamma_i}})_{i=1}^{n}]\) on the observations \((\text{x}_i)\) and latents \((\color{red}{\text{z}_i})\) is given by: \[\begin{align*}L_{\text{com}}(\color{blue}{\Theta}, \color{red}{\Gamma})&=p(X,\color{red}{Z}|\color{blue}{\Theta}, \color{red}{\Gamma})=\prod_{i=1}^n\prod_{k=1}^K\left[\color{blue}{\pi_k}f_k(\text{x}_i; \color{blue}{\theta_k})\right]^{\color{red}{z_{ik}}}.\\ \mathcal{L}_{\text{com}}(\color{blue}{\Theta}, \color{red}{\Gamma})&=\ln[p(X,\color{red}{Z}|\color{blue}{\Theta},\color{red}{\Gamma})]=\sum_{i=1}^n\sum_{k=1}^K\color{red}{z_{ik}}\left[\ln(\color{blue}{\pi_k})+\ln\left(f_k(\text{x}_i; \color{blue}{\theta_k})\right)\right]\ \ \color{red}{(2)}.\end{align*}\]

Q1: What do we want?
A1: Maximize \(\mathcal{L}_{\text{obs}}(\color{blue}{\Theta})=\ln[p(X|\color{blue}{\Theta})]\) for \(\color{blue}{\Theta}\).
Observed and Complete Likelihood is related by: \[L_{\text{obs}}(X|\color{blue}{\Theta})=\mathbb{E}_{\color{red}{Z}}[L_{\text{com}}(X,\color{red}{Z}|\color{blue}{\Theta}, \color{red}{\Gamma})].\]

1. Mixture Models

1.4. Evidence Lower Bound (ELBO)

Observed and Complete Likelihood is related by: \[L_{\text{obs}}(X|\color{blue}{\Theta})=\mathbb{E}_{\color{red}{Z}}[L_{\text{com}}(X,\color{red}{Z}|\color{blue}{\Theta}, \color{red}{\Gamma})].\]

OLL \(\mathcal{L}_{\text{obs}}\), also known as Evidence, is related to the CLL \(\mathcal{L}_{\text{com}}\) by: \[\begin{align*} \mathcal{L}_{\text{obs}}(\color{blue}{\Theta})&=\mathcal{L}_{\text{com}}(\color{blue}{\Theta}, \color{red}{\Gamma})-\ln\left[p(\color{red}{Z}|X,\color{red}{\Gamma})\right].\end{align*}\]
For any distribution \(q(\color{red}{Z})\), by taking expectation w.r.t \(\color{red}{Z}\) on both sides, \[\begin{align*} \mathcal{L}_{\text{obs}}(\color{blue}{\Theta})&=\underbrace{\mathbb{E}_{q(\color{red}{Z})}[\mathcal{L}_{\text{com}}(\color{blue}{\Theta}, \color{red}{\Gamma})-\ln[q(\color{red}{Z})]]}_{\color{blue}{\text{ELBO: }}\text{denoted by }\color{blue}{Q}(\color{blue}{\Theta}, \color{red}{\Gamma})}+\underbrace{\mathbb{E}_{q(\color{red}{Z})}\left[\ln\left[\frac{q(\color{red}{Z})}{p(\color{red}{Z}|X,\color{red}{\Gamma})} \right]\right]}_{\color{red}{\text{KL}}[q(\color{red}{Z})||p(\color{red}{Z}|X,\color{red}{\Gamma})]\geq 0}\ \ \color{red}{(3)}.\end{align*}\]

Key breakthrough: In \(\color{red}{(3)}\) if we take \(q(\color{red}{Z})=p(\color{red}{Z}|X,\color{red}{\Gamma})\), then maximizing ELBO: \(\color{blue}{Q}(\color{blue}{\Theta}, \color{red}{\Gamma})\) w.r.t to \(\color{blue}{\Theta}\) will also maximize \(\mathcal{L}_{\text{obs}}(\color{blue}{\Theta})\) as well.

1. Mixture Models

1.4. Evidence Lower Bound (ELBO): Summary

Given data \(D=\{\text{x}_1,\dots,\text{x}_n\}\subset\mathbb{R}^d\) and number of components \(K\geq 1\), we model the density of the data by \[f(\text{x})=\sum_{k=1}^K\color{blue}{\pi_k}f_k(\text{x}; \color{blue}{\theta_k}).\]

Latent variable \(\color{red}{\text{z}_i}=(\color{red}{z_{i1}},\dots,\color{red}{\text{z}_{iK}}):\) unobserved group membership of \(\text{x}_i\).
Given \(\text{x}_i\), it may belong to component \(k\) with probability \(\color{red}{\gamma_{ik}}\) called ‘responsibility’. One has \[\color{red}{\Gamma}=\begin{pmatrix}\color{red}{\gamma_{11}}&\dots&\color{red}{\gamma_{1K}}\\ \vdots&\ddots&\vdots\\ \color{red}{\gamma_{n1}}&\dots&\color{red}{\gamma_{nK}} \end{pmatrix}\in \{0,1\}^{n\times d}\ (\text{clustering signature}).\]

The best \(\color{blue}{\Theta}=(\color{blue}{\vec{\pi}},\color{blue}{\vec{\theta}})\) is estimated by iteratively maximizing ELBO: \(\color{blue}{Q}(\color{blue}{\Theta}, \color{red}{\Gamma})\).

II. Gaussian Mixture Model and EM Algorithm

1. Gaussian Mixture Models (GMM)

1.1. GMM Definition

The component of Mixture Model \(f_k\) can be modeled using any distribution.
The most common choice is \(f_k(\text{x})=\mathcal{N}(\text{x};\color{blue}{\vec{\mu}_k},\color{blue}{\Sigma_k})\) where
- \(\color{blue}{\vec{\mu}_k}\in\mathbb{R}^d\) is the mean vector
- \(\color{blue}{\Sigma_k}\in\mathbb{R}^{d\times d}\) is the covariance matrix of component \(k\) of the model.

Code

import numpy as np
import plotly.graph_objs as go
mu1, Sigma1 = [0, 0], [[1, 0], [0, 3]]

mu2, Sigma2  = [5, 5], [[2, -1], [-1, 2]]

mu3, Sigma3 = [-1, 6], [[2, 1.2], [1.2, 1]]

mu4, Sigma4 = [6, 0], [[3, 0.1], [0.1, 0.25]]
x1 = np.random.multivariate_normal(mean=mu1, cov=Sigma1, size=100)
x2 = np.random.multivariate_normal(mean=mu2, cov=Sigma2, size=100)
x3 = np.random.multivariate_normal(mean=mu3, cov=Sigma3, size=100)
x4 = np.random.multivariate_normal(mean=mu4, cov=Sigma4, size=100)
fig = go.Figure(
    go.Scatter(x=x1[:,0], y = x1[:,1],
               mode = "markers",
               name = r"$\color{blue}{\mu_1=\begin{pmatrix}0\\ 0\end{pmatrix}, \Sigma_1=\begin{pmatrix}1&0\\0&3\end{pmatrix}}$",
               showlegend = True,
               marker = dict(size = 10)))
fig.add_trace(
    go.Scatter(x=x2[:,0], y = x2[:,1],
               mode = "markers",
               name = r"$\color{red}{\mu_1=\begin{pmatrix}5\\ 5\end{pmatrix},\Sigma_2=\begin{pmatrix}2&-1\\-1&2\end{pmatrix}}$",
               showlegend = True,
               marker = dict(size = 10)))
fig.add_trace(
    go.Scatter(x=x3[:,0], y = x3[:,1],
               mode = "markers",
               name = r"$\color{green}{\mu_1=\begin{pmatrix}-1\\ 6\end{pmatrix}, \Sigma_3=\begin{pmatrix}2&1.2\\1.2&1\end{pmatrix}}$",
               showlegend = True,
               marker = dict(size = 10)))
fig.add_trace(
    go.Scatter(x=x4[:,0], y = x4[:,1],
               mode = "markers",
               name = r"$\color{purple}{\mu_1=\begin{pmatrix}6\\ 0\end{pmatrix},\Sigma_4=\begin{pmatrix}3&0.1\\0.1&0.25\end{pmatrix}}$",
               showlegend = True,
               marker = dict(size = 10)))

fig.update_layout(title = "2D Gaussian Distribution",
                  width = 500,
                  height = 280)
fig.show()

Code

import numpy as np
import plotly.graph_objects as go
from scipy.stats import multivariate_normal

# 3. Create a Grid for Density Calculation
# We calculate the bounds to ensure our grid covers all data
all_x = np.concatenate([x1[:,0], x2[:,0], x3[:,0], x4[:,0]])
all_y = np.concatenate([x1[:,1], x2[:,1], x3[:,1], x4[:,1]])
x_range = np.linspace(all_x.min()-1, all_x.max()+1, 100)
y_range = np.linspace(all_y.min()-1, all_y.max()+1, 100)
x_grid, y_grid = np.meshgrid(x_range, y_range)
pos = np.dstack((x_grid, y_grid))

# 4. Calculate PDFs (Z-axis height)
z1 = multivariate_normal(mu1, Sigma1).pdf(pos)
z2 = multivariate_normal(mu2, Sigma2).pdf(pos)
z3 = multivariate_normal(mu3, Sigma3).pdf(pos)
z4 = multivariate_normal(mu4, Sigma4).pdf(pos)

# 5. Build the 3D Figure
fig = go.Figure()

# --- Add Scatter Points on the floor (z=0) ---
# We verify z=0 for all points to place them on the 'floor'
for data, color, name in [(x1, 'blue', 'Cluster 1'), 
                          (x2, 'red', 'Cluster 2'), 
                          (x3, 'green', 'Cluster 3'), 
                          (x4, 'purple', 'Cluster 4')]:
    fig.add_trace(go.Scatter3d(
        x=data[:,0], y=data[:,1], z=np.zeros(len(data)),
        mode='markers',
        marker=dict(size=6, color=color, opacity=0.8),
        name=name + ' (Data)'
    ))

# --- Add Density Surfaces ---
# We use separate surfaces for each cluster with matching colors
# 'showscale=False' hides the colorbar to keep it clean
# colorscale=[[0, color], [1, color]] forces a solid color surface
fig.add_trace(go.Surface(x=x_grid, y=y_grid, z=z1, opacity=0.3, showscale=False, 
                         colorscale=[[0, 'blue'], [1, 'blue']], name='Density 1'))
fig.add_trace(go.Surface(x=x_grid, y=y_grid, z=z2, opacity=0.3, showscale=False, 
                         colorscale=[[0, 'red'], [1, 'red']], name='Density 2'))
fig.add_trace(go.Surface(x=x_grid, y=y_grid, z=z3, opacity=0.3, showscale=False, 
                         colorscale=[[0, 'green'], [1, 'green']], name='Density 3'))
fig.add_trace(go.Surface(x=x_grid, y=y_grid, z=z4, opacity=0.3, showscale=False, 
                         colorscale=[[0, 'purple'], [1, 'purple']], name='Density 4'))

# 6. Layout adjustments
fig.update_layout(
    title="3D Gaussian Mixture Densities",
    width=500, height=280,
    scene=dict(
        xaxis_title='X',
        yaxis_title='Y',
        zaxis_title='Probability Density',
        aspectmode='cube' # Keeps x, y, z scales proportional
    ),
    margin=dict(l=0, r=0, b=0, t=50) # Tight layout
)

fig.show()

1. Gaussian Mixture Models (GMM)

GMM form: \(f(\text{x})=\sum_{k=1}^K\frac{\color{blue}{\pi_k}}{(2\pi)^{d/2}|\color{blue}{\Sigma_k}|^{1/2}}\exp\left[-\frac{1}{2}(\text{x}-\color{blue}{\vec{\mu_k}})^T\color{blue}{\Sigma_k}^{-1}(\text{x}-\color{blue}{\vec{\mu_k}})\right]\).
The following notation are used:
- Model parameters: \(\color{blue}{\Theta}=(\color{blue}{\vec{\pi}},(\color{blue}{\vec{\mu_k}})_k,(\color{blue}{\Sigma}_k)_k)\): all model parameters.
- Responsibilty: \(\color{red}{\Gamma}=(\color{red}{\color{red}{\gamma_{ik}}})_{i,k}\): Chance of obs \(i\) being in class \(k\).
The true posterior can be computed using Bayes’ rule: \[p(\color{red}{z_{ik}}=1|\text{x}_i, \color{red}{\vec{\gamma}_{i}})=\frac{p(\text{x}_i,\color{red}{z_{ik}}=1|\color{red}{\vec{\gamma}_{i}},\color{blue}{\Theta})}{p(\text{x}_i|\color{blue}{\Theta})}=\frac{\color{blue}{\pi_k}\mathcal{N}(\text{x}_i; \color{blue}{\vec{\mu_k}}, \color{blue}{\Sigma_k})}{\sum_{j=1}^K\color{blue}{\pi_j}\mathcal{N}(\text{x}_i; \color{blue}{\mu_j}, \color{blue}{\Sigma_j})}\]

ELBO: \[\color{blue}{Q}(\color{blue}{\Theta}, \color{red}{\Gamma}) = \sum_{i=1}^{n} \sum_{k=1}^{K} \color{red}{\gamma_{ik}} \left[ \ln(\color{blue}{\pi_k}) + \ln[\mathcal{N}(\text{x}_i;\color{blue}{\vec{\mu_k}}, \color{blue}{\Sigma_k})] \right]-\mathbb{E}[q(\color{red}{Z})].\]

2. EM Algorithm for GMM

💡 Key: For any choice of \(\color{red}{\Gamma}\), one has: \(\max_{\color{blue}{\Theta}}\color{blue}{Q}(\color{blue}{\Theta}, \color{red}{\Gamma})\leq\max_{\color{blue}{\Theta}}\mathcal{L}_{\text{obs}}(\color{blue}{\Theta})\).

EM Algorithm: An iterative method to find MLE of parameters \(\color{blue}{\Theta}\) of the models with latent variables \(\color{red}{\Gamma}\) by alternating between two steps until convergence. For \(t=0,1,...,T\):
- E-step: Compute the responsibilities \(\color{red}{\gamma_{ik}}^{(t+1)}\) from current parameters \(\color{blue}{\Theta}^{(t)}\): \[\color{red}{\gamma_{ik}}^{(t+1)}=p(\color{red}{z_{ik}}=1|\text{x}_i, \color{red}{\vec{\gamma}_{i}}^{(t)},\color{blue}{\Theta}^{(t)})=\frac{\color{blue}{\pi_k}^{(t)}\mathcal{N}(\text{x}_i; \color{blue}{\vec{\mu_k}}^{(t)}, \color{blue}{\Sigma_k}^{(t)})}{\sum_{j=1}^K\color{blue}{\pi_j}^{(t)}\mathcal{N}(\text{x}_i; \color{blue}{\mu_j}^{(t)}, \color{blue}{\Sigma_j}^{(t)})}\text{ then }\color{blue}{Q}(\color{blue}{\Theta}, \color{red}{\Gamma}^{(t+1)}).\]
- M-step: Update the parameters \(\color{blue}{\Theta}^{(t)}\to\color{blue}{\Theta}^{(t+1)}=\arg\max_{\color{blue}{\Theta}}\color{blue}{Q}(\color{blue}{\Theta}, \color{red}{\Gamma}^{(t+1)})\): \[\begin{align*} \color{blue}{\pi_k}^{(t+1)}&=\frac{1}{n}\sum_{i=1}^n\color{red}{\gamma_{ik}}^{(t+1)},\\ \color{blue}{\vec{\mu_k}}^{(t+1)}&=\frac{\sum_{i=1}^n\color{red}{\gamma_{ik}}^{(t+1)}\text{x}_i}{\sum_{i=1}^n\color{red}{\gamma_{ik}}^{(t+1)}},\\ \color{blue}{\Sigma_k}^{(t+1)}&=\frac{\sum_{i=1}^n\color{red}{\gamma_{ik}}^{(t+1)}(\text{x}_i-\color{blue}{\vec{\mu_k}}^{(t+1)})(\text{x}_i-\color{blue}{\vec{\mu_k}}^{(t+1)})^T}{\sum_{i=1}^n\color{red}{\gamma_{ik}}^{(t+1)}}.\end{align*}\]

2. EM Algorithm for GMM

E-step: Compute the responsibilities \(\color{red}{\gamma_{ik}}^{(t+1)}\) from current parameters \(\color{blue}{\Theta}^{(t)}\): \[\color{red}{\gamma_{ik}}^{(t+1)}=p(\color{red}{z_{ik}}=1|\text{x}_i, \color{red}{\vec{\gamma}_{i}}^{(t)},\color{blue}{\Theta}^{(t)})=\frac{\color{blue}{\pi_k}^{(t)}\mathcal{N}(\text{x}_i; \color{blue}{\vec{\mu_k}}^{(t)}, \color{blue}{\Sigma_k}^{(t)})}{\sum_{j=1}^K\color{blue}{\pi_j}^{(t)}\mathcal{N}(\text{x}_i; \color{blue}{\mu_j}^{(t)}, \color{blue}{\Sigma_j}^{(t)})}\text{ then }\color{blue}{Q}(\color{blue}{\Theta}, \color{red}{\Gamma}^{(t+1)}).\]
M-step: Update the parameters \(\color{blue}{\Theta}^{(t)}\to\color{blue}{\Theta}^{(t+1)}=\arg\max_{\color{blue}{\Theta}}\color{blue}{Q}(\color{blue}{\Theta}, \color{red}{\Gamma}^{(t+1)})\): \[\begin{align*} \color{blue}{\pi_k}^{(t+1)}&=\frac{1}{n}\sum_{i=1}^n\color{red}{\gamma_{ik}}^{(t+1)},\\ \color{blue}{\vec{\mu_k}}^{(t+1)}&=\frac{\sum_{i=1}^n\color{red}{\gamma_{ik}}^{(t+1)}\text{x}_i}{\sum_{i=1}^n\color{red}{\gamma_{ik}}^{(t+1)}},\\ \color{blue}{\Sigma_k}^{(t+1)}&=\frac{\sum_{i=1}^n\color{red}{\gamma_{ik}}^{(t+1)}(\text{x}_i-\color{blue}{\vec{\mu_k}}^{(t+1)})(\text{x}_i-\color{blue}{\vec{\mu_k}}^{(t+1)})^T}{\sum_{i=1}^n\color{red}{\gamma_{ik}}^{(t+1)}}.\end{align*}\]

\[\text{Process: }\color{blue}{\Theta}^\color{black}{(0)}\underbrace{\longrightarrow}_{\text{E-step}}\color{red}{\Gamma}^{(1)}\underbrace{\longrightarrow}_{\text{M-step}}\color{blue}{\Theta}^\color{black}{(1)}\underbrace{\longrightarrow}_{\text{E-step}}\color{red}{\Gamma}^{(2)}\underbrace{\longrightarrow}_{\text{M-step}}\color{blue}{\Theta}^\color{black}{(2)}\dots\underbrace{\longrightarrow}_{\text{E-step}}\color{red}{\Gamma}^{(t)}\underbrace{\longrightarrow}_{\text{M-step}}\color{blue}{\Theta}^\color{black}{(t)}...\]

2. EM Algorithm for GMM

Application on Old Faithful Dataset

Code

import numpy as np
import plotly.graph_objects as go
from scipy.stats import multivariate_normal

# --- (Assuming df0 is already loaded as per your previous code) ---
# Data Preparation
# We still define X, but we won't rely on df0['color'] for the animation anymore
X = df0[['eruptions', 'waiting']].values

# --- MODIFICATION 1: Initialization ---
np.random.seed(42)
K = 2
n_samples, n_features = X.shape

# Old way: random_indices = np.random.choice(n_samples, K, replace=False)
# Old way: mu = X[random_indices].copy()

# New way: Initialize centroids "far" apart (Min vs Max of the dataset)
ids = np.random.choice(n_samples, 1, replace=False)
mu = np.array([X[ids[0],:], X[ids[0]-20,:]])

global_cov = np.cov(X.T)
cov = [global_cov.copy() for _ in range(K)]
pi = np.array([0.5, 0.5])

# Visualization Grid (Kept same)
x_range = np.linspace(X[:, 0].min() - 0.5, X[:, 0].max() + 0.5, 60)
y_range = np.linspace(X[:, 1].min() - 5, X[:, 1].max() + 5, 60)
xx, yy = np.meshgrid(x_range, y_range)
pos = np.dstack((xx, yy))

def get_mixture_density(pi, mu, cov, grid_pos):
    z = np.zeros(grid_pos.shape[:2])
    for k in range(K):
        z += pi[k] * multivariate_normal(mu[k], cov[k]).pdf(grid_pos)
    return z

# EM Algorithm with Frame Capture
iterations = 15
frames = []

for i in range(iterations):
    # E-STEP
    probs = np.zeros((n_samples, K))
    for k in range(K):
        probs[:, k] = pi[k] * multivariate_normal(mu[k], cov[k]).pdf(X)
    resp = probs / (probs.sum(axis=1, keepdims=True) + 1e-10)
    
    # --- MODIFICATION 2: Dynamic Coloring based on current Label ---
    # Assign label based on which cluster has higher responsibility
    current_labels = np.argmax(resp, axis=1)
    # Map labels to colors (0 -> Red, 1 -> Blue)
    current_colors = np.where(current_labels == 0, 'red', 'blue')

    # Text for parameters
    param_text = (
        f"<b>Iteration {i}</b><br>"
        f"π: [{pi[0]:.2f}, {pi[1]:.2f}]<br>"
        f"μ1: [{mu[0][0]:.2f}, {mu[0][1]:.2f}]<br>"
        f"μ2: [{mu[1][0]:.2f}, {mu[1][1]:.2f}]"
    )
    
    z_total = get_mixture_density(pi, mu, cov, pos)
    
    # Store state: Surface + Observations + Centroids
    frames.append(go.Frame(
        data=[
            go.Surface(z=z_total, x=xx, y=yy), # Trace 0
            # Trace 1 (Observations): Now uses 'current_colors'
            go.Scatter3d(
                x=df0['eruptions'], 
                y=df0['waiting'], 
                z=np.zeros(n_samples),
                mode='markers', 
                marker=dict(size=4, color=current_colors, opacity=0.8) 
            ), 
            go.Scatter3d(                     # Trace 2 (Moving Centroids)
                x=mu[:, 0], 
                y=mu[:, 1], 
                z=[0, 0], 
                mode='markers', 
                marker=dict(size=10, color='black', symbol='diamond', line=dict(width=2, color='white')),
                name='Centroids'
            )
        ],
        name=f"Iteration {i}",
        layout=go.Layout(annotations=[dict(
            text=param_text, align='left', showarrow=False,
            x=0.05, y=0.95, xref='paper', yref='paper',
            bgcolor="rgba(255, 255, 255, 0.7)", bordercolor="black", borderwidth=1
        )])
    ))
    
    # M-STEP
    N_k = resp.sum(axis=0)
    for k in range(K):
        mu[k] = (resp[:, k, np.newaxis] * X).sum(axis=0) / N_k[k]
        diff = X - mu[k]
        weighted_diff = resp[:, k, np.newaxis, np.newaxis] * np.einsum('ni,nj->nij', diff, diff)
        cov[k] = weighted_diff.sum(axis=0) / N_k[k]
        pi[k] = N_k[k] / n_samples

# Build Figure
fig = go.Figure(
    data=[
        go.Surface(
            z=frames[0].data[0].z, x=xx, y=yy, 
            opacity=0.6, colorscale='Viridis', showscale=False, name='Density'
        ),
        go.Scatter3d(
            x=df0['eruptions'].values, y=df0['waiting'].values, z=np.zeros(n_samples),
            # --- MODIFICATION 3: Initial Plot uses colors from Frame 0 ---
            mode='markers', marker=dict(size=4, color=frames[0].data[1].marker.color, opacity=0.8),
            name='Observations'
        ),
        go.Scatter3d(
            x=frames[0].data[2].x, y=frames[0].data[2].y, z=[0, 0],
            mode='markers', marker=dict(size=8, color='black', symbol='diamond', line=dict(width=2, color='white')),
            name='Centroids'
        )
    ],
    layout=go.Layout(
        width=400, height=470,
        title="EM Algorithm Dynamic",
        scene=dict(
            xaxis_title='Eruptions', yaxis_title='Waiting', zaxis_title='Density',
            camera=dict(eye=dict(x=0, y=1.5, z=0.6), center=dict(x=0, y=0, z=-0.2))
        ),
        annotations=frames[0].layout.annotations,
        updatemenus=[{
            "buttons": [
                {"args": [None, {"frame": {"duration": 400, "redraw": True}, "fromcurrent": True}],
                 "label": "Play", "method": "animate"},
                {"args": [[None], {"frame": {"duration": 0, "redraw": True}, "mode": "immediate"}],
                 "label": "Pause", "method": "animate"}
            ],
            "type": "buttons", "showactive": False, "x": -0.05, "y": 0.1, "xanchor": "left", "yanchor": "top"
        }],
        sliders=[{
            "steps": [
                {"args": [[f.name], {"frame": {"duration": 0, "redraw": True}, "mode": "immediate"}],
                 "label": str(k), "method": "animate"} for k, f in enumerate(frames)
            ],
            "active": 0, "currentvalue": {"prefix": "Iteration: "}, "pad": {"t": 50}
        }]
    ),
    frames=frames
)

fig.show()

Q\(^*\): Why does it work?
A\(^*\) Because each step increases the ELBO:
- E-step: \[\color{blue}{Q}(\color{blue}{\Theta}^{(t)}, \color{red}{\Gamma}^{(t)})\leq \color{blue}{Q}(\color{blue}{\Theta}^{(t)}, \color{red}{\Gamma}^{(t+1)})=\mathcal{L}_{\text{obs}}(\color{blue}{\Theta}^{(t)}).\]
- M-step: \[\mathcal{L}_{\text{obs}}(\color{blue}{\Theta}^{(t)})=\color{blue}{Q}(\color{blue}{\Theta}^{(t)}, \color{red}{\Gamma}^{(t+1)})\leq \color{blue}{Q}(\color{blue}{\Theta}^{(t+1)}, \color{red}{\Gamma}^{(t+1)}).\]
- Chain of \(Q(\color{blue}{\Theta}, \color{red}{\Gamma})\) in EM-Algorithm: \[\begin{align*}&\mathcal{L}_{\text{obs}}(\color{blue}{\Theta}^{(0)})\overset{\color{red}{(3)}}{=}\color{blue}{Q}(\color{blue}{\Theta}^{(0)}, \color{red}{\Gamma}^{(1)})\overset{\color{red}{(M)}}{\leq} \color{blue}{Q}(\color{blue}{\Theta}^{(1)}, \color{red}{\Gamma}^{(1)})\\ \overset{\color{red}{(E)}}{\leq}&\mathcal{L}_{\text{obs}}(\color{blue}{\Theta}^{(1)})\overset{\color{red}{(3)}}{=}\color{blue}{Q}(\color{blue}{\Theta}^{(1)}, \color{red}{\Gamma}^{(2)})\overset{\color{red}{(M)}}{\leq} \color{blue}{Q}(\color{blue}{\Theta}^{(2)}, \color{red}{\Gamma}^{(2)})\\ \overset{\color{red}{(E)}}{\leq}&\mathcal{L}_{\text{obs}}(\color{blue}{\Theta}^{(2)})\overset{\color{red}{(3)}}{=}\color{blue}{Q}(\color{blue}{\Theta}^{(2)}, \color{red}{\Gamma}^{(3)})\overset{\color{red}{(M)}}{\leq} \color{blue}{Q}(\color{blue}{\Theta}^{(3)}, \color{red}{\Gamma}^{(3)})\\ \vdots&\end{align*}\]

2. EM Algorithm for GMM

Issues of EM in GMM

EM algorithm only guarantees convergence to a local maximum of the observed data log-likelihood \(\mathcal{L}_{\text{obs}}(\color{blue}{\Theta})\) (may results in bad clusters).
Issues in GMM:
- Initialization sensitivity: Poor initialization can lead to suboptimal local maxima (poeple often use K-means to initialize means).
- Number of components (K): Choosing too few or too many components can lead to underfitting or overfitting (next part).
- Singular covariance matrices: If a component may collapse to a single point, its covariance matrix becomes singular, causing numerical instability (poeple often add a small value to the diagonal of covariance matrices).
- Convergence criteria: The choice of convergence threshold can affect the quality of the solution and computational efficiency.

2. EM Algorithm for GMM

Issues of EM in GMM

Code

import numpy as np
import plotly.graph_objects as go
from scipy.stats import multivariate_normal

# --- (Assuming df0 is already loaded as per your previous code) ---
# Data Preparation
# We still define X, but we won't rely on df0['color'] for the animation anymore
X = df0[['eruptions', 'waiting']].values

# --- MODIFICATION 1: Initialization ---
np.random.seed(42)
K = 2
n_samples, n_features = X.shape

# Old way: random_indices = np.random.choice(n_samples, K, replace=False)
# Old way: mu = X[random_indices].copy()

# New way: Initialize centroids "far" apart (Min vs Max of the dataset)
ids = np.random.choice(n_samples, 1, replace=False)
mu = np.array([X[ids[0],:], X[ids[0]+10,:]])

global_cov = np.cov(X.T)
cov = [global_cov.copy() for _ in range(K)]
pi = np.array([0.5, 0.5])

# Visualization Grid (Kept same)
x_range = np.linspace(X[:, 0].min() - 0.5, X[:, 0].max() + 0.5, 60)
y_range = np.linspace(X[:, 1].min() - 5, X[:, 1].max() + 5, 60)
xx, yy = np.meshgrid(x_range, y_range)
pos = np.dstack((xx, yy))

def get_mixture_density(pi, mu, cov, grid_pos):
    z = np.zeros(grid_pos.shape[:2])
    for k in range(K):
        z += pi[k] * multivariate_normal(mu[k], cov[k]).pdf(grid_pos)
    return z

# EM Algorithm with Frame Capture
iterations = 15
frames = []

for i in range(iterations):
    # E-STEP
    probs = np.zeros((n_samples, K))
    for k in range(K):
        probs[:, k] = pi[k] * multivariate_normal(mu[k], cov[k]).pdf(X)
    resp = probs / (probs.sum(axis=1, keepdims=True) + 1e-10)
    
    # --- MODIFICATION 2: Dynamic Coloring based on current Label ---
    # Assign label based on which cluster has higher responsibility
    current_labels = np.argmax(resp, axis=1)
    # Map labels to colors (0 -> Red, 1 -> Blue)
    current_colors = np.where(current_labels == 0, 'red', 'blue')

    # Text for parameters
    param_text = (
        f"<b>Iteration {i}</b><br>"
        f"π: [{pi[0]:.2f}, {pi[1]:.2f}]<br>"
        f"μ1: [{mu[0][0]:.2f}, {mu[0][1]:.2f}]<br>"
        f"μ2: [{mu[1][0]:.2f}, {mu[1][1]:.2f}]"
    )
    
    z_total = get_mixture_density(pi, mu, cov, pos)
    
    # Store state: Surface + Observations + Centroids
    frames.append(go.Frame(
        data=[
            go.Surface(z=z_total, x=xx, y=yy), # Trace 0
            # Trace 1 (Observations): Now uses 'current_colors'
            go.Scatter3d(
                x=df0['eruptions'], 
                y=df0['waiting'], 
                z=np.zeros(n_samples),
                mode='markers', 
                marker=dict(size=4, color=current_colors, opacity=0.8) 
            ), 
            go.Scatter3d(                     # Trace 2 (Moving Centroids)
                x=mu[:, 0], 
                y=mu[:, 1], 
                z=[0, 0], 
                mode='markers', 
                marker=dict(size=10, color='black', symbol='diamond', line=dict(width=2, color='white')),
                name='Centroids'
            )
        ],
        name=f"Iteration {i}",
        layout=go.Layout(annotations=[dict(
            text=param_text, align='left', showarrow=False,
            x=0.05, y=0.95, xref='paper', yref='paper',
            bgcolor="rgba(255, 255, 255, 0.7)", bordercolor="black", borderwidth=1
        )])
    ))
    
    # M-STEP
    N_k = resp.sum(axis=0)
    for k in range(K):
        mu[k] = (resp[:, k, np.newaxis] * X).sum(axis=0) / N_k[k]
        diff = X - mu[k]
        weighted_diff = resp[:, k, np.newaxis, np.newaxis] * np.einsum('ni,nj->nij', diff, diff)
        cov[k] = weighted_diff.sum(axis=0) / N_k[k]
        pi[k] = N_k[k] / n_samples

# Build Figure
fig = go.Figure(
    data=[
        go.Surface(
            z=frames[0].data[0].z, x=xx, y=yy, 
            opacity=0.6, colorscale='Viridis', showscale=False, name='Density'
        ),
        go.Scatter3d(
            x=df0['eruptions'].values, y=df0['waiting'].values, z=np.zeros(n_samples),
            # --- MODIFICATION 3: Initial Plot uses colors from Frame 0 ---
            mode='markers', marker=dict(size=4, color=frames[0].data[1].marker.color, opacity=0.8),
            name='Observations'
        ),
        go.Scatter3d(
            x=frames[0].data[2].x, y=frames[0].data[2].y, z=[0, 0],
            mode='markers', marker=dict(size=8, color='black', symbol='diamond', line=dict(width=2, color='white')),
            name='Centroids'
        )
    ],
    layout=go.Layout(
        width=800, height=450,
        title="EM Algorithm with a bad initialization",
        scene=dict(
            xaxis_title='Eruptions', yaxis_title='Waiting', zaxis_title='Density',
            camera=dict(eye=dict(x=0, y=1.5, z=0.6), center=dict(x=0, y=0, z=-0.2))
        ),
        annotations=frames[0].layout.annotations,
        updatemenus=[{
            "buttons": [
                {"args": [None, {"frame": {"duration": 400, "redraw": True}, "fromcurrent": True}],
                 "label": "Play", "method": "animate"},
                {"args": [[None], {"frame": {"duration": 0, "redraw": True}, "mode": "immediate"}],
                 "label": "Pause", "method": "animate"}
            ],
            "type": "buttons", "showactive": False, "x": -0.05, "y": 0.1, "xanchor": "left", "yanchor": "top"
        }],
        sliders=[{
            "steps": [
                {"args": [[f.name], {"frame": {"duration": 0, "redraw": True}, "mode": "immediate"}],
                 "label": str(k), "method": "animate"} for k, f in enumerate(frames)
            ],
            "active": 0, "currentvalue": {"prefix": "Iteration: "}, "pad": {"t": 50}
        }]
    ),
    frames=frames
)

fig.show()

3. Parameter tuning

Unlike for hard clustering (KMeans), choosing K: and other parameters for GMM is based on probabilistic reasoning.
For any \(K\) and candidate parameters \(\color{blue}{\Theta}\), log-likelihood: indicates the goodness of fit of the model to the data: \[\mathcal{L}_{\text{obs}}(\color{blue}{\Theta})=\sum_{i=1}^{n}\ln\left(\sum_{k=1}^K\color{blue}{\pi_k}\mathcal{N}(\text{x}_i;\color{blue}{\vec{\mu_k}}, \color{blue}{\Sigma_k})\right).\]
However, simply maximizing \(\mathcal{L}_{\text{obs}}(\color{blue}{\Theta})\) may lead to overfitting (e.g., increasing \(K\) always increases likelihood). Thus, we use model selection criteria that balance fit and complexity:

Akaike Information Criterion (AIC): \[\text{AIC} = \underbrace{- 2\mathcal{L}_{\text{obs}}(\color{blue}{\Theta})}_{\color{blue}{\text{Goodness of fit}}} + \underbrace{2\color{red}{p}}_{\color{red}{\text{Complexity}}},\] where \(\color{red}{p}\) is the number of parameters.

Bayesian Information Criterion (BIC): \[\text{BIC} = \underbrace{- 2\mathcal{L}_{\text{obs}}(\color{blue}{\Theta})}_{\color{blue}{\text{Goodness of fit}}} + \underbrace{\color{red}{p}\ln(n)}_{\color{red}{\text{Complexity}}},\] where \(n\) is the number of data points.

3. Parameter tuning

Akaike Information Criterion (AIC): \[\text{AIC} = \underbrace{- 2\mathcal{L}_{\text{obs}}(\color{blue}{\Theta})}_{\color{blue}{\text{Goodness of fit}}} + \underbrace{2\color{red}{p}}_{\color{red}{\text{Complexity}}},\] where \(\color{red}{p}\) is the number of parameters.

Bayesian Information Criterion (BIC): \[\text{BIC} = \underbrace{- 2\mathcal{L}_{\text{obs}}(\color{blue}{\Theta})}_{\color{blue}{\text{Goodness of fit}}} + \underbrace{\color{red}{p}\ln(n)}_{\color{red}{\text{Complexity}}},\] where \(n\) is the number of data points.

Code

from sklearn.mixture import GaussianMixture
n_components_range = range(1, 11)
cv_types = ['full', 'tied', 'diag', 'spherical']
bic_scores = []
aic_scores = []

results = {ctype: {'aic': [], 'bic': []} for ctype in cv_types}
for ctype in cv_types:
    for n in n_components_range:
        # Fit GMM
        gmm = GaussianMixture(n_components=n, covariance_type=ctype, random_state=42)
        gmm.fit(X)
        
        # Store metrics
        results[ctype]['aic'].append(gmm.aic(X))
        results[ctype]['bic'].append(gmm.bic(X))

import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px

# Create subplots: 2 rows, 1 column
fig = make_subplots(
    rows=2, cols=1,
    vertical_spacing=0.12,
    shared_xaxes=True  # Share x-axis to zoom both together
)

# Define a color map for consistency
colors = px.colors.qualitative.Set1
type_colors = {ctype: colors[i % len(colors)] for i, ctype in enumerate(cv_types)}

for ctype in cv_types:
    color = type_colors[ctype]
    
    # 1. AIC Trace (Solid Line, Circle Marker)
    fig.add_trace(
        go.Scatter(
            x=list(n_components_range),
            y=results[ctype]['aic'],
            mode='lines+markers',
            name=f'Cov: {ctype}',
            line=dict(color=color),
            marker=dict(symbol='circle'),
            legendgroup=ctype  # Group AIC and BIC controls
        ),
        row=1, col=1
    )

    # 2. BIC Trace (Dashed Line, X Marker)
    fig.add_trace(
        go.Scatter(
            x=list(n_components_range),
            y=results[ctype]['bic'],
            mode='lines+markers',
            name=f'Cov: {ctype}',
            line=dict(color=color, dash='dash'),
            marker=dict(symbol='x'),
            legendgroup=ctype,
            showlegend=False  # Hide duplicate legend entries
        ),
        row=2, col=1
    )

# Update Layout
fig.update_layout(
    height=390, width=500,
    hovermode="x unified",  # Show values for all lines at the hover point
    template="ggplot2"
)

# Axis Labels
fig.update_xaxes(title_text="Number of Components (K)", row=2, col=1)
fig.update_yaxes(title_text="AIC", row=1, col=1)
fig.update_yaxes(title_text="BIC", row=2, col=1)

fig.show()

Four types of covariance structures are available in sklearn’s GaussianMixture:
- full: each component has its own \(\color{blue}{\Sigma_k}\).
- tied: all components share the same \(\color{blue}{\Sigma}\).
- diag: each has diagonal cov \(\color{blue}{\Sigma_k} = \color{blue}{\text{diag}}\).
- spherical: each component has its own single variance (isotropic) \(\color{blue}{\Sigma_k} = \color{blue}{\sigma I_d}\).
Interpreting the plots:
- Elbow/Minimum: Look for the \(K\) with elbow or lowest score.
- Covariance Type: All seem reasonable except for spherical with best \(K=2\).
- Overfitting: Scores flat/rise after \(K=2\).

Summary: GMM & EM Algorithm

The Core Concept

Probabilistic Clustering: GMM provides soft assignments (probabilities). It assumes data is generated from a mixture of several Gaussian distributions.
Flexibility: Can model clusters of different sizes and elliptical shapes (via the Covariance Matrix), not just spheres.

The Engine: Expectation-Maximization (EM)

Iterative Process: Alternates between estimating hidden variables and optimizing parameters.
- E-Step (Expectation): Calculate “Responsibilities” (\(\color{red}{\gamma}_{ik}\)). How likely is it that point \(\text{x}_i\) belongs to cluster \(k\)?
- M-Step (Maximization): Update parameters \((\color{blue}{\vec{\pi}},\color{blue}{\vec{\mu}},\color{blue}{\vec{\Sigma}})\) using these responsibilities as weights.
Convergence: Guaranteed to increase the likelihood (or the lower bound Q) at each step, converging to a local optimum.

Summary: GMM & EM Algorithm

Critical Challenges

Local Optima: The algorithm is sensitive to initialization. It may get stuck in a “bad” solution. (Mitigation: Run multiple times or use K-Means initialization).
Singularities: If a cluster collapses onto a single point, variance \(\approx 0\) and likelihood \(\to\infty\). (Mitigation: Regularization).

Model Selection

Finding K: Do not use Inertia. Use AIC or BIC to balance model fit against complexity. Look for the minimum score.
Covariance Type: Tune full, tied, diag, or spherical to prevent overfitting.

Gaussian Mixture Models (GMM) and EM Algorithm

Outline

0. Motivation

I. Mixture Models

II. Gaussian Mixture Models

III. EM Alogirthm

0. Motivation

0. Motivation

Old Faithful dataset (\(272\) rows, \(2\) columns)

0. Motivation

Old Faithful dataset (\(272\) rows, \(2\) columns)

KMeans problems

0. Motivation

Old Faithful dataset (\(272\) rows, \(2\) columns)

Clustering types

I. Mixture Models

1. Mixture Models

1.1. Mixture Density

1. Mixture Models

1.2. Observed Likelihood/Log-Likelihood

1. Mixture Models

1.3. Latent Variables/Complete Log-Likelihood

1. Mixture Models

1.3. Latent Variables/Complete Log-Likelihood

1. Mixture Models

1.4. Evidence Lower Bound (ELBO)

1. Mixture Models

1.4. Evidence Lower Bound (ELBO): Summary

II. Gaussian Mixture Model and EM Algorithm

1. Gaussian Mixture Models (GMM)

1.1. GMM Definition

1. Gaussian Mixture Models (GMM)

2. EM Algorithm for GMM

2. EM Algorithm for GMM

2. EM Algorithm for GMM

Application on Old Faithful Dataset

2. EM Algorithm for GMM

Issues of EM in GMM

2. EM Algorithm for GMM

Issues of EM in GMM

3. Parameter tuning

3. Parameter tuning

Summary: GMM & EM Algorithm

Summary: GMM & EM Algorithm

🥳 Yeahhhh….

Let’s Party… 🥂