TP5 - Gaussian Mixture Model (GMM) & EM Algorithm

Exploratory Data Analysis & Unsuperivsed Learning
Course: PHAUK Sokkey, PhD
TP: HAS Sothea, PhD

Objective: In this lab, we’ll dive into the fascinating world of Gaussian Mixture Models (GMMs) and the Expectation-Maximization (EM) algorithm, both of which are fundamental concepts of unsupervised machine learning. GMMs can be viewed from various angles, such as density estimation and soft clustering. We’ll explore both perspectives and apply it to image segmentation, laying the groundwork for a broader understanding of generative models.

The Jupyter Notebook for this TP can be downloaded here: TP5-GMM_EM.

1. Gaussian Mixture Models

A. Perform GMM using GaussianMixture from sklearn.mixture on Iris dataset using n_components=5.

Read this documentation and answer the followign questions:

Print the estimated parameters of each component.

from sklearn.mixture import GaussianMixture
from sklearn.datasets import load_iris

# Load data
Iris = load_iris()
X = Iris.data
y = Iris.target

# Implementing GMM
gmm = GaussianMixture(n_components=5)
gmm.fit(X)
y_pred = gmm.predict(X)

print("Centers:")
print(gmm.means_)

print("Covariance matrix:")
print(gmm.covariances_)

Centers:
[[5.006      3.428      1.462      0.246     ]
 [6.26183379 2.82763301 4.80182632 1.64851896]
 [7.46997728 3.12188466 6.29667057 2.04826795]
 [5.64089166 2.69934996 4.06795054 1.26111652]
 [6.53649134 3.04541477 5.52291008 2.1090139 ]]
Covariance matrix:
[[[ 0.121765    0.097232    0.016028    0.010124  ]
  [ 0.097232    0.140817    0.011464    0.009112  ]
  [ 0.016028    0.011464    0.029557    0.005948  ]
  [ 0.010124    0.009112    0.005948    0.010885  ]]

 [[ 0.13587143  0.03736051 -0.03907485 -0.06342009]
  [ 0.03736051  0.08364917 -0.01103571 -0.00395449]
  [-0.03907485 -0.01103571  0.05868746  0.0436558 ]
  [-0.06342009 -0.00395449  0.0436558   0.06656236]]

 [[ 0.07166238  0.01914424  0.06929775  0.01983696]
  [ 0.01914424  0.14721539 -0.00337837  0.02302957]
  [ 0.06929775 -0.00337837  0.11945869  0.03147415]
  [ 0.01983696  0.02302957  0.03147415  0.0594727 ]]

 [[ 0.13864084  0.0614113   0.09189386  0.0160678 ]
  [ 0.0614113   0.08317746  0.0753073   0.03656449]
  [ 0.09189386  0.0753073   0.17957683  0.05654027]
  [ 0.0160678   0.03656449  0.05654027  0.03619922]]

 [[ 0.05601497  0.0092837   0.00165046  0.02402932]
  [ 0.0092837   0.0545331   0.00428341  0.04566143]
  [ 0.00165046  0.00428341  0.06014524  0.01488898]
  [ 0.02402932  0.04566143  0.01488898  0.0762846 ]]]

As the dataset is of dimension \(4\), we borrow a tool from next chapter PCA for visualizing this centers.

from sklearn.decomposition import  PCA
# PCA with 3 components
pc = PCA(n_components=3)
X_proj = pc.fit_transform(X)

# Project the means
centroid_proj = pc.transform(gmm.means_)

import plotly.io as pio
pio.renderers.default = 'notebook'
import plotly.graph_objects as go

fig = go.Figure(go.Scatter3d(x=X_proj[:,0], y=X_proj[:,1], z=X_proj[:,2], name="Projected iris data",
                mode = "markers", marker = dict(size= 6, color=y_pred)))
fig.update_layout(title="Projection of Iris data onto the first 3 PCs", width=700, height=500)

# Add means
fig.add_trace(
    go.Scatter3d(x=centroid_proj[:,0], y=centroid_proj[:,1], 
                 name="Centroids obtained from GMM",
                 z=centroid_proj[:,2], mode="markers", 
                 marker=dict(size=10, color="black")))
fig.show()

What does the score() do in this module?

Mixture models (MMs) assume that the density \(f_X\) of the observation \(x_1,...,x_n\in\mathbb{R}^d\) consists of mutiple components or clusters where the number of components is denoted by \(K>1\), i.e., \[f_X(x_i)=\sum_{k=1}^K\pi_kf_k(x_i;\theta_k)\text{ for all }i\in\{1,2,\dots,n\},\] where

\(\pi_k\) is the proportion of data points in component \(k\)
\(f_k(x;\theta_k)\) is the density of data \(x\) in the component \(k\)th with corresponding parameter \(\theta_k\) for \(k\in\{1,2,\dots,K\}\).

If we assume that within each cluster, the observation is normally distributed, i.e., \(f_k(x;\theta_k)={\cal N}(x;\theta_k)\) with \(\theta_k=(\mu_k,\Sigma_k)\), then we obtain Gaussian Mixture Model (GMM). In this case, the likelihood of the observation is written as

\[\begin{align*} L(\theta_1,\theta_2,\dots,\theta_K|x_1,\dots,x_n)=&f(x_1,\dots,x_n;(\theta_k)_{k=1}^K)\\ \underbrace{=}_{\text{indep}}&\prod_{i=1}^nf_X(x_i)\\ =&\prod_{i=1}^n\left[\sum_{k=1}^K\pi_k{\cal N}(x_i;\theta_k)\right]. \end{align*}\]

At this point, our objective is to search for values of \((\theta_1,\dots,\theta_K)\) maximizing the above likelihood function or equivalently, maximizing the following log-likelihood function:

\[{\cal L}(\theta_1,\theta_2,\dots,\theta_K)=\sum_{i=1}^n\log\left[\sum_{k=1}^K\pi_k{\cal N}(x_i;\theta_k)\right].\]

Unfortunately, the form of such a log-likelihood is not maximizable in general. We therefore need a numerical method for solving this maximization problem. Back to our question, given a number of cluster \(K=5\) and all values of paramters \((\theta_1,\theta_2,\dots,\theta_5)\), the score function is the average log-likelihood of all the observations evaluated at these values of \(K=5\) and all the approximated parameters:

\[\text{Score}=\frac{1}{n}\sum_{i=1}^n\log\left[\sum_{k=1}^K\pi_k{\cal N}(x_i;\theta_k)\right]=\frac{{\cal L}(\theta_1,\theta_2,\dots,\theta_K)}{n}.\] Simply put, if the number of components with its approximated parameters fit well with the observed data: \(x_1,\dots,x_n\), this \(\text{Score}\) should be large.

Remark: the score often increases with the number of components as Gaussian density increases when data concentrate close to the center (which is likely to be true when there are many clusters).

Compute this score, AIC and BIC of the trained GMM.

print(f'Score: {gmm.score(X)}')
print(f'AIC: {gmm.aic(X)}')
print(f'BIC: {gmm.bic(X)}')

Score: -1.0560456875931115
AIC: 464.8137062779334
BIC: 687.6007180410563

B. Perform GMM on Iris data but this time using n_components = 1,2,...,10.

Compute the score, AIC and BIC at each number of component.
What is the optimal number of component.

import plotly.io as pio
pio.renderers.default = 'notebook'

score = []
aic = []
bic = []
for k in range(2,11):
  gmm = GaussianMixture(n_components=k, covariance_type='full')
  gmm.fit(X)
  score.append(gmm.score(X))
  aic.append(gmm.aic(X))
  bic.append(gmm.bic(X))

from plotly.subplots import make_subplots

fig_score =  make_subplots(
    rows=1, cols=3,
    subplot_titles=("Score vs No. clusters", "AIC vs No. clusters", "BIC vs No. clusters"))

fig_score.add_trace(go.Scatter(x=list(range(2,11)), name="Score", y = score), row=1, col=1)
fig_score.add_trace(go.Scatter(x=list(range(2,11)), name="AIC", y = aic), row=1, col=2)
fig_score.add_trace(go.Scatter(x=list(range(2,11)), name="BIC", y = bic), row=1, col=3)
fig_score.update_layout(width=780, height=400)

The optimal number of clsuters based on AIC is 8 while the optimal number of cluster is 2 based on BIC.

C. With the optimal number of components from question B, perform GMM on Iris data using different options of covariance_type from the list [‘full’, ‘tied’, ‘diag’, ‘spherical’]. In each case, compute the score associated to each variance type. Comment.

# Cluster = 8
scores = {}
aics = {}
bics = {}
for k in [8, 2]:
    scores[str(k)] = []
    aics[str(k)] = []
    bics[str(k)] = []
    for co in ['full', 'tied', 'diag', 'spherical']:
        gmm = GaussianMixture(n_components=k, covariance_type=co)
        gmm.fit(X)
        scores[str(k)].append(gmm.score(X))
        aics[str(k)].append(gmm.aic(X))
        bics[str(k)].append(gmm.bic(X))

import pandas as pd
import numpy as np

df = pd.concat([pd.DataFrame(scores).melt(var_name='K', value_name="Value"), 
                pd.DataFrame(aics).melt(var_name='K', value_name="Value"), 
                pd.DataFrame(bics).melt(var_name='K', value_name="Value")], axis=0)
df = pd.concat([df.reset_index(drop=True), 
           pd.DataFrame({'Criteria': np.repeat(['Score', 'AIC', 'BIC'], 8)}), 
           pd.DataFrame({'Cov': ['full', 'tied', 'diag', 'spherical'] * 6})], axis=1)

import seaborn as sns
import matplotlib.pyplot as plt

_, axs = plt.subplots(1, 3, figsize=(15, 4))
i = 0
for v in ['Score', 'AIC', 'BIC']:
    sns.barplot(data=df[df.Criteria == v], x="K", y="Value", hue="Cov", ax=axs[i])
    axs[i].set_title(f"Criterion: {v}")
    axs[i].bar_label(axs[i].containers[0], fmt="%.2f")
    axs[i].bar_label(axs[i].containers[1], fmt="%.2f")
    axs[i].bar_label(axs[i].containers[2], fmt="%.2f")
    axs[i].bar_label(axs[i].containers[3], fmt="%.2f")
    i += 1
plt.tight_layout()
plt.show()

In terms of the Score (likelihood fit) and AIC, 8 components with a full covariance matrix fit best on the data, as indicated by the highest score and lowest AIC.
However, BIC suggests that 2 components with a full covariance matrix better fit the data.

D. Repeat question B and C on a simulated data from the previous TP4. How is GMM’s result compared to Kmeans or Hierarchical clustering?

Left as a practical exercise.

2. EM Algorithm

The EM algorithm is used to estimate the parameters of the GMM, ensuring that the model fits the data as closely as possible by iteratively refining the parameters. It leverages the concept of latent variables (responsibilities) to handle the fact that the actual class labels for the data points are unknown.

This iterative optimization makes GMMs a powerful tool for tasks like clustering and density estimation.

A. Recall the process of EM algorithm in GMM of \(K\) components.

Theory of 1D EM-algorithm

We have observed data log-likelihood:

\[{\cal L}_{\text{obs}}(\theta_1,\theta_2,\dots,\theta_K)=\sum_{i=1}^n\log\left[\sum_{k=1}^K\pi_k{\cal N}(x_i;\theta_k)\right],\]

which is not easy to be maximized.

We introduce latent variable \(z_i=(z_{i1},z_{i2},\dots,z_{iK})\) indicating the component where \(x_i\) belongs to as follow:

\[\begin{align*} z_i&\sim \text{Multi}(\pi_1,\dots,\pi_K)\\ x_i|z_{ik}=1&\sim {\cal N}(\mu_k,\sigma_k^2) \end{align*}.\]

Here, only one component of \(z_i\) that is not zero and if \(z_{ik}=1\) or \(x_i\) belongs to component \(k\), then \(x_i\) is distributed according to \({\cal N}(\mu_k,\sigma_k)\). For any observation \(x_i\) and any component \(k\in\{1,\dots,K\}\).

\[\begin{align*} p(z_{ik}=1|x_i)&=\frac{p(x_i,z_{ik}=1)}{p(x_i)}\\ &=\frac{p(z_{ik}=1)p(x_i|z_{ik}=1)}{p(x_i)}\\ &=\frac{p(z_{ik}=1)p(x_i|z_{ik}=1)}{\sum_{j=1}^Kp(z_{ij}=1)p(x_i|z_{ij}=1)}\\ &=\frac{\pi_k{\cal N}(x_i;\mu_k,\sigma_k^2)}{\sum_{j=1}^K\pi_j{\cal N}(x_i;\mu_j,\sigma_j^2)},\quad\text{where }{\cal N}(x_i;\mu_k,\sigma_k^2)=\frac{1}{\sqrt{2\pi\sigma_k^2}}e^{-(x_i-\mu_k)^2/(2\sigma_k^2)} \end{align*}.\]

This implies that \(\mathbb{E}[z_{ik}]=p(z_{ik}=1|x_i)=\frac{\pi_k{\cal N}(x_i;\mu_k,\sigma_k^2)}{\sum_{j=1}^K\pi_j{\cal N}(x_i;\mu_j,\sigma_j^2)}=\gamma_{ik}\) because \(z_{ik}|x_i\sim{\cal B}(\gamma_{ik})\) (given \(x_i\) value, \(z_{ik}\) is a Bernoulli random variable of parameter \(\gamma_{ik}\)). This \(\gamma_{ik}\) is called responsibility, and it is the probability of a given observation \(x_i\) being a member of component \(k\).

We can now introduce “complete data log-likelihood” for parameter \(\theta_k=(\pi_k,\mu_k,\sigma_k^2)\) at the complete data \((x_1,\dots,x_n,z_1,\dots,z_n)\):

\[\begin{align*} {\cal L}_{\text{com}}(\theta_1,\dots, \theta_K)&={\cal L}(\theta_1,\dots, \theta_K|x_1,\dots,x_n,z_1,\dots,z_n)\\ &=\prod_{i=1}^np(z_i)p(x_i|z_i)\\ &=\prod_{i=1}^n\prod_{k=1}^K\left[p(z_{ik}=1)p(x_i|z_{ik}=1)\right]^{z_{ik}}\\ &=\prod_{i=1}^n\prod_{k=1}^K\left[\pi_{k}{\cal N}(x_i;\mu_k,\sigma_k^2)\right]^{z_{ik}}\\ \Rightarrow \log({\cal L}_{\text{com}})&=\sum_{i=1}^n\sum_{k=1}^Kz_{ik}\left[\log(\pi_k)+\log({\cal N}(x_i;\mu_k,\sigma_k^2))\right]\\ \Rightarrow \mathbb{E}_{z}[\log({\cal L}_{\text{com}})]&=\sum_{i=1}^n\sum_{k=1}^K\mathbb{E}_{z}[z_{ik}]\left[\log(\pi_k)+\log({\cal N}(x_i;\mu_k,\sigma_k^2))\right]\\ &=\sum_{i=1}^n\sum_{k=1}^K\gamma_{ik}\left[\log(\pi_k)-\frac{1}{2}\log(2\pi\sigma_k^2)-\frac{1}{2\sigma_k^2}(x_i-\mu_k)^2\right]\qquad\text{(E-step)}\\ :&=Q(\theta,\theta^{\text{old}}) \end{align*},\]

where

\(\theta=(\pi_1,\dots,\pi_K,\mu_1,\dots,\mu_K,\sigma_1,\dots,\sigma_K)\) are the parameters explicitly defined in \(Q\)
\(\theta_{\text{old}}=(\pi_1',\dots,\pi_K',\mu_1',\dots,\mu_K',\sigma_1',\dots,\sigma_K')\) are the parameters used for evaluating \((\gamma_{ik})\).

One may ask: why maximizing this expectation of the complete data log-likelihood w.r.t the parameters \(\theta_1,\dots,\theta_K\) also maximizing the observed log-likeliood?

Maximizing this expectation works because it is indeed the lower bound of the objective “log observed data likelihood”:

\[\begin{align*} {\cal L}_{\text{obs}}(\theta_1,\theta_2,\dots,\theta_K)&=p(x_1,\dots,x_n)\\ &=\mathbb{E}_{z}[p(x_1,\dots,x_n,z1,\dots,z_n)]\\ &=\mathbb{E}_{z}[{\cal L}_{\text{com}}(\theta_1,\dots, \theta_K)]\\ \Rightarrow \log[{\cal L}_{\text{obs}}(\theta_1,\theta_2,\dots,\theta_K)]&=\log\Big(\mathbb{E}_{z}[{\cal L}_{\text{com}}(\theta_1,\dots, \theta_K)]\Big)\\ &\overbrace{\geq}^{J.S}\ \mathbb{E}_z[\log[{\cal L}_{\text{com}}(\theta_1,\theta_2,\dots,\theta_K)]]\quad (*)\\ &=Q(\theta,\theta^{\text{old}}). \end{align*}\]

EM-Algorithm

Initialization of \(\theta^{\text{old}}\) including:
- \(\pi_1^0,dots,\pi_K^0\) (proportion of each component, can be equaly likely).
- \(\mu_k^0,\dots,\mu_K^0\) (means of each component. It’s important, should start from KMeans’s result.)
- \(\sigma_1^2,\dots,\sigma_K^2\) (the variances of each component, can be estimated using sample variances).
E-step: Compute the expectation \(Q(\theta, \theta^{\text{old}})=\mathbb{E}_z[\log[{\cal L}_{\text{com}}(\theta_1,\theta_2,\dots,\theta_K)]]\).

In GMM, this step is equivalent to compute responsibility matrix

\[\Gamma=\begin{bmatrix}\gamma_{11} & \gamma_{12} &\dots & \gamma_{1K}\\ \gamma_{21} & \gamma_{22} & \dots & \gamma_{2K}\\ \vdots & \vdots & \ddots & \vdots\\ \gamma_{n1} & \gamma_{n2} & \dots & \gamma_{nK}\end{bmatrix}\]

where for any \(i=1,2,\dots,n\) and \(k=1,2,\dots,K\):

\[\gamma_{jk}=\frac{\pi_k{\cal N}(x_i;\mu_k,\sigma_k^2)}{\sum_{j=1}^K\pi_j{\cal N}(x_i;\mu_j,\sigma_j^2)}.\]

M-step: Compute the maximizer: \(\theta^{\text{new}}=\arg\max_{\theta}Q(\theta, \theta^{\text{old}})\). For GMM, we have explicit formula for the parameters as follows:

\[\begin{align*} \hat{\pi}_k^{}&=\frac{\sum_{i=1}^n\gamma_{ik}}{n}\\ \hat{\mu}_k&=\frac{\sum_{i=1}^n\gamma_{ik}x_i}{\sum_{i=1}^n\gamma_{ik}}\\ \hat{\sigma}_k^2&=\frac{\sum_{i=1}^n\gamma_{ik}(x_i-\mu)^2}{\sum_{i=1}^n\gamma_{ik}}. \end{align*}\]

EM Algorithm:
- Initialize: \(\theta^0=(\pi^0,\mu^0,\sigma^0)\) and maximum iteration \(N\).
- for j = 1, 2, ..., N:
  - E-step: compute \(\gamma_{ik}^j\) from \(\theta^{j-1}\).
  - M-step: compute \(\theta^j\) from \(\gamma_{ik}^j\).
- Repeat until converges or reaches maximum iteration.

See, for example: EM algorithm for incomplete data, Dempster et al. (1977) and The EM algorithm as lower bound optimization technique, Rave Harpaz and Robert Haralick (2006).

B. 1D EM Algorithm:

Plot the density of the third column of Iris dataset. From this density, what is the number of components?

sns.set(style="whitegrid")
ax = sns.kdeplot(X[:,2])
ax.set_title(f"Density of {Iris.feature_names[2]}")
ax.set_xlabel(f"{Iris.feature_names[2]}")
plt.show()

This density suggests that there are two distinct components of petal length of Iris flowers within Iris dataset.

Write a function EM1d(x, K=3, max_iter = 100) that takes 1D data array x, number of components K and the maximum itermation of EM algorithm max_iter. The function should return: responsibility matrix \(\Gamma\), center and variance of all \(K\) components.

Code

import numpy as np 
from scipy.stats import norm 
# Mixture density
def densityMixture(x, mu, sigma, prop): 
    n = len(x) 
    K = len(prop)
    d = np.zeros(n) 
    for i in range(n): 
        for k in range(K): 
            d[i] += prop[k] * norm.pdf(x[i], mu[k], sigma[k]) 
    return d

# Log likelihood function
def logLikehood(x, mu, sigma, prop): 
    result = densityMixture(x, mu, sigma, prop) 
    return np.sum(np.log(densityMixture(x, mu, sigma, prop)))


# Compute Responsibilities in E-step
def responsibilities(x, mu, sigma, prop): 
    n = len(x) 
    K = len(prop) 
    gamma = np.zeros((n, K)) 
    for i in range(n): 
        for k in range(K): 
            gamma[i, k] = prop[k] * norm.pdf(x[i], mu[k], sigma[k]) 
    gamma = gamma / gamma.sum(axis=1, keepdims=True) 
    return gamma

# Compute Parameters in M-step
def params(x, gamma): 
    n = gamma.shape[0] 
    K = gamma.shape[1] 
    prop = np.sum(gamma, axis=0) / n 
    mu = (gamma.T @ x) / np.sum(gamma, axis=0) 
    sigma = np.sqrt(np.diag((gamma.T @ (np.tile(x, (K, 1)).T - np.tile(mu, (n, 1)))**2) / np.sum(gamma, axis=0))) 
    return {"prop": prop, "mu": mu, "sigma": sigma}

from sklearn.cluster import KMeans

# EM algorithm for 1D data
def EM1d(x, K, max_iter = 100): 
    n = len(x) 
    # Initialization 
    kmeans = KMeans(n_clusters=K).fit(x.reshape(-1, 1)) 
    cl = kmeans.labels_ 
    gamma = np.zeros((n, K)) 
    gamma[np.arange(n), cl] = 1 
    # M step at Iteration 0
    old_param = params(x, gamma) 
    loglik = [logLikehood(x, old_param['mu'], old_param['sigma'], old_param['prop'])] 

    # E step at Iteration 1
    gamma = responsibilities(x, old_param['mu'], old_param['sigma'], old_param['prop']) 
    # M step at Iteration 1
    param = params(x, gamma) 
    loglik.append(logLikehood(x, param['mu'], param['sigma'], param['prop'])) 
    
    # Main loop 
    for i in range(max_iter):
        # E step 
        gamma = responsibilities(x, param['mu'], param['sigma'], param['prop']) 

        # M step 
        param = params(x, gamma)
        loglik.append(logLikehood(x, param['mu'], param['sigma'], param['prop'])) 
    return {"prop": param['prop'], "mu": param['mu'], "sigma": param['sigma'], "loglik": loglik, "gamma": gamma}

Apply your function to the third column of the iris data with \(K=2,...,10\).

import warnings 
warnings.filterwarnings('ignore')

res = []
for k in range(2, 11):
    res.append(EM1d(X[:,2], K=k))

c:\Users\hasso\AppData\Local\Programs\Python\Python312\Lib\site-packages\scipy\stats\_distn_infrastructure.py:1983: RuntimeWarning:

divide by zero encountered in divide

c:\Users\hasso\AppData\Local\Programs\Python\Python312\Lib\site-packages\scipy\stats\_distn_infrastructure.py:1983: RuntimeWarning:

invalid value encountered in divide

c:\Users\hasso\AppData\Local\Programs\Python\Python312\Lib\site-packages\scipy\stats\_distn_infrastructure.py:1983: RuntimeWarning:

divide by zero encountered in divide

c:\Users\hasso\AppData\Local\Programs\Python\Python312\Lib\site-packages\scipy\stats\_distn_infrastructure.py:1983: RuntimeWarning:

invalid value encountered in divide

Compute score, AIC and BIC for each \(K\) (you may need to write your own function for that). What is the optimal number of components?

# AIC
def AIC(param):
    prop, mu, sigma, gamma, llik = param['prop'], param['mu'], param['sigma'], param["gamma"], param["loglik"][-1]
    d = 3*len(mu) - 1
    return -2*llik + 2*d

# BIC
def BIC(param):
    prop, mu, sigma, gamma, llik = param['prop'], param['mu'], param['sigma'], param["gamma"], param["loglik"][-1]
    d = 3*len(mu) - 1
    return -2*llik + d*np.log(gamma.shape[0])

import plotly.io as pio
pio.renderers.default = 'notebook'

score = []
aic = []
bic = []
for k in range(9):
  score.append(res[k-2]["loglik"][-1])
  aic.append(AIC(res[k]))
  bic.append(BIC(res[k]))

from plotly.subplots import make_subplots

fig_score =  make_subplots(
    rows=1, cols=3,
    subplot_titles=("Score vs No. clusters", "AIC vs No. clusters", "BIC vs No. clusters"))

fig_score.add_trace(go.Scatter(x=list(range(2,11)), name="Score", y = score), row=1, col=1)
fig_score.add_trace(go.Scatter(x=list(range(2,11)), name="AIC", y = aic), row=1, col=2)
fig_score.add_trace(go.Scatter(x=list(range(2,11)), name="BIC", y = bic), row=1, col=3)
fig_score.update_layout(width=780, height=400, title=f"Score, AIC and BIC of {Iris.feature_names[2]} vs no. clusters")

This result suggests that the optimal number of clusters is \(2\).

Visualize your estimated density.

pred_density = densityMixture(np.sort(X[:,2]), mu=res[0]['mu'], sigma=res[0]['sigma'], prop=res[0]['prop'])

ax = sns.kdeplot(x=X[:,2])
plt.plot(np.sort(X[:,2]), pred_density, "r", linestyle="--", label="Predicted density")
plt.title("True vs Predicted density")
plt.legend()
plt.show()

C. Image Segmentation.

Load any image (not too large resolution or it can be an Mnist image from the previous TP).

from skimage import data
image = data.astronaut()

print(image.shape)

plt.imshow(image)
plt.axis('off')
plt.show()

(512, 512, 3)

Reshape it into 1D array, then apply your EM1d function (or sklearn GMM function) on that 1D pixel array with your desired number of components.
Assign each pixel to a component and reshape the segmented image back into its original shape.
Display the original and segmented images side by side. Comment.

# Reduce resolution of the image for fast processing
image_gray = image[::5,::5,2]
print(image_gray.shape)
# Reshape image
image1d = image_gray.reshape(-1)
plt.imshow(image_gray, cmap="gray")
plt.axis('off')
plt.show()

(103, 103)

param = EM1d(image1d, 3, 30)
labels = np.argmax(param['gamma'], axis=1)

image_compressed = np.array([np.mean(image1d[labels == i]) for i in range(3)])
image_seg = np.zeros_like(labels)
_, ax = plt.subplots(figsize=(6, 5))
for i in range(3):
  image_seg[labels == i] = image_compressed[i]
image_reshaped = image_seg.reshape(image_gray.shape[0], image_gray.shape[1])
ax.imshow(image_reshaped, cmap='gray')
plt.axis('off')
plt.tight_layout()
plt.show()

1. Gaussian Mixture Models

2. EM Algorithm

Further Readings