TP5 - Gaussian Mixture Model (GMM) & EM Algorithm

Exploratory Data Analysis & Unsuperivsed Learning
Course: PHAUK Sokkey, PhD
TP: HAS Sothea, PhD


Objective: In this lab, we’ll dive into the fascinating world of Gaussian Mixture Models (GMMs) and the Expectation-Maximization (EM) algorithm, both of which are fundamental concepts of unsupervised machine learning. GMMs can be viewed from various angles, such as density estimation and soft clustering. We’ll explore both perspectives and apply it to image segmentation, laying the groundwork for a broader understanding of generative models.


The Jupyter Notebook for this TP can be downloaded here: TP5-GMM_EM.


1. Gaussian Mixture Models

A. Perform GMM using GaussianMixture from sklearn.mixture on Iris dataset using n_components=5.

Read this documentation and answer the followign questions:

  • Print the estimated parameters of each component.
  • What does the score() do in this module?
  • Compute this score, AIC and BIC of the trained GMM.
# To do

B. Perform GMM on Iris data but this time using n_components = 1,2,...,10.

  • Compute the score, AIC and BIC at each number of component.
  • What is the optimal number of component.
# To do

C. With the optimal number of components from question B, perform GMM on Iris data using different options of covariance_type from the list [‘full’, ‘tied’, ‘diag’, ‘spherical’]. In each case, compute the score associated to each variance type. Comment.

# To do

D. Repeat question B and C on a simulated data from the previous TP4. How is GMM’s result compared to Kmeans or Hierarchical clustering?

# To do

2. EM Algorithm

The EM algorithm is used to estimate the parameters of the GMM, ensuring that the model fits the data as closely as possible by iteratively refining the parameters. It leverages the concept of latent variables (responsibilities) to handle the fact that the actual class labels for the data points are unknown.

This iterative optimization makes GMMs a powerful tool for tasks like clustering and density estimation.

A. Recall the process of EM algorithm in GMM of \(K\) components.

To do

B. 1D EM Algorithm:

  • Plot the density of the third column of Iris dataset. From this density, what is the number of components?
  • Write a function EM1d(x, K=3, max_iter = 100) that takes 1D data array x, number of components K and the maximum itermation of EM algorithm max_iter. The function should return: responsibility matrix \(\Gamma\), center and variance of all \(K\) components.
  • Apply your function to the third column of the iris data with \(K=1,2,...,10\).
  • Compute score, AIC and BIC for each \(K\) (you may need to write your own function for that). What is the optimal number of components?
  • Visualize your estimated density.
# To do

C. Image Segmentation.

  • Load any image (not too large resolution or it can be an Mnist image from the previous TP).
  • Reshape it into 1D array, then apply your EM1d function (or sklearn GMM function) on that 1D pixel array with your desired number of components.
  • Assign each pixel to a component and reshape the segmented image back into its original shape.
  • Display the original and segmented images side by side. Comment.
# To do

Further Readings