Lab7 - Clustering Algorithm

Course: ITM 390 004 - Machine Learning
Lecturer: Sothea HAS, PhD


Objective: This lab is designed for you to apply clustering algorithm such as KMeans and Hierarchical clustering on some real-world dataset. You will try to detect suitable number of clusters \(K\) and interpret the meaning of each cluster.

1. California Housing Price

The California Housing Price data contains information from the 1990 California census.


A. Import the dataset into python environment from the kaggle repository: California Housing Price.

  • Check the dimension of the dataset and column types. Take care of columns with inappropriate type if there is any.

  • Clustering algorithm requires data scaling, therefore encode categorical columns and normalize all columns preparing for clustering.

# To do

B. KMeans with your choice of \(K\): Perform KMeans algorithm using KMeans module from sklearn.cluster with your favorite number of clusters \(K\).

  • Visualize the clustering structure using columns: total_bedrooms and median_house_value.

  • Compute WSS.

  • Compute Silhouette Score of the resulting clustering structure.

# To do

C. Find suitable \(K\): For \(K\in\{1,2,3,...,10\}\), perform KMeans with each value of \(K\) and store WSS and Sihouette Coeffifient associated to each \(K\).

  • Plot the WSS curve and \(K\) to detect the elbow of the curve.

  • Can you detect a suitble \(K\)?

  • Visualize the Silhouette Coefficient vs \(K\) curve. Decide \(K\).

# To do

2. Cybersecurity Intrusion Dectection Dataset (Optional/Exploration)

This Cybersecurity Intrusion Detection Dataset is designed for detecting cyber intrusions based on network traffic and user behavior. The description of the data is available here: https://www.kaggle.com/datasets/dnkumars/cybersecurity-intrusion-detection-dataset.

A. Import the data into the environment.

# To do

B. Preprocess and clean the data.

# To do

C. Perform clustering algorithms and detect a suitable number of clusters \(K\) for this dataset.

# To do

References

\(^{\text{๐Ÿ“š}}\) Deep Learning, Ian Goodfellow. (2016)..
\(^{\text{๐Ÿ“š}}\) Hands-on ML with Sklearn, Keras & Tensorflow, Aurรฉlien Geron (2017)..
\(^{\text{๐Ÿ“š}}\) Heart Disease Dataset.
\(^{\text{๐Ÿ“š}}\) Backpropagation, 3B1B.