# To doLab7 - Clustering Algorithm
Course: ITM 390 004 - Machine Learning
Lecturer: Sothea HAS, PhD
Objective: This lab is designed for you to apply clustering algorithm such as KMeans and Hierarchical clustering on some real-world dataset. You will try to detect suitable number of clusters \(K\) and interpret the meaning of each cluster.
- The
notebookof thisTPis available here: TP7_Clustering.ipynb.
1. California Housing Price
The California Housing Price data contains information from the 1990 California census.
A. Import the dataset into python environment from the kaggle repository: California Housing Price.
Check the dimension of the dataset and column types. Take care of columns with inappropriate type if there is any.
Clustering algorithm requires data scaling, therefore encode categorical columns and normalize all columns preparing for clustering.
B. KMeans with your choice of \(K\): Perform KMeans algorithm using KMeans module from sklearn.cluster with your favorite number of clusters \(K\).
Visualize the clustering structure using columns:
total_bedroomsandmedian_house_value.Compute WSS.
Compute Silhouette Score of the resulting clustering structure.
# To doC. Find suitable \(K\): For \(K\in\{1,2,3,...,10\}\), perform KMeans with each value of \(K\) and store WSS and Sihouette Coeffifient associated to each \(K\).
Plot the WSS curve and \(K\) to detect the elbow of the curve.
Can you detect a suitble \(K\)?
Visualize the Silhouette Coefficient vs \(K\) curve. Decide \(K\).
# To do2. Cybersecurity Intrusion Dectection Dataset (Optional/Exploration)
This Cybersecurity Intrusion Detection Dataset is designed for detecting cyber intrusions based on network traffic and user behavior. The description of the data is available here: https://www.kaggle.com/datasets/dnkumars/cybersecurity-intrusion-detection-dataset.
A. Import the data into the environment.
# To doB. Preprocess and clean the data.
# To doC. Perform clustering algorithms and detect a suitable number of clusters \(K\) for this dataset.
# To doReferences
\(^{\text{๐}}\) Deep Learning, Ian Goodfellow. (2016)..
\(^{\text{๐}}\) Hands-on ML with Sklearn, Keras & Tensorflow, Aurรฉlien Geron (2017)..
\(^{\text{๐}}\) Heart Disease Dataset.
\(^{\text{๐}}\) Backpropagation, 3B1B.