Introduction to Data Mining

CSCI-866-001: Data Mining & Knowledge Discovery

Lecturer: Dr. Sothea HAS

About the course

Objective: Equip you with essential Data Mining skills to uncover insights/knowledge from data and make informed decisions.
Grading Criteria

`Criteria`	`Percentage`
Attendance	10%
Participation & quiz	30%
Midterm Exam	30%
Final Project & Presentation / Practical labs	30%

Programming:

Python
R

Jupyter Notebook, Google colab, Matplotlib, Seaborn

Rmarkdown/Quarto, Posit cloud, Ggplot2, Tidyverse

Where to visit

Canvas & Course webpage: https://hassothea.github.io/Data_Mining_AUPP/

Just want to know you a bit 👇

Let’s see 🫣

Introduction to
Data Mining

📋 Outline

Motivation & Introduction
Data Mining Tasks
KDD vs CRISP-DM Process
Data Mining vs Data Analytics
Applications & Challenges of Data Mining

Motivation & Introduction

Motivation

Amazon rating dataset (\(2\)M+ rows, \(4\) cols)

Code

import pandas as pd                 # Import pandas package
import seaborn as sns               # Package for beautiful graphs
path = "C:/Users/hasso/.cache/kagglehub/datasets/skillsmuggler/amazon-ratings/versions/1/"
import pandas as pd
df = pd.read_csv(path + "ratings_Beauty.csv")
df.head()

	UserId	ProductId	Rating	Timestamp
0	A39HTATAQ9V7YF	0205616461	5.0	1369699200
1	A3JM6GV9MNOF9X	0558925278	3.0	1355443200
2	A1Z513UWSAAO0F	0558925278	5.0	1404691200
3	A1WMRR494NWEWV	0733001998	4.0	1382572800
4	A3IAAVS479H7M7	0737104473	1.0	1274227200

What insights/patterns can we draw from this dataset?
How might this knowledge be useful for our work/business?
These lead to Data Mining.

Introduction

Data Mining: a process of extracting valuable insights and identifying patterns from massive datasets.
⚠️ Mining knowledge within the data rather than simply retrieving raw data!

It involves using:

Statistics:
- Descriptive
- Pattern recognition
- Validation techniques…

Softwares:
- Python
- SQL
- Power BI…

Techniques & algorithms:
- Machine learning
- Clustering,
- Classification/regression
- Visualization methods…

Ex: It can identify popular products or individual preferences…

Data Mining Tasks

Predictive:
- Classification: Categorical target.
- Regression: Numerical target.
- Timeseries: sequential num. target.
Descriptive:
- Association rules:
  - Ex: watch anime \(\Rightarrow\) read manga.
- Clustering:
  - Ex: Customer segmentation.
- Sequential discovery:
  - Sequence of purchases and preferences.
- Summarization:
  - Reduction
  - Insightful visualization…

KDD Process

Knowledge Discovery from Data Process is a full picture of Data Mining.

KDD Process

Knowledge Discovery from Data Process is a full picture of Data Mining.

1. Selection

Gather and select data from appropriate sources.
The data (warehouse) can be raw or secondary (already organized).
The data should be relevant for our analysis.

KDD Process

Knowledge Discovery from Data Process is a full picture of Data Mining.

2. Preprocessing

Clean and remove irrelevant data for the analysis.
This includes type encoding, handle missing values, outliers, inconsistent data…

KDD Process

Knowledge Discovery from Data Process is a full picture of Data Mining.

3. Transformation

Convert data by normalizing, standardizing, encoding…
Organize data in a suitable way for the analysis.

KDD Process

Knowledge Discovery from Data Process is a full picture of Data Mining.

4. Data Mining

This is where ML algorithms: clustering, predicting or dimensional reduction are implemented.
Each method is used according to the objective: descriptive or prediction.

KDD Process

Knowledge Discovery from Data Process is a full picture of Data Mining.

5. Interpretation/Evaluation

Interpret the insights/knowledge from the previous step.
Summarize results: comprehensible graphs and numbers.
Generate report and visualization for technical and non-technical audiences.

CRISP-DM Process

CRISP-DM is another picture of Data Mining in industry/business domain.

Data Mining vs
Data Analytics

Data Mining vs Data Analytics

Applications & Challenges

Applications of Data Mining

Data Mining is a process/method that can be applied to solve various types of problems according to the data and purpose.

Challenges in Data Mining

Application on Amazon dataset

Customer, Product & Rating Overview

Code

import plotly.graph_objects as go
import plotly.express as px
users = df['UserId'].value_counts()\
    .sort_values(ascending=False)
print(users.head(5).to_frame())
print("\n")
fig = go.Figure(go.Bar(x=users.index[:10], y=users.values[:10]))
fig.update_layout(title="10 most purchase customers",
    xaxis=dict(title="Customer ID"),
    yaxis=dict(title="Count"),
    height=350, width=300)
fig.update_xaxes(tickangle=-30) 
fig.show()

                count
UserId               
A3KEZLJ59C1JVH    389
A281NPSIMI1C2R    336
A3M174IC0VXOS2    326
A2V5R832QCSOMX    278
A3LJLRIZL38GG3    276

Code

prods = df['ProductId'].value_counts()\
    .sort_values(ascending=False)
print(prods.head(5).to_frame())
print("\n")
fig1 = go.Figure(
    go.Bar(x=prods.index[:10], 
        y=prods.values[:10]))
fig1.update_layout(title="10 most purchased products",
    xaxis=dict(title="Product ID"),
    yaxis=dict(title="Count"),
    height=340, width=300)
fig1.update_xaxes(tickangle=-30)
fig1.show()

            count
ProductId        
B001MA0QY2   7533
B0009V1YR8   2869
B0043OYFKU   2477
B0000YUXI0   2143
B003V265QW   2088

Code

rate = df['Rating'].value_counts()\
    .sort_values(ascending=False)
print(rate.head(5).to_frame())
print("\n")
fig2 = go.Figure(
    go.Bar(x=rate.index[:10], 
        y=rate.values[:10]))
fig2.update_layout(title="Rating distribution",
    xaxis=dict(title="Rating"),
    yaxis=dict(title="Count"),
    height=300, width=300)
fig2.show()

          count
Rating         
5.0     1248721
4.0      307740
1.0      183784
3.0      169791
2.0      113034

Application on Amazon dataset

Product preferences per customer

Code

from plotly.subplots import make_subplots
fig = make_subplots(rows=3, cols=1)
for i, usr in enumerate(users.index[:3]):
    pop_prod = df[df['UserId'] == usr][['ProductId', 'Rating']]
    rate_ = pop_prod[['ProductId']].groupby(by=['ProductId'])\
        .value_counts()\
        .sort_values(ascending=False)
    fig.add_trace(go.Bar(x=rate_.index[:10].astype(object), 
        y=rate_.values[:10], showlegend=False), row=i+1, col=1)
    fig.update_yaxes(
        title=dict(
            text=f"Costumer {usr}", 
            font=dict(size=9)),
        col=1, row=i+1)
fig.update_layout(title="Product count per customer",
    height = 500)
fig.show()

Application on Amazon dataset

Customer preference level

Code

from plotly.subplots import make_subplots
fig = make_subplots(rows=3, cols=1)
for i, usr in enumerate(users.index[:3]):
    pop_prod = df[df['UserId'] == usr][['ProductId', 'Rating']]
    pop_prod['Rating'] = pop_prod['Rating'].astype(object)
    rate_ = pop_prod[['Rating']].groupby(by=['Rating'])\
        .value_counts()\
        .sort_values(ascending=False)
    fig.add_trace(go.Bar(x=rate_.index[:10].astype(object), 
        y=rate_.values[:10], showlegend=False), row=i+1, col=1)
    fig.update_yaxes(
        title=dict(
            text=f"Costumer {usr}", 
            font=dict(size=9)),
        col=1, row=i+1)
fig.update_layout(title="Rating count per customer",
    height = 500)
fig.show()

Summary

Data Mining aims at extract knowledge from vast dataset.
Core Techniques:
- Classification (e.g., decision trees, SVM).
- Clustering (e.g., k-means, hierarchical).
- Association Rules (e.g., market basket analysis).
- Regression & Outlier Detection.
Process: Data selection \(\Rightarrow\) Preprocessing \(\Rightarrow\) Transformation \(\Rightarrow\) Data Mining \(\Rightarrow\) Interpretation/Knowledge.
Challenges: Data quality, privacy/ethics, scalability, and interpretability of models.

Introduction to Data Mining

About the course

Where to visit

Just want to know you a bit 👇

Let’s see 🫣

Introduction to Data Mining

📋 Outline

Motivation & Introduction

Motivation

Amazon rating dataset (\(2\)M+ rows, \(4\) cols)

Introduction

It involves using:

Data Mining Tasks

Data Mining Tasks

KDD Process

KDD Process

1. Selection

KDD Process

2. Preprocessing

KDD Process

3. Transformation

KDD Process

4. Data Mining

KDD Process

5. Interpretation/Evaluation

CRISP-DM Process

Data Mining vs Data Analytics

Data Mining vs Data Analytics

Applications & Challenges

Applications of Data Mining

Challenges in Data Mining

Application on Amazon dataset

Application on Amazon dataset

Customer, Product & Rating Overview

Application on Amazon dataset

Product preferences per customer

Application on Amazon dataset

Customer preference level

Summary

🥳 Yeahhhh….

Party time… 🥂

Introduction to
Data Mining

Data Mining vs
Data Analytics