Introduction to Data Mining


CSCI-866-001: Data Mining & Knowledge Discovery



Lecturer: Dr. Sothea HAS

About the course

  • Objective: Equip you with essential Data Mining skills to uncover insights/knowledge from data and make informed decisions.

  • Grading Criteria

Criteria Percentage
Attendance 10%
Participation & quiz 30%
Midterm Exam 30%
Final Project & Presentation / Practical labs 30%
  • Programming:

Where to visit

Just want to know you a bit 👇

Let’s see 🫣

Introduction to
Data Mining

đź“‹ Outline

  • Motivation & Introduction

  • Data Mining Tasks

  • KDD vs CRISP-DM Process

  • Data Mining vs Data Analytics

  • Applications & Challenges of Data Mining

Motivation & Introduction

Motivation

Amazon rating dataset (\(2\)M+ rows, \(4\) cols)

Code
import pandas as pd                 # Import pandas package
import seaborn as sns               # Package for beautiful graphs
path = "C:/Users/hasso/.cache/kagglehub/datasets/skillsmuggler/amazon-ratings/versions/1/"
import pandas as pd
df = pd.read_csv(path + "ratings_Beauty.csv")
df.head()
UserId ProductId Rating Timestamp
0 A39HTATAQ9V7YF 0205616461 5.0 1369699200
1 A3JM6GV9MNOF9X 0558925278 3.0 1355443200
2 A1Z513UWSAAO0F 0558925278 5.0 1404691200
3 A1WMRR494NWEWV 0733001998 4.0 1382572800
4 A3IAAVS479H7M7 0737104473 1.0 1274227200
  • What insights/patterns can we draw from this dataset?
  • How might this knowledge be useful for our work/business?
  • These lead to Data Mining.

Introduction

  • Data Mining: a process of extracting valuable insights and identifying patterns from massive datasets.
  • ⚠️ Mining knowledge within the data rather than simply retrieving raw data!

It involves using:

  • Statistics:
    • Descriptive
    • Pattern recognition
    • Validation techniques…
  • Softwares:
    • Python
    • SQL
    • Power BI…
  • Techniques & algorithms:
    • Machine learning
    • Clustering,
    • Classification/regression
    • Visualization methods…
  • Ex: It can identify popular products or individual preferences…

Data Mining Tasks

Data Mining Tasks

  • Predictive:
    • Classification: Categorical target.
    • Regression: Numerical target.
    • Timeseries: sequential num. target.
  • Descriptive:
    • Association rules:
      • Ex: watch anime \(\Rightarrow\) read manga.
    • Clustering:
      • Ex: Customer segmentation.
    • Sequential discovery:
      • Sequence of purchases and preferences.
    • Summarization:
      • Reduction
      • Insightful visualization…

KDD Process

  • Knowledge Discovery from Data Process is a full picture of Data Mining.

KDD Process

  • Knowledge Discovery from Data Process is a full picture of Data Mining.

1. Selection

  • Gather and select data from appropriate sources.
  • The data (warehouse) can be raw or secondary (already organized).
  • The data should be relevant for our analysis.

KDD Process

  • Knowledge Discovery from Data Process is a full picture of Data Mining.

2. Preprocessing

  • Clean and remove irrelevant data for the analysis.
  • This includes type encoding, handle missing values, outliers, inconsistent data…

KDD Process

  • Knowledge Discovery from Data Process is a full picture of Data Mining.

3. Transformation

  • Convert data by normalizing, standardizing, encoding…
  • Organize data in a suitable way for the analysis.

KDD Process

  • Knowledge Discovery from Data Process is a full picture of Data Mining.

4. Data Mining

  • This is where ML algorithms: clustering, predicting or dimensional reduction are implemented.
  • Each method is used according to the objective: descriptive or prediction.

KDD Process

  • Knowledge Discovery from Data Process is a full picture of Data Mining.

5. Interpretation/Evaluation

  • Interpret the insights/knowledge from the previous step.
  • Summarize results: comprehensible graphs and numbers.
  • Generate report and visualization for technical and non-technical audiences.

CRISP-DM Process

  • CRISP-DM is another picture of Data Mining in industry/business domain.

Data Mining vs
Data Analytics

Data Mining vs Data Analytics

Applications & Challenges

Applications of Data Mining

  • Data Mining is a process/method that can be applied to solve various types of problems according to the data and purpose.

Challenges in Data Mining

Application on Amazon dataset

Application on Amazon dataset

Customer, Product & Rating Overview

Code
import plotly.graph_objects as go
import plotly.express as px
users = df['UserId'].value_counts()\
    .sort_values(ascending=False)
print(users.head(5).to_frame())
print("\n")
fig = go.Figure(go.Bar(x=users.index[:10], y=users.values[:10]))
fig.update_layout(title="10 most purchase customers",
    xaxis=dict(title="Customer ID"),
    yaxis=dict(title="Count"),
    height=350, width=300)
fig.update_xaxes(tickangle=-30) 
fig.show()
                count
UserId               
A3KEZLJ59C1JVH    389
A281NPSIMI1C2R    336
A3M174IC0VXOS2    326
A2V5R832QCSOMX    278
A3LJLRIZL38GG3    276

Code
prods = df['ProductId'].value_counts()\
    .sort_values(ascending=False)
print(prods.head(5).to_frame())
print("\n")
fig1 = go.Figure(
    go.Bar(x=prods.index[:10], 
        y=prods.values[:10]))
fig1.update_layout(title="10 most purchased products",
    xaxis=dict(title="Product ID"),
    yaxis=dict(title="Count"),
    height=340, width=300)
fig1.update_xaxes(tickangle=-30)
fig1.show()
            count
ProductId        
B001MA0QY2   7533
B0009V1YR8   2869
B0043OYFKU   2477
B0000YUXI0   2143
B003V265QW   2088

Code
rate = df['Rating'].value_counts()\
    .sort_values(ascending=False)
print(rate.head(5).to_frame())
print("\n")
fig2 = go.Figure(
    go.Bar(x=rate.index[:10], 
        y=rate.values[:10]))
fig2.update_layout(title="Rating distribution",
    xaxis=dict(title="Rating"),
    yaxis=dict(title="Count"),
    height=300, width=300)
fig2.show()
          count
Rating         
5.0     1248721
4.0      307740
1.0      183784
3.0      169791
2.0      113034

Application on Amazon dataset

Product preferences per customer

Code
from plotly.subplots import make_subplots
fig = make_subplots(rows=3, cols=1)
for i, usr in enumerate(users.index[:3]):
    pop_prod = df[df['UserId'] == usr][['ProductId', 'Rating']]
    rate_ = pop_prod[['ProductId']].groupby(by=['ProductId'])\
        .value_counts()\
        .sort_values(ascending=False)
    fig.add_trace(go.Bar(x=rate_.index[:10].astype(object), 
        y=rate_.values[:10], showlegend=False), row=i+1, col=1)
    fig.update_yaxes(
        title=dict(
            text=f"Costumer {usr}", 
            font=dict(size=9)),
        col=1, row=i+1)
fig.update_layout(title="Product count per customer",
    height = 500)
fig.show()

Application on Amazon dataset

Customer preference level

Code
from plotly.subplots import make_subplots
fig = make_subplots(rows=3, cols=1)
for i, usr in enumerate(users.index[:3]):
    pop_prod = df[df['UserId'] == usr][['ProductId', 'Rating']]
    pop_prod['Rating'] = pop_prod['Rating'].astype(object)
    rate_ = pop_prod[['Rating']].groupby(by=['Rating'])\
        .value_counts()\
        .sort_values(ascending=False)
    fig.add_trace(go.Bar(x=rate_.index[:10].astype(object), 
        y=rate_.values[:10], showlegend=False), row=i+1, col=1)
    fig.update_yaxes(
        title=dict(
            text=f"Costumer {usr}", 
            font=dict(size=9)),
        col=1, row=i+1)
fig.update_layout(title="Rating count per customer",
    height = 500)
fig.show()

Summary

  • Data Mining aims at extract knowledge from vast dataset.

  • Core Techniques:

    • Classification (e.g., decision trees, SVM).
    • Clustering (e.g., k-means, hierarchical).
    • Association Rules (e.g., market basket analysis).
    • Regression & Outlier Detection.
  • Process: Data selection \(\Rightarrow\) Preprocessing \(\Rightarrow\) Transformation \(\Rightarrow\) Data Mining \(\Rightarrow\) Interpretation/Knowledge.

  • Challenges: Data quality, privacy/ethics, scalability, and interpretability of models.

🥳 Yeahhhh….









Party time… 🥂