👋 Introduction

Welcome to the Data Mining and Knowledge Discovery Course!

This course provides a comprehensive introduction to the principles, algorithms, and applications of data mining, equipping you with the tools to transform raw data into actionable knowledge. From preprocessing, analyzing and model development for classification or regression to clustering and association rule mining, you’ll learn how to tackle real-world challenges in fields like healthcare, finance, marketing, and beyond.

📋 Course Overview

🗝️ Key Topics Covered

1. Fundamentals of Data Mining

The knowledge discovery process (KDD)
Data types, quality, and preprocessing (cleaning, transformation)
Exploratory Data Analysis (EDA) for knowledge extraction

2. Core Techniques

Classification/Regression (kNN, Decision Trees, SVM, Naïve Bayes)
Clustering (k-Means, Hierarchical, DBSCAN)
Association Rule Mining
Anomaly Detection

3. Advanced Methods

Dimensionality reduction (PCA, t-SNE)
Text mining and NLP basics
Deep learning for data mining (introductory)
Practical Applications
Case studies in healthcare, e-commerce, and social networks
Ethical considerations and pitfalls (bias, privacy)

By the end of this course, you’ll have a solid knowledge in Data Mining and its techniques, empowering you to make data-driven decisions and insights. If you are looking to enhance your skills in Data Mining, this course has something for everyone.

📝 Course Criteria

Criteria	Percentage
Attendance	10%
Participation & quiz	30%
Midterm Exam	30%
Final Project & Presentation / Practical labs	30%

💻 Programming:

You are free you use your favorite programming language Python or .

🗺️ Course progress

Note: The following table of contents will be progressively updated according to the course advancement.

Topic	Lab	Solution	Remark
Introduction to Data Mining	Lab1: Introduction	Solution1	…Loading
Data Comprehension & Preprocessing	Lab2: Preprocessing	Solution2	…Loading
Basic Data Analysis & Visualization	Lab3: Basic DA / Lab3: Visualization	Solution3	…Loading
Classification: NBC & Logistic Regression	Lab4: NBC & Logistic Regression	Solution4	…Loading
Classification: KNN & Decision Trees	Lab5: KNN & Trees	Solution5	…Loading
Classification: SVM & Ensemble Methods	Lab6: SVM & Ensemble Methods	Solution6	…Loading
Model Evaluation	--------	--------	--------
Dimensional Reduction Methods	--------	--------	--------
Kmeans & Hierarchical Clustering	Lab7: Clustering	Solution7	…Loading
Text Mining	--------	--------	--------

📄 Midterms, Exams and Projects

In this section, you will find all the information related to the midterms, exams, and projects including instructions, starting dates and the deadlines.

📄 Midterm & Exam

A possible midterm date: ...Loading.

📄 Project:

Deadline for the report: ...Loading.
Where to submit: Canvas
Your report should be in (your favorite) PDF format and include the following criteria:

1. Introduction

Objective: Clearly define the problem (e.g., classification, clustering, pattern mining).
Dataset: Describe the source, size, and features (e.g., UCI Repository, Kaggle).
Relevance: Why is this problem interesting from a data mining perspective?

2. Data Preprocessing

Data Cleaning: Handling missing values, duplicates, noise (e.g., binning, interpolation).
Feature Transformation: Normalization, discretization, encoding (e.g., one-hot).
Feature Selection: Techniques used (e.g., PCA, correlation analysis, wrapper methods).

3. Exploratory Data Analysis (EDA)

Descriptive Statistics: Mean, variance, distributions (include tables/visualizations).
Visualizations: Heatmaps, histograms, scatter plots for feature relationships.
Insights: Uncover preliminary patterns (e.g., class imbalance, outliers).

4. Data Mining Techniques Applied

(Split into subsections based on your project’s focus)

A. Model Development

Algorithms: Justify choices (e.g., Decision Trees for interpretability, SVM for high dimensions).
Training/Testing: Choice, split strategy (e.g., 80/20, cross-validation).
Hyperparameter Tuning: Methods used (e.g., grid search, random search).

B. Alternative Approaches

Compare at least 2 techniques (e.g., clustering with k-means vs. DBSCAN).
Mention ensemble methods (e.g., Random Forest, Boosting) if applicable.

5. Results & Evaluation

Metrics: Use domain-appropriate measures:
- Classification: Accuracy, Precision, Recall, ROC-AUC.
- Clustering: Silhouette Score, Dunn Index.
- Association Rules: Support, Confidence, Lift.
Visual Evidence: Confusion matrices, elbow plots, dendrograms.
Benchmarking: Compare against baselines (e.g., naive Bayes as a simple model).

6. Discussion & Challenges

Limitations: Data quality, computational constraints, assumptions.
Business/Research Implications: How do results translate to real-world solutions?

7. Conclusion & Future Work

Summarize key findings (e.g., “k=3 clusters best segmented our customer data”).
Suggest improvements (e.g., deeper feature engineering, alternative algorithms).

8. References

Cite datasets, libraries (e.g., scikit-learn, Weka), and papers (e.g., on novel algorithms).

9. Appendix (Optional)

Code Snippets: Critical steps (e.g., entropy calculation for decision trees).
Extended Results: Additional graphs/tables omitted from the main report.
Presentation:
- A possible dates: ...Loading.

📚 Resources and Further Reading

Here, you will find additional resources, including books, research papers, and online courses, to further your understanding of Data Mining.

CSCI 866 001
Data Mining and Knowledge Discovery