Lab2: Data Comprehension & Preprocessing

Course: CSCI-866-001: Data Mining & Knowledge Discovery
Lecturer: Sothea HAS, PhD

Objective: In this lab, you will delve deeper into assessing the quality of datasets and employing preprocessing techniques to properly clean them.

The notebook of this Lab can be downloaded here: Lab2_Preprocessing.ipynb.
Or you can work directly with Google Colab here: Lab2_Preprocessing.ipynb.

0. `Titanic` Dataset (Optional)

You are encouraged to start by exploring what you have seen in the lecture within Real Example section on the Titanic dataset available here: https://www.kaggle.com/datasets/yasserh/titanic-dataset.

1. `Auto-MPG` Dataset

The Auto-MPG dataset contains spec of various cars and is available in kaggle. For more, read here. The data can be downloaded as follow:

import kagglehub # To load the data
import pandas as pd # To handle and manipulate the data

# Download latest version
path = kagglehub.dataset_download("denkuznetz/food-delivery-time-prediction")

# Download latest version
path = kagglehub.dataset_download("uciml/autompg-dataset")
auto = pd.read_csv(path + '/auto-mpg.csv')
auto.head()

	mpg	cylinders	displacement	horsepower	weight	acceleration	model year	origin	car name
0	18.0	8	307.0	130	3504	12.0	70	1	chevrolet chevelle malibu
1	15.0	8	350.0	165	3693	11.5	70	1	buick skylark 320
2	18.0	8	318.0	150	3436	11.0	70	1	plymouth satellite
3	16.0	8	304.0	150	3433	12.0	70	1	amc rebel sst
4	17.0	8	302.0	140	3449	10.5	70	1	ford torino

A. Are the columns in the correct data type?

# To do

B. Change the wrongly encoded columns into their suitable data type if there are any.

# To do

C. What seem to be the problem with horsepower? Solve the problem carefully.

# To do

D. Make sure all the columns are in correct data type.

# To do

D. Are there any duplicated data? How would you handle them if there are any?

# To do

E. Compute descriptive statistics of the data and visualize their distribution.

# To do

F. Are there any outliers?

# To do

0. Titanic Dataset (Optional)

1. Auto-MPG Dataset

Further Reading

0. `Titanic` Dataset (Optional)

1. `Auto-MPG` Dataset