Lab2: Data Comprehension & Preprocessing

Course: CSCI-866-001: Data Mining & Knowledge Discovery
Lecturer: Sothea HAS, PhD


Objective: In this lab, you will delve deeper into assessing the quality of datasets and employing preprocessing techniques to properly clean them.


0. Titanic Dataset (Optional)

You are encouraged to start by exploring what you have seen in the lecture within Real Example section on the Titanic dataset available here: https://www.kaggle.com/datasets/yasserh/titanic-dataset.

1. Auto-MPG Dataset

The Auto-MPG dataset contains spec of various cars and is available in kaggle. For more, read here. The data can be downloaded as follow:

import kagglehub # To load the data
import pandas as pd # To handle and manipulate the data

# Download latest version
path = kagglehub.dataset_download("denkuznetz/food-delivery-time-prediction")

# Download latest version
path = kagglehub.dataset_download("uciml/autompg-dataset")
auto = pd.read_csv(path + '/auto-mpg.csv')
auto.head()
mpg cylinders displacement horsepower weight acceleration model year origin car name
0 18.0 8 307.0 130 3504 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165 3693 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150 3436 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150 3433 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140 3449 10.5 70 1 ford torino

A. Are the columns in the correct data type?

# To do

B. Change the wrongly encoded columns into their suitable data type if there are any.

# To do

C. What seem to be the problem with horsepower? Solve the problem carefully.

# To do

D. Make sure all the columns are in correct data type.

# To do

D. Are there any duplicated data? How would you handle them if there are any?

# To do

E. Compute descriptive statistics of the data and visualize their distribution.

# To do

F. Are there any outliers?

# To do

Further Reading