{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# **TP3 - Data Preprocessing**\n",
"\n",
"Exploratory Data Analysis & Unsuperivsed Learning
\n",
"**Course: PHAUK Sokkey, PhD**
\n",
"**TP: HAS Sothea, PhD**\n",
"\n",
"-------"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Objective**: Preprocessing is important in data related tasks. In this TP, you will explore different challanges you may encounted during when performing data preprocessing. We will discuss reasonable solution to these challanges.\n",
"\n",
"---------\n",
"\n",
"> **The `Jupyter Notebook` for this TP can be downloaded here: [TP3-Data-Preprocessing](https://hassothea.github.io/EDA_ITC/TPs/TP3_Data_Preprocessing.ipynb)**.\n",
"\n",
"-------"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Missing Data\n",
"\n",
"We will begin with missing data which are very common within real-world datasets. One common question about missing values is \"Should we remove them\"? We will delve into possible proper solution to this problem.\n",
"\n",
"We will work with `Enfants` dataset available here: [Enfants dataset](https://github.com/hassothea/EDA_ITC/blob/main/data/Enfants.txt)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**a.** Import this data and name it `data`."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"<>:3: SyntaxWarning: invalid escape sequence '\\d'\n",
"<>:3: SyntaxWarning: invalid escape sequence '\\d'\n",
"C:\\Users\\hasso\\AppData\\Local\\Temp\\ipykernel_18808\\610560642.py:3: SyntaxWarning: invalid escape sequence '\\d'\n",
" data = pd.read_table(\"D:/Sothea_PC/Teaching_ITC/EDA\\data/Enfants.txt\", sep=\"\\t\")\n"
]
},
{
"data": {
"text/html": [
"
\n", " | GENRE | \n", "AGE | \n", "TAILLE | \n", "MASSE | \n", "
---|---|---|---|---|
0 | \n", "F | \n", "68 | \n", "0 | \n", "20 | \n", "
1 | \n", "M | \n", "74 | \n", "116 | \n", "18 | \n", "
2 | \n", "M | \n", "69 | \n", "120 | \n", "23 | \n", "
3 | \n", "M | \n", "72 | \n", "121 | \n", "25 | \n", "
4 | \n", "M | \n", "73 | \n", "114 | \n", "17 | \n", "