Data collected directly from the source for a specific purpose.
Example:
Surveys or Questionnaires 🗳️
Interviews 🎙️
Observations 🧐
Experiments 🔬
Secondary
Data that has already been collected, processed, and made available by others.
Example:
Government publications or reports 📄
Books and articles 📚
Online databases and repositories 🌐
Industry/NGO reports 🏭
Data Sources
Format
Structured
Highly organized and easily searchable in databases using predefined schemas.
Structure: typically stored in tables with rows and columns.
Example:
Spreadsheets: Excel
CSV files
Unstructured
Lacks a predefined format or schema and is typically stored in its raw form.
Structure: Free-form and can be text, images, videos…
Example:
Emails/Documents (e.g., Word files, PDFs)
Social media posts, images, audio, videos
Web pages…
Data Quality
Data quality
Someone in 60s said Garbage In, Garbage Out (GIGO)!.
Data quality is the most important thing in Data Analysis.
Data quality
Data quality
Timeliness: how up-to-date the data is for its intended use.
Ex: Temperature of 60s wouldn’t help forecasting tomorrow.
Data quality
Uniqueness: data should not be duplicated.
Ex: Recording the same female heart disease patient many times may lead to a conclude that females have a higher likelihood of developing heart disease.
Data quality
Validity: data should take values within its valid range.
Ex: Height & weight should not be 0.
Data quality
Consistency: data should be uniform and compatible (format, type…) across different datasets and over time.
Ex: 15/03/2004 & 03/15/2004, Male & M…
Data quality
Accuracy: data should be accurate and reflects what it is meant to measure.
Ex: You entered ‘I like DA Course’ in my survey beause it’s not anonymous.
Data quality
Completeness: data should not contain missing values.
col_to_be_converted = ['Survived', 'Pclass']for col in col_to_be_converted: data[col] = data[col].astype(object)data[col_to_be_converted].dtypes.to_frame().T
Survived
Pclass
0
object
object
Drop column Cabin:
data.drop(columns = ["Cabin"], inplace =True)
Drop 1 row with NA in Fare:
data.dropna(subset = ['Fare'], inplace =True)
Study missing values in Age:
Impact on qual columns:
Code
import seaborn as snsimport matplotlib.pyplot as pltsns.set(style="whitegrid")fig, axs = plt.subplots(2, 4, figsize=(6, 3.75))col_qual = ['Survived', 'Pclass', 'Sex', 'Embarked']for i, va inenumerate(col_qual): sns.countplot(data, x=va, ax=axs[0,i], stat ="proportion") axs[0,i].bar_label(axs[0,i].containers[0], fmt="%0.2f") sns.countplot(data.dropna(), x=va, ax=axs[1,i] , stat ="proportion") axs[1,i].bar_label(axs[1,i].containers[0], fmt="%0.2f")if i ==0: axs[0,i].set_ylabel("Before remove NA") axs[1,i].set_ylabel("After remove NA")else: axs[0,i].set_ylabel("") axs[1,i].set_ylabel("")plt.tight_layout()plt.show()