Univariate Analysis

INF-604: Data Analysis

Lecturer: Dr. Sothea HAS

🎯 Objectives

Identify quantitative & qualitative data/columns.
Use the right statistical values to summarize them.
Use the right statistical graphs to represent them.
Use the right tool to understand each individual column…

🗺️ Outline

Motivation
Data Types
Qualitative data
Quantitative data
Real examples

Motivation

Consider Titanic dataset

Code

import pandas as pd                 # Import pandas package
import seaborn as sns               # Package for beautiful graphs
import matplotlib.pyplot as plt     # Graph management
data = pd.read_csv(path_titanic + "/Titanic-Dataset.csv" ) # Import it into Python
sns.set(style="whitegrid")          # Set grid background
data.drop(columns=['PassengerId']).head()

	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

First step in analyzing data is understanding the nature of each individual column.
Some guiding questions:
- Were there more female or male passengers?
- Did many people survive?
- Were there more elderly or young passengers?

Data Types:
Qual. vs Quan.

Data Types

Quality vs Quantity

Consider our Titanic dataset

Code

data[['Survived', 'Pclass', 'Age', 'Embarked']].head(5)    # Show 5 first rows

	Survived	Pclass	Age	Embarked
0	0	3	22.0	S
1	1	1	38.0	C
2	1	3	26.0	S
3	1	1	35.0	S
4	0	3	35.0	S

Column Embarked is clearly different:
- Performing \(+\), \(-\), \(\times\), \(\div\)… doesn’t make any sense!
- Comparing \(<\), \(>\)… doesn’t make sense either!
Embarked is a Qualitative or Categorical data.

Age on the other hand is numbered:
- Age \(50\) is older than \(30\).
- Age \(20\) is \(5\) years younger than \(25\) or \(25-20=5\).
Age is a Quantitative or Numerical data.
Q1: How about other two columns?

Data Types

Quality vs Quantity

Data Types

Challenge

Code

data[['Sex', 'SibSp', 'Parch', 'Fare']].head()

	Sex	SibSp	Fare
0	male	1	7.2500
1	female	1	71.2833
2	female	0	7.9250
3	female	1	53.1000
4	male	0	8.0500

Q2: Define type of these columns.

	Quantitative		Qualitative
Column	Dis	Cont	Nomi	Ordi
`Sex`
`SibSp`
`Parch`
`Fare`

	Quantitative		Qualitative
Column	Dis	Cont	Nomi	Ordi
`Sex`			✅
`SibSp`
`Parch`
`Fare`

	Quantitative		Qualitative
Column	Dis	Cont	Nomi	Ordi
`Sex`			✅
`SibSp`	✅
`Parch`
`Fare`

	Quantitative		Qualitative
Column	Dis	Cont	Nomi	Ordi
`Sex`			✅
`SibSp`	✅
`Parch`	✅
`Fare`

	Quantitative		Qualitative
Column	Dis	Cont	Nomi	Ordi
`Sex`			✅
`SibSp`	✅
`Parch`	✅
`Fare`		✅

Now, let’s take a closer look!

Qualitative Data

Statistical values

data[['Pclass', 'Survived', 'Embarked', 'Sex']].head()

	Pclass	Survived	Embarked	Sex
0	3	0	S	male
1	1	1	C	female
2	3	1	S	female
3	1	1	S	female
4	3	0	S	male

What values should we use to describe qualitative data?
Absolute Frequency: Number of occurrences/counts of each category.
Relative Frequency: proportion/percentage of each category.
Mode: Category with highest frequency/count.

Example:

Code

freq_tab = data[['Pclass']].value_counts().to_frame().round()
freq_tab['proportion'] = data[['Pclass']].value_counts(normalize=True).round(2)
freq_tab.T

Pclass	3	1	2
count	491.00	216.00	184.00
proportion	0.55	0.24	0.21

Code

freq_tab = data[['Sex']].value_counts().to_frame()
freq_tab['proportion'] = data[['Sex']].value_counts(normalize=True).round(2)
freq_tab.T

Sex	male	female
count	577.00	314.00
proportion	0.65	0.35

Q3: I dare you to take care of the other two columns 😏!

Qualitative Data

Visualization

data[['Pclass', 'Survived', 'Embarked', 'Sex']].head()

	Pclass	Survived	Embarked	Sex
0	3	0	S	male
1	1	1	C	female
2	3	1	S	female
3	1	1	S	female
4	3	0	S	male

What graph should we use to present qualitative data?

Countplot/Barplot: Represent each count/proportion by a bar.

Example:

import matplotlib.pyplot as plt
import seaborn as sns  # For graph
sns.set(style="whitegrid") # set nice background
plt.figure(figsize=(5,3))
ax = sns.countplot(data, x="Survived") # create graph
ax.set_title("Barplot of Survived") # add title
ax.bar_label(ax.containers[0]) # add number to bars
plt.show() # Show graph

Qualitative Data

Visualization

data[['Pclass', 'Survived', 'Embarked', 'Sex']].head()

	Pclass	Survived	Embarked	Sex
0	3	0	S	male
1	1	1	C	female
2	3	1	S	female
3	1	1	S	female
4	3	0	S	male

What graph should we use to present qualitative data?

Countplot/Barplot: Represent each count/proportion by a bar.

Example:

import matplotlib.pyplot as plt
import seaborn as sns  # For graph
sns.set(style="whitegrid") # set nice background
plt.figure(figsize=(5,3))
ax = sns.countplot(data,x="Survived", stat="proportion")
ax.set_title("Barplot of Survived") # add title
ax.bar_label(ax.containers[0], fmt="%0.2f") # number
plt.show() # Show graph

Qualitative Data

Visualization

data[['Pclass', 'Survived', 'Embarked', 'Sex']].head()

	Pclass	Survived	Embarked	Sex
0	3	0	S	male
1	1	1	C	female
2	3	1	S	female
3	1	1	S	female
4	3	0	S	male

What graph should we use to present qualitative data?

Pie chart: Represent count/proportion by circular slices.

Example:

import matplotlib.pyplot as plt
import seaborn as sns  # For graph
sns.set(style="whitegrid") # set nice background
plt.figure(figsize=(6,4))
tab = data['Embarked'].value_counts() # Compute 
plt.pie(tab, labels=tab.index, autopct='%0.2f%%') # graph
plt.title("Barplot of Pclass") # add title
plt.show() # Show graph

Qualitative Data

Visualization

data[['Pclass', 'Survived', 'Embarked', 'Sex']].head()

	Pclass	Survived	Embarked	Sex
0	3	0	S	male
1	1	1	C	female
2	3	1	S	female
3	1	1	S	female
4	3	0	S	male

What graph should we use to present qualitative data?

Pie chart: Represent count/proportion by circular slices.

⚠️ Pie charts can be challenging to read with numerous categories. They’re harder to perceive when many categories have similar proportions.

Example:

import matplotlib.pyplot as plt
import seaborn as sns  # For graph
sns.set(style="whitegrid") # set nice background
plt.figure(figsize=(6,4))
tab = data['Embarked'].value_counts() # Compute 
plt.pie(tab, labels=tab.index, autopct='%0.2f%%') # graph
plt.title("Barplot of Pclass") # add title
plt.show() # Show graph

Qualitative Data

Summary

Quantitative Data

Statistical values

data[['Age', 'Fare', 'SibSp', 'Parch']].head()

	Age	Fare	SibSp
0	22.0	7.2500	1
1	38.0	71.2833	1
2	26.0	7.9250	0
3	35.0	53.1000	1
4	35.0	8.0500	0

What values should we use to describe quantitative data?
Quantiles: For data sorted in ascending order, the cut points divide the range into contiguous proportion intervals.

Different types of quantile:

Quartiles: The 25th (Q1), 50th (Q2 or median), and 75th (Q3) percentiles.

	min	25%	50%	75%	max
Fare	0.00	7.91	14.45	31.0	512.33
Age	0.42	20.12	28.00	38.0	80.00

Percentiles: Values that divide data into 100 equal parts.

Quantitative Data

Statistical values

data[['Age', 'Fare', 'SibSp', 'Parch']].head()

	Age	Fare	SibSp
0	22.0	7.2500	1
1	38.0	71.2833	1
2	26.0	7.9250	0
3	35.0	53.1000	1
4	35.0	8.0500	0

What values should we use to describe quantitative data?
Quantiles: For data sorted in ascending order, the cut points divide the range into contiguous proportion intervals.

Method to find Quartiles:

Sort the data in ascending order: \(X_1,...,X_n\).

Quantitative Data

Statistical values

data[['Age', 'Fare', 'SibSp', 'Parch']].head()

	Age	Fare	SibSp
0	22.0	7.2500	1
1	38.0	71.2833	1
2	26.0	7.9250	0
3	35.0	53.1000	1
4	35.0	8.0500	0

What values should we use to describe quantitative data?
Quantiles: For data sorted in ascending order, the cut points divide the range into contiguous proportion intervals.

Method to find Quartiles:

Sort the data in ascending order: \(X_1,...,X_n\).

If \(n\) is odd: \(\color{red}{Q_2}=X_{(n+1)/2}\) “middle term”.
If \(n\) is even: \(\color{red}{Q_2}=\frac{X_{(n/2)}+X_{(n/2)+1}}{2}\) “middle value”.

Quantitative Data

Statistical values

data[['Age', 'Fare', 'SibSp', 'Parch']].head()

	Age	Fare	SibSp
0	22.0	7.2500	1
1	38.0	71.2833	1
2	26.0	7.9250	0
3	35.0	53.1000	1
4	35.0	8.0500	0

What values should we use to describe quantitative data?
Quantiles: For data sorted in ascending order, the cut points divide the range into contiguous proportion intervals.

Method to find Quartiles:

Sort the data in ascending order: \(X_1,...,X_n\).

If \(n\) is odd: \(\color{red}{Q_2}=X_{(n+1)/2}\) “middle term”.
If \(n\) is even: \(\color{red}{Q_2}=\frac{X_{(n/2)}+X_{(n/2)+1}}{2}\) “middle value”.
\(Q_1\): the middle point of the lower-half data.
\(\color{green}{Q_3}\): the middle point of the upper-half data.

Quantitative Data

Statistical values

data[['Age', 'Fare', 'SibSp', 'Parch']].head(3)

	Age	Fare	SibSp
0	22.0	7.2500	1
1	38.0	71.2833	1
2	26.0	7.9250	0

Median (Q2) is a value that describe Measure of Central Tendency.

Mean: Average value of all data points:

\[\color{blue}{\overline{X}=\frac{1}{n}\sum_{i=1}^nX_i=\frac{X_1+\dots+X_n}{n}}.\]

Examples:

mean = data[['Age','Fare']].mean()\
                        .to_frame()
mean.columns = ['Mean']
mean.T

	Age	Fare
Mean	29.699118	32.204208

The average age of passengers was around \(30\) years old.
In average, passengers spent approximately \(£32\) in fare.

Quantitative Data

Statistical values

data[['Age', 'Fare', 'SibSp', 'Parch']].head(3)

	Age	Fare	SibSp
0	22.0	7.2500	1
1	38.0	71.2833	1
2	26.0	7.9250	0

Two main Measure of dispersion:

Sample variance: average of squared distances of data points from the Mean.

\[\color{blue}{\widehat{\sigma}^2=\frac{1}{n-1}\sum_{i=1}^n(X_i-\overline{X})^2}.\]

Examples:

var = data[['Age','Fare']].var()\
                        .to_frame()\
                        .round(3)
var.columns = ['Var']
var.T

	Age	Fare
Var	211.019	2469.437

Large variance means that data points are widely spread out from the Mean.

Quantitative Data

Statistical values

data[['Age', 'Fare', 'SibSp', 'Parch']].head(3)

	Age	Fare	SibSp
0	22.0	7.2500	1
1	38.0	71.2833	1
2	26.0	7.9250	0

Two main Measure of dispersion:

Sample standard deviation: Just the square root of Variance.

\[\color{blue}{\widehat{\sigma}=\sqrt{\widehat{\sigma}^2}=\sqrt{\frac{1}{n-1}\sum_{i=1}^n(X_i-\overline{X})^2}}.\]

Examples:

std = data[['Age','Fare']]\
        .apply(['var', 'std'])
std

	Age	Fare
var	211.019125	2469.436846
std	14.526497	49.693429

Large standard deviation (Std) means data points are spread out widely from the Mean.
Std has the same unit as \(X_i\).

Quantitative Data

Statistical Summary

	Age	Fare	SibSp
0	22.0	7.2500	1
1	38.0	71.2833	1
2	26.0	7.9250	0
3	35.0	53.1000	1
4	35.0	8.0500	0

Statistical summary uses all key values to help us understand the data:

Where the data is concentrated (mean/median).
How spread out it is (var/std)…

Examples:

data[['Age','Fare']]\
        .describe()  # for summary

	Age	Fare
count	714.000000	891.000000
mean	29.699118	32.204208
std	14.526497	49.693429
min	0.420000	0.000000
25%	20.125000	7.910400
50%	28.000000	14.454200
75%	38.000000	31.000000
max	80.000000	512.329200

Quantitative Data

Visualization: Boxplot

	Age	Fare	SibSp
0	22.0	7.2500	1
1	38.0	71.2833	1
2	26.0	7.9250	0
3	35.0	53.1000	1
4	35.0	8.0500	0

Boxplots describe data using Quartiles and the range where data normally fall within.

Lower and upper part of the box are \(Q_1\) and \(\color{green}{Q_3}\). Median \(\color{red}{Q_2}\) is the middle line.
Interquartile range: \(\text{IQR}=\color{green}{Q_3}-Q_1\), it’s the gap that covers central range of \(50\%\) of data.
Range: \([Q_1-1.5\text{IQR},\color{green}{Q_3}+1.5\text{IQR}]\). If the data are normally distributed.
Data points that fall outside this range, can be considered Outliers (data that deviate away from usual observations).

Quantitative Data

Visualization: Boxplot

	Age	Fare	SibSp
0	22.0	7.2500	1
1	38.0	71.2833	1
2	26.0	7.9250	0
3	35.0	53.1000	1
4	35.0	8.0500	0

Code

import plotly.express as px
fig = px.box(data, x="Fare")
fig.update_layout(height=220, 
                  width=530,
                  title="Boxplot of Fare")
fig.show()

Boxplots describe data using Quartiles and the range where data normally fall within.

Lower and upper part of the box are \(Q_1\) and \(\color{green}{Q_3}\). Median \(\color{red}{Q_2}\) is the middle line.
Interquartile range: \(\text{IQR}=\color{green}{Q_3}-Q_1\), it’s the gap that covers central range of \(50\%\) of data.
Range: \([Q_1-1.5\text{IQR},\color{green}{Q_3}+1.5\text{IQR}]\). If the data are normally distributed.
Data points that fall outside this range, can be considered Outliers (data that deviate away from usual observations).

Quantitative Data

Visualization: Boxplot

Code

import plotly.express as px
fig = px.box(data, x="Fare")
fig.update_layout(height=220,
                  width=530,
                  title="Boxplot of Fare")
fig.show()

This boxplot tells us that:
- Fares range from \(£0\) to maximum fare of \(£512.33\).
- \(Q_1=£7.9\) indicating that around \(25\%\) of passengers spent less than \(£7.9\) to get to the ship.
- \(\color{red}{Q_2}=£14.45\) (Median): \(\approx 50\%\) spent less than \(£14.45\).
- \(\color{green}{Q_3}=£31\): \(\approx 75\%\) spent less than \(£31\).
- There are many outliers, passengers who spent more than the upper fence (\(£65\)), with the largest fare of \(£512.33\).

Boxplots describe data using Quartiles and the range where data normally fall within.

Lower and upper part of the box are \(Q_1\) and \(\color{green}{Q_3}\). Median \(\color{red}{Q_2}\) is the middle line.
Interquartile range: \(\text{IQR}=\color{green}{Q_3}-Q_1\), it’s the gap that covers central range of \(50\%\) of data.
Range: \([Q_1-1.5\text{IQR},\color{green}{Q_3}+1.5\text{IQR}]\). If the data are normally distributed.
Data points that fall outside this range, can be considered Outliers (data that deviate away from usual observations).

Quantitative Data

Visualization: Histogram

Code

import plotly.express as px
fig = px.histogram(data, x="Age")
fig.update_layout(height=220, 
                  width=530, 
                  title="Histogram of Age")
fig.show()

A histogram is constructed by:
- Defining a grid range of bins: \(B_1, \dots, B_N\).
- The height of each bar represents the count of \(X_i\) values that fall within the corresponding bin.
It describes the frequency of observations within each bin range.

Mathematical definition of histogram

Define bins: \(B_1,\dots, B_N\).
For any \(x\) and \(x\in B_k\) for some \(k\) then

\[\text{hist}(x)=\sum_{i=1}^n\mathbb{1}_{\{X_i\in B_k\}}.\]

For this example of Age:

Most passengers were between 16 and 52 years old.
There were more children younger than 10 years old than those around 10 years old.
There were fewer than 10 individuals in each age group older than 52 years old.

Quantitative Data

Visualization: Kernel Density Plot (KDE)

Code

import plotly.figure_factory as ff
age = [data[['Age']].dropna().values.reshape(-1)]
group_labels = ['distplot']
fig = ff.create_distplot(age, group_labels=group_labels, bin_size=1.9)
fig.update_layout(height=220, 
                  width=530, 
                  title="Histogram of Age")
fig.show()

A Kernel Density Plot is a smooth, continuous version of a histogram.
It describes the relative frequency of observations over ranges of values.
It has nicer mathematical properties than histograms.

Mathematical definition of KDE

If \(K\) is a smooth kernel function, for example: \(K(x)=e^{-x^2/2}\).
For a given \(h>0\) and for any \(x\):

\[\text{kde}(x)=\frac{1}{nh}\sum_{i=1}^nK\Big(\frac{x-X_i}{h}\Big).\]

Kernel density plot conveys similar information as histograms.
It’s often discussed in pobability and statistics classes.

Quantitative Data

Summary

Real examples

Qualitative columns of Titanic Dataset

Code

qual_var = ['Survived', 'Pclass', 'Sex']
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
fig = make_subplots(
    rows=3, cols=1, 
    specs=[[{"type": "bar"}], [{"type": "bar"}], [{"type": "bar"}]],
    subplot_titles=("Barplot of Survived", "Barplot of Pclass ", "Barplot Sex"))
for i, va in enumerate(qual_var):
    cnt = data[va].value_counts()
    if i == 0:
        fig.add_trace(
            go.Bar(x=list(cnt.index.astype(str)), y=list(cnt.values), name=va), col=1, row=i+1)
    else:
        fig.add_trace(
            go.Bar(x=list(cnt.index.astype(object)), y=list(cnt.values), name=va), col=1, row=i+1)
fig.update_layout(height=450, width=450)
fig.show()

Code

fig = make_subplots(
    rows=3, cols=1, 
    specs=[[{"type": "pie"}], [{"type": "pie"}], [{"type": "pie"}]],
    subplot_titles=("Pie chart of Survived", "Pie chart of Pclass ", "Pie chart Sex"))
for i, va in enumerate(qual_var):
    cnt = data[va].value_counts()
    fig.add_trace(go.Pie(labels=list(cnt.index.astype(object)), values=list(cnt.values)), col=1, row=i+1)
fig.update_layout(height=450, width=450)
fig.show()

Real examples

Quantitative columns of Titanic Dataset

Code

quan_var = ['Age', 'SibSp', 'Parch', 'Fare']
cols = ['#C96451', '#80C96F', '#6B7FDB', '#C07EDE']
fig = make_subplots(
    rows=2, cols=4,
    subplot_titles=("Distribution of Age", "Distribution of Parch", "Distribution of SibSp", "Distribution of Fare", "","","",""))
for i, va in enumerate(quan_var):
    fig.add_trace(
        go.Box(x=data[va].values, name=va, marker_color = cols[i]), col=i+1, row=1)
    fig.add_trace(
        go.Histogram(x=data[va].values, name=va, marker_color = cols[i]), col=i+1, row=2)
fig.update_layout(height=450, width=1000)
fig.show()

Real examples

Quantitative columns of Titanic Dataset

Code

quan_var = ['Age', 'SibSp', 'Parch', 'Fare']
cols = ['#C96451', '#80C96F', '#6B7FDB', '#C07EDE']
fig = make_subplots(
    rows=2, cols=4,
    subplot_titles=("Distribution of Age", "Distribution of Parch", "Distribution of SibSp", "Distribution of log(Fare)", "","","",""))
for i, va in enumerate(quan_var):
    fig.add_trace(
        go.Box(x=data[va].values, name=va, marker_color = cols[i]), col=i+1, row=1)
    fig.add_trace(
        go.Histogram(x=data[va].values, name=va, marker_color = cols[i]), col=i+1, row=2)
fig.update_layout(height=450, width=1000)
fig.update_xaxes(type="log", row=1, col=4)
fig.update_yaxes(type="log", row=2, col=4)
fig.show()

🥳 Yeahhhh….

Let’s Party… 🥂

The Party 👇🫠

Univariate Analysis

🎯 Objectives

🗺️ Outline

Motivation

Motivation

Consider Titanic dataset

Data Types: Qual. vs Quan.

Data Types

Quality vs Quantity

Data Types

Quality vs Quantity

Data Types

Challenge

Qualitative Data

Qualitative Data

Statistical values

Qualitative Data

Visualization

Qualitative Data

Visualization

Qualitative Data

Visualization

Qualitative Data

Visualization

Qualitative Data

Summary

Quantitative Data

Quantitative Data

Statistical values

Different types of quantile:

Quantitative Data

Statistical values

Quantitative Data

Statistical values

Quantitative Data

Statistical values

Quantitative Data

Statistical values

Examples:

Quantitative Data

Statistical values

Examples:

Quantitative Data

Statistical values

Examples:

Quantitative Data

Statistical Summary

Examples:

Quantitative Data

Visualization: Boxplot

Quantitative Data

Visualization: Boxplot

Quantitative Data

Visualization: Boxplot

Quantitative Data

Visualization: Histogram

Mathematical definition of histogram

For this example of Age:

Quantitative Data

Visualization: Kernel Density Plot (KDE)

Mathematical definition of KDE

Quantitative Data

Summary

Real examples

Real examples

Qualitative columns of Titanic Dataset

Real examples

Quantitative columns of Titanic Dataset

Real examples

Quantitative columns of Titanic Dataset

🥳 Yeahhhh….

Let’s Party… 🥂

The Party 👇🫠

Data Types:
Qual. vs Quan.