Introduction to Timeseries Analysis


INF-604: Data Analysis

Lecturer: Dr. Sothea HAS

Outline

  • Introduction & Motivation

  • Visualization & Statistical Values

  • Timeseries Main Components & Decompositions

  • Real-world Examples

Introduction & Motivation

Introduction & Motivation

Non- vs timeseries data (Gapminder)

Non-timeseries data (2002)

Code
import numpy as np
import pandas as pd
from gapminder import gapminder
gapminder.query("year == 2002").drop(columns=["year", "continent", "GDP_Category"]).head(3)
country lifeExp pop gdpPercap
10 Afghanistan 42.129 25268405 726.734055
22 Albania 75.651 3508512 4604.211737
34 Algeria 70.994 31287142 5288.040382

  • We are interested in:
    • Distribution of individual columns (barplot, boxplot, histogram…)
    • Relationship between columns (scatterplot, grouped barplots, color, shape, size…)
    • Statistical values: means, min, max…
  • Not interested in (trend or evolution) in time.

Timeseries data (Cambodia)

Code
gapminder.query("country == 'Cambodia'").drop(columns=["continent", "country", "GDP_Category"]).head(3)
year lifeExp pop gdpPercap
216 1952 39.417 4693836 368.469286
217 1957 41.366 5322536 434.038336
218 1962 43.415 6083619 496.913648

  • Individual column and the relationship between columns are still important.
  • Main interest: how those columns and their relationship evolve over time?
  • Previous graphs and statistical values can still be used, but should be interpreted differently!
  • More tools (graphs, values…) are required to understand their tendency as time evolves.

Introduction & Motivation

Non- vs timeseries data (Gapminder)

Non-timeseries data (2002)

Code
gapminder.query("year == 2002").drop(columns=["year", "continent", "GDP_Category"]).head(3)
country lifeExp pop gdpPercap
10 Afghanistan 42.129 25268405 726.734055
22 Albania 75.651 3508512 4604.211737
34 Algeria 70.994 31287142 5288.040382

  • How would you interpret this histogram?

Timeseries data (Cambodia)

Code
gapminder.query("country == 'Cambodia'").drop(columns=["continent", "country", "GDP_Category"]).head(3)
year lifeExp pop gdpPercap
216 1952 39.417 4693836 368.469286
217 1957 41.366 5322536 434.038336
218 1962 43.415 6083619 496.913648

  • How about this?

Introduction & Motivation

Definition

  • A time series is a sequence of data points organized in time order.
  • Usually, the time signal is sampled at equally spaced points in time.
  • Examples:
    • Climate: temperature, humidity…
    • Finance: stock prices, asset prices, exchange rate…
    • E-Commerce: page views, new users, searches…
    • Business: transactions, revenue, inventory levels…
    • Natural language: texts, sentences…

Motivation

  • Understanding the nature and behavior of the timeseries.
  • Forecasting the future based on the historical data.

Visualization & Statistical Values

Visualization & Statistical values

Visualization

  • Quantitative: lineplot.
Code
import matplotlib.pyplot as plt
sns.set(style="white")
_, axs = plt.subplots(3, 1, figsize=(5,4.5))
sns.lineplot(df_climate.iloc[::12,:], x="date", y="meantemp", ax=axs[0])
axs[0].set_title("Mean temperature", fontsize=13)
axs[0].set_xticks([])
axs[0].set_ylabel("")
axs[0].set_xlabel("")

sns.lineplot(df_climate.iloc[::12,:], x="date", y="humidity", ax=axs[1])
axs[1].set_title("Humidity", fontsize=13)
axs[1].set_xticks([])
axs[1].set_ylabel("")
axs[1].set_xlabel("")

sns.lineplot(df_climate.iloc[::12,:], x="date", y="wind_speed", ax=axs[2])
axs[2].set_title("Wind speed", fontsize=13)
# axs[2].tick_params(axis='x', labelrotation=90, size=8)
plt.xticks(df_climate.iloc[::12,:].date[::10], rotation=45, size=8)
plt.tight_layout()
plt.show()

  • Qualitative: evolutional barplot.
Code
import plotly.express as px
def cat_gdp(yearly_data):
    return pd.qcut(yearly_data, q=3, labels=['Developing', 'Emerging', 'Developed'])
df = gapminder
# Apply the function to each year
df['GDP_Category'] = df.groupby('year').apply(lambda x: cat_gdp(x.gdpPercap)).reset_index(level=0, drop=True)

df_Af = df.query("continent == 'Asia'")
# Aggregate the data
df_agg = df_Af.groupby(['year', 'GDP_Category']).size().reset_index(name='Count')

# Create the stacked bar chart
fig = px.bar(
    df_agg, x='year', y='Count', 
    color='GDP_Category', barmode='stack',
    title="Evolution of Asian Countries' GDP from 1952 to 2007",
    labels={'Count': 'Number of Countries', 'year': 'Year'})

fig.update_layout(height=410, width=500)
fig.show()

Visualization & Statistical values

Statistical values: Autocorrelation

  • We’re interested in how the current value influences the succeeding/later values?
  • Consider the mean temperature & its lags:
Temp Lag1 Lag2 Lag3
0 10.000 15.833 12.250 16.667
1 15.833 12.250 16.667 15.600
2 12.250 16.667 15.600 19.000
3 16.667 15.600 19.000 22.333
4 15.600 19.000 22.333 24.143
  • Q1: If Temp and Lag1 are highly correlated, what does that mean?
  • A1: Current highly correlated with next.
  • Correlations of our example:
Temp Lag1 Lag2 Lag3
Temp 1.0 0.89 0.82 0.71
  • Autocorrelation at lag \(\color{blue}{k}\) of \((X_t)\): \[r_{\color{blue}{k}}=\frac{n}{n-\color{blue}{k}}\frac{\sum_{t=1}^{n-\color{blue}{k}}(X_t-\overline{X})(X_{t+\color{blue}{k}}-\overline{X})}{\sum_{t=1}^{n}(X_t-\overline{X})^2},\] where \(\overline{X}=\frac{1}{n}\sum_{t=1}^nX_t\) (average).
  • Interpretation: for any lag \(k:-1\leq r_k\leq 1\) and it indicates the correlation between original timeseies with its \(k\)-lag timeseries.

Visualization & Statistical values

Visualization: Correlogram/ACF plot

  • It shows the relation between the lag \(k\) and the \(r_k\).
  • In python :
Code
from statsmodels.graphics.tsaplots import plot_acf
_, ax = plt.subplots(2,1,figsize=(5, 3.65))
plot_acf(df_lag.Temp, lags=60, ax=ax[0])
ax[0].set_title('Correlogram for Mean Temperature', fontsize=13)
ax[0].set_xlabel('Lag')
ax[0].set_ylabel('Autocorrelation')

sns.lineplot(df_climate.iloc[::12,:].iloc[:60,:], x="date", y="meantemp", ax=ax[1])
ax[1].set_title("Mean temperature", fontsize=13)
ax[1].set_ylabel("")
ax[1].set_xlabel("")
plt.xticks(df_climate.iloc[::12,:].iloc[:60,:].date[::10], rotation=45, size=8)
plt.tight_layout()
plt.show()

  • Interpretation:
    • The autocorrelation oscillates between 1 and -1, showing a periodic pattern of the temperature.
    • Peaks and troughs in autocorrelation repeat approximately every 30 lags, indicating cycles in the data.
    • Values outside the shaded region indicate significant autocorrelation, which points to a strong relationship between temperature at specific lags.

Visualization & Statistical values

Visualization: Correlogram/ACF plot

  • Consider more examples:

  • Their correlograms:

Visualization & Statistical values

Visualization: Correlogram/ACF plot

Date Open High Low Close
0 2012-05-18 42.05 45.00 38.00 38.23
1 2012-05-21 36.53 36.66 33.00 34.03
2 2012-05-22 32.61 33.59 30.94 31.00
3 2012-05-23 31.37 32.50 31.36 32.00
4 2012-05-24 32.95 33.21 31.77 33.03
5 2012-05-25 32.90 32.95 31.11 31.91
6 2012-05-29 31.48 31.69 28.65 28.84
7 2012-05-30 28.70 29.55 27.86 28.19
8 2012-05-31 28.55 29.67 26.83 29.60
9 2012-06-01 28.89 29.15 27.39 27.72
10 2012-06-04 27.20 27.65 26.44 26.90
11 2012-06-05 26.70 27.76 25.75 25.87

Timeseries Main Components & Decompositions

Main components & Decompositions

  • Three main components:
    • \(T_t\): Trend-cycle component
    • \(S_t\): Seasonal component
    • \(R_t\): Remainder
  • Two mains decompositions:
    • Additive decomposition: \[X_t=T_t+S_t+R_t,\] with \(R_t\sim{\cal N}(0,\sigma^2),\sigma>0\).
    • Multiplicative decomposition: \[X_t=T_t\times S_t\times R_t,\] with \(R_t\sim{\cal N}(1,\sigma^2),\sigma>0\).
Code
from statsmodels.tsa.seasonal import seasonal_decompose
ts = df_fb[['Close']].values[::10]
decomposition = seasonal_decompose(
    ts, model='additive', 
    period=12)
seasonal, trend, residual = decomposition.seasonal, decomposition.trend, decomposition.resid

plt.figure(figsize=(5, 5))
plt.subplot(411)
plt.plot(ts, 'r', label='Original')
plt.legend()
plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend()
plt.subplot(413)
plt.plot(seasonal, label='Seasonal')
plt.legend()
plt.subplot(414)
plt.plot(residual, label='Residual')
plt.legend()
plt.tight_layout()
plt.show()

Main components & Decompositions

  • Three main components:
    • \(T_t\): Trend-cycle component
    • \(S_t\): Seasonal component
    • \(R_t\): Remainder
  • Two mains decompositions:
    • Additive decomposition: \[X_t=T_t+S_t+R_t,\] with \(R_t\sim{\cal N}(0,\sigma^2),\sigma>0\).
    • Multiplicative decomposition: \[X_t=T_t\times S_t\times R_t,\] with \(R_t\sim{\cal N}(1,\sigma^2),\sigma>0\) ✅.
Code
decomposition = seasonal_decompose(
    ts, model='multiplicative', 
    period=12)
seasonal, trend, residual = decomposition.seasonal, decomposition.trend, decomposition.resid

plt.figure(figsize=(5, 5))
plt.subplot(411)
plt.plot(ts, 'r', label='Original')
plt.legend()
plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend()
plt.subplot(413)
plt.plot(seasonal, label='Seasonal')
plt.legend()
plt.subplot(414)
plt.plot(residual, label='Residual')
plt.legend()
plt.tight_layout()
plt.show()

Real-world Examples

Real-world Examples

Nivdia stock price

Real-world Examples

Nivdia stock price

Log-transformation

Real-world Examples

Nivdia stock price

ACF Plot

  • Slow decay in the ACF (e.g., high autocorrelation at high lags) confirms a strong trend (non-stationarity).
  • No repeating spikes at fixed lags (e.g., Lag 12 for monthly data) implies No evidence of seasonality.
  • Stock prices are usually every complex and cannot be precisely described.

Real-world Examples

Nivdia stock price

Decompositions

🥳 Yeahhhh….









Let’s Party… 🥂