Model Development

ITM-370: Data Analytics

Lecturer: Dr. Sothea Has

Getting to know you a bit…

Let’s see…

Motivation

Old Faithful dataset (\(272\) rows, \(2\) columns)

Code

import pandas as pd                 # Import pandas package
import seaborn as sns               # Package for beautiful graphs
import matplotlib.pyplot as plt     # Graph management
sns.set(style="whitegrid")          # Set grid background
# path = "https://gist.githubusercontent.com/curran/4b59d1046d9e66f2787780ad51a1cd87/raw/9ec906b78a98cf300947a37b56cfe70d01183200/data.tsv"                       # The data can be found in this link
path = "./data/faithful.csv"        # Path where the data is stored
df0 = pd.read_csv(path)              # Import it into Python
df0.head(5)                        # Randomly select 4 points

	eruptions	waiting
0	3.600	79
1	1.800	54
2	3.333	74
3	2.283	62
4	4.533	85

Code

plt.figure(figsize=(5,3.2))                          # Define figure size
sns.scatterplot(df0, x="waiting", y="eruptions")    # Create scatterplot
plt.title("Old Faithful data from Yellowstone National Park, US", fontsize=10)    # Title
plt.suptitle("Eruptions vs waiting times", fontsize=13, y=1)                 # Subtitle
plt.show()

The longer the wait, the longer duration of the eruption.

Motivation (2)

Marketing data (\(200\) rows, \(4\) columns)

Code

import pyreadr
import pandas as pd
df1 = pyreadr.read_r("./data/marketing.rda")
df1 = df1['marketing']
df1.head(5)                        # Randomly select 4 points

	youtube	facebook	newspaper	sales
0	276.12	45.36	83.04	26.52
1	53.40	47.16	54.12	12.48
2	20.64	55.08	83.16	11.16
3	181.80	49.56	70.20	22.20
4	216.96	12.96	70.08	15.48

🧐 How would you represnt everything in one graph?

Motivation (2)

Marketing data (\(200\) rows, \(4\) columns)

Code

import pyreadr
import pandas as pd
df1 = pyreadr.read_r("./data/marketing.rda")
df1 = df1['marketing']
df1.head(5)                        # Randomly select 4 points

	youtube	facebook	newspaper	sales
0	276.12	45.36	83.04	26.52
1	53.40	47.16	54.12	12.48
2	20.64	55.08	83.16	11.16
3	181.80	49.56	70.20	22.20
4	216.96	12.96	70.08	15.48

Code

import plotly.express as px
fig = px.scatter_3d(df1, x="youtube", y="facebook", z="sales", 
                    size="newspaper", color="newspaper",
                    size_max=40)
camera = dict(eye=dict(x=1, y=-1, z=1.2))
fig.update_layout(title="Sales as a function of all ads",
                  width=550, height=350,
                  scene_camera=camera)

Increasing ads seems to boost sales!

Our Motivational Quote Today

“Where there is data smoke, there is business fire.” — Thomas Redman

Some Notation

Data: input-target

	youtube	facebook	newspaper	sales
0	276.12	45.36	83.04	26.52
1	53.40	47.16	54.12	12.48
2	20.64	55.08	83.16	11.16
3	181.80	49.56	70.20	22.20
4	216.96	12.96	70.08	15.48

\[{\cal D}=\begin{bmatrix} X_1 & \dots & X_d & Y\\ x_{11} & \dots & x_{1d} & y_1\\ x_{21} & \dots & x_{2d} & y_2\\ \vdots & \ddots & \vdots & \vdots\\ x_{n1} & \dots & x_{nd} & y_n \end{bmatrix}\]

Our marketing data: \(n=200\) and \(d=3\).
Input \(\text{x}_i=(x_{i1}, \dots,x_{id})\in\mathbb{R}^d\) with target \(y_i\).

Model Development

Objective

Using input \(\text{x}\) to predict its corresponding target \(y\).

Simple Linear Regression \[\begin{bmatrix} X\\ x_1\\ \vdots\\ x_n\\ \end{bmatrix}\leadsto\begin{bmatrix} Y\\ y_1\\ \vdots\\ y_n\\ \end{bmatrix}\]

Multiple Linear Regression \[\begin{bmatrix} X_1 & \dots & X_d\\ x_{1d} & \dots & x_{1d}\\ \vdots & \ddots & \vdots\\ x_{n1} & \dots & x_{nd}\\ \end{bmatrix}\leadsto\begin{bmatrix} Y\\ y_1\\ \vdots\\ y_n\\ \end{bmatrix}\]

Simple Linear Regression

Simple Linear Regression (SLR)

Predict \(y\) using only a single input \(\text{x}\in\mathbb{R}\).
Model: \(\underbrace{\hat{y}}_{\text{predicted eruption}}=\beta_0+\beta_1\underbrace{\text{x}}_{\text{waiting}}\) for \(\beta_0,\beta_1\in\mathbb{R}\).

Simple Linear Regression (SLR)

Residual Sum of Squares: \(\begin{align*} \text{RSS}&=\sum_{i=1}^n(\color{red}{y_i-\hat{y}_i})^2\\ &=\sum_{i=1}^n(\color{red}{y_i-\beta_0-\beta_1x_i})^2 \end{align*}\)

Ordinary Least Squares (OLS):
The best-fitted line minimizes TSS.

Model: \(\underbrace{\hat{y}}_{\text{predicted eruption}}=\beta_0+\beta_1\underbrace{\text{x}}_{\text{waiting}}\) for \(\beta_0,\beta_1\in\mathbb{R}\).

Simple Linear Regression (SLR)

Optimal Least-square line

Optimal Least-Square Line

Optimal line: \(\hat{y}=\hat{\beta}_0+\hat{\beta}_1\text{x}\) where

\[\begin{align} \hat{\beta}_1&=\frac{\sum_{i=1}^n(\text{x}_i-\overline{\text{x}}_n)(y_i-\overline{y}_n)}{\sum_{i=1}^n(\text{x}_i-\overline{\text{x}}_n)^2}=\frac{\text{Cov}(X,Y)}{\text{V}(X)}\\ \hat{\beta}_0&=\overline{y}_n-\hat{\beta}_1\overline{\text{x}}_n\end{align}, \] with

\(\overline{\text{x}}_n=\frac{1}{n}\sum_{i=1}^n\text{x}_i\) and \(\overline{y}_n=\frac{1}{n}\sum_{i=1}^ny_i\) be the average/mean of \(X\) and \(Y\) resp.
\(\text{Cov}(X,Y)=\frac{1}{n}\sum_{i=1}^n(\text{x}_i-\overline{\text{x}}_n)(y_i-\overline{y}_n)\) be the “covariance” between \(X\) & \(Y\).
\(\text{V}(X)=\frac{1}{n}\sum_{i=1}^n(\text{x}_i-\overline{\text{x}}_n)^2\) be the “variance” of \(X\).

Simple Linear Regression (SLR)

Apply on marketing data

Simple Linear Regression (SLR)

Apply on marketing data (cont.)

Code

from sklearn.linear_model import LinearRegression  # import model
lr = LinearRegression()                 # initiate model
x_train, y_train = df1[['youtube']], df1['sales']  # training input-target
lr = lr.fit(x_train, y_train)         # build model = esimate coefficients

# Training data and fitted line
pred_train = lr.predict(x_train)

# Figures
fig_market2 = go.Figure(go.Scatter(x=x_train.youtube, y=y_train, mode="markers", name="Training data"))
fig_market2.add_trace(go.Scatter(x=x_train.youtube, y=pred_train, mode="lines+markers", name=f"<br>Train prediction<br> Sale={np.round(lr.coef_,2)[0]}youtube+{np.round(lr.intercept_,2)}"))
fig_market2.update_layout(title="Sales vs youtube",
                          xaxis=dict(title="youtube"),
                          yaxis=dict(title="sales"),
                          width=600, height=400)
fig_market2.show()

Simple Linear Regression (SLR)

Model Diagnostics (judging the model)

R-squared (coefficient of determination) \[R^2=1-\frac{\text{RSS}}{\text{TSS}}=1-\frac{\sum_{i=1}(y_i-\hat{y}_i)^2}{\sum_{i=1}(y_i-\overline{y}_n)^2}=\frac{\text{V}(\hat{Y})}{\text{V}(Y)}.\]
- Example: \(R^2=\) 0.612 in our model.
- Interpretation: The model (youtube) can capture around 61.0% of the variation of the target (sales).

Simple Linear Regression (SLR)

Model Diagnostics (judging the model)

Residuals: \(e_i=y_i-\hat{y}_i\sim{\cal N}(0,\sigma^2)\) for some \(\sigma>0\).
- Symmetric around \(0\) & do NOT depend on \(\text{x}_i\) nor \(y_i\).

Code

res = pred_train-y_train   # Compute residuals

from plotly.subplots import make_subplots

fig_res = make_subplots(rows=1, cols=2, subplot_titles=("Residuals vs predicted sales", "Residual desity"))

fig_res.add_trace(
    go.Scatter(x=pred_train, y=res, name="Residuals", mode="markers"), 
    row=1, col=1)
fig_res.add_trace(
    go.Scatter(x=[np.min(pred_train), np.max(pred_train)], y=[0,0], mode="lines", line=dict(color='red', dash="dash"), name="0"), 
    row=1, col=1)

fig_res.update_xaxes(title_text="Predicted Sales", row=1, col=1)
fig_res.update_yaxes(title_text="Residuals", row=1, col=1)


fig_res.add_trace(
    go.Histogram(x=res, name = "Residual histogram"), row=1, col=2
)
fig_res.update_xaxes(title_text="Residual", row=1, col=2)
fig_res.update_yaxes(title_text="Histogram", row=1, col=2)

fig_res.update_layout(width=950, height=250)
fig_res.show()

Simple Linear Regression (SLR)

SLR on Marketing Data

Summary

Obtained model: Sales = 0.048 YouTube + 8.439.
Coefficient \(\beta_1=\) 0.048 indicates that Sales is expected to increase (or decrease) by 0.048 units for every \(1\) unit increase (or decrease) in YouTube ads.
R-squared: Represents the proportion of the target’s variation captured by the model.
Residual: In a good model, the residuals should be random noise, indicating the model has captured most of the information from the target.
Marketing example:
- The amount in ad on YouTube alone can explain around \(61\)% (R-squared) of the variation in sales.
- However, the residuals still contain patterns (large errors at small and large predicted sales), suggesting the model can be improved.

Multiple Linear Regression

Correlation Matrix

Pearson’s correlation coefficient

Correlation between two columns \(X_1\) and \(X_2\): \[r=r_{X_1,X_2}=\frac{\sum_{i=1}^n(x_{i1}-\overline{x}_{1})(x_{i2}-\overline{x}_{2})}{\sqrt{\left(\sum_{i=1}^n(x_{i1}-\overline{x}_{1})^2\right)\left(\sum_{i=1}^n(x_{i2}-\overline{x}_{2})^2\right)}}\]
- \(-1\leq r\leq 1\) for any pair \(X_1\) and \(X_2\).
- If \(r\approx 1\), then \(X_1\) and \(X_2\) are positively correlated (one ↗️, another ↗️).
- If \(r\approx -1\), then \(X_1\) and \(X_2\) are negatively correlated (one ↗️, another ↘️)
- If \(r\approx 0\), then \(X_1\) and \(X_2\) are decorrelated (no pattern or relation).
It helps identifying informative/useful inputs for the building models.
It also helps identifying redundant (strongly correlated) inputs.
Note: Correlation does not imply causation; it only indicates a relationship, not a cause-and-effect link.

Correlation matrix

Examples

Correlation Matrix

Example: marketing data

cor = df1.corr()   # df1 is the marketing data
cor.style.background_gradient()

	youtube	facebook	newspaper	sales
youtube	1.000000	0.054809	0.056648	0.782224
facebook	0.054809	1.000000	0.354104	0.576223
newspaper	0.056648	0.354104	1.000000	0.228299
sales	0.782224	0.576223	0.228299	1.000000

YouTube is strongly correlated with target Sales and is most useful for building models, followed by Facebook and Newspaper.
Facebook and Newspaper have a significantly larger correlation with each other than with YouTube.

Multiple Linear Regression (MLR)

Predict \(y\) using more than one input \(\text{x}\in\mathbb{R}^d\).
Model: \(\underbrace{\hat{y}}_{\text{Sales}}=\beta_0+\beta_1\underbrace{x_1}_{\text{FB}}+\beta_2\underbrace{x_2}_{\text{NP}}\) for \(\beta_0,\beta_1,\beta_2\in\mathbb{R}\).
Find the optimal \(\vec{\beta}=[\beta_0,\beta_1,\beta_2]\) by minimizing RSS: \[\text{RSS}=\sum_{i=1}^n(\color{red}{y_i-\hat{y}_i})^2=\Big\|\overbrace{\begin{bmatrix}\color{red}{y_1}\\ y_2\\ \vdots\\ y_n\end{bmatrix}}^{Y}-\overbrace{\begin{bmatrix} \color{red}{1} & \color{red}{x_{11}} & \color{red}{x_{12}}\\ 1 & x_{21} & x_{22}\\ \vdots & \ddots & \vdots\\ 1 & x_{n1} & x_{n2}\end{bmatrix}}^{X}\overbrace{\begin{bmatrix}\color{red}{\beta_0}\\ \color{red}{\beta_1}\\ \color{red}{\beta_2}\end{bmatrix}}^{\vec{\beta}}\|^2\]

Multiple Linear Regression (MLR)

Find the optimal \(\vec{\beta}=[\beta_0,\beta_1,\beta_2]\) by minimizing RSS: \[\text{RSS}=\sum_{i=1}^n(\color{red}{y_i-\hat{y}_i})^2=\Big\|\overbrace{\begin{bmatrix}\color{red}{y_1}\\ y_2\\ \vdots\\ y_n\end{bmatrix}}^{Y}-\overbrace{\begin{bmatrix} \color{red}{1} & \color{red}{x_{11}} & \color{red}{x_{12}}\\ 1 & x_{21} & x_{22}\\ \vdots & \ddots & \vdots\\ 1 & x_{n1} & x_{n2}\end{bmatrix}}^{X}\overbrace{\begin{bmatrix}\color{red}{\beta_0}\\ \color{red}{\beta_1}\\ \color{red}{\beta_2}\end{bmatrix}}^{\vec{\beta}}\|^2\]
Minimizing above RSS, we obtain Normal Equation: \[\color{blue}{\hat{\beta}=(X^tX)^{-1}X^tY}\in\mathbb{R}^{d+1},\text{ with } d=2.\]

Multiple Linear Regression (MLR)

Minimizing above RSS, we obtain Normal Equation: \[\color{blue}{\hat{\beta}=(X^tX)^{-1}X^tY}\in\mathbb{R}^{d+1},\text{ with } d=2.\]
Prediction: \[\hat{Y}=X\color{blue}{\hat{\beta}}=\underbrace{\color{blue}{X(X^tX)^{-1}X^t}}_{Projection\ Matrix}Y.\]

The prediction \(\hat{Y}\) of \(Y\) by MLR is the projection of the target \(Y\) onto the subspace spanned by columns of \(X\).

Multiple Linear Regression (MLR)

MLR in action

Code

from sklearn.linear_model import LinearRegression
mlr = LinearRegression()
x_train, y_train = df1[["facebook", "newspaper"]], df1.sales
mlr.fit(x_train, y_train)      # Fit model
y_hat = mlr.predict(x_train)   # Predict

x_surf0 = np.array([[np.min(x_train.facebook), np.min(x_train.newspaper)], 
                   [np.max(x_train.facebook), np.min(x_train.newspaper)],
                   [np.min(x_train.facebook), np.max(x_train.newspaper)],
                   [np.max(x_train.facebook), np.max(x_train.newspaper)]])
y_surf = mlr.predict(x_surf0).reshape(2,2)
x_surf = np.array([[np.min(x_train.facebook), np.min(x_train.newspaper)], 
                   [np.max(x_train.facebook), np.max(x_train.newspaper)]])
frames = []
frames.append(
    go.Frame(
        data=[go.Scatter3d(x=x_train.facebook,
                           y=x_train.newspaper,
                           z=y_train,
                           mode="markers",
                           marker=dict(size=5),
                           hovertemplate='facebook: %{x}<br>' + 
                                         'newspaper: %{y}<br>' +
                                         'sales: %{z}<extra></extra>'),
        go.Surface(x=x_surf[:,0],
                   y=x_surf[:,1],
                   z=y_surf,
                   hovertemplate='facebook: %{x}<br>' + 
                                         'newspaper: %{y}<br>' +
                                         'sales: %{z}<extra></extra>')],
        name=f"Level: 0"
    ))

alpha = np.linspace(0,1,10)
for alp in alpha[1:]:
    y_proj = y_train + alp * (y_hat - y_train)
    frames.append(go.Frame(
        data=[go.Scatter3d(x=x_train.facebook,
                            y=x_train.newspaper,
                            z=y_proj,
                            mode="markers",
                            marker=dict(size=5),
                            hovertemplate='facebook: %{x}<br>' + 
                                         'newspaper: %{y}<br>' +
                                         'sales: %{z}<extra></extra>'),
        go.Surface(x=x_surf[:,0],
                   y=x_surf[:,1],
                   z=y_surf,
                   hovertemplate='facebook: %{x}<br>' + 
                                         'newspaper: %{y}<br>' +
                                         'sales: %{z}<extra></extra>')],
        name=f"Level: {np.round(alp,3)}"
    ))

# Add scatter plot and first polynomial fit to the initial figure

fig_mlr = go.Figure(
    data=[
        go.Scatter3d(x=x_train.facebook,
                               y=x_train.newspaper,
                               z=y_train,
                               mode="markers",
                               marker=dict(size=5),
                               hovertemplate='facebook: %{x}<br>' + 
                                         'newspaper: %{y}<br>' +
                                         'sales: %{z}<extra></extra>'),
        go.Surface(x=x_surf[:,0],
                   y=x_surf[:,1],
                   z=y_surf,
                   hovertemplate='facebook: %{x}<br>' + 
                                         'newspaper: %{y}<br>' +
                                         'sales: %{z}<extra></extra>')
    ],
    layout=go.Layout(
        title=f"Sales = {np.round(mlr.intercept_,3)}+{np.round(mlr.coef_[0],3)}Facebook+{np.round(mlr.coef_[1],3)}Newspaper",
        xaxis=dict(title="Facebook", range=[np.min(df1["facebook"])*0.9, np.max(df1["facebook"])*1.1]),
        yaxis=dict(title="Newspaper", range=[np.min(df1["newspaper"])*0.9, np.max(df1["newspaper"])*1.1]),
        updatemenus=[{
            "buttons": [
                {
                    "args": [None, {"frame": {"duration": 1000, "redraw": True}, "fromcurrent": True, "mode": "immediate"}],
                    "label": "Play",
                    "method": "animate"
                },
                {
                    "args": [[None], {"frame": {"duration": 0, "redraw": False}, "mode": "immediate"}],
                    "label": "Stop",
                    "method": "animate"
                }
            ],
            "type": "buttons",
            "showactive": False,
            "x": 0.1,
            "y": 1.2,
            "pad": {"r": 3, "t": 50}
        }],
        sliders=[{
            "active": 0,
            "currentvalue": {"prefix": "Project "},
            "pad": {"t": 50},
            "steps": [{"label": f"Level: {np.round(alp,3)}",
                       "method": "animate",
                       "args": [[f"Level: {np.round(alp,3)}"], {"frame": {"duration": 1000, "redraw": True}, "mode": "immediate", 
                       "transition": {"duration": 10}}]}
                      for alp in alpha]
        }]
    ),
    frames=frames
)

fig_mlr.update_layout(height=450, width=800, scene_camera=camera,
                    scene=dict(
                            zaxis=dict(title="Sales", range=[np.min(df1.sales)*0.9, np.max(df1.sales)*1.1])
                        ))

fig_mlr.update_scenes(xaxis_title_text= "Facebook",  
                    yaxis_title_text= "Newspaper",  
                    zaxis_title_text="Sales"
)
fig_mlr.show()

Multiple Linear Regreesion (MLR)

Model Diagnostics

Look at R-squared just like in SLR.
A better criterion, Adjusted R-squared: \[R^2_{\text{adj}}=1-\frac{n-1}{n-d-1}(1-R^2).\] Here, \(n\) is the number of obs, \(d\) is the number of inputs.
For our built model: \(R^2=\) 0.333 and \(R^2_{\text{adj}}=\) 0.326 (not so good!).
\(R^2_{\text{adj}}\) is always smaller than \(R^2\). A large \(R^2\) with only a slight decrease in \(R^2_{\text{adj}}\) indicates a good MLR model.

Multiple Linear Regreesion (MLR)

Model Diagnostics (cont.)

Check Residuals…

Code

resid = y_train - y_hat   # residuals

from plotly.subplots import make_subplots

fig_res = make_subplots(rows=1, cols=2, subplot_titles=("Residuals vs predicted sales", "Residual desity"))

fig_res.add_trace(
    go.Scatter(x=y_hat, y=resid, name="Residuals", mode="markers"), 
    row=1, col=1)
fig_res.add_trace(
    go.Scatter(x=[np.min(y_hat), np.max(y_hat)], y=[0,0], mode="lines", line=dict(color='red', dash="dash"), name="0"), 
    row=1, col=1)

fig_res.update_xaxes(title_text="Predicted Sales", row=1, col=1)
fig_res.update_yaxes(title_text="Residuals", row=1, col=1)


fig_res.add_trace(
    go.Histogram(x=resid, name = "Residual histogram"), row=1, col=2
)
fig_res.update_xaxes(title_text="Residual", row=1, col=2)
fig_res.update_yaxes(title_text="Histogram", row=1, col=2)

fig_res.update_layout(width=950, height=350)
fig_res.show()

Multiple Linear Regreesion (MLR)

MLR on Marketing Data

Summary

Obtained model: Sales = 11.027+0.199 Facebook+0.007 Newspaper.
Rough interpretion, \(\beta_1=\) 0.199 indicates that if facebook ad is increased (or decreased) by \(1\) unit, sales is expected to increase (or decrease) by 0.199 units.
Explain: \(\beta_2=\) 0.007.
\(R^2=\) 0.333 indicates that around \(33.33\)% variation of the sales can be explained by ads on Facebook and Newspaper together, which is not enough to be a good model!
A slight decrease in \(R^2_{\text{adj}}=\) 0.326 suggests that the information provided by both variables is not redundant for explaining sales.
There are some large negative values of residuals, indicating that the model overestimated some actual target especially around large sales.

What’s next?

Use standardized inputs: \(\text{x}_i\to \tilde{\text{x}}_i=\frac{\text{x}_i-\overline{\text{x}}_i}{\hat{\sigma}_{\text{x}_i}}\)
Transform the target using, for example, box-cox transformation:

\(y_i\to \tilde{y}=\begin{cases}\frac{y_i^{\lambda}-1}{\lambda}&\text{if }\lambda\neq 0\\ \log(y_i)&\text{if }\lambda=0.\end{cases}\)

It can reduce outliers
Improve normality
Smooth out the target
More in the next chapter…

Model Development

Getting to know you a bit…

Let’s see…

Motivation

Motivation

Old Faithful dataset (\(272\) rows, \(2\) columns)

Motivation (2)

Marketing data (\(200\) rows, \(4\) columns)

Motivation (2)

Marketing data (\(200\) rows, \(4\) columns)

Our Motivational Quote Today

Some Notation

Data: input-target

Model Development

Simple Linear Regression

Simple Linear Regression (SLR)

Simple Linear Regression (SLR)

Simple Linear Regression (SLR)

Simple Linear Regression (SLR)

Optimal Least-square line

Simple Linear Regression (SLR)

Apply on marketing data

Simple Linear Regression (SLR)

Apply on marketing data (cont.)

Simple Linear Regression (SLR)

Model Diagnostics (judging the model)

Simple Linear Regression (SLR)

Model Diagnostics (judging the model)

Simple Linear Regression (SLR)

SLR on Marketing Data

Multiple Linear Regression

Correlation Matrix

Correlation matrix

Examples

Correlation Matrix

Example: marketing data

Multiple Linear Regression (MLR)

Multiple Linear Regression (MLR)

Multiple Linear Regression (MLR)

Multiple Linear Regression (MLR)

MLR in action

Multiple Linear Regreesion (MLR)

Model Diagnostics

Multiple Linear Regreesion (MLR)

Model Diagnostics (cont.)

Multiple Linear Regreesion (MLR)

MLR on Marketing Data

What’s next?

🥳 Yeahhhh…. 🥂

Any questions?