Model Development


ITM-370: Data Analytics

Lecturer: Dr. Sothea Has

Getting to know you a bit…

Let’s see…

Motivation

Motivation

Old Faithful dataset (\(272\) rows, \(2\) columns)

Code
import pandas as pd                 # Import pandas package
import seaborn as sns               # Package for beautiful graphs
import matplotlib.pyplot as plt     # Graph management
sns.set(style="whitegrid")          # Set grid background
# path = "https://gist.githubusercontent.com/curran/4b59d1046d9e66f2787780ad51a1cd87/raw/9ec906b78a98cf300947a37b56cfe70d01183200/data.tsv"                       # The data can be found in this link
path = "./data/faithful.csv"        # Path where the data is stored
df0 = pd.read_csv(path)              # Import it into Python
df0.head(5)                        # Randomly select 4 points
eruptions waiting
0 3.600 79
1 1.800 54
2 3.333 74
3 2.283 62
4 4.533 85

Code
plt.figure(figsize=(5,3.2))                          # Define figure size
sns.scatterplot(df0, x="waiting", y="eruptions")    # Create scatterplot
plt.title("Old Faithful data from Yellowstone National Park, US", fontsize=10)    # Title
plt.suptitle("Eruptions vs waiting times", fontsize=13, y=1)                 # Subtitle
plt.show()

  • The longer the wait, the longer duration of the eruption.

Motivation (2)

Marketing data (\(200\) rows, \(4\) columns)

Code
import pyreadr
import pandas as pd
df1 = pyreadr.read_r("./data/marketing.rda")
df1 = df1['marketing']
df1.head(5)                        # Randomly select 4 points
youtube facebook newspaper sales
0 276.12 45.36 83.04 26.52
1 53.40 47.16 54.12 12.48
2 20.64 55.08 83.16 11.16
3 181.80 49.56 70.20 22.20
4 216.96 12.96 70.08 15.48

🧐 How would you represnt everything in one graph?

Motivation (2)

Marketing data (\(200\) rows, \(4\) columns)

Code
import pyreadr
import pandas as pd
df1 = pyreadr.read_r("./data/marketing.rda")
df1 = df1['marketing']
df1.head(5)                        # Randomly select 4 points
youtube facebook newspaper sales
0 276.12 45.36 83.04 26.52
1 53.40 47.16 54.12 12.48
2 20.64 55.08 83.16 11.16
3 181.80 49.56 70.20 22.20
4 216.96 12.96 70.08 15.48
Code
import plotly.express as px
fig = px.scatter_3d(df1, x="youtube", y="facebook", z="sales", 
                    size="newspaper", color="newspaper",
                    size_max=40)
camera = dict(eye=dict(x=1, y=-1, z=1.2))
fig.update_layout(title="Sales as a function of all ads",
                  width=550, height=350,
                  scene_camera=camera)


  • Increasing ads seems to boost sales!

Our Motivational Quote Today

“Where there is data smoke, there is business fire.” — Thomas Redman

Some Notation

Data: input-target

youtube facebook newspaper sales
0 276.12 45.36 83.04 26.52
1 53.40 47.16 54.12 12.48
2 20.64 55.08 83.16 11.16
3 181.80 49.56 70.20 22.20
4 216.96 12.96 70.08 15.48

\[{\cal D}=\begin{bmatrix} X_1 & \dots & X_d & Y\\ x_{11} & \dots & x_{1d} & y_1\\ x_{21} & \dots & x_{2d} & y_2\\ \vdots & \ddots & \vdots & \vdots\\ x_{n1} & \dots & x_{nd} & y_n \end{bmatrix}\]

  • Our marketing data: \(n=200\) and \(d=3\).
  • Input \(\text{x}_i=(x_{i1}, \dots,x_{id})\in\mathbb{R}^d\) with target \(y_i\).

Model Development

Objective

Using input \(\text{x}\) to predict its corresponding target \(y\).

  • Simple Linear Regression \[\begin{bmatrix} X\\ x_1\\ \vdots\\ x_n\\ \end{bmatrix}\leadsto\begin{bmatrix} Y\\ y_1\\ \vdots\\ y_n\\ \end{bmatrix}\]
  • Multiple Linear Regression \[\begin{bmatrix} X_1 & \dots & X_d\\ x_{1d} & \dots & x_{1d}\\ \vdots & \ddots & \vdots\\ x_{n1} & \dots & x_{nd}\\ \end{bmatrix}\leadsto\begin{bmatrix} Y\\ y_1\\ \vdots\\ y_n\\ \end{bmatrix}\]

Simple Linear Regression

Simple Linear Regression (SLR)

  • Predict \(y\) using only a single input \(\text{x}\in\mathbb{R}\).
  • Model: \(\underbrace{\hat{y}}_{\text{predicted eruption}}=\beta_0+\beta_1\underbrace{\text{x}}_{\text{waiting}}\) for \(\beta_0,\beta_1\in\mathbb{R}\).

Simple Linear Regression (SLR)

  • Residual Sum of Squares: \(\begin{align*} \text{RSS}&=\sum_{i=1}^n(\color{red}{y_i-\hat{y}_i})^2\\ &=\sum_{i=1}^n(\color{red}{y_i-\beta_0-\beta_1x_i})^2 \end{align*}\)

Ordinary Least Squares (OLS):
The best-fitted line minimizes TSS.

  • Model: \(\underbrace{\hat{y}}_{\text{predicted eruption}}=\beta_0+\beta_1\underbrace{\text{x}}_{\text{waiting}}\) for \(\beta_0,\beta_1\in\mathbb{R}\).

Simple Linear Regression (SLR)

Simple Linear Regression (SLR)

Optimal Least-square line

Optimal Least-Square Line

Optimal line: \(\hat{y}=\hat{\beta}_0+\hat{\beta}_1\text{x}\) where

\[\begin{align} \hat{\beta}_1&=\frac{\sum_{i=1}^n(\text{x}_i-\overline{\text{x}}_n)(y_i-\overline{y}_n)}{\sum_{i=1}^n(\text{x}_i-\overline{\text{x}}_n)^2}=\frac{\text{Cov}(X,Y)}{\text{V}(X)}\\ \hat{\beta}_0&=\overline{y}_n-\hat{\beta}_1\overline{\text{x}}_n\end{align}, \] with

  • \(\overline{\text{x}}_n=\frac{1}{n}\sum_{i=1}^n\text{x}_i\) and \(\overline{y}_n=\frac{1}{n}\sum_{i=1}^ny_i\) be the average/mean of \(X\) and \(Y\) resp.
  • \(\text{Cov}(X,Y)=\frac{1}{n}\sum_{i=1}^n(\text{x}_i-\overline{\text{x}}_n)(y_i-\overline{y}_n)\) be the “covariance” between \(X\) & \(Y\).
  • \(\text{V}(X)=\frac{1}{n}\sum_{i=1}^n(\text{x}_i-\overline{\text{x}}_n)^2\) be the “variance” of \(X\).

Simple Linear Regression (SLR)

Apply on marketing data

Simple Linear Regression (SLR)

Apply on marketing data (cont.)

Code
from sklearn.linear_model import LinearRegression  # import model
lr = LinearRegression()                 # initiate model
x_train, y_train = df1[['youtube']], df1['sales']  # training input-target
lr = lr.fit(x_train, y_train)         # build model = esimate coefficients

# Training data and fitted line
pred_train = lr.predict(x_train)

# Figures
fig_market2 = go.Figure(go.Scatter(x=x_train.youtube, y=y_train, mode="markers", name="Training data"))
fig_market2.add_trace(go.Scatter(x=x_train.youtube, y=pred_train, mode="lines+markers", name=f"<br>Train prediction<br> Sale={np.round(lr.coef_,2)[0]}youtube+{np.round(lr.intercept_,2)}"))
fig_market2.update_layout(title="Sales vs youtube",
                          xaxis=dict(title="youtube"),
                          yaxis=dict(title="sales"),
                          width=600, height=400)
fig_market2.show()

Simple Linear Regression (SLR)

Model Diagnostics (judging the model)

  • R-squared (coefficient of determination) \[R^2=1-\frac{\text{RSS}}{\text{TSS}}=1-\frac{\sum_{i=1}(y_i-\hat{y}_i)^2}{\sum_{i=1}(y_i-\overline{y}_n)^2}=\frac{\text{V}(\hat{Y})}{\text{V}(Y)}.\]
    • Example: \(R^2=\) 0.612 in our model.
    • Interpretation: The model (youtube) can capture around 61.0% of the variation of the target (sales).

Simple Linear Regression (SLR)

Model Diagnostics (judging the model)

  • Residuals: \(e_i=y_i-\hat{y}_i\sim{\cal N}(0,\sigma^2)\) for some \(\sigma>0\).
    • Symmetric around \(0\) & do NOT depend on \(\text{x}_i\) nor \(y_i\).
Code
res = pred_train-y_train   # Compute residuals

from plotly.subplots import make_subplots

fig_res = make_subplots(rows=1, cols=2, subplot_titles=("Residuals vs predicted sales", "Residual desity"))

fig_res.add_trace(
    go.Scatter(x=pred_train, y=res, name="Residuals", mode="markers"), 
    row=1, col=1)
fig_res.add_trace(
    go.Scatter(x=[np.min(pred_train), np.max(pred_train)], y=[0,0], mode="lines", line=dict(color='red', dash="dash"), name="0"), 
    row=1, col=1)

fig_res.update_xaxes(title_text="Predicted Sales", row=1, col=1)
fig_res.update_yaxes(title_text="Residuals", row=1, col=1)


fig_res.add_trace(
    go.Histogram(x=res, name = "Residual histogram"), row=1, col=2
)
fig_res.update_xaxes(title_text="Residual", row=1, col=2)
fig_res.update_yaxes(title_text="Histogram", row=1, col=2)

fig_res.update_layout(width=950, height=250)
fig_res.show()

Simple Linear Regression (SLR)

SLR on Marketing Data

Summary

  • Obtained model: Sales = 0.048 YouTube + 8.439.
  • Coefficient \(\beta_1=\) 0.048 indicates that Sales is expected to increase (or decrease) by 0.048 units for every \(1\) unit increase (or decrease) in YouTube ads.
  • R-squared: Represents the proportion of the target’s variation captured by the model.
  • Residual: In a good model, the residuals should be random noise, indicating the model has captured most of the information from the target.
  • Marketing example:
    • The amount in ad on YouTube alone can explain around \(61\)% (R-squared) of the variation in sales.
    • However, the residuals still contain patterns (large errors at small and large predicted sales), suggesting the model can be improved.

Multiple Linear Regression

Correlation Matrix

Pearson’s correlation coefficient

  • Correlation between two columns \(X_1\) and \(X_2\): \[r=r_{X_1,X_2}=\frac{\sum_{i=1}^n(x_{i1}-\overline{x}_{1})(x_{i2}-\overline{x}_{2})}{\sqrt{\left(\sum_{i=1}^n(x_{i1}-\overline{x}_{1})^2\right)\left(\sum_{i=1}^n(x_{i2}-\overline{x}_{2})^2\right)}}\]
    • \(-1\leq r\leq 1\) for any pair \(X_1\) and \(X_2\).
    • If \(r\approx 1\), then \(X_1\) and \(X_2\) are positively correlated (one ↗️, another ↗️).
    • If \(r\approx -1\), then \(X_1\) and \(X_2\) are negatively correlated (one ↗️, another ↘️)
    • If \(r\approx 0\), then \(X_1\) and \(X_2\) are decorrelated (no pattern or relation).
  • It helps identifying informative/useful inputs for the building models.
  • It also helps identifying redundant (strongly correlated) inputs.
  • Note: Correlation does not imply causation; it only indicates a relationship, not a cause-and-effect link.

Correlation matrix

Examples

Correlation Matrix

Example: marketing data

cor = df1.corr()   # df1 is the marketing data
cor.style.background_gradient()
  youtube facebook newspaper sales
youtube 1.000000 0.054809 0.056648 0.782224
facebook 0.054809 1.000000 0.354104 0.576223
newspaper 0.056648 0.354104 1.000000 0.228299
sales 0.782224 0.576223 0.228299 1.000000
  • YouTube is strongly correlated with target Sales and is most useful for building models, followed by Facebook and Newspaper.
  • Facebook and Newspaper have a significantly larger correlation with each other than with YouTube.

Multiple Linear Regression (MLR)

  • Predict \(y\) using more than one input \(\text{x}\in\mathbb{R}^d\).
  • Model: \(\underbrace{\hat{y}}_{\text{Sales}}=\beta_0+\beta_1\underbrace{x_1}_{\text{FB}}+\beta_2\underbrace{x_2}_{\text{NP}}\) for \(\beta_0,\beta_1,\beta_2\in\mathbb{R}\).
  • Find the optimal \(\vec{\beta}=[\beta_0,\beta_1,\beta_2]\) by minimizing RSS: \[\text{RSS}=\sum_{i=1}^n(\color{red}{y_i-\hat{y}_i})^2=\Big\|\overbrace{\begin{bmatrix}\color{red}{y_1}\\ y_2\\ \vdots\\ y_n\end{bmatrix}}^{Y}-\overbrace{\begin{bmatrix} \color{red}{1} & \color{red}{x_{11}} & \color{red}{x_{12}}\\ 1 & x_{21} & x_{22}\\ \vdots & \ddots & \vdots\\ 1 & x_{n1} & x_{n2}\end{bmatrix}}^{X}\overbrace{\begin{bmatrix}\color{red}{\beta_0}\\ \color{red}{\beta_1}\\ \color{red}{\beta_2}\end{bmatrix}}^{\vec{\beta}}\|^2\]

Multiple Linear Regression (MLR)

  • Find the optimal \(\vec{\beta}=[\beta_0,\beta_1,\beta_2]\) by minimizing RSS: \[\text{RSS}=\sum_{i=1}^n(\color{red}{y_i-\hat{y}_i})^2=\Big\|\overbrace{\begin{bmatrix}\color{red}{y_1}\\ y_2\\ \vdots\\ y_n\end{bmatrix}}^{Y}-\overbrace{\begin{bmatrix} \color{red}{1} & \color{red}{x_{11}} & \color{red}{x_{12}}\\ 1 & x_{21} & x_{22}\\ \vdots & \ddots & \vdots\\ 1 & x_{n1} & x_{n2}\end{bmatrix}}^{X}\overbrace{\begin{bmatrix}\color{red}{\beta_0}\\ \color{red}{\beta_1}\\ \color{red}{\beta_2}\end{bmatrix}}^{\vec{\beta}}\|^2\]
  • Minimizing above RSS, we obtain Normal Equation: \[\color{blue}{\hat{\beta}=(X^tX)^{-1}X^tY}\in\mathbb{R}^{d+1},\text{ with } d=2.\]

Multiple Linear Regression (MLR)

  • Minimizing above RSS, we obtain Normal Equation: \[\color{blue}{\hat{\beta}=(X^tX)^{-1}X^tY}\in\mathbb{R}^{d+1},\text{ with } d=2.\]
  • Prediction: \[\hat{Y}=X\color{blue}{\hat{\beta}}=\underbrace{\color{blue}{X(X^tX)^{-1}X^t}}_{Projection\ Matrix}Y.\]

The prediction \(\hat{Y}\) of \(Y\) by MLR is the projection of the target \(Y\) onto the subspace spanned by columns of \(X\).

Multiple Linear Regression (MLR)

MLR in action

Code
from sklearn.linear_model import LinearRegression
mlr = LinearRegression()
x_train, y_train = df1[["facebook", "newspaper"]], df1.sales
mlr.fit(x_train, y_train)      # Fit model
y_hat = mlr.predict(x_train)   # Predict

x_surf0 = np.array([[np.min(x_train.facebook), np.min(x_train.newspaper)], 
                   [np.max(x_train.facebook), np.min(x_train.newspaper)],
                   [np.min(x_train.facebook), np.max(x_train.newspaper)],
                   [np.max(x_train.facebook), np.max(x_train.newspaper)]])
y_surf = mlr.predict(x_surf0).reshape(2,2)
x_surf = np.array([[np.min(x_train.facebook), np.min(x_train.newspaper)], 
                   [np.max(x_train.facebook), np.max(x_train.newspaper)]])
frames = []
frames.append(
    go.Frame(
        data=[go.Scatter3d(x=x_train.facebook,
                           y=x_train.newspaper,
                           z=y_train,
                           mode="markers",
                           marker=dict(size=5),
                           hovertemplate='facebook: %{x}<br>' + 
                                         'newspaper: %{y}<br>' +
                                         'sales: %{z}<extra></extra>'),
        go.Surface(x=x_surf[:,0],
                   y=x_surf[:,1],
                   z=y_surf,
                   hovertemplate='facebook: %{x}<br>' + 
                                         'newspaper: %{y}<br>' +
                                         'sales: %{z}<extra></extra>')],
        name=f"Level: 0"
    ))

alpha = np.linspace(0,1,10)
for alp in alpha[1:]:
    y_proj = y_train + alp * (y_hat - y_train)
    frames.append(go.Frame(
        data=[go.Scatter3d(x=x_train.facebook,
                            y=x_train.newspaper,
                            z=y_proj,
                            mode="markers",
                            marker=dict(size=5),
                            hovertemplate='facebook: %{x}<br>' + 
                                         'newspaper: %{y}<br>' +
                                         'sales: %{z}<extra></extra>'),
        go.Surface(x=x_surf[:,0],
                   y=x_surf[:,1],
                   z=y_surf,
                   hovertemplate='facebook: %{x}<br>' + 
                                         'newspaper: %{y}<br>' +
                                         'sales: %{z}<extra></extra>')],
        name=f"Level: {np.round(alp,3)}"
    ))

# Add scatter plot and first polynomial fit to the initial figure

fig_mlr = go.Figure(
    data=[
        go.Scatter3d(x=x_train.facebook,
                               y=x_train.newspaper,
                               z=y_train,
                               mode="markers",
                               marker=dict(size=5),
                               hovertemplate='facebook: %{x}<br>' + 
                                         'newspaper: %{y}<br>' +
                                         'sales: %{z}<extra></extra>'),
        go.Surface(x=x_surf[:,0],
                   y=x_surf[:,1],
                   z=y_surf,
                   hovertemplate='facebook: %{x}<br>' + 
                                         'newspaper: %{y}<br>' +
                                         'sales: %{z}<extra></extra>')
    ],
    layout=go.Layout(
        title=f"Sales = {np.round(mlr.intercept_,3)}+{np.round(mlr.coef_[0],3)}Facebook+{np.round(mlr.coef_[1],3)}Newspaper",
        xaxis=dict(title="Facebook", range=[np.min(df1["facebook"])*0.9, np.max(df1["facebook"])*1.1]),
        yaxis=dict(title="Newspaper", range=[np.min(df1["newspaper"])*0.9, np.max(df1["newspaper"])*1.1]),
        updatemenus=[{
            "buttons": [
                {
                    "args": [None, {"frame": {"duration": 1000, "redraw": True}, "fromcurrent": True, "mode": "immediate"}],
                    "label": "Play",
                    "method": "animate"
                },
                {
                    "args": [[None], {"frame": {"duration": 0, "redraw": False}, "mode": "immediate"}],
                    "label": "Stop",
                    "method": "animate"
                }
            ],
            "type": "buttons",
            "showactive": False,
            "x": 0.1,
            "y": 1.2,
            "pad": {"r": 3, "t": 50}
        }],
        sliders=[{
            "active": 0,
            "currentvalue": {"prefix": "Project "},
            "pad": {"t": 50},
            "steps": [{"label": f"Level: {np.round(alp,3)}",
                       "method": "animate",
                       "args": [[f"Level: {np.round(alp,3)}"], {"frame": {"duration": 1000, "redraw": True}, "mode": "immediate", 
                       "transition": {"duration": 10}}]}
                      for alp in alpha]
        }]
    ),
    frames=frames
)

fig_mlr.update_layout(height=450, width=800, scene_camera=camera,
                    scene=dict(
                            zaxis=dict(title="Sales", range=[np.min(df1.sales)*0.9, np.max(df1.sales)*1.1])
                        ))

fig_mlr.update_scenes(xaxis_title_text= "Facebook",  
                    yaxis_title_text= "Newspaper",  
                    zaxis_title_text="Sales"
)
fig_mlr.show()

Multiple Linear Regreesion (MLR)

Model Diagnostics

  • Look at R-squared just like in SLR.
  • A better criterion, Adjusted R-squared: \[R^2_{\text{adj}}=1-\frac{n-1}{n-d-1}(1-R^2).\] Here, \(n\) is the number of obs, \(d\) is the number of inputs.
  • For our built model: \(R^2=\) 0.333 and \(R^2_{\text{adj}}=\) 0.326 (not so good!).
  • \(R^2_{\text{adj}}\) is always smaller than \(R^2\). A large \(R^2\) with only a slight decrease in \(R^2_{\text{adj}}\) indicates a good MLR model.

Multiple Linear Regreesion (MLR)

Model Diagnostics (cont.)

  • Check Residuals
Code
resid = y_train - y_hat   # residuals

from plotly.subplots import make_subplots

fig_res = make_subplots(rows=1, cols=2, subplot_titles=("Residuals vs predicted sales", "Residual desity"))

fig_res.add_trace(
    go.Scatter(x=y_hat, y=resid, name="Residuals", mode="markers"), 
    row=1, col=1)
fig_res.add_trace(
    go.Scatter(x=[np.min(y_hat), np.max(y_hat)], y=[0,0], mode="lines", line=dict(color='red', dash="dash"), name="0"), 
    row=1, col=1)

fig_res.update_xaxes(title_text="Predicted Sales", row=1, col=1)
fig_res.update_yaxes(title_text="Residuals", row=1, col=1)


fig_res.add_trace(
    go.Histogram(x=resid, name = "Residual histogram"), row=1, col=2
)
fig_res.update_xaxes(title_text="Residual", row=1, col=2)
fig_res.update_yaxes(title_text="Histogram", row=1, col=2)

fig_res.update_layout(width=950, height=350)
fig_res.show()

Multiple Linear Regreesion (MLR)

MLR on Marketing Data

Summary

  • Obtained model: Sales = 11.027+0.199 Facebook+0.007 Newspaper.
  • Rough interpretion, \(\beta_1=\) 0.199 indicates that if facebook ad is increased (or decreased) by \(1\) unit, sales is expected to increase (or decrease) by 0.199 units.
  • Explain: \(\beta_2=\) 0.007.
  • \(R^2=\) 0.333 indicates that around \(33.33\)% variation of the sales can be explained by ads on Facebook and Newspaper together, which is not enough to be a good model!
  • A slight decrease in \(R^2_{\text{adj}}=\) 0.326 suggests that the information provided by both variables is not redundant for explaining sales.
  • There are some large negative values of residuals, indicating that the model overestimated some actual target especially around large sales.

What’s next?

  • Use standardized inputs: \(\text{x}_i\to \tilde{\text{x}}_i=\frac{\text{x}_i-\overline{\text{x}}_i}{\hat{\sigma}_{\text{x}_i}}\)
  • Transform the target using, for example, box-cox transformation:

\(y_i\to \tilde{y}=\begin{cases}\frac{y_i^{\lambda}-1}{\lambda}&\text{if }\lambda\neq 0\\ \log(y_i)&\text{if }\lambda=0.\end{cases}\)

  • It can reduce outliers
  • Improve normality
  • Smooth out the target
  • More in the next chapter…

🥳 Yeahhhh…. 🥂









Any questions?