Linear Regression

Exploratory Data Analysis & Unsupervised Learning

Lecturer: HAS Sothea, PhD

🗺️ Content

Motivation
Simple Linear Regresion
Multiple Linear Regression
Model Evaluation
Model refinement: Regularization

Motivation

Old Faithful dataset (\(272\) rows, \(2\) columns)

Code

import pandas as pd                 # Import pandas package
import seaborn as sns               # Package for beautiful graphs
import matplotlib.pyplot as plt     # Graph management
sns.set(style="whitegrid")          # Set grid background
# path = "https://gist.githubusercontent.com/curran/4b59d1046d9e66f2787780ad51a1cd87/raw/9ec906b78a98cf300947a37b56cfe70d01183200/data.tsv"                       # The data can be found in this link
df0 = pd.read_csv(path)              # Import it into Python
df0.head(5)                        # Randomly select 4 points

	eruptions	waiting
0	3.600	79
1	1.800	54
2	3.333	74
3	2.283	62
4	4.533	85

Code

plt.figure(figsize=(5,3.2))                          # Define figure size
sns.scatterplot(df0, x="waiting", y="eruptions")    # Create scatterplot
plt.title("Old Faithful data from Yellowstone National Park, US", fontsize=10)    # Title
plt.suptitle("Eruptions vs waiting times", fontsize=13, y=1)                 # Subtitle
plt.show()

The longer the wait, the longer duration of the eruption.

Motivation (2)

Marketing data (\(200\) rows, \(4\) columns)

Code

import pyreadr
import pandas as pd
df1 = pyreadr.read_r(path1)
df1 = df1['marketing']
df1.head(5)                        # Randomly select 4 points

	youtube	facebook	newspaper	sales
0	276.12	45.36	83.04	26.52
1	53.40	47.16	54.12	12.48
2	20.64	55.08	83.16	11.16
3	181.80	49.56	70.20	22.20
4	216.96	12.96	70.08	15.48

🧐 How would you represnt everything in one graph?

Motivation (2)

Marketing data (\(200\) rows, \(4\) columns)

Code

df1.head(5)                        # Randomly select 4 points

	youtube	facebook	newspaper	sales
0	276.12	45.36	83.04	26.52
1	53.40	47.16	54.12	12.48
2	20.64	55.08	83.16	11.16
3	181.80	49.56	70.20	22.20
4	216.96	12.96	70.08	15.48

Code

import plotly.express as px
fig = px.scatter_3d(df1, x="youtube", y="facebook", z="sales", 
                    size="newspaper", color="newspaper",
                    size_max=40)
camera = dict(eye=dict(x=1, y=-1, z=1.2))
fig.update_layout(title="Sales as a function of all ads",
                  width=550, height=350,
                  scene_camera=camera)

Increasing ads seems to boost sales!

Our Motivational Quote Today

“Where there is data smoke, there is business fire.” — Thomas Redman

Some Notation

Data: input-target

	youtube	facebook	newspaper	sales
0	276.12	45.36	83.04	26.52
1	53.40	47.16	54.12	12.48
2	20.64	55.08	83.16	11.16
3	181.80	49.56	70.20	22.20
4	216.96	12.96	70.08	15.48

\[{\cal D}=\begin{bmatrix} X_1 & \dots & X_d & Y\\ x_{11} & \dots & x_{1d} & y_1\\ x_{21} & \dots & x_{2d} & y_2\\ \vdots & \ddots & \vdots & \vdots\\ x_{n1} & \dots & x_{nd} & y_n \end{bmatrix}\]

Our marketing data: \(n=200\) and \(d=3\).
Input \(\text{x}_i=(x_{i1}, \dots,x_{id})\in\mathbb{R}^d\) with target \(y_i\).

Model Development

Objective

Using input \(\text{x}\) to predict its corresponding target \(y\).

Simple Linear Regression \[\begin{bmatrix} X\\ x_1\\ \vdots\\ x_n\\ \end{bmatrix}\leadsto\begin{bmatrix} Y\\ y_1\\ \vdots\\ y_n\\ \end{bmatrix}\]

Multiple Linear Regression \[\begin{bmatrix} X_1 & \dots & X_d\\ x_{1d} & \dots & x_{1d}\\ \vdots & \ddots & \vdots\\ x_{n1} & \dots & x_{nd}\\ \end{bmatrix}\leadsto\begin{bmatrix} Y\\ y_1\\ \vdots\\ y_n\\ \end{bmatrix}\]

Simple Linear Regression

Simple Linear Regression (SLR)

Predict \(y\) using only a single input \(\text{x}\in\mathbb{R}\).
Model: \(\underbrace{\hat{y}}_{\text{predicted eruption}}=\beta_0+\beta_1\underbrace{\text{x}}_{\text{waiting}}\) for \(\beta_0,\beta_1\in\mathbb{R}\).

Simple Linear Regression (SLR)

Residual Sum of Squares: \(\begin{align*} \text{RSS}&=\sum_{i=1}^n(\color{red}{y_i-\hat{y}_i})^2\\ &=\sum_{i=1}^n(\color{red}{y_i-\beta_0-\beta_1x_i})^2 \end{align*}\)

Ordinary Least Squares (OLS):
The best-fitted line minimizes TSS.

Model: \(\underbrace{\hat{y}}_{\text{predicted eruption}}=\beta_0+\beta_1\underbrace{\text{x}}_{\text{waiting}}\) for \(\beta_0,\beta_1\in\mathbb{R}\).

Simple Linear Regression (SLR)

Optimal Least-square line

Optimal Least-Square Line

Optimal line: \(\hat{y}=\hat{\beta}_0+\hat{\beta}_1\text{x}\) where

\[\begin{align} \hat{\beta}_1&=\frac{\sum_{i=1}^n(\text{x}_i-\overline{\text{x}}_n)(y_i-\overline{y}_n)}{\sum_{i=1}^n(\text{x}_i-\overline{\text{x}}_n)^2}=\frac{\text{Cov}(X,Y)}{\text{V}(X)}\\ \hat{\beta}_0&=\overline{y}_n-\hat{\beta}_1\overline{\text{x}}_n\end{align}, \] with

\(\overline{\text{x}}_n=\frac{1}{n}\sum_{i=1}^n\text{x}_i\) and \(\overline{y}_n=\frac{1}{n}\sum_{i=1}^ny_i\) be the average/mean of \(X\) and \(Y\) resp.
\(\text{Cov}(X,Y)=\frac{1}{n}\sum_{i=1}^n(\text{x}_i-\overline{\text{x}}_n)(y_i-\overline{y}_n)\) be the “covariance” between \(X\) & \(Y\).
\(\text{V}(X)=\frac{1}{n}\sum_{i=1}^n(\text{x}_i-\overline{\text{x}}_n)^2\) be the “variance” of \(X\).

Simple Linear Regression (SLR)

Apply on marketing data

Simple Linear Regression (SLR)

Apply on marketing data (cont.)

Code

from sklearn.linear_model import LinearRegression  # import model
lr = LinearRegression()                 # initiate model
x_train, y_train = df1[['youtube']], df1['sales']  # training input-target
lr = lr.fit(x_train, y_train)         # build model = esimate coefficients

# Training data and fitted line
pred_train = lr.predict(x_train)

# Figures
fig_market2 = go.Figure(go.Scatter(x=x_train.youtube, y=y_train, mode="markers", name="Training data"))
fig_market2.add_trace(go.Scatter(x=x_train.youtube, y=pred_train, mode="lines+markers", name=f"<br>Train prediction<br> Sale={np.round(lr.coef_,2)[0]}youtube+{np.round(lr.intercept_,2)}"))
fig_market2.update_layout(title="Sales vs youtube",
                          xaxis=dict(title="youtube"),
                          yaxis=dict(title="sales"),
                          width=600, height=400)
fig_market2.show()

Simple Linear Regression (SLR)

Model Diagnostics (judging the model)

R-squared (coefficient of determination) \[R^2=1-\frac{\text{RSS}}{\text{TSS}}=1-\frac{\sum_{i=1}(y_i-\hat{y}_i)^2}{\sum_{i=1}(y_i-\overline{y}_n)^2}=\frac{\text{V}(\hat{Y})}{\text{V}(Y)}.\]
- Example: \(R^2=\) 0.612 in our model.
- Interpretation: The model (youtube) can capture around 61.0% of the variation of the target (sales).

Simple Linear Regression (SLR)

Model Diagnostics (judging the model)

Residuals: \(e_i=y_i-\hat{y}_i\sim{\cal N}(0,\sigma^2)\) for some \(\sigma>0\).
- Symmetric around \(0\) & do NOT depend on \(\text{x}_i\) nor \(y_i\).

Code

res = pred_train-y_train   # Compute residuals

from plotly.subplots import make_subplots

fig_res = make_subplots(rows=1, cols=2, subplot_titles=("Residuals vs predicted sales", "Residual desity"))

fig_res.add_trace(
    go.Scatter(x=pred_train, y=res, name="Residuals", mode="markers"), 
    row=1, col=1)
fig_res.add_trace(
    go.Scatter(x=[np.min(pred_train), np.max(pred_train)], y=[0,0], mode="lines", line=dict(color='red', dash="dash"), name="0"), 
    row=1, col=1)

fig_res.update_xaxes(title_text="Predicted Sales", row=1, col=1)
fig_res.update_yaxes(title_text="Residuals", row=1, col=1)


fig_res.add_trace(
    go.Histogram(x=res, name = "Residual histogram"), row=1, col=2
)
fig_res.update_xaxes(title_text="Residual", row=1, col=2)
fig_res.update_yaxes(title_text="Histogram", row=1, col=2)

fig_res.update_layout(width=950, height=250)
fig_res.show()

Simple Linear Regression (SLR)

SLR on Marketing Data

Summary

Obtained model: Sales = 0.048 YouTube + 8.439.
Coefficient \(\beta_1=\) 0.048 indicates that Sales is expected to increase (or decrease) by 0.048 units for every \(1\) unit increase (or decrease) in YouTube ads.
R-squared: Represents the proportion of the target’s variation captured by the model.
Residual: In a good model, the residuals should be random noise, indicating the model has captured most of the information from the target.
Marketing example:
- The amount in ad on YouTube alone can explain around \(61\)% (R-squared) of the variation in sales.
- However, the residuals still contain patterns (large errors at small and large predicted sales), suggesting the model can be improved.

Multiple Linear Regression

Correlation Matrix

Pearson’s correlation coefficient

Correlation between two columns \(X_1\) and \(X_2\): \[r=r_{X_1,X_2}=\frac{\sum_{i=1}^n(x_{i1}-\overline{x}_{1})(x_{i2}-\overline{x}_{2})}{\sqrt{\left(\sum_{i=1}^n(x_{i1}-\overline{x}_{1})^2\right)\left(\sum_{i=1}^n(x_{i2}-\overline{x}_{2})^2\right)}}\]
- \(-1\leq r\leq 1\) for any pair \(X_1\) and \(X_2\).
- If \(r\approx 1\), then \(X_1\) and \(X_2\) are positively correlated (one ↗️, another ↗️).
- If \(r\approx -1\), then \(X_1\) and \(X_2\) are negatively correlated (one ↗️, another ↘️)
- If \(r\approx 0\), then \(X_1\) and \(X_2\) are decorrelated (no pattern or relation).
It helps identifying informative/useful inputs for the building models.
It also helps identifying redundant (strongly correlated) inputs.
Note: Correlation does not imply causation; it only indicates a relationship, not a cause-and-effect link.

Correlation matrix

Examples

Correlation Matrix

Example: marketing data

cor = df1.corr()   # df1 is the marketing data
cor.style.background_gradient()

	youtube	facebook	newspaper	sales
youtube	1.000000	0.054809	0.056648	0.782224
facebook	0.054809	1.000000	0.354104	0.576223
newspaper	0.056648	0.354104	1.000000	0.228299
sales	0.782224	0.576223	0.228299	1.000000

YouTube is strongly correlated with target Sales and is most useful for building models, followed by Facebook and Newspaper.
Facebook and Newspaper have a significantly larger correlation with each other than with YouTube.

Multiple Linear Regression (MLR)

Predict \(y\) using more than one input \(\text{x}\in\mathbb{R}^d\).
Model: \(\underbrace{\hat{y}}_{\text{Sales}}=\beta_0+\beta_1\underbrace{x_1}_{\text{FB}}+\beta_2\underbrace{x_2}_{\text{NP}}\) for \(\beta_0,\beta_1,\beta_2\in\mathbb{R}\).
Find the optimal \(\vec{\beta}=[\beta_0,\beta_1,\beta_2]\) by minimizing RSS: \[\text{RSS}=\sum_{i=1}^n(\color{red}{y_i-\hat{y}_i})^2=\Big\|\overbrace{\begin{bmatrix}\color{red}{y_1}\\ y_2\\ \vdots\\ y_n\end{bmatrix}}^{Y}-\overbrace{\begin{bmatrix} \color{red}{1} & \color{red}{x_{11}} & \color{red}{x_{12}}\\ 1 & x_{21} & x_{22}\\ \vdots & \ddots & \vdots\\ 1 & x_{n1} & x_{n2}\end{bmatrix}}^{X}\overbrace{\begin{bmatrix}\color{red}{\beta_0}\\ \color{red}{\beta_1}\\ \color{red}{\beta_2}\end{bmatrix}}^{\vec{\beta}}\|^2\]

Multiple Linear Regression (MLR)

Find the optimal \(\vec{\beta}=[\beta_0,\beta_1,\beta_2]\) by minimizing RSS: \[\text{RSS}=\sum_{i=1}^n(\color{red}{y_i-\hat{y}_i})^2=\Big\|\overbrace{\begin{bmatrix}\color{red}{y_1}\\ y_2\\ \vdots\\ y_n\end{bmatrix}}^{Y}-\overbrace{\begin{bmatrix} \color{red}{1} & \color{red}{x_{11}} & \color{red}{x_{12}}\\ 1 & x_{21} & x_{22}\\ \vdots & \ddots & \vdots\\ 1 & x_{n1} & x_{n2}\end{bmatrix}}^{X}\overbrace{\begin{bmatrix}\color{red}{\beta_0}\\ \color{red}{\beta_1}\\ \color{red}{\beta_2}\end{bmatrix}}^{\vec{\beta}}\|^2\]
Minimizing above RSS, we obtain Normal Equation: \[\color{blue}{\hat{\beta}=(X^tX)^{-1}X^tY}\in\mathbb{R}^{d+1},\text{ with } d=2.\]

Multiple Linear Regression (MLR)

Minimizing above RSS, we obtain Normal Equation: \[\color{blue}{\hat{\beta}=(X^tX)^{-1}X^tY}\in\mathbb{R}^{d+1},\text{ with } d=2.\]
Prediction: \[\hat{Y}=X\color{blue}{\hat{\beta}}=\underbrace{\color{blue}{X(X^tX)^{-1}X^t}}_{\text{Projection Matrix}}Y.\]

The prediction \(\hat{Y}\) of \(Y\) by MLR is the projection of the target \(Y\) onto the subspace spanned by columns of \(X\).

Multiple Linear Regression (MLR)

MLR in action

Code

from sklearn.linear_model import LinearRegression
mlr = LinearRegression()
x_train, y_train = df1[["facebook", "newspaper"]], df1.sales
mlr.fit(x_train, y_train)      # Fit model
y_hat = mlr.predict(x_train)   # Predict

x_surf0 = np.array([[np.min(x_train.facebook), np.min(x_train.newspaper)], 
                   [np.max(x_train.facebook), np.min(x_train.newspaper)],
                   [np.min(x_train.facebook), np.max(x_train.newspaper)],
                   [np.max(x_train.facebook), np.max(x_train.newspaper)]])
y_surf = mlr.predict(x_surf0).reshape(2,2)
x_surf = np.array([[np.min(x_train.facebook), np.min(x_train.newspaper)], 
                   [np.max(x_train.facebook), np.max(x_train.newspaper)]])
frames = []
frames.append(
    go.Frame(
        data=[go.Scatter3d(x=x_train.facebook,
                           y=x_train.newspaper,
                           z=y_train,
                           mode="markers",
                           marker=dict(size=5),
                           hovertemplate='facebook: %{x}<br>' + 
                                         'newspaper: %{y}<br>' +
                                         'sales: %{z}<extra></extra>'),
        go.Surface(x=x_surf[:,0],
                   y=x_surf[:,1],
                   z=y_surf,
                   hovertemplate='facebook: %{x}<br>' + 
                                         'newspaper: %{y}<br>' +
                                         'sales: %{z}<extra></extra>')],
        name=f"Level: 0"
    ))

alpha = np.linspace(0,1,10)
for alp in alpha[1:]:
    y_proj = y_train + alp * (y_hat - y_train)
    frames.append(go.Frame(
        data=[go.Scatter3d(x=x_train.facebook,
                            y=x_train.newspaper,
                            z=y_proj,
                            mode="markers",
                            marker=dict(size=5),
                            hovertemplate='facebook: %{x}<br>' + 
                                         'newspaper: %{y}<br>' +
                                         'sales: %{z}<extra></extra>'),
        go.Surface(x=x_surf[:,0],
                   y=x_surf[:,1],
                   z=y_surf,
                   hovertemplate='facebook: %{x}<br>' + 
                                         'newspaper: %{y}<br>' +
                                         'sales: %{z}<extra></extra>')],
        name=f"Level: {np.round(alp,3)}"
    ))

# Add scatter plot and first polynomial fit to the initial figure

fig_mlr = go.Figure(
    data=[
        go.Scatter3d(x=x_train.facebook,
                               y=x_train.newspaper,
                               z=y_train,
                               mode="markers",
                               marker=dict(size=5),
                               hovertemplate='facebook: %{x}<br>' + 
                                         'newspaper: %{y}<br>' +
                                         'sales: %{z}<extra></extra>'),
        go.Surface(x=x_surf[:,0],
                   y=x_surf[:,1],
                   z=y_surf,
                   hovertemplate='facebook: %{x}<br>' + 
                                         'newspaper: %{y}<br>' +
                                         'sales: %{z}<extra></extra>')
    ],
    layout=go.Layout(
        title=f"Sales = {np.round(mlr.intercept_,3)}+{np.round(mlr.coef_[0],3)}Facebook+{np.round(mlr.coef_[1],3)}Newspaper",
        xaxis=dict(title="Facebook", range=[np.min(df1["facebook"])*0.9, np.max(df1["facebook"])*1.1]),
        yaxis=dict(title="Newspaper", range=[np.min(df1["newspaper"])*0.9, np.max(df1["newspaper"])*1.1]),
        updatemenus=[{
            "buttons": [
                {
                    "args": [None, {"frame": {"duration": 1000, "redraw": True}, "fromcurrent": True, "mode": "immediate"}],
                    "label": "Play",
                    "method": "animate"
                },
                {
                    "args": [[None], {"frame": {"duration": 0, "redraw": False}, "mode": "immediate"}],
                    "label": "Stop",
                    "method": "animate"
                }
            ],
            "type": "buttons",
            "showactive": False,
            "x": 0.1,
            "y": 1.2,
            "pad": {"r": 3, "t": 50}
        }],
        sliders=[{
            "active": 0,
            "currentvalue": {"prefix": "Project "},
            "pad": {"t": 50},
            "steps": [{"label": f"Level: {np.round(alp,3)}",
                       "method": "animate",
                       "args": [[f"Level: {np.round(alp,3)}"], {"frame": {"duration": 1000, "redraw": True}, "mode": "immediate", 
                       "transition": {"duration": 10}}]}
                      for alp in alpha]
        }]
    ),
    frames=frames
)

fig_mlr.update_layout(height=450, width=800, scene_camera=camera,
                    scene=dict(
                            zaxis=dict(title="Sales", range=[np.min(df1.sales)*0.9, np.max(df1.sales)*1.1])
                        ))

fig_mlr.update_scenes(xaxis_title_text= "Facebook",  
                    yaxis_title_text= "Newspaper",  
                    zaxis_title_text="Sales"
)
fig_mlr.show()

Multiple Linear Regreesion (MLR)

Model Diagnostics

Look at R-squared just like in SLR.
A better criterion, Adjusted R-squared: \[R^2_{\text{adj}}=1-\frac{n-1}{n-d-1}(1-R^2).\] Here, \(n\) is the number of obs, \(d\) is the number of inputs.
For our built model: \(R^2=\) 0.333 and \(R^2_{\text{adj}}=\) 0.326 (not so good!).
\(R^2_{\text{adj}}\) is always smaller than \(R^2\). A large \(R^2\) with only a slight decrease in \(R^2_{\text{adj}}\) indicates a good MLR model.

Multiple Linear Regreesion (MLR)

Model Diagnostics (cont.)

Check Residuals…

Code

resid = y_train - y_hat   # residuals

from plotly.subplots import make_subplots

fig_res = make_subplots(rows=1, cols=2, subplot_titles=("Residuals vs predicted sales", "Residual desity"))

fig_res.add_trace(
    go.Scatter(x=y_hat, y=resid, name="Residuals", mode="markers"), 
    row=1, col=1)
fig_res.add_trace(
    go.Scatter(x=[np.min(y_hat), np.max(y_hat)], y=[0,0], mode="lines", line=dict(color='red', dash="dash"), name="0"), 
    row=1, col=1)

fig_res.update_xaxes(title_text="Predicted Sales", row=1, col=1)
fig_res.update_yaxes(title_text="Residuals", row=1, col=1)


fig_res.add_trace(
    go.Histogram(x=resid, name = "Residual histogram"), row=1, col=2
)
fig_res.update_xaxes(title_text="Residual", row=1, col=2)
fig_res.update_yaxes(title_text="Histogram", row=1, col=2)

fig_res.update_layout(width=950, height=350)
fig_res.show()

Multiple Linear Regreesion (MLR)

MLR on Marketing Data

Summary

Obtained model: Sales = 11.027+0.199 Facebook+0.007 Newspaper.
Rough interpretion, \(\beta_1=\) 0.199 indicates that if facebook ad is increased (or decreased) by \(1\) unit, sales is expected to increase (or decrease) by 0.199 units.
Explain: \(\beta_2=\) 0.007.
\(R^2=\) 0.333 indicates that around \(33.33\)% variation of the sales can be explained by ads on Facebook and Newspaper together, which is not enough to be a good model!
A slight decrease in \(R^2_{\text{adj}}=\) 0.326 suggests that the information provided by both variables is not redundant for explaining sales.
There are some large negative values of residuals, indicating that the model overestimated some actual target especially around large sales.

What’s next?

Use standardized inputs: \(\text{x}_i\to \tilde{\text{x}}_i=\frac{\text{x}_i-\overline{\text{x}}_i}{\hat{\sigma}_{\text{x}_i}}\)
Transform the target using, for example, box-cox transformation:

\(y_i\to \tilde{y}=\begin{cases}\frac{y_i^{\lambda}-1}{\lambda}&\text{if }\lambda\neq 0\\ \log(y_i)&\text{if }\lambda=0.\end{cases}\)

It can reduce outliers
Improve normality
Smooth out the target
More in the next chapter…

Model Evaluation

Out-of-sample MSE

A good model must not only performs well on the training data (used to build it), but also on unseen observations.
We should judge a model based on how it generalizes on new unseen observations.

Out-of-sample Mean Squared Error (MSE): \[\color{green}{\frac{1}{n_{\text{new}}}\sum_{1=1}^{n_{\text{new}}}(y_i-\hat{y}_i)^2}.\]

In practice:
- Train data \(\approx75\%-80\%\to\) for building the model.
- Test data \(\approx20\%-25\%\to\) for testing the model.

Out-of-sample MSE

A good model must not only performs well on the training data (used to build it), but also on unseen observations.
We should judge a model based on how it generalizes on new unseen observations.

Out-of-sample Mean Squared Error (MSE): \[\color{green}{\frac{1}{n_{\text{new}}}\sum_{1=1}^{n_{\text{new}}}(y_i-\hat{y}_i)^2}.\]

Code

import pyreadr
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.metrics import mean_squared_error

market = pyreadr.read_r(path1)
market = market['marketing']
shuffle_id = np.random.choice(['train', 'test'], 
                                    replace=True, 
                                    p=[0.75, 0.25], 
                                    size=market.shape[0])
market['type'] = shuffle_id

# Model
from sklearn.linear_model import LinearRegression
lr1 = LinearRegression().fit(market.loc[market.type == "train", ['youtube']], market.loc[market.type == "train", "sales"])

y_hat = lr1.predict(market.loc[market.type == "test", ['youtube']])

import plotly.express as px
import plotly.graph_objects as go
fig1 = px.scatter(data_frame=market,
            x="youtube",
            y="sales",
            color="type",
            color_discrete_map={
                "train": "#e89927", 
                "test": "#3bbc35"
            })
fig1.add_trace(go.Scatter(x=market.loc[market.type == "test", 'youtube'],
                          y=y_hat,
                          name="Model built on train data",
                          line=dict(color="#e89927")))

fig1.update_layout(width=600, height=250, title="SLR Model: Sales vs Youtube")
fig1.show()

Cross-validation MSE

What if it’s our unlucky day?

Here, it’s great on training data but poor on testing data!

\(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)

Computing CV-MSE:

Cross-validation MSE

What if it’s our unlucky day?

Here, it’s great on training data but poor on testing data!

\(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)

Computing CV-MSE:

Cross-validation MSE

What if it’s our unlucky day?

Here, it’s great on training data but poor on testing data!

\(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)

Computing CV-MSE:

Cross-validation MSE

What if it’s our unlucky day?

Here, it’s great on training data but poor on testing data!

\(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)

Computing CV-MSE:

Cross-validation MSE

What if it’s our unlucky day?

Here, it’s great on training data but poor on testing data!

\(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)

Computing CV-MSE:

Cross-validation MSE

What if it’s our unlucky day?

Here, it’s great on training data but poor on testing data!

\(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)

Computing CV-MSE:

Doesn’t depend on bad splits!
It’s the average of different Test MSEs.
It’s the Estimate MSE of New Unseen Data\(^{\text{📚}}\).

\(^{\text{📚}}\) Chapter 5, The Introduction to Statistical Learning.

Cross-validation MSE

Summary

Cross-validation is a model evaluation teachnique.
It can be used with different metrics other than MSE, such as Accuracy, F1-score,…
It can prevent overfitting.
For SLR or MLR (without hyperparameter tuning), it can provide an estimate of Test Error.
For models with hyperparameters, it can be used to tune those hyperparameters (coming soon).
Our sales vs youtube example: 5-fold CV-MSE = 21.011 or CV-RMSE = 4.584.

Feature engineering

Missing values & outliers

Data of \(4\)-\(7\) years old kids.

Gender	Age	Height	Weight
F	68	0	20
F	68	0	18
F	65	105	0
F	63	0	15
F	68	112	0
F	66	106	0

Missing values are often represented by NA (nan in Python).
Question: how do we handle it?
Answer: we should at least know what kind of missing values are they: MCAR, MAR or MNAR?

Feature engineering

Missing values & outliers

Missing Completely At Random (MCAR)

They are randomly missing.
Easy to handle with imputation or dropping methods.
They don’t introduce bias.
Ex: The values are just randomly missing due to human or technical errors.

Feature engineering

Missing values & outliers

Missing At Random (MAR)

The missingness is related to other variables.
Model-based imputation often work well: SLR, MLR, KNN…
Ex: Weights are often missing among women in a survey if it’s optional.

Feature engineering

Missing values & outliers

Missing Not At Random (MNAR)

These are the trickiest, as the missingness is related to the missing values themselves.
It may require domain-specific knowledge or advanced techniques (more data, external info…).
Ex: Very high or very low salaries are often missing from a survey if it’s optional.

Feature engineering

Missing values & outliers

Outliers

Data points that deviate significantly from the majority of observations in a dataset.
It can influence our analyses: insightful or problematic!
We can hunt them down using:
- Graphs: Scatterplots, Boxplots or histograms…
- They often fall outside \([\text{Q}_1-1.5\text{IQR},\text{Q}_3+1.5\text{IQR}]\).

Feature engineering

Feature transformation

Z-score & Min-Max Scaling

Z-score of \(x_j\) is \(\tilde{x}_j=(x_j-\overline{x}_j)/\sigma_{x_j}\).
Min-Max scaling of \(x_j\) is \(\tilde{x}_j=\frac{x_j-\min_{x_j}}{\max_{x_j}-\min_{x_j}}\in [0,1]\).
When inputs are of different units: (Kg, Km, dollars…).
When the differences in scales are too large.
When working with distance-based models or models that are sensitive to scale of data: SLR, MLR, KNN, SVM, Logistic Regression, PCA, Neural Networks…
Ex: Often used in image processing…

Feature engineering

Feature transformation

One-hot encoding

Code

from gapminder import gapminder
import numpy as np
from sklearn.preprocessing import OneHotEncoder as onehot
encoder = onehot()
encoded_data = encoder.fit_transform(gapminder.loc[gapminder.year == 2007, ['continent']]).toarray()

# encoded dataset
X_encoded = pd.DataFrame(encoded_data, columns=[x.replace('continent_', '') for x in encoder.get_feature_names_out(['continent'])])

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
lr = LinearRegression()

lr.fit(X_encoded, gapminder.lifeExp.loc[gapminder.year==2007])
R2 = r2_score(gapminder.lifeExp.loc[gapminder.year==2007], lr.predict(X_encoded))
df_encoded = X_encoded.copy()
df_encoded['lifeExp'] = gapminder.lifeExp.loc[gapminder.year==2007].values
fig_cont = px.box(data_frame=gapminder.loc[gapminder.year==2007,:],
                  x="continent", y="lifeExp", color="continent")
fig_cont.update_layout(title="Life Expectancy vs Continent", height=250, width=500)
fig_cont.show()

Some categorical inputs are useful for building models.
They have to be converted.
Ex: continent is useful for predicting lifeExp (for more, see here).
- R-squared: 0.635.

Africa	Americas	Asia	Europe	Oceania	lifeExp
0.000000	0.000000	1.000000	0.000000	0.000000	43.828000

Feature engineering

Feature transformation

Polynomial features

Predicting target using linear form of inputs may be unrealistic!
More complicated forms of inputs might be better for predicting the target!
Ex: sales vs youtube: \(R^2\approx 61\%\).
Now: \(\widehat{\text{sales}}=\beta_0+\beta_1\text{YT}+\beta_2\text{YT}^2\)

Feature engineering

Feature transformation

Polynomial features

Now: \(\widehat{\text{sales}}=\beta_0+\beta_1\text{YT}+\beta_2\text{YT}^2\)

Code

market2 = pd.concat([market.youtube, market.youtube ** 2, market.sales], axis=1)
market2.columns = ["YT", "YT^2", "Sales"]
market2.iloc[:3,:]

	YT	YT^2	Sales
0	276.12	76242.2544	26.52
1	53.40	2851.5600	12.48
2	20.64	426.0096	11.16

Feature engineering

Feature transformation

Polynomial features

Feature engineering

Feature transformation

Polynomial features

Overfitting

Challenge in every model

Overfitting happens when a model learns the training data too well, capturing noise and fluctuations rather than the underlying pattern.
It fits the training data almost perfectly, but fails to generalize to new, unseen data.
Complex models (high-degree poly. features) often overfit the data.

Overcoming overfitting

\(K\)-fold Cross-Validation

It ensures that the model performs well on different subsets.
The most common technique to overcome overfitting.

Tuning Polynomial degree Using \(K\)-fold Cross-Validation

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression as LR
from sklearn.model_selection import cross_val_score
# Data
X, y = market[["youtube"]], market['sales']
# List of all degrees to search over
degree = list(range(1,11))
# List to store all losses
loss = [] 
for deg in degree:
    pf = PolynomialFeatures(degree=deg)
    X_poly = pf.fit_transform(X)
    model = LR()
    score = -cross_val_score(model, X_poly, y, cv=5, 
                scoring='neg_mean_squared_error').mean()
    loss.append(score)

overcoming overfitting

Regularization

Another approach is to controll the magnitude of the coefficients.
It often works well for SLR, MLR and Polynomial Regression…

overcoming overfitting

Regularization: Ridge Regression

Model: \(\hat{y}=\beta_0+\beta_1x_1+\dots+\beta_dx_d\),
Objective: Search for \(\vec{\beta}=[\beta_0,\dots,\beta_d]\) minimizing the following loss function for some \(\color{green}{\alpha}>0\): \[{\cal L}_{\text{ridge}}(\vec{\beta})=\color{red}{\underbrace{\sum_{i=1}^n(y_i-\widehat{y}_i)^2}_{\text{RSS}}}+\color{green}{\alpha}\color{blue}{\underbrace{\sum_{j=0}^{d}\beta_j^2}_{\text{Magnitude}}}.\]
Recall: SLR & MLR seek to minimize only RSS.

Overcoming overfitting

Regularization: Ridge Regression

Large \(\color{green}{\alpha}\Rightarrow\) strong penalty \(\Rightarrow\) small \(\vec{\beta}\).
Small \(\color{green}{\alpha}\Rightarrow\) weak penalty \(\Rightarrow\) freer \(\vec{\beta}\).
🔑 Objective: Learn the best \(\color{green}{\alpha}>0\).

Loss: \({\cal L}_{\text{ridge}}(\vec{\beta})=\color{red}{\underbrace{\sum_{i=1}^n(y_i-\widehat{y}_i)^2}_{\text{RSS}}}+\color{green}{\alpha}\color{blue}{\underbrace{\sum_{j=0}^{d}\beta_j^2}_{\text{Magnitude}}}.\)

Overcoming overfitting

Regularization: Ridge Regression

How to find a suitable regularization strength \(\color{green}{\alpha}\)?

Overcoming overfitting

Regularization: Ridge Regression

Tuning Regularization Stregnth \(\color{green}{\alpha}\) Using \(K\)-fold Cross-Validation

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
# Data
X, y = market[["youtube"]], market['sales']
poly = PolynomialFeatures(degree=8)
X_poly = poly.fit_transform(X)
# List of all degrees to search over
alphas = list(np.linspace(0.01, 3, 30)) + list(np.linspace(3.1, 20000, 30))
# List to store all losses
loss = []
coefficients = {f'alpha={alpha}': [] for alpha in alphas}
for alp in alphas:
    model = Ridge(alpha=alp)
    score = -cross_val_score(model, X_poly, y, cv=5, 
                scoring='neg_mean_squared_error').mean()
    loss.append(score)
    # Fit
    model.fit(X_poly, y)
    coefficients[f'alpha={alp}'] = model.coef_

Overcoming overfitting

Regularization: Ridge Regression

Tuning Regularization Stregnth \(\color{green}{\alpha}\) Using \(K\)-fold Cross-Validation

Overcoming overfitting

Regularization: Ridge Regression

Tuning Regularization Stregnth \(\color{green}{\alpha}\) Using \(K\)-fold Cross-Validation

Overcoming overfitting

Regularization: Ridge Regression

Pros

It works well when there are inputs that are approximately linearly related with the target.
It helps stabilize the estimates when inputs are highly correlated.
It can prevent overfitting.
It is effective when the number of inputs exceeds the number of observations.

Cons

It does not work well when the input-output relationships are highly non-linear.
It may introduce bias into the coefficient estimates.
It does not perform feature selection.
It can be challenging for interpretation.

overcoming overfitting

Regularization: Lasso Regression

Model: \(\hat{y}=\beta_0+\beta_1x_1+\dots+\beta_dx_d\),
Objective: Search for \(\vec{\beta}=[\beta_0,\dots,\beta_d]\) minimizing the following loss function for some \(\color{green}{\alpha}>0\): \[{\cal L}_{\text{lasso}}(\vec{\beta})=\color{red}{\underbrace{\sum_{i=1}^n(y_i-\widehat{y}_i)^2}_{\text{RSS}}}+\color{green}{\alpha}\color{blue}{\underbrace{\sum_{j=0}^{d}|\beta_j|}_{\text{Magnitude}}}.\]

Overcoming overfitting

Regularization: Lasso Regression

Large \(\color{green}{\alpha}\Rightarrow\) strong penalty \(\Rightarrow\) small \(\vec{\beta}\).
Small \(\color{green}{\alpha}\Rightarrow\) weak penalty \(\Rightarrow\) freer \(\vec{\beta}\).
🔑 Objective: Learn the best \(\color{green}{\alpha}>0\).

Loss: \({\cal L}_{\text{lasso}}(\vec{\beta})=\color{red}{\underbrace{\sum_{i=1}^n(y_i-\widehat{y}_i)^2}_{\text{RSS}}}+\color{green}{\alpha}\color{blue}{\underbrace{\sum_{j=0}^{d}|\beta_j|}_{\text{Magnitude}}}.\)

Overcoming overfitting

Regularization: Lasso Regression

Tuning Regularization Stregnth \(\color{green}{\alpha}\) Using \(K\)-fold Cross-Validation

Overcoming overfitting

Regularization: Lasso Regression

Pros

Lasso inherently performs feature selection when increasing regularization parameter \(\alpha\) (less important variables are forced to be completely \(0\)).
It works well when there are many inputs (high-dimensional data) and some highly correlated with the target.
It can handle collinearities (many redundant inputs).
It can prevent overfitting and offers high interpretability.

Cons

It does not work well when the input-output relationships are highly non-linear.
It may introduce bias into the coefficient estimates.
It is sensitive to the scale of the data, so proper scaling of predictors is crucial before applying the method.

Linear Regression

🗺️ Content

Motivation

Simple Linear Regresion

Multiple Linear Regression

Model Evaluation

Model refinement: Regularization

Motivation

Motivation

Old Faithful dataset (\(272\) rows, \(2\) columns)

Motivation (2)

Marketing data (\(200\) rows, \(4\) columns)

Motivation (2)

Marketing data (\(200\) rows, \(4\) columns)

Our Motivational Quote Today

Some Notation

Data: input-target

Model Development

Simple Linear Regression

Simple Linear Regression (SLR)

Simple Linear Regression (SLR)

Simple Linear Regression (SLR)

Simple Linear Regression (SLR)

Optimal Least-square line

Simple Linear Regression (SLR)

Apply on marketing data

Simple Linear Regression (SLR)

Apply on marketing data (cont.)

Simple Linear Regression (SLR)

Model Diagnostics (judging the model)

Simple Linear Regression (SLR)

Model Diagnostics (judging the model)

Simple Linear Regression (SLR)

SLR on Marketing Data

Multiple Linear Regression

Correlation Matrix

Correlation matrix

Examples

Correlation Matrix

Example: marketing data

Multiple Linear Regression (MLR)

Multiple Linear Regression (MLR)

Multiple Linear Regression (MLR)

Multiple Linear Regression (MLR)

MLR in action

Multiple Linear Regreesion (MLR)

Model Diagnostics

Multiple Linear Regreesion (MLR)

Model Diagnostics (cont.)

Multiple Linear Regreesion (MLR)

MLR on Marketing Data

What’s next?

Model Evaluation

Out-of-sample MSE

Out-of-sample MSE

Cross-validation MSE

Cross-validation MSE

Cross-validation MSE

Cross-validation MSE

Cross-validation MSE

Cross-validation MSE

Cross-validation MSE

Summary

Model Refinement

Feature engineering

Missing values & outliers

Feature engineering

Missing values & outliers

Missing Completely At Random (MCAR)

Feature engineering

Missing values & outliers

Missing At Random (MAR)

Feature engineering

Missing values & outliers

Missing Not At Random (MNAR)

Feature engineering

Missing values & outliers

Outliers

Feature engineering

Feature transformation