Linear Regression


ITM 390 004: Machine Learning

Lecturer: Dr. Sothea HAS

Outline

  • Motivation

  • Preprocessing & Exploratory Data Analysis

  • Simple Linear Regression

  • Multiple Linear Regression

Motivation

Motivation

Auto-MPG Dataset (398, 9)

Code
import pandas as pd                 # Import pandas package
import numpy as np
import seaborn as sns               # Package for beautiful graphs
import matplotlib.pyplot as plt     # Graph management
sns.set(style="whitegrid")          # Set grid 
df0 = pd.read_csv(path2 + "auto-mpg.csv")   # Import it into Python
df0.head(5)                        # Randomly select 4 points
mpg cylinders displacement horsepower weight acceleration model year origin car name
0 18.0 8 307.0 130 3504 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165 3693 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150 3436 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150 3433 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140 3449 10.5 70 1 ford torino

  • mpg: Fuel efficiency (miles per gallon).
  • cylinders: Engine cylinders count.
  • displacement: Engine size displacement (cubic inches).
  • acceleration: from 0 to 60 (mph) time (seconds).
  • origin: 1 = USA, 2 = Europe, 3 = Asia.

Motivation

Auto-MPG Dataset (398, 9)

Code
import pandas as pd                 # Import pandas package
import numpy as np
import seaborn as sns               # Package for beautiful graphs
import matplotlib.pyplot as plt     # Graph management
sns.set(style="whitegrid")          # Set grid 
data = pd.read_csv(path2 + "auto-mpg.csv")   # Import it into Python
data.head(4)                        # Randomly select 4 points
mpg cylinders displacement horsepower weight acceleration model year origin car name
0 18.0 8 307.0 130 3504 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165 3693 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150 3436 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150 3433 12.0 70 1 amc rebel sst

  • What factors affect fuel efficiency the most?
  • How much can we predict fuel efficiency of the cars using their characteristics?

Preprocessing & Exploratory Data Analysis (EDA)

Preprocessing

Auto-MPG Dataset (398, 9)

Columns types

mpg cylinders displacement horsepower weight acceleration model year origin car name
0 float64 int64 float64 object int64 float64 int64 int64 object
  • Q1: Is there anything wrong with column type?
  • A1: Two main problems:
    • origin is qualitative, therefore should be “category/object”.
    • ⚠️ horsepower is quantitative, therefore should be “float/int”.
  • Modifying data type:
mpg cylinders displacement horsepower weight acceleration model year origin car name
0 float64 int64 float64 int64 int64 float64 int64 category object

⚠️ When a quantitative column is encoded as qualitative, missing values or inconsistent values may be presented.

EDA

Auto-MPG Dataset (398, 9)

Univariate analysis: Statistical summary

Code
data.describe().T.drop(columns=['count', '25%', '75%'])
df_car = data

EDA

Auto-MPG Dataset (398, 9)

Univariate analysis: Visualization

Code
quan_vars = data.select_dtypes(include="number").columns
fig, axs = plt.subplots(2, 4, figsize=(10,4.75))
for i, va in enumerate(data.columns):
    if va in quan_vars:
        sns.histplot(data, x=va, kde=True, ax=axs[i//4, i%4], stat="proportion")
    else:
        if va != "car name":
            sns.countplot(data, x=va, ax=axs[i//4, i%4], stat="proportion")
            axs[i//4, i%4].bar_label(axs[i//4, i%4].containers[0], fmt="%.2f")
plt.tight_layout()
plt.show()

Bivariate analysis: Correlation matrix

Code
pair_grid = sns.PairGrid(data=data[quan_vars], height=0.9, aspect=2)

# Map plots to the lower triangle only
pair_grid.map_lower(sns.scatterplot)  # Scatterplots in the lower triangle
pair_grid.map_diag(sns.histplot)      # Histograms on the diagonal

# pair_plot = sns.pairplot(data=data[quan_vars], height=0.9, aspect=2.5)
def corr_func(x, y, **kws): 
    r1 = np.corrcoef(x, y)[0, 1]
    plt.gca().annotate(f"{r1:.2f}", xy=(0.5, 0.5), 
                       xycoords='axes fraction', 
                       ha='center', fontsize=30, color='#1d69d1')

pair_grid.map_upper(corr_func)
for ax in pair_grid.axes[:, 0]:  # Access the first column of axes (y-axis labels)
    ax.set_ylabel(ax.get_ylabel(), rotation=45, labelpad=20)
plt.tight_layout()
plt.show()

Bivariate analysis: Visualization

  • Does fuel-efficiency depend on the origin?
Code
_, axs = plt.subplots(1, 1, figsize=(8, 5))
sns.boxplot(data=data, x="origin", y="mpg", hue="origin", ax=axs)
plt.tight_layout()
plt.show()

Preprocessing & EDA

Summary

  • Weight shows the strongest negative correlation with mpg, followed by displacement, cylinders, and horsepower. These variables are significant in explaining variations in mpg.

  • These features are also highly correlated with each other, suggesting potential redundancy when included together in a predictive model.

  • Despite being a categorical variable, origin proves to be valuable for predicting mpg.

Simple Linear Regression (SLR)

Simple Linear Regression (SLR)

mpg vs weight

Code
data[['mpg', 'weight']].head(3)
mpg weight
0 18.0 3504
1 15.0 3693
2 18.0 3436
Code
import plotly.express as px
fig = px.scatter(data, x="weight", y="mpg", hover_name="car name")
fig.update_layout(title="mpg vs weight", height=290, width=450)
fig.show()
  • Simple Linear Model: \[\text{(prediction)}:\quad\widehat{\text{mpg}}_i=\color{blue}{a}\text{weight}_i+\color{blue}{b},\] for some \(\color{blue}{a},\color{blue}{b}\in\mathbb{R}\) to be chosen so that \(\color{red}{\widehat{\text{mpg}}_i\approx \text{mpg}_i}\) for all \(i=1,...,n.\)
  • In general, \(\hat{y}_i=\color{blue}{a}\text{x}_i+\color{blue}{b}\), with keys \(\color{blue}{a},\color{blue}{b}\), and observed data \((y_i,\text{x}_i),i=1,...,n\).

  • Objective Find the best \(\color{blue}{a}\) and \(\color{blue}{b}\) so that (prediction) \(\color{red}{\hat{y}_i\approx y_i}\) (reality) for all \(i\).

  • What does \(\color{red}{\hat{y}_i\approx y_i}\) mean?

Simple Linear Regression (SLR)

mpg vs weight

Code
data[['mpg', 'weight']].head(3)
mpg weight
0 18.0 3504
1 15.0 3693
2 18.0 3436
Code
fig.update_layout(title="Mpg vs weight", height=290, width=450)
fig.show()
  • What does \(\color{red}{\hat{y}_i\approx y_i}\) mean?
  • Q2: For \(y_0=20.312\), which one is the best prediction among: \(\color{red}{\hat{y}_0=18.2, 21.5}\) and \(\color{red}{19.73}\)?
  • A2: Consider the residuals:
\(\color{red}{\hat{y}_0}\) \(\color{red}{18.2}\) \(\color{red}{21.5}\) \(\color{blue}{19.73}\)
\(\color{red}{e_0=y_0-\hat{y}_0}\) \(\color{red}{2.112}\) \(\color{red}{-1.188}\) \(\color{blue}{0.582}\)
\(\color{red}{|e_0|}\) \(\color{red}{2.112}\) \(\color{red}{1.188}\) \(\color{blue}{0.582}\)
\(\color{red}{e_0^2}\) \(\color{red}{4.46}\) \(\color{red}{1.41}\) \(\color{blue}{0.34}\)

🔑 Small residual = good prediction.

Simple Linear Regression (SLR)

mpg vs weight

Code
# Linear Regression
from sklearn.linear_model import LinearRegression
import plotly.graph_objects as go
lr = LinearRegression()
x_w, y_mpg = data[['weight']], data[['mpg']]
lr.fit(x_w, y_mpg)
a, b = lr.coef_[0][0], lr.intercept_[0]

x_w, y_mpg = data[['weight']].to_numpy(), data[['mpg']].to_numpy()
# Generate coefficients list for different line fits
coef_list = a * np.array([4, 0.5, 0.05, 2, 1.0, 0.25, 3])

x_min, x_max = np.min(x_w), np.max(x_w)
y_min, y_max = np.min(y_mpg), np.max(y_mpg)
x_fit = np.linspace(x_min * 0.8, x_max* 1.2, 2).reshape(-1, 1)

idx = 100
x_line = np.repeat(x_w[idx],2)
# Create frames for polynomial fits

frames = []
for coef in coef_list:
    y_fit = x_fit * coef + b
    y_line = np.array([y_mpg[idx][0], x_w[idx][0] * coef + b])
    y_pred = x_w.flatten() * coef + b
    rss = np.sum((y_mpg.flatten()-y_pred) ** 2)
    frames.append(go.Frame(
        data=[go.Scatter(x=x_w.flatten(), y=y_mpg.flatten(), mode='markers', name='mpg vs weight', marker=dict(size=10)),
              go.Scatter(x=x_line.flatten(), y=y_line.flatten(), mode='lines+markers', name='Residual', line=dict(color="red", dash='dash'), visible="legendonly"),
              go.Scatter(x=x_fit.flatten(), y=y_fit.flatten(), mode='lines', line=dict(color="#b6531a"),
                         name='<br>y={:.3f}x+{:.3f}<br>RSS={:.3f}'.format(np.round(coef, 3), np.round(b, 3), np.round(rss,2)))],
        name=f'{np.round(coef, 3)}'
    ))

y_line = np.array([y_mpg[idx][0], x_w[idx][0] * coef_list[0]+ b])
y_pred0 = x_w.flatten() * coef_list[0] + b
rss0 = np.sum((y_mpg.flatten()-y_pred) ** 2)

fig1 = go.Figure(
    data=[
        go.Scatter(x=x_w.flatten(), y=y_mpg.flatten(), mode='markers', name='mpg vs weight', marker=dict(size=10)),
        go.Scatter(x=x_line.flatten(), y=y_line.flatten(), mode='lines+markers', name='Residual', line=dict(color="red", dash='dash'), visible="legendonly"),
        go.Scatter(x=x_fit.flatten(), y=x_fit.flatten()* coef_list[0]+b, mode='lines', line=dict(color="#b6531a"),
                   name=f'<br>y={np.round(coef_list[0], 3)}x+{np.round(b, 3)}<br>RSS={np.round(rss0,2)}')
    ],
    layout=go.Layout(
        title="MPG vs Weight",
        xaxis=dict(title="Weight", range=[x_min*0.8, x_max*1.1]),
        yaxis=dict(title="MPG", range=[y_min*0.6, y_max*1.1]),
        updatemenus=[{
            "buttons": [
                {
                    "args": [None, {"frame": {"duration": 1000, "redraw": True}, "fromcurrent": True, "mode": "immediate"}],
                    "label": "Play",
                    "method": "animate"
                },
                {
                    "args": [[None], {"frame": {"duration": 0, "redraw": False}, "mode": "immediate"}],
                    "label": "Stop",
                    "method": "animate"
                }
            ],
            "type": "buttons",
            "showactive": False,
            "x": -0.1,
            "y": 1.25,
            "pad": {"r": 11, "t": 50}
        }],
        sliders=[{
            "active": 0,
            "currentvalue": {"prefix": "Coefficient: "},
            "pad": {"t": 50},
            "steps": [{"label": f"{np.round(coef, 3)}",
                       "method": "animate",
                       "args": [[f'{np.round(coef, 3)}'], {"frame": {"duration": 1000, "redraw": True}, "mode": "immediate", 
                       "transition": {"duration": 10}}]}
                      for coef in coef_list]
        }]
    ),
    frames=frames
)
fig1.update_layout(title="Mpg vs weight", height=480, width=500)
fig1.show()
  • Residual Sum of Squares (RSS): \[\begin{align*}\color{red}{\text{RSS}=\sum_{i=1}e_i^2}&=\color{red}{\sum_{i=1}^n(y_i-\color{blue}{\hat{y}_i})^2}\\ &=\color{red}{\sum_{i=1}^n(y_i}-\color{blue}{a}\text{x}_i-\color{blue}{b}\color{red}{)^2}.\end{align*}\]

  • Roughly, \(\color{red}{\text{RSS}}\) is sum of all the dash lines (squared).

  • Objective: Find the coefficient \((\color{blue}{a,b})\) that produces smallest \(\color{red}{\text{RSS}}\).

  • Can you spot the best fitted line 😎?

Simple Linear Regression (SLR)

mpg vs weight

Code
# Linear Regression

from plotly.subplots import make_subplots
lr = LinearRegression()

# surface of loss
sur_loss = np.array([[RSS(x_w, y_mpg, a_, b_) for a_ in a_grid] for b_ in b_grid])

fig_surf = make_subplots(rows=1, cols=2,
            specs=[[{'type': 'xy'}, {'type': 'surface'}]],
            subplot_titles=('Fitted Line', 'Loss surface as a function of (a,b)'))

y_line = np.array([y_mpg[idx][0], x_w[idx][0] * coef_list[0]+ b])
y_pred0 = x_w.flatten() * coef_list[0] + b
rss0 = np.sum((y_mpg.flatten()-y_pred0) ** 2)

fig_surf.add_trace(go.Scatter(x=x_w.flatten(), y=y_mpg.flatten(),
    mode='markers', name='mpg vs weight', marker=dict(size=10)), row=1, col=1)
fig_surf.add_trace(go.Scatter(x=x_line.flatten(), y=y_line.flatten(), 
    mode='lines+markers', name='Residual', line=dict(color="red", dash='dash'), visible="legendonly"), row=1, col=1)
fig_surf.add_trace(go.Scatter(x=x_fit.flatten(), y=x_fit.flatten()*coef_list[0]+b, 
    mode='lines', line=dict(color="#b6531a"),
    name=f'<br>y={np.round(coef_list[0], 3)}x+{np.round(b, 3)}<br>RSS={np.round(rss0,2)}'), row=1, col=1)

fig_surf.add_trace(go.Scatter3d(
    x=[coef_list[0]] * 2, 
    y=[b] * 2,
    z=[0, rss0],
    mode="markers+lines", marker=dict(color="red", size=6),
    line=dict(dash="dash", color="red"),
    name="Loss value"), row=1, col=2)

fig_surf.add_trace(go.Surface(
    x=a_grid, 
    y=b_grid,
    z=sur_loss,
    showscale=False,
    opacity=0.3,
    name="Loss surface"), row=1, col=2)

frames_loss = []
for coef in coef_list[1:]:
    y_fit = x_fit * coef + b
    y_line = np.array([y_mpg[idx][0], x_w[idx][0] * coef + b])
    y_pred = x_w.flatten() * coef + b
    rss = np.sum((y_mpg.flatten()-y_pred) ** 2)
    frames_loss.append(
        go.Frame(
            data=[
                go.Scatter(x=x_w.flatten(), y=y_mpg.flatten(), mode='markers', 
                    name='mpg vs weight', marker=dict(size=10)),
                go.Scatter(x=x_line.flatten(), y=y_line.flatten(), mode='lines+markers', 
                    name='Residual', line=dict(color="red", dash='dash'), visible="legendonly"),
                go.Scatter(x=x_fit.flatten(), y=y_fit.flatten(), mode='lines', 
                    line=dict(color="#b6531a"), name='<br>y={:.3f}x+{:.3f}<br>RSS={:.3f}'.format(np.round(coef, 3), np.round(b, 3), np.round(rss,2))),
                go.Scatter3d(
                    x=[coef] * 2, 
                    y=[b] * 2,
                    z=[0, rss],
                    mode="markers+lines", marker=dict(color="red", size=6),
                    line=dict(dash="dash", color="red"),
                    name="Loss value"),
                go.Surface(
                    x=a_grid, 
                    y=b_grid,
                    z=sur_loss,
                    showscale=False,
                    opacity=0.3,
                    name='<br>y={:.3f}x+{:.3f}<br>RSS={:.3f}'.format(np.round(coef, 3), np.round(b, 3), np.round(rss,2)))],
            name=f'{np.round(coef, 3)}'))

fig_surf.update_layout(
        title="Loss function at different coefficients (a,b)",
        height=480,
        updatemenus=[{
            "buttons": [
                {
                    "args": [None, {"frame": {"duration": 1000, "redraw": True}, "fromcurrent": True, "mode": "immediate"}],
                    "label": "Play",
                    "method": "animate"
                },
                {
                    "args": [[None], {"frame": {"duration": 0, "redraw": False}, "mode": "immediate"}],
                    "label": "Stop",
                    "method": "animate"
                }
            ],
            "type": "buttons",
            "showactive": False,
            "x": -0.1,
            "y": 1.25,
            "pad": {"r": 11, "t": 50}
        }],
        sliders=[{
            "active": 0,
            "currentvalue": {"prefix": "Coefficient: "},
            "pad": {"t": 50},
            "steps": [{"label": f"{np.round(coef, 3)}",
                       "method": "animate",
                       "args": [[f'{np.round(coef, 3)}'], {"frame": {"duration": 1000, "redraw": True}, "mode": "immediate", 
                       "transition": {"duration": 10}}]}
                      for coef in coef_list[1:]]
        }]
    )
fig_surf.frames = frames_loss
fig_surf.update_xaxes(range=[x_min*0.8, x_max*1.1], title="Weight", row=1, col=1)
fig_surf.update_yaxes(range=[y_min*0.6, y_max*1.1], title="MPG", row=1, col=1)
fig_surf.update_scenes(
    xaxis_range=[np.min(a_grid), np.max(a_grid)],
    yaxis_range=[np.min(b_grid), np.max(b_grid)],
    zaxis_range=[0, rss0],
    # camera_eye=dict(x=1.5, y=1.5, z=1),
    # aspectmode='cube',
    row=1, col=2  # This references the first scene
)

fig_surf.show()

Simple Linear Regression (SLR)

mpg vs weight

Code
# Linear Regression
fig1.update_layout(title="Mpg vs weight", height=480, width=500)
fig1.show()

Optimal Least-Square Line

  • The best fitted line: \(\hat{y}_i=\color{blue}{\hat{a}}\text{x}_i+\color{blue}{\hat{b}}\) where

\[\begin{align} \hat{\color{blue}{a}}&=\frac{\sum_{i=1}^n(\text{x}_i-\overline{\text{x}}_n)(y_i-\overline{y}_n)}{\sum_{i=1}^n(\text{x}_i-\overline{\text{x}}_n)^2}=\frac{\text{Cov}(X,Y)}{\text{V}(X)}\\ \hat{\color{blue}{b}}&=\overline{y}_n-\hat{\color{blue}{a}}\overline{\text{x}}_n,\quad\text{with}\end{align} \]

  • \(\overline{\text{x}}_n=\frac{1}{n}\sum_{i=1}^n\text{x}_i\) and \(\overline{y}_n=\frac{1}{n}\sum_{i=1}^ny_i\): the average/mean of \(X\) and \(Y\) respectively.
  • \(\text{Cov}(X,Y)=\frac{1}{n}\sum_{i=1}^n(\text{x}_i-\overline{\text{x}}_n)(y_i-\overline{y}_n)\): the “covariance” between \(X\) & \(Y\).
  • \(\text{V}(X)=\frac{1}{n}\sum_{i=1}^n(\text{x}_i-\overline{\text{x}}_n)^2\): the “variance” of \(X\).
  • Our example: \((\color{blue}{\hat{a}},\color{blue}{\hat{b}})=\) (-0.01, 46.22).
  • Interpret: If weight increases 1 unit, mpg decreases by around \(0.01\) unit.

Simple Linear Regression (SLR)

Model Diagnostics (judging the model)

R-squared (coefficient of determination)

\[R^2=1-\frac{\text{RSS}}{\text{TSS}}=1-\frac{\sum_{i=1}(y_i-\hat{y}_i)^2}{\sum_{i=1}(y_i-\overline{y}_n)^2}=\frac{\color{red}{\text{V}(\hat{Y})}}{\color{blue}{\text{V}(Y)}}.\]

Simple Linear Regression (SLR)

Model Diagnostics (judging the model)

R-squared (coefficient of determination)

\[R^2=1-\frac{\text{RSS}}{\text{TSS}}=1-\frac{\sum_{i=1}(y_i-\hat{y}_i)^2}{\sum_{i=1}(y_i-\overline{y}_n)^2}=\frac{\color{red}{\text{V}(\hat{Y})}}{\color{blue}{\text{V}(Y)}}.\]

  • We always have \(0\leq R^2\leq 1\).
  • Example: For mpg vs weight: \(R^2=\) 0.693.
  • Interpretation: The model (weight) can explain around 69.3% of the variation of the target (mpg).

Simple Linear Regression (SLR)

Model Diagnostics (judging the model)

Residual Analysis

  • Residuals: If \(\color{red}{e_i=y_i-\hat{y}_i}\sim{\cal N}(0,\sigma^2)\) for some \(\sigma>0\) .i.e.,
    Symmetric around \(0\) & DO NOT DEPEND ON \(\text{x}_i\) nor \(y_i\).
Code
res = y_true-y_pred   # Compute residuals

from plotly.subplots import make_subplots
fig_res = make_subplots(rows=1, cols=2, 
    subplot_titles=("Residuals vs predicted mpg", 
                    "Residual desity"))
fig_res.add_trace(
    go.Scatter(x=y_pred.flatten(), y=res.flatten(), name="Residuals", mode="markers"), 
    row=1, col=1)
fig_res.add_trace(
    go.Scatter(x=[np.min(y_pred.flatten()), np.max(y_pred.flatten())], 
    y=[0,0], mode="lines", line=dict(color='red', dash="dash"), name="0"), 
    row=1, col=1)

fig_res.update_xaxes(title_text="Predicted MPG", row=1, col=1)
fig_res.update_yaxes(title_text="Residuals", row=1, col=1)


fig_res.add_trace(
    go.Histogram(x=res, name = "Residual histogram"), row=1, col=2
)
fig_res.update_xaxes(title_text="Residual", row=1, col=2)
fig_res.update_yaxes(title_text="Histogram", row=1, col=2)

fig_res.update_layout(width=950, height=300)
fig_res.show()

Simple Linear Regression (SLR)

T-test of Significance of Coefficient

  • The estimated coefficient \(\color{blue}{\hat{a}}\) and \(\color{blue}{\hat{b}}\) are computed based on a sample of data.
  • How can we be sure that the linear relation between \(\text{x}\) and \(y\) truely exists: \(\hat{y}=\color{blue}{\hat{a}}\text{x}+\color{blue}{\hat{b}}\) with \(a\neq 0\)?
  • This is equivalent to testing \(H_0: \color{blue}{\hat{a}}=0\) against \(H_1: \color{blue}{\hat{a}}\neq 0\).
  • If \(n\) is large enough (\(n>30\)) or the residual is gaussian then if \(H_0\) is true, we have \(\color{blue}{t}=\frac{\color{blue}{\hat{a}}}{s_{\color{blue}{\hat{a}}}}\sim{\cal T}(n-2)\) where \(s_{\color{blue}{\hat{a}}}\) is the standard deviation of \(\color{blue}{\hat{a}}\).
  • Given \(0\leq\color{red}{\alpha}\leq 1\), let \(\color{red}{t_{\alpha/2}}\) be the \(\color{red}{\alpha}\)-quantile of t-distribution \(\mathbb{P}(|{\cal T}(n-2)|\geq \color{red}{t_{\alpha/2}})=\color{red}{\alpha}\):
    • We can reject \(H_0\) if \(\color{blue}{t}\geq \color{red}{t_{\alpha/2}}\) (linear relation between \(\text{x}\) & \(y\) truely exists) at confidence level \(1-\color{red}{\alpha}\).
    • Else, we cannot reject \(H_0\) (not enough evidence to support a linear relationship between \(y\) & \(\text{x}\)).

Simple Linear Regression (SLR)

\(t\)-test for Coefficient

import statsmodels.api as sm
model = sm.OLS(data['mpg'], sm.add_constant(data[['weight']]))
results = model.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.693
Model:                            OLS   Adj. R-squared:                  0.692
Method:                 Least Squares   F-statistic:                     878.8
Date:                Mon, 08 Sep 2025   Prob (F-statistic):          6.02e-102
Time:                        22:14:42   Log-Likelihood:                -1130.0
No. Observations:                 392   AIC:                             2264.
Df Residuals:                     390   BIC:                             2272.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         46.2165      0.799     57.867      0.000      44.646      47.787
weight        -0.0076      0.000    -29.645      0.000      -0.008      -0.007
==============================================================================
Omnibus:                       41.682   Durbin-Watson:                   0.808
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               60.039
Skew:                           0.727   Prob(JB):                     9.18e-14
Kurtosis:                       4.251   Cond. No.                     1.13e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.13e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Simple Linear Regression (SLR)

Summary

  • Obtained model: mpg = -0.008\(\times\)weight + 46.217.
  • As we already rejected \(H_0:\color{blue}{\hat{a}}=0\), the coefficient \(\color{blue}{\hat{a}}=\) -0.008 can be interpreted as follows: mpg is expected to decrease (or increase) by 0.008 units for every \(1\) unit increase (or increase) in car weight.
  • R-squared: Represents the proportion of the target’s variation (mpg) captured by the model or explanatory variable weight alone.
  • Residual: In a good model, the residuals should behave like random noise, indicating that the model has captured most of the information/pattern from the target.
  • Our example:
    • The weight of cars alone can explain \(\approx 70\)% (R-squared) of the variation of mpg.
    • However, the residuals still contain patterns (large errors at large predicted mpg), suggesting the model can be improved.

Multiple Linear Regression (MLR)

Multiple Linear Regression (MLR)

mpg vs cylinders + year

  • Multiple Linear Regression: using more than 1 input, for example: \[\begin{align*}\widehat{\text{mpg}}_i&=\color{blue}{\beta_0} + \color{blue}{\beta_1}\text{acc}_i+\color{blue}{\beta_2}\text{year}_i\\(\text{Maths:}\quad \hat{y}_i&=\color{blue}{\beta_0} + \color{blue}{\beta_1}\text{x}_{i1}+\color{blue}{\beta_2}\text{x}_{i2}),\end{align*}\] with \(\color{blue}{\beta_0,\beta_1,\beta_2}\in\mathbb{R}\) to be estimated.
  • We find \([\color{blue}{\hat{\beta}_0,\hat{\beta}_1,\hat{\beta}_2}]\) minimizing \[\begin{align*}\color{red}{\text{RSS}}&=\sum_{i=1}^n(y_i-\color{blue}{\hat{y}_i})^2\\ &=\sum_{i=1}^n(y_i-\color{blue}{\beta_0}-\color{blue}{\beta_1}\text{x}_{i1}-\color{blue}{\beta_2}\text{x}_{i2})^2.\end{align*}\]
mpg cylinders model year
0 18.0 8 70
1 15.0 8 70
2 18.0 8 70

Multiple Linear Regression (MLR)

mpg vs cylinders + year

  • We find \([\color{blue}{\hat{\beta}_0,\hat{\beta}_1,\hat{\beta}_2}]\) minimizing \[\begin{align*}\color{red}{\text{RSS}}&=\sum_{i=1}^n(y_i-\color{blue}{\hat{y}_i})^2\\ &=\sum_{i=1}^n(y_i-\color{blue}{\beta_0}-\color{blue}{\beta_1}\text{x}_{i1}-\color{blue}{\beta_2}\text{x}_{i2})^2\\ &=\|\underbrace{Y}_{\begin{bmatrix}y_1\\ \vdots\\ y_n\end{bmatrix}}-\underbrace{X}_{\begin{bmatrix}1 & \text{x}_{11} &\text{x}_{12}\\ \vdots & \vdots & \vdots\\ 1 & \text{x}_{n1} &\text{x}_{n2}\end{bmatrix}}\color{blue}{\underbrace{\vec{\beta}}_{\begin{bmatrix}\beta_0\\ \beta_1\\ \beta_2\end{bmatrix}}}\|^2.\end{align*}\]
  • Minimizing \(\color{red}{\text{RSS}}\Rightarrow \color{blue}{\vec{\beta}^*=(X^TX)^{-1}X^TY}.\)
  • Prediction: \(\color{blue}{\hat{Y}}=X\color{blue}{\vec{\beta}^*}\).
mpg cylinders model year
0 18.0 8 70
1 15.0 8 70
2 18.0 8 70

Multiple Linear Regression (MLR)

Model Diagnostics

Adjusted R-squared

  • Normally, \(R^2\) increases along with the number of inputs, but a good model may not need so many variables.
  • A better criterion, Adjusted R-squared (balancing the number of inputs with the increment in \(R^2\)): \[R^2_{\text{adj}}=1-\frac{n-1}{n-d-1}(1-R^2).\] Here, \(n\) is the number of observations, \(d\) is the number of inputs.
  • Usually, \(R^2_{\text{adj}}\leq R^2\).
  • For our model: \(R^2=\) 0.715 and \(R^2_{\text{adj}}=\) 0.714 (this is a good sign!).
  • A large \(R^2\) with a slight drop in \(R^2_{\text{adj}}\) indicates a good MLR model.

Multiple Linear Regression (MLR)

Model Diagnostics (cont.)

Residual analysis

Code
resid = y_train - y_hat   # residuals

from plotly.subplots import make_subplots

fig_res = make_subplots(rows=1, cols=2, subplot_titles=("Residuals vs predicted sales", "Residual desity"))

fig_res.add_trace(
    go.Scatter(x=y_hat, y=resid, name="Residuals", mode="markers"), 
    row=1, col=1)
fig_res.add_trace(
    go.Scatter(x=[np.min(y_hat), np.max(y_hat)], y=[0,0], mode="lines", line=dict(color='red', dash="dash"), name="0"), 
    row=1, col=1)

fig_res.update_xaxes(title_text="Predicted Sales", row=1, col=1)
fig_res.update_yaxes(title_text="Residuals", row=1, col=1)


fig_res.add_trace(
    go.Histogram(x=resid, name = "Residual histogram"), row=1, col=2
)
fig_res.update_xaxes(title_text="Residual", row=1, col=2)
fig_res.update_yaxes(title_text="Histogram", row=1, col=2)

fig_res.update_layout(width=950, height=350)
fig_res.show()

Multiple Linear Regression (MLR)

\(t\)-test of coefficients

  • Just like in SLR, we can test \(H_0: \beta_j=0\) against \(H_1:\beta_j\neq 0\) using \(t\)-test.
  • If one of the two assumptions is true:
    • There are large enough observations \(n>30\)
    • Or the residuals follow Gaussian distribution with constant variance, then \(H_0\) is true, \[t_j=\frac{\beta_j}{s_{j}}\sim {\cal T}(n-d-1).\]
  • For a given level \(\alpha\), we CAN REJECT \(H_0:\beta_j=0\) if \(|t_j|>t_{\alpha/2}\).

Multiple Linear Regression (MLR)

\(t\)-test of coefficients

import statsmodels.api as sm
model = sm.OLS(df['mpg'], sm.add_constant(df[['cylinders', 'year']]))
results = model.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.715
Model:                            OLS   Adj. R-squared:                  0.714
Method:                 Least Squares   F-statistic:                     488.1
Date:                Mon, 08 Sep 2025   Prob (F-statistic):          8.84e-107
Time:                        22:14:42   Log-Likelihood:                -1115.1
No. Observations:                 392   AIC:                             2236.
Df Residuals:                     389   BIC:                             2248.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -17.1464      4.944     -3.468      0.001     -26.866      -7.426
cylinders     -2.9981      0.132    -22.718      0.000      -3.258      -2.739
year           0.7502      0.061     12.276      0.000       0.630       0.870
==============================================================================
Omnibus:                       24.502   Durbin-Watson:                   1.290
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               31.620
Skew:                           0.513   Prob(JB):                     1.36e-07
Kurtosis:                       3.940   Cond. No.                     1.79e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.79e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Multiple Linear Regression (MLR)

Summary

  • Obtained model: MPG = -17.146 -2.998 cylinders + 0.75 year.
  • Rough interpretion, \(\beta_1=\) -2.998 indicates that if cylinders increase (or decreases) by \(1\) unit, mpg is expected to decrease (or increase) by 2.998 units.
  • Explain: \(\beta_2=\) 0.75.
  • \(R^2=\) 0.715 indicates that around 71.5% variation of mpg can be explained by cylinders and year together, which is better than weight alone.
  • A slight decrease in \(R^2_{\text{adj}}=\) 0.714 suggests that the information provided by both variables is not redundant for explaining mpg.
  • The spread values of residuals around large predicted mpg indicates that the model underestimates the actual target.

🥳 Yeahhhh….









Let’s Party… 🥂