Linear Regression

ITM 390 004: Machine Learning

Lecturer: Dr. Sothea HAS

Outline

Motivation
Preprocessing & Exploratory Data Analysis
Simple Linear Regression
Multiple Linear Regression

Motivation

Auto-MPG Dataset (398, 9)

Code

import pandas as pd                 # Import pandas package
import numpy as np
import seaborn as sns               # Package for beautiful graphs
import matplotlib.pyplot as plt     # Graph management
sns.set(style="whitegrid")          # Set grid 
df0 = pd.read_csv(path2 + "auto-mpg.csv")   # Import it into Python
df0.head(5)                        # Randomly select 4 points

	mpg	cylinders	displacement	horsepower	weight	acceleration	model year	origin	car name
0	18.0	8	307.0	130	3504	12.0	70	1	chevrolet chevelle malibu
1	15.0	8	350.0	165	3693	11.5	70	1	buick skylark 320
2	18.0	8	318.0	150	3436	11.0	70	1	plymouth satellite
3	16.0	8	304.0	150	3433	12.0	70	1	amc rebel sst
4	17.0	8	302.0	140	3449	10.5	70	1	ford torino

mpg: Fuel efficiency (miles per gallon).
cylinders: Engine cylinders count.
displacement: Engine size displacement (cubic inches).
acceleration: from 0 to 60 (mph) time (seconds).
origin: 1 = USA, 2 = Europe, 3 = Asia.

Motivation

Auto-MPG Dataset (398, 9)

Code

import pandas as pd                 # Import pandas package
import numpy as np
import seaborn as sns               # Package for beautiful graphs
import matplotlib.pyplot as plt     # Graph management
sns.set(style="whitegrid")          # Set grid 
data = pd.read_csv(path2 + "auto-mpg.csv")   # Import it into Python
data.head(4)                        # Randomly select 4 points

	mpg	cylinders	displacement	horsepower	weight	acceleration	model year	origin	car name
0	18.0	8	307.0	130	3504	12.0	70	1	chevrolet chevelle malibu
1	15.0	8	350.0	165	3693	11.5	70	1	buick skylark 320
2	18.0	8	318.0	150	3436	11.0	70	1	plymouth satellite
3	16.0	8	304.0	150	3433	12.0	70	1	amc rebel sst

What factors affect fuel efficiency the most?
How much can we predict fuel efficiency of the cars using their characteristics?

Preprocessing & Exploratory Data Analysis (EDA)

Preprocessing

Auto-MPG Dataset (398, 9)

Columns types

	mpg	cylinders	displacement	horsepower	weight	acceleration	model year	origin	car name
0	float64	int64	float64	object	int64	float64	int64	int64	object

Q1: Is there anything wrong with column type?
A1: Two main problems:
- origin is qualitative, therefore should be “category/object”.
- ⚠️ horsepower is quantitative, therefore should be “float/int”.
Modifying data type:

	mpg	cylinders	displacement	horsepower	weight	acceleration	model year	origin	car name
0	float64	int64	float64	int64	int64	float64	int64	category	object

⚠️ When a quantitative column is encoded as qualitative, missing values or inconsistent values may be presented.

EDA

Auto-MPG Dataset (398, 9)

Univariate analysis: Statistical summary

Code

data.describe().T.drop(columns=['count', '25%', '75%'])
df_car = data

EDA

Auto-MPG Dataset (398, 9)

Univariate analysis: Visualization

Code

quan_vars = data.select_dtypes(include="number").columns
fig, axs = plt.subplots(2, 4, figsize=(10,4.75))
for i, va in enumerate(data.columns):
    if va in quan_vars:
        sns.histplot(data, x=va, kde=True, ax=axs[i//4, i%4], stat="proportion")
    else:
        if va != "car name":
            sns.countplot(data, x=va, ax=axs[i//4, i%4], stat="proportion")
            axs[i//4, i%4].bar_label(axs[i//4, i%4].containers[0], fmt="%.2f")
plt.tight_layout()
plt.show()

Bivariate analysis: Correlation matrix

Code

pair_grid = sns.PairGrid(data=data[quan_vars], height=0.9, aspect=2)

# Map plots to the lower triangle only
pair_grid.map_lower(sns.scatterplot)  # Scatterplots in the lower triangle
pair_grid.map_diag(sns.histplot)      # Histograms on the diagonal

# pair_plot = sns.pairplot(data=data[quan_vars], height=0.9, aspect=2.5)
def corr_func(x, y, **kws): 
    r1 = np.corrcoef(x, y)[0, 1]
    plt.gca().annotate(f"{r1:.2f}", xy=(0.5, 0.5), 
                       xycoords='axes fraction', 
                       ha='center', fontsize=30, color='#1d69d1')

pair_grid.map_upper(corr_func)
for ax in pair_grid.axes[:, 0]:  # Access the first column of axes (y-axis labels)
    ax.set_ylabel(ax.get_ylabel(), rotation=45, labelpad=20)
plt.tight_layout()
plt.show()

Bivariate analysis: Visualization

Does fuel-efficiency depend on the origin?

Code

_, axs = plt.subplots(1, 1, figsize=(8, 5))
sns.boxplot(data=data, x="origin", y="mpg", hue="origin", ax=axs)
plt.tight_layout()
plt.show()

Preprocessing & EDA

Summary

Weight shows the strongest negative correlation with mpg, followed by displacement, cylinders, and horsepower. These variables are significant in explaining variations in mpg.
These features are also highly correlated with each other, suggesting potential redundancy when included together in a predictive model.
Despite being a categorical variable, origin proves to be valuable for predicting mpg.

Simple Linear Regression (SLR)

`mpg` vs `weight`

Code

data[['mpg', 'weight']].head(3)

	mpg	weight
0	18.0	3504
1	15.0	3693
2	18.0	3436

Code

import plotly.express as px
fig = px.scatter(data, x="weight", y="mpg", hover_name="car name")
fig.update_layout(title="mpg vs weight", height=290, width=450)
fig.show()

Simple Linear Model: \[\text{(prediction)}:\quad\widehat{\text{mpg}}_i=\color{blue}{a}\text{weight}_i+\color{blue}{b},\] for some \(\color{blue}{a},\color{blue}{b}\in\mathbb{R}\) to be chosen so that \(\color{red}{\widehat{\text{mpg}}_i\approx \text{mpg}_i}\) for all \(i=1,...,n.\)
In general, \(\hat{y}_i=\color{blue}{a}\text{x}_i+\color{blue}{b}\), with keys \(\color{blue}{a},\color{blue}{b}\), and observed data \((y_i,\text{x}_i),i=1,...,n\).

Objective Find the best \(\color{blue}{a}\) and \(\color{blue}{b}\) so that (prediction) \(\color{red}{\hat{y}_i\approx y_i}\) (reality) for all \(i\).

What does \(\color{red}{\hat{y}_i\approx y_i}\) mean?

Simple Linear Regression (SLR)

`mpg` vs `weight`

Code

data[['mpg', 'weight']].head(3)

	mpg	weight
0	18.0	3504
1	15.0	3693
2	18.0	3436

Code

fig.update_layout(title="Mpg vs weight", height=290, width=450)
fig.show()

What does \(\color{red}{\hat{y}_i\approx y_i}\) mean?

Q2: For \(y_0=20.312\), which one is the best prediction among: \(\color{red}{\hat{y}_0=18.2, 21.5}\) and \(\color{red}{19.73}\)?
A2: Consider the residuals:

\(\color{red}{\hat{y}_0}\)	\(\color{red}{18.2}\)	\(\color{red}{21.5}\)	\(\color{blue}{19.73}\)
\(\color{red}{e_0=y_0-\hat{y}_0}\)	\(\color{red}{2.112}\)	\(\color{red}{-1.188}\)	\(\color{blue}{0.582}\)
\(\color{red}{\|e_0\|}\)	\(\color{red}{2.112}\)	\(\color{red}{1.188}\)	\(\color{blue}{0.582}\)
\(\color{red}{e_0^2}\)	\(\color{red}{4.46}\)	\(\color{red}{1.41}\)	\(\color{blue}{0.34}\)

🔑 Small residual = good prediction.

Simple Linear Regression (SLR)

`mpg` vs `weight`

Code

# Linear Regression
from sklearn.linear_model import LinearRegression
import plotly.graph_objects as go
lr = LinearRegression()
x_w, y_mpg = data[['weight']], data[['mpg']]
lr.fit(x_w, y_mpg)
a, b = lr.coef_[0][0], lr.intercept_[0]

x_w, y_mpg = data[['weight']].to_numpy(), data[['mpg']].to_numpy()
# Generate coefficients list for different line fits
coef_list = a * np.array([4, 0.5, 0.05, 2, 1.0, 0.25, 3])

x_min, x_max = np.min(x_w), np.max(x_w)
y_min, y_max = np.min(y_mpg), np.max(y_mpg)
x_fit = np.linspace(x_min * 0.8, x_max* 1.2, 2).reshape(-1, 1)

idx = 100
x_line = np.repeat(x_w[idx],2)
# Create frames for polynomial fits

frames = []
for coef in coef_list:
    y_fit = x_fit * coef + b
    y_line = np.array([y_mpg[idx][0], x_w[idx][0] * coef + b])
    y_pred = x_w.flatten() * coef + b
    rss = np.sum((y_mpg.flatten()-y_pred) ** 2)
    frames.append(go.Frame(
        data=[go.Scatter(x=x_w.flatten(), y=y_mpg.flatten(), mode='markers', name='mpg vs weight', marker=dict(size=10)),
              go.Scatter(x=x_line.flatten(), y=y_line.flatten(), mode='lines+markers', name='Residual', line=dict(color="red", dash='dash'), visible="legendonly"),
              go.Scatter(x=x_fit.flatten(), y=y_fit.flatten(), mode='lines', line=dict(color="#b6531a"),
                         name='<br>y={:.3f}x+{:.3f}<br>RSS={:.3f}'.format(np.round(coef, 3), np.round(b, 3), np.round(rss,2)))],
        name=f'{np.round(coef, 3)}'
    ))

y_line = np.array([y_mpg[idx][0], x_w[idx][0] * coef_list[0]+ b])
y_pred0 = x_w.flatten() * coef_list[0] + b
rss0 = np.sum((y_mpg.flatten()-y_pred) ** 2)

fig1 = go.Figure(
    data=[
        go.Scatter(x=x_w.flatten(), y=y_mpg.flatten(), mode='markers', name='mpg vs weight', marker=dict(size=10)),
        go.Scatter(x=x_line.flatten(), y=y_line.flatten(), mode='lines+markers', name='Residual', line=dict(color="red", dash='dash'), visible="legendonly"),
        go.Scatter(x=x_fit.flatten(), y=x_fit.flatten()* coef_list[0]+b, mode='lines', line=dict(color="#b6531a"),
                   name=f'<br>y={np.round(coef_list[0], 3)}x+{np.round(b, 3)}<br>RSS={np.round(rss0,2)}')
    ],
    layout=go.Layout(
        title="MPG vs Weight",
        xaxis=dict(title="Weight", range=[x_min*0.8, x_max*1.1]),
        yaxis=dict(title="MPG", range=[y_min*0.6, y_max*1.1]),
        updatemenus=[{
            "buttons": [
                {
                    "args": [None, {"frame": {"duration": 1000, "redraw": True}, "fromcurrent": True, "mode": "immediate"}],
                    "label": "Play",
                    "method": "animate"
                },
                {
                    "args": [[None], {"frame": {"duration": 0, "redraw": False}, "mode": "immediate"}],
                    "label": "Stop",
                    "method": "animate"
                }
            ],
            "type": "buttons",
            "showactive": False,
            "x": -0.1,
            "y": 1.25,
            "pad": {"r": 11, "t": 50}
        }],
        sliders=[{
            "active": 0,
            "currentvalue": {"prefix": "Coefficient: "},
            "pad": {"t": 50},
            "steps": [{"label": f"{np.round(coef, 3)}",
                       "method": "animate",
                       "args": [[f'{np.round(coef, 3)}'], {"frame": {"duration": 1000, "redraw": True}, "mode": "immediate", 
                       "transition": {"duration": 10}}]}
                      for coef in coef_list]
        }]
    ),
    frames=frames
)
fig1.update_layout(title="Mpg vs weight", height=480, width=500)
fig1.show()

Residual Sum of Squares (RSS): \[\begin{align*}\color{red}{\text{RSS}=\sum_{i=1}e_i^2}&=\color{red}{\sum_{i=1}^n(y_i-\color{blue}{\hat{y}_i})^2}\\ &=\color{red}{\sum_{i=1}^n(y_i}-\color{blue}{a}\text{x}_i-\color{blue}{b}\color{red}{)^2}.\end{align*}\]
Roughly, \(\color{red}{\text{RSS}}\) is sum of all the dash lines (squared).
Objective: Find the coefficient \((\color{blue}{a,b})\) that produces smallest \(\color{red}{\text{RSS}}\).
Can you spot the best fitted line 😎?

Simple Linear Regression (SLR)

`mpg` vs `weight`

Code

# Linear Regression

from plotly.subplots import make_subplots
lr = LinearRegression()

# surface of loss
sur_loss = np.array([[RSS(x_w, y_mpg, a_, b_) for a_ in a_grid] for b_ in b_grid])

fig_surf = make_subplots(rows=1, cols=2,
            specs=[[{'type': 'xy'}, {'type': 'surface'}]],
            subplot_titles=('Fitted Line', 'Loss surface as a function of (a,b)'))

y_line = np.array([y_mpg[idx][0], x_w[idx][0] * coef_list[0]+ b])
y_pred0 = x_w.flatten() * coef_list[0] + b
rss0 = np.sum((y_mpg.flatten()-y_pred0) ** 2)

fig_surf.add_trace(go.Scatter(x=x_w.flatten(), y=y_mpg.flatten(),
    mode='markers', name='mpg vs weight', marker=dict(size=10)), row=1, col=1)
fig_surf.add_trace(go.Scatter(x=x_line.flatten(), y=y_line.flatten(), 
    mode='lines+markers', name='Residual', line=dict(color="red", dash='dash'), visible="legendonly"), row=1, col=1)
fig_surf.add_trace(go.Scatter(x=x_fit.flatten(), y=x_fit.flatten()*coef_list[0]+b, 
    mode='lines', line=dict(color="#b6531a"),
    name=f'<br>y={np.round(coef_list[0], 3)}x+{np.round(b, 3)}<br>RSS={np.round(rss0,2)}'), row=1, col=1)

fig_surf.add_trace(go.Scatter3d(
    x=[coef_list[0]] * 2, 
    y=[b] * 2,
    z=[0, rss0],
    mode="markers+lines", marker=dict(color="red", size=6),
    line=dict(dash="dash", color="red"),
    name="Loss value"), row=1, col=2)

fig_surf.add_trace(go.Surface(
    x=a_grid, 
    y=b_grid,
    z=sur_loss,
    showscale=False,
    opacity=0.3,
    name="Loss surface"), row=1, col=2)

frames_loss = []
for coef in coef_list[1:]:
    y_fit = x_fit * coef + b
    y_line = np.array([y_mpg[idx][0], x_w[idx][0] * coef + b])
    y_pred = x_w.flatten() * coef + b
    rss = np.sum((y_mpg.flatten()-y_pred) ** 2)
    frames_loss.append(
        go.Frame(
            data=[
                go.Scatter(x=x_w.flatten(), y=y_mpg.flatten(), mode='markers', 
                    name='mpg vs weight', marker=dict(size=10)),
                go.Scatter(x=x_line.flatten(), y=y_line.flatten(), mode='lines+markers', 
                    name='Residual', line=dict(color="red", dash='dash'), visible="legendonly"),
                go.Scatter(x=x_fit.flatten(), y=y_fit.flatten(), mode='lines', 
                    line=dict(color="#b6531a"), name='<br>y={:.3f}x+{:.3f}<br>RSS={:.3f}'.format(np.round(coef, 3), np.round(b, 3), np.round(rss,2))),
                go.Scatter3d(
                    x=[coef] * 2, 
                    y=[b] * 2,
                    z=[0, rss],
                    mode="markers+lines", marker=dict(color="red", size=6),
                    line=dict(dash="dash", color="red"),
                    name="Loss value"),
                go.Surface(
                    x=a_grid, 
                    y=b_grid,
                    z=sur_loss,
                    showscale=False,
                    opacity=0.3,
                    name='<br>y={:.3f}x+{:.3f}<br>RSS={:.3f}'.format(np.round(coef, 3), np.round(b, 3), np.round(rss,2)))],
            name=f'{np.round(coef, 3)}'))

fig_surf.update_layout(
        title="Loss function at different coefficients (a,b)",
        height=480,
        updatemenus=[{
            "buttons": [
                {
                    "args": [None, {"frame": {"duration": 1000, "redraw": True}, "fromcurrent": True, "mode": "immediate"}],
                    "label": "Play",
                    "method": "animate"
                },
                {
                    "args": [[None], {"frame": {"duration": 0, "redraw": False}, "mode": "immediate"}],
                    "label": "Stop",
                    "method": "animate"
                }
            ],
            "type": "buttons",
            "showactive": False,
            "x": -0.1,
            "y": 1.25,
            "pad": {"r": 11, "t": 50}
        }],
        sliders=[{
            "active": 0,
            "currentvalue": {"prefix": "Coefficient: "},
            "pad": {"t": 50},
            "steps": [{"label": f"{np.round(coef, 3)}",
                       "method": "animate",
                       "args": [[f'{np.round(coef, 3)}'], {"frame": {"duration": 1000, "redraw": True}, "mode": "immediate", 
                       "transition": {"duration": 10}}]}
                      for coef in coef_list[1:]]
        }]
    )
fig_surf.frames = frames_loss
fig_surf.update_xaxes(range=[x_min*0.8, x_max*1.1], title="Weight", row=1, col=1)
fig_surf.update_yaxes(range=[y_min*0.6, y_max*1.1], title="MPG", row=1, col=1)
fig_surf.update_scenes(
    xaxis_range=[np.min(a_grid), np.max(a_grid)],
    yaxis_range=[np.min(b_grid), np.max(b_grid)],
    zaxis_range=[0, rss0],
    # camera_eye=dict(x=1.5, y=1.5, z=1),
    # aspectmode='cube',
    row=1, col=2  # This references the first scene
)

fig_surf.show()

Simple Linear Regression (SLR)

`mpg` vs `weight`

Code

# Linear Regression
fig1.update_layout(title="Mpg vs weight", height=480, width=500)
fig1.show()

Optimal Least-Square Line

The best fitted line: \(\hat{y}_i=\color{blue}{\hat{a}}\text{x}_i+\color{blue}{\hat{b}}\) where

\[\begin{align} \hat{\color{blue}{a}}&=\frac{\sum_{i=1}^n(\text{x}_i-\overline{\text{x}}_n)(y_i-\overline{y}_n)}{\sum_{i=1}^n(\text{x}_i-\overline{\text{x}}_n)^2}=\frac{\text{Cov}(X,Y)}{\text{V}(X)}\\ \hat{\color{blue}{b}}&=\overline{y}_n-\hat{\color{blue}{a}}\overline{\text{x}}_n,\quad\text{with}\end{align} \]

\(\overline{\text{x}}_n=\frac{1}{n}\sum_{i=1}^n\text{x}_i\) and \(\overline{y}_n=\frac{1}{n}\sum_{i=1}^ny_i\): the average/mean of \(X\) and \(Y\) respectively.
\(\text{Cov}(X,Y)=\frac{1}{n}\sum_{i=1}^n(\text{x}_i-\overline{\text{x}}_n)(y_i-\overline{y}_n)\): the “covariance” between \(X\) & \(Y\).
\(\text{V}(X)=\frac{1}{n}\sum_{i=1}^n(\text{x}_i-\overline{\text{x}}_n)^2\): the “variance” of \(X\).
Our example: \((\color{blue}{\hat{a}},\color{blue}{\hat{b}})=\) (-0.01, 46.22).
Interpret: If weight increases 1 unit, mpg decreases by around \(0.01\) unit.

Simple Linear Regression (SLR)

Model Diagnostics (judging the model)

R-squared (coefficient of determination)

\[R^2=1-\frac{\text{RSS}}{\text{TSS}}=1-\frac{\sum_{i=1}(y_i-\hat{y}_i)^2}{\sum_{i=1}(y_i-\overline{y}_n)^2}=\frac{\color{red}{\text{V}(\hat{Y})}}{\color{blue}{\text{V}(Y)}}.\]

Simple Linear Regression (SLR)

Model Diagnostics (judging the model)

R-squared (coefficient of determination)

\[R^2=1-\frac{\text{RSS}}{\text{TSS}}=1-\frac{\sum_{i=1}(y_i-\hat{y}_i)^2}{\sum_{i=1}(y_i-\overline{y}_n)^2}=\frac{\color{red}{\text{V}(\hat{Y})}}{\color{blue}{\text{V}(Y)}}.\]

We always have \(0\leq R^2\leq 1\).
Example: For mpg vs weight: \(R^2=\) 0.693.
Interpretation: The model (weight) can explain around 69.3% of the variation of the target (mpg).

Simple Linear Regression (SLR)

Model Diagnostics (judging the model)

Residual Analysis

Residuals: If \(\color{red}{e_i=y_i-\hat{y}_i}\sim{\cal N}(0,\sigma^2)\) for some \(\sigma>0\) .i.e.,
Symmetric around \(0\) & DO NOT DEPEND ON \(\text{x}_i\) nor \(y_i\).

Code

res = y_true-y_pred   # Compute residuals

from plotly.subplots import make_subplots
fig_res = make_subplots(rows=1, cols=2, 
    subplot_titles=("Residuals vs predicted mpg", 
                    "Residual desity"))
fig_res.add_trace(
    go.Scatter(x=y_pred.flatten(), y=res.flatten(), name="Residuals", mode="markers"), 
    row=1, col=1)
fig_res.add_trace(
    go.Scatter(x=[np.min(y_pred.flatten()), np.max(y_pred.flatten())], 
    y=[0,0], mode="lines", line=dict(color='red', dash="dash"), name="0"), 
    row=1, col=1)

fig_res.update_xaxes(title_text="Predicted MPG", row=1, col=1)
fig_res.update_yaxes(title_text="Residuals", row=1, col=1)


fig_res.add_trace(
    go.Histogram(x=res, name = "Residual histogram"), row=1, col=2
)
fig_res.update_xaxes(title_text="Residual", row=1, col=2)
fig_res.update_yaxes(title_text="Histogram", row=1, col=2)

fig_res.update_layout(width=950, height=300)
fig_res.show()

Simple Linear Regression (SLR)

T-test of Significance of Coefficient

The estimated coefficient \(\color{blue}{\hat{a}}\) and \(\color{blue}{\hat{b}}\) are computed based on a sample of data.
How can we be sure that the linear relation between \(\text{x}\) and \(y\) truely exists: \(\hat{y}=\color{blue}{\hat{a}}\text{x}+\color{blue}{\hat{b}}\) with \(a\neq 0\)?
This is equivalent to testing \(H_0: \color{blue}{\hat{a}}=0\) against \(H_1: \color{blue}{\hat{a}}\neq 0\).
If \(n\) is large enough (\(n>30\)) or the residual is gaussian then if \(H_0\) is true, we have \(\color{blue}{t}=\frac{\color{blue}{\hat{a}}}{s_{\color{blue}{\hat{a}}}}\sim{\cal T}(n-2)\) where \(s_{\color{blue}{\hat{a}}}\) is the standard deviation of \(\color{blue}{\hat{a}}\).
Given \(0\leq\color{red}{\alpha}\leq 1\), let \(\color{red}{t_{\alpha/2}}\) be the \(\color{red}{\alpha}\)-quantile of t-distribution \(\mathbb{P}(|{\cal T}(n-2)|\geq \color{red}{t_{\alpha/2}})=\color{red}{\alpha}\):
- We can reject \(H_0\) if \(\color{blue}{t}\geq \color{red}{t_{\alpha/2}}\) (linear relation between \(\text{x}\) & \(y\) truely exists) at confidence level \(1-\color{red}{\alpha}\).
- Else, we cannot reject \(H_0\) (not enough evidence to support a linear relationship between \(y\) & \(\text{x}\)).

Simple Linear Regression (SLR)

\(t\)-test for Coefficient

import statsmodels.api as sm
model = sm.OLS(data['mpg'], sm.add_constant(data[['weight']]))
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.693
Model:                            OLS   Adj. R-squared:                  0.692
Method:                 Least Squares   F-statistic:                     878.8
Date:                Mon, 08 Sep 2025   Prob (F-statistic):          6.02e-102
Time:                        22:14:42   Log-Likelihood:                -1130.0
No. Observations:                 392   AIC:                             2264.
Df Residuals:                     390   BIC:                             2272.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         46.2165      0.799     57.867      0.000      44.646      47.787
weight        -0.0076      0.000    -29.645      0.000      -0.008      -0.007
==============================================================================
Omnibus:                       41.682   Durbin-Watson:                   0.808
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               60.039
Skew:                           0.727   Prob(JB):                     9.18e-14
Kurtosis:                       4.251   Cond. No.                     1.13e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.13e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Simple Linear Regression (SLR)

Summary

Obtained model: mpg = -0.008\(\times\)weight + 46.217.
As we already rejected \(H_0:\color{blue}{\hat{a}}=0\), the coefficient \(\color{blue}{\hat{a}}=\) -0.008 can be interpreted as follows: mpg is expected to decrease (or increase) by 0.008 units for every \(1\) unit increase (or increase) in car weight.
R-squared: Represents the proportion of the target’s variation (mpg) captured by the model or explanatory variable weight alone.
Residual: In a good model, the residuals should behave like random noise, indicating that the model has captured most of the information/pattern from the target.
Our example:
- The weight of cars alone can explain \(\approx 70\)% (R-squared) of the variation of mpg.
- However, the residuals still contain patterns (large errors at large predicted mpg), suggesting the model can be improved.

Multiple Linear Regression (MLR)

`mpg` vs `cylinders + year`

Multiple Linear Regression: using more than 1 input, for example: \[\begin{align*}\widehat{\text{mpg}}_i&=\color{blue}{\beta_0} + \color{blue}{\beta_1}\text{acc}_i+\color{blue}{\beta_2}\text{year}_i\\(\text{Maths:}\quad \hat{y}_i&=\color{blue}{\beta_0} + \color{blue}{\beta_1}\text{x}_{i1}+\color{blue}{\beta_2}\text{x}_{i2}),\end{align*}\] with \(\color{blue}{\beta_0,\beta_1,\beta_2}\in\mathbb{R}\) to be estimated.

We find \([\color{blue}{\hat{\beta}_0,\hat{\beta}_1,\hat{\beta}_2}]\) minimizing \[\begin{align*}\color{red}{\text{RSS}}&=\sum_{i=1}^n(y_i-\color{blue}{\hat{y}_i})^2\\ &=\sum_{i=1}^n(y_i-\color{blue}{\beta_0}-\color{blue}{\beta_1}\text{x}_{i1}-\color{blue}{\beta_2}\text{x}_{i2})^2.\end{align*}\]

	mpg	cylinders	model year
0	18.0	8	70
1	15.0	8	70
2	18.0	8	70

Multiple Linear Regression (MLR)

`mpg` vs `cylinders + year`

We find \([\color{blue}{\hat{\beta}_0,\hat{\beta}_1,\hat{\beta}_2}]\) minimizing \[\begin{align*}\color{red}{\text{RSS}}&=\sum_{i=1}^n(y_i-\color{blue}{\hat{y}_i})^2\\ &=\sum_{i=1}^n(y_i-\color{blue}{\beta_0}-\color{blue}{\beta_1}\text{x}_{i1}-\color{blue}{\beta_2}\text{x}_{i2})^2\\ &=\|\underbrace{Y}_{\begin{bmatrix}y_1\\ \vdots\\ y_n\end{bmatrix}}-\underbrace{X}_{\begin{bmatrix}1 & \text{x}_{11} &\text{x}_{12}\\ \vdots & \vdots & \vdots\\ 1 & \text{x}_{n1} &\text{x}_{n2}\end{bmatrix}}\color{blue}{\underbrace{\vec{\beta}}_{\begin{bmatrix}\beta_0\\ \beta_1\\ \beta_2\end{bmatrix}}}\|^2.\end{align*}\]

Minimizing \(\color{red}{\text{RSS}}\Rightarrow \color{blue}{\vec{\beta}^*=(X^TX)^{-1}X^TY}.\)
Prediction: \(\color{blue}{\hat{Y}}=X\color{blue}{\vec{\beta}^*}\).

	mpg	cylinders	model year
0	18.0	8	70
1	15.0	8	70
2	18.0	8	70

Multiple Linear Regression (MLR)

Model Diagnostics

Adjusted R-squared

Normally, \(R^2\) increases along with the number of inputs, but a good model may not need so many variables.
A better criterion, Adjusted R-squared (balancing the number of inputs with the increment in \(R^2\)): \[R^2_{\text{adj}}=1-\frac{n-1}{n-d-1}(1-R^2).\] Here, \(n\) is the number of observations, \(d\) is the number of inputs.
Usually, \(R^2_{\text{adj}}\leq R^2\).
For our model: \(R^2=\) 0.715 and \(R^2_{\text{adj}}=\) 0.714 (this is a good sign!).
A large \(R^2\) with a slight drop in \(R^2_{\text{adj}}\) indicates a good MLR model.

Multiple Linear Regression (MLR)

Model Diagnostics (cont.)

Residual analysis

Code

resid = y_train - y_hat   # residuals

from plotly.subplots import make_subplots

fig_res = make_subplots(rows=1, cols=2, subplot_titles=("Residuals vs predicted sales", "Residual desity"))

fig_res.add_trace(
    go.Scatter(x=y_hat, y=resid, name="Residuals", mode="markers"), 
    row=1, col=1)
fig_res.add_trace(
    go.Scatter(x=[np.min(y_hat), np.max(y_hat)], y=[0,0], mode="lines", line=dict(color='red', dash="dash"), name="0"), 
    row=1, col=1)

fig_res.update_xaxes(title_text="Predicted Sales", row=1, col=1)
fig_res.update_yaxes(title_text="Residuals", row=1, col=1)


fig_res.add_trace(
    go.Histogram(x=resid, name = "Residual histogram"), row=1, col=2
)
fig_res.update_xaxes(title_text="Residual", row=1, col=2)
fig_res.update_yaxes(title_text="Histogram", row=1, col=2)

fig_res.update_layout(width=950, height=350)
fig_res.show()

Multiple Linear Regression (MLR)

\(t\)-test of coefficients

Just like in SLR, we can test \(H_0: \beta_j=0\) against \(H_1:\beta_j\neq 0\) using \(t\)-test.
If one of the two assumptions is true:
- There are large enough observations \(n>30\)
- Or the residuals follow Gaussian distribution with constant variance, then \(H_0\) is true, \[t_j=\frac{\beta_j}{s_{j}}\sim {\cal T}(n-d-1).\]
For a given level \(\alpha\), we CAN REJECT \(H_0:\beta_j=0\) if \(|t_j|>t_{\alpha/2}\).

Multiple Linear Regression (MLR)

\(t\)-test of coefficients

import statsmodels.api as sm
model = sm.OLS(df['mpg'], sm.add_constant(df[['cylinders', 'year']]))
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.715
Model:                            OLS   Adj. R-squared:                  0.714
Method:                 Least Squares   F-statistic:                     488.1
Date:                Mon, 08 Sep 2025   Prob (F-statistic):          8.84e-107
Time:                        22:14:42   Log-Likelihood:                -1115.1
No. Observations:                 392   AIC:                             2236.
Df Residuals:                     389   BIC:                             2248.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -17.1464      4.944     -3.468      0.001     -26.866      -7.426
cylinders     -2.9981      0.132    -22.718      0.000      -3.258      -2.739
year           0.7502      0.061     12.276      0.000       0.630       0.870
==============================================================================
Omnibus:                       24.502   Durbin-Watson:                   1.290
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               31.620
Skew:                           0.513   Prob(JB):                     1.36e-07
Kurtosis:                       3.940   Cond. No.                     1.79e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.79e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Multiple Linear Regression (MLR)

Summary

Obtained model: MPG = -17.146 -2.998 cylinders + 0.75 year.
Rough interpretion, \(\beta_1=\) -2.998 indicates that if cylinders increase (or decreases) by \(1\) unit, mpg is expected to decrease (or increase) by 2.998 units.
Explain: \(\beta_2=\) 0.75.
\(R^2=\) 0.715 indicates that around 71.5% variation of mpg can be explained by cylinders and year together, which is better than weight alone.
A slight decrease in \(R^2_{\text{adj}}=\) 0.714 suggests that the information provided by both variables is not redundant for explaining mpg.
The spread values of residuals around large predicted mpg indicates that the model underestimates the actual target.

Linear Regression

Outline

Motivation

Motivation

Auto-MPG Dataset (398, 9)

Motivation

Auto-MPG Dataset (398, 9)

Preprocessing & Exploratory Data Analysis (EDA)

Preprocessing

Auto-MPG Dataset (398, 9)

Columns types

EDA

Auto-MPG Dataset (398, 9)

Univariate analysis: Statistical summary

EDA

Auto-MPG Dataset (398, 9)

Univariate analysis: Visualization

Bivariate analysis: Correlation matrix

Bivariate analysis: Visualization

Preprocessing & EDA

Summary

Simple Linear Regression (SLR)

Simple Linear Regression (SLR)

mpg vs weight

Simple Linear Regression (SLR)

mpg vs weight

Simple Linear Regression (SLR)

mpg vs weight

Simple Linear Regression (SLR)

mpg vs weight

Simple Linear Regression (SLR)

mpg vs weight

Simple Linear Regression (SLR)

Model Diagnostics (judging the model)

R-squared (coefficient of determination)

Simple Linear Regression (SLR)

Model Diagnostics (judging the model)

R-squared (coefficient of determination)

Simple Linear Regression (SLR)

Model Diagnostics (judging the model)

Residual Analysis

Simple Linear Regression (SLR)

T-test of Significance of Coefficient

Simple Linear Regression (SLR)

\(t\)-test for Coefficient

Simple Linear Regression (SLR)

Multiple Linear Regression (MLR)

Multiple Linear Regression (MLR)

mpg vs cylinders + year

Multiple Linear Regression (MLR)

mpg vs cylinders + year

Multiple Linear Regression (MLR)

Model Diagnostics

Adjusted R-squared

Multiple Linear Regression (MLR)

Model Diagnostics (cont.)

Residual analysis

Multiple Linear Regression (MLR)

\(t\)-test of coefficients

Multiple Linear Regression (MLR)

\(t\)-test of coefficients

Multiple Linear Regression (MLR)

🥳 Yeahhhh….

Let’s Party… 🥂

`mpg` vs `weight`

`mpg` vs `weight`

`mpg` vs `weight`

`mpg` vs `weight`

`mpg` vs `weight`

`mpg` vs `cylinders + year`

`mpg` vs `cylinders + year`