Linear & Logistic Regression

INF-604: Data Analysis

Lecturer: Dr. Sothea HAS

Outline

Motivation
Exploratory Data Analysis
Simple Linear Regression
Multiple Linear Regression
Logistic Regression

Motivation

Auto-MPG Dataset (398, 9)

Code

import pandas as pd                 # Import pandas package
import numpy as np
import seaborn as sns               # Package for beautiful graphs
import matplotlib.pyplot as plt     # Graph management
sns.set(style="whitegrid")          # Set grid 
df0 = pd.read_csv(path2 + "auto-mpg.csv")   # Import it into Python
df0.head(5)                        # Randomly select 4 points

	mpg	cylinders	displacement	horsepower	weight	acceleration	model year	origin	car name
0	18.0	8	307.0	130	3504	12.0	70	1	chevrolet chevelle malibu
1	15.0	8	350.0	165	3693	11.5	70	1	buick skylark 320
2	18.0	8	318.0	150	3436	11.0	70	1	plymouth satellite
3	16.0	8	304.0	150	3433	12.0	70	1	amc rebel sst
4	17.0	8	302.0	140	3449	10.5	70	1	ford torino

mpg: Fuel efficiency (miles per gallon).
cylinders: Engine cylinders count.
displacement: Engine size displacement (cubic inches).
acceleration: from 0 to 60 (mph) time (seconds).
origin: 1 = USA, 2 = Europe, 3 = Asia.

Motivation

Auto-MPG Dataset (398, 9)

Code

import pandas as pd                 # Import pandas package
import numpy as np
import seaborn as sns               # Package for beautiful graphs
import matplotlib.pyplot as plt     # Graph management
sns.set(style="whitegrid")          # Set grid 
data = pd.read_csv(path2 + "auto-mpg.csv")   # Import it into Python
data.head(4)                        # Randomly select 4 points

	mpg	cylinders	displacement	horsepower	weight	acceleration	model year	origin	car name
0	18.0	8	307.0	130	3504	12.0	70	1	chevrolet chevelle malibu
1	15.0	8	350.0	165	3693	11.5	70	1	buick skylark 320
2	18.0	8	318.0	150	3436	11.0	70	1	plymouth satellite
3	16.0	8	304.0	150	3433	12.0	70	1	amc rebel sst

What affects fuel efficiency the most?
Are newer models more fuel-efficient?
What influences speed or acceleration?
How did MPG values change across the years (model year)?
How do cars differ by origin (USA, Europe, Asia)?
What are the characteristics of cars from different origins?

Exploratory Data Analysis (EDA)

EDA

Auto-MPG Dataset (398, 9)

Columns types

	mpg	cylinders	displacement	horsepower	weight	acceleration	model year	origin	car name
0	float64	int64	float64	object	int64	float64	int64	int64	object

Q1: Is there anything wrong with column type?
A1: Two main problems:
- origin is qualitative, therefore should be “category/object”.
- ⚠️ horsepower is quantitative, therefore should be “float/int”.
Modifying data type:

	mpg	cylinders	displacement	horsepower	weight	acceleration	model year	origin	car name
0	float64	int64	float64	int32	int64	float64	int64	category	object

⚠️ When a quantitative column is encoded as qualitative, missing values or inconsistent values may be presented.

EDA

Auto-MPG Dataset (398, 9)

Univariate analysis: Statistical summary

Code

data.describe().T.drop(columns=['count', '25%', '75%'])
df_car = data

EDA

Auto-MPG Dataset (398, 9)

Univariate analysis: Visualization

Code

quan_vars = data.select_dtypes(include="number").columns
fig, axs = plt.subplots(2, 4, figsize=(10,4.75))
for i, va in enumerate(data.columns):
    if va in quan_vars:
        sns.histplot(data, x=va, kde=True, ax=axs[i//4, i%4], stat="proportion")
    else:
        if va != "car name":
            sns.countplot(data, x=va, ax=axs[i//4, i%4], stat="proportion")
            axs[i//4, i%4].bar_label(axs[i//4, i%4].containers[0], fmt="%.2f")
plt.tight_layout()
plt.show()

Bivariate analysis: Correlation matrix

Code

pair_grid = sns.PairGrid(data=data[quan_vars], height=0.9, aspect=2)

# Map plots to the lower triangle only
pair_grid.map_lower(sns.scatterplot)  # Scatterplots in the lower triangle
pair_grid.map_diag(sns.histplot)      # Histograms on the diagonal

# pair_plot = sns.pairplot(data=data[quan_vars], height=0.9, aspect=2.5)
def corr_func(x, y, **kws): 
    r1 = np.corrcoef(x, y)[0, 1]
    plt.gca().annotate(f"{r1:.2f}", xy=(0.5, 0.5), 
                       xycoords='axes fraction', 
                       ha='center', fontsize=30, color='#1d69d1')

pair_grid.map_upper(corr_func)
for ax in pair_grid.axes[:, 0]:  # Access the first column of axes (y-axis labels)
    ax.set_ylabel(ax.get_ylabel(), rotation=45, labelpad=20)
plt.tight_layout()
plt.show()

Bivariate analysis: Visualization

Does fuel-efficiency depend on the origin?

Code

_, axs = plt.subplots(1, 1, figsize=(8, 5))
sns.boxplot(data=data, x="origin", y="mpg", hue="origin", ax=axs)
plt.tight_layout()
plt.show()

EDA

Summary

Weight shows the strongest negative correlation with mpg, followed by displacement, cylinders, and horsepower. These variables are significant in explaining variations in mpg.
These features are also highly correlated with each other, suggesting potential redundancy when included together in a predictive model.
Despite being a categorical variable, origin proves to be valuable for predicting mpg.

Simple Linear Regression (SLR)

`mpg` vs `weight`

Code

data[['mpg', 'weight']].head(3)

	mpg	weight
0	18.0	3504
1	15.0	3693
2	18.0	3436

Code

import plotly.express as px
fig = px.scatter(data, x="weight", y="mpg", hover_name="car name")
fig.update_layout(title="mpg vs weight", height=290, width=450)
fig.show()

Simple Linear Model: \[\text{(prediction)}:\quad\widehat{\text{mpg}}_i=\color{blue}{a}\text{weight}_i+\color{blue}{b},\] for some \(\color{blue}{a},\color{blue}{b}\in\mathbb{R}\) to be chosen so that \(\color{red}{\widehat{\text{mpg}}_i\approx \text{mpg}_i}\) for all \(i=1,...,n.\)
In general, \(\hat{y}_i=\color{blue}{a}\text{x}_i+\color{blue}{b}\), with keys \(\color{blue}{a},\color{blue}{b}\), and observed data \((y_i,\text{x}_i),i=1,...,n\).

Objective Find the best \(\color{blue}{a}\) and \(\color{blue}{b}\) so that (prediction) \(\color{red}{\hat{y}_i\approx y_i}\) (reality) for all \(i\).

What does \(\color{red}{\hat{y}_i\approx y_i}\) mean?

Simple Linear Regression (SLR)

`mpg` vs `weight`

Code

data[['mpg', 'weight']].head(3)

	mpg	weight
0	18.0	3504
1	15.0	3693
2	18.0	3436

Code

fig.update_layout(title="Mpg vs weight", height=290, width=450)
fig.show()

What does \(\color{red}{\hat{y}_i\approx y_i}\) mean?

Q2: For \(y_0=20.312\), which one is the best prediction among: \(\color{red}{\hat{y}_0=18.2, 21.5}\) and \(\color{red}{19.73}\)?
A2: Consider the residuals:

\(\color{red}{\hat{y}_0}\)	\(\color{red}{18.2}\)	\(\color{red}{21.5}\)	\(\color{blue}{19.73}\)
\(\color{red}{e_0=y_0-\hat{y}_0}\)	\(\color{red}{2.112}\)	\(\color{red}{-1.188}\)	\(\color{blue}{0.582}\)
\(\color{red}{\|e_0\|}\)	\(\color{red}{2.112}\)	\(\color{red}{1.188}\)	\(\color{blue}{0.582}\)
\(\color{red}{e_0^2}\)	\(\color{red}{4.46}\)	\(\color{red}{1.41}\)	\(\color{blue}{0.34}\)

🔑 Small residual = good prediction.

Simple Linear Regression (SLR)

`mpg` vs `weight`

Code

# Linear Regression
from sklearn.linear_model import LinearRegression
import plotly.graph_objects as go
lr = LinearRegression()
x_w, y_mpg = data[['weight']], data[['mpg']]
lr.fit(x_w, y_mpg)
a, b = lr.coef_[0][0], lr.intercept_[0]

x_w, y_mpg = data[['weight']].to_numpy(), data[['mpg']].to_numpy()
# Generate coefficients list for different line fits
coef_list = a * np.array([4, 0.5, 0.05, 2, 1.0, 0.25, 3])

x_min, x_max = np.min(x_w), np.max(x_w)
y_min, y_max = np.min(y_mpg), np.max(y_mpg)
x_fit = np.linspace(x_min * 0.8, x_max* 1.2, 2).reshape(-1, 1)

idx = 100
x_line = np.repeat(x_w[idx],2)
# Create frames for polynomial fits

frames = []
for coef in coef_list:
    y_fit = x_fit * coef + b
    y_line = np.array([y_mpg[idx][0], x_w[idx][0] * coef + b])
    y_pred = x_w.flatten() * coef + b
    rss = np.sum((y_mpg.flatten()-y_pred) ** 2)
    frames.append(go.Frame(
        data=[go.Scatter(x=x_w.flatten(), y=y_mpg.flatten(), mode='markers', name='mpg vs weight', marker=dict(size=10)),
              go.Scatter(x=x_line.flatten(), y=y_line.flatten(), mode='lines+markers', name='Residual', line=dict(color="red", dash='dash'), visible="legendonly"),
              go.Scatter(x=x_fit.flatten(), y=y_fit.flatten(), mode='lines', line=dict(color="#b6531a"),
                         name='<br>y={:.3f}x+{:.3f}<br>RSS={:.3f}'.format(np.round(coef, 3), np.round(b, 3), np.round(rss,2)))],
        name=f'{np.round(coef, 3)}'
    ))

y_line = np.array([y_mpg[idx][0], x_w[idx][0] * coef_list[0]+ b])
y_pred0 = x_w.flatten() * coef_list[0] + b
rss0 = np.sum((y_mpg.flatten()-y_pred) ** 2)

fig1 = go.Figure(
    data=[
        go.Scatter(x=x_w.flatten(), y=y_mpg.flatten(), mode='markers', name='mpg vs weight', marker=dict(size=10)),
        go.Scatter(x=x_line.flatten(), y=y_line.flatten(), mode='lines+markers', name='Residual', line=dict(color="red", dash='dash'), visible="legendonly"),
        go.Scatter(x=x_fit.flatten(), y=x_fit.flatten()* coef_list[0]+b, mode='lines', line=dict(color="#b6531a"),
                   name=f'<br>y={np.round(coef_list[0], 3)}x+{np.round(b, 3)}<br>RSS={np.round(rss0,2)}')
    ],
    layout=go.Layout(
        title="MPG vs Weight",
        xaxis=dict(title="Weight", range=[x_min*0.8, x_max*1.1]),
        yaxis=dict(title="MPG", range=[y_min*0.6, y_max*1.1]),
        updatemenus=[{
            "buttons": [
                {
                    "args": [None, {"frame": {"duration": 1000, "redraw": True}, "fromcurrent": True, "mode": "immediate"}],
                    "label": "Play",
                    "method": "animate"
                },
                {
                    "args": [[None], {"frame": {"duration": 0, "redraw": False}, "mode": "immediate"}],
                    "label": "Stop",
                    "method": "animate"
                }
            ],
            "type": "buttons",
            "showactive": False,
            "x": -0.1,
            "y": 1.25,
            "pad": {"r": 11, "t": 50}
        }],
        sliders=[{
            "active": 0,
            "currentvalue": {"prefix": "Coefficient: "},
            "pad": {"t": 50},
            "steps": [{"label": f"{np.round(coef, 3)}",
                       "method": "animate",
                       "args": [[f'{np.round(coef, 3)}'], {"frame": {"duration": 1000, "redraw": True}, "mode": "immediate", 
                       "transition": {"duration": 10}}]}
                      for coef in coef_list]
        }]
    ),
    frames=frames
)
fig1.update_layout(title="Mpg vs weight", height=480, width=500)
fig1.show()

Residual Sum of Squares (RSS): \[\begin{align*}\color{red}{\text{RSS}=\sum_{i=1}e_i^2}&=\color{red}{\sum_{i=1}^n(y_i-\color{blue}{\hat{y}_i})^2}\\ &=\color{red}{\sum_{i=1}^n(y_i}-\color{blue}{a}\text{x}_i-\color{blue}{b}\color{red}{)^2}.\end{align*}\]
Roughly, \(\color{red}{\text{RSS}}\) is sum of all the dash lines (squared).
Objective: Find the coefficient \((\color{blue}{a,b})\) that produces smallest \(\color{red}{\text{RSS}}\).
Can you spot the best fitted line 😎?

Simple Linear Regression (SLR)

`mpg` vs `weight`

Code

# Linear Regression

from plotly.subplots import make_subplots
lr = LinearRegression()

# surface of loss
sur_loss = np.array([[RSS(x_w, y_mpg, a_, b_) for a_ in a_grid] for b_ in b_grid])

fig_surf = make_subplots(rows=1, cols=2,
            specs=[[{'type': 'xy'}, {'type': 'surface'}]],
            subplot_titles=('Fitted Line', 'Loss surface as a function of (a,b)'))

y_line = np.array([y_mpg[idx][0], x_w[idx][0] * coef_list[0]+ b])
y_pred0 = x_w.flatten() * coef_list[0] + b
rss0 = np.sum((y_mpg.flatten()-y_pred0) ** 2)

fig_surf.add_trace(go.Scatter(x=x_w.flatten(), y=y_mpg.flatten(),
    mode='markers', name='mpg vs weight', marker=dict(size=10)), row=1, col=1)
fig_surf.add_trace(go.Scatter(x=x_line.flatten(), y=y_line.flatten(), 
    mode='lines+markers', name='Residual', line=dict(color="red", dash='dash'), visible="legendonly"), row=1, col=1)
fig_surf.add_trace(go.Scatter(x=x_fit.flatten(), y=x_fit.flatten()*coef_list[0]+b, 
    mode='lines', line=dict(color="#b6531a"),
    name=f'<br>y={np.round(coef_list[0], 3)}x+{np.round(b, 3)}<br>RSS={np.round(rss0,2)}'), row=1, col=1)

fig_surf.add_trace(go.Scatter3d(
    x=[coef_list[0]] * 2, 
    y=[b] * 2,
    z=[0, rss0],
    mode="markers+lines", marker=dict(color="red", size=6),
    line=dict(dash="dash", color="red"),
    name="Loss value"), row=1, col=2)

fig_surf.add_trace(go.Surface(
    x=a_grid, 
    y=b_grid,
    z=sur_loss,
    showscale=False,
    opacity=0.3,
    name="Loss surface"), row=1, col=2)

frames_loss = []
for coef in coef_list[1:]:
    y_fit = x_fit * coef + b
    y_line = np.array([y_mpg[idx][0], x_w[idx][0] * coef + b])
    y_pred = x_w.flatten() * coef + b
    rss = np.sum((y_mpg.flatten()-y_pred) ** 2)
    frames_loss.append(
        go.Frame(
            data=[
                go.Scatter(x=x_w.flatten(), y=y_mpg.flatten(), mode='markers', 
                    name='mpg vs weight', marker=dict(size=10)),
                go.Scatter(x=x_line.flatten(), y=y_line.flatten(), mode='lines+markers', 
                    name='Residual', line=dict(color="red", dash='dash'), visible="legendonly"),
                go.Scatter(x=x_fit.flatten(), y=y_fit.flatten(), mode='lines', 
                    line=dict(color="#b6531a"), name='<br>y={:.3f}x+{:.3f}<br>RSS={:.3f}'.format(np.round(coef, 3), np.round(b, 3), np.round(rss,2))),
                go.Scatter3d(
                    x=[coef] * 2, 
                    y=[b] * 2,
                    z=[0, rss],
                    mode="markers+lines", marker=dict(color="red", size=6),
                    line=dict(dash="dash", color="red"),
                    name="Loss value"),
                go.Surface(
                    x=a_grid, 
                    y=b_grid,
                    z=sur_loss,
                    showscale=False,
                    opacity=0.3,
                    name='<br>y={:.3f}x+{:.3f}<br>RSS={:.3f}'.format(np.round(coef, 3), np.round(b, 3), np.round(rss,2)))],
            name=f'{np.round(coef, 3)}'))

fig_surf.update_layout(
        title="Loss function at different coefficients (a,b)",
        height=480,
        updatemenus=[{
            "buttons": [
                {
                    "args": [None, {"frame": {"duration": 1000, "redraw": True}, "fromcurrent": True, "mode": "immediate"}],
                    "label": "Play",
                    "method": "animate"
                },
                {
                    "args": [[None], {"frame": {"duration": 0, "redraw": False}, "mode": "immediate"}],
                    "label": "Stop",
                    "method": "animate"
                }
            ],
            "type": "buttons",
            "showactive": False,
            "x": -0.1,
            "y": 1.25,
            "pad": {"r": 11, "t": 50}
        }],
        sliders=[{
            "active": 0,
            "currentvalue": {"prefix": "Coefficient: "},
            "pad": {"t": 50},
            "steps": [{"label": f"{np.round(coef, 3)}",
                       "method": "animate",
                       "args": [[f'{np.round(coef, 3)}'], {"frame": {"duration": 1000, "redraw": True}, "mode": "immediate", 
                       "transition": {"duration": 10}}]}
                      for coef in coef_list[1:]]
        }]
    )
fig_surf.frames = frames_loss
fig_surf.update_xaxes(range=[x_min*0.8, x_max*1.1], title="Weight", row=1, col=1)
fig_surf.update_yaxes(range=[y_min*0.6, y_max*1.1], title="MPG", row=1, col=1)
fig_surf.update_scenes(
    xaxis_range=[np.min(a_grid), np.max(a_grid)],
    yaxis_range=[np.min(b_grid), np.max(b_grid)],
    zaxis_range=[0, rss0],
    # camera_eye=dict(x=1.5, y=1.5, z=1),
    # aspectmode='cube',
    row=1, col=2  # This references the first scene
)

fig_surf.show()

Simple Linear Regression (SLR)

`mpg` vs `weight`

Code

# Linear Regression
fig1.update_layout(title="Mpg vs weight", height=480, width=500)
fig1.show()

Optimal Least-Square Line

The best fitted line: \(\hat{y}_i=\color{blue}{\hat{a}}\text{x}_i+\color{blue}{\hat{b}}\) where

\[\begin{align} \hat{\color{blue}{a}}&=\frac{\sum_{i=1}^n(\text{x}_i-\overline{\text{x}}_n)(y_i-\overline{y}_n)}{\sum_{i=1}^n(\text{x}_i-\overline{\text{x}}_n)^2}=\frac{\text{Cov}(X,Y)}{\text{V}(X)}\\ \hat{\color{blue}{b}}&=\overline{y}_n-\hat{\color{blue}{a}}\overline{\text{x}}_n,\quad\text{with}\end{align} \]

\(\overline{\text{x}}_n=\frac{1}{n}\sum_{i=1}^n\text{x}_i\) and \(\overline{y}_n=\frac{1}{n}\sum_{i=1}^ny_i\): the average/mean of \(X\) and \(Y\) respectively.
\(\text{Cov}(X,Y)=\frac{1}{n}\sum_{i=1}^n(\text{x}_i-\overline{\text{x}}_n)(y_i-\overline{y}_n)\): the “covariance” between \(X\) & \(Y\).
\(\text{V}(X)=\frac{1}{n}\sum_{i=1}^n(\text{x}_i-\overline{\text{x}}_n)^2\): the “variance” of \(X\).
Our example: \((\color{blue}{\hat{a}},\color{blue}{\hat{b}})=\) (-0.01, 46.22).
Interpret: If weight increases 1 unit, mpg decreases by around \(0.01\) unit.

Simple Linear Regression (SLR)

Model Diagnostics (judging the model)

R-squared (coefficient of determination)

\[R^2=1-\frac{\text{RSS}}{\text{TSS}}=1-\frac{\sum_{i=1}(y_i-\hat{y}_i)^2}{\sum_{i=1}(y_i-\overline{y}_n)^2}=\frac{\color{red}{\text{V}(\hat{Y})}}{\color{blue}{\text{V}(Y)}}.\]

Simple Linear Regression (SLR)

Model Diagnostics (judging the model)

R-squared (coefficient of determination)

\[R^2=1-\frac{\text{RSS}}{\text{TSS}}=1-\frac{\sum_{i=1}(y_i-\hat{y}_i)^2}{\sum_{i=1}(y_i-\overline{y}_n)^2}=\frac{\color{red}{\text{V}(\hat{Y})}}{\color{blue}{\text{V}(Y)}}.\]

We always have \(0\leq R^2\leq 1\).
Example: For mpg vs weight: \(R^2=\) 0.693.
Interpretation: The model (weight) can explain around 69.3% of the variation of the target (mpg).

Simple Linear Regression (SLR)

Model Diagnostics (judging the model)

Residual Analysis

Residuals: If \(\color{red}{e_i=y_i-\hat{y}_i}\sim{\cal N}(0,\sigma^2)\) for some \(\sigma>0\) .i.e.,
Symmetric around \(0\) & DO NOT DEPEND ON \(\text{x}_i\) nor \(y_i\).

Code

res = y_true-y_pred   # Compute residuals

from plotly.subplots import make_subplots
fig_res = make_subplots(rows=1, cols=2, 
    subplot_titles=("Residuals vs predicted mpg", 
                    "Residual desity"))
fig_res.add_trace(
    go.Scatter(x=y_pred.flatten(), y=res.flatten(), name="Residuals", mode="markers"), 
    row=1, col=1)
fig_res.add_trace(
    go.Scatter(x=[np.min(y_pred.flatten()), np.max(y_pred.flatten())], 
    y=[0,0], mode="lines", line=dict(color='red', dash="dash"), name="0"), 
    row=1, col=1)

fig_res.update_xaxes(title_text="Predicted MPG", row=1, col=1)
fig_res.update_yaxes(title_text="Residuals", row=1, col=1)


fig_res.add_trace(
    go.Histogram(x=res, name = "Residual histogram"), row=1, col=2
)
fig_res.update_xaxes(title_text="Residual", row=1, col=2)
fig_res.update_yaxes(title_text="Histogram", row=1, col=2)

fig_res.update_layout(width=950, height=300)
fig_res.show()

Simple Linear Regression (SLR)

T-test of Significance of Coefficient

The estimated coefficient \(\color{blue}{\hat{a}}\) and \(\color{blue}{\hat{b}}\) are computed based on a sample of data.
How can we be sure that the linear relation between \(\text{x}\) and \(y\) truely exists: \(\hat{y}=\color{blue}{\hat{a}}\text{x}+\color{blue}{\hat{b}}\) with \(a\neq 0\)?
This is equivalent to testing \(H_0: \color{blue}{\hat{a}}=0\) against \(H_1: \color{blue}{\hat{a}}\neq 0\).
If \(n\) is large enough (\(n>30\)) or the residual is gaussian then if \(H_0\) is true, we have \(\color{blue}{t}=\frac{\color{blue}{\hat{a}}}{s_{\color{blue}{\hat{a}}}}\sim{\cal T}(n-2)\) where \(s_{\color{blue}{\hat{a}}}\) is the standard deviation of \(\color{blue}{\hat{a}}\).
Given \(0\leq\color{red}{\alpha}\leq 1\), let \(\color{red}{t_{\alpha/2}}\) be the \(\color{red}{\alpha}\)-quantile of t-distribution \(\mathbb{P}(|{\cal T}(n-2)|\geq \color{red}{t_{\alpha/2}})=\color{red}{\alpha}\):
- We can reject \(H_0\) if \(\color{blue}{t}\geq \color{red}{t_{\alpha/2}}\) (linear relation between \(\text{x}\) & \(y\) truely exists) at confidence level \(1-\color{red}{\alpha}\).
- Else, we cannot reject \(H_0\) (not enough evidence to support a linear relationship between \(y\) & \(\text{x}\)).

Simple Linear Regression (SLR)

\(t\)-test for Coefficient

import statsmodels.api as sm
model = sm.OLS(data['mpg'], sm.add_constant(data[['weight']]))
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.693
Model:                            OLS   Adj. R-squared:                  0.692
Method:                 Least Squares   F-statistic:                     878.8
Date:                Mon, 21 Apr 2025   Prob (F-statistic):          6.02e-102
Time:                        10:23:01   Log-Likelihood:                -1130.0
No. Observations:                 392   AIC:                             2264.
Df Residuals:                     390   BIC:                             2272.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         46.2165      0.799     57.867      0.000      44.646      47.787
weight        -0.0076      0.000    -29.645      0.000      -0.008      -0.007
==============================================================================
Omnibus:                       41.682   Durbin-Watson:                   0.808
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               60.039
Skew:                           0.727   Prob(JB):                     9.18e-14
Kurtosis:                       4.251   Cond. No.                     1.13e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.13e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Simple Linear Regression (SLR)

Summary

Obtained model: mpg = -0.008\(\times\)weight + 46.217.
As we already rejected \(H_0:\color{blue}{\hat{a}}=0\), the coefficient \(\color{blue}{\hat{a}}=\) -0.008 can be interpreted as follows: mpg is expected to decrease (or increase) by 0.008 units for every \(1\) unit increase (or increase) in car weight.
R-squared: Represents the proportion of the target’s variation (mpg) captured by the model or explanatory variable weight alone.
Residual: In a good model, the residuals should behave like random noise, indicating that the model has captured most of the information/pattern from the target.
Our example:
- The weight of cars alone can explain \(\approx 70\)% (R-squared) of the variation of mpg.
- However, the residuals still contain patterns (large errors at large predicted mpg), suggesting the model can be improved.

Multiple Linear Regression (MLR)

`mpg` vs `cylinders + year`

Multiple Linear Regression: using more than 1 input, for example: \[\begin{align*}\widehat{\text{mpg}}_i&=\color{blue}{\beta_0} + \color{blue}{\beta_1}\text{acc}_i+\color{blue}{\beta_2}\text{year}_i\\(\text{Maths:}\quad \hat{y}_i&=\color{blue}{\beta_0} + \color{blue}{\beta_1}\text{x}_{i1}+\color{blue}{\beta_2}\text{x}_{i2}),\end{align*}\] with \(\color{blue}{\beta_0,\beta_1,\beta_2}\in\mathbb{R}\) to be estimated.

We find \([\color{blue}{\hat{\beta}_0,\hat{\beta}_1,\hat{\beta}_2}]\) minimizing \[\begin{align*}\color{red}{\text{RSS}}&=\sum_{i=1}^n(y_i-\color{blue}{\hat{y}_i})^2\\ &=\sum_{i=1}^n(y_i-\color{blue}{\beta_0}-\color{blue}{\beta_1}\text{x}_{i1}-\color{blue}{\beta_2}\text{x}_{i2})^2.\end{align*}\]

	mpg	cylinders	model year
0	18.0	8	70
1	15.0	8	70
2	18.0	8	70

Multiple Linear Regression (MLR)

`mpg` vs `cylinders + year`

We find \([\color{blue}{\hat{\beta}_0,\hat{\beta}_1,\hat{\beta}_2}]\) minimizing \[\begin{align*}\color{red}{\text{RSS}}&=\sum_{i=1}^n(y_i-\color{blue}{\hat{y}_i})^2\\ &=\sum_{i=1}^n(y_i-\color{blue}{\beta_0}-\color{blue}{\beta_1}\text{x}_{i1}-\color{blue}{\beta_2}\text{x}_{i2})^2\\ &=\|\underbrace{Y}_{\begin{bmatrix}y_1\\ \vdots\\ y_n\end{bmatrix}}-\underbrace{X}_{\begin{bmatrix}1 & \text{x}_{11} &\text{x}_{12}\\ \vdots & \vdots & \vdots\\ 1 & \text{x}_{n1} &\text{x}_{n2}\end{bmatrix}}\color{blue}{\underbrace{\vec{\beta}}_{\begin{bmatrix}\beta_0\\ \beta_1\\ \beta_2\end{bmatrix}}}\|^2.\end{align*}\]

Minimizing \(\color{red}{\text{RSS}}\Rightarrow \color{blue}{\vec{\beta}^*=(X^TX)^{-1}X^TY}.\)
Prediction: \(\color{blue}{\hat{Y}}=X\color{blue}{\vec{\beta}^*}\).

	mpg	cylinders	model year
0	18.0	8	70
1	15.0	8	70
2	18.0	8	70

Multiple Linear Regreesion (MLR)

Model Diagnostics

Adjusted R-squared

Normally, \(R^2\) increases along with the number of inputs, but a good model may not need so many variables.
A better criterion, Adjusted R-squared (balancing the number of inputs with the increment in \(R^2\)): \[R^2_{\text{adj}}=1-\frac{n-1}{n-d-1}(1-R^2).\] Here, \(n\) is the number of observations, \(d\) is the number of inputs.
Usually, \(R^2_{\text{adj}}\leq R^2\).
For our model: \(R^2=\) 0.715 and \(R^2_{\text{adj}}=\) 0.714 (this is a good sign!).
A large \(R^2\) with a slight drop in \(R^2_{\text{adj}}\) indicates a good MLR model.

Multiple Linear Regreesion (MLR)

Model Diagnostics (cont.)

Residual analysis

Code

resid = y_train - y_hat   # residuals

from plotly.subplots import make_subplots

fig_res = make_subplots(rows=1, cols=2, subplot_titles=("Residuals vs predicted sales", "Residual desity"))

fig_res.add_trace(
    go.Scatter(x=y_hat, y=resid, name="Residuals", mode="markers"), 
    row=1, col=1)
fig_res.add_trace(
    go.Scatter(x=[np.min(y_hat), np.max(y_hat)], y=[0,0], mode="lines", line=dict(color='red', dash="dash"), name="0"), 
    row=1, col=1)

fig_res.update_xaxes(title_text="Predicted Sales", row=1, col=1)
fig_res.update_yaxes(title_text="Residuals", row=1, col=1)


fig_res.add_trace(
    go.Histogram(x=resid, name = "Residual histogram"), row=1, col=2
)
fig_res.update_xaxes(title_text="Residual", row=1, col=2)
fig_res.update_yaxes(title_text="Histogram", row=1, col=2)

fig_res.update_layout(width=950, height=350)
fig_res.show()

Multiple Linear Regression (MLR)

\(t\)-test of coefficients

Just like in SLR, we can test \(H_0: \beta_j=0\) against \(H_1:\beta_j\neq 0\) using \(t\)-test.
If one of the two assumptions is true:
- There are large enough observations \(n>30\)
- Or the residuals follow Gaussian distribution with constant variance, then \(H_0\) is true, \[t_j=\frac{\beta_j}{s_{j}}\sim {\cal T}(n-d-1).\]
For a given level \(\alpha\), we CAN REJECT \(H_0:\beta_j=0\) if \(|t_j|>t_{\alpha/2}\).

Multiple Linear Regression (MLR)

\(t\)-test of coefficients

import statsmodels.api as sm
model = sm.OLS(df['mpg'], sm.add_constant(df[['cylinders', 'year']]))
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.715
Model:                            OLS   Adj. R-squared:                  0.714
Method:                 Least Squares   F-statistic:                     488.1
Date:                Mon, 21 Apr 2025   Prob (F-statistic):          8.84e-107
Time:                        10:23:02   Log-Likelihood:                -1115.1
No. Observations:                 392   AIC:                             2236.
Df Residuals:                     389   BIC:                             2248.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -17.1464      4.944     -3.468      0.001     -26.866      -7.426
cylinders     -2.9981      0.132    -22.718      0.000      -3.258      -2.739
year           0.7502      0.061     12.276      0.000       0.630       0.870
==============================================================================
Omnibus:                       24.502   Durbin-Watson:                   1.290
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               31.620
Skew:                           0.513   Prob(JB):                     1.36e-07
Kurtosis:                       3.940   Cond. No.                     1.79e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.79e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Multiple Linear Regression (MLR)

Summary

Obtained model: MPG = -17.146 -2.998 cylinders + 0.75 year.
Rough interpretion, \(\beta_1=\) -2.998 indicates that if cylinders increase (or decreases) by \(1\) unit, mpg is expected to decrease (or increase) by 2.998 units.
Explain: \(\beta_2=\) 0.75.
\(R^2=\) 0.715 indicates that around 71.5% variation of mpg can be explained by cylinders and year together, which is better than weight alone.
A slight decrease in \(R^2_{\text{adj}}=\) 0.714 suggests that the information provided by both variables is not redundant for explaining mpg.
The spread values of residuals around large predicted mpg indicates that the model underestimates the actual target.

Logistic Regression (LR)

Linear Regression aims at predicting quantitative target. Such a problem is called Regression Problem.
Logistic Regression aims at predicting categorical target. It’s a Classification method.
Consider our survey available here: Data Collection.

	neck (cm)	waist (cm)	height (m)	weight (kg)
count	11.00000	11.000000	11.000000	11.000000
mean	33.90000	75.663636	48.140000	60.227273
std	8.20317	23.492010	79.691223	12.098272
min	13.50000	27.000000	1.580000	49.000000
25%	32.00000	66.500000	1.650000	49.500000
50%	36.00000	73.000000	1.740000	58.000000
75%	38.60000	81.000000	80.910000	71.000000
max	44.70000	115.300000	180.000000	83.000000

What’s wrong with this data?

Logistic Regression (LR)

Bivariate analysis

	neck (cm)	waist (cm)	height (m)	weight (kg)
neck (cm)	1.000000	0.883607	0.541476	0.762261
waist (cm)	0.883607	1.000000	0.307691	0.795352
height (m)	0.541476	0.307691	1.000000	0.597656
weight (kg)	0.762261	0.795352	0.597656	1.000000

It’s clear that neck is linearly related with waist size.
It’s weight is linearly related with both neck and waist.
Q1: What columns would you use to predict gender?
A1: Height & weight as they separate gender well.
Logistic Regression is for predicting qualitative variables.

Logistic Regression (LR)

Binary Logistic Regression

x1	x2	y
-0.752759	2.704286	1
1.935603	-0.838856	0

Code

import plotly.express as px
import plotly.graph_objects as go
fig = px.scatter(
    data_toy1, x="x1", y="x2", 
    color="y")

# Lines
line_coefs = np.array(
    [[1, 1, -2], [-1, 0.3, 0.5], [-0.5, 0.3, -1], [0.1, -1, 1]])
frames = []
x_line = np.array([np.min(x1[:,1]), np.max(x1[:,1])])
y_range = np.array([np.min(x1[:,2]), np.max(x1[:,2])])
id_small = np.argsort(np.abs(x1[:,1]) + np.abs(x1[:,2]))[5]
point_far = np.array([x_line[0], y_range[1]])
point_near = np.array([x1[id_small,1],x1[id_small,2]])

for i, coef in enumerate(line_coefs):
    y_line = (-coef[0] - coef[1] * x_line) / coef[2]
    a, b = -coef[1]/coef[2], -coef[0]/coef[2]
    point_proj_far = np.array([(point_far[0]+a*point_far[1]-a*b)/(a**2+1), a*(point_far[0]+a*point_far[1]-a*b)/(a**2+1)+b])
    point_proj_near = np.array([(point_near[0]+a*point_near[1]-a*b)/(a**2+1), a*(point_near[0]+a*point_near[1]-a*b)/(a**2+1)+b])
    p1 = np.row_stack([point_far, point_proj_far])
    p2 = np.row_stack([point_near, point_proj_near])
    frames.append(go.Frame(
        data=[fig.data[0],
              fig.data[1],
              go.Scatter(
                x=p1[:,0],
                y=p1[:,1],
                name="Far from boundary",
                line=dict(dash="dash"),
                visible="legendonly"
              ),
              go.Scatter(
                x=p2[:,0],
                y=p2[:,1],
                name="Close to boundary",
                line=dict(dash="dash"),
                visible="legendonly"
              ),
              go.Scatter(
                x=x_line, y=y_line, mode='lines',
                line=dict(width=3, color="black"), 
                name=f'Line: {i+1}')],
        name=f'{i+1}'))

y_line = (-line_coefs[0,0] - line_coefs[0,1] * x_line) / line_coefs[0,2]
a, b = -line_coefs[0,0]/line_coefs[0,2], - line_coefs[0,1]/line_coefs[0,2]
point_proj_far = np.array([(point_far[0]+a*point_far[1]-a*b)/(a**2+1), a*(point_far[0]+a*point_far[1]-a*b)/(a**2+1)+b])
point_proj_near = np.array([(point_near[0]+a*point_near[1]-a*b)/(a**2+1), a*(point_near[0]+a*point_near[1]-a*b)/(a**2+1)+b])
p1 = np.row_stack([point_far, point_proj_far])
p2 = np.row_stack([point_near, point_proj_near])
fig1 = go.Figure(
    data=[
        fig.data[0],
        fig.data[1],
        go.Scatter(
            x=p1[:,0],
            y=p1[:,1],
            name="Far from boundary",
            line=dict(dash="dash"),
            visible="legendonly"
            ),
        go.Scatter(
            x=p2[:,0],
            y=p2[:,1],
            name="Close to boundary",
            line=dict(dash="dash"),
            visible="legendonly"
        ),
        go.Scatter(
            x=x_line, y=y_line, mode='lines',
            line=dict(width=3, color="black"), 
            name=f'Line: 1')
    ],
    layout=go.Layout(
        title="1st simulated data & boundary lines",
        xaxis=dict(title="x1", range=[-3.4, 3.4]),
        yaxis=dict(title="x2", range=[-3.4, 3.4]),
        updatemenus=[{
            "buttons": [
                {
                    "args": [None, {"frame": {"duration": 1000, "redraw": True}, "fromcurrent": True, "mode": "immediate"}],
                    "label": "Play",
                    "method": "animate"
                },
                {
                    "args": [[None], {"frame": {"duration": 0, "redraw": False}, "mode": "immediate"}],
                    "label": "Stop",
                    "method": "animate"
                }
            ],
            "type": "buttons",
            "showactive": False,
            "x": -0.1,
            "y": 1.25,
            "pad": {"r": 10, "t": 50}
        }],
        sliders=[{
            "active": 0,
            "currentvalue": {"prefix": "Line: "},
            "pad": {"t": 50},
            "steps": [{"label": f"{i}",
                       "method": "animate",
                       "args": [[f'{i}'], {"frame": {"duration": 1000, "redraw": True}, "mode": "immediate", 
                       "transition": {"duration": 10}}]}
                      for i in range(1,5)]
        }]
    ),
    frames=frames
)

fig1.update_layout(height=370, width=500)

fig1.show()

Objective: Given input \(\text{x}_i\in\mathbb{R}^d\), classify if \(y\in\{0,1\}\) (Male or female).
Main idea: classify \(\Leftrightarrow\) identify decision boundary.
Main assumption: Boundary(B) is linear.
Model: Given input \(\text{x}_i\), the chance that it belongs to class \(1\) is given by \[\mathbb{P}(Y_i=1|X=\text{x}_i)=\sigma(\color{blue}{\beta_0}+\sum_{j=1}^d\color{blue}{\beta_j}x_{ij}),\] where \(\color{blue}{\beta_0,\beta_1,\dots,\beta_d}\in\mathbb{R}\) are the key parameters to be estiamted from the data, and \(\sigma(t)=1/(1+e^{-t}),\forall t\geq 0\).

Binary Logistic Regression

Model intuition

Ex: Given \(\text{x}_0=[\text{h}_0,\text{w}_0]\in\mathbb{R}^2,\) for any candidate parameter \(\color{blue}{\vec{\beta}=[\beta_0,\beta_1,\beta_2]}\), \[\color{green}{z_0}=\color{blue}{\beta_0}+\color{blue}{\beta_1}\text{h}_0+\color{blue}{\beta_2}\text{w}_0\text{ is the relative distance from }\text{x}_0\to\text{ Boundary (B)}.\]

That’s to say that
- \(\color{green}{z_0}>0\Leftrightarrow \text{x}_0\) is above the boundary.
- \(|\color{green}{z_0}|\) is large \(\Leftrightarrow\) \(\text{x}_0\) is far from the bounday.

A good boundary should be such that:
- \(|\color{green}{z_0}|\) large \(\Rightarrow\) “certain about its class”.
- \(|\color{green}{z_0}|\) small \(\Rightarrow\) “less certain about its class”.

Binary Logistic Regression

Model intuition

A good boundary should be such that:
- \(|\color{green}{z_0}|\) large \(\Rightarrow\) “certain about its class”.
- \(|\color{green}{z_0}|\) small \(\Rightarrow\) “less certain about its class”.

Sigmoid function, \(\sigma:\mathbb{R}\to (0,1)\)

\[\color{green}{z}\mapsto\sigma(\color{green}{z})=\frac{1}{1+\exp(-\color{green}{z})}.\]

Key ideas

\(\color{green}{z_0}=\color{blue}{\beta_0}+\text{x}_0^T\color{blue}{\vec{\beta}}\) is the relative distance of \(\text{x}_0\) w.r.t (B).
Sigmoid converts this relative distance into probability.

Binary Logistic Regression

Example

For \(\color{blue}{(\beta_0,\beta_1,\beta_2)=(1, 1, -2)}\), compute \(p(\color{blue}{1}|\text{x})\) for the data:

x1	x2	y
2.489186	-0.779048	0
-2.572868	-1.086146	1
2.767143	2.432104	0

Compute relative distance \(z_i=\color{blue}{\beta_0}+\text{x}_i^T\color{blue}{\vec{\beta}}\), then \(p(1|\text{x}_i)\) \[\begin{align*}z_1&=1+1(2.489186)-2(-0.779048)\\ &=5.047282>0\\ \Rightarrow p(1|\text{x}_1)&=\sigma(z_1)=1/(1+e^{-5.047282})=\color{blue}{0.9936}.\\ \\ z_2&=1+1(-2.572868)-2(-1.086146)\\ &=0.599424>0\\ \Rightarrow p(1|\text{x}_2)&=\sigma(z_2)=1/(1+e^{-0.599424})=\color{blue}{0.6455}.\\ \\ z_3&=1+1(2.767143)-2(2.432104)\\ &=-1.097065 < 0\\ \Rightarrow p(1|\text{x}_3)&=\sigma(z_3)=1/(1+e^{-(-1.097065)})=\color{red}{0.2503}.\end{align*}\]
Interpretation: \(\text{x}_1,\text{x}_2\) are located above the line \((B):1+x_1-2x_2\) as \(z_1,z_2>0\) and are predicted to be in class \(\color{blue}{1}\). On the other hand, \(\text{x}_3\) is located below the line (\(z_3<0\)) and is predicted to be in class \(\color{red}{0}\).
Q4: Now,how do we find the best key parameter \(\color{blue}{\beta_0,\dots,\beta_d}\)?
We will build a criterion just like RSS in linear regression.

Binary Logistic Regression

Conditional likelihood \(\to\) Cross-entropy

Data: \({\cal D}=\{(\text{x}_1,y_1),...,(\text{x}_n,y_n)\}\subset\mathbb{R}^d\times \{0,1\}\).
Objective: search for \(\color{blue}{\beta_0}\in\mathbb{R},\color{blue}{\vec{\beta}}\in\mathbb{R}^d\) such that the model is best aligned with the data \({\cal D}\): \[p(y_i|\text{x}_i)\text{ is large for all }i\in\{1,\dots,n\}.\]

Conditional Likelihood Function: If the data are iid, one has \[\begin{align*}{L}(\color{blue}{\beta_0},\color{blue}{\vec{\beta}})&=\mathbb{P}(Y_1=y_1,\dots,Y_n=y_n|X_1=\text{x}_1,\dots,X_n=\text{x}_n)\\ &=\prod_{i=1}^np(y_i|\text{x}_i)\\ &=\prod_{i=1}^n\Big[p(1|\text{x}_i)\Big]^{y_i}\Big[p(0|\text{x}_i)\Big]^{1-y_i}\\ &=\prod_{i=1}^n\Big[\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}})\Big]^{y_i}\Big[(1-\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}}))\Big]^{1-y_i}. \end{align*}\]
Cross-entropy: \(\text{CEn}(\color{blue}{\vec{\beta}})=-\sum_{i=1}^n\Big[y_i\log[\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}})]+(1-y_i)\log[(1-\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}}))]\Big]\).

Binary Logistic Regression

Estimating coefficients

We search for coefficient \(\color{blue}{\vec{\beta}}=[\color{blue}{\beta_0,\dots,\beta_d}]\) minimizing \[\text{CEn}(\color{blue}{\vec{\beta}})=-\sum_{i=1}^n\Big[y_i\log[\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}})]+(1-y_i)\log[(1-\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}}))]\Big].\]
😭 Unfortunately, such minimizer values \((\color{blue}{\widehat{\beta}_0,\widehat{\vec{\beta}}})\) CANNOT be analytically computed.
😊 Fortunately, it can be numerically approximated!
We can use optimization algorithms such as Gradient Descent Algorithm to estimate the best \(\color{blue}{\hat{\beta}}\).

For more on Gradient Descent Algorithm for Logistic Regression, read here.

Binary Logistic Regression

Summary

Logistic Regression Model

Main model: \(p(1|\text{x})=1/(1+e^{-\color{green}{z}})=1/(1+e^{-(\color{blue}{\beta_0}+\text{x}^T\color{blue}{\vec{\beta}})})\).
- Interpretation:
  - Boundary decision is Linear defined by the coefficients \(\color{blue}{\beta_0}\) and \(\color{blue}{\vec{\beta}}\).
  - Probability of being in each class depends on the relative distance of that point to the boundary.
  - Works well when classes are linearly separable.

Objective: buliding a Logistic Regression model is equivalent to searching for parameters \(\color{blue}{\beta_0}\) and \(\color{blue}{\vec{\beta}}\) that minimizes the Cross-entropy.
The loss cannot be minimized analytically but can be minimized numerically.

Logistic Regression

Application on `Auto-MPG`

For our Auto-MPG dataset, we aim at predicting origin using some characteristics of the cars.
Build intuition through visualization:

Logistic Regression

Application on `Auto-MPG`

We predict origin using all quantitative columns.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Building the model
X_train, X_test, y_train, y_test = train_test_split(
    df_car.select_dtypes(include="number"),
    df_car[['origin']])
lgit = LogisticRegression()
lgit = lgit.fit(X_train, y_train)
# Prediction
y_pred = lgit.predict(X_test)
# Accuracy
acc = np.mean(y_pred.flatten() == y_test.to_numpy().flatten())

Accuracy = 0.745.
Here, accuracy is defined by \[\text{Accuracy}=\frac{\text{Num. correctly predicted}}{\text{Num. observations}}.\]

Logistic Regression

Summary

We introduce basic concept of Logistic Regression Model: \[p(1|X=\text{x})=\frac{1}{(1+e^{-\color{blue}{\beta_0}-\text{x}^T\color{blue}{\vec{\beta}}})}.\]
The intuition of the model: the probability of being in class \(1\) depends on the relative distance from \(\text{x}\) to a linear boundary defined by \(\color{blue}{[\beta_0,\beta_1,\dots,\beta_d]}\).
The linear boundary assumption may be too weak in practice.
The performance of the model can be improved further by
- Selecting relevant features
- Feature engineering: polynomial transform…
- Regularization or penalty methods…

Linear & Logistic Regression

Outline

Motivation

Motivation

Auto-MPG Dataset (398, 9)

Motivation

Auto-MPG Dataset (398, 9)

Exploratory Data Analysis (EDA)

EDA

Auto-MPG Dataset (398, 9)

Columns types

EDA

Auto-MPG Dataset (398, 9)

Univariate analysis: Statistical summary

EDA

Auto-MPG Dataset (398, 9)

Univariate analysis: Visualization

Bivariate analysis: Correlation matrix

Bivariate analysis: Visualization

EDA

Summary

Simple Linear Regression (SLR)

Simple Linear Regression (SLR)

mpg vs weight

Simple Linear Regression (SLR)

mpg vs weight

Simple Linear Regression (SLR)

mpg vs weight

Simple Linear Regression (SLR)

mpg vs weight

Simple Linear Regression (SLR)

mpg vs weight

Simple Linear Regression (SLR)

Model Diagnostics (judging the model)

R-squared (coefficient of determination)

Simple Linear Regression (SLR)

Model Diagnostics (judging the model)

R-squared (coefficient of determination)

Simple Linear Regression (SLR)

Model Diagnostics (judging the model)

Residual Analysis

Simple Linear Regression (SLR)

T-test of Significance of Coefficient

Simple Linear Regression (SLR)

\(t\)-test for Coefficient

Simple Linear Regression (SLR)

Multiple Linear Regression (MLR)

Multiple Linear Regression (MLR)

mpg vs cylinders + year

Multiple Linear Regression (MLR)

mpg vs cylinders + year

Multiple Linear Regreesion (MLR)

Model Diagnostics

Adjusted R-squared

Multiple Linear Regreesion (MLR)

Model Diagnostics (cont.)

Residual analysis

Multiple Linear Regression (MLR)

\(t\)-test of coefficients

Multiple Linear Regression (MLR)

\(t\)-test of coefficients

Multiple Linear Regression (MLR)

Logistic Regression (LR)

Logistic Regression (LR)

Logistic Regression (LR)

Bivariate analysis

Logistic Regression (LR)

Binary Logistic Regression

Binary Logistic Regression

Model intuition

Binary Logistic Regression

Model intuition

Binary Logistic Regression

Binary Logistic Regression

Conditional likelihood \(\to\) Cross-entropy

Binary Logistic Regression

Estimating coefficients

Binary Logistic Regression

Summary

Logistic Regression

`mpg` vs `weight`

`mpg` vs `weight`

`mpg` vs `weight`

`mpg` vs `weight`

`mpg` vs `weight`

`mpg` vs `cylinders + year`

`mpg` vs `cylinders + year`

Application on `Auto-MPG`

Application on `Auto-MPG`