import pandas as pd # Import pandas packageimport numpy as npimport seaborn as sns # Package for beautiful graphsimport matplotlib.pyplot as plt # Graph managementsns.set(style="whitegrid") # Set grid data = pd.read_csv(path2 +"auto-mpg.csv") # Import it into Pythondata.head(4) # Randomly select 4 points
mpg
cylinders
displacement
horsepower
weight
acceleration
model year
origin
car name
0
18.0
8
307.0
130
3504
12.0
70
1
chevrolet chevelle malibu
1
15.0
8
350.0
165
3693
11.5
70
1
buick skylark 320
2
18.0
8
318.0
150
3436
11.0
70
1
plymouth satellite
3
16.0
8
304.0
150
3433
12.0
70
1
amc rebel sst
What affects fuel efficiency the most?
Are newer models more fuel-efficient?
What influences speed or acceleration?
How did MPG values change across the years (model year)?
How do cars differ by origin (USA, Europe, Asia)?
What are the characteristics of cars from different origins?
Weight shows the strongest negative correlation with mpg, followed by displacement, cylinders, and horsepower. These variables are significant in explaining variations in mpg.
These features are also highly correlated with each other, suggesting potential redundancy when included together in a predictive model.
Despite being a categorical variable, origin proves to be valuable for predicting mpg.
Simple Linear Regression (SLR)
Simple Linear Regression (SLR)
mpg vs weight
Code
data[['mpg', 'weight']].head(3)
mpg
weight
0
18.0
3504
1
15.0
3693
2
18.0
3436
Code
import plotly.express as pxfig = px.scatter(data, x="weight", y="mpg", hover_name="car name")fig.update_layout(title="mpg vs weight", height=290, width=450)fig.show()
Simple Linear Model:\[\text{(prediction)}:\quad\widehat{\text{mpg}}_i=\color{blue}{a}\text{weight}_i+\color{blue}{b},\] for some \(\color{blue}{a},\color{blue}{b}\in\mathbb{R}\) to be chosen so that \(\color{red}{\widehat{\text{mpg}}_i\approx \text{mpg}_i}\) for all \(i=1,...,n.\)
In general, \(\hat{y}_i=\color{blue}{a}\text{x}_i+\color{blue}{b}\), with keys \(\color{blue}{a},\color{blue}{b}\), and observed data \((y_i,\text{x}_i),i=1,...,n\).
Objective Find the best \(\color{blue}{a}\) and \(\color{blue}{b}\) so that (prediction) \(\color{red}{\hat{y}_i\approx y_i}\) (reality) for all \(i\).
What does \(\color{red}{\hat{y}_i\approx y_i}\) mean?
Simple Linear Regression (SLR)
mpg vs weight
Code
data[['mpg', 'weight']].head(3)
mpg
weight
0
18.0
3504
1
15.0
3693
2
18.0
3436
Code
fig.update_layout(title="Mpg vs weight", height=290, width=450)fig.show()
What does \(\color{red}{\hat{y}_i\approx y_i}\) mean?
Q2: For \(y_0=20.312\), which one is the best prediction among: \(\color{red}{\hat{y}_0=18.2, 21.5}\) and \(\color{red}{19.73}\)?
Residual Sum of Squares (RSS): \[\begin{align*}\color{red}{\text{RSS}=\sum_{i=1}e_i^2}&=\color{red}{\sum_{i=1}^n(y_i-\color{blue}{\hat{y}_i})^2}\\ &=\color{red}{\sum_{i=1}^n(y_i}-\color{blue}{a}\text{x}_i-\color{blue}{b}\color{red}{)^2}.\end{align*}\]
Roughly, \(\color{red}{\text{RSS}}\) is sum of all the dash lines (squared).
Objective: Find the coefficient \((\color{blue}{a,b})\) that produces smallest \(\color{red}{\text{RSS}}\).
\(\overline{\text{x}}_n=\frac{1}{n}\sum_{i=1}^n\text{x}_i\) and \(\overline{y}_n=\frac{1}{n}\sum_{i=1}^ny_i\): the average/mean of \(X\) and \(Y\) respectively.
\(\text{Cov}(X,Y)=\frac{1}{n}\sum_{i=1}^n(\text{x}_i-\overline{\text{x}}_n)(y_i-\overline{y}_n)\): the “covariance” between \(X\) & \(Y\).
\(\text{V}(X)=\frac{1}{n}\sum_{i=1}^n(\text{x}_i-\overline{\text{x}}_n)^2\): the “variance” of \(X\).
Interpretation: The model (weight) can explain around 69.3% of the variation of the target (mpg).
Simple Linear Regression (SLR)
Model Diagnostics (judging the model)
Residual Analysis
Residuals: If \(\color{red}{e_i=y_i-\hat{y}_i}\sim{\cal N}(0,\sigma^2)\) for some \(\sigma>0\) .i.e., Symmetric around\(0\) & DO NOT DEPEND ON\(\text{x}_i\) nor \(y_i\).
The estimated coefficient \(\color{blue}{\hat{a}}\) and \(\color{blue}{\hat{b}}\) are computed based on a sample of data.
How can we be sure that the linear relation between \(\text{x}\) and \(y\) truely exists: \(\hat{y}=\color{blue}{\hat{a}}\text{x}+\color{blue}{\hat{b}}\) with \(a\neq 0\)?
This is equivalent to testing \(H_0: \color{blue}{\hat{a}}=0\) against \(H_1: \color{blue}{\hat{a}}\neq 0\).
If \(n\) is large enough (\(n>30\)) or the residual is gaussian then if \(H_0\) is true, we have \(\color{blue}{t}=\frac{\color{blue}{\hat{a}}}{s_{\color{blue}{\hat{a}}}}\sim{\cal T}(n-2)\) where \(s_{\color{blue}{\hat{a}}}\) is the standard deviation of \(\color{blue}{\hat{a}}\).
Given \(0\leq\color{red}{\alpha}\leq 1\), let \(\color{red}{t_{\alpha/2}}\) be the \(\color{red}{\alpha}\)-quantile of t-distribution \(\mathbb{P}(|{\cal T}(n-2)|\geq \color{red}{t_{\alpha/2}})=\color{red}{\alpha}\):
We can reject\(H_0\) if \(\color{blue}{t}\geq \color{red}{t_{\alpha/2}}\) (linear relation between \(\text{x}\) & \(y\) truely exists) at confidence level \(1-\color{red}{\alpha}\).
Else, we cannot reject\(H_0\) (not enough evidence to support a linear relationship between \(y\) & \(\text{x}\)).
Simple Linear Regression (SLR)
\(t\)-test for Coefficient
import statsmodels.api as smmodel = sm.OLS(data['mpg'], sm.add_constant(data[['weight']]))results = model.fit()print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.693
Model: OLS Adj. R-squared: 0.692
Method: Least Squares F-statistic: 878.8
Date: Mon, 21 Apr 2025 Prob (F-statistic): 6.02e-102
Time: 10:23:01 Log-Likelihood: -1130.0
No. Observations: 392 AIC: 2264.
Df Residuals: 390 BIC: 2272.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 46.2165 0.799 57.867 0.000 44.646 47.787
weight -0.0076 0.000 -29.645 0.000 -0.008 -0.007
==============================================================================
Omnibus: 41.682 Durbin-Watson: 0.808
Prob(Omnibus): 0.000 Jarque-Bera (JB): 60.039
Skew: 0.727 Prob(JB): 9.18e-14
Kurtosis: 4.251 Cond. No. 1.13e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.13e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
As we already rejected\(H_0:\color{blue}{\hat{a}}=0\), the coefficient \(\color{blue}{\hat{a}}=\) -0.008 can be interpreted as follows: mpg is expected to decrease (or increase) by 0.008 units for every \(1\) unit increase (or increase) in car weight.
R-squared: Represents the proportion of the target’s variation (mpg) captured by the model or explanatory variable weight alone.
Residual: In a good model, the residuals should behave like random noise, indicating that the model has captured most of the information/pattern from the target.
Our example:
The weight of cars alone can explain \(\approx 70\)% (R-squared) of the variation of mpg.
However, the residuals still contain patterns (large errors at large predicted mpg), suggesting the model can be improved.
Multiple Linear Regression (MLR)
Multiple Linear Regression (MLR)
mpg vs cylinders + year
Multiple Linear Regression:using more than 1 input, for example: \[\begin{align*}\widehat{\text{mpg}}_i&=\color{blue}{\beta_0} + \color{blue}{\beta_1}\text{acc}_i+\color{blue}{\beta_2}\text{year}_i\\(\text{Maths:}\quad \hat{y}_i&=\color{blue}{\beta_0} + \color{blue}{\beta_1}\text{x}_{i1}+\color{blue}{\beta_2}\text{x}_{i2}),\end{align*}\] with \(\color{blue}{\beta_0,\beta_1,\beta_2}\in\mathbb{R}\) to be estimated.
We find \([\color{blue}{\hat{\beta}_0,\hat{\beta}_1,\hat{\beta}_2}]\) minimizing \[\begin{align*}\color{red}{\text{RSS}}&=\sum_{i=1}^n(y_i-\color{blue}{\hat{y}_i})^2\\ &=\sum_{i=1}^n(y_i-\color{blue}{\beta_0}-\color{blue}{\beta_1}\text{x}_{i1}-\color{blue}{\beta_2}\text{x}_{i2})^2.\end{align*}\]
Normally, \(R^2\) increases along with the number of inputs, but a good model may not need so many variables.
A better criterion, Adjusted R-squared (balancing the number of inputs with the increment in\(R^2\)): \[R^2_{\text{adj}}=1-\frac{n-1}{n-d-1}(1-R^2).\] Here, \(n\) is the number of observations, \(d\) is the number of inputs.
Usually, \(R^2_{\text{adj}}\leq R^2\).
For our model: \(R^2=\) 0.715 and \(R^2_{\text{adj}}=\) 0.714 (this is a good sign!).
A large \(R^2\) with a slight drop in \(R^2_{\text{adj}}\) indicates a good MLR model.
Rough interpretion, \(\beta_1=\) -2.998 indicates that if cylindersincrease (or decreases) by \(1\) unit, mpg is expected to decrease (or increase) by 2.998 units.
Explain: \(\beta_2=\) 0.75.
\(R^2=\) 0.715 indicates that around 71.5% variation of mpg can be explained by cylinders and year together, which is better than weight alone.
A slight decrease in \(R^2_{\text{adj}}=\) 0.714 suggests that the information provided by both variables is not redundant for explaining mpg.
The spread values of residuals around large predicted mpg indicates that the model underestimates the actual target.
Logistic Regression (LR)
Logistic Regression (LR)
Linear Regression aims at predicting quantitative target. Such a problem is called Regression Problem.
Logistic Regression aims at predicting categorical target. It’s a Classification method.
Objective: Given input \(\text{x}_i\in\mathbb{R}^d\), classify if \(y\in\{0,1\}\) (Male or female).
Main idea: classify \(\Leftrightarrow\) identify decision boundary.
Main assumption: Boundary(B) is linear.
Model: Given input \(\text{x}_i\), the chance that it belongs to class \(1\) is given by \[\mathbb{P}(Y_i=1|X=\text{x}_i)=\sigma(\color{blue}{\beta_0}+\sum_{j=1}^d\color{blue}{\beta_j}x_{ij}),\] where \(\color{blue}{\beta_0,\beta_1,\dots,\beta_d}\in\mathbb{R}\) are the key parameters to be estiamted from the data, and \(\sigma(t)=1/(1+e^{-t}),\forall t\geq 0\).
Binary Logistic Regression
Model intuition
Ex: Given \(\text{x}_0=[\text{h}_0,\text{w}_0]\in\mathbb{R}^2,\) for any candidate parameter\(\color{blue}{\vec{\beta}=[\beta_0,\beta_1,\beta_2]}\), \[\color{green}{z_0}=\color{blue}{\beta_0}+\color{blue}{\beta_1}\text{h}_0+\color{blue}{\beta_2}\text{w}_0\text{ is the relative distance from }\text{x}_0\to\text{ Boundary (B)}.\]
That’s to say that
\(\color{green}{z_0}>0\Leftrightarrow \text{x}_0\) is above the boundary.
\(|\color{green}{z_0}|\) is large \(\Leftrightarrow\)\(\text{x}_0\) is far from the bounday.
A good boundary should be such that:
\(|\color{green}{z_0}|\) large \(\Rightarrow\)“certain about its class”.
\(|\color{green}{z_0}|\) small \(\Rightarrow\)“less certain about its class”.
Binary Logistic Regression
Model intuition
A good boundary should be such that:
\(|\color{green}{z_0}|\) large \(\Rightarrow\)“certain about its class”.
\(|\color{green}{z_0}|\) small \(\Rightarrow\)“less certain about its class”.
Interpretation: \(\text{x}_1,\text{x}_2\) are located above the line \((B):1+x_1-2x_2\) as \(z_1,z_2>0\) and are predicted to be in class \(\color{blue}{1}\). On the other hand, \(\text{x}_3\) is located below the line (\(z_3<0\)) and is predicted to be in class \(\color{red}{0}\).
Q4: Now,how do we find the best key parameter \(\color{blue}{\beta_0,\dots,\beta_d}\)?
We will build a criterion just like RSS in linear regression.
Objective: search for \(\color{blue}{\beta_0}\in\mathbb{R},\color{blue}{\vec{\beta}}\in\mathbb{R}^d\) such that the model is best aligned with the data \({\cal D}\): \[p(y_i|\text{x}_i)\text{ is large for all }i\in\{1,\dots,n\}.\]
Conditional Likelihood Function: If the data are iid, one has \[\begin{align*}{L}(\color{blue}{\beta_0},\color{blue}{\vec{\beta}})&=\mathbb{P}(Y_1=y_1,\dots,Y_n=y_n|X_1=\text{x}_1,\dots,X_n=\text{x}_n)\\
&=\prod_{i=1}^np(y_i|\text{x}_i)\\
&=\prod_{i=1}^n\Big[p(1|\text{x}_i)\Big]^{y_i}\Big[p(0|\text{x}_i)\Big]^{1-y_i}\\
&=\prod_{i=1}^n\Big[\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}})\Big]^{y_i}\Big[(1-\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}}))\Big]^{1-y_i}.
\end{align*}\]
We search for coefficient \(\color{blue}{\vec{\beta}}=[\color{blue}{\beta_0,\dots,\beta_d}]\) minimizing \[\text{CEn}(\color{blue}{\vec{\beta}})=-\sum_{i=1}^n\Big[y_i\log[\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}})]+(1-y_i)\log[(1-\sigma(-\color{blue}{\beta_0}-\text{x}_i^T\color{blue}{\vec{\beta}}))]\Big].\]
😭 Unfortunately, such minimizer values \((\color{blue}{\widehat{\beta}_0,\widehat{\vec{\beta}}})\)CANNOT be analytically computed.
😊 Fortunately, it can be numerically approximated!
We can use optimization algorithms such as Gradient Descent Algorithm to estimate the best \(\color{blue}{\hat{\beta}}\).
For more on Gradient Descent Algorithm for Logistic Regression, read here.
Binary Logistic Regression
Summary
Logistic Regression Model
Main model: \(p(1|\text{x})=1/(1+e^{-\color{green}{z}})=1/(1+e^{-(\color{blue}{\beta_0}+\text{x}^T\color{blue}{\vec{\beta}})})\).
Interpretation:
Boundary decision is Linear defined by the coefficients \(\color{blue}{\beta_0}\) and \(\color{blue}{\vec{\beta}}\).
Probability of being in each class depends on the relative distance of that point to the boundary.
Works well when classes are linearly separable.
Objective: buliding a Logistic Regression model is equivalent to searching for parameters\(\color{blue}{\beta_0}\) and \(\color{blue}{\vec{\beta}}\) that minimizes the Cross-entropy.
The loss cannot be minimized analytically but can be minimized numerically.
Logistic Regression
Application on Auto-MPG
For our Auto-MPG dataset, we aim at predicting origin using some characteristics of the cars.
Build intuition through visualization:
Logistic Regression
Application on Auto-MPG
We predict origin using all quantitative columns.
from sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_split# Building the modelX_train, X_test, y_train, y_test = train_test_split( df_car.select_dtypes(include="number"), df_car[['origin']])lgit = LogisticRegression()lgit = lgit.fit(X_train, y_train)# Predictiony_pred = lgit.predict(X_test)# Accuracyacc = np.mean(y_pred.flatten() == y_test.to_numpy().flatten())
Accuracy = 0.745.
Here, accuracy is defined by \[\text{Accuracy}=\frac{\text{Num. correctly predicted}}{\text{Num. observations}}.\]
Logistic Regression
Summary
We introduce basic concept of Logistic Regression Model: \[p(1|X=\text{x})=\frac{1}{(1+e^{-\color{blue}{\beta_0}-\text{x}^T\color{blue}{\vec{\beta}}})}.\]
The intuition of the model: the probability of being in class \(1\) depends on the relative distance from \(\text{x}\) to a linear boundary defined by \(\color{blue}{[\beta_0,\beta_1,\dots,\beta_d]}\).
The linear boundary assumption may be too weak in practice.
The performance of the model can be improved further by