Model Refinement
(Beyond Linearity)

INF-604: Data Analysis

Lecturer: Dr. Sothea HAS

Continue from linear models…

Interpretation of Linear Models

Linear Regression is an interpretable model where the influence of each input is explicit (coefficients \(\color{blue}{\beta_j}\)) in \[y=\color{blue}{\beta_0}+\color{blue}{\beta_1}x_1+\color{blue}{\beta_2}x_2+\dots+\color{blue}{\beta_d}x_d + \epsilon.\]
- Numerical input: \(\color{blue}{\beta_j}\) is the change \(y\) for 1 unit increase in \(x_j\).
- Categorical input: \(\color{blue}{\beta_j}\) represents the difference in mean of \(y\) when category \(x_j=1\) relative to the Reference Category (Baseline or the dropped category).
Comparing Influence (Feature Importance)
- Raw Coefficients: Not comparable across different variables (e.g., Age vs Income).
- Solution: In order to compare the effect of variables on the target, variables should be Standardized.

Outline

Model Evaluation Techniques
Feature Engineering
Regularization
- Ridge Regression
- Lasso Regression

Model Evaluation

Out-of-sample MSE

A good model must not only performs well on the training data (used to build it), but also on unseen observations.
We should judge a model based on how it generalizes on new unseen observations.

Out-of-sample Mean Squared Error (MSE): \[\color{green}{\frac{1}{n_{\text{new}}}\sum_{1=1}^{n_{\text{new}}}(y_i-\hat{y}_i)^2}.\]

In practice:
- Train data \(\approx75\%-80\%\to\) for building the model.
- Test data \(\approx20\%-25\%\to\) for testing the model.

Out-of-sample MSE

A good model must not only performs well on the training data (used to build it), but also on unseen observations.
We should judge a model based on how it generalizes on new unseen observations.

Out-of-sample Mean Squared Error (MSE): \[\color{green}{\frac{1}{n_{\text{new}}}\sum_{1=1}^{n_{\text{new}}}(y_i-\hat{y}_i)^2}.\]

Code

import pyreadr
import numpy as np
import seaborn as sns
from sklearn.metrics import mean_squared_error

# market = pyreadr.read_r(path1)
# market = market['marketing']

X = data[['weight', 'horsepower', 'acceleration', 'mpg']]

shuffle_id = np.random.choice(['train', 'test'], 
                                    replace=True, 
                                    p=[0.75, 0.25], 
                                    size=X.shape[0])
X['type'] = shuffle_id

# Model
from sklearn.linear_model import LinearRegression
lr1 = LinearRegression().fit(X.loc[X.type == "train", ['weight']], X.loc[X.type == "train", "mpg"])

y_hat = lr1.predict(X.loc[X.type == "train", ['weight']])

import plotly.express as px
import plotly.graph_objects as go
fig1 = px.scatter(data_frame=X,
            x="weight",
            y="mpg",
            color="type",
            color_discrete_map={
                "train": "#e89927", 
                "test": "#3bbc35"
            })
fig1.add_trace(go.Scatter(x=X.loc[X.type == "train", 'weight'],
                          y=y_hat,
                          name="Model built on train data",
                          line=dict(color="blue")))

fig1.update_layout(width=600, height=250, title="SLR Model: mpg vs weight")
fig1.show()

Cross-validation MSE

What if it’s our unlucky day?

Here, it’s great on training data but poor on testing data!

\(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)

Computing CV-MSE:

Cross-validation MSE

What if it’s our unlucky day?

Here, it’s great on training data but poor on testing data!

\(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)

Computing CV-MSE:

Cross-validation MSE

What if it’s our unlucky day?

Here, it’s great on training data but poor on testing data!

\(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)

Computing CV-MSE:

Cross-validation MSE

What if it’s our unlucky day?

Here, it’s great on training data but poor on testing data!

\(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)

Computing CV-MSE:

Cross-validation MSE

What if it’s our unlucky day?

Here, it’s great on training data but poor on testing data!

\(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)

Computing CV-MSE:

Cross-validation MSE

What if it’s our unlucky day?

Here, it’s great on training data but poor on testing data!

\(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)

Computing CV-MSE:

Doesn’t depend on bad splits!
It’s the average of different Test MSEs.
It’s the Estimate MSE of New Unseen Data\(^{\text{📚}}\).

\(^{\text{📚}}\) Chapter 5, The Introduction to Statistical Learning.

Cross-validation MSE

Summary

Cross-validation is a model evaluation teachnique.
It can be used with different metrics other than MSE, such as Accuracy, F1-score,…
It can prevent overfitting.
For SLR or MLR (without hyperparameter tuning), it can provide an estimate of Test Error.
For models with hyperparameters, it can be used to tune those hyperparameters (coming soon).
Our mpg vs weight example: 5-fold CV-MSE = 31.988 or CV-RMSE = 5.656.

Feature engineering

Feature transformation

Z-score & Min-Max Scaling

Z-score of \(x_j\) is \(\tilde{x}_j=(x_j-\overline{x}_j)/\sigma_{x_j}\).
Min-Max scaling of \(x_j\) is \(\tilde{x}_j=\frac{x_j-\min_{x_j}}{\max_{x_j}-\min_{x_j}}\in [0,1]\).
When inputs are of different units: (Kg, Km, dollars…).
When the differences in scales are too large. It allows us to compare the coefficients of the linear model.
When working with distance-based models or models that are sensitive to scale of data: SLR, MLR, KNN, SVM, Logistic Regression, PCA, Neural Networks…
Ex: Often used in image processing…

Feature engineering

Feature transformation

One-hot encoding

Code

from gapminder import gapminder
import numpy as np
from sklearn.preprocessing import OneHotEncoder as onehot
encoder = onehot()
encoded_data = encoder.fit_transform(gapminder.loc[gapminder.year == 2007, ['continent']]).toarray()

# encoded dataset
X_encoded = pd.DataFrame(encoded_data, columns=[x.replace('continent_', '') for x in encoder.get_feature_names_out(['continent'])])

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
lr = LinearRegression()

lr.fit(X_encoded, gapminder.lifeExp.loc[gapminder.year==2007])
R2 = r2_score(gapminder.lifeExp.loc[gapminder.year==2007], lr.predict(X_encoded))
df_encoded = X_encoded.copy()
df_encoded['lifeExp'] = gapminder.lifeExp.loc[gapminder.year==2007].values
fig_cont = px.box(data_frame=gapminder.loc[gapminder.year==2007,:],
                  x="continent", y="lifeExp", color="continent")
fig_cont.update_layout(title="Life Expectancy vs Continent", height=250, width=500)
fig_cont.show()

Some categorical inputs are useful for building models.
They have to be converted.
Ex: continent is useful for predicting lifeExp (for more, see here).
- R-squared: 0.635.

Africa	Americas	Asia	Europe	Oceania	lifeExp
0.000000	0.000000	1.000000	0.000000	0.000000	43.828000

Feature engineering

Feature Engineering

Predicting target using linear form of inputs may be unrealistic!
More complicated forms of inputs might be better for predicting the target!
Ex: mpg vs weight: \(R^2\approx 69.3\%\).
Now: \(\widehat{\text{mpg}}=\color{blue}{\beta_0}+\color{blue}{\beta_1}\text{w}+\color{blue}{\beta_2}\text{w}^2.\)

Feature engineering

Feature Engineering

Now: \(\widehat{\text{mpg}}=\color{blue}{\beta_0}+\color{blue}{\beta_1}\text{w}+\color{blue}{\beta_2}\text{w}^2.\)

Code

X2 = pd.concat([X.weight, X.weight ** 2, X.mpg], axis=1)
X2.columns = ["w", "w^2", "mpg"]
X2.iloc[:3,:]

	w	w^2	mpg
0	3504	12278016	18.0
1	3693	13638249	15.0
2	3436	11806096	18.0

Feature engineering

Feature Engineering

Feature engineering

Feature Engineering

Overfitting

Challenge in every model

Overfitting happens when a model learns the training data too well, capturing noise and fluctuations rather than the underlying pattern.
It fits the training data almost perfectly, but fails to generalize to new, unseen data.
Complex models (high-degree poly. features) often overfit the data.

Overcoming overfitting

\(K\)-fold Cross-Validation

It ensures that the model performs well on different subsets.
The most common technique to overcome overfitting.

Tuning Polynomial degree Using \(K\)-fold Cross-Validation

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression as LR
from sklearn.model_selection import cross_val_score
# Data
X, y = X[["weight"]], data['mpg']
# List of all degrees to search over
degree = list(range(1,11))
# List to store all losses
loss = [] 
for deg in degree:
    X_poly = pd.concat([X**d for d in range(1, deg+1)], axis=1)
    X_poly.columns = [f"w^{d}" for d in range(1, deg+1)]
    model = LR()
    score = -cross_val_score(model, X_poly, y, cv=20, 
                scoring='neg_mean_squared_error').mean()
    loss.append(score)

Overcoming overfitting

\(K\)-fold Cross-Validation

It ensures that the model performs well on different subsets.
The most common technique to overcome overfitting.

Tuning Regularization Stregnth \(\color{green}{\alpha}\) Using \(K\)-fold Cross-Validation

overcoming overfitting

Regularization

Another approach is to controll the magnitude of the coefficients.
It often works well for SLR, MLR and Polynomial Regression…

Overcoming overfitting

Regularization: Ridge Regression

Model: \(\hat{y}=\beta_0+\beta_1x_1+\dots+\beta_dx_d\),
Objective: Search for \(\vec{\beta}=[\beta_0,\dots,\beta_d]\) minimizing the following loss function for some \(\color{green}{\alpha}>0\): \[{\cal L}_{\text{ridge}}(\vec{\beta})=\color{red}{\underbrace{\sum_{i=1}^n(y_i-\widehat{y}_i)^2}_{\text{RSS}}}+\color{green}{\alpha}\color{blue}{\underbrace{\sum_{j=0}^{d}\beta_j^2}_{\text{Magnitude}}}.\]
Recall: SLR & MLR seek to minimize only RSS.

Overcoming overfitting

Regularization: Ridge Regression

Large \(\color{green}{\alpha}\Rightarrow\) strong penalty \(\Rightarrow\) small \(\vec{\beta}\).
Small \(\color{green}{\alpha}\Rightarrow\) weak penalty \(\Rightarrow\) freer \(\vec{\beta}\).
🔑 Objective: Learn the best \(\color{green}{\alpha}>0\).

Loss: \({\cal L}_{\text{ridge}}(\vec{\beta})=\color{red}{\underbrace{\sum_{i=1}^n(y_i-\widehat{y}_i)^2}_{\text{RSS}}}+\color{green}{\alpha}\color{blue}{\underbrace{\sum_{j=0}^{d}\beta_j^2}_{\text{Magnitude}}}.\)

Overcoming overfitting

Regularization: Ridge Regression

How to find a suitable regularization strength \(\color{green}{\alpha}\)?

Tuning Regularization Stregnth \(\color{green}{\alpha}\) Using \(K\)-fold Cross-Validation

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
# Data
deg = 5
X, y = X[["weight"]], data['mpg']
X_poly = pd.concat([X**d for d in range(1, deg+1)], axis=1)
X_poly.columns = [f"w^{d}" for d in range(1, deg+1)]
# List of all degrees to search over
alphas = list(np.linspace(0.01, 3, 30)) + list(np.linspace(3.1, 50000, 30))
# List to store all losses
loss = []
coefficients = {f'alpha={alpha}': [] for alpha in alphas}
for alp in alphas:
    model = Ridge(alpha=alp)
    score = -cross_val_score(model, X_poly, y, cv=5, 
                scoring='neg_mean_squared_error').mean()
    loss.append(score)
    # Fit
    model.fit(X_poly, y)
    coefficients[f'alpha={alp}'] = model.coef_

Overcoming overfitting

Regularization: Ridge Regression

How to find a suitable regularization strength \(\color{green}{\alpha}\)?

Tuning Regularization Stregnth \(\color{green}{\alpha}\) Using \(K\)-fold Cross-Validation

Overcoming overfitting

Regularization: Ridge Regression

Pros

It works well when there are inputs that are approximately linearly related with the target.
It helps stabilize the estimates when inputs are highly correlated.
It can prevent overfitting.
It is effective when the number of inputs exceeds the number of observations.

Cons

It does not work well when the input-output relationships are highly non-linear.
It may introduce bias into the coefficient estimates.
It does not perform feature selection.
It can be challenging for interpretation.

overcoming overfitting

Regularization: Lasso Regression

Model: \(\hat{y}=\beta_0+\beta_1x_1+\dots+\beta_dx_d\),
Objective: Search for \(\vec{\beta}=[\beta_0,\dots,\beta_d]\) minimizing the following loss function for some \(\color{green}{\alpha}>0\): \[{\cal L}_{\text{lasso}}(\vec{\beta})=\color{red}{\underbrace{\sum_{i=1}^n(y_i-\widehat{y}_i)^2}_{\text{RSS}}}+\color{green}{\alpha}\color{blue}{\underbrace{\sum_{j=0}^{d}|\beta_j|}_{\text{Magnitude}}}.\]

Overcoming overfitting

Regularization: Lasso Regression

Large \(\color{green}{\alpha}\Rightarrow\) strong penalty \(\Rightarrow\) small \(\vec{\beta}\).
Small \(\color{green}{\alpha}\Rightarrow\) weak penalty \(\Rightarrow\) freer \(\vec{\beta}\).
🔑 Objective: Learn the best \(\color{green}{\alpha}>0\).

Loss: \({\cal L}_{\text{lasso}}(\vec{\beta})=\color{red}{\underbrace{\sum_{i=1}^n(y_i-\widehat{y}_i)^2}_{\text{RSS}}}+\color{green}{\alpha}\color{blue}{\underbrace{\sum_{j=0}^{d}|\beta_j|}_{\text{Magnitude}}}.\)

Overcoming overfitting

Regularization: Lasso Regression

Tuning Regularization Stregnth \(\color{green}{\alpha}\) Using \(K\)-fold Cross-Validation

Overcoming overfitting

Regularization: Lasso Regression

Pros

Lasso inherently performs feature selection when increasing regularization parameter \(\alpha\) (less important variables are forced to be completely \(0\)).
It works well when there are many inputs (high-dimensional data) and some highly correlated with the target.
It can handle collinearities (many redundant inputs).
It can prevent overfitting and offers high interpretability.

Cons

It does not work well when the input-output relationships are highly non-linear.
It may introduce bias into the coefficient estimates.
It is sensitive to the scale of the data, so proper scaling of predictors is crucial before applying the method.

Model Refinement (Beyond Linearity)

Continue from linear models…

Interpretation of Linear Models

Outline

Model Evaluation

Out-of-sample MSE

Out-of-sample MSE

Cross-validation MSE

Cross-validation MSE

Cross-validation MSE

Cross-validation MSE

Cross-validation MSE

Cross-validation MSE

Cross-validation MSE

Summary

Model Refinement

Feature engineering

Feature transformation

Z-score & Min-Max Scaling

Feature engineering

Feature transformation

One-hot encoding

Feature engineering

Feature Engineering

Feature engineering

Feature Engineering

Feature engineering

Feature Engineering

Feature engineering

Feature Engineering

Overfitting

Challenge in every model

Overcoming overfitting

\(K\)-fold Cross-Validation

Overcoming overfitting

\(K\)-fold Cross-Validation

overcoming overfitting

Regularization

Overcoming overfitting

Regularization: Ridge Regression

Overcoming overfitting

Regularization: Ridge Regression

Overcoming overfitting

Regularization: Ridge Regression

Overcoming overfitting

Regularization: Ridge Regression

Overcoming overfitting

Regularization: Ridge Regression

overcoming overfitting

Regularization: Lasso Regression

Overcoming overfitting

Regularization: Lasso Regression

Overcoming overfitting

Regularization: Lasso Regression

Overcoming overfitting

Regularization: Lasso Regression

🥳 Yeahhh! Party Time…. 🥂

Any questions?

Model Refinement
(Beyond Linearity)