Model Refinement
(Beyond Linearity)


INF-604: Data Analysis

Lecturer: Dr. Sothea HAS

Continue from linear models…

Interpretation of Linear Models

  • Linear Regression is an interpretable model where the influence of each input is explicit (coefficients \(\color{blue}{\beta_j}\)) in \[y=\color{blue}{\beta_0}+\color{blue}{\beta_1}x_1+\color{blue}{\beta_2}x_2+\dots+\color{blue}{\beta_d}x_d + \epsilon.\]
    • Numerical input: \(\color{blue}{\beta_j}\) is the change \(y\) for 1 unit increase in \(x_j\).
    • Categorical input: \(\color{blue}{\beta_j}\) represents the difference in mean of \(y\) when category \(x_j=1\) relative to the Reference Category (Baseline or the dropped category).
  • Comparing Influence (Feature Importance)
    • Raw Coefficients: Not comparable across different variables (e.g., Age vs Income).
    • Solution: In order to compare the effect of variables on the target, variables should be Standardized.

Outline

  • Model Evaluation Techniques

  • Feature Engineering

  • Regularization

    • Ridge Regression
    • Lasso Regression

Model Evaluation

Out-of-sample MSE

  • A good model must not only performs well on the training data (used to build it), but also on unseen observations.

  • We should judge a model based on how it generalizes on new unseen observations.

  • Out-of-sample Mean Squared Error (MSE): \[\color{green}{\frac{1}{n_{\text{new}}}\sum_{1=1}^{n_{\text{new}}}(y_i-\hat{y}_i)^2}.\]
  • In practice:
    • Train data \(\approx75\%-80\%\to\) for building the model.
    • Test data \(\approx20\%-25\%\to\) for testing the model.

Out-of-sample MSE

  • A good model must not only performs well on the training data (used to build it), but also on unseen observations.

  • We should judge a model based on how it generalizes on new unseen observations.

  • Out-of-sample Mean Squared Error (MSE): \[\color{green}{\frac{1}{n_{\text{new}}}\sum_{1=1}^{n_{\text{new}}}(y_i-\hat{y}_i)^2}.\]
Code
import pyreadr
import numpy as np
import seaborn as sns
from sklearn.metrics import mean_squared_error

# market = pyreadr.read_r(path1)
# market = market['marketing']

X = data[['weight', 'horsepower', 'acceleration', 'mpg']]

shuffle_id = np.random.choice(['train', 'test'], 
                                    replace=True, 
                                    p=[0.75, 0.25], 
                                    size=X.shape[0])
X['type'] = shuffle_id

# Model
from sklearn.linear_model import LinearRegression
lr1 = LinearRegression().fit(X.loc[X.type == "train", ['weight']], X.loc[X.type == "train", "mpg"])

y_hat = lr1.predict(X.loc[X.type == "train", ['weight']])

import plotly.express as px
import plotly.graph_objects as go
fig1 = px.scatter(data_frame=X,
            x="weight",
            y="mpg",
            color="type",
            color_discrete_map={
                "train": "#e89927", 
                "test": "#3bbc35"
            })
fig1.add_trace(go.Scatter(x=X.loc[X.type == "train", 'weight'],
                          y=y_hat,
                          name="Model built on train data",
                          line=dict(color="blue")))

fig1.update_layout(width=600, height=250, title="SLR Model: mpg vs weight")
fig1.show()

Cross-validation MSE

  • What if it’s our unlucky day?
  • Here, it’s great on training data but poor on testing data!
  • \(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)
  • Computing CV-MSE:

Cross-validation MSE

  • What if it’s our unlucky day?
  • Here, it’s great on training data but poor on testing data!
  • \(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)
  • Computing CV-MSE:

Cross-validation MSE

  • What if it’s our unlucky day?
  • Here, it’s great on training data but poor on testing data!
  • \(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)
  • Computing CV-MSE:

Cross-validation MSE

  • What if it’s our unlucky day?
  • Here, it’s great on training data but poor on testing data!
  • \(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)
  • Computing CV-MSE:

Cross-validation MSE

  • What if it’s our unlucky day?
  • Here, it’s great on training data but poor on testing data!
  • \(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)
  • Computing CV-MSE:

Cross-validation MSE

  • What if it’s our unlucky day?
  • Here, it’s great on training data but poor on testing data!
  • \(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)
  • Computing CV-MSE:

  • Doesn’t depend on bad splits!
  • It’s the average of different Test MSEs.
  • It’s the Estimate MSE of New Unseen Data\(^{\text{📚}}\).

Cross-validation MSE

Summary

  • Cross-validation is a model evaluation teachnique.
  • It can be used with different metrics other than MSE, such as Accuracy, F1-score,…
  • It can prevent overfitting.
  • For SLR or MLR (without hyperparameter tuning), it can provide an estimate of Test Error.
  • For models with hyperparameters, it can be used to tune those hyperparameters (coming soon).
  • Our mpg vs weight example: 5-fold CV-MSE = 31.988 or CV-RMSE = 5.656.

Model Refinement

Feature engineering

Feature transformation

Z-score & Min-Max Scaling

  • Z-score of \(x_j\) is \(\tilde{x}_j=(x_j-\overline{x}_j)/\sigma_{x_j}\).
  • Min-Max scaling of \(x_j\) is \(\tilde{x}_j=\frac{x_j-\min_{x_j}}{\max_{x_j}-\min_{x_j}}\in [0,1]\).
  • When inputs are of different units: (Kg, Km, dollars…).
  • When the differences in scales are too large. It allows us to compare the coefficients of the linear model.
  • When working with distance-based models or models that are sensitive to scale of data: SLR, MLR, KNN, SVM, Logistic Regression, PCA, Neural Networks…
  • Ex: Often used in image processing…

Feature engineering

Feature transformation

One-hot encoding

Code
from gapminder import gapminder
import numpy as np
from sklearn.preprocessing import OneHotEncoder as onehot
encoder = onehot()
encoded_data = encoder.fit_transform(gapminder.loc[gapminder.year == 2007, ['continent']]).toarray()

# encoded dataset
X_encoded = pd.DataFrame(encoded_data, columns=[x.replace('continent_', '') for x in encoder.get_feature_names_out(['continent'])])

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
lr = LinearRegression()

lr.fit(X_encoded, gapminder.lifeExp.loc[gapminder.year==2007])
R2 = r2_score(gapminder.lifeExp.loc[gapminder.year==2007], lr.predict(X_encoded))
df_encoded = X_encoded.copy()
df_encoded['lifeExp'] = gapminder.lifeExp.loc[gapminder.year==2007].values
fig_cont = px.box(data_frame=gapminder.loc[gapminder.year==2007,:],
                  x="continent", y="lifeExp", color="continent")
fig_cont.update_layout(title="Life Expectancy vs Continent", height=250, width=500)
fig_cont.show()
  • Some categorical inputs are useful for building models.
  • They have to be converted.
  • Ex: continent is useful for predicting lifeExp (for more, see here).
    • R-squared: 0.635.

Africa Americas Asia Europe Oceania lifeExp
0.000000 0.000000 1.000000 0.000000 0.000000 43.828000

Feature engineering

Feature Engineering

  • Predicting target using linear form of inputs may be unrealistic!
  • More complicated forms of inputs might be better for predicting the target!
  • Ex: mpg vs weight: \(R^2\approx 69.3\%\).
  • Now: \(\widehat{\text{mpg}}=\color{blue}{\beta_0}+\color{blue}{\beta_1}\text{w}+\color{blue}{\beta_2}\text{w}^2.\)

Feature engineering

Feature Engineering

  • Now: \(\widehat{\text{mpg}}=\color{blue}{\beta_0}+\color{blue}{\beta_1}\text{w}+\color{blue}{\beta_2}\text{w}^2.\)
Code
X2 = pd.concat([X.weight, X.weight ** 2, X.mpg], axis=1)
X2.columns = ["w", "w^2", "mpg"]
X2.iloc[:3,:]
w w^2 mpg
0 3504 12278016 18.0
1 3693 13638249 15.0
2 3436 11806096 18.0

Feature engineering

Feature Engineering

Feature engineering

Feature Engineering

Overfitting

Challenge in every model

  • Overfitting happens when a model learns the training data too well, capturing noise and fluctuations rather than the underlying pattern.
  • It fits the training data almost perfectly, but fails to generalize to new, unseen data.
  • Complex models (high-degree poly. features) often overfit the data.

Overcoming overfitting

\(K\)-fold Cross-Validation

  • It ensures that the model performs well on different subsets.
  • The most common technique to overcome overfitting.

Tuning Polynomial degree Using \(K\)-fold Cross-Validation

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression as LR
from sklearn.model_selection import cross_val_score
# Data
X, y = X[["weight"]], data['mpg']
# List of all degrees to search over
degree = list(range(1,11))
# List to store all losses
loss = [] 
for deg in degree:
    X_poly = pd.concat([X**d for d in range(1, deg+1)], axis=1)
    X_poly.columns = [f"w^{d}" for d in range(1, deg+1)]
    model = LR()
    score = -cross_val_score(model, X_poly, y, cv=20, 
                scoring='neg_mean_squared_error').mean()
    loss.append(score)

Overcoming overfitting

\(K\)-fold Cross-Validation

  • It ensures that the model performs well on different subsets.
  • The most common technique to overcome overfitting.

Tuning Regularization Stregnth \(\color{green}{\alpha}\) Using \(K\)-fold Cross-Validation

overcoming overfitting

Regularization

  • Another approach is to controll the magnitude of the coefficients.
  • It often works well for SLR, MLR and Polynomial Regression…

Overcoming overfitting

Regularization: Ridge Regression

  • Model: \(\hat{y}=\beta_0+\beta_1x_1+\dots+\beta_dx_d\),

  • Objective: Search for \(\vec{\beta}=[\beta_0,\dots,\beta_d]\) minimizing the following loss function for some \(\color{green}{\alpha}>0\): \[{\cal L}_{\text{ridge}}(\vec{\beta})=\color{red}{\underbrace{\sum_{i=1}^n(y_i-\widehat{y}_i)^2}_{\text{RSS}}}+\color{green}{\alpha}\color{blue}{\underbrace{\sum_{j=0}^{d}\beta_j^2}_{\text{Magnitude}}}.\]

  • Recall: SLR & MLR seek to minimize only RSS.

Overcoming overfitting

Regularization: Ridge Regression

  • Large \(\color{green}{\alpha}\Rightarrow\) strong penalty \(\Rightarrow\) small \(\vec{\beta}\).
  • Small \(\color{green}{\alpha}\Rightarrow\) weak penalty \(\Rightarrow\) freer \(\vec{\beta}\).
  • 🔑 Objective: Learn the best \(\color{green}{\alpha}>0\).
  • Loss: \({\cal L}_{\text{ridge}}(\vec{\beta})=\color{red}{\underbrace{\sum_{i=1}^n(y_i-\widehat{y}_i)^2}_{\text{RSS}}}+\color{green}{\alpha}\color{blue}{\underbrace{\sum_{j=0}^{d}\beta_j^2}_{\text{Magnitude}}}.\)

Overcoming overfitting

Regularization: Ridge Regression

  • How to find a suitable regularization strength \(\color{green}{\alpha}\)?

Tuning Regularization Stregnth \(\color{green}{\alpha}\) Using \(K\)-fold Cross-Validation

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
# Data
deg = 5
X, y = X[["weight"]], data['mpg']
X_poly = pd.concat([X**d for d in range(1, deg+1)], axis=1)
X_poly.columns = [f"w^{d}" for d in range(1, deg+1)]
# List of all degrees to search over
alphas = list(np.linspace(0.01, 3, 30)) + list(np.linspace(3.1, 50000, 30))
# List to store all losses
loss = []
coefficients = {f'alpha={alpha}': [] for alpha in alphas}
for alp in alphas:
    model = Ridge(alpha=alp)
    score = -cross_val_score(model, X_poly, y, cv=5, 
                scoring='neg_mean_squared_error').mean()
    loss.append(score)
    # Fit
    model.fit(X_poly, y)
    coefficients[f'alpha={alp}'] = model.coef_

Overcoming overfitting

Regularization: Ridge Regression

  • How to find a suitable regularization strength \(\color{green}{\alpha}\)?

Tuning Regularization Stregnth \(\color{green}{\alpha}\) Using \(K\)-fold Cross-Validation

Overcoming overfitting

Regularization: Ridge Regression

Pros

  • It works well when there are inputs that are approximately linearly related with the target.
  • It helps stabilize the estimates when inputs are highly correlated.
  • It can prevent overfitting.
  • It is effective when the number of inputs exceeds the number of observations.

Cons

  • It does not work well when the input-output relationships are highly non-linear.
  • It may introduce bias into the coefficient estimates.
  • It does not perform feature selection.
  • It can be challenging for interpretation.

overcoming overfitting

Regularization: Lasso Regression

  • Model: \(\hat{y}=\beta_0+\beta_1x_1+\dots+\beta_dx_d\),
  • Objective: Search for \(\vec{\beta}=[\beta_0,\dots,\beta_d]\) minimizing the following loss function for some \(\color{green}{\alpha}>0\): \[{\cal L}_{\text{lasso}}(\vec{\beta})=\color{red}{\underbrace{\sum_{i=1}^n(y_i-\widehat{y}_i)^2}_{\text{RSS}}}+\color{green}{\alpha}\color{blue}{\underbrace{\sum_{j=0}^{d}|\beta_j|}_{\text{Magnitude}}}.\]

Overcoming overfitting

Regularization: Lasso Regression

  • Large \(\color{green}{\alpha}\Rightarrow\) strong penalty \(\Rightarrow\) small \(\vec{\beta}\).
  • Small \(\color{green}{\alpha}\Rightarrow\) weak penalty \(\Rightarrow\) freer \(\vec{\beta}\).
  • 🔑 Objective: Learn the best \(\color{green}{\alpha}>0\).
  • Loss: \({\cal L}_{\text{lasso}}(\vec{\beta})=\color{red}{\underbrace{\sum_{i=1}^n(y_i-\widehat{y}_i)^2}_{\text{RSS}}}+\color{green}{\alpha}\color{blue}{\underbrace{\sum_{j=0}^{d}|\beta_j|}_{\text{Magnitude}}}.\)

Overcoming overfitting

Regularization: Lasso Regression

Tuning Regularization Stregnth \(\color{green}{\alpha}\) Using \(K\)-fold Cross-Validation

Overcoming overfitting

Regularization: Lasso Regression

Pros

  • Lasso inherently performs feature selection when increasing regularization parameter \(\alpha\) (less important variables are forced to be completely \(0\)).
  • It works well when there are many inputs (high-dimensional data) and some highly correlated with the target.
  • It can handle collinearities (many redundant inputs).
  • It can prevent overfitting and offers high interpretability.

Cons

  • It does not work well when the input-output relationships are highly non-linear.
  • It may introduce bias into the coefficient estimates.
  • It is sensitive to the scale of the data, so proper scaling of predictors is crucial before applying the method.

🥳 Yeahhh! Party Time…. 🥂









Any questions?