Model Evaluation & Refinement


ITM-370: Data Analytics

Lecturer: Dr. Sothea Has

🗺️ Content

Model Evaluation

  • Out-of-sample MSE
  • \(K\)-fold Cross-Validation MSE

Model refinement

  • Feature Engineering
  • Overfitting
  • Overcoming overfitting

Model Evaluation

Out-of-sample MSE

  • A good model must not only performs well on the training data (used to build it), but also on unseen observations.

  • We should judge a model based on how it generalizes on new unseen observations.

  • Out-of-sample Mean Squared Error (MSE): \[\color{green}{\frac{1}{n_{\text{new}}}\sum_{1=1}^{n_{\text{new}}}(y_i-\hat{y}_i)^2}.\]
  • In practice:
    • Train data \(\approx75\%-80\%\to\) for building the model.
    • Test data \(\approx20\%-25\%\to\) for testing the model.

Out-of-sample MSE

  • A good model must not only performs well on the training data (used to build it), but also on unseen observations.

  • We should judge a model based on how it generalizes on new unseen observations.

  • Out-of-sample Mean Squared Error (MSE): \[\color{green}{\frac{1}{n_{\text{new}}}\sum_{1=1}^{n_{\text{new}}}(y_i-\hat{y}_i)^2}.\]
Code
import pyreadr
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.metrics import mean_squared_error

market = pyreadr.read_r("./data/marketing.rda")
market = market['marketing']
shuffle_id = np.random.choice(['train', 'test'], 
                                    replace=True, 
                                    p=[0.75, 0.25], 
                                    size=market.shape[0])
market['type'] = shuffle_id

# Model
from sklearn.linear_model import LinearRegression
lr1 = LinearRegression().fit(market.loc[market.type == "train", ['youtube']], market.loc[market.type == "train", "sales"])

y_hat = lr1.predict(market.loc[market.type == "test", ['youtube']])

import plotly.express as px
import plotly.graph_objects as go
fig1 = px.scatter(data_frame=market,
            x="youtube",
            y="sales",
            color="type",
            color_discrete_map={
                "train": "#e89927", 
                "test": "#3bbc35"
            })
fig1.add_trace(go.Scatter(x=market.loc[market.type == "test", 'youtube'],
                          y=y_hat,
                          name="Model built on train data",
                          line=dict(color="#e89927")))

fig1.update_layout(width=600, height=250, title="SLR Model: Sales vs Youtube")
fig1.show()

Cross-validation MSE

  • What if it’s our unlucky day?
  • Here, it’s great on training data but poor on testing data!
  • \(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)
  • Computing CV-MSE:

Cross-validation MSE

  • What if it’s our unlucky day?
  • Here, it’s great on training data but poor on testing data!
  • \(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)
  • Computing CV-MSE:

Cross-validation MSE

  • What if it’s our unlucky day?
  • Here, it’s great on training data but poor on testing data!
  • \(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)
  • Computing CV-MSE:

Cross-validation MSE

  • What if it’s our unlucky day?
  • Here, it’s great on training data but poor on testing data!
  • \(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)
  • Computing CV-MSE:

Cross-validation MSE

  • What if it’s our unlucky day?
  • Here, it’s great on training data but poor on testing data!
  • \(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)
  • Computing CV-MSE:

Cross-validation MSE

  • What if it’s our unlucky day?
  • Here, it’s great on training data but poor on testing data!
  • \(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)
  • Computing CV-MSE:

  • Doesn’t depend on bad splits!
  • It’s the average of different Test MSEs.
  • It’s the Estimate MSE of New Unseen Data\(^{\text{📚}}\).

Cross-validation MSE

Summary

  • Cross-validation is a model evaluation teachnique.
  • It can be used with different metrics other than MSE, such as Accuracy, F1-score,…
  • It can prevent overfitting.
  • For SLR or MLR (without hyperparameter tuning), it can provide an estimate of Test Error.
  • For models with hyperparameters, it can be used to tune those hyperparameters (coming soon).
  • Our sales vs youtube example: 5-fold CV-MSE = 21.86 or CV-RMSE = 4.675.

Model Refinement

Feature engineering

Missing values & outliers

  • Data of \(4\)-\(7\) years old kids.
Gender Age Height Weight
F 68 0 20
F 68 0 18
F 65 105 0
F 63 0 15
F 68 112 0
F 66 106 0
  • Missing values are often represented by NA (nan in Python).

  • Question: how do we handle it?

  • Answer: we should at least know what kind of missing values are they: MCAR, MAR or MNAR?

Feature engineering

Missing values & outliers

Missing Completely At Random (MCAR)

  • They are randomly missing.
  • Easy to handle with imputation or dropping methods.
  • They don’t introduce bias.
  • Ex: The values are just randomly missing due to human or technical errors.

Feature engineering

Missing values & outliers

Missing At Random (MAR)

  • The missingness is related to other variables.
  • Model-based imputation often work well: SLR, MLR, KNN
  • Ex: Weights are often missing among women in a survey if it’s optional.

Feature engineering

Missing values & outliers

Missing Not At Random (MNAR)

  • These are the trickiest, as the missingness is related to the missing values themselves.
  • It may require domain-specific knowledge or advanced techniques (more data, external info…).
  • Ex: Very high or very low salaries are often missing from a survey if it’s optional.

Feature engineering

Missing values & outliers

Outliers

  • Data points that deviate significantly from the majority of observations in a dataset.
  • It can influence our analyses: insightful or problematic!
  • We can hunt them down using:
    • Graphs: Scatterplots, Boxplots or histograms…
    • They often fall outside \([\text{Q}_1-1.5\text{IQR},\text{Q}_3+1.5\text{IQR}]\).

Feature engineering

Feature transformation

Z-score & Min-Max Scaling

  • Z-score of \(x_j\) is \(\tilde{x}_j=(x_j-\overline{x}_j)/\sigma_{x_j}\).
  • Min-Max scaling of \(x_j\) is \(\tilde{x}_j=\frac{x_j-\min_{x_j}}{\max_{x_j}-\min_{x_j}}\in [0,1]\).
  • When inputs are of different units: (Kg, Km, dollars…).
  • When the differences in scales are too large.
  • When working with distance-based models or models that are sensitive to scale of data: SLR, MLR, KNN, SVM, Logistic Regression, PCA, Neural Networks…
  • Ex: Often used in image processing…

Feature engineering

Feature transformation

One-hot encoding

Code
from gapminder import gapminder
import numpy as np
from sklearn.preprocessing import OneHotEncoder as onehot
encoder = onehot()
encoded_data = encoder.fit_transform(gapminder.loc[gapminder.year == 2007, ['continent']]).toarray()

# encoded dataset
X_encoded = pd.DataFrame(encoded_data, columns=[x.replace('continent_', '') for x in encoder.get_feature_names_out(['continent'])])

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
lr = LinearRegression()

lr.fit(X_encoded, gapminder.lifeExp.loc[gapminder.year==2007])
R2 = r2_score(gapminder.lifeExp.loc[gapminder.year==2007], lr.predict(X_encoded))
df_encoded = X_encoded.copy()
df_encoded['lifeExp'] = gapminder.lifeExp.loc[gapminder.year==2007].values
fig_cont = px.box(data_frame=gapminder.loc[gapminder.year==2007,:],
                  x="continent", y="lifeExp", color="continent")
fig_cont.update_layout(title="Life Expectancy vs Continent", height=250, width=500)
fig_cont.show()
  • Some categorical inputs are useful for building models.
  • They have to be converted.
  • Ex: continent is useful for predicting lifeExp (for more, see here).
    • R-squared: 0.624.

Africa Americas Asia Europe Oceania lifeExp
0.000000 0.000000 1.000000 0.000000 0.000000 43.828000

Feature engineering

Feature transformation

Polynomial features

  • Predicting target using linear form of inputs may be unrealistic!
  • More complicated forms of inputs might be better for predicting the target!
  • Ex: sales vs youtube: \(R^2\approx 61\%\).
  • Now: \(\widehat{\text{sales}}=\beta_0+\beta_1\text{YT}+\beta_2\text{YT}^2\)

Feature engineering

Feature transformation

Polynomial features

  • Now: \(\widehat{\text{sales}}=\beta_0+\beta_1\text{YT}+\beta_2\text{YT}^2\)
Code
market2 = pd.concat([market.youtube, market.youtube ** 2, market.sales], axis=1)
market2.columns = ["YT", "YT^2", "Sales"]
market2.iloc[:3,:]
YT YT^2 Sales
0 276.12 76242.2544 26.52
1 53.40 2851.5600 12.48
2 20.64 426.0096 11.16

Feature engineering

Feature transformation

Polynomial features

Feature engineering

Feature transformation

Polynomial features

Overfitting

Challenge in every model

  • Overfitting happens when a model learns the training data too well, capturing noise and fluctuations rather than the underlying pattern.
  • It fits the training data almost perfectly, but fails to generalize to new, unseen data.
  • Complex models (high-degree poly. features) often overfit the data.

Overcoming overfitting

\(K\)-fold Cross-Validation

  • It ensures that the model performs well on different subsets.
  • The most common technique to overcome overfitting.

Tuning Polynomial degree Using \(K\)-fold Cross-Validation

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression as LR
from sklearn.model_selection import cross_val_score
# Data
X, y = market[["youtube"]], market['sales']
# List of all degrees to search over
degree = list(range(1,11))
# List to store all losses
loss = [] 
for deg in degree:
    pf = PolynomialFeatures(degree=deg)
    X_poly = pf.fit_transform(X)
    model = LR()
    score = -cross_val_score(model, X_poly, y, cv=5, 
                scoring='neg_mean_squared_error').mean()
    loss.append(score)

overcoming overfitting

Regularization

  • Another approach is to controll the magnitude of the coefficients.
  • It often works well for SLR, MLR and Polynomial Regression…

overcoming overfitting

Regularization: Ridge Regression

  • Model: \(\hat{y}=\beta_0+\beta_1x_1+\dots+\beta_dx_d\),

  • Objective: Search for \(\vec{\beta}=[\beta_0,\dots,\beta_d]\) minimizing the following loss function for some \(\color{green}{\alpha}>0\): \[{\cal L}_{\text{ridge}}(\vec{\beta})=\color{red}{\underbrace{\sum_{i=1}^n(y_i-\widehat{y}_i)^2}_{\text{RSS}}}+\color{green}{\alpha}\color{blue}{\underbrace{\sum_{j=0}^{d}\beta_j^2}_{\text{Magnitude}}}.\]

  • Recall: SLR & MLR seek to minimize only RSS.

Overcoming overfitting

Regularization: Ridge Regression

  • Large \(\color{green}{\alpha}\Rightarrow\) strong penalty \(\Rightarrow\) small \(\vec{\beta}\).
  • Small \(\color{green}{\alpha}\Rightarrow\) weak penalty \(\Rightarrow\) freer \(\vec{\beta}\).
  • 🔑 Objective: Learn the best \(\color{green}{\alpha}>0\).
  • Loss: \({\cal L}_{\text{ridge}}(\vec{\beta})=\color{red}{\underbrace{\sum_{i=1}^n(y_i-\widehat{y}_i)^2}_{\text{RSS}}}+\color{green}{\alpha}\color{blue}{\underbrace{\sum_{j=0}^{d}\beta_j^2}_{\text{Magnitude}}}.\)

Overcoming overfitting

Regularization: Ridge Regression

How to find a suitable regularization strength \(\color{green}{\alpha}\)?

Overcoming overfitting

Regularization: Ridge Regression

Tuning Regularization Stregnth \(\color{green}{\alpha}\) Using \(K\)-fold Cross-Validation

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
# Data
X, y = market[["youtube"]], market['sales']
poly = PolynomialFeatures(degree=8)
X_poly = poly.fit_transform(X)
# List of all degrees to search over
alphas = list(np.linspace(0.01, 3, 30)) + list(np.linspace(3.1, 20000, 30))
# List to store all losses
loss = []
coefficients = {f'alpha={alpha}': [] for alpha in alphas}
for alp in alphas:
    model = Ridge(alpha=alp)
    score = -cross_val_score(model, X_poly, y, cv=5, 
                scoring='neg_mean_squared_error').mean()
    loss.append(score)
    # Fit
    model.fit(X_poly, y)
    coefficients[f'alpha={alp}'] = model.coef_

Overcoming overfitting

Regularization: Ridge Regression

Tuning Regularization Stregnth \(\color{green}{\alpha}\) Using \(K\)-fold Cross-Validation

Overcoming overfitting

Regularization: Ridge Regression

Tuning Regularization Stregnth \(\color{green}{\alpha}\) Using \(K\)-fold Cross-Validation

Overcoming overfitting

Regularization: Ridge Regression

Pros

  • It works well when there are inputs that are approximately linearly related with the target.
  • It helps stabilize the estimates when inputs are highly correlated.
  • It can prevent overfitting.
  • It is effective when the number of inputs exceeds the number of observations.

Cons

  • It does not work well when the input-output relationships are highly non-linear.
  • It may introduce bias into the coefficient estimates.
  • It does not perform feature selection.
  • It can be challenging for interpretation.

overcoming overfitting

Regularization: Lasso Regression

  • Model: \(\hat{y}=\beta_0+\beta_1x_1+\dots+\beta_dx_d\),
  • Objective: Search for \(\vec{\beta}=[\beta_0,\dots,\beta_d]\) minimizing the following loss function for some \(\color{green}{\alpha}>0\): \[{\cal L}_{\text{ridge}}(\vec{\beta})=\color{red}{\underbrace{\sum_{i=1}^n(y_i-\widehat{y}_i)^2}_{\text{RSS}}}+\color{green}{\alpha}\color{blue}{\underbrace{\sum_{j=0}^{d}|\beta_j|}_{\text{Magnitude}}}.\]

Overcoming overfitting

Regularization: Lasso Regression

  • Large \(\color{green}{\alpha}\Rightarrow\) strong penalty \(\Rightarrow\) small \(\vec{\beta}\).
  • Small \(\color{green}{\alpha}\Rightarrow\) weak penalty \(\Rightarrow\) freer \(\vec{\beta}\).
  • 🔑 Objective: Learn the best \(\color{green}{\alpha}>0\).
  • Loss: \({\cal L}_{\text{ridge}}(\vec{\beta})=\color{red}{\underbrace{\sum_{i=1}^n(y_i-\widehat{y}_i)^2}_{\text{RSS}}}+\color{green}{\alpha}\color{blue}{\underbrace{\sum_{j=0}^{d}|\beta_j|}_{\text{Magnitude}}}.\)

Overcoming overfitting

Regularization: Lasso Regression

Tuning Regularization Stregnth \(\color{green}{\alpha}\) Using \(K\)-fold Cross-Validation

Overcoming overfitting

Regularization: Lasso Regression

Pros

  • Lasso inherently performs feature selection when increasing regularization parameter \(\alpha\) (less important variables are forced to be completely \(0\)).
  • It works well when there are many inputs (high-dimensional data) and some highly correlated with the target.
  • It can handle collinearities (many redundant inputs).
  • It can prevent overfitting and offers high interpretability.

Cons

  • It does not work well when the input-output relationships are highly non-linear.
  • It may introduce bias into the coefficient estimates.
  • It is sensitive to the scale of the data, so proper scaling of predictors is crucial before applying the method.

🥳 Yeahhhh……. 🥂









Any questions?

:::