Model Evaluation & Refinement

ITM-370: Data Analytics

Lecturer: Dr. Sothea Has

🗺️ Content

Model Evaluation

Out-of-sample MSE
\(K\)-fold Cross-Validation MSE

Feature Engineering
Overfitting
Overcoming overfitting

Model Evaluation

Out-of-sample MSE

A good model must not only performs well on the training data (used to build it), but also on unseen observations.
We should judge a model based on how it generalizes on new unseen observations.

Out-of-sample Mean Squared Error (MSE): \[\color{green}{\frac{1}{n_{\text{new}}}\sum_{1=1}^{n_{\text{new}}}(y_i-\hat{y}_i)^2}.\]

In practice:
- Train data \(\approx75\%-80\%\to\) for building the model.
- Test data \(\approx20\%-25\%\to\) for testing the model.

Out-of-sample MSE

A good model must not only performs well on the training data (used to build it), but also on unseen observations.
We should judge a model based on how it generalizes on new unseen observations.

Out-of-sample Mean Squared Error (MSE): \[\color{green}{\frac{1}{n_{\text{new}}}\sum_{1=1}^{n_{\text{new}}}(y_i-\hat{y}_i)^2}.\]

Code

import pyreadr
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.metrics import mean_squared_error

market = pyreadr.read_r("./data/marketing.rda")
market = market['marketing']
shuffle_id = np.random.choice(['train', 'test'], 
                                    replace=True, 
                                    p=[0.75, 0.25], 
                                    size=market.shape[0])
market['type'] = shuffle_id

# Model
from sklearn.linear_model import LinearRegression
lr1 = LinearRegression().fit(market.loc[market.type == "train", ['youtube']], market.loc[market.type == "train", "sales"])

y_hat = lr1.predict(market.loc[market.type == "test", ['youtube']])

import plotly.express as px
import plotly.graph_objects as go
fig1 = px.scatter(data_frame=market,
            x="youtube",
            y="sales",
            color="type",
            color_discrete_map={
                "train": "#e89927", 
                "test": "#3bbc35"
            })
fig1.add_trace(go.Scatter(x=market.loc[market.type == "test", 'youtube'],
                          y=y_hat,
                          name="Model built on train data",
                          line=dict(color="#e89927")))

fig1.update_layout(width=600, height=250, title="SLR Model: Sales vs Youtube")
fig1.show()

Cross-validation MSE

What if it’s our unlucky day?

Here, it’s great on training data but poor on testing data!

\(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)

Computing CV-MSE:

Cross-validation MSE

What if it’s our unlucky day?

Here, it’s great on training data but poor on testing data!

\(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)

Computing CV-MSE:

Cross-validation MSE

What if it’s our unlucky day?

Here, it’s great on training data but poor on testing data!

\(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)

Computing CV-MSE:

Cross-validation MSE

What if it’s our unlucky day?

Here, it’s great on training data but poor on testing data!

\(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)

Computing CV-MSE:

Cross-validation MSE

What if it’s our unlucky day?

Here, it’s great on training data but poor on testing data!

\(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)

Computing CV-MSE:

Cross-validation MSE

What if it’s our unlucky day?

Here, it’s great on training data but poor on testing data!

\(K\)-fold Cross-Validation MSE: \(\text{CV-MSE}=\frac{1}{K}\sum_{k=1}^K\text{MSE}_k.\)

Computing CV-MSE:

Doesn’t depend on bad splits!
It’s the average of different Test MSEs.
It’s the Estimate MSE of New Unseen Data\(^{\text{📚}}\).

\(^{\text{📚}}\) Chapter 5, The Introduction to Statistical Learning.

Cross-validation MSE

Summary

Cross-validation is a model evaluation teachnique.
It can be used with different metrics other than MSE, such as Accuracy, F1-score,…
It can prevent overfitting.
For SLR or MLR (without hyperparameter tuning), it can provide an estimate of Test Error.
For models with hyperparameters, it can be used to tune those hyperparameters (coming soon).
Our sales vs youtube example: 5-fold CV-MSE = 22.655 or CV-RMSE = 4.76.

Feature engineering

Missing values & outliers

Data of \(4\)-\(7\) years old kids.

Gender	Age	Height	Weight
F	68	0	20
F	68	0	18
F	65	105	0
F	63	0	15
F	68	112	0
F	66	106	0

Missing values are often represented by NA (nan in Python).
Question: how do we handle it?
Answer: we should at least know what kind of missing values are they: MCAR, MAR or MNAR?

Feature engineering

Missing values & outliers

Missing Completely At Random (MCAR)

They are randomly missing.
Easy to handle with imputation or dropping methods.
They don’t introduce bias.
Ex: The values are just randomly missing due to human or technical errors.

Feature engineering

Missing values & outliers

Missing At Random (MAR)

The missingness is related to other variables.
Model-based imputation often work well: SLR, MLR, KNN…
Ex: Weights are often missing among women in a survey if it’s optional.

Feature engineering

Missing values & outliers

Missing Not At Random (MNAR)

These are the trickiest, as the missingness is related to the missing values themselves.
It may require domain-specific knowledge or advanced techniques (more data, external info…).
Ex: Very high or very low salaries are often missing from a survey if it’s optional.

Feature engineering

Missing values & outliers

Outliers

Data points that deviate significantly from the majority of observations in a dataset.
It can influence our analyses: insightful or problematic!
We can hunt them down using:
- Graphs: Scatterplots, Boxplots or histograms…
- They often fall outside \([\text{Q}_1-1.5\text{IQR},\text{Q}_3+1.5\text{IQR}]\).

Feature engineering

Feature transformation

Z-score & Min-Max Scaling

Z-score of \(x_j\) is \(\tilde{x}_j=(x_j-\overline{x}_j)/\sigma_{x_j}\).
Min-Max scaling of \(x_j\) is \(\tilde{x}_j=\frac{x_j-\min_{x_j}}{\max_{x_j}-\min_{x_j}}\in [0,1]\).
When inputs are of different units: (Kg, Km, dollars…).
When the differences in scales are too large.
When working with distance-based models or models that are sensitive to scale of data: SLR, MLR, KNN, SVM, Logistic Regression, PCA, Neural Networks…
Ex: Often used in image processing…

Feature engineering

Feature transformation

One-hot encoding

Code

from gapminder import gapminder
import numpy as np
from sklearn.preprocessing import OneHotEncoder as onehot
encoder = onehot()
encoded_data = encoder.fit_transform(gapminder.loc[gapminder.year == 2007, ['continent']]).toarray()

# encoded dataset
X_encoded = pd.DataFrame(encoded_data, columns=[x.replace('continent_', '') for x in encoder.get_feature_names_out(['continent'])])

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
lr = LinearRegression()

lr.fit(X_encoded, gapminder.lifeExp.loc[gapminder.year==2007])
R2 = r2_score(gapminder.lifeExp.loc[gapminder.year==2007], lr.predict(X_encoded))
df_encoded = X_encoded.copy()
df_encoded['lifeExp'] = gapminder.lifeExp.loc[gapminder.year==2007].values
fig_cont = px.box(data_frame=gapminder.loc[gapminder.year==2007,:],
                  x="continent", y="lifeExp", color="continent")
fig_cont.update_layout(title="Life Expectancy vs Continent", height=250, width=500)
fig_cont.show()

Some categorical inputs are useful for building models.
They have to be converted.
Ex: continent is useful for predicting lifeExp (for more, see here).
- R-squared: 0.624.

Africa	Americas	Asia	Europe	Oceania	lifeExp
0.000000	0.000000	1.000000	0.000000	0.000000	43.828000

Feature engineering

Feature transformation

Polynomial features

Predicting target using linear form of inputs may be unrealistic!
More complicated forms of inputs might be better for predicting the target!
Ex: sales vs youtube: \(R^2\approx 61\%\).
Now: \(\widehat{\text{sales}}=\beta_0+\beta_1\text{YT}+\beta_2\text{YT}^2\)

Feature engineering

Feature transformation

Polynomial features

Now: \(\widehat{\text{sales}}=\beta_0+\beta_1\text{YT}+\beta_2\text{YT}^2\)

Code

market2 = pd.concat([market.youtube, market.youtube ** 2, market.sales], axis=1)
market2.columns = ["YT", "YT^2", "Sales"]
market2.iloc[:3,:]

	YT	YT^2	Sales
0	276.12	76242.2544	26.52
1	53.40	2851.5600	12.48
2	20.64	426.0096	11.16

Feature engineering

Feature transformation

Polynomial features

Feature engineering

Feature transformation

Polynomial features

Overfitting

Challenge in every model

Overfitting happens when a model learns the training data too well, capturing noise and fluctuations rather than the underlying pattern.
It fits the training data almost perfectly, but fails to generalize to new, unseen data.
Complex models (high-degree poly. features) often overfit the data.

Overcoming overfitting

\(K\)-fold Cross-Validation

It ensures that the model performs well on different subsets.
The most common technique to overcome overfitting.

Tuning Polynomial degree Using \(K\)-fold Cross-Validation

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression as LR
from sklearn.model_selection import cross_val_score
# Data
X, y = market[["youtube"]], market['sales']
# List of all degrees to search over
degree = list(range(1,11))
# List to store all losses
loss = [] 
for deg in degree:
    pf = PolynomialFeatures(degree=deg)
    X_poly = pf.fit_transform(X)
    model = LR()
    score = -cross_val_score(model, X_poly, y, cv=5, 
                scoring='neg_mean_squared_error').mean()
    loss.append(score)

overcoming overfitting

Regularization

Another approach is to controll the magnitude of the coefficients.
It often works well for SLR, MLR and Polynomial Regression…

overcoming overfitting

Regularization: Ridge Regression

Model: \(\hat{y}=\beta_0+\beta_1x_1+\dots+\beta_dx_d\),
Objective: Search for \(\vec{\beta}=[\beta_0,\dots,\beta_d]\) minimizing the following loss function for some \(\color{green}{\alpha}>0\): \[{\cal L}_{\text{ridge}}(\vec{\beta})=\color{red}{\underbrace{\sum_{i=1}^n(y_i-\widehat{y}_i)^2}_{\text{RSS}}}+\color{green}{\alpha}\color{blue}{\underbrace{\sum_{j=0}^{d}\beta_j^2}_{\text{Magnitude}}}.\]
Recall: SLR & MLR seek to minimize only RSS.

Overcoming overfitting

Regularization: Ridge Regression

Large \(\color{green}{\alpha}\Rightarrow\) strong penalty \(\Rightarrow\) small \(\vec{\beta}\).
Small \(\color{green}{\alpha}\Rightarrow\) weak penalty \(\Rightarrow\) freer \(\vec{\beta}\).
🔑 Objective: Learn the best \(\color{green}{\alpha}>0\).

Loss: \({\cal L}_{\text{ridge}}(\vec{\beta})=\color{red}{\underbrace{\sum_{i=1}^n(y_i-\widehat{y}_i)^2}_{\text{RSS}}}+\color{green}{\alpha}\color{blue}{\underbrace{\sum_{j=0}^{d}\beta_j^2}_{\text{Magnitude}}}.\)

Overcoming overfitting

Regularization: Ridge Regression

How to find a suitable regularization strength \(\color{green}{\alpha}\)?

Overcoming overfitting

Regularization: Ridge Regression

Tuning Regularization Stregnth \(\color{green}{\alpha}\) Using \(K\)-fold Cross-Validation

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
# Data
X, y = market[["youtube"]], market['sales']
poly = PolynomialFeatures(degree=8)
X_poly = poly.fit_transform(X)
# List of all degrees to search over
alphas = list(np.linspace(0.01, 3, 30)) + list(np.linspace(3.1, 20000, 30))
# List to store all losses
loss = []
coefficients = {f'alpha={alpha}': [] for alpha in alphas}
for alp in alphas:
    model = Ridge(alpha=alp)
    score = -cross_val_score(model, X_poly, y, cv=5, 
                scoring='neg_mean_squared_error').mean()
    loss.append(score)
    # Fit
    model.fit(X_poly, y)
    coefficients[f'alpha={alp}'] = model.coef_

Overcoming overfitting

Regularization: Ridge Regression

Tuning Regularization Stregnth \(\color{green}{\alpha}\) Using \(K\)-fold Cross-Validation

Overcoming overfitting

Regularization: Ridge Regression

Tuning Regularization Stregnth \(\color{green}{\alpha}\) Using \(K\)-fold Cross-Validation

Overcoming overfitting

Regularization: Ridge Regression

Pros

It works well when there are inputs that are approximately linearly related with the target.
It helps stabilize the estimates when inputs are highly correlated.
It can prevent overfitting.
It is effective when the number of inputs exceeds the number of observations.

Cons

It does not work well when the input-output relationships are highly non-linear.
It may introduce bias into the coefficient estimates.
It does not perform feature selection.
It can be challenging for interpretation.

overcoming overfitting

Regularization: Lasso Regression

Model: \(\hat{y}=\beta_0+\beta_1x_1+\dots+\beta_dx_d\),
Objective: Search for \(\vec{\beta}=[\beta_0,\dots,\beta_d]\) minimizing the following loss function for some \(\color{green}{\alpha}>0\): \[{\cal L}_{\text{lasso}}(\vec{\beta})=\color{red}{\underbrace{\sum_{i=1}^n(y_i-\widehat{y}_i)^2}_{\text{RSS}}}+\color{green}{\alpha}\color{blue}{\underbrace{\sum_{j=0}^{d}|\beta_j|}_{\text{Magnitude}}}.\]

Overcoming overfitting

Regularization: Lasso Regression

Large \(\color{green}{\alpha}\Rightarrow\) strong penalty \(\Rightarrow\) small \(\vec{\beta}\).
Small \(\color{green}{\alpha}\Rightarrow\) weak penalty \(\Rightarrow\) freer \(\vec{\beta}\).
🔑 Objective: Learn the best \(\color{green}{\alpha}>0\).

Loss: \({\cal L}_{\text{lasso}}(\vec{\beta})=\color{red}{\underbrace{\sum_{i=1}^n(y_i-\widehat{y}_i)^2}_{\text{RSS}}}+\color{green}{\alpha}\color{blue}{\underbrace{\sum_{j=0}^{d}|\beta_j|}_{\text{Magnitude}}}.\)

Overcoming overfitting

Regularization: Lasso Regression

Tuning Regularization Stregnth \(\color{green}{\alpha}\) Using \(K\)-fold Cross-Validation

Overcoming overfitting

Regularization: Lasso Regression

Pros

Lasso inherently performs feature selection when increasing regularization parameter \(\alpha\) (less important variables are forced to be completely \(0\)).
It works well when there are many inputs (high-dimensional data) and some highly correlated with the target.
It can handle collinearities (many redundant inputs).
It can prevent overfitting and offers high interpretability.

Cons

It does not work well when the input-output relationships are highly non-linear.
It may introduce bias into the coefficient estimates.
It is sensitive to the scale of the data, so proper scaling of predictors is crucial before applying the method.

Model Evaluation & Refinement

🗺️ Content

Model Evaluation

Model refinement

Model Evaluation

Out-of-sample MSE

Out-of-sample MSE

Cross-validation MSE

Cross-validation MSE

Cross-validation MSE

Cross-validation MSE

Cross-validation MSE

Cross-validation MSE

Cross-validation MSE

Summary

Model Refinement

Feature engineering

Missing values & outliers

Feature engineering

Missing values & outliers

Missing Completely At Random (MCAR)

Feature engineering

Missing values & outliers

Missing At Random (MAR)

Feature engineering

Missing values & outliers

Missing Not At Random (MNAR)

Feature engineering

Missing values & outliers

Outliers

Feature engineering

Feature transformation

Z-score & Min-Max Scaling

Feature engineering

Feature transformation

One-hot encoding

Feature engineering

Feature transformation

Polynomial features

Feature engineering

Feature transformation

Polynomial features

Feature engineering

Feature transformation

Polynomial features

Feature engineering

Feature transformation

Polynomial features

Overfitting

Challenge in every model

Overcoming overfitting

\(K\)-fold Cross-Validation

overcoming overfitting

Regularization

overcoming overfitting

Regularization: Ridge Regression

Overcoming overfitting

Regularization: Ridge Regression

Overcoming overfitting

Regularization: Ridge Regression

Overcoming overfitting

Regularization: Ridge Regression

Overcoming overfitting

Regularization: Ridge Regression

Overcoming overfitting

Regularization: Ridge Regression

Overcoming overfitting

Regularization: Ridge Regression

overcoming overfitting

Regularization: Lasso Regression

Overcoming overfitting

Regularization: Lasso Regression

Overcoming overfitting

Regularization: Lasso Regression

Overcoming overfitting

Regularization: Lasso Regression

🥳 Yeahhhh……. 🥂

Any questions?