TP6 - Linear Regression & Regularization


Objective: Building a model is (SLR or MLR) is rather simple, but building a good model that can generalize well on unseen observations is more challenging. This practical lab aims at enhancing your practical skills in pushing the performance of your models on new unseen data using techniques introduced in the class.

The Jupyter Notebook for this Lab can be downloaded here: TP6_Linear_Regression.ipynb.


1. Importing Abalone dataset

You need internet to load the data by running the following codes. For more information about this data, read Abalone dataset.

# %pip install kagglehub   # if you have not installed "kagglehub" module yet
import kagglehub

# Download latest version
path = kagglehub.dataset_download("rodolfomendes/abalone-dataset")
print("Path to dataset files:", path)

# Import data
import pandas as pd
data = pd.read_csv(path + "/abalone.csv")
data.head()
Path to dataset files: C:\Users\hasso\.cache\kagglehub\datasets\rodolfomendes\abalone-dataset\versions\3
Sex Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings
0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7
2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7

A. What’s the dimension of this dataset? How many quantitative and qualitative variables are there in this dataset?

  • Create statistical summary and visualize the distribution of of the quantitative columns and then the qualitative one.

  • Identify and properly handle what seems to be the problems in this dataset.

  • Inspect if there are any dupplicated data.

  • Study both correlation matrices of this dataset. Comment this correlation matrix.

  • Is the qualitative column useful for predicting the target Rings?

2. Correlation matrix, SLR and MLR

A. Compute both types of correlation matrices for quantitative varaibles. Describe what you observed.

  • Visualize the pairwise-scatterplot between the quantitative columns.

  • Prepare the data for building the model.

# To do

B. Draw the scatterplot of the target Rings vs the most correlated input.

# To do
  • Here, I split the data into two parts. You are allowed to use only the training part for building models. The testing part will be used to evaluate the models’ performance.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
                                        data.iloc[:,1:7], 
                                        data['Rings'], 
                                        test_size=0.2, 
                                        random_state=42)

C. Fit a SLR model using the most correlated input to predict the Rings of abalone.

  • Draw the fitted line on the previous scatterplot.

  • Compute \(R^2\) and explain the observed value.

  • Compute Mean Suared Error on the test data (Test MSE).

  • Analyze the residuals and conclude.

# To do

D. Fit MLR using all inputs. Compute:

  • Compute \(R^2\), adjusted \(R_{\text{adj}}^2\) and explain the observed value.

  • Compute Mean Suared Error on the test data (Test MSE).

  • Analyze the residuals and conclude.

# To do

E. Select some columns then build an MLR model. Compare it to the previous models.

# To do

3. Polynomial Regression and Regularized Linear Models

A. Build polynomial regression with different degree \(n\in\{2,3,...,10\}\) of the best correlated input to predict Rings. Compute Test MSE for each case.

  • Perform \(10\)-fold Cross-validation to select the best degree \(n\). Hint: see slide 20.
# To do

B. Perform \(10\)-fold Cross-validation to tune the best penalization strength \(\alpha\) of Ridge regression model for predicting Rings. Hint: see slide 24.

  • Compute Test MSE and compare it to the previous models.
# To do

C. Perform \(10\)-fold Cross-validation to tune the best penalization strength \(\alpha\) of Lasso regression model for predicting Rings. Hint: see slide 24.

  • Compute Test MSE and compare it to the previous models.

  • How many inputs are retained by Lasso?

# To do

Further readings