TP5 - Linear Regression

Course: EDA & Unsupervised Learning
M1-DAS
Lecturer: HAS Sothea, PhD


Objective: In this TP, you will learn how to implement SLR, MLR and enhance your practical skills in pushing the performance of your models on new unseen data using techniques introduced in the course.

The Jupyter Notebook for this Lab can be downloaded here: TP5_Linear_Regression.ipynb.


1. Importing Abalone dataset

You need internet to load the data by running the following codes. For more information about this data, read Abalone dataset.

# %pip install kagglehub   # if you have not installed "kagglehub" module yet
import kagglehub

# Download latest version
path = kagglehub.dataset_download("rodolfomendes/abalone-dataset")

# Import data
import pandas as pd
data = pd.read_csv(path + "/abalone.csv")
data.head()
Sex Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings
0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7
2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7

A. What’s the dimension of this dataset? How many quantitative and qualitative variables are there in this dataset?

# To do

B. Perform univariate analysis (compute statistical values and plot the distribution) on each variables of your dataset. The goal is to understand your data (the scale and how each column is distributed), detect missing values and outliers, etc.

# To do

2. Correlation matrix, SLR and MLR

A. Compute the correlation matrix of this data using pd.corr() function. Describe what you observed from this correlation matrix.

# To do

B. Draw the pairs of scatterplots of the target Rings vs the all quantitative inputs.

  • Comment your findings and handle the problems if there is any in the graph.
# To do
  • Here, I split the data into two parts. You are allowed to use only the training part for building models. The testing part will be used to evaluate the models’ performance.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
                                        data.iloc[:,1:7], 
                                        data['Rings'], 
                                        test_size=0.2, 
                                        random_state=42)

C. Fit a SLR model using the most correlated input to predict the Rings of abalone.

  • Draw the fitted line on the previous scatterplot.
  • Compute \(R^2\) and explain the observed value.
  • Compute Mean Suared Error on the test data (Test MSE).
  • Analyze the residuals and conclude.
# To do

D. Fit MLR using all inputs. Compute:

  • Compute \(R^2\), adjusted \(R_{\text{adj}}^2\) and explain the observed value.
  • Compute Mean Suared Error on the test data (Test MSE).
  • Analyze the residuals and conclude.

3. Polynomial Regression and Regularized Linear Models

A. Build polynomial regression with different degree \(n\in\{2,3,...,10\}\) of the best correlated input to predict Rings. Compute Test MSE for each case.

  • Perform \(10\)-fold Cross-validation to select the best degree \(n\). Hint: see slide 47.
# To do

B. Perform \(10\)-fold Cross-validation to tune the best penalization strength \(\alpha\) of Ridge regression model for predicting Rings. Hint: see slide 51.

  • Compute Test MSE and compare it to the previous models.
# To do

C. Perform \(10\)-fold Cross-validation to tune the best penalization strength \(\alpha\) of Lasso regression model for predicting Rings. Hint: see slide 51.

  • Compute Test MSE and compare it to the previous models.
  • How many inputs are retained by Lasso?
# To do

4. Auto-MPG Dataset

Apply what you have learn to predict fuel efficiency (Miles Per Gallon, MPG) in vehicles of the kaggle AUTO-MPG Dataset.

import kagglehub

# Download latest version
path = kagglehub.dataset_download("uciml/autompg-dataset")

data = pd.read_csv(path + '/auto-mpg.csv')
data.head()
mpg cylinders displacement horsepower weight acceleration model year origin car name
0 18.0 8 307.0 130 3504 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165 3693 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150 3436 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150 3433 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140 3449 10.5 70 1 ford torino

Further readings