TP1 - Linear Regression

Course: ITM 390 004: Machine Learning

Lecturer: Sothea HAS, PhD


Objective: In this lab, you will reproduce the result we have done in linear regression. Moreover, you will explore beyond what we have done including using multiple inputs variables, compare their performances and feature engineering (polynomial features) to further elevate the model performance.

The Google Colab notebook is available here: Lab1_Linear_Regression.ipynb.


We will be working with kaggle Auto-MPG dataset. Now, find a way to import the dataset into our environment.

#To do

1. Data Preprocessing and EDA

A. Univariate Analysis:

  • Check and modify columns with wrong data type.
  • What’s wrong with column horsepower? How would you solve this problem?
  • Make sure all the columns are in correct data type as shown in slide 7.
  • Compute descriptive statistics of the data and visualize their distribution as illustrated in slide 9.
# To do

B. Bivariate Analysis:

  • Compute correlation matrix and pairplot (see slide 10).
  • Also reproduce conditional boxplot of influence of origin on mpg as illustrated in slide 11.
  • Describe what you have observed from the above figures.
# To do

2. Simple Linear Regression (SLR)

  • Split the data into two parts \((X_{\text{train}}, y_{\text{train}})\) and \((X_{\text{test}}, y_{\text{test}})\) containing \(80\%\) and \(20\%\) of the total observations, respectively.
  • Pick one input other than weight and build linear regression model to predict mpg using LinearRegression module of scikit-learn using only the training data.
  • Draw scatterplot of that input vs mpg. Add the fitted line to the existing scatterplot.
  • Compute \(R^2\) on both, the training and testing data, and check the training residual of the model.
  • Explain the result and interpret the coefficient of your model.
# To do

3. Multiple Linear Regression (MLR)

A. All at once:

  • Build multiple linear regression on the training data using all inputs.
  • Compute \(R^2\) and \(R^2_{\text{adj}}\) on both the training and testing data, and explain the result.
  • Check the residual. Explain the result.
# To do

B. Polynomial Features (PLR):

  • Return to your choice in SLR, now create input data consisting of that input (\(X\)) and its square (\(X^2\)). Build multiple linear regression to predict mpg using \(X\) and \(X^2\).
  • Draw the scatterplot and the fitted curve.
  • Compute \(R^2\) and \(R^{2}_{\text{adj}}\) on both the training and testing data. Explain.
  • Check the residual and explain.
  • Conclude your findings: which one do you think is the best model among the three models in predicting your choice of test data: SLR, MLR or PLR?
# To do

Further readings