Lab - Model Development

ITM-370: Data Analytics
Lecturer: HAS Sothea, PhD


Objective: This practical lab aims to enhance your skills in implementing simple and multiple linear regression using market data covered in the course.

The Jupyter Notebook for this Lab can be downloaded here: Lab_Model_development.ipynb.

Or you can work with this notebook in Google Colab here: Lab_Model_development.ipynb.


Importing Market Data

You need internet to load the data by running the following codes. We will simply call it data.

import pyreadr
import pandas as pd
data = pd.read_csv("https://raw.githubusercontent.com/hassothea/Data_Analytics_AUPP/refs/heads/main/data/marketing.csv", sep=",")
data.head(5)
youtube facebook newspaper sales
0 276.12 45.36 83.04 26.52
1 53.40 47.16 54.12 12.48
2 20.64 55.08 83.16 11.16
3 181.80 49.56 70.20 22.20
4 216.96 12.96 70.08 15.48

1. Study correlation matrix

A. Compute correlation matrix of this data using pd.corr() function. Explain this correlation matrix (see slide 21).

# To do

B. Plot scatterplot of the following pairs: - Facebook (x-axis) vs Sales (y-axis) - Newspaper (x-axis) vs Sales (y-axis)

You should add title and using proper name for each axis.

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# To do

Key remark: Correlation matrix tells us a lot about which inputs are useful for constructing the model. If we were to build a model using only one input, use the one having the highest correlation with the target. On the other hand, putting many highly correlated inputs together can result in a bad model because it can lead to multicollinearity. This means the model has difficulty distinguishing the individual effects of each input variable, resulting in unstable and unreliable coefficient estimates. Additionally, it can inflate the variance of the regression coefficients, making the model less interpretable and potentially overfitting the data. Simply put, it muddies the waters.

2. Simple Linear Regression

A. We already used YouTube as an explanatory variable to predict Sales in the course.

  • Now, build a SLR model to predict sales using Facebook.
from sklearn.linear_model import LinearRegression

# Prepare data X and y
# To do

# Build model
# To do

# Fit the model on the data
# To do
  • Perform model dignosis:
    • Compute \(R^2\) then explain the observed value.
    • Compute and plot residuals for this model. Conclude.
# Compute R-squared
# To do

# Graph
# To do

B. Repeat question (A) but using newspaper as an input for SLR instead.

# Prepare data X and y
# To do

# Fit model
# To do

3. Multiple Linear Regression

We already build a MLR with two inputs during the course. Now, you will do it using all three inputs.

A. Build a MLR model using the three inputs.

# Prepare data
# To do

# Build model
# To do

B. Perform model diagnosis as illustrated in the course (from slide 26). Interpret your findings and conclude.

# To do

Further readings