{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# **Lab - Model Development**\n", "\n", "ITM-370: Data Analytics
\n", "**Lecturer: HAS Sothea, PhD**\n", "\n", "-------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Objective:** This practical lab aims to enhance your skills in implementing simple and multiple linear regression using market data covered in the course.\n", "\n", "> **The `Jupyter Notebook` for this Lab can be downloaded here: [Lab_Model_development.ipynb](https://hassothea.github.io/Data_Analytics_AUPP/Labs/Lab_Model_development.ipynb)**.\n", "\n", "> **Or you can work with this notebook in `Google Colab` here: [Lab_Model_development.ipynb](https://colab.research.google.com/drive/1vOtcbNdVCJPZzuZA0JI-i7o8lk9BI2co?usp=sharing)**.\n", "\n", "-----------\n", "\n", "# Importing Market Data\n", "\n", "You need internet to load the data by running the following codes. We will simply call it `data`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
youtubefacebooknewspapersales
0276.1245.3683.0426.52
153.4047.1654.1212.48
220.6455.0883.1611.16
3181.8049.5670.2022.20
4216.9612.9670.0815.48
\n", "
" ], "text/plain": [ " youtube facebook newspaper sales\n", "0 276.12 45.36 83.04 26.52\n", "1 53.40 47.16 54.12 12.48\n", "2 20.64 55.08 83.16 11.16\n", "3 181.80 49.56 70.20 22.20\n", "4 216.96 12.96 70.08 15.48" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pyreadr\n", "import pandas as pd\n", "data = pd.read_csv(\"https://raw.githubusercontent.com/hassothea/Data_Analytics_AUPP/refs/heads/main/data/marketing.csv\", sep=\",\")\n", "data.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Study correlation matrix\n", "\n", "**A.** Compute correlation matrix of this data using `pd.corr()` function. Explain this correlation matrix (see [slide 21](https://hassothea.github.io/Data_Analytics_AUPP/slides/Model_development.html#/correlation-matrix-1))." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# To do" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**B.** Plot scatterplot of the following pairs:\n", "- Facebook (x-axis) vs Sales (y-axis)\n", "- Newspaper (x-axis) vs Sales (y-axis)\n", "\n", "You should add title and using proper name for each axis." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import plotly.express as px\n", "\n", "# To do" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **Key remark**: Correlation matrix tells us a lot about which inputs are useful for constructing the model. If we were to build a model using only one input, use the one having the highest correlation with the target. On the other hand, putting many highly correlated inputs together can result in a bad model because it can lead to multicollinearity. This means the model has difficulty distinguishing the individual effects of each input variable, resulting in unstable and unreliable coefficient estimates. Additionally, it can inflate the variance of the regression coefficients, making the model less interpretable and potentially overfitting the data. Simply put, it muddies the waters." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Simple Linear Regression\n", "\n", "**A.** We already used `YouTube` as an explanatory variable to predict `Sales` in the course. \n", "\n", "- Now, build a SLR model to predict `sales` using `Facebook`." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "\n", "# Prepare data X and y\n", "# To do\n", "\n", "# Build model\n", "# To do\n", "\n", "# Fit the model on the data\n", "# To do" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Perform model dignosis:\n", " - Compute $R^2$ then explain the observed value.\n", " - Compute and plot residuals for this model. Conclude." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# Compute R-squared\n", "# To do\n", "\n", "# Graph\n", "# To do" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**B.** Repeat question **(A)** but using `newspaper` as an input for SLR instead." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# Prepare data X and y\n", "# To do\n", "\n", "# Fit model\n", "# To do" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Multiple Linear Regression\n", "\n", "We already build a MLR with two inputs during the course. Now, you will do it using all three inputs.\n", "\n", "**A.** Build a MLR model using the three inputs." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# Prepare data\n", "# To do\n", "\n", "# Build model\n", "# To do" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**B.** Perform model diagnosis as illustrated in the course (from [slide 26](https://hassothea.github.io/Data_Analytics_AUPP/slides/Model_development.html#/multiple-linear-regreesion-mlr)). Interpret your findings and conclude." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# To do" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Further readings\n", "- Graphical tools:\n", " - [`matplotlib`](https://matplotlib.org/stable/index.html)\n", " - [`seaborn`](https://seaborn.pydata.org/)\n", " - [`plotly`](https://plotly.com/python/)\n", "- [Linear regression, Wiki](https://en.wikipedia.org/wiki/Linear_regression)\n", "- [Linear Regression Diagnostics in Python, Jan Kirenz](https://www.kirenz.com/blog/posts/2021-11-14-linear-regression-diagnostics-in-python/)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.5" } }, "nbformat": 4, "nbformat_minor": 2 }