{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **Lab - Model Evaluation and Refinement**\n",
    "\n",
    "<b>ITM-370: Data Analytics</b><br>\n",
    "**Lecturer: HAS Sothea, PhD**\n",
    "\n",
    "-------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Objective:** Building a model is (SLR or MLR) is rather simple, but building a good model that can generalize well on unseen observations is more challenging. This practical lab aims at enhancing your practical skills in pushing the performance of your models on new unseen data using techniques introduced in the class.\n",
    "\n",
    "> **The `Jupyter Notebook` for this Lab can be downloaded here: [Lab_Model_Evaluation_Refinement.ipynb](https://hassothea.github.io/Data_Analytics_AUPP/Labs/Lab_Model_Evaluation_Refinement.ipynb)**.\n",
    "\n",
    "> **Or you can work with this notebook in `Google Colab` here: [Lab_Model_Evaluation_Refinement.ipynb](https://colab.research.google.com/drive/1olaCH9qshMuwVL3Lp-nqcB7F4nJswMYe?usp=sharing)**.\n",
    "\n",
    "-----------\n",
    "\n",
    "## 1. Importing `Abalone dataset`\n",
    "\n",
    "You need internet to load the data by running the following codes. For more information about this data, read [`Abalone dataset`](https://www.kaggle.com/datasets/rodolfomendes/abalone-dataset)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Path to dataset files: C:\\Users\\hasso\\.cache\\kagglehub\\datasets\\rodolfomendes\\abalone-dataset\\versions\\3\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Sex</th>\n",
       "      <th>Length</th>\n",
       "      <th>Diameter</th>\n",
       "      <th>Height</th>\n",
       "      <th>Whole weight</th>\n",
       "      <th>Shucked weight</th>\n",
       "      <th>Viscera weight</th>\n",
       "      <th>Shell weight</th>\n",
       "      <th>Rings</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>M</td>\n",
       "      <td>0.455</td>\n",
       "      <td>0.365</td>\n",
       "      <td>0.095</td>\n",
       "      <td>0.5140</td>\n",
       "      <td>0.2245</td>\n",
       "      <td>0.1010</td>\n",
       "      <td>0.150</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>M</td>\n",
       "      <td>0.350</td>\n",
       "      <td>0.265</td>\n",
       "      <td>0.090</td>\n",
       "      <td>0.2255</td>\n",
       "      <td>0.0995</td>\n",
       "      <td>0.0485</td>\n",
       "      <td>0.070</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>F</td>\n",
       "      <td>0.530</td>\n",
       "      <td>0.420</td>\n",
       "      <td>0.135</td>\n",
       "      <td>0.6770</td>\n",
       "      <td>0.2565</td>\n",
       "      <td>0.1415</td>\n",
       "      <td>0.210</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>M</td>\n",
       "      <td>0.440</td>\n",
       "      <td>0.365</td>\n",
       "      <td>0.125</td>\n",
       "      <td>0.5160</td>\n",
       "      <td>0.2155</td>\n",
       "      <td>0.1140</td>\n",
       "      <td>0.155</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>I</td>\n",
       "      <td>0.330</td>\n",
       "      <td>0.255</td>\n",
       "      <td>0.080</td>\n",
       "      <td>0.2050</td>\n",
       "      <td>0.0895</td>\n",
       "      <td>0.0395</td>\n",
       "      <td>0.055</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Sex  Length  Diameter  Height  Whole weight  Shucked weight  Viscera weight  \\\n",
       "0   M   0.455     0.365   0.095        0.5140          0.2245          0.1010   \n",
       "1   M   0.350     0.265   0.090        0.2255          0.0995          0.0485   \n",
       "2   F   0.530     0.420   0.135        0.6770          0.2565          0.1415   \n",
       "3   M   0.440     0.365   0.125        0.5160          0.2155          0.1140   \n",
       "4   I   0.330     0.255   0.080        0.2050          0.0895          0.0395   \n",
       "\n",
       "   Shell weight  Rings  \n",
       "0         0.150     15  \n",
       "1         0.070      7  \n",
       "2         0.210      9  \n",
       "3         0.155     10  \n",
       "4         0.055      7  "
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# %pip install kagglehub   # if you have not installed \"kagglehub\" module yet\n",
    "import kagglehub\n",
    "\n",
    "# Download latest version\n",
    "path = kagglehub.dataset_download(\"rodolfomendes/abalone-dataset\")\n",
    "print(\"Path to dataset files:\", path)\n",
    "\n",
    "# Import data\n",
    "import pandas as pd\n",
    "data = pd.read_csv(path + \"/abalone.csv\")\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**A.** What's the dimension of this dataset? How many quantitative and qualitative variables are there in this dataset?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**B.** Perform univariate analysis (compute statistical values and plot the distribution) on each variables of your dataset. The goal is to understand your data (the scale and how each column is distributed), detect missing values and outliers, etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Correlation matrix, SLR and MLR\n",
    "\n",
    "**A.** Compute the correlation matrix of this data using `pd.corr()` function. Describe what you observed from this correlation matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**B.** Draw the scatterplot of the target `Rings` vs the most correlated input."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Here, I split the data into two parts. You are allowed to use only the training part for building models. The testing part will be used to evaluate the models' performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "                                        data.iloc[:,1:7], \n",
    "                                        data['Rings'], \n",
    "                                        test_size=0.2, \n",
    "                                        random_state=42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**C.** Fit a SLR model using the most correlated input to predict the `Rings` of abalone. \n",
    "- Draw the fitted line on the previous scatterplot. \n",
    "- Compute $R^2$ and explain the observed value.\n",
    "- Compute Mean Suared Error on the test data (Test MSE).\n",
    "- Analyze the residuals and conclude."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**D.** Fit MLR using all inputs. Compute:\n",
    "- Compute $R^2$, adjusted $R_{\\text{adj}}^2$ and explain the observed value.\n",
    "- Compute Mean Suared Error on the test data (Test MSE).\n",
    "- Analyze the residuals and conclude."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Polynomial Regression and Regularized Linear Models\n",
    "\n",
    "**A.** Build polynomial regression with different degree $n\\in\\{2,3,...,10\\}$ of the best correlated input to predict `Rings`. Compute Test MSE for each case.\n",
    "- Perform $10$-fold Cross-validation to select the best degree $n$. Hint: see [slide 20](https://hassothea.github.io/Data_Analytics_AUPP/slides/Model_Evaluation_Refinement.html#/overcoming-overfitting)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "**B.** Perform $10$-fold Cross-validation to tune the best penalization strength $\\alpha$ of Ridge regression model for predicting `Rings`. Hint: see [slide 24](https://hassothea.github.io/Data_Analytics_AUPP/slides/Model_Evaluation_Refinement.html#/overcoming-overfitting-5).\n",
    "- Compute Test MSE and compare it to the previous models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**C.** Perform $10$-fold Cross-validation to tune the best penalization strength $\\alpha$ of Lasso regression model for predicting `Rings`. Hint: see [slide 24](https://hassothea.github.io/Data_Analytics_AUPP/slides/Model_Evaluation_Refinement.html#/overcoming-overfitting-5).\n",
    "- Compute Test MSE and compare it to the previous models.\n",
    "- How many inputs are retained by Lasso?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To do"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Further readings\n",
    "- Graphical tools:\n",
    "    - [`matplotlib`](https://matplotlib.org/stable/index.html)\n",
    "    - [`seaborn`](https://seaborn.pydata.org/)\n",
    "    - [`plotly`](https://plotly.com/python/)\n",
    "- [Ridge, Sklearn](https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.Ridge.html)\n",
    "- [Lasso, Sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html)\n",
    "- [Chapter 5, Introduction to Statistical Learning, James et al. (2023)](https://www.statlearning.com/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}