# **Lab5: Logistic Regression**

**Course**: **M2-DAS: Advanced Machine Learning** <br>
**Lecturer**: **Dr. Sothea HAS**

-----

**Objective:** In this lab, you will learn how to build Binary Logistic Regression model to predict `heart failure` patients. Not only that, you will learn to detect informative features for maximizing the potential of the constructed models. You will also see that **quantitative features** are not always the most important ones in building a good predictive model. You have to treat all types of data carefully.

- The `notebook` of this `TP` can be downloaded here: [Lab5_Logistic_Regression.ipynb](https://hassothea.github.io/Advanced-Machine-Learning-ITC/TPs/Lab5_Logistic_Regression.ipynb){target="_blank"}.

-----


## **1. Heart Failure Prediction**

We will work with [Kaggle Heart Failure Dataset](https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction) as introduced in [TP1](https://hassothea.github.io/Advanced-Machine-Learning-ITC/TPs/Lab4_NBC.html){target='_blank'}. You may use the preprocessing step done in the previous work.

In [2]:
import kagglehub
import pandas as pd

# Download latest version
path = kagglehub.dataset_download("fedesoriano/heart-failure-prediction")
data = pd.read_csv(path + "/heart.csv")
data.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


### **1.1. Binary Logistic Regression**
- Split the data into $80\%$-training and $20\%$-testing data.
- We start from **feature transformation:**
    - Perform one-hot encoding for all the qualitative variables.
    - Standardize all the inputs.
- Construct 4 Binary Logistic Regression models on the 80%-Training using different options of inputs:
    - `lg_quan`: logistic regression using only quantitative inputs.
    - `lg_qual`: logistic regression using only qualitative inputs (use one-hot encoding: `pd.get_dummies()`).
    - `lg_eda`: logistic regression using your selected inputs.
    - `lg_full`: logistic regression using all inputs.
- Measure their performance on the corresponding testing data. Compare the results to the result of NBC from the previous **TP**.
- Comment on what you observe.

In [None]:
# To do

### **1.2. Polynomial Features**

Based on the result of EDA, one may try to further elevate the performance of the model using feature engineering or handle problems stored in your problem list. Here, we will try feature engineering.

**Tasks:**

- **Quadratic features:** Build a model by introducing quadratic features of some selected variable i.e., $X_1, X_2, X_3\to X_1^2, X_2^2, X_3^3, X_1X_2, X_1X_3, X_2X_3$. Test its performance on the test data.
- **Penalty parameter C:** When more features are created, the model will naturally become too flexible, it's recommended to fine-tune penalty parameter $C$ in this case as well.
    - **Random choice:** Try varying parameter $C$, for example, $C=0.01$ as follow `LogisticRegression(C=0.01)`. Fit the model to the training data then test its performance on the testing data. Measure its test performance.
    - **Search for the best $C$:** Now, try to search for the best $C$ and report the performance on the test data of the model built with the optimal value of $C$. Measure the performance of the model on the test data.
- Compare the performance of all models. 

In [None]:
# To do

## **2. Logistic Regression on Email Spam Dataset**

The `spam` dataset contains frequency of some common words and its class  ('spam' or 'nonspam'). The following code allows you to import this data into our environment.

In [None]:
import pandas as pd
path = "https://raw.githubusercontent.com/hassothea/MLcourses/main/data/spam.txt"
data = pd.read_csv(path, sep=" ")
data = data.drop(columns=['Id'])
data.head(5)

- Inspect the dataset to find missing values and proportion of spam and nonspam emails.
- Split the data into training and testing parts.
- Apply techniques you had done in the previous part to identify email spams.
- Evaluate model performance on test data using suitable metrics: accuracy, recall, precision, f1 score.

# **Further Reading**

$^{\text{ðŸ“š}}$  `Pandas` python library: [https://pandas.pydata.org/docs/getting_started/index.html#getting-started](https://pandas.pydata.org/docs/getting_started/index.html#getting-started) <br>
$^{\text{ðŸ“š}}$  `Pandas Cheatsheet`: [https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) <br>
$^{\text{ðŸ“š}}$  `10 Minute to Pandas`: [https://pandas.pydata.org/docs/user_guide/10min.html](https://pandas.pydata.org/docs/user_guide/10min.html) <br>
$^{\text{ðŸ“š}}$  `Some Pandas Lession`: [https://www.kaggle.com/learn/pandas](https://pandas.pydata.org/docs/user_guide/10min.html) <br>
$^{\text{ðŸ“š}}$ [Chapter 4, *Introduction to Statistical Learning with R*, James et al. (2021).](https://www.sas.upenn.edu/~fdiebold/NoHesitations/BookAdvanced.pdf){target="_blank"}. <br>
$^{\text{ðŸ“š}}$ [Chapter 2, *The Elements of Statistical Learning*, Hastie et al. (2008).](https://www.sas.upenn.edu/~fdiebold/NoHesitations/BookAdvanced.pdf){target="_blank"}. <br>
$^{\text{ðŸ“š}}$ [Friedman (1989)](http://www.leg.ufpr.br/~eferreira/CE064/Regularized%20Discriminant%20Analysis.pdf){target="_blank"}. <br>
$^{\text{ðŸ“š}}$ [Heart Disease Dataset](https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset){target="_blank"}. <br>
$^{\text{ðŸ“š}}$ [Different Type of Correlation Metrics Used by Data Scientists, Ashray](https://www.analyticsvidhya.com/blog/2021/09/different-type-of-correlation-metrics-used-by-data-scientist/){target="_blank"}. 
