TP7 - Linear & Logistic Regression

Course: INF-604: Data Analysis
Lecturer: Sothea HAS, PhD


Objective: In this lab, you will reproduce the result we have done in linear and logistic regression. Moreover, you will explore beyond what we have done including using multiple inputs variables, compare their performances and feature engineering to further elevate the model performance.


I. Linear Regression

We are working with kaggle Auto-MPG dataset. Now, find a way to import the dataset into our environment.

# To do

1. Data Preprocessing and EDA

A. Univariate Analysis:

  • Check and modify columns with wrong data type.
  • What’s wrong with column horsepower? How would you solve this problem?
  • Make sure all the columns are in correct data type as shown in slide 7.
  • Compute descriptive statistics of the data and visualize their distribution as illustrated in slide 8 and 9.
# To do

B. Bivariate Analysis:

  • Compute correlation matrix and pairplot (see slide 10).
  • Also reproduce conditional boxplot of influence of origin on mpg as illustrated in slide 11.
# To do

2. Simple Linear Regression (SLR)

  • Pick one input (except weight) and build linear regression model to predict mpg.
  • Draw scatterplot of that input vs mpg. Add the fitted line to the existing scatterplot.
  • Compute \(R^2\) and check the residual of the model. Explain the result.
  • Check if the coefficient of your model is significantly different from 0 within 95% certainty.
  • Interpret the model.
# To do

3. Multiple Linear Regression (MLR)

A. All at once:

  • Build multiple linear regression using all inputs.
  • Compute \(R^2\) and \(R^2_{\text{adj}}\) and explain the result.
  • Check the residual. Explain the result.
  • Are all inputs significantly related with the target mpg?
# To do

B. Polynomial Features:

  • Return to your choice in SLR, now create input data consisting of that input (call it \(X\)) and its square (\(X^2\)). Build multiple linear regression to predict mpg using \(X\) and \(X^2\).
  • Draw the scatterplot and the fitted curve.
  • Compute \(R^2\) and \(R^{2}_{\text{adj}}\). Explain.
  • Check the residual and explain.
  • Conclude your findings: which one do you think is the best model among the three models?
# To do

II. Logistic Regression

Cardiovascular diseases (CVDs) are the leading cause of death globally, taking an estimated 17.9 million lives each year (WHO). CVDs are a group of disorders of the heart and blood vessels and include coronary heart disease, cerebrovascular disease, rheumatic heart disease and other conditions. More than four out of five CVD deaths are due to heart attacks and strokes, and one third of these deaths occur prematurely in people under 70 years of age. This research intends to pinpoint the most relevant/risk factors of heart disease as well as predict the overall risk using logistic regression.

We will explore Kaggle Heart Disease Dataset. Load the dataset into the environment.

# To do

1. Data Preprocessing and EDA

A. Univariate Analysis

  • Check and modify if there is any column with wrong data type.
  • Are there any missing values in this dataset? If so, compute percentage of missing values within each column containing such values and properly handle them.
  • Compute descriptive statistics of the data and visualize their distribution.
  • Pick a few columns and describe their distribution.
# Todo

B. Bivariate Analysis

  • Compute correlation matrix among numerical columns.
  • Visualize relationship between each column to the target TenYearCHD.
  • Draft a list of columns that you think is most related to the target.
# To do

2. Build logistic regression models

A. Simple Logistic Regression Model

  • Split the data into \(80\%-20\%\) of training and testing data using train_test_split from sklearn.preprocessing module.
  • Build a logistic regression model called lg0 using only one selected input to predict the target TenYearCHD.
  • Evaluate performance of the model built in the previous step using metrics such as accuracy, precision, recall and f1_score (from sklearn.metrics module).
  • Write down explicit formula of your model.
# To do

B. Multiple Logistic Regression Model

  • Build a logistic regression model (called lg1) to predict the target using just the columns drafted in part 1.B.
  • Measure its performance on the test data using all metrics.
  • Build another logistic regression model using all the columns (called lg2).
  • Compute its performance on the test data using all the metrics.
  • Conclude.

Further readings