Course: M2-DAS: Advanced Machine Learning Lecturer: Dr. Sothea HAS
Objective: In this lab, you will learn how to build Binary Logistic Regression model to predict heart failure patients. Not only that, you will learn to detect informative features for maximizing the potential of the constructed models. You will also see that quantitative features are not always the most important ones in building a good predictive model. You have to treat all types of data carefully.
Split the data into \(80\%\)-training and \(20\%\)-testing data.
We start from feature transformation:
Perform one-hot encoding for all the qualitative variables.
Standardize all the inputs.
Construct 4 Binary Logistic Regression models on the 80%-Training using different options of inputs:
lg_quan: logistic regression using only quantitative inputs.
lg_qual: logistic regression using only qualitative inputs (use one-hot encoding: pd.get_dummies()).
lg_eda: logistic regression using your selected inputs.
lg_full: logistic regression using all inputs.
Measure their performance on the corresponding testing data. Compare the results to the result of NBC from the previous TP.
Comment on what you observe.
# To do
1.2. Polynomial Features
Based on the result of EDA, one may try to further elevate the performance of the model using feature engineering or handle problems stored in your problem list. Here, we will try feature engineering.
Tasks:
Quadratic features: Build a model by introducing quadratic features of some selected variable i.e., \(X_1, X_2, X_3\to X_1^2, X_2^2, X_3^3, X_1X_2, X_1X_3, X_2X_3\). Test its performance on the test data.
Penalty parameter C: When more features are created, the model will naturally become too flexible, it’s recommended to fine-tune penalty parameter \(C\) in this case as well.
Random choice: Try varying parameter \(C\), for example, \(C=0.01\) as follow LogisticRegression(C=0.01). Fit the model to the training data then test its performance on the testing data. Measure its test performance.
Search for the best \(C\): Now, try to search for the best \(C\) and report the performance on the test data of the model built with the optimal value of \(C\). Measure the performance of the model on the test data.
Compare the performance of all models.
# To do
2. Logistic Regression on Email Spam Dataset
The spam dataset contains frequency of some common words and its class (‘spam’ or ‘nonspam’). The following code allows you to import this data into our environment.