Ensemble Learning



Abalone dataset

The dataset is available here: https://www.kaggle.com/datasets/rodolfomendes/abalone-dataset.

import kagglehub

# Download latest version
path = kagglehub.dataset_download("rodolfomendes/abalone-dataset")

# Import data
import pandas as pd
data = pd.read_csv(path + "/abalone.csv")
data.head()
Warning: Looks like you're using an outdated `kagglehub` version (installed: 0.3.6), please consider upgrading to the latest version (0.3.7).
Sex Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings
0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7
2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7
print(f"Number of rows: {data.shape[0]}")
print(f"Number of columns: {data.shape[1]}")
Number of rows: 4177
Number of columns: 9

1. EDA

# Data types
data.dtypes
Sex                object
Length            float64
Diameter          float64
Height            float64
Whole weight      float64
Shucked weight    float64
Viscera weight    float64
Shell weight      float64
Rings               int64
dtype: object
data.describe()
Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings
count 4177.000000 4177.000000 4177.000000 4177.000000 4177.000000 4177.000000 4177.000000 4177.000000
mean 0.523992 0.407881 0.139516 0.828742 0.359367 0.180594 0.238831 9.933684
std 0.120093 0.099240 0.041827 0.490389 0.221963 0.109614 0.139203 3.224169
min 0.075000 0.055000 0.000000 0.002000 0.001000 0.000500 0.001500 1.000000
25% 0.450000 0.350000 0.115000 0.441500 0.186000 0.093500 0.130000 8.000000
50% 0.545000 0.425000 0.140000 0.799500 0.336000 0.171000 0.234000 9.000000
75% 0.615000 0.480000 0.165000 1.153000 0.502000 0.253000 0.329000 11.000000
max 0.815000 0.650000 1.130000 2.825500 1.488000 0.760000 1.005000 29.000000
  • There are missing values (abalone with 0 heights). We shall handle this.
data = data.loc[data.Height > 0]
data.describe()
Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings
count 4175.000000 4175.00000 4175.000000 4175.000000 4175.000000 4175.000000 4175.000000 4175.000000
mean 0.524065 0.40794 0.139583 0.829005 0.359476 0.180653 0.238834 9.935090
std 0.120069 0.09922 0.041725 0.490349 0.221954 0.109605 0.139212 3.224227
min 0.075000 0.05500 0.010000 0.002000 0.001000 0.000500 0.001500 1.000000
25% 0.450000 0.35000 0.115000 0.442250 0.186250 0.093500 0.130000 8.000000
50% 0.545000 0.42500 0.140000 0.800000 0.336000 0.171000 0.234000 9.000000
75% 0.615000 0.48000 0.165000 1.153500 0.502000 0.253000 0.328750 11.000000
max 0.815000 0.65000 1.130000 2.825500 1.488000 0.760000 1.005000 29.000000
print(f'Number of duplicated data: {data.duplicated().sum()}')
Number of duplicated data: 0

1.1 Univariate Analysis

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

quan_cols = data.select_dtypes(include="number").columns
_, axs = plt.subplots(2,4, figsize=(16, 8))
for i, va in enumerate(quan_cols):
  sns.boxplot(data=data, y=va, ax=axs[i//4, i % 4])
  axs[i//4, i % 4].set_title(f"Distribution of {va}")
plt.show()

sns.countplot(data=data, x="Sex")
plt.show()

1.2 Bivariate Analysis

data.select_dtypes(include='number').corr()
Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings
Length 1.000000 0.986802 0.828108 0.925217 0.897859 0.902960 0.898419 0.556464
Diameter 0.986802 1.000000 0.834298 0.925414 0.893108 0.899672 0.906084 0.574418
Height 0.828108 0.834298 1.000000 0.819886 0.775621 0.798908 0.819596 0.557625
Whole weight 0.925217 0.925414 0.819886 1.000000 0.969389 0.966354 0.955924 0.540151
Shucked weight 0.897859 0.893108 0.775621 0.969389 1.000000 0.931924 0.883129 0.420597
Viscera weight 0.902960 0.899672 0.798908 0.966354 0.931924 1.000000 0.908186 0.503562
Shell weight 0.898419 0.906084 0.819596 0.955924 0.883129 0.908186 1.000000 0.627928
Rings 0.556464 0.574418 0.557625 0.540151 0.420597 0.503562 0.627928 1.000000
data.select_dtypes(include='number').corr(method="spearman")
Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings
Length 1.000000 0.983299 0.888124 0.972600 0.956790 0.952600 0.948662 0.603954
Diameter 0.983299 1.000000 0.895651 0.971290 0.950422 0.948328 0.954896 0.622487
Height 0.888124 0.895651 1.000000 0.915963 0.874163 0.900535 0.922130 0.657403
Whole weight 0.972600 0.971290 0.915963 1.000000 0.977038 0.975222 0.970197 0.630440
Shucked weight 0.956790 0.950422 0.874163 0.977038 1.000000 0.947581 0.918465 0.538956
Viscera weight 0.952600 0.948328 0.900535 0.975222 0.947581 1.000000 0.938893 0.613930
Shell weight 0.948662 0.954896 0.922130 0.970197 0.918465 0.938893 1.000000 0.692991
Rings 0.603954 0.622487 0.657403 0.630440 0.538956 0.613930 0.692991 1.000000
plt.figure(figsize=(16,16))
sns.pairplot(data=data, hue="Sex")
plt.show()
<Figure size 1600x1600 with 0 Axes>

Height has low correlation with the target Rings but this is likely due to the 2 weird data point with large height but small rings. One can remove these data points. Moreover the target spread widely at large values of inputs. One can scale the target to improve the connection.

data1 = data
import numpy as np
data.drop(data[data['Height'] > 0.5].index, inplace=True)
data1['Root_Rings'] = data['Rings'].map(lambda x: np.sqrt(x))
sns.pairplot(data=data1, hue="Sex")
plt.show()

data1.drop(columns=['Sex']).corr()
Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings Root_Rings
Length 1.000000 0.986794 0.900868 0.925328 0.898129 0.903033 0.898363 0.556572 0.609433
Diameter 0.986794 1.000000 0.907187 0.925499 0.893330 0.899716 0.906026 0.574551 0.626043
Height 0.900868 0.907187 1.000000 0.888850 0.837485 0.866757 0.891857 0.610107 0.651990
Whole weight 0.925328 0.925499 0.888850 1.000000 0.969370 0.966290 0.955954 0.540621 0.574118
Shucked weight 0.898129 0.893330 0.837485 0.969370 1.000000 0.931831 0.883194 0.421156 0.460780
Viscera weight 0.903033 0.899716 0.866757 0.966290 0.931831 1.000000 0.908133 0.503977 0.540612
Shell weight 0.898363 0.906026 0.891857 0.955954 0.883194 0.908133 1.000000 0.628169 0.654175
Rings 0.556572 0.574551 0.610107 0.540621 0.421156 0.503977 0.628169 1.000000 0.991914
Root_Rings 0.609433 0.626043 0.651990 0.574118 0.460780 0.540612 0.654175 0.991914 1.000000
data1.drop(columns=['Sex']).corr(method="spearman")
Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings Root_Rings
Length 1.000000 0.983282 0.888776 0.972587 0.956831 0.952555 0.948606 0.603924 0.603924
Diameter 0.983282 1.000000 0.896281 0.971271 0.950446 0.948276 0.954847 0.622477 0.622477
Height 0.888776 0.896281 1.000000 0.916451 0.874402 0.901073 0.922828 0.658139 0.658139
Whole weight 0.972587 0.971271 0.916451 1.000000 0.977044 0.975198 0.970182 0.630489 0.630489
Shucked weight 0.956831 0.950446 0.874402 0.977044 1.000000 0.947578 0.918476 0.539061 0.539061
Viscera weight 0.952555 0.948276 0.901073 0.975198 0.947578 1.000000 0.938835 0.613942 0.613942
Shell weight 0.948606 0.954847 0.922828 0.970182 0.918476 0.938835 1.000000 0.693018 0.693018
Rings 0.603924 0.622477 0.658139 0.630489 0.539061 0.613942 0.693018 1.000000 1.000000
Root_Rings 0.603924 0.622477 0.658139 0.630489 0.539061 0.613942 0.693018 1.000000 1.000000

2. Models

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.metrics import mean_squared_error

2.1 Bagging: Random Forest & Extra-Trees

import numpy as np
x_cat = pd.get_dummies(data["Sex"], drop_first=True)
X = pd.concat([x_cat, data.iloc[:,1:-1]], axis=1)
y = data['Rings'].map(lambda x: np.sqrt(x))
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=42)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
(2921, 10) (2921,)
(1252, 10) (1252,)
rf = RandomForestRegressor(n_estimators=300)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_rf))}")
RMSE: 0.004429570403170732
ext = ExtraTreesRegressor(n_estimators=300)
ext.fit(X_train, y_train)
y_pred_ex = ext.predict(X_test)
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_ex))}")
RMSE: 0.002914711157601805
  • GridsearchCV Random Forest
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [500, 800, 1000],
    'min_samples_leaf' : [2, 3, 5, 7, 10],
    'max_features' : np.linspace(5, X_train.shape[1], 3, dtype=int)
}

cv = GridSearchCV(
    estimator=RandomForestRegressor(),
    param_grid=param_grid, cv=5, 
    scoring='neg_mean_squared_error')
cv.fit(X_train, y_train)
rf_cv = RandomForestRegressor(**cv.best_params_).fit(X_train, y_train)
y_pred_rf_cv = rf_cv.predict(X_test)
rmse_rf_cv = np.sqrt(mean_squared_error(y_test, y_pred_rf_cv))
df_rmse = pd.DataFrame(
    {'RF_CV': [rmse_rf_cv]}
)
df_rmse
RF_CV
0 0.007523
print(f'* Optimal parameter: {cv.best_params_}')
* Optimal parameter: {'max_features': 10, 'min_samples_leaf': 2, 'n_estimators': 500}
  • Extra-trees
cv = GridSearchCV(
    estimator=ExtraTreesRegressor(), 
    param_grid=param_grid, cv=5, 
    scoring='neg_mean_squared_error')
cv.fit(X_train, y_train)

ext_cv = ExtraTreesRegressor(**cv.best_params_).fit(X_train, y_train)
y_pred_ex_cv = ext_cv.predict(X_test)
rmse_ext_cv = np.sqrt(mean_squared_error(y_test, y_pred_ex_cv))
df_rmse = pd.concat([df_rmse, pd.DataFrame(
                    {'Extra-trees_CV': [rmse_ext_cv]})], 
                    axis=1)
df_rmse
RF_CV Extra-trees_CV
0 0.007523 0.005515
print(f'* Optimal parameter: {cv.best_params_}')
* Optimal parameter: {'max_features': 10, 'min_samples_leaf': 2, 'n_estimators': 500}

2.2 Boosting: XGBoost

from xgboost.sklearn import XGBRegressor

xgb_model = XGBRegressor(objective="reg:squarederror")
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
print(f"RMSE without CV: {np.sqrt(mean_squared_error(y_test, y_pred_xgb))}")
RMSE without CV: 0.005525182469075176

GridseachCV for XGBoost

from sklearn.model_selection import KFold
from itertools import product

param_grid = {
        'gamma': [0.01, 0.1, 0.2, 0.3],
        'subsample': [0.8, 0.9, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'max_depth': [2, 3, 4],
        'n_estimators': [100, 200]
}

n_cv = 5
kf = KFold(n_splits=n_cv, shuffle=True, random_state=42)
xgb = XGBRegressor(objective="reg:squarederror")
# Perform parameter search manually
list_params = list(product(*param_grid.values()))
mse_cv = np.zeros(shape=(len(list_params),))
j = 1
for train_index, test_index in kf.split(X_train):
  X_tr, X_te = X_train.iloc[train_index], X_train.iloc[test_index]
  y_tr, y_te = y_train.iloc[train_index], y_train.iloc[test_index]
  mse_scores = np.zeros(shape=(len(list_params),))
  for i,params in enumerate(list_params):
    param_dict = dict(zip(param_grid.keys(), params))
    model = XGBRegressor(**param_dict)
    model.fit(X_tr, y_tr)
    y_pred = model.predict(X_te)
    mse = mean_squared_error(y_te, y_pred)
    mse_scores[i] = mse
  mse_cv = mse_cv + mse_scores
  print(f"* Fold: {j} / {n_cv}")
  j += 1
* Fold: 1 / 5
* Fold: 2 / 5
* Fold: 3 / 5
* Fold: 4 / 5
* Fold: 5 / 5
mse_cv /= n_cv
opt_param = dict(zip(param_grid.keys(), list_params[np.argmin(mse_cv)]))
print(opt_param)
model = XGBRegressor(**opt_param)
model = model.fit(X_train.to_numpy(), y_train.to_numpy())
y_pred_xgb_cv = model.predict(X_test.to_numpy())
rmse_xgb_cv = np.sqrt(mean_squared_error(y_test, y_pred_xgb_cv))
df_rmse = pd.concat([df_rmse, pd.DataFrame(
                    {'XGB_CV': [rmse_xgb_cv]})], 
                    axis=1)
df_rmse
{'gamma': 0.01, 'subsample': 1.0, 'colsample_bytree': 1.0, 'max_depth': 4, 'n_estimators': 200}
RF_CV Extra-trees_CV XGB_CV
0 0.007523 0.005515 0.008281

2.3 Stacking: Super Learner

# %pip install gradientcobra
from gradientcobra.superlearner import SuperLearner
from sklearn.preprocessing import StandardScaler
sl = SuperLearner()
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(X_train)
x_test_scaled = scaler.transform(X_test)
sl_fit = sl.fit(X_train, y_train)
sl_fit.draw_learning_curve()
print(f'The selected meta learner is {sl_fit.SuperLearner}')
The selected meta learner is LinearRegression()
y_pred_sl = sl_fit.predict(X_test)
rmse_sl_cv = np.sqrt(mean_squared_error(y_test, y_pred_sl))
df_rmse = pd.concat([df_rmse, pd.DataFrame(
                    {'SLearner_CV': [rmse_sl_cv]})], 
                    axis=1)
df_rmse
RF_CV Extra-trees_CV XGB_CV SLearner_CV
0 0.007523 0.005515 0.008281 0.003753

2.4 Consensual Aggregation: GradientCOBRA

from gradientcobra.gradientcobra import GradientCOBRA

gc = GradientCOBRA(opt_method="grid",
                   bandwidth_list=np.linspace(0.01, 100, 500))
gc_fit = gc.fit(X_train, y_train)
* Grid search progress: 100%|██████████| 500/500 [00:39<00:00, 12.64it/s]
print("Estimated bandwidth :" + str(gc_fit.optimization_outputs['opt_bandwidth'][0]))
Estimated bandwidth :53.71204408817634
import plotly.io as pio
pio.renderers.default = 'notebook'
gc_fit.draw_learning_curve()
y_pred_gc = gc_fit.predict(X_test)
rmse_gc_cv = np.sqrt(mean_squared_error(y_test, y_pred_gc))
df_rmse = pd.concat([df_rmse, pd.DataFrame(
                    {'': [rmse_gc_cv]})], 
                    axis=1)
df_rmse
RF_CV Extra-trees_CV XGB_CV SLearner_CV GradientCOBRA
0 0.007523 0.005515 0.008281 0.003753 0.001945

E. Deep Learning

# This is an example with Keras
from sklearn.metrics import mean_squared_error
from keras.models import Sequential
from keras.layers import Dense, Input, Dropout

from keras.callbacks import Callback

# Input
d = x_train_scaled.shape[1]

model = Sequential()
model.add(Input(shape=(d,)))

# To do
model.add(Dense(16, activation="relu"))
model.add(Dense(8, activation="relu"))
model.add(Dense(1, activation="linear"))

model.compile(optimizer='adam', loss='mean_squared_error', metrics=['mse'])

# I only print every N epochs
class custom_callback(Callback):
    def __init__(self, N):
        super(custom_callback, self).__init__()
        self.N = N

    def on_epoch_end(self, epoch, logs=None):
        if (epoch + 1) % self.N == 0:
            print(f'Epoch {epoch + 1}: loss = {logs["loss"]}, MSE = {logs["mse"]}')
            
print_callback = custom_callback(500)
history = model.fit(x_train_scaled, y_train, epochs=3000, batch_size=256, validation_split=0.1, verbose=0, callbacks=[print_callback])

train_loss = history.history['loss']
val_loss = history.history['val_loss']

import plotly.io as pio
pio.renderers.default = 'notebook'
import plotly.graph_objs as go
# Plot the learning curves
epochs = list(range(1, len(train_loss) + 1))
fig1 = go.Figure(go.Scatter(x=epochs, y=train_loss, name="Training loss"))
fig1.add_trace(go.Scatter(x=epochs, y=val_loss, name="Training loss"))
fig1.update_layout(title="Training and Validation Loss",
                   width=800, height=500,
                   xaxis=dict(title="Epoch", type="log"),
                   yaxis=dict(title="Loss"))
fig1.show()
Epoch 500: loss = 0.00043223437387496233, MSE = 0.00043223437387496233
Epoch 1000: loss = 0.0001070614525815472, MSE = 0.0001070614525815472
Epoch 1500: loss = 6.664307875325903e-05, MSE = 6.664307875325903e-05
Epoch 2000: loss = 5.8001125580631196e-05, MSE = 5.8001125580631196e-05
Epoch 2500: loss = 5.7669451052788645e-05, MSE = 5.7669451052788645e-05
Epoch 3000: loss = 3.066043063881807e-05, MSE = 3.066043063881807e-05
y_pred_dnn = model.predict(x_test_scaled)
rmse_dnn = np.sqrt(mean_squared_error(y_test, y_pred_dnn))
rmse_dnn
40/40 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
0.005447119682704804
y_pred_dnn = model.predict(x_test_scaled)
rmse_dnn = np.sqrt(mean_squared_error(y_test, y_pred_dnn))
df_rmse = pd.concat([df_rmse, pd.DataFrame(
                    {'DNN': [rmse_dnn]})], 
                    axis=1)
df_rmse
40/40 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
RF_CV Extra-trees_CV XGB_CV SLearner_CV GradientCOBRA DNN
0 0.007523 0.005515 0.008281 0.003753 0.001945 0.005447

2. Kaggle Stroke Dataset

Stroke, also known as a cerebrovascular accident (CVA), occurs when blood flow to a part of the brain is interrupted or reduced, depriving brain tissue of oxygen and nutrients. This dataset contains information such as age, gender, hypertension, heart disease, marital status, work type, residence type, average glucose level, and body mass index (BMI). The goal is to use this data to build predictive models that can help identify individuals at high risk of stroke, enabling early intervention and potentially saving lives. It is a very highly imbalanced dataset, you may face challenges in building a model. Random sampling and weighting methods may be considered. For more information, see: Kaggle Stroke Dataset.

path = kagglehub.dataset_download("fedesoriano/stroke-prediction-dataset")

data = pd.read_csv(path + '/healthcare-dataset-stroke-data.csv')
data.head()
Warning: Looks like you're using an outdated `kagglehub` version (installed: 0.3.6), please consider upgrading to the latest version (0.3.7).
id gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi smoking_status stroke
0 9046 Male 67.0 0 1 Yes Private Urban 228.69 36.6 formerly smoked 1
1 51676 Female 61.0 0 0 Yes Self-employed Rural 202.21 NaN never smoked 1
2 31112 Male 80.0 0 1 Yes Private Rural 105.92 32.5 never smoked 1
3 60182 Female 49.0 0 0 Yes Private Urban 171.23 34.4 smokes 1
4 1665 Female 79.0 1 0 Yes Self-employed Rural 174.12 24.0 never smoked 1

Reader should try to implement ensemble learning methods to predict this stroke condition of this dataset.

References

\(^{\text{📚}}\) Bagging predictors, Breiman (1996).
\(^{\text{📚}}\) The strength of weak learnability, Robert E. Schapire (1990)..
\(^{\text{📚}}\) COBRA: A combined regression strategy, Beau et al. (2016).
\(^{\text{📚}}\) Gradient COBRA: A kernel-based consensual aggregation for regression, Has (2023).
\(^{\text{📚}}\) Aggregation using input–output trade-off, Fischer & Mougeot (2019).
\(^{\text{📚}}\) Super Learner, M. J. Van der Laan (2007).