Objective: Weβve delved into advanced classification techniques such as Support Vector Machines (SVM) and Ensemble Methods, which combine high-variance or weak learners to create more robust models. Now, weβll focus on applying these techniques to real-world datasets. More importantly, weβll learn how to fine-tune each methodβs hyperparameters to optimize performance. In addition to classification, ensemble methods offer valuable insights into feature importanceβan essential step in enhancing and refining our models.
In this section, we work with simulated datasets. We define the following functions for generating, visualizing and drawing the decision boundary of the model.
Function1 : generate_2d_dataset
Code
import numpy as npimport numpy as npimport plotly.graph_objects as goimport plotly.express as pxfrom plotly.subplots import make_subplotsfrom sklearn.datasets import make_classification, make_circles, make_moons# Generate 2D binary classification datasetsdef generate_2d_dataset(n_samples=500, boundary_type='linear', noise=0.1, random_state=42):""" Generate a 2D binary classification dataset with linear or non-linear decision boundary. Parameters: ----------- n_samples : int, default=500 Number of samples to generate boundary_type : str, default='linear' Type of decision boundary: 'linear', 'circular', 'moons', 'spiral' noise : float, default=0.05 Amount of noise to add to the data (0.0 to 1.0) random_state : int, default=42 Random state for reproducibility Returns: -------- X : ndarray of shape (n_samples, 2) Feature matrix y : ndarray of shape (n_samples,) Target labels (0 or 1) """ np.random.seed(random_state)if boundary_type =='linear':# Generate well-separated linearly separable data X, y = make_classification( n_samples=n_samples, n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1, random_state=random_state, class_sep=2.5, # Increased separation flip_y=noise # Very low label noise )# Add minimal noise X += np.random.normal(0, noise, X.shape)elif boundary_type =='circular':# Generate well-separated circular decision boundary X, y = make_circles( n_samples=n_samples, noise=noise, # Reduced noise factor=0.4, # Better separation between circles random_state=random_state )elif boundary_type =='moons':# Generate well-separated moon-shaped decision boundary X, y = make_moons( n_samples=n_samples, noise=noise, # Reduced noise random_state=random_state )elif boundary_type =='spiral':# Generate better separated spiral decision boundary n_per_class = n_samples //2 theta = np.linspace(0, 3*np.pi, n_per_class) # Reduced spiral length# First spiral (class 0) r1 = theta / (1.5*np.pi) # Slower growth rate x1 = r1 * np.cos(theta) y1 = r1 * np.sin(theta)# Second spiral (class 1) - better separated x2 = r1 * np.cos(theta + np.pi) y2 = r1 * np.sin(theta + np.pi)# Combine data X = np.vstack([np.column_stack([x1, y1]), np.column_stack([x2, y2])]) y = np.hstack([np.zeros(n_per_class), np.ones(n_per_class)])# Add minimal noise X += np.random.normal(0, noise *0.3, X.shape)# Shuffle the data indices = np.random.permutation(len(X)) X, y = X[indices], y[indices]else:raiseValueError("boundary_type must be one of: 'linear', 'circular', 'moons', 'spiral'")return X, y.astype(int)
Function 2: plot_decision_boundary
Code
import warningswarnings.filterwarnings('ignore')# Plot decision boundary and datadef plot_decision_boundary(X, y, model=None, title="Decision Boundary", resolution=100, alpha_contour=0.7, alpha_points=0.8, colorscale='RdYlBu', point_size=8, show_mesh=True):""" Create an interactive decision boundary plot using Plotly with color gradients. Parameters: ----------- X : array-like of shape (n_samples, 2) Feature matrix (must be 2D) y : array-like of shape (n_samples,) Target labels model : sklearn estimator or None, default=None Trained model that implements predict() and predict_proba() or decision_function(). If None, only plots the dataset without decision boundary. title : str, default="Decision Boundary" Title for the plot resolution : int, default=100 Resolution of the decision boundary mesh alpha_contour : float, default=0.7 Transparency of the decision boundary alpha_points : float, default=0.8 Transparency of the data points colorscale : str, default='RdYlBu' Plotly colorscale for the decision boundary point_size : int, default=8 Size of the scatter points show_mesh : bool, default=True Whether to show the decision boundary mesh (ignored if model is None) Returns: -------- fig : plotly.graph_objects.Figure Interactive Plotly figure """# Create the figure fig = go.Figure()# Add decision boundary contour only if model is provided and show_mesh is Trueif model isnotNoneand show_mesh:# Ensure model is fittedifnothasattr(model, 'predict'):raiseValueError("Model must be fitted and have a predict method")# Create mesh grid x_min, x_max = X[:, 0].min() -1, X[:, 0].max() +1 y_min, y_max = X[:, 1].min() -1, X[:, 1].max() +1 xx, yy = np.meshgrid(np.linspace(x_min, x_max, resolution), np.linspace(y_min, y_max, resolution)) mesh_points = np.c_[xx.ravel(), yy.ravel()]# Get predictions for the meshtry:# Try to get probability predictions for smoother boundariesifhasattr(model, 'predict_proba'): Z = model.predict_proba(mesh_points)[:, 1] # Probability of class 1elifhasattr(model, 'decision_function'): Z = model.decision_function(mesh_points)# Normalize decision function output to [0, 1] range Z = (Z - Z.min()) / (Z.max() - Z.min())else: Z = model.predict(mesh_points).astype(float)exceptExceptionas e:print(f"Error getting model predictions: {e}") Z = model.predict(mesh_points).astype(float) Z = Z.reshape(xx.shape)# Add decision boundary contour fig.add_trace(go.Contour( x=np.linspace(x_min, x_max, resolution), y=np.linspace(y_min, y_max, resolution), z=Z, colorscale=colorscale, opacity=alpha_contour, showscale=True, colorbar=dict( title="Decision<br>Confidence", titleside="right", tickmode="linear", tick0=0, dtick=0.2 ), contours=dict( start=0, end=1, size=0.1, ), name="Decision Boundary" ))# Add data points unique_labels = np.unique(y) colors = px.colors.qualitative.Set1[:len(unique_labels)]for i, label inenumerate(unique_labels): mask = y == label fig.add_trace(go.Scatter( x=X[mask, 0], y=X[mask, 1], mode='markers', marker=dict( size=point_size, color=colors[i], opacity=alpha_points, line=dict(width=1, color='black') ), name=f'Class {label}', hovertemplate=f'<b>Class {label}</b><br>'+'Feature 1: %{x:.2f}<br>'+'Feature 2: %{y:.2f}<br>'+'<extra></extra>' ))# Update layout fig.update_layout( title=dict( text=title, x=0.5, font=dict(size=18, family="Arial Black") ), xaxis_title="Feature 1", yaxis_title="Feature 2", width=700, height=600, showlegend=True, legend=dict( yanchor="top", y=0.99, xanchor="left", x=0.01, bgcolor="rgba(255,255,255,0.8)", bordercolor="black", borderwidth=1 ), plot_bgcolor='white', paper_bgcolor='white' )# Make axes equal fig.update_xaxes(scaleanchor="y", scaleratio=1) fig.update_yaxes(scaleanchor="x", scaleratio=1)# Add information text if no model is providedif model isNone: fig.add_annotation( xref="paper", yref="paper", x=0.02, y=0.02, text="Dataset visualization<br>(No model provided)", showarrow=False, font=dict(size=12, color="gray"), bgcolor="rgba(255,255,255,0.8)", bordercolor="gray", borderwidth=1 )return fig
A. Simulate 4 different types of datasets of size 1000 with different options of boundary_type.
Visualize each dataset using plot_decision_boundary as follow:
Dataset
B. Linear Dataset:
Randomly split the data into \(80\%\)-training and \(20\%\)-testing parts.
Build SVM model with default parameter on the training data and evaluate its performance on the testing part using:
Accuracy
Recall
Precision
F1-score
and AUC
Repeat the previous question for
Random Forest
Extra-trees
Adaboost
XGBoost
Plot decision boundaries of all models and compare them.
# To do
C. Other dataset
For each of the dataset (circular, moons and spiral), perform the same splitting scheme as in the previous case.
Build the previous models on each dataset then evaluate their performances.
Visualize the decision boundary of each case and compare them.
# To do
D. Hyperparameter Tuning on Moons Dataset
Import GridSearchCV module from sklearn.model_selection.
Perform \(5\)-fold cross-validation to search for the best possible hyperparameters for each model.
Evaluate their performance and visualize the boundary decision of each model.
# To do
2. Email Spam Dataset
Letβs start by exploring the email spam dataset introduced in the previous chapter. The data can be imported as follow.
import pandas as pdpath ="https://raw.githubusercontent.com/hassothea/MLcourses/main/data/spam.txt"data = pd.read_csv(path, sep=" ")data.head(5)
Apply SVM and Ensemble Learning Models and fine-tune their hyperparameters on this dataset using \(80\%\) training data.
Evaluate its performance on \(20\%\) testing.
Plot feature important from ensemble learning methods.