Deep Neural Network


ITM 390 004: Machine Learning

Lecturer: Dr. Sothea HAS

Content

  • Introduction

  • Multilayer Perceptrons

  • Training and Learning Curves

  • Applications

Introduction

Deep Neural Networks (DNN) or Multilayer Perceptron (MLP) is a type of ML model built to simulate the complex decision-making power of the human brain 🧠.

It is a backbone that powers the recent development of Artificial Intelligence (AI) applications in our lives today.

Model: Multilayer Perceptron (MLP)

  • Deep Neural Networks (DNNs)/Multilayer Perceptrons (MLP) are computational models inspired by the human brain.

Model: Multilayer perceptron (MLP)

  • Deep Neural Networks (DNNs)/Multilayer Perceptrons (MLP) are computational models inspired by the human brain.

  • Input layer: vector of individual inputs \(\text{x}_i\in\mathbb{R}^d\).
    • It takes the inputs from the dataset.
    • The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.

Model: Multilayer perceptron (MLP)

  • Deep Neural Networks (DNNs)/Multilayer Perceptrons (MLP) are computational models inspired by the human brain.

  • Input layer: vector of individual inputs \(\color{green}{\text{x}_i}\in\mathbb{R}^d\).
    • It takes the inputs from the dataset.
    • The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.
  • Hidden layer: Governed by the equations:
    \[\begin{align*}\color{green}{z_0}&=\color{green}{\text{x}}\in\mathbb{R}^d\\ \color{green}{z_k}&=\sigma_k(\color{blue}{W_k}\color{green}{z_{k-1}}+\color{blue}{b_k})\text{ for }k=1,...,L-1, \end{align*}\]

Model: Multilayer perceptron (MLP)

  • Deep Neural Networks (DNNs)/Multilayer Perceptrons (MLP) are computational models inspired by the human brain.

  • Input layer: vector of individual inputs \(\color{green}{\text{x}_i}\in\mathbb{R}^d\).
    • It takes the inputs from the dataset.
    • The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.
  • Hidden layer: Governed by the equations:
    \[\begin{align*}\color{green}{z_0}&=\color{green}{\text{x}}\in\mathbb{R}^d\\ \color{green}{z_k}&=\sigma_k(\color{blue}{W_k}\color{green}{z_{k-1}}+\color{blue}{b_k})\text{ for }k=1,...,L-1. \end{align*}\] where,
    • \(\color{blue}{W_k}\) is a matrix of size \(\ell_{k}\times\ell_{k-1}\)
    • \(\color{blue}{b_k}\) is a bias vector of size \(\ell_k\)
    • \(\sigma_k\): is a point-wise nonlinear activation function.

Model: Multilayer perceptron (MLP)

  • Deep Neural Networks (DNNs)/Multilayer Perceptrons (MLP) are computational models inspired by the human brain.

  • Input layer: vector of individual inputs \(\color{green}{\text{x}_i}\in\mathbb{R}^d\).
    • It takes the inputs from the dataset.
    • The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.
  • Hidden layer: Governed by the equations:
    \[\begin{align*}\color{green}{z_0}&=\color{green}{\text{x}}\in\mathbb{R}^d\\ \color{green}{z_k}&=\sigma_k(\color{blue}{W_k}\color{green}{z_{k-1}}+\color{blue}{b_k})\text{ for }k=1,...,L-1. \end{align*}\] where,
    • \(\color{blue}{W_k}\) is a matrix of size \(\ell_{k}\times\ell_{k-1}\)
    • \(\color{blue}{b_k}\) is a bias vector of size \(\ell_k\)
    • \(\sigma_k\): is a point-wise nonlinear activation function.
  • Output layer: Returns the predictions: \[\color{blue}{\hat{y}}=\sigma_L(\color{blue}{W_L}\color{green}{z_{L-1}}+\color{blue}{b_L}).\]

Model: Multilayer perceptron (MLP)

  • Deep Neural Networks (DNNs)/Multilayer Perceptrons (MLP) are computational models inspired by the human brain.

  • Input layer: vector of individual inputs \(\color{green}{\text{x}_i}\in\mathbb{R}^d\).
    • It takes the inputs from the dataset.
    • The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.
  • Hidden layer: Governed by the equations:
    \[\begin{align*}\color{green}{z_0}&=\color{green}{\text{x}}\in\mathbb{R}^d\\ \color{green}{z_k}&=\sigma_k(\color{blue}{W_k}\color{green}{z_{k-1}}+\color{blue}{b_k})\text{ for }k=1,...,L-1. \end{align*}\] where,
    • \(\color{blue}{W_k}\) is a matrix of size \(\ell_{k}\times\ell_{k-1}\)
    • \(\color{blue}{b_k}\) is a bias vector of size \(\ell_k\)
    • \(\sigma_k\): is a point-wise nonlinear activation function.
  • Output layer: Returns the predictions: \[\color{blue}{\hat{y}}=\sigma_L(\color{blue}{W_L}\color{green}{z_{L-1}}+\color{blue}{b_L}).\]
  • Loss function: measures the difference between predictions and the real targets.

Model: Multilayer perceptron (MLP)

  • Deep Neural Networks (DNNs)/Multilayer Perceptrons (MLP) are computational models inspired by the human brain.

  • Input layer: vector of individual inputs \(\color{green}{\text{x}_i}\in\mathbb{R}^d\).
    • It takes the inputs from the dataset.
    • The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.
  • Hidden layer: Governed by the equations:
    \[\begin{align*}\color{green}{z_0}&=\color{green}{\text{x}}\in\mathbb{R}^d\\ \color{green}{z_k}&=\sigma_k(\color{blue}{W_k}\color{green}{z_{k-1}}+\color{blue}{b_k})\text{ for }k=1,...,L-1. \end{align*}\] where,
    • \(\color{blue}{W_k}\) is a matrix of size \(\ell_{k}\times\ell_{k-1}\)
    • \(\color{blue}{b_k}\) is a bias vector of size \(\ell_k\)
    • \(\sigma_k\): is a point-wise nonlinear activation function.
  • Output layer: Returns the predictions: \[\color{blue}{\hat{y}}=\sigma_L(\color{blue}{W_L}\color{green}{z_{L-1}}+\color{blue}{b_L}).\]
  • Loss function: measures the difference between predictions and the real targets.

Model: Multilayer perceptron (MLP)

Input Layer: sensory organs of the network

  • It plays a role as senses: πŸ‘€, πŸ‘‚, πŸ‘ƒ, πŸ‘…, πŸ‘Š …
  • The input data are directly fitted into input layer.
  • Let’s take a look at Mnist dataset.
import matplotlib.pyplot as plt
from keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
_, axs = plt.subplots(1,3, figsize=(6,2))
print(f"Train image dimension: {X_train.shape}")
for i in range(3):
    axs[i].imshow(X_train[i,:,:])
    axs[i].set_title(f"Number {y_test[i]}")
    axs[i].axis("off")
plt.tight_layout()
plt.axis("off")
plt.show()
Train image dimension: (60000, 28, 28)

Preprocessing:

  • Scaling: pixel \(\in [0,1]\)
  • Reshaping: image dim: \(28\times 28\to 784\).
  • Target one-hot encoding: \[y=2\to y_{\text{one-hot}}=[0,0,\color{red}{1},0,0,0,0,0,0,0,0].\]
X_train = X_train.reshape((-1,28*28)).astype("float32")/255
X_test = X_test.reshape((-1,28*28)).astype("float32")/255
from tensorflow.keras.utils import to_categorical
train_labels = to_categorical(y_train)
test_labels = to_categorical(y_test)

Model: Multilayer perceptron (MLP)

Input Layer: sensory organs of the network

  • Let’s build an MLP using Keras.
  • We first create Input Layer of size \(d=9\).
from sklearn.metrics import mean_squared_error 
from keras.models import Sequential 
from keras.layers import Dense, Input

# Dimension of the data
n, d = X_train.shape   # rows & columns

# Initiate the MLP model
model = Sequential()
# Add an input layer
model.add(Input(shape=(d,)))
  • Given trainable weights \(\color{blue}{W_1}\) of size \(\ell_1\times d\) and bias \(\color{blue}{b_1}\in\mathbb{R}^d\), the input \(\color{green}{\text{x}}\in\mathbb{R}^d\) is converted at the input layer by \[\begin{align*} \color{green}{z_1}&=\sigma_1(\color{blue}{W_1}\color{green}{\text{x}} + \color{blue}{b_1})\\ &=\sigma_1\begin{pmatrix} \color{blue}{\begin{bmatrix} w_{11} & w_{12} & \dots & w_{1d}\\ \vdots & \vdots & \ddots & \vdots\\ w_{\ell_11} & w_{\ell_12} & \dots & w_{\ell_1d}\\ \end{bmatrix}}\color{green}{\begin{bmatrix} x_1\\ \vdots\\ x_d \end{bmatrix}}+ \color{blue}{\begin{bmatrix} b_1\\ \vdots\\ b_{\ell_1} \end{bmatrix}} \end{pmatrix} \end{align*}\]

Model: Multilayer perceptron (MLP)

Hidden/output Layer: brain 🧠/Action πŸƒπŸ»β€β™‚οΈβ€βž‘οΈ

  • Let’s add two hidden layers of sizes \(32\) to our existing network.
  • Then add an output layer to make real-valued prediction \(\color{blue}{\hat{y}}\) of Rings.
# Add hidden layer of size 128
model.add(Dense(128, activation="relu"))

# Add another hidden layer of size 128
model.add(Dense(128, activation="relu"))

# Add one last layer (output) of size 10
model.add(Dense(10, activation="softmax"))
  • With trainable weights \(\color{blue}{W_2, W_3}\) and biases \(\color{blue}{b_2,b_3}\), the feedforward path: \[\begin{align*} \color{green}{z_2}&=\sigma_2(\color{blue}{W_2}\color{green}{z_1} + \color{blue}{b_2})\in\mathbb{R}^{128}\\ \color{blue}{\hat{y}}&=\sigma_3(\color{blue}{W_3}\color{green}{z_2} + \color{blue}{b_3})\in\mathbb{R} \end{align*}\]
  • What is the dimension of each parameter?

Model: Multilayer perceptron (MLP)

Activation functions: \(\sigma(.)\)

  • In feedforward path, we use matrix multiplications (\(\color{blue}{W_j}\)’s) and additions (\(\color{blue}{b_j}\)’s).
  • These operations are linear.
  • Without non-linear components, the network is just a linear regression.
  • These non-linear functions are called activation functions.
  • It’s an important component that makes the networks powerful!
  • Types of activation functions \(\sigma_j(.)\):

\[\begin{align*} \text{Sigmoid}(z)&=1/(1+e^{-z})\text{ for }z\in\mathbb{R}\\ \text{Softmax}(z)&=(e^{z_1},\dots,e^{z_d})/\sum_{k=1}^de^{z_k},\text{ for }z\in\mathbb{R}^d\\ \color{red}{\text{ReLU}(z)}&=\color{red}{\max(0,z)\text{ for }z\in\mathbb{R}}\\ \text{Tanh}(z)&=\tanh(z)\text{ for }z\in\mathbb{R}\\ \text{Leaky ReLU}(z)&=\begin{cases}z,&\mbox{if} z>0\\ \alpha z,&\mbox{if }z\leq 0\end{cases}. \end{align*}\]

Ex: Multiple Logistic Regression:

πŸ‘‰ Notebook: Feedforward NN by hand.

Model: Multilayer perceptron (MLP)

Loss function: true \(y\) vs prediction \(\color{blue}{\hat{y}}\)

  • Given weights \(\color{blue}{W_j}\)’s and biases \(\color{blue}{b_j}\)’s of the network, the feedforward network can produce prediction \(\hat{y}\).
  • To measure how good the network is, we compare the prediction \(\color{blue}{\hat{y}}\) to the real target \(y\).
  • Loss function quantifies the difference between the predicted output and the actual target.
  • Regression losses:
    • \(\ell_2(y_i,\color{blue}{\hat{y}_i})=(y_i-\color{blue}{\hat{y}_i})^2\): Squared loss.
    • \(\ell_1(y_i,\color{blue}{\hat{y}_i})=|y_i-\color{blue}{\hat{y}_i}|\): Absolute loss.
    • \(\ell_{\text{rel}}(y_i,\color{blue}{\hat{y}_i})=|\frac{y_i-\color{blue}{\hat{y}_i}}{y_i}|\): Relative loss.
  • Classification losses:
    • \(\text{CEn}(y_i,\color{blue}{\hat{y}_i})=-\sum_{j=1}^My_{ij}\log(\color{blue}{\hat{y}_{ij}})\): Cross-Entropy.
    • \(\text{Hinge}(y_i,\color{blue}{\hat{y}_i})=\max\{0,1-\sum_{j=1}^My_{ij}\color{blue}{\hat{y}_{ij}}\}\): Hinge loss.
    • \(\text{KL}(y_i,\color{blue}{\hat{y}_i})=\sum_{j=1}^My_{ij}\log(y_{ij}/\color{blue}{\hat{y}_{ij}})\): Kullback-Leibler (KL) Divergence.
  • Q1: What are the key parameters of the network?
  • A1: All weights \(\color{blue}{W_j}\)’s and biases \(\color{blue}{b_j}\)’s.
  • Q2: How to find the suitable values of these parameters?
  • A2: Loss function can guide the network to its better and better state! In other words, we can use the loss/mistake to adjust all key parameters, leading to a better state of the network.

Model: Multilayer perceptron (MLP)

Feedforward Neural Networks By Hand

πŸ‘‰ Jupyter notebook: Feedforward NN by hand.

Model: Multilayer perceptron (MLP)

Why is it powerful?

  • Roughly speaking, it can approximate any reasonably complex input-output relationship to any desired level of precision! (For more, read UAT, Deepmind).

Model: Multilayer perceptron (MLP)

Why is it powerful?

Let’s see what it means: πŸ‘‰ Jupyter notebook: Universal Approximation Theorem.

Backpropagation: Gradient-based

Optimization in Keras

  • We set up optimization method for our existing network as follow:
# We use Adam optimizer
from keras.optimizers import Adam, SGD
# Set up optimizer for our model
model.compile(
    optimizer='adam', 
    loss='categorical_crossentropy', 
    metrics=['accuracy'])
  • Let’s have a look at your model:
model.summary()
Model: "sequential_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┑━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
β”‚ dense_3 (Dense)                 β”‚ (None, 128)            β”‚       100,480 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ dense_4 (Dense)                 β”‚ (None, 128)            β”‚        16,512 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ dense_5 (Dense)                 β”‚ (None, 10)             β”‚         1,290 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 Total params: 118,282 (462.04 KB)
 Trainable params: 118,282 (462.04 KB)
 Non-trainable params: 0 (0.00 B)

Training & Learning Curves

  • Important hyperparameters:
    • activation functions: non-linear functions for each layer.
    • batch_size: number of minibatch \(b\).
    • learning_rate: step size \(\eta\) for each update.
    • epochs: number of passes over the entire training data.
    • validation_split: a fraction of the training data for tracking model state during training.
    • architecture: number of layers and neurons per layer…
  • Choosing the right architecture requires experiences and exploration.
  • In this case, the network yields Test Accuracy \(=\) 0.954 (correctly predicted).
  • Tuning the hyperparameters would push its performance even further.
# Training the network
history = model.fit(
    X_train[:10000,:], train_labels[:10000], 
    epochs=50, batch_size=64, verbose=0,
    validation_split=0.1)
# evaluation
loss, accuracy = model.evaluate(
    X_test, test_labels, verbose=0)
# Extract loss values 
train_loss = history.history['loss']
val_loss = history.history['val_loss'] 
# Plot the learning curves 
epochs = list(range(1, len(train_loss) + 1))
fig1 = go.Figure(go.Scatter(
    x=epochs, y=train_loss, name="Training loss"))
fig1.add_trace(
    go.Scatter(x=epochs, y=val_loss, 
    name="Training loss"))
fig1.update_layout(
    title="Training and Validation Loss", 
    width=510, height=250,
    xaxis=dict(title="Epoch", type="log"),
        yaxis=dict(title="Loss"))
fig1.show()

Diagnostics with Learning Curves

  • The above learning curve can be used to access the state of our model during and after training.
    • The training loss always decreases as it’s measured using the training data.
    • The drop of validation loss indicates the generalization capability of the model at that state.
    • The model starts to overfit the training data when the validation curve starts to increase.
    • We should stop the training process when we observe this change in validation curve.
  • The learning curves can also reveal other aspects of the network and the data including:
    • When the model underfit the data or requires more training epochs
    • When the learning rate (\(\eta\)) is too large
    • When the model cannot generalize well to validation set
    • When it converges properly
    • When the validation data is not representative enough
    • When the validation data is too easy too predict…
  • These are helpful resources for understanding the above properties:

Neural Network Playground

Summary

Pros

  • Versatility: DNNs can be used for a wide range of tasks including classification, regression, and even function approximation.
  • Non-linear Problem Solving: They can model complex relationships and capture non-linear patterns in data, thanks to their non-linear activation functions.
  • Flexibility: MLPs can have multiple layers and neurons, making them highly adaptable to various problem complexities.
  • Training Efficiency: With advancements like backpropagation, training MLPs has become efficient and effective.
  • Feature Learning: MLPs can automatically learn features from raw data, reducing the need for manual feature extraction.

Cons

  • Computational Complexity: They can be computationally intensive, especially with large datasets and complex architectures, requiring significant processing power and memory.
  • Overfitting: MLPs can easily overfit to training data, especially if they have too many parameters relative to the amount of training data.
  • Black Box Nature: The internal workings of an MLP are not easily interpretable, making it difficult to understand how specific decisions are made.
  • Requires Large Datasets: Effective training of MLPs often requires large amounts of data, which might not always be available.
  • Hyperparameter Tuning: MLPs have several hyperparameters (e.g., learning rate, number of hidden layers, number of neurons per layer) that need careful tuning, which can be time-consuming and challenging.
  • Architecture: Designing right architecture can be challenging as well.

πŸ₯³ It’s party time πŸ₯‚