Deep Neural Network

ITM 390 004: Machine Learning

Lecturer: Dr. Sothea HAS

Content

Introduction
Multilayer Perceptrons
Training and Learning Curves
Applications

Introduction

Deep Neural Networks (DNN) or Multilayer Perceptron (MLP) is a type of ML model built to simulate the complex decision-making power of the human brain 🧠.

It is a backbone that powers the recent development of Artificial Intelligence (AI) applications in our lives today.

Model: Multilayer Perceptron (MLP)

Deep Neural Networks (DNNs)/Multilayer Perceptrons (MLP) are computational models inspired by the human brain.

Model: Multilayer perceptron (MLP)

Deep Neural Networks (DNNs)/Multilayer Perceptrons (MLP) are computational models inspired by the human brain.

Input layer: vector of individual inputs \(\text{x}_i\in\mathbb{R}^d\).
- It takes the inputs from the dataset.
- The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.

Model: Multilayer perceptron (MLP)

Deep Neural Networks (DNNs)/Multilayer Perceptrons (MLP) are computational models inspired by the human brain.

Input layer: vector of individual inputs \(\color{green}{\text{x}_i}\in\mathbb{R}^d\).
- It takes the inputs from the dataset.
- The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.

Hidden layer: Governed by the equations:
\[\begin{align*}\color{green}{z_0}&=\color{green}{\text{x}}\in\mathbb{R}^d\\ \color{green}{z_k}&=\sigma_k(\color{blue}{W_k}\color{green}{z_{k-1}}+\color{blue}{b_k})\text{ for }k=1,...,L-1, \end{align*}\]

Model: Multilayer perceptron (MLP)

Deep Neural Networks (DNNs)/Multilayer Perceptrons (MLP) are computational models inspired by the human brain.

Input layer: vector of individual inputs \(\color{green}{\text{x}_i}\in\mathbb{R}^d\).
- It takes the inputs from the dataset.
- The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.

Hidden layer: Governed by the equations:
\[\begin{align*}\color{green}{z_0}&=\color{green}{\text{x}}\in\mathbb{R}^d\\ \color{green}{z_k}&=\sigma_k(\color{blue}{W_k}\color{green}{z_{k-1}}+\color{blue}{b_k})\text{ for }k=1,...,L-1. \end{align*}\] where,
- \(\color{blue}{W_k}\) is a matrix of size \(\ell_{k}\times\ell_{k-1}\)
- \(\color{blue}{b_k}\) is a bias vector of size \(\ell_k\)
- \(\sigma_k\): is a point-wise nonlinear activation function.

Model: Multilayer perceptron (MLP)

Deep Neural Networks (DNNs)/Multilayer Perceptrons (MLP) are computational models inspired by the human brain.

Input layer: vector of individual inputs \(\color{green}{\text{x}_i}\in\mathbb{R}^d\).
- It takes the inputs from the dataset.
- The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.

Hidden layer: Governed by the equations:
\[\begin{align*}\color{green}{z_0}&=\color{green}{\text{x}}\in\mathbb{R}^d\\ \color{green}{z_k}&=\sigma_k(\color{blue}{W_k}\color{green}{z_{k-1}}+\color{blue}{b_k})\text{ for }k=1,...,L-1. \end{align*}\] where,
- \(\color{blue}{W_k}\) is a matrix of size \(\ell_{k}\times\ell_{k-1}\)
- \(\color{blue}{b_k}\) is a bias vector of size \(\ell_k\)
- \(\sigma_k\): is a point-wise nonlinear activation function.
Output layer: Returns the predictions: \[\color{blue}{\hat{y}}=\sigma_L(\color{blue}{W_L}\color{green}{z_{L-1}}+\color{blue}{b_L}).\]

Model: Multilayer perceptron (MLP)

Deep Neural Networks (DNNs)/Multilayer Perceptrons (MLP) are computational models inspired by the human brain.

Input layer: vector of individual inputs \(\color{green}{\text{x}_i}\in\mathbb{R}^d\).
- It takes the inputs from the dataset.
- The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.

Hidden layer: Governed by the equations:
\[\begin{align*}\color{green}{z_0}&=\color{green}{\text{x}}\in\mathbb{R}^d\\ \color{green}{z_k}&=\sigma_k(\color{blue}{W_k}\color{green}{z_{k-1}}+\color{blue}{b_k})\text{ for }k=1,...,L-1. \end{align*}\] where,
- \(\color{blue}{W_k}\) is a matrix of size \(\ell_{k}\times\ell_{k-1}\)
- \(\color{blue}{b_k}\) is a bias vector of size \(\ell_k\)
- \(\sigma_k\): is a point-wise nonlinear activation function.
Output layer: Returns the predictions: \[\color{blue}{\hat{y}}=\sigma_L(\color{blue}{W_L}\color{green}{z_{L-1}}+\color{blue}{b_L}).\]
Loss function: measures the difference between predictions and the real targets.

Model: Multilayer perceptron (MLP)

Deep Neural Networks (DNNs)/Multilayer Perceptrons (MLP) are computational models inspired by the human brain.

Input layer: vector of individual inputs \(\color{green}{\text{x}_i}\in\mathbb{R}^d\).
- It takes the inputs from the dataset.
- The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.

Hidden layer: Governed by the equations:
\[\begin{align*}\color{green}{z_0}&=\color{green}{\text{x}}\in\mathbb{R}^d\\ \color{green}{z_k}&=\sigma_k(\color{blue}{W_k}\color{green}{z_{k-1}}+\color{blue}{b_k})\text{ for }k=1,...,L-1. \end{align*}\] where,
- \(\color{blue}{W_k}\) is a matrix of size \(\ell_{k}\times\ell_{k-1}\)
- \(\color{blue}{b_k}\) is a bias vector of size \(\ell_k\)
- \(\sigma_k\): is a point-wise nonlinear activation function.
Output layer: Returns the predictions: \[\color{blue}{\hat{y}}=\sigma_L(\color{blue}{W_L}\color{green}{z_{L-1}}+\color{blue}{b_L}).\]
Loss function: measures the difference between predictions and the real targets.

Model: Multilayer perceptron (MLP)

Input Layer: sensory organs of the network

It plays a role as senses: 👀, 👂, 👃, 👅, 👊 …
The input data are directly fitted into input layer.
Let’s take a look at Mnist dataset.

import matplotlib.pyplot as plt
from keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
_, axs = plt.subplots(1,3, figsize=(6,2))
print(f"Train image dimension: {X_train.shape}")
for i in range(3):
    axs[i].imshow(X_train[i,:,:])
    axs[i].set_title(f"Number {y_test[i]}")
    axs[i].axis("off")
plt.tight_layout()
plt.axis("off")
plt.show()

Train image dimension: (60000, 28, 28)

Preprocessing:

Scaling: pixel \(\in [0,1]\)
Reshaping: image dim: \(28\times 28\to 784\).
Target one-hot encoding: \[y=2\to y_{\text{one-hot}}=[0,0,\color{red}{1},0,0,0,0,0,0,0,0].\]

X_train = X_train.reshape((-1,28*28)).astype("float32")/255
X_test = X_test.reshape((-1,28*28)).astype("float32")/255
from tensorflow.keras.utils import to_categorical
train_labels = to_categorical(y_train)
test_labels = to_categorical(y_test)

Model: Multilayer perceptron (MLP)

Input Layer: sensory organs of the network

Let’s build an MLP using Keras.
We first create Input Layer of size \(d=9\).

from sklearn.metrics import mean_squared_error 
from keras.models import Sequential 
from keras.layers import Dense, Input

# Dimension of the data
n, d = X_train.shape   # rows & columns

# Initiate the MLP model
model = Sequential()
# Add an input layer
model.add(Input(shape=(d,)))

Given trainable weights \(\color{blue}{W_1}\) of size \(\ell_1\times d\) and bias \(\color{blue}{b_1}\in\mathbb{R}^d\), the input \(\color{green}{\text{x}}\in\mathbb{R}^d\) is converted at the input layer by \[\begin{align*} \color{green}{z_1}&=\sigma_1(\color{blue}{W_1}\color{green}{\text{x}} + \color{blue}{b_1})\\ &=\sigma_1\begin{pmatrix} \color{blue}{\begin{bmatrix} w_{11} & w_{12} & \dots & w_{1d}\\ \vdots & \vdots & \ddots & \vdots\\ w_{\ell_11} & w_{\ell_12} & \dots & w_{\ell_1d}\\ \end{bmatrix}}\color{green}{\begin{bmatrix} x_1\\ \vdots\\ x_d \end{bmatrix}}+ \color{blue}{\begin{bmatrix} b_1\\ \vdots\\ b_{\ell_1} \end{bmatrix}} \end{pmatrix} \end{align*}\]

Model: Multilayer perceptron (MLP)

Hidden/output Layer: brain 🧠/Action 🏃🏻‍♂️‍➡️

Let’s add two hidden layers of sizes \(32\) to our existing network.
Then add an output layer to make real-valued prediction \(\color{blue}{\hat{y}}\) of Rings.

# Add hidden layer of size 128
model.add(Dense(128, activation="relu"))

# Add another hidden layer of size 128
model.add(Dense(128, activation="relu"))

# Add one last layer (output) of size 10
model.add(Dense(10, activation="softmax"))

With trainable weights \(\color{blue}{W_2, W_3}\) and biases \(\color{blue}{b_2,b_3}\), the feedforward path: \[\begin{align*} \color{green}{z_2}&=\sigma_2(\color{blue}{W_2}\color{green}{z_1} + \color{blue}{b_2})\in\mathbb{R}^{128}\\ \color{blue}{\hat{y}}&=\sigma_3(\color{blue}{W_3}\color{green}{z_2} + \color{blue}{b_3})\in\mathbb{R} \end{align*}\]
What is the dimension of each parameter?

Model: Multilayer perceptron (MLP)

Activation functions: \(\sigma(.)\)

In feedforward path, we use matrix multiplications (\(\color{blue}{W_j}\)’s) and additions (\(\color{blue}{b_j}\)’s).
These operations are linear.
Without non-linear components, the network is just a linear regression.
These non-linear functions are called activation functions.
It’s an important component that makes the networks powerful!
Types of activation functions \(\sigma_j(.)\):

\[\begin{align*} \text{Sigmoid}(z)&=1/(1+e^{-z})\text{ for }z\in\mathbb{R}\\ \text{Softmax}(z)&=(e^{z_1},\dots,e^{z_d})/\sum_{k=1}^de^{z_k},\text{ for }z\in\mathbb{R}^d\\ \color{red}{\text{ReLU}(z)}&=\color{red}{\max(0,z)\text{ for }z\in\mathbb{R}}\\ \text{Tanh}(z)&=\tanh(z)\text{ for }z\in\mathbb{R}\\ \text{Leaky ReLU}(z)&=\begin{cases}z,&\mbox{if} z>0\\ \alpha z,&\mbox{if }z\leq 0\end{cases}. \end{align*}\]

Ex: Multiple Logistic Regression:

👉 Notebook: Feedforward NN by hand.

Model: Multilayer perceptron (MLP)

Loss function: true \(y\) vs prediction \(\color{blue}{\hat{y}}\)

Given weights \(\color{blue}{W_j}\)’s and biases \(\color{blue}{b_j}\)’s of the network, the feedforward network can produce prediction \(\hat{y}\).
To measure how good the network is, we compare the prediction \(\color{blue}{\hat{y}}\) to the real target \(y\).
Loss function quantifies the difference between the predicted output and the actual target.
Regression losses:
- \(\ell_2(y_i,\color{blue}{\hat{y}_i})=(y_i-\color{blue}{\hat{y}_i})^2\): Squared loss.
- \(\ell_1(y_i,\color{blue}{\hat{y}_i})=|y_i-\color{blue}{\hat{y}_i}|\): Absolute loss.
- \(\ell_{\text{rel}}(y_i,\color{blue}{\hat{y}_i})=|\frac{y_i-\color{blue}{\hat{y}_i}}{y_i}|\): Relative loss.
Classification losses:
- \(\text{CEn}(y_i,\color{blue}{\hat{y}_i})=-\sum_{j=1}^My_{ij}\log(\color{blue}{\hat{y}_{ij}})\): Cross-Entropy.
- \(\text{Hinge}(y_i,\color{blue}{\hat{y}_i})=\max\{0,1-\sum_{j=1}^My_{ij}\color{blue}{\hat{y}_{ij}}\}\): Hinge loss.
- \(\text{KL}(y_i,\color{blue}{\hat{y}_i})=\sum_{j=1}^My_{ij}\log(y_{ij}/\color{blue}{\hat{y}_{ij}})\): Kullback-Leibler (KL) Divergence.

Q1: What are the key parameters of the network?
A1: All weights \(\color{blue}{W_j}\)’s and biases \(\color{blue}{b_j}\)’s.
Q2: How to find the suitable values of these parameters?
A2: Loss function can guide the network to its better and better state! In other words, we can use the loss/mistake to adjust all key parameters, leading to a better state of the network.

Model: Multilayer perceptron (MLP)

Feedforward Neural Networks By Hand

👉 Jupyter notebook: Feedforward NN by hand.

Model: Multilayer perceptron (MLP)

Why is it powerful?

Roughly speaking, it can approximate any reasonably complex input-output relationship to any desired level of precision! (For more, read UAT, Deepmind).

Model: Multilayer perceptron (MLP)

Why is it powerful?

Let’s see what it means: 👉 Jupyter notebook: Universal Approximation Theorem.

Backpropagation: Gradient-based

Grant Sanderson of 3B1B did a really amazing job on this 👇
Source: Backpropagation, 3Blue1Brown.
More readings:

Optimization in Keras

We set up optimization method for our existing network as follow:

# We use Adam optimizer
from keras.optimizers import Adam, SGD
# Set up optimizer for our model
model.compile(
    optimizer='adam', 
    loss='categorical_crossentropy', 
    metrics=['accuracy'])

Let’s have a look at your model:

model.summary()

Model: "sequential_1"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense_3 (Dense)                 │ (None, 128)            │       100,480 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_4 (Dense)                 │ (None, 128)            │        16,512 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_5 (Dense)                 │ (None, 10)             │         1,290 │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 118,282 (462.04 KB)

 Trainable params: 118,282 (462.04 KB)

 Non-trainable params: 0 (0.00 B)

Training & Learning Curves

Important hyperparameters:
- activation functions: non-linear functions for each layer.
- batch_size: number of minibatch \(b\).
- learning_rate: step size \(\eta\) for each update.
- epochs: number of passes over the entire training data.
- validation_split: a fraction of the training data for tracking model state during training.
- architecture: number of layers and neurons per layer…
Choosing the right architecture requires experiences and exploration.
In this case, the network yields Test Accuracy \(=\) 0.954 (correctly predicted).
Tuning the hyperparameters would push its performance even further.

# Training the network
history = model.fit(
    X_train[:10000,:], train_labels[:10000], 
    epochs=50, batch_size=64, verbose=0,
    validation_split=0.1)
# evaluation
loss, accuracy = model.evaluate(
    X_test, test_labels, verbose=0)
# Extract loss values 
train_loss = history.history['loss']
val_loss = history.history['val_loss'] 
# Plot the learning curves 
epochs = list(range(1, len(train_loss) + 1))
fig1 = go.Figure(go.Scatter(
    x=epochs, y=train_loss, name="Training loss"))
fig1.add_trace(
    go.Scatter(x=epochs, y=val_loss, 
    name="Training loss"))
fig1.update_layout(
    title="Training and Validation Loss", 
    width=510, height=250,
    xaxis=dict(title="Epoch", type="log"),
        yaxis=dict(title="Loss"))
fig1.show()

Diagnostics with Learning Curves

The above learning curve can be used to access the state of our model during and after training.
- The training loss always decreases as it’s measured using the training data.
- The drop of validation loss indicates the generalization capability of the model at that state.
- The model starts to overfit the training data when the validation curve starts to increase.
- We should stop the training process when we observe this change in validation curve.

The learning curves can also reveal other aspects of the network and the data including:
- When the model underfit the data or requires more training epochs
- When the learning rate (\(\eta\)) is too large
- When the model cannot generalize well to validation set
- When it converges properly
- When the validation data is not representative enough
- When the validation data is too easy too predict…
These are helpful resources for understanding the above properties:
- A deep Dive Into Learning Curves in ML, Mostafa Ibrahim
- Diagnosing Model Performance with Learning Curves.

Neural Network Playground

Summary

Pros

Versatility: DNNs can be used for a wide range of tasks including classification, regression, and even function approximation.
Non-linear Problem Solving: They can model complex relationships and capture non-linear patterns in data, thanks to their non-linear activation functions.
Flexibility: MLPs can have multiple layers and neurons, making them highly adaptable to various problem complexities.
Training Efficiency: With advancements like backpropagation, training MLPs has become efficient and effective.
Feature Learning: MLPs can automatically learn features from raw data, reducing the need for manual feature extraction.

Cons

Computational Complexity: They can be computationally intensive, especially with large datasets and complex architectures, requiring significant processing power and memory.
Overfitting: MLPs can easily overfit to training data, especially if they have too many parameters relative to the amount of training data.
Black Box Nature: The internal workings of an MLP are not easily interpretable, making it difficult to understand how specific decisions are made.
Requires Large Datasets: Effective training of MLPs often requires large amounts of data, which might not always be available.
Hyperparameter Tuning: MLPs have several hyperparameters (e.g., learning rate, number of hidden layers, number of neurons per layer) that need careful tuning, which can be time-consuming and challenging.
Architecture: Designing right architecture can be challenging as well.

Deep Neural Network

Content

Introduction

Model: Multilayer Perceptron (MLP)

Model: Multilayer perceptron (MLP)

Model: Multilayer perceptron (MLP)

Model: Multilayer perceptron (MLP)

Model: Multilayer perceptron (MLP)

Model: Multilayer perceptron (MLP)

Model: Multilayer perceptron (MLP)

Model: Multilayer perceptron (MLP)

Input Layer: sensory organs of the network

Preprocessing:

Model: Multilayer perceptron (MLP)

Input Layer: sensory organs of the network

Model: Multilayer perceptron (MLP)

Hidden/output Layer: brain 🧠/Action 🏃🏻‍♂️‍➡️

Model: Multilayer perceptron (MLP)

Activation functions: \(\sigma(.)\)

Ex: Multiple Logistic Regression:

Model: Multilayer perceptron (MLP)

Loss function: true \(y\) vs prediction \(\color{blue}{\hat{y}}\)

Model: Multilayer perceptron (MLP)

Feedforward Neural Networks By Hand

Model: Multilayer perceptron (MLP)

Why is it powerful?

Model: Multilayer perceptron (MLP)

Why is it powerful?

Backpropagation: Gradient-based

Optimization in Keras

Training & Learning Curves

Diagnostics with Learning Curves

Neural Network Playground

Summary

🥳 It’s party time 🥂