Deep Neural Network


Advanced Machine Learning

     

Lecturer: Dr. HAS Sothea

Content

  • Introduction

  • Multilayer Perceptrons

  • Training and Learning Curves

  • Applications

A bit of history

History

Early Foundations
Year Development
1943 Walter Pitts and Warren McCulloch created the first computer model based on neural networks, using “threshold logic” to mimic the thought process.
1960s Henry J. Kelley developed the basics of a continuous backpropagation model, and Stuart Dreyfus simplified it using the chain rule.
Development of Algorithms
1965 Alexey Ivakhnenko and Valentin Lapa developed early deep learning algorithms using polynomial activation functions.
1980s Geoffrey Hinton1 and colleagues revived neural networks by demonstrating effective training using backpropagation
AI Winters and Resurgence
1970s The first AI winter occurred due to unmet expectations, leading to reduced funding and research.
1980s Despite the AI winter, research continued, leading to significant advancements in neural networks and deep learning.
Modern Era
1990s Development of convolutional neural networks (CNNs) by Yann LeCun and others for image recognition.
2006 Geoffrey Hinton and colleagues introduced deep belief networks, which further advanced deep learning techniques.
2012 AlexNet, a deep convolutional neural network, won the ImageNet competition, showcasing the power of deep learning in computer vision.
2016 AlphaGo by DeepMind defeated a human Go champion, demonstrating the potential of deep learning in complex games.
Present Deep learning continues to evolve, with applications in natural language processing, speech recognition, autonomous vehicles, and more.
Key Milestones
Year Key Model Development
1943 Pitts and McCulloch’s neural network model.
1960s Kelley’s backpropagation model and Dreyfus’s chain rule simplification.
1980s Hinton’s backpropagation revival & Recurrent Neural Networks (RNNs).
1990s LeCun’s Convolutional Neural Networks (CNNs).
2006 Deep belief networks.
2012 AlexNet’s ImageNet win.
2016 AlphaGo’s victory.
2017 Attention is all you need (key models of ChatGPT)

Introduction

Deep Neural Networks (DNN) or Multilayer Perceptron (MLP) is a type of ML model built to simulate the complex decision-making power of the human brain 🧠.

It is a backbone that powers the recent development of Artificial Intelligence (AI) applications in our lives today.

Model: Multilayer Perceptron (MLP)

  • Deep Neural Networks (DNNs)/Multilayer Perceptrons (MLP) are computational models inspired by the human brain.

Model: Multilayer perceptron (MLP)

  • Deep Neural Networks (DNNs)/Multilayer Perceptrons (MLP) are computational models inspired by the human brain.

  • Input layer: vector of individual inputs \(\text{x}_i\in\mathbb{R}^d\).
    • It takes the inputs from the dataset.
    • The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.

Model: Multilayer perceptron (MLP)

  • Deep Neural Networks (DNNs)/Multilayer Perceptrons (MLP) are computational models inspired by the human brain.

  • Input layer: vector of individual inputs \(\color{green}{\text{x}_i}\in\mathbb{R}^d\).
    • It takes the inputs from the dataset.
    • The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.
  • Hidden layer: Governed by the equations:
    \[\begin{align*}\color{green}{z_0}&=\color{green}{\text{x}}\in\mathbb{R}^d\\ \color{green}{z_k}&=\sigma_k(\color{blue}{W_k}\color{green}{z_{k-1}}+\color{blue}{b_k})\text{ for }k=1,...,L-1, \end{align*}\]

Model: Multilayer perceptron (MLP)

  • Deep Neural Networks (DNNs)/Multilayer Perceptrons (MLP) are computational models inspired by the human brain.

  • Input layer: vector of individual inputs \(\color{green}{\text{x}_i}\in\mathbb{R}^d\).
    • It takes the inputs from the dataset.
    • The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.
  • Hidden layer: Governed by the equations:
    \[\begin{align*}\color{green}{z_0}&=\color{green}{\text{x}}\in\mathbb{R}^d\\ \color{green}{z_k}&=\sigma_k(\color{blue}{W_k}\color{green}{z_{k-1}}+\color{blue}{b_k})\text{ for }k=1,...,L-1. \end{align*}\] where,
    • \(\color{blue}{W_k}\) is a matrix of size \(\ell_{k}\times\ell_{k-1}\)
    • \(\color{blue}{b_k}\) is a bias vector of size \(\ell_k\)
    • \(\sigma_k\): is a point-wise nonlinear activation function.

Model: Multilayer perceptron (MLP)

  • Deep Neural Networks (DNNs)/Multilayer Perceptrons (MLP) are computational models inspired by the human brain.

  • Input layer: vector of individual inputs \(\color{green}{\text{x}_i}\in\mathbb{R}^d\).
    • It takes the inputs from the dataset.
    • The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.
  • Hidden layer: Governed by the equations:
    \[\begin{align*}\color{green}{z_0}&=\color{green}{\text{x}}\in\mathbb{R}^d\\ \color{green}{z_k}&=\sigma_k(\color{blue}{W_k}\color{green}{z_{k-1}}+\color{blue}{b_k})\text{ for }k=1,...,L-1. \end{align*}\] where,
    • \(\color{blue}{W_k}\) is a matrix of size \(\ell_{k}\times\ell_{k-1}\)
    • \(\color{blue}{b_k}\) is a bias vector of size \(\ell_k\)
    • \(\sigma_k\): is a point-wise nonlinear activation function.
  • Output layer: Returns the predictions: \[\color{blue}{\hat{y}}=\sigma_L(\color{blue}{W_L}\color{green}{z_{L-1}}+\color{blue}{b_L}).\]

Model: Multilayer perceptron (MLP)

  • Deep Neural Networks (DNNs)/Multilayer Perceptrons (MLP) are computational models inspired by the human brain.

  • Input layer: vector of individual inputs \(\color{green}{\text{x}_i}\in\mathbb{R}^d\).
    • It takes the inputs from the dataset.
    • The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.
  • Hidden layer: Governed by the equations:
    \[\begin{align*}\color{green}{z_0}&=\color{green}{\text{x}}\in\mathbb{R}^d\\ \color{green}{z_k}&=\sigma_k(\color{blue}{W_k}\color{green}{z_{k-1}}+\color{blue}{b_k})\text{ for }k=1,...,L-1. \end{align*}\] where,
    • \(\color{blue}{W_k}\) is a matrix of size \(\ell_{k}\times\ell_{k-1}\)
    • \(\color{blue}{b_k}\) is a bias vector of size \(\ell_k\)
    • \(\sigma_k\): is a point-wise nonlinear activation function.
  • Output layer: Returns the predictions: \[\color{blue}{\hat{y}}=\sigma_L(\color{blue}{W_L}\color{green}{z_{L-1}}+\color{blue}{b_L}).\]
  • Loss function: measures the difference between predictions and the real targets.

Model: Multilayer perceptron (MLP)

  • Deep Neural Networks (DNNs)/Multilayer Perceptrons (MLP) are computational models inspired by the human brain.

  • Input layer: vector of individual inputs \(\color{green}{\text{x}_i}\in\mathbb{R}^d\).
    • It takes the inputs from the dataset.
    • The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.
  • Hidden layer: Governed by the equations:
    \[\begin{align*}\color{green}{z_0}&=\color{green}{\text{x}}\in\mathbb{R}^d\\ \color{green}{z_k}&=\sigma_k(\color{blue}{W_k}\color{green}{z_{k-1}}+\color{blue}{b_k})\text{ for }k=1,...,L-1. \end{align*}\] where,
    • \(\color{blue}{W_k}\) is a matrix of size \(\ell_{k}\times\ell_{k-1}\)
    • \(\color{blue}{b_k}\) is a bias vector of size \(\ell_k\)
    • \(\sigma_k\): is a point-wise nonlinear activation function.
  • Output layer: Returns the predictions: \[\color{blue}{\hat{y}}=\sigma_L(\color{blue}{W_L}\color{green}{z_{L-1}}+\color{blue}{b_L}).\]
  • Loss function: measures the difference between predictions and the real targets.

Model: Multilayer perceptron (MLP)

Input Layer: sensory organs of the network

  • It plays a role as senses: 👀, 👂, 👃, 👅, 👊 …
  • The input data are directly fitted into input layer.
  • Let’s take a look at Mnist dataset.
import matplotlib.pyplot as plt
from keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
_, axs = plt.subplots(1,3, figsize=(6,2))
print(f"Train image dimension: {X_train.shape}")
for i in range(3):
    axs[i].imshow(X_train[i,:,:])
    axs[i].set_title(f"Number {y_test[i]}")
    axs[i].axis("off")
plt.tight_layout()
plt.axis("off")
plt.show()
Train image dimension: (60000, 28, 28)

Preprocessing:

  • Scaling: pixel \(\in [0,1]\)
  • Reshaping: image dim: \(28\times 28\to 784\).
  • Target one-hot encoding: \[y=2\to y_{\text{one-hot}}=[0,0,\color{red}{1},0,0,0,0,0,0,0,0].\]
X_train = X_train.reshape((-1,28*28)).astype("float32")/255
X_test = X_test.reshape((-1,28*28)).astype("float32")/255
from tensorflow.keras.utils import to_categorical
train_labels = to_categorical(y_train)
test_labels = to_categorical(y_test)

Model: Multilayer perceptron (MLP)

Input Layer: sensory organs of the network

  • Let’s build an MLP using Keras.
  • We first create Input Layer of size \(d=9\).
from sklearn.metrics import mean_squared_error 
from keras.models import Sequential 
from keras.layers import Dense, Input

# Dimension of the data
n, d = X_train.shape   # rows & columns

# Initiate the MLP model
model = Sequential()
# Add an input layer
model.add(Input(shape=(d,)))
  • Given trainable weights \(\color{blue}{W_1}\) of size \(\ell_1\times d\) and bias \(\color{blue}{b_1}\in\mathbb{R}^d\), the input \(\color{green}{\text{x}}\in\mathbb{R}^d\) is taken at the input layer: \[\begin{align*} \color{green}{z_1}&=\sigma_1(\color{blue}{W_1}\color{green}{\text{x}} + \color{blue}{b_1})\\ &=\sigma_1\begin{pmatrix} \color{blue}{\begin{bmatrix} w_{11} & w_{12} & \dots & w_{1d}\\ \vdots & \vdots & \ddots & \vdots\\ w_{\ell_11} & w_{\ell_12} & \dots & w_{\ell_1d}\\ \end{bmatrix}}\color{green}{\begin{bmatrix} x_1\\ \vdots\\ x_d \end{bmatrix}}+ \color{blue}{\begin{bmatrix} b_1\\ \vdots\\ b_{\ell_1} \end{bmatrix}} \end{pmatrix} \end{align*}\]

Model: Multilayer perceptron (MLP)

Hidden/output Layer: brain 🧠/Action 🏃🏻‍♂️‍➡️

  • Let’s add two hidden layers of sizes \(32\) to our existing network.
  • Then add an output layer to make real-valued prediction \(\color{blue}{\hat{y}}\) of Rings.
# Add hidden layer of size 128
model.add(Dense(128, activation="relu"))

# Add another hidden layer of size 128
model.add(Dense(128, activation="relu"))

# Add one last layer (output) of size 10
model.add(Dense(10, activation="softmax"))
  • With trainable weights \(\color{blue}{W_2, W_3}\) and biases \(\color{blue}{b_2,b_3}\), the feedforward path: \[\begin{align*} \color{green}{z_2}&=\sigma_2(\color{blue}{W_2}\color{green}{z_1} + \color{blue}{b_2})\in\mathbb{R}^{128}\\ \color{blue}{\hat{y}}&=\sigma_3(\color{blue}{W_3}\color{green}{z_2} + \color{blue}{b_3})\in\mathbb{R}^{10}. \end{align*}\]
  • What is the dimension of each parameter?

Model: Multilayer perceptron (MLP)

Activation functions: \(\sigma(.)\)

  • In feedforward path, we use matrix multiplications (\(\color{blue}{W_j}\)’s) and additions (\(\color{blue}{b_j}\)’s).
  • These operations are linear.
  • Without non-linear components, the network is just a linear regression.
  • These non-linear functions are called activation functions.
  • It’s an important component that makes the networks powerful!
  • Types of activation functions \(\sigma_j(.)\):

\[\begin{align*} \text{Sigmoid}(z)&=1/(1+e^{-z})\text{ for }z\in\mathbb{R}\\ \text{Softmax}(z)&=(e^{z_1},\dots,e^{z_d})/\sum_{k=1}^de^{z_k},\text{ for }z\in\mathbb{R}^d\\ \color{red}{\text{ReLU}(z)}&\color{red}{=\max(0,z)\text{ for }z\in\mathbb{R}}\\ \text{Tanh}(z)&=\tanh(z)\text{ for }z\in\mathbb{R}\\ \text{Leaky ReLU}(z)&=\begin{cases}z,&\mbox{if} z>0\\ \alpha z,&\mbox{if }z\leq 0\end{cases}. \end{align*}\]

Ex: Multiple Logistic Regression:

Model: Multilayer perceptron (MLP)

Loss function: true \(y\) vs prediction \(\color{blue}{\hat{y}}\)

  • Given weights \(\color{blue}{W_j}\)’s and biases \(\color{blue}{b_j}\)’s of the network, the feedforward network can produce prediction \(\hat{y}\).
  • To measure how good the network is, we compare the prediction \(\color{blue}{\hat{y}}\) to the real target \(y\).
  • Loss function quantifies the difference between the predicted output and the actual target.
  • Regression losses:
    • \(\ell_2(y_i,\color{blue}{\hat{y}_i})=(y_i-\color{blue}{\hat{y}_i})^2\): Squared loss.
    • \(\ell_1(y_i,\color{blue}{\hat{y}_i})=|y_i-\color{blue}{\hat{y}_i}|\): Absolute loss.
    • \(\ell_{\text{rel}}(y_i,\color{blue}{\hat{y}_i})=|\frac{y_i-\color{blue}{\hat{y}_i}}{y_i}|\): Relative loss.
  • Classification losses:
    • \(\text{CEn}(y_i,\color{blue}{\hat{y}_i})=-\sum_{j=1}^My_{ij}\log(\color{blue}{\hat{y}_{ij}})\): Cross-Entropy.
    • \(\text{Hinge}(y_i,\color{blue}{\hat{y}_i})=\max\{0,1-\sum_{j=1}^My_{ij}\color{blue}{\hat{y}_{ij}}\}\): Hinge loss.
    • \(\text{KL}(y_i,\color{blue}{\hat{y}_i})=\sum_{j=1}^My_{ij}\log(y_{ij}/\color{blue}{\hat{y}_{ij}})\): Kullback-Leibler (KL) Divergence.
  • Q1: What are the key parameters of the network?
  • A1: All weights \(\color{blue}{W_j}\)’s and biases \(\color{blue}{b_j}\)’s.
  • Q2: How to find the suitable values of these parameters?
  • A2: Loss function can guide the network to its better and better state! In other words, we can use the loss/mistake to adjust all key parameters, leading to a better state of the network.

Model: Multilayer perceptron (MLP)

Feedforward Neural Networks By Hand

👉 Jupyter notebook: Feedforward NN by hand.

Model: Multilayer perceptron (MLP)

Why is it powerful?

  • Roughly speaking, it can approximate any reasonably complex input-output relationship to any desired level of precision! (For more, read UAT, Deepmind).

Model: Multilayer perceptron (MLP)

Why is it powerful?

Let’s see what it means: 👉 Jupyter notebook: Universal Approximation Theorem.

Backpropagation: Gradient-based

Optimization in Keras

  • We set up optimization method for our existing network as follow:
# We use Adam optimizer
from keras.optimizers import Adam, SGD
# Set up optimizer for our model
model.compile(
    optimizer='adam', 
    loss='categorical_crossentropy', 
    metrics=['accuracy'])
  • Let’s have a look at your model:
model.summary()
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                     Output Shape                  Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense_12 (Dense)                │ (None, 128)            │       100,480 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_13 (Dense)                │ (None, 128)            │        16,512 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_14 (Dense)                │ (None, 10)             │         1,290 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 118,282 (462.04 KB)
 Trainable params: 118,282 (462.04 KB)
 Non-trainable params: 0 (0.00 B)

Training & Learning Curves

  • A few important hyperparameters:
    • batch_size: number of minibatch \(b\).
    • epochs: number of times that the network passes through the entire training dataset.
    • validation_split: a fraction of the training data for validation during model training. We can keep track of the model state during training by measuring the loss on this validation data, especially for preventing overfitting.
  • Choosing the right architecture requires experiences and tuning.
  • In this case, the network yields Test Accuracy \(=\) 0.954 (correctly predicted).
  • Tuning the hyperparameters would push its performance even further.
# Training the network
history = model.fit(
    X_train[:10000,:], train_labels[:10000], 
    epochs=50, batch_size=64, 
    validation_split=0.1, verbose=0)
# evaluation
loss, accuracy = model.evaluate(
    X_test, test_labels, verbose=0)
# Extract loss values 
train_loss = history.history['loss']
val_loss = history.history['val_loss'] 
# Plot the learning curves 
epochs = list(range(1, len(train_loss) + 1))
fig1 = go.Figure(go.Scatter(
    x=epochs, y=train_loss, name="Training loss"))
fig1.add_trace(
    go.Scatter(x=epochs, y=val_loss, 
    name="Training loss"))
fig1.update_layout(
    title="Training and Validation Loss", 
    width=510, height=250,
    xaxis=dict(title="Epoch", type="log"),
        yaxis=dict(title="Loss"))
fig1.show()

Diagnostics with Learning Curves

  • The above learning curve can be used to access the state of our model during and after training.
    • The training loss always decreases as it’s measured using the training data.
    • The drop of validation loss indicates the generalization capability of the model at that state.
    • The model starts to overfit the training data when the validation curve starts to increase.
    • We should stop the training process when we observe this change in validation curve.
  • The learning curves can also reveal other aspects of the network and the data including:
    • When the model underfit the data or requires more training epochs
    • When the learning rate (\(\eta\)) is too large
    • When the model cannot generalize well to validation set
    • When it converges properly
    • When the validation data is not representative enough
    • When the validation data is too easy too predict…
  • These are helpful resources for understanding the above properties:

Neural Network Playground

Summary

Pros

  • Versatility: DNNs can be used for a wide range of tasks including classification, regression, and even function approximation.
  • Non-linear Problem Solving: They can model complex relationships and capture non-linear patterns in data, thanks to their non-linear activation functions.
  • Flexibility: MLPs can have multiple layers and neurons, making them highly adaptable to various problem complexities.
  • Training Efficiency: With advancements like backpropagation, training MLPs has become efficient and effective.
  • Feature Learning: MLPs can automatically learn features from raw data, reducing the need for manual feature extraction.

Cons

  • Computational Complexity: They can be computationally intensive, especially with large datasets and complex architectures, requiring significant processing power and memory.
  • Overfitting: MLPs can easily overfit to training data, especially if they have too many parameters relative to the amount of training data.
  • Black Box Nature: The internal workings of an MLP are not easily interpretable, making it difficult to understand how specific decisions are made.
  • Requires Large Datasets: Effective training of MLPs often requires large amounts of data, which might not always be available.
  • Hyperparameter Tuning: MLPs have several hyperparameters (e.g., learning rate, number of hidden layers, number of neurons per layer) that need careful tuning, which can be time-consuming and challenging.
  • Architecture: Designing right architecture can be challenging as well.

🥳 It’s party time 🥂