Deep Learning


ITM-370: Data Analytics

Lecturer: Dr. Sothea Has

Content

  • Introduction & Brief History

  • World of Approximation

  • Neural Networks

  • Optimization

  • Applications

Introduction &
Brief History

Introduction

Deep Learning (DL) is a subset of Machine Learning (ML) that uses Multilayer Neural Networks, called Deep Neural Networks (DNN), to simulate the complex decision-making power of the human brain 🧠. Some form of deep learning powers most of the Artificial Intelligence (AI) applications in our lives today.

History

Early Foundations
Year Development
1943 Walter Pitts and Warren McCulloch created the first computer model based on neural networks, using β€œthreshold logic” to mimic the thought process.
1960s Henry J. Kelley developed the basics of a continuous backpropagation model, and Stuart Dreyfus simplified it using the chain rule.
Development of Algorithms
1965 Alexey Ivakhnenko and Valentin Lapa developed early deep learning algorithms using polynomial activation functions.
1980s Geoffrey Hinton1 and colleagues revived neural networks by demonstrating effective training using backpropagation
AI Winters and Resurgence
1970s The first AI winter occurred due to unmet expectations, leading to reduced funding and research.
1980s Despite the AI winter, research continued, leading to significant advancements in neural networks and deep learning.
Modern Era
1990s Development of convolutional neural networks (CNNs) by Yann LeCun and others for image recognition.
2006 Geoffrey Hinton and colleagues introduced deep belief networks, which further advanced deep learning techniques.
2012 AlexNet, a deep convolutional neural network, won the ImageNet competition, showcasing the power of deep learning in computer vision.
2016 AlphaGo by DeepMind defeated a human Go champion, demonstrating the potential of deep learning in complex games.
Present Deep learning continues to evolve, with applications in natural language processing, speech recognition, autonomous vehicles, and more.
Key Milestones
Year Key Model Development
1943 Pitts and McCulloch’s neural network model.
1960s Kelley’s backpropagation model and Dreyfus’s chain rule simplification.
1980s Hinton’s backpropagation revival & Recurrent Neural Networks (RNNs).
1990s LeCun’s Convolutional Neural Networks (CNNs).
2006 Deep belief networks.
2012 AlexNet’s ImageNet win.
2016 AlphaGo’s victory.
2017 Attention is all you need (key models of ChatGPT)

World of Approximations

Approximation

  • Approximation is the process of finding a value that is close to the true value of a quantity, but not exactly equal to it. It is often used when an exact value is difficult to obtain or not necessary.
  • In 1683, Jacob Bernoulli discovered Euler’s number \(e=2.718...\) when he was studying compound interest and trying to determine what would happen if interest were compounded more and more frequently.
  • Suppose I put \(\$ 1\) into a saving account:
Interest Per Year N Compound Total
\(100\%\) 1 \(1+1\)
\(100\%\) 2 \((1+1/2)^2\)
\(100\%\) 3 \((1+1/3)^3\)
\(\vdots\) \(\vdots\) \(\vdots\)
\(100\%\) n \((1+1/n)^n\)
  • The compounded interest \(\to e\) as \(n\) becomes very large i.e., \[\lim_{n\to \infty}\Big(1+\frac{1}{n}\Big)^n=e=2.71828182...\]
  • With \(100\%\) interest per year calculated every second, my \(\$ 1\) yields nearly \(\$ e=\$ 2.71828...\) at the end of the year.

Approximation

The Role of Models

  • Our world if full of approximations.
  • In mathematics and science, models are tools that help us approximate complex systems and phenomena.
  • A good model can capture the bahavior and help us understand reality.
  • Simple models might underestimate reality but are often easy to interpret (LR & MLR). Conversely, complex models can provide a more accurate approximation of reality, though they may lack interpretability (Neural Networks).
  • In this class, models are used to approximate the relationships between inputs (ads) and targets (sales).

Artificial Neural Networks (ANNs)

Model: Multilayer perceptron

  • Artificial Neural Networks (ANNs) are computational models inspired by the human brain.

Model: Multilayer perceptron

  • Artificial Neural Networks (ANNs) are computational models inspired by the human brain.

  • Input layer: vector of individual inputs \(\text{x}_i\in\mathbb{R}^d\).
    • It takes the inputs from the dataset.
    • The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.

Model: Multilayer perceptron

  • Artificial Neural Networks (ANNs) are computational models inspired by the human brain.

  • Input layer: vector of individual inputs \(\color{green}{\text{x}_i}\in\mathbb{R}^d\).
    • It takes the inputs from the dataset.
    • The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.
  • Hidden layer: Governed by the equations:
    \[\begin{align*}\color{green}{z_0}&=\color{green}{\text{x}}\in\mathbb{R}^d\\ \color{green}{z_k}&=\sigma_k(\color{blue}{W_k}\color{green}{z_{k-1}}+\color{blue}{b_k})\text{ for }k=1,...,L-1. \end{align*}\] where,
    • \(\color{blue}{W_k}\) is a matrix of size \(\ell_{k}\times\ell_{k-1}\)
    • \(\color{blue}{b_k}\) is a bias vector of size \(\ell_k\)
    • \(\sigma_k\): is a point-wise nonlinear activation function.

Model: Multilayer perceptron

  • Artificial Neural Networks (ANNs) are computational models inspired by the human brain.

  • Input layer: vector of individual inputs \(\color{green}{\text{x}_i}\in\mathbb{R}^d\).
    • It takes the inputs from the dataset.
    • The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.
  • Hidden layer: Governed by the equations:
    \[\begin{align*}\color{green}{z_0}&=\color{green}{\text{x}}\in\mathbb{R}^d\\ \color{green}{z_k}&=\sigma_k(\color{blue}{W_k}\color{green}{z_{k-1}}+\color{blue}{b_k})\text{ for }k=1,...,L-1. \end{align*}\] where,
    • \(\color{blue}{W_k}\) is a matrix of size \(\ell_{k}\times\ell_{k-1}\)
    • \(\color{blue}{b_k}\) is a bias vector of size \(\ell_k\)
    • \(\sigma_k\): is a point-wise nonlinear activation function.
  • Output layer: Returns the predictions: \[\color{blue}{\hat{y}}=\sigma_L(\color{blue}{W_L}\color{green}{z_{L-1}}+\color{blue}{b_L}).\]

Model: Multilayer perceptron

  • Artificial Neural Networks (ANNs) are computational models inspired by the human brain.

  • Input layer: vector of individual inputs \(\color{green}{\text{x}_i}\in\mathbb{R}^d\).
    • It takes the inputs from the dataset.
    • The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.
  • Hidden layer: Governed by the equations:
    \[\begin{align*}\color{green}{z_0}&=\color{green}{\text{x}}\in\mathbb{R}^d\\ \color{green}{z_k}&=\sigma_k(\color{blue}{W_k}\color{green}{z_{k-1}}+\color{blue}{b_k})\text{ for }k=1,...,L-1. \end{align*}\] where,
    • \(\color{blue}{W_k}\) is a matrix of size \(\ell_{k}\times\ell_{k-1}\)
    • \(\color{blue}{b_k}\) is a bias vector of size \(\ell_k\)
    • \(\sigma_k\): is a point-wise nonlinear activation function.
  • Output layer: Returns the predictions: \[\color{blue}{\hat{y}}=\sigma_L(\color{blue}{W_L}\color{green}{z_{L-1}}+\color{blue}{b_L}).\]
  • Loss function: measures the difference between predictions and the real targets.

Model: Multilayer perceptron

  • Artificial Neural Networks (ANNs) are computational models inspired by the human brain.

  • Input layer: vector of individual inputs \(\color{green}{\text{x}_i}\in\mathbb{R}^d\).
    • It takes the inputs from the dataset.
    • The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.
  • Hidden layer: Governed by the equations:
    \[\begin{align*}\color{green}{z_0}&=\color{green}{\text{x}}\in\mathbb{R}^d\\ \color{green}{z_k}&=\sigma_k(\color{blue}{W_k}\color{green}{z_{k-1}}+\color{blue}{b_k})\text{ for }k=1,...,L-1. \end{align*}\] where,
    • \(\color{blue}{W_k}\) is a matrix of size \(\ell_{k}\times\ell_{k-1}\)
    • \(\color{blue}{b_k}\) is a bias vector of size \(\ell_k\)
    • \(\sigma_k\): is a point-wise nonlinear activation function.
  • Output layer: Returns the predictions: \[\color{blue}{\hat{y}}=\sigma_L(\color{blue}{W_L}\color{green}{z_{L-1}}+\color{blue}{b_L}).\]
  • Loss function: measures the difference between predictions and the real targets.

Model: Multilayer perceptron

Why is it powerful?

  • Roughly speaking, it can approximate any reasonably complex input-output relationship to any desired level of precision! (For more, read here)

Model: Multilayer perceptron

Input Layer: sensory organs of the network

  • It plays a role as senses: πŸ‘€, πŸ‘‚, πŸ‘ƒ, πŸ‘…, πŸ‘Š …
  • The input data are directly fitted into input layer.
  • Let’s use our kaggle Abalone dataset.
Sex Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings
0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7
2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7
  • Input: \(\text{x}_1=\) [β€˜M’, 0.455, 0.365, 0.095, 0.514, 0.2245, 0.101, 0.15].
  • Target: \(y_1=\) 15.
  • Q1: What should be done in preprocessing step?
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import StandardScaler 
data = data.loc[data.Height > 0,:]   # drop missing values
data = pd.get_dummies(data, columns=['Sex'], drop_first=True)  # One-hot encoding
X = data.drop('Rings', axis=1) 
y = data['Rings']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Train-test split
scaler = StandardScaler() # Scaling inputs
X_train = scaler.fit_transform(X_train) 
X_test = scaler.transform(X_test)
  • Input: \(\text{x}_1=\) [-0.418, -0.289, -0.464, -0.795, -0.82, -0.844, -0.637, 1.454, -0.75].
  • Target: \(y_1=\) 15.
  • Q2: What’s the size of the input layer?
    • A2: \(d=9\).

Model: Multilayer perceptron

Input Layer: sensory organs of the network

  • Let’s build an MLP using Keras.
  • We first create Input Layer of size \(d=9\).
from sklearn.metrics import mean_squared_error 
from keras.models import Sequential 
from keras.layers import Dense, Input

# Dimension of the data
n, d = X_train.shape   # rows & columns

# Initiate the MLP model
model = Sequential()

# Add an input layer
model.add(Input(shape=(d,)))
  • Given trainable weights \(\color{blue}{W_1}\) of size \(\ell_1\times d\) and bias \(\color{blue}{b_1}\in\mathbb{R}^d\), the input \(\color{green}{\text{x}}\in\mathbb{R}^d\) is converted at the input layer by \[\begin{align*} \color{green}{z_1}&=\sigma_1(\color{blue}{W_1}\color{green}{\text{x}} + \color{blue}{b_1})\\ &=\sigma_1\begin{pmatrix} \color{blue}{\begin{bmatrix} w_{11} & w_{12} & \dots & w_{1d}\\ \vdots & \vdots & \ddots & \vdots\\ w_{\ell_11} & w_{\ell_12} & \dots & w_{\ell_1d}\\ \end{bmatrix}}\color{green}{\begin{bmatrix} x_1\\ \vdots\\ x_d \end{bmatrix}}+ \color{blue}{\begin{bmatrix} b_1\\ \vdots\\ b_{\ell_1} \end{bmatrix}} \end{pmatrix} \end{align*}\]

Model: Multilayer perceptron

Hidden/output Layer: brain 🧠/Action πŸƒπŸ»β€β™‚οΈβ€βž‘οΈ

  • Let’s add two hidden layers of sizes \(32\) to our existing network.
  • Then add an output layer to make real-valued prediction \(\color{blue}{\hat{y}}\) of Rings.
# Add hidden layer of size 32
model.add(Dense(32, activation='relu'))

# Add another hidden layer of size 32
model.add(Dense(32, activation='relu'))

# Add one last layer (output) of size 1
model.add(Dense(1, activation='linear'))
  • With trainable weights \(\color{blue}{W_2, W_3}\) and biases \(\color{blue}{b_2,b_3}\), the feedforward path: \[\begin{align*} \color{green}{z_2}&=\sigma_2(\color{blue}{W_2}\color{green}{z_1} + \color{blue}{b_2})\in\mathbb{R}^{32}\\ \color{blue}{\hat{y}}&=\sigma_4(\color{blue}{W_3}\color{green}{z_2} + \color{blue}{b_2})\in\mathbb{R} \end{align*}\]
  • What is the dimension of each parameter?

Model: Multilayer perceptron

Activation functions: \(\sigma(.)\) β•­β•―

  • In feedforward path, we use matrix multiplications (\(\color{blue}{W_j}\)’s) and additions (\(\color{blue}{b_j}\)’s).
  • These operations are linear.
  • Without non-linear components, the network is just a linear regression.
  • These non-linear functions are called activation functions.
  • It’s an important component that makes the networks powerful!
  • Types of activation functions \(\sigma_j(.)\):

\[\begin{align*} \text{Sigmoid}(z)&=1/(1+e^{-z})\text{ for }z\in\mathbb{R}\\ \text{Softmax}(z)&=(e^{z_1},\dots,e^{z_d})/\sum_{k=1}^de^{z_k},\text{ for }z\in\mathbb{R}^d\\ \color{red}{\text{ReLU}(z)}&\color{red}{=\max(0,z)\text{ for }z\in\mathbb{R}}\\ \text{Tanh}(z)&=\tanh(z)\text{ for }z\in\mathbb{R}\\ \text{Leaky ReLU}(z)&=\begin{cases}z,&\mbox{if} z>0\\ \alpha z,&\mbox{if }z\leq 0\end{cases}. \end{align*}\]

Model: Multilayer perceptron

Loss function: true \(y\) vs prediction \(\color{blue}{\hat{y}}\)

  • Given weights \(\color{blue}{W_j}\)’s and biases \(\color{blue}{b_j}\)’s of the network, the feedforward network can produce prediction \(\hat{y}\).
  • To measure how good the network is, we compare the prediction \(\color{blue}{\hat{y}}\) to the real target \(y\).
  • Loss function quantifies the difference between the predicted output and the actual target.
  • Regression losses:
    • \(\ell_2(y_i,\color{blue}{\hat{y}_i})=(y_i-\color{blue}{\hat{y}_i})^2\): Squared loss.
    • \(\ell_1(y_i,\color{blue}{\hat{y}_i})=|y_i-\color{blue}{\hat{y}_i}|\): Absolute loss.
    • \(\ell_{\text{rel}}(y_i,\color{blue}{\hat{y}_i})=|\frac{y_i-\color{blue}{\hat{y}_i}}{y_i}|\): Relative loss.
  • Classification losses:
    • \(\text{CEn}(y_i,\color{blue}{\hat{y}_i})=-\sum_{j=1}^My_{ij}\log(\color{blue}{\hat{y}_{ij}})\): Cross-Entropy.
    • \(\text{Hinge}(y_i,\color{blue}{\hat{y}_i})=\max\{0,1-\sum_{j=1}^My_{ij}\color{blue}{\hat{y}_{ij}}\}\): Hinge loss.
    • \(\text{KL}(y_i,\color{blue}{\hat{y}_i})=\sum{j=1}^My_{ij}\log(y_{ij}/\color{blue}{\hat{y}_{ij}})\): Kullback-Leibler (KL) Divergence.
  • Q3: What are the key parameters of the network?
  • A3: All weights \(\color{blue}{W_j}\)’s and biases \(\color{blue}{b_j}\)’s.
  • Q4: How to find the suitable values of these parameters?
  • A4: Loss function can guide the network to its better and better state! In other words, we can use the loss/mistake to adjust all key parameters, leading to a better state of the network.

Optimization

Gradient of a function

Gradient

  • If \(L:\mathbb{R}^d\to\mathbb{R}, y=L(x_1,...,x_d)\) be a real-valued function of \(d\) real variables. For any point \(a\in\mathbb{R}^d\), the gradient of \(L\) at point \(a\) is a \(d\)-dimensional vector defined by \[\nabla L(a)=\begin{bmatrix}\frac{\partial L}{\partial x_1}(a)\\ \vdots\\ \frac{\partial L}{\partial x_d}(a)\end{bmatrix}.\]
  • It is the direction (from the point \(a\)) with the fastest increasing rate of the function \(L\).
  • Ex: \(f(x,y)=0.1x^4+0.5y^2\). Compute \(\nabla f(1,-1)\).
    • One has \(\frac{\partial f}{\partial x}=0.4x^3\) and \(\frac{\partial f}{\partial y}=y\).
    • Therefore, \(\nabla f(1,-1)=(0.4,-1).\)
    • Meaning: from point \((1,-1)\), function \(f\) increases fastest along the direction \((0.4, -1)\).
  • Around 1847, Augustin-Louis Cauchy used this to propose a genius algorithm that makes impossibilities, possible!
  • Optimizing loss function means searching for all weights \(W_j\)’s and biases \(b_j\)’s that minimizes the loss function!.
  • Q5: Knowing this, how would you search for the lowest position of the loss function?

Gradient-based Optimization

Gradient Descent Algorithm

GD Algorithm: Moving opposite to the gradient

  • Objective: find parameters \(\color{blue}{\theta^*=(W_1,..,W_L,b_1,...,b_L)}\) that minimizes the average loss over the full training data: \(L(\color{blue}{\theta})=\frac{1}{n}\sum_{i=1}^n\ell(y_i,\color{blue}{\hat{y}_i}).\)
    • For example, on Abalone dataset: \(L(\color{blue}{\theta})=\frac{1}{n}\sum_{i=1}^n(y_i,\color{blue}{\hat{y}_i})^2.\)
  • Initialiaztion:
    • learning rate: \(\eta>0\) (small \(\approx 0.001\) or \(0.01\))
    • initial value: \(\theta_0\)
    • Maximum iteration: \(N\)
    • Stopping threshold: \(\delta>0\) (small \(< 10^{-5}\)).
  • Update: for t=1,2,...,N: \[\color{blue}{\theta_{t}}:=\color{blue}{\theta_{t-1}}-\eta\nabla L(\color{blue}{\theta_{t-1}}).\]
    • if \(\|\nabla L(\color{blue}{\theta_{t^*}})\|<\delta\) at some step \(t^*\): return \(\color{blue}{\theta_{t^*}}\).
    • else: return \(\color{blue}{\theta_{N}}\).

Gradient-based Optimization

Gradient Descent Algorithm

GD Algorithm: Moving opposite to the gradient

  • Objective: find parameters \(\color{blue}{\theta^*=(W_1,..,W_L,b_1,...,b_L)}\) that minimizes the average loss over the full training data: \(L(\color{blue}{\theta})=\frac{1}{n}\sum_{i=1}^n\ell(y_i,\color{blue}{\hat{y}_i}).\)
    • For example, on Abalone dataset: \(L(\color{blue}{\theta})=\frac{1}{n}\sum_{i=1}^n(y_i-\color{blue}{\hat{y}_i})^2.\)
  • Initialiaztion:
    • learning rate: \(\eta>0\) (small \(\approx 0.001\) or \(0.01\))
    • initial value: \(\theta_0\)
    • Maximum iteration: \(N\)
    • Stopping threshold: \(\delta>0\) (small \(< 10^{-5}\)).
  • Update: for t=1,2,...,N: \[\color{blue}{\theta_{t}}:=\color{blue}{\theta_{t-1}}-\eta\nabla L(\color{blue}{\theta_{t-1}}).\]
    • if \(\|\nabla L(\color{blue}{\theta_{t^*}})\|<\delta\) at some step \(t^*\): return \(\color{blue}{\theta_{t^*}}\).
    • else: return \(\color{blue}{\theta_{N}}\).
  • Here’s GD in action πŸ‘‰

Gradient-based Optimization

Stochastic Gradient Descent Algorithm

SGD Algorithm: replace full data by small random subsets

  • Objective: find parameters \(\color{blue}{\theta^*=(W_1,..,W_L,b_1,...,b_L)}\) that minimizes \(L(\color{blue}{\theta})=\frac{1}{n}\sum_{i=1}^n\ell(y_i,\color{blue}{\hat{y}_i}).\)
  • But here, we approximate the full loss using small subsets of size \(b<<n\) called minibatches: \(\hat{L}_b(\color{blue}{\theta})=\frac{1}{b}\sum_{i=1}^b(y_i^B-\color{blue}{\hat{y}_i^B})^2.\)
  • Initialiaztion:
    • learning rate: \(\eta>0\) (small \(\approx 0.001\) or \(0.01\))
    • initial value: \(\theta_0\)
    • Maximum iteration: \(N\)
    • Stopping threshold: \(\delta>0\) (small \(< 10^{-5}\)).
  • Update: for t=1,2,...,N: \[\color{blue}{\theta_{t}}:=\color{blue}{\theta_{t-1}}-\eta\nabla \hat{L}_b(\color{blue}{\theta_{t-1}}).\]
    • if \(\|\nabla \hat{L}_b(\color{blue}{\theta_{t^*}})\|<\delta\) at some step \(t^*\): return \(\color{blue}{\theta_{t^*}}\).
    • else: return \(\color{blue}{\theta_{N}}\).

Gradient-based Optimization

GD vs SGD

Gradient Descent (DG)

  • Batch Processing: Uses the entire dataset to compute gradients and update weights in one go.
  • Stability: More stable with smoother convergence towards the minimum.
  • Speed: Slower, especially with large datasets, as it requires processing all data points each time.
  • Memory Usage: Requires a lot of memory to store the entire dataset.
  • Key points: Entire dataset, smooth convergence, slower, higher memory.

Stochastic Gradient Descent (SGD)

  • Incremental Processing: Updates weights using one data point at a time.
  • Fluctuations: More fluctuation in the path towards the minimum, but can escape local minima more effectively.
  • Speed: Faster and more efficient with large datasets since it updates weights more frequently.
  • Memory Usage: Lower memory requirement as it processes data points one by one.
  • Key points: One data point at a time, fluctuating path, faster, lower memory.
  • There are many more powerful algorithms: Adam (Adaptive Moment Estimation), RMSprop (Root Mean Square Propagation), Adagrad (Adaptive Gradient Algorithm)…

Training/Backpropagation

Optimization in Keras

  • We set up optimization method for our existing network as follow:
# We use Adam optimizer
from keras.optimizers import Adam, SGD

# Set up optimizer for our model
model.compile(optimizer=SGD(learning_rate=0.001), loss='mean_squared_error')
  • Let’s have a look at your model:
Model: "sequential_7"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┑━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
β”‚ dense_21 (Dense)                β”‚ (None, 32)             β”‚           320 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ dense_22 (Dense)                β”‚ (None, 32)             β”‚         1,056 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ dense_23 (Dense)                β”‚ (None, 1)              β”‚            33 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 Total params: 1,409 (5.50 KB)
 Trainable params: 1,409 (5.50 KB)
 Non-trainable params: 0 (0.00 B)

Training & Learning Curves

  • A few important hyperparameters:
    • batch_size: number of minibatch \(b\).
    • epochs: number of times that the network passes through the entire training dataset.
    • validation_split: a fraction of the training data for validation during model training. We can keep track of the model state during training by measuring the loss on this validation data, especially for preventing overfitting.
  • The network yields Test MSE \(=\) 4.416.
  • This is better than all the methods used in linear models (see our Lab2).
# Training the network
history = model.fit(X_train, y_train, epochs=120, batch_size=32, validation_split=0.1, verbose=0)

# Extract loss values 
train_loss = history.history['loss']
val_loss = history.history['val_loss'] 

# Plot the learning curves 
epochs = list(range(1, len(train_loss) + 1))
fig1 = go.Figure(go.Scatter(x=epochs, y=train_loss, name="Training loss"))
fig1.add_trace(go.Scatter(x=epochs, y=val_loss, name="Training loss"))
fig1.update_layout(title="Training and Validation Loss", 
                   width=510, height=200,
                   xaxis=dict(title="Epoch", type="log"),
                   yaxis=dict(title="Loss"))
fig1.show()

Diagnostics with Learning Curves

  • The above learning curve can be used to access the state of our model during and after training.
    • The training loss always decreases as it’s measured using the training data.
    • The drop of validation loss indicates the generalization capability of the model at that state.
    • The model starts to overfit the training data when the validation curve starts to increase.
    • We should stop the training process when we observe this change in validation curve.
  • The learning curves can also reveal other aspects of the network and the data including:
    • When the model underfit the data or requires more training epochs
    • When the learning rate (\(\eta\)) is too large
    • When the model cannot generalize well to validation set
    • When it converges properly
    • When the validation data is not representative enough
    • When the validation data is too easy too predict…
  • These are helpful resources for understanding the above properties:

Applications & Examples

Neural Network Playground

CNN Explainer

Summary

Pros

  • Versatility: MLPs can be used for a wide range of tasks including classification, regression, and even function approximation.
  • Non-linear Problem Solving: They can model complex relationships and capture non-linear patterns in data, thanks to their non-linear activation functions.
  • Flexibility: MLPs can have multiple layers and neurons, making them highly adaptable to various problem complexities.
  • Training Efficiency: With advancements like backpropagation, training MLPs has become efficient and effective.
  • Feature Learning: MLPs can automatically learn features from raw data, reducing the need for manual feature extraction.

Cons

  • Computational Complexity: They can be computationally intensive, especially with large datasets and complex architectures, requiring significant processing power and memory.
  • Overfitting: MLPs can easily overfit to training data, especially if they have too many parameters relative to the amount of training data.
  • Black Box Nature: The internal workings of an MLP are not easily interpretable, making it difficult to understand how specific decisions are made.
  • Requires Large Datasets: Effective training of MLPs often requires large amounts of data, which might not always be available.
  • Hyperparameter Tuning: MLPs have several hyperparameters (e.g., learning rate, number of hidden layers, number of neurons per layer) that need careful tuning, which can be time-consuming and challenging.
  • Architecture: Designing right architecture can be challenging as well.

πŸ₯³ Yeahhhh……. πŸ₯‚