Deep Learning

ITM-370: Data Analytics

Lecturer: Dr. Sothea Has

Content

Introduction & Brief History
World of Approximation
Neural Networks
Optimization
Applications

Introduction &
Brief History

Introduction

Deep Learning (DL) is a subset of Machine Learning (ML) that uses Multilayer Neural Networks, called Deep Neural Networks (DNN), to simulate the complex decision-making power of the human brain 🧠. Some form of deep learning powers most of the Artificial Intelligence (AI) applications in our lives today.

History

**Early Foundations**
`Year`	`Development`
1943	Walter Pitts and Warren McCulloch created the first computer model based on neural networks, using “threshold logic” to mimic the thought process.
1960s	Henry J. Kelley developed the basics of a continuous backpropagation model, and Stuart Dreyfus simplified it using the chain rule.

**Development of Algorithms**
1965	Alexey Ivakhnenko and Valentin Lapa developed early deep learning algorithms using polynomial activation functions.
1980s	Geoffrey Hinton¹ and colleagues revived neural networks by demonstrating effective training using backpropagation

**AI Winters and Resurgence**
1970s	The first AI winter occurred due to unmet expectations, leading to reduced funding and research.
1980s	Despite the AI winter, research continued, leading to significant advancements in neural networks and deep learning.

**Modern Era**
1990s	Development of convolutional neural networks (CNNs) by Yann LeCun and others for image recognition.
2006	Geoffrey Hinton and colleagues introduced deep belief networks, which further advanced deep learning techniques.
2012	AlexNet, a deep convolutional neural network, won the ImageNet competition, showcasing the power of deep learning in computer vision.
2016	AlphaGo by DeepMind defeated a human Go champion, demonstrating the potential of deep learning in complex games.
Present	Deep learning continues to evolve, with applications in natural language processing, speech recognition, autonomous vehicles, and more.

**Key Milestones**
`Year`	`Key Model Development`
1943	Pitts and McCulloch’s neural network model.
1960s	Kelley’s backpropagation model and Dreyfus’s chain rule simplification.
1980s	Hinton’s backpropagation revival & Recurrent Neural Networks (RNNs).
1990s	LeCun’s Convolutional Neural Networks (CNNs).
2006	Deep belief networks.
2012	AlexNet’s ImageNet win.
2016	AlphaGo’s victory.
2017	Attention is all you need (key models of ChatGPT)

World of Approximations

Approximation

Approximation is the process of finding a value that is close to the true value of a quantity, but not exactly equal to it. It is often used when an exact value is difficult to obtain or not necessary.
In 1683, Jacob Bernoulli discovered Euler’s number $e=2.718...$ when he was studying compound interest and trying to determine what would happen if interest were compounded more and more frequently.

Suppose I put $\$ 1$ into a saving account:

`Interest Per Year`	`N Compound`	`Total`
$100\%$	1	$1+1$
$100\%$	2	$(1+1/2)^2$
$100\%$	3	$(1+1/3)^3$
$\vdots$	$\vdots$	$\vdots$
$100\%$	n	$(1+1/n)^n$

The compounded interest $\to e$ as $n$ becomes very large i.e., \[\lim_{n\to \infty}\Big(1+\frac{1}{n}\Big)^n=e=2.71828182...\]
With $100\%$ interest per year calculated every second, my $\$ 1$ yields nearly $\$ e=\$ 2.71828...$ at the end of the year.

Approximation

The Role of Models

Our world if full of approximations.
In mathematics and science, models are tools that help us approximate complex systems and phenomena.
A good model can capture the bahavior and help us understand reality.
Simple models might underestimate reality but are often easy to interpret (LR & MLR). Conversely, complex models can provide a more accurate approximation of reality, though they may lack interpretability (Neural Networks).
In this class, models are used to approximate the relationships between inputs (ads) and targets (sales).

Artificial Neural Networks (ANNs)

Model: Multilayer perceptron

Artificial Neural Networks (ANNs) are computational models inspired by the human brain.

Model: Multilayer perceptron

Artificial Neural Networks (ANNs) are computational models inspired by the human brain.

Input layer: vector of individual inputs $\text{x}_i\in\mathbb{R}^d$.
- It takes the inputs from the dataset.
- The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.

Model: Multilayer perceptron

Artificial Neural Networks (ANNs) are computational models inspired by the human brain.

Input layer: vector of individual inputs $\color{green}{\text{x}_i}\in\mathbb{R}^d$.
- It takes the inputs from the dataset.
- The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.

Hidden layer: Governed by the equations:
\[\begin{align*}\color{green}{z_0}&=\color{green}{\text{x}}\in\mathbb{R}^d\\ \color{green}{z_k}&=\sigma_k(\color{blue}{W_k}\color{green}{z_{k-1}}+\color{blue}{b_k})\text{ for }k=1,...,L-1. \end{align*}\] where,
- $\color{blue}{W_k}$ is a matrix of size $\ell_{k}\times\ell_{k-1}$
- $\color{blue}{b_k}$ is a bias vector of size $\ell_k$
- $\sigma_k$: is a point-wise nonlinear activation function.

Model: Multilayer perceptron

Artificial Neural Networks (ANNs) are computational models inspired by the human brain.

Input layer: vector of individual inputs $\color{green}{\text{x}_i}\in\mathbb{R}^d$.
- It takes the inputs from the dataset.
- The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.

Hidden layer: Governed by the equations:
\[\begin{align*}\color{green}{z_0}&=\color{green}{\text{x}}\in\mathbb{R}^d\\ \color{green}{z_k}&=\sigma_k(\color{blue}{W_k}\color{green}{z_{k-1}}+\color{blue}{b_k})\text{ for }k=1,...,L-1. \end{align*}\] where,
- $\color{blue}{W_k}$ is a matrix of size $\ell_{k}\times\ell_{k-1}$
- $\color{blue}{b_k}$ is a bias vector of size $\ell_k$
- $\sigma_k$: is a point-wise nonlinear activation function.
Output layer: Returns the predictions: \[\color{blue}{\hat{y}}=\sigma_L(\color{blue}{W_L}\color{green}{z_{L-1}}+\color{blue}{b_L}).\]

Model: Multilayer perceptron

Artificial Neural Networks (ANNs) are computational models inspired by the human brain.

Input layer: vector of individual inputs $\color{green}{\text{x}_i}\in\mathbb{R}^d$.
- It takes the inputs from the dataset.
- The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.

Hidden layer: Governed by the equations:
\[\begin{align*}\color{green}{z_0}&=\color{green}{\text{x}}\in\mathbb{R}^d\\ \color{green}{z_k}&=\sigma_k(\color{blue}{W_k}\color{green}{z_{k-1}}+\color{blue}{b_k})\text{ for }k=1,...,L-1. \end{align*}\] where,
- $\color{blue}{W_k}$ is a matrix of size $\ell_{k}\times\ell_{k-1}$
- $\color{blue}{b_k}$ is a bias vector of size $\ell_k$
- $\sigma_k$: is a point-wise nonlinear activation function.
Output layer: Returns the predictions: \[\color{blue}{\hat{y}}=\sigma_L(\color{blue}{W_L}\color{green}{z_{L-1}}+\color{blue}{b_L}).\]
Loss function: measures the difference between predictions and the real targets.

Model: Multilayer perceptron

Artificial Neural Networks (ANNs) are computational models inspired by the human brain.

Input layer: vector of individual inputs $\color{green}{\text{x}_i}\in\mathbb{R}^d$.
- It takes the inputs from the dataset.
- The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.

Hidden layer: Governed by the equations:
\[\begin{align*}\color{green}{z_0}&=\color{green}{\text{x}}\in\mathbb{R}^d\\ \color{green}{z_k}&=\sigma_k(\color{blue}{W_k}\color{green}{z_{k-1}}+\color{blue}{b_k})\text{ for }k=1,...,L-1. \end{align*}\] where,
- $\color{blue}{W_k}$ is a matrix of size $\ell_{k}\times\ell_{k-1}$
- $\color{blue}{b_k}$ is a bias vector of size $\ell_k$
- $\sigma_k$: is a point-wise nonlinear activation function.
Output layer: Returns the predictions: \[\color{blue}{\hat{y}}=\sigma_L(\color{blue}{W_L}\color{green}{z_{L-1}}+\color{blue}{b_L}).\]
Loss function: measures the difference between predictions and the real targets.

Model: Multilayer perceptron

Why is it powerful?

Roughly speaking, it can approximate any reasonably complex input-output relationship to any desired level of precision! (For more, read here)

Model: Multilayer perceptron

Input Layer: sensory organs of the network

It plays a role as senses: 👀, 👂, 👃, 👅, 👊 …
The input data are directly fitted into input layer.
Let’s use our kaggle Abalone dataset.

	Sex	Length	Diameter	Height	Whole weight	Shucked weight	Viscera weight	Shell weight	Rings
0	M	0.455	0.365	0.095	0.5140	0.2245	0.1010	0.150	15
1	M	0.350	0.265	0.090	0.2255	0.0995	0.0485	0.070	7
2	F	0.530	0.420	0.135	0.6770	0.2565	0.1415	0.210	9
3	M	0.440	0.365	0.125	0.5160	0.2155	0.1140	0.155	10
4	I	0.330	0.255	0.080	0.2050	0.0895	0.0395	0.055	7

Input: $\text{x}_1=$ [‘M’, 0.455, 0.365, 0.095, 0.514, 0.2245, 0.101, 0.15].
Target: $y_1=$ 15.

Q1: What should be done in preprocessing step?

A1: In preprocessing step, we should:
- Remove missing values.
- Encode variable Sex: OneHotEncoder.
- Scale input: MinMaxScaler or StandardScaler.

from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import StandardScaler 
data = data.loc[data.Height > 0,:]   # drop missing values
data = pd.get_dummies(data, columns=['Sex'], drop_first=True)  # One-hot encoding
X = data.drop('Rings', axis=1) 
y = data['Rings']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Train-test split
scaler = StandardScaler() # Scaling inputs
X_train = scaler.fit_transform(X_train) 
X_test = scaler.transform(X_test)

Input: $\text{x}_1=$ [-0.418, -0.289, -0.464, -0.795, -0.82, -0.844, -0.637, 1.454, -0.75].
Target: $y_1=$ 15.

Q2: What’s the size of the input layer?
- A2: $d=9$.

Model: Multilayer perceptron

Input Layer: sensory organs of the network

Let’s build an MLP using Keras.
We first create Input Layer of size $d=9$.

from sklearn.metrics import mean_squared_error 
from keras.models import Sequential 
from keras.layers import Dense, Input

# Dimension of the data
n, d = X_train.shape   # rows & columns

# Initiate the MLP model
model = Sequential()

# Add an input layer
model.add(Input(shape=(d,)))

Given trainable weights $\color{blue}{W_1}$ of size $\ell_1\times d$ and bias $\color{blue}{b_1}\in\mathbb{R}^d$, the input $\color{green}{\text{x}}\in\mathbb{R}^d$ is converted at the input layer by \[\begin{align*} \color{green}{z_1}&=\sigma_1(\color{blue}{W_1}\color{green}{\text{x}} + \color{blue}{b_1})\\ &=\sigma_1\begin{pmatrix} \color{blue}{\begin{bmatrix} w_{11} & w_{12} & \dots & w_{1d}\\ \vdots & \vdots & \ddots & \vdots\\ w_{\ell_11} & w_{\ell_12} & \dots & w_{\ell_1d}\\ \end{bmatrix}}\color{green}{\begin{bmatrix} x_1\\ \vdots\\ x_d \end{bmatrix}}+ \color{blue}{\begin{bmatrix} b_1\\ \vdots\\ b_{\ell_1} \end{bmatrix}} \end{pmatrix} \end{align*}\]

Model: Multilayer perceptron

Hidden/output Layer: brain 🧠/Action 🏃🏻‍♂️‍➡️

Let’s add two hidden layers of sizes $32$ to our existing network.
Then add an output layer to make real-valued prediction $\color{blue}{\hat{y}}$ of Rings.

# Add hidden layer of size 32
model.add(Dense(32, activation='relu'))

# Add another hidden layer of size 32
model.add(Dense(32, activation='relu'))

# Add one last layer (output) of size 1
model.add(Dense(1, activation='linear'))

With trainable weights $\color{blue}{W_2, W_3}$ and biases $\color{blue}{b_2,b_3}$, the feedforward path: \[\begin{align*} \color{green}{z_2}&=\sigma_2(\color{blue}{W_2}\color{green}{z_1} + \color{blue}{b_2})\in\mathbb{R}^{32}\\ \color{blue}{\hat{y}}&=\sigma_4(\color{blue}{W_3}\color{green}{z_2} + \color{blue}{b_2})\in\mathbb{R} \end{align*}\]
What is the dimension of each parameter?

Model: Multilayer perceptron

Activation functions: $\sigma(.)$ ╭╯

In feedforward path, we use matrix multiplications ($\color{blue}{W_j}$’s) and additions ($\color{blue}{b_j}$’s).
These operations are linear.
Without non-linear components, the network is just a linear regression.
These non-linear functions are called activation functions.
It’s an important component that makes the networks powerful!
Types of activation functions $\sigma_j(.)$:

\[\begin{align*} \text{Sigmoid}(z)&=1/(1+e^{-z})\text{ for }z\in\mathbb{R}\\ \text{Softmax}(z)&=(e^{z_1},\dots,e^{z_d})/\sum_{k=1}^de^{z_k},\text{ for }z\in\mathbb{R}^d\\ \color{red}{\text{ReLU}(z)}&\color{red}{=\max(0,z)\text{ for }z\in\mathbb{R}}\\ \text{Tanh}(z)&=\tanh(z)\text{ for }z\in\mathbb{R}\\ \text{Leaky ReLU}(z)&=\begin{cases}z,&\mbox{if} z>0\\ \alpha z,&\mbox{if }z\leq 0\end{cases}. \end{align*}\]

Model: Multilayer perceptron

Loss function: true $y$ vs prediction $\color{blue}{\hat{y}}$

Given weights $\color{blue}{W_j}$’s and biases $\color{blue}{b_j}$’s of the network, the feedforward network can produce prediction $\hat{y}$.
To measure how good the network is, we compare the prediction $\color{blue}{\hat{y}}$ to the real target $y$.
Loss function quantifies the difference between the predicted output and the actual target.
Regression losses:
- $\ell_2(y_i,\color{blue}{\hat{y}_i})=(y_i-\color{blue}{\hat{y}_i})^2$: Squared loss.
- $\ell_1(y_i,\color{blue}{\hat{y}_i})=|y_i-\color{blue}{\hat{y}_i}|$: Absolute loss.
- $\ell_{\text{rel}}(y_i,\color{blue}{\hat{y}_i})=|\frac{y_i-\color{blue}{\hat{y}_i}}{y_i}|$: Relative loss.
Classification losses:
- $\text{CEn}(y_i,\color{blue}{\hat{y}_i})=-\sum_{j=1}^My_{ij}\log(\color{blue}{\hat{y}_{ij}})$: Cross-Entropy.
- $\text{Hinge}(y_i,\color{blue}{\hat{y}_i})=\max\{0,1-\sum_{j=1}^My_{ij}\color{blue}{\hat{y}_{ij}}\}$: Hinge loss.
- $\text{KL}(y_i,\color{blue}{\hat{y}_i})=\sum{j=1}^My_{ij}\log(y_{ij}/\color{blue}{\hat{y}_{ij}})$: Kullback-Leibler (KL) Divergence.

Q3: What are the key parameters of the network?
A3: All weights $\color{blue}{W_j}$’s and biases $\color{blue}{b_j}$’s.
Q4: How to find the suitable values of these parameters?
A4: Loss function can guide the network to its better and better state! In other words, we can use the loss/mistake to adjust all key parameters, leading to a better state of the network.

Optimization

Gradient of a function

Gradient

If $L:\mathbb{R}^d\to\mathbb{R}, y=L(x_1,...,x_d)$ be a real-valued function of $d$ real variables. For any point $a\in\mathbb{R}^d$, the gradient of $L$ at point $a$ is a $d$-dimensional vector defined by \[\nabla L(a)=\begin{bmatrix}\frac{\partial L}{\partial x_1}(a)\\ \vdots\\ \frac{\partial L}{\partial x_d}(a)\end{bmatrix}.\]
It is the direction (from the point $a$) with the fastest increasing rate of the function $L$.
Ex: $f(x,y)=0.1x^4+0.5y^2$. Compute $\nabla f(1,-1)$.
- One has $\frac{\partial f}{\partial x}=0.4x^3$ and $\frac{\partial f}{\partial y}=y$.
- Therefore, $\nabla f(1,-1)=(0.4,-1).$
- Meaning: from point $(1,-1)$, function $f$ increases fastest along the direction $(0.4, -1)$.
Around 1847, Augustin-Louis Cauchy used this to propose a genius algorithm that makes impossibilities, possible!

Optimizing loss function means searching for all weights $W_j$’s and biases $b_j$’s that minimizes the loss function!.
Q5: Knowing this, how would you search for the lowest position of the loss function?

Gradient-based Optimization

Gradient Descent Algorithm

GD Algorithm: Moving opposite to the gradient

Objective: find parameters $\color{blue}{\theta^*=(W_1,..,W_L,b_1,...,b_L)}$ that minimizes the average loss over the full training data: $L(\color{blue}{\theta})=\frac{1}{n}\sum_{i=1}^n\ell(y_i,\color{blue}{\hat{y}_i}).$
- For example, on Abalone dataset: $L(\color{blue}{\theta})=\frac{1}{n}\sum_{i=1}^n(y_i,\color{blue}{\hat{y}_i})^2.$
Initialiaztion:
- learning rate: $\eta>0$ (small $\approx 0.001$ or $0.01$)
- initial value: $\theta_0$
- Maximum iteration: $N$
- Stopping threshold: $\delta>0$ (small $< 10^{-5}$).
Update: for t=1,2,...,N: \[\color{blue}{\theta_{t}}:=\color{blue}{\theta_{t-1}}-\eta\nabla L(\color{blue}{\theta_{t-1}}).\]
- if $\|\nabla L(\color{blue}{\theta_{t^*}})\|<\delta$ at some step $t^*$: return $\color{blue}{\theta_{t^*}}$.
- else: return $\color{blue}{\theta_{N}}$.

Gradient-based Optimization

Gradient Descent Algorithm

GD Algorithm: Moving opposite to the gradient

Objective: find parameters $\color{blue}{\theta^*=(W_1,..,W_L,b_1,...,b_L)}$ that minimizes the average loss over the full training data: $L(\color{blue}{\theta})=\frac{1}{n}\sum_{i=1}^n\ell(y_i,\color{blue}{\hat{y}_i}).$
- For example, on Abalone dataset: $L(\color{blue}{\theta})=\frac{1}{n}\sum_{i=1}^n(y_i-\color{blue}{\hat{y}_i})^2.$
Initialiaztion:
- learning rate: $\eta>0$ (small $\approx 0.001$ or $0.01$)
- initial value: $\theta_0$
- Maximum iteration: $N$
- Stopping threshold: $\delta>0$ (small $< 10^{-5}$).
Update: for t=1,2,...,N: \[\color{blue}{\theta_{t}}:=\color{blue}{\theta_{t-1}}-\eta\nabla L(\color{blue}{\theta_{t-1}}).\]
- if $\|\nabla L(\color{blue}{\theta_{t^*}})\|<\delta$ at some step $t^*$: return $\color{blue}{\theta_{t^*}}$.
- else: return $\color{blue}{\theta_{N}}$.
Here’s GD in action 👉

Gradient-based Optimization

Stochastic Gradient Descent Algorithm

SGD Algorithm: replace full data by small random subsets

Objective: find parameters $\color{blue}{\theta^*=(W_1,..,W_L,b_1,...,b_L)}$ that minimizes $L(\color{blue}{\theta})=\frac{1}{n}\sum_{i=1}^n\ell(y_i,\color{blue}{\hat{y}_i}).$
But here, we approximate the full loss using small subsets of size $b<<n$ called minibatches: $\hat{L}_b(\color{blue}{\theta})=\frac{1}{b}\sum_{i=1}^b(y_i^B-\color{blue}{\hat{y}_i^B})^2.$
Initialiaztion:
- learning rate: $\eta>0$ (small $\approx 0.001$ or $0.01$)
- initial value: $\theta_0$
- Maximum iteration: $N$
- Stopping threshold: $\delta>0$ (small $< 10^{-5}$).
Update: for t=1,2,...,N: \[\color{blue}{\theta_{t}}:=\color{blue}{\theta_{t-1}}-\eta\nabla \hat{L}_b(\color{blue}{\theta_{t-1}}).\]
- if $\|\nabla \hat{L}_b(\color{blue}{\theta_{t^*}})\|<\delta$ at some step $t^*$: return $\color{blue}{\theta_{t^*}}$.
- else: return $\color{blue}{\theta_{N}}$.

Gradient-based Optimization

GD vs SGD

Gradient Descent (DG)

Batch Processing: Uses the entire dataset to compute gradients and update weights in one go.
Stability: More stable with smoother convergence towards the minimum.
Speed: Slower, especially with large datasets, as it requires processing all data points each time.
Memory Usage: Requires a lot of memory to store the entire dataset.
Key points: Entire dataset, smooth convergence, slower, higher memory.

Stochastic Gradient Descent (SGD)

Incremental Processing: Updates weights using one data point at a time.
Fluctuations: More fluctuation in the path towards the minimum, but can escape local minima more effectively.
Speed: Faster and more efficient with large datasets since it updates weights more frequently.
Memory Usage: Lower memory requirement as it processes data points one by one.
Key points: One data point at a time, fluctuating path, faster, lower memory.

There are many more powerful algorithms: Adam (Adaptive Moment Estimation), RMSprop (Root Mean Square Propagation), Adagrad (Adaptive Gradient Algorithm)…

Training/Backpropagation

Optimization in Keras

We set up optimization method for our existing network as follow:

# We use Adam optimizer
from keras.optimizers import Adam, SGD

# Set up optimizer for our model
model.compile(optimizer=SGD(learning_rate=0.001), loss='mean_squared_error')

Let’s have a look at your model:

Model: "sequential_7"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense_21 (Dense)                │ (None, 32)             │           320 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_22 (Dense)                │ (None, 32)             │         1,056 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_23 (Dense)                │ (None, 1)              │            33 │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 1,409 (5.50 KB)

 Trainable params: 1,409 (5.50 KB)

 Non-trainable params: 0 (0.00 B)

Training & Learning Curves

A few important hyperparameters:
- batch_size: number of minibatch $b$.
- epochs: number of times that the network passes through the entire training dataset.
- validation_split: a fraction of the training data for validation during model training. We can keep track of the model state during training by measuring the loss on this validation data, especially for preventing overfitting.
The network yields Test MSE $=$ 4.416.
This is better than all the methods used in linear models (see our Lab2).

# Training the network
history = model.fit(X_train, y_train, epochs=120, batch_size=32, validation_split=0.1, verbose=0)

# Extract loss values 
train_loss = history.history['loss']
val_loss = history.history['val_loss'] 

# Plot the learning curves 
epochs = list(range(1, len(train_loss) + 1))
fig1 = go.Figure(go.Scatter(x=epochs, y=train_loss, name="Training loss"))
fig1.add_trace(go.Scatter(x=epochs, y=val_loss, name="Training loss"))
fig1.update_layout(title="Training and Validation Loss", 
                   width=510, height=200,
                   xaxis=dict(title="Epoch", type="log"),
                   yaxis=dict(title="Loss"))
fig1.show()

Diagnostics with Learning Curves

The above learning curve can be used to access the state of our model during and after training.
- The training loss always decreases as it’s measured using the training data.
- The drop of validation loss indicates the generalization capability of the model at that state.
- The model starts to overfit the training data when the validation curve starts to increase.
- We should stop the training process when we observe this change in validation curve.

The learning curves can also reveal other aspects of the network and the data including:
- When the model underfit the data or requires more training epochs
- When the learning rate ($\eta$) is too large
- When the model cannot generalize well to validation set
- When it converges properly
- When the validation data is not representative enough
- When the validation data is too easy too predict…
These are helpful resources for understanding the above properties:
- A deep Dive Into Learning Curves in ML, Mostafa Ibrahim
- Diagnosing Model Performance with Learning Curves.

Applications & Examples

Neural Network Playground

CNN Explainer

Summary

Pros

Versatility: MLPs can be used for a wide range of tasks including classification, regression, and even function approximation.
Non-linear Problem Solving: They can model complex relationships and capture non-linear patterns in data, thanks to their non-linear activation functions.
Flexibility: MLPs can have multiple layers and neurons, making them highly adaptable to various problem complexities.
Training Efficiency: With advancements like backpropagation, training MLPs has become efficient and effective.
Feature Learning: MLPs can automatically learn features from raw data, reducing the need for manual feature extraction.

Cons

Computational Complexity: They can be computationally intensive, especially with large datasets and complex architectures, requiring significant processing power and memory.
Overfitting: MLPs can easily overfit to training data, especially if they have too many parameters relative to the amount of training data.
Black Box Nature: The internal workings of an MLP are not easily interpretable, making it difficult to understand how specific decisions are made.
Requires Large Datasets: Effective training of MLPs often requires large amounts of data, which might not always be available.
Hyperparameter Tuning: MLPs have several hyperparameters (e.g., learning rate, number of hidden layers, number of neurons per layer) that need careful tuning, which can be time-consuming and challenging.
Architecture: Designing right architecture can be challenging as well.

`Interest Per Year`	`N Compound`	`Total`
\(100\%\)	1	\(1+1\)
\(100\%\)	2	\((1+1/2)^2\)
\(100\%\)	3	\((1+1/3)^3\)
\(\vdots\)	\(\vdots\)	\(\vdots\)
\(100\%\)	n	\((1+1/n)^n\)

Deep Learning

Content

Introduction & Brief History

Introduction

History

World of Approximations

Approximation

Approximation

The Role of Models

Artificial Neural Networks (ANNs)

Model: Multilayer perceptron

Model: Multilayer perceptron

Model: Multilayer perceptron

Model: Multilayer perceptron

Model: Multilayer perceptron

Model: Multilayer perceptron

Model: Multilayer perceptron

Why is it powerful?

Model: Multilayer perceptron

Input Layer: sensory organs of the network

Model: Multilayer perceptron

Input Layer: sensory organs of the network

Model: Multilayer perceptron

Hidden/output Layer: brain 🧠/Action 🏃🏻‍♂️‍➡️

Model: Multilayer perceptron

Activation functions: \(\sigma(.)\) ╭╯

Model: Multilayer perceptron

Loss function: true \(y\) vs prediction \(\color{blue}{\hat{y}}\)

Optimization

Gradient of a function

Gradient-based Optimization

Gradient Descent Algorithm

Gradient-based Optimization

Gradient Descent Algorithm

Gradient-based Optimization

Stochastic Gradient Descent Algorithm

Gradient-based Optimization

GD vs SGD

Gradient Descent (DG)

Stochastic Gradient Descent (SGD)

Training/Backpropagation

Optimization in Keras

Training & Learning Curves

Diagnostics with Learning Curves

Applications & Examples

Neural Network Playground

CNN Explainer

Summary

🥳 Yeahhhh……. 🥂

Introduction &
Brief History