Advanced Machine Learning

Lecturer: Dr. HAS Sothea

Code: AMSI61AML

Deep Neural Network

Content

Introduction & Brief History
World of Approximation
Neural Networks
Optimization
Applications

Introduction

Deep Neural Networks (DNN) or Multilayer Perceptron (MLP) is a type of ML model built to simulate the complex decision-making power of the human brain 🧠.

It is a backbone that powers the recent development of Artificial Intelligence (AI) applications in our lives today.

History

**Early Foundations**
`Year`	`Development`
1943	Walter Pitts and Warren McCulloch created the first computer model based on neural networks, using “threshold logic” to mimic the thought process.
1960s	Henry J. Kelley developed the basics of a continuous backpropagation model, and Stuart Dreyfus simplified it using the chain rule.

**Development of Algorithms**
1965	Alexey Ivakhnenko and Valentin Lapa developed early deep learning algorithms using polynomial activation functions.
1980s	Geoffrey Hinton¹ and colleagues revived neural networks by demonstrating effective training using backpropagation

**AI Winters and Resurgence**
1970s	The first AI winter occurred due to unmet expectations, leading to reduced funding and research.
1980s	Despite the AI winter, research continued, leading to significant advancements in neural networks and deep learning.

**Modern Era**
1990s	Development of convolutional neural networks (CNNs) by Yann LeCun and others for image recognition.
2006	Geoffrey Hinton and colleagues introduced deep belief networks, which further advanced deep learning techniques.
2012	AlexNet, a deep convolutional neural network, won the ImageNet competition, showcasing the power of deep learning in computer vision.
2016	AlphaGo by DeepMind defeated a human Go champion, demonstrating the potential of deep learning in complex games.
Present	Deep learning continues to evolve, with applications in natural language processing, speech recognition, autonomous vehicles, and more.

**Key Milestones**
`Year`	`Key Model Development`
1943	Pitts and McCulloch’s neural network model.
1960s	Kelley’s backpropagation model and Dreyfus’s chain rule simplification.
1980s	Hinton’s backpropagation revival & Recurrent Neural Networks (RNNs).
1990s	LeCun’s Convolutional Neural Networks (CNNs).
2006	Deep belief networks.
2012	AlexNet’s ImageNet win.
2016	AlphaGo’s victory.
2017	Attention is all you need (key models of ChatGPT)

World of Approximations

Approximation

Approximation is the process of finding a value that is close to the true value of a quantity, but not exactly equal to it. It is often used when an exact value is difficult to obtain or not necessary.
Ex: In 1683, Jacob Bernoulli discovered $e=2.718...$ from compound interest.

Suppose I put $\$ 1$ into a saving account:

`Interest Per Year`	`N Compound`	`Total`
$100\%$	1	$1+1$
$100\%$	2	$(1+1/2)^2$
$100\%$	3	$(1+1/3)^3$
$\vdots$	$\vdots$	$\vdots$
$100\%$	n	$(1+1/n)^n$

The compounded interest $\to e$ as $n$ becomes very large i.e., \[\lim_{n\to \infty}\Big(1+\frac{1}{n}\Big)^n=e=2.71828182...\] With $100\%$ interest per year calculated every second, my $\$ 1$ yields nearly $\$ e=\$ 2.71828...$ at the end of the year.

Approximation

Taylor expansion

If $f:\mathbb{R}\to\mathbb{R}$ is infinitely differentiable $f\in C^{\infty}$, i.e., $f',f'',f''',...$ exist, then

\[\forall x,a\in\mathbb{R}: f(x)=\sum_{n=0}^{\infty}\frac{f^{(n)}(a)}{n!}(x-a)^n.\]

Example:
- $e^x=1+x+\frac{x^2}{2!}+\frac{x^3}{3!}+\frac{x^4}{4!}+\dots$
- $\sin(x)=x-\frac{x^3}{3!}+\frac{x^5}{5!}-\frac{x^7}{7!}+\dots$
- $\cos(x)=1-\frac{x^2}{2!}+\frac{x^4}{4!}-\frac{x^6}{6!}+\dots$
Why does it matter?
In reality, even though interested in $e^x$ we only compute $x,x^2,x^3,...,x^p$ for some large enough degree $p\in\mathbb{N}$.

Approximation

The Role of Models

As data/ML practitioners, we are interested in the relationship between input $X$ and the target $y$, called $\color{red}{f}$.
Models approximate this relationship $\color{red}{f}$.

\[\underbrace{\begin{bmatrix}x_{11} & x_{12} & \dots & x_{1d}\\ x_{21} & x_{22} & \dots & x_{2d}\\ x_{31} & x_{32} & \dots & x_{3d}\\ \vdots & \vdots & \ddots & \vdots\\ x_{n1} & x_{n2} & \dots & x_{nd}\\ \end{bmatrix}}_{\text{Input }X}\xrightarrow[]{\color{red}{f}} \underbrace{\begin{bmatrix}y_1\\ y_2\\ y_3\\ \vdots\\ y_n \end{bmatrix}}_{\text{target }y}\]

Model: Multilayer Perceptron

Deep Neural Networks (DNNs)/Multilayer Perceptrons (MLP) are computational models inspired by the human brain.

Model: Multilayer perceptron

Deep Neural Networks (DNNs)/Multilayer Perceptrons (MLP) are computational models inspired by the human brain.

Input layer: vector of individual inputs $\text{x}_i\in\mathbb{R}^d$.
- It takes the inputs from the dataset.
- The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.

Model: Multilayer perceptron

Deep Neural Networks (DNNs)/Multilayer Perceptrons (MLP) are computational models inspired by the human brain.

Input layer: vector of individual inputs $\color{green}{\text{x}_i}\in\mathbb{R}^d$.
- It takes the inputs from the dataset.
- The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.

Hidden layer: Governed by the equations:
\[\begin{align*}\color{green}{z_0}&=\color{green}{\text{x}}\in\mathbb{R}^d\\ \color{green}{z_k}&=\sigma_k(\color{blue}{W_k}\color{green}{z_{k-1}}+\color{blue}{b_k})\text{ for }k=1,...,L-1. \end{align*}\] where,
- $\color{blue}{W_k}$ is a matrix of size $\ell_{k}\times\ell_{k-1}$
- $\color{blue}{b_k}$ is a bias vector of size $\ell_k$
- $\sigma_k$: is a point-wise nonlinear activation function.

Model: Multilayer perceptron

Deep Neural Networks (DNNs)/Multilayer Perceptrons (MLP) are computational models inspired by the human brain.

Input layer: vector of individual inputs $\color{green}{\text{x}_i}\in\mathbb{R}^d$.
- It takes the inputs from the dataset.
- The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.

Hidden layer: Governed by the equations:
\[\begin{align*}\color{green}{z_0}&=\color{green}{\text{x}}\in\mathbb{R}^d\\ \color{green}{z_k}&=\sigma_k(\color{blue}{W_k}\color{green}{z_{k-1}}+\color{blue}{b_k})\text{ for }k=1,...,L-1. \end{align*}\] where,
- $\color{blue}{W_k}$ is a matrix of size $\ell_{k}\times\ell_{k-1}$
- $\color{blue}{b_k}$ is a bias vector of size $\ell_k$
- $\sigma_k$: is a point-wise nonlinear activation function.
Output layer: Returns the predictions: \[\color{blue}{\hat{y}}=\sigma_L(\color{blue}{W_L}\color{green}{z_{L-1}}+\color{blue}{b_L}).\]

Model: Multilayer perceptron

Deep Neural Networks (DNNs)/Multilayer Perceptrons (MLP) are computational models inspired by the human brain.

Input layer: vector of individual inputs $\color{green}{\text{x}_i}\in\mathbb{R}^d$.
- It takes the inputs from the dataset.
- The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.

Hidden layer: Governed by the equations:
\[\begin{align*}\color{green}{z_0}&=\color{green}{\text{x}}\in\mathbb{R}^d\\ \color{green}{z_k}&=\sigma_k(\color{blue}{W_k}\color{green}{z_{k-1}}+\color{blue}{b_k})\text{ for }k=1,...,L-1. \end{align*}\] where,
- $\color{blue}{W_k}$ is a matrix of size $\ell_{k}\times\ell_{k-1}$
- $\color{blue}{b_k}$ is a bias vector of size $\ell_k$
- $\sigma_k$: is a point-wise nonlinear activation function.
Output layer: Returns the predictions: \[\color{blue}{\hat{y}}=\sigma_L(\color{blue}{W_L}\color{green}{z_{L-1}}+\color{blue}{b_L}).\]
Loss function: measures the difference between predictions and the real targets.

Model: Multilayer perceptron

Deep Neural Networks (DNNs)/Multilayer Perceptrons (MLP) are computational models inspired by the human brain.

Input layer: vector of individual inputs $\color{green}{\text{x}_i}\in\mathbb{R}^d$.
- It takes the inputs from the dataset.
- The inputs should be preprocessed: scaled, encoded, transformed, etc, before passing to this layer.

Hidden layer: Governed by the equations:
\[\begin{align*}\color{green}{z_0}&=\color{green}{\text{x}}\in\mathbb{R}^d\\ \color{green}{z_k}&=\sigma_k(\color{blue}{W_k}\color{green}{z_{k-1}}+\color{blue}{b_k})\text{ for }k=1,...,L-1. \end{align*}\] where,
- $\color{blue}{W_k}$ is a matrix of size $\ell_{k}\times\ell_{k-1}$
- $\color{blue}{b_k}$ is a bias vector of size $\ell_k$
- $\sigma_k$: is a point-wise nonlinear activation function.
Output layer: Returns the predictions: \[\color{blue}{\hat{y}}=\sigma_L(\color{blue}{W_L}\color{green}{z_{L-1}}+\color{blue}{b_L}).\]
Loss function: measures the difference between predictions and the real targets.

Model: Multilayer perceptron

Input Layer: sensory organs of the network

It plays a role as senses: 👀, 👂, 👃, 👅, 👊 …
The input data are directly fitted into input layer.
Let’s use our kaggle Abalone dataset.

	age	sex	cp	trestbps	chol	restecg	thalach	exang	oldpeak	slope	ca	thal	target
392	51	1	2	110	175	1	123	0	0.6	2	0	2	1
960	52	0	2	136	196	0	169	0	0.1	1	0	2	1
888	60	0	0	150	258	0	157	0	2.6	1	2	3	0
741	41	0	2	112	268	0	172	1	0.0	2	0	2	1
287	71	0	1	160	302	1	162	0	0.4	2	2	2	1

Input: $\text{x}_1=$ [52.0, 1.0, 0.0, 125.0, 212.0, 0.0, 1.0, 168.0, 0.0, 1.0, 2.0, 2.0, 3.0].
Target: $y_1=$ 0.

Q1: What should be done in preprocessing step?
- Remove missing values if there are any.
- Encode cat. variables: OneHotEncoder.

Scale input: MinMaxScaler or StandardScaler.

from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import StandardScaler 
data = data   # drop missing values
quan_vars = ['age','trestbps','chol','thalach','oldpeak']
qual_vars = ['sex','cp','fbs','restecg','exang','slope','ca','thal']
for i in quan_vars:
  data[i] = data[i].astype('float')
for i in qual_vars:
  data[i] = data[i].astype('category')
data = pd.get_dummies(data, columns=qual_vars, drop_first=True)  # One-hot encoding
y = data['target']
X = data.drop('target', axis=1) 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Train-test split
scaler = StandardScaler() # Scaling inputs
X_train = scaler.fit_transform(X_train) 
X_test = scaler.transform(X_test)

Input: $\text{x}_1=$ [-0.586, -0.779, -1.935, -1.019, -0.211, 0.655, -0.43, 1.606, -0.293, -0.414, -0.981, -0.122, -0.726, -0.959, 1.095, -0.523, -0.383, 3.625, -0.132, -0.263, 0.934, -0.814].
Target: $y_1=$ 0.
Q2: What’s the size of the input layer?

Model: Multilayer perceptron

Input Layer: sensory organs of the network

Let’s build an MLP using Keras.
We first create Input Layer of size $d=9$.

from sklearn.metrics import mean_squared_error 
from keras.models import Sequential 
from keras.layers import Dense, Input

# Dimension of the data
n, d = X_train.shape   # rows & columns

# Initiate the MLP model
model = Sequential()

# Add an input layer
model.add(Input(shape=(d,)))

Given trainable weights $\color{blue}{W_1}$ of size $\ell_1\times d$ and bias $\color{blue}{b_1}\in\mathbb{R}^d$, the input $\color{green}{\text{x}}\in\mathbb{R}^d$ is converted at the input layer by \[\begin{align*} \color{green}{z_1}&=\sigma_1(\color{blue}{W_1}\color{green}{\text{x}} + \color{blue}{b_1})\\ &=\sigma_1\begin{pmatrix} \color{blue}{\begin{bmatrix} w_{11} & w_{12} & \dots & w_{1d}\\ \vdots & \vdots & \ddots & \vdots\\ w_{\ell_11} & w_{\ell_12} & \dots & w_{\ell_1d}\\ \end{bmatrix}}\color{green}{\begin{bmatrix} x_1\\ \vdots\\ x_d \end{bmatrix}}+ \color{blue}{\begin{bmatrix} b_1\\ \vdots\\ b_{\ell_1} \end{bmatrix}} \end{pmatrix} \end{align*}\]

Model: Multilayer perceptron

Hidden/output Layer: brain 🧠/Action 🏃🏻‍♂️‍➡️

Let’s add two hidden layers of sizes $32$ to our existing network.
Then add an output layer to make real-valued prediction $\color{blue}{\hat{y}}$ of Rings.

# Add hidden layer of size 32
model.add(Dense(32, activation='relu'))

# Add another hidden layer of size 32
model.add(Dense(32, activation='relu'))

# Add one last layer (output) of size 1
model.add(Dense(1, activation='sigmoid'))

With trainable weights $\color{blue}{W_2, W_3}$ and biases $\color{blue}{b_2,b_3}$, the feedforward path: \[\begin{align*} \color{green}{z_2}&=\sigma_2(\color{blue}{W_2}\color{green}{z_1} + \color{blue}{b_2})\in\mathbb{R}^{32}\\ \color{blue}{\hat{y}}&=\sigma_4(\color{blue}{W_3}\color{green}{z_2} + \color{blue}{b_2})\in\mathbb{R} \end{align*}\]
What is the dimension of each parameter?

Model: Multilayer perceptron

Activation functions: $\sigma(.)$ ╭╯

In feedforward path, we use matrix multiplications ($\color{blue}{W_j}$’s) and additions ($\color{blue}{b_j}$’s).
These operations are linear.
Without non-linear components, the network is just a linear regression.
These non-linear functions are called activation functions.
It’s an important component that makes the networks powerful!
Types of activation functions $\sigma_j(.)$:

\[\begin{align*} \text{Sigmoid}(z)&=1/(1+e^{-z})\text{ for }z\in\mathbb{R}\\ \text{Softmax}(z)&=(e^{z_1},\dots,e^{z_d})/\sum_{k=1}^de^{z_k},\text{ for }z\in\mathbb{R}^d\\ \color{red}{\text{ReLU}(z)}&\color{red}{=\max(0,z)\text{ for }z\in\mathbb{R}}\\ \text{Tanh}(z)&=\tanh(z)\text{ for }z\in\mathbb{R}\\ \text{Leaky ReLU}(z)&=\begin{cases}z,&\mbox{if} z>0\\ \alpha z,&\mbox{if }z\leq 0\end{cases}. \end{align*}\]

Model: Multilayer perceptron

Loss function: true $y$ vs prediction $\color{blue}{\hat{y}}$

Given weights $\color{blue}{W_j}$’s and biases $\color{blue}{b_j}$’s of the network, the feedforward network can produce prediction $\hat{y}$.
To measure how good the network is, we compare the prediction $\color{blue}{\hat{y}}$ to the real target $y$.
Loss function quantifies the difference between the predicted output and the actual target.
Regression losses:
- $\ell_2(y_i,\color{blue}{\hat{y}_i})=(y_i-\color{blue}{\hat{y}_i})^2$: Squared loss.
- $\ell_1(y_i,\color{blue}{\hat{y}_i})=|y_i-\color{blue}{\hat{y}_i}|$: Absolute loss.
- $\ell_{\text{rel}}(y_i,\color{blue}{\hat{y}_i})=|\frac{y_i-\color{blue}{\hat{y}_i}}{y_i}|$: Relative loss.
Classification losses:
- $\text{CEn}(y_i,\color{blue}{\hat{y}_i})=-\sum_{j=1}^My_{ij}\log(\color{blue}{\hat{y}_{ij}})$: Cross-Entropy.
- $\text{Hinge}(y_i,\color{blue}{\hat{y}_i})=\max\{0,1-\sum_{j=1}^My_{ij}\color{blue}{\hat{y}_{ij}}\}$: Hinge loss.
- $\text{KL}(y_i,\color{blue}{\hat{y}_i})=\sum_{j=1}^My_{ij}\log(y_{ij}/\color{blue}{\hat{y}_{ij}})$: Kullback-Leibler (KL) Divergence.

Q3: What are the key parameters of the network?
A3: All weights $\color{blue}{W_j}$’s and biases $\color{blue}{b_j}$’s.
Q4: How to find the suitable values of these parameters?
A4: Loss function can guide the network to its better and better state! In other words, we can use the loss/mistake to adjust all key parameters, leading to a better state of the network.

Model: Multilayer perceptron

Feedforward Neural Networks By Hand

👉 Jupyter notebook: Feedforward NN by hand.

Model: Multilayer perceptron

Why is it powerful?

Roughly speaking, it can approximate any reasonably complex input-output relationship to any desired level of precision! (For more, read UAT, Deepmind).

Model: Multilayer perceptron

Why is it powerful?

Let’s see what it means: 👉 Jupyter notebook: Universal Approximation Theorem.

Backpropagation: Gradient-based

Grant Sanderson of 3B1B did a really amazing job on this 👇
Source: Backpropagation, 3Blue1Brown.
More readings:
- Graphs: Backpropagation by Colah
- Computational Graphs, and Backpropagation by Michael Collins.

Optimization in Keras

We set up optimization method for our existing network as follow:

# We use Adam optimizer
from keras.optimizers import Adam, SGD

# Set up optimizer for our model
model.compile(optimizer=SGD(learning_rate=0.01), loss='binary_crossentropy', metrics=['accuracy'])

Let’s have a look at your model:

model.summary()

Model: "sequential"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ (None, 32)             │           736 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 32)             │         1,056 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_2 (Dense)                 │ (None, 1)              │            33 │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 1,825 (7.13 KB)

 Trainable params: 1,825 (7.13 KB)

 Non-trainable params: 0 (0.00 B)

Training & Learning Curves

A few important hyperparameters:
- batch_size: number of minibatch $b$.
- epochs: number of times that the network passes through the entire training dataset.
- validation_split: a fraction of the training data for validation during model training. We can keep track of the model state during training by measuring the loss on this validation data, especially for preventing overfitting.
The network yields Test MSE $=$ 0.941.
This is better than what we could achieve in Logistic Regression (see our TP2).
Tuning the hyperparameters would push its performance even further.

# Training the network
history = model.fit(X_train, y_train, epochs=200, batch_size=32, validation_split=0.1, verbose=0)

# Extract loss values 
train_loss = history.history['loss']
val_loss = history.history['val_loss'] 

# Plot the learning curves 
epochs = list(range(1, len(train_loss) + 1))
fig1 = go.Figure(go.Scatter(x=epochs, y=train_loss, name="Training loss"))
fig1.add_trace(go.Scatter(x=epochs, y=val_loss, name="Training loss"))
fig1.update_layout(title="Training and Validation Loss", 
                   width=510, height=250,
                   xaxis=dict(title="Epoch", type="log"),
                   yaxis=dict(title="Loss"))
fig1.show()

Diagnostics with Learning Curves

The above learning curve can be used to access the state of our model during and after training.
- The training loss always decreases as it’s measured using the training data.
- The drop of validation loss indicates the generalization capability of the model at that state.
- The model starts to overfit the training data when the validation curve starts to increase.
- We should stop the training process when we observe this change in validation curve.

The learning curves can also reveal other aspects of the network and the data including:
- When the model underfit the data or requires more training epochs
- When the learning rate ($\eta$) is too large
- When the model cannot generalize well to validation set
- When it converges properly
- When the validation data is not representative enough
- When the validation data is too easy too predict…
These are helpful resources for understanding the above properties:
- A deep Dive Into Learning Curves in ML, Mostafa Ibrahim
- Diagnosing Model Performance with Learning Curves.

Neural Network Playground

Summary

Pros

Versatility: DNNs can be used for a wide range of tasks including classification, regression, and even function approximation.
Non-linear Problem Solving: They can model complex relationships and capture non-linear patterns in data, thanks to their non-linear activation functions.
Flexibility: MLPs can have multiple layers and neurons, making them highly adaptable to various problem complexities.
Training Efficiency: With advancements like backpropagation, training MLPs has become efficient and effective.
Feature Learning: MLPs can automatically learn features from raw data, reducing the need for manual feature extraction.

Cons

Computational Complexity: They can be computationally intensive, especially with large datasets and complex architectures, requiring significant processing power and memory.
Overfitting: MLPs can easily overfit to training data, especially if they have too many parameters relative to the amount of training data.
Black Box Nature: The internal workings of an MLP are not easily interpretable, making it difficult to understand how specific decisions are made.
Requires Large Datasets: Effective training of MLPs often requires large amounts of data, which might not always be available.
Hyperparameter Tuning: MLPs have several hyperparameters (e.g., learning rate, number of hidden layers, number of neurons per layer) that need careful tuning, which can be time-consuming and challenging.
Architecture: Designing right architecture can be challenging as well.

`Interest Per Year`	`N Compound`	`Total`
\(100\%\)	1	\(1+1\)
\(100\%\)	2	\((1+1/2)^2\)
\(100\%\)	3	\((1+1/3)^3\)
\(\vdots\)	\(\vdots\)	\(\vdots\)
\(100\%\)	n	\((1+1/n)^n\)

	age	sex	cp	trestbps	chol	restecg	thalach	exang	oldpeak	slope	ca	thal	target
392	51	1	2	110	175	1	123	0	0.6	2	0	2	1
960	52	0	2	136	196	0	169	0	0.1	1	0	2	1
888	60	0	0	150	258	0	157	0	2.6	1	2	3	0
741	41	0	2	112	268	0	172	1	0.0	2	0	2	1
287	71	0	1	160	302	1	162	0	0.4	2	2	2	1

	age	sex	cp	trestbps	chol	restecg	thalach	exang	oldpeak	slope	ca	thal	target
392	51	1	2	110	175	1	123	0	0.6	2	0	2	1
960	52	0	2	136	196	0	169	0	0.1	1	0	2	1
888	60	0	0	150	258	0	157	0	2.6	1	2	3	0
741	41	0	2	112	268	0	172	1	0.0	2	0	2	1
287	71	0	1	160	302	1	162	0	0.4	2	2	2	1

Advanced Machine Learning

Deep Neural Network

Content

Introduction

History

World of Approximations

Approximation

Approximation

Taylor expansion

Approximation

The Role of Models

Model: Multilayer Perceptron

Model: Multilayer perceptron

Model: Multilayer perceptron

Model: Multilayer perceptron

Model: Multilayer perceptron

Model: Multilayer perceptron

Model: Multilayer perceptron

Input Layer: sensory organs of the network

Model: Multilayer perceptron

Input Layer: sensory organs of the network

Model: Multilayer perceptron

Hidden/output Layer: brain 🧠/Action 🏃🏻‍♂️‍➡️

Model: Multilayer perceptron

Activation functions: \(\sigma(.)\) ╭╯

Model: Multilayer perceptron

Loss function: true \(y\) vs prediction \(\color{blue}{\hat{y}}\)

Model: Multilayer perceptron

Feedforward Neural Networks By Hand

Model: Multilayer perceptron

Why is it powerful?

Model: Multilayer perceptron

Why is it powerful?

Backpropagation: Gradient-based

Optimization in Keras

Training & Learning Curves

Diagnostics with Learning Curves

Neural Network Playground

Summary

🥳 It’s party time 🥂

📋 View party menu here: Party 3 Menu.

🫠 Download party invitation here: Party 3 Invitation Letter.

	age	sex	cp	trestbps	chol	restecg	thalach	exang	oldpeak	slope	ca	thal	target
392	51	1	2	110	175	1	123	0	0.6	2	0	2	1
960	52	0	2	136	196	0	169	0	0.1	1	0	2	1
888	60	0	0	150	258	0	157	0	2.6	1	2	3	0
741	41	0	2	112	268	0	172	1	0.0	2	0	2	1
287	71	0	1	160	302	1	162	0	0.4	2	2	2	1