Content
Introduction & Brief History
World of Approximation
Neural Networks
Optimization
Applications
AMSI61AML
Introduction & Brief History
World of Approximation
Neural Networks
Optimization
Applications
Deep Neural Networks (DNN) or Multilayer Perceptron (MLP) is a type of ML model built to simulate the complex decision-making power of the human brain π§ .
It is a backbone that powers the recent development of Artificial Intelligence (AI) applications in our lives today.
Year |
Development |
---|---|
1943 | Walter Pitts and Warren McCulloch created the first computer model based on neural networks, using βthreshold logicβ to mimic the thought process. |
1960s | Henry J. Kelley developed the basics of a continuous backpropagation model, and Stuart Dreyfus simplified it using the chain rule. |
1965 | Alexey Ivakhnenko and Valentin Lapa developed early deep learning algorithms using polynomial activation functions. |
1980s | Geoffrey Hinton1 and colleagues revived neural networks by demonstrating effective training using backpropagation |
1970s | The first AI winter occurred due to unmet expectations, leading to reduced funding and research. |
1980s | Despite the AI winter, research continued, leading to significant advancements in neural networks and deep learning. |
1990s | Development of convolutional neural networks (CNNs) by Yann LeCun and others for image recognition. |
2006 | Geoffrey Hinton and colleagues introduced deep belief networks, which further advanced deep learning techniques. |
2012 | AlexNet, a deep convolutional neural network, won the ImageNet competition, showcasing the power of deep learning in computer vision. |
2016 | AlphaGo by DeepMind defeated a human Go champion, demonstrating the potential of deep learning in complex games. |
Present | Deep learning continues to evolve, with applications in natural language processing, speech recognition, autonomous vehicles, and more. |
Year |
Key Model Development |
---|---|
1943 | Pitts and McCullochβs neural network model. |
1960s | Kelleyβs backpropagation model and Dreyfusβs chain rule simplification. |
1980s | Hintonβs backpropagation revival & Recurrent Neural Networks (RNNs). |
1990s | LeCunβs Convolutional Neural Networks (CNNs). |
2006 | Deep belief networks. |
2012 | AlexNetβs ImageNet win. |
2016 | AlphaGoβs victory. |
2017 | Attention is all you need (key models of ChatGPT) |
Approximation
is the process of finding a value that is close to the true value of a quantity, but not exactly equal to it. It is often used when an exact value is difficult to obtain or not necessary.Suppose I put \(\$ 1\) into a saving account:
Interest Per Year |
N Compound |
Total |
---|---|---|
\(100\%\) | 1 | \(1+1\) |
\(100\%\) | 2 | \((1+1/2)^2\) |
\(100\%\) | 3 | \((1+1/3)^3\) |
\(\vdots\) | \(\vdots\) | \(\vdots\) |
\(100\%\) | n | \((1+1/n)^n\) |
The compounded interest \(\to e\) as \(n\) becomes very large i.e., \[\lim_{n\to \infty}\Big(1+\frac{1}{n}\Big)^n=e=2.71828182...\] With \(100\%\) interest per year calculated every second, my \(\$ 1\) yields nearly \(\$ e=\$ 2.71828...\) at the end of the year.
If \(f:\mathbb{R}\to\mathbb{R}\) is infinitely differentiable \(f\in C^{\infty}\), i.e., \(f',f'',f''',...\) exist, then
\[\forall x,a\in\mathbb{R}: f(x)=\sum_{n=0}^{\infty}\frac{f^{(n)}(a)}{n!}(x-a)^n.\]
relationship
between input \(X\) and the target \(y\), called \(\color{red}{f}\).\[\underbrace{\begin{bmatrix}x_{11} & x_{12} & \dots & x_{1d}\\ x_{21} & x_{22} & \dots & x_{2d}\\ x_{31} & x_{32} & \dots & x_{3d}\\ \vdots & \vdots & \ddots & \vdots\\ x_{n1} & x_{n2} & \dots & x_{nd}\\ \end{bmatrix}}_{\text{Input }X}\xrightarrow[]{\color{red}{f}} \underbrace{\begin{bmatrix}y_1\\ y_2\\ y_3\\ \vdots\\ y_n \end{bmatrix}}_{\text{target }y}\]
nonlinear activation function
.nonlinear activation function
.nonlinear activation function
.nonlinear activation function
.Abalone
dataset.age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
392 | 51 | 1 | 2 | 110 | 175 | 0 | 1 | 123 | 0 | 0.6 | 2 | 0 | 2 | 1 |
960 | 52 | 0 | 2 | 136 | 196 | 0 | 0 | 169 | 0 | 0.1 | 1 | 0 | 2 | 1 |
888 | 60 | 0 | 0 | 150 | 258 | 0 | 0 | 157 | 0 | 2.6 | 1 | 2 | 3 | 0 |
741 | 41 | 0 | 2 | 112 | 268 | 0 | 0 | 172 | 1 | 0.0 | 2 | 0 | 2 | 1 |
287 | 71 | 0 | 1 | 160 | 302 | 0 | 1 | 162 | 0 | 0.4 | 2 | 2 | 2 | 1 |
OneHotEncoder
.MinMaxScaler
or StandardScaler
.from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
data = data # drop missing values
quan_vars = ['age','trestbps','chol','thalach','oldpeak']
qual_vars = ['sex','cp','fbs','restecg','exang','slope','ca','thal']
for i in quan_vars:
data[i] = data[i].astype('float')
for i in qual_vars:
data[i] = data[i].astype('category')
data = pd.get_dummies(data, columns=qual_vars, drop_first=True) # One-hot encoding
y = data['target']
X = data.drop('target', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Train-test split
scaler = StandardScaler() # Scaling inputs
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
MLP
using Keras
.Rings.
\[\begin{align*} \text{Sigmoid}(z)&=1/(1+e^{-z})\text{ for }z\in\mathbb{R}\\ \text{Softmax}(z)&=(e^{z_1},\dots,e^{z_d})/\sum_{k=1}^de^{z_k},\text{ for }z\in\mathbb{R}^d\\ \color{red}{\text{ReLU}(z)}&\color{red}{=\max(0,z)\text{ for }z\in\mathbb{R}}\\ \text{Tanh}(z)&=\tanh(z)\text{ for }z\in\mathbb{R}\\ \text{Leaky ReLU}(z)&=\begin{cases}z,&\mbox{if} z>0\\ \alpha z,&\mbox{if }z\leq 0\end{cases}. \end{align*}\]
π Jupyter notebook: Feedforward NN by hand.
Letβs see what it means: π Jupyter notebook: Universal Approximation Theorem.
Model: "sequential"
βββββββββββββββββββββββββββββββββββ³βββββββββββββββββββββββββ³ββββββββββββββββ β Layer (type) β Output Shape β Param # β β‘βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ© β dense (Dense) β (None, 32) β 736 β βββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββΌββββββββββββββββ€ β dense_1 (Dense) β (None, 32) β 1,056 β βββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββΌββββββββββββββββ€ β dense_2 (Dense) β (None, 1) β 33 β βββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββ΄ββββββββββββββββ
Total params: 1,825 (7.13 KB)
Trainable params: 1,825 (7.13 KB)
Non-trainable params: 0 (0.00 B)
batch_size
: number of minibatch \(b\).epochs
: number of times that the network passes through the entire training dataset.validation_split
: a fraction of the training data for validation during model training. We can keep track of the model state during training by measuring the loss on this validation data, especially for preventing overfitting.# Training the network
history = model.fit(X_train, y_train, epochs=200, batch_size=32, validation_split=0.1, verbose=0)
# Extract loss values
train_loss = history.history['loss']
val_loss = history.history['val_loss']
# Plot the learning curves
epochs = list(range(1, len(train_loss) + 1))
fig1 = go.Figure(go.Scatter(x=epochs, y=train_loss, name="Training loss"))
fig1.add_trace(go.Scatter(x=epochs, y=val_loss, name="Training loss"))
fig1.update_layout(title="Training and Validation Loss",
width=510, height=250,
xaxis=dict(title="Epoch", type="log"),
yaxis=dict(title="Loss"))
fig1.show()
Pros
Cons