name: layout-general layout: true class: left, top <style> .remark-slide-number { position: inherit; } .remark-slide-number .progress-bar-container { position: absolute; bottom: 0; height: 4px; display: block; left: 0; right: 0; } .remark-slide-number .progress-bar { height: 100%; background-color: rgba(2, 70, 79, 0.874); } .remark-slide-number .numeric { position: absolute; bottom: 4%; height: 4px; display: block; right: 2.5%; font-weight: bold; } .remark-slide-number .numeric-out{ color: rgba(2, 70, 79, 0.874); } </style> <!-- pacman::p_load(tidyverse) --> <!-- pacman::p_load(readr) --> <!-- pacman::p_load(latex2exp) --> <!-- pacman::p_load(kableExtra) --> <!-- pacman::p_load(gt) --> <!-- # old_theme <-theme_set(theme_xaringan()) --> <!-- #old_theme <-theme_set(theme_minimal(base_size=14, base_family = "Helvetica")) --> <!-- pacman::p_load("knitr") --> <!-- opts_chunk$set(warning = FALSE, --> <!-- message = FALSE, --> <!-- cache = TRUE, --> <!-- autodep = TRUE, --> <!-- tidy = FALSE, --> <!-- fig.retina = 4) --> <!-- require(data.table, quietly = TRUE, warn.conflicts = FALSE) --> <!-- pacman::p_load(readr) --> --- count: false class: top, center, title-slide <img src="img/UniversiteParisCite_logo.jpg" align="left" height="100"> <img src="img/cnrs.png" align="middle" height="90"> .white[aaaaaaa] <img src="img/X.png" align="right" height="110" width="90"> .white[aaaaaaa] <br><hbr> <hr class="L1"> ## Introduction to Machine Learning <hr class="L2"> #### Institude de Technologie du Cambodge <img src="img/itc.png" align="middle" height="120"> ### Sothea .textsc[Has], Ph.D --- ## What's Machine Learning (ML)?<hbr> .pull-left[<hbr> - [Arthur Samuel (1959)](https://en.wikipedia.org/wiki/Arthur_Samuel_(computer_scientist) .center[<img src="./img/samuel.PNG" width="140px" height ="180px"/>] ] .pull-right[ <br><br><br><br> "The field of study that gives computers the ability to learn without being explicitly programmed." ] --- count:false ## What's Machine Learning (ML)?<hbr> .pull-left[<hbr> - [Arthur Samuel (1959)](https://en.wikipedia.org/wiki/Arthur_Samuel_(computer_scientist) .center[<img src="./img/samuel.PNG" width="140px" height ="180px"/>] - [Tom M. Mitchell (1997)](https://en.wikipedia.org/wiki/Tom_M._Mitchell) .center[<img src="./img/mitchell.PNG" width="140px" height ="180px"/>] ] .pull-right[ <br><br><br><br> "The field of study that gives computers the ability to learn without being explicitly programmed." <br><br><br><br><br><hbr> "A computer program is said to learn from experience .stress[E] with respect to some class of tasks .stress[T] and performance measure .stress[P], if its performance at tasks in .stress[T], as measured by .stress[P], improves with experience .stress[E]." ] --- ## Why do we care?<hbr> .subtitle[ [Applications of Machine Learning](https://www.javatpoint.com/applications-of-machine-learning)]<hbr> .center[<img src="./img/appli_ml.png" width="430px" height ="380px"/>] Because the world is now driven by .stress[Data], and .stress[Machine Learning] is a powerful tool. --- .pull-left[ ### Traditional programming .center[<img src="./img/trad_program.png" width="320px" height ="350px"/>] - .stress[Programming]'s rules are observed and designed by human. - Example:
with '&' or '!' more than `\(20\)` times, should be a spam. ] -- .pull-right[ ### Machine Learning .center[<img src="./img/ML.jpg" width="320px" height ="350px"/>] - .stress[Algorithms] are designed to capture the rules from the training data. - Example:
.stress[similar] to the existing spams, should also be a spam. ] --- template: inter-slide class: left, middle count: false ##
.bold-blue[Outline] <br> .hhead[I. Some basic elements of Machine Learning] <br> .hhead[II. Hands-on Linear Regression Models] <br> .hhead[III. Logistic Regression] <br> .hhead[IV. Deep Neural Networks & Conclusion] --- template: inter-slide class: left, middle count: false ##
.bold-blue[Outline] <br> .section[I. Some basic elements of Machine Learning] <br> .hhead[II. Hands-on Linear Regression Models] <br> .hhead[III. Logistic Regression] <br> .hhead[IV. Deep Neural Networks & Conclusion] --- ## I. Some Elements of Machine Learning<h0br> ### 1. Data (
)<h0br> - Data `\(\approx\)` information, experiences, ... -- - Structured data: predefined type (string, double, ...) and dimension (rows, columns, channels)... - Example: phone numbers
, zip codes
, identity number
, gender
...<hbr> -- .pull-left[ ```r spam <- read_delim(file=str_c(fpath,"spam.txt", sep = ''), show_col_types = FALSE) spam[c(2,4061),c("receive","address","num000","type")] %>% knitr::kable(format = "markdown") ``` ] .pull-right[ | receive| address| num000|type | |-------:|-------:|------:|:-------| | 0.21| 0.28| 0.43|spam | | 0.00| 0.00| 0.00|nonspam | ] -- - Unstructured data: everything else (no predefined type nor dimension)... - Example: emails
, social media posts
, video
, sensor
data... -- - Data pre-processing : data types, removing outliers, .stress[imputing] missing values (`NA`), scaling (.stress[standardization] or .stress[normalization]),... -- - .stress[GIGO] : "garbage in, garbage out!" --- ## I. Some Elements of Machine Learning<h0br> ### 1. Data (
)<h0br> .pull-left[ ```r timeSeries %>% ggplot(aes(x = time)) + geom_line(aes(y = win), color = '#4109F0') + geom_line(aes(y = win1), color = '#E71F1F') + geom_line(aes(y = win2), color = '#F7D606') + geom_line(aes(y = win3), color = '#1EC826') + geom_line(aes(y = win4), color = '#06B6D9') + geom_line(aes(y = win5), color = '#B40EE5') + theme(text = element_text(size = 18)) ``` .center[<img src="./img/digit_7.png" width="480px" height ="200px"/>] ] .pull-right[ ![](itc_linear_files/figure-html/time_series-out-1.png)<!-- --> <hbr> .center[<img src="./img/digit.gif" width="360px" height ="200px"/>] ] --- ## I. Some Elements of Machine Learning<h0br> ### 2. Task (
?)<h0br> .center[<img src="./img/ML_branches.png" width="630px" height ="450px"/><hbr> .caption[Source: [https://askdatascience.com/13/what-are-the-main-branches-of-machine-learning](https://askdatascience.com/13/what-are-the-main-branches-of-machine-learning)]] --- exclude:true ## I. Some Elements of Machine Learning<h0br> ### 2. Task (With fuel, where do you want go?)<h0br> .pull-left[ - Supervised learning: - Identify email (spam or not) - Weather forecast - Health care system - Increase of income from ads... - Unsupervised learning: - City planning - Clustering customers or users - Organizing products - Gene representation... ] .pull-right[ .center[<img src="./img/ML_branches.png" width="300px" height ="220px"/>] - Reinforcement learning: - Self-driving car, - Gaming ([Alpha GO](https://www.deepmind.com/research/highlighted-research/alphago)) - Robotic - Natural Language Processing (NLP)... ] --- ## I. Some Elements of Machine Learning<h0br> ### 3. Model (
(
),
(
),
(
)?) <h0br> .center[<img src="./img/input_output.jpg" width="620px" height ="220px"/>] - Model: a function `\(f\)` that `\(f(\text{input})\approx\text{output}\)`. -- - For example: `\(f(\)`
`\({}_i)\approx \text{type}_i, \text{ for almost every }i\)` <!-- (\text{Make}_i,\text{Spend}_i,...,\text{&}_i,...,\text{Cap}_i) --> -- - More examples: - Our jean model: `\(\text{waist}=a\times\text{nick}+b\)`, where `\(a,b\)` are the keys. -- - Exponential decay low (N. of nuclei): `\(N(t)=N_0e^{-\lambda t}\)`, `\(\lambda\)` is the key. -- - [Jim Rohn](https://www.jimrohn.com/): You `\(\approx\)` average( `\(5\)` people you spend the most time with ). --- ## I. Some Elements of Machine Learning<h0br> ### 4. Loss (
?)<h0br> .pull-left[ - We want `\(f:f(\text{input})\approx\text{ouput}\)`. - What does ' `\(\approx\)` ' mean? - .stress[Loss function] quantifies the quality of the model. - Regression loss: `\(\hat{y}=f(\text{input})\)` and `\(y=\)` output ( `\(\in\mathbb{R}\)` ), - Quadratic: `\(\ell_2(y,\hat{y})=(y-\hat{y})^2\)` - Absolute: `\(\ell_1(y,\hat{y})=|y-\hat{y}|\)` - Relative: `\(R\ell_1(y,\hat{y})=\Big|\frac{y-\hat{y}}{y}\Big|\)` ] -- .pull-right[ - Classification loss: - `\(0\text{-}1\)` loss: `\(y,\hat{y}\in\{1,...,M\}\)`, `$$\ell_{0,1}(y,\hat{y})=\mathbb{1}_{\{y\neq \hat{y}\}}\in\{0,1\}$$` - Cross-entropy: `\(p,\hat{p}\in \mathcal{S}_{M-1}\)`, `$$\text{CEn}(p,\hat{p})=-\sum_{j=1}^Mp_j\log(\hat{p}_j)$$` - KL divergence: `\(p,\hat{p}\in \mathcal{S}_{M-1}\)`, `$$\text{KL}(p,\hat{p})=\sum_{j=1}^Mp_j\log(p_j/\hat{p}_j)$$` <h0br> ] -- .center[ <h0br>
.stress[Hopefully, small (average) loss, good model!] ] --- ## I. Some Elements of Machine Learning<h0br> ### 5. Learning (
?)<hbr> -- - Learning `\(f^*\in\mathcal{F}=\{f:\text{Input space }\mathcal{X}\to\text{ Output space }\mathcal{Y}\}\)` means `$$f^*=\arg\min_{f\in\mathcal{F}}\mathbb{E}[\ell(Y,f(X))]\ \ \ \ \ \ (1)$$` where, - `\(\ell\)`: some loss function. - `\((X,Y)\in\mathcal{X}\times\mathcal{Y}\)`: generic input-output couple. - `\(\mathbb{E}\)`: expectation w.r.t `\((X,Y)\)`. -- - Solving `\((1)\)` is normally .red[NOT] analytically possible! -- - Optimization algorithm: Gradient descent (GD), Stochastic GD, Adagrad, RMSProp... -- - Learning: optimization `\(\Rightarrow\)` best key `\(\Rightarrow\)` best model (hopefully). -- - Ex, jean model: `\((a^*,b^*)=\arg\min_{(a,b)\in\mathbb{R}^2}\mathbb{E}[(\text{waist}-(a\times\text{nick}+b))^2]\)` --- count:false ## I. Some Elements of Machine Learning<h0br> ### 5. Learning (
?)<h0br> .pull-left[ <h0br> - Misclassification error of `\(f\)`: $$ `\begin{aligned} \mathbb{E}[\ell_{0,1}(Y,f(X))]&=\mathbb{E}[\mathbb{1}_{\{Y\neq f(X)\}}]\\ &=\mathbb{P}(Y\neq f(X))\\ &=1-\mathbb{P}(Y = f(X)) \end{aligned}` $$ ] .pull-right[ <br> <br> <br> .center[.stress[Misclassification error = 1 - Accuracy]] <br> <bR> ] .pull-left[ <h0br>
] .pull-right[ <h0br>
] --- ## I. Some Elements of Machine Learning<h0br> ### 6. Evaluation (how good is the model?)<h0br> - .stress[Good model] means small (average) loss, but on which data? - The model should predict fairly well on new observation. - Model evaluation: computing the loss on new (unseen) observations. -- ### In practice - We split the data into `\(2\)` parts: Training ( `\(\approx80\%\)` ) & testing ( `\(\approx20\%\)` ). - Training data: to build the model (estimate the good key). - Testing data: to access the performance. --- template: inter-slide class: left, middle count: false ##
.bold-blue[Outline] <br> .hhead[I. Some basic elements of Machine Learning] <br> .section[II. Hands-on Linear Regression Models] <br> .hhead[III. Logistic Regression] <br> .hhead[IV. Deep Neural Networks & Conclusion] --- ## II. Hands-on Linear Regression Models<h0br> ### `Tips` dataset <h0br> - Dimension: `\(244\times 7\)`. ```r tip[sample(nrow(tip), 7),] |> knitr::kable(format = "markdown") ``` | total_bill| tip|sex |smoker |day |time | size| |----------:|----:|:------|:------|:----|:------|----:| | 16.29| 3.71|Male |No |Sun |Dinner | 3| | 27.20| 4.00|Male |No |Thur |Lunch | 4| | 28.15| 3.00|Male |Yes |Sat |Dinner | 5| | 17.07| 3.00|Female |No |Sat |Dinner | 3| | 21.01| 3.50|Male |No |Sun |Dinner | 3| | 31.27| 5.00|Male |No |Sat |Dinner | 3| | 18.35| 2.50|Male |No |Sat |Dinner | 4| -- - Objective: build a linear model to predict the .stress[tip]. --- ## II. Hands-on Linear Regression Models<h0br> ### `Tips` dataset <h0br> - Pearson's correlation coefficient: `$$\rho(X,Y)=\frac{\sum_{i=1}^n(X_i-\overline{X}_n)(Y_i-\overline{Y}_n)}{\sqrt{\sum_{i=1}^n(X_i-\overline{X}_n)^2}\sqrt{\sum_{i=1}^n(Y_i-\overline{Y}_n)^2}}.$$` - `\(-1\leq \rho(X,Y)\leq 1\)` for any `\(X,Y\)`. - `\(|\rho(X,Y)|\approx 1\Leftrightarrow Y\approx aX+b\)` for some `\(a\neq 0\)`. - On **Tips** dataset: | | total_bill| tip| size| |:----------|----------:|---------:|---------:| |total_bill | 1.0000000| 0.6757341| 0.5983151| |tip | 0.6757341| 1.0000000| 0.4892988| |size | 0.5983151| 0.4892988| 1.0000000| --- ## II. Hands-on Linear Regression Models<h0br> ### `Tips` dataset <h0br> .pull-left[ ```r tip |> ggplot(aes(x = total_bill, y = tip, color = sex, size = size, shape = time)) + geom_point() + labs(title = "Tip VS total bill", x = "Total bill", y = "Tip", size = '', color = '', shape = '') + theme(rect = element_rect(colour = "white"), axis.text.x = element_text(size = 15), axis.text.y = element_text(size = 15), legend.position="bottom", legend.box = "horizontal") ``` <img src="itc_linear_files/figure-html/tip1-1.png" style="display: block; margin: auto;" /> ] -- .pull-right[ ```r tip |> ggplot(aes(x = day, y = tip, fill = smoker))+ geom_violin() + labs(title = "Tip VS Day and Smoke", x = "Day", y = "Tip") + theme(rect = element_rect(colour = "white"), axis.text.x = element_text(size = 15), axis.text.y = element_text(size = 15), legend.position="bottom", legend.box = "horizontal") ``` <img src="itc_linear_files/figure-html/tip2-1.png" style="display: block; margin: auto;" /> ] --- ## II. Hands-on Linear Regression Models<h0br> ### Simple LR: `Tip ~ Total_bill` <h0br> - Simple start: **Total bill** `\(\rightarrow\)` **Tip**. - Model: `\(\hat{y}=ax +b\)` - Criterion: `\(\text{MSE}(a,b)=\frac{1}{n}\sum_{i=1}^n(y_i-\hat{y}_i)^2=\frac{1}{n}\sum_{i=1}^n(y_i-ax_i-b)^2\)` -- <img src="itc_linear_files/figure-html/mse-out-1.png" style="display: block; margin: auto;" /> --- count:false ## II. Hands-on Linear Regression Models<h0br> ### Simple LR: `Tip ~ Total_bill` <h0br> - Simple start: **Total bill** `\(\rightarrow\)` **Tip**. - Model: `\(\hat{y}=ax +b\)` - Criterion: `\(\text{MSE}(a,b)=\frac{1}{n}\sum_{i=1}^n(y_i-\hat{y}_i)^2=\frac{1}{n}\sum_{i=1}^n(y_i-ax_i-b)^2\)`
--- count:false ## II. Hands-on Linear Regression Models<h0br> ### Simple LR: `Tip ~ Total_bill` <h0br> - Simple start: **Total bill** `\(\rightarrow\)` **Tip**. - Model: `\(\hat{y}=ax +b\)` - Criterion: `\(\text{MSE}(a,b)=\frac{1}{n}\sum_{i=1}^n(y_i-\hat{y}_i)^2=\frac{1}{n}\sum_{i=1}^n(y_i-ax_i-b)^2\)` .center[<hbr> ``` Call: lm(formula = tip ~ total_bill, data = tip[mask, ]) Residuals: Min 1Q Median 3Q Max -3.0255 -0.6008 -0.1207 0.4884 3.3558 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.127194 0.182384 6.18 3.72e-09 *** total_bill 0.093461 0.008437 11.08 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.024 on 193 degrees of freedom Multiple R-squared: 0.3887, Adjusted R-squared: 0.3855 F-statistic: 122.7 on 1 and 193 DF, p-value: < 2.2e-16 ``` ] --- count:false ## II. Hands-on Linear Regression Models<h0br> ### Simple LR: `Tip ~ Total_bill` <h0br> - Simple start: **Total bill** `\(\rightarrow\)` **Tip**. - Model: `\(\hat{y}=ax +b\)` - Criterion: `\(\text{MSE}(a,b)=\frac{1}{n}\sum_{i=1}^n(y_i-\hat{y}_i)^2=\frac{1}{n}\sum_{i=1}^n(y_i-ax_i-b)^2\)` .pull-left[ - Minimizers of MSE: `$$\hat{a}=\frac{\langle x-\overline{x},y-\overline{y}\rangle}{\|x-\overline{x}\|^2}=\frac{\rho(x,y)}{\mathbb{V}(x)},\hspace{0.3cm}\\ \hat{b}=\overline{y}-\hat{a}\overline{x}\hspace{5.3cm}$$` ] .pull-right[ <hbr> ```r p1 + geom_point(data =tibble(total_bill = tip[!mask,][['total_bill']], tip = predict(mod0, newdata = tip[!mask,])), color = "#791AD3", shape = "x", size = 4) + geom_abline(slope = mod0$coefficients[2], intercept = mod0$coefficients[1], color = "blue", linewidth = 1)+ labs(title = "Fitting linear line") + annotate('text', x = 30, y = 8.5, label = paste("RMSE =", round(er, 3), "at a =",round(mod0$coefficients[2], 3), "and b =", round(mod0$coefficients[1], 3), "\n R-squared:", round(summary(mod0)$r.squared, 3))) -> p3; p3 ``` <img src="itc_linear_files/figure-html/fit1-1.png" style="display: block; margin: auto;" /> ] --- count:false ## II. Hands-on Linear Regression Models<h0br> ### Simple LR: `Tip ~ Total_bill` <h0br> - Simple start: **Total bill** `\(\rightarrow\)` **Tip**. - Model: `\(\hat{y}=ax +b\)` - Criterion: `\(\text{MSE}(a,b)=\frac{1}{n}\sum_{i=1}^n(y_i-\hat{y}_i)^2=\frac{1}{n}\sum_{i=1}^n(y_i-ax_i-b)^2\)` .pull-left[ - Minimizers of MSE: `$$\hat{a}=\frac{\langle x-\overline{x},y-\overline{y}\rangle}{\|x-\overline{x}\|^2}=\frac{\rho(x,y)}{\mathbb{V}(x)},\\ \hat{b}=\overline{y}-\hat{a}\overline{x}\hspace{5.3cm}$$` - .stress[R-squared]: `\(R^2=\mathbb{V}(\hat{y})/\mathbb{V}(y)\)`. ] .pull-right[ <hbr> ```r p1 + geom_point(data =tibble(total_bill = tip[!mask,][['total_bill']], tip = predict(mod0, newdata = tip[!mask,])), color = "red", shape = "x", size = 4) + geom_abline(slope = mod0$coefficients[2], intercept = mod0$coefficients[1], color = "blue", linewidth = 1)+ labs(title = "Fitting linear line") + annotate('text', x = 30, y = 8.5, label = paste("RMSE:", round(er, 3), "at a =",round(mod0$coefficients[2], 3), "and b =", round(mod0$coefficients[1], 3))) -> p3; p3 ``` ![](itc_linear_files/figure-html/fit1-out-1.png)<!-- --> ] --- count:false ## II. Hands-on Linear Regression Models<h0br> ### Simple LR: `Tip ~ Total_bill` <h0br> - Simple start: **Total bill** `\(\rightarrow\)` **Tip**. - Model: `\(\hat{y}=ax +b\)` - Criterion: `\(\text{MSE}(a,b)=\frac{1}{n}\sum_{i=1}^n(y_i-\hat{y}_i)^2=\frac{1}{n}\sum_{i=1}^n(y_i-ax_i-b)^2\)` .pull-left[ - Minimizers of MSE: `$$\hat{a}=\frac{\langle x-\overline{x},y-\overline{y}\rangle}{\|x-\overline{x}\|^2}=\frac{\rho(x,y)}{\mathbb{V}(x)},\\ \hat{b}=\overline{y}-\hat{a}\overline{x}\hspace{5.3cm}$$` - .stress[R-squared]: `\(R^2=\mathbb{V}(\hat{y})/\mathbb{V}(y)\)`. - .red[Not possible] in general! ] .pull-right[ <hbr> ```r p1 + geom_point(data =tibble(total_bill = tip[!mask,][['total_bill']], tip = predict(mod0, newdata = tip[!mask,])), color = "red", shape = "x", size = 4) + geom_abline(slope = mod0$coefficients[2], intercept = mod0$coefficients[1], color = "blue", linewidth = 1)+ labs(title = "Fitting linear line") + annotate('text', x = 30, y = 8.5, label = paste("RMSE:", round(er, 3), "at a =",round(mod0$coefficients[2], 3), "and b =", round(mod0$coefficients[1], 3))) -> p3; p3 ``` ![](itc_linear_files/figure-html/fit1-out-1.png)<!-- --> ] --- count:false ## II. Hands-on Linear Regression Models<h0br> ### Simple LR: `Tip ~ Total_bill` <h0br> - Simple start: **Total bill** `\(\rightarrow\)` **Tip**. - Model: `\(\hat{y}=ax +b\)` - Criterion: `\(\text{MSE}(a,b)=\frac{1}{n}\sum_{i=1}^n(y_i-\hat{y}_i)^2=\frac{1}{n}\sum_{i=1}^n(y_i-ax_i-b)^2\)` .pull-left[ - Minimizers of MSE: `$$\hat{a}=\frac{\langle x-\overline{x},y-\overline{y}\rangle}{\|x-\overline{x}\|^2}=\frac{\rho(x,y)}{\mathbb{V}(x)},\\ \hat{b}=\overline{y}-\hat{a}\overline{x}\hspace{5.3cm}$$` - .stress[R-squared]: `\(R^2=\mathbb{V}(\hat{y})/\mathbb{V}(y)\)`. - .red[Not possible] in general! - .stress[Numerical methods] are needed! ] .pull-right[ <hbr> ```r p1 + geom_point(data =tibble(total_bill = tip[!mask,][['total_bill']], tip = predict(mod0, newdata = tip[!mask,])), color = "red", shape = "x", size = 4) + geom_abline(slope = mod0$coefficients[2], intercept = mod0$coefficients[1], color = "blue", linewidth = 1)+ labs(title = "Fitting linear line") + annotate('text', x = 30, y = 8.5, label = paste("RMSE:", round(er, 3), "at a =",round(mod0$coefficients[2], 3), "and b =", round(mod0$coefficients[1], 3))) -> p3; p3 ``` ![](itc_linear_files/figure-html/fit1-out-1.png)<!-- --> ] --- ## II. Hands-on Linear Regression Models<h0br> ### Optimization algorithm: Gradient Descent (GD) <h0br> - We consider the optimization problem: <hbr> `$$x^*=\arg\min_{x\in O}f(x),$$` <hbr> where `\(f:\mathbb{R}^d\to\mathbb{R}\)` is **differentiable** on some open subset `\(O\subset\mathbb{R}^d\)`. .center[<img src="./img/min.png" width="400px"/>] --- count:false ## II. Hands-on Linear Regression Models<h0br> ### Optimization algorithm: Gradient Descent (GD) <h0br> - We consider the optimization problem: <hbr> `$$x^*=\arg\min_{x\in O}f(x),$$` <hbr> where `\(f:\mathbb{R}^d\to\mathbb{R}\)` is **differentiable** on some open subset `\(O\subset\mathbb{R}^d\)`. - .stress[Key idea]: for any `\(x_0, x_0+h\in O\)` and `\(\alpha>0\)`: `\(f(x_0+h)\approx f(x_0)+h^t\nabla f(x_0)\)`. .center[<img src="./img/min.png" width="400px"/>] --- count:false ## II. Hands-on Linear Regression Models<h0br> ### Optimization algorithm: Gradient Descent (GD) <h0br> - We consider the optimization problem: <hbr> `$$x^*=\arg\min_{x\in O}f(x),$$` <hbr> where `\(f:\mathbb{R}^d\to\mathbb{R}\)` is **differentiable** on some open subset `\(O\subset\mathbb{R}^d\)`. - .stress[Key idea]: for any `\(x_0, x_0+h\in O\)` and `\(\alpha>0\)`: `\(f(x_0+h)\approx f(x_0)+h^t\nabla f(x_0)\)`. If `\(h=-\alpha \nabla f(x_0)\)`, thus `\(f(x_0-\alpha \nabla f(x_0))\approx f(x_0)-\alpha\|\nabla f(x_0)\|^2\leq f(x_0)\)`. -- > .stress[Gradient descent algorithm]: <br> - initialize: `\(x_0\)`, learning rate `\(\alpha>0\)`, threshold `\(\delta>0\)` - `for t = 0, 1, ..., T:` `$$x_{t+1}=x_t-\alpha\nabla f(x_t)$$` - stop when `\(\|x^{t+1}-x^t\|<\delta\)` or other stopping criterion is met. --- ## II. Hands-on Linear Regression Models<h0br> ### Optimization algorithm: Gradient Descent (GD) <h0br> > .stress[Gradient descent algorithm]: <br> - initialize: `\(x_0\)`, learning rate `\(\alpha>0\)`, threshold `\(\delta>0\)` - `for t = 0, 1, ..., T:` `$$x_{t+1}=x_t-\alpha\nabla f(x_t)$$` - stop when `\(\|x^{t+1}-x^t\|<\delta\)` or other stopping criterion is met. .pull-left-60[<hbr> - .stress[Ideal case]: if `\(f\)` is **convex** & `\(\nabla f\)` is `\(L\)`-**Lipschitz continous**, i.e., `$$\|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|,$$` thus **GD** algorithm converges `\(\Leftrightarrow \alpha< 2/L\)`. ] .pull-left-40[ .center[<img src="./img/convex.png" width="270px"/>] ] --- count:false ## II. Hands-on Linear Regression Models<h0br> ### Optimization algorithm: Gradient Descent (GD) <h0br> > .stress[Gradient descent algorithm]: <br> - initialize: `\(x_0\)`, learning rate `\(\alpha>0\)`, threshold `\(\delta>0\)` - `for t = 0, 1, ..., T:` `$$x_{t+1}=x_t-\alpha\nabla f(x_t)$$` - stop when `\(\|x^{t+1}-x^t\|<\delta\)` or other stopping criterion is met. .pull-left-60[<hbr> - .stress[Ideal case]: if `\(f\)` is **convex** & `\(\nabla f\)` is `\(L\)`-**Lipschitz continous**, i.e., `$$\|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|,$$` thus **GD** algorithm converges `\(\Leftrightarrow \alpha< 2/L\)`. - Main issue in ML: computing full `\(\nabla f\)`. ] .pull-left-40[ .center[<img src="./img/convex.png" width="270px"/>] ] --- count:false ## II. Hands-on Linear Regression Models<h0br> ### Optimization algorithm: Gradient Descent (GD) <h0br> > .stress[Gradient descent algorithm]: <br> - initialize: `\(x_0\)`, learning rate `\(\alpha>0\)`, threshold `\(\delta>0\)` - `for t = 0, 1, ..., T:` `$$x_{t+1}=x_t-\alpha\nabla f(x_t)$$` - stop when `\(\|x^{t+1}-x^t\|<\delta\)` or other stopping criterion is met. .pull-left-60[<hbr> - .stress[Ideal case]: if `\(f\)` is **convex** & `\(\nabla f\)` is `\(L\)`-**Lipschitz continous**, i.e., `$$\|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|,$$` thus **GD** algorithm converges `\(\Leftrightarrow \alpha< 2/L\)`. - Main issue in ML: computing full `\(\nabla f\)`. - A common approach: numerical method using sub-samples (mini-batch). ] .pull-left-40[ .center[<img src="./img/convex.png" width="270px"/>] ] --- ## II. Hands-on Linear Regression Models<h0br> ### Optimization algorithm: Gradient Descent (GD) <h0br>
--- ## II. Hands-on Linear Regression Models<h0br> ### `Tip ~ Total_bill` using GD <h0br> .pull-left[
] .pull-right[ <img src="itc_linear_files/figure-html/unnamed-chunk-13-.gif" width="450" height="350" style="display: block; margin: auto 0 auto auto;" /> - And there you have it! ] --- ## II. Hands-on Linear Regression Models<h0br> ### Multiple LR: `Tip ~ Total_bill + Size` <h0br> .pull-left-60[ - Pearson's correlations between variables: | | total_bill| tip| size| |:----------|----------:|---------:|---------:| |total_bill | 1.0000000| 0.6757341| 0.5983151| |tip | 0.6757341| 1.0000000| 0.4892988| |size | 0.5983151| 0.4892988| 1.0000000| - General model: $$ \hat{Y}=X\beta\hspace{5.35cm} $$ $$ `\begin{bmatrix} \hat{y}_1 \\ \vdots \\ \hat{y}_n \\ \end{bmatrix}=\begin{bmatrix} 1 & x_{11} & ... & x_{1d} \\ \vdots & \vdots & \ddots & \vdots\\ 1 & x_{n1} & ... & x_{nd} \\ \end{bmatrix}\begin{bmatrix} \beta_0 \\ \vdots \\ \beta_d \\ \end{bmatrix}` $$ ] .pull-right-40[ - .stress[Optimization]: `$$\hat{ \beta}=\arg\min_{\beta\in\mathbb{R}^{d+1}}\|Y-X\beta\|^2.$$` - .stress[Normal equation]: `$$\hat{\beta}=(X^tX)^{-1}X^tY,$$` - .stress[Prediction]: `$$\hat{Y}=X\hat{\beta}={\cal P}Y,$$` with `\({\cal P}=X(X^tX)^{-1}X^t\)`. - `\({\cal P}=\text{proj}_{\{X_j\}}\)` O.P matrix. ] --- count:true ## II. Hands-on Linear Regression Models<h0br> ### Multiple LR: `Tip ~ Total_bill + Size` <h0br> .pull-left-70[ ```r mod1 <- lm(tip ~ total_bill + size, data = tip[mask,]); summary(mod1) ``` ``` Call: lm(formula = tip ~ total_bill + size, data = tip[mask, ]) Residuals: Min 1Q Median 3Q Max -2.6593 -0.6091 -0.1066 0.5016 3.4849 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.81673 0.21295 3.835 0.00017 *** total_bill 0.07656 0.01039 7.372 4.87e-12 *** size 0.24950 0.09213 2.708 0.00738 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.008 on 192 degrees of freedom Multiple R-squared: 0.4112, Adjusted R-squared: 0.405 F-statistic: 67.03 on 2 and 192 DF, p-value: < 2.2e-16 ``` ] .pull-right-30[ <br><br><br><br><br><br> |Model | RMSE| |:-----|-----:| |1 | 1.039| |2 | 1.070| ] --- ## II. Hands-on Linear Regression Models<h0br> ### Multiple LR: `Tip ~ All` <h0br> .pull-left-70[<hbr> ``` Call: lm(formula = tip ~ ., data = tip[mask, ]) Residuals: Min 1Q Median 3Q Max -2.5336 -0.5943 -0.1326 0.5160 3.5293 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.852677 0.298846 2.853 0.00482 ** total_bill 0.079678 0.010909 7.304 7.94e-12 *** sexMale -0.044752 0.156404 -0.286 0.77509 smokerYes -0.075187 0.162533 -0.463 0.64419 day.L 0.035309 0.390170 0.090 0.92799 day.Q -0.003551 0.248848 -0.014 0.98863 day.C 0.166458 0.200445 0.830 0.40735 timeLunch 0.157262 0.496369 0.317 0.75173 size 0.228495 0.096436 2.369 0.01884 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.018 on 186 degrees of freedom Multiple R-squared: 0.4181, Adjusted R-squared: 0.3931 F-statistic: 16.71 on 8 and 186 DF, p-value: < 2.2e-16 ``` ] .pull-right-30[ <br><br><br><br><br><br> |Model | RMSE| |:-----|-----:| |1 | 1.039| |2 | 1.070| |3 | 1.092| ] --- ## II. Hands-on Linear Regression Models<h0br> ### Regularization <h0br> .pull-left[ <img src="itc_linear_files/figure-html/line0-1.png" style="display: block; margin: auto;" /> ] -- .pull-right[ - Regularization formulation: `$$(\hat{a},\hat{b})=\arg\min_{(a,b)\in\mathbb{R}^2}\text{MSE}+\lambda{\cal R}(a,b)$$` where `\({\cal R}\)` is a .stress[regularization] term: - `\({\cal R}(a,b)=\|(a,b)\|_1=|a|+|b|\)` - `\({\cal R}(a,b)=\|(a,b)\|_2^2=a^2+b^2.\)` - `\(\lambda\)` to be tuned, $$ `\begin{aligned} \text{Large } \lambda\hspace{1.8cm} \\ \Updownarrow \hspace{2.5cm}\\ \text{Strong regularization } \\ \Updownarrow \hspace{2.5cm}\\ \text{Small parameters} \hspace{0.5cm} \end{aligned}` $$ ] --- ## II. Hands-on Linear Regression Models<h0br> ### `\(\ell_2\)` regularization <h0br>
`$$(\hat{a},\hat{b})=\arg\min_{(a,b)\in\mathbb{R}^2}\sum_{i=1}^n(y_i-ax_i+b)^2+\lambda \|(a,b)\|_2^2$$` --- ## II. Hands-on Linear Regression Models<h0br> ### `\(\ell_1\)` regularization <h0br>
`$$(\hat{a},\hat{b})=\arg\min_{(a,b)\in\mathbb{R}^2}\sum_{i=1}^n(y_i-ax_i+b)^2+\lambda \|(a,b)\|_1$$` --- ## II. Hands-on Linear Regression Models<h0br> ### Ridge regression ( `\(\ell_2\)`-norm ) <h0br> .pull-left[ - In general, for any `\(\lambda>0\)`: `$$\hat{\beta}^{\text{ridge}}=\arg\min_{\beta\in\mathbb{R}^{d+1}}\|Y-X\beta\|^2+\lambda\|\beta\|_2^2$$` - Solution: `$$\hat{\beta}^{\text{ridge}}=(X^tX+\lambda I)^{-1}X^tY$$` ] --- count:false ## II. Hands-on Linear Regression Models<h0br> ### Ridge regression ( `\(\ell_2\)`-norm ) <h0br> .pull-left[ - In general, for any `\(\lambda>0\)`: `$$\hat{\beta}^{\text{ridge}}=\arg\min_{\beta\in\mathbb{R}^{d+1}}\|Y-X\beta\|^2+\lambda\|\beta\|_2^2$$` - Solution: `$$\hat{\beta}^{\text{ridge}}=(X^tX+\lambda I)^{-1}X^tY$$` - .stress[K-fold cross validation]: - `For each` `\(\lambda\in\{\lambda_1, \lambda_2, ...,\lambda_J\}\)` .center[<img src="./img/cv1.png" width="400px"/>] ] --- count:false ## II. Hands-on Linear Regression Models<h0br> ### Ridge regression ( `\(\ell_2\)`-norm ) <h0br> .pull-left[ - In general, for any `\(\lambda>0\)`: `$$\hat{\beta}^{\text{ridge}}=\arg\min_{\beta\in\mathbb{R}^{d+1}}\|Y-X\beta\|^2+\lambda\|\beta\|_2^2$$` - Solution: `$$\hat{\beta}^{\text{ridge}}=(X^tX+\lambda I)^{-1}X^tY$$` - .stress[K-fold cross validation]: - `For each` `\(\lambda\in\{\lambda_1, \lambda_2, ...,\lambda_J\}\)` .center[<img src="./img/cv2.png" width="400px"/>] ] --- count:false ## II. Hands-on Linear Regression Models<h0br> ### Ridge regression ( `\(\ell_2\)`-norm ) <h0br> .pull-left[ - In general, for any `\(\lambda>0\)`: `$$\hat{\beta}^{\text{ridge}}=\arg\min_{\beta\in\mathbb{R}^{d+1}}\|Y-X\beta\|^2+\lambda\|\beta\|_2^2$$` - Solution: `$$\hat{\beta}^{\text{ridge}}=(X^tX+\lambda I)^{-1}X^tY$$` - .stress[K-fold cross validation]: - `For each` `\(\lambda\in\{\lambda_1, \lambda_2, ...,\lambda_J\}\)` .center[<img src="./img/cv3.png" width="400px"/>] ] --- count:false ## II. Hands-on Linear Regression Models<h0br> ### Ridge regression ( `\(\ell_2\)`-norm ) <h0br> .pull-left[ - In general, for any `\(\lambda>0\)`: `$$\hat{\beta}^{\text{ridge}}=\arg\min_{\beta\in\mathbb{R}^{d+1}}\|Y-X\beta\|^2+\lambda\|\beta\|_2^2$$` - Solution: `$$\hat{\beta}^{\text{ridge}}=(X^tX+\lambda I)^{-1}X^tY$$` - .stress[K-fold cross validation]: - `For each` `\(\lambda\in\{\lambda_1, \lambda_2, ...,\lambda_J\}\)` .center[<img src="./img/cv4.png" width="400px"/>] ] --- count:false ## II. Hands-on Linear Regression Models<h0br> ### Ridge regression ( `\(\ell_2\)`-norm ) <h0br> .pull-left[ - In general, for any `\(\lambda>0\)`: `$$\hat{\beta}^{\text{ridge}}=\arg\min_{\beta\in\mathbb{R}^{d+1}}\|Y-X\beta\|^2+\lambda\|\beta\|_2^2$$` - Solution: `$$\hat{\beta}^{\text{ridge}}=(X^tX+\lambda I)^{-1}X^tY$$` - .stress[K-fold cross validation]: - `For each` `\(\lambda\in\{\lambda_1, \lambda_2, ...,\lambda_J\}\)` .center[<img src="./img/cv5.png" width="400px"/>] ] --- count:false ## II. Hands-on Linear Regression Models<h0br> ### Ridge regression ( `\(\ell_2\)`-norm ) <h0br> .pull-left[ - In general, for any `\(\lambda>0\)`: `$$\hat{\beta}^{\text{ridge}}=\arg\min_{\beta\in\mathbb{R}^{d+1}}\|Y-X\beta\|^2+\lambda\|\beta\|_2^2$$` - Solution: `$$\hat{\beta}^{\text{ridge}}=(X^tX+\lambda I)^{-1}X^tY$$` - .stress[K-fold cross validation]: - `For each` `\(\lambda\in\{\lambda_1, \lambda_2, ...,\lambda_J\}\)` .center[<img src="./img/cv5.png" width="400px"/>] ] .pull-right[ - Cross-validation error: `$$\text{CVE}(\lambda)=\frac{1}{K}\sum_{k=1}^K\text{error}_k$$` ] --- count:false ## II. Hands-on Linear Regression Models<h0br> ### Ridge regression ( `\(\ell_2\)`-norm ) <h0br> .pull-left[ - In general, for any `\(\lambda>0\)`: `$$\hat{\beta}^{\text{ridge}}=\arg\min_{\beta\in\mathbb{R}^{d+1}}\|Y-X\beta\|^2+\lambda\|\beta\|_2^2$$` - Solution: `$$\hat{\beta}^{\text{ridge}}=(X^tX+\lambda I)^{-1}X^tY$$` - .stress[K-fold cross validation]: - `For each` `\(\lambda\in\{\lambda_1, \lambda_2, ...,\lambda_J\}\)` .center[<img src="./img/cv5.png" width="400px"/>] ] .pull-right[ - Cross-validation error: `$$\text{CVE}(\lambda)=\frac{1}{K}\sum_{k=1}^K\text{error}_k$$` <img src="itc_linear_files/figure-html/pen1-1.png" style="display: block; margin: auto;" /> ] --- count:false ## II. Hands-on Linear Regression Models<h0br> ### Ridge regression ( `\(\ell_2\)`-norm ) <h0br> .pull-left[ - Now, look! |Model | RMSE| Inter| total_bill| size| |:-----|-----:|-----:|----------:|-----:| |1 | 1.039| 1.127| 0.093| NA| |2 | 1.070| 0.817| 0.077| 0.249| |3 | 1.092| 0.853| 0.080| 0.228| |4 | 1.033| 1.089| 0.095| NA| ] .pull-right[ <img src="itc_linear_files/figure-html/fit4-1.png" style="display: block; margin: auto;" /> ] -- ### .subtitle[Other regularized linear models:] <h0br> - Lasso regression: `\(\hat{\beta}^{\text{lasso}}=\arg\min_{\beta\in\mathbb{R}^{d+1}}\|Y-X\beta\|^2+\lambda\|\beta\|_1.\)` - Elastic net: `\(\hat{\beta}^{\text{elas}}=\arg\min_{\beta\in\mathbb{R}^{d+1}}\|Y-X\beta\|^2+\lambda_1\|\beta\|_1+\lambda_2\|\beta\|_2.\)` --- template: inter-slide class: left, middle count: false ##
.bold-blue[Outline] <br> .hhead[I. Some basic elements of Machine Learning] <br> .hhead[II. Hands-on Linear Regression Models] <br> .section[III. Logistic Regression] <br> .hhead[IV. Deep Neural Networks & Conclusion] --- ## III. Logistic Regression<h0br> ### Perceptron: 1st step towards Neural Networks <h0br> .pull-left-60[ - For binary or multi-class classification problems. - Start with simple case: .stress[Binary classification]. - Assumptions: - Data : `\((X,Y)\in\mathbb{R}^d\times\{0, 1\}\)`. - Sigmoid function : `\(\sigma(t)=1/(1+e^{-t})\)`. - Model : `\(\mathbb{P}(Y=1|X=x)=\sigma(x^t\beta)\)` for some `\(\beta\in\mathbb{R}^{d+1}\)` and `\(x\in\{1\}\times\mathbb{R}^d\)`. - Prediction : `\(\hat{y}=1\Leftrightarrow \mathbb{P}(Y=1|X=x)\geq 0.5\)` ] .pull-right-40[ <img src="itc_linear_files/figure-html/sigmoid-1.png" style="display: block; margin: auto;" /> <img src="itc_linear_files/figure-html/sigmoid1-1.png" style="display: block; margin: auto;" /> ] --- count:false ## III. Logistic Regression<h0br> ### Perceptron: 1st step towards Neural Networks <h0br> .pull-left-60[ - For binary or multi-class classification problems. - Start with simple case: .stress[Binary classification]. - Assumptions: - Data : `\((X,Y)\in\mathbb{R}^d\times\{0, 1\}\)`. - Sigmoid function : `\(\sigma(t)=1/(1+e^{-t})\)`. - Model : `\(\mathbb{P}(Y=1|X=x)=\sigma(x^t\beta)\)` for some `\(\beta\in\mathbb{R}^{d+1}\)` and `\(x\in\{1\}\times\mathbb{R}^d\)`. - Prediction : `\(\hat{y}=1\Leftrightarrow \mathbb{P}(Y=1|X=x)\geq 0.5\)` ] .pull-right-40[ .center[<img src="./img/logR0.png" width="280px", height ="200.3px"/>] <img src="itc_linear_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> ] -- - What are the keys? -- - .stress[YES! TO ESTIMATE THE BEST PARAMETER] `\(\beta\)`. --- ## III. Logistic Regression<h0br> ### Loss function <h0br> -- - .stress[Log conditional likelihood:] `$$L(\beta|{\cal D}_{\text{train}})=\sum_{i=1}^{n_{\text{train}}}\left[y_i\log\left(\sigma(x_i^t\beta)\right)+(1-y_i)\log\left(1-\sigma(x_i^t\beta)\right)\right]$$` -- - .stress[Log loss or cross-entropy:] `$${\cal L}(\beta)=-\sum_{i=1}^{n_{\text{train}}}\left[y_i\log\left(\sigma(x_i^t\beta)\right)+(1-y_i)\log\left(1-\sigma(x_i^t\beta)\right)\right]$$` -- - .stress[Challenge] : `\({\cal L}\)` is **smooth** and **convex** ([D. Jurafsky and J. H. Martin (2023)](https://web.stanford.edu/~jurafsky/slp3/)). -- - .stress[Regularized loss :] for any `\(\lambda=(\lambda_1,\lambda_2)\in\mathbb{R}_+^2\)`, `$${\cal L}_{\lambda}(\beta)=-\sum_{i=1}^{n_{\text{train}}}\left[y_i\log\left(\sigma(x_i^t\beta)\right)+(1-y_i)\log\left(1-\sigma(x_i^t\beta)\right)\right]+\lambda_1\|\beta\|_1+\lambda_2\|\beta\|_2^2$$` --- ## III. Logistic Regression<h0br> ### Optimization in regularized case<h0br> - Optimization formulation: given `\(\lambda>0,\)` `$$\beta^*_\lambda=\arg\min_{\beta}{\cal L}_{\lambda}(\beta).$$` -- - .red[CAN'T] be solved analytically! -- - `\(\tilde{\beta}\leftarrow\text{GDOptimization}(D,\lambda)\)` : perform GD on a subsample `\(D\)` at fixed `\(\lambda>0\)`. <h0br> .pull-left-70[ <h0br> - .stress[Algorithm :] <hbr> > - `for` `\(\tilde{\lambda}\in\{\lambda_\min,...,\lambda_\max\}\)`: - `for` `\(k\in\{1,...,K\}\)`: - `\(\text{GDOptimization}(F_{-k}, \tilde{\lambda})\rightarrow\tilde{\beta}_\lambda\)` - `Compute error on` `\(F_k\)` - `Compute CVE` for `\(\tilde{\lambda}\)` - `\(\lambda^*=\arg\min_{\tilde{\lambda}}\text{CVE}(\tilde{\lambda})\)` - `Use` `\(\lambda^*\rightarrow\text{GDOptimization}(D_{\text{train}}, \lambda^*)\rightarrow \beta^*\)`. ] .pull-right-30[ .center[<img src="./img/cv6.png" width="200px"/>] ] --- ## III. Logistic Regression<h0br> ### `Spam` dataset ([Hopkins et al. (1999)](https://archive.ics.uci.edu/dataset/94/spambase)) <h0br> - `Spam` dataset : `\((x_i,y_i)\in\mathbb{R}^{58}\times\{\text{spam},\text{non-spam}\}\)` for `\(i=1,...,4601\)`. | id| make| data| receive| address| num000| charHash| charDollar|type | |----:|----:|----:|-------:|-------:|------:|--------:|----------:|:-------| | 1| 0.00| 0.00| 0.00| 0.64| 0.00| 0.000| 0.000|spam | | 100| 1.24| 0.00| 0.00| 0.41| 0.00| 0.000| 0.527|spam | | 1000| 0.45| 0.00| 0.00| 0.90| 0.00| 0.081| 0.162|spam | | 2000| 0.00| 0.46| 0.00| 0.00| 0.00| 0.000| 0.000|nonspam | | 3000| 0.27| 0.00| 0.00| 0.00| 0.00| 0.000| 0.000|nonspam | | 4000| 0.07| 0.00| 0.15| 0.00| 0.23| 0.000| 0.044|nonspam | | 4061| 0.00| 0.00| 0.00| 0.00| 0.00| 0.000| 0.000|nonspam | - Encode : **spam** `\(=1\)` & **nonspam** `\(=0\)`. -- - .stress[Objective]: building `spam` filter using .stress[Logistic Regression]. --- ## III. Logistic Regression<h0br> ### `Spam` dataset ([Hopkins et al. (1999)](https://archive.ics.uci.edu/dataset/94/spambase)) <h0br> .pull-left[ <img src="itc_linear_files/figure-html/inputs_spam0-1.png" width="100%" style="display: block; margin: auto auto auto 0;" /><img src="itc_linear_files/figure-html/inputs_spam0-2.png" width="100%" style="display: block; margin: auto auto auto 0;" /> ] .pull-right[ <img src="itc_linear_files/figure-html/inputs_spam1-1.png" width="100%" style="display: block; margin: auto auto auto 0;" /> ] --- ## III. Logistic Regression<h0br> ### `Spam` dataset : Results <h0br> .pull-left-60[<h0br> - **CVE** as a function on penalty parameters:
] .pull-right-40[<h0br> - Accuracy as a function of penalty parameter on CV folds. <img src="itc_linear_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto 0 auto auto;" /> - Performance on test data: | Not_penalized| Penalized| |-------------:|---------:| | 0.9336957| 0.9369565| ] --- ## III. Logistic Regression<h0br> ### Confusion matrix <h0br> .pull-left[ <h0br> - .red[Imbalanced data] `\(=\)` nightmare! ] --- count:false ## III. Logistic Regression<h0br> ### Confusion matrix <h0br> .pull-left[ <h0br> - .red[Imbalanced data] `\(=\)` nightmare! - .red[Accuracy] `\(\approx\)` right for .red[wrong reasons]! ] --- count:false ## III. Logistic Regression<h0br> ### Confusion matrix <h0br> .pull-left[ <h0br> - .red[Imbalanced data] `\(=\)` nightmare! - .red[Accuracy] `\(\approx\)` right for .red[wrong reasons]! - .stress[Confusion matrix]: .center[<img src="./img/conf_mat.png" width="300px" height="180px"/>] - .stress[Recall/TPR] `\(\displaystyle=\frac{TP}{P}=\frac{TP}{TP+FN}.\)` - .stress[Precision] `\(\displaystyle=\frac{TP}{TP+FP}.\)` ] .pull-right[ <h0br> - .stress[F1 score] `\(\displaystyle=\frac{2\text{Precision}\times\text{Recall}}{\text{Precision}+\text{Recall}}.\)` - Confusion matrix without penalization: | | nonspam| spam| |:-------|-------:|----:| |nonspam | 549| 27| |spam | 34| 310| - Confusion matrix with penalization: | | nonspam| spam| |:-------|-------:|----:| |nonspam | 551| 25| |spam | 33| 311| ] --- ## III. Logistic Regression<h0br> ### Receiver operating characteristic (ROC) curves <h0br> .pull-left[ <h0br> - .stress[TPR (power)] `\(\displaystyle=\frac{TP}{P}=\frac{TP}{TP+FN}.\)` - .stress[FPR (error I)] `\(\displaystyle=\frac{FP}{N}=\frac{FP}{FP+TN}.\)` - .stress[ROC curve] `\(=(\text{FPR}(\delta), \text{TPR}(\delta))\)` for `\(\delta\in (0,1)\)` s.t `$$\hat{y}_i=1\Leftrightarrow\hat{\mathbb{P}}(Y=1|X=x_i)\geq\delta.$$` - Is `\(\delta=1/2\)` always good? - How much you want to control .red[FPR]? ] .pull-right[ <h0br> ![](itc_linear_files/figure-html/roc-1.png)<!-- --> ] -- <h0br> - AUC = .stress[Area Under ROC Curve]. -- - Which part is for fine-tuning `\(\delta\)`? --- ## III. Logistic Regression<h0br> ### Multinomial case ( `\(M\)` ) <h0br> - Multiple classes: `\(Y\in\{y_1,y_2,...,y_M\}\)`. - Example: Hand-written digit recognition i.e., `\(Y\in\{0,1,...,9\}\)`. -- - `\(\text{Softmax}: \mathbb{R}^M\to{\cal S}_M=\{(p_1,...,p_M)\in[0,1]^M:\sum p_m=1\}\)` defined for any `\(z=(z_1,...,z_M)\in\mathbb{R}^M\)` by: `$$\text{softmax}(z)=\left(\frac{e^{z_1}}{\sum_m e^{z_m}}, \ldots,\frac{e^{z_M}}{\sum_m e^{z_m}}\right).$$` -- .pull-left[<h0br> - Recall binary model: <br> .left[<img src="./img/logR0.png" width="300px" height ="200px"/>] ] .pull-right[<h0br> - Multi-class model: <br> .left[<img src="./img/logRM.png" width="330px" height ="190px"/>] ] --- ## III. Logistic Regression<h0br> ### Multinomial case ( `\(M\)` ) <h0br> .pull-left-40[ - Multi-class model: .left[<img src="./img/logRM.png" width="300px" height ="150px"/>] ] .pull-right-60[ - For any training data `\(x_i\in\{1\}\times\mathbb{R}^d\)`, `$$\hspace{1.1cm}\hat{p}^i=\text{softmax}(x_i^tW)\hspace{5.5cm}$$` .center[<img src="./img/matrix_form_logR.png" width="400px" height ="125px"/>] ] -- - .stress[Cross-entropy:] `$${\cal L}(W)=-\sum_{i=1}^{n_{\text{train}}}\sum_{m=1}^Mt_{im}\log\left(\frac{e^{x_i^tw^m}}{\sum_{j=1}^Me^{x_i^tw^j}}\right),$$` where `\(t_{i}=(0,...,0,\underbrace{1}_{m\text{-th}},0,...,0)\Leftrightarrow y_i=y_m\)` (one-hot encoding). -- - .stress[Optimization:] `\(W^*=\arg\min_{W}{\cal L}(W)+\lambda{\cal R}(W)\)`. --- ## III. Logistic Regression<h0br> ### Mnist dataset (Tensorflow)<h0br> .pull-left[ <h0br> - Dimension: - Train: `\(60\ 000\times 28\times28\)`. - Test: `\(10\ 000\times 28\times28\)`. - Examples: .center[<img src="./img/digits.png" width="370px"/>] ] -- .pull-right[ <h0br> - Model summary: ```r summary(model1) ``` .center[<img src="./img/summary_mod1.png" width="370px"/>] ] --- count:false ## III. Logistic Regression<h0br> ### Mnist dataset (Tensorflow)<h0br> .pull-left[ <h0br> - Dimension: - Train: `\(60\ 000\times 28\times28\)`. - Test: `\(10\ 000\times 28\times28\)`. - Examples: .center[<img src="./img/digits.png" width="370px"/>] ] .pull-right[ <h0br> - Model summary: ```r summary(model1) ``` .center[<img src="./img/summary_mod1.png" width="370px"/>] <h0br> .center[<img src="./img/multi_logit.gif" width="370px" height ="200px"/>] ] --- ## III. Logistic Regression<h0br> ### Mnist dataset (Tensorflow)<h0br> .pull-left-60[ <h0br> - Validation `\(=0.2\)`: <img src="itc_linear_files/figure-html/train_digit1-1.png" style="display: block; margin: auto;" /> | | Multi_Logistic_Regression| |:--------|-------------------------:| |loss | 0.2689023| |accuracy | 0.9272000| ] -- .pull-right-40[ - When it works: <img src="itc_linear_files/figure-html/train_digit2-1.png" style="display: block; margin: auto;" /> - And when it doesn't: <img src="itc_linear_files/figure-html/train_digit3-1.png" style="display: block; margin: auto;" /> ] --- template: inter-slide class: left, middle count: false ##
.bold-blue[Outline] <br> .hhead[I. Some basic elements of Machine Learning] <br> .hhead[II. Hands-on Linear Regression Models] <br> .hhead[III. Logistic Regression] <br> .section[IV. Deep Neural Networks & Conclusion] --- ## IV. Deep Neural Networks (DNN)<h0br> ### <img src="./img/neural.png" width="32px"/> Key to a bigger world<h0br> .pull-left[ <h0br> - .stress[Synapse] links between .stress[neurons]: .center[<img src="./img/brain.png" width="280px" height = "175px"/> <hbr> .caption[[Source: Perceptron: An introduction to artificial neural networks](https://matt.might.net/articles/hello-perceptron/)]] ] -- .pull-right[ <h0br> - Artificial model: .center[<img src="./img/NN0.png" width="270px" height = "180px"/> <hbr> .caption[[Source: Neural Networks from scratch](https://github.com/skawy/Neural-Network-from-scratch)]] ] -- .pull-left-40[ <hbr> - Governing equation: `$$\begin{aligned} z_0 &= x\\ z_{\ell} &= f_{\ell}(z_{\ell-1}^tW^{\ell-1})\\ \hat{y} &= O(z_{L}^tW^L),\\ \text{for }\ell &= 1,...,L, \end{aligned}$$` ] -- .pull-left-60[ <hbr> - `\(L\)` : depth. - `\(W^{\ell}\)` : weights. - `\(f_\ell\)` & `\(O\)` : activation functions, - `\(f(x)=\max\{x,0\}\)`: `Relu` - `\(f(x)=\tanh(x)\)`: `tangent hyperbolic` - `\(f(x)=\sigma(x)\)`: `sigmoid` - `\(O(x)=\text{softmax}(x)\)` ... ] --- ## IV. Deep Neural Networks (DNN)<h0br> ### <img src="./img/neural.png" width="32px"/> Feedforward NN<h0br> .pull-left[ <h0br> - Feedforward NN model: .center[<img src="./img/ffNN.gif" width="280px"/>] ] .pull-right[ <h0br> - [Universal approximation theorems](https://en.wikipedia.org/wiki/Universal_approximation_theorem): .center[<img src="./img/UAT.png" width="280px"/>] ] --- count:false ## IV. Deep Neural Networks (DNN)<h0br> ### <img src="./img/neural.png" width="32px"/> Feedforward NN<h0br> .pull-left[ <h0br> - Feedforward NN model: .center[<img src="./img/ffNN.gif" width="280px"/>] - Training: - Loss: `\(\sum_{B}\ell(f(x,W),y)/|B|\)` - Metric: `\(\sum_{V}\ell'(f(x,W),y)/|V|\)` - Opt: SGD, Momentum, Adam, ... `$$\text{Opt. + Loss}_t\rightarrow W_t\rightarrow \text{Metric}_t\rightarrow \text{Stop}.$$` ] .pull-right[ <h0br> - [Universal approximation theorems](https://en.wikipedia.org/wiki/Universal_approximation_theorem): .center[<img src="./img/UAT.png" width="280px"/>] - Backpropagation: .center[<img src="./img/reverse_ffNN.gif" width="280px"/>] ] --- ## IV. Deep Neural Networks (DNN)<h0br> ### <img src="./img/neural.png" width="32px"/> Challenges in DNN <h0br> .center[<img src="./img/cost_dnn.jpg" width="400px" height ="250px"/> <hbr> .caption[Source: [https://reconsider.news/2018/05/09/ai-researchers-allege-machine-learning-alchemy/](https://reconsider.news/2018/05/09/ai-researchers-allege-machine-learning-alchemy/)]] -- - Cost function is never convex! -- - Hungry for data! -- - Optimal architecture of DNN is still an open question! -- - Many combinations of hyperparameters to be tuned or considered ... --- ## IV. Deep Neural Networks (DNN)<h0br> ### <img src="./img/neural.png" width="32px"/> Learning diagnosis and tips<h0br> .pull-left[ <h0br> - [Learning curve diagnosis](https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/): .center[ <hbr> .caption[ Too simple Need more training Overfitting] <br> <img src="./img/diagnose1.png" width="110px" height ="92px"/> <img src="./img/diagnose2.png" width="110px" height ="92px"/> <img src="./img/diagnose3.png" width="110px" height ="92px"/> <br> .caption[ Easy validation Unrepresentative training data Unrepresentative validation data] <br> <img src="./img/diagnose7.png" width="110px" height ="92px"/> <img src="./img/diagnose5.png" width="110px" height ="92px"/> <img src="./img/diagnose6.png" width="110px" height ="92px"/> <br> .caption[Seem well fitted] <br> <img src="./img/diagnose4.png" width="110px" height ="92px"/>] ] .pull-right[ <h0br> - Start : - Weight initialization - Regularization - Dropout - Validation/minibatch size - Batch normalization ... - During training: `$$\text{Start small}\\ \Downarrow\\ \text{Diagnosing}\\ \Updownarrow\\ \text{Adjusting}$$` - Stop: diagnosing or early stopping ... ] --- ## IV. Deep Neural Networks (DNN)<h0br> ### <img src="./img/neural.png" width="32px"/> Some common types of ANN<h0br> .pull-left[ <h0br> - `\(L=1\)` (without hidden layer): .left[<img src="./img/logRM.png" width="300px" height ="150px"/>] <h0br> - .stress[Convenlutional NN]: Images, <br> .left[<img src="./img/cnn.gif" width="350px"/>] ] .pull-left[ <h0br> - .stress[Recurrent NN (LSTM)]: TS or NLP, .left[<img src="./img/LSTM.png" width="230px" height ="150px"/>] <h0br> - .stress[Transformers (attention)]: NLP, <h0br> .center[<img src="./img/attention.png" width="150px" height ="230px"/> <hbr> .caption[[Source: Attention is all you need.](https://arxiv.org/abs/1706.03762)]] ] --- ## IV. Deep Neural Networks (DNN)<h0br> ### <img src="./img/neural.png" width="32px"/> Comparison: Spam filtering <h0br> .pull-left[ <h0br> - .stress[Spam filtering]: .center[<img src="./img/dnn_spam.png" width="320px"/>] <h0br> <img src="itc_linear_files/figure-html/spam_dnn-1.png" style="display: block; margin: auto;" /> | Not_penalized| Penalized| DNN| |-------------:|---------:|---------:| | 0.9336957| 0.9369565| 0.9413043| ] -- .pull-right[ <h0br> - .stress[Mnist digit recognition]: .center[<img src="./img/dnn_mnist.png" width="300px"/>] <h0br> <img src="itc_linear_files/figure-html/mnist_dnn-1.png" style="display: block; margin: auto;" /> | MLP| DNN| |------:|------:| | 0.9272| 0.9785| ] --- ## Conclusion <h0br> - .stress[Summary]: - Linear Models: regression and classification. - Some intuitions of gradient-based optimization: gradient descent. - Regularization: constraint the magnitude of parameters, ovoid overfitting. - DNN & comparisons. -- - .stress[Some perspectives]: - ML is not only optimization, quantity and quality of data matter! - Interpretability is sometimes more important than predictability. - Universal method doesn't exist! It all comes down to trying. - Other (nonparametric) methods: trees, random forest, adaboost and xgboost... --- count: false template: inter-slide class: left, middle count: false .center[# References]<hbr> 📚 [I. Goodfellow, Y. Bengio, A. Courville. Deep Learning, MIT Press, 2016.](http://www.deeplearningbook.org) 📚 [T. Hastie, R. Tibshirani, J. Friedman. The Elements of Statistical Learning. Springer, 2009. ISBN: 978-0-387-84858-7.](https://link.springer.com/book/10.1007/978-0-387-84858-7) 📚 [A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems 30 (NIPS 2017)](https://papers.nips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html) 📚 [Java T Point: https://www.javatpoint.com/applications-of-machine-learning](https://www.javatpoint.com/applications-of-machine-learning) 📚 [Java T Point: https://www.javatpoint.com/basic-concepts-in-machine-learning](https://www.javatpoint.com/basic-concepts-in-machine-learning) 📚 [D. Jurafsky and J. H. Martin. Speech and Language Processing. 2023](https://web.stanford.edu/~jurafsky/slp3/)
[Xaringan Rmarkdown: https://bookdown.org/yihui/rmarkdown/xaringan-format.html](https://bookdown.org/yihui/rmarkdown/xaringan-format.html) <h0br> .pull-right[ # Thank you 🤓 ] --- ## Lab Class: Jupyter Notebook <h0br> - Go to the following `github` for `notebook`: .center[.large[
[https://github.com/hassothea/TeachingML](https://github.com/hassothea/TeachingML)]]