Deep Learning

# Deep Learning
### Bradley C. Boehmke and Brandon M. Greenwell
### 2018/05/12

---

background-image: url(https://mir-s3-cdn-cf.behance.net/project_modules/disp/b2ac1a19467255.562dae98a2b23.jpg)
background-size: cover

# Deep Learning

???

Image credit: [Behance](https://www.behance.net/gallery/19467255/Thinking-David-))

---

## Why deep learning?

Neural networks originated in the computer science field to answer questions that normal statistical approaches were not designed to answer.

.full-width[.content-box-blue[.bolder[.center[
We humans interpret the many features of each digit (i.e. angles, edges, thickness, circles). 🤔
]]]]

---

## Why deep learning?

In essence, neural networks perform the same task albeit in a far simpler manner than our brains

---

## Overcoming challenges

Recent advancements have created new energy around neural networks:

* advancements in computer hardware (off the shelf CPUs became faster and GPUs were created) made computation more practical,

* growth in data collection made them more relevant,

* advancements in the underlying algorithms made the depth (number of hidden layers) of neural nets less of a constraint

---

## The result...

Deeper, more complex neural networks

* high-dimensional, unstructured feature-rich data

* automates feature engineering

* complex, non-linear relationships

* computationally demanding

* _as feature dimension space reduces, traditional machine learning approaches tend to perform just as well, if not better, and are more efficient_

---

## Feedforward DNNs

Multiple DNN models exist:

* __convolutional neural networks__ (CNN or ConvNet) have wide applications in image and video recognition,

* __recurrent neural networks__ (RNN) are used with speech recognition,

* __long short-term memory neural networks__ (LTSM) are advancing automated robotics and machine translation.

Fundamental to all these methods is the ___feedforward neural net___ (aka multilayer perceptron)

---

## Key components

To build a feedforward DNN we need 4 key components:

1. input data,

2. a defined network architecture,

3. a feedback mechanism to help our model learn,

4. a model training approach.

---

background-image: url(Images/prerequisites.png)
background-position: center
background-size: contain

---

## Package requirement

We'll use the CPU-based version of `keras` and `TensorFlow`

Install with the following:

```r
install.packages("keras")
keras::install_keras()

# will require you to execute the following at a terminal
$ sudo /usr/bin/easy_install pip
$ sudo /usr/local/bin/pip install --upgrade virtualenv

# if you get the above notification you will need to re-run
keras::install_keras()
```

```r
library(keras)
```

.full-width[.content-box-blue[.bolder[.center[
See [keras.rstudio.com](keras.rstudio.com) for details.
]]]]

---

## Data requirement

1. Feedfoward DNNs require all feature inputs to be numeric. Consequently, we one-hot encode with `model.matrix`.
2. Due to the data transformation process that DNNs perform, they are highly sensitive to the individual scale of the feature values. Consequently, we standardize our feature sets. Also note, that we are standardizing our test feature sets based on the mean and standard deviation of the training features to minimize data leakage.
3. When one-hot encoding, some variable levels have little or no variance. We remove these variables.

```r
# one hot encode --> we use model.matrix(...)[, -1] to discard the intercept
data_onehot <- model.matrix(~ ., AmesHousing::make_ames())[, -1] %>% as.data.frame()

# Create training (70%) and test (30%) sets for the AmesHousing::make_ames() data.
# Use set.seed for reproducibility
set.seed(123)
split <- rsample::initial_split(data_onehot, prop = .7, strata = "Sale_Price")
train <- rsample::training(split)
test <- rsample::testing(split)

# Create & standardize feature sets
# training features
train_x <- train %>% dplyr::select(-Sale_Price)
mean <- colMeans(train_x)
std <- apply(train_x, 2, sd)
train_x <- scale(train_x, center = mean, scale = std)

# testing features
test_x <- test %>% dplyr::select(-Sale_Price)
test_x <- scale(test_x, center = mean, scale = std)

# Create & transform response sets
train_y <- log(train$Sale_Price)
test_y <- log(test$Sale_Price)

# zero variance variables (after one hot encoded) cause NaN so we need to remove
zv <- which(colSums(is.na(train_x)) > 0, useNames = FALSE)
train_x <- train_x[, -zv]
test_x <- test_x[, -zv]

# check dimensions
dim(train_x)
## [1] 2054  299
dim(test_x)
## [1] 876 299
```

]

---

# Network Architecture

???

Image credit: [EPFLTV](http://video.epfl.ch/EPFLTV//Images/Channel/LogoSAR.png)

---

## Layers & nodes

Building blocks of your model and they decide how complex the network will be

* Layers are considered dense (fully connected) where all the nodes in each successive layer are connected

* More layers and nodes you add the more opportunities for new features to be learned (commonly referred to as the model’s capacity)

* Beyond the input layer, which is just our predictor variables:
  - hidden layers
     - no well-defined rules to follow for #
     - rectangular data `$\rightarrow$` 2-5 layers are sufficient
     - number of hidden layers and nodes in your network will drive the computational burden of your model
  - output layer
     - driven by the type of modeling you are performing
     - Regression `$\rightarrow$` 1 node
     - Binary classification `$\rightarrow$` 1 node `$\rightarrow$` probability of success
     - Multinomial `$\rightarrow$` `$y_n$` nodes `$\rightarrow$` probability of each class
     
---

## Implementation

`keras` uses a layering approach with `%>%`

* Two hidden layers
  - `$1^{st}$`: 10 nodes
  - `$2^{nd}$`: 5 nodes

```r
model <- keras_model_sequential() %>%
 layer_dense(units = 10, input_shape = ncol(train_x)) %>% # hidden layer 1
 layer_dense(units = 5) %>% # hidden layer 2
 layer_dense(units = 1) # output layer
```

.full-width[.content-box-blue[.bolder[.center[
Note: must tell first hidden layer how many features to expect with `input_shape`!
]]]]

---

## Activation functions

__Human body__

* biologic neurons receive inputs from many adjacent neurons

* when these inputs accumulate beyond a certain threshold the neuron is ___activated___ suggesting there is a signal

__Neural nets__

* we use activation functions to perform the same thing in neural nets

---

## Activation functions

Multiple activations to choose from:

.full-width[.content-box-blue[.bolder[.center[
[https://en.wikipedia.org/wiki/Activation_function](https://en.wikipedia.org/wiki/Activation_function)
]]]]

---

## Activation functions

Rule of  👍:

__Hidden layers__

* Rectified linear unit (ReLU) most popular

* Sigmoid

* TanH

__Output layers__

* Regression: linear (identity)

* Binary Classification: Sigmoid

* Multinomial: Softmax

---

## Implementation

We specify activation with...wait for it...`activation`

---

background-image: url(Images/sgd_icon.gif)
background-position: center
background-size: contain

# Backpropagation

---

## Mini-batch stochastic gradient descent

The primary ___learning___ mechanism in neural networks

Step 1: sample observations (*mini-batch*)

Step 2: assign weights and perform *forward pass*

Step 3: compute *loss function*

Step 4: work backwards through each layer, compute *partial derivatives*

Step 5: adjust weights a little in the opposite direction of the gradient (*learning rate*)

Step 6: grab another mini-batch, rinse and repeate until loss function is minimized

.full-width[.content-box-blue[.bolder[.center[
There are many mini-batch SGD algorithms to choose from.
]]]]

---

## Implementation

To be able to perform backpropagation we need to add a `compile` layer to our model where we specify:

* mini-batch SGD algorithm (i.e. `rmsprop` (default),  `adadelt`, `adam`)

* loss metric (regression: MSE, MAE, MAPE; classification: binary/categorical crossentropy)

* additional metrics to track

```r
model <- keras_model_sequential() %>%
 layer_dense(units = 10, activation = "relu", input_shape = ncol(train_x)) %>%
 layer_dense(units = 5, activation = "relu") %>%
 layer_dense(units = 1) %>%
* compile(
* optimizer = "rmsprop",
* loss = "mse",
* metrics = c("mae","mape")
* )
```

<center>
<bold>

We're finally ready to train! 🏃 

</bold>
</center>

---

# Model Training

---

## Model Training

To train a model, we use `fit`

* `batch_size`: anywhere from 1-*n* `$\rightarrow$` typically 32, 64, 128, 256

* `epochs`: 1 epoch equals *n* forward passes with specified batch size to go through all sample data

* `validation_split`: data set aside for out-of-sample error estimate

```r
learn <- model %>% fit(
 x = train_x,
 y = train_y,
 batch_size = 32,
 epochs = 25,
 validation_split = .2,
 verbose = FALSE
)
```

---

## Putting it all together

```r
model <- keras_model_sequential() %>%
 
 # network architecture
 layer_dense(units = 10, activation = "relu", input_shape = ncol(train_x)) %>%
 layer_dense(units = 5, activation = "relu") %>%
 layer_dense(units = 1) %>%
 
 # backpropagation
 compile(
 optimizer = "rmsprop",
 loss = "mse",
 metrics = c("mae", "mape")
 )

# train our model
learn <- model %>% fit(
 x = train_x,
 y = train_y,
 epochs = 25,
 batch_size = 32,
 validation_split = .2,
 verbose = FALSE
)

learn
## Trained on 1,643 samples, validated on 411 samples (batch_size=32, epochs=25)
## Final epoch (plot to see history):
##                               loss: 0.6108
## val_mean_absolute_percentage_error: 7.819
##            val_mean_absolute_error: 0.9355
##     mean_absolute_percentage_error: 4.077
##                           val_loss: 3.488
##                mean_absolute_error: 0.4848

plot(learn)
```

]

---

# Tuning

---

## General tuning process

* ___Many___ ways to tune a DNN

* Typically, the tuning process follows these general steps; however, there is often a lot of iteration among these:

1. Adjust model capacity (layers & nodes)
   2. Increase epochs if you do not see a flatlined loss function
   3. Add batch normalization
   4. Add dropout
   5. Add weight regularization
   6. Adjust learning rate

---

## Adjust model capacity (layers & nodes)

Purposely overfit, then reduce layers/nodes until errors stabilize

```r
model <- keras_model_sequential() %>%
 
 # network architecture
* layer_dense(units = 500, activation = "relu", input_shape = ncol(train_x)) %>%
* layer_dense(units = 250, activation = "relu") %>%
* layer_dense(units = 125, activation = "relu") %>%
 layer_dense(units = 1) %>%
 
 # backpropagation
 compile(
 optimizer = "rmsprop",
 loss = "mse",
 metrics = c("mae", "mape")
 )

# train our model
learn <- model %>% fit(
 x = train_x,
 y = train_y,
 epochs = 25,
 batch_size = 32,
 validation_split = .2,
 verbose = FALSE
)

plot(learn)
```

]

---

background-image: url(http://amsterdammakerfestival.nl/wp-content/uploads/2016/08/the-challenge.png)

---

## Your Turn!

Reduce the layers and nodes until you find stable errors.

---

## Solution

```r
model <- keras_model_sequential() %>%
 
 # network architecture
* layer_dense(units = 100, activation = "relu", input_shape = ncol(train_x)) %>%
* layer_dense(units = 50, activation = "relu") %>%
 layer_dense(units = 1) %>%
 
 # backpropagation
 compile(
 optimizer = "rmsprop",
 loss = "mse",
 metrics = c("mae", "mape")
 )

# train our model
learn <- model %>% fit(
 x = train_x,
 y = train_y,
 epochs = 25,
 batch_size = 32,
 validation_split = .2,
 verbose = FALSE
)

learn
## Trained on 1,643 samples, validated on 411 samples (batch_size=32, epochs=25)
## Final epoch (plot to see history):
##                               loss: 0.1193
## val_mean_absolute_percentage_error: 5.388
##            val_mean_absolute_error: 0.6457
##     mean_absolute_percentage_error: 2.266
##                           val_loss: 1.171
##                mean_absolute_error: 0.2714

plot(learn)
```

]

---

## Adjust epochs

* If you notice your loss function is still decreasing in the last epoch then you will want to increase the number of epochs.

* Alternatively, if your epochs flatline early then there is no reason to run so many epochs as you are just using extra computational energy with no gain.

* We can add a `callback` function inside of fit to help with this.

```r
# train our model
learn <- model %>% fit(
 x = train_x,
 y = train_y,
 epochs = 25,
 batch_size = 32,
 validation_split = .2,
 verbose = FALSE,
* callbacks = list(
* callback_early_stopping(patience = 2)
* )
)

learn
## Trained on 1,643 samples, validated on 411 samples (batch_size=32, epochs=4)
## Final epoch (plot to see history):
##                               loss: 0.1191
## val_mean_absolute_percentage_error: 5.46
##            val_mean_absolute_error: 0.654
##     mean_absolute_percentage_error: 2.211
##                           val_loss: 1.101
##                mean_absolute_error: 0.2645

plot(learn)
```

]

---

## Add batch normalization

As we add more layers, it becomes important that we continue to renormalize to help with gradient propogation.

```r
model <- keras_model_sequential() %>%
 
 # network architecture
 layer_dense(units = 100, activation = "relu", input_shape = ncol(train_x)) %>%
* layer_batch_normalization() %>%
 layer_dense(units = 50, activation = "relu") %>%
* layer_batch_normalization() %>%
 layer_dense(units = 1) %>%
 
 # backpropagation
 compile(
 optimizer = "rmsprop",
 loss = "mse",
 metrics = c("mae", "mape")
 )

# train our model
learn <- model %>% fit(
 x = train_x,
 y = train_y,
 epochs = 25,
 batch_size = 32,
 validation_split = .2,
 verbose = FALSE,
 callbacks = list( 
 callback_early_stopping(patience = 2) 
 ) 
)

learn
## Trained on 1,643 samples, validated on 411 samples (batch_size=32, epochs=22)
## Final epoch (plot to see history):
##                               loss: 0.06041
## val_mean_absolute_percentage_error: 2.201
##            val_mean_absolute_error: 0.2608
##     mean_absolute_percentage_error: 1.561
##                           val_loss: 0.1353
##                mean_absolute_error: 0.1865

plot(learn)
```

]

---

## Add dropout

* ___Dropout___ is one of the most effective and commonly used approaches to prevent overfitting in neural networks.

* Dropout randomly drops out (setting to zero) a number of output features in a layer during training.

* By randomly removing different nodes, we help prevent the model from fitting patterns to happenstance patterns (noise) that are not significant.

```r
model <- keras_model_sequential() %>%
 
 # network architecture
 layer_dense(units = 100, activation = "relu", input_shape = ncol(train_x)) %>%
 layer_batch_normalization() %>%
* layer_dropout(rate = 0.2) %>%
 layer_dense(units = 50, activation = "relu") %>%
 layer_batch_normalization() %>%
* layer_dropout(rate = 0.2) %>%
 layer_dense(units = 1) %>%
 
 # backpropagation
 compile(
 optimizer = "rmsprop",
 loss = "mse",
 metrics = c("mae", "mape")
 )

learn
## Trained on 1,643 samples, validated on 411 samples (batch_size=32, epochs=25)
## Final epoch (plot to see history):
##                               loss: 1.609
## val_mean_absolute_percentage_error: 2.367
##            val_mean_absolute_error: 0.2822
##     mean_absolute_percentage_error: 8.344
##                           val_loss: 0.1481
##                mean_absolute_error: 0.9999

plot(learn)
```

]

---

## Add weight regularization

We can add regularization just as we saw in earlier tutorial:

* `$L_2$` *norm*: most common `$\rightarrow$` ridge `$\rightarrow$` called ___weight decay___ in the context of neural nets

* `$L_1$` *norm*: lasso

* Combination: elastic net

```r
model <- keras_model_sequential() %>%
 
 # network architecture
 layer_dense(units = 100, activation = "relu", input_shape = ncol(train_x),
* kernel_regularizer = regularizer_l2(0.001)) %>%
 layer_batch_normalization() %>%
 layer_dense(units = 50, activation = "relu",
* kernel_regularizer = regularizer_l2(0.001)) %>%
 layer_batch_normalization() %>%
 layer_dense(units = 1) %>%
 
 # backpropagation
 compile(
 optimizer = "rmsprop",
 loss = "mse",
 metrics = c("mae", "mape")
 )

learn
## Trained on 1,643 samples, validated on 411 samples (batch_size=32, epochs=22)
## Final epoch (plot to see history):
##                               loss: 0.2211
## val_mean_absolute_percentage_error: 2.559
##            val_mean_absolute_error: 0.3068
##     mean_absolute_percentage_error: 1.511
##                           val_loss: 0.2986
##                mean_absolute_error: 0.1806

plot(learn)
```

]

---

## Adjust learning rate

* The different optimizers (i.e. RMSProp, Adam, Adagrad) have different algorithmic approaches for deciding the learning rate.

---

## Adjust learning rate

* The different optimizers (i.e. RMSProp, Adam, Adagrad) have different algorithmic approaches for deciding the learning rate.

* We can automatically adjust the learning rate by a factor of 2-10 once the validation loss has stopped improving.

```r
model <- keras_model_sequential() %>%
 
 # network architecture
 layer_dense(units = 100, activation = "relu", input_shape = ncol(train_x),
* kernel_regularizer = regularizer_l2(0.001)) %>%
 layer_batch_normalization() %>%
 layer_dense(units = 50, activation = "relu",
* kernel_regularizer = regularizer_l2(0.001)) %>%
 layer_batch_normalization() %>%
 layer_dense(units = 1) %>%
 
 # backpropagation
 compile(
* optimizer = "adadelta",
 loss = "mse",
 metrics = c("mae", "mape")
 )

# train our model
learn <- model %>% fit(
 x = train_x,
 y = train_y,
 epochs = 25,
 batch_size = 32,
 validation_split = .2,
 verbose = FALSE,
 callbacks = list(
 callback_early_stopping(patience = 2),
* callback_reduce_lr_on_plateau()
 )
)

learn
## Trained on 1,643 samples, validated on 411 samples (batch_size=32, epochs=25)
## Final epoch (plot to see history):
##                               loss: 0.1876
##                                 lr: 1
##                mean_absolute_error: 0.1388
##            val_mean_absolute_error: 0.2819
##     mean_absolute_percentage_error: 1.162
##                           val_loss: 0.2799
## val_mean_absolute_percentage_error: 2.36

plot(learn)
```

]

---

background-image: url(http://amsterdammakerfestival.nl/wp-content/uploads/2016/08/the-challenge.png)

---

## Your Turn!

Adjust the tuning parameters and see if you can further reduce the validation loss metric

* Adjust layers/nodes

* Increase epochs if you do not see a flatlined loss function

* Add batch normalization

* Add dropout

* Add weight regularization

* Adjust learning rate

---

## Solution

```r
final_model <- keras_model_sequential() %>%
 
 # network architecture
 layer_dense(units = 250, activation = "relu", input_shape = ncol(train_x)) %>%
 layer_batch_normalization() %>%
 layer_dense(units = 125, activation = "relu", input_shape = ncol(train_x)) %>%
 layer_batch_normalization() %>%
 layer_dense(units = 50, activation = "relu") %>%
 layer_batch_normalization() %>%
 layer_dense(units = 1) %>%
 
 # backpropagation
 compile(
 optimizer = "rmsprop",
 loss = "mse",
 metrics = c("mae", "mape")
 )

# train our model
final_results <- final_model %>% fit(
 x = train_x,
 y = train_y,
 epochs = 25,
 batch_size = 32,
 validation_split = .2,
 verbose = FALSE,
 callbacks = list( 
 callback_early_stopping(patience = 2),
 callback_reduce_lr_on_plateau() 
 ) 
)

final_results
## Trained on 1,643 samples, validated on 411 samples (batch_size=32, epochs=16)
## Final epoch (plot to see history):
##                               loss: 0.03791
##                                 lr: 0.001
##                mean_absolute_error: 0.1523
##            val_mean_absolute_error: 0.2472
##     mean_absolute_percentage_error: 1.273
##                           val_loss: 0.09164
## val_mean_absolute_percentage_error: 2.064

plot(final_results)
```

]

---

# Predicting

---

## Evaluate new data set

We can `evaluate` on our test data

```r
(results <- final_model %>% evaluate(test_x, test_y))
## $loss
## [1] 0.08834792
## 
## $mean_absolute_error
## [1] 0.2478597
## 
## $mean_absolute_percentage_error
## [1] 2.068246
```

---

## Predicting on new data

```r
final_model %>% 
  predict(test_x) %>% 
  broom::tidy() %>% 
  dplyr::mutate(
    truth = test_y, 
    pred_tran = exp(x), 
    truth_tran = exp(truth)
    ) %>%
   yardstick::rmse(truth_tran, pred_tran)
## [1] 61789.22
```

* On average, our estimates are about $62K off from the actual sales price.

* Considering the mean sales price is $180K, this is a sizeable error.

---

# Learning More

---

## Learning resources

* Great resources to learn more:

---