Regularization

# Regularization
### Bradley C. Boehmke and Brandon M. Greenwell
### 2018/05/12

---

# Overview

---

## OLS Regression

* Model form: `$y_i = \beta_0 + \beta_{1}x_{i1} + + \beta_{2}x_{i2} \cdots + \beta_{p}x_{ip} + \epsilon_i$`

* Objective function: `$\text{minimize} \bigg \{ SSE = \sum^n_{i=1} (y_i - \hat{y}_i)^2 \bigg \} \equiv \text{minimize MSE}$`

---

## OLS Regression

Some key assumptions we make when working with OLS regression:

* Linear relationship

* Multivariate normality

* No autocorrelation

* Homoscedastic (constant variance in residuals)

* `$p < n$` (there is no unique solution when `$p > n$`)

* No or little multicollinearity

.full-width[.content-box-blue[.bolder[.center[
Under standard assumptions, the coefficients produced by OLS are unbiased and, of all unbiased linear techniques, have the lowest variance. 😄]]]]

---

## OLS Regression

__However__, as `$p$` grows there are three main issues we most commonly run into:

1. Multicollinearity 🤬

2. Insufficient solution 😕

3. Interpretability 🤷

---

## Multicollinearity

As *p* increases `$\rightarrow$` multicollinearity `$\rightarrow$` high variability in our coefficient terms.

```r
# train models with strongly correlated variables
m1 <- lm(Sale_Price ~ Gr_Liv_Area + TotRms_AbvGrd, data = ames_train)
m2 <- lm(Sale_Price ~ Gr_Liv_Area, data = ames_train)
m3 <- lm(Sale_Price ~ TotRms_AbvGrd, data = ames_train)

*coef(m1)
##   (Intercept)   Gr_Liv_Area TotRms_AbvGrd 
##    46264.9749      137.8144   -11191.4972
*coef(m2)
## (Intercept) Gr_Liv_Area 
##   15796.516     110.059
*coef(m3)
##   (Intercept) TotRms_AbvGrd 
##      19713.35      25014.02
```

.full-width[.content-box-blue[.bolder[.center[
Causes overfitting, which means we have high variance in the bias-variance tradeoff space. 🤬
]]]]

---

## Insufficient Solution

When `$p > n$` `$\rightarrow$` OLS solution matrix is *not* invertible `$\rightarrow$`:

1. infinite solution sets, most of which overfit data,

2. computationally infeasible.

---

## Interpretability

With a large number of features, we often would like to identify a smaller subset of these features that exhibit the strongest effects.

* Approach 1: model selection
    - computationally inefficient (Ames data: `$2^{80}=1.208926E+24$` models to evaluate)
    - simply assume a feature as in or out `$\rightarrow$` *hard threshholding*

* Approach 2: 
 - retain all coefficients
 - slowly pushes a feature's effect towards zero `$\rightarrow$` *soft threshholding*

.full-width[.content-box-blue[.bolder[.center[
Without interpretability we just have accuracy! 🤷
]]]]

---

## Regularized Regression

One alternative to OLS regression is to use regularized regression (aka *penalized* models or *shrinkage* methods)

`$$\text{minimize } \big \{ SSE + P \big \}$$`

]

* Constrains magnitude of the coefficients

* Progressively shrinks coefficients to zero

* Reduces variability of coefficients (pulls correlated coefficients together)

.full-width[.content-box-blue[.bolder[.center[
Reduces variance of model, which can also reduce error! 🎉
]]]]

---

# Ridge Regression

---

## Ridge regression: the idea

`$$\text{Objective function: minimize } \bigg \{ SSE + \lambda \sum^p_{j=1} \beta_j^2 \bigg \}$$`

---

## Ridge regression: the idea

`$$\text{Objective function: minimize } \bigg \{ SSE + \lambda \sum^p_{j=1} \beta_j^2 \bigg \}$$`

---

## Ridge regression: the idea

`$$\text{Objective function: minimize } \bigg \{ SSE + \lambda \sum^p_{j=1} \beta_j^2 \bigg \}$$`

---

## Ridge regression: implementation

**Packages used:**

```r
library(rsample)  # data splitting & resampling
library(tidyr)    # data manipulation
library(dplyr)    # data manipulation
library(ggplot2)  # visualization
library(caret)    # data prep
*library(glmnet)   # implementing regularized regression approaches
```

**Data used:**

```r
boston <- pdp::boston # example data
ames <- AmesHousing::make_ames() # exercise data
```

---

## Ridge regression: implementation

**Data prep:**

```r
# create sample splits
set.seed(123)
data_split <- initial_split(boston, prop = .7, strata = "cmedv")

boston_train <- training(data_split)
boston_test <- testing(data_split)

# create feature sets
*one_hot <- caret::dummyVars(cmedv ~ ., data = boston_train, fullRank = TRUE)
*train_x <- predict(one_hot, boston_train)
*train_y <- boston_train$cmedv

test_x <- predict(one_hot, boston_test)
test_y <- boston_test$cmedv

# dimension of training feature set
dim(train_x)
## [1] 356  15
```

.full-width[.content-box-blue[.bolder[.center[
`glmnet` does not use the formula method (<code>y ~ x</code>) so prior to modeling we need to create our feature and target set.
]]]]

]

---

background-image: url(http://amsterdammakerfestival.nl/wp-content/uploads/2016/08/the-challenge.png)

---

## Your Turn!

1. Create training (70%) and test (30%) sets for the `AmesHousing::make_ames()` data.  Use `set.seed(123)` to match my output.

2. Create training and testing feature model matrices and response vectors.

3. What is the dimension of of your feature matrix?

---

## Solution: Preparing data

```r
# Create training (70%) and test (30%) sets for the AmesHousing::make_ames() data.
# Use set.seed(123)

set.seed(123)
ames_split <- initial_split(AmesHousing::make_ames(), prop = .7, strata = "Sale_Price")
ames_train <- training(ames_split)
ames_test <- testing(ames_split)

# Create training and testing feature model matrices and response vectors.
ames_one_hot <- caret::dummyVars(Sale_Price ~ ., data = ames_train, fullRank = TRUE)
ames_train_x <- predict(ames_one_hot, ames_train)
ames_train_y <- log(ames_train$Sale_Price)

ames_test_x <- predict(ames_one_hot, ames_test)
ames_test_y <- log(ames_test$Sale_Price)

# What is the dimension of of your feature matrix?
dim(ames_train_x)
## [1] 2054  307
```

---

## Ridge regression: implementation

To apply a Ridge model we can use the `glmnet::glmnet` function

- Ridge: <code>alpha = 0</code>, Lasso: `alpha = 1`, elastic net: <code>0 `$\leq$` alpha `$\leq$` 1</code>
- essential that predictor variables are standardized (`standardize = TRUE`)
- `glmnet` performs Ridge across wide range of `$\lambda$`

```r
## fit ridge regression
boston_ridge <- glmnet(
 x = train_x,
 y = train_y,
* alpha = 0
)

plot(boston_ridge, xvar = "lambda")
```

]

```r
## lambdas applied
boston_ridge$lambda
##   [1] 6963.2514683 6344.6553994 5781.0137003 5267.4443763 4799.4991356
##   [6] 4373.1248604 3984.6285006 3630.6450867 3308.1085837 3014.2253346
##  [11] 2746.4498635 2502.4628271 2280.1509266 2077.5886027 1893.0213573
##  [16] 1724.8505573 1571.6195877 1432.0012351 1304.7861921 1188.8725829
##  [21] 1083.2564193  987.0229046  899.3385101  819.4437556  746.6466308
##  [26]  680.3166020  619.8791501  564.8107948  514.6345605  468.9158446
##  [31]  427.2586533  389.3021721  354.7176401  323.2055026  294.4928165
##  [36]  268.3308864  244.4931100  222.7730159  202.9824752  184.9500715
##  [41]  168.5196169  153.5487986  139.9079466  127.4789102  116.1540351
##  [46]  105.8352308   96.4331206   87.8662679   80.0604709   72.9481193
##  [51]   66.4676094   60.5628102   55.1825771   50.2803090   45.8135449
##  [56]   41.7435959   38.0352099   34.6562666   31.5774994   28.7722414
##  [61]   26.2161948   23.8872203   21.7651455   19.8315899   18.0698062
##  [66]   16.4645344   15.0018705   13.6691457   12.4548165   11.3483649
##  [71]   10.3402074    9.4216119    8.5846219    7.8219877    7.1271039
##  [76]    6.4939516    5.9170469    5.3913927    4.9124363    4.4760290
##  [81]    4.0783909    3.7160779    3.3859518    3.0851531    2.8110766
##  [86]    2.5613483    2.3338052    2.1264764    1.9375661    1.7654381
##  [91]    1.6086014    1.4656977    1.3354891    1.2168480    1.1087465
##  [96]    1.0102486    0.9205009    0.8387261    0.7642160    0.6963251
```

]

]

---

## Ridge regression: implementation

We can also directly access the coefficients for a model using `coef`:

```r
# small lambda = big coefficients
tidy(coef(boston_ridge)[, 100])
## # A tibble: 16 x 2
## names x
## <chr> <dbl>
## 1 (Intercept) -574 
## 2 lon - 6.31 
## 3 lat 3.52 
## 4 crim - 0.0830 
## 5 zn 0.0366 
## 6 indus - 0.0725 
## 7 chas.1 2.55 
## 8 nox - 10.8 
## 9 rm 4.29 
## 10 age - 0.0105 
## 11 dis - 1.20 
## 12 rad 0.131 
## 13 tax - 0.00602
## 14 ptratio - 0.698 
## 15 b 0.0107 
## 16 lstat - 0.422
```

]

```r
# big lambda = small coefficients
tidy(coef(boston_ridge)[, 1])
## # A tibble: 16 x 2
## names x
## <chr> <dbl>
## 1 (Intercept) 22.6 
## 2 lon - 0.0000000000000000000000000000000000405 
## 3 lat 0.00000000000000000000000000000000000113 
## 4 crim - 0.000000000000000000000000000000000000530 
## 5 zn 0.000000000000000000000000000000000000150 
## 6 indus - 0.000000000000000000000000000000000000685 
## 7 chas.1 0.00000000000000000000000000000000000622 
## 8 nox - 0.0000000000000000000000000000000000365 
## 9 rm 0.00000000000000000000000000000000000953 
## 10 age - 0.000000000000000000000000000000000000129 
## 11 dis 0.00000000000000000000000000000000000113 
## 12 rad - 0.000000000000000000000000000000000000417 
## 13 tax - 0.0000000000000000000000000000000000000263
## 14 ptratio - 0.00000000000000000000000000000000000208 
## 15 b 0.0000000000000000000000000000000000000353
## 16 lstat - 0.000000000000000000000000000000000000947
```

]

<center>
<bold>

What's the best `$\lambda$` value? How much improvement are we experiencing with our model?

</bold>
</center>

]

---

## Ridge regression: tuning

* `$\lambda$`: tuning parameter that helps control our model from over-fitting to the training data
* to identify the optimal `$\lambda$` value we need to perform cross-validation
* `cv.glmnet` provides a built-in option to perform k-fold CV

```r
## fit CV ridge regression
boston_ridge <- cv.glmnet(
 x = train_x,
 y = train_y,
 alpha = 0,
* nfolds = 10
)

## plot CV MSE
plot(boston_ridge)
```

]

---

## Ridge regression: tuning

```r
# minimum MSE and respective lambda
min(boston_ridge$cvm)      
## [1] 24.561
boston_ridge$lambda.min     
## [1] 0.764216

# MSE and respective lambda within 1 standard error of minimum MSE
boston_ridge$cvm[boston_ridge$lambda == boston_ridge$lambda.1se]
## [1] 28.60901
boston_ridge$lambda.1se 
## [1] 5.391393
```

]

---

## Ridge regression: pros

The Ridge regression model:

* pushes many of the correlated features towards each other rather than allowing for one to be wildly positive and the other wildly negative.

* non-important features have been pushed closer to zero...minimizing noise

* provides us more clarity in identifying the true signals in our model.

---

## Ridge regression: pros

```r
coef(boston_ridge, s = "lambda.1se") %>%
  tidy() %>%
  filter(row != "(Intercept)") %>%
  ggplot(aes(value, reorder(row, value))) +
  geom_point() +
  ggtitle("Rank-order of variable influence") +
  xlab("Coefficient") +
  ylab(NULL)
```

]

---

## Ridge regression: cons

The Ridge regression model:

* retains <bold>all</bold> variables

* does not perform feature selection

---

background-image: url(http://amsterdammakerfestival.nl/wp-content/uploads/2016/08/the-challenge.png)

---

## Your Turn!

1. Apply a 10-fold CV Ridge regression model to ames data.

2. What is the `$\lambda$` value with the lowest MSE?

3. What is the `$\lambda$` value within 1 standard error of the lowest MSE?

---

## Solution: Preparing data

```r
## Apply CV Ridge regression to ames data
ames_ridge <- cv.glmnet(
 x = ames_train_x,
 y = ames_train_y,
 alpha = 0
)

plot(ames_ridge)
```

]

---

## Solution: Preparing data

```r
# minimum MSE and respective lambda
min(ames_ridge$cvm)      
## [1] 0.02216293
ames_ridge$lambda.min     
## [1] 0.1357169

# MSE and respective lambda within 1 standard error of minimum MSE
ames_ridge$cvm[ames_ridge$lambda == ames_ridge$lambda.1se]
## [1] 0.02635513
ames_ridge$lambda.1se 
## [1] 0.7948967
```

]

---

# Lasso Regression

## least absolute shrinkage and selection operator

---

## Lasso regression: the idea

`$$\text{Objective function: minimize } \bigg \{ SSE + \lambda \sum^p_{j=1} | \beta_j | \bigg \}$$`

.full-width[.content-box-blue[.bolder[.center[
Will actually push coefficients to zero...great for automated feature selection!
]]]]

]

---

## Lasso: implementation

To apply a Lasso model we can use the `glmnet::glmnet` function

- Ridge: `alpha = 0` Lasso: <code>alpha = 1</code>, elastic net: <code>0 `$\leq$` alpha `$\leq$` 1</code>
- essential that predictor variables are standardized
- `glmnet` performs Lasso across wide range of `$\lambda$`

```r
## fit lasso regression
boston_lasso <- glmnet(
 x = train_x,
 y = train_y,
* alpha = 1
)

plot(boston_lasso, xvar = "lambda")
```

]

---

## Lasso: tuning

.full-width[.content-box-yellow[.bolder[.center[
We can use the same operations as we did with Ridge regression to perform CV.
]]]]

```r
## fit CV lasso regression
boston_lasso <- cv.glmnet(
 x = train_x,
 y = train_y,
 alpha = 1
)

## plot CV MSE
plot(boston_lasso)
```

]

---

background-image: url(http://amsterdammakerfestival.nl/wp-content/uploads/2016/08/the-challenge.png)

---

## Your Turn!

1. Apply a 10-fold CV Lasso regression model to ames data.

2. What is the `$\lambda$` value with the lowest MSE?

3. What is the `$\lambda$` value within 1 standard error of the lowest MSE?

---

## Solution: Preparing data

```r
## Apply CV Lasso regression to ames data
ames_lasso <- cv.glmnet(
 x = ames_train_x,
 y = ames_train_y,
 alpha = 1
)

plot(ames_lasso)
```

]

---

## Solution: Preparing data

```r
# minimum MSE and respective lambda
min(ames_lasso$cvm)      
## [1] 0.02344511
ames_lasso$lambda.min     
## [1] 0.003865266

# MSE and respective lambda within 1 standard error of minimum MSE
ames_lasso$cvm[ames_lasso$lambda == ames_lasso$lambda.1se]
## [1] 0.0275068
ames_lasso$lambda.1se 
## [1] 0.01560415
```

]

---

# Elastic nets

---

## Elastic nets: the idea

`$$\text{Objective function: minimize } \bigg \{ SSE + \lambda_1 \sum^p_{j=1} \beta_j^2 + \lambda_2 \sum^p_{j=1} | \beta_j | \bigg \}$$`

.full-width[.content-box-yellow[.bolder[.center[
Enables effective regularization via the ridge penalty with the feature selection characteristics of the lasso penalty!
]]]]

---

## Elastic nets: implementation

```r
lasso <- glmnet(train_x, train_y, alpha = 1.0) 
*elastic1 <- glmnet(train_x, train_y, alpha = 0.25)
*elastic2 <- glmnet(train_x, train_y, alpha = 0.75)
ridge <- glmnet(train_x, train_y, alpha = 0.0)
```

]

---

## Elastic nets: tuning

Two tuning parameters:

* `$\lambda$`

* `alpha`

```r
fold_id <- sample(1:10, size = length(train_y), replace=TRUE)

cv_lasso <- cv.glmnet(train_x, train_y, alpha = 1.0, foldid = fold_id) 
*cv_elastic1 <- cv.glmnet(train_x, train_y, alpha = 0.3, foldid = fold_id)
*cv_elastic2 <- cv.glmnet(train_x, train_y, alpha = 0.6, foldid = fold_id)
cv_ridge <- cv.glmnet(train_x, train_y, alpha = 0.0, foldid = fold_id)
```

]

---

## Elastic nets: tuning

```r
# tuning grid
tuning_grid <- tibble::tibble(
 alpha = seq(0, 1, by = .1),
 mse_min = NA,
 mse_1se = NA
)

for(i in seq_along(tuning_grid$alpha)) {
 
 # fit CV model for each alpha value
 fit <- cv.glmnet(train_x, train_y, alpha = tuning_grid$alpha[i], foldid = fold_id)
 
 # extract MSE and lambda values
 tuning_grid$mse_min[i] <- fit$cvm[fit$lambda == fit$lambda.min]
 tuning_grid$mse_1se[i] <- fit$cvm[fit$lambda == fit$lambda.1se]
}

tuning_grid %>%
  mutate(se = mse_1se - mse_min) %>%
  ggplot(aes(alpha, mse_min)) +
  geom_line(size = 2) +
  geom_ribbon(aes(ymax = mse_min + se, ymin = mse_min - se), alpha = .25) +
  ggtitle("MSE ± one standard error")
```

]

---

background-image: url(http://amsterdammakerfestival.nl/wp-content/uploads/2016/08/the-challenge.png)

---

## Your Turn!

1. Apply an elastic net model to the ames data.

2. Which value of `alpha` performs best?

3. Can you identify the most influential features?

---

## Solution: Compare performance

```r
# reproducible CV splits
fold_id <- sample(1:10, size = length(ames_train_y), replace=TRUE)

# tuning grid
tuning_grid <- tibble::tibble(
 alpha = seq(0, 1, by = .1),
 mse_min = NA,
 mse_1se = NA,
 lambda_min = NA,
 lambda_1se = NA
)

# modeling
for(i in seq_along(tuning_grid$alpha)) {
 fit <- cv.glmnet(ames_train_x, ames_train_y, alpha = tuning_grid$alpha[i], foldid = fold_id)
 tuning_grid$mse_min[i] <- fit$cvm[fit$lambda == fit$lambda.min]
 tuning_grid$mse_1se[i] <- fit$cvm[fit$lambda == fit$lambda.1se]
 tuning_grid$lambda_min[i] <- fit$lambda.min
 tuning_grid$lambda_1se[i] <- fit$lambda.1se
}

# compare optimal MSEs
tuning_grid %>%
  mutate(se = mse_1se - mse_min) %>%
  ggplot(aes(alpha, mse_min)) +
  geom_line(size = 2) +
  geom_ribbon(aes(ymax = mse_min + se, ymin = mse_min - se), alpha = .25) +
  ggtitle("MSE ± one standard error")
```

]

---

## Solution: Identify influential features

```r
# get the coefficients
best_fit <- glmnet(
 x = ames_train_x,
 y = ames_train_y,
 alpha = 1,
 lambda = subset(tuning_grid, alpha == 1)$lambda_min
)

best_fit %>%
  coef() %>%
  tidy() %>%
  filter(row != "(Intercept)") %>%
  ggplot(aes(value, reorder(row, value), color = value >= 0)) +
  geom_point(show.legend = FALSE)
```

]

---

# Predicting

---

## Making predictions on new data

* Use `predict` with best model and new data

* Caveat: must supply `s` parameter with preferred `$\lambda$`

```r
# some best model
cv_lasso <- cv.glmnet(ames_train_x, ames_train_y, alpha = 1.0)
min(cv_lasso$cvm)
```

```
## [1] 0.02101189
```

```r
# predict
pred <- predict(cv_lasso, s = cv_lasso$lambda.min, ames_test_x)

# re-transform predicted values
pred_tran <- exp(pred)
caret::RMSE(pred_tran, ames_test$Sale_Price)
```

```
## [1] 24740.36
```

---

# Alternative Packages

---

## `caret`

```r
library(caret)

train_control <- trainControl(method = "cv", number = 10)

caret_mod <- train(
 x = ames_train_x,
 y = ames_train_y,
 method = "glmnet",
 preProc = c("center", "scale", "zv", "nzv"),
 trControl = train_control,
* tuneLength = 10
)

caret_mod
```

]

---

## `h2o`

.full-width[.content-box-yellow[.bolder[.center[
I only show the code but do not run due to the excess output that `h2o` kicks out!
]]]]

```r
library(h2o)
h2o.init()

# convert data to h2o object
ames_h2o <- ames_train %>%
 mutate(Sale_Price_log = log(Sale_Price)) %>%
 as.h2o()

# set the response column to Sale_Price_log
response <- "Sale_Price_log"

# set the predictor names
predictors <- setdiff(colnames(ames_train), "Sale_Price")

# try using the `alpha` parameter:
# train your model, where you specify alpha
ames_glm <- h2o.glm(
 x = predictors, 
 y = response, 
 training_frame = train,
 nfolds = 10,
 keep_cross_validation_predictions = TRUE,
 alpha = .25
 )

# print the mse for the validation data
print(h2o.rmse(ames_glm, xval = TRUE))

# grid over `alpha`
# select the values for `alpha` to grid over
hyper_params <- list(
 alpha = seq(0, 1, by = .1),
 lambda = seq(0.0001, 10, length.out = 10)
 )

# this example uses cartesian grid search because the search space is small
# and we want to see the performance of all models. For a larger search space use
# random grid search instead: {'strategy': "RandomDiscrete"}

# build grid search with previously selected hyperparameters
grid <- h2o.grid(
 x = predictors, 
 y = response, 
 training_frame = train, 
 nfolds = 10,
 keep_cross_validation_predictions = TRUE,
 algorithm = "glm",
 grid_id = "ames_grid", 
 hyper_params = hyper_params,
 search_criteria = list(strategy = "Cartesian")
 )

# Sort the grid models by mse
sorted_grid <- h2o.getGrid("ames_grid", sort_by = "mse", decreasing = FALSE)
sorted_grid

# grab top model id
best_h2o_model <- sorted_grid@model_ids[[1]]
best_model <- h2o.getModel(best_h2o_model)
```

]

---

# Learning More

---

## Additional Resources

* Regularization has been extended to *many* other machine learning algorithms

* Great resources to learn more (listed in order of complexity):