Fundamentals of Machine Learning

# Fundamentals of Machine Learning
### Bradley C. Boehmke and Brandon M. Greenwell
### 2018/05/12

---

## Modeling Process

Before introducing specific algorithms, this module introduces concepts that are useful for any type of machine learning model:

- data splitting
- feature engineering
- basic model formulation
- model evaluation
- model tuning

---
class: center, middle, inverse

# Data Spend

---

## Data splitting

How do we "spend" the data to find an optimal model?

We _typically_ split data into training and test data sets:

*  ***Training Set***: these data are used to estimate model and tune model parameters.

* ***Test Set***: having chosen a final model, these data are used to estimate its prediction error (generalization error). These data should _not be used during model training_. <img src = "Images/no_bitme.png" style="width:3%;height:3%;">

---

## Data splitting

How we spend our data can determine how well our models perform.

Given a fixed amount of data:

* too much spent in training won't allow us to get a good assessment of predictive performance.  We may find a model that fits the training data very well, but is not generalizable (overfitting)

* too much spent in testing won't allow us to get a good assessment of model parameters

Guidelines...

* Typical splits you hear are 60-40, 70-30, or 80-20.

* Often you hear that size of data determines split.  Partially true; however, also depends on...

* Signal-to-noise ratio

* Business decision being answered

---

## Mechanics of Data Splitting

There are a few different ways to do the split: _simple random sampling_ and _stratified sampling_ based on the outcome are the most typical.

For stratification:

* **classification**: this would mean sampling within the classes as to preserve the distribution of the outcome in the training and test sets

* **regression**: determine the quartiles of the data set and samples within those artificial groups

Other types of sampling approaches

* Down/Up sampling

* Snowball sampling

* Synthetic Minority Over-sampling Technique (SMOTE)

* Etc.

---

## Implementation of Data Splitting

```r
library(rsample)

# Make sure that you get the same random numbers
set.seed(123)
data_split <- initial_split(pdp::boston, prop = .7, strata = "cmedv")

boston_train <- training(data_split)
boston_test <- testing(data_split)

nrow(boston_train)/nrow(pdp::boston)
## [1] 0.7035573
```

---

## Outcome Distribution

```r
library(ggplot2)

# how do the distributions differ?
ggplot(boston_train, aes(x = cmedv)) + 
  geom_density(trim = TRUE) + 
  geom_density(data = boston_test, trim = TRUE, col = "red") 
```

---

background-image: url(http://amsterdammakerfestival.nl/wp-content/uploads/2016/08/the-challenge.png)

---

## Your Turn!

Create training (70%) and test (30%) sets for the `AmesHousing::make_ames()` data. Stratify on the `Sale_Price` variable and visualize this variable for the training and testing sets to see if their distributions are similar.

---

## Solution

```r
# Ames housing data
ames <- AmesHousing::make_ames()

# create reproducible splits
set.seed(123)
data_split <- initial_split(ames, prop = .7, strata = "Sale_Price")
ames_train <- training(data_split)
ames_test <- testing(data_split)
nrow(ames_train)/nrow(ames)
## [1] 0.7010239

# compare distributions
ggplot(ames_train, aes(x = Sale_Price)) + 
  geom_density(trim = TRUE) + 
  geom_density(data = ames_test, trim = TRUE, col = "red")
```

---

# Feature Engineering

---

## Feature Engineering

Pre-processing our features is often a requirement for many algorithms...plus it can greatly increase performance 🙌

Feature engineering can include:

* transformations of variables

* alternate encodings of a variable

* elimination of low/zero variance predictors

* collapsing of correlated variables

.full-width[.content-box-blue[.bolder[.center[Feature engineering can include a wide variety of pre-processing steps, this section covers just a few that you will continuously see throughout this training.]]]]

---

## One-hot encoding

* Some models require all variables to be numeric.

* Some packages automate this process (i.e. `caret`, `h2o`) while others do not (i.e. `glmnet`, `keras`)

* We can one-hot encode with `caret::dummyVars` (provides full-rank encoding and 1-to-1 dummy encoding)

```r
# full rank 
full_rank <- dummyVars(~ ., boston_train, fullRank = TRUE)
boston_train <- predict(full_rank, boston_train) %>% as.data.frame()
boston_test <- predict(full_rank, boston_test) %>% as.data.frame()

# 1-to-1 dummy encode (less than full rank)
*one_hot <- dummyVars(~ ., boston_train, fullRank = FALSE)
boston_train <- predict(one_hot, boston_train) %>% as.data.frame()
boston_test <- predict(one_hot, boston_test) %>% as.data.frame()
```

.full-width[.content-box-blue[.bolder[.center[For classification problems we do not want to one-hot encode our response variable! 😨]]]]

---

## Response Transformation

Adjusting the distribution of variables by using a **transformation** can lead to a big improvement.

__Normalizing response variable__

```r
# log transformation
train_y <- log(boston_train$cmedv)
test_y <- log(boston_test$cmedv)

# Box Cox transformation
*lambda <- forecast::BoxCox.lambda(boston_train$cmedv)
train_y <- forecast::BoxCox(boston_train$cmedv, lambda)
test_y <- forecast::BoxCox(boston_test$cmedv, lambda)

# Inverse Box Cox
inv_box_cox <- function(x, lambda) {
 if (lambda == 0) exp(x) else (lambda*x + 1)^(1/lambda) 
}
```

.full-width[.content-box-blue[.bolder[.center[We use training set lambda to minimize data leakage! 💧]]]]

---

## Feature Transformation

Adjusting the distribution of variables by using a **transformation** can lead to a big improvement.

__Normalizing features__

* Some models (_K_-NN, SVMs, PLS, neural networks) require that the features have the same units. **Centering** and **scaling** can be used for this purpose.

* Its important to standardize the test feature sets based on the mean and standard deviation of the training features to minimize data leakage.

```r
features <- setdiff(names(boston_train), "cmedv")

# pre-process estimation based on training features
pre_process <- caret::preProcess(
 x = boston_train[, features],
* method = c("BoxCox", "center", "scale")
 )

# apply to both training & test
train_x <- predict(pre_process, boston_train[, features])
test_x <- predict(pre_process, boston_test[, features])
```

---

## Alternative Feature Transformation

* Removing near-zero or zero variance variables

* Collapsing highly correlated variables with PCA

* etc.

```r
features <- setdiff(names(boston_train), "cmedv")

# pre-process estimation based on training features
pre_process <- caret::preProcess(
 x = boston_train[, features],
* method = c("BoxCox", "center", "scale", "pca", "nzv", "zv")
 )

# apply to both training & test
train_x <- predict(pre_process, boston_train[, features])
test_x <- predict(pre_process, boston_test[, features])
```

.full-width[.content-box-blue[.bolder[.center[Check out `?caret::preProcess` for the many options it offers.]]]]

---

background-image: url(http://amsterdammakerfestival.nl/wp-content/uploads/2016/08/the-challenge.png)

---

## Your Turn!

Perform the following feature engineering on your Ames data:

* __Response variable__

- Normalize
  
* __Predictor variables__

- One-hot encode
  
  - Standardize (center & scale)
  
  - Remove zero-variance variables
  
---

## Solution

__Normalize response variable__

```r
# transform response
lambda <- forecast::BoxCox.lambda(ames_train$Sale_Price)
train_y <- forecast::BoxCox(ames_train$Sale_Price, lambda)
test_y <- forecast::BoxCox(ames_test$Sale_Price, lambda)

# compare
par(mfrow = c(1, 2))
hist(ames_train$Sale_Price, breaks = 30, main = "Untransformed")
hist(train_y, breaks = 30, main = "Transformed")
```

---

## Solution

__Predictor variables__

```r
# one-hot encode
one_hot <- dummyVars(~ ., ames_train, fullRank = FALSE)
train_x <- predict(one_hot, ames_train) %>% as.data.frame()
test_x <- predict(one_hot, ames_test) %>% as.data.frame()

# pre-processing steps
pre_process <- caret::preProcess(
 train_x,
 method = c("center", "scale", "zv")
)

# apply pre-processing
train_x <- predict(pre_process, train_x)
test_x <- predict(pre_process, test_x)

# common dimensions
dim(train_x)
## [1] 2054  346
dim(test_x)
## [1] 876 346
```

---
class: center, middle, inverse

# Model Formulation

---

## Packages used

* There are ___many___ packages to perform machine learning and there are almost always more than one to perform each algorithm.

* There are often pros and cons for each package

* This training will try to expose you to many of the more common packages for each algorithm but realize there are *more ways than one to skin a* 🙀

**Built-in `lm` approach:**

```r
m1.base <- lm(cmedv ~ ., data = boston_train)

summary(m1.base)
## 
## Call:
## lm(formula = cmedv ~ ., data = boston_train)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -15.673 -2.736 -0.593 1.943 25.502 
## 
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) -3.079e+02 3.710e+02 -0.830 0.407125 
## lon -2.847e+00 4.027e+00 -0.707 0.480073 
## lat 3.319e+00 4.647e+00 0.714 0.475549 
## crim -1.131e-01 5.042e-02 -2.242 0.025575 * 
## zn 5.430e-02 1.687e-02 3.220 0.001407 ** 
## indus -1.992e-02 7.639e-02 -0.261 0.794483 
## chas.0 -2.425e+00 1.061e+00 -2.285 0.022946 * 
## chas.1 NA NA NA NA 
## nox -1.646e+01 4.742e+00 -3.472 0.000584 ***
## rm 4.128e+00 5.227e-01 7.896 4.01e-14 ***
## age -1.266e-02 1.596e-02 -0.794 0.428037 
## dis -1.668e+00 2.564e-01 -6.504 2.80e-10 ***
## rad 2.856e-01 8.352e-02 3.420 0.000704 ***
## tax -1.257e-02 4.629e-03 -2.714 0.006979 ** 
## ptratio -8.114e-01 1.661e-01 -4.884 1.60e-06 ***
## b 1.134e-02 3.246e-03 3.495 0.000536 ***
## lstat -4.627e-01 6.112e-02 -7.570 3.54e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.815 on 340 degrees of freedom
## Multiple R-squared: 0.7512,	Adjusted R-squared: 0.7403 
## F-statistic: 68.45 on 15 and 340 DF, p-value: < 2.2e-16
```

]

**`caret` approach:**

```r
library(caret)

m1.caret <- train(cmedv ~ ., data = boston_train, method = "lm")

summary(m1.caret)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -15.673 -2.736 -0.593 1.943 25.502 
## 
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) -3.079e+02 3.710e+02 -0.830 0.407125 
## lon -2.847e+00 4.027e+00 -0.707 0.480073 
## lat 3.319e+00 4.647e+00 0.714 0.475549 
## crim -1.131e-01 5.042e-02 -2.242 0.025575 * 
## zn 5.430e-02 1.687e-02 3.220 0.001407 ** 
## indus -1.992e-02 7.639e-02 -0.261 0.794483 
## chas.0 -2.425e+00 1.061e+00 -2.285 0.022946 * 
## chas.1 NA NA NA NA 
## nox -1.646e+01 4.742e+00 -3.472 0.000584 ***
## rm 4.128e+00 5.227e-01 7.896 4.01e-14 ***
## age -1.266e-02 1.596e-02 -0.794 0.428037 
## dis -1.668e+00 2.564e-01 -6.504 2.80e-10 ***
## rad 2.856e-01 8.352e-02 3.420 0.000704 ***
## tax -1.257e-02 4.629e-03 -2.714 0.006979 ** 
## ptratio -8.114e-01 1.661e-01 -4.884 1.60e-06 ***
## b 1.134e-02 3.246e-03 3.495 0.000536 ***
## lstat -4.627e-01 6.112e-02 -7.570 3.54e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.815 on 340 degrees of freedom
## Multiple R-squared: 0.7512,	Adjusted R-squared: 0.7403 
## F-statistic: 68.45 on 15 and 340 DF, p-value: < 2.2e-16
```

]
]

---

## Model Specification

* There are two primary ways we'll specify model formulation

* Typically, the non-formula matrix approach is more efficient computationally

**Formula Interface:**

```r
lm(y ~ x1 + x2, data = train)
lm(y ~ ., data = train)
lm(log(y) ~ ., data = train)
```

]

**Non-formula ("matrix") Interface:**

```r
glmnet(x = train_x, y = train_y)
```

]

.full-width[.content-box-blue[.bolder[.center[Not all machine learning functions have both interfaces. 🤬]]]]

---
class: center, middle, inverse

# Model Tuning

---

## Tuning Parameters

* Hyperparameters control the level of model complexity.

* Regularized regression
   * `alpha` (mixing %)
   * `lambda` (regularization)

* MARS
   * `nprune` (# terms)
   * `degree` (product degree)

* Neural Nets
   * `size` (# hidden units) 
   * `decay` (weight decay)
   
]

* Discriminant Analysis
   * `model` (type)
   * `dimen` (discriminant fx)

* SVMs
   * `sigma` 
   * `cost` 
   * `weight`

* Trees
   * `mtry` (randomly selected variables)
   * `ntrees` (# trees)
   * `max_depth` (tree depth)

]

---

## Tuning Parameters

* Hyperparameters control the level of model complexity.

* This is good as it allows us to transform our model to better align with patterns within our data.

---

## Tuning Parameters

* Hyperparameters control the level of model complexity.

* This is good as it allows us to transform our model to better align with patterns within our data.

* However, it can be bad because we can overfit our model to training data, which will not generalize well.

---

## Tuning Parameters

* Hyperparameters control the level of model complexity.

* This is good as it allows us to transform our model to better align with patterns within our data.

* However, it can be bad because we can overfit our model to training data, which will not generalize well.

---
class: center, middle, inverse

# Resampling

---

## Cross Validation for Generalization

* Our goal is to not only find a model that performs well on training data but...

* to find one that performs well on _future unseen data_.

---

## Custom CV

* Some packages/functions do not have built-in CV capabilities

* `lm` does not have built-in CV capabilities but we can use `rsample`

```r
set.seed(123)
*cv_split <- vfold_cv(pdp::boston, v = 10, strata = "cmedv")

cv_split
## # 10-fold cross-validation using stratification 
## # A tibble: 10 x 2
## splits id 
## <list> <chr> 
## 1 <S3: rsplit> Fold01
## 2 <S3: rsplit> Fold02
## 3 <S3: rsplit> Fold03
## 4 <S3: rsplit> Fold04
## 5 <S3: rsplit> Fold05
## 6 <S3: rsplit> Fold06
## 7 <S3: rsplit> Fold07
## 8 <S3: rsplit> Fold08
## 9 <S3: rsplit> Fold09
## 10 <S3: rsplit> Fold10
```

---

## Custom CV

* Some packages/functions do not have built-in CV capabilities

* `lm` does not have built-in CV capabilities but we can use `rsample`

```r
cv_split <- vfold_cv(boston_train, v = 10)

# create empty vector to store error metric
rmse <- vector(mode = "numeric", length = nrow(cv_split))

# iterate through each fold, model, predict, score
for (i in 1:nrow(cv_split)) {
 m <- lm(cmedv ~ ., data = analysis(cv_split$splits[[i]]))
 p <- predict(m, assessment(cv_split$splits[[i]]))
 rmse[i] <- caret::RMSE(p, assessment(cv_split$splits[[i]])$cmedv)
}

mean(rmse)
## [1] 4.953962
```

---

## Built-in CV

Many model functions, like `caret`, have built-in CV capabilities

```r
m2.caret <- train(
 cmedv ~ ., 
 data = boston_train, 
 method = "lm",
* trControl = trainControl(method = "cv", number = 10)
 )

m2.caret
## Linear Regression 
## 
## 356 samples
##  16 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 320, 322, 320, 321, 320, 321, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   4.884355  0.7279148  3.431138
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
```
]

---
class: center, middle, inverse

# Model Evaluation

---

## Performance Metrics

**Regression:**

* Mean Square Error (MSE)
* Mean Absolute Error (MAE)
* Mean Absolute Percent Error (MAPE)
* Root Mean Squared Logarithmic Error (RMSLE)

]

**Classification:**

* Classification Accuracy
* Recall vs. Specificity
* `$F_1$` Score
* Log Loss

]

---

## Performance Metrics

**Regression:**

* **Mean Square Error (MSE)**
* Mean Absolute Error (MAE)
* Mean Absolute Percent Error (MAPE)
* Root Mean Squared Logarithmic Error (RMSLE)

$$ MSE = \frac{1}{n} \sum^n_{i=1}(y_i - \hat y_i)^2 $$

]

**Classification:**

* **Classification Accuracy**
* Recall vs. Specificity
* `$F_1$` Score
* Log Loss

]

---

background-image: url(http://amsterdammakerfestival.nl/wp-content/uploads/2016/08/the-challenge.png)

---

## Your Turn!

Starting with the raw `AmesHousing::make_ames()` data:

1. split your data into 70 (training) / 30 (testing)
2. normalize the response variable
3. standardize the features (don't one-hot encode as `lm` will automate the dummy encoding)
4. apply a linear regression model with 10-fold cross validation
5. what is your average CV RMSE?

---

## Solution

```r

# 1. 70/30 split
data_split <- initial_split(AmesHousing::make_ames(), prop = .7, strata = "Sale_Price")
ames_train <- training(data_split)
ames_test <- testing(data_split)

# 2. normalize response variable
lambda <- forecast::BoxCox.lambda(ames_train$Sale_Price)
train_y <- forecast::BoxCox(ames_train$Sale_Price, lambda)
test_y <- forecast::BoxCox(ames_test$Sale_Price, lambda)

# 3. standardize features
features <- setdiff(names(ames_train), "Sale_Price")
pre_process <- preProcess(
 x = ames_train[, features],
 method = c("YeoJohnson", "center", "scale", "zv")
)
train_x <- predict(pre_process, ames_train[, features])
test_x <- predict(pre_process, ames_test[, features])

# 4. 10-fold CV linear regression
lm.cv <- train(
 x = train_x,
 y = train_y,
 method = "lm",
 trControl = trainControl(method = "cv", number = 10)
 )

# 5. out-of-sample avg CV RMSE
lm.cv
## Linear Regression 
## 
## 2054 samples
##   80 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1849, 1849, 1849, 1848, 1847, 1847, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE       
##   0.1662439  0.8563245  0.08603729
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
```

]

---

## So, can we do better?

---

## Peeking inside the black box

.full-width[.content-box-yellow[.center[What insights do we generally want to extract from our ML models?]]]

* Which variables are important? ([`vip`](https://github.com/koalaverse/vip))

* How does each feature functionallr relate to the outcome of interest? ([`pdp`](https://github.com/bgreenwell/pdp), [`plotmo`](https://CRAN.R-project.org/package=plotmo), [`ICEbox`](https://CRAN.R-project.org/package=ICEbox))

* How do the features interact? ([`pdp`](https://github.com/bgreenwell/pdp), [`plotmo`](https://CRAN.R-project.org/package=plotmo), [`ICEbox`](https://CRAN.R-project.org/package=ICEbox))

---

## Variable importance plots

* Some ML algorithms provide variable importance measures, but not all R implementations provide variable importance plots (VIPs)

* Enter...[`vip`](https://github.com/AFIT-R/vip)

- Provides a consistent framework for extracting and plotting variable importance scores from many types of ML models

- `vi()` always returns a [tibble](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html)

- `vip()` uses [`ggplot2`](http://ggplot2.org/)

]

```r
# Not yet on CRAN (experimental)
devtools::install_github("koalaverse/vip")
```

<img src="Images/vip-logo.svg" width="80%" style="display: block; margin: auto;" />
]

---

## Variable importance plots

```r
# Load required packages
library(ranger)
library(vip)

# Load the Ames housing data
ames <- AmesHousing::make_ames()

# Fit a random forest
set.seed(2024)
rf <- ranger(Sale_Price ~ ., data = ames, importance = "impurity")

# Variable importance plot
*vip(rf, num_features = 25)
```

---

## Variable importance plots

---

## Partial dependence plots

* [Partial dependence plots (PDPs)](https://journal.r-project.org/archive/2017/RJ-2017-016/index.html) help visualize the relationship between a subset of the features (typically 1-3) and the response

* Let `$x_1$` be the predictor variable of interest with unique values `$\left\{x_{11}, x_{12}, \dots, x_{1k}\right\}$`

* The partial dependence of the response on `$x_1$` can be constructed as follows:

1. For `$i \in \left\{1, 2, \dots, k\right\}$`:
    
        a. Copy the training data and replace the original values of `$x_1$` with the constant `$x_{1i}$`
    
        b. Compute the vector of predicted values from the modified copy of the training data
        
        c. Compute the average prediction to obtain `$\bar{f}_1\left(x_{1i}\right)$`

2. Plot the pairs `$\left\{x_{1i}, \bar{f}_1\left(x_{1i}\right)\right\}$` for `$i = 1, 2, \dotsc, k$`

]

---

## Partial dependence plots

* Very few ML packages in R provide support for constructing PDPs

* Enter...[`pdp`](https://journal.r-project.org/archive/2017/RJ-2017-016/index.html)

- Provides a consistent way of constructing PDPs (and more) from many types of ML models (not just RFs)

- Allows for multivariate displays (i.e., interactions) and so much more!!
    
    - Includes options for **parallel processing** and **progress bars** 😎

]

```r
# Install from CRAN
install.packages("pdp")

# Install from GitHub
devtools::install_github("bgreenwell/pdp")
```

]

---

## A handy flow chart

---
class: inverse, middle, center

# Questions?