The Naïve Classifier

# The Naïve Classifier
### Bradley C. Boehmke and Brandon M. Greenwell
### 2018/05/12

---

# Overview

---

## The Idea

* Founded on Bayesian probability theory

* Incorporates the concept of *conditional probability*, the probabilty of event *A* given that event *B* has occurred [denoted as `$P(A \vert B)$`]

* Let us assume we have a classification problem where we are asked to predict which employees are expected to churn (a problem of _attrition_)

* Hence, we are seeking the probability of an employee belonging to attrition class `$C_k$` (where `$C_{yes} = \texttt{attrit}$` and `$C_{no} = \texttt{non-attrit}$`) given some predictor variables ( `$x_1, x_2, \dots, x_p$` ).

* This can be written as `$P(C_k|x_1, \dots, x_p)$`

* Bayes' theorem allows us to compute this probability with:

$$ P(C_k \vert X) = \frac{P(C_k) \cdot P(X \vert C_k)}{P(X)} $$

<center>
<bold>

🤨 I know, let's examine this equation a little more closely.

</bold>
</center>

---

## The Idea

$$ P(C_k \vert X) = \frac{\color{red}P\color{red}({\color{red}C_k}\color{red}) \cdot P(X \vert C_k)}{P(X)} $$

* `$\color{red}P\color{red}(\color{red}C_\color{red}k\color{red})$` is the ___prior probability___ of the outcome. Essentially, based on the historical data, what is the probability of an employee attriting or not.

---

## The Idea

$$ P(C_k \vert X) = \frac{P(C_k) \cdot \color{red} P\color{red}(\color{red}X \color{red}\vert \color{red}C_k\color{red})}{P(X)} $$

* `$P(C_k)$` is the prior probability of the outcome. Essentially, based on the historical data, what is the probability of an employee attriting or not.

* `$\color{red}P\color{red}(\color{red}X \color{red}\vert \color{red}C_\color{red}k\color{red})$` is the ___conditional probability___ or ___likelihood___. Essentially, for each class of the response variable (i.e. attrit or not attrit), what is the probability of observing the predictor values.

---

## The Idea

$$ P(C_k \vert X) = \frac{P(C_k) \cdot P(X \vert C_k)}{\color{red}P\color{red}(\color{red}X\color{red})} $$

* `$P(C_k)$` is the prior probability of the outcome. Essentially, based on the historical data, what is the probability of an employee attriting or not.

* `$P(X \vert C_k)$` is the conditional probability or likelihood. Essentially, for each class of the response variable (i.e. attrit or not attrit), what is the probability of observing the predictor values.

* `$\color{red}P\color{red}(\color{red}X\color{red})$` is the probability of the predictor variables. Essentially, based on the historical data, what is the probability of each observed combination of predictor variables. When new data comes in, this becomes our ___evidence___.

---

## The Idea

$$ \color{red}P\color{red}(\color{red}C_k \color{red}\vert \color{red}X\color{red}) = \frac{P(C_k) \cdot P(X \vert C_k)}{P(X)} $$

* `$P(C_k)$` is the prior probability of the outcome. Essentially, based on the historical data, what is the probability of an employee attriting or not.

* `$P(X)$` is the probability of the predictor variables. Essentially, based on the historical data, what is the probability of each observed combination of predictor variables. When new data comes in, this becomes our evidence.

* `$\color{red}P\color{red}(\color{red}C_\color{red}k \color{red}\vert \color{red}X\color{red})$` is called our ___posterior probability___. By combining our observed information, we are updating our _a priori_ information on probabilities to compute a posterior probability that an observation has class `$\color{red}C_\color{red}k$`.

---

## The Idea

In plain english...

$$\texttt{posterior} = \frac{\texttt{prior} \times \texttt{likelihood}}{\texttt{evidence}} $$

---

## But we have a major problem

* As the number of features grow, computing `$P(C_k \vert X)$` becomes intractable

* A response variable with _m_ classes and _p_ predictors requires `$m^p$` probabilities computed

<center>
<bold>

And just when you thought you had it 😠

</bold>
</center>

---

## A Simplified Classifier

The ___naïve Bayes classifier___ makes a simplifying assumption:

* predictor variables are _conditionally independent_ of one another given the response value,

* allows us to simplify our computation such that the posterior probability is simply the product of the probability distribution for each individual variable conditioned on the response category

`$$P(C_k \vert X) = \prod^n_{i=1} P(x_i \vert C_k)$$`

* now we are only required to compute `$m \times p$` probabilities

---

## Advantages and Shortcomings

__Pros__

* Simple (intuitively & computationally)

* Fast

* Performs well on small data

* Scales well to large data

]

__Cons__

* Assumes equally important & independent features

* Faulty assumption

* The more this is violated the less we can rely on the exact posterior probability

* ___But___, the rank ordering of propensities will still be correct

]

<center>
<bold>

Naïve Bayes is often a surprisingly accurate algorithm; however, on average it rarely can compete with the accuracy of advanced tree-based methods (random forests & gradient boosting machines) but is definitely a 🔧 worth having in your toolkit.

</bold>
</center>

---

# Implementation

---

## Packages

Several packages to apply naïve Bayes:

* `e1071`

* `klaR`

* `naivebayes`

* `bnclassify`

* `h2o`

* `caret`: an aggregator package

---

## Prerequisites

__Packages used__

```r
library(rsample)  # data splitting 
library(dplyr)    # data transformation
library(ggplot2)  # data visualization
library(caret)    # implementing with caret
```

__Data used__

```r
# convert some numeric variables to factors
attrition <- attrition %>%
 mutate(
 JobLevel = factor(JobLevel),
 StockOptionLevel = factor(StockOptionLevel),
 TrainingTimesLastYear = factor(TrainingTimesLastYear)
 )

# Create training (70%) and test (30%) sets for the attrition data.
# Use set.seed for reproducibility
set.seed(123)
split <- initial_split(attrition, prop = .7, strata = "Attrition")
train <- training(split)
test <- testing(split)
```

---
class: center, middle, inverse

background-image: url(http://amsterdammakerfestival.nl/wp-content/uploads/2016/08/the-challenge.png)

---

## Your Turn!

How well does our assumption of ___conditional independence___ between the features hold up?

---

## Solution: assessing conditional independence

```r
train %>%
  filter(Attrition == "Yes") %>%
  select_if(is.numeric) %>%
  cor() %>%
  corrplot::corrplot()
```

---

## Default Settings

```r
# create response and feature data
features <- setdiff(names(train), "Attrition")
x <- train[, features]
y <- train$Attrition

# set up 10-fold cross validation procedure
train_control <- trainControl(
 method = "cv", 
 number = 10
 )

# train model
nb.m1 <- train(
 x = x,
 y = y,
* method = "nb",
 trControl = train_control
 )

# results
confusionMatrix(nb.m1)
```

```
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction   No  Yes
##        No  75.3  8.3
##        Yes  8.5  7.8
##                             
##  Accuracy (average) : 0.8311
```

]

<center>
<bold>

Thoughts? 🤔

</bold>
</center>

---

## Benchmark comparison

```r
# initial naive results
confusionMatrix(nb.m1)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction   No  Yes
##        No  75.3  8.3
##        Yes  8.5  7.8
##                             
##  Accuracy (average) : 0.8311
```

]

```r
# distribution of Attrition rates across train & test set
table(train$Attrition) %>% prop.table()
## 
##       No      Yes 
## 0.838835 0.161165
table(test$Attrition) %>% prop.table()
## 
##        No       Yes 
## 0.8386364 0.1613636
```

]

<center>
<bold>

The goal is to improve predictive accuracy over and above our 83% benchmark.

</bold>
</center>

---

# Tuning

---

## What Can We Tune?

__Continuous variable density estimator__

* assumes normal distribution

* can normalize with Box-Cox transformation

* use use non-parametric kernel density estimators

* or combination of the two

]

```r
train %>% 
  select_if(is.numeric) %>% 
  gather(metric, value) %>% 
  ggplot(aes(value, fill = metric)) + 
  geom_density(show.legend = FALSE) + 
  facet_wrap(~ metric, scales = "free", ncol = 2)
```

]
]

---

## What Can We Tune?

__Laplace smoother__

* naïve Bayes uses the product of feature probabilities conditioned on each class.

* if unseen data includes a feature `$\leftrightarrow$` response combination not seen in the training data then the probability `$P(x_i \vert C_k) = 0$` will ripple through the entire multiplication of all features and force the posterior probability to be zero for that class.

* Laplace smoother adds a small constant to every feature `$\leftrightarrow$` response combination so that each feature has a nonzero probability of occuring for each class.

---

## Implementation

We can tune these hyperparameters for our naïve Bayes model with:

* `usekernel` parameter allows us to use a kernel density estimate for continuous variables versus a guassian density estimate,

* `adjust` allows us to adjust the bandwidth of the kernel density (larger numbers mean more flexible density estimate),

* `fL` allows us to incorporate the Laplace smoother.

---

## Implementation

```r
# set up tuning grid
search_grid <- expand.grid(
 usekernel = c(TRUE, FALSE),
 adjust = 0:3,
 fL = 0:2
)

nb.m2 <- train(
 x = x,
 y = y,
 method = "nb",
 trControl = train_control,
* tuneGrid = search_grid
 )

# top 3 modesl
nb.m2$results %>% 
  top_n(3, wt = Accuracy) %>%
  arrange(desc(Accuracy))
```

```
##   usekernel adjust fL  Accuracy     Kappa AccuracySD    KappaSD
## 1      TRUE      3  1 0.8592233 0.4199747 0.02867321 0.09849057
## 2      TRUE      3  0 0.8582524 0.3647326 0.02637094 0.10345730
## 3      TRUE      3  2 0.8572816 0.4468411 0.02967833 0.08369331
```

---
class: center, middle, inverse

background-image: url(http://amsterdammakerfestival.nl/wp-content/uploads/2016/08/the-challenge.png)

---

## Your Turn!

* Tune this model some more. Do you gain any improvement?

* Can you think if anything else you could do (i.e. pre-processing)?

---

## Solution

```r
# set up tuning grid
search_grid <- expand.grid(
 usekernel = c(TRUE, FALSE),
 fL = 0:5,
 adjust = seq(0, 5, by = 1)
)

# train model
nb.m3 <- train(
 x = x,
 y = y,
 method = "nb",
 trControl = train_control,
 tuneGrid = search_grid,
 preProc = c("BoxCox", "center", "scale", "pca")
 )

# top 5 modesl
nb.m3$results %>% 
  top_n(5, wt = Accuracy) %>%
  arrange(desc(Accuracy))
```

```
##   usekernel fL adjust  Accuracy     Kappa AccuracySD   KappaSD
## 1      TRUE  1      3 0.8758178 0.4387361 0.02682762 0.1264859
## 2      TRUE  0      2 0.8738763 0.4504429 0.03163167 0.1371136
## 3      TRUE  3      4 0.8689837 0.4494678 0.02252035 0.1174482
## 4      TRUE  2      3 0.8670798 0.4620638 0.03127524 0.1266924
## 5      TRUE  0      3 0.8661184 0.3495805 0.02354238 0.1363180
```

```r
# plot search grid results
plot(nb.m3)
```

```r
# results for best model
confusionMatrix(nb.m3)
```

```
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction   No  Yes
##        No  81.3  9.8
##        Yes  2.6  6.3
##                             
##  Accuracy (average) : 0.8757
```

]

---

## Predicting

Once we have found some optimal model, we can assess the accuracy on our final holdout test set:

```r
pred <- predict(nb.m3, newdata = test)
confusionMatrix(pred, test$Attrition)
```

```
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  349  41
##        Yes  20  30
##                                           
##                Accuracy : 0.8614          
##                  95% CI : (0.8255, 0.8923)
##     No Information Rate : 0.8386          
##     P-Value [Acc > NIR] : 0.10756         
##                                           
##                   Kappa : 0.4183          
##  Mcnemar's Test P-Value : 0.01045         
##                                           
##             Sensitivity : 0.9458          
##             Specificity : 0.4225          
##          Pos Pred Value : 0.8949          
##          Neg Pred Value : 0.6000          
##              Prevalence : 0.8386          
##          Detection Rate : 0.7932          
##    Detection Prevalence : 0.8864          
##       Balanced Accuracy : 0.6842          
##                                           
##        'Positive' Class : No              
## 
```

---

# Time to Wake Up!

---

## Prior algorithms as classifiers

All the algorithms we saw yesterday can be used as classifiers:

```r
# regularized classification --> type.measure can also be "class" or "auc"
cv.glmnet(
  x, y, 
* family = "binomial",
* type.measure = "deviance"
  )

# MARS
earth(
  x, y, 
* glm = list(family = binomial)
)

# feedfoward neural network
model <- keras_model_sequential() %>%
 layer_dense(units = 10, activation = "relu", input_shape = ncol(x)) %>%
 layer_dense(units = 5, activation = "relu") %>%
* layer_dense(units = 1, activation = "sigmoid") %>%
 compile(
 optimizer = "rmsprop",
* loss = "categorical_crossentropy",
* metrics = c("accuracy")
 )

model %>% fit(x = x, y = y)
```

]

---
class: center, middle, inverse

background-image: url(http://amsterdammakerfestival.nl/wp-content/uploads/2016/08/the-challenge.png)

---

## Your Turn!

* Spend the next 30 minutes practicing implementing these other classifiers.

* Remember, some algorithms require different feature preprocessing.

* Can you compare performance to the naïve Bayes classifier?

---

# Learning More

---

## Additional Naïve Bayes Resources

* [Andrew Moore's tutorials](http://www.cs.cmu.edu/~./awm/tutorials/naive.html)

* [Naive Bayes classifiers by Kevin Murphy](https://datajobsboard.com/wp-content/uploads/2017/01/Naive-Bayes-Kevin-Murphy.pdf)

* [Data Mining and Predictive Analytics, Ch. 14](https://www.amazon.com/Mining-Predictive-Analytics-Daniel-Chantal/dp/8126559136/ref=sr_1_1?ie=UTF8&qid=1524231609&sr=8-1&keywords=data+mining+and+predictive+analytics+2nd+edition+%2C+by+larose)