class: center, middle, inverse, title-slide # The Naïve Classifier ### Bradley C. Boehmke and Brandon M. Greenwell ### 2018/05/12 --- class: center, middle, inverse # Overview <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/18/Bayes%27_Theorem_MMB_01.jpg/1200px-Bayes%27_Theorem_MMB_01.jpg" width="100%" style="display: block; margin: auto;" /> --- ## The Idea * Founded on Bayesian probability theory * Incorporates the concept of *conditional probability*, the probabilty of event *A* given that event *B* has occurred [denoted as `\(P(A \vert B)\)`] * Let us assume we have a classification problem where we are asked to predict which employees are expected to churn (a problem of _attrition_) * Hence, we are seeking the probability of an employee belonging to attrition class `\(C_k\)` (where `\(C_{yes} = \texttt{attrit}\)` and `\(C_{no} = \texttt{non-attrit}\)`) given some predictor variables ( `\(x_1, x_2, \dots, x_p\)` ). * This can be written as `\(P(C_k|x_1, \dots, x_p)\)` * Bayes' theorem allows us to compute this probability with: $$ P(C_k \vert X) = \frac{P(C_k) \cdot P(X \vert C_k)}{P(X)} $$ <br> <center> <bold> <font color="red"> 🤨 I know, let's examine this equation a little more closely. </font> </bold> </center> --- ## The Idea $$ P(C_k \vert X) = \frac{\color{red}P\color{red}({\color{red}C_k}\color{red}) \cdot P(X \vert C_k)}{P(X)} $$ * `\(\color{red}P\color{red}(\color{red}C_\color{red}k\color{red})\)` <font color="red">is the ___prior probability___ of the outcome. Essentially, based on the historical data, what is the probability of an employee attriting or not.</font> --- ## The Idea $$ P(C_k \vert X) = \frac{P(C_k) \cdot \color{red} P\color{red}(\color{red}X \color{red}\vert \color{red}C_k\color{red})}{P(X)} $$ * `\(P(C_k)\)` is the prior probability of the outcome. Essentially, based on the historical data, what is the probability of an employee attriting or not. * `\(\color{red}P\color{red}(\color{red}X \color{red}\vert \color{red}C_\color{red}k\color{red})\)` <font color="red">is the ___conditional probability___ or ___likelihood___. Essentially, for each class of the response variable (i.e. attrit or not attrit), what is the probability of observing the predictor values.</font> --- ## The Idea $$ P(C_k \vert X) = \frac{P(C_k) \cdot P(X \vert C_k)}{\color{red}P\color{red}(\color{red}X\color{red})} $$ * `\(P(C_k)\)` is the prior probability of the outcome. Essentially, based on the historical data, what is the probability of an employee attriting or not. * `\(P(X \vert C_k)\)` is the conditional probability or likelihood. Essentially, for each class of the response variable (i.e. attrit or not attrit), what is the probability of observing the predictor values. * `\(\color{red}P\color{red}(\color{red}X\color{red})\)` <font color="red">is the probability of the predictor variables. Essentially, based on the historical data, what is the probability of each observed combination of predictor variables. When new data comes in, this becomes our ___evidence___.</font> --- ## The Idea $$ \color{red}P\color{red}(\color{red}C_k \color{red}\vert \color{red}X\color{red}) = \frac{P(C_k) \cdot P(X \vert C_k)}{P(X)} $$ * `\(P(C_k)\)` is the prior probability of the outcome. Essentially, based on the historical data, what is the probability of an employee attriting or not. * `\(P(X \vert C_k)\)` is the conditional probability or likelihood. Essentially, for each class of the response variable (i.e. attrit or not attrit), what is the probability of observing the predictor values. * `\(P(X)\)` is the probability of the predictor variables. Essentially, based on the historical data, what is the probability of each observed combination of predictor variables. When new data comes in, this becomes our evidence. * `\(\color{red}P\color{red}(\color{red}C_\color{red}k \color{red}\vert \color{red}X\color{red})\)` <font color="red">is called our ___posterior probability___. By combining our observed information, we are updating our _a priori_ information on probabilities to compute a posterior probability that an observation has class `\(\color{red}C_\color{red}k\)`.</font> --- ## The Idea In plain english... <br> <br> $$\texttt{posterior} = \frac{\texttt{prior} \times \texttt{likelihood}}{\texttt{evidence}} $$ <br> <br> <img src="Images/i_see.png" width="200" height="200" style="display: block; margin: auto;" /> --- ## But we have a major problem * As the number of features grow, computing `\(P(C_k \vert X)\)` becomes intractable * A response variable with _m_ classes and _p_ predictors requires `\(m^p\)` probabilities computed <img src="Figures/07-Figures/07-exponential_probabilities-1.svg" style="display: block; margin: auto;" /> <center> <bold> <font color="red"> And just when you thought you had it 😠 </font> </bold> </center> --- ## A Simplified Classifier The ___naïve Bayes classifier___ makes a simplifying assumption: * predictor variables are _conditionally independent_ of one another given the response value, * allows us to simplify our computation such that the posterior probability is simply the product of the probability distribution for each individual variable conditioned on the response category `$$P(C_k \vert X) = \prod^n_{i=1} P(x_i \vert C_k)$$` * now we are only required to compute `\(m \times p\)` probabilities <br> <img src="Images/easy_peasy.png" width="150" height="150" style="display: block; margin: auto;" /> --- ## Advantages and Shortcomings .pull-left[ __<font color="green">Pros</font>__ * Simple (intuitively & computationally) * Fast * Performs well on small data * Scales well to large data ] .pull-right[ __<font color="red">Cons</font>__ * Assumes equally important & independent features * Faulty assumption * The more this is violated the less we can rely on the exact posterior probability * ___But___, the rank ordering of propensities will still be correct ] <br><br> <center> <bold> <font color="red"> Naïve Bayes is often a surprisingly accurate algorithm; however, on average it rarely can compete with the accuracy of advanced tree-based methods (random forests & gradient boosting machines) but is definitely a 🔧 worth having in your toolkit. </font> </bold> </center> --- class: center, middle, inverse # Implementation --- ## Packages Several packages to apply naïve Bayes: * `e1071` * `klaR` * `naivebayes` * `bnclassify` * `h2o` * <mark>`caret`: an aggregator package</mark> --- ## Prerequisites __Packages used__ ```r library(rsample) # data splitting library(dplyr) # data transformation library(ggplot2) # data visualization library(caret) # implementing with caret ``` __Data used__ ```r # convert some numeric variables to factors attrition <- attrition %>% mutate( JobLevel = factor(JobLevel), StockOptionLevel = factor(StockOptionLevel), TrainingTimesLastYear = factor(TrainingTimesLastYear) ) # Create training (70%) and test (30%) sets for the attrition data. # Use set.seed for reproducibility set.seed(123) split <- initial_split(attrition, prop = .7, strata = "Attrition") train <- training(split) test <- testing(split) ``` --- class: center, middle, inverse background-image: url(http://amsterdammakerfestival.nl/wp-content/uploads/2016/08/the-challenge.png) --- ## Your Turn! How well does our assumption of ___conditional independence___ between the features hold up? --- ## Solution: assessing conditional independence ```r train %>% filter(Attrition == "Yes") %>% select_if(is.numeric) %>% cor() %>% corrplot::corrplot() ``` <img src="Figures/07-Figures/07-solution1-1.svg" style="display: block; margin: auto;" /> --- ## Default Settings .scrollable[ ```r # create response and feature data features <- setdiff(names(train), "Attrition") x <- train[, features] y <- train$Attrition # set up 10-fold cross validation procedure train_control <- trainControl( method = "cv", number = 10 ) # train model nb.m1 <- train( x = x, y = y, * method = "nb", trControl = train_control ) # results confusionMatrix(nb.m1) ``` ``` ## Cross-Validated (10 fold) Confusion Matrix ## ## (entries are percentual average cell counts across resamples) ## ## Reference ## Prediction No Yes ## No 75.3 8.3 ## Yes 8.5 7.8 ## ## Accuracy (average) : 0.8311 ``` ] <center> <bold> <font color="red"> Thoughts? 🤔 </font> </bold> </center> --- ## Benchmark comparison .pull-left[ ```r # initial naive results confusionMatrix(nb.m1) ## Cross-Validated (10 fold) Confusion Matrix ## ## (entries are percentual average cell counts across resamples) ## ## Reference ## Prediction No Yes ## No 75.3 8.3 ## Yes 8.5 7.8 ## ## Accuracy (average) : 0.8311 ``` ] .pull-right[ ```r # distribution of Attrition rates across train & test set table(train$Attrition) %>% prop.table() ## ## No Yes ## 0.838835 0.161165 table(test$Attrition) %>% prop.table() ## ## No Yes ## 0.8386364 0.1613636 ``` ] <br><br><br><br><br><br> <center> <bold> <font color="red"> The goal is to improve predictive accuracy over and above our 83% benchmark. </font> </bold> </center> --- class: center, middle, inverse # Tuning --- ## What Can We Tune? __Continuous variable density estimator__ .scrollable[ .pull-left[ * assumes normal distribution * can normalize with Box-Cox transformation * use use non-parametric kernel density estimators * or combination of the two ] .pull-right[ ```r train %>% select_if(is.numeric) %>% gather(metric, value) %>% ggplot(aes(value, fill = metric)) + geom_density(show.legend = FALSE) + facet_wrap(~ metric, scales = "free", ncol = 2) ``` <img src="Figures/07-Figures/07-nb-cv-est-1.svg" style="display: block; margin: auto;" /> ] ] --- ## What Can We Tune? __Laplace smoother__ * naïve Bayes uses the product of feature probabilities conditioned on each class. * if unseen data includes a feature `\(\leftrightarrow\)` response combination not seen in the training data then the probability `\(P(x_i \vert C_k) = 0\)` will ripple through the entire multiplication of all features and force the posterior probability to be zero for that class. * Laplace smoother adds a small constant to every feature `\(\leftrightarrow\)` response combination so that each feature has a nonzero probability of occuring for each class. --- ## Implementation We can tune these hyperparameters for our naïve Bayes model with: * `usekernel` parameter allows us to use a kernel density estimate for continuous variables versus a guassian density estimate, * `adjust` allows us to adjust the bandwidth of the kernel density (larger numbers mean more flexible density estimate), * `fL` allows us to incorporate the Laplace smoother. --- ## Implementation ```r # set up tuning grid search_grid <- expand.grid( usekernel = c(TRUE, FALSE), adjust = 0:3, fL = 0:2 ) nb.m2 <- train( x = x, y = y, method = "nb", trControl = train_control, * tuneGrid = search_grid ) # top 3 modesl nb.m2$results %>% top_n(3, wt = Accuracy) %>% arrange(desc(Accuracy)) ``` ``` ## usekernel adjust fL Accuracy Kappa AccuracySD KappaSD ## 1 TRUE 3 1 0.8592233 0.4199747 0.02867321 0.09849057 ## 2 TRUE 3 0 0.8582524 0.3647326 0.02637094 0.10345730 ## 3 TRUE 3 2 0.8572816 0.4468411 0.02967833 0.08369331 ``` --- class: center, middle, inverse background-image: url(http://amsterdammakerfestival.nl/wp-content/uploads/2016/08/the-challenge.png) --- ## Your Turn! * Tune this model some more. Do you gain any improvement? * Can you think if anything else you could do (i.e. pre-processing)? --- ## Solution .scrollable[ ```r # set up tuning grid search_grid <- expand.grid( usekernel = c(TRUE, FALSE), fL = 0:5, adjust = seq(0, 5, by = 1) ) # train model nb.m3 <- train( x = x, y = y, method = "nb", trControl = train_control, tuneGrid = search_grid, preProc = c("BoxCox", "center", "scale", "pca") ) # top 5 modesl nb.m3$results %>% top_n(5, wt = Accuracy) %>% arrange(desc(Accuracy)) ``` ``` ## usekernel fL adjust Accuracy Kappa AccuracySD KappaSD ## 1 TRUE 1 3 0.8758178 0.4387361 0.02682762 0.1264859 ## 2 TRUE 0 2 0.8738763 0.4504429 0.03163167 0.1371136 ## 3 TRUE 3 4 0.8689837 0.4494678 0.02252035 0.1174482 ## 4 TRUE 2 3 0.8670798 0.4620638 0.03127524 0.1266924 ## 5 TRUE 0 3 0.8661184 0.3495805 0.02354238 0.1363180 ``` ```r # plot search grid results plot(nb.m3) ``` <img src="Figures/07-Figures/07-solution2-1.svg" style="display: block; margin: auto;" /> ```r # results for best model confusionMatrix(nb.m3) ``` ``` ## Cross-Validated (10 fold) Confusion Matrix ## ## (entries are percentual average cell counts across resamples) ## ## Reference ## Prediction No Yes ## No 81.3 9.8 ## Yes 2.6 6.3 ## ## Accuracy (average) : 0.8757 ``` ] --- ## Predicting Once we have found some optimal model, we can assess the accuracy on our final holdout test set: ```r pred <- predict(nb.m3, newdata = test) confusionMatrix(pred, test$Attrition) ``` ``` ## Confusion Matrix and Statistics ## ## Reference ## Prediction No Yes ## No 349 41 ## Yes 20 30 ## ## Accuracy : 0.8614 ## 95% CI : (0.8255, 0.8923) ## No Information Rate : 0.8386 ## P-Value [Acc > NIR] : 0.10756 ## ## Kappa : 0.4183 ## Mcnemar's Test P-Value : 0.01045 ## ## Sensitivity : 0.9458 ## Specificity : 0.4225 ## Pos Pred Value : 0.8949 ## Neg Pred Value : 0.6000 ## Prevalence : 0.8386 ## Detection Rate : 0.7932 ## Detection Prevalence : 0.8864 ## Balanced Accuracy : 0.6842 ## ## 'Positive' Class : No ## ``` --- # Time to Wake Up! --- ## Prior algorithms as classifiers All the algorithms we saw yesterday can be used as classifiers: .scrollable[ ```r # regularized classification --> type.measure can also be "class" or "auc" cv.glmnet( x, y, * family = "binomial", * type.measure = "deviance" ) # MARS earth( x, y, * glm = list(family = binomial) ) # feedfoward neural network model <- keras_model_sequential() %>% layer_dense(units = 10, activation = "relu", input_shape = ncol(x)) %>% layer_dense(units = 5, activation = "relu") %>% * layer_dense(units = 1, activation = "sigmoid") %>% compile( optimizer = "rmsprop", * loss = "categorical_crossentropy", * metrics = c("accuracy") ) model %>% fit(x = x, y = y) ``` ] --- class: center, middle, inverse background-image: url(http://amsterdammakerfestival.nl/wp-content/uploads/2016/08/the-challenge.png) --- ## Your Turn! * Spend the next 30 minutes practicing implementing these other classifiers. * Remember, some algorithms require different feature preprocessing. * Can you compare performance to the naïve Bayes classifier? --- class: center, middle, inverse # Learning More --- ## Additional Naïve Bayes Resources * [Andrew Moore's tutorials](http://www.cs.cmu.edu/~./awm/tutorials/naive.html) * [Naive Bayes classifiers by Kevin Murphy](https://datajobsboard.com/wp-content/uploads/2017/01/Naive-Bayes-Kevin-Murphy.pdf) * [Data Mining and Predictive Analytics, Ch. 14](https://www.amazon.com/Mining-Predictive-Analytics-Daniel-Chantal/dp/8126559136/ref=sr_1_1?ie=UTF8&qid=1524231609&sr=8-1&keywords=data+mining+and+predictive+analytics+2nd+edition+%2C+by+larose)