class: center, middle, inverse, title-slide # Support Vector Machines ### Bradley C. Boehmke and Brandon M. Greenwell ### May 15, 2018 --- class: inverse, center, middle background-image: url(Images/i_svm.jpg) background-position: center background-size: contain --- class: inverse, center, middle # Introduction --- ## Support vector machines (SVMs) * .large[A direct approach to .red[binary] classification]: - Try to find a **hyperplane** in feature space that "best" separates the classes * .large[In practice, it is difficult (if not impossible) to find a hyperplane to separate the classes in the original feature space] -- .pull-left[ * .large[SVMs extend this idea in two ways:] - Relax/soften what we mean by separates - Use the [*kernel trick*](https://en.wikipedia.org/wiki/Kernel_method) to enlarge the feature space to the point that separation is (more) possible ] .pull-right[ <img src="Figures/08-Figures/08-smiley-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- ## Hyperplanes .medium[ * A hyperplane in *p*-dimensional feature space is defined by the (linear) equation `$$f\left(X\right) = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p = 0$$` - For `\(p = 2\)` the hyperplane is just a .red[line] in 2-D space - For `\(p = 3\)` the hyperplane is a .red[plane] in 3-D space * `\(f\left(X\right) > 0\)` for points on one side of the hyperplane, and `\(f\left(X\right) < 0\)` for points on the other side * It is (mathematically) convenient to code the binary outcome using {-1, 1} so that `\(Y_i \times f\left(X_i\right) > 0\)` for points on the correct side of the decision boundary! ] --- class: inverse, center, middle background-image: url(Images/separable.jpg) background-position: center background-size: contain # Seperable data --- ## Separating hyperplanes .larger[ * For two separable classes, there are an infinite number of separating hyperplanes! 😱 * What we want is a separating hyperplane with .red[good generalization performance]! * The .red[*hard margin classifier*] (HMC) is an "optimal" separating hyperplane and the simplest type of SVM ] --- class: center, middle ## The .red[*optimal separating hyperplane*] separates the two classes .red[and] maximizes the distance to the closest point from either class .right[.large[-Vladimir Vapnik]] --- <img src="Figures/08-Figures/08-separating-hyperplanes-01-1.svg" width="80%" style="display: block; margin: auto;" /> --- <img src="Figures/08-Figures/08-separating-hyperplanes-02-1.svg" width="80%" style="display: block; margin: auto;" /> --- class: inverse, center, middle background-image: url(Images/explain.png) background-position: center background-size: contain --- <img src="Figures/08-Figures/08-plot-1.svg" width="80%" style="display: block; margin: auto;" /> --- <img src="Figures/08-Figures/08-chull-02-1.svg" width="80%" style="display: block; margin: auto;" /> --- <img src="Figures/08-Figures/08-chull-03-1.svg" width="80%" style="display: block; margin: auto;" /> --- <img src="Figures/08-Figures/08-chull-04-1.svg" width="80%" style="display: block; margin: auto;" /> --- <img src="Figures/08-Figures/08-chull-05-1.svg" width="80%" style="display: block; margin: auto;" /> --- <img src="Figures/08-Figures/08-chull-06-1.svg" width="80%" style="display: block; margin: auto;" /> --- <img src="Figures/08-Figures/08-chull-07-1.svg" width="80%" style="display: block; margin: auto;" /> --- ## The hard margin classifier <br/> .large[ `$$\underset{\beta_0, \beta_1, \dots, \beta_p}{\text{maximize}} \quad M$$` `$$\text{subject to:} \quad \begin{cases} \sum_{j = 1}^p \beta_j^2 = 1,\\ y_i\left(\beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip}\right) \ge M,\\ \quad \forall i = 1, 2, \dots, n \end{cases}$$` ] -- <br/> .red[.center[.large[This is a quadratic programming problem with linear inequality constraints!!]]] --- class: inverse, center, middle # Sometimes perfect separation is achievable, but not desirable! --- <img src="Figures/08-Figures/08-noisy-01-1.svg" width="80%" style="display: block; margin: auto;" /> --- <img src="Figures/08-Figures/08-noisy-02-1.svg" width="80%" style="display: block; margin: auto;" /> --- ## The soft margin classifier <br/> .large[ `$$\underset{\beta_0, \beta_1, \dots, \beta_p}{\text{maximize}} \quad M$$` `$$\text{subject to:} \quad \begin{cases} \sum_{j = 1}^p \beta_j^2 = 1,\\ y_i\left(\beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip}\right) \ge M\left(1 - \xi_i\right),\\ \quad \forall i = 1, 2, \dots, n\\ \xi_i \ge 0, \\ \sum_{i = 1}^n \xi_i \le C\end{cases}$$` ] -- <br/> .red[.center[.large[This is a quadratic programming problem with linear inequality constraints!!]]] --- class: center, middle <img src="Images/svm-classifiers.png" width="100%" style="display: block; margin: auto;" /> --- class: center, middle ## `\(C\)` is a regularization parameter <img src="Figures/08-Figures/08-noisy-path-01-1.svg" width="100%" style="display: block; margin: auto;" /> --- class: center, middle <!--  --> <img src="GIFs/svmpath.gif" alt="drawing" style="width: 600px;"/> --- class: center, middle ### As it turns out, you can fit the entire regularization path with essentially the same cost as a single SVM fit! <img src="Figures/08-Figures/hell-yeah-1.svg" width="40%" style="display: block; margin: auto;" /> --- class: inverse, center, middle background-image: url(Images/non-separable.jpg) background-position: center background-size: contain # Non-separable data --- ## The support vector machine .large[ * The objective function for the soft margin classifier can be re-expressed as `$$L_D = \sum_{i = 1}^n - \frac{1}{2}\sum_{i = 1}^n\sum_{i' = 1}^n \alpha_i\alpha_{i'}y_iy_{i'}x_i^{\top}x_{i'}$$` ] -- .large[ * In the SVM, we replace the .red[*dot product*] with a .red[*kernel function*] `$$L_D = \sum_{i = 1}^n - \frac{1}{2}\sum_{i = 1}^n\sum_{i' = 1}^n \alpha_i\alpha_{i'}y_iy_{i'}K\left(x_i, x_{i'}\right)$$` ] --- class: center, middle .larger[Replacing the dot product with a kernel function is similar in spirit to .red[enlarging the feature space using basis functions] (e.g., like in MARS!)] --- ## Popular kernel functions .larger[ * .magenta[*d*-th degree polynomial:] `$$K\left(x, x'\right) = \left(1 + \langle x, x' \rangle\right) ^ d$$` * .magenta[Radial basis function:] `$$K\left(x, x'\right) = \exp\left(\gamma \lVert x - x'\rVert ^ 2\right)$$` * .magenta[Hyperbolic tangent:] `$$K\left(x, x'\right) = \tanh\left(k_1\lVert x - x'\rVert + k_2\right)$$` ] --- class: center, middle <img src="Figures/08-Figures/08-circle-01-1.svg" width="80%" style="display: block; margin: auto;" /> --- class: center, middle .larger[ Enlarge the feature space by adding a third variable `$$X_3 = X_1^2 + X_2^2$$` ] --- class: center, middle
--- class: center, middle ## The kernel trick <iframe width="560" height="315" src="https://www.youtube.com/embed/3liCbRZPrZA" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe> --- <img src="Figures/08-Figures/08-circle-boundaries-1.svg" width="80%" style="display: block; margin: auto;" /> --- <img src="Figures/08-Figures/08-circle-mars-1.svg" width="80%" style="display: block; margin: auto;" /> --- ## Support vector machines .large[ .pull-left[ #### .green[Advantages:] * Maximization of generalizability * Global optimum (.green[convex optimization problem]) * Robustness to outliers * Incredibly flexible ] .pull-right[ #### .red[Disadvantages:] * Extension to more than two classes * Doesn't scale well to large `\(N\)` 😞 * Predicting .red[class probabilities] is not automatic ] ] --- ## More than two classes .large[ * The SVM, as introduced, is .red[applicable to only two classes!] * What do we do when we have more than two classes? - .orange.bold[*One-versus-all* (OVA):] Fit an SVM for each class (one class versus the rest); classify to the class with the largest; classify to the class for which the margin is the largest - .orange.bold[*One-versus-one* (OVO):] Fit all `\(\binom{\# \ classes}{2}\)` pairwise SVMs; classify to the class that wins the most pairwise competitions ] --- ## Support vector regression <img src="Figures/08-Figures/08-svr-1.svg" width="80%" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Fitting SVMs in R --- class: center, middle <img src="Images/apm-summary.png" width="100%" style="display: block; margin: auto;" /> --- ## Lots of R packages available!!! * [`e1071`](https://cran.r-project.org/package=e1071) - Provides an interface to the award-winning `libsvm` C++, a very efficient implementations of SVMs * [`kernlab`](https://cran.r-project.org/package=kernlab) - More flexible implementation of SVMs with basic kernel functionality (**kern**al **lab**oratory), but still makes use of `libsvm` optimizers * [`svmpath`](https://cran.r-project.org/package=svmpath) - Computes the entire regularization path for the two-class SVM classifier with essentially the same cost as a single SVM fit! * [`LiblineaR`](https://cran.r-project.org/package=LiblineaR) - A wrapper around the `LIBLINEAR` C/C++ library for machine learning --- ## Two spirals benchmark problem ```r # Load required packages library(kernlab) # for fitting SVMs *library(mlbench) # for ML benchmark data sets # Simulate train and test sets set.seed(0841) trn <- as.data.frame( mlbench.spirals(300, cycles = 2, sd = 0.09) ) tst <- as.data.frame( mlbench.spirals(10000, cycles = 2, sd = 0.09) ) names(trn) <- names(tst) <- c("x1", "x2", "classes") ``` --- <img src="Figures/08-Figures/08-example-spirals-02-1.svg" width="80%" style="display: block; margin: auto;" /> --- ## Two spirals benchmark problem ```r # Fit an SVM using a radial basis function kernel spirals_rbf <- ksvm( classes ~ x1 + x2, data = trn, * kernel = "rbfdot", * C = 500, # I just picked a value * prob.model = TRUE ) ``` --- <img src="Figures/08-Figures/08-example-spirals-04-1.svg" width="80%" style="display: block; margin: auto;" /> --- ## Two spirals benchmark problem ```r # Test set confusion matrix (tab <- table( pred = predict(spirals_rbf, newdata = tst), # predicted outcome obs = tst$classes # observed outcome )) ``` ``` ## obs ## pred 1 2 ## 1 4580 424 ## 2 420 4576 ``` ```r # Test set error 1 - sum(diag(tab)) / nrow(tst) ``` ``` ## [1] 0.0844 ``` --- class: center, middle, inverse background-image: url(Images/tuning.jpg) # Model tuning ??? Image credit: [imgflip](http://www.learntoplaymusic.com/blog/tune-guitar/) --- ## Using caret::getModelInfo() ```r # Linear (i.e., ordinary inner product) caret::getModelInfo("svmLinear")$svmLinear$parameters ``` ``` ## parameter class label ## 1 C numeric Cost ``` ```r # Polynomial kernel caret::getModelInfo("svmPoly")$svmPoly$parameters ``` ``` ## parameter class label ## 1 degree numeric Polynomial Degree ## 2 scale numeric Scale ## 3 C numeric Cost ``` ```r # Radial basis kernel caret::getModelInfo("svmRadial")$svmRadial$parameters ``` ``` ## parameter class label ## 1 sigma numeric Sigma ## 2 C numeric Cost ``` --- class: center, middle .larger[ Not so .red[`purrr`]fect, now be gone! ```r detach(package:purrr) ``` ] --- ## Job attrition example ```r # Load required packages library(caret) # for classification and regression training library(dplyr) # for data wrangling library(ggplot2) # for more awesome plotting library(rsample) # for attrition data and data splitting library(pdp) # for PDPs *# Same setup as Naive Bayes module! # Load the attrition data attrition <- attrition %>% mutate( # convert some numeric features to factors JobLevel = factor(JobLevel), StockOptionLevel = factor(StockOptionLevel), TrainingTimesLastYear = factor(TrainingTimesLastYear) ) # Train and test splits set.seed(123) # for reproducibility split <- initial_split(attrition, prop = 0.7, strata = "Attrition") trn <- training(split) tst <- testing(split) ``` --- class: center, middle ## Choosing an evaluation metric is one of the most important tasks in any ML project! <img src="Figures/08-Figures/08-what-1.svg" width="50%" style="display: block; margin: auto;" /> --- ## Job attrition example ```r # Control params for SVM ctrl <- trainControl( method = "cv", number = 5, * classProbs = TRUE, * summaryFunction = twoClassSummary ) # Tune an SVM set.seed(1854) # for reproducibility attr_svm <- train( Attrition ~ ., data = trn, * method = "svmRadial", * preProcess = c("center", "scale"), * metric = "ROC", trControl = ctrl, tuneGrid = data.frame( sigma = 0.008071434, C = seq(from = 0.1, to = 5, length = 30) ) ) ``` --- ## Job attrition example ```r # Plot tuning results ggplot(attr_svm) + theme_light() ``` <img src="Figures/08-Figures/08-attrition-03-1.svg" width="80%" style="display: block; margin: auto;" /> --- ## Job attrition example .pull-left[ * .large[`caret`'s `varImp()` function provides a .red[*filter-based variable importance*] measure:] * .large[`vip` has something similar, but it is still experimental!] ```r # Filter-based variable # importance scores plot(varImp(attr_svm)) ``` ] .pull-right[ <img src="Figures/08-Figures/08-attrition-05-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- ## Job attrition example .pull-left[ ```r # Partial dependence plots features <- c( "MonthlyIncome", "TotalWorkingYears", "OverTime", "YearsAtCompany" ) pdfs <- lapply(features, function(x) { autoplot(partial(attr_svm, pred.var = x, prob = TRUE)) + theme_light() }) grid.arrange( * grobs = pdfs, ncol = 2 ) ``` ] .pull-right[ <img src="Figures/08-Figures/08-attrition-07-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- ## Job attrition example .pull-left[ ```r # Load required packages *library(pROC) # Plot train and test ROC curves roc_trn <- roc( # train AUC: 0.9717 predictor = predict(attr_svm, newdata = trn, type = "prob")$Yes, response = trn$Attrition, levels = rev(levels(trn$Attrition)) ) roc_tst <- roc( # test AUC: 0.8567 predictor = predict(attr_svm, newdata = tst, type = "prob")$Yes, response = tst$Attrition, levels = rev(levels(tst$Attrition)) ) plot(roc_trn) lines(roc_tst, col = "dodgerblue2") legend("bottomright", legend = c("Train", "Test"), bty = "n", cex = 2.5, col = c("black", "dodgerblue2"), inset = 0.01, lwd = 2) ``` ] .pull-right[ <img src="Figures/08-Figures/08-attrition-09-1.svg" width="100%" style="display: block; margin: auto;" /> ] --- class: center, middle, inverse background-image: url(Images/your-turn.jpg) background-position: center background-size: contain ??? Image credit: [imgflip](https://imgflip.com/) --- class: middle .large[ Retune the previous model using a polynomial kernel. What is the test ROC for your final model? Which features seem to be the most important? **Hint:** Use .red[`method = "svmPoly"`] in the call to .red[`caret::train()`] and be sure to update the .red[`tuneGrid`] argument accordingly. ] --- class: middle ```r # Retune model using polynomial kenel *set.seed(1720) # for reproducibility attr_svm_poly <- train( Attrition ~ ., data = trn, * method = "svmPoly", * preProcess = c("center", "scale"), * metric = "ROC", trControl = ctrl, tuneGrid = data.frame( degree = 1:3, scale = 1, C = seq(from = 0.1, to = 5, length = 30) ) ) ``` --- class: middle ```r # Updated ROC curve roc_tst_poly <- roc( # test AUC: 0.8567 predictor = predict(attr_svm_poly, newdata = tst, type = "prob")$Yes, response = tst$Attrition, levels = rev(levels(tst$Attrition)) ) plot(roc_tst_poly, col = set1[1L]) lines(roc_tst, col = "dodgerblue2") legend( x = "bottomright", legend = c("Test (Poly)", "Test (RBF)"), bty = "n", cex = 1.5, col = set1[1L:2L], inset = 0.01, lwd = 2 ) ``` --- class: middle, center <img src="Figures/08-Figures/08-solution-03-1.svg" width="80%" style="display: block; margin: auto;" /> --- ## Additional resources <img src="Images/svm-books.png" width="100%" style="display: block; margin: auto;" /> <br/> * .larger[[Fitting the entire regularization path](http://www.jmlr.org/papers/volume5/hastie04a/hastie04a.pdf)] --- ## Questions? .center[ Insert clever Chuck Norris joke here! ]