In the era of “big data”, it is becoming more of a challenge to not only build state-of-the-art predictive models, but also gain an understanding of what’s really going on in the data. For example, it is often of interest to know which, if any, of the predictors in a fitted model are relatively influential on the predicted outcome. Some modern algorithms—like random forests and gradient boosted decision trees—have a natural way of quantifying the importance or relative influence of each feature. Other algorithms—like naive Bayes classifiers and support vector machines—are not capable of doing so and model-agnostic approaches are generally used to measure each predictor’s importance. Enter vip, an R package for constructing variable importance (VI) scores/plots for many types of supervised learning algorithms using model-based and novel model-agnostic approaches.

Example usage

For illustration, we use one of the regression problems described in Friedman (1991) and Breiman (1996). These data are available in the mlbench package. The inputs consist of 10 independent variables uniformly distributed on the interval \(\left[0, 1\right]\); however, only 5 out of these 10 are actually used in the true model. Outputs are created according to the formula described in ?mlbench::mlbench.friedman1. The code chunk below simulates 500 observations from the model default standard deviation.

Model-based VI scores

Some machine learning algorithms have their own way of quantifying variable Importance. We describe some of these in the subsection that follow. The issue with model-specific VI scores is that they are not necessarily comparable across different types of models. For example, directly computing the impurity-based VI scores from tree-based models to the \(t\)-statistic from linear models.

Trees and tree ensembles

Decision trees probably offer the most natural model-based approach to quantifying the importance of each feature. In a binary decision tree, at each node \(t\), a single predictor is used to partition the data into two homogeneous groups. The chosen predictor is the one that maximizes some measure of improvement \(\widehat{i}_t\). The relative importance of predictor \(x\) is the sum of the squared improvements over all internal nodes of the tree for which \(x\) was chosen as the partitioning variable; see Breiman, Friedman, and Charles J. Stone (1984) for details. This idea also extends to ensembles of decision trees, such as RFs and GBMs. In ensembles, the improvement score for each predictor is averaged across all the trees in the ensemble. Fortunately, due to the stabilizing effect of averaging, the improvement-based VI metric is often more reliable in large ensembles (see Hastie, Tibshirani, and Friedman 2009, pg. 368). RFs offer an additional method for computing VI scores. The idea is to use the leftover out-of-bag (OOB) data to construct validation-set errors for each tree. Then, each predictor is randomly shuffled in the OOB data and the error is computed again. The idea is that if variable \(x\) is important, then the validation error will go up when \(x\) is perturbed in the OOB data. The difference in the two errors is recorded for the OOB data then averaged across all trees in the forest.

To illustrate, we fit a CART-like regression tree, RF, and GBM to the simulated training data. (Note: there are a number of different packages available for fitting these types of models, we just picked popular and efficient implementations for illustration.)

Each of the above packages include the ability to compute VI scores for all the features in the model; however, the implementation is rather package specific, as shown in the code chunk below.

As we would expect, all three methods rank the variables x.1x.5 as more important than the others. While this is good news, it is unfortunate that we have to remember the different functions and ways of extracting and plotting VI scores from various model fitting functions. This is where vip can help…one function to rule them all! Once vip is loaded, we can use vi() to extract a tibble of VI scores.

Notice how the vi() function always returns a tibble with two columns: Variable and Importance1. Also, by default, vi() always orders the VI scores from highest to lowest; this, among other options, can be controlled by the user (see ?vip::vi for details). Plotting VI scores with vip is just as straightforward.

(p2 <- vip(rfo, width = 0.5, fill = "green3"))   # RF

(p3 <- vip(bst, col = "purple2"))   # GBM

Notice how the vip() function always returns a "ggplot" object (by default, this will be a bar plot). For large models with many features, a dot plot is more effective (in fact, a number of useful plotting options can be fiddles with).

library(ggplot2)  # for theme_light() function
vip(bst, num_features = 5, bar = FALSE, color, horizontal = FALSE, 
    color = "red", shape = 17, size = 4) +
  theme_light()

Linear models

In multiple linear regression, or linear models (LMs), the absolute value of the \(t\)-statistic is commonly used as a measure of VI. The same idea also extends to generalized linear models (GLMs). In the code chunk below, we fit an LM to the simulated trn data set allowing for all main and two-way interaction effects, then use the step() function to perform backward elimination.

One issue with computing VI scores for LMs using the \(t\)-statistic approach is that a score is assigned to each term in the model, rather than to just each feature! We can solve this problem using one of the model-agnostic approaches discussed later.

Multivariate adaptive regression splines (MARS), which were introduced in Friedman (1991), is an automatic regression technique which can be seen as a generalization of multiple linear regression and generalized linear models. In the MARS algorithm, the contribution (or VI score) for each predictor is determined using a generalized cross-validation (GCV) statistic. An example using the earth package is given below:

Neural networks

For NNs, two popular methods for constructing VI scores are the Garson algorithm (Garson 1991), later modified by Goh (1995), and the Olden algorithm (Olden, Joy, and Death 2004). For both algorithms, the basis of these importance scores is the network’s connection weights. The Garson algorithm determines VI by identifying all weighted connections between the nodes of interest. Olden’s algorithm, on the other hand, uses the product of the raw connection weights between each input and output neuron and sums the product across all hidden neurons. This has been shown to outperform the Garson method in various simulations. For DNNs, a similar method due to Gedeon (1997) considers the weights connecting the input features to the first two hidden layers (for simplicity and speed); but this method can be slow for large networks.

These methods will be implemented in vip soon!

Model-agnostic VI scores

Model-agnostic interpredibility separates interpretation from the model. Compared to model-specific approaches, model-agnostic VI methods are more flexible (since they can be applied to any supervised learning algorithm). In this section, we discuss model-agnostic methods for quantifying global feature importance using three different approaches: 1) PDPs, 2) ICE curves, and 3) permutation. For details on approaches 1)–2), see Greenwell, Boehmke, and McCarthy (2018) (or just click here).

PDP method

Our first model-agnostic approach is based on quantifying the “flatness” of the PDPs of each feature. PDPs help visualize the effect of low cardinality subsets of the feature space on the estimated prediction surface (e.g., main effects and two/three-way interaction effects.). PDPs provide model-agnostic interpretations and can be constructed in the same way for any supervised learning algorithm. Below, we fit a projection pursuit regression (PPR) model and construct PDPs for each feature using the pdp package (Greenwell 2017).

Next, we compute PDP-based VI scores for the PPR and NN models. The PDP method constructs VI scores that quantify the “flatness” of each PDP (by default, this is defined by computing the standard deviation of the \(y\)-axis values for each PDP). To use the PDP method, specify method = "pdp" in the call to vi() or vip().

ICE curve method

The ICE curve method is similar to the PDP method. The only difference is that we measure the “flatness” of each ICE curve and then aggregate the results (e.g., by averaging)2. If there are no (substantial) interaction effects, using method = "ice" will produce results similar to using method = "pdp". However, if strong interaction effects are present, they can obfuscate the main effects and render the PDP-based approach less useful (since the PDPs for important features can be relatively flat when certain interactions are present; see Goldstein et al. (2015) for details). In fact, it is probably safest to always use method = "ice".

Below, we display the ICE curves for each feature using the same \(y\)-axis scale. Again, there is a clear difference between the ICE curves for features x.1x.5 and x.6x.10; the later being relatively flat by comparison. Also, notice how the ICE curves within each feature are relatively parallel (if the ICE curves within each feature were perfectly parallel, the standard deviation for each curve would be the same and the results will be identical to the PDP method). In this example, the interaction term between x.1 and x.2 does not obfuscate the PDPs for the main effects and the results are not much different. To use the ICE curve method, specify method = "ice" in the call to vi() or vip().

# PDPs for all 10 features
ice_curves <- lapply(features, FUN = function(feature) {
  ice <- partial(pp, pred.var = feature, ice = TRUE)
  autoplot(ice, alpha = 0.1) + 
    ylim(range(trn$y)) +
    theme_light()
})
grid.arrange(grobs = ice_curves, ncol = 5)

Permutation method

The permutation method exists in various forms and was made popular in Breiman (2001) for random forests. A more general approach to the permutation method is described in Assessing Variable Importance for Predictive Models of Arbitrary Type, an R package vignette by DataRobot. The permutation approach used in vip is quite simple. The idea is that if we randomly permute the values of an important feature in the training data, the training performance would degrade (since permuting the values of a feature effectively destroys any relationship between that feature and the target variable). This of course assumes that the model has been properly tuned (e.g., using cross-validation) and is not over fitting. The permutation approach uses the difference between some baseline performance measure (e.g., training \(R^2\) or RMSE) and the same performance measure obtained after permuting the values of a particular feature in the training data (Note: the model is NOT refit to the training data after randomly permuting the values of a feature). To use the permutation approach, specify method = "permute" in the call to vi() or vip(). Note that using method = "permute" requires specifying a few additional arguments; see ?vi_permute for details.

An example is given below for the previously fitted PPR and NN models.

# Plot VI scores
set.seed(2021)  # for reproducibility
p1 <- vip(pp, method = "permute", target = "y", metric = "rsquared",
          pred_fun = predict) + ggtitle("PPR")
#> Warning: Setting `method = "permute"` is experimental, use at your own
#> risk!
p2 <- vip(nn, method = "permute", target = "y", metric = "rsquared",
          pred_fun = predict) + ggtitle("NN")
#> Warning: Setting `method = "permute"` is experimental, use at your own
#> risk!
grid.arrange(p1, p2, ncol = 2)

The Pima Indians diabetes data

As a final example, we’ll consider the well-known Pima Indians diabetes data; see ?pdp::pima for details. These data contain diabetes test results collected by the the US National Institute of Diabetes and Digestive and Kidney Diseases from a population of women who were at least 21 years old, of Pima Indian heritage, and living near Phoenix, Arizona. The target variable, diabetes, is a factor indicating the diabetes test result (pos/neg). In the code chunk below, we fit a random forest to the Pima Indians data using the fantastic ranger package. Note that we fit two different random forests: rfo1 and rfo2. The only difference is that we would use rfo1 if we wanted predicted class labels and we would use rfo2 for predicted class probabilities. The distinction is important when using method = "permute" since the performance metric being used requires the predicted outcome to be either the class labels (e.g., metric = "error" for classification error) or predicted class labels (e.g., "auc" for area under the curve). We’ll illustrate both below. We should point out that there is more built-in support for "ranger" objects, so it is not necessary to supply pred_fun or specify a specific metric (the default is metric = "auto"), but for completeness, we explicitly specify all the options.

# Load required packages
library(ranger)

# Load the Pima indians diabetes data
data(pima, package = "pdp")
pima <- na.omit(pima)  # remove records with missing values

# Fit a random forest
set.seed(1322)  # for reproducibility
rfo1 <- ranger(diabetes ~ ., data = pima, importance = "permutation")
rfo2 <- ranger(diabetes ~ ., data = pima, importance = "permutation",
               probability = TRUE)  # for predicted probabilities

# Plot VI scores
p1 <- vip(rfo1)  # model-specific
p2 <- vip(rfo2)  # model-specific
set.seed(1329)  # for reproducibility
pfun <- function(object, newdata) predict(object, data = newdata)$predictions
p3 <- vip(rfo1, method = "permute", metric = "error", pred_fun = pfun, 
          target = "diabetes")
#> Warning: Setting `method = "permute"` is experimental, use at your own
#> risk!
p4 <- vip(rfo2, method = "permute", metric = "auc", pred_fun = pfun,
          target = "diabetes", reference_class = "neg")
#> Warning: Setting `method = "permute"` is experimental, use at your own
#> risk!
grid.arrange(p1, p2, p3, p4, ncol = 2)

Use sparklines to characterize feature effects

Starting with vip v0.1.3, we have included a new function vit() for constructing html-based variable importance tables. The primary difference between vi() and vit() is that the latter includes an Effect column that displays a sparkline representation of the partial dependence function for each feature. This is a concise way to display both feature importance and feature effect information in a single table. See ?vip::vit for details. We illustrate the basic use of vit() in the code chunks below.

References

Breiman, Leo. 1996. “Bagging Predictors.” Machine Learning 24 (2): 123–40. https://doi.org/10.1023/A:1018054314350.

———. 2001. “Random Forests.” Machine Learning 45 (1): 5–32. https://doi.org/10.1023/A:1010933404324.

Breiman, Leo, Jerome Friedman, and Richard A. Olshen Charles J. Stone. 1984. Classification and Regression Trees. The Wadsworth and Brooks-Cole Statistics-Probability Series. Taylor & Francis.

Friedman, Jerome H. 1991. “Multivariate Adaptive Regression Splines.” The Annals of Statistics 19 (1): 1–67. https://doi.org/10.1214/aos/1176347963.

Garson, David G. 1991. “Interpreting Neural-Network Connection Weights.” Artificial Intelligence Expert 6 (4): 46–51.

Gedeon, T.D. 1997. “Data Mining of Inputs: Analysing Magnitude and Functional Measures.” International Journal of Neural Systems 24 (2): 123–40. https://doi.org/10.1007/s10994-006-6226-1.

Goh, A.T.C. 1995. “Back-Propagation Neural Networks for Modeling Complex Systems.” Artificial Intelligence in Engineering 9 (3): 143–51. https://doi.org/http://dx.doi.org/10.1016/0954-1810(94)00011-S.

Goldstein, Alex, Adam Kapelner, Justin Bleich, and Emil Pitkin. 2015. “Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual Conditional Expectation.” Journal of Computational and Graphical Statistics 24 (1): 44–65. https://doi.org/10.1080/10618600.2014.907095.

Greenwell, Brandon. 2017. Pdp: Partial Dependence Plots. https://CRAN.R-project.org/package=partial.

Greenwell, Brandon M., Bradley C. Boehmke, and Andrew J. McCarthy. 2018. “A Simple and Effective Model-Based Variable Importance Measure.” arXiv Preprint arXiv:1805.04755.

Hastie, Trevor, Robert. Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. Springer Series in Statistics. Springer-Verlag.

Olden, Julian D, Michael K Joy, and Russell G Death. 2004. “An Accurate Comparison of Methods for Quantifying Variable Importance in Artificial Neural Networks Using Simulated Data.” Ecological Modelling 178 (3): 389–97. https://doi.org/http://dx.doi.org/10.1016/j.ecolmodel.2004.03.013.


  1. The exception is GLM-like models (e.g., LMs and GLMs), described in the next section, which include an additional column called Sign containing the sign of the original coefficients.

  2. There is also the potential to use the individual ICE curves to quantify feature importance at the observation level, thereby providing local VI scores.