Warning This vignette is still under construction!

As it turns out, the PDP-based variable importance (VI) measure discussed in Greenwell, Boehmke, and McCarthy (2018) (and provided by method = "pdp") can also be used to quantify the strength of potential interaction effects. Let \(i\left(x_i, x_j\right)\) \(\left(i \ne j\right)\) be the standard deviation of the joint partial dependence values \(\bar{f}_{ij}\left(x_{ii'}, x_{jj'}\right)\) for \(i' = 1, 2, \dots, k_i\) and \(j' = 1, 2, \dots, k_j\). Essentially, a weak interaction effect of \(x_i\) and \(x_j\) on \(Y\) would suggest that \(i\left(x_i, x_j\right)\) has little variation when either \(x_i\) or \(x_j\) is held constant while the other varies.

Let \(\boldsymbol{z}_s = \left(x_i, x_j\right)\), \(i \neq j\), be any two predictors in the feature space \(\boldsymbol{x}\). Construct the partial dependence function \(\bar{f}_s\left(x_i, x_j\right)\) and compute \(i\left(x_i\right)\) for each unique value of \(x_j\), denoted \(\i\left(x_i | x_j\right)\), and take the standard deviation of the resulting importance scores. The same can be done for \(x_j\) and the results averaged together. Large values (relative to each other) would be indicative of possible interaction effects.

Prerequisites

# Load required packages
library(dplyr)           # for data wrangling
library(gbm)             # for fitting generalized boosted models
library(ggplot2)         # for general visualization
library(lattice)         # for general visualization
library(mlbench)         # for machine learning benchmark data sets
library(NeuralNetTools)  # for various tools for neural networks
library(nnet)            # for fitting neural networks w/ a single hidden layer
library(vip)             # for visualizing feature importance

Friedman’s regression problem

To illustrate, we’ll use one of the regression problems described in Friedman (1991) and Breiman (1996). The feature space consists of ten independent \(\mathcal{U}\left(0, 1\right)\) random variables; however, only five out of these ten actually appear in the true model. The response is related to the features according to the formula \[ Y = 10 \sin\left(\pi x_1 x_2\right) + 20 \left(x_3 - 0.5\right) ^ 2 + 10 x_4 + 5 x_5 + \epsilon, \] where \(\epsilon \sim \mathcal{N}\left(0, \sigma^2\right)\). Using the R package nnet (Venables and Ripley 2002), we fit a NN with one hidden layer containing eight units and a weight decay of 0.01 (these parameters were chosen using 5-fold cross-validation) to 500 observations simulated from the above model with \(\sigma = 1\). The cross-validated \(R^2\) value was 0.94.

VIPs for the fitted network are displayed in Figure 1. Notice how the Garson and Olden algorithms incorrectly label some of the features not in the true model as “important”. For example, the Garson algorithm incorrectly labels \(x_8\) (which is not included in the true model) as more important than \(x_5\) (which is in the true model). Similarly, Olden’s method incorrectly labels \(x_{10}\) as being more important than \(x_2\). Our method, on the other hand, clearly labels all five of the predictors in the true model as the most important features in the fitted NN.

# Simulate data
set.seed(101)  # for reproducibility
trn <- as.data.frame(mlbench.friedman1(n = 500, sd = 1))

# Fit a neural network to simulated data
set.seed(103)
nn <- nnet(y ~ ., data = trn, size = 8, linout = TRUE, decay = 0.01,
           maxit = 1000, trace = FALSE)

# VIP: partial dependence algorithm
p1 <- vip(nn, method = "pdp", feature_names = paste0("x.", 1:10)) +
  labs(x = "", y = "Importance", title = "PDP method") +
  theme_light()

# VIP: Garson's algorithm
nn_garson <- garson(nn, bar_plot = FALSE) %>%
  tibble::rownames_to_column("Variable") %>%
  select(Variable, Importance = rel_imp) %>%
  mutate(Importance = Importance)
p2 <- ggplot(nn_garson, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_col() +
  labs(x = "", y = "Importance", title = "Garson's method") +
  coord_flip() +
  theme_light()

# VIP: Olden's algorithm
nn_olden <- olden(nn, bar_plot = FALSE) %>%
  tibble::rownames_to_column("Variable") %>%
  select(Variable, Importance = importance) %>%
  mutate(Importance = Importance)
p3 <- ggplot(nn_olden, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_col() +
  labs(x = "", y = "Importance", title = "Olden's method") +
  coord_flip() +
  theme_light()

# Display plots side by side
grid.arrange(p1, p2, p3, ncol = 3)
**Figure 1** VIPs for the NN fit to Friedman's regression data. *Left:* Model-agnostic: partial dependence. *Middle:* Model-specific: Garson's method. *Right:* Model-specific: Olden's method.

Figure 1 VIPs for the NN fit to Friedman’s regression data. Left: Model-agnostic: partial dependence. Middle: Model-specific: Garson’s method. Right: Model-specific: Olden’s method.

We also constructed the partial dependence functions for all pairwise interactions and computed the interaction statistic discussed in Greenwell, Boehmke, and McCarthy (2018). The top ten interaction statistics are displayed in Figure 2. There is a clear indication of an interaction effect between features \(x_1\) and \(x_2\), the only interaction effect present in the true model.

**Figure 2** Top 10 VI-based interaction statistics for the NN fit to Friedman's regression data.

Figure 2 Top 10 VI-based interaction statistics for the NN fit to Friedman’s regression data.

In fact, since we know the distributions of the predictors in the true model, we can work out the true partial dependence functions. For example, for the pairs \(\left(x_1, x_2\right)\) and \(\left(x_1, x_4\right)\), we have \[ f\left(x_1, x_2\right) = 10 \sin \left(\pi x_1 x_2\right) + 55 / 6, \] and \[ f\left(x_1, x_4\right) = \frac{5 \pi x_1 \left(12 x_4 + 5\right) - 12 \cos \left(\pi x_1\right) + 12}{6 \pi x_1}. \] Next, we simulated the standard deviation of \(f\left(x_1, x_2\right)\) for a wide range of fixed values of \(x_2\); this is what \(i\left(x_1 | x_2\right)\) is trying to estimate. The results from doing this for both predictors in each model are displayed in Figure 3. The top row of Figure Figure 3 illustrates that the importance of \(x_1\) (i.e., the strength of its relationship to the predicted outcome) heavily depends on the value of \(x_2\) and vice versa (i.e., an interaction effect between \(x_1\) and \(x_2\)). In the bottom row, on the other hand, we see that the importance of \(x_1\) does not depend on the value of \(x_4\) and vice versa (i.e., no interaction effect between \(x_1\) and \(x_4\)).

# Simulation function
sim_fun <- function(pred.var, pd.fun, xlabs = c("", ""), ylabs = c("", "")) {
  x <- y <- seq(from = 0, to = 1, length = 100)
  xy <- expand.grid(x, y)
  z <- apply(xy, MARGIN = 1, FUN = function(x) {
    pd.fun(x[1L], x[2L])
  })
  res <- as.data.frame(cbind(xy, z))
  names(res) <- c(pred.var, "yhat")
  form <- as.formula(paste("yhat ~", paste(paste(pred.var, collapse = "*"))))
  p1 <- levelplot(form, data = res, col.regions = viridis::viridis,
                  xlab = expression(x[1]), ylab = expression(x[4]))
  approxVar.x <- function(x = 0.5, n = 100000) {
    y <- runif(n, min = 0, max = 1)
    sd(pd.fun(x, y))
  }
  approxVar.y <- function(y = 0.5, n = 100000) {
    x <- runif(n, min = 0, max = 1)
    sd(pd.fun(x, y))
  }
  x <- seq(from = 0, to = 1, length = 100)
  y1 <- sapply(x, approxVar.x)
  y2 <- sapply(x, approxVar.y)
  p2 <- xyplot(SD ~ x, data = data.frame(x = x, SD = y1), type = "l",
               lwd = 1, col = "black",
               ylim = c(min(y1, na.rm = TRUE) - 1, max(y1, na.rm = TRUE) + 1),
               xlab = expression(x[1]),
               ylab = expression(imp ~ (x[4]*" | "*x[1])))
  p3 <- xyplot(SD ~ x, data = data.frame(x = x, SD = y2), type = "l",
               lwd = 1, col = "black",
               ylim = c(min(y2, na.rm = TRUE) - 1, max(y2, na.rm = TRUE) + 1),
               xlab = expression(x[4]),
               ylab = expression(imp ~ (x[1]*" | "*x[4])))
  list(p1, p2, p3)
}

# Run simulations
sim1 <- sim_fun(
  pred.var = c("x.1", "x.2"), 
  xlabs = c(expression(x[1]), expression(x[2])),
  ylabs = c(expression(imp ~ (x[2]*" | "*x[1])), 
            expression(imp ~ (x[1]*" | "*x[2]))),
  pd.fun = function(x1, x2) {
    5 * (pi * x1 * (12 * x2 + 5) - 12 * cos(pi * x1) + 12) / (6 * pi * x1)
  })
sim2 <- sim_fun(
  pred.var = c("x.1", "x.4"), 
  xlabs = c(expression(x[1]), expression(x[4])),
  ylabs = c(expression(imp ~ (x[4]*" | "*x[1])), 
            expression(imp ~ (x[1]*" | "*x[4]))),
  pd.fun = function(x1, x2) {
    10 * sin(pi * x1 * x2) + 55 / 6
})

# Display plots side by side
grid.arrange(
  sim1[[1]], sim1[[2]], sim1[[3]], 
  sim2[[1]], sim2[[2]], sim2[[3]], 
  nrow = 2
)
**Figure 3** Results from a small Monte Carlo simulation on the interaction effects between $x_1$ and $x_2$ (top row), and $x_1$ and $x_4$ (bottom row).

Figure 3 Results from a small Monte Carlo simulation on the interaction effects between \(x_1\) and \(x_2\) (top row), and \(x_1\) and \(x_4\) (bottom row).

Friedman’s \(H\)-statistic

An alternative measure for the strength of interaction effects is known as Friedman’s \(H\)-statistic (Friedman and Popescu 2008). Coincidentally, this method is also based on the estimated partial dependence functions of the corresponding predictors, but uses a different approach.

For comparison, we fit a GBM to the Friedman regression data from the previous section. The parameters were chosen using 5-fold cross-validation. We used the R package gbm (Ridgeway 2017) which has built-in support for computing Friedman’s \(H\)-statistic for any combination of predictors. The results are displayed in Figure 4. To our surprise, the \(H\)-statistic did not seem to catch the true interaction between \(x_1\) and \(x_2\). Instead, the \(H\)-statistic ranked the pairs \(\left(x_8, x_9\right)\) and \(\left(x_7, x_{10}\right)\) as having the strongest interaction effects, even though these predictors do not appear in the true model. Our VI-based interaction statistic, on the other hand, clearly suggests the pair \(\left(x_1, x_2\right)\) as having the strongest interaction effect.

# Fit a GBM
set.seed(937)
trn.gbm <- gbm(y ~ ., data = trn, distribution = "gaussian", n.trees = 25000,
               shrinkage = 0.01, interaction.depth = 2, bag.fraction = 1,
               train.fraction = 0.8, cv.folds = 5, verbose = FALSE)
best.iter <- gbm.perf(trn.gbm, method = "cv", plot.it = FALSE)

# Friedman's H-statistic
combns <- t(combn(paste0("x.", 1:10), m = 2))
int.h <- numeric(nrow(combns))
for (i in 1:nrow(combns)) {
  # print(paste("iter", i, "of", nrow(combns)))
  int.h[i] <- interact.gbm(trn.gbm, data = trn, i.var = combns[i, ], 
                           n.trees = best.iter)
}
int.h <- data.frame(x = paste0(combns[, 1L], "*", combns[, 2L]), y = int.h)
int.h <- int.h[order(int.h$y, decreasing = TRUE), ]

# VI-based interaction statistic
int.i <- vint(
  object = trn.gbm,                    # fitted model object
  feature_names = paste0("x.", 1:10),  # features for which to compute pairwise interactions statistics
  n.trees = best.iter,                 # needed if object is of class "gbm"
  parallel = TRUE
)

# Plot Friedman's H-statistics
p1 <- ggplot(int.h[1:10, ], aes(reorder(x, y), y)) +
  geom_col(width = 0.75) +
  labs(x = "", y = "Interaction strength", title = "Friedman's H-statistic") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1)) +
  theme_light() +
  coord_flip()

# Plot PDP-based interaction statistics
p2 <- ggplot(int.i[1:10, ], aes(reorder(Variables, Interaction), Interaction)) +
  geom_col(width = 0.75) +
  labs(x = "", y = "Interaction strength", title = "Partial dependence") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1)) +
  theme_light() +
  coord_flip()

# Display plots side by side
grid.arrange(p1, p2, ncol = 2)
**Figure 4** Interaction statistics for the GBM model fit to Friedman's regression data. \textit{Left:} Friedman's $H$-statistic. \textit{Right:} Our VI-based interaction statistic.

Figure 4 Interaction statistics for the GBM model fit to Friedman’s regression data. Friedman’s \(H\)-statistic. Our VI-based interaction statistic.

References

Breiman, Leo. 1996. “Bagging Predictors.” Machine Learning 8 (2): 209–18. https://doi.org/10.1023/A:1018054314350.

Friedman, Jerome H. 1991. “Multivariate Adaptive Regression Splines.” The Annals of Statistics 19 (1): 1–67. https://doi.org/10.1214/aos/1176347963.

Friedman, Jerome H., and Bogdan E. Popescu. 2008. “Predictive Learning via Rule Ensembles.” Annals of Applied Statistics 2 (3): 916–54. https://doi.org/10.1214/07-aoas148.

Greenwell, Brandon M., Bradley C. Boehmke, and Andrew J. McCarthy. 2018. “A Simple and Effective Model-Based Variable Importance Measure.” arXiv Preprint arXiv:1805.04755.

Ridgeway, Greg. 2017. Gbm: Generalized Boosted Regression Models. https://CRAN.R-project.org/package=gbm.

Venables, W. N., and B. D. Ripley. 2002. Modern Applied Statistics with S. 4th ed. New York: Springer-Verlag. http://www.stats.ox.ac.uk/pub/MASS4.