We present vivid, an R package for visualizing variable importance and variable interactions in machine learning models. The package provides heatmap and graph-based displays for viewing variable importance and interaction jointly, and partial dependence plots in both a matrix layout and an alternative layout emphasizing important variable subsets. With the intention of increasing machine learning models’ interpretability and making the work applicable to a wider readership, we discuss the design choices behind our implementation by focusing on the package structure and providing an in-depth look at the package functions and key features. We also provide a practical illustration of the software in use on a data set.
Our motivation behind the creation of the vivid package is to investigate machine learning models in a way that is simple to understand while also offering helpful insights into how variables affect the fit. We do this through the use of heatmaps, network graphs, and both a generalized pairs plot style partial dependence plot (PDP) (Friedman 2000) and a space saving PDP based on key variable subsets. While the techniques and fundamental goals of these visualizations have been discussed in Inglis et al. (2022b), we focus here on the implementation details of the package by providing a complete listing of the functions and arguments included in the vivid package with further examples indicating advanced usage beyond that previously shown. In this work we examine the decisions made when designing the package and provide an in-depth look at the package functions and features with the intention of making the work applicable to a larger readership. This article outlines the general architectural principles implemented in vivid, such as the data structures we use and data formatting, function design, filtering techniques, and more. We illustrate each function by way of a practical example. Our package is available on the Comprehensive R Archive Network at https://cran.r-project.org/web/packages/vivid or on GitHub at https://github.com/AlanInglis/vivid.
In recent years machine learning (ML) algorithms have emerged as a valuable tool for both industry and science. However, due to the black-box nature of many of these algorithms it can be challenging to communicate the reasoning behind the algorithm’s decision-making processes. With the need for transparency in ML growing it is important to gain understanding and clarity about how these algorithms are making predictions (Antunes et al. 2018; Felzmann et al. 2019). Many R packages are now available that aid in creating interpretable machine learning (IML) models such as iml (Molnar et al. 2018), DALEX (Biecek 2018), and lime (Hvitfeldt et al. 2022). For a comprehensive review of IML, see Molnar (2022) and Biecek and Burzykowski (2021).
How we choose to visualize aspects of the model output is of vital importance to how a researcher can interpret and communicate their findings. Consequently, model summaries such as variable importance and variable interactions (VImp and VInt; together we term these VIVI) are frequently used in various fields to comprehend and explain the hidden structure in an ML fit. In ecology they are employed to determine the causes of ecological phenomena (e.g. Murray and Conner 2009); in meteorology VImp measures and partial dependence plots are used to examine air quality (e.g. Grange et al. 2018); in bioinformatics, understanding gene-environment interactions have made these measures an important tool for genomic analysis (e.g. Chen and Ishwaran 2012).
In Table 1 we summarize VIVI measures and visualizations provided by a selection of R packages. VIVI measures are categorized as global or local, depending on whether they refer to the entire predictor space or a specific sub-region. For instance, Individual Conditional Expectation (ICE) curves (Goldstein et al. 2015) and Local Interpretable Model-agnostic Explanations (LIME) (Ribeiro et al. 2016) are considered local methods (implementations of which can be found in the ICEbox package (Goldstein et al. 2015) and the lime package (Hvitfeldt et al. 2022)), whereas partial dependence and permutation importance represent global methods. Packages that incorporate local or global methods can be found under the column ‘Method’ in Table 1. Additionally, VIVI measures can be broken down into model specific (embedded) methods or model agnostic methods. The packages listed in Table 1 are grouped by whether they can compute model specific or agnostic measures and can be found under the column `Measure’. In embedded methods the variable importance is incorporated into the ML algorithm. For example, random forests (RF) (Breiman 2001) and gradient boosting machines (GBM) (Friedman 2000) use the tree structure to evaluate the performance of the model. Bayesian additive regression tree models (BART) (Chipman et al. 2010) also use an embedded method to obtain VIVI measures by looking at the proportion of splitting rules for variables or variable pairs used in the trees. The package randomForestExplainer (Paluszynska et al. 2020) provides a set of tools to understand what is happening inside a random forest and uses the concept of minimal depth (Ishwaran et al. 2010) to assess both importance and interaction strength by examining the position of a variable within the trees. Additionally the varImp (Probst 2020) package can be used to compute importance scores for the conditional inference random forest of the party package (Strobl et al. 2008). For gradient boosted machines, the EIX (Maksymiuk et al. 2021) package can be used to measure and identify VIVI and visualize the results.
Model-agnostic methods are techniques that can, in principle, be applied to any ML algorithm. Agnostic methods not only provide flexibility in relation to model selection but are also useful for comparing different fitted ML models. An example of a model agnostic approach for evaluating VImps is permutation importance (Breiman 2001). This method calculates the difference in a model’s predictive performance following a variable’s permutation; implementations are available in the packages iml, flashlight (Mayer 2023), DALEX and vip (Greenwell and Boehmke 2020). Each package provides options for specifying the performance metric to use in computing the model performance as well as providing options to select the number of permutations, with flashlight and DALEX packages additionally providing an option to select the number of observations that should be sampled for the calculation of variable importance. The vip package also provides a PDP/ICE-based variable importance method, which quantifies the variability in PDP and ICE plots. For VInts, Friedman’s \(H\)-statistic (Friedman and Popescu 2008) is an agnostic interaction measure derived from the partial dependence by comparing a pair of variables’ joint effects with the sum of their marginal effects. Packages iml and flashlight provide implementations.
Partial dependence plots (PDPs) were first introduced by Friedman (2000) as a model agnostic way to visualize the relationship between a specified predictor variable and the fit, averaging over other predictors’ effects. Similar to PDPs, individual conditional expectation curves (ICE) (Goldstein et al. 2015) show the relationship between a specified predictor and the fit, fixing the levels of other predictors at those of a particular observation. PDP curves are then the average of the ICE curves over all observations in the dataset. R packages offering PDPs include iml, DALEX (which calls the ingredients package (Biecek and Baniecki 2023)), flashlight, and pdp (Greenwell 2017); the package ICEbox (Goldstein et al. 2015) provides ICE curves and variations. Each package provides options for selecting the size of the grid for evaluating the predictions, with flashlight and DALEX providing options to select the number of observations to consider in calculating the partial dependence.
In vivid we provide a suite of functions (see Table 2) for calculating and visualizing variable importance, interactions and the partial dependence. Our displays conveniently show (either model specific and agnostic) VImp and VInt jointly using heatmaps and network graphs. Through the use of seriation techniques, we group together variables with the greatest impact on the response. For our network plot, additional filtering to remove less influential variables and clustering options to group highly interacting variables are provided, thus providing a more informative picture identifying relevant features. Additionally, our implementation makes it possible to apply our methods to subsets of data, which leads to locally-based importance measures. Our generalized PDP (GPDP) displays partial dependence plots in a matrix layout combining univariate and bivariate partial dependence plots with variable scatterplots. This layout, coupled with seriation, allows for quick assessment of of how pairs of variables have an impact on the model fit. We furthermore provide a more compact version of the GDPD, the so-called zen-partial dependence plot (ZPDP) consisting only of those bivariate partial dependence plots with high VInt. All of our displays are designed to quickly identify how variables, both singly and jointly, affect the fitted response and can be used for regression or classification fits. As the output of our displays are ggplot2 objects (Wickham 2016), they are easily customizable and provide the flexibility to create custom VIVI visualizations.
This paper is structured as follows. First we introduce a dataset and fit models that will be used as examples throughout this paper. Following this, we describe vivid functionality for calculating VIVI. We then move on to visualizations and focus on the functionality provided by the two functions
viviNetwork for displaying VIVI, and two functions for displaying PDPs namely,
pdpZen. Finally we provide some concluding discussion.
The well-known Boston housing data (Harrison Jr and Rubinfeld 1978) from the R package MASS (Venables and Ripley 2002) concerns prices of 506 houses and 14 predictor variables including property attributes such as number of rooms and social attributes including crime rate and pollution levels. The response is the median value of owner-occupied homes in $1000s (medv).
We first fit a random forest (using the randomForest package). In order to access all the available embedded variable importance scores, the
importance argument must be
TRUE when calling the
randomForest function. This allows any of the provided importance metrics to be used in vivid. However, in our following examples, we use an agnostic VImp measure supplied by the vivid package, which allows us to directly compare VImp values across different fits.
In the following sections we will explain how aspects of the two fits can be compared with vivid software. We will also explain aspects of our software design with reference to these fits.
The first step in using vivid is to calculate variable importance and interactions for a model fit.
vivi function calculates both of these, creating a square, symmetric matrix containing variable importance on the diagonal and variable interactions on the off-diagonal. Required inputs are a fitted ML model, a data frame on which the model was trained, and the name of the response variable for the fit. The returned matrix has importance and interaction values for all variables in the supplied data frame, excluding the response. Our visualizations functions
viviNetwork are designed to show the results of a
vivi calculation, but will work equally well for any square matrix with identical row and column names. (Note, the symmetry assumption is not required for
viviNetwork uses interaction values from the lower-triangular part of the matrix only.)
The code snippet below shows the creation of a
vivid matrix for the random forest fit. For clarity, we include all of the
vivi function arguments for the random forest fit, though only the first three are required. Other inputs will be described in the section
vivi function additional arguments.
In the absence of any model-specific importance measure we use an agnostic permutation method described by Fisher et al. (2019) to obtain the variable importance scores. In this method a model error score (root mean square error) is calculated, then each feature is randomly permuted and the model error is re-calculated. The difference in performance is considered to be the variable importance score for that feature. By default the permutation is set to be replicated four times to account for variability. However, we provide an option to select a desired number of permutations (via the
vivi function calculates both the importance and interactions using S3 methods. By default, the agnostic importance and interaction scores in vivid are computed using the generalized predict function from the condvis2 package (Hurley et al. 2022). Consequently, vivid can be used out-of-the-box with any model type that works with condvis2 predict (see
CVpredict from condvis2 for a full list of compatible model types). To allow vivid to run with other model fits, a custom predict function must be passed to the
predictFun argument (as discussed below).
The S3 method used to obtain the importance is called
vivid relies on the flashlight package to calculate agnostic importance via
flashlight::light_importance which currently works for numeric and numeric binary responses only. For model-specific variable importance, we provide individual methods to access importance scores for some of the most popular model fitting
R packages, namely; ranger (Wright and Ziegler 2017), randomForest, mlr (Bischl et al. 2016), mlr3 (Lang et al. 2019), and parsnip (Kuhn and Vaughan 2022) (however, more could be added via additional methods). To access any available model-specific variable importance from the aforementioned packages, the
importanceType argument must be set to equal the selected importance metric. For example, to select the percent increase in mean square error importance score from the randomForest package, the
importanceType argument must be set to equal this measure as it is called within the randomForest package, that is;
importanceType = "%IncMSE". If the
importanceType is not set or is set to equal
agnostic, then the agnostic importance is calculated. It should be noted that when comparing different model fits, using model-specific variable importance will result in importance measures that are not directly comparable (however, comparing model specific scores could be useful when comparing the same fitting procedure evaluated using different parameters). In the case that a practitioner may wish to use our visualizations to compare different model fits, we recommend using the agnostic permutation approach supplied by vivid to make a direct comparison of the importance measures.
For variable interactions, we use the model-agnostic Friedman’s \(H\)-statistic to identify any pairwise interactions. As discussed in Inglis et al. (2022b), we recommend the unnormalized version of the \(H\)-statistic which prevents detection of spurious interactions which can occur when the bivariate partial dependence function (used in the construction of the \(H\)-statistic) is flat. In the case of a binary response classification model, we follow Hastie et al. (2009) and compute the \(H\)-statistic and partial dependence on the logit scale.
vivi function calculates interactions using an S3 method called
vividInteraction, which again relies on the flashlight package to calculate Friedman’s \(H\)-statistic via
flashlight::light_interaction. Friedman’s \(H\)-statistic is the only interaction measure currently available in vivid, though the method of Greenwell et al. (2018) could also be used for this purpose. Embedded interaction measures could easily be incorporated via S3 methods in future.
flashlight simplifies the calculation of VIVI values
as it allows a custom predict function to be supplied for the calculation of agnostic importance and the \(H\)-statistic; this flexibility means importance and the \(H\)-statistic can be calculated for any ML model. Note that as flashlight importance and interaction functions act in a model agnostic way, they will give a VIVI of zero for variables in the dataset (except the response) that are not used by the supplied ML fit.
We supply an internal custom predict function called
CVpredictfun to both
CVpredictfun is a wrapper around
CVpredict from the condvis2 package, which adds an option for the classification to select (via the
class argument to
vivi) the class to be used for prediction and calculates predictions on the logit scale by default.
CVpredict accepts a broad range of fit classes thus streamlining the process of calculating VIVI.
In situations where the fit class is not handled by
CVpredict (as is the case for the GBM model created from xgboost), supplying a custom predict function to the
vivi function by way of the
predictFun argument allows the agnostic VIVI values to be calculated.
In the code snippet below, we build the
vivid matrix for the GBM fit using a custom predict function, which must be of the form given in the code snippet.
For brevity we omit some of the optional
vivi function arguments. By default, the agnostic variable importance is used to allow for a direct comparison of the importance measures for both model fits.
vivi function additional arguments
vivi function has 11 arguments. Some of these have been discussed above, including
predictFun. Here we provide a summary of the remaining arguments. First, the
normalized argument determines if Friedman’s \(H\)-statistic should be normalized or not (see Inglis et al. (2022b), for the pros and cons of each version). The arguments
nmax are used to set the size of the grid for evaluating the predictions and maximum number of data rows to consider, respectively, in the calculation of the \(H\)-statistic. Lowering the grid size can provide a significant speed boost, though at the expense of predictive accuracy. Additionally, sampling the data via
nmax offers a speed boost. The default values for
nmax are 50 and 500, respectively.
A drawback of using Friedman’s \(H\)-statistic as a measure of interaction is that it is a computationally expensive calculation, and may be especially time-consuming for models where prediction is slow.
shows the build time (rounded to the nearest second) averaged over five runs for the creation of a
vivid matrix with default parameters for different ML algorithms using the Boston Housing data. As the Boston housing data has 13 predictor variables, Friedman’s \(H\)-statistic is computed for 91 predictor pairs. The ML algorithms are: GBM, random forest, support vector machine (SVM), neural network (NN), and k-nearest neighbors (KNN). The SVM, NN, and KNN were built using the e1071 (Meyer et al. 2021), nnet (Venables and Ripley 2002), and kknn (Schliep and Hechenbichler 2016) packages, with the KNN being built through the mlr3 (Lang et al. 2019) framework. Each of the models were built using their default settings and, for each model fit, the agnostic VIVI was measured. The speed tests were performed on both a 2017 Mac-book Pro 2.3 GHz Dual-Core Intel Core i5 with 8GB of RAM and a 2021 32GB Mac-book M1 Pro. Here we are essentially comparing predict times for the various fits. The NN fit created using the nnet package was the fastest, followed by GBM. Both random forests are the slowest, and surprisingly, the older Mac beats its higher spec cousin for the randomForest fit.
vivid matrix may also be obtained from variable importance and interaction values calculated elsewhere. The package vip offers these, and evaluates interactions using a method called the feature importance ranking measure (FIRM, see Greenwell et al. (2018), for more details). The
vip2vivid function we provide in vivid takes VIVI values created in vip and turns them into a
vivid matrix, that can be subsequently used with our plotting tools. For example, in the code below, model-specific VImp and FIRM VInt scores are calculated for the random forest fit, and subsequently arranged into a
vivid matrix with the VImps on the diagonal and VInts on the off-diagonal.
viviHeatmap function constructs a heatmap displaying both importance and interactions, with importance on the diagonal and interactions on the off-diagonals. A
vivid matrix is the only required input, which does not necessarily need to be symmetric (for example, the interaction measures from the randomForestExplainer package are asymmetric and could be visualized using our heatmap. Color palettes for the importance and interactions are optionally provided via
For the default color palette we choose single-hue, color-blind friendly sequential color palettes from Zeileis et al. (2020), where low and high VIVI values are represented by low and high luminance color values respectively, aiding in highlighting values of interest.
The ordering of the heatmap is taken from the ordering of the input matrix. As
reorder was set to
FALSE when building both the random forest and GBM fit
vivid matrix, the ordering of the heatmaps matches the variable order in the dataset. This is useful for directly comparing multiple heatmaps, however it does not necessarily lend itself for easy identification of the largest VIVI values. If we were to seriate both
vivi matrices separately, we would end up with different optimal orderings for each matrix. An alternative is to create a common ordering by averaging over the two
vivid matrix objects and applying the
vividReorder function to the result.
(This function uses a seriation algorithm based on the techniques of Earle and Hurley (2015) designed to place high interaction variables adjacently and to pull high VIVI variables towards the top-left; see Inglis et al. (2022b) for details.)
Both VIVI matrices are then re-ordered using the newly obtained variable order.
intLims specify the range of importance and interaction values to be mapped to colors. Default values are calculated from the maximum and minimum VIVI values in the
vivid matrix. Importance and interaction values falling outside the supplied limits are squished to the closest limit. It can be useful to specify these limits in the situation where there is an extremely large VIVI value that dominates the display, or where we wish two or more plots to have the same limits for comparison purposes, as in the example below. The
angle argument is used to rotate the x-axis labels.
Figure 2 shows our improved ordering so that variables with high VIVI values are pushed to the top left of the plots. Filtering can also be applied to the input matrix to display a subset of variables. When compared to the GBM fit in (b), the random forest fit in (a) has weaker interactions and lower importance scores. Both plots identify \(lstat\) as being the most important. Both fits also show that \(lstat\) (the percentage of lower status of the population) interacts with several other variables, though the interactions are much stronger for the GBM. Notably, the strongest interaction in both fits are different. These are \(lstat:crim\) (where \(crim\) is the per capita crime rate by town) for the random forest fit and \(lstat:nox\) (where \(nox\) is parts per 10 million nitrogen oxides concentration) for the GBM fit.