Random forests are one of the most popular statistical learning algorithms, and a variety of methods for fitting random forests and related recursive partitioning approaches is available in R. This paper points out two important features of the random forest implementation cforest
available in the party package: The resulting forests are unbiased and thus preferable to the randomForest
implementation available in randomForest if predictor variables are of different types. Moreover, a conditional permutation importance measure has recently been added to the party package, which can help evaluate the importance of correlated predictor variables. The rationale of this new measure is illustrated and hands-on advice is given for the usage of recursive partitioning tools in R.
Recursive partitioning methods are amongst the most popular and widely used statistical learning tools for nonparametric regression and classification. Random forests in particular, which can deal with large numbers of predictor variables even in the presence of complex interactions, are being applied successfully in many scientific fields
(see, e.g., Lunetta, L. B. Hayward, J. Segal, and P. V. Eerdewegh 2004; Strobl, J. Malley, and G. Tutz 2009 and the references therein for applications in genetics and social sciences).
Thus, it is not surprising that there is a variety of recursive partitioning tools available in R (see http://CRAN.R-project.org/view=MachineLearning for an overview).
The scope of recursive partitioning methods in R ranges from the standard classification and regression trees available in rpart (Therneau, B. Atkinson, and B. D. Ripley 2008) to the reference implementation of random forests (Breiman 2001) available in randomForest (Liaw and M. Wiener 2002, 2008). Both methods are popular in applied research, and several extensions and refinements have been suggested in the statistical literature in recent years.
One particularly important improvement was the introduction of unbiased tree algorithms, which overcome the major weak spot of the classical approaches available in rpart and randomForest: variable-selection bias.
The term variable-selection bias refers to the fact that in standard tree algorithms variable selection is biased in favor of variables offering many potential cut-points, so that variables with many categories and continuous variables are artificially preferred (see, e.g, Kim and W. Loh 2001; Shih 2002; Hothorn, K. Hornik, and A. Zeileis 2006; Strobl, A.-L. Boulesteix, and T. Augustin 2007 for details).
To overcome this weakness of the early tree algorithms, new algorithms
have been developed that do not artificially favor splits in variables
with many categories or continuous variables. In R such an unbiased tree
algorithm is available in the ctree
function for conditional inference
trees in the party package (Hothorn, K. Hornik, and A. Zeileis 2006). The package also provides
a random forest implementation cforest
based on unbiased trees, which
enables learning unbiased forests (Strobl, A.-L. Boulesteix, A. Zeileis, and T. Hothorn 2007).
Unbiased variable selection is the key to reliable prediction and interpretability in both individual trees and forests. However, while a single tree’s interpretation is straightforward, in random forests an extra effort is necessary to assess the importance of each predictor in the complex ensemble of trees.
This issue is typically addressed by means of variable-importance
measures such as Gini importance and the “mean decrease in accuracy” or
“permutation” importance, available in randomForest in the
importance()
function (with type = 2
and type = 1
, respectively).
Similarly, a permutation-importance measure for cforest
is available
via varimp()
in party.
Unfortunately, variable-importance measures in random forests are subject to the same bias in favor of variables with many categories and continuous variables that affects variable selection in single trees, and also to a new source of bias induced by the resampling scheme (Strobl, A.-L. Boulesteix, A. Zeileis, and T. Hothorn 2007). Both problems can be addressed in party to guarantee unbiased variable selection and variable importance for predictor variables of different types.
Even though this refined approach can provide reliable variable-importance measures in many applications, the original permutation importance can be misleading in the case of correlated predictors. Therefore, Strobl, A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis (2008) suggested a solution for this problem in the form of a new, conditional permutation-importance measure. Starting from version 0.9-994, this new measure is available in the party package.
The rationale and usage of this new measure is outlined in the following sections and illustrated by means of a toy example.
Permutation importance, which is available in randomForest and party, is based on a random permutation of the predictor variables, as described in more detail below.
The alternative variable-importance measure available in randomForest, Gini importance, is based on the Gini gain criterion employed in most traditional classification tree algorithms. However, the Gini importance has been shown to carry forward the bias of the underlying Gini-gain splitting criterion (see, e.g., Kim and W. Loh 2001; Hothorn, K. Hornik, and A. Zeileis 2006; Strobl, A.-L. Boulesteix, and T. Augustin 2007) when predictor variables vary in their number of categories or scale of measurement (Strobl, A.-L. Boulesteix, A. Zeileis, and T. Hothorn 2007). Therefore, it is not recommended in these situations.
Permutation importance, on the other hand, is a reliable measure of
variable importance for uncorrelated predictors when sub-sampling
without replacement — instead of bootstrap sampling — and unbiased
trees are used in the construction of the forest (Strobl, A.-L. Boulesteix, A. Zeileis, and T. Hothorn 2007).
Accordingly, the default settings for the control parameters
cforest_control
have been pre-defined to the default version
cforest_unbiased
to guarantee sub-sampling without replacement and
unbiased individual trees in fitting random forests with the party
package.
The rationale of the original permutation-importance measure is the following: By randomly permuting the predictor variable \(X_j\), its original association with the response \(Y\) is broken. When the permuted variable \(X_j\), together with the remaining non-permuted predictor variables, is used to predict the response for the out-of-bag observations, the prediction accuracy (i.e., the number of correctly classified observations in classification, or respectively the mean squared error in regression) decreases substantially if the original variable \(X_j\) was associated with the response. Thus, Breiman (2001) suggests the difference in prediction accuracy before and after permuting \(X_j\), averaged over all trees, as a measure for variable importance.
In standard implementations of random forests, such as randomForest in
R, an additional scaled version of permutation importance (often called
the \(z\)-score), which is computed by dividing the raw importance by its
standard error, is provided (for example by
importance(obj, type = 2, scale = TRUE)
in randomForest). Note,
however, that the results of Strobl and A. Zeileis (2008) show that the z-score
is not suitable for significance tests and that raw importance has
better statistical properties.
The original permutation-importance measure can, for reasons outlined below, be considered as a marginal measure of importance. In this sense, it has the same property as, e.g., a marginal correlation coefficient: A variable that has no effect of its own, but is correlated with a relevant predictor variable, can receive a high importance score. Accordingly, empirical results (see Strobl, A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis 2008 and the references therein) suggest that the original permutation-importance measure often assigns higher scores to correlated predictors.
In contrast, partial correlation coefficients (like the coefficients in linear regression models) measure the importance of a variable given the other predictor variables in the model. The advantage of such a partial, or conditional, approach is illustrated by means of a toy example:
The data set readingSkills
is an artificial data set generated by
means of a linear model. The response variable contains hypothetical
scores on a test of reading skills for 200 school children. Potential
predictor variables in the data set are the age of the child, whether
the child is a native speaker of the test language and the shoe size of
the child.
Obviously, the latter is not a sensible predictor of reading skills (and was actually simulated not to have any effect on the response) — but with respect to marginal (as opposed to partial) correlations, shoe size is highly correlated with the test score. Of course this spurious correlation is only due to the fact that both shoe size and test score are associated with the underlying variable age.
In this simple problem, a linear model would be perfectly capable of
identifying the original coefficients of the predictor variables
(including the fact that shoe size has no effect on reading skills once
the truly relevant predictor variable age is included in the model).
However, the cforest
permutation-importance measure is mislead by the
spurious correlation and assigns a rather high importance value to the
nonsense-variable shoe size:
> library("party")
> set.seed(42)
> readingSkills.cf <- cforest(score ~ .,
+ data = readingSkills, control =
+ cforest_unbiased(mtry = 2, ntree = 50))
> set.seed(42)
> varimp(readingSkills.cf)
nativeSpeaker age shoeSize12.62036 74.52034 17.97287
The reason for this odd behavior can be found in the way the predictor variables are permuted in the computation of the importance measure: Strobl, A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis (2008) show that the original approach, where one predictor variable \(X_j\) is permuted against both the response \(Y\) and the remaining (one or more) predictor variables \(Z=X_{1}, \dots, X_{j-1},X_{j+1}, \dots, X_{p}\), as illustrated in Figure 1, corresponds to a pattern of independence between \(X_j\) and both \(Y\) and \(Z\).
From a theoretical point of view, this means that a high value of the importance measure can be caused by a violation either of the independence between \(X_j\) and \(Y\) or of the independence between \(X_j\) and \(Z\), even though the latter is not of interest here.
For practical applications, this means that a variable \(X_j\) that is correlated with an important predictor \(Z\) can appear more important than an uncorrelated variable, even if \(X_j\) has no effect of its own.
Y | Xj | Z |
---|---|---|
y1 | xπj(1), j | z1 |
⋮ | ⋮ | ⋮ |
yi | xπj(i), j | zi |
⋮ | ⋮ | ⋮ |
yn | xπj(n), j | zn |
Y | Xj | Z |
---|---|---|
y1 | xπj|Z = a(1), j | z1 = a |
y3 | xπj|Z = a(3), j | z3 = a |
y27 | xπj|Z = a(27), j | z27 = a |
y6 | xπj|Z = b(6), j | z6 = b |
y14 | xπj|Z = b(14), j | z14 = b |
y21 | xπj|Z = b(21), j | z21 = b |
⋮ | ⋮ | ⋮ |
The aim to reflect only the impact of \(X_j\) itself in predicting the response \(Y\), rather than its correlations with other predictor variables, can be better achieved by means of a conditional importance measure in the spirit of a partial correlation: We want to measure the association between \(X_j\) and \(Y\) given the correlation structure between \(X_j\) and the other predictor variables in the data set.
To meet this aim, Strobl, A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis (2008) suggest a conditional permutation scheme, where \(X_j\) is permuted only within groups of observations with \(Z=z\) in order to preserve the correlation structure between \(X_j\) and the other predictor variables, as illustrated in Figure 2.
With this new, conditional permutation scheme, the importance measure is able to reveal the spurious correlation between shoe size and reading skills:
> set.seed(42)
> varimp(readingSkills.cf, conditional =
+ TRUE)
nativeSpeaker age shoeSize11.161887 44.388450 1.087162
Only by means of conditional importance it becomes clear that the covariate native speaker is actually more relevant for predicting the test score than shoe size is, whose conditional effect is negligible.
Thus, the conditional importance mimics the behavior of partial correlations or linear regression coefficients, the interpretation of which many readers may be more familiar. However, whether a marginal or conditional importance measure is to be preferred depends on the actual research question.
Current research also investigates the impact that the choice of the
tuning parameter mtry
, which regulates the number of randomly
preselected predictor variables that can be chosen in each split (cf. Strobl, A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis 2008), and parameters regulating the depth of the trees
have on variable importance.
Conditioning is straightforward whenever the variables to be conditioned on, \(Z\), are categorical (cf., e.g., Nason, S. Emerson, and M. Leblanc 2004). However, conditioning on continuous variables, which may entail as many different values as observations in the sample, would produce cells with very sparse counts — which would make permuting the values of \(X_j\) within each cell rather pointless. Thus, in order to create cells of reasonable size for conditioning, continuous variables need to be discretized.
As a straightforward discretization strategy for random forests, Strobl, A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis (2008) suggest defining the conditioning grid by means of the partition of the feature space induced by each individual tree. This grid can be used to conditionally permute the values of \(X_j\) within cells defined by combinations of \(Z\), where \(Z\) can contain potentially large sets of covariates of different scales of measurement.
The main advantages of this approach are that this partition has already
been learned from the data during model fitting, that it can contain
splits in categorical, ordered and continuous predictor variables, and
that it can thus serve as an internally available means for discretizing
the feature space. For ease of computation, the conditioning grid
employed in varimp
uses all cut-points as bisectors of the sample
space (the same approach is followed by Nason, S. Emerson, and M. Leblanc 2004).
The set of variables \(Z\) to be conditioned on should contain all
variables that are correlated with the current variable of interest
\(X_j\). In the varimp
function, this is assured by the small default
value 0.2 of the threshold
argument: By default, all variables whose
correlation with \(X_j\) meets the condition
\(1 - p\mathrm{-value} > 0.2\) are used for conditioning. A larger value
of threshold
would have the effect that only those variables that are
strongly correlated with \(X_j\) would be used for conditioning, but would
also lower the computational burden.
Note that the same permutation tests that are used for split selection in the tree building process (Hothorn, K. Hornik, and A. Zeileis 2006) are used here to measure the association between \(X_j\) and the remaining covariates.
To conclude, we would like to summarize the application of conditional variable importance and general issues in fitting random forests with R. Depending on certain characteristics of your data set, we suggest the following approaches:
If all predictor variables are of the same type (for example: all
continuous or all unordered categorical with the same number of
categories), use either randomForest
(randomForest) or cforest
(party). While randomForest
is computationally faster, cforest
is safe even for variables of different types.
For predictor variables of the same type, Gini importance,
importance(obj, type = 2)
, or permutation importance,
importance(obj, type = 1)
, available for randomForest
, and
permutation importance, varimp(obj)
, available for cforest
, are
all adequate importance measures.
If the predictor variables are of different types (for example:
different scales of measurement, different numbers of categories),
use cforest
(party) with the default option
controls = cforest_unbiased
and permutation importance
varimp(obj)
.
If the predictor variables are correlated, depending on your
research question, conditional importance, available via
varimp(obj, conditional = TRUE)
for cforest
(party), can add
to the understanding of your data.
General remarks:
Note that the default settings for mtry
differ in randomForest
and cforest
: In randomForest
the default setting for
classification, e.g., is floor(sqrt(ncol(x)))
, while in cforest
it is fixed to the value 5 for technical reasons.
Always check whether you get the same results with a different random seed before interpreting the variable importance ranking!
If the ranking of even the top-scoring predictor variables depends
on the choice of the random seed, increase the number of trees
(argument ntree
in randomForest
and cforest_control
).
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Strobl, "Party on!", The R Journal, 2009
BibTeX citation
@article{RJ-2009-013, author = {Strobl, Carolin}, title = {Party on!}, journal = {The R Journal}, year = {2009}, note = {https://rjournal.github.io/}, volume = {1}, issue = {2}, issn = {2073-4859}, pages = {14-17} }