RFpredInterval: An R Package for Prediction Intervals with Random Forests and Boosted Forests

Like many predictive models, random forests provide point predictions for new observations. Besides the point prediction, it is important to quantify the uncertainty in the prediction. Prediction intervals provide information about the reliability of the point predictions. We have developed a comprehensive R package, RFpredInterval, that integrates 16 methods to build prediction intervals with random forests and boosted forests. The set of methods implemented in the package includes a new method to build prediction intervals with boosted forests (PIBF) and 15 method variations to produce prediction intervals with random forests, as proposed by Roy and Larocque (2020). We perform an extensive simulation study and apply real data analyses to compare the performance of the proposed method to ten existing methods for building prediction intervals with random forests. The results show that the proposed method is very competitive and, globally, outperforms competing methods.


Introduction
Predictive modelling is the general concept of building a model that describes how a group of covariates can be used to predict a response variable.The objective is to predict the unknown responses of observations given their covariates.For example, predictive models could be used to predict the sale price of houses given house characteristics (De Cock, 2011).In its simplest form, a predictive model aims to provide a point prediction for a new observation.However, a point prediction does not contain information about its precision that can tell us how close to the true response we can expect the prediction to be, which is often important in decision-making context.Hence, although the point prediction is often the main goal of predictive analysis, assessing its reliability is equally important, and this can be achieved with a prediction interval (PI).A PI contains a set of likely values for the true response with an associated level of confidence, usually, 90% or 95%.Given that shorter PIs are more informative, developing predictive models that can produce shorter PIs along with the point predictions is crucial in assessing and quantifying the prediction error.In real-world applications, knowing the prediction error alongside the point prediction increases the practical value of the prediction.
Regression analysis is a form of predictive modelling technique that examines the relationship between a response variable and a group of covariates.In this paper, we consider a general regression model where Y is a univariate continuous response variable, X is a p-dimensional vector of predictors, and ϵ is an error term.We assume g (.) is an unknown smooth function ℜ p → ℜ and E [Y|X = x] = g (X).
A confidence interval of the prediction is a range likely to contain the location of the response variable's true population mean.However, a prediction interval for a new observation is wider than its corresponding confidence interval and provides a range likely to contain this new observation's response value.
In the past decade, random forests have increased in popularity and provide an efficient way to generate point predictions for model (1).A random forest is an ensemble method composed of many decision trees, which can be described with a simple algorithm (Breiman, 2001).For each tree b = {1, ..., B}, a bootstrap sample of observations is drawn and a fully grown tree is built such that a set of predictors is randomly selected at each node and the best split is selected among all possible splits with those predictors only.The random forest prediction for a new observation is the average of the B trees where ŷb new is the tree prediction for the new observation in the bth tree, i.e. the average of observations in the terminal node corresponding to the new observation.Besides this traditional description, the modern view also considers random forests as data-driven weight generators (Hothorn et al., 2004;Lin and Jeon, 2006;Moradian et al., 2017Moradian et al., , 2019;;Athey et al., 2019;Roy and Larocque, 2020;Tabib and Larocque, 2020;Alakuş et al., 2021).
The R Journal Vol.14/1, March 2022 ISSN 2073-4859 Although random forests limit over-fitting by combining many trees, which reduces the variance of the estimator, final predictions can be biased (Mentch and Hooker, 2016;Wager and Athey, 2018).Since each tree is built under the same random process, all trees focus on the same part of the response signal, usually the strongest.Therefore, some parts of the response signal may be left untargeted, which could result in biased point predictions.Wager and Athey (2018) provide bounds for the extent of the bias of random forests under some assumptions about the tree growing process.Following their work, Ghosal and Hooker (2021) proposed a bias correction method in a regression framework called one-step boosted forest, which is introduced in Breiman (2001) and Zhang and Lu (2012).The main idea of the proposed method is to sum the predictions of two random forests, where the first is a regression forest fitted on the target data set and the second is fitted on the out-of-bag residuals of the former.Empirical studies show that this method provides a significant bias reduction when compared to a simple random forest.
The current paper proposes an R package providing, among other features, an extension of the onestep boosted forest method described above (Ghosal and Hooker, 2021).The literature on prediction intervals for random forests consists mostly of recent studies.The first method is the Quantile Regression Forests (QRF) method proposed by Meinshausen (2006).The aim of QRF is to estimate conditional quantiles of the response variable, instead of conditional means, using an estimated cumulative distribution function obtained with the nearest neighbour forest weights introduced by Hothorn et al. (2004).Prediction intervals can be built directly from the estimated conditional quantiles.The method is implemented in the CRAN package quantregForest (Meinshausen, 2017).
In a more recent study, Athey et al. (2019) proposed Generalized Random Forests (GRF), a very general framework to estimate any quantity, such as conditional means, quantiles or average partial effects, identified by local moment equations.Trees are grown with splitting rules designed to maximize heterogeneity with respect to the quantity of interest.Quantile regression forest is one of the applications of GRF.Similar to the QRF, the GRF method uses the neighbourhood information from different trees to compute a weighted set of neighbours for each test point.Unlike QRF, which grows trees with the least-squares criterion, GRF uses a splitting rule designed to capture heterogeneity in conditional quantiles.An implementation of quantile regression forest with GRF is available in the function quantile_forest of the CRAN package grf (Tibshirani et al., 2021).Vovk et al. (2005Vovk et al. ( , 2009) ) introduced a general distribution-free conformal prediction interval framework.Any predictive model, including random forests, can be used within the proposed methodology.The idea is to use an augmented data set that includes the new observation to be predicted to fit the model, and apply a set of hypothesis tests to provide an error bound around the point prediction for the new observation.Although this method does not require any distribution assumptions, it is computationally intensive.Lei et al. (2018) proposed a variant of this method, called Split Conformal (SC) prediction, which splits the data into two subsets, one to fit the model, and one to compute the quantiles of the residual distribution.We note that, while the original full conformal prediction interval framework produces shorter intervals, SC is computationally more efficient.The R package conformalInference (Tibshirani, 2019), available on GitHub, implements this method.Roy and Larocque (2020) proposed 20 distinct variations of methods to improve the performance of prediction intervals with random forests.These approaches differ according to 1) the method used to build the forest and 2) the method used to build the prediction interval.Four methods can be used to build the forest: three from the classification and regression tree (CART) paradigm (Breiman and Breiman, 1984) and the transformation forest method (TRF) proposed by Hothorn and Zeileis (2021).Within the CART paradigm, in addition to the default least-squares (LS) splitting criterion, two alternative splitting criteria, L 1 and shortest prediction interval (SPI), are considered.Prediction intervals are built using the Bag of Observations for Prediction (BOP), which is the set of nearest neighbour observations previously used in Moradian et al. (2017Moradian et al. ( , 2019)).In addition to the type of forest chosen, there are also five methods to build prediction intervals: the classical method (LM), the quantile method (Quant), the shortest prediction interval (SPI), the highest density region (HDR), and the contiguous HDR (CHDR).LM is computed based on an intercept-only linear model using the BOP as the sample, and produces a symmetric PI around the point prediction.The quantile method, similar to the QRF method, is based on the quantiles of the BOP.SPI corresponds to the shortest interval among the intervals that contain at least (1 − α) 100% of the observations in the BOP.As an alternative to SPI, HDR is the smallest region in the BOP, with the desired coverage (1 − α).Note that HDR is not necessarily a single interval.If the distribution is multimodal, it can be formed by multiple intervals.Finally, CHDR is a way to obtain a single prediction interval from HDR intervals by building an interval with the minimum and maximum bounds of the HDR intervals.Zhang et al. (2020) proposed a forest-based prediction interval method, called Out-of-Bag (OOB) prediction intervals, to estimate prediction intervals using the empirical quantiles of the out-of-bag prediction errors.The method assumes that OOB prediction errors are identically distributed and that their distribution can be well approximated by the out-of-bag prediction errors obtained from all training observations.The resulting prediction intervals have the same width for all test observations.The R Journal Vol.14/1, March 2022 ISSN 2073-4859 The method is implemented in the CRAN package rfinterval (Zhang, 2019).Lu and Hardin (2021) proposed a very interesting and useful method to estimate the conditional prediction error distribution of a random forest.The main idea of the proposed method is to use a random forest to compute the out-of-bag residuals for the training observations and to form a set of out-of-bag neighbours for each test point.Then, the conditional prediction error distribution for each test point is determined with the out-of-bag residuals in the neighbourhood.Estimating the prediction error distribution enables the estimation of conditional quantiles, conditional biases and conditional mean squared prediction errors.The prediction interval for a test point x, defined by PI α (x), is formed by adding the α/2 and 1 − α/2 quantiles of the conditional prediction error distribution to the random forest point prediction.The estimators are implemented in the CRAN package forestError (Lu and Hardin, 2020).
Note that the conformal inference, OOB approach of Zhang et al. (2020) and the PI α (x) method of Lu and Hardin (2021) all use the prediction errors to build the prediction intervals.Instead of using the training responses directly to estimate quantiles, using prediction errors provides a better predictive power.However, unlike conformal inference and the OOB approach, the PI α (x) method uses the nearest neighbour observations to estimate the prediction error distribution.This idea is very similar to the BOP idea (Roy and Larocque, 2020), but instead of using in-bag observations, Lu and Hardin (2021) use out-of-bag observations to form the neighbourhoods.This approach allows the local information for the test observations to be extracted.
In this paper, we introduce the R package RFpredInterval (Alakus et al., 2022), which is the novel implementation of 16 methods to build prediction intervals with random forests and boosted forests.The set of methods implemented in the package includes a new method to build prediction intervals with boosted forests and 15 method variations (three splitting rules with the CART paradigm which are LS, L 1 and SPI, and five methods to build prediction intervals which are LM, SPI, Quant, HDR and CHDR) proposed by Roy and Larocque (2020).These 15 methods had been thoroughly investigated before through simulation studies and with real data sets in Roy and Larocque (2020).However, they are not easily available to use.One of the main contributions of our package is the implementation of these competitive methods and the ability for users to compare various prediction interval methods within the same package.The other main contribution of our paper is a new method to build prediction intervals.Contrary to the 15 methods proposed by Roy and Larocque (2020), the newly introduced method was not tested before.That is why in the paper we placed a greater emphasis on investigating the new method through extensive simulation studies and with real data.For performance comparison purposes, we compared the new method to 10 existing methods which include: • 5 of the 15 implemented method variations of Roy and Larocque (2020); see the Competing methods subsection for details.
The new proposed method to build Prediction Intervals with Boosted Forests is called PIBF.This approach integrates the idea of using the nearest neighbour out-of-bag observations to estimate the conditional prediction error distribution presented in Lu and Hardin (2021) to the one-step boosted forest proposed by Ghosal and Hooker (2021).We will show in this paper that PIBF significantly improves the performance of prediction intervals with random forests when compared with 10 existing methods using a variety of simulated and real benchmark data sets.
The rest of the paper is organized as follows.In the next section, we describe the algorithm implemented in PIBF.We then present the details of the package and provide a practical and reproducible example.We also perform a simulation study to compare the performance of our proposed method to existing competing methods, and we investigate the performance of the proposed method with real data sets.Lastly, we conclude with a discussion of the results.

Method and implementation
The proposed method is based on the one-step boosted forest method proposed by Ghosal and Hooker (2021).It consists in fitting two regression random forests: the first is fitted to get point predictions and out-of-bag (OOB) residuals using the given data set, whereas the second is fitted to predict those residuals using the original covariates.As empirical studies demonstrate, the one-step boosted forest provides point predictions with reduced bias compared to the simple random forest.Ghosal and Hooker (2021) use subsampling for their theoretical investigations, even though random forests were originally described with bootstrap samples and obtain notable performance improvements.They The R Journal Vol.14/1, March 2022 ISSN 2073-4859 have also investigated the effect of using bootstrapping with the one-step boosted forest on the bias estimations.From the results presented in their appendix, the use of bootstrapping yields the best performance and reduces the bias the most in exchange for an increase in their proposed variance estimator, which is defined under asymptotic normality.In this paper, we use bootstrapping for the one-step boosted forest method following the better performance results on bias reduction.The final prediction for a new observation, x new , is the sum of the predictions from the two random forests where ŷnew is the point prediction obtained from the first random forest and εnew is the bias estimation from the second forest.
Besides bias correction, we use the second random forest as a way to construct a prediction interval by finding the nearest neighbour observations that are close to the one we want to predict.The idea of finding the nearest neighbour observations, a concept very similar to the 'nearest neighbour forest weights' (Hothorn et al., 2004;Lin and Jeon, 2006), was introduced in Moradian et al. (2017) and later used in Moradian et al. (2019), Roy and Larocque (2020), Tabib and Larocque (2020) and Alakuş et al. (2021).For a new observation, the set of in-bag training observations that are in the same terminal nodes as the new observation forms the set of nearest neighbour observations.Roy and Larocque (2020) called this set of observations the Bag of Observations for Prediction (BOP).We can define the BOP for a new observation x new as where I b (x new ) is the set of in-bag training observations, i.e., observations in the bootstrap sample that are in the same terminal node as x new in the b th tree.I b (.) consists of the training observations that are in the bootstrap sample of the b th tree.
Instead of forming the set of nearest neighbour observations with the in-bag training observations, we can use the out-of-bag observations which are not in the bootstrap sample, as used in Lu and Hardin (2021).We can define the out-of-bag equivalent of the BOP for a new observation x new (3) as where O b (x new ) is the set of out-of-bag observations that are in the same terminal node as x new in the b th tree.O b (.) consists of the training observations that are not in the bootstrap sample of the b th tree.
Out-of-bag observations are not used in the tree growing process.Thus, for the trees where the training observations are out-of-bag, they are like the unobserved test observations for those trees.The only difference is that, for a new observation, we use all the trees in the forest whereas for an out-of-bag observation we have only a subset of the forest trees.By using the out-of-bag equivalent of the BOP for a new observation, we can make use of the analogy between the out-of-bag observations and test observations.The out-of-bag neighbours of a new observation represent the new observation better than the in-bag neighbours.
Any desired measure can be obtained by using the constructed BOPs.In this paper, we use the BOP idea to build a prediction interval for a test observation.For a new observation with covariates x new , we firstly form BOP * (x new ) (4) using the out-of-bag neighbours.Then, as proposed by Lu and  Hardin (2021), we estimate the conditional prediction error distribution, F (x new ), but now with the bias-corrected out-of-bag residuals of the observations in BOP * (x new ).Lastly, we build a prediction interval for the new observation as where ŷ * new is the bias-corrected prediction, SPI l α F (x new ) and SPI u α F (x new ) are the lower and upper bounds of the SPI α F (x new ) , which is the shortest interval formed by the observations in BOP * (x new ) that contains at least (1 − α) 100% of the observations.By using the bias-corrected residuals to form prediction error distribution and picking the shortest interval among the qualified intervals, we can expect narrower prediction intervals.
We can summarize the steps of the proposed method as follows: 1. Train the first regression RF with covariates X to predict the response variable Y, and get the OOB predictions Ŷoob 7. Form a BOP for x new with the OOB neighbours using the second RF, BOP * (x new ) (4) and estimate the conditional prediction error distribution for x new as,

Calibration
The principal goal of any prediction interval method is to ensure the desired coverage level.In order to attain the desired coverage level (1 − α), we may need a calibration procedure.The goal of the calibration is to find the value of α w , called the working level in Roy and Larocque (2020), such that the coverage level of the PIs for the training observations is closest to the desired coverage level.Roy and Larocque (2020) presented a calibration procedure that uses the BOPs that are built using only the trees where the training observation x i is OOB.The idea is to find the value of α w using the OOB-BOPs.
In this paper, we call this procedure OOB calibration.
We also include a cross-validation-based calibration procedure with the proposed method to acquire the desired (1 − α) coverage level.In this calibration, we apply k-fold cross-validation to form prediction intervals for the training observations.In each fold, we split the original training data set into training and testing sets.For the training set, we go through the steps 1-5 defined above.Then, for each observation in the testing set, we apply steps 6-8 and build a PI.After completing CV, we compute the coverage level with the constructed PIs and if the coverage is not within the acceptable coverage range, then we apply a grid search to find the α w such that α w is the closest to the target α among the set of α w 's.Once we find the α w , we use this level to build the PI for the new observations.

The RFpredInterval package
In our package, we implement 16 methods that apply random forest training.Ten of these methods have specialized splitting rules in the random forest growing process.These methods are the ones with L 1 and shortest prediction interval (SPI) splitting rules proposed by Roy and Larocque (2020).To implement these methods, we have utilised the custom split feature of the randomForestSRC package (Ishwaran and Kogalur, 2021).
The randomForestSRC package allows users to define a custom splitting rule for the tree growing process.The user needs to define the customized splitting rule in the splitCustom.cfile with Cprogramming.After modifying the splitCustom.cfile, all C source code files in the package's src folder must be recompiled.Finally, the package must be re-installed for the custom split rule to become available.
In our package development process, we froze the version of randomForestSRC to the latest one available at the time, which is version 2.11.0, to apply specialized splitting rules.After defining the L 1 and SPI splitting rules, all C files were re-compiled.Finally, all package files including our R files for prediction interval methods were re-built to make the package ready for the user installation.
The RFpredInterval package has two main R functions as below: • pibf(): Constructs prediction intervals with the proposed method, PIBF.
Table 1 presents the list of functions and methods implemented in RFpredInterval.For pibf(), RFpredInterval uses the CRAN package ranger (Wright et al., 2020) to fit the random forests.For rfpi(), RFpredInterval uses randomForestSRC package.For the least-squares splitting rule, both randomForestSRC and ranger packages are applicable.
The R Journal Vol.14/1, March 2022 ISSN 2073-4859 Splitting rule is the least-squares (LS) from the CART paradigm."ls" is used for the "split_method" argument.A vector of characters c("lm","spi","quant","hdr","chdr") is used for the "pi_method" argument to apply all or a subset of the PI methods.ranger or randomForestSRC can be used to fit the random forest.Returns a list of constructed PIs for the selected PI methods and point predictions for the test data.
LM SPI Quant HDR L 1

CHDR
Splitting rule is the L 1 from the CART paradigm (Roy and Larocque, 2020)."l1" is used for the "split_method" argument.A vector of characters c("lm","spi","quant","hdr","chdr") is used for the "pi_method" argument to apply all or a subset of the PI methods.Only randomForestSRC can be used to fit the random forest since the split rule is implemented with the custom split feature of that package.Returns a list of constructed PIs for the selected PI methods and point predictions for the test data.
Splitting rule is the shortest PI (SPI) from the CART paradigm (Roy and Larocque, 2020)."spi" is used for the "split_method" argument.A vector of characters c("lm","spi","quant","hdr","chdr") is used for the "pi_method" argument to apply all or a subset of the PI methods.Only randomForestSRC can be used to fit the random forest since the split rule is implemented with the custom split feature of that package.Returns a list of constructed PIs for the selected PI methods and point predictions for the test data.

piall()
All methods Builds prediction intervals with all of the implemented PI methods.The ranger package is used to fit the random forests for the PIBF and methods with LS split rule.randomForestSRC package is used for the methods with L 1 and SPI split rules.Default values are assigned to the function arguments of pibf() and rfpi().Returns an object of class "piall" containing a list of constructed PIs with 16 methods, and point predictions obtained with the PIBF method, LS, L 1 and SPI split rules for the test data.

plot()
Plots the 16 constructed PIs obtained with piall() function for a test observation.

print()
Prints the summary output of pibf(), rfpi() and piall() functions.LM: Classical method, SPI: Shortest PI, Quant: Quantiles, HDR: Highest density region, CHDR: Contiguous HDR In this section, we illustrate the usage of the RFpredInterval package with the Ames Housing data set (De Cock, 2011).The data set was introduced as a modern alternative to the well-known Boston Housing data set.The data set contains many explanatory variables on the quality and quantity of physical attributes of houses in Ames, IA sold from 2006 to 2010.Most of the variables give information to a typical home buyer who would like to know about a house (e.g.number of bedrooms and bathrooms, square footage, heating type, lot size, etc.).
The AmesHousing (Kuhn, 2020) package contains the raw data and processed versions of the Ames Housing data set.The raw data contains 2930 observations and 82 variables, which include 23 nominal, 23 ordinal, 14 discrete, and 20 continuous variables, involved in assessing house values.The processed version of the data set has 2330 observations and 81 variables, including the target variable Sale_Price representing the value of houses in US dollars.The usual goal for this data set is to predict the sale price of each house given covariates.
We load the processed version of the Ames Housing data set from the AmesHousing package and prepare the data set that we will use for the analyses.The preprocessing steps are presented in the Supplementary Material.This version of the data set contains 22 factors and 59 numeric variables, including 1 response variable Sale_Price, for 2929 observations.We split the data set into training and testing samples.
set.seed(3456) n <-nrow(AmesHousing) trainindex <-sample(1:n, size = round(0.7*n),replace = FALSE) traindata <-AmesHousing[trainindex, ] testdata <-AmesHousing [-trainindex, ] We fit a random forest with 1000 trees using the training data and construct 95% prediction intervals for the observations in the testing data with the proposed method.We apply 5-fold crossvalidation based calibration and set the acceptable coverage range to [.945, .955].We can pass the list of random forest parameters for ranger package.We can then analyze the constructed PIs and bias-corrected random forest predictions for the testing data, as shown below.The PI output is a list containing lower and upper bounds.For example, we can print the point prediction and prediction interval for the tenth observation in the testing data.
out$pred_interval out$test_pred c(out$pred_interval$lower[10], out$test_pred[10], out$pred_interval$upper[10]) [1] 133.8629 160.2426 194.5804We can also print the summary output.In the summary output, we can always see the mean PI length over the test data set.If calibration is applied, we can see the working level of α.If the test data set has true response information, as in our example, coverage and prediction errors for the test set are also printed.Moreover, since we have entered oob = TRUE in the function arguments, in the summary output we can see the mean PI length and coverage measures along with the prediction errors for the training set.The prediction intervals are built with the out-of-bag (OOB) predictions and prediction errors.Next, we construct 95% prediction intervals using the variations proposed by Roy and Larocque (2020).In the following example, the splitting is rule is set to L 1 and we want to apply LM, Quant and SPI methods for building prediction intervals.We apply OOB calibration and set the acceptable coverage range to [.945, .955].We can pass the the list of random forest parameters for randomForestSRC package.
out2 <-rfpi(formula = Sale_Price ~., traindata = traindata, testdata = testdata, alpha = 0.05, calibration = TRUE, split_rule = "l1", pi_method = c("lm", "quant", "spi"), params_rfsrc = list(ntree = 1000), params_calib = list(range = c(0.945,0.955)), oob = FALSE) We can analyze the constructed PIs for the testing data as below.Each PI output is a list containing lower and upper bounds.For instance, we can print the point prediction and LM prediction interval for the tenth observation in the testing data.We print the summary output.In the summary output, we can see the splitting rule selected in the first row.Since the test data set has true responses in our example, we can see the coverage information for the selected PI methods besides the mean PI length and α w in the printed table.Below the table, we have the mean prediction errors for the test set.
out3 <-piall(formula = Sale_Price ~., traindata = traindata, testdata = testdata, alpha = 0.05, num.trees = 1000) The output is a list of constructed prediction intervals with 16 methods and point predictions obtained with the PIBF method, LS, L 1 , and SPI split rules.Hence, the output includes 16 prediction intervals and 4 point predictions which is a list of 20 items in total.
We print the summary output.

plot(out3, test_id = 15)
The R Journal Vol.14/1, March 2022 ISSN 2073-4859 Figure 1 presents the prediction intervals and point predictions for the test observation.The methods are ordered in the y-axis based on their resulting PI length.For each method, the red point presents the point prediction and blue lines show the constructed prediction interval(s) for the test observation.If the true response of the test observation is known, it is demonstrated with a dashed vertical line.Note that we may have multiple prediction intervals with the HDR PI method.As we can see from the figure, we may have four different point predictions for the same test observation.The PIBF method and the three splitting rules LS, L 1 and SPI can produce different point predictions.But all PI method variations for the same splitting rule have the same point prediction.

Simulation study
In this section, we compare the predictive performance of the prediction intervals constructed with our proposed method to the existing methods presented in the Introduction using a variety of simulated and real benchmark data sets.

Simulation design
We apply a simulation study based on seven simulated data sets from the literature.The first three of the data sets are Friedman's benchmark regression problems described in Jerome H. Friedman (1991) and Breiman (1996).We use the CRAN package mlbench (Leisch and Dimitriadou, 2021) to generate these data sets.
In Friedman Problem 1, the inputs are 10 independent variables uniformly distributed on the interval [0, 1].The first five covariates are used to generate the response: y = 10 sin (πx 1 x 2 ) + 20 (x 3 − 0.5) 2 + 10x 4 + 5x 5 + ϵ where ϵ is N 0, σ 2 and the default standard deviation of ϵ is 1 which yields a signal-to-noise ratio (SNR) (i.e., the ratio of the standard deviation of signal to the standard deviation of error) of 4.8:1.
The R Journal Vol.14/1, March 2022 ISSN 2073-4859 In Friedman Problem 2, the response is generated as where the inputs are four independent variables uniformly distributed over the ranges and ϵ is N 0, σ 2 .The default value of 125, which yields a SNR of 3:1, is used for the standard deviation of ϵ.
In Friedman Problem 3, the inputs are four independent variables uniformly distributed over the same ranges as Friedman Problem 2. The response is generated as where ϵ is N 0, σ 2 and the default value of 0.01 for the standard deviation of ϵ is used, which yields a SNR of 3:1.
The fourth data set is the Peak Benchmark Problem which is also from the mlbench package.Let r = 3u where u is uniform on [0, 1] and let x be uniformly distributed on the d-dimensional sphere of radius r.The response is y = 25 exp −0.5r 2 .The default value of d = 20 dimensions is used.
The fifth one is a modification of Friedman Problem 1, which was used in Hothorn and Zeileis (2021) in their H2c setup.This data set was designed to have heteroscedasticity.The inputs are 10 independent variables uniformly distributed on the interval [0, 1].The first five covariates are used in the mean function and the unscaled mean function is defined as Then, the scaled mean function on the interval [−1.5, 1.5] is where µ min and µ max are the minimum and maximum values of µ over the sample.The last five covariates are used in the standard deviation function and the unscaled standard deviation function is σ = 10 sin (πx 6 x 7 ) + 20 (x 8 − 0.5) 2 + 10x 9 + 5x 10 The standard deviation is scaled as where σ min and σ max are the minimum and maximum values of σ over the sample.The response is generated as a normal random variable with mean µ S and standard deviation σ S .
The last two data sets, which were used in Roy and Larocque (2020), have a tree-based response variable.The inputs are seven independent variables generated from the standard normal distribution.The response is generated with the seven covariates according to a tree model with a depth of three, with eight terminal nodes: The R Journal Vol.14/1, March 2022 ISSN 2073-4859 where the terminal node means are u = (5, 10, 15, 20, 25, 30, 35, 40) and I is the indicator function.The difference in the two data sets is the distribution of the error.In the first one, ϵ is generated from a standard normal distribution and in the other it is from an exponential distribution with mean 1.The signal-to-noise ratio is 11.5:1 for both data sets.
We use training sample sizes of n train = {200, 500, 1000, 5000}, resulting in 28 scenarios.Each scenario is repeated 500 times.In each run, we generate an independent test set of new observations with n test = 1000.

Competing methods
We compare our proposed prediction interval estimator with 10 competing methods which were presented in the Introduction.The first is the PI α method.We fit the random forest with the ranger package and use the forestError package to build PIs.The second is the OOB method.The rfinterval package is used.The third is the split conformal method.The conformalInference package is used.The fourth is the QRF method and the quantregForest package is used.The fifth is the GRF method.The function quantile_forest in the grf package is used.
The last five are variations of Roy and Larocque (2020).To compare the performance of the method variations, a comprehensive simulation study and real data analyses were performed in Roy and Larocque (2020).One of the biggest conclusions from these comparison studies was that, among the three alternative splitting criteria within the CART paradigm, the impact of the choice of the splitting rule on the performance of the prediction intervals was moderate whereas the selection of the PI method had a much greater impact on the performance.Hence, in this simulation study, we set up the splitting rule to the least-squares (LS) and only compare the five PI methods: LM, Quant, SPI, HDR, and CHDR.For those methods, the rfpi() function of the RFpredInterval package is used.We fit the random forest with the ranger package.

Parameter settings
For the simulations, we use the following parameters.For all methods, we set the number of trees to 2000.Letting p be the number of covariates, then the number of covariates to randomly split at each node, mtry, is set to max {⌊p/3⌋, 1} (except for the GRF method).For the GRF method, following Athey et al. (2019), mtry is set to min ⌈ √ p + 20⌉, p .Also, we use the honest splitting regime with the default fraction of 0.5 for the GRF method.The minimum node size parameter for all forests is set to 5. The desired coverage is set to 95% for all methods.For the proposed method, we perform the cross-validation-based calibration as the primary calibration procedure, but we also investigate the OOB calibration.For the method variations in Roy and Larocque (2020), we perform the OOB calibration procedure as they proposed.For both calibration procedures, the acceptable range of coverage is set to [.945, .955].Calibration is not performed for the competing methods since no option for calibration is offered in their CRAN packages.

Performance with the simulated data sets
We can evaluate the performance of the competing methods with two measures: the mean coverage and the mean prediction interval length.Table 2 presents the average coverage rate of each method on the test set over 500 replications for a given simulated data set and sample size, with average mean prediction interval lengths shown in parentheses.The principal goal of any prediction interval method is to ensure the desired coverage level.In this simulation study, the desired coverage level is set to 95% for all methods.The left plot in Figure 2 shows the mean coverages over the 28 scenarios for all methods.Overall, all of the methods, except the QRF and GRF methods which tend to be conservative, provide a mean coverage close to the desired level.QRF and GRF methods have an average mean coverage of 0.975 and 0.974 over all scenarios, respectively.Although the PI α method has an average of the mean coverages of 0.957, close to the desired level, its variability is large.Over 308 (11 methods × 28 scenarios) average coverage values, there is only one case where the mean coverage is below 0.94.It corresponds to the PI α method in Friedman Problem 2 with n train = 5000.
Once the prediction intervals provide the desired coverage level, the next goal of any PI method is to provide the shortest PI length.Prior to carrying out a detailed comparison of interval lengths, we can globally compare the interval lengths over all scenarios with the percentage increase in mean PI length of a method with respect to the best method for a given run.For a given run, define ml i as the mean PI length of method i and ml * as the shortest mean PI length over the 10 competing methods.The percentage increase in PI length for method i is computed as 100 × (ml i − ml * ) /ml * .Smaller values for this measure indicate better performances.The right plot in Figure 2 presents the relative The R Journal Vol.14/1, March 2022 ISSN 2073-4859 lengths of the methods across 14,000 runs (28 scenarios × 500 replications).The prediction intervals with the GRF method are the widest, followed by the QRF method.However, as we saw in the left plot in Figure 2, GRF and QRF produce conservative prediction intervals, so their PI lengths cannot be fairly compared to the other methods with a coverage closer to 0.95.Based on the global results, the proposed method, PIBF, performs the best.Following PIBF, PI α , OOB, LM, SPI, HDR and CHDR perform similarly well, with PI α being slightly better.Among the variations of Roy and Larocque (2020), the PIs with quantiles produce longer prediction intervals than the other four variations.Each white circle is the average of the mean coverages over 28 scenarios.(Right) Boxplots for the percentage increase in mean PI length of each method compared to the shortest PI length for a given run across 14,000 runs.The smallest is the percentage increase, the better is the method.Each white circle is the average of the relative lengths over 14,000 runs.Since the outlier values are distorting the scales, they are removed from the graph.PIBF: Prediction intervals with boosted forests (the proposed method), PI α : Conditional α-level prediction interval, OOB: Out-of-Bag approach, LM: Classical method, Quant: Quantiles, SPI: Shortest PI, HDR: Highest density region, CHDR: Contiguous HDR, SC: Split conformal, QRF: Quantile regression forest, GRF: Generalized random forest.Now, we investigate the performance of the methods separately for each scenario.See figures S1 to S7 in the Supplementary Material for the mean PI length results of each method for each simulated data set.Each figure has four facets corresponding to the four levels of the training sample size.For all methods and data sets, the mean PI lengths and their variability decrease as the sample size increases (except the GRF method for the tree-based data sets).We see that for Friedman Problem 1, from Figure S1, for all sample sizes, PIBF consistently outperforms the 10 competing methods in terms of mean PI length while ensuring the desired coverage level (see Table 2 for the mean coverage results).QRF and GRF provide the widest prediction intervals for all sample sizes.However, as presented in Table 2, these methods heavily over-cover and are therefore not comparable with the other methods.While the PI α method also slightly over-covers, it has shorter PIs than other methods.
For Friedman Problem 2 (see Figure S2 in the Supplementary Material), we see that the proposed method has the shortest mean PI length for the smallest sample size, and as the sample size increases PI α provides shorter PIs.However, we should also take into account the mean coverages presented in Table 2.The proposed method has smaller coverage levels for n train = 200 compared to PI α , but as the sample size increases the coverage levels decrease for the PI α method (up to 0.937 for n train = 5000) whereas PIBF keeps it around 0.945.For n train = 5000, the OOB method builds shorter PIs while ensuring the desired coverage level.Again, QRF and GRF have the widest PIs for all sample sizes due to their conservative PIs.
The performance of PIBF and PI α is very similar for Friedman Problem 3 (see Figure S3 in the Supplementary Material).Both methods provide the shortest PIs with similar coverage levels and The R Journal Vol.14/1, March 2022 ISSN 2073-4859 For the H2c setup (see Figure S5 in the Supplementary Material), which is the modification of Friedman Problem 1, we can see that all methods are comparable since QRF and GRF do not overcover.In this setting, all methods also perform fairly well with respect to PI length.Overall, the LM prediction interval method with the LS splitting rule provides slightly shorter PIs.
For the tree-based data sets (see figures S6 and S7 in the Supplementary Material), overall, it seems that the distribution of the error does not have a significant effect on the results.Again, QRF and GRF have conservative PIs.Unlike the other data sets, we see here that the mean PI lengths of the GRF method decrease very slowly as the sample size increases.For n train = 5000, all methods (except QRF and GRF) perform similarly.For the smallest sample size, HDR PI building methods and the proposed method perform slightly better than other methods.As the sample size increases, PIBF produces the shortest prediction intervals.

Effect of calibration on the performance of prediction intervals
In this section, we investigate the effect of the proposed calibration on performance of the prediction intervals.We apply the same simulation study using the seven simulated data sets.We compare the results of the proposed method without calibration, with OOB calibration and calibration with cross-validation.The desired coverage level is set to 95%.In Figure 3, the left plot presents the mean coverages over 28 scenarios for the three variations, and the right plot shows the percentage increase in mean PI length of each of the three calibration variants across 14,000 runs (28 scenarios × 500 replications).Although we obtain the shortest prediction intervals without calibration, the variability of the mean coverage level is larger and sometimes the coverage falls below 0.94.Looking at the left plot, we can say that the variability of the mean coverage level decreases with both calibration procedures.However, applying OOB calibration provides conservative PIs.The median of the mean coverage level is more than 0.96 and the PIs with the OOB calibration are the widest.Applying calibration with CV produces slightly longer PIs than those with no calibration, but these PIs have coverage levels closer to the desired level.
The R Journal Vol.14/1, March 2022 ISSN 2073-4859 In the package, both calibration procedures are implemented for the PIBF method.Simulation study results show that, compared to the OOB calibration, calibration with CV produces shorter PIs while maintaining the desired coverage level.Therefore, the default calibration procedure is set to CV in the pibf() function.In terms of computational time (see tables 4 and 5), calibration with k-fold CV is slower than OOB calibration since it needs to fit two additional random forests for each fold.

Performance with real data sets
To further explore the performance of the prediction intervals built with the proposed method, we use 11 real data sets.Since two of the data sets have two response variables, we consider that we are analyzing 13 real data sets.Boston housing and Ames housing data sets are from the R packages mlbench and AmesHousing, respectively.The other data sets are obtained through the UCI Machine Learning Repository (Dua and Graff, 2017).
For each data set, we apply 100 times 10-fold cross-validation for each method.Hence, for each fold, the training and testing sets correspond to 90% and 10% of the whole data set, respectively.The desired coverage level is set to 95% for all methods.Table 3 presents the results of the real data analyses (n is the number of observations and p is the number of predictors) with the average coverage rate of each method over 100 repetitions, and mean prediction interval lengths averaged over 100 repetitions shown in parentheses.Figure 4 illustrates the global results of the analyses across datasets.The left plot in Figure 4 shows the mean coverages over the 13 real data sets for all methods.Similar to what we have seen with the simulated data sets, the QRF and GRF methods produce conservative prediction intervals, whereas the other methods provide a mean coverage close to the desired level.Again, the PI α method maintains the target level on average with 0.959, but its variability is the highest among all methods.Across all data sets, there are three cases where the mean coverage is below 0.94: the proposed method for Concrete slump, and the PI α method for Auto MPG and Computer hardware.
The right plot in Figure 4 presents the relative lengths of methods across 13 real data sets.For each method, there are 13 points in the boxplot, and each point corresponds to the percentage increase in mean PI length compared to the best method for a single real data set.Again, the prediction intervals with GRF and QRF are the widest among eleven methods.Among the other nine methods, the proposed method performs the best, followed by PI α .
For each real data set, we analyze the performance of each method through the mean PI lengths presented in figures S8 to S10 in the Supplementary Material.For the Abalone data set, the HDR method produces the shortest prediction intervals, followed by the SPI and Quant methods.While the QRF method over-covers, its PIs are no wider than those of most of the other methods.The proposed method, PIBF, is distinctly the best prediction interval method yielding the shortest PI lengths for the Air quality data set with absolute and relative humidity response variables, Airfoil selfnoise, Ames housing, Boston housing, Concrete compression, Energy efficiency data set with cooling and heating load response variables, and Servo data sets.In the Auto MPG data set, PI α has the shortest mean The R Journal Vol.14/1, March 2022 ISSN 2073-4859 PI length but with a mean coverage of 0.929.Among the other methods, PIBF, OOB, LM, Quant and SPI methods show similarly good performances while maintaining the desired coverage level.For the Computer hardware data set, PIBF, PI α , and SPI methods perform better than the other methods.
They have similar mean PI lengths.For the Concrete slump data set, the proposed method has the shortest mean PI length, but with a slightly smaller coverage of 0.939.This data set is the only one of the simulated and real data sets where the proposed method has a mean coverage below 0.94.After PIBF, PI α and LM show a good performance with a mean coverage close to the target level.Overall, we can conclude that the proposed method shows better performance than the 10 competing methods for almost all of the real data sets.Each white circle is the average of the mean coverages over 13 real data sets.(Right) Boxplots for the percentage increase in mean PI length of each method compared to the shortest PI length for a given real data set across 13 data sets.The smallest the percentage increase, the better the method.Each white circle is the average of the relative lengths over 13 real data sets.One of the outlier values for GRF with the percentage increase of 1244% is removed from the graph since it is distorting the scales.PIBF: Prediction intervals with boosted forests (the proposed method), PI α : Conditional α-level prediction interval, OOB: Out-of-Bag approach, LM: Classical method, Quant: Quantiles, SPI: Shortest PI, HDR: Highest density region, CHDR: Contiguous HDR, SC: Split conformal, QRF: Quantile regression forest, GRF: Generalized random forest.

Comparison of the computational times
All simulations and real data analyses were conducted in R version 3.6.0on a Linux machine with Intel(R) Xeon(R) E5-2667 v3 @ 3.20GHz with 396 GB of memory.The average computational time of each method for the simulated and real data sets are presented in tables 4 and 5.For the proposed method (PIBF), the computational times for both calibration methods, cross-validation and OOB, are presented in the tables.We can see from the tables that, for most of the data sets, calibration with cross-validation has longer running times than OOB calibration, which is expected since with the k-fold cross-validation, we fit 2k more random forests than applying OOB calibration.
For the variations of Roy and Larocque (2020), since we can build prediction intervals with the five PI methods by only fitting a single random forest with a selected splitting rule, we present the total computational time for building the five variations under RFPI.To be clear, for a given splitting rule, the rfpi() function fits a random forest and then the set of PI methods requested by the user are applied to the output of the random forest.In our simulations, we choose to return all five PI methods for the selected splitting rule, i.e. when we measure the running time of the rfpi() function, we get the total running time of building five prediction intervals.Therefore, we should interpret the The R Journal Vol.14/1, March 2022 ISSN 2073-4859 values for RFPI with care while comparing the computational times of the methods.Although not all PI methods have similar computational complexities, we can say that even the average time of building prediction intervals with one of these variations, assuming they have similar running times, is reasonable.Since, for the HDR-based PI methods, an optimal bandwidth has to be chosen, which is a time-consuming process, among the five PI methods, the slowest ones are the HDR and CHDR.
From the remaining variations, the classical method, LM, is the fastest, followed by the Quant and SPI methods.
For both the simulations and real data analyses, the OOB and GRF methods have the smallest running times.For most of the methods, the increase in the sample size has a mild effect on the ratio of increase in running times.However, for the split conformal method with the simulated data sets, running times increase more than the proportional increase in sample sizes.Similarly, we can see that the QRF method is also affected from the training sample size as it rises to 5000.

Concluding remarks
In this paper, we have introduced an R package named RFpredInterval.This package implements 16 methods to build prediction intervals with random forests: a new method to build Prediction Intervals with Boosted Forests (PIBF) and 15 different variations to produce prediction intervals with random forests proposed by Roy and Larocque (2020).PIBF provides bias-corrected point predictions obtained with the one-step boosted forest and prediction intervals by using the nearest neighbour out-of-bag observations to estimate the conditional prediction error distribution.
We performed an extensive simulation study with a variety of simulated data sets and applied real data analyses to investigate the performance of the proposed method.The performance was evaluated based on the coverage level and length of the prediction intervals.We compared the performance of the proposed method to 10 existing methods for building prediction intervals with random forests.The proposed method was able to maintain the desired coverage level with both the simulated and real data sets.In terms of the PI lengths, globally, the proposed method provided the shortest prediction intervals among all methods.The conclusions drawn from the analysis of real data sets were very similar to those with the simulated data sets.This provides evidence for the reliability of the proposed method.All results obtained indicate that the proposed method can be used with confidence for a variety of regression problems.
Note that the coverage rate of prediction intervals for new observations can have several interpretations.An interesting discussion about this issue is given in Mayr et al. (2012).In that paper, the authors presented two interpretation of coverage: sample coverage and conditional coverage.Sample coverage means that if we draw a new sample from the same population as the training sample and build PIs with a desired coverage level of (1 − α), then the global coverage rate over this sample will be (1 − α).The conditional coverage means that if we sample many new observations always having the same set of covariates and build PIs for them with a desired coverage level of (1 − α), then about (1 − α) 100% of these prediction intervals will contain the true value of the response.To hold a desired level of conditional coverage, the predictive method needs to provide the desired coverage level over the entire covariate space.On the other hand, sample coverage needs only maintain the desired coverage level over the new sample, on average.Therefore, if the conditional coverage holds, then the sample coverage also holds.In practice, predictive models are mostly evaluated with their The R Journal Vol.14/1, March 2022 ISSN 2073-4859  global predictive performance.Hence, ensuring that the sample coverage level is achieved should be sufficient for most applications.The proposed calibration method with cross-validation is designed to ensure the sample coverage property.From the simulation study and real data analyses, we can see that the sample coverage is attained with the proposed calibration method.

Availability
The package is available from CRAN at https://cran.r-project.org/package=RFpredInterval.

2.
Compute the OOB residuals as εoob = Y − Ŷoob 3. Train the second regression RF with covariates X to predict the OOB residuals εoob , and get the The R Journal Vol.14/1, March 2022 ISSN 2073-4859 OOB predictions for residuals εoob 4. Update the OOB predictions as Ŷ * oob = Ŷoob + εoob 5. Compute the updated OOB residuals after bias-correction as ε * oob = Y − Ŷ * oob 6.For a new observation x new , get the point predictions ŷnew from the first RF, and get the predicted residuals εnew from the second RF, then the final prediction for the new observation is ŷ * new = ŷnew + εnew where εnew is the estimated bias.

Figure 1 :
Figure 1: Prediction intervals for the 15th test observation in the test data.The x-axis represents the sale price of houses in thousands.For each method, red dots represent the point prediction and blue lines show the prediction interval(s).The vertical dashed line shows the true response value for the test observation.PIBF: Prediction intervals with boosted forests (the proposed method).The notation for the other 15 methods is split rule-PI method.Splitting rules are LS: Least-squares, L1: L 1 , SPI: Shortest PI split rule.PI methods are LM: Classical method, Quant: Quantiles, SPI: Shortest PI, HDR: Highest density region, CHDR: Contiguous HDR.

Figure 2 :
Figure 2: (Left) Boxplots for the mean coverage over all scenarios.All methods, except the QRF and GRF methods, are able to provide a mean coverage close to the desired coverage level of 0.95.Each white circle is the average of the mean coverages over 28 scenarios.(Right) Boxplots for the percentage increase in mean PI length of each method compared to the shortest PI length for a given run across 14,000 runs.The smallest is the percentage increase, the better is the method.Each white circle is the average of the relative lengths over 14,000 runs.Since the outlier values are distorting the scales, they are removed from the graph.PIBF: Prediction intervals with boosted forests (the proposed method), PI α : Conditional α-level prediction interval, OOB: Out-of-Bag approach, LM: Classical method, Quant: Quantiles, SPI: Shortest PI, HDR: Highest density region, CHDR: Contiguous HDR, SC: Split conformal, QRF: Quantile regression forest, GRF: Generalized random forest.

Figure 3 :
Figure 3: Global performance results of the proposed method for the simulated data sets with different calibration procedures.NC: No calibration, OOB: OOB calibration, CV: Calibration with cross-validation.(Left) Boxplots for the mean coverage over 500 replications across all scenarios.(Right) Boxplots for the percentage increase in mean PI length of each calibration procedure compared to the shortest PI length for a given run across 14,000 runs.Smaller values are better.Since outlier values are distorting the scales, they are removed.

Figure 4 :
Figure4: (Left) Boxplots for the mean coverage over all real data sets.All methods except the QRF and GRF methods are able to provide a mean coverage close to the desired coverage level of 0.95.Each white circle is the average of the mean coverages over 13 real data sets.(Right) Boxplots for the percentage increase in mean PI length of each method compared to the shortest PI length for a given real data set across 13 data sets.The smallest the percentage increase, the better the method.Each white circle is the average of the relative lengths over 13 real data sets.One of the outlier values for GRF with the percentage increase of 1244% is removed from the graph since it is distorting the scales.PIBF: Prediction intervals with boosted forests (the proposed method), PI α : Conditional α-level prediction interval, OOB: Out-of-Bag approach, LM: Classical method, Quant: Quantiles, SPI: Shortest PI, HDR: Highest density region, CHDR: Contiguous HDR, SC: Split conformal, QRF: Quantile regression forest, GRF: Generalized random forest.

Table 1 :
List of functions and methods with characteristics PIBF Builds prediction intervals with boosted forests.The ranger package is used to fit the random forests.Calibration options are "cv" and "oob".Returns constructed PIs and bias-corrected point predictions for the test data.

Table 2 :
Results of the simulation study.Average coverage rates of each method in each simulation, with average mean PI lengths shown in parentheses.The desired coverage level is 0.95.Shortest average mean PI lengths are emphasized with bold text.PIBF: Prediction intervals with boosted forests (the proposed method), PI α : Conditional α-level prediction interval, OOB: Out-of-Bag approach, LM: Classical method, Quant: Quantiles, SPI: Shortest PI, HDR: Highest density region, CHDR: Contiguous HDR, SC: Split conformal, QRF: Quantile regression forest, GRF: Generalized random forest.Results for the Peak Benchmark Problem presented in FigureS4are very similar to those of Friedman Problem 1.For all sample sizes, PIBF consistently outperforms the 10 competing methods in terms of mean PI length.But this time, SPI method with LS splitting rule comes in second place.

Table 3 :
Results of the real data analysis.Average coverage rates of each method for each real data set, with average mean PI lengths shown in parentheses.The desired coverage level is 0.95.Shortest average mean PI lengths are shown in bold.PIBF: Prediction intervals with boosted forests (the proposed method), PI α : Conditional α-level prediction interval, OOB: Out-of-Bag approach, LM: Classical method, Quant: Quantiles, SPI: Shortest PI, HDR: Highest density region, CHDR: Contiguous HDR, SC: Split conformal, QRF: Quantile regression forest, GRF: Generalized random forest.

Table 4 :
Average computational time (in seconds) of each method over 500 replications for each simulated data set.The values represent the average time to build prediction intervals for a test set with 1000 observations.PIBF-CV: The proposed method with the cross-validation calibration, PIBF-OOB: The proposed method with the OOB calibration, RFPI: The average total running time of the five PI methods, i.e.LM + Quant + SPI + HDR + CHDR.

Table 5 :
Average computational time (in seconds) of each method over 100 times 10-fold crossvalidation for each real data set.PIBF-CV: The proposed method with the cross-validation calibration, PIBF-OOB: The proposed method with the OOB calibration, RFPI: The average total running time of the five PI methods, i.e.LM + Quant + SPI + HDR + CHDR.