Like many predictive models, random forests provide point predictions for new observations. Besides the point prediction, it is important to quantify the uncertainty in the prediction. Prediction intervals provide information about the reliability of the point predictions. We have developed a comprehensive R package, RFpredInterval, that integrates 16 methods to build prediction intervals with random forests and boosted forests. The set of methods implemented in the package includes a new method to build prediction intervals with boosted forests (PIBF) and 15 method variations to produce prediction intervals with random forests, as proposed by (Roy and Larocque 2020). We perform an extensive simulation study and apply real data analyses to compare the performance of the proposed method to ten existing methods for building prediction intervals with random forests. The results show that the proposed method is very competitive and, globally, outperforms competing methods.
Predictive modelling is the general concept of building a model that describes how a group of covariates can be used to predict a response variable. The objective is to predict the unknown responses of observations given their covariates. For example, predictive models could be used to predict the sale price of houses given house characteristics (De Cock 2011). In its simplest form, a predictive model aims to provide a point prediction for a new observation. However, a point prediction does not contain information about its precision that can tell us how close to the true response we can expect the prediction to be, which is often important in decisionmaking context. Hence, although the point prediction is often the main goal of predictive analysis, assessing its reliability is equally important, and this can be achieved with a prediction interval (PI). A PI contains a set of likely values for the true response with an associated level of confidence, usually, \(90\%\) or \(95\%\). Given that shorter PIs are more informative, developing predictive models that can produce shorter PIs along with the point predictions is crucial in assessing and quantifying the prediction error. In realworld applications, knowing the prediction error alongside the point prediction increases the practical value of the prediction.
Regression analysis is a form of predictive modelling technique that examines the relationship between a response variable and a group of covariates. In this paper, we consider a general regression model \[\label{eq:regression} Y = g\left(X\right)+\epsilon \tag{1}\] where \(Y\) is a univariate continuous response variable, \(X\) is a pdimensional vector of predictors, and \(\epsilon\) is an error term. We assume \(g\left(.\right)\) is an unknown smooth function \(\Re^p \rightarrow \Re\) and \(E\left[YX=x\right] = g\left(X\right)\). A confidence interval of the prediction is a range likely to contain the location of the response variable’s true population mean. However, a prediction interval for a new observation is wider than its corresponding confidence interval and provides a range likely to contain this new observation’s response value.
In the past decade, random forests have increased in popularity and provide an efficient way to generate point predictions for model (1). A random forest is an ensemble method composed of many decision trees, which can be described with a simple algorithm (Breiman 2001). For each tree \(b=\left \{1,..., B\right \}\), a bootstrap sample of observations is drawn and a fully grown tree is built such that a set of predictors is randomly selected at each node and the best split is selected among all possible splits with those predictors only. The random forest prediction for a new observation is the average of the \(B\) trees \[\hat{y}^{}_{new} = \frac{1}{B} \sum_{b=1}^{B}\hat{y}^b_{new}\] where \(\hat{y}^b_{new}\) is the tree prediction for the new observation in the \(b\)th tree, i.e. the average of observations in the terminal node corresponding to the new observation. Besides this traditional description, the modern view also considers random forests as datadriven weight generators (Hothorn et al. 2004; Lin and Jeon 2006; Moradian et al. 2017, 2019; Athey et al. 2019; Roy and Larocque 2020; Tabib and Larocque 2020; Alakuş et al. 2021).
Although random forests limit overfitting by combining many trees, which reduces the variance of the estimator, final predictions can be biased (Mentch and Hooker 2016; Wager and Athey 2018). Since each tree is built under the same random process, all trees focus on the same part of the response signal, usually the strongest. Therefore, some parts of the response signal may be left untargeted, which could result in biased point predictions. Wager and Athey (2018) provide bounds for the extent of the bias of random forests under some assumptions about the tree growing process. Following their work, Ghosal and Hooker (2021) proposed a bias correction method in a regression framework called onestep boosted forest, which is introduced in Breiman (2001) and Zhang and Lu (2012). The main idea of the proposed method is to sum the predictions of two random forests, where the first is a regression forest fitted on the target data set and the second is fitted on the outofbag residuals of the former. Empirical studies show that this method provides a significant bias reduction when compared to a simple random forest.
The current paper proposes an R package providing, among other features, an extension of the onestep boosted forest method described above (Ghosal and Hooker 2021). The literature on prediction intervals for random forests consists mostly of recent studies. The first method is the Quantile Regression Forests (QRF) method proposed by Meinshausen (2006). The aim of QRF is to estimate conditional quantiles of the response variable, instead of conditional means, using an estimated cumulative distribution function obtained with the nearest neighbour forest weights introduced by Hothorn et al. (2004). Prediction intervals can be built directly from the estimated conditional quantiles. The method is implemented in the CRAN package quantregForest (Meinshausen 2017).
In a more recent study, Athey et al. (2019) proposed Generalized
Random Forests (GRF), a very general framework to estimate any quantity,
such as conditional means, quantiles or average partial effects,
identified by local moment equations. Trees are grown with splitting
rules designed to maximize heterogeneity with respect to the quantity of
interest. Quantile regression forest is one of the applications of GRF.
Similar to the QRF, the GRF method uses the neighbourhood information
from different trees to compute a weighted set of neighbours for each
test point. Unlike QRF, which grows trees with the leastsquares
criterion, GRF uses a splitting rule designed to capture heterogeneity
in conditional quantiles. An implementation of quantile regression
forest with GRF is available in the function quantile_forest
of the
CRAN package grf (Tibshirani et al. 2021).
Vovk et al. (2005, 2009) introduced a general distributionfree conformal prediction interval framework. Any predictive model, including random forests, can be used within the proposed methodology. The idea is to use an augmented data set that includes the new observation to be predicted to fit the model, and apply a set of hypothesis tests to provide an error bound around the point prediction for the new observation. Although this method does not require any distribution assumptions, it is computationally intensive. Lei et al. (2018) proposed a variant of this method, called Split Conformal (SC) prediction, which splits the data into two subsets, one to fit the model, and one to compute the quantiles of the residual distribution. We note that, while the original full conformal prediction interval framework produces shorter intervals, SC is computationally more efficient. The R package conformalInference (Tibshirani 2019), available on GitHub, implements this method.
Roy and Larocque (2020) proposed 20 distinct variations of methods to improve the performance of prediction intervals with random forests. These approaches differ according to 1) the method used to build the forest and 2) the method used to build the prediction interval. Four methods can be used to build the forest: three from the classification and regression tree (CART) paradigm (Breiman and Breiman 1984) and the transformation forest method (TRF) proposed by (Hothorn and Zeileis 2021). Within the CART paradigm, in addition to the default leastsquares (LS) splitting criterion, two alternative splitting criteria, \(L_1\) and shortest prediction interval (SPI), are considered. Prediction intervals are built using the Bag of Observations for Prediction (BOP), which is the set of nearest neighbour observations previously used in Moradian et al. (2017, 2019). In addition to the type of forest chosen, there are also five methods to build prediction intervals: the classical method (LM), the quantile method (Quant), the shortest prediction interval (SPI), the highest density region (HDR), and the contiguous HDR (CHDR). LM is computed based on an interceptonly linear model using the BOP as the sample, and produces a symmetric PI around the point prediction. The quantile method, similar to the QRF method, is based on the quantiles of the BOP. SPI corresponds to the shortest interval among the intervals that contain at least \(\left(1\alpha\right)100\%\) of the observations in the BOP. As an alternative to SPI, HDR is the smallest region in the BOP, with the desired coverage \(\left(1\alpha\right)\). Note that HDR is not necessarily a single interval. If the distribution is multimodal, it can be formed by multiple intervals. Finally, CHDR is a way to obtain a single prediction interval from HDR intervals by building an interval with the minimum and maximum bounds of the HDR intervals.
Zhang et al. (2020) proposed a forestbased prediction interval method, called OutofBag (OOB) prediction intervals, to estimate prediction intervals using the empirical quantiles of the outofbag prediction errors. The method assumes that OOB prediction errors are identically distributed and that their distribution can be well approximated by the outofbag prediction errors obtained from all training observations. The resulting prediction intervals have the same width for all test observations. The method is implemented in the CRAN package rfinterval (Zhang 2019).
Lu and Hardin (2021) proposed a very interesting and useful method to estimate the conditional prediction error distribution of a random forest. The main idea of the proposed method is to use a random forest to compute the outofbag residuals for the training observations and to form a set of outofbag neighbours for each test point. Then, the conditional prediction error distribution for each test point is determined with the outofbag residuals in the neighbourhood. Estimating the prediction error distribution enables the estimation of conditional quantiles, conditional biases and conditional mean squared prediction errors. The prediction interval for a test point \(x\), defined by \(\widehat {PI}_{\alpha}\left(x\right)\), is formed by adding the \(\alpha/2\) and \(1\alpha/2\) quantiles of the conditional prediction error distribution to the random forest point prediction. The estimators are implemented in the CRAN package forestError (Lu and Hardin 2020).
Note that the conformal inference, OOB approach of Zhang et al. (2020) and the \(\widehat {PI}_{\alpha}\left(x\right)\) method of (Lu and Hardin 2021) all use the prediction errors to build the prediction intervals. Instead of using the training responses directly to estimate quantiles, using prediction errors provides a better predictive power. However, unlike conformal inference and the OOB approach, the \(\widehat {PI}_{\alpha}\left(x\right)\) method uses the nearest neighbour observations to estimate the prediction error distribution. This idea is very similar to the BOP idea (Roy and Larocque 2020), but instead of using inbag observations, Lu and Hardin (2021) use outofbag observations to form the neighbourhoods. This approach allows the local information for the test observations to be extracted.
In this paper, we introduce the R package RFpredInterval (Alakus et al. 2022), which is the novel implementation of 16 methods to build prediction intervals with random forests and boosted forests. The set of methods implemented in the package includes a new method to build prediction intervals with boosted forests and 15 method variations (three splitting rules with the CART paradigm which are LS, \(L_1\) and SPI, and five methods to build prediction intervals which are LM, SPI, Quant, HDR and CHDR) proposed by Roy and Larocque (2020). These 15 methods had been thoroughly investigated before through simulation studies and with real data sets in Roy and Larocque (2020). However, they are not easily available to use. One of the main contributions of our package is the implementation of these competitive methods and the ability for users to compare various prediction interval methods within the same package. The other main contribution of our paper is a new method to build prediction intervals. Contrary to the 15 methods proposed by Roy and Larocque (2020), the newly introduced method was not tested before. That is why in the paper we placed a greater emphasis on investigating the new method through extensive simulation studies and with real data. For performance comparison purposes, we compared the new method to 10 existing methods which include:
The new proposed method to build Prediction Intervals with Boosted Forests is called PIBF. This approach integrates the idea of using the nearest neighbour outofbag observations to estimate the conditional prediction error distribution presented in Lu and Hardin (2021) to the onestep boosted forest proposed by Ghosal and Hooker (2021). We will show in this paper that PIBF significantly improves the performance of prediction intervals with random forests when compared with 10 existing methods using a variety of simulated and real benchmark data sets.
The rest of the paper is organized as follows. In the next section, we describe the algorithm implemented in PIBF. We then present the details of the package and provide a practical and reproducible example. We also perform a simulation study to compare the performance of our proposed method to existing competing methods, and we investigate the performance of the proposed method with real data sets. Lastly, we conclude with a discussion of the results.
The proposed method is based on the onestep boosted forest method proposed by Ghosal and Hooker (2021). It consists in fitting two regression random forests: the first is fitted to get point predictions and outofbag (OOB) residuals using the given data set, whereas the second is fitted to predict those residuals using the original covariates. As empirical studies demonstrate, the onestep boosted forest provides point predictions with reduced bias compared to the simple random forest. Ghosal and Hooker (2021) use subsampling for their theoretical investigations, even though random forests were originally described with bootstrap samples and obtain notable performance improvements. They have also investigated the effect of using bootstrapping with the onestep boosted forest on the bias estimations. From the results presented in their appendix, the use of bootstrapping yields the best performance and reduces the bias the most in exchange for an increase in their proposed variance estimator, which is defined under asymptotic normality. In this paper, we use bootstrapping for the onestep boosted forest method following the better performance results on bias reduction. The final prediction for a new observation, \(x_{new}\), is the sum of the predictions from the two random forests \[\hat y^{*}_{new} = \hat y^{}_{new} + \hat \epsilon^{}_{new}\] where \(\hat y^{}_{new}\) is the point prediction obtained from the first random forest and \(\hat \epsilon^{}_{new}\) is the bias estimation from the second forest.
Besides bias correction, we use the second random forest as a way to construct a prediction interval by finding the nearest neighbour observations that are close to the one we want to predict. The idea of finding the nearest neighbour observations, a concept very similar to the ‘nearest neighbour forest weights’ (Hothorn et al. 2004; Lin and Jeon 2006), was introduced in (Moradian et al. 2017) and later used in (Moradian et al. 2019), (Roy and Larocque 2020), (Tabib and Larocque 2020) and (Alakuş et al. 2021). For a new observation, the set of inbag training observations that are in the same terminal nodes as the new observation forms the set of nearest neighbour observations. (Roy and Larocque 2020) called this set of observations the Bag of Observations for Prediction (BOP). We can define the BOP for a new observation \(x_{new}\) as \[\label{eq:bop} BOP\left(x_{new}\right) = \bigcup\limits_{b=1}^{B} I_b\left(x_{new}\right) \tag{2}\] where \(I_b\left(x_{new}\right)\) is the set of inbag training observations, i.e., observations in the bootstrap sample that are in the same terminal node as \(x_{new}\) in the \(b^{th}\) tree. \(I_b\left(.\right)\) consists of the training observations that are in the bootstrap sample of the \(b^{th}\) tree.
Instead of forming the set of nearest neighbour observations with the inbag training observations, we can use the outofbag observations which are not in the bootstrap sample, as used in Lu and Hardin (2021). We can define the outofbag equivalent of the BOP for a new observation \(x_{new}\) (2) as \[\label{eq:bop2} BOP^{*}\left(x_{new}\right) = \bigcup\limits_{b=1}^{B} O_b\left(x_{new}\right) \tag{3}\] where \(O_b\left(x_{new}\right)\) is the set of outofbag observations that are in the same terminal node as \(x_{new}\) in the \(b^{th}\) tree. \(O_b\left(.\right)\) consists of the training observations that are not in the bootstrap sample of the \(b^{th}\) tree.
Outofbag observations are not used in the tree growing process. Thus, for the trees where the training observations are outofbag, they are like the unobserved test observations for those trees. The only difference is that, for a new observation, we use all the trees in the forest whereas for an outofbag observation we have only a subset of the forest trees. By using the outofbag equivalent of the BOP for a new observation, we can make use of the analogy between the outofbag observations and test observations. The outofbag neighbours of a new observation represent the new observation better than the inbag neighbours.
Any desired measure can be obtained by using the constructed BOPs. In this paper, we use the BOP idea to build a prediction interval for a test observation. For a new observation with covariates \(x_{new}\), we firstly form \(BOP^{*}\left(x_{new}\right)\) (3) using the outofbag neighbours. Then, as proposed by Lu and Hardin (2021), we estimate the conditional prediction error distribution, \(\hat{F}\left(x_{new}\right)\), but now with the biascorrected outofbag residuals of the observations in \(BOP^{*}\left(x_{new}\right)\). Lastly, we build a prediction interval for the new observation as \[PI\left(x_{new}\right) = \left[\hat y^{*}_{new} + SPI^l_{\alpha}\left(\hat{F}\left(x_{new}\right)\right), \hat y^{*}_{new} + SPI^u_{\alpha}\left(\hat{F}\left(x_{new}\right)\right)\right]\] where \(\hat y^{*}_{new}\) is the biascorrected prediction, \(SPI^l_{\alpha}\left(\hat{F}\left(x_{new}\right)\right)\) and \(SPI^u_{\alpha}\left(\hat{F}\left(x_{new}\right)\right)\) are the lower and upper bounds of the \(SPI_{\alpha}\left(\hat{F}\left(x_{new}\right)\right)\), which is the shortest interval formed by the observations in \(BOP^{*}\left(x_{new}\right)\) that contains at least \(\left(1\alpha\right)100\%\) of the observations. By using the biascorrected residuals to form prediction error distribution and picking the shortest interval among the qualified intervals, we can expect narrower prediction intervals.
We can summarize the steps of the proposed method as follows:
Train the first regression RF with covariates \(X\) to predict the response variable \(Y\), and get the OOB predictions \(\hat Y_{oob}\)
Compute the OOB residuals as \(\hat \epsilon_{oob}^{} = Y\hat Y_{oob}^{}\)
Train the second regression RF with covariates \(X\) to predict the OOB residuals \(\hat \epsilon_{oob}^{}\), and get the OOB predictions for residuals \(\hat{\hat \epsilon}_{oob}^{}\)
Update the OOB predictions as \(\hat Y_{oob}^* = \hat Y_{oob}^{} + \hat{\hat \epsilon}_{oob}^{}\)
Compute the updated OOB residuals after biascorrection as \(\hat \epsilon_{oob}^{*} = Y  \hat Y_{oob}^*\)
For a new observation \(x_{new}\), get the point predictions \(\hat y_{new}\) from the first RF, and get the predicted residuals \(\hat \epsilon_{new}\) from the second RF, then the final prediction for the new observation is \[\hat y_{new}^{*} = \hat y_{new}^{} + \hat \epsilon_{new}^{}\] where \(\hat \epsilon_{new}^{}\) is the estimated bias.
Form a BOP for \(x_{new}\) with the OOB neighbours using the second RF, \(BOP^{*}\left(x_{new}\right) \eqref{eq:bop2}\) and estimate the conditional prediction error distribution for \(x_{new}\) as, \[\hat{F}\left(x_{new}^{}\right) = \left \{\hat \epsilon_{oob,i}^{*}  i \in BOP^{*}\left(x_{new}\right)\right \}\]
Build a PI for \(x_{new}\) as \(PI\left(x_{new}\right) = \left[\hat y^{*}_{new} + SPI^l_{\alpha}\left(\hat{F}\left(x_{new}\right)\right), \hat y^{*}_{new} + SPI^u_{\alpha}\left(\hat{F}\left(x_{new}\right)\right)\right]\)
The principal goal of any prediction interval method is to ensure the desired coverage level. In order to attain the desired coverage level \(\left(1\alpha\right)\), we may need a calibration procedure. The goal of the calibration is to find the value of \(\alpha_w\), called the working level in Roy and Larocque (2020), such that the coverage level of the PIs for the training observations is closest to the desired coverage level. Roy and Larocque (2020) presented a calibration procedure that uses the BOPs that are built using only the trees where the training observation \(x_i\) is OOB. The idea is to find the value of \(\alpha_w\) using the OOBBOPs. In this paper, we call this procedure OOB calibration.
We also include a crossvalidationbased calibration procedure with the proposed method to acquire the desired \(\left(1\alpha\right)\) coverage level. In this calibration, we apply \(k\)fold crossvalidation to form prediction intervals for the training observations. In each fold, we split the original training data set into training and testing sets. For the training set, we go through the steps 15 defined above. Then, for each observation in the testing set, we apply steps 68 and build a PI. After completing CV, we compute the coverage level with the constructed PIs and if the coverage is not within the acceptable coverage range, then we apply a grid search to find the \(\alpha_w\) such that \(\alpha_w\) is the closest to the target \(\alpha\) among the set of \(\alpha_w\)’s. Once we find the \(\alpha_w\), we use this level to build the PI for the new observations.
In our package, we implement 16 methods that apply random forest training. Ten of these methods have specialized splitting rules in the random forest growing process. These methods are the ones with \(L_1\) and shortest prediction interval (SPI) splitting rules proposed by Roy and Larocque (2020). To implement these methods, we have utilised the custom split feature of the randomForestSRC package (Ishwaran and Kogalur 2021).
The randomForestSRC package allows users to define a custom splitting
rule for the tree growing process. The user needs to define the
customized splitting rule in the splitCustom.c
file with
Cprogramming. After modifying the splitCustom.c
file, all C source
code files in the package’s src
folder must be recompiled. Finally,
the package must be reinstalled for the custom split rule to become
available.
In our package development process, we froze the version of randomForestSRC to the latest one available at the time, which is version 2.11.0, to apply specialized splitting rules. After defining the \(L_1\) and SPI splitting rules, all C files were recompiled. Finally, all package files including our R files for prediction interval methods were rebuilt to make the package ready for the user installation.
The RFpredInterval package has two main R functions as below:
pibf()
: Constructs prediction intervals with the proposed method,
PIBF.rfpi()
: Constructs prediction intervals with 15 distinct
variations proposed by Roy and Larocque (2020).Table 1 presents the list of functions and methods
implemented in RFpredInterval. For pibf()
, RFpredInterval uses the
CRAN package ranger
(Wright et al. 2020) to fit the random forests. For rfpi()
, RFpredInterval
uses randomForestSRC package. For the leastsquares splitting rule,
both randomForestSRC and ranger packages are applicable.
Function  Method  Details  

pibf() 
PIBF  Builds prediction intervals with boosted forests. The ranger package is used to fit the random forests. Calibration options are "cv" and "oob" . Returns constructed PIs and biascorrected point predictions for the test data. 

rfpi() 
Split method  PI method  
LS  LM  Splitting rule is the leastsquares (LS) from the CART paradigm. "ls" is used for the "split_method" argument. A vector of characters c("lm","spi","quant","hdr","chdr") is used for the "pi_method" argument to apply all or a subset of the PI methods. ranger or randomForestSRC can be used to fit the random forest. Returns a list of constructed PIs for the selected PI methods and point predictions for the test data. 

SPI  
Quant  
HDR  
CHDR  
\(L_1\)  LM  Splitting rule is the \(L_1\) from the CART paradigm (Roy and Larocque 2020). "l1" is used for the "split_method" argument. A vector of characters c("lm","spi","quant","hdr","chdr") is used for the "pi_method" argument to apply all or a subset of the PI methods. Only randomForestSRC can be used to fit the random forest since the split rule is implemented with the custom split feature of that package. Returns a list of constructed PIs for the selected PI methods and point predictions for the test data. 

SPI  
Quant  
HDR  
CHDR  
SPI  LM  Splitting rule is the shortest PI (SPI) from the CART paradigm (Roy and Larocque 2020). "spi" is used for the "split_method" argument. A vector of characters c("lm","spi","quant","hdr","chdr") is used for the "pi_method" argument to apply all or a subset of the PI methods. Only randomForestSRC can be used to fit the random forest since the split rule is implemented with the custom split feature of that package. Returns a list of constructed PIs for the selected PI methods and point predictions for the test data. 

SPI  
Quant  
HDR  
CHDR  
piall() 
All methods  Builds prediction intervals with all of the implemented PI methods. The ranger package is used to fit the random forests for the PIBF and methods with LS split rule. randomForestSRC package is used for the methods with \(L_1\) and SPI split rules. Default values are assigned to the function arguments of pibf() and rfpi() . Returns an object of class "piall" containing a list of constructed PIs with 16 methods, and point predictions obtained with the PIBF method, LS, \(L_1\) and SPI split rules for the test data. 

plot() 
Plots the 16 constructed PIs obtained with piall() function for a test observation. 

print() 
Prints the summary output of pibf() , rfpi() and piall() functions. 

LM: Classical method, SPI: Shortest PI, Quant: Quantiles, HDR: Highest density region, CHDR: Contiguous HDR 
In this section, we illustrate the usage of the RFpredInterval package with the Ames Housing data set (De Cock 2011). The data set was introduced as a modern alternative to the wellknown Boston Housing data set. The data set contains many explanatory variables on the quality and quantity of physical attributes of houses in Ames, IA sold from 2006 to 2010. Most of the variables give information to a typical home buyer who would like to know about a house (e.g. number of bedrooms and bathrooms, square footage, heating type, lot size, etc.).
The AmesHousing
(Kuhn 2020) package contains the raw data and processed versions of
the Ames Housing data set. The raw data contains 2930 observations and
82 variables, which include 23 nominal, 23 ordinal, 14 discrete, and 20
continuous variables, involved in assessing house values. The processed
version of the data set has 2330 observations and 81 variables,
including the target variable Sale_Price
representing the value of
houses in US dollars. The usual goal for this data set is to predict the
sale price of each house given covariates.
We load the processed version of the Ames Housing data set from the
AmesHousing package and prepare the data set that we will use for the
analyses. The preprocessing steps are presented in the Supplementary
Material. This version of the data set contains 22 factors and 59
numeric variables, including 1 response variable Sale_Price
, for 2929
observations. We split the data set into training and testing samples.
set.seed(3456)
< nrow(AmesHousing)
n < sample(1:n, size = round(0.7*n), replace = FALSE)
trainindex < AmesHousing[trainindex, ]
traindata < AmesHousing[trainindex, ] testdata
We fit a random forest with 1000 trees using the training data and construct \(95\%\) prediction intervals for the observations in the testing data with the proposed method. We apply 5fold crossvalidation based calibration and set the acceptable coverage range to \(\left[.945,.955\right]\). We can pass the list of random forest parameters for ranger package.
< pibf(formula = Sale_Price ~ .,
out traindata = traindata,
testdata = testdata,
alpha = 0.05,
calibration = "cv",
numfolds = 5,
coverage_range = c(0.945, 0.955),
params_ranger = list(num.trees = 1000),
oob = TRUE)
We can then analyze the constructed PIs and biascorrected random forest predictions for the testing data, as shown below. The PI output is a list containing lower and upper bounds. For example, we can print the point prediction and prediction interval for the tenth observation in the testing data.
$pred_interval
out$test_pred
outc(out$pred_interval$lower[10], out$test_pred[10], out$pred_interval$upper[10])
1] 133.8629 160.2426 194.5804 [
We can also print the summary output. In the summary output, we can
always see the mean PI length over the test data set. If calibration is
applied, we can see the working level of \(\alpha\). If the test data set
has true response information, as in our example, coverage and
prediction errors for the test set are also printed. Moreover, since we
have entered oob = TRUE
in the function arguments, in the summary
output we can see the mean PI length and coverage measures along with
the prediction errors for the training set. The prediction intervals are
built with the outofbag (OOB) predictions and prediction errors.
print(out)
> alpha_w: 0.050
> Mean PI length: 73.081
> Coverage: 96.8%
> MAE of test predictions: 12.773
> RMSE of test predictions: 19.545
>
> Mean PI length (OOB PIs): 74.823
> Coverage (OOB PIs): 94.7%
> MAE of OOB train predictions: 14.179
> RMSE of OOB train predictions: 23.875
Next, we construct \(95\%\) prediction intervals using the variations proposed by Roy and Larocque (2020). In the following example, the splitting is rule is set to \(L_1\) and we want to apply LM, Quant and SPI methods for building prediction intervals. We apply OOB calibration and set the acceptable coverage range to \(\left[.945,.955\right]\). We can pass the the list of random forest parameters for randomForestSRC package.
< rfpi(formula = Sale_Price ~ .,
out2 traindata = traindata,
testdata = testdata,
alpha = 0.05,
calibration = TRUE,
split_rule = "l1",
pi_method = c("lm", "quant", "spi"),
params_rfsrc = list(ntree = 1000),
params_calib = list(range = c(0.945, 0.955)),
oob = FALSE)
We can analyze the constructed PIs for the testing data as below. Each PI output is a list containing lower and upper bounds. For instance, we can print the point prediction and LM prediction interval for the tenth observation in the testing data.
$lm_interval
out2$quant_interval
out2$spi_interval
out2c(out2$lm_interval$lower[10], out2$test_pred[10], out2$lm_interval$upper[10])
1] 129.9474 154.2098 176.8429 [
We print the summary output. In the summary output, we can see the splitting rule selected in the first row. Since the test data set has true responses in our example, we can see the coverage information for the selected PI methods besides the mean PI length and \(\alpha_w\) in the printed table. Below the table, we have the mean prediction errors for the test set.
print(out2)
> Split rule: L1
> 
> Mean PI length Coverage alpha_w
> Classical method (LM) 81.641 96.2% 0.140
> Shortest prediction interval (SPI) 81.953 96.1% 0.100
> Quantile method (Quant) 80.893 95.8% 0.120
> 
> MAE of test predictions: 14.605
> RMSE of test predictions: 22.603
Although, with the pibf()
and rfpi()
functions, we have more
flexibility to set the arguments for the methods, we can build
prediction intervals with all 16 methods implemented in the package with
the piall()
function. We will build 95% prediction intervals for the
test set.
< piall(formula = Sale_Price ~ .,
out3 traindata = traindata,
testdata = testdata,
alpha = 0.05,
num.trees = 1000)
The output is a list of constructed prediction intervals with 16 methods and point predictions obtained with the PIBF method, LS, \(L_1\), and SPI split rules. Hence, the output includes 16 prediction intervals and 4 point predictions which is a list of 20 items in total.
We print the summary output.
print(out3)
> 
> Mean PI length Coverage
> PIBF 72.752 96.6%
> LSLM 81.800 96.5%
> LSSPI 82.662 95.4%
> LSQuant 81.523 95.2%
> LSHDR 81.733 96.0%
> LSCHDR 83.025 96.1%
> L1LM 81.649 96.1%
> L1SPI 81.805 96.1%
> L1Quant 80.836 95.3%
> L1HDR 83.032 96.4%
> L1CHDR 82.482 96.1%
> SPILM 81.578 96.0%
> SPISPI 81.960 96.2%
> SPIQuant 81.036 95.4%
> SPIHDR 84.404 96.6%
> SPICHDR 82.927 96.1%
> 
> MAE RMSE
> PIBF 12.774 19.428
> LS split 14.412 22.413
> L1 split 14.596 22.569
> SPI split 14.558 22.600
Lastly, we plot the constructed prediction intervals with all 16 methods, for the 15th observation in the test set.
plot(out3, test_id = 15)
Figure 1 presents the prediction intervals and point predictions for the test observation. The methods are ordered in the yaxis based on their resulting PI length. For each method, the red point presents the point prediction and blue lines show the constructed prediction interval(s) for the test observation. If the true response of the test observation is known, it is demonstrated with a dashed vertical line. Note that we may have multiple prediction intervals with the HDR PI method. As we can see from the figure, we may have four different point predictions for the same test observation. The PIBF method and the three splitting rules LS, \(L_1\) and SPI can produce different point predictions. But all PI method variations for the same splitting rule have the same point prediction.
In this section, we compare the predictive performance of the prediction intervals constructed with our proposed method to the existing methods presented in the Introduction using a variety of simulated and real benchmark data sets.
We apply a simulation study based on seven simulated data sets from the literature. The first three of the data sets are Friedman’s benchmark regression problems described in Jerome H. Friedman (1991) and Breiman (1996). We use the CRAN package mlbench (Leisch and Dimitriadou 2021) to generate these data sets.
In Friedman Problem 1, the inputs are 10 independent variables uniformly distributed on the interval \(\left[0,1\right]\). The first five covariates are used to generate the response: \[y = 10 \sin\left(\pi x_1 x_2\right) + 20 \left(x_3  0.5\right)^2 + 10 x_4 + 5 x_5 + \epsilon\] where \(\epsilon\) is \(N\left(0,\sigma^2\right)\) and the default standard deviation of \(\epsilon\) is 1 which yields a signaltonoise ratio (SNR) (i.e., the ratio of the standard deviation of signal to the standard deviation of error) of 4.8:1.
In Friedman Problem 2, the response is generated as \[y = \left(x_1^2 + \left(x_2 x_3  \frac{1}{x_2 x_4}\right)^2\right)^{0.5} + \epsilon\] where the inputs are four independent variables uniformly distributed over the ranges \[\begin{aligned} 0 \leq &x_1 \leq 100 \\ 40 \pi \leq &x_2 \leq 560 \pi \\ 0 \leq &x_3 \leq 1\\ 1 \leq &x_4 \leq 11 \end{aligned}\] and \(\epsilon\) is \(N\left(0,\sigma^2\right)\). The default value of 125, which yields a SNR of 3:1, is used for the standard deviation of \(\epsilon\).
In Friedman Problem 3, the inputs are four independent variables uniformly distributed over the same ranges as Friedman Problem 2. The response is generated as \[y = \arctan{\left(\frac{x_2 x_3  \frac{1}{x_2 x_4}}{x_1}\right)} + \epsilon\] where \(\epsilon\) is \(N\left(0,\sigma^2\right)\) and the default value of 0.01 for the standard deviation of \(\epsilon\) is used, which yields a SNR of 3:1.
The fourth data set is the Peak Benchmark Problem which is also from the mlbench package. Let \(r=3u\) where \(u\) is uniform on \(\left[0,1\right]\) and let \(x\) be uniformly distributed on the ddimensional sphere of radius \(r\). The response is \(y=25\exp\left(0.5r^2\right)\). The default value of \(d=20\) dimensions is used.
The fifth one is a modification of Friedman Problem 1, which was used in Hothorn and Zeileis (2021) in their H2c setup. This data set was designed to have heteroscedasticity. The inputs are 10 independent variables uniformly distributed on the interval \(\left[0,1\right]\). The first five covariates are used in the mean function and the unscaled mean function is defined as \[\mu = 10 \sin\left(\pi x_1 x_2\right) + 20 \left(x_3  0.5\right)^2 + 10 x_4 + 5 x_5\] Then, the scaled mean function on the interval \(\left[1.5,1.5\right]\) is \[\mu^{S} = \frac{3\left(\mu  \mu_{min}\right)}{\mu_{max}  \mu_{min}}1.5\] where \(\mu_{min}\) and \(\mu_{max}\) are the minimum and maximum values of \(\mu\) over the sample. The last five covariates are used in the standard deviation function and the unscaled standard deviation function is \[\sigma = 10 \sin\left(\pi x_6 x_7\right) + 20 \left(x_8  0.5\right)^2 + 10 x_9 + 5 x_{10}\] The standard deviation is scaled as \[\sigma^{S}=\exp\left(\frac{3\left(\sigma\sigma_{min}\right)}{\sigma_{max}\sigma_{min}}1.5\right)\] where \(\sigma_{min}\) and \(\sigma_{max}\) are the minimum and maximum values of \(\sigma\) over the sample. The response is generated as a normal random variable with mean \(\mu^{S}\) and standard deviation \(\sigma^{S}\).
The last two data sets, which were used in Roy and Larocque (2020), have a treebased response variable. The inputs are seven independent variables generated from the standard normal distribution. The response is generated with the seven covariates according to a tree model with a depth of three, with eight terminal nodes: \[\begin{aligned} y = & u_1 I\left(x_1<0, x_2<0, x_4<0\right)\\ & + u_2 I\left(x_1<0, x_2<0, x_4 \geq 0\right)\\ & + u_3 I\left(x_1<0, x_2 \geq 0, x_5<0\right)\\ & + u_4 I\left(x_1<0, x_2 \geq 0, x_5 \geq 0\right)\\ & + u_5 I\left(x_1 \geq 0, x_3<0, x_6<0\right)\\ & + u_6 I\left(x_1 \geq 0, x_3<0, x_6 \geq 0\right)\\ & + u_7 I\left(x_1 \geq 0, x_3 \geq 0, x_7<0\right)\\ & + u_8 I\left(x_1 \geq 0, x_3 \geq 0, x_7 \geq 0\right) + \epsilon \end{aligned}\] where the terminal node means are \(u=\left(5, 10, 15, 20, 25, 30, 35, 40\right)\) and \(I\) is the indicator function. The difference in the two data sets is the distribution of the error. In the first one, \(\epsilon\) is generated from a standard normal distribution and in the other it is from an exponential distribution with mean 1. The signaltonoise ratio is 11.5:1 for both data sets.
We use training sample sizes of \(n_{train}=\left \{200,500,1000,5000\right \}\), resulting in 28 scenarios. Each scenario is repeated 500 times. In each run, we generate an independent test set of new observations with \(n_{test}= 1000\).
We compare our proposed prediction interval estimator with 10 competing
methods which were presented in the Introduction. The first is the
\(\widehat{PI}_\alpha\) method. We fit the random forest with the ranger
package and use the forestError package to build PIs. The second is
the OOB method. The rfinterval package is used. The third is the split
conformal method. The conformalInference package is used. The fourth
is the QRF method and the quantregForest package is used. The fifth is
the GRF method. The function quantile_forest
in the grf package is
used.
The last five are variations of Roy and Larocque (2020). To compare the
performance of the method variations, a comprehensive simulation study
and real data analyses were performed in (Roy and Larocque 2020). One of
the biggest conclusions from these comparison studies was that, among
the three alternative splitting criteria within the CART paradigm, the
impact of the choice of the splitting rule on the performance of the
prediction intervals was moderate whereas the selection of the PI method
had a much greater impact on the performance. Hence, in this simulation
study, we set up the splitting rule to the leastsquares (LS) and only
compare the five PI methods: LM, Quant, SPI, HDR, and CHDR. For those
methods, the rfpi()
function of the RFpredInterval package is used.
We fit the random forest with the ranger package.
For the simulations, we use the following parameters. For all methods, we set the number of trees to 2000. Letting \(p\) be the number of covariates, then the number of covariates to randomly split at each node, mtry, is set to \(\max\left \{\lfloor p/3 \rfloor,1\right \}\) (except for the GRF method). For the GRF method, following Athey et al. (2019), \(mtry\) is set to \(\min\left \{\lceil \sqrt{p}+20 \rceil,p\right \}\). Also, we use the honest splitting regime with the default fraction of 0.5 for the GRF method. The minimum node size parameter for all forests is set to 5. The desired coverage is set to \(95\%\) for all methods. For the proposed method, we perform the crossvalidationbased calibration as the primary calibration procedure, but we also investigate the OOB calibration. For the method variations in Roy and Larocque (2020), we perform the OOB calibration procedure as they proposed. For both calibration procedures, the acceptable range of coverage is set to \(\left[.945,.955\right]\). Calibration is not performed for the competing methods since no option for calibration is offered in their CRAN packages.
We can evaluate the performance of the competing methods with two measures: the mean coverage and the mean prediction interval length. Table 2 presents the average coverage rate of each method on the test set over 500 replications for a given simulated data set and sample size, with average mean prediction interval lengths shown in parentheses. The principal goal of any prediction interval method is to ensure the desired coverage level. In this simulation study, the desired coverage level is set to \(95\%\) for all methods. The left plot in Figure 2 shows the mean coverages over the 28 scenarios for all methods. Overall, all of the methods, except the QRF and GRF methods which tend to be conservative, provide a mean coverage close to the desired level. QRF and GRF methods have an average mean coverage of 0.975 and 0.974 over all scenarios, respectively. Although the \(\widehat{PI}_\alpha\) method has an average of the mean coverages of 0.957, close to the desired level, its variability is large. Over 308 (11 methods \(\times\) 28 scenarios) average coverage values, there is only one case where the mean coverage is below 0.94. It corresponds to the \(\widehat{PI}_\alpha\) method in Friedman Problem 2 with \(n_{train}=5000\).
Once the prediction intervals provide the desired coverage level, the next goal of any PI method is to provide the shortest PI length. Prior to carrying out a detailed comparison of interval lengths, we can globally compare the interval lengths over all scenarios with the percentage increase in mean PI length of a method with respect to the best method for a given run. For a given run, define \(ml_{i}\) as the mean PI length of method \(i\) and \(ml^{*}\) as the shortest mean PI length over the 10 competing methods. The percentage increase in PI length for method \(i\) is computed as \(100 \times \left(ml_i  ml^{*}\right)/ml^{*}\). Smaller values for this measure indicate better performances. The right plot in Figure 2 presents the relative lengths of the methods across 14,000 runs (28 scenarios \(\times\) 500 replications). The prediction intervals with the GRF method are the widest, followed by the QRF method. However, as we saw in the left plot in Figure 2, GRF and QRF produce conservative prediction intervals, so their PI lengths cannot be fairly compared to the other methods with a coverage closer to 0.95. Based on the global results, the proposed method, PIBF, performs the best. Following PIBF, \(\widehat{PI}_\alpha\), OOB, LM, SPI, HDR and CHDR perform similarly well, with \(\widehat{PI}_\alpha\) being slightly better. Among the variations of Roy and Larocque (2020), the PIs with quantiles produce longer prediction intervals than the other four variations.
Now, we investigate the performance of the methods separately for each scenario. See figures S1 to S7 in the Supplementary Material for the mean PI length results of each method for each simulated data set. Each figure has four facets corresponding to the four levels of the training sample size. For all methods and data sets, the mean PI lengths and their variability decrease as the sample size increases (except the GRF method for the treebased data sets). We see that for Friedman Problem 1, from Figure S1, for all sample sizes, PIBF consistently outperforms the 10 competing methods in terms of mean PI length while ensuring the desired coverage level (see Table 2 for the mean coverage results). QRF and GRF provide the widest prediction intervals for all sample sizes. However, as presented in Table 2, these methods heavily overcover and are therefore not comparable with the other methods. While the \(\widehat{PI}_\alpha\) method also slightly overcovers, it has shorter PIs than other methods.
For Friedman Problem 2 (see Figure S2 in the Supplementary Material), we see that the proposed method has the shortest mean PI length for the smallest sample size, and as the sample size increases \(\widehat{PI}_\alpha\) provides shorter PIs. However, we should also take into account the mean coverages presented in Table 2. The proposed method has smaller coverage levels for \(n_{train}=200\) compared to \(\widehat{PI}_\alpha\), but as the sample size increases the coverage levels decrease for the \(\widehat{PI}_\alpha\) method (up to 0.937 for \(n_{train}=5000\)) whereas PIBF keeps it around 0.945. For \(n_{train}=5000\), the OOB method builds shorter PIs while ensuring the desired coverage level. Again, QRF and GRF have the widest PIs for all sample sizes due to their conservative PIs.
The performance of PIBF and \(\widehat{PI}_\alpha\) is very similar for Friedman Problem 3 (see Figure S3 in the Supplementary Material). Both methods provide the shortest PIs with similar coverage levels and mean PI lengths. Results for the Peak Benchmark Problem presented in Figure S4 are very similar to those of Friedman Problem 1. For all sample sizes, PIBF consistently outperforms the 10 competing methods in terms of mean PI length. But this time, SPI method with LS splitting rule comes in second place.
For the H2c setup (see Figure S5 in the Supplementary Material), which is the modification of Friedman Problem 1, we can see that all methods are comparable since QRF and GRF do not overcover. In this setting, all methods also perform fairly well with respect to PI length. Overall, the LM prediction interval method with the LS splitting rule provides slightly shorter PIs.
For the treebased data sets (see figures S6 and S7 in the Supplementary Material), overall, it seems that the distribution of the error does not have a significant effect on the results. Again, QRF and GRF have conservative PIs. Unlike the other data sets, we see here that the mean PI lengths of the GRF method decrease very slowly as the sample size increases. For \(n_{train}=5000\), all methods (except QRF and GRF) perform similarly. For the smallest sample size, HDR PI building methods and the proposed method perform slightly better than other methods. As the sample size increases, PIBF produces the shortest prediction intervals.
\(n_{train}\)  Data  PIBF  \(\widehat{PI}_\alpha\)  OOB  LM  Quant  SPI  HDR  CHDR  SC  QRF  GRF 

200  Friedman 1  0.952 (7.69)  0.959 (9.7)  0.949 (10.3)  0.955 (11.2)  0.956 (12.5)  0.956 (12.1)  0.955 (12.2)  0.956 (11.5)  0.952 (12)  0.978 (15.3)  0.968 (17.7) 
Friedman 2  0.942 (635)  0.951 (679)  0.947 (762)  0.956 (772)  0.955 (990)  0.956 (898)  0.955 (793)  0.954 (800)  0.951 (912)  0.969 (1103)  0.972 (1241)  
Friedman 3  0.944 (0.57)  0.948 (0.62)  0.949 (0.75)  0.956 (0.69)  0.953 (0.83)  0.952 (0.74)  0.951 (0.88)  0.954 (0.76)  0.953 (0.93)  0.961 (0.89)  0.968 (0.88)  
Peak  0.958 (9.06)  0.956 (12.4)  0.949 (13.3)  0.955 (12)  0.956 (17.9)  0.957 (10.5)  0.955 (15.4)  0.955 (11.1)  0.951 (15.5)  0.972 (20.6)  0.978 (20.2)  
H2c  0.954 (6.5)  0.954 (6.3)  0.955 (6.26)  0.955 (5.86)  0.956 (6.42)  0.957 (6.47)  0.956 (6.61)  0.955 (6.45)  0.959 (6.68)  0.952 (6.09)  0.957 (6.35)  
TreeN  0.949 (9.12)  0.967 (12.3)  0.947 (14.2)  0.956 (11.9)  0.956 (22.3)  0.958 (15.4)  0.955 (9.08)  0.955 (11.9)  0.951 (18.9)  0.979 (29.6)  0.960 (36)  
Treeexp  0.949 (9.31)  0.967 (12.4)  0.947 (14.2)  0.955 (11.9)  0.956 (22.1)  0.959 (15.1)  0.956 (10.4)  0.956 (12.5)  0.950 (18.9)  0.979 (29.5)  0.960 (35.8)  
500  Friedman 1  0.952 (6.32)  0.962 (8.25)  0.948 (8.81)  0.953 (9.37)  0.952 (10.1)  0.953 (9.87)  0.952 (9.81)  0.953 (9.47)  0.951 (10.1)  0.986 (14)  0.978 (16.9) 
Friedman 2  0.945 (586)  0.946 (585)  0.948 (650)  0.952 (676)  0.952 (820)  0.952 (744)  0.951 (679)  0.951 (690)  0.952 (751)  0.975 (998)  0.979 (1120)  
Friedman 3  0.943 (0.5)  0.946 (0.53)  0.948 (0.62)  0.952 (0.61)  0.951 (0.71)  0.951 (0.65)  0.951 (0.7)  0.951 (0.67)  0.952 (0.75)  0.965 (0.79)  0.970 (0.77)  
Peak  0.957 (6.44)  0.967 (9.97)  0.951 (11)  0.953 (9.75)  0.954 (15)  0.954 (8.22)  0.956 (11.6)  0.953 (8.93)  0.952 (12.8)  0.980 (18.7)  0.984 (17.8)  
H2c  0.953 (5.81)  0.956 (5.88)  0.951 (5.89)  0.952 (5.43)  0.953 (5.83)  0.953 (5.81)  0.954 (5.92)  0.952 (5.84)  0.955 (6.17)  0.954 (5.79)  0.959 (5.94)  
TreeN  0.948 (6.16)  0.969 (8.17)  0.949 (9.85)  0.952 (8.23)  0.953 (15.5)  0.953 (10)  0.954 (7.31)  0.952 (9.6)  0.952 (13.3)  0.984 (25.8)  0.969 (35.7)  
Treeexp  0.947 (6.31)  0.966 (8.3)  0.949 (9.93)  0.952 (8.19)  0.953 (15.4)  0.954 (9.67)  0.954 (7.93)  0.953 (9.96)  0.953 (13.4)  0.984 (25.8)  0.970 (35.5)  
1000  Friedman 1  0.953 (5.67)  0.964 (7.42)  0.949 (7.95)  0.954 (8.37)  0.954 (8.89)  0.954 (8.75)  0.954 (8.65)  0.953 (8.4)  0.950 (8.85)  0.990 (12.9)  0.985 (15.9) 
Friedman 2  0.945 (565)  0.942 (546)  0.950 (594)  0.953 (635)  0.953 (735)  0.953 (682)  0.952 (637)  0.953 (645)  0.951 (658)  0.977 (910)  0.983 (1022)  
Friedman 3  0.944 (0.47)  0.945 (0.48)  0.948 (0.55)  0.954 (0.58)  0.952 (0.65)  0.952 (0.61)  0.951 (0.63)  0.952 (0.63)  0.949 (0.63)  0.967 (0.73)  0.971 (0.72)  
Peak  0.957 (5)  0.972 (8.44)  0.950 (9.54)  0.952 (8.51)  0.953 (13.2)  0.953 (7.21)  0.956 (9.35)  0.952 (7.82)  0.951 (11)  0.983 (17.5)  0.986 (16.7)  
H2c  0.952 (5.48)  0.955 (5.64)  0.949 (5.69)  0.948 (5.17)  0.949 (5.49)  0.950 (5.51)  0.950 (5.49)  0.949 (5.49)  0.949 (5.78)  0.954 (5.61)  0.958 (5.71)  
TreeN  0.946 (5.11)  0.965 (6.2)  0.950 (7.52)  0.954 (6.67)  0.953 (12.1)  0.954 (7.91)  0.954 (6.21)  0.953 (8.33)  0.950 (9.94)  0.985 (21.6)  0.976 (35.1)  
Treeexp  0.946 (5.16)  0.964 (6.39)  0.949 (7.64)  0.954 (6.62)  0.953 (11.9)  0.954 (7.45)  0.955 (6.5)  0.953 (8.45)  0.950 (10)  0.985 (21.6)  0.975 (34.9)  
5000  Friedman 1  0.950 (4.71)  0.967 (6.04)  0.950 (6.47)  0.953 (6.67)  0.953 (6.97)  0.953 (6.91)  0.954 (6.78)  0.953 (6.67)  0.950 (7.03)  0.995 (10.8)  0.992 (13.2) 
Friedman 2  0.945 (543)  0.937 (507)  0.950 (529)  0.952 (574)  0.951 (610)  0.951 (592)  0.951 (570)  0.951 (573)  0.950 (549)  0.975 (717)  0.982 (781)  
Friedman 3  0.945 (0.44)  0.941 (0.43)  0.950 (0.46)  0.953 (0.53)  0.952 (0.56)  0.952 (0.55)  0.951 (0.54)  0.952 (0.56)  0.950 (0.49)  0.966 (0.62)  0.971 (0.62)  
Peak  0.955 (2.84)  0.978 (5.92)  0.952 (7.11)  0.952 (6.54)  0.954 (10.1)  0.952 (5.68)  0.957 (6.46)  0.951 (6.04)  0.950 (8)  0.988 (14.9)  0.990 (15.2)  
H2c  0.949 (5)  0.953 (5.31)  0.943 (5.39)  0.941 (4.82)  0.943 (5.01)  0.944 (5.06)  0.944 (5.03)  0.942 (4.98)  0.942 (5.41)  0.953 (5.33)  0.958 (5.31)  
TreeN  0.945 (4.33)  0.949 (4.37)  0.951 (4.77)  0.957 (5.25)  0.951 (7.03)  0.950 (5.4)  0.953 (4.78)  0.951 (5.89)  0.950 (5.55)  0.979 (10.5)  0.986 (32.1)  
Treeexp  0.945 (4.08)  0.959 (4.39)  0.950 (4.91)  0.955 (5.12)  0.951 (6.87)  0.952 (4.78)  0.954 (4.36)  0.949 (5.49)  0.950 (5.73)  0.978 (10.5)  0.986 (31.8) 
In this section, we investigate the effect of the proposed calibration on performance of the prediction intervals. We apply the same simulation study using the seven simulated data sets. We compare the results of the proposed method without calibration, with OOB calibration and calibration with crossvalidation. The desired coverage level is set to 95%. In Figure 3, the left plot presents the mean coverages over 28 scenarios for the three variations, and the right plot shows the percentage increase in mean PI length of each of the three calibration variants across 14,000 runs (28 scenarios \(\times\) 500 replications). Although we obtain the shortest prediction intervals without calibration, the variability of the mean coverage level is larger and sometimes the coverage falls below 0.94. Looking at the left plot, we can say that the variability of the mean coverage level decreases with both calibration procedures. However, applying OOB calibration provides conservative PIs. The median of the mean coverage level is more than 0.96 and the PIs with the OOB calibration are the widest. Applying calibration with CV produces slightly longer PIs than those with no calibration, but these PIs have coverage levels closer to the desired level.
In the package, both calibration procedures are implemented for the PIBF
method. Simulation study results show that, compared to the OOB
calibration, calibration with CV produces shorter PIs while maintaining
the desired coverage level. Therefore, the default calibration procedure
is set to CV in the pibf()
function. In terms of computational time
(see tables 4 and 5), calibration
with \(k\)fold CV is slower than OOB calibration since it needs to fit
two additional random forests for each fold.
To further explore the performance of the prediction intervals built with the proposed method, we use 11 real data sets. Since two of the data sets have two response variables, we consider that we are analyzing 13 real data sets. Boston housing and Ames housing data sets are from the R packages mlbench and AmesHousing, respectively. The other data sets are obtained through the UCI Machine Learning Repository (Dua and Graff 2017).
For each data set, we apply 100 times 10fold crossvalidation for each method. Hence, for each fold, the training and testing sets correspond to 90% and 10% of the whole data set, respectively. The desired coverage level is set to 95% for all methods. Table 3 presents the results of the real data analyses (n is the number of observations and p is the number of predictors) with the average coverage rate of each method over 100 repetitions, and mean prediction interval lengths averaged over 100 repetitions shown in parentheses. Figure 4 illustrates the global results of the analyses across datasets. The left plot in Figure 4 shows the mean coverages over the 13 real data sets for all methods. Similar to what we have seen with the simulated data sets, the QRF and GRF methods produce conservative prediction intervals, whereas the other methods provide a mean coverage close to the desired level. Again, the \(\widehat{PI}_\alpha\) method maintains the target level on average with 0.959, but its variability is the highest among all methods. Across all data sets, there are three cases where the mean coverage is below 0.94: the proposed method for Concrete slump, and the \(\widehat{PI}_\alpha\) method for Auto MPG and Computer hardware.
The right plot in Figure 4 presents the relative lengths of methods across 13 real data sets. For each method, there are 13 points in the boxplot, and each point corresponds to the percentage increase in mean PI length compared to the best method for a single real data set. Again, the prediction intervals with GRF and QRF are the widest among eleven methods. Among the other nine methods, the proposed method performs the best, followed by \(\widehat{PI}_\alpha\).
For each real data set, we analyze the performance of each method through the mean PI lengths presented in figures S8 to S10 in the Supplementary Material. For the Abalone data set, the HDR method produces the shortest prediction intervals, followed by the SPI and Quant methods. While the QRF method overcovers, its PIs are no wider than those of most of the other methods. The proposed method, PIBF, is distinctly the best prediction interval method yielding the shortest PI lengths for the Air quality data set with absolute and relative humidity response variables, Airfoil selfnoise, Ames housing, Boston housing, Concrete compression, Energy efficiency data set with cooling and heating load response variables, and Servo data sets. In the Auto MPG data set, \(\widehat{PI}_\alpha\) has the shortest mean PI length but with a mean coverage of 0.929. Among the other methods, PIBF, OOB, LM, Quant and SPI methods show similarly good performances while maintaining the desired coverage level. For the Computer hardware data set, PIBF, \(\widehat{PI}_\alpha\), and SPI methods perform better than the other methods. They have similar mean PI lengths. For the Concrete slump data set, the proposed method has the shortest mean PI length, but with a slightly smaller coverage of 0.939. This data set is the only one of the simulated and real data sets where the proposed method has a mean coverage below 0.94. After PIBF, \(\widehat{PI}_\alpha\) and LM show a good performance with a mean coverage close to the target level. Overall, we can conclude that the proposed method shows better performance than the 10 competing methods for almost all of the real data sets.
Data  n  p  PIBF  \(\widehat{PI}_\alpha\)  OOB  LM  Quant  SPI  HDR  CHDR  SC  QRF  GRF 

Abalone  4177  8  0.947 (8.18)  0.946 (8.05)  0.950 (8.9)  0.949 (8.09)  0.953 (6.81)  0.951 (6.73)  0.946 (5.58)  0.951 (8.12)  0.950 (9.04)  0.971 (8.03)  0.978 (8.82) 
Air quality (AH)  6941  13  0.948 (0.22)  0.966 (0.29)  0.950 (0.34)  0.954 (0.29)  0.954 (0.33)  0.954 (0.31)  0.954 (0.34)  0.954 (0.29)  0.950 (0.4)  0.992 (0.55)  0.994 (0.78) 
Air quality (RH)  0.949 (12.9)  0.971 (18.7)  0.950 (21.6)  0.953 (19.6)  0.953 (22.5)  0.954 (20.6)  0.954 (21.1)  0.953 (19.3)  0.950 (25.5)  0.989 (35)  0.989 (44.3)  
Airfoil selfnoise  1503  5  0.946 (8.47)  0.972 (12.9)  0.950 (13.3)  0.952 (18.3)  0.952 (19.8)  0.953 (18.9)  0.952 (19.5)  0.953 (17.9)  0.951 (14.7)  0.987 (20.4)  0.988 (18.1) 
Ames housing  2929  80  0.955 (69.3)  0.961 (81.6)  0.950 (90.3)  0.954 (79.3)  0.953 (80.5)  0.953 (80.5)  0.952 (80.3)  0.952 (80.4)  0.951 (96.4)  0.987 (118)  0.993 (172) 
Auto MPG  392  7  0.945 (10)  0.929 (9.76)  0.947 (11.1)  0.953 (10.5)  0.953 (10.9)  0.953 (10.5)  0.954 (13.2)  0.952 (11.5)  0.955 (13.1)  0.967 (12)  0.977 (16.1) 
Boston housing  506  13  0.942 (10.5)  0.948 (11.2)  0.949 (12.3)  0.954 (11.7)  0.953 (11.8)  0.953 (11.4)  0.952 (12.2)  0.952 (12)  0.951 (15)  0.982 (15.7)  0.990 (25.9) 
Computer hardware  209  6  0.942 (139)  0.939 (140)  0.946 (171)  0.952 (163)  0.951 (163)  0.950 (144)  0.951 (157)  0.950 (156)  0.957 (248)  0.965 (182)  0.985 (365) 
Concrete compression  1030  8  0.948 (14.8)  0.957 (18)  0.950 (19.6)  0.954 (20.2)  0.953 (27.2)  0.954 (22.8)  0.953 (27.7)  0.953 (21.5)  0.950 (26)  0.982 (34.5)  0.989 (46.6) 
Concrete slump  103  7  0.939 (12.2)  0.947 (14.2)  0.948 (15.3)  0.949 (14.9)  0.949 (20.6)  0.944 (19.4)  0.945 (16.3)  0.951 (15.4)  0.958 (22.1)  0.956 (22.3)  0.970 (28.3) 
Energy efficiency (CL)  768  8  0.953 (3.71)  0.977 (5)  0.950 (8.01)  0.952 (7.37)  0.952 (7.9)  0.952 (7.04)  0.951 (8.03)  0.953 (6.46)  0.951 (8.54)  0.985 (8.18)  0.986 (20.4) 
Energy efficiency (HL)  0.951 (1.48)  0.987 (3.43)  0.950 (4.16)  0.953 (6.99)  0.952 (7.17)  0.954 (6.35)  0.951 (5.52)  0.952 (4.98)  0.952 (4.84)  0.988 (7.79)  0.989 (19.9)  
Servo  167  4  0.957 (18.4)  0.966 (24.2)  0.948 (25.5)  0.953 (26.6)  0.957 (32.9)  0.955 (27.2)  0.954 (28.2)  0.955 (26.8)  0.959 (30.4)  0.983 (37.9)  0.977 (42.9) 
All simulations and real data analyses were conducted in R version 3.6.0 on a Linux machine with Intel(R) Xeon(R) E52667 v3 @ 3.20GHz with 396 GB of memory. The average computational time of each method for the simulated and real data sets are presented in tables 4 and 5. For the proposed method (PIBF), the computational times for both calibration methods, crossvalidation and OOB, are presented in the tables. We can see from the tables that, for most of the data sets, calibration with crossvalidation has longer running times than OOB calibration, which is expected since with the kfold crossvalidation, we fit \(2k\) more random forests than applying OOB calibration.
For the variations of (Roy and Larocque 2020), since we can build
prediction intervals with the five PI methods by only fitting a single
random forest with a selected splitting rule, we present the total
computational time for building the five variations under RFPI. To be
clear, for a given splitting rule, the rfpi()
function fits a random
forest and then the set of PI methods requested by the user are applied
to the output of the random forest. In our simulations, we choose to
return all five PI methods for the selected splitting rule, i.e. when
we measure the running time of the rfpi()
function, we get the total
running time of building five prediction intervals. Therefore, we should
interpret the values for RFPI with care while comparing the
computational times of the methods. Although not all PI methods have
similar computational complexities, we can say that even the average
time of building prediction intervals with one of these variations,
assuming they have similar running times, is reasonable. Since, for the
HDRbased PI methods, an optimal bandwidth has to be chosen, which is a
timeconsuming process, among the five PI methods, the slowest ones are
the HDR and CHDR. From the remaining variations, the classical method,
LM, is the fastest, followed by the Quant and SPI methods.
For both the simulations and real data analyses, the OOB and GRF methods have the smallest running times. For most of the methods, the increase in the sample size has a mild effect on the ratio of increase in running times. However, for the split conformal method with the simulated data sets, running times increase more than the proportional increase in sample sizes. Similarly, we can see that the QRF method is also affected from the training sample size as it rises to 5000.
\(n_{train}\)  Data  PIBFCV  PIBFOOB  \(\widehat{PI}_\alpha\)  OOB  RFPI  SC  QRF  GRF 

200  Friedman 1  21.36  13.44  9.62  0.25  87.79  0.61  3.05  0.27 
Friedman 2  21.83  13.96  9.62  0.23  59.50  0.44  2.61  0.26  
Friedman 3  28.84  16.03  11.46  0.20  75.47  0.41  2.50  0.22  
Peak  18.76  10.95  11.91  0.28  73.15  0.89  4.05  0.43  
H2c  12.92  9.83  7.99  0.27  36.94  0.79  4.05  0.30  
TreeN  22.75  21.17  12.21  0.23  100.02  0.50  2.81  0.26  
Treeexp  16.28  15.62  15.65  0.23  87.49  0.53  2.85  0.25  
500  Friedman 1  38.37  14.63  8.50  0.44  96.18  2.45  9.29  0.66 
Friedman 2  27.07  13.64  7.64  0.43  84.19  1.80  6.79  0.47  
Friedman 3  18.85  17.49  7.98  0.47  70.78  1.86  4.96  0.45  
Peak  29.16  18.14  9.72  0.70  212.23  2.29  7.49  1.26  
H2c  14.84  11.29  11.20  0.44  58.60  1.74  4.95  0.78  
TreeN  17.80  20.31  7.70  0.51  183.92  2.21  4.28  0.51  
Treeexp  18.05  19.05  7.74  0.51  170.24  1.72  4.45  0.49  
1000  Friedman 1  32.92  18.07  12.11  0.64  181.30  3.35  9.54  1.50 
Friedman 2  23.07  15.94  10.25  0.45  135.64  2.14  6.26  0.90  
Friedman 3  23.24  16.07  7.37  0.44  96.76  2.27  6.40  0.69  
Peak  51.93  33.12  7.95  0.83  374.89  5.35  19.71  1.65  
H2c  26.81  12.56  8.07  0.59  80.38  3.68  13.11  0.87  
TreeN  22.84  16.70  7.11  0.47  288.35  2.72  9.53  0.78  
Treeexp  21.06  17.26  7.03  0.48  272.52  2.62  10.38  0.69  
5000  Friedman 1  155.81  139.88  28.70  3.72  928.84  40.79  134.97  4.84 
Friedman 2  155.62  101.58  35.25  2.30  430.80  33.43  60.99  2.65  
Friedman 3  106.70  91.72  34.41  2.36  330.81  28.47  73.18  2.72  
Peak  287.67  245.17  22.76  6.84  1914.56  57.96  123.19  11.80  
H2c  283.53  107.57  24.12  4.31  271.18  42.69  79.19  5.01  
TreeN  106.17  71.78  22.97  3.19  753.32  26.88  74.81  3.86  
Treeexp  95.85  68.90  21.76  3.32  779.42  26.09  64.02  3.85 
Data  n  p  PIBFCV  PIBFOOB  \(\widehat{PI}_\alpha\)  OOB  RFPI  SC  QRF  GRF 

Abalone  4177  8  102.37  64.39  17.65  5.20  422.10  20.42  45.39  6.32 
Air quality (AH)  6941  13  383.60  126.95  26.63  11.55  1256.56  45.45  126.36  16.07 
Air quality (RH)  383.53  124.71  26.60  11.55  1111.16  46.13  127.02  15.95  
Airfoil selfnoise  1503  5  256.92  16.61  3.57  0.31  271.35  1.80  3.65  0.65 
Ames housing  2929  80  195.07  28.56  15.64  8.73  333.28  42.95  226.49  11.34 
Auto MPG  392  7  11.02  7.08  1.82  0.35  47.84  0.58  1.73  0.40 
Boston housing  506  13  14.77  8.89  2.32  0.49  63.20  1.31  3.63  0.70 
Computer hardware  209  6  6.53  3.80  0.93  0.22  26.49  0.28  0.98  0.24 
Concrete compression  1030  8  28.28  17.80  3.38  0.38  257.19  2.03  5.86  0.66 
Concrete slump  103  7  9.52  2.10  0.65  0.14  26.60  0.13  0.62  0.11 
Energy efficiency (CL)  768  8  85.97  15.93  2.51  0.34  341.73  0.94  2.39  0.52 
Energy efficiency (HL)  99.95  20.31  2.47  0.34  380.97  0.92  2.42  0.42  
Servo  167  4  12.05  3.99  0.72  0.15  53.14  0.12  0.65  0.15 
In this paper, we have introduced an R package named RFpredInterval. This package implements 16 methods to build prediction intervals with random forests: a new method to build Prediction Intervals with Boosted Forests (PIBF) and 15 different variations to produce prediction intervals with random forests proposed by Roy and Larocque (2020). PIBF provides biascorrected point predictions obtained with the onestep boosted forest and prediction intervals by using the nearest neighbour outofbag observations to estimate the conditional prediction error distribution.
We performed an extensive simulation study with a variety of simulated data sets and applied real data analyses to investigate the performance of the proposed method. The performance was evaluated based on the coverage level and length of the prediction intervals. We compared the performance of the proposed method to 10 existing methods for building prediction intervals with random forests. The proposed method was able to maintain the desired coverage level with both the simulated and real data sets. In terms of the PI lengths, globally, the proposed method provided the shortest prediction intervals among all methods. The conclusions drawn from the analysis of real data sets were very similar to those with the simulated data sets. This provides evidence for the reliability of the proposed method. All results obtained indicate that the proposed method can be used with confidence for a variety of regression problems.
Note that the coverage rate of prediction intervals for new observations can have several interpretations. An interesting discussion about this issue is given in Mayr et al. (2012). In that paper, the authors presented two interpretation of coverage: sample coverage and conditional coverage. Sample coverage means that if we draw a new sample from the same population as the training sample and build PIs with a desired coverage level of \(\left(1\alpha\right)\), then the global coverage rate over this sample will be \(\left(1\alpha\right)\). The conditional coverage means that if we sample many new observations always having the same set of covariates and build PIs for them with a desired coverage level of \(\left(1\alpha\right)\), then about \(\left(1\alpha\right)100\%\) of these prediction intervals will contain the true value of the response. To hold a desired level of conditional coverage, the predictive method needs to provide the desired coverage level over the entire covariate space. On the other hand, sample coverage needs only maintain the desired coverage level over the new sample, on average. Therefore, if the conditional coverage holds, then the sample coverage also holds. In practice, predictive models are mostly evaluated with their global predictive performance. Hence, ensuring that the sample coverage level is achieved should be sufficient for most applications. The proposed calibration method with crossvalidation is designed to ensure the sample coverage property. From the simulation study and real data analyses, we can see that the sample coverage is attained with the proposed calibration method.
The package is available from CRAN at https://cran.rproject.org/package=RFpredInterval. The development version is available at https://github.com/calakus/RFpredInterval.
This research was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) and by Fondation HEC Montréal.
RFpredInterval, quantregForest, grf, rfinterval, forestError, randomForestSRC, ranger, AmesHousing, mlbench
CausalInference, Econometrics, HighPerformanceComputing, MachineLearning, MissingData, Survival
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Alakus, et al., "RFpredInterval: An R Package for Prediction Intervals with Random Forests and Boosted Forests", The R Journal, 2022
BibTeX citation
@article{RJ2022012, author = {Alakus, Cansu and Larocque, Denis and Labbe, Aurélie}, title = {RFpredInterval: An R Package for Prediction Intervals with Random Forests and Boosted Forests}, journal = {The R Journal}, year = {2022}, note = {https://rjournal.github.io/}, volume = {14}, issue = {1}, issn = {20734859}, pages = {300320} }