[1]\fnmJohanna \surHardin

1]\orgdivDepartment of Mathematics & Statistics, \orgnamePomona College, \orgaddress\street610 N. College Ave., \cityClaremont, \postcode91711, \stateCA, \countryUSA

Prediction Error Estimation in Random Forests

\fnmIan \surKrupkin [email protected]    [email protected] [
Abstract

In this paper, error estimates of classification Random Forests are quantitatively assessed. Based on the initial theoretical framework built by Bates et al. [1], the true error rate and expected error rate are theoretically and empirically investigated in the context of a variety of error estimation methods common to Random Forests. We show that in the classification case, Random Forests’ estimates of prediction error is closer on average to the true error rate instead of the average prediction error. This is opposite the findings of Bates et al. [1] which are given for logistic regression. We further show that our result holds across different error estimation strategies such as cross-validation, bagging, and data splitting.

keywords:
true error rate, expected error rate, cross-validation, bootstrap**, bagging, data-splitting
pacs:
[

MSC Classification]62J20, 62F40

1 Introduction

As evidenced by the 2017 State of Data Science and Machine Learning report by Kaggle, almost half of data scientists use Random Forests at work [11]. Random Forests [2] are a popular classification tool in a variety of fields, especially because of their excellent performance in very complex data settings. The fact that out-of-bag (OOB) errors are theoretically and computationally simple improvements over a train-test split, lead to their ubiquity. When deploying a predictive model, it is important to understand its prediction accuracy on future test points; both good point estimates and an understanding of the variability of the estimates are essential. When Random Forests are implemented, the OOB error is a widely-used approach for point and interval estimate tasks, but in spite of OOB’s seeming simplicity, its properties remain opaque. In the past, the OOB error has been affirmed to be an unbiased estimate of the true error rate [16, 6]. Nonetheless, it has been shown that for two-class classification problems the OOB error can overestimate the true prediction error [3, 13]. It was later argued that the use of stratified subsampling with sampling fractions that are proportional to response class sizes of the training data yielded almost unbiased error rates [9]. The present work is primarily concerned with OOB errors, but also addresses other common methods such as data splitting and cross-validation, as well as their combination with OOB errors.

Despite the apparent straightforwardness of data splitting, cross-validation, and bagging, the formal properties of these modeling techniques are subtle. When calculating an error rate, the question of “what are we estimating?” rightfully reappears often. In an innovative paper, Bates et al. [1] provide a framework for understanding and measuring prediction error in the case of least squares linear models. They prove that cross-validation does not estimate the true error rate (see Equation (1)) for the model at hand, fit to the training data. Instead, the empirical cross-validation error “estimates the average prediction error of models fit on other unseen training sets drawn from the same population,” i.e., the expected error rate (see Equation (2)). They expand their work in simulations to show that the same trend holds for logistic regression.

To our knowledge, in the case of Random Forests, ours is the first study investigating the different types of prediction error as a metric for the accuracy of the model on training, cross-validated, out of bag, test, and future data. We illustrate and extend the work of Bates et al. [1] to the setting of Random Forests, where empirical evidence for Random Forests shows the opposite of what Bates et al. [1] proved for (generalized) linear models. For Random Forests, the empirical errors (found via data splitting, cross-validation, or bagging) are closer, on average, to the true error rate (see Equation (1)) than to the expected error rate (see Equation (2)). The distinction between the true error rate (see Equation (1)) and the expected error rate (see Equation (2)) is quite subtle. Section 2 goes into great detail explaining the differences between the two theoretical error types.

The main contribution of our work is two-fold: (i) investigation of the switch in direction of the proximity of Random Forests’ error estimates to the error targets from that presented by Bates et al. [1] on logistic regression, and (ii) exploration of the performance of an expanded set of error estimation strategies.

The paper is structured as follows: In Section 2, we set up notation and introduce the two different error targets: true error rate (ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT) and expected error rate (Err). Subsequently, Section 3, introduces simulation-based studies. The descriptions include an outline of the simulated data, the considered settings, and several model building workflows that will be investigated. In Section 4, we present the results of the studies. The results are discussed in Section 5 alongside recommendations.

All code used for the simulations is available at:
https://github.com/iankrupkin/Prediction-Error-Estimation-in-Random-Forests

2 What Prediction Error are We Estimating?

Before turning to our main method in the next section, we introduce our notation and review topics related to error targets. We consider the supervised learning setting where we have p𝑝pitalic_p features and n𝑛nitalic_n observations, denoted by 𝐗i=(Xi,1,,Xi,p)subscript𝐗𝑖subscript𝑋𝑖1subscript𝑋𝑖𝑝{\bf X}_{i}=(X_{i,1},\dots,X_{i,p})bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_X start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_i , italic_p end_POSTSUBSCRIPT ), the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT row of 𝐗n×p𝐗superscript𝑛superscript𝑝{\bf X}\in\mathbb{R}^{n}\times\mathbb{R}^{p}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT, and response 𝐘=(Y1,,Yn)n𝐘subscript𝑌1subscript𝑌𝑛superscript𝑛{\bf Y}=(Y_{1},\dots,Y_{n})\in\mathbb{R}^{n}bold_Y = ( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. We assume that the data points (𝐗i,Yi)subscript𝐗𝑖subscript𝑌𝑖({\bf X}_{i},Y_{i})( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for i=1,,n𝑖1𝑛i=1,\dots,nitalic_i = 1 , … , italic_n are independent and identically distributed from some underlying distribution P𝑃Pitalic_P on p+1superscript𝑝1\mathbb{R}^{p+1}blackboard_R start_POSTSUPERSCRIPT italic_p + 1 end_POSTSUPERSCRIPT. We wish to understand the performance of our fitted model when generalized to unseen data points, which can be formalized by a loss function on an observed response y𝑦yitalic_y compared to a predicted response y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG:

(y^,y)0^𝑦𝑦0\ell(\hat{y},y)\geq 0roman_ℓ ( over^ start_ARG italic_y end_ARG , italic_y ) ≥ 0

such that (y,y)=0𝑦𝑦0\ell(y,y)=0roman_ℓ ( italic_y , italic_y ) = 0 for all y𝑦yitalic_y. The form of \ellroman_ℓ need not be specified and could be squared error loss, misclassification error, cross-entropy, etc. Now consider a model f()𝑓f(\cdot)italic_f ( ⋅ ) parameterized by θ𝜃\thetaitalic_θ. Let f^(𝐱,θ)^𝑓𝐱𝜃\hat{f}({\bf x},\theta)over^ start_ARG italic_f end_ARG ( bold_x , italic_θ ) be the function that predicts y𝑦yitalic_y from 𝐱p𝐱superscript𝑝{\bf x}\in\mathbb{R}^{p}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT using the model with true parameters θ𝜃\thetaitalic_θ, which take values in the space ΘΘ\Thetaroman_Θ. Let 𝒜𝒜\mathcal{A}caligraphic_A be a model-fitting algorithm that takes any number of data points and returns an estimated parameter vector θ^Θ^𝜃Θ\hat{\theta}\in\Thetaover^ start_ARG italic_θ end_ARG ∈ roman_Θ. Hence, θ^=𝒜(𝐗,Y)^𝜃𝒜𝐗𝑌\hat{\theta}=\mathcal{A}({\bf X},Y)over^ start_ARG italic_θ end_ARG = caligraphic_A ( bold_X , italic_Y ) is the fitted value of the parameter based on the observed data 𝐗𝐗{\bf X}bold_X and Y𝑌Yitalic_Y. Let (𝐗~,Y~)Psimilar-to~𝐗~𝑌𝑃(\widetilde{\bf X},\widetilde{Y})\sim P( over~ start_ARG bold_X end_ARG , over~ start_ARG italic_Y end_ARG ) ∼ italic_P be another independent test point (or dataset) from the same distribution. Using the training data, we are interested in finding the function f^(𝐱,θ^)^𝑓𝐱^𝜃\hat{f}({\bf x},\hat{\theta})over^ start_ARG italic_f end_ARG ( bold_x , over^ start_ARG italic_θ end_ARG ) that minimizes the loss (f^(𝐗~,θ^),Y~)^𝑓~𝐗^𝜃~𝑌\ell(\hat{f}(\widetilde{\bf X},\hat{\theta}),\widetilde{Y})roman_ℓ ( over^ start_ARG italic_f end_ARG ( over~ start_ARG bold_X end_ARG , over^ start_ARG italic_θ end_ARG ) , over~ start_ARG italic_Y end_ARG ). Note that (f^(𝐗~,θ^),Y~)^𝑓~𝐗^𝜃~𝑌\ell(\hat{f}(\widetilde{\bf X},\hat{\theta}),\widetilde{Y})roman_ℓ ( over^ start_ARG italic_f end_ARG ( over~ start_ARG bold_X end_ARG , over^ start_ARG italic_θ end_ARG ) , over~ start_ARG italic_Y end_ARG ) is a random and unknown object, and our target is one of two quantities:

True Error Rate: ErrXYsubscriptTrue Error Rate: Err𝑋𝑌\displaystyle\text{True Error Rate:\;}\text{Err}_{XY}True Error Rate: roman_Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT :=E[(f^(𝐗~,θ^),Y~)|(𝐗,Y)]assignabsent𝐸delimited-[]conditional^𝑓~𝐗^𝜃~𝑌𝐗𝑌\displaystyle:=E[\ell(\hat{f}(\widetilde{\bf X},\hat{\theta}),\widetilde{Y})|(% {\bf X},Y)]:= italic_E [ roman_ℓ ( over^ start_ARG italic_f end_ARG ( over~ start_ARG bold_X end_ARG , over^ start_ARG italic_θ end_ARG ) , over~ start_ARG italic_Y end_ARG ) | ( bold_X , italic_Y ) ] (1)
Expected Error Rate: Err :=E[(f^(𝐗~,θ^),Y~)]=E[ErrXY]assignabsent𝐸delimited-[]^𝑓~𝐗^𝜃~𝑌𝐸delimited-[]subscriptErr𝑋𝑌\displaystyle:=E[\ell(\hat{f}(\widetilde{\bf X},\hat{\theta}),\widetilde{Y})]=% E[\text{Err}_{XY}]:= italic_E [ roman_ℓ ( over^ start_ARG italic_f end_ARG ( over~ start_ARG bold_X end_ARG , over^ start_ARG italic_θ end_ARG ) , over~ start_ARG italic_Y end_ARG ) ] = italic_E [ Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT ] (2)

Equations (1) and (2) are the two most natural quantities of interest to the analyst. Known as the true error rate, ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT is the expected test error of the model that is fit on our actual training set. Err is the expected average error of the fitting algorithm run on the same-sized datasets drawn from the underlying distribution P𝑃Pitalic_P and is called the expected error rate. It is important to note that the random variable Err is a constant with respect to (𝐗,Y)𝐗𝑌({\bf X},Y)( bold_X , italic_Y ), while ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT is a function of (𝐗,Y)𝐗𝑌({\bf X},Y)( bold_X , italic_Y ) [1, 14]. In the Random Forest context, there is randomness associated with the model building itself (unlike OLS, which is deterministic given a dataset). However, the expected values in Equations (1) and (2) average over the randomness associated with the model building. There has been some work to formalize the distributional results associated with Random Forests (e.g., see Mentch and Hooker [12]), but the formal distributional structure is outside the scope of our work.

Depending on the context of the statistical analysis, one may prefer to estimate either ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT or Err. The former quantity is of the most interest to a practitioner deploying a specific model, whereas the latter may be of interest to a researcher comparing different fitting algorithms. To illustrate this difference, consider the following examples.

Suppose Statistician A is trying to estimate the average height of penguins, found in the wild, based on a sample of 100 researched penguins. Statistician A will use the original sample to build a model to estimate the average height of the next sample of penguins. When presenting the model, they will be interested in the true error rate (ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT) of their specific model because they will want to know how the specific model they built will perform on the next dataset. That is, the research question forcuses on the single model and its ability to predict future data. As Bates et al. [1] describe it, ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT “is of the most interest to a practitioner deploying a specific model.”

However, if Statistician B is trying to accomplish a similar task of estimating the average height of penguins, based on a finite sample, but is unsure of the structure of the model to utilize, they will be interested in a slightly different error metric. Statistician B will run a variety of fitting algorithms to build numerous models based on the sample available and will need to compare the models. They will want to estimate the expected error rate (Err) to know the average error of the process by which the fitting algorithm is run on same-sized datasets drawn from the underlying distribution. Statistician B is less interested in the performance of the single model which was built using the sample at hand (which is Statistician A’s target), instead, their interest is in the performance of the process used to arrive at the model. As Bates et al. [1] describe it, Err is of “interest to a researcher comparing different fitting algorithms.”

The estimation of the theoretical quantities, ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT and Err, deepens the understanding of the difference between the two. Both error quantities are expected values and unknown in any given Random Forest model. However, they can be approximated using simulations where the underlying population is known and available for repeated sampling. As mentioned above, the true error rate, ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT, is the test error of the model that was fit on our actual training set. Hence, the estimation of this quantity is the error produced by a single model on a new theoretically infinitely large test set. As illustrated by the lefthand column in Figure 1, the sample is used to create the model, and then the model is used to predict a large test set from the underlying population. The entire left column is repeated many times to estimate the true error rate of the given study set-up. The missclassification rate on the test set will be the true error rate.

Subsequently, Err is the average error of the fitting algorithm run on the same-sized datasets drawn from the underlying distribution P𝑃Pitalic_P, and called the expected error rate. When calculating the expected error rate, the average of the true error rate, but using a new model every time, is taken. As seen by the entire box in Figure 1, first the entire model fitting process is repeated to obtain each ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT from all possible samples. Then, the estimate of Err is obtained by averaging many repeated values of ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT,

In other words, the difference between the estimation of ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT and Err is that the former uses one model, while the latter averages over many models. As seen from Equation (1), ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT is conditional on the data, while in Equation (2), Err is unconditional. Note that Err averages over everything that is random in the model building process, including the randomness in the training set that produced the model and the bootstrap** inherent in a Random Forest model.

Refer to caption
Figure 1: Empirical estimation of ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT and Err. In the simulation study, the left column is repeated many times to estimate ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT, and the entire box is repeated many times to get Err.

While it may initially appear that the quantity ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT is easier to estimate, since it concerns the model at hand, it has been observed that, in some settings, the cross-validation empirical estimate of error is weakly correlated with ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT (given a particular population at hand) [15]. The disassociation issue is mainly attributed to data re-usage.

Let (𝐗,Y)subscript𝐗subscript𝑌({\bf X}_{\infty},Y_{\infty})( bold_X start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) represent a dataset of unlimited size enabling the best possible model f()𝑓f(\cdot)italic_f ( ⋅ ) to be chosen. Theoretically, ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT can be decomposed into four parts (see Equations (3) - (6)), as seen in Faraway [4]. Let (𝐗,Y)subscript𝐗subscript𝑌({\bf X}_{\infty},Y_{\infty})( bold_X start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) represent a black box with the ability to generate datasets of unlimited size and quantity without defining the true underlying model generating mechanism. The true model f𝑓fitalic_f is given in contrast to the model found from the original dataset, f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG. The parameter is either given by the infinite data, θsubscript𝜃\theta_{\infty}italic_θ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT, or estimated using the original data, θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG.

ErrXYsubscriptErr𝑋𝑌\displaystyle\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT =E[(f(𝐗,θ),Y)|(𝐗,Y)]absent𝐸delimited-[]conditional𝑓subscript𝐗subscript𝜃subscript𝑌𝐗𝑌\displaystyle=E[\ell(f({\bf X}_{\infty},\theta_{\infty}),Y_{\infty})|({\bf X},% Y)]= italic_E [ roman_ℓ ( italic_f ( bold_X start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) , italic_Y start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) | ( bold_X , italic_Y ) ] best performance (3)
+E[(f^(𝐗,θ),Y)(f(𝐗,θ),Y)|(𝐗,Y)]𝐸delimited-[]^𝑓subscript𝐗subscript𝜃subscript𝑌conditional𝑓subscript𝐗subscript𝜃subscript𝑌𝐗𝑌\displaystyle+E[\ell(\hat{f}({\bf X}_{\infty},\theta_{\infty}),Y_{\infty})-% \ell(f({\bf X}_{\infty},\theta_{\infty}),Y_{\infty})|({\bf X},Y)]+ italic_E [ roman_ℓ ( over^ start_ARG italic_f end_ARG ( bold_X start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) , italic_Y start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) - roman_ℓ ( italic_f ( bold_X start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) , italic_Y start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) | ( bold_X , italic_Y ) ] model select cost (4)
+E[(f^(𝐗,θ^),Y)(f^(𝐗,θ),Y)|(𝐗,Y)]𝐸delimited-[]^𝑓𝐗^𝜃𝑌conditional^𝑓subscript𝐗subscript𝜃subscript𝑌𝐗𝑌\displaystyle+E[\ell(\hat{f}({\bf X},\hat{\theta}),{Y})-\ell(\hat{f}({\bf X}_{% \infty},\theta_{\infty}),Y_{\infty})|({\bf X},Y)]+ italic_E [ roman_ℓ ( over^ start_ARG italic_f end_ARG ( bold_X , over^ start_ARG italic_θ end_ARG ) , italic_Y ) - roman_ℓ ( over^ start_ARG italic_f end_ARG ( bold_X start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) , italic_Y start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) | ( bold_X , italic_Y ) ] parameter est cost (5)
+E[(f^(𝐗~,θ^),Y~)(f^(𝐗,θ^),Y)|(𝐗,Y)]𝐸delimited-[]^𝑓~𝐗^𝜃~𝑌conditional^𝑓𝐗^𝜃𝑌𝐗𝑌\displaystyle+E[\ell(\hat{f}(\widetilde{\bf X},\hat{\theta}),\widetilde{Y})-% \ell(\hat{f}({\bf X},\hat{\theta}),{Y})|({\bf X},Y)]+ italic_E [ roman_ℓ ( over^ start_ARG italic_f end_ARG ( over~ start_ARG bold_X end_ARG , over^ start_ARG italic_θ end_ARG ) , over~ start_ARG italic_Y end_ARG ) - roman_ℓ ( over^ start_ARG italic_f end_ARG ( bold_X , over^ start_ARG italic_θ end_ARG ) , italic_Y ) | ( bold_X , italic_Y ) ] data re-use cost (6)

Term (3) represents the best performance of the prediction error; it is the expected loss on the correct model using all the possible data. Term (4) represents the difference between the loss for the true model on infinite data and the loss for the selected model on infinite data. Term (5) represents the difference in loss using the data model but estimating the parameter with a single (independent) dataset versus infinite data. Term (6) represents the difference in loss between the data model and the parameter estimates using the original data versus an independent dataset.

The most interesting component is the final term (6), caused by data re-usage, which has a non-zero expectation when the same data points are used for both model selection and parameter estimation. If one uses a validation set approach (for model selection and parameter estimation, see SDT in Table 2), term (6) will have an expectation of zero because each observation is only used once. However, as estimated empirically, when using a full data approach, term (6) can be large and easily cancel out any advantages the full data has in model selection and parameter estimation [4]. Thus, the full data strategy will have lower model selection and parameter estimation costs than the validation set strategy due to the higher number of observations used to complete the model selection and parameter estimation processes, but the advantages can be swamped by the data re-use cost.

The difference between the full data and validation set strategies, seen in comparing terms (4) & (5), is bounded and well understood as an effect of sample size [4]. Despite suffering in model selection and parameter estimation costs, the validation set strategy will have a lower data re-use cost than the full data strategy, and we know the data re-use cost term (6) could be very large. Therefore, we would like to investigate the trade-offs in the data re-use cost versus the model selection and parameter estimation costs, in the Random Forest context. Our study comes in the form of analyzing various model fitting approaches and the resulting estimates of error. Specifically the use of OOB errors compared to validation set and cross-validation strategies is investigated.

3 Methods

Simulated data are used to study the behavior of modeling strategies in the setting in which all predictor variables are uncorrelated. The results provide insight into the mechanisms which lead to different targets in error estimates. We follow the set-up of Janitza and Hornung [9] who use simulation to demonstrate that the OOB error overestimates the true error rate. We recognize that a larger simulation study would provide more nuance of the observed behavior. However, the simulations at hand address our main goals of the paper: (i) to document the switch in direction of error proximity when comparing Random Forests to logistic regression, and (ii) to explore the performance of an expanded set of error estimation strategies.

3.1 Data Generation and Settings

The bias of error estimates in different data settings with numeric predictor variables is systematically investigated by means of simulation studies in balanced binary two-class response variable data. The settings considered are:

  • Different number of predictors, p{10,100}𝑝10100p\in\{10,100\}italic_p ∈ { 10 , 100 }.

  • Different number of observations such that n<p,n>p,npformulae-sequence𝑛𝑝formulae-sequence𝑛𝑝much-greater-than𝑛𝑝n<p,n>p,n\gg pitalic_n < italic_p , italic_n > italic_p , italic_n ≫ italic_p.

As done when modeling real data, several Random Forests with different mtry𝑚𝑡𝑟𝑦mtryitalic_m italic_t italic_r italic_y values are constructed for each setting (mtry𝑚𝑡𝑟𝑦mtryitalic_m italic_t italic_r italic_y is the parameter that determines the number of randomly chosen variables to be considered for each split on a tree). In the parameter estimation process, the possible values for mtry𝑚𝑡𝑟𝑦mtryitalic_m italic_t italic_r italic_y range from mtry=1𝑚𝑡𝑟𝑦1mtry=1italic_m italic_t italic_r italic_y = 1 all the way up to mtry=p𝑚𝑡𝑟𝑦𝑝mtry=pitalic_m italic_t italic_r italic_y = italic_p. Note that for mtry=1𝑚𝑡𝑟𝑦1mtry=1italic_m italic_t italic_r italic_y = 1 there is no selection of an optimal predictor variable for a split, while for mtry=p𝑚𝑡𝑟𝑦𝑝mtry=pitalic_m italic_t italic_r italic_y = italic_p the Random Forest method coincides with the bagging procedure which selects the best predictor variable from the entire set of predictors for binary splitting at each internal node.

Generally, the number of trees chosen is a trade-off between accuracy and computational speed. More trees are necessary when using a large number of predictor variables. The OOB error stabilizes at around 250 trees in convergence studies [5], which concluded that 1000 trees might be sufficiently large for a genome-wide dataset of more than 300,000 predictor variables. Also in high-dimensional settings, Random Forests with 500 trees and 1000 trees yield very similar OOB errors [7]. In accordance with these findings in the literature, the number of trees is set to 500 in all of our Random Forest simulations. Each dataset described in Table 1 is repeated 1000 times to obtain stable results (all reported empirical errors are the average of 1000 replicates).

Only numeric predictor variables are considered in the studies. Both predictors associated with the response and predictors not associated with the response are considered, with all predictors distributed independently of one another. The predictors not associated with the response follow a standard normal distribution. The distribution of predictors with association is different for each response class. The predictor values for observations from class 1 are always drawn from a standard normal distribution. The predictor values for observations from class 2 are drawn from a normal distribution with variance 1 and a mean different from zero. Table 1 gives an overview of the distribution of predictors in the response classes. Let us consider the setting with p=10𝑝10p=10italic_p = 10 as an example. The first two predictors X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are associated with the response, while the other predictors X3,,X10subscript𝑋3subscript𝑋10X_{3},\dots,X_{10}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT are noise. Hence, X3,,X10subscript𝑋3subscript𝑋10X_{3},\dots,X_{10}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT follow a standard normal distribution, while the distributions of X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT depend on the class to which the observations belong. If the observation comes from class 1, the distribution of X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is N(0,1)𝑁01N(0,1)italic_N ( 0 , 1 ), and X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are distributed N(0.75,1)𝑁0.751N(0.75,1)italic_N ( 0.75 , 1 ) for class 2. Randomly drawing the mean separately for X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and for each repetition of the study insures that predictors with different effect strengths are considered.

Number of Predictors Predictors class 1: N(μ1,1)𝑁subscript𝜇11N(\mu_{1},1)italic_N ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , 1 ) class 2: N(μ2,1)𝑁subscript𝜇21N(\mu_{2},1)italic_N ( italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , 1 )
p=10𝑝10p=10italic_p = 10 X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT μ1=0subscript𝜇10\mu_{1}=0italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 μ2N(0.75,1)similar-tosubscript𝜇2𝑁0.751\mu_{2}\sim N(0.75,1)italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ italic_N ( 0.75 , 1 )
X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT μ1=0subscript𝜇10\mu_{1}=0italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 μ2N(0.75,1)similar-tosubscript𝜇2𝑁0.751\mu_{2}\sim N(0.75,1)italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ italic_N ( 0.75 , 1 )
X3,,X10subscript𝑋3subscript𝑋10X_{3},\dots,X_{10}italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT μ1=0subscript𝜇10\mu_{1}=0italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 μ2=0subscript𝜇20\mu_{2}=0italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0
p=100𝑝100p=100italic_p = 100 X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT μ1=0subscript𝜇10\mu_{1}=0italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 μ2N(0.75,1)similar-tosubscript𝜇2𝑁0.751\mu_{2}\sim N(0.75,1)italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ italic_N ( 0.75 , 1 )
\vdots \vdots \vdots
X10subscript𝑋10X_{10}italic_X start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT μ1=0subscript𝜇10\mu_{1}=0italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 μ2N(0.75,1)similar-tosubscript𝜇2𝑁0.751\mu_{2}\sim N(0.75,1)italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ italic_N ( 0.75 , 1 )
X11,,X100subscript𝑋11subscript𝑋100X_{11},\dots,X_{100}italic_X start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT 100 end_POSTSUBSCRIPT μ1=0subscript𝜇10\mu_{1}=0italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 μ2=0subscript𝜇20\mu_{2}=0italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0
Table 1: Simulation set up describing the distribution of predictors in class 1 and class 2, as in Janitza and Hornung [9]. Note that class 2 is shifted from class 1 for only a few of the p𝑝pitalic_p variables. Also note that all predictor variables are independent and normally distributed.

It is important to note that all predictors are uncorrelated. Although assuming no correlations between any of the predictors is not necessarily realistic, such settings are important to investigate in order to understand the mechanisms which lead to different targets in error estimation.

3.2 Strategies for Error Estimation

The modeling process consists of parameter estimation followed by error estimation. An important point of consideration when completing the two estimation steps is the choice of which subset of observations will be used in each operation. Often, data for parameter estimation and data for error estimation are collected at the same time, thus resulting in a single sample that needs to be apportioned to both parameter and error estimation. Finding the optimal model complexity requires an external test dataset [8]. In an ideal world, to avoid “data snoo**”, one needs one dataset for model building, one for parameter estimation, and then after a model is accepted, another dataset for error estimation. However, rarely are three independent datasets available, so one may need to do the best one can with the data available. Hence, when modeling it is important to outline the strategy that will be used to construct the model and then estimate its error.

We chose a variety of strategies to separately target the parameter estimation and error estimation steps in the modeling process, and thus each strategy consists of three parts. In the descriptions that follow, in-fold represents the observations used to fit the model using cross-validation; out-of-fold represents the observations that are held out of the model fitting using cross-validation. In-bag are the observations that are used to fit the model using Random Forests; out-of-bag (OOB) represents the observations that are held out of the model fitting when using Random Forests. A more complete description of cross-validation, bagging, and Random Forests can be found in James et al. [10].

For each of the considered strategies, there are three aspects to consider (see Figure 2 and Table 2). The original data is divided differently into three datasets: Data1, Data2, and Data3. See Table 2 for details on how each strategy divides up the original dataset. Sometimes, e.g., SDT, the three datasets are completely independent. Sometimes, e.g., FDO, the same observations show up throughout each of the three datasets.

  1. 1.

    Model Fitting: Data1 is used to fit a series of models to consider. For example, with Random Forests, there might be 5 models fit (f^1,f^2,f^3,f^4,f^5subscript^𝑓1subscript^𝑓2subscript^𝑓3subscript^𝑓4subscript^𝑓5\hat{f}_{1},\hat{f}_{2},\hat{f}_{3},\hat{f}_{4},\hat{f}_{5}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT), each with a different value of mtry.𝑚𝑡𝑟𝑦mtry.italic_m italic_t italic_r italic_y .

  2. 2.

    Parameter Estimation: Data2 is used to select the value of the parameter. In the above case, the value of mtry𝑚𝑡𝑟𝑦mtryitalic_m italic_t italic_r italic_y is selected by assessing Data2 on each of f^1,f^2,f^3,f^4,f^5subscript^𝑓1subscript^𝑓2subscript^𝑓3subscript^𝑓4subscript^𝑓5\hat{f}_{1},\hat{f}_{2},\hat{f}_{3},\hat{f}_{4},\hat{f}_{5}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT.

  3. 3.

    Error Estimation: Once the model and parameter have been selected, Data3 is used to estimate the prediction error of the selected model.

Refer to caption
Figure 2: For each strategy, the original data are divided differently into three datasets: Data1, Data2, and Data3. See Table 2 for details on how each strategy divides up the original dataset. Sometimes, e.g., SDT, the three datasets are completely independent. Sometimes, e.g., FDO, the same observations show up throughout each of the three datasets.

Note that in some of the error estimation strategies, there is no parameter estimation (see Table 2). In that case, there is only a need for two distinct datasets, not three (i.e., Data2 is not used in the strategy). The following error estimation strategies are considered:

  • Logistic Regression CV Error (LGCV): The logistic model is built on the in-fold dataset and the error of the model is estimated via the out-of-fold data using 4-fold cross-validation. Logistic regression models are run only for np.𝑛𝑝n\geq p.italic_n ≥ italic_p .

  • Full Data Set CV Error (FDCV): Parameters are set prior to model building with mtry=p𝑚𝑡𝑟𝑦𝑝mtry=\sqrt{p}italic_m italic_t italic_r italic_y = square-root start_ARG italic_p end_ARG. The Random Forest is built on the in-fold dataset and the error of the model is estimated via out-of-fold data using 4-fold cross-validation.

  • Full Data Set OOB Error (FDO): Parameter and error estimation is done on the same dataset. mtry𝑚𝑡𝑟𝑦mtryitalic_m italic_t italic_r italic_y is chosen by using the OOB error rate. Hence, the Random Forest (and corresponding mtry𝑚𝑡𝑟𝑦mtryitalic_m italic_t italic_r italic_y) with the lowest OOB error rate is chosen and the OOB error is returned as the error estimate.

  • Split Data Set OOB Error (SDO): The sample is divided into training and testing sets. Parameter estimation is done on the (in-bag) training set, using the OOB error rate to select mtry𝑚𝑡𝑟𝑦mtryitalic_m italic_t italic_r italic_y. The error of the Random Forest, built on the entire training set, is then estimated by predicting on the testing set and reporting the test error. The error estimation uses only test observations that are not part of the set of observations that are considered for constructing the Random Forest.

  • Split Data Set CV Error (SDCV): The sample is divided into training and testing sets. Parameter estimation is done on the training set, using 4-fold cross-validated error estimates to select mtry.𝑚𝑡𝑟𝑦mtry.italic_m italic_t italic_r italic_y . The error of the Random Forest, built on the entire training set (using the cross-validated mtry),mtry),italic_m italic_t italic_r italic_y ) , is then estimated by predicting on the testing set and reporting the test error. The error estimation uses only test observations that are not part of the set of observations that are considered for constructing the Random Forest.

  • Split Data Set Test Error (SDT): The sample is divided into three independent training, validation, and testing sets. Parameter estimation of mtry𝑚𝑡𝑟𝑦mtryitalic_m italic_t italic_r italic_y is done by finding the minimum validation error across training models with differing values of mtry.𝑚𝑡𝑟𝑦mtry.italic_m italic_t italic_r italic_y . The error of the Random Forest, built solely on the training set (using the validation set mtry),mtry),italic_m italic_t italic_r italic_y ) , is estimated by predicting on the testing set and reporting the test error. The error estimation uses only test observations that are not part of the set of observations that are considered for constructing the Random Forest.

Strategy Model Fitting Parameter Estimation Error Estimation
(Data1) (Data2) (Data3)
LGCV In-Fold observations N/A Out-of-Fold observations
from full dataset from full dataset
FDCV In-Fold observations N/A Out-of-Fold observations
from full dataset from full dataset
FDO In-Bag observations OOB observations OOB observations
from full dataset from full dataset from full dataset
SDO In-Bag observations OOB observations Test dataset
from training dataset from training dataset
SDCV In-Fold observations Out-Fold observations Test dataset
from training dataset from training dataset
SDT Training dataset Validation dataset Test dataset
Table 2: Data used in each step of the error estimation strategies. The three columns across each strategy are pictured in Figure 2 and include model fitting, parameter estimation, and error estimation. Each row describes a different method for partitioning the data into the three groups.

4 Results

4.1 Distance to Target Errors

Ideally, the estimated error rate of a statistical model should be both close to the truth and low. But that begs the question: close to which truth, ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT or Err? In an effort to compare the results of our simulations to those of Bates et al. [1], Figure 3 shows the distance of Err^(LGCV)superscript^Err(LGCV)\widehat{\text{Err}}^{\text{(LGCV)}}over^ start_ARG Err end_ARG start_POSTSUPERSCRIPT (LGCV) end_POSTSUPERSCRIPT from Err compared to its distance from ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT. Similar to Figure 8 in Bates et al., we can see that in the logistic regression model,

|Err^(LGCV)ErrXY|>|Err^(LGCV)Err|superscript^Err(LGCV)subscriptErr𝑋𝑌superscript^Err(LGCV)Err|\widehat{\text{Err}}^{\text{(LGCV)}}-\text{Err}_{XY}|>|\widehat{\text{Err}}^{% \text{(LGCV)}}-\text{Err}|| over^ start_ARG Err end_ARG start_POSTSUPERSCRIPT (LGCV) end_POSTSUPERSCRIPT - Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT | > | over^ start_ARG Err end_ARG start_POSTSUPERSCRIPT (LGCV) end_POSTSUPERSCRIPT - Err |

The difference lessens as n𝑛n\rightarrow\inftyitalic_n → ∞. Regardless, repeated simulations consistently confirm that Err^(LGCV)superscript^Err(LGCV)\widehat{\text{Err}}^{\text{(LGCV)}}over^ start_ARG Err end_ARG start_POSTSUPERSCRIPT (LGCV) end_POSTSUPERSCRIPT is, on average, closer to Err than ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT.

Refer to caption
Figure 3: Simulation results comparing the error of cross-validated estimates of a logistic regression model to both Err and ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT: the mean absolute deviation between Err^(LGCV)superscript^Err(LGCV)\widehat{\text{Err}}^{\text{(LGCV)}}over^ start_ARG Err end_ARG start_POSTSUPERSCRIPT (LGCV) end_POSTSUPERSCRIPT and Err (orange dashed line) or ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT (green solid line). The dark grey vertical line in each panel is where n=p𝑛𝑝n=pitalic_n = italic_p. Notice that across all n𝑛nitalic_n and both values of p𝑝pitalic_p, Err^(LGCV)superscript^Err(LGCV)\widehat{\text{Err}}^{\text{(LGCV)}}over^ start_ARG Err end_ARG start_POSTSUPERSCRIPT (LGCV) end_POSTSUPERSCRIPT is closer to Err than ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT. Logistic regression models are run only for np𝑛𝑝n\geq pitalic_n ≥ italic_p.
Refer to caption
Figure 4: Simulation results comparing the error of CV estimates of a Random Forest model to both Err and ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT: the mean absolute deviation between Err^(FDCV)superscript^Err(FDCV)\widehat{\text{Err}}^{\text{(FDCV)}}over^ start_ARG Err end_ARG start_POSTSUPERSCRIPT (FDCV) end_POSTSUPERSCRIPT and Err (orange dashed line) or ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT (green solid line). The dark grey vertical line in each panel is where n=p𝑛𝑝n=pitalic_n = italic_p. Notice that across n𝑛nitalic_n and both values of p𝑝pitalic_p, Err^(FDCV)superscript^Err(FDCV)\widehat{\text{Err}}^{\text{(FDCV)}}over^ start_ARG Err end_ARG start_POSTSUPERSCRIPT (FDCV) end_POSTSUPERSCRIPT is closer to ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT than Err.

In Figure 4 we see that the error proximity relationship has flipped between logistic regression models and Random Forests. Err^(LGCV)superscript^Err(LGCV)\widehat{\text{Err}}^{\text{(LGCV)}}over^ start_ARG Err end_ARG start_POSTSUPERSCRIPT (LGCV) end_POSTSUPERSCRIPT is closer to Err, but Err^(FDCV)superscript^Err(FDCV)\widehat{\text{Err}}^{\text{(FDCV)}}over^ start_ARG Err end_ARG start_POSTSUPERSCRIPT (FDCV) end_POSTSUPERSCRIPT is closer to ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT. As a reminder, the difference between Err^(LGCV)superscript^Err(LGCV)\widehat{\text{Err}}^{\text{(LGCV)}}over^ start_ARG Err end_ARG start_POSTSUPERSCRIPT (LGCV) end_POSTSUPERSCRIPT and Err^(FDCV)superscript^Err(FDCV)\widehat{\text{Err}}^{\text{(FDCV)}}over^ start_ARG Err end_ARG start_POSTSUPERSCRIPT (FDCV) end_POSTSUPERSCRIPT is that the former is an error estimate for a logistic model while the latter is an error estimate for a Random Forest model. Both are cross-validated estimates on the in-fold dataset with no parameter tuning. Thus, the relationship highlighted by Bates et al. [1] seems to be specific to generalized linear models as they investigated only linear and logistic regression models.

The flip in relationship may be attributed to the difference in the way each model utilizes the data. In logistic regression, the coefficients are estimated via maximum likelihood estimation, thus possibly leading to over-fitting and biased estimates of error due to the model optimizing for the specific dataset. On the other hand, bagging and other resampling techniques can be used to reduce the variance in model predictions. In Random Forests, the bias of the full forest is equivalent to the bias of a single decision tree (which itself has low bias and high variance) [8]. However, by creating many trees and then averaging them, the variance of the final forest can be greatly reduced over that of a single tree. In practice, the only limitation on the size of the forest is computing time as an infinite number of trees could be trained without ever increasing bias and with a continual (if asymptotically declining) decrease in the variance. As a result, the logistic regression model may be less informative on the “next” sample, than a Random Forest. Hence, Err^^Err\widehat{\text{Err}}over^ start_ARG Err end_ARG is closer to ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT than Err for Random Forests because its resampling methods build the model on data more akin to wild data.

We further explore the difference between Err^^Err\widehat{\text{Err}}over^ start_ARG Err end_ARG for logistic regression and Random Forests in an experiment with n=50𝑛50n=50italic_n = 50 observations and p=10𝑝10p=10italic_p = 10 features (repeated 20 times); see Figure 5. In the right plot, there seems to exist a pairing between Err^(FDCV)superscript^Err(FDCV)\widehat{\text{Err}}^{\text{(FDCV)}}over^ start_ARG Err end_ARG start_POSTSUPERSCRIPT (FDCV) end_POSTSUPERSCRIPT and ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT (linking the two estimates from the same dataset), where high estimates of Err^(FDCV)superscript^Err(FDCV)\widehat{\text{Err}}^{\text{(FDCV)}}over^ start_ARG Err end_ARG start_POSTSUPERSCRIPT (FDCV) end_POSTSUPERSCRIPT are paired with high estimates of ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT and low estimates of Err^(FDCV)superscript^Err(FDCV)\widehat{\text{Err}}^{\text{(FDCV)}}over^ start_ARG Err end_ARG start_POSTSUPERSCRIPT (FDCV) end_POSTSUPERSCRIPT are paired with low estimates of ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT (i.e., very few of the linking lines cross). In a logistic regression model, there does not seem to exist this pairing as seen in the left plot (i.e., most of the linking lines cross). In other words, Err^(LGCV)superscript^Err(LGCV)\widehat{\text{Err}}^{\text{(LGCV)}}over^ start_ARG Err end_ARG start_POSTSUPERSCRIPT (LGCV) end_POSTSUPERSCRIPT is seen to be neither closer nor farther to Err than to ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT, but Err^(FDCV)superscript^ErrFDCV\widehat{\text{Err}}^{(\text{FDCV})}over^ start_ARG Err end_ARG start_POSTSUPERSCRIPT ( FDCV ) end_POSTSUPERSCRIPT tends to be closer to ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT than to Err.

Refer to caption
Figure 5: Random sample of 20 iterations of Err^^Err\widehat{\text{Err}}over^ start_ARG Err end_ARG linked with the corresponding ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT for logistic regression (left plot) as compared to Random Forests (right plot). Each of the 20 datasets has n=50𝑛50n=50italic_n = 50 and p=10𝑝10p=10italic_p = 10. The lines connecting the boxplots represent the Err^^Err\widehat{\text{Err}}over^ start_ARG Err end_ARG and ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT values for a given dataset and analysis method. The crossing lines on the left plot (logistic regression) indicate that that Err^^Err\widehat{\text{Err}}over^ start_ARG Err end_ARG and ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT are not correlated; the non-crossing lines on the right plot (Random Forest) indicate that Err^^Err\widehat{\text{Err}}over^ start_ARG Err end_ARG and ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT are correlated. Solid black horizontal line in each figure is Err.

Returning to the remaining strategies that all use Random Forests, Figures 6 - 9 show that across the error estimation strategies, Err^^Err\widehat{\text{Err}}over^ start_ARG Err end_ARG is closer to ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT than Err on average. Despite this relationship, the differences in mean absolute deviations, from both ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT and Err, tend to be quite small.

Refer to caption
Figure 6: Simulation results comparing the error of OOB estimates of a Random Forest model to both Err and ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT: the mean absolute deviation between Err^(FDO)superscript^Err(FDO)\widehat{\text{Err}}^{\text{(FDO)}}over^ start_ARG Err end_ARG start_POSTSUPERSCRIPT (FDO) end_POSTSUPERSCRIPT and Err (orange dashed line) or ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT (green solid line). The dark grey vertical line in each panel is where n=p𝑛𝑝n=pitalic_n = italic_p. Notice that across n𝑛nitalic_n and both values of p𝑝pitalic_p, Err^(FDO)superscript^Err(FDO)\widehat{\text{Err}}^{\text{(FDO)}}over^ start_ARG Err end_ARG start_POSTSUPERSCRIPT (FDO) end_POSTSUPERSCRIPT is closer to ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT than Err.
Refer to caption
Figure 7: Simulation results comparing the error of OOB estimates of a Random Forest model to both Err and ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT: the mean absolute deviation between Err^(SDO)superscript^Err(SDO)\widehat{\text{Err}}^{\text{(SDO)}}over^ start_ARG Err end_ARG start_POSTSUPERSCRIPT (SDO) end_POSTSUPERSCRIPT and Err (orange dashed line) or ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT (green solid line). The dark grey vertical line in each panel is where n=p𝑛𝑝n=pitalic_n = italic_p. Notice that across n𝑛nitalic_n and both values of p𝑝pitalic_p, Err^(SDO)superscript^Err(SDO)\widehat{\text{Err}}^{\text{(SDO)}}over^ start_ARG Err end_ARG start_POSTSUPERSCRIPT (SDO) end_POSTSUPERSCRIPT is closer to ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT than Err.
Refer to caption
Figure 8: Simulation results comparing the error of CV estimates of a Random Forest model to both Err and ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT: the mean absolute deviation between Err^(SDCV)superscript^Err(SDCV)\widehat{\text{Err}}^{\text{(SDCV)}}over^ start_ARG Err end_ARG start_POSTSUPERSCRIPT (SDCV) end_POSTSUPERSCRIPT and Err (orange dashed line) or ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT (green solid line). The dark grey vertical line in each panel is where n=p𝑛𝑝n=pitalic_n = italic_p. Notice that across n𝑛nitalic_n and both values of p𝑝pitalic_p, Err^(SDCV)superscript^Err(SDCV)\widehat{\text{Err}}^{\text{(SDCV)}}over^ start_ARG Err end_ARG start_POSTSUPERSCRIPT (SDCV) end_POSTSUPERSCRIPT is closer to ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT than Err.
Refer to caption
Figure 9: Simulation results comparing the error of validation set estimates of a Random Forest model to both Err and ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT: the mean absolute deviation between Err^(SDT)superscript^Err(SDT)\widehat{\text{Err}}^{\text{(SDT)}}over^ start_ARG Err end_ARG start_POSTSUPERSCRIPT (SDT) end_POSTSUPERSCRIPT and Err (orange dashed line) or ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT (green solid line). The dark grey vertical line in each panel is where n=p𝑛𝑝n=pitalic_n = italic_p. Notice that across n𝑛nitalic_n and both values of p𝑝pitalic_p, Err^(SDT)superscript^Err(SDT)\widehat{\text{Err}}^{\text{(SDT)}}over^ start_ARG Err end_ARG start_POSTSUPERSCRIPT (SDT) end_POSTSUPERSCRIPT is closer to ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT than Err.

4.2 Distance Across Error Estimation Strategies

Section 4.1 details our investigation of the distance of the empirical estimates of error both to ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT and to Err.Err\text{Err}.Err . We show that in Random Forests, the error estimates are closer to ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT than to Err,Err\text{Err},Err , which is the opposite result of Bates et al. [1] on generalized linear models. Now we assess a follow-up question: how close is Err^^Err\widehat{\text{Err}}over^ start_ARG Err end_ARG to ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT? Figure 10 compares the strategies according to the expected value of |Err^ErrXY|^ErrsubscriptErr𝑋𝑌|\widehat{\text{Err}}-\text{Err}_{XY}|| over^ start_ARG Err end_ARG - Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT |.

In the case of p=10𝑝10p=10italic_p = 10 features, the strategies that utilize the in-fold or in-bag dataset to train the model (LGCV, FDCV, and FDO) outperform the split data approaches (SDCV, SDO, SDT). The strategies without independent validation sets seem to not over-fit compared the error estimates with train/test splits that suffer a drop in performance, likely due to smaller sample sizes for model and parameter fit. It is important to mention that FDO is the only strategy of the three (LGCV, FDCV, and FDO) that tunes parameters (here, mtry𝑚𝑡𝑟𝑦mtryitalic_m italic_t italic_r italic_y).

In the case of p=100𝑝100p=100italic_p = 100 features, FDCV and FDO are the best candidates. In contrast to p=10𝑝10p=10italic_p = 10 features, FDCV and FDO perform better than LGCV when np𝑛𝑝n\geq pitalic_n ≥ italic_p (LGCV is not fit when n<p.)n<p.)italic_n < italic_p . ) As with p=10𝑝10p=10italic_p = 10, the data-splitting methods (SDCV, SDO, and SDT) have worse performance. Once again, it is important to note that, out of the strategies that build the model on the entire dataset, FDO tunes the model’s parameters, compared to FDCV which does not.

Refer to caption
Figure 10: Mean absolute deviation between Err^^Err\widehat{\text{Err}}over^ start_ARG Err end_ARG and ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT across error estimation strategies. The dark grey vertical line in each panel is where n=p𝑛𝑝n=pitalic_n = italic_p, note that logistic regression models are run only for np𝑛𝑝n\geq pitalic_n ≥ italic_p.

5 Discussion

The Random Forest error rate investigation in our paper has two main components. First, we discuss the difference in error targets presented by Bates et al. [1]. In their work, they find that in the special case of the generalized linear model using unregularized OLS for model-fitting, common estimates of prediction error — cross-validation, bootstrap, data splitting, and covariance penalties — should be viewed as estimates of the expected error rate, averaged across other hypothetical datasets from the same distribution. Our primary result is that, in the classification case, Random Forests’ empirical estimates of prediction error can be taken as an estimate of the true error rate (ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT) instead of as an estimate of the expected error rate (Err), which is the opposite of the result of Bates et al. [1] whose work includes logistic regression. In simulations the result held across error estimation strategies such as cross-validation, bagging, and data splitting (See Figures (4) - (9)). The result is present for a variety of choices of sample size, n𝑛nitalic_n, and feature number, p𝑝pitalic_p. Nonetheless, we wish to be clear that the estimates of prediction error were a good approximation of both the true error rate (ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT) and expected error rate (Err) in the data splitting cases.

A fundamental open question is to understand the size of the gap of estimates of prediction error with the true error rate and expected error rate. The present work focuses on determining to which target the estimate is closer. Moreover, it is necessary to understand under what conditions the gap is large, making it necessary to modify the method of error estimation depending on the target. Roughly speaking, we expect the gap between any of the error rates to be small when n/p𝑛𝑝n/pitalic_n / italic_p is large. In our simulations, the difference between estimates of prediction error with the true error rate and expected error rate is always smaller than 0.01; as n𝑛nitalic_n increases the difference decreases. Other future directions include the investigation of the relationship among Err^^Err\widehat{\text{Err}}over^ start_ARG Err end_ARG, ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT, and Err in correlated and/or imbalanced data.

Second, we discussed the performance of a variety of error estimation strategies. The models built on the entire sample (e.g., using CV or OOB strategies) are closer to the true error rate compared to those built on a training set with error estimates obtained from a testing set. Therefore, the data strategies that do not use a holdout set seem to be a more appealing choice for model building, regardless if parameter tuning is to be performed or not. Empirically, the strategies that use resampling techniques as opposed to a holdout set are favorable.

Whereas Bates et al. [1] show Err^^Err\widehat{\text{Err}}over^ start_ARG Err end_ARG to be closer to Err than ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT in generalized linear models, we show Err^^Err\widehat{\text{Err}}over^ start_ARG Err end_ARG to be closer to ErrXYsubscriptErr𝑋𝑌\text{Err}_{XY}Err start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT than Err in Random Forests. Additionally, resampling techniques seem to outperform data splitting models in Random Forests.

References

  • \bibcommenthead
  • Bates et al. [2023] Bates, S., Hastie, T., Tibshirani, R.: Cross-validation: what does it estimate and how well does it do it? Journal of the American Statistical Association (2023) https://doi.org/10.1080/01621459.2023.2197686 [stat.ME]
  • Breiman [2001] Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001) https://doi.org/10.1023/A:1010950718922
  • Bylander [2002] Bylander, T.: Estimating generalization error on two-class datasets using out-of-bag estimates. Machine Learning 48(1-3), 287–297 (2002). Copyright - Kluwer Academic Publishers 2002
  • Faraway [2014] Faraway, J.J.: Does data splitting improve prediction? Statistics and Computing 26(1–2), 49–60 (2014) https://doi.org/10.1007/s11222-014-9522-9
  • Goldstein et al. [2010] Goldstein, B., Hubbard, A., Cutler, A., Barcellos, L.: An application of random forests to a genome-wide association dataset: Methodological considerations & new findings. BMC genetics 11, 49 (2010)
  • Goldstein et al. [2011] Goldstein, B.A., Polley, E.C., Briggs, F.B.S.: Random forests for genetic association studies. Statistical Applications in Genetics and Molecular Biology 10(1) (2011) https://doi.org/10.2202/1544-6115.1691
  • Genuer et al. [2008] Genuer, R., Poggi, J.-M., Tuleau, C.: Random Forests: some methodological insights. arXiv (2008). https://doi.org/10.48550/ARXIV.0811.3619 . https://arxiv.longhoe.net/abs/0811.3619
  • Hastie et al. [2001] Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics. Springer, New York, NY, USA (2001)
  • Janitza and Hornung [2018] Janitza, S., Hornung, R.: On the overestimation of random forest’s out-of-bag error. PLOS ONE 13(8), 1–31 (2018) https://doi.org/10.1371/journal.pone.0201904
  • James et al. [2013] James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning: with Applications in R. Springer, ??? (2013). https://faculty.marshall.usc.edu/gareth-james/ISL/
  • Kaggle [2017] Kaggle: The State of Data Science & Machine Learning (2017). https://ailab-ua.github.io/courses/resources/the_state_of_data_science_machine_learning_-_kaggle_2017_survey.pdf
  • Mentch and Hooker [2016] Mentch, L., Hooker, G.: Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. Journal of Machine Learning Research 17(26), 1–41 (2016)
  • Mitchell [2011] Mitchell, M.: Bias of the random forest out-of-bag (oob) error for certain input parameters. Open Journal of Statistics 01, 205–211 (2011)
  • Rajanala et al. [2022] Rajanala, S., Bates, S., Hastie, T., Tibshirani, R.: Confidence Intervals for the Generalisation Error of Random Forests (2022)
  • Yousef [2019] Yousef, W.A.: A Leisurely Look at Versions and Variants of the Cross Validation Estimator. arXiv (2019). https://doi.org/10.48550/ARXIV.1907.13413 . https://arxiv.longhoe.net/abs/1907.13413
  • Zhang et al. [2010] Zhang, G.-Y., Zhang, C.-X., Zhang, J.-S.: Out-of-bag estimation of the optimal hyperparameter in subbag ensemble method. Communications in Statistics - Simulation and Computation 39(10), 1877–1892 (2010) https://doi.org/10.1080/03610918.2010.521277