[1]\fnmJohanna \surHardin

1]\orgdivDepartment of Mathematics & Statistics, \orgnamePomona College, \orgaddress\street610 N. College Ave., \cityClaremont, \postcode91711, \stateCA, \countryUSA

Prediction Error Estimation in Random Forests

\fnmIan \surKrupkin [email protected] [email protected] [

Abstract

In this paper, error estimates of classification Random Forests are quantitatively assessed. Based on the initial theoretical framework built by Bates et al. [1], the true error rate and expected error rate are theoretically and empirically investigated in the context of a variety of error estimation methods common to Random Forests. We show that in the classification case, Random Forests’ estimates of prediction error is closer on average to the true error rate instead of the average prediction error. This is opposite the findings of Bates et al. [1] which are given for logistic regression. We further show that our result holds across different error estimation strategies such as cross-validation, bagging, and data splitting.

keywords:

true error rate, expected error rate, cross-validation, bootstrap**, bagging, data-splitting

pacs:

[

MSC Classification]62J20, 62F40

1 Introduction

As evidenced by the 2017 State of Data Science and Machine Learning report by Kaggle, almost half of data scientists use Random Forests at work [11]. Random Forests [2] are a popular classification tool in a variety of fields, especially because of their excellent performance in very complex data settings. The fact that out-of-bag (OOB) errors are theoretically and computationally simple improvements over a train-test split, lead to their ubiquity. When deploying a predictive model, it is important to understand its prediction accuracy on future test points; both good point estimates and an understanding of the variability of the estimates are essential. When Random Forests are implemented, the OOB error is a widely-used approach for point and interval estimate tasks, but in spite of OOB’s seeming simplicity, its properties remain opaque. In the past, the OOB error has been affirmed to be an unbiased estimate of the true error rate [16, 6]. Nonetheless, it has been shown that for two-class classification problems the OOB error can overestimate the true prediction error [3, 13]. It was later argued that the use of stratified subsampling with sampling fractions that are proportional to response class sizes of the training data yielded almost unbiased error rates [9]. The present work is primarily concerned with OOB errors, but also addresses other common methods such as data splitting and cross-validation, as well as their combination with OOB errors.

Despite the apparent straightforwardness of data splitting, cross-validation, and bagging, the formal properties of these modeling techniques are subtle. When calculating an error rate, the question of “what are we estimating?” rightfully reappears often. In an innovative paper, Bates et al. [1] provide a framework for understanding and measuring prediction error in the case of least squares linear models. They prove that cross-validation does not estimate the true error rate (see Equation (1)) for the model at hand, fit to the training data. Instead, the empirical cross-validation error “estimates the average prediction error of models fit on other unseen training sets drawn from the same population,” i.e., the expected error rate (see Equation (2)). They expand their work in simulations to show that the same trend holds for logistic regression.

To our knowledge, in the case of Random Forests, ours is the first study investigating the different types of prediction error as a metric for the accuracy of the model on training, cross-validated, out of bag, test, and future data. We illustrate and extend the work of Bates et al. [1] to the setting of Random Forests, where empirical evidence for Random Forests shows the opposite of what Bates et al. [1] proved for (generalized) linear models. For Random Forests, the empirical errors (found via data splitting, cross-validation, or bagging) are closer, on average, to the true error rate (see Equation (1)) than to the expected error rate (see Equation (2)). The distinction between the true error rate (see Equation (1)) and the expected error rate (see Equation (2)) is quite subtle. Section 2 goes into great detail explaining the differences between the two theoretical error types.

The main contribution of our work is two-fold: (i) investigation of the switch in direction of the proximity of Random Forests’ error estimates to the error targets from that presented by Bates et al. [1] on logistic regression, and (ii) exploration of the performance of an expanded set of error estimation strategies.

The paper is structured as follows: In Section 2, we set up notation and introduce the two different error targets: true error rate ( $\text{Err}_{XY}$ ) and expected error rate (Err). Subsequently, Section 3, introduces simulation-based studies. The descriptions include an outline of the simulated data, the considered settings, and several model building workflows that will be investigated. In Section 4, we present the results of the studies. The results are discussed in Section 5 alongside recommendations.

All code used for the simulations is available at:
https://github.com/iankrupkin/Prediction-Error-Estimation-in-Random-Forests

2 What Prediction Error are We Estimating?

Before turning to our main method in the next section, we introduce our notation and review topics related to error targets. We consider the supervised learning setting where we have $p$ features and $n$ observations, denoted by ${\bf X}_{i}=(X_{i,1},\dots,X_{i,p})$ , the $i^{th}$ row of ${\bf X}\in\mathbb{R}^{n}\times\mathbb{R}^{p}$ , and response ${\bf Y}=(Y_{1},\dots,Y_{n})\in\mathbb{R}^{n}$ . We assume that the data points $({\bf X}_{i},Y_{i})$ for $i=1,\dots,n$ are independent and identically distributed from some underlying distribution $P$ on $\mathbb{R}^{p+1}$ . We wish to understand the performance of our fitted model when generalized to unseen data points, which can be formalized by a loss function on an observed response $y$ compared to a predicted response $\hat{y}$ :

\ell(\hat{y},y)\geq 0

such that $\ell(y,y)=0$ for all $y$ . The form of $\ell$ need not be specified and could be squared error loss, misclassification error, cross-entropy, etc. Now consider a model $f(\cdot)$ parameterized by $\theta$ . Let $\hat{f}({\bf x},\theta)$ be the function that predicts $y$ from ${\bf x}\in\mathbb{R}^{p}$ using the model with true parameters $\theta$ , which take values in the space $\Theta$ . Let $\mathcal{A}$ be a model-fitting algorithm that takes any number of data points and returns an estimated parameter vector $\hat{\theta}\in\Theta$ . Hence, $\hat{\theta}=\mathcal{A}({\bf X},Y)$ is the fitted value of the parameter based on the observed data ${\bf X}$ and $Y$ . Let $(\widetilde{\bf X},\widetilde{Y})\sim P$ be another independent test point (or dataset) from the same distribution. Using the training data, we are interested in finding the function $\hat{f}({\bf x},\hat{\theta})$ that minimizes the loss $\ell(\hat{f}(\widetilde{\bf X},\hat{\theta}),\widetilde{Y})$ . Note that $\ell(\hat{f}(\widetilde{\bf X},\hat{\theta}),\widetilde{Y})$ is a random and unknown object, and our target is one of two quantities:

	$\displaystyle\text{True Error Rate:\;}\text{Err}_{XY}$	$\displaystyle:=E[\ell(\hat{f}(\widetilde{\bf X},\hat{\theta}),\widetilde{Y})\|(% {\bf X},Y)]$		(1)
	Expected Error Rate: Err	$\displaystyle:=E[\ell(\hat{f}(\widetilde{\bf X},\hat{\theta}),\widetilde{Y})]=% E[\text{Err}_{XY}]$		(2)

Equations (1) and (2) are the two most natural quantities of interest to the analyst. Known as the true error rate, $\text{Err}_{XY}$ is the expected test error of the model that is fit on our actual training set. Err is the expected average error of the fitting algorithm run on the same-sized datasets drawn from the underlying distribution $P$ and is called the expected error rate. It is important to note that the random variable Err is a constant with respect to $({\bf X},Y)$ , while $\text{Err}_{XY}$ is a function of $({\bf X},Y)$ [1, 14]. In the Random Forest context, there is randomness associated with the model building itself (unlike OLS, which is deterministic given a dataset). However, the expected values in Equations (1) and (2) average over the randomness associated with the model building. There has been some work to formalize the distributional results associated with Random Forests (e.g., see Mentch and Hooker [12]), but the formal distributional structure is outside the scope of our work.

Depending on the context of the statistical analysis, one may prefer to estimate either $\text{Err}_{XY}$ or Err. The former quantity is of the most interest to a practitioner deploying a specific model, whereas the latter may be of interest to a researcher comparing different fitting algorithms. To illustrate this difference, consider the following examples.

Suppose Statistician A is trying to estimate the average height of penguins, found in the wild, based on a sample of 100 researched penguins. Statistician A will use the original sample to build a model to estimate the average height of the next sample of penguins. When presenting the model, they will be interested in the true error rate ( $\text{Err}_{XY}$ ) of their specific model because they will want to know how the specific model they built will perform on the next dataset. That is, the research question forcuses on the single model and its ability to predict future data. As Bates et al. [1] describe it, $\text{Err}_{XY}$ “is of the most interest to a practitioner deploying a specific model.”

However, if Statistician B is trying to accomplish a similar task of estimating the average height of penguins, based on a finite sample, but is unsure of the structure of the model to utilize, they will be interested in a slightly different error metric. Statistician B will run a variety of fitting algorithms to build numerous models based on the sample available and will need to compare the models. They will want to estimate the expected error rate (Err) to know the average error of the process by which the fitting algorithm is run on same-sized datasets drawn from the underlying distribution. Statistician B is less interested in the performance of the single model which was built using the sample at hand (which is Statistician A’s target), instead, their interest is in the performance of the process used to arrive at the model. As Bates et al. [1] describe it, Err is of “interest to a researcher comparing different fitting algorithms.”

The estimation of the theoretical quantities, $\text{Err}_{XY}$ and Err, deepens the understanding of the difference between the two. Both error quantities are expected values and unknown in any given Random Forest model. However, they can be approximated using simulations where the underlying population is known and available for repeated sampling. As mentioned above, the true error rate, $\text{Err}_{XY}$ , is the test error of the model that was fit on our actual training set. Hence, the estimation of this quantity is the error produced by a single model on a new theoretically infinitely large test set. As illustrated by the lefthand column in Figure 1, the sample is used to create the model, and then the model is used to predict a large test set from the underlying population. The entire left column is repeated many times to estimate the true error rate of the given study set-up. The missclassification rate on the test set will be the true error rate.

Subsequently, Err is the average error of the fitting algorithm run on the same-sized datasets drawn from the underlying distribution $P$ , and called the expected error rate. When calculating the expected error rate, the average of the true error rate, but using a new model every time, is taken. As seen by the entire box in Figure 1, first the entire model fitting process is repeated to obtain each $\text{Err}_{XY}$ from all possible samples. Then, the estimate of Err is obtained by averaging many repeated values of $\text{Err}_{XY}$ ,

In other words, the difference between the estimation of $\text{Err}_{XY}$ and Err is that the former uses one model, while the latter averages over many models. As seen from Equation (1), $\text{Err}_{XY}$ is conditional on the data, while in Equation (2), Err is unconditional. Note that Err averages over everything that is random in the model building process, including the randomness in the training set that produced the model and the bootstrap** inherent in a Random Forest model.

Refer to caption — Figure 1: Empirical estimation of $\text{Err}_{XY}$ and Err. In the simulation study, the left column is repeated many times to estimate $\text{Err}_{XY}$ , and the entire box is repeated many times to get Err.

While it may initially appear that the quantity $\text{Err}_{XY}$ is easier to estimate, since it concerns the model at hand, it has been observed that, in some settings, the cross-validation empirical estimate of error is weakly correlated with $\text{Err}_{XY}$ (given a particular population at hand) [15]. The disassociation issue is mainly attributed to data re-usage.

Let $({\bf X}_{\infty},Y_{\infty})$ represent a dataset of unlimited size enabling the best possible model $f(\cdot)$ to be chosen. Theoretically, $\text{Err}_{XY}$ can be decomposed into four parts (see Equations (3) - (6)), as seen in Faraway [4]. Let $({\bf X}_{\infty},Y_{\infty})$ represent a black box with the ability to generate datasets of unlimited size and quantity without defining the true underlying model generating mechanism. The true model $f$ is given in contrast to the model found from the original dataset, $\hat{f}$ . The parameter is either given by the infinite data, $\theta_{\infty}$ , or estimated using the original data, $\hat{\theta}$ .

$\displaystyle\text{Err}_{XY}$	$\displaystyle=E[\ell(f({\bf X}_{\infty},\theta_{\infty}),Y_{\infty})\|({\bf X},% Y)]$	best performance	(3)
	$\displaystyle+E[\ell(\hat{f}({\bf X}_{\infty},\theta_{\infty}),Y_{\infty})-% \ell(f({\bf X}_{\infty},\theta_{\infty}),Y_{\infty})\|({\bf X},Y)]$	model select cost	(4)
	$\displaystyle+E[\ell(\hat{f}({\bf X},\hat{\theta}),{Y})-\ell(\hat{f}({\bf X}_{% \infty},\theta_{\infty}),Y_{\infty})\|({\bf X},Y)]$	parameter est cost	(5)
	$\displaystyle+E[\ell(\hat{f}(\widetilde{\bf X},\hat{\theta}),\widetilde{Y})-% \ell(\hat{f}({\bf X},\hat{\theta}),{Y})\|({\bf X},Y)]$	data re-use cost	(6)

Term (3) represents the best performance of the prediction error; it is the expected loss on the correct model using all the possible data. Term (4) represents the difference between the loss for the true model on infinite data and the loss for the selected model on infinite data. Term (5) represents the difference in loss using the data model but estimating the parameter with a single (independent) dataset versus infinite data. Term (6) represents the difference in loss between the data model and the parameter estimates using the original data versus an independent dataset.

The most interesting component is the final term (6), caused by data re-usage, which has a non-zero expectation when the same data points are used for both model selection and parameter estimation. If one uses a validation set approach (for model selection and parameter estimation, see SDT in Table 2), term (6) will have an expectation of zero because each observation is only used once. However, as estimated empirically, when using a full data approach, term (6) can be large and easily cancel out any advantages the full data has in model selection and parameter estimation [4]. Thus, the full data strategy will have lower model selection and parameter estimation costs than the validation set strategy due to the higher number of observations used to complete the model selection and parameter estimation processes, but the advantages can be swamped by the data re-use cost.

The difference between the full data and validation set strategies, seen in comparing terms (4) & (5), is bounded and well understood as an effect of sample size [4]. Despite suffering in model selection and parameter estimation costs, the validation set strategy will have a lower data re-use cost than the full data strategy, and we know the data re-use cost term (6) could be very large. Therefore, we would like to investigate the trade-offs in the data re-use cost versus the model selection and parameter estimation costs, in the Random Forest context. Our study comes in the form of analyzing various model fitting approaches and the resulting estimates of error. Specifically the use of OOB errors compared to validation set and cross-validation strategies is investigated.

3 Methods

Simulated data are used to study the behavior of modeling strategies in the setting in which all predictor variables are uncorrelated. The results provide insight into the mechanisms which lead to different targets in error estimates. We follow the set-up of Janitza and Hornung [9] who use simulation to demonstrate that the OOB error overestimates the true error rate. We recognize that a larger simulation study would provide more nuance of the observed behavior. However, the simulations at hand address our main goals of the paper: (i) to document the switch in direction of error proximity when comparing Random Forests to logistic regression, and (ii) to explore the performance of an expanded set of error estimation strategies.

3.1 Data Generation and Settings

The bias of error estimates in different data settings with numeric predictor variables is systematically investigated by means of simulation studies in balanced binary two-class response variable data. The settings considered are:

•

Different number of predictors, $p\in\{10,100\}$ .
•

Different number of observations such that $n<p,n>p,n\gg p$ .

As done when modeling real data, several Random Forests with different $mtry$ values are constructed for each setting ( $mtry$ is the parameter that determines the number of randomly chosen variables to be considered for each split on a tree). In the parameter estimation process, the possible values for $mtry$ range from $mtry=1$ all the way up to $mtry=p$ . Note that for $mtry=1$ there is no selection of an optimal predictor variable for a split, while for $mtry=p$ the Random Forest method coincides with the bagging procedure which selects the best predictor variable from the entire set of predictors for binary splitting at each internal node.

Generally, the number of trees chosen is a trade-off between accuracy and computational speed. More trees are necessary when using a large number of predictor variables. The OOB error stabilizes at around 250 trees in convergence studies [5], which concluded that 1000 trees might be sufficiently large for a genome-wide dataset of more than 300,000 predictor variables. Also in high-dimensional settings, Random Forests with 500 trees and 1000 trees yield very similar OOB errors [7]. In accordance with these findings in the literature, the number of trees is set to 500 in all of our Random Forest simulations. Each dataset described in Table 1 is repeated 1000 times to obtain stable results (all reported empirical errors are the average of 1000 replicates).

Only numeric predictor variables are considered in the studies. Both predictors associated with the response and predictors not associated with the response are considered, with all predictors distributed independently of one another. The predictors not associated with the response follow a standard normal distribution. The distribution of predictors with association is different for each response class. The predictor values for observations from class 1 are always drawn from a standard normal distribution. The predictor values for observations from class 2 are drawn from a normal distribution with variance 1 and a mean different from zero. Table 1 gives an overview of the distribution of predictors in the response classes. Let us consider the setting with $p=10$ as an example. The first two predictors $X_{1}$ and $X_{2}$ are associated with the response, while the other predictors $X_{3},\dots,X_{10}$ are noise. Hence, $X_{3},\dots,X_{10}$ follow a standard normal distribution, while the distributions of $X_{1}$ and $X_{2}$ depend on the class to which the observations belong. If the observation comes from class 1, the distribution of $X_{1}$ and $X_{2}$ is $N(0,1)$ , and $X_{1}$ and $X_{2}$ are distributed $N(0.75,1)$ for class 2. Randomly drawing the mean separately for $X_{1}$ and $X_{2}$ and for each repetition of the study insures that predictors with different effect strengths are considered.

Number of Predictors	Predictors	class 1: $N(\mu_{1},1)$	class 2: $N(\mu_{2},1)$
$p=10$	$X_{1}$	$\mu_{1}=0$	$\mu_{2}\sim N(0.75,1)$
	$X_{2}$	$\mu_{1}=0$	$\mu_{2}\sim N(0.75,1)$
	$X_{3},\dots,X_{10}$	$\mu_{1}=0$	$\mu_{2}=0$
$p=100$	$X_{1}$	$\mu_{1}=0$	$\mu_{2}\sim N(0.75,1)$
	$\vdots$	$\vdots$	$\vdots$
	$X_{10}$	$\mu_{1}=0$	$\mu_{2}\sim N(0.75,1)$
	$X_{11},\dots,X_{100}$	$\mu_{1}=0$	$\mu_{2}=0$

Table 1: Simulation set up describing the distribution of predictors in class 1 and class 2, as in Janitza and Hornung [9]. Note that class 2 is shifted from class 1 for only a few of the

p

variables. Also note that all predictor variables are independent and normally distributed.

It is important to note that all predictors are uncorrelated. Although assuming no correlations between any of the predictors is not necessarily realistic, such settings are important to investigate in order to understand the mechanisms which lead to different targets in error estimation.

3.2 Strategies for Error Estimation

The modeling process consists of parameter estimation followed by error estimation. An important point of consideration when completing the two estimation steps is the choice of which subset of observations will be used in each operation. Often, data for parameter estimation and data for error estimation are collected at the same time, thus resulting in a single sample that needs to be apportioned to both parameter and error estimation. Finding the optimal model complexity requires an external test dataset [8]. In an ideal world, to avoid “data snoo**”, one needs one dataset for model building, one for parameter estimation, and then after a model is accepted, another dataset for error estimation. However, rarely are three independent datasets available, so one may need to do the best one can with the data available. Hence, when modeling it is important to outline the strategy that will be used to construct the model and then estimate its error.

We chose a variety of strategies to separately target the parameter estimation and error estimation steps in the modeling process, and thus each strategy consists of three parts. In the descriptions that follow, in-fold represents the observations used to fit the model using cross-validation; out-of-fold represents the observations that are held out of the model fitting using cross-validation. In-bag are the observations that are used to fit the model using Random Forests; out-of-bag (OOB) represents the observations that are held out of the model fitting when using Random Forests. A more complete description of cross-validation, bagging, and Random Forests can be found in James et al. [10].

For each of the considered strategies, there are three aspects to consider (see Figure 2 and Table 2). The original data is divided differently into three datasets: Data₁, Data₂, and Data₃. See Table 2 for details on how each strategy divides up the original dataset. Sometimes, e.g., SDT, the three datasets are completely independent. Sometimes, e.g., FDO, the same observations show up throughout each of the three datasets.

1.

Model Fitting: Data₁ is used to fit a series of models to consider. For example, with Random Forests, there might be 5 models fit ( $\hat{f}_{1},\hat{f}_{2},\hat{f}_{3},\hat{f}_{4},\hat{f}_{5}$ ), each with a different value of $mtry.$
2.

Parameter Estimation: Data₂ is used to select the value of the parameter. In the above case, the value of $mtry$ is selected by assessing Data₂ on each of $\hat{f}_{1},\hat{f}_{2},\hat{f}_{3},\hat{f}_{4},\hat{f}_{5}$ .
3.

Error Estimation: Once the model and parameter have been selected, Data₃ is used to estimate the prediction error of the selected model.

Note that in some of the error estimation strategies, there is no parameter estimation (see Table 2). In that case, there is only a need for two distinct datasets, not three (i.e., Data₂ is not used in the strategy). The following error estimation strategies are considered:

•

Logistic Regression CV Error (LGCV): The logistic model is built on the in-fold dataset and the error of the model is estimated via the out-of-fold data using 4-fold cross-validation. Logistic regression models are run only for $n\geq p.$
•

Full Data Set CV Error (FDCV): Parameters are set prior to model building with $mtry=\sqrt{p}$ . The Random Forest is built on the in-fold dataset and the error of the model is estimated via out-of-fold data using 4-fold cross-validation.
•

Full Data Set OOB Error (FDO): Parameter and error estimation is done on the same dataset. $mtry$ is chosen by using the OOB error rate. Hence, the Random Forest (and corresponding $mtry$ ) with the lowest OOB error rate is chosen and the OOB error is returned as the error estimate.
•

Split Data Set OOB Error (SDO): The sample is divided into training and testing sets. Parameter estimation is done on the (in-bag) training set, using the OOB error rate to select $mtry$ . The error of the Random Forest, built on the entire training set, is then estimated by predicting on the testing set and reporting the test error. The error estimation uses only test observations that are not part of the set of observations that are considered for constructing the Random Forest.
•

Split Data Set CV Error (SDCV): The sample is divided into training and testing sets. Parameter estimation is done on the training set, using 4-fold cross-validated error estimates to select $mtry.$ The error of the Random Forest, built on the entire training set (using the cross-validated $mtry),$ is then estimated by predicting on the testing set and reporting the test error. The error estimation uses only test observations that are not part of the set of observations that are considered for constructing the Random Forest.
•

Split Data Set Test Error (SDT): The sample is divided into three independent training, validation, and testing sets. Parameter estimation of $mtry$ is done by finding the minimum validation error across training models with differing values of $mtry.$ The error of the Random Forest, built solely on the training set (using the validation set $mtry),$ is estimated by predicting on the testing set and reporting the test error. The error estimation uses only test observations that are not part of the set of observations that are considered for constructing the Random Forest.

Strategy	Model Fitting	Parameter Estimation	Error Estimation
	(Data₁)	(Data₂)	(Data₃)
LGCV	In-Fold observations	N/A	Out-of-Fold observations
	from full dataset		from full dataset
FDCV	In-Fold observations	N/A	Out-of-Fold observations
	from full dataset		from full dataset
FDO	In-Bag observations	OOB observations	OOB observations
	from full dataset	from full dataset	from full dataset
SDO	In-Bag observations	OOB observations	Test dataset
	from training dataset	from training dataset
SDCV	In-Fold observations	Out-Fold observations	Test dataset
	from training dataset	from training dataset
SDT	Training dataset	Validation dataset	Test dataset

Table 2: Data used in each step of the error estimation strategies. The three columns across each strategy are pictured in Figure 2 and include model fitting, parameter estimation, and error estimation. Each row describes a different method for partitioning the data into the three groups.

4 Results

4.1 Distance to Target Errors

Ideally, the estimated error rate of a statistical model should be both close to the truth and low. But that begs the question: close to which truth, $\text{Err}_{XY}$ or Err? In an effort to compare the results of our simulations to those of Bates et al. [1], Figure 3 shows the distance of $\widehat{\text{Err}}^{\text{(LGCV)}}$ from Err compared to its distance from $\text{Err}_{XY}$ . Similar to Figure 8 in Bates et al., we can see that in the logistic regression model,

|\widehat{\text{Err}}^{\text{(LGCV)}}-\text{Err}_{XY}|>|\widehat{\text{Err}}^{% \text{(LGCV)}}-\text{Err}|

The difference lessens as $n\rightarrow\infty$ . Regardless, repeated simulations consistently confirm that $\widehat{\text{Err}}^{\text{(LGCV)}}$ is, on average, closer to Err than $\text{Err}_{XY}$ .

In Figure 4 we see that the error proximity relationship has flipped between logistic regression models and Random Forests. $\widehat{\text{Err}}^{\text{(LGCV)}}$ is closer to Err, but $\widehat{\text{Err}}^{\text{(FDCV)}}$ is closer to $\text{Err}_{XY}$ . As a reminder, the difference between $\widehat{\text{Err}}^{\text{(LGCV)}}$ and $\widehat{\text{Err}}^{\text{(FDCV)}}$ is that the former is an error estimate for a logistic model while the latter is an error estimate for a Random Forest model. Both are cross-validated estimates on the in-fold dataset with no parameter tuning. Thus, the relationship highlighted by Bates et al. [1] seems to be specific to generalized linear models as they investigated only linear and logistic regression models.

The flip in relationship may be attributed to the difference in the way each model utilizes the data. In logistic regression, the coefficients are estimated via maximum likelihood estimation, thus possibly leading to over-fitting and biased estimates of error due to the model optimizing for the specific dataset. On the other hand, bagging and other resampling techniques can be used to reduce the variance in model predictions. In Random Forests, the bias of the full forest is equivalent to the bias of a single decision tree (which itself has low bias and high variance) [8]. However, by creating many trees and then averaging them, the variance of the final forest can be greatly reduced over that of a single tree. In practice, the only limitation on the size of the forest is computing time as an infinite number of trees could be trained without ever increasing bias and with a continual (if asymptotically declining) decrease in the variance. As a result, the logistic regression model may be less informative on the “next” sample, than a Random Forest. Hence, $\widehat{\text{Err}}$ is closer to $\text{Err}_{XY}$ than Err for Random Forests because its resampling methods build the model on data more akin to wild data.

We further explore the difference between $\widehat{\text{Err}}$ for logistic regression and Random Forests in an experiment with $n=50$ observations and $p=10$ features (repeated 20 times); see Figure 5. In the right plot, there seems to exist a pairing between $\widehat{\text{Err}}^{\text{(FDCV)}}$ and $\text{Err}_{XY}$ (linking the two estimates from the same dataset), where high estimates of $\widehat{\text{Err}}^{\text{(FDCV)}}$ are paired with high estimates of $\text{Err}_{XY}$ and low estimates of $\widehat{\text{Err}}^{\text{(FDCV)}}$ are paired with low estimates of $\text{Err}_{XY}$ (i.e., very few of the linking lines cross). In a logistic regression model, there does not seem to exist this pairing as seen in the left plot (i.e., most of the linking lines cross). In other words, $\widehat{\text{Err}}^{\text{(LGCV)}}$ is seen to be neither closer nor farther to Err than to $\text{Err}_{XY}$ , but $\widehat{\text{Err}}^{(\text{FDCV})}$ tends to be closer to $\text{Err}_{XY}$ than to Err.

Returning to the remaining strategies that all use Random Forests, Figures 6 - 9 show that across the error estimation strategies, $\widehat{\text{Err}}$ is closer to $\text{Err}_{XY}$ than Err on average. Despite this relationship, the differences in mean absolute deviations, from both $\text{Err}_{XY}$ and Err, tend to be quite small.

4.2 Distance Across Error Estimation Strategies

Section 4.1 details our investigation of the distance of the empirical estimates of error both to $\text{Err}_{XY}$ and to $\text{Err}.$ We show that in Random Forests, the error estimates are closer to $\text{Err}_{XY}$ than to $\text{Err},$ which is the opposite result of Bates et al. [1] on generalized linear models. Now we assess a follow-up question: how close is $\widehat{\text{Err}}$ to $\text{Err}_{XY}$ ? Figure 10 compares the strategies according to the expected value of $|\widehat{\text{Err}}-\text{Err}_{XY}|$ .

In the case of $p=10$ features, the strategies that utilize the in-fold or in-bag dataset to train the model (LGCV, FDCV, and FDO) outperform the split data approaches (SDCV, SDO, SDT). The strategies without independent validation sets seem to not over-fit compared the error estimates with train/test splits that suffer a drop in performance, likely due to smaller sample sizes for model and parameter fit. It is important to mention that FDO is the only strategy of the three (LGCV, FDCV, and FDO) that tunes parameters (here, $mtry$ ).

In the case of $p=100$ features, FDCV and FDO are the best candidates. In contrast to $p=10$ features, FDCV and FDO perform better than LGCV when $n\geq p$ (LGCV is not fit when $n<p.)$ As with $p=10$ , the data-splitting methods (SDCV, SDO, and SDT) have worse performance. Once again, it is important to note that, out of the strategies that build the model on the entire dataset, FDO tunes the model’s parameters, compared to FDCV which does not.

5 Discussion

The Random Forest error rate investigation in our paper has two main components. First, we discuss the difference in error targets presented by Bates et al. [1]. In their work, they find that in the special case of the generalized linear model using unregularized OLS for model-fitting, common estimates of prediction error — cross-validation, bootstrap, data splitting, and covariance penalties — should be viewed as estimates of the expected error rate, averaged across other hypothetical datasets from the same distribution. Our primary result is that, in the classification case, Random Forests’ empirical estimates of prediction error can be taken as an estimate of the true error rate ( $\text{Err}_{XY}$ ) instead of as an estimate of the expected error rate (Err), which is the opposite of the result of Bates et al. [1] whose work includes logistic regression. In simulations the result held across error estimation strategies such as cross-validation, bagging, and data splitting (See Figures (4) - (9)). The result is present for a variety of choices of sample size, $n$ , and feature number, $p$ . Nonetheless, we wish to be clear that the estimates of prediction error were a good approximation of both the true error rate ( $\text{Err}_{XY}$ ) and expected error rate (Err) in the data splitting cases.

A fundamental open question is to understand the size of the gap of estimates of prediction error with the true error rate and expected error rate. The present work focuses on determining to which target the estimate is closer. Moreover, it is necessary to understand under what conditions the gap is large, making it necessary to modify the method of error estimation depending on the target. Roughly speaking, we expect the gap between any of the error rates to be small when $n/p$ is large. In our simulations, the difference between estimates of prediction error with the true error rate and expected error rate is always smaller than 0.01; as $n$ increases the difference decreases. Other future directions include the investigation of the relationship among $\widehat{\text{Err}}$ , $\text{Err}_{XY}$ , and Err in correlated and/or imbalanced data.

Second, we discussed the performance of a variety of error estimation strategies. The models built on the entire sample (e.g., using CV or OOB strategies) are closer to the true error rate compared to those built on a training set with error estimates obtained from a testing set. Therefore, the data strategies that do not use a holdout set seem to be a more appealing choice for model building, regardless if parameter tuning is to be performed or not. Empirically, the strategies that use resampling techniques as opposed to a holdout set are favorable.

Whereas Bates et al. [1] show $\widehat{\text{Err}}$ to be closer to Err than $\text{Err}_{XY}$ in generalized linear models, we show $\widehat{\text{Err}}$ to be closer to $\text{Err}_{XY}$ than Err in Random Forests. Additionally, resampling techniques seem to outperform data splitting models in Random Forests.

References

\bibcommenthead
Bates et al. [2023] Bates, S., Hastie, T., Tibshirani, R.: Cross-validation: what does it estimate and how well does it do it? Journal of the American Statistical Association (2023) https://doi.org/10.1080/01621459.2023.2197686 [stat.ME]
Breiman [2001] Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001) https://doi.org/10.1023/A:1010950718922
Bylander [2002] Bylander, T.: Estimating generalization error on two-class datasets using out-of-bag estimates. Machine Learning 48(1-3), 287–297 (2002). Copyright - Kluwer Academic Publishers 2002
Faraway [2014] Faraway, J.J.: Does data splitting improve prediction? Statistics and Computing 26(1–2), 49–60 (2014) https://doi.org/10.1007/s11222-014-9522-9
Goldstein et al. [2010] Goldstein, B., Hubbard, A., Cutler, A., Barcellos, L.: An application of random forests to a genome-wide association dataset: Methodological considerations & new findings. BMC genetics 11, 49 (2010)
Goldstein et al. [2011] Goldstein, B.A., Polley, E.C., Briggs, F.B.S.: Random forests for genetic association studies. Statistical Applications in Genetics and Molecular Biology 10(1) (2011) https://doi.org/10.2202/1544-6115.1691
Genuer et al. [2008] Genuer, R., Poggi, J.-M., Tuleau, C.: Random Forests: some methodological insights. arXiv (2008). https://doi.org/10.48550/ARXIV.0811.3619 . https://arxiv.longhoe.net/abs/0811.3619
Hastie et al. [2001] Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics. Springer, New York, NY, USA (2001)
Janitza and Hornung [2018] Janitza, S., Hornung, R.: On the overestimation of random forest’s out-of-bag error. PLOS ONE 13(8), 1–31 (2018) https://doi.org/10.1371/journal.pone.0201904
James et al. [2013] James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning: with Applications in R. Springer, ??? (2013). https://faculty.marshall.usc.edu/gareth-james/ISL/
Kaggle [2017] Kaggle: The State of Data Science & Machine Learning (2017). https://ailab-ua.github.io/courses/resources/the_state_of_data_science_machine_learning_-_kaggle_2017_survey.pdf
Mentch and Hooker [2016] Mentch, L., Hooker, G.: Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. Journal of Machine Learning Research 17(26), 1–41 (2016)
Mitchell [2011] Mitchell, M.: Bias of the random forest out-of-bag (oob) error for certain input parameters. Open Journal of Statistics 01, 205–211 (2011)
Rajanala et al. [2022] Rajanala, S., Bates, S., Hastie, T., Tibshirani, R.: Confidence Intervals for the Generalisation Error of Random Forests (2022)
Yousef [2019] Yousef, W.A.: A Leisurely Look at Versions and Variants of the Cross Validation Estimator. arXiv (2019). https://doi.org/10.48550/ARXIV.1907.13413 . https://arxiv.longhoe.net/abs/1907.13413
Zhang et al. [2010] Zhang, G.-Y., Zhang, C.-X., Zhang, J.-S.: Out-of-bag estimation of the optimal hyperparameter in subbag ensemble method. Communications in Statistics - Simulation and Computation 39(10), 1877–1892 (2010) https://doi.org/10.1080/03610918.2010.521277