[1]\fnmJohanna \surHardin
1]\orgdivDepartment of Mathematics & Statistics, \orgnamePomona College, \orgaddress\street610 N. College Ave., \cityClaremont, \postcode91711, \stateCA, \countryUSA
Prediction Error Estimation in Random Forests
Abstract
In this paper, error estimates of classification Random Forests are quantitatively assessed. Based on the initial theoretical framework built by Bates et al. [1], the true error rate and expected error rate are theoretically and empirically investigated in the context of a variety of error estimation methods common to Random Forests. We show that in the classification case, Random Forests’ estimates of prediction error is closer on average to the true error rate instead of the average prediction error. This is opposite the findings of Bates et al. [1] which are given for logistic regression. We further show that our result holds across different error estimation strategies such as cross-validation, bagging, and data splitting.
keywords:
true error rate, expected error rate, cross-validation, bootstrap**, bagging, data-splittingpacs:
[MSC Classification]62J20, 62F40
1 Introduction
As evidenced by the 2017 State of Data Science and Machine Learning report by Kaggle, almost half of data scientists use Random Forests at work [11]. Random Forests [2] are a popular classification tool in a variety of fields, especially because of their excellent performance in very complex data settings. The fact that out-of-bag (OOB) errors are theoretically and computationally simple improvements over a train-test split, lead to their ubiquity. When deploying a predictive model, it is important to understand its prediction accuracy on future test points; both good point estimates and an understanding of the variability of the estimates are essential. When Random Forests are implemented, the OOB error is a widely-used approach for point and interval estimate tasks, but in spite of OOB’s seeming simplicity, its properties remain opaque. In the past, the OOB error has been affirmed to be an unbiased estimate of the true error rate [16, 6]. Nonetheless, it has been shown that for two-class classification problems the OOB error can overestimate the true prediction error [3, 13]. It was later argued that the use of stratified subsampling with sampling fractions that are proportional to response class sizes of the training data yielded almost unbiased error rates [9]. The present work is primarily concerned with OOB errors, but also addresses other common methods such as data splitting and cross-validation, as well as their combination with OOB errors.
Despite the apparent straightforwardness of data splitting, cross-validation, and bagging, the formal properties of these modeling techniques are subtle. When calculating an error rate, the question of “what are we estimating?” rightfully reappears often. In an innovative paper, Bates et al. [1] provide a framework for understanding and measuring prediction error in the case of least squares linear models. They prove that cross-validation does not estimate the true error rate (see Equation (1)) for the model at hand, fit to the training data. Instead, the empirical cross-validation error “estimates the average prediction error of models fit on other unseen training sets drawn from the same population,” i.e., the expected error rate (see Equation (2)). They expand their work in simulations to show that the same trend holds for logistic regression.
To our knowledge, in the case of Random Forests, ours is the first study investigating the different types of prediction error as a metric for the accuracy of the model on training, cross-validated, out of bag, test, and future data. We illustrate and extend the work of Bates et al. [1] to the setting of Random Forests, where empirical evidence for Random Forests shows the opposite of what Bates et al. [1] proved for (generalized) linear models. For Random Forests, the empirical errors (found via data splitting, cross-validation, or bagging) are closer, on average, to the true error rate (see Equation (1)) than to the expected error rate (see Equation (2)). The distinction between the true error rate (see Equation (1)) and the expected error rate (see Equation (2)) is quite subtle. Section 2 goes into great detail explaining the differences between the two theoretical error types.
The main contribution of our work is two-fold: (i) investigation of the switch in direction of the proximity of Random Forests’ error estimates to the error targets from that presented by Bates et al. [1] on logistic regression, and (ii) exploration of the performance of an expanded set of error estimation strategies.
The paper is structured as follows: In Section 2, we set up notation and introduce the two different error targets: true error rate () and expected error rate (Err). Subsequently, Section 3, introduces simulation-based studies. The descriptions include an outline of the simulated data, the considered settings, and several model building workflows that will be investigated. In Section 4, we present the results of the studies. The results are discussed in Section 5 alongside recommendations.
All code used for the simulations is available at:
https://github.com/iankrupkin/Prediction-Error-Estimation-in-Random-Forests
2 What Prediction Error are We Estimating?
Before turning to our main method in the next section, we introduce our notation and review topics related to error targets. We consider the supervised learning setting where we have features and observations, denoted by , the row of , and response . We assume that the data points for are independent and identically distributed from some underlying distribution on . We wish to understand the performance of our fitted model when generalized to unseen data points, which can be formalized by a loss function on an observed response compared to a predicted response :
such that for all . The form of need not be specified and could be squared error loss, misclassification error, cross-entropy, etc. Now consider a model parameterized by . Let be the function that predicts from using the model with true parameters , which take values in the space . Let be a model-fitting algorithm that takes any number of data points and returns an estimated parameter vector . Hence, is the fitted value of the parameter based on the observed data and . Let be another independent test point (or dataset) from the same distribution. Using the training data, we are interested in finding the function that minimizes the loss . Note that is a random and unknown object, and our target is one of two quantities:
(1) | ||||
Expected Error Rate: Err | (2) |
Equations (1) and (2) are the two most natural quantities of interest to the analyst. Known as the true error rate, is the expected test error of the model that is fit on our actual training set. Err is the expected average error of the fitting algorithm run on the same-sized datasets drawn from the underlying distribution and is called the expected error rate. It is important to note that the random variable Err is a constant with respect to , while is a function of [1, 14]. In the Random Forest context, there is randomness associated with the model building itself (unlike OLS, which is deterministic given a dataset). However, the expected values in Equations (1) and (2) average over the randomness associated with the model building. There has been some work to formalize the distributional results associated with Random Forests (e.g., see Mentch and Hooker [12]), but the formal distributional structure is outside the scope of our work.
Depending on the context of the statistical analysis, one may prefer to estimate either or Err. The former quantity is of the most interest to a practitioner deploying a specific model, whereas the latter may be of interest to a researcher comparing different fitting algorithms. To illustrate this difference, consider the following examples.
Suppose Statistician A is trying to estimate the average height of penguins, found in the wild, based on a sample of 100 researched penguins. Statistician A will use the original sample to build a model to estimate the average height of the next sample of penguins. When presenting the model, they will be interested in the true error rate () of their specific model because they will want to know how the specific model they built will perform on the next dataset. That is, the research question forcuses on the single model and its ability to predict future data. As Bates et al. [1] describe it, “is of the most interest to a practitioner deploying a specific model.”
However, if Statistician B is trying to accomplish a similar task of estimating the average height of penguins, based on a finite sample, but is unsure of the structure of the model to utilize, they will be interested in a slightly different error metric. Statistician B will run a variety of fitting algorithms to build numerous models based on the sample available and will need to compare the models. They will want to estimate the expected error rate (Err) to know the average error of the process by which the fitting algorithm is run on same-sized datasets drawn from the underlying distribution. Statistician B is less interested in the performance of the single model which was built using the sample at hand (which is Statistician A’s target), instead, their interest is in the performance of the process used to arrive at the model. As Bates et al. [1] describe it, Err is of “interest to a researcher comparing different fitting algorithms.”
The estimation of the theoretical quantities, and Err, deepens the understanding of the difference between the two. Both error quantities are expected values and unknown in any given Random Forest model. However, they can be approximated using simulations where the underlying population is known and available for repeated sampling. As mentioned above, the true error rate, , is the test error of the model that was fit on our actual training set. Hence, the estimation of this quantity is the error produced by a single model on a new theoretically infinitely large test set. As illustrated by the lefthand column in Figure 1, the sample is used to create the model, and then the model is used to predict a large test set from the underlying population. The entire left column is repeated many times to estimate the true error rate of the given study set-up. The missclassification rate on the test set will be the true error rate.
Subsequently, Err is the average error of the fitting algorithm run on the same-sized datasets drawn from the underlying distribution , and called the expected error rate. When calculating the expected error rate, the average of the true error rate, but using a new model every time, is taken. As seen by the entire box in Figure 1, first the entire model fitting process is repeated to obtain each from all possible samples. Then, the estimate of Err is obtained by averaging many repeated values of ,
In other words, the difference between the estimation of and Err is that the former uses one model, while the latter averages over many models. As seen from Equation (1), is conditional on the data, while in Equation (2), Err is unconditional. Note that Err averages over everything that is random in the model building process, including the randomness in the training set that produced the model and the bootstrap** inherent in a Random Forest model.
While it may initially appear that the quantity is easier to estimate, since it concerns the model at hand, it has been observed that, in some settings, the cross-validation empirical estimate of error is weakly correlated with (given a particular population at hand) [15]. The disassociation issue is mainly attributed to data re-usage.
Let represent a dataset of unlimited size enabling the best possible model to be chosen. Theoretically, can be decomposed into four parts (see Equations (3) - (6)), as seen in Faraway [4]. Let represent a black box with the ability to generate datasets of unlimited size and quantity without defining the true underlying model generating mechanism. The true model is given in contrast to the model found from the original dataset, . The parameter is either given by the infinite data, , or estimated using the original data, .
best performance | (3) | |||||
model select cost | (4) | |||||
parameter est cost | (5) | |||||
data re-use cost | (6) |
Term (3) represents the best performance of the prediction error; it is the expected loss on the correct model using all the possible data. Term (4) represents the difference between the loss for the true model on infinite data and the loss for the selected model on infinite data. Term (5) represents the difference in loss using the data model but estimating the parameter with a single (independent) dataset versus infinite data. Term (6) represents the difference in loss between the data model and the parameter estimates using the original data versus an independent dataset.
The most interesting component is the final term (6), caused by data re-usage, which has a non-zero expectation when the same data points are used for both model selection and parameter estimation. If one uses a validation set approach (for model selection and parameter estimation, see SDT in Table 2), term (6) will have an expectation of zero because each observation is only used once. However, as estimated empirically, when using a full data approach, term (6) can be large and easily cancel out any advantages the full data has in model selection and parameter estimation [4]. Thus, the full data strategy will have lower model selection and parameter estimation costs than the validation set strategy due to the higher number of observations used to complete the model selection and parameter estimation processes, but the advantages can be swamped by the data re-use cost.
The difference between the full data and validation set strategies, seen in comparing terms (4) & (5), is bounded and well understood as an effect of sample size [4]. Despite suffering in model selection and parameter estimation costs, the validation set strategy will have a lower data re-use cost than the full data strategy, and we know the data re-use cost term (6) could be very large. Therefore, we would like to investigate the trade-offs in the data re-use cost versus the model selection and parameter estimation costs, in the Random Forest context. Our study comes in the form of analyzing various model fitting approaches and the resulting estimates of error. Specifically the use of OOB errors compared to validation set and cross-validation strategies is investigated.
3 Methods
Simulated data are used to study the behavior of modeling strategies in the setting in which all predictor variables are uncorrelated. The results provide insight into the mechanisms which lead to different targets in error estimates. We follow the set-up of Janitza and Hornung [9] who use simulation to demonstrate that the OOB error overestimates the true error rate. We recognize that a larger simulation study would provide more nuance of the observed behavior. However, the simulations at hand address our main goals of the paper: (i) to document the switch in direction of error proximity when comparing Random Forests to logistic regression, and (ii) to explore the performance of an expanded set of error estimation strategies.
3.1 Data Generation and Settings
The bias of error estimates in different data settings with numeric predictor variables is systematically investigated by means of simulation studies in balanced binary two-class response variable data. The settings considered are:
-
•
Different number of predictors, .
-
•
Different number of observations such that .
As done when modeling real data, several Random Forests with different values are constructed for each setting ( is the parameter that determines the number of randomly chosen variables to be considered for each split on a tree). In the parameter estimation process, the possible values for range from all the way up to . Note that for there is no selection of an optimal predictor variable for a split, while for the Random Forest method coincides with the bagging procedure which selects the best predictor variable from the entire set of predictors for binary splitting at each internal node.
Generally, the number of trees chosen is a trade-off between accuracy and computational speed. More trees are necessary when using a large number of predictor variables. The OOB error stabilizes at around 250 trees in convergence studies [5], which concluded that 1000 trees might be sufficiently large for a genome-wide dataset of more than 300,000 predictor variables. Also in high-dimensional settings, Random Forests with 500 trees and 1000 trees yield very similar OOB errors [7]. In accordance with these findings in the literature, the number of trees is set to 500 in all of our Random Forest simulations. Each dataset described in Table 1 is repeated 1000 times to obtain stable results (all reported empirical errors are the average of 1000 replicates).
Only numeric predictor variables are considered in the studies. Both predictors associated with the response and predictors not associated with the response are considered, with all predictors distributed independently of one another. The predictors not associated with the response follow a standard normal distribution. The distribution of predictors with association is different for each response class. The predictor values for observations from class 1 are always drawn from a standard normal distribution. The predictor values for observations from class 2 are drawn from a normal distribution with variance 1 and a mean different from zero. Table 1 gives an overview of the distribution of predictors in the response classes. Let us consider the setting with as an example. The first two predictors and are associated with the response, while the other predictors are noise. Hence, follow a standard normal distribution, while the distributions of and depend on the class to which the observations belong. If the observation comes from class 1, the distribution of and is , and and are distributed for class 2. Randomly drawing the mean separately for and and for each repetition of the study insures that predictors with different effect strengths are considered.
Number of Predictors | Predictors | class 1: | class 2: |
---|---|---|---|
It is important to note that all predictors are uncorrelated. Although assuming no correlations between any of the predictors is not necessarily realistic, such settings are important to investigate in order to understand the mechanisms which lead to different targets in error estimation.
3.2 Strategies for Error Estimation
The modeling process consists of parameter estimation followed by error estimation. An important point of consideration when completing the two estimation steps is the choice of which subset of observations will be used in each operation. Often, data for parameter estimation and data for error estimation are collected at the same time, thus resulting in a single sample that needs to be apportioned to both parameter and error estimation. Finding the optimal model complexity requires an external test dataset [8]. In an ideal world, to avoid “data snoo**”, one needs one dataset for model building, one for parameter estimation, and then after a model is accepted, another dataset for error estimation. However, rarely are three independent datasets available, so one may need to do the best one can with the data available. Hence, when modeling it is important to outline the strategy that will be used to construct the model and then estimate its error.
We chose a variety of strategies to separately target the parameter estimation and error estimation steps in the modeling process, and thus each strategy consists of three parts. In the descriptions that follow, in-fold represents the observations used to fit the model using cross-validation; out-of-fold represents the observations that are held out of the model fitting using cross-validation. In-bag are the observations that are used to fit the model using Random Forests; out-of-bag (OOB) represents the observations that are held out of the model fitting when using Random Forests. A more complete description of cross-validation, bagging, and Random Forests can be found in James et al. [10].
For each of the considered strategies, there are three aspects to consider (see Figure 2 and Table 2). The original data is divided differently into three datasets: Data1, Data2, and Data3. See Table 2 for details on how each strategy divides up the original dataset. Sometimes, e.g., SDT, the three datasets are completely independent. Sometimes, e.g., FDO, the same observations show up throughout each of the three datasets.
-
1.
Model Fitting: Data1 is used to fit a series of models to consider. For example, with Random Forests, there might be 5 models fit (), each with a different value of
-
2.
Parameter Estimation: Data2 is used to select the value of the parameter. In the above case, the value of is selected by assessing Data2 on each of .
-
3.
Error Estimation: Once the model and parameter have been selected, Data3 is used to estimate the prediction error of the selected model.
Note that in some of the error estimation strategies, there is no parameter estimation (see Table 2). In that case, there is only a need for two distinct datasets, not three (i.e., Data2 is not used in the strategy). The following error estimation strategies are considered:
-
•
Logistic Regression CV Error (LGCV): The logistic model is built on the in-fold dataset and the error of the model is estimated via the out-of-fold data using 4-fold cross-validation. Logistic regression models are run only for
-
•
Full Data Set CV Error (FDCV): Parameters are set prior to model building with . The Random Forest is built on the in-fold dataset and the error of the model is estimated via out-of-fold data using 4-fold cross-validation.
-
•
Full Data Set OOB Error (FDO): Parameter and error estimation is done on the same dataset. is chosen by using the OOB error rate. Hence, the Random Forest (and corresponding ) with the lowest OOB error rate is chosen and the OOB error is returned as the error estimate.
-
•
Split Data Set OOB Error (SDO): The sample is divided into training and testing sets. Parameter estimation is done on the (in-bag) training set, using the OOB error rate to select . The error of the Random Forest, built on the entire training set, is then estimated by predicting on the testing set and reporting the test error. The error estimation uses only test observations that are not part of the set of observations that are considered for constructing the Random Forest.
-
•
Split Data Set CV Error (SDCV): The sample is divided into training and testing sets. Parameter estimation is done on the training set, using 4-fold cross-validated error estimates to select The error of the Random Forest, built on the entire training set (using the cross-validated is then estimated by predicting on the testing set and reporting the test error. The error estimation uses only test observations that are not part of the set of observations that are considered for constructing the Random Forest.
-
•
Split Data Set Test Error (SDT): The sample is divided into three independent training, validation, and testing sets. Parameter estimation of is done by finding the minimum validation error across training models with differing values of The error of the Random Forest, built solely on the training set (using the validation set is estimated by predicting on the testing set and reporting the test error. The error estimation uses only test observations that are not part of the set of observations that are considered for constructing the Random Forest.
Strategy | Model Fitting | Parameter Estimation | Error Estimation |
(Data1) | (Data2) | (Data3) | |
LGCV | In-Fold observations | N/A | Out-of-Fold observations |
from full dataset | from full dataset | ||
FDCV | In-Fold observations | N/A | Out-of-Fold observations |
from full dataset | from full dataset | ||
FDO | In-Bag observations | OOB observations | OOB observations |
from full dataset | from full dataset | from full dataset | |
SDO | In-Bag observations | OOB observations | Test dataset |
from training dataset | from training dataset | ||
SDCV | In-Fold observations | Out-Fold observations | Test dataset |
from training dataset | from training dataset | ||
SDT | Training dataset | Validation dataset | Test dataset |
4 Results
4.1 Distance to Target Errors
Ideally, the estimated error rate of a statistical model should be both close to the truth and low. But that begs the question: close to which truth, or Err? In an effort to compare the results of our simulations to those of Bates et al. [1], Figure 3 shows the distance of from Err compared to its distance from . Similar to Figure 8 in Bates et al., we can see that in the logistic regression model,
The difference lessens as . Regardless, repeated simulations consistently confirm that is, on average, closer to Err than .
In Figure 4 we see that the error proximity relationship has flipped between logistic regression models and Random Forests. is closer to Err, but is closer to . As a reminder, the difference between and is that the former is an error estimate for a logistic model while the latter is an error estimate for a Random Forest model. Both are cross-validated estimates on the in-fold dataset with no parameter tuning. Thus, the relationship highlighted by Bates et al. [1] seems to be specific to generalized linear models as they investigated only linear and logistic regression models.
The flip in relationship may be attributed to the difference in the way each model utilizes the data. In logistic regression, the coefficients are estimated via maximum likelihood estimation, thus possibly leading to over-fitting and biased estimates of error due to the model optimizing for the specific dataset. On the other hand, bagging and other resampling techniques can be used to reduce the variance in model predictions. In Random Forests, the bias of the full forest is equivalent to the bias of a single decision tree (which itself has low bias and high variance) [8]. However, by creating many trees and then averaging them, the variance of the final forest can be greatly reduced over that of a single tree. In practice, the only limitation on the size of the forest is computing time as an infinite number of trees could be trained without ever increasing bias and with a continual (if asymptotically declining) decrease in the variance. As a result, the logistic regression model may be less informative on the “next” sample, than a Random Forest. Hence, is closer to than Err for Random Forests because its resampling methods build the model on data more akin to wild data.
We further explore the difference between for logistic regression and Random Forests in an experiment with observations and features (repeated 20 times); see Figure 5. In the right plot, there seems to exist a pairing between and (linking the two estimates from the same dataset), where high estimates of are paired with high estimates of and low estimates of are paired with low estimates of (i.e., very few of the linking lines cross). In a logistic regression model, there does not seem to exist this pairing as seen in the left plot (i.e., most of the linking lines cross). In other words, is seen to be neither closer nor farther to Err than to , but tends to be closer to than to Err.
Returning to the remaining strategies that all use Random Forests, Figures 6 - 9 show that across the error estimation strategies, is closer to than Err on average. Despite this relationship, the differences in mean absolute deviations, from both and Err, tend to be quite small.
4.2 Distance Across Error Estimation Strategies
Section 4.1 details our investigation of the distance of the empirical estimates of error both to and to We show that in Random Forests, the error estimates are closer to than to which is the opposite result of Bates et al. [1] on generalized linear models. Now we assess a follow-up question: how close is to ? Figure 10 compares the strategies according to the expected value of .
In the case of features, the strategies that utilize the in-fold or in-bag dataset to train the model (LGCV, FDCV, and FDO) outperform the split data approaches (SDCV, SDO, SDT). The strategies without independent validation sets seem to not over-fit compared the error estimates with train/test splits that suffer a drop in performance, likely due to smaller sample sizes for model and parameter fit. It is important to mention that FDO is the only strategy of the three (LGCV, FDCV, and FDO) that tunes parameters (here, ).
In the case of features, FDCV and FDO are the best candidates. In contrast to features, FDCV and FDO perform better than LGCV when (LGCV is not fit when As with , the data-splitting methods (SDCV, SDO, and SDT) have worse performance. Once again, it is important to note that, out of the strategies that build the model on the entire dataset, FDO tunes the model’s parameters, compared to FDCV which does not.
5 Discussion
The Random Forest error rate investigation in our paper has two main components. First, we discuss the difference in error targets presented by Bates et al. [1]. In their work, they find that in the special case of the generalized linear model using unregularized OLS for model-fitting, common estimates of prediction error — cross-validation, bootstrap, data splitting, and covariance penalties — should be viewed as estimates of the expected error rate, averaged across other hypothetical datasets from the same distribution. Our primary result is that, in the classification case, Random Forests’ empirical estimates of prediction error can be taken as an estimate of the true error rate () instead of as an estimate of the expected error rate (Err), which is the opposite of the result of Bates et al. [1] whose work includes logistic regression. In simulations the result held across error estimation strategies such as cross-validation, bagging, and data splitting (See Figures (4) - (9)). The result is present for a variety of choices of sample size, , and feature number, . Nonetheless, we wish to be clear that the estimates of prediction error were a good approximation of both the true error rate () and expected error rate (Err) in the data splitting cases.
A fundamental open question is to understand the size of the gap of estimates of prediction error with the true error rate and expected error rate. The present work focuses on determining to which target the estimate is closer. Moreover, it is necessary to understand under what conditions the gap is large, making it necessary to modify the method of error estimation depending on the target. Roughly speaking, we expect the gap between any of the error rates to be small when is large. In our simulations, the difference between estimates of prediction error with the true error rate and expected error rate is always smaller than 0.01; as increases the difference decreases. Other future directions include the investigation of the relationship among , , and Err in correlated and/or imbalanced data.
Second, we discussed the performance of a variety of error estimation strategies. The models built on the entire sample (e.g., using CV or OOB strategies) are closer to the true error rate compared to those built on a training set with error estimates obtained from a testing set. Therefore, the data strategies that do not use a holdout set seem to be a more appealing choice for model building, regardless if parameter tuning is to be performed or not. Empirically, the strategies that use resampling techniques as opposed to a holdout set are favorable.
Whereas Bates et al. [1] show to be closer to Err than in generalized linear models, we show to be closer to than Err in Random Forests. Additionally, resampling techniques seem to outperform data splitting models in Random Forests.
References
- \bibcommenthead
- Bates et al. [2023] Bates, S., Hastie, T., Tibshirani, R.: Cross-validation: what does it estimate and how well does it do it? Journal of the American Statistical Association (2023) https://doi.org/10.1080/01621459.2023.2197686 [stat.ME]
- Breiman [2001] Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001) https://doi.org/10.1023/A:1010950718922
- Bylander [2002] Bylander, T.: Estimating generalization error on two-class datasets using out-of-bag estimates. Machine Learning 48(1-3), 287–297 (2002). Copyright - Kluwer Academic Publishers 2002
- Faraway [2014] Faraway, J.J.: Does data splitting improve prediction? Statistics and Computing 26(1–2), 49–60 (2014) https://doi.org/10.1007/s11222-014-9522-9
- Goldstein et al. [2010] Goldstein, B., Hubbard, A., Cutler, A., Barcellos, L.: An application of random forests to a genome-wide association dataset: Methodological considerations & new findings. BMC genetics 11, 49 (2010)
- Goldstein et al. [2011] Goldstein, B.A., Polley, E.C., Briggs, F.B.S.: Random forests for genetic association studies. Statistical Applications in Genetics and Molecular Biology 10(1) (2011) https://doi.org/10.2202/1544-6115.1691
- Genuer et al. [2008] Genuer, R., Poggi, J.-M., Tuleau, C.: Random Forests: some methodological insights. arXiv (2008). https://doi.org/10.48550/ARXIV.0811.3619 . https://arxiv.longhoe.net/abs/0811.3619
- Hastie et al. [2001] Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics. Springer, New York, NY, USA (2001)
- Janitza and Hornung [2018] Janitza, S., Hornung, R.: On the overestimation of random forest’s out-of-bag error. PLOS ONE 13(8), 1–31 (2018) https://doi.org/10.1371/journal.pone.0201904
- James et al. [2013] James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning: with Applications in R. Springer, ??? (2013). https://faculty.marshall.usc.edu/gareth-james/ISL/
- Kaggle [2017] Kaggle: The State of Data Science & Machine Learning (2017). https://ailab-ua.github.io/courses/resources/the_state_of_data_science_machine_learning_-_kaggle_2017_survey.pdf
- Mentch and Hooker [2016] Mentch, L., Hooker, G.: Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. Journal of Machine Learning Research 17(26), 1–41 (2016)
- Mitchell [2011] Mitchell, M.: Bias of the random forest out-of-bag (oob) error for certain input parameters. Open Journal of Statistics 01, 205–211 (2011)
- Rajanala et al. [2022] Rajanala, S., Bates, S., Hastie, T., Tibshirani, R.: Confidence Intervals for the Generalisation Error of Random Forests (2022)
- Yousef [2019] Yousef, W.A.: A Leisurely Look at Versions and Variants of the Cross Validation Estimator. arXiv (2019). https://doi.org/10.48550/ARXIV.1907.13413 . https://arxiv.longhoe.net/abs/1907.13413
- Zhang et al. [2010] Zhang, G.-Y., Zhang, C.-X., Zhang, J.-S.: Out-of-bag estimation of the optimal hyperparameter in subbag ensemble method. Communications in Statistics - Simulation and Computation 39(10), 1877–1892 (2010) https://doi.org/10.1080/03610918.2010.521277