-
Harnessing pattern-by-pattern linear classifiers for prediction with missing data
Authors:
Angel D Reyero Lobo,
Alexis Ayme,
Claire Boyer,
Erwan Scornet
Abstract:
Missing values have been thoroughly analyzed in the context of linear models, where the final aim is to build coefficient estimates. However, estimating coefficients does not directly solve the problem of prediction with missing entries: a manner to address empty components must be designed. Major approaches to deal with prediction with missing values are empirically driven and can be decomposed i…
▽ More
Missing values have been thoroughly analyzed in the context of linear models, where the final aim is to build coefficient estimates. However, estimating coefficients does not directly solve the problem of prediction with missing entries: a manner to address empty components must be designed. Major approaches to deal with prediction with missing values are empirically driven and can be decomposed into two families: imputation (filling in empty fields) and pattern-by-pattern prediction, where a predictor is built on each missing pattern. Unfortunately, most simple imputation techniques used in practice (as constant imputation) are not consistent when combined with linear models. In this paper, we focus on the more flexible pattern-by-pattern approaches and study their predictive performances on Missing Completely At Random (MCAR) data. We first show that a pattern-by-pattern logistic regression model is intrinsically ill-defined, implying that even classical logistic regression is impossible to apply to missing data. We then analyze the perceptron model and show how the linear separability property extends to partially-observed inputs. Finally, we use the Linear Discriminant Analysis to prove that pattern-by-pattern LDA is consistent in a high-dimensional regime. We refine our analysis to more complex MNAR data.
△ Less
Submitted 15 May, 2024;
originally announced May 2024.
-
What Is a Good Imputation Under MAR Missingness?
Authors:
Jeffrey Näf,
Erwan Scornet,
Julie Josse
Abstract:
Missing values pose a persistent challenge in modern data science. Consequently, there is an ever-growing number of publications introducing new imputation methods in various fields. The present paper attempts to take a step back and provide a more systematic analysis. Starting from an in-depth discussion of the Missing at Random (MAR) condition for nonparametric imputation, we first develop an id…
▽ More
Missing values pose a persistent challenge in modern data science. Consequently, there is an ever-growing number of publications introducing new imputation methods in various fields. The present paper attempts to take a step back and provide a more systematic analysis. Starting from an in-depth discussion of the Missing at Random (MAR) condition for nonparametric imputation, we first develop an identification result, showing that the widely used Multiple Imputation by Chained Equations (MICE) approach indeed identifies the right conditional distributions. Building on this analysis, we propose three essential properties a successful imputation method should meet, thus enabling a more principled evaluation of existing methods and more targeted development of new methods. In particular, we introduce a new imputation method, denoted mice-DRF, that meets two out of the three criteria. We then discuss and refine ways to rank imputation methods, develo** a powerful, easy-to-use scoring algorithm to rank missing value imputations.
△ Less
Submitted 7 June, 2024; v1 submitted 28 March, 2024;
originally announced March 2024.
-
Random features models: a way to study the success of naive imputation
Authors:
Alexis Ayme,
Claire Boyer,
Aymeric Dieuleveut,
Erwan Scornet
Abstract:
Constant (naive) imputation is still widely used in practice as this is a first easy-to-use technique to deal with missing data. Yet, this simple method could be expected to induce a large bias for prediction purposes, as the imputed input may strongly differ from the true underlying data. However, recent works suggest that this bias is low in the context of high-dimensional linear predictors when…
▽ More
Constant (naive) imputation is still widely used in practice as this is a first easy-to-use technique to deal with missing data. Yet, this simple method could be expected to induce a large bias for prediction purposes, as the imputed input may strongly differ from the true underlying data. However, recent works suggest that this bias is low in the context of high-dimensional linear predictors when data is supposed to be missing completely at random (MCAR). This paper completes the picture for linear predictors by confirming the intuition that the bias is negligible and that surprisingly naive imputation also remains relevant in very low dimension.To this aim, we consider a unique underlying random features model, which offers a rigorous framework for studying predictive performances, whilst the dimension of the observed features varies.Building on these theoretical results, we establish finite-sample bounds on stochastic gradient (SGD) predictors applied to zero-imputed data, a strategy particularly well suited for large-scale learning.If the MCAR assumption appears to be strong, we show that similar favorable behaviors occur for more complex missing data scenarios.
△ Less
Submitted 6 February, 2024;
originally announced February 2024.
-
Do we need rebalancing strategies? A theoretical and empirical study around SMOTE and its variants
Authors:
Abdoulaye Sakho,
Emmanuel Malherbe,
Erwan Scornet
Abstract:
Synthetic Minority Oversampling Technique (SMOTE) is a common rebalancing strategy for handling imbalanced tabular data sets. However, few works analyze SMOTE theoretically. In this paper, we prove that SMOTE (with default parameter) simply copies the original minority samples asymptotically. We also prove that SMOTE exhibits boundary artifacts, thus justifying existing SMOTE variants. Then we int…
▽ More
Synthetic Minority Oversampling Technique (SMOTE) is a common rebalancing strategy for handling imbalanced tabular data sets. However, few works analyze SMOTE theoretically. In this paper, we prove that SMOTE (with default parameter) simply copies the original minority samples asymptotically. We also prove that SMOTE exhibits boundary artifacts, thus justifying existing SMOTE variants. Then we introduce two new SMOTE-related strategies, and compare them with state-of-the-art rebalancing procedures. Surprisingly, for most data sets, we observe that applying no rebalancing strategy is competitive in terms of predictive performances, with tuned random forests. For highly imbalanced data sets, our new method, named Multivariate Gaussian SMOTE, is competitive. Besides, our analysis sheds some lights on the behavior of common rebalancing strategies, when used in conjunction with random forests.
△ Less
Submitted 3 June, 2024; v1 submitted 6 February, 2024;
originally announced February 2024.
-
Risk ratio, odds ratio, risk difference... Which causal measure is easier to generalize?
Authors:
Bénédicte Colnet,
Julie Josse,
Gaël Varoquaux,
Erwan Scornet
Abstract:
There are many measures to report so-called treatment or causal effects: absolute difference, ratio, odds ratio, number needed to treat, and so on. The choice of a measure, eg absolute versus relative, is often debated because it leads to different impressions of the benefit or risk of a treatment. Besides, different causal measures may lead to various treatment effect heterogeneity: some input va…
▽ More
There are many measures to report so-called treatment or causal effects: absolute difference, ratio, odds ratio, number needed to treat, and so on. The choice of a measure, eg absolute versus relative, is often debated because it leads to different impressions of the benefit or risk of a treatment. Besides, different causal measures may lead to various treatment effect heterogeneity: some input variables may have an influence on some causal measures and no effect at all on others. In addition some measures - but not all - have appealing properties such as collapsibility, matching the intuition of a population summary. In this paper, we first review common causal measures and their pros and cons typically brought forward. Doing so, we clarify the notions of collapsibility and treatment effect heterogeneity, unifying existing definitions. Then, we show that for any causal measures there exists a generative model such that the conditional average treatment effect (CATE) captures the treatment effect. However, only the risk difference can disentangle the treatment effect from the baseline at both population and strata levels, regardless of the outcome type (continuous or binary). As our primary goal is the generalization of causal measures, we show that different sets of covariates are needed to generalize an effect to a target population depending on (i) the causal measure of interest, and (ii) the identification method chosen, that is generalizing either conditional outcome or local effects.
△ Less
Submitted 30 March, 2024; v1 submitted 28 March, 2023;
originally announced March 2023.
-
Naive imputation implicitly regularizes high-dimensional linear models
Authors:
Alexis Ayme,
Claire Boyer,
Aymeric Dieuleveut,
Erwan Scornet
Abstract:
Two different approaches exist to handle missing values for prediction: either imputation, prior to fitting any predictive algorithms, or dedicated methods able to natively incorporate missing values. While imputation is widely (and easily) use, it is unfortunately biased when low-capacity predictors (such as linear models) are applied afterward. However, in practice, naive imputation exhibits goo…
▽ More
Two different approaches exist to handle missing values for prediction: either imputation, prior to fitting any predictive algorithms, or dedicated methods able to natively incorporate missing values. While imputation is widely (and easily) use, it is unfortunately biased when low-capacity predictors (such as linear models) are applied afterward. However, in practice, naive imputation exhibits good predictive performance. In this paper, we study the impact of imputation in a high-dimensional linear model with MCAR missing data. We prove that zero imputation performs an implicit regularization closely related to the ridge method, often used in high-dimensional problems. Leveraging on this connection, we establish that the imputation bias is controlled by a ridge bias, which vanishes in high dimension. As a predictor, we argue in favor of the averaged SGD strategy, applied to zero-imputed data. We establish an upper bound on its generalization error, highlighting that imputation is benign in the d $\sqrt$ n regime. Experiments illustrate our findings.
△ Less
Submitted 31 January, 2023;
originally announced January 2023.
-
Sparse tree-based initialization for neural networks
Authors:
Patrick Lutz,
Ludovic Arnould,
Claire Boyer,
Erwan Scornet
Abstract:
Dedicated neural network (NN) architectures have been designed to handle specific data types (such as CNN for images or RNN for text), which ranks them among state-of-the-art methods for dealing with these data. Unfortunately, no architecture has been found for dealing with tabular data yet, for which tree ensemble methods (tree boosting, random forests) usually show the best predictive performanc…
▽ More
Dedicated neural network (NN) architectures have been designed to handle specific data types (such as CNN for images or RNN for text), which ranks them among state-of-the-art methods for dealing with these data. Unfortunately, no architecture has been found for dealing with tabular data yet, for which tree ensemble methods (tree boosting, random forests) usually show the best predictive performances. In this work, we propose a new sparse initialization technique for (potentially deep) multilayer perceptrons (MLP): we first train a tree-based procedure to detect feature interactions and use the resulting information to initialize the network, which is subsequently trained via standard stochastic gradient strategies. Numerical experiments on several tabular data sets show that this new, simple and easy-to-use method is a solid concurrent, both in terms of generalization capacity and computation time, to default MLP initialization and even to existing complex deep learning solutions. In fact, this wise MLP initialization raises the resulting NN methods to the level of a valid competitor to gradient boosting when dealing with tabular data. Besides, such initializations are able to preserve the sparsity of weights introduced in the first layers of the network through training. This fact suggests that this new initializer operates an implicit regularization during the NN training, and emphasizes that the first layers act as a sparse feature extractor (as for convolutional layers in CNN).
△ Less
Submitted 30 September, 2022;
originally announced September 2022.
-
Reweighting the RCT for generalization: finite sample error and variable selection
Authors:
Bénédicte Colnet,
Julie Josse,
Gaël Varoquaux,
Erwan Scornet
Abstract:
Randomized Controlled Trials (RCTs) may suffer from limited scope. In particular, samples may be unrepresentative: some RCTs over- or under- sample individuals with certain characteristics compared to the target population, for which one wants conclusions on treatment effectiveness. Re-weighting trial individuals to match the target population can improve the treatment effect estimation. In this w…
▽ More
Randomized Controlled Trials (RCTs) may suffer from limited scope. In particular, samples may be unrepresentative: some RCTs over- or under- sample individuals with certain characteristics compared to the target population, for which one wants conclusions on treatment effectiveness. Re-weighting trial individuals to match the target population can improve the treatment effect estimation. In this work, we establish the exact expressions of the bias and variance of such reweighting procedures -- also called Inverse Propensity of Sampling Weighting (IPSW) -- in presence of categorical covariates for any sample size. Such results allow us to compare the theoretical performance of different versions of IPSW estimates. Besides, our results show how the performance (bias, variance, and quadratic risk) of IPSW estimates depends on the two sample sizes (RCT and target population). A by-product of our work is the proof of consistency of IPSW estimates. Results also reveal that IPSW performances are improved when the trial probability to be treated is estimated (rather than using its oracle counterpart). In addition, we study choice of variables: how including covariates that are not necessary for identifiability of the causal effect may impact the asymptotic variance. Including covariates that are shifted between the two samples but not treatment effect modifiers increases the variance while non-shifted but treatment effect modifiers do not. We illustrate all the takeaways in a didactic example, and on a semi-synthetic simulation inspired from critical care medicine.
△ Less
Submitted 13 March, 2024; v1 submitted 16 August, 2022;
originally announced August 2022.
-
Is interpolation benign for random forest regression?
Authors:
Ludovic Arnould,
Claire Boyer,
Erwan Scornet
Abstract:
Statistical wisdom suggests that very complex models, interpolating training data, will be poor at predicting unseen examples.Yet, this aphorism has been recently challenged by the identification of benign overfitting regimes, specially studied in the case of parametric models: generalization capabilities may be preserved despite model high complexity.While it is widely known that fully-grown deci…
▽ More
Statistical wisdom suggests that very complex models, interpolating training data, will be poor at predicting unseen examples.Yet, this aphorism has been recently challenged by the identification of benign overfitting regimes, specially studied in the case of parametric models: generalization capabilities may be preserved despite model high complexity.While it is widely known that fully-grown decision trees interpolate and, in turn, have bad predictive performances, the same behavior is yet to be analyzed for Random Forests (RF).In this paper, we study the trade-off between interpolation and consistency for several types of RF algorithms. Theoretically, we prove that interpolation regimes and consistency cannot be achieved simultaneously for several non-adaptive RF.Since adaptivity seems to be the cornerstone to bring together interpolation and consistency, we study interpolating Median RF which are proved to be consistent in the interpolating regime. This is the first result conciliating interpolation and consistency for RF, highlighting that the averaging effect introduced by feature randomization is a key mechanism, sufficient to ensure the consistency in the interpolation regime and beyond.Numerical experiments show that Breiman's RF are consistent while exactly interpolating, when no bootstrap step is involved.We theoretically control the size of the interpolation area, which converges fast enough to zero, giving a necessary condition for exact interpolation and consistency to occur in conjunction.
△ Less
Submitted 9 February, 2023; v1 submitted 8 February, 2022;
originally announced February 2022.
-
Minimax rate of consistency for linear models with missing values
Authors:
Alexis Ayme,
Claire Boyer,
Aymeric Dieuleveut,
Erwan Scornet
Abstract:
Missing values arise in most real-world data sets due to the aggregation of multiple sources and intrinsically missing information (sensor failure, unanswered questions in surveys...). In fact, the very nature of missing values usually prevents us from running standard learning algorithms. In this paper, we focus on the extensively-studied linear models, but in presence of missing values, which tu…
▽ More
Missing values arise in most real-world data sets due to the aggregation of multiple sources and intrinsically missing information (sensor failure, unanswered questions in surveys...). In fact, the very nature of missing values usually prevents us from running standard learning algorithms. In this paper, we focus on the extensively-studied linear models, but in presence of missing values, which turns out to be quite a challenging task. Indeed, the Bayes rule can be decomposed as a sum of predictors corresponding to each missing pattern. This eventually requires to solve a number of learning tasks, exponential in the number of input features, which makes predictions impossible for current real-world datasets. First, we propose a rigorous setting to analyze a least-square type estimator and establish a bound on the excess risk which increases exponentially in the dimension. Consequently, we leverage the missing data distribution to propose a new algorithm, andderive associated adaptive risk bounds that turn out to be minimax optimal. Numerical experiments highlight the benefits of our method compared to state-of-the-art algorithms used for predictions with missing values.
△ Less
Submitted 3 February, 2022;
originally announced February 2022.
-
What's a good imputation to predict with missing values?
Authors:
Marine Le Morvan,
Julie Josse,
Erwan Scornet,
Gaël Varoquaux
Abstract:
How to learn a good predictor on data with missing values? Most efforts focus on first imputing as well as possible and second learning on the completed data to predict the outcome. Yet, this widespread practice has no theoretical grounding. Here we show that for almost all imputation functions, an impute-then-regress procedure with a powerful learner is Bayes optimal. This result holds for all mi…
▽ More
How to learn a good predictor on data with missing values? Most efforts focus on first imputing as well as possible and second learning on the completed data to predict the outcome. Yet, this widespread practice has no theoretical grounding. Here we show that for almost all imputation functions, an impute-then-regress procedure with a powerful learner is Bayes optimal. This result holds for all missing-values mechanisms, in contrast with the classic statistical results that require missing-at-random settings to use imputation in probabilistic modeling. Moreover, it implies that perfect conditional imputation is not needed for good prediction asymptotically. In fact, we show that on perfectly imputed data the best regression function will generally be discontinuous, which makes it hard to learn. Crafting instead the imputation so as to leave the regression function unchanged simply shifts the problem to learning discontinuous imputations. Rather, we suggest that it is easier to learn imputation and regression jointly. We propose such a procedure, adapting NeuMiss, a neural network capturing the conditional links across observed and unobserved variables whatever the missing-value pattern. Experiments confirm that joint imputation and regression through NeuMiss is better than various two step procedures in our experiments with finite number of samples.
△ Less
Submitted 30 November, 2021; v1 submitted 1 June, 2021;
originally announced June 2021.
-
SHAFF: Fast and consistent SHApley eFfect estimates via random Forests
Authors:
Clément Bénard,
Gérard Biau,
Sébastien da Veiga,
Erwan Scornet
Abstract:
Interpretability of learning algorithms is crucial for applications involving critical decisions, and variable importance is one of the main interpretation tools. Shapley effects are now widely used to interpret both tree ensembles and neural networks, as they can efficiently handle dependence and interactions in the data, as opposed to most other variable importance measures. However, estimating…
▽ More
Interpretability of learning algorithms is crucial for applications involving critical decisions, and variable importance is one of the main interpretation tools. Shapley effects are now widely used to interpret both tree ensembles and neural networks, as they can efficiently handle dependence and interactions in the data, as opposed to most other variable importance measures. However, estimating Shapley effects is a challenging task, because of the computational complexity and the conditional expectation estimates. Accordingly, existing Shapley algorithms have flaws: a costly running time, or a bias when input variables are dependent. Therefore, we introduce SHAFF, SHApley eFfects via random Forests, a fast and accurate Shapley effect estimate, even when input variables are dependent. We show SHAFF efficiency through both a theoretical analysis of its consistency, and the practical performance improvements over competitors with extensive experiments. An implementation of SHAFF in C++ and R is available online.
△ Less
Submitted 2 February, 2022; v1 submitted 25 May, 2021;
originally announced May 2021.
-
Causal effect on a target population: a sensitivity analysis to handle missing covariates
Authors:
Bénédicte Colnet,
Julie Josse,
Erwan Scornet,
Gaël Varoquaux
Abstract:
Randomized Controlled Trials (RCTs) are often considered the gold standard for estimating causal effect, but they may lack external validity when the population eligible to the RCT is substantially different from the target population. Having at hand a sample of the target population of interest allows us to generalize the causal effect. Identifying the treatment effect in the target population re…
▽ More
Randomized Controlled Trials (RCTs) are often considered the gold standard for estimating causal effect, but they may lack external validity when the population eligible to the RCT is substantially different from the target population. Having at hand a sample of the target population of interest allows us to generalize the causal effect. Identifying the treatment effect in the target population requires covariates to capture all treatment effect modifiers that are shifted between the two sets. Standard estimators then use either weighting (IPSW), outcome modeling (G-formula), or combine the two in doubly robust approaches (AIPSW). However such covariates are often not available in both sets. In this paper, after proving L1-consistency of these three estimators, we compute the expected bias induced by a missing covariate, assuming a Gaussian distribution, a continuous outcome, and a semi-parametric model. Under this setting, we perform a sensitivity analysis for each missing covariate pattern and compute the sign of the expected bias. We also show that there is no gain in linearly imputing a partially-unobserved covariate. Finally we study the substitution of a missing covariate by a proxy. We illustrate all these results on simulations, as well as semi-synthetic benchmarks using data from the Tennessee Student/Teacher Achievement Ratio (STAR), and a real-world example from critical care medicine.
△ Less
Submitted 10 January, 2023; v1 submitted 13 May, 2021;
originally announced May 2021.
-
MDA for random forests: inconsistency, and a practical solution via the Sobol-MDA
Authors:
Clément Bénard,
Sébastien da Veiga,
Erwan Scornet
Abstract:
Variable importance measures are the main tools to analyze the black-box mechanisms of random forests. Although the mean decrease accuracy (MDA) is widely accepted as the most efficient variable importance measure for random forests, little is known about its statistical properties. In fact, the definition of MDA varies across the main random forest software. In this article, our objective is to r…
▽ More
Variable importance measures are the main tools to analyze the black-box mechanisms of random forests. Although the mean decrease accuracy (MDA) is widely accepted as the most efficient variable importance measure for random forests, little is known about its statistical properties. In fact, the definition of MDA varies across the main random forest software. In this article, our objective is to rigorously analyze the behavior of the main MDA implementations. Consequently, we mathematically formalize the various implemented MDA algorithms, and then establish their limits when the sample size increases. This asymptotic analysis reveals that these MDA versions differ as importance measures, since they converge towards different quantities. More importantly, we break down these limits into three components: the first two terms are related to Sobol indices, which are well-defined measures of a covariate contribution to the response variance, widely used in the sensitivity analysis field, as opposed to the third term, whose value increases with dependence within covariates. Thus, we theoretically demonstrate that the MDA does not target the right quantity to detect influential covariates in a dependent setting, a fact that has already been noticed experimentally. To address this issue, we define a new importance measure for random forests, the Sobol-MDA, which fixes the flaws of the original MDA, and consistently estimates the accuracy decrease of the forest retrained without a given covariate, but with an efficient computational cost. The Sobol-MDA empirically outperforms its competitors on both simulated and real data for variable selection. An open source implementation in R and C++ is available online.
△ Less
Submitted 1 March, 2022; v1 submitted 26 February, 2021;
originally announced February 2021.
-
Analyzing the tree-layer structure of Deep Forests
Authors:
Ludovic Arnould,
Claire Boyer,
Erwan Scornet,
Sorbonne Lpsm
Abstract:
Random forests on the one hand, and neural networks on the other hand, have met great success in the machine learning community for their predictive performance. Combinations of both have been proposed in the literature, notably leading to the so-called deep forests (DF) (Zhou \& Feng,2019). In this paper, our aim is not to benchmark DF performances but to investigate instead their underlying mech…
▽ More
Random forests on the one hand, and neural networks on the other hand, have met great success in the machine learning community for their predictive performance. Combinations of both have been proposed in the literature, notably leading to the so-called deep forests (DF) (Zhou \& Feng,2019). In this paper, our aim is not to benchmark DF performances but to investigate instead their underlying mechanisms. Additionally, DF architecture can be generally simplified into more simple and computationally efficient shallow forest networks. Despite some instability, the latter may outperform standard predictive tree-based methods. We exhibit a theoretical framework in which a shallow tree network is shown to enhance the performance of classical decision trees. In such a setting, we provide tight theoretical lower and upper bounds on its excess risk. These theoretical results show the interest of tree-network architectures for well-structured data provided that the first layer, acting as a data encoder, is rich enough.
△ Less
Submitted 14 October, 2021; v1 submitted 29 October, 2020;
originally announced October 2020.
-
NeuMiss networks: differentiable programming for supervised learning with missing values
Authors:
Marine Le Morvan,
Julie Josse,
Thomas Moreau,
Erwan Scornet,
Gaël Varoquaux
Abstract:
The presence of missing values makes supervised learning much more challenging. Indeed, previous work has shown that even when the response is a linear function of the complete data, the optimal predictor is a complex function of the observed entries and the missingness indicator. As a result, the computational or sample complexities of consistent approaches depend on the number of missing pattern…
▽ More
The presence of missing values makes supervised learning much more challenging. Indeed, previous work has shown that even when the response is a linear function of the complete data, the optimal predictor is a complex function of the observed entries and the missingness indicator. As a result, the computational or sample complexities of consistent approaches depend on the number of missing patterns, which can be exponential in the number of dimensions. In this work, we derive the analytical form of the optimal predictor under a linearity assumption and various missing data mechanisms including Missing at Random (MAR) and self-masking (Missing Not At Random). Based on a Neumann-series approximation of the optimal predictor, we propose a new principled architecture, named NeuMiss networks. Their originality and strength come from the use of a new type of non-linearity: the multiplication by the missingness indicator. We provide an upper bound on the Bayes risk of NeuMiss networks, and show that they have good predictive accuracy with both a number of parameters and a computational complexity independent of the number of missing data patterns. As a result they scale well to problems with many features, and remain statistically efficient for medium-sized samples. Moreover, we show that, contrary to procedures using EM or imputation, they are robust to the missing data mechanism, including difficult MNAR settings such as self-masking.
△ Less
Submitted 4 November, 2020; v1 submitted 3 July, 2020;
originally announced July 2020.
-
Interpretable Random Forests via Rule Extraction
Authors:
Clément Bénard,
Gérard Biau,
Sébastien da Veiga,
Erwan Scornet
Abstract:
We introduce SIRUS (Stable and Interpretable RUle Set) for regression, a stable rule learning algorithm which takes the form of a short and simple list of rules. State-of-the-art learning algorithms are often referred to as "black boxes" because of the high number of operations involved in their prediction process. Despite their powerful predictivity, this lack of interpretability may be highly re…
▽ More
We introduce SIRUS (Stable and Interpretable RUle Set) for regression, a stable rule learning algorithm which takes the form of a short and simple list of rules. State-of-the-art learning algorithms are often referred to as "black boxes" because of the high number of operations involved in their prediction process. Despite their powerful predictivity, this lack of interpretability may be highly restrictive for applications with critical decisions at stake. On the other hand, algorithms with a simple structure-typically decision trees, rule algorithms, or sparse linear models-are well known for their instability. This undesirable feature makes the conclusions of the data analysis unreliable and turns out to be a strong operational limitation. This motivates the design of SIRUS, which combines a simple structure with a remarkable stable behavior when data is perturbed. The algorithm is based on random forests, the predictive accuracy of which is preserved. We demonstrate the efficiency of the method both empirically (through experiments) and theoretically (with the proof of its asymptotic stability). Our R/C++ software implementation sirus is available from CRAN.
△ Less
Submitted 10 February, 2021; v1 submitted 29 April, 2020;
originally announced April 2020.
-
Linear predictor on linearly-generated data with missing values: non consistency and solutions
Authors:
Marine Le Morvan,
Nicolas Prost,
Julie Josse,
Erwan Scornet,
Gaël Varoquaux
Abstract:
We consider building predictors when the data have missing values. We study the seemingly-simple case where the target to predict is a linear function of the fully-observed data and we show that, in the presence of missing values, the optimal predictor may not be linear. In the particular Gaussian case, it can be written as a linear function of multiway interactions between the observed data and t…
▽ More
We consider building predictors when the data have missing values. We study the seemingly-simple case where the target to predict is a linear function of the fully-observed data and we show that, in the presence of missing values, the optimal predictor may not be linear. In the particular Gaussian case, it can be written as a linear function of multiway interactions between the observed data and the various missing-value indicators. Due to its intrinsic complexity, we study a simple approximation and prove generalization bounds with finite samples, highlighting regimes for which each method performs best. We then show that multilayer perceptrons with ReLU activation functions can be consistent, and can explore good trade-offs between the true model and approximations. Our study highlights the interesting family of models that are beneficial to fit with missing values depending on the amount of data available.
△ Less
Submitted 12 May, 2020; v1 submitted 3 February, 2020;
originally announced February 2020.
-
Trees, forests, and impurity-based variable importance
Authors:
Erwan Scornet
Abstract:
Tree ensemble methods such as random forests [Breiman, 2001] are very popular to handle high-dimensional tabular data sets, notably because of their good predictive accuracy. However, when machine learning is used for decision-making problems, settling for the best predictive procedures may not be reasonable since enlightened decisions require an in-depth comprehension of the algorithm prediction…
▽ More
Tree ensemble methods such as random forests [Breiman, 2001] are very popular to handle high-dimensional tabular data sets, notably because of their good predictive accuracy. However, when machine learning is used for decision-making problems, settling for the best predictive procedures may not be reasonable since enlightened decisions require an in-depth comprehension of the algorithm prediction process. Unfortunately, random forests are not intrinsically interpretable since their prediction results from averaging several hundreds of decision trees. A classic approach to gain knowledge on this so-called black-box algorithm is to compute variable importances, that are employed to assess the predictive impact of each input variable. Variable importances are then used to rank or select variables and thus play a great role in data analysis. Nevertheless, there is no justification to use random forest variable importances in such way: we do not even know what these quantities estimate. In this paper, we analyze one of the two well-known random forest variable importances, the Mean Decrease Impurity (MDI). We prove that if input variables are independent and in absence of interactions, MDI provides a variance decomposition of the output, where the contribution of each variable is clearly identified. We also study models exhibiting dependence between input variables or interaction, for which the variable importance is intrinsically ill-defined. Our analysis shows that there may exist some benefits to use a forest compared to a single tree.
△ Less
Submitted 24 December, 2021; v1 submitted 13 January, 2020;
originally announced January 2020.
-
SIRUS: Stable and Interpretable RUle Set for Classification
Authors:
Clément Bénard,
Gérard Biau,
Sébastien da Veiga,
Erwan Scornet
Abstract:
State-of-the-art learning algorithms, such as random forests or neural networks, are often qualified as "black-boxes" because of the high number and complexity of operations involved in their prediction mechanism. This lack of interpretability is a strong limitation for applications involving critical decisions, typically the analysis of production processes in the manufacturing industry. In such…
▽ More
State-of-the-art learning algorithms, such as random forests or neural networks, are often qualified as "black-boxes" because of the high number and complexity of operations involved in their prediction mechanism. This lack of interpretability is a strong limitation for applications involving critical decisions, typically the analysis of production processes in the manufacturing industry. In such critical contexts, models have to be interpretable, i.e., simple, stable, and predictive. To address this issue, we design SIRUS (Stable and Interpretable RUle Set), a new classification algorithm based on random forests, which takes the form of a short list of rules. While simple models are usually unstable with respect to data perturbation, SIRUS achieves a remarkable stability improvement over cutting-edge methods. Furthermore, SIRUS inherits a predictive accuracy close to random forests, combined with the simplicity of decision trees. These properties are assessed both from a theoretical and empirical point of view, through extensive numerical experiments based on our R/C++ software implementation sirus available from CRAN.
△ Less
Submitted 16 December, 2020; v1 submitted 19 August, 2019;
originally announced August 2019.
-
AMF: Aggregated Mondrian Forests for Online Learning
Authors:
Jaouad Mourtada,
Stéphane Gaïffas,
Erwan Scornet
Abstract:
Random Forests (RF) is one of the algorithms of choice in many supervised learning applications, be it classification or regression. The appeal of such tree-ensemble methods comes from a combination of several characteristics: a remarkable accuracy in a variety of tasks, a small number of parameters to tune, robustness with respect to features scaling, a reasonable computational cost for training…
▽ More
Random Forests (RF) is one of the algorithms of choice in many supervised learning applications, be it classification or regression. The appeal of such tree-ensemble methods comes from a combination of several characteristics: a remarkable accuracy in a variety of tasks, a small number of parameters to tune, robustness with respect to features scaling, a reasonable computational cost for training and prediction, and their suitability in high-dimensional settings. The most commonly used RF variants however are "offline" algorithms, which require the availability of the whole dataset at once. In this paper, we introduce AMF, an online random forest algorithm based on Mondrian Forests. Using a variant of the Context Tree Weighting algorithm, we show that it is possible to efficiently perform an exact aggregation over all prunings of the trees; in particular, this enables to obtain a truly online parameter-free algorithm which is competitive with the optimal pruning of the Mondrian tree, and thus adaptive to the unknown regularity of the regression function. Numerical experiments show that AMF is competitive with respect to several strong baselines on a large number of datasets for multi-class classification.
△ Less
Submitted 15 May, 2020; v1 submitted 25 June, 2019;
originally announced June 2019.
-
On the consistency of supervised learning with missing values
Authors:
Julie Josse,
Jacob M. Chen,
Nicolas Prost,
Erwan Scornet,
Gaël Varoquaux
Abstract:
In many application settings, the data have missing entries which make analysis challenging. An abundant literature addresses missing values in an inferential framework: estimating parameters and their variance from incomplete tables. Here, we consider supervised-learning settings: predicting a target when missing values appear in both training and testing data. We show the consistency of two appr…
▽ More
In many application settings, the data have missing entries which make analysis challenging. An abundant literature addresses missing values in an inferential framework: estimating parameters and their variance from incomplete tables. Here, we consider supervised-learning settings: predicting a target when missing values appear in both training and testing data. We show the consistency of two approaches in prediction. A striking result is that the widely-used method of imputing with a constant, such as the mean prior to learning is consistent when missing values are not informative. This contrasts with inferential settings where mean imputation is pointed at for distorting the distribution of the data. That such a simple approach can be consistent is important in practice. We also show that a predictor suited for complete observations can predict optimally on incomplete data, through multiple imputation. Finally, to compare imputation with learning directly with a model that accounts for missing values, we analyze further decision trees. These can naturally tackle empirical risk minimization with missing values, due to their ability to handle the half-discrete nature of incomplete variables. After comparing theoretically and empirically different missing values strategies in trees, we recommend using the "missing incorporated in attribute" method as it can handle both non-informative and informative missing values.
△ Less
Submitted 21 March, 2024; v1 submitted 19 February, 2019;
originally announced February 2019.
-
Minimax optimal rates for Mondrian trees and forests
Authors:
Jaouad Mourtada,
Stéphane Gaïffas,
Erwan Scornet
Abstract:
Introduced by Breiman, Random Forests are widely used classification and regression algorithms. While being initially designed as batch algorithms, several variants have been proposed to handle online learning. One particular instance of such forests is the \emph{Mondrian Forest}, whose trees are built using the so-called Mondrian process, therefore allowing to easily update their construction in…
▽ More
Introduced by Breiman, Random Forests are widely used classification and regression algorithms. While being initially designed as batch algorithms, several variants have been proposed to handle online learning. One particular instance of such forests is the \emph{Mondrian Forest}, whose trees are built using the so-called Mondrian process, therefore allowing to easily update their construction in a streaming fashion. In this paper, we provide a thorough theoretical study of Mondrian Forests in a batch learning setting, based on new results about Mondrian partitions. Our results include consistency and convergence rates for Mondrian Trees and Forests, that turn out to be minimax optimal on the set of $s$-Hölder function with $s \in (0,1]$ (for trees and forests) and $s \in (1,2]$ (for forests only), assuming a proper tuning of their complexity parameter in both cases. Furthermore, we prove that an adaptive procedure (to the unknown $s \in (0, 2]$) can be constructed by combining Mondrian Forests with a standard model aggregation algorithm. These results are the first demonstrating that some particular random forests achieve minimax rates \textit{in arbitrary dimension}. Owing to their remarkably simple distributional properties, which lead to minimax rates, Mondrian trees are a promising basis for more sophisticated yet theoretically sound random forests variants.
△ Less
Submitted 9 April, 2019; v1 submitted 15 March, 2018;
originally announced March 2018.
-
Universal consistency and minimax rates for online Mondrian Forests
Authors:
Jaouad Mourtada,
Stéphane Gaïffas,
Erwan Scornet
Abstract:
We establish the consistency of an algorithm of Mondrian Forests, a randomized classification algorithm that can be implemented online. First, we amend the original Mondrian Forest algorithm, that considers a fixed lifetime parameter. Indeed, the fact that this parameter is fixed hinders the statistical consistency of the original procedure. Our modified Mondrian Forest algorithm grows trees with…
▽ More
We establish the consistency of an algorithm of Mondrian Forests, a randomized classification algorithm that can be implemented online. First, we amend the original Mondrian Forest algorithm, that considers a fixed lifetime parameter. Indeed, the fact that this parameter is fixed hinders the statistical consistency of the original procedure. Our modified Mondrian Forest algorithm grows trees with increasing lifetime parameters $λ_n$, and uses an alternative updating rule, allowing to work also in an online fashion. Second, we provide a theoretical analysis establishing simple conditions for consistency. Our theoretical analysis also exhibits a surprising fact: our algorithm achieves the minimax rate (optimal rate) for the estimation of a Lipschitz regression function, which is a strong extension of previous results to an arbitrary dimension.
△ Less
Submitted 8 November, 2017;
originally announced November 2017.
-
Neural Random Forests
Authors:
Gérard Biau,
Erwan Scornet,
Johannes Welbl
Abstract:
Given an ensemble of randomized regression trees, it is possible to restructure them as a collection of multilayered neural networks with particular connection weights. Following this principle, we reformulate the random forest method of Breiman (2001) into a neural network setting, and in turn propose two new hybrid procedures that we call neural random forests. Both predictors exploit prior know…
▽ More
Given an ensemble of randomized regression trees, it is possible to restructure them as a collection of multilayered neural networks with particular connection weights. Following this principle, we reformulate the random forest method of Breiman (2001) into a neural network setting, and in turn propose two new hybrid procedures that we call neural random forests. Both predictors exploit prior knowledge of regression trees for their architecture, have less parameters to tune than standard networks, and less restrictions on the geometry of the decision boundaries than trees. Consistency results are proved, and substantial numerical evidence is provided on both synthetic and real data sets to assess the excellent performance of our methods in a large variety of prediction problems.
△ Less
Submitted 3 April, 2018; v1 submitted 25 April, 2016;
originally announced April 2016.
-
Impact of subsampling and pruning on random forests
Authors:
Roxane Duroux,
Erwan Scornet
Abstract:
Random forests are ensemble learning methods introduced by Breiman (2001) that operate by averaging several decision trees built on a randomly selected subspace of the data set. Despite their widespread use in practice, the respective roles of the different mechanisms at work in Breiman's forests are not yet fully understood, neither is the tuning of the corresponding parameters. In this paper, we…
▽ More
Random forests are ensemble learning methods introduced by Breiman (2001) that operate by averaging several decision trees built on a randomly selected subspace of the data set. Despite their widespread use in practice, the respective roles of the different mechanisms at work in Breiman's forests are not yet fully understood, neither is the tuning of the corresponding parameters. In this paper, we study the influence of two parameters , namely the subsampling rate and the tree depth, on Breiman's forests performance. More precisely, we show that fully developed sub-sampled forests and pruned (without subsampling) forests have similar performances, as long as respective parameters are well chosen. Moreover , experiments show that a proper tuning of subsampling or pruning lead in most cases to an improvement of Breiman's original forests errors.
△ Less
Submitted 14 March, 2016;
originally announced March 2016.
-
A Random Forest Guided Tour
Authors:
Gérard Biau,
Erwan Scornet
Abstract:
The random forest algorithm, proposed by L. Breiman in 2001, has been extremely successful as a general-purpose classification and regression method. The approach, which combines several randomized decision trees and aggregates their predictions by averaging, has shown excellent performance in settings where the number of variables is much larger than the number of observations. Moreover, it is ve…
▽ More
The random forest algorithm, proposed by L. Breiman in 2001, has been extremely successful as a general-purpose classification and regression method. The approach, which combines several randomized decision trees and aggregates their predictions by averaging, has shown excellent performance in settings where the number of variables is much larger than the number of observations. Moreover, it is versatile enough to be applied to large-scale problems, is easily adapted to various ad-hoc learning tasks, and returns measures of variable importance. The present article reviews the most recent theoretical and methodological developments for random forests. Emphasis is placed on the mathematical forces driving the algorithm, with special attention given to the selection of parameters, the resampling mechanism, and variable importance measures. This review is intended to provide non-experts easy access to the main ideas.
△ Less
Submitted 18 November, 2015;
originally announced November 2015.
-
Random forests and kernel methods
Authors:
Erwan Scornet
Abstract:
Random forests are ensemble methods which grow trees as base learners and combine their predictions by averaging. Random forests are known for their good practical performance, particularly in high dimensional set-tings. On the theoretical side, several studies highlight the potentially fruitful connection between random forests and kernel methods. In this paper, we work out in full details this c…
▽ More
Random forests are ensemble methods which grow trees as base learners and combine their predictions by averaging. Random forests are known for their good practical performance, particularly in high dimensional set-tings. On the theoretical side, several studies highlight the potentially fruitful connection between random forests and kernel methods. In this paper, we work out in full details this connection. In particular, we show that by slightly modifying their definition, random forests can be rewrit-ten as kernel methods (called KeRF for Kernel based on Random Forests) which are more interpretable and easier to analyze. Explicit expressions of KeRF estimates for some specific random forest models are given, together with upper bounds on their rate of consistency. We also show empirically that KeRF estimates compare favourably to random forest estimates.
△ Less
Submitted 17 September, 2015; v1 submitted 12 February, 2015;
originally announced February 2015.
-
On the asymptotics of random forests
Authors:
Erwan Scornet
Abstract:
The last decade has witnessed a growing interest in random forest models which are recognized to exhibit good practical performance, especially in high-dimensional settings. On the theoretical side, however, their predictive power remains largely unexplained, thereby creating a gap between theory and practice. The aim of this paper is twofold. Firstly, we provide theoretical guarantees to link fin…
▽ More
The last decade has witnessed a growing interest in random forest models which are recognized to exhibit good practical performance, especially in high-dimensional settings. On the theoretical side, however, their predictive power remains largely unexplained, thereby creating a gap between theory and practice. The aim of this paper is twofold. Firstly, we provide theoretical guarantees to link finite forests used in practice (with a finite number M of trees) to their asymptotic counterparts. Using empirical process theory, we prove a uniform central limit theorem for a large class of random forest estimates, which holds in particular for Breiman's original forests. Secondly, we show that infinite forest consistency implies finite forest consistency and thus, we state the consistency of several infinite forests. In particular, we prove that q quantile forests---close in spirit to Breiman's forests but easier to study---are able to combine inconsistent trees to obtain a final consistent prediction, thus highlighting the benefits of random forests compared to single trees.
△ Less
Submitted 7 September, 2014;
originally announced September 2014.
-
Consistency of random forests
Authors:
Erwan Scornet,
Gérard Biau,
Jean-Philippe Vert
Abstract:
Random forests are a learning algorithm proposed by Breiman [Mach. Learn. 45 (2001) 5--32] that combines several randomized decision trees and aggregates their predictions by averaging. Despite its wide usage and outstanding practical performance, little is known about the mathematical properties of the procedure. This disparity between theory and practice originates in the difficulty to simultane…
▽ More
Random forests are a learning algorithm proposed by Breiman [Mach. Learn. 45 (2001) 5--32] that combines several randomized decision trees and aggregates their predictions by averaging. Despite its wide usage and outstanding practical performance, little is known about the mathematical properties of the procedure. This disparity between theory and practice originates in the difficulty to simultaneously analyze both the randomization process and the highly data-dependent tree structure. In the present paper, we take a step forward in forest exploration by proving a consistency result for Breiman's [Mach. Learn. 45 (2001) 5--32] original algorithm in the context of additive regression models. Our analysis also sheds an interesting light on how random forests can nicely adapt to sparsity. 1. Introduction. Random forests are an ensemble learning method for classification and regression that constructs a number of randomized decision trees during the training phase and predicts by averaging the results. Since its publication in the seminal paper of Breiman (2001), the procedure has become a major data analysis tool, that performs well in practice in comparison with many standard methods. What has greatly contributed to the popularity of forests is the fact that they can be applied to a wide range of prediction problems and have few parameters to tune. Aside from being simple to use, the method is generally recognized for its accuracy and its ability to deal with small sample sizes, high-dimensional feature spaces and complex data structures. The random forest methodology has been successfully involved in many practical problems, including air quality prediction (winning code of the EMC data science global hackathon in 2012, see http://www.kaggle.com/c/dsg-hackathon), chemoinformatics [Svetnik et al. (2003)], ecology [Prasad, Iverson and Liaw (2006), Cutler et al. (2007)], 3D
△ Less
Submitted 8 August, 2015; v1 submitted 12 May, 2014;
originally announced May 2014.