Skip to main content

Showing 1–30 of 30 results for author: Scornet, E

.
  1. arXiv:2405.09196  [pdf, ps, other

    math.ST

    Harnessing pattern-by-pattern linear classifiers for prediction with missing data

    Authors: Angel D Reyero Lobo, Alexis Ayme, Claire Boyer, Erwan Scornet

    Abstract: Missing values have been thoroughly analyzed in the context of linear models, where the final aim is to build coefficient estimates. However, estimating coefficients does not directly solve the problem of prediction with missing entries: a manner to address empty components must be designed. Major approaches to deal with prediction with missing values are empirically driven and can be decomposed i… ▽ More

    Submitted 15 May, 2024; originally announced May 2024.

  2. arXiv:2403.19196  [pdf, other

    math.ST

    What Is a Good Imputation Under MAR Missingness?

    Authors: Jeffrey Näf, Erwan Scornet, Julie Josse

    Abstract: Missing values pose a persistent challenge in modern data science. Consequently, there is an ever-growing number of publications introducing new imputation methods in various fields. The present paper attempts to take a step back and provide a more systematic analysis. Starting from an in-depth discussion of the Missing at Random (MAR) condition for nonparametric imputation, we first develop an id… ▽ More

    Submitted 7 June, 2024; v1 submitted 28 March, 2024; originally announced March 2024.

  3. arXiv:2402.03839  [pdf, other

    math.ST stat.ML

    Random features models: a way to study the success of naive imputation

    Authors: Alexis Ayme, Claire Boyer, Aymeric Dieuleveut, Erwan Scornet

    Abstract: Constant (naive) imputation is still widely used in practice as this is a first easy-to-use technique to deal with missing data. Yet, this simple method could be expected to induce a large bias for prediction purposes, as the imputed input may strongly differ from the true underlying data. However, recent works suggest that this bias is low in the context of high-dimensional linear predictors when… ▽ More

    Submitted 6 February, 2024; originally announced February 2024.

  4. arXiv:2402.03819  [pdf, other

    stat.ML cs.LG

    Do we need rebalancing strategies? A theoretical and empirical study around SMOTE and its variants

    Authors: Abdoulaye Sakho, Emmanuel Malherbe, Erwan Scornet

    Abstract: Synthetic Minority Oversampling Technique (SMOTE) is a common rebalancing strategy for handling imbalanced tabular data sets. However, few works analyze SMOTE theoretically. In this paper, we prove that SMOTE (with default parameter) simply copies the original minority samples asymptotically. We also prove that SMOTE exhibits boundary artifacts, thus justifying existing SMOTE variants. Then we int… ▽ More

    Submitted 3 June, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

  5. arXiv:2303.16008  [pdf, other

    stat.ME

    Risk ratio, odds ratio, risk difference... Which causal measure is easier to generalize?

    Authors: Bénédicte Colnet, Julie Josse, Gaël Varoquaux, Erwan Scornet

    Abstract: There are many measures to report so-called treatment or causal effects: absolute difference, ratio, odds ratio, number needed to treat, and so on. The choice of a measure, eg absolute versus relative, is often debated because it leads to different impressions of the benefit or risk of a treatment. Besides, different causal measures may lead to various treatment effect heterogeneity: some input va… ▽ More

    Submitted 30 March, 2024; v1 submitted 28 March, 2023; originally announced March 2023.

  6. arXiv:2301.13585  [pdf, other

    math.ST

    Naive imputation implicitly regularizes high-dimensional linear models

    Authors: Alexis Ayme, Claire Boyer, Aymeric Dieuleveut, Erwan Scornet

    Abstract: Two different approaches exist to handle missing values for prediction: either imputation, prior to fitting any predictive algorithms, or dedicated methods able to natively incorporate missing values. While imputation is widely (and easily) use, it is unfortunately biased when low-capacity predictors (such as linear models) are applied afterward. However, in practice, naive imputation exhibits goo… ▽ More

    Submitted 31 January, 2023; originally announced January 2023.

  7. arXiv:2209.15283  [pdf, other

    stat.ML cs.LG

    Sparse tree-based initialization for neural networks

    Authors: Patrick Lutz, Ludovic Arnould, Claire Boyer, Erwan Scornet

    Abstract: Dedicated neural network (NN) architectures have been designed to handle specific data types (such as CNN for images or RNN for text), which ranks them among state-of-the-art methods for dealing with these data. Unfortunately, no architecture has been found for dealing with tabular data yet, for which tree ensemble methods (tree boosting, random forests) usually show the best predictive performanc… ▽ More

    Submitted 30 September, 2022; originally announced September 2022.

  8. arXiv:2208.07614  [pdf, other

    stat.ME

    Reweighting the RCT for generalization: finite sample error and variable selection

    Authors: Bénédicte Colnet, Julie Josse, Gaël Varoquaux, Erwan Scornet

    Abstract: Randomized Controlled Trials (RCTs) may suffer from limited scope. In particular, samples may be unrepresentative: some RCTs over- or under- sample individuals with certain characteristics compared to the target population, for which one wants conclusions on treatment effectiveness. Re-weighting trial individuals to match the target population can improve the treatment effect estimation. In this w… ▽ More

    Submitted 13 March, 2024; v1 submitted 16 August, 2022; originally announced August 2022.

  9. arXiv:2202.03688  [pdf, other

    math.ST

    Is interpolation benign for random forest regression?

    Authors: Ludovic Arnould, Claire Boyer, Erwan Scornet

    Abstract: Statistical wisdom suggests that very complex models, interpolating training data, will be poor at predicting unseen examples.Yet, this aphorism has been recently challenged by the identification of benign overfitting regimes, specially studied in the case of parametric models: generalization capabilities may be preserved despite model high complexity.While it is widely known that fully-grown deci… ▽ More

    Submitted 9 February, 2023; v1 submitted 8 February, 2022; originally announced February 2022.

  10. arXiv:2202.01463  [pdf, other

    stat.ML cs.LG

    Minimax rate of consistency for linear models with missing values

    Authors: Alexis Ayme, Claire Boyer, Aymeric Dieuleveut, Erwan Scornet

    Abstract: Missing values arise in most real-world data sets due to the aggregation of multiple sources and intrinsically missing information (sensor failure, unanswered questions in surveys...). In fact, the very nature of missing values usually prevents us from running standard learning algorithms. In this paper, we focus on the extensively-studied linear models, but in presence of missing values, which tu… ▽ More

    Submitted 3 February, 2022; originally announced February 2022.

  11. arXiv:2106.00311  [pdf, other

    stat.ML cs.AI cs.LG

    What's a good imputation to predict with missing values?

    Authors: Marine Le Morvan, Julie Josse, Erwan Scornet, Gaël Varoquaux

    Abstract: How to learn a good predictor on data with missing values? Most efforts focus on first imputing as well as possible and second learning on the completed data to predict the outcome. Yet, this widespread practice has no theoretical grounding. Here we show that for almost all imputation functions, an impute-then-regress procedure with a powerful learner is Bayes optimal. This result holds for all mi… ▽ More

    Submitted 30 November, 2021; v1 submitted 1 June, 2021; originally announced June 2021.

  12. arXiv:2105.11724  [pdf, other

    stat.ML cs.LG

    SHAFF: Fast and consistent SHApley eFfect estimates via random Forests

    Authors: Clément Bénard, Gérard Biau, Sébastien da Veiga, Erwan Scornet

    Abstract: Interpretability of learning algorithms is crucial for applications involving critical decisions, and variable importance is one of the main interpretation tools. Shapley effects are now widely used to interpret both tree ensembles and neural networks, as they can efficiently handle dependence and interactions in the data, as opposed to most other variable importance measures. However, estimating… ▽ More

    Submitted 2 February, 2022; v1 submitted 25 May, 2021; originally announced May 2021.

  13. Causal effect on a target population: a sensitivity analysis to handle missing covariates

    Authors: Bénédicte Colnet, Julie Josse, Erwan Scornet, Gaël Varoquaux

    Abstract: Randomized Controlled Trials (RCTs) are often considered the gold standard for estimating causal effect, but they may lack external validity when the population eligible to the RCT is substantially different from the target population. Having at hand a sample of the target population of interest allows us to generalize the causal effect. Identifying the treatment effect in the target population re… ▽ More

    Submitted 10 January, 2023; v1 submitted 13 May, 2021; originally announced May 2021.

  14. arXiv:2102.13347  [pdf, other

    stat.ML cs.LG

    MDA for random forests: inconsistency, and a practical solution via the Sobol-MDA

    Authors: Clément Bénard, Sébastien da Veiga, Erwan Scornet

    Abstract: Variable importance measures are the main tools to analyze the black-box mechanisms of random forests. Although the mean decrease accuracy (MDA) is widely accepted as the most efficient variable importance measure for random forests, little is known about its statistical properties. In fact, the definition of MDA varies across the main random forest software. In this article, our objective is to r… ▽ More

    Submitted 1 March, 2022; v1 submitted 26 February, 2021; originally announced February 2021.

  15. arXiv:2010.15690  [pdf, other

    cs.LG math.ST

    Analyzing the tree-layer structure of Deep Forests

    Authors: Ludovic Arnould, Claire Boyer, Erwan Scornet, Sorbonne Lpsm

    Abstract: Random forests on the one hand, and neural networks on the other hand, have met great success in the machine learning community for their predictive performance. Combinations of both have been proposed in the literature, notably leading to the so-called deep forests (DF) (Zhou \& Feng,2019). In this paper, our aim is not to benchmark DF performances but to investigate instead their underlying mech… ▽ More

    Submitted 14 October, 2021; v1 submitted 29 October, 2020; originally announced October 2020.

  16. arXiv:2007.01627  [pdf, other

    cs.LG cs.AI stat.ML

    NeuMiss networks: differentiable programming for supervised learning with missing values

    Authors: Marine Le Morvan, Julie Josse, Thomas Moreau, Erwan Scornet, Gaël Varoquaux

    Abstract: The presence of missing values makes supervised learning much more challenging. Indeed, previous work has shown that even when the response is a linear function of the complete data, the optimal predictor is a complex function of the observed entries and the missingness indicator. As a result, the computational or sample complexities of consistent approaches depend on the number of missing pattern… ▽ More

    Submitted 4 November, 2020; v1 submitted 3 July, 2020; originally announced July 2020.

    Journal ref: Advances in Neural Information Processing Systems 33, Dec 2020, Vancouver, Canada

  17. arXiv:2004.14841  [pdf, other

    stat.ML cs.AI cs.LG

    Interpretable Random Forests via Rule Extraction

    Authors: Clément Bénard, Gérard Biau, Sébastien da Veiga, Erwan Scornet

    Abstract: We introduce SIRUS (Stable and Interpretable RUle Set) for regression, a stable rule learning algorithm which takes the form of a short and simple list of rules. State-of-the-art learning algorithms are often referred to as "black boxes" because of the high number of operations involved in their prediction process. Despite their powerful predictivity, this lack of interpretability may be highly re… ▽ More

    Submitted 10 February, 2021; v1 submitted 29 April, 2020; originally announced April 2020.

  18. arXiv:2002.00658  [pdf, other

    cs.LG cs.AI stat.ML

    Linear predictor on linearly-generated data with missing values: non consistency and solutions

    Authors: Marine Le Morvan, Nicolas Prost, Julie Josse, Erwan Scornet, Gaël Varoquaux

    Abstract: We consider building predictors when the data have missing values. We study the seemingly-simple case where the target to predict is a linear function of the fully-observed data and we show that, in the presence of missing values, the optimal predictor may not be linear. In the particular Gaussian case, it can be written as a linear function of multiway interactions between the observed data and t… ▽ More

    Submitted 12 May, 2020; v1 submitted 3 February, 2020; originally announced February 2020.

    Journal ref: Proceedings of Machine Learning Research, PMLR, In press

  19. arXiv:2001.04295  [pdf, other

    math.ST stat.ML

    Trees, forests, and impurity-based variable importance

    Authors: Erwan Scornet

    Abstract: Tree ensemble methods such as random forests [Breiman, 2001] are very popular to handle high-dimensional tabular data sets, notably because of their good predictive accuracy. However, when machine learning is used for decision-making problems, settling for the best predictive procedures may not be reasonable since enlightened decisions require an in-depth comprehension of the algorithm prediction… ▽ More

    Submitted 24 December, 2021; v1 submitted 13 January, 2020; originally announced January 2020.

  20. arXiv:1908.06852  [pdf, other

    stat.ML cs.LG math.ST

    SIRUS: Stable and Interpretable RUle Set for Classification

    Authors: Clément Bénard, Gérard Biau, Sébastien da Veiga, Erwan Scornet

    Abstract: State-of-the-art learning algorithms, such as random forests or neural networks, are often qualified as "black-boxes" because of the high number and complexity of operations involved in their prediction mechanism. This lack of interpretability is a strong limitation for applications involving critical decisions, typically the analysis of production processes in the manufacturing industry. In such… ▽ More

    Submitted 16 December, 2020; v1 submitted 19 August, 2019; originally announced August 2019.

  21. arXiv:1906.10529  [pdf, other

    stat.ML cs.LG math.ST

    AMF: Aggregated Mondrian Forests for Online Learning

    Authors: Jaouad Mourtada, Stéphane Gaïffas, Erwan Scornet

    Abstract: Random Forests (RF) is one of the algorithms of choice in many supervised learning applications, be it classification or regression. The appeal of such tree-ensemble methods comes from a combination of several characteristics: a remarkable accuracy in a variety of tasks, a small number of parameters to tune, robustness with respect to features scaling, a reasonable computational cost for training… ▽ More

    Submitted 15 May, 2020; v1 submitted 25 June, 2019; originally announced June 2019.

  22. arXiv:1902.06931  [pdf, other

    stat.ML cs.LG math.ST

    On the consistency of supervised learning with missing values

    Authors: Julie Josse, Jacob M. Chen, Nicolas Prost, Erwan Scornet, Gaël Varoquaux

    Abstract: In many application settings, the data have missing entries which make analysis challenging. An abundant literature addresses missing values in an inferential framework: estimating parameters and their variance from incomplete tables. Here, we consider supervised-learning settings: predicting a target when missing values appear in both training and testing data. We show the consistency of two appr… ▽ More

    Submitted 21 March, 2024; v1 submitted 19 February, 2019; originally announced February 2019.

  23. arXiv:1803.05784  [pdf, ps, other

    stat.ML math.ST

    Minimax optimal rates for Mondrian trees and forests

    Authors: Jaouad Mourtada, Stéphane Gaïffas, Erwan Scornet

    Abstract: Introduced by Breiman, Random Forests are widely used classification and regression algorithms. While being initially designed as batch algorithms, several variants have been proposed to handle online learning. One particular instance of such forests is the \emph{Mondrian Forest}, whose trees are built using the so-called Mondrian process, therefore allowing to easily update their construction in… ▽ More

    Submitted 9 April, 2019; v1 submitted 15 March, 2018; originally announced March 2018.

  24. arXiv:1711.02887  [pdf, other

    stat.ML

    Universal consistency and minimax rates for online Mondrian Forests

    Authors: Jaouad Mourtada, Stéphane Gaïffas, Erwan Scornet

    Abstract: We establish the consistency of an algorithm of Mondrian Forests, a randomized classification algorithm that can be implemented online. First, we amend the original Mondrian Forest algorithm, that considers a fixed lifetime parameter. Indeed, the fact that this parameter is fixed hinders the statistical consistency of the original procedure. Our modified Mondrian Forest algorithm grows trees with… ▽ More

    Submitted 8 November, 2017; originally announced November 2017.

    Comments: NIPS 2017

  25. arXiv:1604.07143  [pdf, other

    stat.ML cs.LG math.ST

    Neural Random Forests

    Authors: Gérard Biau, Erwan Scornet, Johannes Welbl

    Abstract: Given an ensemble of randomized regression trees, it is possible to restructure them as a collection of multilayered neural networks with particular connection weights. Following this principle, we reformulate the random forest method of Breiman (2001) into a neural network setting, and in turn propose two new hybrid procedures that we call neural random forests. Both predictors exploit prior know… ▽ More

    Submitted 3 April, 2018; v1 submitted 25 April, 2016; originally announced April 2016.

  26. arXiv:1603.04261  [pdf, other

    math.ST

    Impact of subsampling and pruning on random forests

    Authors: Roxane Duroux, Erwan Scornet

    Abstract: Random forests are ensemble learning methods introduced by Breiman (2001) that operate by averaging several decision trees built on a randomly selected subspace of the data set. Despite their widespread use in practice, the respective roles of the different mechanisms at work in Breiman's forests are not yet fully understood, neither is the tuning of the corresponding parameters. In this paper, we… ▽ More

    Submitted 14 March, 2016; originally announced March 2016.

  27. arXiv:1511.05741  [pdf, other

    math.ST stat.ML

    A Random Forest Guided Tour

    Authors: Gérard Biau, Erwan Scornet

    Abstract: The random forest algorithm, proposed by L. Breiman in 2001, has been extremely successful as a general-purpose classification and regression method. The approach, which combines several randomized decision trees and aggregates their predictions by averaging, has shown excellent performance in settings where the number of variables is much larger than the number of observations. Moreover, it is ve… ▽ More

    Submitted 18 November, 2015; originally announced November 2015.

  28. arXiv:1502.03836  [pdf, other

    math.ST

    Random forests and kernel methods

    Authors: Erwan Scornet

    Abstract: Random forests are ensemble methods which grow trees as base learners and combine their predictions by averaging. Random forests are known for their good practical performance, particularly in high dimensional set-tings. On the theoretical side, several studies highlight the potentially fruitful connection between random forests and kernel methods. In this paper, we work out in full details this c… ▽ More

    Submitted 17 September, 2015; v1 submitted 12 February, 2015; originally announced February 2015.

  29. arXiv:1409.2090  [pdf, other

    math.ST

    On the asymptotics of random forests

    Authors: Erwan Scornet

    Abstract: The last decade has witnessed a growing interest in random forest models which are recognized to exhibit good practical performance, especially in high-dimensional settings. On the theoretical side, however, their predictive power remains largely unexplained, thereby creating a gap between theory and practice. The aim of this paper is twofold. Firstly, we provide theoretical guarantees to link fin… ▽ More

    Submitted 7 September, 2014; originally announced September 2014.

  30. arXiv:1405.2881  [pdf, ps, other

    math.ST stat.ML

    Consistency of random forests

    Authors: Erwan Scornet, Gérard Biau, Jean-Philippe Vert

    Abstract: Random forests are a learning algorithm proposed by Breiman [Mach. Learn. 45 (2001) 5--32] that combines several randomized decision trees and aggregates their predictions by averaging. Despite its wide usage and outstanding practical performance, little is known about the mathematical properties of the procedure. This disparity between theory and practice originates in the difficulty to simultane… ▽ More

    Submitted 8 August, 2015; v1 submitted 12 May, 2014; originally announced May 2014.

    Journal ref: Annals of Statistics, Institute of Mathematical Statistics (IMS), 2015, 43 (4), pp.1716-1741