-
What Is a Good Imputation Under MAR Missingness?
Authors:
Jeffrey Näf,
Erwan Scornet,
Julie Josse
Abstract:
Missing values pose a persistent challenge in modern data science. Consequently, there is an ever-growing number of publications introducing new imputation methods in various fields. The present paper attempts to take a step back and provide a more systematic analysis. Starting from an in-depth discussion of the Missing at Random (MAR) condition for nonparametric imputation, we first develop an id…
▽ More
Missing values pose a persistent challenge in modern data science. Consequently, there is an ever-growing number of publications introducing new imputation methods in various fields. The present paper attempts to take a step back and provide a more systematic analysis. Starting from an in-depth discussion of the Missing at Random (MAR) condition for nonparametric imputation, we first develop an identification result, showing that the widely used Multiple Imputation by Chained Equations (MICE) approach indeed identifies the right conditional distributions. Building on this analysis, we propose three essential properties a successful imputation method should meet, thus enabling a more principled evaluation of existing methods and more targeted development of new methods. In particular, we introduce a new imputation method, denoted mice-DRF, that meets two out of the three criteria. We then discuss and refine ways to rank imputation methods, develo** a powerful, easy-to-use scoring algorithm to rank missing value imputations.
△ Less
Submitted 7 June, 2024; v1 submitted 28 March, 2024;
originally announced March 2024.
-
MMD-based Variable Importance for Distributional Random Forest
Authors:
Clément Bénard,
Jeffrey Näf,
Julie Josse
Abstract:
Distributional Random Forest (DRF) is a flexible forest-based method to estimate the full conditional distribution of a multivariate output of interest given input variables. In this article, we introduce a variable importance algorithm for DRFs, based on the well-established drop and relearn principle and MMD distance. While traditional importance measures only detect variables with an influence…
▽ More
Distributional Random Forest (DRF) is a flexible forest-based method to estimate the full conditional distribution of a multivariate output of interest given input variables. In this article, we introduce a variable importance algorithm for DRFs, based on the well-established drop and relearn principle and MMD distance. While traditional importance measures only detect variables with an influence on the output mean, our algorithm detects variables impacting the output distribution more generally. We show that the introduced importance measure is consistent, exhibits high empirical performance on both real and simulated data, and outperforms competitors. In particular, our algorithm is highly efficient to select variables through recursive feature elimination, and can therefore provide small sets of variables to build accurate estimates of conditional output distributions.
△ Less
Submitted 14 February, 2024; v1 submitted 18 October, 2023;
originally announced October 2023.
-
Confidence and Uncertainty Assessment for Distributional Random Forests
Authors:
Jeffrey Näf,
Corinne Emmenegger,
Peter Bühlmann,
Nicolai Meinshausen
Abstract:
The Distributional Random Forest (DRF) is a recently introduced Random Forest algorithm to estimate multivariate conditional distributions. Due to its general estimation procedure, it can be employed to estimate a wide range of targets such as conditional average treatment effects, conditional quantiles, and conditional correlations. However, only results about the consistency and convergence rate…
▽ More
The Distributional Random Forest (DRF) is a recently introduced Random Forest algorithm to estimate multivariate conditional distributions. Due to its general estimation procedure, it can be employed to estimate a wide range of targets such as conditional average treatment effects, conditional quantiles, and conditional correlations. However, only results about the consistency and convergence rate of the DRF prediction are available so far. We characterize the asymptotic distribution of DRF and develop a bootstrap approximation of it. This allows us to derive inferential tools for quantifying standard errors and the construction of confidence regions that have asymptotic coverage guarantees. In simulation studies, we empirically validate the developed theory for inference of low-dimensional targets and for testing distributional differences between two populations.
△ Less
Submitted 19 December, 2023; v1 submitted 11 February, 2023;
originally announced February 2023.
-
R-NL: Covariance Matrix Estimation for Elliptical Distributions based on Nonlinear Shrinkage
Authors:
Simon Hediger,
Jeffrey Näf,
Michael Wolf
Abstract:
We combine Tyler's robust estimator of the dispersion matrix with nonlinear shrinkage. This approach delivers a simple and fast estimator of the dispersion matrix in elliptical models that is robust against both heavy tails and high dimensions. We prove convergence of the iterative part of our algorithm and demonstrate the favorable performance of the estimator in a wide range of simulation scenar…
▽ More
We combine Tyler's robust estimator of the dispersion matrix with nonlinear shrinkage. This approach delivers a simple and fast estimator of the dispersion matrix in elliptical models that is robust against both heavy tails and high dimensions. We prove convergence of the iterative part of our algorithm and demonstrate the favorable performance of the estimator in a wide range of simulation scenarios. Finally, an empirical application demonstrates its state-of-the-art performance on real data.
△ Less
Submitted 3 May, 2023; v1 submitted 26 October, 2022;
originally announced October 2022.
-
PKLM: A flexible MCAR test using Classification
Authors:
Meta-Lina Spohn,
Jeffrey Näf,
Loris Michel,
Nicolai Meinshausen
Abstract:
We develop a fully non-parametric, easy-to-use, and powerful test for the missing completely at random (MCAR) assumption on the missingness mechanism of a dataset. The test compares distributions of different missing patterns on random projections in the variable space of the data. The distributional differences are measured with the Kullback-Leibler Divergence, using probability Random Forests. W…
▽ More
We develop a fully non-parametric, easy-to-use, and powerful test for the missing completely at random (MCAR) assumption on the missingness mechanism of a dataset. The test compares distributions of different missing patterns on random projections in the variable space of the data. The distributional differences are measured with the Kullback-Leibler Divergence, using probability Random Forests. We thus refer to it as "Projected Kullback-Leibler MCAR" (PKLM) test. The use of random projections makes it applicable even if very few or no fully observed observations are available or if the number of dimensions is large. An efficient permutation approach guarantees the level for any finite sample size, resolving a major shortcoming of most other available tests. Moreover, the test can be used on both discrete and continuous data. We show empirically on a range of simulated data distributions and real datasets that our test has consistently high power and is able to avoid inflated type-I errors. Finally, we provide an R-package PKLMtest with an implementation of our test.
△ Less
Submitted 30 November, 2022; v1 submitted 21 September, 2021;
originally announced September 2021.
-
Imputation Scores
Authors:
Jeffrey Näf,
Meta-Lina Spohn,
Loris Michel,
Nicolai Meinshausen
Abstract:
Given the prevalence of missing data in modern statistical research, a broad range of methods is available for any given imputation task. How does one choose the `best' imputation method in a given application? The standard approach is to select some observations, set their status to missing, and compare prediction accuracy of the methods under consideration of these observations. Besides having t…
▽ More
Given the prevalence of missing data in modern statistical research, a broad range of methods is available for any given imputation task. How does one choose the `best' imputation method in a given application? The standard approach is to select some observations, set their status to missing, and compare prediction accuracy of the methods under consideration of these observations. Besides having to somewhat artificially mask observations, a shortcoming of this approach is that imputations based on the conditional mean will rank highest if predictive accuracy is measured with quadratic loss. In contrast, we want to rank highest an imputation that can sample from the true conditional distributions. In this paper, we develop a framework called "Imputation Scores" (I-Scores) for assessing missing value imputations. We provide a specific I-Score based on density ratios and projections, that is applicable to discrete and continuous data. It does not require to mask additional observations for evaluations and is also applicable if there are no complete observations. The population version is shown to be proper in the sense that the highest rank is assigned to an imputation method that samples from the correct conditional distribution. The propriety is shown under the missing completely at random (MCAR) assumption but is also shown to be valid under missing at random (MAR) with slightly more restrictive assumptions. We show empirically on a range of data sets and imputation methods that our score consistently ranks true data high(est) and is able to avoid pitfalls usually associated with performance measures such as RMSE. Finally, we provide the R-package Iscores available on CRAN with an implementation of our method.
△ Less
Submitted 30 November, 2022; v1 submitted 7 June, 2021;
originally announced June 2021.
-
Distributional Random Forests: Heterogeneity Adjustment and Multivariate Distributional Regression
Authors:
Domagoj Ćevid,
Loris Michel,
Jeffrey Näf,
Nicolai Meinshausen,
Peter Bühlmann
Abstract:
Random Forest (Breiman, 2001) is a successful and widely used regression and classification algorithm. Part of its appeal and reason for its versatility is its (implicit) construction of a kernel-type weighting function on training data, which can also be used for targets other than the original mean estimation. We propose a novel forest construction for multivariate responses based on their joint…
▽ More
Random Forest (Breiman, 2001) is a successful and widely used regression and classification algorithm. Part of its appeal and reason for its versatility is its (implicit) construction of a kernel-type weighting function on training data, which can also be used for targets other than the original mean estimation. We propose a novel forest construction for multivariate responses based on their joint conditional distribution, independent of the estimation target and the data model. It uses a new splitting criterion based on the MMD distributional metric, which is suitable for detecting heterogeneity in multivariate distributions. The induced weights define an estimate of the full conditional distribution, which in turn can be used for arbitrary and potentially complicated targets of interest. The method is very versatile and convenient to use, as we illustrate on a wide range of examples. The code is available as Python and R packages drf.
△ Less
Submitted 12 October, 2022; v1 submitted 29 May, 2020;
originally announced May 2020.
-
High Probability Lower Bounds for the Total Variation Distance
Authors:
Loris Michel,
Jeffrey Näf,
Nicolai Meinshausen
Abstract:
The statistics and machine learning communities have recently seen a growing interest in classification-based approaches to two-sample testing. The outcome of a classification-based two-sample test remains a rejection decision, which is not always informative since the null hypothesis is seldom strictly true. Therefore, when a test rejects, it would be beneficial to provide an additional quantity…
▽ More
The statistics and machine learning communities have recently seen a growing interest in classification-based approaches to two-sample testing. The outcome of a classification-based two-sample test remains a rejection decision, which is not always informative since the null hypothesis is seldom strictly true. Therefore, when a test rejects, it would be beneficial to provide an additional quantity serving as a refined measure of distributional difference. In this work, we introduce a framework for the construction of high-probability lower bounds on the total variation distance. These bounds are based on a one-dimensional projection, such as a classification or regression method, and can be interpreted as the minimal fraction of samples pointing towards a distributional difference. We further derive asymptotic power and detection rates of two proposed estimators and discuss potential uses through an application to a reanalysis climate dataset.
△ Less
Submitted 14 November, 2022; v1 submitted 12 May, 2020;
originally announced May 2020.
-
On the Use of Random Forest for Two-Sample Testing
Authors:
Simon Hediger,
Loris Michel,
Jeffrey Näf
Abstract:
Following the line of classification-based two-sample testing, tests based on the Random Forest classifier are proposed. The developed tests are easy to use, require almost no tuning, and are applicable for any distribution on $\mathbb{R}^d$. Furthermore, the built-in variable importance measure of the Random Forest gives potential insights into which variables make out the difference in distribut…
▽ More
Following the line of classification-based two-sample testing, tests based on the Random Forest classifier are proposed. The developed tests are easy to use, require almost no tuning, and are applicable for any distribution on $\mathbb{R}^d$. Furthermore, the built-in variable importance measure of the Random Forest gives potential insights into which variables make out the difference in distribution. An asymptotic power analysis for the proposed tests is developed. Finally, two real-world applications illustrate the usefulness of the introduced methodology. To simplify the use of the method, the R-package "hypoRF" is provided.
△ Less
Submitted 6 May, 2021; v1 submitted 14 March, 2019;
originally announced March 2019.
-
Probing the dark matter issue in f(R)-gravity via gravitational lensing
Authors:
M. Lubini,
C. Tortora,
J. Näf,
Ph. Jetzer,
S. Capozziello
Abstract:
For a general class of analytic f(R)-gravity theories, we discuss the weak field limit in view of gravitational lensing. Though an additional Yukawa term in the gravitational potential modifies dynamics with respect to the standard Newtonian limit of General Relativity, the motion of massless particles results unaffected thanks to suitable cancellations in the post-Newtonian limit. Thus, all the l…
▽ More
For a general class of analytic f(R)-gravity theories, we discuss the weak field limit in view of gravitational lensing. Though an additional Yukawa term in the gravitational potential modifies dynamics with respect to the standard Newtonian limit of General Relativity, the motion of massless particles results unaffected thanks to suitable cancellations in the post-Newtonian limit. Thus, all the lensing observables are equal to the ones known from General Relativity. Since f(R)-gravity is claimed, among other things, to be a possible solution to overcome for the need of dark matter in virialized systems, we discuss the impact of our results on the dynamical and gravitational lensing analyses. In this framework, dynamics could, in principle, be able to reproduce the astrophysical observations without recurring to dark matter, but in the case of gravitational lensing we find that dark matter is an unavoidable ingredient. Another important implication is that gravitational lensing, in the post-Newtonian limit, is not able to constrain these extended theories, since their predictions do not differ from General Relativity.
△ Less
Submitted 2 December, 2011; v1 submitted 14 April, 2011;
originally announced April 2011.
-
On Gravitational Radiation in Quadratic $f(R)$ Gravity
Authors:
Joachim Näf,
Philippe Jetzer
Abstract:
We investigate the gravitational radiation emitted by an isolated system for gravity theories with Lagrange density $f(R) = R + aR^2$. As a formal result we obtain leading order corrections to the quadrupole formula in General Relativity. We make use of the analogy of $f(R)$ theories with scalar--tensor theories, which in contrast to General Relativity feature an additional scalar degree of freedo…
▽ More
We investigate the gravitational radiation emitted by an isolated system for gravity theories with Lagrange density $f(R) = R + aR^2$. As a formal result we obtain leading order corrections to the quadrupole formula in General Relativity. We make use of the analogy of $f(R)$ theories with scalar--tensor theories, which in contrast to General Relativity feature an additional scalar degree of freedom. Unlike General Relativity, where the leading order gravitational radiation is produced by quadrupole moments, the additional degree of freedom predicts gravitational radiation of all multipoles, in particular monopoles and dipoles, as this is the case for the most alternative gravity theories known today. An application to a hypothetical binary pulsar moving in a circular orbit yields the rough limit $a \lesssim 1.7\cdot10^{17}\,\mathrm{m}^2$ by constraining the dipole power to account at most for 1% of the quadrupole power as predicted by General Relativity.
△ Less
Submitted 12 July, 2011; v1 submitted 12 April, 2011;
originally announced April 2011.
-
On the 1/c Expansion of f(R) Gravity
Authors:
Joachim Näf,
Philippe Jetzer
Abstract:
We derive for applications to isolated systems - on the scale of the Solar System - the first relativistic terms in the $1/c$ expansion of the space time metric $g_{μν}$ for metric $f(R)$ gravity theories, where $f$ is assumed to be analytic at $R=0$. For our purpose it suffices to take into account up to quadratic terms in the expansion of $f(R)$, thus we can approximate $f(R) = R + aR^2$ with a…
▽ More
We derive for applications to isolated systems - on the scale of the Solar System - the first relativistic terms in the $1/c$ expansion of the space time metric $g_{μν}$ for metric $f(R)$ gravity theories, where $f$ is assumed to be analytic at $R=0$. For our purpose it suffices to take into account up to quadratic terms in the expansion of $f(R)$, thus we can approximate $f(R) = R + aR^2$ with a positive dimensional parameter $a$. In the non-relativistic limit, we get an additional Yukawa correction with coupling strength $G/3$ and Compton wave length $\sqrt{6a}$ to the Newtonian potential, which is a known result in the literature. As an application, we derive to the same order the correction to the geodetic precession of a gyroscope in a gravitational field and the precession of binary pulsars. The result of the Gravity Probe B experiment yields the limit $a \lesssim 5 \times 10^{11} \, \mathrm{m}^2$, whereas for the pulsar B in the PSR J0737-3039 system we get a bound which is about $10^4$ times larger. On the other hand the Eöt-Wash experiment provides the best laboratory bound $a \lesssim 10^{-10} \, \mathrm{m}^2$. Although the former bounds from geodesic precession are much larger than the laboratory ones, they are still meaningful in the case some type of chameleon effect is present and thus the effective values could be different at different length scales.
△ Less
Submitted 12 April, 2010;
originally announced April 2010.
-
On Gravitational Waves in Spacetimes with a Nonvanishing Cosmological Constant
Authors:
J. Näf,
P. Jetzer,
M. Sereno
Abstract:
We study the effect of a cosmological constant $Λ$ on the propagation and detection of gravitational waves. To this purpose we investigate the linearised Einstein's equations with terms up to linear order in $Λ$ in a de Sitter and an anti-de Sitter background spacetime. In this framework the cosmological term does not induce changes in the polarization states of the waves, whereas the amplitude…
▽ More
We study the effect of a cosmological constant $Λ$ on the propagation and detection of gravitational waves. To this purpose we investigate the linearised Einstein's equations with terms up to linear order in $Λ$ in a de Sitter and an anti-de Sitter background spacetime. In this framework the cosmological term does not induce changes in the polarization states of the waves, whereas the amplitude gets modified with terms depending on $Λ$. Moreover, if a source emits a periodic waveform, its periodicity as measured by a distant observer gets modified. These effects are, however, extremely tiny and thus well below the detectability by some twenty orders of magnitude within present gravitational wave detectors such as LIGO or future planned ones such as LISA.
△ Less
Submitted 30 October, 2008;
originally announced October 2008.