Search | arXiv e-print repository

arXiv:2403.19196 [pdf, other]

What Is a Good Imputation Under MAR Missingness?

Authors: Jeffrey Näf, Erwan Scornet, Julie Josse

Abstract: Missing values pose a persistent challenge in modern data science. Consequently, there is an ever-growing number of publications introducing new imputation methods in various fields. The present paper attempts to take a step back and provide a more systematic analysis. Starting from an in-depth discussion of the Missing at Random (MAR) condition for nonparametric imputation, we first develop an id… ▽ More Missing values pose a persistent challenge in modern data science. Consequently, there is an ever-growing number of publications introducing new imputation methods in various fields. The present paper attempts to take a step back and provide a more systematic analysis. Starting from an in-depth discussion of the Missing at Random (MAR) condition for nonparametric imputation, we first develop an identification result, showing that the widely used Multiple Imputation by Chained Equations (MICE) approach indeed identifies the right conditional distributions. Building on this analysis, we propose three essential properties a successful imputation method should meet, thus enabling a more principled evaluation of existing methods and more targeted development of new methods. In particular, we introduce a new imputation method, denoted mice-DRF, that meets two out of the three criteria. We then discuss and refine ways to rank imputation methods, develo** a powerful, easy-to-use scoring algorithm to rank missing value imputations. △ Less

Submitted 7 June, 2024; v1 submitted 28 March, 2024; originally announced March 2024.

arXiv:2310.12115 [pdf, other]

MMD-based Variable Importance for Distributional Random Forest

Authors: Clément Bénard, Jeffrey Näf, Julie Josse

Abstract: Distributional Random Forest (DRF) is a flexible forest-based method to estimate the full conditional distribution of a multivariate output of interest given input variables. In this article, we introduce a variable importance algorithm for DRFs, based on the well-established drop and relearn principle and MMD distance. While traditional importance measures only detect variables with an influence… ▽ More Distributional Random Forest (DRF) is a flexible forest-based method to estimate the full conditional distribution of a multivariate output of interest given input variables. In this article, we introduce a variable importance algorithm for DRFs, based on the well-established drop and relearn principle and MMD distance. While traditional importance measures only detect variables with an influence on the output mean, our algorithm detects variables impacting the output distribution more generally. We show that the introduced importance measure is consistent, exhibits high empirical performance on both real and simulated data, and outperforms competitors. In particular, our algorithm is highly efficient to select variables through recursive feature elimination, and can therefore provide small sets of variables to build accurate estimates of conditional output distributions. △ Less

Submitted 14 February, 2024; v1 submitted 18 October, 2023; originally announced October 2023.

arXiv:2302.05761 [pdf, other]

Confidence and Uncertainty Assessment for Distributional Random Forests

Authors: Jeffrey Näf, Corinne Emmenegger, Peter Bühlmann, Nicolai Meinshausen

Abstract: The Distributional Random Forest (DRF) is a recently introduced Random Forest algorithm to estimate multivariate conditional distributions. Due to its general estimation procedure, it can be employed to estimate a wide range of targets such as conditional average treatment effects, conditional quantiles, and conditional correlations. However, only results about the consistency and convergence rate… ▽ More The Distributional Random Forest (DRF) is a recently introduced Random Forest algorithm to estimate multivariate conditional distributions. Due to its general estimation procedure, it can be employed to estimate a wide range of targets such as conditional average treatment effects, conditional quantiles, and conditional correlations. However, only results about the consistency and convergence rate of the DRF prediction are available so far. We characterize the asymptotic distribution of DRF and develop a bootstrap approximation of it. This allows us to derive inferential tools for quantifying standard errors and the construction of confidence regions that have asymptotic coverage guarantees. In simulation studies, we empirically validate the developed theory for inference of low-dimensional targets and for testing distributional differences between two populations. △ Less

Submitted 19 December, 2023; v1 submitted 11 February, 2023; originally announced February 2023.

arXiv:2210.14854 [pdf, other]

doi 10.1109/TSP.2023.3270742

R-NL: Covariance Matrix Estimation for Elliptical Distributions based on Nonlinear Shrinkage

Authors: Simon Hediger, Jeffrey Näf, Michael Wolf

Abstract: We combine Tyler's robust estimator of the dispersion matrix with nonlinear shrinkage. This approach delivers a simple and fast estimator of the dispersion matrix in elliptical models that is robust against both heavy tails and high dimensions. We prove convergence of the iterative part of our algorithm and demonstrate the favorable performance of the estimator in a wide range of simulation scenar… ▽ More We combine Tyler's robust estimator of the dispersion matrix with nonlinear shrinkage. This approach delivers a simple and fast estimator of the dispersion matrix in elliptical models that is robust against both heavy tails and high dimensions. We prove convergence of the iterative part of our algorithm and demonstrate the favorable performance of the estimator in a wide range of simulation scenarios. Finally, an empirical application demonstrates its state-of-the-art performance on real data. △ Less

Submitted 3 May, 2023; v1 submitted 26 October, 2022; originally announced October 2022.

arXiv:2109.10150 [pdf, other]

PKLM: A flexible MCAR test using Classification

Authors: Meta-Lina Spohn, Jeffrey Näf, Loris Michel, Nicolai Meinshausen

Abstract: We develop a fully non-parametric, easy-to-use, and powerful test for the missing completely at random (MCAR) assumption on the missingness mechanism of a dataset. The test compares distributions of different missing patterns on random projections in the variable space of the data. The distributional differences are measured with the Kullback-Leibler Divergence, using probability Random Forests. W… ▽ More We develop a fully non-parametric, easy-to-use, and powerful test for the missing completely at random (MCAR) assumption on the missingness mechanism of a dataset. The test compares distributions of different missing patterns on random projections in the variable space of the data. The distributional differences are measured with the Kullback-Leibler Divergence, using probability Random Forests. We thus refer to it as "Projected Kullback-Leibler MCAR" (PKLM) test. The use of random projections makes it applicable even if very few or no fully observed observations are available or if the number of dimensions is large. An efficient permutation approach guarantees the level for any finite sample size, resolving a major shortcoming of most other available tests. Moreover, the test can be used on both discrete and continuous data. We show empirically on a range of simulated data distributions and real datasets that our test has consistently high power and is able to avoid inflated type-I errors. Finally, we provide an R-package PKLMtest with an implementation of our test. △ Less

Submitted 30 November, 2022; v1 submitted 21 September, 2021; originally announced September 2021.

arXiv:2106.03742 [pdf, other]

Imputation Scores

Authors: Jeffrey Näf, Meta-Lina Spohn, Loris Michel, Nicolai Meinshausen

Abstract: Given the prevalence of missing data in modern statistical research, a broad range of methods is available for any given imputation task. How does one choose the `best' imputation method in a given application? The standard approach is to select some observations, set their status to missing, and compare prediction accuracy of the methods under consideration of these observations. Besides having t… ▽ More Given the prevalence of missing data in modern statistical research, a broad range of methods is available for any given imputation task. How does one choose the `best' imputation method in a given application? The standard approach is to select some observations, set their status to missing, and compare prediction accuracy of the methods under consideration of these observations. Besides having to somewhat artificially mask observations, a shortcoming of this approach is that imputations based on the conditional mean will rank highest if predictive accuracy is measured with quadratic loss. In contrast, we want to rank highest an imputation that can sample from the true conditional distributions. In this paper, we develop a framework called "Imputation Scores" (I-Scores) for assessing missing value imputations. We provide a specific I-Score based on density ratios and projections, that is applicable to discrete and continuous data. It does not require to mask additional observations for evaluations and is also applicable if there are no complete observations. The population version is shown to be proper in the sense that the highest rank is assigned to an imputation method that samples from the correct conditional distribution. The propriety is shown under the missing completely at random (MCAR) assumption but is also shown to be valid under missing at random (MAR) with slightly more restrictive assumptions. We show empirically on a range of data sets and imputation methods that our score consistently ranks true data high(est) and is able to avoid pitfalls usually associated with performance measures such as RMSE. Finally, we provide the R-package Iscores available on CRAN with an implementation of our method. △ Less

Submitted 30 November, 2022; v1 submitted 7 June, 2021; originally announced June 2021.

arXiv:2005.14458 [pdf, other]

Distributional Random Forests: Heterogeneity Adjustment and Multivariate Distributional Regression

Authors: Domagoj Ćevid, Loris Michel, Jeffrey Näf, Nicolai Meinshausen, Peter Bühlmann

Abstract: Random Forest (Breiman, 2001) is a successful and widely used regression and classification algorithm. Part of its appeal and reason for its versatility is its (implicit) construction of a kernel-type weighting function on training data, which can also be used for targets other than the original mean estimation. We propose a novel forest construction for multivariate responses based on their joint… ▽ More Random Forest (Breiman, 2001) is a successful and widely used regression and classification algorithm. Part of its appeal and reason for its versatility is its (implicit) construction of a kernel-type weighting function on training data, which can also be used for targets other than the original mean estimation. We propose a novel forest construction for multivariate responses based on their joint conditional distribution, independent of the estimation target and the data model. It uses a new splitting criterion based on the MMD distributional metric, which is suitable for detecting heterogeneity in multivariate distributions. The induced weights define an estimate of the full conditional distribution, which in turn can be used for arbitrary and potentially complicated targets of interest. The method is very versatile and convenient to use, as we illustrate on a wide range of examples. The code is available as Python and R packages drf. △ Less

Submitted 12 October, 2022; v1 submitted 29 May, 2020; originally announced May 2020.

arXiv:2005.06006 [pdf, other]

High Probability Lower Bounds for the Total Variation Distance

Authors: Loris Michel, Jeffrey Näf, Nicolai Meinshausen

Abstract: The statistics and machine learning communities have recently seen a growing interest in classification-based approaches to two-sample testing. The outcome of a classification-based two-sample test remains a rejection decision, which is not always informative since the null hypothesis is seldom strictly true. Therefore, when a test rejects, it would be beneficial to provide an additional quantity… ▽ More The statistics and machine learning communities have recently seen a growing interest in classification-based approaches to two-sample testing. The outcome of a classification-based two-sample test remains a rejection decision, which is not always informative since the null hypothesis is seldom strictly true. Therefore, when a test rejects, it would be beneficial to provide an additional quantity serving as a refined measure of distributional difference. In this work, we introduce a framework for the construction of high-probability lower bounds on the total variation distance. These bounds are based on a one-dimensional projection, such as a classification or regression method, and can be interpreted as the minimal fraction of samples pointing towards a distributional difference. We further derive asymptotic power and detection rates of two proposed estimators and discuss potential uses through an application to a reanalysis climate dataset. △ Less

Submitted 14 November, 2022; v1 submitted 12 May, 2020; originally announced May 2020.

arXiv:1903.06287 [pdf, ps, other]

On the Use of Random Forest for Two-Sample Testing

Authors: Simon Hediger, Loris Michel, Jeffrey Näf

Abstract: Following the line of classification-based two-sample testing, tests based on the Random Forest classifier are proposed. The developed tests are easy to use, require almost no tuning, and are applicable for any distribution on $\mathbb{R}^d$. Furthermore, the built-in variable importance measure of the Random Forest gives potential insights into which variables make out the difference in distribut… ▽ More Following the line of classification-based two-sample testing, tests based on the Random Forest classifier are proposed. The developed tests are easy to use, require almost no tuning, and are applicable for any distribution on $\mathbb{R}^d$. Furthermore, the built-in variable importance measure of the Random Forest gives potential insights into which variables make out the difference in distribution. An asymptotic power analysis for the proposed tests is developed. Finally, two real-world applications illustrate the usefulness of the introduced methodology. To simplify the use of the method, the R-package "hypoRF" is provided. △ Less

Submitted 6 May, 2021; v1 submitted 14 March, 2019; originally announced March 2019.

arXiv:1104.2851 [pdf, ps, other]

doi 10.1140/epjc/s10052-011-1834-8

Probing the dark matter issue in f(R)-gravity via gravitational lensing

Authors: M. Lubini, C. Tortora, J. Näf, Ph. Jetzer, S. Capozziello

Abstract: For a general class of analytic f(R)-gravity theories, we discuss the weak field limit in view of gravitational lensing. Though an additional Yukawa term in the gravitational potential modifies dynamics with respect to the standard Newtonian limit of General Relativity, the motion of massless particles results unaffected thanks to suitable cancellations in the post-Newtonian limit. Thus, all the l… ▽ More For a general class of analytic f(R)-gravity theories, we discuss the weak field limit in view of gravitational lensing. Though an additional Yukawa term in the gravitational potential modifies dynamics with respect to the standard Newtonian limit of General Relativity, the motion of massless particles results unaffected thanks to suitable cancellations in the post-Newtonian limit. Thus, all the lensing observables are equal to the ones known from General Relativity. Since f(R)-gravity is claimed, among other things, to be a possible solution to overcome for the need of dark matter in virialized systems, we discuss the impact of our results on the dynamical and gravitational lensing analyses. In this framework, dynamics could, in principle, be able to reproduce the astrophysical observations without recurring to dark matter, but in the case of gravitational lensing we find that dark matter is an unavoidable ingredient. Another important implication is that gravitational lensing, in the post-Newtonian limit, is not able to constrain these extended theories, since their predictions do not differ from General Relativity. △ Less

Submitted 2 December, 2011; v1 submitted 14 April, 2011; originally announced April 2011.

Comments: 7 pages, accepted for publication in EPJC

Journal ref: Eur. Phys. J. C (2011) 71, 1834

arXiv:1104.2200 [pdf, ps, other]

doi 10.1103/PhysRevD.84.024027

On Gravitational Radiation in Quadratic $f(R)$ Gravity

Authors: Joachim Näf, Philippe Jetzer

Abstract: We investigate the gravitational radiation emitted by an isolated system for gravity theories with Lagrange density $f(R) = R + aR^2$. As a formal result we obtain leading order corrections to the quadrupole formula in General Relativity. We make use of the analogy of $f(R)$ theories with scalar--tensor theories, which in contrast to General Relativity feature an additional scalar degree of freedo… ▽ More We investigate the gravitational radiation emitted by an isolated system for gravity theories with Lagrange density $f(R) = R + aR^2$. As a formal result we obtain leading order corrections to the quadrupole formula in General Relativity. We make use of the analogy of $f(R)$ theories with scalar--tensor theories, which in contrast to General Relativity feature an additional scalar degree of freedom. Unlike General Relativity, where the leading order gravitational radiation is produced by quadrupole moments, the additional degree of freedom predicts gravitational radiation of all multipoles, in particular monopoles and dipoles, as this is the case for the most alternative gravity theories known today. An application to a hypothetical binary pulsar moving in a circular orbit yields the rough limit $a \lesssim 1.7\cdot10^{17}\,\mathrm{m}^2$ by constraining the dipole power to account at most for 1% of the quadrupole power as predicted by General Relativity. △ Less

Submitted 12 July, 2011; v1 submitted 12 April, 2011; originally announced April 2011.

Comments: 14 Pages, 1 Figure

Journal ref: Phys.Rev.D84:024027,2011

arXiv:1004.2014 [pdf, ps, other]

doi 10.1103/PhysRevD.81.104003

On the 1/c Expansion of f(R) Gravity

Authors: Joachim Näf, Philippe Jetzer

Abstract: We derive for applications to isolated systems - on the scale of the Solar System - the first relativistic terms in the $1/c$ expansion of the space time metric $g_{μν}$ for metric $f(R)$ gravity theories, where $f$ is assumed to be analytic at $R=0$. For our purpose it suffices to take into account up to quadratic terms in the expansion of $f(R)$, thus we can approximate $f(R) = R + aR^2$ with a… ▽ More We derive for applications to isolated systems - on the scale of the Solar System - the first relativistic terms in the $1/c$ expansion of the space time metric $g_{μν}$ for metric $f(R)$ gravity theories, where $f$ is assumed to be analytic at $R=0$. For our purpose it suffices to take into account up to quadratic terms in the expansion of $f(R)$, thus we can approximate $f(R) = R + aR^2$ with a positive dimensional parameter $a$. In the non-relativistic limit, we get an additional Yukawa correction with coupling strength $G/3$ and Compton wave length $\sqrt{6a}$ to the Newtonian potential, which is a known result in the literature. As an application, we derive to the same order the correction to the geodetic precession of a gyroscope in a gravitational field and the precession of binary pulsars. The result of the Gravity Probe B experiment yields the limit $a \lesssim 5 \times 10^{11} \, \mathrm{m}^2$, whereas for the pulsar B in the PSR J0737-3039 system we get a bound which is about $10^4$ times larger. On the other hand the Eöt-Wash experiment provides the best laboratory bound $a \lesssim 10^{-10} \, \mathrm{m}^2$. Although the former bounds from geodesic precession are much larger than the laboratory ones, they are still meaningful in the case some type of chameleon effect is present and thus the effective values could be different at different length scales. △ Less

Submitted 12 April, 2010; originally announced April 2010.

Comments: 11 pages, accepted for publication in Physical Review D

Journal ref: Phys.Rev.D81:104003,2010

arXiv:0810.5426 [pdf, ps, other]

doi 10.1103/PhysRevD.79.024014

On Gravitational Waves in Spacetimes with a Nonvanishing Cosmological Constant

Authors: J. Näf, P. Jetzer, M. Sereno

Abstract: We study the effect of a cosmological constant $Λ$ on the propagation and detection of gravitational waves. To this purpose we investigate the linearised Einstein's equations with terms up to linear order in $Λ$ in a de Sitter and an anti-de Sitter background spacetime. In this framework the cosmological term does not induce changes in the polarization states of the waves, whereas the amplitude… ▽ More We study the effect of a cosmological constant $Λ$ on the propagation and detection of gravitational waves. To this purpose we investigate the linearised Einstein's equations with terms up to linear order in $Λ$ in a de Sitter and an anti-de Sitter background spacetime. In this framework the cosmological term does not induce changes in the polarization states of the waves, whereas the amplitude gets modified with terms depending on $Λ$. Moreover, if a source emits a periodic waveform, its periodicity as measured by a distant observer gets modified. These effects are, however, extremely tiny and thus well below the detectability by some twenty orders of magnitude within present gravitational wave detectors such as LIGO or future planned ones such as LISA. △ Less

Submitted 30 October, 2008; originally announced October 2008.

Comments: 8 pages, 4 figures, accepted for publication in Physical Review D

Journal ref: Phys.Rev.D79:024014,2009

Showing 1–13 of 13 results for author: Näf, J