-
Unraveling overoptimism and publication bias in ML-driven science
Authors:
Pouria Saidi,
Gautam Dasarathy,
Visar Berisha
Abstract:
Machine Learning (ML) is increasingly used across many disciplines with impressive reported results across many domain areas. However, recent studies suggest that the published performance of ML models are often overoptimistic. Validity concerns are underscored by findings of an inverse relationship between sample size and reported accuracy in published ML models, contrasting with the theory of le…
▽ More
Machine Learning (ML) is increasingly used across many disciplines with impressive reported results across many domain areas. However, recent studies suggest that the published performance of ML models are often overoptimistic. Validity concerns are underscored by findings of an inverse relationship between sample size and reported accuracy in published ML models, contrasting with the theory of learning curves where accuracy should improve or remain stable with increasing sample size. This paper investigates factors contributing to overoptimistic accuracy reports in ML-driven science, focusing on data leakage and publication bias. We introduce a novel stochastic model for observed accuracy, integrating parametric learning curves and the aforementioned biases. We then construct an estimator that corrects for these biases in observed data. Theoretical and empirical results show that our framework can estimate the underlying learning curve, providing realistic performance assessments from published results. Applying the model to meta-analyses in ML-driven science, including neuroimaging-based and speech-based classifications of neurological conditions, we find prevalent overoptimism and estimate the inherent limits of ML-based prediction in each domain.
△ Less
Submitted 10 June, 2024; v1 submitted 23 May, 2024;
originally announced May 2024.
-
Active Sequential Two-Sample Testing
Authors:
Weizhi Li,
Prad Kadambi,
Pouria Saidi,
Karthikeyan Natesan Ramamurthy,
Gautam Dasarathy,
Visar Berisha
Abstract:
A two-sample hypothesis test is a statistical procedure used to determine whether the distributions generating two samples are identical. We consider the two-sample testing problem in a new scenario where the sample measurements (or sample features) are inexpensive to access, but their group memberships (or labels) are costly. To address the problem, we devise the first \emph{active sequential two…
▽ More
A two-sample hypothesis test is a statistical procedure used to determine whether the distributions generating two samples are identical. We consider the two-sample testing problem in a new scenario where the sample measurements (or sample features) are inexpensive to access, but their group memberships (or labels) are costly. To address the problem, we devise the first \emph{active sequential two-sample testing framework} that not only sequentially but also \emph{actively queries}. Our test statistic is a likelihood ratio where one likelihood is found by maximization over all class priors, and the other is provided by a probabilistic classification model. The classification model is adaptively updated and used to predict where the (unlabelled) features have a high dependency on labels; labeling the ``high-dependency'' features leads to the increased power of the proposed testing framework. In theory, we provide the proof that our framework produces an \emph{anytime-valid} $p$-value. In addition, we characterize the proposed framework's gain in testing power by analyzing the mutual information between the feature and label variables in asymptotic and finite-sample scenarios. In practice, we introduce an instantiation of our framework and evaluate it using several experiments; the experiments on the synthetic, MNIST, and application-specific datasets demonstrate that the testing power of the instantiated active sequential test significantly increases while the Type I error is under control.
△ Less
Submitted 27 June, 2024; v1 submitted 29 January, 2023;
originally announced January 2023.
-
Support Recovery of Periodic Mixtures with Nested Periodic Dictionaries
Authors:
Pouria Saidi,
George K. Atia
Abstract:
Periodic signals composed of periodic mixtures admit sparse representations in nested periodic dictionaries (NPDs). Therefore, their underlying hidden periods can be estimated by recovering the exact support of said representations. In this paper, support recovery guarantees of such signals are derived both in noise-free and noisy settings. While exact recovery conditions have been studied in the…
▽ More
Periodic signals composed of periodic mixtures admit sparse representations in nested periodic dictionaries (NPDs). Therefore, their underlying hidden periods can be estimated by recovering the exact support of said representations. In this paper, support recovery guarantees of such signals are derived both in noise-free and noisy settings. While exact recovery conditions have been studied in the theory of compressive sensing, existing conditions fall short of yielding meaningful achievability regions in the context of periodic signals with sparse representations in NPDs, in part since existing bounds do not capture structures intrinsic to these dictionaries. We leverage known properties of NPDs to derive several conditions for exact sparse recovery of periodic mixtures in the noise-free setting. These conditions rest on newly introduced notions of nested periodic coherence and restricted coherence, which can be efficiently computed. In the presence of noise, we obtain improved conditions for recovering the exact support set of the sparse representation of the periodic mixture via orthogonal matching pursuit based on the introduced notions of coherence. The theoretical findings are corroborated using numerical experiments for different families of NPDs. Our results show significant improvement over generic recovery bounds as the conditions hold over a larger range of sparsity levels.
△ Less
Submitted 3 June, 2024; v1 submitted 25 October, 2021;
originally announced October 2021.