Search | arXiv e-print repository

Direction Preferring Confidence Intervals

Authors: Tzviel Frostig, Yoav Benjamini, Ruth Heller

Abstract: Confidence intervals (CIs) are instrumental in statistical analysis, providing a range estimate of the parameters. In modern statistics, selective inference is common, where only certain parameters are highlighted. However, this selective approach can bias the inference, leading some to advocate for the use of CIs over p-values. To increase the flexibility of confidence intervals, we introduce dir… ▽ More Confidence intervals (CIs) are instrumental in statistical analysis, providing a range estimate of the parameters. In modern statistics, selective inference is common, where only certain parameters are highlighted. However, this selective approach can bias the inference, leading some to advocate for the use of CIs over p-values. To increase the flexibility of confidence intervals, we introduce direction-preferring CIs, enabling analysts to focus on parameters trending in a particular direction. We present these types of CIs in two settings: First, when there is no selection of parameters; and second, for situations involving parameter selection, where we offer a conditional version of the direction-preferring CIs. Both of these methods build upon the foundations of Modified Pratt CIs, which rely on non-equivariant acceptance regions to achieve longer intervals in exchange for improved sign exclusions. We show that for selected parameters out of m > 1 initial parameters of interest, CIs aimed at controlling the false coverage rate, have higher power to determine the sign compared to conditional CIs. We also show that conditional confidence intervals control the marginal false coverage rate (mFCR) under any dependency. △ Less

Submitted 30 March, 2024; originally announced April 2024.

Comments: 11 figures, 45 pages

MSC Class: 62P10

arXiv:2311.18575 [pdf, other]

Class Distribution Shifts in Zero-Shot Learning: Learning Robust Representations

Authors: Yuli Slavutsky, Yuval Benjamini

Abstract: Zero-shot learning methods typically assume that the new, unseen classes that are encountered at deployment, come from the same distribution as training classes. However, real-world scenarios often involve class distribution shifts (e.g., in age or gender for person identification), posing challenges for zero-shot classifiers that rely on learned representations from training classes. In this work… ▽ More Zero-shot learning methods typically assume that the new, unseen classes that are encountered at deployment, come from the same distribution as training classes. However, real-world scenarios often involve class distribution shifts (e.g., in age or gender for person identification), posing challenges for zero-shot classifiers that rely on learned representations from training classes. In this work, we propose a model that assumes that the attribute responsible for the shift is unknown in advance, and show that standard training may lead to non-robust representations. To mitigate this, we propose an algorithm for learning robust representations by (a) constructing synthetic data environments via hierarchical sampling and (b) applying environment balancing penalization, inspired by out-of-distribution problems. We show that our approach improves generalization on diverse class distributions in both simulations and real-world datasets. △ Less

Submitted 27 May, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

arXiv:2307.15361 [pdf, other]

Confident Feature Ranking

Authors: Bitya Neuhof, Yuval Benjamini

Abstract: Machine learning models are widely applied in various fields. Stakeholders often use post-hoc feature importance methods to better understand the input features' contribution to the models' predictions. The interpretation of the importance values provided by these methods is frequently based on the relative order of the features (their ranking) rather than the importance values themselves. Since t… ▽ More Machine learning models are widely applied in various fields. Stakeholders often use post-hoc feature importance methods to better understand the input features' contribution to the models' predictions. The interpretation of the importance values provided by these methods is frequently based on the relative order of the features (their ranking) rather than the importance values themselves. Since the order may be unstable, we present a framework for quantifying the uncertainty in global importance values. We propose a novel method for the post-hoc interpretation of feature importance values that is based on the framework and pairwise comparisons of the feature importance values. This method produces simultaneous confidence intervals for the features' ranks, which include the ``true'' (infinite sample) ranks with high probability, and enables the selection of the set of the top-k important features. △ Less

Submitted 18 April, 2024; v1 submitted 28 July, 2023; originally announced July 2023.

arXiv:2303.13330 [pdf, other]

Logistic Regression Equivalence: A Framework for Comparing Logistic Regression Models Across Populations

Authors: Guy Ashiri-Prossner, Yuval Benjamini

Abstract: In this paper we discuss how to evaluate the differences between fitted logistic regression models across sub-populations. Our motivating example is in studying computerized diagnosis for learning disabilities, where sub-populations based on gender may or may not require separate models. In this context, significance tests for hypotheses of no difference between populations may provide perverse in… ▽ More In this paper we discuss how to evaluate the differences between fitted logistic regression models across sub-populations. Our motivating example is in studying computerized diagnosis for learning disabilities, where sub-populations based on gender may or may not require separate models. In this context, significance tests for hypotheses of no difference between populations may provide perverse incentives, as larger variances and smaller samples increase the probability of not-rejecting the null. We argue that equivalence testing for a prespecified tolerance level on population differences incentivizes accuracy in the inference. We develop a cascading set of equivalence tests, in which each test addresses a different aspect of the model: the way the phenomenon is coded in the regression coefficients, the individual predictions in the per example log odds ratio and the overall accuracy in the mean square prediction error. For each equivalence test, we propose a strategy for setting the equivalence thresholds. The large-sample approximations are validated using simulations. For diagnosis data, we show examples for equivalent and non-equivalent models. △ Less

Submitted 23 March, 2023; originally announced March 2023.

arXiv:2111.07444 [pdf, other]

Detecting Differences Between Correlation-Matrix Populations due to Single-variable Perturbations, with Application to Resting State fMRI

Authors: Itamar Faran, Michael Peer, Shahar Arzy, Yuval Benjamini

Abstract: Correlation matrices provide a useful way to characterize variable dependencies in many real-world problems. Often, a perturbation in few variables can lead to small differences in multiple correlation coefficients related to these variables. In this paper we propose a low-dimensional representation of these differences as a product of single-variable perturbations that can efficiently characteriz… ▽ More Correlation matrices provide a useful way to characterize variable dependencies in many real-world problems. Often, a perturbation in few variables can lead to small differences in multiple correlation coefficients related to these variables. In this paper we propose a low-dimensional representation of these differences as a product of single-variable perturbations that can efficiently characterize such effects; We develop methods for point estimation, confidence intervals and hypothesis tests for this model. Importantly, our methods are tailored for comparing samples of correlation matrices, in that they account for both the inherent variability in correlation matrices and for the variation between matrices in each sample. In simulations, our model shows a substantial increase in power compared to mass univariate approaches. As a test case, we analyze correlation matrices of resting state functional-MRI (RS-fMRI) in patients with a rare neurological condition - transient global amnesia (TGA) and healthy controls. TGA is characterized by a lesion to a specific brain area and the connectivity matrices supposedly represent changes in only few variables, as in the assumption of our model. In this dataset, our model identifies substantially decreased synchronization in several brain regions within the patient population, which could not be detected using previous methods without prior-knowledge. Our framework shows the advantage of adding informed mean-structure for detecting differences in high-dimensional correlation matrices, and can be adapted for new differential structures. Our methods are available in the open-source package corrpops. △ Less

Submitted 14 November, 2021; originally announced November 2021.

arXiv:2010.15011 [pdf, other]

Predicting Classification Accuracy When Adding New Unobserved Classes

Authors: Yuli Slavutsky, Yuval Benjamini

Abstract: Multiclass classifiers are often designed and evaluated only on a sample from the classes on which they will eventually be applied. Hence, their final accuracy remains unknown. In this work we study how a classifier's performance over the initial class sample can be used to extrapolate its expected accuracy on a larger, unobserved set of classes. For this, we define a measure of separation between… ▽ More Multiclass classifiers are often designed and evaluated only on a sample from the classes on which they will eventually be applied. Hence, their final accuracy remains unknown. In this work we study how a classifier's performance over the initial class sample can be used to extrapolate its expected accuracy on a larger, unobserved set of classes. For this, we define a measure of separation between correct and incorrect classes that is independent of the number of classes: the "reversed ROC" (rROC), which is obtained by replacing the roles of classes and data-points in the common ROC. We show that the classification accuracy is a function of the rROC in multiclass classifiers, for which the learned representation of data from the initial class sample remains unchanged when new classes are added. Using these results we formulate a robust neural-network-based algorithm, "CleaneX", which learns to estimate the accuracy of such classifiers on arbitrarily large sets of classes. Unlike previous methods, our method uses both the observed accuracies of the classifier and densities of classification scores, and therefore achieves remarkably better predictions than current state-of-the-art methods on both simulations and real datasets of object detection, face recognition, and brain decoding. △ Less

Submitted 9 March, 2021; v1 submitted 28 October, 2020; originally announced October 2020.

Journal ref: International Conference on Learning Representations (ICLR), 2021

arXiv:2006.11585 [pdf]

Ignored evident multiplicity harms replicability -- adjusting for it offers a remedy

Authors: Yoav Zeevi, Sofi Astashenko, Yoav Benjamini

Abstract: It is a central dogma in science that a result of a study should be replicable. Only 90 of the 190 replications attempts were successful. We attribute a substantial part of the problem to selective inference evident in the paper, which is the practice of selecting some of the results from the many. 100 papers in the Reproducibility Project in Psychology were analyzed. It was evident that the repor… ▽ More It is a central dogma in science that a result of a study should be replicable. Only 90 of the 190 replications attempts were successful. We attribute a substantial part of the problem to selective inference evident in the paper, which is the practice of selecting some of the results from the many. 100 papers in the Reproducibility Project in Psychology were analyzed. It was evident that the reporting of many results is common (77.7 per paper on average). It was further found that the selection from those multiple results is not adjusted for. We propose to account for selection using the hierarchical false discovery rate (FDR) controlling procedure TreeBH of Bogomolov et al. (2020), which exploits hierarchical structures to gain power. Results that were statistically significant after adjustment were 97% of the replicable results (31 of 32). Additionally, only 1 of the 21 non-significant results after adjustment was replicated. Given the easy deployment of adjustment tools and the minor loss of power involved, we argue that addressing multiplicity is an essential missing component in experimental psychology. It should become a required component in the arsenal of replicability enhancing methodologies in the field. △ Less

Submitted 19 May, 2021; v1 submitted 20 June, 2020; originally announced June 2020.

Comments: 28 pages, 2 figures, 1 table

arXiv:1912.10472 [pdf, other]

Testing the equality of multivariate means when $p>n$ by combining the Hoteling and Simes tests

Authors: Tzviel Frostig, Yoav Benjamini

Abstract: We propose a method of testing the shift between mean vectors of two multivariate Gaussian random variables in a high-dimensional setting incorporating the possible dependency and allowing $p > n$. This method is a combination of two well-known tests: the Hotelling test and the Simes test. The tests are integrated by sampling several dimensions at each iteration, testing each using the Hotelling t… ▽ More We propose a method of testing the shift between mean vectors of two multivariate Gaussian random variables in a high-dimensional setting incorporating the possible dependency and allowing $p > n$. This method is a combination of two well-known tests: the Hotelling test and the Simes test. The tests are integrated by sampling several dimensions at each iteration, testing each using the Hotelling test, and combining their results using the Simes test. We prove that this procedure is valid asymptotically. This procedure can be extended to handle non-equal covariance matrices by plugging in the appropriate extension of the Hotelling test. Using a simulation study, we show that the proposed test is advantageous over state-of-the-art tests in many scenarios and robust to violation of the Gaussian assumption. △ Less

Submitted 22 December, 2019; originally announced December 2019.

arXiv:1907.06856 [pdf, other]

Quantifying replicability and consistency in systematic reviews

Authors: Iman Jaljuli, Yoav Benjamini, Liat Shenhav, Orestis Panagiotou, Ruth Heller

Abstract: Systematic reviews of interventions are important tools for synthesizing evidence from multiple studies. They serve to increase power and improve precision, in the same way that larger studies can do, but also to establish the consistency of effects and replicability of results across studies which are not identical. In this work we suggest to incorporate replicability analysis tools to quantify t… ▽ More Systematic reviews of interventions are important tools for synthesizing evidence from multiple studies. They serve to increase power and improve precision, in the same way that larger studies can do, but also to establish the consistency of effects and replicability of results across studies which are not identical. In this work we suggest to incorporate replicability analysis tools to quantify the consistency and conflict. These are offered both for the fixed-effect and for the random-effects meta-analyses. We motivate and demonstrate our approach and its implications by examples from systematic reviews from the Cochrane library, and offer a way to incorporate our suggestions in their standard reporting system. △ Less

Submitted 18 April, 2021; v1 submitted 16 July, 2019; originally announced July 2019.

arXiv:1906.00505 [pdf, other]

Confidence Intervals for Selected Parameters

Authors: Yoav Benjamini, Yotam Hechtlinger, Philip B. Stark

Abstract: Practical or scientific considerations often lead to selecting a subset of parameters as ``important.'' Inferences about those parameters often are based on the same data used to select them in the first place. That can make the reported uncertainties deceptively optimistic: confidence intervals that ignore selection generally have less than their nominal coverage probability. Controlling the prob… ▽ More Practical or scientific considerations often lead to selecting a subset of parameters as ``important.'' Inferences about those parameters often are based on the same data used to select them in the first place. That can make the reported uncertainties deceptively optimistic: confidence intervals that ignore selection generally have less than their nominal coverage probability. Controlling the probability that one or more intervals for selected parameters do not cover---the ``simultaneous over the selected'' (SoS) error rate---is crucial in many scientific problems. Intervals that control the SoS error rate can be constructed in ways that take advantage of knowledge of the selection rule. We construct SoS-controlling confidence intervals for parameters deemed the most ``important'' $k$ of $m$ shift parameters because they are estimated (by independent estimators) to be the largest. The new intervals improve substantially over Šidák intervals when $k$ is small compared to $m$, and approach the standard Bonferroni-corrected intervals when $k \approx m$. Standard, unadjusted confidence intervals for location parameters have the correct coverage probability for $k=1$, $m=2$ if, when the true parameters are zero, the estimators are exchangeable and symmetric. △ Less

Submitted 2 June, 2019; originally announced June 2019.

Comments: 36 pages, 11 figures

arXiv:1712.09713 [pdf, other]

Extrapolating Expected Accuracies for Large Multi-Class Problems

Authors: Charles Zheng, Rakesh Achanta, Yuval Benjamini

Abstract: The difficulty of multi-class classification generally increases with the number of classes. Using data from a subset of the classes, can we predict how well a classifier will scale with an increased number of classes? Under the assumptions that the classes are sampled identically and independently from a population, and that the classifier is based on independently learned scoring functions, we s… ▽ More The difficulty of multi-class classification generally increases with the number of classes. Using data from a subset of the classes, can we predict how well a classifier will scale with an increased number of classes? Under the assumptions that the classes are sampled identically and independently from a population, and that the classifier is based on independently learned scoring functions, we show that the expected accuracy when the classifier is trained on k classes is the (k-1)st moment of a certain distribution that can be estimated from data. We present an unbiased estimation method based on the theory, and demonstrate its application on a facial recognition example. △ Less

Submitted 27 December, 2017; originally announced December 2017.

Comments: Submitted to JMLR

arXiv:1705.07529 [pdf, other]

Testing hypotheses on a tree: new error rates and controlling strategies

Authors: Marina Bogomolov, Christine B. Peterson, Yoav Benjamini, Chiara Sabatti

Abstract: We introduce a multiple testing procedure (TreeBH) which addresses the challenge of controlling error rates at multiple levels of resolution. Conceptually, we frame this problem as the selection of hypotheses which are organized hierarchically in a tree structure. We describe a fast algorithm for the proposed sequential procedure, and prove that it controls relevant error rates given certain assum… ▽ More We introduce a multiple testing procedure (TreeBH) which addresses the challenge of controlling error rates at multiple levels of resolution. Conceptually, we frame this problem as the selection of hypotheses which are organized hierarchically in a tree structure. We describe a fast algorithm for the proposed sequential procedure, and prove that it controls relevant error rates given certain assumptions on the dependence among the p-values. Through simulations, we demonstrate that TreeBH offers the desired guarantees under a range of dependency structures (including one similar to that encountered in genome-wide association studies) and that it has the potential of gaining power over alternative methods. We also introduce a modified version of TreeBH which we prove to control the relevant error rates under any dependency structure. We conclude with two case studies: we first analyze data collected as part of the Genotype-Tissue Expression (GTEx) project, which aims to characterize the genetic regulation of gene expression across multiple tissues in the human body, and secondly, data examining the relationship between the gut microbiome and colorectal cancer. △ Less

Submitted 23 October, 2018; v1 submitted 21 May, 2017; originally announced May 2017.

arXiv:1608.08873 [pdf, other]

doi 10.1093/biostatistics/kxz035

Better-Than-Chance Classification for Signal Detection

Authors: Jonathan D. Rosenblatt, Yuval Benjamini, Roee Gilron, Roy Mukamel, Jelle J. Goeman

Abstract: The estimated accuracy of a classifier is a random quantity with variability. A common practice in supervised machine learning, is thus to test if the estimated accuracy is significantly better than chance level. This method of signal detection is particularly popular in neuroimaging and genetics. We provide evidence that using a classifier's accuracy as a test statistic can be an underpowered str… ▽ More The estimated accuracy of a classifier is a random quantity with variability. A common practice in supervised machine learning, is thus to test if the estimated accuracy is significantly better than chance level. This method of signal detection is particularly popular in neuroimaging and genetics. We provide evidence that using a classifier's accuracy as a test statistic can be an underpowered strategy for finding differences between populations, compared to a bona-fide statistical test. It is also computationally more demanding than a statistical test. Via simulation, we compare test statistics that are based on classification accuracy, to others based on multivariate test statistics. We find that probability of detecting differences between two distributions is lower for accuracy based statistics. We examine several candidate causes for the low power of accuracy tests. These causes include: the discrete nature of the accuracy test statistic, the type of signal accuracy tests are designed to detect, their inefficient use of the data, and their regularization. When the purposes of the analysis is not signal detection, but rather, the evaluation of a particular classifier, we suggest several improvements to increase power. In particular, to replace V-fold cross validation with the Leave-One-Out Bootstrap. △ Less

Submitted 14 December, 2017; v1 submitted 31 August, 2016; originally announced August 2016.

arXiv:1606.05229 [pdf, other]

Estimating mutual information in high dimensions via classification error

Authors: Charles Y. Zheng, Yuval Benjamini

Abstract: Multivariate pattern analyses approaches in neuroimaging are fundamentally concerned with investigating the quantity and type of information processed by various regions of the human brain; typically, estimates of classification accuracy are used to quantify information. While a extensive and powerful library of methods can be applied to train and assess classifiers, it is not always clear how to… ▽ More Multivariate pattern analyses approaches in neuroimaging are fundamentally concerned with investigating the quantity and type of information processed by various regions of the human brain; typically, estimates of classification accuracy are used to quantify information. While a extensive and powerful library of methods can be applied to train and assess classifiers, it is not always clear how to use the resulting measures of classification performance to draw scientific conclusions: e.g. for the purpose of evaluating redundancy between brain regions. An additional confound for interpreting classification performance is the dependence of the error rate on the number and choice of distinct classes obtained for the classification task. In contrast, mutual information is a quantity defined independently of the experimental design, and has ideal properties for comparative analyses. Unfortunately, estimating the mutual information based on observations becomes statistically infeasible in high dimensions without some kind of assumption or prior. In this paper, we construct a novel classification-based estimator of mutual information based on high-dimensional asymptotics. We show that in a particular limiting regime, the mutual information is an invertible function of the expected $k$-class Bayes error. While the theory is based on a large-sample, high-dimensional limit, we demonstrate through simulations that our proposed estimator has superior performance to the alternatives in problems of moderate dimensionality. △ Less

Submitted 10 October, 2016; v1 submitted 16 June, 2016; originally announced June 2016.

arXiv:1606.05228 [pdf, other]

How many faces can be recognized? Performance extrapolation for multi-class classification

Authors: Charles Y. Zheng, Rakesh Achanta, Yuval Benjamini

Abstract: The difficulty of multi-class classification generally increases with the number of classes. Using data from a subset of the classes, can we predict how well a classifier will scale with an increased number of classes? Under the assumption that the classes are sampled exchangeably, and under the assumption that the classifier is generative (e.g. QDA or Naive Bayes), we show that the expected accur… ▽ More The difficulty of multi-class classification generally increases with the number of classes. Using data from a subset of the classes, can we predict how well a classifier will scale with an increased number of classes? Under the assumption that the classes are sampled exchangeably, and under the assumption that the classifier is generative (e.g. QDA or Naive Bayes), we show that the expected accuracy when the classifier is trained on $k$ classes is the $k-1$st moment of a \emph{conditional accuracy distribution}, which can be estimated from data. This provides the theoretical foundation for performance extrapolation based on pseudolikelihood, unbiased estimation, and high-dimensional asymptotics. We investigate the robustness of our methods to non-generative classifiers in simulations and one optical character recognition example. △ Less

Submitted 16 June, 2016; originally announced June 2016.

Comments: Submitted to NIPS 2016

arXiv:1507.07270 [pdf]

Searching for behavioral homologies: Shared generative rules for expansion and narrowing down of the locomotor repertoire in Arthropods and Vertebrates

Authors: A. Gomez-Marin, E. Oron, A. Gakamsky, D. Valente, Y. Benjamini, I. Golani

Abstract: We use immobility as an origin and reference for the measurement of locomotor behavior; speed, the direction of walking and the direction of facing as the three degrees of freedom sha** fly locomotor behavior, and cocaine as the parameter inducing a progressive transition in and out of immobility. In this way we expose and quantify the generative rules that shape fruit fly locomotor behavior, wh… ▽ More We use immobility as an origin and reference for the measurement of locomotor behavior; speed, the direction of walking and the direction of facing as the three degrees of freedom sha** fly locomotor behavior, and cocaine as the parameter inducing a progressive transition in and out of immobility. In this way we expose and quantify the generative rules that shape fruit fly locomotor behavior, which consist of a gradual narrowing down of the fly's locomotor freedom of movement during the transition into immobility and a precisely opposite expansion of freedom during the transition from immobility to normal behavior. The same generative rules of narrowing down and expansion apply to vertebrate behavior in a variety of contexts, Recent claims for deep homology between the vertebrate basal ganglia and the arthropod central complex, and neurochemical processes explaining the expansion of locomotor behavior in vertebrates could guide the search for equivalent neurochemical processes that mediate locomotor narrowing down and expansion in arthropods. We argue that a methodology for isolating relevant measures and quantifying generative rules having a potential for discovering candidate behavioral homologies is already available and we specify some of its essential features. △ Less

Submitted 26 July, 2015; originally announced July 2015.

arXiv:1506.08391 [pdf]

doi 10.1371/journal.pone.0140207

Co** with Space Neophobia in Drosophila melanogaster: The Asymmetric Dynamics of Crossing a Doorway to the Untrodden

Authors: Shay Cohen, Yoav Benjamini, Ilan Golani

Abstract: Insects exhibit remarkable cognitive skills in the field and several cognitive abilities have been demonstrated in Drosophila in the laboratory. By devising an ethologically relevant experimental setup that also allows comparison of behavior across remote taxonomic groups we sought to reduce the gap between the field and the laboratory, and reveal as yet undiscovered ethological phenomena within a… ▽ More Insects exhibit remarkable cognitive skills in the field and several cognitive abilities have been demonstrated in Drosophila in the laboratory. By devising an ethologically relevant experimental setup that also allows comparison of behavior across remote taxonomic groups we sought to reduce the gap between the field and the laboratory, and reveal as yet undiscovered ethological phenomena within a wider phylogenetic perspective. We tracked individual flies that eclosed in a small (45mm) arena containing a piece of fruit, connected to a larger (130mm) arena by a wide (5mm) doorway. Using this setup we show that the widely open doorway initially functions as a barrier: the likelihood of entering the large arena increases gradually, requiring repeated approaches to the doorway, and even after entering the flies immediately return. Gradually the flies acquire the option to avoid returning, spending more relative time and performing relatively longer excursions in the large arena. The entire process may take up three successive days. This behavior constitutes co** with space neophobia, the avoidance of untrodden space. It appears to be the same as the neophobic doorway-crossing reported in mouse models of anxiety. In both mice and flies the moment-to-moment developmental dynamics of transition between trodden and untrodden terrain appear to be the same, and in mice it is taken to imply memory and, therefore, cognition. Recent claims have been made for a deep homology between the arthropod central complex and the vertebrate basal ganglia, two structures involved in navigation. The shared dynamics of space occupancy in flies and mice might indicate the existence of cognitive exploration also in the flies or else a convergent structure exhibiting the same developmental dynamics. △ Less

Submitted 28 June, 2015; originally announced June 2015.

arXiv:1504.00701 [pdf, other]

Many Phenotypes without Many False Discoveries: Error Controlling Strategies for Multi-Traits Association Studies

Authors: Christine Peterson, Marina Bogomolov, Yoav Benjamini, Chiara Sabatti

Abstract: The genetic basis of multiple phenotypes such as gene expression, metabolite levels, or imaging features is often investigated by testing a large collection of hypotheses, probing the existence of association between each of the traits and hundreds of thousands of genotyped variants. Appropriate multiplicity adjustment is crucial to guarantee replicability of findings, and False Discovery Rate (FD… ▽ More The genetic basis of multiple phenotypes such as gene expression, metabolite levels, or imaging features is often investigated by testing a large collection of hypotheses, probing the existence of association between each of the traits and hundreds of thousands of genotyped variants. Appropriate multiplicity adjustment is crucial to guarantee replicability of findings, and False Discovery Rate (FDR) is frequently adopted as a measure of global error. In the interest of interpretability, results are often summarized so that reporting focuses on variants discovered to be associated to some phenotypes. We show that applying FDR-controlling procedures on the entire collection of hypotheses fails to control the rate of false discovery of associated variants as well as the average rate of false discovery of phenotypes influenced by such variants. We propose a simple hierarchical testing procedure which allows control of both these error rates and provides a more reliable basis for the identification of variants with functional effects. We demonstrate the utility of this approach through simulation studies comparing various error rates and measures of power for genetic association studies of multiple traits. Finally, we apply the proposed method to identify genetic variants which impact flowering phenotypes in Arabdopsis thaliana, expanding the set of discoveries. △ Less

Submitted 2 April, 2015; originally announced April 2015.

arXiv:1503.02278 [pdf, ps, other]

Testing for replicability in a follow-up study when the primary study hypotheses are two-sided

Authors: Ruth Heller, Marina Bogomolov, Yoav Benjamini, Tamar Sofer

Abstract: When testing for replication of results from a primary study with two-sided hypotheses in a follow-up study, we are usually interested in discovering the features with discoveries in the same direction in the two studies. The direction of testing in the follow-up study for each feature can therefore be decided by the primary study. We prove that in this case the methods suggested in Heller, Bogomo… ▽ More When testing for replication of results from a primary study with two-sided hypotheses in a follow-up study, we are usually interested in discovering the features with discoveries in the same direction in the two studies. The direction of testing in the follow-up study for each feature can therefore be decided by the primary study. We prove that in this case the methods suggested in Heller, Bogomolov, and Benjamini (2014) for control over false replicability claims are valid. Specifically, we prove that if we input into the procedures in Heller, Bogomolov, and Benjamini (2014) the one-sided p-values in the directions favoured by the primary study, then we achieve directional control over the desired error measure (family-wise error rate or false discovery rate). △ Less

Submitted 8 March, 2015; originally announced March 2015.

Comments: arXiv admin note: text overlap with arXiv:1310.0606

arXiv:1502.00088 [pdf, other]

Quantifying replicability in systematic reviews: the r-value

Authors: Liat Shenhav, Ruth Heller, Yoav Benjamini

Abstract: In order to assess the effect of a health care intervention, it is useful to look at an ensemble of relevant studies. The Cochrane Collaboration's admirable goal is to provide systematic reviews of all relevant clinical studies, in order to establish whether or not there is a conclusive evidence about a specific intervention. This is done mainly by conducting a meta-analysis: a statistical synthes… ▽ More In order to assess the effect of a health care intervention, it is useful to look at an ensemble of relevant studies. The Cochrane Collaboration's admirable goal is to provide systematic reviews of all relevant clinical studies, in order to establish whether or not there is a conclusive evidence about a specific intervention. This is done mainly by conducting a meta-analysis: a statistical synthesis of results from a series of systematically collected studies. Health practitioners often interpret a significant meta-analysis summary effect as a statement that the treatment effect is consistent across a series of studies. However, the meta-analysis significance may be driven by an effect in only one of the studies. Indeed, in an analysis of two domains of Cochrane reviews we show that in a non-negligible fraction of reviews, the removal of a single study from the meta-analysis of primary endpoints makes the conclusion non-significant. Therefore, reporting the evidence towards replicability of the effect across studies in addition to the significant meta-analysis summary effect will provide credibility to the interpretation that the effect was replicated across studies. We suggest an objective, easily computed quantity, we term the r-value, that quantifies the extent of this reliance on single studies. We suggest adding the r-values to the main results and to the forest plots of systematic reviews. △ Less

Submitted 10 May, 2015; v1 submitted 31 January, 2015; originally announced February 2015.

arXiv:1412.3242 [pdf, other]

Selective Correlations - the conditional estimators

Authors: Yoav Benjamini, Amit Meir

Abstract: The problem of Voodoo correlations is recognized in neuroimaging as the problem of estimating quantities of interest from the same data that was used to select them as interesting. In statistical terminology, the problem of inference following selection from the same data is that of selective inference. Motivated by the unwelcome side-effects of the recommended remedy- splitting the data. A method… ▽ More The problem of Voodoo correlations is recognized in neuroimaging as the problem of estimating quantities of interest from the same data that was used to select them as interesting. In statistical terminology, the problem of inference following selection from the same data is that of selective inference. Motivated by the unwelcome side-effects of the recommended remedy- splitting the data. A method for constructing confidence intervals based on the correct post-selection distribution of the observations has been suggested recently. We utilize a similar approach in order to provide point estimates that account for a large part of the selection bias. We show via extensive simulations that the proposed estimator has favorable properties, namely, that it is likely to reduce estimation bias and the mean squared error compared to the direct estimator without sacrificing power to detect non-zero correlation as in the case of the data splitting approach. We show that both point estimates and confidence intervals are needed in order to get a full assessment of the uncertainty in the point estimates as both are integrated into the Confidence Calibration Plots proposed recently. The computation of the estimators is implemented in an accompanying software package. △ Less

Submitted 10 December, 2014; originally announced December 2014.

Comments: 18 pages, 10 figures

arXiv:1401.2722 [pdf, ps, other]

doi 10.1214/13-AOAS681

The shuffle estimator for explainable variance in fMRI experiments

Authors: Yuval Benjamini, Bin Yu

Abstract: In computational neuroscience, it is important to estimate well the proportion of signal variance in the total variance of neural activity measurements. This explainable variance measure helps neuroscientists assess the adequacy of predictive models that describe how images are encoded in the brain. Complicating the estimation problem are strong noise correlations, which may confound the neural re… ▽ More In computational neuroscience, it is important to estimate well the proportion of signal variance in the total variance of neural activity measurements. This explainable variance measure helps neuroscientists assess the adequacy of predictive models that describe how images are encoded in the brain. Complicating the estimation problem are strong noise correlations, which may confound the neural responses corresponding to the stimuli. If not properly taken into account, the correlations could inflate the explainable variance estimates and suggest false possible prediction accuracies. We propose a novel method to estimate the explainable variance in functional MRI (fMRI) brain activity measurements when there are strong correlations in the noise. Our shuffle estimator is nonparametric, unbiased, and built upon the random effect model reflecting the randomization in the fMRI data collection process. Leveraging symmetries in the measurements, our estimator is obtained by appropriately permuting the measurement vector in such a way that the noise covariance structure is intact but the explainable variance is changed after the permutation. This difference is then used to estimate the explainable variance. We validate the properties of the proposed method in simulation experiments. For the image-fMRI data, we show that the shuffle estimates can explain the variation in prediction accuracy for voxels within the primary visual cortex (V1) better than alternative parametric methods. △ Less

Submitted 13 January, 2014; originally announced January 2014.

Comments: Published in at http://dx.doi.org/10.1214/13-AOAS681 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS681

Journal ref: Annals of Applied Statistics 2013, Vol. 7, No. 4, 2007-2033

arXiv:1311.5354 [pdf, ps, other]

doi 10.1080/00031305.2017.1360795

Another Argument in Favour of Wilcoxon's Signed Rank Test

Authors: Jonathan Rosenblatt, Yoav Benjamini

Abstract: The Wilcoxon Signed Rank test is typically called upon when testing whether a symmetric distribution has a specified centre and the Gaussianity is in question. As with all insurance policies it comes with a cost, even if small, in terms of power versus a t-test, when the distribution is indeed Gaussian. In this note we further show that even when the distribution tested is Gaussian there need not… ▽ More The Wilcoxon Signed Rank test is typically called upon when testing whether a symmetric distribution has a specified centre and the Gaussianity is in question. As with all insurance policies it comes with a cost, even if small, in terms of power versus a t-test, when the distribution is indeed Gaussian. In this note we further show that even when the distribution tested is Gaussian there need not be power loss at all, if the alternative is of a mixture type rather than a shift. The signed rank test may turn out to be more powerful than the t-test, and the supposedly conservative strategy, might actually be the more powerful one. Drug testing and functional magnetic imaging are two such scenarios. Wilcoxon' signed rank test will typically be called upon by a researcher when testing for the location of a single population, using a small sample and Gaussianity is dubious. As all insurance policies, it will come with a cost-- power. It is well known, that under a Gaussian setup, the signed rank test is less powerful than, say, a t-test. The works of Pitman and others have reassured us that this power loss is surprisingly small. In this note we argue that the power loss might actually be smaller than typically assumed. In particular, if the deviation from the null Gaussian distribution is of a mixture type and not a shift type, the signed rank test is no longer dominated by the t-test and can actually be more powerful. △ Less

Submitted 21 November, 2013; originally announced November 2013.

arXiv:1310.0606 [pdf, ps, other]

doi 10.1073/pnas.1314814111

Deciding whether follow-up studies have replicated findings in a preliminary large-scale "omics' study"

Authors: Ruth Heller, Marina Bogomolov, Yoav Benjamini

Abstract: We propose a formal method to declare that findings from a primary study have been replicated in a follow-up study. Our proposal is appropriate for primary studies that involve large-scale searches for rare true positives (i.e. needles in a haystack). Our proposal assigns an $r$-value to each finding; this is the lowest false discovery rate at which the finding can be called replicated. Examples a… ▽ More We propose a formal method to declare that findings from a primary study have been replicated in a follow-up study. Our proposal is appropriate for primary studies that involve large-scale searches for rare true positives (i.e. needles in a haystack). Our proposal assigns an $r$-value to each finding; this is the lowest false discovery rate at which the finding can be called replicated. Examples are given and software is available. △ Less

Submitted 10 June, 2014; v1 submitted 2 October, 2013; originally announced October 2013.

Journal ref: Proceedings of the National Academy of Sciences of the United States of America (PNAS), 2014 vol. 111 no. 46, 16262-16267

arXiv:1212.3436 [pdf, other]

doi 10.1016/j.neuroimage.2013.08.025

Revisiting Multi-Subject Random Effects in fMRI: Advocating Prevalence Estimation

Authors: Jonathan D. Rosenblatt, Matthijs Vink, Yoav Benjamini

Abstract: Random Effects analysis has been introduced into fMRI research in order to generalize findings from the study group to the whole population. Generalizing findings is obviously harder than detecting activation in the study group since in order to be significant, an activation has to be larger than the inter-subject variability. Indeed, detected regions are smaller when using random effect analysis… ▽ More Random Effects analysis has been introduced into fMRI research in order to generalize findings from the study group to the whole population. Generalizing findings is obviously harder than detecting activation in the study group since in order to be significant, an activation has to be larger than the inter-subject variability. Indeed, detected regions are smaller when using random effect analysis versus fixed effects. The statistical assumptions behind the classic random effects model are that the effect in each location is normally distributed over subjects, and "activation" refers to a non-null mean effect. We argue this model is unrealistic compared to the true population variability, where, due to functional plasticity and registration anomalies, at each brain location some of the subjects are active and some are not. We propose a finite-Gaussian--mixture--random-effect. A model that amortizes between-subject spatial disagreement and quantifies it using the "prevalence" of activation at each location. This measure has several desirable properties: (a) It is more informative than the typical active/inactive paradigm. (b) In contrast to the hypothesis testing approach (thus t-maps) which are trivially rejected for large sample sizes, the larger the sample size, the more informative the prevalence statistic becomes. In this work we present a formal definition and an estimation procedure of this prevalence. The end result of the proposed analysis is a map of the prevalence at locations with significant activation, highlighting activations regions that are common over many brains. △ Less

Submitted 31 March, 2013; v1 submitted 14 December, 2012; originally announced December 2012.

arXiv:1106.3670 [pdf, ps, other]

Adjusting for selection bias in testing multiple families of hypotheses

Authors: Yoav Benjamini, Marina Bogomolov

Abstract: In many large multiple testing problems the hypotheses are divided into families. Given the data, families with evidence for true discoveries are selected, and hypotheses within them are tested. Neither controlling the error-rate in each family separately nor controlling the error-rate over all hypotheses together can assure that an error-rate is controlled in the selected families. We formulate t… ▽ More In many large multiple testing problems the hypotheses are divided into families. Given the data, families with evidence for true discoveries are selected, and hypotheses within them are tested. Neither controlling the error-rate in each family separately nor controlling the error-rate over all hypotheses together can assure that an error-rate is controlled in the selected families. We formulate this concern about selective inference in its generality, for a very wide class of error-rates and for any selection criterion, and present an adjustment of the testing level inside the selected families that retains the average error-rate over the selected families. △ Less

Submitted 18 June, 2011; originally announced June 2011.

arXiv:1011.1987 [pdf, ps, other]

doi 10.1214/09-AOAS304

High-throughput data analysis in behavior genetics

Authors: Anat Sakov, Ilan Golani, Dina Lipkind, Yoav Benjamini

Abstract: In recent years, a growing need has arisen in different fields for the development of computational systems for automated analysis of large amounts of data (high-throughput). Dealing with nonstandard noise structure and outliers, that could have been detected and corrected in manual analysis, must now be built into the system with the aid of robust methods. We discuss such problems and present ins… ▽ More In recent years, a growing need has arisen in different fields for the development of computational systems for automated analysis of large amounts of data (high-throughput). Dealing with nonstandard noise structure and outliers, that could have been detected and corrected in manual analysis, must now be built into the system with the aid of robust methods. We discuss such problems and present insights and solutions in the context of behavior genetics, where data consists of a time series of locations of a mouse in a circular arena. In order to estimate the location, velocity and acceleration of the mouse, and identify stops, we use a nonstandard mix of robust and resistant methods: LOWESS and repeated running median. In addition, we argue that protection against small deviations from experimental protocols can be handled automatically using statistical methods. In our case, it is of biological interest to measure a rodent's distance from the arena's wall, but this measure is corrupted if the arena is not a perfect circle, as required in the protocol. The problem is addressed by estimating robustly the actual boundary of the arena and its center using a nonparametric regression quantile of the behavioral data, with the aid of a fast algorithm developed for that purpose. △ Less

Submitted 9 November, 2010; originally announced November 2010.

Comments: Published in at http://dx.doi.org/10.1214/09-AOAS304 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS304

Journal ref: Annals of Applied Statistics 2010, Vol. 4, No. 2, 743-763

arXiv:0905.2819 [pdf, ps, other]

doi 10.1214/08-AOAS194

A simple forward selection procedure based on false discovery rate control

Authors: Yoav Benjamini, Yulia Gavrilov

Abstract: We propose the use of a new false discovery rate (FDR) controlling procedure as a model selection penalized method, and compare its performance to that of other penalized methods over a wide range of realistic settings: nonorthogonal design matrices, moderate and large pool of explanatory variables, and both sparse and nonsparse models, in the sense that they may include a small and large fracti… ▽ More We propose the use of a new false discovery rate (FDR) controlling procedure as a model selection penalized method, and compare its performance to that of other penalized methods over a wide range of realistic settings: nonorthogonal design matrices, moderate and large pool of explanatory variables, and both sparse and nonsparse models, in the sense that they may include a small and large fraction of the potential variables (and even all). The comparison is done by a comprehensive simulation study, using a quantitative framework for performance comparisons in the form of empirical minimaxity relative to a "random oracle": the oracle model selection performance on data dependent forward selected family of potential models. We show that FDR based procedures have good performance, and in particular the newly proposed method, emerges as having empirical minimax performance. Interestingly, using FDR level of 0.05 is a global best. △ Less

Submitted 18 May, 2009; originally announced May 2009.

Comments: Published in at http://dx.doi.org/10.1214/08-AOAS194 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS194

Journal ref: Annals of Applied Statistics 2009, Vol. 3, No. 1, 179-198

arXiv:0903.5373 [pdf, ps, other]

doi 10.1214/07-AOS586

An adaptive step-down procedure with proven FDR control under independence

Authors: Yulia Gavrilov, Yoav Benjamini, Sanat K. Sarkar

Abstract: In this work we study an adaptive step-down procedure for testing $m$ hypotheses. It stems from the repeated use of the false discovery rate controlling the linear step-up procedure (sometimes called BH), and makes use of the critical constants $iq/[(m+1-i(1-q)]$, $i=1,...,m$. Motivated by its success as a model selection procedure, as well as by its asymptotic optimality, we are interested in i… ▽ More In this work we study an adaptive step-down procedure for testing $m$ hypotheses. It stems from the repeated use of the false discovery rate controlling the linear step-up procedure (sometimes called BH), and makes use of the critical constants $iq/[(m+1-i(1-q)]$, $i=1,...,m$. Motivated by its success as a model selection procedure, as well as by its asymptotic optimality, we are interested in its false discovery rate (FDR) controlling properties for a finite number of hypotheses. We prove this step-down procedure controls the FDR at level $q$ for independent test statistics. We then numerically compare it with two other procedures with proven FDR control under independence, both in terms of power under independence and FDR control under positive dependence. △ Less

Submitted 31 March, 2009; originally announced March 2009.

Comments: Published in at http://dx.doi.org/10.1214/07-AOS586 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS586 MSC Class: 62J15 (Primary)

Journal ref: Annals of Statistics 2009, Vol. 37, No. 2, 619-629

arXiv:0808.0582 [pdf, ps, other]

doi 10.1214/07-STS236B

Comment: Microarrays, Empirical Bayes and the Two-Groups Model

Authors: Yoav Benjamini

Abstract: Comment on ``Microarrays, Empirical Bayes and the Two-Groups Model'' [arXiv:0808.0572] Comment on ``Microarrays, Empirical Bayes and the Two-Groups Model'' [arXiv:0808.0572] △ Less

Submitted 5 August, 2008; originally announced August 2008.

Comments: Published in at http://dx.doi.org/10.1214/07-STS236B the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-STS-STS236B

Journal ref: Statistical Science 2008, Vol. 23, No. 1, 23-28

arXiv:math/0505374 [pdf, ps, other]

Adapting to Unknown Sparsity by controlling the False Discovery Rate

Authors: Felix Abramovich, Yoav Benjamini, David L. Donoho, Iain M. Johnstone

Abstract: We attempt to recover an $n$-dimensional vector observed in white noise, where $n$ is large and the vector is known to be sparse, but the degree of sparsity is unknown. We consider three different ways of defining sparsity of a vector: using the fraction of nonzero terms; imposing power-law decay bounds on the ordered entries; and controlling the $\ell_p$ norm for $p$ small. We obtain a procedur… ▽ More We attempt to recover an $n$-dimensional vector observed in white noise, where $n$ is large and the vector is known to be sparse, but the degree of sparsity is unknown. We consider three different ways of defining sparsity of a vector: using the fraction of nonzero terms; imposing power-law decay bounds on the ordered entries; and controlling the $\ell_p$ norm for $p$ small. We obtain a procedure which is asymptotically minimax for $\ell^r$ loss, simultaneously throughout a range of such sparsity classes. The optimal procedure is a data-adaptive thresholding scheme, driven by control of the {\it False Discovery Rate} (FDR). FDR control is a relatively recent innovation in simultaneous testing, ensuring that at most a certain fraction of the rejected null hypotheses will correspond to false rejections. In our treatment, the FDR control parameter $q_n$ also plays a determining role in asymptotic minimaxity. If $q = \lim q_n \in [0,1/2]$ and also $q_n > γ/\log(n)$ we get sharp asymptotic minimaxity, simultaneously, over a wide range of sparse parameter spaces and loss functions. On the other hand, $ q = \lim q_n \in (1/2,1]$, forces the risk to exceed the minimax risk by a factor growing with $q$. To our knowledge, this relation between ideas in simultaneous inference and asymptotic decision theory is new. Our work provides a new perspective on a class of model selection rules which has been introduced recently by several authors. These new rules impose complexity penalization of the form $2 \cdot \log({potential model size} / {actual model size})$. We exhibit a close connection with FDR-controlling procedures under stringent control of the false discovery rate. △ Less

Submitted 18 May, 2005; originally announced May 2005.

Comments: This is a complete version of a paper to appear in Annals of Statitistics. The paper in AoS has certain proofs abbreviated that are given here in detail

MSC Class: 62F10; 62G12

Showing 1–31 of 31 results for author: Benjamini, Y