Search | arXiv e-print repository

When Knockoffs fail: diagnosing and fixing non-exchangeability of Knockoffs

Authors: Alexandre Blain, Bertrand Thirion, Julia Linhart, Pierre Neuvial

Abstract: Knockoffs are a popular statistical framework that addresses the challenging problem of conditional variable selection in high-dimensional settings with statistical control. Such statistical control is essential for the reliability of inference. However, knockoff guarantees rely on an exchangeability assumption that is difficult to test in practice, and there is little discussion in the literature… ▽ More Knockoffs are a popular statistical framework that addresses the challenging problem of conditional variable selection in high-dimensional settings with statistical control. Such statistical control is essential for the reliability of inference. However, knockoff guarantees rely on an exchangeability assumption that is difficult to test in practice, and there is little discussion in the literature on how to deal with unfulfilled hypotheses. This assumption is related to the ability to generate data similar to the observed data. To maintain reliable inference, we introduce a diagnostic tool based on Classifier Two-Sample Tests. Using simulations and real data, we show that violations of this assumption occur in common settings for classical Knockoffs generators, especially when the data have a strong dependence structure. We show that the diagnostic tool correctly detects such behavior. To fix knockoff generation, we propose a nonparametric, computationally-efficient alternative knockoff construction, which is based on constructing a predictor of each variable based on all others. We show that this approach achieves asymptotic exchangeability with the original variables under standard assumptions on the predictive model. We show empirically that the proposed approach restores error control on simulated data. △ Less

Submitted 9 July, 2024; originally announced July 2024.

arXiv:2310.11822 [pdf, other]

Post-clustering Inference under Dependency

Authors: Javier González-Delgado, Juan Cortés, Pierre Neuvial

Abstract: Recent work by Gao et al. has laid the foundations for post-clustering inference. For the first time, the authors established a theoretical framework allowing to test for differences between means of estimated clusters. Additionally, they studied the estimation of unknown parameters while controlling the selective type I error. However, their theory was developed for independent observations ident… ▽ More Recent work by Gao et al. has laid the foundations for post-clustering inference. For the first time, the authors established a theoretical framework allowing to test for differences between means of estimated clusters. Additionally, they studied the estimation of unknown parameters while controlling the selective type I error. However, their theory was developed for independent observations identically distributed as $p$-dimensional Gaussian variables with a spherical covariance matrix. Here, we aim at extending this framework to a more convenient scenario for practical applications, where arbitrary dependence structures between observations and features are allowed. We show that a $p$-value for post-clustering inference under general dependency can be defined, and we assess the theoretical conditions allowing the compatible estimation of a covariance matrix. The theory is developed for hierarchical agglomerative clustering algorithms with several types of linkages, and for the $k$-means algorithm. We illustrate our method with synthetic data and real data of protein structures. △ Less

Submitted 18 October, 2023; originally announced October 2023.

arXiv:2310.10373 [pdf, other]

False Discovery Proportion control for aggregated Knockoffs

Authors: Alexandre Blain, Bertrand Thirion, Olivier Grisel, Pierre Neuvial

Abstract: Controlled variable selection is an important analytical step in various scientific fields, such as brain imaging or genomics. In these high-dimensional data settings, considering too many variables leads to poor models and high costs, hence the need for statistical guarantees on false positives. Knockoffs are a popular statistical tool for conditional variable selection in high dimension. However… ▽ More Controlled variable selection is an important analytical step in various scientific fields, such as brain imaging or genomics. In these high-dimensional data settings, considering too many variables leads to poor models and high costs, hence the need for statistical guarantees on false positives. Knockoffs are a popular statistical tool for conditional variable selection in high dimension. However, they control for the expected proportion of false discoveries (FDR) and not their actual proportion (FDP). We present a new method, KOPI, that controls the proportion of false discoveries for Knockoff-based inference. The proposed method also relies on a new type of aggregation to address the undesirable randomness associated with classical Knockoff inference. We demonstrate FDP control and substantial power gains over existing Knockoff-based methods in various simulation settings and achieve good sensitivity/specificity tradeoffs on brain imaging and genomic data. △ Less

Submitted 16 October, 2023; originally announced October 2023.

Comments: NeurIPS 2023

arXiv:2309.01492 [pdf, other]

Selective inference after convex clustering with $\ell_1$ penalization

Authors: François Bachoc, Cathy Maugis-Rabusseau, Pierre Neuvial

Abstract: Classical inference methods notoriously fail when applied to data-driven test hypotheses or inference targets. Instead, dedicated methodologies are required to obtain statistical guarantees for these selective inference problems. Selective inference is particularly relevant post-clustering, typically when testing a difference in mean between two clusters. In this paper, we address convex clusterin… ▽ More Classical inference methods notoriously fail when applied to data-driven test hypotheses or inference targets. Instead, dedicated methodologies are required to obtain statistical guarantees for these selective inference problems. Selective inference is particularly relevant post-clustering, typically when testing a difference in mean between two clusters. In this paper, we address convex clustering with $\ell_1$ penalization, by leveraging related selective inference tools for regression, based on Gaussian vectors conditioned to polyhedral sets. In the one-dimensional case, we prove a polyhedral characterization of obtaining given clusters, than enables us to suggest a test procedure with statistical guarantees. This characterization also allows us to provide a computationally efficient regularization path algorithm. Then, we extend the above test procedure and guarantees to multi-dimensional clustering with $\ell_1$ penalization, and also to more general multi-dimensional clusterings that aggregate one-dimensional ones. With various numerical experiments, we validate our statistical guarantees and we demonstrate the power of our methods to detect differences in mean between clusters. Our methods are implemented in the R package poclin. △ Less

Submitted 4 September, 2023; originally announced September 2023.

Comments: 40 pages, 8 figures

MSC Class: 62F03; 62H30

arXiv:2208.13724 [pdf, other]

FDP control in multivariate linear models using the bootstrap

Authors: Samuel Davenport, Bertrand Thirion, Pierre Neuvial

Abstract: In this article we develop a method for performing post hoc inference of the False Discovery Proportion (FDP) over multiple contrasts of interest in the multivariate linear model. To do so we use the bootstrap to simulate from the distribution of the null contrasts. We combine the bootstrap with the post hoc inference bounds of Blanchard (2020) and prove that doing so provides simultaneous asympto… ▽ More In this article we develop a method for performing post hoc inference of the False Discovery Proportion (FDP) over multiple contrasts of interest in the multivariate linear model. To do so we use the bootstrap to simulate from the distribution of the null contrasts. We combine the bootstrap with the post hoc inference bounds of Blanchard (2020) and prove that doing so provides simultaneous asymptotic control of the FDP over all subsets of hypotheses. This requires us to demonstrate consistency of the multivariate bootstrap in the linear model, which we do via the Lindeberg Central Limit Theorem, providing a simpler proof of this result than that of Eck (2018). We demonstrate, via simulations, that our approach provides simultaneous control of the FDP over all subsets and is typically more powerful than existing, state of the art, parametric methods. We illustrate our approach on functional Magnetic Resonance Imaging data from the Human Connectome project and on a transcriptomic dataset of chronic obstructive pulmonary disease. △ Less

Submitted 20 September, 2022; v1 submitted 29 August, 2022; originally announced August 2022.

arXiv:2204.10572 [pdf, other]

doi 10.1016/j.neuroimage.2022.119492

Notip: Non-parametric True Discovery Proportion control for brain imaging

Authors: Alexandre Blain, Bertrand Thirion, Pierre Neuvial

Abstract: Cluster-level inference procedures are widely used for brain map**. These methods compare the size of clusters obtained by thresholding brain maps to an upper bound under the global null hypothesis, computed using Random Field Theory or permutations. However, the guarantees obtained by this type of inference - i.e. at least one voxel is truly activated in the cluster - are not informative with r… ▽ More Cluster-level inference procedures are widely used for brain map**. These methods compare the size of clusters obtained by thresholding brain maps to an upper bound under the global null hypothesis, computed using Random Field Theory or permutations. However, the guarantees obtained by this type of inference - i.e. at least one voxel is truly activated in the cluster - are not informative with regards to the strength of the signal therein. There is thus a need for methods to assess the amount of signal within clusters; yet such methods have to take into account that clusters are defined based on the data, which creates circularity in the inference scheme. This has motivated the use of post hoc estimates that allow statistically valid estimation of the proportion of activated voxels in clusters. In the context of fMRI data, the All-Resolutions Inference framework introduced in [25] provides post hoc estimates of the proportion of activated voxels. However, this method relies on parametric threshold families, which results in conservative inference. In this paper, we leverage randomization methods to adapt to data characteristics and obtain tighter false discovery control. We obtain Notip, for Non-parametric True Discovery Proportion control: a powerful, non-parametric method that yields statistically valid guarantees on the proportion of activated voxels in data-derived clusters. Numerical experiments demonstrate substantial gains in number of detections compared with state-of-the-art methods on 36 fMRI datasets. The conditions under which the proposed method brings benefits are also discussed. △ Less

Submitted 21 July, 2022; v1 submitted 22 April, 2022; originally announced April 2022.

Comments: NeuroImage (2022)

Journal ref: NeuroImage (2022), 119492

arXiv:2108.00165 [pdf, other]

doi 10.1214/23-EJS2135

Two-sample goodness-of-fit tests on the flat torus based on Wasserstein distance and their relevance to structural biology

Authors: Javier González-Delgado, Alberto González-Sanz, Juan Cortés, Pierre Neuvial

Abstract: This work is motivated by the study of local protein structure, which is defined by two variable dihedral angles that take values from probability distributions on the flat torus. Our goal is to provide the space $\mathcal{P}(\mathbb{R}^2/\mathbb{Z}^2)$ with a metric that quantifies local structural modifications due to changes in the protein sequence, and to define associated two-sample goodness-… ▽ More This work is motivated by the study of local protein structure, which is defined by two variable dihedral angles that take values from probability distributions on the flat torus. Our goal is to provide the space $\mathcal{P}(\mathbb{R}^2/\mathbb{Z}^2)$ with a metric that quantifies local structural modifications due to changes in the protein sequence, and to define associated two-sample goodness-of-fit testing approaches. Due to its adaptability to the space geometry, we focus on the Wasserstein distance as a metric between distributions. We extend existing results of the theory of Optimal Transport to the $d$-dimensional flat torus $\mathbb{T}^d=\mathbb{R}^d/\mathbb{Z}^d$, in particular a Central Limit Theorem. Moreover, we propose different approaches for two-sample goodness-of-fit testing for the one and two-dimensional case, based on the Wasserstein distance. We prove their validity and consistency. We provide an implementation of these tests in \textsf{R}. Their performance is assessed by numerical experiments on synthetic data and illustrated by an application to protein structure data. △ Less

Submitted 11 September, 2023; v1 submitted 31 July, 2021; originally announced August 2021.

Journal ref: J. González-Delgado, A. González-Sanz, J. Cortés, P. Neuvial. Two-sample goodness-of-fit tests on the flat torus based on Wasserstein distance and their relevance to structural biology. Electron. J. Statist., 17(1) 1547-1586, 2023

arXiv:2105.00288 [pdf, other]

Post hoc false discovery proportion inference under a Hidden Markov Model

Authors: Marie Perrot-Dockès, Gilles Blanchard, Pierre Neuvial, Etienne Roquain

Abstract: We address the multiple testing problem under the assumption that the true/false hypotheses are driven by a Hidden Markov Model (HMM), which is recognized as a fundamental setting to model multiple testing under dependence since the seminal work of \citet{sun2009large}. While previous work has concentrated on deriving specific procedures with a controlled False Discovery Rate (FDR) under this mode… ▽ More We address the multiple testing problem under the assumption that the true/false hypotheses are driven by a Hidden Markov Model (HMM), which is recognized as a fundamental setting to model multiple testing under dependence since the seminal work of \citet{sun2009large}. While previous work has concentrated on deriving specific procedures with a controlled False Discovery Rate (FDR) under this model, following a recent trend in selective inference, we consider the problem of establishing confidence bounds on the false discovery proportion (FDP), for a user-selected set of hypotheses that can depend on the observed data in an arbitrary way. We develop a methodology to construct such confidence bounds first when the HMM model is known, then when its parameters are unknown and estimated, including the data distribution under the null and the alternative, using a nonparametric approach. In the latter case, we propose a bootstrap-based methodology to take into account the effect of parameter estimation error. We show that taking advantage of the assumed HMM structure allows for a substantial improvement of confidence bound sharpness over existing agnostic (structure-free) methods, as witnessed both via numerical experiments and real data examples. △ Less

Submitted 1 May, 2021; originally announced May 2021.

arXiv:2004.08312 [pdf, other]

doi 10.29007/v7qj

Identification of deregulated transcription factors involved in subtypes of cancers

Authors: Magali Champion, Julien Chiquet, Pierre Neuvial, Mohamed Elati, François Radvanyi, Etienne Birmelé

Abstract: We propose a methodology for the identification of transcription factors involved in the deregulation of genes in tumoral cells. This strategy is based on the inference of a reference gene regulatory network that connects transcription factors to their downstream targets using gene expression data. The behavior of genes in tumor samples is then carefully compared to this network of reference to de… ▽ More We propose a methodology for the identification of transcription factors involved in the deregulation of genes in tumoral cells. This strategy is based on the inference of a reference gene regulatory network that connects transcription factors to their downstream targets using gene expression data. The behavior of genes in tumor samples is then carefully compared to this network of reference to detect deregulated target genes. A linear model is finally used to measure the ability of each transcription factor to explain those deregulations. We assess the performance of our method by numerical experiments on a breast cancer data set. We show that the information about deregulation is complementary to the expression data as the combination of the two improves the supervised classification performance of samples into cancer subtypes. △ Less

Submitted 17 April, 2020; originally announced April 2020.

Journal ref: Proceedings of the 12th International Conference on Bioinformatics and Computational Biology, vol 70, pages 1--10

arXiv:1910.11575 [pdf, other]

On agnostic post hoc approaches to false positive control

Authors: Gilles Blanchard, Pierre Neuvial, Etienne Roquain

Abstract: This document is a book chapter which gives a partial survey on post hoc approaches to false positive control. This document is a book chapter which gives a partial survey on post hoc approaches to false positive control. △ Less

Submitted 25 October, 2019; originally announced October 2019.

arXiv:1909.10923 [pdf, other]

Applicability and Interpretability of Hierarchical Agglomerative Clustering With or Without Contiguity Constraints

Authors: Nathanaël Randriamihamison, Nathalie Vialaneix, Pierre Neuvial

Abstract: Hierarchical Agglomerative Classification (HAC) with Ward's linkage has been widely used since its introduction in Ward (1963). The present article reviews the different extensions of the method to various input data and the constrained framework, while providing applicability conditions. In addition, various versions of the graphical representation of the results as a dendrogram are also presente… ▽ More Hierarchical Agglomerative Classification (HAC) with Ward's linkage has been widely used since its introduction in Ward (1963). The present article reviews the different extensions of the method to various input data and the constrained framework, while providing applicability conditions. In addition, various versions of the graphical representation of the results as a dendrogram are also presented and their properties are clarified. While some of these results can sometimes be found in an heteroclite literature, we clarify and complete them all using a uniform background. In particular, this study reveals an important distinction between a consistency property of the dendrogram and the absence of crossover within it. Finally, a simulation study shows that the constrained version of HAC can sometimes provide more relevant results than its unconstrained version despite the fact that the latter optimizes the objective criterion on a reduced set of solutions at each step. Overall, the article provides comprehensive recommandations for the use of HAC and constrained HAC depending on the input data as well as for the representation of the results. △ Less

Submitted 24 September, 2019; originally announced September 2019.

arXiv:1902.01596 [pdf, other]

Adjacency-constrained hierarchical clustering of a band similarity matrix with application to Genomics

Authors: Christophe Ambroise, Alia Dehman, Pierre Neuvial, Guillem Rigaill, Nathalie Vialaneix

Abstract: Motivation: Genomic data analyses such as Genome-Wide Association Studies (GWAS) or Hi-C studies are often faced with the problem of partitioning chromosomes into successive regions based on a similarity matrix of high-resolution, locus-level measurements. An intuitive way of doing this is to perform a modified Hierarchical Agglomerative Clustering (HAC), where only adjacent clusters (according to… ▽ More Motivation: Genomic data analyses such as Genome-Wide Association Studies (GWAS) or Hi-C studies are often faced with the problem of partitioning chromosomes into successive regions based on a similarity matrix of high-resolution, locus-level measurements. An intuitive way of doing this is to perform a modified Hierarchical Agglomerative Clustering (HAC), where only adjacent clusters (according to the ordering of positions within a chromosome) are allowed to be merged. A major practical drawback of this method is its quadratic time and space complexity in the number of loci, which is typically of the order of 10^4 to 10^5 for each chromosome. Results: By assuming that the similarity between physically distant objects is negligible, we propose an implementation of this adjacency-constrained HAC with quasi-linear complexity. Our illustrations on GWAS and Hi-C datasets demonstrate the relevance of this assumption, and show that this method highlights biologically meaningful signals. Thanks to its small time and memory footprint, the method can be run on a standard laptop in minutes or even seconds. Availability and Implementation: Software and sample data are available as an R package, adjclust, that can be downloaded from the Comprehensive R Archive Network (CRAN). △ Less

Submitted 5 February, 2019; originally announced February 2019.

arXiv:1807.01470 [pdf, other]

Post hoc false positive control for spatially structured hypotheses

Authors: Guillermo Durand, Gilles Blanchard, Pierre Neuvial, Etienne Roquain

Abstract: In a high dimensional multiple testing framework, we present new confidence bounds on the false positives contained in subsets S of selected null hypotheses. The coverage probability holds simultaneously over all subsets S, which means that the obtained confidence bounds are post hoc. Therefore, S can be chosen arbitrarily, possibly by using the data set several times. We focus in this paper speci… ▽ More In a high dimensional multiple testing framework, we present new confidence bounds on the false positives contained in subsets S of selected null hypotheses. The coverage probability holds simultaneously over all subsets S, which means that the obtained confidence bounds are post hoc. Therefore, S can be chosen arbitrarily, possibly by using the data set several times. We focus in this paper specifically on the case where the null hypotheses are spatially structured. Our method is based on recent advances in post hoc inference and particularly on the general methodology of Blanchard et al. (2017); we build confidence bounds for some pre-specified forest-structured subsets {R k , k $\in$ K}, called the reference family, and then we deduce a bound for any subset S by interpolation. The proposed bounds are shown to improve substantially previous ones when the signal is locally structured. Our findings are supported both by theoretical results and numerical experiments. Moreover, we show that our bound can be obtained by a low-complexity algorithm, which makes our approach completely operational for a practical use. The proposed bounds are implemented in the open-source R package sansSouci. △ Less

Submitted 19 September, 2018; v1 submitted 4 July, 2018; originally announced July 2018.

arXiv:1804.07566 [pdf, ps, other]

doi 10.1214/18-ejs1490

On the Post Selection Inference constant under Restricted Isometry Properties

Authors: François Bachoc, Gilles Blanchard, Pierre Neuvial

Abstract: Uniformly valid confidence intervals post model selection in regression can be constructed based on Post-Selection Inference (PoSI) constants. PoSI constants are minimal for orthogonal design matrices, and can be upper bounded in function of the sparsity of the set of models under consideration, for generic design matrices. In order to improve on these generic sparse upper bounds, we consider desi… ▽ More Uniformly valid confidence intervals post model selection in regression can be constructed based on Post-Selection Inference (PoSI) constants. PoSI constants are minimal for orthogonal design matrices, and can be upper bounded in function of the sparsity of the set of models under consideration, for generic design matrices. In order to improve on these generic sparse upper bounds, we consider design matrices satisfying a Restricted Isometry Property (RIP) condition. We provide a new upper bound on the PoSI constant in this setting. This upper bound is an explicit function of the RIP constant of the design matrix, thereby giving an interpolation between the orthogonal setting and the generic sparse setting. We show that this upper bound is asymptotically optimal in many settings by constructing a matching lower bound. △ Less

Submitted 22 November, 2018; v1 submitted 20 April, 2018; originally announced April 2018.

Comments: Electronic journal of statistics, Shaker Heights, OH : Institute of Mathematical Statistics, 2018

arXiv:1703.02307 [pdf, other]

Post hoc inference via joint family-wise error rate control

Authors: Gilles Blanchard, Pierre Neuvial, Etienne Roquain

Abstract: We introduce a general methodology for post hoc inference in a large-scale multiple testing framework. The approach is called "user-agnostic" in the sense that the statistical guarantee on the number of correct rejections holds for any set of candidate items selected by the user (after having seen the data). This task is investigated by defining a suitable criterion, named the joint-family-wise-er… ▽ More We introduce a general methodology for post hoc inference in a large-scale multiple testing framework. The approach is called "user-agnostic" in the sense that the statistical guarantee on the number of correct rejections holds for any set of candidate items selected by the user (after having seen the data). This task is investigated by defining a suitable criterion, named the joint-family-wise-error rate (JER for short). We propose several procedures for controlling the JER, with a special focus on incorporating dependencies while adapting to the unknown quantity of signal (via a step-down approach). We show that our proposed setting incorporates as particular cases a version of the higher criticism as well as the closed testing based approach of Goeman and Solari (2011). Our theoretical statements are supported by numerical experiments. △ Less

Submitted 8 January, 2018; v1 submitted 7 March, 2017; originally announced March 2017.

arXiv:1505.05705 [pdf, other]

doi 10.1186/1752-0509-9-S6-S6

A model for gene deregulation detection using expression data

Authors: Thomas Picchetti, Julien Chiquet, Mohamed Elati, Pierre Neuvial, Rémy Nicolle, Etienne Birmelé

Abstract: In tumoral cells, gene regulation mechanisms are severely altered, and these modifications in the regulations may be characteristic of different subtypes of cancer. However, these alterations do not necessarily induce differential expressions between the subtypes. To answer this question, we propose a statistical methodology to identify the misregulated genes given a reference network and gene exp… ▽ More In tumoral cells, gene regulation mechanisms are severely altered, and these modifications in the regulations may be characteristic of different subtypes of cancer. However, these alterations do not necessarily induce differential expressions between the subtypes. To answer this question, we propose a statistical methodology to identify the misregulated genes given a reference network and gene expression data. Our model is based on a regulatory process in which all genes are allowed to be deregulated. We derive an EM algorithm where the hidden variables correspond to the status (under/over/normally expressed) of the genes and where the E-step is solved thanks to a message passing algorithm. Our procedure provides posterior probabilities of deregulation in a given sample for each gene. We assess the performance of our method by numerical experiments on simulations and on a bladder cancer data set. △ Less

Submitted 8 January, 2016; v1 submitted 21 May, 2015; originally announced May 2015.

Report number: MAP5 2015-17

arXiv:1402.7203 [pdf, other]

doi 10.1093/bib/bbu026

Performance evaluation of DNA copy number segmentation methods

Authors: Morgane Pierre-Jean, Guillem Rigaill, Pierre Neuvial

Abstract: A number of bioinformatic or biostatistical methods are available for analyzing DNA copy number profiles measured from microarray or sequencing technologies. In the absence of rich enough gold standard data sets, the performance of these methods is generally assessed using unrealistic simulation studies, or based on small real data analyses. We have designed and implemented a framework to generate… ▽ More A number of bioinformatic or biostatistical methods are available for analyzing DNA copy number profiles measured from microarray or sequencing technologies. In the absence of rich enough gold standard data sets, the performance of these methods is generally assessed using unrealistic simulation studies, or based on small real data analyses. We have designed and implemented a framework to generate realistic DNA copy number profiles of cancer samples with known truth. These profiles are generated by resampling real SNP microarray data from genomic regions with known copy-number state. The original real data have been extracted from dilutions series of tumor cell lines with matched blood samples at several concentrations. Therefore, the signal-to-noise ratio of the generated profiles can be controlled through the (known) percentage of tumor cells in the sample. In this paper, we describe this framework and illustrate some of the benefits of the proposed data generation approach on a practical use case: a comparison study between methods for segmenting DNA copy number profiles from SNP microarrays. This study indicates that no single method is uniformly better than all others. It also helps identifying pros and cons for the compared methods as a function of biologically informative parameters, such as the fraction of tumor cells in the sample and the proportion of heterozygous markers. Availability: R package jointSeg: http://r-forge.r-project.org/R/?group\_id=1562 △ Less

Submitted 5 November, 2015; v1 submitted 28 February, 2014; originally announced February 2014.

Journal ref: Briefings in Bioinformatics, Oxford University Press (OUP), 2015, 16 (4)

arXiv:1206.6980 [pdf, ps, other]

doi 10.1214/11-AOAS528

More power via graph-structured tests for differential expression of gene networks

Authors: Laurent Jacob, Pierre Neuvial, Sandrine Dudoit

Abstract: We consider multivariate two-sample tests of means, where the location shift between the two populations is expected to be related to a known graph structure. An important application of such tests is the detection of differentially expressed genes between two patient populations, as shifts in expression levels are expected to be coherent with the structure of graphs reflecting gene properties suc… ▽ More We consider multivariate two-sample tests of means, where the location shift between the two populations is expected to be related to a known graph structure. An important application of such tests is the detection of differentially expressed genes between two patient populations, as shifts in expression levels are expected to be coherent with the structure of graphs reflecting gene properties such as biological process, molecular function, regulation or metabolism. For a fixed graph of interest, we demonstrate that accounting for graph structure can yield more powerful tests under the assumption of smooth distribution shift on the graph. We also investigate the identification of nonhomogeneous subgraphs of a given large graph, which poses both computational and multiple hypothesis testing problems. The relevance and benefits of the proposed approach are illustrated on synthetic data and on breast and bladder cancer gene expression data analyzed in the context of KEGG and NCI pathways. △ Less

Submitted 29 June, 2012; originally announced June 2012.

Comments: Published in at http://dx.doi.org/10.1214/11-AOAS528 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org). arXiv admin note: substantial text overlap with arXiv:1009.5173

Report number: IMS-AOAS-AOAS528

Journal ref: Annals of Applied Statistics 2012, Vol. 6, No. 2, 561-600

arXiv:1106.6147 [pdf, other]

doi 10.1214/12-AOS1042

On false discovery rate thresholding for classification under sparsity

Authors: Pierre Neuvial, Etienne Roquain

Abstract: We study the properties of false discovery rate (FDR) thresholding, viewed as a classification procedure. The "0"-class (null) is assumed to have a known density while the "1"-class (alternative) is obtained from the "0"-class either by translation or by scaling. Furthermore, the "1"-class is assumed to have a small number of elements w.r.t. the "0"-class (sparsity). We focus on densities of the S… ▽ More We study the properties of false discovery rate (FDR) thresholding, viewed as a classification procedure. The "0"-class (null) is assumed to have a known density while the "1"-class (alternative) is obtained from the "0"-class either by translation or by scaling. Furthermore, the "1"-class is assumed to have a small number of elements w.r.t. the "0"-class (sparsity). We focus on densities of the Subbotin family, including Gaussian and Laplace models. Nonasymptotic oracle inequalities are derived for the excess risk of FDR thresholding. These inequalities lead to explicit rates of convergence of the excess risk to zero, as the number m of items to be classified tends to infinity and in a regime where the power of the Bayes rule is away from 0 and 1. Moreover, these theoretical investigations suggest an explicit choice for the target level $α_m$ of FDR thresholding, as a function of m. Our oracle inequalities show theoretically that the resulting FDR thresholding adapts to the unknown sparsity regime contained in the data. This property is illustrated with numerical experiments. △ Less

Submitted 5 March, 2013; v1 submitted 30 June, 2011; originally announced June 2011.

Journal ref: Annals of Statistics (2012) Vol. 40, No. 5, 2572-2600

arXiv:1009.5173 [pdf, ps, other]

doi 10.1214/11-AOAS528

Gains in Power from Structured Two-Sample Tests of Means on Graphs

Authors: Laurent Jacob, Pierre Neuvial, Sandrine Dudoit

Abstract: We consider multivariate two-sample tests of means, where the location shift between the two populations is expected to be related to a known graph structure. An important application of such tests is the detection of differentially expressed genes between two patient populations, as shifts in expression levels are expected to be coherent with the structure of graphs reflecting gene properties suc… ▽ More We consider multivariate two-sample tests of means, where the location shift between the two populations is expected to be related to a known graph structure. An important application of such tests is the detection of differentially expressed genes between two patient populations, as shifts in expression levels are expected to be coherent with the structure of graphs reflecting gene properties such as biological process, molecular function, regulation, or metabolism. For a fixed graph of interest, we demonstrate that accounting for graph structure can yield more powerful tests under the assumption of smooth distribution shift on the graph. We also investigate the identification of non-homogeneous subgraphs of a given large graph, which poses both computational and multiple testing problems. The relevance and benefits of the proposed approach are illustrated on synthetic data and on breast cancer gene expression data analyzed in context of KEGG pathways. △ Less

Submitted 27 September, 2010; originally announced September 2010.

Journal ref: Annals of Applied Statistics 2012, Vol. 6, No. 2, 561-600

arXiv:1003.0747 [pdf, other]

Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators

Authors: Pierre Neuvial

Abstract: The False Discovery Rate (FDR) is a commonly used type I error rate in multiple testing problems. It is defined as the expected False Discovery Proportion (FDP), that is, the expected fraction of false positives among rejected hypotheses. When the hypotheses are independent, the Benjamini-Hochberg procedure achieves FDR control at any pre-specified level. By construction, FDR control offers no gua… ▽ More The False Discovery Rate (FDR) is a commonly used type I error rate in multiple testing problems. It is defined as the expected False Discovery Proportion (FDP), that is, the expected fraction of false positives among rejected hypotheses. When the hypotheses are independent, the Benjamini-Hochberg procedure achieves FDR control at any pre-specified level. By construction, FDR control offers no guarantee in terms of power, or type II error. A number of alternative procedures have been developed, including plug-in procedures that aim at gaining power by incorporating an estimate of the proportion of true null hypotheses. In this paper, we study the asymptotic behavior of a class of plug-in procedures based on kernel estimators of the density of the $p$-values, as the number $m$ of tested hypotheses grows to infinity. In a setting where the hypotheses tested are independent, we prove that these procedures are asymptotically more powerful in two respects: (i) a tighter asymptotic FDR control for any target FDR level and (ii) a broader range of target levels yielding positive asymptotic power. We also show that this increased asymptotic power comes at the price of slower, non-parametric convergence rates for the FDP. These rates are of the form $m^{-k/(2k+1)}$, where $k$ is determined by the regularity of the density of the $p$-value distribution, or, equivalently, of the test statistics distribution. These results are applied to one- and two-sided tests statistics for Gaussian and Laplace location models, and for the Student model. △ Less

Submitted 20 April, 2013; v1 submitted 3 March, 2010; originally announced March 2010.

Journal ref: Journal of Machine Learning Research 14 (2013) 1423-1459

arXiv:0803.2111 [pdf, ps, other]

doi 10.1214/08-EJS207

Asymptotic properties of false discovery rate controlling procedures under independence

Authors: Pierre Neuvial

Abstract: We investigate the performance of a family of multiple comparison procedures for strong control of the False Discovery Rate ($\mathsf{FDR}$). The $\mathsf{FDR}$ is the expected False Discovery Proportion ($\mathsf{FDP}$), that is, the expected fraction of false rejections among all rejected hypotheses. A number of refinements to the original Benjamini-Hochberg procedure [1] have been proposed, t… ▽ More We investigate the performance of a family of multiple comparison procedures for strong control of the False Discovery Rate ($\mathsf{FDR}$). The $\mathsf{FDR}$ is the expected False Discovery Proportion ($\mathsf{FDP}$), that is, the expected fraction of false rejections among all rejected hypotheses. A number of refinements to the original Benjamini-Hochberg procedure [1] have been proposed, to increase power by estimating the proportion of true null hypotheses, either implicitly, leading to one-stage adaptive procedures [4, 7] or explicitly, leading to two-stage adaptive (or plug-in) procedures [2, 21]. We use a variant of the stochastic process approach proposed by Genovese and Wasserman [11] to study the fluctuations of the $\mathsf{FDP}$ achieved with each of these procedures around its expectation, for independent tested hypotheses. We introduce a framework for the derivation of generic Central Limit Theorems for the $\mathsf{FDP}$ of these procedures, characterizing the associated regularity conditions, and comparing the asymptotic power of the various procedures. We interpret recently proposed one-stage adaptive procedures [4, 7] as fixed points in the iteration of well known two-stage adaptive procedures [2, 21]. △ Less

Submitted 21 November, 2008; v1 submitted 14 March, 2008; originally announced March 2008.

Comments: Published in at http://dx.doi.org/10.1214/08-EJS207 the Electronic Journal of Statistics (http://www.i-journals.org/ejs/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-EJS-EJS_2008_207 MSC Class: 62G10; 62H15; 60F05 (Primary)

Journal ref: Electronic Journal of Statistics 2008, Vol. 2, 1065-1110

Showing 1–22 of 22 results for author: Neuvial, P