-
Easily Computed Marginal Likelihoods from Posterior Simulation Using the THAMES Estimator
Authors:
Martin Metodiev,
Marie Perrot-Dockès,
Sarah Ouadah,
Nicholas J. Irons,
Adrian E. Raftery
Abstract:
We propose an easily computed estimator of marginal likelihoods from posterior simulation output, via reciprocal importance sampling, combining earlier proposals of DiCiccio et al (1997) and Robert and Wraith (2009). This involves only the unnormalized posterior densities from the sampled parameter values, and does not involve additional simulations beyond the main posterior simulation, or additio…
▽ More
We propose an easily computed estimator of marginal likelihoods from posterior simulation output, via reciprocal importance sampling, combining earlier proposals of DiCiccio et al (1997) and Robert and Wraith (2009). This involves only the unnormalized posterior densities from the sampled parameter values, and does not involve additional simulations beyond the main posterior simulation, or additional complicated calculations. It is unbiased for the reciprocal of the marginal likelihood, consistent, has finite variance, and is asymptotically normal. It involves one user-specified control parameter, and we derive an optimal way of specifying this. We illustrate it with several numerical examples.
△ Less
Submitted 15 May, 2023;
originally announced May 2023.
-
Post hoc false discovery proportion inference under a Hidden Markov Model
Authors:
Marie Perrot-Dockès,
Gilles Blanchard,
Pierre Neuvial,
Etienne Roquain
Abstract:
We address the multiple testing problem under the assumption that the true/false hypotheses are driven by a Hidden Markov Model (HMM), which is recognized as a fundamental setting to model multiple testing under dependence since the seminal work of \citet{sun2009large}. While previous work has concentrated on deriving specific procedures with a controlled False Discovery Rate (FDR) under this mode…
▽ More
We address the multiple testing problem under the assumption that the true/false hypotheses are driven by a Hidden Markov Model (HMM), which is recognized as a fundamental setting to model multiple testing under dependence since the seminal work of \citet{sun2009large}. While previous work has concentrated on deriving specific procedures with a controlled False Discovery Rate (FDR) under this model, following a recent trend in selective inference, we consider the problem of establishing confidence bounds on the false discovery proportion (FDP), for a user-selected set of hypotheses that can depend on the observed data in an arbitrary way. We develop a methodology to construct such confidence bounds first when the HMM model is known, then when its parameters are unknown and estimated, including the data distribution under the null and the alternative, using a nonparametric approach. In the latter case, we propose a bootstrap-based methodology to take into account the effect of parameter estimation error. We show that taking advantage of the assumed HMM structure allows for a substantial improvement of confidence bound sharpness over existing agnostic (structure-free) methods, as witnessed both via numerical experiments and real data examples.
△ Less
Submitted 1 May, 2021;
originally announced May 2021.
-
Some detection tests for low complexity data models and unknown background distribution
Authors:
D. Mary,
S. Bourguignon,
E. Roquain,
S. Sulis,
M. Perrot-Dockes
Abstract:
We consider several detection situations where, under the alternative hypothesis, the signal admits a low complexity model and, under both the null and the alternative hypotheses, the distribution of the background noise is {unknown}. We present several detection strategies for such cases, whose design relies on exogenous or on endogenous data. These testing procedures have been inspired by and ar…
▽ More
We consider several detection situations where, under the alternative hypothesis, the signal admits a low complexity model and, under both the null and the alternative hypotheses, the distribution of the background noise is {unknown}. We present several detection strategies for such cases, whose design relies on exogenous or on endogenous data. These testing procedures have been inspired by and are applied to two specific problems in Astrophysics, namely the detection of exoplanets from radial velocity curves and of distant galaxies in hyperspectral datacubes.
△ Less
Submitted 7 December, 2020;
originally announced December 2020.
-
Estimation of large block structured covariance matrices: Application to "multi-omic" approaches to study seed quality
Authors:
Marie Perrot-Dockès,
Céline Lévy-Leduc,
Loïc Rajjou
Abstract:
Motivated by an application in high-throughput genomics and metabolomics, we propose a novel, efficient and fully data-driven approach for estimating large block structured sparse covariance matrices in the case where the number of variables is much larger than the number of samples without limiting ourselves to block diagonal matrices. Our approach consists in approximating such a covariance matr…
▽ More
Motivated by an application in high-throughput genomics and metabolomics, we propose a novel, efficient and fully data-driven approach for estimating large block structured sparse covariance matrices in the case where the number of variables is much larger than the number of samples without limiting ourselves to block diagonal matrices. Our approach consists in approximating such a covariance matrix by the sum of a low-rank sparse matrix and a diagonal matrix. Our methodology also can deal with matrices for which the block structure appears only if the columns and rows are permuted according to an unknown permutation. Our technique is implemented in the R package \texttt{BlockCov} which is available from the Comprehensive R Archive Network (CRAN) and from GitHub. In order to illustrate the statistical and numerical performance of our package some numerical experiments are provided as well as a thorough comparison with alternative methods. Finally, our approach is applied to the use of "multi-omic" approaches for studying seed quality.
△ Less
Submitted 6 December, 2019; v1 submitted 26 June, 2018;
originally announced June 2018.
-
Variable selection in multivariate linear models with high-dimensional covariance matrix estimation
Authors:
Marie Perrot-Dockès,
Céline Lévy-Leduc,
Laure Sansonnet,
Julien Chiquet
Abstract:
In this paper, we propose a novel variable selection approach in the framework of multivariate linear models taking into account the dependence that may exist between the responses. It consists in estimating beforehand the covariance matrix of the responses and to plug this estimator in a Lasso criterion, in order to obtain a sparse estimator of the coefficient matrix. The properties of our approa…
▽ More
In this paper, we propose a novel variable selection approach in the framework of multivariate linear models taking into account the dependence that may exist between the responses. It consists in estimating beforehand the covariance matrix of the responses and to plug this estimator in a Lasso criterion, in order to obtain a sparse estimator of the coefficient matrix. The properties of our approach are investigated both from a theoretical and a numerical point of view. More precisely, we give general conditions that the estimators of the covariance matrix and its inverse have to satisfy in order to recover the positions of the null and non null entries of the coefficient matrix when the size of the covariance matrix is not fixed and can tend to infinity. We prove that these conditions are satisfied in the particular case of some Toeplitz matrices. Our approach is implemented in the R package MultiVarSel available from the Comprehensive R Archive Network (CRAN) and is very attractive since it benefits from a low computational load. We also assess the performance of our methodology using synthetic data and compare it with alternative approaches. Our numerical experiments show that including the estimation of the covariance matrix in the Lasso criterion dramatically improves the variable selection performance in many cases.
△ Less
Submitted 13 July, 2017;
originally announced July 2017.
-
A multivariate variable selection approach for analyzing LC-MS metabolomics data
Authors:
M. Perrot-Dockès,
C. Lévy-Leduc,
J. Chiquet,
L. Sansonnet,
M. Brégère,
M. -P. Étienne,
S. Robin,
G. Genta-Jouve
Abstract:
Omic data are characterized by the presence of strong dependence structures that result either from data acquisition or from some underlying biological processes. In metabolomics, for instance, data resulting from Liquid Chromatography-Mass Spectrometry (LC-MS) -- a technique which gives access to a large coverage of metabolites -- exhibit such patterns. These data sets are typically used to find…
▽ More
Omic data are characterized by the presence of strong dependence structures that result either from data acquisition or from some underlying biological processes. In metabolomics, for instance, data resulting from Liquid Chromatography-Mass Spectrometry (LC-MS) -- a technique which gives access to a large coverage of metabolites -- exhibit such patterns. These data sets are typically used to find the metabolites characterizing a phenotype of interest associated with the samples. However, applying some statistical procedures that do not adjust the variable selection step to the dependence pattern may result in a loss of power and the selection of spurious variables. The goal of this paper is to propose a variable selection procedure in the multivariate linear model that accounts for the dependence structure of the multiple outputs which may lead in the LC-MS framework to the selection of more relevant metabolites. We propose a novel Lasso-based approach in the multivariate framework of the general linear model taking into account the dependence structure by using various modelings of the covariance matrix of the residuals. Our numerical experiments show that including the estimation of the covariance matrix of the residuals in the Lasso criterion dramatically improves the variable selection performance. Our approach is also successfully applied to a LC-MS data set made of African copals samples for which it is able to provide a small list of metabolites without altering the phenotype discrimination. Our methodology is implemented in the R package MultiVarSel which is available from the CRAN (Comprehensive R Archive Network).
△ Less
Submitted 31 March, 2017;
originally announced April 2017.