Search | arXiv e-print repository

Controlled Variable Selection from Summary Statistics Only? A Solution via GhostKnockoffs and Penalized Regression

Authors: Zhaomeng Chen, Zihuai He, Benjamin B. Chu, Jiaqi Gu, Tim Morrison, Chiara Sabatti, Emmanuel Candès

Abstract: Identifying which variables do influence a response while controlling false positives pervades statistics and data science. In this paper, we consider a scenario in which we only have access to summary statistics, such as the values of marginal empirical correlations between each dependent variable of potential interest and the response. This situation may arise due to privacy concerns, e.g., to a… ▽ More Identifying which variables do influence a response while controlling false positives pervades statistics and data science. In this paper, we consider a scenario in which we only have access to summary statistics, such as the values of marginal empirical correlations between each dependent variable of potential interest and the response. This situation may arise due to privacy concerns, e.g., to avoid the release of sensitive genetic information. We extend GhostKnockoffs (He et al. [2022]) and introduce variable selection methods based on penalized regression achieving false discovery rate (FDR) control. We report empirical results in extensive simulation studies, demonstrating enhanced performance over previous work. We also apply our methods to genome-wide association studies of Alzheimer's disease, and evidence a significant improvement in power. △ Less

Submitted 20 February, 2024; originally announced February 2024.

arXiv:2310.15069 [pdf, other]

Second-order group knockoffs with applications to GWAS

Authors: Benjamin B Chu, Jiaqi Gu, Zhaomeng Chen, Tim Morrison, Emmanuel Candes, Zihuai He, Chiara Sabatti

Abstract: Conditional testing via the knockoff framework allows one to identify -- among large number of possible explanatory variables -- those that carry unique information about an outcome of interest, and also provides a false discovery rate guarantee on the selection. This approach is particularly well suited to the analysis of genome wide association studies (GWAS), which have the goal of identifying… ▽ More Conditional testing via the knockoff framework allows one to identify -- among large number of possible explanatory variables -- those that carry unique information about an outcome of interest, and also provides a false discovery rate guarantee on the selection. This approach is particularly well suited to the analysis of genome wide association studies (GWAS), which have the goal of identifying genetic variants which influence traits of medical relevance. While conditional testing can be both more powerful and precise than traditional GWAS analysis methods, its vanilla implementation encounters a difficulty common to all multivariate analysis methods: it is challenging to distinguish among multiple, highly correlated regressors. This impasse can be overcome by shifting the object of inference from single variables to groups of correlated variables. To achieve this, it is necessary to construct "group knockoffs." While successful examples are already documented in the literature, this paper substantially expands the set of algorithms and software for group knockoffs. We focus in particular on second-order knockoffs, for which we describe correlation matrix approximations that are appropriate for GWAS data and that result in considerable computational savings. We illustrate the effectiveness of the proposed methods with simulations and with the analysis of albuminuria data from the UK Biobank. The described algorithms are implemented in an open-source Julia package Knockoffs.jl, for which both R and Python wrappers are available. △ Less

Submitted 3 March, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

Comments: 46 pages, 10 figures, 2 tables, 3 algorithms

arXiv:2306.09976 [pdf, other]

Catch me if you can: Signal localization with knockoff e-values

Authors: Paula Gablenz, Chiara Sabatti

Abstract: We consider problems where many, somewhat redundant, hypotheses are tested and we are interested in reporting the most precise rejections, with false discovery rate (FDR) control. This is the case, for example, when researchers are interested both in individual hypotheses as well as group hypotheses corresponding to intersections of sets of the original hypotheses, at several resolution levels. A… ▽ More We consider problems where many, somewhat redundant, hypotheses are tested and we are interested in reporting the most precise rejections, with false discovery rate (FDR) control. This is the case, for example, when researchers are interested both in individual hypotheses as well as group hypotheses corresponding to intersections of sets of the original hypotheses, at several resolution levels. A concrete application is in genome-wide association studies, where, depending on the signal strengths, it might be possible to resolve the influence of individual genetic variants on a phenotype with greater or lower precision. To adapt to the unknown signal strength, analyses are conducted at multiple resolutions and researchers are most interested in the more precise discoveries. Assuring FDR control on the reported findings with these adaptive searches is, however, often impossible. To design a multiple comparison procedure that allows for an adaptive choice of resolution with FDR control, we leverage e-values and linear programming. We adapt this approach to problems where knockoffs and group knockoffs have been successfully applied to test conditional independence hypotheses. We demonstrate its efficacy by analyzing data from the UK Biobank. △ Less

Submitted 19 April, 2024; v1 submitted 16 June, 2023; originally announced June 2023.

Comments: 48 pages, 12 figures; text edits (incl. abstract, appendix, additional remarks), added references

arXiv:2211.08637 [pdf, other]

Near-peer mentoring in data science: Two experiences at Stanford University

Authors: Chiara Sabatti, Qian Zhao

Abstract: Universities have been expanding the data science programs for undergraduate students, with the simultaneous goal of reaching and retaining students from underrepresented groups in the data science workforce. The set of new programs also offer opportunities to involve graduate students, fostering their growth as future leaders in data science education. We describe two programs that use the near p… ▽ More Universities have been expanding the data science programs for undergraduate students, with the simultaneous goal of reaching and retaining students from underrepresented groups in the data science workforce. The set of new programs also offer opportunities to involve graduate students, fostering their growth as future leaders in data science education. We describe two programs that use the near peer mentoring structure to provide pathways for graduate students to develop teaching and mentoring skills, while providing research and learning opportunities for undergraduate students from diverse backgrounds. In the Data Science for Social Good Summer program, graduate students mentor a group of undergraduate fellows as they tackle a data science project with positive social impact. In the Inclusive Mentoring in Data Science course, graduate students participate in workshops on effective and inclusive mentorship strategies. In an experiential learning framework, they are paired with undergraduate students from non-R1 schools, who they mentor through weekly one-on-one on-line meetings. These initiatives offer a prototype of future programs that serve the dual goal of providing both hands-on mentoring experience for graduate students and research opportunities for undergraduate students, in a high-touch inclusive and encouraging environment. △ Less

Submitted 8 June, 2024; v1 submitted 15 November, 2022; originally announced November 2022.

arXiv:2108.08813 [pdf, other]

Transfer learning in genome-wide association studies with knockoffs

Authors: Shuangning Li, Zhimei Ren, Chiara Sabatti, Matteo Sesia

Abstract: This paper presents and compares alternative transfer learning methods that can increase the power of conditional testing via knockoffs by leveraging prior information in external data sets collected from different populations or measuring related outcomes. The relevance of this methodology is explored in particular within the context of genome-wide association studies, where it can be helpful to… ▽ More This paper presents and compares alternative transfer learning methods that can increase the power of conditional testing via knockoffs by leveraging prior information in external data sets collected from different populations or measuring related outcomes. The relevance of this methodology is explored in particular within the context of genome-wide association studies, where it can be helpful to address the pressing need for principled ways to suitably account for, and efficiently learn from the genetic variation associated to diverse ancestries. Finally, we apply these methods to analyze several phenotypes in the UK Biobank data set, demonstrating that transfer learning helps knockoffs discover more numerous associations in the data collected from minority populations, potentially opening the way to the development of more accurate polygenic risk scores. △ Less

Submitted 19 August, 2021; originally announced August 2021.

arXiv:2106.04118 [pdf, other]

Searching for consistent associations with a multi-environment knockoff filter

Authors: Shuangning Li, Matteo Sesia, Yaniv Romano, Emmanuel Candès, Chiara Sabatti

Abstract: This paper develops a method based on model-X knockoffs to find conditional associations that are consistent across diverse environments, controlling the false discovery rate. The motivation for this problem is that large data sets may contain numerous associations that are statistically significant and yet misleading, as they are induced by confounders or sampling imperfections. However, associat… ▽ More This paper develops a method based on model-X knockoffs to find conditional associations that are consistent across diverse environments, controlling the false discovery rate. The motivation for this problem is that large data sets may contain numerous associations that are statistically significant and yet misleading, as they are induced by confounders or sampling imperfections. However, associations consistently replicated under different conditions may be more interesting. In fact, consistency sometimes provably leads to valid causal inferences even if conditional associations do not. While the proposed method is flexible and can be deployed in a wide range of applications, this paper highlights its relevance to genome-wide association studies, in which consistency across populations with diverse ancestries mitigates confounding due to unmeasured variants. The effectiveness of this approach is demonstrated by simulations and applications to the UK Biobank data. △ Less

Submitted 8 June, 2021; originally announced June 2021.

Comments: 41 pages, 21 figures, 8 tables

arXiv:2002.09644 [pdf, other]

doi 10.1073/pnas.2007743117

Causal Inference in Genetic Trio Studies

Authors: Stephen Bates, Matteo Sesia, Chiara Sabatti, Emmanuel Candes

Abstract: We introduce a method to rigorously draw causal inferences---inferences immune to all possible confounding---from genetic data that include parents and offspring. Causal conclusions are possible with these data because the natural randomness in meiosis can be viewed as a high-dimensional randomized experiment. We make this observation actionable by develo** a novel conditional independence test… ▽ More We introduce a method to rigorously draw causal inferences---inferences immune to all possible confounding---from genetic data that include parents and offspring. Causal conclusions are possible with these data because the natural randomness in meiosis can be viewed as a high-dimensional randomized experiment. We make this observation actionable by develo** a novel conditional independence test that identifies regions of the genome containing distinct causal variants. The proposed Digital Twin Test compares an observed offspring to carefully constructed synthetic offspring from the same parents in order to determine statistical significance, and it can leverage any black-box multivariate model and additional non-trio genetic data in order to increase power. Crucially, our inferences are based only on a well-established mathematical description of the rearrangement of genetic material during meiosis and make no assumptions about the relationship between the genotypes and phenotypes. △ Less

Submitted 22 February, 2020; originally announced February 2020.

Journal ref: Proc. Natl. Acad. Sci. U.S.A. 177 (2020) 24117-24126

arXiv:1908.05428 [pdf, other]

With Malice Towards None: Assessing Uncertainty via Equalized Coverage

Authors: Yaniv Romano, Rina Foygel Barber, Chiara Sabatti, Emmanuel J. Candès

Abstract: An important factor to guarantee a fair use of data-driven recommendation systems is that we should be able to communicate their uncertainty to decision makers. This can be accomplished by constructing prediction intervals, which provide an intuitive measure of the limits of predictive performance. To support equitable treatment, we force the construction of such intervals to be unbiased in the se… ▽ More An important factor to guarantee a fair use of data-driven recommendation systems is that we should be able to communicate their uncertainty to decision makers. This can be accomplished by constructing prediction intervals, which provide an intuitive measure of the limits of predictive performance. To support equitable treatment, we force the construction of such intervals to be unbiased in the sense that their coverage must be equal across all protected groups of interest. We present an operational methodology that achieves this goal by offering rigorous distribution-free coverage guarantees holding in finite samples. Our methodology, equalized coverage, is flexible as it can be viewed as a wrapper around any predictive algorithm. We test the applicability of the proposed framework on real data, demonstrating that equalized coverage constructs unbiased prediction intervals, unlike competitive methods. △ Less

Submitted 15 August, 2019; originally announced August 2019.

Comments: 14 pages, 1 figure, 1 table

arXiv:1903.05701 [pdf, other]

doi 10.1093/biomet/asy075

Rejoinder: "Gene Hunting with Hidden Markov Model Knockoffs"

Authors: Matteo Sesia, Chiara Sabatti, Emmanuel J. Candès

Abstract: In this paper we deepen and enlarge the reflection on the possible advantages of a knockoff approach to genome wide association studies (Sesia et al., 2018), starting from the discussions in Bottolo & Richardson (2019); Jewell & Witten (2019); Rosenblatt et al. (2019) and Marchini (2019). The discussants bring up a number of important points, either related to the knockoffs methodology in general,… ▽ More In this paper we deepen and enlarge the reflection on the possible advantages of a knockoff approach to genome wide association studies (Sesia et al., 2018), starting from the discussions in Bottolo & Richardson (2019); Jewell & Witten (2019); Rosenblatt et al. (2019) and Marchini (2019). The discussants bring up a number of important points, either related to the knockoffs methodology in general, or to its specific application to genetic studies. In the following we offer some clarifications, mention relevant recent developments and highlight some of the still open problems. △ Less

Submitted 13 March, 2019; originally announced March 2019.

Comments: 12 pages, 4 figures

Journal ref: Biometrika, Volume 106, Issue 1, 1 March 2019, Pages 35-45

arXiv:1809.01792 [pdf, other]

Filtering the rejection set while preserving false discovery rate control

Authors: Eugene Katsevich, Chiara Sabatti, Marina Bogomolov

Abstract: Scientific hypotheses in a variety of applications have domain-specific structures, such as the tree structure of the International Classification of Diseases (ICD), the directed acyclic graph structure of the Gene Ontology (GO), or the spatial structure in genome-wide association studies. In the context of multiple testing, the resulting relationships among hypotheses can create redundancies amon… ▽ More Scientific hypotheses in a variety of applications have domain-specific structures, such as the tree structure of the International Classification of Diseases (ICD), the directed acyclic graph structure of the Gene Ontology (GO), or the spatial structure in genome-wide association studies. In the context of multiple testing, the resulting relationships among hypotheses can create redundancies among rejections that hinder interpretability. This leads to the practice of filtering rejection sets obtained from multiple testing procedures, which may in turn invalidate their inferential guarantees. We propose Focused BH, a simple, flexible, and principled methodology to adjust for the application of any pre-specified filter. We prove that Focused BH controls the false discovery rate under various conditions, including when the filter satisfies an intuitive monotonicity property and the p-values are positively dependent. We demonstrate in simulations that Focused BH performs well across a variety of settings, and illustrate this method's practical utility via analyses of real datasets based on ICD and GO. △ Less

Submitted 10 April, 2020; v1 submitted 5 September, 2018; originally announced September 2018.

arXiv:1801.08686 [pdf, other]

Selection-adjusted inference: an application to confidence intervals for cis-eQTL effect sizes

Authors: Snigdha Panigrahi, Junjie Zhu, Chiara Sabatti

Abstract: The goal of eQTL studies is to identify the genetic variants that influence the expression levels of the genes in an organism. High throughput technology has made such studies possible: in a given tissue sample, it enables us to quantify the expression levels of approximately 20,000 genes and to record the alleles present at millions of genetic polymorphisms. While obtaining this data is relativel… ▽ More The goal of eQTL studies is to identify the genetic variants that influence the expression levels of the genes in an organism. High throughput technology has made such studies possible: in a given tissue sample, it enables us to quantify the expression levels of approximately 20,000 genes and to record the alleles present at millions of genetic polymorphisms. While obtaining this data is relatively cheap once a specimen is at hand, obtaining human tissue remains a costly endeavor. Thus, eQTL studies continue to be based on relatively small sample sizes, with this limitation particularly serious for tissues of most immediate medical relevance. Given the high dimensional nature of this datasets and the large number of hypotheses tested, the scientific community has adopted early on multiplicity adjustment procedures, which primarily control the false discoveries rate for the identification of genetic variants with influence on the expression levels. In contrast, a problem that has not received much attention to date is that of providing estimates of the effect sizes associated to these variants, in a way that accounts for the considerable amount of selection. We illustrate how the recently developed conditional inference approach can be deployed to obtain confidence intervals for the eQTL effect sizes with reliable coverage. The procedure we propose is based on a randomized hierarchical strategy that both reflects the steps typically adopted in state of the art investigations and introduces the use of randomness instead of data splitting to maximize the use of available data. Analysis of the GTEx Liver dataset (v6) suggests that naively obtained confidence intervals would likely not cover the true values of effect sizes and that the number of local genetic polymorphisms influencing the expression level of genes might be underestimated. △ Less

Submitted 6 June, 2018; v1 submitted 26 January, 2018; originally announced January 2018.

arXiv:1706.09375 [pdf, other]

Multilayer Knockoff Filter: Controlled variable selection at multiple resolutions

Authors: Eugene Katsevich, Chiara Sabatti

Abstract: We tackle the problem of selecting from among a large number of variables those that are 'important' for an outcome. We consider situations where groups of variables are also of interest in their own right. For example, each variable might be a genetic polymorphism and we might want to study how a trait depends on variability in genes, segments of DNA that typically contain multiple such polymorph… ▽ More We tackle the problem of selecting from among a large number of variables those that are 'important' for an outcome. We consider situations where groups of variables are also of interest in their own right. For example, each variable might be a genetic polymorphism and we might want to study how a trait depends on variability in genes, segments of DNA that typically contain multiple such polymorphisms. Or, variables might quantify various aspects of the functioning of individual internet servers owned by a company, and we might be interested in assessing the importance of each server as a whole on the average download speed for the company's customers. In this context, to discover that a variable is relevant for the outcome implies discovering that the larger entity it represents is also important. To guarantee meaningful and reproducible results, we suggest controlling the rate of false discoveries for findings at the level of individual variables and at the level of groups. Building on the knockoff construction of Barber and Candes (2015) and the multilayer testing framework of Barber and Ramdas (2016), we introduce the multilayer knockoff filter (MKF). We prove that MKF simultaneously controls the FDR at each resolution and use simulations to show that it incurs little power loss compared to methods that provide guarantees only for the discoveries of individual variables. We apply MKF to analyze a genetic dataset and find that it successfully reduces the number of false gene discoveries without a significant reduction in power. △ Less

Submitted 9 August, 2018; v1 submitted 28 June, 2017; originally announced June 2017.

arXiv:1706.04677 [pdf, other]

doi 10.1093/biomet/asy033

Gene Hunting with Knockoffs for Hidden Markov Models

Authors: Matteo Sesia, Chiara Sabatti, Emmanuel J. Candès

Abstract: Modern scientific studies often require the identification of a subset of relevant explanatory variables, in the attempt to understand an interesting phenomenon. Several statistical methods have been developed to automate this task, but only recently has the framework of model-free knockoffs proposed a general solution that can perform variable selection under rigorous type-I error control, withou… ▽ More Modern scientific studies often require the identification of a subset of relevant explanatory variables, in the attempt to understand an interesting phenomenon. Several statistical methods have been developed to automate this task, but only recently has the framework of model-free knockoffs proposed a general solution that can perform variable selection under rigorous type-I error control, without relying on strong modeling assumptions. In this paper, we extend the methodology of model-free knockoffs to a rich family of problems where the distribution of the covariates can be described by a hidden Markov model (HMM). We develop an exact and efficient algorithm to sample knockoff copies of an HMM. We then argue that combined with the knockoffs selective framework, they provide a natural and powerful tool for performing principled inference in genome-wide association studies with guaranteed FDR control. Finally, we apply our methodology to several datasets aimed at studying the Crohn's disease and several continuous phenotypes, e.g. levels of cholesterol. △ Less

Submitted 14 June, 2017; originally announced June 2017.

Comments: 35 pages, 13 figues, 9 tables

Journal ref: Biometrika, Volume 106, Issue 1, 1 March 2019, Pages 1-18

arXiv:1705.07529 [pdf, other]

Testing hypotheses on a tree: new error rates and controlling strategies

Authors: Marina Bogomolov, Christine B. Peterson, Yoav Benjamini, Chiara Sabatti

Abstract: We introduce a multiple testing procedure (TreeBH) which addresses the challenge of controlling error rates at multiple levels of resolution. Conceptually, we frame this problem as the selection of hypotheses which are organized hierarchically in a tree structure. We describe a fast algorithm for the proposed sequential procedure, and prove that it controls relevant error rates given certain assum… ▽ More We introduce a multiple testing procedure (TreeBH) which addresses the challenge of controlling error rates at multiple levels of resolution. Conceptually, we frame this problem as the selection of hypotheses which are organized hierarchically in a tree structure. We describe a fast algorithm for the proposed sequential procedure, and prove that it controls relevant error rates given certain assumptions on the dependence among the p-values. Through simulations, we demonstrate that TreeBH offers the desired guarantees under a range of dependency structures (including one similar to that encountered in genome-wide association studies) and that it has the potential of gaining power over alternative methods. We also introduce a modified version of TreeBH which we prove to control the relevant error rates under any dependency structure. We conclude with two case studies: we first analyze data collected as part of the Genotype-Tissue Expression (GTEx) project, which aims to characterize the genetic regulation of gene expression across multiple tissues in the human body, and secondly, data examining the relationship between the gut microbiome and colorectal cancer. △ Less

Submitted 23 October, 2018; v1 submitted 21 May, 2017; originally announced May 2017.

arXiv:1610.03330 [pdf, other]

Detecting Multiple Replicating Signals using Adaptive Filtering Procedures

Authors: **gshu Wang, Lin Gui, Weijie J. Su, Chiara Sabatti, Art B. Owen

Abstract: Replicability is a fundamental quality of scientific discoveries: we are interested in those signals that are detectable in different laboratories, study populations, across time etc. Unlike meta-analysis which accounts for experimental variability but does not guarantee replicability, testing a partial conjunction (PC) null aims specifically to identify the signals that are discovered in multiple… ▽ More Replicability is a fundamental quality of scientific discoveries: we are interested in those signals that are detectable in different laboratories, study populations, across time etc. Unlike meta-analysis which accounts for experimental variability but does not guarantee replicability, testing a partial conjunction (PC) null aims specifically to identify the signals that are discovered in multiple studies. In many contemporary applications, e.g., comparing multiple high-throughput genetic experiments, a large number $M$ of PC nulls need to be tested simultaneously, calling for a multiple comparisons correction. However, standard multiple testing adjustments on the $M$ PC $p$-values can be severely conservative, especially when $M$ is large and the signals are sparse. We introduce AdaFilter, a new multiple testing procedure that increases power by adaptively filtering out unlikely candidates of PC nulls. We prove that AdaFilter can control FWER and FDR as long as data across studies are independent, and has much higher power than other existing methods. We illustrate the application of AdaFilter with three examples: microarray studies of Duchenne muscular dystrophy, single-cell RNA sequencing of T cells in lung cancer tumors and GWAS for metabolomics. △ Less

Submitted 18 November, 2021; v1 submitted 11 October, 2016; originally announced October 2016.

arXiv:1504.00946 [pdf, other]

doi 10.1534/genetics.115.184572

Genetic variant selection: learning across traits and sites

Authors: Laurel Stell, Chiara Sabatti

Abstract: We consider resequencing studies of associated loci and the problem of prioritizing sequence variants for functional follow-up. Working within the multivariate linear regression framework helps us to account for correlation across variants, and adopting a Bayesian approach naturally leads to posterior probabilities that incorporate all information about the variants' function. We describe two nove… ▽ More We consider resequencing studies of associated loci and the problem of prioritizing sequence variants for functional follow-up. Working within the multivariate linear regression framework helps us to account for correlation across variants, and adopting a Bayesian approach naturally leads to posterior probabilities that incorporate all information about the variants' function. We describe two novel prior distributions that facilitate learning the role of each variant by borrowing evidence across phenotypes and across mutations in the same gene. We illustrate their potential advantages with simulations and re-analyzing a dataset of sequencing variants. △ Less

Submitted 4 April, 2016; v1 submitted 3 April, 2015; originally announced April 2015.

Comments: Published at http://www.genetics.org/content/202/2/439 in GENETICS (http://www.genetics.org)

Journal ref: Genetics 2016, vol. 202, no. 2, 439-455

arXiv:1504.00701 [pdf, other]

Many Phenotypes without Many False Discoveries: Error Controlling Strategies for Multi-Traits Association Studies

Authors: Christine Peterson, Marina Bogomolov, Yoav Benjamini, Chiara Sabatti

Abstract: The genetic basis of multiple phenotypes such as gene expression, metabolite levels, or imaging features is often investigated by testing a large collection of hypotheses, probing the existence of association between each of the traits and hundreds of thousands of genotyped variants. Appropriate multiplicity adjustment is crucial to guarantee replicability of findings, and False Discovery Rate (FD… ▽ More The genetic basis of multiple phenotypes such as gene expression, metabolite levels, or imaging features is often investigated by testing a large collection of hypotheses, probing the existence of association between each of the traits and hundreds of thousands of genotyped variants. Appropriate multiplicity adjustment is crucial to guarantee replicability of findings, and False Discovery Rate (FDR) is frequently adopted as a measure of global error. In the interest of interpretability, results are often summarized so that reporting focuses on variants discovered to be associated to some phenotypes. We show that applying FDR-controlling procedures on the entire collection of hypotheses fails to control the rate of false discovery of associated variants as well as the average rate of false discovery of phenotypes influenced by such variants. We propose a simple hierarchical testing procedure which allows control of both these error rates and provides a more reliable basis for the identification of variants with functional effects. We demonstrate the utility of this approach through simulation studies comparing various error rates and measures of power for genetic association studies of multiple traits. Finally, we apply the proposed method to identify genetic variants which impact flowering phenotypes in Arabdopsis thaliana, expanding the set of discoveries. △ Less

Submitted 2 April, 2015; originally announced April 2015.

arXiv:1407.3824 [pdf, ps, other]

doi 10.1214/15-AOAS842

SLOPE - Adaptive variable selection via convex optimization

Authors: Małgorzata Bogdan, Ewout van den Berg, Chiara Sabatti, Weijie Su, Emmanuel J. Candès

Abstract: We introduce a new estimator for the vector of coefficients $β$ in the linear model $y=Xβ+z$, where $X$ has dimensions $n\times p$ with $p$ possibly larger than $n$. SLOPE, short for Sorted L-One Penalized Estimation, is the solution to \[\min_{b\in\mathbb{R}^p}\frac{1}{2}\Vert y-Xb\Vert _{\ell_2}^2+λ_1\vert b\vert _{(1)}+λ_2\vert b\vert_{(2)}+\cdots+λ_p\vert b\vert_{(p)},\] where… ▽ More We introduce a new estimator for the vector of coefficients $β$ in the linear model $y=Xβ+z$, where $X$ has dimensions $n\times p$ with $p$ possibly larger than $n$. SLOPE, short for Sorted L-One Penalized Estimation, is the solution to \[\min_{b\in\mathbb{R}^p}\frac{1}{2}\Vert y-Xb\Vert _{\ell_2}^2+λ_1\vert b\vert _{(1)}+λ_2\vert b\vert_{(2)}+\cdots+λ_p\vert b\vert_{(p)},\] where $λ_1\geλ_2\ge\cdots\geλ_p\ge0$ and $\vert b\vert_{(1)}\ge\vert b\vert_{(2)}\ge\cdots\ge\vert b\vert_{(p)}$ are the decreasing absolute values of the entries of $b$. This is a convex program and we demonstrate a solution algorithm whose computational complexity is roughly comparable to that of classical $\ell_1$ procedures such as the Lasso. Here, the regularizer is a sorted $\ell_1$ norm, which penalizes the regression coefficients according to their rank: the higher the rank - that is, stronger the signal - the larger the penalty. This is similar to the Benjamini and Hochberg [J. Roy. Statist. Soc. Ser. B 57 (1995) 289-300] procedure (BH) which compares more significant $p$-values with more stringent thresholds. One notable choice of the sequence $\{λ_i\}$ is given by the BH critical values $λ_{\mathrm {BH}}(i)=z(1-i\cdot q/2p)$, where $q\in(0,1)$ and $z(α)$ is the quantile of a standard normal distribution. SLOPE aims to provide finite sample guarantees on the selected model; of special interest is the false discovery rate (FDR), defined as the expected proportion of irrelevant regressors among all selected predictors. Under orthogonal designs, SLOPE with $λ_{\mathrm{BH}}$ provably controls FDR at level $q$. Moreover, it also appears to have appreciable inferential properties under more general designs $X$ while having substantial power, as demonstrated in a series of experiments running on both simulated and real data. △ Less

Submitted 4 November, 2015; v1 submitted 14 July, 2014; originally announced July 2014.

Comments: Published at http://dx.doi.org/10.1214/15-AOAS842 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS842

Journal ref: Annals of Applied Statistics 2015, Vol. 9, No. 3, 1103-1140

arXiv:1202.5064 [pdf, other]

Reconstructing DNA copy number by joint segmentation of multiple sequences

Authors: Zhongyang Zhang, Kenneth Lange, Chiara Sabatti

Abstract: The variation in DNA copy number carries information on the modalities of genome evolution and misregulation of DNA replication in cancer cells; its study can be helpful to localize tumor suppressor genes, distinguish different populations of cancerous cell, as well identify genomic variations responsible for disease phenotypes. A number of different high throughput technologies can be used to ide… ▽ More The variation in DNA copy number carries information on the modalities of genome evolution and misregulation of DNA replication in cancer cells; its study can be helpful to localize tumor suppressor genes, distinguish different populations of cancerous cell, as well identify genomic variations responsible for disease phenotypes. A number of different high throughput technologies can be used to identify copy number variable sites, and the literature documents multiple effective algorithms. We focus here on the specific problem of detecting regions where variation in copy number is relatively common in the sample at hand: this encompasses the cases of copy number polymorphisms, related samples, technical replicates, and cancerous sub-populations from the same individual. We present an algorithm based on regularization approaches with significant computational advantages and competitive accuracy. We illustrate its applicability with simulated and real data sets. △ Less

Submitted 14 March, 2012; v1 submitted 22 February, 2012; originally announced February 2012.

Comments: 54 pages, 5 figures

arXiv:1011.1798 [pdf, ps, other]

doi 10.1214/10-AOAS350

Sparse regulatory networks

Authors: Gareth M. James, Chiara Sabatti, Nengfeng Zhou, Ji Zhu

Abstract: In many organisms the expression levels of each gene are controlled by the activation levels of known "Transcription Factors" (TF). A problem of considerable interest is that of estimating the "Transcription Regulation Networks" (TRN) relating the TFs and genes. While the expression levels of genes can be observed, the activation levels of the corresponding TFs are usually unknown, greatly increas… ▽ More In many organisms the expression levels of each gene are controlled by the activation levels of known "Transcription Factors" (TF). A problem of considerable interest is that of estimating the "Transcription Regulation Networks" (TRN) relating the TFs and genes. While the expression levels of genes can be observed, the activation levels of the corresponding TFs are usually unknown, greatly increasing the difficulty of the problem. Based on previous experimental work, it is often the case that partial information about the TRN is available. For example, certain TFs may be known to regulate a given gene or in other cases a connection may be predicted with a certain probability. In general, the biology of the problem indicates there will be very few connections between TFs and genes. Several methods have been proposed for estimating TRNs. However, they all suffer from problems such as unrealistic assumptions about prior knowledge of the network structure or computational limitations. We propose a new approach that can directly utilize prior information about the network structure in conjunction with observed gene expression data to estimate the TRN. Our approach uses $L_1$ penalties on the network to ensure a sparse structure. This has the advantage of being computationally efficient as well as making many fewer assumptions about the network structure. We use our methodology to construct the TRN for E. coli and show that the estimate is biologically sensible and compares favorably with previous estimates. △ Less

Submitted 8 November, 2010; originally announced November 2010.

Comments: Published in at http://dx.doi.org/10.1214/10-AOAS350 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS350

Journal ref: Annals of Applied Statistics 2010, Vol. 4, No. 2, 663-686

arXiv:0906.2234 [pdf, ps, other]

doi 10.1214/10-AOAS357

Reconstructing DNA copy number by penalized estimation and imputation

Authors: Zhongyang Zhang, Kenneth Lange, Roel Ophoff, Chiara Sabatti

Abstract: Recent advances in genomics have underscored the surprising ubiquity of DNA copy number variation (CNV). Fortunately, modern genoty** platforms also detect CNVs with fairly high reliability. Hidden Markov models and algorithms have played a dominant role in the interpretation of CNV data. Here we explore CNV reconstruction via estimation with a fused-lasso penalty as suggested by Tibshirani and… ▽ More Recent advances in genomics have underscored the surprising ubiquity of DNA copy number variation (CNV). Fortunately, modern genoty** platforms also detect CNVs with fairly high reliability. Hidden Markov models and algorithms have played a dominant role in the interpretation of CNV data. Here we explore CNV reconstruction via estimation with a fused-lasso penalty as suggested by Tibshirani and Wang [Biostatistics 9 (2008) 18--29]. We mount a fresh attack on this difficult optimization problem by the following: (a) changing the penalty terms slightly by substituting a smooth approximation to the absolute value function, (b) designing and implementing a new MM (majorization--minimization) algorithm, and (c) applying a fast version of Newton's method to jointly update all model parameters. Together these changes enable us to minimize the fused-lasso criterion in a highly effective way. We also reframe the reconstruction problem in terms of imputation via discrete optimization. This approach is easier and more accurate than parameter estimation because it relies on the fact that only a handful of possible copy number states exist at each SNP. The dynamic programming framework has the added bonus of exploiting information that the current fused-lasso approach ignores. The accuracy of our imputations is comparable to that of hidden Markov models at a substantially lower computational cost. △ Less

Submitted 10 January, 2011; v1 submitted 11 June, 2009; originally announced June 2009.

Comments: Published in at http://dx.doi.org/10.1214/10-AOAS357 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS357

Journal ref: Annals of Applied Statistics 2010, Vol. 4, No. 4, 1749-1773

Showing 1–21 of 21 results for author: Sabatti, C