-
Controlled Variable Selection from Summary Statistics Only? A Solution via GhostKnockoffs and Penalized Regression
Authors:
Zhaomeng Chen,
Zihuai He,
Benjamin B. Chu,
Jiaqi Gu,
Tim Morrison,
Chiara Sabatti,
Emmanuel Candès
Abstract:
Identifying which variables do influence a response while controlling false positives pervades statistics and data science. In this paper, we consider a scenario in which we only have access to summary statistics, such as the values of marginal empirical correlations between each dependent variable of potential interest and the response. This situation may arise due to privacy concerns, e.g., to a…
▽ More
Identifying which variables do influence a response while controlling false positives pervades statistics and data science. In this paper, we consider a scenario in which we only have access to summary statistics, such as the values of marginal empirical correlations between each dependent variable of potential interest and the response. This situation may arise due to privacy concerns, e.g., to avoid the release of sensitive genetic information. We extend GhostKnockoffs (He et al. [2022]) and introduce variable selection methods based on penalized regression achieving false discovery rate (FDR) control. We report empirical results in extensive simulation studies, demonstrating enhanced performance over previous work. We also apply our methods to genome-wide association studies of Alzheimer's disease, and evidence a significant improvement in power.
△ Less
Submitted 20 February, 2024;
originally announced February 2024.
-
Second-order group knockoffs with applications to GWAS
Authors:
Benjamin B Chu,
Jiaqi Gu,
Zhaomeng Chen,
Tim Morrison,
Emmanuel Candes,
Zihuai He,
Chiara Sabatti
Abstract:
Conditional testing via the knockoff framework allows one to identify -- among large number of possible explanatory variables -- those that carry unique information about an outcome of interest, and also provides a false discovery rate guarantee on the selection. This approach is particularly well suited to the analysis of genome wide association studies (GWAS), which have the goal of identifying…
▽ More
Conditional testing via the knockoff framework allows one to identify -- among large number of possible explanatory variables -- those that carry unique information about an outcome of interest, and also provides a false discovery rate guarantee on the selection. This approach is particularly well suited to the analysis of genome wide association studies (GWAS), which have the goal of identifying genetic variants which influence traits of medical relevance.
While conditional testing can be both more powerful and precise than traditional GWAS analysis methods, its vanilla implementation encounters a difficulty common to all multivariate analysis methods: it is challenging to distinguish among multiple, highly correlated regressors. This impasse can be overcome by shifting the object of inference from single variables to groups of correlated variables. To achieve this, it is necessary to construct "group knockoffs." While successful examples are already documented in the literature, this paper substantially expands the set of algorithms and software for group knockoffs. We focus in particular on second-order knockoffs, for which we describe correlation matrix approximations that are appropriate for GWAS data and that result in considerable computational savings. We illustrate the effectiveness of the proposed methods with simulations and with the analysis of albuminuria data from the UK Biobank.
The described algorithms are implemented in an open-source Julia package Knockoffs.jl, for which both R and Python wrappers are available.
△ Less
Submitted 3 March, 2024; v1 submitted 23 October, 2023;
originally announced October 2023.
-
Catch me if you can: Signal localization with knockoff e-values
Authors:
Paula Gablenz,
Chiara Sabatti
Abstract:
We consider problems where many, somewhat redundant, hypotheses are tested and we are interested in reporting the most precise rejections, with false discovery rate (FDR) control. This is the case, for example, when researchers are interested both in individual hypotheses as well as group hypotheses corresponding to intersections of sets of the original hypotheses, at several resolution levels. A…
▽ More
We consider problems where many, somewhat redundant, hypotheses are tested and we are interested in reporting the most precise rejections, with false discovery rate (FDR) control. This is the case, for example, when researchers are interested both in individual hypotheses as well as group hypotheses corresponding to intersections of sets of the original hypotheses, at several resolution levels. A concrete application is in genome-wide association studies, where, depending on the signal strengths, it might be possible to resolve the influence of individual genetic variants on a phenotype with greater or lower precision. To adapt to the unknown signal strength, analyses are conducted at multiple resolutions and researchers are most interested in the more precise discoveries. Assuring FDR control on the reported findings with these adaptive searches is, however, often impossible. To design a multiple comparison procedure that allows for an adaptive choice of resolution with FDR control, we leverage e-values and linear programming. We adapt this approach to problems where knockoffs and group knockoffs have been successfully applied to test conditional independence hypotheses. We demonstrate its efficacy by analyzing data from the UK Biobank.
△ Less
Submitted 19 April, 2024; v1 submitted 16 June, 2023;
originally announced June 2023.
-
Near-peer mentoring in data science: Two experiences at Stanford University
Authors:
Chiara Sabatti,
Qian Zhao
Abstract:
Universities have been expanding the data science programs for undergraduate students, with the simultaneous goal of reaching and retaining students from underrepresented groups in the data science workforce. The set of new programs also offer opportunities to involve graduate students, fostering their growth as future leaders in data science education. We describe two programs that use the near p…
▽ More
Universities have been expanding the data science programs for undergraduate students, with the simultaneous goal of reaching and retaining students from underrepresented groups in the data science workforce. The set of new programs also offer opportunities to involve graduate students, fostering their growth as future leaders in data science education. We describe two programs that use the near peer mentoring structure to provide pathways for graduate students to develop teaching and mentoring skills, while providing research and learning opportunities for undergraduate students from diverse backgrounds. In the Data Science for Social Good Summer program, graduate students mentor a group of undergraduate fellows as they tackle a data science project with positive social impact. In the Inclusive Mentoring in Data Science course, graduate students participate in workshops on effective and inclusive mentorship strategies. In an experiential learning framework, they are paired with undergraduate students from non-R1 schools, who they mentor through weekly one-on-one on-line meetings. These initiatives offer a prototype of future programs that serve the dual goal of providing both hands-on mentoring experience for graduate students and research opportunities for undergraduate students, in a high-touch inclusive and encouraging environment.
△ Less
Submitted 8 June, 2024; v1 submitted 15 November, 2022;
originally announced November 2022.
-
Transfer learning in genome-wide association studies with knockoffs
Authors:
Shuangning Li,
Zhimei Ren,
Chiara Sabatti,
Matteo Sesia
Abstract:
This paper presents and compares alternative transfer learning methods that can increase the power of conditional testing via knockoffs by leveraging prior information in external data sets collected from different populations or measuring related outcomes. The relevance of this methodology is explored in particular within the context of genome-wide association studies, where it can be helpful to…
▽ More
This paper presents and compares alternative transfer learning methods that can increase the power of conditional testing via knockoffs by leveraging prior information in external data sets collected from different populations or measuring related outcomes. The relevance of this methodology is explored in particular within the context of genome-wide association studies, where it can be helpful to address the pressing need for principled ways to suitably account for, and efficiently learn from the genetic variation associated to diverse ancestries. Finally, we apply these methods to analyze several phenotypes in the UK Biobank data set, demonstrating that transfer learning helps knockoffs discover more numerous associations in the data collected from minority populations, potentially opening the way to the development of more accurate polygenic risk scores.
△ Less
Submitted 19 August, 2021;
originally announced August 2021.
-
Searching for consistent associations with a multi-environment knockoff filter
Authors:
Shuangning Li,
Matteo Sesia,
Yaniv Romano,
Emmanuel Candès,
Chiara Sabatti
Abstract:
This paper develops a method based on model-X knockoffs to find conditional associations that are consistent across diverse environments, controlling the false discovery rate. The motivation for this problem is that large data sets may contain numerous associations that are statistically significant and yet misleading, as they are induced by confounders or sampling imperfections. However, associat…
▽ More
This paper develops a method based on model-X knockoffs to find conditional associations that are consistent across diverse environments, controlling the false discovery rate. The motivation for this problem is that large data sets may contain numerous associations that are statistically significant and yet misleading, as they are induced by confounders or sampling imperfections. However, associations consistently replicated under different conditions may be more interesting. In fact, consistency sometimes provably leads to valid causal inferences even if conditional associations do not. While the proposed method is flexible and can be deployed in a wide range of applications, this paper highlights its relevance to genome-wide association studies, in which consistency across populations with diverse ancestries mitigates confounding due to unmeasured variants. The effectiveness of this approach is demonstrated by simulations and applications to the UK Biobank data.
△ Less
Submitted 8 June, 2021;
originally announced June 2021.
-
Causal Inference in Genetic Trio Studies
Authors:
Stephen Bates,
Matteo Sesia,
Chiara Sabatti,
Emmanuel Candes
Abstract:
We introduce a method to rigorously draw causal inferences---inferences immune to all possible confounding---from genetic data that include parents and offspring. Causal conclusions are possible with these data because the natural randomness in meiosis can be viewed as a high-dimensional randomized experiment. We make this observation actionable by develo** a novel conditional independence test…
▽ More
We introduce a method to rigorously draw causal inferences---inferences immune to all possible confounding---from genetic data that include parents and offspring. Causal conclusions are possible with these data because the natural randomness in meiosis can be viewed as a high-dimensional randomized experiment. We make this observation actionable by develo** a novel conditional independence test that identifies regions of the genome containing distinct causal variants. The proposed Digital Twin Test compares an observed offspring to carefully constructed synthetic offspring from the same parents in order to determine statistical significance, and it can leverage any black-box multivariate model and additional non-trio genetic data in order to increase power. Crucially, our inferences are based only on a well-established mathematical description of the rearrangement of genetic material during meiosis and make no assumptions about the relationship between the genotypes and phenotypes.
△ Less
Submitted 22 February, 2020;
originally announced February 2020.
-
With Malice Towards None: Assessing Uncertainty via Equalized Coverage
Authors:
Yaniv Romano,
Rina Foygel Barber,
Chiara Sabatti,
Emmanuel J. Candès
Abstract:
An important factor to guarantee a fair use of data-driven recommendation systems is that we should be able to communicate their uncertainty to decision makers. This can be accomplished by constructing prediction intervals, which provide an intuitive measure of the limits of predictive performance. To support equitable treatment, we force the construction of such intervals to be unbiased in the se…
▽ More
An important factor to guarantee a fair use of data-driven recommendation systems is that we should be able to communicate their uncertainty to decision makers. This can be accomplished by constructing prediction intervals, which provide an intuitive measure of the limits of predictive performance. To support equitable treatment, we force the construction of such intervals to be unbiased in the sense that their coverage must be equal across all protected groups of interest. We present an operational methodology that achieves this goal by offering rigorous distribution-free coverage guarantees holding in finite samples. Our methodology, equalized coverage, is flexible as it can be viewed as a wrapper around any predictive algorithm. We test the applicability of the proposed framework on real data, demonstrating that equalized coverage constructs unbiased prediction intervals, unlike competitive methods.
△ Less
Submitted 15 August, 2019;
originally announced August 2019.
-
Rejoinder: "Gene Hunting with Hidden Markov Model Knockoffs"
Authors:
Matteo Sesia,
Chiara Sabatti,
Emmanuel J. Candès
Abstract:
In this paper we deepen and enlarge the reflection on the possible advantages of a knockoff approach to genome wide association studies (Sesia et al., 2018), starting from the discussions in Bottolo & Richardson (2019); Jewell & Witten (2019); Rosenblatt et al. (2019) and Marchini (2019). The discussants bring up a number of important points, either related to the knockoffs methodology in general,…
▽ More
In this paper we deepen and enlarge the reflection on the possible advantages of a knockoff approach to genome wide association studies (Sesia et al., 2018), starting from the discussions in Bottolo & Richardson (2019); Jewell & Witten (2019); Rosenblatt et al. (2019) and Marchini (2019). The discussants bring up a number of important points, either related to the knockoffs methodology in general, or to its specific application to genetic studies. In the following we offer some clarifications, mention relevant recent developments and highlight some of the still open problems.
△ Less
Submitted 13 March, 2019;
originally announced March 2019.
-
Filtering the rejection set while preserving false discovery rate control
Authors:
Eugene Katsevich,
Chiara Sabatti,
Marina Bogomolov
Abstract:
Scientific hypotheses in a variety of applications have domain-specific structures, such as the tree structure of the International Classification of Diseases (ICD), the directed acyclic graph structure of the Gene Ontology (GO), or the spatial structure in genome-wide association studies. In the context of multiple testing, the resulting relationships among hypotheses can create redundancies amon…
▽ More
Scientific hypotheses in a variety of applications have domain-specific structures, such as the tree structure of the International Classification of Diseases (ICD), the directed acyclic graph structure of the Gene Ontology (GO), or the spatial structure in genome-wide association studies. In the context of multiple testing, the resulting relationships among hypotheses can create redundancies among rejections that hinder interpretability. This leads to the practice of filtering rejection sets obtained from multiple testing procedures, which may in turn invalidate their inferential guarantees. We propose Focused BH, a simple, flexible, and principled methodology to adjust for the application of any pre-specified filter. We prove that Focused BH controls the false discovery rate under various conditions, including when the filter satisfies an intuitive monotonicity property and the p-values are positively dependent. We demonstrate in simulations that Focused BH performs well across a variety of settings, and illustrate this method's practical utility via analyses of real datasets based on ICD and GO.
△ Less
Submitted 10 April, 2020; v1 submitted 5 September, 2018;
originally announced September 2018.
-
Selection-adjusted inference: an application to confidence intervals for cis-eQTL effect sizes
Authors:
Snigdha Panigrahi,
Junjie Zhu,
Chiara Sabatti
Abstract:
The goal of eQTL studies is to identify the genetic variants that influence the expression levels of the genes in an organism. High throughput technology has made such studies possible: in a given tissue sample, it enables us to quantify the expression levels of approximately 20,000 genes and to record the alleles present at millions of genetic polymorphisms. While obtaining this data is relativel…
▽ More
The goal of eQTL studies is to identify the genetic variants that influence the expression levels of the genes in an organism. High throughput technology has made such studies possible: in a given tissue sample, it enables us to quantify the expression levels of approximately 20,000 genes and to record the alleles present at millions of genetic polymorphisms. While obtaining this data is relatively cheap once a specimen is at hand, obtaining human tissue remains a costly endeavor. Thus, eQTL studies continue to be based on relatively small sample sizes, with this limitation particularly serious for tissues of most immediate medical relevance. Given the high dimensional nature of this datasets and the large number of hypotheses tested, the scientific community has adopted early on multiplicity adjustment procedures, which primarily control the false discoveries rate for the identification of genetic variants with influence on the expression levels. In contrast, a problem that has not received much attention to date is that of providing estimates of the effect sizes associated to these variants, in a way that accounts for the considerable amount of selection. We illustrate how the recently developed conditional inference approach can be deployed to obtain confidence intervals for the eQTL effect sizes with reliable coverage. The procedure we propose is based on a randomized hierarchical strategy that both reflects the steps typically adopted in state of the art investigations and introduces the use of randomness instead of data splitting to maximize the use of available data. Analysis of the GTEx Liver dataset (v6) suggests that naively obtained confidence intervals would likely not cover the true values of effect sizes and that the number of local genetic polymorphisms influencing the expression level of genes might be underestimated.
△ Less
Submitted 6 June, 2018; v1 submitted 26 January, 2018;
originally announced January 2018.
-
Multilayer Knockoff Filter: Controlled variable selection at multiple resolutions
Authors:
Eugene Katsevich,
Chiara Sabatti
Abstract:
We tackle the problem of selecting from among a large number of variables those that are 'important' for an outcome. We consider situations where groups of variables are also of interest in their own right. For example, each variable might be a genetic polymorphism and we might want to study how a trait depends on variability in genes, segments of DNA that typically contain multiple such polymorph…
▽ More
We tackle the problem of selecting from among a large number of variables those that are 'important' for an outcome. We consider situations where groups of variables are also of interest in their own right. For example, each variable might be a genetic polymorphism and we might want to study how a trait depends on variability in genes, segments of DNA that typically contain multiple such polymorphisms. Or, variables might quantify various aspects of the functioning of individual internet servers owned by a company, and we might be interested in assessing the importance of each server as a whole on the average download speed for the company's customers. In this context, to discover that a variable is relevant for the outcome implies discovering that the larger entity it represents is also important. To guarantee meaningful and reproducible results, we suggest controlling the rate of false discoveries for findings at the level of individual variables and at the level of groups. Building on the knockoff construction of Barber and Candes (2015) and the multilayer testing framework of Barber and Ramdas (2016), we introduce the multilayer knockoff filter (MKF). We prove that MKF simultaneously controls the FDR at each resolution and use simulations to show that it incurs little power loss compared to methods that provide guarantees only for the discoveries of individual variables. We apply MKF to analyze a genetic dataset and find that it successfully reduces the number of false gene discoveries without a significant reduction in power.
△ Less
Submitted 9 August, 2018; v1 submitted 28 June, 2017;
originally announced June 2017.
-
Gene Hunting with Knockoffs for Hidden Markov Models
Authors:
Matteo Sesia,
Chiara Sabatti,
Emmanuel J. Candès
Abstract:
Modern scientific studies often require the identification of a subset of relevant explanatory variables, in the attempt to understand an interesting phenomenon. Several statistical methods have been developed to automate this task, but only recently has the framework of model-free knockoffs proposed a general solution that can perform variable selection under rigorous type-I error control, withou…
▽ More
Modern scientific studies often require the identification of a subset of relevant explanatory variables, in the attempt to understand an interesting phenomenon. Several statistical methods have been developed to automate this task, but only recently has the framework of model-free knockoffs proposed a general solution that can perform variable selection under rigorous type-I error control, without relying on strong modeling assumptions. In this paper, we extend the methodology of model-free knockoffs to a rich family of problems where the distribution of the covariates can be described by a hidden Markov model (HMM). We develop an exact and efficient algorithm to sample knockoff copies of an HMM. We then argue that combined with the knockoffs selective framework, they provide a natural and powerful tool for performing principled inference in genome-wide association studies with guaranteed FDR control. Finally, we apply our methodology to several datasets aimed at studying the Crohn's disease and several continuous phenotypes, e.g. levels of cholesterol.
△ Less
Submitted 14 June, 2017;
originally announced June 2017.
-
Testing hypotheses on a tree: new error rates and controlling strategies
Authors:
Marina Bogomolov,
Christine B. Peterson,
Yoav Benjamini,
Chiara Sabatti
Abstract:
We introduce a multiple testing procedure (TreeBH) which addresses the challenge of controlling error rates at multiple levels of resolution. Conceptually, we frame this problem as the selection of hypotheses which are organized hierarchically in a tree structure. We describe a fast algorithm for the proposed sequential procedure, and prove that it controls relevant error rates given certain assum…
▽ More
We introduce a multiple testing procedure (TreeBH) which addresses the challenge of controlling error rates at multiple levels of resolution. Conceptually, we frame this problem as the selection of hypotheses which are organized hierarchically in a tree structure. We describe a fast algorithm for the proposed sequential procedure, and prove that it controls relevant error rates given certain assumptions on the dependence among the p-values. Through simulations, we demonstrate that TreeBH offers the desired guarantees under a range of dependency structures (including one similar to that encountered in genome-wide association studies) and that it has the potential of gaining power over alternative methods. We also introduce a modified version of TreeBH which we prove to control the relevant error rates under any dependency structure.
We conclude with two case studies: we first analyze data collected as part of the Genotype-Tissue Expression (GTEx) project, which aims to characterize the genetic regulation of gene expression across multiple tissues in the human body, and secondly, data examining the relationship between the gut microbiome and colorectal cancer.
△ Less
Submitted 23 October, 2018; v1 submitted 21 May, 2017;
originally announced May 2017.
-
Detecting Multiple Replicating Signals using Adaptive Filtering Procedures
Authors:
**gshu Wang,
Lin Gui,
Weijie J. Su,
Chiara Sabatti,
Art B. Owen
Abstract:
Replicability is a fundamental quality of scientific discoveries: we are interested in those signals that are detectable in different laboratories, study populations, across time etc. Unlike meta-analysis which accounts for experimental variability but does not guarantee replicability, testing a partial conjunction (PC) null aims specifically to identify the signals that are discovered in multiple…
▽ More
Replicability is a fundamental quality of scientific discoveries: we are interested in those signals that are detectable in different laboratories, study populations, across time etc. Unlike meta-analysis which accounts for experimental variability but does not guarantee replicability, testing a partial conjunction (PC) null aims specifically to identify the signals that are discovered in multiple studies. In many contemporary applications, e.g., comparing multiple high-throughput genetic experiments, a large number $M$ of PC nulls need to be tested simultaneously, calling for a multiple comparisons correction. However, standard multiple testing adjustments on the $M$ PC $p$-values can be severely conservative, especially when $M$ is large and the signals are sparse. We introduce AdaFilter, a new multiple testing procedure that increases power by adaptively filtering out unlikely candidates of PC nulls. We prove that AdaFilter can control FWER and FDR as long as data across studies are independent, and has much higher power than other existing methods. We illustrate the application of AdaFilter with three examples: microarray studies of Duchenne muscular dystrophy, single-cell RNA sequencing of T cells in lung cancer tumors and GWAS for metabolomics.
△ Less
Submitted 18 November, 2021; v1 submitted 11 October, 2016;
originally announced October 2016.
-
Genetic variant selection: learning across traits and sites
Authors:
Laurel Stell,
Chiara Sabatti
Abstract:
We consider resequencing studies of associated loci and the problem of prioritizing sequence variants for functional follow-up. Working within the multivariate linear regression framework helps us to account for correlation across variants, and adopting a Bayesian approach naturally leads to posterior probabilities that incorporate all information about the variants' function. We describe two nove…
▽ More
We consider resequencing studies of associated loci and the problem of prioritizing sequence variants for functional follow-up. Working within the multivariate linear regression framework helps us to account for correlation across variants, and adopting a Bayesian approach naturally leads to posterior probabilities that incorporate all information about the variants' function. We describe two novel prior distributions that facilitate learning the role of each variant by borrowing evidence across phenotypes and across mutations in the same gene. We illustrate their potential advantages with simulations and re-analyzing a dataset of sequencing variants.
△ Less
Submitted 4 April, 2016; v1 submitted 3 April, 2015;
originally announced April 2015.
-
Many Phenotypes without Many False Discoveries: Error Controlling Strategies for Multi-Traits Association Studies
Authors:
Christine Peterson,
Marina Bogomolov,
Yoav Benjamini,
Chiara Sabatti
Abstract:
The genetic basis of multiple phenotypes such as gene expression, metabolite levels, or imaging features is often investigated by testing a large collection of hypotheses, probing the existence of association between each of the traits and hundreds of thousands of genotyped variants. Appropriate multiplicity adjustment is crucial to guarantee replicability of findings, and False Discovery Rate (FD…
▽ More
The genetic basis of multiple phenotypes such as gene expression, metabolite levels, or imaging features is often investigated by testing a large collection of hypotheses, probing the existence of association between each of the traits and hundreds of thousands of genotyped variants. Appropriate multiplicity adjustment is crucial to guarantee replicability of findings, and False Discovery Rate (FDR) is frequently adopted as a measure of global error. In the interest of interpretability, results are often summarized so that reporting focuses on variants discovered to be associated to some phenotypes.
We show that applying FDR-controlling procedures on the entire collection of hypotheses fails to control the rate of false discovery of associated variants as well as the average rate of false discovery of phenotypes influenced by such variants. We propose a simple hierarchical testing procedure which allows control of both these error rates and provides a more reliable basis for the identification of variants with functional effects. We demonstrate the utility of this approach through simulation studies comparing various error rates and measures of power for genetic association studies of multiple traits. Finally, we apply the proposed method to identify genetic variants which impact flowering phenotypes in Arabdopsis thaliana, expanding the set of discoveries.
△ Less
Submitted 2 April, 2015;
originally announced April 2015.
-
SLOPE - Adaptive variable selection via convex optimization
Authors:
Małgorzata Bogdan,
Ewout van den Berg,
Chiara Sabatti,
Weijie Su,
Emmanuel J. Candès
Abstract:
We introduce a new estimator for the vector of coefficients $β$ in the linear model $y=Xβ+z$, where $X$ has dimensions $n\times p$ with $p$ possibly larger than $n$. SLOPE, short for Sorted L-One Penalized Estimation, is the solution to \[\min_{b\in\mathbb{R}^p}\frac{1}{2}\Vert y-Xb\Vert _{\ell_2}^2+λ_1\vert b\vert _{(1)}+λ_2\vert b\vert_{(2)}+\cdots+λ_p\vert b\vert_{(p)},\] where…
▽ More
We introduce a new estimator for the vector of coefficients $β$ in the linear model $y=Xβ+z$, where $X$ has dimensions $n\times p$ with $p$ possibly larger than $n$. SLOPE, short for Sorted L-One Penalized Estimation, is the solution to \[\min_{b\in\mathbb{R}^p}\frac{1}{2}\Vert y-Xb\Vert _{\ell_2}^2+λ_1\vert b\vert _{(1)}+λ_2\vert b\vert_{(2)}+\cdots+λ_p\vert b\vert_{(p)},\] where $λ_1\geλ_2\ge\cdots\geλ_p\ge0$ and $\vert b\vert_{(1)}\ge\vert b\vert_{(2)}\ge\cdots\ge\vert b\vert_{(p)}$ are the decreasing absolute values of the entries of $b$. This is a convex program and we demonstrate a solution algorithm whose computational complexity is roughly comparable to that of classical $\ell_1$ procedures such as the Lasso. Here, the regularizer is a sorted $\ell_1$ norm, which penalizes the regression coefficients according to their rank: the higher the rank - that is, stronger the signal - the larger the penalty. This is similar to the Benjamini and Hochberg [J. Roy. Statist. Soc. Ser. B 57 (1995) 289-300] procedure (BH) which compares more significant $p$-values with more stringent thresholds. One notable choice of the sequence $\{λ_i\}$ is given by the BH critical values $λ_{\mathrm {BH}}(i)=z(1-i\cdot q/2p)$, where $q\in(0,1)$ and $z(α)$ is the quantile of a standard normal distribution. SLOPE aims to provide finite sample guarantees on the selected model; of special interest is the false discovery rate (FDR), defined as the expected proportion of irrelevant regressors among all selected predictors. Under orthogonal designs, SLOPE with $λ_{\mathrm{BH}}$ provably controls FDR at level $q$. Moreover, it also appears to have appreciable inferential properties under more general designs $X$ while having substantial power, as demonstrated in a series of experiments running on both simulated and real data.
△ Less
Submitted 4 November, 2015; v1 submitted 14 July, 2014;
originally announced July 2014.
-
Reconstructing DNA copy number by joint segmentation of multiple sequences
Authors:
Zhongyang Zhang,
Kenneth Lange,
Chiara Sabatti
Abstract:
The variation in DNA copy number carries information on the modalities of genome evolution and misregulation of DNA replication in cancer cells; its study can be helpful to localize tumor suppressor genes, distinguish different populations of cancerous cell, as well identify genomic variations responsible for disease phenotypes. A number of different high throughput technologies can be used to ide…
▽ More
The variation in DNA copy number carries information on the modalities of genome evolution and misregulation of DNA replication in cancer cells; its study can be helpful to localize tumor suppressor genes, distinguish different populations of cancerous cell, as well identify genomic variations responsible for disease phenotypes. A number of different high throughput technologies can be used to identify copy number variable sites, and the literature documents multiple effective algorithms. We focus here on the specific problem of detecting regions where variation in copy number is relatively common in the sample at hand: this encompasses the cases of copy number polymorphisms, related samples, technical replicates, and cancerous sub-populations from the same individual. We present an algorithm based on regularization approaches with significant computational advantages and competitive accuracy. We illustrate its applicability with simulated and real data sets.
△ Less
Submitted 14 March, 2012; v1 submitted 22 February, 2012;
originally announced February 2012.
-
Sparse regulatory networks
Authors:
Gareth M. James,
Chiara Sabatti,
Nengfeng Zhou,
Ji Zhu
Abstract:
In many organisms the expression levels of each gene are controlled by the activation levels of known "Transcription Factors" (TF). A problem of considerable interest is that of estimating the "Transcription Regulation Networks" (TRN) relating the TFs and genes. While the expression levels of genes can be observed, the activation levels of the corresponding TFs are usually unknown, greatly increas…
▽ More
In many organisms the expression levels of each gene are controlled by the activation levels of known "Transcription Factors" (TF). A problem of considerable interest is that of estimating the "Transcription Regulation Networks" (TRN) relating the TFs and genes. While the expression levels of genes can be observed, the activation levels of the corresponding TFs are usually unknown, greatly increasing the difficulty of the problem. Based on previous experimental work, it is often the case that partial information about the TRN is available. For example, certain TFs may be known to regulate a given gene or in other cases a connection may be predicted with a certain probability. In general, the biology of the problem indicates there will be very few connections between TFs and genes. Several methods have been proposed for estimating TRNs. However, they all suffer from problems such as unrealistic assumptions about prior knowledge of the network structure or computational limitations. We propose a new approach that can directly utilize prior information about the network structure in conjunction with observed gene expression data to estimate the TRN. Our approach uses $L_1$ penalties on the network to ensure a sparse structure. This has the advantage of being computationally efficient as well as making many fewer assumptions about the network structure. We use our methodology to construct the TRN for E. coli and show that the estimate is biologically sensible and compares favorably with previous estimates.
△ Less
Submitted 8 November, 2010;
originally announced November 2010.
-
Reconstructing DNA copy number by penalized estimation and imputation
Authors:
Zhongyang Zhang,
Kenneth Lange,
Roel Ophoff,
Chiara Sabatti
Abstract:
Recent advances in genomics have underscored the surprising ubiquity of DNA copy number variation (CNV). Fortunately, modern genoty** platforms also detect CNVs with fairly high reliability. Hidden Markov models and algorithms have played a dominant role in the interpretation of CNV data. Here we explore CNV reconstruction via estimation with a fused-lasso penalty as suggested by Tibshirani and…
▽ More
Recent advances in genomics have underscored the surprising ubiquity of DNA copy number variation (CNV). Fortunately, modern genoty** platforms also detect CNVs with fairly high reliability. Hidden Markov models and algorithms have played a dominant role in the interpretation of CNV data. Here we explore CNV reconstruction via estimation with a fused-lasso penalty as suggested by Tibshirani and Wang [Biostatistics 9 (2008) 18--29]. We mount a fresh attack on this difficult optimization problem by the following: (a) changing the penalty terms slightly by substituting a smooth approximation to the absolute value function, (b) designing and implementing a new MM (majorization--minimization) algorithm, and (c) applying a fast version of Newton's method to jointly update all model parameters. Together these changes enable us to minimize the fused-lasso criterion in a highly effective way. We also reframe the reconstruction problem in terms of imputation via discrete optimization. This approach is easier and more accurate than parameter estimation because it relies on the fact that only a handful of possible copy number states exist at each SNP. The dynamic programming framework has the added bonus of exploiting information that the current fused-lasso approach ignores. The accuracy of our imputations is comparable to that of hidden Markov models at a substantially lower computational cost.
△ Less
Submitted 10 January, 2011; v1 submitted 11 June, 2009;
originally announced June 2009.