Search | arXiv e-print repository

DeepROCK: Error-controlled interaction detection in deep neural networks

Authors: Winston Chen, William Stafford Noble, Yang Young Lu

Abstract: The complexity of deep neural networks (DNNs) makes them powerful but also makes them challenging to interpret, hindering their applicability in error-intolerant domains. Existing methods attempt to reason about the internal mechanism of DNNs by identifying feature interactions that influence prediction outcomes. However, such methods typically lack a systematic strategy to prioritize interactions… ▽ More The complexity of deep neural networks (DNNs) makes them powerful but also makes them challenging to interpret, hindering their applicability in error-intolerant domains. Existing methods attempt to reason about the internal mechanism of DNNs by identifying feature interactions that influence prediction outcomes. However, such methods typically lack a systematic strategy to prioritize interactions while controlling confidence levels, making them difficult to apply in practice for scientific discovery and hypothesis validation. In this paper, we introduce a method, called DeepROCK, to address this limitation by using knockoffs, which are dummy variables that are designed to mimic the dependence structure of a given set of features while being conditionally independent of the response. Together with a novel DNN architecture involving a pairwise-coupling layer, DeepROCK jointly controls the false discovery rate (FDR) and maximizes statistical power. In addition, we identify a challenge in correctly controlling FDR using off-the-shelf feature interaction importance measures. DeepROCK overcomes this challenge by proposing a calibration procedure applied to existing interaction importance measures to make the FDR under control at a target level. Finally, we validate the effectiveness of DeepROCK through extensive experiments on simulated and real datasets. △ Less

Submitted 26 September, 2023; originally announced September 2023.

arXiv:2302.11837 [pdf, other]

Bounding the FDP in competition-based control of the FDR

Authors: Arya Ebadi, Dong Luo, Jack Freestone, William Stafford Noble, Uri Keich

Abstract: Competition-based approach to controlling the false discovery rate (FDR) recently rose to prominence when, generalizing it to sequential hypothesis testing, Barber and Candès used it as part of their knockoff-filter. Control of the FDR implies that the, arguably more important, false discovery proportion is only controlled in an average sense. We present TDC-SB and TDC-UB that provide upper predic… ▽ More Competition-based approach to controlling the false discovery rate (FDR) recently rose to prominence when, generalizing it to sequential hypothesis testing, Barber and Candès used it as part of their knockoff-filter. Control of the FDR implies that the, arguably more important, false discovery proportion is only controlled in an average sense. We present TDC-SB and TDC-UB that provide upper prediction bounds on the FDP in the list of discoveries generated when controlling the FDR using competition. Using simulated and real data we show that, overall, our new procedures offer significantly tighter upper bounds than ones obtained using the recently published approach of Katsevich and Ramdas, even when the latter is further improved using the interpolation concept of Goeman et al. △ Less

Submitted 23 February, 2023; originally announced February 2023.

Comments: The original version of this paper appeared as arxiv:2011.11939v1. That version was split into two: one branch continuing as v2 & v3 of that original submission, and the other branch is now added here as a new submission

arXiv:2011.11939 [pdf, other]

Competition-based control of the false discovery proportion

Authors: Dong Luo, Arya Ebadi, Yilun He, Kristen Emery, William Stafford Noble, Uri Keich

Abstract: Recently, Barber and Candès laid the theoretical foundation for a general framework for false discovery rate (FDR) control based on the notion of "knockoffs." A closely related FDR control methodology has long been employed in the analysis of mass spectrometry data, referred to there as "target-decoy competition" (TDC). However, any approach that aims to control the FDR, which is defined as the ex… ▽ More Recently, Barber and Candès laid the theoretical foundation for a general framework for false discovery rate (FDR) control based on the notion of "knockoffs." A closely related FDR control methodology has long been employed in the analysis of mass spectrometry data, referred to there as "target-decoy competition" (TDC). However, any approach that aims to control the FDR, which is defined as the expected value of the false discovery proportion (FDP), suffers from a problem. Specifically, even when successfully controlling the FDR at level $α$, the FDP in the list of discoveries can significantly exceed $α$. We offer FDP-SD, a new procedure that rigorously controls the FDP in the competition (knockoff / TDC) setup by guaranteeing that the FDP is bounded by $α$ at any desired confidence level. Compared with the just-published general framework of Katsevich and Ramdas, FDP-SD generally delivers more power and often substantially so in simulated as well as real data. △ Less

Submitted 14 March, 2022; v1 submitted 24 November, 2020; originally announced November 2020.

Comments: This revision focuses only on FDP-SD described in the original submission. A later submission will further develop the procedures for simultaneous bounds on the FDP

arXiv:2002.00526 [pdf, other]

DANCE: Enhancing saliency maps using decoys

Authors: Yang Lu, Wenbo Guo, Xinyu Xing, William Stafford Noble

Abstract: Saliency methods can make deep neural network predictions more interpretable by identifying a set of critical features in an input sample, such as pixels that contribute most strongly to a prediction made by an image classifier. Unfortunately, recent evidence suggests that many saliency methods poorly perform, especially in situations where gradients are saturated, inputs contain adversarial pertu… ▽ More Saliency methods can make deep neural network predictions more interpretable by identifying a set of critical features in an input sample, such as pixels that contribute most strongly to a prediction made by an image classifier. Unfortunately, recent evidence suggests that many saliency methods poorly perform, especially in situations where gradients are saturated, inputs contain adversarial perturbations, or predictions rely upon inter-feature dependence. To address these issues, we propose a framework that improves the robustness of saliency methods by following a two-step procedure. First, we introduce a perturbation mechanism that subtly varies the input sample without changing its intermediate representations. Using this approach, we can gather a corpus of perturbed data samples while ensuring that the perturbed and original input samples follow the same distribution. Second, we compute saliency maps for the perturbed samples and propose a new method to aggregate saliency maps. With this design, we offset the gradient saturation influence upon interpretation. From a theoretical perspective, we show the aggregated saliency map could not only capture inter-feature dependence but, more importantly, robustify interpretation against previously described adversarial perturbation methods. Following our theoretical analysis, we present experimental results suggesting that, both qualitatively and quantitatively, our saliency method outperforms existing methods. △ Less

Submitted 14 June, 2021; v1 submitted 2 February, 2020; originally announced February 2020.

arXiv:1907.01458 [pdf, other]

Multiple competition-based FDR control for peptide detection

Authors: Kristen Emery, Syamand Hasam, William Stafford Noble, Uri Keich

Abstract: Competition-based FDR control has been commonly used for over a decade in the computational mass spectrometry community (Elias and Gygi, 2007). Recently, the approach has gained significant popularity in other fields after Barber and Candes (2015) laid its theoretical foundation in a more general setting that included the feature selection problem. In both cases, the competition is based on a head… ▽ More Competition-based FDR control has been commonly used for over a decade in the computational mass spectrometry community (Elias and Gygi, 2007). Recently, the approach has gained significant popularity in other fields after Barber and Candes (2015) laid its theoretical foundation in a more general setting that included the feature selection problem. In both cases, the competition is based on a head-to-head comparison between an observed score and a corresponding decoy / knockoff. Keich and Noble (2017b) recently demonstrated some advantages of using multiple rather than a single decoy when addressing the problem of assigning peptide sequences to observed mass spectra. In this work, we consider a related problem -- detecting peptides based on a collection of mass spectra -- and we develop a new framework for competition-based FDR control using multiple null scores. Within this framework, we offer several methods, all of which are based on a novel procedure that rigorously controls the FDR in the finite sample setting. Using real data to study the peptide detection problem we show that, relative to existing single-decoy methods, our approach can increase the number of discovered peptides by up to 50% at small FDR thresholds. △ Less

Submitted 13 November, 2019; v1 submitted 2 July, 2019; originally announced July 2019.

Comments: Numerous changes from the initial submission including an expanded section on peptide detection (context/motivation and results), refocused and streamlined methods development section, revised and more selective figures reflecting the most recent analysis

arXiv:1906.03543 [pdf, other]

apricot: Submodular selection for data summarization in Python

Authors: Jacob Schreiber, Jeffrey Bilmes, William Stafford Noble

Abstract: We present apricot, an open source Python package for selecting representative subsets from large data sets using submodular optimization. The package implements an efficient greedy selection algorithm that offers strong theoretical guarantees on the quality of the selected set. Two submodular set functions are implemented in apricot: facility location, which is broadly applicable but requires mem… ▽ More We present apricot, an open source Python package for selecting representative subsets from large data sets using submodular optimization. The package implements an efficient greedy selection algorithm that offers strong theoretical guarantees on the quality of the selected set. Two submodular set functions are implemented in apricot: facility location, which is broadly applicable but requires memory quadratic in the number of examples in the data set, and a feature-based function that is less broadly applicable but can scale to millions of examples. Apricot is extremely efficient, using both algorithmic speedups such as the lazy greedy algorithm and code optimizers such as numba. We demonstrate the use of subset selection by training machine learning models to comparable accuracy using either the full data set or a representative subset thereof. This paper presents an explanation of submodular selection, an overview of the features in apricot, and an application to several data sets. The code and tutorial Jupyter notebooks are available at https://github.com/jmschrei/apricot △ Less

Submitted 8 June, 2019; originally announced June 2019.

arXiv:1809.01185 [pdf, other]

DeepPINK: reproducible feature selection in deep neural networks

Authors: Yang Young Lu, Yingying Fan, **chi Lv, William Stafford Noble

Abstract: Deep learning has become increasingly popular in both supervised and unsupervised machine learning thanks to its outstanding empirical performance. However, because of their intrinsic complexity, most deep learning methods are largely treated as black box tools with little interpretability. Even though recent attempts have been made to facilitate the interpretability of deep neural networks (DNNs)… ▽ More Deep learning has become increasingly popular in both supervised and unsupervised machine learning thanks to its outstanding empirical performance. However, because of their intrinsic complexity, most deep learning methods are largely treated as black box tools with little interpretability. Even though recent attempts have been made to facilitate the interpretability of deep neural networks (DNNs), existing methods are susceptible to noise and lack of robustness. Therefore, scientists are justifiably cautious about the reproducibility of the discoveries, which is often related to the interpretability of the underlying statistical models. In this paper, we describe a method to increase the interpretability and reproducibility of DNNs by incorporating the idea of feature selection with controlled error rate. By designing a new DNN architecture and integrating it with the recently proposed knockoffs framework, we perform feature selection with a controlled error rate, while maintaining high power. This new method, DeepPINK (Deep feature selection using Paired-Input Nonlinear Knockoffs), is applied to both simulated and real data sets to demonstrate its empirical utility. △ Less

Submitted 6 September, 2018; v1 submitted 4 September, 2018; originally announced September 2018.

arXiv:1410.7875 [pdf, other]

Faster graphical model identification of tandem mass spectra using peptide word lattices

Authors: Shengjie Wang, John T. Halloran, Jeff A. Bilmes, William S. Noble

Abstract: Liquid chromatography coupled with tandem mass spectrometry, also known as shotgun proteomics, is a widely-used high-throughput technology for identifying proteins in complex biological samples. Analysis of the tens of thousands of fragmentation spectra produced by a typical shotgun proteomics experiment begins by assigning to each observed spectrum the peptide hypothesized to be responsible for g… ▽ More Liquid chromatography coupled with tandem mass spectrometry, also known as shotgun proteomics, is a widely-used high-throughput technology for identifying proteins in complex biological samples. Analysis of the tens of thousands of fragmentation spectra produced by a typical shotgun proteomics experiment begins by assigning to each observed spectrum the peptide hypothesized to be responsible for generating the spectrum, typically done by searching each spectrum against a database of peptides. We have recently described a machine learning method---Dynamic Bayesian Network for Rapid Identification of Peptides (DRIP)---that not only achieves state-of-the-art spectrum identification performance on a variety of datasets but also provides a trainable model capable of returning valuable auxiliary information regarding specific peptide-spectrum matches. In this work, we present two significant improvements to DRIP. First, we describe how to use word lattices, which are widely used in natural language processing, to significantly speed up DRIP's computations. To our knowledge, all existing shotgun proteomics search engines compute independent scores between a given observed spectrum and each possible candidate peptide from the database. The key idea of the word lattice is to represent the set of candidate peptides in a single data structure, thereby allowing sharing of redundant computations among the different candidates. We demonstrate that using lattices in conjunction with DRIP leads to speedups on the order of tens across yeast and worm data sets. Second, we introduce a variant of DRIP that uses a discriminative training framework, performing maximum mutual entropy estimation rather than maximum likelihood estimation. This modification improves DRIP's statistical power, enabling us to increase the number of identified spectrum at a 1% false discovery rate on yeast and worm data sets. △ Less

Submitted 29 October, 2014; originally announced October 2014.

arXiv:1210.4904 [pdf]

Spectrum Identification using a Dynamic Bayesian Network Model of Tandem Mass Spectra

Authors: Ajit P. Singh, John Halloran, Jeff A. Bilmes, Katrin Kirchoff, William S. Noble

Abstract: Shotgun proteomics is a high-throughput technology used to identify unknown proteins in a complex mixture. At the heart of this process is a prediction task, the spectrum identification problem, in which each fragmentation spectrum produced by a shotgun proteomics experiment must be mapped to the peptide (protein subsequence) which generated the spectrum. We propose a new algorithm for spectrum id… ▽ More Shotgun proteomics is a high-throughput technology used to identify unknown proteins in a complex mixture. At the heart of this process is a prediction task, the spectrum identification problem, in which each fragmentation spectrum produced by a shotgun proteomics experiment must be mapped to the peptide (protein subsequence) which generated the spectrum. We propose a new algorithm for spectrum identification, based on dynamic Bayesian networks, which significantly outperforms the de-facto standard tools for this task: SEQUEST and Mascot. △ Less

Submitted 16 October, 2012; originally announced October 2012.

Comments: Appears in Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence (UAI2012)

Report number: UAI-P-2012-PG-775-785

arXiv:1207.5848 [pdf, other]

On the feasibility and utility of exploiting real time database search to improve adaptive peak selection

Authors: Benjamin J. Diament, Michael J. MacCoss, William Stafford Noble

Abstract: Rationale: In a shotgun proteomics experiment with data-dependent acquisition, real-time analysis of a precursor scan results in selection of a handful of peaks for subsequent isolation, fragmentation and secondary scanning. This peak selection protocol typically focuses on the most abundant peaks in the precursor scan, while attempting to avoid re-sampling the same m/z values in rapid succession.… ▽ More Rationale: In a shotgun proteomics experiment with data-dependent acquisition, real-time analysis of a precursor scan results in selection of a handful of peaks for subsequent isolation, fragmentation and secondary scanning. This peak selection protocol typically focuses on the most abundant peaks in the precursor scan, while attempting to avoid re-sampling the same m/z values in rapid succession. The protocol does not, however, incorporate analysis of previous fragmentation scans into the peak selection procedure. Methods: In this work, we investigate the feasibility and utility of incorporating analysis of previous fragmentation scans into the peak selection protocol. We demonstrate that real-time identification of fragmentation spectra is feasible in principle, and we investigate, via simulations, several strategies to make use of the resulting peptide identifications during peak selection. Results: Our simulations fail to provide evidence that peptide identifications can provide a large improvement in the total number of peptides identified by a shotgun proteomics experiment. Conclusions: These results are significant because they point out the feasibility of using peptide identifications during peak selection, and because our experiments may provide a starting point for others working in this direction. △ Less

Submitted 24 July, 2012; originally announced July 2012.

arXiv:q-bio/0610040 [pdf, ps, other]

Metric learning pairwise kernel for graph inference

Authors: Jean-Philippe Vert, Jian Qiu, William Stafford Noble

Abstract: Much recent work in bioinformatics has focused on the inference of various types of biological networks, representing gene regulation, metabolic processes, protein-protein interactions, etc. A common setting involves inferring network edges in a supervised fashion from a set of high-confidence edges, possibly characterized by multiple, heterogeneous data sets (protein sequence, gene expression,… ▽ More Much recent work in bioinformatics has focused on the inference of various types of biological networks, representing gene regulation, metabolic processes, protein-protein interactions, etc. A common setting involves inferring network edges in a supervised fashion from a set of high-confidence edges, possibly characterized by multiple, heterogeneous data sets (protein sequence, gene expression, etc.). Here, we distinguish between two modes of inference in this setting: direct inference based upon similarities between nodes joined by an edge, and indirect inference based upon similarities between one pair of nodes and another pair of nodes. We propose a supervised approach for the direct case by translating it into a distance metric learning problem. A relaxation of the resulting convex optimization problem leads to the support vector machine (SVM) algorithm with a particular kernel for pairs, which we call the metric learning pairwise kernel (MLPK). We demonstrate, using several real biological networks, that this direct approach often improves upon the state-of-the-art SVM for indirect inference with the tensor product pairwise kernel. △ Less

Submitted 21 October, 2006; originally announced October 2006.

Showing 1–11 of 11 results for author: Noble, W S