-
Not all tickets are equal and we know it: Guiding pruning with domain-specific knowledge
Authors:
Intekhab Hossain,
Jonas Fischer,
Rebekka Burkholz,
John Quackenbush
Abstract:
Neural structure learning is of paramount importance for scientific discovery and interpretability. Yet, contemporary pruning algorithms that focus on computational resource efficiency face algorithmic barriers to select a meaningful model that aligns with domain expertise. To mitigate this challenge, we propose DASH, which guides pruning by available domain-specific structural information. In the…
▽ More
Neural structure learning is of paramount importance for scientific discovery and interpretability. Yet, contemporary pruning algorithms that focus on computational resource efficiency face algorithmic barriers to select a meaningful model that aligns with domain expertise. To mitigate this challenge, we propose DASH, which guides pruning by available domain-specific structural information. In the context of learning dynamic gene regulatory network models, we show that DASH combined with existing general knowledge on interaction partners provides data-specific insights aligned with biology. For this task, we show on synthetic data with ground truth information and two real world applications the effectiveness of DASH, which outperforms competing methods by a large margin and provides more meaningful biological insights. Our work shows that domain specific structural information bears the potential to improve model-derived scientific insights.
△ Less
Submitted 5 March, 2024;
originally announced March 2024.
-
Scaling up Continuous-Time Markov Chains Helps Resolve Underspecification
Authors:
Alkis Gotovos,
Rebekka Burkholz,
John Quackenbush,
Stefanie Jegelka
Abstract:
Modeling the time evolution of discrete sets of items (e.g., genetic mutations) is a fundamental problem in many biomedical applications. We approach this problem through the lens of continuous-time Markov chains, and show that the resulting learning task is generally underspecified in the usual setting of cross-sectional data. We explore a perhaps surprising remedy: including a number of addition…
▽ More
Modeling the time evolution of discrete sets of items (e.g., genetic mutations) is a fundamental problem in many biomedical applications. We approach this problem through the lens of continuous-time Markov chains, and show that the resulting learning task is generally underspecified in the usual setting of cross-sectional data. We explore a perhaps surprising remedy: including a number of additional independent items can help determine time order, and hence resolve underspecification. This is in sharp contrast to the common practice of limiting the analysis to a small subset of relevant items, which is followed largely due to poor scaling of existing methods. To put our theoretical insight into practice, we develop an approximate likelihood maximization method for learning continuous-time Markov chains, which can scale to hundreds of items and is orders of magnitude faster than previous methods. We demonstrate the effectiveness of our approach on synthetic and real cancer data.
△ Less
Submitted 6 July, 2021;
originally announced July 2021.
-
DRAGON: Determining Regulatory Associations using Graphical models on multi-Omic Networks
Authors:
Katherine H. Shutta,
Deborah Weighill,
Rebekka Burkholz,
Marouen Ben Guebila,
Dawn L. DeMeo,
Helena U. Zacharias,
John Quackenbush,
Michael Altenbuchinger
Abstract:
The increasing quantity of multi-omics data, such as methylomic and transcriptomic profiles, collected on the same specimen, or even on the same cell, provide a unique opportunity to explore the complex interactions that define cell phenotype and govern cellular responses to perturbations. We propose a network approach based on Gaussian Graphical Models (GGMs) that facilitates the joint analysis o…
▽ More
The increasing quantity of multi-omics data, such as methylomic and transcriptomic profiles, collected on the same specimen, or even on the same cell, provide a unique opportunity to explore the complex interactions that define cell phenotype and govern cellular responses to perturbations. We propose a network approach based on Gaussian Graphical Models (GGMs) that facilitates the joint analysis of paired omics data. This method, called DRAGON (Determining Regulatory Associations using Graphical models on multi-Omic Networks), calibrates its parameters to achieve an optimal trade-off between the network's complexity and estimation accuracy, while explicitly accounting for the characteristics of each of the assessed omics "layers." In simulation studies, we show that DRAGON adapts to edge density and feature size differences between omics layers, improving model inference and edge recovery compared to state-of-the-art methods. We further demonstrate in an analysis of joint transcriptome - methylome data from TCGA breast cancer specimens that DRAGON can identify key molecular mechanisms such as gene regulation via promoter methylation. In particular, we identify Transcription Factor AP-2 Beta (TFAP2B) as a potential multi-omic biomarker for basal-type breast cancer. DRAGON is available as open-source code in Python through the Network Zoo package (netZooPy v0.8; netzoo.github.io).
△ Less
Submitted 21 September, 2022; v1 submitted 4 April, 2021;
originally announced April 2021.
-
Gene targeting in disease networks
Authors:
Deborah Weighill,
Marouen Ben Guebila,
Kimberly Glass,
John Platig,
Jen Jen Yeh,
John Quackenbush
Abstract:
Profiling of whole transcriptomes has become a cornerstone of molecular biology and an invaluable tool for the characterization of clinical phenotypes and the identification of disease subtypes. Analyses of these data are becoming ever more sophisticated as we move beyond simple comparisons to consider networks of higher-order interactions and associations. Gene regulatory networks model the regul…
▽ More
Profiling of whole transcriptomes has become a cornerstone of molecular biology and an invaluable tool for the characterization of clinical phenotypes and the identification of disease subtypes. Analyses of these data are becoming ever more sophisticated as we move beyond simple comparisons to consider networks of higher-order interactions and associations. Gene regulatory networks model the regulatory relationships of transcription factors and genes and have allowed the identification of differentially regulated processes in disease systems. In this perspective we discuss gene targeting scores, which measure changes in inferred regulatory network interactions, and their use in identifying disease-relevant processes. In addition, we present an example analysis or pancreatic ductal adenocarcinoma demonstrating the power of gene targeting scores to identify differential processes between complex phenotypes; processes which would have been missed by only performing differential expression analysis. This example demonstrates that gene targeting scores are an invaluable addition to gene expression analysis in the characterization of diseases and other complex phenotypes.
△ Less
Submitted 11 January, 2021;
originally announced January 2021.
-
The importance of transparency and reproducibility in artificial intelligence research
Authors:
Benjamin Haibe-Kains,
George Alexandru Adam,
Ahmed Hosny,
Farnoosh Khodakarami,
MAQC Society Board,
Levi Waldron,
Bo Wang,
Chris McIntosh,
Anshul Kundaje,
Casey S. Greene,
Michael M. Hoffman,
Jeffrey T. Leek,
Wolfgang Huber,
Alvis Brazma,
Joelle Pineau,
Robert Tibshirani,
Trevor Hastie,
John P. A. Ioannidis,
John Quackenbush,
Hugo J. W. L. Aerts
Abstract:
In their study, McKinney et al. showed the high potential of artificial intelligence for breast cancer screening. However, the lack of detailed methods and computer code undermines its scientific value. We identify obstacles hindering transparent and reproducible AI research as faced by McKinney et al and provide solutions with implications for the broader field.
In their study, McKinney et al. showed the high potential of artificial intelligence for breast cancer screening. However, the lack of detailed methods and computer code undermines its scientific value. We identify obstacles hindering transparent and reproducible AI research as faced by McKinney et al and provide solutions with implications for the broader field.
△ Less
Submitted 7 March, 2020; v1 submitted 28 February, 2020;
originally announced March 2020.
-
Cascade Size Distributions: Why They Matter and How to Compute Them Efficiently
Authors:
Rebekka Burkholz,
John Quackenbush
Abstract:
Cascade models are central to understanding, predicting, and controlling epidemic spreading and information propagation. Related optimization, including influence maximization, model parameter inference, or the development of vaccination strategies, relies heavily on sampling from a model. This is either inefficient or inaccurate. As alternative, we present an efficient message passing algorithm t…
▽ More
Cascade models are central to understanding, predicting, and controlling epidemic spreading and information propagation. Related optimization, including influence maximization, model parameter inference, or the development of vaccination strategies, relies heavily on sampling from a model. This is either inefficient or inaccurate. As alternative, we present an efficient message passing algorithm that computes the probability distribution of the cascade size for the Independent Cascade Model on weighted directed networks and generalizations. Our approach is exact on trees but can be applied to any network topology. It approximates locally tree-like networks well, scales to large networks, and can lead to surprisingly good performance on more dense networks, as we also exemplify on real world data.
△ Less
Submitted 16 December, 2020; v1 submitted 9 September, 2019;
originally announced September 2019.
-
Network-based Distance Metric with Application to Discover Disease Subtypes in Cancer
Authors:
Jipeng Qiang,
Wei Ding,
John Quackenbush,
** Chen
Abstract:
While we once thought of cancer as single monolithic diseases affecting a specific organ site, we now understand that there are many subtypes of cancer defined by unique patterns of gene mutations. These gene mutational data, which can be more reliably obtained than gene expression data, help to determine how the subtypes develop, evolve, and respond to therapies. Different from dense continuous-v…
▽ More
While we once thought of cancer as single monolithic diseases affecting a specific organ site, we now understand that there are many subtypes of cancer defined by unique patterns of gene mutations. These gene mutational data, which can be more reliably obtained than gene expression data, help to determine how the subtypes develop, evolve, and respond to therapies. Different from dense continuous-value gene expression data, which most existing cancer subtype discovery algorithms use, somatic mutational data are extremely sparse and heterogeneous, because there are less than 0.5\% mutated genes in discrete value 1/0 out of 20,000 human protein-coding genes, and identical mutated genes are rarely shared by cancer patients.
Our focus is to search for cancer subtypes from extremely sparse and high dimensional gene mutational data in discrete 1 and 0 values using unsupervised learning. We propose a new network-based distance metric. We project cancer patients' mutational profile into their gene network structure and measure the distance between two patients using the similarity between genes and between the gene vertexes of the patients in the network. Experimental results in synthetic data and real-world data show that our approach outperforms the top competitors in cancer subtype discovery. Furthermore, our approach can identify cancer subtypes that cannot be detected by other clustering algorithms in real cancer data.
△ Less
Submitted 28 February, 2017;
originally announced March 2017.
-
PyPanda: a Python Package for Gene Regulatory Network Reconstruction
Authors:
David G. P. van IJzendoorn,
Kimberly Glass,
John Quackenbush,
Marieke L. Kuijjer
Abstract:
PANDA (Passing Attributes between Networks for Data Assimilation) is a gene regulatory network inference method that uses message-passing to integrate multiple sources of 'omics data. PANDA was originally coded in C++. In this application note we describe PyPanda, the Python version of PANDA. PyPanda runs considerably faster than the C++ version and includes additional features for network analysi…
▽ More
PANDA (Passing Attributes between Networks for Data Assimilation) is a gene regulatory network inference method that uses message-passing to integrate multiple sources of 'omics data. PANDA was originally coded in C++. In this application note we describe PyPanda, the Python version of PANDA. PyPanda runs considerably faster than the C++ version and includes additional features for network analysis. Availability: The open source PyPanda Python package is freely available at https://github.com/davidvi/pypanda. Contact: d.g.p.van [email protected]
△ Less
Submitted 12 July, 2016; v1 submitted 22 April, 2016;
originally announced April 2016.
-
Bipartite Community Structure of eQTLs
Authors:
John Platig,
Peter Castaldi,
Dawn DeMeo,
John Quackenbush
Abstract:
Genome Wide Association Studies (GWAS) and eQTL analyses have produced a large and growing number of genetic associations linked to a wide range of human phenotypes. As of 2013, there were more than 11,000 SNPs associated with a trait as reported in the NHGRI GWAS Catalog. However, interpreting the functional roles played by these SNPs remains a challenge. Here we describe an approach that uses th…
▽ More
Genome Wide Association Studies (GWAS) and eQTL analyses have produced a large and growing number of genetic associations linked to a wide range of human phenotypes. As of 2013, there were more than 11,000 SNPs associated with a trait as reported in the NHGRI GWAS Catalog. However, interpreting the functional roles played by these SNPs remains a challenge. Here we describe an approach that uses the inherent bipartite structure of eQTL networks to place SNPs into a functional context.
Using genoty** and gene expression data from 163 lung tissue samples in a study of Chronic Obstructive Pulmonary Disease (COPD) we calculated eQTL associations between SNPs and genes and cast significant associations (FDR $< 0.1$) as links in a bipartite network. To our surprise, we discovered that the highly-connected "hub" SNPs within the network were devoid of disease-associations. However, within the network we identified 35 highly modular communities, which comprise groups of SNPs associated with groups of genes; 13 of these communities were significantly enriched for distinct biological functions (P $ < 5 \times 10^{-4}$) including COPD-related functions. Further, we found that GWAS-significant SNPs were enriched at the cores of these communities, including previously identified GWAS associations for COPD, asthma, and pulmonary function, among others. These results speak to our intuition: rather than single SNPs influencing single genes, we see groups of SNPs associated with the expression of families of functionally related genes and that disease SNPs are associated with the perturbation of those functions. These methods are not limited in their application to COPD and can be used in the analysis of a wide variety of disease processes and other phenotypic traits.
△ Less
Submitted 9 September, 2015;
originally announced September 2015.
-
High Performance Computing of Gene Regulatory Networks using a Message-Passing Model
Authors:
Kimberly Glass,
John Quackenbush,
Jeremy Kepner
Abstract:
Gene regulatory network reconstruction is a fundamental problem in computational biology. We recently developed an algorithm, called PANDA (Passing Attributes Between Networks for Data Assimilation), that integrates multiple sources of 'omics data and estimates regulatory network models. This approach was initially implemented in the C++ programming language and has since been applied to a number…
▽ More
Gene regulatory network reconstruction is a fundamental problem in computational biology. We recently developed an algorithm, called PANDA (Passing Attributes Between Networks for Data Assimilation), that integrates multiple sources of 'omics data and estimates regulatory network models. This approach was initially implemented in the C++ programming language and has since been applied to a number of biological systems. In our current research we are beginning to expand the algorithm to incorporate larger and most diverse data-sets, to reconstruct networks that contain increasing numbers of elements, and to build not only single network models, but sets of networks. In order to accomplish these "Big Data" applications, it has become critical that we increase the computational efficiency of the PANDA implementation. In this paper we show how to recast PANDA's similarity equations as matrix operations. This allows us to implement a highly readable version of the algorithm using the MATLAB/Octave programming language. We find that the resulting M-code much shorter (103 compared to 1128 lines) and more easily modifiable for potential future applications. The new implementation also runs significantly faster, with increasing efficiency as the network models increase in size. Tests comparing the C-code and M-code versions of PANDA demonstrate that this speed-up is on the order of 20-80 times faster for networks of similar dimensions to those we find in current biological applications.
△ Less
Submitted 24 July, 2015;
originally announced July 2015.
-
Estimating sample-specific regulatory networks
Authors:
Marieke Lydia Kuijjer,
Matthew Tung,
GuoCheng Yuan,
John Quackenbush,
Kimberly Glass
Abstract:
Biological systems are driven by intricate interactions among the complex array of molecules that comprise the cell. Many methods have been developed to reconstruct network models of those interactions. These methods often draw on large numbers of samples with measured gene expression profiles to infer connections between genes (or gene products). The result is an aggregate network model represent…
▽ More
Biological systems are driven by intricate interactions among the complex array of molecules that comprise the cell. Many methods have been developed to reconstruct network models of those interactions. These methods often draw on large numbers of samples with measured gene expression profiles to infer connections between genes (or gene products). The result is an aggregate network model representing a single estimate for the likelihood of each interaction, or "edge," in the network. While informative, aggregate models fail to capture the heterogeneity that is represented in any population. Here we propose a method to reverse engineer sample-specific networks from aggregate network models. We demonstrate the accuracy and applicability of our approach in several data sets, including simulated data, microarray expression data from synchronized yeast cells, and RNA-seq data collected from human lymphoblastoid cell lines. We show that these sample-specific networks can be used to study changes in network topology across time and to characterize shifts in gene regulation that may not be apparent in expression data. We believe the ability to generate sample-specific networks will greatly facilitate the application of network methods to the increasingly large, complex, and heterogeneous multi-omic data sets that are currently being generated, and ultimately support the emerging field of precision network medicine.
△ Less
Submitted 28 June, 2018; v1 submitted 24 May, 2015;
originally announced May 2015.