-
Not all tickets are equal and we know it: Guiding pruning with domain-specific knowledge
Authors:
Intekhab Hossain,
Jonas Fischer,
Rebekka Burkholz,
John Quackenbush
Abstract:
Neural structure learning is of paramount importance for scientific discovery and interpretability. Yet, contemporary pruning algorithms that focus on computational resource efficiency face algorithmic barriers to select a meaningful model that aligns with domain expertise. To mitigate this challenge, we propose DASH, which guides pruning by available domain-specific structural information. In the…
▽ More
Neural structure learning is of paramount importance for scientific discovery and interpretability. Yet, contemporary pruning algorithms that focus on computational resource efficiency face algorithmic barriers to select a meaningful model that aligns with domain expertise. To mitigate this challenge, we propose DASH, which guides pruning by available domain-specific structural information. In the context of learning dynamic gene regulatory network models, we show that DASH combined with existing general knowledge on interaction partners provides data-specific insights aligned with biology. For this task, we show on synthetic data with ground truth information and two real world applications the effectiveness of DASH, which outperforms competing methods by a large margin and provides more meaningful biological insights. Our work shows that domain specific structural information bears the potential to improve model-derived scientific insights.
△ Less
Submitted 5 March, 2024;
originally announced March 2024.
-
Scaling up Continuous-Time Markov Chains Helps Resolve Underspecification
Authors:
Alkis Gotovos,
Rebekka Burkholz,
John Quackenbush,
Stefanie Jegelka
Abstract:
Modeling the time evolution of discrete sets of items (e.g., genetic mutations) is a fundamental problem in many biomedical applications. We approach this problem through the lens of continuous-time Markov chains, and show that the resulting learning task is generally underspecified in the usual setting of cross-sectional data. We explore a perhaps surprising remedy: including a number of addition…
▽ More
Modeling the time evolution of discrete sets of items (e.g., genetic mutations) is a fundamental problem in many biomedical applications. We approach this problem through the lens of continuous-time Markov chains, and show that the resulting learning task is generally underspecified in the usual setting of cross-sectional data. We explore a perhaps surprising remedy: including a number of additional independent items can help determine time order, and hence resolve underspecification. This is in sharp contrast to the common practice of limiting the analysis to a small subset of relevant items, which is followed largely due to poor scaling of existing methods. To put our theoretical insight into practice, we develop an approximate likelihood maximization method for learning continuous-time Markov chains, which can scale to hundreds of items and is orders of magnitude faster than previous methods. We demonstrate the effectiveness of our approach on synthetic and real cancer data.
△ Less
Submitted 6 July, 2021;
originally announced July 2021.
-
Cascade Size Distributions: Why They Matter and How to Compute Them Efficiently
Authors:
Rebekka Burkholz,
John Quackenbush
Abstract:
Cascade models are central to understanding, predicting, and controlling epidemic spreading and information propagation. Related optimization, including influence maximization, model parameter inference, or the development of vaccination strategies, relies heavily on sampling from a model. This is either inefficient or inaccurate. As alternative, we present an efficient message passing algorithm t…
▽ More
Cascade models are central to understanding, predicting, and controlling epidemic spreading and information propagation. Related optimization, including influence maximization, model parameter inference, or the development of vaccination strategies, relies heavily on sampling from a model. This is either inefficient or inaccurate. As alternative, we present an efficient message passing algorithm that computes the probability distribution of the cascade size for the Independent Cascade Model on weighted directed networks and generalizations. Our approach is exact on trees but can be applied to any network topology. It approximates locally tree-like networks well, scales to large networks, and can lead to surprisingly good performance on more dense networks, as we also exemplify on real world data.
△ Less
Submitted 16 December, 2020; v1 submitted 9 September, 2019;
originally announced September 2019.
-
Network-based Distance Metric with Application to Discover Disease Subtypes in Cancer
Authors:
Jipeng Qiang,
Wei Ding,
John Quackenbush,
** Chen
Abstract:
While we once thought of cancer as single monolithic diseases affecting a specific organ site, we now understand that there are many subtypes of cancer defined by unique patterns of gene mutations. These gene mutational data, which can be more reliably obtained than gene expression data, help to determine how the subtypes develop, evolve, and respond to therapies. Different from dense continuous-v…
▽ More
While we once thought of cancer as single monolithic diseases affecting a specific organ site, we now understand that there are many subtypes of cancer defined by unique patterns of gene mutations. These gene mutational data, which can be more reliably obtained than gene expression data, help to determine how the subtypes develop, evolve, and respond to therapies. Different from dense continuous-value gene expression data, which most existing cancer subtype discovery algorithms use, somatic mutational data are extremely sparse and heterogeneous, because there are less than 0.5\% mutated genes in discrete value 1/0 out of 20,000 human protein-coding genes, and identical mutated genes are rarely shared by cancer patients.
Our focus is to search for cancer subtypes from extremely sparse and high dimensional gene mutational data in discrete 1 and 0 values using unsupervised learning. We propose a new network-based distance metric. We project cancer patients' mutational profile into their gene network structure and measure the distance between two patients using the similarity between genes and between the gene vertexes of the patients in the network. Experimental results in synthetic data and real-world data show that our approach outperforms the top competitors in cancer subtype discovery. Furthermore, our approach can identify cancer subtypes that cannot be detected by other clustering algorithms in real cancer data.
△ Less
Submitted 28 February, 2017;
originally announced March 2017.