Search | arXiv e-print repository

Bayesian compositional regression with flexible microbiome feature aggregation and selection

Authors: Satabdi Saha, Liangliang Zhang, Kim-Anh Do, Christine B. Peterson

Abstract: Ongoing advances in microbiome profiling have allowed unprecedented insights into the molecular activities of microbial communities. This has fueled a strong scientific interest in understanding the critical role the microbiome plays in governing human health, by identifying microbial features associated with clinical outcomes of interest. Several aspects of microbiome data limit the applicability… ▽ More Ongoing advances in microbiome profiling have allowed unprecedented insights into the molecular activities of microbial communities. This has fueled a strong scientific interest in understanding the critical role the microbiome plays in governing human health, by identifying microbial features associated with clinical outcomes of interest. Several aspects of microbiome data limit the applicability of existing variable selection approaches. In particular, microbiome data are high-dimensional, extremely sparse, and compositional. Importantly, many of the observed features, although categorized as different taxa, may play related functional roles. To address these challenges, we propose a novel compositional regression approach that leverages the data-adaptive clustering and variable selection properties of the spiked Dirichlet process to identify taxa that exhibit similar functional roles. Our proposed method, Bayesian Regression with Agglomerated Compositional Effects using a dirichLET process (BRACElet), enables the identification of a sparse set of features with shared impacts on the outcome, facilitating dimension reduction and model interpretation. We demonstrate that BRACElet outperforms existing approaches for microbiome variable selection through simulation studies and an application elucidating the impact of oral microbiome composition on insulin resistance. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2309.08109 [pdf, other]

CAT: a conditional association test for microbiome data using a leave-out approach

Authors: Yushu Shi, Liangliang Zhang, Kim-Anh Do, Robert R. Jenq, Christine B. Peterson

Abstract: In microbiome analysis, researchers often seek to identify taxonomic features associated with an outcome of interest. However, microbiome features are intercorrelated and linked by phylogenetic relationships, making it challenging to assess the association between an individual feature and an outcome. Researchers have developed global tests for the association of microbiome profiles with outcomes… ▽ More In microbiome analysis, researchers often seek to identify taxonomic features associated with an outcome of interest. However, microbiome features are intercorrelated and linked by phylogenetic relationships, making it challenging to assess the association between an individual feature and an outcome. Researchers have developed global tests for the association of microbiome profiles with outcomes using beta diversity metrics which offer robustness to extreme values and can incorporate information on the phylogenetic tree structure. Despite the popularity of global association testing, most existing methods for follow-up testing of individual features only consider the marginal effect and do not provide relevant information for the design of microbiome interventions. This paper proposes a novel conditional association test, CAT, which can account for other features and phylogenetic relatedness when testing the association between a feature and an outcome. CAT adopts a leave-out method, measuring the importance of a feature in predicting the outcome by removing that feature from the data and quantifying how much the association with the outcome is weakened through the change in the coefficient of determination. By leveraging global tests including PERMANOVA and MiRKAT-based methods, CAT allows association testing for continuous, binary, categorical, count, survival, and correlated outcomes. Our simulation and real data application results illustrate the potential of CAT to inform the design of microbiome interventions aimed at improving clinical outcomes. △ Less

Submitted 14 September, 2023; originally announced September 2023.

arXiv:2308.13737 [pdf, other]

survivalContour: Visualizing predicted survival via colored contour plots

Authors: Yushu Shi, Liangliang Zhang, Kim-Anh Do, Robert R. Jenq, Christine B. Peterson

Abstract: Advances in survival analysis have facilitated unprecedented flexibility in data modeling, yet there remains a lack of tools for graphically illustrating the influence of continuous covariates on predicted survival outcomes. We propose the utilization of a colored contour plot to depict the predicted survival probabilities over time, and provide a Shiny app and R package as implementations of this… ▽ More Advances in survival analysis have facilitated unprecedented flexibility in data modeling, yet there remains a lack of tools for graphically illustrating the influence of continuous covariates on predicted survival outcomes. We propose the utilization of a colored contour plot to depict the predicted survival probabilities over time, and provide a Shiny app and R package as implementations of this tool. Our approach is capable of supporting conventional models, including the Cox and Fine-Gray models. However, its capability shines when coupled with cutting-edge machine learning models such as random survival forests and deep neural networks. △ Less

Submitted 12 January, 2024; v1 submitted 25 August, 2023; originally announced August 2023.

arXiv:2207.14753 [pdf, other]

Estimating Causal Effects with Hidden Confounding using Instrumental Variables and Environments

Authors: James P. Long, Hongxu Zhu, Kim-Anh Do, Min ** Ha

Abstract: Recent works have proposed regression models which are invariant across data collection environments. These estimators often have a causal interpretation under conditions on the environments and type of invariance imposed. One recent example, the Causal Dantzig (CD), is consistent under hidden confounding and represents an alternative to classical instrumental variable estimators such as Two Stage… ▽ More Recent works have proposed regression models which are invariant across data collection environments. These estimators often have a causal interpretation under conditions on the environments and type of invariance imposed. One recent example, the Causal Dantzig (CD), is consistent under hidden confounding and represents an alternative to classical instrumental variable estimators such as Two Stage Least Squares (TSLS). In this work we derive the CD as a generalized method of moments (GMM) estimator. The GMM representation leads to several practical results, including 1) creation of the Generalized Causal Dantzig (GCD) estimator which can be applied to problems with continuous environments where the CD cannot be fit 2) a Hybrid (GCD-TSLS combination) estimator which has properties superior to GCD or TSLS alone 3) straightforward asymptotic results for all methods using GMM theory. We compare the CD, GCD, TSLS, and Hybrid estimators in simulations and an application to a Flow Cytometry data set. The newly proposed GCD and Hybrid estimators have superior performance to existing methods in many settings. △ Less

Submitted 9 November, 2023; v1 submitted 29 July, 2022; originally announced July 2022.

Comments: 32 pages, 7 figures, 4 tables

arXiv:2207.09991 [pdf, other]

Causal Models, Prediction, and Extrapolation in Cell Line Perturbation Experiments

Authors: James P. Long, Yumeng Yang, Kim-Anh Do

Abstract: In cell line perturbation experiments, a collection of cells is perturbed with external agents (e.g. drugs) and responses such as protein expression measured. Due to cost constraints, only a small fraction of all possible perturbations can be tested in vitro. This has led to the development of computational (in silico) models which can predict cellular responses to perturbations. Perturbations wit… ▽ More In cell line perturbation experiments, a collection of cells is perturbed with external agents (e.g. drugs) and responses such as protein expression measured. Due to cost constraints, only a small fraction of all possible perturbations can be tested in vitro. This has led to the development of computational (in silico) models which can predict cellular responses to perturbations. Perturbations with clinically interesting predicted responses can be prioritized for in vitro testing. In this work, we compare causal and non-causal regression models for perturbation response prediction in a Melanoma cancer cell line. The current best performing method on this data set is Cellbox which models how proteins causally effect each other using a system of ordinary differential equations (ODEs). We derive a closed form solution to the Cellbox system of ODEs in the linear case. These analytic results facilitate comparison of Cellbox to regression approaches. We show that causal models such as Cellbox, while requiring more assumptions, enable extrapolation in ways that non-causal regression models cannot. For example, causal models can predict responses for never before tested drugs. We illustrate these strengths and weaknesses in simulations. In an application to the Melanoma cell line data, we find that regression models outperform the Cellbox causal model. △ Less

Submitted 20 July, 2022; originally announced July 2022.

Comments: 13 pages, 4 figures

arXiv:2011.06061 [pdf, other]

A Framework for Mediation Analysis with Multiple Exposures, Multivariate Mediators, and Non-Linear Response Models

Authors: James P. Long, Ehsan Irajizad, James D. Doecke, Kim-Anh Do, Min ** Ha

Abstract: Mediation analysis seeks to identify and quantify the paths by which an exposure affects an outcome. Intermediate variables which are effected by the exposure and which effect the outcome are known as mediators. There exists extensive work on mediation analysis in the context of models with a single mediator and continuous and binary outcomes. However these methods are often not suitable for multi… ▽ More Mediation analysis seeks to identify and quantify the paths by which an exposure affects an outcome. Intermediate variables which are effected by the exposure and which effect the outcome are known as mediators. There exists extensive work on mediation analysis in the context of models with a single mediator and continuous and binary outcomes. However these methods are often not suitable for multi-omic data that include highly interconnected variables measuring biological mechanisms and various types of outcome variables such as censored survival responses. In this article, we develop a general framework for causal mediation analysis with multiple exposures, multivariate mediators, and continuous, binary, and survival responses. We estimate mediation effects on several scales including the mean difference, odds ratio, and restricted mean scale as appropriate for various outcome models. Our estimation method avoids imposing constraints on model parameters such as the rare disease assumption while accommodating continuous exposures. We evaluate the framework and compare it to other methods in extensive simulation studies by assessing bias, type I error and power at a range of sample sizes, disease prevalences, and number of false mediators. Using Kidney Renal Clear Cell Carcinoma data from The Cancer Genome Atlas, we identify proteins which mediate the effect of metabolic gene expression on survival. Software for implementing this unified framework is made available in an R package (https://github.com/longjp/mediateR). △ Less

Submitted 11 November, 2020; originally announced November 2020.

Comments: 17 pages, 5 figures

arXiv:2007.15812 [pdf, other]

Sparse tree-based clustering of microbiome data to characterize microbiome heterogeneity in pancreatic cancer

Authors: Yushu Shi, Liangliang Zhang, Kim-Anh Do, Robert Jenq, Christine Peterson

Abstract: There is a keen interest in characterizing variation in the microbiome across cancer patients, given increasing evidence of its important role in determining treatment outcomes. Here our goal is to discover subgroups of patients with similar microbiome profiles. We propose a novel unsupervised clustering approach in the Bayesian framework that innovates over existing model-based clustering approac… ▽ More There is a keen interest in characterizing variation in the microbiome across cancer patients, given increasing evidence of its important role in determining treatment outcomes. Here our goal is to discover subgroups of patients with similar microbiome profiles. We propose a novel unsupervised clustering approach in the Bayesian framework that innovates over existing model-based clustering approaches, such as the Dirichlet multinomial mixture model, in three key respects: we incorporate feature selection, learn the appropriate number of clusters from the data, and integrate information on the tree structure relating the observed features. We compare the performance of our proposed method to existing methods on simulated data designed to mimic real microbiome data. We then illustrate results obtained for our motivating data set, a clinical study aimed at characterizing the tumor microbiome of pancreatic cancer patients. △ Less

Submitted 2 December, 2022; v1 submitted 30 July, 2020; originally announced July 2020.

arXiv:2005.01169 [pdf, ps, other]

ProgPermute: Progressive permutation for a dynamic representation of the robustness of microbiome discoveries

Authors: Liangliang Zhang, Yushu Shi, Kim-Anh Do, Christine B. Peterson, Robert R. Jenq

Abstract: Identification of features is a critical task in microbiome studies that is complicated by the fact that microbial data are high dimensional and heterogeneous. Masked by the complexity of the data, the problem of separating signals from noise becomes challenging and troublesome. For instance, when performing differential abundance tests, multiple testing adjustments tend to be overconservative, as… ▽ More Identification of features is a critical task in microbiome studies that is complicated by the fact that microbial data are high dimensional and heterogeneous. Masked by the complexity of the data, the problem of separating signals from noise becomes challenging and troublesome. For instance, when performing differential abundance tests, multiple testing adjustments tend to be overconservative, as the probability of a type I error (false positive) increases dramatically with the large numbers of hypotheses. Moreover, the grou** effect of interest can be obscured by heterogeneity. These factors can incorrectly lead to the conclusion that there are no differences in the microbiome compositions. We translate and represent the problem of identifying differential features as a dynamic layout of separating the signal from its random background. We propose progressive permutation as a method to achieve this process and show converging patterns. More specifically, we progressively permute the grou** factor labels of the microbiome samples and perform multiple differential abundance tests in each scenario. We then compare the signal strength of the top features from the original data with their performance in permutations, and observe an apparent decreasing trend if these top features are true positives identified from the data. We have developed this into a user-friendly RShiny tool and R package, which consist of functions that can convey the overall association between the microbiome and the grou** factor, rank the robustness of the discovered microbes, and list the discoveries, their effect sizes, and individual abundances. △ Less

Submitted 17 September, 2020; v1 submitted 3 May, 2020; originally announced May 2020.

Comments: 8 pages, 6 figures

arXiv:1908.09961 [pdf, other]

Theory and Evaluation Metrics for Learning Disentangled Representations

Authors: Kien Do, Truyen Tran

Abstract: We make two theoretical contributions to disentanglement learning by (a) defining precise semantics of disentangled representations, and (b) establishing robust metrics for evaluation. First, we characterize the concept "disentangled representations" used in supervised and unsupervised methods along three dimensions-informativeness, separability and interpretability - which can be expressed and qu… ▽ More We make two theoretical contributions to disentanglement learning by (a) defining precise semantics of disentangled representations, and (b) establishing robust metrics for evaluation. First, we characterize the concept "disentangled representations" used in supervised and unsupervised methods along three dimensions-informativeness, separability and interpretability - which can be expressed and quantified explicitly using information-theoretic constructs. This helps explain the behaviors of several well-known disentanglement learning models. We then propose robust metrics for measuring informativeness, separability and interpretability. Through a comprehensive suite of experiments, we show that our metrics correctly characterize the representations learned by different methods and are consistent with qualitative (visual) results. Thus, the metrics allow disentanglement learning methods to be compared on a fair ground. We also empirically uncovered new interesting properties of VAE-based methods and interpreted them with our formulation. These findings are promising and hopefully will encourage the design of more theoretically driven models for learning disentangled representations. △ Less

Submitted 18 March, 2021; v1 submitted 26 August, 2019; originally announced August 2019.

arXiv:1811.05405 [pdf, ps, other]

doi 10.1093/bioinformatics/btz636

NExUS: Bayesian simultaneous network estimation across unequal sample sizes

Authors: Priyam Das, Christine Peterson, Kim-Anh Do, Rehan Akbani, Veerabhadran Baladandayuthapani

Abstract: Network-based analyses of high-throughput genomics data provide a holistic, systems-level understanding of various biological mechanisms for a common population. However, when estimating multiple networks across heterogeneous sub-populations, varying sample sizes pose a challenge in the estimation and inference, as network differences may be driven by differences in power. We are particularly inte… ▽ More Network-based analyses of high-throughput genomics data provide a holistic, systems-level understanding of various biological mechanisms for a common population. However, when estimating multiple networks across heterogeneous sub-populations, varying sample sizes pose a challenge in the estimation and inference, as network differences may be driven by differences in power. We are particularly interested in addressing this challenge in the context of proteomic networks for related cancers, as the number of subjects available for rare cancer (sub-)types is often limited. We develop NExUS (Network Estimation across Unequal Sample sizes), a Bayesian method that enables joint learning of multiple networks while avoiding artefactual relationship between sample size and network sparsity. We demonstrate through simulations that NExUS outperforms existing network estimation methods in this context, and apply it to learn network similarity and shared pathway activity for groups of cancers with related origins represented in The Cancer Genome Atlas (TCGA) proteomic data. △ Less

Submitted 6 November, 2018; originally announced November 2018.

Comments: 8 pages, 8 figues

arXiv:1804.00293 [pdf, other]

Attentional Multilabel Learning over Graphs: A Message Passing Approach

Authors: Kien Do, Truyen Tran, Thin Nguyen, Svetha Venkatesh

Abstract: We address a largely open problem of multilabel classification over graphs. Unlike traditional vector input, a graph has rich variable-size substructures which are related to the labels in some ways. We believe that uncovering these relations might hold the key to classification performance and explainability. We introduce GAML (Graph Attentional Multi-Label learning), a novel graph neural network… ▽ More We address a largely open problem of multilabel classification over graphs. Unlike traditional vector input, a graph has rich variable-size substructures which are related to the labels in some ways. We believe that uncovering these relations might hold the key to classification performance and explainability. We introduce GAML (Graph Attentional Multi-Label learning), a novel graph neural network that can handle this problem effectively. GAML regards labels as auxiliary nodes and models them in conjunction with the input graph. By applying message passing and attention mechanisms to both the label nodes and the input nodes iteratively, GAML can capture the relations between the labels and the input subgraphs at various resolution scales. Moreover, our model can take advantage of explicit label dependencies. It also scales linearly with the number of labels and graph size thanks to our proposed hierarchical attention. We evaluate GAML on an extensive set of experiments with both graph-structured inputs and classical unstructured inputs. The results show that GAML significantly outperforms other competing methods. Importantly, GAML enables intuitive visualizations for better understanding of the label-substructure relations and explanation of the model behaviors. △ Less

Submitted 11 April, 2018; v1 submitted 1 April, 2018; originally announced April 2018.

arXiv:1608.04830 [pdf, other]

Outlier Detection on Mixed-Type Data: An Energy-based Approach

Authors: Kien Do, Truyen Tran, Dinh Phung, Svetha Venkatesh

Abstract: Outlier detection amounts to finding data points that differ significantly from the norm. Classic outlier detection methods are largely designed for single data type such as continuous or discrete. However, real world data is increasingly heterogeneous, where a data point can have both discrete and continuous attributes. Handling mixed-type data in a disciplined way remains a great challenge. In t… ▽ More Outlier detection amounts to finding data points that differ significantly from the norm. Classic outlier detection methods are largely designed for single data type such as continuous or discrete. However, real world data is increasingly heterogeneous, where a data point can have both discrete and continuous attributes. Handling mixed-type data in a disciplined way remains a great challenge. In this paper, we propose a new unsupervised outlier detection method for mixed-type data based on Mixed-variate Restricted Boltzmann Machine (Mv.RBM). The Mv.RBM is a principled probabilistic method that models data density. We propose to use \emph{free-energy} derived from Mv.RBM as outlier score to detect outliers as those data points lying in low density regions. The method is fast to learn and compute, is scalable to massive datasets. At the same time, the outlier score is identical to data negative log-density up-to an additive constant. We evaluate the proposed method on synthetic and real-world datasets and demonstrate that (a) a proper handling mixed-types is necessary in outlier detection, and (b) free-energy of Mv.RBM is a powerful and efficient outlier scoring method, which is highly competitive against state-of-the-arts. △ Less

Submitted 16 August, 2016; originally announced August 2016.

Showing 1–12 of 12 results for author: Do, K