-
Unsupervised Learning of Phylogenetic Trees via Split-Weight Embedding
Authors:
Yibo Kong,
George P. Tiley,
Claudia Solis-Lemus
Abstract:
Unsupervised learning has become a staple in classical machine learning, successfully identifying clustering patterns in data across a broad range of domain applications. Surprisingly, despite its accuracy and elegant simplicity, unsupervised learning has not been sufficiently exploited in the realm of phylogenetic tree inference. The main reason for the delay in adoption of unsupervised learning…
▽ More
Unsupervised learning has become a staple in classical machine learning, successfully identifying clustering patterns in data across a broad range of domain applications. Surprisingly, despite its accuracy and elegant simplicity, unsupervised learning has not been sufficiently exploited in the realm of phylogenetic tree inference. The main reason for the delay in adoption of unsupervised learning in phylogenetics is the lack of a meaningful, yet simple, way of embedding phylogenetic trees into a vector space. Here, we propose the simple yet powerful split-weight embedding which allows us to fit standard clustering algorithms to the space of phylogenetic trees. We show that our split-weight embedded clustering is able to recover meaningful evolutionary relationships in simulated and real (Adansonia baobabs) data.
△ Less
Submitted 3 May, 2024; v1 submitted 26 December, 2023;
originally announced December 2023.
-
Human Limits in Machine Learning: Prediction of Plant Phenotypes Using Soil Microbiome Data
Authors:
Rosa Aghdam,
Xudong Tang,
Shan Shan,
Richard Lankau,
Claudia Solís-Lemus
Abstract:
The preservation of soil health is a critical challenge in the 21st century due to its significant impact on agriculture, human health, and biodiversity. We provide the first deep investigation of the predictive potential of machine learning models to understand the connections between soil and biological phenotypes. We investigate an integrative framework performing accurate machine learning-base…
▽ More
The preservation of soil health is a critical challenge in the 21st century due to its significant impact on agriculture, human health, and biodiversity. We provide the first deep investigation of the predictive potential of machine learning models to understand the connections between soil and biological phenotypes. We investigate an integrative framework performing accurate machine learning-based prediction of plant phenotypes from biological, chemical, and physical properties of the soil via two models: random forest and Bayesian neural network. We show that prediction is improved when incorporating environmental features like soil physicochemical properties and microbial population density into the models, in addition to the microbiome information. Exploring various data preprocessing strategies confirms the significant impact of human decisions on predictive performance. We show that the naive total sum scaling normalization that is commonly used in microbiome research is not the optimal strategy to maximize predictive power. Also, we find that accurately defined labels are more important than normalization, taxonomic level or model characteristics. In cases where humans are unable to classify samples accurately, machine learning model performance is limited. Lastly, we provide domain scientists via a full model selection decision tree to identify the human choices that optimize model prediction power. Our work is accompanied by open source reproducible scripts (https://github.com/solislemuslab/soil-microbiome-nn) for maximum outreach among the microbiome research community.
△ Less
Submitted 16 February, 2024; v1 submitted 19 June, 2023;
originally announced June 2023.
-
Ultrafast learning of 4-node hybridization cycles in phylogenetic networks using algebraic invariants
Authors:
Zhaoxing Wu,
Claudia Solis-Lemus
Abstract:
Motivation: The abundance of gene flow in the Tree of Life challenges the notion that evolution can be represented with a fully bifurcating process, as this process cannot capture important biological realities like hybridization, introgression, or horizontal gene transfer. Coalescent-based network methods are increasingly popular, yet not scalable for big data, because they need to perform a heur…
▽ More
Motivation: The abundance of gene flow in the Tree of Life challenges the notion that evolution can be represented with a fully bifurcating process, as this process cannot capture important biological realities like hybridization, introgression, or horizontal gene transfer. Coalescent-based network methods are increasingly popular, yet not scalable for big data, because they need to perform a heuristic search in the space of networks as well as numerical optimization that can be NP-hard.
Results: Here, we introduce a novel method to reconstruct phylogenetic networks based on algebraic invariants. While there is a long tradition of using algebraic invariants in phylogenetics, our work is the first to define phylogenetic invariants on concordance factors (frequencies of 4-taxon splits in the input gene trees) to identify level-1 phylogenetic networks under the multispecies coalescent model. Our novel inference methodology is optimization-free as it only requires the evaluation of polynomial equations, and as such, it bypasses the traversal of network space, yielding a computational speed at least 10 times faster than the fastest-to-date network methods. We illustrate the accuracy and speed of our new method on a variety of simulated scenarios as well as in the estimation of a phylogenetic network for the genus Canis.
Availability and Implementation: We implement our novel theory on an open-source publicly available Julia package PhyloDiamond.jl available at https://github.com/solislemuslab/PhyloDiamond.jl with broad applicability within the evolutionary biology community.
Contact: [email protected]
△ Less
Submitted 9 November, 2023; v1 submitted 29 November, 2022;
originally announced November 2022.
-
BioKlustering: a web app for semi-supervised learning of maximally imbalanced genomic data
Authors:
Samuel Ozminkowski,
Yuke Wu,
Liule Yang,
Zhiwen Xu,
Luke Selberg,
Chunrong Huang,
Claudia Solis-Lemus
Abstract:
Summary: Accurate phenotype prediction from genomic sequences is a highly coveted task in biological and medical research. While machine-learning holds the key to accurate prediction in a variety of fields, the complexity of biological data can render many methodologies inapplicable. We introduce BioKlustering, a user-friendly open-source and publicly available web app for unsupervised and semi-su…
▽ More
Summary: Accurate phenotype prediction from genomic sequences is a highly coveted task in biological and medical research. While machine-learning holds the key to accurate prediction in a variety of fields, the complexity of biological data can render many methodologies inapplicable. We introduce BioKlustering, a user-friendly open-source and publicly available web app for unsupervised and semi-supervised learning specialized for cases when sequence alignment and/or experimental phenoty** of all classes are not possible. Among its main advantages, BioKlustering 1) allows for maximally imbalanced settings of partially observed labels including cases when only one class is observed, which is currently prohibited in most semi-supervised methods, 2) takes unaligned sequences as input and thus, allows learning for widely diverse sequences (impossible to align) such as virus and bacteria, 3) is easy to use for anyone with little or no programming expertise, and 4) works well with small sample sizes.
Availability and Implementation: BioKlustering (https://bioklustering.wid.wisc.edu) is a freely available web app implemented with Django, a Python-based framework, with all major browsers supported. The web app does not need any installation, and it is publicly available and open-source (https://github.com/solislemuslab/bioklustering).
△ Less
Submitted 26 September, 2022; v1 submitted 23 September, 2022;
originally announced September 2022.
-
Identifying microbial drivers in biological phenotypes with a Bayesian Network Regression model
Authors:
Samuel Ozminkowski,
Claudia Solis-Lemus
Abstract:
1. In Bayesian Network Regression models, networks are considered the predictors of continuous responses. These models have been successfully used in brain research to identify regions in the brain that are associated with specific human traits, yet their potential to elucidate microbial drivers in biological phenotypes for microbiome research remains unknown. In particular, microbial networks are…
▽ More
1. In Bayesian Network Regression models, networks are considered the predictors of continuous responses. These models have been successfully used in brain research to identify regions in the brain that are associated with specific human traits, yet their potential to elucidate microbial drivers in biological phenotypes for microbiome research remains unknown. In particular, microbial networks are challenging due to their high-dimension and high sparsity compared to brain networks. Furthermore, unlike in brain connectome research, in microbiome research, it is usually expected that the presence of microbes have an effect on the response (main effects), not just the interactions.
2. Here, we develop the first thorough investigation of whether Bayesian Network Regression models are suitable for microbial datasets on a variety of synthetic and real data under diverse biological scenarios. We test whether the Bayesian Network Regression model that accounts only for interaction effects (edges in the network) is able to identify key drivers (microbes) in phenotypic variability.
3. We show that this model is indeed able to identify influential nodes and edges in the microbial networks that drive changes in the phenotype for most biological settings, but we also identify scenarios where this method performs poorly which allows us to provide practical advice for domain scientists aiming to apply these tools to their datasets.
4. BNR models provide a framework for microbiome researchers to identify connections between microbes and measured phenotypes. We allow the use of this statistical model by providing an easy-to-use implementation which is publicly available Julia package at https://github.com/solislemuslab/BayesianNetworkRegression.jl.
△ Less
Submitted 20 January, 2024; v1 submitted 10 August, 2022;
originally announced August 2022.
-
Estimating sparse direct effects in multivariate regression with the spike-and-slab LASSO
Authors:
Yunyi Shen,
Claudia Solís-Lemus,
Sameer K. Deshpande
Abstract:
The multivariate regression interpretation of the Gaussian chain graph model simultaneously parametrizes (i) the direct effects of $p$ predictors on $q$ outcomes and (ii) the residual partial covariances between pairs of outcomes. We introduce a new method for fitting sparse Gaussian chain graph models with spike-and-slab LASSO (SSL) priors. We develop an Expectation Conditional Maximization algor…
▽ More
The multivariate regression interpretation of the Gaussian chain graph model simultaneously parametrizes (i) the direct effects of $p$ predictors on $q$ outcomes and (ii) the residual partial covariances between pairs of outcomes. We introduce a new method for fitting sparse Gaussian chain graph models with spike-and-slab LASSO (SSL) priors. We develop an Expectation Conditional Maximization algorithm to obtain sparse estimates of the $p \times q$ matrix of direct effects and the $q \times q$ residual precision matrix. Our algorithm iteratively solves a sequence of penalized maximum likelihood problems with self-adaptive penalties that gradually filter out negligible regression coefficients and partial covariances. Because it adaptively penalizes individual model parameters, our method is seen to outperform fixed-penalty competitors on simulated data. We establish the posterior contraction rate for our model, buttressing our method's excellent empirical performance with strong theoretical guarantees. Using our method, we estimated the direct effects of diet and residence type on the composition of the gut microbiome of elderly adults.
△ Less
Submitted 26 March, 2024; v1 submitted 14 July, 2022;
originally announced July 2022.
-
CARlasso: An R package for the estimation of sparse microbial networks with predictors
Authors:
Yunyi Shen,
Claudia Solis-Lemus
Abstract:
Microbiome data analyses require statistical tools that can simultaneously decode microbes' reactions to the environment and interactions among microbes. We introduce CARlasso, the first user-friendly open-source and publicly available R package to fit a chain graph model for the inference of sparse microbial networks that represent both interactions among nodes and effects of a set of predictors.…
▽ More
Microbiome data analyses require statistical tools that can simultaneously decode microbes' reactions to the environment and interactions among microbes. We introduce CARlasso, the first user-friendly open-source and publicly available R package to fit a chain graph model for the inference of sparse microbial networks that represent both interactions among nodes and effects of a set of predictors. Unlike in standard regression approaches, the edges represent the correct conditional structure among responses and predictors that allows the incorporation of prior knowledge from controlled experiments. In addition, CARlasso 1) enforces sparsity in the network via LASSO; 2) allows for an adaptive extension to include different shrinkage to different edges; 3) is computationally inexpensive through an efficient Gibbs sampling algorithm so it can equally handle small and big data; 4) allows for continuous, binary, counting and compositional responses via proper hierarchical structure, and 5) has a similar syntax to lm for ease of use. The package also supports Bayesian graphical LASSO and several of its hierarchical models as well as lower level one-step sampling functions of the CAR-LASSO model for users to extend.
△ Less
Submitted 29 July, 2021;
originally announced July 2021.
-
The Effect of the Prior and the Experimental Design on the Inference of the Precision Matrix in Gaussian Chain Graph Models
Authors:
Yunyi Shen,
Claudia Solis-Lemus
Abstract:
Here, we investigate whether (and how) experimental design could aid in the estimation of the precision matrix in a Gaussian chain graph model, especially the interplay between the design, the effect of the experiment and prior knowledge about the effect. Estimation of the precision matrix is a fundamental task to infer biological graphical structures like microbial networks. We compare the margin…
▽ More
Here, we investigate whether (and how) experimental design could aid in the estimation of the precision matrix in a Gaussian chain graph model, especially the interplay between the design, the effect of the experiment and prior knowledge about the effect. Estimation of the precision matrix is a fundamental task to infer biological graphical structures like microbial networks. We compare the marginal posterior precision of the precision matrix under four priors: flat, conjugate Normal-Wishart, Normal-MGIG and a general independent. Under the flat and conjugate priors, the Laplace-approximated posterior precision is not a function of the design matrix rendering useless any efforts to find an optimal experimental design to infer the precision matrix. In contrast, the Normal-MGIG and general independent priors do allow for the search of optimal experimental designs, yet there is a sharp upper bound on the information that can be extracted from a given experiment. We confirm our theoretical findings via a simulation study comparing i) the KL divergence between prior and posterior and ii) the Stein's loss difference of MAPs between random and no experiment. Our findings provide practical advice for domain scientists conducting experiments to better infer the precision matrix as a representation of a biological network.
△ Less
Submitted 29 November, 2023; v1 submitted 2 July, 2021;
originally announced July 2021.
-
Bayesian Chain Graph LASSO Models to Learn Sparse Microbial Networks with Predictors
Authors:
Yunyi Shen,
Claudia Solis-Lemus
Abstract:
Microbiome data require statistical models that can simultaneously decode microbes' reaction to the environment and interactions among microbes. While a multiresponse linear regression model seems like a straight-forward solution, we argue that treating it as a graphical model is flawed given that the regression coefficient matrix does not encode the conditional dependence structure between respon…
▽ More
Microbiome data require statistical models that can simultaneously decode microbes' reaction to the environment and interactions among microbes. While a multiresponse linear regression model seems like a straight-forward solution, we argue that treating it as a graphical model is flawed given that the regression coefficient matrix does not encode the conditional dependence structure between response and predictor nodes as it does not represent the adjacency matrix. This observation is especially important in biological settings when we have prior knowledge on the edges from specific experimental interventions that can only be properly encoded under a conditional dependence model. Here, we propose a chain graph model with two sets of nodes (predictors and responses) whose solution yields a graph with edges that indeed represent conditional dependence and thus, agrees with the experimenter's intuition on the average behavior of nodes under treatment. The solution to our model is sparse via Bayesian LASSO. In addition, we propose an adaptive extension so that different shrinkage can be applied to different edges to incorporate edge-specific prior knowledge. Our model is computationally inexpensive through an efficient Gibbs sampling algorithm and can account for binary, counting and compositional responses via appropriate hierarchical structure. We apply our model to a human gut and a soil microbial compositional datasets and we highlight that CG-LASSO can estimate biologically meaningful network structures in the data. The CG-LASSO software is available as an R package at https://github.com/YunyiShen/CAR-LASSO.
△ Less
Submitted 23 July, 2022; v1 submitted 15 December, 2020;
originally announced December 2020.
-
Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting
Authors:
Claudia Solís-Lemus,
Cécile Ané
Abstract:
Phylogenetic networks are necessary to represent the tree of life expanded by edges to represent events such as horizontal gene transfers, hybridizations or gene flow. Not all species follow the paradigm of vertical inheritance of their genetic material. While a great deal of research has flourished into the inference of phylogenetic trees, statistical methods to infer phylogenetic networks are st…
▽ More
Phylogenetic networks are necessary to represent the tree of life expanded by edges to represent events such as horizontal gene transfers, hybridizations or gene flow. Not all species follow the paradigm of vertical inheritance of their genetic material. While a great deal of research has flourished into the inference of phylogenetic trees, statistical methods to infer phylogenetic networks are still limited and under development. The main disadvantage of existing methods is a lack of scalability. Here, we present a statistical method to infer phylogenetic networks from multi-locus genetic data in a pseudolikelihood framework. Our model accounts for incomplete lineage sorting through the coalescent model, and for horizontal inheritance of genes through reticulation nodes in the network. Computation of the pseudolikelihood is fast and simple, and it avoids the burdensome calculation of the full likelihood which can be intractable with many species. Moreover, estimation at the quartet-level has the added computational benefit that it is easily parallelizable. Simulation studies comparing our method to a full likelihood approach show that our pseudolikelihood approach is much faster without compromising accuracy. We applied our method to reconstruct the evolutionary relationships among swordtails and platyfishes ($Xiphophorus$: Poeciliidae), which is characterized by widespread hybridizations.
△ Less
Submitted 12 February, 2016; v1 submitted 20 September, 2015;
originally announced September 2015.