-
Representing Molecules as Random Walks Over Interpretable Grammars
Authors:
Michael Sun,
Minghao Guo,
Weize Yuan,
Veronika Thost,
Crystal Elaine Owens,
Aristotle Franklin Grosz,
Sharvaa Selvan,
Katelyn Zhou,
Hassan Mohiuddin,
Benjamin J Pedretti,
Zachary P Smith,
Jie Chen,
Wojciech Matusik
Abstract:
Recent research in molecular discovery has primarily been devoted to small, drug-like molecules, leaving many similarly important applications in material design without adequate technology. These applications often rely on more complex molecular structures with fewer examples that are carefully designed using known substructures. We propose a data-efficient and interpretable model for representin…
▽ More
Recent research in molecular discovery has primarily been devoted to small, drug-like molecules, leaving many similarly important applications in material design without adequate technology. These applications often rely on more complex molecular structures with fewer examples that are carefully designed using known substructures. We propose a data-efficient and interpretable model for representing and reasoning over such molecules in terms of graph grammars that explicitly describe the hierarchical design space featuring motifs to be the design basis. We present a novel representation in the form of random walks over the design space, which facilitates both molecule generation and property prediction. We demonstrate clear advantages over existing methods in terms of performance, efficiency, and synthesizability of predicted molecules, and we provide detailed insights into the method's chemical interpretability.
△ Less
Submitted 2 June, 2024; v1 submitted 12 March, 2024;
originally announced March 2024.
-
MakeSBML: A tool for converting between Antimony and SBML
Authors:
Bartholomew E. Jardine,
Lucian P. Smith,
Herbert M. Sauro
Abstract:
We describe a web-based tool, MakeSBML (https://sys-bio.github.io/makesbml/), that provides an installation-free application for creating, editing, and searching the Biomodels repository for SBML-based models. MakeSBML is a client-based web application that translates models expressed in human-readable Antimony to the System Biology Markup Language (SBML) and vice-versa. Since MakeSBML is a web-ba…
▽ More
We describe a web-based tool, MakeSBML (https://sys-bio.github.io/makesbml/), that provides an installation-free application for creating, editing, and searching the Biomodels repository for SBML-based models. MakeSBML is a client-based web application that translates models expressed in human-readable Antimony to the System Biology Markup Language (SBML) and vice-versa. Since MakeSBML is a web-based application it requires no installation on the user's part. Currently, MakeSBML is hosted on a GitHub page where the client-based design makes it trivial to move to other hosts. This model for software deployment also reduces maintenance costs since an active server is not required. The SBML modeling language is often used in systems biology research to describe complex biochemical networks and makes reproducing models much easier. However, SBML is designed to be computer-readable, not human-readable. We therefore employ the human-readable Antimony language to make it easy to create and edit SBML models.
△ Less
Submitted 6 September, 2023;
originally announced September 2023.
-
Adapting Modeling and Simulation Credibility Standards to Computational Systems Biology
Authors:
Lillian T. Tatka,
Lucian P. Smith,
Joseph L. Hellerstein,
Herbert M. Sauro
Abstract:
Computational models are increasingly used in high-impact decision making in science, engineering, and medicine. The National Aeronautics and Space Administration (NASA) uses computational models to perform complex experiments that are otherwise prohibitively expensive or require a microgravity environment. Similarly, the Food and Drug Administration (FDA) and European Medicines Agency (EMA) have…
▽ More
Computational models are increasingly used in high-impact decision making in science, engineering, and medicine. The National Aeronautics and Space Administration (NASA) uses computational models to perform complex experiments that are otherwise prohibitively expensive or require a microgravity environment. Similarly, the Food and Drug Administration (FDA) and European Medicines Agency (EMA) have began accepting models and simulations as form of evidence for pharmaceutical and medical device approval. It is crucial that computational models meet a standard of credibility when using them in high-stakes decision making. For this reason, institutes including NASA, the FDA, and the EMA have developed standards to promote and assess the credibility of computational models and simulations. However, due to the breadth of models these institutes assess, these credibility standards are mostly qualitative and avoid making specific recommendations. On the other hand, modeling and simulation in systems biology is a narrow domain and several standards are already in place. As systems biology models increase in complexity and influence, the development of a credibility assessment system is crucial. Here we review existing standards in systems biology, credibility standards in other science, engineering, and medical fields, and propose the development of a credibility standard for systems biology models.
△ Less
Submitted 14 January, 2023;
originally announced January 2023.
-
BioSimulators: a central registry of simulation engines and services for recommending specific tools
Authors:
Bilal Shaikh,
Lucian P. Smith,
Dan Vasilescu,
Gnaneswara Marupilla,
Michael Wilson,
Eran Agmon,
Henry Agnew,
Steven S. Andrews,
Azraf Anwar,
Moritz E. Beber,
Frank T. Bergmann,
David Brooks,
Lutz Brusch,
Laurence Calzone,
Kiri Choi,
Joshua Cooper,
John Detloff,
Brian Drawert,
Michel Dumontier,
G. Bard Ermentrout,
James R. Faeder,
Andrew P. Freiburger,
Fabian Fröhlich,
Akira Funahashi,
Alan Garny
, et al. (46 additional authors not shown)
Abstract:
Computational models have great potential to accelerate bioscience, bioengineering, and medicine. However, it remains challenging to reproduce and reuse simulations, in part, because the numerous formats and methods for simulating various subsystems and scales remain siloed by different software tools. For example, each tool must be executed through a distinct interface. To help investigators find…
▽ More
Computational models have great potential to accelerate bioscience, bioengineering, and medicine. However, it remains challenging to reproduce and reuse simulations, in part, because the numerous formats and methods for simulating various subsystems and scales remain siloed by different software tools. For example, each tool must be executed through a distinct interface. To help investigators find and use simulation tools, we developed BioSimulators (https://biosimulators.org), a central registry of the capabilities of simulation tools and consistent Python, command-line, and containerized interfaces to each version of each tool. The foundation of BioSimulators is standards, such as CellML, SBML, SED-ML, and the COMBINE archive format, and validation tools for simulation projects and simulation tools that ensure these standards are used consistently. To help modelers find tools for particular projects, we have also used the registry to develop recommendation services. We anticipate that BioSimulators will help modelers exchange, reproduce, and combine simulations.
△ Less
Submitted 13 March, 2022;
originally announced March 2022.
-
Human genetic admixture through the lens of population genomics
Authors:
Shyamalika Gopalan,
Samuel Patillo Smith,
Katharine Korunes,
Iman Hamid,
Sohini Ramachandran,
Amy Goldberg
Abstract:
Over the last fifty years, geneticists have made great strides in understanding how our species' evolutionary history gave rise to current patterns of human genetic diversity classically summarized by Lewontin in his 1972 paper, 'The Apportionment of Human Diversity'. One evolutionary process that requires special attention in both population genetics and statistical genetics is admixture: gene fl…
▽ More
Over the last fifty years, geneticists have made great strides in understanding how our species' evolutionary history gave rise to current patterns of human genetic diversity classically summarized by Lewontin in his 1972 paper, 'The Apportionment of Human Diversity'. One evolutionary process that requires special attention in both population genetics and statistical genetics is admixture: gene flow between two or more previously separated source populations to form a new admixed population. The admixture process introduces unique patterns of genetic variation within and between populations, which in turn influences the inference of demographic histories, identification genetic targets of selection, and prediction of phenotypes. In this review, we highlight recent studies and methodological advances that have leveraged genomic signatures of admixture to gain insights into human history, natural selection, and complex trait architecture. We also outline some challenges for admixture population genetics, including limitations of applying methods designed for single-ancestry populations to the study of admixed populations.
△ Less
Submitted 11 February, 2022; v1 submitted 24 September, 2021;
originally announced September 2021.
-
SED-ML Validator: tool for debugging simulation experiments
Authors:
Bilal Shaikh,
Andrew Philip Freiburger,
Matthias König,
Frank T. Bergmann,
David P. Nickerson,
Herbert M. Sauro,
Michael L. Blinov,
Lucian P. Smith,
Ion I. Moraru,
Jonathan R. Karr
Abstract:
Summary: More sophisticated models are needed to address problems in bioscience, synthetic biology, and precision medicine. To help facilitate the collaboration needed for such models, the community developed the Simulation Experiment Description Markup Language (SED-ML), a common format for describing simulations. However, the utility of SED-ML has been hampered by limited support for SED-ML amon…
▽ More
Summary: More sophisticated models are needed to address problems in bioscience, synthetic biology, and precision medicine. To help facilitate the collaboration needed for such models, the community developed the Simulation Experiment Description Markup Language (SED-ML), a common format for describing simulations. However, the utility of SED-ML has been hampered by limited support for SED-ML among modeling software tools and by different interpretations of SED-ML among the tools that support the format. To help modelers debug their simulations and to push the community to use SED-ML consistently, we developed a tool for validating SED-ML files. We have used the validator to correct the official SED-ML example files. We plan to use the validator to correct the files in the BioModels database so that they can be simulated. We anticipate that the validator will be a valuable tool for develo** more predictive simulations and that the validator will help increase the adoption and interoperability of SED-ML.
Availability: The validator is freely available as a webform, HTTP API, command-line program, and Python package at https://run.biosimulations.org/utils/validate and https://pypi.org/project/biosimulators-utils. The validator is also embedded into interfaces to 11 simulation tools. The source code is openly available as described in the Supplementary data.
Contact: [email protected]
△ Less
Submitted 1 June, 2021;
originally announced June 2021.
-
Time to revisit the endpoint dilution assay and to replace TCID$_{50}$ and PFU as measures of a virus sample's infection concentration
Authors:
Daniel Cresta,
Donald C. Warren,
Christian Quirouette,
Amanda P. Smith,
Lindey C. Lane,
Amber M. Smith,
Catherine A. A. Beauchemin
Abstract:
The infectivity of a virus sample is measured by the infections it causes, via a plaque or focus forming assay (PFU or FFU) or an endpoint dilution (ED) assay (TCID$_{50}$, CCID$_{50}$, EID$_{50}$, etc., hereafter collectively ID$_{50}$). The counting of plaques or foci at a given dilution intuitively and directly provides the concentration of infectious doses in the undiluted sample. However, it…
▽ More
The infectivity of a virus sample is measured by the infections it causes, via a plaque or focus forming assay (PFU or FFU) or an endpoint dilution (ED) assay (TCID$_{50}$, CCID$_{50}$, EID$_{50}$, etc., hereafter collectively ID$_{50}$). The counting of plaques or foci at a given dilution intuitively and directly provides the concentration of infectious doses in the undiluted sample. However, it has many technical and experimental limitations. For example, it relies on one's judgement in distinguishing between two merged plaques and a larger one, or between small plaques and staining artifacts. In this regard, ED assays are more robust because one need only determine whether infection occurred. The output of the ED assay, the 50% infectious dose (ID$_{50}$), is calculated using either the Spearman-Karber (SK, 1908,1931) or Reed-Muench (RM, 1938) mathematical approximations. However, these are often miscalculated and their ID$_{50}$ approximation is biased. We propose that the PFU and FFU assays be abandoned, and that the measured output of the ED assay, the ID$_{50}$, be replaced by a more useful measure we coined Specific INfections (SIN). We introduce a free, open-source web-application, midSIN, that computes the SIN concentration in a virus sample from a standard ED assay, requiring no changes to current experimental protocols. We demonstrate that the SIN/mL of a sample reliably corresponds to the number of infections the sample will cause per unit volume, and directly relates to the multiplicity of infection. midSIN estimates are shown to be more accurate and robust than those using the RM and SK approximations. The impact of ED plate design choices (dilution factor, replicates per dilution) on measurement accuracy is also explored. The simplicity of SIN as a measure and the greater accuracy of midSIN make them an easy, superior replacement for the PFU, FFU, and ID$_{50}$ measures.
△ Less
Submitted 27 January, 2021;
originally announced January 2021.
-
A scalable method for molecular network reconstruction identifies properties of targets and mutations in acute myeloid leukemia
Authors:
Edison Ong,
Anthony Szedlak,
Yunyi Kang,
Peyton Smith,
Nicholas Smith,
Madison McBride,
Darren Finlay,
Kristiina Vuori,
James Mason,
Edward D. Ball,
Carlo Piermarocchi,
Giovanni Paternostro
Abstract:
A key aim of systems biology is the reconstruction of molecular networks, however we do not yet have networks that integrate information from all datasets available for a particular clinical condition. This is in part due to the limited scalability, in terms of required computational time and power, of existing algorithms. Network reconstruction methods should also be scalable in the sense of allo…
▽ More
A key aim of systems biology is the reconstruction of molecular networks, however we do not yet have networks that integrate information from all datasets available for a particular clinical condition. This is in part due to the limited scalability, in terms of required computational time and power, of existing algorithms. Network reconstruction methods should also be scalable in the sense of allowing scientists from different backgrounds to efficiently integrate additional data. We present a network model of acute myeloid leukemia (AML). In the current version (AML 2.1) we have used gene expression data (both microarray and RNA-seq) from five different studies comprising a total of 771 AML samples and a protein-protein interactions dataset. Our scalable network reconstruction method is in part based on the well-known property of gene expression correlation among interacting molecules. The difficulty of distinguishing between direct and indirect interactions is addressed optimizing the coefficient of variation of gene expression, using a validated gold standard dataset of direct interactions. Computational time is much reduced compared to other network reconstruction methods. A key feature is the study of the reproducibility of interactions found in independent clinical datasets. An analysis of the most significant clusters, and of the network properties (intraset efficiency, degree, betweenness centrality and PageRank) of common AML mutations demonstrated the biological significance of the network. A statistical analysis of the response of blast cells from eleven AML patients to a library of kinase inhibitors provided an experimental validation of the network. A combination of network and experimental data identified CDK1, CDK2, CDK4 and CDK6 and other kinases as potential therapeutic targets in AML.
△ Less
Submitted 25 July, 2014;
originally announced July 2014.
-
Coarse-graining DNA for simulations of DNA nanotechnology
Authors:
Jonathan P. K. Doye,
Thomas E. Ouldridge,
Ard A. Louis,
Flavio Romano,
Petr Sulc,
Christian Matek,
Benedict E. K. Snodin,
Lorenzo Rovigatti,
John S. Schreck,
Ryan M. Harrison,
William P. J. Smith
Abstract:
To simulate long time and length scale processes involving DNA it is necessary to use a coarse-grained description. Here we provide an overview of different approaches to such coarse graining, focussing on those at the nucleotide level that allow the self-assembly processes associated with DNA nanotechnology to be studied. OxDNA, our recently-developed coarse-grained DNA model, is particularly sui…
▽ More
To simulate long time and length scale processes involving DNA it is necessary to use a coarse-grained description. Here we provide an overview of different approaches to such coarse graining, focussing on those at the nucleotide level that allow the self-assembly processes associated with DNA nanotechnology to be studied. OxDNA, our recently-developed coarse-grained DNA model, is particularly suited to this task, and has opened up this field to systematic study by simulations. We illustrate some of the range of DNA nanotechnology systems to which the model is being applied, as well as the insights it can provide into fundamental biophysical properties of DNA.
△ Less
Submitted 18 August, 2013;
originally announced August 2013.
-
Reducing assembly complexity of microbial genomes with single-molecule sequencing
Authors:
Sergey Koren,
Gregory P Harhay,
Timothy PL Smith,
James L Bono,
Dayna M Harhay,
D. Scott Mcvey,
Diana Radune,
Nicholas H Bergman,
Adam M Phillippy
Abstract:
Background: The short reads output by first- and second-generation DNA sequencing instruments cannot completely reconstruct microbial chromosomes. Therefore, most genomes have been left unfinished due to the significant resources required to manually close gaps in draft assemblies. Third-generation, single-molecule sequencing addresses this problem by greatly increasing sequencing read length, whi…
▽ More
Background: The short reads output by first- and second-generation DNA sequencing instruments cannot completely reconstruct microbial chromosomes. Therefore, most genomes have been left unfinished due to the significant resources required to manually close gaps in draft assemblies. Third-generation, single-molecule sequencing addresses this problem by greatly increasing sequencing read length, which simplifies the assembly problem.
Results: To measure the benefit of single-molecule sequencing on microbial genome assembly, we sequenced and assembled the genomes of six bacteria and analyzed the repeat complexity of 2,267 complete bacteria and archaea. Our results indicate that the majority of known bacterial and archaeal genomes can be assembled without gaps, at finished-grade quality, using a single PacBio RS sequencing library. These single-library assemblies are also more accurate than typical short-read assemblies and hybrid assemblies of short and long reads.
Conclusions: Automated assembly of long, single-molecule sequencing data reduces the cost of microbial finishing to $1,000 for most genomes, and future advances in this technology are expected to drive the cost lower. This is expected to increase the number of completed genomes, improve the quality of microbial genome databases, and enable high-fidelity, population-scale studies of pan-genomes and chromosomal organization.
△ Less
Submitted 15 November, 2013; v1 submitted 12 April, 2013;
originally announced April 2013.