-
Innovations in Integrating Machine Learning and Agent-Based Modeling of Biomedical Systems
Authors:
Nikita Sivakumar,
Cameron Mura,
Shayn M. Peirce
Abstract:
Agent-based modeling (ABM) is a well-established paradigm for simulating complex systems via interactions between constituent entities. Machine learning (ML) refers to approaches whereby statistical algorithms 'learn' from data on their own, without imposing a priori theories of system behavior. Biological systems -- from molecules, to cells, to entire organisms -- consist of vast numbers of entit…
▽ More
Agent-based modeling (ABM) is a well-established paradigm for simulating complex systems via interactions between constituent entities. Machine learning (ML) refers to approaches whereby statistical algorithms 'learn' from data on their own, without imposing a priori theories of system behavior. Biological systems -- from molecules, to cells, to entire organisms -- consist of vast numbers of entities, governed by complex webs of interactions that span many spatiotemporal scales and exhibit nonlinearity, stochasticity and intricate coupling between entities. The macroscopic properties and collective dynamics of such systems are difficult to capture via continuum modelling and mean-field formalisms. ABM takes a 'bottom-up' approach that obviates these difficulties by enabling one to easily propose and test a set of well-defined 'rules' to be applied to the individual entities (agents) in a system. Evaluating a system and propagating its state over discrete time-steps effectively simulates the system, allowing observables to be computed and system properties to be analyzed. Because the rules that govern an ABM can be difficult to abstract and formulate from experimental data, there is an opportunity to use ML to help infer optimal, system-specific ABM rules. Once such rule-sets are devised, ABM calculations can generate a wealth of data, and ML can be applied there too -- e.g., to probe statistical measures that meaningfully describe a system's stochastic properties. As an example of synergy in the other direction (from ABM to ML), ABM simulations can generate realistic datasets for training ML algorithms (e.g., for regularization, to mitigate overfitting). In these ways, one can envision various synergistic ABM$\rightleftharpoons$ML loops. This review summarizes how ABM and ML have been integrated in contexts that span spatiotemporal scales, from cellular to population-level epidemiology.
△ Less
Submitted 9 November, 2022; v1 submitted 2 June, 2022;
originally announced June 2022.
-
Exploration of Dark Chemical Genomics Space via Portal Learning: Applied to Targeting the Undruggable Genome and COVID-19 Anti-Infective Polypharmacology
Authors:
Tian Cai,
Li Xie,
Muge Chen,
Yang Liu,
Di He,
Shuo Zhang,
Cameron Mura,
Philip E. Bourne,
Lei Xie
Abstract:
Advances in biomedicine are largely fueled by exploring uncharted territories of human biology. Machine learning can both enable and accelerate discovery, but faces a fundamental hurdle when applied to unseen data with distributions that differ from previously observed ones -- a common dilemma in scientific inquiry. We have developed a new deep learning framework, called {\textit{Portal Learning}}…
▽ More
Advances in biomedicine are largely fueled by exploring uncharted territories of human biology. Machine learning can both enable and accelerate discovery, but faces a fundamental hurdle when applied to unseen data with distributions that differ from previously observed ones -- a common dilemma in scientific inquiry. We have developed a new deep learning framework, called {\textit{Portal Learning}}, to explore dark chemical and biological space. Three key, novel components of our approach include: (i) end-to-end, step-wise transfer learning, in recognition of biology's sequence-structure-function paradigm, (ii) out-of-cluster meta-learning, and (iii) stress model selection. Portal Learning provides a practical solution to the out-of-distribution (OOD) problem in statistical machine learning. Here, we have implemented Portal Learning to predict chemical-protein interactions on a genome-wide scale. Systematic studies demonstrate that Portal Learning can effectively assign ligands to unexplored gene families (unknown functions), versus existing state-of-the-art methods, thereby allowing us to target previously "undruggable" proteins and design novel polypharmacological agents for disrupting interactions between SARS-CoV-2 and human proteins. Portal Learning is general-purpose and can be further applied to other areas of scientific inquiry.
△ Less
Submitted 23 November, 2021;
originally announced November 2021.
-
Walk2Map: Extracting Floor Plans from Indoor Walk Trajectories
Authors:
Claudio Mura,
Renato Pajarola,
Konrad Schindler,
Niloy Mitra
Abstract:
Recent years have seen a proliferation of new digital products for the efficient management of indoor spaces, with important applications like emergency management, virtual property showcasing and interior design. These products rely on accurate 3D models of the environments considered, including information on both architectural and non-permanent elements. These models must be created from measur…
▽ More
Recent years have seen a proliferation of new digital products for the efficient management of indoor spaces, with important applications like emergency management, virtual property showcasing and interior design. These products rely on accurate 3D models of the environments considered, including information on both architectural and non-permanent elements. These models must be created from measured data such as RGB-D images or 3D point clouds, whose capture and consolidation involves lengthy data workflows. This strongly limits the rate at which 3D models can be produced, preventing the adoption of many digital services for indoor space management. We provide an alternative to such data-intensive procedures by presenting Walk2Map, a data-driven approach to generate floor plans only from trajectories of a person walking inside the rooms. Thanks to recent advances in data-driven inertial odometry, such minimalistic input data can be acquired from the IMU readings of consumer-level smartphones, which allows for an effortless and scalable map** of real-world indoor spaces. Our work is based on learning the latent relation between an indoor walk trajectory and the information represented in a floor plan: interior space footprint, portals, and furniture. We distinguish between recovering area-related (interior footprint, furniture) and wall-related (doors) information and use two different neural architectures for the two tasks: an image-based Encoder-Decoder and a Graph Convolutional Network, respectively. We train our networks using scanned 3D indoor models and apply them in a cascaded fashion on an indoor walk trajectory at inference time. We perform a qualitative and quantitative evaluation using both simulated and measured, real-world trajectories, and compare against a baseline method for image-to-image translation. The experiments confirm the feasibility of our approach.
△ Less
Submitted 27 February, 2021;
originally announced March 2021.
-
Machine Learning for Classification of Protein Helix Cap** Motifs
Authors:
Sean Mullane,
Ruoyan Chen,
Sri Vaishnavi Vemulapalli,
Eli J. Draizen,
Ke Wang,
Cameron Mura,
Philip E. Bourne
Abstract:
The biological function of a protein stems from its 3-dimensional structure, which is thermodynamically determined by the energetics of interatomic forces between its amino acid building blocks (the order of amino acids, known as the sequence, defines a protein). Given the costs (time, money, human resources) of determining protein structures via experimental means such as X-ray crystallography, c…
▽ More
The biological function of a protein stems from its 3-dimensional structure, which is thermodynamically determined by the energetics of interatomic forces between its amino acid building blocks (the order of amino acids, known as the sequence, defines a protein). Given the costs (time, money, human resources) of determining protein structures via experimental means such as X-ray crystallography, can we better describe and compare protein 3D structures in a robust and efficient manner, so as to gain meaningful biological insights? We begin by considering a relatively simple problem, limiting ourselves to just protein secondary structural elements. Historically, many computational methods have been devised to classify amino acid residues in a protein chain into one of several discrete secondary structures, of which the most well-characterized are the geometrically regular $α$-helix and $β$-sheet; irregular structural patterns, such as 'turns' and 'loops', are less understood. Here, we present a study of Deep Learning techniques to classify the loop-like end cap structures which delimit $α$-helices. Previous work used highly empirical and heuristic methods to manually classify helix cap** motifs. Instead, we use structural data directly--including (i) backbone torsion angles computed from 3D structures, (ii) macromolecular feature sets (e.g., physicochemical properties), and (iii) helix cap classification data (from CAPS-DB)--as the ground truth to train a bidirectional long short-term memory (BiLSTM) model to classify helix cap residues. We tried different network architectures and scanned hyperparameters in order to train and assess several models; we also trained a Support Vector Classifier (SVC) to use as a baseline. Ultimately, we achieved 85% class-balanced accuracy with a deep BiLSTM model.
△ Less
Submitted 1 May, 2019;
originally announced May 2019.
-
A Robust Feature-aware Sparse Mesh Representation
Authors:
Lizeth J. Fuentes Perez,
Luciano A. Romero Calla,
Anselmo A. Montenegro,
Claudio Mura,
Renato Pajarola
Abstract:
The sparse representation of signals defined on Euclidean domains has been successfully applied in signal processing. Bringing the power of sparse representations to non-regular domains is still a challenge, but promising approaches have started emerging recently. In this paper, we investigate the problem of sparsely representing discrete surfaces and propose a new representation that is capable o…
▽ More
The sparse representation of signals defined on Euclidean domains has been successfully applied in signal processing. Bringing the power of sparse representations to non-regular domains is still a challenge, but promising approaches have started emerging recently. In this paper, we investigate the problem of sparsely representing discrete surfaces and propose a new representation that is capable of providing tools for solving different geometry processing problems. The sparse discrete surface representation is obtained by combining innovative approaches into an integrated method. First, to deal with irregular mesh domains, we devised a new way to subdivide discrete meshes into a set of patches using a feature-aware seed sampling. Second, we achieve good surface approximation with over-fitting control by combining the power of a continuous global dictionary representation with a modified Orthogonal Marching Pursuit. The discrete surface approximation results produced were able to preserve the shape features while being robust to over-fitting. Our results show that the method is quite promising for applications like surface re-sampling and mesh compression.
△ Less
Submitted 24 November, 2020; v1 submitted 18 October, 2018;
originally announced October 2018.
-
Ten Quick Tips for Using a Raspberry Pi
Authors:
Anthony C Fletcher,
Cameron Mura
Abstract:
Much of biology (and, indeed, all of science) is becoming increasingly computational. We tend to think of this in regards to algorithmic approaches and software tools, as well as increased computing power. There has also been a shift towards slicker, packaged solutions--which mirrors everyday life, from smart phones to smart homes. As a result, it's all too easy to be detached from the fundamental…
▽ More
Much of biology (and, indeed, all of science) is becoming increasingly computational. We tend to think of this in regards to algorithmic approaches and software tools, as well as increased computing power. There has also been a shift towards slicker, packaged solutions--which mirrors everyday life, from smart phones to smart homes. As a result, it's all too easy to be detached from the fundamental elements that power these changes, and to see solutions as "black boxes". The major goal of this piece is to use the example of the Raspberry Pi--a small, general-purpose computer--as the central component in a highly developed ecosystem that brings together elements like external hardware, sensors and controllers, state-of-the-art programming practices, and basic electronics and physics, all in an approachable and useful way. External devices and inputs are easily connected to the Pi, and it can, in turn, control attached devices very simply. So whether you want to use it to manage laboratory equipment, sample the environment, teach bioinformatics, control your home security or make a model lunar lander, it's all built from the same basic principles. To quote Richard Feynman, "What I cannot create, I do not understand".
△ Less
Submitted 8 February, 2019; v1 submitted 28 September, 2018;
originally announced October 2018.
-
An Introduction to Programming for Bioscientists: A Python-based Primer
Authors:
Berk Ekmekci,
Charles E. McAnany,
Cameron Mura
Abstract:
Computing has revolutionized the biological sciences over the past several decades, such that virtually all contemporary research in the biosciences utilizes computer programs. The computational advances have come on many fronts, spurred by fundamental developments in hardware, software, and algorithms. These advances have influenced, and even engendered, a phenomenal array of bioscience fields, i…
▽ More
Computing has revolutionized the biological sciences over the past several decades, such that virtually all contemporary research in the biosciences utilizes computer programs. The computational advances have come on many fronts, spurred by fundamental developments in hardware, software, and algorithms. These advances have influenced, and even engendered, a phenomenal array of bioscience fields, including molecular evolution and bioinformatics; genome-, proteome-, transcriptome- and metabolome-wide experimental studies; structural genomics; and atomistic simulations of cellular-scale molecular assemblies as large as ribosomes and intact viruses. In short, much of post-genomic biology is increasingly becoming a form of computational biology. The ability to design and write computer programs is among the most indispensable skills that a modern researcher can cultivate. Python has become a popular programming language in the biosciences, largely because (i) its straightforward semantics and clean syntax make it a readily accessible first language; (ii) it is expressive and well-suited to object-oriented programming, as well as other modern paradigms; and (iii) the many available libraries and third-party toolkits extend the functionality of the core language into virtually every biological domain (sequence and structure analyses, phylogenomics, workflow management systems, etc.). This primer offers a basic introduction to coding, via Python, and it includes concrete examples and exercises to illustrate the language's usage and capabilities; the main text culminates with a final project in structural bioinformatics. A suite of Supplemental Chapters is also provided. Starting with basic concepts, such as that of a 'variable', the Chapters methodically advance the reader to the point of writing a graphical user interface to compute the Hamming distance between two DNA sequences.
△ Less
Submitted 17 May, 2016;
originally announced May 2016.
-
Abstractions, Algorithms and Data Structures for Structural Bioinformatics in PyCogent
Authors:
Marcin Cieslik,
Zygmunt Derewenda,
Cameron Mura
Abstract:
To facilitate flexible and efficient structural bioinformatics analyses, new functionality for three-dimensional structure processing and analysis has been introduced into PyCogent -- a popular feature-rich framework for sequence-based bioinformatics, but one which has lacked equally powerful tools for handling stuctural/coordinate-based data. Extensible Python modules have been developed, which p…
▽ More
To facilitate flexible and efficient structural bioinformatics analyses, new functionality for three-dimensional structure processing and analysis has been introduced into PyCogent -- a popular feature-rich framework for sequence-based bioinformatics, but one which has lacked equally powerful tools for handling stuctural/coordinate-based data. Extensible Python modules have been developed, which provide object-oriented abstractions (based on a hierarchical representation of macromolecules), efficient data structures (e.g. kD-trees), fast implementations of common algorithms (e.g. surface-area calculations), read/write support for Protein Data Bank-related file formats and wrappers for external command-line applications (e.g. Stride). Integration of this code into PyCogent is symbiotic, allowing sequence-based work to benefit from structure-derived data and, reciprocally, enabling structural studies to leverage PyCogent's versatile tools for phylogenetic and evolutionary analyses.
△ Less
Submitted 19 July, 2014;
originally announced July 2014.
-
Development & Implementation of a PyMOL 'putty' Representation
Authors:
Cameron Mura
Abstract:
The PyMOL molecular graphics program has been modified to introduce a new 'putty' cartoon representation, akin to the 'sausage'-style representation of the MOLMOL molecular visualization (MolVis) software package. This document outlines the development and implementation of the putty representation.
The PyMOL molecular graphics program has been modified to introduce a new 'putty' cartoon representation, akin to the 'sausage'-style representation of the MOLMOL molecular visualization (MolVis) software package. This document outlines the development and implementation of the putty representation.
△ Less
Submitted 19 July, 2014;
originally announced July 2014.
-
PaPy: Parallel and Distributed Data-processing Pipelines in Python
Authors:
Marcin Cieslik,
Cameron Mura
Abstract:
PaPy, which stands for parallel pipelines in Python, is a highly flexible framework that enables the construction of robust, scalable workflows for either generating or processing voluminous datasets. A workflow is created from user-written Python functions (nodes) connected by 'pipes' (edges) into a directed acyclic graph. These functions are arbitrarily definable, and can make use of any Python…
▽ More
PaPy, which stands for parallel pipelines in Python, is a highly flexible framework that enables the construction of robust, scalable workflows for either generating or processing voluminous datasets. A workflow is created from user-written Python functions (nodes) connected by 'pipes' (edges) into a directed acyclic graph. These functions are arbitrarily definable, and can make use of any Python modules or external binaries. Given a user-defined topology and collection of input data, functions are composed into nested higher-order maps, which are transparently and robustly evaluated in parallel on a single computer or on remote hosts. Local and remote computational resources can be flexibly pooled and assigned to functional nodes, thereby allowing facile load-balancing and pipeline optimization to maximize computational throughput. Input items are processed by nodes in parallel, and traverse the graph in batches of adjustable size -- a trade-off between lazy-evaluation, parallelism, and memory consumption. The processing of a single item can be parallelized in a scatter/gather scheme. The simplicity and flexibility of distributed workflows using PaPy bridges the gap between desktop -> grid, enabling this new computing paradigm to be leveraged in the processing of large scientific datasets.
△ Less
Submitted 14 July, 2014;
originally announced July 2014.
-
Ten Simple Rules for Creating Biomolecular Graphics
Authors:
Cameron Mura
Abstract:
One need only compare the number of three-dimensional molecular illustrations in the first (1990) and third (2004) editions of Voet & Voet's "Biochemistry" in order to appreciate this field's profound communicative value in modern biological sciences -- ranging from medicine, physiology, and cell biology, to pharmaceutical chemistry and drug design, to structural and computational biology. The cli…
▽ More
One need only compare the number of three-dimensional molecular illustrations in the first (1990) and third (2004) editions of Voet & Voet's "Biochemistry" in order to appreciate this field's profound communicative value in modern biological sciences -- ranging from medicine, physiology, and cell biology, to pharmaceutical chemistry and drug design, to structural and computational biology. The cliché about a picture being worth a thousand words is quite poignant here: The information 'content' of an effectively-constructed piece of molecular graphics can be immense. Because biological function arises from structure, it is difficult to overemphasize the utility of visualization and graphics in molding our current understanding of the molecular nature of biological systems. Nevertheless, creating effective molecular graphics is not easy -- neither conceptually, nor in terms of effort required. The present collection of Rules is meant as a guide for those embarking upon their first molecular illustrations.
△ Less
Submitted 15 July, 2014;
originally announced July 2014.