-
Learning Multi-view Molecular Representations with Structured and Unstructured Knowledge
Authors:
Yizhen Luo,
Kai Yang,
Massimo Hong,
Xing Yi Liu,
Zikun Nie,
Hao Zhou,
Zaiqing Nie
Abstract:
Capturing molecular knowledge with representation learning approaches holds significant potential in vast scientific fields such as chemistry and life science. An effective and generalizable molecular representation is expected to capture the consensus and complementary molecular expertise from diverse views and perspectives. However, existing works fall short in learning multi-view molecular repr…
▽ More
Capturing molecular knowledge with representation learning approaches holds significant potential in vast scientific fields such as chemistry and life science. An effective and generalizable molecular representation is expected to capture the consensus and complementary molecular expertise from diverse views and perspectives. However, existing works fall short in learning multi-view molecular representations, due to challenges in explicitly incorporating view information and handling molecular knowledge from heterogeneous sources. To address these issues, we present MV-Mol, a molecular representation learning model that harvests multi-view molecular expertise from chemical structures, unstructured knowledge from biomedical texts, and structured knowledge from knowledge graphs. We utilize text prompts to model view information and design a fusion architecture to extract view-based molecular representations. We develop a two-stage pre-training procedure, exploiting heterogeneous data of varying quality and quantity. Through extensive experiments, we show that MV-Mol provides improved representations that substantially benefit molecular property prediction. Additionally, MV-Mol exhibits state-of-the-art performance in multi-modal comprehension of molecular structures and texts. Code and data are available at https://github.com/PharMolix/OpenBioMed.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
Benchmarking the CoW with the TopCoW Challenge: Topology-Aware Anatomical Segmentation of the Circle of Willis for CTA and MRA
Authors:
Kaiyuan Yang,
Fabio Musio,
Yihui Ma,
Norman Juchler,
Johannes C. Paetzold,
Rami Al-Maskari,
Luciano Höher,
Hongwei Bran Li,
Ibrahim Ethem Hamamci,
Anjany Sekuboyina,
Suprosanna Shit,
Hou**g Huang,
Chinmay Prabhakar,
Ezequiel de la Rosa,
Diana Waldmannstetter,
Florian Kofler,
Fernando Navarro,
Martin Menten,
Ivan Ezhov,
Daniel Rueckert,
Iris Vos,
Ynte Ruigrok,
Birgitta Velthuis,
Hugo Kuijf,
Julien Hämmerli
, et al. (59 additional authors not shown)
Abstract:
The Circle of Willis (CoW) is an important network of arteries connecting major circulations of the brain. Its vascular architecture is believed to affect the risk, severity, and clinical outcome of serious neuro-vascular diseases. However, characterizing the highly variable CoW anatomy is still a manual and time-consuming expert task. The CoW is usually imaged by two angiographic imaging modaliti…
▽ More
The Circle of Willis (CoW) is an important network of arteries connecting major circulations of the brain. Its vascular architecture is believed to affect the risk, severity, and clinical outcome of serious neuro-vascular diseases. However, characterizing the highly variable CoW anatomy is still a manual and time-consuming expert task. The CoW is usually imaged by two angiographic imaging modalities, magnetic resonance angiography (MRA) and computed tomography angiography (CTA), but there exist limited public datasets with annotations on CoW anatomy, especially for CTA. Therefore we organized the TopCoW Challenge in 2023 with the release of an annotated CoW dataset. The TopCoW dataset was the first public dataset with voxel-level annotations for thirteen possible CoW vessel components, enabled by virtual-reality (VR) technology. It was also the first large dataset with paired MRA and CTA from the same patients. TopCoW challenge formalized the CoW characterization problem as a multiclass anatomical segmentation task with an emphasis on topological metrics. We invited submissions worldwide for the CoW segmentation task, which attracted over 140 registered participants from four continents. The top performing teams managed to segment many CoW components to Dice scores around 90%, but with lower scores for communicating arteries and rare variants. There were also topological mistakes for predictions with high Dice scores. Additional topological analysis revealed further areas for improvement in detecting certain CoW components and matching CoW variant topology accurately. TopCoW represented a first attempt at benchmarking the CoW anatomical segmentation task for MRA and CTA, both morphologically and topologically.
△ Less
Submitted 29 April, 2024; v1 submitted 29 December, 2023;
originally announced December 2023.
-
MolFM: A Multimodal Molecular Foundation Model
Authors:
Yizhen Luo,
Kai Yang,
Massimo Hong,
Xing Yi Liu,
Zaiqing Nie
Abstract:
Molecular knowledge resides within three different modalities of information sources: molecular structures, biomedical documents, and knowledge bases. Effective incorporation of molecular knowledge from these modalities holds paramount significance in facilitating biomedical research. However, existing multimodal molecular foundation models exhibit limitations in capturing intricate connections be…
▽ More
Molecular knowledge resides within three different modalities of information sources: molecular structures, biomedical documents, and knowledge bases. Effective incorporation of molecular knowledge from these modalities holds paramount significance in facilitating biomedical research. However, existing multimodal molecular foundation models exhibit limitations in capturing intricate connections between molecular structures and texts, and more importantly, none of them attempt to leverage a wealth of molecular expertise derived from knowledge graphs. In this study, we introduce MolFM, a multimodal molecular foundation model designed to facilitate joint representation learning from molecular structures, biomedical texts, and knowledge graphs. We propose cross-modal attention between atoms of molecular structures, neighbors of molecule entities and semantically related texts to facilitate cross-modal comprehension. We provide theoretical analysis that our cross-modal pre-training captures local and global molecular knowledge by minimizing the distance in the feature space between different modalities of the same molecule, as well as molecules sharing similar structures or functions. MolFM achieves state-of-the-art performance on various downstream tasks. On cross-modal retrieval, MolFM outperforms existing models with 12.13% and 5.04% absolute gains under the zero-shot and fine-tuning settings, respectively. Furthermore, qualitative analysis showcases MolFM's implicit ability to provide grounding from molecular substructures and knowledge graphs. Code and models are available on https://github.com/BioFM/OpenBioMed.
△ Less
Submitted 21 July, 2023; v1 submitted 6 June, 2023;
originally announced July 2023.
-
Machine Learning for Protein Engineering
Authors:
Kadina E. Johnston,
Clara Fannjiang,
Bruce J. Wittmann,
Brian L. Hie,
Kevin K. Yang,
Zachary Wu
Abstract:
Directed evolution of proteins has been the most effective method for protein engineering. However, a new paradigm is emerging, fusing the library generation and screening approaches of traditional directed evolution with computation through the training of machine learning models on protein sequence fitness data. This chapter highlights successful applications of machine learning to protein engin…
▽ More
Directed evolution of proteins has been the most effective method for protein engineering. However, a new paradigm is emerging, fusing the library generation and screening approaches of traditional directed evolution with computation through the training of machine learning models on protein sequence fitness data. This chapter highlights successful applications of machine learning to protein engineering and directed evolution, organized by the improvements that have been made with respect to each step of the directed evolution cycle. Additionally, we provide an outlook for the future based on the current direction of the field, namely in the development of calibrated models and in incorporating other modalities, such as protein structure.
△ Less
Submitted 26 May, 2023;
originally announced May 2023.
-
Applying Deep Reinforcement Learning to the HP Model for Protein Structure Prediction
Authors:
Kaiyuan Yang,
Hou**g Huang,
Olafs Vandans,
Adithya Murali,
Fujia Tian,
Roland H. C. Yap,
Liang Dai
Abstract:
A central problem in computational biophysics is protein structure prediction, i.e., finding the optimal folding of a given amino acid sequence. This problem has been studied in a classical abstract model, the HP model, where the protein is modeled as a sequence of H (hydrophobic) and P (polar) amino acids on a lattice. The objective is to find conformations maximizing H-H contacts. It is known th…
▽ More
A central problem in computational biophysics is protein structure prediction, i.e., finding the optimal folding of a given amino acid sequence. This problem has been studied in a classical abstract model, the HP model, where the protein is modeled as a sequence of H (hydrophobic) and P (polar) amino acids on a lattice. The objective is to find conformations maximizing H-H contacts. It is known that even in this reduced setting, the problem is intractable (NP-hard). In this work, we apply deep reinforcement learning (DRL) to the two-dimensional HP model. We can obtain the conformations of best known energies for benchmark HP sequences with lengths from 20 to 50. Our DRL is based on a deep Q-network (DQN). We find that a DQN based on long short-term memory (LSTM) architecture greatly enhances the RL learning ability and significantly improves the search process. DRL can sample the state space efficiently, without the need of manual heuristics. Experimentally we show that it can find multiple distinct best-known solutions per trial. This study demonstrates the effectiveness of deep reinforcement learning in the HP model for protein folding.
△ Less
Submitted 9 December, 2022; v1 submitted 27 November, 2022;
originally announced November 2022.
-
Protein structure generation via folding diffusion
Authors:
Kevin E. Wu,
Kevin K. Yang,
Rianne van den Berg,
James Y. Zou,
Alex X. Lu,
Ava P. Amini
Abstract:
The ability to computationally generate novel yet physically foldable protein structures could lead to new biological discoveries and new treatments targeting yet incurable diseases. Despite recent advances in protein structure prediction, directly generating diverse, novel protein structures from neural networks remains difficult. In this work, we present a new diffusion-based generative model th…
▽ More
The ability to computationally generate novel yet physically foldable protein structures could lead to new biological discoveries and new treatments targeting yet incurable diseases. Despite recent advances in protein structure prediction, directly generating diverse, novel protein structures from neural networks remains difficult. In this work, we present a new diffusion-based generative model that designs protein backbone structures via a procedure that mirrors the native folding process. We describe protein backbone structure as a series of consecutive angles capturing the relative orientation of the constituent amino acid residues, and generate new structures by denoising from a random, unfolded state towards a stable folded structure. Not only does this mirror how proteins biologically twist into energetically favorable conformations, the inherent shift and rotational invariance of this representation crucially alleviates the need for complex equivariant networks. We train a denoising diffusion probabilistic model with a simple transformer backbone and demonstrate that our resulting model unconditionally generates highly realistic protein structures with complexity and structural patterns akin to those of naturally-occurring proteins. As a useful resource, we release the first open-source codebase and trained models for protein structure diffusion.
△ Less
Submitted 23 November, 2022; v1 submitted 30 September, 2022;
originally announced September 2022.
-
Dominant Eigenvalue-Eigenvector Pair Estimation via Graph Infection
Authors:
Kaiyuan Yang,
Li Xia,
Y. C. Tay
Abstract:
We present a novel method to estimate the dominant eigenvalue and eigenvector pair of any non-negative real matrix via graph infection. The key idea in our technique lies in approximating the solution to the first-order matrix ordinary differential equation (ODE) with the Euler method. Graphs, which can be weighted, directed, and with loops, are first converted to its adjacency matrix A. Then by a…
▽ More
We present a novel method to estimate the dominant eigenvalue and eigenvector pair of any non-negative real matrix via graph infection. The key idea in our technique lies in approximating the solution to the first-order matrix ordinary differential equation (ODE) with the Euler method. Graphs, which can be weighted, directed, and with loops, are first converted to its adjacency matrix A. Then by a naive infection model for graphs, we establish the corresponding first-order matrix ODE, through which A's dominant eigenvalue is revealed by the fastest growing term. When there are multiple dominant eigenvalues of the same magnitude, the classical power iteration method can fail. In contrast, our method can converge to the dominant eigenvalue even when same-magnitude counterparts exist, be it complex or opposite in sign. We conduct several experiments comparing the convergence between our method and power iteration. Our results show clear advantages over power iteration for tree graphs, bipartite graphs, directed graphs with periods, and Markov chains with spider-traps. To our knowledge, this is the first work that estimates dominant eigenvalue and eigenvector pair from the perspective of a dynamical system and matrix ODE. We believe our method can be adopted as an alternative to power iteration, especially for graphs.
△ Less
Submitted 7 May, 2023; v1 submitted 1 August, 2022;
originally announced August 2022.
-
Exploring evolution-aware & -free protein language models as protein function predictors
Authors:
Mingyang Hu,
Fajie Yuan,
Kevin K. Yang,
Fusong Ju,
** Su,
Hui Wang,
Fei Yang,
Qiuyang Ding
Abstract:
Large-scale Protein Language Models (PLMs) have improved performance in protein prediction tasks, ranging from 3D structure prediction to various function predictions. In particular, AlphaFold, a ground-breaking AI system, could potentially reshape structural biology. However, the utility of the PLM module in AlphaFold, Evoformer, has not been explored beyond structure prediction. In this paper, w…
▽ More
Large-scale Protein Language Models (PLMs) have improved performance in protein prediction tasks, ranging from 3D structure prediction to various function predictions. In particular, AlphaFold, a ground-breaking AI system, could potentially reshape structural biology. However, the utility of the PLM module in AlphaFold, Evoformer, has not been explored beyond structure prediction. In this paper, we investigate the representation ability of three popular PLMs: ESM-1b (single sequence), MSA-Transformer (multiple sequence alignment) and Evoformer (structural), with a special focus on Evoformer. Specifically, we aim to answer the following key questions: (i) Does the Evoformer trained as part of AlphaFold produce representations amenable to predicting protein function? (ii) If yes, can Evoformer replace ESM-1b and MSA-Transformer? (ii) How much do these PLMs rely on evolution-related protein data? In this regard, are they complementary to each other? We compare these models by empirical study along with new insights and conclusions. All code and datasets for reproducibility are available at https://github.com/elttaes/Revisiting-PLMs.
△ Less
Submitted 16 October, 2022; v1 submitted 13 June, 2022;
originally announced June 2022.
-
Magnetoelectric Bio-Implants Powered and Programmed by a Single Transmitter for Coordinated Multisite Stimulation
Authors:
Zhanghao Yu,
Joshua C. Chen,
Yan He,
Fatima T. Alrashdan,
Benjamin W. Avants,
Amanda Singer,
Jacob T. Robinson,
Kaiyuan Yang
Abstract:
This article presents a hardware platform including stimulating implants wirelessly powered and controlled by a shared transmitter (TX) for coordinated leadless multisite stimulation. The adopted novel single-TX, multiple-implant structure can flexibly deploy stimuli, improve system efficiency, easily scale stimulating channel quantity, and relieve efforts in device synchronization. In the propose…
▽ More
This article presents a hardware platform including stimulating implants wirelessly powered and controlled by a shared transmitter (TX) for coordinated leadless multisite stimulation. The adopted novel single-TX, multiple-implant structure can flexibly deploy stimuli, improve system efficiency, easily scale stimulating channel quantity, and relieve efforts in device synchronization. In the proposed system, a wireless link leveraging magnetoelectric (ME) effect is co-designed with a robust and efficient system-on-chip (SoC) to enable reliable operation and individual programming of every implant. Each implant integrates a 0.8-mm2 chip, a 6-mm2 ME film, and an energy storage capacitor within a 6.2-mm3 size. ME power transfer is capable of safely transmitting milliwatt power to devices placed several centimeters away from the TX coil, maintaining good efficiency with size constraints, and tolerating 60 degree, 1.5-cm misalignment in angular and lateral movement. The SoC robustly operates with 2-V source amplitude variations that spans a 40-mm TX-implant distance change, realizes individual addressability through physical unclonable function (PUF) IDs, and achieves 90% efficiency for 1.5-3.5-V stimulation with fully programmable stimulation parameters.
△ Less
Submitted 31 December, 2021;
originally announced December 2021.
-
Machine learning modeling of family wide enzyme-substrate specificity screens
Authors:
Samuel Goldman,
Ria Das,
Kevin K. Yang,
Connor W. Coley
Abstract:
Biocatalysis is a promising approach to sustainably synthesize pharmaceuticals, complex natural products, and commodity chemicals at scale. However, the adoption of biocatalysis is limited by our ability to select enzymes that will catalyze their natural chemical transformation on non-natural substrates. While machine learning and in silico directed evolution are well-posed for this predictive mod…
▽ More
Biocatalysis is a promising approach to sustainably synthesize pharmaceuticals, complex natural products, and commodity chemicals at scale. However, the adoption of biocatalysis is limited by our ability to select enzymes that will catalyze their natural chemical transformation on non-natural substrates. While machine learning and in silico directed evolution are well-posed for this predictive modeling challenge, efforts to date have primarily aimed to increase activity against a single known substrate, rather than to identify enzymes capable of acting on new substrates of interest. To address this need, we curate 6 different high-quality enzyme family screens from the literature that each measure multiple enzymes against multiple substrates. We compare machine learning-based compound-protein interaction (CPI) modeling approaches from the literature used for predicting drug-target interactions. Surprisingly, comparing these interaction-based models against collections of independent (single task) enzyme-only or substrate-only models reveals that current CPI approaches are incapable of learning interactions between compounds and proteins in the current family level data regime. We further validate this observation by demonstrating that our no-interaction baseline can outperform CPI-based models from the literature used to guide the discovery of kinase inhibitors. Given the high performance of non-interaction based models, we introduce a new structure-based strategy for pooling residue representations across a protein sequence. Altogether, this work motivates a principled path forward in order to build and evaluate meaningful predictive models for biocatalysis and other drug discovery applications.
△ Less
Submitted 8 September, 2021;
originally announced September 2021.
-
MagNI: A Magnetoelectrically Powered and Controlled Wireless Neurostimulating Implant
Authors:
Zhanghao Yu,
Joshua C. Chen,
Fatima T. Alrashdan,
Benjamin W. Avants,
Yan He,
Amanda Singer,
Jacob T. Robinson,
Kaiyuan Yang
Abstract:
This paper presents the first wireless and programmable neural stimulator leveraging magnetoelectric (ME) effects for power and data transfer. Thanks to low tissue absorption, low misalignment sensitivity and high power transfer efficiency, the ME effect enables safe delivery of high power levels (a few milliwatts) at low resonant frequencies (~250 kHz) to mm-sized implants deep inside the body (3…
▽ More
This paper presents the first wireless and programmable neural stimulator leveraging magnetoelectric (ME) effects for power and data transfer. Thanks to low tissue absorption, low misalignment sensitivity and high power transfer efficiency, the ME effect enables safe delivery of high power levels (a few milliwatts) at low resonant frequencies (~250 kHz) to mm-sized implants deep inside the body (30-mm depth). The presented MagNI (Magnetoelectric Neural Implant) consists of a 1.5-mm$^2$ 180-nm CMOS chip, an in-house built 4x2 mm ME film, an energy storage capacitor, and on-board electrodes on a flexible polyimide substrate with a total volume of 8.2 mm$^3$ . The chip with a power consumption of 23.7 $μ$W includes robust system control and data recovery mechanisms under source amplitude variations (1-V variation tolerance). The system delivers fully-programmable bi-phasic current-controlled stimulation with patterns covering 0.05-to-1.5-mA amplitude, 64-to-512-$μ$s pulse width, and 0-to-200Hz repetition frequency for neurostimulation.
△ Less
Submitted 6 July, 2021;
originally announced July 2021.
-
Adaptive machine learning for protein engineering
Authors:
Brian L. Hie,
Kevin K. Yang
Abstract:
Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using these models to suggest new protein designs, one must deal with the vast combinatorial complexity of protein sequences. Here, we review how to use a sequence-to-function machine-learning surrogate model to select sequences for experime…
▽ More
Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using these models to suggest new protein designs, one must deal with the vast combinatorial complexity of protein sequences. Here, we review how to use a sequence-to-function machine-learning surrogate model to select sequences for experimental measurement. First, we discuss how to select sequences through a single round of machine-learning optimization. Then, we discuss sequential optimization, where the goal is to discover optimized sequences and improve the model across multiple rounds of training, optimization, and experimental measurement.
△ Less
Submitted 6 July, 2021; v1 submitted 9 June, 2021;
originally announced June 2021.
-
Protein sequence design with deep generative models
Authors:
Zachary Wu,
Kadina E. Johnston,
Frances H. Arnold,
Kevin K. Yang
Abstract:
Protein engineering seeks to identify protein sequences with optimized properties. When guided by machine learning, protein sequence generation methods can draw on prior knowledge and experimental efforts to improve this process. In this review, we highlight recent applications of machine learning to generate protein sequences, focusing on the emerging field of deep generative methods.
Protein engineering seeks to identify protein sequences with optimized properties. When guided by machine learning, protein sequence generation methods can draw on prior knowledge and experimental efforts to improve this process. In this review, we highlight recent applications of machine learning to generate protein sequences, focusing on the emerging field of deep generative methods.
△ Less
Submitted 9 April, 2021;
originally announced April 2021.
-
Emulation of Astrocyte Induced Neural Phase Synchrony in Spin-Orbit Torque Oscillator Neurons
Authors:
Umang Garg,
Kezhou Yang,
Abhronil Sengupta
Abstract:
Astrocytes play a central role in inducing concerted phase synchronized neural-wave patterns inside the brain. In this article, we demonstrate that injected radio-frequency signal in underlying heavy metal layer of spin-orbit torque oscillator neurons mimic the neuron phase synchronization effect realized by glial cells. Potential application of such phase coupling effects is illustrated in the co…
▽ More
Astrocytes play a central role in inducing concerted phase synchronized neural-wave patterns inside the brain. In this article, we demonstrate that injected radio-frequency signal in underlying heavy metal layer of spin-orbit torque oscillator neurons mimic the neuron phase synchronization effect realized by glial cells. Potential application of such phase coupling effects is illustrated in the context of a temporal "binding problem". We also present the design of a coupled neuron-synapse-astrocyte network enabled by compact neuromimetic devices by combining the concepts of local spike-timing dependent plasticity and astrocyte induced neural phase synchrony.
△ Less
Submitted 16 September, 2021; v1 submitted 1 July, 2020;
originally announced July 2020.
-
Improved Conditional Flow Models for Molecule to Image Synthesis
Authors:
Karren Yang,
Samuel Goldman,
Wengong **,
Alex Lu,
Regina Barzilay,
Tommi Jaakkola,
Caroline Uhler
Abstract:
In this paper, we aim to synthesize cell microscopy images under different molecular interventions, motivated by practical applications to drug development. Building on the recent success of graph neural networks for learning molecular embeddings and flow-based models for image generation, we propose Mol2Image: a flow-based generative model for molecule to cell image synthesis. To generate cell fe…
▽ More
In this paper, we aim to synthesize cell microscopy images under different molecular interventions, motivated by practical applications to drug development. Building on the recent success of graph neural networks for learning molecular embeddings and flow-based models for image generation, we propose Mol2Image: a flow-based generative model for molecule to cell image synthesis. To generate cell features at different resolutions and scale to high-resolution images, we develop a novel multi-scale flow architecture based on a Haar wavelet image pyramid. To maximize the mutual information between the generated images and the molecular interventions, we devise a training strategy based on contrastive learning. To evaluate our model, we propose a new set of metrics for biological image generation that are robust, interpretable, and relevant to practitioners. We show quantitatively that our method learns a meaningful embedding of the molecular intervention, which is translated into an image representation reflecting the biological effects of the intervention.
△ Less
Submitted 15 June, 2020;
originally announced June 2020.
-
Causal Network Models of SARS-CoV-2 Expression and Aging to Identify Candidates for Drug Repurposing
Authors:
Anastasiya Belyaeva,
Louis Cammarata,
Adityanarayanan Radhakrishnan,
Chandler Squires,
Karren Dai Yang,
G. V. Shivashankar,
Caroline Uhler
Abstract:
Given the severity of the SARS-CoV-2 pandemic, a major challenge is to rapidly repurpose existing approved drugs for clinical interventions. While a number of data-driven and experimental approaches have been suggested in the context of drug repurposing, a platform that systematically integrates available transcriptomic, proteomic and structural data is missing. More importantly, given that SARS-C…
▽ More
Given the severity of the SARS-CoV-2 pandemic, a major challenge is to rapidly repurpose existing approved drugs for clinical interventions. While a number of data-driven and experimental approaches have been suggested in the context of drug repurposing, a platform that systematically integrates available transcriptomic, proteomic and structural data is missing. More importantly, given that SARS-CoV-2 pathogenicity is highly age-dependent, it is critical to integrate aging signatures into drug discovery platforms. We here take advantage of large-scale transcriptional drug screens combined with RNA-seq data of the lung epithelium with SARS-CoV-2 infection as well as the aging lung. To identify robust druggable protein targets, we propose a principled causal framework that makes use of multiple data modalities. Our analysis highlights the importance of serine/threonine and tyrosine kinases as potential targets that intersect the SARS-CoV-2 and aging pathways. By integrating transcriptomic, proteomic and structural data that is available for many diseases, our drug discovery platform is broadly applicable. Rigorous in vitro experiments as well as clinical trials are needed to validate the identified candidate drugs.
△ Less
Submitted 5 June, 2020;
originally announced June 2020.
-
Uncertainty Quantification Using Neural Networks for Molecular Property Prediction
Authors:
Lior Hirschfeld,
Kyle Swanson,
Kevin Yang,
Regina Barzilay,
Connor W. Coley
Abstract:
Uncertainty quantification (UQ) is an important component of molecular property prediction, particularly for drug discovery applications where model predictions direct experimental design and where unanticipated imprecision wastes valuable time and resources. The need for UQ is especially acute for neural models, which are becoming increasingly standard yet are challenging to interpret. While seve…
▽ More
Uncertainty quantification (UQ) is an important component of molecular property prediction, particularly for drug discovery applications where model predictions direct experimental design and where unanticipated imprecision wastes valuable time and resources. The need for UQ is especially acute for neural models, which are becoming increasingly standard yet are challenging to interpret. While several approaches to UQ have been proposed in the literature, there is no clear consensus on the comparative performance of these models. In this paper, we study this question in the context of regression tasks. We systematically evaluate several methods on five benchmark datasets using multiple complementary performance metrics. Our experiments show that none of the methods we tested is unequivocally superior to all others, and none produces a particularly reliable ranking of errors across multiple datasets. While we believe these results show that existing UQ methods are not sufficient for all common use-cases and demonstrate the benefits of further research, we conclude with a practical recommendation as to which existing techniques seem to perform well relative to others.
△ Less
Submitted 20 May, 2020;
originally announced May 2020.
-
Preliminary prediction of the basic reproduction number of the Wuhan novel coronavirus 2019-nCoV
Authors:
Tao Zhou,
Quanhui Liu,
Zimo Yang,
**gyi Liao,
Kexin Yang,
Wei Bai,
Xin Lü,
Wei Zhang
Abstract:
Objectives.--To estimate the basic reproduction number of the Wuhan novel coronavirus (2019-nCoV). Methods.--Based on the susceptible-exposed-infected-removed (SEIR) compartment model and the assumption that the infectious cases with symptoms occurred before January 25, 2020 are resulted from free propagation without intervention, we estimate the basic reproduction number of 2019-nCoV according to…
▽ More
Objectives.--To estimate the basic reproduction number of the Wuhan novel coronavirus (2019-nCoV). Methods.--Based on the susceptible-exposed-infected-removed (SEIR) compartment model and the assumption that the infectious cases with symptoms occurred before January 25, 2020 are resulted from free propagation without intervention, we estimate the basic reproduction number of 2019-nCoV according to the reported confirmed cases and suspected cases, as well as the theoretical estimated number of infected cases by other research teams, together with some epidemiological determinants learned from the severe acute respiratory syndrome. Results The basic reproduction number falls between 2.8 to 3.3 by using the real-time reports on the number of 2019-nCoV infected cases from People's Daily in China, and falls between 3.2 and 3.9 on the basis of the predicted number of infected cases from colleagues. Conclusions.--The early transmission ability of 2019-nCoV is closed to or slightly higher than SARS. It is a controllable disease with moderate-high transmissibility. Timely and effective control measures are needed to suppress the further transmissions. Notes Added.--Using a newly reported epidemiological determinants for early 2019-nCoV, the estimated basic reproduction number is in the range [2.2,3.0].
△ Less
Submitted 31 January, 2020; v1 submitted 28 January, 2020;
originally announced January 2020.
-
Batched Stochastic Bayesian Optimization via Combinatorial Constraints Design
Authors:
Kevin K. Yang,
Yuxin Chen,
Alycia Lee,
Yisong Yue
Abstract:
In many high-throughput experimental design settings, such as those common in biochemical engineering, batched queries are more cost effective than one-by-one sequential queries. Furthermore, it is often not possible to directly choose items to query. Instead, the experimenter specifies a set of constraints that generates a library of possible items, which are then selected stochastically. Motivat…
▽ More
In many high-throughput experimental design settings, such as those common in biochemical engineering, batched queries are more cost effective than one-by-one sequential queries. Furthermore, it is often not possible to directly choose items to query. Instead, the experimenter specifies a set of constraints that generates a library of possible items, which are then selected stochastically. Motivated by these considerations, we investigate \emph{Batched Stochastic Bayesian Optimization} (BSBO), a novel Bayesian optimization scheme for choosing the constraints in order to guide exploration towards items with greater utility. We focus on \emph{site-saturation mutagenesis}, a prototypical setting of BSBO in biochemical engineering, and propose a natural objective function for this problem. Importantly, we show that our objective function can be efficiently decomposed as a difference of submodular functions (DS), which allows us to employ DS optimization tools to greedily identify sets of constraints that increase the likelihood of finding items with high utility. Our experimental results show that our algorithm outperforms common heuristics on both synthetic and two real protein datasets.
△ Less
Submitted 17 April, 2019;
originally announced April 2019.
-
Machine learning-guided directed evolution for protein engineering
Authors:
Kevin K. Yang,
Zachary Wu,
Frances H. Arnold
Abstract:
Machine learning (ML)-guided directed evolution is a new paradigm for biological design that enables optimization of complex functions. ML methods use data to predict how sequence maps to function without requiring a detailed model of the underlying physics or biological pathways. To demonstrate ML-guided directed evolution, we introduce the steps required to build ML sequence-function models and…
▽ More
Machine learning (ML)-guided directed evolution is a new paradigm for biological design that enables optimization of complex functions. ML methods use data to predict how sequence maps to function without requiring a detailed model of the underlying physics or biological pathways. To demonstrate ML-guided directed evolution, we introduce the steps required to build ML sequence-function models and use them to guide engineering, making recommendations at each stage. This review covers basic concepts relevant to using ML for protein engineering as well as the current literature and applications of this new engineering paradigm. ML methods accelerate directed evolution by learning from information contained in all measured variants and using that information to select sequences that are likely to be improved. We then provide two case studies that demonstrate the ML-guided directed evolution process. We also look to future opportunities where ML will enable discovery of new protein functions and uncover the relationship between protein sequence and function.
△ Less
Submitted 19 April, 2019; v1 submitted 26 November, 2018;
originally announced November 2018.
-
Niche inheritance: a cooperative pathway to enhance cancer cell fitness though ecosystem engineering
Authors:
Kimberline R. Yang,
Steven Mooney,
Jelani C. Zarif,
Donald S. Coffey,
Russell S. Taichman,
Kenneth J. Pienta
Abstract:
Cancer cells can be described as an invasive species that is able to establish itself in a new environment. The concept of niche construction can be utilized to describe the process by which cancer cells terraform their environment, thereby engineering an ecosystem that promotes the genetic fitness of the species. Ecological dispersion theory can then be utilized to describe and model the steps an…
▽ More
Cancer cells can be described as an invasive species that is able to establish itself in a new environment. The concept of niche construction can be utilized to describe the process by which cancer cells terraform their environment, thereby engineering an ecosystem that promotes the genetic fitness of the species. Ecological dispersion theory can then be utilized to describe and model the steps and barriers involved in a successful diaspora as the cancer cells leave the original host organ and migrate to new host organs to successfully establish a new metastatic community. These ecological concepts can be further utilized to define new diagnostic and therapeutic areas for lethal cancers.
△ Less
Submitted 28 March, 2014;
originally announced March 2014.