Search | arXiv e-print repository

Learning Multi-view Molecular Representations with Structured and Unstructured Knowledge

Authors: Yizhen Luo, Kai Yang, Massimo Hong, Xing Yi Liu, Zikun Nie, Hao Zhou, Zaiqing Nie

Abstract: Capturing molecular knowledge with representation learning approaches holds significant potential in vast scientific fields such as chemistry and life science. An effective and generalizable molecular representation is expected to capture the consensus and complementary molecular expertise from diverse views and perspectives. However, existing works fall short in learning multi-view molecular repr… ▽ More Capturing molecular knowledge with representation learning approaches holds significant potential in vast scientific fields such as chemistry and life science. An effective and generalizable molecular representation is expected to capture the consensus and complementary molecular expertise from diverse views and perspectives. However, existing works fall short in learning multi-view molecular representations, due to challenges in explicitly incorporating view information and handling molecular knowledge from heterogeneous sources. To address these issues, we present MV-Mol, a molecular representation learning model that harvests multi-view molecular expertise from chemical structures, unstructured knowledge from biomedical texts, and structured knowledge from knowledge graphs. We utilize text prompts to model view information and design a fusion architecture to extract view-based molecular representations. We develop a two-stage pre-training procedure, exploiting heterogeneous data of varying quality and quantity. Through extensive experiments, we show that MV-Mol provides improved representations that substantially benefit molecular property prediction. Additionally, MV-Mol exhibits state-of-the-art performance in multi-modal comprehension of molecular structures and texts. Code and data are available at https://github.com/PharMolix/OpenBioMed. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: 12 pages, 4 figures

arXiv:2312.17670 [pdf, other]

Benchmarking the CoW with the TopCoW Challenge: Topology-Aware Anatomical Segmentation of the Circle of Willis for CTA and MRA

Authors: Kaiyuan Yang, Fabio Musio, Yihui Ma, Norman Juchler, Johannes C. Paetzold, Rami Al-Maskari, Luciano Höher, Hongwei Bran Li, Ibrahim Ethem Hamamci, Anjany Sekuboyina, Suprosanna Shit, Hou**g Huang, Chinmay Prabhakar, Ezequiel de la Rosa, Diana Waldmannstetter, Florian Kofler, Fernando Navarro, Martin Menten, Ivan Ezhov, Daniel Rueckert, Iris Vos, Ynte Ruigrok, Birgitta Velthuis, Hugo Kuijf, Julien Hämmerli , et al. (59 additional authors not shown)

Abstract: The Circle of Willis (CoW) is an important network of arteries connecting major circulations of the brain. Its vascular architecture is believed to affect the risk, severity, and clinical outcome of serious neuro-vascular diseases. However, characterizing the highly variable CoW anatomy is still a manual and time-consuming expert task. The CoW is usually imaged by two angiographic imaging modaliti… ▽ More The Circle of Willis (CoW) is an important network of arteries connecting major circulations of the brain. Its vascular architecture is believed to affect the risk, severity, and clinical outcome of serious neuro-vascular diseases. However, characterizing the highly variable CoW anatomy is still a manual and time-consuming expert task. The CoW is usually imaged by two angiographic imaging modalities, magnetic resonance angiography (MRA) and computed tomography angiography (CTA), but there exist limited public datasets with annotations on CoW anatomy, especially for CTA. Therefore we organized the TopCoW Challenge in 2023 with the release of an annotated CoW dataset. The TopCoW dataset was the first public dataset with voxel-level annotations for thirteen possible CoW vessel components, enabled by virtual-reality (VR) technology. It was also the first large dataset with paired MRA and CTA from the same patients. TopCoW challenge formalized the CoW characterization problem as a multiclass anatomical segmentation task with an emphasis on topological metrics. We invited submissions worldwide for the CoW segmentation task, which attracted over 140 registered participants from four continents. The top performing teams managed to segment many CoW components to Dice scores around 90%, but with lower scores for communicating arteries and rare variants. There were also topological mistakes for predictions with high Dice scores. Additional topological analysis revealed further areas for improvement in detecting certain CoW components and matching CoW variant topology accurately. TopCoW represented a first attempt at benchmarking the CoW anatomical segmentation task for MRA and CTA, both morphologically and topologically. △ Less

Submitted 29 April, 2024; v1 submitted 29 December, 2023; originally announced December 2023.

Comments: 24 pages, 11 figures, 9 tables. Summary Paper for the MICCAI TopCoW 2023 Challenge

arXiv:2307.09484 [pdf, other]

MolFM: A Multimodal Molecular Foundation Model

Authors: Yizhen Luo, Kai Yang, Massimo Hong, Xing Yi Liu, Zaiqing Nie

Abstract: Molecular knowledge resides within three different modalities of information sources: molecular structures, biomedical documents, and knowledge bases. Effective incorporation of molecular knowledge from these modalities holds paramount significance in facilitating biomedical research. However, existing multimodal molecular foundation models exhibit limitations in capturing intricate connections be… ▽ More Molecular knowledge resides within three different modalities of information sources: molecular structures, biomedical documents, and knowledge bases. Effective incorporation of molecular knowledge from these modalities holds paramount significance in facilitating biomedical research. However, existing multimodal molecular foundation models exhibit limitations in capturing intricate connections between molecular structures and texts, and more importantly, none of them attempt to leverage a wealth of molecular expertise derived from knowledge graphs. In this study, we introduce MolFM, a multimodal molecular foundation model designed to facilitate joint representation learning from molecular structures, biomedical texts, and knowledge graphs. We propose cross-modal attention between atoms of molecular structures, neighbors of molecule entities and semantically related texts to facilitate cross-modal comprehension. We provide theoretical analysis that our cross-modal pre-training captures local and global molecular knowledge by minimizing the distance in the feature space between different modalities of the same molecule, as well as molecules sharing similar structures or functions. MolFM achieves state-of-the-art performance on various downstream tasks. On cross-modal retrieval, MolFM outperforms existing models with 12.13% and 5.04% absolute gains under the zero-shot and fine-tuning settings, respectively. Furthermore, qualitative analysis showcases MolFM's implicit ability to provide grounding from molecular substructures and knowledge graphs. Code and models are available on https://github.com/BioFM/OpenBioMed. △ Less

Submitted 21 July, 2023; v1 submitted 6 June, 2023; originally announced July 2023.

Comments: 31 pages, 15 figures, and 15 tables

arXiv:2305.16634 [pdf, other]

Machine Learning for Protein Engineering

Authors: Kadina E. Johnston, Clara Fannjiang, Bruce J. Wittmann, Brian L. Hie, Kevin K. Yang, Zachary Wu

Abstract: Directed evolution of proteins has been the most effective method for protein engineering. However, a new paradigm is emerging, fusing the library generation and screening approaches of traditional directed evolution with computation through the training of machine learning models on protein sequence fitness data. This chapter highlights successful applications of machine learning to protein engin… ▽ More Directed evolution of proteins has been the most effective method for protein engineering. However, a new paradigm is emerging, fusing the library generation and screening approaches of traditional directed evolution with computation through the training of machine learning models on protein sequence fitness data. This chapter highlights successful applications of machine learning to protein engineering and directed evolution, organized by the improvements that have been made with respect to each step of the directed evolution cycle. Additionally, we provide an outlook for the future based on the current direction of the field, namely in the development of calibrated models and in incorporating other modalities, such as protein structure. △ Less

Submitted 26 May, 2023; originally announced May 2023.

Comments: Initial book chapter submission on February 28, 2022, to be published by Springer Nature

arXiv:2211.14939 [pdf, other]

doi 10.1016/j.physa.2022.128395

Applying Deep Reinforcement Learning to the HP Model for Protein Structure Prediction

Authors: Kaiyuan Yang, Hou**g Huang, Olafs Vandans, Adithya Murali, Fujia Tian, Roland H. C. Yap, Liang Dai

Abstract: A central problem in computational biophysics is protein structure prediction, i.e., finding the optimal folding of a given amino acid sequence. This problem has been studied in a classical abstract model, the HP model, where the protein is modeled as a sequence of H (hydrophobic) and P (polar) amino acids on a lattice. The objective is to find conformations maximizing H-H contacts. It is known th… ▽ More A central problem in computational biophysics is protein structure prediction, i.e., finding the optimal folding of a given amino acid sequence. This problem has been studied in a classical abstract model, the HP model, where the protein is modeled as a sequence of H (hydrophobic) and P (polar) amino acids on a lattice. The objective is to find conformations maximizing H-H contacts. It is known that even in this reduced setting, the problem is intractable (NP-hard). In this work, we apply deep reinforcement learning (DRL) to the two-dimensional HP model. We can obtain the conformations of best known energies for benchmark HP sequences with lengths from 20 to 50. Our DRL is based on a deep Q-network (DQN). We find that a DQN based on long short-term memory (LSTM) architecture greatly enhances the RL learning ability and significantly improves the search process. DRL can sample the state space efficiently, without the need of manual heuristics. Experimentally we show that it can find multiple distinct best-known solutions per trial. This study demonstrates the effectiveness of deep reinforcement learning in the HP model for protein folding. △ Less

Submitted 9 December, 2022; v1 submitted 27 November, 2022; originally announced November 2022.

Comments: Published at Physica A: Statistical Mechanics and its Applications, available online 7 December 2022. Extended abstract accepted by the Machine Learning and the Physical Sciences workshop, NeurIPS 2022

arXiv:2209.15611 [pdf, other]

Protein structure generation via folding diffusion

Authors: Kevin E. Wu, Kevin K. Yang, Rianne van den Berg, James Y. Zou, Alex X. Lu, Ava P. Amini

Abstract: The ability to computationally generate novel yet physically foldable protein structures could lead to new biological discoveries and new treatments targeting yet incurable diseases. Despite recent advances in protein structure prediction, directly generating diverse, novel protein structures from neural networks remains difficult. In this work, we present a new diffusion-based generative model th… ▽ More The ability to computationally generate novel yet physically foldable protein structures could lead to new biological discoveries and new treatments targeting yet incurable diseases. Despite recent advances in protein structure prediction, directly generating diverse, novel protein structures from neural networks remains difficult. In this work, we present a new diffusion-based generative model that designs protein backbone structures via a procedure that mirrors the native folding process. We describe protein backbone structure as a series of consecutive angles capturing the relative orientation of the constituent amino acid residues, and generate new structures by denoising from a random, unfolded state towards a stable folded structure. Not only does this mirror how proteins biologically twist into energetically favorable conformations, the inherent shift and rotational invariance of this representation crucially alleviates the need for complex equivariant networks. We train a denoising diffusion probabilistic model with a simple transformer backbone and demonstrate that our resulting model unconditionally generates highly realistic protein structures with complexity and structural patterns akin to those of naturally-occurring proteins. As a useful resource, we release the first open-source codebase and trained models for protein structure diffusion. △ Less

Submitted 23 November, 2022; v1 submitted 30 September, 2022; originally announced September 2022.

ACM Class: I.2.0; J.3

arXiv:2208.00982 [pdf, other]

Dominant Eigenvalue-Eigenvector Pair Estimation via Graph Infection

Authors: Kaiyuan Yang, Li Xia, Y. C. Tay

Abstract: We present a novel method to estimate the dominant eigenvalue and eigenvector pair of any non-negative real matrix via graph infection. The key idea in our technique lies in approximating the solution to the first-order matrix ordinary differential equation (ODE) with the Euler method. Graphs, which can be weighted, directed, and with loops, are first converted to its adjacency matrix A. Then by a… ▽ More We present a novel method to estimate the dominant eigenvalue and eigenvector pair of any non-negative real matrix via graph infection. The key idea in our technique lies in approximating the solution to the first-order matrix ordinary differential equation (ODE) with the Euler method. Graphs, which can be weighted, directed, and with loops, are first converted to its adjacency matrix A. Then by a naive infection model for graphs, we establish the corresponding first-order matrix ODE, through which A's dominant eigenvalue is revealed by the fastest growing term. When there are multiple dominant eigenvalues of the same magnitude, the classical power iteration method can fail. In contrast, our method can converge to the dominant eigenvalue even when same-magnitude counterparts exist, be it complex or opposite in sign. We conduct several experiments comparing the convergence between our method and power iteration. Our results show clear advantages over power iteration for tree graphs, bipartite graphs, directed graphs with periods, and Markov chains with spider-traps. To our knowledge, this is the first work that estimates dominant eigenvalue and eigenvector pair from the perspective of a dynamical system and matrix ODE. We believe our method can be adopted as an alternative to power iteration, especially for graphs. △ Less

Submitted 7 May, 2023; v1 submitted 1 August, 2022; originally announced August 2022.

Comments: Research paper accepted by Proc. 16th International Conference on Graph Transformation (ICGT 2023), Leicester, UK. Extended abstract accepted by the Graph Signal Processing (GSP) Workshop 2023, Oxford, UK. GitHub source code: https://github.com/FeynmanDNA/Dominant_EigenPair_Est_Graph_Infection

arXiv:2206.06583 [pdf, other]

Exploring evolution-aware & -free protein language models as protein function predictors

Authors: Mingyang Hu, Fajie Yuan, Kevin K. Yang, Fusong Ju, ** Su, Hui Wang, Fei Yang, Qiuyang Ding

Abstract: Large-scale Protein Language Models (PLMs) have improved performance in protein prediction tasks, ranging from 3D structure prediction to various function predictions. In particular, AlphaFold, a ground-breaking AI system, could potentially reshape structural biology. However, the utility of the PLM module in AlphaFold, Evoformer, has not been explored beyond structure prediction. In this paper, w… ▽ More Large-scale Protein Language Models (PLMs) have improved performance in protein prediction tasks, ranging from 3D structure prediction to various function predictions. In particular, AlphaFold, a ground-breaking AI system, could potentially reshape structural biology. However, the utility of the PLM module in AlphaFold, Evoformer, has not been explored beyond structure prediction. In this paper, we investigate the representation ability of three popular PLMs: ESM-1b (single sequence), MSA-Transformer (multiple sequence alignment) and Evoformer (structural), with a special focus on Evoformer. Specifically, we aim to answer the following key questions: (i) Does the Evoformer trained as part of AlphaFold produce representations amenable to predicting protein function? (ii) If yes, can Evoformer replace ESM-1b and MSA-Transformer? (ii) How much do these PLMs rely on evolution-related protein data? In this regard, are they complementary to each other? We compare these models by empirical study along with new insights and conclusions. All code and datasets for reproducibility are available at https://github.com/elttaes/Revisiting-PLMs. △ Less

Submitted 16 October, 2022; v1 submitted 13 June, 2022; originally announced June 2022.

arXiv:2112.15552 [pdf, other]

doi 10.1109/JSSC.2021.3129993

Magnetoelectric Bio-Implants Powered and Programmed by a Single Transmitter for Coordinated Multisite Stimulation

Authors: Zhanghao Yu, Joshua C. Chen, Yan He, Fatima T. Alrashdan, Benjamin W. Avants, Amanda Singer, Jacob T. Robinson, Kaiyuan Yang

Abstract: This article presents a hardware platform including stimulating implants wirelessly powered and controlled by a shared transmitter (TX) for coordinated leadless multisite stimulation. The adopted novel single-TX, multiple-implant structure can flexibly deploy stimuli, improve system efficiency, easily scale stimulating channel quantity, and relieve efforts in device synchronization. In the propose… ▽ More This article presents a hardware platform including stimulating implants wirelessly powered and controlled by a shared transmitter (TX) for coordinated leadless multisite stimulation. The adopted novel single-TX, multiple-implant structure can flexibly deploy stimuli, improve system efficiency, easily scale stimulating channel quantity, and relieve efforts in device synchronization. In the proposed system, a wireless link leveraging magnetoelectric (ME) effect is co-designed with a robust and efficient system-on-chip (SoC) to enable reliable operation and individual programming of every implant. Each implant integrates a 0.8-mm2 chip, a 6-mm2 ME film, and an energy storage capacitor within a 6.2-mm3 size. ME power transfer is capable of safely transmitting milliwatt power to devices placed several centimeters away from the TX coil, maintaining good efficiency with size constraints, and tolerating 60 degree, 1.5-cm misalignment in angular and lateral movement. The SoC robustly operates with 2-V source amplitude variations that spans a 40-mm TX-implant distance change, realizes individual addressability through physical unclonable function (PUF) IDs, and achieves 90% efficiency for 1.5-3.5-V stimulation with fully programmable stimulation parameters. △ Less

Submitted 31 December, 2021; originally announced December 2021.

Comments: This paper has been published in IEEE Journal of Solid-State Circuits, 2021

Journal ref: IEEE Journal of Solid-State Circuits, 2021

arXiv:2109.03900 [pdf, other]

doi 10.1371/journal.pcbi.1009853

Machine learning modeling of family wide enzyme-substrate specificity screens

Authors: Samuel Goldman, Ria Das, Kevin K. Yang, Connor W. Coley

Abstract: Biocatalysis is a promising approach to sustainably synthesize pharmaceuticals, complex natural products, and commodity chemicals at scale. However, the adoption of biocatalysis is limited by our ability to select enzymes that will catalyze their natural chemical transformation on non-natural substrates. While machine learning and in silico directed evolution are well-posed for this predictive mod… ▽ More Biocatalysis is a promising approach to sustainably synthesize pharmaceuticals, complex natural products, and commodity chemicals at scale. However, the adoption of biocatalysis is limited by our ability to select enzymes that will catalyze their natural chemical transformation on non-natural substrates. While machine learning and in silico directed evolution are well-posed for this predictive modeling challenge, efforts to date have primarily aimed to increase activity against a single known substrate, rather than to identify enzymes capable of acting on new substrates of interest. To address this need, we curate 6 different high-quality enzyme family screens from the literature that each measure multiple enzymes against multiple substrates. We compare machine learning-based compound-protein interaction (CPI) modeling approaches from the literature used for predicting drug-target interactions. Surprisingly, comparing these interaction-based models against collections of independent (single task) enzyme-only or substrate-only models reveals that current CPI approaches are incapable of learning interactions between compounds and proteins in the current family level data regime. We further validate this observation by demonstrating that our no-interaction baseline can outperform CPI-based models from the literature used to guide the discovery of kinase inhibitors. Given the high performance of non-interaction based models, we introduce a new structure-based strategy for pooling residue representations across a protein sequence. Altogether, this work motivates a principled path forward in order to build and evaluate meaningful predictive models for biocatalysis and other drug discovery applications. △ Less

Submitted 8 September, 2021; originally announced September 2021.

arXiv:2107.02995 [pdf, other]

doi 10.1109/TBCAS.2020.3037862

MagNI: A Magnetoelectrically Powered and Controlled Wireless Neurostimulating Implant

Authors: Zhanghao Yu, Joshua C. Chen, Fatima T. Alrashdan, Benjamin W. Avants, Yan He, Amanda Singer, Jacob T. Robinson, Kaiyuan Yang

Abstract: This paper presents the first wireless and programmable neural stimulator leveraging magnetoelectric (ME) effects for power and data transfer. Thanks to low tissue absorption, low misalignment sensitivity and high power transfer efficiency, the ME effect enables safe delivery of high power levels (a few milliwatts) at low resonant frequencies (~250 kHz) to mm-sized implants deep inside the body (3… ▽ More This paper presents the first wireless and programmable neural stimulator leveraging magnetoelectric (ME) effects for power and data transfer. Thanks to low tissue absorption, low misalignment sensitivity and high power transfer efficiency, the ME effect enables safe delivery of high power levels (a few milliwatts) at low resonant frequencies (~250 kHz) to mm-sized implants deep inside the body (30-mm depth). The presented MagNI (Magnetoelectric Neural Implant) consists of a 1.5-mm$^2$ 180-nm CMOS chip, an in-house built 4x2 mm ME film, an energy storage capacitor, and on-board electrodes on a flexible polyimide substrate with a total volume of 8.2 mm$^3$ . The chip with a power consumption of 23.7 $μ$W includes robust system control and data recovery mechanisms under source amplitude variations (1-V variation tolerance). The system delivers fully-programmable bi-phasic current-controlled stimulation with patterns covering 0.05-to-1.5-mA amplitude, 64-to-512-$μ$s pulse width, and 0-to-200Hz repetition frequency for neurostimulation. △ Less

Submitted 6 July, 2021; originally announced July 2021.

Comments: This work has been accepted to 2020 IEEE Transactions on Biomedical Circuits and Systems (TBioCAS)

Journal ref: IEEE Transactions on Biomedical Circuits and Systems (TBioCAS), Volume: 14, Issue: 6, Pages: 1241-1252, Dec. 2020

arXiv:2106.05466 [pdf, other]

Adaptive machine learning for protein engineering

Authors: Brian L. Hie, Kevin K. Yang

Abstract: Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using these models to suggest new protein designs, one must deal with the vast combinatorial complexity of protein sequences. Here, we review how to use a sequence-to-function machine-learning surrogate model to select sequences for experime… ▽ More Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using these models to suggest new protein designs, one must deal with the vast combinatorial complexity of protein sequences. Here, we review how to use a sequence-to-function machine-learning surrogate model to select sequences for experimental measurement. First, we discuss how to select sequences through a single round of machine-learning optimization. Then, we discuss sequential optimization, where the goal is to discover optimized sequences and improve the model across multiple rounds of training, optimization, and experimental measurement. △ Less

Submitted 6 July, 2021; v1 submitted 9 June, 2021; originally announced June 2021.

Comments: 9 pages, 2 figures

arXiv:2104.04457 [pdf, other]

doi 10.1016/j.cbpa.2021.04.004

Protein sequence design with deep generative models

Authors: Zachary Wu, Kadina E. Johnston, Frances H. Arnold, Kevin K. Yang

Abstract: Protein engineering seeks to identify protein sequences with optimized properties. When guided by machine learning, protein sequence generation methods can draw on prior knowledge and experimental efforts to improve this process. In this review, we highlight recent applications of machine learning to generate protein sequences, focusing on the emerging field of deep generative methods. Protein engineering seeks to identify protein sequences with optimized properties. When guided by machine learning, protein sequence generation methods can draw on prior knowledge and experimental efforts to improve this process. In this review, we highlight recent applications of machine learning to generate protein sequences, focusing on the emerging field of deep generative methods. △ Less

Submitted 9 April, 2021; originally announced April 2021.

Comments: 11 pages, 2 figures

arXiv:2007.00776 [pdf, other]

Emulation of Astrocyte Induced Neural Phase Synchrony in Spin-Orbit Torque Oscillator Neurons

Authors: Umang Garg, Kezhou Yang, Abhronil Sengupta

Abstract: Astrocytes play a central role in inducing concerted phase synchronized neural-wave patterns inside the brain. In this article, we demonstrate that injected radio-frequency signal in underlying heavy metal layer of spin-orbit torque oscillator neurons mimic the neuron phase synchronization effect realized by glial cells. Potential application of such phase coupling effects is illustrated in the co… ▽ More Astrocytes play a central role in inducing concerted phase synchronized neural-wave patterns inside the brain. In this article, we demonstrate that injected radio-frequency signal in underlying heavy metal layer of spin-orbit torque oscillator neurons mimic the neuron phase synchronization effect realized by glial cells. Potential application of such phase coupling effects is illustrated in the context of a temporal "binding problem". We also present the design of a coupled neuron-synapse-astrocyte network enabled by compact neuromimetic devices by combining the concepts of local spike-timing dependent plasticity and astrocyte induced neural phase synchrony. △ Less

Submitted 16 September, 2021; v1 submitted 1 July, 2020; originally announced July 2020.

arXiv:2006.08532 [pdf, other]

Improved Conditional Flow Models for Molecule to Image Synthesis

Authors: Karren Yang, Samuel Goldman, Wengong **, Alex Lu, Regina Barzilay, Tommi Jaakkola, Caroline Uhler

Abstract: In this paper, we aim to synthesize cell microscopy images under different molecular interventions, motivated by practical applications to drug development. Building on the recent success of graph neural networks for learning molecular embeddings and flow-based models for image generation, we propose Mol2Image: a flow-based generative model for molecule to cell image synthesis. To generate cell fe… ▽ More In this paper, we aim to synthesize cell microscopy images under different molecular interventions, motivated by practical applications to drug development. Building on the recent success of graph neural networks for learning molecular embeddings and flow-based models for image generation, we propose Mol2Image: a flow-based generative model for molecule to cell image synthesis. To generate cell features at different resolutions and scale to high-resolution images, we develop a novel multi-scale flow architecture based on a Haar wavelet image pyramid. To maximize the mutual information between the generated images and the molecular interventions, we devise a training strategy based on contrastive learning. To evaluate our model, we propose a new set of metrics for biological image generation that are robust, interpretable, and relevant to practitioners. We show quantitatively that our method learns a meaningful embedding of the molecular intervention, which is translated into an image representation reflecting the biological effects of the intervention. △ Less

Submitted 15 June, 2020; originally announced June 2020.

MSC Class: 92-08

arXiv:2006.03735 [pdf, other]

doi 10.1038/s41467-021-21056-z

Causal Network Models of SARS-CoV-2 Expression and Aging to Identify Candidates for Drug Repurposing

Authors: Anastasiya Belyaeva, Louis Cammarata, Adityanarayanan Radhakrishnan, Chandler Squires, Karren Dai Yang, G. V. Shivashankar, Caroline Uhler

Abstract: Given the severity of the SARS-CoV-2 pandemic, a major challenge is to rapidly repurpose existing approved drugs for clinical interventions. While a number of data-driven and experimental approaches have been suggested in the context of drug repurposing, a platform that systematically integrates available transcriptomic, proteomic and structural data is missing. More importantly, given that SARS-C… ▽ More Given the severity of the SARS-CoV-2 pandemic, a major challenge is to rapidly repurpose existing approved drugs for clinical interventions. While a number of data-driven and experimental approaches have been suggested in the context of drug repurposing, a platform that systematically integrates available transcriptomic, proteomic and structural data is missing. More importantly, given that SARS-CoV-2 pathogenicity is highly age-dependent, it is critical to integrate aging signatures into drug discovery platforms. We here take advantage of large-scale transcriptional drug screens combined with RNA-seq data of the lung epithelium with SARS-CoV-2 infection as well as the aging lung. To identify robust druggable protein targets, we propose a principled causal framework that makes use of multiple data modalities. Our analysis highlights the importance of serine/threonine and tyrosine kinases as potential targets that intersect the SARS-CoV-2 and aging pathways. By integrating transcriptomic, proteomic and structural data that is available for many diseases, our drug discovery platform is broadly applicable. Rigorous in vitro experiments as well as clinical trials are needed to validate the identified candidate drugs. △ Less

Submitted 5 June, 2020; originally announced June 2020.

arXiv:2005.10036 [pdf, other]

Uncertainty Quantification Using Neural Networks for Molecular Property Prediction

Authors: Lior Hirschfeld, Kyle Swanson, Kevin Yang, Regina Barzilay, Connor W. Coley

Abstract: Uncertainty quantification (UQ) is an important component of molecular property prediction, particularly for drug discovery applications where model predictions direct experimental design and where unanticipated imprecision wastes valuable time and resources. The need for UQ is especially acute for neural models, which are becoming increasingly standard yet are challenging to interpret. While seve… ▽ More Uncertainty quantification (UQ) is an important component of molecular property prediction, particularly for drug discovery applications where model predictions direct experimental design and where unanticipated imprecision wastes valuable time and resources. The need for UQ is especially acute for neural models, which are becoming increasingly standard yet are challenging to interpret. While several approaches to UQ have been proposed in the literature, there is no clear consensus on the comparative performance of these models. In this paper, we study this question in the context of regression tasks. We systematically evaluate several methods on five benchmark datasets using multiple complementary performance metrics. Our experiments show that none of the methods we tested is unequivocally superior to all others, and none produces a particularly reliable ranking of errors across multiple datasets. While we believe these results show that existing UQ methods are not sufficient for all common use-cases and demonstrate the benefits of further research, we conclude with a practical recommendation as to which existing techniques seem to perform well relative to others. △ Less

Submitted 20 May, 2020; originally announced May 2020.

arXiv:2001.10530 [pdf]

doi 10.1111/jebm.12376

Preliminary prediction of the basic reproduction number of the Wuhan novel coronavirus 2019-nCoV

Authors: Tao Zhou, Quanhui Liu, Zimo Yang, **gyi Liao, Kexin Yang, Wei Bai, Xin Lü, Wei Zhang

Abstract: Objectives.--To estimate the basic reproduction number of the Wuhan novel coronavirus (2019-nCoV). Methods.--Based on the susceptible-exposed-infected-removed (SEIR) compartment model and the assumption that the infectious cases with symptoms occurred before January 25, 2020 are resulted from free propagation without intervention, we estimate the basic reproduction number of 2019-nCoV according to… ▽ More Objectives.--To estimate the basic reproduction number of the Wuhan novel coronavirus (2019-nCoV). Methods.--Based on the susceptible-exposed-infected-removed (SEIR) compartment model and the assumption that the infectious cases with symptoms occurred before January 25, 2020 are resulted from free propagation without intervention, we estimate the basic reproduction number of 2019-nCoV according to the reported confirmed cases and suspected cases, as well as the theoretical estimated number of infected cases by other research teams, together with some epidemiological determinants learned from the severe acute respiratory syndrome. Results The basic reproduction number falls between 2.8 to 3.3 by using the real-time reports on the number of 2019-nCoV infected cases from People's Daily in China, and falls between 3.2 and 3.9 on the basis of the predicted number of infected cases from colleagues. Conclusions.--The early transmission ability of 2019-nCoV is closed to or slightly higher than SARS. It is a controllable disease with moderate-high transmissibility. Timely and effective control measures are needed to suppress the further transmissions. Notes Added.--Using a newly reported epidemiological determinants for early 2019-nCoV, the estimated basic reproduction number is in the range [2.2,3.0]. △ Less

Submitted 31 January, 2020; v1 submitted 28 January, 2020; originally announced January 2020.

Comments: 8 pages, 1 table and 1 figure

Journal ref: Journal of Evidence Based Medicine (2020) 1

arXiv:1904.08102 [pdf, other]

Batched Stochastic Bayesian Optimization via Combinatorial Constraints Design

Authors: Kevin K. Yang, Yuxin Chen, Alycia Lee, Yisong Yue

Abstract: In many high-throughput experimental design settings, such as those common in biochemical engineering, batched queries are more cost effective than one-by-one sequential queries. Furthermore, it is often not possible to directly choose items to query. Instead, the experimenter specifies a set of constraints that generates a library of possible items, which are then selected stochastically. Motivat… ▽ More In many high-throughput experimental design settings, such as those common in biochemical engineering, batched queries are more cost effective than one-by-one sequential queries. Furthermore, it is often not possible to directly choose items to query. Instead, the experimenter specifies a set of constraints that generates a library of possible items, which are then selected stochastically. Motivated by these considerations, we investigate \emph{Batched Stochastic Bayesian Optimization} (BSBO), a novel Bayesian optimization scheme for choosing the constraints in order to guide exploration towards items with greater utility. We focus on \emph{site-saturation mutagenesis}, a prototypical setting of BSBO in biochemical engineering, and propose a natural objective function for this problem. Importantly, we show that our objective function can be efficiently decomposed as a difference of submodular functions (DS), which allows us to employ DS optimization tools to greedily identify sets of constraints that increase the likelihood of finding items with high utility. Our experimental results show that our algorithm outperforms common heuristics on both synthetic and two real protein datasets. △ Less

Submitted 17 April, 2019; originally announced April 2019.

arXiv:1811.10775 [pdf, other]

Machine learning-guided directed evolution for protein engineering

Authors: Kevin K. Yang, Zachary Wu, Frances H. Arnold

Abstract: Machine learning (ML)-guided directed evolution is a new paradigm for biological design that enables optimization of complex functions. ML methods use data to predict how sequence maps to function without requiring a detailed model of the underlying physics or biological pathways. To demonstrate ML-guided directed evolution, we introduce the steps required to build ML sequence-function models and… ▽ More Machine learning (ML)-guided directed evolution is a new paradigm for biological design that enables optimization of complex functions. ML methods use data to predict how sequence maps to function without requiring a detailed model of the underlying physics or biological pathways. To demonstrate ML-guided directed evolution, we introduce the steps required to build ML sequence-function models and use them to guide engineering, making recommendations at each stage. This review covers basic concepts relevant to using ML for protein engineering as well as the current literature and applications of this new engineering paradigm. ML methods accelerate directed evolution by learning from information contained in all measured variants and using that information to select sequences that are likely to be improved. We then provide two case studies that demonstrate the ML-guided directed evolution process. We also look to future opportunities where ML will enable discovery of new protein functions and uncover the relationship between protein sequence and function. △ Less

Submitted 19 April, 2019; v1 submitted 26 November, 2018; originally announced November 2018.

Comments: Made significant revisions to focus on aspects most relevant to applying machine learning to speed up directed evolution

arXiv:1403.7413 [pdf]

Niche inheritance: a cooperative pathway to enhance cancer cell fitness though ecosystem engineering

Authors: Kimberline R. Yang, Steven Mooney, Jelani C. Zarif, Donald S. Coffey, Russell S. Taichman, Kenneth J. Pienta

Abstract: Cancer cells can be described as an invasive species that is able to establish itself in a new environment. The concept of niche construction can be utilized to describe the process by which cancer cells terraform their environment, thereby engineering an ecosystem that promotes the genetic fitness of the species. Ecological dispersion theory can then be utilized to describe and model the steps an… ▽ More Cancer cells can be described as an invasive species that is able to establish itself in a new environment. The concept of niche construction can be utilized to describe the process by which cancer cells terraform their environment, thereby engineering an ecosystem that promotes the genetic fitness of the species. Ecological dispersion theory can then be utilized to describe and model the steps and barriers involved in a successful diaspora as the cancer cells leave the original host organ and migrate to new host organs to successfully establish a new metastatic community. These ecological concepts can be further utilized to define new diagnostic and therapeutic areas for lethal cancers. △ Less

Submitted 28 March, 2014; originally announced March 2014.

Comments: 8 pages, 1 Table, 4 Figures

Showing 1–21 of 21 results for author: Yang, K