-
Full-Atom Peptide Design based on Multi-modal Flow Matching
Authors:
Jiahan Li,
Chaoran Cheng,
Zuofan Wu,
Ruihan Guo,
Shitong Luo,
Zhizhou Ren,
Jian Peng,
Jianzhu Ma
Abstract:
Peptides, short chains of amino acid residues, play a vital role in numerous biological processes by interacting with other target molecules, offering substantial potential in drug discovery. In this work, we present PepFlow, the first multi-modal deep generative model grounded in the flow-matching framework for the design of full-atom peptides that target specific protein receptors. Drawing inspi…
▽ More
Peptides, short chains of amino acid residues, play a vital role in numerous biological processes by interacting with other target molecules, offering substantial potential in drug discovery. In this work, we present PepFlow, the first multi-modal deep generative model grounded in the flow-matching framework for the design of full-atom peptides that target specific protein receptors. Drawing inspiration from the crucial roles of residue backbone orientations and side-chain dynamics in protein-peptide interactions, we characterize the peptide structure using rigid backbone frames within the $\mathrm{SE}(3)$ manifold and side-chain angles on high-dimensional tori. Furthermore, we represent discrete residue types in the peptide sequence as categorical distributions on the probability simplex. By learning the joint distributions of each modality using derived flows and vector fields on corresponding manifolds, our method excels in the fine-grained design of full-atom peptides. Harnessing the multi-modal paradigm, our approach adeptly tackles various tasks such as fix-backbone sequence design and side-chain packing through partial sampling. Through meticulously crafted experiments, we demonstrate that PepFlow exhibits superior performance in comprehensive benchmarks, highlighting its significant potential in computational peptide design and analysis.
△ Less
Submitted 2 June, 2024;
originally announced June 2024.
-
ProtFAD: Introducing function-aware domains as implicit modality towards protein function perception
Authors:
Mingqing Wang,
Zhiwei Nie,
Yonghong He,
Zhixiang Ren
Abstract:
Protein function prediction is currently achieved by encoding its sequence or structure, where the sequence-to-function transcendence and high-quality structural data scarcity lead to obvious performance bottlenecks. Protein domains are "building blocks" of proteins that are functionally independent, and their combinations determine the diverse biological functions. However, most existing studies…
▽ More
Protein function prediction is currently achieved by encoding its sequence or structure, where the sequence-to-function transcendence and high-quality structural data scarcity lead to obvious performance bottlenecks. Protein domains are "building blocks" of proteins that are functionally independent, and their combinations determine the diverse biological functions. However, most existing studies have yet to thoroughly explore the intricate functional information contained in the protein domains. To fill this gap, we propose a synergistic integration approach for a function-aware domain representation, and a domain-joint contrastive learning strategy to distinguish different protein functions while aligning the modalities. Specifically, we associate domains with the GO terms as function priors to pre-train domain embeddings. Furthermore, we partition proteins into multiple sub-views based on continuous joint domains for contrastive training under the supervision of a novel triplet InfoNCE loss. Our approach significantly and comprehensively outperforms the state-of-the-art methods on various benchmarks, and clearly differentiates proteins carrying distinct functions compared to the competitor.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
A Self-feedback Knowledge Elicitation Approach for Chemical Reaction Predictions
Authors:
Pengfei Liu,
Jun Tao,
Zhixiang Ren
Abstract:
The task of chemical reaction predictions (CRPs) plays a pivotal role in advancing drug discovery and material science. However, its effectiveness is constrained by the vast and uncertain chemical reaction space and challenges in capturing reaction selectivity, particularly due to existing methods' limitations in exploiting the data's inherent knowledge. To address these challenges, we introduce a…
▽ More
The task of chemical reaction predictions (CRPs) plays a pivotal role in advancing drug discovery and material science. However, its effectiveness is constrained by the vast and uncertain chemical reaction space and challenges in capturing reaction selectivity, particularly due to existing methods' limitations in exploiting the data's inherent knowledge. To address these challenges, we introduce a data-curated self-feedback knowledge elicitation approach. This method starts from iterative optimization of molecular representations and facilitates the extraction of knowledge on chemical reaction types (RTs). Then, we employ adaptive prompt learning to infuse the prior knowledge into the large language model (LLM). As a result, we achieve significant enhancements: a 14.2% increase in retrosynthesis prediction accuracy, a 74.2% rise in reagent prediction accuracy, and an expansion in the model's capability for handling multi-task chemical reactions. This research offers a novel paradigm for knowledge elicitation in scientific research and showcases the untapped potential of LLMs in CRPs.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
Impact of Domain Knowledge and Multi-Modality on Intelligent Molecular Property Prediction: A Systematic Survey
Authors:
Taojie Kuang,
Pengfei Liu,
Zhixiang Ren
Abstract:
The precise prediction of molecular properties is essential for advancements in drug development, particularly in virtual screening and compound optimization. The recent introduction of numerous deep learning-based methods has shown remarkable potential in enhancing molecular property prediction (MPP), especially improving accuracy and insights into molecular structures. Yet, two critical question…
▽ More
The precise prediction of molecular properties is essential for advancements in drug development, particularly in virtual screening and compound optimization. The recent introduction of numerous deep learning-based methods has shown remarkable potential in enhancing molecular property prediction (MPP), especially improving accuracy and insights into molecular structures. Yet, two critical questions arise: does the integration of domain knowledge augment the accuracy of molecular property prediction and does employing multi-modal data fusion yield more precise results than unique data source methods? To explore these matters, we comprehensively review and quantitatively analyze recent deep learning methods based on various benchmarks. We discover that integrating molecular information significantly improves molecular property prediction (MPP) for both regression and classification tasks. Specifically, regression improvements, measured by reductions in root mean square error (RMSE), are up to 4.0%, while classification enhancements, measured by the area under the receiver operating characteristic curve (ROC-AUC), are up to 1.7%. We also discover that enriching 2D graphs with 1D SMILES boosts multi-modal learning performance for regression tasks by up to 9.1%, and augmenting 2D graphs with 3D information increases performance for classification tasks by up to 13.2%, with both enhancements measured using ROC-AUC. The two consolidated insights offer crucial guidance for future advancements in drug discovery.
△ Less
Submitted 27 June, 2024; v1 submitted 11 February, 2024;
originally announced February 2024.
-
3D-Mol: A Novel Contrastive Learning Framework for Molecular Property Prediction with 3D Information
Authors:
Taojie Kuang,
Yiming Ren,
Zhixiang Ren
Abstract:
Molecular property prediction, crucial for early drug candidate screening and optimization, has seen advancements with deep learning-based methods. While deep learning-based methods have advanced considerably, they often fall short in fully leveraging 3D spatial information. Specifically, current molecular encoding techniques tend to inadequately extract spatial information, leading to ambiguous r…
▽ More
Molecular property prediction, crucial for early drug candidate screening and optimization, has seen advancements with deep learning-based methods. While deep learning-based methods have advanced considerably, they often fall short in fully leveraging 3D spatial information. Specifically, current molecular encoding techniques tend to inadequately extract spatial information, leading to ambiguous representations where a single one might represent multiple distinct molecules. Moreover, existing molecular modeling methods focus predominantly on the most stable 3D conformations, neglecting other viable conformations present in reality. To address these issues, we propose 3D-Mol, a novel approach designed for more accurate spatial structure representation. It deconstructs molecules into three hierarchical graphs to better extract geometric information. Additionally, 3D-Mol leverages contrastive learning for pretraining on 20 million unlabeled data, treating their conformations with identical topological structures as weighted positive pairs and contrasting ones as negatives, based on the similarity of their 3D conformation descriptors and fingerprints. We compare 3D-Mol with various state-of-the-art baselines on 7 benchmarks and demonstrate our outstanding performance.
△ Less
Submitted 27 June, 2024; v1 submitted 28 September, 2023;
originally announced September 2023.
-
GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text
Authors:
Pengfei Liu,
Yiming Ren,
Jun Tao,
Zhixiang Ren
Abstract:
Large language models have made significant strides in natural language processing, enabling innovative applications in molecular science by processing textual representations of molecules. However, most existing language models cannot capture the rich information with complex molecular structures or images. In this paper, we introduce GIT-Mol, a multi-modal large language model that integrates th…
▽ More
Large language models have made significant strides in natural language processing, enabling innovative applications in molecular science by processing textual representations of molecules. However, most existing language models cannot capture the rich information with complex molecular structures or images. In this paper, we introduce GIT-Mol, a multi-modal large language model that integrates the Graph, Image, and Text information. To facilitate the integration of multi-modal molecular data, we propose GIT-Former, a novel architecture that is capable of aligning all modalities into a unified latent space. We achieve a 5%-10% accuracy increase in properties prediction and a 20.2% boost in molecule generation validity compared to the baselines. With the any-to-language molecular translation strategy, our model has the potential to perform more downstream tasks, such as compound name recognition and chemical reaction prediction.
△ Less
Submitted 6 February, 2024; v1 submitted 13 August, 2023;
originally announced August 2023.
-
Dynamic traversal of large gaps by insects and legged robots reveals a template
Authors:
Sean W. Gart,
Changxin Yan,
Ratan Othayoth,
Zhiyi Ren,
Chen Li
Abstract:
It is well known that animals can use neural and sensory feedback via vision, tactile sensing, and echolocation to negotiate obstacles. Similarly, most robots use deliberate or reactive planning to avoid obstacles, which relies on prior knowledge or high-fidelity sensing of the environment. However, during dynamic locomotion in complex, novel, 3-D terrains such as forest floor and building rubble,…
▽ More
It is well known that animals can use neural and sensory feedback via vision, tactile sensing, and echolocation to negotiate obstacles. Similarly, most robots use deliberate or reactive planning to avoid obstacles, which relies on prior knowledge or high-fidelity sensing of the environment. However, during dynamic locomotion in complex, novel, 3-D terrains such as forest floor and building rubble, sensing and planning suffer bandwidth limitation and large noise and are sometimes even impossible. Here, we study rapid locomotion over a large gap, a simple, ubiquitous obstacle, to begin to discover general principles of dynamic traversal of large 3-D obstacles. We challenged the discoid cockroach and an open-loop six-legged robot to traverse a large gap of varying length. Both the animal and the robot could dynamically traverse a gap as large as 1 body length by bridging the gap with its head, but traversal probability decreased with gap length. Based on these observations, we developed a template that well captured body dynamics and quantitatively predicted traversal performance. Our template revealed that high approach speed, initial body pitch, and initial body pitch angular velocity facilitated dynamic traversal, and successfully predicted a new strategy of using body pitch control that increased the robot maximal traversal gap length by 50%. Our study established the first template of dynamic locomotion beyond planar surfaces and is an important step in expanding terradynamics into complex 3-D terrains.
△ Less
Submitted 1 November, 2019;
originally announced November 2019.
-
Competitive Exclusion in a DAE Model for Microbial Electrolysis Cells
Authors:
Harry J. Dudley,
Zhiyong Jason Ren,
David M. Bortz
Abstract:
Microbial electrolysis cells (MECs) employ electroactive bacteria to perform extracellular electron transfer, enabling hydrogen generation from biodegradable substrates. In previous work, we developed and analyzed a differential-algebraic equation (DAE) model for MECs. The model resembles a chemostat with ordinary differential equations (ODEs) for concentrations of substrate, microorganisms, and a…
▽ More
Microbial electrolysis cells (MECs) employ electroactive bacteria to perform extracellular electron transfer, enabling hydrogen generation from biodegradable substrates. In previous work, we developed and analyzed a differential-algebraic equation (DAE) model for MECs. The model resembles a chemostat with ordinary differential equations (ODEs) for concentrations of substrate, microorganisms, and an extracellular mediator involved in electron transfer. There is also an algebraic constraint for electric current and hydrogen production. Our goal is to determine the outcome of competition between methanogenic archaea and electroactive bacteria, because only the latter contribute to electric current and resulting hydrogen production. We investigate asymptotic stability in two industrially relevant versions of the model. An important aspect of chemostats models is the principle of competitive exclusion -- only microbes which grow at the lowest substrate concentration will survive as $t\to\infty$. We show that if methanogens grow at the lowest substrate concentration, then the equilibrium corresponding to competitive exclusion by methanogens is globally asymptotically stable. The analogous result for electroactive bacteria is not necessarily true. We show that local asymptotic stability of exclusion by electroactive bacteria is not guaranteed, even in a simplified version of the model. In this case, even if electroactive bacteria can grow at the lowest substrate concentration, a few additional conditions are required to guarantee local asymptotic stability. We also provide numerical simulations supporting these arguments. Our results suggest operating conditions that are most conducive to success of electroactive bacteria and the resulting current and hydrogen production in MECs. This will help identify when methane production or electricity and hydrogen production are favored.
△ Less
Submitted 6 July, 2020; v1 submitted 5 June, 2019;
originally announced June 2019.
-
Nestedness in complex networks: Observation, emergence, and implications
Authors:
Manuel Sebastian Mariani,
Zhuo-Ming Ren,
Jordi Bascompte,
Claudio Juan Tessone
Abstract:
The observed architecture of ecological and socio-economic networks differs significantly from that of random networks. From a network science standpoint, non-random structural patterns observed in real networks call for an explanation of their emergence and an understanding of their potential systemic consequences. This article focuses on one of these patterns: nestedness. Given a network of inte…
▽ More
The observed architecture of ecological and socio-economic networks differs significantly from that of random networks. From a network science standpoint, non-random structural patterns observed in real networks call for an explanation of their emergence and an understanding of their potential systemic consequences. This article focuses on one of these patterns: nestedness. Given a network of interacting nodes, nestedness can be described as the tendency for nodes to interact with subsets of the interaction partners of better-connected nodes. Known since more than $80$ years in biogeography, nestedness has been found in systems as diverse as ecological mutualistic organizations, world trade, inter-organizational relations, among many others. This review article focuses on three main pillars: the existing methodologies to observe nestedness in networks; the main theoretical mechanisms conceived to explain the emergence of nestedness in ecological and socio-economic networks; the implications of a nested topology of interactions for the stability and feasibility of a given interacting system. We survey results from variegated disciplines, including statistical physics, graph theory, ecology, and theoretical economics. Nestedness was found to emerge both in bipartite networks and, more recently, in unipartite ones; this review is the first comprehensive attempt to unify both streams of studies, usually disconnected from each other. We believe that the truly interdisciplinary endeavour -- while rooted in a complex systems perspective -- may inspire new models and algorithms whose realm of application will undoubtedly transcend disciplinary boundaries.
△ Less
Submitted 18 May, 2019;
originally announced May 2019.
-
Sensitivity and Bifurcation Analysis of a DAE Model for a Microbial Electrolysis Cell
Authors:
Harry J. Dudley,
Lu Lu,
Zhiyong Jason Ren,
David M. Bortz
Abstract:
Microbial electrolysis cells (MECs) are a promising new technology for producing hydrogen cheaply, efficiently, and sustainably. However, to scale up this technology, we need a better understanding of the processes in the devices. In this effort, we present a differential-algebraic equation (DAE) model of a microbial electrolysis cell with an algebraic constraint on current. We then perform sensit…
▽ More
Microbial electrolysis cells (MECs) are a promising new technology for producing hydrogen cheaply, efficiently, and sustainably. However, to scale up this technology, we need a better understanding of the processes in the devices. In this effort, we present a differential-algebraic equation (DAE) model of a microbial electrolysis cell with an algebraic constraint on current. We then perform sensitivity and bifurcation analysis for the DAE system. The model can be applied either to batch-cycle MECs or to continuous-flow MECs. We conduct differential-algebraic sensitivity analysis after fitting simulations to current density data for a batch-cycle MEC. The sensitivity analysis suggests which parameters have the greatest influence on the current density at particular times during the experiment. In particular, growth and consumption parameters for exoelectrogenic bacteria have a strong effect prior to the peak current density. An alternative strategy to maximizing peak current density is maintaining a long term stable equilibrium with non-zero current density in a continuous-flow MEC. We characterize the minimum dilution rate required for a stable nonzero current equilibrium and demonstrate transcritical bifurcations in the dilution rate parameter that exchange stability between several curves of equilibria. Specifically, increasing the dilution rate transitions the system through three regimes where the stable equilibrium exhibits (i) competitive exclusion by methanogens, (ii) coexistence, and (iii) competitive exclusion by exolectrogens. Positive long term current production is only feasible in the final two regimes. These results suggest how to modify system parameters to increase peak current density in a batch-cycle MEC or to increase the long term current density equilibrium value in a continuous-flow MEC.
△ Less
Submitted 17 February, 2018;
originally announced February 2018.