-
ReactXT: Understanding Molecular "Reaction-ship" via Reaction-Contextualized Molecule-Text Pretraining
Authors:
Zhiyuan Liu,
Yaorui Shi,
An Zhang,
Sihang Li,
Enzhi Zhang,
Xiang Wang,
Kenji Kawaguchi,
Tat-Seng Chua
Abstract:
Molecule-text modeling, which aims to facilitate molecule-relevant tasks with a textual interface and textual knowledge, is an emerging research direction. Beyond single molecules, studying reaction-text modeling holds promise for hel** the synthesis of new materials and drugs. However, previous works mostly neglect reaction-text modeling: they primarily focus on modeling individual molecule-tex…
▽ More
Molecule-text modeling, which aims to facilitate molecule-relevant tasks with a textual interface and textual knowledge, is an emerging research direction. Beyond single molecules, studying reaction-text modeling holds promise for hel** the synthesis of new materials and drugs. However, previous works mostly neglect reaction-text modeling: they primarily focus on modeling individual molecule-text pairs or learning chemical reactions without texts in context. Additionally, one key task of reaction-text modeling -- experimental procedure prediction -- is less explored due to the absence of an open-source dataset. The task is to predict step-by-step actions of conducting chemical experiments and is crucial to automating chemical synthesis. To resolve the challenges above, we propose a new pretraining method, ReactXT, for reaction-text modeling, and a new dataset, OpenExp, for experimental procedure prediction. Specifically, ReactXT features three types of input contexts to incrementally pretrain LMs. Each of the three input contexts corresponds to a pretraining task to improve the text-based understanding of either reactions or single molecules. ReactXT demonstrates consistent improvements in experimental procedure prediction and molecule captioning and offers competitive results in retrosynthesis. Our code is available at https://github.com/syr-cn/ReactXT.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
ProtT3: Protein-to-Text Generation for Text-based Protein Understanding
Authors:
Zhiyuan Liu,
An Zhang,
Hao Fei,
Enzhi Zhang,
Xiang Wang,
Kenji Kawaguchi,
Tat-Seng Chua
Abstract:
Language Models (LMs) excel in understanding textual descriptions of proteins, as evident in biomedical question-answering tasks. However, their capability falters with raw protein data, such as amino acid sequences, due to a deficit in pretraining on such data. Conversely, Protein Language Models (PLMs) can understand and convert protein data into high-quality representations, but struggle to pro…
▽ More
Language Models (LMs) excel in understanding textual descriptions of proteins, as evident in biomedical question-answering tasks. However, their capability falters with raw protein data, such as amino acid sequences, due to a deficit in pretraining on such data. Conversely, Protein Language Models (PLMs) can understand and convert protein data into high-quality representations, but struggle to process texts. To address their limitations, we introduce ProtT3, a framework for Protein-to-Text Generation for Text-based Protein Understanding. ProtT3 empowers an LM to understand protein sequences of amino acids by incorporating a PLM as its protein understanding module, enabling effective protein-to-text generation. This collaboration between PLM and LM is facilitated by a cross-modal projector (i.e., Q-Former) that bridges the modality gap between the PLM's representation space and the LM's input space. Unlike previous studies focusing on protein property prediction and protein-text retrieval, we delve into the largely unexplored field of protein-to-text generation. To facilitate comprehensive benchmarks and promote future research, we establish quantitative evaluations for protein-text modeling tasks, including protein captioning, protein question-answering, and protein-text retrieval. Our experiments show that ProtT3 substantially surpasses current baselines, with ablation studies further highlighting the efficacy of its core components. Our code is available at https://github.com/acharkq/ProtT3.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
Towards 3D Molecule-Text Interpretation in Language Models
Authors:
Sihang Li,
Zhiyuan Liu,
Yanchen Luo,
Xiang Wang,
Xiangnan He,
Kenji Kawaguchi,
Tat-Seng Chua,
Qi Tian
Abstract:
Language Models (LMs) have greatly influenced diverse domains. However, their inherent limitation in comprehending 3D molecular structures has considerably constrained their potential in the biomolecular domain. To bridge this gap, we focus on 3D molecule-text interpretation, and propose 3D-MoLM: 3D-Molecular Language Modeling. Specifically, 3D-MoLM enables an LM to interpret and analyze 3D molecu…
▽ More
Language Models (LMs) have greatly influenced diverse domains. However, their inherent limitation in comprehending 3D molecular structures has considerably constrained their potential in the biomolecular domain. To bridge this gap, we focus on 3D molecule-text interpretation, and propose 3D-MoLM: 3D-Molecular Language Modeling. Specifically, 3D-MoLM enables an LM to interpret and analyze 3D molecules by equip** the LM with a 3D molecular encoder. This integration is achieved by a 3D molecule-text projector, bridging the 3D molecular encoder's representation space and the LM's input space. Moreover, to enhance 3D-MoLM's ability of cross-modal molecular understanding and instruction following, we meticulously curated a 3D molecule-centric instruction tuning dataset -- 3D-MoIT. Through 3D molecule-text alignment and 3D molecule-centric instruction tuning, 3D-MoLM establishes an integration of 3D molecular encoder and LM. It significantly surpasses existing baselines on downstream tasks, including molecule-text retrieval, molecule captioning, and more challenging open-text molecular QA tasks, especially focusing on 3D-dependent properties. We release our codes and datasets at https://github.com/lsh0520/3D-MoLM.
△ Less
Submitted 17 March, 2024; v1 submitted 24 January, 2024;
originally announced January 2024.
-
Predicting heteropolymer interactions: demixing and hypermixing of disordered protein sequences
Authors:
Kyosuke Adachi,
Kyogo Kawaguchi
Abstract:
Cells contain multiple condensates which spontaneously form due to the heterotypic interactions between their components. Although the proteins and disordered region sequences that are responsible for condensate formation have been extensively studied, the rule of interactions between the components that allow demixing, i.e., the coexistence of multiple condensates, is yet to be elucidated. Here w…
▽ More
Cells contain multiple condensates which spontaneously form due to the heterotypic interactions between their components. Although the proteins and disordered region sequences that are responsible for condensate formation have been extensively studied, the rule of interactions between the components that allow demixing, i.e., the coexistence of multiple condensates, is yet to be elucidated. Here we construct an effective theory of the interaction between heteropolymers by fitting it to the molecular dynamics simulation results obtained for more than 200 sequences sampled from the disordered regions of human proteins. We find that the sum of amino acid pair interactions across two heteropolymers predicts the Boyle temperature qualitatively well, which can be quantitatively improved by the dimer pair approximation, where we incorporate the effect of neighboring amino acids in the sequences. The improved theory, combined with the finding of a metric that captures the effective interaction strength between distinct sequences, allowed the selection of up to three disordered region sequences that demix with each other in multicomponent simulations, as well as the generation of artificial sequences that demix with a given sequence.The theory points to a generic sequence design strategy to demix or hypermix thanks to the low dimensional nature of the space of the interactions that we identify. As a consequence of the geometric arguments in the space of interactions, we find that the number of distinct sequences that can demix with each other is strongly constrained, irrespective of the choice of the coarse-grained model. Altogether, we construct a theoretical basis for methods to estimate the effective interaction between heteropolymers, which can be utilized in predicting phase separation properties as well as rules of assignment in the localization and functions of disordered proteins.
△ Less
Submitted 22 June, 2024; v1 submitted 15 January, 2024;
originally announced January 2024.
-
Universal voter model emergence in genetically labeled homeostatic tissues
Authors:
Hiroki Yamaguchi,
Kyogo Kawaguchi
Abstract:
Recent experiments in adult mammalian tissues have found scaling relations of the voter model in the dynamics of the genetically labeled population of stem cells. Yet, the reason for this seemingly robust appearance of the voter model remains unexplained. Here we show that the voter model kinetics is indeed a generic behavior that arises at macroscale in a linearly stable homeostatic tissue underg…
▽ More
Recent experiments in adult mammalian tissues have found scaling relations of the voter model in the dynamics of the genetically labeled population of stem cells. Yet, the reason for this seemingly robust appearance of the voter model remains unexplained. Here we show that the voter model kinetics is indeed a generic behavior that arises at macroscale in a linearly stable homeostatic tissue undergoing turnover. Starting from the continuum model of a multicellular system, we show that the dynamics of the labeled cell population converges to the voter model kinetics at large spatio-temporal scale of observation. We present a method to calculate the length scale and time scale of coarse-graining that is required in obtaining the effective voter model dynamics, and apply it to the growth factor competition model and the pairwise mechanical interaction model.
△ Less
Submitted 7 March, 2019;
originally announced March 2019.
-
Topological defect launches 3D mound in the active nematic sheet of neural progenitors
Authors:
Kyogo Kawaguchi,
Ryoichiro Kageyama,
Masaki Sano
Abstract:
Cultured stem cells have become a standard platform not only for regenerative medicine and developmental biology but also for biophysical studies. Yet, the characterization of cultured stem cells at the level of morphology and macroscopic patterns resulting from cell-to-cell interactions remain largely qualitative, even though they are the simplest features observed in everyday experiments. Here w…
▽ More
Cultured stem cells have become a standard platform not only for regenerative medicine and developmental biology but also for biophysical studies. Yet, the characterization of cultured stem cells at the level of morphology and macroscopic patterns resulting from cell-to-cell interactions remain largely qualitative, even though they are the simplest features observed in everyday experiments. Here we report that neural progenitor cells (NPCs), which are multipotent stem cells that give rise to cells in the central nervous system, rapidly glide and stochastically reverse its velocity while locally aligning with neighboring cells, thus showing features of an active nematic system. Within the two-dimensional nematic pattern, we find interspaced topological defects with +1/2 and -1/2 charges. Remarkably, we identified rapid cell accumulation leading to three-dimensional mounds at the +1/2 topological defects. Single-cell level imaging around the defects allowed quantification of the evolving cell density, clarifying that not only cells concentrate at +1/2 defects, but also escape from -1/2 defects. We propose the mechanism of instability around the defects as the interplay between the anisotropic friction and the active force field, thus addressing a novel universal mechanism for local cell density control.
△ Less
Submitted 20 May, 2016;
originally announced May 2016.
-
Dynamical Crossover in a Stochastic Model of Cell Fate Decision
Authors:
Hiroki Yamaguchi,
Kyogo Kawaguchi,
Takahiro Sagawa
Abstract:
We study the asymptotic behaviors of stochastic cell fate decision between proliferation and differentiation. We propose a model of a self-replicating Langevin system, where cells choose their fate (i.e. proliferation or differentiation) depending on local cell density. Based on this model, we propose a scenario for multi-cellular organisms to maintain the density of cells (i.e., homeostasis) thro…
▽ More
We study the asymptotic behaviors of stochastic cell fate decision between proliferation and differentiation. We propose a model of a self-replicating Langevin system, where cells choose their fate (i.e. proliferation or differentiation) depending on local cell density. Based on this model, we propose a scenario for multi-cellular organisms to maintain the density of cells (i.e., homeostasis) through finite-ranged cell-cell interactions. Furthermore, we numerically show that the distribution of the number of descendant cells changes over time, thus unifying the previously proposed two models regarding homeostasis: the critical birth death process and the voter model. Our results provide a general platform for the study of stochastic cell fate decision in terms of nonequilibrium statistical mechanics.
△ Less
Submitted 22 May, 2017; v1 submitted 12 April, 2016;
originally announced April 2016.