Search | arXiv e-print repository

GIMLET: A Unified Graph-Text Model for Instruction-Based Molecule Zero-Shot Learning

Authors: Haiteng Zhao, Shengchao Liu, Chang Ma, Hannan Xu, Jie Fu, Zhi-Hong Deng, Lingpeng Kong, Qi Liu

Abstract: Molecule property prediction has gained significant attention in recent years. The main bottleneck is the label insufficiency caused by expensive lab experiments. In order to alleviate this issue and to better leverage textual knowledge for tasks, this study investigates the feasibility of employing natural language instructions to accomplish molecule-related tasks in a zero-shot setting. We disco… ▽ More Molecule property prediction has gained significant attention in recent years. The main bottleneck is the label insufficiency caused by expensive lab experiments. In order to alleviate this issue and to better leverage textual knowledge for tasks, this study investigates the feasibility of employing natural language instructions to accomplish molecule-related tasks in a zero-shot setting. We discover that existing molecule-text models perform poorly in this setting due to inadequate treatment of instructions and limited capacity for graphs. To overcome these issues, we propose GIMLET, which unifies language models for both graph and text data. By adopting generalized position embedding, our model is extended to encode both graph structures and instruction text without additional graph encoding modules. GIMLET also decouples encoding of the graph from tasks instructions in the attention mechanism, enhancing the generalization of graph features across novel tasks. We construct a dataset consisting of more than two thousand molecule tasks with corresponding instructions derived from task descriptions. We pretrain GIMLET on the molecule tasks along with instructions, enabling the model to transfer effectively to a broad range of tasks. Experimental results demonstrate that GIMLET significantly outperforms molecule-text baselines in instruction-based zero-shot learning, even achieving closed results to supervised GNN models on tasks such as toxcast and muv. △ Less

Submitted 22 October, 2023; v1 submitted 28 May, 2023; originally announced June 2023.

arXiv:2303.10657 [pdf]

STGIC: a graph and image convolution-based method for spatial transcriptomic clustering

Authors: Chen Zhang, Junhui Gao, Lingxin Kong, Guangshuo cao, Xiangyu Guo, Wei Liu

Abstract: Spatial transcriptomic (ST) clustering employs spatial and transcription information to group spots spatially coherent and transcriptionally similar together into the same spatial domain. Graph convolution network (GCN) and graph attention network (GAT), fed with spatial coordinates derived adjacency and transcription profile derived feature matrix are often used to solve the problem. Our proposed… ▽ More Spatial transcriptomic (ST) clustering employs spatial and transcription information to group spots spatially coherent and transcriptionally similar together into the same spatial domain. Graph convolution network (GCN) and graph attention network (GAT), fed with spatial coordinates derived adjacency and transcription profile derived feature matrix are often used to solve the problem. Our proposed method STGIC (spatial transcriptomic clustering with graph and image convolution) utilizes an adaptive graph convolution (AGC) to get high quality pseudo-labels and then resorts to dilated convolution framework (DCF) for virtual image converted from gene expression information and spatial coordinates of spots. The dilation rates and kernel sizes are set appropriately and updating of weight values in the kernels is made to be subject to the spatial distance from the position of corresponding elements to kernel centers so that feature extraction of each spot is better guided by spatial distance to neighbor spots. Self-supervision realized by KL-divergence, spatial continuity loss and cross entropy calculated among spots with high confidence pseudo-labels make up the training objective of DCF. STGIC attains state-of-the-art (SOTA) clustering performance on the benchmark dataset of human dorsolateral prefrontal cortex (DLPFC). Besides, it's capable of depicting fine structures of other tissues from other species as well as guiding the identification of marker genes. Also, STGIC is expandable to Stereo-seq data with high spatial resolution. △ Less

Submitted 23 October, 2023; v1 submitted 19 March, 2023; originally announced March 2023.

Comments: Major revision has been made to generate the current version as follows: 1. Writing style has been thoroughly changed. 2. Four more datasets have been added. 3. Contrastive learning has been removed since it doesn't make significant difference to the performance. 4. Two more authors are added

arXiv:2302.12563 [pdf, other]

Retrieved Sequence Augmentation for Protein Representation Learning

Authors: Chang Ma, Haiteng Zhao, Lin Zheng, Jiayi Xin, Qintong Li, Lijun Wu, Zhihong Deng, Yang Lu, Qi Liu, Lingpeng Kong

Abstract: Protein language models have excelled in a variety of tasks, ranging from structure prediction to protein engineering. However, proteins are highly diverse in functions and structures, and current state-of-the-art models including the latest version of AlphaFold rely on Multiple Sequence Alignments (MSA) to feed in the evolutionary knowledge. Despite their success, heavy computational overheads, a… ▽ More Protein language models have excelled in a variety of tasks, ranging from structure prediction to protein engineering. However, proteins are highly diverse in functions and structures, and current state-of-the-art models including the latest version of AlphaFold rely on Multiple Sequence Alignments (MSA) to feed in the evolutionary knowledge. Despite their success, heavy computational overheads, as well as the de novo and orphan proteins remain great challenges in protein representation learning. In this work, we show that MSAaugmented models inherently belong to retrievalaugmented methods. Motivated by this finding, we introduce Retrieved Sequence Augmentation(RSA) for protein representation learning without additional alignment or pre-processing. RSA links query protein sequences to a set of sequences with similar structures or properties in the database and combines these sequences for downstream prediction. We show that protein language models benefit from the retrieval enhancement on both structure prediction and property prediction tasks, with a 5% improvement on MSA Transformer on average while being 373 times faster. In addition, we show that our model can transfer to new protein domains better and outperforms MSA Transformer on de novo protein prediction. Our study fills a much-encountered gap in protein prediction and brings us a step closer to demystifying the domain knowledge needed to understand protein sequences. Code is available on https://github.com/HKUNLP/RSA. △ Less

Submitted 24 February, 2023; originally announced February 2023.

arXiv:2011.10597 [pdf, other]

Synchronization within synchronization: transients and intermittency in ecological networks

Authors: Huawei Fan, Ling-Wei Kong, Xingang Wang, Alan Hastings, Ying-Cheng Lai

Abstract: Transients are fundamental to ecological systems with significant implications to management, conservation, and biological control. We uncover a type of transient synchronization behavior in spatial ecological networks whose local dynamics are of the chaotic, predator-prey type. In the parameter regime where there is phase synchronization among all the patches, complete synchronization (i.e., sync… ▽ More Transients are fundamental to ecological systems with significant implications to management, conservation, and biological control. We uncover a type of transient synchronization behavior in spatial ecological networks whose local dynamics are of the chaotic, predator-prey type. In the parameter regime where there is phase synchronization among all the patches, complete synchronization (i.e., synchronization in both phase and amplitude) can arise in certain pairs of patches as determined by the network symmetry - henceforth the phenomenon of "synchronization within synchronization." Distinct patterns of complete synchronization coexist but, due to intrinsic instability or noise, each pattern is a transient and there is random, intermittent switching among the patterns in the course of time evolution. The probability distribution of the transient time is found to follow an algebraic scaling law with a divergent average transient lifetime. Based on symmetry considerations, we develop a stability analysis to understand these phenomena. The general principle of symmetry can also be exploited to explain previously discovered, counterintuitive synchronization behaviors in ecological networks. △ Less

Submitted 20 November, 2020; originally announced November 2020.

Comments: 17 pages, 7 figures

Showing 1–4 of 4 results for author: Kong, L