Search | arXiv e-print repository

Benchmarking Large Language Models for Molecule Prediction Tasks

Authors: Zhiqiang Zhong, Kuangyu Zhou, Davide Mottin

Abstract: Large Language Models (LLMs) stand at the forefront of a number of Natural Language Processing (NLP) tasks. Despite the widespread adoption of LLMs in NLP, much of their potential in broader fields remains largely unexplored, and significant limitations persist in their design and implementation. Notably, LLMs struggle with structured data, such as graphs, and often falter when tasked with answeri… ▽ More Large Language Models (LLMs) stand at the forefront of a number of Natural Language Processing (NLP) tasks. Despite the widespread adoption of LLMs in NLP, much of their potential in broader fields remains largely unexplored, and significant limitations persist in their design and implementation. Notably, LLMs struggle with structured data, such as graphs, and often falter when tasked with answering domain-specific questions requiring deep expertise, such as those in biology and chemistry. In this paper, we explore a fundamental question: Can LLMs effectively handle molecule prediction tasks? Rather than pursuing top-tier performance, our goal is to assess how LLMs can contribute to diverse molecule tasks. We identify several classification and regression prediction tasks across six standard molecule datasets. Subsequently, we carefully design a set of prompts to query LLMs on these tasks and compare their performance with existing Machine Learning (ML) models, which include text-based models and those specifically designed for analysing the geometric structure of molecules. Our investigation reveals several key insights: Firstly, LLMs generally lag behind ML models in achieving competitive performance on molecule tasks, particularly when compared to models adept at capturing the geometric structure of molecules, highlighting the constrained ability of LLMs to comprehend graph data. Secondly, LLMs show promise in enhancing the performance of ML models when used collaboratively. Lastly, we engage in a discourse regarding the challenges and promising avenues to harness LLMs for molecule prediction tasks. The code and models are available at https://github.com/zhiqiangzhongddu/LLMaMol. △ Less

Submitted 8 March, 2024; originally announced March 2024.

arXiv:2402.13418 [pdf, other]

Efficiently Predicting Mutational Effect on Homologous Proteins by Evolution Encoding

Authors: Zhiqiang Zhong, Davide Mottin

Abstract: Predicting protein properties is paramount for biological and medical advancements. Current protein engineering mutates on a typical protein, called the wild-type, to construct a family of homologous proteins and study their properties. Yet, existing methods easily neglect subtle mutations, failing to capture the effect on the protein properties. To this end, we propose EvolMPNN, Evolution-aware M… ▽ More Predicting protein properties is paramount for biological and medical advancements. Current protein engineering mutates on a typical protein, called the wild-type, to construct a family of homologous proteins and study their properties. Yet, existing methods easily neglect subtle mutations, failing to capture the effect on the protein properties. To this end, we propose EvolMPNN, Evolution-aware Message Passing Neural Network, an efficient model to learn evolution-aware protein embeddings. EvolMPNN samples sets of anchor proteins, computes evolutionary information by means of residues and employs a differentiable evolution-aware aggregation scheme over these sampled anchors. This way, EvolMPNN can efficiently utilise a novel message-passing method to capture the mutation effect on proteins with respect to the anchor proteins. Afterwards, the aggregated evolution-aware embeddings are integrated with sequence embeddings to generate final comprehensive protein embeddings. Our model shows up to 6.4% better than state-of-the-art methods and attains 36X inference speedup in comparison with large pre-trained models. Code and models are available at https://github.com/zhiqiangzhongddu/EvolMPNN. △ Less

Submitted 25 June, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

arXiv:2309.09984 [pdf]

BDEC:Brain Deep Embedded Clustering model

Authors: Xiaoxiao Ma, Chunzhi Yi, Zhicai Zhong, Hui Zhou, Baichun Wei, Haiqi Zhu, Feng Jiang

Abstract: An essential premise for neuroscience brain network analysis is the successful segmentation of the cerebral cortex into functionally homogeneous regions. Resting-state functional magnetic resonance imaging (rs-fMRI), capturing the spontaneous activities of the brain, provides the potential for cortical parcellation. Previous parcellation methods can be roughly categorized into three groups, mainly… ▽ More An essential premise for neuroscience brain network analysis is the successful segmentation of the cerebral cortex into functionally homogeneous regions. Resting-state functional magnetic resonance imaging (rs-fMRI), capturing the spontaneous activities of the brain, provides the potential for cortical parcellation. Previous parcellation methods can be roughly categorized into three groups, mainly employing either local gradient, global similarity, or a combination of both. The traditional clustering algorithms, such as "K-means" and "Spectral clustering" may affect the reproducibility or the biological interpretation of parcellations; The region growing-based methods influence the expression of functional homogeneity in the brain at a large scale; The parcellation method based on probabilistic graph models inevitably introduce model assumption biases. In this work, we develop an assumption-free model called as BDEC, which leverages the robust data fitting capability of deep learning. To the best of our knowledge, this is the first study that uses deep learning algorithm for rs-fMRI-based parcellation. By comparing with nine commonly used brain parcellation methods, the BDEC model demonstrates significantly superior performance in various functional homogeneity indicators. Furthermore, it exhibits favorable results in terms of validity, network analysis, task homogeneity, and generalization capability. These results suggest that the BDEC parcellation captures the functional characteristics of the brain and holds promise for future voxel-wise brain network analysis in the dimensionality reduction of fMRI data. △ Less

Submitted 11 September, 2023; originally announced September 2023.

arXiv:2306.07652 [pdf]

Inactivated COVID-19 Vaccination did not affect In vitro fertilization (IVF) / Intra-Cytoplasmic Sperm Injection (ICSI) cycle outcomes

Authors: Qi Wan, Ying Ling Yao, XingYu Lv, Li Hong Geng, Yue Wang, Enoch Appiah Adu-Gyamfi, Xue Jiao Wang, Yue Qian, Juan Yang, Ming Xing Chend, Zhao Hui Zhong, Yuan Li, Yu Bin Ding

Abstract: Background: The objective of this study is to evaluate the impact of COVID-19 inactivated vaccine administration on the outcomes of in vitro fertilization (IVF) and intracytoplasmic sperm injection (ICSI) cycles in infertile couples in China. Methods: We collected data from the CYART prospective cohort, which included couples undergoing IVF treatment from January 2021 to September 2022 at Sichuan… ▽ More Background: The objective of this study is to evaluate the impact of COVID-19 inactivated vaccine administration on the outcomes of in vitro fertilization (IVF) and intracytoplasmic sperm injection (ICSI) cycles in infertile couples in China. Methods: We collected data from the CYART prospective cohort, which included couples undergoing IVF treatment from January 2021 to September 2022 at Sichuan **xin Xinan Women & Children's Hospital. Based on whether they received vaccination before ovarian stimulation, the couples were divided into the vaccination group and the non-vaccination group. We compared the laboratory parameters and pregnancy outcomes between the two groups. Findings: After performing propensity score matching (PSM), the analysis demonstrated similar clinical pregnancy rates, biochemical pregnancy and ongoing pregnancy rates between vaccinated and unvaccinated women. No significant disparities were found in terms of embryo development and laboratory parameters among the groups. Moreover, male vaccination had no impact on patient performance or pregnancy outcomes in assisted reproductive technology treatments. Additionally, there were no significant differences observed in the effects of vaccination on embryo development and pregnancy outcomes among couples undergoing ART. Interpretation: The findings suggest that COVID-19 vaccination did not have a significant effect on patients undergoing IVF/ICSI with fresh embryo transfer. Therefore, it is recommended that couples should receive COVID-19 vaccination as scheduled to help mitigate the COVID-19 pandemic. △ Less

Submitted 13 June, 2023; originally announced June 2023.

Comments: 26 pages, 4 figures and 5 tables

arXiv:2301.05864 [pdf, other]

Recent advances in artificial intelligence for retrosynthesis

Authors: Zipeng Zhong, Jie Song, Zunlei Feng, Tiantao Liu, Lingxiang Jia, Shaolun Yao, Tingjun Hou, Mingli Song

Abstract: Retrosynthesis is the cornerstone of organic chemistry, providing chemists in material and drug manufacturing access to poorly available and brand-new molecules. Conventional rule-based or expert-based computer-aided synthesis has obvious limitations, such as high labor costs and limited search space. In recent years, dramatic breakthroughs driven by artificial intelligence have revolutionized ret… ▽ More Retrosynthesis is the cornerstone of organic chemistry, providing chemists in material and drug manufacturing access to poorly available and brand-new molecules. Conventional rule-based or expert-based computer-aided synthesis has obvious limitations, such as high labor costs and limited search space. In recent years, dramatic breakthroughs driven by artificial intelligence have revolutionized retrosynthesis. Here we aim to present a comprehensive review of recent advances in AI-based retrosynthesis. For single-step and multi-step retrosynthesis both, we first list their goal and provide a thorough taxonomy of existing methods. Afterwards, we analyze these methods in terms of their mechanism and performance, and introduce popular evaluation metrics for them, in which we also provide a detailed comparison among representative methods on several public datasets. In the next part we introduce popular databases and established platforms for retrosynthesis. Finally, this review concludes with a discussion about promising research directions in this field. △ Less

Submitted 14 January, 2023; originally announced January 2023.

Comments: 27 pages, 6 figurs, 4 tables

arXiv:2203.11444 [pdf, other]

doi 10.1039/D2SC02763A

Root-aligned SMILES: A Tight Representation for Chemical Reaction Prediction

Authors: Zipeng Zhong, Jie Song, Zunlei Feng, Tiantao Liu, Lingxiang Jia, Shaolun Yao, Min Wu, Tingjun Hou, Mingli Song

Abstract: Chemical reaction prediction, involving forward synthesis and retrosynthesis prediction, is a fundamental problem in organic synthesis. A popular computational paradigm formulates synthesis prediction as a sequence-to-sequence translation problem, where the typical SMILES is adopted for molecule representations. However, the general-purpose SMILES neglects the characteristics of chemical reactions… ▽ More Chemical reaction prediction, involving forward synthesis and retrosynthesis prediction, is a fundamental problem in organic synthesis. A popular computational paradigm formulates synthesis prediction as a sequence-to-sequence translation problem, where the typical SMILES is adopted for molecule representations. However, the general-purpose SMILES neglects the characteristics of chemical reactions, where the molecular graph topology is largely unaltered from reactants to products, resulting in the suboptimal performance of SMILES if straightforwardly applied. In this article, we propose the root-aligned SMILES (R-SMILES), which specifies a tightly aligned one-to-one map** between the product and the reactant SMILES for more efficient synthesis prediction. Due to the strict one-to-one map** and reduced edit distance, the computational model is largely relieved from learning the complex syntax and dedicated to learning the chemical knowledge for reactions. We compare the proposed R-SMILES with various state-of-the-art baselines and show that it significantly outperforms them all, demonstrating the superiority of the proposed method. △ Less

Submitted 12 August, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

Comments: Chemical Science 2022. Main paper: 16 pages, 5 figures, and 6 tables; supplementary information: 8 pages, 5 figures and 3 tables. Code repository: https://github.com/otori-bird/retrosynthesis

arXiv:1903.06917 [pdf, other]

Molecular Polar Belief Propagation Decoder and Successive Cancellation Decoder

Authors: Zhiwei Zhong, Lulu Ge, Zaichen Zhang, Xiaohu You, Chuan Zhang

Abstract: By constructing chemical reaction networks (CRNs), this paper proposes a method of synthesizing polar decoder using belief propagation (BP) algorithm and successive cancellation (SC) algorithm, respectively. Theoretical analysis and simulation results have validated the feasibility of the method. Reactions in the proposed design could be experimentally implemented with DNA strand displacement reac… ▽ More By constructing chemical reaction networks (CRNs), this paper proposes a method of synthesizing polar decoder using belief propagation (BP) algorithm and successive cancellation (SC) algorithm, respectively. Theoretical analysis and simulation results have validated the feasibility of the method. Reactions in the proposed design could be experimentally implemented with DNA strand displacement reactions, making the proposed polar decoders promising for wide application in nanoscale devices. △ Less

Submitted 16 March, 2019; originally announced March 2019.

Comments: This paper was first submitted to GLOBECOM 2018

arXiv:1807.02010 [pdf, other]

DNA Computing for Combinational Logic

Authors: Chuan Zhang, Lulu Ge, Yuchen Zhuang, Ziyuan Shen, Zhiwei Zhong, Zaichen Zhang, Xiaohu You

Abstract: With the progressive scale-down of semiconductor's feature size, people are looking forward to More Moore and More than Moore. In order to offer a possible alternative implementation process, people are trying to figure out a feasible transfer from silicon to molecular computing. Such transfer lies on bio-based modules programming with computer-like logic, aiming at realizing the Turing machine. T… ▽ More With the progressive scale-down of semiconductor's feature size, people are looking forward to More Moore and More than Moore. In order to offer a possible alternative implementation process, people are trying to figure out a feasible transfer from silicon to molecular computing. Such transfer lies on bio-based modules programming with computer-like logic, aiming at realizing the Turing machine. To accomplish this, the DNA-based combinational logic is inevitably the first step we have taken care of. This timely overview paper introduces combinational logic synthesized in DNA computing from both analog and digital perspectives separately. State-of-the-art research progress is summarized for interested readers to quick understand DNA computing, initiate discussion on existing techniques and inspire innovation solutions. We hope this paper can pave the way for the future DNA computing synthesis. △ Less

Submitted 5 July, 2018; originally announced July 2018.

arXiv:1710.04173 [pdf, other]

Structural Stability of Lexical Semantic Spaces: Nouns in Chinese and French

Authors: Sabine Ploux, Rui Wang, ZhengFeng Zhong, Hai Zhao, Yang Xin, Bao-Liang Lu

Abstract: Many studies in the neurosciences have dealt with the semantic processing of words or categories, but few have looked into the semantic organization of the lexicon thought as a system. The present study was designed to try to move towards this goal, using both electrophysiological and corpus-based data, and to compare two languages from different families: French and Mandarin Chinese. We conduct… ▽ More Many studies in the neurosciences have dealt with the semantic processing of words or categories, but few have looked into the semantic organization of the lexicon thought as a system. The present study was designed to try to move towards this goal, using both electrophysiological and corpus-based data, and to compare two languages from different families: French and Mandarin Chinese. We conducted an EEG-based semantic-decision experiment using 240 words from eight categories (clothing, parts of a house, tools, vehicles, fruits/vegetables, animals, body parts, and people) as the material. A data-analysis method (correspondence analysis) commonly used in computational linguistics was applied to the electrophysiological signals. The present cross-language comparison indicated stability for the following aspects of the languages' lexical semantic organizations: (1) the living/nonliving distinction, which showed up as a main factor for both languages; (2) greater dispersion of the living categories as compared to the nonliving ones; (3) prototypicality of the \emph{animals} category within the living categories, and with respect to the living/nonliving distinction; and (4) the existence of a person-centered reference gradient. Our electrophysiological analysis indicated stability of the networks at play in each of these processes. Stability was also observed in the data taken from word usage in the languages (synonyms and associated words obtained from textual corpora). △ Less

Submitted 11 October, 2017; originally announced October 2017.

Comments: 17 pages, 4 figures

arXiv:1607.01384 [pdf]

SMISS: A protein function prediction server by integrating multiple sources

Authors: Renzhi Cao, Zhaolong Zhong, Jianlin Cheng

Abstract: SMISS is a novel web server for protein function prediction. Three different predictors can be selected for different usage. It integrates different sources to improve the protein function prediction accuracy, including the query protein sequence, protein-protein interaction network, gene-gene interaction network, and the rules mined from protein function associations. SMISS automatically switch t… ▽ More SMISS is a novel web server for protein function prediction. Three different predictors can be selected for different usage. It integrates different sources to improve the protein function prediction accuracy, including the query protein sequence, protein-protein interaction network, gene-gene interaction network, and the rules mined from protein function associations. SMISS automatically switch to ab initio protein function prediction based on the query sequence when there is no homologs in the database. It takes fasta format sequences as input, and several sequences can submit together without influencing the computation speed too much. PHP and Perl are two primary programming language used in the server. The CodeIgniter MVC PHP web framework and Bootstrap front-end framework are used for building the server. It can be used in different platforms in standard web browser, such as Windows, Mac OS X, Linux, and iOS. No plugins are needed for our website. Availability: http://tulip.rnet.missouri.edu/profunc/. △ Less

Submitted 21 March, 2016; originally announced July 2016.

Comments: 13 pages, 7 figures

arXiv:1601.00891 [pdf, other]

doi 10.1186/s13059-016-1037-6

An expanded evaluation of protein function prediction methods shows an improvement in accuracy

Authors: Yuxiang Jiang, Tal Ronnen Oron, Wyatt T Clark, Asma R Bankapur, Daniel D'Andrea, Rosalba Lepore, Christopher S Funk, Indika Kahanda, Karin M Verspoor, Asa Ben-Hur, Emily Koo, Duncan Penfold-Brown, Dennis Shasha, Noah Youngs, Richard Bonneau, Alexandra Lin, Sayed ME Sahraeian, Pier Luigi Martelli, Giuseppe Profiti, Rita Casadio, Renzhi Cao, Zhaolong Zhong, Jianlin Cheng, Adrian Altenhoff, Nives Skunca , et al. (122 additional authors not shown)

Abstract: Background: The increasing volume and variety of genotypic and phenotypic data is a major defining characteristic of modern biomedical sciences. At the same time, the limitations in technology for generating data and the inherently stochastic nature of biomolecular events have led to the discrepancy between the volume of data and the amount of knowledge gleaned from it. A major bottleneck in our a… ▽ More Background: The increasing volume and variety of genotypic and phenotypic data is a major defining characteristic of modern biomedical sciences. At the same time, the limitations in technology for generating data and the inherently stochastic nature of biomolecular events have led to the discrepancy between the volume of data and the amount of knowledge gleaned from it. A major bottleneck in our ability to understand the molecular underpinnings of life is the assignment of function to biological macromolecules, especially proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, accurately assessing methods for protein function prediction and tracking progress in the field remain challenging. Methodology: We have conducted the second Critical Assessment of Functional Annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. One hundred twenty-six methods from 56 research groups were evaluated for their ability to predict biological functions using the Gene Ontology and gene-disease associations using the Human Phenotype Ontology on a set of 3,681 proteins from 18 species. CAFA2 featured significantly expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis also compared the best methods participating in CAFA1 to those of CAFA2. Conclusions: The top performing methods in CAFA2 outperformed the best methods from CAFA1, demonstrating that computational function prediction is improving. This increased accuracy can be attributed to the combined effect of the growing number of experimental annotations and improved methods for function prediction. △ Less

Submitted 2 January, 2016; originally announced January 2016.

Comments: Submitted to Genome Biology

Showing 1–11 of 11 results for author: Zhong, Z