Search | arXiv e-print repository

EnzChemRED, a rich enzyme chemistry relation extraction dataset

Authors: Po-Ting Lai, Elisabeth Coudert, Lucila Aimo, Kristian Axelsen, Lionel Breuza, Edouard de Castro, Marc Feuermann, Anne Morgat, Lucille Pourcel, Ivo Pedruzzi, Sylvain Poux, Nicole Redaschi, Catherine Rivoire, Anastasia Sveshnikova, Chih-Hsuan Wei, Robert Leaman, Ling Luo, Zhiyong Lu, Alan Bridge

Abstract: Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) metho… ▽ More Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts in which enzymes and the chemical reactions they catalyze are annotated using identifiers from the UniProt Knowledgebase (UniProtKB) and the ontology of Chemical Entities of Biological Interest (ChEBI). We show that fine-tuning pre-trained language models with EnzChemRED can significantly boost their ability to identify mentions of proteins and chemicals in text (Named Entity Recognition, or NER) and to extract the chemical conversions in which they participate (Relation Extraction, or RE), with average F1 score of 86.30% for NER, 86.66% for RE for chemical conversion pairs, and 83.79% for RE for chemical conversion pairs and linked enzymes. We combine the best performing methods after fine-tuning using EnzChemRED to create an end-to-end pipeline for knowledge extraction from text and apply this to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea. The EnzChemRED corpus is freely available at https://ftp.expasy.org/databases/rhea/nlp/. △ Less

Submitted 22 April, 2024; originally announced April 2024.

arXiv:2201.09948 [pdf, other]

ReLSO: A Transformer-based Model for Latent Space Optimization and Generation of Proteins

Authors: Egbert Castro, Abhinav Godavarthi, Julian Rubinfien, Kevin B. Givechian, Dhananjay Bhaskar, Smita Krishnaswamy

Abstract: The development of powerful natural language models have increased the ability to learn meaningful representations of protein sequences. In addition, advances in high-throughput mutagenesis, directed evolution, and next-generation sequencing have allowed for the accumulation of large amounts of labeled fitness data. Leveraging these two trends, we introduce Regularized Latent Space Optimization (R… ▽ More The development of powerful natural language models have increased the ability to learn meaningful representations of protein sequences. In addition, advances in high-throughput mutagenesis, directed evolution, and next-generation sequencing have allowed for the accumulation of large amounts of labeled fitness data. Leveraging these two trends, we introduce Regularized Latent Space Optimization (ReLSO), a deep transformer-based autoencoder which features a highly structured latent space that is trained to jointly generate sequences as well as predict fitness. Through regularized prediction heads, ReLSO introduces a powerful protein sequence encoder and novel approach for efficient fitness landscape traversal. Using ReLSO, we explicitly model the sequence-function landscape of large labeled datasets and generate new molecules by optimizing within the latent space using gradient-based methods. We evaluate this approach on several publicly-available protein datasets, including variant sets of anti-ranibizumab and GFP. We observe a greater sequence optimization efficiency (increase in fitness per optimization step) by ReLSO compared to other approaches, where ReLSO more robustly generates high-fitness sequences. Furthermore, the attention-based relationships learned by the jointly-trained ReLSO models provides a potential avenue towards sequence-level fitness attribution information. △ Less

Submitted 31 May, 2022; v1 submitted 24 January, 2022; originally announced January 2022.

arXiv:2006.06885 [pdf, other]

Uncovering the Folding Landscape of RNA Secondary Structure with Deep Graph Embeddings

Authors: Egbert Castro, Andrew Benz, Alexander Tong, Guy Wolf, Smita Krishnaswamy

Abstract: Biomolecular graph analysis has recently gained much attention in the emerging field of geometric deep learning. Here we focus on organizing biomolecular graphs in ways that expose meaningful relations and variations between them. We propose a geometric scattering autoencoder (GSAE) network for learning such graph embeddings. Our embedding network first extracts rich graph features using the recen… ▽ More Biomolecular graph analysis has recently gained much attention in the emerging field of geometric deep learning. Here we focus on organizing biomolecular graphs in ways that expose meaningful relations and variations between them. We propose a geometric scattering autoencoder (GSAE) network for learning such graph embeddings. Our embedding network first extracts rich graph features using the recently proposed geometric scattering transform. Then, it leverages a semi-supervised variational autoencoder to extract a low-dimensional embedding that retains the information in these features that enable prediction of molecular properties as well as characterize graphs. We show that GSAE organizes RNA graphs both by structure and energy, accurately reflecting bistable RNA structures. Also, the model is generative and can sample new folding trajectories. △ Less

Submitted 28 March, 2022; v1 submitted 11 June, 2020; originally announced June 2020.

Comments: 10 pages, 10 figures, 4 tables, Presented at IEEE Big Data 2020

arXiv:1911.07565 [pdf, other]

Effects of Visualizing Technical Debts on a Software Maintenance Project

Authors: Ronivon Dias, Pedro Neto, Irvayne Ibiapina, Guilherme Avelino e Otavio Castro

Abstract: The technical debt (TD) metaphor is widely used to encapsulate numerous software quality problems. She describes the trade-off between the short term benefit of taking a shortcut during the design or implementation phase of a software product (for example, in order to meet a deadline) and the long term consequences of taking said shortcut, which may affect the quality of the software product. TDs… ▽ More The technical debt (TD) metaphor is widely used to encapsulate numerous software quality problems. She describes the trade-off between the short term benefit of taking a shortcut during the design or implementation phase of a software product (for example, in order to meet a deadline) and the long term consequences of taking said shortcut, which may affect the quality of the software product. TDs must be managed to guarantee the software quality and also reduce its maintenance and evolution costs. However, the tools for TD detection usually provide results only considering the files perspective (class and methods), that is not usual during the project management. In this work, a technique is proposed to identify/visualize TD on a new perspective: software features. The proposed technique adopts Mining Software Repository (MRS) tools to identify the software features and after the technical debts that affect these features. Additionally, we also proposed an approach to support maintenance tasks guided by TD visualization at the feature level aiming to evaluate its applicability on real software projects. The results indicate that the approach can be useful to decrease the existent TDs, as well as avoid the introduction of new TDs. △ Less

Submitted 18 November, 2019; originally announced November 2019.

Comments: in Portuguese, Aceito no XVIII Brazilian Symposium on Software Quality (SBQS'19), October 28-November 1, 2019, Fortaleza, Brazil

arXiv:1506.08133 [pdf, ps, other]

A Study on the Effect of Exit Widths and Crowd Sizes in the Formation of Arch in Clogged Crowds

Authors: Francisco Enrique Vicente G. Castro, Jaderick P. Pabico

Abstract: The arching phenomenon is an emergent pattern formed by a $c$-sized crowd of intelligent, goal-oriented, autonomous, heterogeneous individuals moving towards a $w$-wide exit along a long $W$-wide corridor, where $W>w$. We collected empirical data from microsimulations to identify the combination effects of~$c$ and~$w$ to the time~$T$ of the onset of and the size~$S$ of the formation of the arch. T… ▽ More The arching phenomenon is an emergent pattern formed by a $c$-sized crowd of intelligent, goal-oriented, autonomous, heterogeneous individuals moving towards a $w$-wide exit along a long $W$-wide corridor, where $W>w$. We collected empirical data from microsimulations to identify the combination effects of~$c$ and~$w$ to the time~$T$ of the onset of and the size~$S$ of the formation of the arch. The arch takes on the form of the perimeter of a half ellipse halved along the minor axis. We measured the~$S$ with respect to the lengths of the major~$M$ and minor~$m$ axes of the ellipse, respectively. The mathematical description of the formation of this phenomenon will be an important information in the design of walkways to control and easily direct the flow of large crowds, especially during panic egress conditions. △ Less

Submitted 25 June, 2015; originally announced June 2015.

Comments: 9 pages, 5 figures, originally appeared in H.N. Adorna and A.L. Sioson (eds.) Proceedings of the 6th National Symposium on Mathematical Aspects of Computer Science (SMACS 2012), La Carmela de Boracay Convention Center, Boracay Island, Malay, Aklan, 04-08 December 2012, pp. 66-74. arXiv admin note: text overlap with arXiv:1506.07781

Journal ref: Philippine Computing Journal 8(1):21-29

arXiv:1506.07781 [pdf, ps, other]

Microsimulations of Arching, Clogging, and Bursty Exit Phenomena in Crowd Dynamics

Authors: Francisco Enrique Vicente G. Castro, Jaderick P. Pabico

Abstract: We present in this paper the behavior of an artificial agent who is a member of a crowd. The behavior is based on the social comparison theory, as well as the trajectory map** towards an agent's goal considering the agent's field of vision. The crowd of artificial agents were able to exhibit arching, clogging, and bursty exit rates. We were also able to observe a new phenomenon we called double… ▽ More We present in this paper the behavior of an artificial agent who is a member of a crowd. The behavior is based on the social comparison theory, as well as the trajectory map** towards an agent's goal considering the agent's field of vision. The crowd of artificial agents were able to exhibit arching, clogging, and bursty exit rates. We were also able to observe a new phenomenon we called double arching, which happens towards the end of the simulation, and whose onset is exhibited by a "calm" density graph within the exit passage. The density graph is usually bursty at this area. Because of these exhibited phenomena, we can use these agents with high confidence to perform microsimulation studies for modeling the behavior of humans and objects in very realistic ways. △ Less

Submitted 25 June, 2015; originally announced June 2015.

Comments: 6 pages, 6 figures, original paper appeared in Proceedings of the 10th National Conference on Information Technology Education (NCITE 2012), Laoag City, Ilocos Norte, Philippines, 18-20 October 2012. (ISSN 2012-0761)

Journal ref: Philippine Information Technology Journal 6(1):11-16

arXiv:1401.4603 [pdf]

doi 10.1613/jair.3612

Semantic Similarity Measures Applied to an Ontology for Human-Like Interaction

Authors: Esperanza Albacete, Javier Calle, Elena Castro, Dolores Cuadra

Abstract: The focus of this paper is the calculation of similarity between two concepts from an ontology for a Human-Like Interaction system. In order to facilitate this calculation, a similarity function is proposed based on five dimensions (sort, compositional, essential, restrictive and descriptive) constituting the structure of ontological knowledge. The paper includes a proposal for computing a similar… ▽ More The focus of this paper is the calculation of similarity between two concepts from an ontology for a Human-Like Interaction system. In order to facilitate this calculation, a similarity function is proposed based on five dimensions (sort, compositional, essential, restrictive and descriptive) constituting the structure of ontological knowledge. The paper includes a proposal for computing a similarity function for each dimension of knowledge. Later on, the similarity values obtained are weighted and aggregated to obtain a global similarity measure. In order to calculate those weights associated to each dimension, four training methods have been proposed. The training methods differ in the element to fit: the user, concepts or pairs of concepts, and a hybrid approach. For evaluating the proposal, the knowledge base was fed from WordNet and extended by using a knowledge editing toolkit (Cognos). The evaluation of the proposal is carried out through the comparison of system responses with those given by human test subjects, both providing a measure of the soundness of the procedure and revealing ways in which the proposal may be improved. △ Less

Submitted 18 January, 2014; originally announced January 2014.

Journal ref: Journal Of Artificial Intelligence Research, Volume 44, pages 397-421, 2012

Showing 1–7 of 7 results for author: Castro, E