-
EnzChemRED, a rich enzyme chemistry relation extraction dataset
Authors:
Po-Ting Lai,
Elisabeth Coudert,
Lucila Aimo,
Kristian Axelsen,
Lionel Breuza,
Edouard de Castro,
Marc Feuermann,
Anne Morgat,
Lucille Pourcel,
Ivo Pedruzzi,
Sylvain Poux,
Nicole Redaschi,
Catherine Rivoire,
Anastasia Sveshnikova,
Chih-Hsuan Wei,
Robert Leaman,
Ling Luo,
Zhiyong Lu,
Alan Bridge
Abstract:
Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) metho…
▽ More
Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts in which enzymes and the chemical reactions they catalyze are annotated using identifiers from the UniProt Knowledgebase (UniProtKB) and the ontology of Chemical Entities of Biological Interest (ChEBI). We show that fine-tuning pre-trained language models with EnzChemRED can significantly boost their ability to identify mentions of proteins and chemicals in text (Named Entity Recognition, or NER) and to extract the chemical conversions in which they participate (Relation Extraction, or RE), with average F1 score of 86.30% for NER, 86.66% for RE for chemical conversion pairs, and 83.79% for RE for chemical conversion pairs and linked enzymes. We combine the best performing methods after fine-tuning using EnzChemRED to create an end-to-end pipeline for knowledge extraction from text and apply this to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea. The EnzChemRED corpus is freely available at https://ftp.expasy.org/databases/rhea/nlp/.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
ReLSO: A Transformer-based Model for Latent Space Optimization and Generation of Proteins
Authors:
Egbert Castro,
Abhinav Godavarthi,
Julian Rubinfien,
Kevin B. Givechian,
Dhananjay Bhaskar,
Smita Krishnaswamy
Abstract:
The development of powerful natural language models have increased the ability to learn meaningful representations of protein sequences. In addition, advances in high-throughput mutagenesis, directed evolution, and next-generation sequencing have allowed for the accumulation of large amounts of labeled fitness data. Leveraging these two trends, we introduce Regularized Latent Space Optimization (R…
▽ More
The development of powerful natural language models have increased the ability to learn meaningful representations of protein sequences. In addition, advances in high-throughput mutagenesis, directed evolution, and next-generation sequencing have allowed for the accumulation of large amounts of labeled fitness data. Leveraging these two trends, we introduce Regularized Latent Space Optimization (ReLSO), a deep transformer-based autoencoder which features a highly structured latent space that is trained to jointly generate sequences as well as predict fitness. Through regularized prediction heads, ReLSO introduces a powerful protein sequence encoder and novel approach for efficient fitness landscape traversal. Using ReLSO, we explicitly model the sequence-function landscape of large labeled datasets and generate new molecules by optimizing within the latent space using gradient-based methods. We evaluate this approach on several publicly-available protein datasets, including variant sets of anti-ranibizumab and GFP. We observe a greater sequence optimization efficiency (increase in fitness per optimization step) by ReLSO compared to other approaches, where ReLSO more robustly generates high-fitness sequences. Furthermore, the attention-based relationships learned by the jointly-trained ReLSO models provides a potential avenue towards sequence-level fitness attribution information.
△ Less
Submitted 31 May, 2022; v1 submitted 24 January, 2022;
originally announced January 2022.
-
Uncovering the Folding Landscape of RNA Secondary Structure with Deep Graph Embeddings
Authors:
Egbert Castro,
Andrew Benz,
Alexander Tong,
Guy Wolf,
Smita Krishnaswamy
Abstract:
Biomolecular graph analysis has recently gained much attention in the emerging field of geometric deep learning. Here we focus on organizing biomolecular graphs in ways that expose meaningful relations and variations between them. We propose a geometric scattering autoencoder (GSAE) network for learning such graph embeddings. Our embedding network first extracts rich graph features using the recen…
▽ More
Biomolecular graph analysis has recently gained much attention in the emerging field of geometric deep learning. Here we focus on organizing biomolecular graphs in ways that expose meaningful relations and variations between them. We propose a geometric scattering autoencoder (GSAE) network for learning such graph embeddings. Our embedding network first extracts rich graph features using the recently proposed geometric scattering transform. Then, it leverages a semi-supervised variational autoencoder to extract a low-dimensional embedding that retains the information in these features that enable prediction of molecular properties as well as characterize graphs. We show that GSAE organizes RNA graphs both by structure and energy, accurately reflecting bistable RNA structures. Also, the model is generative and can sample new folding trajectories.
△ Less
Submitted 28 March, 2022; v1 submitted 11 June, 2020;
originally announced June 2020.
-
Effects of Visualizing Technical Debts on a Software Maintenance Project
Authors:
Ronivon Dias,
Pedro Neto,
Irvayne Ibiapina,
Guilherme Avelino e Otavio Castro
Abstract:
The technical debt (TD) metaphor is widely used to encapsulate numerous software quality problems. She describes the trade-off between the short term benefit of taking a shortcut during the design or implementation phase of a software product (for example, in order to meet a deadline) and the long term consequences of taking said shortcut, which may affect the quality of the software product. TDs…
▽ More
The technical debt (TD) metaphor is widely used to encapsulate numerous software quality problems. She describes the trade-off between the short term benefit of taking a shortcut during the design or implementation phase of a software product (for example, in order to meet a deadline) and the long term consequences of taking said shortcut, which may affect the quality of the software product. TDs must be managed to guarantee the software quality and also reduce its maintenance and evolution costs. However, the tools for TD detection usually provide results only considering the files perspective (class and methods), that is not usual during the project management. In this work, a technique is proposed to identify/visualize TD on a new perspective: software features. The proposed technique adopts Mining Software Repository (MRS) tools to identify the software features and after the technical debts that affect these features. Additionally, we also proposed an approach to support maintenance tasks guided by TD visualization at the feature level aiming to evaluate its applicability on real software projects. The results indicate that the approach can be useful to decrease the existent TDs, as well as avoid the introduction of new TDs.
△ Less
Submitted 18 November, 2019;
originally announced November 2019.
-
A Study on the Effect of Exit Widths and Crowd Sizes in the Formation of Arch in Clogged Crowds
Authors:
Francisco Enrique Vicente G. Castro,
Jaderick P. Pabico
Abstract:
The arching phenomenon is an emergent pattern formed by a $c$-sized crowd of intelligent, goal-oriented, autonomous, heterogeneous individuals moving towards a $w$-wide exit along a long $W$-wide corridor, where $W>w$. We collected empirical data from microsimulations to identify the combination effects of~$c$ and~$w$ to the time~$T$ of the onset of and the size~$S$ of the formation of the arch. T…
▽ More
The arching phenomenon is an emergent pattern formed by a $c$-sized crowd of intelligent, goal-oriented, autonomous, heterogeneous individuals moving towards a $w$-wide exit along a long $W$-wide corridor, where $W>w$. We collected empirical data from microsimulations to identify the combination effects of~$c$ and~$w$ to the time~$T$ of the onset of and the size~$S$ of the formation of the arch. The arch takes on the form of the perimeter of a half ellipse halved along the minor axis. We measured the~$S$ with respect to the lengths of the major~$M$ and minor~$m$ axes of the ellipse, respectively. The mathematical description of the formation of this phenomenon will be an important information in the design of walkways to control and easily direct the flow of large crowds, especially during panic egress conditions.
△ Less
Submitted 25 June, 2015;
originally announced June 2015.
-
Microsimulations of Arching, Clogging, and Bursty Exit Phenomena in Crowd Dynamics
Authors:
Francisco Enrique Vicente G. Castro,
Jaderick P. Pabico
Abstract:
We present in this paper the behavior of an artificial agent who is a member of a crowd. The behavior is based on the social comparison theory, as well as the trajectory map** towards an agent's goal considering the agent's field of vision. The crowd of artificial agents were able to exhibit arching, clogging, and bursty exit rates. We were also able to observe a new phenomenon we called double…
▽ More
We present in this paper the behavior of an artificial agent who is a member of a crowd. The behavior is based on the social comparison theory, as well as the trajectory map** towards an agent's goal considering the agent's field of vision. The crowd of artificial agents were able to exhibit arching, clogging, and bursty exit rates. We were also able to observe a new phenomenon we called double arching, which happens towards the end of the simulation, and whose onset is exhibited by a "calm" density graph within the exit passage. The density graph is usually bursty at this area. Because of these exhibited phenomena, we can use these agents with high confidence to perform microsimulation studies for modeling the behavior of humans and objects in very realistic ways.
△ Less
Submitted 25 June, 2015;
originally announced June 2015.
-
Semantic Similarity Measures Applied to an Ontology for Human-Like Interaction
Authors:
Esperanza Albacete,
Javier Calle,
Elena Castro,
Dolores Cuadra
Abstract:
The focus of this paper is the calculation of similarity between two concepts from an ontology for a Human-Like Interaction system. In order to facilitate this calculation, a similarity function is proposed based on five dimensions (sort, compositional, essential, restrictive and descriptive) constituting the structure of ontological knowledge. The paper includes a proposal for computing a similar…
▽ More
The focus of this paper is the calculation of similarity between two concepts from an ontology for a Human-Like Interaction system. In order to facilitate this calculation, a similarity function is proposed based on five dimensions (sort, compositional, essential, restrictive and descriptive) constituting the structure of ontological knowledge. The paper includes a proposal for computing a similarity function for each dimension of knowledge. Later on, the similarity values obtained are weighted and aggregated to obtain a global similarity measure. In order to calculate those weights associated to each dimension, four training methods have been proposed. The training methods differ in the element to fit: the user, concepts or pairs of concepts, and a hybrid approach. For evaluating the proposal, the knowledge base was fed from WordNet and extended by using a knowledge editing toolkit (Cognos). The evaluation of the proposal is carried out through the comparison of system responses with those given by human test subjects, both providing a measure of the soundness of the procedure and revealing ways in which the proposal may be improved.
△ Less
Submitted 18 January, 2014;
originally announced January 2014.