-
Graph Neural Networks for Microbial Genome Recovery
Authors:
Andre Lamurias,
Alessandro Tibo,
Katja Hose,
Mads Albertsen,
Thomas Dyhre Nielsen
Abstract:
Microbes have a profound impact on our health and environment, but our understanding of the diversity and function of microbial communities is severely limited. Through DNA sequencing of microbial communities (metagenomics), DNA fragments (reads) of the individual microbes can be obtained, which through assembly graphs can be combined into long contiguous DNA sequences (contigs). Given the complex…
▽ More
Microbes have a profound impact on our health and environment, but our understanding of the diversity and function of microbial communities is severely limited. Through DNA sequencing of microbial communities (metagenomics), DNA fragments (reads) of the individual microbes can be obtained, which through assembly graphs can be combined into long contiguous DNA sequences (contigs). Given the complexity of microbial communities, single contig microbial genomes are rarely obtained. Instead, contigs are eventually clustered into bins, with each bin ideally making up a full genome. This process is referred to as metagenomic binning.
Current state-of-the-art techniques for metagenomic binning rely only on the local features for the individual contigs. These techniques therefore fail to exploit the similarities between contigs as encoded by the assembly graph, in which the contigs are organized. In this paper, we propose to use Graph Neural Networks (GNNs) to leverage the assembly graph when learning contig representations for metagenomic binning. Our method, VaeG-Bin, combines variational autoencoders for learning latent representations of the individual contigs, with GNNs for refining these representations by taking into account the neighborhood structure of the contigs in the assembly graph. We explore several types of GNNs and demonstrate that VaeG-Bin recovers more high-quality genomes than other state-of-the-art binners on both simulated and real-world datasets.
△ Less
Submitted 26 April, 2022;
originally announced April 2022.
-
Priberam Labs at the NTCIR-15 SHINRA2020-ML: Classification Task
Authors:
Ruben Cardoso,
Afonso Mendes,
Andre Lamurias
Abstract:
Wikipedia is an online encyclopedia available in 285 languages. It composes an extremely relevant Knowledge Base (KB), which could be leveraged by automatic systems for several purposes. However, the structure and organisation of such information are not prone to automatic parsing and understanding and it is, therefore, necessary to structure this knowledge. The goal of the current SHINRA2020-ML t…
▽ More
Wikipedia is an online encyclopedia available in 285 languages. It composes an extremely relevant Knowledge Base (KB), which could be leveraged by automatic systems for several purposes. However, the structure and organisation of such information are not prone to automatic parsing and understanding and it is, therefore, necessary to structure this knowledge. The goal of the current SHINRA2020-ML task is to leverage Wikipedia pages in order to categorise their corresponding entities across 268 hierarchical categories, belonging to the Extended Named Entity (ENE) ontology. In this work, we propose three distinct models based on the contextualised embeddings yielded by Multilingual BERT. We explore the performances of a linear layer with and without explicit usage of the ontology's hierarchy, and a Gated Recurrent Units (GRU) layer. We also test several pooling strategies to leverage BERT's embeddings and selection criteria based on the labels' scores. We were able to achieve good performance across a large variety of languages, including those not seen during the fine-tuning process (zero-shot languages).
△ Less
Submitted 12 May, 2021;
originally announced May 2021.
-
Generating Biomedical Question Answering Corpora from Q&A forums
Authors:
Andre Lamurias,
Diana Sousa,
Francisco M. Couto
Abstract:
Question Answering (QA) is a natural language processing task that aims at obtaining relevant answers to user questions. While some progress has been made in this area, biomedical questions are still a challenge to most QA approaches, due to the complexity of the domain and limited availability of training sets. We present a method to automatically extract question-article pairs from Q\&A web foru…
▽ More
Question Answering (QA) is a natural language processing task that aims at obtaining relevant answers to user questions. While some progress has been made in this area, biomedical questions are still a challenge to most QA approaches, due to the complexity of the domain and limited availability of training sets. We present a method to automatically extract question-article pairs from Q\&A web forums, which can be used for document retrieval, a crucial step of most QA systems. The proposed framework extracts from selected forums the questions and the respective answers that contain citations. This way, QA systems based on document retrieval can be developed and evaluated using the question-article pairs annotated by users of these forums. We generated the BiQA corpus by applying our framework to three forums, obtaining 7,453 questions and 14,239 question-article pairs. We evaluated how the number of articles associated with each question and the number of votes on each answer affects the performance of baseline document retrieval approaches. Also, we demonstrated that the articles given as answers are significantly similar to the questions and trained a state-of-the-art deep learning model that obtained similar performance to using a dataset manually annotated by experts. The proposed framework can be used to update the BiQA corpus from the same forums as new posts are made, and from other forums that support their answers with documents. The BiQA corpus and the framework used to generate it are available at \url{https://github.com/lasigeBioTM/BiQA}.
△ Less
Submitted 22 December, 2020; v1 submitted 6 February, 2020;
originally announced February 2020.
-
Using Neural Networks for Relation Extraction from Biomedical Literature
Authors:
Diana Sousa,
Andre Lamurias,
Francisco M. Couto
Abstract:
Using different sources of information to support automated extracting of relations between biomedical concepts contributes to the development of our understanding of biological systems. The primary comprehensive source of these relations is biomedical literature. Several relation extraction approaches have been proposed to identify relations between concepts in biomedical literature, namely, usin…
▽ More
Using different sources of information to support automated extracting of relations between biomedical concepts contributes to the development of our understanding of biological systems. The primary comprehensive source of these relations is biomedical literature. Several relation extraction approaches have been proposed to identify relations between concepts in biomedical literature, namely, using neural networks algorithms. The use of multichannel architectures composed of multiple data representations, as in deep neural networks, is leading to state-of-the-art results. The right combination of data representations can eventually lead us to even higher evaluation scores in relation extraction tasks. Thus, biomedical ontologies play a fundamental role by providing semantic and ancestry information about an entity. The incorporation of biomedical ontologies has already been proved to enhance previous state-of-the-art results.
△ Less
Submitted 18 September, 2020; v1 submitted 27 May, 2019;
originally announced May 2019.
-
A Silver Standard Corpus of Human Phenotype-Gene Relations
Authors:
Diana Sousa,
Andre Lamurias,
Francisco M. Couto
Abstract:
Human phenotype-gene relations are fundamental to fully understand the origin of some phenotypic abnormalities and their associated diseases. Biomedical literature is the most comprehensive source of these relations, however, we need Relation Extraction tools to automatically recognize them. Most of these tools require an annotated corpus and to the best of our knowledge, there is no corpus availa…
▽ More
Human phenotype-gene relations are fundamental to fully understand the origin of some phenotypic abnormalities and their associated diseases. Biomedical literature is the most comprehensive source of these relations, however, we need Relation Extraction tools to automatically recognize them. Most of these tools require an annotated corpus and to the best of our knowledge, there is no corpus available annotated with human phenotype-gene relations. This paper presents the Phenotype-Gene Relations (PGR) corpus, a silver standard corpus of human phenotype and gene annotations and their relations. The corpus consists of 1712 abstracts, 5676 human phenotype annotations, 13835 gene annotations, and 4283 relations. We generated this corpus using Named-Entity Recognition tools, whose results were partially evaluated by eight curators, obtaining a precision of 87.01%. By using the corpus we were able to obtain promising results with two state-of-the-art deep learning tools, namely 78.05% of precision. The PGR corpus was made publicly available to the research community.
△ Less
Submitted 13 April, 2020; v1 submitted 26 March, 2019;
originally announced March 2019.
-
WS4A: a Biomedical Question and Answering System based on public Web Services and Ontologies
Authors:
Miguel J. Rodrigues,
Miguel Falé,
Andre Lamurias,
Francisco M. Couto
Abstract:
This paper describes our system, dubbed WS4A (Web Services for All), that participated in the fourth edition of the BioASQ challenge (2016). We used WS4A to perform the Question and Answering (QA) task 4b, which consisted on the retrieval of relevant concepts, documents, snippets, RDF triples, exact answers and ideal answers for each given question. The novelty in our approach consists on the maxi…
▽ More
This paper describes our system, dubbed WS4A (Web Services for All), that participated in the fourth edition of the BioASQ challenge (2016). We used WS4A to perform the Question and Answering (QA) task 4b, which consisted on the retrieval of relevant concepts, documents, snippets, RDF triples, exact answers and ideal answers for each given question. The novelty in our approach consists on the maximum exploitation of existing web services in each step of WS4A, such as the annotation of text, and the retrieval of metadata for each annotation. The information retrieved included concept identifiers, ontologies, ancestors, and most importantly, PubMed identifiers. The paper describes the WS4A pipeline and also presents the precision, recall and f-measure values obtained in task 4b. Our system achieved two second places in two subtasks on one of the five batches.
△ Less
Submitted 17 November, 2016; v1 submitted 27 September, 2016;
originally announced September 2016.