-
SPACE-IDEAS: A Dataset for Salient Information Detection in Space Innovation
Authors:
Andrés García-Silva,
Cristian Berrío,
José Manuel Gómez-Pérez
Abstract:
Detecting salient parts in text using natural language processing has been widely used to mitigate the effects of information overflow. Nevertheless, most of the datasets available for this task are derived mainly from academic publications. We introduce SPACE-IDEAS, a dataset for salient information detection from innovation ideas related to the Space domain. The text in SPACE-IDEAS varies greatl…
▽ More
Detecting salient parts in text using natural language processing has been widely used to mitigate the effects of information overflow. Nevertheless, most of the datasets available for this task are derived mainly from academic publications. We introduce SPACE-IDEAS, a dataset for salient information detection from innovation ideas related to the Space domain. The text in SPACE-IDEAS varies greatly and includes informal, technical, academic and business-oriented writing styles. In addition to a manually annotated dataset we release an extended version that is annotated using a large generative language model. We train different sentence and sequential sentence classifiers, and show that the automatically annotated dataset can be leveraged using multitask learning to train better classifiers.
△ Less
Submitted 25 March, 2024;
originally announced March 2024.
-
Textual Entailment for Effective Triple Validation in Object Prediction
Authors:
Andrés García-Silva,
Cristian Berrío,
José Manuel Gómez-Pérez
Abstract:
Knowledge base population seeks to expand knowledge graphs with facts that are typically extracted from a text corpus. Recently, language models pretrained on large corpora have been shown to contain factual knowledge that can be retrieved using cloze-style strategies. Such approach enables zero-shot recall of facts, showing competitive results in object prediction compared to supervised baselines…
▽ More
Knowledge base population seeks to expand knowledge graphs with facts that are typically extracted from a text corpus. Recently, language models pretrained on large corpora have been shown to contain factual knowledge that can be retrieved using cloze-style strategies. Such approach enables zero-shot recall of facts, showing competitive results in object prediction compared to supervised baselines. However, prompt-based fact retrieval can be brittle and heavily depend on the prompts and context used, which may produce results that are unintended or hallucinatory.We propose to use textual entailment to validate facts extracted from language models through cloze statements. Our results show that triple validation based on textual entailment improves language model predictions in different training regimes. Furthermore, we show that entailment-based triple validation is also effective to validate candidate facts extracted from other sources including existing knowledge graphs and text passages where named entities are recognized.
△ Less
Submitted 29 January, 2024;
originally announced January 2024.
-
Capturing Pertinent Symbolic Features for Enhanced Content-Based Misinformation Detection
Authors:
Flavio Merenda,
José Manuel Gómez-Pérez
Abstract:
Preventing the spread of misinformation is challenging. The detection of misleading content presents a significant hurdle due to its extreme linguistic and domain variability. Content-based models have managed to identify deceptive language by learning representations from textual data such as social media posts and web articles. However, aggregating representative samples of this heterogeneous ph…
▽ More
Preventing the spread of misinformation is challenging. The detection of misleading content presents a significant hurdle due to its extreme linguistic and domain variability. Content-based models have managed to identify deceptive language by learning representations from textual data such as social media posts and web articles. However, aggregating representative samples of this heterogeneous phenomenon and implementing effective real-world applications is still elusive. Based on analytical work on the language of misinformation, this paper analyzes the linguistic attributes that characterize this phenomenon and how representative of such features some of the most popular misinformation datasets are. We demonstrate that the appropriate use of pertinent symbolic knowledge in combination with neural language models is helpful in detecting misleading content. Our results achieve state-of-the-art performance in misinformation datasets across the board, showing that our approach offers a valid and robust alternative to multi-task transfer learning without requiring any additional training data. Furthermore, our results show evidence that structured knowledge can provide the extra boost required to address a complex and unpredictable real-world problem like misinformation detection, not only in terms of accuracy but also time efficiency and resource utilization.
△ Less
Submitted 29 January, 2024;
originally announced January 2024.
-
Towards Language-driven Scientific AI
Authors:
José Manuel Gómez-Pérez
Abstract:
Inspired by recent and revolutionary developments in AI, particularly in language understanding and generation, we set about designing AI systems that are able to address complex scientific tasks that challenge human capabilities to make new discoveries. Central to our approach is the notion of natural language as core representation, reasoning, and exchange format between scientific AI and human…
▽ More
Inspired by recent and revolutionary developments in AI, particularly in language understanding and generation, we set about designing AI systems that are able to address complex scientific tasks that challenge human capabilities to make new discoveries. Central to our approach is the notion of natural language as core representation, reasoning, and exchange format between scientific AI and human scientists. In this paper, we identify and discuss some of the main research challenges to accomplish such vision.
△ Less
Submitted 31 October, 2022; v1 submitted 27 October, 2022;
originally announced October 2022.
-
Artificial Intelligence and Natural Language Processing and Understanding in Space: A Methodological Framework and Four ESA Case Studies
Authors:
José Manuel Gómez-Pérez,
Andrés García-Silva,
Rosemarie Leone,
Mirko Albani,
Moritz Fontaine,
Charles Poncet,
Leopold Summerer,
Alessandro Donati,
Ilaria Roma,
Stefano Scaglioni
Abstract:
The European Space Agency is well known as a powerful force for scientific discovery in numerous areas related to Space. The amount and depth of the knowledge produced throughout the different missions carried out by ESA and their contribution to scientific progress is enormous, involving large collections of documents like scientific publications, feasibility studies, technical reports, and quali…
▽ More
The European Space Agency is well known as a powerful force for scientific discovery in numerous areas related to Space. The amount and depth of the knowledge produced throughout the different missions carried out by ESA and their contribution to scientific progress is enormous, involving large collections of documents like scientific publications, feasibility studies, technical reports, and quality management procedures, among many others. Through initiatives like the Open Space Innovation Platform, ESA also acts as a hub for new ideas coming from the wider community across different challenges, contributing to a virtuous circle of scientific discovery and innovation. Handling such wealth of information, of which large part is unstructured text, is a colossal task that goes beyond human capabilities, hence requiring automation. In this paper, we present a methodological framework based on artificial intelligence and natural language processing and understanding to automatically extract information from Space documents, generating value from it, and illustrate such framework through several case studies implemented across different functional areas of ESA, including Mission Design, Quality Assurance, Long-Term Data Preservation, and the Open Space Innovation Platform. In doing so, we demonstrate the value of these technologies in several tasks ranging from effortlessly searching and recommending Space information to automatically determining how innovative an idea can be, answering questions about Space, and generating quizzes regarding quality procedures. Each of these accomplishments represents a step forward in the application of increasingly intelligent AI systems in Space, from structuring and facilitating information access to intelligent systems capable to understand and reason with such information.
△ Less
Submitted 24 October, 2022; v1 submitted 7 October, 2022;
originally announced October 2022.
-
Generating Quizzes to Support Training on Quality Management and Assurance in Space Science and Engineering
Authors:
Andrés García-Silva,
Cristian Berrío,
José Manuel Gómez-Pérez
Abstract:
Quality management and assurance is key for space agencies to guarantee the success of space missions, which are high-risk and extremely costly. In this paper, we present a system to generate quizzes, a common resource to evaluate the effectiveness of training sessions, from documents about quality assurance procedures in the Space domain. Our system leverages state of the art auto-regressive mode…
▽ More
Quality management and assurance is key for space agencies to guarantee the success of space missions, which are high-risk and extremely costly. In this paper, we present a system to generate quizzes, a common resource to evaluate the effectiveness of training sessions, from documents about quality assurance procedures in the Space domain. Our system leverages state of the art auto-regressive models like T5 and BART to generate questions, and a RoBERTa model to extract answers for such questions, thus verifying their suitability.
△ Less
Submitted 4 November, 2022; v1 submitted 7 October, 2022;
originally announced October 2022.
-
SpaceQA: Answering Questions about the Design of Space Missions and Space Craft Concepts
Authors:
Andrés García-Silva,
Cristian Berrío,
José Manuel Gómez-Pérez,
José Antonio Martínez-Heras,
Alessandro Donati,
Ilaria Roma
Abstract:
We present SpaceQA, to the best of our knowledge the first open-domain QA system in Space mission design. SpaceQA is part of an initiative by the European Space Agency (ESA) to facilitate the access, sharing and reuse of information about Space mission design within the agency and with the public. We adopt a state-of-the-art architecture consisting of a dense retriever and a neural reader and opt…
▽ More
We present SpaceQA, to the best of our knowledge the first open-domain QA system in Space mission design. SpaceQA is part of an initiative by the European Space Agency (ESA) to facilitate the access, sharing and reuse of information about Space mission design within the agency and with the public. We adopt a state-of-the-art architecture consisting of a dense retriever and a neural reader and opt for an approach based on transfer learning rather than fine-tuning due to the lack of domain-specific annotated data. Our evaluation on a test set produced by ESA is largely consistent with the results originally reported by the evaluated retrievers and confirms the need of fine tuning for reading comprehension. As of writing this paper, ESA is piloting SpaceQA internally.
△ Less
Submitted 4 November, 2022; v1 submitted 7 October, 2022;
originally announced October 2022.
-
On the Impact of Knowledge-based Linguistic Annotations in the Quality of Scientific Embeddings
Authors:
Andres Garcia-Silva,
Ronald Denaux,
Jose Manuel Gomez-Perez
Abstract:
In essence, embedding algorithms work by optimizing the distance between a word and its usual context in order to generate an embedding space that encodes the distributional representation of words. In addition to single words or word pieces, other features which result from the linguistic analysis of text, including lexical, grammatical and semantic information, can be used to improve the quality…
▽ More
In essence, embedding algorithms work by optimizing the distance between a word and its usual context in order to generate an embedding space that encodes the distributional representation of words. In addition to single words or word pieces, other features which result from the linguistic analysis of text, including lexical, grammatical and semantic information, can be used to improve the quality of embedding spaces. However, until now we did not have a precise understanding of the impact that such individual annotations and their possible combinations may have in the quality of the embeddings. In this paper, we conduct a comprehensive study on the use of explicit linguistic annotations to generate embeddings from a scientific corpus and quantify their impact in the resulting representations. Our results show how the effect of such annotations in the embeddings varies depending on the evaluation task. In general, we observe that learning embeddings using linguistic annotations contributes to achieve better evaluation results.
△ Less
Submitted 13 April, 2021;
originally announced April 2021.
-
Understanding Transformers for Bot Detection in Twitter
Authors:
Andres Garcia-Silva,
Cristian Berrio,
Jose Manuel Gomez-Perez
Abstract:
In this paper we shed light on the impact of fine-tuning over social media data in the internal representations of neural language models. We focus on bot detection in Twitter, a key task to mitigate and counteract the automatic spreading of disinformation and bias in social media. We investigate the use of pre-trained language models to tackle the detection of tweets generated by a bot or a human…
▽ More
In this paper we shed light on the impact of fine-tuning over social media data in the internal representations of neural language models. We focus on bot detection in Twitter, a key task to mitigate and counteract the automatic spreading of disinformation and bias in social media. We investigate the use of pre-trained language models to tackle the detection of tweets generated by a bot or a human account based exclusively on its content. Unlike the general trend in benchmarks like GLUE, where BERT generally outperforms generative transformers like GPT and GPT-2 for most classification tasks on regular text, we observe that fine-tuning generative transformers on a bot detection task produces higher accuracies. We analyze the architectural components of each transformer and study the effect of fine-tuning on their hidden states and output representations. Among our findings, we show that part of the syntactical information and distributional properties captured by BERT during pre-training is lost upon fine-tuning while the generative pre-training approach manage to preserve these properties.
△ Less
Submitted 13 April, 2021;
originally announced April 2021.
-
Classifying Scientific Publications with BERT -- Is Self-Attention a Feature Selection Method?
Authors:
Andres Garcia-Silva,
Jose Manuel Gomez-Perez
Abstract:
We investigate the self-attention mechanism of BERT in a fine-tuning scenario for the classification of scientific articles over a taxonomy of research disciplines. We observe how self-attention focuses on words that are highly related to the domain of the article. Particularly, a small subset of vocabulary words tends to receive most of the attention. We compare and evaluate the subset of the mos…
▽ More
We investigate the self-attention mechanism of BERT in a fine-tuning scenario for the classification of scientific articles over a taxonomy of research disciplines. We observe how self-attention focuses on words that are highly related to the domain of the article. Particularly, a small subset of vocabulary words tends to receive most of the attention. We compare and evaluate the subset of the most attended words with feature selection methods normally used for text classification in order to characterize self-attention as a possible feature selection approach. Using ConceptNet as ground truth, we also find that attended words are more related to the research fields of the articles. However, conventional feature selection methods are still a better option to learn classifiers from scratch. This result suggests that, while self-attention identifies domain-relevant terms, the discriminatory information in BERT is encoded in the contextualized outputs and the classification layer. It also raises the question whether injecting feature selection methods in the self-attention mechanism could further optimize single sequence classification using transformers.
△ Less
Submitted 20 January, 2021;
originally announced January 2021.
-
ISAAQ -- Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention
Authors:
Jose Manuel Gomez-Perez,
Raul Ortega
Abstract:
Textbook Question Answering is a complex task in the intersection of Machine Comprehension and Visual Question Answering that requires reasoning with multimodal information from text and diagrams. For the first time, this paper taps on the potential of transformer language models and bottom-up and top-down attention to tackle the language and visual understanding challenges this task entails. Rath…
▽ More
Textbook Question Answering is a complex task in the intersection of Machine Comprehension and Visual Question Answering that requires reasoning with multimodal information from text and diagrams. For the first time, this paper taps on the potential of transformer language models and bottom-up and top-down attention to tackle the language and visual understanding challenges this task entails. Rather than training a language-visual transformer from scratch we rely on pre-trained transformers, fine-tuning and ensembling. We add bottom-up and top-down attention to identify regions of interest corresponding to diagram constituents and their relationships, improving the selection of relevant visual information for each question and answer options. Our system ISAAQ reports unprecedented success in all TQA question types, with accuracies of 81.36%, 71.11% and 55.12% on true/false, text-only and diagram multiple choice questions. ISAAQ also demonstrates its broad applicability, obtaining state-of-the-art results in other demanding datasets.
△ Less
Submitted 1 October, 2020;
originally announced October 2020.
-
Linked Credibility Reviews for Explainable Misinformation Detection
Authors:
Ronald Denaux,
Jose Manuel Gomez-Perez
Abstract:
In recent years, misinformation on the Web has become increasingly rampant. The research community has responded by proposing systems and challenges, which are beginning to be useful for (various subtasks of) detecting misinformation. However, most proposed systems are based on deep learning techniques which are fine-tuned to specific domains, are difficult to interpret and produce results which a…
▽ More
In recent years, misinformation on the Web has become increasingly rampant. The research community has responded by proposing systems and challenges, which are beginning to be useful for (various subtasks of) detecting misinformation. However, most proposed systems are based on deep learning techniques which are fine-tuned to specific domains, are difficult to interpret and produce results which are not machine readable. This limits their applicability and adoption as they can only be used by a select expert audience in very specific settings. In this paper we propose an architecture based on a core concept of Credibility Reviews (CRs) that can be used to build networks of distributed bots that collaborate for misinformation detection. The CRs serve as building blocks to compose graphs of (i) web content, (ii) existing credibility signals --fact-checked claims and reputation reviews of websites--, and (iii) automatically computed reviews. We implement this architecture on top of lightweight extensions to Schema.org and services providing generic NLP tasks for semantic similarity and stance detection. Evaluations on existing datasets of social-media posts, fake news and political speeches demonstrates several advantages over existing systems: extensibility, domain-independence, composability, explainability and transparency via provenance. Furthermore, we obtain competitive results without requiring finetuning and establish a new state of the art on the Clef'18 CheckThat! Factuality task.
△ Less
Submitted 28 August, 2020;
originally announced August 2020.
-
Assessing the Lexico-Semantic Relational Knowledge Captured by Word and Concept Embeddings
Authors:
Ronald Denaux,
Jose Manuel Gomez-Perez
Abstract:
Deep learning currently dominates the benchmarks for various NLP tasks and, at the basis of such systems, words are frequently represented as embeddings --vectors in a low dimensional space-- learned from large text corpora and various algorithms have been proposed to learn both word and concept embeddings. One of the claimed benefits of such embeddings is that they capture knowledge about semanti…
▽ More
Deep learning currently dominates the benchmarks for various NLP tasks and, at the basis of such systems, words are frequently represented as embeddings --vectors in a low dimensional space-- learned from large text corpora and various algorithms have been proposed to learn both word and concept embeddings. One of the claimed benefits of such embeddings is that they capture knowledge about semantic relations. Such embeddings are most often evaluated through tasks such as predicting human-rated similarity and analogy which only test a few, often ill-defined, relations. In this paper, we propose a method for (i) reliably generating word and concept pair datasets for a wide number of relations by using a knowledge graph and (ii) evaluating to what extent pre-trained embeddings capture those relations. We evaluate the approach against a proprietary and a public knowledge graph and analyze the results, showing which lexico-semantic relational knowledge is captured by current embedding learning approaches.
△ Less
Submitted 24 September, 2019;
originally announced September 2019.
-
Look, Read and Enrich. Learning from Scientific Figures and their Captions
Authors:
Jose Manuel Gomez-Perez,
Raul Ortega
Abstract:
Compared to natural images, understanding scientific figures is particularly hard for machines. However, there is a valuable source of information in scientific literature that until now has remained untapped: the correspondence between a figure and its caption. In this paper we investigate what can be learnt by looking at a large number of figures and reading their captions, and introduce a figur…
▽ More
Compared to natural images, understanding scientific figures is particularly hard for machines. However, there is a valuable source of information in scientific literature that until now has remained untapped: the correspondence between a figure and its caption. In this paper we investigate what can be learnt by looking at a large number of figures and reading their captions, and introduce a figure-caption correspondence learning task that makes use of our observations. Training visual and language networks without supervision other than pairs of unconstrained figures and captions is shown to successfully solve this task. We also show that transferring lexical and semantic knowledge from a knowledge graph significantly enriches the resulting features. Finally, we demonstrate the positive impact of such features in other tasks involving scientific text and figures, like multi-modal classification and machine comprehension for question answering, outperforming supervised baselines and ad-hoc approaches.
△ Less
Submitted 19 September, 2019;
originally announced September 2019.
-
Enabling FAIR Research in Earth Science through Research Objects
Authors:
Andres Garcia-Silva,
Jose Manuel Gomez-Perez,
Raul Palma,
Marcin Krystek,
Simone Mantovani,
Federica Foglini,
Valentina Grande,
Francesco De Leo,
Stefano Salvi,
Elisa Trasati,
Vito Romaniello,
Mirko Albani,
Cristiano Silvagni,
Rosemarie Leone,
Fulvio Marelli,
Sergio Albani,
Michele Lazzarini,
Hazel J. Napier,
Helen M. Glaves,
Timothy Aldridge,
Charles Meertens,
Fran Boler,
Henry W. Loescher,
Christine Laney,
Melissa A Genazzio
, et al. (2 additional authors not shown)
Abstract:
Data-intensive science communities are progressively adopting FAIR practices that enhance the visibility of scientific breakthroughs and enable reuse. At the core of this movement, research objects contain and describe scientific information and resources in a way compliant with the FAIR principles and sustain the development of key infrastructure and tools. This paper provides an account of the c…
▽ More
Data-intensive science communities are progressively adopting FAIR practices that enhance the visibility of scientific breakthroughs and enable reuse. At the core of this movement, research objects contain and describe scientific information and resources in a way compliant with the FAIR principles and sustain the development of key infrastructure and tools. This paper provides an account of the challenges, experiences and solutions involved in the adoption of FAIR around research objects over several Earth Science disciplines. During this journey, our work has been comprehensive, with outcomes including: an extended research object model adapted to the needs of earth scientists; the provisioning of digital object identifiers (DOI) to enable persistent identification and to give due credit to authors; the generation of content-based, semantically rich, research object metadata through natural language processing, enhancing visibility and reuse through recommendation systems and third-party search engines; and various types of checklists that provide a compact representation of research object quality as a key enabler of scientific reuse. All these results have been integrated in ROHub, a platform that provides research object management functionality to a wealth of applications and interfaces across different scientific communities. To monitor and quantify the community uptake of research objects, we have defined indicators and obtained measures via ROHub that are also discussed herein.
△ Less
Submitted 27 September, 2018;
originally announced September 2018.
-
Indexing Execution Patterns in Workflow Provenance Graphs through Generalized Trie Structures
Authors:
Esteban García-Cuesta,
José M. Gómez-Pérez
Abstract:
Over the last years, scientific workflows have become mature enough to be used in a production style. However, despite the increasing maturity, there is still a shortage of tools for searching, adapting, and reusing workflows that hinders a more generalized adoption by the scientific communities. Indeed, due to the limited availability of machine-readable scientific metadata and the heterogeneity…
▽ More
Over the last years, scientific workflows have become mature enough to be used in a production style. However, despite the increasing maturity, there is still a shortage of tools for searching, adapting, and reusing workflows that hinders a more generalized adoption by the scientific communities. Indeed, due to the limited availability of machine-readable scientific metadata and the heterogeneity of workflow specification formats and representations, new ways to leverage alternative sources of information that complement existing approaches are needed. In this paper we address such limitations by applying statistically enriched generalized trie structures to exploit workflow execution provenance information in order to assist the analysis, indexing and search of scientific workflows. Our method bridges the gap between the description of what a workflow is supposed to do according to its specification and related metadata and what it actually does as recorded in its provenance execution trace. In doing so, we also prove that the proposed method outperforms SPARQL 1.1 Property Paths for querying provenance graphs.
△ Less
Submitted 19 July, 2018;
originally announced July 2018.
-
Not just about size - A Study on the Role of Distributed Word Representations in the Analysis of Scientific Publications
Authors:
Andres Garcia,
Jose Manuel Gomez-Perez
Abstract:
The emergence of knowledge graphs in the scholarly communication domain and recent advances in artificial intelligence and natural language processing bring us closer to a scenario where intelligent systems can assist scientists over a range of knowledge-intensive tasks. In this paper we present experimental results about the generation of word embeddings from scholarly publications for the intell…
▽ More
The emergence of knowledge graphs in the scholarly communication domain and recent advances in artificial intelligence and natural language processing bring us closer to a scenario where intelligent systems can assist scientists over a range of knowledge-intensive tasks. In this paper we present experimental results about the generation of word embeddings from scholarly publications for the intelligent processing of scientific texts extracted from SciGraph. We compare the performance of domain-specific embeddings with existing pre-trained vectors generated from very large and general purpose corpora. Our results suggest that there is a trade-off between corpus specificity and volume. Embeddings from domain-specific scientific corpora effectively capture the semantics of the domain. On the other hand, obtaining comparable results through general corpora can also be achieved, but only in the presence of very large corpora of well formed text. Furthermore, We also show that the degree of overlap** between knowledge areas is directly related to the performance of embeddings in domain evaluation tasks.
△ Less
Submitted 5 April, 2018;
originally announced April 2018.
-
Collaboration Spheres: a Visual Metaphor to Share and Reuse Research Objects
Authors:
Mariano Rico,
José Manuel Gómez-Pérez,
Rafael Gonzalez,
Aleix Garrido,
Oscar Corcho
Abstract:
Research Objects (ROs) are semantically enhanced aggregations of resources associated to scientific experiments, such as data, provenance of these data, the scientific workflow used to run the experiment, intermediate results, logs and the interpretation of the results. As the number of ROs increases, it is becoming difficult to find ROs to be used, reused or re-purposed. New search and retrieval…
▽ More
Research Objects (ROs) are semantically enhanced aggregations of resources associated to scientific experiments, such as data, provenance of these data, the scientific workflow used to run the experiment, intermediate results, logs and the interpretation of the results. As the number of ROs increases, it is becoming difficult to find ROs to be used, reused or re-purposed. New search and retrieval techniques are required to find the most appropriate ROs for a given researcher, paying attention to provide an intuitive user interface. In this paper we show CollabSpheres, a user interface that provides a new visual metaphor to find ROs by means of a recommendation system that takes advantage of the social aspects of ROs. The experimental evaluation of this tool shows that users perceive high values of usability, user satisfaction, usefulness and ease of use. From the analysis of these results we argue that users perceive the simplicity, intuitiveness and cleanness of this tool, as well as this tool increases collaboration and reuse of research objects.
△ Less
Submitted 16 October, 2017;
originally announced October 2017.