-
HyperBERT: Mixing Hypergraph-Aware Layers with Language Models for Node Classification on Text-Attributed Hypergraphs
Authors:
Adrián Bazaga,
Pietro Liò,
Gos Micklem
Abstract:
Hypergraphs are characterized by complex topological structure, representing higher-order interactions among multiple entities through hyperedges. Lately, hypergraph-based deep learning methods to learn informative data representations for the problem of node classification on text-attributed hypergraphs have garnered increasing research attention. However, existing methods struggle to simultaneou…
▽ More
Hypergraphs are characterized by complex topological structure, representing higher-order interactions among multiple entities through hyperedges. Lately, hypergraph-based deep learning methods to learn informative data representations for the problem of node classification on text-attributed hypergraphs have garnered increasing research attention. However, existing methods struggle to simultaneously capture the full extent of hypergraph structural information and the rich linguistic attributes inherent in the nodes attributes, which largely hampers their effectiveness and generalizability. To overcome these challenges, we explore ways to further augment a pretrained BERT model with specialized hypergraph-aware layers for the task of node classification. Such layers introduce higher-order structural inductive bias into the language model, thus improving the model's capacity to harness both higher-order context information from the hypergraph structure and semantic information present in text. In this paper, we propose a new architecture, HyperBERT, a mixed text-hypergraph model which simultaneously models hypergraph relational structure while maintaining the high-quality text encoding capabilities of a pre-trained BERT. Notably, HyperBERT presents results that achieve a new state-of-the-art on five challenging text-attributed hypergraph node classification benchmarks.
△ Less
Submitted 25 May, 2024; v1 submitted 11 February, 2024;
originally announced February 2024.
-
Language Model Knowledge Distillation for Efficient Question Answering in Spanish
Authors:
Adrián Bazaga,
Pietro Liò,
Gos Micklem
Abstract:
Recent advances in the development of pre-trained Spanish language models has led to significant progress in many Natural Language Processing (NLP) tasks, such as question answering. However, the lack of efficient models imposes a barrier for the adoption of such models in resource-constrained environments. Therefore, smaller distilled models for the Spanish language could be proven to be highly s…
▽ More
Recent advances in the development of pre-trained Spanish language models has led to significant progress in many Natural Language Processing (NLP) tasks, such as question answering. However, the lack of efficient models imposes a barrier for the adoption of such models in resource-constrained environments. Therefore, smaller distilled models for the Spanish language could be proven to be highly scalable and facilitate their further adoption on a variety of tasks and scenarios. In this work, we take one step in this direction by develo** SpanishTinyRoBERTa, a compressed language model based on RoBERTa for efficient question answering in Spanish. To achieve this, we employ knowledge distillation from a large model onto a lighter model that allows for a wider implementation, even in areas with limited computational resources, whilst attaining negligible performance sacrifice. Our experiments show that the dense distilled model can still preserve the performance of its larger counterpart, while significantly increasing inference speedup. This work serves as a starting point for further research and investigation of model compression efforts for Spanish language models across various NLP tasks.
△ Less
Submitted 16 March, 2024; v1 submitted 7 December, 2023;
originally announced December 2023.
-
SQLformer: Deep Auto-Regressive Query Graph Generation for Text-to-SQL Translation
Authors:
Adrián Bazaga,
Pietro Liò,
Gos Micklem
Abstract:
In recent years, the task of text-to-SQL translation, which converts natural language questions into executable SQL queries, has gained significant attention for its potential to democratize data access. Despite its promise, challenges such as adapting to unseen databases and aligning natural language with SQL syntax have hindered widespread adoption. To overcome these issues, we introduce SQLform…
▽ More
In recent years, the task of text-to-SQL translation, which converts natural language questions into executable SQL queries, has gained significant attention for its potential to democratize data access. Despite its promise, challenges such as adapting to unseen databases and aligning natural language with SQL syntax have hindered widespread adoption. To overcome these issues, we introduce SQLformer, a novel Transformer architecture specifically crafted to perform text-to-SQL translation tasks. Our model predicts SQL queries as abstract syntax trees (ASTs) in an autoregressive way, incorporating structural inductive bias in the encoder and decoder layers. This bias, guided by database table and column selection, aids the decoder in generating SQL query ASTs represented as graphs in a Breadth-First Search canonical order. Our experiments demonstrate that SQLformer achieves state-of-the-art performance across six prominent text-to-SQL benchmarks.
△ Less
Submitted 27 May, 2024; v1 submitted 26 October, 2023;
originally announced October 2023.
-
Unsupervised Pretraining for Fact Verification by Language Model Distillation
Authors:
Adrián Bazaga,
Pietro Liò,
Gos Micklem
Abstract:
Fact verification aims to verify a claim using evidence from a trustworthy knowledge base. To address this challenge, algorithms must produce features for every claim that are both semantically meaningful, and compact enough to find a semantic alignment with the source information. In contrast to previous work, which tackled the alignment problem by learning over annotated corpora of claims and th…
▽ More
Fact verification aims to verify a claim using evidence from a trustworthy knowledge base. To address this challenge, algorithms must produce features for every claim that are both semantically meaningful, and compact enough to find a semantic alignment with the source information. In contrast to previous work, which tackled the alignment problem by learning over annotated corpora of claims and their corresponding labels, we propose SFAVEL (Self-supervised Fact Verification via Language Model Distillation), a novel unsupervised pretraining framework that leverages pre-trained language models to distil self-supervised features into high-quality claim-fact alignments without the need for annotations. This is enabled by a novel contrastive loss function that encourages features to attain high-quality claim and evidence alignments whilst preserving the semantic relationships across the corpora. Notably, we present results that achieve a new state-of-the-art on FB15k-237 (+5.3% Hits@1) and FEVER (+8% accuracy) with linear evaluation.
△ Less
Submitted 6 March, 2024; v1 submitted 28 September, 2023;
originally announced September 2023.
-
Learning from learning machines: a new generation of AI technology to meet the needs of science
Authors:
Luca Pion-Tonachini,
Kristofer Bouchard,
Hector Garcia Martin,
Sean Peisert,
W. Bradley Holtz,
Anil Aswani,
Dipankar Dwivedi,
Haruko Wainwright,
Ghanshyam Pilania,
Benjamin Nachman,
Babetta L. Marrone,
Nicola Falco,
Prabhat,
Daniel Arnold,
Alejandro Wolf-Yadlin,
Sarah Powers,
Sharlee Climer,
Quinn Jackson,
Ty Carlson,
Michael Sohn,
Petrus Zwart,
Neeraj Kumar,
Amy Justice,
Claire Tomlin,
Daniel Jacobson
, et al. (11 additional authors not shown)
Abstract:
We outline emerging opportunities and challenges to enhance the utility of AI for scientific discovery. The distinct goals of AI for industry versus the goals of AI for science create tension between identifying patterns in data versus discovering patterns in the world from data. If we address the fundamental challenges associated with "bridging the gap" between domain-driven scientific models and…
▽ More
We outline emerging opportunities and challenges to enhance the utility of AI for scientific discovery. The distinct goals of AI for industry versus the goals of AI for science create tension between identifying patterns in data versus discovering patterns in the world from data. If we address the fundamental challenges associated with "bridging the gap" between domain-driven scientific models and data-driven AI learning machines, then we expect that these AI models can transform hypothesis generation, scientific discovery, and the scientific process itself.
△ Less
Submitted 26 November, 2021;
originally announced November 2021.
-
Translating synthetic natural language to database queries: a polyglot deep learning framework
Authors:
Adrián Bazaga,
Nupur Gunwant,
Gos Micklem
Abstract:
The number of databases as well as their size and complexity is increasing. This creates a barrier to use especially for non-experts, who have to come to grips with the nature of the data, the way it has been represented in the database, and the specific query languages or user interfaces by which data are accessed. These difficulties worsen in research settings, where it is common to work with ma…
▽ More
The number of databases as well as their size and complexity is increasing. This creates a barrier to use especially for non-experts, who have to come to grips with the nature of the data, the way it has been represented in the database, and the specific query languages or user interfaces by which data are accessed. These difficulties worsen in research settings, where it is common to work with many different databases. One approach to improving this situation is to allow users to pose their queries in natural language.
In this work we describe a machine learning framework, Polyglotter, that in a general way supports the map** of natural language searches to database queries. Importantly, it does not require the creation of manually annotated data for training and therefore can be applied easily to multiple domains. The framework is polyglot in the sense that it supports multiple different database engines that are accessed with a variety of query languages, including SQL and Cypher. Furthermore Polyglotter also supports multi-class queries.
Our results indicate that our framework performs well on both synthetic and real databases, and may provide opportunities for database maintainers to improve accessibility to their resources.
△ Less
Submitted 14 April, 2021;
originally announced April 2021.
-
Understanding the Systems Biology of Pathogen Virulence Using Semantic Methodologies
Authors:
David Rhee,
Kevin Shieh,
Julie Sullivan,
Gos Micklem,
Kami Kim,
Aaron Golden
Abstract:
Systems biology approaches to the integrative study of cells, organs and organisms offer the best means of understanding in a holistic manner the diversity of molecular assays that can be now be implemented in a high throughput manner. Such assays can sample the genome, epigenome, proteome, metabolome and microbiome contemporaneously, allowing us for the first time to perform a complete analysis o…
▽ More
Systems biology approaches to the integrative study of cells, organs and organisms offer the best means of understanding in a holistic manner the diversity of molecular assays that can be now be implemented in a high throughput manner. Such assays can sample the genome, epigenome, proteome, metabolome and microbiome contemporaneously, allowing us for the first time to perform a complete analysis of physiological activity. The central problem remains empowering the scientific community to actually implement such an integration, across seemingly diverse data types and measurements. One promising solution is to apply semantic techniques on a self-consistent and implicitly correct ontological representation of these data types. In this paper we describe how we have applied one such solution, based around the InterMine data warehouse platform which uses as its basis the Sequence Ontology, to facilitate a systems biology analysis of virulence in the apicomplexan pathogen $Toxoplasma~gondii$, a common parasite that infects up to half the worlds population, with acute pathogenic risks for immuno-compromised individuals or pregnant mothers. Our solution, which we named `toxoMine', has provided both a platform for our collaborators to perform such integrative analyses and also opportunities for such cyberinfrastructure to be further developed, particularly to take advantage of possible semantic similarities of value to knowledge discovery in the Omics enterprise. We discuss these opportunities in the context of further enhancing the capabilities of this powerful integrative platform.
△ Less
Submitted 18 April, 2016;
originally announced April 2016.