Search | arXiv e-print repository

Meta-prompting Optimized Retrieval-augmented Generation

Abstract: Retrieval-augmented generation resorts to content retrieved from external sources in order to leverage the performance of large language models in downstream tasks. The excessive volume of retrieved content, the possible dispersion of its parts, or their out of focus range may happen nevertheless to eventually have a detrimental rather than an incremental effect. To mitigate this issue and improve… ▽ More Retrieval-augmented generation resorts to content retrieved from external sources in order to leverage the performance of large language models in downstream tasks. The excessive volume of retrieved content, the possible dispersion of its parts, or their out of focus range may happen nevertheless to eventually have a detrimental rather than an incremental effect. To mitigate this issue and improve retrieval-augmented generation, we propose a method to refine the retrieved content before it is included in the prompt by resorting to meta-prompting optimization. Put to empirical test with the demanding multi-hop question answering task from the StrategyQA dataset, the evaluation results indicate that this method outperforms a similar retrieval-augmented system but without this method by over 30%. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2404.05333 [pdf, ps, other]

PORTULAN ExtraGLUE Datasets and Models: Kick-starting a Benchmark for the Neural Processing of Portuguese

Authors: Tomás Osório, Bernardo Leite, Henrique Lopes Cardoso, Luís Gomes, João Rodrigues, Rodrigo Santos, António Branco

Abstract: Leveraging research on the neural modelling of Portuguese, we contribute a collection of datasets for an array of language processing tasks and a corresponding collection of fine-tuned neural language models on these downstream tasks. To align with mainstream benchmarks in the literature, originally developed in English, and to kick start their Portuguese counterparts, the datasets were machine-tr… ▽ More Leveraging research on the neural modelling of Portuguese, we contribute a collection of datasets for an array of language processing tasks and a corresponding collection of fine-tuned neural language models on these downstream tasks. To align with mainstream benchmarks in the literature, originally developed in English, and to kick start their Portuguese counterparts, the datasets were machine-translated from English with a state-of-the-art translation engine. The resulting PORTULAN ExtraGLUE benchmark is a basis for research on Portuguese whose improvement can be pursued in future work. Similarly, the respective fine-tuned neural language models, developed with a low-rank adaptation approach, are made available as baselines that can stimulate future work on the neural processing of Portuguese. All datasets and models have been developed and are made available for two variants of Portuguese: European and Brazilian. △ Less

Submitted 8 May, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

Comments: Preprint - Paper accepted for BUCC 2024

arXiv:2403.08004 [pdf, other]

Pix2Pix-OnTheFly: Leveraging LLMs for Instruction-Guided Image Editing

Authors: Rodrigo Santos, João Silva, António Branco

Abstract: The combination of language processing and image processing keeps attracting increased interest given recent impressive advances that leverage the combined strengths of both domains of research. Among these advances, the task of editing an image on the basis solely of a natural language instruction stands out as a most challenging endeavour. While recent approaches for this task resort, in one way… ▽ More The combination of language processing and image processing keeps attracting increased interest given recent impressive advances that leverage the combined strengths of both domains of research. Among these advances, the task of editing an image on the basis solely of a natural language instruction stands out as a most challenging endeavour. While recent approaches for this task resort, in one way or other, to some form of preliminary preparation, training or fine-tuning, this paper explores a novel approach: We propose a preparation-free method that permits instruction-guided image editing on the fly. This approach is organized along three steps properly orchestrated that resort to image captioning and DDIM inversion, followed by obtaining the edit direction embedding, followed by image editing proper. While dispensing with preliminary preparation, our approach demonstrates to be effective and competitive, outperforming recent, state of the art models for this task when evaluated on the MAGICBRUSH dataset. △ Less

Submitted 12 March, 2024; originally announced March 2024.

arXiv:2403.01897 [pdf, other]

Fostering the Ecosystem of Open Neural Encoders for Portuguese with Albertina PT* Family

Authors: Rodrigo Santos, João Rodrigues, Luís Gomes, João Silva, António Branco, Henrique Lopes Cardoso, Tomás Freitas Osório, Bernardo Leite

Abstract: To foster the neural encoding of Portuguese, this paper contributes foundation encoder models that represent an expansion of the still very scarce ecosystem of large language models specifically developed for this language that are fully open, in the sense that they are open source and openly distributed for free under an open license for any purpose, thus including research and commercial usages.… ▽ More To foster the neural encoding of Portuguese, this paper contributes foundation encoder models that represent an expansion of the still very scarce ecosystem of large language models specifically developed for this language that are fully open, in the sense that they are open source and openly distributed for free under an open license for any purpose, thus including research and commercial usages. Like most languages other than English, Portuguese is low-resourced in terms of these foundational language resources, there being the inaugural 900 million parameter Albertina and 335 million Bertimbau. Taking this couple of models as an inaugural set, we present the extension of the ecosystem of state-of-the-art open encoders for Portuguese with a larger, top performance-driven model with 1.5 billion parameters, and a smaller, efficiency-driven model with 100 million parameters. While achieving this primary goal, further results that are relevant for this ecosystem were obtained as well, namely new datasets for Portuguese based on the SuperGLUE benchmark, which we also distribute openly. △ Less

Submitted 5 March, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

arXiv:2402.18766 [pdf, other]

Advancing Generative AI for Portuguese with Open Decoder Gervásio PT*

Authors: Rodrigo Santos, João Silva, Luís Gomes, João Rodrigues, António Branco

Abstract: To advance the neural decoding of Portuguese, in this paper we present a fully open Transformer-based, instruction-tuned decoder model that sets a new state of the art in this respect. To develop this decoder, which we named Gervásio PT*, a strong LLaMA~2 7B model was used as a starting point, and its further improvement through additional training was done over language resources that include new… ▽ More To advance the neural decoding of Portuguese, in this paper we present a fully open Transformer-based, instruction-tuned decoder model that sets a new state of the art in this respect. To develop this decoder, which we named Gervásio PT*, a strong LLaMA~2 7B model was used as a starting point, and its further improvement through additional training was done over language resources that include new instruction data sets of Portuguese prepared for this purpose, which are also contributed in this paper. All versions of Gervásio are open source and distributed for free under an open license, including for either research or commercial usage, and can be run on consumer-grade hardware, thus seeking to contribute to the advancement of research and innovation in language technology for Portuguese. △ Less

Submitted 5 March, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

arXiv:2312.14029 [pdf, other]

Phylogenetic tree distance computation over succinct representations

Authors: António Pedro Branco, Cátia Vaz, Alexandre P. Francisco

Abstract: There are several tools available to infer phylogenetic trees, which depict the evolutionary relationships among biological entities such as viral and bacterial strains in infectious outbreaks, or cancerous cells in tumor progression trees. These tools rely on several inference methods available to produce phylogenetic trees, with resulting trees not being unique. Thus, methods for comparing phylo… ▽ More There are several tools available to infer phylogenetic trees, which depict the evolutionary relationships among biological entities such as viral and bacterial strains in infectious outbreaks, or cancerous cells in tumor progression trees. These tools rely on several inference methods available to produce phylogenetic trees, with resulting trees not being unique. Thus, methods for comparing phylogenies that are capable of revealing where two phylogenetic trees agree or differ are required. An approach is then to compute a similarity or dissimilarity measure between trees, with the Robinson- Foulds distance being one of the most used, and which can be computed in linear time and space. Nevertheless, given the large and increasing volume of phylogenetic data, phylogenetic trees are becoming very large with hundreds of thousands of leafs. In this context, space requirements become an issue both while computing tree distances and while storing trees. We propose then an efficient implementation of the Robinson-Foulds distance over trees succinct representations. Our implementation generalizes also the Robinson-Foulds distances to labelled phylogenetic trees, i.e., trees containing labels on all nodes, instead of only on leaves. Experimental results show that we are able to still achieve linear time while requiring less space. Our implementation is available as an open-source tool at https://github.com/pedroparedesbranco/TreeDiff. △ Less

Submitted 21 December, 2023; originally announced December 2023.

arXiv:2305.06721 [pdf, other]

doi 10.1007/978-3-031-49008-8_35

Advancing Neural Encoding of Portuguese with Transformer Albertina PT-*

Authors: João Rodrigues, Luís Gomes, João Silva, António Branco, Rodrigo Santos, Henrique Lopes Cardoso, Tomás Osório

Abstract: To advance the neural encoding of Portuguese (PT), and a fortiori the technological preparation of this language for the digital age, we developed a Transformer-based foundation model that sets a new state of the art in this respect for two of its variants, namely European Portuguese from Portugal (PT-PT) and American Portuguese from Brazil (PT-BR). To develop this encoder, which we named Albert… ▽ More To advance the neural encoding of Portuguese (PT), and a fortiori the technological preparation of this language for the digital age, we developed a Transformer-based foundation model that sets a new state of the art in this respect for two of its variants, namely European Portuguese from Portugal (PT-PT) and American Portuguese from Brazil (PT-BR). To develop this encoder, which we named Albertina PT-*, a strong model was used as a starting point, DeBERTa, and its pre-training was done over data sets of Portuguese, namely over data sets we gathered for PT-PT and PT-BR, and over the brWaC corpus for PT-BR. The performance of Albertina and competing models was assessed by evaluating them on prominent downstream language processing tasks adapted for Portuguese. Both Albertina PT-PT and PT-BR versions are distributed free of charge and under the most permissive license possible and can be run on consumer-grade hardware, thus seeking to contribute to the advancement of research and innovation in language technology for Portuguese. △ Less

Submitted 20 June, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

arXiv:2209.02495 [pdf, other]

Transfer Learning of Lexical Semantic Families for Argumentative Discourse Units Identification

Authors: João Rodrigues, Ruben Branco, António Branco

Abstract: Argument mining tasks require an informed range of low to high complexity linguistic phenomena and commonsense knowledge. Previous work has shown that pre-trained language models are highly effective at encoding syntactic and semantic linguistic phenomena when applied with transfer learning techniques and built on different pre-training objectives. It remains an issue of how much the existing pre-… ▽ More Argument mining tasks require an informed range of low to high complexity linguistic phenomena and commonsense knowledge. Previous work has shown that pre-trained language models are highly effective at encoding syntactic and semantic linguistic phenomena when applied with transfer learning techniques and built on different pre-training objectives. It remains an issue of how much the existing pre-trained language models encompass the complexity of argument mining tasks. We rely on experimentation to shed light on how language models obtained from different lexical semantic families leverage the performance of the identification of argumentative discourse units task. Experimental results show that transfer learning techniques are beneficial to the task and that current methods may be insufficient to leverage commonsense knowledge from different lexical semantic families. △ Less

Submitted 24 October, 2022; v1 submitted 6 September, 2022; originally announced September 2022.

arXiv:2103.06924 [pdf, other]

Anaphoric Binding: an integrated overview

Authors: António Branco

Abstract: The interpretation of anaphors depends on their antecedents as the semantic value that an anaphor eventually conveys is co-specified by the value of its antecedent. Interestingly, when occurring in a given syntactic position, different anaphors may have different sets of admissible antecedents. Such differences are the basis for the categorization of anaphoric expressions according to their anapho… ▽ More The interpretation of anaphors depends on their antecedents as the semantic value that an anaphor eventually conveys is co-specified by the value of its antecedent. Interestingly, when occurring in a given syntactic position, different anaphors may have different sets of admissible antecedents. Such differences are the basis for the categorization of anaphoric expressions according to their anaphoric capacity, being important to determine what are the sets of admissible antecedents and how to represent and process this anaphoric capacity for each type of anaphor. From an empirical perspective, these constraints stem from what appears as quite cogent generalisations and exhibit a universal character, given their cross linguistic validity. From a conceptual point of view, in turn, the relations among binding constraints involve non-trivial cross symmetry, which lends them a modular nature and provides further strength to the plausibility of their universal character. This kind of anaphoric binding constraints appears thus as a most significant subset of natural language knowledge, usually referred to as binding theory. This paper provides an integrated overview of these constraints holding on the pairing of nominal anaphors with their admissible antecedents that are based on grammatical relations and structure. Along with the increasing interest on neuro-symbolic approaches to natural language, this paper seeks to contribute to revive the interest on this most intriguing research topic. △ Less

Submitted 11 March, 2021; originally announced March 2021.

arXiv:2011.07997 [pdf, other]

Comparative Probing of Lexical Semantics Theories for Cognitive Plausibility and Technological Usefulness

Authors: António Branco, João Rodrigues, Małgorzata Salawa, Ruben Branco, Chakaveh Saedi

Abstract: Lexical semantics theories differ in advocating that the meaning of words is represented as an inference graph, a feature map** or a vector space, thus raising the question: is it the case that one of these approaches is superior to the others in representing lexical semantics appropriately? Or in its non antagonistic counterpart: could there be a unified account of lexical semantics where these… ▽ More Lexical semantics theories differ in advocating that the meaning of words is represented as an inference graph, a feature map** or a vector space, thus raising the question: is it the case that one of these approaches is superior to the others in representing lexical semantics appropriately? Or in its non antagonistic counterpart: could there be a unified account of lexical semantics where these approaches seamlessly emerge as (partial) renderings of (different) aspects of a core semantic knowledge base? In this paper, we contribute to these research questions with a number of experiments that systematically probe different lexical semantics theories for their levels of cognitive plausibility and of technological usefulness. The empirical findings obtained from these experiments advance our insight on lexical semantics as the feature-based approach emerges as superior to the other ones, and arguably also move us closer to finding answers to the research questions above. △ Less

Submitted 16 November, 2020; originally announced November 2020.

arXiv:2003.13833 [pdf]

The European Language Technology Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communication in Multilingual Europe

Authors: Georg Rehm, Katrin Marheinecke, Stefanie Hegele, Stelios Piperidis, Kalina Bontcheva, Jan Hajič, Khalid Choukri, Andrejs Vasiļjevs, Gerhard Backfried, Christoph Prinz, José Manuel Gómez Pérez, Luc Meertens, Paul Lukowicz, Josef van Genabith, Andrea Lösch, Philipp Slusallek, Morten Irgens, Patrick Gatellier, Joachim Köhler, Laure Le Bars, Dimitra Anastasiou, Albina Auksoriūtė, Núria Bel, António Branco, Gerhard Budin , et al. (22 additional authors not shown)

Abstract: Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitu… ▽ More Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitude of approaches and technologies tailored to Europe's specific needs, there is still an immense level of fragmentation. At the same time, AI has become an increasingly important concept in the European Information and Communication Technology area. For a few years now, AI, including many opportunities, synergies but also misconceptions, has been overshadowing every other topic. We present an overview of the European LT landscape, describing funding programmes, activities, actions and challenges in the different countries with regard to LT, including the current state of play in industry and the LT market. We present a brief overview of the main LT-related activities on the EU level in the last ten years and develop strategic guidance with regard to four key dimensions. △ Less

Submitted 30 March, 2020; originally announced March 2020.

Comments: Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020). To appear

arXiv:1912.00567 [pdf, ps, other]

Merging External Bilingual Pairs into Neural Machine Translation

Authors: Tao Wang, Shaohui Kuang, Deyi Xiong, António Branco

Abstract: As neural machine translation (NMT) is not easily amenable to explicit correction of errors, incorporating pre-specified translations into NMT is widely regarded as a non-trivial challenge. In this paper, we propose and explore three methods to endow NMT with pre-specified bilingual pairs. Instead, for instance, of modifying the beam search algorithm during decoding or making complex modifications… ▽ More As neural machine translation (NMT) is not easily amenable to explicit correction of errors, incorporating pre-specified translations into NMT is widely regarded as a non-trivial challenge. In this paper, we propose and explore three methods to endow NMT with pre-specified bilingual pairs. Instead, for instance, of modifying the beam search algorithm during decoding or making complex modifications to the attention mechanism --- mainstream approaches to tackling this challenge ---, we experiment with the training data being appropriately pre-processed to add information about pre-specified translations. Extra embeddings are also used to distinguish pre-specified tokens from the other tokens. Extensive experimentation and analysis indicate that over 99% of the pre-specified phrases are successfully translated (given a 85% baseline) and that there is also a substantive improvement in translation quality with the methods explored here. △ Less

Submitted 1 December, 2019; originally announced December 2019.

Comments: 7 pages, 3 figures, 5 tables

MSC Class: 68T50 ACM Class: I.2.7

arXiv:1711.05380 [pdf, other]

Attention Focusing for Neural Machine Translation by Bridging Source and Target Embeddings

Authors: Shaohui Kuang, Junhui Li, António Branco, Weihua Luo, Deyi Xiong

Abstract: In neural machine translation, a source sequence of words is encoded into a vector from which a target sequence is generated in the decoding phase. Differently from statistical machine translation, the associations between source words and their possible target counterparts are not explicitly stored. Source and target words are at the two ends of a long information processing procedure, mediated b… ▽ More In neural machine translation, a source sequence of words is encoded into a vector from which a target sequence is generated in the decoding phase. Differently from statistical machine translation, the associations between source words and their possible target counterparts are not explicitly stored. Source and target words are at the two ends of a long information processing procedure, mediated by hidden states at both the source encoding and the target decoding phases. This makes it possible that a source word is incorrectly translated into a target word that is not any of its admissible equivalent counterparts in the target language. In this paper, we seek to somewhat shorten the distance between source and target words in that procedure, and thus strengthen their association, by means of a method we term bridging source and target word embeddings. We experiment with three strategies: (1) a source-side bridging model, where source word embeddings are moved one step closer to the output target sequence; (2) a target-side bridging model, which explores the more relevant source word embeddings for the prediction of the target sequence; and (3) a direct bridging model, which directly connects source and target word embeddings seeking to minimize errors in the translation of ones by the others. Experiments and analysis presented in this paper demonstrate that the proposed bridging models are able to significantly improve quality of both sentence translation, in general, and alignment and translation of individual source words with target words, in particular. △ Less

Submitted 10 May, 2018; v1 submitted 14 November, 2017; originally announced November 2017.

Comments: 9 pages, 6 figures. Accepted by ACL2018

Showing 1–13 of 13 results for author: Branco, A