Search | arXiv e-print repository

ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language

Authors: Marcos Piau, Roberto Lotufo, Rodrigo Nogueira

Abstract: Despite advancements in Natural Language Processing (NLP) and the growing availability of pretrained models, the English language remains the primary focus of model development. Continued pretraining on language-specific corpora provides a practical solution for adapting models to other languages. However, the impact of different pretraining settings on downstream tasks remains underexplored. This… ▽ More Despite advancements in Natural Language Processing (NLP) and the growing availability of pretrained models, the English language remains the primary focus of model development. Continued pretraining on language-specific corpora provides a practical solution for adapting models to other languages. However, the impact of different pretraining settings on downstream tasks remains underexplored. This work introduces $\texttt{ptt5-v2}$, investigating the continued pretraining of T5 models for Portuguese. We first develop a baseline set of settings and pretrain models with sizes up to 3B parameters. Finetuning on three Portuguese downstream tasks (assin2 STS, assin2 RTE, and TweetSentBR) yields SOTA results on the latter two. We then explore the effects of different pretraining configurations, including quality filters, optimization strategies, and multi-epoch pretraining. Perhaps surprisingly, their impact remains subtle compared to our baseline. We release $\texttt{ptt5-v2}$ pretrained checkpoints and the finetuned MonoT5 rerankers on HuggingFace at https://huggingface.co/collections/unicamp-dl/ptt5-v2-666538a650188ba00aa8d2d0 and https://huggingface.co/collections/unicamp-dl/monoptt5-66653981877df3ea727f720d. △ Less

Submitted 16 June, 2024; originally announced June 2024.

arXiv:2404.08191 [pdf, other]

Measuring Cross-lingual Transfer in Bytes

Authors: Leandro Rodrigues de Souza, Thales Sales Almeida, Roberto Lotufo, Rodrigo Nogueira

Abstract: Multilingual pretraining has been a successful solution to the challenges posed by the lack of resources for languages. These models can transfer knowledge to target languages with minimal or no examples. Recent research suggests that monolingual models also have a similar capability, but the mechanisms behind this transfer remain unclear. Some studies have explored factors like language contamina… ▽ More Multilingual pretraining has been a successful solution to the challenges posed by the lack of resources for languages. These models can transfer knowledge to target languages with minimal or no examples. Recent research suggests that monolingual models also have a similar capability, but the mechanisms behind this transfer remain unclear. Some studies have explored factors like language contamination and syntactic similarity. An emerging line of research suggests that the representations learned by language models contain two components: a language-specific and a language-agnostic component. The latter is responsible for transferring a more universal knowledge. However, there is a lack of comprehensive exploration of these properties across diverse target languages. To investigate this hypothesis, we conducted an experiment inspired by the work on the Scaling Laws for Transfer. We measured the amount of data transferred from a source language to a target language and found that models initialized from diverse languages perform similarly to a target language in a cross-lingual setting. This was surprising because the amount of data transferred to 10 diverse target languages, such as Spanish, Korean, and Finnish, was quite similar. We also found evidence that this transfer is not related to language contamination or language proximity, which strengthens the hypothesis that the model also relies on language-agnostic knowledge. Our experiments have opened up new possibilities for measuring how much data represents the language-agnostic representations learned during pretraining. △ Less

Submitted 11 April, 2024; originally announced April 2024.

Comments: NAACL 2024

arXiv:2404.06976 [pdf, other]

Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers

Authors: Mirelle Bueno, Eduardo Seiti de Oliveira, Rodrigo Nogueira, Roberto A. Lotufo, Jayr Alencar Pereira

Abstract: Despite Portuguese being one of the most spoken languages in the world, there is a lack of high-quality information retrieval datasets in that language. We present Quati, a dataset specifically designed for the Brazilian Portuguese language. It comprises a collection of queries formulated by native speakers and a curated set of documents sourced from a selection of high-quality Brazilian Portugues… ▽ More Despite Portuguese being one of the most spoken languages in the world, there is a lack of high-quality information retrieval datasets in that language. We present Quati, a dataset specifically designed for the Brazilian Portuguese language. It comprises a collection of queries formulated by native speakers and a curated set of documents sourced from a selection of high-quality Brazilian Portuguese websites. These websites are frequented more likely by real users compared to those randomly scraped, ensuring a more representative and relevant corpus. To label the query-document pairs, we use a state-of-the-art LLM, which shows inter-annotator agreement levels comparable to human performance in our assessments. We provide a detailed description of our annotation methodology to enable others to create similar datasets for other languages, providing a cost-effective way of creating high-quality IR datasets with an arbitrary number of labeled documents per query. Finally, we evaluate a diverse range of open-source and commercial retrievers to serve as baseline systems. Quati is publicly available at https://huggingface.co/datasets/unicamp-dl/quati and all scripts at https://github.com/unicamp-dl/quati . △ Less

Submitted 10 April, 2024; originally announced April 2024.

Comments: 22 pages

arXiv:2402.07859 [pdf, other]

Lissard: Long and Simple Sequential Reasoning Datasets

Authors: Mirelle Bueno, Roberto Lotufo, Rodrigo Nogueira

Abstract: Language models are now capable of solving tasks that require dealing with long sequences consisting of hundreds of thousands of tokens. However, they often fail on tasks that require repetitive use of simple rules, even on sequences that are much shorter than those seen during training. For example, state-of-the-art LLMs can find common items in two lists with up to 20 items but fail when lists h… ▽ More Language models are now capable of solving tasks that require dealing with long sequences consisting of hundreds of thousands of tokens. However, they often fail on tasks that require repetitive use of simple rules, even on sequences that are much shorter than those seen during training. For example, state-of-the-art LLMs can find common items in two lists with up to 20 items but fail when lists have 80 items. In this paper, we introduce Lissard, a benchmark comprising seven tasks whose goal is to assess the ability of models to process and generate wide-range sequence lengths, requiring repetitive procedural execution. Our evaluation of open-source (Mistral-7B and Mixtral-8x7B) and proprietary models (GPT-3.5 and GPT-4) show a consistent decline in performance across all models as the complexity of the sequence increases. The datasets and code are available at https://github.com/unicamp-dl/Lissard △ Less

Submitted 20 February, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

arXiv:2402.06334 [pdf, other]

ExaRanker-Open: Synthetic Explanation for IR using Open-Source LLMs

Authors: Fernando Ferraretto, Thiago Laitz, Roberto Lotufo, Rodrigo Nogueira

Abstract: ExaRanker recently introduced an approach to training information retrieval (IR) models, incorporating natural language explanations as additional labels. The method addresses the challenge of limited labeled examples, leading to improvements in the effectiveness of IR models. However, the initial results were based on proprietary language models such as GPT-3.5, which posed constraints on dataset… ▽ More ExaRanker recently introduced an approach to training information retrieval (IR) models, incorporating natural language explanations as additional labels. The method addresses the challenge of limited labeled examples, leading to improvements in the effectiveness of IR models. However, the initial results were based on proprietary language models such as GPT-3.5, which posed constraints on dataset size due to its cost and data privacy. In this paper, we introduce ExaRanker-Open, where we adapt and explore the use of open-source language models to generate explanations. The method has been tested using different LLMs and datasets sizes to better comprehend the effective contribution of data augmentation. Our findings reveal that incorporating explanations consistently enhances neural rankers, with benefits escalating as the LLM size increases. Notably, the data augmentation method proves advantageous even with large datasets, as evidenced by ExaRanker surpassing the target baseline by 0.6 nDCG@10 points in our study. To encourage further advancements by the research community, we have open-sourced both the code and datasets at https://github.com/unicamp-dl/ExaRanker. △ Less

Submitted 9 February, 2024; originally announced February 2024.

arXiv:2401.06910 [pdf, other]

InRanker: Distilled Rankers for Zero-shot Information Retrieval

Authors: Thiago Laitz, Konstantinos Papakostas, Roberto Lotufo, Rodrigo Nogueira

Abstract: Despite multi-billion parameter neural rankers being common components of state-of-the-art information retrieval pipelines, they are rarely used in production due to the enormous amount of compute required for inference. In this work, we propose a new method for distilling large rankers into their smaller versions focusing on out-of-domain effectiveness. We introduce InRanker, a version of monoT5… ▽ More Despite multi-billion parameter neural rankers being common components of state-of-the-art information retrieval pipelines, they are rarely used in production due to the enormous amount of compute required for inference. In this work, we propose a new method for distilling large rankers into their smaller versions focusing on out-of-domain effectiveness. We introduce InRanker, a version of monoT5 distilled from monoT5-3B with increased effectiveness on out-of-domain scenarios. Our key insight is to use language models and rerankers to generate as much as possible synthetic "in-domain" training data, i.e., data that closely resembles the data that will be seen at retrieval time. The pipeline consists of two distillation phases that do not require additional user queries or manual annotations: (1) training on existing supervised soft teacher labels, and (2) training on teacher soft labels for synthetic queries generated using a large language model. Consequently, models like monoT5-60M and monoT5-220M improved their effectiveness by using the teacher's knowledge, despite being 50x and 13x smaller, respectively. Models and code are available at https://github.com/unicamp-dl/InRanker. △ Less

Submitted 12 January, 2024; originally announced January 2024.

arXiv:2401.05273 [pdf, other]

INACIA: Integrating Large Language Models in Brazilian Audit Courts: Opportunities and Challenges

Authors: Jayr Pereira, Andre Assumpcao, Julio Trecenti, Luiz Airosa, Caio Lente, Jhonatan Cléto, Guilherme Dobins, Rodrigo Nogueira, Luis Mitchell, Roberto Lotufo

Abstract: This paper introduces INACIA (Instrução Assistida com Inteligência Artificial), a groundbreaking system designed to integrate Large Language Models (LLMs) into the operational framework of Brazilian Federal Court of Accounts (TCU). The system automates various stages of case analysis, including basic information extraction, admissibility examination, Periculum in mora and Fumus boni iuris analyses… ▽ More This paper introduces INACIA (Instrução Assistida com Inteligência Artificial), a groundbreaking system designed to integrate Large Language Models (LLMs) into the operational framework of Brazilian Federal Court of Accounts (TCU). The system automates various stages of case analysis, including basic information extraction, admissibility examination, Periculum in mora and Fumus boni iuris analyses, and recommendations generation. Through a series of experiments, we demonstrate INACIA's potential in extracting relevant information from case documents, evaluating its legal plausibility, and formulating propositions for judicial decision-making. Utilizing a validation dataset alongside LLMs, our evaluation methodology presents a novel approach to assessing system performance, correlating highly with human judgment. These results underscore INACIA's potential in complex legal task handling while also acknowledging the current limitations. This study discusses possible improvements and the broader implications of applying AI in legal contexts, suggesting that INACIA represents a significant step towards integrating AI in legal systems globally, albeit with cautious optimism grounded in the empirical findings. △ Less

Submitted 26 February, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

arXiv:2312.06907 [pdf, other]

w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training

Authors: Orlem Lima dos Santos, Karen Rosero, Roberto de Alencar Lotufo

Abstract: Sound Event Detection and Localization (SELD) constitutes a complex task that depends on extensive multichannel audio recordings with annotated sound events and their respective locations. In this paper, we introduce a self-supervised approach for SELD adapted from the pre-training methodology of wav2vec 2.0, which learns representations directly from raw audio data, eliminating the need for super… ▽ More Sound Event Detection and Localization (SELD) constitutes a complex task that depends on extensive multichannel audio recordings with annotated sound events and their respective locations. In this paper, we introduce a self-supervised approach for SELD adapted from the pre-training methodology of wav2vec 2.0, which learns representations directly from raw audio data, eliminating the need for supervision. By applying this approach to SELD, we can leverage a substantial amount of unlabeled 3D audio data to learn robust representations of sound events and their locations. Our method comprises two primary stages: pre-training and fine-tuning. In the pre-training phase, unlabeled 3D audio datasets are utilized to train our w2v-SELD model, capturing intricate high-level features and contextual information inherent in audio signals. Subsequently, in the fine-tuning stage, a smaller dataset with labeled SELD data fine-tunes the pre-trained model. Experimental results on benchmark datasets demonstrate the effectiveness of the proposed self-supervised approach for SELD. The model surpasses baseline systems provided with the datasets and achieves competitive performance comparable to state-of-the-art supervised methods. The code and pre-trained parameters of our w2v-SELD model are available in this repository. △ Less

Submitted 29 December, 2023; v1 submitted 11 December, 2023; originally announced December 2023.

Comments: 17 pages, 5 figures

arXiv:2312.02365 [pdf, other]

MEDPSeg: Hierarchical polymorphic multitask learning for the segmentation of ground-glass opacities, consolidation, and pulmonary structures on computed tomography

Authors: Diedre S. Carmo, Jean A. Ribeiro, Alejandro P. Comellas, Joseph M. Reinhardt, Sarah E. Gerard, Letícia Rittner, Roberto A. Lotufo

Abstract: The COVID-19 pandemic response highlighted the potential of deep learning methods in facilitating the diagnosis, prognosis and understanding of lung diseases through automated segmentation of pulmonary structures and lesions in chest computed tomography (CT). Automated separation of lung lesion into ground-glass opacity (GGO) and consolidation is hindered due to the labor-intensive and subjective… ▽ More The COVID-19 pandemic response highlighted the potential of deep learning methods in facilitating the diagnosis, prognosis and understanding of lung diseases through automated segmentation of pulmonary structures and lesions in chest computed tomography (CT). Automated separation of lung lesion into ground-glass opacity (GGO) and consolidation is hindered due to the labor-intensive and subjective nature of this task, resulting in scarce availability of ground truth for supervised learning. To tackle this problem, we propose MEDPSeg. MEDPSeg learns from heterogeneous chest CT targets through hierarchical polymorphic multitask learning (HPML). HPML explores the hierarchical nature of GGO and consolidation, lung lesions, and the lungs, with further benefits achieved through multitasking airway and pulmonary artery segmentation. Over 6000 volumetric CT scans from different partially labeled sources were used for training and testing. Experiments show PML enabling new state-of-the-art performance for GGO and consolidation segmentation tasks. In addition, MEDPSeg simultaneously performs segmentation of the lung parenchyma, airways, pulmonary artery, and lung lesions, all in a single forward prediction, with performance comparable to state-of-the-art methods specialized in each of those targets. Finally, we provide an open-source implementation with a graphical user interface at https://github.com/MICLab-Unicamp/medpseg. △ Less

Submitted 25 March, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

Comments: This manuscript is under review and might change in the future

arXiv:2310.09446 [pdf, other]

Automatic segmentation of lung findings in CT and application to Long COVID

Authors: Diedre S. Carmo, Rosarie A. Tudas, Alejandro P. Comellas, Leticia Rittner, Roberto A. Lotufo, Joseph M. Reinhardt, Sarah E. Gerard

Abstract: Automated segmentation of lung abnormalities in computed tomography is an important step for diagnosing and characterizing lung disease. In this work, we improve upon a previous method and propose S-MEDSeg, a deep learning based approach for accurate segmentation of lung lesions in chest CT images. S-MEDSeg combines a pre-trained EfficientNet backbone, bidirectional feature pyramid network, and mo… ▽ More Automated segmentation of lung abnormalities in computed tomography is an important step for diagnosing and characterizing lung disease. In this work, we improve upon a previous method and propose S-MEDSeg, a deep learning based approach for accurate segmentation of lung lesions in chest CT images. S-MEDSeg combines a pre-trained EfficientNet backbone, bidirectional feature pyramid network, and modern network advancements to achieve improved segmentation performance. A comprehensive ablation study was performed to evaluate the contribution of the proposed network modifications. The results demonstrate modifications introduced in S-MEDSeg significantly improves segmentation performance compared to the baseline approach. The proposed method is applied to an independent dataset of long COVID inpatients to study the effect of post-acute infection vaccination on extent of lung findings. Open-source code, graphical user interface and pip package are available at https://github.com/MICLab-Unicamp/medseg. △ Less

Submitted 13 October, 2023; originally announced October 2023.

Comments: Versao em português brasileiro submetida para o XV EADCA 2023. Brazilian portuguese verson submitted to XV EADCA 2023

arXiv:2307.04601 [pdf, ps, other]

InPars Toolkit: A Unified and Reproducible Synthetic Data Generation Pipeline for Neural Information Retrieval

Authors: Hugo Abonizio, Luiz Bonifacio, Vitor Jeronymo, Roberto Lotufo, Jakub Zavrel, Rodrigo Nogueira

Abstract: Recent work has explored Large Language Models (LLMs) to overcome the lack of training data for Information Retrieval (IR) tasks. The generalization abilities of these models have enabled the creation of synthetic in-domain data by providing instructions and a few examples on a prompt. InPars and Promptagator have pioneered this approach and both methods have demonstrated the potential of using LL… ▽ More Recent work has explored Large Language Models (LLMs) to overcome the lack of training data for Information Retrieval (IR) tasks. The generalization abilities of these models have enabled the creation of synthetic in-domain data by providing instructions and a few examples on a prompt. InPars and Promptagator have pioneered this approach and both methods have demonstrated the potential of using LLMs as synthetic data generators for IR tasks. This makes them an attractive solution for IR tasks that suffer from a lack of annotated data. However, the reproducibility of these methods was limited, because InPars' training scripts are based on TPUs -- which are not widely accessible -- and because the code for Promptagator was not released and its proprietary LLM is not publicly accessible. To fully realize the potential of these methods and make their impact more widespread in the research community, the resources need to be accessible and easy to reproduce by researchers and practitioners. Our main contribution is a unified toolkit for end-to-end reproducible synthetic data generation research, which includes generation, filtering, training and evaluation. Additionally, we provide an interface to IR libraries widely used by the community and support for GPU. Our toolkit not only reproduces the InPars method and partially reproduces Promptagator, but also provides a plug-and-play functionality allowing the use of different LLMs, exploring filtering methods and finetuning various reranker models on the generated data. We also made available all the synthetic data generated in this work for the 18 different datasets in the BEIR benchmark which took more than 2,000 GPU hours to be generated as well as the reranker models finetuned on the synthetic data. Code and data are available at https://github.com/zetaalphavector/InPars △ Less

Submitted 10 July, 2023; originally announced July 2023.

arXiv:2304.05901 [pdf, other]

Automated computed tomography and magnetic resonance imaging segmentation using deep learning: a beginner's guide

Authors: Diedre Carmo, Gustavo Pinheiro, Lívia Rodrigues, Thays Abreu, Roberto Lotufo, Letícia Rittner

Abstract: Medical image segmentation is an increasingly popular area of research in medical imaging processing and analysis. However, many researchers who are new to the field struggle with basic concepts. This tutorial paper aims to provide an overview of the fundamental concepts of medical imaging, with a focus on Magnetic Resonance and Computerized Tomography. We will also discuss deep learning algorithm… ▽ More Medical image segmentation is an increasingly popular area of research in medical imaging processing and analysis. However, many researchers who are new to the field struggle with basic concepts. This tutorial paper aims to provide an overview of the fundamental concepts of medical imaging, with a focus on Magnetic Resonance and Computerized Tomography. We will also discuss deep learning algorithms, tools, and frameworks used for segmentation tasks, and suggest best practices for method development and image analysis. Our tutorial includes sample tasks using public data, and accompanying code is available on GitHub (https://github.com/MICLab-Unicamp/Medical-ImagingTutorial). By sharing our insights gained from years of experience in the field and learning from relevant literature, we hope to assist researchers in overcoming the initial challenges they may encounter in this exciting and important area of research. △ Less

Submitted 12 April, 2023; originally announced April 2023.

Comments: Equal contribution from Diedre Carmo, Gustavo Pinheiro, and Lívia Rodrigues

arXiv:2303.17003 [pdf, other]

Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission Exams

Authors: Desnes Nunes, Ricardo Primi, Ramon Pires, Roberto Lotufo, Rodrigo Nogueira

Abstract: The present study aims to explore the capabilities of Language Models (LMs) in tackling high-stakes multiple-choice tests, represented here by the Exame Nacional do Ensino Médio (ENEM), a multidisciplinary entrance examination widely adopted by Brazilian universities. This exam poses challenging tasks for LMs, since its questions may span into multiple fields of knowledge, requiring understanding… ▽ More The present study aims to explore the capabilities of Language Models (LMs) in tackling high-stakes multiple-choice tests, represented here by the Exame Nacional do Ensino Médio (ENEM), a multidisciplinary entrance examination widely adopted by Brazilian universities. This exam poses challenging tasks for LMs, since its questions may span into multiple fields of knowledge, requiring understanding of information from diverse domains. For instance, a question may require comprehension of both statistics and biology to be solved. This work analyzed responses generated by GPT-3.5 and GPT-4 models for questions presented in the 2009-2017 exams, as well as for questions of the 2022 exam, which were made public after the training of the models was completed. Furthermore, different prompt strategies were tested, including the use of Chain-of-Thought (CoT) prompts to generate explanations for answers. On the 2022 edition, the best-performing model, GPT-4 with CoT, achieved an accuracy of 87%, largely surpassing GPT-3.5 by 11 points. The code and data used on experiments are available at https://github.com/piresramon/gpt-4-enem. △ Less

Submitted 29 March, 2023; originally announced March 2023.

arXiv:2303.16145 [pdf, other]

NeuralMind-UNICAMP at 2022 TREC NeuCLIR: Large Boring Rerankers for Cross-lingual Retrieval

Authors: Vitor Jeronymo, Roberto Lotufo, Rodrigo Nogueira

Abstract: This paper reports on a study of cross-lingual information retrieval (CLIR) using the mT5-XXL reranker on the NeuCLIR track of TREC 2022. Perhaps the biggest contribution of this study is the finding that despite the mT5 model being fine-tuned only on query-document pairs of the same language it proved to be viable for CLIR tasks, where query-document pairs are in different languages, even in the… ▽ More This paper reports on a study of cross-lingual information retrieval (CLIR) using the mT5-XXL reranker on the NeuCLIR track of TREC 2022. Perhaps the biggest contribution of this study is the finding that despite the mT5 model being fine-tuned only on query-document pairs of the same language it proved to be viable for CLIR tasks, where query-document pairs are in different languages, even in the presence of suboptimal first-stage retrieval performance. The results of the study show outstanding performance across all tasks and languages, leading to a high number of winning positions. Finally, this study provides valuable insights into the use of mT5 in CLIR tasks and highlights its potential as a viable solution. For reproduction refer to https://github.com/unicamp-dl/NeuCLIR22-mT5 △ Less

Submitted 28 March, 2023; originally announced March 2023.

arXiv:2301.10521 [pdf, other]

ExaRanker: Explanation-Augmented Neural Ranker

Authors: Fernando Ferraretto, Thiago Laitz, Roberto Lotufo, Rodrigo Nogueira

Abstract: Recent work has shown that inducing a large language model (LLM) to generate explanations prior to outputting an answer is an effective strategy to improve performance on a wide range of reasoning tasks. In this work, we show that neural rankers also benefit from explanations. We use LLMs such as GPT-3.5 to augment retrieval datasets with explanations and train a sequence-to-sequence ranking model… ▽ More Recent work has shown that inducing a large language model (LLM) to generate explanations prior to outputting an answer is an effective strategy to improve performance on a wide range of reasoning tasks. In this work, we show that neural rankers also benefit from explanations. We use LLMs such as GPT-3.5 to augment retrieval datasets with explanations and train a sequence-to-sequence ranking model to output a relevance label and an explanation for a given query-document pair. Our model, dubbed ExaRanker, finetuned on a few thousand examples with synthetic explanations performs on par with models finetuned on 3x more examples without explanations. Furthermore, the ExaRanker model incurs no additional computational cost during ranking and allows explanations to be requested on demand. △ Less

Submitted 3 June, 2023; v1 submitted 25 January, 2023; originally announced January 2023.

arXiv:2301.01820 [pdf, ps, other]

InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval

Authors: Vitor Jeronymo, Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, Jakub Zavrel, Rodrigo Nogueira

Abstract: Recently, InPars introduced a method to efficiently use large language models (LLMs) in information retrieval tasks: via few-shot examples, an LLM is induced to generate relevant queries for documents. These synthetic query-document pairs can then be used to train a retriever. However, InPars and, more recently, Promptagator, rely on proprietary LLMs such as GPT-3 and FLAN to generate such dataset… ▽ More Recently, InPars introduced a method to efficiently use large language models (LLMs) in information retrieval tasks: via few-shot examples, an LLM is induced to generate relevant queries for documents. These synthetic query-document pairs can then be used to train a retriever. However, InPars and, more recently, Promptagator, rely on proprietary LLMs such as GPT-3 and FLAN to generate such datasets. In this work we introduce InPars-v2, a dataset generator that uses open-source LLMs and existing powerful rerankers to select synthetic query-document pairs for training. A simple BM25 retrieval pipeline followed by a monoT5 reranker finetuned on InPars-v2 data achieves new state-of-the-art results on the BEIR benchmark. To allow researchers to further improve our method, we open source the code, synthetic data, and finetuned models: https://github.com/zetaalphavector/inPars/tree/master/tpu △ Less

Submitted 26 May, 2023; v1 submitted 4 January, 2023; originally announced January 2023.

arXiv:2212.09656 [pdf, other]

Visconde: Multi-document QA with GPT-3 and Neural Reranking

Authors: Jayr Pereira, Robson Fidalgo, Roberto Lotufo, Rodrigo Nogueira

Abstract: This paper proposes a question-answering system that can answer questions whose supporting evidence is spread over multiple (potentially long) documents. The system, called Visconde, uses a three-step pipeline to perform the task: decompose, retrieve, and aggregate. The first step decomposes the question into simpler questions using a few-shot large language model (LLM). Then, a state-of-the-art s… ▽ More This paper proposes a question-answering system that can answer questions whose supporting evidence is spread over multiple (potentially long) documents. The system, called Visconde, uses a three-step pipeline to perform the task: decompose, retrieve, and aggregate. The first step decomposes the question into simpler questions using a few-shot large language model (LLM). Then, a state-of-the-art search engine is used to retrieve candidate passages from a large collection for each decomposed question. In the final step, we use the LLM in a few-shot setting to aggregate the contents of the passages into the final answer. The system is evaluated on three datasets: IIRC, Qasper, and StrategyQA. Results suggest that current retrievers are the main bottleneck and that readers are already performing at the human level as long as relevant passages are provided. The system is also shown to be more effective when the model is induced to give explanations before answering a question. Code is available at \url{https://github.com/neuralmind-ai/visconde}. △ Less

Submitted 19 December, 2022; originally announced December 2022.

arXiv:2212.06121 [pdf, other]

In Defense of Cross-Encoders for Zero-Shot Retrieval

Authors: Guilherme Rosa, Luiz Bonifacio, Vitor Jeronymo, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, Rodrigo Nogueira

Abstract: Bi-encoders and cross-encoders are widely used in many state-of-the-art retrieval pipelines. In this work we study the generalization ability of these two types of architectures on a wide range of parameter count on both in-domain and out-of-domain scenarios. We find that the number of parameters and early query-document interactions of cross-encoders play a significant role in the generalization… ▽ More Bi-encoders and cross-encoders are widely used in many state-of-the-art retrieval pipelines. In this work we study the generalization ability of these two types of architectures on a wide range of parameter count on both in-domain and out-of-domain scenarios. We find that the number of parameters and early query-document interactions of cross-encoders play a significant role in the generalization ability of retrieval models. Our experiments show that increasing model size results in marginal gains on in-domain test sets, but much larger gains in new domains never seen during fine-tuning. Furthermore, we show that cross-encoders largely outperform bi-encoders of similar size in several tasks. In the BEIR benchmark, our largest cross-encoder surpasses a state-of-the-art bi-encoder by more than 4 average points. Finally, we show that using bi-encoders as first-stage retrievers provides no gains in comparison to a simpler retriever such as BM25 on out-of-domain tasks. The code is available at https://github.com/guilhermemr04/scaling-zero-shot-retrieval.git △ Less

Submitted 12 December, 2022; originally announced December 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:2206.02873

arXiv:2210.14837 [pdf, other]

NeuralSearchX: Serving a Multi-billion-parameter Reranker for Multilingual Metasearch at a Low Cost

Authors: Thales Sales Almeida, Thiago Laitz, João Seródio, Luiz Henrique Bonifacio, Roberto Lotufo, Rodrigo Nogueira

Abstract: The widespread availability of search API's (both free and commercial) brings the promise of increased coverage and quality of search results for metasearch engines, while decreasing the maintenance costs of the crawling and indexing infrastructures. However, merging strategies frequently comprise complex pipelines that require careful tuning, which is often overlooked in the literature. In this w… ▽ More The widespread availability of search API's (both free and commercial) brings the promise of increased coverage and quality of search results for metasearch engines, while decreasing the maintenance costs of the crawling and indexing infrastructures. However, merging strategies frequently comprise complex pipelines that require careful tuning, which is often overlooked in the literature. In this work, we describe NeuralSearchX, a metasearch engine based on a multi-purpose large reranking model to merge results and highlight sentences. Due to the homogeneity of our architecture, we could focus our optimization efforts on a single component. We compare our system with Microsoft's Biomedical Search and show that our design choices led to a much cost-effective system with competitive QPS while having close to state-of-the-art results on a wide range of public benchmarks. Human evaluation on two domain-specific tasks shows that our retrieval system outperformed Google API by a large margin in terms of nDCG@10 scores. By describing our architecture and implementation in detail, we hope that the community will build on our design choices. The system is available at https://neuralsearchx.nsx.ai. △ Less

Submitted 26 October, 2022; originally announced October 2022.

Comments: published as a full paper at the DESIRES 2022 Conference. 13 pages

Journal ref: DESIRES 2022-3rd International Conference on Design of Experimental Search and Information REtrieval Systems, 30-31,August 2022, San Jose, CA, USA

arXiv:2209.15094 [pdf, other]

Open-source tool for Airway Segmentation in Computed Tomography using 2.5D Modified EfficientDet: Contribution to the ATM22 Challenge

Authors: Diedre Carmo, Leticia Rittner, Roberto Lotufo

Abstract: Airway segmentation in computed tomography images can be used to analyze pulmonary diseases, however, manual segmentation is labor intensive and relies on expert knowledge. This manuscript details our contribution to MICCAI's 2022 Airway Tree Modelling challenge, a competition of fully automated methods for airway segmentation. We employed a previously developed deep learning architecture based on… ▽ More Airway segmentation in computed tomography images can be used to analyze pulmonary diseases, however, manual segmentation is labor intensive and relies on expert knowledge. This manuscript details our contribution to MICCAI's 2022 Airway Tree Modelling challenge, a competition of fully automated methods for airway segmentation. We employed a previously developed deep learning architecture based on a modified EfficientDet (MEDSeg), training from scratch for binary airway segmentation using the provided annotations. Our method achieved 90.72 Dice in internal validation, 95.52 Dice on external validation, and 93.49 Dice in the final test phase, while not being specifically designed or tuned for airway segmentation. Open source code and a pip package for predictions with our model and trained weights are in https://github.com/MICLab-Unicamp/medseg. △ Less

Submitted 3 October, 2022; v1 submitted 29 September, 2022; originally announced September 2022.

Comments: Open source code, graphical user interface, and a pip package for predictions with our model and trained weights are in https://github.com/MICLab-Unicamp/medseg

arXiv:2209.13738 [pdf, ps, other]

mRobust04: A Multilingual Version of the TREC Robust 2004 Benchmark

Authors: Vitor Jeronymo, Mauricio Nascimento, Roberto Lotufo, Rodrigo Nogueira

Abstract: Robust 2004 is an information retrieval benchmark whose large number of judgments per query make it a reliable evaluation dataset. In this paper, we present mRobust04, a multilingual version of Robust04 that was translated to 8 languages using Google Translate. We also provide results of three different multilingual retrievers on this dataset. The dataset is available at https://huggingface.co/dat… ▽ More Robust 2004 is an information retrieval benchmark whose large number of judgments per query make it a reliable evaluation dataset. In this paper, we present mRobust04, a multilingual version of Robust04 that was translated to 8 languages using Google Translate. We also provide results of three different multilingual retrievers on this dataset. The dataset is available at https://huggingface.co/datasets/unicamp-dl/mrobust △ Less

Submitted 27 September, 2022; originally announced September 2022.

Comments: 4 pages

arXiv:2209.11035 [pdf, other]

MonoByte: A Pool of Monolingual Byte-level Language Models

Authors: Hugo Abonizio, Leandro Rodrigues de Souza, Roberto Lotufo, Rodrigo Nogueira

Abstract: The zero-shot cross-lingual ability of models pretrained on multilingual and even monolingual corpora has spurred many hypotheses to explain this intriguing empirical result. However, due to the costs of pretraining, most research uses public models whose pretraining methodology, such as the choice of tokenization, corpus size, and computational budget, might differ drastically. When researchers p… ▽ More The zero-shot cross-lingual ability of models pretrained on multilingual and even monolingual corpora has spurred many hypotheses to explain this intriguing empirical result. However, due to the costs of pretraining, most research uses public models whose pretraining methodology, such as the choice of tokenization, corpus size, and computational budget, might differ drastically. When researchers pretrain their own models, they often do so under a constrained budget, and the resulting models might underperform significantly compared to SOTA models. These experimental differences led to various inconsistent conclusions about the nature of the cross-lingual ability of these models. To help further research on the topic, we released 10 monolingual byte-level models rigorously pretrained under the same configuration with a large compute budget (equivalent to 420 days on a V100) and corpora that are 4 times larger than the original BERT's. Because they are tokenizer-free, the problem of unseen token embeddings is eliminated, thus allowing researchers to try a wider range of cross-lingual experiments in languages with different scripts. Additionally, we release two models pretrained on non-natural language texts that can be used in sanity-check experiments. Experiments on QA and NLI tasks show that our monolingual models achieve competitive performance to the multilingual one, and hence can be served to strengthen our understanding of cross-lingual transferability in language models. △ Less

Submitted 27 September, 2022; v1 submitted 22 September, 2022; originally announced September 2022.

arXiv:2208.11445 [pdf, other]

Induced Natural Language Rationales and Interleaved Markup Tokens Enable Extrapolation in Large Language Models

Authors: Mirelle Bueno, Carlos Gemmell, Jeffrey Dalton, Roberto Lotufo, Rodrigo Nogueira

Abstract: The ability to extrapolate, i.e., to make predictions on sequences that are longer than those presented as training examples, is a challenging problem for current deep learning models. Recent work shows that this limitation persists in state-of-the-art Transformer-based models. Most solutions to this problem use specific architectures or training methods that do not generalize to other tasks. We d… ▽ More The ability to extrapolate, i.e., to make predictions on sequences that are longer than those presented as training examples, is a challenging problem for current deep learning models. Recent work shows that this limitation persists in state-of-the-art Transformer-based models. Most solutions to this problem use specific architectures or training methods that do not generalize to other tasks. We demonstrate that large language models can succeed in extrapolation without modifying their architecture or training procedure. Our experimental results show that generating step-by-step rationales and introducing marker tokens are both required for effective extrapolation. First, we induce a language model to produce step-by-step rationales before outputting the answer to effectively communicate the task to the model. However, as sequences become longer, we find that current models struggle to keep track of token positions. To address this issue, we interleave output tokens with markup tokens that act as explicit positional and counting symbols. Our findings show how these two complementary approaches enable remarkable sequence extrapolation and highlight a limitation of current architectures to effectively generalize without explicit surface form guidance. Code available at https://github.com/MirelleB/induced-rationales-markup-tokens △ Less

Submitted 28 November, 2022; v1 submitted 24 August, 2022; originally announced August 2022.

arXiv:2208.06264 [pdf, ps, other]

A Boring-yet-effective Approach for the Product Ranking Task of the Amazon KDD Cup 2022

Authors: Vitor Jeronymo, Guilherme Rosa, Surya Kallumadi, Roberto Lotufo, Rodrigo Nogueira

Abstract: In this work we describe our submission to the product ranking task of the Amazon KDD Cup 2022. We rely on a receipt that showed to be effective in previous competitions: we focus our efforts towards efficiently training and deploying large language odels, such as mT5, while reducing to a minimum the number of task-specific adaptations. Despite the simplicity of our approach, our best model was le… ▽ More In this work we describe our submission to the product ranking task of the Amazon KDD Cup 2022. We rely on a receipt that showed to be effective in previous competitions: we focus our efforts towards efficiently training and deploying large language odels, such as mT5, while reducing to a minimum the number of task-specific adaptations. Despite the simplicity of our approach, our best model was less than 0.004 nDCG@20 below the top submission. As the top 20 teams achieved an nDCG@20 close to .90, we argue that we need more difficult e-Commerce evaluation datasets to discriminate retrieval methods. △ Less

Submitted 9 August, 2022; originally announced August 2022.

arXiv:2206.02873 [pdf, other]

No Parameter Left Behind: How Distillation and Model Size Affect Zero-Shot Retrieval

Authors: Guilherme Moraes Rosa, Luiz Bonifacio, Vitor Jeronymo, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, Rodrigo Nogueira

Abstract: Recent work has shown that small distilled language models are strong competitors to models that are orders of magnitude larger and slower in a wide range of information retrieval tasks. This has made distilled and dense models, due to latency constraints, the go-to choice for deployment in real-world retrieval applications. In this work, we question this practice by showing that the number of par… ▽ More Recent work has shown that small distilled language models are strong competitors to models that are orders of magnitude larger and slower in a wide range of information retrieval tasks. This has made distilled and dense models, due to latency constraints, the go-to choice for deployment in real-world retrieval applications. In this work, we question this practice by showing that the number of parameters and early query-document interaction play a significant role in the generalization ability of retrieval models. Our experiments show that increasing model size results in marginal gains on in-domain test sets, but much larger gains in new domains never seen during fine-tuning. Furthermore, we show that rerankers largely outperform dense ones of similar size in several tasks. Our largest reranker reaches the state of the art in 12 of the 18 datasets of the Benchmark-IR (BEIR) and surpasses the previous state of the art by 3 average points. Finally, we confirm that in-domain effectiveness is not a good indicator of zero-shot effectiveness. Code is available at https://github.com/guilhermemr04/scaling-zero-shot-retrieval.git △ Less

Submitted 12 December, 2022; v1 submitted 6 June, 2022; originally announced June 2022.

arXiv:2205.15172 [pdf, ps, other]

Billions of Parameters Are Worth More Than In-domain Training Data: A case study in the Legal Case Entailment Task

Authors: Guilherme Moraes Rosa, Luiz Bonifacio, Vitor Jeronymo, Hugo Abonizio, Roberto Lotufo, Rodrigo Nogueira

Abstract: Recent work has shown that language models scaled to billions of parameters, such as GPT-3, perform remarkably well in zero-shot and few-shot scenarios. In this work, we experiment with zero-shot models in the legal case entailment task of the COLIEE 2022 competition. Our experiments show that scaling the number of parameters in a language model improves the F1 score of our previous zero-shot resu… ▽ More Recent work has shown that language models scaled to billions of parameters, such as GPT-3, perform remarkably well in zero-shot and few-shot scenarios. In this work, we experiment with zero-shot models in the legal case entailment task of the COLIEE 2022 competition. Our experiments show that scaling the number of parameters in a language model improves the F1 score of our previous zero-shot result by more than 6 points, suggesting that stronger zero-shot capability may be a characteristic of larger models, at least for this task. Our 3B-parameter zero-shot model outperforms all models, including ensembles, in the COLIEE 2021 test set and also achieves the best performance of a single model in the COLIEE 2022 competition, second only to the ensemble composed of the 3B model itself and a smaller version of the same model. Despite the challenges posed by large language models, mainly due to latency constraints in real-time applications, we provide a demonstration of our zero-shot monoT5-3b model being used in production as a search engine, including for legal documents. The code for our submission and the demo of our system are available at https://github.com/neuralmind-ai/coliee and https://neuralsearchx.neuralmind.ai, respectively. △ Less

Submitted 30 May, 2022; originally announced May 2022.

arXiv:2202.03120 [pdf, other]

doi 10.1145/3462757.3466103

To Tune or Not To Tune? Zero-shot Models for Legal Case Entailment

Authors: Guilherme Moraes Rosa, Ruan Chaves Rodrigues, Roberto de Alencar Lotufo, Rodrigo Nogueira

Abstract: There has been mounting evidence that pretrained language models fine-tuned on large and diverse supervised datasets can transfer well to a variety of out-of-domain tasks. In this work, we investigate this transfer ability to the legal domain. For that, we participated in the legal case entailment task of COLIEE 2021, in which we use such models with no adaptations to the target domain. Our submis… ▽ More There has been mounting evidence that pretrained language models fine-tuned on large and diverse supervised datasets can transfer well to a variety of out-of-domain tasks. In this work, we investigate this transfer ability to the legal domain. For that, we participated in the legal case entailment task of COLIEE 2021, in which we use such models with no adaptations to the target domain. Our submissions achieved the highest scores, surpassing the second-best team by more than six percentage points. Our experiments confirm a counter-intuitive result in the new paradigm of pretrained language models: given limited labeled data, models with little or no adaptation to the target task can be more robust to changes in the data distribution than models fine-tuned on it. Code is available at https://github.com/neuralmind-ai/coliee. △ Less

Submitted 7 February, 2022; originally announced February 2022.

arXiv:2201.05658 [pdf, other]

Sequence-to-Sequence Models for Extracting Information from Registration and Legal Documents

Authors: Ramon Pires, Fábio C. de Souza, Guilherme Rosa, Roberto A. Lotufo, Rodrigo Nogueira

Abstract: A typical information extraction pipeline consists of token- or span-level classification models coupled with a series of pre- and post-processing scripts. In a production pipeline, requirements often change, with classes being added and removed, which leads to nontrivial modifications to the source code and the possible introduction of bugs. In this work, we evaluate sequence-to-sequence models a… ▽ More A typical information extraction pipeline consists of token- or span-level classification models coupled with a series of pre- and post-processing scripts. In a production pipeline, requirements often change, with classes being added and removed, which leads to nontrivial modifications to the source code and the possible introduction of bugs. In this work, we evaluate sequence-to-sequence models as an alternative to token-level classification methods for information extraction of legal and registration documents. We finetune models that jointly extract the information and generate the output already in a structured format. Post-processing steps are learned during training, thus eliminating the need for rule-based methods and simplifying the pipeline. Furthermore, we propose a novel method to align the output with the input text, thus facilitating system inspection and auditing. Our experiments on four real-world datasets show that the proposed method is an alternative to classical pipelines. △ Less

Submitted 14 January, 2022; originally announced January 2022.

arXiv:2109.01942 [pdf, other]

On the ability of monolingual models to learn language-agnostic representations

Authors: Leandro Rodrigues de Souza, Rodrigo Nogueira, Roberto Lotufo

Abstract: Pretrained multilingual models have become a de facto default approach for zero-shot cross-lingual transfer. Previous work has shown that these models are able to achieve cross-lingual representations when pretrained on two or more languages with shared parameters. In this work, we provide evidence that a model can achieve language-agnostic representations even when pretrained on a single language… ▽ More Pretrained multilingual models have become a de facto default approach for zero-shot cross-lingual transfer. Previous work has shown that these models are able to achieve cross-lingual representations when pretrained on two or more languages with shared parameters. In this work, we provide evidence that a model can achieve language-agnostic representations even when pretrained on a single language. That is, we find that monolingual models pretrained and finetuned on different languages achieve competitive performance compared to the ones that use the same target language. Surprisingly, the models show a similar performance on a same task regardless of the pretraining language. For example, models pretrained on distant languages such as German and Portuguese perform similarly on English tasks. △ Less

Submitted 25 October, 2021; v1 submitted 4 September, 2021; originally announced September 2021.

arXiv:2108.13897 [pdf, other]

mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset

Authors: Luiz Bonifacio, Vitor Jeronymo, Hugo Queiroz Abonizio, Israel Campiotti, Marzieh Fadaee, Roberto Lotufo, Rodrigo Nogueira

Abstract: The MS MARCO ranking dataset has been widely used for training deep learning models for IR tasks, achieving considerable effectiveness on diverse zero-shot scenarios. However, this type of resource is scarce in languages other than English. In this work, we present mMARCO, a multilingual version of the MS MARCO passage ranking dataset comprising 13 languages that was created using machine translat… ▽ More The MS MARCO ranking dataset has been widely used for training deep learning models for IR tasks, achieving considerable effectiveness on diverse zero-shot scenarios. However, this type of resource is scarce in languages other than English. In this work, we present mMARCO, a multilingual version of the MS MARCO passage ranking dataset comprising 13 languages that was created using machine translation. We evaluated mMARCO by finetuning monolingual and multilingual reranking models, as well as a multilingual dense retrieval model on this dataset. We also evaluated models finetuned using the mMARCO dataset in a zero-shot scenario on Mr. TyDi dataset, demonstrating that multilingual models finetuned on our translated dataset achieve superior effectiveness to models finetuned on the original English version alone. Our experiments also show that a distilled multilingual reranker is competitive with non-distilled models while having 5.4 times fewer parameters. Lastly, we show a positive correlation between translation quality and retrieval effectiveness, providing evidence that improvements in translation methods might lead to improvements in multilingual information retrieval. The translated datasets and finetuned models are available at https://github.com/unicamp-dl/mMARCO. △ Less

Submitted 17 August, 2022; v1 submitted 31 August, 2021; originally announced August 2021.

arXiv:2105.06813 [pdf, other]

A cost-benefit analysis of cross-lingual transfer methods

Authors: Guilherme Moraes Rosa, Luiz Henrique Bonifacio, Leandro Rodrigues de Souza, Roberto Lotufo, Rodrigo Nogueira

Abstract: An effective method for cross-lingual transfer is to fine-tune a bilingual or multilingual model on a supervised dataset in one language and evaluating it on another language in a zero-shot manner. Translating examples at training time or inference time are also viable alternatives. However, there are costs associated with these methods that are rarely addressed in the literature. In this work, we… ▽ More An effective method for cross-lingual transfer is to fine-tune a bilingual or multilingual model on a supervised dataset in one language and evaluating it on another language in a zero-shot manner. Translating examples at training time or inference time are also viable alternatives. However, there are costs associated with these methods that are rarely addressed in the literature. In this work, we analyze cross-lingual methods in terms of their effectiveness (e.g., accuracy), development and deployment costs, as well as their latencies at inference time. Our experiments on three tasks indicate that the best cross-lingual method is highly task-dependent. Finally, by combining zero-shot and translation methods, we achieve the state-of-the-art in two of the three datasets used in this work. Based on these results, we question the need for manually labeled training data in a target language. Code and translated datasets are available at https://github.com/unicamp-dl/cross-lingual-analysis △ Less

Submitted 14 December, 2021; v1 submitted 14 May, 2021; originally announced May 2021.

arXiv:2105.05686 [pdf, ps, other]

Yes, BM25 is a Strong Baseline for Legal Case Retrieval

Authors: Guilherme Moraes Rosa, Ruan Chaves Rodrigues, Roberto Lotufo, Rodrigo Nogueira

Abstract: We describe our single submission to task 1 of COLIEE 2021. Our vanilla BM25 got second place, well above the median of submissions. Code is available at https://github.com/neuralmind-ai/coliee. We describe our single submission to task 1 of COLIEE 2021. Our vanilla BM25 got second place, well above the median of submissions. Code is available at https://github.com/neuralmind-ai/coliee. △ Less

Submitted 25 October, 2021; v1 submitted 26 April, 2021; originally announced May 2021.

arXiv:2009.09290 [pdf, other]

Can questions summarize a corpus? Using question generation for characterizing COVID-19 research

Authors: Gabriela Surita, Rodrigo Nogueira, Roberto Lotufo

Abstract: What are the latent questions on some textual data? In this work, we investigate using question generation models for exploring a collection of documents. Our method, dubbed corpus2question, consists of applying a pre-trained question generation model over a corpus and aggregating the resulting questions by frequency and time. This technique is an alternative to methods such as topic modelling and… ▽ More What are the latent questions on some textual data? In this work, we investigate using question generation models for exploring a collection of documents. Our method, dubbed corpus2question, consists of applying a pre-trained question generation model over a corpus and aggregating the resulting questions by frequency and time. This technique is an alternative to methods such as topic modelling and word cloud for summarizing large amounts of textual data. Results show that applying corpus2question on a corpus of scientific articles related to COVID-19 yields relevant questions about the topic. The most frequent questions are "what is covid 19" and "what is the treatment for covid". Among the 1000 most frequent questions are "what is the threshold for herd immunity" and "what is the role of ace2 in viral entry". We show that the proposed method generated similar questions for 13 of the 27 expert-made questions from the CovidQA question answering dataset. The code to reproduce our experiments and the generated questions are available at: https://github.com/unicamp-dl/corpus2question △ Less

Submitted 19 September, 2020; originally announced September 2020.

Comments: 11 pages, 5 figures

ACM Class: H.3.3

arXiv:2008.09144 [pdf, other]

PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data

Authors: Diedre Carmo, Marcos Piau, Israel Campiotti, Rodrigo Nogueira, Roberto Lotufo

Abstract: In natural language processing (NLP), there is a need for more resources in Portuguese, since much of the data used in the state-of-the-art research is in other languages. In this paper, we pretrain a T5 model on the BrWac corpus, an extensive collection of web pages in Portuguese, and evaluate its performance against other Portuguese pretrained models and multilingual models on three different ta… ▽ More In natural language processing (NLP), there is a need for more resources in Portuguese, since much of the data used in the state-of-the-art research is in other languages. In this paper, we pretrain a T5 model on the BrWac corpus, an extensive collection of web pages in Portuguese, and evaluate its performance against other Portuguese pretrained models and multilingual models on three different tasks. We show that our Portuguese pretrained models have significantly better performance over the original T5 models. Moreover, we demonstrate the positive impact of using a Portuguese vocabulary. Our code and models are available at https://github.com/unicamp-dl/PTT5. △ Less

Submitted 8 October, 2020; v1 submitted 20 August, 2020; originally announced August 2020.

arXiv:2008.08769 [pdf, other]

Lite Training Strategies for Portuguese-English and English-Portuguese Translation

Authors: Alexandre Lopes, Rodrigo Nogueira, Roberto Lotufo, Helio Pedrini

Abstract: Despite the widespread adoption of deep learning for machine translation, it is still expensive to develop high-quality translation models. In this work, we investigate the use of pre-trained models, such as T5 for Portuguese-English and English-Portuguese translation tasks using low-cost hardware. We explore the use of Portuguese and English pre-trained language models and propose an adaptation o… ▽ More Despite the widespread adoption of deep learning for machine translation, it is still expensive to develop high-quality translation models. In this work, we investigate the use of pre-trained models, such as T5 for Portuguese-English and English-Portuguese translation tasks using low-cost hardware. We explore the use of Portuguese and English pre-trained language models and propose an adaptation of the English tokenizer to represent Portuguese characters, such as diaeresis, acute and grave accents. We compare our models to the Google Translate API and MarianMT on a subset of the ParaCrawl dataset, as well as to the winning submission to the WMT19 Biomedical Translation Shared Task. We also describe our submission to the WMT20 Biomedical Translation Shared Task. Our results show that our models have a competitive performance to state-of-the-art models while being trained on modest hardware (a single 8GB gaming GPU for nine days). Our data, models and code are available at https://github.com/unicamp-dl/Lite-T5-Translation. △ Less

Submitted 20 August, 2020; originally announced August 2020.

Comments: for code and weights, visit https://github.com/unicamp-dl/Lite-T5-Translation

arXiv:2002.06219 [pdf, other]

Electricity Theft Detection with self-attention

Authors: Paulo Finardi, Israel Campiotti, Gustavo Plensack, Rafael Derradi de Souza, Rodrigo Nogueira, Gustavo Pinheiro, Roberto Lotufo

Abstract: In this work we propose a novel self-attention mechanism model to address electricity theft detection on an imbalanced realistic dataset that presents a daily electricity consumption provided by State Grid Corporation of China. Our key contribution is the introduction of a multi-head self-attention mechanism concatenated with dilated convolutions and unified by a convolution of kernel size $1$. Mo… ▽ More In this work we propose a novel self-attention mechanism model to address electricity theft detection on an imbalanced realistic dataset that presents a daily electricity consumption provided by State Grid Corporation of China. Our key contribution is the introduction of a multi-head self-attention mechanism concatenated with dilated convolutions and unified by a convolution of kernel size $1$. Moreover, we introduce a binary input channel (Binary Mask) to identify the position of the missing values, allowing the network to learn how to deal with these values. Our model achieves an AUC of $0.926$ which is an improvement in more than $17\%$ with respect to previous baseline work. The code is available on GitHub at https://github.com/neuralmind-ai/electricity-theft-detection-with-self-attention. △ Less

Submitted 14 February, 2020; originally announced February 2020.

arXiv:2001.05058 [pdf, other]

doi 10.1016/j.heliyon.2021.e06226

Hippocampus Segmentation on Epilepsy and Alzheimer's Disease Studies with Multiple Convolutional Neural Networks

Authors: Diedre Carmo, Bruna Silva, Clarissa Yasuda, Letícia Rittner, Roberto Lotufo

Abstract: Hippocampus segmentation on magnetic resonance imaging is of key importance for the diagnosis, treatment decision and investigation of neuropsychiatric disorders. Automatic segmentation is an active research field, with many recent models using deep learning. Most current state-of-the art hippocampus segmentation methods train their methods on healthy or Alzheimer's disease patients from public da… ▽ More Hippocampus segmentation on magnetic resonance imaging is of key importance for the diagnosis, treatment decision and investigation of neuropsychiatric disorders. Automatic segmentation is an active research field, with many recent models using deep learning. Most current state-of-the art hippocampus segmentation methods train their methods on healthy or Alzheimer's disease patients from public datasets. This raises the question whether these methods are capable of recognizing the hippocampus on a different domain, that of epilepsy patients with hippocampus resection. In this paper we present a state-of-the-art, open source, ready-to-use, deep learning based hippocampus segmentation method. It uses an extended 2D multi-orientation approach, with automatic pre-processing and orientation alignment. The methodology was developed and validated using HarP, a public Alzheimer's disease hippocampus segmentation dataset. We test this methodology alongside other recent deep learning methods, in two domains: The HarP test set and an in-house epilepsy dataset, containing hippocampus resections, named HCUnicamp. We show that our method, while trained only in HarP, surpasses others from the literature in both the HarP test set and HCUnicamp in Dice. Additionally, Results from training and testing in HCUnicamp volumes are also reported separately, alongside comparisons between training and testing in epilepsy and Alzheimer's data and vice versa. Although current state-of-the-art methods, including our own, achieve upwards of 0.9 Dice in HarP, all tested methods, including our own, produced false positives in HCUnicamp resection regions, showing that there is still room for improvement for hippocampus segmentation methods when resection is involved. △ Less

Submitted 10 February, 2021; v1 submitted 14 January, 2020; originally announced January 2020.

Comments: Code is available at https://github.com/dscarmo/e2dhipseg Published in Heliyon: https://www.sciencedirect.com/science/article/pii/S2405844021003315

Journal ref: Heliyon, Volume 7, Issue 2, 2021

arXiv:1909.10649 [pdf, other]

Portuguese Named Entity Recognition using BERT-CRF

Authors: Fábio Souza, Rodrigo Nogueira, Roberto Lotufo

Abstract: Recent advances in language representation using neural networks have made it viable to transfer the learned internal states of a trained model to downstream natural language processing tasks, such as named entity recognition (NER) and question answering. It has been shown that the leverage of pre-trained language models improves the overall performance on many tasks and is highly beneficial when… ▽ More Recent advances in language representation using neural networks have made it viable to transfer the learned internal states of a trained model to downstream natural language processing tasks, such as named entity recognition (NER) and question answering. It has been shown that the leverage of pre-trained language models improves the overall performance on many tasks and is highly beneficial when labeled data is scarce. In this work, we train Portuguese BERT models and employ a BERT-CRF architecture to the NER task on the Portuguese language, combining the transfer capabilities of BERT with the structured predictions of CRF. We explore feature-based and fine-tuning training strategies for the BERT model. Our fine-tuning approach obtains new state-of-the-art results on the HAREM I dataset, improving the F1-score by 1 point on the selective scenario (5 NE classes) and by 4 points on the total scenario (10 NE classes). △ Less

Submitted 27 February, 2020; v1 submitted 23 September, 2019; originally announced September 2019.

arXiv:1902.04487 [pdf, other]

Extended 2D Consensus Hippocampus Segmentation

Authors: Diedre Carmo, Bruna Silva, Clarissa Yasuda, Letícia Rittner, Roberto Lotufo

Abstract: Hippocampus segmentation plays a key role in diagnosing various brain disorders such as Alzheimer's disease, epilepsy, multiple sclerosis, cancer, depression and others. Nowadays, segmentation is still mainly performed manually by specialists. Segmentation done by experts is considered to be a gold-standard when evaluating automated methods, buts it is a time consuming and arduos task, requiring s… ▽ More Hippocampus segmentation plays a key role in diagnosing various brain disorders such as Alzheimer's disease, epilepsy, multiple sclerosis, cancer, depression and others. Nowadays, segmentation is still mainly performed manually by specialists. Segmentation done by experts is considered to be a gold-standard when evaluating automated methods, buts it is a time consuming and arduos task, requiring specialized personnel. In recent years, efforts have been made to achieve reliable automated segmentation. For years the best performing authomatic methods were multi atlas based with around 80-85% Dice coefficient and very time consuming, but machine learning methods are recently rising with promising time and accuracy performance. A method for volumetric hippocampus segmentation is presented, based on the consensus of tri-planar U-Net inspired fully convolutional networks (FCNNs), with some modifications, including residual connections, VGG weight transfers, batch normalization and a patch extraction technique employing data from neighbor patches. A study on the impact of our modifications to the classical U-Net architecture was performed. Our method achieves cutting edge performance in our dataset, with around 96% volumetric Dice accuracy in our test data. In a public validation dataset, HARP, we achieve 87.48% DICE. GPU execution time is in the order of seconds per volume, and source code is publicly available. Also, masks are shown to be similar to other recent state-of-the-art hippocampus segmentation methods in a third dataset, without manual annotations. △ Less

Submitted 13 May, 2020; v1 submitted 12 February, 2019; originally announced February 2019.

Comments: This was published as an extended abstract in MIDL 2019 [arXiv:1907.08612]. An alpha version of the code is available at https://github.com/dscarmo/e2dhipseg. More experiments on improvements to the method and code are ongoing. Future updates are to be expected. A new, more complete paper is published in arXiv:2001.05058

Report number: MIDL/2019/ExtendedAbstract/Sygx97DaKV

arXiv:1804.04988 [pdf, other]

Convolutional Neural Networks for Skull-strip** in Brain MR Imaging using Consensus-based Silver standard Masks

Authors: Oeslle Lucena, Roberto Souza, Leticia Rittner, Richard Frayne, Roberto Lotufo

Abstract: Convolutional neural networks (CNN) for medical imaging are constrained by the number of annotated data required in the training stage. Usually, manual annotation is considered to be the "gold standard". However, medical imaging datasets that include expert manual segmentation are scarce as this step is time-consuming, and therefore expensive. Moreover, single-rater manual annotation is most often… ▽ More Convolutional neural networks (CNN) for medical imaging are constrained by the number of annotated data required in the training stage. Usually, manual annotation is considered to be the "gold standard". However, medical imaging datasets that include expert manual segmentation are scarce as this step is time-consuming, and therefore expensive. Moreover, single-rater manual annotation is most often used in data-driven approaches making the network optimal with respect to only that single expert. In this work, we propose a CNN for brain extraction in magnetic resonance (MR) imaging, that is fully trained with what we refer to as silver standard masks. Our method consists of 1) develo** a dataset with "silver standard" masks as input, and implementing both 2) a tri-planar method using parallel 2D U-Net-based CNNs (referred to as CONSNet) and 3) an auto-context implementation of CONSNet. The term CONSNet refers to our integrated approach, i.e., training with silver standard masks and using a 2D U-Net-based architecture. Our results showed that we outperformed (i.e., larger Dice coefficients) the current state-of-the-art SS methods. Our use of silver standard masks reduced the cost of manual annotation, decreased inter-intra-rater variability, and avoided CNN segmentation super-specialization towards one specific manual annotation guideline that can occur when gold standard masks are used. Moreover, the usage of silver standard masks greatly enlarges the volume of input annotated data because we can relatively easily generate labels for unlabeled data. In addition, our method has the advantage that, once trained, it takes only a few seconds to process a typical brain image volume using modern hardware, such as a high-end graphics processing unit. In contrast, many of the other competitive methods have processing times in the order of minutes. △ Less

Submitted 13 April, 2018; originally announced April 2018.

arXiv:1508.00537 [pdf]

doi 10.1109/BIOMS.2014.6951531

Evaluating software-based fingerprint liveness detection using Convolutional Networks and Local Binary Patterns

Authors: Rodrigo Frassetto Nogueira, Roberto de Alencar Lotufo, Rubens Campos Machado

Abstract: With the growing use of biometric authentication systems in the past years, spoof fingerprint detection has become increasingly important. In this work, we implement and evaluate two different feature extraction techniques for software-based fingerprint liveness detection: Convolutional Networks with random weights and Local Binary Patterns. Both techniques were used in conjunction with a Support… ▽ More With the growing use of biometric authentication systems in the past years, spoof fingerprint detection has become increasingly important. In this work, we implement and evaluate two different feature extraction techniques for software-based fingerprint liveness detection: Convolutional Networks with random weights and Local Binary Patterns. Both techniques were used in conjunction with a Support Vector Machine (SVM) classifier. Dataset Augmentation was used to increase classifier's performance and a variety of preprocessing operations were tested, such as frequency filtering, contrast equalization, and region of interest filtering. The experiments were made on the datasets used in The Liveness Detection Competition of years 2009, 2011 and 2013, which comprise almost 50,000 real and fake fingerprints' images. Our best method achieves an overall rate of 95.2% of correctly classified samples - an improvement of 35% in test error when compared with the best previously published results. △ Less

Submitted 3 August, 2015; originally announced August 2015.

Comments: arXiv admin note: text overlap with arXiv:1301.3557 by other authors

Journal ref: Biometric Measurements and Systems for Security and Medical Applications (BIOMS) Proceedings, 2014 IEEE Workshop on

Showing 1–41 of 41 results for author: Lotufo, R