Skip to main content

Showing 1–11 of 11 results for author: Abonizio, H

.
  1. arXiv:2403.09887  [pdf, other

    cs.CL cs.AI

    Sabiá-2: A New Generation of Portuguese Large Language Models

    Authors: Thales Sales Almeida, Hugo Abonizio, Rodrigo Nogueira, Ramon Pires

    Abstract: We introduce Sabiá-2, a family of large language models trained on Portuguese texts. The models are evaluated on a diverse range of exams, including entry-level tests for Brazilian universities, professional certification exams, and graduate-level exams for various disciplines such as accounting, economics, engineering, law and medicine. Our results reveal that our best model so far, Sabiá-2 Mediu… ▽ More

    Submitted 26 March, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

  2. arXiv:2311.14169  [pdf, other

    cs.CL cs.AI cs.LG

    Evaluating GPT-4's Vision Capabilities on Brazilian University Admission Exams

    Authors: Ramon Pires, Thales Sales Almeida, Hugo Abonizio, Rodrigo Nogueira

    Abstract: Recent advancements in language models have showcased human-comparable performance in academic entrance exams. However, existing studies often overlook questions that require the integration of visual comprehension, thus compromising the full spectrum and complexity inherent in real-world scenarios. To address this gap, we present a comprehensive framework to evaluate language models on entrance e… ▽ More

    Submitted 23 November, 2023; originally announced November 2023.

    Comments: arXiv admin note: substantial text overlap with arXiv:2303.17003

  3. arXiv:2307.04601  [pdf, ps, other

    cs.IR

    InPars Toolkit: A Unified and Reproducible Synthetic Data Generation Pipeline for Neural Information Retrieval

    Authors: Hugo Abonizio, Luiz Bonifacio, Vitor Jeronymo, Roberto Lotufo, Jakub Zavrel, Rodrigo Nogueira

    Abstract: Recent work has explored Large Language Models (LLMs) to overcome the lack of training data for Information Retrieval (IR) tasks. The generalization abilities of these models have enabled the creation of synthetic in-domain data by providing instructions and a few examples on a prompt. InPars and Promptagator have pioneered this approach and both methods have demonstrated the potential of using LL… ▽ More

    Submitted 10 July, 2023; originally announced July 2023.

  4. Sabiá: Portuguese Large Language Models

    Authors: Ramon Pires, Hugo Abonizio, Thales Sales Almeida, Rodrigo Nogueira

    Abstract: As the capabilities of language models continue to advance, it is conceivable that "one-size-fits-all" model will remain as the main paradigm. For instance, given the vast number of languages worldwide, many of which are low-resource, the prevalent practice is to pretrain a single model on multiple languages. In this paper, we add to the growing body of evidence that challenges this practice, demo… ▽ More

    Submitted 9 November, 2023; v1 submitted 16 April, 2023; originally announced April 2023.

  5. arXiv:2301.01820  [pdf, ps, other

    cs.IR cs.AI

    InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval

    Authors: Vitor Jeronymo, Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, Jakub Zavrel, Rodrigo Nogueira

    Abstract: Recently, InPars introduced a method to efficiently use large language models (LLMs) in information retrieval tasks: via few-shot examples, an LLM is induced to generate relevant queries for documents. These synthetic query-document pairs can then be used to train a retriever. However, InPars and, more recently, Promptagator, rely on proprietary LLMs such as GPT-3 and FLAN to generate such dataset… ▽ More

    Submitted 26 May, 2023; v1 submitted 4 January, 2023; originally announced January 2023.

  6. arXiv:2212.06121  [pdf, other

    cs.IR cs.CL

    In Defense of Cross-Encoders for Zero-Shot Retrieval

    Authors: Guilherme Rosa, Luiz Bonifacio, Vitor Jeronymo, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, Rodrigo Nogueira

    Abstract: Bi-encoders and cross-encoders are widely used in many state-of-the-art retrieval pipelines. In this work we study the generalization ability of these two types of architectures on a wide range of parameter count on both in-domain and out-of-domain scenarios. We find that the number of parameters and early query-document interactions of cross-encoders play a significant role in the generalization… ▽ More

    Submitted 12 December, 2022; originally announced December 2022.

    Comments: arXiv admin note: substantial text overlap with arXiv:2206.02873

  7. arXiv:2209.11035  [pdf, other

    cs.CL

    MonoByte: A Pool of Monolingual Byte-level Language Models

    Authors: Hugo Abonizio, Leandro Rodrigues de Souza, Roberto Lotufo, Rodrigo Nogueira

    Abstract: The zero-shot cross-lingual ability of models pretrained on multilingual and even monolingual corpora has spurred many hypotheses to explain this intriguing empirical result. However, due to the costs of pretraining, most research uses public models whose pretraining methodology, such as the choice of tokenization, corpus size, and computational budget, might differ drastically. When researchers p… ▽ More

    Submitted 27 September, 2022; v1 submitted 22 September, 2022; originally announced September 2022.

  8. arXiv:2206.02873  [pdf, other

    cs.IR cs.CL cs.PF

    No Parameter Left Behind: How Distillation and Model Size Affect Zero-Shot Retrieval

    Authors: Guilherme Moraes Rosa, Luiz Bonifacio, Vitor Jeronymo, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, Rodrigo Nogueira

    Abstract: Recent work has shown that small distilled language models are strong competitors to models that are orders of magnitude larger and slower in a wide range of information retrieval tasks. This has made distilled and dense models, due to latency constraints, the go-to choice for deployment in real-world retrieval applications. In this work, we question this practice by showing that the number of par… ▽ More

    Submitted 12 December, 2022; v1 submitted 6 June, 2022; originally announced June 2022.

  9. arXiv:2205.15172  [pdf, ps, other

    cs.CL

    Billions of Parameters Are Worth More Than In-domain Training Data: A case study in the Legal Case Entailment Task

    Authors: Guilherme Moraes Rosa, Luiz Bonifacio, Vitor Jeronymo, Hugo Abonizio, Roberto Lotufo, Rodrigo Nogueira

    Abstract: Recent work has shown that language models scaled to billions of parameters, such as GPT-3, perform remarkably well in zero-shot and few-shot scenarios. In this work, we experiment with zero-shot models in the legal case entailment task of the COLIEE 2022 competition. Our experiments show that scaling the number of parameters in a language model improves the F1 score of our previous zero-shot resu… ▽ More

    Submitted 30 May, 2022; originally announced May 2022.

  10. arXiv:2202.05144  [pdf, other

    cs.CL

    InPars: Data Augmentation for Information Retrieval using Large Language Models

    Authors: Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, Rodrigo Nogueira

    Abstract: The information retrieval community has recently witnessed a revolution due to large pretrained transformer models. Another key ingredient for this revolution was the MS MARCO dataset, whose scale and diversity has enabled zero-shot transfer learning to various tasks. However, not all IR tasks and domains can benefit from one single dataset equally. Extensive research in various NLP tasks has show… ▽ More

    Submitted 10 February, 2022; originally announced February 2022.

  11. arXiv:2108.13897  [pdf, other

    cs.CL cs.AI

    mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset

    Authors: Luiz Bonifacio, Vitor Jeronymo, Hugo Queiroz Abonizio, Israel Campiotti, Marzieh Fadaee, Roberto Lotufo, Rodrigo Nogueira

    Abstract: The MS MARCO ranking dataset has been widely used for training deep learning models for IR tasks, achieving considerable effectiveness on diverse zero-shot scenarios. However, this type of resource is scarce in languages other than English. In this work, we present mMARCO, a multilingual version of the MS MARCO passage ranking dataset comprising 13 languages that was created using machine translat… ▽ More

    Submitted 17 August, 2022; v1 submitted 31 August, 2021; originally announced August 2021.