Skip to main content

Showing 1–10 of 10 results for author: Jeronymo, V

Searching in archive cs. Search in all archives.
.
  1. arXiv:2307.04601  [pdf, ps, other

    cs.IR

    InPars Toolkit: A Unified and Reproducible Synthetic Data Generation Pipeline for Neural Information Retrieval

    Authors: Hugo Abonizio, Luiz Bonifacio, Vitor Jeronymo, Roberto Lotufo, Jakub Zavrel, Rodrigo Nogueira

    Abstract: Recent work has explored Large Language Models (LLMs) to overcome the lack of training data for Information Retrieval (IR) tasks. The generalization abilities of these models have enabled the creation of synthetic in-domain data by providing instructions and a few examples on a prompt. InPars and Promptagator have pioneered this approach and both methods have demonstrated the potential of using LL… ▽ More

    Submitted 10 July, 2023; originally announced July 2023.

  2. arXiv:2304.01019  [pdf, other

    cs.IR cs.CL

    Simple Yet Effective Neural Ranking and Reranking Baselines for Cross-Lingual Information Retrieval

    Authors: Jimmy Lin, David Alfonso-Hermelo, Vitor Jeronymo, Ehsan Kamalloo, Carlos Lassance, Rodrigo Nogueira, Odunayo Ogundepo, Mehdi Rezagholizadeh, Nandan Thakur, Jheng-Hong Yang, Xinyu Zhang

    Abstract: The advent of multilingual language models has generated a resurgence of interest in cross-lingual information retrieval (CLIR), which is the task of searching documents in one language with queries from another. However, the rapid pace of progress has led to a confusing panoply of methods and reproducibility has lagged behind the state of the art. In this context, our work makes two important con… ▽ More

    Submitted 3 April, 2023; originally announced April 2023.

  3. arXiv:2303.16145  [pdf, other

    cs.IR

    NeuralMind-UNICAMP at 2022 TREC NeuCLIR: Large Boring Rerankers for Cross-lingual Retrieval

    Authors: Vitor Jeronymo, Roberto Lotufo, Rodrigo Nogueira

    Abstract: This paper reports on a study of cross-lingual information retrieval (CLIR) using the mT5-XXL reranker on the NeuCLIR track of TREC 2022. Perhaps the biggest contribution of this study is the finding that despite the mT5 model being fine-tuned only on query-document pairs of the same language it proved to be viable for CLIR tasks, where query-document pairs are in different languages, even in the… ▽ More

    Submitted 28 March, 2023; originally announced March 2023.

  4. arXiv:2301.01820  [pdf, ps, other

    cs.IR cs.AI

    InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval

    Authors: Vitor Jeronymo, Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, Jakub Zavrel, Rodrigo Nogueira

    Abstract: Recently, InPars introduced a method to efficiently use large language models (LLMs) in information retrieval tasks: via few-shot examples, an LLM is induced to generate relevant queries for documents. These synthetic query-document pairs can then be used to train a retriever. However, InPars and, more recently, Promptagator, rely on proprietary LLMs such as GPT-3 and FLAN to generate such dataset… ▽ More

    Submitted 26 May, 2023; v1 submitted 4 January, 2023; originally announced January 2023.

  5. arXiv:2212.06121  [pdf, other

    cs.IR cs.CL

    In Defense of Cross-Encoders for Zero-Shot Retrieval

    Authors: Guilherme Rosa, Luiz Bonifacio, Vitor Jeronymo, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, Rodrigo Nogueira

    Abstract: Bi-encoders and cross-encoders are widely used in many state-of-the-art retrieval pipelines. In this work we study the generalization ability of these two types of architectures on a wide range of parameter count on both in-domain and out-of-domain scenarios. We find that the number of parameters and early query-document interactions of cross-encoders play a significant role in the generalization… ▽ More

    Submitted 12 December, 2022; originally announced December 2022.

    Comments: arXiv admin note: substantial text overlap with arXiv:2206.02873

  6. arXiv:2209.13738  [pdf, ps, other

    cs.CL cs.AI

    mRobust04: A Multilingual Version of the TREC Robust 2004 Benchmark

    Authors: Vitor Jeronymo, Mauricio Nascimento, Roberto Lotufo, Rodrigo Nogueira

    Abstract: Robust 2004 is an information retrieval benchmark whose large number of judgments per query make it a reliable evaluation dataset. In this paper, we present mRobust04, a multilingual version of Robust04 that was translated to 8 languages using Google Translate. We also provide results of three different multilingual retrievers on this dataset. The dataset is available at https://huggingface.co/dat… ▽ More

    Submitted 27 September, 2022; originally announced September 2022.

    Comments: 4 pages

  7. arXiv:2208.06264  [pdf, ps, other

    cs.IR

    A Boring-yet-effective Approach for the Product Ranking Task of the Amazon KDD Cup 2022

    Authors: Vitor Jeronymo, Guilherme Rosa, Surya Kallumadi, Roberto Lotufo, Rodrigo Nogueira

    Abstract: In this work we describe our submission to the product ranking task of the Amazon KDD Cup 2022. We rely on a receipt that showed to be effective in previous competitions: we focus our efforts towards efficiently training and deploying large language odels, such as mT5, while reducing to a minimum the number of task-specific adaptations. Despite the simplicity of our approach, our best model was le… ▽ More

    Submitted 9 August, 2022; originally announced August 2022.

  8. arXiv:2206.02873  [pdf, other

    cs.IR cs.CL cs.PF

    No Parameter Left Behind: How Distillation and Model Size Affect Zero-Shot Retrieval

    Authors: Guilherme Moraes Rosa, Luiz Bonifacio, Vitor Jeronymo, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, Rodrigo Nogueira

    Abstract: Recent work has shown that small distilled language models are strong competitors to models that are orders of magnitude larger and slower in a wide range of information retrieval tasks. This has made distilled and dense models, due to latency constraints, the go-to choice for deployment in real-world retrieval applications. In this work, we question this practice by showing that the number of par… ▽ More

    Submitted 12 December, 2022; v1 submitted 6 June, 2022; originally announced June 2022.

  9. arXiv:2205.15172  [pdf, ps, other

    cs.CL

    Billions of Parameters Are Worth More Than In-domain Training Data: A case study in the Legal Case Entailment Task

    Authors: Guilherme Moraes Rosa, Luiz Bonifacio, Vitor Jeronymo, Hugo Abonizio, Roberto Lotufo, Rodrigo Nogueira

    Abstract: Recent work has shown that language models scaled to billions of parameters, such as GPT-3, perform remarkably well in zero-shot and few-shot scenarios. In this work, we experiment with zero-shot models in the legal case entailment task of the COLIEE 2022 competition. Our experiments show that scaling the number of parameters in a language model improves the F1 score of our previous zero-shot resu… ▽ More

    Submitted 30 May, 2022; originally announced May 2022.

  10. arXiv:2108.13897  [pdf, other

    cs.CL cs.AI

    mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset

    Authors: Luiz Bonifacio, Vitor Jeronymo, Hugo Queiroz Abonizio, Israel Campiotti, Marzieh Fadaee, Roberto Lotufo, Rodrigo Nogueira

    Abstract: The MS MARCO ranking dataset has been widely used for training deep learning models for IR tasks, achieving considerable effectiveness on diverse zero-shot scenarios. However, this type of resource is scarce in languages other than English. In this work, we present mMARCO, a multilingual version of the MS MARCO passage ranking dataset comprising 13 languages that was created using machine translat… ▽ More

    Submitted 17 August, 2022; v1 submitted 31 August, 2021; originally announced August 2021.