Skip to main content

Showing 1–17 of 17 results for author: Dejean, H

.
  1. arXiv:2407.09252  [pdf, other

    cs.CL cs.IR

    Context Embeddings for Efficient Answer Generation in RAG

    Authors: David Rau, Shuai Wang, Hervé Déjean, Stéphane Clinchant

    Abstract: Retrieval-Augmented Generation (RAG) allows overcoming the limited knowledge of LLMs by extending the input with external information. As a consequence, the contextual inputs to the model become much longer which slows down decoding time directly translating to the time a user has to wait for an answer. We address this challenge by presenting COCOM, an effective context compression method, reducin… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

    Comments: 10 pages

  2. arXiv:2407.01463  [pdf, other

    cs.CL cs.AI

    Retrieval-augmented generation in multilingual settings

    Authors: Nadezhda Chirkova, David Rau, Hervé Déjean, Thibault Formal, Stéphane Clinchant, Vassilina Nikoulina

    Abstract: Retrieval-augmented generation (RAG) has recently emerged as a promising solution for incorporating up-to-date or domain-specific knowledge into large language models (LLMs) and improving LLM factuality, but is predominantly studied in English-only settings. In this work, we consider RAG in the multilingual setting (mRAG), i.e. with user queries and the datastore in 13 languages, and investigate w… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  3. arXiv:2407.01102  [pdf, other

    cs.CL cs.IR

    BERGEN: A Benchmarking Library for Retrieval-Augmented Generation

    Authors: David Rau, Hervé Déjean, Nadezhda Chirkova, Thibault Formal, Shuai Wang, Vassilina Nikoulina, Stéphane Clinchant

    Abstract: Retrieval-Augmented Generation allows to enhance Large Language Models with external knowledge. In response to the recent popularity of generative LLMs, many RAG approaches have been proposed, which involve an intricate number of different configurations such as evaluation datasets, collections, metrics, retrievers, and LLMs. Inconsistent benchmarking poses a major challenge in comparing approache… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: 29 pages

  4. arXiv:2404.13950  [pdf, other

    cs.IR

    SPLATE: Sparse Late Interaction Retrieval

    Authors: Thibault Formal, Stéphane Clinchant, Hervé Déjean, Carlos Lassance

    Abstract: The late interaction paradigm introduced with ColBERT stands out in the neural Information Retrieval space, offering a compelling effectiveness-efficiency trade-off across many benchmarks. Efficient late interaction retrieval is based on an optimized multi-step strategy, where an approximate search first identifies a set of candidate documents to re-rank exactly. In this work, we introduce SPLATE,… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

    Comments: To appear at SIGIR'24 (short paper track)

  5. arXiv:2404.13357  [pdf, other

    cs.IR

    Two-Step SPLADE: Simple, Efficient and Effective Approximation of SPLADE

    Authors: Carlos Lassance, Hervé Dejean, Stéphane Clinchant, Nicola Tonellotto

    Abstract: Learned sparse models such as SPLADE have successfully shown how to incorporate the benefits of state-of-the-art neural information retrieval models into the classical inverted index data structure. Despite their improvements in effectiveness, learned sparse models are not as efficient as classical sparse model such as BM25. The problem has been investigated and addressed by recently developed str… ▽ More

    Submitted 20 April, 2024; originally announced April 2024.

    Comments: published in Findings at ECIR'24

  6. arXiv:2403.10407  [pdf, ps, other

    cs.IR

    A Thorough Comparison of Cross-Encoders and LLMs for Reranking SPLADE

    Authors: Hervé Déjean, Stéphane Clinchant, Thibault Formal

    Abstract: We present a comparative study between cross-encoder and LLMs rerankers in the context of re-ranking effective SPLADE retrievers. We conduct a large evaluation on TREC Deep Learning datasets and out-of-domain datasets such as BEIR and LoTTE. In the first set of experiments, we show how cross-encoder rerankers are hard to distinguish when it comes to re-rerank SPLADE on MS MARCO. Observations shift… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

  7. arXiv:2403.06789  [pdf, other

    cs.IR cs.CL

    SPLADE-v3: New baselines for SPLADE

    Authors: Carlos Lassance, Hervé Déjean, Thibault Formal, Stéphane Clinchant

    Abstract: A companion to the release of the latest version of the SPLADE library. We describe changes to the training structure and present our latest series of models -- SPLADE-v3. We compare this new version to BM25, SPLADE++, as well as re-rankers, and showcase its effectiveness via a meta-analysis over more than 40 query sets. SPLADE-v3 further pushes the limit of SPLADE models: it is statistically sign… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

    Comments: Technical report

  8. arXiv:2306.02867  [pdf, other

    cs.IR

    Benchmarking Middle-Trained Language Models for Neural Search

    Authors: Hervé Déjean, Stéphane Clinchant, Carlos Lassance, Simon Lupart, Thibault Formal

    Abstract: Middle training methods aim to bridge the gap between the Masked Language Model (MLM) pre-training and the final finetuning for retrieval. Recent models such as CoCondenser, RetroMAE, and LexMAE argue that the MLM task is not sufficient enough to pre-train a transformer network for retrieval and hence propose various tasks to do so. Intrigued by those novel methods, we noticed that all these model… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.

  9. arXiv:2304.12702  [pdf, other

    cs.IR

    A Static Pruning Study on Sparse Neural Retrievers

    Authors: Carlos Lassance, Simon Lupart, Hervé Dejean, Stéphane Clinchant, Nicola Tonellotto

    Abstract: Sparse neural retrievers, such as DeepImpact, uniCOIL and SPLADE, have been introduced recently as an efficient and effective way to perform retrieval with inverted indexes. They aim to learn term importance and, in some cases, document expansions, to provide a more effective document ranking compared to traditional bag-of-words retrieval models such as BM25. However, these sparse neural retriever… ▽ More

    Submitted 25 April, 2023; originally announced April 2023.

    Comments: Short accepted at SIGIR 2023. Has extra content in the Appendix that is not present in the ACM version

  10. arXiv:2303.13220  [pdf, other

    cs.IR cs.CL

    Parameter-Efficient Sparse Retrievers and Rerankers using Adapters

    Authors: Vaishali Pal, Carlos Lassance, Hervé Déjean, Stéphane Clinchant

    Abstract: Parameter-Efficient transfer learning with Adapters have been studied in Natural Language Processing (NLP) as an alternative to full fine-tuning. Adapters are memory-efficient and scale well with downstream tasks by training small bottle-neck layers added between transformer layers while kee** the large pretrained language model (PLMs) frozen. In spite of showing promising results in NLP, these… ▽ More

    Submitted 23 March, 2023; originally announced March 2023.

    Comments: accepted at ECIR'23

  11. arXiv:2301.10444  [pdf, other

    cs.IR cs.CL

    An Experimental Study on Pretraining Transformers from Scratch for IR

    Authors: Carlos Lassance, Hervé Déjean, Stéphane Clinchant

    Abstract: Finetuning Pretrained Language Models (PLM) for IR has been de facto the standard practice since their breakthrough effectiveness few years ago. But, is this approach well understood? In this paper, we study the impact of the pretraining collection on the final IR effectiveness. In particular, we challenge the current hypothesis that PLM shall be trained on a large enough generic collection and we… ▽ More

    Submitted 25 January, 2023; originally announced January 2023.

    Comments: Accepted at ECIR 2023

  12. arXiv:2206.10304  [pdf, other

    cs.CL cs.AI

    LayoutXLM vs. GNN: An Empirical Evaluation of Relation Extraction for Documents

    Authors: Hervé Déjean, Stéphane Clinchant, Jean-Luc Meunier

    Abstract: This paper investigates the Relation Extraction task in documents by benchmarking two different neural network models: a multi-modal language model (LayoutXLM) and a Graph Neural Network: Edge Convolution Network (ECN). For this benchmark, we use the XFUND dataset, released along with LayoutXLM. While both models reach similar results, they both exhibit very different characteristics. This raises… ▽ More

    Submitted 9 May, 2022; originally announced June 2022.

  13. arXiv:1906.11901  [pdf

    cs.CV cs.IR cs.LG stat.ML

    Comparing Machine Learning Approaches for Table Recognition in Historical Register Books

    Authors: Stéphane Clinchant, Hervé Déjean, Jean-Luc Meunier, Eva Lang, Florian Kleber

    Abstract: We present in this paper experiments on Table Recognition in hand-written registry books. We first explain how the problem of row and column detection is modeled, and then compare two Machine Learning approaches (Conditional Random Field and Graph Convolutional Network) for detecting these table elements. Evaluation was conducted on death records provided by the Archive of the Diocese of Passau. B… ▽ More

    Submitted 14 June, 2019; originally announced June 2019.

    Comments: DAS 2018

  14. arXiv:1807.06270  [pdf, other

    cs.CV cs.CL

    Bench-Marking Information Extraction in Semi-Structured Historical Handwritten Records

    Authors: Animesh Prasad, Hervé Déjean, Jean-Luc Meunier, Max Weidemann, Johannes Michael, Gundram Leifert

    Abstract: In this report, we present our findings from benchmarking experiments for information extraction on historical handwritten marriage records Esposalles from IEHHR - ICDAR 2017 robust reading competition. The information extraction is modeled as semantic labeling of the sequence across 2 set of labels. This can be achieved by sequentially or jointly applying handwritten text recognition (HTR) and na… ▽ More

    Submitted 17 July, 2018; originally announced July 2018.

  15. arXiv:cs/0107017  [pdf, ps, other

    cs.CL

    Learning Computational Grammars

    Authors: John Nerbonne, Anja Belz, Nicola Cancedda, Herve Dejean, James Hammerton, Rob Koeling, Stasinos Konstantopoulos, Miles Osborne, Franck Thollard, Erik F. Tjong Kim Sang

    Abstract: This paper reports on the "Learning Computational Grammars" (LCG) project, a postdoc network devoted to studying the application of machine learning techniques to grammars suitable for computational use. We were interested in a more systematic survey to understand the relevance of many factors to the success of learning, esp. the availability of annotated data, the kind of dependencies in the da… ▽ More

    Submitted 15 July, 2001; originally announced July 2001.

    ACM Class: I.2.7

    Journal ref: In: Walter Daelemans and Remi Zajac (eds.), Proceedings of CoNLL-2001, Toulouse, France, 2001, pp. 97-104

  16. arXiv:cs/0107016  [pdf, ps, other

    cs.CL

    Introduction to the CoNLL-2001 Shared Task: Clause Identification

    Authors: Erik F. Tjong Kim Sang, Herve Dejean

    Abstract: We describe the CoNLL-2001 shared task: dividing text into clauses. We give background information on the data sets, present a general overview of the systems that have taken part in the shared task and briefly discuss their performance.

    Submitted 15 July, 2001; originally announced July 2001.

    ACM Class: I.2.7

    Journal ref: In: Walter Daelemans and Remi Zajac (eds.), Proceedings of CoNLL-2001, Toulouse, France, 2001, pp. 53-57

  17. arXiv:cs/0008012  [pdf, ps, other

    cs.CL

    Applying System Combination to Base Noun Phrase Identification

    Authors: Erik F. Tjong Kim Sang, Walter Daelemans, Herve Dejean, Rob Koeling, Yuval Krymolowski, Vasin Punyakanok, Dan Roth

    Abstract: We use seven machine learning algorithms for one task: identifying base noun phrases. The results have been processed by different system combination methods and all of these outperformed the best individual result. We have applied the seven learners with the best combinator, a majority vote of the top five systems, to a standard data set and managed to improve the best published result for this… ▽ More

    Submitted 17 August, 2000; originally announced August 2000.

    Comments: 7 pages

    ACM Class: I.2.7

    Journal ref: Proceedings of COLING 2000, Saarbruecken, Germany