Skip to main content

Showing 1–15 of 15 results for author: Fröbe, M

.
  1. arXiv:2405.07920  [pdf, other

    cs.IR

    A Systematic Investigation of Distilling Large Language Models into Cross-Encoders for Passage Re-ranking

    Authors: Ferdinand Schlatt, Maik Fröbe, Harrisen Scells, Shengyao Zhuang, Bevan Koopman, Guido Zuccon, Benno Stein, Martin Potthast, Matthias Hagen

    Abstract: Cross-encoders distilled from large language models (LLMs) are often more effective re-rankers than cross-encoders fine-tuned on manually labeled data. However, the distilled models usually do not reach their teacher LLM's effectiveness. To investigate whether best practices for fine-tuning cross-encoders on manually labeled data (e.g., hard-negative sampling, deep sampling, and listwise loss func… ▽ More

    Submitted 16 June, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

  2. arXiv:2404.06912  [pdf, other

    cs.IR

    Set-Encoder: Permutation-Invariant Inter-Passage Attention for Listwise Passage Re-Ranking with Cross-Encoders

    Authors: Ferdinand Schlatt, Maik Fröbe, Harrisen Scells, Shengyao Zhuang, Bevan Koopman, Guido Zuccon, Benno Stein, Martin Potthast, Matthias Hagen

    Abstract: Existing cross-encoder re-rankers can be categorized as pointwise, pairwise, or listwise models. Pair- and listwise models allow passage interactions, which usually makes them more effective than pointwise models but also less efficient and less robust to input order permutations. To enable efficient permutation-invariant passage interactions during re-ranking, we propose a new cross-encoder archi… ▽ More

    Submitted 16 June, 2024; v1 submitted 10 April, 2024; originally announced April 2024.

  3. arXiv:2403.07654  [pdf, other

    cs.IR

    Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance Models

    Authors: Andrew Parry, Maik Fröbe, Sean MacAvaney, Martin Potthast, Matthias Hagen

    Abstract: Modern sequence-to-sequence relevance models like monoT5 can effectively capture complex textual interactions between queries and documents through cross-encoding. However, the use of natural language tokens in prompts, such as Query, Document, and Relevant for monoT5, opens an attack vector for malicious documents to manipulate their relevance score through prompt injection, e.g., by adding targe… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

    Comments: 13 pages, 3 figures, Accepted at ECIR 2024 as a Full Paper

  4. Investigating the Effects of Sparse Attention on Cross-Encoders

    Authors: Ferdinand Schlatt, Maik Fröbe, Matthias Hagen

    Abstract: Cross-encoders are effective passage and document re-rankers but less efficient than other neural or classic retrieval models. A few previous studies have applied windowed self-attention to make cross-encoders more efficient. However, these studies did not investigate the potential and limits of different attention patterns or window sizes. We close this gap and systematically analyze how token in… ▽ More

    Submitted 20 March, 2024; v1 submitted 29 December, 2023; originally announced December 2023.

    Comments: Accepted at ECIR'24

  5. Evaluating Generative Ad Hoc Information Retrieval

    Authors: Lukas Gienapp, Harrisen Scells, Niklas Deckers, Janek Bevendorff, Shuai Wang, Johannes Kiesel, Shahbaz Syed, Maik Fröbe, Guido Zuccon, Benno Stein, Matthias Hagen, Martin Potthast

    Abstract: Recent advances in large language models have enabled the development of viable generative retrieval systems. Instead of a traditional document ranking, generative retrieval systems often directly return a grounded generated text as a response to a query. Quantifying the utility of the textual responses is essential for appropriately evaluating such generative ad hoc retrieval. Yet, the establishe… ▽ More

    Submitted 22 May, 2024; v1 submitted 8 November, 2023; originally announced November 2023.

    Comments: 14 pages, 6 figures, 1 table. Published at SIGIR'24 perspective paper track

  6. The Information Retrieval Experiment Platform

    Authors: Maik Fröbe, Jan Heinrich Reimer, Sean MacAvaney, Niklas Deckers, Simon Reich, Janek Bevendorff, Benno Stein, Matthias Hagen, Martin Potthast

    Abstract: We integrate ir_datasets, ir_measures, and PyTerrier with TIRA in the Information Retrieval Experiment Platform (TIREx) to promote more standardized, reproducible, scalable, and even blinded retrieval experiments. Standardization is achieved when a retrieval approach implements PyTerrier's interfaces and the input and output of an experiment are compatible with ir_datasets and ir_measures. However… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

    Comments: 11 pages. To be published in the proceedings of SIGIR 2023

  7. The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives

    Authors: Jan Heinrich Reimer, Sebastian Schmidt, Maik Fröbe, Lukas Gienapp, Harrisen Scells, Benno Stein, Matthias Hagen, Martin Potthast

    Abstract: The Archive Query Log (AQL) is a previously unused, comprehensive query log collected at the Internet Archive over the last 25 years. Its first version includes 356 million queries, 166 million search result pages, and 1.7 billion search results across 550 search providers. Although many query logs have been studied in the literature, the search providers that own them generally do not publish the… ▽ More

    Submitted 31 July, 2023; v1 submitted 1 April, 2023; originally announced April 2023.

    Comments: SIGIR 2023 resource paper, 13 pages

  8. arXiv:2212.07476  [pdf, other

    cs.IR cs.CL cs.CV

    The Infinite Index: Information Retrieval on Generative Text-To-Image Models

    Authors: Niklas Deckers, Maik Fröbe, Johannes Kiesel, Gianluca Pandolfo, Christopher Schröder, Benno Stein, Martin Potthast

    Abstract: Conditional generative models such as DALL-E and Stable Diffusion generate images based on a user-defined text, the prompt. Finding and refining prompts that produce a desired image has become the art of prompt engineering. Generative models do not provide a built-in retrieval model for a user's information need expressed through prompts. In light of an extensive literature review, we reframe prom… ▽ More

    Submitted 21 January, 2023; v1 submitted 14 December, 2022; originally announced December 2022.

    Comments: Final version for CHIIR 2023

  9. Sparse Pairwise Re-ranking with Pre-trained Transformers

    Authors: Lukas Gienapp, Maik Fröbe, Matthias Hagen, Martin Potthast

    Abstract: Pairwise re-ranking models predict which of two documents is more relevant to a query and then aggregate a final ranking from such preferences. This is often more effective than pointwise re-ranking models that directly predict a relevance value for each document. However, the high inference overhead of pairwise models limits their practical application: usually, for a set of $k$ documents to be r… ▽ More

    Submitted 10 July, 2022; originally announced July 2022.

    Comments: Accepted at ICTIR 2022

  10. arXiv:2206.14759  [pdf, other

    cs.IR

    How Train-Test Leakage Affects Zero-shot Retrieval

    Authors: Maik Fröbe, Christopher Akiki, Martin Potthast, Matthias Hagen

    Abstract: Neural retrieval models are often trained on (subsets of) the millions of queries of the MS MARCO / ORCAS datasets and then tested on the 250 Robust04 queries or other TREC benchmarks with often only 50 queries. In such setups, many of the few test queries can be very similar to queries from the huge training data -- in fact, 69% of the Robust04 queries have near-duplicates in MS MARCO / ORCAS. We… ▽ More

    Submitted 30 August, 2022; v1 submitted 29 June, 2022; originally announced June 2022.

    Comments: To appear at the 29th International Symposium on String Processing and Information Retrieval (SPIRE 2022)

  11. arXiv:2203.10282  [pdf, other

    cs.CL

    Clickbait Spoiling via Question Answering and Passage Retrieval

    Authors: Matthias Hagen, Maik Fröbe, Artur Jurk, Martin Potthast

    Abstract: We introduce and study the task of clickbait spoiling: generating a short text that satisfies the curiosity induced by a clickbait post. Clickbait links to a web page and advertises its contents by arousing curiosity instead of providing an informative summary. Our contributions are approaches to classify the type of spoiler needed (i.e., a phrase or a passage), and to generate appropriate spoiler… ▽ More

    Submitted 19 March, 2022; originally announced March 2022.

    Comments: Accepted at ACL 2022

  12. arXiv:2201.11094  [pdf, other

    cs.IR cs.CL

    SCAI-QReCC Shared Task on Conversational Question Answering

    Authors: Svitlana Vakulenko, Johannes Kiesel, Maik Fröbe

    Abstract: Search-Oriented Conversational AI (SCAI) is an established venue that regularly puts a spotlight upon the recent work advancing the field of conversational search. SCAI'21 was organised as an independent on-line event and featured a shared task on conversational question answering. Since all of the participant teams experimented with answer generation models for this task, we identified evaluation… ▽ More

    Submitted 26 January, 2022; originally announced January 2022.

    Comments: 10 pages

  13. arXiv:2111.10864  [pdf, other

    cs.IR

    The Impact of Main Content Extraction on Near-Duplicate Detection

    Authors: Maik Fröbe, Matthias Hagen, Janek Bevendorff, Michael Völske, Benno Stein, Christopher Schröder, Robby Wagner, Lukas Gienapp, Martin Potthast

    Abstract: Commercial web search engines employ near-duplicate detection to ensure that users see each relevant result only once, albeit the underlying web crawls typically include (near-)duplicates of many web pages. We revisit the risks and potential of near-duplicates with an information retrieval focus, motivating that current efforts toward an open and independent European web search infrastructure shou… ▽ More

    Submitted 21 November, 2021; originally announced November 2021.

  14. arXiv:2107.00893  [pdf, other

    cs.DL cs.NI cs.SI

    Web Archive Analytics

    Authors: Michael Völske, Janek Bevendorff, Johannes Kiesel, Benno Stein, Maik Fröbe, Matthias Hagen, Martin Potthast

    Abstract: Web archive analytics is the exploitation of publicly accessible web pages and their evolution for research purposes -- to the extent organizationally possible for researchers. In order to better understand the complexity of this task, the first part of this paper puts the entirety of the world's captured, created, and replicated data (the "Global Datasphere") in relation to other important data s… ▽ More

    Submitted 2 July, 2021; originally announced July 2021.

    Comments: 12 pages, 5 figures. Published in the proceedings of INFORMATIK 2020

    Journal ref: INFORMATIK 2020. Gesellschaft für Informatik, Bonn. (pp. 61-72)

  15. Towards Axiomatic Explanations for Neural Ranking Models

    Authors: Michael Völske, Alexander Bondarenko, Maik Fröbe, Matthias Hagen, Benno Stein, Jaspreet Singh, Avishek Anand

    Abstract: Recently, neural networks have been successfully employed to improve upon state-of-the-art performance in ad-hoc retrieval tasks via machine-learned ranking functions. While neural retrieval models grow in complexity and impact, little is understood about their correspondence with well-studied IR principles. Recent work on interpretability in machine learning has provided tools and techniques to u… ▽ More

    Submitted 11 July, 2021; v1 submitted 15 June, 2021; originally announced June 2021.

    Comments: 10 pages, 2 figures. Published in the proceedings of ICTIR 2021