Skip to main content

Showing 1–8 of 8 results for author: Bevendorff, J

.
  1. Detecting Generated Native Ads in Conversational Search

    Authors: Sebastian Schmidt, Ines Zelch, Janek Bevendorff, Benno Stein, Matthias Hagen, Martin Potthast

    Abstract: Conversational search engines such as YouChat and Microsoft Copilot use large language models (LLMs) to generate responses to queries. It is only a small step to also let the same technology insert ads within the generated responses - instead of separately placing ads next to a response. Inserted ads would be reminiscent of native advertising and product placement, both of which are very effective… ▽ More

    Submitted 30 April, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

    Comments: WWW'24 Short Papers Track; 4 pages

  2. Evaluating Generative Ad Hoc Information Retrieval

    Authors: Lukas Gienapp, Harrisen Scells, Niklas Deckers, Janek Bevendorff, Shuai Wang, Johannes Kiesel, Shahbaz Syed, Maik Fröbe, Guido Zuccon, Benno Stein, Matthias Hagen, Martin Potthast

    Abstract: Recent advances in large language models have enabled the development of viable generative retrieval systems. Instead of a traditional document ranking, generative retrieval systems often directly return a grounded generated text as a response to a query. Quantifying the utility of the textual responses is essential for appropriately evaluating such generative ad hoc retrieval. Yet, the establishe… ▽ More

    Submitted 22 May, 2024; v1 submitted 8 November, 2023; originally announced November 2023.

    Comments: 14 pages, 6 figures, 1 table. Published at SIGIR'24 perspective paper track

  3. The Information Retrieval Experiment Platform

    Authors: Maik Fröbe, Jan Heinrich Reimer, Sean MacAvaney, Niklas Deckers, Simon Reich, Janek Bevendorff, Benno Stein, Matthias Hagen, Martin Potthast

    Abstract: We integrate ir_datasets, ir_measures, and PyTerrier with TIRA in the Information Retrieval Experiment Platform (TIREx) to promote more standardized, reproducible, scalable, and even blinded retrieval experiments. Standardization is achieved when a retrieval approach implements PyTerrier's interfaces and the input and output of an experiment are compatible with ir_datasets and ir_measures. However… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

    Comments: 11 pages. To be published in the proceedings of SIGIR 2023

  4. arXiv:2211.02477  [pdf, other

    cs.CL cs.DL

    SMAuC -- The Scientific Multi-Authorship Corpus

    Authors: Janek Bevendorff, Philipp Sauer, Lukas Gienapp, Wolfgang Kircheis, Erik Körner, Benno Stein, Martin Potthast

    Abstract: The rapidly growing volume of scientific publications offers an interesting challenge for research on methods for analyzing the authorship of documents with one or more authors. However, most existing datasets lack scientific documents or the necessary metadata for constructing new experiments and test cases. We introduce SMAuC, a comprehensive, metadata-rich corpus tailored to scientific authorsh… ▽ More

    Submitted 10 May, 2023; v1 submitted 4 November, 2022; originally announced November 2022.

  5. arXiv:2112.03103  [pdf, other

    cs.IR

    FastWARC: Optimizing Large-Scale Web Archive Analytics

    Authors: Janek Bevendorff, Martin Potthast, Benno Stein

    Abstract: Web search and other large-scale web data analytics rely on processing archives of web pages stored in a standardized and efficient format. Since its introduction in 2008, the IIPC's Web ARCive (WARC) format has become the standard format for this purpose. As a list of individually compressed records of HTTP requests and responses, it allows for constant-time random access to all kinds of web data… ▽ More

    Submitted 22 November, 2021; originally announced December 2021.

    Journal ref: OSSYM 2021 - 3rd International Open Search Symposium

  6. arXiv:2111.10864  [pdf, other

    cs.IR

    The Impact of Main Content Extraction on Near-Duplicate Detection

    Authors: Maik Fröbe, Matthias Hagen, Janek Bevendorff, Michael Völske, Benno Stein, Christopher Schröder, Robby Wagner, Lukas Gienapp, Martin Potthast

    Abstract: Commercial web search engines employ near-duplicate detection to ensure that users see each relevant result only once, albeit the underlying web crawls typically include (near-)duplicates of many web pages. We revisit the risks and potential of near-duplicates with an information retrieval focus, motivating that current efforts toward an open and independent European web search infrastructure shou… ▽ More

    Submitted 21 November, 2021; originally announced November 2021.

  7. arXiv:2107.00893  [pdf, other

    cs.DL cs.NI cs.SI

    Web Archive Analytics

    Authors: Michael Völske, Janek Bevendorff, Johannes Kiesel, Benno Stein, Maik Fröbe, Matthias Hagen, Martin Potthast

    Abstract: Web archive analytics is the exploitation of publicly accessible web pages and their evolution for research purposes -- to the extent organizationally possible for researchers. In order to better understand the complexity of this task, the first part of this paper puts the entirety of the world's captured, created, and replicated data (the "Global Datasphere") in relation to other important data s… ▽ More

    Submitted 2 July, 2021; originally announced July 2021.

    Comments: 12 pages, 5 figures. Published in the proceedings of INFORMATIK 2020

    Journal ref: INFORMATIK 2020. Gesellschaft für Informatik, Bonn. (pp. 61-72)

  8. arXiv:1702.05638  [pdf, other

    cs.CL

    A Stylometric Inquiry into Hyperpartisan and Fake News

    Authors: Martin Potthast, Johannes Kiesel, Kevin Reinartz, Janek Bevendorff, Benno Stein

    Abstract: This paper reports on a writing style analysis of hyperpartisan (i.e., extremely one-sided) news in connection to fake news. It presents a large corpus of 1,627 articles that were manually fact-checked by professional journalists from BuzzFeed. The articles originated from 9 well-known political publishers, 3 each from the mainstream, the hyperpartisan left-wing, and the hyperpartisan right-wing.… ▽ More

    Submitted 18 February, 2017; originally announced February 2017.

    Comments: 10 pages, 3 figures, 6 tables, submitted to ACL 2017