Skip to main content

Showing 1–22 of 22 results for author: Clarke, C L A

.
  1. arXiv:2405.02178  [pdf, other

    cs.CL cs.AI

    Assessing and Verifying Task Utility in LLM-Powered Applications

    Authors: Negar Arabzadeh, Siqing Huo, Nikhil Mehta, Qinqyun Wu, Chi Wang, Ahmed Awadallah, Charles L. A. Clarke, Julia Kiseleva

    Abstract: The rapid development of Large Language Models (LLMs) has led to a surge in applications that facilitate collaboration among multiple agents, assisting humans in their daily tasks. However, a significant gap remains in assessing to what extent LLM-powered applications genuinely enhance user experience and task execution efficiency. This highlights the need to verify utility of LLM-powered applicat… ▽ More

    Submitted 12 May, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

    Comments: arXiv admin note: text overlap with arXiv:2402.09015

  2. arXiv:2404.08137  [pdf, other

    cs.IR

    Generative Information Retrieval Evaluation

    Authors: Marwah Alaofi, Negar Arabzadeh, Charles L. A. Clarke, Mark Sanderson

    Abstract: This paper is a draft of a chapter intended to appear in a forthcoming book on generative information retrieval, co-edited by Chirag Shah and Ryen White. In this chapter, we consider generative information retrieval evaluation from two distinct but interrelated perspectives. First, large language models (LLMs) themselves are rapidly becoming tools for evaluation, with current research indicating t… ▽ More

    Submitted 16 April, 2024; v1 submitted 11 April, 2024; originally announced April 2024.

    Comments: Draft of a chapter intended to appear in a forthcoming book on generative information retrieval, co-edited by Chirag Shah and Ryen White

  3. arXiv:2404.04044  [pdf, other

    cs.IR

    A Comparison of Methods for Evaluating Generative IR

    Authors: Negar Arabzadeh, Charles L. A. Clarke

    Abstract: Information retrieval systems increasingly incorporate generative components. For example, in a retrieval augmented generation (RAG) system, a retrieval component might provide a source of ground truth, while a generative component summarizes and augments its responses. In other systems, a large language model (LLM) might directly generate responses without consulting a retrieval component. While… ▽ More

    Submitted 9 April, 2024; v1 submitted 5 April, 2024; originally announced April 2024.

  4. arXiv:2401.17543  [pdf, other

    cs.IR

    Fréchet Distance for Offline Evaluation of Information Retrieval Systems with Sparse Labels

    Authors: Negar Arabzadeh, Charles L. A. Clarke

    Abstract: The rapid advancement of natural language processing, information retrieval (IR), computer vision, and other technologies has presented significant challenges in evaluating the performance of these systems. One of the main challenges is the scarcity of human-labeled data, which hinders the fair and accurate assessment of these systems. In this work, we specifically focus on evaluating IR systems w… ▽ More

    Submitted 18 February, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

  5. arXiv:2401.04842  [pdf, other

    cs.IR

    Adapting Standard Retrieval Benchmarks to Evaluate Generated Answers

    Authors: Negar Arabzadeh, Amin Bigdeli, Charles L. A. Clarke

    Abstract: Large language models can now directly generate answers to many factual questions without referencing external sources. Unfortunately, relatively little attention has been paid to methods for evaluating the quality and correctness of these answers, for comparing the performance of one model to another, or for comparing one prompt to another. In addition, the quality of generated answers are rarely… ▽ More

    Submitted 9 January, 2024; originally announced January 2024.

  6. Retrieving Supporting Evidence for Generative Question Answering

    Authors: Siqing Huo, Negar Arabzadeh, Charles L. A. Clarke

    Abstract: Current large language models (LLMs) can exhibit near-human levels of performance on many natural language-based tasks, including open-domain question answering. Unfortunately, at this time, they also convincingly hallucinate incorrect answers, so that responses to questions must be verified against external sources before they can be accepted at face value. In this paper, we report two simple exp… ▽ More

    Submitted 20 September, 2023; originally announced September 2023.

    Comments: arXiv admin note: text overlap with arXiv:2306.13781

    Journal ref: Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (SIGIR-AP '23), November 26--28, 2023, Bei**g, China

  7. arXiv:2306.13781  [pdf, other

    cs.IR

    Retrieving Supporting Evidence for LLMs Generated Answers

    Authors: Siqing Huo, Negar Arabzadeh, Charles L. A. Clarke

    Abstract: Current large language models (LLMs) can exhibit near-human levels of performance on many natural language tasks, including open-domain question answering. Unfortunately, they also convincingly hallucinate incorrect answers, so that responses to questions must be verified against external sources before they can be accepted at face value. In this paper, we report a simple experiment to automatical… ▽ More

    Submitted 23 June, 2023; originally announced June 2023.

  8. arXiv:2305.06984  [pdf, other

    cs.CL

    Evaluating Open-Domain Question Answering in the Era of Large Language Models

    Authors: Ehsan Kamalloo, Nouha Dziri, Charles L. A. Clarke, Davood Rafiei

    Abstract: Lexical matching remains the de facto evaluation method for open-domain question answering (QA). Unfortunately, lexical matching fails completely when a plausible candidate answer does not appear in the list of gold answers, which is increasingly the case as we shift from extractive to generative models. The recent success of large language models (LLMs) for QA aggravates lexical matching failures… ▽ More

    Submitted 6 July, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

    Comments: ACL 2023; code and data released at https://github.com/ehsk/OpenQA-eval

  9. arXiv:2208.04887  [pdf, other

    cs.IR

    Early Stage Sparse Retrieval with Entity Linking

    Authors: Dahlia Shehata, Negar Arabzadeh, Charles L. A. Clarke

    Abstract: Despite the advantages of their low-resource settings, traditional sparse retrievers depend on exact matching approaches between high-dimensional bag-of-words (BoW) representations of both the queries and the collection. As a result, retrieval performance is restricted by semantic discrepancies and vocabulary gaps. On the other hand, transformer-based dense retrievers introduce significant improve… ▽ More

    Submitted 10 August, 2022; v1 submitted 9 August, 2022; originally announced August 2022.

  10. arXiv:2208.04882  [pdf, other

    cs.IR

    Unsupervised Question Clarity Prediction Through Retrieved Item Coherency

    Authors: Negar Arabzadeh, Mahsa Seifikar, Charles L. A. Clarke

    Abstract: Despite recent progress on conversational systems, they still do not perform smoothly and coherently when faced with ambiguous requests. When questions are unclear, conversational systems should have the ability to ask clarifying questions, rather than assuming a particular interpretation or simply responding that they do not understand. Previous studies have shown that users are more satisfied wh… ▽ More

    Submitted 9 August, 2022; originally announced August 2022.

  11. Human Preferences as Dueling Bandits

    Authors: Xinyi Yan, Chengxi Luo, Charles L. A. Clarke, Nick Craswell, Ellen M. Voorhees, Pablo Castells

    Abstract: The dramatic improvements in core information retrieval tasks engendered by neural rankers create a need for novel evaluation methods. If every ranker returns highly relevant items in the top ranks, it becomes difficult to recognize meaningful differences between them and to build reusable test collections. Several recent papers explore pairwise preference judgments as an alternative to traditiona… ▽ More

    Submitted 21 April, 2022; originally announced April 2022.

    Comments: To appear in 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. Associated code and data is available at https://github.com/XinyiYan/duelingBandits

  12. arXiv:2112.11481  [pdf, other

    cs.CL cs.AI

    Translating Human Mobility Forecasting through Natural Language Generation

    Authors: Hao Xue, Flora D. Salim, Yongli Ren, Charles L. A. Clarke

    Abstract: Existing human mobility forecasting models follow the standard design of the time-series prediction model which takes a series of numerical values as input to generate a numerical value as a prediction. Although treating this as a regression problem seems straightforward, incorporating various contextual information such as the semantic category information of each Place-of-Interest (POI) is a nec… ▽ More

    Submitted 13 December, 2021; originally announced December 2021.

    Comments: Accepted at WSDM2022

  13. arXiv:2109.10739  [pdf, other

    cs.IR

    Predicting Efficiency/Effectiveness Trade-offs for Dense vs. Sparse Retrieval Strategy Selection

    Authors: Negar Arabzadeh, Xinyi Yan, Charles L. A. Clarke

    Abstract: Over the last few years, contextualized pre-trained transformer models such as BERT have provided substantial improvements on information retrieval tasks. Recent approaches based on pre-trained transformer models such as BERT, fine-tune dense low-dimensional contextualized representations of queries and documents in embedding space. While these dense retrievers enjoy substantial retrieval effectiv… ▽ More

    Submitted 22 September, 2021; originally announced September 2021.

  14. arXiv:2109.00062  [pdf, other

    cs.IR

    Shallow pooling for sparse labels

    Authors: Negar Arabzadeh, Alexandra Vtyurina, Xinyi Yan, Charles L. A. Clarke

    Abstract: Recent years have seen enormous gains in core IR tasks, including document and passage ranking. Datasets and leaderboards, and in particular the MS MARCO datasets, illustrate the dramatic improvements achieved by modern neural rankers. When compared with traditional test collections, the MS MARCO datasets employ substantially more queries with substantially fewer known relevant items per query. Gi… ▽ More

    Submitted 28 February, 2022; v1 submitted 31 August, 2021; originally announced September 2021.

  15. arXiv:2007.11682  [pdf, other

    cs.IR

    Assessing top-$k$ preferences

    Authors: Charles L. A. Clarke, Alexandra Vtyurina, Mark D. Smucker

    Abstract: Assessors make preference judgments faster and more consistently than graded judgments. Preference judgments can also recognize distinctions between items that appear equivalent under graded judgments. Unfortunately, preference judgments can require more than linear effort to fully order a pool of items, and evaluation measures for preference judgments are not as well established as those for grad… ▽ More

    Submitted 12 February, 2021; v1 submitted 22 July, 2020; originally announced July 2020.

  16. Efficient and Effective Tail Latency Minimization in Multi-Stage Retrieval Systems

    Authors: Joel Mackenzie, J. Shane Culpepper, Roi Blanco, Matt Crane, Charles L. A. Clarke, Jimmy Lin

    Abstract: Scalable web search systems typically employ multi-stage retrieval architectures, where an initial stage generates a set of candidate documents that are then pruned and re-ranked. Since subsequent stages typically exploit a multitude of features of varying costs using machine-learned models, reducing the number of documents that are considered at each stage improves latency. In this work, we propo… ▽ More

    Submitted 20 April, 2017; v1 submitted 12 April, 2017; originally announced April 2017.

    Comments: Update 1: Edited email address

  17. arXiv:1610.06468  [pdf, other

    cs.IR cs.DL cs.NI

    Ten Blue Links on Mars

    Authors: Charles L. A. Clarke, Gordon V. Cormack, Jimmy Lin, Adam Roegiest

    Abstract: This paper explores a simple question: How would we provide a high-quality search experience on Mars, where the fundamental physical limit is speed-of-light propagation delays on the order of tens of minutes? On Earth, users are accustomed to nearly instantaneous response times from search engines. Is it possible to overcome orders-of-magnitude longer latency to provide a tolerable user experience… ▽ More

    Submitted 20 October, 2016; originally announced October 2016.

  18. arXiv:1610.02502  [pdf, other

    cs.IR

    Dynamic Trade-Off Prediction in Multi-Stage Retrieval Systems

    Authors: J. Shane Culpepper, Charles L. A. Clarke, Jimmy Lin

    Abstract: Modern multi-stage retrieval systems are comprised of a candidate generation stage followed by one or more reranking stages. In such an architecture, the quality of the final ranked list may not be sensitive to the quality of initial candidate pool, especially in terms of early precision. This provides several opportunities to increase retrieval efficiency without significantly sacrificing effecti… ▽ More

    Submitted 8 October, 2016; originally announced October 2016.

    ACM Class: H.3.3; H.3.4

  19. arXiv:1606.03066  [pdf, other

    cs.IR

    The Effects of Latency Penalties in Evaluating Push Notification Systems

    Authors: Luchen Tan, Jimmy Lin, Adam Roegiest, Charles L. A. Clarke

    Abstract: We examine the effects of different latency penalties in the evaluation of push notification systems, as operationalized in the TREC 2015 Microblog track evaluation. The purpose of this study is to inform the design of metrics for the TREC 2016 Real-Time Summarization track, which is largely modeled after the TREC 2015 evaluation design.

    Submitted 9 June, 2016; originally announced June 2016.

  20. arXiv:1506.00717  [pdf, other

    cs.IR

    Assessing Efficiency-Effectiveness Tradeoffs in Multi-Stage Retrieval Systems Without Using Relevance Judgments

    Authors: Charles L. A. Clarke, J. Shane Culpepper, Alistair Moffat

    Abstract: Large-scale retrieval systems are often implemented as a cascading sequence of phases -- a first filtering step, in which a large set of candidate documents are extracted using a simple technique such as Boolean matching and/or static document scores; and then one or more ranking steps, in which the pool of documents retrieved by the filter is scored more precisely using dozens or perhaps hundreds… ▽ More

    Submitted 1 June, 2015; originally announced June 2015.

    ACM Class: H.3.1; H.3.3

  21. arXiv:1408.3587  [pdf, other

    cs.IR

    A Family of Rank Similarity Measures based on Maximized Effectiveness Difference

    Authors: Luchen Tan, Clarke L. A. Clarke

    Abstract: Rank similarity measures provide a method for quantifying differences between search engine results without the need for relevance judgments. For example, the providers of a search service might use such measures to estimate the impact of a proposed algorithmic change across a large number of queries - perhaps millions - identifying those queries where the impact is greatest. In this paper, we pro… ▽ More

    Submitted 15 August, 2014; originally announced August 2014.

  22. arXiv:1004.5168  [pdf, ps, other

    cs.IR

    Efficient and Effective Spam Filtering and Re-ranking for Large Web Datasets

    Authors: Gordon V. Cormack, Mark D. Smucker, Charles L. A. Clarke

    Abstract: The TREC 2009 web ad hoc and relevance feedback tasks used a new document collection, the ClueWeb09 dataset, which was crawled from the general Web in early 2009. This dataset contains 1 billion web pages, a substantial fraction of which are spam --- pages designed to deceive search engines so as to deliver an unwanted payload. We examine the effect of spam on the results of the TREC 2009 web ad h… ▽ More

    Submitted 28 April, 2010; originally announced April 2010.