Skip to main content

Showing 1–26 of 26 results for author: Lawrie, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.17186  [pdf, other

    cs.CL cs.CY

    CLERC: A Dataset for Legal Case Retrieval and Retrieval-Augmented Analysis Generation

    Authors: Abe Bohan Hou, Orion Weller, Guanghui Qin, Eugene Yang, Dawn Lawrie, Nils Holzenberger, Andrew Blair-Stanek, Benjamin Van Durme

    Abstract: Legal professionals need to write analyses that rely on citations to relevant precedents, i.e., previous case decisions. Intelligent systems assisting legal professionals in writing such documents provide great benefits but are challenging to design. Such systems need to help locate, summarize, and reason over salient precedents in order to be useful. To enable systems for such tasks, we work with… ▽ More

    Submitted 27 June, 2024; v1 submitted 24 June, 2024; originally announced June 2024.

  2. On the Evaluation of Machine-Generated Reports

    Authors: James Mayfield, Eugene Yang, Dawn Lawrie, Sean MacAvaney, Paul McNamee, Douglas W. Oard, Luca Soldaini, Ian Soboroff, Orion Weller, Efsun Kayi, Kate Sanders, Marc Mason, Noah Hibbler

    Abstract: Large Language Models (LLMs) have enabled new ways to satisfy information needs. Although great strides have been made in applying them to settings like document ranking and short-form text generation, they still struggle to compose complete, accurate, and verifiable long-form reports. Reports with these qualities are necessary to satisfy the complex, nuanced, or multi-faceted information needs of… ▽ More

    Submitted 9 May, 2024; v1 submitted 1 May, 2024; originally announced May 2024.

    Comments: 12 pages, 4 figures, accepted at SIGIR 2024 as perspective paper

  3. Language Fairness in Multilingual Information Retrieval

    Authors: Eugene Yang, Thomas Jänich, James Mayfield, Dawn Lawrie

    Abstract: Multilingual information retrieval (MLIR) considers the problem of ranking documents in several languages for a query expressed in a language that may differ from any of those languages. Recent work has observed that approaches such as combining ranked lists representing a single document language each or using multilingual pretrained language models demonstrate a preference for one language over… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

    Comments: 5 pages, 1 figure, accepted at SIGIR 2024 as short paper

  4. Distillation for Multilingual Information Retrieval

    Authors: Eugene Yang, Dawn Lawrie, James Mayfield

    Abstract: Recent work in cross-language information retrieval (CLIR), where queries and documents are in different languages, has shown the benefit of the Translate-Distill framework that trains a cross-language neural dual-encoder model using translation and distillation. However, Translate-Distill only supports a single document language. Multilingual information retrieval (MLIR), which ranks a multilingu… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

    Comments: 6 pages, 1 figure, accepted at SIGIR 2024 as short paper

  5. PLAID SHIRTTT for Large-Scale Streaming Dense Retrieval

    Authors: Dawn Lawrie, Efsun Kayi, Eugene Yang, James Mayfield, Douglas W. Oard

    Abstract: PLAID, an efficient implementation of the ColBERT late interaction bi-encoder using pretrained language models for ranking, consistently achieves state-of-the-art performance in monolingual, cross-language, and multilingual retrieval. PLAID differs from ColBERT by assigning terms to clusters and representing those terms as cluster centroids plus compressed residual vectors. While PLAID is effectiv… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

    Comments: 5 pages, 1 figure, accepted at SIGIR 2024 as short paper

  6. arXiv:2404.18797  [pdf, other

    cs.IR

    Efficiency-Effectiveness Tradeoff of Probabilistic Structured Queries for Cross-Language Information Retrieval

    Authors: Eugene Yang, Suraj Nair, Dawn Lawrie, James Mayfield, Douglas W. Oard, Kevin Duh

    Abstract: Probabilistic Structured Queries (PSQ) is a cross-language information retrieval (CLIR) method that uses translation probabilities statistically derived from aligned corpora. PSQ is a strong baseline for efficient CLIR using sparse indexing. It is, therefore, useful as the first stage in a cascaded neural CLIR system whose second stage is more effective but too inefficient to be used on its own to… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

    Comments: 11 pages, 5 figures

  7. arXiv:2404.08134  [pdf, other

    cs.IR cs.CL

    Extending Translate-Train for ColBERT-X to African Language CLIR

    Authors: Eugene Yang, Dawn J. Lawrie, Paul McNamee, James Mayfield

    Abstract: This paper describes the submission runs from the HLTCOE team at the CIRAL CLIR tasks for African languages at FIRE 2023. Our submissions use machine translation models to translate the documents and the training passages, and ColBERT-X as the retrieval model. Additionally, we present a set of unofficial runs that use an alternative training procedure with a similar training setting.

    Submitted 11 April, 2024; originally announced April 2024.

    Comments: 10 pages, 2 figures. System description paper for HLTCOE's participation in CIRAL@FIRE 2023

  8. arXiv:2404.08118  [pdf, ps, other

    cs.CL cs.IR

    HLTCOE at TREC 2023 NeuCLIR Track

    Authors: Eugene Yang, Dawn Lawrie, James Mayfield

    Abstract: The HLTCOE team applied PLAID, an mT5 reranker, and document translation to the TREC 2023 NeuCLIR track. For PLAID we included a variety of models and training techniques -- the English model released with ColBERT v2, translate-train~(TT), Translate Distill~(TD) and multilingual translate-train~(MTT). TT trains a ColBERT model with English queries and passages automatically translated into the doc… ▽ More

    Submitted 11 April, 2024; originally announced April 2024.

    Comments: 6 pages. Part of TREC 2023 Proceedings

  9. arXiv:2404.08071  [pdf, other

    cs.IR

    Overview of the TREC 2023 NeuCLIR Track

    Authors: Dawn Lawrie, Sean MacAvaney, James Mayfield, Paul McNamee, Douglas W. Oard, Luca Soldaini, Eugene Yang

    Abstract: The principal goal of the TREC Neural Cross-Language Information Retrieval (NeuCLIR) track is to study the impact of neural approaches to cross-language information retrieval. The track has created four collections, large collections of Chinese, Persian, and Russian newswire and a smaller collection of Chinese scientific abstracts. The principal tasks are ranked retrieval of news in one of the thr… ▽ More

    Submitted 11 April, 2024; originally announced April 2024.

    Comments: 27 pages, 17 figures. Part of the TREC 2023 Proceedings

  10. arXiv:2403.15246  [pdf, other

    cs.IR cs.CL cs.LG

    FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions

    Authors: Orion Weller, Benjamin Chang, Sean MacAvaney, Kyle Lo, Arman Cohan, Benjamin Van Durme, Dawn Lawrie, Luca Soldaini

    Abstract: Modern Language Models (LMs) are capable of following long and complex instructions that enable a large and diverse set of user requests. While Information Retrieval (IR) models use these LMs as the backbone of their architectures, virtually none of them allow users to provide detailed instructions alongside queries, thus limiting their ability to satisfy complex information needs. In this work, w… ▽ More

    Submitted 7 May, 2024; v1 submitted 22 March, 2024; originally announced March 2024.

  11. arXiv:2403.12958  [pdf, other

    cs.CL

    Dated Data: Tracing Knowledge Cutoffs in Large Language Models

    Authors: Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, Benjamin Van Durme

    Abstract: Released Large Language Models (LLMs) are often paired with a claimed knowledge cutoff date, or the dates at which training data was gathered. Such information is crucial for applications where the LLM must provide up to date information. However, this statement only scratches the surface: do all resources in the training data share the same knowledge cutoff date? Does the model's demonstrated kno… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

  12. arXiv:2401.04810  [pdf, other

    cs.IR cs.CL

    Translate-Distill: Learning Cross-Language Dense Retrieval by Translation and Distillation

    Authors: Eugene Yang, Dawn Lawrie, James Mayfield, Douglas W. Oard, Scott Miller

    Abstract: Prior work on English monolingual retrieval has shown that a cross-encoder trained using a large number of relevance judgments for query-document pairs can be used as a teacher to train more efficient, but similarly effective, dual-encoder student models. Applying a similar knowledge distillation approach to training an efficient dual-encoder model for Cross-Language Information Retrieval (CLIR),… ▽ More

    Submitted 9 January, 2024; originally announced January 2024.

    Comments: 17 pages, 1 figure, accepted at ECIR 2024

  13. arXiv:2309.08541  [pdf, other

    cs.IR cs.AI cs.CL

    When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets

    Authors: Orion Weller, Kyle Lo, David Wadden, Dawn Lawrie, Benjamin Van Durme, Arman Cohan, Luca Soldaini

    Abstract: Using large language models (LMs) for query or document expansion can improve generalization in information retrieval. However, it is unknown whether these techniques are universally beneficial or only effective in specific settings, such as for particular retrieval models, dataset domains, or query types. To answer this, we conduct the first comprehensive analysis of LM-based expansion. We find t… ▽ More

    Submitted 26 February, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

    Comments: EACL 2024 camera ready

  14. arXiv:2305.13252  [pdf, other

    cs.CL cs.AI

    "According to ...": Prompting Language Models Improves Quoting from Pre-Training Data

    Authors: Orion Weller, Marc Marone, Nathaniel Weir, Dawn Lawrie, Daniel Khashabi, Benjamin Van Durme

    Abstract: Large Language Models (LLMs) may hallucinate and generate fake information, despite pre-training on factual data. Inspired by the journalistic device of "according to sources", we propose according-to prompting: directing LLMs to ground responses against previously observed text. To quantify this grounding, we propose a novel evaluation metric (QUIP-Score) that measures the extent to which model-p… ▽ More

    Submitted 26 February, 2024; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: Accepted to EACL 2024

  15. arXiv:2305.07614  [pdf, other

    cs.IR cs.CL

    NevIR: Negation in Neural Information Retrieval

    Authors: Orion Weller, Dawn Lawrie, Benjamin Van Durme

    Abstract: Negation is a common everyday phenomena and has been a consistent area of weakness for language models (LMs). Although the Information Retrieval (IR) community has adopted LMs as the backbone of modern IR architectures, there has been little to no research in understanding how negation impacts neural IR. We therefore construct a straightforward benchmark on this theme: asking IR models to rank two… ▽ More

    Submitted 26 February, 2024; v1 submitted 12 May, 2023; originally announced May 2023.

    Comments: Accepted to EACL 2024

  16. arXiv:2305.00331  [pdf, other

    cs.IR

    Synthetic Cross-language Information Retrieval Training Data

    Authors: James Mayfield, Eugene Yang, Dawn Lawrie, Samuel Barham, Orion Weller, Marc Mason, Suraj Nair, Scott Miller

    Abstract: A key stumbling block for neural cross-language information retrieval (CLIR) systems has been the paucity of training data. The appearance of the MS MARCO monolingual training set led to significant advances in the state of the art in neural monolingual retrieval. By translating the MS MARCO documents into other languages using machine translation, this resource has been made useful to the CLIR co… ▽ More

    Submitted 29 April, 2023; originally announced May 2023.

    Comments: 11 pages, 4 figures

  17. arXiv:2304.12367  [pdf, other

    cs.IR

    Overview of the TREC 2022 NeuCLIR Track

    Authors: Dawn Lawrie, Sean MacAvaney, James Mayfield, Paul McNamee, Douglas W. Oard, Luca Soldaini, Eugene Yang

    Abstract: This is the first year of the TREC Neural CLIR (NeuCLIR) track, which aims to study the impact of neural approaches to cross-language information retrieval. The main task in this year's track was ad hoc ranked retrieval of Chinese, Persian, or Russian newswire documents using queries expressed in English. Topics were developed using standard TREC processes, except that topics developed by an annot… ▽ More

    Submitted 24 September, 2023; v1 submitted 24 April, 2023; originally announced April 2023.

    Comments: 22 pages, 13 figures, 10 tables. Part of the Thirty-First Text REtrieval Conference (TREC 2022) Proceedings. Replace the misplaced Russian result table

  18. arXiv:2212.10448  [pdf, other

    cs.IR cs.CL

    Parameter-efficient Zero-shot Transfer for Cross-Language Dense Retrieval with Adapters

    Authors: Eugene Yang, Suraj Nair, Dawn Lawrie, James Mayfield, Douglas W. Oard

    Abstract: A popular approach to creating a zero-shot cross-language retrieval model is to substitute a monolingual pretrained language model in the retrieval model with a multilingual pretrained language model such as Multilingual BERT. This multilingual model is fined-tuned to the retrieval task with monolingual data such as English MS MARCO using the same training recipe as the monolingual retrieval model… ▽ More

    Submitted 20 December, 2022; originally announced December 2022.

    Comments: 15 pages, 1 figure

  19. arXiv:2212.10019  [pdf, other

    cs.CL

    When Do Decompositions Help for Machine Reading?

    Authors: Kangda Wei, Dawn Lawrie, Benjamin Van Durme, Yunmo Chen, Orion Weller

    Abstract: Answering complex questions often requires multi-step reasoning in order to obtain the final answer. Most research into decompositions of complex questions involves open-domain systems, which have shown success in using these decompositions for improved retrieval. In the machine reading setting, however, work to understand when decompositions are helpful is understudied. We conduct experiments on… ▽ More

    Submitted 20 December, 2022; originally announced December 2022.

  20. arXiv:2212.10002  [pdf, other

    cs.CL cs.IR

    Defending Against Disinformation Attacks in Open-Domain Question Answering

    Authors: Orion Weller, Aleem Khan, Nathaniel Weir, Dawn Lawrie, Benjamin Van Durme

    Abstract: Recent work in open-domain question answering (ODQA) has shown that adversarial poisoning of the search collection can cause large drops in accuracy for production systems. However, little to no work has proposed methods to defend against these attacks. To do so, we rely on the intuition that redundant information often exists in large corpora. To find it, we introduce a method that uses query aug… ▽ More

    Submitted 26 February, 2024; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: Accepted to EACL 2024

  21. arXiv:2209.01335  [pdf, other

    cs.IR cs.CL

    Neural Approaches to Multilingual Information Retrieval

    Authors: Dawn Lawrie, Eugene Yang, Douglas W. Oard, James Mayfield

    Abstract: Providing access to information across languages has been a goal of Information Retrieval (IR) for decades. While progress has been made on Cross Language IR (CLIR) where queries are expressed in one language and documents in another, the multilingual (MLIR) task to create a single ranked list of documents across many languages is considerably more challenging. This paper investigates whether adva… ▽ More

    Submitted 9 February, 2023; v1 submitted 3 September, 2022; originally announced September 2022.

    Comments: 17 pages, 3 figures, accepted at ECIR 2023

  22. arXiv:2206.02291  [pdf, other

    cs.CL

    Pretrained Models for Multilingual Federated Learning

    Authors: Orion Weller, Marc Marone, Vladimir Braverman, Dawn Lawrie, Benjamin Van Durme

    Abstract: Since the advent of Federated Learning (FL), research has applied these methods to natural language processing (NLP) tasks. Despite a plethora of papers in FL for NLP, no previous works have studied how multilingual text impacts FL algorithms. Furthermore, multilingual text provides an interesting avenue to examine the impact of non-IID text (e.g. different languages) on FL in naturally occurring… ▽ More

    Submitted 5 June, 2022; originally announced June 2022.

    Comments: NAACL 2022

  23. arXiv:2201.09996  [pdf, ps, other

    cs.IR

    Patapasco: A Python Framework for Cross-Language Information Retrieval Experiments

    Authors: Cash Costello, Eugene Yang, Dawn Lawrie, James Mayfield

    Abstract: While there are high-quality software frameworks for information retrieval experimentation, they do not explicitly support cross-language information retrieval (CLIR). To fill this gap, we have created Patapsco, a Python CLIR framework. This framework specifically addresses the complexity that comes with running experiments in multiple languages. Patapsco is designed to be extensible to many langu… ▽ More

    Submitted 24 January, 2022; originally announced January 2022.

    Comments: 5 pages, accepted at ECIR 2022 as a demo paper

  24. arXiv:2201.09992  [pdf, other

    cs.IR cs.CL

    HC4: A New Suite of Test Collections for Ad Hoc CLIR

    Authors: Dawn Lawrie, James Mayfield, Douglas Oard, Eugene Yang

    Abstract: HC4 is a new suite of test collections for ad hoc Cross-Language Information Retrieval (CLIR), with Common Crawl News documents in Chinese, Persian, and Russian, topics in English and in the document languages, and graded relevance judgments. New test collections are needed because existing CLIR test collections built using pooling of traditional CLIR runs have systematic gaps in their relevance j… ▽ More

    Submitted 24 January, 2022; originally announced January 2022.

    Comments: 16 pages, 2 figures, accepted at ECIR 2022

  25. arXiv:2201.08471  [pdf, other

    cs.IR cs.CL

    Transfer Learning Approaches for Building Cross-Language Dense Retrieval Models

    Authors: Suraj Nair, Eugene Yang, Dawn Lawrie, Kevin Duh, Paul McNamee, Kenton Murray, James Mayfield, Douglas W. Oard

    Abstract: The advent of transformer-based models such as BERT has led to the rise of neural ranking models. These models have improved the effectiveness of retrieval systems well beyond that of lexical term matching models such as BM25. While monolingual retrieval tasks have benefited from large-scale training collections such as MS MARCO and advances in neural architectures, cross-language retrieval tasks… ▽ More

    Submitted 20 January, 2022; originally announced January 2022.

    Comments: Accepted at ECIR 2022 (Full paper)

  26. arXiv:2003.03072  [pdf, other

    cs.CL cs.LG

    Improving Neural Named Entity Recognition with Gazetteers

    Authors: Chan Hee Song, Dawn Lawrie, Tim Finin, James Mayfield

    Abstract: The goal of this work is to improve the performance of a neural named entity recognition system by adding input features that indicate a word is part of a name included in a gazetteer. This article describes how to generate gazetteers from the Wikidata knowledge graph as well as how to integrate the information into a neural NER system. Experiments reveal that the approach yields performance gains… ▽ More

    Submitted 6 March, 2020; originally announced March 2020.

    Comments: Short version accepted to the 33rd FLAIRS conference