Skip to main content

Showing 1–16 of 16 results for author: Oard, D W

.
  1. On the Evaluation of Machine-Generated Reports

    Authors: James Mayfield, Eugene Yang, Dawn Lawrie, Sean MacAvaney, Paul McNamee, Douglas W. Oard, Luca Soldaini, Ian Soboroff, Orion Weller, Efsun Kayi, Kate Sanders, Marc Mason, Noah Hibbler

    Abstract: Large Language Models (LLMs) have enabled new ways to satisfy information needs. Although great strides have been made in applying them to settings like document ranking and short-form text generation, they still struggle to compose complete, accurate, and verifiable long-form reports. Reports with these qualities are necessary to satisfy the complex, nuanced, or multi-faceted information needs of… ▽ More

    Submitted 9 May, 2024; v1 submitted 1 May, 2024; originally announced May 2024.

    Comments: 12 pages, 4 figures, accepted at SIGIR 2024 as perspective paper

  2. PLAID SHIRTTT for Large-Scale Streaming Dense Retrieval

    Authors: Dawn Lawrie, Efsun Kayi, Eugene Yang, James Mayfield, Douglas W. Oard

    Abstract: PLAID, an efficient implementation of the ColBERT late interaction bi-encoder using pretrained language models for ranking, consistently achieves state-of-the-art performance in monolingual, cross-language, and multilingual retrieval. PLAID differs from ColBERT by assigning terms to clusters and representing those terms as cluster centroids plus compressed residual vectors. While PLAID is effectiv… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

    Comments: 5 pages, 1 figure, accepted at SIGIR 2024 as short paper

  3. arXiv:2404.18797  [pdf, other

    cs.IR

    Efficiency-Effectiveness Tradeoff of Probabilistic Structured Queries for Cross-Language Information Retrieval

    Authors: Eugene Yang, Suraj Nair, Dawn Lawrie, James Mayfield, Douglas W. Oard, Kevin Duh

    Abstract: Probabilistic Structured Queries (PSQ) is a cross-language information retrieval (CLIR) method that uses translation probabilities statistically derived from aligned corpora. PSQ is a strong baseline for efficient CLIR using sparse indexing. It is, therefore, useful as the first stage in a cascaded neural CLIR system whose second stage is more effective but too inefficient to be used on its own to… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

    Comments: 11 pages, 5 figures

  4. arXiv:2404.08071  [pdf, other

    cs.IR

    Overview of the TREC 2023 NeuCLIR Track

    Authors: Dawn Lawrie, Sean MacAvaney, James Mayfield, Paul McNamee, Douglas W. Oard, Luca Soldaini, Eugene Yang

    Abstract: The principal goal of the TREC Neural Cross-Language Information Retrieval (NeuCLIR) track is to study the impact of neural approaches to cross-language information retrieval. The track has created four collections, large collections of Chinese, Persian, and Russian newswire and a smaller collection of Chinese scientific abstracts. The principal tasks are ranked retrieval of news in one of the thr… ▽ More

    Submitted 11 April, 2024; originally announced April 2024.

    Comments: 27 pages, 17 figures. Part of the TREC 2023 Proceedings

  5. arXiv:2401.04810  [pdf, other

    cs.IR cs.CL

    Translate-Distill: Learning Cross-Language Dense Retrieval by Translation and Distillation

    Authors: Eugene Yang, Dawn Lawrie, James Mayfield, Douglas W. Oard, Scott Miller

    Abstract: Prior work on English monolingual retrieval has shown that a cross-encoder trained using a large number of relevance judgments for query-document pairs can be used as a teacher to train more efficient, but similarly effective, dual-encoder student models. Applying a similar knowledge distillation approach to training an efficient dual-encoder model for Cross-Language Information Retrieval (CLIR),… ▽ More

    Submitted 9 January, 2024; originally announced January 2024.

    Comments: 17 pages, 1 figure, accepted at ECIR 2024

  6. arXiv:2305.18683  [pdf, other

    cs.DL cs.IR

    Known by the Company it Keeps: Proximity-Based Indexing for Physical Content in Archival Repositories

    Authors: Douglas W. Oard

    Abstract: Despite the plethora of born-digital content, vast troves of important content remain accessible only on physical media such as paper or microfilm. The traditional approach to indexing undigitized content is using manually created metadata that describes it at some level of aggregation (e.g., folder, box, or collection). Searchers led in this way to some subset of the content often must then manua… ▽ More

    Submitted 8 July, 2023; v1 submitted 29 May, 2023; originally announced May 2023.

    Comments: To be published in Theory and Practice of Digital Libraries (TPDL) 2023

  7. arXiv:2304.12367  [pdf, other

    cs.IR

    Overview of the TREC 2022 NeuCLIR Track

    Authors: Dawn Lawrie, Sean MacAvaney, James Mayfield, Paul McNamee, Douglas W. Oard, Luca Soldaini, Eugene Yang

    Abstract: This is the first year of the TREC Neural CLIR (NeuCLIR) track, which aims to study the impact of neural approaches to cross-language information retrieval. The main task in this year's track was ad hoc ranked retrieval of Chinese, Persian, or Russian newswire documents using queries expressed in English. Topics were developed using standard TREC processes, except that topics developed by an annot… ▽ More

    Submitted 24 September, 2023; v1 submitted 24 April, 2023; originally announced April 2023.

    Comments: 22 pages, 13 figures, 10 tables. Part of the Thirty-First Text REtrieval Conference (TREC 2022) Proceedings. Replace the misplaced Russian result table

  8. arXiv:2212.10448  [pdf, other

    cs.IR cs.CL

    Parameter-efficient Zero-shot Transfer for Cross-Language Dense Retrieval with Adapters

    Authors: Eugene Yang, Suraj Nair, Dawn Lawrie, James Mayfield, Douglas W. Oard

    Abstract: A popular approach to creating a zero-shot cross-language retrieval model is to substitute a monolingual pretrained language model in the retrieval model with a multilingual pretrained language model such as Multilingual BERT. This multilingual model is fined-tuned to the retrieval task with monolingual data such as English MS MARCO using the same training recipe as the monolingual retrieval model… ▽ More

    Submitted 20 December, 2022; originally announced December 2022.

    Comments: 15 pages, 1 figure

  9. arXiv:2209.01335  [pdf, other

    cs.IR cs.CL

    Neural Approaches to Multilingual Information Retrieval

    Authors: Dawn Lawrie, Eugene Yang, Douglas W. Oard, James Mayfield

    Abstract: Providing access to information across languages has been a goal of Information Retrieval (IR) for decades. While progress has been made on Cross Language IR (CLIR) where queries are expressed in one language and documents in another, the multilingual (MLIR) task to create a single ranked list of documents across many languages is considerably more challenging. This paper investigates whether adva… ▽ More

    Submitted 9 February, 2023; v1 submitted 3 September, 2022; originally announced September 2022.

    Comments: 17 pages, 3 figures, accepted at ECIR 2023

  10. C3: Continued Pretraining with Contrastive Weak Supervision for Cross Language Ad-Hoc Retrieval

    Authors: Eugene Yang, Suraj Nair, Ramraj Chandradevan, Rebecca Iglesias-Flores, Douglas W. Oard

    Abstract: Pretrained language models have improved effectiveness on numerous tasks, including ad-hoc retrieval. Recent work has shown that continuing to pretrain a language model with auxiliary objectives before fine-tuning on the retrieval task can further improve retrieval effectiveness. Unlike monolingual retrieval, designing an appropriate auxiliary task for cross-language map**s is challenging. To ad… ▽ More

    Submitted 25 April, 2022; originally announced April 2022.

    Comments: 6 pages, 2 figures, accepted as a SIGIR 2022 Short Paper

  11. arXiv:2201.08471  [pdf, other

    cs.IR cs.CL

    Transfer Learning Approaches for Building Cross-Language Dense Retrieval Models

    Authors: Suraj Nair, Eugene Yang, Dawn Lawrie, Kevin Duh, Paul McNamee, Kenton Murray, James Mayfield, Douglas W. Oard

    Abstract: The advent of transformer-based models such as BERT has led to the rise of neural ranking models. These models have improved the effectiveness of retrieval systems well beyond that of lexical term matching models such as BM25. While monolingual retrieval tasks have benefited from large-scale training collections such as MS MARCO and advances in neural architectures, cross-language retrieval tasks… ▽ More

    Submitted 20 January, 2022; originally announced January 2022.

    Comments: Accepted at ECIR 2022 (Full paper)

  12. arXiv:2111.10504  [pdf, other

    cs.IR

    Effects of context, complexity, and clustering on evaluation for math formula retrieval

    Authors: Behrooz Mansouri, Douglas W. Oard, Anurag Agarwal, Richard Zanibbi

    Abstract: There are now several test collections for the formula retrieval task, in which a system's goal is to identify useful mathematical formulae to show in response to a query posed as a formula. These test collections differ in query format, query complexity, number of queries, content source, and relevance definition. Comparisons among six formula retrieval test collections illustrate that defining r… ▽ More

    Submitted 19 November, 2021; originally announced November 2021.

  13. arXiv:2111.05988  [pdf, ps, other

    cs.IR cs.AI cs.CL

    Cross-language Information Retrieval

    Authors: Petra Galuščáková, Douglas W. Oard, Suraj Nair

    Abstract: Two key assumptions shape the usual view of ranked retrieval: (1) that the searcher can choose words for their query that might appear in the documents that they wish to see, and (2) that ranking retrieved documents will suffice because the searcher will be able to recognize those which they wished to find. When the documents to be searched are in a language not known by the searcher, neither assu… ▽ More

    Submitted 8 June, 2022; v1 submitted 10 November, 2021; originally announced November 2021.

    Comments: 49 pages, 0 figures

  14. arXiv:2106.02293  [pdf, other

    cs.CL cs.IR

    Cross-language Sentence Selection via Data Augmentation and Rationale Training

    Authors: Yanda Chen, Chris Kedzie, Suraj Nair, Petra Galuščáková, Rui Zhang, Douglas W. Oard, Kathleen McKeown

    Abstract: This paper proposes an approach to cross-language sentence selection in a low-resource setting. It uses data augmentation and negative sampling techniques on noisy parallel sentence data to directly learn a cross-lingual embedding-based query relevance model. Results show that this approach performs as well as or better than multiple state-of-the-art machine translation + monolingual retrieval sys… ▽ More

    Submitted 4 June, 2021; originally announced June 2021.

    Comments: ACL 2021 main conference

  15. arXiv:2102.01757  [pdf, other

    cs.CL

    The Multilingual TEDx Corpus for Speech Recognition and Translation

    Authors: Elizabeth Salesky, Matthew Wiesner, Jacob Bremerman, Roldano Cattoni, Matteo Negri, Marco Turchi, Douglas W. Oard, Matt Post

    Abstract: We present the Multilingual TEDx corpus, built to support speech recognition (ASR) and speech translation (ST) research across many non-English source languages. The corpus is a collection of audio recordings from TEDx talks in 8 source languages. We segment transcripts into sentences and align them to the source-language audio and target-language translations. The corpus is released along with op… ▽ More

    Submitted 14 June, 2021; v1 submitted 2 February, 2021; originally announced February 2021.

    Comments: Accepted to Interspeech 2021

  16. arXiv:2011.07203  [pdf, ps, other

    cs.IR

    Providing More Efficient Access To Government Records: A Use Case Involving Application of Machine Learning to Improve FOIA Review for the Deliberative Process Privilege

    Authors: Jason R. Baron, Mahmoud F. Sayed, Douglas W. Oard

    Abstract: At present, the review process for material that is exempt from disclosure under the Freedom of Information Act (FOIA) in the United States of America, and under many similar government transparency regimes worldwide, is entirely manual. Public access to the records of their government is thus inhibited by the long backlogs of material awaiting such reviews. This paper studies one aspect of that p… ▽ More

    Submitted 13 November, 2020; originally announced November 2020.