-
Refocusing on Relevance: Personalization in NLG
Authors:
Shiran Dudy,
Steven Bedrick,
Bonnie Webber
Abstract:
Many NLG tasks such as summarization, dialogue response, or open domain question answering focus primarily on a source text in order to generate a target response. This standard approach falls short, however, when a user's intent or context of work is not easily recoverable based solely on that source text -- a scenario that we argue is more of the rule than the exception. In this work, we argue t…
▽ More
Many NLG tasks such as summarization, dialogue response, or open domain question answering focus primarily on a source text in order to generate a target response. This standard approach falls short, however, when a user's intent or context of work is not easily recoverable based solely on that source text -- a scenario that we argue is more of the rule than the exception. In this work, we argue that NLG systems in general should place a much higher level of emphasis on making use of additional context, and suggest that relevance (as used in Information Retrieval) be thought of as a crucial tool for designing user-oriented text-generating tasks. We further discuss possible harms and hazards around such personalization, and argue that value-sensitive design represents a crucial path forward through these challenges.
△ Less
Submitted 10 September, 2021;
originally announced September 2021.
-
Searching for Scientific Evidence in a Pandemic: An Overview of TREC-COVID
Authors:
Kirk Roberts,
Tasmeer Alam,
Steven Bedrick,
Dina Demner-Fushman,
Kyle Lo,
Ian Soboroff,
Ellen Voorhees,
Lucy Lu Wang,
William R Hersh
Abstract:
We present an overview of the TREC-COVID Challenge, an information retrieval (IR) shared task to evaluate search on scientific literature related to COVID-19. The goals of TREC-COVID include the construction of a pandemic search test collection and the evaluation of IR methods for COVID-19. The challenge was conducted over five rounds from April to July, 2020, with participation from 92 unique tea…
▽ More
We present an overview of the TREC-COVID Challenge, an information retrieval (IR) shared task to evaluate search on scientific literature related to COVID-19. The goals of TREC-COVID include the construction of a pandemic search test collection and the evaluation of IR methods for COVID-19. The challenge was conducted over five rounds from April to July, 2020, with participation from 92 unique teams and 556 individual submissions. A total of 50 topics (sets of related queries) were used in the evaluation, starting at 30 topics for Round 1 and adding 5 new topics per round to target emerging topics at that state of the still-emerging pandemic. This paper provides a comprehensive overview of the structure and results of TREC-COVID. Specifically, the paper provides details on the background, task structure, topic structure, corpus, participation, pooling, assessment, judgments, results, top-performing systems, lessons learned, and benchmark datasets.
△ Less
Submitted 19 April, 2021;
originally announced April 2021.
-
Are Some Words Worth More than Others?
Authors:
Shiran Dudy,
Steven Bedrick
Abstract:
Current evaluation metrics for language modeling and generation rely heavily on the accuracy of predicted (or generated) words as compared to a reference ground truth. While important, token-level accuracy only captures one aspect of a language model's behavior, and ignores linguistic properties of words that may allow some mis-predicted tokens to be useful in practice. Furthermore, statistics dir…
▽ More
Current evaluation metrics for language modeling and generation rely heavily on the accuracy of predicted (or generated) words as compared to a reference ground truth. While important, token-level accuracy only captures one aspect of a language model's behavior, and ignores linguistic properties of words that may allow some mis-predicted tokens to be useful in practice. Furthermore, statistics directly tied to prediction accuracy (including perplexity) may be confounded by the Zipfian nature of written language, as the majority of the prediction attempts will occur with frequently-occurring types. A model's performance may vary greatly between high- and low-frequency words, which in practice could lead to failure modes such as repetitive and dull generated text being produced by a downstream consumer of a language model. To address this, we propose two new intrinsic evaluation measures within the framework of a simple word prediction task that are designed to give a more holistic picture of a language model's performance. We evaluate several commonly-used large English language models using our proposed metrics, and demonstrate that our approach reveals functional differences in performance between the models that are obscured by more traditional metrics.
△ Less
Submitted 13 October, 2020; v1 submitted 12 October, 2020;
originally announced October 2020.
-
TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection
Authors:
Ellen Voorhees,
Tasmeer Alam,
Steven Bedrick,
Dina Demner-Fushman,
William R Hersh,
Kyle Lo,
Kirk Roberts,
Ian Soboroff,
Lucy Lu Wang
Abstract:
TREC-COVID is a community evaluation designed to build a test collection that captures the information needs of biomedical researchers using the scientific literature during a pandemic. One of the key characteristics of pandemic search is the accelerated rate of change: the topics of interest evolve as the pandemic progresses and the scientific literature in the area explodes. The COVID-19 pandemi…
▽ More
TREC-COVID is a community evaluation designed to build a test collection that captures the information needs of biomedical researchers using the scientific literature during a pandemic. One of the key characteristics of pandemic search is the accelerated rate of change: the topics of interest evolve as the pandemic progresses and the scientific literature in the area explodes. The COVID-19 pandemic provides an opportunity to capture this progression as it happens. TREC-COVID, in creating a test collection around COVID-19 literature, is building infrastructure to support new research and technologies in pandemic search.
△ Less
Submitted 9 May, 2020;
originally announced May 2020.
-
CREATE: Cohort Retrieval Enhanced by Analysis of Text from Electronic Health Records using OMOP Common Data Model
Authors:
Sijia Liu,
Yanshan Wang,
Andrew Wen,
Liwei Wang,
Na Hong,
Feichen Shen,
Steven Bedrick,
William Hersh,
Hongfang Liu
Abstract:
Background: Widespread adoption of electronic health records (EHRs) has enabled secondary use of EHR data for clinical research and healthcare delivery. Natural language processing (NLP) techniques have shown promise in their capability to extract the embedded information in unstructured clinical data, and information retrieval (IR) techniques provide flexible and scalable solutions that can augme…
▽ More
Background: Widespread adoption of electronic health records (EHRs) has enabled secondary use of EHR data for clinical research and healthcare delivery. Natural language processing (NLP) techniques have shown promise in their capability to extract the embedded information in unstructured clinical data, and information retrieval (IR) techniques provide flexible and scalable solutions that can augment the NLP systems for retrieving and ranking relevant records. Methods: In this paper, we present the implementation of Cohort Retrieval Enhanced by Analysis of Text from EHRs (CREATE), a cohort retrieval system that can execute textual cohort selection queries on both structured and unstructured EHR data. CREATE is a proof-of-concept system that leverages a combination of structured queries and IR techniques on NLP results to improve cohort retrieval performance while adopting the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) to enhance model portability. The NLP component empowered by cTAKES is used to extract CDM concepts from textual queries. We design a hierarchical index in Elasticsearch to support CDM concept search utilizing IR techniques and frameworks. Results: Our case study on 5 cohort identification queries evaluated using the IR metric, P@5 (Precision at 5) at both the patient-level and document-level, demonstrates that CREATE achieves an average P@5 of 0.90, which outperforms systems using only structured data or only unstructured data with average P@5s of 0.54 and 0.74, respectively.
△ Less
Submitted 22 January, 2019;
originally announced January 2019.