-
SEMQA: Semi-Extractive Multi-Source Question Answering
Authors:
Tal Schuster,
Adam D. Lelkes,
Haitian Sun,
Jai Gupta,
Jonathan Berant,
William W. Cohen,
Donald Metzler
Abstract:
Recently proposed long-form question answering (QA) systems, supported by large language models (LLMs), have shown promising capabilities. Yet, attributing and verifying their generated abstractive answers can be difficult, and automatically evaluating their accuracy remains an ongoing challenge.
In this work, we introduce a new QA task for answering multi-answer questions by summarizing multipl…
▽ More
Recently proposed long-form question answering (QA) systems, supported by large language models (LLMs), have shown promising capabilities. Yet, attributing and verifying their generated abstractive answers can be difficult, and automatically evaluating their accuracy remains an ongoing challenge.
In this work, we introduce a new QA task for answering multi-answer questions by summarizing multiple diverse sources in a semi-extractive fashion. Specifically, Semi-extractive Multi-source QA (SEMQA) requires models to output a comprehensive answer, while mixing factual quoted spans -- copied verbatim from given input sources -- and non-factual free-text connectors that glue these spans together into a single cohesive passage. This setting bridges the gap between the outputs of well-grounded but constrained extractive QA systems and more fluent but harder to attribute fully abstractive answers. Particularly, it enables a new mode for language models that leverages their advanced language generation capabilities, while also producing fine in-line attributions by-design that are easy to verify, interpret, and evaluate.
To study this task, we create the first dataset of this kind, QuoteSum, with human-written semi-extractive answers to natural and generated questions, and define text-based evaluation metrics. Experimenting with several LLMs in various settings, we find this task to be surprisingly challenging, demonstrating the importance of QuoteSum for develo** and studying such consolidation capabilities.
△ Less
Submitted 30 June, 2024; v1 submitted 8 November, 2023;
originally announced November 2023.
-
SDOH-NLI: a Dataset for Inferring Social Determinants of Health from Clinical Notes
Authors:
Adam D. Lelkes,
Eric Loreaux,
Tal Schuster,
Ming-Jun Chen,
Alvin Rajkomar
Abstract:
Social and behavioral determinants of health (SDOH) play a significant role in sha** health outcomes, and extracting these determinants from clinical notes is a first step to help healthcare providers systematically identify opportunities to provide appropriate care and address disparities. Progress on using NLP methods for this task has been hindered by the lack of high-quality publicly availab…
▽ More
Social and behavioral determinants of health (SDOH) play a significant role in sha** health outcomes, and extracting these determinants from clinical notes is a first step to help healthcare providers systematically identify opportunities to provide appropriate care and address disparities. Progress on using NLP methods for this task has been hindered by the lack of high-quality publicly available labeled data, largely due to the privacy and regulatory constraints on the use of real patients' information. This paper introduces a new dataset, SDOH-NLI, that is based on publicly available notes and which we release publicly. We formulate SDOH extraction as a natural language inference (NLI) task, and provide binary textual entailment labels obtained from human raters for a cross product of a set of social history snippets as premises and SDOH factors as hypotheses. Our dataset differs from standard NLI benchmarks in that our premises and hypotheses are obtained independently. We evaluate both "off-the-shelf" entailment models as well as models fine-tuned on our data, and highlight the ways in which our dataset appears more challenging than commonly used NLI datasets.
△ Less
Submitted 27 October, 2023;
originally announced October 2023.
-
How Does Generative Retrieval Scale to Millions of Passages?
Authors:
Ronak Pradeep,
Kai Hui,
Jai Gupta,
Adam D. Lelkes,
Honglei Zhuang,
Jimmy Lin,
Donald Metzler,
Vinh Q. Tran
Abstract:
Popularized by the Differentiable Search Index, the emerging paradigm of generative retrieval re-frames the classic information retrieval problem into a sequence-to-sequence modeling task, forgoing external indices and encoding an entire document corpus within a single Transformer. Although many different approaches have been proposed to improve the effectiveness of generative retrieval, they have…
▽ More
Popularized by the Differentiable Search Index, the emerging paradigm of generative retrieval re-frames the classic information retrieval problem into a sequence-to-sequence modeling task, forgoing external indices and encoding an entire document corpus within a single Transformer. Although many different approaches have been proposed to improve the effectiveness of generative retrieval, they have only been evaluated on document corpora on the order of 100k in size. We conduct the first empirical study of generative retrieval techniques across various corpus scales, ultimately scaling up to the entire MS MARCO passage ranking task with a corpus of 8.8M passages and evaluating model sizes up to 11B parameters. We uncover several findings about scaling generative retrieval to millions of passages; notably, the central importance of using synthetic queries as document representations during indexing, the ineffectiveness of existing proposed architecture modifications when accounting for compute cost, and the limits of naively scaling model parameters with respect to retrieval performance. While we find that generative retrieval is competitive with state-of-the-art dual encoders on small corpora, scaling to millions of passages remains an important and unsolved challenge. We believe these findings will be valuable for the community to clarify the current state of generative retrieval, highlight the unique challenges, and inspire new research directions.
△ Less
Submitted 19 May, 2023;
originally announced May 2023.
-
Instability in clinical risk stratification models using deep learning
Authors:
Daniel Lopez-Martinez,
Alex Yakubovich,
Martin Seneviratne,
Adam D. Lelkes,
Akshit Tyagi,
Jonas Kemp,
Ethan Steinberg,
N. Lance Downing,
Ron C. Li,
Keith E. Morse,
Nigam H. Shah,
Ming-Jun Chen
Abstract:
While it has been well known in the ML community that deep learning models suffer from instability, the consequences for healthcare deployments are under characterised. We study the stability of different model architectures trained on electronic health records, using a set of outpatient prediction tasks as a case study. We show that repeated training runs of the same deep learning model on the sa…
▽ More
While it has been well known in the ML community that deep learning models suffer from instability, the consequences for healthcare deployments are under characterised. We study the stability of different model architectures trained on electronic health records, using a set of outpatient prediction tasks as a case study. We show that repeated training runs of the same deep learning model on the same training data can result in significantly different outcomes at a patient level even though global performance metrics remain stable. We propose two stability metrics for measuring the effect of randomness of model training, as well as mitigation strategies for improving model stability.
△ Less
Submitted 19 November, 2022;
originally announced November 2022.
-
All Birds with One Stone: Multi-task Text Classification for Efficient Inference with One Forward Pass
Authors:
Jiaxin Huang,
Tianqi Liu,
Jialu Liu,
Adam D. Lelkes,
Cong Yu,
Jiawei Han
Abstract:
Multi-Task Learning (MTL) models have shown their robustness, effectiveness, and efficiency for transferring learned knowledge across tasks. In real industrial applications such as web content classification, multiple classification tasks are predicted from the same input text such as a web article. However, at the serving time, the existing multitask transformer models such as prompt or adaptor b…
▽ More
Multi-Task Learning (MTL) models have shown their robustness, effectiveness, and efficiency for transferring learned knowledge across tasks. In real industrial applications such as web content classification, multiple classification tasks are predicted from the same input text such as a web article. However, at the serving time, the existing multitask transformer models such as prompt or adaptor based approaches need to conduct N forward passes for N tasks with O(N) computation cost. To tackle this problem, we propose a scalable method that can achieve stronger performance with close to O(1) computation cost via only one forward pass. To illustrate real application usage, we release a multitask dataset on news topic and style classification. Our experiments show that our proposed method outperforms strong baselines on both the GLUE benchmark and our news dataset. Our code and dataset are publicly available at https://bit.ly/mtop-code.
△ Less
Submitted 22 May, 2022;
originally announced May 2022.
-
AgreeSum: Agreement-Oriented Multi-Document Summarization
Authors:
Richard Yuanzhe Pang,
Adam D. Lelkes,
Vinh Q. Tran,
Cong Yu
Abstract:
We aim to renew interest in a particular multi-document summarization (MDS) task which we call AgreeSum: agreement-oriented multi-document summarization. Given a cluster of articles, the goal is to provide abstractive summaries that represent information common and faithful to all input articles. Given the lack of existing datasets, we create a dataset for AgreeSum, and provide annotations on arti…
▽ More
We aim to renew interest in a particular multi-document summarization (MDS) task which we call AgreeSum: agreement-oriented multi-document summarization. Given a cluster of articles, the goal is to provide abstractive summaries that represent information common and faithful to all input articles. Given the lack of existing datasets, we create a dataset for AgreeSum, and provide annotations on article-summary entailment relations for a subset of the clusters in the dataset. We aim to create strong baselines for the task by applying the top-performing pretrained single-document summarization model PEGASUS onto AgreeSum, leveraging both annotated clusters by supervised losses, and unannotated clusters by T5-based entailment-related and language-related losses. Compared to other baselines, both automatic evaluation and human evaluation show better article-summary and cluster-summary entailment in generated summaries. On a separate note, we hope that our article-summary entailment annotations contribute to the community's effort in improving abstractive summarization faithfulness.
△ Less
Submitted 4 June, 2021;
originally announced June 2021.
-
Quiz-Style Question Generation for News Stories
Authors:
Adam D. Lelkes,
Vinh Q. Tran,
Cong Yu
Abstract:
A large majority of American adults get at least some of their news from the Internet. Even though many online news products have the goal of informing their users about the news, they lack scalable and reliable tools for measuring how well they are achieving this goal, and therefore have to resort to noisy proxy metrics (e.g., click-through rates or reading time) to track their performance.
As…
▽ More
A large majority of American adults get at least some of their news from the Internet. Even though many online news products have the goal of informing their users about the news, they lack scalable and reliable tools for measuring how well they are achieving this goal, and therefore have to resort to noisy proxy metrics (e.g., click-through rates or reading time) to track their performance.
As a first step towards measuring news informedness at a scale, we study the problem of quiz-style multiple-choice question generation, which may be used to survey users about their knowledge of recent news. In particular, we formulate the problem as two sequence-to-sequence tasks: question-answer generation (QAG) and distractor, or incorrect answer, generation (DG). We introduce NewsQuizQA, the first dataset intended for quiz-style question-answer generation, containing 20K human written question-answer pairs from 5K news article summaries. Using this dataset, we propose a series of novel techniques for applying large pre-trained Transformer encoder-decoder models, namely PEGASUS and T5, to the tasks of question-answer generation and distractor generation.
We show that our models outperform strong baselines using both automated metrics and human raters. We provide a case study of running weekly quizzes on real-world users via the Google Surveys platform over the course of two months. We found that users generally found the automatically generated questions to be educational and enjoyable. Finally, to serve the research community, we are releasing the NewsQuizQA dataset.
△ Less
Submitted 17 February, 2021;
originally announced February 2021.
-
A Confidence-Based Approach for Balancing Fairness and Accuracy
Authors:
Benjamin Fish,
Jeremy Kun,
Ádám D. Lelkes
Abstract:
We study three classical machine learning algorithms in the context of algorithmic fairness: adaptive boosting, support vector machines, and logistic regression. Our goal is to maintain the high accuracy of these learning algorithms while reducing the degree to which they discriminate against individuals because of their membership in a protected group.
Our first contribution is a method for ach…
▽ More
We study three classical machine learning algorithms in the context of algorithmic fairness: adaptive boosting, support vector machines, and logistic regression. Our goal is to maintain the high accuracy of these learning algorithms while reducing the degree to which they discriminate against individuals because of their membership in a protected group.
Our first contribution is a method for achieving fairness by shifting the decision boundary for the protected group. The method is based on the theory of margins for boosting. Our method performs comparably to or outperforms previous algorithms in the fairness literature in terms of accuracy and low discrimination, while simultaneously allowing for a fast and transparent quantification of the trade-off between bias and error.
Our second contribution addresses the shortcomings of the bias-error trade-off studied in most of the algorithmic fairness literature. We demonstrate that even hopelessly naive modifications of a biased algorithm, which cannot be reasonably said to be fair, can still achieve low bias and high accuracy. To help to distinguish between these naive algorithms and more sensible algorithms we propose a new measure of fairness, called resilience to random bias (RRB). We demonstrate that RRB distinguishes well between our naive and sensible fairness algorithms. RRB together with bias and accuracy provides a more complete picture of the fairness of an algorithm.
△ Less
Submitted 21 January, 2016;
originally announced January 2016.
-
Network installation and recovery: approximation lower bounds and faster exact formulations
Authors:
Alexander Gutfraind,
Jeremy Kun,
Ádám D. Lelkes,
Lev Reyzin
Abstract:
We study the Neighbor Aided Network Installation Problem (NANIP) introduced previously which asks for a minimal cost ordering of the vertices of a graph, where the cost of visiting a node is a function of the number of neighbors that have already been visited. This problem has applications in resource management and disaster recovery. In this paper we analyze the computational hardness of NANIP. I…
▽ More
We study the Neighbor Aided Network Installation Problem (NANIP) introduced previously which asks for a minimal cost ordering of the vertices of a graph, where the cost of visiting a node is a function of the number of neighbors that have already been visited. This problem has applications in resource management and disaster recovery. In this paper we analyze the computational hardness of NANIP. In particular we show that this problem is NP-hard even when restricted to convex decreasing cost functions, give a linear approximation lower bound for the greedy algorithm, and prove a general sub-constant approximation lower bound. Then we give a new integer programming formulation of NANIP and empirically observe its speedup over the original integer program.
△ Less
Submitted 13 November, 2014;
originally announced November 2014.
-
On the Computational Complexity of MapReduce
Authors:
Benjamin Fish,
Jeremy Kun,
Ádám Dániel Lelkes,
Lev Reyzin,
György Turán
Abstract:
In this paper we study MapReduce computations from a complexity-theoretic perspective. First, we formulate a uniform version of the MRC model of Karloff et al. (2010). We then show that the class of regular languages, and moreover all of sublogarithmic space, lies in constant round MRC. This result also applies to the MPC model of Andoni et al. (2014). In addition, we prove that, conditioned on a…
▽ More
In this paper we study MapReduce computations from a complexity-theoretic perspective. First, we formulate a uniform version of the MRC model of Karloff et al. (2010). We then show that the class of regular languages, and moreover all of sublogarithmic space, lies in constant round MRC. This result also applies to the MPC model of Andoni et al. (2014). In addition, we prove that, conditioned on a variant of the Exponential Time Hypothesis, there are strict hierarchies within MRC so that increasing the number of rounds or the amount of time per processor increases the power of MRC. To the best of our knowledge we are the first to approach the MapReduce model with complexity-theoretic techniques, and our work lays the foundation for further analysis relating MapReduce to established complexity classes.
△ Less
Submitted 6 October, 2015; v1 submitted 1 October, 2014;
originally announced October 2014.
-
Biclique coverings, rectifier networks and the cost of $\varepsilon$-removal
Authors:
Szabolcs Iván,
Ádám Dániel Lelkes,
Judit Nagy-György,
Balázs Szörényi,
György Turán
Abstract:
We relate two complexity notions of bipartite graphs: the minimal weight biclique covering number $\mathrm{Cov}(G)$ and the minimal rectifier network size $\mathrm{Rect}(G)$ of a bipartite graph $G$. We show that there exist graphs with $\mathrm{Cov}(G)\geq \mathrm{Rect}(G)^{3/2-ε}$. As a corollary, we establish that there exist nondeterministic finite automata (NFAs) with $\varepsilon$-transition…
▽ More
We relate two complexity notions of bipartite graphs: the minimal weight biclique covering number $\mathrm{Cov}(G)$ and the minimal rectifier network size $\mathrm{Rect}(G)$ of a bipartite graph $G$. We show that there exist graphs with $\mathrm{Cov}(G)\geq \mathrm{Rect}(G)^{3/2-ε}$. As a corollary, we establish that there exist nondeterministic finite automata (NFAs) with $\varepsilon$-transitions, having $n$ transitions total such that the smallest equivalent $\varepsilon$-free NFA has $Ω(n^{3/2-ε})$ transitions. We also formulate a version of previous bounds for the weighted set cover problem and discuss its connections to giving upper bounds for the possible blow-up.
△ Less
Submitted 30 May, 2014;
originally announced June 2014.