Skip to main content

Showing 1–50 of 63 results for author: Khashabi, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.20092  [pdf, other

    cs.CV

    LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression

    Authors: Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, Alan Yuille

    Abstract: While significant advancements have been made in compressed representations for text embeddings in large language models (LLMs), the compression of visual tokens in large multi-modal models (LMMs) has remained a largely overlooked area. In this work, we present the study on the analysis of redundancy concerning visual tokens and efficient training within these models. Our initial experiments show… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    Comments: Code is available at https://github.com/Beckschen/LLaVolta

  2. arXiv:2406.14673  [pdf, other

    cs.CL

    Insights into LLM Long-Context Failures: When Transformers Know but Don't Tell

    Authors: Taiming Lu, Muhan Gao, Kuai Yu, Adam Byerly, Daniel Khashabi

    Abstract: Large Language Models (LLMs) exhibit positional bias, struggling to utilize information from the middle or end of long contexts. Our study explores LLMs' long-context reasoning by probing their hidden representations. We find that while LLMs encode the position of target information, they often fail to leverage this in generating accurate responses. This reveals a disconnect between information re… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  3. arXiv:2405.13274  [pdf, other

    cs.CL

    DiffNorm: Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation

    Authors: Weiting Tan, **gyu Zhang, Lingfeng Shen, Daniel Khashabi, Philipp Koehn

    Abstract: Non-autoregressive Transformers (NATs) are recently applied in direct speech-to-speech translation systems, which convert speech across different languages without intermediate text data. Although NATs generate high-quality outputs and offer faster inference than autoregressive models, they tend to produce incoherent and repetitive results due to complex data distribution (e.g., acoustic and lingu… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

  4. arXiv:2404.04298  [pdf, other

    cs.AI cs.CL cs.LG

    SELF-[IN]CORRECT: LLMs Struggle with Refining Self-Generated Responses

    Authors: Dongwei Jiang, **gyu Zhang, Orion Weller, Nathaniel Weir, Benjamin Van Durme, Daniel Khashabi

    Abstract: Can LLMs continually improve their previous outputs for better results? An affirmative answer would require LLMs to be better at discriminating among previously-generated alternatives, than generating initial responses. We explore the validity of this hypothesis in practice. We first introduce a unified framework that allows us to compare the generative and discriminative capability of any model o… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

  5. arXiv:2404.03862  [pdf, other

    cs.CL

    Verifiable by Design: Aligning Language Models to Quote from Pre-Training Data

    Authors: **gyu Zhang, Marc Marone, Tianjian Li, Benjamin Van Durme, Daniel Khashabi

    Abstract: For humans to trust the fluent generations of large language models (LLMs), they must be able to verify their correctness against trusted, external sources. Recent efforts aim to increase verifiability through citations of retrieved documents or post-hoc provenance. However, such citations are prone to mistakes that further complicate their verifiability. To address these limitations, we tackle th… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

  6. arXiv:2403.12958  [pdf, other

    cs.CL

    Dated Data: Tracing Knowledge Cutoffs in Large Language Models

    Authors: Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, Benjamin Van Durme

    Abstract: Released Large Language Models (LLMs) are often paired with a claimed knowledge cutoff date, or the dates at which training data was gathered. Such information is crucial for applications where the LLM must provide up to date information. However, this statement only scratches the surface: do all resources in the training data share the same knowledge cutoff date? Does the model's demonstrated kno… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

  7. arXiv:2403.11905  [pdf, other

    cs.AI cs.CL cs.CV cs.HC

    Tur[k]ingBench: A Challenge Benchmark for Web Agents

    Authors: Kevin Xu, Yeganeh Kordi, Kate Sanders, Yizhong Wang, Adam Byerly, Jack Zhang, Benjamin Van Durme, Daniel Khashabi

    Abstract: Recent chatbots have demonstrated impressive ability to understand and communicate in raw-text form. However, there is more to the world than raw text. For example, humans spend long hours of their time on web pages, where text is intertwined with other modalities and tasks are accomplished in the form of various complex interactions. Can state-of-the-art multi-modal models generalize to such comp… ▽ More

    Submitted 21 March, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

  8. arXiv:2402.18678  [pdf, other

    cs.CL

    RORA: Robust Free-Text Rationale Evaluation

    Authors: Zheng** Jiang, Yining Lu, Hanjie Chen, Daniel Khashabi, Benjamin Van Durme, Anqi Liu

    Abstract: Free-text rationales play a pivotal role in explainable NLP, bridging the knowledge and reasoning gaps behind a model's decision-making. However, due to the diversity of potential reasoning paths and a corresponding lack of definitive ground truth, their evaluation remains a challenge. Existing evaluation metrics rely on the degree to which a rationale supports a target label, but we find these fa… ▽ More

    Submitted 14 June, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

  9. arXiv:2402.12370  [pdf, other

    cs.CL cs.AI

    AnaloBench: Benchmarking the Identification of Abstract and Long-context Analogies

    Authors: Xiao Ye, Andrew Wang, Jacob Choi, Yining Lu, Shreya Sharma, Lingfeng Shen, Vijay Tiyyala, Nicholas Andrews, Daniel Khashabi

    Abstract: Humans regularly engage in analogical thinking, relating personal experiences to current situations ($X$ is analogous to $Y$ because of $Z$). Analogical thinking allows humans to solve problems in creative ways, grasp difficult concepts, and articulate ideas more effectively. Can language models (LMs) do the same? To answer this question, we propose ANALOBENCH, a benchmark to determine analogical… ▽ More

    Submitted 19 February, 2024; originally announced February 2024.

  10. arXiv:2402.11399  [pdf, other

    cs.CL cs.CR cs.CY cs.LG

    k-SemStamp: A Clustering-Based Semantic Watermark for Detection of Machine-Generated Text

    Authors: Abe Bohan Hou, **gyu Zhang, Yichen Wang, Daniel Khashabi, Tianxing He

    Abstract: Recent watermarked generation algorithms inject detectable signatures during language generation to facilitate post-hoc detection. While token-level watermarks are vulnerable to paraphrase attacks, SemStamp (Hou et al., 2023) applies watermark on the semantic representation of sentences and demonstrates promising robustness. SemStamp employs locality-sensitive hashing (LSH) to partition the semant… ▽ More

    Submitted 8 June, 2024; v1 submitted 17 February, 2024; originally announced February 2024.

    Comments: Accepted to ACL 24 Findings

  11. arXiv:2401.13136  [pdf, other

    cs.CL cs.AI

    The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts

    Authors: Lingfeng Shen, Weiting Tan, Sihao Chen, Yunmo Chen, **gyu Zhang, Haoran Xu, Boyuan Zheng, Philipp Koehn, Daniel Khashabi

    Abstract: As the influence of large language models (LLMs) spans across global communities, their safety challenges in multilingual settings become paramount for alignment research. This paper examines the variations in safety challenges faced by LLMs across different languages and discusses approaches to alleviating such concerns. By comparing how state-of-the-art LLMs respond to the same set of malicious… ▽ More

    Submitted 23 January, 2024; originally announced January 2024.

  12. arXiv:2310.08540  [pdf, other

    cs.CL cs.AI cs.LG

    Do pretrained Transformers Learn In-Context by Gradient Descent?

    Authors: Lingfeng Shen, Aayush Mishra, Daniel Khashabi

    Abstract: The emergence of In-Context Learning (ICL) in LLMs remains a remarkable phenomenon that is partially understood. To explain ICL, recent studies have created theoretical connections to Gradient Descent (GD). We ask, do such connections hold up in actual pre-trained language models? We highlight the limiting assumptions in prior works that make their setup considerably different from the practical s… ▽ More

    Submitted 3 June, 2024; v1 submitted 12 October, 2023; originally announced October 2023.

  13. arXiv:2310.03991  [pdf, other

    cs.CL

    SemStamp: A Semantic Watermark with Paraphrastic Robustness for Text Generation

    Authors: Abe Bohan Hou, **gyu Zhang, Tianxing He, Yichen Wang, Yung-Sung Chuang, Hongwei Wang, Lingfeng Shen, Benjamin Van Durme, Daniel Khashabi, Yulia Tsvetkov

    Abstract: Existing watermarking algorithms are vulnerable to paraphrase attacks because of their token-level design. To address this issue, we propose SemStamp, a robust sentence-level semantic watermarking algorithm based on locality-sensitive hashing (LSH), which partitions the semantic space of sentences. The algorithm encodes and LSH-hashes a candidate sentence generated by an LLM, and conducts sentence… ▽ More

    Submitted 22 April, 2024; v1 submitted 5 October, 2023; originally announced October 2023.

    Comments: Accepted to NAACL 24 Main

  14. arXiv:2310.00840  [pdf, other

    cs.CL

    Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models

    Authors: Tianjian Li, Haoran Xu, Philipp Koehn, Daniel Khashabi, Kenton Murray

    Abstract: Text generation models are notoriously vulnerable to errors in the training data. With the wide-spread availability of massive amounts of web-crawled data becoming more commonplace, how can we enhance the robustness of models trained on a massive amount of noisy web-crawled text? In our work, we propose Error Norm Truncation (ENT), a robust enhancement method to the standard training objective tha… ▽ More

    Submitted 18 March, 2024; v1 submitted 1 October, 2023; originally announced October 2023.

    Comments: ICLR 2024

  15. arXiv:2309.16155  [pdf, other

    cs.CL cs.LG

    The Trickle-down Impact of Reward (In-)consistency on RLHF

    Authors: Lingfeng Shen, Sihao Chen, Linfeng Song, Lifeng **, Baolin Peng, Haitao Mi, Daniel Khashabi, Dong Yu

    Abstract: Standard practice within Reinforcement Learning from Human Feedback (RLHF) involves optimizing against a Reward Model (RM), which itself is trained to reflect human preferences for desirable generations. A notable subject that is understudied is the (in-)consistency of RMs -- whether they can recognize the semantic changes to different prompts and appropriately adapt their reward assignments -- an… ▽ More

    Submitted 28 September, 2023; originally announced September 2023.

  16. arXiv:2307.08775  [pdf, other

    cs.AI

    GEAR: Augmenting Language Models with Generalizable and Efficient Tool Resolution

    Authors: Yining Lu, Hao** Yu, Daniel Khashabi

    Abstract: Augmenting large language models (LLM) to use external tools enhances their performance across a variety of tasks. However, prior works over-rely on task-specific demonstration of tool use that limits their generalizability and computational cost due to making many calls to large-scale LLMs. We introduce GEAR, a computationally efficient query-tool grounding algorithm that is generalizable to vari… ▽ More

    Submitted 30 January, 2024; v1 submitted 17 July, 2023; originally announced July 2023.

  17. arXiv:2305.13252  [pdf, other

    cs.CL cs.AI

    "According to ...": Prompting Language Models Improves Quoting from Pre-Training Data

    Authors: Orion Weller, Marc Marone, Nathaniel Weir, Dawn Lawrie, Daniel Khashabi, Benjamin Van Durme

    Abstract: Large Language Models (LLMs) may hallucinate and generate fake information, despite pre-training on factual data. Inspired by the journalistic device of "according to sources", we propose according-to prompting: directing LLMs to ground responses against previously observed text. To quantify this grounding, we propose a novel evaluation metric (QUIP-Score) that measures the extent to which model-p… ▽ More

    Submitted 26 February, 2024; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: Accepted to EACL 2024

  18. arXiv:2305.10713  [pdf, other

    cs.CL cs.LG

    Flatness-Aware Prompt Selection Improves Accuracy and Sample Efficiency

    Authors: Lingfeng Shen, Weiting Tan, Boyuan Zheng, Daniel Khashabi

    Abstract: With growing capabilities of large language models, prompting them has become the dominant way to access them. This has motivated the development of strategies for automatically selecting effective language prompts. In this paper, we introduce prompt flatness, a new metric to quantify the expected utility of a language prompt. This metric is inspired by flatness regularization in statistical learn… ▽ More

    Submitted 22 October, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

  19. arXiv:2212.10560  [pdf, other

    cs.CL cs.AI

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Authors: Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, Hannaneh Hajishirzi

    Abstract: Large "instruction-tuned" language models (i.e., finetuned to respond to instructions) have demonstrated a remarkable ability to generalize zero-shot to new tasks. Nevertheless, they depend heavily on human-written instruction data that is often limited in quantity, diversity, and creativity, therefore hindering the generality of the tuned model. We introduce Self-Instruct, a framework for improvi… ▽ More

    Submitted 25 May, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: ACL 2023 camera ready, 23 pages, 9 figures, 11 tables

  20. arXiv:2212.10511  [pdf, other

    cs.CL cs.AI cs.LG

    When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

    Authors: Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, Hannaneh Hajishirzi

    Abstract: Despite their impressive performance on diverse tasks, large language models (LMs) still struggle with tasks requiring rich world knowledge, implying the limitations of relying solely on their parameters to encode a wealth of world knowledge. This paper aims to understand LMs' strengths and limitations in memorizing factual knowledge, by conducting large-scale knowledge probing experiments of 10 m… ▽ More

    Submitted 2 July, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: ACL 2023; Code and data available at https://github.com/AlexTMallen/adaptive-retrieval

  21. arXiv:2211.00053  [pdf, other

    cs.CL

    Generating Sequences by Learning to Self-Correct

    Authors: Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, Ye** Choi

    Abstract: Sequence generation applications require satisfying semantic constraints, such as ensuring that programs are correct, using certain keywords, or avoiding undesirable content. Language models, whether fine-tuned or prompted with few-shot demonstrations, frequently violate these constraints, and lack a mechanism to iteratively revise their outputs. Moreover, some powerful language models are of extr… ▽ More

    Submitted 31 October, 2022; originally announced November 2022.

  22. arXiv:2210.10040  [pdf, other

    cs.CL cs.CY cs.LG cs.SI

    The Tail Wagging the Dog: Dataset Construction Biases of Social Bias Benchmarks

    Authors: Nikil Roashan Selvam, Sunipa Dev, Daniel Khashabi, Tushar Khot, Kai-Wei Chang

    Abstract: How reliably can we trust the scores obtained from social bias benchmarks as faithful indicators of problematic social biases in a given language model? In this work, we study this question by contrasting social biases with non-social biases stemming from choices made during dataset construction that might not even be discernible to the human eye. To do so, we empirically simulate various alternat… ▽ More

    Submitted 16 June, 2023; v1 submitted 18 October, 2022; originally announced October 2022.

    Comments: ACL 2023

  23. arXiv:2206.04615  [pdf, other

    cs.CL cs.AI cs.CY cs.LG stat.ML

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

    Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More

    Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

    Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

    Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

  24. arXiv:2205.12688  [pdf, other

    cs.CL

    ProsocialDialog: A Prosocial Backbone for Conversational Agents

    Authors: Hyunwoo Kim, Youngjae Yu, Liwei Jiang, Ximing Lu, Daniel Khashabi, Gunhee Kim, Ye** Choi, Maarten Sap

    Abstract: Most existing dialogue systems fail to respond properly to potentially unsafe user utterances by either ignoring or passively agreeing with them. To address this issue, we introduce ProsocialDialog, the first large-scale multi-turn dialogue dataset to teach conversational agents to respond to problematic content following social norms. Covering diverse unethical, problematic, biased, and toxic sit… ▽ More

    Submitted 25 October, 2022; v1 submitted 25 May, 2022; originally announced May 2022.

    Comments: EMNLP 2022 camera ready; Dataset and model can be found at https://hyunw.kim/prosocial-dialog/

  25. arXiv:2205.11603  [pdf, other

    cs.CL

    Representation Projection Invariance Mitigates Representation Collapse

    Authors: Anastasia Razdaibiedina, Ashish Khetan, Zohar Karnin, Daniel Khashabi, Vishaal Kapoor, Vivek Madan

    Abstract: Fine-tuning contextualized representations learned by pre-trained language models remains a prevalent practice in NLP. However, fine-tuning can lead to representation degradation (also known as representation collapse), which may result in instability, sub-optimal performance, and weak generalization. In this paper, we propose Representation Projection Invariance (REPINA), a novel regularization… ▽ More

    Submitted 21 November, 2023; v1 submitted 23 May, 2022; originally announced May 2022.

    Comments: 41 pages, 6 figures

  26. arXiv:2204.07705  [pdf, other

    cs.CL cs.AI

    Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks

    Authors: Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, Eshaan Pathak, Giannis Karamanolakis, Haizhi Gary Lai, Ishan Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, Maitreya Patel, Kuntal Kumar Pal, Mehrad Moradshahi, Mihir Parmar, Mirali Purohit, Neeraj Varshney, Phani Rohitha Kaza , et al. (15 additional authors not shown)

    Abstract: How well can NLP models generalize to a variety of unseen tasks when provided with task instructions? To address this question, we first introduce Super-NaturalInstructions, a benchmark of 1,616 diverse NLP tasks and their expert-written instructions. Our collection covers 76 distinct task types, including but not limited to classification, extraction, infilling, sequence tagging, text rewriting,… ▽ More

    Submitted 24 October, 2022; v1 submitted 15 April, 2022; originally announced April 2022.

    Comments: Accepted to EMNLP 2022, 25 pages

  27. arXiv:2202.12359  [pdf, other

    cs.CL cs.AI

    UnifiedQA-v2: Stronger Generalization via Broader Cross-Format Training

    Authors: Daniel Khashabi, Yeganeh Kordi, Hannaneh Hajishirzi

    Abstract: We present UnifiedQA-v2, a QA model built with the same process as UnifiedQA, except that it utilizes more supervision -- roughly 3x the number of datasets used for UnifiedQA. This generally leads to better in-domain and cross-domain results.

    Submitted 23 February, 2022; originally announced February 2022.

  28. arXiv:2202.11705  [pdf, other

    cs.CL cs.AI cs.LG

    COLD Decoding: Energy-based Constrained Text Generation with Langevin Dynamics

    Authors: Lianhui Qin, Sean Welleck, Daniel Khashabi, Ye** Choi

    Abstract: Many applications of text generation require incorporating different constraints to control the semantics or style of generated text. These constraints can be hard (e.g., ensuring certain keywords are included in the output) and soft (e.g., contextualizing the output with the left- or right-hand context). In this paper, we present Energy-based Constrained Decoding with Langevin Dynamics (COLD), a… ▽ More

    Submitted 13 October, 2022; v1 submitted 23 February, 2022; originally announced February 2022.

    Comments: NeurIPS 2022. code: https://github.com/qkaren/COLD_decoding

  29. arXiv:2112.08726  [pdf, other

    cs.CL

    NeuroLogic A*esque Decoding: Constrained Text Generation with Lookahead Heuristics

    Authors: Ximing Lu, Sean Welleck, Peter West, Liwei Jiang, Jungo Kasai, Daniel Khashabi, Ronan Le Bras, Lianhui Qin, Youngjae Yu, Rowan Zellers, Noah A. Smith, Ye** Choi

    Abstract: The dominant paradigm for neural text generation is left-to-right decoding from autoregressive language models. Constrained or controllable generation under complex lexical constraints, however, requires foresight to plan ahead feasible future paths. Drawing inspiration from the A* search algorithm, we propose NeuroLogic A*esque, a decoding algorithm that incorporates heuristic estimates of futu… ▽ More

    Submitted 16 December, 2021; originally announced December 2021.

  30. arXiv:2112.08348  [pdf, other

    cs.CL

    Prompt Waywardness: The Curious Case of Discretized Interpretation of Continuous Prompts

    Authors: Daniel Khashabi, Shane Lyu, Sewon Min, Lianhui Qin, Kyle Richardson, Sean Welleck, Hannaneh Hajishirzi, Tushar Khot, Ashish Sabharwal, Sameer Singh, Ye** Choi

    Abstract: Fine-tuning continuous prompts for target tasks has recently emerged as a compact alternative to full model fine-tuning. Motivated by these promising results, we investigate the feasibility of extracting a discrete (textual) interpretation of continuous prompts that is faithful to the problem they solve. In practice, we observe a "wayward" behavior between the task solved by continuous prompts and… ▽ More

    Submitted 4 May, 2022; v1 submitted 15 December, 2021; originally announced December 2021.

    Comments: NAACL 2022

  31. arXiv:2111.07408  [pdf, other

    cs.CL

    Time Waits for No One! Analysis and Challenges of Temporal Misalignment

    Authors: Kelvin Luu, Daniel Khashabi, Suchin Gururangan, Karishma Mandyam, Noah A. Smith

    Abstract: When an NLP model is trained on text data from one time period and tested or deployed on data from another, the resulting temporal misalignment can degrade end-task performance. In this work, we establish a suite of eight diverse tasks across different domains (social media, science papers, news, and reviews) and periods of time (spanning five years or more) to quantify the effects of temporal mis… ▽ More

    Submitted 1 July, 2022; v1 submitted 14 November, 2021; originally announced November 2021.

    Comments: 9 pages, 6 figures, 3 tables

    Journal ref: NAACL 2022

  32. arXiv:2110.08542  [pdf, other

    cs.CL

    Hey AI, Can You Solve Complex Tasks by Talking to Agents?

    Authors: Tushar Khot, Kyle Richardson, Daniel Khashabi, Ashish Sabharwal

    Abstract: Training giant models from scratch for each complex task is resource- and data-inefficient. To help develop models that can leverage existing systems, we propose a new challenge: Learning to solve complex tasks by communicating with existing agents (or models) in natural language. We design a synthetic benchmark, CommaQA, with three complex reasoning tasks (explicit, implicit, numeric) designed to… ▽ More

    Submitted 9 May, 2022; v1 submitted 16 October, 2021; originally announced October 2021.

    Comments: Accepted to Findings of ACL 2022

  33. arXiv:2109.07830  [pdf, other

    cs.CL cs.AI cs.LG

    Reframing Instructional Prompts to GPTk's Language

    Authors: Swaroop Mishra, Daniel Khashabi, Chitta Baral, Ye** Choi, Hannaneh Hajishirzi

    Abstract: What kinds of instructional prompts are easier to follow for Language Models (LMs)? We study this question by conducting extensive empirical analysis that shed light on important features of successful instructional prompts. Specifically, we study several classes of reframing techniques for manual reformulation of prompts into more effective ones. Some examples include decomposing a complex task i… ▽ More

    Submitted 15 March, 2022; v1 submitted 16 September, 2021; originally announced September 2021.

    Comments: ACL 2022 Findings

  34. arXiv:2106.01465  [pdf, other

    cs.CL cs.AI cs.LG

    Ethical-Advice Taker: Do Language Models Understand Natural Language Interventions?

    Authors: Jieyu Zhao, Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Kai-Wei Chang

    Abstract: Is it possible to use natural language to intervene in a model's behavior and alter its prediction in a desired way? We investigate the effectiveness of natural language interventions for reading-comprehension systems, studying this in the context of social stereotypes. Specifically, we propose a new language understanding task, Linguistic Ethical Interventions (LEI), where the goal is to amend a… ▽ More

    Submitted 2 June, 2021; originally announced June 2021.

    Comments: 9 pages, Findings of ACL-IJCNLP 2021

  35. arXiv:2104.08773  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    Cross-Task Generalization via Natural Language Crowdsourcing Instructions

    Authors: Swaroop Mishra, Daniel Khashabi, Chitta Baral, Hannaneh Hajishirzi

    Abstract: Humans (e.g., crowdworkers) have a remarkable ability in solving different tasks, by simply reading textual instructions that define them and looking at a few examples. Despite the success of the conventional supervised learning on individual datasets, such models often struggle with generalization across tasks (e.g., a question-answering system cannot solve classification tasks). A long-standing… ▽ More

    Submitted 14 March, 2022; v1 submitted 18 April, 2021; originally announced April 2021.

    Comments: ACL 2022

  36. arXiv:2104.08727  [pdf, other

    cs.CL cs.AI

    GooAQ: Open Question Answering with Diverse Answer Types

    Authors: Daniel Khashabi, Amos Ng, Tushar Khot, Ashish Sabharwal, Hannaneh Hajishirzi, Chris Callison-Burch

    Abstract: While day-to-day questions come with a variety of answer types, the current question-answering (QA) literature has failed to adequately address the answer diversity of questions. To this end, we present GooAQ, a large-scale dataset with a variety of answer types. This dataset contains over 5 million questions and 3 million answers collected from Google. GooAQ questions are collected semi-automatic… ▽ More

    Submitted 10 September, 2021; v1 submitted 18 April, 2021; originally announced April 2021.

    Comments: EMNLP-Findings 2021

  37. arXiv:2102.03315  [pdf, other

    cs.CL cs.AI

    Think you have Solved Direct-Answer Question Answering? Try ARC-DA, the Direct-Answer AI2 Reasoning Challenge

    Authors: Sumithra Bhakthavatsalam, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, Peter Clark

    Abstract: We present the ARC-DA dataset, a direct-answer ("open response", "freeform") version of the ARC (AI2 Reasoning Challenge) multiple-choice dataset. While ARC has been influential in the community, its multiple-choice format is unrepresentative of real-world questions, and multiple choice formats can be particularly susceptible to artifacts. The ARC-DA dataset addresses these concerns by converting… ▽ More

    Submitted 5 February, 2021; originally announced February 2021.

  38. arXiv:2101.06561  [pdf, other

    cs.CL cs.AI

    GENIE: Toward Reproducible and Standardized Human Evaluation for Text Generation

    Authors: Daniel Khashabi, Gabriel Stanovsky, Jonathan Bragg, Nicholas Lourie, Jungo Kasai, Ye** Choi, Noah A. Smith, Daniel S. Weld

    Abstract: While often assumed a gold standard, effective human evaluation of text generation remains an important, open area for research. We revisit this problem with a focus on producing consistent evaluations that are reproducible -- over time and across different populations. We study this goal in different stages of the human evaluation pipeline. In particular, we consider design choices for the annota… ▽ More

    Submitted 31 October, 2022; v1 submitted 16 January, 2021; originally announced January 2021.

    Comments: Accepted to EMNLP 2022 main conference, visit our project page at: https://genie.apps.allenai.org

  39. arXiv:2101.02235  [pdf, other

    cs.CL

    Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies

    Authors: Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, Jonathan Berant

    Abstract: A key limitation in current datasets for multi-hop reasoning is that the required steps for answering the question are mentioned in it explicitly. In this work, we introduce StrategyQA, a question answering (QA) benchmark where the required reasoning steps are implicit in the question, and should be inferred using a strategy. A fundamental challenge in this setup is how to elicit such creative que… ▽ More

    Submitted 6 January, 2021; originally announced January 2021.

    Comments: Accepted for publication in Transactions of the Association for Computational Linguistics (TACL), 2021. Author's final version

  40. arXiv:2012.06154  [pdf, other

    cs.CL cs.AI

    ParsiNLU: A Suite of Language Understanding Challenges for Persian

    Authors: Daniel Khashabi, Arman Cohan, Siamak Shakeri, Pedram Hosseini, Pouya Pezeshkpour, Malihe Alikhani, Moin Aminnaseri, Marzieh Bitaab, Faeze Brahman, Sarik Ghazarian, Mozhdeh Gheini, Arman Kabiri, Rabeeh Karimi Mahabadi, Omid Memarrast, Ahmadreza Mosallanezhad, Erfan Noury, Shahab Raji, Mohammad Sadegh Rasooli, Sepideh Sadeghi, Erfan Sadeqi Azer, Niloofar Safi Samghabadi, Mahsa Shafaei, Saber Sheybani, Ali Tazarv, Yadollah Yaghoobzadeh

    Abstract: Despite the progress made in recent years in addressing natural language understanding (NLU) challenges, the majority of this progress remains to be concentrated on resource-rich languages like English. This work focuses on Persian language, one of the widely spoken languages in the world, and yet there are few NLU datasets available for this rich language. The availability of high-quality evaluat… ▽ More

    Submitted 13 July, 2021; v1 submitted 11 December, 2020; originally announced December 2020.

    Comments: To appear on Transactions of the Association for Computational Linguistics (TACL), 2021

  41. arXiv:2010.02428  [pdf, other

    cs.CL

    UnQovering Stereoty** Biases via Underspecified Questions

    Authors: Tao Li, Tushar Khot, Daniel Khashabi, Ashish Sabharwal, Vivek Srikumar

    Abstract: While language embeddings have been shown to have stereoty** biases, how these biases affect downstream question answering (QA) models remains unexplored. We present UNQOVER, a general framework to probe and quantify biases through underspecified questions. We show that a naive use of model scores can lead to incorrect bias estimates due to two forms of reasoning errors: positional dependence an… ▽ More

    Submitted 9 October, 2020; v1 submitted 5 October, 2020; originally announced October 2020.

    Comments: Accepted at Findings of EMNLP 2020

  42. arXiv:2009.00751  [pdf, other

    cs.CL cs.AI

    Text Modular Networks: Learning to Decompose Tasks in the Language of Existing Models

    Authors: Tushar Khot, Daniel Khashabi, Kyle Richardson, Peter Clark, Ashish Sabharwal

    Abstract: We propose a general framework called Text Modular Networks(TMNs) for building interpretable systems that learn to solve complex tasks by decomposing them into simpler ones solvable by existing models. To ensure solvability of simpler tasks, TMNs learn the textual input-output behavior (i.e., language) of existing models through their datasets. This differs from prior decomposition-based approache… ▽ More

    Submitted 12 April, 2021; v1 submitted 1 September, 2020; originally announced September 2020.

    Comments: Accepted to NAACL 2021

  43. arXiv:2005.04304  [pdf, other

    cs.CL

    Temporal Common Sense Acquisition with Minimal Supervision

    Authors: Ben Zhou, Qiang Ning, Daniel Khashabi, Dan Roth

    Abstract: Temporal common sense (e.g., duration and frequency of events) is crucial for understanding natural language. However, its acquisition is challenging, partly because such information is often not expressed explicitly in text, and human annotation on such concepts is costly. This work proposes a novel sequence modeling approach that exploits explicit and implicit mentions of temporal common sense,… ▽ More

    Submitted 8 May, 2020; originally announced May 2020.

    Comments: Accepted by ACL 2020

  44. arXiv:2005.00700  [pdf, other

    cs.CL cs.AI

    UnifiedQA: Crossing Format Boundaries With a Single QA System

    Authors: Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, Hannaneh Hajishirzi

    Abstract: Question answering (QA) tasks have been posed using a variety of formats, such as extractive span selection, multiple choice, etc. This has led to format-specialized models, and even to an implicit division in the QA community. We argue that such boundaries are artificial and perhaps unnecessary, given the reasoning abilities we seek to teach are not governed by the format. As evidence, we use the… ▽ More

    Submitted 6 October, 2020; v1 submitted 2 May, 2020; originally announced May 2020.

    Comments: EMNLP 2020 (Findings)

  45. arXiv:2005.00206  [pdf, other

    cs.AI cs.CL

    TransOMCS: From Linguistic Graphs to Commonsense Knowledge

    Authors: Hongming Zhang, Daniel Khashabi, Yangqiu Song, Dan Roth

    Abstract: Commonsense knowledge acquisition is a key problem for artificial intelligence. Conventional methods of acquiring commonsense knowledge generally require laborious and costly human annotations, which are not feasible on a large scale. In this paper, we explore a practical way of mining commonsense knowledge from linguistic graphs, with the goal of transferring cheap knowledge obtained with linguis… ▽ More

    Submitted 1 May, 2020; originally announced May 2020.

    Comments: Accepted by IJCAI 2020

  46. arXiv:2004.04849  [pdf, other

    cs.CL cs.AI cs.LG

    More Bang for Your Buck: Natural Perturbation for Robust Question Answering

    Authors: Daniel Khashabi, Tushar Khot, Ashish Sabharwal

    Abstract: While recent models have achieved human-level scores on many NLP datasets, we observe that they are considerably sensitive to small changes in input. As an alternative to the standard approach of addressing this issue by constructing training sets of completely new examples, we propose doing so via minimal perturbation of examples. Specifically, our approach involves first collecting a set of seed… ▽ More

    Submitted 6 October, 2020; v1 submitted 9 April, 2020; originally announced April 2020.

    Comments: EMNLP 2020

  47. arXiv:2004.02709  [pdf, other

    cs.CL

    Evaluating Models' Local Decision Boundaries via Contrast Sets

    Authors: Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hanna Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A. Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Wallace, Ally Zhang , et al. (1 additional authors not shown)

    Abstract: Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture a dataset's intended capabilities. We propose a new annotation paradigm for NLP that helps to close systemati… ▽ More

    Submitted 1 October, 2020; v1 submitted 6 April, 2020; originally announced April 2020.

  48. arXiv:1911.03850  [pdf, other

    cs.CL cs.AI

    Not All Claims are Created Equal: Choosing the Right Statistical Approach to Assess Hypotheses

    Authors: Erfan Sadeqi Azer, Daniel Khashabi, Ashish Sabharwal, Dan Roth

    Abstract: Empirical research in Natural Language Processing (NLP) has adopted a narrow set of principles for assessing hypotheses, relying mainly on p-value computation, which suffers from several known issues. While alternative proposals have been well-debated and adopted in other fields, they remain rarely discussed or used within the NLP community. We address this gap by contrasting various hypothesis as… ▽ More

    Submitted 4 May, 2020; v1 submitted 9 November, 2019; originally announced November 2019.

    Comments: ACL 2020

  49. arXiv:1909.03065  [pdf, other

    cs.CL

    "Going on a vacation" takes longer than "Going for a walk": A Study of Temporal Commonsense Understanding

    Authors: Ben Zhou, Daniel Khashabi, Qiang Ning, Dan Roth

    Abstract: Understanding time is crucial for understanding events expressed in natural language. Because people rarely say the obvious, it is often necessary to have commonsense knowledge about various temporal aspects of events, such as duration, frequency, and temporal order. However, this important problem has so far received limited attention. This paper systematically studies this temporal commonsense p… ▽ More

    Submitted 6 September, 2019; originally announced September 2019.

    Comments: EMNLP 2019 (short paper). arXiv admin note: text overlap with arXiv:1908.04926

  50. arXiv:1909.01958  [pdf, other

    cs.CL cs.AI

    From 'F' to 'A' on the N.Y. Regents Science Exams: An Overview of the Aristo Project

    Authors: Peter Clark, Oren Etzioni, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, Niket Tandon, Sumithra Bhakthavatsalam, Dirk Groeneveld, Michal Guerquin, Michael Schmitz

    Abstract: AI has achieved remarkable mastery over games such as Chess, Go, and Poker, and even Jeopardy, but the rich variety of standardized exams has remained a landmark challenge. Even in 2016, the best AI system achieved merely 59.3% on an 8th Grade science exam challenge. This paper reports unprecedented success on the Grade 8 New York Regents Science Exam, where for the first time a system scores more… ▽ More

    Submitted 1 February, 2021; v1 submitted 4 September, 2019; originally announced September 2019.

    Comments: AI Magazine 41 (4) Winter 2020. New analysis sections added