Skip to main content

Showing 1–26 of 26 results for author: Ladhak, F

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.02756  [pdf, other

    cs.CL cs.AI cs.LG

    Aligning Large Language Models via Fine-grained Supervision

    Authors: Dehong Xu, Liang Qiu, Minseok Kim, Faisal Ladhak, Jaeyoung Do

    Abstract: Pre-trained large-scale language models (LLMs) excel at producing coherent articles, yet their outputs may be untruthful, toxic, or fail to align with user expectations. Current approaches focus on using reinforcement learning with human feedback (RLHF) to improve model alignment, which works by transforming coarse human preferences of LLM outputs into a feedback signal that guides the model learn… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  2. arXiv:2310.17623  [pdf, other

    cs.CL cs.LG

    Proving Test Set Contamination in Black Box Language Models

    Authors: Yonatan Oren, Nicole Meister, Niladri Chatterji, Faisal Ladhak, Tatsunori B. Hashimoto

    Abstract: Large language models are trained on vast amounts of internet data, prompting concerns and speculation that they have memorized public benchmarks. Going from speculation to proof of contamination is challenging, as the pretraining data used by proprietary models are often not publicly accessible. We show that it is possible to provide provable guarantees of test set contamination in language model… ▽ More

    Submitted 23 November, 2023; v1 submitted 26 October, 2023; originally announced October 2023.

  3. arXiv:2309.04269  [pdf, other

    cs.CL

    From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting

    Authors: Griffin Adams, Alexander Fabbri, Faisal Ladhak, Eric Lehman, Noémie Elhadad

    Abstract: Selecting the ``right'' amount of information to include in a summary is a difficult task. A good summary should be detailed and entity-centric without being overly dense and hard to follow. To better understand this tradeoff, we solicit increasingly dense GPT-4 summaries with what we refer to as a ``Chain of Density'' (CoD) prompt. Specifically, GPT-4 generates an initial entity-sparse summary be… ▽ More

    Submitted 8 September, 2023; originally announced September 2023.

    Comments: preprint

  4. arXiv:2305.17779  [pdf, other

    cs.CL

    Generating EDU Extracts for Plan-Guided Summary Re-Ranking

    Authors: Griffin Adams, Alexander R. Fabbri, Faisal Ladhak, Kathleen McKeown, Noémie Elhadad

    Abstract: Two-step approaches, in which summary candidates are generated-then-reranked to return a single summary, can improve ROUGE scores over the standard single-step approach. Yet, standard decoding methods (i.e., beam search, nucleus sampling, and diverse beam search) produce candidates with redundant, and often low quality, content. In this paper, we design a novel method to generate candidates for re… ▽ More

    Submitted 28 May, 2023; originally announced May 2023.

    Comments: ACL 2023

  5. arXiv:2303.17548  [pdf, other

    cs.CL cs.AI cs.CY cs.LG

    Whose Opinions Do Language Models Reflect?

    Authors: Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, Tatsunori Hashimoto

    Abstract: Language models (LMs) are increasingly being used in open-ended contexts, where the opinions reflected by LMs in response to subjective queries can have a profound impact, both on user satisfaction, as well as sha** the views of society at large. In this work, we put forth a quantitative framework to investigate the opinions reflected by LMs -- by leveraging high-quality public opinion polls and… ▽ More

    Submitted 30 March, 2023; originally announced March 2023.

  6. arXiv:2301.13848  [pdf, other

    cs.CL cs.AI cs.LG

    Benchmarking Large Language Models for News Summarization

    Authors: Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, Tatsunori B. Hashimoto

    Abstract: Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, we find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability. S… ▽ More

    Submitted 31 January, 2023; originally announced January 2023.

  7. arXiv:2212.10722  [pdf, other

    cs.CL

    Contrastive Error Attribution for Finetuned Language Models

    Authors: Faisal Ladhak, Esin Durmus, Tatsunori Hashimoto

    Abstract: Recent work has identified noisy and misannotated data as a core cause of hallucinations and unfaithful outputs in Natural Language Generation (NLG) tasks. Consequently, identifying and removing these examples is a key open challenge in creating reliable NLG systems. In this work, we introduce a framework to identify and remove low-quality training instances that lead to undesirable outputs, such… ▽ More

    Submitted 11 July, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: ACL 2023

  8. arXiv:2212.09746  [pdf, other

    cs.CL

    Evaluating Human-Language Model Interaction

    Authors: Mina Lee, Megha Srivastava, Amelia Hardy, John Thickstun, Esin Durmus, Ashwin Paranjape, Ines Gerard-Ursin, Xiang Lisa Li, Faisal Ladhak, Frieda Rong, Rose E. Wang, Minae Kwon, Joon Sung Park, Hancheng Cao, Tony Lee, Rishi Bommasani, Michael Bernstein, Percy Liang

    Abstract: Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non-interactive in that a model produces output without human involvement. To evaluate human-LM interaction, we develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive… ▽ More

    Submitted 5 January, 2024; v1 submitted 19 December, 2022; originally announced December 2022.

    Comments: Authored by the Center for Research on Foundation Models (CRFM) at the Stanford Institute for Human-Centered Artificial Intelligence (HAI)

  9. arXiv:2211.09110  [pdf, other

    cs.CL cs.AI cs.LG

    Holistic Evaluation of Language Models

    Authors: Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao , et al. (25 additional authors not shown)

    Abstract: Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest fo… ▽ More

    Submitted 1 October, 2023; v1 submitted 16 November, 2022; originally announced November 2022.

    Comments: Authored by the Center for Research on Foundation Models (CRFM) at the Stanford Institute for Human-Centered Artificial Intelligence (HAI). Project page: https://crfm.stanford.edu/helm/v1.0

    Journal ref: Published in Transactions on Machine Learning Research (TMLR), 2023

  10. arXiv:2211.05886  [pdf, ps, other

    cs.CL

    CREATIVESUMM: Shared Task on Automatic Summarization for Creative Writing

    Authors: Divyansh Agarwal, Alexander R. Fabbri, Simeng Han, Wojciech Kryściński, Faisal Ladhak, Bryan Li, Kathleen McKeown, Dragomir Radev, Tianyi Zhang, Sam Wiseman

    Abstract: This paper introduces the shared task of summarizing documents in several creative domains, namely literary texts, movie scripts, and television scripts. Summarizing these creative documents requires making complex literary interpretations, as well as understanding non-trivial temporal dependencies in texts containing varied styles of plot development and narrative structure. This poses unique cha… ▽ More

    Submitted 6 December, 2022; v1 submitted 10 November, 2022; originally announced November 2022.

    Comments: 4 pages + 3 for references and appendix

  11. arXiv:2211.04903  [pdf, other

    cs.CL

    Novel Chapter Abstractive Summarization using Spinal Tree Aware Sub-Sentential Content Selection

    Authors: Hardy Hardy, Miguel Ballesteros, Faisal Ladhak, Muhammad Khalifa, Vittorio Castelli, Kathleen McKeown

    Abstract: Summarizing novel chapters is a difficult task due to the input length and the fact that sentences that appear in the desired summaries draw content from multiple places throughout the chapter. We present a pipelined extractive-abstractive approach where the extractive step filters the content that is passed to the abstractive component. Extremely lengthy input also results in a highly skewed data… ▽ More

    Submitted 9 November, 2022; originally announced November 2022.

  12. Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale

    Authors: Federico Bianchi, Pratyusha Kalluri, Esin Durmus, Faisal Ladhak, Myra Cheng, Debora Nozza, Tatsunori Hashimoto, Dan Jurafsky, James Zou, Aylin Caliskan

    Abstract: Machine learning models that convert user-written text descriptions into images are now widely available online and used by millions of users to generate millions of images a day. We investigate the potential for these models to amplify dangerous and complex stereotypes. We find a broad range of ordinary prompts produce stereotypes, including prompts simply mentioning traits, descriptors, occupati… ▽ More

    Submitted 7 June, 2023; v1 submitted 7 November, 2022; originally announced November 2022.

    Comments: FAccT 2023 paper. The published version is available at 10.1145/3593013.3594095

  13. arXiv:2206.11249  [pdf, other

    cs.CL cs.AI cs.LG

    GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

    Authors: Sebastian Gehrmann, Abhik Bhattacharjee, Abinaya Mahendiran, Alex Wang, Alexandros Papangelis, Aman Madaan, Angelina McMillan-Major, Anna Shvets, Ashish Upadhyay, Bingsheng Yao, Bryan Wilie, Chandra Bhagavatula, Chaobin You, Craig Thomson, Cristina Garbacea, Dakuo Wang, Daniel Deutsch, Deyi Xiong, Di **, Dimitra Gkatzia, Dragomir Radev, Elizabeth Clark, Esin Durmus, Faisal Ladhak, Filip Ginter , et al. (52 additional authors not shown)

    Abstract: Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, an… ▽ More

    Submitted 24 June, 2022; v1 submitted 22 June, 2022; originally announced June 2022.

  14. arXiv:2205.12495  [pdf, other

    cs.CL

    ToKen: Task Decomposition and Knowledge Infusion for Few-Shot Hate Speech Detection

    Authors: Badr AlKhamissi, Faisal Ladhak, Srini Iyer, Ves Stoyanov, Zornitsa Kozareva, Xian Li, Pascale Fung, Lambert Mathias, Asli Celikyilmaz, Mona Diab

    Abstract: Hate speech detection is complex; it relies on commonsense reasoning, knowledge of stereotypes, and an understanding of social nuance that differs from one culture to the next. It is also difficult to collect a large-scale hate speech annotated dataset. In this work, we frame this problem as a few-shot learning task, and show significant gains with decomposing the task into its "constituent" parts… ▽ More

    Submitted 20 May, 2023; v1 submitted 25 May, 2022; originally announced May 2022.

    Comments: Accepted at EMNLP 2022

    Journal ref: In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2109-2120, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics

  15. arXiv:2204.09890  [pdf, other

    cs.CL

    Spurious Correlations in Reference-Free Evaluation of Text Generation

    Authors: Esin Durmus, Faisal Ladhak, Tatsunori Hashimoto

    Abstract: Model-based, reference-free evaluation metrics have been proposed as a fast and cost-effective approach to evaluate Natural Language Generation (NLG) systems. Despite promising recent results, we find evidence that reference-free evaluation metrics of summarization and dialog generation may be relying on spurious correlations with measures such as word overlap, perplexity, and length. We further o… ▽ More

    Submitted 21 April, 2022; originally announced April 2022.

    Comments: Published in ACL 2022 main conference

  16. arXiv:2108.13684  [pdf, other

    cs.CL

    Faithful or Extractive? On Mitigating the Faithfulness-Abstractiveness Trade-off in Abstractive Summarization

    Authors: Faisal Ladhak, Esin Durmus, He He, Claire Cardie, Kathleen McKeown

    Abstract: Despite recent progress in abstractive summarization, systems still suffer from faithfulness errors. While prior work has proposed models that improve faithfulness, it is unclear whether the improvement comes from an increased level of extractiveness of the model outputs as one naive way to improve faithfulness is to make summarization models more extractive. In this work, we present a framework f… ▽ More

    Submitted 21 April, 2022; v1 submitted 31 August, 2021; originally announced August 2021.

    Comments: Published in ACL 2022 main conference

  17. arXiv:2108.07258  [pdf, other

    cs.LG cs.AI cs.CY

    On the Opportunities and Risks of Foundation Models

    Authors: Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh , et al. (89 additional authors not shown)

    Abstract: AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their cap… ▽ More

    Submitted 12 July, 2022; v1 submitted 16 August, 2021; originally announced August 2021.

    Comments: Authored by the Center for Research on Foundation Models (CRFM) at the Stanford Institute for Human-Centered Artificial Intelligence (HAI). Report page with citation guidelines: https://crfm.stanford.edu/report.html

  18. arXiv:2104.07868  [pdf, other

    cs.CL

    Segmenting Subtitles for Correcting ASR Segmentation Errors

    Authors: David Wan, Chris Kedzie, Faisal Ladhak, Elsbeth Turcan, Petra Galuščáková, Elena Zotkina, Zheng** Jiang, Peter Bell, Kathleen McKeown

    Abstract: Typical ASR systems segment the input audio into utterances using purely acoustic information, which may not resemble the sentence-like units that are expected by conventional machine translation (MT) systems for Spoken Language Translation. In this work, we propose a model for correcting the acoustic segmentation of ASR models for low-resource languages to improve performance on downstream tasks.… ▽ More

    Submitted 15 April, 2021; originally announced April 2021.

  19. arXiv:2102.01672  [pdf, other

    cs.CL cs.AI cs.LG

    The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

    Authors: Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Aremu Anuoluwapo, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna Clinciu, Dipanjan Das, Kaustubh D. Dhole, Wanyu Du, Esin Durmus, Ondřej Dušek, Chris Emezue, Varun Gangal, Cristina Garbacea, Tatsunori Hashimoto, Yufang Hou, Yacine Jernite, Harsh Jhamtani, Yangfeng Ji, Shailza Jolly, Mihir Kale, Dhruv Kumar, Faisal Ladhak , et al. (31 additional authors not shown)

    Abstract: We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. Due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it… ▽ More

    Submitted 1 April, 2021; v1 submitted 2 February, 2021; originally announced February 2021.

  20. arXiv:2010.14042  [pdf, other

    cs.CL

    To BERT or Not to BERT: Comparing Task-specific and Task-agnostic Semi-Supervised Approaches for Sequence Tagging

    Authors: Kasturi Bhattacharjee, Miguel Ballesteros, Rishita Anubhai, Smaranda Muresan, Jie Ma, Faisal Ladhak, Yaser Al-Onaizan

    Abstract: Leveraging large amounts of unlabeled data using Transformer-like architectures, like BERT, has gained popularity in recent times owing to their effectiveness in learning general representations that can then be further fine-tuned for downstream tasks to much success. However, training these models can be costly both from an economic and environmental standpoint. In this work, we investigate how t… ▽ More

    Submitted 27 October, 2020; originally announced October 2020.

    Comments: Accepted in the Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020)(https://2020.emnlp.org/papers/main)

  21. arXiv:2010.09608  [pdf, other

    cs.CL

    Incorporating Terminology Constraints in Automatic Post-Editing

    Authors: David Wan, Chris Kedzie, Faisal Ladhak, Marine Carpuat, Kathleen McKeown

    Abstract: Users of machine translation (MT) may want to ensure the use of specific lexical terminologies. While there exist techniques for incorporating terminology constraints during inference for MT, current APE approaches cannot ensure that they will appear in the final translation. In this paper, we present both autoregressive and non-autoregressive models for lexically constrained APE, demonstrating th… ▽ More

    Submitted 19 October, 2020; originally announced October 2020.

    Comments: To appear in WMT, 2020

  22. arXiv:2010.03093  [pdf, other

    cs.CL

    WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization

    Authors: Faisal Ladhak, Esin Durmus, Claire Cardie, Kathleen McKeown

    Abstract: We introduce WikiLingua, a large-scale, multilingual dataset for the evaluation of crosslingual abstractive summarization systems. We extract article and summary pairs in 18 languages from WikiHow, a high quality, collaborative resource of how-to guides on a diverse set of topics written by human authors. We create gold-standard article-summary alignments across languages by aligning the images th… ▽ More

    Submitted 6 October, 2020; originally announced October 2020.

    Comments: Findings of EMNLP 2020

  23. arXiv:2005.01840  [pdf, other

    cs.CL

    Exploring Content Selection in Summarization of Novel Chapters

    Authors: Faisal Ladhak, Bryan Li, Yaser Al-Onaizan, Kathleen McKeown

    Abstract: We present a new summarization task, generating summaries of novel chapters using summary/chapter pairs from online study guides. This is a harder task than the news summarization task, given the chapter length as well as the extreme paraphrasing and generalization found in the summaries. We focus on extractive summarization, which requires the creation of a gold-standard set of extractive summari… ▽ More

    Submitted 29 March, 2021; v1 submitted 4 May, 2020; originally announced May 2020.

    Comments: ACL 2020

  24. The Role of Pragmatic and Discourse Context in Determining Argument Impact

    Authors: Esin Durmus, Faisal Ladhak, Claire Cardie

    Abstract: Research in the social sciences and psychology has shown that the persuasiveness of an argument depends not only the language employed, but also on attributes of the source/communicator, the audience, and the appropriateness and strength of the argument's claims given the pragmatic and discourse context of the argument. Among these characteristics of persuasive arguments, prior work in NLP does no… ▽ More

    Submitted 6 April, 2020; originally announced April 2020.

    Comments: EMNLP 2019

  25. Determining Relative Argument Specificity and Stance for Complex Argumentative Structures

    Authors: Esin Durmus, Faisal Ladhak, Claire Cardie

    Abstract: Systems for automatic argument generation and debate require the ability to (1) determine the stance of any claims employed in the argument and (2) assess the specificity of each claim relative to the argument context. Existing work on understanding claim specificity and stance, however, has been limited to the study of argumentative structures that are relatively shallow, most often consisting of… ▽ More

    Submitted 26 June, 2019; originally announced June 2019.

  26. arXiv:1804.08198  [pdf, other

    cs.CL

    A neural interlingua for multilingual machine translation

    Authors: Yichao Lu, Phillip Keung, Faisal Ladhak, Vikas Bhardwaj, Shaonan Zhang, Jason Sun

    Abstract: We incorporate an explicit neural interlingua into a multilingual encoder-decoder neural machine translation (NMT) architecture. We demonstrate that our model learns a language-independent representation by performing direct zero-shot translation (without using pivot translation), and by using the source sentence embeddings to create an English Yelp review classifier that, through the mediation of… ▽ More

    Submitted 16 October, 2018; v1 submitted 22 April, 2018; originally announced April 2018.

    Comments: Accepted in WMT 18