Search | arXiv e-print repository

Confidence Calibration and Rationalization for LLMs via Multi-Agent Deliberation

Authors: Ruixin Yang, Dheeraj Rajagopal, Shirley Anugrah Hayati, Bin Hu, Dongyeop Kang

Abstract: Uncertainty estimation is a significant issue for current large language models (LLMs) that are generally poorly calibrated and over-confident, especially with reinforcement learning from human feedback (RLHF). Unlike humans, whose decisions and confidences not only stem from intrinsic beliefs but can also be adjusted through daily observations, existing calibration methods for LLMs focus on estim… ▽ More Uncertainty estimation is a significant issue for current large language models (LLMs) that are generally poorly calibrated and over-confident, especially with reinforcement learning from human feedback (RLHF). Unlike humans, whose decisions and confidences not only stem from intrinsic beliefs but can also be adjusted through daily observations, existing calibration methods for LLMs focus on estimating or eliciting individual confidence without taking full advantage of the "Collective Wisdom": the interaction among multiple LLMs that can collectively improve both accuracy and calibration. In this work, we propose Collaborative Calibration, a post-hoc training-free calibration strategy that leverages the collaborative and expressive capabilities of multiple tool-augmented LLM agents in a simulated group deliberation process. We demonstrate the effectiveness of Collaborative Calibration on generative QA tasks across various domains, showing its potential in harnessing the rationalization of collectively calibrated confidence assessments and improving the reliability of model predictions. △ Less

Submitted 10 May, 2024; v1 submitted 13 April, 2024; originally announced April 2024.

Comments: Accepted at ICLR 2024 Workshop on Reliable and Responsible Foundation Models

arXiv:2402.11532 [pdf, other]

Chain-of-Instructions: Compositional Instruction Tuning on Large Language Models

Authors: Shirley Anugrah Hayati, Taehee Jung, Tristan Bodding-Long, Sudipta Kar, Abhinav Sethy, Joo-Kyung Kim, Dongyeop Kang

Abstract: Fine-tuning large language models (LLMs) with a collection of large and diverse instructions has improved the model's generalization to different tasks, even for unseen tasks. However, most existing instruction datasets include only single instructions, and they struggle to follow complex instructions composed of multiple subtasks. In this work, we propose a novel concept of compositional instruct… ▽ More Fine-tuning large language models (LLMs) with a collection of large and diverse instructions has improved the model's generalization to different tasks, even for unseen tasks. However, most existing instruction datasets include only single instructions, and they struggle to follow complex instructions composed of multiple subtasks. In this work, we propose a novel concept of compositional instructions called chain-of-instructions (CoI), where the output of one instruction becomes an input for the next like a chain. Unlike the conventional practice of solving single instruction tasks, our proposed method encourages a model to solve each subtask step by step until the final answer is reached. CoI-tuning (i.e., fine-tuning with CoI instructions) improves the model's ability to handle instructions composed of multiple subtasks as well as unseen composite tasks such as multilingual summarization. Overall, our study find that simple CoI tuning of existing instruction data can provide consistent generalization to solve more complex, unseen, and longer chains of instructions. △ Less

Submitted 24 June, 2024; v1 submitted 18 February, 2024; originally announced February 2024.

arXiv:2401.14698 [pdf, other]

Under the Surface: Tracking the Artifactuality of LLM-Generated Data

Authors: Debarati Das, Karin De Langis, Anna Martin-Boyle, Jaehyung Kim, Minhwa Lee, Zae Myung Kim, Shirley Anugrah Hayati, Risako Owan, Bin Hu, Ritik Parkar, Ryan Koo, Jonginn Park, Aahan Tyagi, Libby Ferland, Sanjali Roy, Vincent Liu, Dongyeop Kang

Abstract: This work delves into the expanding role of large language models (LLMs) in generating artificial data. LLMs are increasingly employed to create a variety of outputs, including annotations, preferences, instruction prompts, simulated dialogues, and free text. As these forms of LLM-generated data often intersect in their application, they exert mutual influence on each other and raise significant c… ▽ More This work delves into the expanding role of large language models (LLMs) in generating artificial data. LLMs are increasingly employed to create a variety of outputs, including annotations, preferences, instruction prompts, simulated dialogues, and free text. As these forms of LLM-generated data often intersect in their application, they exert mutual influence on each other and raise significant concerns about the quality and diversity of the artificial data incorporated into training cycles, leading to an artificial data ecosystem. To the best of our knowledge, this is the first study to aggregate various types of LLM-generated text data, from more tightly constrained data like "task labels" to more lightly constrained "free-form text". We then stress test the quality and implications of LLM-generated artificial data, comparing it with human data across various existing benchmarks. Despite artificial data's capability to match human performance, this paper reveals significant hidden disparities, especially in complex tasks where LLMs often miss the nuanced understanding of intrinsic human-generated content. This study critically examines diverse LLM-generated data and emphasizes the need for ethical practices in data creation and when using LLMs. It highlights the LLMs' shortcomings in replicating human traits and behaviors, underscoring the importance of addressing biases and artifacts produced in LLM-generated content for future research and development. All data and code are available on our project page. △ Less

Submitted 30 January, 2024; v1 submitted 26 January, 2024; originally announced January 2024.

Comments: Core Authors: Debarati Das, Karin De Langis, Anna Martin-Boyle, Jaehyung Kim, Minhwa Lee and Zae Myung Kim | Project lead : Debarati Das | PI : Dongyeop Kang

arXiv:2311.09799 [pdf, other]

How Far Can We Extract Diverse Perspectives from Large Language Models?

Authors: Shirley Anugrah Hayati, Minhwa Lee, Dheeraj Rajagopal, Dongyeop Kang

Abstract: Collecting diverse human opinions is costly and challenging. This leads to a recent trend in collaborative efforts between humans and Large Language Models (LLMs) for generating diverse data, offering potential scalable and efficient solutions. However, the extent of LLMs' capability to generate diverse perspectives on subjective topics remains an unexplored question. In this study, we investigate… ▽ More Collecting diverse human opinions is costly and challenging. This leads to a recent trend in collaborative efforts between humans and Large Language Models (LLMs) for generating diverse data, offering potential scalable and efficient solutions. However, the extent of LLMs' capability to generate diverse perspectives on subjective topics remains an unexplored question. In this study, we investigate LLMs' capacity for generating diverse perspectives and rationales on subjective topics, such as social norms and argumentative texts. We formulate a new problem of maximum diversity extraction from LLMs. Motivated by how humans develop their opinions through their values, we propose a criteria-based prompting technique to ground diverse opinions. To see how far we can extract diverse perspectives from LLMs, or called diversity coverage, we employ a step-by-step recall prompting for generating more outputs from the model in an iterative manner. As we apply our methods to various tasks, indeed we find that LLMs can generate diverse opinions according to the degree of task subjectivity △ Less

Submitted 18 February, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

arXiv:2212.08279 [pdf, other]

Werewolf Among Us: A Multimodal Dataset for Modeling Persuasion Behaviors in Social Deduction Games

Authors: Bolin Lai, Hongxin Zhang, Miao Liu, Aryan Pariani, Fiona Ryan, Wenqi Jia, Shirley Anugrah Hayati, James M. Rehg, Diyi Yang

Abstract: Persuasion modeling is a key building block for conversational agents. Existing works in this direction are limited to analyzing textual dialogue corpus. We argue that visual signals also play an important role in understanding human persuasive behaviors. In this paper, we introduce the first multimodal dataset for modeling persuasion behaviors. Our dataset includes 199 dialogue transcriptions and… ▽ More Persuasion modeling is a key building block for conversational agents. Existing works in this direction are limited to analyzing textual dialogue corpus. We argue that visual signals also play an important role in understanding human persuasive behaviors. In this paper, we introduce the first multimodal dataset for modeling persuasion behaviors. Our dataset includes 199 dialogue transcriptions and videos captured in a multi-player social deduction game setting, 26,647 utterance level annotations of persuasion strategy, and game level annotations of deduction game outcomes. We provide extensive experiments to show how dialogue context and visual signals benefit persuasion strategy prediction. We also explore the generalization ability of language models for persuasion modeling and the role of persuasion strategies in predicting social deduction game outcomes. Our dataset, code, and models can be found at https://persuasion-deductiongame.socialai-data.org. △ Less

Submitted 15 December, 2022; originally announced December 2022.

Comments: 17 pages

arXiv:2211.05182 [pdf, ps, other]

doi 10.1145/3555640

Modeling Motivational Interviewing Strategies On An Online Peer-to-Peer Counseling Platform

Authors: Raj Sanjay Shah, Faye Holt, Shirley Anugrah Hayati, Aastha Agarwal, Yi-Chia Wang, Robert E. Kraut, Diyi Yang

Abstract: Millions of people participate in online peer-to-peer support sessions, yet there has been little prior research on systematic psychology-based evaluations of fine-grained peer-counselor behavior in relation to client satisfaction. This paper seeks to bridge this gap by map** peer-counselor chat-messages to motivational interviewing (MI) techniques. We annotate 14,797 utterances from 734 chat co… ▽ More Millions of people participate in online peer-to-peer support sessions, yet there has been little prior research on systematic psychology-based evaluations of fine-grained peer-counselor behavior in relation to client satisfaction. This paper seeks to bridge this gap by map** peer-counselor chat-messages to motivational interviewing (MI) techniques. We annotate 14,797 utterances from 734 chat conversations using 17 MI techniques and introduce four new interviewing codes such as chit-chat and inappropriate to account for the unique conversational patterns observed on online platforms. We automate the process of labeling peer-counselor responses to MI techniques by fine-tuning large domain-specific language models and then use these automated measures to investigate the behavior of the peer counselors via correlational studies. Specifically, we study the impact of MI techniques on the conversation ratings to investigate the techniques that predict clients' satisfaction with their counseling sessions. When counselors use techniques such as reflection and affirmation, clients are more satisfied. Examining volunteer counselors' change in usage of techniques suggest that counselors learn to use more introduction and open questions as they gain experience. This work provides a deeper understanding of the use of motivational interviewing techniques on peer-to-peer counselor platforms and sheds light on how to build better training programs for volunteer counselors on online platforms. △ Less

Submitted 9 November, 2022; originally announced November 2022.

Comments: Accepted at CSCW 2022

arXiv:2210.07469 [pdf, other]

StyLEx: Explaining Style Using Human Lexical Annotations

Authors: Shirley Anugrah Hayati, Kyumin Park, Dheeraj Rajagopal, Lyle Ungar, Dongyeop Kang

Abstract: Large pre-trained language models have achieved impressive results on various style classification tasks, but they often learn spurious domain-specific words to make predictions (Hayati et al., 2021). While human explanation highlights stylistic tokens as important features for this task, we observe that model explanations often do not align with them. To tackle this issue, we introduce StyLEx, a… ▽ More Large pre-trained language models have achieved impressive results on various style classification tasks, but they often learn spurious domain-specific words to make predictions (Hayati et al., 2021). While human explanation highlights stylistic tokens as important features for this task, we observe that model explanations often do not align with them. To tackle this issue, we introduce StyLEx, a model that learns from human-annotated explanations of stylistic features and jointly learns to perform the task and predict these features as model explanations. Our experiments show that StyLEx can provide human-like stylistic lexical explanations without sacrificing the performance of sentence-level style prediction on both in-domain and out-of-domain datasets. Explanations from StyLEx show significant improvements in explanation metrics (sufficiency, plausibility) and when evaluated with human annotations. They are also more understandable by human judges compared to the widely-used saliency-based explanation baseline. △ Less

Submitted 14 April, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

Comments: EACL 2023

arXiv:2112.06050 [pdf, other]

Real-Time Detection of Crowded Buses via Mobile Phones

Authors: Alex Haig, Shirley Anugrah Hayati, Anthony Tomasic

Abstract: Automated passenger counting (APC) technology is central to many aspects of the public transit experience. APC information informs public transit planners about utilization in a public transit system and operations about dynamic fluctuations in demand. Perhaps most importantly, APC information provides one metric to the rider experience - standing during a long ride because of a crowded vehicle is… ▽ More Automated passenger counting (APC) technology is central to many aspects of the public transit experience. APC information informs public transit planners about utilization in a public transit system and operations about dynamic fluctuations in demand. Perhaps most importantly, APC information provides one metric to the rider experience - standing during a long ride because of a crowded vehicle is an unpleasant experience. Several technologies have been successfully used for APC including light beam sensing and video image analysis. However, these technologies are expensive and must be installed in buses. In this paper, we analyze a new source of data using statistical models: rider smartphone accelerometers. Smartphones are ubiquitous in society and accelerometers have been shown to accurately model user states such as walking and sitting. We extend these models to use accelerometers to detect if the rider is standing or sitting on a bus. Standing riders are a signal that the bus is crowded. This paper provides evidence that user smartphones are a valid source of participatory sensing and thus a new source of automated passenger counting data. △ Less

Submitted 11 December, 2021; originally announced December 2021.

Journal ref: Poster Session, The Transportation Research Board (TRB) 98th Annual Meeting, 2019

arXiv:2109.02738 [pdf, other]

Does BERT Learn as Humans Perceive? Understanding Linguistic Styles through Lexica

Authors: Shirley Anugrah Hayati, Dongyeop Kang, Lyle Ungar

Abstract: People convey their intention and attitude through linguistic styles of the text that they write. In this study, we investigate lexicon usages across styles throughout two lenses: human perception and machine word importance, since words differ in the strength of the stylistic cues that they provide. To collect labels of human perception, we curate a new dataset, Hummingbird, on top of benchmarkin… ▽ More People convey their intention and attitude through linguistic styles of the text that they write. In this study, we investigate lexicon usages across styles throughout two lenses: human perception and machine word importance, since words differ in the strength of the stylistic cues that they provide. To collect labels of human perception, we curate a new dataset, Hummingbird, on top of benchmarking style datasets. We have crowd workers highlight the representative words in the text that makes them think the text has the following styles: politeness, sentiment, offensiveness, and five emotion types. We then compare these human word labels with word importance derived from a popular fine-tuned style classifier like BERT. Our results show that the BERT often finds content words not relevant to the target style as important words used in style prediction, but humans do not perceive the same way even though for some styles (e.g., positive sentiment and joy) human- and machine-identified words share significant overlap for some styles. △ Less

Submitted 12 November, 2021; v1 submitted 6 September, 2021; originally announced September 2021.

Comments: Accepted at EMNLP 2021 Main Conference, updated typos and Appendix

arXiv:2105.00825 [pdf, other]

DEUX: An Attribute-Guided Framework for Sociable Recommendation Dialog Systems

Authors: Yu Li, Shirley Anugrah Hayati, Weiyan Shi, Zhou Yu

Abstract: It is important for sociable recommendation dialog systems to perform as both on-task content and social content to engage users and gain their favor. In addition to understand the user preferences and provide a satisfying recommendation, such systems must be able to generate coherent and natural social conversations to the user. Traditional dialog state tracking cannot be applied to such systems… ▽ More It is important for sociable recommendation dialog systems to perform as both on-task content and social content to engage users and gain their favor. In addition to understand the user preferences and provide a satisfying recommendation, such systems must be able to generate coherent and natural social conversations to the user. Traditional dialog state tracking cannot be applied to such systems because it does not track the attributes in the social content. To address this challenge, we propose DEUX, a novel attribute-guided framework to create better user experiences while accomplishing a movie recommendation task. DEUX has a module that keeps track of the movie attributes (e.g., favorite genres, actors,etc.) in both user utterances and system responses. This allows the system to introduce new movie attributes in its social content. Then, DEUX has multiple values for the same attribute type which suits the recommendation task since a user may like multiple genres, for instance. Experiments suggest that DEUX outperforms all the baselines on being more consistent, fitting the user preferences better, and providing a more engaging chat experience. Our approach can be used for any similar problems of sociable task-oriented dialog system. △ Less

Submitted 16 April, 2021; originally announced May 2021.

arXiv:2009.14306 [pdf, other]

INSPIRED: Toward Sociable Recommendation Dialog Systems

Authors: Shirley Anugrah Hayati, Dongyeop Kang, Qingxiaoyang Zhu, Weiyan Shi, Zhou Yu

Abstract: In recommendation dialogs, humans commonly disclose their preference and make recommendations in a friendly manner. However, this is a challenge when develo** a sociable recommendation dialog system, due to the lack of dialog dataset annotated with such sociable strategies. Therefore, we present INSPIRED, a new dataset of 1,001 human-human dialogs for movie recommendation with measures for succe… ▽ More In recommendation dialogs, humans commonly disclose their preference and make recommendations in a friendly manner. However, this is a challenge when develo** a sociable recommendation dialog system, due to the lack of dialog dataset annotated with such sociable strategies. Therefore, we present INSPIRED, a new dataset of 1,001 human-human dialogs for movie recommendation with measures for successful recommendations. To better understand how humans make recommendations in communication, we design an annotation scheme related to recommendation strategies based on social science theories and annotate these dialogs. Our analysis shows that sociable recommendation strategies, such as sharing personal opinions or communicating with encouragement, more frequently lead to successful recommendations. Based on our dataset, we train end-to-end recommendation dialog systems with and without our strategy labels. In both automatic and human evaluation, our model with strategy incorporation outperforms the baseline model. This work is a first step for building sociable recommendation dialog systems with a basis of social science theories. △ Less

Submitted 8 October, 2020; v1 submitted 29 September, 2020; originally announced September 2020.

Comments: Accepted as a long paper at EMNLP 2020, corrected typos

arXiv:2004.13203 [pdf, other]

A Summary of the First Workshop on Language Technology for Language Documentation and Revitalization

Authors: Graham Neubig, Shruti Rijhwani, Alexis Palmer, Jordan MacKenzie, Hilaria Cruz, Xinjian Li, Matthew Lee, Aditi Chaudhary, Luke Gessler, Steven Abney, Shirley Anugrah Hayati, Antonios Anastasopoulos, Olga Zamaraeva, Emily Prud'hommeaux, Jennette Child, Sara Child, Rebecca Knowles, Sarah Moeller, Jeffrey Micher, Yiyuan Li, Sydney Zink, Mengzhou Xia, Roshan S Sharma, Patrick Littell

Abstract: Despite recent advances in natural language processing and other language technology, the application of such technology to language documentation and conservation has been limited. In August 2019, a workshop was held at Carnegie Mellon University in Pittsburgh to attempt to bring together language community members, documentary linguists, and technologists to discuss how to bridge this gap and cr… ▽ More Despite recent advances in natural language processing and other language technology, the application of such technology to language documentation and conservation has been limited. In August 2019, a workshop was held at Carnegie Mellon University in Pittsburgh to attempt to bring together language community members, documentary linguists, and technologists to discuss how to bridge this gap and create prototypes of novel and practical language revitalization technologies. This paper reports the results of this workshop, including issues discussed, and various conceived and implemented technologies for nine languages: Arapaho, Cayuga, Inuktitut, Irish Gaelic, Kidaw'ida, Kwak'wala, Ojibwe, San Juan Quiahije Chatino, and Seneca. △ Less

Submitted 27 April, 2020; originally announced April 2020.

Comments: Accepted at SLTU-CCURL 2020

arXiv:1808.10025 [pdf, other]

Retrieval-Based Neural Code Generation

Authors: Shirley Anugrah Hayati, Raphael Olivier, Pravalika Avvaru, Pengcheng Yin, Anthony Tomasic, Graham Neubig

Abstract: In models to generate program source code from natural language, representing this code in a tree structure has been a common approach. However, existing methods often fail to generate complex code correctly due to a lack of ability to memorize large and complex structures. We introduce ReCode, a method based on subtree retrieval that makes it possible to explicitly reference existing code example… ▽ More In models to generate program source code from natural language, representing this code in a tree structure has been a common approach. However, existing methods often fail to generate complex code correctly due to a lack of ability to memorize large and complex structures. We introduce ReCode, a method based on subtree retrieval that makes it possible to explicitly reference existing code examples within a neural code generation model. First, we retrieve sentences that are similar to input sentences using a dynamic-programming-based sentence similarity scoring method. Next, we extract n-grams of action sequences that build the associated abstract syntax tree. Finally, we increase the probability of actions that cause the retrieved n-gram action subtree to be in the predicted code. We show that our approach improves the performance on two code generation tasks by up to +2.6 BLEU. △ Less

Submitted 29 August, 2018; originally announced August 2018.

Comments: This paper is accepted in EMNLP 2018. It has 6 pages

Showing 1–13 of 13 results for author: Hayati, S A