Search | arXiv e-print repository

The Calibration Gap between Model and Human Confidence in Large Language Models

Authors: Mark Steyvers, Heliodoro Tejeda, Aakriti Kumar, Catarina Belem, Sheer Karny, Xinyue Hu, Lukas Mayer, Padhraic Smyth

Abstract: For large language models (LLMs) to be trusted by humans they need to be well-calibrated in the sense that they can accurately assess and communicate how likely it is that their predictions are correct. Recent work has focused on the quality of internal LLM confidence assessments, but the question remains of how well LLMs can communicate this internal model confidence to human users. This paper ex… ▽ More For large language models (LLMs) to be trusted by humans they need to be well-calibrated in the sense that they can accurately assess and communicate how likely it is that their predictions are correct. Recent work has focused on the quality of internal LLM confidence assessments, but the question remains of how well LLMs can communicate this internal model confidence to human users. This paper explores the disparity between external human confidence in an LLM's responses and the internal confidence of the model. Through experiments involving multiple-choice questions, we systematically examine human users' ability to discern the reliability of LLM outputs. Our study focuses on two key areas: (1) assessing users' perception of true LLM confidence and (2) investigating the impact of tailored explanations on this perception. The research highlights that default explanations from LLMs often lead to user overestimation of both the model's confidence and its' accuracy. By modifying the explanations to more accurately reflect the LLM's internal confidence, we observe a significant shift in user perception, aligning it more closely with the model's actual confidence levels. This adjustment in explanatory approach demonstrates potential for enhancing user trust and accuracy in assessing LLM outputs. The findings underscore the importance of transparent communication of confidence levels in LLMs, particularly in high-stakes applications where understanding the reliability of AI-generated information is essential. △ Less

Submitted 24 January, 2024; originally announced January 2024.

Comments: 27 pages, 10 figures

arXiv:2312.07679 [pdf, other]

Bayesian Online Learning for Consensus Prediction

Authors: Sam Showalter, Alex Boyd, Padhraic Smyth, Mark Steyvers

Abstract: Given a pre-trained classifier and multiple human experts, we investigate the task of online classification where model predictions are provided for free but querying humans incurs a cost. In this practical but under-explored setting, oracle ground truth is not available. Instead, the prediction target is defined as the consensus vote of all experts. Given that querying full consensus can be costl… ▽ More Given a pre-trained classifier and multiple human experts, we investigate the task of online classification where model predictions are provided for free but querying humans incurs a cost. In this practical but under-explored setting, oracle ground truth is not available. Instead, the prediction target is defined as the consensus vote of all experts. Given that querying full consensus can be costly, we propose a general framework for online Bayesian consensus estimation, leveraging properties of the multivariate hypergeometric distribution. Based on this framework, we propose a family of methods that dynamically estimate expert consensus from partial feedback by producing a posterior over expert and model beliefs. Analyzing this posterior induces an interpretable trade-off between querying cost and classification performance. We demonstrate the efficacy of our framework against a variety of baselines on CIFAR-10H and ImageNet-16H, two large-scale crowdsourced datasets. △ Less

Submitted 12 December, 2023; originally announced December 2023.

arXiv:2305.09064 [pdf, other]

doi 10.1145/3593013.3594111

Capturing Humans' Mental Models of AI: An Item Response Theory Approach

Authors: Markelle Kelly, Aakriti Kumar, Padhraic Smyth, Mark Steyvers

Abstract: Improving our understanding of how humans perceive AI teammates is an important foundation for our general understanding of human-AI teams. Extending relevant work from cognitive science, we propose a framework based on item response theory for modeling these perceptions. We apply this framework to real-world experiments, in which each participant works alongside another person or an AI agent in a… ▽ More Improving our understanding of how humans perceive AI teammates is an important foundation for our general understanding of human-AI teams. Extending relevant work from cognitive science, we propose a framework based on item response theory for modeling these perceptions. We apply this framework to real-world experiments, in which each participant works alongside another person or an AI agent in a question-answering setting, repeatedly assessing their teammate's performance. Using this experimental data, we demonstrate the use of our framework for testing research questions about people's perceptions of both AI agents and other people. We contrast mental models of AI teammates with those of human teammates as we characterize the dimensionality of these mental models, their development over time, and the influence of the participants' own self-perception. Our results indicate that people expect AI agents' performance to be significantly better on average than the performance of other humans, with less variation across different types of problems. We conclude with a discussion of the implications of these findings for human-AI interaction. △ Less

Submitted 15 May, 2023; originally announced May 2023.

Comments: FAccT 2023

arXiv:2301.11916 [pdf, other]

Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for In-Context Learning

Authors: Xinyi Wang, Wanrong Zhu, Michael Saxon, Mark Steyvers, William Yang Wang

Abstract: In recent years, pre-trained large language models (LLMs) have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning. However, existing literature has highlighted the sensitivity of this capability to the selection of few-shot demonstrations. Current understandings of the underlying mechanisms by which this capability arises fro… ▽ More In recent years, pre-trained large language models (LLMs) have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning. However, existing literature has highlighted the sensitivity of this capability to the selection of few-shot demonstrations. Current understandings of the underlying mechanisms by which this capability arises from regular language model pretraining objectives remain disconnected from the real-world LLMs. This study aims to examine the in-context learning phenomenon through a Bayesian lens, viewing real-world LLMs as latent variable models. On this premise, we propose an algorithm to select optimal demonstrations from a set of annotated data with a small LM, and then directly generalize the selected demonstrations to larger LMs. We demonstrate significant improvement over baselines, averaged over eight GPT models on eight real-world text classification datasets. We also demonstrate the real-world usefulness of our algorithm on GSM8K, a math word problem dataset. Our empirical findings support our hypothesis that LLMs implicitly infer a latent variable containing task information. △ Less

Submitted 12 February, 2024; v1 submitted 27 January, 2023; originally announced January 2023.

Comments: code at: https://github.com/WANGXinyiLinda/concept-based-demonstration-selection Accepted to NeurIPS 2023

arXiv:2109.14591 [pdf, other]

Combining Human Predictions with Model Probabilities via Confusion Matrices and Calibration

Authors: Gavin Kerrigan, Padhraic Smyth, Mark Steyvers

Abstract: An increasingly common use case for machine learning models is augmenting the abilities of human decision makers. For classification tasks where neither the human or model are perfectly accurate, a key step in obtaining high performance is combining their individual predictions in a manner that leverages their relative strengths. In this work, we develop a set of algorithms that combine the probab… ▽ More An increasingly common use case for machine learning models is augmenting the abilities of human decision makers. For classification tasks where neither the human or model are perfectly accurate, a key step in obtaining high performance is combining their individual predictions in a manner that leverages their relative strengths. In this work, we develop a set of algorithms that combine the probabilistic output of a model with the class-level output of a human. We show theoretically that the accuracy of our combination model is driven not only by the individual human and model accuracies, but also by the model's confidence. Empirical results on image classification with CIFAR-10 and a subset of ImageNet demonstrate that such human-model combinations consistently have higher accuracies than the model or human alone, and that the parameters of the combination method can be estimated effectively with as few as ten labeled datapoints. △ Less

Submitted 1 October, 2021; v1 submitted 29 September, 2021; originally announced September 2021.

Comments: NeurIPS 2021

arXiv:2010.09851 [pdf, other]

Can I Trust My Fairness Metric? Assessing Fairness with Unlabeled Data and Bayesian Inference

Authors: Disi Ji, Padhraic Smyth, Mark Steyvers

Abstract: We investigate the problem of reliably assessing group fairness when labeled examples are few but unlabeled examples are plentiful. We propose a general Bayesian framework that can augment labeled data with unlabeled data to produce more accurate and lower-variance estimates compared to methods based on labeled data alone. Our approach estimates calibrated scores for unlabeled examples in each gro… ▽ More We investigate the problem of reliably assessing group fairness when labeled examples are few but unlabeled examples are plentiful. We propose a general Bayesian framework that can augment labeled data with unlabeled data to produce more accurate and lower-variance estimates compared to methods based on labeled data alone. Our approach estimates calibrated scores for unlabeled examples in each group using a hierarchical latent variable model conditioned on labeled examples. This in turn allows for inference of posterior distributions with associated notions of uncertainty for a variety of group fairness metrics. We demonstrate that our approach leads to significant and consistent reductions in estimation error across multiple well-known fairness datasets, sensitive attributes, and predictive models. The results show the benefits of using both unlabeled data and Bayesian inference in terms of assessing whether a prediction model is fair or not. △ Less

Submitted 19 October, 2020; originally announced October 2020.

Comments: 27 pages

arXiv:2002.06532 [pdf, other]

Active Bayesian Assessment for Black-Box Classifiers

Authors: Disi Ji, Robert L. Logan IV, Padhraic Smyth, Mark Steyvers

Abstract: Recent advances in machine learning have led to increased deployment of black-box classifiers across a wide variety of applications. In many such situations there is a critical need to both reliably assess the performance of these pre-trained models and to perform this assessment in a label-efficient manner (given that labels may be scarce and costly to collect). In this paper, we introduce an act… ▽ More Recent advances in machine learning have led to increased deployment of black-box classifiers across a wide variety of applications. In many such situations there is a critical need to both reliably assess the performance of these pre-trained models and to perform this assessment in a label-efficient manner (given that labels may be scarce and costly to collect). In this paper, we introduce an active Bayesian approach for assessment of classifier performance to satisfy the desiderata of both reliability and label-efficiency. We begin by develo** inference strategies to quantify uncertainty for common assessment metrics such as accuracy, misclassification cost, and calibration error. We then propose a general framework for active Bayesian assessment using inferred uncertainty to guide efficient selection of instances for labeling, enabling better performance assessment with fewer labels. We demonstrate significant gains from our proposed active Bayesian approach via a series of systematic empirical experiments assessing the performance of modern neural classifiers (e.g., ResNet and BERT) on several standard image and text classification datasets. △ Less

Submitted 15 March, 2021; v1 submitted 16 February, 2020; originally announced February 2020.

arXiv:1808.02157 [pdf, other]

Experimental Design Modulates Variance in BOLD Activation: The Variance Design General Linear Model

Authors: Garren Gaut, Xiangrui Li, Zhong-Lin Lu, Mark Steyvers

Abstract: Typical fMRI studies have focused on either the mean trend in the blood-oxygen-level-dependent (BOLD) time course or functional connectivity (FC). However, other statistics of the neuroimaging data may contain important information. Despite studies showing links between the variance in the BOLD time series (BV) and age and cognitive performance, a formal framework for testing these effects has not… ▽ More Typical fMRI studies have focused on either the mean trend in the blood-oxygen-level-dependent (BOLD) time course or functional connectivity (FC). However, other statistics of the neuroimaging data may contain important information. Despite studies showing links between the variance in the BOLD time series (BV) and age and cognitive performance, a formal framework for testing these effects has not yet been developed. We introduce the Variance Design General Linear Model (VDGLM), a novel framework that facilitates the detection of variance effects. We designed the framework for general use in any fMRI study by modeling both mean and variance in BOLD activation as a function of experimental design. The flexibility of this approach allows the VDGLM to i) simultaneously make inferences about a mean or variance effect while controlling for the other and ii) test for variance effects that could be associated with multiple conditions and/or noise regressors. We demonstrate the use of the VDGLM in a working memory application and show that engagement in a working memory task is associated with whole-brain decreases in BOLD variance. △ Less

Submitted 6 August, 2018; originally announced August 2018.

Comments: 18 pages, 7 figures

arXiv:1807.04745 [pdf, other]

Predicting Task and Subject Differences with Functional Connectivity and BOLD Variability

Authors: Garren Gaut, Xiangrui Li, Brandon Turner, William A. Cunningham, Zhong-Lin Lu, Mark Steyvers

Abstract: Previous research has found that functional connectivity (FC) can accurately predict the identity of a subject performing a task and the type of task being performed. We replicate these results using a large dataset collected at the OSU Center for Cognitive and Behavioral Brain Imaging. We also introduce a novel perspective on task and subject identity prediction: BOLD Variability (BV). Conceptual… ▽ More Previous research has found that functional connectivity (FC) can accurately predict the identity of a subject performing a task and the type of task being performed. We replicate these results using a large dataset collected at the OSU Center for Cognitive and Behavioral Brain Imaging. We also introduce a novel perspective on task and subject identity prediction: BOLD Variability (BV). Conceptually, BV is a region-specific measure based on the variance within each brain region. BV is simple to compute, interpret, and visualize. We show that both FC and BV are predictive of task and subject, even across scanning sessions separated by multiple years. Subject differences rather than task differences account for the majority of changes in BV and FC. Similar to results in FC, we show that BV is reduced during cognitive tasks relative to rest. △ Less

Submitted 12 July, 2018; originally announced July 2018.

arXiv:1207.4169 [pdf]

The Author-Topic Model for Authors and Documents

Authors: Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, Padhraic Smyth

Abstract: We introduce the author-topic model, a generative model for documents that extends Latent Dirichlet Allocation (LDA; Blei, Ng, & Jordan, 2003) to include authorship information. Each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. A document with multiple authors is modeled as a distribution over topics that… ▽ More We introduce the author-topic model, a generative model for documents that extends Latent Dirichlet Allocation (LDA; Blei, Ng, & Jordan, 2003) to include authorship information. Each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. A document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors. We apply the model to a collection of 1,700 NIPS conference papers and 160,000 CiteSeer abstracts. Exact inference is intractable for these datasets and we use Gibbs sampling to estimate the topic and author distributions. We compare the performance with two other generative models for documents, which are special cases of the author-topic model: LDA (a topic model) and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. We show topics recovered by the author-topic model, and demonstrate applications to computing similarity between authors and entropy of author output. △ Less

Submitted 11 July, 2012; originally announced July 2012.

Comments: Appears in Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence (UAI2004)

Report number: UAI-P-2004-PG-487-494

arXiv:1107.2462 [pdf, other]

Statistical Topic Models for Multi-Label Document Classification

Authors: Timothy N. Rubin, America Chambers, Padhraic Smyth, Mark Steyvers

Abstract: Machine learning approaches to multi-label document classification have to date largely relied on discriminative modeling techniques such as support vector machines. A drawback of these approaches is that performance rapidly drops off as the total number of labels and the number of labels per document increase. This problem is amplified when the label frequencies exhibit the type of highly skewed… ▽ More Machine learning approaches to multi-label document classification have to date largely relied on discriminative modeling techniques such as support vector machines. A drawback of these approaches is that performance rapidly drops off as the total number of labels and the number of labels per document increase. This problem is amplified when the label frequencies exhibit the type of highly skewed distributions that are often observed in real-world datasets. In this paper we investigate a class of generative statistical topic models for multi-label documents that associate individual word tokens with different labels. We investigate the advantages of this approach relative to discriminative models, particularly with respect to classification problems involving large numbers of relatively rare labels. We compare the performance of generative and discriminative approaches on document labeling tasks ranging from datasets with several thousand labels to datasets with tens of labels. The experimental results indicate that probabilistic generative models can achieve competitive multi-label classification performance compared to discriminative methods, and have advantages for datasets with many labels and skewed label frequencies. △ Less

Submitted 9 November, 2011; v1 submitted 13 July, 2011; originally announced July 2011.

Comments: 44 Pages (Including Appendices). To be published in: The Machine Learning Journal, special issue on Learning from Multi-Label Data. Version 2 corrects some typos, updates some of the notation used in the paper for clarification of some equations, and incorporates several relatively minor changes to the text throughout the paper

arXiv:0808.0973 [pdf, ps, other]

Text Modeling using Unsupervised Topic Models and Concept Hierarchies

Authors: Chaitanya Chemudugunta, Padhraic Smyth, Mark Steyvers

Abstract: Statistical topic models provide a general data-driven framework for automated discovery of high-level knowledge from large collections of text documents. While topic models can potentially discover a broad range of themes in a data set, the interpretability of the learned topics is not always ideal. Human-defined concepts, on the other hand, tend to be semantically richer due to careful selecti… ▽ More Statistical topic models provide a general data-driven framework for automated discovery of high-level knowledge from large collections of text documents. While topic models can potentially discover a broad range of themes in a data set, the interpretability of the learned topics is not always ideal. Human-defined concepts, on the other hand, tend to be semantically richer due to careful selection of words to define concepts but they tend not to cover the themes in a data set exhaustively. In this paper, we propose a probabilistic framework to combine a hierarchy of human-defined semantic concepts with statistical topic models to seek the best of both worlds. Experimental results using two different sources of concept hierarchies and two collections of text documents indicate that this combination leads to systematic improvements in the quality of the associated language models as well as enabling new techniques for inferring and visualizing the semantics of a document. △ Less

Submitted 7 August, 2008; originally announced August 2008.

arXiv:cond-mat/0110012 [pdf]

The large-scale structure of semantic networks: statistical analyses and a model for semantic growth

Authors: Mark Steyvers, Joshua B. Tenenbaum

Abstract: We present statistical analyses of the large-scale structure of three types of semantic networks: word associations, WordNet, and Roget's thesaurus. We show that they have a small-world structure, characterized by sparse connectivity, short average path-lengths between words, and strong local clustering. In addition, the distributions of the number of connections follow power laws that indicate… ▽ More We present statistical analyses of the large-scale structure of three types of semantic networks: word associations, WordNet, and Roget's thesaurus. We show that they have a small-world structure, characterized by sparse connectivity, short average path-lengths between words, and strong local clustering. In addition, the distributions of the number of connections follow power laws that indicate a scale-free pattern of connectivity, with most nodes having relatively few connections joined together through a small number of hubs with many connections. These regularities have also been found in certain other complex natural networks, such as the world wide web, but they are not consistent with many conventional models of semantic organization, based on inheritance hierarchies, arbitrarily structured networks, or high-dimensional vector spaces. We propose that these structures reflect the mechanisms by which semantic networks grow. We describe a simple model for semantic growth, in which each new word or concept is connected to an existing network by differentiating the connectivity pattern of an existing node. This model generates appropriate small-world statistics and power-law connectivity distributions, and also suggests one possible mechanistic basis for the effects of learning history variables (age-of-acquisition, usage frequency) on behavioral performance in semantic processing tasks. △ Less

Submitted 1 October, 2001; originally announced October 2001.

Comments: 25 pages, 9 figures, submitted paper to Cognitive Science

Showing 1–13 of 13 results for author: Steyvers, M