Search | arXiv e-print repository

doi 10.1145/3630106.3658941

"I'm Not Sure, But...": Examining the Impact of Large Language Models' Uncertainty Expression on User Reliance and Trust

Authors: Sunnie S. Y. Kim, Q. Vera Liao, Mihaela Vorvoreanu, Stephanie Ballard, Jennifer Wortman Vaughan

Abstract: Widely deployed large language models (LLMs) can produce convincing yet incorrect outputs, potentially misleading users who may rely on them as if they were correct. To reduce such overreliance, there have been calls for LLMs to communicate their uncertainty to end users. However, there has been little empirical work examining how users perceive and act upon LLMs' expressions of uncertainty. We ex… ▽ More Widely deployed large language models (LLMs) can produce convincing yet incorrect outputs, potentially misleading users who may rely on them as if they were correct. To reduce such overreliance, there have been calls for LLMs to communicate their uncertainty to end users. However, there has been little empirical work examining how users perceive and act upon LLMs' expressions of uncertainty. We explore this question through a large-scale, pre-registered, human-subject experiment (N=404) in which participants answer medical questions with or without access to responses from a fictional LLM-infused search engine. Using both behavioral and self-reported measures, we examine how different natural language expressions of uncertainty impact participants' reliance, trust, and overall task performance. We find that first-person expressions (e.g., "I'm not sure, but...") decrease participants' confidence in the system and tendency to agree with the system's answers, while increasing participants' accuracy. An exploratory analysis suggests that this increase can be attributed to reduced (but not fully eliminated) overreliance on incorrect answers. While we observe similar effects for uncertainty expressed from a general perspective (e.g., "It's not clear, but..."), these effects are weaker and not statistically significant. Our findings suggest that using natural language expressions of uncertainty may be an effective approach for reducing overreliance on LLMs, but that the precise language used matters. This highlights the importance of user testing before deploying LLMs at scale. △ Less

Submitted 15 May, 2024; v1 submitted 1 May, 2024; originally announced May 2024.

Comments: Accepted to FAccT 2024. This version includes the appendix

arXiv:2401.09051 [pdf, other]

Canvil: Designerly Adaptation for LLM-Powered User Experiences

Authors: K. J. Kevin Feng, Q. Vera Liao, Ziang Xiao, Jennifer Wortman Vaughan, Amy X. Zhang, David W. McDonald

Abstract: Advancements in large language models (LLMs) are poised to spark a proliferation of LLM-powered user experiences. In product teams, designers are often tasked with crafting user experiences that align with user needs. To involve designers and leverage their user-centered perspectives to create effective and responsible LLM-powered products, we introduce the practice of designerly adaptation for en… ▽ More Advancements in large language models (LLMs) are poised to spark a proliferation of LLM-powered user experiences. In product teams, designers are often tasked with crafting user experiences that align with user needs. To involve designers and leverage their user-centered perspectives to create effective and responsible LLM-powered products, we introduce the practice of designerly adaptation for engaging with LLMs as an adaptable design material. We first identify key characteristics of designerly adaptation through a formative study with designers experienced in designing for LLM-powered products (N=12). These characteristics are 1) have a low technical barrier to entry, 2) leverage designers' unique perspectives bridging users and technology, and 3) encourage model tinkering. Based on this characterization, we build Canvil, a Figma widget that operationalizes designerly adaptation. Canvil supports structured authoring of system prompts to adapt LLM behavior, testing of adapted models on diverse user inputs, and integration of model outputs into interface designs. We use Canvil as a technology probe in a group-based design study (6 groups, N=17) to investigate the implications of integrating designerly adaptation into design workflows. We find that designers are able to iteratively tinker with different adaptation approaches and reason about interface affordances to enhance end-user interaction with LLMs. Furthermore, designers identified promising collaborative workflows for designerly adaptation. Our work opens new avenues for collaborative processes and tools that foreground designers' user-centered expertise in the crafting and deployment of LLM-powered user experiences. △ Less

Submitted 17 January, 2024; originally announced January 2024.

arXiv:2312.06153 [pdf, other]

Open Datasheets: Machine-readable Documentation for Open Datasets and Responsible AI Assessments

Authors: Anthony Cintron Roman, Jennifer Wortman Vaughan, Valerie See, Steph Ballard, Jehu Torres, Caleb Robinson, Juan M. Lavista Ferres

Abstract: This paper introduces a no-code, machine-readable documentation framework for open datasets, with a focus on responsible AI (RAI) considerations. The framework aims to improve comprehensibility, and usability of open datasets, facilitating easier discovery and use, better understanding of content and context, and evaluation of dataset quality and accuracy. The proposed framework is designed to str… ▽ More This paper introduces a no-code, machine-readable documentation framework for open datasets, with a focus on responsible AI (RAI) considerations. The framework aims to improve comprehensibility, and usability of open datasets, facilitating easier discovery and use, better understanding of content and context, and evaluation of dataset quality and accuracy. The proposed framework is designed to streamline the evaluation of datasets, hel** researchers, data scientists, and other open data users quickly identify datasets that meet their needs and organizational policies or regulations. The paper also discusses the implementation of the framework and provides recommendations to maximize its potential. The framework is expected to enhance the quality and reliability of data used in research and decision-making, fostering the development of more responsible and trustworthy AI systems. △ Less

Submitted 27 March, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

arXiv:2306.03262 [pdf, other]

Has the Machine Learning Review Process Become More Arbitrary as the Field Has Grown? The NeurIPS 2021 Consistency Experiment

Authors: Alina Beygelzimer, Yann N. Dauphin, Percy Liang, Jennifer Wortman Vaughan

Abstract: We present the NeurIPS 2021 consistency experiment, a larger-scale variant of the 2014 NeurIPS experiment in which 10% of conference submissions were reviewed by two independent committees to quantify the randomness in the review process. We observe that the two committees disagree on their accept/reject recommendations for 23% of the papers and that, consistent with the results from 2014, approxi… ▽ More We present the NeurIPS 2021 consistency experiment, a larger-scale variant of the 2014 NeurIPS experiment in which 10% of conference submissions were reviewed by two independent committees to quantify the randomness in the review process. We observe that the two committees disagree on their accept/reject recommendations for 23% of the papers and that, consistent with the results from 2014, approximately half of the list of accepted papers would change if the review process were randomly rerun. Our analysis suggests that making the conference more selective would increase the arbitrariness of the process. Taken together with previous research, our results highlight the inherent difficulty of objectively measuring the quality of research, and suggest that authors should not be excessively discouraged by rejected work. △ Less

Submitted 5 June, 2023; originally announced June 2023.

arXiv:2306.01941 [pdf, other]

AI Transparency in the Age of LLMs: A Human-Centered Research Roadmap

Authors: Q. Vera Liao, Jennifer Wortman Vaughan

Abstract: The rise of powerful large language models (LLMs) brings about tremendous opportunities for innovation but also looming risks for individuals and society at large. We have reached a pivotal moment for ensuring that LLMs and LLM-infused applications are developed and deployed responsibly. However, a central pillar of responsible AI -- transparency -- is largely missing from the current discourse ar… ▽ More The rise of powerful large language models (LLMs) brings about tremendous opportunities for innovation but also looming risks for individuals and society at large. We have reached a pivotal moment for ensuring that LLMs and LLM-infused applications are developed and deployed responsibly. However, a central pillar of responsible AI -- transparency -- is largely missing from the current discourse around LLMs. It is paramount to pursue new approaches to provide transparency for LLMs, and years of research at the intersection of AI and human-computer interaction (HCI) highlight that we must do so with a human-centered perspective: Transparency is fundamentally about supporting appropriate human understanding, and this understanding is sought by different stakeholders with different goals in different contexts. In this new era of LLMs, we must develop and design approaches to transparency by considering the needs of stakeholders in the emerging LLM ecosystem, the novel types of LLM-infused applications being built, and the new usage patterns and challenges around LLMs, all while building on lessons learned about how people process, interact with, and make use of information. We reflect on the unique challenges that arise in providing transparency for LLMs, along with lessons learned from HCI and responsible AI research that has taken a human-centered perspective on AI transparency. We then lay out four common approaches that the community has taken to achieve transparency -- model reporting, publishing evaluation results, providing explanations, and communicating uncertainty -- and call out open questions around how these approaches may or may not be applied to LLMs. We hope this provides a starting point for discussion and a useful roadmap for future research. △ Less

Submitted 7 August, 2023; v1 submitted 2 June, 2023; originally announced June 2023.

arXiv:2302.14165 [pdf, other]

doi 10.1145/3544548.3580816

GAM Coach: Towards Interactive and User-centered Algorithmic Recourse

Authors: Zijie J. Wang, Jennifer Wortman Vaughan, Rich Caruana, Duen Horng Chau

Abstract: Machine learning (ML) recourse techniques are increasingly used in high-stakes domains, providing end users with actions to alter ML predictions, but they assume ML developers understand what input variables can be changed. However, a recourse plan's actionability is subjective and unlikely to match developers' expectations completely. We present GAM Coach, a novel open-source system that adapts i… ▽ More Machine learning (ML) recourse techniques are increasingly used in high-stakes domains, providing end users with actions to alter ML predictions, but they assume ML developers understand what input variables can be changed. However, a recourse plan's actionability is subjective and unlikely to match developers' expectations completely. We present GAM Coach, a novel open-source system that adapts integer linear programming to generate customizable counterfactual explanations for Generalized Additive Models (GAMs), and leverages interactive visualizations to enable end users to iteratively generate recourse plans meeting their needs. A quantitative user study with 41 participants shows our tool is usable and useful, and users prefer personalized recourse plans over generic plans. Through a log analysis, we explore how users discover satisfactory recourse plans, and provide empirical evidence that transparency can lead to more opportunities for everyday users to discover counterintuitive patterns in ML models. GAM Coach is available at: https://poloclub.github.io/gam-coach/. △ Less

Submitted 28 February, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

Comments: Accepted to CHI 2023. 20 pages, 12 figures. For a demo video, see https://youtu.be/ubacP34H9XE. For a live demo, visit https://poloclub.github.io/gam-coach/

arXiv:2302.10395 [pdf, other]

Designerly Understanding: Information Needs for Model Transparency to Support Design Ideation for AI-Powered User Experience

Authors: Q. Vera Liao, Hariharan Subramonyam, Jennifer Wang, Jennifer Wortman Vaughan

Abstract: Despite the widespread use of artificial intelligence (AI), designing user experiences (UX) for AI-powered systems remains challenging. UX designers face hurdles understanding AI technologies, such as pre-trained language models, as design materials. This limits their ability to ideate and make decisions about whether, where, and how to use AI. To address this problem, we bridge the literature on… ▽ More Despite the widespread use of artificial intelligence (AI), designing user experiences (UX) for AI-powered systems remains challenging. UX designers face hurdles understanding AI technologies, such as pre-trained language models, as design materials. This limits their ability to ideate and make decisions about whether, where, and how to use AI. To address this problem, we bridge the literature on AI design and AI transparency to explore whether and how frameworks for transparent model reporting can support design ideation with pre-trained models. By interviewing 23 UX practitioners, we find that practitioners frequently work with pre-trained models, but lack support for UX-led ideation. Through a scenario-based design task, we identify common goals that designers seek model understanding for and pinpoint their model transparency information needs. Our study highlights the pivotal role that UX designers can play in Responsible AI and calls for supporting their understanding of AI limitations through model transparency and interrogation. △ Less

Submitted 20 February, 2023; originally announced February 2023.

Comments: Accepted at ACM CHI Conference on Human Factors in Computing Systems (CHI 2023)

arXiv:2302.07248 [pdf, other]

Generation Probabilities Are Not Enough: Exploring the Effectiveness of Uncertainty Highlighting in AI-Powered Code Completions

Authors: Helena Vasconcelos, Gagan Bansal, Adam Fourney, Q. Vera Liao, Jennifer Wortman Vaughan

Abstract: Large-scale generative models enabled the development of AI-powered code completion tools to assist programmers in writing code. However, much like other AI-powered tools, AI-powered code completions are not always accurate, potentially introducing bugs or even security vulnerabilities into code if not properly detected and corrected by a human programmer. One technique that has been proposed and… ▽ More Large-scale generative models enabled the development of AI-powered code completion tools to assist programmers in writing code. However, much like other AI-powered tools, AI-powered code completions are not always accurate, potentially introducing bugs or even security vulnerabilities into code if not properly detected and corrected by a human programmer. One technique that has been proposed and implemented to help programmers identify potential errors is to highlight uncertain tokens. However, there have been no empirical studies exploring the effectiveness of this technique-- nor investigating the different and not-yet-agreed-upon notions of uncertainty in the context of generative models. We explore the question of whether conveying information about uncertainty enables programmers to more quickly and accurately produce code when collaborating with an AI-powered code completion tool, and if so, what measure of uncertainty best fits programmers' needs. Through a mixed-methods study with 30 programmers, we compare three conditions: providing the AI system's code completion alone, highlighting tokens with the lowest likelihood of being generated by the underlying generative model, and highlighting tokens with the highest predicted likelihood of being edited by a programmer. We find that highlighting tokens with the highest predicted likelihood of being edited leads to faster task completion and more targeted edits, and is subjectively preferred by study participants. In contrast, highlighting tokens according to their probability of being generated does not provide any benefit over the baseline with no highlighting. We further explore the design space of how to convey uncertainty in AI-powered code completion tools, and find that programmers prefer highlights that are granular, informative, interpretable, and not overwhelming. △ Less

Submitted 14 February, 2023; originally announced February 2023.

arXiv:2301.07255 [pdf, other]

Understanding the Role of Human Intuition on Reliance in Human-AI Decision-Making with Explanations

Authors: Valerie Chen, Q. Vera Liao, Jennifer Wortman Vaughan, Gagan Bansal

Abstract: AI explanations are often mentioned as a way to improve human-AI decision-making, but empirical studies have not found consistent evidence of explanations' effectiveness and, on the contrary, suggest that they can increase overreliance when the AI system is wrong. While many factors may affect reliance on AI support, one important factor is how decision-makers reconcile their own intuition -- beli… ▽ More AI explanations are often mentioned as a way to improve human-AI decision-making, but empirical studies have not found consistent evidence of explanations' effectiveness and, on the contrary, suggest that they can increase overreliance when the AI system is wrong. While many factors may affect reliance on AI support, one important factor is how decision-makers reconcile their own intuition -- beliefs or heuristics, based on prior knowledge, experience, or pattern recognition, used to make judgments -- with the information provided by the AI system to determine when to override AI predictions. We conduct a think-aloud, mixed-methods study with two explanation types (feature- and example-based) for two prediction tasks to explore how decision-makers' intuition affects their use of AI predictions and explanations, and ultimately their choice of when to rely on AI. Our results identify three types of intuition involved in reasoning about AI predictions and explanations: intuition about the task outcome, features, and AI limitations. Building on these, we summarize three observed pathways for decision-makers to apply their own intuition and override AI predictions. We use these pathways to explain why (1) the feature-based explanations we used did not improve participants' decision outcomes and increased their overreliance on AI, and (2) the example-based explanations we used improved decision-makers' performance over feature-based explanations and helped achieve complementary human-AI performance. Overall, our work identifies directions for further development of AI decision-support systems and explanation methods that help decision-makers effectively apply their intuition to achieve appropriate reliance on AI. △ Less

Submitted 14 June, 2023; v1 submitted 17 January, 2023; originally announced January 2023.

Comments: To appear in CSCW 2023

arXiv:2211.12966 [pdf, other]

How do Authors' Perceptions of their Papers Compare with Co-authors' Perceptions and Peer-review Decisions?

Authors: Charvi Rastogi, Ivan Stelmakh, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, Jennifer Wortman Vaughan, Zhenyu Xue, Hal Daumé III, Emma Pierson, Nihar B. Shah

Abstract: How do author perceptions match up to the outcomes of the peer-review process and perceptions of others? In a top-tier computer science conference (NeurIPS 2021) with more than 23,000 submitting authors and 9,000 submitted papers, we survey the authors on three questions: (i) their predicted probability of acceptance for each of their papers, (ii) their perceived ranking of their own papers based… ▽ More How do author perceptions match up to the outcomes of the peer-review process and perceptions of others? In a top-tier computer science conference (NeurIPS 2021) with more than 23,000 submitting authors and 9,000 submitted papers, we survey the authors on three questions: (i) their predicted probability of acceptance for each of their papers, (ii) their perceived ranking of their own papers based on scientific contribution, and (iii) the change in their perception about their own papers after seeing the reviews. The salient results are: (1) Authors have roughly a three-fold overestimate of the acceptance probability of their papers: The median prediction is 70% for an approximately 25% acceptance rate. (2) Female authors exhibit a marginally higher (statistically significant) miscalibration than male authors; predictions of authors invited to serve as meta-reviewers or reviewers are similarly calibrated, but better than authors who were not invited to review. (3) Authors' relative ranking of scientific contribution of two submissions they made generally agree (93%) with their predicted acceptance probabilities, but there is a notable 7% responses where authors think their better paper will face a worse outcome. (4) The author-provided rankings disagreed with the peer-review decisions about a third of the time; when co-authors ranked their jointly authored papers, co-authors disagreed at a similar rate -- about a third of the time. (5) At least 30% of respondents of both accepted and rejected papers said that their perception of their own paper improved after the review process. The stakeholders in peer review should take these findings into account in setting their expectations from peer review. △ Less

Submitted 22 November, 2022; originally announced November 2022.

arXiv:2208.02896 [pdf, other]

Interpretable Distribution Shift Detection using Optimal Transport

Authors: Neha Hulkund, Nicolo Fusi, Jennifer Wortman Vaughan, David Alvarez-Melis

Abstract: We propose a method to identify and characterize distribution shifts in classification datasets based on optimal transport. It allows the user to identify the extent to which each class is affected by the shift, and retrieves corresponding pairs of samples to provide insights on its nature. We illustrate its use on synthetic and natural shift examples. While the results we present are preliminary,… ▽ More We propose a method to identify and characterize distribution shifts in classification datasets based on optimal transport. It allows the user to identify the extent to which each class is affected by the shift, and retrieves corresponding pairs of samples to provide insights on its nature. We illustrate its use on synthetic and natural shift examples. While the results we present are preliminary, we hope that this inspires future work on interpretable methods for analyzing distribution shifts. △ Less

Submitted 4 August, 2022; originally announced August 2022.

Comments: Presented at ICML 2022 DataPerf Workshop

arXiv:2206.15465 [pdf, other]

doi 10.1145/3534678.3539074

Interpretability, Then What? Editing Machine Learning Models to Reflect Human Knowledge and Values

Authors: Zijie J. Wang, Alex Kale, Harsha Nori, Peter Stella, Mark E. Nunnally, Duen Horng Chau, Mihaela Vorvoreanu, Jennifer Wortman Vaughan, Rich Caruana

Abstract: Machine learning (ML) interpretability techniques can reveal undesirable patterns in data that models exploit to make predictions--potentially causing harms once deployed. However, how to take action to address these patterns is not always clear. In a collaboration between ML and human-computer interaction researchers, physicians, and data scientists, we develop GAM Changer, the first interactive… ▽ More Machine learning (ML) interpretability techniques can reveal undesirable patterns in data that models exploit to make predictions--potentially causing harms once deployed. However, how to take action to address these patterns is not always clear. In a collaboration between ML and human-computer interaction researchers, physicians, and data scientists, we develop GAM Changer, the first interactive system to help domain experts and data scientists easily and responsibly edit Generalized Additive Models (GAMs) and fix problematic patterns. With novel interaction techniques, our tool puts interpretability into action--empowering users to analyze, validate, and align model behaviors with their knowledge and values. Physicians have started to use our tool to investigate and fix pneumonia and sepsis risk prediction models, and an evaluation with 7 data scientists working in diverse domains highlights that our tool is easy to use, meets their model editing needs, and fits into their current workflows. Built with modern web technologies, our tool runs locally in users' web browsers or computational notebooks, lowering the barrier to use. GAM Changer is available at the following public demo link: https://interpret.ml/gam-changer. △ Less

Submitted 30 June, 2022; originally announced June 2022.

Comments: Accepted at KDD 2022. 11 pages, 19 figures. For a demo video, see https://youtu.be/D6whtfInqTc. For a live demo, visit https://interpret.ml/gam-changer

arXiv:2206.02923 [pdf, ps, other]

Understanding Machine Learning Practitioners' Data Documentation Perceptions, Needs, Challenges, and Desiderata

Authors: Amy K. Heger, Liz B. Marquis, Mihaela Vorvoreanu, Hanna Wallach, Jennifer Wortman Vaughan

Abstract: Data is central to the development and evaluation of machine learning (ML) models. However, the use of problematic or inappropriate datasets can result in harms when the resulting models are deployed. To encourage responsible AI practice through more deliberate reflection on datasets and transparency around the processes by which they are created, researchers and practitioners have begun to advoca… ▽ More Data is central to the development and evaluation of machine learning (ML) models. However, the use of problematic or inappropriate datasets can result in harms when the resulting models are deployed. To encourage responsible AI practice through more deliberate reflection on datasets and transparency around the processes by which they are created, researchers and practitioners have begun to advocate for increased data documentation and have proposed several data documentation frameworks. However, there is little research on whether these data documentation frameworks meet the needs of ML practitioners, who both create and consume datasets. To address this gap, we set out to understand ML practitioners' data documentation perceptions, needs, challenges, and desiderata, with the goal of deriving design requirements that can inform future data documentation frameworks. We conducted a series of semi-structured interviews with 14 ML practitioners at a single large, international technology company. We had them answer a list of questions taken from datasheets for datasets (Gebru, 2021). Our findings show that current approaches to data documentation are largely ad hoc and myopic in nature. Participants expressed needs for data documentation frameworks to be adaptable to their contexts, integrated into their existing tools and workflows, and automated wherever possible. Despite the fact that data documentation frameworks are often motivated from the perspective of responsible AI, participants did not make the connection between the questions that they were asked to answer and their responsible AI implications. In addition, participants often had difficulties prioritizing the needs of dataset consumers and providing information that someone unfamiliar with their datasets might need to know. Based on these findings, we derive seven design requirements for future data documentation frameworks. △ Less

Submitted 24 August, 2022; v1 submitted 6 June, 2022; originally announced June 2022.

Comments: Camera-ready preprint of paper accepted to CSCW 2022

arXiv:2205.08363 [pdf, other]

doi 10.1145/3531146.3533122

REAL ML: Recognizing, Exploring, and Articulating Limitations of Machine Learning Research

Authors: Jessie J. Smith, Saleema Amershi, Solon Barocas, Hanna Wallach, Jennifer Wortman Vaughan

Abstract: Transparency around limitations can improve the scientific rigor of research, help ensure appropriate interpretation of research findings, and make research claims more credible. Despite these benefits, the machine learning (ML) research community lacks well-developed norms around disclosing and discussing limitations. To address this gap, we conduct an iterative design process with 30 ML and ML-a… ▽ More Transparency around limitations can improve the scientific rigor of research, help ensure appropriate interpretation of research findings, and make research claims more credible. Despite these benefits, the machine learning (ML) research community lacks well-developed norms around disclosing and discussing limitations. To address this gap, we conduct an iterative design process with 30 ML and ML-adjacent researchers to develop and test REAL ML, a set of guided activities to help ML researchers recognize, explore, and articulate the limitations of their research. Using a three-stage interview and survey study, we identify ML researchers' perceptions of limitations, as well as the challenges they face when recognizing, exploring, and articulating limitations. We develop REAL ML to address some of these practical challenges, and highlight additional cultural challenges that will require broader shifts in community norms to address. We hope our study and REAL ML help move the ML research community toward more active and appropriate engagement with limitations. △ Less

Submitted 5 May, 2022; originally announced May 2022.

Comments: This work appears in the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT '22)

arXiv:2112.05675 [pdf, ps, other]

Assessing the Fairness of AI Systems: AI Practitioners' Processes, Challenges, and Needs for Support

Authors: Michael Madaio, Lisa Egede, Hariharan Subramonyam, Jennifer Wortman Vaughan, Hanna Wallach

Abstract: Various tools and practices have been developed to support practitioners in identifying, assessing, and mitigating fairness-related harms caused by AI systems. However, prior research has highlighted gaps between the intended design of these tools and practices and their use within particular contexts, including gaps caused by the role that organizational factors play in sha** fairness work. In… ▽ More Various tools and practices have been developed to support practitioners in identifying, assessing, and mitigating fairness-related harms caused by AI systems. However, prior research has highlighted gaps between the intended design of these tools and practices and their use within particular contexts, including gaps caused by the role that organizational factors play in sha** fairness work. In this paper, we investigate these gaps for one such practice: disaggregated evaluations of AI systems, intended to uncover performance disparities between demographic groups. By conducting semi-structured interviews and structured workshops with thirty-three AI practitioners from ten teams at three technology companies, we identify practitioners' processes, challenges, and needs for support when designing disaggregated evaluations. We find that practitioners face challenges when choosing performance metrics, identifying the most relevant direct stakeholders and demographic groups on which to focus, and collecting datasets with which to conduct disaggregated evaluations. More generally, we identify impacts on fairness work stemming from a lack of engagement with direct stakeholders or domain experts, business imperatives that prioritize customers over marginalized groups, and the drive to deploy AI systems at scale. △ Less

Submitted 10 February, 2022; v1 submitted 10 December, 2021; originally announced December 2021.

Comments: Camera-ready preprint of paper accepted to the CSCW conference

arXiv:2112.03245 [pdf, other]

GAM Changer: Editing Generalized Additive Models with Interactive Visualization

Authors: Zijie J. Wang, Alex Kale, Harsha Nori, Peter Stella, Mark Nunnally, Duen Horng Chau, Mihaela Vorvoreanu, Jennifer Wortman Vaughan, Rich Caruana

Abstract: Recent strides in interpretable machine learning (ML) research reveal that models exploit undesirable patterns in the data to make predictions, which potentially causes harms in deployment. However, it is unclear how we can fix these models. We present our ongoing work, GAM Changer, an open-source interactive system to help data scientists and domain experts easily and responsibly edit their Gener… ▽ More Recent strides in interpretable machine learning (ML) research reveal that models exploit undesirable patterns in the data to make predictions, which potentially causes harms in deployment. However, it is unclear how we can fix these models. We present our ongoing work, GAM Changer, an open-source interactive system to help data scientists and domain experts easily and responsibly edit their Generalized Additive Models (GAMs). With novel visualization techniques, our tool puts interpretability into action -- empowering human users to analyze, validate, and align model behaviors with their knowledge and values. Built using modern web technologies, our tool runs locally in users' computational notebooks or web browsers without requiring extra compute resources, lowering the barrier to creating more responsible ML models. GAM Changer is available at https://interpret.ml/gam-changer. △ Less

Submitted 6 December, 2021; originally announced December 2021.

Comments: 7 pages, 15 figures, accepted to the Research2Clinics workshop at NeurIPS 2021. For a demo video, see https://youtu.be/2gVSoPoSeJ8. For a live demo, visit https://interpret.ml/gam-changer/

arXiv:2104.13299 [pdf, other]

From Human Explanation to Model Interpretability: A Framework Based on Weight of Evidence

Authors: David Alvarez-Melis, Harmanpreet Kaur, Hal Daumé III, Hanna Wallach, Jennifer Wortman Vaughan

Abstract: We take inspiration from the study of human explanation to inform the design and evaluation of interpretability methods in machine learning. First, we survey the literature on human explanation in philosophy, cognitive science, and the social sciences, and propose a list of design principles for machine-generated explanations that are meaningful to humans. Using the concept of weight of evidence f… ▽ More We take inspiration from the study of human explanation to inform the design and evaluation of interpretability methods in machine learning. First, we survey the literature on human explanation in philosophy, cognitive science, and the social sciences, and propose a list of design principles for machine-generated explanations that are meaningful to humans. Using the concept of weight of evidence from information theory, we develop a method for generating explanations that adhere to these principles. We show that this method can be adapted to handle high-dimensional, multi-class settings, yielding a flexible framework for generating explanations. We demonstrate that these explanations can be estimated accurately from finite samples and are robust to small perturbations of the inputs. We also evaluate our method through a qualitative user study with machine learning practitioners, where we observe that the resulting explanations are usable despite some participants struggling with background concepts like prior class probabilities. Finally, we conclude by surfacing~design~implications for interpretability tools in general. △ Less

Submitted 20 September, 2021; v1 submitted 27 April, 2021; originally announced April 2021.

Comments: HCOMP 2021

arXiv:2103.06076 [pdf, other]

Designing Disaggregated Evaluations of AI Systems: Choices, Considerations, and Tradeoffs

Authors: Solon Barocas, Anhong Guo, Ece Kamar, Jacquelyn Krones, Meredith Ringel Morris, Jennifer Wortman Vaughan, Duncan Wadsworth, Hanna Wallach

Abstract: Disaggregated evaluations of AI systems, in which system performance is assessed and reported separately for different groups of people, are conceptually simple. However, their design involves a variety of choices. Some of these choices influence the results that will be obtained, and thus the conclusions that can be drawn; others influence the impacts -- both beneficial and harmful -- that a disa… ▽ More Disaggregated evaluations of AI systems, in which system performance is assessed and reported separately for different groups of people, are conceptually simple. However, their design involves a variety of choices. Some of these choices influence the results that will be obtained, and thus the conclusions that can be drawn; others influence the impacts -- both beneficial and harmful -- that a disaggregated evaluation will have on people, including the people whose data is used to conduct the evaluation. We argue that a deeper understanding of these choices will enable researchers and practitioners to design careful and conclusive disaggregated evaluations. We also argue that better documentation of these choices, along with the underlying considerations and tradeoffs that have been made, will help others when interpreting an evaluation's results and conclusions. △ Less

Submitted 1 December, 2021; v1 submitted 10 March, 2021; originally announced March 2021.

arXiv:2101.01816 [pdf, ps, other]

Incentive-Compatible Forecasting Competitions

Authors: Jens Witkowski, Rupert Freeman, Jennifer Wortman Vaughan, David M. Pennock, Andreas Krause

Abstract: We initiate the study of incentive-compatible forecasting competitions in which multiple forecasters make predictions about one or more events and compete for a single prize. We have two objectives: (1) to incentivize forecasters to report truthfully and (2) to award the prize to the most accurate forecaster. Proper scoring rules incentivize truthful reporting if all forecasters are paid according… ▽ More We initiate the study of incentive-compatible forecasting competitions in which multiple forecasters make predictions about one or more events and compete for a single prize. We have two objectives: (1) to incentivize forecasters to report truthfully and (2) to award the prize to the most accurate forecaster. Proper scoring rules incentivize truthful reporting if all forecasters are paid according to their scores. However, incentives become distorted if only the best-scoring forecaster wins a prize, since forecasters can often increase their probability of having the highest score by reporting more extreme beliefs. In this paper, we introduce two novel forecasting competition mechanisms. Our first mechanism is incentive compatible and guaranteed to select the most accurate forecaster with probability higher than any other forecaster. Moreover, we show that in the standard single-event, two-forecaster setting and under mild technical conditions, no other incentive-compatible mechanism selects the most accurate forecaster with higher probability. Our second mechanism is incentive compatible when forecasters' beliefs are such that information about one event does not lead to belief updates on other events, and it selects the best forecaster with probability approaching 1 as the number of events grows. Our notion of incentive compatibility is more general than previous definitions of dominant strategy incentive compatibility in that it allows for reports to be correlated with the event outcomes. Moreover, our mechanisms are easy to implement and can be generalized to the related problems of outputting a ranking over forecasters and hiring a forecaster with high accuracy on future events. △ Less

Submitted 7 September, 2021; v1 submitted 5 January, 2021; originally announced January 2021.

Comments: 38 pages. Relative to the previous version Appendix A and Theorem 5 are new. This version additionally contains some expanded exposition

arXiv:2007.03661 [pdf]

Mathematical Foundations for Social Computing

Authors: Yiling Chen, Arpita Ghosh, Michael Kearns, Tim Roughgarden, Jennifer Wortman Vaughan

Abstract: Social computing encompasses the mechanisms through which people interact with computational systems: crowdsourcing systems, ranking and recommendation systems, online prediction markets, citizen science projects, and collaboratively edited wikis, to name a few. These systems share the common feature that humans are active participants, making choices that determine the input to, and therefore the… ▽ More Social computing encompasses the mechanisms through which people interact with computational systems: crowdsourcing systems, ranking and recommendation systems, online prediction markets, citizen science projects, and collaboratively edited wikis, to name a few. These systems share the common feature that humans are active participants, making choices that determine the input to, and therefore the output of, the system. The output of these systems can be viewed as a joint computation between machine and human, and can be richer than what either could produce alone. The term social computing is often used as a synonym for several related areas, such as "human computation" and subsets of "collective intelligence"; we use it in its broadest sense to encompass all of these things. Social computing is blossoming into a rich research area of its own, with contributions from diverse disciplines including computer science, economics, and other social sciences. Yet a broad mathematical foundation for social computing is yet to be established, with a plethora of under-explored opportunities for mathematical research to impact social computing. As in other fields, there is great potential for mathematical work to influence and shape the future of social computing. However, we are far from having the systematic and principled understanding of the advantages, limitations, and potentials of social computing required to match the impact on applications that has occurred in other fields. In June 2015, we brought together roughly 25 experts in related fields to discuss the promise and challenges of establishing mathematical foundations for social computing. This document captures several of the key ideas discussed. △ Less

Submitted 7 July, 2020; originally announced July 2020.

Comments: A Computing Community Consortium (CCC) workshop report, 15 pages

Report number: ccc2014report_5

arXiv:2005.10624 [pdf, ps, other]

Greedy Algorithm almost Dominates in Smoothed Contextual Bandits

Authors: Manish Raghavan, Aleksandrs Slivkins, Jennifer Wortman Vaughan, Zhiwei Steven Wu

Abstract: Online learning algorithms, widely used to power search and content optimization on the web, must balance exploration and exploitation, potentially sacrificing the experience of current users in order to gain information that will lead to better decisions in the future. While necessary in the worst case, explicit exploration has a number of disadvantages compared to the greedy algorithm that alway… ▽ More Online learning algorithms, widely used to power search and content optimization on the web, must balance exploration and exploitation, potentially sacrificing the experience of current users in order to gain information that will lead to better decisions in the future. While necessary in the worst case, explicit exploration has a number of disadvantages compared to the greedy algorithm that always "exploits" by choosing an action that currently looks optimal. We ask under what conditions inherent diversity in the data makes explicit exploration unnecessary. We build on a recent line of work on the smoothed analysis of the greedy algorithm in the linear contextual bandits model. We improve on prior results to show that a greedy approach almost matches the best possible Bayesian regret rate of any other algorithm on the same problem instance whenever the diversity conditions hold, and that this regret is at most $\tilde O(T^{1/3})$. △ Less

Submitted 27 December, 2021; v1 submitted 19 May, 2020; originally announced May 2020.

Comments: Results in this paper, without any proofs, have been announced in an extended abstract (Raghavan et al., 2018a), and fleshed out in the technical report (Raghavan et al., 2018b [arXiv:1806.00543]). This manuscript covers a subset of results from Raghavan et al. (2018a,b), focusing on the greedy algorithm, and is streamlined accordingly

arXiv:2002.08837 [pdf, other]

No-Regret and Incentive-Compatible Online Learning

Authors: Rupert Freeman, David M. Pennock, Chara Podimata, Jennifer Wortman Vaughan

Abstract: We study online learning settings in which experts act strategically to maximize their influence on the learning algorithm's predictions by potentially misreporting their beliefs about a sequence of binary events. Our goal is twofold. First, we want the learning algorithm to be no-regret with respect to the best fixed expert in hindsight. Second, we want incentive compatibility, a guarantee that e… ▽ More We study online learning settings in which experts act strategically to maximize their influence on the learning algorithm's predictions by potentially misreporting their beliefs about a sequence of binary events. Our goal is twofold. First, we want the learning algorithm to be no-regret with respect to the best fixed expert in hindsight. Second, we want incentive compatibility, a guarantee that each expert's best strategy is to report his true beliefs about the realization of each event. To achieve this goal, we build on the literature on wagering mechanisms, a type of multi-agent scoring rule. We provide algorithms that achieve no regret and incentive compatibility for myopic experts for both the full and partial information settings. In experiments on datasets from FiveThirtyEight, our algorithms have regret comparable to classic no-regret algorithms, which are not incentive-compatible. Finally, we identify an incentive-compatible algorithm for forward-looking strategic agents that exhibits diminishing regret in practice. △ Less

Submitted 30 June, 2020; v1 submitted 20 February, 2020; originally announced February 2020.

Comments: Appears in ICML2020

arXiv:1910.13503 [pdf, other]

Weight of Evidence as a Basis for Human-Oriented Explanations

Authors: David Alvarez-Melis, Hal Daumé III, Jennifer Wortman Vaughan, Hanna Wallach

Abstract: Interpretability is an elusive but highly sought-after characteristic of modern machine learning methods. Recent work has focused on interpretability via $\textit{explanations}$, which justify individual model predictions. In this work, we take a step towards reconciling machine explanations with those that humans produce and prefer by taking inspiration from the study of explanation in philosophy… ▽ More Interpretability is an elusive but highly sought-after characteristic of modern machine learning methods. Recent work has focused on interpretability via $\textit{explanations}$, which justify individual model predictions. In this work, we take a step towards reconciling machine explanations with those that humans produce and prefer by taking inspiration from the study of explanation in philosophy, cognitive science, and the social sciences. We identify key aspects in which these human explanations differ from current machine explanations, distill them into a list of desiderata, and formalize them into a framework via the notion of $\textit{weight of evidence}$ from information theory. Finally, we instantiate this framework in two simple applications and show it produces intuitive and comprehensible explanations. △ Less

Submitted 29 October, 2019; originally announced October 2019.

Comments: Human-Centric Machine Learning (HCML) Workshop @ NeurIPS 2019

arXiv:1907.02227 [pdf, other]

Toward Fairness in AI for People with Disabilities: A Research Roadmap

Authors: Anhong Guo, Ece Kamar, Jennifer Wortman Vaughan, Hanna Wallach, Meredith Ringel Morris

Abstract: AI technologies have the potential to dramatically impact the lives of people with disabilities (PWD). Indeed, improving the lives of PWD is a motivator for many state-of-the-art AI systems, such as automated speech recognition tools that can caption videos for people who are deaf and hard of hearing, or language prediction algorithms that can augment communication for people with speech or cognit… ▽ More AI technologies have the potential to dramatically impact the lives of people with disabilities (PWD). Indeed, improving the lives of PWD is a motivator for many state-of-the-art AI systems, such as automated speech recognition tools that can caption videos for people who are deaf and hard of hearing, or language prediction algorithms that can augment communication for people with speech or cognitive disabilities. However, widely deployed AI systems may not work properly for PWD, or worse, may actively discriminate against them. These considerations regarding fairness in AI for PWD have thus far received little attention. In this position paper, we identify potential areas of concern regarding how several AI technology categories may impact particular disability constituencies if care is not taken in their design, development, and testing. We intend for this risk assessment of how various classes of AI might interact with various classes of disability to provide a roadmap for future research that is needed to gather data, test these hypotheses, and build more inclusive algorithms. △ Less

Submitted 2 August, 2019; v1 submitted 4 July, 2019; originally announced July 2019.

Comments: ACM ASSETS 2019 Workshop on AI Fairness for People with Disabilities

arXiv:1905.00457 [pdf, other]

doi 10.1016/j.jet.2021.105234

Truthful Aggregation of Budget Proposals

Authors: Rupert Freeman, David M. Pennock, Dominik Peters, Jennifer Wortman Vaughan

Abstract: We consider a participatory budgeting problem in which each voter submits a proposal for how to divide a single divisible resource (such as money or time) among several possible alternatives (such as public projects or activities) and these proposals must be aggregated into a single aggregate division. Under $\ell_1$ preferences -- for which a voter's disutility is given by the $\ell_1$ distance b… ▽ More We consider a participatory budgeting problem in which each voter submits a proposal for how to divide a single divisible resource (such as money or time) among several possible alternatives (such as public projects or activities) and these proposals must be aggregated into a single aggregate division. Under $\ell_1$ preferences -- for which a voter's disutility is given by the $\ell_1$ distance between the aggregate division and the division he or she most prefers -- the social welfare-maximizing mechanism, which minimizes the average $\ell_1$ distance between the outcome and each voter's proposal, is incentive compatible (Goel et al. 2016). However, it fails to satisfy the natural fairness notion of proportionality, placing too much weight on majority preferences. Leveraging a connection between market prices and the generalized median rules of Moulin (1980), we introduce the independent markets mechanism, which is both incentive compatible and proportional. We unify the social welfare-maximizing mechanism and the independent markets mechanism by defining a broad class of moving phantom mechanisms that includes both. We show that every moving phantom mechanism is incentive compatible. Finally, we characterize the social welfare-maximizing mechanism as the unique Pareto-optimal mechanism in this class, suggesting an inherent tradeoff between Pareto optimality and proportionality. △ Less

Submitted 21 January, 2022; v1 submitted 1 May, 2019; originally announced May 2019.

Comments: 28 pages, final journal version

Journal ref: Journal of Economic Theory, Volume 193, April 2021, 105234

arXiv:1812.05239 [pdf, other]

doi 10.1145/3290605.3300830

Improving fairness in machine learning systems: What do industry practitioners need?

Authors: Kenneth Holstein, Jennifer Wortman Vaughan, Hal Daumé III, Miro Dudík, Hanna Wallach

Abstract: The potential for machine learning (ML) systems to amplify social inequities and unfairness is receiving increasing popular and academic attention. A surge of recent work has focused on the development of algorithmic tools to assess and mitigate such unfairness. If these tools are to have a positive impact on industry practice, however, it is crucial that their design be informed by an understandi… ▽ More The potential for machine learning (ML) systems to amplify social inequities and unfairness is receiving increasing popular and academic attention. A surge of recent work has focused on the development of algorithmic tools to assess and mitigate such unfairness. If these tools are to have a positive impact on industry practice, however, it is crucial that their design be informed by an understanding of real-world needs. Through 35 semi-structured interviews and an anonymous survey of 267 ML practitioners, we conduct the first systematic investigation of commercial product teams' challenges and needs for support in develo** fairer ML systems. We identify areas of alignment and disconnect between the challenges faced by industry practitioners and solutions proposed in the fair ML research literature. Based on these findings, we highlight directions for future ML and HCI research that will better address industry practitioners' needs. △ Less

Submitted 7 January, 2019; v1 submitted 12 December, 2018; originally announced December 2018.

Comments: To appear in the 2019 ACM CHI Conference on Human Factors in Computing Systems (CHI 2019)

arXiv:1808.08646 [pdf, other]

doi 10.1145/3287560.3287597

The Disparate Effects of Strategic Manipulation

Authors: Lily Hu, Nicole Immorlica, Jennifer Wortman Vaughan

Abstract: When consequential decisions are informed by algorithmic input, individuals may feel compelled to alter their behavior in order to gain a system's approval. Models of agent responsiveness, termed "strategic manipulation," analyze the interaction between a learner and agents in a world where all agents are equally able to manipulate their features in an attempt to "trick" a published classifier. In… ▽ More When consequential decisions are informed by algorithmic input, individuals may feel compelled to alter their behavior in order to gain a system's approval. Models of agent responsiveness, termed "strategic manipulation," analyze the interaction between a learner and agents in a world where all agents are equally able to manipulate their features in an attempt to "trick" a published classifier. In cases of real world classification, however, an agent's ability to adapt to an algorithm is not simply a function of her personal interest in receiving a positive classification, but is bound up in a complex web of social factors that affect her ability to pursue certain action responses. In this paper, we adapt models of strategic manipulation to capture dynamics that may arise in a setting of social inequality wherein candidate groups face different costs to manipulation. We find that whenever one group's costs are higher than the other's, the learner's equilibrium strategy exhibits an inequality-reinforcing phenomenon wherein the learner erroneously admits some members of the advantaged group, while erroneously excluding some members of the disadvantaged group. We also consider the effects of interventions in which a learner subsidizes members of the disadvantaged group, lowering their costs in order to improve her own classification performance. Here we encounter a paradoxical result: there exist cases in which providing a subsidy improves only the learner's utility while actually making both candidate groups worse-off--even the group receiving the subsidy. Our results reveal the potentially adverse social ramifications of deploying tools that attempt to evaluate an individual's "quality" when agents' capacities to adaptively respond differ. △ Less

Submitted 10 May, 2019; v1 submitted 26 August, 2018; originally announced August 2018.

Comments: 29 pages, 4 figures

arXiv:1806.05740 [pdf, other]

Using Search Queries to Understand Health Information Needs in Africa

Authors: Rediet Abebe, Shawndra Hill, Jennifer Wortman Vaughan, Peter M. Small, H. Andrew Schwartz

Abstract: The lack of comprehensive, high-quality health data in develo** nations creates a roadblock for combating the impacts of disease. One key challenge is understanding the health information needs of people in these nations. Without understanding people's everyday needs, concerns, and misconceptions, health organizations and policymakers lack the ability to effectively target education and programm… ▽ More The lack of comprehensive, high-quality health data in develo** nations creates a roadblock for combating the impacts of disease. One key challenge is understanding the health information needs of people in these nations. Without understanding people's everyday needs, concerns, and misconceptions, health organizations and policymakers lack the ability to effectively target education and programming efforts. In this paper, we propose a bottom-up approach that uses search data from individuals to uncover and gain insight into health information needs in Africa. We analyze Bing searches related to HIV/AIDS, malaria, and tuberculosis from all 54 African nations. For each disease, we automatically derive a set of common search themes or topics, revealing a wide-spread interest in various types of information, including disease symptoms, drugs, concerns about breastfeeding, as well as stigma, beliefs in natural cures, and other topics that may be hard to uncover through traditional surveys. We expose the different patterns that emerge in health information needs by demographic groups (age and sex) and country. We also uncover discrepancies in the quality of content returned by search engines to users by topic. Combined, our results suggest that search data can help illuminate health information needs in Africa and inform discussions on health policy and targeted education efforts both on- and offline. △ Less

Submitted 17 April, 2019; v1 submitted 14 June, 2018; originally announced June 2018.

Comments: Extended version of an ICWSM 2019 paper

arXiv:1806.00543 [pdf, ps, other]

The Externalities of Exploration and How Data Diversity Helps Exploitation

Authors: Manish Raghavan, Aleksandrs Slivkins, Jennifer Wortman Vaughan, Zhiwei Steven Wu

Abstract: Online learning algorithms, widely used to power search and content optimization on the web, must balance exploration and exploitation, potentially sacrificing the experience of current users for information that will lead to better decisions in the future. Recently, concerns have been raised about whether the process of exploration could be viewed as unfair, placing too much burden on certain ind… ▽ More Online learning algorithms, widely used to power search and content optimization on the web, must balance exploration and exploitation, potentially sacrificing the experience of current users for information that will lead to better decisions in the future. Recently, concerns have been raised about whether the process of exploration could be viewed as unfair, placing too much burden on certain individuals or groups. Motivated by these concerns, we initiate the study of the externalities of exploration - the undesirable side effects that the presence of one party may impose on another - under the linear contextual bandits model. We introduce the notion of a group externality, measuring the extent to which the presence of one population of users impacts the rewards of another. We show that this impact can in some cases be negative, and that, in a certain sense, no algorithm can avoid it. We then study externalities at the individual level, interpreting the act of exploration as an externality imposed on the current user of a system by future users. This drives us to ask under what conditions inherent diversity in the data makes explicit exploration unnecessary. We build on a recent line of work on the smoothed analysis of the greedy algorithm that always chooses the action that currently looks optimal, improving on prior results to show that a greedy approach almost matches the best possible Bayesian regret rate of any other algorithm on the same problem instance whenever the diversity conditions hold, and that this regret is at most $\tilde{O}(T^{1/3})$. Returning to group-level effects, we show that under the same conditions, negative group externalities essentially vanish under the greedy algorithm. Together, our results uncover a sharp contrast between the high externalities that exist in the worst case, and the ability to remove all externalities if the data is sufficiently diverse. △ Less

Submitted 2 July, 2018; v1 submitted 1 June, 2018; originally announced June 2018.

arXiv:1803.09010 [pdf, other]

Datasheets for Datasets

Authors: Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, Kate Crawford

Abstract: The machine learning community currently has no standardized process for documenting datasets, which can lead to severe consequences in high-stakes domains. To address this gap, we propose datasheets for datasets. In the electronics industry, every component, no matter how simple or complex, is accompanied with a datasheet that describes its operating characteristics, test results, recommended use… ▽ More The machine learning community currently has no standardized process for documenting datasets, which can lead to severe consequences in high-stakes domains. To address this gap, we propose datasheets for datasets. In the electronics industry, every component, no matter how simple or complex, is accompanied with a datasheet that describes its operating characteristics, test results, recommended uses, and other information. By analogy, we propose that every dataset be accompanied with a datasheet that documents its motivation, composition, collection process, recommended uses, and so on. Datasheets for datasets will facilitate better communication between dataset creators and dataset consumers, and encourage the machine learning community to prioritize transparency and accountability. △ Less

Submitted 1 December, 2021; v1 submitted 23 March, 2018; originally announced March 2018.

Comments: Published in CACM in December, 2021

arXiv:1802.07810 [pdf, other]

Manipulating and Measuring Model Interpretability

Authors: Forough Poursabzi-Sangdeh, Daniel G. Goldstein, Jake M. Hofman, Jennifer Wortman Vaughan, Hanna Wallach

Abstract: With machine learning models being increasingly used to aid decision making even in high-stakes domains, there has been a growing interest in develo** interpretable models. Although many supposedly interpretable models have been proposed, there have been relatively few experimental studies investigating whether these models achieve their intended effects, such as making people more closely follo… ▽ More With machine learning models being increasingly used to aid decision making even in high-stakes domains, there has been a growing interest in develo** interpretable models. Although many supposedly interpretable models have been proposed, there have been relatively few experimental studies investigating whether these models achieve their intended effects, such as making people more closely follow a model's predictions when it is beneficial for them to do so or enabling them to detect when a model has made a mistake. We present a sequence of pre-registered experiments (N=3,800) in which we showed participants functionally identical models that varied only in two factors commonly thought to make machine learning models more or less interpretable: the number of features and the transparency of the model (i.e., whether the model internals are clear or black box). Predictably, participants who saw a clear model with few features could better simulate the model's predictions. However, we did not find that participants more closely followed its predictions. Furthermore, showing participants a clear model meant that they were less able to detect and correct for the model's sizable mistakes, seemingly due to information overload. These counterintuitive findings emphasize the importance of testing over intuition when develo** interpretable models. △ Less

Submitted 15 August, 2021; v1 submitted 21 February, 2018; originally announced February 2018.

ACM Class: I.2

arXiv:1702.07810 [pdf, other]

A Decomposition of Forecast Error in Prediction Markets

Authors: Miroslav Dudík, Sébastien Lahaie, Ryan Rogers, Jennifer Wortman Vaughan

Abstract: We analyze sources of error in prediction market forecasts in order to bound the difference between a security's price and the ground truth it estimates. We consider cost-function-based prediction markets in which an automated market maker adjusts security prices according to the history of trade. We decompose the forecasting error into three components: sampling error, arising because traders onl… ▽ More We analyze sources of error in prediction market forecasts in order to bound the difference between a security's price and the ground truth it estimates. We consider cost-function-based prediction markets in which an automated market maker adjusts security prices according to the history of trade. We decompose the forecasting error into three components: sampling error, arising because traders only possess noisy estimates of ground truth; market-maker bias, resulting from the use of a particular market maker (i.e., cost function) to facilitate trade; and convergence error, arising because, at any point in time, market prices may still be in flux. Our goal is to make explicit the tradeoffs between these error components, influenced by design decisions such as the functional form of the cost function and the amount of liquidity in the market. We consider a specific model in which traders have exponential utility and exponential-family beliefs representing noisy estimates of ground truth. In this setting, sampling error vanishes as the number of traders grows, but there is a tradeoff between the other two components. We provide both upper and lower bounds on market-maker bias and convergence error, and demonstrate via numerical simulations that these bounds are tight. Our results yield new insights into the question of how to set the market's liquidity parameter and into the forecasting benefits of enforcing coherent prices across securities. △ Less

Submitted 20 February, 2018; v1 submitted 24 February, 2017; originally announced February 2017.

Journal ref: Advances in Neural Information Processing Systems 30 (NIPS 2017)

arXiv:1611.01688 [pdf, other]

Oracle-Efficient Online Learning and Auction Design

Authors: Miroslav Dudík, Nika Haghtalab, Haipeng Luo, Robert E. Schapire, Vasilis Syrgkanis, Jennifer Wortman Vaughan

Abstract: We consider the design of computationally efficient online learning algorithms in an adversarial setting in which the learner has access to an offline optimization oracle. We present an algorithm called Generalized Follow-the-Perturbed-Leader and provide conditions under which it is oracle-efficient while achieving vanishing regret. Our results make significant progress on an open problem raised b… ▽ More We consider the design of computationally efficient online learning algorithms in an adversarial setting in which the learner has access to an offline optimization oracle. We present an algorithm called Generalized Follow-the-Perturbed-Leader and provide conditions under which it is oracle-efficient while achieving vanishing regret. Our results make significant progress on an open problem raised by Hazan and Koren, who showed that oracle-efficient algorithms do not exist in general and asked whether one can identify properties under which oracle-efficient online learning may be possible. Our auction-design framework considers an auctioneer learning an optimal auction for a sequence of adversarially selected valuations with the goal of achieving revenue that is almost as good as the optimal auction in hindsight, among a class of auctions. We give oracle-efficient learning results for: (1) VCG auctions with bidder-specific reserves in single-parameter settings, (2) envy-free item pricing in multi-item auctions, and (3) s-level auctions of Morgenstern and Roughgarden for single-item settings. The last result leads to an approximation of the overall optimal Myerson auction when bidders' valuations are drawn according to a fast-mixing Markov process, extending prior work that only gave such guarantees for the i.i.d. setting. Finally, we derive various extensions, including: (1) oracle-efficient algorithms for the contextual learning setting in which the learner has access to side information (such as bidder demographics), (2) learning with approximate oracles such as those based on Maximal-in-Range algorithms, and (3) no-regret bidding in simultaneous auctions, resolving an open problem of Daskalakis and Syrgkanis. △ Less

Submitted 5 August, 2019; v1 submitted 5 November, 2016; originally announced November 2016.

Comments: An earlier version of this paper appeared in FOCS 2017

arXiv:1602.07362 [pdf, other]

The Possibilities and Limitations of Private Prediction Markets

Authors: Rachel Cummings, David M. Pennock, Jennifer Wortman Vaughan

Abstract: We consider the design of private prediction markets, financial markets designed to elicit predictions about uncertain events without revealing too much information about market participants' actions or beliefs. Our goal is to design market mechanisms in which participants' trades or wagers influence the market's behavior in a way that leads to accurate predictions, yet no single participant has t… ▽ More We consider the design of private prediction markets, financial markets designed to elicit predictions about uncertain events without revealing too much information about market participants' actions or beliefs. Our goal is to design market mechanisms in which participants' trades or wagers influence the market's behavior in a way that leads to accurate predictions, yet no single participant has too much influence over what others are able to observe. We study the possibilities and limitations of such mechanisms using tools from differential privacy. We begin by designing a private one-shot wagering mechanism in which bettors specify a belief about the likelihood of a future event and a corresponding monetary wager. Wagers are redistributed among bettors in a way that more highly rewards those with accurate predictions. We provide a class of wagering mechanisms that are guaranteed to satisfy truthfulness, budget balance in expectation, and other desirable properties while additionally guaranteeing epsilon-joint differential privacy in the bettors' reported beliefs, and analyze the trade-off between the achievable level of privacy and the sensitivity of a bettor's payment to her own report. We then ask whether it is possible to obtain privacy in dynamic prediction markets, focusing our attention on the popular cost-function framework in which securities with payments linked to future events are bought and sold by an automated market maker. We show that under general conditions, it is impossible for such a market maker to simultaneously achieve bounded worst-case loss and epsilon-differential privacy without allowing the privacy guarantee to degrade extremely quickly as the number of trades grows, making such markets impractical in settings in which privacy is valued. We conclude by suggesting several avenues for potentially circumventing this lower bound. △ Less

Submitted 23 February, 2016; originally announced February 2016.

arXiv:1503.05897 [pdf, other]

doi 10.1145/2736277.2741102

Incentivizing High Quality Crowdwork

Authors: Chien-Ju Ho, Aleksandrs Slivkins, Siddharth Suri, Jennifer Wortman Vaughan

Abstract: We study the causal effects of financial incentives on the quality of crowdwork. We focus on performance-based payments (PBPs), bonus payments awarded to workers for producing high quality work. We design and run randomized behavioral experiments on the popular crowdsourcing platform Amazon Mechanical Turk with the goal of understanding when, where, and why PBPs help, identifying properties of the… ▽ More We study the causal effects of financial incentives on the quality of crowdwork. We focus on performance-based payments (PBPs), bonus payments awarded to workers for producing high quality work. We design and run randomized behavioral experiments on the popular crowdsourcing platform Amazon Mechanical Turk with the goal of understanding when, where, and why PBPs help, identifying properties of the payment, payment structure, and the task itself that make them most effective. We provide examples of tasks for which PBPs do improve quality. For such tasks, the effectiveness of PBPs is not too sensitive to the threshold for quality required to receive the bonus, while the magnitude of the bonus must be large enough to make the reward salient. We also present examples of tasks for which PBPs do not improve quality. Our results suggest that for PBPs to improve quality, the task must be effort-responsive: the task must allow workers to produce higher quality work by exerting more effort. We also give a simple method to determine if a task is effort-responsive a priori. Furthermore, our experiments suggest that all payments on Mechanical Turk are, to some degree, implicitly performance-based in that workers believe their work may be rejected if their performance is sufficiently poor. Finally, we propose a new model of worker behavior that extends the standard principal-agent model from economics to include a worker's subjective beliefs about his likelihood of being paid, and show that the predictions of this model are in line with our experimental findings. This model may be useful as a foundation for theoretical studies of incentives in crowdsourcing markets. △ Less

Submitted 19 March, 2015; originally announced March 2015.

Comments: This is a preprint of an Article accepted for publication in WWW \c{opyright} 2015 International World Wide Web Conference Committee

arXiv:1407.8161 [pdf, ps, other]

Market Making with Decreasing Utility for Information

Authors: Miroslav Dudík, Rafael Frongillo, Jennifer Wortman Vaughan

Abstract: We study information elicitation in cost-function-based combinatorial prediction markets when the market maker's utility for information decreases over time. In the sudden revelation setting, it is known that some piece of information will be revealed to traders, and the market maker wishes to prevent guaranteed profits for trading on the sure information. In the gradual decrease setting, the mark… ▽ More We study information elicitation in cost-function-based combinatorial prediction markets when the market maker's utility for information decreases over time. In the sudden revelation setting, it is known that some piece of information will be revealed to traders, and the market maker wishes to prevent guaranteed profits for trading on the sure information. In the gradual decrease setting, the market maker's utility for (partial) information decreases continuously over time. We design adaptive cost functions for both settings which: (1) preserve the information previously gathered in the market; (2) eliminate (or diminish) rewards to traders for the publicly revealed information; (3) leave the reward structure unaffected for other information; and (4) maintain the market maker's worst-case loss. Our constructions utilize mixed Bregman divergence, which matches our notion of utility for information. △ Less

Submitted 30 July, 2014; originally announced July 2014.

Journal ref: M. Dudik, R. Frongillo, and J. Wortman Vaughan. Market Making with Decreasing Utility for Information. In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence, pages 152-161, 2014

arXiv:1405.2875 [pdf, ps, other]

Adaptive Contract Design for Crowdsourcing Markets: Bandit Algorithms for Repeated Principal-Agent Problems

Authors: Chien-Ju Ho, Aleksandrs Slivkins, Jennifer Wortman Vaughan

Abstract: Crowdsourcing markets have emerged as a popular platform for matching available workers with tasks to complete. The payment for a particular task is typically set by the task's requester, and may be adjusted based on the quality of the completed work, for example, through the use of "bonus" payments. In this paper, we study the requester's problem of dynamically adjusting quality-contingent paymen… ▽ More Crowdsourcing markets have emerged as a popular platform for matching available workers with tasks to complete. The payment for a particular task is typically set by the task's requester, and may be adjusted based on the quality of the completed work, for example, through the use of "bonus" payments. In this paper, we study the requester's problem of dynamically adjusting quality-contingent payments for tasks. We consider a multi-round version of the well-known principal-agent model, whereby in each round a worker makes a strategic choice of the effort level which is not directly observable by the requester. In particular, our formulation significantly generalizes the budget-free online task pricing problems studied in prior work. We treat this problem as a multi-armed bandit problem, with each "arm" representing a potential contract. To cope with the large (and in fact, infinite) number of arms, we propose a new algorithm, AgnosticZooming, which discretizes the contract space into a finite number of regions, effectively treating each region as a single arm. This discretization is adaptively refined, so that more promising regions of the contract space are eventually discretized more finely. We analyze this algorithm, showing that it achieves regret sublinear in the time horizon and substantially improves over non-adaptive discretization (which is the only competing approach in the literature). Our results advance the state of art on several different topics: the theory of crowdsourcing markets, principal-agent problems, multi-armed bandits, and dynamic pricing. △ Less

Submitted 2 September, 2015; v1 submitted 12 May, 2014; originally announced May 2014.

Comments: This is the full version of a paper in the ACM Conference on Economics and Computation (ACM-EC), 2014

arXiv:1308.1746 [pdf, ps, other]

Online Decision Making in Crowdsourcing Markets: Theoretical Challenges (Position Paper)

Authors: Aleksandrs Slivkins, Jennifer Wortman Vaughan

Abstract: Over the past decade, crowdsourcing has emerged as a cheap and efficient method of obtaining solutions to simple tasks that are difficult for computers to solve but possible for humans. The popularity and promise of crowdsourcing markets has led to both empirical and theoretical research on the design of algorithms to optimize various aspects of these markets, such as the pricing and assignment of… ▽ More Over the past decade, crowdsourcing has emerged as a cheap and efficient method of obtaining solutions to simple tasks that are difficult for computers to solve but possible for humans. The popularity and promise of crowdsourcing markets has led to both empirical and theoretical research on the design of algorithms to optimize various aspects of these markets, such as the pricing and assignment of tasks. Much of the existing theoretical work on crowdsourcing markets has focused on problems that fall into the broad category of online decision making; task requesters or the crowdsourcing platform itself make repeated decisions about prices to set, workers to filter out, problems to assign to specific workers, or other things. Often these decisions are complex, requiring algorithms that learn about the distribution of available tasks or workers over time and take into account the strategic (or sometimes irrational) behavior of workers. As human computation grows into its own field, the time is ripe to address these challenges in a principled way. However, it appears very difficult to capture all pertinent aspects of crowdsourcing markets in a single coherent model. In this paper, we reflect on the modeling issues that inhibit theoretical research on online decision making for crowdsourcing, and identify some steps forward. This paper grew out of the authors' own frustration with these issues, and we hope it will encourage the community to attempt to understand, debate, and ultimately address them. The authors welcome feedback for future revisions of this paper. △ Less

Submitted 26 November, 2013; v1 submitted 7 August, 2013; originally announced August 2013.

arXiv:1210.4837 [pdf]

Designing Informative Securities

Authors: Yiling Chen, Mike Ruberry, Jennifer Wortman Vaughan

Abstract: We create a formal framework for the design of informative securities in prediction markets. These securities allow a market organizer to infer the likelihood of events of interest as well as if he knew all of the traders' private signals. We consider the design of markets that are always informative, markets that are informative for a particular signal structure of the participants, and informati… ▽ More We create a formal framework for the design of informative securities in prediction markets. These securities allow a market organizer to infer the likelihood of events of interest as well as if he knew all of the traders' private signals. We consider the design of markets that are always informative, markets that are informative for a particular signal structure of the participants, and informative markets constructed from a restricted selection of securities. We find that to achieve informativeness, it can be necessary to allow participants to express information that may not be directly of interest to the market organizer, and that understanding the participants' signal structure is important for designing informative prediction markets. △ Less

Submitted 16 October, 2012; originally announced October 2012.

Comments: Appears in Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence (UAI2012)

Report number: UAI-P-2012-PG-185-195

arXiv:1205.2646 [pdf]

Censored Exploration and the Dark Pool Problem

Authors: Kuzman Ganchev, Michael Kearns, Yuriy Nevmyvaka, Jennifer Wortman Vaughan

Abstract: We introduce and analyze a natural algorithm for multi-venue exploration from censored data, which is motivated by the Dark Pool Problem of modern quantitative finance. We prove that our algorithm converges in polynomial time to a near-optimal allocation policy; prior results for similar problems in stochastic inventory control guaranteed only asymptotic convergence and examined variants in which… ▽ More We introduce and analyze a natural algorithm for multi-venue exploration from censored data, which is motivated by the Dark Pool Problem of modern quantitative finance. We prove that our algorithm converges in polynomial time to a near-optimal allocation policy; prior results for similar problems in stochastic inventory control guaranteed only asymptotic convergence and examined variants in which each venue could be treated independently. Our analysis bears a strong resemblance to that of efficient exploration/ exploitation schemes in the reinforcement learning literature. We describe an extensive experimental evaluation of our algorithm on the Dark Pool Problem using real trading data. △ Less

Submitted 9 May, 2012; originally announced May 2012.

Comments: Appears in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI2009)

Report number: UAI-P-2009-PG-185-194

arXiv:1011.1941 [pdf, ps, other]

An Optimization-Based Framework for Automated Market-Making

Authors: Jacob Abernethy, Yiling Chen, Jennifer Wortman Vaughan

Abstract: Building on ideas from online convex optimization, we propose a general framework for the design of efficient securities markets over very large outcome spaces. The challenge here is computational. In a complete market, in which one security is offered for each outcome, the market institution can not efficiently keep track of the transaction history or calculate security prices when the outcome sp… ▽ More Building on ideas from online convex optimization, we propose a general framework for the design of efficient securities markets over very large outcome spaces. The challenge here is computational. In a complete market, in which one security is offered for each outcome, the market institution can not efficiently keep track of the transaction history or calculate security prices when the outcome space is large. The natural solution is to restrict the space of securities to be much smaller than the outcome space in such a way that securities can be priced efficiently. Recent research has focused on searching for spaces of securities that can be priced efficiently by existing mechanisms designed for complete markets. While there have been some successes, much of this research has led to hardness results. In this paper, we take a drastically different approach. We start with an arbitrary space of securities with bounded payoff, and establish a framework to design markets tailored to this space. We prove that any market satisfying a set of intuitive conditions must price securities via a convex potential function and that the space of reachable prices must be precisely the convex hull of the security payoffs. We then show how the convex potential function can be defined in terms of an optimization over the convex hull of the security payoffs. The optimal solution to the optimization problem gives the security prices. Using this framework, we provide an efficient market for predicting the landing location of an object on a sphere. In addition, we show that we can relax our "no-arbitrage" condition to design a new efficient market maker for pair betting, which is known to be #P-hard to price using existing mechanisms. This relaxation also allows the market maker to charge transaction fees so that the depth of the market can be dynamically increased as the number of trades increases. △ Less

Submitted 8 November, 2010; originally announced November 2010.

arXiv:1005.3566 [pdf, ps, other]

Evolution with Drifting Targets

Authors: Varun Kanade, Leslie G. Valiant, Jennifer Wortman Vaughan

Abstract: We consider the question of the stability of evolutionary algorithms to gradual changes, or drift, in the target concept. We define an algorithm to be resistant to drift if, for some inverse polynomial drift rate in the target function, it converges to accuracy 1 -- ε, with polynomial resources, and then stays within that accuracy indefinitely, except with probability ε, at any one time. We show t… ▽ More We consider the question of the stability of evolutionary algorithms to gradual changes, or drift, in the target concept. We define an algorithm to be resistant to drift if, for some inverse polynomial drift rate in the target function, it converges to accuracy 1 -- ε, with polynomial resources, and then stays within that accuracy indefinitely, except with probability ε, at any one time. We show that every evolution algorithm, in the sense of Valiant (2007; 2009), can be converted using the Correlational Query technique of Feldman (2008), into such a drift resistant algorithm. For certain evolutionary algorithms, such as for Boolean conjunctions, we give bounds on the rates of drift that they can resist. We develop some new evolution algorithms that are resistant to significant drift. In particular, we give an algorithm for evolving linear separators over the spherically symmetric distribution that is resistant to a drift rate of O(ε/n), and another algorithm over the more general product normal distributions that resists a smaller drift rate. The above translation result can be also interpreted as one on the robustness of the notion of evolvability itself under changes of definition. As a second result in that direction we show that every evolution algorithm can be converted to a quasi-monotonic one that can evolve from any starting point without the performance ever dip** significantly below that of the starting point. This permits the somewhat unnatural feature of arbitrary performance degradations to be removed from several known robustness translations. △ Less

Submitted 19 May, 2010; originally announced May 2010.

arXiv:1003.0034 [pdf, ps, other]

A New Understanding of Prediction Markets Via No-Regret Learning

Authors: Yiling Chen, Jennifer Wortman Vaughan

Abstract: We explore the striking mathematical connections that exist between market scoring rules, cost function based prediction markets, and no-regret learning. We show that any cost function based prediction market can be interpreted as an algorithm for the commonly studied problem of learning from expert advice by equating trades made in the market with losses observed by the learning algorithm. If t… ▽ More We explore the striking mathematical connections that exist between market scoring rules, cost function based prediction markets, and no-regret learning. We show that any cost function based prediction market can be interpreted as an algorithm for the commonly studied problem of learning from expert advice by equating trades made in the market with losses observed by the learning algorithm. If the loss of the market organizer is bounded, this bound can be used to derive an O(sqrt(T)) regret bound for the corresponding learning algorithm. We then show that the class of markets with convex cost functions exactly corresponds to the class of Follow the Regularized Leader learning algorithms, with the choice of a cost function in the market corresponding to the choice of a regularizer in the learning problem. Finally, we show an equivalence between market scoring rules and prediction markets with convex cost functions. This implies that market scoring rules can also be interpreted naturally as Follow the Regularized Leader algorithms, and may be of independent interest. These connections provide new insight into how it is that commonly studied markets, such as the Logarithmic Market Scoring Rule, can aggregate opinions into accurate estimates of the likelihood of future events. △ Less

Submitted 26 February, 2010; originally announced March 2010.

Showing 1–43 of 43 results for author: Vaughan, J W