Search | arXiv e-print repository

Towards Bidirectional Human-AI Alignment: A Systematic Review for Clarifications, Framework, and Future Directions

Authors: Hua Shen, Tiffany Knearem, Reshmi Ghosh, Kenan Alkiek, Kundan Krishna, Yachuan Liu, Ziqiao Ma, Savvas Petridis, Yi-Hao Peng, Li Qiwei, Sushrita Rakshit, Chenglei Si, Yutong Xie, Jeffrey P. Bigham, Frank Bentley, Joyce Chai, Zachary Lipton, Qiaozhu Mei, Rada Mihalcea, Michael Terry, Diyi Yang, Meredith Ringel Morris, Paul Resnick, David Jurgens

Abstract: Recent advancements in general-purpose AI have highlighted the importance of guiding AI systems towards the intended goals, ethical principles, and values of individuals and groups, a concept broadly recognized as alignment. However, the lack of clarified definitions and scopes of human-AI alignment poses a significant obstacle, hampering collaborative efforts across research domains to achieve th… ▽ More Recent advancements in general-purpose AI have highlighted the importance of guiding AI systems towards the intended goals, ethical principles, and values of individuals and groups, a concept broadly recognized as alignment. However, the lack of clarified definitions and scopes of human-AI alignment poses a significant obstacle, hampering collaborative efforts across research domains to achieve this alignment. In particular, ML- and philosophy-oriented alignment research often views AI alignment as a static, unidirectional process (i.e., aiming to ensure that AI systems' objectives match humans) rather than an ongoing, mutual alignment problem [429]. This perspective largely neglects the long-term interaction and dynamic changes of alignment. To understand these gaps, we introduce a systematic review of over 400 papers published between 2019 and January 2024, spanning multiple domains such as Human-Computer Interaction (HCI), Natural Language Processing (NLP), Machine Learning (ML), and others. We characterize, define and scope human-AI alignment. From this, we present a conceptual framework of "Bidirectional Human-AI Alignment" to organize the literature from a human-centered perspective. This framework encompasses both 1) conventional studies of aligning AI to humans that ensures AI produces the intended outcomes determined by humans, and 2) a proposed concept of aligning humans to AI, which aims to help individuals and society adjust to AI advancements both cognitively and behaviorally. Additionally, we articulate the key findings derived from literature analysis, including discussions about human values, interaction techniques, and evaluations. To pave the way for future studies, we envision three key challenges for future directions and propose examples of potential future solutions. △ Less

Submitted 17 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

Comments: 56 pages

arXiv:2406.07739 [pdf, other]

UICoder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback

Authors: Jason Wu, Eldon Schoop, Alan Leung, Titus Barik, Jeffrey P. Bigham, Jeffrey Nichols

Abstract: Large language models (LLMs) struggle to consistently generate UI code that compiles and produces visually relevant designs. Existing approaches to improve generation rely on expensive human feedback or distilling a proprietary model. In this paper, we explore the use of automated feedback (compilers and multi-modal models) to guide LLMs to generate high-quality UI code. Our method starts with an… ▽ More Large language models (LLMs) struggle to consistently generate UI code that compiles and produces visually relevant designs. Existing approaches to improve generation rely on expensive human feedback or distilling a proprietary model. In this paper, we explore the use of automated feedback (compilers and multi-modal models) to guide LLMs to generate high-quality UI code. Our method starts with an existing LLM and iteratively produces improved models by self-generating a large synthetic dataset using an original model, applying automated tools to aggressively filter, score, and de-duplicate the data into a refined higher quality dataset. The original LLM is improved by finetuning on this refined dataset. We applied our approach to several open-source LLMs and compared the resulting performance to baseline models with both automated metrics and human preferences. Our evaluation shows the resulting models outperform all other downloadable baselines and approach the performance of larger proprietary models. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: Accepted to NAACL 2024

arXiv:2405.15037 [pdf, other]

doi 10.1145/3643834.3660710

"This really lets us see the entire world:" Designing a conversational telepresence robot for homebound older adults

Authors: Yaxin Hu, Laura Stegner, Yasmine Kotturi, Caroline Zhang, Yi-Hao Peng, Faria Huq, Yuhang Zhao, Jeffrey P. Bigham, Bilge Mutlu

Abstract: In this paper, we explore the design and use of conversational telepresence robots to help homebound older adults interact with the external world. An initial needfinding study (N=8) using video vignettes revealed older adults' experiential needs for robot-mediated remote experiences such as exploration, reminiscence and social participation. We then designed a prototype system to support these go… ▽ More In this paper, we explore the design and use of conversational telepresence robots to help homebound older adults interact with the external world. An initial needfinding study (N=8) using video vignettes revealed older adults' experiential needs for robot-mediated remote experiences such as exploration, reminiscence and social participation. We then designed a prototype system to support these goals and conducted a technology probe study (N=11) to garner a deeper understanding of user preferences for remote experiences. The study revealed user interactive patterns in each desired experience, highlighting the need of robot guidance, social engagements with the robot and the remote bystanders. Our work identifies a novel design space where conversational telepresence robots can be used to foster meaningful interactions in the remote physical environment. We offer design insights into the robot's proactive role in providing guidance and using dialogue to create personalized, contextualized and meaningful experiences. △ Less

Submitted 23 May, 2024; originally announced May 2024.

Comments: In proceedings of ACM Designing Interactive Systems (DIS) 2024

MSC Class: 68-06

arXiv:2404.12500 [pdf, other]

UIClip: A Data-driven Model for Assessing User Interface Design

Authors: Jason Wu, Yi-Hao Peng, Amanda Li, Amanda Swearngin, Jeffrey P. Bigham, Jeffrey Nichols

Abstract: User interface (UI) design is a difficult yet important task for ensuring the usability, accessibility, and aesthetic qualities of applications. In our paper, we develop a machine-learned model, UIClip, for assessing the design quality and visual relevance of a UI given its screenshot and natural language description. To train UIClip, we used a combination of automated crawling, synthetic augmenta… ▽ More User interface (UI) design is a difficult yet important task for ensuring the usability, accessibility, and aesthetic qualities of applications. In our paper, we develop a machine-learned model, UIClip, for assessing the design quality and visual relevance of a UI given its screenshot and natural language description. To train UIClip, we used a combination of automated crawling, synthetic augmentation, and human ratings to construct a large-scale dataset of UIs, collated by description and ranked by design quality. Through training on the dataset, UIClip implicitly learns properties of good and bad designs by i) assigning a numerical score that represents a UI design's relevance and quality and ii) providing design suggestions. In an evaluation that compared the outputs of UIClip and other baselines to UIs rated by 12 human designers, we found that UIClip achieved the highest agreement with ground-truth rankings. Finally, we present three example applications that demonstrate how UIClip can facilitate downstream applications that rely on instantaneous assessment of UI design quality: i) UI code generation, ii) UI design tips generation, and iii) quality-aware UI example search. △ Less

Submitted 18 April, 2024; originally announced April 2024.

arXiv:2404.03085 [pdf, other]

doi 10.1145/3613904.3642628

Talaria: Interactively Optimizing Machine Learning Models for Efficient Inference

Authors: Fred Hohman, Chaoqun Wang, **mook Lee, Jochen Görtler, Dominik Moritz, Jeffrey P Bigham, Zhile Ren, Cecile Foret, Qi Shan, Xiaoyi Zhang

Abstract: On-device machine learning (ML) moves computation from the cloud to personal devices, protecting user privacy and enabling intelligent user experiences. However, fitting models on devices with limited resources presents a major technical challenge: practitioners need to optimize models and balance hardware metrics such as model size, latency, and power. To help practitioners create efficient ML mo… ▽ More On-device machine learning (ML) moves computation from the cloud to personal devices, protecting user privacy and enabling intelligent user experiences. However, fitting models on devices with limited resources presents a major technical challenge: practitioners need to optimize models and balance hardware metrics such as model size, latency, and power. To help practitioners create efficient ML models, we designed and developed Talaria: a model visualization and optimization system. Talaria enables practitioners to compile models to hardware, interactively visualize model statistics, and simulate optimizations to test the impact on inference metrics. Since its internal deployment two years ago, we have evaluated Talaria using three methodologies: (1) a log analysis highlighting its growth of 800+ practitioners submitting 3,600+ models; (2) a usability survey with 26 users assessing the utility of 20 Talaria features; and (3) a qualitative interview with the 7 most active users about their experience using Talaria. △ Less

Submitted 3 April, 2024; originally announced April 2024.

Comments: Proceedings of the 2024 ACM CHI Conference on Human Factors in Computing Systems

arXiv:2402.17082 [pdf, other]

doi 10.1145/3613904.3642191

Deconstructing the Veneer of Simplicity: Co-Designing Introductory Generative AI Workshops with Local Entrepreneurs

Authors: Yasmine Kotturi, Angel Anderson, Glenn Ford, Michael Skirpan, Jeffrey P. Bigham

Abstract: Generative AI platforms and features are permeating many aspects of work. Entrepreneurs from lean economies in particular are well positioned to outsource tasks to generative AI given limited resources. In this paper, we work to address a growing disparity in use of these technologies by building on a four-year partnership with a local entrepreneurial hub dedicated to equity in tech and entreprene… ▽ More Generative AI platforms and features are permeating many aspects of work. Entrepreneurs from lean economies in particular are well positioned to outsource tasks to generative AI given limited resources. In this paper, we work to address a growing disparity in use of these technologies by building on a four-year partnership with a local entrepreneurial hub dedicated to equity in tech and entrepreneurship. Together, we co-designed an interactive workshops series aimed to onboard local entrepreneurs to generative AI platforms. Alongside four community-driven and iterative workshops with entrepreneurs across five months, we conducted interviews with 15 local entrepreneurs and community providers. We detail the importance of communal and supportive exposure to generative AI tools for local entrepreneurs, scaffolding actionable use (and supporting non-use), demystifying generative AI technologies by emphasizing entrepreneurial power, while simultaneously deconstructing the veneer of simplicity to address the many operational skills needed for successful application. △ Less

Submitted 26 February, 2024; originally announced February 2024.

arXiv:2402.12566 [pdf, other]

GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence

Authors: Kundan Krishna, Sanjana Ramprasad, Prakhar Gupta, Byron C. Wallace, Zachary C. Lipton, Jeffrey P. Bigham

Abstract: LLMs can generate factually incorrect statements even when provided access to reference documents. Such errors can be dangerous in high-stakes applications (e.g., document-grounded QA for healthcare or finance). We present GenAudit -- a tool intended to assist fact-checking LLM responses for document-grounded tasks. GenAudit suggests edits to the LLM response by revising or removing claims that ar… ▽ More LLMs can generate factually incorrect statements even when provided access to reference documents. Such errors can be dangerous in high-stakes applications (e.g., document-grounded QA for healthcare or finance). We present GenAudit -- a tool intended to assist fact-checking LLM responses for document-grounded tasks. GenAudit suggests edits to the LLM response by revising or removing claims that are not supported by the reference document, and also presents evidence from the reference for facts that do appear to have support. We train models to execute these tasks, and design an interactive interface to present suggested edits and evidence to users. Comprehensive evaluation by human raters shows that GenAudit can detect errors in 8 different LLM outputs when summarizing documents from diverse domains. To ensure that most errors are flagged by the system, we propose a method that can increase the error recall while minimizing impact on precision. We release our tool (GenAudit) and fact-checking model for public use. △ Less

Submitted 16 March, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

Comments: Code and models available at https://genaudit.org

arXiv:2312.06147 [pdf, other]

"What's important here?": Opportunities and Challenges of Using LLMs in Retrieving Information from Web Interfaces

Authors: Faria Huq, Jeffrey P. Bigham, Nikolas Martelaro

Abstract: Large language models (LLMs) that have been trained on a corpus that includes large amount of code exhibit a remarkable ability to understand HTML code. As web interfaces are primarily constructed using HTML, we design an in-depth study to see how LLMs can be used to retrieve and locate important elements for a user given query (i.e. task description) in a web interface. In contrast with prior wor… ▽ More Large language models (LLMs) that have been trained on a corpus that includes large amount of code exhibit a remarkable ability to understand HTML code. As web interfaces are primarily constructed using HTML, we design an in-depth study to see how LLMs can be used to retrieve and locate important elements for a user given query (i.e. task description) in a web interface. In contrast with prior works, which primarily focused on autonomous web navigation, we decompose the problem as an even atomic operation - Can LLMs identify the important information in the web page for a user given query? This decomposition enables us to scrutinize the current capabilities of LLMs and uncover the opportunities and challenges they present. Our empirical experiments show that while LLMs exhibit a reasonable level of performance in retrieving important UI elements, there is still a substantial room for improvement. We hope our investigation will inspire follow-up works in overcoming the current challenges in this domain. △ Less

Submitted 11 December, 2023; originally announced December 2023.

Comments: Accepted to NeurIPS 2023 R0-FoMo Workshop

arXiv:2310.00091 [pdf, other]

Towards Automated Accessibility Report Generation for Mobile Apps

Authors: Amanda Swearngin, Jason Wu, Xiaoyi Zhang, Esteban Gomez, Jen Coughenour, Rachel Stukenborg, Bhavya Garg, Greg Hughes, Adriana Hilliard, Jeffrey P. Bigham, Jeffrey Nichols

Abstract: Many apps have basic accessibility issues, like missing labels or low contrast. Automated tools can help app developers catch basic issues, but can be laborious or require writing dedicated tests. We propose a system, motivated by a collaborative process with accessibility stakeholders at a large technology company, to generate whole app accessibility reports by combining varied data collection me… ▽ More Many apps have basic accessibility issues, like missing labels or low contrast. Automated tools can help app developers catch basic issues, but can be laborious or require writing dedicated tests. We propose a system, motivated by a collaborative process with accessibility stakeholders at a large technology company, to generate whole app accessibility reports by combining varied data collection methods (e.g., app crawling, manual recording) with an existing accessibility scanner. Many such scanners are based on single-screen scanning, and a key problem in whole app accessibility reporting is to effectively de-duplicate and summarize issues collected across an app. To this end, we developed a screen grou** model with 96.9% accuracy (88.8% F1-score) and UI element matching heuristics with 97% accuracy (98.2% F1-score). We combine these technologies in a system to report and summarize unique issues across an app, and enable a unique pixel-based ignore feature to help engineers and testers better manage reported issues across their app's lifetime. We conducted a qualitative evaluation with 18 accessibility-focused engineers and testers which showed this system can enhance their existing accessibility testing toolkit and address key limitations in current accessibility scanning tools. △ Less

Submitted 16 October, 2023; v1 submitted 29 September, 2023; originally announced October 2023.

Comments: 24 pages, 8 figures

arXiv:2308.08726 [pdf, other]

Never-ending Learning of User Interfaces

Authors: Jason Wu, Rebecca Krosnick, Eldon Schoop, Amanda Swearngin, Jeffrey P. Bigham, Jeffrey Nichols

Abstract: Machine learning models have been trained to predict semantic information about user interfaces (UIs) to make apps more accessible, easier to test, and to automate. Currently, most models rely on datasets that are collected and labeled by human crowd-workers, a process that is costly and surprisingly error-prone for certain tasks. For example, it is possible to guess if a UI element is "tappable"… ▽ More Machine learning models have been trained to predict semantic information about user interfaces (UIs) to make apps more accessible, easier to test, and to automate. Currently, most models rely on datasets that are collected and labeled by human crowd-workers, a process that is costly and surprisingly error-prone for certain tasks. For example, it is possible to guess if a UI element is "tappable" from a screenshot (i.e., based on visual signifiers) or from potentially unreliable metadata (e.g., a view hierarchy), but one way to know for certain is to programmatically tap the UI element and observe the effects. We built the Never-ending UI Learner, an app crawler that automatically installs real apps from a mobile app store and crawls them to discover new and challenging training examples to learn from. The Never-ending UI Learner has crawled for more than 5,000 device-hours, performing over half a million actions on 6,000 apps to train three computer vision models for i) tappability prediction, ii) draggability prediction, and iii) screen similarity. △ Less

Submitted 16 August, 2023; originally announced August 2023.

arXiv:2306.05446 [pdf, other]

Latent Phrase Matching for Dysarthric Speech

Authors: Colin Lea, Dianna Yee, Jaya Narain, Zifang Huang, Lauren Tooley, Jeffrey P. Bigham, Leah Findlater

Abstract: Many consumer speech recognition systems are not tuned for people with speech disabilities, resulting in poor recognition and user experience, especially for severe speech differences. Recent studies have emphasized interest in personalized speech models from people with atypical speech patterns. We propose a query-by-example-based personalized phrase recognition system that is trained using small… ▽ More Many consumer speech recognition systems are not tuned for people with speech disabilities, resulting in poor recognition and user experience, especially for severe speech differences. Recent studies have emphasized interest in personalized speech models from people with atypical speech patterns. We propose a query-by-example-based personalized phrase recognition system that is trained using small amounts of speech, is language agnostic, does not assume a traditional pronunciation lexicon, and generalizes well across speech difference severities. On an internal dataset collected from 32 people with dysarthria, this approach works regardless of severity and shows a 60% improvement in recall relative to a commercial speech recognition system. On the public EasyCall dataset of dysarthric speech, our approach improves accuracy by 30.5%. Performance degrades as the number of phrases increases, but consistently outperforms ASR systems when trained with 50 unique phrases. △ Less

Submitted 8 June, 2023; originally announced June 2023.

arXiv:2305.14296 [pdf, other]

USB: A Unified Summarization Benchmark Across Tasks and Domains

Authors: Kundan Krishna, Prakhar Gupta, Sanjana Ramprasad, Byron C. Wallace, Jeffrey P. Bigham, Zachary C. Lipton

Abstract: While the NLP community has produced numerous summarization benchmarks, none provide the rich annotations required to simultaneously address many important problems related to control and reliability. We introduce a Wikipedia-derived benchmark, complemented by a rich set of crowd-sourced annotations, that supports $8$ interrelated tasks: (i) extractive summarization; (ii) abstractive summarization… ▽ More While the NLP community has produced numerous summarization benchmarks, none provide the rich annotations required to simultaneously address many important problems related to control and reliability. We introduce a Wikipedia-derived benchmark, complemented by a rich set of crowd-sourced annotations, that supports $8$ interrelated tasks: (i) extractive summarization; (ii) abstractive summarization; (iii) topic-based summarization; (iv) compressing selected sentences into a one-line summary; (v) surfacing evidence for a summary sentence; (vi) predicting the factual accuracy of a summary sentence; (vii) identifying unsubstantiated spans in a summary sentence; (viii) correcting factual errors in summaries. We compare various methods on this benchmark and discover that on multiple tasks, moderately-sized fine-tuned models consistently outperform much larger few-shot prompted language models. For factuality-related tasks, we also evaluate existing heuristics to create training data and find that training on them results in worse performance than training on $20\times$ less human-labeled data. Our articles draw from $6$ domains, facilitating cross-domain analysis. On some tasks, the amount of training data matters more than the domain where it comes from, while for other tasks training specifically on data from the target domain, even if limited, is more beneficial. △ Less

Submitted 4 December, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

Comments: EMNLP Findings 2023 Camera Ready

arXiv:2302.09044 [pdf, other]

From User Perceptions to Technical Improvement: Enabling People Who Stutter to Better Use Speech Recognition

Authors: Colin Lea, Zifang Huang, Lauren Tooley, Jaya Narain, Dianna Yee, Panayiotis Georgiou, Tien Dung Tran, Jeffrey P. Bigham, Leah Findlater

Abstract: Consumer speech recognition systems do not work as well for many people with speech diferences, such as stuttering, relative to the rest of the general population. However, what is not clear is the degree to which these systems do not work, how they can be improved, or how much people want to use them. In this paper, we frst address these questions using results from a 61-person survey from people… ▽ More Consumer speech recognition systems do not work as well for many people with speech diferences, such as stuttering, relative to the rest of the general population. However, what is not clear is the degree to which these systems do not work, how they can be improved, or how much people want to use them. In this paper, we frst address these questions using results from a 61-person survey from people who stutter and fnd participants want to use speech recognition but are frequently cut of, misunderstood, or speech predictions do not represent intent. In a second study, where 91 people who stutter recorded voice assistant commands and dictation, we quantify how dysfuencies impede performance in a consumer-grade speech recognition system. Through three technical investigations, we demonstrate how many common errors can be prevented, resulting in a system that cuts utterances of 79.1% less often and improves word error rate from 25.4% to 9.9%. △ Less

Submitted 27 February, 2023; v1 submitted 17 February, 2023; originally announced February 2023.

Comments: CHI 2023

arXiv:2301.13280 [pdf, other]

WebUI: A Dataset for Enhancing Visual UI Understanding with Web Semantics

Authors: Jason Wu, Siyan Wang, Siman Shen, Yi-Hao Peng, Jeffrey Nichols, Jeffrey P. Bigham

Abstract: Modeling user interfaces (UIs) from visual information allows systems to make inferences about the functionality and semantics needed to support use cases in accessibility, app automation, and testing. Current datasets for training machine learning models are limited in size due to the costly and time-consuming process of manually collecting and annotating UIs. We crawled the web to construct WebU… ▽ More Modeling user interfaces (UIs) from visual information allows systems to make inferences about the functionality and semantics needed to support use cases in accessibility, app automation, and testing. Current datasets for training machine learning models are limited in size due to the costly and time-consuming process of manually collecting and annotating UIs. We crawled the web to construct WebUI, a large dataset of 400,000 rendered web pages associated with automatically extracted metadata. We analyze the composition of WebUI and show that while automatically extracted data is noisy, most examples meet basic criteria for visual UI modeling. We applied several strategies for incorporating semantics found in web pages to increase the performance of visual UI understanding models in the mobile domain, where less labeled data is available: (i) element detection, (ii) screen classification and (iii) screen similarity. △ Less

Submitted 30 January, 2023; originally announced January 2023.

Comments: Accepted to CHI 2023. Dataset, code, and models release coming soon

arXiv:2301.08372 [pdf, other]

Screen Correspondence: Map** Interchangeable Elements between UIs

Authors: Jason Wu, Amanda Swearngin, Xiaoyi Zhang, Jeffrey Nichols, Jeffrey P. Bigham

Abstract: Understanding user interface (UI) functionality is a useful yet challenging task for both machines and people. In this paper, we investigate a machine learning approach for screen correspondence, which allows reasoning about UIs by map** their elements onto previously encountered examples with known functionality and properties. We describe and implement a model that incorporates element semanti… ▽ More Understanding user interface (UI) functionality is a useful yet challenging task for both machines and people. In this paper, we investigate a machine learning approach for screen correspondence, which allows reasoning about UIs by map** their elements onto previously encountered examples with known functionality and properties. We describe and implement a model that incorporates element semantics, appearance, and text to support correspondence computation without requiring any labeled examples. Through a comprehensive performance evaluation, we show that our approach improves upon baselines by incorporating multi-modal properties of UIs. Finally, we show three example applications where screen correspondence facilitates better UI understanding for humans and machines: (i) instructional overlay generation, (ii) semantic UI element search, and (iii) automated interface testing. △ Less

Submitted 19 January, 2023; originally announced January 2023.

arXiv:2209.14389 [pdf, other]

Downstream Datasets Make Surprisingly Good Pretraining Corpora

Authors: Kundan Krishna, Saurabh Garg, Jeffrey P. Bigham, Zachary C. Lipton

Abstract: For most natural language processing tasks, the dominant practice is to finetune large pretrained transformer models (e.g., BERT) using smaller downstream datasets. Despite the success of this approach, it remains unclear to what extent these gains are attributable to the massive background corpora employed for pretraining versus to the pretraining objectives themselves. This paper introduces a la… ▽ More For most natural language processing tasks, the dominant practice is to finetune large pretrained transformer models (e.g., BERT) using smaller downstream datasets. Despite the success of this approach, it remains unclear to what extent these gains are attributable to the massive background corpora employed for pretraining versus to the pretraining objectives themselves. This paper introduces a large-scale study of self-pretraining, where the same (downstream) training data is used for both pretraining and finetuning. In experiments addressing both ELECTRA and RoBERTa models and 10 distinct downstream classification datasets, we observe that self-pretraining rivals standard pretraining on the BookWiki corpus (despite using around $10\times$--$500\times$ less data), outperforming the latter on $7$ and $5$ datasets, respectively. Surprisingly, these task-specific pretrained models often perform well on other tasks, including the GLUE benchmark. Besides classification tasks, self-pretraining also provides benefits on structured output prediction tasks such as span based question answering and commonsense inference, often providing more than $50\%$ of the performance boosts provided by pretraining on the BookWiki corpus. Our results hint that in many scenarios, performance gains attributable to pretraining are driven primarily by the pretraining objective itself and are not always attributable to the use of external pretraining data in massive amounts. These findings are especially relevant in light of concerns about intellectual property and offensive content in web-scale pretraining data. △ Less

Submitted 26 May, 2023; v1 submitted 28 September, 2022; originally announced September 2022.

Comments: ACL2023 Camera Ready

arXiv:2207.12551 [pdf, other]

DialCrowd 2.0: A Quality-Focused Dialog System Crowdsourcing Toolkit

Authors: Jessica Huynh, Ting-Rui Chiang, Jeffrey Bigham, Maxine Eskenazi

Abstract: Dialog system developers need high-quality data to train, fine-tune and assess their systems. They often use crowdsourcing for this since it provides large quantities of data from many workers. However, the data may not be of sufficiently good quality. This can be due to the way that the requester presents a task and how they interact with the workers. This paper introduces DialCrowd 2.0 to help r… ▽ More Dialog system developers need high-quality data to train, fine-tune and assess their systems. They often use crowdsourcing for this since it provides large quantities of data from many workers. However, the data may not be of sufficiently good quality. This can be due to the way that the requester presents a task and how they interact with the workers. This paper introduces DialCrowd 2.0 to help requesters obtain higher quality data by, for example, presenting tasks more clearly and facilitating effective communication with workers. DialCrowd 2.0 guides developers in creating improved Human Intelligence Tasks (HITs) and is directly applicable to the workflows used currently by developers and researchers. △ Less

Submitted 25 July, 2022; originally announced July 2022.

Comments: Published at LREC 2022

arXiv:2207.07712 [pdf, other]

Reflow: Automatically Improving Touch Interactions in Mobile Applications through Pixel-based Refinements

Authors: Jason Wu, Titus Barik, Xiaoyi Zhang, Colin Lea, Jeffrey Nichols, Jeffrey P. Bigham

Abstract: Touch is the primary way that users interact with smartphones. However, building mobile user interfaces where touch interactions work well for all users is a difficult problem, because users have different abilities and preferences. We propose a system, Reflow, which automatically applies small, personalized UI adaptations, called refinements -- to mobile app screens to improve touch efficiency. R… ▽ More Touch is the primary way that users interact with smartphones. However, building mobile user interfaces where touch interactions work well for all users is a difficult problem, because users have different abilities and preferences. We propose a system, Reflow, which automatically applies small, personalized UI adaptations, called refinements -- to mobile app screens to improve touch efficiency. Reflow uses a pixel-based strategy to work with existing applications, and improves touch efficiency while minimally disrupting the design intent of the original application. Our system optimizes a UI by (i) extracting its layout from its screenshot, (ii) refining its layout, and (iii) re-rendering the UI to reflect these modifications. We conducted a user study with 10 participants and a heuristic evaluation with 6 experts and found that applications optimized by Reflow led to, on average, 9% faster selection time with minimal layout disruption. The results demonstrate that Reflow's refinements useful UI adaptations to improve touch interactions. △ Less

Submitted 15 July, 2022; originally announced July 2022.

arXiv:2205.12673 [pdf, other]

InstructDial: Improving Zero and Few-shot Generalization in Dialogue through Instruction Tuning

Authors: Prakhar Gupta, Cathy Jiao, Yi-Ting Yeh, Shikib Mehri, Maxine Eskenazi, Jeffrey P. Bigham

Abstract: Instruction tuning is an emergent paradigm in NLP wherein natural language instructions are leveraged with language models to induce zero-shot performance on unseen tasks. Instructions have been shown to enable good performance on unseen tasks and datasets in both large and small language models. Dialogue is an especially interesting area to explore instruction tuning because dialogue systems perf… ▽ More Instruction tuning is an emergent paradigm in NLP wherein natural language instructions are leveraged with language models to induce zero-shot performance on unseen tasks. Instructions have been shown to enable good performance on unseen tasks and datasets in both large and small language models. Dialogue is an especially interesting area to explore instruction tuning because dialogue systems perform multiple kinds of tasks related to language (e.g., natural language understanding and generation, domain-specific interaction), yet instruction tuning has not been systematically explored for dialogue-related tasks. We introduce InstructDial, an instruction tuning framework for dialogue, which consists of a repository of 48 diverse dialogue tasks in a unified text-to-text format created from 59 openly available dialogue datasets. Next, we explore cross-task generalization ability on models tuned on InstructDial across diverse dialogue tasks. Our analysis reveals that InstructDial enables good zero-shot performance on unseen datasets and tasks such as dialogue evaluation and intent detection, and even better performance in a few-shot setting. To ensure that models adhere to instructions, we introduce novel meta-tasks. We establish benchmark zero-shot and few-shot performance of models trained using the proposed framework on multiple dialogue tasks. △ Less

Submitted 26 October, 2022; v1 submitted 25 May, 2022; originally announced May 2022.

Comments: EMNLP 2022

arXiv:2205.09314 [pdf, other]

Target-Guided Dialogue Response Generation Using Commonsense and Data Augmentation

Authors: Prakhar Gupta, Harsh Jhamtani, Jeffrey P. Bigham

Abstract: Target-guided response generation enables dialogue systems to smoothly transition a conversation from a dialogue context toward a target sentence. Such control is useful for designing dialogue systems that direct a conversation toward specific goals, such as creating non-obtrusive recommendations or introducing new topics in the conversation. In this paper, we introduce a new technique for target-… ▽ More Target-guided response generation enables dialogue systems to smoothly transition a conversation from a dialogue context toward a target sentence. Such control is useful for designing dialogue systems that direct a conversation toward specific goals, such as creating non-obtrusive recommendations or introducing new topics in the conversation. In this paper, we introduce a new technique for target-guided response generation, which first finds a bridging path of commonsense knowledge concepts between the source and the target, and then uses the identified bridging path to generate transition responses. Additionally, we propose techniques to re-purpose existing dialogue datasets for target-guided generation. Experiments reveal that the proposed techniques outperform various baselines on this task. Finally, we observe that the existing automated metrics for this task correlate poorly with human judgement ratings. We propose a novel evaluation metric that we demonstrate is more reliable for target-guided response evaluation. Our work generally enables dialogue system designers to exercise more control over the conversations that their systems produce. △ Less

Submitted 19 May, 2022; originally announced May 2022.

Comments: Accepted at NAACL 2022 (Findings)

arXiv:2202.07750 [pdf, other]

Nonverbal Sound Detection for Disordered Speech

Authors: Colin Lea, Zifang Huang, Dhruv Jain, Lauren Tooley, Zeinab Liaghat, Shrinath Thelapurath, Leah Findlater, Jeffrey P. Bigham

Abstract: Voice assistants have become an essential tool for people with various disabilities because they enable complex phone- or tablet-based interactions without the need for fine-grained motor control, such as with touchscreens. However, these systems are not tuned for the unique characteristics of individuals with speech disorders, including many of those who have a motor-speech disorder, are deaf or… ▽ More Voice assistants have become an essential tool for people with various disabilities because they enable complex phone- or tablet-based interactions without the need for fine-grained motor control, such as with touchscreens. However, these systems are not tuned for the unique characteristics of individuals with speech disorders, including many of those who have a motor-speech disorder, are deaf or hard of hearing, have a severe stutter, or are minimally verbal. We introduce an alternative voice-based input system which relies on sound event detection using fifteen nonverbal mouth sounds like "pop," "click," or "eh." This system was designed to work regardless of ones' speech abilities and allows full access to existing technology. In this paper, we describe the design of a dataset, model considerations for real-world deployment, and efforts towards model personalization. Our fully-supervised model achieves segment-level precision and recall of 88.6% and 88.4% on an internal dataset of 710 adults, while achieving 0.31 false positives per hour on aggressors such as speech. Five-shot personalization enables satisfactory performance in 84.5% of cases where the generic model fails. △ Less

Submitted 15 February, 2022; originally announced February 2022.

Comments: Accepted at ICASSP 2022

arXiv:2112.09544 [pdf]

It's Time to Do Something: Mitigating the Negative Impacts of Computing Through a Change to the Peer Review Process

Authors: Brent Hecht, Lauren Wilcox, Jeffrey P. Bigham, Johannes Schöning, Ehsan Hoque, Jason Ernst, Yonatan Bisk, Luigi De Russis, Lana Yarosh, Bushra Anjum, Danish Contractor, Cathy Wu

Abstract: The computing research community needs to work much harder to address the downsides of our innovations. Between the erosion of privacy, threats to democracy, and automation's effect on employment (among many other issues), we can no longer simply assume that our research will have a net positive impact on the world. While bending the arc of computing innovation towards societal benefit may at firs… ▽ More The computing research community needs to work much harder to address the downsides of our innovations. Between the erosion of privacy, threats to democracy, and automation's effect on employment (among many other issues), we can no longer simply assume that our research will have a net positive impact on the world. While bending the arc of computing innovation towards societal benefit may at first seem intractable, we believe we can achieve substantial progress with a straightforward step: making a small change to the peer review process. As we explain below, we hypothesize that our recommended change will force computing researchers to more deeply consider the negative impacts of their work. We also expect that this change will incentivize research and policy that alleviates computing's negative impacts. △ Less

Submitted 17 December, 2021; originally announced December 2021.

Comments: First published on the ACM Future of Computing Academy blog on March 29, 2018. This is the archival version

arXiv:2111.05241 [pdf, other]

A Survey of NLP-Related Crowdsourcing HITs: what works and what does not

Authors: Jessica Huynh, Jeffrey Bigham, Maxine Eskenazi

Abstract: Crowdsourcing requesters on Amazon Mechanical Turk (AMT) have raised questions about the reliability of the workers. The AMT workforce is very diverse and it is not possible to make blanket assumptions about them as a group. Some requesters now reject work en mass when they do not get the results they expect. This has the effect of giving each worker (good or bad) a lower Human Intelligence Task (… ▽ More Crowdsourcing requesters on Amazon Mechanical Turk (AMT) have raised questions about the reliability of the workers. The AMT workforce is very diverse and it is not possible to make blanket assumptions about them as a group. Some requesters now reject work en mass when they do not get the results they expect. This has the effect of giving each worker (good or bad) a lower Human Intelligence Task (HIT) approval score, which is unfair to the good workers. It also has the effect of giving the requester a bad reputation on the workers' forums. Some of the issues causing the mass rejections stem from the requesters not taking the time to create a well-formed task with complete instructions and/or not paying a fair wage. To explore this assumption, this paper describes a study that looks at the crowdsourcing HITs on AMT that were available over a given span of time and records information about those HITs. This study also records information from a crowdsourcing forum on the worker perspective on both those HITs and on their corresponding requesters. Results reveal issues in worker payment and presentation issues such as missing instructions or HITs that are not doable. △ Less

Submitted 9 November, 2021; originally announced November 2021.

arXiv:2109.08763 [pdf]

doi 10.1145/3472749.3474763

Screen Parsing: Towards Reverse Engineering of UI Models from Screenshots

Authors: Jason Wu, Xiaoyi Zhang, Jeff Nichols, Jeffrey P. Bigham

Abstract: Automated understanding of user interfaces (UIs) from their pixels can improve accessibility, enable task automation, and facilitate interface design without relying on developers to comprehensively provide metadata. A first step is to infer what UI elements exist on a screen, but current approaches are limited in how they infer how those elements are semantically grouped into structured interface… ▽ More Automated understanding of user interfaces (UIs) from their pixels can improve accessibility, enable task automation, and facilitate interface design without relying on developers to comprehensively provide metadata. A first step is to infer what UI elements exist on a screen, but current approaches are limited in how they infer how those elements are semantically grouped into structured interface definitions. In this paper, we motivate the problem of screen parsing, the task of predicting UI elements and their relationships from a screenshot. We describe our implementation of screen parsing and provide an effective training procedure that optimizes its performance. In an evaluation comparing the accuracy of the generated output, we find that our implementation significantly outperforms current systems (up to 23%). Finally, we show three example applications that are facilitated by screen parsing: (i) UI similarity search, (ii) accessibility enhancement, and (iii) code generation from UI screenshots. △ Less

Submitted 17 September, 2021; originally announced September 2021.

arXiv:2109.04953 [pdf, other]

Does Pretraining for Summarization Require Knowledge Transfer?

Authors: Kundan Krishna, Jeffrey Bigham, Zachary C. Lipton

Abstract: Pretraining techniques leveraging enormous datasets have driven recent advances in text summarization. While folk explanations suggest that knowledge transfer accounts for pretraining's benefits, little is known about why it works or what makes a pretraining task or dataset suitable. In this paper, we challenge the knowledge transfer story, showing that pretraining on documents consisting of chara… ▽ More Pretraining techniques leveraging enormous datasets have driven recent advances in text summarization. While folk explanations suggest that knowledge transfer accounts for pretraining's benefits, little is known about why it works or what makes a pretraining task or dataset suitable. In this paper, we challenge the knowledge transfer story, showing that pretraining on documents consisting of character n-grams selected at random, we can nearly match the performance of models pretrained on real corpora. This work holds the promise of eliminating upstream corpora, which may alleviate some concerns over offensive language, bias, and copyright issues. To see whether the small residual benefit of using real data could be accounted for by the structure of the pretraining task, we design several tasks motivated by a qualitative study of summarization corpora. However, these tasks confer no appreciable benefit, leaving open the possibility of a small role for knowledge transfer. △ Less

Submitted 10 September, 2021; originally announced September 2021.

Comments: Camera-ready for Findings of EMNLP 2021

arXiv:2106.11759 [pdf, other]

Analysis and Tuning of a Voice Assistant System for Dysfluent Speech

Authors: Vikramjit Mitra, Zifang Huang, Colin Lea, Lauren Tooley, Sarah Wu, Darren Botten, Ashwini Palekar, Shrinath Thelapurath, Panayiotis Georgiou, Sachin Kajarekar, Jefferey Bigham

Abstract: Dysfluencies and variations in speech pronunciation can severely degrade speech recognition performance, and for many individuals with moderate-to-severe speech disorders, voice operated systems do not work. Current speech recognition systems are trained primarily with data from fluent speakers and as a consequence do not generalize well to speech with dysfluencies such as sound or word repetition… ▽ More Dysfluencies and variations in speech pronunciation can severely degrade speech recognition performance, and for many individuals with moderate-to-severe speech disorders, voice operated systems do not work. Current speech recognition systems are trained primarily with data from fluent speakers and as a consequence do not generalize well to speech with dysfluencies such as sound or word repetitions, sound prolongations, or audible blocks. The focus of this work is on quantitative analysis of a consumer speech recognition system on individuals who stutter and production-oriented approaches for improving performance for common voice assistant tasks (i.e., "what is the weather?"). At baseline, this system introduces a significant number of insertion and substitution errors resulting in intended speech Word Error Rates (isWER) that are 13.64\% worse (absolute) for individuals with fluency disorders. We show that by simply tuning the decoding parameters in an existing hybrid speech recognition system one can improve isWER by 24\% (relative) for individuals with fluency disorders. Tuning these parameters translates to 3.6\% better domain recognition and 1.7\% better intent recognition relative to the default setup for the 18 study participants across all stuttering severities. △ Less

Submitted 18 June, 2021; originally announced June 2021.

Comments: 5 pages, 1 page reference, 2 figures

arXiv:2106.05894 [pdf, other]

Synthesizing Adversarial Negative Responses for Robust Response Ranking and Evaluation

Authors: Prakhar Gupta, Yulia Tsvetkov, Jeffrey P. Bigham

Abstract: Open-domain neural dialogue models have achieved high performance in response ranking and evaluation tasks. These tasks are formulated as a binary classification of responses given in a dialogue context, and models generally learn to make predictions based on context-response content similarity. However, over-reliance on content similarity makes the models less sensitive to the presence of inconsi… ▽ More Open-domain neural dialogue models have achieved high performance in response ranking and evaluation tasks. These tasks are formulated as a binary classification of responses given in a dialogue context, and models generally learn to make predictions based on context-response content similarity. However, over-reliance on content similarity makes the models less sensitive to the presence of inconsistencies, incorrect time expressions and other factors important for response appropriateness and coherence. We propose approaches for automatically creating adversarial negative training data to help ranking and evaluation models learn features beyond content similarity. We propose mask-and-fill and keyword-guided approaches that generate negative examples for training more robust dialogue systems. These generated adversarial responses have high content similarity with the contexts but are either incoherent, inappropriate or not fluent. Our approaches are fully data-driven and can be easily incorporated in existing models and datasets. Experiments on classification, ranking and evaluation tasks across multiple datasets demonstrate that our approaches outperform strong baselines in providing informative negative examples for training dialogue systems. △ Less

Submitted 10 June, 2021; originally announced June 2021.

Comments: Accepted to Findings of ACL 2021

arXiv:2105.01734 [pdf]

doi 10.1145/3430263.3452434

When Can Accessibility Help?: An Exploration of Accessibility Feature Recommendation on Mobile Devices

Authors: Jason Wu, Gabriel Reyes, Sam C. White, Xiaoyi Zhang, Jeffrey P. Bigham

Abstract: Numerous accessibility features have been developed and included in consumer operating systems to provide people with a variety of disabilities additional ways to access computing devices. Unfortunately, many users, especially older adults who are more likely to experience ability changes, are not aware of these features or do not know which combination to use. In this paper, we first quantify thi… ▽ More Numerous accessibility features have been developed and included in consumer operating systems to provide people with a variety of disabilities additional ways to access computing devices. Unfortunately, many users, especially older adults who are more likely to experience ability changes, are not aware of these features or do not know which combination to use. In this paper, we first quantify this problem via a survey with 100 participants, demonstrating that very few people are aware of built-in accessibility features on their phones. These observations led us to investigate accessibility recommendation as a way to increase awareness and adoption. We developed four prototype recommenders that span different accessibility categories, which we used to collect insights from 20 older adults. Our work demonstrates the need to increase awareness of existing accessibility features on mobile devices, and shows that automated recommendation could help people find beneficial accessibility features. △ Less

Submitted 4 May, 2021; originally announced May 2021.

Comments: Accepted to Web4All 2021 (W4A '21)

arXiv:2103.14491 [pdf]

Say It All: Feedback for Improving Non-Visual Presentation Accessibility

Authors: Yi-Hao Peng, JiWoong Jang, Jeffrey P. Bigham, Amy Pavel

Abstract: Presenters commonly use slides as visual aids for informative talks. When presenters fail to verbally describe the content on their slides, blind and visually impaired audience members lose access to necessary content, making the presentation difficult to follow. Our analysis of 90 presentation videos revealed that 72% of 610 visual elements (e.g., images, text) were insufficiently described. To h… ▽ More Presenters commonly use slides as visual aids for informative talks. When presenters fail to verbally describe the content on their slides, blind and visually impaired audience members lose access to necessary content, making the presentation difficult to follow. Our analysis of 90 presentation videos revealed that 72% of 610 visual elements (e.g., images, text) were insufficiently described. To help presenters create accessible presentations, we introduce Presentation A11y, a system that provides real-time and post-presentation accessibility feedback. Our system analyzes visual elements on the slide and the transcript of the verbal presentation to provide element-level feedback on what visual content needs to be further described or even removed. Presenters using our system with their own slide-based presentations described more of the content on their slides, and identified 3.26 times more accessibility problems to fix after the talk than when using a traditional slide-based presentation interface. Integrating accessibility feedback into content creation tools will improve the accessibility of informational content for all. △ Less

Submitted 26 March, 2021; originally announced March 2021.

arXiv:2102.12394 [pdf, other]

SEP-28k: A Dataset for Stuttering Event Detection From Podcasts With People Who Stutter

Authors: Colin Lea, Vikramjit Mitra, Aparna Joshi, Sachin Kajarekar, Jeffrey P. Bigham

Abstract: The ability to automatically detect stuttering events in speech could help speech pathologists track an individual's fluency over time or help improve speech recognition systems for people with atypical speech patterns. Despite increasing interest in this area, existing public datasets are too small to build generalizable dysfluency detection systems and lack sufficient annotations. In this work,… ▽ More The ability to automatically detect stuttering events in speech could help speech pathologists track an individual's fluency over time or help improve speech recognition systems for people with atypical speech patterns. Despite increasing interest in this area, existing public datasets are too small to build generalizable dysfluency detection systems and lack sufficient annotations. In this work, we introduce Stuttering Events in Podcasts (SEP-28k), a dataset containing over 28k clips labeled with five event types including blocks, prolongations, sound repetitions, word repetitions, and interjections. Audio comes from public podcasts largely consisting of people who stutter interviewing other people who stutter. We benchmark a set of acoustic models on SEP-28k and the public FluencyBank dataset and highlight how simply increasing the amount of training data improves relative detection performance by 28\% and 24\% F1 on each. Annotations from over 32k clips across both datasets will be publicly released. △ Less

Submitted 24 February, 2021; originally announced February 2021.

Comments: Accepted to ICASSP 2021

arXiv:2101.04893 [pdf, other]

Screen Recognition: Creating Accessibility Metadata for Mobile Applications from Pixels

Authors: Xiaoyi Zhang, Lilian de Greef, Amanda Swearngin, Samuel White, Kyle Murray, Lisa Yu, Qi Shan, Jeffrey Nichols, Jason Wu, Chris Fleizach, Aaron Everitt, Jeffrey P. Bigham

Abstract: Many accessibility features available on mobile platforms require applications (apps) to provide complete and accurate metadata describing user interface (UI) components. Unfortunately, many apps do not provide sufficient metadata for accessibility features to work as expected. In this paper, we explore inferring accessibility metadata for mobile apps from their pixels, as the visual interfaces of… ▽ More Many accessibility features available on mobile platforms require applications (apps) to provide complete and accurate metadata describing user interface (UI) components. Unfortunately, many apps do not provide sufficient metadata for accessibility features to work as expected. In this paper, we explore inferring accessibility metadata for mobile apps from their pixels, as the visual interfaces often best reflect an app's full functionality. We trained a robust, fast, memory-efficient, on-device model to detect UI elements using a dataset of 77,637 screens (from 4,068 iPhone apps) that we collected and annotated. To further improve UI detections and add semantic information, we introduced heuristics (e.g., UI grou** and ordering) and additional models (e.g., recognize UI content, state, interactivity). We built Screen Recognition to generate accessibility metadata to augment iOS VoiceOver. In a study with 9 screen reader users, we validated that our approach improves the accessibility of existing mobile apps, enabling even previously inaccessible apps to be used. △ Less

Submitted 13 January, 2021; originally announced January 2021.

arXiv:2012.15211 [pdf, other]

The Challenges of Crowd Workers in Rural and Urban America

Authors: Claudia Flores-Saviaga, Yuwen Li, Benjamin V. Hanrahan, Jeffrey Bigham, Saiph Savage

Abstract: Crowd work has the potential of hel** the financial recovery of regions traditionally plagued by a lack of economic opportunities, e.g., rural areas. However, we currently have limited information about the challenges facing crowd work-ers from rural and super rural areas as they struggle to make a living through crowd work sites. This paper examines the challenges and advantages of rural and su… ▽ More Crowd work has the potential of hel** the financial recovery of regions traditionally plagued by a lack of economic opportunities, e.g., rural areas. However, we currently have limited information about the challenges facing crowd work-ers from rural and super rural areas as they struggle to make a living through crowd work sites. This paper examines the challenges and advantages of rural and super rural AmazonMechanical Turk (MTurk) crowd workers and contrasts them with those of workers from urban areas. Based on a survey of421 crowd workers from differing geographic regions in theU.S., we identified how across regions, people struggled with being onboarded into crowd work. We uncovered that despite the inequalities and barriers, rural workers tended to be striving more in micro-tasking than their urban counterparts. We also identified cultural traits, relating to time dimension and individualism, that offer us an insight into crowd workers and the necessary qualities for them to succeed on gig platforms. We finish by providing design implications based on our findings to create more inclusive crowd work platforms and tools △ Less

Submitted 30 December, 2020; originally announced December 2020.

Journal ref: Proceedings of the AAAI Conference on Human Computation and Crowdsourcing 2020

arXiv:2012.02748 [pdf, other]

Challenging common interpretability assumptions in feature attribution explanations

Authors: Jonathan Dinu, Jeffrey Bigham, J. Zico Kolter

Abstract: As machine learning and algorithmic decision making systems are increasingly being leveraged in high-stakes human-in-the-loop settings, there is a pressing need to understand the rationale of their predictions. Researchers have responded to this need with explainable AI (XAI), but often proclaim interpretability axiomatically without evaluation. When these systems are evaluated, they are often tes… ▽ More As machine learning and algorithmic decision making systems are increasingly being leveraged in high-stakes human-in-the-loop settings, there is a pressing need to understand the rationale of their predictions. Researchers have responded to this need with explainable AI (XAI), but often proclaim interpretability axiomatically without evaluation. When these systems are evaluated, they are often tested through offline simulations with proxy metrics of interpretability (such as model complexity). We empirically evaluate the veracity of three common interpretability assumptions through a large scale human-subjects experiment with a simple "placebo explanation" control. We find that feature attribution explanations provide marginal utility in our task for a human decision maker and in certain cases result in worse decisions due to cognitive and contextual confounders. This result challenges the assumed universal benefit of applying these methods and we hope this work will underscore the importance of human evaluation in XAI research. Supplemental materials -- including anonymized data from the experiment, code to replicate the study, an interactive demo of the experiment, and the models used in the analysis -- can be found at: https://doi.pizza/challenging-xai. △ Less

Submitted 4 December, 2020; originally announced December 2020.

Comments: Presented at the NeurIPS 2020 ML-Retrospectives, Surveys & Meta-Analyses Workshop

ACM Class: J.4; I.5.1; K.4

arXiv:2010.06035 [pdf]

doi 10.1145/3373625.3417006

Making Mobile Augmented Reality Applications Accessible

Authors: Jaylin Herskovitz, Jason Wu, Samuel White, Amy Pavel, Gabriel Reyes, Anhong Guo, Jeffrey P. Bigham

Abstract: Augmented Reality (AR) technology creates new immersive experiences in entertainment, games, education, retail, and social media. AR content is often primarily visual and it is challenging to enable access to it non-visually due to the mix of virtual and real-world content. In this paper, we identify common constituent tasks in AR by analyzing existing mobile AR applications for iOS, and character… ▽ More Augmented Reality (AR) technology creates new immersive experiences in entertainment, games, education, retail, and social media. AR content is often primarily visual and it is challenging to enable access to it non-visually due to the mix of virtual and real-world content. In this paper, we identify common constituent tasks in AR by analyzing existing mobile AR applications for iOS, and characterize the design space of tasks that require accessible alternatives. For each of the major task categories, we create prototype accessible alternatives that we evaluate in a study with 10 blind participants to explore their perceptions of accessible AR. Our study demonstrates that these prototypes make AR possible to use for blind users and reveals a number of insights to move forward. We believe our work sets forth not only exemplars for developers to create accessible AR applications, but also a roadmap for future research to make AR comprehensively accessible. △ Less

Submitted 12 October, 2020; originally announced October 2020.

Comments: 14 pages. 6 figures. Published in The 22nd International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS '20)

arXiv:2010.03667 [pdf]

Rescribe: Authoring and Automatically Editing Audio Descriptions

Authors: Amy Pavel, Gabriel Reyes, Jeffrey P. Bigham

Abstract: Audio descriptions make videos accessible to those who cannot see them by describing visual content in audio. Producing audio descriptions is challenging due to the synchronous nature of the audio description that must fit into gaps of other video content. An experienced audio description author will produce content that fits narration necessary to understand, enjoy, or experience the video conten… ▽ More Audio descriptions make videos accessible to those who cannot see them by describing visual content in audio. Producing audio descriptions is challenging due to the synchronous nature of the audio description that must fit into gaps of other video content. An experienced audio description author will produce content that fits narration necessary to understand, enjoy, or experience the video content into the time available. This can be especially tricky for novices to do well. In this paper, we introduce a tool, Rescribe, that helps authors create and refine their audio descriptions. Using Rescribe, authors first create a draft of all the content they would like to include in the audio description. Rescribe then uses a dynamic programming approach to optimize between the length of the audio description, available automatic shortening approaches, and source track lengthening approaches. Authors can iteratively visualize and refine the audio descriptions produced by Rescribe, working in concert with the tool. We evaluate the effectiveness of Rescribe through interviews with blind and visually impaired audio description users who preferred Rescribe-edited descriptions to extended descriptions. In addition, we invite novice users to create audio descriptions with Rescribe and another tool, finding that users produce audio descriptions with fewer placement errors using Rescribe. △ Less

Submitted 7 October, 2020; originally announced October 2020.

arXiv:2008.09075 [pdf]

Controlling Dialogue Generation with Semantic Exemplars

Authors: Prakhar Gupta, Jeffrey P. Bigham, Yulia Tsvetkov, Amy Pavel

Abstract: Dialogue systems pretrained with large language models generate locally coherent responses, but lack the fine-grained control over responses necessary to achieve specific goals. A promising method to control response generation is exemplar-based generation, in which models edit exemplar responses that are retrieved from training data, or hand-written to strategically address discourse-level goals,… ▽ More Dialogue systems pretrained with large language models generate locally coherent responses, but lack the fine-grained control over responses necessary to achieve specific goals. A promising method to control response generation is exemplar-based generation, in which models edit exemplar responses that are retrieved from training data, or hand-written to strategically address discourse-level goals, to fit new dialogue contexts. But, current exemplar-based approaches often excessively copy words from the exemplar responses, leading to incoherent replies. We present an Exemplar-based Dialogue Generation model, EDGE, that uses the semantic frames present in exemplar responses to guide generation. We show that controlling dialogue generation based on the semantic frames of exemplars, rather than words in the exemplar itself, improves the coherence of generated responses, while preserving semantic meaning and conversation goals present in exemplar responses. △ Less

Submitted 25 March, 2021; v1 submitted 20 August, 2020; originally announced August 2020.

Comments: Accepted at NAACL 2021

arXiv:2007.07151 [pdf, other]

Extracting Structured Data from Physician-Patient Conversations By Predicting Noteworthy Utterances

Authors: Kundan Krishna, Amy Pavel, Benjamin Schloss, Jeffrey P. Bigham, Zachary C. Lipton

Abstract: Despite diverse efforts to mine various modalities of medical data, the conversations between physicians and patients at the time of care remain an untapped source of insights. In this paper, we leverage this data to extract structured information that might assist physicians with post-visit documentation in electronic health records, potentially lightening the clerical burden. In this exploratory… ▽ More Despite diverse efforts to mine various modalities of medical data, the conversations between physicians and patients at the time of care remain an untapped source of insights. In this paper, we leverage this data to extract structured information that might assist physicians with post-visit documentation in electronic health records, potentially lightening the clerical burden. In this exploratory study, we describe a new dataset consisting of conversation transcripts, post-visit summaries, corresponding supporting evidence (in the transcript), and structured labels. We focus on the tasks of recognizing relevant diagnoses and abnormalities in the review of organ systems (RoS). One methodological challenge is that the conversations are long (around 1500 words), making it difficult for modern deep-learning models to use them as input. To address this challenge, we extract noteworthy utterances---parts of the conversation likely to be cited as evidence supporting some summary sentence. We find that by first filtering for (predicted) noteworthy utterances, we can significantly boost predictive performance for recognizing both diagnoses and RoS abnormalities. △ Less

Submitted 14 July, 2020; originally announced July 2020.

arXiv:2005.06021 [pdf, other]

doi 10.1145/3366423.3380199

Becoming the Super Turker: Increasing Wages via a Strategy from High Earning Workers

Authors: Saiph Savage, Chun-Wei Chiang, Susumu Saito, Carlos Toxtli, Jeffrey Bigham

Abstract: Crowd markets have traditionally limited workers by not providing transparency information concerning which tasks pay fairly or which requesters are unreliable. Researchers believe that a key reason why crowd workers earn low wages is due to this lack of transparency. As a result, tools have been developed to provide more transparency within crowd markets to help workers. However, while most worke… ▽ More Crowd markets have traditionally limited workers by not providing transparency information concerning which tasks pay fairly or which requesters are unreliable. Researchers believe that a key reason why crowd workers earn low wages is due to this lack of transparency. As a result, tools have been developed to provide more transparency within crowd markets to help workers. However, while most workers use these tools, they still earn less than minimum wage. We argue that the missing element is guidance on how to use transparency information. In this paper, we explore how novice workers can improve their earnings by following the transparency criteria of Super Turkers, i.e., crowd workers who earn higher salaries on Amazon Mechanical Turk (MTurk). We believe that Super Turkers have developed effective processes for using transparency information. Therefore, by having novices follow a Super Turker criteria (one that is simple and popular among Super Turkers), we can help novices increase their wages. For this purpose, we: (i) conducted a survey and data analysis to computationally identify a simple yet common criteria that Super Turkers use for handling transparency tools; (ii) deployed a two-week field experiment with novices who followed this Super Turker criteria to find better work on MTurk. Novices in our study viewed over 25,000 tasks by 1,394 requesters. We found that novices who utilized this Super Turkers' criteria earned better wages than other novices. Our results highlight that tool development to support crowd workers should be paired with educational opportunities that teach workers how to effectively use the tools and their related metrics (e.g., transparency values). We finish with design recommendations for empowering crowd workers to earn higher salaries. △ Less

Submitted 7 May, 2020; originally announced May 2020.

Comments: 12.pages, 8 figures, The Web Conference 202, ACM WWW 2020

ACM Class: I.2.7

arXiv:2005.01795 [pdf, other]

Generating SOAP Notes from Doctor-Patient Conversations Using Modular Summarization Techniques

Authors: Kundan Krishna, Sopan Khosla, Jeffrey P. Bigham, Zachary C. Lipton

Abstract: Following each patient visit, physicians draft long semi-structured clinical summaries called SOAP notes. While invaluable to clinicians and researchers, creating digital SOAP notes is burdensome, contributing to physician burnout. In this paper, we introduce the first complete pipelines to leverage deep summarization models to generate these notes based on transcripts of conversations between phy… ▽ More Following each patient visit, physicians draft long semi-structured clinical summaries called SOAP notes. While invaluable to clinicians and researchers, creating digital SOAP notes is burdensome, contributing to physician burnout. In this paper, we introduce the first complete pipelines to leverage deep summarization models to generate these notes based on transcripts of conversations between physicians and patients. After exploring a spectrum of methods across the extractive-abstractive spectrum, we propose Cluster2Sent, an algorithm that (i) extracts important utterances relevant to each summary section; (ii) clusters together related utterances; and then (iii) generates one summary sentence per cluster. Cluster2Sent outperforms its purely abstractive counterpart by 8 ROUGE-1 points, and produces significantly more factual and coherent sentences as assessed by expert human evaluators. For reproducibility, we demonstrate similar benefits on the publicly available AMI dataset. Our results speak to the benefits of structuring summaries into sections and annotating supporting evidence when constructing summarization corpora. △ Less

Submitted 2 June, 2021; v1 submitted 4 May, 2020; originally announced May 2020.

Comments: Published at ACL 2021 Main Conference

arXiv:1909.05725 [pdf, other]

doi 10.15346/hc.v6i1.7

InstructableCrowd: Creating IF-THEN Rules for Smartphones via Conversations with the Crowd

Authors: Ting-Hao 'Kenneth' Huang, Amos Azaria, Oscar J. Romero, Jeffrey P. Bigham

Abstract: Natural language interfaces have become a common part of modern digital life. Chatbots utilize text-based conversations to communicate with users; personal assistants on smartphones such as Google Assistant take direct speech commands from their users; and speech-controlled devices such as Amazon Echo use voice as their only input mode. In this paper, we introduce InstructableCrowd, a crowd-powere… ▽ More Natural language interfaces have become a common part of modern digital life. Chatbots utilize text-based conversations to communicate with users; personal assistants on smartphones such as Google Assistant take direct speech commands from their users; and speech-controlled devices such as Amazon Echo use voice as their only input mode. In this paper, we introduce InstructableCrowd, a crowd-powered system that allows users to program their devices via conversation. The user verbally expresses a problem to the system, in which a group of crowd workers collectively respond and program relevant multi-part IF-THEN rules to help the user. The IF-THEN rules generated by InstructableCrowd connect relevant sensor combinations (e.g., location, weather, device acceleration, etc.) to useful effectors (e.g., text messages, device alarms, etc.). Our study showed that non-programmers can use the conversational interface of InstructableCrowd to create IF-THEN rules that have similar quality compared with the rules created manually. InstructableCrowd generally illustrates how users may converse with their devices, not only to trigger simple voice commands, but also to personalize their increasingly powerful and complicated devices. △ Less

Submitted 12 September, 2019; originally announced September 2019.

Comments: Published at Human Computation (2019) 6:1:113-146

Journal ref: Human Computation (2019) 6:1:113-146

arXiv:1908.07144 [pdf, other]

doi 10.1145/3332165.3347873

StateLens: A Reverse Engineering Solution for Making Existing Dynamic Touchscreens Accessible

Authors: Anhong Guo, Junhan Kong, Michael Rivera, Frank F. Xu, Jeffrey P. Bigham

Abstract: Blind people frequently encounter inaccessible dynamic touchscreens in their everyday lives that are difficult, frustrating, and often impossible to use independently. Touchscreens are often the only way to control everything from coffee machines and payment terminals, to subway ticket machines and in-flight entertainment systems. Interacting with dynamic touchscreens is difficult non-visually bec… ▽ More Blind people frequently encounter inaccessible dynamic touchscreens in their everyday lives that are difficult, frustrating, and often impossible to use independently. Touchscreens are often the only way to control everything from coffee machines and payment terminals, to subway ticket machines and in-flight entertainment systems. Interacting with dynamic touchscreens is difficult non-visually because the visual user interfaces change, interactions often occur over multiple different screens, and it is easy to accidentally trigger interface actions while exploring the screen. To solve these problems, we introduce StateLens - a three-part reverse engineering solution that makes existing dynamic touchscreens accessible. First, StateLens reverse engineers the underlying state diagrams of existing interfaces using point-of-view videos found online or taken by users using a hybrid crowd-computer vision pipeline. Second, using the state diagrams, StateLens automatically generates conversational agents to guide blind users through specifying the tasks that the interface can perform, allowing the StateLens iOS application to provide interactive guidance and feedback so that blind users can access the interface. Finally, a set of 3D-printed accessories enable blind people to explore capacitive touchscreens without the risk of triggering accidental touches on the interface. Our technical evaluation shows that StateLens can accurately reconstruct interfaces from stationary, hand-held, and web videos; and, a user study of the complete system demonstrates that StateLens successfully enables blind users to access otherwise inaccessible dynamic touchscreens. △ Less

Submitted 19 August, 2019; originally announced August 2019.

Comments: ACM UIST 2019

arXiv:1907.10568 [pdf, other]

Investigating Evaluation of Open-Domain Dialogue Systems With Human Generated Multiple References

Authors: Prakhar Gupta, Shikib Mehri, Tiancheng Zhao, Amy Pavel, Maxine Eskenazi, Jeffrey P. Bigham

Abstract: The aim of this paper is to mitigate the shortcomings of automatic evaluation of open-domain dialog systems through multi-reference evaluation. Existing metrics have been shown to correlate poorly with human judgement, particularly in open-domain dialog. One alternative is to collect human annotations for evaluation, which can be expensive and time consuming. To demonstrate the effectiveness of mu… ▽ More The aim of this paper is to mitigate the shortcomings of automatic evaluation of open-domain dialog systems through multi-reference evaluation. Existing metrics have been shown to correlate poorly with human judgement, particularly in open-domain dialog. One alternative is to collect human annotations for evaluation, which can be expensive and time consuming. To demonstrate the effectiveness of multi-reference evaluation, we augment the test set of DailyDialog with multiple references. A series of experiments show that the use of multiple references results in improved correlation between several automatic metrics and human judgement for both the quality and the diversity of system output. △ Less

Submitted 8 September, 2019; v1 submitted 24 July, 2019; originally announced July 2019.

Comments: SIGDIAL 2019

arXiv:1906.03168 [pdf, other]

doi 10.1371/journal.pone.0241687

Predicting risk of dyslexia with an online gamified test

Authors: Luz Rello, Ricardo Baeza-Yates, Abdullah Ali, Jeffrey P. Bigham, Miquel Serra

Abstract: Dyslexia is a specific learning disorder related to school failure. Detection is both crucial and challenging, especially in languages with transparent orthographies, such as Spanish. To make detecting dyslexia easier, we designed an online gamified test and a predictive machine learning model. In a study with more than 3,600 participants, our model correctly detected over 80% of the participants… ▽ More Dyslexia is a specific learning disorder related to school failure. Detection is both crucial and challenging, especially in languages with transparent orthographies, such as Spanish. To make detecting dyslexia easier, we designed an online gamified test and a predictive machine learning model. In a study with more than 3,600 participants, our model correctly detected over 80% of the participants with dyslexia. To check the robustness of the method we tested our method using a new data set with over 1,300 participants with age customized tests in a different environment -- a tablet instead of a desktop computer -- reaching a recall of over 72% for the class with dyslexia for children 9 years old or older. Our work shows that dyslexia can be screened using a machine learning approach. An online screening tool based on our methods has already been used by more than 200,000 people. △ Less

Submitted 9 December, 2019; v1 submitted 7 June, 2019; originally announced June 2019.

arXiv:1903.07032 [pdf, other]

doi 10.1145/3308558.3313716

TurkScanner: Predicting the Hourly Wage of Microtasks

Authors: Susumu Saito, Chun-Wei Chiang, Saiph Savage, Teppei Nakano, Tetsunori Kobayashi, Jeffrey Bigham

Abstract: Workers in crowd markets struggle to earn a living. One reason for this is that it is difficult for workers to accurately gauge the hourly wages of microtasks, and they consequently end up performing labor with little pay. In general, workers are provided with little information about tasks, and are left to rely on noisy signals, such as textual description of the task or rating of the requester.… ▽ More Workers in crowd markets struggle to earn a living. One reason for this is that it is difficult for workers to accurately gauge the hourly wages of microtasks, and they consequently end up performing labor with little pay. In general, workers are provided with little information about tasks, and are left to rely on noisy signals, such as textual description of the task or rating of the requester. This study explores various computational methods for predicting the working times (and thus hourly wages) required for tasks based on data collected from other workers completing crowd work. We provide the following contributions. (i) A data collection method for gathering real-world training data on crowd-work tasks and the times required for workers to complete them; (ii) TurkScanner: a machine learning approach that predicts the necessary working time to complete a task (and can thus implicitly provide the expected hourly wage). We collected 9,155 data records using a web browser extension installed by 84 Amazon Mechanical Turk workers, and explored the challenge of accurately recording working times both automatically and by asking workers. TurkScanner was created using ~150 derived features, and was able to predict the hourly wages of 69.6% of all the tested microtasks within a 75% error. Directions for future research include observing the effects of tools on people's working practices, adapting this approach to a requester tool for better price setting, and predicting other elements of work (e.g., the acceptance likelihood and worker task preferences.) △ Less

Submitted 17 March, 2019; originally announced March 2019.

Comments: Proceedings of the 28th International Conference on World Wide Web (WWW '19), San Francisco, CA, USA, May 13-17, 2019

arXiv:1802.08218 [pdf, other]

VizWiz Grand Challenge: Answering Visual Questions from Blind People

Authors: Danna Gurari, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, Jeffrey P. Bigham

Abstract: The study of algorithms to automatically answer visual questions currently is motivated by visual question answering (VQA) datasets constructed in artificial VQA settings. We propose VizWiz, the first goal-oriented VQA dataset arising from a natural VQA setting. VizWiz consists of over 31,000 visual questions originating from blind people who each took a picture using a mobile phone and recorded a… ▽ More The study of algorithms to automatically answer visual questions currently is motivated by visual question answering (VQA) datasets constructed in artificial VQA settings. We propose VizWiz, the first goal-oriented VQA dataset arising from a natural VQA setting. VizWiz consists of over 31,000 visual questions originating from blind people who each took a picture using a mobile phone and recorded a spoken question about it, together with 10 crowdsourced answers per visual question. VizWiz differs from the many existing VQA datasets because (1) images are captured by blind photographers and so are often poor quality, (2) questions are spoken and so are more conversational, and (3) often visual questions cannot be answered. Evaluation of modern algorithms for answering visual questions and deciding if a visual question is answerable reveals that VizWiz is a challenging dataset. We introduce this dataset to encourage a larger community to develop more generalized algorithms that can assist blind people. △ Less

Submitted 9 May, 2018; v1 submitted 22 February, 2018; originally announced February 2018.

arXiv:1801.02668 [pdf, other]

doi 10.1145/3173574.3173869

Evorus: A Crowd-powered Conversational Assistant Built to Automate Itself Over Time

Authors: Ting-Hao 'Kenneth' Huang, Joseph Chee Chang, Jeffrey P. Bigham

Abstract: Crowd-powered conversational assistants have been shown to be more robust than automated systems, but do so at the cost of higher response latency and monetary costs. A promising direction is to combine the two approaches for high quality, low latency, and low cost solutions. In this paper, we introduce Evorus, a crowd-powered conversational assistant built to automate itself over time by (i) allo… ▽ More Crowd-powered conversational assistants have been shown to be more robust than automated systems, but do so at the cost of higher response latency and monetary costs. A promising direction is to combine the two approaches for high quality, low latency, and low cost solutions. In this paper, we introduce Evorus, a crowd-powered conversational assistant built to automate itself over time by (i) allowing new chatbots to be easily integrated to automate more scenarios, (ii) reusing prior crowd answers, and (iii) learning to automatically approve response candidates. Our 5-month-long deployment with 80 participants and 281 conversations shows that Evorus can automate itself without compromising conversation quality. Crowd-AI architectures have long been proposed as a way to reduce cost and latency for crowd-powered systems; Evorus demonstrates how automation can be introduced successfully in a deployed system. Its architecture allows future researchers to make further innovation on the underlying automated components in the context of a deployed open domain dialog system. △ Less

Submitted 9 January, 2018; v1 submitted 8 January, 2018; originally announced January 2018.

Comments: 10 pages. To appear in the Proceedings of the Conference on Human Factors in Computing Systems 2018 (CHI'18)

ACM Class: H.5.m

arXiv:1712.05796 [pdf]

A Data-Driven Analysis of Workers' Earnings on Amazon Mechanical Turk

Authors: Kotaro Hara, Abi Adams, Kristy Milland, Saiph Savage, Chris Callison-Burch, Jeffrey Bigham

Abstract: A growing number of people are working as part of on-line crowd work, which has been characterized by its low wages; yet, we know little about wage distribution and causes of low/high earnings. We recorded 2,676 workers performing 3.8 million tasks on Amazon Mechanical Turk. Our task-level analysis revealed that workers earned a median hourly wage of only ~\$2/h, and only 4% earned more than \$7.2… ▽ More A growing number of people are working as part of on-line crowd work, which has been characterized by its low wages; yet, we know little about wage distribution and causes of low/high earnings. We recorded 2,676 workers performing 3.8 million tasks on Amazon Mechanical Turk. Our task-level analysis revealed that workers earned a median hourly wage of only ~\$2/h, and only 4% earned more than \$7.25/h. The average requester pays more than \$11/h, although lower-paying requesters post much more work. Our wage calculations are influenced by how unpaid work is included in our wage calculations, e.g., time spent searching for tasks, working on tasks that are rejected, and working on tasks that are ultimately not submitted. We further explore the characteristics of tasks and working patterns that yield higher hourly wages. Our analysis informs future platform design and worker tools to create a more positive future for crowd work. △ Less

Submitted 28 December, 2017; v1 submitted 14 December, 2017; originally announced December 2017.

Comments: Conditionally accepted for inclusion in the 2018 ACM Conference on Human Factors in Computing Systems (CHI'18) Papers program

arXiv:1708.03044 [pdf, other]

"Is there anything else I can help you with?": Challenges in Deploying an On-Demand Crowd-Powered Conversational Agent

Authors: Ting-Hao Kenneth Huang, Walter S. Lasecki, Amos Azaria, Jeffrey P. Bigham

Abstract: Intelligent conversational assistants, such as Apple's Siri, Microsoft's Cortana, and Amazon's Echo, have quickly become a part of our digital life. However, these assistants have major limitations, which prevents users from conversing with them as they would with human dialog partners. This limits our ability to observe how users really want to interact with the underlying system. To address this… ▽ More Intelligent conversational assistants, such as Apple's Siri, Microsoft's Cortana, and Amazon's Echo, have quickly become a part of our digital life. However, these assistants have major limitations, which prevents users from conversing with them as they would with human dialog partners. This limits our ability to observe how users really want to interact with the underlying system. To address this problem, we developed a crowd-powered conversational assistant, Chorus, and deployed it to see how users and workers would interact together when mediated by the system. Chorus sophisticatedly converses with end users over time by recruiting workers on demand, which in turn decide what might be the best response for each user sentence. Up to the first month of our deployment, 59 users have held conversations with Chorus during 320 conversational sessions. In this paper, we present an account of Chorus' deployment, with a focus on four challenges: (i) identifying when conversations are over, (ii) malicious users and workers, (iii) on-demand recruiting, and (iv) settings in which consensus is not enough. Our observations could assist the deployment of crowd-powered conversation systems and crowd-powered systems in general. △ Less

Submitted 9 August, 2017; originally announced August 2017.

Comments: 10 pages. In Proceedings of Conference on Human Computation & Crowdsourcing (HCOMP 2016), 2016, Austin, TX, USA

arXiv:1704.03627 [pdf, other]

Real-time On-Demand Crowd-powered Entity Extraction

Authors: Ting-Hao 'Kenneth' Huang, Yun-Nung Chen, Jeffrey P. Bigham

Abstract: Output-agreement mechanisms such as ESP Game have been widely used in human computation to obtain reliable human-generated labels. In this paper, we argue that a "time-limited" output-agreement mechanism can be used to create a fast and robust crowd-powered component in interactive systems, particularly dialogue systems, to extract key information from user utterances on the fly. Our experiments o… ▽ More Output-agreement mechanisms such as ESP Game have been widely used in human computation to obtain reliable human-generated labels. In this paper, we argue that a "time-limited" output-agreement mechanism can be used to create a fast and robust crowd-powered component in interactive systems, particularly dialogue systems, to extract key information from user utterances on the fly. Our experiments on Amazon Mechanical Turk using the Airline Travel Information System (ATIS) dataset showed that the proposed approach achieves high-quality results with an average response time shorter than 9 seconds. △ Less

Submitted 6 December, 2017; v1 submitted 12 April, 2017; originally announced April 2017.

Comments: Accepted by the 5th Edition Of The Collective Intelligence Conference (CI 2017) as an oral presentation. Interface code and data are available at: https://github.com/windx0303/dialogue-esp-game

arXiv:1508.02982 [pdf]

WearWrite: Orchestrating the Crowd to Complete Complex Tasks from Wearables (We Wrote This Paper on a Watch)

Authors: Michael Nebeling, Anhong Guo, Kyle Murray, Annika Tostengard, Angelos Giannopoulos, Martin Mihajlov, Steven Dow, Jaime Teevan, Jeffrey P. Bigham

Abstract: In this paper we introduce a paradigm for completing complex tasks from wearable devices by leveraging crowdsourcing, and demonstrate its validity for academic writing. We explore this paradigm using a collaborative authoring system, called WearWrite, which is designed to enable authors and crowd workers to work together using an Android smartwatch and Google Docs to produce academic papers, inclu… ▽ More In this paper we introduce a paradigm for completing complex tasks from wearable devices by leveraging crowdsourcing, and demonstrate its validity for academic writing. We explore this paradigm using a collaborative authoring system, called WearWrite, which is designed to enable authors and crowd workers to work together using an Android smartwatch and Google Docs to produce academic papers, including this one. WearWrite allows expert authors who do not have access to large devices to contribute bits of expertise and big picture direction from their watch, while freeing them of the obligation of integrating their contributions into the overall document. Crowd workers on desktop computers actually write the document. We used this approach to write several simple papers, and found it was effective at producing reasonable drafts. However, the workers often needed more structure and the authors more context. WearWrite addresses these issues by focusing workers on specific tasks and providing select context to authors on the watch. We demonstrate the system's feasibility by writing this paper using it. △ Less

Submitted 25 July, 2015; originally announced August 2015.

Showing 1–50 of 55 results for author: Bigham, J