-
Generating Educational Materials with Different Levels of Readability using LLMs
Authors:
Chieh-Yang Huang,
**g Wei,
Ting-Hao 'Kenneth' Huang
Abstract:
This study introduces the leveled-text generation task, aiming to rewrite educational materials to specific readability levels while preserving meaning. We assess the capability of GPT-3.5, LLaMA-2 70B, and Mixtral 8x7B, to generate content at various readability levels through zero-shot and few-shot prompting. Evaluating 100 processed educational materials reveals that few-shot prompting signific…
▽ More
This study introduces the leveled-text generation task, aiming to rewrite educational materials to specific readability levels while preserving meaning. We assess the capability of GPT-3.5, LLaMA-2 70B, and Mixtral 8x7B, to generate content at various readability levels through zero-shot and few-shot prompting. Evaluating 100 processed educational materials reveals that few-shot prompting significantly improves performance in readability manipulation and information preservation. LLaMA-2 70B performs better in achieving the desired difficulty range, while GPT-3.5 maintains original meaning. However, manual inspection highlights concerns such as misinformation introduction and inconsistent edit distribution. These findings emphasize the need for further research to ensure the quality of generated educational content.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
How Does Conversation Length Impact User's Satisfaction? A Case Study of Length-Controlled Conversations with LLM-Powered Chatbots
Authors:
Shih-Hong Huang,
Ya-Fang Lin,
Zeyu He,
Chieh-Yang Huang,
Ting-Hao 'Kenneth' Huang
Abstract:
Users can discuss a wide range of topics with large language models (LLMs), but they do not always prefer solving problems or getting information through lengthy conversations. This raises an intriguing HCI question: How does instructing LLMs to engage in longer or shorter conversations affect conversation quality? In this paper, we developed two Slack chatbots using GPT-4 with the ability to vary…
▽ More
Users can discuss a wide range of topics with large language models (LLMs), but they do not always prefer solving problems or getting information through lengthy conversations. This raises an intriguing HCI question: How does instructing LLMs to engage in longer or shorter conversations affect conversation quality? In this paper, we developed two Slack chatbots using GPT-4 with the ability to vary conversation lengths and conducted a user study. Participants asked the chatbots both highly and less conversable questions, engaging in dialogues with 0, 3, 5, and 7 conversational turns. We found that the conversation quality does not differ drastically across different conditions, while participants had mixed reactions. Our study demonstrates LLMs' ability to change conversation length and the potential benefits for users resulting from such changes, but we caution that changes in text form may not necessarily imply changes in quality or content.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
SciCapenter: Supporting Caption Composition for Scientific Figures with Machine-Generated Captions and Ratings
Authors:
Ting-Yao Hsu,
Chieh-Yang Huang,
Shih-Hong Huang,
Ryan Rossi,
Sungchul Kim,
Tong Yu,
C. Lee Giles,
Ting-Hao K. Huang
Abstract:
Crafting effective captions for figures is important. Readers heavily depend on these captions to grasp the figure's message. However, despite a well-developed set of AI technologies for figures and captions, these have rarely been tested for usefulness in aiding caption writing. This paper introduces SciCapenter, an interactive system that puts together cutting-edge AI technologies for scientific…
▽ More
Crafting effective captions for figures is important. Readers heavily depend on these captions to grasp the figure's message. However, despite a well-developed set of AI technologies for figures and captions, these have rarely been tested for usefulness in aiding caption writing. This paper introduces SciCapenter, an interactive system that puts together cutting-edge AI technologies for scientific figure captions to aid caption composition. SciCapenter generates a variety of captions for each figure in a scholarly article, providing scores and a comprehensive checklist to assess caption quality across multiple critical aspects, such as helpfulness, OCR mention, key takeaways, and visual properties reference. Users can directly edit captions in SciCapenter, resubmit for revised evaluations, and iteratively refine them. A user study with Ph.D. students indicates that SciCapenter significantly lowers the cognitive load of caption writing. Participants' feedback further offers valuable design insights for future systems aiming to enhance caption writing.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
If in a Crowdsourced Data Annotation Pipeline, a GPT-4
Authors:
Zeyu He,
Chieh-Yang Huang,
Chien-Kuang Cornelia Ding,
Shaurya Rohatgi,
Ting-Hao 'Kenneth' Huang
Abstract:
Recent studies indicated GPT-4 outperforms online crowd workers in data labeling accuracy, notably workers from Amazon Mechanical Turk (MTurk). However, these studies were criticized for deviating from standard crowdsourcing practices and emphasizing individual workers' performances over the whole data-annotation process. This paper compared GPT-4 and an ethical and well-executed MTurk pipeline, w…
▽ More
Recent studies indicated GPT-4 outperforms online crowd workers in data labeling accuracy, notably workers from Amazon Mechanical Turk (MTurk). However, these studies were criticized for deviating from standard crowdsourcing practices and emphasizing individual workers' performances over the whole data-annotation process. This paper compared GPT-4 and an ethical and well-executed MTurk pipeline, with 415 workers labeling 3,177 sentence segments from 200 scholarly articles using the CODA-19 scheme. Two worker interfaces yielded 127,080 labels, which were then used to infer the final labels through eight label-aggregation algorithms. Our evaluation showed that despite best practices, MTurk pipeline's highest accuracy was 81.5%, whereas GPT-4 achieved 83.6%. Interestingly, when combining GPT-4's labels with crowd labels collected via an advanced worker interface for aggregation, 2 out of the 8 algorithms achieved an even higher accuracy (87.5%, 87.0%). Further analysis suggested that, when the crowd's and GPT-4's labeling strengths are complementary, aggregating them could increase labeling accuracy.
△ Less
Submitted 28 June, 2024; v1 submitted 26 February, 2024;
originally announced February 2024.
-
Inspo: Writing Stories with a Flock of AIs and Humans
Authors:
Chieh-Yang Huang,
Sanjana Gautam,
Shannon McClellan Brooks,
Ya-Fang Lin,
Ting-Hao 'Kenneth' Huang
Abstract:
Large Language Models (LLMs) have advanced automated writing assistance, enabling complex tasks like co-writing novels and poems. However, real-world writing typically requires various support and collaboration across stages and scenarios. Existing research mainly examines how writers utilize single text generators, neglecting this broader context. This paper introduces Inspo, a web-based editor t…
▽ More
Large Language Models (LLMs) have advanced automated writing assistance, enabling complex tasks like co-writing novels and poems. However, real-world writing typically requires various support and collaboration across stages and scenarios. Existing research mainly examines how writers utilize single text generators, neglecting this broader context. This paper introduces Inspo, a web-based editor that incorporates various text generators and online crowd workers. Through a three-phase user study, we examine writers' interactions with Inspo for novel writing. Quantitative analyses of writing logs highlight changes in participants' writing progress and the influence of various text-generation models. Complementing this with qualitative insights from semi-structured interviews, we illustrate participants' perceptions of these models and the crowd. Based on the findings, we provide design recommendations for the next generation of intelligent writing tools and discuss the potential sociocultural implications of integrating AI and human input in the writing process.
△ Less
Submitted 12 March, 2024; v1 submitted 28 November, 2023;
originally announced November 2023.
-
GPT-4 as an Effective Zero-Shot Evaluator for Scientific Figure Captions
Authors:
Ting-Yao Hsu,
Chieh-Yang Huang,
Ryan Rossi,
Sungchul Kim,
C. Lee Giles,
Ting-Hao K. Huang
Abstract:
There is growing interest in systems that generate captions for scientific figures. However, assessing these systems output poses a significant challenge. Human evaluation requires academic expertise and is costly, while automatic evaluation depends on often low-quality author-written captions. This paper investigates using large language models (LLMs) as a cost-effective, reference-free method fo…
▽ More
There is growing interest in systems that generate captions for scientific figures. However, assessing these systems output poses a significant challenge. Human evaluation requires academic expertise and is costly, while automatic evaluation depends on often low-quality author-written captions. This paper investigates using large language models (LLMs) as a cost-effective, reference-free method for evaluating figure captions. We first constructed SCICAP-EVAL, a human evaluation dataset that contains human judgments for 3,600 scientific figure captions, both original and machine-made, for 600 arXiv figures. We then prompted LLMs like GPT-4 and GPT-3 to score (1-6) each caption based on its potential to aid reader understanding, given relevant context such as figure-mentioning paragraphs. Results show that GPT-4, used as a zero-shot evaluator, outperformed all other models and even surpassed assessments made by Computer Science and Informatics undergraduates, achieving a Kendall correlation score of 0.401 with Ph.D. students rankings
△ Less
Submitted 23 October, 2023;
originally announced October 2023.
-
Location-Aware Visual Question Generation with Lightweight Models
Authors:
Nicholas Collin Suwono,
Justin Chih-Yao Chen,
Tun Min Hung,
Ting-Hao Kenneth Huang,
I-Bin Liao,
Yung-Hui Li,
Lun-Wei Ku,
Shao-Hua Sun
Abstract:
This work introduces a novel task, location-aware visual question generation (LocaVQG), which aims to generate engaging questions from data relevant to a particular geographical location. Specifically, we represent such location-aware information with surrounding images and a GPS coordinate. To tackle this task, we present a dataset generation pipeline that leverages GPT-4 to produce diverse and s…
▽ More
This work introduces a novel task, location-aware visual question generation (LocaVQG), which aims to generate engaging questions from data relevant to a particular geographical location. Specifically, we represent such location-aware information with surrounding images and a GPS coordinate. To tackle this task, we present a dataset generation pipeline that leverages GPT-4 to produce diverse and sophisticated questions. Then, we aim to learn a lightweight model that can address the LocaVQG task and fit on an edge device, such as a mobile phone. To this end, we propose a method which can reliably generate engaging questions from location-aware information. Our proposed method outperforms baselines regarding human evaluation (e.g., engagement, grounding, coherence) and automatic evaluation metrics (e.g., BERTScore, ROUGE-2). Moreover, we conduct extensive ablation studies to justify our proposed techniques for both generating the dataset and solving the task.
△ Less
Submitted 23 October, 2023;
originally announced October 2023.
-
Automated Layout Design and Control of Robust Cooperative Grasped-Load Aerial Transportation Systems
Authors:
Carlo Bosio,
Jerry Tang,
Ting-Hao Wang,
Mark W. Mueller
Abstract:
We present a novel approach to cooperative aerial transportation through a team of drones, using optimal control theory and a hierarchical control strategy. We assume the drones are connected to the payload through rigid attachments, essentially transforming the whole system into a larger flying object with "thrust modules" at the attachment locations of the drones. We investigate the optimal arra…
▽ More
We present a novel approach to cooperative aerial transportation through a team of drones, using optimal control theory and a hierarchical control strategy. We assume the drones are connected to the payload through rigid attachments, essentially transforming the whole system into a larger flying object with "thrust modules" at the attachment locations of the drones. We investigate the optimal arrangement of the thrust modules around the payload, so that the resulting system is robust to disturbances. We choose the $\mathcal{H}_2$ norm as a measure of robustness, and propose an iterative optimization routine to compute the optimal layout of the vehicles around the object. We experimentally validate our approach using four drones and comparing the disturbance rejection performances achieved by two different layouts (the optimal one and a sub-optimal one), and observe that the results match our predictions.
△ Less
Submitted 28 February, 2024; v1 submitted 11 October, 2023;
originally announced October 2023.
-
Unmasking Nationality Bias: A Study of Human Perception of Nationalities in AI-Generated Articles
Authors:
Pranav Narayanan Venkit,
Sanjana Gautam,
Ruchi Panchanadikar,
Ting-Hao `Kenneth' Huang,
Shomir Wilson
Abstract:
We investigate the potential for nationality biases in natural language processing (NLP) models using human evaluation methods. Biased NLP models can perpetuate stereotypes and lead to algorithmic discrimination, posing a significant challenge to the fairness and justice of AI systems. Our study employs a two-step mixed-methods approach that includes both quantitative and qualitative analysis to i…
▽ More
We investigate the potential for nationality biases in natural language processing (NLP) models using human evaluation methods. Biased NLP models can perpetuate stereotypes and lead to algorithmic discrimination, posing a significant challenge to the fairness and justice of AI systems. Our study employs a two-step mixed-methods approach that includes both quantitative and qualitative analysis to identify and understand the impact of nationality bias in a text generation model. Through our human-centered quantitative analysis, we measure the extent of nationality bias in articles generated by AI sources. We then conduct open-ended interviews with participants, performing qualitative coding and thematic analysis to understand the implications of these biases on human readers. Our findings reveal that biased NLP models tend to replicate and amplify existing societal biases, which can translate to harm if used in a sociotechnical setting. The qualitative analysis from our interviews offers insights into the experience readers have when encountering such articles, highlighting the potential to shift a reader's perception of a country. These findings emphasize the critical role of public perception in sha** AI's impact on society and the need to correct biases in AI systems.
△ Less
Submitted 8 August, 2023;
originally announced August 2023.
-
Good Data, Large Data, or No Data? Comparing Three Approaches in Develo** Research Aspect Classifiers for Biomedical Papers
Authors:
Shreya Chandrasekhar,
Chieh-Yang Huang,
Ting-Hao 'Kenneth' Huang
Abstract:
The rapid growth of scientific publications, particularly during the COVID-19 pandemic, emphasizes the need for tools to help researchers efficiently comprehend the latest advancements. One essential part of understanding scientific literature is research aspect classification, which categorizes sentences in abstracts to Background, Purpose, Method, and Finding. In this study, we investigate the i…
▽ More
The rapid growth of scientific publications, particularly during the COVID-19 pandemic, emphasizes the need for tools to help researchers efficiently comprehend the latest advancements. One essential part of understanding scientific literature is research aspect classification, which categorizes sentences in abstracts to Background, Purpose, Method, and Finding. In this study, we investigate the impact of different datasets on model performance for the crowd-annotated CODA-19 research aspect classification task. Specifically, we explore the potential benefits of using the large, automatically curated PubMed 200K RCT dataset and evaluate the effectiveness of large language models (LLMs), such as LLaMA, GPT-3, ChatGPT, and GPT-4. Our results indicate that using the PubMed 200K RCT dataset does not improve performance for the CODA-19 task. We also observe that while GPT-4 performs well, it does not outperform the SciBERT model fine-tuned on the CODA-19 dataset, emphasizing the importance of a dedicated and task-aligned datasets dataset for the target task. Our code is available at https://github.com/Crowd-AI-Lab/CODA-19-exp.
△ Less
Submitted 7 June, 2023;
originally announced June 2023.
-
ConvXAI: Delivering Heterogeneous AI Explanations via Conversations to Support Human-AI Scientific Writing
Authors:
Hua Shen,
Chieh-Yang Huang,
Tongshuang Wu,
Ting-Hao 'Kenneth' Huang
Abstract:
Despite a surge collection of XAI methods, users still struggle to obtain required AI explanations. Previous research suggests chatbots as dynamic solutions, but the effective design of conversational XAI agents for practical human needs remains under-explored. This paper focuses on Conversational XAI for AI-assisted scientific writing tasks. Drawing from human linguistic theories and formative st…
▽ More
Despite a surge collection of XAI methods, users still struggle to obtain required AI explanations. Previous research suggests chatbots as dynamic solutions, but the effective design of conversational XAI agents for practical human needs remains under-explored. This paper focuses on Conversational XAI for AI-assisted scientific writing tasks. Drawing from human linguistic theories and formative studies, we identify four design rationales: "multifaceted", "controllability", "mix-initiative", "context-aware drill-down". We incorporate them into an interactive prototype, ConvXAI, which facilitates heterogeneous AI explanations for scientific writing through dialogue. In two studies with 21 users, ConvXAI outperforms a GUI-based baseline on improving human-perceived understanding and writing improvement. The paper further discusses the practical human usage patterns in interacting with ConvXAI for scientific co-writing.
△ Less
Submitted 27 October, 2023; v1 submitted 16 May, 2023;
originally announced May 2023.
-
Does Human Collaboration Enhance the Accuracy of Identifying LLM-Generated Deepfake Texts?
Authors:
Adaku Uchendu,
Jooyoung Lee,
Hua Shen,
Thai Le,
Ting-Hao 'Kenneth' Huang,
Dongwon Lee
Abstract:
Advances in Large Language Models (e.g., GPT-4, LLaMA) have improved the generation of coherent sentences resembling human writing on a large scale, resulting in the creation of so-called deepfake texts. However, this progress poses security and privacy concerns, necessitating effective solutions for distinguishing deepfake texts from human-written ones. Although prior works studied humans' abilit…
▽ More
Advances in Large Language Models (e.g., GPT-4, LLaMA) have improved the generation of coherent sentences resembling human writing on a large scale, resulting in the creation of so-called deepfake texts. However, this progress poses security and privacy concerns, necessitating effective solutions for distinguishing deepfake texts from human-written ones. Although prior works studied humans' ability to detect deepfake texts, none has examined whether "collaboration" among humans improves the detection of deepfake texts. In this study, to address this gap of understanding on deepfake texts, we conducted experiments with two groups: (1) nonexpert individuals from the AMT platform and (2) writing experts from the Upwork platform. The results demonstrate that collaboration among humans can potentially improve the detection of deepfake texts for both groups, increasing detection accuracies by 6.36% for non-experts and 12.76% for experts, respectively, compared to individuals' detection accuracies. We further analyze the explanations that humans used for detecting a piece of text as deepfake text, and find that the strongest indicator of deepfake texts is their lack of coherence and consistency. Our study provides useful insights for future tools and framework designs to facilitate the collaborative human detection of deepfake texts. The experiment datasets and AMT implementations are available at: https://github.com/huashen218/llm-deepfake-human-study.git
△ Less
Submitted 9 October, 2023; v1 submitted 3 April, 2023;
originally announced April 2023.
-
What Types of Questions Require Conversation to Answer? A Case Study of AskReddit Questions
Authors:
Shih-Hong Huang,
Chieh-Yang Huang,
Ya-Fang Lin,
Ting-Hao 'Kenneth' Huang
Abstract:
The proliferation of automated conversational systems such as chatbots, spoken-dialogue systems, and smart speakers, has significantly impacted modern digital life. However, these systems are primarily designed to provide answers to well-defined questions rather than to support users in exploring complex, ill-defined questions. In this paper, we aim to push the boundaries of conversational systems…
▽ More
The proliferation of automated conversational systems such as chatbots, spoken-dialogue systems, and smart speakers, has significantly impacted modern digital life. However, these systems are primarily designed to provide answers to well-defined questions rather than to support users in exploring complex, ill-defined questions. In this paper, we aim to push the boundaries of conversational systems by examining the types of nebulous, open-ended questions that can best be answered through conversation. We first sampled 500 questions from one million open-ended requests posted on AskReddit, and then recruited online crowd workers to answer eight inquiries about these questions. We also performed open coding to categorize the questions into 27 different domains. We found that the issues people believe require conversation to resolve satisfactorily are highly social and personal. Our work provides insights into how future research could be geared to align with users' needs.
△ Less
Submitted 3 April, 2023; v1 submitted 30 March, 2023;
originally announced March 2023.
-
Summaries as Captions: Generating Figure Captions for Scientific Documents with Automated Text Summarization
Authors:
Chieh-Yang Huang,
Ting-Yao Hsu,
Ryan Rossi,
Ani Nenkova,
Sungchul Kim,
Gromit Yeuk-Yin Chan,
Eunyee Koh,
Clyde Lee Giles,
Ting-Hao 'Kenneth' Huang
Abstract:
Good figure captions help paper readers understand complex scientific figures. Unfortunately, even published papers often have poorly written captions. Automatic caption generation could aid paper writers by providing good starting captions that can be refined for better quality. Prior work often treated figure caption generation as a vision-to-language task. In this paper, we show that it can be…
▽ More
Good figure captions help paper readers understand complex scientific figures. Unfortunately, even published papers often have poorly written captions. Automatic caption generation could aid paper writers by providing good starting captions that can be refined for better quality. Prior work often treated figure caption generation as a vision-to-language task. In this paper, we show that it can be more effectively tackled as a text summarization task in scientific documents. We fine-tuned PEGASUS, a pre-trained abstractive summarization model, to specifically summarize figure-referencing paragraphs (e.g., "Figure 3 shows...") into figure captions. Experiments on large-scale arXiv figures show that our method outperforms prior vision methods in both automatic and human evaluations. We further conducted an in-depth investigation focused on two key challenges: (i) the common presence of low-quality author-written captions and (ii) the lack of clear standards for good captions. Our code and data are available at: https://github.com/Crowd-AI-Lab/Generating-Figure-Captions-as-a-Text-Summarization-Task.
△ Less
Submitted 11 August, 2023; v1 submitted 23 February, 2023;
originally announced February 2023.
-
Conveying the Predicted Future to Users: A Case Study of Story Plot Prediction
Authors:
Chieh-Yang Huang,
Saniya Naphade,
Kavya Laalasa Karanam,
Ting-Hao 'Kenneth' Huang
Abstract:
Creative writing is hard: Novelists struggle with writer's block daily. While automatic story generation has advanced recently, it is treated as a "toy task" for advancing artificial intelligence rather than hel** people. In this paper, we create a system that produces a short description that narrates a predicted plot using existing story generation approaches. Our goal is to assist writers in…
▽ More
Creative writing is hard: Novelists struggle with writer's block daily. While automatic story generation has advanced recently, it is treated as a "toy task" for advancing artificial intelligence rather than hel** people. In this paper, we create a system that produces a short description that narrates a predicted plot using existing story generation approaches. Our goal is to assist writers in crafting a consistent and compelling story arc. We conducted experiments on Amazon Mechanical Turk (AMT) to examine the quality of the generated story plots in terms of consistency and storiability. The results show that short descriptions produced by our frame-enhanced GPT-2 (FGPT-2) were rated as the most consistent and storiable among all models; FGPT-2's outputs even beat some random story snippets written by humans. Next, we conducted a preliminary user study using a story continuation task where AMT workers were given access to machine-generated story plots and asked to write a follow-up story. FGPT-2 could positively affect the writing process, though people favor other baselines more. Our study shed some light on the possibilities of future creative writing support systems beyond the scope of completing sentences. Our code is available at: https://github.com/appleternity/Story-Plot-Generation.
△ Less
Submitted 17 February, 2023;
originally announced February 2023.
-
Nationality Bias in Text Generation
Authors:
Pranav Narayanan Venkit,
Sanjana Gautam,
Ruchi Panchanadikar,
Ting-Hao 'Kenneth' Huang,
Shomir Wilson
Abstract:
Little attention is placed on analyzing nationality bias in language models, especially when nationality is highly used as a factor in increasing the performance of social NLP models. This paper examines how a text generation model, GPT-2, accentuates pre-existing societal biases about country-based demonyms. We generate stories using GPT-2 for various nationalities and use sensitivity analysis to…
▽ More
Little attention is placed on analyzing nationality bias in language models, especially when nationality is highly used as a factor in increasing the performance of social NLP models. This paper examines how a text generation model, GPT-2, accentuates pre-existing societal biases about country-based demonyms. We generate stories using GPT-2 for various nationalities and use sensitivity analysis to explore how the number of internet users and the country's economic status impacts the sentiment of the stories. To reduce the propagation of biases through large language models (LLM), we explore the debiasing method of adversarial triggering. Our results show that GPT-2 demonstrates significant bias against countries with lower internet users, and adversarial triggering effectively reduces the same.
△ Less
Submitted 14 February, 2023; v1 submitted 5 February, 2023;
originally announced February 2023.
-
Too Slow to Be Useful? On Incorporating Humans in the Loop of Smart Speakers
Authors:
Shih-Hong Huang,
Chieh-Yang Huang,
Yuxin Deng,
Hua Shen,
Szu-Chi Kuan,
Ting-Hao 'Kenneth' Huang
Abstract:
Real-time crowd-powered systems, such as Chorus/Evorus, VizWiz, and Apparition, have shown how incorporating humans into automated systems could supplement where the automatic solutions fall short. However, one unspoken bottleneck of applying such architectures to more scenarios is the longer latency of including humans in the loop of automated systems. For the applications that have hard constrai…
▽ More
Real-time crowd-powered systems, such as Chorus/Evorus, VizWiz, and Apparition, have shown how incorporating humans into automated systems could supplement where the automatic solutions fall short. However, one unspoken bottleneck of applying such architectures to more scenarios is the longer latency of including humans in the loop of automated systems. For the applications that have hard constraints in turnaround times, human-operated components' longer latency and large speed variation seem to be apparent deal breakers. This paper explicates and quantifies these limitations by using a human-powered text-based backend to hold conversations with users through a voice-only smart speaker. Smart speakers must respond to users' requests within seconds, so the workers behind the scenes only have a few seconds to compose answers. We measured the end-to-end system latency and the conversation quality with eight pairs of participants, showing the challenges and superiority of such systems.
△ Less
Submitted 7 December, 2022;
originally announced December 2022.
-
Multi-VQG: Generating Engaging Questions for Multiple Images
Authors:
Min-Hsuan Yeh,
Vicent Chen,
Ting-Hao 'Kenneth' Haung,
Lun-Wei Ku
Abstract:
Generating engaging content has drawn much recent attention in the NLP community. Asking questions is a natural way to respond to photos and promote awareness. However, most answers to questions in traditional question-answering (QA) datasets are factoids, which reduce individuals' willingness to answer. Furthermore, traditional visual question generation (VQG) confines the source data for questio…
▽ More
Generating engaging content has drawn much recent attention in the NLP community. Asking questions is a natural way to respond to photos and promote awareness. However, most answers to questions in traditional question-answering (QA) datasets are factoids, which reduce individuals' willingness to answer. Furthermore, traditional visual question generation (VQG) confines the source data for question generation to single images, resulting in a limited ability to comprehend time-series information of the underlying event. In this paper, we propose generating engaging questions from multiple images. We present MVQG, a new dataset, and establish a series of baselines, including both end-to-end and dual-stage architectures. Results show that building stories behind the image sequence enables models to generate engaging questions, which confirms our assumption that people typically construct a picture of the event in their minds before asking questions. These results open up an exciting challenge for visual-and-language models to implicitly construct a story behind a series of photos to allow for creativity and experience sharing and hence draw attention to downstream applications.
△ Less
Submitted 17 November, 2022; v1 submitted 14 November, 2022;
originally announced November 2022.
-
Let's Talk! Striking Up Conversations via Conversational Visual Question Generation
Authors:
Shih-Han Chan,
Tsai-Lun Yang,
Yun-Wei Chu,
Chi-Yang Hsu,
Ting-Hao Huang,
Yu-Shian Chiu,
Lun-Wei Ku
Abstract:
An engaging and provocative question can open up a great conversation. In this work, we explore a novel scenario: a conversation agent views a set of the user's photos (for example, from social media platforms) and asks an engaging question to initiate a conversation with the user. The existing vision-to-question models mostly generate tedious and obvious questions, which might not be ideals conve…
▽ More
An engaging and provocative question can open up a great conversation. In this work, we explore a novel scenario: a conversation agent views a set of the user's photos (for example, from social media platforms) and asks an engaging question to initiate a conversation with the user. The existing vision-to-question models mostly generate tedious and obvious questions, which might not be ideals conversation starters. This paper introduces a two-phase framework that first generates a visual story for the photo set and then uses the story to produce an interesting question. The human evaluation shows that our framework generates more response-provoking questions for starting conversations than other vision-to-question baselines.
△ Less
Submitted 19 May, 2022;
originally announced May 2022.
-
Empathy-Centric Design At Scale
Authors:
Andrea Mauri,
Yen-Chia Hsu,
Marco Brambilla,
Aisling Ann O'Kane,
Ting-Hao 'Kenneth' Huang,
Himanshu Verma
Abstract:
EmpathiCH aims at bringing together and blend different expertise to develop new research agenda in the context of "Empathy-Centric Design at Scale". The main research question is to investigate how new technologies can contribute to the elicitation of empathy across and within multiple stakeholders at scale; and how empathy can be used to design solutions to societal problems that are not only ef…
▽ More
EmpathiCH aims at bringing together and blend different expertise to develop new research agenda in the context of "Empathy-Centric Design at Scale". The main research question is to investigate how new technologies can contribute to the elicitation of empathy across and within multiple stakeholders at scale; and how empathy can be used to design solutions to societal problems that are not only effective but also balanced, inclusive, and aware of their effect on society. Through presentations, participatory sessions, and a living experiment -- where data about the peoples' interactions is collected throughout the event -- we aim to make this workshop the ideal venue to foster collaboration, build networks, and shape the future direction of "Empathy-Centric Design at Scale".
△ Less
Submitted 13 April, 2022;
originally announced April 2022.
-
Are Shortest Rationales the Best Explanations for Human Understanding?
Authors:
Hua Shen,
Tongshuang Wu,
Wenbo Guo,
Ting-Hao 'Kenneth' Huang
Abstract:
Existing self-explaining models typically favor extracting the shortest possible rationales - snippets of an input text "responsible for" corresponding output - to explain the model prediction, with the assumption that shorter rationales are more intuitive to humans. However, this assumption has yet to be validated. Is the shortest rationale indeed the most human-understandable? To answer this que…
▽ More
Existing self-explaining models typically favor extracting the shortest possible rationales - snippets of an input text "responsible for" corresponding output - to explain the model prediction, with the assumption that shorter rationales are more intuitive to humans. However, this assumption has yet to be validated. Is the shortest rationale indeed the most human-understandable? To answer this question, we design a self-explaining model, LimitedInk, which allows users to extract rationales at any target length. Compared to existing baselines, LimitedInk achieves compatible end-task performance and human-annotated rationale agreement, making it a suitable representation of the recent class of self-explaining models. We use LimitedInk to conduct a user study on the impact of rationale length, where we ask human judges to predict the sentiment label of documents based only on LimitedInk-generated rationales with different lengths. We show rationales that are too short do not help humans predict labels better than randomly masked text, suggesting the need for more careful design of the best human rationales.
△ Less
Submitted 16 March, 2022;
originally announced March 2022.
-
SciCap: Generating Captions for Scientific Figures
Authors:
Ting-Yao Hsu,
C. Lee Giles,
Ting-Hao 'Kenneth' Huang
Abstract:
Researchers use figures to communicate rich, complex information in scientific papers. The captions of these figures are critical to conveying effective messages. However, low-quality figure captions commonly occur in scientific articles and may decrease understanding. In this paper, we propose an end-to-end neural framework to automatically generate informative, high-quality captions for scientif…
▽ More
Researchers use figures to communicate rich, complex information in scientific papers. The captions of these figures are critical to conveying effective messages. However, low-quality figure captions commonly occur in scientific articles and may decrease understanding. In this paper, we propose an end-to-end neural framework to automatically generate informative, high-quality captions for scientific figures. To this end, we introduce SCICAP, a large-scale figure-caption dataset based on computer science arXiv papers published between 2010 and 2020. After pre-processing - including figure-type classification, sub-figure identification, text normalization, and caption text selection - SCICAP contained more than two million figures extracted from over 290,000 papers. We then established baseline models that caption graph plots, the dominant (19.2%) figure type. The experimental results showed both opportunities and steep challenges of generating captions for scientific figures.
△ Less
Submitted 25 October, 2021; v1 submitted 22 October, 2021;
originally announced October 2021.
-
Empowering Local Communities Using Artificial Intelligence
Authors:
Yen-Chia Hsu,
Ting-Hao 'Kenneth' Huang,
Himanshu Verma,
Andrea Mauri,
Illah Nourbakhsh,
Alessandro Bozzon
Abstract:
Artificial Intelligence (AI) is increasingly used to analyze large amounts of data in various practices, such as object recognition. We are specifically interested in using AI-powered systems to engage local communities in develo** plans or solutions for pressing societal and environmental concerns. Such local contexts often involve multiple stakeholders with different and even contradictory age…
▽ More
Artificial Intelligence (AI) is increasingly used to analyze large amounts of data in various practices, such as object recognition. We are specifically interested in using AI-powered systems to engage local communities in develo** plans or solutions for pressing societal and environmental concerns. Such local contexts often involve multiple stakeholders with different and even contradictory agendas, resulting in mismatched expectations of these systems' behaviors and desired outcomes. There is a need to investigate if AI models and pipelines can work as expected in different contexts through co-creation and field deployment. Based on case studies in co-creating AI-powered systems with local people, we explain challenges that require more attention and provide viable paths to bridge AI research with citizen needs. We advocate for develo** new collaboration approaches and mindsets that are needed to co-create AI-powered systems in multi-stakeholder contexts to address local concerns.
△ Less
Submitted 26 April, 2022; v1 submitted 5 October, 2021;
originally announced October 2021.
-
FinQA: A Dataset of Numerical Reasoning over Financial Data
Authors:
Zhiyu Chen,
Wenhu Chen,
Charese Smiley,
Sameena Shah,
Iana Borova,
Dylan Langdon,
Reema Moussa,
Matt Beane,
Ting-Hao Huang,
Bryan Routledge,
William Yang Wang
Abstract:
The sheer volume of financial statements makes it difficult for humans to access and analyze a business's financials. Robust numerical reasoning likewise faces unique challenges in this domain. In this work, we focus on answering deep questions over financial data, aiming to automate the analysis of a large corpus of financial documents. In contrast to existing tasks on general domain, the finance…
▽ More
The sheer volume of financial statements makes it difficult for humans to access and analyze a business's financials. Robust numerical reasoning likewise faces unique challenges in this domain. In this work, we focus on answering deep questions over financial data, aiming to automate the analysis of a large corpus of financial documents. In contrast to existing tasks on general domain, the finance domain includes complex numerical reasoning and understanding of heterogeneous representations. To facilitate analytical progress, we propose a new large-scale dataset, FinQA, with Question-Answering pairs over Financial reports, written by financial experts. We also annotate the gold reasoning programs to ensure full explainability. We further introduce baselines and conduct comprehensive experiments in our dataset. The results demonstrate that popular, large, pre-trained models fall far short of expert humans in acquiring finance knowledge and in complex multi-step numerical reasoning on that knowledge. Our dataset -- the first of its kind -- should therefore enable significant, new community research into complex application domains. The dataset and code are publicly available\url{https://github.com/czyssrs/FinQA}.
△ Less
Submitted 7 May, 2022; v1 submitted 31 August, 2021;
originally announced September 2021.
-
ABCD: A Graph Framework to Convert Complex Sentences to a Covering Set of Simple Sentences
Authors:
Yanjun Gao,
Ting-hao Huang,
Rebecca J. Passonneau
Abstract:
Atomic clauses are fundamental text units for understanding complex sentences. Identifying the atomic sentences within complex sentences is important for applications such as summarization, argument mining, discourse analysis, discourse parsing, and question answering. Previous work mainly relies on rule-based methods dependent on parsing. We propose a new task to decompose each complex sentence i…
▽ More
Atomic clauses are fundamental text units for understanding complex sentences. Identifying the atomic sentences within complex sentences is important for applications such as summarization, argument mining, discourse analysis, discourse parsing, and question answering. Previous work mainly relies on rule-based methods dependent on parsing. We propose a new task to decompose each complex sentence into simple sentences derived from the tensed clauses in the source, and a novel problem formulation as a graph edit task. Our neural model learns to Accept, Break, Copy or Drop elements of a graph that combines word adjacency and grammatical dependencies. The full processing pipeline includes modules for graph construction, graph editing, and sentence generation from the output graph. We introduce DeSSE, a new dataset designed to train and evaluate complex sentence decomposition, and MinWiki, a subset of MinWikiSplit. ABCD achieves comparable performance as two parsing baselines on MinWiki. On DeSSE, which has a more even balance of complex sentence types, our model achieves higher accuracy on the number of atomic sentences than an encoder-decoder baseline. Results include a detailed error analysis.
△ Less
Submitted 22 June, 2021;
originally announced June 2021.
-
Plot and Rework: Modeling Storylines for Visual Storytelling
Authors:
Chi-Yang Hsu,
Yun-Wei Chu,
Ting-Hao 'Kenneth' Huang,
Lun-Wei Ku
Abstract:
Writing a coherent and engaging story is not easy. Creative writers use their knowledge and worldview to put disjointed elements together to form a coherent storyline, and work and rework iteratively toward perfection. Automated visual storytelling (VIST) models, however, make poor use of external knowledge and iterative generation when attempting to create stories. This paper introduces PR-VIST,…
▽ More
Writing a coherent and engaging story is not easy. Creative writers use their knowledge and worldview to put disjointed elements together to form a coherent storyline, and work and rework iteratively toward perfection. Automated visual storytelling (VIST) models, however, make poor use of external knowledge and iterative generation when attempting to create stories. This paper introduces PR-VIST, a framework that represents the input image sequence as a story graph in which it finds the best path to form a storyline. PR-VIST then takes this path and learns to generate the final story via an iterative training process. This framework produces stories that are superior in terms of diversity, coherence, and humanness, per both automatic and human evaluations. An ablation study shows that both plotting and reworking contribute to the model's superiority.
△ Less
Submitted 7 July, 2021; v1 submitted 14 May, 2021;
originally announced May 2021.
-
Semantic Frame Forecast
Authors:
Chieh-Yang Huang,
Ting-Hao 'Kenneth' Huang
Abstract:
This paper introduces semantic frame forecast, a task that predicts the semantic frames that will occur in the next 10, 100, or even 1,000 sentences in a running story. Prior work focused on predicting the immediate future of a story, such as one to a few sentences ahead. However, when novelists write long stories, generating a few sentences is not enough to help them gain high-level insight to de…
▽ More
This paper introduces semantic frame forecast, a task that predicts the semantic frames that will occur in the next 10, 100, or even 1,000 sentences in a running story. Prior work focused on predicting the immediate future of a story, such as one to a few sentences ahead. However, when novelists write long stories, generating a few sentences is not enough to help them gain high-level insight to develop the follow-up story. In this paper, we formulate a long story as a sequence of "story blocks," where each block contains a fixed number of sentences (e.g., 10, 100, or 200). This formulation allows us to predict the follow-up story arc beyond the scope of a few sentences. We represent a story block using the term frequencies (TF) of semantic frames in it, normalized by each frame's inverse document frequency (IDF). We conduct semantic frame forecast experiments on 4,794 books from the Bookcorpus and 7,962 scientific abstracts from CODA-19, with block sizes ranging from 5 to 1,000 sentences. The results show that automated models can forecast the follow-up story blocks better than the random, prior, and replay baselines, indicating the task's feasibility. We also learn that the models using the frame representation as features outperform all the existing approaches when the block size is over 150 sentences. The human evaluation also shows that the proposed frame representation, when visualized as word clouds, is comprehensible, representative, and specific to humans. Our code is available at https://github.com/appleternity/FrameForecasting.
△ Less
Submitted 12 April, 2021;
originally announced April 2021.
-
Explaining the Road Not Taken
Authors:
Hua Shen,
Ting-Hao 'Kenneth' Huang
Abstract:
It is unclear if existing interpretations of deep neural network models respond effectively to the needs of users. This paper summarizes the common forms of explanations (such as feature attribution, decision rules, or probes) used in over 200 recent papers about natural language processing (NLP), and compares them against user questions collected in the XAI Question Bank. We found that although u…
▽ More
It is unclear if existing interpretations of deep neural network models respond effectively to the needs of users. This paper summarizes the common forms of explanations (such as feature attribution, decision rules, or probes) used in over 200 recent papers about natural language processing (NLP), and compares them against user questions collected in the XAI Question Bank. We found that although users are interested in explanations for the road not taken -- namely, why the model chose one result and not a well-defined, seemly similar legitimate counterpart -- most model interpretations cannot answer these questions.
△ Less
Submitted 30 March, 2021; v1 submitted 27 March, 2021;
originally announced March 2021.
-
Assessing the Helpfulness of Learning Materials with Inference-Based Learner-Like Agent
Authors:
Yun-Hsuan Jen,
Chieh-Yang Huang,
Mei-Hua Chen,
Ting-Hao 'Kenneth' Huang,
Lun-Wei Ku
Abstract:
Many English-as-a-second language learners have trouble using near-synonym words (e.g., small vs.little; briefly vs.shortly) correctly, and often look for example sentences to learn how two nearly synonymous terms differ. Prior work uses hand-crafted scores to recommend sentences but has difficulty in adopting such scores to all the near-synonyms as near-synonyms differ in various ways. We notice…
▽ More
Many English-as-a-second language learners have trouble using near-synonym words (e.g., small vs.little; briefly vs.shortly) correctly, and often look for example sentences to learn how two nearly synonymous terms differ. Prior work uses hand-crafted scores to recommend sentences but has difficulty in adopting such scores to all the near-synonyms as near-synonyms differ in various ways. We notice that the helpfulness of the learning material would reflect on the learners' performance. Thus, we propose the inference-based learner-like agent to mimic learner behavior and identify good learning materials by examining the agent's performance. To enable the agent to behave like a learner, we leverage entailment modeling's capability of inferring answers from the provided materials. Experimental results show that the proposed agent is equipped with good learner-like behavior to achieve the best performance in both fill-in-the-blank (FITB) and good example sentence selection tasks. We further conduct a classroom user study with college ESL learners. The results of the user study show that the proposed agent can find out example sentences that help students learn more easily and efficiently. Compared to other models, the proposed agent improves the score of more than 17% of students after learning.
△ Less
Submitted 5 October, 2020;
originally announced October 2020.
-
How Useful Are the Machine-Generated Interpretations to General Users? A Human Evaluation on Guessing the Incorrectly Predicted Labels
Authors:
Hua Shen,
Ting-Hao Kenneth Huang
Abstract:
Explaining to users why automated systems make certain mistakes is important and challenging. Researchers have proposed ways to automatically produce interpretations for deep neural network models. However, it is unclear how useful these interpretations are in hel** users figure out why they are getting an error. If an interpretation effectively explains to users how the underlying deep neural n…
▽ More
Explaining to users why automated systems make certain mistakes is important and challenging. Researchers have proposed ways to automatically produce interpretations for deep neural network models. However, it is unclear how useful these interpretations are in hel** users figure out why they are getting an error. If an interpretation effectively explains to users how the underlying deep neural network model works, people who were presented with the interpretation should be better at predicting the model's outputs than those who were not. This paper presents an investigation on whether or not showing machine-generated visual interpretations helps users understand the incorrectly predicted labels produced by image classifiers. We showed the images and the correct labels to 150 online crowd workers and asked them to select the incorrectly predicted labels with or without showing them the machine-generated visual interpretations. The results demonstrated that displaying the visual interpretations did not increase, but rather decreased, the average guessing accuracy by roughly 10%.
△ Less
Submitted 27 August, 2020; v1 submitted 26 August, 2020;
originally announced August 2020.
-
Project RISE: Recognizing Industrial Smoke Emissions
Authors:
Yen-Chia Hsu,
Ting-Hao 'Kenneth' Huang,
Ting-Yao Hu,
Paul Dille,
Sean Prendi,
Ryan Hoffman,
Anastasia Tsuhlares,
Jessica Pachuta,
Randy Sargent,
Illah Nourbakhsh
Abstract:
Industrial smoke emissions pose a significant concern to human health. Prior works have shown that using Computer Vision (CV) techniques to identify smoke as visual evidence can influence the attitude of regulators and empower citizens to pursue environmental justice. However, existing datasets are not of sufficient quality nor quantity to train the robust CV models needed to support air quality a…
▽ More
Industrial smoke emissions pose a significant concern to human health. Prior works have shown that using Computer Vision (CV) techniques to identify smoke as visual evidence can influence the attitude of regulators and empower citizens to pursue environmental justice. However, existing datasets are not of sufficient quality nor quantity to train the robust CV models needed to support air quality advocacy. We introduce RISE, the first large-scale video dataset for Recognizing Industrial Smoke Emissions. We adopted a citizen science approach to collaborate with local community members to annotate whether a video clip has smoke emissions. Our dataset contains 12,567 clips from 19 distinct views from cameras that monitored three industrial facilities. These daytime clips span 30 days over two years, including all four seasons. We ran experiments using deep neural networks to establish a strong performance baseline and reveal smoke recognition challenges. Our survey study discussed community feedback, and our data analysis displayed opportunities for integrating citizen scientists and crowd workers into the application of Artificial Intelligence for Social Impact.
△ Less
Submitted 29 April, 2024; v1 submitted 12 May, 2020;
originally announced May 2020.
-
CODA-19: Using a Non-Expert Crowd to Annotate Research Aspects on 10,000+ Abstracts in the COVID-19 Open Research Dataset
Authors:
Ting-Hao 'Kenneth' Huang,
Chieh-Yang Huang,
Chien-Kuang Cornelia Ding,
Yen-Chia Hsu,
C. Lee Giles
Abstract:
This paper introduces CODA-19, a human-annotated dataset that codes the Background, Purpose, Method, Finding/Contribution, and Other sections of 10,966 English abstracts in the COVID-19 Open Research Dataset. CODA-19 was created by 248 crowd workers from Amazon Mechanical Turk within 10 days, and achieved labeling quality comparable to that of experts. Each abstract was annotated by nine different…
▽ More
This paper introduces CODA-19, a human-annotated dataset that codes the Background, Purpose, Method, Finding/Contribution, and Other sections of 10,966 English abstracts in the COVID-19 Open Research Dataset. CODA-19 was created by 248 crowd workers from Amazon Mechanical Turk within 10 days, and achieved labeling quality comparable to that of experts. Each abstract was annotated by nine different workers, and the final labels were acquired by majority vote. The inter-annotator agreement (Cohen's kappa) between the crowd and the biomedical expert (0.741) is comparable to inter-expert agreement (0.788). CODA-19's labels have an accuracy of 82.2% when compared to the biomedical expert's labels, while the accuracy between experts was 85.0%. Reliable human annotations help scientists access and integrate the rapidly accelerating coronavirus literature, and also serve as the battery of AI/NLP research, but obtaining expert annotations can be slow. We demonstrated that a non-expert crowd can be rapidly employed at scale to join the fight against COVID-19.
△ Less
Submitted 17 September, 2020; v1 submitted 5 May, 2020;
originally announced May 2020.
-
Heteroglossia: In-Situ Story Ideation with the Crowd
Authors:
Chieh-Yang Huang,
Shih-Hong Huang,
Ting-Hao 'Kenneth' Huang
Abstract:
Ideation is essential for creative writing. Many authors struggle to come up with ideas throughout the writing process, yet modern writing tools fail to provide on-the-spot assistance for writers when they get stuck. This paper introduces Heteroglossia, an add-on for Google Docs that allows writers to elicit story ideas from the online crowd using their text editors. Writers can share snippets of…
▽ More
Ideation is essential for creative writing. Many authors struggle to come up with ideas throughout the writing process, yet modern writing tools fail to provide on-the-spot assistance for writers when they get stuck. This paper introduces Heteroglossia, an add-on for Google Docs that allows writers to elicit story ideas from the online crowd using their text editors. Writers can share snippets of their working drafts and ask the crowd to provide follow-up story ideas based on it. Heteroglossia employs a strategy called "role play", where each worker is assigned a fictional character in a story and asked to brainstorm plot ideas from that character's perspective. Our deployment with two experienced story writers shows that Heteroglossia is easy to use and can generate interesting ideas. Heteroglossia allows us to gain insight into how future technologies can be developed to support ideation in creative writing
△ Less
Submitted 15 January, 2020; v1 submitted 13 January, 2020;
originally announced January 2020.
-
Smell Pittsburgh: Engaging Community Citizen Science for Air Quality
Authors:
Yen-Chia Hsu,
Jennifer Cross,
Paul Dille,
Michael Tasota,
Beatrice Dias,
Randy Sargent,
Ting-Hao 'Kenneth' Huang,
Illah Nourbakhsh
Abstract:
Urban air pollution has been linked to various human health concerns, including cardiopulmonary diseases. Communities who suffer from poor air quality often rely on experts to identify pollution sources due to the lack of accessible tools. Taking this into account, we developed Smell Pittsburgh, a system that enables community members to report odors and track where these odors are frequently conc…
▽ More
Urban air pollution has been linked to various human health concerns, including cardiopulmonary diseases. Communities who suffer from poor air quality often rely on experts to identify pollution sources due to the lack of accessible tools. Taking this into account, we developed Smell Pittsburgh, a system that enables community members to report odors and track where these odors are frequently concentrated. All smell report data are publicly accessible online. These reports are also sent to the local health department and visualized on a map along with air quality data from monitoring stations. This visualization provides a comprehensive overview of the local pollution landscape. Additionally, with these reports and air quality data, we developed a model to predict upcoming smell events and send push notifications to inform communities. We also applied regression analysis to identify statistically significant effects of push notifications on user engagement. Our evaluation of this system demonstrates that engaging residents in documenting their experiences with pollution odors can help identify local air pollution patterns, and can empower communities to advocate for better air quality. All citizen-contributed smell data are publicly accessible and can be downloaded from https://smellpgh.org.
△ Less
Submitted 20 November, 2020; v1 submitted 26 December, 2019;
originally announced December 2019.
-
Knowledge-Enriched Visual Storytelling
Authors:
Chao-Chun Hsu,
Zi-Yuan Chen,
Chi-Yang Hsu,
Chih-Chia Li,
Tzu-Yuan Lin,
Ting-Hao 'Kenneth' Huang,
Lun-Wei Ku
Abstract:
Stories are diverse and highly personalized, resulting in a large possible output space for story generation. Existing end-to-end approaches produce monotonous stories because they are limited to the vocabulary and knowledge in a single training dataset. This paper introduces KG-Story, a three-stage framework that allows the story generation model to take advantage of external Knowledge Graphs to…
▽ More
Stories are diverse and highly personalized, resulting in a large possible output space for story generation. Existing end-to-end approaches produce monotonous stories because they are limited to the vocabulary and knowledge in a single training dataset. This paper introduces KG-Story, a three-stage framework that allows the story generation model to take advantage of external Knowledge Graphs to produce interesting stories. KG-Story distills a set of representative words from the input prompts, enriches the word set by using external knowledge graphs, and finally generates stories based on the enriched word set. This distill-enrich-generate framework allows the use of external resources not only for the enrichment phase, but also for the distillation and generation phases. In this paper, we show the superiority of KG-Story for visual storytelling, where the input prompt is a sequence of five photos and the output is a short story. Per the human ranking evaluation, stories generated by KG-Story are on average ranked better than that of the state-of-the-art systems. Our code and output stories are available at https://github.com/zychen423/KE-VIST.
△ Less
Submitted 3 December, 2019;
originally announced December 2019.
-
On Automating Conversations
Authors:
Ting-Hao 'Kenneth' Huang
Abstract:
From 2016 to 2018, we developed and deployed Chorus, a system that blends real-time human computation with artificial intelligence (AI) and has real-world, open conversations with users. We took a top-down approach that started with a working crowd-powered system, Chorus, and then created a framework, Evorus, that enables Chorus to automate itself over time. Over our two-year deployment, more than…
▽ More
From 2016 to 2018, we developed and deployed Chorus, a system that blends real-time human computation with artificial intelligence (AI) and has real-world, open conversations with users. We took a top-down approach that started with a working crowd-powered system, Chorus, and then created a framework, Evorus, that enables Chorus to automate itself over time. Over our two-year deployment, more than 420 users talked with Chorus, having over 2,200 conversation sessions. This line of work demonstrated how a crowd-powered conversational assistant can be automated over time, and more importantly, how such a system can be deployed to talk with real users to help them with their everyday tasks. This position paper discusses two sets of challenges that we explored during the development and deployment of Chorus and Evorus: the challenges that come from being an "agent" and those that arise from the subset of conversations that are more difficult to automate.
△ Less
Submitted 24 October, 2019; v1 submitted 21 October, 2019;
originally announced October 2019.
-
On Using Chatbots to Promote Smoking Cessation Among Adolescents of Low Socioeconomic Status
Authors:
Patricia Simon,
Suchitra Krishnan-Sarin,
Ting-Hao 'Kenneth' Huang
Abstract:
Reducing youth tobacco use is critical for improving child health since tobacco use is associated with respiratory problems, and nicotine may interfere with healthy brain development. While tobacco regulation has contributed to declines in cigarette use among youth, these declines have occurred more quickly for youth of high socioeconomic status (SES) compared to youth of low SES. A major barrier…
▽ More
Reducing youth tobacco use is critical for improving child health since tobacco use is associated with respiratory problems, and nicotine may interfere with healthy brain development. While tobacco regulation has contributed to declines in cigarette use among youth, these declines have occurred more quickly for youth of high socioeconomic status (SES) compared to youth of low SES. A major barrier to smoking cessation for adolescents of low SES is coordination of access and transportation to in-person treatment sessions. Low-SES youth may have family obligations that limit their ability to access in-person treatment. At the same time, mobile use among adolescents is high: 85% have smartphones. Additionally, adolescents engage in texting at high rates, suggesting that they are well-suited for mobile instant messaging interventions. Mobile interventions have shown promise for youth, but their use remains low. Thus, more research is needed to develop effective and engaging mobile interventions to increase quit rates. In this paper, we provide a brief review of approaches to adolescent smoking cessation and describe the promise of chatbots for smoking cessation.
△ Less
Submitted 19 October, 2019;
originally announced October 2019.
-
InstructableCrowd: Creating IF-THEN Rules for Smartphones via Conversations with the Crowd
Authors:
Ting-Hao 'Kenneth' Huang,
Amos Azaria,
Oscar J. Romero,
Jeffrey P. Bigham
Abstract:
Natural language interfaces have become a common part of modern digital life. Chatbots utilize text-based conversations to communicate with users; personal assistants on smartphones such as Google Assistant take direct speech commands from their users; and speech-controlled devices such as Amazon Echo use voice as their only input mode. In this paper, we introduce InstructableCrowd, a crowd-powere…
▽ More
Natural language interfaces have become a common part of modern digital life. Chatbots utilize text-based conversations to communicate with users; personal assistants on smartphones such as Google Assistant take direct speech commands from their users; and speech-controlled devices such as Amazon Echo use voice as their only input mode. In this paper, we introduce InstructableCrowd, a crowd-powered system that allows users to program their devices via conversation. The user verbally expresses a problem to the system, in which a group of crowd workers collectively respond and program relevant multi-part IF-THEN rules to help the user. The IF-THEN rules generated by InstructableCrowd connect relevant sensor combinations (e.g., location, weather, device acceleration, etc.) to useful effectors (e.g., text messages, device alarms, etc.). Our study showed that non-programmers can use the conversational interface of InstructableCrowd to create IF-THEN rules that have similar quality compared with the rules created manually. InstructableCrowd generally illustrates how users may converse with their devices, not only to trigger simple voice commands, but also to personalize their increasingly powerful and complicated devices.
△ Less
Submitted 12 September, 2019;
originally announced September 2019.
-
Visual Story Post-Editing
Authors:
Ting-Yao Hsu,
Chieh-Yang Huang,
Yen-Chia Hsu,
Ting-Hao 'Kenneth' Huang
Abstract:
We introduce the first dataset for human edits of machine-generated visual stories and explore how these collected edits may be used for the visual story post-editing task. The dataset, VIST-Edit, includes 14,905 human edited versions of 2,981 machine-generated visual stories. The stories were generated by two state-of-the-art visual storytelling models, each aligned to 5 human-edited versions. We…
▽ More
We introduce the first dataset for human edits of machine-generated visual stories and explore how these collected edits may be used for the visual story post-editing task. The dataset, VIST-Edit, includes 14,905 human edited versions of 2,981 machine-generated visual stories. The stories were generated by two state-of-the-art visual storytelling models, each aligned to 5 human-edited versions. We establish baselines for the task, showing how a relatively small set of human edits can be leveraged to boost the performance of large visual storytelling models. We also discuss the weak correlation between automatic evaluation scores and human ratings, motivating the need for new automatic metrics.
△ Less
Submitted 4 June, 2019;
originally announced June 2019.
-
Dixit: Interactive Visual Storytelling via Term Manipulation
Authors:
Chao-Chun Hsu,
Yu-Hua Chen,
Zi-Yuan Chen,
Hsin-Yu Lin,
Ting-Hao 'Kenneth' Huang,
Lun-Wei Ku
Abstract:
In this paper, we introduce Dixit, an interactive visual storytelling system that the user interacts with iteratively to compose a short story for a photo sequence. The user initiates the process by uploading a sequence of photos. Dixit first extracts text terms from each photo which describe the objects (e.g., boy, bike) or actions (e.g., sleep) in the photo, and then allows the user to add new t…
▽ More
In this paper, we introduce Dixit, an interactive visual storytelling system that the user interacts with iteratively to compose a short story for a photo sequence. The user initiates the process by uploading a sequence of photos. Dixit first extracts text terms from each photo which describe the objects (e.g., boy, bike) or actions (e.g., sleep) in the photo, and then allows the user to add new terms or remove existing terms. Dixit then generates a short story based on these terms. Behind the scenes, Dixit uses an LSTM-based model trained on image caption data and FrameNet to distill terms from each image and utilizes a transformer decoder to compose a context-coherent story. Users change images or terms iteratively with Dixit to create the most ideal story. Dixit also allows users to manually edit and rate stories. The proposed procedure opens up possibilities for interpretable and controllable visual storytelling, allowing users to understand the story formation rationale and to intervene in the generation process.
△ Less
Submitted 31 May, 2019; v1 submitted 6 March, 2019;
originally announced March 2019.
-
On How Users Edit Computer-Generated Visual Stories
Authors:
Ting-Yao Hsu,
Yen-Chia Hsu,
Ting-Hao 'Kenneth' Huang
Abstract:
A significant body of research in Artificial Intelligence (AI) has focused on generating stories automatically, either based on prior story plots or input images. However, literature has little to say about how users would receive and use these stories. Given the quality of stories generated by modern AI algorithms, users will nearly inevitably have to edit these stories before putting them to rea…
▽ More
A significant body of research in Artificial Intelligence (AI) has focused on generating stories automatically, either based on prior story plots or input images. However, literature has little to say about how users would receive and use these stories. Given the quality of stories generated by modern AI algorithms, users will nearly inevitably have to edit these stories before putting them to real use. In this paper, we present the first analysis of how human users edit machine-generated stories. We obtained 962 short stories generated by one of the state-of-the-art visual storytelling models. For each story, we recruited five crowd workers from Amazon Mechanical Turk to edit it. Our analysis of these edits shows that, on average, users (i) slightly shortened machine-generated stories, (ii) increased lexical diversity in these stories, and (iii) often replaced nouns and their determiners/articles with pronouns. Our study provides a better understanding on how users receive and edit machine-generated stories,informing future researchers to create more usable and helpful story generation systems.
△ Less
Submitted 8 March, 2019; v1 submitted 21 February, 2019;
originally announced February 2019.
-
Smell Pittsburgh: Community-Empowered Mobile Smell Reporting System
Authors:
Yen-Chia Hsu,
Jennifer Cross,
Paul Dille,
Michael Tasota,
Beatrice Dias,
Randy Sargent,
Ting-Hao 'Kenneth' Huang,
Illah Nourbakhsh
Abstract:
Urban air pollution has been linked to various human health considerations, including cardiopulmonary diseases. Communities who suffer from poor air quality often rely on experts to identify pollution sources due to the lack of accessible tools. Taking this into account, we developed Smell Pittsburgh, a system that enables community members to report odors and track where these odors are frequentl…
▽ More
Urban air pollution has been linked to various human health considerations, including cardiopulmonary diseases. Communities who suffer from poor air quality often rely on experts to identify pollution sources due to the lack of accessible tools. Taking this into account, we developed Smell Pittsburgh, a system that enables community members to report odors and track where these odors are frequently concentrated. All smell report data are publicly accessible online. These reports are also sent to the local health department and visualized on a map along with air quality data from monitoring stations. This visualization provides a comprehensive overview of the local pollution landscape. Additionally, with these reports and air quality data, we developed a model to predict upcoming smell events and send push notifications to inform communities. Our evaluation of this system demonstrates that engaging residents in documenting their experiences with pollution odors can help identify local air pollution patterns, and can empower communities to advocate for better air quality.
△ Less
Submitted 1 July, 2020; v1 submitted 25 October, 2018;
originally announced October 2018.
-
EmotionLines: An Emotion Corpus of Multi-Party Conversations
Authors:
Sheng-Yeh Chen,
Chao-Chun Hsu,
Chuan-Chun Kuo,
Ting-Hao,
Huang,
Lun-Wei Ku
Abstract:
Feeling emotion is a critical characteristic to distinguish people from machines. Among all the multi-modal resources for emotion detection, textual datasets are those containing the least additional information in addition to semantics, and hence are adopted widely for testing the developed systems. However, most of the textual emotional datasets consist of emotion labels of only individual words…
▽ More
Feeling emotion is a critical characteristic to distinguish people from machines. Among all the multi-modal resources for emotion detection, textual datasets are those containing the least additional information in addition to semantics, and hence are adopted widely for testing the developed systems. However, most of the textual emotional datasets consist of emotion labels of only individual words, sentences or documents, which makes it challenging to discuss the contextual flow of emotions. In this paper, we introduce EmotionLines, the first dataset with emotions labeling on all utterances in each dialogue only based on their textual content. Dialogues in EmotionLines are collected from Friends TV scripts and private Facebook messenger dialogues. Then one of seven emotions, six Ekman's basic emotions plus the neutral emotion, is labeled on each utterance by 5 Amazon MTurkers. A total of 29,245 utterances from 2,000 dialogues are labeled in EmotionLines. We also provide several strong baselines for emotion detection models on EmotionLines in this paper.
△ Less
Submitted 30 May, 2018; v1 submitted 22 February, 2018;
originally announced February 2018.
-
Evorus: A Crowd-powered Conversational Assistant Built to Automate Itself Over Time
Authors:
Ting-Hao 'Kenneth' Huang,
Joseph Chee Chang,
Jeffrey P. Bigham
Abstract:
Crowd-powered conversational assistants have been shown to be more robust than automated systems, but do so at the cost of higher response latency and monetary costs. A promising direction is to combine the two approaches for high quality, low latency, and low cost solutions. In this paper, we introduce Evorus, a crowd-powered conversational assistant built to automate itself over time by (i) allo…
▽ More
Crowd-powered conversational assistants have been shown to be more robust than automated systems, but do so at the cost of higher response latency and monetary costs. A promising direction is to combine the two approaches for high quality, low latency, and low cost solutions. In this paper, we introduce Evorus, a crowd-powered conversational assistant built to automate itself over time by (i) allowing new chatbots to be easily integrated to automate more scenarios, (ii) reusing prior crowd answers, and (iii) learning to automatically approve response candidates. Our 5-month-long deployment with 80 participants and 281 conversations shows that Evorus can automate itself without compromising conversation quality. Crowd-AI architectures have long been proposed as a way to reduce cost and latency for crowd-powered systems; Evorus demonstrates how automation can be introduced successfully in a deployed system. Its architecture allows future researchers to make further innovation on the underlying automated components in the context of a deployed open domain dialog system.
△ Less
Submitted 9 January, 2018; v1 submitted 8 January, 2018;
originally announced January 2018.
-
"Is there anything else I can help you with?": Challenges in Deploying an On-Demand Crowd-Powered Conversational Agent
Authors:
Ting-Hao Kenneth Huang,
Walter S. Lasecki,
Amos Azaria,
Jeffrey P. Bigham
Abstract:
Intelligent conversational assistants, such as Apple's Siri, Microsoft's Cortana, and Amazon's Echo, have quickly become a part of our digital life. However, these assistants have major limitations, which prevents users from conversing with them as they would with human dialog partners. This limits our ability to observe how users really want to interact with the underlying system. To address this…
▽ More
Intelligent conversational assistants, such as Apple's Siri, Microsoft's Cortana, and Amazon's Echo, have quickly become a part of our digital life. However, these assistants have major limitations, which prevents users from conversing with them as they would with human dialog partners. This limits our ability to observe how users really want to interact with the underlying system. To address this problem, we developed a crowd-powered conversational assistant, Chorus, and deployed it to see how users and workers would interact together when mediated by the system. Chorus sophisticatedly converses with end users over time by recruiting workers on demand, which in turn decide what might be the best response for each user sentence. Up to the first month of our deployment, 59 users have held conversations with Chorus during 320 conversational sessions. In this paper, we present an account of Chorus' deployment, with a focus on four challenges: (i) identifying when conversations are over, (ii) malicious users and workers, (iii) on-demand recruiting, and (iv) settings in which consensus is not enough. Our observations could assist the deployment of crowd-powered conversation systems and crowd-powered systems in general.
△ Less
Submitted 9 August, 2017;
originally announced August 2017.
-
MoodSwipe: A Soft Keyboard that Suggests Messages Based on User-Specified Emotions
Authors:
Chieh-Yang Huang,
Tristan Labetoulle,
Ting-Hao Kenneth Huang,
Yi-Pei Chen,
Hung-Chen Chen,
Vallari Srivastava,
Lun-Wei Ku
Abstract:
We present MoodSwipe, a soft keyboard that suggests text messages given the user-specified emotions utilizing the real dialog data. The aim of MoodSwipe is to create a convenient user interface to enjoy the technology of emotion classification and text suggestion, and at the same time to collect labeled data automatically for develo** more advanced technologies. While users select the MoodSwipe…
▽ More
We present MoodSwipe, a soft keyboard that suggests text messages given the user-specified emotions utilizing the real dialog data. The aim of MoodSwipe is to create a convenient user interface to enjoy the technology of emotion classification and text suggestion, and at the same time to collect labeled data automatically for develo** more advanced technologies. While users select the MoodSwipe keyboard, they can type as usual but sense the emotion conveyed by their text and receive suggestions for their message as a benefit. In MoodSwipe, the detected emotions serve as the medium for suggested texts, where viewing the latter is the incentive to correcting the former. We conduct several experiments to show the superiority of the emotion classification models trained on the dialog data, and further to verify good emotion cues are important context for text suggestion.
△ Less
Submitted 22 July, 2017;
originally announced July 2017.
-
Real-time On-Demand Crowd-powered Entity Extraction
Authors:
Ting-Hao 'Kenneth' Huang,
Yun-Nung Chen,
Jeffrey P. Bigham
Abstract:
Output-agreement mechanisms such as ESP Game have been widely used in human computation to obtain reliable human-generated labels. In this paper, we argue that a "time-limited" output-agreement mechanism can be used to create a fast and robust crowd-powered component in interactive systems, particularly dialogue systems, to extract key information from user utterances on the fly. Our experiments o…
▽ More
Output-agreement mechanisms such as ESP Game have been widely used in human computation to obtain reliable human-generated labels. In this paper, we argue that a "time-limited" output-agreement mechanism can be used to create a fast and robust crowd-powered component in interactive systems, particularly dialogue systems, to extract key information from user utterances on the fly. Our experiments on Amazon Mechanical Turk using the Airline Travel Information System (ATIS) dataset showed that the proposed approach achieves high-quality results with an average response time shorter than 9 seconds.
△ Less
Submitted 6 December, 2017; v1 submitted 12 April, 2017;
originally announced April 2017.
-
Challenges in Providing Automatic Affective Feedback in Instant Messaging Applications
Authors:
Chieh-Yang Huang,
Ting-Hao,
Huang,
Lun-Wei Ku
Abstract:
Instant messaging is one of the major channels of computer mediated communication. However, humans are known to be very limited in understanding others' emotions via text-based communication. Aiming on introducing emotion sensing technologies to instant messaging, we developed EmotionPush, a system that automatically detects the emotions of the messages end-users received on Facebook Messenger and…
▽ More
Instant messaging is one of the major channels of computer mediated communication. However, humans are known to be very limited in understanding others' emotions via text-based communication. Aiming on introducing emotion sensing technologies to instant messaging, we developed EmotionPush, a system that automatically detects the emotions of the messages end-users received on Facebook Messenger and provides colored cues on their smartphones accordingly. We conducted a deployment study with 20 participants during a time span of two weeks. In this paper, we revealed five challenges, along with examples, that we observed in our study based on both user's feedback and chat logs, including (i)the continuum of emotions, (ii)multi-user conversations, (iii)different dynamics between different users, (iv)misclassification of emotions and (v)unconventional content. We believe this discussion will benefit the future exploration of affective computing for instant messaging, and also shed light on research of conversational emotion sensing.
△ Less
Submitted 9 February, 2017;
originally announced February 2017.
-
Sensing Emotions in Text Messages: An Application and Deployment Study of EmotionPush
Authors:
Shih-Ming Wang,
Chun-Hui Li,
Yu-Chun Lo,
Ting-Hao K. Huang,
Lun-Wei Ku
Abstract:
Instant messaging and push notifications play important roles in modern digital life. To enable robust sense-making and rich context awareness in computer mediated communications, we introduce EmotionPush, a system that automatically conveys the emotion of received text with a colored push notification on mobile devices. EmotionPush is powered by state-of-the-art emotion classifiers and is deploye…
▽ More
Instant messaging and push notifications play important roles in modern digital life. To enable robust sense-making and rich context awareness in computer mediated communications, we introduce EmotionPush, a system that automatically conveys the emotion of received text with a colored push notification on mobile devices. EmotionPush is powered by state-of-the-art emotion classifiers and is deployed for Facebook Messenger clients on Android. The study showed that the system is able to help users prioritize interactions.
△ Less
Submitted 15 October, 2016;
originally announced October 2016.
-
Visual Storytelling
Authors:
Ting-Hao,
Huang,
Francis Ferraro,
Nasrin Mostafazadeh,
Ishan Misra,
Aishwarya Agrawal,
Jacob Devlin,
Ross Girshick,
Xiaodong He,
Pushmeet Kohli,
Dhruv Batra,
C. Lawrence Zitnick,
Devi Parikh,
Lucy Vanderwende,
Michel Galley,
Margaret Mitchell
Abstract:
We introduce the first dataset for sequential vision-to-language, and explore how this data may be used for the task of visual storytelling. The first release of this dataset, SIND v.1, includes 81,743 unique photos in 20,211 sequences, aligned to both descriptive (caption) and story language. We establish several strong baselines for the storytelling task, and motivate an automatic metric to benc…
▽ More
We introduce the first dataset for sequential vision-to-language, and explore how this data may be used for the task of visual storytelling. The first release of this dataset, SIND v.1, includes 81,743 unique photos in 20,211 sequences, aligned to both descriptive (caption) and story language. We establish several strong baselines for the storytelling task, and motivate an automatic metric to benchmark progress. Modelling concrete description as well as figurative and social language, as provided in this dataset and the storytelling task, has the potential to move artificial intelligence from basic understandings of typical visual scenes towards more and more human-like understanding of grounded event structure and subjective expression.
△ Less
Submitted 13 April, 2016;
originally announced April 2016.