-
Chat, Shift and Perform: Bridging the Gap between Task-oriented and Non-task-oriented Dialog Systems
Authors:
Teppei Yoshino,
Yosuke Fukuchi,
Shoya Matsumori,
Michita Imai
Abstract:
We propose CASPER (ChAt, Shift and PERform), a novel dialog system consisting of three types of dialog models: chatter, shifter, and performer. Shifter, which is designed for topic switching, enables a seamless flow of dialog from open-domain chat- to task-oriented dialog. In a user study, CASPER gave a better impression in terms of naturalness of response, lack of forced topic switching, and sati…
▽ More
We propose CASPER (ChAt, Shift and PERform), a novel dialog system consisting of three types of dialog models: chatter, shifter, and performer. Shifter, which is designed for topic switching, enables a seamless flow of dialog from open-domain chat- to task-oriented dialog. In a user study, CASPER gave a better impression in terms of naturalness of response, lack of forced topic switching, and satisfaction compared with a baseline dialog system trained in an end-to-end manner. In an ablation study, we found that naturalness of response, dialog satisfaction, and task-elicitation rate improved compared with when shifter was removed from CASPER, indicating that topic shift with shifter supports the introduction of natural task-oriented dialog.
△ Less
Submitted 5 June, 2022;
originally announced June 2022.
-
Mask and Cloze: Automatic Open Cloze Question Generation using a Masked Language Model
Authors:
Shoya Matsumori,
Kohei Okuoka,
Ryoichi Shibata,
Minami Inoue,
Yosuke Fukuchi,
Michita Imai
Abstract:
Open cloze questions have been attracting attention for both measuring the ability and facilitating the learning of L2 English learners. In spite of its benefits, the open cloze test has been introduced only sporadically on the educational front, largely because it is burdensome for teachers to manually create the questions. Unlike the more commonly used multiple choice questions (MCQ), open cloze…
▽ More
Open cloze questions have been attracting attention for both measuring the ability and facilitating the learning of L2 English learners. In spite of its benefits, the open cloze test has been introduced only sporadically on the educational front, largely because it is burdensome for teachers to manually create the questions. Unlike the more commonly used multiple choice questions (MCQ), open cloze questions are in free form and thus teachers have to ensure that only a ground truth answer and no additional words will be accepted in the blank. To help ease this burden, we developed CLOZER, an automatic open cloze question generator. In this work, we evaluate CLOZER through quantitative experiments on 1,600 answers and show statistically that it can successfully generate open cloze questions that only accept the ground truth answer. A comparative experiment with human-generated questions also reveals that CLOZER can generate OCQs better than the average non-native English teacher. Additionally, we conduct a field study at a local high school to clarify the benefits and hurdles when introducing CLOZER. The results demonstrate that while students found the application useful for their language learning. Finally, on the basis of our findings, we proposed several design improvements.
△ Less
Submitted 15 May, 2022;
originally announced May 2022.
-
LatteGAN: Visually Guided Language Attention for Multi-Turn Text-Conditioned Image Manipulation
Authors:
Shoya Matsumori,
Yuki Abe,
Kosuke Shingyouchi,
Komei Sugiura,
Michita Imai
Abstract:
Text-guided image manipulation tasks have recently gained attention in the vision-and-language community. While most of the prior studies focused on single-turn manipulation, our goal in this paper is to address the more challenging multi-turn image manipulation (MTIM) task. Previous models for this task successfully generate images iteratively, given a sequence of instructions and a previously ge…
▽ More
Text-guided image manipulation tasks have recently gained attention in the vision-and-language community. While most of the prior studies focused on single-turn manipulation, our goal in this paper is to address the more challenging multi-turn image manipulation (MTIM) task. Previous models for this task successfully generate images iteratively, given a sequence of instructions and a previously generated image. However, this approach suffers from under-generation and a lack of generated quality of the objects that are described in the instructions, which consequently degrades the overall performance. To overcome these problems, we present a novel architecture called a Visually Guided Language Attention GAN (LatteGAN). Here, we address the limitations of the previous approaches by introducing a Visually Guided Language Attention (Latte) module, which extracts fine-grained text representations for the generator, and a Text-Conditioned U-Net discriminator architecture, which discriminates both the global and local representations of fake or real images. Extensive experiments on two distinct MTIM datasets, CoDraw and i-CLEVR, demonstrate the state-of-the-art performance of the proposed model.
△ Less
Submitted 2 June, 2022; v1 submitted 27 December, 2021;
originally announced December 2021.
-
Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue
Authors:
Shoya Matsumori,
Kosuke Shingyouchi,
Yuki Abe,
Yosuke Fukuchi,
Komei Sugiura,
Michita Imai
Abstract:
Building an interactive artificial intelligence that can ask questions about the real world is one of the biggest challenges for vision and language problems. In particular, goal-oriented visual dialogue, where the aim of the agent is to seek information by asking questions during a turn-taking dialogue, has been gaining scholarly attention recently. While several existing models based on the Gues…
▽ More
Building an interactive artificial intelligence that can ask questions about the real world is one of the biggest challenges for vision and language problems. In particular, goal-oriented visual dialogue, where the aim of the agent is to seek information by asking questions during a turn-taking dialogue, has been gaining scholarly attention recently. While several existing models based on the GuessWhat?! dataset have been proposed, the Questioner typically asks simple category-based questions or absolute spatial questions. This might be problematic for complex scenes where the objects share attributes or in cases where descriptive questions are required to distinguish objects. In this paper, we propose a novel Questioner architecture, called Unified Questioner Transformer (UniQer), for descriptive question generation with referring expressions. In addition, we build a goal-oriented visual dialogue task called CLEVR Ask. It synthesizes complex scenes that require the Questioner to generate descriptive questions. We train our model with two variants of CLEVR Ask datasets. The results of the quantitative and qualitative evaluations show that UniQer outperforms the baseline.
△ Less
Submitted 29 June, 2021;
originally announced June 2021.
-
SLAM-Inspired Simultaneous Contextualization and Interpreting for Incremental Conversation Sentences
Authors:
Yusuke Takimoto,
Yosuke Fukuchi,
Shoya Matsumori,
Michita Imai
Abstract:
Distributed representation of words has improved the performance for many natural language tasks. In many methods, however, only one meaning is considered for one label of a word, and multiple meanings of polysemous words depending on the context are rarely handled. Although research works have dealt with polysemous words, they determine the meanings of such words according to a batch of large doc…
▽ More
Distributed representation of words has improved the performance for many natural language tasks. In many methods, however, only one meaning is considered for one label of a word, and multiple meanings of polysemous words depending on the context are rarely handled. Although research works have dealt with polysemous words, they determine the meanings of such words according to a batch of large documents. Hence, there are two problems with applying these methods to sequential sentences, as in a conversation that contains ambiguous expressions. The first problem is that the methods cannot sequentially deal with the interdependence between context and word interpretation, in which context is decided by word interpretations and the word interpretations are decided by the context. Context estimation must thus be performed in parallel to pursue multiple interpretations. The second problem is that the previous methods use large-scale sets of sentences for offline learning of new interpretations, and the steps of learning and inference are clearly separated. Such methods using offline learning cannot obtain new interpretations during a conversation. Hence, to dynamically estimate the conversation context and interpretations of polysemous words in sequential sentences, we propose a method of Simultaneous Contextualization And INterpreting (SCAIN) based on the traditional Simultaneous Localization And Map** (SLAM) algorithm. By using the SCAIN algorithm, we can sequentially optimize the interdependence between context and word interpretation while obtaining new interpretations online. For experimental evaluation, we created two datasets: one from Wikipedia's disambiguation pages and the other from real conversations. For both datasets, the results confirmed that SCAIN could effectively achieve sequential optimization of the interdependence and acquisition of new interpretations.
△ Less
Submitted 29 May, 2020;
originally announced May 2020.