Search | arXiv e-print repository

Inspecting Spoken Language Understanding from Kids for Basic Math Learning at Home

Authors: Eda Okur, Roddy Fuentes Alba, Saurav Sahay, Lama Nachman

Abstract: Enriching the quality of early childhood education with interactive math learning at home systems, empowered by recent advances in conversational AI technologies, is slowly becoming a reality. With this motivation, we implement a multimodal dialogue system to support play-based learning experiences at home, guiding kids to master basic math concepts. This work explores Spoken Language Understandin… ▽ More Enriching the quality of early childhood education with interactive math learning at home systems, empowered by recent advances in conversational AI technologies, is slowly becoming a reality. With this motivation, we implement a multimodal dialogue system to support play-based learning experiences at home, guiding kids to master basic math concepts. This work explores Spoken Language Understanding (SLU) pipeline within a task-oriented dialogue system developed for Kid Space, with cascading Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) components evaluated on our home deployment data with kids going through gamified math learning activities. We validate the advantages of a multi-task architecture for NLU and experiment with a diverse set of pretrained language representations for Intent Recognition and Entity Extraction tasks in the math learning domain. To recognize kids' speech in realistic home environments, we investigate several ASR systems, including the commercial Google Cloud and the latest open-source Whisper solutions with varying model sizes. We evaluate the SLU pipeline by testing our best-performing NLU models on noisy ASR output to inspect the challenges of understanding children for math learning in authentic homes. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Comments: Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA) at ACL 2023

arXiv:2302.05888 [pdf, other]

Position Matters! Empirical Study of Order Effect in Knowledge-grounded Dialogue

Authors: Hsuan Su, Shachi H Kumar, Sahisnu Mazumder, Wenda Chen, Ramesh Manuvinakurike, Eda Okur, Saurav Sahay, Lama Nachman, Shang-Tse Chen, Hung-yi Lee

Abstract: With the power of large pretrained language models, various research works have integrated knowledge into dialogue systems. The traditional techniques treat knowledge as part of the input sequence for the dialogue system, prepending a set of knowledge statements in front of dialogue history. However, such a mechanism forces knowledge sets to be concatenated in an ordered manner, making models impl… ▽ More With the power of large pretrained language models, various research works have integrated knowledge into dialogue systems. The traditional techniques treat knowledge as part of the input sequence for the dialogue system, prepending a set of knowledge statements in front of dialogue history. However, such a mechanism forces knowledge sets to be concatenated in an ordered manner, making models implicitly pay imbalanced attention to the sets during training. In this paper, we first investigate how the order of the knowledge set can influence autoregressive dialogue systems' responses. We conduct experiments on two commonly used dialogue datasets with two types of transformer-based models and find that models view the input knowledge unequally. To this end, we propose a simple and novel technique to alleviate the order effect by modifying the position embeddings of knowledge input in these models. With the proposed position embedding method, the experimental results show that each knowledge statement is uniformly considered to generate responses. △ Less

Submitted 12 February, 2023; originally announced February 2023.

arXiv:2211.03511 [pdf, other]

End-to-End Evaluation of a Spoken Dialogue System for Learning Basic Mathematics

Authors: Eda Okur, Saurav Sahay, Roddy Fuentes Alba, Lama Nachman

Abstract: The advances in language-based Artificial Intelligence (AI) technologies applied to build educational applications can present AI for social-good opportunities with a broader positive impact. Across many disciplines, enhancing the quality of mathematics education is crucial in building critical thinking and problem-solving skills at younger ages. Conversational AI systems have started maturing to… ▽ More The advances in language-based Artificial Intelligence (AI) technologies applied to build educational applications can present AI for social-good opportunities with a broader positive impact. Across many disciplines, enhancing the quality of mathematics education is crucial in building critical thinking and problem-solving skills at younger ages. Conversational AI systems have started maturing to a point where they could play a significant role in hel** students learn fundamental math concepts. This work presents a task-oriented Spoken Dialogue System (SDS) built to support play-based learning of basic math concepts for early childhood education. The system has been evaluated via real-world deployments at school while the students are practicing early math concepts with multimodal interactions. We discuss our efforts to improve the SDS pipeline built for math learning, for which we explore utilizing MathBERT representations for potential enhancement to the Natural Language Understanding (NLU) module. We perform an end-to-end evaluation using real-world deployment outputs from the Automatic Speech Recognition (ASR), Intent Recognition, and Dialogue Manager (DM) components to understand how error propagation affects the overall performance in real-world scenarios. △ Less

Submitted 7 November, 2022; originally announced November 2022.

Comments: Proceedings of the 1st Workshop on Mathematical Natural Language Processing (MathNLP) at EMNLP 2022

arXiv:2205.13754 [pdf, other]

NLU for Game-based Learning in Real: Initial Evaluations

Authors: Eda Okur, Saurav Sahay, Lama Nachman

Abstract: Intelligent systems designed for play-based interactions should be contextually aware of the users and their surroundings. Spoken Dialogue Systems (SDS) are critical for these interactive agents to carry out effective goal-oriented communication with users in real-time. For the real-world (i.e., in-the-wild) deployment of such conversational agents, improving the Natural Language Understanding (NL… ▽ More Intelligent systems designed for play-based interactions should be contextually aware of the users and their surroundings. Spoken Dialogue Systems (SDS) are critical for these interactive agents to carry out effective goal-oriented communication with users in real-time. For the real-world (i.e., in-the-wild) deployment of such conversational agents, improving the Natural Language Understanding (NLU) module of the goal-oriented SDS pipeline is crucial, especially with limited task-specific datasets. This study explores the potential benefits of a recently proposed transformer-based multi-task NLU architecture, mainly to perform Intent Recognition on small-size domain-specific educational game datasets. The evaluation datasets were collected from children practicing basic math concepts via play-based interactions in game-based learning settings. We investigate the NLU performances on the initial proof-of-concept game datasets versus the real-world deployment datasets and observe anticipated performance drops in-the-wild. We have shown that compared to the more straightforward baseline approaches, Dual Intent and Entity Transformer (DIET) architecture is robust enough to handle real-world data to a large extent for the Intent Recognition task on these domain-specific in-the-wild game datasets. △ Less

Submitted 26 May, 2022; originally announced May 2022.

Comments: Proceedings of the Games and Natural Language Processing Workshop at LREC 2022

arXiv:2205.04006 [pdf, other]

Data Augmentation with Paraphrase Generation and Entity Extraction for Multimodal Dialogue System

Authors: Eda Okur, Saurav Sahay, Lama Nachman

Abstract: Contextually aware intelligent agents are often required to understand the users and their surroundings in real-time. Our goal is to build Artificial Intelligence (AI) systems that can assist children in their learning process. Within such complex frameworks, Spoken Dialogue Systems (SDS) are crucial building blocks to handle efficient task-oriented communication with children in game-based learni… ▽ More Contextually aware intelligent agents are often required to understand the users and their surroundings in real-time. Our goal is to build Artificial Intelligence (AI) systems that can assist children in their learning process. Within such complex frameworks, Spoken Dialogue Systems (SDS) are crucial building blocks to handle efficient task-oriented communication with children in game-based learning settings. We are working towards a multimodal dialogue system for younger kids learning basic math concepts. Our focus is on improving the Natural Language Understanding (NLU) module of the task-oriented SDS pipeline with limited datasets. This work explores the potential benefits of data augmentation with paraphrase generation for the NLU models trained on small task-specific datasets. We also investigate the effects of extracting entities for conceivably further data expansion. We have shown that paraphrasing with model-in-the-loop (MITL) strategies using small seed data is a promising approach yielding improved performance results for the Intent Recognition task. △ Less

Submitted 8 May, 2022; originally announced May 2022.

Comments: Proceedings of the 13th International Conference on Language Resources and Evaluation (LREC 2022)

arXiv:2104.13406 [pdf, other]

Semi-supervised Interactive Intent Labeling

Authors: Saurav Sahay, Eda Okur, Nagib Hakim, Lama Nachman

Abstract: Building the Natural Language Understanding (NLU) modules of task-oriented Spoken Dialogue Systems (SDS) involves a definition of intents and entities, collection of task-relevant data, annotating the data with intents and entities, and then repeating the same process over and over again for adding any functionality/enhancement to the SDS. In this work, we showcase an Intent Bulk Labeling system w… ▽ More Building the Natural Language Understanding (NLU) modules of task-oriented Spoken Dialogue Systems (SDS) involves a definition of intents and entities, collection of task-relevant data, annotating the data with intents and entities, and then repeating the same process over and over again for adding any functionality/enhancement to the SDS. In this work, we showcase an Intent Bulk Labeling system where SDS developers can interactively label and augment training data from unlabeled utterance corpora using advanced clustering and visual labeling methods. We extend the Deep Aligned Clustering work with a better backbone BERT model, explore techniques to select the seed data for labeling, and develop a data balancing method using an oversampling technique that utilizes paraphrasing models. We also look at the effect of data augmentation on the clustering process. Our results show that we can achieve over 10% gain in clustering accuracy on some datasets using the combination of the above techniques. Finally, we extract utterance embeddings from the clustering model and plot the data to interactively bulk label the samples, reducing the time and effort for data labeling of the whole dataset significantly. △ Less

Submitted 11 May, 2021; v1 submitted 27 April, 2021; originally announced April 2021.

Comments: NAACL 2021 - Workshop on Data Science with Human-in-the-loop: Language Advances (DaSH-LA)

arXiv:2007.03876 [pdf, other]

Audio-Visual Understanding of Passenger Intents for In-Cabin Conversational Agents

Authors: Eda Okur, Shachi H Kumar, Saurav Sahay, Lama Nachman

Abstract: Building multimodal dialogue understanding capabilities situated in the in-cabin context is crucial to enhance passenger comfort in autonomous vehicle (AV) interaction systems. To this end, understanding passenger intents from spoken interactions and vehicle vision systems is a crucial component for develo** contextual and visually grounded conversational agents for AV. Towards this goal, we exp… ▽ More Building multimodal dialogue understanding capabilities situated in the in-cabin context is crucial to enhance passenger comfort in autonomous vehicle (AV) interaction systems. To this end, understanding passenger intents from spoken interactions and vehicle vision systems is a crucial component for develo** contextual and visually grounded conversational agents for AV. Towards this goal, we explore AMIE (Automated-vehicle Multimodal In-cabin Experience), the in-cabin agent responsible for handling multimodal passenger-vehicle interactions. In this work, we discuss the benefits of a multimodal understanding of in-cabin utterances by incorporating verbal/language input together with the non-verbal/acoustic and visual clues from inside and outside the vehicle. Our experimental results outperformed text-only baselines as we achieved improved performances for intent detection with a multimodal approach. △ Less

Submitted 7 July, 2020; originally announced July 2020.

Comments: ACL 2020 - Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML)

arXiv:2007.02038 [pdf, other]

Low Rank Fusion based Transformers for Multimodal Sequences

Authors: Saurav Sahay, Eda Okur, Shachi H Kumar, Lama Nachman

Abstract: Our senses individually work in a coordinated fashion to express our emotional intentions. In this work, we experiment with modeling modality-specific sensory signals to attend to our latent multimodal emotional intentions and vice versa expressed via low-rank multimodal fusion and multimodal transformers. The low-rank factorization of multimodal fusion amongst the modalities helps represent appro… ▽ More Our senses individually work in a coordinated fashion to express our emotional intentions. In this work, we experiment with modeling modality-specific sensory signals to attend to our latent multimodal emotional intentions and vice versa expressed via low-rank multimodal fusion and multimodal transformers. The low-rank factorization of multimodal fusion amongst the modalities helps represent approximate multiplicative latent signal interactions. Motivated by the work of~\cite{tsai2019MULT} and~\cite{Liu_2018}, we present our transformer-based cross-fusion architecture without any over-parameterization of the model. The low-rank fusion helps represent the latent signal interactions while the modality-specific attention helps focus on relevant parts of the signal. We present two methods for the Multimodal Sentiment and Emotion Recognition results on CMU-MOSEI, CMU-MOSI, and IEMOCAP datasets and show that our models have lesser parameters, train faster and perform comparably to many larger fusion-based architectures. △ Less

Submitted 4 July, 2020; originally announced July 2020.

Comments: ACL 2020 workshop on Second Grand Challenge and Workshop on Multimodal Language

arXiv:2002.02334 [pdf, other]

Self-recognition in conversational agents

Authors: Yigit Oktar, Erdem Okur, Mehmet Turkan

Abstract: In a standard Turing test, a machine has to prove its humanness to the judges. By successfully imitating a thinking entity such as a human, this machine then proves that it can also think. Some objections claim that Turing test is not a tool to demonstrate the existence of general intelligence or thinking activity. A compelling alternative is the Lovelace test, in which the agent must originate a… ▽ More In a standard Turing test, a machine has to prove its humanness to the judges. By successfully imitating a thinking entity such as a human, this machine then proves that it can also think. Some objections claim that Turing test is not a tool to demonstrate the existence of general intelligence or thinking activity. A compelling alternative is the Lovelace test, in which the agent must originate a product that the agent's creator cannot explain. Therefore, the agent must be the owner of an original product. However, for this to happen the agent must exhibit the idea of self and distinguish oneself from others. Sustaining the idea of self within the Turing test is still possible if the judge decides to act as a textual mirror. Self-recognition tests applied on animals through mirrors appear to be viable tools to demonstrate the existence of a type of general intelligence. Methodology here constructs a textual version of the mirror test by placing the agent as the one and only judge to figure out whether the contacted one is an other, a mimicker, or oneself in an unsupervised manner. This textual version of the mirror test is objective, self-contained, and devoid of humanness. Any agent passing this textual mirror test should have or can acquire a thought mechanism that can be referred to as the inner-voice, answering the original and long lasting question of Turing "Can machines think?" in a constructive manner still within the bounds of the Turing test. Moreover, it is possible that a successful self-recognition might pave way to stronger notions of self-awareness in artificial beings. △ Less

Submitted 5 September, 2021; v1 submitted 6 February, 2020; originally announced February 2020.

arXiv:1912.10132 [pdf, ps, other]

Exploring Context, Attention and Audio Features for Audio Visual Scene-Aware Dialog

Authors: Shachi H Kumar, Eda Okur, Saurav Sahay, Jonathan Huang, Lama Nachman

Abstract: We are witnessing a confluence of vision, speech and dialog system technologies that are enabling the IVAs to learn audio-visual groundings of utterances and have conversations with users about the objects, activities and events surrounding them. Recent progress in visual grounding techniques and Audio Understanding are enabling machines to understand shared semantic concepts and listen to the var… ▽ More We are witnessing a confluence of vision, speech and dialog system technologies that are enabling the IVAs to learn audio-visual groundings of utterances and have conversations with users about the objects, activities and events surrounding them. Recent progress in visual grounding techniques and Audio Understanding are enabling machines to understand shared semantic concepts and listen to the various sensory events in the environment. With audio and visual grounding methods, end-to-end multimodal SDS are trained to meaningfully communicate with us in natural language about the real dynamic audio-visual sensory world around us. In this work, we explore the role of `topics' as the context of the conversation along with multimodal attention into such an end-to-end audio-visual scene-aware dialog system architecture. We also incorporate an end-to-end audio classification ConvNet, AclNet, into our models. We develop and test our approaches on the Audio Visual Scene-Aware Dialog (AVSD) dataset released as a part of the DSTC7. We present the analysis of our experiments and show that some of our model variations outperform the baseline system released for AVSD. △ Less

Submitted 20 December, 2019; originally announced December 2019.

Comments: Presented at the Visual Question Answering and Dialog Workshop, CVPR 2019, Long Beach, USA. arXiv admin note: substantial text overlap with arXiv:1912.10131

arXiv:1912.10131 [pdf, other]

Leveraging Topics and Audio Features with Multimodal Attention for Audio Visual Scene-Aware Dialog

Authors: Shachi H Kumar, Eda Okur, Saurav Sahay, Jonathan Huang, Lama Nachman

Abstract: With the recent advancements in Artificial Intelligence (AI), Intelligent Virtual Assistants (IVA) such as Alexa, Google Home, etc., have become a ubiquitous part of many homes. Currently, such IVAs are mostly audio-based, but going forward, we are witnessing a confluence of vision, speech and dialog system technologies that are enabling the IVAs to learn audio-visual groundings of utterances. Thi… ▽ More With the recent advancements in Artificial Intelligence (AI), Intelligent Virtual Assistants (IVA) such as Alexa, Google Home, etc., have become a ubiquitous part of many homes. Currently, such IVAs are mostly audio-based, but going forward, we are witnessing a confluence of vision, speech and dialog system technologies that are enabling the IVAs to learn audio-visual groundings of utterances. This will enable agents to have conversations with users about the objects, activities and events surrounding them. In this work, we present three main architectural explorations for the Audio Visual Scene-Aware Dialog (AVSD): 1) investigating `topics' of the dialog as an important contextual feature for the conversation, 2) exploring several multimodal attention mechanisms during response generation, 3) incorporating an end-to-end audio classification ConvNet, AclNet, into our architecture. We discuss detailed analysis of the experimental results and show that our model variations outperform the baseline system presented for the AVSD task. △ Less

Submitted 20 December, 2019; originally announced December 2019.

Comments: Presented at the 3rd Visually Grounded Interaction and Language (ViGIL) Workshop, NeurIPS 2019, Vancouver, Canada. arXiv admin note: substantial text overlap with arXiv:1812.08407, arXiv:1912.10132

arXiv:1912.10130 [pdf, other]

Modeling Intent, Dialog Policies and Response Adaptation for Goal-Oriented Interactions

Authors: Saurav Sahay, Shachi H Kumar, Eda Okur, Haroon Syed, Lama Nachman

Abstract: Building a machine learning driven spoken dialog system for goal-oriented interactions involves careful design of intents and data collection along with development of intent recognition models and dialog policy learning algorithms. The models should be robust enough to handle various user distractions during the interaction flow and should steer the user back into an engaging interaction for succ… ▽ More Building a machine learning driven spoken dialog system for goal-oriented interactions involves careful design of intents and data collection along with development of intent recognition models and dialog policy learning algorithms. The models should be robust enough to handle various user distractions during the interaction flow and should steer the user back into an engaging interaction for successful completion of the interaction. In this work, we have designed a goal-oriented interaction system where children can engage with agents for a series of interactions involving `Meet \& Greet' and `Simon Says' game play. We have explored various feature extractors and models for improved intent recognition and looked at leveraging previous user and system interactions in novel ways with attention models. We have also looked at dialog adaptation methods for entrained response selection. Our bootstrapped models from limited training data perform better than many baseline approaches we have looked at for intent recognition and dialog action prediction. △ Less

Submitted 20 December, 2019; originally announced December 2019.

Comments: Presented as a full-paper at the 23rd Workshop on the Semantics and Pragmatics of Dialogue (SemDial 2019 - LondonLogue), Sep 4-6, 2019, London, UK

Journal ref: Proceedings of the 23rd Workshop on the Semantics and Pragmatics of Dialogue (SEMDIAL), pp. 146-155, London, United Kingdom, September 2019

arXiv:1909.13714 [pdf, ps, other]

Towards Multimodal Understanding of Passenger-Vehicle Interactions in Autonomous Vehicles: Intent/Slot Recognition Utilizing Audio-Visual Data

Authors: Eda Okur, Shachi H Kumar, Saurav Sahay, Lama Nachman

Abstract: Understanding passenger intents from spoken interactions and car's vision (both inside and outside the vehicle) are important building blocks towards develo** contextual dialog systems for natural interactions in autonomous vehicles (AV). In this study, we continued exploring AMIE (Automated-vehicle Multimodal In-cabin Experience), the in-cabin agent responsible for handling certain multimodal p… ▽ More Understanding passenger intents from spoken interactions and car's vision (both inside and outside the vehicle) are important building blocks towards develo** contextual dialog systems for natural interactions in autonomous vehicles (AV). In this study, we continued exploring AMIE (Automated-vehicle Multimodal In-cabin Experience), the in-cabin agent responsible for handling certain multimodal passenger-vehicle interactions. When the passengers give instructions to AMIE, the agent should parse such commands properly considering available three modalities (language/text, audio, video) and trigger the appropriate functionality of the AV system. We had collected a multimodal in-cabin dataset with multi-turn dialogues between the passengers and AMIE using a Wizard-of-Oz scheme via realistic scavenger hunt game. In our previous explorations, we experimented with various RNN-based models to detect utterance-level intents (set destination, change route, go faster, go slower, stop, park, pull over, drop off, open door, and others) along with intent keywords and relevant slots (location, position/direction, object, gesture/gaze, time-guidance, person) associated with the action to be performed in our AV scenarios. In this recent work, we propose to discuss the benefits of multimodal understanding of in-cabin utterances by incorporating verbal/language input (text and speech embeddings) together with the non-verbal/acoustic and visual input from inside and outside the vehicle (i.e., passenger gestures and gaze from in-cabin video stream, referred objects outside of the vehicle from the road view camera stream). Our experimental results outperformed text-only baselines and with multimodality, we achieved improved performances for utterance-level intent detection and slot filling. △ Less

Submitted 19 September, 2019; originally announced September 2019.

Comments: Presented as a short-paper at the 23rd Workshop on the Semantics and Pragmatics of Dialogue (SemDial 2019 - LondonLogue), Sep 4-6, 2019, London, UK

Journal ref: Proceedings of the 23rd Workshop on the Semantics and Pragmatics of Dialogue (SEMDIAL), pp. 213-215, London, United Kingdom, September 2019

arXiv:1904.10500 [pdf, other]

Natural Language Interactions in Autonomous Vehicles: Intent Detection and Slot Filling from Passenger Utterances

Authors: Eda Okur, Shachi H Kumar, Saurav Sahay, Asli Arslan Esme, Lama Nachman

Abstract: Understanding passenger intents and extracting relevant slots are important building blocks towards develo** contextual dialogue systems for natural interactions in autonomous vehicles (AV). In this work, we explored AMIE (Automated-vehicle Multi-modal In-cabin Experience), the in-cabin agent responsible for handling certain passenger-vehicle interactions. When the passengers give instructions t… ▽ More Understanding passenger intents and extracting relevant slots are important building blocks towards develo** contextual dialogue systems for natural interactions in autonomous vehicles (AV). In this work, we explored AMIE (Automated-vehicle Multi-modal In-cabin Experience), the in-cabin agent responsible for handling certain passenger-vehicle interactions. When the passengers give instructions to AMIE, the agent should parse such commands properly and trigger the appropriate functionality of the AV system. In our current explorations, we focused on AMIE scenarios describing usages around setting or changing the destination and route, updating driving behavior or speed, finishing the trip and other use-cases to support various natural commands. We collected a multi-modal in-cabin dataset with multi-turn dialogues between the passengers and AMIE using a Wizard-of-Oz scheme via a realistic scavenger hunt game activity. After exploring various recent Recurrent Neural Networks (RNN) based techniques, we introduced our own hierarchical joint models to recognize passenger intents along with relevant slots associated with the action to be performed in AV scenarios. Our experimental results outperformed certain competitive baselines and achieved overall F1 scores of 0.91 for utterance-level intent detection and 0.96 for slot filling tasks. In addition, we conducted initial speech-to-text explorations by comparing intent/slot models trained and tested on human transcriptions versus noisy Automatic Speech Recognition (ASR) outputs. Finally, we compared the results with single passenger rides versus the rides with multiple passengers. △ Less

Submitted 23 April, 2019; originally announced April 2019.

Comments: Accepted and presented as a full paper at 20th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2019), April 7-13, 2019, La Rochelle, France

Journal ref: Springer LNCS Proceedings for CICLing 2019

arXiv:1901.06291 [pdf, ps, other]

Detecting Behavioral Engagement of Students in the Wild Based on Contextual and Visual Data

Authors: Eda Okur, Nese Alyuz, Sinem Aslan, Utku Genc, Cagri Tanriover, Asli Arslan Esme

Abstract: To investigate the detection of students' behavioral engagement (On-Task vs. Off-Task), we propose a two-phase approach in this study. In Phase 1, contextual logs (URLs) are utilized to assess active usage of the content platform. If there is active use, the appearance information is utilized in Phase 2 to infer behavioral engagement. Incorporating the contextual information improved the overall F… ▽ More To investigate the detection of students' behavioral engagement (On-Task vs. Off-Task), we propose a two-phase approach in this study. In Phase 1, contextual logs (URLs) are utilized to assess active usage of the content platform. If there is active use, the appearance information is utilized in Phase 2 to infer behavioral engagement. Incorporating the contextual information improved the overall F1-scores from 0.77 to 0.82. Our cross-classroom and cross-platform experiments showed the proposed generic and multi-modal behavioral engagement models' applicability to a different set of students or different subject areas. △ Less

Submitted 15 January, 2019; originally announced January 2019.

Comments: 12th Women in Machine Learning Workshop (WiML 2017), co-located with the 31st Conference on Neural Information Processing Systems (NeurIPS 2017), Long Beach, CA, USA

arXiv:1901.05835 [pdf, ps, other]

Unobtrusive and Multimodal Approach for Behavioral Engagement Detection of Students

Authors: Nese Alyuz, Eda Okur, Utku Genc, Sinem Aslan, Cagri Tanriover, Asli Arslan Esme

Abstract: We propose a multimodal approach for detection of students' behavioral engagement states (i.e., On-Task vs. Off-Task), based on three unobtrusive modalities: Appearance, Context-Performance, and Mouse. Final behavioral engagement states are achieved by fusing modality-specific classifiers at the decision level. Various experiments were conducted on a student dataset collected in an authentic class… ▽ More We propose a multimodal approach for detection of students' behavioral engagement states (i.e., On-Task vs. Off-Task), based on three unobtrusive modalities: Appearance, Context-Performance, and Mouse. Final behavioral engagement states are achieved by fusing modality-specific classifiers at the decision level. Various experiments were conducted on a student dataset collected in an authentic classroom. △ Less

Submitted 15 January, 2019; originally announced January 2019.

Comments: 12th Women in Machine Learning Workshop (WiML 2017), co-located with the 31st Conference on Neural Information Processing Systems (NeurIPS 2017), Long Beach, CA, USA

arXiv:1901.04899 [pdf, ps, other]

Conversational Intent Understanding for Passengers in Autonomous Vehicles

Authors: Eda Okur, Shachi H Kumar, Saurav Sahay, Asli Arslan Esme, Lama Nachman

Abstract: Understanding passenger intents and extracting relevant slots are important building blocks towards develo** a contextual dialogue system responsible for handling certain vehicle-passenger interactions in autonomous vehicles (AV). When the passengers give instructions to AMIE (Automated-vehicle Multimodal In-cabin Experience), the agent should parse such commands properly and trigger the appropr… ▽ More Understanding passenger intents and extracting relevant slots are important building blocks towards develo** a contextual dialogue system responsible for handling certain vehicle-passenger interactions in autonomous vehicles (AV). When the passengers give instructions to AMIE (Automated-vehicle Multimodal In-cabin Experience), the agent should parse such commands properly and trigger the appropriate functionality of the AV system. In our AMIE scenarios, we describe usages and support various natural commands for interacting with the vehicle. We collected a multimodal in-cabin data-set with multi-turn dialogues between the passengers and AMIE using a Wizard-of-Oz scheme. We explored various recent Recurrent Neural Networks (RNN) based techniques and built our own hierarchical models to recognize passenger intents along with relevant slots associated with the action to be performed in AV scenarios. Our experimental results achieved F1-score of 0.91 on utterance-level intent recognition and 0.96 on slot extraction models. △ Less

Submitted 13 December, 2018; originally announced January 2019.

arXiv:1901.03793 [pdf, other]

The Importance of Socio-Cultural Differences for Annotating and Detecting the Affective States of Students

Authors: Eda Okur, Sinem Aslan, Nese Alyuz, Asli Arslan Esme, Ryan S. Baker

Abstract: The development of real-time affect detection models often depends upon obtaining annotated data for supervised learning by employing human experts to label the student data. One open question in annotating affective data for affect detection is whether the labelers (i.e., human experts) need to be socio-culturally similar to the students being labeled, as this impacts the cost feasibility of obta… ▽ More The development of real-time affect detection models often depends upon obtaining annotated data for supervised learning by employing human experts to label the student data. One open question in annotating affective data for affect detection is whether the labelers (i.e., human experts) need to be socio-culturally similar to the students being labeled, as this impacts the cost feasibility of obtaining the labels. In this study, we investigate the following research questions: For affective state annotation, how does the socio-cultural background of human expert labelers, compared to the subjects, impact the degree of consensus and distribution of affective states obtained? Secondly, how do differences in labeler background impact the performance of affect detection models that are trained using these labels? △ Less

Submitted 11 January, 2019; originally announced January 2019.

Comments: 13th Women in Machine Learning Workshop (WiML 2018), co-located with the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada

arXiv:1812.08407 [pdf, other]

Context, Attention and Audio Feature Explorations for Audio Visual Scene-Aware Dialog

Authors: Shachi H Kumar, Eda Okur, Saurav Sahay, Juan Jose Alvarado Leanos, Jonathan Huang, Lama Nachman

Abstract: With the recent advancements in AI, Intelligent Virtual Assistants (IVA) have become a ubiquitous part of every home. Going forward, we are witnessing a confluence of vision, speech and dialog system technologies that are enabling the IVAs to learn audio-visual groundings of utterances and have conversations with users about the objects, activities and events surrounding them. As a part of the 7th… ▽ More With the recent advancements in AI, Intelligent Virtual Assistants (IVA) have become a ubiquitous part of every home. Going forward, we are witnessing a confluence of vision, speech and dialog system technologies that are enabling the IVAs to learn audio-visual groundings of utterances and have conversations with users about the objects, activities and events surrounding them. As a part of the 7th Dialog System Technology Challenges (DSTC7), for Audio Visual Scene-Aware Dialog (AVSD) track, We explore `topics' of the dialog as an important contextual feature into the architecture along with explorations around multimodal Attention. We also incorporate an end-to-end audio classification ConvNet, AclNet, into our models. We present detailed analysis of the experiments and show that some of our model variations outperform the baseline system presented for this task. △ Less

Submitted 20 December, 2018; originally announced December 2018.

Comments: 7 pages, 2 figures, DSTC7 workshop at AAAI 2019

arXiv:1810.08732 [pdf, ps, other]

Named Entity Recognition on Twitter for Turkish using Semi-supervised Learning with Word Embeddings

Authors: Eda Okur, Hakan Demir, Arzucan Özgür

Abstract: Recently, due to the increasing popularity of social media, the necessity for extracting information from informal text types, such as microblog texts, has gained significant attention. In this study, we focused on the Named Entity Recognition (NER) problem on informal text types for Turkish. We utilized a semi-supervised learning approach based on neural networks. We applied a fast unsupervised m… ▽ More Recently, due to the increasing popularity of social media, the necessity for extracting information from informal text types, such as microblog texts, has gained significant attention. In this study, we focused on the Named Entity Recognition (NER) problem on informal text types for Turkish. We utilized a semi-supervised learning approach based on neural networks. We applied a fast unsupervised method for learning continuous representations of words in vector space. We made use of these obtained word embeddings, together with language independent features that are engineered to work better on informal text types, for generating a Turkish NER system on microblog texts. We evaluated our Turkish NER system on Twitter messages and achieved better F-score performances than the published results of previously proposed NER systems on Turkish tweets. Since we did not employ any language dependent features, we believe that our method can be easily adapted to microblog texts in other morphologically rich languages. △ Less

Submitted 19 October, 2018; originally announced October 2018.

Comments: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)

Showing 1–20 of 20 results for author: Okur, E