-
Automatic UAV-based Airport Pavement Inspection Using Mixed Real and Virtual Scenarios
Authors:
Pablo Alonso,
Jon Ander Iñiguez de Gordoa,
Juan Diego Ortega,
Sara García,
Francisco Javier Iriarte,
Marcos Nieto
Abstract:
Runway and taxiway pavements are exposed to high stress during their projected lifetime, which inevitably leads to a decrease in their condition over time. To make sure airport pavement condition ensure uninterrupted and resilient operations, it is of utmost importance to monitor their condition and conduct regular inspections. UAV-based inspection is recently gaining importance due to its wide ra…
▽ More
Runway and taxiway pavements are exposed to high stress during their projected lifetime, which inevitably leads to a decrease in their condition over time. To make sure airport pavement condition ensure uninterrupted and resilient operations, it is of utmost importance to monitor their condition and conduct regular inspections. UAV-based inspection is recently gaining importance due to its wide range monitoring capabilities and reduced cost. In this work, we propose a vision-based approach to automatically identify pavement distress using images captured by UAVs. The proposed method is based on Deep Learning (DL) to segment defects in the image. The DL architecture leverages the low computational capacities of embedded systems in UAVs by using an optimised implementation of EfficientNet feature extraction and Feature Pyramid Network segmentation. To deal with the lack of annotated data for training we have developed a synthetic dataset generation methodology to extend available distress datasets. We demonstrate that the use of a mixed dataset composed of synthetic and real training images yields better results when testing the training models in real application scenarios.
△ Less
Submitted 11 January, 2024;
originally announced January 2024.
-
Oh, Jeez! or Uh-huh? A Listener-aware Backchannel Predictor on ASR Transcriptions
Authors:
Daniel Ortega,
Chia-Yu Li,
Ngoc Thang Vu
Abstract:
This paper presents our latest investigation on modeling backchannel in conversations. Motivated by a proactive backchanneling theory, we aim at develo** a system which acts as a proactive listener by inserting backchannels, such as continuers and assessment, to influence speakers. Our model takes into account not only lexical and acoustic cues, but also introduces the simple and novel idea of u…
▽ More
This paper presents our latest investigation on modeling backchannel in conversations. Motivated by a proactive backchanneling theory, we aim at develo** a system which acts as a proactive listener by inserting backchannels, such as continuers and assessment, to influence speakers. Our model takes into account not only lexical and acoustic cues, but also introduces the simple and novel idea of using listener embeddings to mimic different backchanneling behaviours. Our experimental results on the Switchboard benchmark dataset reveal that acoustic cues are more important than lexical cues in this task and their combination with listener embeddings works best on both, manual transcriptions and automatically generated transcriptions.
△ Less
Submitted 10 April, 2023;
originally announced April 2023.
-
Modeling Speaker-Listener Interaction for Backchannel Prediction
Authors:
Daniel Ortega,
Sarina Meyer,
Antje Schweitzer,
Ngoc Thang Vu
Abstract:
We present our latest findings on backchannel modeling novelly motivated by the canonical use of the minimal responses Yeah and Uh-huh in English and their correspondent tokens in German, and the effect of encoding the speaker-listener interaction. Backchanneling theories emphasize the active and continuous role of the listener in the course of the conversation, their effects on the speaker's subs…
▽ More
We present our latest findings on backchannel modeling novelly motivated by the canonical use of the minimal responses Yeah and Uh-huh in English and their correspondent tokens in German, and the effect of encoding the speaker-listener interaction. Backchanneling theories emphasize the active and continuous role of the listener in the course of the conversation, their effects on the speaker's subsequent talk, and the consequent dynamic speaker-listener interaction. Therefore, we propose a neural-based acoustic backchannel classifier on minimal responses by processing acoustic features from the speaker speech, capturing and imitating listeners' backchanneling behavior, and encoding speaker-listener interaction. Our experimental results on the Switchboard and GECO datasets reveal that in almost all tested scenarios the speaker or listener behavior embeddings help the model make more accurate backchannel predictions. More importantly, a proper interaction encoding strategy, i.e., combining the speaker and listener embeddings, leads to the best performance on both datasets in terms of F1-score.
△ Less
Submitted 10 April, 2023;
originally announced April 2023.
-
Virtual passengers for real car solutions: synthetic datasets
Authors:
Paola Natalia Canas,
Juan Diego Ortega,
Marcos Nieto,
Oihana Otaegui
Abstract:
Strategies that include the generation of synthetic data are beginning to be viable as obtaining real data can be logistically complicated, very expensive or slow. Not only the capture of the data can lead to complications, but also its annotation. To achieve high-fidelity data for training intelligent systems, we have built a 3D scenario and set-up to resemble reality as closely as possible. With…
▽ More
Strategies that include the generation of synthetic data are beginning to be viable as obtaining real data can be logistically complicated, very expensive or slow. Not only the capture of the data can lead to complications, but also its annotation. To achieve high-fidelity data for training intelligent systems, we have built a 3D scenario and set-up to resemble reality as closely as possible. With our approach, it is possible to configure and vary parameters to add randomness to the scene and, in this way, allow variation in data, which is so important in the construction of a dataset. Besides, the annotation task is already included in the data generation exercise, rather than being a post-capture task, which can save a lot of resources. We present the process and concept of synthetic data generation in an automotive context, specifically for driver and passenger monitoring purposes, as an alternative to real data capturing.
△ Less
Submitted 13 May, 2022;
originally announced May 2022.
-
2D Grid Map Generation for Deep-Learning-based Navigation Approaches
Authors:
Gabriel O. Flores-Aquino,
Jheison Duvier Díaz Ortega,
Ricardo Yahir Almazan Arvizu,
Raúl López Muñoz,
O. Octavio Gutierrez-Frias,
J. Irving Vasquez-Gomez
Abstract:
In the last decade, autonomous navigation for roboticshas been leveraged by deep learning and other approachesbased on machine learning. These approaches have demon-strated significant advantages in robotics performance. Butthey have the disadvantage that they require a lot of data toinfer knowledge. In this paper, we present an algorithm forbuilding 2D maps with attributes that make them useful f…
▽ More
In the last decade, autonomous navigation for roboticshas been leveraged by deep learning and other approachesbased on machine learning. These approaches have demon-strated significant advantages in robotics performance. Butthey have the disadvantage that they require a lot of data toinfer knowledge. In this paper, we present an algorithm forbuilding 2D maps with attributes that make them useful fortraining and testing machine-learning-based approaches.The maps are based on dungeons environments where sev-eral random rooms are built and then those rooms are con-nected. In addition, we provide a dataset with 10,000 mapsproduced by the proposed algorithm and a description withextensive information for algorithm evaluation. Such infor-mation includes validation of path existence, the best path,distances, among other attributes. We believe that thesemaps and their related information can be very useful forrobotics enthusiasts and researchers who want to test deeplearning approaches. The dataset is available athttps://github.com/gbriel21/map2D_dataSet.git
△ Less
Submitted 4 December, 2021; v1 submitted 25 October, 2021;
originally announced October 2021.
-
DMD: A Large-Scale Multi-Modal Driver Monitoring Dataset for Attention and Alertness Analysis
Authors:
Juan Diego Ortega,
Neslihan Kose,
Paola Cañas,
Min-An Chao,
Alexander Unnervik,
Marcos Nieto,
Oihana Otaegui,
Luis Salgado
Abstract:
Vision is the richest and most cost-effective technology for Driver Monitoring Systems (DMS), especially after the recent success of Deep Learning (DL) methods. The lack of sufficiently large and comprehensive datasets is currently a bottleneck for the progress of DMS development, crucial for the transition of automated driving from SAE Level-2 to SAE Level-3. In this paper, we introduce the Drive…
▽ More
Vision is the richest and most cost-effective technology for Driver Monitoring Systems (DMS), especially after the recent success of Deep Learning (DL) methods. The lack of sufficiently large and comprehensive datasets is currently a bottleneck for the progress of DMS development, crucial for the transition of automated driving from SAE Level-2 to SAE Level-3. In this paper, we introduce the Driver Monitoring Dataset (DMD), an extensive dataset which includes real and simulated driving scenarios: distraction, gaze allocation, drowsiness, hands-wheel interaction and context data, in 41 hours of RGB, depth and IR videos from 3 cameras capturing face, body and hands of 37 drivers. A comparison with existing similar datasets is included, which shows the DMD is more extensive, diverse, and multi-purpose. The usage of the DMD is illustrated by extracting a subset of it, the dBehaviourMD dataset, containing 13 distraction activities, prepared to be used in DL training processes. Furthermore, we propose a robust and real-time driver behaviour recognition system targeting a real-world application that can run on cost-efficient CPU-only platforms, based on the dBehaviourMD. Its performance is evaluated with different types of fusion strategies, which all reach enhanced accuracy still providing real-time response.
△ Less
Submitted 27 August, 2020;
originally announced August 2020.
-
ADVISER: A Toolkit for Develo** Multi-modal, Multi-domain and Socially-engaged Conversational Agents
Authors:
Chia-Yu Li,
Daniel Ortega,
Dirk Väth,
Florian Lux,
Lindsey Vanderlyn,
Maximilian Schmidt,
Michael Neumann,
Moritz Völkel,
Pavel Denisov,
Sabrina Jenne,
Zorica Kacarevic,
Ngoc Thang Vu
Abstract:
We present ADVISER - an open-source, multi-domain dialog system toolkit that enables the development of multi-modal (incorporating speech, text and vision), socially-engaged (e.g. emotion recognition, engagement level prediction and backchanneling) conversational agents. The final Python-based implementation of our toolkit is flexible, easy to use, and easy to extend not only for technically exper…
▽ More
We present ADVISER - an open-source, multi-domain dialog system toolkit that enables the development of multi-modal (incorporating speech, text and vision), socially-engaged (e.g. emotion recognition, engagement level prediction and backchanneling) conversational agents. The final Python-based implementation of our toolkit is flexible, easy to use, and easy to extend not only for technically experienced users, such as machine learning researchers, but also for less technically experienced users, such as linguists or cognitive scientists, thereby providing a flexible platform for collaborative research. Link to open-source code: https://github.com/DigitalPhonetics/adviser
△ Less
Submitted 4 May, 2020;
originally announced May 2020.
-
Multimodal Fusion with Deep Neural Networks for Audio-Video Emotion Recognition
Authors:
Juan D. S. Ortega,
Mohammed Senoussaoui,
Eric Granger,
Marco Pedersoli,
Patrick Cardinal,
Alessandro L. Koerich
Abstract:
This paper presents a novel deep neural network (DNN) for multimodal fusion of audio, video and text modalities for emotion recognition. The proposed DNN architecture has independent and shared layers which aim to learn the representation for each modality, as well as the best combined representation to achieve the best prediction. Experimental results on the AVEC Sentiment Analysis in the Wild da…
▽ More
This paper presents a novel deep neural network (DNN) for multimodal fusion of audio, video and text modalities for emotion recognition. The proposed DNN architecture has independent and shared layers which aim to learn the representation for each modality, as well as the best combined representation to achieve the best prediction. Experimental results on the AVEC Sentiment Analysis in the Wild dataset indicate that the proposed DNN can achieve a higher level of Concordance Correlation Coefficient (CCC) than other state-of-the-art systems that perform early fusion of modalities at feature-level (i.e., concatenation) and late fusion at score-level (i.e., weighted average) fusion. The proposed DNN has achieved CCCs of 0.606, 0.534, and 0.170 on the development partition of the dataset for predicting arousal, valence and liking, respectively.
△ Less
Submitted 6 July, 2019;
originally announced July 2019.
-
Emotion Recognition Using Fusion of Audio and Video Features
Authors:
Juan D. S. Ortega,
Patrick Cardinal,
Alessandro L. Koerich
Abstract:
In this paper we propose a fusion approach to continuous emotion recognition that combines visual and auditory modalities in their representation spaces to predict the arousal and valence levels. The proposed approach employs a pre-trained convolution neural network and transfer learning to extract features from video frames that capture the emotional content. For the auditory content, a minimalis…
▽ More
In this paper we propose a fusion approach to continuous emotion recognition that combines visual and auditory modalities in their representation spaces to predict the arousal and valence levels. The proposed approach employs a pre-trained convolution neural network and transfer learning to extract features from video frames that capture the emotional content. For the auditory content, a minimalistic set of parameters such as prosodic, excitation, vocal tract, and spectral descriptors are used as features. The fusion of these two modalities is carried out at a feature level, before training a single support vector regressor (SVR) or at a prediction level, after training one SVR for each modality. The proposed approach also includes preprocessing and post-processing techniques which contribute favorably to improving the concordance correlation coefficient (CCC). Experimental results for predicting spontaneous and natural emotions on the RECOLA dataset have shown that the proposed approach takes advantage of the complementary information of visual and auditory modalities and provides CCCs of 0.749 and 0.565 for arousal and valence, respectively.
△ Less
Submitted 25 June, 2019;
originally announced June 2019.
-
Context-aware Neural-based Dialog Act Classification on Automatically Generated Transcriptions
Authors:
Daniel Ortega,
Chia-Yu Li,
Gisela Vallejo,
Pavel Denisov,
Ngoc Thang Vu
Abstract:
This paper presents our latest investigations on dialog act (DA) classification on automatically generated transcriptions. We propose a novel approach that combines convolutional neural networks (CNNs) and conditional random fields (CRFs) for context modeling in DA classification. We explore the impact of transcriptions generated from different automatic speech recognition systems such as hybrid T…
▽ More
This paper presents our latest investigations on dialog act (DA) classification on automatically generated transcriptions. We propose a novel approach that combines convolutional neural networks (CNNs) and conditional random fields (CRFs) for context modeling in DA classification. We explore the impact of transcriptions generated from different automatic speech recognition systems such as hybrid TDNN/HMM and End-to-End systems on the final performance. Experimental results on two benchmark datasets (MRDA and SwDA) show that the combination CNN and CRF improves consistently the accuracy. Furthermore, they show that although the word error rates are comparable, End-to-End ASR system seems to be more suitable for DA classification.
△ Less
Submitted 28 February, 2019;
originally announced February 2019.
-
Lexico-acoustic Neural-based Models for Dialog Act Classification
Authors:
Daniel Ortega,
Ngoc Thang Vu
Abstract:
Recent works have proposed neural models for dialog act classification in spoken dialogs. However, they have not explored the role and the usefulness of acoustic information. We propose a neural model that processes both lexical and acoustic features for classification. Our results on two benchmark datasets reveal that acoustic features are helpful in improving the overall accuracy. Finally, a dee…
▽ More
Recent works have proposed neural models for dialog act classification in spoken dialogs. However, they have not explored the role and the usefulness of acoustic information. We propose a neural model that processes both lexical and acoustic features for classification. Our results on two benchmark datasets reveal that acoustic features are helpful in improving the overall accuracy. Finally, a deeper analysis shows that acoustic features are valuable in three cases: when a dialog act has sufficient data, when lexical information is limited and when strong lexical cues are not present.
△ Less
Submitted 2 March, 2018;
originally announced March 2018.
-
Neural-based Context Representation Learning for Dialog Act Classification
Authors:
Daniel Ortega,
Ngoc Thang Vu
Abstract:
We explore context representation learning methods in neural-based models for dialog act classification. We propose and compare extensively different methods which combine recurrent neural network architectures and attention mechanisms (AMs) at different context levels. Our experimental results on two benchmark datasets show consistent improvements compared to the models without contextual informa…
▽ More
We explore context representation learning methods in neural-based models for dialog act classification. We propose and compare extensively different methods which combine recurrent neural network architectures and attention mechanisms (AMs) at different context levels. Our experimental results on two benchmark datasets show consistent improvements compared to the models without contextual information and reveal that the most suitable AM in the architecture depends on the nature of the dataset.
△ Less
Submitted 8 August, 2017;
originally announced August 2017.