Search | arXiv e-print repository

doi 10.1117/12.2679734

Automatic UAV-based Airport Pavement Inspection Using Mixed Real and Virtual Scenarios

Authors: Pablo Alonso, Jon Ander Iñiguez de Gordoa, Juan Diego Ortega, Sara García, Francisco Javier Iriarte, Marcos Nieto

Abstract: Runway and taxiway pavements are exposed to high stress during their projected lifetime, which inevitably leads to a decrease in their condition over time. To make sure airport pavement condition ensure uninterrupted and resilient operations, it is of utmost importance to monitor their condition and conduct regular inspections. UAV-based inspection is recently gaining importance due to its wide ra… ▽ More Runway and taxiway pavements are exposed to high stress during their projected lifetime, which inevitably leads to a decrease in their condition over time. To make sure airport pavement condition ensure uninterrupted and resilient operations, it is of utmost importance to monitor their condition and conduct regular inspections. UAV-based inspection is recently gaining importance due to its wide range monitoring capabilities and reduced cost. In this work, we propose a vision-based approach to automatically identify pavement distress using images captured by UAVs. The proposed method is based on Deep Learning (DL) to segment defects in the image. The DL architecture leverages the low computational capacities of embedded systems in UAVs by using an optimised implementation of EfficientNet feature extraction and Feature Pyramid Network segmentation. To deal with the lack of annotated data for training we have developed a synthetic dataset generation methodology to extend available distress datasets. We demonstrate that the use of a mixed dataset composed of synthetic and real training images yields better results when testing the training models in real application scenarios. △ Less

Submitted 11 January, 2024; originally announced January 2024.

Comments: 12 pages, 6 figures, published in proceedings of 15th International Conference on Machine Vision (ICMV)

Journal ref: Proc. SPIE 12701, Fifteenth International Conference on Machine Vision (ICMV 2022), 1270118

arXiv:2304.04478 [pdf, other]

Oh, Jeez! or Uh-huh? A Listener-aware Backchannel Predictor on ASR Transcriptions

Authors: Daniel Ortega, Chia-Yu Li, Ngoc Thang Vu

Abstract: This paper presents our latest investigation on modeling backchannel in conversations. Motivated by a proactive backchanneling theory, we aim at develo** a system which acts as a proactive listener by inserting backchannels, such as continuers and assessment, to influence speakers. Our model takes into account not only lexical and acoustic cues, but also introduces the simple and novel idea of u… ▽ More This paper presents our latest investigation on modeling backchannel in conversations. Motivated by a proactive backchanneling theory, we aim at develo** a system which acts as a proactive listener by inserting backchannels, such as continuers and assessment, to influence speakers. Our model takes into account not only lexical and acoustic cues, but also introduces the simple and novel idea of using listener embeddings to mimic different backchanneling behaviours. Our experimental results on the Switchboard benchmark dataset reveal that acoustic cues are more important than lexical cues in this task and their combination with listener embeddings works best on both, manual transcriptions and automatically generated transcriptions. △ Less

Submitted 10 April, 2023; originally announced April 2023.

Comments: Published in ICASSP 2020

arXiv:2304.04472 [pdf, other]

Modeling Speaker-Listener Interaction for Backchannel Prediction

Authors: Daniel Ortega, Sarina Meyer, Antje Schweitzer, Ngoc Thang Vu

Abstract: We present our latest findings on backchannel modeling novelly motivated by the canonical use of the minimal responses Yeah and Uh-huh in English and their correspondent tokens in German, and the effect of encoding the speaker-listener interaction. Backchanneling theories emphasize the active and continuous role of the listener in the course of the conversation, their effects on the speaker's subs… ▽ More We present our latest findings on backchannel modeling novelly motivated by the canonical use of the minimal responses Yeah and Uh-huh in English and their correspondent tokens in German, and the effect of encoding the speaker-listener interaction. Backchanneling theories emphasize the active and continuous role of the listener in the course of the conversation, their effects on the speaker's subsequent talk, and the consequent dynamic speaker-listener interaction. Therefore, we propose a neural-based acoustic backchannel classifier on minimal responses by processing acoustic features from the speaker speech, capturing and imitating listeners' backchanneling behavior, and encoding speaker-listener interaction. Our experimental results on the Switchboard and GECO datasets reveal that in almost all tested scenarios the speaker or listener behavior embeddings help the model make more accurate backchannel predictions. More importantly, a proper interaction encoding strategy, i.e., combining the speaker and listener embeddings, leads to the best performance on both datasets in terms of F1-score. △ Less

Submitted 10 April, 2023; originally announced April 2023.

Comments: Published in IWSDS 2023

arXiv:2205.06556 [pdf]

Virtual passengers for real car solutions: synthetic datasets

Authors: Paola Natalia Canas, Juan Diego Ortega, Marcos Nieto, Oihana Otaegui

Abstract: Strategies that include the generation of synthetic data are beginning to be viable as obtaining real data can be logistically complicated, very expensive or slow. Not only the capture of the data can lead to complications, but also its annotation. To achieve high-fidelity data for training intelligent systems, we have built a 3D scenario and set-up to resemble reality as closely as possible. With… ▽ More Strategies that include the generation of synthetic data are beginning to be viable as obtaining real data can be logistically complicated, very expensive or slow. Not only the capture of the data can lead to complications, but also its annotation. To achieve high-fidelity data for training intelligent systems, we have built a 3D scenario and set-up to resemble reality as closely as possible. With our approach, it is possible to configure and vary parameters to add randomness to the scene and, in this way, allow variation in data, which is so important in the construction of a dataset. Besides, the annotation task is already included in the data generation exercise, rather than being a post-capture task, which can save a lot of resources. We present the process and concept of synthetic data generation in an automotive context, specifically for driver and passenger monitoring purposes, as an alternative to real data capturing. △ Less

Submitted 13 May, 2022; originally announced May 2022.

Comments: 9 pages, 6 figures, 14th ITS European Congress

arXiv:2110.13242 [pdf, other]

2D Grid Map Generation for Deep-Learning-based Navigation Approaches

Authors: Gabriel O. Flores-Aquino, Jheison Duvier Díaz Ortega, Ricardo Yahir Almazan Arvizu, Raúl López Muñoz, O. Octavio Gutierrez-Frias, J. Irving Vasquez-Gomez

Abstract: In the last decade, autonomous navigation for roboticshas been leveraged by deep learning and other approachesbased on machine learning. These approaches have demon-strated significant advantages in robotics performance. Butthey have the disadvantage that they require a lot of data toinfer knowledge. In this paper, we present an algorithm forbuilding 2D maps with attributes that make them useful f… ▽ More In the last decade, autonomous navigation for roboticshas been leveraged by deep learning and other approachesbased on machine learning. These approaches have demon-strated significant advantages in robotics performance. Butthey have the disadvantage that they require a lot of data toinfer knowledge. In this paper, we present an algorithm forbuilding 2D maps with attributes that make them useful fortraining and testing machine-learning-based approaches.The maps are based on dungeons environments where sev-eral random rooms are built and then those rooms are con-nected. In addition, we provide a dataset with 10,000 mapsproduced by the proposed algorithm and a description withextensive information for algorithm evaluation. Such infor-mation includes validation of path existence, the best path,distances, among other attributes. We believe that thesemaps and their related information can be very useful forrobotics enthusiasts and researchers who want to test deeplearning approaches. The dataset is available athttps://github.com/gbriel21/map2D_dataSet.git △ Less

Submitted 4 December, 2021; v1 submitted 25 October, 2021; originally announced October 2021.

Comments: 6 pages, 4 figures, conference, dataset

arXiv:2008.12085 [pdf, ps, other]

doi 10.1007/978-3-030-66823-5_23

DMD: A Large-Scale Multi-Modal Driver Monitoring Dataset for Attention and Alertness Analysis

Authors: Juan Diego Ortega, Neslihan Kose, Paola Cañas, Min-An Chao, Alexander Unnervik, Marcos Nieto, Oihana Otaegui, Luis Salgado

Abstract: Vision is the richest and most cost-effective technology for Driver Monitoring Systems (DMS), especially after the recent success of Deep Learning (DL) methods. The lack of sufficiently large and comprehensive datasets is currently a bottleneck for the progress of DMS development, crucial for the transition of automated driving from SAE Level-2 to SAE Level-3. In this paper, we introduce the Drive… ▽ More Vision is the richest and most cost-effective technology for Driver Monitoring Systems (DMS), especially after the recent success of Deep Learning (DL) methods. The lack of sufficiently large and comprehensive datasets is currently a bottleneck for the progress of DMS development, crucial for the transition of automated driving from SAE Level-2 to SAE Level-3. In this paper, we introduce the Driver Monitoring Dataset (DMD), an extensive dataset which includes real and simulated driving scenarios: distraction, gaze allocation, drowsiness, hands-wheel interaction and context data, in 41 hours of RGB, depth and IR videos from 3 cameras capturing face, body and hands of 37 drivers. A comparison with existing similar datasets is included, which shows the DMD is more extensive, diverse, and multi-purpose. The usage of the DMD is illustrated by extracting a subset of it, the dBehaviourMD dataset, containing 13 distraction activities, prepared to be used in DL training processes. Furthermore, we propose a robust and real-time driver behaviour recognition system targeting a real-world application that can run on cost-efficient CPU-only platforms, based on the dBehaviourMD. Its performance is evaluated with different types of fusion strategies, which all reach enhanced accuracy still providing real-time response. △ Less

Submitted 27 August, 2020; originally announced August 2020.

Comments: Accepted to ECCV 2020 workshop - Assistive Computer Vision and Robotics

arXiv:2005.01777 [pdf, other]

ADVISER: A Toolkit for Develo** Multi-modal, Multi-domain and Socially-engaged Conversational Agents

Authors: Chia-Yu Li, Daniel Ortega, Dirk Väth, Florian Lux, Lindsey Vanderlyn, Maximilian Schmidt, Michael Neumann, Moritz Völkel, Pavel Denisov, Sabrina Jenne, Zorica Kacarevic, Ngoc Thang Vu

Abstract: We present ADVISER - an open-source, multi-domain dialog system toolkit that enables the development of multi-modal (incorporating speech, text and vision), socially-engaged (e.g. emotion recognition, engagement level prediction and backchanneling) conversational agents. The final Python-based implementation of our toolkit is flexible, easy to use, and easy to extend not only for technically exper… ▽ More We present ADVISER - an open-source, multi-domain dialog system toolkit that enables the development of multi-modal (incorporating speech, text and vision), socially-engaged (e.g. emotion recognition, engagement level prediction and backchanneling) conversational agents. The final Python-based implementation of our toolkit is flexible, easy to use, and easy to extend not only for technically experienced users, such as machine learning researchers, but also for less technically experienced users, such as linguists or cognitive scientists, thereby providing a flexible platform for collaborative research. Link to open-source code: https://github.com/DigitalPhonetics/adviser △ Less

Submitted 4 May, 2020; originally announced May 2020.

Comments: All authors contributed equally. Accepted to be presented at ACL - System demonstrations - 2020

arXiv:1907.03196 [pdf, other]

Multimodal Fusion with Deep Neural Networks for Audio-Video Emotion Recognition

Authors: Juan D. S. Ortega, Mohammed Senoussaoui, Eric Granger, Marco Pedersoli, Patrick Cardinal, Alessandro L. Koerich

Abstract: This paper presents a novel deep neural network (DNN) for multimodal fusion of audio, video and text modalities for emotion recognition. The proposed DNN architecture has independent and shared layers which aim to learn the representation for each modality, as well as the best combined representation to achieve the best prediction. Experimental results on the AVEC Sentiment Analysis in the Wild da… ▽ More This paper presents a novel deep neural network (DNN) for multimodal fusion of audio, video and text modalities for emotion recognition. The proposed DNN architecture has independent and shared layers which aim to learn the representation for each modality, as well as the best combined representation to achieve the best prediction. Experimental results on the AVEC Sentiment Analysis in the Wild dataset indicate that the proposed DNN can achieve a higher level of Concordance Correlation Coefficient (CCC) than other state-of-the-art systems that perform early fusion of modalities at feature-level (i.e., concatenation) and late fusion at score-level (i.e., weighted average) fusion. The proposed DNN has achieved CCCs of 0.606, 0.534, and 0.170 on the development partition of the dataset for predicting arousal, valence and liking, respectively. △ Less

Submitted 6 July, 2019; originally announced July 2019.

arXiv:1906.10623 [pdf, other]

Emotion Recognition Using Fusion of Audio and Video Features

Authors: Juan D. S. Ortega, Patrick Cardinal, Alessandro L. Koerich

Abstract: In this paper we propose a fusion approach to continuous emotion recognition that combines visual and auditory modalities in their representation spaces to predict the arousal and valence levels. The proposed approach employs a pre-trained convolution neural network and transfer learning to extract features from video frames that capture the emotional content. For the auditory content, a minimalis… ▽ More In this paper we propose a fusion approach to continuous emotion recognition that combines visual and auditory modalities in their representation spaces to predict the arousal and valence levels. The proposed approach employs a pre-trained convolution neural network and transfer learning to extract features from video frames that capture the emotional content. For the auditory content, a minimalistic set of parameters such as prosodic, excitation, vocal tract, and spectral descriptors are used as features. The fusion of these two modalities is carried out at a feature level, before training a single support vector regressor (SVR) or at a prediction level, after training one SVR for each modality. The proposed approach also includes preprocessing and post-processing techniques which contribute favorably to improving the concordance correlation coefficient (CCC). Experimental results for predicting spontaneous and natural emotions on the RECOLA dataset have shown that the proposed approach takes advantage of the complementary information of visual and auditory modalities and provides CCCs of 0.749 and 0.565 for arousal and valence, respectively. △ Less

Submitted 25 June, 2019; originally announced June 2019.

arXiv:1902.11060 [pdf, other]

Context-aware Neural-based Dialog Act Classification on Automatically Generated Transcriptions

Authors: Daniel Ortega, Chia-Yu Li, Gisela Vallejo, Pavel Denisov, Ngoc Thang Vu

Abstract: This paper presents our latest investigations on dialog act (DA) classification on automatically generated transcriptions. We propose a novel approach that combines convolutional neural networks (CNNs) and conditional random fields (CRFs) for context modeling in DA classification. We explore the impact of transcriptions generated from different automatic speech recognition systems such as hybrid T… ▽ More This paper presents our latest investigations on dialog act (DA) classification on automatically generated transcriptions. We propose a novel approach that combines convolutional neural networks (CNNs) and conditional random fields (CRFs) for context modeling in DA classification. We explore the impact of transcriptions generated from different automatic speech recognition systems such as hybrid TDNN/HMM and End-to-End systems on the final performance. Experimental results on two benchmark datasets (MRDA and SwDA) show that the combination CNN and CRF improves consistently the accuracy. Furthermore, they show that although the word error rates are comparable, End-to-End ASR system seems to be more suitable for DA classification. △ Less

Submitted 28 February, 2019; originally announced February 2019.

Comments: 5 pages, 1 figure, ICASSP 2019, dialog act classification, automatic speech recognition

arXiv:1803.00831 [pdf, other]

Lexico-acoustic Neural-based Models for Dialog Act Classification

Authors: Daniel Ortega, Ngoc Thang Vu

Abstract: Recent works have proposed neural models for dialog act classification in spoken dialogs. However, they have not explored the role and the usefulness of acoustic information. We propose a neural model that processes both lexical and acoustic features for classification. Our results on two benchmark datasets reveal that acoustic features are helpful in improving the overall accuracy. Finally, a dee… ▽ More Recent works have proposed neural models for dialog act classification in spoken dialogs. However, they have not explored the role and the usefulness of acoustic information. We propose a neural model that processes both lexical and acoustic features for classification. Our results on two benchmark datasets reveal that acoustic features are helpful in improving the overall accuracy. Finally, a deeper analysis shows that acoustic features are valuable in three cases: when a dialog act has sufficient data, when lexical information is limited and when strong lexical cues are not present. △ Less

Submitted 2 March, 2018; originally announced March 2018.

Comments: 5 pages, 1 figure, 2018 International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2018)

arXiv:1708.02561 [pdf, other]

Neural-based Context Representation Learning for Dialog Act Classification

Authors: Daniel Ortega, Ngoc Thang Vu

Abstract: We explore context representation learning methods in neural-based models for dialog act classification. We propose and compare extensively different methods which combine recurrent neural network architectures and attention mechanisms (AMs) at different context levels. Our experimental results on two benchmark datasets show consistent improvements compared to the models without contextual informa… ▽ More We explore context representation learning methods in neural-based models for dialog act classification. We propose and compare extensively different methods which combine recurrent neural network architectures and attention mechanisms (AMs) at different context levels. Our experimental results on two benchmark datasets show consistent improvements compared to the models without contextual information and reveal that the most suitable AM in the architecture depends on the nature of the dataset. △ Less

Submitted 8 August, 2017; originally announced August 2017.

Comments: 5 pages, 1 figure, SIGDIAL 2017

Showing 1–12 of 12 results for author: Ortega, D