-
The Right Losses for the Right Gains: Improving the Semantic Consistency of Deep Text-to-Image Generation with Distribution-Sensitive Losses
Authors:
Mahmoud Ahmed,
Omer Moussa,
Ismail Shaheen,
Mohamed Abdelfattah,
Amr Abdalla,
Marwan Eid,
Hesham Eraqi,
Mohamed Moustafa
Abstract:
One of the major challenges in training deep neural networks for text-to-image generation is the significant linguistic discrepancy between ground-truth captions of each image in most popular datasets. The large difference in the choice of words in such captions results in synthesizing images that are semantically dissimilar to each other and to their ground-truth counterparts. Moreover, existing…
▽ More
One of the major challenges in training deep neural networks for text-to-image generation is the significant linguistic discrepancy between ground-truth captions of each image in most popular datasets. The large difference in the choice of words in such captions results in synthesizing images that are semantically dissimilar to each other and to their ground-truth counterparts. Moreover, existing models either fail to generate the fine-grained details of the image or require a huge number of parameters that renders them inefficient for text-to-image synthesis. To fill this gap in the literature, we propose using the contrastive learning approach with a novel combination of two loss functions: fake-to-fake loss to increase the semantic consistency between generated images of the same caption, and fake-to-real loss to reduce the gap between the distributions of real images and fake ones. We test this approach on two baseline models: SSAGAN and AttnGAN (with style blocks to enhance the fine-grained details of the images.) Results show that our approach improves the qualitative results on AttnGAN with style blocks on the CUB dataset. Additionally, on the challenging COCO dataset, our approach achieves competitive results against the state-of-the-art Lafite model, outperforms the FID score of SSAGAN model by 44.
△ Less
Submitted 17 December, 2023;
originally announced December 2023.
-
RBPGAN: Recurrent Back-Projection GAN for Video Super Resolution
Authors:
Marwah Sulaiman,
Zahraa Shehabeldin,
Israa Fahmy,
Mohammed Barakat,
Mohammed El-Naggar,
Dareen Hussein,
Moustafa Youssef,
Hesham M. Eraqi
Abstract:
Recently, video super resolution (VSR) has become a very impactful task in the area of Computer Vision due to its various applications. In this paper, we propose Recurrent Back-Projection Generative Adversarial Network (RBPGAN) for VSR in an attempt to generate temporally coherent solutions while preserving spatial details. RBPGAN integrates two state-of-the-art models to get the best in both worl…
▽ More
Recently, video super resolution (VSR) has become a very impactful task in the area of Computer Vision due to its various applications. In this paper, we propose Recurrent Back-Projection Generative Adversarial Network (RBPGAN) for VSR in an attempt to generate temporally coherent solutions while preserving spatial details. RBPGAN integrates two state-of-the-art models to get the best in both worlds without compromising the accuracy of produced video. The generator of the model is inspired by RBPN system, while the discriminator is inspired by TecoGAN. We also utilize **-Pong loss to increase temporal consistency over time. Our contribution together results in a model that outperforms earlier work in terms of temporally consistent details, as we will demonstrate qualitatively and quantitatively using different datasets.
△ Less
Submitted 10 December, 2023; v1 submitted 15 November, 2023;
originally announced November 2023.
-
Dynamic Conditional Imitation Learning for Autonomous Driving
Authors:
Hesham M. Eraqi,
Mohamed N. Moustafa,
Jens Honer
Abstract:
Conditional imitation learning (CIL) trains deep neural networks, in an end-to-end manner, to mimic human driving. This approach has demonstrated suitable vehicle control when following roads, avoiding obstacles, or taking specific turns at intersections to reach a destination. Unfortunately, performance dramatically decreases when deployed to unseen environments and is inconsistent against varyin…
▽ More
Conditional imitation learning (CIL) trains deep neural networks, in an end-to-end manner, to mimic human driving. This approach has demonstrated suitable vehicle control when following roads, avoiding obstacles, or taking specific turns at intersections to reach a destination. Unfortunately, performance dramatically decreases when deployed to unseen environments and is inconsistent against varying weather conditions. Most importantly, the current CIL fails to avoid static road blockages. In this work, we propose a solution to those deficiencies. First, we fuse the laser scanner with the regular camera streams, at the features level, to overcome the generalization and consistency challenges. Second, we introduce a new efficient Occupancy Grid Map** (OGM) method along with new algorithms for road blockages avoidance and global route planning. Consequently, our proposed method dynamically detects partial and full road blockages, and guides the controlled vehicle to another route to reach the destination. Following the original CIL work, we demonstrated the effectiveness of our proposal on CARLA simulator urban driving benchmark. Our experiments showed that our model improved consistency against weather conditions by four times and autonomous driving success rate generalization by 52%. Furthermore, our global route planner improved the driving success rate by 37%. Our proposed road blockages avoidance algorithm improved the driving success rate by 27%. Finally, the average kilometers traveled before a collision with a static object increased by 1.5 times. The main source code can be reached at https://heshameraqi.github.io/dynamic_cil_autonomous_driving.
△ Less
Submitted 16 November, 2022;
originally announced November 2022.
-
Lip-Listening: Mixing Senses to Understand Lips using Cross Modality Knowledge Distillation for Word-Based Models
Authors:
Hadeel Mabrouk,
Omar Abugabal,
Nourhan Sakr,
Hesham M. Eraqi
Abstract:
In this work, we propose a technique to transfer speech recognition capabilities from audio speech recognition systems to visual speech recognizers, where our goal is to utilize audio data during lipreading model training. Impressive progress in the domain of speech recognition has been exhibited by audio and audio-visual systems. Nevertheless, there is still much to be explored with regards to vi…
▽ More
In this work, we propose a technique to transfer speech recognition capabilities from audio speech recognition systems to visual speech recognizers, where our goal is to utilize audio data during lipreading model training. Impressive progress in the domain of speech recognition has been exhibited by audio and audio-visual systems. Nevertheless, there is still much to be explored with regards to visual speech recognition systems due to the visual ambiguity of some phonemes. To this end, the development of visual speech recognition models is crucial given the instability of audio models. The main contributions of this work are i) building on recent state-of-the-art word-based lipreading models by integrating sequence-level and frame-level Knowledge Distillation (KD) to their systems; ii) leveraging audio data during training visual models, a feat which has not been utilized in prior word-based work; iii) proposing the Gaussian-shaped averaging in frame-level KD, as an efficient technique that aids the model in distilling knowledge at the sequence model encoder. This work proposes a novel and competitive architecture for lip-reading, as we demonstrate a noticeable improvement in performance, setting a new benchmark equals to 88.64% on the LRW dataset.
△ Less
Submitted 5 June, 2022;
originally announced July 2022.
-
Collision-Free Navigation using Evolutionary Symmetrical Neural Networks
Authors:
Hesham M. Eraqi,
Mena Nagiub,
Peter Sidra
Abstract:
Collision avoidance systems play a vital role in reducing the number of vehicle accidents and saving human lives. This paper extends the previous work using evolutionary neural networks for reactive collision avoidance. We are proposing a new method we have called symmetric neural networks. The method improves the model's performance by enforcing constraints between the network weights which reduc…
▽ More
Collision avoidance systems play a vital role in reducing the number of vehicle accidents and saving human lives. This paper extends the previous work using evolutionary neural networks for reactive collision avoidance. We are proposing a new method we have called symmetric neural networks. The method improves the model's performance by enforcing constraints between the network weights which reduces the model optimization search space and hence, learns more accurate control of the vehicle steering for improved maneuvering. The training and validation processes are carried out using a simulation environment - the codebase is publicly available. Extensive experiments are conducted to analyze the proposed method and evaluate its performance. The method is tested in several simulated driving scenarios. In addition, we have analyzed the effect of the rangefinder sensor resolution and noise on the overall goal of reactive collision avoidance. Finally, we have tested the generalization of the proposed method. The results are encouraging; the proposed method has improved the model's learning curve for training scenarios and generalization to the new test scenarios. Using constrained weights has significantly improved the number of generations required for the Genetic Algorithm optimization.
△ Less
Submitted 11 April, 2022; v1 submitted 29 March, 2022;
originally announced March 2022.
-
ASMDD: Arabic Speech Mispronunciation Detection Dataset
Authors:
Salah A. Aly,
Abdelrahman Salah,
Hesham M. Eraqi
Abstract:
The largest dataset of Arabic speech mispronunciation detections in Egyptian dialogues is introduced. The dataset is composed of annotated audio files representing the top 100 words that are most frequently used in the Arabic language, pronounced by 100 Egyptian children (aged between 2 and 8 years old). The dataset is collected and annotated on segmental pronunciation error detections by expert l…
▽ More
The largest dataset of Arabic speech mispronunciation detections in Egyptian dialogues is introduced. The dataset is composed of annotated audio files representing the top 100 words that are most frequently used in the Arabic language, pronounced by 100 Egyptian children (aged between 2 and 8 years old). The dataset is collected and annotated on segmental pronunciation error detections by expert listeners.
△ Less
Submitted 1 November, 2021;
originally announced November 2021.
-
An Evaluation of RGB and LiDAR Fusion for Semantic Segmentation
Authors:
Amr S. Mohamed,
Ali Abdelkader,
Mohamed Anany,
Omar El-Behady,
Muhammad Faisal,
Asser Hangal,
Hesham M. Eraqi,
Mohamed N. Moustafa
Abstract:
LiDARs and cameras are the two main sensors that are planned to be included in many announced autonomous vehicles prototypes. Each of the two provides a unique form of data from a different perspective to the surrounding environment. In this paper, we explore and attempt to answer the question: is there an added benefit by fusing those two forms of data for the purpose of semantic segmentation wit…
▽ More
LiDARs and cameras are the two main sensors that are planned to be included in many announced autonomous vehicles prototypes. Each of the two provides a unique form of data from a different perspective to the surrounding environment. In this paper, we explore and attempt to answer the question: is there an added benefit by fusing those two forms of data for the purpose of semantic segmentation within the context of autonomous driving? We also attempt to show at which level does said fusion prove to be the most useful. We evaluated our algorithms on the publicly available SemanticKITTI dataset. All fusion models show improvements over the base model, with the mid-level fusion showing the highest improvement of 2.7% in terms of mean Intersection over Union (mIoU) metric.
△ Less
Submitted 17 August, 2021;
originally announced August 2021.
-
Spatio-Temporal Attention Mechanism and Knowledge Distillation for Lip Reading
Authors:
Shahd Elashmawy,
Marian Ramsis,
Hesham M. Eraqi,
Farah Eldeshnawy,
Hadeel Mabrouk,
Omar Abugabal,
Nourhan Sakr
Abstract:
Despite the advancement in the domain of audio and audio-visual speech recognition, visual speech recognition systems are still quite under-explored due to the visual ambiguity of some phonemes. In this work, we propose a new lip-reading model that combines three contributions. First, the model front-end adopts a spatio-temporal attention mechanism to help extract the informative data from the inp…
▽ More
Despite the advancement in the domain of audio and audio-visual speech recognition, visual speech recognition systems are still quite under-explored due to the visual ambiguity of some phonemes. In this work, we propose a new lip-reading model that combines three contributions. First, the model front-end adopts a spatio-temporal attention mechanism to help extract the informative data from the input visual frames. Second, the model back-end utilizes a sequence-level and frame-level Knowledge Distillation (KD) techniques that allow leveraging audio data during the visual model training. Third, a data preprocessing pipeline is adopted that includes facial landmarks detection-based lip-alignment. On LRW lip-reading dataset benchmark, a noticeable accuracy improvement is demonstrated; the spatio-temporal attention, Knowledge Distillation, and lip-alignment contributions achieved 88.43%, 88.64%, and 88.37% respectively.
△ Less
Submitted 7 August, 2021;
originally announced August 2021.
-
Pervasive Hand Gesture Recognition for Smartphones using Non-audible Sound and Deep Learning
Authors:
Ahmed Ibrahim,
Ayman El-Refai,
Sara Ahmed,
Mariam Aboul-Ela,
Hesham M. Eraqi,
Mohamed Moustafa
Abstract:
Due to the mass advancement in ubiquitous technologies nowadays, new pervasive methods have come into the practice to provide new innovative features and stimulate the research on new human-computer interactions. This paper presents a hand gesture recognition method that utilizes the smartphone's built-in speakers and microphones. The proposed system emits an ultrasonic sonar-based signal (inaudib…
▽ More
Due to the mass advancement in ubiquitous technologies nowadays, new pervasive methods have come into the practice to provide new innovative features and stimulate the research on new human-computer interactions. This paper presents a hand gesture recognition method that utilizes the smartphone's built-in speakers and microphones. The proposed system emits an ultrasonic sonar-based signal (inaudible sound) from the smartphone's stereo speakers, which is then received by the smartphone's microphone and processed via a Convolutional Neural Network (CNN) for Hand Gesture Recognition. Data augmentation techniques are proposed to improve the detection accuracy and three dual-channel input fusion methods are compared. The first method merges the dual-channel audio as a single input spectrogram image. The second method adopts early fusion by concatenating the dual-channel spectrograms. The third method adopts late fusion by having two convectional input branches processing each of the dual-channel spectrograms and then the outputs are merged by the last layers. Our experimental results demonstrate a promising detection accuracy for the six gestures presented in our publicly available dataset with an accuracy of 93.58\% as a baseline.
△ Less
Submitted 4 August, 2021;
originally announced August 2021.
-
End-to-end sensor modeling for LiDAR Point Cloud
Authors:
Khaled Elmadawi,
Moemen Abdelrazek,
Mohamed Elsobky,
Hesham M. Eraqi,
Mohamed Zahran
Abstract:
Advanced sensors are a key to enable self-driving cars technology. Laser scanner sensors (LiDAR, Light Detection And Ranging) became a fundamental choice due to its long-range and robustness to low light driving conditions. The problem of designing a control software for self-driving cars is a complex task to explicitly formulate in rule-based systems, thus recent approaches rely on machine learni…
▽ More
Advanced sensors are a key to enable self-driving cars technology. Laser scanner sensors (LiDAR, Light Detection And Ranging) became a fundamental choice due to its long-range and robustness to low light driving conditions. The problem of designing a control software for self-driving cars is a complex task to explicitly formulate in rule-based systems, thus recent approaches rely on machine learning that can learn those rules from data. The major problem with such approaches is that the amount of training data required for generalizing a machine learning model is big, and on the other hand LiDAR data annotation is very costly compared to other car sensors. An accurate LiDAR sensor model can cope with such problem. Moreover, its value goes beyond this because existing LiDAR development, validation, and evaluation platforms and processes are very costly, and virtual testing and development environments are still immature in terms of physical properties representation. In this work we propose a novel Deep Learning-based LiDAR sensor model. This method models the sensor echos, using a Deep Neural Network to model echo pulse widths learned from real data using Polar Grid Maps (PGM). We benchmark our model performance against comprehensive real sensor data and very promising results are achieved that sets a baseline for future works.
△ Less
Submitted 17 July, 2019;
originally announced July 2019.
-
Driver Distraction Identification with an Ensemble of Convolutional Neural Networks
Authors:
Hesham M. Eraqi,
Yehya Abouelnaga,
Mohamed H. Saad,
Mohamed N. Moustafa
Abstract:
The World Health Organization (WHO) reported 1.25 million deaths yearly due to road traffic accidents worldwide and the number has been continuously increasing over the last few years. Nearly fifth of these accidents are caused by distracted drivers. Existing work of distracted driver detection is concerned with a small set of distractions (mostly, cell phone usage). Unreliable ad-hoc methods are…
▽ More
The World Health Organization (WHO) reported 1.25 million deaths yearly due to road traffic accidents worldwide and the number has been continuously increasing over the last few years. Nearly fifth of these accidents are caused by distracted drivers. Existing work of distracted driver detection is concerned with a small set of distractions (mostly, cell phone usage). Unreliable ad-hoc methods are often used.In this paper, we present the first publicly available dataset for driver distraction identification with more distraction postures than existing alternatives. In addition, we propose a reliable deep learning-based solution that achieves a 90% accuracy. The system consists of a genetically-weighted ensemble of convolutional neural networks, we show that a weighted ensemble of classifiers using a genetic algorithm yields in a better classification confidence. We also study the effect of different visual elements in distraction detection by means of face and hand localizations, and skin segmentation. Finally, we present a thinned version of our ensemble that could achieve 84.64% classification accuracy and operate in a real-time environment.
△ Less
Submitted 22 January, 2019;
originally announced January 2019.
-
Static Free Space Detection with Laser Scanner using Occupancy Grid Maps
Authors:
Hesham M. Eraqi,
Jens Honer,
Sebastian Zuther
Abstract:
Drivable free space information is vital for autonomous vehicles that have to plan evasive maneuvers in real-time. In this paper, we present a new efficient method for environmental free space detection with laser scanner based on 2D occupancy grid maps (OGM) to be used for Advanced Driving Assistance Systems (ADAS) and Collision Avoidance Systems (CAS). Firstly, we introduce an enhanced inverse s…
▽ More
Drivable free space information is vital for autonomous vehicles that have to plan evasive maneuvers in real-time. In this paper, we present a new efficient method for environmental free space detection with laser scanner based on 2D occupancy grid maps (OGM) to be used for Advanced Driving Assistance Systems (ADAS) and Collision Avoidance Systems (CAS). Firstly, we introduce an enhanced inverse sensor model tailored for high-resolution laser scanners for building OGM. It compensates the unreflected beams and deals with the ray casting to grid cells accuracy and computational effort problems. Secondly, we introduce the 'vehicle on a circle for grid maps' map alignment algorithm that allows building more accurate local maps by avoiding the computationally expensive inaccurate operations of image sub-pixel shifting and rotation. The resulted grid map is more convenient for ADAS features than existing methods, as it allows using less memory sizes, and hence, results into a better real-time performance. Thirdly, we present an algorithm to detect what we call the 'in-sight edges'. These edges guarantee modeling the free space area with a single polygon of a fixed number of vertices regardless the driving situation and map complexity. The results from real world experiments show the effectiveness of our approach.
△ Less
Submitted 30 June, 2020; v1 submitted 2 January, 2018;
originally announced January 2018.
-
End-to-End Deep Learning for Steering Autonomous Vehicles Considering Temporal Dependencies
Authors:
Hesham M. Eraqi,
Mohamed N. Moustafa,
Jens Honer
Abstract:
Steering a car through traffic is a complex task that is difficult to cast into algorithms. Therefore, researchers turn to training artificial neural networks from front-facing camera data stream along with the associated steering angles. Nevertheless, most existing solutions consider only the visual camera frames as input, thus ignoring the temporal relationship between frames. In this work, we p…
▽ More
Steering a car through traffic is a complex task that is difficult to cast into algorithms. Therefore, researchers turn to training artificial neural networks from front-facing camera data stream along with the associated steering angles. Nevertheless, most existing solutions consider only the visual camera frames as input, thus ignoring the temporal relationship between frames. In this work, we propose a Convolutional Long Short-Term Memory Recurrent Neural Network (C-LSTM), that is end-to-end trainable, to learn both visual and dynamic temporal dependencies of driving. Additionally, We introduce posing the steering angle regression problem as classification while imposing a spatial relationship between the output layer neurons. Such method is based on learning a sinusoidal function that encodes steering angles. To train and validate our proposed methods, we used the publicly available Comma.ai dataset. Our solution improved steering root mean square error by 35% over recent methods, and led to a more stable steering by 87%.
△ Less
Submitted 22 November, 2017; v1 submitted 10 October, 2017;
originally announced October 2017.
-
Real-time Distracted Driver Posture Classification
Authors:
Yehya Abouelnaga,
Hesham M. Eraqi,
Mohamed N. Moustafa
Abstract:
In this paper, we present a new dataset for "distracted driver" posture estimation. In addition, we propose a novel system that achieves 95.98% driving posture estimation classification accuracy. The system consists of a genetically-weighted ensemble of Convolutional Neural Networks (CNNs). We show that a weighted ensemble of classifiers using a genetic algorithm yields in better classification co…
▽ More
In this paper, we present a new dataset for "distracted driver" posture estimation. In addition, we propose a novel system that achieves 95.98% driving posture estimation classification accuracy. The system consists of a genetically-weighted ensemble of Convolutional Neural Networks (CNNs). We show that a weighted ensemble of classifiers using a genetic algorithm yields in better classification confidence. We also study the effect of different visual elements (i.e. hands and face) in distraction detection and classification by means of face and hand localizations. Finally, we present a thinned version of our ensemble that could achieve a 94.29% classification accuracy and operate in a realtime environment.
△ Less
Submitted 29 November, 2018; v1 submitted 28 June, 2017;
originally announced June 2017.
-
Reactive Collision Avoidance using Evolutionary Neural Networks
Authors:
Hesham Eraqi,
Youssef EmadEldin,
Mohamed Moustafa
Abstract:
Collision avoidance systems can play a vital role in reducing the number of accidents and saving human lives. In this paper, we introduce and validate a novel method for vehicles reactive collision avoidance using evolutionary neural networks (ENN). A single front-facing rangefinder sensor is the only input required by our method. The training process and the proposed method analysis and validatio…
▽ More
Collision avoidance systems can play a vital role in reducing the number of accidents and saving human lives. In this paper, we introduce and validate a novel method for vehicles reactive collision avoidance using evolutionary neural networks (ENN). A single front-facing rangefinder sensor is the only input required by our method. The training process and the proposed method analysis and validation are carried out using simulation. Extensive experiments are conducted to analyse the proposed method and evaluate its performance. Firstly, we experiment the ability to learn collision avoidance in a static free track. Secondly, we analyse the effect of the rangefinder sensor resolution on the learning process. Thirdly, we experiment the ability of a vehicle to individually and simultaneously learn collision avoidance. Finally, we test the generality of the proposed method. We used a more realistic and powerful simulation environment (CarMaker), a camera as an alternative input sensor, and lane kee** as an extra feature to learn. The results are encouraging; the proposed method successfully allows vehicles to learn collision avoidance in different scenarios that are unseen during training. It also generalizes well if any of the input sensor, the simulator, or the task to be learned is changed.
△ Less
Submitted 27 September, 2016;
originally announced September 2016.