Search | arXiv e-print repository

arXiv:2312.10854 [pdf, other]

The Right Losses for the Right Gains: Improving the Semantic Consistency of Deep Text-to-Image Generation with Distribution-Sensitive Losses

Authors: Mahmoud Ahmed, Omer Moussa, Ismail Shaheen, Mohamed Abdelfattah, Amr Abdalla, Marwan Eid, Hesham Eraqi, Mohamed Moustafa

Abstract: One of the major challenges in training deep neural networks for text-to-image generation is the significant linguistic discrepancy between ground-truth captions of each image in most popular datasets. The large difference in the choice of words in such captions results in synthesizing images that are semantically dissimilar to each other and to their ground-truth counterparts. Moreover, existing… ▽ More One of the major challenges in training deep neural networks for text-to-image generation is the significant linguistic discrepancy between ground-truth captions of each image in most popular datasets. The large difference in the choice of words in such captions results in synthesizing images that are semantically dissimilar to each other and to their ground-truth counterparts. Moreover, existing models either fail to generate the fine-grained details of the image or require a huge number of parameters that renders them inefficient for text-to-image synthesis. To fill this gap in the literature, we propose using the contrastive learning approach with a novel combination of two loss functions: fake-to-fake loss to increase the semantic consistency between generated images of the same caption, and fake-to-real loss to reduce the gap between the distributions of real images and fake ones. We test this approach on two baseline models: SSAGAN and AttnGAN (with style blocks to enhance the fine-grained details of the images.) Results show that our approach improves the qualitative results on AttnGAN with style blocks on the CUB dataset. Additionally, on the challenging COCO dataset, our approach achieves competitive results against the state-of-the-art Lafite model, outperforms the FID score of SSAGAN model by 44. △ Less

Submitted 17 December, 2023; originally announced December 2023.

arXiv:2311.09178 [pdf, other]

RBPGAN: Recurrent Back-Projection GAN for Video Super Resolution

Authors: Marwah Sulaiman, Zahraa Shehabeldin, Israa Fahmy, Mohammed Barakat, Mohammed El-Naggar, Dareen Hussein, Moustafa Youssef, Hesham M. Eraqi

Abstract: Recently, video super resolution (VSR) has become a very impactful task in the area of Computer Vision due to its various applications. In this paper, we propose Recurrent Back-Projection Generative Adversarial Network (RBPGAN) for VSR in an attempt to generate temporally coherent solutions while preserving spatial details. RBPGAN integrates two state-of-the-art models to get the best in both worl… ▽ More Recently, video super resolution (VSR) has become a very impactful task in the area of Computer Vision due to its various applications. In this paper, we propose Recurrent Back-Projection Generative Adversarial Network (RBPGAN) for VSR in an attempt to generate temporally coherent solutions while preserving spatial details. RBPGAN integrates two state-of-the-art models to get the best in both worlds without compromising the accuracy of produced video. The generator of the model is inspired by RBPN system, while the discriminator is inspired by TecoGAN. We also utilize **-Pong loss to increase temporal consistency over time. Our contribution together results in a model that outperforms earlier work in terms of temporally consistent details, as we will demonstrate qualitatively and quantitatively using different datasets. △ Less

Submitted 10 December, 2023; v1 submitted 15 November, 2023; originally announced November 2023.

arXiv:2211.11579 [pdf, other]

doi 10.1109/TITS.2022.3214079

Dynamic Conditional Imitation Learning for Autonomous Driving

Authors: Hesham M. Eraqi, Mohamed N. Moustafa, Jens Honer

Abstract: Conditional imitation learning (CIL) trains deep neural networks, in an end-to-end manner, to mimic human driving. This approach has demonstrated suitable vehicle control when following roads, avoiding obstacles, or taking specific turns at intersections to reach a destination. Unfortunately, performance dramatically decreases when deployed to unseen environments and is inconsistent against varyin… ▽ More Conditional imitation learning (CIL) trains deep neural networks, in an end-to-end manner, to mimic human driving. This approach has demonstrated suitable vehicle control when following roads, avoiding obstacles, or taking specific turns at intersections to reach a destination. Unfortunately, performance dramatically decreases when deployed to unseen environments and is inconsistent against varying weather conditions. Most importantly, the current CIL fails to avoid static road blockages. In this work, we propose a solution to those deficiencies. First, we fuse the laser scanner with the regular camera streams, at the features level, to overcome the generalization and consistency challenges. Second, we introduce a new efficient Occupancy Grid Map** (OGM) method along with new algorithms for road blockages avoidance and global route planning. Consequently, our proposed method dynamically detects partial and full road blockages, and guides the controlled vehicle to another route to reach the destination. Following the original CIL work, we demonstrated the effectiveness of our proposal on CARLA simulator urban driving benchmark. Our experiments showed that our model improved consistency against weather conditions by four times and autonomous driving success rate generalization by 52%. Furthermore, our global route planner improved the driving success rate by 37%. Our proposed road blockages avoidance algorithm improved the driving success rate by 27%. Finally, the average kilometers traveled before a collision with a static object increased by 1.5 times. The main source code can be reached at https://heshameraqi.github.io/dynamic_cil_autonomous_driving. △ Less

Submitted 16 November, 2022; originally announced November 2022.

Comments: 14 pages, 11 figures, 7 tables

Journal ref: IEEE Transactions on Intelligent Transportation Systems

arXiv:2207.05692 [pdf, other]

Lip-Listening: Mixing Senses to Understand Lips using Cross Modality Knowledge Distillation for Word-Based Models

Authors: Hadeel Mabrouk, Omar Abugabal, Nourhan Sakr, Hesham M. Eraqi

Abstract: In this work, we propose a technique to transfer speech recognition capabilities from audio speech recognition systems to visual speech recognizers, where our goal is to utilize audio data during lipreading model training. Impressive progress in the domain of speech recognition has been exhibited by audio and audio-visual systems. Nevertheless, there is still much to be explored with regards to vi… ▽ More In this work, we propose a technique to transfer speech recognition capabilities from audio speech recognition systems to visual speech recognizers, where our goal is to utilize audio data during lipreading model training. Impressive progress in the domain of speech recognition has been exhibited by audio and audio-visual systems. Nevertheless, there is still much to be explored with regards to visual speech recognition systems due to the visual ambiguity of some phonemes. To this end, the development of visual speech recognition models is crucial given the instability of audio models. The main contributions of this work are i) building on recent state-of-the-art word-based lipreading models by integrating sequence-level and frame-level Knowledge Distillation (KD) to their systems; ii) leveraging audio data during training visual models, a feat which has not been utilized in prior word-based work; iii) proposing the Gaussian-shaped averaging in frame-level KD, as an efficient technique that aids the model in distilling knowledge at the sequence model encoder. This work proposes a novel and competitive architecture for lip-reading, as we demonstrate a noticeable improvement in performance, setting a new benchmark equals to 88.64% on the LRW dataset. △ Less

Submitted 5 June, 2022; originally announced July 2022.

Comments: arXiv admin note: text overlap with arXiv:2108.03543

arXiv:2203.15522 [pdf, other]

Collision-Free Navigation using Evolutionary Symmetrical Neural Networks

Authors: Hesham M. Eraqi, Mena Nagiub, Peter Sidra

Abstract: Collision avoidance systems play a vital role in reducing the number of vehicle accidents and saving human lives. This paper extends the previous work using evolutionary neural networks for reactive collision avoidance. We are proposing a new method we have called symmetric neural networks. The method improves the model's performance by enforcing constraints between the network weights which reduc… ▽ More Collision avoidance systems play a vital role in reducing the number of vehicle accidents and saving human lives. This paper extends the previous work using evolutionary neural networks for reactive collision avoidance. We are proposing a new method we have called symmetric neural networks. The method improves the model's performance by enforcing constraints between the network weights which reduces the model optimization search space and hence, learns more accurate control of the vehicle steering for improved maneuvering. The training and validation processes are carried out using a simulation environment - the codebase is publicly available. Extensive experiments are conducted to analyze the proposed method and evaluate its performance. The method is tested in several simulated driving scenarios. In addition, we have analyzed the effect of the rangefinder sensor resolution and noise on the overall goal of reactive collision avoidance. Finally, we have tested the generalization of the proposed method. The results are encouraging; the proposed method has improved the model's learning curve for training scenarios and generalization to the new test scenarios. Using constrained weights has significantly improved the number of generations required for the Genetic Algorithm optimization. △ Less

Submitted 11 April, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

Comments: arXiv admin note: text overlap with arXiv:1609.08414

Journal ref: 2022 IEEE International Conference on Evolving and Adaptive Intelligent Systems (EAIS 2022), Larnaca, Cyprus

arXiv:2111.01136 [pdf]

ASMDD: Arabic Speech Mispronunciation Detection Dataset

Authors: Salah A. Aly, Abdelrahman Salah, Hesham M. Eraqi

Abstract: The largest dataset of Arabic speech mispronunciation detections in Egyptian dialogues is introduced. The dataset is composed of annotated audio files representing the top 100 words that are most frequently used in the Arabic language, pronounced by 100 Egyptian children (aged between 2 and 8 years old). The dataset is collected and annotated on segmental pronunciation error detections by expert l… ▽ More The largest dataset of Arabic speech mispronunciation detections in Egyptian dialogues is introduced. The dataset is composed of annotated audio files representing the top 100 words that are most frequently used in the Arabic language, pronounced by 100 Egyptian children (aged between 2 and 8 years old). The dataset is collected and annotated on segmental pronunciation error detections by expert listeners. △ Less

Submitted 1 November, 2021; originally announced November 2021.

Comments: 3 pages, 2 tables, 2 figures, dataset link: https://drive.google.com/drive/folders/1dhlp-L0n6_RAzoosVK4bRa7hxBnzebqs

arXiv:2108.07661 [pdf, other]

An Evaluation of RGB and LiDAR Fusion for Semantic Segmentation

Authors: Amr S. Mohamed, Ali Abdelkader, Mohamed Anany, Omar El-Behady, Muhammad Faisal, Asser Hangal, Hesham M. Eraqi, Mohamed N. Moustafa

Abstract: LiDARs and cameras are the two main sensors that are planned to be included in many announced autonomous vehicles prototypes. Each of the two provides a unique form of data from a different perspective to the surrounding environment. In this paper, we explore and attempt to answer the question: is there an added benefit by fusing those two forms of data for the purpose of semantic segmentation wit… ▽ More LiDARs and cameras are the two main sensors that are planned to be included in many announced autonomous vehicles prototypes. Each of the two provides a unique form of data from a different perspective to the surrounding environment. In this paper, we explore and attempt to answer the question: is there an added benefit by fusing those two forms of data for the purpose of semantic segmentation within the context of autonomous driving? We also attempt to show at which level does said fusion prove to be the most useful. We evaluated our algorithms on the publicly available SemanticKITTI dataset. All fusion models show improvements over the base model, with the mid-level fusion showing the highest improvement of 2.7% in terms of mean Intersection over Union (mIoU) metric. △ Less

Submitted 17 August, 2021; originally announced August 2021.

arXiv:2108.03543 [pdf, other]

Spatio-Temporal Attention Mechanism and Knowledge Distillation for Lip Reading

Authors: Shahd Elashmawy, Marian Ramsis, Hesham M. Eraqi, Farah Eldeshnawy, Hadeel Mabrouk, Omar Abugabal, Nourhan Sakr

Abstract: Despite the advancement in the domain of audio and audio-visual speech recognition, visual speech recognition systems are still quite under-explored due to the visual ambiguity of some phonemes. In this work, we propose a new lip-reading model that combines three contributions. First, the model front-end adopts a spatio-temporal attention mechanism to help extract the informative data from the inp… ▽ More Despite the advancement in the domain of audio and audio-visual speech recognition, visual speech recognition systems are still quite under-explored due to the visual ambiguity of some phonemes. In this work, we propose a new lip-reading model that combines three contributions. First, the model front-end adopts a spatio-temporal attention mechanism to help extract the informative data from the input visual frames. Second, the model back-end utilizes a sequence-level and frame-level Knowledge Distillation (KD) techniques that allow leveraging audio data during the visual model training. Third, a data preprocessing pipeline is adopted that includes facial landmarks detection-based lip-alignment. On LRW lip-reading dataset benchmark, a noticeable accuracy improvement is demonstrated; the spatio-temporal attention, Knowledge Distillation, and lip-alignment contributions achieved 88.43%, 88.64%, and 88.37% respectively. △ Less

Submitted 7 August, 2021; originally announced August 2021.

arXiv:2108.02148 [pdf, other]

Pervasive Hand Gesture Recognition for Smartphones using Non-audible Sound and Deep Learning

Authors: Ahmed Ibrahim, Ayman El-Refai, Sara Ahmed, Mariam Aboul-Ela, Hesham M. Eraqi, Mohamed Moustafa

Abstract: Due to the mass advancement in ubiquitous technologies nowadays, new pervasive methods have come into the practice to provide new innovative features and stimulate the research on new human-computer interactions. This paper presents a hand gesture recognition method that utilizes the smartphone's built-in speakers and microphones. The proposed system emits an ultrasonic sonar-based signal (inaudib… ▽ More Due to the mass advancement in ubiquitous technologies nowadays, new pervasive methods have come into the practice to provide new innovative features and stimulate the research on new human-computer interactions. This paper presents a hand gesture recognition method that utilizes the smartphone's built-in speakers and microphones. The proposed system emits an ultrasonic sonar-based signal (inaudible sound) from the smartphone's stereo speakers, which is then received by the smartphone's microphone and processed via a Convolutional Neural Network (CNN) for Hand Gesture Recognition. Data augmentation techniques are proposed to improve the detection accuracy and three dual-channel input fusion methods are compared. The first method merges the dual-channel audio as a single input spectrogram image. The second method adopts early fusion by concatenating the dual-channel spectrograms. The third method adopts late fusion by having two convectional input branches processing each of the dual-channel spectrograms and then the outputs are merged by the last layers. Our experimental results demonstrate a promising detection accuracy for the six gestures presented in our publicly available dataset with an accuracy of 93.58\% as a baseline. △ Less

Submitted 4 August, 2021; originally announced August 2021.

arXiv:1907.07748 [pdf, other]

End-to-end sensor modeling for LiDAR Point Cloud

Authors: Khaled Elmadawi, Moemen Abdelrazek, Mohamed Elsobky, Hesham M. Eraqi, Mohamed Zahran

Abstract: Advanced sensors are a key to enable self-driving cars technology. Laser scanner sensors (LiDAR, Light Detection And Ranging) became a fundamental choice due to its long-range and robustness to low light driving conditions. The problem of designing a control software for self-driving cars is a complex task to explicitly formulate in rule-based systems, thus recent approaches rely on machine learni… ▽ More Advanced sensors are a key to enable self-driving cars technology. Laser scanner sensors (LiDAR, Light Detection And Ranging) became a fundamental choice due to its long-range and robustness to low light driving conditions. The problem of designing a control software for self-driving cars is a complex task to explicitly formulate in rule-based systems, thus recent approaches rely on machine learning that can learn those rules from data. The major problem with such approaches is that the amount of training data required for generalizing a machine learning model is big, and on the other hand LiDAR data annotation is very costly compared to other car sensors. An accurate LiDAR sensor model can cope with such problem. Moreover, its value goes beyond this because existing LiDAR development, validation, and evaluation platforms and processes are very costly, and virtual testing and development environments are still immature in terms of physical properties representation. In this work we propose a novel Deep Learning-based LiDAR sensor model. This method models the sensor echos, using a Deep Neural Network to model echo pulse widths learned from real data using Polar Grid Maps (PGM). We benchmark our model performance against comprehensive real sensor data and very promising results are achieved that sets a baseline for future works. △ Less

Submitted 17 July, 2019; originally announced July 2019.

Comments: Accepted in IEEE Intelligent Transportation Systems Conference - ITSC 2019

arXiv:1901.09097 [pdf, other]

Driver Distraction Identification with an Ensemble of Convolutional Neural Networks

Authors: Hesham M. Eraqi, Yehya Abouelnaga, Mohamed H. Saad, Mohamed N. Moustafa

Abstract: The World Health Organization (WHO) reported 1.25 million deaths yearly due to road traffic accidents worldwide and the number has been continuously increasing over the last few years. Nearly fifth of these accidents are caused by distracted drivers. Existing work of distracted driver detection is concerned with a small set of distractions (mostly, cell phone usage). Unreliable ad-hoc methods are… ▽ More The World Health Organization (WHO) reported 1.25 million deaths yearly due to road traffic accidents worldwide and the number has been continuously increasing over the last few years. Nearly fifth of these accidents are caused by distracted drivers. Existing work of distracted driver detection is concerned with a small set of distractions (mostly, cell phone usage). Unreliable ad-hoc methods are often used.In this paper, we present the first publicly available dataset for driver distraction identification with more distraction postures than existing alternatives. In addition, we propose a reliable deep learning-based solution that achieves a 90% accuracy. The system consists of a genetically-weighted ensemble of convolutional neural networks, we show that a weighted ensemble of classifiers using a genetic algorithm yields in a better classification confidence. We also study the effect of different visual elements in distraction detection by means of face and hand localizations, and skin segmentation. Finally, we present a thinned version of our ensemble that could achieve 84.64% classification accuracy and operate in a real-time environment. △ Less

Submitted 22 January, 2019; originally announced January 2019.

Comments: arXiv admin note: substantial text overlap with arXiv:1706.09498

Journal ref: Journal of Advanced Transportation, Machine Learning in Transportation (MLT) Issue, 2019

arXiv:1801.00600 [pdf]

Static Free Space Detection with Laser Scanner using Occupancy Grid Maps

Authors: Hesham M. Eraqi, Jens Honer, Sebastian Zuther

Abstract: Drivable free space information is vital for autonomous vehicles that have to plan evasive maneuvers in real-time. In this paper, we present a new efficient method for environmental free space detection with laser scanner based on 2D occupancy grid maps (OGM) to be used for Advanced Driving Assistance Systems (ADAS) and Collision Avoidance Systems (CAS). Firstly, we introduce an enhanced inverse s… ▽ More Drivable free space information is vital for autonomous vehicles that have to plan evasive maneuvers in real-time. In this paper, we present a new efficient method for environmental free space detection with laser scanner based on 2D occupancy grid maps (OGM) to be used for Advanced Driving Assistance Systems (ADAS) and Collision Avoidance Systems (CAS). Firstly, we introduce an enhanced inverse sensor model tailored for high-resolution laser scanners for building OGM. It compensates the unreflected beams and deals with the ray casting to grid cells accuracy and computational effort problems. Secondly, we introduce the 'vehicle on a circle for grid maps' map alignment algorithm that allows building more accurate local maps by avoiding the computationally expensive inaccurate operations of image sub-pixel shifting and rotation. The resulted grid map is more convenient for ADAS features than existing methods, as it allows using less memory sizes, and hence, results into a better real-time performance. Thirdly, we present an algorithm to detect what we call the 'in-sight edges'. These edges guarantee modeling the free space area with a single polygon of a fixed number of vertices regardless the driving situation and map complexity. The results from real world experiments show the effectiveness of our approach. △ Less

Submitted 30 June, 2020; v1 submitted 2 January, 2018; originally announced January 2018.

Journal ref: IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), Yokohama, JAPAN, October 2017

arXiv:1710.03804 [pdf, other]

End-to-End Deep Learning for Steering Autonomous Vehicles Considering Temporal Dependencies

Authors: Hesham M. Eraqi, Mohamed N. Moustafa, Jens Honer

Abstract: Steering a car through traffic is a complex task that is difficult to cast into algorithms. Therefore, researchers turn to training artificial neural networks from front-facing camera data stream along with the associated steering angles. Nevertheless, most existing solutions consider only the visual camera frames as input, thus ignoring the temporal relationship between frames. In this work, we p… ▽ More Steering a car through traffic is a complex task that is difficult to cast into algorithms. Therefore, researchers turn to training artificial neural networks from front-facing camera data stream along with the associated steering angles. Nevertheless, most existing solutions consider only the visual camera frames as input, thus ignoring the temporal relationship between frames. In this work, we propose a Convolutional Long Short-Term Memory Recurrent Neural Network (C-LSTM), that is end-to-end trainable, to learn both visual and dynamic temporal dependencies of driving. Additionally, We introduce posing the steering angle regression problem as classification while imposing a spatial relationship between the output layer neurons. Such method is based on learning a sinusoidal function that encodes steering angles. To train and validate our proposed methods, we used the publicly available Comma.ai dataset. Our solution improved steering root mean square error by 35% over recent methods, and led to a more stable steering by 87%. △ Less

Submitted 22 November, 2017; v1 submitted 10 October, 2017; originally announced October 2017.

Comments: 31st Conference on Neural Information Processing Systems (NIPS), Machine Learning for Intelligent Transportation Systems Workshop, Long Beach, CA, USA, 2017

arXiv:1706.09498 [pdf, other]

Real-time Distracted Driver Posture Classification

Authors: Yehya Abouelnaga, Hesham M. Eraqi, Mohamed N. Moustafa

Abstract: In this paper, we present a new dataset for "distracted driver" posture estimation. In addition, we propose a novel system that achieves 95.98% driving posture estimation classification accuracy. The system consists of a genetically-weighted ensemble of Convolutional Neural Networks (CNNs). We show that a weighted ensemble of classifiers using a genetic algorithm yields in better classification co… ▽ More In this paper, we present a new dataset for "distracted driver" posture estimation. In addition, we propose a novel system that achieves 95.98% driving posture estimation classification accuracy. The system consists of a genetically-weighted ensemble of Convolutional Neural Networks (CNNs). We show that a weighted ensemble of classifiers using a genetic algorithm yields in better classification confidence. We also study the effect of different visual elements (i.e. hands and face) in distraction detection and classification by means of face and hand localizations. Finally, we present a thinned version of our ensemble that could achieve a 94.29% classification accuracy and operate in a realtime environment. △ Less

Submitted 29 November, 2018; v1 submitted 28 June, 2017; originally announced June 2017.

Journal ref: 32nd Conference on Neural Information Processing Systems (NIPS 2018), Workshop on Machine Learning for Intelligent Transportation Systems

arXiv:1609.08414 [pdf]

Reactive Collision Avoidance using Evolutionary Neural Networks

Authors: Hesham Eraqi, Youssef EmadEldin, Mohamed Moustafa

Abstract: Collision avoidance systems can play a vital role in reducing the number of accidents and saving human lives. In this paper, we introduce and validate a novel method for vehicles reactive collision avoidance using evolutionary neural networks (ENN). A single front-facing rangefinder sensor is the only input required by our method. The training process and the proposed method analysis and validatio… ▽ More Collision avoidance systems can play a vital role in reducing the number of accidents and saving human lives. In this paper, we introduce and validate a novel method for vehicles reactive collision avoidance using evolutionary neural networks (ENN). A single front-facing rangefinder sensor is the only input required by our method. The training process and the proposed method analysis and validation are carried out using simulation. Extensive experiments are conducted to analyse the proposed method and evaluate its performance. Firstly, we experiment the ability to learn collision avoidance in a static free track. Secondly, we analyse the effect of the rangefinder sensor resolution on the learning process. Thirdly, we experiment the ability of a vehicle to individually and simultaneously learn collision avoidance. Finally, we test the generality of the proposed method. We used a more realistic and powerful simulation environment (CarMaker), a camera as an alternative input sensor, and lane kee** as an extra feature to learn. The results are encouraging; the proposed method successfully allows vehicles to learn collision avoidance in different scenarios that are unseen during training. It also generalizes well if any of the input sensor, the simulator, or the task to be learned is changed. △ Less

Submitted 27 September, 2016; originally announced September 2016.

Comments: ECTA 2016. Final paper is at SCITEPRESS digital library

Showing 1–15 of 15 results for author: Eraqi, H