Search | arXiv e-print repository

LetsMap: Unsupervised Representation Learning for Semantic BEV Map**

Authors: Nikhil Gosala, Kürsat Petek, B Ravi Kiran, Senthil Yogamani, Paulo Drews-Jr, Wolfram Burgard, Abhinav Valada

Abstract: Semantic Bird's Eye View (BEV) maps offer a rich representation with strong occlusion reasoning for various decision making tasks in autonomous driving. However, most BEV map** approaches employ a fully supervised learning paradigm that relies on large amounts of human-annotated BEV ground truth data. In this work, we address this limitation by proposing the first unsupervised representation lea… ▽ More Semantic Bird's Eye View (BEV) maps offer a rich representation with strong occlusion reasoning for various decision making tasks in autonomous driving. However, most BEV map** approaches employ a fully supervised learning paradigm that relies on large amounts of human-annotated BEV ground truth data. In this work, we address this limitation by proposing the first unsupervised representation learning approach to generate semantic BEV maps from a monocular frontal view (FV) image in a label-efficient manner. Our approach pretrains the network to independently reason about scene geometry and scene semantics using two disjoint neural pathways in an unsupervised manner and then finetunes it for the task of semantic BEV map** using only a small fraction of labels in the BEV. We achieve label-free pretraining by exploiting spatial and temporal consistency of FV images to learn scene geometry while relying on a novel temporal masked autoencoder formulation to encode the scene representation. Extensive evaluations on the KITTI-360 and nuScenes datasets demonstrate that our approach performs on par with the existing state-of-the-art approaches while using only 1% of BEV labels and no additional labeled data. △ Less

Submitted 29 May, 2024; originally announced May 2024.

Comments: 23 pages, 5 figures

arXiv:2404.13443 [pdf, other]

FisheyeDetNet: 360° Surround view Fisheye Camera based Object Detection System for Autonomous Driving

Authors: Ganesh Sistu, Senthil Yogamani

Abstract: Object detection is a mature problem in autonomous driving with pedestrian detection being one of the first deployed algorithms. It has been comprehensively studied in the literature. However, object detection is relatively less explored for fisheye cameras used for surround-view near field sensing. The standard bounding box representation fails in fisheye cameras due to heavy radial distortion, p… ▽ More Object detection is a mature problem in autonomous driving with pedestrian detection being one of the first deployed algorithms. It has been comprehensively studied in the literature. However, object detection is relatively less explored for fisheye cameras used for surround-view near field sensing. The standard bounding box representation fails in fisheye cameras due to heavy radial distortion, particularly in the periphery. To mitigate this, we explore extending the standard object detection output representation of bounding box. We design rotated bounding boxes, ellipse, generic polygon as polar arc/angle representations and define an instance segmentation mIOU metric to analyze these representations. The proposed model FisheyeDetNet with polygon outperforms others and achieves a mAP score of 49.5 % on Valeo fisheye surround-view dataset for automated driving applications. This dataset has 60K images captured from 4 surround-view cameras across Europe, North America and Asia. To the best of our knowledge, this is the first detailed study on object detection on fisheye cameras for autonomous driving scenarios. △ Less

Submitted 27 April, 2024; v1 submitted 20 April, 2024; originally announced April 2024.

Comments: arXiv admin note: text overlap with arXiv:2206.05542 by other authors

arXiv:2404.06352 [pdf, other]

DaF-BEVSeg: Distortion-aware Fisheye Camera based Bird's Eye View Segmentation with Occlusion Reasoning

Authors: Senthil Yogamani, David Unger, Venkatraman Narayanan, Varun Ravi Kumar

Abstract: Semantic segmentation is an effective way to perform scene understanding. Recently, segmentation in 3D Bird's Eye View (BEV) space has become popular as its directly used by drive policy. However, there is limited work on BEV segmentation for surround-view fisheye cameras, commonly used in commercial vehicles. As this task has no real-world public dataset and existing synthetic datasets do not han… ▽ More Semantic segmentation is an effective way to perform scene understanding. Recently, segmentation in 3D Bird's Eye View (BEV) space has become popular as its directly used by drive policy. However, there is limited work on BEV segmentation for surround-view fisheye cameras, commonly used in commercial vehicles. As this task has no real-world public dataset and existing synthetic datasets do not handle amodal regions due to occlusion, we create a synthetic dataset using the Cognata simulator comprising diverse road types, weather, and lighting conditions. We generalize the BEV segmentation to work with any camera model; this is useful for mixing diverse cameras. We implement a baseline by applying cylindrical rectification on the fisheye images and using a standard LSS-based BEV segmentation model. We demonstrate that we can achieve better performance without undistortion, which has the adverse effects of increased runtime due to pre-processing, reduced field-of-view, and resampling artifacts. Further, we introduce a distortion-aware learnable BEV pooling strategy that is more effective for the fisheye cameras. We extend the model with an occlusion reasoning module, which is critical for estimating in BEV space. Qualitative performance of DaF-BEVSeg is showcased in the video at https://streamable.com/ge4v51. △ Less

Submitted 9 April, 2024; originally announced April 2024.

arXiv:2403.16338 [pdf, other]

Impact of Video Compression Artifacts on Fisheye Camera Visual Perception Tasks

Authors: Madhumitha Sakthi, Louis Kerofsky, Varun Ravi Kumar, Senthil Yogamani

Abstract: Autonomous driving systems require extensive data collection schemes to cover the diverse scenarios needed for building a robust and safe system. The data volumes are in the order of Exabytes and have to be stored for a long period of time (i.e., more than 10 years of the vehicle's life cycle). Lossless compression doesn't provide sufficient compression ratios, hence, lossy video compression has b… ▽ More Autonomous driving systems require extensive data collection schemes to cover the diverse scenarios needed for building a robust and safe system. The data volumes are in the order of Exabytes and have to be stored for a long period of time (i.e., more than 10 years of the vehicle's life cycle). Lossless compression doesn't provide sufficient compression ratios, hence, lossy video compression has been explored. It is essential to prove that lossy video compression artifacts do not impact the performance of the perception algorithms. However, there is limited work in this area to provide a solid conclusion. In particular, there is no such work for fisheye cameras, which have high radial distortion and where compression may have higher artifacts. Fisheye cameras are commonly used in automotive systems for 3D object detection task. In this work, we provide the first analysis of the impact of standard video compression codecs on wide FOV fisheye camera images. We demonstrate that the achievable compression with negligible impact depends on the dataset and temporal prediction of the video codec. We propose a radial distortion-aware zonal metric to evaluate the performance of artifacts in fisheye images. In addition, we present a novel method for estimating affine mode parameters of the latest VVC codec, and suggest some areas for improvement in video codecs for the application to fisheye imagery. △ Less

Submitted 24 March, 2024; originally announced March 2024.

arXiv:2403.11761 [pdf, other]

BEVCar: Camera-Radar Fusion for BEV Map and Object Segmentation

Authors: Jonas Schramm, Niclas Vödisch, Kürsat Petek, B Ravi Kiran, Senthil Yogamani, Wolfram Burgard, Abhinav Valada

Abstract: Semantic scene segmentation from a bird's-eye-view (BEV) perspective plays a crucial role in facilitating planning and decision-making for mobile robots. Although recent vision-only methods have demonstrated notable advancements in performance, they often struggle under adverse illumination conditions such as rain or nighttime. While active sensors offer a solution to this challenge, the prohibiti… ▽ More Semantic scene segmentation from a bird's-eye-view (BEV) perspective plays a crucial role in facilitating planning and decision-making for mobile robots. Although recent vision-only methods have demonstrated notable advancements in performance, they often struggle under adverse illumination conditions such as rain or nighttime. While active sensors offer a solution to this challenge, the prohibitively high cost of LiDARs remains a limiting factor. Fusing camera data with automotive radars poses a more inexpensive alternative but has received less attention in prior research. In this work, we aim to advance this promising avenue by introducing BEVCar, a novel approach for joint BEV object and map segmentation. The core novelty of our approach lies in first learning a point-based encoding of raw radar data, which is then leveraged to efficiently initialize the lifting of image features into the BEV space. We perform extensive experiments on the nuScenes dataset and demonstrate that BEVCar outperforms the current state of the art. Moreover, we show that incorporating radar information significantly enhances robustness in challenging environmental conditions and improves segmentation performance for distant objects. To foster future research, we provide the weather split of the nuScenes dataset used in our experiments, along with our code and trained models at http://bevcar.cs.uni-freiburg.de. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2402.06826 [pdf, other]

Neural Rendering based Urban Scene Reconstruction for Autonomous Driving

Authors: Shihao Shen, Louis Kerofsky, Varun Ravi Kumar, Senthil Yogamani

Abstract: Dense 3D reconstruction has many applications in automated driving including automated annotation validation, multimodal data augmentation, providing ground truth annotations for systems lacking LiDAR, as well as enhancing auto-labeling accuracy. LiDAR provides highly accurate but sparse depth, whereas camera images enable estimation of dense depth but noisy particularly at long ranges. In this pa… ▽ More Dense 3D reconstruction has many applications in automated driving including automated annotation validation, multimodal data augmentation, providing ground truth annotations for systems lacking LiDAR, as well as enhancing auto-labeling accuracy. LiDAR provides highly accurate but sparse depth, whereas camera images enable estimation of dense depth but noisy particularly at long ranges. In this paper, we harness the strengths of both sensors and propose a multimodal 3D scene reconstruction using a framework combining neural implicit surfaces and radiance fields. In particular, our method estimates dense and accurate 3D structures and creates an implicit map representation based on signed distance fields, which can be further rendered into RGB images, and depth maps. A mesh can be extracted from the learned signed distance field and culled based on occlusion. Dynamic objects are efficiently filtered on the fly during sampling using 3D object detection models. We demonstrate qualitative and quantitative results on challenging automotive scenes. △ Less

Submitted 9 February, 2024; originally announced February 2024.

Comments: Accepted for publication in Electronic Imaging, Autonomous Vehicles and Machines 2024. Qualitative results are shared in https://youtu.be/EK47fYJiY3M

arXiv:2309.09080 [pdf, other]

Multi-camera Bird's Eye View Perception for Autonomous Driving

Authors: David Unger, Nikhil Gosala, Varun Ravi Kumar, Shubhankar Borse, Abhinav Valada, Senthil Yogamani

Abstract: Most automated driving systems comprise a diverse sensor set, including several cameras, Radars, and LiDARs, ensuring a complete 360°coverage in near and far regions. Unlike Radar and LiDAR, which measure directly in 3D, cameras capture a 2D perspective projection with inherent depth ambiguity. However, it is essential to produce perception outputs in 3D to enable the spatial reasoning of other ag… ▽ More Most automated driving systems comprise a diverse sensor set, including several cameras, Radars, and LiDARs, ensuring a complete 360°coverage in near and far regions. Unlike Radar and LiDAR, which measure directly in 3D, cameras capture a 2D perspective projection with inherent depth ambiguity. However, it is essential to produce perception outputs in 3D to enable the spatial reasoning of other agents and structures for optimal path planning. The 3D space is typically simplified to the BEV space by omitting the less relevant Z-coordinate, which corresponds to the height dimension.The most basic approach to achieving the desired BEV representation from a camera image is IPM, assuming a flat ground surface. Surround vision systems that are pretty common in new vehicles use the IPM principle to generate a BEV image and to show it on display to the driver. However, this approach is not suited for autonomous driving since there are severe distortions introduced by this too-simplistic transformation method. More recent approaches use deep neural networks to output directly in BEV space. These methods transform camera images into BEV space using geometric constraints implicitly or explicitly in the network. As CNN has more context information and a learnable transformation can be more flexible and adapt to image content, the deep learning-based methods set the new benchmark for BEV transformation and achieve state-of-the-art performance. First, this chapter discusses the contemporary trends of multi-camera-based DNN (deep neural network) models outputting object representations directly in the BEV space. Then, we discuss how this approach can extend to effective sensor fusion and coupling downstream tasks like situation analysis and prediction. Finally, we show challenges and open problems in BEV perception. △ Less

Submitted 19 September, 2023; v1 submitted 16 September, 2023; originally announced September 2023.

Comments: Taylor & Francis (CRC Press) book chapter. Book title: Computer Vision: Challenges, Trends, and Opportunities

arXiv:2307.08850 [pdf, other]

LiDAR-BEVMTN: Real-Time LiDAR Bird's-Eye View Multi-Task Perception Network for Autonomous Driving

Authors: Sambit Mohapatra, Senthil Yogamani, Varun Ravi Kumar, Stefan Milz, Heinrich Gotzig, Patrick Mäder

Abstract: LiDAR is crucial for robust 3D scene perception in autonomous driving. LiDAR perception has the largest body of literature after camera perception. However, multi-task learning across tasks like detection, segmentation, and motion estimation using LiDAR remains relatively unexplored, especially on automotive-grade embedded platforms. We present a real-time multi-task convolutional neural network f… ▽ More LiDAR is crucial for robust 3D scene perception in autonomous driving. LiDAR perception has the largest body of literature after camera perception. However, multi-task learning across tasks like detection, segmentation, and motion estimation using LiDAR remains relatively unexplored, especially on automotive-grade embedded platforms. We present a real-time multi-task convolutional neural network for LiDAR-based object detection, semantics, and motion segmentation. The unified architecture comprises a shared encoder and task-specific decoders, enabling joint representation learning. We propose a novel Semantic Weighting and Guidance (SWAG) module to transfer semantic features for improved object detection selectively. Our heterogeneous training scheme combines diverse datasets and exploits complementary cues between tasks. The work provides the first embedded implementation unifying these key perception tasks from LiDAR point clouds achieving 3ms latency on the embedded NVIDIA Xavier platform. We achieve state-of-the-art results for two tasks, semantic and motion segmentation, and close to state-of-the-art performance for 3D object detection. By maximizing hardware efficiency and leveraging multi-task synergies, our method delivers an accurate and efficient solution tailored for real-world automated driving deployment. Qualitative results can be seen at https://youtu.be/H-hWRzv2lIY. △ Less

Submitted 17 July, 2023; originally announced July 2023.

arXiv:2306.03810 [pdf, other]

X-Align++: cross-modal cross-view alignment for Bird's-eye-view segmentation

Authors: Shubhankar Borse, Senthil Yogamani, Marvin Klingner, Varun Ravi, Hong Cai, Abdulaziz Almuzairee, Fatih Porikli

Abstract: Bird's-eye-view (BEV) grid is a typical representation of the perception of road components, e.g., drivable area, in autonomous driving. Most existing approaches rely on cameras only to perform segmentation in BEV space, which is fundamentally constrained by the absence of reliable depth information. The latest works leverage both camera and LiDAR modalities but suboptimally fuse their features us… ▽ More Bird's-eye-view (BEV) grid is a typical representation of the perception of road components, e.g., drivable area, in autonomous driving. Most existing approaches rely on cameras only to perform segmentation in BEV space, which is fundamentally constrained by the absence of reliable depth information. The latest works leverage both camera and LiDAR modalities but suboptimally fuse their features using simple, concatenation-based mechanisms. In this paper, we address these problems by enhancing the alignment of the unimodal features in order to aid feature fusion, as well as enhancing the alignment between the cameras' perspective view (PV) and BEV representations. We propose X-Align, a novel end-to-end cross-modal and cross-view learning framework for BEV segmentation consisting of the following components: (i) a novel Cross-Modal Feature Alignment (X-FA) loss, (ii) an attention-based Cross-Modal Feature Fusion (X-FF) module to align multi-modal BEV features implicitly, and (iii) an auxiliary PV segmentation branch with Cross-View Segmentation Alignment (X-SA) losses to improve the PV-to-BEV transformation. We evaluate our proposed method across two commonly used benchmark datasets, i.e., nuScenes and KITTI-360. Notably, X-Align significantly outperforms the state-of-the-art by 3 absolute mIoU points on nuScenes. We also provide extensive ablation studies to demonstrate the effectiveness of the individual components. △ Less

Submitted 6 June, 2023; originally announced June 2023.

Comments: Accepted for publication at Springer Machine Vision and Applications Journal. The Version of Record of this article is published in Machine Vision and Applications Journal, and is available online at https://doi.org/10.1007/s00138-023-01400-7. arXiv admin note: substantial text overlap with arXiv:2210.06778

arXiv:2303.02203 [pdf, other]

X$^3$KD: Knowledge Distillation Across Modalities, Tasks and Stages for Multi-Camera 3D Object Detection

Authors: Marvin Klingner, Shubhankar Borse, Varun Ravi Kumar, Behnaz Rezaei, Venkatraman Narayanan, Senthil Yogamani, Fatih Porikli

Abstract: Recent advances in 3D object detection (3DOD) have obtained remarkably strong results for LiDAR-based models. In contrast, surround-view 3DOD models based on multiple camera images underperform due to the necessary view transformation of features from perspective view (PV) to a 3D world representation which is ambiguous due to missing depth information. This paper introduces X$^3$KD, a comprehensi… ▽ More Recent advances in 3D object detection (3DOD) have obtained remarkably strong results for LiDAR-based models. In contrast, surround-view 3DOD models based on multiple camera images underperform due to the necessary view transformation of features from perspective view (PV) to a 3D world representation which is ambiguous due to missing depth information. This paper introduces X$^3$KD, a comprehensive knowledge distillation framework across different modalities, tasks, and stages for multi-camera 3DOD. Specifically, we propose cross-task distillation from an instance segmentation teacher (X-IS) in the PV feature extraction stage providing supervision without ambiguous error backpropagation through the view transformation. After the transformation, we apply cross-modal feature distillation (X-FD) and adversarial training (X-AT) to improve the 3D world representation of multi-camera features through the information contained in a LiDAR-based 3DOD teacher. Finally, we also employ this teacher for cross-modal output distillation (X-OD), providing dense supervision at the prediction stage. We perform extensive ablations of knowledge distillation at different stages of multi-camera 3DOD. Our final X$^3$KD model outperforms previous state-of-the-art approaches on the nuScenes and Waymo datasets and generalizes to RADAR-based 3DOD. Qualitative results video at https://youtu.be/1do9DPFmr38. △ Less

Submitted 3 March, 2023; originally announced March 2023.

Comments: Accepted to CVPR 2023

arXiv:2301.04422 [pdf, other]

Optical Flow for Autonomous Driving: Applications, Challenges and Improvements

Authors: Shihao Shen, Louis Kerofsky, Senthil Yogamani

Abstract: Optical flow estimation is a well-studied topic for automated driving applications. Many outstanding optical flow estimation methods have been proposed, but they become erroneous when tested in challenging scenarios that are commonly encountered. Despite the increasing use of fisheye cameras for near-field sensing in automated driving, there is very limited literature on optical flow estimation wi… ▽ More Optical flow estimation is a well-studied topic for automated driving applications. Many outstanding optical flow estimation methods have been proposed, but they become erroneous when tested in challenging scenarios that are commonly encountered. Despite the increasing use of fisheye cameras for near-field sensing in automated driving, there is very limited literature on optical flow estimation with strong lens distortion. Thus we propose and evaluate training strategies to improve a learning-based optical flow algorithm by leveraging the only existing fisheye dataset with optical flow ground truth. While trained with synthetic data, the model demonstrates strong capabilities to generalize to real world fisheye data. The other challenge neglected by existing state-of-the-art algorithms is low light. We propose a novel, generic semi-supervised framework that significantly boosts performances of existing methods in such conditions. To the best of our knowledge, this is the first approach that explicitly handles optical flow estimation in low light. △ Less

Submitted 11 January, 2023; originally announced January 2023.

Comments: Accepted for publication in Electronic Imaging, Autonomous Vehicles and Machines 2023

arXiv:2210.06778 [pdf, other]

X-Align: Cross-Modal Cross-View Alignment for Bird's-Eye-View Segmentation

Authors: Shubhankar Borse, Marvin Klingner, Varun Ravi Kumar, Hong Cai, Abdulaziz Almuzairee, Senthil Yogamani, Fatih Porikli

Abstract: Bird's-eye-view (BEV) grid is a common representation for the perception of road components, e.g., drivable area, in autonomous driving. Most existing approaches rely on cameras only to perform segmentation in BEV space, which is fundamentally constrained by the absence of reliable depth information. Latest works leverage both camera and LiDAR modalities, but sub-optimally fuse their features usin… ▽ More Bird's-eye-view (BEV) grid is a common representation for the perception of road components, e.g., drivable area, in autonomous driving. Most existing approaches rely on cameras only to perform segmentation in BEV space, which is fundamentally constrained by the absence of reliable depth information. Latest works leverage both camera and LiDAR modalities, but sub-optimally fuse their features using simple, concatenation-based mechanisms. In this paper, we address these problems by enhancing the alignment of the unimodal features in order to aid feature fusion, as well as enhancing the alignment between the cameras' perspective view (PV) and BEV representations. We propose X-Align, a novel end-to-end cross-modal and cross-view learning framework for BEV segmentation consisting of the following components: (i) a novel Cross-Modal Feature Alignment (X-FA) loss, (ii) an attention-based Cross-Modal Feature Fusion (X-FF) module to align multi-modal BEV features implicitly, and (iii) an auxiliary PV segmentation branch with Cross-View Segmentation Alignment (X-SA) losses to improve the PV-to-BEV transformation. We evaluate our proposed method across two commonly used benchmark datasets, i.e., nuScenes and KITTI-360. Notably, X-Align significantly outperforms the state-of-the-art by 3 absolute mIoU points on nuScenes. We also provide extensive ablation studies to demonstrate the effectiveness of the individual components. △ Less

Submitted 31 October, 2022; v1 submitted 13 October, 2022; originally announced October 2022.

Comments: Accepted to WACV 2023

arXiv:2206.12912 [pdf, other]

Woodscape Fisheye Object Detection for Autonomous Driving -- CVPR 2022 OmniCV Workshop Challenge

Authors: Saravanabalagi Ramachandran, Ganesh Sistu, Varun Ravi Kumar, John McDonald, Senthil Yogamani

Abstract: Object detection is a comprehensively studied problem in autonomous driving. However, it has been relatively less explored in the case of fisheye cameras. The strong radial distortion breaks the translation invariance inductive bias of Convolutional Neural Networks. Thus, we present the WoodScape fisheye object detection challenge for autonomous driving which was held as part of the CVPR 2022 Work… ▽ More Object detection is a comprehensively studied problem in autonomous driving. However, it has been relatively less explored in the case of fisheye cameras. The strong radial distortion breaks the translation invariance inductive bias of Convolutional Neural Networks. Thus, we present the WoodScape fisheye object detection challenge for autonomous driving which was held as part of the CVPR 2022 Workshop on Omnidirectional Computer Vision (OmniCV). This is one of the first competitions focused on fisheye camera object detection. We encouraged the participants to design models which work natively on fisheye images without rectification. We used CodaLab to host the competition based on the publicly available WoodScape fisheye dataset. In this paper, we provide a detailed analysis on the competition which attracted the participation of 120 global teams and a total of 1492 submissions. We briefly discuss the details of the winning methods and analyze their qualitative and quantitative results. △ Less

Submitted 26 June, 2022; originally announced June 2022.

Comments: Workshop on Omnidirectional Computer Vision (OmniCV) at Conference on Computer Vision and Pattern Recognition (CVPR) 2021

arXiv:2206.12738 [pdf, other]

Self-Supervised 3D Monocular Object Detection by Recycling Bounding Boxes

Authors: Sugirtha T, Sridevi M, Khailash Santhakumar, Hao Liu, B Ravi Kiran, Thomas Gauthier, Senthil Yogamani

Abstract: Modern object detection architectures are moving towards employing self-supervised learning (SSL) to improve performance detection with related pretext tasks. Pretext tasks for monocular 3D object detection have not yet been explored yet in literature. The paper studies the application of established self-supervised bounding box recycling by labeling random windows as the pretext task. The classif… ▽ More Modern object detection architectures are moving towards employing self-supervised learning (SSL) to improve performance detection with related pretext tasks. Pretext tasks for monocular 3D object detection have not yet been explored yet in literature. The paper studies the application of established self-supervised bounding box recycling by labeling random windows as the pretext task. The classifier head of the 3D detector is trained to classify random windows containing different proportions of the ground truth objects, thus handling the foreground-background imbalance. We evaluate the pretext task using the RTM3D detection model as baseline, with and without the application of data augmentation. We demonstrate improvements of between 2-3 % in mAP 3D and 0.9-1.5 % BEV scores using SSL over the baseline scores. We propose the inverse class frequency re-weighted (ICFW) mAP score that highlights improvements in detection for low frequency classes in a class imbalanced dataset with long tails. We demonstrate improvements in ICFW both mAP 3D and BEV scores to take into account the class imbalance in the KITTI validation dataset. We see 4-5 % increase in ICFW metric with the pretext task. △ Less

Submitted 25 June, 2022; originally announced June 2022.

Comments: Published at ICCVW-SSLAD 2021. arXiv admin note: substantial text overlap with arXiv:2104.10786

arXiv:2206.02876 [pdf, other]

SpikiLi: A Spiking Simulation of LiDAR based Real-time Object Detection for Autonomous Driving

Authors: Sambit Mohapatra, Thomas Mesquida, Mona Hodaei, Senthil Yogamani, Heinrich Gotzig, Patrick Mader

Abstract: Spiking Neural Networks are a recent and new neural network design approach that promises tremendous improvements in power efficiency, computation efficiency, and processing latency. They do so by using asynchronous spike-based data flow, event-based signal generation, processing, and modifying the neuron model to resemble biological neurons closely. While some initial works have shown significant… ▽ More Spiking Neural Networks are a recent and new neural network design approach that promises tremendous improvements in power efficiency, computation efficiency, and processing latency. They do so by using asynchronous spike-based data flow, event-based signal generation, processing, and modifying the neuron model to resemble biological neurons closely. While some initial works have shown significant initial evidence of applicability to common deep learning tasks, their applications in complex real-world tasks has been relatively low. In this work, we first illustrate the applicability of spiking neural networks to a complex deep learning task namely Lidar based 3D object detection for automated driving. Secondly, we make a step-by-step demonstration of simulating spiking behavior using a pre-trained convolutional neural network. We closely model essential aspects of spiking neural networks in simulation and achieve equivalent run-time and accuracy on a GPU. When the model is realized on a neuromorphic hardware, we expect to have significantly improved power efficiency. △ Less

Submitted 6 June, 2022; originally announced June 2022.

Comments: Accepted at Workshop on Event Sensing and Neuromorphic Engineering - 8th International Conference on Event-based Control, Communication, and Signal Processing

arXiv:2205.15667 [pdf, other]

ViT-BEVSeg: A Hierarchical Transformer Network for Monocular Birds-Eye-View Segmentation

Authors: Pramit Dutta, Ganesh Sistu, Senthil Yogamani, Edgar Galván, John McDonald

Abstract: Generating a detailed near-field perceptual model of the environment is an important and challenging problem in both self-driving vehicles and autonomous mobile robotics. A Bird Eye View (BEV) map, providing a panoptic representation, is a commonly used approach that provides a simplified 2D representation of the vehicle surroundings with accurate semantic level segmentation for many downstream ta… ▽ More Generating a detailed near-field perceptual model of the environment is an important and challenging problem in both self-driving vehicles and autonomous mobile robotics. A Bird Eye View (BEV) map, providing a panoptic representation, is a commonly used approach that provides a simplified 2D representation of the vehicle surroundings with accurate semantic level segmentation for many downstream tasks. Current state-of-the art approaches to generate BEV-maps employ a Convolutional Neural Network (CNN) backbone to create feature-maps which are passed through a spatial transformer to project the derived features onto the BEV coordinate frame. In this paper, we evaluate the use of vision transformers (ViT) as a backbone architecture to generate BEV maps. Our network architecture, ViT-BEVSeg, employs standard vision transformers to generate a multi-scale representation of the input image. The resulting representation is then provided as an input to a spatial transformer decoder module which outputs segmentation maps in the BEV grid. We evaluate our approach on the nuScenes dataset demonstrating a considerable improvement in the performance relative to state-of-the-art approaches. △ Less

Submitted 31 May, 2022; originally announced May 2022.

Comments: Accepted for 2022 IEEE World Congress on Computational Intelligence (Track: IJCNN)

arXiv:2205.13281 [pdf, other]

Surround-view Fisheye Camera Perception for Automated Driving: Overview, Survey and Challenges

Authors: Varun Ravi Kumar, Ciaran Eising, Christian Witt, Senthil Yogamani

Abstract: Surround-view fisheye cameras are commonly used for near-field sensing in automated driving. Four fisheye cameras on four sides of the vehicle are sufficient to cover 360° around the vehicle capturing the entire near-field region. Some primary use cases are automated parking, traffic jam assist, and urban driving. There are limited datasets and very little work on near-field perception tasks as th… ▽ More Surround-view fisheye cameras are commonly used for near-field sensing in automated driving. Four fisheye cameras on four sides of the vehicle are sufficient to cover 360° around the vehicle capturing the entire near-field region. Some primary use cases are automated parking, traffic jam assist, and urban driving. There are limited datasets and very little work on near-field perception tasks as the focus in automotive perception is on far-field perception. In contrast to far-field, surround-view perception poses additional challenges due to high precision object detection requirements of 10cm and partial visibility of objects. Due to the large radial distortion of fisheye cameras, standard algorithms cannot be extended easily to the surround-view use case. Thus, we are motivated to provide a self-contained reference for automotive fisheye camera perception for researchers and practitioners. Firstly, we provide a unified and taxonomic treatment of commonly used fisheye camera models. Secondly, we discuss various perception tasks and existing literature. Finally, we discuss the challenges and future direction. △ Less

Submitted 5 January, 2023; v1 submitted 26 May, 2022; originally announced May 2022.

Comments: Accepted for publication at IEEE Transactions on Intelligent Transportation Systems

arXiv:2205.02105 [pdf, ps, other]

Neuroevolutionary Multi-objective approaches to Trajectory Prediction in Autonomous Vehicles

Authors: Fergal Stapleton, Edgar Galván, Ganesh Sistu, Senthil Yogamani

Abstract: The incentive for using Evolutionary Algorithms (EAs) for the automated optimization and training of deep neural networks (DNNs), a process referred to as neuroevolution, has gained momentum in recent years. The configuration and training of these networks can be posed as optimization problems. Indeed, most of the recent works on neuroevolution have focused their attention on single-objective opti… ▽ More The incentive for using Evolutionary Algorithms (EAs) for the automated optimization and training of deep neural networks (DNNs), a process referred to as neuroevolution, has gained momentum in recent years. The configuration and training of these networks can be posed as optimization problems. Indeed, most of the recent works on neuroevolution have focused their attention on single-objective optimization. Moreover, from the little research that has been done at the intersection of neuroevolution and evolutionary multi-objective optimization (EMO), all the research that has been carried out has focused predominantly on the use of one type of DNN: convolutional neural networks (CNNs), using well-established standard benchmark problems such as MNIST. In this work, we make a leap in the understanding of these two areas (neuroevolution and EMO), regarded in this work as neuroevolutionary multi-objective, by using and studying a rich DNN composed of a CNN and Long-short Term Memory network. Moreover, we use a robust and challenging vehicle trajectory prediction problem. By using the well-known Non-dominated Sorting Genetic Algorithm-II, we study the effects of five different objectives, tested in categories of three, allowing us to show how these objectives have either a positive or detrimental effect in neuroevolution for trajectory prediction in autonomous vehicles. △ Less

Submitted 6 May, 2022; v1 submitted 4 May, 2022; originally announced May 2022.

Comments: Accepted in Genetic and Evolutionary Computation Conference Companion (GECCO '22 Companion), July 9--13, 2022, Boston, MA, USA, 4 pages, 1 figure, 6 tables

arXiv:2203.15441 [pdf, other]

doi 10.1109/ACCESS.2023.3305576

UnShadowNet: Illumination Critic Guided Contrastive Learning For Shadow Removal

Authors: Subhrajyoti Dasgupta, Arindam Das, Senthil Yogamani, Sudip Das, Ciaran Eising, Andrei Bursuc, Ujjwal Bhattacharya

Abstract: Shadows are frequently encountered natural phenomena that significantly hinder the performance of computer vision perception systems in practical settings, e.g., autonomous driving. A solution to this would be to eliminate shadow regions from the images before the processing of the perception system. Yet, training such a solution requires pairs of aligned shadowed and non-shadowed images which are… ▽ More Shadows are frequently encountered natural phenomena that significantly hinder the performance of computer vision perception systems in practical settings, e.g., autonomous driving. A solution to this would be to eliminate shadow regions from the images before the processing of the perception system. Yet, training such a solution requires pairs of aligned shadowed and non-shadowed images which are difficult to obtain. We introduce a novel weakly supervised shadow removal framework UnShadowNet trained using contrastive learning. It is composed of a DeShadower network responsible for the removal of the extracted shadow under the guidance of an Illumination network which is trained adversarially by the illumination critic and a Refinement network to further remove artefacts. We show that UnShadowNet can be easily extended to a fully-supervised set-up to exploit the ground-truth when available. UnShadowNet outperforms existing state-of-the-art approaches on three publicly available shadow datasets (ISTD, adjusted ISTD, SRD) in both the weakly and fully supervised setups. △ Less

Submitted 24 August, 2023; v1 submitted 29 March, 2022; originally announced March 2022.

Comments: Accepted for publication at IEEE Access, vol. 11, pp. 87760-87774, 2023

arXiv:2203.05056 [pdf, other]

doi 10.1109/LRA.2022.3188106

SynWoodScape: Synthetic Surround-view Fisheye Camera Dataset for Autonomous Driving

Authors: Ahmed Rida Sekkat, Yohan Dupuis, Varun Ravi Kumar, Hazem Rashed, Senthil Yogamani, Pascal Vasseur, Paul Honeine

Abstract: Surround-view cameras are a primary sensor for automated driving, used for near-field perception. It is one of the most commonly used sensors in commercial vehicles primarily used for parking visualization and automated parking. Four fisheye cameras with a 190° field of view cover the 360° around the vehicle. Due to its high radial distortion, the standard algorithms do not extend easily. Previous… ▽ More Surround-view cameras are a primary sensor for automated driving, used for near-field perception. It is one of the most commonly used sensors in commercial vehicles primarily used for parking visualization and automated parking. Four fisheye cameras with a 190° field of view cover the 360° around the vehicle. Due to its high radial distortion, the standard algorithms do not extend easily. Previously, we released the first public fisheye surround-view dataset named WoodScape. In this work, we release a synthetic version of the surround-view dataset, covering many of its weaknesses and extending it. Firstly, it is not possible to obtain ground truth for pixel-wise optical flow and depth. Secondly, WoodScape did not have all four cameras annotated simultaneously in order to sample diverse frames. However, this means that multi-camera algorithms cannot be designed to obtain a unified output in birds-eye space, which is enabled in the new dataset. We implemented surround-view fisheye geometric projections in CARLA Simulator matching WoodScape's configuration and created SynWoodScape. We release 80k images from the synthetic dataset with annotations for 10+ tasks. We also release the baseline code and supporting scripts. △ Less

Submitted 2 January, 2023; v1 submitted 9 March, 2022; originally announced March 2022.

Comments: IEEE Robotics and Automation Letters (RA-L) and IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2022). An initial sample of the dataset is released in https://drive.google.com/drive/folders/1N5rrySiw1uh9kLeBuOblMbXJ09YsqO7I

Journal ref: IEEE Robotics and Automation Letters ( Volume: 7, Issue: 3, July 2022)

arXiv:2203.01177 [pdf, other]

Detecting Adversarial Perturbations in Multi-Task Perception

Authors: Marvin Klingner, Varun Ravi Kumar, Senthil Yogamani, Andreas Bär, Tim Fingscheidt

Abstract: While deep neural networks (DNNs) achieve impressive performance on environment perception tasks, their sensitivity to adversarial perturbations limits their use in practical applications. In this paper, we (i) propose a novel adversarial perturbation detection scheme based on multi-task perception of complex vision tasks (i.e., depth estimation and semantic segmentation). Specifically, adversaria… ▽ More While deep neural networks (DNNs) achieve impressive performance on environment perception tasks, their sensitivity to adversarial perturbations limits their use in practical applications. In this paper, we (i) propose a novel adversarial perturbation detection scheme based on multi-task perception of complex vision tasks (i.e., depth estimation and semantic segmentation). Specifically, adversarial perturbations are detected by inconsistencies between extracted edges of the input image, the depth output, and the segmentation output. To further improve this technique, we (ii) develop a novel edge consistency loss between all three modalities, thereby improving their initial consistency which in turn supports our detection scheme. We verify our detection scheme's effectiveness by employing various known attacks and image noises. In addition, we (iii) develop a multi-task adversarial attack, aiming at fooling both tasks as well as our detection scheme. Experimental evaluation on the Cityscapes and KITTI datasets shows that under an assumption of a 5% false positive rate up to 100% of images are correctly detected as adversarially perturbed, depending on the strength of the perturbation. Code is available at https://github.com/ifnspaml/AdvAttackDet. A short video at https://youtu.be/KKa6gOyWmH4 provides qualitative results. △ Less

Submitted 11 September, 2022; v1 submitted 2 March, 2022; originally announced March 2022.

Comments: Accepted at IROS 2022

arXiv:2111.04875 [pdf, other]

LiMoSeg: Real-time Bird's Eye View based LiDAR Motion Segmentation

Authors: Sambit Mohapatra, Mona Hodaei, Senthil Yogamani, Stefan Milz, Heinrich Gotzig, Martin Simon, Hazem Rashed, Patrick Maeder

Abstract: Moving object detection and segmentation is an essential task in the Autonomous Driving pipeline. Detecting and isolating static and moving components of a vehicle's surroundings are particularly crucial in path planning and localization tasks. This paper proposes a novel real-time architecture for motion segmentation of Light Detection and Ranging (LiDAR) data. We use three successive scans of Li… ▽ More Moving object detection and segmentation is an essential task in the Autonomous Driving pipeline. Detecting and isolating static and moving components of a vehicle's surroundings are particularly crucial in path planning and localization tasks. This paper proposes a novel real-time architecture for motion segmentation of Light Detection and Ranging (LiDAR) data. We use three successive scans of LiDAR data in 2D Bird's Eye View (BEV) representation to perform pixel-wise classification as static or moving. Furthermore, we propose a novel data augmentation technique to reduce the significant class imbalance between static and moving objects. We achieve this by artificially synthesizing moving objects by cutting and pasting static vehicles. We demonstrate a low latency of 8 ms on a commonly used automotive embedded platform, namely Nvidia Jetson Xavier. To the best of our knowledge, this is the first work directly performing motion segmentation in LiDAR BEV space. We provide quantitative results on the challenging SemanticKITTI dataset, and qualitative results are provided in https://youtu.be/2aJ-cL8b0LI. △ Less

Submitted 22 January, 2022; v1 submitted 8 November, 2021; originally announced November 2021.

Comments: Accepted for Presentation at International Conference on Computer Vision Theory and Applications (VISAPP 2022)

arXiv:2108.07736 [pdf, other]

A Hybrid Sparse-Dense Monocular SLAM System for Autonomous Driving

Authors: Louis Gallagher, Varun Ravi Kumar, Senthil Yogamani, John B. McDonald

Abstract: In this paper, we present a system for incrementally reconstructing a dense 3D model of the geometry of an outdoor environment using a single monocular camera attached to a moving vehicle. Dense models provide a rich representation of the environment facilitating higher-level scene understanding, perception, and planning. Our system employs dense depth prediction with a hybrid map** architecture… ▽ More In this paper, we present a system for incrementally reconstructing a dense 3D model of the geometry of an outdoor environment using a single monocular camera attached to a moving vehicle. Dense models provide a rich representation of the environment facilitating higher-level scene understanding, perception, and planning. Our system employs dense depth prediction with a hybrid map** architecture combining state-of-the-art sparse features and dense fusion-based visual SLAM algorithms within an integrated framework. Our novel contributions include design of hybrid sparse-dense camera tracking and loop closure, and scale estimation improvements in dense depth prediction. We use the motion estimates from the sparse method to overcome the large and variable inter-frame displacement typical of outdoor vehicle scenarios. Our system then registers the live image with the dense model using whole-image alignment. This enables the fusion of the live frame and dense depth prediction into the model. Global consistency and alignment between the sparse and dense models are achieved by applying pose constraints from the sparse method directly within the deformation of the dense model. We provide qualitative and quantitative results for both trajectory estimation and surface reconstruction accuracy, demonstrating competitive performance on the KITTI dataset. Qualitative results of the proposed approach are illustrated in https://youtu.be/Pn2uaVqjskY. Source code for the project is publicly available at the following repository https://github.com/robotvisionmu/DenseMonoSLAM. △ Less

Submitted 17 August, 2021; originally announced August 2021.

Comments: 8 pages, 5 figures. To be published in the proceedings of the 10th European Conference on Mobile Robotics 2021

ACM Class: I.2.10

arXiv:2107.08246 [pdf, other]

Woodscape Fisheye Semantic Segmentation for Autonomous Driving -- CVPR 2021 OmniCV Workshop Challenge

Authors: Saravanabalagi Ramachandran, Ganesh Sistu, John McDonald, Senthil Yogamani

Abstract: We present the WoodScape fisheye semantic segmentation challenge for autonomous driving which was held as part of the CVPR 2021 Workshop on Omnidirectional Computer Vision (OmniCV). This challenge is one of the first opportunities for the research community to evaluate the semantic segmentation techniques targeted for fisheye camera perception. Due to strong radial distortion standard models don't… ▽ More We present the WoodScape fisheye semantic segmentation challenge for autonomous driving which was held as part of the CVPR 2021 Workshop on Omnidirectional Computer Vision (OmniCV). This challenge is one of the first opportunities for the research community to evaluate the semantic segmentation techniques targeted for fisheye camera perception. Due to strong radial distortion standard models don't generalize well to fisheye images and hence the deformations in the visual appearance of objects and entities needs to be encoded implicitly or as explicit knowledge. This challenge served as a medium to investigate the challenges and new methodologies to handle the complexities with perception on fisheye images. The challenge was hosted on CodaLab and used the recently released WoodScape dataset comprising of 10k samples. In this paper, we provide a summary of the competition which attracted the participation of 71 global teams and a total of 395 submissions. The top teams recorded significantly improved mean IoU and accuracy scores over the baseline PSPNet with ResNet-50 backbone. We summarize the methods of winning algorithms and analyze the failure cases. We conclude by providing future directions for the research. △ Less

Submitted 17 July, 2021; originally announced July 2021.

Comments: Workshop on Omnidirectional Computer Vision (OmniCV) at Conference on Computer Vision and Pattern Recognition (CVPR) 2021. Presentation video is available at https://youtu.be/xa7Fl2mD4CA?t=12253

arXiv:2107.07449 [pdf, other]

Adversarial Attacks on Multi-task Visual Perception for Autonomous Driving

Authors: Ibrahim Sobh, Ahmed Hamed, Varun Ravi Kumar, Senthil Yogamani

Abstract: Deep neural networks (DNNs) have accomplished impressive success in various applications, including autonomous driving perception tasks, in recent years. On the other hand, current deep neural networks are easily fooled by adversarial attacks. This vulnerability raises significant concerns, particularly in safety-critical applications. As a result, research into attacking and defending DNNs has ga… ▽ More Deep neural networks (DNNs) have accomplished impressive success in various applications, including autonomous driving perception tasks, in recent years. On the other hand, current deep neural networks are easily fooled by adversarial attacks. This vulnerability raises significant concerns, particularly in safety-critical applications. As a result, research into attacking and defending DNNs has gained much coverage. In this work, detailed adversarial attacks are applied on a diverse multi-task visual perception deep network across distance estimation, semantic segmentation, motion detection, and object detection. The experiments consider both white and black box attacks for targeted and un-targeted cases, while attacking a task and inspecting the effect on all the others, in addition to inspecting the effect of applying a simple defense method. We conclude this paper by comparing and discussing the experimental results, proposing insights and future work. The visualizations of the attacks are available at https://youtu.be/6AixN90budY. △ Less

Submitted 7 November, 2021; v1 submitted 15 July, 2021; originally announced July 2021.

Comments: Accepted for publication at Journal of Imaging Science and Technology

arXiv:2107.04937 [pdf, other]

BEV-MODNet: Monocular Camera based Bird's Eye View Moving Object Detection for Autonomous Driving

Authors: Hazem Rashed, Mariam Essam, Maha Mohamed, Ahmad El Sallab, Senthil Yogamani

Abstract: Detection of moving objects is a very important task in autonomous driving systems. After the perception phase, motion planning is typically performed in Bird's Eye View (BEV) space. This would require projection of objects detected on the image plane to top view BEV plane. Such a projection is prone to errors due to lack of depth information and noisy map** in far away areas. CNNs can leverage… ▽ More Detection of moving objects is a very important task in autonomous driving systems. After the perception phase, motion planning is typically performed in Bird's Eye View (BEV) space. This would require projection of objects detected on the image plane to top view BEV plane. Such a projection is prone to errors due to lack of depth information and noisy map** in far away areas. CNNs can leverage the global context in the scene to project better. In this work, we explore end-to-end Moving Object Detection (MOD) on the BEV map directly using monocular images as input. To the best of our knowledge, such a dataset does not exist and we create an extended KITTI-raw dataset consisting of 12.9k images with annotations of moving object masks in BEV space for five classes. The dataset is intended to be used for class agnostic motion cue based object detection and classes are provided as meta-data for better tuning. We design and implement a two-stream RGB and optical flow fusion architecture which outputs motion segmentation directly in BEV space. We compare it with inverse perspective map** of state-of-the-art motion segmentation predictions on the image plane. We observe a significant improvement of 13% in mIoU using the simple baseline implementation. This demonstrates the ability to directly learn motion segmentation output in BEV space. Qualitative results of our baseline and the dataset annotations can be found in https://sites.google.com/view/bev-modnet. △ Less

Submitted 10 July, 2021; originally announced July 2021.

Comments: Accepted for Oral Presentation at IEEE Intelligent Transportation Systems Conference (ITSC) 2021

arXiv:2105.12763 [pdf, other]

An Online Learning System for Wireless Charging Alignment using Surround-view Fisheye Cameras

Authors: Ashok Dahal, Varun Ravi Kumar, Senthil Yogamani, Ciaran Eising

Abstract: Electric Vehicles are increasingly common, with inductive chargepads being considered a convenient and efficient means of charging electric vehicles. However, drivers are typically poor at aligning the vehicle to the necessary accuracy for efficient inductive charging, making the automated alignment of the two charging plates desirable. In parallel to the electrification of the vehicular fleet, au… ▽ More Electric Vehicles are increasingly common, with inductive chargepads being considered a convenient and efficient means of charging electric vehicles. However, drivers are typically poor at aligning the vehicle to the necessary accuracy for efficient inductive charging, making the automated alignment of the two charging plates desirable. In parallel to the electrification of the vehicular fleet, automated parking systems that make use of surround-view camera systems are becoming increasingly popular. In this work, we propose a system based on the surround-view camera architecture to detect, localize, and automatically align the vehicle with the inductive chargepad. The visual design of the chargepads is not standardized and not necessarily known beforehand. Therefore, a system that relies on offline training will fail in some situations. Thus, we propose a self-supervised online learning method that leverages the driver's actions when manually aligning the vehicle with the chargepad and combine it with weak supervision from semantic segmentation and depth to learn a classifier to auto-annotate the chargepad in the video for further training. In this way, when faced with a previously unseen chargepad, the driver needs only manually align the vehicle a single time. As the chargepad is flat on the ground, it is not easy to detect it from a distance. Thus, we propose using a Visual SLAM pipeline to learn landmarks relative to the chargepad to enable alignment from a greater range. We demonstrate the working system on an automated vehicle as illustrated in the video at https://youtu.be/_cLCmkW4UYo. To encourage further research, we will share a chargepad dataset used in this work. △ Less

Submitted 21 December, 2022; v1 submitted 26 May, 2021; originally announced May 2021.

Comments: Accepted for publication at IEEE Transactions on Intelligent Transportation Systems. Chargepad Dataset is shared at https://drive.google.com/drive/folders/1KeLFIqOnhU2CGsD0vbiN9UqKmBSyHERd

arXiv:2105.12713 [pdf, other]

Spatio-Contextual Deep Network Based Multimodal Pedestrian Detection For Autonomous Driving

Authors: Kinjal Dasgupta, Arindam Das, Sudip Das, Ujjwal Bhattacharya, Senthil Yogamani

Abstract: Pedestrian Detection is the most critical module of an Autonomous Driving system. Although a camera is commonly used for this purpose, its quality degrades severely in low-light night time driving scenarios. On the other hand, the quality of a thermal camera image remains unaffected in similar conditions. This paper proposes an end-to-end multimodal fusion model for pedestrian detection using RGB… ▽ More Pedestrian Detection is the most critical module of an Autonomous Driving system. Although a camera is commonly used for this purpose, its quality degrades severely in low-light night time driving scenarios. On the other hand, the quality of a thermal camera image remains unaffected in similar conditions. This paper proposes an end-to-end multimodal fusion model for pedestrian detection using RGB and thermal images. Its novel spatio-contextual deep network architecture is capable of exploiting the multimodal input efficiently. It consists of two distinct deformable ResNeXt-50 encoders for feature extraction from the two modalities. Fusion of these two encoded features takes place inside a multimodal feature embedding module (MuFEm) consisting of several groups of a pair of Graph Attention Network and a feature fusion unit. The output of the last feature fusion unit of MuFEm is subsequently passed to two CRFs for their spatial refinement. Further enhancement of the features is achieved by applying channel-wise attention and extraction of contextual information with the help of four RNNs traversing in four different directions. Finally, these feature maps are used by a single-stage decoder to generate the bounding box of each pedestrian and the score map. We have performed extensive experiments of the proposed framework on three publicly available multimodal pedestrian detection benchmark datasets, namely KAIST, CVC-14, and UTokyo. The results on each of them improved the respective state-of-the-art performance. A short video giving an overview of this work along with its qualitative results can be seen at https://youtu.be/FDJdSifuuCs. Our source code will be released upon publication of the paper. △ Less

Submitted 24 January, 2022; v1 submitted 26 May, 2021; originally announced May 2021.

Comments: To be published at IEEE Transactions on Intelligent Transportation Systems

arXiv:2105.07930 [pdf, other]

Ensemble-based Semi-supervised Learning to Improve Noisy Soiling Annotations in Autonomous Driving

Authors: Michal Uricar, Ganesh Sistu, Lucie Yahiaoui, Senthil Yogamani

Abstract: Manual annotation of soiling on surround view cameras is a very challenging and expensive task. The unclear boundary for various soiling categories like water drops or mud particles usually results in a large variance in the annotation quality. As a result, the models trained on such poorly annotated data are far from being optimal. In this paper, we focus on handling such noisy annotations via ps… ▽ More Manual annotation of soiling on surround view cameras is a very challenging and expensive task. The unclear boundary for various soiling categories like water drops or mud particles usually results in a large variance in the annotation quality. As a result, the models trained on such poorly annotated data are far from being optimal. In this paper, we focus on handling such noisy annotations via pseudo-label driven ensemble model which allow us to quickly spot problematic annotations and in most cases also sufficiently fixing them. We train a soiling segmentation model on both noisy and refined labels and demonstrate significant improvements using the refined annotations. It also illustrates that it is possible to effectively refine lower cost coarse annotations. △ Less

Submitted 11 July, 2021; v1 submitted 17 May, 2021; originally announced May 2021.

Comments: Accepted for Oral Presentation at IEEE Intelligent Transportation Systems Conference (ITSC) 2021

arXiv:2104.14042 [pdf, other]

Weather and Light Level Classification for Autonomous Driving: Dataset, Baseline and Active Learning

Authors: Mahesh M Dhananjaya, Varun Ravi Kumar, Senthil Yogamani

Abstract: Autonomous driving is rapidly advancing, and Level 2 functions are becoming a standard feature. One of the foremost outstanding hurdles is to obtain robust visual perception in harsh weather and low light conditions where accuracy degradation is severe. It is critical to have a weather classification model to decrease visual perception confidence during these scenarios. Thus, we have built a new d… ▽ More Autonomous driving is rapidly advancing, and Level 2 functions are becoming a standard feature. One of the foremost outstanding hurdles is to obtain robust visual perception in harsh weather and low light conditions where accuracy degradation is severe. It is critical to have a weather classification model to decrease visual perception confidence during these scenarios. Thus, we have built a new dataset for weather (fog, rain, and snow) classification and light level (bright, moderate, and low) classification. Furthermore, we provide street type (asphalt, grass, and cobblestone) classification, leading to 9 labels. Each image has three labels corresponding to weather, light level, and street type. We recorded the data utilizing an industrial front camera of RCCC (red/clear) format with a resolution of $1024\times1084$. We collected 15k video sequences and sampled 60k images. We implement an active learning framework to reduce the dataset's redundancy and find the optimal set of frames for training a model. We distilled the 60k images further to 1.1k images, which will be shared publicly after privacy anonymization. There is no public dataset for weather and light level classification focused on autonomous driving to the best of our knowledge. The baseline ResNet18 network used for weather classification achieves state-of-the-art results in two non-automotive weather classification public datasets but significantly lower accuracy on our proposed dataset, demonstrating it is not saturated and needs further research. △ Less

Submitted 29 November, 2021; v1 submitted 28 April, 2021; originally announced April 2021.

Comments: Accepted for Oral Presentation at IEEE Intelligent Transportation Systems Conference (ITSC) 2021. Dataset is released in https://drive.google.com/drive/folders/1t3hwbPCfbokUaaWROr6PBA4WTW0GuQJi

arXiv:2104.12583 [pdf]

doi 10.1109/ITSC.2015.329

Vision-based Driver Assistance Systems: Survey, Taxonomy and Advances

Authors: Jonathan Horgan, Ciarán Hughes, John McDonald, Senthil Yogamani

Abstract: Vision-based driver assistance systems is one of the rapidly growing research areas of ITS, due to various factors such as the increased level of safety requirements in automotive, computational power in embedded systems, and desire to get closer to autonomous driving. It is a cross disciplinary area encompassing specialised fields like computer vision, machine learning, robotic navigation, embedd… ▽ More Vision-based driver assistance systems is one of the rapidly growing research areas of ITS, due to various factors such as the increased level of safety requirements in automotive, computational power in embedded systems, and desire to get closer to autonomous driving. It is a cross disciplinary area encompassing specialised fields like computer vision, machine learning, robotic navigation, embedded systems, automotive electronics and safety critical software. In this paper, we survey the list of vision based advanced driver assistance systems with a consistent terminology and propose a taxonomy. We also propose an abstract model in an attempt to formalize a top-down view of application development to scale towards autonomous driving system. △ Less

Submitted 26 April, 2021; originally announced April 2021.

Journal ref: 2015 IEEE 18th International Conference on Intelligent Transportation Systems

arXiv:2104.12537 [pdf, other]

doi 10.1016/j.imavis.2017.07.002

Computer vision in automated parking systems: Design, implementation and challenges

Authors: Markus Heimberger, Jonathan Horgan, Ciaran Hughes, John McDonald, Senthil Yogamani

Abstract: Automated driving is an active area of research in both industry and academia. Automated Parking, which is automated driving in a restricted scenario of parking with low speed manoeuvring, is a key enabling product for fully autonomous driving systems. It is also an important milestone from the perspective of a higher end system built from the previous generation driver assistance systems comprisi… ▽ More Automated driving is an active area of research in both industry and academia. Automated Parking, which is automated driving in a restricted scenario of parking with low speed manoeuvring, is a key enabling product for fully autonomous driving systems. It is also an important milestone from the perspective of a higher end system built from the previous generation driver assistance systems comprising of collision warning, pedestrian detection, etc. In this paper, we discuss the design and implementation of an automated parking system from the perspective of computer vision algorithms. Designing a low-cost system with functional safety is challenging and leads to a large gap between the prototype and the end product, in order to handle all the corner cases. We demonstrate how camera systems are crucial for addressing a range of automated parking use cases and also, to add robustness to systems based on active distance measuring sensors, such as ultrasonics and radar. The key vision modules which realize the parking use cases are 3D reconstruction, parking slot marking recognition, freespace and vehicle/pedestrian detection. We detail the important parking use cases and demonstrate how to combine the vision modules to form a robust parking system. To the best of the authors' knowledge, this is the first detailed discussion of a systemic view of a commercial automated parking system. △ Less

Submitted 26 April, 2021; originally announced April 2021.

Journal ref: Image and Vision Computing, Volume 68, December 2017, Pages 88-101

arXiv:2104.10985 [pdf, other]

VM-MODNet: Vehicle Motion aware Moving Object Detection for Autonomous Driving

Authors: Hazem Rashed, Ahmad El Sallab, Senthil Yogamani

Abstract: Moving object Detection (MOD) is a critical task in autonomous driving as moving agents around the ego-vehicle need to be accurately detected for safe trajectory planning. It also enables appearance agnostic detection of objects based on motion cues. There are geometric challenges like motion-parallax ambiguity which makes it a difficult problem. In this work, we aim to leverage the vehicle motion… ▽ More Moving object Detection (MOD) is a critical task in autonomous driving as moving agents around the ego-vehicle need to be accurately detected for safe trajectory planning. It also enables appearance agnostic detection of objects based on motion cues. There are geometric challenges like motion-parallax ambiguity which makes it a difficult problem. In this work, we aim to leverage the vehicle motion information and feed it into the model to have an adaptation mechanism based on ego-motion. The motivation is to enable the model to implicitly perform ego-motion compensation to improve performance. We convert the six degrees of freedom vehicle motion into a pixel-wise tensor which can be fed as input to the CNN model. The proposed model using Vehicle Motion Tensor (VMT) achieves an absolute improvement of 5.6% in mIoU over the baseline architecture. We also achieve state-of-the-art results on the public KITTI_MoSeg_Extended dataset even compared to methods which make use of LiDAR and additional input frames. Our model is also lightweight and runs at 85 fps on a TitanX GPU. Qualitative results are provided in https://youtu.be/ezbfjti-kTk. △ Less

Submitted 10 July, 2021; v1 submitted 22 April, 2021; originally announced April 2021.

Comments: Accepted for Oral Presentation at IEEE Intelligent Transportation Systems Conference (ITSC) 2021

arXiv:2104.10786 [pdf, other]

Exploring 2D Data Augmentation for 3D Monocular Object Detection

Authors: Sugirtha T, Sridevi M, Khailash Santhakumar, B Ravi Kiran, Thomas Gauthier, Senthil Yogamani

Abstract: Data augmentation is a key component of CNN based image recognition tasks like object detection. However, it is relatively less explored for 3D object detection. Many standard 2D object detection data augmentation techniques do not extend to 3D box. Extension of these data augmentations for 3D object detection requires adaptation of the 3D geometry of the input scene and synthesis of new viewpoint… ▽ More Data augmentation is a key component of CNN based image recognition tasks like object detection. However, it is relatively less explored for 3D object detection. Many standard 2D object detection data augmentation techniques do not extend to 3D box. Extension of these data augmentations for 3D object detection requires adaptation of the 3D geometry of the input scene and synthesis of new viewpoints. This requires accurate depth information of the scene which may not be always available. In this paper, we evaluate existing 2D data augmentations and propose two novel augmentations for monocular 3D detection without a requirement for novel view synthesis. We evaluate these augmentations on the RTM3D detection model firstly due to the shorter training times . We obtain a consistent improvement by 4% in the 3D AP (@IoU=0.7) for cars, ~1.8% scores 3D AP (@IoU=0.25) for pedestrians & cyclists, over the baseline on KITTI car detection dataset. We also demonstrate a rigorous evaluation of the mAP scores by re-weighting them to take into account the class imbalance in the KITTI validation dataset. △ Less

Submitted 21 April, 2021; originally announced April 2021.

arXiv:2104.10780 [pdf, other]

BEVDetNet: Bird's Eye View LiDAR Point Cloud based Real-time 3D Object Detection for Autonomous Driving

Authors: Sambit Mohapatra, Senthil Yogamani, Heinrich Gotzig, Stefan Milz, Patrick Mader

Abstract: 3D object detection based on LiDAR point clouds is a crucial module in autonomous driving particularly for long range sensing. Most of the research is focused on achieving higher accuracy and these models are not optimized for deployment on embedded systems from the perspective of latency and power efficiency. For high speed driving scenarios, latency is a crucial parameter as it provides more tim… ▽ More 3D object detection based on LiDAR point clouds is a crucial module in autonomous driving particularly for long range sensing. Most of the research is focused on achieving higher accuracy and these models are not optimized for deployment on embedded systems from the perspective of latency and power efficiency. For high speed driving scenarios, latency is a crucial parameter as it provides more time to react to dangerous situations. Typically a voxel or point-cloud based 3D convolution approach is utilized for this module. Firstly, they are inefficient on embedded platforms as they are not suitable for efficient parallelization. Secondly, they have a variable runtime due to level of sparsity of the scene which is against the determinism needed in a safety system. In this work, we aim to develop a very low latency algorithm with fixed runtime. We propose a novel semantic segmentation architecture as a single unified model for object center detection using key points, box predictions and orientation prediction using binned classification in a simpler Bird's Eye View (BEV) 2D representation. The proposed architecture can be trivially extended to include semantic segmentation classes like road without any additional computation. The proposed model has a latency of 4 ms on the embedded Nvidia Xavier platform. The model is 5X faster than other top accuracy models with a minimal accuracy degradation of 2% in Average Precision at IoU=0.5 on KITTI dataset. △ Less

Submitted 10 July, 2021; v1 submitted 21 April, 2021; originally announced April 2021.

Comments: Accepted for Oral Presentation at IEEE Intelligent Transportation Systems Conference (ITSC) 2021

arXiv:2104.04420 [pdf, other]

SVDistNet: Self-Supervised Near-Field Distance Estimation on Surround View Fisheye Cameras

Authors: Varun Ravi Kumar, Marvin Klingner, Senthil Yogamani, Markus Bach, Stefan Milz, Tim Fingscheidt, Patrick Mäder

Abstract: A 360° perception of scene geometry is essential for automated driving, notably for parking and urban driving scenarios. Typically, it is achieved using surround-view fisheye cameras, focusing on the near-field area around the vehicle. The majority of current depth estimation approaches focus on employing just a single camera, which cannot be straightforwardly generalized to multiple cameras. The… ▽ More A 360° perception of scene geometry is essential for automated driving, notably for parking and urban driving scenarios. Typically, it is achieved using surround-view fisheye cameras, focusing on the near-field area around the vehicle. The majority of current depth estimation approaches focus on employing just a single camera, which cannot be straightforwardly generalized to multiple cameras. The depth estimation model must be tested on a variety of cameras equipped to millions of cars with varying camera geometries. Even within a single car, intrinsics vary due to manufacturing tolerances. Deep learning models are sensitive to these changes, and it is practically infeasible to train and test on each camera variant. As a result, we present novel camera-geometry adaptive multi-scale convolutions which utilize the camera parameters as a conditional input, enabling the model to generalize to previously unseen fisheye cameras. Additionally, we improve the distance estimation by pairwise and patchwise vector-based self-attention encoder networks. We evaluate our approach on the Fisheye WoodScape surround-view dataset, significantly improving over previous approaches. We also show a generalization of our approach across different camera viewing angles and perform extensive experiments to support our contributions. To enable comparison with other approaches, we evaluate the front camera data on the KITTI dataset (pinhole camera images) and achieve state-of-the-art performance among self-supervised monocular methods. An overview video with qualitative results is provided at https://youtu.be/bmX0UcU9wtA. Baseline code and dataset will be made public. △ Less

Submitted 9 April, 2021; originally announced April 2021.

Comments: To be published at IEEE Transactions on Intelligent Transportation Systems

arXiv:2103.17001 [pdf, other]

Near-field Perception for Low-Speed Vehicle Automation using Surround-view Fisheye Cameras

Authors: Ciaran Eising, Jonathan Horgan, Senthil Yogamani

Abstract: Cameras are the primary sensor in automated driving systems. They provide high information density and are optimal for detecting road infrastructure cues laid out for human vision. Surround-view camera systems typically comprise of four fisheye cameras with 190°+ field of view covering the entire 360° around the vehicle focused on near-field sensing. They are the principal sensors for low-speed, h… ▽ More Cameras are the primary sensor in automated driving systems. They provide high information density and are optimal for detecting road infrastructure cues laid out for human vision. Surround-view camera systems typically comprise of four fisheye cameras with 190°+ field of view covering the entire 360° around the vehicle focused on near-field sensing. They are the principal sensors for low-speed, high accuracy, and close-range sensing applications, such as automated parking, traffic jam assistance, and low-speed emergency braking. In this work, we provide a detailed survey of such vision systems, setting up the survey in the context of an architecture that can be decomposed into four modular components namely Recognition, Reconstruction, Relocalization, and Reorganization. We jointly call this the 4R Architecture. We discuss how each component accomplishes a specific aspect and provide a positional argument that they can be synergized to form a complete perception system for low-speed automation. We support this argument by presenting results from previous works and by presenting architecture proposals for such a system. Qualitative results are presented in the video at https://youtu.be/ae8bCOF77uY. △ Less

Submitted 6 June, 2023; v1 submitted 31 March, 2021; originally announced March 2021.

Comments: Accepted for publication at IEEE Transactions on Intelligent Transportation Systems

arXiv:2103.00191 [pdf, other]

FisheyeSuperPoint: Keypoint Detection and Description Network for Fisheye Images

Authors: Anna Konrad, Ciarán Eising, Ganesh Sistu, John McDonald, Rudi Villing, Senthil Yogamani

Abstract: Keypoint detection and description is a commonly used building block in computer vision systems particularly for robotics and autonomous driving. However, the majority of techniques to date have focused on standard cameras with little consideration given to fisheye cameras which are commonly used in urban driving and automated parking. In this paper, we propose a novel training and evaluation pipe… ▽ More Keypoint detection and description is a commonly used building block in computer vision systems particularly for robotics and autonomous driving. However, the majority of techniques to date have focused on standard cameras with little consideration given to fisheye cameras which are commonly used in urban driving and automated parking. In this paper, we propose a novel training and evaluation pipeline for fisheye images. We make use of SuperPoint as our baseline which is a self-supervised keypoint detector and descriptor that has achieved state-of-the-art results on homography estimation. We introduce a fisheye adaptation pipeline to enable training on undistorted fisheye images. We evaluate the performance on the HPatches benchmark, and, by introducing a fisheye based evaluation method for detection repeatability and descriptor matching correctness, on the Oxford RobotCar dataset. △ Less

Submitted 29 November, 2021; v1 submitted 27 February, 2021; originally announced March 2021.

Comments: Accepted for Presentation at International Conference on Computer Vision Theory and Applications (VISAPP 2022)

arXiv:2102.07448 [pdf, other]

OmniDet: Surround View Cameras based Multi-task Visual Perception Network for Autonomous Driving

Authors: Varun Ravi Kumar, Senthil Yogamani, Hazem Rashed, Ganesh Sistu, Christian Witt, Isabelle Leang, Stefan Milz, Patrick Mäder

Abstract: Surround View fisheye cameras are commonly deployed in automated driving for 360° near-field sensing around the vehicle. This work presents a multi-task visual perception network on unrectified fisheye images to enable the vehicle to sense its surrounding environment. It consists of six primary tasks necessary for an autonomous driving system: depth estimation, visual odometry, semantic segmentati… ▽ More Surround View fisheye cameras are commonly deployed in automated driving for 360° near-field sensing around the vehicle. This work presents a multi-task visual perception network on unrectified fisheye images to enable the vehicle to sense its surrounding environment. It consists of six primary tasks necessary for an autonomous driving system: depth estimation, visual odometry, semantic segmentation, motion segmentation, object detection, and lens soiling detection. We demonstrate that the jointly trained model performs better than the respective single task versions. Our multi-task model has a shared encoder providing a significant computational advantage and has synergized decoders where tasks support each other. We propose a novel camera geometry based adaptation mechanism to encode the fisheye distortion model both at training and inference. This was crucial to enable training on the WoodScape dataset, comprised of data from different parts of the world collected by 12 different cameras mounted on three different cars with different intrinsics and viewpoints. Given that bounding boxes is not a good representation for distorted fisheye images, we also extend object detection to use a polygon with non-uniformly sampled vertices. We additionally evaluate our model on standard automotive datasets, namely KITTI and Cityscapes. We obtain the state-of-the-art results on KITTI for depth estimation and pose estimation tasks and competitive performance on the other tasks. We perform extensive ablation studies on various architecture choices and task weighting methodologies. A short video at https://youtu.be/xbSjZ5OfPes provides qualitative results. △ Less

Submitted 6 June, 2023; v1 submitted 15 February, 2021; originally announced February 2021.

Comments: Best Robot Vision paper award finalist (top 4). Camera ready version accepted for RA-L and ICRA 2021 publication

arXiv:2012.02124 [pdf, other]

Generalized Object Detection on Fisheye Cameras for Autonomous Driving: Dataset, Representations and Baseline

Authors: Hazem Rashed, Eslam Mohamed, Ganesh Sistu, Varun Ravi Kumar, Ciaran Eising, Ahmad El-Sallab, Senthil Yogamani

Abstract: Object detection is a comprehensively studied problem in autonomous driving. However, it has been relatively less explored in the case of fisheye cameras. The standard bounding box fails in fisheye cameras due to the strong radial distortion, particularly in the image's periphery. We explore better representations like oriented bounding box, ellipse, and generic polygon for object detection in fis… ▽ More Object detection is a comprehensively studied problem in autonomous driving. However, it has been relatively less explored in the case of fisheye cameras. The standard bounding box fails in fisheye cameras due to the strong radial distortion, particularly in the image's periphery. We explore better representations like oriented bounding box, ellipse, and generic polygon for object detection in fisheye images in this work. We use the IoU metric to compare these representations using accurate instance segmentation ground truth. We design a novel curved bounding box model that has optimal properties for fisheye distortion models. We also design a curvature adaptive perimeter sampling method for obtaining polygon vertices, improving relative mAP score by 4.9% compared to uniform sampling. Overall, the proposed polygon model improves mIoU relative accuracy by 40.3%. It is the first detailed study on object detection on fisheye cameras for autonomous driving scenarios to the best of our knowledge. The dataset comprising of 10,000 images along with all the object representations ground truth will be made public to encourage further research. We summarize our work in a short video with qualitative results at https://youtu.be/iLkOzvJpL-A. △ Less

Submitted 21 December, 2022; v1 submitted 3 December, 2020; originally announced December 2020.

Comments: Camera ready version. Accepted for presentation at Winter Conference on Applications of Computer Vision 2021. Dataset is shared at https://drive.google.com/drive/folders/1bobmY2wlIBozeU5ZgPfYPqVAnpPw4QrM

arXiv:2010.11681 [pdf, other]

Learning Panoptic Segmentation from Instance Contours

Authors: Sumanth Chennupati, Venkatraman Narayanan, Ganesh Sistu, Senthil Yogamani, Samir A Rawashdeh

Abstract: Panoptic Segmentation aims to provide an understanding of background (stuff) and instances of objects (things) at a pixel level. It combines the separate tasks of semantic segmentation (pixel level classification) and instance segmentation to build a single unified scene understanding task. Typically, panoptic segmentation is derived by combining semantic and instance segmentation tasks that are l… ▽ More Panoptic Segmentation aims to provide an understanding of background (stuff) and instances of objects (things) at a pixel level. It combines the separate tasks of semantic segmentation (pixel level classification) and instance segmentation to build a single unified scene understanding task. Typically, panoptic segmentation is derived by combining semantic and instance segmentation tasks that are learned separately or jointly (multi-task networks). In general, instance segmentation networks are built by adding a foreground mask estimation layer on top of object detectors or using instance clustering methods that assign a pixel to an instance center. In this work, we present a fully convolution neural network that learns instance segmentation from semantic segmentation and instance contours (boundaries of things). Instance contours along with semantic segmentation yield a boundary aware semantic segmentation of things. Connected component labeling on these results produces instance segmentation. We merge semantic and instance segmentation results to output panoptic segmentation. We evaluate our proposed method on the CityScapes dataset to demonstrate qualitative and quantitative performances along with several ablation studies. Our overview video can be accessed from url:https://youtu.be/wBtcxRhG3e0. △ Less

Submitted 5 April, 2021; v1 submitted 15 October, 2020; originally announced October 2020.

Comments: Accepted at ICRA 2021. Overview Video: https://youtu.be/wBtcxRhG3e0

arXiv:2008.07008 [pdf, other]

Monocular Instance Motion Segmentation for Autonomous Driving: KITTI InstanceMotSeg Dataset and Multi-task Baseline

Authors: Eslam Mohamed, Mahmoud Ewaisha, Mennatullah Siam, Hazem Rashed, Senthil Yogamani, Waleed Hamdy, Muhammad Helmi, Ahmad El-Sallab

Abstract: Moving object segmentation is a crucial task for autonomous vehicles as it can be used to segment objects in a class agnostic manner based on their motion cues. It enables the detection of unseen objects during training (e.g., moose or a construction truck) based on their motion and independent of their appearance. Although pixel-wise motion segmentation has been studied in autonomous driving lite… ▽ More Moving object segmentation is a crucial task for autonomous vehicles as it can be used to segment objects in a class agnostic manner based on their motion cues. It enables the detection of unseen objects during training (e.g., moose or a construction truck) based on their motion and independent of their appearance. Although pixel-wise motion segmentation has been studied in autonomous driving literature, it has been rarely addressed at the instance level, which would help separate connected segments of moving objects leading to better trajectory planning. As the main issue is the lack of large public datasets, we create a new InstanceMotSeg dataset comprising of 12.9K samples improving upon our KITTIMoSeg dataset. In addition to providing instance level annotations, we have added 4 additional classes which is crucial for studying class agnostic motion segmentation. We adapt YOLACT and implement a motion-based class agnostic instance segmentation model which would act as a baseline for the dataset. We also extend it to an efficient multi-task model which additionally provides semantic instance segmentation sharing the encoder. The model then learns separate prototype coefficients within the class agnostic and semantic heads providing two independent paths of object detection for redundant safety. To obtain real-time performance, we study different efficient encoders and obtain 39 fps on a Titan Xp GPU using MobileNetV2 with an improvement of 10% mAP relative to the baseline. Our model improves the previous state of the art motion segmentation method by 3.3%. The dataset and qualitative results video are shared in our website at https://sites.google.com/view/instancemotseg/. △ Less

Submitted 26 May, 2021; v1 submitted 16 August, 2020; originally announced August 2020.

Comments: Accepted for presentation at IEEE IV 2021 (Intelligent Vehicles Symposium) conference

arXiv:2008.04017 [pdf, other]

SynDistNet: Self-Supervised Monocular Fisheye Camera Distance Estimation Synergized with Semantic Segmentation for Autonomous Driving

Authors: Varun Ravi Kumar, Marvin Klingner, Senthil Yogamani, Stefan Milz, Tim Fingscheidt, Patrick Maeder

Abstract: State-of-the-art self-supervised learning approaches for monocular depth estimation usually suffer from scale ambiguity. They do not generalize well when applied on distance estimation for complex projection models such as in fisheye and omnidirectional cameras. This paper introduces a novel multi-task learning strategy to improve self-supervised monocular distance estimation on fisheye and pinhol… ▽ More State-of-the-art self-supervised learning approaches for monocular depth estimation usually suffer from scale ambiguity. They do not generalize well when applied on distance estimation for complex projection models such as in fisheye and omnidirectional cameras. This paper introduces a novel multi-task learning strategy to improve self-supervised monocular distance estimation on fisheye and pinhole camera images. Our contribution to this work is threefold: Firstly, we introduce a novel distance estimation network architecture using a self-attention based encoder coupled with robust semantic feature guidance to the decoder that can be trained in a one-stage fashion. Secondly, we integrate a generalized robust loss function, which improves performance significantly while removing the need for hyperparameter tuning with the reprojection loss. Finally, we reduce the artifacts caused by dynamic objects violating static world assumptions using a semantic masking strategy. We significantly improve upon the RMSE of previous work on fisheye by 25% reduction in RMSE. As there is little work on fisheye cameras, we evaluated the proposed method on KITTI using a pinhole model. We achieved state-of-the-art performance among self-supervised methods without requiring an external scale estimation. △ Less

Submitted 14 November, 2020; v1 submitted 10 August, 2020; originally announced August 2020.

Comments: Camera ready version + supplementary. Accepted for presentation at Winter Conference on Applications of Computer Vision 2021

arXiv:2007.09746 [pdf, other]

Beyond Single Stage Encoder-Decoder Networks: Deep Decoders for Semantic Image Segmentation

Authors: Gabriel L. Oliveira, Senthil Yogamani, Wolfram Burgard, Thomas Brox

Abstract: Single encoder-decoder methodologies for semantic segmentation are reaching their peak in terms of segmentation quality and efficiency per number of layers. To address these limitations, we propose a new architecture based on a decoder which uses a set of shallow networks for capturing more information content. The new decoder has a new topology of skip connections, namely backward and stacked res… ▽ More Single encoder-decoder methodologies for semantic segmentation are reaching their peak in terms of segmentation quality and efficiency per number of layers. To address these limitations, we propose a new architecture based on a decoder which uses a set of shallow networks for capturing more information content. The new decoder has a new topology of skip connections, namely backward and stacked residual connections. In order to further improve the architecture we introduce a weight function which aims to re-balance classes to increase the attention of the networks to under-represented objects. We carried out an extensive set of experiments that yielded state-of-the-art results for the CamVid, Gatech and Freiburg Forest datasets. Moreover, to further prove the effectiveness of our decoder, we conducted a set of experiments studying the impact of our decoder to state-of-the-art segmentation techniques. Additionally, we present a set of experiments augmenting semantic segmentation with optical flow information, showing that motion clues can boost pure image based semantic segmentation approaches. △ Less

Submitted 19 July, 2020; originally announced July 2020.

arXiv:2007.06676 [pdf, other]

UnRectDepthNet: Self-Supervised Monocular Depth Estimation using a Generic Framework for Handling Common Camera Distortion Models

Authors: Varun Ravi Kumar, Senthil Yogamani, Markus Bach, Christian Witt, Stefan Milz, Patrick Mader

Abstract: In classical computer vision, rectification is an integral part of multi-view depth estimation. It typically includes epipolar rectification and lens distortion correction. This process simplifies the depth estimation significantly, and thus it has been adopted in CNN approaches. However, rectification has several side effects, including a reduced field of view (FOV), resampling distortion, and se… ▽ More In classical computer vision, rectification is an integral part of multi-view depth estimation. It typically includes epipolar rectification and lens distortion correction. This process simplifies the depth estimation significantly, and thus it has been adopted in CNN approaches. However, rectification has several side effects, including a reduced field of view (FOV), resampling distortion, and sensitivity to calibration errors. The effects are particularly pronounced in case of significant distortion (e.g., wide-angle fisheye cameras). In this paper, we propose a generic scale-aware self-supervised pipeline for estimating depth, euclidean distance, and visual odometry from unrectified monocular videos. We demonstrate a similar level of precision on the unrectified KITTI dataset with barrel distortion comparable to the rectified KITTI dataset. The intuition being that the rectification step can be implicitly absorbed within the CNN model, which learns the distortion model without increasing complexity. Our approach does not suffer from a reduced field of view and avoids computational costs for rectification at inference time. To further illustrate the general applicability of the proposed framework, we apply it to wide-angle fisheye cameras with 190$^\circ$ horizontal field of view. The training framework UnRectDepthNet takes in the camera distortion model as an argument and adapts projection and unprojection functions accordingly. The proposed algorithm is evaluated further on the KITTI rectified dataset, and we achieve state-of-the-art results that improve upon our previous work FisheyeDistanceNet. Qualitative results on a distorted test scene video sequence indicate excellent performance https://youtu.be/K6pbx3bU4Ss. △ Less

Submitted 6 June, 2023; v1 submitted 13 July, 2020; originally announced July 2020.

Comments: Minor fixes added after IROS 2020 Camera ready submission. IROS 2020 presentation video - https://www.youtube.com/watch?v=3Br2KSWZRrY

arXiv:2007.00801 [pdf, other]

TiledSoilingNet: Tile-level Soiling Detection on Automotive Surround-view Cameras Using Coverage Metric

Authors: Arindam Das, Pavel Krizek, Ganesh Sistu, Fabian Burger, Sankaralingam Madasamy, Michal Uricar, Varun Ravi Kumar, Senthil Yogamani

Abstract: Automotive cameras, particularly surround-view cameras, tend to get soiled by mud, water, snow, etc. For higher levels of autonomous driving, it is necessary to have a soiling detection algorithm which will trigger an automatic cleaning system. Localized detection of soiling in an image is necessary to control the cleaning system. It is also necessary to enable partial functionality in unsoiled ar… ▽ More Automotive cameras, particularly surround-view cameras, tend to get soiled by mud, water, snow, etc. For higher levels of autonomous driving, it is necessary to have a soiling detection algorithm which will trigger an automatic cleaning system. Localized detection of soiling in an image is necessary to control the cleaning system. It is also necessary to enable partial functionality in unsoiled areas while reducing confidence in soiled areas. Although this can be solved using a semantic segmentation task, we explore a more efficient solution targeting deployment in low power embedded system. We propose a novel method to regress the area of each soiling type within a tile directly. We refer to this as coverage. The proposed approach is better than learning the dominant class in a tile as multiple soiling types occur within a tile commonly. It also has the advantage of dealing with coarse polygon annotation, which will cause the segmentation task. The proposed soiling coverage decoder is an order of magnitude faster than an equivalent segmentation decoder. We also integrated it into an object detection and semantic segmentation multi-task model using an asynchronous back-propagation algorithm. A portion of the dataset used will be released publicly as part of our WoodScape dataset to encourage further research. △ Less

Submitted 1 July, 2020; originally announced July 2020.

Comments: Accepted for Oral Presentation at IEEE Intelligent Transportation Systems Conference (ITSC) 2020

arXiv:2002.00444 [pdf, other]

Deep Reinforcement Learning for Autonomous Driving: A Survey

Authors: B Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmad A. Al Sallab, Senthil Yogamani, Patrick Pérez

Abstract: With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computat… ▽ More With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed. △ Less

Submitted 23 January, 2021; v1 submitted 2 February, 2020; originally announced February 2020.

Comments: Accepted for publication at IEEE Transactions on Intelligent Transportation Systems

arXiv:2001.02223 [pdf, other]

Dynamic Task Weighting Methods for Multi-task Networks in Autonomous Driving Systems

Authors: Isabelle Leang, Ganesh Sistu, Fabian Burger, Andrei Bursuc, Senthil Yogamani

Abstract: Deep multi-task networks are of particular interest for autonomous driving systems. They can potentially strike an excellent trade-off between predictive performance, hardware constraints and efficient use of information from multiple types of annotations and modalities. However, training such models is non-trivial and requires balancing learning over all tasks as their respective losses display d… ▽ More Deep multi-task networks are of particular interest for autonomous driving systems. They can potentially strike an excellent trade-off between predictive performance, hardware constraints and efficient use of information from multiple types of annotations and modalities. However, training such models is non-trivial and requires balancing learning over all tasks as their respective losses display different scales, ranges and dynamics across training. Multiple task weighting methods that adjust the losses in an adaptive way have been proposed recently on different datasets and combinations of tasks, making it difficult to compare them. In this work, we review and systematically evaluate nine task weighting strategies on common grounds on three automotive datasets (KITTI, Cityscapes and WoodScape). We then propose a novel method combining evolutionary meta-learning and task-based selective backpropagation, for computing task weights leading to reliable network training. Our method outperforms state-of-the-art methods by a significant margin on a two-task application. △ Less

Submitted 27 June, 2020; v1 submitted 7 January, 2020; originally announced January 2020.

Comments: Accepted for Oral Presentation at IEEE Intelligent Transportation Systems Conference (ITSC) 2020

arXiv:2001.02161 [pdf, other]

Trained Trajectory based Automated Parking System using Visual SLAM on Surround View Cameras

Authors: Nivedita Tripathi, Senthil Yogamani

Abstract: Automated Parking is becoming a standard feature in modern vehicles. Existing parking systems build a local map to be able to plan for maneuvering towards a detected slot. Next generation parking systems have an use case where they build a persistent map of the environment where the car is frequently parked, say for example, home parking or office parking. The pre-built map helps in re-localizing… ▽ More Automated Parking is becoming a standard feature in modern vehicles. Existing parking systems build a local map to be able to plan for maneuvering towards a detected slot. Next generation parking systems have an use case where they build a persistent map of the environment where the car is frequently parked, say for example, home parking or office parking. The pre-built map helps in re-localizing the vehicle better when its trying to park the next time. This is achieved by augmenting the parking system with a Visual SLAM pipeline and the feature is called trained trajectory parking in the automotive industry. In this paper, we discuss the use cases, design and implementation of a trained trajectory automated parking system. The proposed system is deployed on commercial vehicles and the consumer application is illustrated in \url{https://youtu.be/nRWF5KhyJZU}. The focus of this paper is on the application and the details of vision algorithms are kept at high level. △ Less

Submitted 19 May, 2021; v1 submitted 7 January, 2020; originally announced January 2020.

Comments: Accepted for presentation at CVPR 2021 Workshop on Women in Computer Vision

arXiv:1912.11066 [pdf, other]

FisheyeMultiNet: Real-time Multi-task Learning Architecture for Surround-view Automated Parking System

Authors: Pullarao Maddu, Wayne Doherty, Ganesh Sistu, Isabelle Leang, Michal Uricar, Sumanth Chennupati, Hazem Rashed, Jonathan Horgan, Ciaran Hughes, Senthil Yogamani

Abstract: Automated Parking is a low speed manoeuvring scenario which is quite unstructured and complex, requiring full 360° near-field sensing around the vehicle. In this paper, we discuss the design and implementation of an automated parking system from the perspective of camera based deep learning algorithms. We provide a holistic overview of an industrial system covering the embedded system, use cases a… ▽ More Automated Parking is a low speed manoeuvring scenario which is quite unstructured and complex, requiring full 360° near-field sensing around the vehicle. In this paper, we discuss the design and implementation of an automated parking system from the perspective of camera based deep learning algorithms. We provide a holistic overview of an industrial system covering the embedded system, use cases and the deep learning architecture. We demonstrate a real-time multi-task deep learning network called FisheyeMultiNet, which detects all the necessary objects for parking on a low-power embedded system. FisheyeMultiNet runs at 15 fps for 4 cameras and it has three tasks namely object detection, semantic segmentation and soiling detection. To encourage further research, we release a partial dataset of 5,000 images containing semantic segmentation and bounding box detection ground truth via WoodScape project \cite{yogamani2019woodscape}. △ Less

Submitted 23 December, 2019; originally announced December 2019.

Comments: Accepted for publication at Irish Machine Vision and Image Processing (IMVIP) 2019

Showing 1–50 of 79 results for author: Yogamani, S