Search | arXiv e-print repository

doi 10.1007/978-3-031-45725-8_14

Transformers in Unsupervised Structure-from-Motion

Authors: Hemang Chawla, Arnav Varma, Elahe Arani, Bahram Zonooz

Abstract: Transformers have revolutionized deep learning based computer vision with improved performance as well as robustness to natural corruptions and adversarial attacks. Transformers are used predominantly for 2D vision tasks, including image classification, semantic segmentation, and object detection. However, robots and advanced driver assistance systems also require 3D scene understanding for decisi… ▽ More Transformers have revolutionized deep learning based computer vision with improved performance as well as robustness to natural corruptions and adversarial attacks. Transformers are used predominantly for 2D vision tasks, including image classification, semantic segmentation, and object detection. However, robots and advanced driver assistance systems also require 3D scene understanding for decision making by extracting structure-from-motion (SfM). We propose a robust transformer-based monocular SfM method that learns to predict monocular pixel-wise depth, ego vehicle's translation and rotation, as well as camera's focal length and principal point, simultaneously. With experiments on KITTI and DDAD datasets, we demonstrate how to adapt different vision transformers and compare them against contemporary CNN-based methods. Our study shows that transformer-based architecture, though lower in run-time efficiency, achieves comparable performance while being more robust against natural corruptions, as well as untargeted and targeted attacks. △ Less

Submitted 16 December, 2023; originally announced December 2023.

Comments: International Joint Conference on Computer Vision, Imaging and Computer Graphics. Cham: Springer Nature Switzerland, 2022. Published at "Communications in Computer and Information Science, vol 1815. Springer Nature". arXiv admin note: text overlap with arXiv:2202.03131

arXiv:2311.02393 [pdf, other]

Continual Learning of Unsupervised Monocular Depth from Videos

Authors: Hemang Chawla, Arnav Varma, Elahe Arani, Bahram Zonooz

Abstract: Spatial scene understanding, including monocular depth estimation, is an important problem in various applications, such as robotics and autonomous driving. While improvements in unsupervised monocular depth estimation have potentially allowed models to be trained on diverse crowdsourced videos, this remains underexplored as most methods utilize the standard training protocol, wherein the models a… ▽ More Spatial scene understanding, including monocular depth estimation, is an important problem in various applications, such as robotics and autonomous driving. While improvements in unsupervised monocular depth estimation have potentially allowed models to be trained on diverse crowdsourced videos, this remains underexplored as most methods utilize the standard training protocol, wherein the models are trained from scratch on all data after new data is collected. Instead, continual training of models on sequentially collected data would significantly reduce computational and memory costs. Nevertheless, naive continual training leads to catastrophic forgetting, where the model performance deteriorates on older domains as it learns on newer domains, highlighting the trade-off between model stability and plasticity. While several techniques have been proposed to address this issue in image classification, the high-dimensional and spatiotemporally correlated outputs of depth estimation make it a distinct challenge. To the best of our knowledge, no framework or method currently exists focusing on the problem of continual learning in depth estimation. Thus, we introduce a framework that captures the challenges of continual unsupervised depth estimation (CUDE), and define the necessary metrics to evaluate model performance. We propose a rehearsal-based dual-memory method, MonoDepthCL, which utilizes spatiotemporal consistency for continual learning in depth estimation, even when the camera intrinsics are unknown. △ Less

Submitted 4 November, 2023; originally announced November 2023.

Comments: Accepted at IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2024)

arXiv:2210.03570 [pdf]

AI-Driven Road Maintenance Inspection v2: Reducing Data Dependency & Quantifying Road Damage

Authors: Haris Iqbal, Hemang Chawla, Arnav Varma, Terence Brouns, Ahmed Badar, Elahe Arani, Bahram Zonooz

Abstract: Road infrastructure maintenance inspection is typically a labor-intensive and critical task to ensure the safety of all road users. Existing state-of-the-art techniques in Artificial Intelligence (AI) for object detection and segmentation help automate a huge chunk of this task given adequate annotated data. However, annotating videos from scratch is cost-prohibitive. For instance, it can take an… ▽ More Road infrastructure maintenance inspection is typically a labor-intensive and critical task to ensure the safety of all road users. Existing state-of-the-art techniques in Artificial Intelligence (AI) for object detection and segmentation help automate a huge chunk of this task given adequate annotated data. However, annotating videos from scratch is cost-prohibitive. For instance, it can take an annotator several days to annotate a 5-minute video recorded at 30 FPS. Hence, we propose an automated labelling pipeline by leveraging techniques like few-shot learning and out-of-distribution detection to generate labels for road damage detection. In addition, our pipeline includes a risk factor assessment for each damage by instance quantification to prioritize locations for repairs which can lead to optimal deployment of road maintenance machinery. We show that the AI models trained with these techniques can not only generalize better to unseen real-world data with reduced requirement for human annotation but also provide an estimate of maintenance urgency, thereby leading to safer roads. △ Less

Submitted 7 October, 2022; originally announced October 2022.

Comments: Accepted at IRF Global R2T Conference & Exhibition 2022

arXiv:2210.02357 [pdf, other]

Image Masking for Robust Self-Supervised Monocular Depth Estimation

Authors: Hemang Chawla, Kishaan Jeeveswaran, Elahe Arani, Bahram Zonooz

Abstract: Self-supervised monocular depth estimation is a salient task for 3D scene understanding. Learned jointly with monocular ego-motion estimation, several methods have been proposed to predict accurate pixel-wise depth without using labeled data. Nevertheless, these methods focus on improving performance under ideal conditions without natural or digital corruptions. The general absence of occlusions i… ▽ More Self-supervised monocular depth estimation is a salient task for 3D scene understanding. Learned jointly with monocular ego-motion estimation, several methods have been proposed to predict accurate pixel-wise depth without using labeled data. Nevertheless, these methods focus on improving performance under ideal conditions without natural or digital corruptions. The general absence of occlusions is assumed even for object-specific depth estimation. These methods are also vulnerable to adversarial attacks, which is a pertinent concern for their reliable deployment in robots and autonomous driving systems. We propose MIMDepth, a method that adapts masked image modeling (MIM) for self-supervised monocular depth estimation. While MIM has been used to learn generalizable features during pre-training, we show how it could be adapted for direct training of monocular depth estimation. Our experiments show that MIMDepth is more robust to noise, blur, weather conditions, digital artifacts, occlusions, as well as untargeted and targeted adversarial attacks. △ Less

Submitted 1 February, 2023; v1 submitted 5 October, 2022; originally announced October 2022.

Comments: Accepted at 2023 IEEE International Conference on Robotics and Automation (ICRA)

arXiv:2207.07032 [pdf, other]

doi 10.1109/IROS47612.2022.9982154

Adversarial Attacks on Monocular Pose Estimation

Authors: Hemang Chawla, Arnav Varma, Elahe Arani, Bahram Zonooz

Abstract: Advances in deep learning have resulted in steady progress in computer vision with improved accuracy on tasks such as object detection and semantic segmentation. Nevertheless, deep neural networks are vulnerable to adversarial attacks, thus presenting a challenge in reliable deployment. Two of the prominent tasks in 3D scene-understanding for robotics and advanced drive assistance systems are mono… ▽ More Advances in deep learning have resulted in steady progress in computer vision with improved accuracy on tasks such as object detection and semantic segmentation. Nevertheless, deep neural networks are vulnerable to adversarial attacks, thus presenting a challenge in reliable deployment. Two of the prominent tasks in 3D scene-understanding for robotics and advanced drive assistance systems are monocular depth and pose estimation, often learned together in an unsupervised manner. While studies evaluating the impact of adversarial attacks on monocular depth estimation exist, a systematic demonstration and analysis of adversarial perturbations against pose estimation are lacking. We show how additive imperceptible perturbations can not only change predictions to increase the trajectory drift but also catastrophically alter its geometry. We also study the relation between adversarial perturbations targeting monocular depth and pose estimation networks, as well as the transferability of perturbations to other networks with different architectures and losses. Our experiments show how the generated perturbations lead to notable errors in relative rotation and translation predictions and elucidate vulnerabilities of the networks. △ Less

Submitted 14 July, 2022; originally announced July 2022.

Comments: Accepted at the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2022)

arXiv:2202.03131 [pdf, other]

doi 10.5220/0010884000003124

Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics

Authors: Arnav Varma, Hemang Chawla, Bahram Zonooz, Elahe Arani

Abstract: The advent of autonomous driving and advanced driver assistance systems necessitates continuous developments in computer vision for 3D scene understanding. Self-supervised monocular depth estimation, a method for pixel-wise distance estimation of objects from a single camera without the use of ground truth labels, is an important task in 3D scene understanding. However, existing methods for this t… ▽ More The advent of autonomous driving and advanced driver assistance systems necessitates continuous developments in computer vision for 3D scene understanding. Self-supervised monocular depth estimation, a method for pixel-wise distance estimation of objects from a single camera without the use of ground truth labels, is an important task in 3D scene understanding. However, existing methods for this task are limited to convolutional neural network (CNN) architectures. In contrast with CNNs that use localized linear operations and lose feature resolution across the layers, vision transformers process at constant resolution with a global receptive field at every stage. While recent works have compared transformers against their CNN counterparts for tasks such as image classification, no study exists that investigates the impact of using transformers for self-supervised monocular depth estimation. Here, we first demonstrate how to adapt vision transformers for self-supervised monocular depth estimation. Thereafter, we compare the transformer and CNN-based architectures for their performance on KITTI depth prediction benchmarks, as well as their robustness to natural corruptions and adversarial attacks, including when the camera intrinsics are unknown. Our study demonstrates how transformer-based architecture, though lower in run-time efficiency, achieves comparable performance while being more robust and generalizable. △ Less

Submitted 7 February, 2022; originally announced February 2022.

Comments: Published in 17th International Conference on Computer Vision Theory and Applications (VISAP, 2022)

arXiv:2112.10028 [pdf, other]

doi 10.1145/3627106.3627199

Attack of the Knights: A Non Uniform Cache Side-Channel Attack

Authors: Farabi Mahmud, Sungkeun Kim, Harpreet Singh Chawla, Chia-Che Tsai, Eun Jung Kim, Abdullah Muzahid

Abstract: For a distributed last-level cache (LLC) in a large multicore chip, the access time to one LLC bank can significantly differ from that to another due to the difference in physical distance. In this paper, we successfully demonstrated a new distance-based side-channel attack by timing the AES decryption operation and extracting part of an AES secret key on an Intel Knights Landing CPU. We introduce… ▽ More For a distributed last-level cache (LLC) in a large multicore chip, the access time to one LLC bank can significantly differ from that to another due to the difference in physical distance. In this paper, we successfully demonstrated a new distance-based side-channel attack by timing the AES decryption operation and extracting part of an AES secret key on an Intel Knights Landing CPU. We introduce several techniques to overcome the challenges of the attack, including the use of multiple attack threads to ensure LLC hits, to detect vulnerable memory locations, and to obtain fine-grained timing of the victim operations. While operating as a covert channel, this attack can reach a bandwidth of 205 kbps with an error rate of only 0.02%. We also observed that the side-channel attack can extract 4 bytes of an AES key with 100% accuracy with only 4000 trial rounds of encryption △ Less

Submitted 31 May, 2023; v1 submitted 18 December, 2021; originally announced December 2021.

Journal ref: Annual Computer Security Applications Conference ACSAC 2023

arXiv:2103.02451 [pdf, other]

doi 10.1109/ICRA48506.2021.9561441

Multimodal Scale Consistency and Awareness for Monocular Self-Supervised Depth Estimation

Authors: Hemang Chawla, Arnav Varma, Elahe Arani, Bahram Zonooz

Abstract: Dense depth estimation is essential to scene-understanding for autonomous driving. However, recent self-supervised approaches on monocular videos suffer from scale-inconsistency across long sequences. Utilizing data from the ubiquitously copresent global positioning systems (GPS), we tackle this challenge by proposing a dynamically-weighted GPS-to-Scale (g2s) loss to complement the appearance-base… ▽ More Dense depth estimation is essential to scene-understanding for autonomous driving. However, recent self-supervised approaches on monocular videos suffer from scale-inconsistency across long sequences. Utilizing data from the ubiquitously copresent global positioning systems (GPS), we tackle this challenge by proposing a dynamically-weighted GPS-to-Scale (g2s) loss to complement the appearance-based losses. We emphasize that the GPS is needed only during the multimodal training, and not at inference. The relative distance between frames captured through the GPS provides a scale signal that is independent of the camera setup and scene distribution, resulting in richer learned feature representations. Through extensive evaluation on multiple datasets, we demonstrate scale-consistent and -aware depth estimation during inference, improving the performance even when training with low-frequency GPS data. △ Less

Submitted 3 March, 2021; originally announced March 2021.

Comments: Accepted at 2021 IEEE International Conference on Robotics and Automation (ICRA)

arXiv:2012.08375 [pdf, other]

doi 10.5220/0010255808690880

Practical Auto-Calibration for Spatial Scene-Understanding from Crowdsourced Dashcamera Videos

Authors: Hemang Chawla, Matti Jukola, Shabbir Marzban, Elahe Arani, Bahram Zonooz

Abstract: Spatial scene-understanding, including dense depth and ego-motion estimation, is an important problem in computer vision for autonomous vehicles and advanced driver assistance systems. Thus, it is beneficial to design perception modules that can utilize crowdsourced videos collected from arbitrary vehicular onboard or dashboard cameras. However, the intrinsic parameters corresponding to such camer… ▽ More Spatial scene-understanding, including dense depth and ego-motion estimation, is an important problem in computer vision for autonomous vehicles and advanced driver assistance systems. Thus, it is beneficial to design perception modules that can utilize crowdsourced videos collected from arbitrary vehicular onboard or dashboard cameras. However, the intrinsic parameters corresponding to such cameras are often unknown or change over time. Typical manual calibration approaches require objects such as a chessboard or additional scene-specific information. On the other hand, automatic camera calibration does not have such requirements. Yet, the automatic calibration of dashboard cameras is challenging as forward and planar navigation results in critical motion sequences with reconstruction ambiguities. Structure reconstruction of complete visual-sequences that may contain tens of thousands of images is also computationally untenable. Here, we propose a system for practical monocular onboard camera auto-calibration from crowdsourced videos. We show the effectiveness of our proposed system on the KITTI raw, Oxford RobotCar, and the crowdsourced D$^2$-City datasets in varying conditions. Finally, we demonstrate its application for accurate monocular dense depth and ego-motion estimation on uncalibrated videos. △ Less

Submitted 15 December, 2020; originally announced December 2020.

Comments: Accepted at 16th International Conference on Computer Vision Theory and Applications (VISAP, 2021)

arXiv:2007.12918 [pdf, other]

doi 10.1109/IROS45743.2020.9341243

Crowdsourced 3D Map**: A Combined Multi-View Geometry and Self-Supervised Learning Approach

Authors: Hemang Chawla, Matti Jukola, Terence Brouns, Elahe Arani, Bahram Zonooz

Abstract: The ability to efficiently utilize crowdsourced visual data carries immense potential for the domains of large scale dynamic map** and autonomous driving. However, state-of-the-art methods for crowdsourced 3D map** assume prior knowledge of camera intrinsics. In this work, we propose a framework that estimates the 3D positions of semantically meaningful landmarks such as traffic signs without… ▽ More The ability to efficiently utilize crowdsourced visual data carries immense potential for the domains of large scale dynamic map** and autonomous driving. However, state-of-the-art methods for crowdsourced 3D map** assume prior knowledge of camera intrinsics. In this work, we propose a framework that estimates the 3D positions of semantically meaningful landmarks such as traffic signs without assuming known camera intrinsics, using only monocular color camera and GPS. We utilize multi-view geometry as well as deep learning based self-calibration, depth, and ego-motion estimation for traffic sign positioning, and show that combining their strengths is important for increasing the map coverage. To facilitate research on this task, we construct and make available a KITTI based 3D traffic sign ground truth positioning dataset. Using our proposed framework, we achieve an average single-journey relative and absolute positioning accuracy of 39cm and 1.26m respectively, on this dataset. △ Less

Submitted 25 July, 2020; originally announced July 2020.

Comments: Accepted at 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

arXiv:2007.04592 [pdf, other]

doi 10.1109/ITSC45102.2020.9294445

Monocular Vision based Crowdsourced 3D Traffic Sign Positioning with Unknown Camera Intrinsics and Distortion Coefficients

Authors: Hemang Chawla, Matti Jukola, Elahe Arani, Bahram Zonooz

Abstract: Autonomous vehicles and driver assistance systems utilize maps of 3D semantic landmarks for improved decision making. However, scaling the map** process as well as regularly updating such maps come with a huge cost. Crowdsourced map** of these landmarks such as traffic sign positions provides an appealing alternative. The state-of-the-art approaches to crowdsourced map** use ground truth cam… ▽ More Autonomous vehicles and driver assistance systems utilize maps of 3D semantic landmarks for improved decision making. However, scaling the map** process as well as regularly updating such maps come with a huge cost. Crowdsourced map** of these landmarks such as traffic sign positions provides an appealing alternative. The state-of-the-art approaches to crowdsourced map** use ground truth camera parameters, which may not always be known or may change over time. In this work, we demonstrate an approach to computing 3D traffic sign positions without knowing the camera focal lengths, principal point, and distortion coefficients a priori. We validate our proposed approach on a public dataset of traffic signs in KITTI. Using only a monocular color camera and GPS, we achieve an average single journey relative and absolute positioning accuracy of 0.26 m and 1.38 m, respectively. △ Less

Submitted 9 July, 2020; originally announced July 2020.

Comments: Accepted at 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC)

arXiv:1601.01398 [pdf]

A Proof-of-Concept Device-to-Device Communication Testbed

Authors: Vibhutesh Kumar Singh, Hardik Chawla, Vivek Ashok Bohara

Abstract: This paper presents the design and development of proof-of-concept Device-to-Device (D2D) Communication testbed. This testbed also seeks to address the design issues involved in the implementation of a D2D network in a realistic scenario. The performance of this testbed has been validated by emulating a Cellular network consisting of a Base Staion (BTS) and many D2D devices in its proximity. The d… ▽ More This paper presents the design and development of proof-of-concept Device-to-Device (D2D) Communication testbed. This testbed also seeks to address the design issues involved in the implementation of a D2D network in a realistic scenario. The performance of this testbed has been validated by emulating a Cellular network consisting of a Base Staion (BTS) and many D2D devices in its proximity. The devices and the BTS coordinate and communicate with each other to select the optimum communication range, mode of communication and transmit parameters. Through the experimental results it has been shown that the proposed testbed has a communication radius of 120m and a D2D communication range of 62m with over 90% efficiency. △ Less

Submitted 6 January, 2016; originally announced January 2016.

Comments: 8th International Conference on COMmunication Systems & NETworkS (COMSNETS 2016), Demos & Exhibits Session

Showing 1–12 of 12 results for author: Chawla, H