Search | arXiv e-print repository

Human-Inspired Topological Representations for Visual Object Recognition in Unseen Environments

Authors: Ekta U. Samani, Ashis G. Banerjee

Abstract: Visual object recognition in unseen and cluttered indoor environments is a challenging problem for mobile robots. Toward this goal, we extend our previous work to propose the TOPS2 descriptor, and an accompanying recognition framework, THOR2, inspired by a human reasoning mechanism known as object unity. We interleave color embeddings obtained using the Mapper algorithm for topological soft cluste… ▽ More Visual object recognition in unseen and cluttered indoor environments is a challenging problem for mobile robots. Toward this goal, we extend our previous work to propose the TOPS2 descriptor, and an accompanying recognition framework, THOR2, inspired by a human reasoning mechanism known as object unity. We interleave color embeddings obtained using the Mapper algorithm for topological soft clustering with the shape-based TOPS descriptor to obtain the TOPS2 descriptor. THOR2, trained using synthetic data, achieves substantially higher recognition accuracy than the shape-based THOR framework and outperforms RGB-D ViT on two real-world datasets: the benchmark OCID dataset and the UW-IS Occluded dataset. Therefore, THOR2 is a promising step toward achieving robust recognition in low-cost robots. △ Less

Submitted 15 September, 2023; originally announced September 2023.

Comments: Accepted for presentation at the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Workshop on Robotic Perception and Map**: Frontier Vision & Learning Techniques

arXiv:2305.03815 [pdf, other]

doi 10.1109/TRO.2023.3343994

Persistent Homology Meets Object Unity: Object Recognition in Clutter

Authors: Ekta U. Samani, Ashis G. Banerjee

Abstract: Recognition of occluded objects in unseen and unstructured indoor environments is a challenging problem for mobile robots. To address this challenge, we propose a new descriptor, TOPS, for point clouds generated from depth images and an accompanying recognition framework, THOR, inspired by human reasoning. The descriptor employs a novel slicing-based approach to compute topological features from f… ▽ More Recognition of occluded objects in unseen and unstructured indoor environments is a challenging problem for mobile robots. To address this challenge, we propose a new descriptor, TOPS, for point clouds generated from depth images and an accompanying recognition framework, THOR, inspired by human reasoning. The descriptor employs a novel slicing-based approach to compute topological features from filtrations of simplicial complexes using persistent homology, and facilitates reasoning-based recognition using object unity. Apart from a benchmark dataset, we report performance on a new dataset, the UW Indoor Scenes (UW-IS) Occluded dataset, curated using commodity hardware to reflect real-world scenarios with different environmental conditions and degrees of object occlusion. THOR outperforms state-of-the-art methods on both the datasets and achieves substantially higher recognition accuracy for all the scenarios of the UW-IS Occluded dataset. Therefore, THOR, is a promising step toward robust recognition in low-cost robots, meant for everyday use in indoor settings. △ Less

Submitted 21 December, 2023; v1 submitted 5 May, 2023; originally announced May 2023.

Comments: This work has been accepted for publication in the IEEE Transactions on Robotics

arXiv:2303.03651 [pdf, other]

F2BEV: Bird's Eye View Generation from Surround-View Fisheye Camera Images for Automated Driving

Authors: Ekta U. Samani, Feng Tao, Harshavardhan R. Dasari, Sihao Ding, Ashis G. Banerjee

Abstract: Bird's Eye View (BEV) representations are tremendously useful for perception-related automated driving tasks. However, generating BEVs from surround-view fisheye camera images is challenging due to the strong distortions introduced by such wide-angle lenses. We take the first step in addressing this challenge and introduce a baseline, F2BEV, to generate discretized BEV height maps and BEV semantic… ▽ More Bird's Eye View (BEV) representations are tremendously useful for perception-related automated driving tasks. However, generating BEVs from surround-view fisheye camera images is challenging due to the strong distortions introduced by such wide-angle lenses. We take the first step in addressing this challenge and introduce a baseline, F2BEV, to generate discretized BEV height maps and BEV semantic segmentation maps from fisheye images. F2BEV consists of a distortion-aware spatial cross attention module for querying and consolidating spatial information from fisheye image features in a transformer-style architecture followed by a task-specific head. We evaluate single-task and multi-task variants of F2BEV on our synthetic FB-SSEM dataset, all of which generate better BEV height and segmentation maps (in terms of the IoU) than a state-of-the-art BEV generation method operating on undistorted fisheye images. We also demonstrate discretized height map generation from real-world fisheye images using F2BEV. Our dataset is publicly available at https://github.com/volvo-cars/FB-SSEM-dataset △ Less

Submitted 1 August, 2023; v1 submitted 6 March, 2023; originally announced March 2023.

Comments: Accepted for publication in the proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems 2023

arXiv:2205.07479 [pdf, other]

Topologically Persistent Features-based Object Recognition in Cluttered Indoor Environments

Authors: Ekta U. Samani, Ashis G. Banerjee

Abstract: Recognition of occluded objects in unseen indoor environments is a challenging problem for mobile robots. This work proposes a new slicing-based topological descriptor that captures the 3D shape of object point clouds to address this challenge. It yields similarities between the descriptors of the occluded and the corresponding unoccluded objects, enabling object unity-based recognition using a li… ▽ More Recognition of occluded objects in unseen indoor environments is a challenging problem for mobile robots. This work proposes a new slicing-based topological descriptor that captures the 3D shape of object point clouds to address this challenge. It yields similarities between the descriptors of the occluded and the corresponding unoccluded objects, enabling object unity-based recognition using a library of trained models. The descriptor is obtained by partitioning an object's point cloud into multiple 2D slices and constructing filtrations (nested sequences of simplicial complexes) on the slices to mimic further slicing of the slices, thereby capturing detailed shapes through persistent homology-generated features. We use nine different sequences of cluttered scenes from a benchmark dataset for performance evaluation. Our method outperforms two state-of-the-art deep learning-based point cloud classification methods, namely, DGCNN and SimpleView. △ Less

Submitted 16 May, 2022; originally announced May 2022.

Comments: Accepted for presentation in the IEEE International Conference on Robotics and Automation (ICRA) 2022 Workshop on Robotic Perception and Map**: Emerging Techniques

arXiv:2010.03196 [pdf, other]

doi 10.1109/LRA.2021.3099460

Visual Object Recognition in Indoor Environments Using Topologically Persistent Features

Authors: Ekta U. Samani, Xingjian Yang, Ashis G. Banerjee

Abstract: Object recognition in unseen indoor environments remains a challenging problem for visual perception of mobile robots. In this letter, we propose the use of topologically persistent features, which rely on the objects' shape information, to address this challenge. In particular, we extract two kinds of features, namely, sparse persistence image (PI) and amplitude, by applying persistent homology t… ▽ More Object recognition in unseen indoor environments remains a challenging problem for visual perception of mobile robots. In this letter, we propose the use of topologically persistent features, which rely on the objects' shape information, to address this challenge. In particular, we extract two kinds of features, namely, sparse persistence image (PI) and amplitude, by applying persistent homology to multi-directional height function-based filtrations of the cubical complexes representing the object segmentation maps. The features are then used to train a fully connected network for recognition. For performance evaluation, in addition to a widely used shape dataset and a benchmark indoor scenes dataset, we collect a new dataset, comprising scene images from two different environments, namely, a living room and a mock warehouse. The scenes are captured using varying camera poses under different illumination conditions and include up to five different objects from a given set of fourteen objects. On the benchmark indoor scenes dataset, sparse PI features show better recognition performance in unseen environments than the features learned using the widely used ResNetV2-56 and EfficientNet-B4 models. Further, they provide slightly higher recall and accuracy values than Faster R-CNN, an end-to-end object detection method, and its state-of-the-art variant, Domain Adaptive Faster R-CNN. The performance of our methods also remains relatively unchanged from the training environment (living room) to the unseen environment (mock warehouse) in the new dataset. In contrast, the performance of the object detection methods drops substantially. We also implement the proposed method on a real-world robot to demonstrate its usefulness. △ Less

Submitted 28 July, 2021; v1 submitted 7 October, 2020; originally announced October 2020.

Comments: This work has been accepted for publication in the IEEE Robotics And Automation Letters

arXiv:1907.03576 [pdf, other]

Deep Learning-Based Semantic Segmentation of Microscale Objects

Authors: Ekta U. Samani, Wei Guo, Ashis G. Banerjee

Abstract: Accurate estimation of the positions and shapes of microscale objects is crucial for automated imaging-guided manipulation using a non-contact technique such as optical tweezers. Perception methods that use traditional computer vision algorithms tend to fail when the manipulation environments are crowded. In this paper, we present a deep learning model for semantic segmentation of the images repre… ▽ More Accurate estimation of the positions and shapes of microscale objects is crucial for automated imaging-guided manipulation using a non-contact technique such as optical tweezers. Perception methods that use traditional computer vision algorithms tend to fail when the manipulation environments are crowded. In this paper, we present a deep learning model for semantic segmentation of the images representing such environments. Our model successfully performs segmentation with a high mean Intersection Over Union score of 0.91. △ Less

Submitted 3 July, 2019; originally announced July 2019.

Comments: A condensed version of the paper is published in the Proceedings of the 2019 International Conference on Manipulation, Automation and Robotics at Small Scales

arXiv:1902.05176 [pdf, other]

doi 10.1109/LRA.2019.2925305

Toward Ergonomic Risk Prediction via Segmentation of Indoor Object Manipulation Actions Using Spatiotemporal Convolutional Networks

Authors: Behnoosh Parsa, Ekta U. Samani, Rose Hendrix, Cameron Devine, Shashi M. Singh, Santosh Devasia, Ashis G. Banerjee

Abstract: Automated real-time prediction of the ergonomic risks of manipulating objects is a key unsolved challenge in develo** effective human-robot collaboration systems for logistics and manufacturing applications. We present a foundational paradigm to address this challenge by formulating the problem as one of action segmentation from RGB-D camera videos. Spatial features are first learned using a dee… ▽ More Automated real-time prediction of the ergonomic risks of manipulating objects is a key unsolved challenge in develo** effective human-robot collaboration systems for logistics and manufacturing applications. We present a foundational paradigm to address this challenge by formulating the problem as one of action segmentation from RGB-D camera videos. Spatial features are first learned using a deep convolutional model from the video frames, which are then fed sequentially to temporal convolutional networks to semantically segment the frames into a hierarchy of actions, which are either ergonomically safe, require monitoring, or need immediate attention. For performance evaluation, in addition to an open-source kitchen dataset, we collected a new dataset comprising twenty individuals picking up and placing objects of varying weights to and from cabinet and table locations at various heights. Results show very high (87-94)\% F1 overlap scores among the ground truth and predicted frame labels for videos lasting over two minutes and consisting of a large number of actions. △ Less

Submitted 26 June, 2019; v1 submitted 13 February, 2019; originally announced February 2019.

Showing 1–7 of 7 results for author: Samani, E U