Search | arXiv e-print repository

TrackAgent: 6D Object Tracking via Reinforcement Learning

Authors: Konstantin Röhrl, Dominik Bauer, Timothy Patten, Markus Vincze

Abstract: Tracking an object's 6D pose, while either the object itself or the observing camera is moving, is important for many robotics and augmented reality applications. While exploiting temporal priors eases this problem, object-specific knowledge is required to recover when tracking is lost. Under the tight time constraints of the tracking task, RGB(D)-based methods are often conceptionally complex or… ▽ More Tracking an object's 6D pose, while either the object itself or the observing camera is moving, is important for many robotics and augmented reality applications. While exploiting temporal priors eases this problem, object-specific knowledge is required to recover when tracking is lost. Under the tight time constraints of the tracking task, RGB(D)-based methods are often conceptionally complex or rely on heuristic motion models. In comparison, we propose to simplify object tracking to a reinforced point cloud (depth only) alignment task. This allows us to train a streamlined approach from scratch with limited amounts of sparse 3D point clouds, compared to the large datasets of diverse RGBD sequences required in previous works. We incorporate temporal frame-to-frame registration with object-based recovery by frame-to-model refinement using a reinforcement learning (RL) agent that jointly solves for both objectives. We also show that the RL agent's uncertainty and a rendering-based mask propagation are effective reinitialization triggers. △ Less

Submitted 28 July, 2023; originally announced July 2023.

Comments: International Conference on Computer Vision Systems (ICVS) 2023

arXiv:2208.08807 [pdf, other]

COPE: End-to-end trainable Constant Runtime Object Pose Estimation

Authors: Stefan Thalhammer, Timothy Patten, Markus Vincze

Abstract: State-of-the-art object pose estimation handles multiple instances in a test image by using multi-model formulations: detection as a first stage and then separately trained networks per object for 2D-3D geometric correspondence prediction as a second stage. Poses are subsequently estimated using the Perspective-n-Points algorithm at runtime. Unfortunately, multi-model formulations are slow and do… ▽ More State-of-the-art object pose estimation handles multiple instances in a test image by using multi-model formulations: detection as a first stage and then separately trained networks per object for 2D-3D geometric correspondence prediction as a second stage. Poses are subsequently estimated using the Perspective-n-Points algorithm at runtime. Unfortunately, multi-model formulations are slow and do not scale well with the number of object instances involved. Recent approaches show that direct 6D object pose estimation is feasible when derived from the aforementioned geometric correspondences. We present an approach that learns an intermediate geometric representation of multiple objects to directly regress 6D poses of all instances in a test image. The inherent end-to-end trainability overcomes the requirement of separately processing individual object instances. By calculating the mutual Intersection-over-Unions, pose hypotheses are clustered into distinct instances, which achieves negligible runtime overhead with respect to the number of object instances. Results on multiple challenging standard datasets show that the pose estimation performance is superior to single-model state-of-the-art approaches despite being more than ~35 times faster. We additionally provide an analysis showing real-time applicability (>24 fps) for images where more than 90 object instances are present. Further results show the advantage of supervising geometric-correspondence-based object pose estimation with the 6D pose. △ Less

Submitted 22 August, 2022; v1 submitted 18 August, 2022; originally announced August 2022.

arXiv:2203.01977 [pdf, other]

Audio-Visual Object Classification for Human-Robot Collaboration

Authors: A. Xompero, Y. L. Pang, T. Patten, A. Prabhakar, B. Calli, A. Cavallaro

Abstract: Human-robot collaboration requires the contactless estimation of the physical properties of containers manipulated by a person, for example while pouring content in a cup or moving a food box. Acoustic and visual signals can be used to estimate the physical properties of such objects, which may vary substantially in shape, material and size, and also be occluded by the hands of the person. To faci… ▽ More Human-robot collaboration requires the contactless estimation of the physical properties of containers manipulated by a person, for example while pouring content in a cup or moving a food box. Acoustic and visual signals can be used to estimate the physical properties of such objects, which may vary substantially in shape, material and size, and also be occluded by the hands of the person. To facilitate comparisons and stimulate progress in solving this problem, we present the CORSMAL challenge and a dataset to assess the performance of the algorithms through a set of well-defined performance scores. The tasks of the challenge are the estimation of the mass, capacity, and dimensions of the object (container), and the classification of the type and amount of its content. A novel feature of the challenge is our real-to-simulation framework for visualising and assessing the impact of estimation errors in human-to-robot handovers. △ Less

Submitted 3 March, 2022; originally announced March 2022.

Comments: 5 pages, 2 figures, 1 table; accepted at ICASSP 2022; Challenge webpage, see https://corsmal.eecs.qmul.ac.uk/challenge.html

arXiv:2201.00239 [pdf, other]

SporeAgent: Reinforced Scene-level Plausibility for Object Pose Refinement

Authors: Dominik Bauer, Timothy Patten, Markus Vincze

Abstract: Observational noise, inaccurate segmentation and ambiguity due to symmetry and occlusion lead to inaccurate object pose estimates. While depth- and RGB-based pose refinement approaches increase the accuracy of the resulting pose estimates, they are susceptible to ambiguity in the observation as they consider visual alignment. We propose to leverage the fact that we often observe static, rigid scen… ▽ More Observational noise, inaccurate segmentation and ambiguity due to symmetry and occlusion lead to inaccurate object pose estimates. While depth- and RGB-based pose refinement approaches increase the accuracy of the resulting pose estimates, they are susceptible to ambiguity in the observation as they consider visual alignment. We propose to leverage the fact that we often observe static, rigid scenes. Thus, the objects therein need to be under physically plausible poses. We show that considering plausibility reduces ambiguity and, in consequence, allows poses to be more accurately predicted in cluttered environments. To this end, we extend a recent RL-based registration approach towards iterative refinement of object poses. Experiments on the LINEMOD and YCB-VIDEO datasets demonstrate the state-of-the-art performance of our depth-based refinement approach. △ Less

Submitted 1 January, 2022; originally announced January 2022.

Comments: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2022

arXiv:2104.05334 [pdf, ps, other]

Risk-Averse Biased Human Policies in Assistive Multi-Armed Bandit Settings

Authors: Michael Koller, Timothy Patten, Markus Vincze

Abstract: Assistive multi-armed bandit problems can be used to model team situations between a human and an autonomous system like a domestic service robot. To account for human biases such as the risk-aversion described in the Cumulative Prospect Theory, the setting is expanded to using observable rewards. When robots leverage knowledge about the risk-averse human model they eliminate the bias and make mor… ▽ More Assistive multi-armed bandit problems can be used to model team situations between a human and an autonomous system like a domestic service robot. To account for human biases such as the risk-aversion described in the Cumulative Prospect Theory, the setting is expanded to using observable rewards. When robots leverage knowledge about the risk-averse human model they eliminate the bias and make more rational choices. We present an algorithm that increases the utility value of such human-robot teams. A brief evaluation indicates that arbitrary reward functions can be handled. △ Less

Submitted 12 April, 2021; originally announced April 2021.

Comments: in TRAITS Workshop Proceedings (arXiv:2103.12679) held in conjunction with Companion of the 2021 ACM/IEEE International Conference on Human-Robot Interaction, March 2021, Pages 709-711

Report number: TRAITS/2021/10

arXiv:2103.15231 [pdf, other]

ReAgent: Point Cloud Registration using Imitation and Reinforcement Learning

Authors: Dominik Bauer, Timothy Patten, Markus Vincze

Abstract: Point cloud registration is a common step in many 3D computer vision tasks such as object pose estimation, where a 3D model is aligned to an observation. Classical registration methods generalize well to novel domains but fail when given a noisy observation or a bad initialization. Learning-based methods, in contrast, are more robust but lack in generalization capacity. We propose to consider iter… ▽ More Point cloud registration is a common step in many 3D computer vision tasks such as object pose estimation, where a 3D model is aligned to an observation. Classical registration methods generalize well to novel domains but fail when given a noisy observation or a bad initialization. Learning-based methods, in contrast, are more robust but lack in generalization capacity. We propose to consider iterative point cloud registration as a reinforcement learning task and, to this end, present a novel registration agent (ReAgent). We employ imitation learning to initialize its discrete registration policy based on a steady expert policy. Integration with policy optimization, based on our proposed alignment reward, further improves the agent's registration performance. We compare our approach to classical and learning-based registration methods on both ModelNet40 (synthetic) and ScanObjectNN (real data) and show that our ReAgent achieves state-of-the-art accuracy. The lightweight architecture of the agent, moreover, enables reduced inference time as compared to related approaches. In addition, we apply our method to the object pose estimation task on real data (LINEMOD), outperforming state-of-the-art pose refinement approaches. △ Less

Submitted 28 March, 2021; originally announced March 2021.

Comments: Accepted at CVPR 2021

arXiv:2103.06134 [pdf, other]

Sim2Real 3D Object Classification using Spherical Kernel Point Convolution and a Deep Center Voting Scheme

Authors: Jean-Baptiste Weibel, Timothy Patten, Markus Vincze

Abstract: While object semantic understanding is essential for most service robotic tasks, 3D object classification is still an open problem. Learning from artificial 3D models alleviates the cost of annotation necessary to approach this problem, but most methods still struggle with the differences existing between artificial and real 3D data. We conjecture that the cause of those issue is the fact that man… ▽ More While object semantic understanding is essential for most service robotic tasks, 3D object classification is still an open problem. Learning from artificial 3D models alleviates the cost of annotation necessary to approach this problem, but most methods still struggle with the differences existing between artificial and real 3D data. We conjecture that the cause of those issue is the fact that many methods learn directly from point coordinates, instead of the shape, as the former is hard to center and to scale under variable occlusions reliably. We introduce spherical kernel point convolutions that directly exploit the object surface, represented as a graph, and a voting scheme to limit the impact of poor segmentation on the classification results. Our proposed approach improves upon state-of-the-art methods by up to 36% when transferring from artificial objects to real objects. △ Less

Submitted 10 March, 2021; originally announced March 2021.

arXiv:2010.16117 [pdf, other]

PyraPose: Feature Pyramids for Fast and Accurate Object Pose Estimation under Domain Shift

Authors: Stefan Thalhammer, Markus Leitner, Timothy Patten, Markus Vincze

Abstract: Object pose estimation enables robots to understand and interact with their environments. Training with synthetic data is necessary in order to adapt to novel situations. Unfortunately, pose estimation under domain shift, i.e., training on synthetic data and testing in the real world, is challenging. Deep learning-based approaches currently perform best when using encoder-decoder networks but typi… ▽ More Object pose estimation enables robots to understand and interact with their environments. Training with synthetic data is necessary in order to adapt to novel situations. Unfortunately, pose estimation under domain shift, i.e., training on synthetic data and testing in the real world, is challenging. Deep learning-based approaches currently perform best when using encoder-decoder networks but typically do not generalize to new scenarios with different scene characteristics. We argue that patch-based approaches, instead of encoder-decoder networks, are more suited for synthetic-to-real transfer because local to global object information is better represented. To that end, we present a novel approach based on a specialized feature pyramid network to compute multi-scale features for creating pose hypotheses on different feature map resolutions in parallel. Our single-shot pose estimation approach is evaluated on multiple standard datasets and outperforms the state of the art by up to 35%. We also perform gras** experiments in the real world to demonstrate the advantage of using synthetic data to generalize to novel environments. △ Less

Submitted 30 October, 2020; originally announced October 2020.

arXiv:2005.03717 [pdf, other]

doi 10.1007/978-3-030-58548-8_38

Neural Object Learning for 6D Pose Estimation Using a Few Cluttered Images

Authors: Kiru Park, Timothy Patten, Markus Vincze

Abstract: Recent methods for 6D pose estimation of objects assume either textured 3D models or real images that cover the entire range of target poses. However, it is difficult to obtain textured 3D models and annotate the poses of objects in real scenarios. This paper proposes a method, Neural Object Learning (NOL), that creates synthetic images of objects in arbitrary poses by combining only a few observa… ▽ More Recent methods for 6D pose estimation of objects assume either textured 3D models or real images that cover the entire range of target poses. However, it is difficult to obtain textured 3D models and annotate the poses of objects in real scenarios. This paper proposes a method, Neural Object Learning (NOL), that creates synthetic images of objects in arbitrary poses by combining only a few observations from cluttered images. A novel refinement step is proposed to align inaccurate poses of objects in source images, which results in better quality images. Evaluations performed on two public datasets show that the rendered images created by NOL lead to state-of-the-art performance in comparison to methods that use 13 times the number of real images. Evaluations on our new dataset show multiple objects can be trained and recognized simultaneously using a sequence of a fixed scene. △ Less

Submitted 21 August, 2020; v1 submitted 7 May, 2020; originally announced May 2020.

Comments: ECCV 2020 (Spotlight)

arXiv:2001.05279 [pdf, other]

doi 10.3389/frobt.2020.00120

DGCM-Net: Dense Geometrical Correspondence Matching Network for Incremental Experience-based Robotic Gras**

Authors: Timothy Patten, Kiru Park, Markus Vincze

Abstract: This article presents a method for gras** novel objects by learning from experience. Successful attempts are remembered and then used to guide future grasps such that more reliable gras** is achieved over time. To generalise the learned experience to unseen objects, we introduce the dense geometric correspondence matching network (DGCM-Net). This applies metric learning to encode objects with… ▽ More This article presents a method for gras** novel objects by learning from experience. Successful attempts are remembered and then used to guide future grasps such that more reliable gras** is achieved over time. To generalise the learned experience to unseen objects, we introduce the dense geometric correspondence matching network (DGCM-Net). This applies metric learning to encode objects with similar geometry nearby in feature space. Retrieving relevant experience for an unseen object is thus a nearest neighbour search with the encoded feature maps. DGCM-Net also reconstructs 3D-3D correspondences using the view-dependent normalised object coordinate space to transform grasp configurations from retrieved samples to unseen objects. In comparison to baseline methods, our approach achieves an equivalent grasp success rate. However, the baselines are significantly improved when fusing the knowledge from experience with their grasp proposal strategy. Offline experiments with a gras** dataset highlight the capability to generalise within and between object classes as well as to improve success rate over time from increasing experience. Lastly, by learning task-relevant grasps, our approach can prioritise grasps that enable the functional use of objects. △ Less

Submitted 17 September, 2020; v1 submitted 15 January, 2020; originally announced January 2020.

Comments: Published in the journal "Frontiers in Robotics and AI"

Journal ref: Frontiers in Robotics and AI 7:120, 2020

arXiv:1910.12585 [pdf, other]

Addressing the Sim2Real Gap in Robotic 3D Object Classification

Authors: Jean-Baptiste Weibel, Timothy Patten, Markus Vincze

Abstract: Object classification with 3D data is an essential component of any scene understanding method. It has gained significant interest in a variety of communities, most notably in robotics and computer graphics. While the advent of deep learning has progressed the field of 3D object classification, most work using this data type are solely evaluated on CAD model datasets. Consequently, current work do… ▽ More Object classification with 3D data is an essential component of any scene understanding method. It has gained significant interest in a variety of communities, most notably in robotics and computer graphics. While the advent of deep learning has progressed the field of 3D object classification, most work using this data type are solely evaluated on CAD model datasets. Consequently, current work does not address the discrepancies existing between real and artificial data. In this work, we examine this gap in a robotic context by specifically addressing the problem of classification when transferring from artificial CAD models to real reconstructed objects. This is performed by training on ModelNet (CAD models) and evaluating on ScanNet (reconstructed objects). We show that standard methods do not perform well in this task. We thus introduce a method that carefully samples object parts that are reproducible under various transformations and hence robust. Using graph convolution to classify the composed graph of parts, our method significantly improves upon the baseline. △ Less

Submitted 28 October, 2019; originally announced October 2019.

arXiv:1909.05730 [pdf, other]

VeREFINE: Integrating Object Pose Verification with Physics-guided Iterative Refinement

Authors: Dominik Bauer, Timothy Patten, Markus Vincze

Abstract: Accurate and robust object pose estimation for robotics applications requires verification and refinement steps. In this work, we propose to integrate hypotheses verification with object pose refinement guided by physics simulation. This allows the physical plausibility of individual object pose estimates and the stability of the estimated scene to be considered in a unified optimization. The prop… ▽ More Accurate and robust object pose estimation for robotics applications requires verification and refinement steps. In this work, we propose to integrate hypotheses verification with object pose refinement guided by physics simulation. This allows the physical plausibility of individual object pose estimates and the stability of the estimated scene to be considered in a unified optimization. The proposed method is able to adapt to scenes of multiple objects and efficiently focuses on refining the most promising object poses in multi-hypotheses scenarios. We call this integrated approach VeREFINE and evaluate it on three datasets with varying scene complexity. The generality of the approach is shown by using three state-of-the-art pose estimators and three baseline refiners. Results show improvements over all baselines and on all datasets. Furthermore, our approach is applied in real-world gras** experiments and outperforms competing methods in terms of grasp success rate. Code is publicly available at github.com/dornik/verefine. △ Less

Submitted 18 May, 2020; v1 submitted 12 September, 2019; originally announced September 2019.

Comments: Revised version

arXiv:1908.07433 [pdf, other]

doi 10.1109/ICCV.2019.00776

Pix2Pose: Pixel-Wise Coordinate Regression of Objects for 6D Pose Estimation

Authors: Kiru Park, Timothy Patten, Markus Vincze

Abstract: Estimating the 6D pose of objects using only RGB images remains challenging because of problems such as occlusion and symmetries. It is also difficult to construct 3D models with precise texture without expert knowledge or specialized scanning devices. To address these problems, we propose a novel pose estimation method, Pix2Pose, that predicts the 3D coordinates of each object pixel without textu… ▽ More Estimating the 6D pose of objects using only RGB images remains challenging because of problems such as occlusion and symmetries. It is also difficult to construct 3D models with precise texture without expert knowledge or specialized scanning devices. To address these problems, we propose a novel pose estimation method, Pix2Pose, that predicts the 3D coordinates of each object pixel without textured models. An auto-encoder architecture is designed to estimate the 3D coordinates and expected errors per pixel. These pixel-wise predictions are then used in multiple stages to form 2D-3D correspondences to directly compute poses with the PnP algorithm with RANSAC iterations. Our method is robust to occlusion by leveraging recent achievements in generative adversarial training to precisely recover occluded parts. Furthermore, a novel loss function, the transformer loss, is proposed to handle symmetric objects by guiding predictions to the closest symmetric pose. Evaluations on three different benchmark datasets containing symmetric and occluded objects show our method outperforms the state of the art using only RGB images. △ Less

Submitted 20 August, 2019; originally announced August 2019.

Comments: Accepted at ICCV 2019 (Oral)

arXiv:1902.01626 [pdf, other]

EasyLabel: A Semi-Automatic Pixel-wise Object Annotation Tool for Creating Robotic RGB-D Datasets

Authors: Markus Suchi, Timothy Patten, David Fischinger, Markus Vincze

Abstract: Develo** robot perception systems for recognizing objects in the real-world requires computer vision algorithms to be carefully scrutinized with respect to the expected operating domain. This demands large quantities of ground truth data to rigorously evaluate the performance of algorithms. This paper presents the EasyLabel tool for easily acquiring high quality ground truth annotation of object… ▽ More Develo** robot perception systems for recognizing objects in the real-world requires computer vision algorithms to be carefully scrutinized with respect to the expected operating domain. This demands large quantities of ground truth data to rigorously evaluate the performance of algorithms. This paper presents the EasyLabel tool for easily acquiring high quality ground truth annotation of objects at the pixel-level in densely cluttered scenes. In a semi-automatic process, complex scenes are incrementally built and EasyLabel exploits depth change to extract precise object masks at each step. We use this tool to generate the Object Cluttered Indoor Dataset (OCID) that captures diverse settings of objects, background, context, sensor to scene distance, viewpoint angle and lighting conditions. OCID is used to perform a systematic comparison of existing object segmentation methods. The baseline comparison supports the need for pixel- and object-wise annotation to progress robot vision towards realistic applications. This insight reveals the usefulness of EasyLabel and OCID to better understand the challenges that robots face in the real-world. Copyright 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. △ Less

Submitted 1 March, 2019; v1 submitted 5 February, 2019; originally announced February 2019.

Comments: 7 pages, 8 figures, ICRA2019, Draft

arXiv:1704.08716 [pdf, other]

Artificial Intelligence Based Malware Analysis

Authors: Avi Pfeffer, Brian Ruttenberg, Lee Kellogg, Michael Howard, Catherine Call, Alison O'Connor, Glenn Takata, Scott Neal Reilly, Terry Patten, Jason Taylor, Robert Hall, Arun Lakhotia, Craig Miles, Dan Scofield, Jared Frank

Abstract: Artificial intelligence methods have often been applied to perform specific functions or tasks in the cyber-defense realm. However, as adversary methods become more complex and difficult to divine, piecemeal efforts to understand cyber-attacks, and malware-based attacks in particular, are not providing sufficient means for malware analysts to understand the past, present and future characteristics… ▽ More Artificial intelligence methods have often been applied to perform specific functions or tasks in the cyber-defense realm. However, as adversary methods become more complex and difficult to divine, piecemeal efforts to understand cyber-attacks, and malware-based attacks in particular, are not providing sufficient means for malware analysts to understand the past, present and future characteristics of malware. In this paper, we present the Malware Analysis and Attributed using Genetic Information (MAAGI) system. The underlying idea behind the MAAGI system is that there are strong similarities between malware behavior and biological organism behavior, and applying biologically inspired methods to corpora of malware can help analysts better understand the ecosystem of malware attacks. Due to the sophistication of the malware and the analysis, the MAAGI system relies heavily on artificial intelligence techniques to provide this capability. It has already yielded promising results over its development life, and will hopefully inspire more integration between the artificial intelligence and cyber--defense communities. △ Less

Submitted 27 April, 2017; originally announced April 2017.

Showing 1–15 of 15 results for author: Patten, T