Search | arXiv e-print repository

RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics

Authors: Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, Dieter Fox

Abstract: From rearranging objects on a table to putting groceries into shelves, robots must plan precise action points to perform tasks accurately and reliably. In spite of the recent adoption of vision language models (VLMs) to control robot behavior, VLMs struggle to precisely articulate robot actions using language. We introduce an automatic synthetic data generation pipeline that instruction-tunes VLMs… ▽ More From rearranging objects on a table to putting groceries into shelves, robots must plan precise action points to perform tasks accurately and reliably. In spite of the recent adoption of vision language models (VLMs) to control robot behavior, VLMs struggle to precisely articulate robot actions using language. We introduce an automatic synthetic data generation pipeline that instruction-tunes VLMs to robotic domains and needs. Using the pipeline, we train RoboPoint, a VLM that predicts image keypoint affordances given language instructions. Compared to alternative approaches, our method requires no real-world data collection or human demonstration, making it much more scalable to diverse environments and viewpoints. In addition, RoboPoint is a general model that enables several downstream applications such as robot navigation, manipulation, and augmented reality (AR) assistance. Our experiments demonstrate that RoboPoint outperforms state-of-the-art VLMs (GPT-4o) and visual prompting techniques (PIVOT) by 21.8% in the accuracy of predicting spatial affordance and by 30.5% in the success rate of downstream tasks. Project website: https://robo-point.github.io. △ Less

Submitted 15 June, 2024; originally announced June 2024.

arXiv:2311.00926 [pdf, other]

M2T2: Multi-Task Masked Transformer for Object-centric Pick and Place

Authors: Wentao Yuan, Adithyavairavan Murali, Arsalan Mousavian, Dieter Fox

Abstract: With the advent of large language models and large-scale robotic datasets, there has been tremendous progress in high-level decision-making for object manipulation. These generic models are able to interpret complex tasks using language commands, but they often have difficulties generalizing to out-of-distribution objects due to the inability of low-level action primitives. In contrast, existing t… ▽ More With the advent of large language models and large-scale robotic datasets, there has been tremendous progress in high-level decision-making for object manipulation. These generic models are able to interpret complex tasks using language commands, but they often have difficulties generalizing to out-of-distribution objects due to the inability of low-level action primitives. In contrast, existing task-specific models excel in low-level manipulation of unknown objects, but only work for a single type of action. To bridge this gap, we present M2T2, a single model that supplies different types of low-level actions that work robustly on arbitrary objects in cluttered scenes. M2T2 is a transformer model which reasons about contact points and predicts valid gripper poses for different action modes given a raw point cloud of the scene. Trained on a large-scale synthetic dataset with 128K scenes, M2T2 achieves zero-shot sim2real transfer on the real robot, outperforming the baseline system with state-of-the-art task-specific models by about 19% in overall performance and 37.5% in challenging scenes where the object needs to be re-oriented for collision-free placement. M2T2 also achieves state-of-the-art results on a subset of language conditioned tasks in RLBench. Videos of robot experiments on unseen objects in both real world and simulation are available on our project website https://m2-t2.github.io. △ Less

Submitted 1 November, 2023; originally announced November 2023.

Comments: 12 pages, 8 figures, accepted by CoRL 2023

arXiv:2304.09302 [pdf, other]

CabiNet: Scaling Neural Collision Detection for Object Rearrangement with Procedural Scene Generation

Authors: Adithyavairavan Murali, Arsalan Mousavian, Clemens Eppner, Adam Fishman, Dieter Fox

Abstract: We address the important problem of generalizing robotic rearrangement to clutter without any explicit object models. We first generate over 650K cluttered scenes - orders of magnitude more than prior work - in diverse everyday environments, such as cabinets and shelves. We render synthetic partial point clouds from this data and use it to train our CabiNet model architecture. CabiNet is a collisi… ▽ More We address the important problem of generalizing robotic rearrangement to clutter without any explicit object models. We first generate over 650K cluttered scenes - orders of magnitude more than prior work - in diverse everyday environments, such as cabinets and shelves. We render synthetic partial point clouds from this data and use it to train our CabiNet model architecture. CabiNet is a collision model that accepts object and scene point clouds, captured from a single-view depth observation, and predicts collisions for SE(3) object poses in the scene. Our representation has a fast inference speed of 7 microseconds per query with nearly 20% higher performance than baseline approaches in challenging environments. We use this collision model in conjunction with a Model Predictive Path Integral (MPPI) planner to generate collision-free trajectories for picking and placing in clutter. CabiNet also predicts waypoints, computed from the scene's signed distance field (SDF), that allows the robot to navigate tight spaces during rearrangement. This improves rearrangement performance by nearly 35% compared to baselines. We systematically evaluate our approach, procedurally generate simulated experiments, and demonstrate that our approach directly transfers to the real world, despite training exclusively in simulation. Robot experiment demos in completely unknown scenes and objects can be found at this http https://cabinet-object-rearrangement.github.io △ Less

Submitted 18 April, 2023; originally announced April 2023.

arXiv:2302.10745 [pdf, other]

Constrained Generative Sampling of 6-DoF Grasps

Authors: Jens Lundell, Francesco Verdoja, Tran Nguyen Le, Arsalan Mousavian, Dieter Fox, Ville Kyrki

Abstract: Most state-of-the-art data-driven grasp sampling methods propose stable and collision-free grasps uniformly on the target object. For bin-picking, executing any of those reachable grasps is sufficient. However, for completing specific tasks, such as squeezing out liquid from a bottle, we want the grasp to be on a specific part of the object's body while avoiding other locations, such as the cap. T… ▽ More Most state-of-the-art data-driven grasp sampling methods propose stable and collision-free grasps uniformly on the target object. For bin-picking, executing any of those reachable grasps is sufficient. However, for completing specific tasks, such as squeezing out liquid from a bottle, we want the grasp to be on a specific part of the object's body while avoiding other locations, such as the cap. This work presents a generative grasp sampling network, VCGS, capable of constrained 6 Degrees of Freedom (DoF) grasp sampling. In addition, we also curate a new dataset designed to train and evaluate methods for constrained gras**. The new dataset, called CONG, consists of over 14 million training samples of synthetically rendered point clouds and grasps at random target areas on 2889 objects. VCGS is benchmarked against GraspNet, a state-of-the-art unconstrained grasp sampler, in simulation and on a real robot. The results demonstrate that VCGS achieves a 10-15% higher grasp success rate than the baseline while being 2-3 times as sample efficient. Supplementary material is available on our project website. △ Less

Submitted 17 August, 2023; v1 submitted 21 February, 2023; originally announced February 2023.

Comments: Accepted at the International Conference on Intelligent Robots and Systems (IROS 2023)

arXiv:2212.06870 [pdf, other]

MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare

Authors: Yann Labbé, Lucas Manuelli, Arsalan Mousavian, Stephen Tyree, Stan Birchfield, Jonathan Tremblay, Justin Carpentier, Mathieu Aubry, Dieter Fox, Josef Sivic

Abstract: We introduce MegaPose, a method to estimate the 6D pose of novel objects, that is, objects unseen during training. At inference time, the method only assumes knowledge of (i) a region of interest displaying the object in the image and (ii) a CAD model of the observed object. The contributions of this work are threefold. First, we present a 6D pose refiner based on a render&compare strategy which c… ▽ More We introduce MegaPose, a method to estimate the 6D pose of novel objects, that is, objects unseen during training. At inference time, the method only assumes knowledge of (i) a region of interest displaying the object in the image and (ii) a CAD model of the observed object. The contributions of this work are threefold. First, we present a 6D pose refiner based on a render&compare strategy which can be applied to novel objects. The shape and coordinate system of the novel object are provided as inputs to the network by rendering multiple synthetic views of the object's CAD model. Second, we introduce a novel approach for coarse pose estimation which leverages a network trained to classify whether the pose error between a synthetic rendering and an observed image of the same object can be corrected by the refiner. Third, we introduce a large-scale synthetic dataset of photorealistic images of thousands of objects with diverse visual and shape properties and show that this diversity is crucial to obtain good generalization performance on novel objects. We train our approach on this large synthetic dataset and apply it without retraining to hundreds of novel objects in real images from several pose estimation benchmarks. Our approach achieves state-of-the-art performance on the ModelNet and YCB-Video datasets. An extensive evaluation on the 7 core datasets of the BOP challenge demonstrates that our approach achieves performance competitive with existing approaches that require access to the target objects during training. Code, dataset and trained models are available on the project page: https://megapose6d.github.io/. △ Less

Submitted 13 December, 2022; originally announced December 2022.

Comments: CoRL 2022

arXiv:2210.13638 [pdf, other]

Learning Robust Real-World Dexterous Gras** Policies via Implicit Shape Augmentation

Authors: Zoey Qiuyu Chen, Karl Van Wyk, Yu-Wei Chao, Wei Yang, Arsalan Mousavian, Abhishek Gupta, Dieter Fox

Abstract: Dexterous robotic hands have the capability to interact with a wide variety of household objects to perform tasks like gras**. However, learning robust real world gras** policies for arbitrary objects has proven challenging due to the difficulty of generating high quality training data. In this work, we propose a learning system (ISAGrasp) for leveraging a small number of human demonstrations… ▽ More Dexterous robotic hands have the capability to interact with a wide variety of household objects to perform tasks like gras**. However, learning robust real world gras** policies for arbitrary objects has proven challenging due to the difficulty of generating high quality training data. In this work, we propose a learning system (ISAGrasp) for leveraging a small number of human demonstrations to bootstrap the generation of a much larger dataset containing successful grasps on a variety of novel objects. Our key insight is to use a correspondence-aware implicit generative model to deform object meshes and demonstrated human grasps in order to generate a diverse dataset of novel objects and successful grasps for supervised learning, while maintaining semantic realism. We use this dataset to train a robust gras** policy in simulation which can be deployed in the real world. We demonstrate gras** performance with a four-fingered Allegro hand in both simulation and the real world, and show this method can handle entirely new semantic classes and achieve a 79% success rate on gras** unseen objects in the real world. △ Less

Submitted 24 October, 2022; originally announced October 2022.

Comments: Accepted by CoRL2022

arXiv:2209.14284 [pdf, other]

DexTransfer: Real World Multi-fingered Dexterous Gras** with Minimal Human Demonstrations

Authors: Zoey Qiuyu Chen, Karl Van Wyk, Yu-Wei Chao, Wei Yang, Arsalan Mousavian, Abhishek Gupta, Dieter Fox

Abstract: Teaching a multi-fingered dexterous robot to grasp objects in the real world has been a challenging problem due to its high dimensional state and action space. We propose a robot-learning system that can take a small number of human demonstrations and learn to grasp unseen object poses given partially occluded observations. Our system leverages a small motion capture dataset and generates a large… ▽ More Teaching a multi-fingered dexterous robot to grasp objects in the real world has been a challenging problem due to its high dimensional state and action space. We propose a robot-learning system that can take a small number of human demonstrations and learn to grasp unseen object poses given partially occluded observations. Our system leverages a small motion capture dataset and generates a large dataset with diverse and successful trajectories for a multi-fingered robot gripper. By adding domain randomization, we show that our dataset provides robust gras** trajectories that can be transferred to a policy learner. We train a dexterous gras** policy that takes the point clouds of the object as input and predicts continuous actions to grasp objects from different initial robot states. We evaluate the effectiveness of our system on a 22-DoF floating Allegro Hand in simulation and a 23-DoF Allegro robot hand with a KUKA arm in real world. The policy learned from our dataset can generalize well on unseen object poses in both simulation and the real world △ Less

Submitted 28 September, 2022; originally announced September 2022.

arXiv:2209.11302 [pdf, other]

ProgPrompt: Generating Situated Robot Task Plans using Large Language Models

Authors: Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, Animesh Garg

Abstract: Task planning can require defining myriad domain knowledge about the world in which a robot needs to act. To ameliorate that effort, large language models (LLMs) can be used to score potential next actions during task planning, and even generate action sequences directly, given an instruction in natural language with no additional domain information. However, such methods either require enumeratin… ▽ More Task planning can require defining myriad domain knowledge about the world in which a robot needs to act. To ameliorate that effort, large language models (LLMs) can be used to score potential next actions during task planning, and even generate action sequences directly, given an instruction in natural language with no additional domain information. However, such methods either require enumerating all possible next steps for scoring, or generate free-form text that may contain actions not possible on a given robot in its current context. We present a programmatic LLM prompt structure that enables plan generation functional across situated environments, robot capabilities, and tasks. Our key insight is to prompt the LLM with program-like specifications of the available actions and objects in an environment, as well as with example programs that can be executed. We make concrete recommendations about prompt structure and generation constraints through ablation experiments, demonstrate state of the art success rates in VirtualHome household tasks, and deploy our method on a physical robot arm for tabletop tasks. Website at progprompt.github.io △ Less

Submitted 22 September, 2022; originally announced September 2022.

arXiv:2207.02556 [pdf, other]

Deep Learning Approaches to Grasp Synthesis: A Review

Authors: Rhys Newbury, Morris Gu, Lachlan Chumbley, Arsalan Mousavian, Clemens Eppner, Jürgen Leitner, Jeannette Bohg, Antonio Morales, Tamim Asfour, Danica Kragic, Dieter Fox, Akansel Cosgun

Abstract: Gras** is the process of picking up an object by applying forces and torques at a set of contacts. Recent advances in deep-learning methods have allowed rapid progress in robotic object gras**. In this systematic review, we surveyed the publications over the last decade, with a particular interest in gras** an object using all 6 degrees of freedom of the end-effector pose. Our review found f… ▽ More Gras** is the process of picking up an object by applying forces and torques at a set of contacts. Recent advances in deep-learning methods have allowed rapid progress in robotic object gras**. In this systematic review, we surveyed the publications over the last decade, with a particular interest in gras** an object using all 6 degrees of freedom of the end-effector pose. Our review found four common methodologies for robotic gras**: sampling-based approaches, direct regression, reinforcement learning, and exemplar approaches. Additionally, we found two `supporting methods` around gras** that use deep-learning to support the gras** process, shape approximation, and affordances. We have distilled the publications found in this systematic review (85 papers) into ten key takeaways we consider crucial for future robotic gras** and manipulation research. An online version of the survey is available at https://rhys-newbury.github.io/projects/6dof/ △ Less

Submitted 4 May, 2023; v1 submitted 6 July, 2022; originally announced July 2022.

Comments: 20 pages. Accepted to T-RO

arXiv:2202.00732 [pdf, other]

IFOR: Iterative Flow Minimization for Robotic Object Rearrangement

Authors: Ankit Goyal, Arsalan Mousavian, Chris Paxton, Yu-Wei Chao, Brian Okorn, Jia Deng, Dieter Fox

Abstract: Accurate object rearrangement from vision is a crucial problem for a wide variety of real-world robotics applications in unstructured environments. We propose IFOR, Iterative Flow Minimization for Robotic Object Rearrangement, an end-to-end method for the challenging problem of object rearrangement for unknown objects given an RGBD image of the original and final scenes. First, we learn an optical… ▽ More Accurate object rearrangement from vision is a crucial problem for a wide variety of real-world robotics applications in unstructured environments. We propose IFOR, Iterative Flow Minimization for Robotic Object Rearrangement, an end-to-end method for the challenging problem of object rearrangement for unknown objects given an RGBD image of the original and final scenes. First, we learn an optical flow model based on RAFT to estimate the relative transformation of the objects purely from synthetic data. This flow is then used in an iterative minimization algorithm to achieve accurate positioning of previously unseen objects. Crucially, we show that our method applies to cluttered scenes, and in the real world, while training only on synthetic data. Videos are available at https://imankgoyal.github.io/ifor.html. △ Less

Submitted 1 February, 2022; originally announced February 2022.

arXiv:2106.15711 [pdf, other]

RICE: Refining Instance Masks in Cluttered Environments with Graph Neural Networks

Authors: Christopher Xie, Arsalan Mousavian, Yu Xiang, Dieter Fox

Abstract: Segmenting unseen object instances in cluttered environments is an important capability that robots need when functioning in unstructured environments. While previous methods have exhibited promising results, they still tend to provide incorrect results in highly cluttered scenes. We postulate that a network architecture that encodes relations between objects at a high-level can be beneficial. Thu… ▽ More Segmenting unseen object instances in cluttered environments is an important capability that robots need when functioning in unstructured environments. While previous methods have exhibited promising results, they still tend to provide incorrect results in highly cluttered scenes. We postulate that a network architecture that encodes relations between objects at a high-level can be beneficial. Thus, in this work, we propose a novel framework that refines the output of such methods by utilizing a graph-based representation of instance masks. We train deep networks capable of sampling smart perturbations to the segmentations, and a graph neural network, which can encode relations between objects, to evaluate the perturbed segmentations. Our proposed method is orthogonal to previous works and achieves state-of-the-art performance when combined with them. We demonstrate an application that uses uncertainty estimates generated by our method to guide a manipulator, leading to efficient understanding of cluttered scenes. Code, models, and video can be found at https://github.com/chrisdxie/rice . △ Less

Submitted 29 June, 2021; originally announced June 2021.

arXiv:2106.01352 [pdf, other]

NeRP: Neural Rearrangement Planning for Unknown Objects

Authors: Ahmed H. Qureshi, Arsalan Mousavian, Chris Paxton, Michael C. Yip, Dieter Fox

Abstract: Robots will be expected to manipulate a wide variety of objects in complex and arbitrary ways as they become more widely used in human environments. As such, the rearrangement of objects has been noted to be an important benchmark for AI capabilities in recent years. We propose NeRP (Neural Rearrangement Planning), a deep learning based approach for multi-step neural object rearrangement planning… ▽ More Robots will be expected to manipulate a wide variety of objects in complex and arbitrary ways as they become more widely used in human environments. As such, the rearrangement of objects has been noted to be an important benchmark for AI capabilities in recent years. We propose NeRP (Neural Rearrangement Planning), a deep learning based approach for multi-step neural object rearrangement planning which works with never-before-seen objects, that is trained on simulation data, and generalizes to the real world. We compare NeRP to several naive and model-based baselines, demonstrating that our approach is measurably better and can efficiently arrange unseen objects in fewer steps and with less planning time. Finally, we demonstrate it on several challenging rearrangement problems in the real world. △ Less

Submitted 4 June, 2021; v1 submitted 2 June, 2021; originally announced June 2021.

Comments: Please refer to our supplementary video: https://youtu.be/CJb1IzH94eo

arXiv:2104.13542 [pdf, other]

STORM: An Integrated Framework for Fast Joint-Space Model-Predictive Control for Reactive Manipulation

Authors: Mohak Bhardwaj, Balakumar Sundaralingam, Arsalan Mousavian, Nathan Ratliff, Dieter Fox, Fabio Ramos, Byron Boots

Abstract: Sampling-based model-predictive control (MPC) is a promising tool for feedback control of robots with complex, non-smooth dynamics, and cost functions. However, the computationally demanding nature of sampling-based MPC algorithms has been a key bottleneck in their application to high-dimensional robotic manipulation problems in the real world. Previous methods have addressed this issue by running… ▽ More Sampling-based model-predictive control (MPC) is a promising tool for feedback control of robots with complex, non-smooth dynamics, and cost functions. However, the computationally demanding nature of sampling-based MPC algorithms has been a key bottleneck in their application to high-dimensional robotic manipulation problems in the real world. Previous methods have addressed this issue by running MPC in the task space while relying on a low-level operational space controller for joint control. However, by not using the joint space of the robot in the MPC formulation, existing methods cannot directly account for non-task space related constraints such as avoiding joint limits, singular configurations, and link collisions. In this paper, we develop a system for fast, joint space sampling-based MPC for manipulators that is efficiently parallelized using GPUs. Our approach can handle task and joint space constraints while taking less than 8ms~(125Hz) to compute the next control command. Further, our method can tightly integrate perception into the control problem by utilizing learned cost functions from raw sensor data. We validate our approach by deploying it on a Franka Panda robot for a variety of dynamic manipulation tasks. We study the effect of different cost formulations and MPC parameters on the synthesized behavior and provide key insights that pave the way for the application of sampling-based MPC for manipulators in a principled manner. We also provide highly optimized, open-source code to be used by the wider robot learning and control community. Videos of experiments can be found at: https://sites.google.com/view/manipulation-mpc △ Less

Submitted 14 September, 2021; v1 submitted 27 April, 2021; originally announced April 2021.

Comments: Accepted for oral presentation at the Conference on Robot Learning (CoRL), 2021. Code available at: https://github.com/NVlabs/storm

Journal ref: 5th Annual Conference on Robot Learning, 2021

arXiv:2104.00622 [pdf, other]

RGB-D Local Implicit Function for Depth Completion of Transparent Objects

Authors: Luyang Zhu, Arsalan Mousavian, Yu Xiang, Hammad Mazhar, Jozef van Eenbergen, Shoubhik Debnath, Dieter Fox

Abstract: Majority of the perception methods in robotics require depth information provided by RGB-D cameras. However, standard 3D sensors fail to capture depth of transparent objects due to refraction and absorption of light. In this paper, we introduce a new approach for depth completion of transparent objects from a single RGB-D image. Key to our approach is a local implicit neural representation built o… ▽ More Majority of the perception methods in robotics require depth information provided by RGB-D cameras. However, standard 3D sensors fail to capture depth of transparent objects due to refraction and absorption of light. In this paper, we introduce a new approach for depth completion of transparent objects from a single RGB-D image. Key to our approach is a local implicit neural representation built on ray-voxel pairs that allows our method to generalize to unseen objects and achieve fast inference speed. Based on this representation, we present a novel framework that can complete missing depth given noisy RGB-D input. We further improve the depth estimation iteratively using a self-correcting refinement model. To train the whole pipeline, we build a large scale synthetic dataset with transparent objects. Experiments demonstrate that our method performs significantly better than the current state-of-the-art methods on both synthetic and real world data. In addition, our approach improves the inference speed by a factor of 20 compared to the previous best method, ClearGrasp. Code and dataset will be released at https://research.nvidia.com/publication/2021-03_RGB-D-Local-Implicit. △ Less

Submitted 1 April, 2021; originally announced April 2021.

Comments: CVPR 2021

arXiv:2103.16747 [pdf, other]

Sim-to-Real for Robotic Tactile Sensing via Physics-Based Simulation and Learned Latent Projections

Authors: Yashraj Narang, Balakumar Sundaralingam, Miles Macklin, Arsalan Mousavian, Dieter Fox

Abstract: Tactile sensing is critical for robotic gras** and manipulation of objects under visual occlusion. However, in contrast to simulations of robot arms and cameras, current simulations of tactile sensors have limited accuracy, speed, and utility. In this work, we develop an efficient 3D finite element method (FEM) model of the SynTouch BioTac sensor using an open-access, GPU-based robotics simulato… ▽ More Tactile sensing is critical for robotic gras** and manipulation of objects under visual occlusion. However, in contrast to simulations of robot arms and cameras, current simulations of tactile sensors have limited accuracy, speed, and utility. In this work, we develop an efficient 3D finite element method (FEM) model of the SynTouch BioTac sensor using an open-access, GPU-based robotics simulator. Our simulations closely reproduce results from an experimentally-validated model in an industry-standard, CPU-based simulator, but at 75x the speed. We then learn latent representations for simulated BioTac deformations and real-world electrical output through self-supervision, as well as projections between the latent spaces using a small supervised dataset. Using these learned latent projections, we accurately synthesize real-world BioTac electrical output and estimate contact patches, both for unseen contact interactions. This work contributes an efficient, freely-accessible FEM model of the BioTac and comprises one of the first efforts to combine self-supervision, cross-modal transfer, and sim-to-real transfer for tactile sensors. △ Less

Submitted 30 March, 2021; originally announced March 2021.

Comments: To be published in Proc. IEEE International Conference on Robotics and Automation (ICRA)

arXiv:2103.14127 [pdf, other]

Contact-GraspNet: Efficient 6-DoF Grasp Generation in Cluttered Scenes

Authors: Martin Sundermeyer, Arsalan Mousavian, Rudolph Triebel, Dieter Fox

Abstract: Gras** unseen objects in unconstrained, cluttered environments is an essential skill for autonomous robotic manipulation. Despite recent progress in full 6-DoF grasp learning, existing approaches often consist of complex sequential pipelines that possess several potential failure points and run-times unsuitable for closed-loop gras**. Therefore, we propose an end-to-end network that efficientl… ▽ More Gras** unseen objects in unconstrained, cluttered environments is an essential skill for autonomous robotic manipulation. Despite recent progress in full 6-DoF grasp learning, existing approaches often consist of complex sequential pipelines that possess several potential failure points and run-times unsuitable for closed-loop gras**. Therefore, we propose an end-to-end network that efficiently generates a distribution of 6-DoF parallel-jaw grasps directly from a depth recording of a scene. Our novel grasp representation treats 3D points of the recorded point cloud as potential grasp contacts. By rooting the full 6-DoF grasp pose and width in the observed point cloud, we can reduce the dimensionality of our grasp representation to 4-DoF which greatly facilitates the learning process. Our class-agnostic approach is trained on 17 million simulated grasps and generalizes well to real world sensor data. In a robotic gras** study of unseen objects in structured clutter we achieve over 90% success rate, cutting the failure rate in half compared to a recent state-of-the-art method. △ Less

Submitted 25 March, 2021; originally announced March 2021.

Comments: ICRA 2021. Video of the real world experiments and code are available at https://research.nvidia.com/publication/2021-03_Contact-GraspNet%3A--Efficient

arXiv:2101.05452 [pdf, other]

Interpreting and Predicting Tactile Signals for the SynTouch BioTac

Authors: Yashraj S. Narang, Balakumar Sundaralingam, Karl Van Wyk, Arsalan Mousavian, Dieter Fox

Abstract: In the human hand, high-density contact information provided by afferent neurons is essential for many human gras** and manipulation capabilities. In contrast, robotic tactile sensors, including the state-of-the-art SynTouch BioTac, are typically used to provide low-density contact information, such as contact location, center of pressure, and net force. Although useful, these data do not convey… ▽ More In the human hand, high-density contact information provided by afferent neurons is essential for many human gras** and manipulation capabilities. In contrast, robotic tactile sensors, including the state-of-the-art SynTouch BioTac, are typically used to provide low-density contact information, such as contact location, center of pressure, and net force. Although useful, these data do not convey or leverage the rich information content that some tactile sensors naturally measure. This research extends robotic tactile sensing beyond reduced-order models through 1) the automated creation of a precise experimental tactile dataset for the BioTac over a diverse range of physical interactions, 2) a 3D finite element (FE) model of the BioTac, which complements the experimental dataset with high-density, distributed contact data, 3) neural-network-based map**s from raw BioTac signals to not only low-dimensional experimental data, but also high-density FE deformation fields, and 4) map**s from the FE deformation fields to the raw signals themselves. The high-density data streams can provide a far greater quantity of interpretable information for gras** and manipulation algorithms than previously accessible. △ Less

Submitted 13 January, 2021; originally announced January 2021.

Comments: Submitted to International Journal of Robotics Research (IJRR)

arXiv:2011.10726 [pdf, other]

Object Rearrangement Using Learned Implicit Collision Functions

Authors: Michael Danielczuk, Arsalan Mousavian, Clemens Eppner, Dieter Fox

Abstract: Robotic object rearrangement combines the skills of picking and placing objects. When object models are unavailable, typical collision-checking models may be unable to predict collisions in partial point clouds with occlusions, making generation of collision-free gras** or placement trajectories challenging. We propose a learned collision model that accepts scene and query object point clouds an… ▽ More Robotic object rearrangement combines the skills of picking and placing objects. When object models are unavailable, typical collision-checking models may be unable to predict collisions in partial point clouds with occlusions, making generation of collision-free gras** or placement trajectories challenging. We propose a learned collision model that accepts scene and query object point clouds and predicts collisions for 6DOF object poses within the scene. We train the model on a synthetic set of 1 million scene/object point cloud pairs and 2 billion collision queries. We leverage the learned collision model as part of a model predictive path integral (MPPI) policy in a tabletop rearrangement task and show that the policy can plan collision-free grasps and placements for objects unseen in training in both simulated and physical cluttered scenes with a Franka Panda robot. The learned model outperforms both traditional pipelines and learned ablations by 9.8% in accuracy on a dataset of simulated collision queries and is 75x faster than the best-performing baseline. Videos and supplementary material are available at https://research.nvidia.com/publication/2021-03_Object-Rearrangement-Using. △ Less

Submitted 26 March, 2021; v1 submitted 21 November, 2020; originally announced November 2020.

Comments: First two authors contributed equally. 2021 IEEE International Conference on Robotics and Automation. 8 pages, 4 figures, 3 tables

arXiv:2011.09584 [pdf, other]

ACRONYM: A Large-Scale Grasp Dataset Based on Simulation

Authors: Clemens Eppner, Arsalan Mousavian, Dieter Fox

Abstract: We introduce ACRONYM, a dataset for robot grasp planning based on physics simulation. The dataset contains 17.7M parallel-jaw grasps, spanning 8872 objects from 262 different categories, each labeled with the grasp result obtained from a physics simulator. We show the value of this large and diverse dataset by using it to train two state-of-the-art learning-based grasp planning algorithms. Grasp p… ▽ More We introduce ACRONYM, a dataset for robot grasp planning based on physics simulation. The dataset contains 17.7M parallel-jaw grasps, spanning 8872 objects from 262 different categories, each labeled with the grasp result obtained from a physics simulator. We show the value of this large and diverse dataset by using it to train two state-of-the-art learning-based grasp planning algorithms. Grasp performance improves significantly when compared to the original smaller dataset. Data and tools can be accessed at https://sites.google.com/nvidia.com/graspdataset. △ Less

Submitted 18 November, 2020; originally announced November 2020.

arXiv:2011.08961 [pdf, other]

Reactive Human-to-Robot Handovers of Arbitrary Objects

Authors: Wei Yang, Chris Paxton, Arsalan Mousavian, Yu-Wei Chao, Maya Cakmak, Dieter Fox

Abstract: Human-robot object handovers have been an actively studied area of robotics over the past decade; however, very few techniques and systems have addressed the challenge of handing over diverse objects with arbitrary appearance, size, shape, and rigidity. In this paper, we present a vision-based system that enables reactive human-to-robot handovers of unknown objects. Our approach combines closed-lo… ▽ More Human-robot object handovers have been an actively studied area of robotics over the past decade; however, very few techniques and systems have addressed the challenge of handing over diverse objects with arbitrary appearance, size, shape, and rigidity. In this paper, we present a vision-based system that enables reactive human-to-robot handovers of unknown objects. Our approach combines closed-loop motion planning with real-time, temporally-consistent grasp generation to ensure reactivity and motion smoothness. Our system is robust to different object positions and orientations, and can grasp both rigid and non-rigid objects. We demonstrate the generalizability, usability, and robustness of our approach on a novel benchmark set of 26 diverse household objects, a user study with naive users (N=6) handing over a subset of 15 objects, and a systematic evaluation examining different ways of handing objects. More results and videos can be found at https://sites.google.com/nvidia.com/handovers-of-arbitrary-objects. △ Less

Submitted 3 June, 2021; v1 submitted 17 November, 2020; originally announced November 2020.

Comments: Accepted to the International Conference on Robotics and Automation (ICRA) 2021

arXiv:2011.08694 [pdf, other]

Reactive Long Horizon Task Execution via Visual Skill and Precondition Models

Authors: Shohin Mukherjee, Chris Paxton, Arsalan Mousavian, Adam Fishman, Maxim Likhachev, Dieter Fox

Abstract: Zero-shot execution of unseen robotic tasks is important to allowing robots to perform a wide variety of tasks in human environments, but collecting the amounts of data necessary to train end-to-end policies in the real-world is often infeasible. We describe an approach for sim-to-real training that can accomplish unseen robotic tasks using models learned in simulation to ground components of a si… ▽ More Zero-shot execution of unseen robotic tasks is important to allowing robots to perform a wide variety of tasks in human environments, but collecting the amounts of data necessary to train end-to-end policies in the real-world is often infeasible. We describe an approach for sim-to-real training that can accomplish unseen robotic tasks using models learned in simulation to ground components of a simple task planner. We learn a library of parameterized skills, along with a set of predicates-based preconditions and termination conditions, entirely in simulation. We explore a block-stacking task because it has a clear structure, where multiple skills must be chained together, but our methods are applicable to a wide range of other problems and domains, and can transfer from simulation to the real-world with no fine tuning. The system is able to recognize failures and accomplish long-horizon tasks from perceptual input, which is critical for real-world execution. We evaluate our proposed approach in both simulation and in the real-world, showing an increase in success rate from 91.6% to 98% in simulation and from 10% to 80% success rate in the real-world as compared with naive baselines. For experiment videos including both real-world and simulation, see: https://www.youtube.com/playlist?list=PL-oD0xHUngeLfQmpngYkGFZarstfPOXqX △ Less

Submitted 14 July, 2021; v1 submitted 17 November, 2020; originally announced November 2020.

Comments: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2021

arXiv:2010.00824 [pdf, other]

Goal-Auxiliary Actor-Critic for 6D Robotic Gras** with Point Clouds

Authors: Lirui Wang, Yu Xiang, Wei Yang, Arsalan Mousavian, Dieter Fox

Abstract: 6D robotic gras** beyond top-down bin-picking scenarios is a challenging task. Previous solutions based on 6D grasp synthesis with robot motion planning usually operate in an open-loop setting, which are sensitive to grasp synthesis errors. In this work, we propose a new method for learning closed-loop control policies for 6D gras**. Our policy takes a segmented point cloud of an object from a… ▽ More 6D robotic gras** beyond top-down bin-picking scenarios is a challenging task. Previous solutions based on 6D grasp synthesis with robot motion planning usually operate in an open-loop setting, which are sensitive to grasp synthesis errors. In this work, we propose a new method for learning closed-loop control policies for 6D gras**. Our policy takes a segmented point cloud of an object from an egocentric camera as input, and outputs continuous 6D control actions of the robot gripper for gras** the object. We combine imitation learning and reinforcement learning and introduce a goal-auxiliary actor-critic algorithm for policy learning. We demonstrate that our learned policy can be integrated into a tabletop 6D gras** system and a human-robot handover system to improve the gras** performance of unseen objects. Our videos and code can be found at https://sites.google.com/view/gaddpg . △ Less

Submitted 30 June, 2021; v1 submitted 2 October, 2020; originally announced October 2020.

Journal ref: The Conference on Robot Learning (CoRL 2021)

arXiv:2009.01949 [pdf, ps, other]

doi 10.1364/JOSAB.409224

Strong-Field Terahertz Control of Plasmon Induced Opacity in Photoexcited Metamaterial

Authors: Ali Mousavian, Zachary J. Thompson, Byounghwak Lee, Alden N. Bradley, Milo X. Sprague, Yun-Shik Lee

Abstract: A terahertz metamaterial consisting of radiative slot antennas and subradiant complementary split-ring resonators exhibits plasmon induced opacity in a narrow spectral range due to the destructive interference between the bright and dark modes of the coupled oscillators. Femtosecond optical excitations instantly quench the mode coupling and plasmon oscillations, injecting photocarriers into the me… ▽ More A terahertz metamaterial consisting of radiative slot antennas and subradiant complementary split-ring resonators exhibits plasmon induced opacity in a narrow spectral range due to the destructive interference between the bright and dark modes of the coupled oscillators. Femtosecond optical excitations instantly quench the mode coupling and plasmon oscillations, injecting photocarriers into the metamaterial. The plasmon resonances in the coupled metamaterial are transiently restored by intense terahertz pulses. The strong terahertz fields induce intervalley scattering and interband tunneling of the photocarries, and achieve significant reduction of the photocarrier mobility. The ultrafast dynamics of the nonlinear THz interactions reveals intricate interplay between photocarriers and plasmon oscillations. The high-field THz control of the plasmon oscillations implies potential applications to ultrahigh-speed plasmonics. △ Less

Submitted 3 September, 2020; originally announced September 2020.

Comments: 5 pages, 4 figures

arXiv:2007.15157 [pdf, other]

Learning RGB-D Feature Embeddings for Unseen Object Instance Segmentation

Authors: Yu Xiang, Christopher Xie, Arsalan Mousavian, Dieter Fox

Abstract: Segmenting unseen objects in cluttered scenes is an important skill that robots need to acquire in order to perform tasks in new environments. In this work, we propose a new method for unseen object instance segmentation by learning RGB-D feature embeddings from synthetic data. A metric learning loss function is utilized to learn to produce pixel-wise feature embeddings such that pixels from the s… ▽ More Segmenting unseen objects in cluttered scenes is an important skill that robots need to acquire in order to perform tasks in new environments. In this work, we propose a new method for unseen object instance segmentation by learning RGB-D feature embeddings from synthetic data. A metric learning loss function is utilized to learn to produce pixel-wise feature embeddings such that pixels from the same object are close to each other and pixels from different objects are separated in the embedding space. With the learned feature embeddings, a mean shift clustering algorithm can be applied to discover and segment unseen objects. We further improve the segmentation accuracy with a new two-stage clustering algorithm. Our method demonstrates that non-photorealistic synthetic RGB and depth images can be used to learn feature embeddings that transfer well to real-world images for unseen object instance segmentation. △ Less

Submitted 3 March, 2021; v1 submitted 29 July, 2020; originally announced July 2020.

Comments: CoRL 2020

arXiv:2007.08073 [pdf, other]

Unseen Object Instance Segmentation for Robotic Environments

Authors: Christopher Xie, Yu Xiang, Arsalan Mousavian, Dieter Fox

Abstract: In order to function in unstructured environments, robots need the ability to recognize unseen objects. We take a step in this direction by tackling the problem of segmenting unseen object instances in tabletop environments. However, the type of large-scale real-world dataset required for this task typically does not exist for most robotic settings, which motivates the use of synthetic data. Our p… ▽ More In order to function in unstructured environments, robots need the ability to recognize unseen objects. We take a step in this direction by tackling the problem of segmenting unseen object instances in tabletop environments. However, the type of large-scale real-world dataset required for this task typically does not exist for most robotic settings, which motivates the use of synthetic data. Our proposed method, UOIS-Net, separately leverages synthetic RGB and synthetic depth for unseen object instance segmentation. UOIS-Net is comprised of two stages: first, it operates only on depth to produce object instance center votes in 2D or 3D and assembles them into rough initial masks. Secondly, these initial masks are refined using RGB. Surprisingly, our framework is able to learn from synthetic RGB-D data where the RGB is non-photorealistic. To train our method, we introduce a large-scale synthetic dataset of random objects on tabletops. We show that our method can produce sharp and accurate segmentation masks, outperforming state-of-the-art methods on unseen object instance segmentation. We also show that our method can segment unseen objects for robot gras**. △ Less

Submitted 13 October, 2021; v1 submitted 15 July, 2020; originally announced July 2020.

Comments: Journal version of arXiv:1907.13236

arXiv:2006.03777 [pdf, other]

Interpreting and Predicting Tactile Signals via a Physics-Based and Data-Driven Framework

Authors: Yashraj S. Narang, Karl Van Wyk, Arsalan Mousavian, Dieter Fox

Abstract: High-density afferents in the human hand have long been regarded as essential for human gras** and manipulation abilities. In contrast, robotic tactile sensors are typically used to provide low-density contact data, such as center-of-pressure and resultant force. Although useful, this data does not exploit the rich information content that some tactile sensors (e.g., the SynTouch BioTac) natural… ▽ More High-density afferents in the human hand have long been regarded as essential for human gras** and manipulation abilities. In contrast, robotic tactile sensors are typically used to provide low-density contact data, such as center-of-pressure and resultant force. Although useful, this data does not exploit the rich information content that some tactile sensors (e.g., the SynTouch BioTac) naturally provide. This research extends robotic tactile sensing beyond reduced-order models through 1) the automated creation of a precise tactile dataset for the BioTac over diverse physical interactions, 2) a 3D finite element (FE) model of the BioTac, which complements the experimental dataset with high-resolution, distributed contact data, and 3) neural-network-based map**s from raw BioTac signals to low-dimensional experimental data, and more importantly, high-density FE deformation fields. These data streams can provide a far greater quantity of interpretable information for gras** and manipulation algorithms than previously accessible. △ Less

Submitted 6 June, 2020; originally announced June 2020.

Comments: To be published in Proc. Robotics: Science and Systems (RSS)

arXiv:1912.05604 [pdf, other]

A Billion Ways to Grasp: An Evaluation of Grasp Sampling Schemes on a Dense, Physics-based Grasp Data Set

Authors: Clemens Eppner, Arsalan Mousavian, Dieter Fox

Abstract: Robot gras** is often formulated as a learning problem. With the increasing speed and quality of physics simulations, generating large-scale gras** data sets that feed learning algorithms is becoming more and more popular. An often overlooked question is how to generate the grasps that make up these data sets. In this paper, we review, classify, and compare different grasp sampling strategies.… ▽ More Robot gras** is often formulated as a learning problem. With the increasing speed and quality of physics simulations, generating large-scale gras** data sets that feed learning algorithms is becoming more and more popular. An often overlooked question is how to generate the grasps that make up these data sets. In this paper, we review, classify, and compare different grasp sampling strategies. Our evaluation is based on a fine-grained discretization of SE(3) and uses physics-based simulation to evaluate the quality and robustness of the corresponding parallel-jaw grasps. Specifically, we consider more than 1 billion grasps for each of the 21 objects from the YCB data set. This dense data set lets us evaluate existing sampling schemes w.r.t. their bias and efficiency. Our experiments show that some popular sampling schemes contain significant bias and do not cover all possible ways an object can be grasped. △ Less

Submitted 11 December, 2019; originally announced December 2019.

Comments: For associated web page, see https://sites.google.com/view/abillionwaystograsp . 19th International Symposium of Robotics Research (ISRR)

arXiv:1912.03628 [pdf, other]

6-DOF Gras** for Target-driven Object Manipulation in Clutter

Authors: Adithyavairavan Murali, Arsalan Mousavian, Clemens Eppner, Chris Paxton, Dieter Fox

Abstract: Gras** in cluttered environments is a fundamental but challenging robotic skill. It requires both reasoning about unseen object parts and potential collisions with the manipulator. Most existing data-driven approaches avoid this problem by limiting themselves to top-down planar grasps which is insufficient for many real-world scenarios and greatly limits possible grasps. We present a method that… ▽ More Gras** in cluttered environments is a fundamental but challenging robotic skill. It requires both reasoning about unseen object parts and potential collisions with the manipulator. Most existing data-driven approaches avoid this problem by limiting themselves to top-down planar grasps which is insufficient for many real-world scenarios and greatly limits possible grasps. We present a method that plans 6-DOF grasps for any desired object in a cluttered scene from partial point cloud observations. Our method achieves a grasp success of 80.3%, outperforming baseline approaches by 17.6% and clearing 9 cluttered table scenes (which contain 23 unknown objects and 51 picks in total) on a real robotic platform. By using our learned collision checking module, we can even reason about effective grasp sequences to retrieve objects that are not immediately accessible. Supplementary video can be found at https://youtu.be/w0B5S-gCsJk. △ Less

Submitted 20 May, 2020; v1 submitted 8 December, 2019; originally announced December 2019.

Comments: Accepted to the International Conference on Robotics and Automation (ICRA) 2020

arXiv:1912.00416 [pdf, other]

LatentFusion: End-to-End Differentiable Reconstruction and Rendering for Unseen Object Pose Estimation

Authors: Keunhong Park, Arsalan Mousavian, Yu Xiang, Dieter Fox

Abstract: Current 6D object pose estimation methods usually require a 3D model for each object. These methods also require additional training in order to incorporate new objects. As a result, they are difficult to scale to a large number of objects and cannot be directly applied to unseen objects. We propose a novel framework for 6D pose estimation of unseen objects. We present a network that reconstruct… ▽ More Current 6D object pose estimation methods usually require a 3D model for each object. These methods also require additional training in order to incorporate new objects. As a result, they are difficult to scale to a large number of objects and cannot be directly applied to unseen objects. We propose a novel framework for 6D pose estimation of unseen objects. We present a network that reconstructs a latent 3D representation of an object using a small number of reference views at inference time. Our network is able to render the latent 3D representation from arbitrary views. Using this neural renderer, we directly optimize for pose given an input image. By training our network with a large number of 3D shapes for reconstruction and rendering, our network generalizes well to unseen objects. We present a new dataset for unseen object pose estimation--MOPED. We evaluate the performance of our method for unseen object pose estimation on MOPED as well as the ModelNet and LINEMOD datasets. Our method performs competitively to supervised methods that are trained on those objects. Code and data is available at https://keunhong.com/publications/latentfusion/. △ Less

Submitted 11 June, 2020; v1 submitted 1 December, 2019; originally announced December 2019.

Comments: CVPR 2020, Project Page: https://keunhong.com/publications/latentfusion/ , Video: https://youtu.be/tlzcq1KYXd8 , Code: https://github.com/NVlabs/latentfusion . We have added experiments for LINEMOD and have updated the experiments on MOPED. We've also added more technical and implementation details to the methods section

arXiv:1909.10159 [pdf, other]

Self-supervised 6D Object Pose Estimation for Robot Manipulation

Authors: Xinke Deng, Yu Xiang, Arsalan Mousavian, Clemens Eppner, Timothy Bretl, Dieter Fox

Abstract: To teach robots skills, it is crucial to obtain data with supervision. Since annotating real world data is time-consuming and expensive, enabling robots to learn in a self-supervised way is important. In this work, we introduce a robot system for self-supervised 6D object pose estimation. Starting from modules trained in simulation, our system is able to label real world images with accurate 6D ob… ▽ More To teach robots skills, it is crucial to obtain data with supervision. Since annotating real world data is time-consuming and expensive, enabling robots to learn in a self-supervised way is important. In this work, we introduce a robot system for self-supervised 6D object pose estimation. Starting from modules trained in simulation, our system is able to label real world images with accurate 6D object poses for self-supervised learning. In addition, the robot interacts with objects in the environment to change the object configuration by gras** or pushing objects. In this way, our system is able to continuously collect data and improve its pose estimation modules. We show that the self-supervised learning improves object segmentation and 6D pose estimation performance, and consequently enables the system to grasp objects more reliably. A video showing the experiments can be found at https://youtu.be/W1Y0Mmh1Gd8. △ Less

Submitted 6 March, 2020; v1 submitted 23 September, 2019; originally announced September 2019.

Comments: Accepted to International Conference on Robotics and Automation (ICRA), 2020

arXiv:1907.13236 [pdf, other]

The Best of Both Modes: Separately Leveraging RGB and Depth for Unseen Object Instance Segmentation

Authors: Christopher Xie, Yu Xiang, Arsalan Mousavian, Dieter Fox

Abstract: In order to function in unstructured environments, robots need the ability to recognize unseen novel objects. We take a step in this direction by tackling the problem of segmenting unseen object instances in tabletop environments. However, the type of large-scale real-world dataset required for this task typically does not exist for most robotic settings, which motivates the use of synthetic data.… ▽ More In order to function in unstructured environments, robots need the ability to recognize unseen novel objects. We take a step in this direction by tackling the problem of segmenting unseen object instances in tabletop environments. However, the type of large-scale real-world dataset required for this task typically does not exist for most robotic settings, which motivates the use of synthetic data. We propose a novel method that separately leverages synthetic RGB and synthetic depth for unseen object instance segmentation. Our method is comprised of two stages where the first stage operates only on depth to produce rough initial masks, and the second stage refines these masks with RGB. Surprisingly, our framework is able to learn from synthetic RGB-D data where the RGB is non-photorealistic. To train our method, we introduce a large-scale synthetic dataset of random objects on tabletops. We show that our method, trained on this dataset, can produce sharp and accurate masks, outperforming state-of-the-art methods on unseen object instance segmentation. We also show that our method can segment unseen objects for robot gras**. Code, models and video can be found at https://rse-lab.cs.washington.edu/projects/unseen-object-instance-segmentation/. △ Less

Submitted 15 July, 2020; v1 submitted 30 July, 2019; originally announced July 2019.

arXiv:1905.10520 [pdf, other]

6-DOF GraspNet: Variational Grasp Generation for Object Manipulation

Authors: Arsalan Mousavian, Clemens Eppner, Dieter Fox

Abstract: Generating grasp poses is a crucial component for any robot object manipulation task. In this work, we formulate the problem of grasp generation as sampling a set of grasps using a variational autoencoder and assess and refine the sampled grasps using a grasp evaluator model. Both Grasp Sampler and Grasp Refinement networks take 3D point clouds observed by a depth camera as input. We evaluate our… ▽ More Generating grasp poses is a crucial component for any robot object manipulation task. In this work, we formulate the problem of grasp generation as sampling a set of grasps using a variational autoencoder and assess and refine the sampled grasps using a grasp evaluator model. Both Grasp Sampler and Grasp Refinement networks take 3D point clouds observed by a depth camera as input. We evaluate our approach in simulation and real-world robot experiments. Our approach achieves 88\% success rate on various commonly used objects with diverse appearances, scales, and weights. Our model is trained purely in simulation and works in the real world without any extra steps. The video of our experiments can be found at: https://research.nvidia.com/publication/2019-10_6-DOF-GraspNet\%3A-Variational △ Less

Submitted 16 August, 2019; v1 submitted 25 May, 2019; originally announced May 2019.

Comments: Accepted to ICCV 2019. Extended camera ready version with additional experiments

arXiv:1905.09304 [pdf, other]

PoseRBPF: A Rao-Blackwellized Particle Filter for 6D Object Pose Tracking

Authors: Xinke Deng, Arsalan Mousavian, Yu Xiang, Fei Xia, Timothy Bretl, Dieter Fox

Abstract: Tracking 6D poses of objects from videos provides rich information to a robot in performing different tasks such as manipulation and navigation. In this work, we formulate the 6D object pose tracking problem in the Rao-Blackwellized particle filtering framework, where the 3D rotation and the 3D translation of an object are decoupled. This factorization allows our approach, called PoseRBPF, to effi… ▽ More Tracking 6D poses of objects from videos provides rich information to a robot in performing different tasks such as manipulation and navigation. In this work, we formulate the 6D object pose tracking problem in the Rao-Blackwellized particle filtering framework, where the 3D rotation and the 3D translation of an object are decoupled. This factorization allows our approach, called PoseRBPF, to efficiently estimate the 3D translation of an object along with the full distribution over the 3D rotation. This is achieved by discretizing the rotation space in a fine-grained manner, and training an auto-encoder network to construct a codebook of feature embeddings for the discretized rotations. As a result, PoseRBPF can track objects with arbitrary symmetries while still maintaining adequate posterior distributions. Our approach achieves state-of-the-art results on two 6D pose estimation benchmarks. A video showing the experiments can be found at https://youtu.be/lE5gjzRKWuA △ Less

Submitted 22 May, 2019; originally announced May 2019.

Comments: Accepted to RSS 2019

arXiv:1805.06066 [pdf, other]

Visual Representations for Semantic Target Driven Navigation

Authors: Arsalan Mousavian, Alexander Toshev, Marek Fiser, Jana Kosecka, Ayzaan Wahid, James Davidson

Abstract: What is a good visual representation for autonomous agents? We address this question in the context of semantic visual navigation, which is the problem of a robot finding its way through a complex environment to a target object, e.g. go to the refrigerator. Instead of acquiring a metric semantic map of an environment and using planning for navigation, our approach learns navigation policies on top… ▽ More What is a good visual representation for autonomous agents? We address this question in the context of semantic visual navigation, which is the problem of a robot finding its way through a complex environment to a target object, e.g. go to the refrigerator. Instead of acquiring a metric semantic map of an environment and using planning for navigation, our approach learns navigation policies on top of representations that capture spatial layout and semantic contextual cues. We propose to using high level semantic and contextual features including segmentation and detection masks obtained by off-the-shelf state-of-the-art vision as observations and use deep network to learn the navigation policy. This choice allows using additional data, from orthogonal sources, to better train different parts of the model the representation extraction is trained on large standard vision datasets while the navigation component leverages large synthetic environments for training. This combination of real and synthetic is possible because equitable feature representations are available in both (e.g., segmentation and detection masks), which alleviates the need for domain adaptation. Both the representation and the navigation policy can be readily applied to real non-synthetic environments as demonstrated on the Active Vision Dataset [1]. Our approach gets successfully to the target in 54% of the cases in unexplored environments, compared to 46% for non-learning based approach, and 28% for the learning-based baseline. △ Less

Submitted 2 July, 2019; v1 submitted 15 May, 2018; originally announced May 2018.

Comments: Accepted to ICRA 2019 and ECCV 2018 Workshop on Visual Learning and Embodied Agents in Simulation Environments

arXiv:1702.07836 [pdf, other]

Synthesizing Training Data for Object Detection in Indoor Scenes

Authors: Georgios Georgakis, Arsalan Mousavian, Alexander C. Berg, Jana Kosecka

Abstract: Detection of objects in cluttered indoor environments is one of the key enabling functionalities for service robots. The best performing object detection approaches in computer vision exploit deep Convolutional Neural Networks (CNN) to simultaneously detect and categorize the objects of interest in cluttered scenes. Training of such models typically requires large amounts of annotated training dat… ▽ More Detection of objects in cluttered indoor environments is one of the key enabling functionalities for service robots. The best performing object detection approaches in computer vision exploit deep Convolutional Neural Networks (CNN) to simultaneously detect and categorize the objects of interest in cluttered scenes. Training of such models typically requires large amounts of annotated training data which is time consuming and costly to obtain. In this work we explore the ability of using synthetically generated composite images for training state-of-the-art object detectors, especially for object instance detection. We superimpose 2D images of textured object models into images of real environments at variety of locations and scales. Our experiments evaluate different superimposition strategies ranging from purely image-based blending all the way to depth and semantics informed positioning of the object models into real scenes. We demonstrate the effectiveness of these object detector training strategies on two publicly available datasets, the GMU-Kitchens and the Washington RGB-D Scenes v2. As one observation, augmenting some hand-labeled training data with synthetic examples carefully composed onto scenes yields object detectors with comparable performance to using much more hand-labeled data. Broadly, this work charts new opportunities for training detectors for new objects by exploiting existing object model repositories in either a purely automatic fashion or with only a very small number of human-annotated examples. △ Less

Submitted 7 September, 2017; v1 submitted 25 February, 2017; originally announced February 2017.

Comments: Added more experiments and link to project webpage

arXiv:1612.00496 [pdf, other]

3D Bounding Box Estimation Using Deep Learning and Geometry

Authors: Arsalan Mousavian, Dragomir Anguelov, John Flynn, Jana Kosecka

Abstract: We present a method for 3D object detection and pose estimation from a single image. In contrast to current techniques that only regress the 3D orientation of an object, our method first regresses relatively stable 3D object properties using a deep convolutional neural network and then combines these estimates with geometric constraints provided by a 2D object bounding box to produce a complete 3D… ▽ More We present a method for 3D object detection and pose estimation from a single image. In contrast to current techniques that only regress the 3D orientation of an object, our method first regresses relatively stable 3D object properties using a deep convolutional neural network and then combines these estimates with geometric constraints provided by a 2D object bounding box to produce a complete 3D bounding box. The first network output estimates the 3D object orientation using a novel hybrid discrete-continuous loss, which significantly outperforms the L2 loss. The second output regresses the 3D object dimensions, which have relatively little variance compared to alternatives and can often be predicted for many object types. These estimates, combined with the geometric constraints on translation imposed by the 2D bounding box, enable us to recover a stable and accurate 3D object pose. We evaluate our method on the challenging KITTI object detection benchmark both on the official metric of 3D orientation estimation and also on the accuracy of the obtained 3D bounding boxes. Although conceptually simple, our method outperforms more complex and computationally expensive approaches that leverage semantic segmentation, instance level segmentation and flat ground priors and sub-category detection. Our discrete-continuous loss also produces state of the art results for 3D viewpoint estimation on the Pascal 3D+ dataset. △ Less

Submitted 10 April, 2017; v1 submitted 1 December, 2016; originally announced December 2016.

Comments: To appear in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017

arXiv:1609.07826 [pdf, other]

Multiview RGB-D Dataset for Object Instance Detection

Authors: Georgios Georgakis, Md Alimoor Reza, Arsalan Mousavian, Phi-Hung Le, Jana Kosecka

Abstract: This paper presents a new multi-view RGB-D dataset of nine kitchen scenes, each containing several objects in realistic cluttered environments including a subset of objects from the BigBird dataset. The viewpoints of the scenes are densely sampled and objects in the scenes are annotated with bounding boxes and in the 3D point cloud. Also, an approach for detection and recognition is presented, whi… ▽ More This paper presents a new multi-view RGB-D dataset of nine kitchen scenes, each containing several objects in realistic cluttered environments including a subset of objects from the BigBird dataset. The viewpoints of the scenes are densely sampled and objects in the scenes are annotated with bounding boxes and in the 3D point cloud. Also, an approach for detection and recognition is presented, which is comprised of two parts: i) a new multi-view 3D proposal generation method and ii) the development of several recognition baselines using AlexNet to score our proposals, which is trained either on crops of the dataset or on synthetically composited training images. Finally, we compare the performance of the object proposals and a detection baseline to the Washington RGB-D Scenes (WRGB-D) dataset and demonstrate that our Kitchen scenes dataset is more challenging for object detection and recognition. The dataset is available at: http://cs.gmu.edu/~robot/gmu-kitchens.html. △ Less

Submitted 25 September, 2016; originally announced September 2016.

arXiv:1609.00278 [pdf, other]

Semantic Image Based Geolocation Given a Map

Authors: Arsalan Mousavian, Jana Kosecka

Abstract: The problem visual place recognition is commonly used strategy for localization. Most successful appearance based methods typically rely on a large database of views endowed with local or global image descriptors and strive to retrieve the views of the same location. The quality of the results is often affected by the density of the reference views and the robustness of the image representation wi… ▽ More The problem visual place recognition is commonly used strategy for localization. Most successful appearance based methods typically rely on a large database of views endowed with local or global image descriptors and strive to retrieve the views of the same location. The quality of the results is often affected by the density of the reference views and the robustness of the image representation with respect to viewpoint variations, clutter and seasonal changes. In this work we present an approach for geo-locating a novel view and determining camera location and orientation using a map and a sparse set of geo-tagged reference views. We propose a novel technique for detection and identification of building facades from geo-tagged reference view using the map and geometry of the building facades. We compute the likelihood of camera location and orientation of the query images using the detected landmark (building) identities from reference views, 2D map of the environment, and geometry of building facades. We evaluate our approach for building identification and geo-localization on a new challenging outdoors urban dataset exhibiting large variations in appearance and viewpoint. △ Less

Submitted 1 September, 2016; originally announced September 2016.

arXiv:1604.07480 [pdf, other]

Joint Semantic Segmentation and Depth Estimation with Deep Convolutional Networks

Authors: Arsalan Mousavian, Hamed Pirsiavash, Jana Kosecka

Abstract: Multi-scale deep CNNs have been used successfully for problems map** each pixel to a label, such as depth estimation and semantic segmentation. It has also been shown that such architectures are reusable and can be used for multiple tasks. These networks are typically trained independently for each task by varying the output layer(s) and training objective. In this work we present a new model fo… ▽ More Multi-scale deep CNNs have been used successfully for problems map** each pixel to a label, such as depth estimation and semantic segmentation. It has also been shown that such architectures are reusable and can be used for multiple tasks. These networks are typically trained independently for each task by varying the output layer(s) and training objective. In this work we present a new model for simultaneous depth estimation and semantic segmentation from a single RGB image. Our approach demonstrates the feasibility of training parts of the model for each task and then fine tuning the full, combined model on both tasks simultaneously using a single loss function. Furthermore we couple the deep CNN with fully connected CRF, which captures the contextual relationships and interactions between the semantic and depth cues improving the accuracy of the final results. The proposed model is trained and evaluated on NYUDepth V2 dataset outperforming the state of the art methods on semantic segmentation and achieving comparable results on the task of depth estimation. △ Less

Submitted 19 September, 2016; v1 submitted 25 April, 2016; originally announced April 2016.

arXiv:1509.06033 [pdf, other]

Deep Convolutional Features for Image Based Retrieval and Scene Categorization

Authors: Arsalan Mousavian, Jana Kosecka

Abstract: Several recent approaches showed how the representations learned by Convolutional Neural Networks can be repurposed for novel tasks. Most commonly it has been shown that the activation features of the last fully connected layers (fc7 or fc6) of the network, followed by a linear classifier outperform the state-of-the-art on several recognition challenge datasets. Instead of recognition, this paper… ▽ More Several recent approaches showed how the representations learned by Convolutional Neural Networks can be repurposed for novel tasks. Most commonly it has been shown that the activation features of the last fully connected layers (fc7 or fc6) of the network, followed by a linear classifier outperform the state-of-the-art on several recognition challenge datasets. Instead of recognition, this paper focuses on the image retrieval problem and proposes a examines alternative pooling strategies derived for CNN features. The presented scheme uses the features maps from an earlier layer 5 of the CNN architecture, which has been shown to preserve coarse spatial information and is semantically meaningful. We examine several pooling strategies and demonstrate superior performance on the image retrieval task (INRIA Holidays) at the fraction of the computational cost, while using a relatively small memory requirements. In addition to retrieval, we see similar efficiency gains on the SUN397 scene categorization dataset, demonstrating wide applicability of this simple strategy. We also introduce and evaluate a novel GeoPlaces5K dataset from different geographical locations in the world for image retrieval that stresses more dramatic changes in appearance and viewpoint. △ Less

Submitted 20 September, 2015; originally announced September 2015.

arXiv:1311.7215 [pdf]

Solving Minimum Vertex Cover Problem Using Learning Automata

Authors: Aylin Mousavian, Alireza Rezvanian, Mohammad Reza Meybodi

Abstract: Minimum vertex cover problem is an NP-Hard problem with the aim of finding minimum number of vertices to cover graph. In this paper, a learning automaton based algorithm is proposed to find minimum vertex cover in graph. In the proposed algorithm, each vertex of graph is equipped with a learning automaton that has two actions in the candidate or non-candidate of the corresponding vertex cover set.… ▽ More Minimum vertex cover problem is an NP-Hard problem with the aim of finding minimum number of vertices to cover graph. In this paper, a learning automaton based algorithm is proposed to find minimum vertex cover in graph. In the proposed algorithm, each vertex of graph is equipped with a learning automaton that has two actions in the candidate or non-candidate of the corresponding vertex cover set. Due to characteristics of learning automata, this algorithm significantly reduces the number of covering vertices of graph. The proposed algorithm based on learning automata iteratively minimize the candidate vertex cover through the update its action probability. As the proposed algorithm proceeds, a candidate solution nears to optimal solution of the minimum vertex cover problem. In order to evaluate the proposed algorithm, several experiments conducted on DIMACS dataset which compared to conventional methods. Experimental results show the major superiority of the proposed algorithm over the other methods. △ Less

Submitted 28 November, 2013; originally announced November 2013.

Comments: 5 pages, 5 figures, conference

arXiv:1307.7171 [pdf, other]

doi 10.1088/1367-2630/16/1/013009

The Resource Theory of Stabilizer Computation

Authors: Victor Veitch, Seyed Ali Hamed Mousavian, Daniel Gottesman, Joseph Emerson

Abstract: Recent results on the non-universality of fault-tolerant gate sets underline the critical role of resource states, such as magic states, to power scalable, universal quantum computation. Here we develop a resource theory, analogous to the theory of entanglement, for resources for stabilizer codes. We introduce two quantitative measures - monotones - for the amount of non-stabilizer resource. As an… ▽ More Recent results on the non-universality of fault-tolerant gate sets underline the critical role of resource states, such as magic states, to power scalable, universal quantum computation. Here we develop a resource theory, analogous to the theory of entanglement, for resources for stabilizer codes. We introduce two quantitative measures - monotones - for the amount of non-stabilizer resource. As an application we give absolute bounds on the efficiency of magic state distillation. One of these monotones is the sum of the negative entries of the discrete Wigner representation of a quantum state, thereby resolving a long-standing open question of whether the degree of negativity in a quasi-probability representation is an operationally meaningful indicator of quantum behaviour. △ Less

Submitted 26 July, 2013; originally announced July 2013.

Journal ref: New J. Phys. 16 (2014) 013009

Showing 1–42 of 42 results for author: Mousavian, A