-
Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation
Authors:
Suraj Nair,
Eric Mitchell,
Kevin Chen,
Brian Ichter,
Silvio Savarese,
Chelsea Finn
Abstract:
We study the problem of learning a range of vision-based manipulation tasks from a large offline dataset of robot interaction. In order to accomplish this, humans need easy and effective ways of specifying tasks to the robot. Goal images are one popular form of task specification, as they are already grounded in the robot's observation space. However, goal images also have a number of drawbacks: t…
▽ More
We study the problem of learning a range of vision-based manipulation tasks from a large offline dataset of robot interaction. In order to accomplish this, humans need easy and effective ways of specifying tasks to the robot. Goal images are one popular form of task specification, as they are already grounded in the robot's observation space. However, goal images also have a number of drawbacks: they are inconvenient for humans to provide, they can over-specify the desired behavior leading to a sparse reward signal, or under-specify task information in the case of non-goal reaching tasks. Natural language provides a convenient and flexible alternative for task specification, but comes with the challenge of grounding language in the robot's observation space. To scalably learn this grounding we propose to leverage offline robot datasets (including highly sub-optimal, autonomously collected data) with crowd-sourced natural language labels. With this data, we learn a simple classifier which predicts if a change in state completes a language instruction. This provides a language-conditioned reward function that can then be used for offline multi-task RL. In our experiments, we find that on language-conditioned manipulation tasks our approach outperforms both goal-image specifications and language conditioned imitation techniques by more than 25%, and is able to perform visuomotor tasks from natural language, such as "open the right drawer" and "move the stapler", on a Franka Emika Panda robot.
△ Less
Submitted 31 October, 2021; v1 submitted 2 September, 2021;
originally announced September 2021.
-
Co-GAIL: Learning Diverse Strategies for Human-Robot Collaboration
Authors:
Chen Wang,
Claudia Pérez-D'Arpino,
Danfei Xu,
Li Fei-Fei,
C. Karen Liu,
Silvio Savarese
Abstract:
We present a method for learning a human-robot collaboration policy from human-human collaboration demonstrations. An effective robot assistant must learn to handle diverse human behaviors shown in the demonstrations and be robust when the humans adjust their strategies during online task execution. Our method co-optimizes a human policy and a robot policy in an interactive learning process: the h…
▽ More
We present a method for learning a human-robot collaboration policy from human-human collaboration demonstrations. An effective robot assistant must learn to handle diverse human behaviors shown in the demonstrations and be robust when the humans adjust their strategies during online task execution. Our method co-optimizes a human policy and a robot policy in an interactive learning process: the human policy learns to generate diverse and plausible collaborative behaviors from demonstrations while the robot policy learns to assist by estimating the unobserved latent strategy of its human collaborator. Across a 2D strategy game, a human-robot handover task, and a multi-step collaborative manipulation task, our method outperforms the alternatives in both simulated evaluations and when executing the tasks with a real human operator in-the-loop. Supplementary materials and videos at https://sites.google.com/view/co-gail-web/home
△ Less
Submitted 20 September, 2023; v1 submitted 12 August, 2021;
originally announced August 2021.
-
BEHAVIOR: Benchmark for Everyday Household Activities in Virtual, Interactive, and Ecological Environments
Authors:
Sanjana Srivastava,
Chengshu Li,
Michael Lingelbach,
Roberto Martín-Martín,
Fei Xia,
Kent Vainio,
Zheng Lian,
Cem Gokmen,
Shyamal Buch,
C. Karen Liu,
Silvio Savarese,
Hyowon Gweon,
Jiajun Wu,
Li Fei-Fei
Abstract:
We introduce BEHAVIOR, a benchmark for embodied AI with 100 activities in simulation, spanning a range of everyday household chores such as cleaning, maintenance, and food preparation. These activities are designed to be realistic, diverse, and complex, aiming to reproduce the challenges that agents must face in the real world. Building such a benchmark poses three fundamental difficulties for eac…
▽ More
We introduce BEHAVIOR, a benchmark for embodied AI with 100 activities in simulation, spanning a range of everyday household chores such as cleaning, maintenance, and food preparation. These activities are designed to be realistic, diverse, and complex, aiming to reproduce the challenges that agents must face in the real world. Building such a benchmark poses three fundamental difficulties for each activity: definition (it can differ by time, place, or person), instantiation in a simulator, and evaluation. BEHAVIOR addresses these with three innovations. First, we propose an object-centric, predicate logic-based description language for expressing an activity's initial and goal conditions, enabling generation of diverse instances for any activity. Second, we identify the simulator-agnostic features required by an underlying environment to support BEHAVIOR, and demonstrate its realization in one such simulator. Third, we introduce a set of metrics to measure task progress and efficiency, absolute and relative to human demonstrators. We include 500 human demonstrations in virtual reality (VR) to serve as the human ground truth. Our experiments demonstrate that even state of the art embodied AI solutions struggle with the level of realism, diversity, and complexity imposed by the activities in our benchmark. We make BEHAVIOR publicly available at behavior.stanford.edu to facilitate and calibrate the development of new embodied AI solutions.
△ Less
Submitted 6 August, 2021;
originally announced August 2021.
-
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
Authors:
Ajay Mandlekar,
Danfei Xu,
Josiah Wong,
Soroush Nasiriany,
Chen Wang,
Rohun Kulkarni,
Li Fei-Fei,
Silvio Savarese,
Yuke Zhu,
Roberto Martín-Martín
Abstract:
Imitating human demonstrations is a promising approach to endow robots with various manipulation capabilities. While recent advances have been made in imitation learning and batch (offline) reinforcement learning, a lack of open-source human datasets and reproducible learning methods make assessing the state of the field difficult. In this paper, we conduct an extensive study of six offline learni…
▽ More
Imitating human demonstrations is a promising approach to endow robots with various manipulation capabilities. While recent advances have been made in imitation learning and batch (offline) reinforcement learning, a lack of open-source human datasets and reproducible learning methods make assessing the state of the field difficult. In this paper, we conduct an extensive study of six offline learning algorithms for robot manipulation on five simulated and three real-world multi-stage manipulation tasks of varying complexity, and with datasets of varying quality. Our study analyzes the most critical challenges when learning from offline human data for manipulation. Based on the study, we derive a series of lessons including the sensitivity to different algorithmic design choices, the dependence on the quality of the demonstrations, and the variability based on the stop** criteria due to the different objectives in training and evaluation. We also highlight opportunities for learning from human datasets, such as the ability to learn proficient policies on challenging, multi-stage tasks beyond the scope of current reinforcement learning methods, and the ability to easily scale to natural, real-world manipulation scenarios where only raw sensory signals are available. We have open-sourced our datasets and all algorithm implementations to facilitate future research and fair comparisons in learning from human demonstration data. Codebase, datasets, trained models, and more available at https://arise-initiative.github.io/robomimic-web/
△ Less
Submitted 24 September, 2021; v1 submitted 6 August, 2021;
originally announced August 2021.
-
iGibson 2.0: Object-Centric Simulation for Robot Learning of Everyday Household Tasks
Authors:
Chengshu Li,
Fei Xia,
Roberto Martín-Martín,
Michael Lingelbach,
Sanjana Srivastava,
Bokui Shen,
Kent Vainio,
Cem Gokmen,
Gokul Dharan,
Tanish Jain,
Andrey Kurenkov,
C. Karen Liu,
Hyowon Gweon,
Jiajun Wu,
Li Fei-Fei,
Silvio Savarese
Abstract:
Recent research in embodied AI has been boosted by the use of simulation environments to develop and train robot learning approaches. However, the use of simulation has skewed the attention to tasks that only require what robotics simulators can simulate: motion and physical contact. We present iGibson 2.0, an open-source simulation environment that supports the simulation of a more diverse set of…
▽ More
Recent research in embodied AI has been boosted by the use of simulation environments to develop and train robot learning approaches. However, the use of simulation has skewed the attention to tasks that only require what robotics simulators can simulate: motion and physical contact. We present iGibson 2.0, an open-source simulation environment that supports the simulation of a more diverse set of household tasks through three key innovations. First, iGibson 2.0 supports object states, including temperature, wetness level, cleanliness level, and toggled and sliced states, necessary to cover a wider range of tasks. Second, iGibson 2.0 implements a set of predicate logic functions that map the simulator states to logic states like Cooked or Soaked. Additionally, given a logic state, iGibson 2.0 can sample valid physical states that satisfy it. This functionality can generate potentially infinite instances of tasks with minimal effort from the users. The sampling mechanism allows our scenes to be more densely populated with small objects in semantically meaningful locations. Third, iGibson 2.0 includes a virtual reality (VR) interface to immerse humans in its scenes to collect demonstrations. As a result, we can collect demonstrations from humans on these new types of tasks, and use them for imitation learning. We evaluate the new capabilities of iGibson 2.0 to enable robot learning of novel tasks, in the hope of demonstrating the potential of this new simulator to support new research in embodied AI. iGibson 2.0 and its new dataset are publicly available at http://svl.stanford.edu/igibson/.
△ Less
Submitted 3 November, 2021; v1 submitted 6 August, 2021;
originally announced August 2021.
-
Discovering Generalizable Skills via Automated Generation of Diverse Tasks
Authors:
Kuan Fang,
Yuke Zhu,
Silvio Savarese,
Li Fei-Fei
Abstract:
The learning efficiency and generalization ability of an intelligent agent can be greatly improved by utilizing a useful set of skills. However, the design of robot skills can often be intractable in real-world applications due to the prohibitive amount of effort and expertise that it requires. In this work, we introduce Skill Learning In Diversified Environments (SLIDE), a method to discover gene…
▽ More
The learning efficiency and generalization ability of an intelligent agent can be greatly improved by utilizing a useful set of skills. However, the design of robot skills can often be intractable in real-world applications due to the prohibitive amount of effort and expertise that it requires. In this work, we introduce Skill Learning In Diversified Environments (SLIDE), a method to discover generalizable skills via automated generation of a diverse set of tasks. As opposed to prior work on unsupervised discovery of skills which incentivizes the skills to produce different outcomes in the same environment, our method pairs each skill with a unique task produced by a trainable task generator. To encourage generalizable skills to emerge, our method trains each skill to specialize in the paired task and maximizes the diversity of the generated tasks. A task discriminator defined on the robot behaviors in the generated tasks is jointly trained to estimate the evidence lower bound of the diversity objective. The learned skills can then be composed in a hierarchical reinforcement learning algorithm to solve unseen target tasks. We demonstrate that the proposed method can effectively learn a variety of robot skills in two tabletop manipulation domains. Our results suggest that the learned skills can effectively improve the robot's performance in various unseen target tasks compared to existing reinforcement learning and skill learning methods.
△ Less
Submitted 25 June, 2021;
originally announced June 2021.
-
JRDB-Act: A Large-scale Dataset for Spatio-temporal Action, Social Group and Activity Detection
Authors:
Mahsa Ehsanpour,
Fatemeh Saleh,
Silvio Savarese,
Ian Reid,
Hamid Rezatofighi
Abstract:
The availability of large-scale video action understanding datasets has facilitated advances in the interpretation of visual scenes containing people. However, learning to recognise human actions and their social interactions in an unconstrained real-world environment comprising numerous people, with potentially highly unbalanced and long-tailed distributed action labels from a stream of sensory d…
▽ More
The availability of large-scale video action understanding datasets has facilitated advances in the interpretation of visual scenes containing people. However, learning to recognise human actions and their social interactions in an unconstrained real-world environment comprising numerous people, with potentially highly unbalanced and long-tailed distributed action labels from a stream of sensory data captured from a mobile robot platform remains a significant challenge, not least owing to the lack of a reflective large-scale dataset. In this paper, we introduce JRDB-Act, as an extension of the existing JRDB, which is captured by a social mobile manipulator and reflects a real distribution of human daily-life actions in a university campus environment. JRDB-Act has been densely annotated with atomic actions, comprises over 2.8M action labels, constituting a large-scale spatio-temporal action detection dataset. Each human bounding box is labeled with one pose-based action label and multiple~(optional) interaction-based action labels. Moreover JRDB-Act provides social group annotation, conducive to the task of grou** individuals based on their interactions in the scene to infer their social activities~(common activities in each social group). Each annotated label in JRDB-Act is tagged with the annotators' confidence level which contributes to the development of reliable evaluation strategies. In order to demonstrate how one can effectively utilise such annotations, we develop an end-to-end trainable pipeline to learn and infer these tasks, i.e. individual action and social group detection. The data and the evaluation code is publicly available at https://jrdb.erc.monash.edu/.
△ Less
Submitted 23 November, 2021; v1 submitted 16 June, 2021;
originally announced June 2021.
-
TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild
Authors:
Vida Adeli,
Mahsa Ehsanpour,
Ian Reid,
Juan Carlos Niebles,
Silvio Savarese,
Ehsan Adeli,
Hamid Rezatofighi
Abstract:
Joint forecasting of human trajectory and pose dynamics is a fundamental building block of various applications ranging from robotics and autonomous driving to surveillance systems. Predicting body dynamics requires capturing subtle information embedded in the humans' interactions with each other and with the objects present in the scene. In this paper, we propose a novel TRajectory and POse Dynam…
▽ More
Joint forecasting of human trajectory and pose dynamics is a fundamental building block of various applications ranging from robotics and autonomous driving to surveillance systems. Predicting body dynamics requires capturing subtle information embedded in the humans' interactions with each other and with the objects present in the scene. In this paper, we propose a novel TRajectory and POse Dynamics (nicknamed TRiPOD) method based on graph attentional networks to model the human-human and human-object interactions both in the input space and the output space (decoded future output). The model is supplemented by a message passing interface over the graphs to fuse these different levels of interactions efficiently. Furthermore, to incorporate a real-world challenge, we propound to learn an indicator representing whether an estimated body joint is visible/invisible at each frame, e.g. due to occlusion or being outside the sensor field of view. Finally, we introduce a new benchmark for this joint task based on two challenging datasets (PoseTrack and 3DPW) and propose evaluation metrics to measure the effectiveness of predictions in the global space, even when there are invisible cases of joints. Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
△ Less
Submitted 27 August, 2021; v1 submitted 8 April, 2021;
originally announced April 2021.
-
LASER: Learning a Latent Action Space for Efficient Reinforcement Learning
Authors:
Arthur Allshire,
Roberto Martín-Martín,
Charles Lin,
Shawn Manuel,
Silvio Savarese,
Animesh Garg
Abstract:
The process of learning a manipulation task depends strongly on the action space used for exploration: posed in the incorrect action space, solving a task with reinforcement learning can be drastically inefficient. Additionally, similar tasks or instances of the same task family impose latent manifold constraints on the most effective action space: the task family can be best solved with actions i…
▽ More
The process of learning a manipulation task depends strongly on the action space used for exploration: posed in the incorrect action space, solving a task with reinforcement learning can be drastically inefficient. Additionally, similar tasks or instances of the same task family impose latent manifold constraints on the most effective action space: the task family can be best solved with actions in a manifold of the entire action space of the robot. Combining these insights we present LASER, a method to learn latent action spaces for efficient reinforcement learning. LASER factorizes the learning problem into two sub-problems, namely action space learning and policy learning in the new action space. It leverages data from similar manipulation task instances, either from an offline expert or online during policy learning, and learns from these trajectories a map** from the original to a latent action space. LASER is trained as a variational encoder-decoder model to map raw actions into a disentangled latent action space while maintaining action reconstruction and latent space dynamic consistency. We evaluate LASER on two contact-rich robotic tasks in simulation, and analyze the benefit of policy learning in the generated latent action space. We show improved sample efficiency compared to the original action space from better alignment of the action space to the task space, as we observe with visualizations of the learned action space manifold. Additional details: https://www.pair.toronto.edu/laser
△ Less
Submitted 30 March, 2021; v1 submitted 29 March, 2021;
originally announced March 2021.
-
Generalization Through Hand-Eye Coordination: An Action Space for Learning Spatially-Invariant Visuomotor Control
Authors:
Chen Wang,
Rui Wang,
Ajay Mandlekar,
Li Fei-Fei,
Silvio Savarese,
Danfei Xu
Abstract:
Imitation Learning (IL) is an effective framework to learn visuomotor skills from offline demonstration data. However, IL methods often fail to generalize to new scene configurations not covered by training data. On the other hand, humans can manipulate objects in varying conditions. Key to such capability is hand-eye coordination, a cognitive ability that enables humans to adaptively direct their…
▽ More
Imitation Learning (IL) is an effective framework to learn visuomotor skills from offline demonstration data. However, IL methods often fail to generalize to new scene configurations not covered by training data. On the other hand, humans can manipulate objects in varying conditions. Key to such capability is hand-eye coordination, a cognitive ability that enables humans to adaptively direct their movements at task-relevant objects and be invariant to the objects' absolute spatial location. In this work, we present a learnable action space, Hand-eye Action Networks (HAN), that can approximate human's hand-eye coordination behaviors by learning from human teleoperated demonstrations. Through a set of challenging multi-stage manipulation tasks, we show that a visuomotor policy equipped with HAN is able to inherit the key spatial invariance property of hand-eye coordination and achieve zero-shot generalization to new scene configurations. Additional materials available at https://sites.google.com/stanford.edu/han
△ Less
Submitted 16 August, 2021; v1 submitted 27 February, 2021;
originally announced March 2021.
-
Local Calibration: Metrics and Recalibration
Authors:
Rachel Luo,
Aadyot Bhatnagar,
Yu Bai,
Shengjia Zhao,
Huan Wang,
Caiming Xiong,
Silvio Savarese,
Stefano Ermon,
Edward Schmerling,
Marco Pavone
Abstract:
Probabilistic classifiers output confidence scores along with their predictions, and these confidence scores should be calibrated, i.e., they should reflect the reliability of the prediction. Confidence scores that minimize standard metrics such as the expected calibration error (ECE) accurately measure the reliability on average across the entire population. However, it is in general impossible t…
▽ More
Probabilistic classifiers output confidence scores along with their predictions, and these confidence scores should be calibrated, i.e., they should reflect the reliability of the prediction. Confidence scores that minimize standard metrics such as the expected calibration error (ECE) accurately measure the reliability on average across the entire population. However, it is in general impossible to measure the reliability of an individual prediction. In this work, we propose the local calibration error (LCE) to span the gap between average and individual reliability. For each individual prediction, the LCE measures the average reliability of a set of similar predictions, where similarity is quantified by a kernel function on a pretrained feature space and by a binning scheme over predicted model confidences. We show theoretically that the LCE can be estimated sample-efficiently from data, and empirically find that it reveals miscalibration modes that are more fine-grained than the ECE can detect. Our key result is a novel local recalibration method LoRe, to improve confidence scores for individual predictions and decrease the LCE. Experimentally, we show that our recalibration method produces more accurate confidence scores, which improves downstream fairness and decision making on classification tasks with both image and tabular data.
△ Less
Submitted 18 August, 2022; v1 submitted 22 February, 2021;
originally announced February 2021.
-
Embodied Intelligence via Learning and Evolution
Authors:
Agrim Gupta,
Silvio Savarese,
Surya Ganguli,
Li Fei-Fei
Abstract:
The intertwined processes of learning and evolution in complex environmental niches have resulted in a remarkable diversity of morphological forms. Moreover, many aspects of animal intelligence are deeply embodied in these evolved morphologies. However, the principles governing relations between environmental complexity, evolved morphology, and the learnability of intelligent control, remain elusi…
▽ More
The intertwined processes of learning and evolution in complex environmental niches have resulted in a remarkable diversity of morphological forms. Moreover, many aspects of animal intelligence are deeply embodied in these evolved morphologies. However, the principles governing relations between environmental complexity, evolved morphology, and the learnability of intelligent control, remain elusive, partially due to the substantial challenge of performing large-scale in silico experiments on evolution and learning. We introduce Deep Evolutionary Reinforcement Learning (DERL): a novel computational framework which can evolve diverse agent morphologies to learn challenging locomotion and manipulation tasks in complex environments using only low level egocentric sensory information. Leveraging DERL we demonstrate several relations between environmental complexity, morphological intelligence and the learnability of control. First, environmental complexity fosters the evolution of morphological intelligence as quantified by the ability of a morphology to facilitate the learning of novel tasks. Second, evolution rapidly selects morphologies that learn faster, thereby enabling behaviors learned late in the lifetime of early ancestors to be expressed early in the lifetime of their descendants. In agents that learn and evolve in complex environments, this result constitutes the first demonstration of a long-conjectured morphological Baldwin effect. Third, our experiments suggest a mechanistic basis for both the Baldwin effect and the emergence of morphological intelligence through the evolution of morphologies that are more physically stable and energy efficient, and can therefore facilitate learning and control.
△ Less
Submitted 3 February, 2021;
originally announced February 2021.
-
Towards new servo control algorithms at the TNG telescope
Authors:
P. Schipani,
M. Gonzalez,
F. Perrotta,
S. Savarese,
M. Colapietro,
A. Ghedina,
M. Hernandez Diaz,
H. Ventura
Abstract:
The servo control algorithms of the TNG, developed in the nineties, have been working for more than 20 years with no major updates. The original hardware was based on a VME-bus based platform running a real time operating system, a rather popular choice for similar applications at the time. Recently, the obsolescence of the hardware and the lack of spares pushed the observatory towards a complete…
▽ More
The servo control algorithms of the TNG, developed in the nineties, have been working for more than 20 years with no major updates. The original hardware was based on a VME-bus based platform running a real time operating system, a rather popular choice for similar applications at the time. Recently, the obsolescence of the hardware and the lack of spares pushed the observatory towards a complete replacement of the electronics that is now being implemented in steps, respecting the basic requirement of never stop** the observatory night operations. Within the framework of this major hardware work, we are taking the opportunity to review and update the existing control schemes. This servo control update, crucial for the telescope performance, envisages a new study from scratch of the controlled plant, including a re-identification of the main axes transfer functions and a re-design of the control filters in the two nested position and speed loops. The ongoing work is described, including preliminary results in the case study of the azimuth axis and our plans for possible further improvements.
△ Less
Submitted 2 January, 2021;
originally announced January 2021.
-
An Automated Pipeline for the VST Data Log Analysis
Authors:
Salvatore Savarese,
Pietro Schipani,
Giulio Capasso,
Mirko Colapietro,
Sergio D'Orsi,
Laurent Marty,
Francesco Perrotta
Abstract:
The VST Telescope Control Software logs continuously detailed information about the telescope and instrument operations. Commands, telemetries, errors, weather conditions and anything may be relevant for the instrument maintenance and the identification of problem sources is regularly saved. All information are recorded in textual form. These log files are often examined individually by the observ…
▽ More
The VST Telescope Control Software logs continuously detailed information about the telescope and instrument operations. Commands, telemetries, errors, weather conditions and anything may be relevant for the instrument maintenance and the identification of problem sources is regularly saved. All information are recorded in textual form. These log files are often examined individually by the observatory personnel for specific issues and for tackling problems raised during the night. Thus, only a minimal part of the information is normally used for daily maintenance. Nevertheless, the analysis of the archived information collected over a long time span can be exploited to reveal useful trends and statistics about the telescope, which would otherwise be overlooked. Given the large size of the archive, a manual inspection and handling of the logs is cumbersome. An automated tool with an adequate user interface has been developed to scrape specific entries within the log files, process the data and display it in a comprehensible way. This pipeline has been used to scan the information collected over 5 years of telescope activity.
△ Less
Submitted 4 January, 2021; v1 submitted 31 December, 2020;
originally announced December 2020.
-
Progress and tests on the Instrument Control Electronics for SOXS
Authors:
M. Colapietro,
G. Capasso,
S. D'Orsi,
P. Schipani,
L. Marty,
S. Savarese,
I. Coretti,
S. Campana,
R. Claudi,
M. Aliverti,
A. Baruffolo,
S. Ben-Ami,
F. Biondi,
R. Cosentino,
F. D'Alessio,
P. D'Avanzo,
O. Hershko,
H. Kuncarayakti,
M. Landoni,
M. Munari,
G. Pignata,
A. Rubin,
S. Scuderi,
F. Vitali,
D. Young
, et al. (24 additional authors not shown)
Abstract:
The forthcoming SOXS (Son Of X-Shooter) will be a new spectroscopic facility for the ESO New Technology Telescope in La Silla, focused on transient events and able to cover both the UV-VIS and NIR bands. The instrument passed the Final Design Review in 2018 and is currently in manufacturing and integration phase. This paper is focused on the assembly and testing of the instrument control electroni…
▽ More
The forthcoming SOXS (Son Of X-Shooter) will be a new spectroscopic facility for the ESO New Technology Telescope in La Silla, focused on transient events and able to cover both the UV-VIS and NIR bands. The instrument passed the Final Design Review in 2018 and is currently in manufacturing and integration phase. This paper is focused on the assembly and testing of the instrument control electronics, which will manage all the motorized functions, alarms, sensors, and electric interlocks. The electronics is hosted in two main control cabinets, divided in several subracks that are assembled to ensure easy accessibility and transportability, to simplify test, integration and maintenance. Both racks are equipped with independent power supply distribution and have their own integrated cooling systems. This paper shows the assembly strategy, reports on the development status and describes the tests performed to verify the system before the integration into the whole instrument.
△ Less
Submitted 23 December, 2020;
originally announced December 2020.
-
Development status of the SOXS spectrograph for the ESO-NTT telescope
Authors:
P. Schipani,
S. Campana,
R. Claudi,
M. Aliverti,
A. Baruffolo,
S. Ben-Ami,
F. Biondi,
G. Capasso,
R. Cosentino,
F. D'Alessio,
P. D'Avanzo,
O. Hershko,
H. Kuncarayakti,
M. Landoni,
M. Munari,
G. Pignata,
A. Rubin,
S. Scuderi,
F. Vitali,
D. Young,
J. Achren,
J. A. Araiza-Duran,
I. Arcavi,
A. Brucalassi,
R. Bruch
, et al. (29 additional authors not shown)
Abstract:
SOXS (Son Of X-Shooter) is a single object spectrograph, characterized by offering a wide simultaneous spectral coverage from U- to H-band, built by an international consortium for the 3.6-m ESO New Technology Telescope at the La Silla Observatory, in the Southern part of the Chilean Atacama Desert. The consortium is focussed on a clear scientific goal: the spectrograph will observe all kind of tr…
▽ More
SOXS (Son Of X-Shooter) is a single object spectrograph, characterized by offering a wide simultaneous spectral coverage from U- to H-band, built by an international consortium for the 3.6-m ESO New Technology Telescope at the La Silla Observatory, in the Southern part of the Chilean Atacama Desert. The consortium is focussed on a clear scientific goal: the spectrograph will observe all kind of transient and variable sources discovered by different surveys with a highly flexible schedule, updated daily, based on the Target of Opportunity concept. It will provide a key spectroscopic partner to any kind of imaging survey, becoming one of the premier transient follow-up instruments in the Southern hemisphere. SOXS will study a mixture of transients encompassing all distance scales and branches of astronomy, including fast alerts (such as gamma-ray bursts and gravitational waves), mid-term alerts (such as supernovae and X-ray transients), and fixed-time events (such as the close-by passage of a minor planet or exoplanets). It will also have the scope to observe active galactic nuclei and blazars, tidal disruption events, fast radio bursts, and more. Besides of the consortium programs on guaranteed time, the instrument is offered to the ESO community for any kind of astrophysical target. The project has passed the Final Design Review and is currently in manufacturing and integration phase. This paper describes the development status of the project.
△ Less
Submitted 23 December, 2020;
originally announced December 2020.
-
Learning Multi-Arm Manipulation Through Collaborative Teleoperation
Authors:
Albert Tung,
Josiah Wong,
Ajay Mandlekar,
Roberto Martín-Martín,
Yuke Zhu,
Li Fei-Fei,
Silvio Savarese
Abstract:
Imitation Learning (IL) is a powerful paradigm to teach robots to perform manipulation tasks by allowing them to learn from human demonstrations collected via teleoperation, but has mostly been limited to single-arm manipulation. However, many real-world tasks require multiple arms, such as lifting a heavy object or assembling a desk. Unfortunately, applying IL to multi-arm manipulation tasks has…
▽ More
Imitation Learning (IL) is a powerful paradigm to teach robots to perform manipulation tasks by allowing them to learn from human demonstrations collected via teleoperation, but has mostly been limited to single-arm manipulation. However, many real-world tasks require multiple arms, such as lifting a heavy object or assembling a desk. Unfortunately, applying IL to multi-arm manipulation tasks has been challenging -- asking a human to control more than one robotic arm can impose significant cognitive burden and is often only possible for a maximum of two robot arms. To address these challenges, we present Multi-Arm RoboTurk (MART), a multi-user data collection platform that allows multiple remote users to simultaneously teleoperate a set of robotic arms and collect demonstrations for multi-arm tasks. Using MART, we collected demonstrations for five novel two and three-arm tasks from several geographically separated users. From our data we arrived at a critical insight: most multi-arm tasks do not require global coordination throughout its full duration, but only during specific moments. We show that learning from such data consequently presents challenges for centralized agents that directly attempt to model all robot actions simultaneously, and perform a comprehensive study of different policy architectures with varying levels of centralization on our tasks. Finally, we propose and evaluate a base-residual policy framework that allows trained policies to better adapt to the mixed coordination setting common in multi-arm manipulation, and show that a centralized policy augmented with a decentralized residual model outperforms all other models on our set of benchmark tasks. Additional results and videos at https://roboturk.stanford.edu/multiarm .
△ Less
Submitted 12 December, 2020;
originally announced December 2020.
-
Human-in-the-Loop Imitation Learning using Remote Teleoperation
Authors:
Ajay Mandlekar,
Danfei Xu,
Roberto Martín-Martín,
Yuke Zhu,
Li Fei-Fei,
Silvio Savarese
Abstract:
Imitation Learning is a promising paradigm for learning complex robot manipulation skills by reproducing behavior from human demonstrations. However, manipulation tasks often contain bottleneck regions that require a sequence of precise actions to make meaningful progress, such as a robot inserting a pod into a coffee machine to make coffee. Trained policies can fail in these regions because small…
▽ More
Imitation Learning is a promising paradigm for learning complex robot manipulation skills by reproducing behavior from human demonstrations. However, manipulation tasks often contain bottleneck regions that require a sequence of precise actions to make meaningful progress, such as a robot inserting a pod into a coffee machine to make coffee. Trained policies can fail in these regions because small deviations in actions can lead the policy into states not covered by the demonstrations. Intervention-based policy learning is an alternative that can address this issue -- it allows human operators to monitor trained policies and take over control when they encounter failures. In this paper, we build a data collection system tailored to 6-DoF manipulation settings, that enables remote human operators to monitor and intervene on trained policies. We develop a simple and effective algorithm to train the policy iteratively on new data collected by the system that encourages the policy to learn how to traverse bottlenecks through the interventions. We demonstrate that agents trained on data collected by our intervention-based system and algorithm outperform agents trained on an equivalent number of samples collected by non-interventional demonstrators, and further show that our method outperforms multiple state-of-the-art baselines for learning from the human interventions on a challenging robot threading task and a coffee making task. Additional results and videos at https://sites.google.com/stanford.edu/iwr .
△ Less
Submitted 12 December, 2020;
originally announced December 2020.
-
Topological Planning with Transformers for Vision-and-Language Navigation
Authors:
Kevin Chen,
Junshen K. Chen,
Jo Chuang,
Marynel Vázquez,
Silvio Savarese
Abstract:
Conventional approaches to vision-and-language navigation (VLN) are trained end-to-end but struggle to perform well in freely traversable environments. Inspired by the robotics community, we propose a modular approach to VLN using topological maps. Given a natural language instruction and topological map, our approach leverages attention mechanisms to predict a navigation plan in the map. The plan…
▽ More
Conventional approaches to vision-and-language navigation (VLN) are trained end-to-end but struggle to perform well in freely traversable environments. Inspired by the robotics community, we propose a modular approach to VLN using topological maps. Given a natural language instruction and topological map, our approach leverages attention mechanisms to predict a navigation plan in the map. The plan is then executed with low-level actions (e.g. forward, rotate) using a robust controller. Experiments show that our method outperforms previous end-to-end approaches, generates interpretable navigation plans, and exhibits intelligent behaviors such as backtracking.
△ Less
Submitted 9 December, 2020;
originally announced December 2020.
-
Semantic and Geometric Modeling with Neural Message Passing in 3D Scene Graphs for Hierarchical Mechanical Search
Authors:
Andrey Kurenkov,
Roberto Martín-Martín,
Jeff Ichnowski,
Ken Goldberg,
Silvio Savarese
Abstract:
Searching for objects in indoor organized environments such as homes or offices is part of our everyday activities. When looking for a target object, we jointly reason about the rooms and containers the object is likely to be in; the same type of container will have a different probability of having the target depending on the room it is in. We also combine geometric and semantic information to in…
▽ More
Searching for objects in indoor organized environments such as homes or offices is part of our everyday activities. When looking for a target object, we jointly reason about the rooms and containers the object is likely to be in; the same type of container will have a different probability of having the target depending on the room it is in. We also combine geometric and semantic information to infer what container is best to search, or what other objects are best to move, if the target object is hidden from view. We propose to use a 3D scene graph representation to capture the hierarchical, semantic, and geometric aspects of this problem. To exploit this representation in a search process, we introduce Hierarchical Mechanical Search (HMS), a method that guides an agent's actions towards finding a target object specified with a natural language description. HMS is based on a novel neural network architecture that uses neural message passing of vectors with visual, geometric, and linguistic information to allow HMS to reason across layers of the graph while combining semantic and geometric cues. HMS is evaluated on a novel dataset of 500 3D scene graphs with dense placements of semantically related objects in storage locations, and is shown to be significantly better than several baselines at finding objects and close to the oracle policy in terms of the median number of actions required. Additional qualitative results can be found at https://ai.stanford.edu/mech-search/hms.
△ Less
Submitted 23 May, 2021; v1 submitted 7 December, 2020;
originally announced December 2020.
-
iGibson 1.0: a Simulation Environment for Interactive Tasks in Large Realistic Scenes
Authors:
Bokui Shen,
Fei Xia,
Chengshu Li,
Roberto Martín-Martín,
Linxi Fan,
Guanzhi Wang,
Claudia Pérez-D'Arpino,
Shyamal Buch,
Sanjana Srivastava,
Lyne P. Tchapmi,
Micael E. Tchapmi,
Kent Vainio,
Josiah Wong,
Li Fei-Fei,
Silvio Savarese
Abstract:
We present iGibson 1.0, a novel simulation environment to develop robotic solutions for interactive tasks in large-scale realistic scenes. Our environment contains 15 fully interactive home-sized scenes with 108 rooms populated with rigid and articulated objects. The scenes are replicas of real-world homes, with distribution and the layout of objects aligned to those of the real world. iGibson 1.0…
▽ More
We present iGibson 1.0, a novel simulation environment to develop robotic solutions for interactive tasks in large-scale realistic scenes. Our environment contains 15 fully interactive home-sized scenes with 108 rooms populated with rigid and articulated objects. The scenes are replicas of real-world homes, with distribution and the layout of objects aligned to those of the real world. iGibson 1.0 integrates several key features to facilitate the study of interactive tasks: i) generation of high-quality virtual sensor signals (RGB, depth, segmentation, LiDAR, flow and so on), ii) domain randomization to change the materials of the objects (both visual and physical) and/or their shapes, iii) integrated sampling-based motion planners to generate collision-free trajectories for robot bases and arms, and iv) intuitive human-iGibson interface that enables efficient collection of human demonstrations. Through experiments, we show that the full interactivity of the scenes enables agents to learn useful visual representations that accelerate the training of downstream manipulation tasks. We also show that iGibson 1.0 features enable the generalization of navigation agents, and that the human-iGibson interface and integrated motion planners facilitate efficient imitation learning of human demonstrated (mobile) manipulation behaviors. iGibson 1.0 is open-source, equipped with comprehensive examples and documentation. For more information, visit our project website: http://svl.stanford.edu/igibson/
△ Less
Submitted 10 August, 2021; v1 submitted 4 December, 2020;
originally announced December 2020.
-
Deep Affordance Foresight: Planning Through What Can Be Done in the Future
Authors:
Danfei Xu,
Ajay Mandlekar,
Roberto Martín-Martín,
Yuke Zhu,
Silvio Savarese,
Li Fei-Fei
Abstract:
Planning in realistic environments requires searching in large planning spaces. Affordances are a powerful concept to simplify this search, because they model what actions can be successful in a given situation. However, the classical notion of affordance is not suitable for long horizon planning because it only informs the robot about the immediate outcome of actions instead of what actions are b…
▽ More
Planning in realistic environments requires searching in large planning spaces. Affordances are a powerful concept to simplify this search, because they model what actions can be successful in a given situation. However, the classical notion of affordance is not suitable for long horizon planning because it only informs the robot about the immediate outcome of actions instead of what actions are best for achieving a long-term goal. In this paper, we introduce a new affordance representation that enables the robot to reason about the long-term effects of actions through modeling what actions are afforded in the future, thereby informing the robot the best actions to take next to achieve a task goal. Based on the new representation, we develop a learning-to-plan method, Deep Affordance Foresight (DAF), that learns partial environment models of affordances of parameterized motor skills through trial-and-error. We evaluate DAF on two challenging manipulation domains and show that it can effectively learn to carry out multi-step tasks, share learned affordance representations among different tasks, and learn to plan with high-dimensional image inputs. Additional material is available at https://sites.google.com/stanford.edu/daf
△ Less
Submitted 23 June, 2021; v1 submitted 17 November, 2020;
originally announced November 2020.
-
Robust Policies via Mid-Level Visual Representations: An Experimental Study in Manipulation and Navigation
Authors:
Bryan Chen,
Alexander Sax,
Gene Lewis,
Iro Armeni,
Silvio Savarese,
Amir Zamir,
Jitendra Malik,
Lerrel Pinto
Abstract:
Vision-based robotics often separates the control loop into one module for perception and a separate module for control. It is possible to train the whole system end-to-end (e.g. with deep RL), but doing it "from scratch" comes with a high sample complexity cost and the final result is often brittle, failing unexpectedly if the test environment differs from that of training.
We study the effects…
▽ More
Vision-based robotics often separates the control loop into one module for perception and a separate module for control. It is possible to train the whole system end-to-end (e.g. with deep RL), but doing it "from scratch" comes with a high sample complexity cost and the final result is often brittle, failing unexpectedly if the test environment differs from that of training.
We study the effects of using mid-level visual representations (features learned asynchronously for traditional computer vision objectives), as a generic and easy-to-decode perceptual state in an end-to-end RL framework. Mid-level representations encode invariances about the world, and we show that they aid generalization, improve sample complexity, and lead to a higher final performance. Compared to other approaches for incorporating invariances, such as domain randomization, asynchronously trained mid-level representations scale better: both to harder problems and to larger domain shifts. In practice, this means that mid-level representations could be used to successfully train policies for tasks where domain randomization and learning-from-scratch failed. We report results on both manipulation and navigation tasks, and for navigation include zero-shot sim-to-real experiments on real robots.
△ Less
Submitted 12 November, 2020;
originally announced November 2020.
-
Multimodal Sensor Fusion with Differentiable Filters
Authors:
Michelle A. Lee,
Brent Yi,
Roberto Martín-Martín,
Silvio Savarese,
Jeannette Bohg
Abstract:
Leveraging multimodal information with recursive Bayesian filters improves performance and robustness of state estimation, as recursive filters can combine different modalities according to their uncertainties. Prior work has studied how to optimally fuse different sensor modalities with analytical state estimation algorithms. However, deriving the dynamics and measurement models along with their…
▽ More
Leveraging multimodal information with recursive Bayesian filters improves performance and robustness of state estimation, as recursive filters can combine different modalities according to their uncertainties. Prior work has studied how to optimally fuse different sensor modalities with analytical state estimation algorithms. However, deriving the dynamics and measurement models along with their noise profile can be difficult or lead to intractable models. Differentiable filters provide a way to learn these models end-to-end while retaining the algorithmic structure of recursive filters. This can be especially helpful when working with sensor modalities that are high dimensional and have very different characteristics. In contact-rich manipulation, we want to combine visual sensing (which gives us global information) with tactile sensing (which gives us local information). In this paper, we study new differentiable filtering architectures to fuse heterogeneous sensor information. As case studies, we evaluate three tasks: two in planar pushing (simulated and real) and one in manipulating a kinematically constrained door (simulated). In extensive evaluations, we find that differentiable filters that leverage crossmodal sensor information reach comparable accuracies to unstructured LSTM models, while presenting interpretability benefits that may be important for safety-critical systems. We also release an open-source library for creating and training differentiable Bayesian filters in PyTorch, which can be found on our project website: https://sites.google.com/view/multimodalfilter
△ Less
Submitted 23 December, 2020; v1 submitted 24 October, 2020;
originally announced October 2020.
-
Robot Navigation in Constrained Pedestrian Environments using Reinforcement Learning
Authors:
Claudia Pérez-D'Arpino,
Can Liu,
Patrick Goebel,
Roberto Martín-Martín,
Silvio Savarese
Abstract:
Navigating fluently around pedestrians is a necessary capability for mobile robots deployed in human environments, such as buildings and homes. While research on social navigation has focused mainly on the scalability with the number of pedestrians in open spaces, typical indoor environments present the additional challenge of constrained spaces such as corridors and doorways that limit maneuverab…
▽ More
Navigating fluently around pedestrians is a necessary capability for mobile robots deployed in human environments, such as buildings and homes. While research on social navigation has focused mainly on the scalability with the number of pedestrians in open spaces, typical indoor environments present the additional challenge of constrained spaces such as corridors and doorways that limit maneuverability and influence patterns of pedestrian interaction. We present an approach based on reinforcement learning (RL) to learn policies capable of dynamic adaptation to the presence of moving pedestrians while navigating between desired locations in constrained environments. The policy network receives guidance from a motion planner that provides waypoints to follow a globally planned trajectory, whereas RL handles the local interactions. We explore a compositional principle for multi-layout training and find that policies trained in a small set of geometrically simple layouts successfully generalize to more complex unseen layouts that exhibit composition of the structural elements available during training. Going beyond walls-world like domains, we show transfer of the learned policy to unseen 3D reconstructions of two real environments. These results support the applicability of the compositional principle to navigation in real-world buildings and indicate promising usage of multi-agent simulation within reconstructed environments for tasks that involve interaction.
△ Less
Submitted 16 November, 2020; v1 submitted 16 October, 2020;
originally announced October 2020.
-
Privacy Preserving Recalibration under Domain Shift
Authors:
Rachel Luo,
Shengjia Zhao,
Jiaming Song,
Jonathan Kuck,
Stefano Ermon,
Silvio Savarese
Abstract:
Classifiers deployed in high-stakes real-world applications must output calibrated confidence scores, i.e. their predicted probabilities should reflect empirical frequencies. Recalibration algorithms can greatly improve a model's probability estimates; however, existing algorithms are not applicable in real-world situations where the test data follows a different distribution from the training dat…
▽ More
Classifiers deployed in high-stakes real-world applications must output calibrated confidence scores, i.e. their predicted probabilities should reflect empirical frequencies. Recalibration algorithms can greatly improve a model's probability estimates; however, existing algorithms are not applicable in real-world situations where the test data follows a different distribution from the training data, and privacy preservation is paramount (e.g. protecting patient records). We introduce a framework that abstracts out the properties of recalibration problems under differential privacy constraints. This framework allows us to adapt existing recalibration algorithms to satisfy differential privacy while remaining effective for domain-shift situations. Guided by our framework, we also design a novel recalibration algorithm, accuracy temperature scaling, that outperforms prior work on private datasets. In an extensive empirical study, we find that our algorithm improves calibration on domain-shift benchmarks under the constraints of differential privacy. On the 15 highest severity perturbations of the ImageNet-C dataset, our method achieves a median ECE of 0.029, over 2x better than the next best recalibration method and almost 5x better than without recalibration.
△ Less
Submitted 21 August, 2020;
originally announced August 2020.
-
ReLMoGen: Leveraging Motion Generation in Reinforcement Learning for Mobile Manipulation
Authors:
Fei Xia,
Chengshu Li,
Roberto Martín-Martín,
Or Litany,
Alexander Toshev,
Silvio Savarese
Abstract:
Many Reinforcement Learning (RL) approaches use joint control signals (positions, velocities, torques) as action space for continuous control tasks. We propose to lift the action space to a higher level in the form of subgoals for a motion generator (a combination of motion planner and trajectory executor). We argue that, by lifting the action space and by leveraging sampling-based motion planners…
▽ More
Many Reinforcement Learning (RL) approaches use joint control signals (positions, velocities, torques) as action space for continuous control tasks. We propose to lift the action space to a higher level in the form of subgoals for a motion generator (a combination of motion planner and trajectory executor). We argue that, by lifting the action space and by leveraging sampling-based motion planners, we can efficiently use RL to solve complex, long-horizon tasks that could not be solved with existing RL methods in the original action space. We propose ReLMoGen -- a framework that combines a learned policy to predict subgoals and a motion generator to plan and execute the motion needed to reach these subgoals. To validate our method, we apply ReLMoGen to two types of tasks: 1) Interactive Navigation tasks, navigation problems where interactions with the environment are required to reach the destination, and 2) Mobile Manipulation tasks, manipulation tasks that require moving the robot base. These problems are challenging because they are usually long-horizon, hard to explore during training, and comprise alternating phases of navigation and interaction. Our method is benchmarked on a diverse set of seven robotics tasks in photo-realistic simulation environments. In all settings, ReLMoGen outperforms state-of-the-art Reinforcement Learning and Hierarchical Reinforcement Learning baselines. ReLMoGen also shows outstanding transferability between different motion generators at test time, indicating a great potential to transfer to real robots.
△ Less
Submitted 26 March, 2021; v1 submitted 18 August, 2020;
originally announced August 2020.
-
Visuomotor Mechanical Search: Learning to Retrieve Target Objects in Clutter
Authors:
Andrey Kurenkov,
Joseph Taglic,
Rohun Kulkarni,
Marcus Dominguez-Kuhne,
Animesh Garg,
Roberto Martín-Martín,
Silvio Savarese
Abstract:
When searching for objects in cluttered environments, it is often necessary to perform complex interactions in order to move occluding objects out of the way and fully reveal the object of interest and make it graspable. Due to the complexity of the physics involved and the lack of accurate models of the clutter, planning and controlling precise predefined interactions with accurate outcome is ext…
▽ More
When searching for objects in cluttered environments, it is often necessary to perform complex interactions in order to move occluding objects out of the way and fully reveal the object of interest and make it graspable. Due to the complexity of the physics involved and the lack of accurate models of the clutter, planning and controlling precise predefined interactions with accurate outcome is extremely hard, when not impossible. In problems where accurate (forward) models are lacking, Deep Reinforcement Learning (RL) has shown to be a viable solution to map observations (e.g. images) to good interactions in the form of close-loop visuomotor policies. However, Deep RL is sample inefficient and fails when applied directly to the problem of unoccluding objects based on images. In this work we present a novel Deep RL procedure that combines i) teacher-aided exploration, ii) a critic with privileged information, and iii) mid-level representations, resulting in sample efficient and effective learning for the problem of uncovering a target object occluded by a heap of unknown objects. Our experiments show that our approach trains faster and converges to more efficient uncovering solutions than baselines and ablations, and that our uncovering policies lead to an average improvement in the graspability of the target object, facilitating downstream retrieval applications.
△ Less
Submitted 13 August, 2020;
originally announced August 2020.
-
How Trustworthy are Performance Evaluations for Basic Vision Tasks?
Authors:
Tran Thien Dat Nguyen,
Hamid Rezatofighi,
Ba-Ngu Vo,
Ba-Tuong Vo,
Silvio Savarese,
Ian Reid
Abstract:
This paper examines performance evaluation criteria for basic vision tasks involving sets of objects namely, object detection, instance-level segmentation and multi-object tracking. The rankings of algorithms by an existing criterion can fluctuate with different choices of parameters, e.g. Intersection over Union (IoU) threshold, making their evaluations unreliable. More importantly, there is no m…
▽ More
This paper examines performance evaluation criteria for basic vision tasks involving sets of objects namely, object detection, instance-level segmentation and multi-object tracking. The rankings of algorithms by an existing criterion can fluctuate with different choices of parameters, e.g. Intersection over Union (IoU) threshold, making their evaluations unreliable. More importantly, there is no means to verify whether we can trust the evaluations of a criterion. This work suggests a notion of trustworthiness for performance criteria, which requires (i) robustness to parameters for reliability, (ii) contextual meaningfulness in sanity tests, and (iii) consistency with mathematical requirements such as the metric properties. We observe that these requirements were overlooked by many widely-used criteria, and explore alternative criteria using metrics for sets of shapes. We also assess all these criteria based on the suggested requirements for trustworthiness.
△ Less
Submitted 22 July, 2022; v1 submitted 8 August, 2020;
originally announced August 2020.
-
Goal-Aware Prediction: Learning to Model What Matters
Authors:
Suraj Nair,
Silvio Savarese,
Chelsea Finn
Abstract:
Learned dynamics models combined with both planning and policy learning algorithms have shown promise in enabling artificial agents to learn to perform many diverse tasks with limited supervision. However, one of the fundamental challenges in using a learned forward dynamics model is the mismatch between the objective of the learned model (future state reconstruction), and that of the downstream p…
▽ More
Learned dynamics models combined with both planning and policy learning algorithms have shown promise in enabling artificial agents to learn to perform many diverse tasks with limited supervision. However, one of the fundamental challenges in using a learned forward dynamics model is the mismatch between the objective of the learned model (future state reconstruction), and that of the downstream planner or policy (completing a specified task). This issue is exacerbated by vision-based control tasks in diverse real-world environments, where the complexity of the real world dwarfs model capacity. In this paper, we propose to direct prediction towards task relevant information, enabling the model to be aware of the current task and encouraging it to only model relevant quantities of the state space, resulting in a learning objective that more closely matches the downstream task. Further, we do so in an entirely self-supervised manner, without the need for a reward function or image labels. We find that our method more effectively models the relevant parts of the scene conditioned on the goal, and as a result outperforms standard task-agnostic dynamics models and model-free reinforcement learning.
△ Less
Submitted 10 August, 2020; v1 submitted 14 July, 2020;
originally announced July 2020.
-
Adaptive Procedural Task Generation for Hard-Exploration Problems
Authors:
Kuan Fang,
Yuke Zhu,
Silvio Savarese,
Li Fei-Fei
Abstract:
We introduce Adaptive Procedural Task Generation (APT-Gen), an approach to progressively generate a sequence of tasks as curricula to facilitate reinforcement learning in hard-exploration problems. At the heart of our approach, a task generator learns to create tasks from a parameterized task space via a black-box procedural generation module. To enable curriculum learning in the absence of a dire…
▽ More
We introduce Adaptive Procedural Task Generation (APT-Gen), an approach to progressively generate a sequence of tasks as curricula to facilitate reinforcement learning in hard-exploration problems. At the heart of our approach, a task generator learns to create tasks from a parameterized task space via a black-box procedural generation module. To enable curriculum learning in the absence of a direct indicator of learning progress, we propose to train the task generator by balancing the agent's performance in the generated tasks and the similarity to the target tasks. Through adversarial training, the task similarity is adaptively estimated by a task discriminator defined on the agent's experiences, allowing the generated tasks to approximate target tasks of unknown parameterization or outside of the predefined task space. Our experiments on the grid world and robotic manipulation task domains show that APT-Gen achieves substantially better performance than various existing baselines by generating suitable tasks of rich variations.
△ Less
Submitted 18 March, 2021; v1 submitted 1 July, 2020;
originally announced July 2020.
-
Generative Sparse Detection Networks for 3D Single-shot Object Detection
Authors:
JunYoung Gwak,
Christopher Choy,
Silvio Savarese
Abstract:
3D object detection has been widely studied due to its potential applicability to many promising areas such as robotics and augmented reality. Yet, the sparse nature of the 3D data poses unique challenges to this task. Most notably, the observable surface of the 3D point clouds is disjoint from the center of the instance to ground the bounding box prediction on. To this end, we propose Generative…
▽ More
3D object detection has been widely studied due to its potential applicability to many promising areas such as robotics and augmented reality. Yet, the sparse nature of the 3D data poses unique challenges to this task. Most notably, the observable surface of the 3D point clouds is disjoint from the center of the instance to ground the bounding box prediction on. To this end, we propose Generative Sparse Detection Network (GSDN), a fully-convolutional single-shot sparse detection network that efficiently generates the support for object proposals. The key component of our model is a generative sparse tensor decoder, which uses a series of transposed convolutions and pruning layers to expand the support of sparse tensors while discarding unlikely object centers to maintain minimal runtime and memory footprint. GSDN can process unprecedentedly large-scale inputs with a single fully-convolutional feed-forward pass, thus does not require the heuristic post-processing stage that stitches results from sliding windows as other previous methods have. We validate our approach on three 3D indoor datasets including the large-scale 3D indoor reconstruction dataset where our method outperforms the state-of-the-art methods by a relative improvement of 7.14% while being 3.78 times faster than the best prior work.
△ Less
Submitted 22 June, 2020;
originally announced June 2020.
-
Probabilistic Visual Navigation with Bidirectional Image Prediction
Authors:
Noriaki Hirose,
Shun Taguchi,
Fei Xia,
Roberto Martin-Martin,
Kosuke Tahara,
Masanori Ishigaki,
Silvio Savarese
Abstract:
Humans can robustly follow a visual trajectory defined by a sequence of images (i.e. a video) regardless of substantial changes in the environment or the presence of obstacles. We aim at endowing similar visual navigation capabilities to mobile robots solely equipped with a RGB fisheye camera. We propose a novel probabilistic visual navigation system that learns to follow a sequence of images with…
▽ More
Humans can robustly follow a visual trajectory defined by a sequence of images (i.e. a video) regardless of substantial changes in the environment or the presence of obstacles. We aim at endowing similar visual navigation capabilities to mobile robots solely equipped with a RGB fisheye camera. We propose a novel probabilistic visual navigation system that learns to follow a sequence of images with bidirectional visual predictions conditioned on possible navigation velocities. By predicting bidirectionally (from start towards goal and vice versa) our method extends its predictive horizon enabling the robot to go around unseen large obstacles that are not visible in the video trajectory. Learning how to react to obstacles and potential risks in the visual field is achieved by imitating human teleoperators. Since the human teleoperation commands are diverse, we propose a probabilistic representation of trajectories that we can sample to find the safest path. Integrated into our navigation system, we present a novel localization approach that infers the current location of the robot based on the virtual predicted trajectories required to reach different images in the visual trajectory. We evaluate our navigation system quantitatively and qualitatively in multiple simulated and real environments and compare to state-of-the-art baselines.Our approach outperforms the most recent visual navigation methods with a large margin with regard to goal arrival rate, subgoal coverage rate, and success weighted by path length (SPL). Our method also generalizes to new robot embodiments never used during training.
△ Less
Submitted 18 February, 2022; v1 submitted 20 March, 2020;
originally announced March 2020.
-
Learning to Generalize Across Long-Horizon Tasks from Human Demonstrations
Authors:
Ajay Mandlekar,
Danfei Xu,
Roberto Martín-Martín,
Silvio Savarese,
Li Fei-Fei
Abstract:
Imitation learning is an effective and safe technique to train robot policies in the real world because it does not depend on an expensive random exploration process. However, due to the lack of exploration, learning policies that generalize beyond the demonstrated behaviors is still an open challenge. We present a novel imitation learning framework to enable robots to 1) learn complex real world…
▽ More
Imitation learning is an effective and safe technique to train robot policies in the real world because it does not depend on an expensive random exploration process. However, due to the lack of exploration, learning policies that generalize beyond the demonstrated behaviors is still an open challenge. We present a novel imitation learning framework to enable robots to 1) learn complex real world manipulation tasks efficiently from a small number of human demonstrations, and 2) synthesize new behaviors not contained in the collected demonstrations. Our key insight is that multi-task domains often present a latent structure, where demonstrated trajectories for different tasks intersect at common regions of the state space. We present Generalization Through Imitation (GTI), a two-stage offline imitation learning algorithm that exploits this intersecting structure to train goal-directed policies that generalize to unseen start and goal state combinations. In the first stage of GTI, we train a stochastic policy that leverages trajectory intersections to have the capacity to compose behaviors from different demonstration trajectories together. In the second stage of GTI, we collect a small set of rollouts from the unconditioned stochastic policy of the first stage, and train a goal-directed agent to generalize to novel start and goal configurations. We validate GTI in both simulated domains and a challenging long-horizon robotic manipulation domain in the real world. Additional results and videos are available at https://sites.google.com/view/gti2020/ .
△ Less
Submitted 23 June, 2021; v1 submitted 12 March, 2020;
originally announced March 2020.
-
JRMOT: A Real-Time 3D Multi-Object Tracker and a New Large-Scale Dataset
Authors:
Abhijeet Shenoi,
Mihir Patel,
JunYoung Gwak,
Patrick Goebel,
Amir Sadeghian,
Hamid Rezatofighi,
Roberto Martín-Martín,
Silvio Savarese
Abstract:
Robots navigating autonomously need to perceive and track the motion of objects and other agents in its surroundings. This information enables planning and executing robust and safe trajectories. To facilitate these processes, the motion should be perceived in 3D Cartesian space. However, most recent multi-object tracking (MOT) research has focused on tracking people and moving objects in 2D RGB v…
▽ More
Robots navigating autonomously need to perceive and track the motion of objects and other agents in its surroundings. This information enables planning and executing robust and safe trajectories. To facilitate these processes, the motion should be perceived in 3D Cartesian space. However, most recent multi-object tracking (MOT) research has focused on tracking people and moving objects in 2D RGB video sequences. In this work we present JRMOT, a novel 3D MOT system that integrates information from RGB images and 3D point clouds to achieve real-time, state-of-the-art tracking performance. Our system is built with recent neural networks for re-identification, 2D and 3D detection and track description, combined into a joint probabilistic data-association framework within a multi-modal recursive Kalman architecture. As part of our work, we release the JRDB dataset, a novel large scale 2D+3D dataset and benchmark, annotated with over 2 million boxes and 3500 time consistent 2D+3D trajectories across 54 indoor and outdoor scenes. JRDB contains over 60 minutes of data including 360 degree cylindrical RGB video and 3D pointclouds in social settings that we use to develop, train and evaluate JRMOT. The presented 3D MOT system demonstrates state-of-the-art performance against competing methods on the popular 2D tracking KITTI benchmark and serves as first 3D tracking solution for our benchmark. Real-robot tests on our social robot JackRabbot indicate that the system is capable of tracking multiple pedestrians fast and reliably. We provide the ROS code of our tracker at https://sites.google.com/view/jrmot.
△ Less
Submitted 22 July, 2020; v1 submitted 19 February, 2020;
originally announced February 2020.
-
Learning to Navigate Using Mid-Level Visual Priors
Authors:
Alexander Sax,
Jeffrey O. Zhang,
Bradley Emi,
Amir Zamir,
Silvio Savarese,
Leonidas Guibas,
Jitendra Malik
Abstract:
How much does having visual priors about the world (e.g. the fact that the world is 3D) assist in learning to perform downstream motor tasks (e.g. navigating a complex environment)? What are the consequences of not utilizing such visual priors in learning? We study these questions by integrating a generic perceptual skill set (a distance estimator, an edge detector, etc.) within a reinforcement le…
▽ More
How much does having visual priors about the world (e.g. the fact that the world is 3D) assist in learning to perform downstream motor tasks (e.g. navigating a complex environment)? What are the consequences of not utilizing such visual priors in learning? We study these questions by integrating a generic perceptual skill set (a distance estimator, an edge detector, etc.) within a reinforcement learning framework (see Fig. 1). This skill set ("mid-level vision") provides the policy with a more processed state of the world compared to raw images.
Our large-scale study demonstrates that using mid-level vision results in policies that learn faster, generalize better, and achieve higher final performance, when compared to learning from scratch and/or using state-of-the-art visual and non-visual representation learning methods. We show that conventional computer vision objectives are particularly effective in this regard and can be conveniently integrated into reinforcement learning frameworks. Finally, we found that no single visual representation was universally useful for all downstream tasks, hence we computationally derive a task-agnostic set of representations optimized to support arbitrary downstream tasks.
△ Less
Submitted 23 December, 2019;
originally announced December 2019.
-
IRIS: Implicit Reinforcement without Interaction at Scale for Learning Control from Offline Robot Manipulation Data
Authors:
Ajay Mandlekar,
Fabio Ramos,
Byron Boots,
Silvio Savarese,
Li Fei-Fei,
Animesh Garg,
Dieter Fox
Abstract:
Learning from offline task demonstrations is a problem of great interest in robotics. For simple short-horizon manipulation tasks with modest variation in task instances, offline learning from a small set of demonstrations can produce controllers that successfully solve the task. However, leveraging a fixed batch of data can be problematic for larger datasets and longer-horizon tasks with greater…
▽ More
Learning from offline task demonstrations is a problem of great interest in robotics. For simple short-horizon manipulation tasks with modest variation in task instances, offline learning from a small set of demonstrations can produce controllers that successfully solve the task. However, leveraging a fixed batch of data can be problematic for larger datasets and longer-horizon tasks with greater variations. The data can exhibit substantial diversity and consist of suboptimal solution approaches. In this paper, we propose Implicit Reinforcement without Interaction at Scale (IRIS), a novel framework for learning from large-scale demonstration datasets. IRIS factorizes the control problem into a goal-conditioned low-level controller that imitates short demonstration sequences and a high-level goal selection mechanism that sets goals for the low-level and selectively combines parts of suboptimal solutions leading to more successful task completions. We evaluate IRIS across three datasets, including the RoboTurk Cans dataset collected by humans via crowdsourcing, and show that performant policies can be learned from purely offline learning. Additional results at https://sites.google.com/stanford.edu/iris/ .
△ Less
Submitted 22 February, 2020; v1 submitted 13 November, 2019;
originally announced November 2019.
-
Scaling Robot Supervision to Hundreds of Hours with RoboTurk: Robotic Manipulation Dataset through Human Reasoning and Dexterity
Authors:
Ajay Mandlekar,
Jonathan Booher,
Max Spero,
Albert Tung,
Anchit Gupta,
Yuke Zhu,
Animesh Garg,
Silvio Savarese,
Li Fei-Fei
Abstract:
Large, richly annotated datasets have accelerated progress in fields such as computer vision and natural language processing, but replicating these successes in robotics has been challenging. While prior data collection methodologies such as self-supervision have resulted in large datasets, the data can have poor signal-to-noise ratio. By contrast, previous efforts to collect task demonstrations w…
▽ More
Large, richly annotated datasets have accelerated progress in fields such as computer vision and natural language processing, but replicating these successes in robotics has been challenging. While prior data collection methodologies such as self-supervision have resulted in large datasets, the data can have poor signal-to-noise ratio. By contrast, previous efforts to collect task demonstrations with humans provide better quality data, but they cannot reach the same data magnitude. Furthermore, neither approach places guarantees on the diversity of the data collected, in terms of solution strategies. In this work, we leverage and extend the RoboTurk platform to scale up data collection for robotic manipulation using remote teleoperation. The primary motivation for our platform is two-fold: (1) to address the shortcomings of prior work and increase the total quantity of manipulation data collected through human supervision by an order of magnitude without sacrificing the quality of the data and (2) to collect data on challenging manipulation tasks across several operators and observe a diverse set of emergent behaviors and solutions. We collected over 111 hours of robot manipulation data across 54 users and 3 challenging manipulation tasks in 1 week, resulting in the largest robot dataset collected via remote teleoperation. We evaluate the quality of our platform, the diversity of demonstrations in our dataset, and the utility of our dataset via quantitative and qualitative analysis. For additional results, supplementary videos, and to download our dataset, visit http://roboturk.stanford.edu/realrobotdataset .
△ Less
Submitted 10 November, 2019;
originally announced November 2019.
-
Interactive Gibson Benchmark (iGibson 0.5): A Benchmark for Interactive Navigation in Cluttered Environments
Authors:
Fei Xia,
William B. Shen,
Chengshu Li,
Priya Kasimbeg,
Micael Tchapmi,
Alexander Toshev,
Li Fei-Fei,
Roberto Martín-Martín,
Silvio Savarese
Abstract:
We present Interactive Gibson Benchmark, the first comprehensive benchmark for training and evaluating Interactive Navigation: robot navigation strategies where physical interaction with objects is allowed and even encouraged to accomplish a task. For example, the robot can move objects if needed in order to clear a path leading to the goal location. Our benchmark comprises two novel elements: 1)…
▽ More
We present Interactive Gibson Benchmark, the first comprehensive benchmark for training and evaluating Interactive Navigation: robot navigation strategies where physical interaction with objects is allowed and even encouraged to accomplish a task. For example, the robot can move objects if needed in order to clear a path leading to the goal location. Our benchmark comprises two novel elements: 1) a new experimental setup, the Interactive Gibson Environment (iGibson 0.5), which simulates high fidelity visuals of indoor scenes, and high fidelity physical dynamics of the robot and common objects found in these scenes; 2) a set of Interactive Navigation metrics which allows one to study the interplay between navigation and physical interaction. We present and evaluate multiple learning-based baselines in Interactive Gibson, and provide insights into regimes of navigation with different trade-offs between navigation path efficiency and disturbance of surrounding objects. We make our benchmark publicly available(https://sites.google.com/view/interactivegibsonenv) and encourage researchers from all disciplines in robotics (e.g. planning, learning, control) to propose, evaluate, and compare their Interactive Navigation solutions in Interactive Gibson.
△ Less
Submitted 9 August, 2021; v1 submitted 29 October, 2019;
originally announced October 2019.
-
Dynamics Learning with Cascaded Variational Inference for Multi-Step Manipulation
Authors:
Kuan Fang,
Yuke Zhu,
Animesh Garg,
Silvio Savarese,
Li Fei-Fei
Abstract:
The fundamental challenge of planning for multi-step manipulation is to find effective and plausible action sequences that lead to the task goal. We present Cascaded Variational Inference (CAVIN) Planner, a model-based method that hierarchically generates plans by sampling from latent spaces. To facilitate planning over long time horizons, our method learns latent representations that decouple the…
▽ More
The fundamental challenge of planning for multi-step manipulation is to find effective and plausible action sequences that lead to the task goal. We present Cascaded Variational Inference (CAVIN) Planner, a model-based method that hierarchically generates plans by sampling from latent spaces. To facilitate planning over long time horizons, our method learns latent representations that decouple the prediction of high-level effects from the generation of low-level motions through cascaded variational inference. This enables us to model dynamics at two different levels of temporal resolutions for hierarchical planning. We evaluate our approach in three multi-step robotic manipulation tasks in cluttered tabletop environments given high-dimensional observations. Empirical results demonstrate that the proposed method outperforms state-of-the-art model-based methods by strategically interacting with multiple objects.
△ Less
Submitted 17 March, 2020; v1 submitted 29 October, 2019;
originally announced October 2019.
-
KETO: Learning Keypoint Representations for Tool Manipulation
Authors:
Zengyi Qin,
Kuan Fang,
Yuke Zhu,
Li Fei-Fei,
Silvio Savarese
Abstract:
We aim to develop an algorithm for robots to manipulate novel objects as tools for completing different task goals. An efficient and informative representation would facilitate the effectiveness and generalization of such algorithms. For this purpose, we present KETO, a framework of learning keypoint representations of tool-based manipulation. For each task, a set of task-specific keypoints is joi…
▽ More
We aim to develop an algorithm for robots to manipulate novel objects as tools for completing different task goals. An efficient and informative representation would facilitate the effectiveness and generalization of such algorithms. For this purpose, we present KETO, a framework of learning keypoint representations of tool-based manipulation. For each task, a set of task-specific keypoints is jointly predicted from 3D point clouds of the tool object by a deep neural network. These keypoints offer a concise and informative description of the object to determine grasps and subsequent manipulation actions. The model is learned from self-supervised robot interactions in the task environment without the need for explicit human annotations. We evaluate our framework in three manipulation tasks with tool use. Our model consistently outperforms state-of-the-art methods in terms of task success rates. Qualitative results of keypoint prediction and tool generation are shown to visualize the learned representations.
△ Less
Submitted 29 October, 2019; v1 submitted 25 October, 2019;
originally announced October 2019.
-
JRDB: A Dataset and Benchmark of Egocentric Robot Visual Perception of Humans in Built Environments
Authors:
Roberto Martín-Martín,
Mihir Patel,
Hamid Rezatofighi,
Abhijeet Shenoi,
JunYoung Gwak,
Eric Frankel,
Amir Sadeghian,
Silvio Savarese
Abstract:
We present JRDB, a novel egocentric dataset collected from our social mobile manipulator JackRabbot. The dataset includes 64 minutes of annotated multimodal sensor data including stereo cylindrical 360$^\circ$ RGB video at 15 fps, 3D point clouds from two Velodyne 16 Lidars, line 3D point clouds from two Sick Lidars, audio signal, RGB-D video at 30 fps, 360$^\circ$ spherical image from a fisheye c…
▽ More
We present JRDB, a novel egocentric dataset collected from our social mobile manipulator JackRabbot. The dataset includes 64 minutes of annotated multimodal sensor data including stereo cylindrical 360$^\circ$ RGB video at 15 fps, 3D point clouds from two Velodyne 16 Lidars, line 3D point clouds from two Sick Lidars, audio signal, RGB-D video at 30 fps, 360$^\circ$ spherical image from a fisheye camera and encoder values from the robot's wheels. Our dataset incorporates data from traditionally underrepresented scenes such as indoor environments and pedestrian areas, all from the ego-perspective of the robot, both stationary and navigating. The dataset has been annotated with over 2.3 million bounding boxes spread over 5 individual cameras and 1.8 million associated 3D cuboids around all people in the scenes totaling over 3500 time consistent trajectories. Together with our dataset and the annotations, we launch a benchmark and metrics for 2D and 3D person detection and tracking. With this dataset, which we plan on extending with further types of annotation in the future, we hope to provide a new source of data and a test-bench for research in the areas of egocentric robot vision, autonomous navigation, and all perceptual tasks around social robotics in human environments.
△ Less
Submitted 24 April, 2021; v1 submitted 25 October, 2019;
originally announced October 2019.
-
HRL4IN: Hierarchical Reinforcement Learning for Interactive Navigation with Mobile Manipulators
Authors:
Chengshu Li,
Fei Xia,
Roberto Martin-Martin,
Silvio Savarese
Abstract:
Most common navigation tasks in human environments require auxiliary arm interactions, e.g. opening doors, pressing buttons and pushing obstacles away. This type of navigation tasks, which we call Interactive Navigation, requires the use of mobile manipulators: mobile bases with manipulation capabilities. Interactive Navigation tasks are usually long-horizon and composed of heterogeneous phases of…
▽ More
Most common navigation tasks in human environments require auxiliary arm interactions, e.g. opening doors, pressing buttons and pushing obstacles away. This type of navigation tasks, which we call Interactive Navigation, requires the use of mobile manipulators: mobile bases with manipulation capabilities. Interactive Navigation tasks are usually long-horizon and composed of heterogeneous phases of pure navigation, pure manipulation, and their combination. Using the wrong part of the embodiment is inefficient and hinders progress. We propose HRL4IN, a novel Hierarchical RL architecture for Interactive Navigation tasks. HRL4IN exploits the exploration benefits of HRL over flat RL for long-horizon tasks thanks to temporally extended commitments towards subgoals. Different from other HRL solutions, HRL4IN handles the heterogeneous nature of the Interactive Navigation task by creating subgoals in different spaces in different phases of the task. Moreover, HRL4IN selects different parts of the embodiment to use for each phase, improving energy efficiency. We evaluate HRL4IN against flat PPO and HAC, a state-of-the-art HRL algorithm, on Interactive Navigation in two environments - a 2D grid-world environment and a 3D environment with physics simulation. We show that HRL4IN significantly outperforms its baselines in terms of task performance and energy efficiency. More information is available at https://sites.google.com/view/hrl4in.
△ Less
Submitted 24 October, 2019;
originally announced October 2019.
-
6-PACK: Category-level 6D Pose Tracker with Anchor-Based Keypoints
Authors:
Chen Wang,
Roberto Martín-Martín,
Danfei Xu,
Jun Lv,
Cewu Lu,
Li Fei-Fei,
Silvio Savarese,
Yuke Zhu
Abstract:
We present 6-PACK, a deep learning approach to category-level 6D object pose tracking on RGB-D data. Our method tracks in real-time novel object instances of known object categories such as bowls, laptops, and mugs. 6-PACK learns to compactly represent an object by a handful of 3D keypoints, based on which the interframe motion of an object instance can be estimated through keypoint matching. Thes…
▽ More
We present 6-PACK, a deep learning approach to category-level 6D object pose tracking on RGB-D data. Our method tracks in real-time novel object instances of known object categories such as bowls, laptops, and mugs. 6-PACK learns to compactly represent an object by a handful of 3D keypoints, based on which the interframe motion of an object instance can be estimated through keypoint matching. These keypoints are learned end-to-end without manual supervision in order to be most effective for tracking. Our experiments show that our method substantially outperforms existing methods on the NOCS category-level 6D pose estimation benchmark and supports a physical robot to perform simple vision-based closed-loop manipulation tasks. Our code and video are available at https://sites.google.com/view/6packtracking.
△ Less
Submitted 23 October, 2019;
originally announced October 2019.
-
3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera
Authors:
Iro Armeni,
Zhi-Yang He,
JunYoung Gwak,
Amir R. Zamir,
Martin Fischer,
Jitendra Malik,
Silvio Savarese
Abstract:
A comprehensive semantic understanding of a scene is important for many applications - but in what space should diverse semantic information (e.g., objects, scene categories, material types, texture, etc.) be grounded and what should be its structure? Aspiring to have one unified structure that hosts diverse types of semantics, we follow the Scene Graph paradigm in 3D, generating a 3D Scene Graph.…
▽ More
A comprehensive semantic understanding of a scene is important for many applications - but in what space should diverse semantic information (e.g., objects, scene categories, material types, texture, etc.) be grounded and what should be its structure? Aspiring to have one unified structure that hosts diverse types of semantics, we follow the Scene Graph paradigm in 3D, generating a 3D Scene Graph. Given a 3D mesh and registered panoramic images, we construct a graph that spans the entire building and includes semantics on objects (e.g., class, material, and other attributes), rooms (e.g., scene category, volume, etc.) and cameras (e.g., location, etc.), as well as the relationships among these entities.
However, this process is prohibitively labor heavy if done manually. To alleviate this we devise a semi-automatic framework that employs existing detection methods and enhances them using two main constraints: I. framing of query images sampled on panoramas to maximize the performance of 2D detectors, and II. multi-view consistency enforcement across 2D detections that originate in different camera locations.
△ Less
Submitted 6 October, 2019;
originally announced October 2019.
-
Causal Induction from Visual Observations for Goal Directed Tasks
Authors:
Suraj Nair,
Yuke Zhu,
Silvio Savarese,
Li Fei-Fei
Abstract:
Causal reasoning has been an indispensable capability for humans and other intelligent animals to interact with the physical world. In this work, we propose to endow an artificial agent with the capability of causal reasoning for completing goal-directed tasks. We develop learning-based approaches to inducing causal knowledge in the form of directed acyclic graphs, which can be used to contextuali…
▽ More
Causal reasoning has been an indispensable capability for humans and other intelligent animals to interact with the physical world. In this work, we propose to endow an artificial agent with the capability of causal reasoning for completing goal-directed tasks. We develop learning-based approaches to inducing causal knowledge in the form of directed acyclic graphs, which can be used to contextualize a learned goal-conditional policy to perform tasks in novel environments with latent causal structures. We leverage attention mechanisms in our causal induction model and goal-conditional policy, enabling us to incrementally generate the causal graph from the agent's visual observations and to selectively use the induced graph for determining actions. Our experiments show that our method effectively generalizes towards completing new tasks in novel environments with previously unseen causal structures.
△ Less
Submitted 3 October, 2019;
originally announced October 2019.
-
Regression Planning Networks
Authors:
Danfei Xu,
Roberto Martín-Martín,
De-An Huang,
Yuke Zhu,
Silvio Savarese,
Li Fei-Fei
Abstract:
Recent learning-to-plan methods have shown promising results on planning directly from observation space. Yet, their ability to plan for long-horizon tasks is limited by the accuracy of the prediction model. On the other hand, classical symbolic planners show remarkable capabilities in solving long-horizon tasks, but they require predefined symbolic rules and symbolic states, restricting their rea…
▽ More
Recent learning-to-plan methods have shown promising results on planning directly from observation space. Yet, their ability to plan for long-horizon tasks is limited by the accuracy of the prediction model. On the other hand, classical symbolic planners show remarkable capabilities in solving long-horizon tasks, but they require predefined symbolic rules and symbolic states, restricting their real-world applicability. In this work, we combine the benefits of these two paradigms and propose a learning-to-plan method that can directly generate a long-term symbolic plan conditioned on high-dimensional observations. We borrow the idea of regression (backward) planning from classical planning literature and introduce Regression Planning Networks (RPN), a neural network architecture that plans backward starting at a task goal and generates a sequence of intermediate goals that reaches the current observation. We show that our model not only inherits many favorable traits from symbolic planning, e.g., the ability to solve previously unseen tasks but also can learn from visual inputs in an end-to-end manner. We evaluate the capabilities of RPN in a grid world environment and a simulated 3D kitchen environment featuring complex visual scenes and long task horizons, and show that it achieves near-optimal performance in completely new task instances.
△ Less
Submitted 28 September, 2019;
originally announced September 2019.
-
SURREAL-System: Fully-Integrated Stack for Distributed Deep Reinforcement Learning
Authors:
Linxi Fan,
Yuke Zhu,
Jiren Zhu,
Zihua Liu,
Orien Zeng,
Anchit Gupta,
Joan Creus-Costa,
Silvio Savarese,
Li Fei-Fei
Abstract:
We present an overview of SURREAL-System, a reproducible, flexible, and scalable framework for distributed reinforcement learning (RL). The framework consists of a stack of four layers: Provisioner, Orchestrator, Protocol, and Algorithms. The Provisioner abstracts away the machine hardware and node pools across different cloud providers. The Orchestrator provides a unified interface for scheduling…
▽ More
We present an overview of SURREAL-System, a reproducible, flexible, and scalable framework for distributed reinforcement learning (RL). The framework consists of a stack of four layers: Provisioner, Orchestrator, Protocol, and Algorithms. The Provisioner abstracts away the machine hardware and node pools across different cloud providers. The Orchestrator provides a unified interface for scheduling and deploying distributed algorithms by high-level description, which is capable of deploying to a wide range of hardware from a personal laptop to full-fledged cloud clusters. The Protocol provides network communication primitives optimized for RL. Finally, the SURREAL algorithms, such as Proximal Policy Optimization (PPO) and Evolution Strategies (ES), can easily scale to 1000s of CPU cores and 100s of GPUs. The learning performances of our distributed algorithms establish new state-of-the-art on OpenAI Gym and Robotics Suites tasks.
△ Less
Submitted 11 October, 2019; v1 submitted 27 September, 2019;
originally announced September 2019.
-
AC-Teach: A Bayesian Actor-Critic Method for Policy Learning with an Ensemble of Suboptimal Teachers
Authors:
Andrey Kurenkov,
Ajay Mandlekar,
Roberto Martin-Martin,
Silvio Savarese,
Animesh Garg
Abstract:
The exploration mechanism used by a Deep Reinforcement Learning (RL) agent plays a key role in determining its sample efficiency. Thus, improving over random exploration is crucial to solve long-horizon tasks with sparse rewards. We propose to leverage an ensemble of partial solutions as teachers that guide the agent's exploration with action suggestions throughout training. While the setup of lea…
▽ More
The exploration mechanism used by a Deep Reinforcement Learning (RL) agent plays a key role in determining its sample efficiency. Thus, improving over random exploration is crucial to solve long-horizon tasks with sparse rewards. We propose to leverage an ensemble of partial solutions as teachers that guide the agent's exploration with action suggestions throughout training. While the setup of learning with teachers has been previously studied, our proposed approach - Actor-Critic with Teacher Ensembles (AC-Teach) - is the first to work with an ensemble of suboptimal teachers that may solve only part of the problem or contradict other each other, forming a unified algorithmic solution that is compatible with a broad range of teacher ensembles. AC-Teach leverages a probabilistic representation of the expected outcome of the teachers' and student's actions to direct exploration, reduce dithering, and adapt to the dynamically changing quality of the learner. We evaluate a variant of AC-Teach that guides the learning of a Bayesian DDPG agent on three tasks - path following, robotic pick and place, and robotic cube swee** using a hook - and show that it improves largely on sampling efficiency over a set of baselines, both for our target scenario of unconstrained suboptimal teachers and for easier setups with optimal or single teachers. Additional results and videos at https://sites.google.com/view/acteach/home.
△ Less
Submitted 12 December, 2019; v1 submitted 9 September, 2019;
originally announced September 2019.
-
Situational Fusion of Visual Representation for Visual Navigation
Authors:
Bokui Shen,
Danfei Xu,
Yuke Zhu,
Leonidas J. Guibas,
Li Fei-Fei,
Silvio Savarese
Abstract:
A complex visual navigation task puts an agent in different situations which call for a diverse range of visual perception abilities. For example, to "go to the nearest chair", the agent might need to identify a chair in a living room using semantics, follow along a hallway using vanishing point cues, and avoid obstacles using depth. Therefore, utilizing the appropriate visual perception abilities…
▽ More
A complex visual navigation task puts an agent in different situations which call for a diverse range of visual perception abilities. For example, to "go to the nearest chair", the agent might need to identify a chair in a living room using semantics, follow along a hallway using vanishing point cues, and avoid obstacles using depth. Therefore, utilizing the appropriate visual perception abilities based on a situational understanding of the visual environment can empower these navigation models in unseen visual environments. We propose to train an agent to fuse a large set of visual representations that correspond to diverse visual perception abilities. To fully utilize each representation, we develop an action-level representation fusion scheme, which predicts an action candidate from each representation and adaptively consolidate these action candidates into the final action. Furthermore, we employ a data-driven inter-task affinity regularization to reduce redundancies and improve generalization. Our approach leads to a significantly improved performance in novel environments over ImageNet-pretrained baseline and other fusion methods.
△ Less
Submitted 3 August, 2021; v1 submitted 23 August, 2019;
originally announced August 2019.