Search | arXiv e-print repository

From LLMs to Actions: Latent Codes as Bridges in Hierarchical Robot Control

Authors: Yide Shentu, Philipp Wu, Aravind Rajeswaran, Pieter Abbeel

Abstract: Hierarchical control for robotics has long been plagued by the need to have a well defined interface layer to communicate between high-level task planners and low-level policies. With the advent of LLMs, language has been emerging as a prospective interface layer. However, this has several limitations. Not all tasks can be decomposed into steps that are easily expressible in natural language (e.g.… ▽ More Hierarchical control for robotics has long been plagued by the need to have a well defined interface layer to communicate between high-level task planners and low-level policies. With the advent of LLMs, language has been emerging as a prospective interface layer. However, this has several limitations. Not all tasks can be decomposed into steps that are easily expressible in natural language (e.g. performing a dance routine). Further, it makes end-to-end finetuning on embodied data challenging due to domain shift and catastrophic forgetting. We introduce our method -- Learnable Latent Codes as Bridges (LCB) -- as an alternate architecture to overcome these limitations. \method~uses a learnable latent code to act as a bridge between LLMs and low-level policies. This enables LLMs to flexibly communicate goals in the task plan without being entirely constrained by language limitations. Additionally, it enables end-to-end finetuning without destroying the embedding space of word tokens learned during pre-training. Through experiments on Language Table and Calvin, two common language based benchmarks for embodied agents, we find that \method~outperforms baselines (including those w/ GPT-4V) that leverage pure language as the interface layer on tasks that require reasoning and multi-step behaviors. △ Less

Submitted 8 May, 2024; originally announced May 2024.

arXiv:2310.06828 [pdf, other]

RoboHive: A Unified Framework for Robot Learning

Authors: Vikash Kumar, Rutav Shah, Gaoyue Zhou, Vincent Moens, Vittorio Caggiano, Jay Vakil, Abhishek Gupta, Aravind Rajeswaran

Abstract: We present RoboHive, a comprehensive software platform and ecosystem for research in the field of Robot Learning and Embodied Artificial Intelligence. Our platform encompasses a diverse range of pre-existing and novel environments, including dexterous manipulation with the Shadow Hand, whole-arm manipulation tasks with Franka and Fetch robots, quadruped locomotion, among others. Included environme… ▽ More We present RoboHive, a comprehensive software platform and ecosystem for research in the field of Robot Learning and Embodied Artificial Intelligence. Our platform encompasses a diverse range of pre-existing and novel environments, including dexterous manipulation with the Shadow Hand, whole-arm manipulation tasks with Franka and Fetch robots, quadruped locomotion, among others. Included environments are organized within and cover multiple domains such as hand manipulation, locomotion, multi-task, multi-agent, muscles, etc. In comparison to prior works, RoboHive offers a streamlined and unified task interface taking dependency on only a minimal set of well-maintained packages, features tasks with high physics fidelity and rich visual diversity, and supports common hardware drivers for real-world deployment. The unified interface of RoboHive offers a convenient and accessible abstraction for algorithmic research in imitation, reinforcement, multi-task, and hierarchical learning. Furthermore, RoboHive includes expert demonstrations and baseline results for most environments, providing a standard for benchmarking and comparisons. Details: https://sites.google.com/view/robohive △ Less

Submitted 10 October, 2023; originally announced October 2023.

Comments: Accepted at 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks

arXiv:2310.02219 [pdf, other]

What do we learn from a large-scale study of pre-trained visual representations in sim and real environments?

Authors: Sneha Silwal, Karmesh Yadav, Tingfan Wu, Jay Vakil, Arjun Majumdar, Sergio Arnaud, Claire Chen, Vincent-Pierre Berges, Dhruv Batra, Aravind Rajeswaran, Mrinal Kalakrishnan, Franziska Meier, Oleksandr Maksymets

Abstract: We present a large empirical investigation on the use of pre-trained visual representations (PVRs) for training downstream policies that execute real-world tasks. Our study spans five different PVRs, two different policy-learning paradigms (imitation and reinforcement learning), and three different robots for 5 distinct manipulation and indoor navigation tasks. From this effort, we can arrive at t… ▽ More We present a large empirical investigation on the use of pre-trained visual representations (PVRs) for training downstream policies that execute real-world tasks. Our study spans five different PVRs, two different policy-learning paradigms (imitation and reinforcement learning), and three different robots for 5 distinct manipulation and indoor navigation tasks. From this effort, we can arrive at three insights: 1) the performance trends of PVRs in the simulation are generally indicative of their trends in the real world, 2) the use of PVRs enables a first-of-its-kind result with indoor ImageNav (zero-shot transfer to a held-out scene in the real world), and 3) the benefits from variations in PVRs, primarily data-augmentation and fine-tuning, also transfer to the real-world performance. See project website for additional details and visuals. △ Less

Submitted 3 October, 2023; originally announced October 2023.

Comments: Project website https://pvrs-sim2real.github.io/

MSC Class: 68T45 (Primary) 68T40; 68T05(Secondary) ACM Class: I.2.9; I.2.6; I.4.8; I.5.4

arXiv:2309.14236 [pdf, other]

MoDem-V2: Visuo-Motor World Models for Real-World Robot Manipulation

Authors: Patrick Lancaster, Nicklas Hansen, Aravind Rajeswaran, Vikash Kumar

Abstract: Robotic systems that aspire to operate in uninstrumented real-world environments must perceive the world directly via onboard sensing. Vision-based learning systems aim to eliminate the need for environment instrumentation by building an implicit understanding of the world based on raw pixels, but navigating the contact-rich high-dimensional search space from solely sparse visual reward signals si… ▽ More Robotic systems that aspire to operate in uninstrumented real-world environments must perceive the world directly via onboard sensing. Vision-based learning systems aim to eliminate the need for environment instrumentation by building an implicit understanding of the world based on raw pixels, but navigating the contact-rich high-dimensional search space from solely sparse visual reward signals significantly exacerbates the challenge of exploration. The applicability of such systems is thus typically restricted to simulated or heavily engineered environments since agent exploration in the real-world without the guidance of explicit state estimation and dense rewards can lead to unsafe behavior and safety faults that are catastrophic. In this study, we isolate the root causes behind these limitations to develop a system, called MoDem-V2, capable of learning contact-rich manipulation directly in the uninstrumented real world. Building on the latest algorithmic advancements in model-based reinforcement learning (MBRL), demo-bootstrap**, and effective exploration, MoDem-V2 can acquire contact-rich dexterous manipulation skills directly in the real world. We identify key ingredients for leveraging demonstrations in model learning while respecting real-world safety considerations -- exploration centering, agency handover, and actor-critic ensembles. We empirically demonstrate the contribution of these ingredients in four complex visuo-motor manipulation problems in both simulation and the real world. To the best of our knowledge, our work presents the first successful system for demonstration-augmented visual MBRL trained directly in the real world. Visit https://sites.google.com/view/modem-v2 for videos and more details. △ Less

Submitted 12 May, 2024; v1 submitted 25 September, 2023; originally announced September 2023.

Comments: 10 pages, 8 figures

arXiv:2306.00942 [pdf, other]

Train Offline, Test Online: A Real Robot Learning Benchmark

Authors: Gaoyue Zhou, Victoria Dean, Mohan Kumar Srirama, Aravind Rajeswaran, Jyothish Pari, Kyle Hatch, Aryan Jain, Tianhe Yu, Pieter Abbeel, Lerrel Pinto, Chelsea Finn, Abhinav Gupta

Abstract: Three challenges limit the progress of robot learning research: robots are expensive (few labs can participate), everyone uses different robots (findings do not generalize across labs), and we lack internet-scale robotics data. We take on these challenges via a new benchmark: Train Offline, Test Online (TOTO). TOTO provides remote users with access to shared robotic hardware for evaluating methods… ▽ More Three challenges limit the progress of robot learning research: robots are expensive (few labs can participate), everyone uses different robots (findings do not generalize across labs), and we lack internet-scale robotics data. We take on these challenges via a new benchmark: Train Offline, Test Online (TOTO). TOTO provides remote users with access to shared robotic hardware for evaluating methods on common tasks and an open-source dataset of these tasks for offline training. Its manipulation task suite requires challenging generalization to unseen objects, positions, and lighting. We present initial results on TOTO comparing five pretrained visual representations and four offline policy learning baselines, remotely contributed by five institutions. The real promise of TOTO, however, lies in the future: we release the benchmark for additional submissions from any user, enabling easy, direct comparison to several methods without the need to obtain hardware or collect data. △ Less

Submitted 30 June, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

Comments: Accepted to ICRA 2023

arXiv:2305.02968 [pdf, other]

Masked Trajectory Models for Prediction, Representation, and Control

Authors: Philipp Wu, Arjun Majumdar, Kevin Stone, Yixin Lin, Igor Mordatch, Pieter Abbeel, Aravind Rajeswaran

Abstract: We introduce Masked Trajectory Models (MTM) as a generic abstraction for sequential decision making. MTM takes a trajectory, such as a state-action sequence, and aims to reconstruct the trajectory conditioned on random subsets of the same trajectory. By training with a highly randomized masking pattern, MTM learns versatile networks that can take on different roles or capabilities, by simply choos… ▽ More We introduce Masked Trajectory Models (MTM) as a generic abstraction for sequential decision making. MTM takes a trajectory, such as a state-action sequence, and aims to reconstruct the trajectory conditioned on random subsets of the same trajectory. By training with a highly randomized masking pattern, MTM learns versatile networks that can take on different roles or capabilities, by simply choosing appropriate masks at inference time. For example, the same MTM network can be used as a forward dynamics model, inverse dynamics model, or even an offline RL agent. Through extensive experiments in several continuous control tasks, we show that the same MTM network -- i.e. same weights -- can match or outperform specialized networks trained for the aforementioned capabilities. Additionally, we find that state representations learned by MTM can significantly accelerate the learning speed of traditional RL algorithms. Finally, in offline RL benchmarks, we find that MTM is competitive with specialized offline RL algorithms, despite MTM being a generic self-supervised learning method without any explicit RL components. Code is available at https://github.com/facebookresearch/mtm △ Less

Submitted 4 May, 2023; originally announced May 2023.

Comments: Accepted for publication at ICML 2023. Project webpage: https://wuphilipp.github.io/mtm/

arXiv:2303.18240 [pdf, other]

Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?

Authors: Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Yecheng Jason Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent-Pierre Berges, Pieter Abbeel, Jitendra Malik, Dhruv Batra, Yixin Lin, Oleksandr Maksymets, Aravind Rajeswaran, Franziska Meier

Abstract: We present the largest and most comprehensive empirical study of pre-trained visual representations (PVRs) or visual 'foundation models' for Embodied AI. First, we curate CortexBench, consisting of 17 different tasks spanning locomotion, navigation, dexterous, and mobile manipulation. Next, we systematically evaluate existing PVRs and find that none are universally dominant. To study the effect of… ▽ More We present the largest and most comprehensive empirical study of pre-trained visual representations (PVRs) or visual 'foundation models' for Embodied AI. First, we curate CortexBench, consisting of 17 different tasks spanning locomotion, navigation, dexterous, and mobile manipulation. Next, we systematically evaluate existing PVRs and find that none are universally dominant. To study the effect of pre-training data size and diversity, we combine over 4,000 hours of egocentric videos from 7 different sources (over 4.3M images) and ImageNet to train different-sized vision transformers using Masked Auto-Encoding (MAE) on slices of this data. Contrary to inferences from prior work, we find that scaling dataset size and diversity does not improve performance universally (but does so on average). Our largest model, named VC-1, outperforms all prior PVRs on average but does not universally dominate either. Next, we show that task- or domain-specific adaptation of VC-1 leads to substantial gains, with VC-1 (adapted) achieving competitive or superior performance than the best known results on all of the benchmarks in CortexBench. Finally, we present real-world hardware experiments, in which VC-1 and VC-1 (adapted) outperform the strongest pre-existing PVR. Overall, this paper presents no new techniques but a rigorous systematic evaluation, a broad set of findings about PVRs (that in some cases, refute those made in narrow domains in prior work), and open-sourced code and models (that required over 10,000 GPU-hours to train) for the benefit of the research community. △ Less

Submitted 1 February, 2024; v1 submitted 31 March, 2023; originally announced March 2023.

Comments: Project website: https://eai-vc.github.io

arXiv:2212.05749 [pdf, other]

On Pre-Training for Visuo-Motor Control: Revisiting a Learning-from-Scratch Baseline

Authors: Nicklas Hansen, Zhecheng Yuan, Yanjie Ze, Tongzhou Mu, Aravind Rajeswaran, Hao Su, Huazhe Xu, Xiaolong Wang

Abstract: In this paper, we examine the effectiveness of pre-training for visuo-motor control tasks. We revisit a simple Learning-from-Scratch (LfS) baseline that incorporates data augmentation and a shallow ConvNet, and find that this baseline is surprisingly competitive with recent approaches (PVR, MVP, R3M) that leverage frozen visual representations trained on large-scale vision datasets -- across a var… ▽ More In this paper, we examine the effectiveness of pre-training for visuo-motor control tasks. We revisit a simple Learning-from-Scratch (LfS) baseline that incorporates data augmentation and a shallow ConvNet, and find that this baseline is surprisingly competitive with recent approaches (PVR, MVP, R3M) that leverage frozen visual representations trained on large-scale vision datasets -- across a variety of algorithms, task domains, and metrics in simulation and on a real robot. Our results demonstrate that these methods are hindered by a significant domain gap between the pre-training datasets and current benchmarks for visuo-motor control, which is alleviated by finetuning. Based on our findings, we provide recommendations for future research in pre-training for control and hope that our simple yet strong baseline will aid in accurately benchmarking progress in this area. △ Less

Submitted 15 June, 2023; v1 submitted 12 December, 2022; originally announced December 2022.

Comments: Code: https://github.com/gemcollector/learning-from-scratch

arXiv:2212.05711 [pdf, other]

CACTI: A Framework for Scalable Multi-Task Multi-Scene Visual Imitation Learning

Authors: Zhao Mandi, Homanga Bharadhwaj, Vincent Moens, Shuran Song, Aravind Rajeswaran, Vikash Kumar

Abstract: Large-scale training have propelled significant progress in various sub-fields of AI such as computer vision and natural language processing. However, building robot learning systems at a comparable scale remains challenging. To develop robots that can perform a wide range of skills and adapt to new scenarios, efficient methods for collecting vast and diverse amounts of data on physical robot syst… ▽ More Large-scale training have propelled significant progress in various sub-fields of AI such as computer vision and natural language processing. However, building robot learning systems at a comparable scale remains challenging. To develop robots that can perform a wide range of skills and adapt to new scenarios, efficient methods for collecting vast and diverse amounts of data on physical robot systems are required, as well as the capability to train high-capacity policies using such datasets. In this work, we propose a framework for scaling robot learning, with specific focus on multi-task and multi-scene manipulation in kitchen environments, both in simulation and in the real world. Our proposed framework, CACTI, comprises four stages that separately handle: data collection, data augmentation, visual representation learning, and imitation policy training, to enable scalability in robot learning . We make use of state-of-the-art generative models as part of the data augmentation stage, and use pre-trained out-of-domain visual representations to improve training efficiency. Experimental results demonstrate the effectiveness of our approach. On a real robot setup, CACTI enables efficient training of a single policy that can perform 10 manipulation tasks involving kitchen objects, and is robust to varying layouts of distractors. In a simulated kitchen environment, CACTI trains a single policy to perform 18 semantic tasks across 100 layout variations for each individual task. We will release the simulation task benchmark and augmented datasets in both real and simulated environments to facilitate future research. △ Less

Submitted 16 February, 2023; v1 submitted 12 December, 2022; originally announced December 2022.

arXiv:2212.05698 [pdf, other]

MoDem: Accelerating Visual Model-Based Reinforcement Learning with Demonstrations

Authors: Nicklas Hansen, Yixin Lin, Hao Su, Xiaolong Wang, Vikash Kumar, Aravind Rajeswaran

Abstract: Poor sample efficiency continues to be the primary challenge for deployment of deep Reinforcement Learning (RL) algorithms for real-world applications, and in particular for visuo-motor control. Model-based RL has the potential to be highly sample efficient by concurrently learning a world model and using synthetic rollouts for planning and policy improvement. However, in practice, sample-efficien… ▽ More Poor sample efficiency continues to be the primary challenge for deployment of deep Reinforcement Learning (RL) algorithms for real-world applications, and in particular for visuo-motor control. Model-based RL has the potential to be highly sample efficient by concurrently learning a world model and using synthetic rollouts for planning and policy improvement. However, in practice, sample-efficient learning with model-based RL is bottlenecked by the exploration challenge. In this work, we find that leveraging just a handful of demonstrations can dramatically improve the sample-efficiency of model-based RL. Simply appending demonstrations to the interaction dataset, however, does not suffice. We identify key ingredients for leveraging demonstrations in model learning -- policy pretraining, targeted exploration, and oversampling of demonstration data -- which forms the three phases of our model-based RL framework. We empirically study three complex visuo-motor control domains and find that our method is 150%-250% more successful in completing sparse reward tasks compared to prior approaches in the low data regime (100K interaction steps, 5 demonstrations). Code and videos are available at: https://nicklashansen.github.io/modemrl △ Less

Submitted 11 December, 2022; originally announced December 2022.

arXiv:2210.06479 [pdf, other]

Real World Offline Reinforcement Learning with Realistic Data Source

Authors: Gaoyue Zhou, Liyiming Ke, Siddhartha Srinivasa, Abhinav Gupta, Aravind Rajeswaran, Vikash Kumar

Abstract: Offline reinforcement learning (ORL) holds great promise for robot learning due to its ability to learn from arbitrary pre-generated experience. However, current ORL benchmarks are almost entirely in simulation and utilize contrived datasets like replay buffers of online RL agents or sub-optimal trajectories, and thus hold limited relevance for real-world robotics. In this work (Real-ORL), we posi… ▽ More Offline reinforcement learning (ORL) holds great promise for robot learning due to its ability to learn from arbitrary pre-generated experience. However, current ORL benchmarks are almost entirely in simulation and utilize contrived datasets like replay buffers of online RL agents or sub-optimal trajectories, and thus hold limited relevance for real-world robotics. In this work (Real-ORL), we posit that data collected from safe operations of closely related tasks are more practical data sources for real-world robot learning. Under these settings, we perform an extensive (6500+ trajectories collected over 800+ robot hours and 270+ human labor hour) empirical study evaluating generalization and transfer capabilities of representative ORL methods on four real-world tabletop manipulation tasks. Our study finds that ORL and imitation learning prefer different action spaces, and that ORL algorithms can generalize from leveraging offline heterogeneous data sources and outperform imitation learning. We release our dataset and implementations at URL: https://sites.google.com/view/real-orl △ Less

Submitted 12 October, 2022; originally announced October 2022.

Comments: Project website: https://sites.google.com/view/real-orl

arXiv:2204.11134 [pdf, other]

Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation?

Authors: Yuchen Cui, Scott Niekum, Abhinav Gupta, Vikash Kumar, Aravind Rajeswaran

Abstract: Task specification is at the core of programming autonomous robots. A low-effort modality for task specification is critical for engagement of non-expert end-users and ultimate adoption of personalized robot agents. A widely studied approach to task specification is through goals, using either compact state vectors or goal images from the same robot scene. The former is hard to interpret for non-e… ▽ More Task specification is at the core of programming autonomous robots. A low-effort modality for task specification is critical for engagement of non-expert end-users and ultimate adoption of personalized robot agents. A widely studied approach to task specification is through goals, using either compact state vectors or goal images from the same robot scene. The former is hard to interpret for non-experts and necessitates detailed state estimation and scene understanding. The latter requires the generation of desired goal image, which often requires a human to complete the task, defeating the purpose of having autonomous robots. In this work, we explore alternate and more general forms of goal specification that are expected to be easier for humans to specify and use such as images obtained from the internet, hand sketches that provide a visual description of the desired task, or simple language descriptions. As a preliminary step towards this, we investigate the capabilities of large scale pre-trained models (foundation models) for zero-shot goal specification, and find promising results in a collection of simulated robot manipulation tasks and real-world datasets. △ Less

Submitted 23 April, 2022; originally announced April 2022.

Comments: 30 pages with appendix, published as a conference paper at L4DC 2022

arXiv:2203.12601 [pdf, other]

R3M: A Universal Visual Representation for Robot Manipulation

Authors: Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, Abhinav Gupta

Abstract: We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of downstream robotic manipulation tasks. Concretely, we pre-train a visual representation using the Ego4D human video dataset using a combination of time-contrastive learning, video-language alignment, and an L1 penalty to encourage sparse and compact representations. The resulting repre… ▽ More We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of downstream robotic manipulation tasks. Concretely, we pre-train a visual representation using the Ego4D human video dataset using a combination of time-contrastive learning, video-language alignment, and an L1 penalty to encourage sparse and compact representations. The resulting representation, R3M, can be used as a frozen perception module for downstream policy learning. Across a suite of 12 simulated robot manipulation tasks, we find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo. Furthermore, R3M enables a Franka Emika Panda arm to learn a range of manipulation tasks in a real, cluttered apartment given just 20 demonstrations. Code and pre-trained models are available at https://tinyurl.com/robotr3m. △ Less

Submitted 18 November, 2022; v1 submitted 23 March, 2022; originally announced March 2022.

Comments: Conference on Robot Learning (CoRL) 2022

arXiv:2203.05960 [pdf, other]

Policy Architectures for Compositional Generalization in Control

Authors: Allan Zhou, Vikash Kumar, Chelsea Finn, Aravind Rajeswaran

Abstract: Many tasks in control, robotics, and planning can be specified using desired goal configurations for various entities in the environment. Learning goal-conditioned policies is a natural paradigm to solve such tasks. However, current approaches struggle to learn and generalize as task complexity increases, such as variations in number of environment entities or compositions of goals. In this work,… ▽ More Many tasks in control, robotics, and planning can be specified using desired goal configurations for various entities in the environment. Learning goal-conditioned policies is a natural paradigm to solve such tasks. However, current approaches struggle to learn and generalize as task complexity increases, such as variations in number of environment entities or compositions of goals. In this work, we introduce a framework for modeling entity-based compositional structure in tasks, and create suitable policy designs that can leverage this structure. Our policies, which utilize architectures like Deep Sets and Self Attention, are flexible and can be trained end-to-end without requiring any action primitives. When trained using standard reinforcement and imitation learning methods on a suite of simulated robot manipulation tasks, we find that these architectures achieve significantly higher success rates with less data. We also find these architectures enable broader and compositional generalization, producing policies that extrapolate to different numbers of entities than seen in training, and stitch together (i.e. compose) learned skills in novel ways. Videos of the results can be found at https://sites.google.com/view/comp-gen-rl. △ Less

Submitted 10 March, 2022; originally announced March 2022.

arXiv:2203.03580 [pdf, other]

The Unsurprising Effectiveness of Pre-Trained Vision Models for Control

Authors: Simone Parisi, Aravind Rajeswaran, Senthil Purushwalkam, Abhinav Gupta

Abstract: Recent years have seen the emergence of pre-trained representations as a powerful abstraction for AI applications in computer vision, natural language, and speech. However, policy learning for control is still dominated by a tabula-rasa learning paradigm, with visuo-motor policies often trained from scratch using data from deployment environments. In this context, we revisit and study the role of… ▽ More Recent years have seen the emergence of pre-trained representations as a powerful abstraction for AI applications in computer vision, natural language, and speech. However, policy learning for control is still dominated by a tabula-rasa learning paradigm, with visuo-motor policies often trained from scratch using data from deployment environments. In this context, we revisit and study the role of pre-trained visual representations for control, and in particular representations trained on large-scale computer vision datasets. Through extensive empirical evaluation in diverse control domains (Habitat, DeepMind Control, Adroit, Franka Kitchen), we isolate and study the importance of different representation training methods, data augmentations, and feature hierarchies. Overall, we find that pre-trained visual representations can be competitive or even better than ground-truth state representations to train control policies. This is in spite of using only out-of-domain data from standard vision datasets, without any in-domain data from the deployment environments. Source code and more at https://sites.google.com/view/pvr-control. △ Less

Submitted 8 August, 2022; v1 submitted 7 March, 2022; originally announced March 2022.

Comments: First two authors contributed equally

Journal ref: International Conference on Machine Learning (ICML), 2022, 162:17359-17371

arXiv:2202.00161 [pdf, other]

CIC: Contrastive Intrinsic Control for Unsupervised Skill Discovery

Authors: Michael Laskin, Hao Liu, Xue Bin Peng, Denis Yarats, Aravind Rajeswaran, Pieter Abbeel

Abstract: We introduce Contrastive Intrinsic Control (CIC), an algorithm for unsupervised skill discovery that maximizes the mutual information between state-transitions and latent skill vectors. CIC utilizes contrastive learning between state-transitions and skills to learn behavior embeddings and maximizes the entropy of these embeddings as an intrinsic reward to encourage behavioral diversity. We evaluat… ▽ More We introduce Contrastive Intrinsic Control (CIC), an algorithm for unsupervised skill discovery that maximizes the mutual information between state-transitions and latent skill vectors. CIC utilizes contrastive learning between state-transitions and skills to learn behavior embeddings and maximizes the entropy of these embeddings as an intrinsic reward to encourage behavioral diversity. We evaluate our algorithm on the Unsupervised Reinforcement Learning Benchmark, which consists of a long reward-free pre-training phase followed by a short adaptation phase to downstream tasks with extrinsic rewards. CIC substantially improves over prior methods in terms of adaptation efficiency, outperforming prior unsupervised skill discovery methods by 1.79x and the next leading overall exploration algorithm by 1.18x. △ Less

Submitted 29 March, 2022; v1 submitted 31 January, 2022; originally announced February 2022.

Comments: Project website: https://sites.google.com/view/cicrl/

arXiv:2107.08829 [pdf, other]

Visual Adversarial Imitation Learning using Variational Models

Authors: Rafael Rafailov, Tianhe Yu, Aravind Rajeswaran, Chelsea Finn

Abstract: Reward function specification, which requires considerable human effort and iteration, remains a major impediment for learning behaviors through deep reinforcement learning. In contrast, providing visual demonstrations of desired behaviors often presents an easier and more natural way to teach agents. We consider a setting where an agent is provided a fixed dataset of visual demonstrations illustr… ▽ More Reward function specification, which requires considerable human effort and iteration, remains a major impediment for learning behaviors through deep reinforcement learning. In contrast, providing visual demonstrations of desired behaviors often presents an easier and more natural way to teach agents. We consider a setting where an agent is provided a fixed dataset of visual demonstrations illustrating how to perform a task, and must learn to solve the task using the provided demonstrations and unsupervised environment interactions. This setting presents a number of challenges including representation learning for visual observations, sample complexity due to high dimensional spaces, and learning instability due to the lack of a fixed reward or learning signal. Towards addressing these challenges, we develop a variational model-based adversarial imitation learning (V-MAIL) algorithm. The model-based approach provides a strong signal for representation learning, enables sample efficiency, and improves the stability of adversarial training by enabling on-policy learning. Through experiments involving several vision-based locomotion and manipulation tasks, we find that V-MAIL learns successful visuomotor policies in a sample-efficient manner, has better stability compared to prior work, and also achieves higher asymptotic performance. We further find that by transferring the learned models, V-MAIL can learn new tasks from visual demonstrations without any additional environment interactions. All results including videos can be found online at \url{https://sites.google.com/view/variational-mail}. △ Less

Submitted 27 June, 2022; v1 submitted 15 July, 2021; originally announced July 2021.

arXiv:2106.09119 [pdf, other]

Behavioral Priors and Dynamics Models: Improving Performance and Domain Transfer in Offline RL

Authors: Catherine Cang, Aravind Rajeswaran, Pieter Abbeel, Michael Laskin

Abstract: Offline Reinforcement Learning (RL) aims to extract near-optimal policies from imperfect offline data without additional environment interactions. Extracting policies from diverse offline datasets has the potential to expand the range of applicability of RL by making the training process safer, faster, and more streamlined. We investigate how to improve the performance of offline RL algorithms, it… ▽ More Offline Reinforcement Learning (RL) aims to extract near-optimal policies from imperfect offline data without additional environment interactions. Extracting policies from diverse offline datasets has the potential to expand the range of applicability of RL by making the training process safer, faster, and more streamlined. We investigate how to improve the performance of offline RL algorithms, its robustness to the quality of offline data, as well as its generalization capabilities. To this end, we introduce Offline Model-based RL with Adaptive Behavioral Priors (MABE). Our algorithm is based on the finding that dynamics models, which support within-domain generalization, and behavioral priors, which support cross-domain generalization, are complementary. When combined together, they substantially improve the performance and generalization of offline RL policies. In the widely studied D4RL offline RL benchmark, we find that MABE achieves higher average performance compared to prior model-free and model-based algorithms. In experiments that require cross-domain generalization, we find that MABE outperforms prior methods. Our website is available at https://sites.google.com/berkeley.edu/mabe . △ Less

Submitted 18 June, 2021; v1 submitted 16 June, 2021; originally announced June 2021.

arXiv:2106.01345 [pdf, other]

Decision Transformer: Reinforcement Learning via Sequence Modeling

Authors: Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch

Abstract: We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling. Unlike prior… ▽ More We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling. Unlike prior approaches to RL that fit value functions or compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return. Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks. △ Less

Submitted 24 June, 2021; v1 submitted 2 June, 2021; originally announced June 2021.

Comments: First two authors contributed equally. Last two authors advised equally

arXiv:2102.08363 [pdf, other]

COMBO: Conservative Offline Model-Based Policy Optimization

Authors: Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, Chelsea Finn

Abstract: Model-based algorithms, which learn a dynamics model from logged experience and perform some sort of pessimistic planning under the learned model, have emerged as a promising paradigm for offline reinforcement learning (offline RL). However, practical variants of such model-based algorithms rely on explicit uncertainty quantification for incorporating pessimism. Uncertainty estimation with complex… ▽ More Model-based algorithms, which learn a dynamics model from logged experience and perform some sort of pessimistic planning under the learned model, have emerged as a promising paradigm for offline reinforcement learning (offline RL). However, practical variants of such model-based algorithms rely on explicit uncertainty quantification for incorporating pessimism. Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable. We overcome this limitation by develo** a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-action tuples generated via rollouts under the learned model. This results in a conservative estimate of the value function for out-of-support state-action tuples, without requiring explicit uncertainty estimation. We theoretically show that our method optimizes a lower bound on the true policy value, that this bound is tighter than that of prior methods, and our approach satisfies a policy improvement guarantee in the offline setting. Through experiments, we find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods on widely studied offline RL benchmarks, including image-based tasks. △ Less

Submitted 26 January, 2022; v1 submitted 16 February, 2021; originally announced February 2021.

Comments: NeurIPS 2021

arXiv:2101.01857 [pdf, other]

Reinforcement Learning with Latent Flow

Authors: Wenling Shang, Xiaofei Wang, Aravind Srinivas, Aravind Rajeswaran, Yang Gao, Pieter Abbeel, Michael Laskin

Abstract: Temporal information is essential to learning effective policies with Reinforcement Learning (RL). However, current state-of-the-art RL algorithms either assume that such information is given as part of the state space or, when learning from pixels, use the simple heuristic of frame-stacking to implicitly capture temporal information present in the image observations. This heuristic is in contrast… ▽ More Temporal information is essential to learning effective policies with Reinforcement Learning (RL). However, current state-of-the-art RL algorithms either assume that such information is given as part of the state space or, when learning from pixels, use the simple heuristic of frame-stacking to implicitly capture temporal information present in the image observations. This heuristic is in contrast to the current paradigm in video classification architectures, which utilize explicit encodings of temporal information through methods such as optical flow and two-stream architectures to achieve state-of-the-art performance. Inspired by leading video classification architectures, we introduce the Flow of Latents for Reinforcement Learning (Flare), a network architecture for RL that explicitly encodes temporal information through latent vector differences. We show that Flare (i) recovers optimal performance in state-based RL without explicit access to the state velocity, solely with positional state information, (ii) achieves state-of-the-art performance on pixel-based challenging continuous control tasks within the DeepMind control benchmark suite, namely quadruped walk, hopper hop, finger turn hard, pendulum swing, and walker run, and is the most sample efficient model-free pixel-based RL algorithm, outperforming the prior model-free state-of-the-art by 1.9X and 1.5X on the 500k and 1M step benchmarks, respectively, and (iv), when augmented over rainbow DQN, outperforms this state-of-the-art level baseline on 5 of 8 challenging Atari games at 100M time step benchmark. △ Less

Submitted 5 January, 2021; originally announced January 2021.

arXiv:2012.11547 [pdf, other]

Offline Reinforcement Learning from Images with Latent Space Models

Authors: Rafael Rafailov, Tianhe Yu, Aravind Rajeswaran, Chelsea Finn

Abstract: Offline reinforcement learning (RL) refers to the problem of learning policies from a static dataset of environment interactions. Offline RL enables extensive use and re-use of historical datasets, while also alleviating safety concerns associated with online exploration, thereby expanding the real-world applicability of RL. Most prior work in offline RL has focused on tasks with compact state rep… ▽ More Offline reinforcement learning (RL) refers to the problem of learning policies from a static dataset of environment interactions. Offline RL enables extensive use and re-use of historical datasets, while also alleviating safety concerns associated with online exploration, thereby expanding the real-world applicability of RL. Most prior work in offline RL has focused on tasks with compact state representations. However, the ability to learn directly from rich observation spaces like images is critical for real-world applications such as robotics. In this work, we build on recent advances in model-based algorithms for offline RL, and extend them to high-dimensional visual observation spaces. Model-based offline RL algorithms have achieved state of the art results in state based tasks and have strong theoretical guarantees. However, they rely crucially on the ability to quantify uncertainty in the model predictions, which is particularly challenging with image observations. To overcome this challenge, we propose to learn a latent-state dynamics model, and represent the uncertainty in the latent space. Our approach is both tractable in practice and corresponds to maximizing a lower bound of the ELBO in the unknown POMDP. In experiments on a range of challenging image-based locomotion and manipulation tasks, we find that our algorithm significantly outperforms previous offline model-free RL methods as well as state-of-the-art online visual model-based RL methods. Moreover, we also find that our approach excels on an image-based drawer closing task on a real robot using a pre-existing dataset. All results including videos can be found online at https://sites.google.com/view/lompo/ . △ Less

Submitted 21 December, 2020; originally announced December 2020.

arXiv:2005.05951 [pdf, other]

MOReL : Model-Based Offline Reinforcement Learning

Authors: Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, Thorsten Joachims

Abstract: In offline reinforcement learning (RL), the goal is to learn a highly rewarding policy based solely on a dataset of historical interactions with the environment. The ability to train RL policies offline can greatly expand the applicability of RL, its data efficiency, and its experimental velocity. Prior work in offline RL has been confined almost exclusively to model-free RL approaches. In this wo… ▽ More In offline reinforcement learning (RL), the goal is to learn a highly rewarding policy based solely on a dataset of historical interactions with the environment. The ability to train RL policies offline can greatly expand the applicability of RL, its data efficiency, and its experimental velocity. Prior work in offline RL has been confined almost exclusively to model-free RL approaches. In this work, we present MOReL, an algorithmic framework for model-based offline RL. This framework consists of two steps: (a) learning a pessimistic MDP (P-MDP) using the offline dataset; and (b) learning a near-optimal policy in this P-MDP. The learned P-MDP has the property that for any policy, the performance in the real environment is approximately lower-bounded by the performance in the P-MDP. This enables it to serve as a good surrogate for purposes of policy evaluation and learning, and overcome common pitfalls of model-based RL like model exploitation. Theoretically, we show that MOReL is minimax optimal (up to log factors) for offline RL. Through experiments, we show that MOReL matches or exceeds state-of-the-art results in widely studied offline RL benchmarks. Moreover, the modular design of MOReL enables future advances in its components (e.g. generative modeling, uncertainty estimation, planning etc.) to directly translate into advances for offline RL. △ Less

Submitted 1 March, 2021; v1 submitted 12 May, 2020; originally announced May 2020.

Comments: First two authors contributed equally. Published at NeurIPS 2020. After publication at NeurIPS 2020, (1) D4RL benchmark results have been added; (2) hyper-parameter ablation studies have been added; (3) scope of Lemma 3 has been extended

arXiv:2004.07804 [pdf, other]

A Game Theoretic Framework for Model Based Reinforcement Learning

Authors: Aravind Rajeswaran, Igor Mordatch, Vikash Kumar

Abstract: Model-based reinforcement learning (MBRL) has recently gained immense interest due to its potential for sample efficiency and ability to incorporate off-policy data. However, designing stable and efficient MBRL algorithms using rich function approximators have remained challenging. To help expose the practical challenges in MBRL and simplify algorithm design from the lens of abstraction, we develo… ▽ More Model-based reinforcement learning (MBRL) has recently gained immense interest due to its potential for sample efficiency and ability to incorporate off-policy data. However, designing stable and efficient MBRL algorithms using rich function approximators have remained challenging. To help expose the practical challenges in MBRL and simplify algorithm design from the lens of abstraction, we develop a new framework that casts MBRL as a game between: (1) a policy player, which attempts to maximize rewards under the learned model; (2) a model player, which attempts to fit the real-world data collected by the policy player. For algorithm development, we construct a Stackelberg game between the two players, and show that it can be solved with approximate bi-level optimization. This gives rise to two natural families of algorithms for MBRL based on which player is chosen as the leader in the Stackelberg game. Together, they encapsulate, unify, and generalize many previous MBRL algorithms. Furthermore, our framework is consistent with and provides a clear basis for heuristics known to be important in practice from prior works. Finally, through experiments we validate that our proposed algorithms are highly sample efficient, match the asymptotic performance of model-free policy gradient, and scale gracefully to high-dimensional tasks like dexterous hand manipulation. Additional details and code can be obtained from the project page at https://sites.google.com/view/mbrl-game △ Less

Submitted 11 March, 2021; v1 submitted 16 April, 2020; originally announced April 2020.

Comments: ICML 2020. This version contains expanded discussion, hyperparameter configurations, and ablation studies

arXiv:2001.07343 [pdf, other]

Lyceum: An efficient and scalable ecosystem for robot learning

Authors: Colin Summers, Kendall Lowrey, Aravind Rajeswaran, Siddhartha Srinivasa, Emanuel Todorov

Abstract: We introduce Lyceum, a high-performance computational ecosystem for robot learning. Lyceum is built on top of the Julia programming language and the MuJoCo physics simulator, combining the ease-of-use of a high-level programming language with the performance of native C. In addition, Lyceum has a straightforward API to support parallel computation across multiple cores and machines. Overall, depen… ▽ More We introduce Lyceum, a high-performance computational ecosystem for robot learning. Lyceum is built on top of the Julia programming language and the MuJoCo physics simulator, combining the ease-of-use of a high-level programming language with the performance of native C. In addition, Lyceum has a straightforward API to support parallel computation across multiple cores and machines. Overall, depending on the complexity of the environment, Lyceum is 5-30x faster compared to other popular abstractions like OpenAI's Gym and DeepMind's dm-control. This substantially reduces training time for various reinforcement learning algorithms; and is also fast enough to support real-time model predictive control through MuJoCo. The code, tutorials, and demonstration videos can be found at: www.lyceum.ml. △ Less

Submitted 21 January, 2020; originally announced January 2020.

arXiv:1909.04630 [pdf, other]

Meta-Learning with Implicit Gradients

Authors: Aravind Rajeswaran, Chelsea Finn, Sham Kakade, Sergey Levine

Abstract: A core capability of intelligent systems is the ability to quickly learn new tasks by drawing on prior experience. Gradient (or optimization) based meta-learning has recently emerged as an effective approach for few-shot learning. In this formulation, meta-parameters are learned in the outer loop, while task-specific models are learned in the inner-loop, by using only a small amount of data from t… ▽ More A core capability of intelligent systems is the ability to quickly learn new tasks by drawing on prior experience. Gradient (or optimization) based meta-learning has recently emerged as an effective approach for few-shot learning. In this formulation, meta-parameters are learned in the outer loop, while task-specific models are learned in the inner-loop, by using only a small amount of data from the current task. A key challenge in scaling these approaches is the need to differentiate through the inner loop learning process, which can impose considerable computational and memory burdens. By drawing upon implicit differentiation, we develop the implicit MAML algorithm, which depends only on the solution to the inner level optimization and not the path taken by the inner loop optimizer. This effectively decouples the meta-gradient computation from the choice of inner loop optimizer. As a result, our approach is agnostic to the choice of inner loop optimizer and can gracefully handle many gradient steps without vanishing gradients or memory constraints. Theoretically, we prove that implicit MAML can compute accurate meta-gradients with a memory footprint that is, up to small constant factors, no more than that which is required to compute a single inner loop gradient and at no overall increase in the total computational cost. Experimentally, we show that these benefits of implicit MAML translate into empirical gains on few-shot image recognition benchmarks. △ Less

Submitted 10 September, 2019; originally announced September 2019.

Comments: NeurIPS 2019. First two authors contributed equally

arXiv:1902.08438 [pdf, other]

Online Meta-Learning

Authors: Chelsea Finn, Aravind Rajeswaran, Sham Kakade, Sergey Levine

Abstract: A central capability of intelligent systems is the ability to continuously build upon previous experiences to speed up and enhance learning of new tasks. Two distinct research paradigms have studied this question. Meta-learning views this problem as learning a prior over model parameters that is amenable for fast adaptation on a new task, but typically assumes the set of tasks are available togeth… ▽ More A central capability of intelligent systems is the ability to continuously build upon previous experiences to speed up and enhance learning of new tasks. Two distinct research paradigms have studied this question. Meta-learning views this problem as learning a prior over model parameters that is amenable for fast adaptation on a new task, but typically assumes the set of tasks are available together as a batch. In contrast, online (regret based) learning considers a sequential setting in which problems are revealed one after the other, but conventionally train only a single model without any task-specific adaptation. This work introduces an online meta-learning setting, which merges ideas from both the aforementioned paradigms to better capture the spirit and practice of continual lifelong learning. We propose the follow the meta leader algorithm which extends the MAML algorithm to this setting. Theoretically, this work provides an $\mathcal{O}(\log T)$ regret guarantee with only one additional higher order smoothness assumption in comparison to the standard online setting. Our experimental evaluation on three different large-scale tasks suggest that the proposed algorithm significantly outperforms alternatives based on traditional online learning approaches. △ Less

Submitted 3 July, 2019; v1 submitted 22 February, 2019; originally announced February 2019.

Comments: ICML 2019. The first two authors contributed equally. Expanded Appendix

arXiv:1811.01848 [pdf, other]

Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control

Authors: Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, Igor Mordatch

Abstract: We propose a plan online and learn offline (POLO) framework for the setting where an agent, with an internal model, needs to continually act and learn in the world. Our work builds on the synergistic relationship between local model-based control, global value function learning, and exploration. We study how local trajectory optimization can cope with approximation errors in the value function, an… ▽ More We propose a plan online and learn offline (POLO) framework for the setting where an agent, with an internal model, needs to continually act and learn in the world. Our work builds on the synergistic relationship between local model-based control, global value function learning, and exploration. We study how local trajectory optimization can cope with approximation errors in the value function, and can stabilize and accelerate value function learning. Conversely, we also study how approximate value functions can help reduce the planning horizon and allow for better policies beyond local solutions. Finally, we also demonstrate how trajectory optimization can be used to perform temporally coordinated exploration in conjunction with estimating uncertainty in value function approximation. This exploration is critical for fast and stable learning of the value function. Combining these components enable solutions to complex simulated control tasks, like humanoid locomotion and dexterous in-hand manipulation, in the equivalent of a few minutes of experience in the real world. △ Less

Submitted 28 January, 2019; v1 submitted 5 November, 2018; originally announced November 2018.

Comments: The first two authors contributed equally. Accepted at ICLR 2019. Supplementary videos available at: https://sites.google.com/view/polo-mpc

arXiv:1810.06045 [pdf, other]

Dexterous Manipulation with Deep Reinforcement Learning: Efficient, General, and Low-Cost

Authors: Henry Zhu, Abhishek Gupta, Aravind Rajeswaran, Sergey Levine, Vikash Kumar

Abstract: Dexterous multi-fingered robotic hands can perform a wide range of manipulation skills, making them an appealing component for general-purpose robotic manipulators. However, such hands pose a major challenge for autonomous control, due to the high dimensionality of their configuration space and complex intermittent contact interactions. In this work, we propose deep reinforcement learning (deep RL… ▽ More Dexterous multi-fingered robotic hands can perform a wide range of manipulation skills, making them an appealing component for general-purpose robotic manipulators. However, such hands pose a major challenge for autonomous control, due to the high dimensionality of their configuration space and complex intermittent contact interactions. In this work, we propose deep reinforcement learning (deep RL) as a scalable solution for learning complex, contact rich behaviors with multi-fingered hands. Deep RL provides an end-to-end approach to directly map sensor readings to actions, without the need for task specific models or policy classes. We show that contact-rich manipulation behavior with multi-fingered hands can be learned by directly training with model-free deep RL algorithms in the real world, with minimal additional assumption and without the aid of simulation. We learn a variety of complex behaviors on two different low-cost hardware platforms. We show that each task can be learned entirely from scratch, and further study how the learning process can be further accelerated by using a small number of human demonstrations to bootstrap learning. Our experiments demonstrate that complex multi-fingered manipulation skills can be learned in the real world in about 4-7 hours for most tasks, and that demonstrations can decrease this to 2-3 hours, indicating that direct deep RL training in the real world is a viable and practical alternative to simulation and model-based control. \url{https://sites.google.com/view/deeprl-handmanipulation} △ Less

Submitted 14 October, 2018; originally announced October 2018.

Comments: https://sites.google.com/view/deeprl-handmanipulation

arXiv:1803.10371 [pdf, other]

Reinforcement learning for non-prehensile manipulation: Transfer from simulation to physical system

Authors: Kendall Lowrey, Svetoslav Kolev, Jeremy Dao, Aravind Rajeswaran, Emanuel Todorov

Abstract: Reinforcement learning has emerged as a promising methodology for training robot controllers. However, most results have been limited to simulation due to the need for a large number of samples and the lack of automated-yet-safe data collection methods. Model-based reinforcement learning methods provide an avenue to circumvent these challenges, but the traditional concern has been the mismatch bet… ▽ More Reinforcement learning has emerged as a promising methodology for training robot controllers. However, most results have been limited to simulation due to the need for a large number of samples and the lack of automated-yet-safe data collection methods. Model-based reinforcement learning methods provide an avenue to circumvent these challenges, but the traditional concern has been the mismatch between the simulator and the real world. Here, we show that control policies learned in simulation can successfully transfer to a physical system, composed of three Phantom robots pushing an object to various desired target positions. We use a modified form of the natural policy gradient algorithm for learning, applied to a carefully identified simulation model. The resulting policies, trained entirely in simulation, work well on the physical system without additional training. In addition, we show that training with an ensemble of models makes the learned policies more robust to modeling errors, thus compensating for difficulties in system identification. △ Less

Submitted 27 March, 2018; originally announced March 2018.

Comments: Accepted at IEEE SIMPAR 2018. Project page: https://sites.google.com/view/phantomsim2real

arXiv:1803.07246 [pdf, other]

Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines

Authors: Cathy Wu, Aravind Rajeswaran, Yan Duan, Vikash Kumar, Alexandre M Bayen, Sham Kakade, Igor Mordatch, Pieter Abbeel

Abstract: Policy gradient methods have enjoyed great success in deep reinforcement learning but suffer from high variance of gradient estimates. The high variance problem is particularly exasperated in problems with long horizons or high-dimensional action spaces. To mitigate this issue, we derive a bias-free action-dependent baseline for variance reduction which fully exploits the structural form of the st… ▽ More Policy gradient methods have enjoyed great success in deep reinforcement learning but suffer from high variance of gradient estimates. The high variance problem is particularly exasperated in problems with long horizons or high-dimensional action spaces. To mitigate this issue, we derive a bias-free action-dependent baseline for variance reduction which fully exploits the structural form of the stochastic policy itself and does not make any additional assumptions about the MDP. We demonstrate and quantify the benefit of the action-dependent baseline through both theoretical analysis as well as numerical results, including an analysis of the suboptimality of the optimal state-dependent baseline. The result is a computationally efficient policy gradient algorithm, which scales to high-dimensional control problems, as demonstrated by a synthetic 2000-dimensional target matching task. Our experimental results indicate that action-dependent baselines allow for faster learning on standard reinforcement learning benchmarks and high-dimensional hand manipulation and synthetic tasks. Finally, we show that the general idea of including additional information in baselines for improved variance reduction can be extended to partially observed and multi-agent tasks. △ Less

Submitted 19 March, 2018; originally announced March 2018.

Comments: Accepted to ICLR 2018, Oral (2%)

arXiv:1711.09874 [pdf, other]

Divide-and-Conquer Reinforcement Learning

Authors: Dibya Ghosh, Avi Singh, Aravind Rajeswaran, Vikash Kumar, Sergey Levine

Abstract: Standard model-free deep reinforcement learning (RL) algorithms sample a new initial state for each trial, allowing them to optimize policies that can perform well even in highly stochastic environments. However, problems that exhibit considerable initial state variation typically produce high-variance gradient estimates for model-free RL, making direct policy or value function optimization challe… ▽ More Standard model-free deep reinforcement learning (RL) algorithms sample a new initial state for each trial, allowing them to optimize policies that can perform well even in highly stochastic environments. However, problems that exhibit considerable initial state variation typically produce high-variance gradient estimates for model-free RL, making direct policy or value function optimization challenging. In this paper, we develop a novel algorithm that instead partitions the initial state space into "slices", and optimizes an ensemble of policies, each on a different slice. The ensemble is gradually unified into a single policy that can succeed on the whole state space. This approach, which we term divide-and-conquer RL, is able to solve complex tasks where conventional deep RL methods are ineffective. Our results show that divide-and-conquer RL greatly outperforms conventional policy gradient methods on challenging gras**, manipulation, and locomotion tasks, and exceeds the performance of a variety of prior methods. Videos of policies learned by our algorithm can be viewed at http://bit.ly/dnc-rl △ Less

Submitted 27 April, 2018; v1 submitted 27 November, 2017; originally announced November 2017.

Comments: Presented at ICLR 2018. Videos and supporting materials are located at http://bit.ly/dnc-rl

arXiv:1709.10087 [pdf, other]

Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

Authors: Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, Sergey Levine

Abstract: Dexterous multi-fingered hands are extremely versatile and provide a generic way to perform a multitude of tasks in human-centric environments. However, effectively controlling them remains challenging due to their high dimensionality and large number of potential contacts. Deep reinforcement learning (DRL) provides a model-agnostic approach to control complex dynamical systems, but has not been s… ▽ More Dexterous multi-fingered hands are extremely versatile and provide a generic way to perform a multitude of tasks in human-centric environments. However, effectively controlling them remains challenging due to their high dimensionality and large number of potential contacts. Deep reinforcement learning (DRL) provides a model-agnostic approach to control complex dynamical systems, but has not been shown to scale to high-dimensional dexterous manipulation. Furthermore, deployment of DRL on physical systems remains challenging due to sample inefficiency. Consequently, the success of DRL in robotics has thus far been limited to simpler manipulators and tasks. In this work, we show that model-free DRL can effectively scale up to complex manipulation tasks with a high-dimensional 24-DoF hand, and solve them from scratch in simulated experiments. Furthermore, with the use of a small number of human demonstrations, the sample complexity can be significantly reduced, which enables learning with sample sizes equivalent to a few hours of robot experience. The use of demonstrations result in policies that exhibit very natural movements and, surprisingly, are also substantially more robust. △ Less

Submitted 26 June, 2018; v1 submitted 28 September, 2017; originally announced September 2017.

Comments: Accepted for presentation at Robotics: Science and Systems (RSS) 2018. Project page: https://sites.google.com/view/deeprl-dexterous-manipulation

arXiv:1703.02660 [pdf, other]

Towards Generalization and Simplicity in Continuous Control

Authors: Aravind Rajeswaran, Kendall Lowrey, Emanuel Todorov, Sham Kakade

Abstract: This work shows that policies with simple linear and RBF parameterizations can be trained to solve a variety of continuous control tasks, including the OpenAI gym benchmarks. The performance of these trained policies are competitive with state of the art results, obtained with more elaborate parameterizations such as fully connected neural networks. Furthermore, existing training and testing scena… ▽ More This work shows that policies with simple linear and RBF parameterizations can be trained to solve a variety of continuous control tasks, including the OpenAI gym benchmarks. The performance of these trained policies are competitive with state of the art results, obtained with more elaborate parameterizations such as fully connected neural networks. Furthermore, existing training and testing scenarios are shown to be very limited and prone to over-fitting, thus giving rise to only trajectory-centric policies. Training with a diverse initial state distribution is shown to produce more global policies with better generalization. This allows for interactive control scenarios where the system recovers from large on-line perturbations; as shown in the supplementary video. △ Less

Submitted 20 March, 2018; v1 submitted 7 March, 2017; originally announced March 2017.

Comments: NIPS 2017, Project page: https://sites.google.com/view/simple-pol

arXiv:1610.01283 [pdf, other]

EPOpt: Learning Robust Neural Network Policies Using Model Ensembles

Authors: Aravind Rajeswaran, Sarvjeet Ghotra, Balaraman Ravindran, Sergey Levine

Abstract: Sample complexity and safety are major challenges when learning policies with reinforcement learning for real-world tasks, especially when the policies are represented using rich function approximators like deep neural networks. Model-based methods where the real-world target domain is approximated using a simulated source domain provide an avenue to tackle the above challenges by augmenting real… ▽ More Sample complexity and safety are major challenges when learning policies with reinforcement learning for real-world tasks, especially when the policies are represented using rich function approximators like deep neural networks. Model-based methods where the real-world target domain is approximated using a simulated source domain provide an avenue to tackle the above challenges by augmenting real data with simulated data. However, discrepancies between the simulated source domain and the target domain pose a challenge for simulated training. We introduce the EPOpt algorithm, which uses an ensemble of simulated source domains and a form of adversarial training to learn policies that are robust and generalize to a broad range of possible target domains, including unmodeled effects. Further, the probability distribution over source domains in the ensemble can be adapted using data from target domain and approximate Bayesian methods, to progressively make it a better approximation. Thus, learning on a model ensemble, along with source domain adaptation, provides the benefit of both robustness and learning/adaptation. △ Less

Submitted 3 March, 2017; v1 submitted 5 October, 2016; originally announced October 2016.

Comments: Accepted for publication at the International Conference on Learning Representations (ICLR) 2017. Supplementary video: https://youtu.be/w1YJ9vwaoto

arXiv:1609.02678 [pdf, ps, other]

Identifying Topology of Power Distribution Networks Based on Smart Meter Data

Authors: Jayadev P Satya, Nirav Bhatt, Ramkrishna Pasumarthy, Aravind Rajeswaran

Abstract: In a power distribution network, the network topology information is essential for an efficient operation of the network. This information of network connectivity is not accurately available, at the low voltage level, due to uninformed changes that happen from time to time. In this paper, we propose a novel data--driven approach to identify the underlying network topology including the load phase… ▽ More In a power distribution network, the network topology information is essential for an efficient operation of the network. This information of network connectivity is not accurately available, at the low voltage level, due to uninformed changes that happen from time to time. In this paper, we propose a novel data--driven approach to identify the underlying network topology including the load phase connectivity from time series of energy measurements. The proposed method involves the application of Principal Component Analysis (PCA) and its graph-theoretic interpretation to infer the topology from smart meter energy measurements. The method is demonstrated through simulation on randomly generated networks and also on IEEE recognized Roy Billinton distribution test system. △ Less

Submitted 9 September, 2016; originally announced September 2016.

Comments: Submitted to IEEE transaction on smart grid

arXiv:1606.01754 [pdf, other]

A Graph Partitioning Algorithm for Leak Detection in Water Distribution Networks

Authors: Aravind Rajeswaran, Sridharakumar Narasimhan, Shankar Narasimhan

Abstract: Leak detection in urban water distribution networks (WDNs) is challenging given their scale, complexity, and limited instrumentation. We present an algorithm for leak detection in WDNs, which involves making additional flow measurements on-demand, and repeated use of water balance. Graph partitioning is used to determine the location of flow measurements, with the objective to minimize the measure… ▽ More Leak detection in urban water distribution networks (WDNs) is challenging given their scale, complexity, and limited instrumentation. We present an algorithm for leak detection in WDNs, which involves making additional flow measurements on-demand, and repeated use of water balance. Graph partitioning is used to determine the location of flow measurements, with the objective to minimize the measurement cost. We follow a multi-stage divide and conquer approach. In every stage, a section of the WDN identified to contain the leak is partitioned into two or more sub-networks, and water balance is used to trace the leak to one of these sub-networks. This process is recursively continued until the desired resolution is achieved. We investigate different methods for solving the arising graph partitioning problem like integer linear programming (ILP) and spectral bisection. The proposed methods are tested on large scale benchmark networks, and our results indicate that on average, less than 3% of the pipes need to be measured for finding the leak in large networks. △ Less

Submitted 3 June, 2016; originally announced June 2016.

arXiv:1511.06063 [pdf, ps, other]

A Novel Approach for Phase Identification in Smart Grids Using Graph Theory and Principal Component Analysis

Authors: P Satya Jayadev, Aravind Rajeswaran, Nirav P Bhatt, Ramkrishna Pasumarthy

Abstract: Consumers with low demand, like households, are generally supplied single-phase power by connecting their service mains to one of the phases of a distribution transformer. The distribution companies face the problem of kee** a record of consumer connectivity to a phase due to uninformed changes that happen. The exact phase connectivity information is important for the efficient operation and con… ▽ More Consumers with low demand, like households, are generally supplied single-phase power by connecting their service mains to one of the phases of a distribution transformer. The distribution companies face the problem of kee** a record of consumer connectivity to a phase due to uninformed changes that happen. The exact phase connectivity information is important for the efficient operation and control of distribution system. We propose a new data driven approach to the problem based on Principal Component Analysis (PCA) and its Graph Theoretic interpretations, using energy measurements in equally timed short intervals, generated from smart meters. We propose an algorithm for inferring phase connectivity from noisy measurements. The algorithm is demonstrated using simulated data for phase connectivities in distribution networks. △ Less

Submitted 7 June, 2016; v1 submitted 19 November, 2015; originally announced November 2015.

Comments: Accepted for the presentation at ACC 16

arXiv:1506.00438 [pdf, other]

Network Topology Identification using PCA and its Graph Theoretic Interpretations

Authors: Aravind Rajeswaran, Shankar Narasimhan

Abstract: We solve the problem of identifying (reconstructing) network topology from steady state network measurements. Concretely, given only a data matrix $\mathbf{X}$ where the $X_{ij}$ entry corresponds to flow in edge $i$ in configuration (steady-state) $j$, we wish to find a network structure for which flow conservation is obeyed at all the nodes. This models many network problems involving conserved… ▽ More We solve the problem of identifying (reconstructing) network topology from steady state network measurements. Concretely, given only a data matrix $\mathbf{X}$ where the $X_{ij}$ entry corresponds to flow in edge $i$ in configuration (steady-state) $j$, we wish to find a network structure for which flow conservation is obeyed at all the nodes. This models many network problems involving conserved quantities like water, power, and metabolic networks. We show that identification is equivalent to learning a model $\mathbf{A_n}$ which captures the approximate linear relationships between the different variables comprising $\mathbf{X}$ (i.e. of the form $\mathbf{A_n X \approx 0}$) such that $\mathbf{A_n}$ is full rank (highest possible) and consistent with a network node-edge incidence structure. The problem is solved through a sequence of steps like estimating approximate linear relationships using Principal Component Analysis, obtaining f-cut-sets from these approximate relationships, and graph realization from f-cut-sets (or equivalently f-circuits). Each step and the overall process is polynomial time. The method is illustrated by identifying topology of a water distribution network. We also study the extent of identifiability from steady-state data. △ Less

Submitted 21 January, 2016; v1 submitted 1 June, 2015; originally announced June 2015.

Comments: Structure of paper is changed to improve presentation. Methods and results are unchanged. A more detailed literature survey has been added

arXiv:cs/0508095 [pdf, ps, other]

Capacity of Ultra Wide Band Wireless Ad Hoc Networks

Authors: Rohit Negi, Arjunan Rajeswaran

Abstract: Throughput capacity is a critical parameter for the design and evaluation of ad-hoc wireless networks. Consider n identical randomly located nodes, on a unit area, forming an ad-hoc wireless network. Assuming a fixed per node transmission capability of T bits per second at a fixed range, it has been shown that the uniform throughput capacity per node r(n) is Theta((T)/(sqrt{n log n})), a decreas… ▽ More Throughput capacity is a critical parameter for the design and evaluation of ad-hoc wireless networks. Consider n identical randomly located nodes, on a unit area, forming an ad-hoc wireless network. Assuming a fixed per node transmission capability of T bits per second at a fixed range, it has been shown that the uniform throughput capacity per node r(n) is Theta((T)/(sqrt{n log n})), a decreasing function of node density n. However an alternate communication model may also be considered, with each node constrained to a maximum transmit power P_0 and capable of utilizing W Hz of bandwidth. Under the limiting case W rightarrow infinity, such as in Ultra Wide Band (UWB) networks, the uniform throughput per node is O ((n log n)^{(alpha-1}/2}) (upper bound) and Omega((n^{(alpha-1)/2})/((log n)^{(alpha +1)/2})) (achievable lower bound). These bounds demonstrate that throughput increases with node density $n$, in contrast to previously published results. This is the result of the large bandwidth, and the assumed power and rate adaptation, which alleviate interference. Thus, the effect of physical layer properties on the capacity of ad hoc wireless networks is demonstrated. Further, the promise of UWB as a physical layer technology for ad-hoc networks is justified. △ Less

Submitted 22 August, 2005; originally announced August 2005.

Comments: 12 pages,2 figures

Showing 1–40 of 40 results for author: Rajeswaran, A