Search | arXiv e-print repository

Discovering Multiple Solutions from a Single Task in Offline Reinforcement Learning

Abstract: Recent studies on online reinforcement learning (RL) have demonstrated the advantages of learning multiple behaviors from a single task, as in the case of few-shot adaptation to a new environment. Although this approach is expected to yield similar benefits in offline RL, appropriate methods for learning multiple solutions have not been fully investigated in previous studies. In this study, we the… ▽ More Recent studies on online reinforcement learning (RL) have demonstrated the advantages of learning multiple behaviors from a single task, as in the case of few-shot adaptation to a new environment. Although this approach is expected to yield similar benefits in offline RL, appropriate methods for learning multiple solutions have not been fully investigated in previous studies. In this study, we therefore addressed the problem of finding multiple solutions from a single task in offline RL. We propose algorithms that can learn multiple solutions in offline RL, and empirically investigate their performance. Our experimental results show that the proposed algorithm learns multiple qualitatively and quantitatively distinctive solutions in offline RL. △ Less

Submitted 9 June, 2024; originally announced June 2024.

Comments: ICML 2024, 21 pages

arXiv:2406.04896 [pdf, other]

Stabilizing Extreme Q-learning by Maclaurin Expansion

Authors: Motoki Omura, Takayuki Osa, Yusuke Mukuta, Tatsuya Harada

Abstract: In Extreme Q-learning (XQL), Gumbel Regression is performed with an assumed Gumbel distribution for the error distribution. This allows learning of the value function without sampling out-of-distribution actions and has shown excellent performance mainly in Offline RL. However, issues remained, including the exponential term in the loss function causing instability and the potential for an error d… ▽ More In Extreme Q-learning (XQL), Gumbel Regression is performed with an assumed Gumbel distribution for the error distribution. This allows learning of the value function without sampling out-of-distribution actions and has shown excellent performance mainly in Offline RL. However, issues remained, including the exponential term in the loss function causing instability and the potential for an error distribution diverging from the Gumbel distribution. Therefore, we propose Maclaurin Expanded Extreme Q-learning to enhance stability. In this method, applying Maclaurin expansion to the loss function in XQL enhances stability against large errors. It also allows adjusting the error distribution assumption from normal to Gumbel based on the expansion order. Our method significantly stabilizes learning in Online RL tasks from DM Control, where XQL was previously unstable. Additionally, it improves performance in several Offline RL tasks from D4RL, where XQL already showed excellent results. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: Accepted at RLC 2024: The first Reinforcement Learning Conference

arXiv:2405.14114 [pdf, other]

Offline Reinforcement Learning from Datasets with Structured Non-Stationarity

Authors: Johannes Ackermann, Takayuki Osa, Masashi Sugiyama

Abstract: Current Reinforcement Learning (RL) is often limited by the large amount of data needed to learn a successful policy. Offline RL aims to solve this issue by using transitions collected by a different behavior policy. We address a novel Offline RL problem setting in which, while collecting the dataset, the transition and reward functions gradually change between episodes but stay constant within ea… ▽ More Current Reinforcement Learning (RL) is often limited by the large amount of data needed to learn a successful policy. Offline RL aims to solve this issue by using transitions collected by a different behavior policy. We address a novel Offline RL problem setting in which, while collecting the dataset, the transition and reward functions gradually change between episodes but stay constant within each episode. We propose a method based on Contrastive Predictive Coding that identifies this non-stationarity in the offline dataset, accounts for it when training a policy, and predicts it during evaluation. We analyze our proposed method and show that it performs well in simple continuous control tasks and challenging, high-dimensional locomotion tasks. We show that our method often achieves the oracle performance and performs better than baselines. △ Less

Submitted 27 May, 2024; v1 submitted 22 May, 2024; originally announced May 2024.

Comments: Accepted for Reinforcement Learning Conference (RLC) 2024

arXiv:2403.07704 [pdf, other]

doi 10.1609/aaai.v38i13.29362

Symmetric Q-learning: Reducing Skewness of Bellman Error in Online Reinforcement Learning

Authors: Motoki Omura, Takayuki Osa, Yusuke Mukuta, Tatsuya Harada

Abstract: In deep reinforcement learning, estimating the value function to evaluate the quality of states and actions is essential. The value function is often trained using the least squares method, which implicitly assumes a Gaussian error distribution. However, a recent study suggested that the error distribution for training the value function is often skewed because of the properties of the Bellman ope… ▽ More In deep reinforcement learning, estimating the value function to evaluate the quality of states and actions is essential. The value function is often trained using the least squares method, which implicitly assumes a Gaussian error distribution. However, a recent study suggested that the error distribution for training the value function is often skewed because of the properties of the Bellman operator, and violates the implicit assumption of normal error distribution in the least squares method. To address this, we proposed a method called Symmetric Q-learning, in which the synthetic noise generated from a zero-mean distribution is added to the target values to generate a Gaussian error distribution. We evaluated the proposed method on continuous control benchmark tasks in MuJoCo. It improved the sample efficiency of a state-of-the-art reinforcement learning method by reducing the skewness of the error distribution. △ Less

Submitted 12 March, 2024; originally announced March 2024.

Comments: Accepted at AAAI 2024: The 38th Annual AAAI Conference on Artificial Intelligence (Main Tech Track)

arXiv:2403.00344 [pdf, other]

Robustifying a Policy in Multi-Agent RL with Diverse Cooperative Behaviors and Adversarial Style Sampling for Assistive Tasks

Authors: Takayuki Osa, Tatsuya Harada

Abstract: Autonomous assistance of people with motor impairments is one of the most promising applications of autonomous robotic systems. Recent studies have reported encouraging results using deep reinforcement learning (RL) in the healthcare domain. Previous studies showed that assistive tasks can be formulated as multi-agent RL, wherein there are two agents: a caregiver and a care-receiver. However, poli… ▽ More Autonomous assistance of people with motor impairments is one of the most promising applications of autonomous robotic systems. Recent studies have reported encouraging results using deep reinforcement learning (RL) in the healthcare domain. Previous studies showed that assistive tasks can be formulated as multi-agent RL, wherein there are two agents: a caregiver and a care-receiver. However, policies trained in multi-agent RL are often sensitive to the policies of other agents. In such a case, a trained caregiver's policy may not work for different care-receivers. To alleviate this issue, we propose a framework that learns a robust caregiver's policy by training it for diverse care-receiver responses. In our framework, diverse care-receiver responses are autonomously learned through trials and errors. In addition, to robustify the care-giver's policy, we propose a strategy for sampling a care-receiver's response in an adversarial manner during the training. We evaluated the proposed method using tasks in an Assistive Gym. We demonstrate that policies trained with a popular deep RL method are vulnerable to changes in policies of other agents and that the proposed framework improves the robustness against such changes. △ Less

Submitted 1 April, 2024; v1 submitted 1 March, 2024; originally announced March 2024.

Comments: 7 pages, accepted for ICRA 2024

arXiv:2310.08864 [pdf, other]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Authors: Open X-Embodiment Collaboration, Abby O'Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, A**kya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, Annie Xie , et al. (267 additional authors not shown)

Abstract: Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning method… ▽ More Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. More details can be found on the project website https://robotics-transformer-x.github.io. △ Less

Submitted 1 June, 2024; v1 submitted 13 October, 2023; originally announced October 2023.

Comments: Project website: https://robotics-transformer-x.github.io

arXiv:2107.05842 [pdf, other]

Motion Planning by Learning the Solution Manifold in Trajectory Optimization

Authors: Takayuki Osa

Abstract: The objective function used in trajectory optimization is often non-convex and can have an infinite set of local optima. In such cases, there are diverse solutions to perform a given task. Although there are a few methods to find multiple solutions for motion planning, they are limited to generating a finite set of solutions. To address this issue, we presents an optimization method that learns an… ▽ More The objective function used in trajectory optimization is often non-convex and can have an infinite set of local optima. In such cases, there are diverse solutions to perform a given task. Although there are a few methods to find multiple solutions for motion planning, they are limited to generating a finite set of solutions. To address this issue, we presents an optimization method that learns an infinite set of solutions in trajectory optimization. In our framework, diverse solutions are obtained by learning latent representations of solutions. Our approach can be interpreted as training a deep generative model of collision-free trajectories for motion planning. The experimental results indicate that the trained model represents an infinite set of homotopic solutions for motion planning problems. △ Less

Submitted 13 July, 2021; originally announced July 2021.

Comments: 24 pages, to appear in the International Journal of Robotics Research

arXiv:2103.07084 [pdf, other]

Discovering Diverse Solutions in Deep Reinforcement Learning by Maximizing State-Action-Based Mutual Information

Authors: Takayuki Osa, Voot Tangkaratt, Masashi Sugiyama

Abstract: Reinforcement learning algorithms are typically limited to learning a single solution for a specified task, even though diverse solutions often exist. Recent studies showed that learning a set of diverse solutions is beneficial because diversity enables robust few-shot adaptation. Although existing methods learn diverse solutions by using the mutual information as unsupervised rewards, such an app… ▽ More Reinforcement learning algorithms are typically limited to learning a single solution for a specified task, even though diverse solutions often exist. Recent studies showed that learning a set of diverse solutions is beneficial because diversity enables robust few-shot adaptation. Although existing methods learn diverse solutions by using the mutual information as unsupervised rewards, such an approach often suffers from the bias of the gradient estimator induced by value function approximation. In this study, we propose a novel method that can learn diverse solutions without suffering the bias problem. In our method, a policy conditioned on a continuous or discrete latent variable is trained by directly maximizing the variational lower bound of the mutual information, instead of using the mutual information as unsupervised rewards as in previous studies. Through extensive experiments on robot locomotion tasks, we demonstrate that the proposed method successfully learns an infinite set of diverse solutions by learning continuous latent variables, which is more challenging than learning a finite number of solutions. Subsequently, we show that our method enables more effective few-shot adaptation compared with existing methods. △ Less

Submitted 12 April, 2022; v1 submitted 11 March, 2021; originally announced March 2021.

Comments: 35 pages

arXiv:2007.12397 [pdf, other]

Learning the Solution Manifold in Optimization and Its Application in Motion Planning

Authors: Takayuki Osa

Abstract: Optimization is an essential component for solving problems in wide-ranging fields. Ideally, the objective function should be designed such that the solution is unique and the optimization problem can be solved stably. However, the objective function used in a practical application is usually non-convex, and sometimes it even has an infinite set of solutions. To address this issue, we propose to l… ▽ More Optimization is an essential component for solving problems in wide-ranging fields. Ideally, the objective function should be designed such that the solution is unique and the optimization problem can be solved stably. However, the objective function used in a practical application is usually non-convex, and sometimes it even has an infinite set of solutions. To address this issue, we propose to learn the solution manifold in optimization. We train a model conditioned on the latent variable such that the model represents an infinite set of solutions. In our framework, we reduce this problem to density estimation by using importance sampling, and the latent representation of the solutions is learned by maximizing the variational lower bound. We apply the proposed algorithm to motion-planning problems, which involve the optimization of high-dimensional parameters. The experimental results indicate that the solution manifold can be learned with the proposed algorithm, and the trained model represents an infinite set of homotopic solutions for motion-planning problems. △ Less

Submitted 24 July, 2020; originally announced July 2020.

arXiv:2006.02608 [pdf, ps, other]

Meta-Model-Based Meta-Policy Optimization

Authors: Takuya Hiraoka, Takahisa Imagawa, Voot Tangkaratt, Takayuki Osa, Takashi Onishi, Yoshimasa Tsuruoka

Abstract: Model-based meta-reinforcement learning (RL) methods have recently been shown to be a promising approach to improving the sample efficiency of RL in multi-task settings. However, the theoretical understanding of those methods is yet to be established, and there is currently no theoretical guarantee of their performance in a real-world environment. In this paper, we analyze the performance guarante… ▽ More Model-based meta-reinforcement learning (RL) methods have recently been shown to be a promising approach to improving the sample efficiency of RL in multi-task settings. However, the theoretical understanding of those methods is yet to be established, and there is currently no theoretical guarantee of their performance in a real-world environment. In this paper, we analyze the performance guarantee of model-based meta-RL methods by extending the theorems proposed by Janner et al. (2019). On the basis of our theoretical results, we propose Meta-Model-Based Meta-Policy Optimization (M3PO), a model-based meta-RL method with a performance guarantee. We demonstrate that M3PO outperforms existing meta-RL methods in continuous-control benchmarks. △ Less

Submitted 11 October, 2021; v1 submitted 3 June, 2020; originally announced June 2020.

Comments: ACML 2021. Video demo: https://drive.google.com/file/d/1DRA-pmIWnHGNv5G_gFrml8YzKCtMcGnu/view?usp=sharing URL Source code: https://github.com/TakuyaHiraoka/Meta-Model-Based-Meta-Policy-Optimization

arXiv:2003.07054 [pdf, other]

doi 10.1177/0278364920918296

Multimodal Trajectory Optimization for Motion Planning

Authors: Takayuki Osa

Abstract: Existing motion planning methods often have two drawbacks: 1) goal configurations need to be specified by a user, and 2) only a single solution is generated under a given condition. In practice, multiple possible goal configurations exist to achieve a task. Although the choice of the goal configuration significantly affects the quality of the resulting trajectory, it is not trivial for a user to s… ▽ More Existing motion planning methods often have two drawbacks: 1) goal configurations need to be specified by a user, and 2) only a single solution is generated under a given condition. In practice, multiple possible goal configurations exist to achieve a task. Although the choice of the goal configuration significantly affects the quality of the resulting trajectory, it is not trivial for a user to specify the optimal goal configuration. In addition, the objective function used in the trajectory optimization is often non-convex, and it can have multiple solutions that achieve comparable costs. In this study, we propose a framework that determines multiple trajectories that correspond to the different modes of the cost function. We reduce the problem of identifying the modes of the cost function to that of estimating the density induced by a distribution based on the cost function. The proposed framework enables users to select a preferable solution from multiple candidate trajectories, thereby making it easier to tune the cost function and obtain a satisfactory solution. We evaluated our proposed method with motion planning tasks in 2D and 3D space. Our experiments show that the proposed algorithm is capable of determining multiple solutions for those tasks. △ Less

Submitted 23 September, 2020; v1 submitted 16 March, 2020; originally announced March 2020.

Comments: 17 pages, International Journal of Robotics Research

arXiv:1912.04063 [pdf, other]

doi 10.1007/s42979-020-00324-7

Goal-Conditioned Variational Autoencoder Trajectory Primitives with Continuous and Discrete Latent Codes

Authors: Takayuki Osa, Shuhei Ikemoto

Abstract: Imitation learning is an intuitive approach for teaching motion to robotic systems. Although previous studies have proposed various methods to model demonstrated movement primitives, one of the limitations of existing methods is that the shape of the trajectories are encoded in high dimensional space. The high dimensionality of the trajectory representation can be a bottleneck in the subsequent pr… ▽ More Imitation learning is an intuitive approach for teaching motion to robotic systems. Although previous studies have proposed various methods to model demonstrated movement primitives, one of the limitations of existing methods is that the shape of the trajectories are encoded in high dimensional space. The high dimensionality of the trajectory representation can be a bottleneck in the subsequent process such as planning a sequence of primitive motions. We address this problem by learning the latent space of the robot trajectory. If the latent variable of the trajectories can be learned, it can be used to tune the trajectory in an intuitive manner even when the user is not an expert. We propose a framework for modeling demonstrated trajectories with a neural network that learns the low-dimensional latent space. Our neural network structure is built on the variational autoencoder (VAE) with discrete and continuous latent variables. We extend the structure of the existing VAE to obtain the decoder that is conditioned on the goal position of the trajectory for generalization to different goal positions. Although the inference performed by VAE is not accurate, the positioning error at the generalized goal position can be reduced to less than 1~mm by incorporating the projection onto the solution space. To cope with requirement of the massive training data, we use a trajectory augmentation technique inspired by the data augmentation commonly used in the computer vision community. In the proposed framework, the latent variables that encodes the multiple types of trajectories are learned in an unsupervised manner, although existing methods usually require label information to model diverse behaviors. The learned decoder can be used as a motion planner in which the user can specify the goal position and the trajectory types by setting the latent variables. △ Less

Submitted 23 September, 2020; v1 submitted 9 December, 2019; originally announced December 2019.

Comments: 8 pages, SN Computer Science

arXiv:1910.01465 [pdf, other]

Reducing Overestimation Bias in Multi-Agent Domains Using Double Centralized Critics

Authors: Johannes Ackermann, Volker Gabler, Takayuki Osa, Masashi Sugiyama

Abstract: Many real world tasks require multiple agents to work together. Multi-agent reinforcement learning (RL) methods have been proposed in recent years to solve these tasks, but current methods often fail to efficiently learn policies. We thus investigate the presence of a common weakness in single-agent RL, namely value function overestimation bias, in the multi-agent setting. Based on our findings, w… ▽ More Many real world tasks require multiple agents to work together. Multi-agent reinforcement learning (RL) methods have been proposed in recent years to solve these tasks, but current methods often fail to efficiently learn policies. We thus investigate the presence of a common weakness in single-agent RL, namely value function overestimation bias, in the multi-agent setting. Based on our findings, we propose an approach that reduces this bias by using double centralized critics. We evaluate it on six mixed cooperative-competitive tasks, showing a significant advantage over current methods. Finally, we investigate the application of multi-agent methods to high-dimensional robotic tasks and show that our approach can be used to learn decentralized policies in this domain. △ Less

Submitted 2 December, 2019; v1 submitted 3 October, 2019; originally announced October 2019.

Comments: Accepted for the Deep RL Workshop at NeurIPS 2019; Changes for v2: Changed Figures 3,4, due to an error in the implementation of MATD3. Please refer to this version for fair evaluation

arXiv:1901.01365 [pdf, other]

Hierarchical Reinforcement Learning via Advantage-Weighted Information Maximization

Authors: Takayuki Osa, Voot Tangkaratt, Masashi Sugiyama

Abstract: Real-world tasks are often highly structured. Hierarchical reinforcement learning (HRL) has attracted research interest as an approach for leveraging the hierarchical structure of a given task in reinforcement learning (RL). However, identifying the hierarchical policy structure that enhances the performance of RL is not a trivial task. In this paper, we propose an HRL method that learns a latent… ▽ More Real-world tasks are often highly structured. Hierarchical reinforcement learning (HRL) has attracted research interest as an approach for leveraging the hierarchical structure of a given task in reinforcement learning (RL). However, identifying the hierarchical policy structure that enhances the performance of RL is not a trivial task. In this paper, we propose an HRL method that learns a latent variable of a hierarchical policy using mutual information maximization. Our approach can be interpreted as a way to learn a discrete and latent representation of the state-action space. To learn option policies that correspond to modes of the advantage function, we introduce advantage-weighted importance sampling. In our HRL method, the gating policy learns to select option policies based on an option-value function, and these option policies are optimized based on the deterministic policy gradient method. This framework is derived by leveraging the analogy between a monolithic policy in standard RL and a hierarchical policy in HRL by using a deterministic option policy. Experimental results indicate that our HRL approach can learn a diversity of options and that it can enhance the performance of RL in continuous control tasks. △ Less

Submitted 7 March, 2019; v1 submitted 4 January, 2019; originally announced January 2019.

Comments: 16 pages, ICLR 2019

arXiv:1811.06711 [pdf]

doi 10.1561/2300000053

An Algorithmic Perspective on Imitation Learning

Authors: Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J. Andrew Bagnell, Pieter Abbeel, Jan Peters

Abstract: As robots and other intelligent agents move from simple environments and problems to more complex, unstructured settings, manually programming their behavior has become increasingly challenging and expensive. Often, it is easier for a teacher to demonstrate a desired behavior rather than attempt to manually engineer it. This process of learning from demonstrations, and the study of algorithms to d… ▽ More As robots and other intelligent agents move from simple environments and problems to more complex, unstructured settings, manually programming their behavior has become increasingly challenging and expensive. Often, it is easier for a teacher to demonstrate a desired behavior rather than attempt to manually engineer it. This process of learning from demonstrations, and the study of algorithms to do so, is called imitation learning. This work provides an introduction to imitation learning. It covers the underlying assumptions, approaches, and how they relate; the rich set of algorithms developed to tackle the problem; and advice on effective tools and implementation. We intend this paper to serve two audiences. First, we want to familiarize machine learning experts with the challenges of imitation learning, particularly those arising in robotics, and the interesting theoretical and practical distinctions between it and more familiar frameworks like statistical supervised learning theory and reinforcement learning. Second, we want to give roboticists and experts in applied artificial intelligence a broader appreciation for the frameworks and tools available for imitation learning. △ Less

Submitted 16 November, 2018; originally announced November 2018.

Comments: 187 pages. Published in Foundations and Trends in Robotics

arXiv:1711.10173 [pdf, other]

Hierarchical Policy Search via Return-Weighted Density Estimation

Authors: Takayuki Osa, Masashi Sugiyama

Abstract: Learning an optimal policy from a multi-modal reward function is a challenging problem in reinforcement learning (RL). Hierarchical RL (HRL) tackles this problem by learning a hierarchical policy, where multiple option policies are in charge of different strategies corresponding to modes of a reward function and a gating policy selects the best option for a given context. Although HRL has been dem… ▽ More Learning an optimal policy from a multi-modal reward function is a challenging problem in reinforcement learning (RL). Hierarchical RL (HRL) tackles this problem by learning a hierarchical policy, where multiple option policies are in charge of different strategies corresponding to modes of a reward function and a gating policy selects the best option for a given context. Although HRL has been demonstrated to be promising, current state-of-the-art methods cannot still perform well in complex real-world problems due to the difficulty of identifying modes of the reward function. In this paper, we propose a novel method called hierarchical policy search via return-weighted density estimation (HPSDE), which can efficiently identify the modes through density estimation with return-weighted importance sampling. Our proposed method finds option policies corresponding to the modes of the return function and automatically determines the number and the location of option policies, which significantly reduces the burden of hyper-parameters tuning. Through experiments, we demonstrate that the proposed HPSDE successfully learns option policies corresponding to modes of the return function and that it can be successfully applied to a challenging motion planning problem of a redundant robotic manipulator. △ Less

Submitted 30 November, 2017; v1 submitted 28 November, 2017; originally announced November 2017.

Comments: The 32nd AAAI Conference on Artificial Intelligence (AAAI 2018), 9 pages

Showing 1–16 of 16 results for author: Osa, T