Tackling Long-Horizon Tasks
with Model-based Offline Reinforcement Learning
Abstract
Model-based offline reinforcement learning (RL) is a compelling approach that addresses the challenge of learning from limited, static data by generating imaginary trajectories using learned models. However, it falls short in solving long-horizon tasks due to high bias in value estimation from model rollouts. In this paper, we introduce a novel model-based offline RL method, Lower Expectile Q-learning (LEQ), which enhances long-horizon task performance by mitigating the high bias in model-based value estimation via expectile regression of -returns. Our empirical results show that LEQ significantly outperforms previous model-based offline RL methods on long-horizon tasks, such as the D4RL AntMaze tasks, matching or surpassing the performance of model-free approaches. Our experiments demonstrate that expectile regression, -returns, and critic training on offline data are all crucial for addressing long-horizon tasks. Additionally, LEQ achieves performance comparable to the state-of-the-art model-based and model-free offline RL methods on the NeoRL benchmark and the D4RL MuJoCo Gym tasks.
1 Introduction
One of the major challenges in offline reinforcement learning (RL) is the overestimation of values for out-of-distribution actions due to the lack of environment interactions [21, 19]. Model-based offline RL addresses this issue by generating additional (imaginary) training data using a learned model, thereby augmenting the given offline data with synthetic experiences that cover out-of-distribution states and actions [34, 16, 35, 2, 30]. While these approaches have demonstrated strong performance in simple, short-horizon tasks, they struggle with noisy model predictions and value estimations, particularly in long-horizon tasks [23]. This challenge is evident in their poor performances (i.e. near zero) on the D4RL AntMaze tasks [6, 15].
Typical model-based offline RL methods alleviate the inaccurate value estimation problem (mostly overestimation) by penalizing Q-values estimated from model rollouts with uncertainties in model predictions [34, 16] or value predictions [30, 14]. While these penalization terms prevent a policy from exploiting erroneous value estimations, the policy now does not maximize the true value, but maximizes the value penalized by heuristically estimated uncertainties, which can lead to sub-optimal behaviors. This is especially problematic in long-horizon, sparse-reward tasks, where Q-values are similar across nearby states [23].
Another way to reduce bias in value estimates is by using multi-step returns [31, 12]. CBOP [14] constructs an explicit distribution of multi-step Q-values from thousands of model rollouts and uses this value as a target for training the Q-function. However, CBOP is computationally expensive for estimating a target value and uses multi-step returns solely for Q-learning, which provides insufficient learning signals for obtaining long-horizon behaviors.
To tackle long-horizon tasks with model-based offline RL, we introduce a simple yet effective model-based offline RL algorithm, Lower Expectile Q-learning (LEQ). As illustrated in Figure 1, LEQ uses expectile regression with a small for both policy and Q-function training, providing an efficient and elegant way to achieve conservative Q-value estimates. Moreover, to better handle long-horizon tasks, we propose to optimize a policy and Q-function using -returns (i.e. TD() targets) of long (-step) model rollouts, allowing the policy to directly learn from low-bias multi-step returns [28].
The experiments on the D4RL AntMaze and MuJoCo Gym tasks [6], as well as the NeoRL benchmark [26], demonstrate that our proposed conservative policy optimization with -return and critic training on offline data significantly improves offline RL policies in long-horizon tasks while achieving comparable performance in short-horizon, dense-reward tasks. Specifically, to the best of our knowledge, LEQ is the first model-based offline RL algorithm capable of matching or outperforming the performance of model-free offline RL algorithms on the long-horizon AntMaze tasks [6, 15].
2 Related Work
Offline RL [21] aims to solve a reinforcement learning problem only with pre-collected datasets, better than behavioral cloning policies [25]. One can simply apply off-policy RL algorithms on top of the fixed dataset. However, off-policy RL methods suffer from the overestimation of Q-values for actions unseen in the offline dataset [8, 18, 19], since an overestimated value function cannot get corrected through online environment interactions in offline RL.
Model-free offline RL algorithms have addressed this value overestimation problem on out-of-distribution actions by (1) regularizing a policy to only output actions in the offline data [24, 17, 7] or (2) adopting a conservative value estimation for executing actions different from the dataset [19, 1]. Despite their strong performances on the standard offline RL benchmarks, model-free offline RL policies tend to be constrained to the support of the data (i.e. state-action pairs in the offline dataset), which may lead to limited generalization capability.
Model-based offline RL approaches have tried to overcome this limitation by suggesting a better use of the limited offline data – learning a world model and generating imaginary data with the learned model that covers out-of-distribution actions. Similar to Dyna-style online model-based RL [32, 9, 10, 11], an offline model-based RL policy can be trained on both offline data and model rollouts. But, again, learned models may be inaccurate on states and actions outside the data support, making a policy easily exploit the learned models.
Recent model-based offline RL algorithms have adopted the conservatism idea from model-free offline RL, penalizing policies incurring (1) uncertain transition dynamics [34, 16, 35] or (2) uncertain value estimation [30, 14]. This conservative use of model-generated data enables model-based offline RL to outperform model-free offline RL in widely used offline RL benchmarks [30]. However, uncertainty estimation is difficult and often inaccurate [35]. Instead of relying on such heuristic [34, 16, 30] or expensive [14] uncertainty estimation, we propose to learn a conservative value function via expectile regression with a small , which is simple, efficient, yet effective.
3 Preliminaries
Problem setup.
We formulate our problem as a Markov Decision Process (MDP) defined as a tuple, [33]. and denote the state and action spaces, respectively. denotes the reward function. 111 denotes the set of probability distributions over denotes the transition dynamics. denotes the initial state distribution and is a discounting factor. The goal of reinforcement learning (RL) is to find a policy, , that maximizes the expected return, , where is a sequence of transitions with a finite horizon , , following and starting from .
In this paper, we consider the offline RL setup [21], where a policy is trained with a fixed given offline dataset, , without any additional online interactions.
Model-based offline RL.
As an offline RL policy is trained from a fixed dataset, one of the major challenges in offline RL is the limited data support; thus, lack of generalization to out-of-distribution states and actions. Model-based offline RL [16, 34, 35, 27, 30, 14] tackles this problem by augmenting the training data with imaginary training data (i.e. model rollouts) generated from the learned transition dynamics and reward model, .
The typical process of model-based offline RL is as follows: (1) pretrain a model (or an ensemble of models) and an initial policy from the offline data, (2) generate short imaginary rollouts using the pretrained model and add them to the training dataset , (3) perform an offline RL algorithm on the augmented dataset , and repeat (2) and (3).
Expectile regression.
Expectile is a generalization of the expectation of a distribution . While the expectation of , , can be viewed as a minimizer of the least-square objective, , -expectile of , , can be defined as a minimizer of the asymmetric least-square objective:
(1) |
where is an asymmetric weighting of and .
We refer a -expectile with as a lower expectile of . When , the objective assigns a high weight for smaller and a low weight for bigger . Thus, minimizing the objective with leads to a conservative statistical estimate compared to the expectation.
4 Approach
The primary limitation for model-based offline RL in solving long-horizon tasks is inherent errors in a world model and critic outside the offline data support. Conservative value estimation can effectively handle such (falsely optimistic) errors. Prior approaches estimate conservative values through diverse uncertainty penalties; but they are either unreliable [35] or computationally expensive [14].
In this paper, we introduce Lower Expectile Q-learning (LEQ), an efficient model-based offline RL method that achieves conservative value estimation via expectile regression of Q-values with lower expectiles when learning from model-generated data (Section 4.1). Additionally, we address the noisy value estimation problem in long-horizon tasks [23] using -returns on -step imaginary rollouts (Section 4.2). Finally, we train a deterministic policy conservatively by maximizing the lower expectile of -returns (Section 4.3). The overview of LEQ is described in Algorithm 1.
4.1 Lower expectile Q-learning
Most offline RL algorithms primarily focus on learning a conservative value function for out-of-distribution actions. In this paper, we propose Lower Expectile Q-learning (LEQ), which learns a conservative Q-function via expectile regression with small , avoiding unreliable uncertainty estimation and exhaustive Q-value estimation.
As illustrated in Figure 1, the target value for , where , can be estimated by rolling out an ensemble of world models and averaging over all possible :
(2) |
This target value has three error sources: the predicted future state and reward and future Q-value . Thus, the target value computed from model-generate data, , is more prone to overestimation than that of the original target Q-value, , computed from :
(3) |
To mitigate the overestimation problem in estimating the true Q-value from -step inaccurate world model rollouts, we propose to use expectile regression on target Q-value estimation with small ; as illustrated in Figure 1, expectile regression with small tends to choose the target Q-value that is lower than the expectation, effectively providing a conservative estimate of target Q-value. Another advantage of using expectile regression is that we do not have to exhaustively evaluate Q-values to get -expectiles as Jeong et al. [14]; instead, we can do conservative estimation using sampling:
(4) |
In addition, the Q-function is also trained on the offline data with the standard Bellman update:
(5) |
To stabilize training of the Q-function, we adopt EMA regularization [11], which prevents drastic change of Q-values by regularizing the difference between the Q-predictions and ones from the exponential moving average:
(6) |
where is an exponential moving average of . Note that by using EMA regularization, we do not use the target Q-network for Equations 2 and 3.
Finally, by combining the three losses above, we define the critic loss as follows:
(7) |
4.2 Lower expectile Q-learning with -return
To further improve LEQ for long-horizon tasks, we use -return instead of -step return for Q-learning. -return allows a Q-function and policy to learn from low-bias multi-step returns [28]. Reducing bias in value estimation with -return is especially important on long-horizon tasks where values for nearby states are similar to each other, as illustrated in Figure 2.
We first define -return of a trajectory in timestep , , using -step return, :222Our -return is slightly different from [31, 11] that puts a high weight to the last -step return, .
(8) |
(9) |
Then, we can rewrite the Q-learning loss in Equation 4 with -step return to the one with -return:
(10) |
4.3 Lower expectile policy learning with -return
For policy optimization, we can use a deterministic policy and update the policy using the deterministic policy gradients similar to DDPG [22].333LEQ also works with a stochastic policy; but, a deterministic policy is easier to train in an offline setup. Instead of maximizing the immediate Q-value, , we propose to directly maximize the lower expectile of -return, which is a more accurate learning target for a policy, analogous to the conservative critic target in Section 4.2:
(11) |
However, due to the expectile term in Equation 11, computing the gradient of is not trivial. To estimate this gradient, we propose a differentiable surrogate loss, approximating in Equation 11 with :
(12) |
Intuitively, this surrogate loss sets a higher weight () on a conservative -return estimation (i.e. ), encouraging a policy to optimize for this conservative -return. On the other hand, an optimistic -return estimation (i.e. ) has a less impact to the policy with a smaller weight (). Thus, optimizing this surrogate loss leads to the policy maximizing the lower expectile of -return. We provide a proof in Appendix B saying that the proposed surrogate loss is a better approximation of Equation 11 than directly maximizing Q-values, .
4.4 Expanding dataset with model rollouts
One of the problem of offline RL is that data distribution is limited to the offline dataset . To tackle this problem, we can simulate the current policy inside the model and use the generated trajectories to expand the dataset, similar to prior works [35, 30]. However, the state coverage will be identical when the policy converges, which might lead to catastrophic forgetting.
To prevent this issue, we expand the dataset using the model rollouts from a noisy exploration policy. Specifically, we execute the exploration policy , which simply adds noise to the current policy , and generate a trajectory of length ( in this paper). We refer this expanded dataset as . Note that we do not use off-policy actions and rewards for critic and policy updates; instead, we generate -step model rollouts starting from and use them for training the policy and Q-function. Thus, we need to store only the states from the rollouts.
5 Experiments
In this paper, we propose a novel model-based offline RL method with simple and efficient yet accurate conservative value estimation. Through our experiments, we aim to answer the following questions: (1) Can LEQ solve long-horizon tasks? (2) How does LEQ perform in widely used offline RL benchmarks? (3) Which component enables model-based offline RL to learn the AntMaze tasks?
5.1 Tasks
To show the strength of LEQ in solving long-horizon tasks, we use the AntMaze tasks, which aims to navigate a -DOF ant robot to the desired goal position, as shown in Figure 4. Specifically, we use umaze, medium, large datasets from D4RL [6], and ultra dataset from Jiang et al. [15]. Moreover, we evaluate our method on MuJoCo locomotion tasks (Figure 4) with dense rewards with D4RL [6] and NeoRL [26] datasets. Please refer to Appendix A for more experimental details.
5.2 Compared offline RL algorithms
We compare the performance of LEQ with the state-of-the-art offline RL algorithms. Please note that LEQ uses the same hyperparameters across all tasks, except the expectile parameter, .
Model-free offline RL.
We consider behavioral cloning (BC) [25]; TD3+BC [7], which combines BC loss to TD3; CQL [19], which penalizes the actions out of data distribution; and IQL [17], which utilizes expectile regression to estimate the value function. For locomotion tasks, we also compare with EDAC [1], which penalizes the Q-values according to the uncertainty of Q-functions
Model-based offline RL.
We consider MOPO [34] and MOBILE [30], which penalize Q-values according to the transition uncertainty and the bellman uncertainty of a world model, respectively; COMBO [35], which combines CQL with MBPO; RAMBO [27], which trains an adversarial world model against the policy; and CBOP [14], which utilizes multi-step returns for critic updates.
5.3 Results on long-horizon AntMaze tasks
As shown in Table 1, LEQ significantly outperforms the prior model-based approaches for all datasets. LEQ achieves and for antmaze-large-play and antmaze-large-diverse, while the second best method, RAMBO [27], scores only and , respectively. We believe these performance gains come from our conservative value estimation, which works more stable than the uncertainty-based penalization of prior works.
Moreover, LEQ even significantly outperforms the model-free approaches in antmaze-umaze, antmaze-large, and antmaze-ultra. Despite its superior performance, LEQ often shows high variance during training, resulting in worse performance on antmaze-medium. Over the course of training, LEQ mostly achieves high success rates, but the evaluation results sometimes drops to as shown in Appendix, Figure 5. We leave the problem of reducing the high variance of our method in certain environments as a future work.
Model-free | Model-based | |||||||||
Dataset | BC | TD3+BC | CQL | IQL | MOPO | COMBO | RAMBO | MOBILE† | CBOP† | LEQ (ours) |
antmaze-umaze | ||||||||||
antmaze-umaze-diverse | ||||||||||
antmaze-medium-play | ||||||||||
antmaze-medium-diverse | ||||||||||
antmaze-large-play | ||||||||||
antmaze-large-diverse | ||||||||||
antmaze-ultra-play | ||||||||||
antmaze-ultra-diverse | ||||||||||
Total w/o antmaze-ultra | ||||||||||
Total |
We use the official implementation of MOBILE and CBOP.
5.4 Results on MuJoCo Gym locomotion tasks
For D4RL MuJoCo Gym tasks in Table 3, LEQ achieves comparable results with the best score of prior works in out of tasks. Furthermore, in Table 2, LEQ outperforms most of the prior works in the NeoRL benchmark, especially in the Hopper and Walker2d domains. These results show that LEQ serves as a general offline RL algorithm, not limited to long-horizon tasks.
Similar to antmaze-medium, LEQ also suffers from the high variance problem. During training, LEQ often achieves high performance, but then, suddenly falls back to , as shown in Appendix, Figure 5. This is mainly because the learned models sometimes fail to capture failures (e.g. hopper and walker falling off) and predict an optimistic future (e.g. hopper and walker walking forward).
Model-free | Model-based | |||||||
Dataset | BC | TD3+BC | CQL | EDAC | IQL | MOPO∗ | MOBILE | LEQ (ours) |
Hopper-L | ||||||||
Hopper-M | ||||||||
Hopper-H | ||||||||
Walker2d-L | ||||||||
Walker2d-M | ||||||||
Walker2d-H | ||||||||
HalfCheetah-L | ||||||||
HalfCheetah-M | ||||||||
HalfCheetah-H | ||||||||
Total |
Model-free | Model-based | ||||||||||
Dataset | BC | TD3+BC | CQL | EDAC | IQL | MOPO∗ | COMBO | RAMBO | MOBILE | CBOP | LEQ (ours) |
hopper-r | |||||||||||
hopper-m | |||||||||||
hopper-mr | |||||||||||
hopper-me | |||||||||||
walker2d-r | |||||||||||
walker2d-m | |||||||||||
walker2d-mr | |||||||||||
walker2d-me | |||||||||||
halfcheetah-r | |||||||||||
halfcheetah-m | |||||||||||
halfcheetah-mr | |||||||||||
halfcheetah-me | |||||||||||
Total |
5.5 Ablation studies
To understand why LEQ (LEQ-) works well in long-horizon tasks, we conduct ablation studies and answer to the following four questions: (1) Does using -return help? (2) Is LEQ better than prior uncertainty-based penalization methods? (3) Which factor enables LEQ to work in AntMaze? and (4) How do imagination length and data expansion length affect the performance?
(1) -returns.
To verify the effect of -return, we compare our method (LEQ-) with the versions with -step return (LEQ-) and -step return (LEQ-). Table 4 shows that using -return drastically improves the performance on AntMaze compared to using -step return or -step return. This result is coherent with the observations in prior online RL methods [28, 11].
(2) Lower expectile Q-learning.
We compare our lower expectile Q-learning with another conservative value estimator, MOBIP used in MOBILE [30], which penalizes Q-values with the standard deviation of Q-ensemble networks. The only difference between LEQ and MOBIP is their target Q-value computation for both critic and policy updates. Table 4 shows that using MOBIP not only deteriorates the success rates (in MOBIP-) but also does not benefit from -return (in MOBIP-).
(3) What makes offline model-based RL work in AntMaze?
Prior to LEQ, none of offline model-based RL methods work in AntMaze, whereas our method even outperforms model-free methods. Thus, we investigate which changes in LEQ enable offline model-based RL work in AntMaze.
Dataset | umaze | medium | large | ultra | Total | ||||
---|---|---|---|---|---|---|---|---|---|
umaze | diverse | play | diverse | play | diverse | play | diverse | ||
LEQ- (ours) | |||||||||
LEQ- | |||||||||
LEQ- | |||||||||
MOBIP- | |||||||||
MOBIP-1 | |||||||||
MOBILE∗ | |||||||||
MOBILE∗ ( = 0.25) | |||||||||
MOBILE∗ ( = 0.997) | |||||||||
MOBILE∗ ( = 10) |
We first re-implement MOBILE with some technical tricks used in LEQ: LayerNorm [3], SymLog [11], single Q-network, and no target Q-value clip**; but, MOBILE∗ achieves a barely non-zero score, . We found that the key to make MOBILE∗ work is reducing , the ratio for the loss calculated from imaginary rollouts and from dataset transitions. When we lower from to (used in LEQ), MOBILE∗ shows meaningful performances in umaze and medium mazes, and achieves in total. We suggest that utilizing the true transition from the dataset is important in long-horizon tasks, which was undervalued in prior works.
(4) Imagination length and dataset expansion length .
As shown in Table 5, the performance increases when it goes to from , but it drops when . This result shows the trade-off of using the world model: the further the agent imagines, more the agent becomes robust to the error of the critic, but more it becomes prone to the error from the model prediction.
We also evaluate LEQ without the dataset expansion (). In AntMaze, the results with and without the dataset expansion are similar, as shown in Table 5. On the other hand, the dataset expansion makes the policy more stable and better in the D4RL MuJoCo tasks (Table 13).
Dataset | (ours) | ||||
---|---|---|---|---|---|
antmaze-umaze | |||||
antmaze-umaze-diverse | |||||
antmaze-medium-play | |||||
antmaze-medium-diverse | |||||
antmaze-large-play | |||||
antmaze-large-diverse | |||||
antmaze-ultra-play | |||||
antmaze-ultra-diverse | |||||
Total |
6 Conclusion
In this paper, we propose a novel offline model-based reinforcement learning method, LEQ, which uses expectile regression to get a conservative evaluation of a policy from model-generated trajectories. Expectile regression eases the pain of constructing the whole distribution of Q-targets and allows for estimating the conservative value via sampling. Combined with -returns in both critic and policy updates for the imaginary rollouts, the policy can receive learning signals that are more robust to both model errors and critic errors. We empirically show that LEQ improves the performance in various tasks – especially, achieving the state-of-the-art performance in the long-horizon AntMaze tasks.
6.1 Limitations
Following prior work on model-based offline RL [30, 14], we assume access to the ground-truth termination function of a task, different from online model-based RL approaches, which learn a termination function from interactions. However, since this termination function is conditioned on a state, a model requires to plan on a state space (or an observation space), which could be challenging in a high-dimensional state space (e.g. pixel observations). Extending the proposed approach to complex environments with high-dimensional observations would be an immediate next step.
6.2 Broader Impacts
Our method aims to increase the ability of autonomous agents, such as robots and self-driving cars, to learn from static, offline data without interacting with the world. This enables autonomous agents to utilize data with diverse qualities (not necessarily from experts). We believe that this paper does not have any immediate negative societal impact.
Acknowledgments and Disclosure of Funding
We would like to thank Junik Bae for helpful discussion. This work was supported in part by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant (RS-2020-II201361, Artificial Intelligence Graduate School Program (Yonsei University)) and the National Research Foundation of Korea (NRF) grant (RS-2024-00333634) funded by the Korean Government (MSIT). Kwanyoung Park was supported by Electronics and Telecommunications Research Institute (ETRI).
References
- An et al. [2021] Gaon An, Seungyong Moon, Jang-Hyun Kim, and Hyun Oh Song. Uncertainty-based offline reinforcement learning with diversified q-ensemble. Advances in neural information processing systems, 34:7436–7447, 2021.
- Argenson and Dulac-Arnold [2021] Arthur Argenson and Gabriel Dulac-Arnold. Model-based offline planning. In International Conference on Learning Representations, 2021.
- Ba et al. [2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Ball et al. [2023] Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. In International Conference on Machine Learning, pages 1577–1594. PMLR, 2023.
- Feinberg et al. [2018] Vladimir Feinberg, Alvin Wan, Ion Stoica, Michael I Jordan, Joseph E Gonzalez, and Sergey Levine. Model-based value estimation for efficient model-free reinforcement learning. arXiv preprint arXiv:1803.00101, 2018.
- Fu et al. [2020] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
- Fujimoto and Gu [2021] Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. In Neural Information Processing Systems, volume 34, pages 20132–20145, 2021.
- Fujimoto et al. [2019] Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052–2062. PMLR, 2019.
- Hafner et al. [2019] Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations, 2019.
- Hafner et al. [2021] Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. In International Conference on Learning Representations, 2021.
- Hafner et al. [2023] Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
- Hessel et al. [2018] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Association for the Advancement of Artificial Intelligence, 2018.
- Janner et al. [2019] Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. In Neural Information Processing Systems, volume 32, 2019.
- Jeong et al. [2023] Jihwan Jeong, Xiaoyu Wang, Michael Gimelfarb, Hyunwoo Kim, Baher Abdulhai, and Scott Sanner. Conservative bayesian model-based value expansion for offline policy optimization. In International Conference on Learning Representations, 2023.
- Jiang et al. [2023] Zhengyao Jiang, Tianjun Zhang, Michael Janner, Yueying Li, Tim Rocktäschel, Edward Grefenstette, and Yuandong Tian. Efficient planning in a compact latent action space. In International Conference on Learning Representations, 2023.
- Kidambi et al. [2020] Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Model-based offline reinforcement learning. In Neural Information Processing Systems, volume 33, pages 21810–21823, 2020.
- Kostrikov et al. [2022] Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022.
- Kumar et al. [2019] Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrap** error reduction. In Neural Information Processing Systems, volume 32, 2019.
- Kumar et al. [2020] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. In Neural Information Processing Systems, volume 33, pages 1179–1191, 2020.
- Le et al. [2019] Hoang Le, Cameron Voloshin, and Yisong Yue. Batch policy learning under constraints. In International Conference on Machine Learning, pages 3703–3712. PMLR, 2019.
- Levine et al. [2020] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
- Lillicrap et al. [2016] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.
- Park et al. [2024] Seohong Park, Dibya Ghosh, Benjamin Eysenbach, and Sergey Levine. Hiql: Offline goal-conditioned rl with latent states as actions. In Neural Information Processing Systems, volume 36, 2024.
- Peng et al. [2019] Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
- Pomerleau [1989] Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in Neural Information Processing Systems, pages 305–313, 1989.
- Qin et al. [2022] Rong-Jun Qin, Xingyuan Zhang, Songyi Gao, Xiong-Hui Chen, Zewen Li, Weinan Zhang, and Yang Yu. NeoRL: A near real-world benchmark for offline reinforcement learning. In Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=jNdLszxdtra.
- Rigter et al. [2022] Marc Rigter, Bruno Lacerda, and Nick Hawes. Rambo-rl: Robust adversarial model-based offline reinforcement learning. In Neural Information Processing Systems, volume 35, pages 16082–16097, 2022.
- Schulman et al. [2016] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations, 2016.
- Sun [2023] Yihao Sun. Offlinerl-kit: An elegant pytorch offline reinforcement learning library. https://github.com/yihaosun1124/OfflineRL-Kit, 2023.
- Sun et al. [2023] Yihao Sun, Jiaji Zhang, Chengxing Jia, Haoxin Lin, Junyin Ye, and Yang Yu. Model-bellman inconsistency for model-based offline reinforcement learning. In International Conference on Machine Learning, pages 33177–33194. PMLR, 2023.
- Sutton [1988] Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3:9–44, 1988.
- Sutton [1991] Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2(4):160–163, 1991.
- Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
- Yu et al. [2020] Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization. In Neural Information Processing Systems, volume 33, pages 14129–14142, 2020.
- Yu et al. [2021] Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. Combo: Conservative offline model-based policy optimization. In Neural Information Processing Systems, volume 34, pages 28954–28967, 2021.
Appendix A Training Details
Computing resources.
All experiments are done on a single RTX 4090 GPU and AMD EPYC 9354 CPU cores. We use different random seeds for each experiment and report the mean and standard deviation. Each offline RL experiment takes hours for ours, hours for MOBILE, and hours for CBOP.
Environment details.
For the locomotion tasks, we use the dataset provided by D4RL [6] and NeoRL [26]. Following IQL [17], we normalize rewards using the maximum and minimum return of all trajectories. We use the true termination functions of the environments, implemented in MOBILE [30].
For the AntMaze tasks, we use the dataset provided by D4RL [6]. Following IQL [17], we subtract from the rewards in the datasets so that the agent receives for each step and on termination. We use the true termination functions of the environments. The termination functions of the AntMaze tasks are not deterministic because a goal of a maze is randomized every time the environment is reset. Nevertheless, we follow the implementation of CBOP [14], where the termination region is set to a circle around the mean of the goal distribution with the radius .
Method implementation details.
For all compared methods, we use the results from their corresponding papers when available. For IQL [17], we run the official implementation with seeds to reproduce the results for the random datasets in D4RL and NeoRL. For the AntMaze tasks, we run the official implementation of MOBILE and CBOP with random seeds. Please note that the original MOBILE implementation does not use the true termination function, so we replace it with our termination function. For MOPO, COMBO, and RAMBO, we use the results reported in RAMBO [27].
World models.
For training world models, we use the architecture and training script from OfflineRL-Kit [29], matching the implementation of MOBILE [30]. Each world model is implemented as a -layer MLPs with the hidden layer size of . We construct an ensemble of world models by selecting out of models with the best validation scores. We pretrain the ensemble of world models for each of random seeds (i.e. training in total world models and using models), which takes approximately hours in average.
Policy and critic networks.
Pretraining.
For some environments, we found that a randomly initialized policy can lead to abnormal rewards or transition prediction from the world models in the early stage, leading to unstable training. Following CBOP [14], we pretrain a policy and a critic using behavioral cloning and FQE [20], respectively. We use a slightly different implementation of FQE from the original implementation, where the operation is approximated with mini-batch gradient descent, similar to standard Q-learning as shown in Algorithm 2.
Comparisons with prior methods.
We provide a comparison of LEQ with the prior model-based approaches and the baseline methods used in our ablation studies in Table 6.
Components | CBOP | MOBILE | MOBILE∗ | MOBIP | LEQ (ours) |
---|---|---|---|---|---|
Training scheme | MVE [5] | MBPO [13] | MBPO [13] | Dyna [32] | Dyna [32] |
Conservatism | Lower-confidence bound | Lower-confidence bound | Lower-confidence bound | Lower-confidence bound | Lower expectile |
Policy | Stochastic | Stochastic | Stochastic | Deterministic | Deterministic |
Policy objective | -returns | -returns | |||
Policy pretraining | BC | – | – | BC | BC |
# of critics | 20-50 | 2 | 1 | 1 | 1 |
Critic objective | Multi-step (adaptive weighting) | One-step | One-step | -returns | -returns |
Critic pretraining | FQE [20] | – | – | FQE [20] | FQE [20] |
Horizon length () | 10 | 1 | 1 | 10 | 10 |
Rollout length () | – | 1 or 5 | 10 | 5 | 5 |
Discount rate () | 0.99 | 0.99 | 0.997 | 0.997 | 0.997 |
in Equation 7 | 1.0 | 0.95 | 0.25 | 0.25 | 0.25 |
Impl. tricks | – | Clip Q-values with | LayerNorm + Symlog | LayerNorm + Symlog | LayerNorm + Symlog |
Running time | 24h | 12h | 40m | 4h | 2h |
Hyperparameters of LEQ.
We report task-agnostic hyperparameters of our method in Table 7. We note that we use the same hyperparameters across all tasks, except . We search the value of in and report the best value for the main experimental results. In addition, we report the exhaustive results in Tables 11 and 12, and summarize used in the main results in Table 8.
Hyperparameters | Value | Description |
3e-5 | Learning rate of actor | |
1e-4 | Learning rate of critic | |
Optimizer | Adam | Optimizer |
5000 | Interval of expanding dataset | |
50000 | Number of data for each expansion of dataset | |
5 | Rollout length for dataset expansion | |
1.0 | Exploration noise for dataset expansion | |
1M | Total number of gradient steps. | |
256 | Batch size from original dataset | |
256 | Batch size from expanded dataset | |
0.997 | Discount factor | |
0.95 | value for -return | |
10 | Imagination length | |
1 | Weight for critic EMA regularization | |
0.995 | Critic EMA decay |
Domain | Task | |
AntMaze | umaze | |
umaze-diverse | ||
medium-play | ||
medium-diverse | ||
large-play | ||
large-diverse | ||
ultra-play | ||
ultra-diverse | ||
MuJoCo | hopper-r | |
hopper-m | ||
hopper-mr | ||
hopper-me | ||
walker2d-r | ||
walker2d-m | ||
walker2d-mr | ||
walker2d-me | ||
halfcheetah-r | ||
halfcheetah-m | ||
halfcheetah-mr | ||
halfcheetah-me | ||
NeoRL | Hopper-L | |
Hopper-M | ||
Hopper-H | ||
Walker2d-L | ||
Walker2d-M | ||
Walker2d-H | ||
HalfCheetah-L | ||
HalfCheetah-M | ||
HalfCheetah-H |
Task-specific hyperparameters of the compared methods.
We report the best hyperparameters of MOBILE∗ for the AntMaze tasks in Tables 10 and 10. For MOBILE and MOBILE∗, we search the value of within , as suggested in MOBILE [30], where is the coefficient of the penalized bellman operator:
(13) |
For CBOP, we conduct hyperparameter search for in , as suggested in the original paper, where is an LCB coefficient of CBOP. We do not report the best hyperparameter for MOBILE and CBOP because both methods score zero points for all hyperparameters in AntMaze.
Domain | Task | |
---|---|---|
AntMaze | umaze | |
umaze-diverse | ||
medium-play | ||
medium-diverse | ||
large-play | ||
large-diverse | ||
ultra-play | ||
ultra-diverse |
Domain | Task | |
---|---|---|
AntMaze | umaze | |
umaze-diverse | ||
medium-play | ||
medium-diverse | ||
large-play | ||
large-diverse | ||
ultra-play | ||
ultra-diverse |
Appendix B Proof of the Policy Objective
We show that the surrogate loss in Equation 12 leads to a better approximation for the expectile of -returns in Equation 11 than maximizing . In other words, we show that optimizing the following policy objective:
(14) |
leads to optimizing a lower-bias estimator of than .
To show this, we first prove that is closer to than . For deriving the proof, we generalize this situation to have an arbitrary distribution and estimation , which corresponds to .
Theorem 1.
Let be a distribution and be a lower expectile of (i.e. ). Let be an arbitrary estimation of , and define . If we let be a new estimation of , then .
Proof.
Without loss of generality, we assume . Then, we have . Thus,
∎
Note that this theorem shows that the bias of the new estimation is always smaller than the original estimation, since and . If we plug in the distribution of to and , then , and we can show the desired result using the theorem: is closer to than .
Here, the normalizing factor is non-differentiable with . Specifically, the gradient is 0 everywhere (except ). Thus, if we calculate the gradient of , the gradient for the normalizing factor disappears. Therefore, we can omit the normalizing factor and get an equivalent formula for gradient-based optimization.
Appendix C More Results
High variance in locomotion tasks.
When we train LEQ in locomotion tasks, we observe that our method often achieves % success rates and then falls back to %, as shown in Figure 5. This is mainly because the learned models sometimes fail to capture failures (e.g. hopper and walker falling off) and predict an optimistic future (e.g. hopper and walker walking forward).
Results for all expectiles .
To give insights how the expectile parameter affects the performance of LEQ, we report the performance of LEQ with all expectile values . The expectile parameter has a trade-off – high expectile makes the model’s predictions less conservative while making a policy easily exploit the model. We recommend first trying , which works well for most of the tasks, and increase until the performance starts to drop.
Expectile | 0.1 | 0.3 | 0.4 | 0.5 |
---|---|---|---|---|
antmaze-umaze | ||||
antmaze-umaze-diverse | ||||
antmaze-medium-play | ||||
antmaze-medium-diverse | ||||
antmaze-large-play | ||||
antmaze-large-diverse | ||||
antmaze-ultra-play | ||||
antmaze-ultra-diverse |
Expectile | 0.1 | 0.3 | 0.4 | 0.5 |
---|---|---|---|---|
hopper-r | ||||
hopper-m | ||||
hopper-mr | ||||
hopper-me | ||||
walker2d-r | ||||
walker2d-m | ||||
walker2d-mr | ||||
walker2d-me | ||||
halfcheetah-r | ||||
halfcheetah-m | ||||
halfcheetah-mr | ||||
halfcheetah-me |
Ablation study on dataset expansion.
Table 13 shows the ablation results on the dataset expansion in D4RL MuJoCo tasks. The results show that the dataset expansion generally improves the performance, especially in Hopper environments.
Dataset | LEQ (ours) | LEQ w/o Dataset Expansion |
---|---|---|
hopper-r | ||
hopper-m | ||
hopper-mr | ||
hopper-me | ||
walker2d-r | ||
walker2d-m | ||
walker2d-mr | ||
walker2d-me | ||
halfcheetah-r | ||
halfcheetah-m | ||
halfcheetah-mr | ||
halfcheetah-me | ||
Total |
Ablation study on MOBILE∗ in AntMaze.
In Table 14, we report the performance of MOBILE∗ for all possible combination of the hyperparameters between the values of MOBILE and LEQ. Specifically, we use , , and . The result shows that is crucial. In addition, the configuration of LEQ yields the best result among all configurations for MOBILE∗.
Hyperparams. | umaze | medium | large | ultra | Total | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
umaze | diverse | play | diverse | play | diverse | play | diverse | ||||