Tackling Long-Horizon Tasks
with Model-based Offline Reinforcement Learning

Kwanyoung Park Youngwoon Lee
Yonsei University
https://kwanyoungpark.github.io/LEQ/

Abstract

Model-based offline reinforcement learning (RL) is a compelling approach that addresses the challenge of learning from limited, static data by generating imaginary trajectories using learned models. However, it falls short in solving long-horizon tasks due to high bias in value estimation from model rollouts. In this paper, we introduce a novel model-based offline RL method, Lower Expectile Q-learning (LEQ), which enhances long-horizon task performance by mitigating the high bias in model-based value estimation via expectile regression of $\lambda$ -returns. Our empirical results show that LEQ significantly outperforms previous model-based offline RL methods on long-horizon tasks, such as the D4RL AntMaze tasks, matching or surpassing the performance of model-free approaches. Our experiments demonstrate that expectile regression, $\lambda$ -returns, and critic training on offline data are all crucial for addressing long-horizon tasks. Additionally, LEQ achieves performance comparable to the state-of-the-art model-based and model-free offline RL methods on the NeoRL benchmark and the D4RL MuJoCo Gym tasks.

1 Introduction

One of the major challenges in offline reinforcement learning (RL) is the overestimation of values for out-of-distribution actions due to the lack of environment interactions [21, 19]. Model-based offline RL addresses this issue by generating additional (imaginary) training data using a learned model, thereby augmenting the given offline data with synthetic experiences that cover out-of-distribution states and actions [34, 16, 35, 2, 30]. While these approaches have demonstrated strong performance in simple, short-horizon tasks, they struggle with noisy model predictions and value estimations, particularly in long-horizon tasks [23]. This challenge is evident in their poor performances (i.e. near zero) on the D4RL AntMaze tasks [6, 15].

Typical model-based offline RL methods alleviate the inaccurate value estimation problem (mostly overestimation) by penalizing Q-values estimated from model rollouts with uncertainties in model predictions [34, 16] or value predictions [30, 14]. While these penalization terms prevent a policy from exploiting erroneous value estimations, the policy now does not maximize the true value, but maximizes the value penalized by heuristically estimated uncertainties, which can lead to sub-optimal behaviors. This is especially problematic in long-horizon, sparse-reward tasks, where Q-values are similar across nearby states [23].

Another way to reduce bias in value estimates is by using multi-step returns [31, 12]. CBOP [14] constructs an explicit distribution of multi-step Q-values from thousands of model rollouts and uses this value as a target for training the Q-function. However, CBOP is computationally expensive for estimating a target value and uses multi-step returns solely for Q-learning, which provides insufficient learning signals for obtaining long-horizon behaviors.

To tackle long-horizon tasks with model-based offline RL, we introduce a simple yet effective model-based offline RL algorithm, Lower Expectile Q-learning (LEQ). As illustrated in Figure 1, LEQ uses expectile regression with a small $\tau$ for both policy and Q-function training, providing an efficient and elegant way to achieve conservative Q-value estimates. Moreover, to better handle long-horizon tasks, we propose to optimize a policy and Q-function using $\lambda$ -returns (i.e. TD( $\lambda$ ) targets) of long ( $15$ -step) model rollouts, allowing the policy to directly learn from low-bias multi-step returns [28].

The experiments on the D4RL AntMaze and MuJoCo Gym tasks [6], as well as the NeoRL benchmark [26], demonstrate that our proposed conservative policy optimization with $\lambda$ -return and critic training on offline data significantly improves offline RL policies in long-horizon tasks while achieving comparable performance in short-horizon, dense-reward tasks. Specifically, to the best of our knowledge, LEQ is the first model-based offline RL algorithm capable of matching or outperforming the performance of model-free offline RL algorithms on the long-horizon AntMaze tasks [6, 15].

Refer to caption — Figure 1: Lower Expectile Q-learning (LEQ). (left) In offline model-based RL, an agent can generate imaginary trajectories using a world model. (right) For conservative Q-evaluation of the policy, LEQ learns the lower expectile of the target $Q$ -distribution from a sampled individual rollout $\mathcal{T}_{i}$ , without estimating the entire Q-distribution with exhaustive rollouts.

2 Related Work

Offline RL [21] aims to solve a reinforcement learning problem only with pre-collected datasets, better than behavioral cloning policies [25]. One can simply apply off-policy RL algorithms on top of the fixed dataset. However, off-policy RL methods suffer from the overestimation of Q-values for actions unseen in the offline dataset [8, 18, 19], since an overestimated value function cannot get corrected through online environment interactions in offline RL.

Model-free offline RL algorithms have addressed this value overestimation problem on out-of-distribution actions by (1) regularizing a policy to only output actions in the offline data [24, 17, 7] or (2) adopting a conservative value estimation for executing actions different from the dataset [19, 1]. Despite their strong performances on the standard offline RL benchmarks, model-free offline RL policies tend to be constrained to the support of the data (i.e. state-action pairs in the offline dataset), which may lead to limited generalization capability.

Model-based offline RL approaches have tried to overcome this limitation by suggesting a better use of the limited offline data – learning a world model and generating imaginary data with the learned model that covers out-of-distribution actions. Similar to Dyna-style online model-based RL [32, 9, 10, 11], an offline model-based RL policy can be trained on both offline data and model rollouts. But, again, learned models may be inaccurate on states and actions outside the data support, making a policy easily exploit the learned models.

Recent model-based offline RL algorithms have adopted the conservatism idea from model-free offline RL, penalizing policies incurring (1) uncertain transition dynamics [34, 16, 35] or (2) uncertain value estimation [30, 14]. This conservative use of model-generated data enables model-based offline RL to outperform model-free offline RL in widely used offline RL benchmarks [30]. However, uncertainty estimation is difficult and often inaccurate [35]. Instead of relying on such heuristic [34, 16, 30] or expensive [14] uncertainty estimation, we propose to learn a conservative value function via expectile regression with a small $\tau$ , which is simple, efficient, yet effective.

3 Preliminaries

Problem setup.

We formulate our problem as a Markov Decision Process (MDP) defined as a tuple, $\mathcal{M}=(\mathcal{S},\mathcal{A},r,p,\rho,\gamma)$ [33]. $\mathcal{S}$ and $\mathcal{A}$ denote the state and action spaces, respectively. $r:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}$ denotes the reward function. $p:\mathcal{S}\times\mathcal{A}\rightarrow\Delta(\mathcal{S})$ ¹¹1 $\Delta(\mathcal{X})$ denotes the set of probability distributions over $\mathcal{X}$ denotes the transition dynamics. $\rho(\mathbf{s}_{0})\in\Delta(\mathcal{S})$ denotes the initial state distribution and $\gamma$ is a discounting factor. The goal of reinforcement learning (RL) is to find a policy, $\pi:\mathcal{S}\rightarrow\Delta(\mathcal{A})$ , that maximizes the expected return, $\mathbb{E}_{\tau\sim p(\cdot\mid\pi,\mathbf{s}_{0}\sim\rho)}\left[\sum_{t=0}^{% T-1}\gamma^{t}r(\mathbf{s}_{t},\mathbf{a}_{t})\right]$ , where $\tau$ is a sequence of transitions with a finite horizon $T$ , $\tau=(\mathbf{s}_{0},\mathbf{a}_{0},r_{0},\mathbf{s}_{1},\mathbf{a}_{1},r_{1},% ...,\mathbf{s}_{T})$ , following $\pi(\mathbf{a}_{t}\mid\mathbf{s}_{t})$ and $p(\mathbf{s}_{t+1}\mid\mathbf{s}_{t},\mathbf{a}_{t})$ starting from $\mathbf{s}_{0}\sim\rho(\cdot)$ .

In this paper, we consider the offline RL setup [21], where a policy $\pi$ is trained with a fixed given offline dataset, $\mathcal{D}_{\text{env}}=\{\tau_{1},\tau_{2},...,\tau_{N}\}$ , without any additional online interactions.

Model-based offline RL.

As an offline RL policy is trained from a fixed dataset, one of the major challenges in offline RL is the limited data support; thus, lack of generalization to out-of-distribution states and actions. Model-based offline RL [16, 34, 35, 27, 30, 14] tackles this problem by augmenting the training data with imaginary training data (i.e. model rollouts) generated from the learned transition dynamics and reward model, $p_{\psi}(\mathbf{s}_{t+1},r\mid\mathbf{s}_{t},\mathbf{a}_{t})$ .

The typical process of model-based offline RL is as follows: (1) pretrain a model (or an ensemble of models) and an initial policy from the offline data, (2) generate short imaginary rollouts $\{\tau\}$ using the pretrained model and add them to the training dataset $\mathcal{D}_{\text{model}}\leftarrow\mathcal{D}_{\text{model}}\cup\{\tau\}$ , (3) perform an offline RL algorithm on the augmented dataset $\mathcal{D}_{\text{model}}\cup\mathcal{D}_{\text{env}}$ , and repeat (2) and (3).

Expectile regression.

Expectile is a generalization of the expectation of a distribution $X$ . While the expectation of $X$ , $\mathbb{E}[X]$ , can be viewed as a minimizer of the least-square objective, $L_{2}(y)=\mathbb{E}_{x\sim X}[(y-x)^{2}]$ , $\tau$ -expectile of $X$ , $\mathbb{E}^{\tau}[X]$ , can be defined as a minimizer of the asymmetric least-square objective:

L_{2}^{\tau}(y)=\mathbb{E}_{x\sim X}\left[\lvert\tau-\mathbbm{1}(y>x)\rvert% \cdot(y-x)^{2}\right],

(1)

where $\lvert\tau-\mathbbm{1}(y>x)\rvert$ is an asymmetric weighting of $L_{2}$ and $0\leq\tau\leq 1$ .

We refer a $\tau$ -expectile with $\tau<0.5$ as a lower expectile of $X$ . When $\tau<0.5$ , the objective assigns a high weight $1-\tau$ for smaller $x$ and a low weight $\tau$ for bigger $x$ . Thus, minimizing the objective with $\tau<0.5$ leads to a conservative statistical estimate compared to the expectation.

4 Approach

The primary limitation for model-based offline RL in solving long-horizon tasks is inherent errors in a world model and critic outside the offline data support. Conservative value estimation can effectively handle such (falsely optimistic) errors. Prior approaches estimate conservative values through diverse uncertainty penalties; but they are either unreliable [35] or computationally expensive [14].

In this paper, we introduce Lower Expectile Q-learning (LEQ), an efficient model-based offline RL method that achieves conservative value estimation via expectile regression of Q-values with lower expectiles when learning from model-generated data (Section 4.1). Additionally, we address the noisy value estimation problem in long-horizon tasks [23] using $\lambda$ -returns on $10$ -step imaginary rollouts (Section 4.2). Finally, we train a deterministic policy conservatively by maximizing the lower expectile of $\lambda$ -returns (Section 4.3). The overview of LEQ is described in Algorithm 1.

Algorithm 1 LEQ: Lower Expectile Q-learning with

\lambda

-returns

0: Offline dataset

\mathcal{D}_{\text{env}}

, expectile

\tau\leq 0.5

, imagination length

H

, dataset expansion length

R

1: Initialize world models

\{p_{\psi_{1}},\cdots,p_{\psi_{M}}\}

, policy

\pi_{\theta}

, and Q-function

Q_{\phi}

2: Pretrain

\{p_{\psi_{1}},\cdots,p_{\psi_{M}}\}

\mathcal{D}_{\text{env}}

\triangleright

\mathcal{L}_{\text{wm}}(\psi)=-\mathbb{E}_{(\mathbf{s},\mathbf{a},r,\mathbf{s}% ^{\prime})\in\mathcal{D}_{\text{env}}}\log p_{\psi}(\mathbf{s}^{\prime},r\mid% \mathbf{s},\mathbf{a})

3: Pretrain

\pi_{\theta}

and

Q_{\phi}

\mathcal{D}_{\text{env}}

\triangleright

using BC for

\pi_{\theta}

and FQE [20] for

Q_{\phi}

\mathcal{D}_{\text{model}}\leftarrow\emptyset

5: while not converged do

6: // Expand dataset using model rollouts

\mathbf{s}_{0}\sim\mathcal{D}_{\text{env}}

\triangleright

start dataset expansion from any state in

\mathcal{D}_{\text{env}}

8: for

t=0,\ldots,R-1

\mathcal{D}_{\text{model}}\leftarrow\mathcal{D}_{\text{model}}\cup\{\mathbf{s}% _{t}\}

10:

\mathbf{a}_{t}=\pi_{\theta}(\mathbf{s}_{t})

11:

\mathbf{s}_{t+1},r_{t}\sim p_{\psi}(\cdot\mid\mathbf{s}_{t},\mathbf{a}_{t})

, where

p_{\psi}\sim\{p_{\psi_{1}},\cdots,p_{\psi_{M}}\}

12: // Generate imaginary data, $\tau=\{(\mathbf{s}_{0},\mathbf{a}_{0},r_{0},\cdots,\mathbf{s}_{H-1},\mathbf{a}% _{H-1},r_{H-1},\mathbf{s}_{H})_{i}\}$

13:

\mathbf{s}_{0}\sim\mathcal{D}_{\text{model}}

\triangleright

start imaginary rollout from any state in

\mathcal{D}_{\text{model}}

14: for

t=0,\ldots,H-1

15:

\mathbf{a}_{t}=\pi_{\theta}(\mathbf{s}_{t})

16:

\mathbf{s}_{t+1},r_{t}\sim p_{\psi}(\cdot\mid\mathbf{s}_{t},\mathbf{a}_{t})

, where

p_{\psi}\sim\{p_{\psi_{1}},\cdots,p_{\psi_{M}}\}

17: // Update critic using both offline and model-generated data

18: Update critic

Q_{\phi}

to minimize

\mathcal{L}^{\lambda}_{Q}(\phi)

in Eq. 7 using

\tau

and

\{\mathbf{s},\mathbf{a},r,\mathbf{s}^{\prime}\}\sim\mathcal{D}_{\text{env}}

19: // Update actor using only model-generated data

20: Update actor

\pi_{\theta}

to minimize

\hat{\mathcal{L}}^{\lambda}_{\pi}(\theta)

in Eq. 12 using

\tau

4.1 Lower expectile Q-learning

Most offline RL algorithms primarily focus on learning a conservative value function for out-of-distribution actions. In this paper, we propose Lower Expectile Q-learning (LEQ), which learns a conservative Q-function via expectile regression with small $\tau$ , avoiding unreliable uncertainty estimation and exhaustive Q-value estimation.

As illustrated in Figure 1, the target value for $Q_{\phi}(\mathbf{s},\mathbf{a})$ , where $\mathbf{a}\leftarrow\pi_{\theta}(\mathbf{s})$ , can be estimated by rolling out an ensemble of world models and averaging $r(\mathbf{s},\mathbf{a})+\gamma Q_{\phi}(\mathbf{s}^{\prime},\mathbf{a}^{% \prime})$ over all possible $\mathbf{s}^{\prime}$ :

\hat{y}_{\text{model}}=\mathbb{E}_{\psi\sim\{\psi_{1},...,\psi_{M}\}}\mathbb{E% }_{(\mathbf{s}^{\prime},r)\sim p_{\psi}(\cdot\mid\mathbf{s},\mathbf{a})}\left[% r+\gamma Q_{\phi}(\mathbf{s}^{\prime},\pi_{\theta}(\mathbf{s}^{\prime}))\right].

(2)

This target value has three error sources: the predicted future state and reward $\mathbf{s}^{\prime},r\sim p_{\psi}(\cdot\mid\mathbf{s},\mathbf{a})$ and future Q-value $Q_{\phi}(\mathbf{s}^{\prime},\pi_{\theta}(\mathbf{s}^{\prime}))$ . Thus, the target value computed from model-generate data, $\hat{y}_{\text{model}}$ , is more prone to overestimation than that of the original target Q-value, $\hat{y}_{\text{env}}$ , computed from $(\mathbf{s},\mathbf{a},r,\mathbf{s}^{\prime})\sim D_{\text{env}}$ :

\hat{y}_{\text{env}}=r+\gamma Q_{\phi}(\mathbf{s}^{\prime},\pi_{\theta}(% \mathbf{s}^{\prime})).

(3)

To mitigate the overestimation problem in estimating the true Q-value from $H$ -step inaccurate world model rollouts, we propose to use expectile regression on target Q-value estimation with small $\tau$ ; as illustrated in Figure 1, expectile regression with small $\tau$ tends to choose the target Q-value that is lower than the expectation, effectively providing a conservative estimate of target Q-value. Another advantage of using expectile regression is that we do not have to exhaustively evaluate Q-values to get $\tau$ -expectiles as Jeong et al. [14]; instead, we can do conservative estimation using sampling:

L_{Q,\text{model}}(\phi)=\mathbb{E}_{\mathbf{s}_{0}\in\mathcal{D}_{\text{model% }},\tau\sim p_{\psi},\pi_{\theta}}\left[\frac{1}{H}\sum_{t=0}^{H}L_{2}^{\tau}(% Q_{\phi}(\mathbf{s}_{t},\pi_{\theta}(\mathbf{s}_{t}))-\hat{y}_{\text{model}})% \right].

(4)

In addition, the Q-function is also trained on the offline data $\mathcal{D}_{\text{env}}$ with the standard Bellman update:

\mathcal{L}_{Q,\text{env}}(\phi)=\mathbb{E}_{(\mathbf{s},\mathbf{a},r,\mathbf{% s}^{\prime})\in\mathcal{D}_{\text{env}}}\left[\frac{1}{2}(Q_{\phi}(\mathbf{s},% \mathbf{a})-\hat{y}_{\text{env}})^{2}\right].

(5)

To stabilize training of the Q-function, we adopt EMA regularization [11], which prevents drastic change of Q-values by regularizing the difference between the Q-predictions and ones from the exponential moving average:

\mathcal{L}_{Q,\text{EMA}}(\phi)=\mathbb{E}_{(\mathbf{s},\mathbf{a})\in% \mathcal{D}_{\text{env}}}\left[(Q_{\phi}(\mathbf{s},\mathbf{a})-Q_{\bar{\phi}}% (\mathbf{s},\mathbf{a}))^{2}\right],

(6)

where $\bar{\phi}$ is an exponential moving average of $\phi$ . Note that by using EMA regularization, we do not use the target Q-network for Equations 2 and 3.

Finally, by combining the three losses above, we define the critic loss as follows:

\mathcal{L}_{Q}(\phi)=\beta\mathcal{L}_{Q,\text{model}}(\phi)+(1-\beta)% \mathcal{L}_{Q,\text{env}}(\phi)+\beta_{\text{EMA}}\mathcal{L}_{Q,\text{EMA}}(% \phi).

(7)

4.2 Lower expectile Q-learning with $\lambda$ -return

To further improve LEQ for long-horizon tasks, we use $\lambda$ -return instead of $1$ -step return for Q-learning. $\lambda$ -return allows a Q-function and policy to learn from low-bias multi-step returns [28]. Reducing bias in value estimation with $\lambda$ -return is especially important on long-horizon tasks where values for nearby states are similar to each other, as illustrated in Figure 2.

We first define $\lambda$ -return of a trajectory $\tau$ in timestep $t$ , $Q_{t}^{\lambda}(\tau)$ , using $N$ -step return, $G_{t:t+N}(\tau)$ :²²2Our $\lambda$ -return is slightly different from [31, 11] that puts a high weight to the last $N$ -step return, $G_{t:H}(\tau)$ .

G_{t:t+N}(\tau)=\sum_{i=0}^{N-1}\gamma^{i}r(\mathbf{s}_{t+i},\mathbf{a}_{t+i})% +\gamma^{N}Q_{\phi}(\mathbf{s}_{t+N},\mathbf{a}_{t+N}),

(8)

Q_{t}^{\lambda}(\tau)=\frac{1-\lambda}{1-\lambda^{H-t-1}}\sum_{i=1}^{H-t}% \lambda^{i-1}G_{t:t+i}(\tau).

(9)

Then, we can rewrite the Q-learning loss in Equation 4 with $1$ -step return to the one with $\lambda$ -return:

\mathcal{L}^{\lambda}_{Q,\text{model}}(\phi)=\mathbb{E}_{\mathbf{s}_{0}\in% \mathcal{D}_{\text{model}},\tau\sim p_{\psi},\pi_{\theta}}\left[\sum_{t=0}^{H-% 1}L_{2}^{\tau}(Q_{\phi}(\mathbf{s}_{t},\pi_{\theta}(\mathbf{s}_{t}))-Q^{% \lambda}_{t}(\tau))\right].

(10)

4.3 Lower expectile policy learning with $\lambda$ -return

For policy optimization, we can use a deterministic policy $\mathbf{a}=\pi_{\theta}(\mathbf{s})$ and update the policy using the deterministic policy gradients similar to DDPG [22].³³3LEQ also works with a stochastic policy; but, a deterministic policy is easier to train in an offline setup. Instead of maximizing the immediate Q-value, $Q_{\phi}(\mathbf{s},\mathbf{a})$ , we propose to directly maximize the lower expectile of $\lambda$ -return, which is a more accurate learning target for a policy, analogous to the conservative critic target in Section 4.2:

\mathcal{L}_{\pi}^{\lambda}(\theta)=-\mathbb{E}_{\mathbf{s}_{0}\in\mathcal{D}_% {\text{model}},\tau\sim p_{\psi},\pi_{\theta}}\left[\sum_{t=0}^{H}\mathbb{E}^{% \tau}_{\tau\sim p_{\psi},\pi_{\theta}}\left[Q_{t}^{\lambda}(\tau)\right]\right].

(11)

However, due to the expectile term in Equation 11, computing the gradient of $\mathcal{L}_{\pi}^{\lambda}(\theta)$ is not trivial. To estimate this gradient, we propose a differentiable surrogate loss, approximating $\mathbb{E}^{\tau}_{\tau\sim p_{\psi},\pi_{\theta}}\left[Q^{\lambda}_{t}(\tau)\right]$ in Equation 11 with $Q_{\phi}(\mathbf{s}_{t},\mathbf{a}_{t})$ :

\hat{\mathcal{L}}_{\pi}^{\lambda}(\theta)=-\mathbb{E}_{\mathbf{s}_{0}\in% \mathcal{D}_{\text{model}},\tau\sim p_{\psi},\pi_{\theta}}\left[\sum_{t=0}^{H}% \lvert\tau-\mathbbm{1}\left(Q_{\phi}(\mathbf{s}_{t},\mathbf{a}_{t})>Q^{\lambda% }_{t}(\tau)\right)\rvert\cdot Q_{t}^{\lambda}(\tau)\right].

(12)

Intuitively, this surrogate loss sets a higher weight ( $1-\tau$ ) on a conservative $\lambda$ -return estimation (i.e. $Q_{\phi}(\mathbf{s}_{t},\mathbf{a}_{t})>Q^{\lambda}_{t}(\tau)$ ), encouraging a policy to optimize for this conservative $\lambda$ -return. On the other hand, an optimistic $\lambda$ -return estimation (i.e. $Q_{\phi}(\mathbf{s}_{t},\mathbf{a}_{t})<Q^{\lambda}_{t}(\tau)$ ) has a less impact to the policy with a smaller weight ( $\tau$ ). Thus, optimizing this surrogate loss leads to the policy maximizing the lower expectile of $\lambda$ -return. We provide a proof in Appendix B saying that the proposed surrogate loss is a better approximation of Equation 11 than directly maximizing Q-values, $Q_{\phi}(\mathbf{s},\mathbf{a})$ .

4.4 Expanding dataset with model rollouts

One of the problem of offline RL is that data distribution is limited to the offline dataset $\mathcal{D}_{\text{env}}$ . To tackle this problem, we can simulate the current policy inside the model and use the generated trajectories to expand the dataset, similar to prior works [35, 30]. However, the state coverage will be identical when the policy converges, which might lead to catastrophic forgetting.

To prevent this issue, we expand the dataset using the model rollouts from a noisy exploration policy. Specifically, we execute the exploration policy $\pi_{\exp}(\cdot\mid\mathbf{s})$ , which simply adds noise $\epsilon\sim N(0,\sigma_{\text{exp}}^{2})$ to the current policy $\pi_{\theta}(\mathbf{s})$ , and generate a trajectory of length $R$ ( $R=5$ in this paper). We refer this expanded dataset as $\mathcal{D}_{\text{model}}$ . Note that we do not use off-policy actions and rewards for critic and policy updates; instead, we generate $H$ -step model rollouts starting from $\mathbf{s}\sim\mathcal{D}_{\text{model}}$ and use them for training the policy and Q-function. Thus, we need to store only the states from the rollouts.

5 Experiments

In this paper, we propose a novel model-based offline RL method with simple and efficient yet accurate conservative value estimation. Through our experiments, we aim to answer the following questions: (1) Can LEQ solve long-horizon tasks? (2) How does LEQ perform in widely used offline RL benchmarks? (3) Which component enables model-based offline RL to learn the AntMaze tasks?

5.1 Tasks

To show the strength of LEQ in solving long-horizon tasks, we use the AntMaze tasks, which aims to navigate a $8$ -DOF ant robot to the desired goal position, as shown in Figure 4. Specifically, we use umaze, medium, large datasets from D4RL [6], and ultra dataset from Jiang et al. [15]. Moreover, we evaluate our method on MuJoCo locomotion tasks (Figure 4) with dense rewards with D4RL [6] and NeoRL [26] datasets. Please refer to Appendix A for more experimental details.

5.2 Compared offline RL algorithms

We compare the performance of LEQ with the state-of-the-art offline RL algorithms. Please note that LEQ uses the same hyperparameters across all tasks, except the expectile parameter, $\tau$ .

Model-free offline RL.

We consider behavioral cloning (BC) [25]; TD3+BC [7], which combines BC loss to TD3; CQL [19], which penalizes the actions out of data distribution; and IQL [17], which utilizes expectile regression to estimate the value function. For locomotion tasks, we also compare with EDAC [1], which penalizes the Q-values according to the uncertainty of Q-functions

Model-based offline RL.

We consider MOPO [34] and MOBILE [30], which penalize Q-values according to the transition uncertainty and the bellman uncertainty of a world model, respectively; COMBO [35], which combines CQL with MBPO; RAMBO [27], which trains an adversarial world model against the policy; and CBOP [14], which utilizes multi-step returns for critic updates.

5.3 Results on long-horizon AntMaze tasks

As shown in Table 1, LEQ significantly outperforms the prior model-based approaches for all $8$ datasets. LEQ achieves $58.6$ and $60.2$ for antmaze-large-play and antmaze-large-diverse, while the second best method, RAMBO [27], scores only $0.0$ and $2.4$ , respectively. We believe these performance gains come from our conservative value estimation, which works more stable than the uncertainty-based penalization of prior works.

Moreover, LEQ even significantly outperforms the model-free approaches in antmaze-umaze, antmaze-large, and antmaze-ultra. Despite its superior performance, LEQ often shows high variance during training, resulting in worse performance on antmaze-medium. Over the course of training, LEQ mostly achieves high success rates, but the evaluation results sometimes drops to $0\%$ as shown in Appendix, Figure 5. We leave the problem of reducing the high variance of our method in certain environments as a future work.

Table 1: AntMaze results. Each number represents the average success rate on

100

trials over different seeds. The results for LEQ, MOBILE, and CBOP are averaged over

5

seeds. The results for other methods are reported following their respective papers.

	Model-free				Model-based
Dataset	BC	TD3+BC	CQL	IQL	MOPO	COMBO	RAMBO	MOBILE^†	CBOP^†	LEQ (ours)
antmaze-umaze	$65.0$	$78.6$	$74.0$	$87.5$	$0.0$	$80.3$	$25.0$	$0.0$ $\pm 0.0$	$0.0$ $\pm 0.0$	$\mathbf{94.4}$ $\mathbf{\pm 6.3}$
antmaze-umaze-diverse	$55.0$	$71.4$	$84.0$	$62.2$	$0.0$	$57.3$	$0.0$	$0.0$ $\pm 0.0$	$0.0$ $\pm 0.0$	$\mathbf{71.0}$ $\mathbf{\pm 12.3}$
antmaze-medium-play	$0.0$	$3.0$	$61.2$	$\mathbf{71.2}$	$0.0$	$0.0$	$16.4$	$0.0$ $\pm 0.0$	$0.0$ $\pm 0.0$	$58.8$ $\pm 33.0$
antmaze-medium-diverse	$0.0$	$10.6$	$53.7$	$\mathbf{70.0}$	$0.0$	$0.0$	$23.2$	$0.0$ $\pm 0.0$	$0.0$ $\pm 0.0$	$46.2$ $\pm 23.2$
antmaze-large-play	$0.0$	$0.0$	$15.8$	$39.6$	$0.0$	$0.0$	$0.0$	$0.0$ $\pm 0.0$	$0.0$ $\pm 0.0$	$\mathbf{58.6}$ $\mathbf{\pm 9.1}$
antmaze-large-diverse	$0.0$	$0.2$	$14.9$	$47.5$	$0.0$	$0.0$	$2.4$	$0.0$ $\pm 0.0$	$0.0$ $\pm 0.0$	$\mathbf{60.2}$ $\mathbf{\pm 18.3}$
antmaze-ultra-play	$-$	$-$	$-$	$8.3$	$-$	$-$	$-$	$0.0$ $\pm 0.0$	$0.0$ $\pm 0.0$	$\mathbf{25.8}$ $\mathbf{\pm 18.2}$
antmaze-ultra-diverse	$-$	$-$	$-$	$15.6$	$-$	$-$	$-$	$0.0$ $\pm 0.0$	$0.0$ $\pm 0.0$	$\mathbf{55.8}$ $\mathbf{\pm 18.3}$
Total w/o antmaze-ultra	$120.0$	$163.8$	$303.6$	$354.1$	$0.0$	$137.6$	$67.0$	$0.0$	$0.0$	$\mathbf{388.8}$
Total	$-$	$-$	$-$	$378.0$	$-$	$-$	$-$	$0.0$	$0.0$	$\mathbf{470.4}$

^$\dagger$We use the official implementation of MOBILE and CBOP.

5.4 Results on MuJoCo Gym locomotion tasks

For D4RL MuJoCo Gym tasks in Table 3, LEQ achieves comparable results with the best score of prior works in $6$ out of $12$ tasks. Furthermore, in Table 2, LEQ outperforms most of the prior works in the NeoRL benchmark, especially in the Hopper and Walker2d domains. These results show that LEQ serves as a general offline RL algorithm, not limited to long-horizon tasks.

Similar to antmaze-medium, LEQ also suffers from the high variance problem. During training, LEQ often achieves high performance, but then, suddenly falls back to $0$ , as shown in Appendix, Figure 5. This is mainly because the learned models sometimes fail to capture failures (e.g. hopper and walker falling off) and predict an optimistic future (e.g. hopper and walker walking forward).

Table 2: NeoRL results. LEQ and IQL results are averaged over

5

seeds. The results for prior works are reported following Sun et al. [30] and Qin et al. [26]. MOPO^∗ is an improved version of MOPO presented in Sun et al. [30]. We highlight the results that are better than

95\%

of the best score.

	Model-free					Model-based
Dataset	BC	TD3+BC	CQL	EDAC	IQL	MOPO^∗	MOBILE	LEQ (ours)
Hopper-L	$15.1$	$15.8$	$16.0$	$18.3$	$16.7$	$6.2$	$17.4$	$\mathbf{24.2}$ $\pm 2.3$
Hopper-M	$51.3$	$70.3$	$64.5$	$44.9$	$28.4$	$1.0$	$51.1$	$\mathbf{104.3}$ $\pm 5.2$
Hopper-H	$43.1$	$75.3$	$76.6$	$52.5$	$22.3$	$11.5$	$87.8$	$\mathbf{95.5}$ $\pm 13.9$
Walker2d-L	$28.5$	$43.0$	$44.7$	$40.2$	$30.7$	$11.6$	$37.6$	$\mathbf{65.1}$ $\pm 2.3$
Walker2d-M	$48.7$	$58.5$	$57.3$	$57.6$	$51.8$	$39.9$	$\mathbf{62.2}$	$45.2$ $\pm 19.4$
Walker2d-H	$\mathbf{72.6}$	$69.6$	$\mathbf{75.3}$	$\mathbf{75.5}$	$\mathbf{76.3}$	$18.0$	$\mathbf{74.9}$	$\mathbf{73.7}$ $\pm 1.1$
HalfCheetah-L	$29.1$	$30.0$	$38.2$	$31.3$	$30.7$	$40.1$	$\mathbf{54.7}$	$33.4$ $\pm 1.6$
HalfCheetah-M	$49.0$	$52.3$	$54.6$	$54.9$	$51.8$	$62.3$	$\mathbf{77.8}$	$59.2$ $\pm 3.9$
HalfCheetah-H	$71.4$	$75.3$	$77.4$	$\mathbf{81.4}$	$76.3$	$65.9$	$\mathbf{83.0}$	$71.8$ $\pm 8.0$
Total	$408.8$	$490.1$	$504.6$	$456.6$	$385.0$	$256.5$	$\mathbf{546.5}$	$\mathbf{572.4}$

Table 3: D4RL MuJoCo Gym results. Each number is a normalized score averaged over

100

trials [6]. Our results are averaged over

5

seeds. The results for prior works are reported following their respective papers. MOPO^∗ is an improved version of MOPO, introduced in Sun et al. [30]. We highlight the results that are better than

95\%

of the best score.

	Model-free					Model-based
Dataset	BC	TD3+BC	CQL	EDAC	IQL	MOPO^∗	COMBO	RAMBO	MOBILE	CBOP	LEQ (ours)
hopper-r	$3.7$	$8.5$	$5.3$	$25.3$	$7.6$	$31.7$	$17.9$	$25.4$	$\mathbf{31.9}$	$\mathbf{32.8}$	$\mathbf{32.4}$ $\pm 0.3$
hopper-m	$54.1$	$59.3$	$61.9$	$\mathbf{101.6}$	$66.3$	$62.8$	$97.2$	$87.0$	$\mathbf{106.6}$	$\mathbf{102.6}$	$\mathbf{103.4}$ $\pm 0.3$
hopper-mr	$16.6$	$60.9$	$86.3$	$\mathbf{101.0}$	$94.7$	$\mathbf{99.4}$	$\mathbf{103.5}$	$89.5$	$99.5$	$\mathbf{104.3}$	$\mathbf{103.9}$ $\pm 1.3$
hopper-me	$53.9$	$98.0$	$96.9$	$\mathbf{110.7}$	$91.5$	$81.6$	$\mathbf{111.1}$	$88.2$	$\mathbf{112.6}$	$\mathbf{111.6}$	$\mathbf{109.4}$ $\pm 1.8$
walker2d-r	$1.3$	$1.6$	$5.4$	$16.6$	$5.2$	$7.4$	$7.0$	$0.0$	$17.9$	$17.8$	$\mathbf{21.5}$ $\pm 0.1$
walker2d-m	$70.9$	$83.7$	$79.5$	$\mathbf{92.5}$	$78.3$	$81.3$	$84.1$	$81.9$	$84.9$	$\mathbf{87.7}$	$74.9$ $\pm 26.9$
walker2d-mr	$20.3$	$81.8$	$76.8$	$87.1$	$73.9$	$85.6$	$56.0$	$89.2$	$89.9$	$92.7$	$\mathbf{98.7}$ $\pm 6.0$
walker2d-me	$90.1$	$110.1$	$109.1$	$\mathbf{114.7}$	$109.6$	$\mathbf{112.9}$	$103.3$	$56.7$	$\mathbf{115.2}$	$\mathbf{117.2}$	$108.2$ $\pm 1.3$
halfcheetah-r	$2.2$	$11.0$	$31.3$	$28.4$	$11.8$	$\mathbf{38.5}$	$\mathbf{38.8}$	$\mathbf{39.5}$	$\mathbf{39.3}$	$32.8$	$30.8$ $\pm 3.3$
halfcheetah-m	$43.2$	$48.3$	$46.9$	$65.9$	$47.4$	$73.0$	$54.2$	$\mathbf{77.9}$	$\mathbf{74.6}$	$\mathbf{74.3}$	$71.7$ $\pm 4.4$
halfcheetah-mr	$37.6$	$44.6$	$45.3$	$61.3$	$44.2$	$\mathbf{72.1}$	$55.1$	$\mathbf{68.7}$	$\mathbf{71.7}$	$66.4$	$65.5$ $\pm 1.1$
halfcheetah-me	$44.0$	$90.7$	$95.0$	$\mathbf{106.3}$	$86.7$	$90.8$	$90.0$	$95.4$	$\mathbf{108.2}$	$\mathbf{105.4}$	$\mathbf{102.8}$ $\pm 0.4$
Total	$437.9$	$698.5$	$739.7$	$911.4$	$717.2$	$844.0$	$802.0$	$812.4$	$\mathbf{959.5}$	$\mathbf{953.4}$	$\mathbf{923.2}$

5.5 Ablation studies

To understand why LEQ (LEQ- $\lambda$ ) works well in long-horizon tasks, we conduct ablation studies and answer to the following four questions: (1) Does using $\lambda$ -return help? (2) Is LEQ better than prior uncertainty-based penalization methods? (3) Which factor enables LEQ to work in AntMaze? and (4) How do imagination length $H$ and data expansion length $R$ affect the performance?

(1) $\lambda$ -returns.

To verify the effect of $\lambda$ -return, we compare our method (LEQ- $\lambda$ ) with the versions with $1$ -step return (LEQ- $1$ ) and $H$ -step return (LEQ- $H$ ). Table 4 shows that using $\lambda$ -return drastically improves the performance on AntMaze compared to using $1$ -step return or $H$ -step return. This result is coherent with the observations in prior online RL methods [28, 11].

(2) Lower expectile Q-learning.

We compare our lower expectile Q-learning with another conservative value estimator, MOBIP used in MOBILE [30], which penalizes Q-values with the standard deviation of Q-ensemble networks. The only difference between LEQ and MOBIP is their target Q-value computation for both critic and policy updates. Table 4 shows that using MOBIP not only deteriorates the success rates (in MOBIP- $1$ ) but also does not benefit from $\lambda$ -return (in MOBIP- $\lambda$ ).

(3) What makes offline model-based RL work in AntMaze?

Prior to LEQ, none of offline model-based RL methods work in AntMaze, whereas our method even outperforms model-free methods. Thus, we investigate which changes in LEQ enable offline model-based RL work in AntMaze.

Table 4: Ablation study results on the AntMaze tasks. (1) We compare different Q-targets, LEQ-

\lambda

, LEQ-

1

, and LEQ-

H

. (2) We compare our lower expectile Q-learning strategy with another conservative Q-value estimation, MOBIP-

1

and MOBIP-

\lambda

. (3) MOBILE^∗ is our re-implementation of MOBILE using MOBILE’s default hyperparameters, i.e.,

\beta=0.95

\gamma=0.99

, and

R=5

. We note that lowering

\beta

0.25

is crucial for MOBILE^∗ to achieve meaningful scores in AntMaze.

Dataset	umaze		medium		large		ultra		Total
Dataset	umaze	diverse	play	diverse	play	diverse	play	diverse	Total
LEQ- $\lambda$ (ours)	$\mathbf{94.4}$ $\pm 6.3$	$\mathbf{71.0}$ $\pm 12.3$	$58.8$ $\pm 33.0$	$46.2$ $\pm 23.2$	$\mathbf{58.6}$ $\pm 9.1$	$\mathbf{60.2}$ $\pm 18.3$	$25.8$ $\pm 18.2$	$\mathbf{55.8}$ $\pm 18.3$	$\mathbf{470.4}$
LEQ- $H$	$\mathbf{93.0}$ $\pm 3.4$	$60.7$ $\pm 10.4$	$46.3$ $\pm 32.4$	$0.0$ $\pm 0.0$	$\mathbf{57.0}$ $\pm 25.6$	$33.3$ $\pm 43.0$	$0.0$ $\pm 0.0$	$0.0$ $\pm 0.0$	$290.3$
LEQ- $1$	$89.6$ $\pm 4.8$	$37.0$ $\pm 32.8$	$55.8$ $\pm 28.7$	$29.8$ $\pm 24.5$	$34.2$ $\pm 13.4$	$49.3$ $\pm 9.0$	$\mathbf{42.2}$ $\pm 13.2$	$35.6$ $\pm 13.0$	$373.5$
MOBIP- $\lambda$	$84.3$ $\pm 3.5$	$40.3$ $\pm 20.4$	$51.3$ $\pm 9.0$	$39.7$ $\pm 12.5$	$28.3$ $\pm 21.5$	$33.7$ $\pm 10.0$	$38.0$ $\pm 27.1$	$23.3$ $\pm 4.9$	$338.9$
MOBIP-1	$59.5$ $\pm 3.5$	$46.5$ $\pm 1.5$	$57.0$ $\pm 11.0$	$\mathbf{54.0}$ $\pm 9.0$	$23.5$ $\pm 19.5$	$38.5$ $\pm 1.5$	$39.5$ $\pm 11.5$	$20.5$ $\pm 20.5$	$339.0$
MOBILE^∗	$1.0$ $\pm 2.0$	$0.0$ $\pm 0.0$	$6.4$ $\pm 5.5$	$5.0$ $\pm 5.0$	$0.8$ $\pm 1.6$	$0.8$ $\pm 1.2$	$0.0$ $\pm 0.0$	$0.0$ $\pm 0.0$	$14.0$
MOBILE^∗ ( $\mathbf{\beta}$ = 0.25)	$77.0$ $\pm 6.4$	$20.4$ $\pm 15.7$	$\mathbf{64.6}$ $\pm 11.1$	$31.6$ $\pm 16.9$	$2.6$ $\pm 2.8$	$7.2$ $\pm 8.9$	$4.6$ $\pm 3.0$	$5.0$ $\pm 4.6$	$213.0$
MOBILE^∗ ( $\mathbf{\gamma}$ = 0.997)	$0.0$ $\pm 0.0$	$0.0$ $\pm 0.0$	$7.2$ $\pm 4.1$	$1.6$ $\pm 2.1$	$9.6$ $\pm 7.1$	$5.4$ $\pm 4.9$	$0.0$ $\pm 0.0$	$1.8$ $\pm 2.7$	$25.6$
MOBILE^∗ ( $R$ = 10)	$0.0$ $\pm 0.0$	$0.0$ $\pm 0.0$	$5.0$ $\pm 5.1$	$0.6$ $\pm 1.2$	$7.4$ $\pm 14.8$	$1.6$ $\pm 3.2$	$0.0$ $\pm 0.0$	$0.0$ $\pm 0.0$	$14.6$

We first re-implement MOBILE with some technical tricks used in LEQ: LayerNorm [3], SymLog [11], single Q-network, and no target Q-value clip**; but, MOBILE^∗ achieves a barely non-zero score, $14.0$ . We found that the key to make MOBILE^∗ work is reducing $\beta$ , the ratio for the loss calculated from imaginary rollouts and from dataset transitions. When we lower $\beta$ from $0.95$ to $0.25$ (used in LEQ), MOBILE^∗ shows meaningful performances in umaze and medium mazes, and achieves $213.0$ in total. We suggest that utilizing the true transition from the dataset is important in long-horizon tasks, which was undervalued in prior works.

(4) Imagination length $H$ and dataset expansion length $R$ .

As shown in Table 5, the performance increases when it goes to $H=10$ from $H=5$ , but it drops when $H=15$ . This result shows the trade-off of using the world model: the further the agent imagines, more the agent becomes robust to the error of the critic, but more it becomes prone to the error from the model prediction.

We also evaluate LEQ without the dataset expansion ( $H=10,R=1$ ). In AntMaze, the results with and without the dataset expansion are similar, as shown in Table 5. On the other hand, the dataset expansion makes the policy more stable and better in the D4RL MuJoCo tasks (Table 13).

Table 5: LEQ with different imagination length

H

and data expansion length

R

. A longer

H

can mitigate critic biases, while increasing model errors, which leads to poor performance. Each number is averaged over

5

random seeds.

Dataset	$\mathbf{H=10,R=5}$ (ours)	$H=5,R=5$	$H=15,R=5$	$H=10,R=1$
antmaze-umaze	$\mathbf{94.4}$ $\pm 6.3$	$\mathbf{95.2}$ $\pm 1.7$	$\mathbf{98.6}$ $\pm 0.5$	$\mathbf{97.4}$ $\pm 1.4$
antmaze-umaze-diverse	$\mathbf{71.0}$ $\pm 12.3$	$\mathbf{67.2}$ $\pm 9.1$	$\mathbf{70.7}$ $\pm 15.2$	$63.0$ $\pm 23.2$
antmaze-medium-play	$58.8$ $\pm 33.0$	$46.4$ $\pm 31.9$	$\mathbf{76.3}$ $\pm 17.2$	$58.2$ $\pm 28.0$
antmaze-medium-diverse	$\mathbf{46.2}$ $\pm 23.2$	$18.6$ $\pm 28.7$	$30.3$ $\pm 40.1$	$28.6$ $\pm 33.7$
antmaze-large-play	$58.6$ $\pm 9.1$	$48.6$ $\pm 15.4$	$\mathbf{62.0}$ $\pm 9.9$	$56.0$ $\pm 9.8$
antmaze-large-diverse	$\mathbf{60.2}$ $\pm 18.3$	$35.2$ $\pm 8.7$	$33.0$ $\pm 3.2$	$\mathbf{57.0}$ $\pm 4.5$
antmaze-ultra-play	$25.8$ $\pm 18.2$	$\mathbf{54.2}$ $\pm 10.8$	$0.0$ $\pm 0.0$	$39.2$ $\pm 15.1$
antmaze-ultra-diverse	$\mathbf{55.8}$ $\pm 18.3$	$39.4$ $\pm 6.1$	$0.0$ $\pm 0.0$	$36.0$ $\pm 12.0$
Total	$\mathbf{470.4}$	$404.8$	$371.0$	$435.4$

6 Conclusion

In this paper, we propose a novel offline model-based reinforcement learning method, LEQ, which uses expectile regression to get a conservative evaluation of a policy from model-generated trajectories. Expectile regression eases the pain of constructing the whole distribution of Q-targets and allows for estimating the conservative value via sampling. Combined with $\lambda$ -returns in both critic and policy updates for the imaginary rollouts, the policy can receive learning signals that are more robust to both model errors and critic errors. We empirically show that LEQ improves the performance in various tasks – especially, achieving the state-of-the-art performance in the long-horizon AntMaze tasks.

6.1 Limitations

Following prior work on model-based offline RL [30, 14], we assume access to the ground-truth termination function of a task, different from online model-based RL approaches, which learn a termination function from interactions. However, since this termination function is conditioned on a state, a model requires to plan on a state space (or an observation space), which could be challenging in a high-dimensional state space (e.g. pixel observations). Extending the proposed approach to complex environments with high-dimensional observations would be an immediate next step.

6.2 Broader Impacts

Our method aims to increase the ability of autonomous agents, such as robots and self-driving cars, to learn from static, offline data without interacting with the world. This enables autonomous agents to utilize data with diverse qualities (not necessarily from experts). We believe that this paper does not have any immediate negative societal impact.

Acknowledgments and Disclosure of Funding

We would like to thank Junik Bae for helpful discussion. This work was supported in part by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant (RS-2020-II201361, Artificial Intelligence Graduate School Program (Yonsei University)) and the National Research Foundation of Korea (NRF) grant (RS-2024-00333634) funded by the Korean Government (MSIT). Kwanyoung Park was supported by Electronics and Telecommunications Research Institute (ETRI).

References

An et al. [2021] Gaon An, Seungyong Moon, Jang-Hyun Kim, and Hyun Oh Song. Uncertainty-based offline reinforcement learning with diversified q-ensemble. Advances in neural information processing systems, 34:7436–7447, 2021.
Argenson and Dulac-Arnold [2021] Arthur Argenson and Gabriel Dulac-Arnold. Model-based offline planning. In International Conference on Learning Representations, 2021.
Ba et al. [2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
Ball et al. [2023] Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. In International Conference on Machine Learning, pages 1577–1594. PMLR, 2023.
Feinberg et al. [2018] Vladimir Feinberg, Alvin Wan, Ion Stoica, Michael I Jordan, Joseph E Gonzalez, and Sergey Levine. Model-based value estimation for efficient model-free reinforcement learning. arXiv preprint arXiv:1803.00101, 2018.
Fu et al. [2020] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
Fujimoto and Gu [2021] Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. In Neural Information Processing Systems, volume 34, pages 20132–20145, 2021.
Fujimoto et al. [2019] Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052–2062. PMLR, 2019.
Hafner et al. [2019] Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations, 2019.
Hafner et al. [2021] Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. In International Conference on Learning Representations, 2021.
Hafner et al. [2023] Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
Hessel et al. [2018] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Association for the Advancement of Artificial Intelligence, 2018.
Janner et al. [2019] Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. In Neural Information Processing Systems, volume 32, 2019.
Jeong et al. [2023] Jihwan Jeong, Xiaoyu Wang, Michael Gimelfarb, Hyunwoo Kim, Baher Abdulhai, and Scott Sanner. Conservative bayesian model-based value expansion for offline policy optimization. In International Conference on Learning Representations, 2023.
Jiang et al. [2023] Zhengyao Jiang, Tianjun Zhang, Michael Janner, Yueying Li, Tim Rocktäschel, Edward Grefenstette, and Yuandong Tian. Efficient planning in a compact latent action space. In International Conference on Learning Representations, 2023.
Kidambi et al. [2020] Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Model-based offline reinforcement learning. In Neural Information Processing Systems, volume 33, pages 21810–21823, 2020.
Kostrikov et al. [2022] Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022.
Kumar et al. [2019] Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrap** error reduction. In Neural Information Processing Systems, volume 32, 2019.
Kumar et al. [2020] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. In Neural Information Processing Systems, volume 33, pages 1179–1191, 2020.
Le et al. [2019] Hoang Le, Cameron Voloshin, and Yisong Yue. Batch policy learning under constraints. In International Conference on Machine Learning, pages 3703–3712. PMLR, 2019.
Levine et al. [2020] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
Lillicrap et al. [2016] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.
Park et al. [2024] Seohong Park, Dibya Ghosh, Benjamin Eysenbach, and Sergey Levine. Hiql: Offline goal-conditioned rl with latent states as actions. In Neural Information Processing Systems, volume 36, 2024.
Peng et al. [2019] Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
Pomerleau [1989] Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in Neural Information Processing Systems, pages 305–313, 1989.
Qin et al. [2022] Rong-Jun Qin, Xingyuan Zhang, Songyi Gao, Xiong-Hui Chen, Zewen Li, Weinan Zhang, and Yang Yu. NeoRL: A near real-world benchmark for offline reinforcement learning. In Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=jNdLszxdtra.
Rigter et al. [2022] Marc Rigter, Bruno Lacerda, and Nick Hawes. Rambo-rl: Robust adversarial model-based offline reinforcement learning. In Neural Information Processing Systems, volume 35, pages 16082–16097, 2022.
Schulman et al. [2016] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations, 2016.
Sun [2023] Yihao Sun. Offlinerl-kit: An elegant pytorch offline reinforcement learning library. https://github.com/yihaosun1124/OfflineRL-Kit, 2023.
Sun et al. [2023] Yihao Sun, Jiaji Zhang, Chengxing Jia, Haoxin Lin, Junyin Ye, and Yang Yu. Model-bellman inconsistency for model-based offline reinforcement learning. In International Conference on Machine Learning, pages 33177–33194. PMLR, 2023.
Sutton [1988] Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3:9–44, 1988.
Sutton [1991] Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2(4):160–163, 1991.
Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
Yu et al. [2020] Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization. In Neural Information Processing Systems, volume 33, pages 14129–14142, 2020.
Yu et al. [2021] Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. Combo: Conservative offline model-based policy optimization. In Neural Information Processing Systems, volume 34, pages 28954–28967, 2021.

Appendix A Training Details

Computing resources.

All experiments are done on a single RTX 4090 GPU and $4$ AMD EPYC 9354 CPU cores. We use $5$ different random seeds for each experiment and report the mean and standard deviation. Each offline RL experiment takes $2$ hours for ours, $12$ hours for MOBILE, and $24$ hours for CBOP.

Environment details.

For the locomotion tasks, we use the dataset provided by D4RL [6] and NeoRL [26]. Following IQL [17], we normalize rewards using the maximum and minimum return of all trajectories. We use the true termination functions of the environments, implemented in MOBILE [30].

For the AntMaze tasks, we use the dataset provided by D4RL [6]. Following IQL [17], we subtract $1$ from the rewards in the datasets so that the agent receives $-1$ for each step and $0$ on termination. We use the true termination functions of the environments. The termination functions of the AntMaze tasks are not deterministic because a goal of a maze is randomized every time the environment is reset. Nevertheless, we follow the implementation of CBOP [14], where the termination region is set to a circle around the mean of the goal distribution with the radius $0.5$ .

Method implementation details.

For all compared methods, we use the results from their corresponding papers when available. For IQL [17], we run the official implementation with $5$ seeds to reproduce the results for the random datasets in D4RL and NeoRL. For the AntMaze tasks, we run the official implementation of MOBILE and CBOP with $5$ random seeds. Please note that the original MOBILE implementation does not use the true termination function, so we replace it with our termination function. For MOPO, COMBO, and RAMBO, we use the results reported in RAMBO [27].

World models.

For training world models, we use the architecture and training script from OfflineRL-Kit [29], matching the implementation of MOBILE [30]. Each world model is implemented as a $4$ -layer MLPs with the hidden layer size of $200$ . We construct an ensemble of world models by selecting $5$ out of $7$ models with the best validation scores. We pretrain the ensemble of world models for each of $5$ random seeds (i.e. training in total $35$ world models and using $25$ models), which takes approximately $5$ hours in average.

Policy and critic networks.

We use $3$ -layer MLPs with size of $256$ both for the policy network and the critic network. We use layer normalization [3] to prevent catastrophic over/underestimation [4], and squash the state inputs using symlog to keep training stable from outliers in long-horizon model rollouts [11].

Pretraining.

For some environments, we found that a randomly initialized policy can lead to abnormal rewards or transition prediction from the world models in the early stage, leading to unstable training. Following CBOP [14], we pretrain a policy $\pi_{\theta}$ and a critic $Q_{\phi}$ using behavioral cloning and FQE [20], respectively. We use a slightly different implementation of FQE from the original implementation, where the $\arg\min$ operation is approximated with mini-batch gradient descent, similar to standard Q-learning as shown in Algorithm 2.

Algorithm 2 FQE: Fitted Q Evaluation [20]

0: Offline dataset

\mathcal{D}_{\text{env}}

, policy

\pi_{\theta}

1: Randomly initialize Q-function

Q_{\phi}

2: while not converged do

\{\mathbf{s}_{i},\mathbf{a}_{i},r_{i},\mathbf{s}^{\prime}_{i}\}_{i=1}^{N}\sim% \mathcal{D}_{\text{env}}

y_{i}=\texttt{sg}(Q_{\phi}(\mathbf{s}^{\prime}_{i},\pi_{\theta}(\mathbf{s}^{% \prime}_{i})))

\triangleright

\texttt{sg}(\cdot)

is stop-gradient operator

L_{\text{FQE}}(\phi)=\frac{1}{N}\sum_{i=1}^{N}(Q_{\phi}(\mathbf{s}_{i},\mathbf% {a}_{i})-y_{i})^{2}

6: Update

Q_{\phi}

using gradient descent to minimize

L_{\text{FQE}}(\phi)

Comparisons with prior methods.

We provide a comparison of LEQ with the prior model-based approaches and the baseline methods used in our ablation studies in Table 6.

Table 6: Comparisons with the prior model-based approaches and the baseline methods introduced for our ablation studies. We color the hyperparameters in blue if they are the same with LEQ. Otherwise, we color them in red.

Components	CBOP	MOBILE	MOBILE^∗	MOBIP	LEQ (ours)
Training scheme	MVE [5]	MBPO [13]	MBPO [13]	Dyna [32]	Dyna [32]
Conservatism	Lower-confidence bound	Lower-confidence bound	Lower-confidence bound	Lower-confidence bound	Lower expectile
Policy	Stochastic	Stochastic	Stochastic	Deterministic	Deterministic
Policy objective	$Q(\mathbf{s},\mathbf{a})$	$Q(\mathbf{s},\mathbf{a})$	$Q(\mathbf{s},\mathbf{a})$	$\lambda$ -returns	$\lambda$ -returns
Policy pretraining	BC	–	–	BC	BC
# of critics	20-50	2	1	1	1
Critic objective	Multi-step (adaptive weighting)	One-step	One-step	$\lambda$ -returns	$\lambda$ -returns
Critic pretraining	FQE [20]	–	–	FQE [20]	FQE [20]
Horizon length ( $H$ )	10	1	1	10	10
Rollout length ( $R$ )	–	1 or 5	10	5	5
Discount rate ( $\gamma$ )	0.99	0.99	0.997	0.997	0.997
$\beta$ in Equation 7	1.0	0.95	0.25	0.25	0.25
Impl. tricks	–	Clip Q-values with $0$	LayerNorm + Symlog	LayerNorm + Symlog	LayerNorm + Symlog
Running time	24h	12h	40m	4h	2h

Hyperparameters of LEQ.

We report task-agnostic hyperparameters of our method in Table 7. We note that we use the same hyperparameters across all tasks, except $\tau$ . We search the value of $\tau$ in $\{0.1,0.3,0.4,0.5\}$ and report the best value for the main experimental results. In addition, we report the exhaustive results in Tables 11 and 12, and summarize $\tau$ used in the main results in Table 8.

Table 7: Shared hyperparameters of LEQ.

Hyperparameters	Value	Description
$lr_{\text{actor}}$	3e-5	Learning rate of actor
$lr_{\text{critic}}$	1e-4	Learning rate of critic
Optimizer	Adam	Optimizer
$T_{\text{expand}}$	5000	Interval of expanding dataset
$N_{\text{expand}}$	50000	Number of data for each expansion of dataset
$R$	5	Rollout length for dataset expansion
$\sigma_{\text{exp}}$	1.0	Exploration noise for dataset expansion
$N_{\text{iter}}$	1M	Total number of gradient steps.
$B_{\text{env}}$	256	Batch size from original dataset
$B_{\text{model}}$	256	Batch size from expanded dataset
$\gamma$	0.997	Discount factor
$\lambda$	0.95	$\lambda$ value for $\lambda$ -return
$H$	10	Imagination length
$\beta_{\text{EMA}}$	1	Weight for critic EMA regularization
$\epsilon_{\text{EMA}}$	0.995	Critic EMA decay

Table 8: Task-specific hyperparameter

\tau

of LEQ.

Domain	Task	$\tau$
AntMaze	umaze	$0.1$
	umaze-diverse	$0.1$
	medium-play	$0.3$
	medium-diverse	$0.1$
	large-play	$0.3$
	large-diverse	$0.3$
	ultra-play	$0.1$
	ultra-diverse	$0.1$
MuJoCo	hopper-r	$0.1$
	hopper-m	$0.1$
	hopper-mr	$0.3$
	hopper-me	$0.1$
	walker2d-r	$0.1$
	walker2d-m	$0.3$
	walker2d-mr	$0.5$
	walker2d-me	$0.1$
	halfcheetah-r	$0.3$
	halfcheetah-m	$0.3$
	halfcheetah-mr	$0.4$
	halfcheetah-me	$0.1$
NeoRL	Hopper-L	$0.1$
	Hopper-M	$0.1$
	Hopper-H	$0.1$
	Walker2d-L	$0.3$
	Walker2d-M	$0.1$
	Walker2d-H	$0.1$
	HalfCheetah-L	$0.1$
	HalfCheetah-M	$0.3$
	HalfCheetah-H	$0.3$

Task-specific hyperparameters of the compared methods.

We report the best hyperparameters of MOBILE^∗ for the AntMaze tasks in Tables 10 and 10. For MOBILE and MOBILE^∗, we search the value of $c$ within $\{0.1,0.5,1.0,1.5\}$ , as suggested in MOBILE [30], where $c$ is the coefficient of the penalized bellman operator:

T\hat{Q}(\mathbf{s},\mathbf{a})=r(\mathbf{s},\mathbf{a})+\gamma Q(\mathbf{s}^{% \prime},\mathbf{a}^{\prime})-c\cdot\text{Std}(Q(\mathbf{s}^{\prime},\mathbf{a}% ^{\prime})).

(13)

For CBOP, we conduct hyperparameter search for $\psi$ in $\{0.5,2.0,3.0,5.0\}$ , as suggested in the original paper, where $\psi$ is an LCB coefficient of CBOP. We do not report the best hyperparameter for MOBILE and CBOP because both methods score zero points for all hyperparameters in AntMaze.

Table 9: Task-specific hyperparameters in MOBILE^∗.

Domain	Task	$c$
AntMaze	umaze	$1.0$
	umaze-diverse	$1.0$
	medium-play	$1.0$
	medium-diverse	$0.1$
	large-play	$0.1$
	large-diverse	$0.1$
	ultra-play	$1.0$
	ultra-diverse	$1.0$

Table 10: Task-specific hyperparameters in MOBILE^∗ with

\lambda

-returns.

Domain	Task	$c$
AntMaze	umaze	$1.0$
	umaze-diverse	$0.5$
	medium-play	$0.1$
	medium-diverse	$0.1$
	large-play	$0.1$
	large-diverse	$0.1$
	ultra-play	$1.0$
	ultra-diverse	$0.5$

Appendix B Proof of the Policy Objective

We show that the surrogate loss in Equation 12 leads to a better approximation for the expectile of $\lambda$ -returns in Equation 11 than maximizing $Q_{\phi}(s,a)$ . In other words, we show that optimizing the following policy objective:

\hat{J}_{\lambda}(\theta)=\mathbb{E}_{\tau\sim p_{\psi},\pi_{\theta}}[(W^{\tau% }(Q_{\phi}(\mathbf{s}_{t},\mathbf{a}_{t})>Q^{\lambda}_{t}(\tau))Q_{t}^{\lambda% }(\tau)],

(14)

leads to optimizing a lower-bias estimator of $\mathbb{E}^{\tau}_{\tau\sim p_{\psi},\pi_{\theta}}[Q_{t}^{\lambda}(\tau)]$ than $Q_{\phi}(\mathbf{s}_{t},\mathbf{a}_{t})$ .

To show this, we first prove that $\hat{Y}_{\text{new}}=\frac{\mathbb{E}[W^{\tau}(Q_{\phi}(\mathbf{s}_{t},\mathbf% {a}_{t})>Q_{t}^{\lambda}(\tau))\cdot Q_{t}^{\lambda}(\tau)]}{\mathbb{E}[W^{% \tau}(Q_{\phi}(\mathbf{s}_{t},\mathbf{a}_{t})>Q_{t}^{\lambda}(\tau))]}$ is closer to $\mathbb{E}^{\tau}_{\tau\sim p_{\psi},\pi_{\theta}}[Q_{t}^{\lambda}(\tau)]$ than $Q_{\phi}(\mathbf{s}_{t},\mathbf{a}_{t})$ . For deriving the proof, we generalize this situation to have an arbitrary distribution $X$ and estimation $\hat{Y}$ , which corresponds to $X=Q_{t}^{\lambda}(\tau),\hat{Y}=Q_{\phi}(\mathbf{s},\mathbf{a})$ .

Theorem 1.

Let $X$ be a distribution and $Y=E^{\tau}[X]$ be a lower expectile of $X$ (i.e. $0<\tau\leq 0.5$ ). Let $\hat{Y}$ be an arbitrary estimation of $Y$ , and define $W^{\tau}(\cdot)=|\tau-\mathbbm{1}(\cdot)|$ . If we let $\hat{Y}_{\text{new}}=\frac{\mathbb{E}[W^{\tau}(\hat{Y}>X)\cdot X]}{\mathbb{E}[% W^{\tau}(\hat{Y}>X)]}$ be a new estimation of $Y$ , then $|Y-\hat{Y}_{\text{new}}|\leq\frac{1-2\tau}{1-\tau}\cdot p(Y\leq X\leq\hat{Y})% \cdot|Y-\hat{Y}|$ .

Proof.

Without loss of generality, we assume $\hat{Y}\geq Y$ . Then, we have $\hat{Y}_{\text{new}}\geq Y$ . Thus,

	$\displaystyle\|\hat{Y}_{\text{new}}-Y\|$
	$\displaystyle=\hat{Y}_{\text{new}}-Y$
	$\displaystyle=\frac{\mathbb{E}[W^{\tau}(\hat{Y}>X)\cdot X]}{\mathbb{E}[W^{\tau% }(\hat{Y}>X)]}-\frac{\mathbb{E}[W^{\tau}(Y>X)\cdot X]}{\mathbb{E}[W^{\tau}(Y>X% )]}\quad\text{\qquad\qquad\qquad\qquad\qquad\qquad\;\;\;($\because$ Def. of $% \hat{Y}_{\text{new}}$ and $Y$)}$
	$\displaystyle=\frac{\mathbb{E}[W^{\tau}(Y>X)\cdot X]+\mathbb{E}[(1-2\tau)\cdot% \mathbbm{1}(Y\leq X\leq\hat{Y})X]}{\mathbb{E}[W^{\tau}(Y>X)]+\mathbb{E}[(1-2% \tau)\cdot\mathbbm{1}(Y\leq X\leq\hat{Y})]}-\frac{\mathbb{E}[W^{\tau}(Y>X)% \cdot X]}{\mathbb{E}[W^{\tau}(Y>X)]}\quad\text{($\because$ Def. of $W^{\tau}(% \cdot)$)}$
	$\displaystyle=\frac{\mathbb{E}[W^{\tau}(Y>X)]\mathbb{E}[(1-2\tau)\cdot\mathbbm% {1}(Y\leq X\leq\hat{Y})X]-\mathbb{E}[W^{\tau}(Y>X)X]\mathbb{E}[(1-2\tau)\cdot% \mathbbm{1}(Y\leq X\leq\hat{Y})]}{\mathbb{E}[W^{\tau}(Y>X)](\mathbb{E}[W^{\tau% }(Y>X)]+\mathbb{E}[(1-2\tau)\cdot\mathbbm{1}(Y\leq X\leq\hat{Y})])}$
	$\displaystyle\leq\frac{1-2\tau}{\mathbb{E}[W^{\tau}(Y>X)]^{2}}\cdot(\mathbb{E}% [W^{\tau}(Y>X)]\mathbb{E}[\mathbbm{1}(Y\leq X\leq\hat{Y})X]-\mathbb{E}[W^{\tau% }(Y>X)X]\mathbb{E}[\mathbbm{1}(Y\leq X\leq\hat{Y})]$
	$\displaystyle=\frac{1-2\tau}{\mathbb{E}[W^{\tau}(Y>X)]}\cdot(\mathbb{E}[% \mathbbm{1}(Y\leq X\leq\hat{Y})X]-Yp(Y\leq X\leq\hat{Y}))$
	$\displaystyle=\frac{(1-2\tau)p(Y\leq X\leq\hat{Y})}{\mathbb{E}[W^{\tau}(Y>X)]}% \cdot(\mathbb{E}_{Y\leq X\leq\hat{Y}}[X]-Y)$
	$\displaystyle\leq\frac{(1-2\tau)p(Y\leq X\leq\hat{Y})}{\mathbb{E}[W^{\tau}(Y>X% )]}\cdot(\hat{Y}-Y)$
	$\displaystyle\leq\frac{1-2\tau}{1-\tau}\cdot p(Y\leq X\leq\hat{Y})\cdot\|\hat{Y% }-Y\|$

∎

Note that this theorem shows that the bias of the new estimation is always smaller than the original estimation, since $\frac{1-2\tau}{1-\tau}<1$ and $p(Y\leq X\leq\hat{Y})\leq 1$ . If we plug in the distribution of $Q_{t}^{\lambda}(\tau)$ to $X$ and $\hat{Y}=Q_{\phi}(s_{t},\mathbf{a}_{t})$ , then $Y=\mathbb{E}^{\tau}[X]=\mathbb{E}^{\tau}_{\tau\sim p_{\psi},\pi_{\theta}}[Q_{t% }^{\lambda}(\tau)]$ , and we can show the desired result using the theorem: $\hat{Y}_{\text{new}}=\frac{\mathbb{E}[W^{\tau}(Q_{\phi}(\mathbf{s}_{t},\mathbf% {a}_{t})>Q_{t}^{\lambda}(\tau))\cdot Q_{t}^{\lambda}(\tau)]}{\mathbb{E}[W^{% \tau}(Q_{\phi}(\mathbf{s}_{t},\mathbf{a}_{t})>Q_{t}^{\lambda}(\tau)]}$ is closer to $\mathbb{E}^{\tau}_{\tau\sim p_{\psi},\pi_{\theta}}[Q_{t}^{\lambda}(\tau)]$ than $Q_{\phi}(\mathbf{s}_{t},\mathbf{a}_{t})$ .

Here, the normalizing factor $\mathbb{E}[W^{\tau}(Q_{\phi}(\mathbf{s}_{t},\mathbf{a}_{t})>Q_{t}^{\lambda}(% \tau))]$ is non-differentiable with $\tau$ . Specifically, the gradient is 0 everywhere (except $Q_{\phi}(\mathbf{s}_{t},\mathbf{a}_{t})=Q_{t}^{\lambda}(\tau)$ ). Thus, if we calculate the gradient of $\hat{Y}_{\text{new}}$ , the gradient for the normalizing factor disappears. Therefore, we can omit the normalizing factor and get an equivalent formula $\mathbb{E}[W^{\tau}(Q_{\phi}(\mathbf{s}_{t},\mathbf{a}_{t})>Q_{t}^{\lambda}(% \tau))\cdot Q_{t}^{\lambda}(\tau)]$ for gradient-based optimization.

Appendix C More Results

High variance in locomotion tasks.

When we train LEQ in locomotion tasks, we observe that our method often achieves $100$ % success rates and then falls back to $0$ %, as shown in Figure 5. This is mainly because the learned models sometimes fail to capture failures (e.g. hopper and walker falling off) and predict an optimistic future (e.g. hopper and walker walking forward).

Results for all expectiles $\tau$ .

To give insights how the expectile parameter $\tau$ affects the performance of LEQ, we report the performance of LEQ with all expectile values $\{0.1,0.3,0.4,0.5\}$ . The expectile parameter $\tau$ has a trade-off – high expectile makes the model’s predictions less conservative while making a policy easily exploit the model. We recommend first trying $\tau=0.1$ , which works well for most of the tasks, and increase $\tau$ until the performance starts to drop.

Table 11: Antmaze results of LEQ with different expectiles. We report the results in Antmaze task with expectiles value of 0.1, 0.3, 0.4, 0.5. The best value is highlighted.

Expectile	0.1	0.3	0.4	0.5
antmaze-umaze	$\mathbf{94.4}$ $\pm 6.3$	$39.0$ $\pm 28.1$	$0.2$ $\pm 0.4$	$3.0$ $\pm 5.5$
antmaze-umaze-diverse	$\mathbf{71.0}$ $\pm 12.2$	$23.6$ $\pm 21.7$	$4.0$ $\pm 4.2$	$0.0$ $\pm 0.0$
antmaze-medium-play	$50.2$ $\pm 39.9$	$\mathbf{58.8}$ $\pm 33.0$	$36.0$ $\pm 21.8$	$0.6$ $\pm 1.2$
antmaze-medium-diverse	$\mathbf{46.2}$ $\pm 23.2$	$13.2$ $\pm 13.3$	$11.6$ $\pm 14.8$	$10.6$ $\pm 13.3$
antmaze-large-play	$42.0$ $\pm 30.6$	$\mathbf{58.6}$ $\pm 9.1$	$52.2$ $\pm 15.8$	$42.2$ $\pm 7.3$
antmaze-large-diverse	$60.6$ $\pm 32.1$	$\mathbf{60.2}$ $\pm 18.3$	$48.8$ $\pm 5.8$	$36.8$ $\pm 9.7$
antmaze-ultra-play	$\mathbf{25.8}$ $\pm 18.2$	$10.8$ $\pm 8.8$	$11.6$ $\pm 12.5$	$9.2$ $\pm 11.5$
antmaze-ultra-diverse	$\mathbf{55.8}$ $\pm 18.3$	$4.6$ $\pm 3.4$	$7.6$ $\pm 7.3$	$0.6$ $\pm 1.2$

Table 12: D4RL mujoco results of LEQ with different expectiles. We report the results in D4RL mujoco task with expectiles value of 0.1, 0.3, 0.4, 0.5. The best value is highlighted.

Expectile	0.1	0.3	0.4	0.5
hopper-r	$\mathbf{32.4}$ $\pm 0.3$	$13.7$ $\pm 9.1$	$16.4$ $\pm 9.3$	$12.5$ $\pm 10.1$
hopper-m	$\mathbf{103.4}$ $\pm 0.3$	$102.7$ $\pm 1.7$	$81.4$ $\pm 24.8$	$38.6$ $\pm 29.2$
hopper-mr	$103.2$ $\pm 1.0$	$\mathbf{103.9}$ $\pm 1.3$	$71.5$ $\pm 34.7$	$103.8$ $\pm 1.9$
hopper-me	$\mathbf{109.4}$ $\pm 1.8$	$108.0$ $\pm 8.7$	$64.2$ $\pm 35.8$	$33.7$ $\pm 0.5$
walker2d-r	$\mathbf{21.5}$ $\pm 0.1$	$21.5$ $\pm 0.5$	$14.0$ $\pm 8.8$	$8.7$ $\pm 6.7$
walker2d-m	$26.3$ $\pm 37.4$	$\mathbf{74.9}$ $\pm 26.9$	$60.3$ $\pm 40.9$	$34.8$ $\pm 34.3$
walker2d-mr	$48.6$ $\pm 19.5$	$60.5$ $\pm 27.4$	$88.5$ $\pm 3.5$	$\mathbf{98.7}$ $\pm 6.0$
walker2d-me	$\mathbf{108.2}$ $\pm 1.3$	$98.8$ $\pm 28.8$	$105.8$ $\pm 25.9$	$33.7$ $\pm 31.9$
halfcheetah-r	$23.8$ $\pm 1.8$	$\mathbf{30.8}$ $\pm 3.3$	$29.0$ $\pm 2.9$	$30.2$ $\pm 2.5$
halfcheetah-m	$65.3$ $\pm 2.0$	$\mathbf{71.7}$ $\pm 4.4$	$58.5$ $\pm 23.8$	$55.5$ $\pm 16.7$
halfcheetah-mr	$60.6$ $\pm 1.4$	$55.4$ $\pm 27.3$	$\mathbf{65.5}$ $\pm 1.1$	$52.4$ $\pm 26.7$
halfcheetah-me	$\mathbf{102.8}$ $\pm 0.4$	$81.5$ $\pm 19.6$	$58.1$ $\pm 26.1$	$46.3$ $\pm 17.7$

Ablation study on dataset expansion.

Table 13 shows the ablation results on the dataset expansion in D4RL MuJoCo tasks. The results show that the dataset expansion generally improves the performance, especially in Hopper environments.

Table 13: D4RL MuJoCo ablation results for dataset expansion. Results are averaged over

5

random seeds. The dataset expansion generally improves the performance of LEQ.

Dataset	LEQ (ours)	LEQ w/o Dataset Expansion
hopper-r	$\mathbf{32.4}$ $\pm 0.3$	$17.6$ $\pm 8.6$
hopper-m	$\mathbf{103.4}$ $\pm 0.3$	$52.7$ $\pm 45.3$
hopper-mr	$\mathbf{103.9}$ $\pm 1.3$	$\mathbf{103.7}$ $\pm 1.3$
hopper-me	$\mathbf{109.4}$ $\pm 1.8$	$79.7$ $\pm 42.4$
walker2d-r	$\mathbf{21.5}$ $\pm 0.1$	$\mathbf{20.5}$ $\pm 2.2$
walker2d-m	$74.9$ $\pm 26.9$	$\mathbf{87.2}$ $\pm 4.3$
walker2d-mr	$\mathbf{98.7}$ $\pm 6.0$	$78.7$ $\pm 35.5$
walker2d-me	$\mathbf{108.2}$ $\pm 1.3$	$\mathbf{110.4}$ $\pm 0.8$
halfcheetah-r	$\mathbf{30.8}$ $\pm 3.3$	$\mathbf{27.7}$ $\pm 2.2$
halfcheetah-m	$\mathbf{71.7}$ $\pm 4.4$	$\mathbf{71.6}$ $\pm 3.8$
halfcheetah-mr	$\mathbf{65.5}$ $\pm 1.1$	$54.4$ $\pm 26.3$
halfcheetah-me	$\mathbf{102.8}$ $\pm 0.4$	$83.9$ $\pm 28.0$
Total	$\mathbf{923.2}$	$788.2$

Ablation study on MOBILE^∗ in AntMaze.

In Table 14, we report the performance of MOBILE^∗ for all possible combination of the hyperparameters between the values of MOBILE and LEQ. Specifically, we use $\beta=\{0.25,0.95\}$ , $\gamma=\{0.99,0.997\}$ , and $R=\{5,10\}$ . The result shows that $\beta=0.25$ is crucial. In addition, the configuration of LEQ yields the best result among all configurations for MOBILE^∗.

Table 14: Complete hyperparameter search results of MOBILE^∗ on AntMaze. MOBILE^∗ uses the hyperparameters from MOBILE:

\beta=0.95

\gamma=0.99

, and

R=5

, whereas LEQ uses

\beta=0.25

\gamma=0.997

, and

R=10

. The results show that

\beta

is the most critical hyperparameter that makes MOBILE^∗ work in AntMaze.

Hyperparams.			umaze		medium		large		ultra		Total
$\beta$	$\gamma$	$R$	umaze	diverse	play	diverse	play	diverse	play	diverse	Total
$0.25$	$0.997$	$10$	$53.8$ $\pm 26.8$	$\mathbf{22.5}$ $\pm 22.2$	$54.0$ $\pm 5.8$	$\mathbf{49.5}$ $\pm 6.2$	$\mathbf{28.3}$ $\pm 6.0$	$\mathbf{28.0}$ $\pm 11.4$	$\mathbf{25.5}$ $\pm 6.9$	$\mathbf{23.8}$ $\pm 15.8$	$\mathbf{285.3}$
$0.25$	$0.997$	$5$	$\mathbf{74.0}$ $\pm 6.9$	$3.7$ $\pm 2.6$	$54.7$ $\pm 27.9$	$28.0$ $\pm 9.6$	$18.7$ $\pm 18.6$	$8.0$ $\pm 9.3$	$9.7$ $\pm 8.2$	$9.0$ $\pm 3.7$	$205.7$
$0.25$	$0.99$	$10$	$39.7$ $\pm 23.4$	$5.0$ $\pm 7.1$	$39.3$ $\pm 27.9$	$38.0$ $\pm 15.0$	$0.0$ $\pm 0.0$	$3.7$ $\pm 5.2$	$0.0$ $\pm 0.0$	$0.0$ $\pm 0.0$	$125.7$
$0.25$	$0.99$	$5$	$\mathbf{77.0}$ $\pm 6.4$	$20.4$ $\pm 15.7$	$\mathbf{64.6}$ $\pm 11.1$	$31.6$ $\pm 16.9$	$2.6$ $\pm 2.8$	$7.2$ $\pm 8.9$	$4.6$ $\pm 3.0$	$5.0$ $\pm 4.6$	$213.0$
$0.95$	$0.997$	$10$	$0.0$ $\pm 0.0$	$0.0$ $\pm 0.0$	$1.8$ $\pm 3.0$	$0.5$ $\pm 0.9$	$0.2$ $\pm 0.4$	$2.2$ $\pm 2.3$	$1.0$ $\pm 1.7$	$0.0$ $\pm 0.0$	$5.7$
$0.95$	$0.997$	$5$	$0.0$ $\pm 0.0$	$0.0$ $\pm 0.0$	$7.2$ $\pm 4.1$	$1.6$ $\pm 2.1$	$9.6$ $\pm 7.1$	$5.4$ $\pm 4.9$	$0.0$ $\pm 0.0$	$1.8$ $\pm 2.7$	$25.6$
$0.95$	$0.99$	$10$	$0.0$ $\pm 0.0$	$0.0$ $\pm 0.0$	$5.0$ $\pm 5.1$	$0.6$ $\pm 1.2$	$7.4$ $\pm 14.8$	$1.6$ $\pm 3.2$	$0.0$ $\pm 0.0$	$0.0$ $\pm 0.0$	$14.6$
$0.95$	$0.99$	$5$	$1.0$ $\pm 2.0$	$0.0$ $\pm 0.0$	$6.4$ $\pm 5.5$	$5.0$ $\pm 5.0$	$0.8$ $\pm 1.6$	$0.8$ $\pm 1.2$	$0.0$ $\pm 0.0$	$0.0$ $\pm 0.0$	$14.0$

Tackling Long-Horizon Tasks with Model-based Offline Reinforcement Learning

Abstract

1 Introduction

2 Related Work

3 Preliminaries

Problem setup.

Model-based offline RL.

Expectile regression.

4 Approach

4.1 Lower expectile Q-learning

4.2 Lower expectile Q-learning with λ𝜆\lambdaitalic_λ-return

4.3 Lower expectile policy learning with λ𝜆\lambdaitalic_λ-return

4.4 Expanding dataset with model rollouts

5 Experiments

5.1 Tasks

5.2 Compared offline RL algorithms

Model-free offline RL.

Model-based offline RL.

5.3 Results on long-horizon AntMaze tasks

5.4 Results on MuJoCo Gym locomotion tasks

5.5 Ablation studies

(1) λ𝜆\lambdaitalic_λ-returns.

(2) Lower expectile Q-learning.

(3) What makes offline model-based RL work in AntMaze?

(4) Imagination length H𝐻Hitalic_H and dataset expansion length R𝑅Ritalic_R.

6 Conclusion

6.1 Limitations

6.2 Broader Impacts

Acknowledgments and Disclosure of Funding

References

Appendix A Training Details

Computing resources.

Environment details.

Method implementation details.

World models.

Policy and critic networks.

Pretraining.

Comparisons with prior methods.

Hyperparameters of LEQ.

Task-specific hyperparameters of the compared methods.

Appendix B Proof of the Policy Objective

Theorem 1.

Proof.

Appendix C More Results

High variance in locomotion tasks.

Results for all expectiles τ𝜏\tauitalic_τ.

Ablation study on dataset expansion.

Ablation study on MOBILE∗ in AntMaze.

Tackling Long-Horizon Tasks
with Model-based Offline Reinforcement Learning

4.2 Lower expectile Q-learning with $\lambda$ -return

4.3 Lower expectile policy learning with $\lambda$ -return

(1) $\lambda$ -returns.

(4) Imagination length $H$ and dataset expansion length $R$ .

Results for all expectiles $\tau$ .

Ablation study on MOBILE^∗ in AntMaze.