Tackling Long-Horizon Tasks
with Model-based Offline Reinforcement Learning

Kwanyoung Park   Youngwoon Lee
Yonsei University
https://kwanyoungpark.github.io/LEQ/
Abstract

Model-based offline reinforcement learning (RL) is a compelling approach that addresses the challenge of learning from limited, static data by generating imaginary trajectories using learned models. However, it falls short in solving long-horizon tasks due to high bias in value estimation from model rollouts. In this paper, we introduce a novel model-based offline RL method, Lower Expectile Q-learning (LEQ), which enhances long-horizon task performance by mitigating the high bias in model-based value estimation via expectile regression of λ𝜆\lambdaitalic_λ-returns. Our empirical results show that LEQ significantly outperforms previous model-based offline RL methods on long-horizon tasks, such as the D4RL AntMaze tasks, matching or surpassing the performance of model-free approaches. Our experiments demonstrate that expectile regression, λ𝜆\lambdaitalic_λ-returns, and critic training on offline data are all crucial for addressing long-horizon tasks. Additionally, LEQ achieves performance comparable to the state-of-the-art model-based and model-free offline RL methods on the NeoRL benchmark and the D4RL MuJoCo Gym tasks.

1 Introduction

One of the major challenges in offline reinforcement learning (RL) is the overestimation of values for out-of-distribution actions due to the lack of environment interactions [21, 19]. Model-based offline RL addresses this issue by generating additional (imaginary) training data using a learned model, thereby augmenting the given offline data with synthetic experiences that cover out-of-distribution states and actions [34, 16, 35, 2, 30]. While these approaches have demonstrated strong performance in simple, short-horizon tasks, they struggle with noisy model predictions and value estimations, particularly in long-horizon tasks [23]. This challenge is evident in their poor performances (i.e. near zero) on the D4RL AntMaze tasks [6, 15].

Typical model-based offline RL methods alleviate the inaccurate value estimation problem (mostly overestimation) by penalizing Q-values estimated from model rollouts with uncertainties in model predictions [34, 16] or value predictions [30, 14]. While these penalization terms prevent a policy from exploiting erroneous value estimations, the policy now does not maximize the true value, but maximizes the value penalized by heuristically estimated uncertainties, which can lead to sub-optimal behaviors. This is especially problematic in long-horizon, sparse-reward tasks, where Q-values are similar across nearby states [23].

Another way to reduce bias in value estimates is by using multi-step returns [31, 12]. CBOP [14] constructs an explicit distribution of multi-step Q-values from thousands of model rollouts and uses this value as a target for training the Q-function. However, CBOP is computationally expensive for estimating a target value and uses multi-step returns solely for Q-learning, which provides insufficient learning signals for obtaining long-horizon behaviors.

To tackle long-horizon tasks with model-based offline RL, we introduce a simple yet effective model-based offline RL algorithm, Lower Expectile Q-learning (LEQ). As illustrated in Figure 1, LEQ uses expectile regression with a small τ𝜏\tauitalic_τ for both policy and Q-function training, providing an efficient and elegant way to achieve conservative Q-value estimates. Moreover, to better handle long-horizon tasks, we propose to optimize a policy and Q-function using λ𝜆\lambdaitalic_λ-returns (i.e. TD(λ𝜆\lambdaitalic_λ) targets) of long (15151515-step) model rollouts, allowing the policy to directly learn from low-bias multi-step returns [28].

The experiments on the D4RL AntMaze and MuJoCo Gym tasks [6], as well as the NeoRL benchmark [26], demonstrate that our proposed conservative policy optimization with λ𝜆\lambdaitalic_λ-return and critic training on offline data significantly improves offline RL policies in long-horizon tasks while achieving comparable performance in short-horizon, dense-reward tasks. Specifically, to the best of our knowledge, LEQ is the first model-based offline RL algorithm capable of matching or outperforming the performance of model-free offline RL algorithms on the long-horizon AntMaze tasks [6, 15].

Refer to caption
Figure 1: Lower Expectile Q-learning (LEQ). (left) In offline model-based RL, an agent can generate imaginary trajectories using a world model. (right) For conservative Q-evaluation of the policy, LEQ learns the lower expectile of the target Q𝑄Qitalic_Q-distribution from a sampled individual rollout 𝒯isubscript𝒯𝑖\mathcal{T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, without estimating the entire Q-distribution with exhaustive rollouts.

2 Related Work

Offline RL [21] aims to solve a reinforcement learning problem only with pre-collected datasets, better than behavioral cloning policies [25]. One can simply apply off-policy RL algorithms on top of the fixed dataset. However, off-policy RL methods suffer from the overestimation of Q-values for actions unseen in the offline dataset [8, 18, 19], since an overestimated value function cannot get corrected through online environment interactions in offline RL.

Model-free offline RL algorithms have addressed this value overestimation problem on out-of-distribution actions by (1) regularizing a policy to only output actions in the offline data [24, 17, 7] or (2) adopting a conservative value estimation for executing actions different from the dataset [19, 1]. Despite their strong performances on the standard offline RL benchmarks, model-free offline RL policies tend to be constrained to the support of the data (i.e. state-action pairs in the offline dataset), which may lead to limited generalization capability.

Model-based offline RL approaches have tried to overcome this limitation by suggesting a better use of the limited offline data – learning a world model and generating imaginary data with the learned model that covers out-of-distribution actions. Similar to Dyna-style online model-based RL [32, 9, 10, 11], an offline model-based RL policy can be trained on both offline data and model rollouts. But, again, learned models may be inaccurate on states and actions outside the data support, making a policy easily exploit the learned models.

Recent model-based offline RL algorithms have adopted the conservatism idea from model-free offline RL, penalizing policies incurring (1) uncertain transition dynamics [34, 16, 35] or (2) uncertain value estimation [30, 14]. This conservative use of model-generated data enables model-based offline RL to outperform model-free offline RL in widely used offline RL benchmarks [30]. However, uncertainty estimation is difficult and often inaccurate [35]. Instead of relying on such heuristic [34, 16, 30] or expensive [14] uncertainty estimation, we propose to learn a conservative value function via expectile regression with a small τ𝜏\tauitalic_τ, which is simple, efficient, yet effective.

3 Preliminaries

Problem setup.

We formulate our problem as a Markov Decision Process (MDP) defined as a tuple, =(𝒮,𝒜,r,p,ρ,γ)𝒮𝒜𝑟𝑝𝜌𝛾\mathcal{M}=(\mathcal{S},\mathcal{A},r,p,\rho,\gamma)caligraphic_M = ( caligraphic_S , caligraphic_A , italic_r , italic_p , italic_ρ , italic_γ ) [33]. 𝒮𝒮\mathcal{S}caligraphic_S and 𝒜𝒜\mathcal{A}caligraphic_A denote the state and action spaces, respectively. r:𝒮×𝒜:𝑟𝒮𝒜r:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}italic_r : caligraphic_S × caligraphic_A → blackboard_R denotes the reward function. p:𝒮×𝒜Δ(𝒮):𝑝𝒮𝒜Δ𝒮p:\mathcal{S}\times\mathcal{A}\rightarrow\Delta(\mathcal{S})italic_p : caligraphic_S × caligraphic_A → roman_Δ ( caligraphic_S )111Δ(𝒳)Δ𝒳\Delta(\mathcal{X})roman_Δ ( caligraphic_X ) denotes the set of probability distributions over 𝒳𝒳\mathcal{X}caligraphic_X denotes the transition dynamics. ρ(𝐬0)Δ(𝒮)𝜌subscript𝐬0Δ𝒮\rho(\mathbf{s}_{0})\in\Delta(\mathcal{S})italic_ρ ( bold_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∈ roman_Δ ( caligraphic_S ) denotes the initial state distribution and γ𝛾\gammaitalic_γ is a discounting factor. The goal of reinforcement learning (RL) is to find a policy, π:𝒮Δ(𝒜):𝜋𝒮Δ𝒜\pi:\mathcal{S}\rightarrow\Delta(\mathcal{A})italic_π : caligraphic_S → roman_Δ ( caligraphic_A ), that maximizes the expected return, 𝔼τp(π,𝐬0ρ)[t=0T1γtr(𝐬t,𝐚t)]\mathbb{E}_{\tau\sim p(\cdot\mid\pi,\mathbf{s}_{0}\sim\rho)}\left[\sum_{t=0}^{% T-1}\gamma^{t}r(\mathbf{s}_{t},\mathbf{a}_{t})\right]blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_p ( ⋅ ∣ italic_π , bold_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_ρ ) end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ], where τ𝜏\tauitalic_τ is a sequence of transitions with a finite horizon T𝑇Titalic_T, τ=(𝐬0,𝐚0,r0,𝐬1,𝐚1,r1,,𝐬T)𝜏subscript𝐬0subscript𝐚0subscript𝑟0subscript𝐬1subscript𝐚1subscript𝑟1subscript𝐬𝑇\tau=(\mathbf{s}_{0},\mathbf{a}_{0},r_{0},\mathbf{s}_{1},\mathbf{a}_{1},r_{1},% ...,\mathbf{s}_{T})italic_τ = ( bold_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ), following π(𝐚t𝐬t)𝜋conditionalsubscript𝐚𝑡subscript𝐬𝑡\pi(\mathbf{a}_{t}\mid\mathbf{s}_{t})italic_π ( bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and p(𝐬t+1𝐬t,𝐚t)𝑝conditionalsubscript𝐬𝑡1subscript𝐬𝑡subscript𝐚𝑡p(\mathbf{s}_{t+1}\mid\mathbf{s}_{t},\mathbf{a}_{t})italic_p ( bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) starting from 𝐬0ρ()similar-tosubscript𝐬0𝜌\mathbf{s}_{0}\sim\rho(\cdot)bold_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_ρ ( ⋅ ).

In this paper, we consider the offline RL setup [21], where a policy π𝜋\piitalic_π is trained with a fixed given offline dataset, 𝒟env={τ1,τ2,,τN}subscript𝒟envsubscript𝜏1subscript𝜏2subscript𝜏𝑁\mathcal{D}_{\text{env}}=\{\tau_{1},\tau_{2},...,\tau_{N}\}caligraphic_D start_POSTSUBSCRIPT env end_POSTSUBSCRIPT = { italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_τ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }, without any additional online interactions.

Model-based offline RL.

As an offline RL policy is trained from a fixed dataset, one of the major challenges in offline RL is the limited data support; thus, lack of generalization to out-of-distribution states and actions. Model-based offline RL [16, 34, 35, 27, 30, 14] tackles this problem by augmenting the training data with imaginary training data (i.e. model rollouts) generated from the learned transition dynamics and reward model, pψ(𝐬t+1,r𝐬t,𝐚t)subscript𝑝𝜓subscript𝐬𝑡1conditional𝑟subscript𝐬𝑡subscript𝐚𝑡p_{\psi}(\mathbf{s}_{t+1},r\mid\mathbf{s}_{t},\mathbf{a}_{t})italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_r ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

The typical process of model-based offline RL is as follows: (1) pretrain a model (or an ensemble of models) and an initial policy from the offline data, (2) generate short imaginary rollouts {τ}𝜏\{\tau\}{ italic_τ } using the pretrained model and add them to the training dataset 𝒟model𝒟model{τ}subscript𝒟modelsubscript𝒟model𝜏\mathcal{D}_{\text{model}}\leftarrow\mathcal{D}_{\text{model}}\cup\{\tau\}caligraphic_D start_POSTSUBSCRIPT model end_POSTSUBSCRIPT ← caligraphic_D start_POSTSUBSCRIPT model end_POSTSUBSCRIPT ∪ { italic_τ }, (3) perform an offline RL algorithm on the augmented dataset 𝒟model𝒟envsubscript𝒟modelsubscript𝒟env\mathcal{D}_{\text{model}}\cup\mathcal{D}_{\text{env}}caligraphic_D start_POSTSUBSCRIPT model end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT env end_POSTSUBSCRIPT, and repeat (2) and (3).

Expectile regression.

Expectile is a generalization of the expectation of a distribution X𝑋Xitalic_X. While the expectation of X𝑋Xitalic_X, 𝔼[X]𝔼delimited-[]𝑋\mathbb{E}[X]blackboard_E [ italic_X ], can be viewed as a minimizer of the least-square objective, L2(y)=𝔼xX[(yx)2]subscript𝐿2𝑦subscript𝔼similar-to𝑥𝑋delimited-[]superscript𝑦𝑥2L_{2}(y)=\mathbb{E}_{x\sim X}[(y-x)^{2}]italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_y ) = blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_X end_POSTSUBSCRIPT [ ( italic_y - italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ], τ𝜏\tauitalic_τ-expectile of X𝑋Xitalic_X, 𝔼τ[X]superscript𝔼𝜏delimited-[]𝑋\mathbb{E}^{\tau}[X]blackboard_E start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT [ italic_X ], can be defined as a minimizer of the asymmetric least-square objective:

L2τ(y)=𝔼xX[|τ𝟙(y>x)|(yx)2],superscriptsubscript𝐿2𝜏𝑦subscript𝔼similar-to𝑥𝑋delimited-[]𝜏1𝑦𝑥superscript𝑦𝑥2L_{2}^{\tau}(y)=\mathbb{E}_{x\sim X}\left[\lvert\tau-\mathbbm{1}(y>x)\rvert% \cdot(y-x)^{2}\right],italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_y ) = blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_X end_POSTSUBSCRIPT [ | italic_τ - blackboard_1 ( italic_y > italic_x ) | ⋅ ( italic_y - italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (1)

where |τ𝟙(y>x)|𝜏1𝑦𝑥\lvert\tau-\mathbbm{1}(y>x)\rvert| italic_τ - blackboard_1 ( italic_y > italic_x ) | is an asymmetric weighting of L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and 0τ10𝜏10\leq\tau\leq 10 ≤ italic_τ ≤ 1.

We refer a τ𝜏\tauitalic_τ-expectile with τ<0.5𝜏0.5\tau<0.5italic_τ < 0.5 as a lower expectile of X𝑋Xitalic_X. When τ<0.5𝜏0.5\tau<0.5italic_τ < 0.5, the objective assigns a high weight 1τ1𝜏1-\tau1 - italic_τ for smaller x𝑥xitalic_x and a low weight τ𝜏\tauitalic_τ for bigger x𝑥xitalic_x. Thus, minimizing the objective with τ<0.5𝜏0.5\tau<0.5italic_τ < 0.5 leads to a conservative statistical estimate compared to the expectation.

4 Approach

The primary limitation for model-based offline RL in solving long-horizon tasks is inherent errors in a world model and critic outside the offline data support. Conservative value estimation can effectively handle such (falsely optimistic) errors. Prior approaches estimate conservative values through diverse uncertainty penalties; but they are either unreliable [35] or computationally expensive [14].

In this paper, we introduce Lower Expectile Q-learning (LEQ), an efficient model-based offline RL method that achieves conservative value estimation via expectile regression of Q-values with lower expectiles when learning from model-generated data (Section 4.1). Additionally, we address the noisy value estimation problem in long-horizon tasks [23] using λ𝜆\lambdaitalic_λ-returns on 10101010-step imaginary rollouts (Section 4.2). Finally, we train a deterministic policy conservatively by maximizing the lower expectile of λ𝜆\lambdaitalic_λ-returns (Section 4.3). The overview of LEQ is described in Algorithm 1.

Algorithm 1 LEQ: Lower Expectile Q-learning with λ𝜆\lambdaitalic_λ-returns
0:  Offline dataset 𝒟envsubscript𝒟env\mathcal{D}_{\text{env}}caligraphic_D start_POSTSUBSCRIPT env end_POSTSUBSCRIPT, expectile τ0.5𝜏0.5\tau\leq 0.5italic_τ ≤ 0.5, imagination length H𝐻Hitalic_H, dataset expansion length R𝑅Ritalic_R.
1:  Initialize world models {pψ1,,pψM}subscript𝑝subscript𝜓1subscript𝑝subscript𝜓𝑀\{p_{\psi_{1}},\cdots,p_{\psi_{M}}\}{ italic_p start_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , italic_p start_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT }, policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, and Q-function Qϕsubscript𝑄italic-ϕQ_{\phi}italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT
2:  Pretrain {pψ1,,pψM}subscript𝑝subscript𝜓1subscript𝑝subscript𝜓𝑀\{p_{\psi_{1}},\cdots,p_{\psi_{M}}\}{ italic_p start_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , italic_p start_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT } on 𝒟envsubscript𝒟env\mathcal{D}_{\text{env}}caligraphic_D start_POSTSUBSCRIPT env end_POSTSUBSCRIPT \triangleright wm(ψ)=𝔼(𝐬,𝐚,r,𝐬)𝒟envlogpψ(𝐬,r𝐬,𝐚)subscriptwm𝜓subscript𝔼𝐬𝐚𝑟superscript𝐬subscript𝒟envsubscript𝑝𝜓superscript𝐬conditional𝑟𝐬𝐚\mathcal{L}_{\text{wm}}(\psi)=-\mathbb{E}_{(\mathbf{s},\mathbf{a},r,\mathbf{s}% ^{\prime})\in\mathcal{D}_{\text{env}}}\log p_{\psi}(\mathbf{s}^{\prime},r\mid% \mathbf{s},\mathbf{a})caligraphic_L start_POSTSUBSCRIPT wm end_POSTSUBSCRIPT ( italic_ψ ) = - blackboard_E start_POSTSUBSCRIPT ( bold_s , bold_a , italic_r , bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_D start_POSTSUBSCRIPT env end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r ∣ bold_s , bold_a )
3:  Pretrain πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and Qϕsubscript𝑄italic-ϕQ_{\phi}italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT on 𝒟envsubscript𝒟env\mathcal{D}_{\text{env}}caligraphic_D start_POSTSUBSCRIPT env end_POSTSUBSCRIPT \triangleright using BC for πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and FQE [20] for Qϕsubscript𝑄italic-ϕQ_{\phi}italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT
4:  𝒟modelsubscript𝒟model\mathcal{D}_{\text{model}}\leftarrow\emptysetcaligraphic_D start_POSTSUBSCRIPT model end_POSTSUBSCRIPT ← ∅
5:  while not converged do
6:     // Expand dataset using model rollouts
7:     𝐬0𝒟envsimilar-tosubscript𝐬0subscript𝒟env\mathbf{s}_{0}\sim\mathcal{D}_{\text{env}}bold_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT env end_POSTSUBSCRIPT \triangleright start dataset expansion from any state in 𝒟envsubscript𝒟env\mathcal{D}_{\text{env}}caligraphic_D start_POSTSUBSCRIPT env end_POSTSUBSCRIPT
8:     for t=0,,R1𝑡0𝑅1t=0,\ldots,R-1italic_t = 0 , … , italic_R - 1 do
9:        𝒟model𝒟model{𝐬t}subscript𝒟modelsubscript𝒟modelsubscript𝐬𝑡\mathcal{D}_{\text{model}}\leftarrow\mathcal{D}_{\text{model}}\cup\{\mathbf{s}% _{t}\}caligraphic_D start_POSTSUBSCRIPT model end_POSTSUBSCRIPT ← caligraphic_D start_POSTSUBSCRIPT model end_POSTSUBSCRIPT ∪ { bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }
10:        𝐚t=πθ(𝐬t)subscript𝐚𝑡subscript𝜋𝜃subscript𝐬𝑡\mathbf{a}_{t}=\pi_{\theta}(\mathbf{s}_{t})bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
11:        𝐬t+1,rtpψ(𝐬t,𝐚t)\mathbf{s}_{t+1},r_{t}\sim p_{\psi}(\cdot\mid\mathbf{s}_{t},\mathbf{a}_{t})bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( ⋅ ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), where pψ{pψ1,,pψM}similar-tosubscript𝑝𝜓subscript𝑝subscript𝜓1subscript𝑝subscript𝜓𝑀p_{\psi}\sim\{p_{\psi_{1}},\cdots,p_{\psi_{M}}\}italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ∼ { italic_p start_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , italic_p start_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT }
12:     // Generate imaginary data, τ={(𝐬0,𝐚0,r0,,𝐬H1,𝐚H1,rH1,𝐬H)i}𝜏subscriptsubscript𝐬0subscript𝐚0subscript𝑟0subscript𝐬𝐻1subscript𝐚𝐻1subscript𝑟𝐻1subscript𝐬𝐻𝑖\tau=\{(\mathbf{s}_{0},\mathbf{a}_{0},r_{0},\cdots,\mathbf{s}_{H-1},\mathbf{a}% _{H-1},r_{H-1},\mathbf{s}_{H})_{i}\}italic_τ = { ( bold_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , bold_s start_POSTSUBSCRIPT italic_H - 1 end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_H - 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_H - 1 end_POSTSUBSCRIPT , bold_s start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }
13:     𝐬0𝒟modelsimilar-tosubscript𝐬0subscript𝒟model\mathbf{s}_{0}\sim\mathcal{D}_{\text{model}}bold_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT model end_POSTSUBSCRIPT \triangleright start imaginary rollout from any state in 𝒟modelsubscript𝒟model\mathcal{D}_{\text{model}}caligraphic_D start_POSTSUBSCRIPT model end_POSTSUBSCRIPT
14:     for t=0,,H1𝑡0𝐻1t=0,\ldots,H-1italic_t = 0 , … , italic_H - 1 do
15:        𝐚t=πθ(𝐬t)subscript𝐚𝑡subscript𝜋𝜃subscript𝐬𝑡\mathbf{a}_{t}=\pi_{\theta}(\mathbf{s}_{t})bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
16:        𝐬t+1,rtpψ(𝐬t,𝐚t)\mathbf{s}_{t+1},r_{t}\sim p_{\psi}(\cdot\mid\mathbf{s}_{t},\mathbf{a}_{t})bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( ⋅ ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), where pψ{pψ1,,pψM}similar-tosubscript𝑝𝜓subscript𝑝subscript𝜓1subscript𝑝subscript𝜓𝑀p_{\psi}\sim\{p_{\psi_{1}},\cdots,p_{\psi_{M}}\}italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ∼ { italic_p start_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , italic_p start_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT }
17:     // Update critic using both offline and model-generated data
18:     Update critic Qϕsubscript𝑄italic-ϕQ_{\phi}italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT to minimize Qλ(ϕ)subscriptsuperscript𝜆𝑄italic-ϕ\mathcal{L}^{\lambda}_{Q}(\phi)caligraphic_L start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_ϕ ) in Eq. 7 using τ𝜏\tauitalic_τ and {𝐬,𝐚,r,𝐬}𝒟envsimilar-to𝐬𝐚𝑟superscript𝐬subscript𝒟env\{\mathbf{s},\mathbf{a},r,\mathbf{s}^{\prime}\}\sim\mathcal{D}_{\text{env}}{ bold_s , bold_a , italic_r , bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } ∼ caligraphic_D start_POSTSUBSCRIPT env end_POSTSUBSCRIPT
19:     // Update actor using only model-generated data
20:     Update actor πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to minimize ^πλ(θ)subscriptsuperscript^𝜆𝜋𝜃\hat{\mathcal{L}}^{\lambda}_{\pi}(\theta)over^ start_ARG caligraphic_L end_ARG start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_θ ) in Eq. 12 using τ𝜏\tauitalic_τ

4.1 Lower expectile Q-learning

Most offline RL algorithms primarily focus on learning a conservative value function for out-of-distribution actions. In this paper, we propose Lower Expectile Q-learning (LEQ), which learns a conservative Q-function via expectile regression with small τ𝜏\tauitalic_τ, avoiding unreliable uncertainty estimation and exhaustive Q-value estimation.

As illustrated in Figure 1, the target value for Qϕ(𝐬,𝐚)subscript𝑄italic-ϕ𝐬𝐚Q_{\phi}(\mathbf{s},\mathbf{a})italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s , bold_a ), where 𝐚πθ(𝐬)𝐚subscript𝜋𝜃𝐬\mathbf{a}\leftarrow\pi_{\theta}(\mathbf{s})bold_a ← italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_s ), can be estimated by rolling out an ensemble of world models and averaging r(𝐬,𝐚)+γQϕ(𝐬,𝐚)𝑟𝐬𝐚𝛾subscript𝑄italic-ϕsuperscript𝐬superscript𝐚r(\mathbf{s},\mathbf{a})+\gamma Q_{\phi}(\mathbf{s}^{\prime},\mathbf{a}^{% \prime})italic_r ( bold_s , bold_a ) + italic_γ italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) over all possible 𝐬superscript𝐬\mathbf{s}^{\prime}bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT:

y^model=𝔼ψ{ψ1,,ψM}𝔼(𝐬,r)pψ(𝐬,𝐚)[r+γQϕ(𝐬,πθ(𝐬))].\hat{y}_{\text{model}}=\mathbb{E}_{\psi\sim\{\psi_{1},...,\psi_{M}\}}\mathbb{E% }_{(\mathbf{s}^{\prime},r)\sim p_{\psi}(\cdot\mid\mathbf{s},\mathbf{a})}\left[% r+\gamma Q_{\phi}(\mathbf{s}^{\prime},\pi_{\theta}(\mathbf{s}^{\prime}))\right].over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT model end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_ψ ∼ { italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ψ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r ) ∼ italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( ⋅ ∣ bold_s , bold_a ) end_POSTSUBSCRIPT [ italic_r + italic_γ italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ] . (2)

This target value has three error sources: the predicted future state and reward 𝐬,rpψ(𝐬,𝐚)\mathbf{s}^{\prime},r\sim p_{\psi}(\cdot\mid\mathbf{s},\mathbf{a})bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r ∼ italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( ⋅ ∣ bold_s , bold_a ) and future Q-value Qϕ(𝐬,πθ(𝐬))subscript𝑄italic-ϕsuperscript𝐬subscript𝜋𝜃superscript𝐬Q_{\phi}(\mathbf{s}^{\prime},\pi_{\theta}(\mathbf{s}^{\prime}))italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ). Thus, the target value computed from model-generate data, y^modelsubscript^𝑦model\hat{y}_{\text{model}}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT model end_POSTSUBSCRIPT, is more prone to overestimation than that of the original target Q-value, y^envsubscript^𝑦env\hat{y}_{\text{env}}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT env end_POSTSUBSCRIPT, computed from (𝐬,𝐚,r,𝐬)Denvsimilar-to𝐬𝐚𝑟superscript𝐬subscript𝐷env(\mathbf{s},\mathbf{a},r,\mathbf{s}^{\prime})\sim D_{\text{env}}( bold_s , bold_a , italic_r , bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∼ italic_D start_POSTSUBSCRIPT env end_POSTSUBSCRIPT:

y^env=r+γQϕ(𝐬,πθ(𝐬)).subscript^𝑦env𝑟𝛾subscript𝑄italic-ϕsuperscript𝐬subscript𝜋𝜃superscript𝐬\hat{y}_{\text{env}}=r+\gamma Q_{\phi}(\mathbf{s}^{\prime},\pi_{\theta}(% \mathbf{s}^{\prime})).over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT env end_POSTSUBSCRIPT = italic_r + italic_γ italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) . (3)

To mitigate the overestimation problem in estimating the true Q-value from H𝐻Hitalic_H-step inaccurate world model rollouts, we propose to use expectile regression on target Q-value estimation with small τ𝜏\tauitalic_τ; as illustrated in Figure 1, expectile regression with small τ𝜏\tauitalic_τ tends to choose the target Q-value that is lower than the expectation, effectively providing a conservative estimate of target Q-value. Another advantage of using expectile regression is that we do not have to exhaustively evaluate Q-values to get τ𝜏\tauitalic_τ-expectiles as Jeong et al. [14]; instead, we can do conservative estimation using sampling:

LQ,model(ϕ)=𝔼𝐬0𝒟model,τpψ,πθ[1Ht=0HL2τ(Qϕ(𝐬t,πθ(𝐬t))y^model)].subscript𝐿𝑄modelitalic-ϕsubscript𝔼formulae-sequencesubscript𝐬0subscript𝒟modelsimilar-to𝜏subscript𝑝𝜓subscript𝜋𝜃delimited-[]1𝐻superscriptsubscript𝑡0𝐻superscriptsubscript𝐿2𝜏subscript𝑄italic-ϕsubscript𝐬𝑡subscript𝜋𝜃subscript𝐬𝑡subscript^𝑦modelL_{Q,\text{model}}(\phi)=\mathbb{E}_{\mathbf{s}_{0}\in\mathcal{D}_{\text{model% }},\tau\sim p_{\psi},\pi_{\theta}}\left[\frac{1}{H}\sum_{t=0}^{H}L_{2}^{\tau}(% Q_{\phi}(\mathbf{s}_{t},\pi_{\theta}(\mathbf{s}_{t}))-\hat{y}_{\text{model}})% \right].italic_L start_POSTSUBSCRIPT italic_Q , model end_POSTSUBSCRIPT ( italic_ϕ ) = blackboard_E start_POSTSUBSCRIPT bold_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT model end_POSTSUBSCRIPT , italic_τ ∼ italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_H end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT model end_POSTSUBSCRIPT ) ] . (4)

In addition, the Q-function is also trained on the offline data 𝒟envsubscript𝒟env\mathcal{D}_{\text{env}}caligraphic_D start_POSTSUBSCRIPT env end_POSTSUBSCRIPT with the standard Bellman update:

Q,env(ϕ)=𝔼(𝐬,𝐚,r,𝐬)𝒟env[12(Qϕ(𝐬,𝐚)y^env)2].subscript𝑄envitalic-ϕsubscript𝔼𝐬𝐚𝑟superscript𝐬subscript𝒟envdelimited-[]12superscriptsubscript𝑄italic-ϕ𝐬𝐚subscript^𝑦env2\mathcal{L}_{Q,\text{env}}(\phi)=\mathbb{E}_{(\mathbf{s},\mathbf{a},r,\mathbf{% s}^{\prime})\in\mathcal{D}_{\text{env}}}\left[\frac{1}{2}(Q_{\phi}(\mathbf{s},% \mathbf{a})-\hat{y}_{\text{env}})^{2}\right].caligraphic_L start_POSTSUBSCRIPT italic_Q , env end_POSTSUBSCRIPT ( italic_ϕ ) = blackboard_E start_POSTSUBSCRIPT ( bold_s , bold_a , italic_r , bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_D start_POSTSUBSCRIPT env end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s , bold_a ) - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT env end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (5)

To stabilize training of the Q-function, we adopt EMA regularization [11], which prevents drastic change of Q-values by regularizing the difference between the Q-predictions and ones from the exponential moving average:

Q,EMA(ϕ)=𝔼(𝐬,𝐚)𝒟env[(Qϕ(𝐬,𝐚)Qϕ¯(𝐬,𝐚))2],subscript𝑄EMAitalic-ϕsubscript𝔼𝐬𝐚subscript𝒟envdelimited-[]superscriptsubscript𝑄italic-ϕ𝐬𝐚subscript𝑄¯italic-ϕ𝐬𝐚2\mathcal{L}_{Q,\text{EMA}}(\phi)=\mathbb{E}_{(\mathbf{s},\mathbf{a})\in% \mathcal{D}_{\text{env}}}\left[(Q_{\phi}(\mathbf{s},\mathbf{a})-Q_{\bar{\phi}}% (\mathbf{s},\mathbf{a}))^{2}\right],caligraphic_L start_POSTSUBSCRIPT italic_Q , EMA end_POSTSUBSCRIPT ( italic_ϕ ) = blackboard_E start_POSTSUBSCRIPT ( bold_s , bold_a ) ∈ caligraphic_D start_POSTSUBSCRIPT env end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s , bold_a ) - italic_Q start_POSTSUBSCRIPT over¯ start_ARG italic_ϕ end_ARG end_POSTSUBSCRIPT ( bold_s , bold_a ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (6)

where ϕ¯¯italic-ϕ\bar{\phi}over¯ start_ARG italic_ϕ end_ARG is an exponential moving average of ϕitalic-ϕ\phiitalic_ϕ. Note that by using EMA regularization, we do not use the target Q-network for Equations 2 and 3.

Finally, by combining the three losses above, we define the critic loss as follows:

Q(ϕ)=βQ,model(ϕ)+(1β)Q,env(ϕ)+βEMAQ,EMA(ϕ).subscript𝑄italic-ϕ𝛽subscript𝑄modelitalic-ϕ1𝛽subscript𝑄envitalic-ϕsubscript𝛽EMAsubscript𝑄EMAitalic-ϕ\mathcal{L}_{Q}(\phi)=\beta\mathcal{L}_{Q,\text{model}}(\phi)+(1-\beta)% \mathcal{L}_{Q,\text{env}}(\phi)+\beta_{\text{EMA}}\mathcal{L}_{Q,\text{EMA}}(% \phi).caligraphic_L start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_ϕ ) = italic_β caligraphic_L start_POSTSUBSCRIPT italic_Q , model end_POSTSUBSCRIPT ( italic_ϕ ) + ( 1 - italic_β ) caligraphic_L start_POSTSUBSCRIPT italic_Q , env end_POSTSUBSCRIPT ( italic_ϕ ) + italic_β start_POSTSUBSCRIPT EMA end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_Q , EMA end_POSTSUBSCRIPT ( italic_ϕ ) . (7)

4.2 Lower expectile Q-learning with λ𝜆\lambdaitalic_λ-return

To further improve LEQ for long-horizon tasks, we use λ𝜆\lambdaitalic_λ-return instead of 1111-step return for Q-learning. λ𝜆\lambdaitalic_λ-return allows a Q-function and policy to learn from low-bias multi-step returns [28]. Reducing bias in value estimation with λ𝜆\lambdaitalic_λ-return is especially important on long-horizon tasks where values for nearby states are similar to each other, as illustrated in Figure 2.

We first define λ𝜆\lambdaitalic_λ-return of a trajectory τ𝜏\tauitalic_τ in timestep t𝑡titalic_t, Qtλ(τ)superscriptsubscript𝑄𝑡𝜆𝜏Q_{t}^{\lambda}(\tau)italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ( italic_τ ), using N𝑁Nitalic_N-step return, Gt:t+N(τ)subscript𝐺:𝑡𝑡𝑁𝜏G_{t:t+N}(\tau)italic_G start_POSTSUBSCRIPT italic_t : italic_t + italic_N end_POSTSUBSCRIPT ( italic_τ ):222Our λ𝜆\lambdaitalic_λ-return is slightly different from [31, 11] that puts a high weight to the last N𝑁Nitalic_N-step return, Gt:H(τ)subscript𝐺:𝑡𝐻𝜏G_{t:H}(\tau)italic_G start_POSTSUBSCRIPT italic_t : italic_H end_POSTSUBSCRIPT ( italic_τ ).

Gt:t+N(τ)=i=0N1γir(𝐬t+i,𝐚t+i)+γNQϕ(𝐬t+N,𝐚t+N),subscript𝐺:𝑡𝑡𝑁𝜏superscriptsubscript𝑖0𝑁1superscript𝛾𝑖𝑟subscript𝐬𝑡𝑖subscript𝐚𝑡𝑖superscript𝛾𝑁subscript𝑄italic-ϕsubscript𝐬𝑡𝑁subscript𝐚𝑡𝑁G_{t:t+N}(\tau)=\sum_{i=0}^{N-1}\gamma^{i}r(\mathbf{s}_{t+i},\mathbf{a}_{t+i})% +\gamma^{N}Q_{\phi}(\mathbf{s}_{t+N},\mathbf{a}_{t+N}),italic_G start_POSTSUBSCRIPT italic_t : italic_t + italic_N end_POSTSUBSCRIPT ( italic_τ ) = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_r ( bold_s start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT ) + italic_γ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t + italic_N end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t + italic_N end_POSTSUBSCRIPT ) , (8)
Qtλ(τ)=1λ1λHt1i=1Htλi1Gt:t+i(τ).superscriptsubscript𝑄𝑡𝜆𝜏1𝜆1superscript𝜆𝐻𝑡1superscriptsubscript𝑖1𝐻𝑡superscript𝜆𝑖1subscript𝐺:𝑡𝑡𝑖𝜏Q_{t}^{\lambda}(\tau)=\frac{1-\lambda}{1-\lambda^{H-t-1}}\sum_{i=1}^{H-t}% \lambda^{i-1}G_{t:t+i}(\tau).italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ( italic_τ ) = divide start_ARG 1 - italic_λ end_ARG start_ARG 1 - italic_λ start_POSTSUPERSCRIPT italic_H - italic_t - 1 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_t : italic_t + italic_i end_POSTSUBSCRIPT ( italic_τ ) . (9)

Then, we can rewrite the Q-learning loss in Equation 4 with 1111-step return to the one with λ𝜆\lambdaitalic_λ-return:

Q,modelλ(ϕ)=𝔼𝐬0𝒟model,τpψ,πθ[t=0H1L2τ(Qϕ(𝐬t,πθ(𝐬t))Qtλ(τ))].subscriptsuperscript𝜆𝑄modelitalic-ϕsubscript𝔼formulae-sequencesubscript𝐬0subscript𝒟modelsimilar-to𝜏subscript𝑝𝜓subscript𝜋𝜃delimited-[]superscriptsubscript𝑡0𝐻1superscriptsubscript𝐿2𝜏subscript𝑄italic-ϕsubscript𝐬𝑡subscript𝜋𝜃subscript𝐬𝑡subscriptsuperscript𝑄𝜆𝑡𝜏\mathcal{L}^{\lambda}_{Q,\text{model}}(\phi)=\mathbb{E}_{\mathbf{s}_{0}\in% \mathcal{D}_{\text{model}},\tau\sim p_{\psi},\pi_{\theta}}\left[\sum_{t=0}^{H-% 1}L_{2}^{\tau}(Q_{\phi}(\mathbf{s}_{t},\pi_{\theta}(\mathbf{s}_{t}))-Q^{% \lambda}_{t}(\tau))\right].caligraphic_L start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q , model end_POSTSUBSCRIPT ( italic_ϕ ) = blackboard_E start_POSTSUBSCRIPT bold_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT model end_POSTSUBSCRIPT , italic_τ ∼ italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_τ ) ) ] . (10)
Refer to caption
Figure 2: λ𝜆\lambdaitalic_λ-return with noisy critic. For long-horizon tasks, value estimates (blue for high values and white for low values) are likely to be flat for the states far from the reward signal shown with the yellow star. Since Q-values are small and similar to each other in this region, Q-values are easily dominated by noise, leading to a wrong learning signal for the policy, as illustrated in the middle row. λ𝜆\lambdaitalic_λ-return can effectively reduce the critic noise by considering both short-term and long-term value estimates, leading to a better learning signal for the policy.

4.3 Lower expectile policy learning with λ𝜆\lambdaitalic_λ-return

For policy optimization, we can use a deterministic policy 𝐚=πθ(𝐬)𝐚subscript𝜋𝜃𝐬\mathbf{a}=\pi_{\theta}(\mathbf{s})bold_a = italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_s ) and update the policy using the deterministic policy gradients similar to DDPG [22].333LEQ also works with a stochastic policy; but, a deterministic policy is easier to train in an offline setup. Instead of maximizing the immediate Q-value, Qϕ(𝐬,𝐚)subscript𝑄italic-ϕ𝐬𝐚Q_{\phi}(\mathbf{s},\mathbf{a})italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s , bold_a ), we propose to directly maximize the lower expectile of λ𝜆\lambdaitalic_λ-return, which is a more accurate learning target for a policy, analogous to the conservative critic target in Section 4.2:

πλ(θ)=𝔼𝐬0𝒟model,τpψ,πθ[t=0H𝔼τpψ,πθτ[Qtλ(τ)]].superscriptsubscript𝜋𝜆𝜃subscript𝔼formulae-sequencesubscript𝐬0subscript𝒟modelsimilar-to𝜏subscript𝑝𝜓subscript𝜋𝜃delimited-[]superscriptsubscript𝑡0𝐻subscriptsuperscript𝔼𝜏similar-to𝜏subscript𝑝𝜓subscript𝜋𝜃delimited-[]superscriptsubscript𝑄𝑡𝜆𝜏\mathcal{L}_{\pi}^{\lambda}(\theta)=-\mathbb{E}_{\mathbf{s}_{0}\in\mathcal{D}_% {\text{model}},\tau\sim p_{\psi},\pi_{\theta}}\left[\sum_{t=0}^{H}\mathbb{E}^{% \tau}_{\tau\sim p_{\psi},\pi_{\theta}}\left[Q_{t}^{\lambda}(\tau)\right]\right].caligraphic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ( italic_θ ) = - blackboard_E start_POSTSUBSCRIPT bold_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT model end_POSTSUBSCRIPT , italic_τ ∼ italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT blackboard_E start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ ∼ italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ( italic_τ ) ] ] . (11)

However, due to the expectile term in Equation 11, computing the gradient of πλ(θ)superscriptsubscript𝜋𝜆𝜃\mathcal{L}_{\pi}^{\lambda}(\theta)caligraphic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ( italic_θ ) is not trivial. To estimate this gradient, we propose a differentiable surrogate loss, approximating 𝔼τpψ,πθτ[Qtλ(τ)]subscriptsuperscript𝔼𝜏similar-to𝜏subscript𝑝𝜓subscript𝜋𝜃delimited-[]subscriptsuperscript𝑄𝜆𝑡𝜏\mathbb{E}^{\tau}_{\tau\sim p_{\psi},\pi_{\theta}}\left[Q^{\lambda}_{t}(\tau)\right]blackboard_E start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ ∼ italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Q start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_τ ) ] in Equation 11 with Qϕ(𝐬t,𝐚t)subscript𝑄italic-ϕsubscript𝐬𝑡subscript𝐚𝑡Q_{\phi}(\mathbf{s}_{t},\mathbf{a}_{t})italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ):

^πλ(θ)=𝔼𝐬0𝒟model,τpψ,πθ[t=0H|τ𝟙(Qϕ(𝐬t,𝐚t)>Qtλ(τ))|Qtλ(τ)].superscriptsubscript^𝜋𝜆𝜃subscript𝔼formulae-sequencesubscript𝐬0subscript𝒟modelsimilar-to𝜏subscript𝑝𝜓subscript𝜋𝜃delimited-[]superscriptsubscript𝑡0𝐻𝜏1subscript𝑄italic-ϕsubscript𝐬𝑡subscript𝐚𝑡subscriptsuperscript𝑄𝜆𝑡𝜏superscriptsubscript𝑄𝑡𝜆𝜏\hat{\mathcal{L}}_{\pi}^{\lambda}(\theta)=-\mathbb{E}_{\mathbf{s}_{0}\in% \mathcal{D}_{\text{model}},\tau\sim p_{\psi},\pi_{\theta}}\left[\sum_{t=0}^{H}% \lvert\tau-\mathbbm{1}\left(Q_{\phi}(\mathbf{s}_{t},\mathbf{a}_{t})>Q^{\lambda% }_{t}(\tau)\right)\rvert\cdot Q_{t}^{\lambda}(\tau)\right].over^ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ( italic_θ ) = - blackboard_E start_POSTSUBSCRIPT bold_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT model end_POSTSUBSCRIPT , italic_τ ∼ italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT | italic_τ - blackboard_1 ( italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) > italic_Q start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_τ ) ) | ⋅ italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ( italic_τ ) ] . (12)

Intuitively, this surrogate loss sets a higher weight (1τ1𝜏1-\tau1 - italic_τ) on a conservative λ𝜆\lambdaitalic_λ-return estimation (i.e. Qϕ(𝐬t,𝐚t)>Qtλ(τ)subscript𝑄italic-ϕsubscript𝐬𝑡subscript𝐚𝑡subscriptsuperscript𝑄𝜆𝑡𝜏Q_{\phi}(\mathbf{s}_{t},\mathbf{a}_{t})>Q^{\lambda}_{t}(\tau)italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) > italic_Q start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_τ )), encouraging a policy to optimize for this conservative λ𝜆\lambdaitalic_λ-return. On the other hand, an optimistic λ𝜆\lambdaitalic_λ-return estimation (i.e. Qϕ(𝐬t,𝐚t)<Qtλ(τ)subscript𝑄italic-ϕsubscript𝐬𝑡subscript𝐚𝑡subscriptsuperscript𝑄𝜆𝑡𝜏Q_{\phi}(\mathbf{s}_{t},\mathbf{a}_{t})<Q^{\lambda}_{t}(\tau)italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) < italic_Q start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_τ )) has a less impact to the policy with a smaller weight (τ𝜏\tauitalic_τ). Thus, optimizing this surrogate loss leads to the policy maximizing the lower expectile of λ𝜆\lambdaitalic_λ-return. We provide a proof in Appendix B saying that the proposed surrogate loss is a better approximation of Equation 11 than directly maximizing Q-values, Qϕ(𝐬,𝐚)subscript𝑄italic-ϕ𝐬𝐚Q_{\phi}(\mathbf{s},\mathbf{a})italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s , bold_a ).

4.4 Expanding dataset with model rollouts

One of the problem of offline RL is that data distribution is limited to the offline dataset 𝒟envsubscript𝒟env\mathcal{D}_{\text{env}}caligraphic_D start_POSTSUBSCRIPT env end_POSTSUBSCRIPT. To tackle this problem, we can simulate the current policy inside the model and use the generated trajectories to expand the dataset, similar to prior works [35, 30]. However, the state coverage will be identical when the policy converges, which might lead to catastrophic forgetting.

To prevent this issue, we expand the dataset using the model rollouts from a noisy exploration policy. Specifically, we execute the exploration policy πexp(𝐬)\pi_{\exp}(\cdot\mid\mathbf{s})italic_π start_POSTSUBSCRIPT roman_exp end_POSTSUBSCRIPT ( ⋅ ∣ bold_s ), which simply adds noise ϵN(0,σexp2)similar-toitalic-ϵ𝑁0superscriptsubscript𝜎exp2\epsilon\sim N(0,\sigma_{\text{exp}}^{2})italic_ϵ ∼ italic_N ( 0 , italic_σ start_POSTSUBSCRIPT exp end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) to the current policy πθ(𝐬)subscript𝜋𝜃𝐬\pi_{\theta}(\mathbf{s})italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_s ), and generate a trajectory of length R𝑅Ritalic_R (R=5𝑅5R=5italic_R = 5 in this paper). We refer this expanded dataset as 𝒟modelsubscript𝒟model\mathcal{D}_{\text{model}}caligraphic_D start_POSTSUBSCRIPT model end_POSTSUBSCRIPT. Note that we do not use off-policy actions and rewards for critic and policy updates; instead, we generate H𝐻Hitalic_H-step model rollouts starting from 𝐬𝒟modelsimilar-to𝐬subscript𝒟model\mathbf{s}\sim\mathcal{D}_{\text{model}}bold_s ∼ caligraphic_D start_POSTSUBSCRIPT model end_POSTSUBSCRIPT and use them for training the policy and Q-function. Thus, we need to store only the states from the rollouts.

5 Experiments

In this paper, we propose a novel model-based offline RL method with simple and efficient yet accurate conservative value estimation. Through our experiments, we aim to answer the following questions: (1) Can LEQ solve long-horizon tasks? (2) How does LEQ perform in widely used offline RL benchmarks? (3) Which component enables model-based offline RL to learn the AntMaze tasks?

Refer to caption
(a) Hopper
Refer to caption
(b) Walker2d
Refer to caption
(c) HalfCheetah
Figure 3: Locomotion tasks.
Refer to caption
(d) Umaze
Refer to caption
(e) Medium
Refer to caption
(f) Large
Refer to caption
(g) Ultra
Figure 4: AntMaze tasks.

5.1 Tasks

To show the strength of LEQ in solving long-horizon tasks, we use the AntMaze tasks, which aims to navigate a 8888-DOF ant robot to the desired goal position, as shown in Figure 4. Specifically, we use umaze, medium, large datasets from D4RL [6], and ultra dataset from Jiang et al. [15]. Moreover, we evaluate our method on MuJoCo locomotion tasks (Figure 4) with dense rewards with D4RL [6] and NeoRL [26] datasets. Please refer to Appendix A for more experimental details.

5.2 Compared offline RL algorithms

We compare the performance of LEQ with the state-of-the-art offline RL algorithms. Please note that LEQ uses the same hyperparameters across all tasks, except the expectile parameter, τ𝜏\tauitalic_τ.

Model-free offline RL.

We consider behavioral cloning (BC[25]; TD3+BC [7], which combines BC loss to TD3; CQL [19], which penalizes the actions out of data distribution; and IQL [17], which utilizes expectile regression to estimate the value function. For locomotion tasks, we also compare with EDAC [1], which penalizes the Q-values according to the uncertainty of Q-functions

Model-based offline RL.

We consider MOPO [34] and MOBILE [30], which penalize Q-values according to the transition uncertainty and the bellman uncertainty of a world model, respectively; COMBO [35], which combines CQL with MBPO; RAMBO [27], which trains an adversarial world model against the policy; and CBOP [14], which utilizes multi-step returns for critic updates.

5.3 Results on long-horizon AntMaze tasks

As shown in Table 1, LEQ significantly outperforms the prior model-based approaches for all 8888 datasets. LEQ achieves 58.658.658.658.6 and 60.260.260.260.2 for antmaze-large-play and antmaze-large-diverse, while the second best method, RAMBO [27], scores only 0.00.00.00.0 and 2.42.42.42.4, respectively. We believe these performance gains come from our conservative value estimation, which works more stable than the uncertainty-based penalization of prior works.

Moreover, LEQ even significantly outperforms the model-free approaches in antmaze-umaze, antmaze-large, and antmaze-ultra. Despite its superior performance, LEQ often shows high variance during training, resulting in worse performance on antmaze-medium. Over the course of training, LEQ mostly achieves high success rates, but the evaluation results sometimes drops to 0%percent00\%0 % as shown in Appendix, Figure 5. We leave the problem of reducing the high variance of our method in certain environments as a future work.

Table 1: AntMaze results. Each number represents the average success rate on 100100100100 trials over different seeds. The results for LEQ, MOBILE, and CBOP are averaged over 5555 seeds. The results for other methods are reported following their respective papers.
Model-free Model-based
Dataset BC TD3+BC CQL IQL MOPO COMBO RAMBO MOBILE CBOP LEQ (ours)
antmaze-umaze 65.065.065.065.0 78.678.678.678.6 74.074.074.074.0 87.587.587.587.5 0.00.00.00.0 80.380.380.380.3 25.025.025.025.0 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 94.494.4\mathbf{94.4}bold_94.4 ±6.3plus-or-minus6.3\mathbf{\pm 6.3}± bold_6.3
antmaze-umaze-diverse 55.055.055.055.0 71.471.471.471.4 84.084.084.084.0 62.262.262.262.2 0.00.00.00.0 57.357.357.357.3 0.00.00.00.0 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 71.071.0\mathbf{71.0}bold_71.0 ±12.3plus-or-minus12.3\mathbf{\pm 12.3}± bold_12.3
antmaze-medium-play 0.00.00.00.0 3.03.03.03.0 61.261.261.261.2 71.271.2\mathbf{71.2}bold_71.2 0.00.00.00.0 0.00.00.00.0 16.416.416.416.4 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 58.858.858.858.8 ±33.0plus-or-minus33.0\pm 33.0± 33.0
antmaze-medium-diverse 0.00.00.00.0 10.610.610.610.6 53.753.753.753.7 70.070.0\mathbf{70.0}bold_70.0 0.00.00.00.0 0.00.00.00.0 23.223.223.223.2 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 46.246.246.246.2 ±23.2plus-or-minus23.2\pm 23.2± 23.2
antmaze-large-play 0.00.00.00.0 0.00.00.00.0 15.815.815.815.8 39.639.639.639.6 0.00.00.00.0 0.00.00.00.0 0.00.00.00.0 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 58.658.6\mathbf{58.6}bold_58.6 ±9.1plus-or-minus9.1\mathbf{\pm 9.1}± bold_9.1
antmaze-large-diverse 0.00.00.00.0 0.20.20.20.2 14.914.914.914.9 47.547.547.547.5 0.00.00.00.0 0.00.00.00.0 2.42.42.42.4 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 60.260.2\mathbf{60.2}bold_60.2 ±18.3plus-or-minus18.3\mathbf{\pm 18.3}± bold_18.3
antmaze-ultra-play -- -- -- 8.38.38.38.3 -- -- -- 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 25.825.8\mathbf{25.8}bold_25.8 ±18.2plus-or-minus18.2\mathbf{\pm 18.2}± bold_18.2
antmaze-ultra-diverse -- -- -- 15.615.615.615.6 -- -- -- 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 55.855.8\mathbf{55.8}bold_55.8 ±18.3plus-or-minus18.3\mathbf{\pm 18.3}± bold_18.3
Total w/o antmaze-ultra 120.0120.0120.0120.0 163.8163.8163.8163.8 303.6303.6303.6303.6 354.1354.1354.1354.1 0.00.00.00.0 137.6137.6137.6137.6 67.067.067.067.0 0.00.00.00.0 0.00.00.00.0 388.8388.8\mathbf{388.8}bold_388.8
Total -- -- -- 378.0378.0378.0378.0 -- -- -- 0.00.00.00.0 0.00.00.00.0 470.4470.4\mathbf{470.4}bold_470.4

\daggerWe use the official implementation of MOBILE and CBOP.

5.4 Results on MuJoCo Gym locomotion tasks

For D4RL MuJoCo Gym tasks in Table 3, LEQ achieves comparable results with the best score of prior works in 6666 out of 12121212 tasks. Furthermore, in Table 2, LEQ outperforms most of the prior works in the NeoRL benchmark, especially in the Hopper and Walker2d domains. These results show that LEQ serves as a general offline RL algorithm, not limited to long-horizon tasks.

Similar to antmaze-medium, LEQ also suffers from the high variance problem. During training, LEQ often achieves high performance, but then, suddenly falls back to 00, as shown in Appendix, Figure 5. This is mainly because the learned models sometimes fail to capture failures (e.g. hopper and walker falling off) and predict an optimistic future (e.g. hopper and walker walking forward).

Table 2: NeoRL results. LEQ and IQL results are averaged over 5555 seeds. The results for prior works are reported following Sun et al. [30] and Qin et al. [26]. MOPO is an improved version of MOPO presented in Sun et al. [30]. We highlight the results that are better than 95%percent9595\%95 % of the best score.
Model-free Model-based
Dataset BC TD3+BC CQL EDAC IQL MOPO MOBILE LEQ (ours)
Hopper-L 15.115.115.115.1 15.815.815.815.8 16.016.016.016.0 18.318.318.318.3 16.716.716.716.7 6.26.26.26.2 17.417.417.417.4 24.224.2\mathbf{24.2}bold_24.2 ±2.3plus-or-minus2.3\pm 2.3± 2.3
Hopper-M 51.351.351.351.3 70.370.370.370.3 64.564.564.564.5 44.944.944.944.9 28.428.428.428.4 1.01.01.01.0 51.151.151.151.1 104.3104.3\mathbf{104.3}bold_104.3 ±5.2plus-or-minus5.2\pm 5.2± 5.2
Hopper-H 43.143.143.143.1 75.375.375.375.3 76.676.676.676.6 52.552.552.552.5 22.322.322.322.3 11.511.511.511.5 87.887.887.887.8 95.595.5\mathbf{95.5}bold_95.5 ±13.9plus-or-minus13.9\pm 13.9± 13.9
Walker2d-L 28.528.528.528.5 43.043.043.043.0 44.744.744.744.7 40.240.240.240.2 30.730.730.730.7 11.611.611.611.6 37.637.637.637.6 65.165.1\mathbf{65.1}bold_65.1 ±2.3plus-or-minus2.3\pm 2.3± 2.3
Walker2d-M 48.748.748.748.7 58.558.558.558.5 57.357.357.357.3 57.657.657.657.6 51.851.851.851.8 39.939.939.939.9 62.262.2\mathbf{62.2}bold_62.2 45.245.245.245.2 ±19.4plus-or-minus19.4\pm 19.4± 19.4
Walker2d-H 72.672.6\mathbf{72.6}bold_72.6 69.669.669.669.6 75.375.3\mathbf{75.3}bold_75.3 75.575.5\mathbf{75.5}bold_75.5 76.376.3\mathbf{76.3}bold_76.3 18.018.018.018.0 74.974.9\mathbf{74.9}bold_74.9 73.773.7\mathbf{73.7}bold_73.7 ±1.1plus-or-minus1.1\pm 1.1± 1.1
HalfCheetah-L 29.129.129.129.1 30.030.030.030.0 38.238.238.238.2 31.331.331.331.3 30.730.730.730.7 40.140.140.140.1 54.754.7\mathbf{54.7}bold_54.7 33.433.433.433.4 ±1.6plus-or-minus1.6\pm 1.6± 1.6
HalfCheetah-M 49.049.049.049.0 52.352.352.352.3 54.654.654.654.6 54.954.954.954.9 51.851.851.851.8 62.362.362.362.3 77.877.8\mathbf{77.8}bold_77.8 59.259.259.259.2 ±3.9plus-or-minus3.9\pm 3.9± 3.9
HalfCheetah-H 71.471.471.471.4 75.375.375.375.3 77.477.477.477.4 81.481.4\mathbf{81.4}bold_81.4 76.376.376.376.3 65.965.965.965.9 83.083.0\mathbf{83.0}bold_83.0 71.871.871.871.8 ±8.0plus-or-minus8.0\pm 8.0± 8.0
Total 408.8408.8408.8408.8 490.1490.1490.1490.1 504.6504.6504.6504.6 456.6456.6456.6456.6 385.0385.0385.0385.0 256.5256.5256.5256.5 546.5546.5\mathbf{546.5}bold_546.5 572.4572.4\mathbf{572.4}bold_572.4
Table 3: D4RL MuJoCo Gym results. Each number is a normalized score averaged over 100100100100 trials [6]. Our results are averaged over 5555 seeds. The results for prior works are reported following their respective papers. MOPO is an improved version of MOPO, introduced in Sun et al. [30]. We highlight the results that are better than 95%percent9595\%95 % of the best score.
Model-free Model-based
Dataset BC TD3+BC CQL EDAC IQL MOPO COMBO RAMBO MOBILE CBOP LEQ (ours)
hopper-r 3.73.73.73.7 8.58.58.58.5 5.35.35.35.3 25.325.325.325.3 7.67.67.67.6 31.731.731.731.7 17.917.917.917.9 25.425.425.425.4 31.931.9\mathbf{31.9}bold_31.9 32.832.8\mathbf{32.8}bold_32.8 32.432.4\mathbf{32.4}bold_32.4 ±0.3plus-or-minus0.3\pm 0.3± 0.3
hopper-m 54.154.154.154.1 59.359.359.359.3 61.961.961.961.9 101.6101.6\mathbf{101.6}bold_101.6 66.366.366.366.3 62.862.862.862.8 97.297.297.297.2 87.087.087.087.0 106.6106.6\mathbf{106.6}bold_106.6 102.6102.6\mathbf{102.6}bold_102.6 103.4103.4\mathbf{103.4}bold_103.4 ±0.3plus-or-minus0.3\pm 0.3± 0.3
hopper-mr 16.616.616.616.6 60.960.960.960.9 86.386.386.386.3 101.0101.0\mathbf{101.0}bold_101.0 94.794.794.794.7 99.499.4\mathbf{99.4}bold_99.4 103.5103.5\mathbf{103.5}bold_103.5 89.589.589.589.5 99.599.599.599.5 104.3104.3\mathbf{104.3}bold_104.3 103.9103.9\mathbf{103.9}bold_103.9 ±1.3plus-or-minus1.3\pm 1.3± 1.3
hopper-me 53.953.953.953.9 98.098.098.098.0 96.996.996.996.9 110.7110.7\mathbf{110.7}bold_110.7 91.591.591.591.5 81.681.681.681.6 111.1111.1\mathbf{111.1}bold_111.1 88.288.288.288.2 112.6112.6\mathbf{112.6}bold_112.6 111.6111.6\mathbf{111.6}bold_111.6 109.4109.4\mathbf{109.4}bold_109.4 ±1.8plus-or-minus1.8\pm 1.8± 1.8
walker2d-r 1.31.31.31.3 1.61.61.61.6 5.45.45.45.4 16.616.616.616.6 5.25.25.25.2 7.47.47.47.4 7.07.07.07.0 0.00.00.00.0 17.917.917.917.9 17.817.817.817.8 21.521.5\mathbf{21.5}bold_21.5 ±0.1plus-or-minus0.1\pm 0.1± 0.1
walker2d-m 70.970.970.970.9 83.783.783.783.7 79.579.579.579.5 92.592.5\mathbf{92.5}bold_92.5 78.378.378.378.3 81.381.381.381.3 84.184.184.184.1 81.981.981.981.9 84.984.984.984.9 87.787.7\mathbf{87.7}bold_87.7 74.974.974.974.9 ±26.9plus-or-minus26.9\pm 26.9± 26.9
walker2d-mr 20.320.320.320.3 81.881.881.881.8 76.876.876.876.8 87.187.187.187.1 73.973.973.973.9 85.685.685.685.6 56.056.056.056.0 89.289.289.289.2 89.989.989.989.9 92.792.792.792.7 98.798.7\mathbf{98.7}bold_98.7 ±6.0plus-or-minus6.0\pm 6.0± 6.0
walker2d-me 90.190.190.190.1 110.1110.1110.1110.1 109.1109.1109.1109.1 114.7114.7\mathbf{114.7}bold_114.7 109.6109.6109.6109.6 112.9112.9\mathbf{112.9}bold_112.9 103.3103.3103.3103.3 56.756.756.756.7 115.2115.2\mathbf{115.2}bold_115.2 117.2117.2\mathbf{117.2}bold_117.2 108.2108.2108.2108.2 ±1.3plus-or-minus1.3\pm 1.3± 1.3
halfcheetah-r 2.22.22.22.2 11.011.011.011.0 31.331.331.331.3 28.428.428.428.4 11.811.811.811.8 38.538.5\mathbf{38.5}bold_38.5 38.838.8\mathbf{38.8}bold_38.8 39.539.5\mathbf{39.5}bold_39.5 39.339.3\mathbf{39.3}bold_39.3 32.832.832.832.8 30.830.830.830.8 ±3.3plus-or-minus3.3\pm 3.3± 3.3
halfcheetah-m 43.243.243.243.2 48.348.348.348.3 46.946.946.946.9 65.965.965.965.9 47.447.447.447.4 73.073.073.073.0 54.254.254.254.2 77.977.9\mathbf{77.9}bold_77.9 74.674.6\mathbf{74.6}bold_74.6 74.374.3\mathbf{74.3}bold_74.3 71.771.771.771.7 ±4.4plus-or-minus4.4\pm 4.4± 4.4
halfcheetah-mr 37.637.637.637.6 44.644.644.644.6 45.345.345.345.3 61.361.361.361.3 44.244.244.244.2 72.172.1\mathbf{72.1}bold_72.1 55.155.155.155.1 68.768.7\mathbf{68.7}bold_68.7 71.771.7\mathbf{71.7}bold_71.7 66.466.466.466.4 65.565.565.565.5 ±1.1plus-or-minus1.1\pm 1.1± 1.1
halfcheetah-me 44.044.044.044.0 90.790.790.790.7 95.095.095.095.0 106.3106.3\mathbf{106.3}bold_106.3 86.786.786.786.7 90.890.890.890.8 90.090.090.090.0 95.495.495.495.4 108.2108.2\mathbf{108.2}bold_108.2 105.4105.4\mathbf{105.4}bold_105.4 102.8102.8\mathbf{102.8}bold_102.8 ±0.4plus-or-minus0.4\pm 0.4± 0.4
Total 437.9437.9437.9437.9 698.5698.5698.5698.5 739.7739.7739.7739.7 911.4911.4911.4911.4 717.2717.2717.2717.2 844.0844.0844.0844.0 802.0802.0802.0802.0 812.4812.4812.4812.4 959.5959.5\mathbf{959.5}bold_959.5 953.4953.4\mathbf{953.4}bold_953.4 923.2923.2\mathbf{923.2}bold_923.2

5.5 Ablation studies

To understand why LEQ (LEQ-λ𝜆\lambdaitalic_λ) works well in long-horizon tasks, we conduct ablation studies and answer to the following four questions: (1) Does using λ𝜆\lambdaitalic_λ-return help? (2) Is LEQ better than prior uncertainty-based penalization methods? (3) Which factor enables LEQ to work in AntMaze? and (4) How do imagination length H𝐻Hitalic_H and data expansion length R𝑅Ritalic_R affect the performance?

(1) λ𝜆\lambdaitalic_λ-returns.

To verify the effect of λ𝜆\lambdaitalic_λ-return, we compare our method (LEQ-λ𝜆\lambdaitalic_λ) with the versions with 1111-step return (LEQ-1111) and H𝐻Hitalic_H-step return (LEQ-H𝐻Hitalic_H). Table 4 shows that using λ𝜆\lambdaitalic_λ-return drastically improves the performance on AntMaze compared to using 1111-step return or H𝐻Hitalic_H-step return. This result is coherent with the observations in prior online RL methods [28, 11].

(2) Lower expectile Q-learning.

We compare our lower expectile Q-learning with another conservative value estimator, MOBIP used in MOBILE [30], which penalizes Q-values with the standard deviation of Q-ensemble networks. The only difference between LEQ and MOBIP is their target Q-value computation for both critic and policy updates. Table 4 shows that using MOBIP not only deteriorates the success rates (in MOBIP-1111) but also does not benefit from λ𝜆\lambdaitalic_λ-return (in MOBIP-λ𝜆\lambdaitalic_λ).

(3) What makes offline model-based RL work in AntMaze?

Prior to LEQ, none of offline model-based RL methods work in AntMaze, whereas our method even outperforms model-free methods. Thus, we investigate which changes in LEQ enable offline model-based RL work in AntMaze.

Table 4: Ablation study results on the AntMaze tasks. (1) We compare different Q-targets, LEQ-λ𝜆\lambdaitalic_λ, LEQ-1111, and LEQ-H𝐻Hitalic_H. (2) We compare our lower expectile Q-learning strategy with another conservative Q-value estimation, MOBIP-1111 and MOBIP-λ𝜆\lambdaitalic_λ. (3) MOBILE is our re-implementation of MOBILE using MOBILE’s default hyperparameters, i.e., β=0.95𝛽0.95\beta=0.95italic_β = 0.95, γ=0.99𝛾0.99\gamma=0.99italic_γ = 0.99, and R=5𝑅5R=5italic_R = 5. We note that lowering β𝛽\betaitalic_β to 0.250.250.250.25 is crucial for MOBILE to achieve meaningful scores in AntMaze.
Dataset umaze medium large ultra Total
umaze diverse play diverse play diverse play diverse
LEQ-λ𝜆\lambdaitalic_λ (ours) 94.494.4\mathbf{94.4}bold_94.4 ±6.3plus-or-minus6.3\pm 6.3± 6.3 71.071.0\mathbf{71.0}bold_71.0 ±12.3plus-or-minus12.3\pm 12.3± 12.3 58.858.858.858.8 ±33.0plus-or-minus33.0\pm 33.0± 33.0 46.246.246.246.2 ±23.2plus-or-minus23.2\pm 23.2± 23.2 58.658.6\mathbf{58.6}bold_58.6 ±9.1plus-or-minus9.1\pm 9.1± 9.1 60.260.2\mathbf{60.2}bold_60.2 ±18.3plus-or-minus18.3\pm 18.3± 18.3 25.825.825.825.8 ±18.2plus-or-minus18.2\pm 18.2± 18.2 55.855.8\mathbf{55.8}bold_55.8 ±18.3plus-or-minus18.3\pm 18.3± 18.3 470.4470.4\mathbf{470.4}bold_470.4
LEQ-H𝐻Hitalic_H 93.093.0\mathbf{93.0}bold_93.0 ±3.4plus-or-minus3.4\pm 3.4± 3.4 60.760.760.760.7 ±10.4plus-or-minus10.4\pm 10.4± 10.4 46.346.346.346.3 ±32.4plus-or-minus32.4\pm 32.4± 32.4 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 57.057.0\mathbf{57.0}bold_57.0 ±25.6plus-or-minus25.6\pm 25.6± 25.6 33.333.333.333.3 ±43.0plus-or-minus43.0\pm 43.0± 43.0 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 290.3290.3290.3290.3
LEQ-1111 89.689.689.689.6 ±4.8plus-or-minus4.8\pm 4.8± 4.8 37.037.037.037.0 ±32.8plus-or-minus32.8\pm 32.8± 32.8 55.855.855.855.8 ±28.7plus-or-minus28.7\pm 28.7± 28.7 29.829.829.829.8 ±24.5plus-or-minus24.5\pm 24.5± 24.5 34.234.234.234.2 ±13.4plus-or-minus13.4\pm 13.4± 13.4 49.349.349.349.3 ±9.0plus-or-minus9.0\pm 9.0± 9.0 42.242.2\mathbf{42.2}bold_42.2 ±13.2plus-or-minus13.2\pm 13.2± 13.2 35.635.635.635.6 ±13.0plus-or-minus13.0\pm 13.0± 13.0 373.5373.5373.5373.5
MOBIP-λ𝜆\lambdaitalic_λ 84.384.384.384.3 ±3.5plus-or-minus3.5\pm 3.5± 3.5 40.340.340.340.3 ±20.4plus-or-minus20.4\pm 20.4± 20.4 51.351.351.351.3 ±9.0plus-or-minus9.0\pm 9.0± 9.0 39.739.739.739.7 ±12.5plus-or-minus12.5\pm 12.5± 12.5 28.328.328.328.3 ±21.5plus-or-minus21.5\pm 21.5± 21.5 33.733.733.733.7 ±10.0plus-or-minus10.0\pm 10.0± 10.0 38.038.038.038.0 ±27.1plus-or-minus27.1\pm 27.1± 27.1 23.323.323.323.3 ±4.9plus-or-minus4.9\pm 4.9± 4.9 338.9338.9338.9338.9
MOBIP-1 59.559.559.559.5 ±3.5plus-or-minus3.5\pm 3.5± 3.5 46.546.546.546.5 ±1.5plus-or-minus1.5\pm 1.5± 1.5 57.057.057.057.0 ±11.0plus-or-minus11.0\pm 11.0± 11.0 54.054.0\mathbf{54.0}bold_54.0 ±9.0plus-or-minus9.0\pm 9.0± 9.0 23.523.523.523.5 ±19.5plus-or-minus19.5\pm 19.5± 19.5 38.538.538.538.5 ±1.5plus-or-minus1.5\pm 1.5± 1.5 39.539.539.539.5 ±11.5plus-or-minus11.5\pm 11.5± 11.5 20.520.520.520.5 ±20.5plus-or-minus20.5\pm 20.5± 20.5 339.0339.0339.0339.0
MOBILE 1.01.01.01.0 ±2.0plus-or-minus2.0\pm 2.0± 2.0 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 6.46.46.46.4 ±5.5plus-or-minus5.5\pm 5.5± 5.5 5.05.05.05.0 ±5.0plus-or-minus5.0\pm 5.0± 5.0 0.80.80.80.8 ±1.6plus-or-minus1.6\pm 1.6± 1.6 0.80.80.80.8 ±1.2plus-or-minus1.2\pm 1.2± 1.2 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 14.014.014.014.0
MOBILE (β𝛽\mathbf{\beta}italic_β = 0.25) 77.077.077.077.0 ±6.4plus-or-minus6.4\pm 6.4± 6.4 20.420.420.420.4 ±15.7plus-or-minus15.7\pm 15.7± 15.7 64.664.6\mathbf{64.6}bold_64.6 ±11.1plus-or-minus11.1\pm 11.1± 11.1 31.631.631.631.6 ±16.9plus-or-minus16.9\pm 16.9± 16.9 2.62.62.62.6 ±2.8plus-or-minus2.8\pm 2.8± 2.8 7.27.27.27.2 ±8.9plus-or-minus8.9\pm 8.9± 8.9 4.64.64.64.6 ±3.0plus-or-minus3.0\pm 3.0± 3.0 5.05.05.05.0 ±4.6plus-or-minus4.6\pm 4.6± 4.6 213.0213.0213.0213.0
MOBILE (γ𝛾\mathbf{\gamma}italic_γ = 0.997) 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 7.27.27.27.2 ±4.1plus-or-minus4.1\pm 4.1± 4.1 1.61.61.61.6 ±2.1plus-or-minus2.1\pm 2.1± 2.1 9.69.69.69.6 ±7.1plus-or-minus7.1\pm 7.1± 7.1 5.45.45.45.4 ±4.9plus-or-minus4.9\pm 4.9± 4.9 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 1.81.81.81.8 ±2.7plus-or-minus2.7\pm 2.7± 2.7 25.625.625.625.6
MOBILE (R𝑅Ritalic_R = 10) 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 5.05.05.05.0 ±5.1plus-or-minus5.1\pm 5.1± 5.1 0.60.60.60.6 ±1.2plus-or-minus1.2\pm 1.2± 1.2 7.47.47.47.4 ±14.8plus-or-minus14.8\pm 14.8± 14.8 1.61.61.61.6 ±3.2plus-or-minus3.2\pm 3.2± 3.2 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 14.614.614.614.6

We first re-implement MOBILE with some technical tricks used in LEQ: LayerNorm [3], SymLog [11], single Q-network, and no target Q-value clip**; but, MOBILE achieves a barely non-zero score, 14.014.014.014.0. We found that the key to make MOBILE work is reducing β𝛽\betaitalic_β, the ratio for the loss calculated from imaginary rollouts and from dataset transitions. When we lower β𝛽\betaitalic_β from 0.950.950.950.95 to 0.250.250.250.25 (used in LEQ), MOBILE shows meaningful performances in umaze and medium mazes, and achieves 213.0213.0213.0213.0 in total. We suggest that utilizing the true transition from the dataset is important in long-horizon tasks, which was undervalued in prior works.

(4) Imagination length H𝐻Hitalic_H and dataset expansion length R𝑅Ritalic_R.

As shown in Table 5, the performance increases when it goes to H=10𝐻10H=10italic_H = 10 from H=5𝐻5H=5italic_H = 5, but it drops when H=15𝐻15H=15italic_H = 15. This result shows the trade-off of using the world model: the further the agent imagines, more the agent becomes robust to the error of the critic, but more it becomes prone to the error from the model prediction.

We also evaluate LEQ without the dataset expansion (H=10,R=1formulae-sequence𝐻10𝑅1H=10,R=1italic_H = 10 , italic_R = 1). In AntMaze, the results with and without the dataset expansion are similar, as shown in Table 5. On the other hand, the dataset expansion makes the policy more stable and better in the D4RL MuJoCo tasks (Table 13).

Table 5: LEQ with different imagination length H𝐻Hitalic_H and data expansion length R𝑅Ritalic_R. A longer H𝐻Hitalic_H can mitigate critic biases, while increasing model errors, which leads to poor performance. Each number is averaged over 5555 random seeds.
Dataset 𝐇=𝟏𝟎,𝐑=𝟓formulae-sequence𝐇10𝐑5\mathbf{H=10,R=5}bold_H = bold_10 , bold_R = bold_5 (ours) H=5,R=5formulae-sequence𝐻5𝑅5H=5,R=5italic_H = 5 , italic_R = 5 H=15,R=5formulae-sequence𝐻15𝑅5H=15,R=5italic_H = 15 , italic_R = 5 H=10,R=1formulae-sequence𝐻10𝑅1H=10,R=1italic_H = 10 , italic_R = 1
antmaze-umaze 94.494.4\mathbf{94.4}bold_94.4 ±6.3plus-or-minus6.3\pm 6.3± 6.3 95.295.2\mathbf{95.2}bold_95.2 ±1.7plus-or-minus1.7\pm 1.7± 1.7 98.698.6\mathbf{98.6}bold_98.6 ±0.5plus-or-minus0.5\pm 0.5± 0.5 97.497.4\mathbf{97.4}bold_97.4 ±1.4plus-or-minus1.4\pm 1.4± 1.4
antmaze-umaze-diverse 71.071.0\mathbf{71.0}bold_71.0 ±12.3plus-or-minus12.3\pm 12.3± 12.3 67.267.2\mathbf{67.2}bold_67.2 ±9.1plus-or-minus9.1\pm 9.1± 9.1 70.770.7\mathbf{70.7}bold_70.7 ±15.2plus-or-minus15.2\pm 15.2± 15.2 63.063.063.063.0 ±23.2plus-or-minus23.2\pm 23.2± 23.2
antmaze-medium-play 58.858.858.858.8 ±33.0plus-or-minus33.0\pm 33.0± 33.0 46.446.446.446.4 ±31.9plus-or-minus31.9\pm 31.9± 31.9 76.376.3\mathbf{76.3}bold_76.3 ±17.2plus-or-minus17.2\pm 17.2± 17.2 58.258.258.258.2 ±28.0plus-or-minus28.0\pm 28.0± 28.0
antmaze-medium-diverse 46.246.2\mathbf{46.2}bold_46.2 ±23.2plus-or-minus23.2\pm 23.2± 23.2 18.618.618.618.6 ±28.7plus-or-minus28.7\pm 28.7± 28.7 30.330.330.330.3 ±40.1plus-or-minus40.1\pm 40.1± 40.1 28.628.628.628.6 ±33.7plus-or-minus33.7\pm 33.7± 33.7
antmaze-large-play 58.658.658.658.6 ±9.1plus-or-minus9.1\pm 9.1± 9.1 48.648.648.648.6 ±15.4plus-or-minus15.4\pm 15.4± 15.4 62.062.0\mathbf{62.0}bold_62.0 ±9.9plus-or-minus9.9\pm 9.9± 9.9 56.056.056.056.0 ±9.8plus-or-minus9.8\pm 9.8± 9.8
antmaze-large-diverse 60.260.2\mathbf{60.2}bold_60.2 ±18.3plus-or-minus18.3\pm 18.3± 18.3 35.235.235.235.2 ±8.7plus-or-minus8.7\pm 8.7± 8.7 33.033.033.033.0 ±3.2plus-or-minus3.2\pm 3.2± 3.2 57.057.0\mathbf{57.0}bold_57.0 ±4.5plus-or-minus4.5\pm 4.5± 4.5
antmaze-ultra-play 25.825.825.825.8 ±18.2plus-or-minus18.2\pm 18.2± 18.2 54.254.2\mathbf{54.2}bold_54.2 ±10.8plus-or-minus10.8\pm 10.8± 10.8 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 39.239.239.239.2 ±15.1plus-or-minus15.1\pm 15.1± 15.1
antmaze-ultra-diverse 55.855.8\mathbf{55.8}bold_55.8 ±18.3plus-or-minus18.3\pm 18.3± 18.3 39.439.439.439.4 ±6.1plus-or-minus6.1\pm 6.1± 6.1 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 36.036.036.036.0 ±12.0plus-or-minus12.0\pm 12.0± 12.0
Total 470.4470.4\mathbf{470.4}bold_470.4 404.8404.8404.8404.8 371.0371.0371.0371.0 435.4435.4435.4435.4

6 Conclusion

In this paper, we propose a novel offline model-based reinforcement learning method, LEQ, which uses expectile regression to get a conservative evaluation of a policy from model-generated trajectories. Expectile regression eases the pain of constructing the whole distribution of Q-targets and allows for estimating the conservative value via sampling. Combined with λ𝜆\lambdaitalic_λ-returns in both critic and policy updates for the imaginary rollouts, the policy can receive learning signals that are more robust to both model errors and critic errors. We empirically show that LEQ improves the performance in various tasks – especially, achieving the state-of-the-art performance in the long-horizon AntMaze tasks.

6.1 Limitations

Following prior work on model-based offline RL [30, 14], we assume access to the ground-truth termination function of a task, different from online model-based RL approaches, which learn a termination function from interactions. However, since this termination function is conditioned on a state, a model requires to plan on a state space (or an observation space), which could be challenging in a high-dimensional state space (e.g. pixel observations). Extending the proposed approach to complex environments with high-dimensional observations would be an immediate next step.

6.2 Broader Impacts

Our method aims to increase the ability of autonomous agents, such as robots and self-driving cars, to learn from static, offline data without interacting with the world. This enables autonomous agents to utilize data with diverse qualities (not necessarily from experts). We believe that this paper does not have any immediate negative societal impact.

Acknowledgments and Disclosure of Funding

We would like to thank Junik Bae for helpful discussion. This work was supported in part by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant (RS-2020-II201361, Artificial Intelligence Graduate School Program (Yonsei University)) and the National Research Foundation of Korea (NRF) grant (RS-2024-00333634) funded by the Korean Government (MSIT). Kwanyoung Park was supported by Electronics and Telecommunications Research Institute (ETRI).

References

  • An et al. [2021] Gaon An, Seungyong Moon, Jang-Hyun Kim, and Hyun Oh Song. Uncertainty-based offline reinforcement learning with diversified q-ensemble. Advances in neural information processing systems, 34:7436–7447, 2021.
  • Argenson and Dulac-Arnold [2021] Arthur Argenson and Gabriel Dulac-Arnold. Model-based offline planning. In International Conference on Learning Representations, 2021.
  • Ba et al. [2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  • Ball et al. [2023] Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. In International Conference on Machine Learning, pages 1577–1594. PMLR, 2023.
  • Feinberg et al. [2018] Vladimir Feinberg, Alvin Wan, Ion Stoica, Michael I Jordan, Joseph E Gonzalez, and Sergey Levine. Model-based value estimation for efficient model-free reinforcement learning. arXiv preprint arXiv:1803.00101, 2018.
  • Fu et al. [2020] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
  • Fujimoto and Gu [2021] Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. In Neural Information Processing Systems, volume 34, pages 20132–20145, 2021.
  • Fujimoto et al. [2019] Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052–2062. PMLR, 2019.
  • Hafner et al. [2019] Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations, 2019.
  • Hafner et al. [2021] Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. In International Conference on Learning Representations, 2021.
  • Hafner et al. [2023] Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
  • Hessel et al. [2018] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Association for the Advancement of Artificial Intelligence, 2018.
  • Janner et al. [2019] Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. In Neural Information Processing Systems, volume 32, 2019.
  • Jeong et al. [2023] Jihwan Jeong, Xiaoyu Wang, Michael Gimelfarb, Hyunwoo Kim, Baher Abdulhai, and Scott Sanner. Conservative bayesian model-based value expansion for offline policy optimization. In International Conference on Learning Representations, 2023.
  • Jiang et al. [2023] Zhengyao Jiang, Tianjun Zhang, Michael Janner, Yueying Li, Tim Rocktäschel, Edward Grefenstette, and Yuandong Tian. Efficient planning in a compact latent action space. In International Conference on Learning Representations, 2023.
  • Kidambi et al. [2020] Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Model-based offline reinforcement learning. In Neural Information Processing Systems, volume 33, pages 21810–21823, 2020.
  • Kostrikov et al. [2022] Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022.
  • Kumar et al. [2019] Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrap** error reduction. In Neural Information Processing Systems, volume 32, 2019.
  • Kumar et al. [2020] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. In Neural Information Processing Systems, volume 33, pages 1179–1191, 2020.
  • Le et al. [2019] Hoang Le, Cameron Voloshin, and Yisong Yue. Batch policy learning under constraints. In International Conference on Machine Learning, pages 3703–3712. PMLR, 2019.
  • Levine et al. [2020] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  • Lillicrap et al. [2016] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.
  • Park et al. [2024] Seohong Park, Dibya Ghosh, Benjamin Eysenbach, and Sergey Levine. Hiql: Offline goal-conditioned rl with latent states as actions. In Neural Information Processing Systems, volume 36, 2024.
  • Peng et al. [2019] Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
  • Pomerleau [1989] Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in Neural Information Processing Systems, pages 305–313, 1989.
  • Qin et al. [2022] Rong-Jun Qin, Xingyuan Zhang, Songyi Gao, Xiong-Hui Chen, Zewen Li, Weinan Zhang, and Yang Yu. NeoRL: A near real-world benchmark for offline reinforcement learning. In Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=jNdLszxdtra.
  • Rigter et al. [2022] Marc Rigter, Bruno Lacerda, and Nick Hawes. Rambo-rl: Robust adversarial model-based offline reinforcement learning. In Neural Information Processing Systems, volume 35, pages 16082–16097, 2022.
  • Schulman et al. [2016] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations, 2016.
  • Sun [2023] Yihao Sun. Offlinerl-kit: An elegant pytorch offline reinforcement learning library. https://github.com/yihaosun1124/OfflineRL-Kit, 2023.
  • Sun et al. [2023] Yihao Sun, Jiaji Zhang, Chengxing Jia, Haoxin Lin, Junyin Ye, and Yang Yu. Model-bellman inconsistency for model-based offline reinforcement learning. In International Conference on Machine Learning, pages 33177–33194. PMLR, 2023.
  • Sutton [1988] Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3:9–44, 1988.
  • Sutton [1991] Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2(4):160–163, 1991.
  • Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • Yu et al. [2020] Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization. In Neural Information Processing Systems, volume 33, pages 14129–14142, 2020.
  • Yu et al. [2021] Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. Combo: Conservative offline model-based policy optimization. In Neural Information Processing Systems, volume 34, pages 28954–28967, 2021.

Appendix A Training Details

Computing resources.

All experiments are done on a single RTX 4090 GPU and 4444 AMD EPYC 9354 CPU cores. We use 5555 different random seeds for each experiment and report the mean and standard deviation. Each offline RL experiment takes 2222 hours for ours, 12121212 hours for MOBILE, and 24242424 hours for CBOP.

Environment details.

For the locomotion tasks, we use the dataset provided by D4RL [6] and NeoRL [26]. Following IQL [17], we normalize rewards using the maximum and minimum return of all trajectories. We use the true termination functions of the environments, implemented in MOBILE [30].

For the AntMaze tasks, we use the dataset provided by D4RL [6]. Following IQL [17], we subtract 1111 from the rewards in the datasets so that the agent receives 11-1- 1 for each step and 00 on termination. We use the true termination functions of the environments. The termination functions of the AntMaze tasks are not deterministic because a goal of a maze is randomized every time the environment is reset. Nevertheless, we follow the implementation of CBOP [14], where the termination region is set to a circle around the mean of the goal distribution with the radius 0.50.50.50.5.

Method implementation details.

For all compared methods, we use the results from their corresponding papers when available. For IQL [17], we run the official implementation with 5555 seeds to reproduce the results for the random datasets in D4RL and NeoRL. For the AntMaze tasks, we run the official implementation of MOBILE and CBOP with 5555 random seeds. Please note that the original MOBILE implementation does not use the true termination function, so we replace it with our termination function. For MOPO, COMBO, and RAMBO, we use the results reported in RAMBO [27].

World models.

For training world models, we use the architecture and training script from OfflineRL-Kit [29], matching the implementation of MOBILE [30]. Each world model is implemented as a 4444-layer MLPs with the hidden layer size of 200200200200. We construct an ensemble of world models by selecting 5555 out of 7777 models with the best validation scores. We pretrain the ensemble of world models for each of 5555 random seeds (i.e. training in total 35353535 world models and using 25252525 models), which takes approximately 5555 hours in average.

Policy and critic networks.

We use 3333-layer MLPs with size of 256256256256 both for the policy network and the critic network. We use layer normalization [3] to prevent catastrophic over/underestimation [4], and squash the state inputs using symlog to keep training stable from outliers in long-horizon model rollouts [11].

Pretraining.

For some environments, we found that a randomly initialized policy can lead to abnormal rewards or transition prediction from the world models in the early stage, leading to unstable training. Following CBOP [14], we pretrain a policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and a critic Qϕsubscript𝑄italic-ϕQ_{\phi}italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT using behavioral cloning and FQE [20], respectively. We use a slightly different implementation of FQE from the original implementation, where the argmin\arg\minroman_arg roman_min operation is approximated with mini-batch gradient descent, similar to standard Q-learning as shown in Algorithm 2.

Algorithm 2 FQE: Fitted Q Evaluation [20]
0:  Offline dataset 𝒟envsubscript𝒟env\mathcal{D}_{\text{env}}caligraphic_D start_POSTSUBSCRIPT env end_POSTSUBSCRIPT, policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT
1:  Randomly initialize Q-function Qϕsubscript𝑄italic-ϕQ_{\phi}italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT
2:  while not converged do
3:     {𝐬i,𝐚i,ri,𝐬i}i=1N𝒟envsimilar-tosuperscriptsubscriptsubscript𝐬𝑖subscript𝐚𝑖subscript𝑟𝑖subscriptsuperscript𝐬𝑖𝑖1𝑁subscript𝒟env\{\mathbf{s}_{i},\mathbf{a}_{i},r_{i},\mathbf{s}^{\prime}_{i}\}_{i=1}^{N}\sim% \mathcal{D}_{\text{env}}{ bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT env end_POSTSUBSCRIPT
4:     yi=sg(Qϕ(𝐬i,πθ(𝐬i)))subscript𝑦𝑖sgsubscript𝑄italic-ϕsubscriptsuperscript𝐬𝑖subscript𝜋𝜃subscriptsuperscript𝐬𝑖y_{i}=\texttt{sg}(Q_{\phi}(\mathbf{s}^{\prime}_{i},\pi_{\theta}(\mathbf{s}^{% \prime}_{i})))italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = sg ( italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ) \triangleright sg()sg\texttt{sg}(\cdot)sg ( ⋅ ) is stop-gradient operator
5:     LFQE(ϕ)=1Ni=1N(Qϕ(𝐬i,𝐚i)yi)2subscript𝐿FQEitalic-ϕ1𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝑄italic-ϕsubscript𝐬𝑖subscript𝐚𝑖subscript𝑦𝑖2L_{\text{FQE}}(\phi)=\frac{1}{N}\sum_{i=1}^{N}(Q_{\phi}(\mathbf{s}_{i},\mathbf% {a}_{i})-y_{i})^{2}italic_L start_POSTSUBSCRIPT FQE end_POSTSUBSCRIPT ( italic_ϕ ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
6:     Update Qϕsubscript𝑄italic-ϕQ_{\phi}italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT using gradient descent to minimize LFQE(ϕ)subscript𝐿FQEitalic-ϕL_{\text{FQE}}(\phi)italic_L start_POSTSUBSCRIPT FQE end_POSTSUBSCRIPT ( italic_ϕ )

Comparisons with prior methods.

We provide a comparison of LEQ with the prior model-based approaches and the baseline methods used in our ablation studies in Table 6.

Table 6: Comparisons with the prior model-based approaches and the baseline methods introduced for our ablation studies. We color the hyperparameters in blue if they are the same with LEQ. Otherwise, we color them in red.
Components CBOP MOBILE MOBILE MOBIP LEQ (ours)
Training scheme MVE [5] MBPO [13] MBPO [13] Dyna [32] Dyna [32]
Conservatism Lower-confidence bound Lower-confidence bound Lower-confidence bound Lower-confidence bound Lower expectile
Policy Stochastic Stochastic Stochastic Deterministic Deterministic
Policy objective Q(𝐬,𝐚)𝑄𝐬𝐚Q(\mathbf{s},\mathbf{a})italic_Q ( bold_s , bold_a ) Q(𝐬,𝐚)𝑄𝐬𝐚Q(\mathbf{s},\mathbf{a})italic_Q ( bold_s , bold_a ) Q(𝐬,𝐚)𝑄𝐬𝐚Q(\mathbf{s},\mathbf{a})italic_Q ( bold_s , bold_a ) λ𝜆\lambdaitalic_λ-returns λ𝜆\lambdaitalic_λ-returns
Policy pretraining BC BC BC
# of critics 20-50 2 1 1 1
Critic objective Multi-step (adaptive weighting) One-step One-step λ𝜆\lambdaitalic_λ-returns λ𝜆\lambdaitalic_λ-returns
Critic pretraining FQE [20] FQE [20] FQE [20]
Horizon length (H𝐻Hitalic_H) 10 1 1 10 10
Rollout length (R𝑅Ritalic_R) 1 or 5 10 5 5
Discount rate (γ𝛾\gammaitalic_γ) 0.99 0.99 0.997 0.997 0.997
β𝛽\betaitalic_β in Equation 7 1.0 0.95 0.25 0.25 0.25
Impl. tricks Clip Q-values with 00 LayerNorm + Symlog LayerNorm + Symlog LayerNorm + Symlog
Running time 24h 12h 40m 4h 2h

Hyperparameters of LEQ.

We report task-agnostic hyperparameters of our method in Table 7. We note that we use the same hyperparameters across all tasks, except τ𝜏\tauitalic_τ. We search the value of τ𝜏\tauitalic_τ in {0.1,0.3,0.4,0.5}0.10.30.40.5\{0.1,0.3,0.4,0.5\}{ 0.1 , 0.3 , 0.4 , 0.5 } and report the best value for the main experimental results. In addition, we report the exhaustive results in Tables 11 and 12, and summarize τ𝜏\tauitalic_τ used in the main results in Table 8.

Table 7: Shared hyperparameters of LEQ.
Hyperparameters Value Description
lractor𝑙subscript𝑟actorlr_{\text{actor}}italic_l italic_r start_POSTSUBSCRIPT actor end_POSTSUBSCRIPT 3e-5 Learning rate of actor
lrcritic𝑙subscript𝑟criticlr_{\text{critic}}italic_l italic_r start_POSTSUBSCRIPT critic end_POSTSUBSCRIPT 1e-4 Learning rate of critic
Optimizer Adam Optimizer
Texpandsubscript𝑇expandT_{\text{expand}}italic_T start_POSTSUBSCRIPT expand end_POSTSUBSCRIPT 5000 Interval of expanding dataset
Nexpandsubscript𝑁expandN_{\text{expand}}italic_N start_POSTSUBSCRIPT expand end_POSTSUBSCRIPT 50000 Number of data for each expansion of dataset
R𝑅Ritalic_R 5 Rollout length for dataset expansion
σexpsubscript𝜎exp\sigma_{\text{exp}}italic_σ start_POSTSUBSCRIPT exp end_POSTSUBSCRIPT 1.0 Exploration noise for dataset expansion
Nitersubscript𝑁iterN_{\text{iter}}italic_N start_POSTSUBSCRIPT iter end_POSTSUBSCRIPT 1M Total number of gradient steps.
Benvsubscript𝐵envB_{\text{env}}italic_B start_POSTSUBSCRIPT env end_POSTSUBSCRIPT 256 Batch size from original dataset
Bmodelsubscript𝐵modelB_{\text{model}}italic_B start_POSTSUBSCRIPT model end_POSTSUBSCRIPT 256 Batch size from expanded dataset
γ𝛾\gammaitalic_γ 0.997 Discount factor
λ𝜆\lambdaitalic_λ 0.95 λ𝜆\lambdaitalic_λ value for λ𝜆\lambdaitalic_λ-return
H𝐻Hitalic_H 10 Imagination length
βEMAsubscript𝛽EMA\beta_{\text{EMA}}italic_β start_POSTSUBSCRIPT EMA end_POSTSUBSCRIPT 1 Weight for critic EMA regularization
ϵEMAsubscriptitalic-ϵEMA\epsilon_{\text{EMA}}italic_ϵ start_POSTSUBSCRIPT EMA end_POSTSUBSCRIPT 0.995 Critic EMA decay
Table 8: Task-specific hyperparameter τ𝜏\tauitalic_τ of LEQ.
Domain Task τ𝜏\tauitalic_τ
AntMaze umaze 0.10.10.10.1
umaze-diverse 0.10.10.10.1
medium-play 0.30.30.30.3
medium-diverse 0.10.10.10.1
large-play 0.30.30.30.3
large-diverse 0.30.30.30.3
ultra-play 0.10.10.10.1
ultra-diverse 0.10.10.10.1
MuJoCo hopper-r 0.10.10.10.1
hopper-m 0.10.10.10.1
hopper-mr 0.30.30.30.3
hopper-me 0.10.10.10.1
walker2d-r 0.10.10.10.1
walker2d-m 0.30.30.30.3
walker2d-mr 0.50.50.50.5
walker2d-me 0.10.10.10.1
halfcheetah-r 0.30.30.30.3
halfcheetah-m 0.30.30.30.3
halfcheetah-mr 0.40.40.40.4
halfcheetah-me 0.10.10.10.1
NeoRL Hopper-L 0.10.10.10.1
Hopper-M 0.10.10.10.1
Hopper-H 0.10.10.10.1
Walker2d-L 0.30.30.30.3
Walker2d-M 0.10.10.10.1
Walker2d-H 0.10.10.10.1
HalfCheetah-L 0.10.10.10.1
HalfCheetah-M 0.30.30.30.3
HalfCheetah-H 0.30.30.30.3

Task-specific hyperparameters of the compared methods.

We report the best hyperparameters of MOBILE for the AntMaze tasks in Tables 10 and 10. For MOBILE and MOBILE, we search the value of c𝑐citalic_c within {0.1,0.5,1.0,1.5}0.10.51.01.5\{0.1,0.5,1.0,1.5\}{ 0.1 , 0.5 , 1.0 , 1.5 }, as suggested in MOBILE [30], where c𝑐citalic_c is the coefficient of the penalized bellman operator:

TQ^(𝐬,𝐚)=r(𝐬,𝐚)+γQ(𝐬,𝐚)cStd(Q(𝐬,𝐚)).𝑇^𝑄𝐬𝐚𝑟𝐬𝐚𝛾𝑄superscript𝐬superscript𝐚𝑐Std𝑄superscript𝐬superscript𝐚T\hat{Q}(\mathbf{s},\mathbf{a})=r(\mathbf{s},\mathbf{a})+\gamma Q(\mathbf{s}^{% \prime},\mathbf{a}^{\prime})-c\cdot\text{Std}(Q(\mathbf{s}^{\prime},\mathbf{a}% ^{\prime})).italic_T over^ start_ARG italic_Q end_ARG ( bold_s , bold_a ) = italic_r ( bold_s , bold_a ) + italic_γ italic_Q ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_c ⋅ Std ( italic_Q ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) . (13)

For CBOP, we conduct hyperparameter search for ψ𝜓\psiitalic_ψ in {0.5,2.0,3.0,5.0}0.52.03.05.0\{0.5,2.0,3.0,5.0\}{ 0.5 , 2.0 , 3.0 , 5.0 }, as suggested in the original paper, where ψ𝜓\psiitalic_ψ is an LCB coefficient of CBOP. We do not report the best hyperparameter for MOBILE and CBOP because both methods score zero points for all hyperparameters in AntMaze.

Table 9: Task-specific hyperparameters in MOBILE.
Domain Task c𝑐citalic_c
AntMaze umaze 1.01.01.01.0
umaze-diverse 1.01.01.01.0
medium-play 1.01.01.01.0
medium-diverse 0.10.10.10.1
large-play 0.10.10.10.1
large-diverse 0.10.10.10.1
ultra-play 1.01.01.01.0
ultra-diverse 1.01.01.01.0
Table 10: Task-specific hyperparameters in MOBILE with λ𝜆\lambdaitalic_λ-returns.
Domain Task c𝑐citalic_c
AntMaze umaze 1.01.01.01.0
umaze-diverse 0.50.50.50.5
medium-play 0.10.10.10.1
medium-diverse 0.10.10.10.1
large-play 0.10.10.10.1
large-diverse 0.10.10.10.1
ultra-play 1.01.01.01.0
ultra-diverse 0.50.50.50.5

Appendix B Proof of the Policy Objective

We show that the surrogate loss in Equation 12 leads to a better approximation for the expectile of λ𝜆\lambdaitalic_λ-returns in Equation 11 than maximizing Qϕ(s,a)subscript𝑄italic-ϕ𝑠𝑎Q_{\phi}(s,a)italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s , italic_a ). In other words, we show that optimizing the following policy objective:

J^λ(θ)=𝔼τpψ,πθ[(Wτ(Qϕ(𝐬t,𝐚t)>Qtλ(τ))Qtλ(τ)],\hat{J}_{\lambda}(\theta)=\mathbb{E}_{\tau\sim p_{\psi},\pi_{\theta}}[(W^{\tau% }(Q_{\phi}(\mathbf{s}_{t},\mathbf{a}_{t})>Q^{\lambda}_{t}(\tau))Q_{t}^{\lambda% }(\tau)],over^ start_ARG italic_J end_ARG start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) > italic_Q start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_τ ) ) italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ( italic_τ ) ] , (14)

leads to optimizing a lower-bias estimator of 𝔼τpψ,πθτ[Qtλ(τ)]subscriptsuperscript𝔼𝜏similar-to𝜏subscript𝑝𝜓subscript𝜋𝜃delimited-[]superscriptsubscript𝑄𝑡𝜆𝜏\mathbb{E}^{\tau}_{\tau\sim p_{\psi},\pi_{\theta}}[Q_{t}^{\lambda}(\tau)]blackboard_E start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ ∼ italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ( italic_τ ) ] than Qϕ(𝐬t,𝐚t)subscript𝑄italic-ϕsubscript𝐬𝑡subscript𝐚𝑡Q_{\phi}(\mathbf{s}_{t},\mathbf{a}_{t})italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

To show this, we first prove that Y^new=𝔼[Wτ(Qϕ(𝐬t,𝐚t)>Qtλ(τ))Qtλ(τ)]𝔼[Wτ(Qϕ(𝐬t,𝐚t)>Qtλ(τ))]subscript^𝑌new𝔼delimited-[]superscript𝑊𝜏subscript𝑄italic-ϕsubscript𝐬𝑡subscript𝐚𝑡superscriptsubscript𝑄𝑡𝜆𝜏superscriptsubscript𝑄𝑡𝜆𝜏𝔼delimited-[]superscript𝑊𝜏subscript𝑄italic-ϕsubscript𝐬𝑡subscript𝐚𝑡superscriptsubscript𝑄𝑡𝜆𝜏\hat{Y}_{\text{new}}=\frac{\mathbb{E}[W^{\tau}(Q_{\phi}(\mathbf{s}_{t},\mathbf% {a}_{t})>Q_{t}^{\lambda}(\tau))\cdot Q_{t}^{\lambda}(\tau)]}{\mathbb{E}[W^{% \tau}(Q_{\phi}(\mathbf{s}_{t},\mathbf{a}_{t})>Q_{t}^{\lambda}(\tau))]}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT new end_POSTSUBSCRIPT = divide start_ARG blackboard_E [ italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) > italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ( italic_τ ) ) ⋅ italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ( italic_τ ) ] end_ARG start_ARG blackboard_E [ italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) > italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ( italic_τ ) ) ] end_ARG is closer to 𝔼τpψ,πθτ[Qtλ(τ)]subscriptsuperscript𝔼𝜏similar-to𝜏subscript𝑝𝜓subscript𝜋𝜃delimited-[]superscriptsubscript𝑄𝑡𝜆𝜏\mathbb{E}^{\tau}_{\tau\sim p_{\psi},\pi_{\theta}}[Q_{t}^{\lambda}(\tau)]blackboard_E start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ ∼ italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ( italic_τ ) ] than Qϕ(𝐬t,𝐚t)subscript𝑄italic-ϕsubscript𝐬𝑡subscript𝐚𝑡Q_{\phi}(\mathbf{s}_{t},\mathbf{a}_{t})italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). For deriving the proof, we generalize this situation to have an arbitrary distribution X𝑋Xitalic_X and estimation Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG, which corresponds to X=Qtλ(τ),Y^=Qϕ(𝐬,𝐚)formulae-sequence𝑋superscriptsubscript𝑄𝑡𝜆𝜏^𝑌subscript𝑄italic-ϕ𝐬𝐚X=Q_{t}^{\lambda}(\tau),\hat{Y}=Q_{\phi}(\mathbf{s},\mathbf{a})italic_X = italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ( italic_τ ) , over^ start_ARG italic_Y end_ARG = italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s , bold_a ).

Theorem 1.

Let X𝑋Xitalic_X be a distribution and Y=Eτ[X]𝑌superscript𝐸𝜏delimited-[]𝑋Y=E^{\tau}[X]italic_Y = italic_E start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT [ italic_X ] be a lower expectile of X𝑋Xitalic_X (i.e. 0<τ0.50𝜏0.50<\tau\leq 0.50 < italic_τ ≤ 0.5). Let Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG be an arbitrary estimation of Y𝑌Yitalic_Y, and define Wτ()=|τ𝟙()|superscript𝑊𝜏𝜏1W^{\tau}(\cdot)=|\tau-\mathbbm{1}(\cdot)|italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( ⋅ ) = | italic_τ - blackboard_1 ( ⋅ ) |. If we let Y^new=𝔼[Wτ(Y^>X)X]𝔼[Wτ(Y^>X)]subscript^𝑌new𝔼delimited-[]superscript𝑊𝜏^𝑌𝑋𝑋𝔼delimited-[]superscript𝑊𝜏^𝑌𝑋\hat{Y}_{\text{new}}=\frac{\mathbb{E}[W^{\tau}(\hat{Y}>X)\cdot X]}{\mathbb{E}[% W^{\tau}(\hat{Y}>X)]}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT new end_POSTSUBSCRIPT = divide start_ARG blackboard_E [ italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( over^ start_ARG italic_Y end_ARG > italic_X ) ⋅ italic_X ] end_ARG start_ARG blackboard_E [ italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( over^ start_ARG italic_Y end_ARG > italic_X ) ] end_ARG be a new estimation of Y𝑌Yitalic_Y, then |YY^new|12τ1τp(YXY^)|YY^|𝑌subscript^𝑌new12𝜏1𝜏𝑝𝑌𝑋^𝑌𝑌^𝑌|Y-\hat{Y}_{\text{new}}|\leq\frac{1-2\tau}{1-\tau}\cdot p(Y\leq X\leq\hat{Y})% \cdot|Y-\hat{Y}|| italic_Y - over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT new end_POSTSUBSCRIPT | ≤ divide start_ARG 1 - 2 italic_τ end_ARG start_ARG 1 - italic_τ end_ARG ⋅ italic_p ( italic_Y ≤ italic_X ≤ over^ start_ARG italic_Y end_ARG ) ⋅ | italic_Y - over^ start_ARG italic_Y end_ARG |.

Proof.

Without loss of generality, we assume Y^Y^𝑌𝑌\hat{Y}\geq Yover^ start_ARG italic_Y end_ARG ≥ italic_Y. Then, we have Y^newYsubscript^𝑌new𝑌\hat{Y}_{\text{new}}\geq Yover^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT new end_POSTSUBSCRIPT ≥ italic_Y. Thus,

|Y^newY|subscript^𝑌new𝑌\displaystyle|\hat{Y}_{\text{new}}-Y|| over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT new end_POSTSUBSCRIPT - italic_Y |
=Y^newYabsentsubscript^𝑌new𝑌\displaystyle=\hat{Y}_{\text{new}}-Y= over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT new end_POSTSUBSCRIPT - italic_Y
=𝔼[Wτ(Y^>X)X]𝔼[Wτ(Y^>X)]𝔼[Wτ(Y>X)X]𝔼[Wτ(Y>X)] ( Def. of Y^new and Y)absent𝔼delimited-[]superscript𝑊𝜏^𝑌𝑋𝑋𝔼delimited-[]superscript𝑊𝜏^𝑌𝑋𝔼delimited-[]superscript𝑊𝜏𝑌𝑋𝑋𝔼delimited-[]superscript𝑊𝜏𝑌𝑋 ( Def. of Y^new and Y)\displaystyle=\frac{\mathbb{E}[W^{\tau}(\hat{Y}>X)\cdot X]}{\mathbb{E}[W^{\tau% }(\hat{Y}>X)]}-\frac{\mathbb{E}[W^{\tau}(Y>X)\cdot X]}{\mathbb{E}[W^{\tau}(Y>X% )]}\quad\text{\qquad\qquad\qquad\qquad\qquad\qquad\;\;\;($\because$ Def. of $% \hat{Y}_{\text{new}}$ and $Y$)}= divide start_ARG blackboard_E [ italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( over^ start_ARG italic_Y end_ARG > italic_X ) ⋅ italic_X ] end_ARG start_ARG blackboard_E [ italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( over^ start_ARG italic_Y end_ARG > italic_X ) ] end_ARG - divide start_ARG blackboard_E [ italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_Y > italic_X ) ⋅ italic_X ] end_ARG start_ARG blackboard_E [ italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_Y > italic_X ) ] end_ARG ( ∵ Def. of over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT new end_POSTSUBSCRIPT and italic_Y )
=𝔼[Wτ(Y>X)X]+𝔼[(12τ)𝟙(YXY^)X]𝔼[Wτ(Y>X)]+𝔼[(12τ)𝟙(YXY^)]𝔼[Wτ(Y>X)X]𝔼[Wτ(Y>X)]( Def. of Wτ())absent𝔼delimited-[]superscript𝑊𝜏𝑌𝑋𝑋𝔼delimited-[]12𝜏1𝑌𝑋^𝑌𝑋𝔼delimited-[]superscript𝑊𝜏𝑌𝑋𝔼delimited-[]12𝜏1𝑌𝑋^𝑌𝔼delimited-[]superscript𝑊𝜏𝑌𝑋𝑋𝔼delimited-[]superscript𝑊𝜏𝑌𝑋( Def. of Wτ())\displaystyle=\frac{\mathbb{E}[W^{\tau}(Y>X)\cdot X]+\mathbb{E}[(1-2\tau)\cdot% \mathbbm{1}(Y\leq X\leq\hat{Y})X]}{\mathbb{E}[W^{\tau}(Y>X)]+\mathbb{E}[(1-2% \tau)\cdot\mathbbm{1}(Y\leq X\leq\hat{Y})]}-\frac{\mathbb{E}[W^{\tau}(Y>X)% \cdot X]}{\mathbb{E}[W^{\tau}(Y>X)]}\quad\text{($\because$ Def. of $W^{\tau}(% \cdot)$)}= divide start_ARG blackboard_E [ italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_Y > italic_X ) ⋅ italic_X ] + blackboard_E [ ( 1 - 2 italic_τ ) ⋅ blackboard_1 ( italic_Y ≤ italic_X ≤ over^ start_ARG italic_Y end_ARG ) italic_X ] end_ARG start_ARG blackboard_E [ italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_Y > italic_X ) ] + blackboard_E [ ( 1 - 2 italic_τ ) ⋅ blackboard_1 ( italic_Y ≤ italic_X ≤ over^ start_ARG italic_Y end_ARG ) ] end_ARG - divide start_ARG blackboard_E [ italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_Y > italic_X ) ⋅ italic_X ] end_ARG start_ARG blackboard_E [ italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_Y > italic_X ) ] end_ARG ( ∵ Def. of italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( ⋅ ) )
=𝔼[Wτ(Y>X)]𝔼[(12τ)𝟙(YXY^)X]𝔼[Wτ(Y>X)X]𝔼[(12τ)𝟙(YXY^)]𝔼[Wτ(Y>X)](𝔼[Wτ(Y>X)]+𝔼[(12τ)𝟙(YXY^)])absent𝔼delimited-[]superscript𝑊𝜏𝑌𝑋𝔼delimited-[]12𝜏1𝑌𝑋^𝑌𝑋𝔼delimited-[]superscript𝑊𝜏𝑌𝑋𝑋𝔼delimited-[]12𝜏1𝑌𝑋^𝑌𝔼delimited-[]superscript𝑊𝜏𝑌𝑋𝔼delimited-[]superscript𝑊𝜏𝑌𝑋𝔼delimited-[]12𝜏1𝑌𝑋^𝑌\displaystyle=\frac{\mathbb{E}[W^{\tau}(Y>X)]\mathbb{E}[(1-2\tau)\cdot\mathbbm% {1}(Y\leq X\leq\hat{Y})X]-\mathbb{E}[W^{\tau}(Y>X)X]\mathbb{E}[(1-2\tau)\cdot% \mathbbm{1}(Y\leq X\leq\hat{Y})]}{\mathbb{E}[W^{\tau}(Y>X)](\mathbb{E}[W^{\tau% }(Y>X)]+\mathbb{E}[(1-2\tau)\cdot\mathbbm{1}(Y\leq X\leq\hat{Y})])}= divide start_ARG blackboard_E [ italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_Y > italic_X ) ] blackboard_E [ ( 1 - 2 italic_τ ) ⋅ blackboard_1 ( italic_Y ≤ italic_X ≤ over^ start_ARG italic_Y end_ARG ) italic_X ] - blackboard_E [ italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_Y > italic_X ) italic_X ] blackboard_E [ ( 1 - 2 italic_τ ) ⋅ blackboard_1 ( italic_Y ≤ italic_X ≤ over^ start_ARG italic_Y end_ARG ) ] end_ARG start_ARG blackboard_E [ italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_Y > italic_X ) ] ( blackboard_E [ italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_Y > italic_X ) ] + blackboard_E [ ( 1 - 2 italic_τ ) ⋅ blackboard_1 ( italic_Y ≤ italic_X ≤ over^ start_ARG italic_Y end_ARG ) ] ) end_ARG
12τ𝔼[Wτ(Y>X)]2(𝔼[Wτ(Y>X)]𝔼[𝟙(YXY^)X]𝔼[Wτ(Y>X)X]𝔼[𝟙(YXY^)]\displaystyle\leq\frac{1-2\tau}{\mathbb{E}[W^{\tau}(Y>X)]^{2}}\cdot(\mathbb{E}% [W^{\tau}(Y>X)]\mathbb{E}[\mathbbm{1}(Y\leq X\leq\hat{Y})X]-\mathbb{E}[W^{\tau% }(Y>X)X]\mathbb{E}[\mathbbm{1}(Y\leq X\leq\hat{Y})]≤ divide start_ARG 1 - 2 italic_τ end_ARG start_ARG blackboard_E [ italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_Y > italic_X ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ ( blackboard_E [ italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_Y > italic_X ) ] blackboard_E [ blackboard_1 ( italic_Y ≤ italic_X ≤ over^ start_ARG italic_Y end_ARG ) italic_X ] - blackboard_E [ italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_Y > italic_X ) italic_X ] blackboard_E [ blackboard_1 ( italic_Y ≤ italic_X ≤ over^ start_ARG italic_Y end_ARG ) ]
=12τ𝔼[Wτ(Y>X)](𝔼[𝟙(YXY^)X]Yp(YXY^))absent12𝜏𝔼delimited-[]superscript𝑊𝜏𝑌𝑋𝔼delimited-[]1𝑌𝑋^𝑌𝑋𝑌𝑝𝑌𝑋^𝑌\displaystyle=\frac{1-2\tau}{\mathbb{E}[W^{\tau}(Y>X)]}\cdot(\mathbb{E}[% \mathbbm{1}(Y\leq X\leq\hat{Y})X]-Yp(Y\leq X\leq\hat{Y}))= divide start_ARG 1 - 2 italic_τ end_ARG start_ARG blackboard_E [ italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_Y > italic_X ) ] end_ARG ⋅ ( blackboard_E [ blackboard_1 ( italic_Y ≤ italic_X ≤ over^ start_ARG italic_Y end_ARG ) italic_X ] - italic_Y italic_p ( italic_Y ≤ italic_X ≤ over^ start_ARG italic_Y end_ARG ) )
=(12τ)p(YXY^)𝔼[Wτ(Y>X)](𝔼YXY^[X]Y)absent12𝜏𝑝𝑌𝑋^𝑌𝔼delimited-[]superscript𝑊𝜏𝑌𝑋subscript𝔼𝑌𝑋^𝑌delimited-[]𝑋𝑌\displaystyle=\frac{(1-2\tau)p(Y\leq X\leq\hat{Y})}{\mathbb{E}[W^{\tau}(Y>X)]}% \cdot(\mathbb{E}_{Y\leq X\leq\hat{Y}}[X]-Y)= divide start_ARG ( 1 - 2 italic_τ ) italic_p ( italic_Y ≤ italic_X ≤ over^ start_ARG italic_Y end_ARG ) end_ARG start_ARG blackboard_E [ italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_Y > italic_X ) ] end_ARG ⋅ ( blackboard_E start_POSTSUBSCRIPT italic_Y ≤ italic_X ≤ over^ start_ARG italic_Y end_ARG end_POSTSUBSCRIPT [ italic_X ] - italic_Y )
(12τ)p(YXY^)𝔼[Wτ(Y>X)](Y^Y)absent12𝜏𝑝𝑌𝑋^𝑌𝔼delimited-[]superscript𝑊𝜏𝑌𝑋^𝑌𝑌\displaystyle\leq\frac{(1-2\tau)p(Y\leq X\leq\hat{Y})}{\mathbb{E}[W^{\tau}(Y>X% )]}\cdot(\hat{Y}-Y)≤ divide start_ARG ( 1 - 2 italic_τ ) italic_p ( italic_Y ≤ italic_X ≤ over^ start_ARG italic_Y end_ARG ) end_ARG start_ARG blackboard_E [ italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_Y > italic_X ) ] end_ARG ⋅ ( over^ start_ARG italic_Y end_ARG - italic_Y )
12τ1τp(YXY^)|Y^Y|absent12𝜏1𝜏𝑝𝑌𝑋^𝑌^𝑌𝑌\displaystyle\leq\frac{1-2\tau}{1-\tau}\cdot p(Y\leq X\leq\hat{Y})\cdot|\hat{Y% }-Y|≤ divide start_ARG 1 - 2 italic_τ end_ARG start_ARG 1 - italic_τ end_ARG ⋅ italic_p ( italic_Y ≤ italic_X ≤ over^ start_ARG italic_Y end_ARG ) ⋅ | over^ start_ARG italic_Y end_ARG - italic_Y |

Note that this theorem shows that the bias of the new estimation is always smaller than the original estimation, since 12τ1τ<112𝜏1𝜏1\frac{1-2\tau}{1-\tau}<1divide start_ARG 1 - 2 italic_τ end_ARG start_ARG 1 - italic_τ end_ARG < 1 and p(YXY^)1𝑝𝑌𝑋^𝑌1p(Y\leq X\leq\hat{Y})\leq 1italic_p ( italic_Y ≤ italic_X ≤ over^ start_ARG italic_Y end_ARG ) ≤ 1. If we plug in the distribution of Qtλ(τ)superscriptsubscript𝑄𝑡𝜆𝜏Q_{t}^{\lambda}(\tau)italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ( italic_τ ) to X𝑋Xitalic_X and Y^=Qϕ(st,𝐚t)^𝑌subscript𝑄italic-ϕsubscript𝑠𝑡subscript𝐚𝑡\hat{Y}=Q_{\phi}(s_{t},\mathbf{a}_{t})over^ start_ARG italic_Y end_ARG = italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), then Y=𝔼τ[X]=𝔼τpψ,πθτ[Qtλ(τ)]𝑌superscript𝔼𝜏delimited-[]𝑋subscriptsuperscript𝔼𝜏similar-to𝜏subscript𝑝𝜓subscript𝜋𝜃delimited-[]superscriptsubscript𝑄𝑡𝜆𝜏Y=\mathbb{E}^{\tau}[X]=\mathbb{E}^{\tau}_{\tau\sim p_{\psi},\pi_{\theta}}[Q_{t% }^{\lambda}(\tau)]italic_Y = blackboard_E start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT [ italic_X ] = blackboard_E start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ ∼ italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ( italic_τ ) ], and we can show the desired result using the theorem: Y^new=𝔼[Wτ(Qϕ(𝐬t,𝐚t)>Qtλ(τ))Qtλ(τ)]𝔼[Wτ(Qϕ(𝐬t,𝐚t)>Qtλ(τ)]\hat{Y}_{\text{new}}=\frac{\mathbb{E}[W^{\tau}(Q_{\phi}(\mathbf{s}_{t},\mathbf% {a}_{t})>Q_{t}^{\lambda}(\tau))\cdot Q_{t}^{\lambda}(\tau)]}{\mathbb{E}[W^{% \tau}(Q_{\phi}(\mathbf{s}_{t},\mathbf{a}_{t})>Q_{t}^{\lambda}(\tau)]}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT new end_POSTSUBSCRIPT = divide start_ARG blackboard_E [ italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) > italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ( italic_τ ) ) ⋅ italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ( italic_τ ) ] end_ARG start_ARG blackboard_E [ italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) > italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ( italic_τ ) ] end_ARG is closer to 𝔼τpψ,πθτ[Qtλ(τ)]subscriptsuperscript𝔼𝜏similar-to𝜏subscript𝑝𝜓subscript𝜋𝜃delimited-[]superscriptsubscript𝑄𝑡𝜆𝜏\mathbb{E}^{\tau}_{\tau\sim p_{\psi},\pi_{\theta}}[Q_{t}^{\lambda}(\tau)]blackboard_E start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ ∼ italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ( italic_τ ) ] than Qϕ(𝐬t,𝐚t)subscript𝑄italic-ϕsubscript𝐬𝑡subscript𝐚𝑡Q_{\phi}(\mathbf{s}_{t},\mathbf{a}_{t})italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

Here, the normalizing factor 𝔼[Wτ(Qϕ(𝐬t,𝐚t)>Qtλ(τ))]𝔼delimited-[]superscript𝑊𝜏subscript𝑄italic-ϕsubscript𝐬𝑡subscript𝐚𝑡superscriptsubscript𝑄𝑡𝜆𝜏\mathbb{E}[W^{\tau}(Q_{\phi}(\mathbf{s}_{t},\mathbf{a}_{t})>Q_{t}^{\lambda}(% \tau))]blackboard_E [ italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) > italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ( italic_τ ) ) ] is non-differentiable with τ𝜏\tauitalic_τ. Specifically, the gradient is 0 everywhere (except Qϕ(𝐬t,𝐚t)=Qtλ(τ)subscript𝑄italic-ϕsubscript𝐬𝑡subscript𝐚𝑡superscriptsubscript𝑄𝑡𝜆𝜏Q_{\phi}(\mathbf{s}_{t},\mathbf{a}_{t})=Q_{t}^{\lambda}(\tau)italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ( italic_τ )). Thus, if we calculate the gradient of Y^newsubscript^𝑌new\hat{Y}_{\text{new}}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT new end_POSTSUBSCRIPT, the gradient for the normalizing factor disappears. Therefore, we can omit the normalizing factor and get an equivalent formula 𝔼[Wτ(Qϕ(𝐬t,𝐚t)>Qtλ(τ))Qtλ(τ)]𝔼delimited-[]superscript𝑊𝜏subscript𝑄italic-ϕsubscript𝐬𝑡subscript𝐚𝑡superscriptsubscript𝑄𝑡𝜆𝜏superscriptsubscript𝑄𝑡𝜆𝜏\mathbb{E}[W^{\tau}(Q_{\phi}(\mathbf{s}_{t},\mathbf{a}_{t})>Q_{t}^{\lambda}(% \tau))\cdot Q_{t}^{\lambda}(\tau)]blackboard_E [ italic_W start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) > italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ( italic_τ ) ) ⋅ italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ( italic_τ ) ] for gradient-based optimization.

Appendix C More Results

High variance in locomotion tasks.

When we train LEQ in locomotion tasks, we observe that our method often achieves 100100100100% success rates and then falls back to 00%, as shown in Figure 5. This is mainly because the learned models sometimes fail to capture failures (e.g. hopper and walker falling off) and predict an optimistic future (e.g. hopper and walker walking forward).

Refer to caption
Refer to caption
Figure 5: High variance during training. Our algorithm experiences oscillation on the performances due to optimistic imaginations near the initial states.

Results for all expectiles τ𝜏\tauitalic_τ.

To give insights how the expectile parameter τ𝜏\tauitalic_τ affects the performance of LEQ, we report the performance of LEQ with all expectile values {0.1,0.3,0.4,0.5}0.10.30.40.5\{0.1,0.3,0.4,0.5\}{ 0.1 , 0.3 , 0.4 , 0.5 }. The expectile parameter τ𝜏\tauitalic_τ has a trade-off – high expectile makes the model’s predictions less conservative while making a policy easily exploit the model. We recommend first trying τ=0.1𝜏0.1\tau=0.1italic_τ = 0.1, which works well for most of the tasks, and increase τ𝜏\tauitalic_τ until the performance starts to drop.

Table 11: Antmaze results of LEQ with different expectiles. We report the results in Antmaze task with expectiles value of 0.1, 0.3, 0.4, 0.5. The best value is highlighted.
Expectile 0.1 0.3 0.4 0.5
antmaze-umaze 94.494.4\mathbf{94.4}bold_94.4 ±6.3plus-or-minus6.3\pm 6.3± 6.3 39.039.039.039.0 ±28.1plus-or-minus28.1\pm 28.1± 28.1 0.20.20.20.2 ±0.4plus-or-minus0.4\pm 0.4± 0.4 3.03.03.03.0 ±5.5plus-or-minus5.5\pm 5.5± 5.5
antmaze-umaze-diverse 71.071.0\mathbf{71.0}bold_71.0 ±12.2plus-or-minus12.2\pm 12.2± 12.2 23.623.623.623.6 ±21.7plus-or-minus21.7\pm 21.7± 21.7 4.04.04.04.0 ±4.2plus-or-minus4.2\pm 4.2± 4.2 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0
antmaze-medium-play 50.250.250.250.2 ±39.9plus-or-minus39.9\pm 39.9± 39.9 58.858.8\mathbf{58.8}bold_58.8 ±33.0plus-or-minus33.0\pm 33.0± 33.0 36.036.036.036.0 ±21.8plus-or-minus21.8\pm 21.8± 21.8 0.60.60.60.6 ±1.2plus-or-minus1.2\pm 1.2± 1.2
antmaze-medium-diverse 46.246.2\mathbf{46.2}bold_46.2 ±23.2plus-or-minus23.2\pm 23.2± 23.2 13.213.213.213.2 ±13.3plus-or-minus13.3\pm 13.3± 13.3 11.611.611.611.6 ±14.8plus-or-minus14.8\pm 14.8± 14.8 10.610.610.610.6 ±13.3plus-or-minus13.3\pm 13.3± 13.3
antmaze-large-play 42.042.042.042.0 ±30.6plus-or-minus30.6\pm 30.6± 30.6 58.658.6\mathbf{58.6}bold_58.6 ±9.1plus-or-minus9.1\pm 9.1± 9.1 52.252.252.252.2 ±15.8plus-or-minus15.8\pm 15.8± 15.8 42.242.242.242.2 ±7.3plus-or-minus7.3\pm 7.3± 7.3
antmaze-large-diverse 60.660.660.660.6 ±32.1plus-or-minus32.1\pm 32.1± 32.1 60.260.2\mathbf{60.2}bold_60.2 ±18.3plus-or-minus18.3\pm 18.3± 18.3 48.848.848.848.8 ±5.8plus-or-minus5.8\pm 5.8± 5.8 36.836.836.836.8 ±9.7plus-or-minus9.7\pm 9.7± 9.7
antmaze-ultra-play 25.825.8\mathbf{25.8}bold_25.8 ±18.2plus-or-minus18.2\pm 18.2± 18.2 10.810.810.810.8±8.8plus-or-minus8.8\pm 8.8± 8.8 11.611.611.611.6 ±12.5plus-or-minus12.5\pm 12.5± 12.5 9.29.29.29.2 ±11.5plus-or-minus11.5\pm 11.5± 11.5
antmaze-ultra-diverse 55.855.8\mathbf{55.8}bold_55.8 ±18.3plus-or-minus18.3\pm 18.3± 18.3 4.64.64.64.6±3.4plus-or-minus3.4\pm 3.4± 3.4 7.67.67.67.6 ±7.3plus-or-minus7.3\pm 7.3± 7.3 0.60.60.60.6 ±1.2plus-or-minus1.2\pm 1.2± 1.2
Table 12: D4RL mujoco results of LEQ with different expectiles. We report the results in D4RL mujoco task with expectiles value of 0.1, 0.3, 0.4, 0.5. The best value is highlighted.
Expectile 0.1 0.3 0.4 0.5
hopper-r 32.432.4\mathbf{32.4}bold_32.4 ±0.3plus-or-minus0.3\pm 0.3± 0.3 13.713.713.713.7 ±9.1plus-or-minus9.1\pm 9.1± 9.1 16.416.416.416.4 ±9.3plus-or-minus9.3\pm 9.3± 9.3 12.512.512.512.5 ±10.1plus-or-minus10.1\pm 10.1± 10.1
hopper-m 103.4103.4\mathbf{103.4}bold_103.4 ±0.3plus-or-minus0.3\pm 0.3± 0.3 102.7102.7102.7102.7 ±1.7plus-or-minus1.7\pm 1.7± 1.7 81.481.481.481.4 ±24.8plus-or-minus24.8\pm 24.8± 24.8 38.638.638.638.6 ±29.2plus-or-minus29.2\pm 29.2± 29.2
hopper-mr 103.2103.2103.2103.2 ±1.0plus-or-minus1.0\pm 1.0± 1.0 103.9103.9\mathbf{103.9}bold_103.9 ±1.3plus-or-minus1.3\pm 1.3± 1.3 71.571.571.571.5 ±34.7plus-or-minus34.7\pm 34.7± 34.7 103.8103.8103.8103.8 ±1.9plus-or-minus1.9\pm 1.9± 1.9
hopper-me 109.4109.4\mathbf{109.4}bold_109.4 ±1.8plus-or-minus1.8\pm 1.8± 1.8 108.0108.0108.0108.0 ±8.7plus-or-minus8.7\pm 8.7± 8.7 64.264.264.264.2 ±35.8plus-or-minus35.8\pm 35.8± 35.8 33.733.733.733.7 ±0.5plus-or-minus0.5\pm 0.5± 0.5
walker2d-r 21.521.5\mathbf{21.5}bold_21.5 ±0.1plus-or-minus0.1\pm 0.1± 0.1 21.521.521.521.5 ±0.5plus-or-minus0.5\pm 0.5± 0.5 14.014.014.014.0 ±8.8plus-or-minus8.8\pm 8.8± 8.8 8.78.78.78.7 ±6.7plus-or-minus6.7\pm 6.7± 6.7
walker2d-m 26.326.326.326.3 ±37.4plus-or-minus37.4\pm 37.4± 37.4 74.974.9\mathbf{74.9}bold_74.9 ±26.9plus-or-minus26.9\pm 26.9± 26.9 60.360.360.360.3 ±40.9plus-or-minus40.9\pm 40.9± 40.9 34.834.834.834.8 ±34.3plus-or-minus34.3\pm 34.3± 34.3
walker2d-mr 48.648.648.648.6 ±19.5plus-or-minus19.5\pm 19.5± 19.5 60.560.560.560.5 ±27.4plus-or-minus27.4\pm 27.4± 27.4 88.588.588.588.5 ±3.5plus-or-minus3.5\pm 3.5± 3.5 98.798.7\mathbf{98.7}bold_98.7 ±6.0plus-or-minus6.0\pm 6.0± 6.0
walker2d-me 108.2108.2\mathbf{108.2}bold_108.2 ±1.3plus-or-minus1.3\pm 1.3± 1.3 98.898.898.898.8 ±28.8plus-or-minus28.8\pm 28.8± 28.8 105.8105.8105.8105.8 ±25.9plus-or-minus25.9\pm 25.9± 25.9 33.733.733.733.7 ±31.9plus-or-minus31.9\pm 31.9± 31.9
halfcheetah-r 23.823.823.823.8 ±1.8plus-or-minus1.8\pm 1.8± 1.8 30.830.8\mathbf{30.8}bold_30.8 ±3.3plus-or-minus3.3\pm 3.3± 3.3 29.029.029.029.0 ±2.9plus-or-minus2.9\pm 2.9± 2.9 30.230.230.230.2 ±2.5plus-or-minus2.5\pm 2.5± 2.5
halfcheetah-m 65.365.365.365.3 ±2.0plus-or-minus2.0\pm 2.0± 2.0 71.771.7\mathbf{71.7}bold_71.7 ±4.4plus-or-minus4.4\pm 4.4± 4.4 58.558.558.558.5 ±23.8plus-or-minus23.8\pm 23.8± 23.8 55.555.555.555.5 ±16.7plus-or-minus16.7\pm 16.7± 16.7
halfcheetah-mr 60.660.660.660.6 ±1.4plus-or-minus1.4\pm 1.4± 1.4 55.455.455.455.4 ±27.3plus-or-minus27.3\pm 27.3± 27.3 65.565.5\mathbf{65.5}bold_65.5 ±1.1plus-or-minus1.1\pm 1.1± 1.1 52.452.452.452.4 ±26.7plus-or-minus26.7\pm 26.7± 26.7
halfcheetah-me 102.8102.8\mathbf{102.8}bold_102.8 ±0.4plus-or-minus0.4\pm 0.4± 0.4 81.581.581.581.5 ±19.6plus-or-minus19.6\pm 19.6± 19.6 58.158.158.158.1 ±26.1plus-or-minus26.1\pm 26.1± 26.1 46.346.346.346.3 ±17.7plus-or-minus17.7\pm 17.7± 17.7

Ablation study on dataset expansion.

Table 13 shows the ablation results on the dataset expansion in D4RL MuJoCo tasks. The results show that the dataset expansion generally improves the performance, especially in Hopper environments.

Table 13: D4RL MuJoCo ablation results for dataset expansion. Results are averaged over 5555 random seeds. The dataset expansion generally improves the performance of LEQ.
Dataset LEQ (ours) LEQ w/o Dataset Expansion
hopper-r 32.432.4\mathbf{32.4}bold_32.4 ±0.3plus-or-minus0.3\pm 0.3± 0.3 17.617.617.617.6 ±8.6plus-or-minus8.6\pm 8.6± 8.6
hopper-m 103.4103.4\mathbf{103.4}bold_103.4 ±0.3plus-or-minus0.3\pm 0.3± 0.3 52.752.752.752.7 ±45.3plus-or-minus45.3\pm 45.3± 45.3
hopper-mr 103.9103.9\mathbf{103.9}bold_103.9 ±1.3plus-or-minus1.3\pm 1.3± 1.3 103.7103.7\mathbf{103.7}bold_103.7 ±1.3plus-or-minus1.3\pm 1.3± 1.3
hopper-me 109.4109.4\mathbf{109.4}bold_109.4 ±1.8plus-or-minus1.8\pm 1.8± 1.8 79.779.779.779.7 ±42.4plus-or-minus42.4\pm 42.4± 42.4
walker2d-r 21.521.5\mathbf{21.5}bold_21.5 ±0.1plus-or-minus0.1\pm 0.1± 0.1 20.520.5\mathbf{20.5}bold_20.5 ±2.2plus-or-minus2.2\pm 2.2± 2.2
walker2d-m 74.974.974.974.9 ±26.9plus-or-minus26.9\pm 26.9± 26.9 87.287.2\mathbf{87.2}bold_87.2 ±4.3plus-or-minus4.3\pm 4.3± 4.3
walker2d-mr 98.798.7\mathbf{98.7}bold_98.7 ±6.0plus-or-minus6.0\pm 6.0± 6.0 78.778.778.778.7 ±35.5plus-or-minus35.5\pm 35.5± 35.5
walker2d-me 108.2108.2\mathbf{108.2}bold_108.2 ±1.3plus-or-minus1.3\pm 1.3± 1.3 110.4110.4\mathbf{110.4}bold_110.4 ±0.8plus-or-minus0.8\pm 0.8± 0.8
halfcheetah-r 30.830.8\mathbf{30.8}bold_30.8 ±3.3plus-or-minus3.3\pm 3.3± 3.3 27.727.7\mathbf{27.7}bold_27.7 ±2.2plus-or-minus2.2\pm 2.2± 2.2
halfcheetah-m 71.771.7\mathbf{71.7}bold_71.7 ±4.4plus-or-minus4.4\pm 4.4± 4.4 71.671.6\mathbf{71.6}bold_71.6 ±3.8plus-or-minus3.8\pm 3.8± 3.8
halfcheetah-mr 65.565.5\mathbf{65.5}bold_65.5 ±1.1plus-or-minus1.1\pm 1.1± 1.1 54.454.454.454.4 ±26.3plus-or-minus26.3\pm 26.3± 26.3
halfcheetah-me 102.8102.8\mathbf{102.8}bold_102.8 ±0.4plus-or-minus0.4\pm 0.4± 0.4 83.983.983.983.9 ±28.0plus-or-minus28.0\pm 28.0± 28.0
Total 923.2923.2\mathbf{923.2}bold_923.2 788.2788.2788.2788.2

Ablation study on MOBILE in AntMaze.

In Table 14, we report the performance of MOBILE for all possible combination of the hyperparameters between the values of MOBILE and LEQ. Specifically, we use β={0.25,0.95}𝛽0.250.95\beta=\{0.25,0.95\}italic_β = { 0.25 , 0.95 }, γ={0.99,0.997}𝛾0.990.997\gamma=\{0.99,0.997\}italic_γ = { 0.99 , 0.997 }, and R={5,10}𝑅510R=\{5,10\}italic_R = { 5 , 10 }. The result shows that β=0.25𝛽0.25\beta=0.25italic_β = 0.25 is crucial. In addition, the configuration of LEQ yields the best result among all configurations for MOBILE.

Table 14: Complete hyperparameter search results of MOBILE on AntMaze. MOBILE uses the hyperparameters from MOBILE: β=0.95𝛽0.95\beta=0.95italic_β = 0.95, γ=0.99𝛾0.99\gamma=0.99italic_γ = 0.99, and R=5𝑅5R=5italic_R = 5, whereas LEQ uses β=0.25𝛽0.25\beta=0.25italic_β = 0.25, γ=0.997𝛾0.997\gamma=0.997italic_γ = 0.997, and R=10𝑅10R=10italic_R = 10. The results show that β𝛽\betaitalic_β is the most critical hyperparameter that makes MOBILE work in AntMaze.
  Hyperparams. umaze medium large ultra Total
β𝛽\betaitalic_β γ𝛾\gammaitalic_γ R𝑅Ritalic_R umaze diverse play diverse play diverse play diverse
0.250.250.250.25 0.9970.9970.9970.997 10101010 53.853.853.853.8 ±26.8plus-or-minus26.8\pm 26.8± 26.8 22.522.5\mathbf{22.5}bold_22.5 ±22.2plus-or-minus22.2\pm 22.2± 22.2 54.054.054.054.0 ±5.8plus-or-minus5.8\pm 5.8± 5.8 49.549.5\mathbf{49.5}bold_49.5 ±6.2plus-or-minus6.2\pm 6.2± 6.2 28.328.3\mathbf{28.3}bold_28.3 ±6.0plus-or-minus6.0\pm 6.0± 6.0 28.028.0\mathbf{28.0}bold_28.0 ±11.4plus-or-minus11.4\pm 11.4± 11.4 25.525.5\mathbf{25.5}bold_25.5 ±6.9plus-or-minus6.9\pm 6.9± 6.9 23.823.8\mathbf{23.8}bold_23.8 ±15.8plus-or-minus15.8\pm 15.8± 15.8 285.3285.3\mathbf{285.3}bold_285.3
0.250.250.250.25 0.9970.9970.9970.997 5555 74.074.0\mathbf{74.0}bold_74.0 ±6.9plus-or-minus6.9\pm 6.9± 6.9 3.73.73.73.7 ±2.6plus-or-minus2.6\pm 2.6± 2.6 54.754.754.754.7 ±27.9plus-or-minus27.9\pm 27.9± 27.9 28.028.028.028.0 ±9.6plus-or-minus9.6\pm 9.6± 9.6 18.718.718.718.7 ±18.6plus-or-minus18.6\pm 18.6± 18.6 8.08.08.08.0 ±9.3plus-or-minus9.3\pm 9.3± 9.3 9.79.79.79.7 ±8.2plus-or-minus8.2\pm 8.2± 8.2 9.09.09.09.0 ±3.7plus-or-minus3.7\pm 3.7± 3.7 205.7205.7205.7205.7
0.250.250.250.25 0.990.990.990.99 10101010 39.739.739.739.7 ±23.4plus-or-minus23.4\pm 23.4± 23.4 5.05.05.05.0 ±7.1plus-or-minus7.1\pm 7.1± 7.1 39.339.339.339.3 ±27.9plus-or-minus27.9\pm 27.9± 27.9 38.038.038.038.0 ±15.0plus-or-minus15.0\pm 15.0± 15.0 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 3.73.73.73.7 ±5.2plus-or-minus5.2\pm 5.2± 5.2 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 125.7125.7125.7125.7
0.250.250.250.25 0.990.990.990.99 5555 77.077.0\mathbf{77.0}bold_77.0 ±6.4plus-or-minus6.4\pm 6.4± 6.4 20.420.420.420.4 ±15.7plus-or-minus15.7\pm 15.7± 15.7 64.664.6\mathbf{64.6}bold_64.6 ±11.1plus-or-minus11.1\pm 11.1± 11.1 31.631.631.631.6 ±16.9plus-or-minus16.9\pm 16.9± 16.9 2.62.62.62.6 ±2.8plus-or-minus2.8\pm 2.8± 2.8 7.27.27.27.2 ±8.9plus-or-minus8.9\pm 8.9± 8.9 4.64.64.64.6 ±3.0plus-or-minus3.0\pm 3.0± 3.0 5.05.05.05.0 ±4.6plus-or-minus4.6\pm 4.6± 4.6 213.0213.0213.0213.0
0.950.950.950.95 0.9970.9970.9970.997 10101010 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 1.81.81.81.8 ±3.0plus-or-minus3.0\pm 3.0± 3.0 0.50.50.50.5 ±0.9plus-or-minus0.9\pm 0.9± 0.9 0.20.20.20.2 ±0.4plus-or-minus0.4\pm 0.4± 0.4 2.22.22.22.2 ±2.3plus-or-minus2.3\pm 2.3± 2.3 1.01.01.01.0 ±1.7plus-or-minus1.7\pm 1.7± 1.7 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 5.75.75.75.7
0.950.950.950.95 0.9970.9970.9970.997 5555 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 7.27.27.27.2 ±4.1plus-or-minus4.1\pm 4.1± 4.1 1.61.61.61.6 ±2.1plus-or-minus2.1\pm 2.1± 2.1 9.69.69.69.6 ±7.1plus-or-minus7.1\pm 7.1± 7.1 5.45.45.45.4 ±4.9plus-or-minus4.9\pm 4.9± 4.9 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 1.81.81.81.8 ±2.7plus-or-minus2.7\pm 2.7± 2.7 25.625.625.625.6
0.950.950.950.95 0.990.990.990.99 10101010 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 5.05.05.05.0 ±5.1plus-or-minus5.1\pm 5.1± 5.1 0.60.60.60.6 ±1.2plus-or-minus1.2\pm 1.2± 1.2 7.47.47.47.4 ±14.8plus-or-minus14.8\pm 14.8± 14.8 1.61.61.61.6 ±3.2plus-or-minus3.2\pm 3.2± 3.2 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 14.614.614.614.6
0.950.950.950.95 0.990.990.990.99 5555 1.01.01.01.0 ±2.0plus-or-minus2.0\pm 2.0± 2.0 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 6.46.46.46.4 ±5.5plus-or-minus5.5\pm 5.5± 5.5 5.05.05.05.0 ±5.0plus-or-minus5.0\pm 5.0± 5.0 0.80.80.80.8 ±1.6plus-or-minus1.6\pm 1.6± 1.6 0.80.80.80.8 ±1.2plus-or-minus1.2\pm 1.2± 1.2 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 0.00.00.00.0 ±0.0plus-or-minus0.0\pm 0.0± 0.0 14.014.014.014.0