HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: xr-hyper

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY-NC-ND 4.0
arXiv:2306.03552v4 [cs.LG] 22 Feb 2024

State Regularized Policy Optimization
on Data with Dynamics Shift

Zhenghai Xue11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT    Qingpeng Cai22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT    Shuchang Liu22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT    Dong Zheng22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT   
Peng Jiang22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT    Kun Gai33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT    Bo An11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT   
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTNanyang Technological University, Singapore
22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTKuaishou Technology  33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT Unaffliated
[email protected][email protected][email protected]
{caiqingpeng,liushuchang,zhengdong,jiangpeng}@kuaishou.com
Abstract

In many real-world scenarios, Reinforcement Learning (RL) algorithms are trained on data with dynamics shift, i.e., with different underlying environment dynamics. A majority of current methods address such issue by training context encoders to identify environment parameters. Data with dynamics shift are separated according to their environment parameters to train the corresponding policy. However, these methods can be sample inefficient as data are used ad hoc, and policies trained for one dynamics cannot benefit from data collected in all other environments with different dynamics. In this paper, we find that in many environments with similar structures and different dynamics, optimal policies have similar stationary state distributions. We exploit such property and learn the stationary state distribution from data with dynamics shift for efficient data reuse. Such distribution is used to regularize the policy trained in a new environment, leading to the SRPO (State Regularized Policy Optimization) algorithm. To conduct theoretical analyses, the intuition of similar environment structures is characterized by the notion of homomorphous MDPs. We then demonstrate a lower-bound performance guarantee on policies regularized by the stationary state distribution. In practice, SRPO can be an add-on module to context-based algorithms in both online and offline RL settings. Experimental results show that SRPO can make several context-based algorithms far more data efficient and significantly improve their overall performance.

1 Introduction

Reinforcement Learning (RL) has achieved great success in solving challenging sequential decision-making problems [1, 2]. Unfortunately, existing RL methods usually assume that agents are trained and evaluated in exactly the same environment, which is often not the case in real-world applications where environment dynamics can vary a lot. For example, the recommendation engine of social apps may need to deal with time-varying and heterogeneous user preferences [3, 4]. A robot arm may operate in different scenarios with different joint frictions and medium densities [5]. In these cases, the agent has to work with the trajectory data from different environment dynamics, i.e., data with dynamics shift, which will bias the learning process and lead to poor performance. In fact, some empirical studies [6, 5] demonstrate that general RL algorithms [7, 8] can easily be misled by different environment dynamics and fail to train a good policy.

In recent years, considerable research efforts have been devoted to addressing the dynamics shift and learning generalizable policies for environments with changing dynamics. One common practice is to train a context encoder [9, 10, 11] to associate the environment dynamics with a latent variable. The policy is then trained with the latent variable as an additional input [12]. One issue with this

Refer to caption
Figure 1: Performance comparison of PPO [7], CaDM [9] and CaDM+SRPO in the Ant environment, where SRPO is our proposed state regularized policy optimization method. Details of the experiment setup are in Sec. 5.1.

practice is that policies conditioned on a specific latent variable can only learn from data collected in the environment corresponding to that latent variable. In other words, data with different dynamics are used in an ad hoc manner. The generalizability of context encoders relies on the expressive power of neural networks. However, neural networks are prone to overfit and behave poorly when extrapolating. As an example, we benchmarked CaDM [9], which is one of the context-based algorithms, under Ant environments with different gravities and display the results in Fig. 1. Although it can outperform PPO [7] due to its adaptability from context encoders, CaDM fails to constantly improve its performance with more data from different environment dynamics. To mitigate the problem of inefficient data use, there are some attempts that leverage Importance Sampling (IS) [13, 5, 14]. Given the dynamics of the target environment, samples from the source environments are assigned with larger importance weights if they are more likely to happen in the target environment and vice versa. Compared to training context encoders, IS-based methods manage to proactively exploit the data from other dynamics. However, such methods require prior knowledge about the dynamics of the target environment. Also, it is notoriously hard to balance the bias and variance when calculating the IS weights.

This paper proposes a new RL paradigm that can explicitly leverage data with dynamics shift. It is also free of the aforementioned drawbacks of IS-based methods. We find that the stationary state distribution induced by optimal policies (later termed optimal state distribution) is similar across a set of environments with similar structures and different environment dynamics. For example, given heterogeneous preferences of users, a video recommendation system may choose different videos to recommend, but the optimal states are the same: users keep pressing the “like” or “save” button and continue watching for a long time. More concretely, the optimal state distribution in one environment dynamics can be informative for training policies in all other different dynamics. We therefore propose a constrained policy optimization (CPO) [15] formulation that requires the policy not only to optimize the cumulative return, but also to generate a stationary state distribution close to the optimal state distribution. By relating optimality to high-reward states [16], we are able to approximate the optimal state distribution from trajectory data regardless of the underlying dynamics, providing a unified and efficient approach to exploiting these data.

Summarizing these ideas, we propose the SRPO (State Regularized Policy Optimization) algorithm. SRPO works as an add-on module in both online and offline context-based RL algorithms such as CaDM [9] and MAPLE [12] to increase their sample efficiency, leading to the CaDM+SRPO and MAPLE+SRPO algorithms. We provide a lower-bound performance guarantee on policies in one dynamics regularized by the optimal state distribution in other dynamics. This theoretically demonstrates the effectiveness of the SRPO algorithm in using data with dynamics shift. Empirical results in both online and offline settings show that SRPO can significantly improve both the data efficiency and the overall performance of several state-of-the-art context-based RL algorithms. We also perform ablation studies to demonstrate the effectiveness of each component in the SRPO algorithm.

2 Backgroud

2.1 Preliminaries

A Markov Decision Process (MDP) can be defined by a tuple (𝒮,𝒜,T,r,γ,ρ0)𝒮𝒜𝑇𝑟𝛾subscript𝜌0(\mathcal{S},\mathcal{A},T,r,\gamma,\rho_{0})( caligraphic_S , caligraphic_A , italic_T , italic_r , italic_γ , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), where 𝒮𝒮\mathcal{S}caligraphic_S is the state space, 𝒜𝒜\mathcal{A}caligraphic_A is the bounded action space with actions a(1,1)𝑎11a\in(-1,1)italic_a ∈ ( - 1 , 1 ), T(s|s,a)[0,1]𝑇conditionalsuperscript𝑠𝑠𝑎01T(s^{\prime}|s,a)\in[0,1]italic_T ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) ∈ [ 0 , 1 ] and r(s,a,s)[Rmax,Rmax]𝑟𝑠𝑎superscript𝑠subscript𝑅subscript𝑅r(s,a,s^{\prime})\in[-R_{\max},R_{\max}]italic_r ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ [ - italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] are the transition and reward functions. γ(0,1)𝛾01\gamma\in(0,1)italic_γ ∈ ( 0 , 1 ) is the discount factor and ρ0(s)subscript𝜌0𝑠\rho_{0}(s)italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) is the initial state distribution. In MDPs with deterministic transitions, we denote T(s,a)𝑇𝑠𝑎T(s,a)italic_T ( italic_s , italic_a ) as the transition function with a slight abuse of notation, and (T,ε)𝑇𝜀(T,\varepsilon)( italic_T , italic_ε ) as {T|T(s,a)T(s,a)|<ε,s𝒮,a𝒜}conditional-setsuperscript𝑇formulae-sequence𝑇𝑠𝑎superscript𝑇𝑠𝑎𝜀formulae-sequencefor-all𝑠𝒮𝑎𝒜\{T^{\prime}\mid|T(s,a)-T^{\prime}(s,a)|<\varepsilon,~{}\forall s\in\mathcal{S% },a\in\mathcal{A}\}{ italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ | italic_T ( italic_s , italic_a ) - italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s , italic_a ) | < italic_ε , ∀ italic_s ∈ caligraphic_S , italic_a ∈ caligraphic_A } which is the εlimit-from𝜀\varepsilon-italic_ε -neighbourhood of T𝑇Titalic_T. RL aims at maximizing the accumulated return of policy π𝜋\piitalic_π: ηT(π)subscript𝜂𝑇𝜋\eta_{T}(\pi)italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π )=Eπ,T[t=0γtr(st,at)]absentsubscript𝐸𝜋𝑇delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡=E_{\pi,T}\left[\sum\limits_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})\right]= italic_E start_POSTSUBSCRIPT italic_π , italic_T end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ], where the expectation is computed with s0ρ0similar-tosubscript𝑠0subscript𝜌0s_{0}\sim\rho_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, atπ(|st)a_{t}\sim\pi(\cdot|s_{t})italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and st+1T(|st,at)s_{t+1}\sim T(\cdot|s_{t},a_{t})italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_T ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The optimal policy π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is defined as πT*=argmaxπηT(π)superscriptsubscript𝜋𝑇subscriptargmax𝜋subscript𝜂𝑇𝜋\pi_{T}^{*}=\operatorname*{arg\,max}\limits_{\pi}\eta_{T}(\pi)italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π ). In an MDP with a policy π𝜋\piitalic_π, the Q-value QTπ(s,a)superscriptsubscript𝑄𝑇𝜋𝑠𝑎Q_{T}^{\pi}(s,a)italic_Q start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) denotes the expected return after taking action a𝑎aitalic_a at state s𝑠sitalic_s: QTπ(s,a)superscriptsubscript𝑄𝑇𝜋𝑠𝑎Q_{T}^{\pi}(s,a)italic_Q start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a )=Eπ,T[t=0γtr(st,at)|s0=s,a0=a]absentsubscript𝐸𝜋𝑇delimited-[]formulae-sequenceconditionalsuperscriptsubscript𝑡0superscript𝛾𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡subscript𝑠0𝑠subscript𝑎0𝑎=E_{\pi,T}\left[\sum\limits_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})|s_{0}=s,a_{% 0}=a\right]= italic_E start_POSTSUBSCRIPT italic_π , italic_T end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_s , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_a ]. The value function is defined as VTπ(s)=𝔼aπ(|s)QTπ(s,a)V_{T}^{\pi}(s)=\mathbb{E}_{a\sim\pi(\cdot|s)}Q_{T}^{\pi}(s,a)italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) = blackboard_E start_POSTSUBSCRIPT italic_a ∼ italic_π ( ⋅ | italic_s ) end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) with VT*(s)subscriptsuperscript𝑉𝑇𝑠V^{*}_{T}(s)italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ) being the shorthand for VTπT*(s)subscriptsuperscript𝑉superscriptsubscript𝜋𝑇𝑇𝑠V^{\pi_{T}^{*}}_{T}(s)italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ). It satisfies the optimal Bellman Equation VT*(s)=maxar(s,a)+γ𝔼sT(|s,a)VT*(s).V^{*}_{T}(s)=\max\limits_{a}~{}r(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim T(\cdot% |s,a)}V^{*}_{T}(s^{\prime}).italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ) = roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_r ( italic_s , italic_a ) + italic_γ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_T ( ⋅ | italic_s , italic_a ) end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) . We can also define the stationary state distribution (also known as state occupation function) as dTπ(s):=(1γ)assignsuperscriptsubscript𝑑𝑇𝜋𝑠1𝛾d_{T}^{\pi}(s):=(1-\gamma)italic_d start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) := ( 1 - italic_γ ) t=0γtPT(st=sπ)superscriptsubscript𝑡0superscript𝛾𝑡subscript𝑃𝑇subscript𝑠𝑡conditional𝑠𝜋\sum_{t=0}^{\infty}\gamma^{t}P_{T}\left(s_{t}=s\mid\pi\right)∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s ∣ italic_π ) with dT*(s)subscriptsuperscript𝑑𝑇𝑠d^{*}_{T}(s)italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ) being the shorthand for dTπT*(s)subscriptsuperscript𝑑superscriptsubscript𝜋𝑇𝑇𝑠d^{\pi_{T}^{*}}_{T}(s)italic_d start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ).

The Hidden Parameter Markov Decision Process (HiP-MDP) captures a class of MDPs with different transition functions and the same reward function by introducing a set of hidden parameters. Specifically, an HiP-MDP is defined by a tuple (𝒮,𝒜,Θ,T,r,γ,ρ0)𝒮𝒜Θ𝑇𝑟𝛾subscript𝜌0(\mathcal{S},\mathcal{A},\Theta,T,r,\gamma,\rho_{0})( caligraphic_S , caligraphic_A , roman_Θ , italic_T , italic_r , italic_γ , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), where ΘΘ\Thetaroman_Θ is the space of hidden parameters. The transition function Tθ(s|s,a,θ)subscript𝑇𝜃conditionalsuperscript𝑠𝑠𝑎𝜃T_{\theta}(s^{\prime}|s,a,\theta)italic_T start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a , italic_θ ) is parameterized not only by states and actions, but also by a hidden parameter θ𝜃\thetaitalic_θ sampled from ΘΘ\Thetaroman_Θ. The action gap of an HiP-MDP is defined as Δ=minθΘmins𝒮minaπ*(s)VTθ*(s)QTθ*(s,a)Δsubscript𝜃Θsubscript𝑠𝒮subscript𝑎superscript𝜋𝑠superscriptsubscript𝑉subscript𝑇𝜃𝑠superscriptsubscript𝑄subscript𝑇𝜃𝑠𝑎\Delta=\min\limits_{\theta\in\Theta}\min\limits_{s\in\mathcal{S}}\min\limits_{% a\neq\pi^{*}(s)}V_{T_{\theta}}^{*}(s)-Q_{T_{\theta}}^{*}(s,a)roman_Δ = roman_min start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_a ≠ italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s ) end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s ) - italic_Q start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_a ), which reflects the minimum gap between an optimal action and all other sub-optimal actions.

2.2 Related Work

MDPs with Different Dynamics

The setting of HiP-MDP [17] was proposed to model a set of variations in the environment dynamics. The problem is intensively investigated in recent years [18] and these researches fall into three categories, i.e., encoder-based, Important Sampling (IS)-based and meta-RL based algorithms. Encoder-based methods extract the hidden parameters from trajectories with variational inference [19] or auxiliary loss [9]. These hidden parameters are used as inputs to the transition function [9] or policy network [20, 12]. Unfortunately, these methods train dynamics-specific policies from the trajectory data of each hidden parameter independently, which leads to poor sample efficiency. Instead, our method uses the data from all dynamics to learn an optimal state distribution that facilitates the policy learning. IS-based methods compute the importance ratio between transition probabilities under different dynamics and modify the replay buffer [13, 14, 5] according to the transition probabilities in the test environments, which is often not available in real-world scenarios. Finally, meta-RL algorithms [21, 22] can adapt to environments with new dynamics through fine-tuning on a small amount of data from the test environment. In contrast, our method can be directly applied to new environment dynamics by making a zero-shot transfer.

Behavior Regularized Methods

The idea of constrained policy optimization (CPO) [15] is widely used in RL. Most researches focus on behavior regularized methods, i.e., adding policy constraints based on another policy distribution, as shown in the following optimization problem:

maxπ𝔼s,a𝒟[𝔼aπ(|s)Q(s,a)] s.t. 𝔼s𝒟[D^(π(|s)π^(|s))]<ε,\displaystyle\max_{\pi}~{}\mathbb{E}_{s,a\sim\mathcal{D}}\left[\mathbb{E}_{a^{% \prime}\sim\pi(\cdot|s)}Q(s,a^{\prime})\right]\qquad\text{ s.t. }~{}~{}\mathbb% {E}_{s\sim\mathcal{D}}\left[\hat{D}(\pi(\cdot|s)\|\hat{\pi}(\cdot|s))\right]<\varepsilon,roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ caligraphic_D end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_π ( ⋅ | italic_s ) end_POSTSUBSCRIPT italic_Q ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] s.t. blackboard_E start_POSTSUBSCRIPT italic_s ∼ caligraphic_D end_POSTSUBSCRIPT [ over^ start_ARG italic_D end_ARG ( italic_π ( ⋅ | italic_s ) ∥ over^ start_ARG italic_π end_ARG ( ⋅ | italic_s ) ) ] < italic_ε , (1)

where 𝒟𝒟\mathcal{D}caligraphic_D is the replay buffer, π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG is the regularizing policy, and D^^𝐷\hat{D}over^ start_ARG italic_D end_ARG is a certain distance measure. Maximum-Entropy RL [23, 8] can be considered as CPO with a uniform policy distribution. The sparse action tasks [24] can be solved by CPO with a sparse policy distribution. Besides, many offline RL algorithms [25, 26, 27] are based on the idea of constraining the current policy distribution to be close to the dataset’s policy distribution. However, data sampled from environments with different dynamics can have distinct optimal policies, as illustrated in Sec. 3.1. In such cases, π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG in Eq. (1) may include policies that do not match the current environment, and can therefore be misleading. So behavior regularization in Eq. (1) would fail on data with dynamics shift. Differently, our proposed method is based on state regularization, which is more suitable when learning from data with dynamics shift.

Leveraging stationary state distributions

The stationary state distribution dTπ(s)subscriptsuperscript𝑑𝜋𝑇𝑠d^{\pi}_{T}(s)italic_d start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ) of policy π𝜋\piitalic_π and dynamics T𝑇Titalic_T is an important feature that can measure the differences in policies and transition functions. It has already been exploited in many researches. In Off-Policy RL, Islam et. al [28] estimates the stationary state distributions of both the current policy and the mixed buffer policy. It then computes the off-policy policy gradient with the constraint that the two distributions should be close. Some Off-Policy Policy Evaluation (OPE) algorithms [29, 30] use the steady-state property of stationary Markov processes to estimate the stationary state distributions. In Imitation Learning (IL), state-only IL algorithms [31, 32] requires the stationary state distribution of the current policy to be close to that of the expert policy. In Inverse RL (IRL),  [33] learns a stationary reward function by computing the gradient of the distance between agent and expert state distribution w.r.t. reward parameters. In Offline RL, [34] requires the stationary state distribution of the learning policy and the behavior policy to be close and perform conservative updates. The use of such distributions in our paper is similar to some researches on sim-to-real [35, 36]. They propose to match the next state distribution in the imperfect simulator and the real environment with inverse dynamics model. They implicitly relies on the idea that the same state distribution should generate similar returns in environments with different dynamics. We formulate the idea in this paper with theorems and quantitatively analyse such similarity in various conditions.

3 State Regularized Policy Optimization

In this section, we first give motivating examples on why the optimal state distribution in one environment dynamics can be informative in all other different dynamics. A constrained policy optimization formulation is then proposed in Sec. 3.2 based on the optimal state distribution. Solving this optimization problem gives rise to our State Regularized Policy Optimization (SRPO) algorithm that can leverage data with dynamics shift to improve the policy performance.

Refer to caption
Figure 2: Visualization of state and action densities in data sampled from the Inverted Pendulum environment with gravity 5 and 10. Under both gravities, the state distribution has high density with low pendulum speed and small pendulum angle. Meanwhile, the action distribution has different peaks in density under different gravities.

3.1 Motivating Example

The key intuition behind SRPO is that the optimal state distribution is similar across environments. Consider an example of the Inverted Pendulum environment in Fig. 2. We train two policies in the environments with gravities of 5 and 10 until convergence. Then the kernel density estimation [37] technique is employed to estimate the state and action density of the data collected by the two policies in different areas. It can be observed from the figure that collected data have the same high state density region with low pendulum speed and small pendulum angle, while the action distribution has different density peaks. It demonstrates that the state distribution of data generated by the optimal policy can be similar regardless of the environment dynamics, and therefore can serve as a reference distribution to regularize the training policy in environments with new dynamics. More demonstrating examples can be found at Appendix B.3.

3.2 State Regularized Policy Optimization

Based on the intuition of informative optimal state distribution, we develop a novel technique that regulates RL algorithms to generate a stationary state distribution that is close to the optimal one. Specifically, we propose the following constrained policy optimization formulation:

maxπ𝔼st,atτπ[t=0γtr(st,at)] s.t. DKL(dπ()ζ())<ε,subscript𝜋subscript𝔼similar-tosubscript𝑠𝑡subscript𝑎𝑡subscript𝜏𝜋delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡 s.t. subscript𝐷KLconditionalsubscript𝑑𝜋𝜁𝜀\displaystyle\max_{\pi}~{}\mathbb{E}_{s_{t},a_{t}\sim\tau_{\pi}}\left[\sum_{t=% 0}^{\infty}\gamma^{t}r\left(s_{t},a_{t}\right)\right]\qquad\text{ s.t. }~{}~{}% D_{\mathrm{KL}}\left(d_{\pi}(\cdot)\|\zeta(\cdot)\right)<\varepsilon,roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_τ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] s.t. italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( ⋅ ) ∥ italic_ζ ( ⋅ ) ) < italic_ε , (2)

where ζ(s)𝜁𝑠\zeta(s)italic_ζ ( italic_s ) is the optimal state distribution in other environment dynamics. By introducing the stationary state distribution, the optimization problem defined in Eq. (2) extends the regularization of in-distribution data to data with distribution shift. A similar form of Eq. (2) (See Eq. 1 in Sec. 2.2) is employed in Offline RL algorithms to ensure conservative policy updates. But it restricts the training data to be sampled from the same environment.

We solve Eq. (2) by casting it to the following unconstrained optimization problem via Lagrange multipliers:

L=𝔼st,atτ[t=0γt(r(st,at)+λlogζ(st)dπ(st))]λε1γ,𝐿subscript𝔼similar-tosubscript𝑠𝑡subscript𝑎𝑡𝜏delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡𝜆𝜁subscript𝑠𝑡subscript𝑑𝜋subscript𝑠𝑡𝜆𝜀1𝛾L=-\mathbb{E}_{s_{t},a_{t}\sim\tau}\left[\sum_{t=0}\limits^{\infty}\gamma^{t}% \left(r(s_{t},a_{t})+\lambda\log\frac{\zeta(s_{t})}{d_{\pi}(s_{t})}\right)% \right]-\frac{\lambda\varepsilon}{1-\gamma},italic_L = - blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_τ end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_λ roman_log divide start_ARG italic_ζ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ) ] - divide start_ARG italic_λ italic_ε end_ARG start_ARG 1 - italic_γ end_ARG , (3)

where λ>0𝜆0\lambda>0italic_λ > 0 is the Lagrangian Multiplier. The detailed derivations of the Lagrangian can be found in Appendix A.1. It is noteworthy that in addition to the multiplier term, the only difference of Eq. (3) and the reward-maximization objective of RL is that the logarithm of probability density ratio λlogζ(st)dπ(st)𝜆𝜁subscript𝑠𝑡subscript𝑑𝜋subscript𝑠𝑡\lambda\log\frac{\zeta(s_{t})}{d_{\pi}(s_{t})}italic_λ roman_log divide start_ARG italic_ζ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG is added to the reward term r(st,at)𝑟subscript𝑠𝑡subscript𝑎𝑡r(s_{t},a_{t})italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Therefore, one can easily apply our scheme to a wide range of RL algorithms by augmenting the reward function with the density ratio.

3.3 Data-based Surrogate of the Density Ratio

The main challenge in solving Eq. (3) is to compute the density ratio ζ(s)dπ(s)𝜁𝑠subscript𝑑𝜋𝑠\frac{\zeta(s)}{d_{\pi}(s)}divide start_ARG italic_ζ ( italic_s ) end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ) end_ARG because obtaining the optimal state distribution during online training or given suboptimal offline dataset is infeasible. Also, dπ(s)subscript𝑑𝜋𝑠d_{\pi}(s)italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ) is intractable if the state space is continuous. Motivated by recent advances in adversarial training [38, 39], we propose a sample-based surrogate for the density ratio ζ(s)dπ(s)𝜁𝑠subscript𝑑𝜋𝑠\frac{\zeta(s)}{d_{\pi}(s)}divide start_ARG italic_ζ ( italic_s ) end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ) end_ARG.

Proposition 3.1.

In a GAN, when the real data distribution is ζ(s)𝜁𝑠\zeta(s)italic_ζ ( italic_s ) and the generated data distribution is dπ(s)subscript𝑑𝜋𝑠d_{\pi}(s)italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ), the output of the discriminator D(s)𝐷𝑠D(s)italic_D ( italic_s ) follows

D(s)1D(s)=ζ(s)dπ(s).𝐷𝑠1𝐷𝑠𝜁𝑠subscript𝑑𝜋𝑠\frac{D(s)}{1-D(s)}=\frac{\zeta(s)}{d_{\pi}(s)}.divide start_ARG italic_D ( italic_s ) end_ARG start_ARG 1 - italic_D ( italic_s ) end_ARG = divide start_ARG italic_ζ ( italic_s ) end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ) end_ARG . (4)
s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTa1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT𝒪1subscript𝒪1\mathcal{O}_{1}caligraphic_O start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTs2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTa2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT𝒪2subscript𝒪2\mathcal{O}_{2}caligraphic_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTs3subscript𝑠3s_{3}italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPTa3subscript𝑎3a_{3}italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT𝒪3subscript𝒪3\mathcal{O}_{3}caligraphic_O start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPTs4subscript𝑠4s_{4}italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT𝒪4subscript𝒪4\mathcal{O}_{4}caligraphic_O start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT
Figure 3: HMM in MDP with optimality variables 𝒪tsubscript𝒪𝑡\mathcal{O}_{t}caligraphic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

We discuss the relation of this sample-based surrogate with f-divergences and Off-Policy RL in Appendix A.3. To train the discriminator D(s)𝐷𝑠D(s)italic_D ( italic_s ), we need to generate samples that is close to the optimal state distribution ζ(s)𝜁𝑠\zeta(s)italic_ζ ( italic_s ) and away from dπ(s)subscript𝑑𝜋𝑠d_{\pi}(s)italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ), which is sort of the average state distribution. Motivated by [16], we model state optimality by a variable 𝒪tsubscript𝒪𝑡\mathcal{O}_{t}caligraphic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. As shown in Fig. 3, we regard the state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in MDP as a hidden state in a Hidden Markov Model (HMM), and introduce the binary observation state 𝒪tsubscript𝒪𝑡\mathcal{O}_{t}caligraphic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. 𝒪t=1subscript𝒪𝑡1\mathcal{O}_{t}=1caligraphic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 denotes that stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the optimal state at timestep t𝑡titalic_t. The observation model is given by

p(𝒪t|st)=maxatexp[γt(r(st,at)Rmax)].𝑝conditionalsubscript𝒪𝑡subscript𝑠𝑡subscriptsubscript𝑎𝑡superscript𝛾𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡subscript𝑅p(\mathcal{O}_{t}|s_{t})=\max_{a_{t}}\exp[\gamma^{t}(r(s_{t},a_{t})-R_{\max})].italic_p ( caligraphic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_max start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_exp [ italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) ] . (5)

We can therefore compute the state density ratio ζ(s)dπ(s)𝜁𝑠subscript𝑑𝜋𝑠\frac{\zeta(s)}{d_{\pi}(s)}divide start_ARG italic_ζ ( italic_s ) end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ) end_ARG as

ζ(s)dπ(s)𝜁𝑠subscript𝑑𝜋𝑠\displaystyle\frac{\zeta(s)}{d_{\pi}(s)}divide start_ARG italic_ζ ( italic_s ) end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ) end_ARG =dπ(s|𝒪0:)dπ(s)=p(𝒪0:|s,π)dπ(s)p(𝒪0:|π)dπ(s)=𝔼t[p(𝒪0:t1|st,π)p(𝒪t:|st,π)]p(𝒪0:|π),absentsubscript𝑑𝜋conditional𝑠subscript𝒪:0subscript𝑑𝜋𝑠𝑝conditionalsubscript𝒪:0𝑠𝜋subscript𝑑𝜋𝑠𝑝conditionalsubscript𝒪:0𝜋subscript𝑑𝜋𝑠subscript𝔼𝑡delimited-[]𝑝conditionalsubscript𝒪:0𝑡1subscript𝑠𝑡𝜋𝑝conditionalsubscript𝒪:𝑡subscript𝑠𝑡𝜋𝑝conditionalsubscript𝒪:0𝜋\displaystyle=\frac{d_{\pi}(s|\mathcal{O}_{0:\infty})}{d_{\pi}(s)}=\frac{p(% \mathcal{O}_{0:\infty}|s,\pi)d_{\pi}(s)}{p(\mathcal{O}_{0:\infty}|\pi)d_{\pi}(% s)}=\frac{\mathbb{E}_{t}[p(\mathcal{O}_{0:t-1}|s_{t},\pi)p(\mathcal{O}_{t:% \infty}|s_{t},\pi)]}{p(\mathcal{O}_{0:\infty}|\pi)},= divide start_ARG italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s | caligraphic_O start_POSTSUBSCRIPT 0 : ∞ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ) end_ARG = divide start_ARG italic_p ( caligraphic_O start_POSTSUBSCRIPT 0 : ∞ end_POSTSUBSCRIPT | italic_s , italic_π ) italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ) end_ARG start_ARG italic_p ( caligraphic_O start_POSTSUBSCRIPT 0 : ∞ end_POSTSUBSCRIPT | italic_π ) italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ) end_ARG = divide start_ARG blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_p ( caligraphic_O start_POSTSUBSCRIPT 0 : italic_t - 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_π ) italic_p ( caligraphic_O start_POSTSUBSCRIPT italic_t : ∞ end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_π ) ] end_ARG start_ARG italic_p ( caligraphic_O start_POSTSUBSCRIPT 0 : ∞ end_POSTSUBSCRIPT | italic_π ) end_ARG , (6)

where the second equation follows the Bayes’ law. The last term is related to the forward probability αt(st)=p(𝒪0:t1|st,π)subscript𝛼𝑡subscript𝑠𝑡𝑝conditionalsubscript𝒪:0𝑡1subscript𝑠𝑡𝜋\alpha_{t}(s_{t})=p(\mathcal{O}_{0:t-1}|s_{t},\pi)italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_p ( caligraphic_O start_POSTSUBSCRIPT 0 : italic_t - 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_π ) and backward probability βt(st)=p(𝒪t:|st,π)subscript𝛽𝑡subscript𝑠𝑡𝑝conditionalsubscript𝒪:𝑡subscript𝑠𝑡𝜋\beta_{t}(s_{t})=p(\mathcal{O}_{t:\infty}|s_{t},\pi)italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_p ( caligraphic_O start_POSTSUBSCRIPT italic_t : ∞ end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_π ) in the HMM. We discuss in Appendix A.2 that βt(st)subscript𝛽𝑡subscript𝑠𝑡\beta_{t}(s_{t})italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is positively related to a soft version of MDP’s state value Vπ(st)subscript𝑉𝜋subscript𝑠𝑡V_{\pi}(s_{t})italic_V start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Also, αt(st)subscript𝛼𝑡subscript𝑠𝑡\alpha_{t}(s_{t})italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) makes a little influence on the overall density ratio. Therefore, the input s𝑠sitalic_s will be more likely to be sampled from distribution ζ(s)𝜁𝑠\zeta(s)italic_ζ ( italic_s ) if it has a higher state value V(s)𝑉𝑠V(s)italic_V ( italic_s ) than average. With this idea, we are able to build training samples for the discriminator D(s)𝐷𝑠D(s)italic_D ( italic_s ).

Algorithm 1 The workflow of SRPO on top of MAPLE [12].
1:  Input: ϕφsubscriptitalic-ϕ𝜑\phi_{\varphi}italic_ϕ start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT as a context encoder parameterized by φ𝜑\varphiitalic_φ; Adaptable policy network πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT parameterized by θ𝜃\thetaitalic_θ; Adaptable value network Vψsubscript𝑉𝜓V_{\psi}italic_V start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT parameterized by ψ𝜓\psiitalic_ψ; Offline dataset 𝒟off subscript𝒟off \mathcal{D}_{\text{off }}caligraphic_D start_POSTSUBSCRIPT off end_POSTSUBSCRIPT; Rollout horizon H𝐻Hitalic_H; State partition ratio ρ𝜌\rhoitalic_ρ; State discriminator Dδsubscript𝐷𝛿D_{\delta}italic_D start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT parameterized by δ𝛿\deltaitalic_δ; Regularization coefficient λ𝜆\lambdaitalic_λ.
2:  for 1, 2, 3, \dots do
3:     for t=1𝑡1t=1italic_t = 1, 2222, \dots, H𝐻Hitalic_H do
4:        Sample ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from ϕφ(zst,at1,zt1)subscriptitalic-ϕ𝜑conditional𝑧subscript𝑠𝑡subscript𝑎𝑡1subscript𝑧𝑡1\phi_{\varphi}\left(z\mid s_{t},a_{t-1},z_{t-1}\right)italic_ϕ start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_z ∣ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) and then sample atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from πθ(ast,zt)subscript𝜋𝜃conditional𝑎subscript𝑠𝑡subscript𝑧𝑡\pi_{\theta}\left(a\mid s_{t},z_{t}\right)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a ∣ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).
5:        Rollout and get transition data (st+1,rt,dt+1,st,at,zt)subscript𝑠𝑡1subscript𝑟𝑡subscript𝑑𝑡1subscript𝑠𝑡subscript𝑎𝑡subscript𝑧𝑡\left(s_{t+1},r_{t},d_{t+1},s_{t},a_{t},z_{t}\right)( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Then add it to 𝒟rollout subscript𝒟rollout \mathcal{D}_{\text{rollout }}caligraphic_D start_POSTSUBSCRIPT rollout end_POSTSUBSCRIPT.
6:     end for
7:     Update the context encoder ϕφsubscriptitalic-ϕ𝜑\phi_{\varphi}italic_ϕ start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT according to MAPLE.
8:     Sample a batch 𝒟batchsubscript𝒟batch\mathcal{D}_{\text{batch}}caligraphic_D start_POSTSUBSCRIPT batch end_POSTSUBSCRIPT from 𝒟offsubscript𝒟off\mathcal{D}_{\text{off}}caligraphic_D start_POSTSUBSCRIPT off end_POSTSUBSCRIPT and 𝒟rolloutsubscript𝒟rollout\mathcal{D}_{\text{rollout}}caligraphic_D start_POSTSUBSCRIPT rollout end_POSTSUBSCRIPT and rank them by their state-values estimated by Vψsubscript𝑉𝜓V_{\psi}italic_V start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT; Add ρ|𝒟batch|𝜌subscript𝒟batch\rho|\mathcal{D}_{\text{batch}}|italic_ρ | caligraphic_D start_POSTSUBSCRIPT batch end_POSTSUBSCRIPT | states with higher state-values to 𝒟realsubscript𝒟real\mathcal{D}_{\text{real}}caligraphic_D start_POSTSUBSCRIPT real end_POSTSUBSCRIPT and the others to 𝒟fakesubscript𝒟fake\mathcal{D}_{\text{fake}}caligraphic_D start_POSTSUBSCRIPT fake end_POSTSUBSCRIPT.
9:     Train the discriminator Dδsubscript𝐷𝛿D_{\delta}italic_D start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT with nll loss.
10:     For one-step transition (st+1,rt,dt+1,st,at,zt)subscript𝑠𝑡1subscript𝑟𝑡subscript𝑑𝑡1subscript𝑠𝑡subscript𝑎𝑡subscript𝑧𝑡\left(s_{t+1},r_{t},d_{t+1},s_{t},a_{t},z_{t}\right)( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) in 𝒟batchsubscript𝒟batch\mathcal{D}_{\text{batch}}caligraphic_D start_POSTSUBSCRIPT batch end_POSTSUBSCRIPT, update rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with rt+λDδ(st)1Dδ(st)subscript𝑟𝑡𝜆subscript𝐷𝛿subscript𝑠𝑡1subscript𝐷𝛿subscript𝑠𝑡r_{t}+\lambda\frac{D_{\delta}(s_{t})}{1-D_{\delta}(s_{t})}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ divide start_ARG italic_D start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - italic_D start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG.
11:     Use the updated 𝒟batchsubscript𝒟batch\mathcal{D}_{\text{batch}}caligraphic_D start_POSTSUBSCRIPT batch end_POSTSUBSCRIPT and SAC to update the policy and value network parameters θ𝜃\thetaitalic_θ and ψ𝜓\psiitalic_ψ.
12:  end for

3.4 Practical Algorithm

Summarizing the previous derivations, we obtain a practical reward regularization algorithm, termed as SRPO (State Regularized Policy Optimization) to leverage data with dynamics shift. We select the MAPLE [12] algorithm, which is one of the SOTA algorithms in context-based Offline RL, as the base algorithm. The detailed procedure of MAPLE+SRPO is shown in Alg. 1. After preparing the dataset in a model-based Offline RL style [40, 12], we sample a batch of data from the dataset, obtain a portion of ρ𝜌\rhoitalic_ρ states with higher rewards and add them to the dataset 𝒟realsubscript𝒟real\mathcal{D}_{\text{real}}caligraphic_D start_POSTSUBSCRIPT real end_POSTSUBSCRIPT. 𝒟fakesubscript𝒟fake\mathcal{D}_{\text{fake}}caligraphic_D start_POSTSUBSCRIPT fake end_POSTSUBSCRIPT is similarly generated by states with lower rewards (line 10). We set ρ=0.5𝜌0.5\rho=0.5italic_ρ = 0.5 in offline experiments with medium-expert level of data. ρ=0.2𝜌0.2\rho=0.2italic_ρ = 0.2 is set in all other experiments. Then a classifier discriminating data from the two datasets is trained (line 11). It estimates the logarithm of the state density ratio λlogζ(s)dπ(s)𝜆𝜁𝑠subscript𝑑𝜋𝑠\lambda\log\frac{\zeta(s)}{d_{\pi}(s)}italic_λ roman_log divide start_ARG italic_ζ ( italic_s ) end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ) end_ARG, which is added to the reward rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (line 12). λ𝜆\lambdaitalic_λ is regarded as a hyperparameter with values 0.10.10.10.1 or 0.30.30.30.3. The effect of λ𝜆\lambdaitalic_λ is investigated in Sec. 5.3. The procedure of the online algorithm CaDM [9]+SRPO is similar to MAPLE+SRPO, where the datasets 𝒟realsubscript𝒟real\mathcal{D}_{\text{real}}caligraphic_D start_POSTSUBSCRIPT real end_POSTSUBSCRIPT and 𝒟fakesubscript𝒟fake\mathcal{D}_{\text{fake}}caligraphic_D start_POSTSUBSCRIPT fake end_POSTSUBSCRIPT are built with data from the replay buffer, rather than the offline dataset.

4 Theoretical Analysis

In this section, we analyze some properties of MDPs with different dynamics and provide theoretical justifications for the SRPO algorithm in Sec. 3. The notations are introduced in Sec. 2.1 and proofs can be found in Appendix A.4. We first show in Thm. 4.2 that the performance of a policy can be lower-bounded when its stationary state distribution is close to a certain optimal state distribution. In accordance with the intuition in Sec. 3.1, it is also demonstrated in Thm. 4.3 that optimal policies can induce the same stationary state distribution in different dynamics under mild assumptions. We start the analysis with the definition of homomorphous MDPs.

Definition 4.1 (homomorphous MDPs).

In an HiP-MDP (𝒮,𝒜,Θ,T,r,γ,ρ0)𝒮𝒜Θ𝑇𝑟𝛾subscript𝜌0(\mathcal{S},\mathcal{A},\Theta,T,r,\gamma,\rho_{0})( caligraphic_S , caligraphic_A , roman_Θ , italic_T , italic_r , italic_γ , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), consider hidden parameters θ1,θ2Θsubscript𝜃1subscript𝜃2Θ\theta_{1},\theta_{2}\in\Thetaitalic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ roman_Θ. Let Ti(s|s,a)=T(s|s,a,θi),(s,a,s)𝒮×𝒜×𝒮,i=1,2formulae-sequencesubscript𝑇𝑖conditionalsuperscript𝑠𝑠𝑎𝑇conditionalsuperscript𝑠𝑠𝑎subscript𝜃𝑖formulae-sequencefor-all𝑠𝑎superscript𝑠𝒮𝒜𝒮𝑖12T_{i}(s^{\prime}|s,a)=T(s^{\prime}|s,a,\theta_{i}),\forall(s,a,s^{\prime})\in% \mathcal{S}\times\mathcal{A}\times\mathcal{S},i=1,2italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) = italic_T ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a , italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , ∀ ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_S × caligraphic_A × caligraphic_S , italic_i = 1 , 2. If a𝒜T1(s|s,a)>0a𝒜T2(s|s,a)>0subscript𝑎𝒜subscript𝑇1conditionalsuperscript𝑠𝑠𝑎0subscript𝑎𝒜subscript𝑇2conditionalsuperscript𝑠𝑠𝑎0\sum\limits_{a\in\mathcal{A}}T_{1}(s^{\prime}|s,a)>0\Leftrightarrow\sum\limits% _{a\in\mathcal{A}}T_{2}(s^{\prime}|s,a)>0∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) > 0 ⇔ ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) > 0 for all s,s𝒮𝑠superscript𝑠𝒮s,s^{\prime}\in\mathcal{S}italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S, MDPs (𝒮,𝒜,T1,r,γ,ρ0)𝒮𝒜subscript𝑇1𝑟𝛾subscript𝜌0(\mathcal{S},\mathcal{A},T_{1},r,\gamma,\rho_{0})( caligraphic_S , caligraphic_A , italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r , italic_γ , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and (𝒮,𝒜,T2,r,γ,ρ0)𝒮𝒜subscript𝑇2𝑟𝛾subscript𝜌0(\mathcal{S},\mathcal{A},T_{2},r,\gamma,\rho_{0})( caligraphic_S , caligraphic_A , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_r , italic_γ , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) are referred to as homomorphous MDPs.

In this definition, a𝒜T(s|s,a)>0subscript𝑎𝒜𝑇conditionalsuperscript𝑠𝑠𝑎0\sum\limits_{a\in\mathcal{A}}T(s^{\prime}|s,a)>0∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_T ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) > 0 means state ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT can be reached from s𝑠sitalic_s, so the equivalence of non-zero transition probabilities refers to the same reachability from s𝑠sitalic_s to ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Such condition holds in a wide range of MDPs differing only in environment parameters. For example, pendulums with different lengths can all reach the upright state from an off-center state, with longer pendulums exerting a larger force. Apart from the homomorphous property, we also require the reward and dynamics functions of MDPs to have Lipschitz properties. We assume reward function r(s,a,s)𝑟𝑠𝑎superscript𝑠r(s,a,s^{\prime})italic_r ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) w.r.t. the action a𝑎aitalic_a is λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-Lipschitz and the dynamics function T(s,a)𝑇𝑠𝑎T(s,a)italic_T ( italic_s , italic_a ) w.r.t. the action a𝑎aitalic_a is λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-inverse Lipschitz. Discussions on these Lipschitz properties can be found in Appendix A.5.

With these preliminaries, we first analyze the discrepancy of accumulated returns of two policies with similar stationary state distributions. The analysis is related to our SRPO algorithm in that the state regularized policy optimization formulation in Eq. (2) also constrains the learning policy to have a similar stationary state distribution with the optimal policy. Specifically, we derive a theorem as follows.

Theorem 4.2.

Consider two homomorphous MDPs with dynamics T𝑇Titalic_T and Tsuperscript𝑇normal-′T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. If T(T,εm)superscript𝑇normal-′𝑇subscript𝜀𝑚T^{\prime}\in(T,\varepsilon_{m})italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ ( italic_T , italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ), for all learning policy π^normal-^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG such that DKL(dTπ^()dT*())εssubscript𝐷normal-KLconditionalsubscriptsuperscript𝑑normal-^𝜋𝑇normal-⋅subscriptsuperscript𝑑superscript𝑇normal-′normal-⋅subscript𝜀𝑠D_{\mathrm{KL}}(d^{\hat{\pi}}_{T}(\cdot)\|d^{*}_{T^{\prime}}(\cdot))\leqslant% \varepsilon_{s}italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_d start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( ⋅ ) ∥ italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ ) ) ⩽ italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, we have

ηT(π^)ηT(πT*)λ1λ2εm+2λ1+2Rmaxεs1γ.subscript𝜂𝑇^𝜋subscript𝜂𝑇superscriptsubscript𝜋𝑇subscript𝜆1subscript𝜆2subscript𝜀𝑚2subscript𝜆12subscript𝑅subscript𝜀𝑠1𝛾\eta_{T}(\hat{\pi})\geqslant\eta_{T}(\pi_{T}^{*})-\dfrac{\lambda_{1}\lambda_{2% }\varepsilon_{m}+2\lambda_{1}+\sqrt{2}R_{\max}\sqrt{\varepsilon_{s}}}{1-\gamma}.italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ) ⩾ italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - divide start_ARG italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + 2 italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + square-root start_ARG 2 end_ARG italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT square-root start_ARG italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 1 - italic_γ end_ARG . (7)

The theorem implies that if a policy π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG has a similar stationary state distribution with the optimal policy in one MDP M𝑀Mitalic_M, π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG will have a lower-bound performance guarantee in all MDPs that are homomorphous with the MDP M𝑀Mitalic_M. Therefore, the learning policy can benefit from the state regularized policy optimization in Sec. 3.2.

More specifically, Eq. (7) shows that the gap in accumulated return of π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG and πT*subscriptsuperscript𝜋𝑇\pi^{*}_{T}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is related to the dynamics shift εmsubscript𝜀𝑚\varepsilon_{m}italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, the KL-Divergence of two stationary state distributions εssubscript𝜀𝑠\varepsilon_{s}italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, and the effective planning horizon 11γ11𝛾\frac{1}{1-\gamma}divide start_ARG 1 end_ARG start_ARG 1 - italic_γ end_ARG. With respect to the dynamics shift εmsubscript𝜀𝑚\varepsilon_{m}italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, it is related to a “uniform” constraint on the dynamics Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. We further show in Appendix A.4 that constraining the dynamics shift on a certain state-action pair is enough to derive Eq. (7). Unlike the dynamics shift εmsubscript𝜀𝑚\varepsilon_{m}italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT that is determined by a pre-defined RL task, the discrepancy between stationary state distributions εssubscript𝜀𝑠\varepsilon_{s}italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is determined by the learning policy π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG and can be optimized during training to obtain a better performance lower-bound. We also discuss in Appendix A.5 how tight Eq. (7) is in terms of the effective planning horizon 11γ11𝛾\frac{1}{1-\gamma}divide start_ARG 1 end_ARG start_ARG 1 - italic_γ end_ARG, compared with some similar performance bounds.

With an additional assumption on the action gap ΔΔ\Deltaroman_Δ (defined in Sec. 2.1), we further demonstrate that the optimal policy of two homomorphous MDPs can have the same stationary state distribution, which verifies the intuition in Sec. 3.1.

Theorem 4.3.

Consider two homomorphous MDPs with dynamics T𝑇Titalic_T and Tsuperscript𝑇normal-′T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. If T(T,εm)superscript𝑇normal-′𝑇subscript𝜀𝑚T^{\prime}\in(T,\varepsilon_{m})italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ ( italic_T , italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) and the action gap Δnormal-Δ\Deltaroman_Δ follows Δ>(2γ)λ1λ2εm1γnormal-Δ2𝛾subscript𝜆1subscript𝜆2subscript𝜀𝑚1𝛾\Delta>\frac{(2-\gamma)\lambda_{1}\lambda_{2}\varepsilon_{m}}{1-\gamma}roman_Δ > divide start_ARG ( 2 - italic_γ ) italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG, for all s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S we have dT*(s)=dT*(s)subscriptsuperscript𝑑𝑇𝑠subscriptsuperscript𝑑superscript𝑇normal-′𝑠d^{*}_{T}(s)=d^{*}_{T^{\prime}}(s)italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ) = italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ).

The assumption is mild and holds in many scenarios. For example, in autonomous driving it can be very dangerous to deviate from the optimal policy. Such suboptimal actions have low rewards, leading to a large action gap ΔΔ\Deltaroman_Δ. In recommendation tasks we are hardly concerned with what items we recommend (the action), as long as the recommendation outcome (the state), i.e., the users’ experiences are good enough, leading to a small λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. The condition of large enough action gap holds in these situations.

5 Experiments

In this section, we conduct experiments to investigate the following questions: (1) Can SRPO leverage data with distribution shift and outperform current SOTA algorithms in the setting of HiP-MDP, in both online and offline RL? (2) How does each component of SRPO (e.g., use state regularization rather than behavior regularization) contribute to SRPO’s performance? To answer question (1), we use the MuJoCo simulator [41] and generate environments with different transition functions. We train the CaDM+SRPO and the MAPLE+SRPO algorithm proposed in Sec. 3.4 and make comparative analysis with baseline algorithms. To answer question (2), we do ablation studies to examine the role of different modules in SRPO. We also examine how the discriminator Dδsubscript𝐷𝛿D_{\delta}italic_D start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT works in complex environments and the effect of regularizing with state distributions in different performance levels.

Refer to caption
Figure 4: Results of online experiments on MuJoCo tasks. The comparison is made between CaDM+SRPO and baseline algorithms PPO, CaDM. Our CaDM+SRPO algorithm has the best overall performance in experiments with 3 and 5 different environment dynamics. The curves show the average return on 4 random seeds and the shadow areas reflect the standard deviation.

5.1 Experiment Setup

We alter the simulator gravity to generate different dynamics in online experiments. Possible values of gravity are {1.0}, {0.7,1.0,1.3}, and {0.4,0.7,1.0,1.3,1.6} in experiments with 1, 3, and 5 kinds of different dynamics, respectively. When the simulator resets, the gravity is uniformly sampled from the set of all possible values. The number of training steps is in proportion to the number of environment parameters. Therefore, the agent has access to the same amount of training data on a certain value of simulator gravity. We also consider the shift of medium density and body mass in offline experiments to show SRPO’s robustness to different forms of dynamics shift.

To perform comparative analysis, we choose CaDM [9] and PPO [7] as baseline algorithms in online experiments. In offline experiments, DARA [5] also exploits large amount of data with dynamics shift. Its algorithm relies on Importance Sampling and will be used as a baseline method. Apart from that, we choose MOPO [40], MAPLE [12] and CQL [26] as baseline methods. More information on the setup of experiments is shown in Appendix B.1.

Table 1: Results of offline experiments on MuJoCo tasks. Numbers are the normalized scores according to the D4RL paper [42]. ME, M, MR and R correspond to the medium-expert, expert, medium-replay and random dataset, respectively. The evaluation is done on policies at the last iteration of training, averaged over four random seeds. The number after ±plus-or-minus\pm± is the standard deviation. Our proposed MAPLE+SRPO algorithm has the best performance in 8 of 12 tasks and the highest overall performance.
CQL (Single Env) GAIL CQL MOPO MAPLE MAPLE +DARA MAPLE +SRPO(Ours)
Walker2d-ME 1.11 0.21±plus-or-minus\pm±0.03 1.03±plus-or-minus\pm±0.10 0.25±plus-or-minus\pm±0.18 0.55±plus-or-minus\pm±0.21 0.80±plus-or-minus\pm±0.02 0.66±plus-or-minus\pm±0.08
Walker2d-M 0.79 0.15±plus-or-minus\pm±0.06 0.78±plus-or-minus\pm±0.01 0.23±plus-or-minus\pm±0.34 0.82±plus-or-minus\pm±0.01 0.83±plus-or-minus\pm±0.03 0.84±plus-or-minus\pm±0.03
Walker2d-MR 0.27 0.00±plus-or-minus\pm±0.00 0.07±plus-or-minus\pm±0.00 0.00±plus-or-minus\pm±0.00 0.16±plus-or-minus\pm±0.02 0.17±plus-or-minus\pm±0.01 0.17±plus-or-minus\pm±0.02
Walker2d-R 0.07 0.00±plus-or-minus\pm±0.00 0.03±plus-or-minus\pm±0.01 0.00±plus-or-minus\pm±0.00 0.22±plus-or-minus\pm±0.00 0.22±plus-or-minus\pm±0.00 0.22±plus-or-minus\pm±0.00
Hopper-ME 0.98 0.04±plus-or-minus\pm±0.01 0.32±plus-or-minus\pm±0.14 0.01±plus-or-minus\pm±0.00 0.96±plus-or-minus\pm±0.14 0.96±plus-or-minus\pm±0.06 0.98±plus-or-minus\pm±0.02
Hopper-M 0.58 0.00±plus-or-minus\pm±0.00 0.57±plus-or-minus\pm±0.16 0.01±plus-or-minus\pm±0.00 0.78±plus-or-minus\pm±0.28 0.40±plus-or-minus\pm±0.05 1.03±plus-or-minus\pm±0.09
Hopper-MR 0.46 0.00±plus-or-minus\pm±0.00 0.14±plus-or-minus\pm±0.02 0.01±plus-or-minus\pm±0.01 0.91±plus-or-minus\pm±0.11 1.02±plus-or-minus\pm±0.01 1.02±plus-or-minus\pm±0.01
Hopper-R 0.11 0.00±plus-or-minus\pm±0.00 0.11±plus-or-minus\pm±0.00 0.01±plus-or-minus\pm±0.00 0.13±plus-or-minus\pm±0.00 0.13±plus-or-minus\pm±0.01 0.32±plus-or-minus\pm±0.02
HalfCheetah-ME 0.62 0.36±plus-or-minus\pm±0.06 0.03±plus-or-minus\pm±0.04 -0.03±plus-or-minus\pm±0.00 0.50±plus-or-minus\pm±0.06 0.50±plus-or-minus\pm±0.00 0.63±plus-or-minus\pm±0.01
HalfCheetah-M 0.44 0.25±plus-or-minus\pm±0.02 0.43±plus-or-minus\pm±0.03 0.38±plus-or-minus\pm±0.28 0.62±plus-or-minus\pm±0.01 0.67±plus-or-minus\pm±0.03 0.63±plus-or-minus\pm±0.01
HalfCheetah-MR 0.46 0.18±plus-or-minus\pm±0.11 0.46±plus-or-minus\pm±0.00 -0.03±plus-or-minus\pm±0.00 0.52±plus-or-minus\pm±0.00 0.53±plus-or-minus\pm±0.01 0.55±plus-or-minus\pm±0.00
HalfCheetah-R 0.35 0.14±plus-or-minus\pm±0.02 0.01±plus-or-minus\pm±0.02 -0.03±plus-or-minus\pm±0.00 0.22±plus-or-minus\pm±0.03 0.21±plus-or-minus\pm±0.00 0.24±plus-or-minus\pm±0.01
Average 0.52 0.11 0.33 0.068 0.53 0.54 0.61
Table 2: Results of ablation studies in offline experiments.
Table 3: Performance comparison of differently regularized policies.
MAPLE +SRPO Behavior Regularizing Random Partition Fixed λ=0.3𝜆0.3\lambda=0.3italic_λ = 0.3
Walker2d 0.47 0.45 0.39 0.40
Hopper 0.83 0.68 0.56 0.79
HalfCheetah 0.51 0.50 0.45 0.40
Average 0.61 0.54 0.47 0.53
Original Hopper Env 10x Density
Random 121.3 44.15
Medium 2178 913.3
Expert 3819 3748
Table 3: Performance comparison of differently regularized policies.

5.2 Results

Online Experiments

The results of online experiments are shown in Fig. 4. With the context encoder and conditional policy, CaDM is able to outperform PPO in all environments. However, it fails to take advantage of the increase in the amount of data with dynamics shift. Its performance with 5 different dynamics is lower than that with 3 dynamics. In contrast, our proposed SRPO algorithm leads to better performance on top of CaDM in accordance with more training data. It significantly outperforms the original CaDM algorithm in environments with 5 different dynamics. The performance comparison in the Pendulum environment is also in accordance with the motivating example in Sec. 3.1. More results of online experiments are shown in Appendix B.2.

Offline Experiments

The results of offline experiments are shown in Tab. 1. The column of “CQL Single” refers to the evaluation score in the CQL [26] paper, where the policy is with data from a single static environment. Without the mechanism of context-based encoders, GAIL [43], CQL and MOPO [40] cannot handle data with distribution shift and show a performance drop. MAPLE [12] and MAPLE+DARA [5] only achieve marginal performance improvement with respect to CQL single. On the other hand, MAPLE+SRPO shows significant performance improvement over CQL single, which means that SRPO can efficiently leverage the additional data with dynamics shift to facilitate policy training. The MAPLE+SRPO algorithm also has a 15% higher evaluation score than MAPLE, achieving the best performance in 8 out of 12 tasks. Apart from MAPLE, the meta-RL algorithm PEARL [44] also has an context encoder for fast adaptation. We compare PEARL with PEARL+SRPO and leave the results in Appendix. B.3.

5.3 Analysis

Ablations

We conduct ablation studies in offline environments to analyze the role of each algorithm component in SRPO. The results are shown in Tab. 3. We first investigate the outcome of regularizing with state-action distribution rather than state distribution in Eq. (2). The resulting policy has a lower evaluation score on average than policies trained with the original SRPO algorithm in all environments. This is because environments with different dynamics do not have a similar optimal policy. The action distribution in the mixed dataset can be misleading when training new policies. According to Sec. 3.3, SRPO trains a classifier to discriminate states with higher values from lower values. We also train another classifier discriminating a random binary partition of states. Ablation results show a huge performance drop, which verifies the effectiveness of the classification-based surrogate mechanism. We also evaluate MAPLE+SRPO with a fixed value of the hyperparameter λ𝜆\lambdaitalic_λ. λ=0.1𝜆0.1\lambda=0.1italic_λ = 0.1 is more suitable for Walker2d and HalfCheetah environments, while in the Hopper environment λ=0.3𝜆0.3\lambda=0.3italic_λ = 0.3 is better. This is in accordance with previous analysis that Hopper agents can benefit more from regularizing with the stationary state distribution.

Effectiveness of Discriminators

Refer to caption
Figure 5: Comparison of values on states with high and low output of the discriminator D𝐷Ditalic_D.

We first train a discriminator Dδsubscript𝐷𝛿D_{\delta}italic_D start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT according to Alg. 1. Then a set of states is sampled from the D4RL [42] dataset and classified into two sets according to the output of Dδsubscript𝐷𝛿D_{\delta}italic_D start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT. The average values of the two sets of states are compared in Fig. 5. As shown in the figure, states with higher Dδsubscript𝐷𝛿D_{\delta}italic_D start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT outputs also have higher values in all three environments. It means that the trained discriminator Dδsubscript𝐷𝛿D_{\delta}italic_D start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT can successfully identify states with high values from those with low values. Therefore, its output can be a good surrogate for the density ratio ζ(s)dπ(s)𝜁𝑠subscript𝑑𝜋𝑠\frac{\zeta(s)}{d_{\pi}(s)}divide start_ARG italic_ζ ( italic_s ) end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ) end_ARG in Sec. 3.

Effectiveness of the Regularization

We also study the effect of policy regularization with different performance levels of stationary state distributions. Random, medium and expert policies in the original Hopper environment are used to estimate the stationary state distributions, which regularize the learning policies in a new environment with different dynamics. The results are shown in Tab. 3, where the expert policy is the most effective in regularizing. This verifies the practice in Sec. 3.2 and the theoretical analysis.

6 Conclusion and Discussion

In this work, we focus on the problem of leveraging data with dynamics shift to efficiently train RL agents. Based on the intuition that optimal policies can lead to similar stationary state distributions, we give a constrained optimization formulation that regards the state distribution as a regularizer. After discussions on a sample-based surrogate, we propose the SRPO algorithm which can be an add-on module to context-based algorithms and improve their sample efficiency. The resulting CaDM+SRPO and MAPLE+SRPO algorithms show superior performance when learning on data sampled from environments with different dynamics. Theoretical analyses are also given to analyze some properties of MDPs with different dynamics. They provide justifications for the intuition of the dynamics-invariant state distribution, as well as the constrained policy optimization formulation.

Limitations and Future work

The theoretical analyses of this work requires the assumption of homomorphous MDPs, i.e., the same state reachability in different MDPs. It would be interesting to discuss whether similar conclusions on the stationary state distribution can be derived without such assumption.

Acknowledgements

We thank Wanqi Xue and Yanchen Deng for helpful discussions. This research is supported by the National Research Foundation, Singapore under its Industry Alignment Fund – Pre-positioning (IAF-PP) Funding Initiative and Ministry of Education, Singapore, under its Academic Research Fund Tier 1 (RG13/22). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.

References

  • [1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. CoRR, abs/1312.5602, 2013.
  • [2] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nat., 529(7587):484–489, 2016.
  • [3] Wanqi Xue, Qingpeng Cai, Zhenghai Xue, Shuo Sun, Shuchang Liu, Dong Zheng, Peng Jiang, and Bo An. PrefRec: Preference-based recommender systems for reinforcing long-term user engagement. CoRR, abs/2212.02779, 2022.
  • [4] Zhenghai Xue, Qingpeng Cai, Tianyou Zuo, Bin Yang, Lantao Hu, Peng Jiang, and Bo An. AdaRec: Adaptive sequential recommendation for reinforcing long-term user engagement. CoRR, abs/2310.03984, 2023.
  • [5] **xin Liu, Hongyin Zhang, and Donglin Wang. DARA: Dynamics-aware reward augmentation in offline reinforcement learning. In ICLR, 2022.
  • [6] Fan-Ming Luo, Shengyi Jiang, Yang Yu, Zongzhang Zhang, and Yi-Feng Zhang. Adapt to environment sudden changes by learning a context sensitive policy. In AAAI, 2022.
  • [7] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
  • [8] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, 2018.
  • [9] Kimin Lee, Younggyo Seo, Seunghyun Lee, Honglak Lee, and **woo Shin. Context-aware dynamics model for generalization in model-based reinforcement learning. In ICML, 2020.
  • [10] Xi Chen, Ali Ghadirzadeh, Tianhe Yu, Yuan Gao, Jianhao Wang, Wenzhe Li, Bin Liang, Chelsea Finn, and Chongjie Zhang. Latent-variable advantage-weighted policy optimization for offline RL. CoRR, abs/2203.08949, 2022.
  • [11] Wenxuan Zhou, Lerrel Pinto, and Abhinav Gupta. Environment probing interaction policies. In ICLR, 2019.
  • [12] Xiong-Hui Chen, Yang Yu, Qingyang Li, Fan-Ming Luo, Zhiwei (Tony) Qin, Wenjie Shang, and Jie** Ye. Offline model-based adaptable policy learning. In NeurIPS, 2021.
  • [13] Benjamin Eysenbach, Shreyas Chaudhari, Swapnil Asawa, Sergey Levine, and Ruslan Salakhutdinov. Off-dynamics reinforcement learning: Training for transfer with domain classifiers. In ICLR, 2021.
  • [14] Haoyi Niu, Shubham Sharma, Yiwen Qiu, Ming Li, Guyue Zhou, Jianming Hu, and Xianyuan Zhan. When to trust your simulator: Dynamics-aware hybrid offline-and-online reinforcement learning. CoRR, abs/2206.13464, 2022.
  • [15] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In ICML, 2017.
  • [16] Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. CoRR, abs/1805.00909, 2018.
  • [17] Finale Doshi-Velez and George Dimitri Konidaris. Hidden parameter markov decision processes: A semiparametric regression approach for discovering latent task parametrizations. In IJCAI, 2016.
  • [18] Robert Kirk, Amy Zhang, Edward Grefenstette, and Tim Rocktäschel. A survey of zero-shot generalisation in deep reinforcement learning. J. Artif. Intell. Res., 76:201–264, 2023.
  • [19] Luisa M. Zintgraf, Kyriacos Shiarlis, Maximilian Igl, Sebastian Schulze, Yarin Gal, Katja Hofmann, and Shimon Whiteson. VariBAD: A very good method for bayes-adaptive deep RL via meta-learning. In ICLR, 2020.
  • [20] Jiachen Yang, Brenden K. Petersen, Hongyuan Zha, and Daniel M. Faissol. Single episode policy transfer in reinforcement learning. In ICLR, 2020.
  • [21] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.
  • [22] Eric Mitchell, Rafael Rafailov, Xue Bin Peng, Sergey Levine, and Chelsea Finn. Offline meta-reinforcement learning with advantage weighting. In ICML, 2021.
  • [23] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In ICML, 2017.
  • [24] **g-Cheng Pang, Tian Xu, Shengyi Jiang, Yu-Ren Liu, and Yang Yu. Sparsity prior regularized q-learning for sparse action tasks. CoRR, abs/2105.08666, 2021.
  • [25] Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In ICML, 2019.
  • [26] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. In NeurIPS, 2020.
  • [27] Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. CoRR, abs/1911.11361, 2019.
  • [28] Riashat Islam, Komal K. Teru, and Deepak Sharma. Off-policy policy gradient algorithms by constraining the state distribution shift. CoRR, abs/1911.06970, 2019.
  • [29] Qiang Liu, Lihong Li, and Ziyang Tang and´ Dengyong Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation. In NeurIPS, 2018.
  • [30] Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. In NeurIPS, 2019.
  • [31] Tanmay Gangwani and Jian Peng. State-only imitation with transition dynamics mismatch. In ICLR, 2020.
  • [32] Fangchen Liu, Zhan Ling, Tongzhou Mu, and Hao Su. State alignment-based imitation learning. In ICLR, 2020.
  • [33] Tianwei Ni, Harshit S. Sikchi, Yufei Wang, Tejus Gupta, Lisa Lee, and Ben Eysenbach. f-IRL: Inverse reinforcement learning via state marginal matching. In CoRL, 2020.
  • [34] Shentao Yang, Yihao Feng, Shujian Zhang, and Mingyuan Zhou. Regularizing a model-based policy stationary distribution to stabilize offline reinforcement learning. In ICML, 2022.
  • [35] Paul F. Christiano, Zain Shah, Igor Mordatch, Jonas Schneider, Trevor Blackwell, Joshua Tobin, Pieter Abbeel, and Wojciech Zaremba. Transfer from simulation to real world through learning deep inverse dynamics model. CoRR, abs/1610.03518, 2016.
  • [36] Shengyi Jiang, **g-Cheng Pang, and Yang Yu. Offline imitation learning with a misspecified simulator. In NeurIPS, 2020.
  • [37] Emanuel Parzen. On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3):1065–1076, 1962.
  • [38] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial networks. CoRR, abs/1406.2661, 2014.
  • [39] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In NIPS, 2016.
  • [40] Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y. Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. MOPO: model-based offline policy optimization. In NeurIPS, 2020.
  • [41] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In IROS, 2012.
  • [42] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: datasets for deep data-driven reinforcement learning. CoRR, abs/2004.07219, 2020.
  • [43] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In NIPS, 2016.
  • [44] Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, and Deirdre Quillen. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In ICML, 2019.
  • [45] Xu-Hui Liu, Zhenghai Xue, **g-Cheng Pang, Shengyi Jiang, Feng Xu, and Yang Yu. Regret minimization experience replay in off-policy reinforcement learning. In NeurIPS, 2021.
  • [46] Samarth Sinha, Jiaming Song, Animesh Garg, and Stefano Ermon. Experience replay with likelihood-free importance weights. In L4DC, 2022.
  • [47] Tian Xu, Ziniu Li, and Yang Yu. Error bounds of imitating policies and environments. In NeurIPS, 2020.
  • [48] Yuda Song, Aditi Mavalankar, Wen Sun, and Sicun Gao. Provably efficient model-based policy adaptation. In ICML, 2020.
  • [49] Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. In NeurIPS, 2019.
  • [50] Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. In ICLR, 2019.
  • [51] Kailin Zeng, Qiyuan Zhang, Bin Chen, Bin Liang, and Jun Yang. APD: learning diverse behaviors for reinforcement learning through unsupervised active pre-training. IEEE Robotics Autom. Lett., 7(4):12251–12258, 2022.

Appendix A Additional Derivations and Proofs

A.1 Derivations of the Lagrangian

We start from the optimization problem:

maxπ𝔼st,atτπt=0γtr(st,at)subscript𝜋subscript𝔼similar-tosubscript𝑠𝑡subscript𝑎𝑡subscript𝜏𝜋superscriptsubscript𝑡0superscript𝛾𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡\displaystyle\max_{\pi}~{}\mathbb{E}_{s_{t},a_{t}\sim\tau_{\pi}}\sum_{t=0}^{% \infty}\gamma^{t}r\left(s_{t},a_{t}\right)roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_τ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (8)
s.t. DKL(dπ()ζ())<εm.s.t. subscript𝐷KLconditionalsubscript𝑑𝜋𝜁subscript𝜀𝑚\displaystyle\text{ s.t. }~{}~{}D_{\mathrm{KL}}\left(d_{\pi}(\cdot)\|\zeta(% \cdot)\right)<\varepsilon_{m}.s.t. italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( ⋅ ) ∥ italic_ζ ( ⋅ ) ) < italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT .

The KL-Divergence term can be transformed as:

DKL(dπ()ζ())subscript𝐷KLconditionalsubscript𝑑𝜋𝜁\displaystyle D_{\mathrm{KL}}\left(d_{\pi}(\cdot)\|\zeta(\cdot)\right)italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( ⋅ ) ∥ italic_ζ ( ⋅ ) ) =𝔼sdπ(s)[logζ(s)logdπ(s)]absentsubscript𝔼similar-to𝑠subscript𝑑𝜋𝑠delimited-[]𝜁𝑠subscript𝑑𝜋𝑠\displaystyle=-\mathbb{E}_{s\sim d_{\pi}(s)}\left[\log\zeta(s)-\log d_{\pi}(s)\right]= - blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT [ roman_log italic_ζ ( italic_s ) - roman_log italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ) ] (9)
=dπ(s)[logζ(s)logdπ(s)]𝑑sabsentsubscript𝑑𝜋𝑠delimited-[]𝜁𝑠subscript𝑑𝜋𝑠differential-d𝑠\displaystyle=-\int d_{\pi}(s)\left[\log\zeta(s)-\log d_{\pi}(s)\right]ds= - ∫ italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ) [ roman_log italic_ζ ( italic_s ) - roman_log italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ) ] italic_d italic_s
=(1γ)t=0γtp(st=s)[logζ(s)logdπ(s)]dsabsent1𝛾superscriptsubscript𝑡0superscript𝛾𝑡𝑝subscript𝑠𝑡𝑠delimited-[]𝜁𝑠subscript𝑑𝜋𝑠𝑑𝑠\displaystyle=-\int(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}p\left(s_{t}=s\right% )\left[\log\zeta(s)-\log d_{\pi}(s)\right]ds= - ∫ ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_p ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s ) [ roman_log italic_ζ ( italic_s ) - roman_log italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ) ] italic_d italic_s
=(1γ)t=0γtp(st=s)[logζ(s)logdπ(s)]𝑑sabsent1𝛾superscriptsubscript𝑡0superscript𝛾𝑡𝑝subscript𝑠𝑡𝑠delimited-[]𝜁𝑠subscript𝑑𝜋𝑠differential-d𝑠\displaystyle=-(1-\gamma)\sum_{t=0}^{\infty}\int\gamma^{t}p\left(s_{t}=s\right% )\left[\log\zeta(s)-\log d_{\pi}(s)\right]ds= - ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ∫ italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_p ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s ) [ roman_log italic_ζ ( italic_s ) - roman_log italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ) ] italic_d italic_s
=(1γ)t=0𝔼stτ[γt(logζ(st)logdπ(st))]absent1𝛾superscriptsubscript𝑡0subscript𝔼similar-tosubscript𝑠𝑡𝜏delimited-[]superscript𝛾𝑡𝜁subscript𝑠𝑡subscript𝑑𝜋subscript𝑠𝑡\displaystyle=-(1-\gamma)\sum_{t=0}^{\infty}\mathbb{E}_{s_{t}\sim\tau}\left[% \gamma^{t}\left(\log\zeta(s_{t})-\log d_{\pi}(s_{t})\right)\right]= - ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_τ end_POSTSUBSCRIPT [ italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( roman_log italic_ζ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - roman_log italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ]
=(1γ)𝔼stτt=0γt(logζ(st)logdπ(st)).absent1𝛾subscript𝔼similar-tosubscript𝑠𝑡𝜏superscriptsubscript𝑡0superscript𝛾𝑡𝜁subscript𝑠𝑡subscript𝑑𝜋subscript𝑠𝑡\displaystyle=-(1-\gamma)\mathbb{E}_{s_{t}\sim\tau}\sum_{t=0}^{\infty}\gamma^{% t}\left(\log\zeta(s_{t})-\log d_{\pi}(s_{t})\right).= - ( 1 - italic_γ ) blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_τ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( roman_log italic_ζ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - roman_log italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) .

So the constraint can be written as

𝔼stτt=0[γt(logdπ(st)logζ(st))]εm1γ<0.subscript𝔼similar-tosubscript𝑠𝑡𝜏superscriptsubscript𝑡0delimited-[]superscript𝛾𝑡subscript𝑑𝜋subscript𝑠𝑡𝜁subscript𝑠𝑡subscript𝜀𝑚1𝛾0\mathbb{E}_{s_{t}\sim\tau}\sum_{t=0}^{\infty}\left[\gamma^{t}\left(\log d_{\pi% }(s_{t})-\log\zeta(s_{t})\right)\right]-\frac{\varepsilon_{m}}{1-\gamma}<0.blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_τ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT [ italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( roman_log italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - roman_log italic_ζ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] - divide start_ARG italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG < 0 . (10)

The optimization problem can be written as the following standard form

minπ𝔼st,atτt=0γtr(st,at)subscript𝜋subscript𝔼similar-tosubscript𝑠𝑡subscript𝑎𝑡𝜏superscriptsubscript𝑡0superscript𝛾𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡\displaystyle\min_{\pi}~{}\mathbb{E}_{s_{t},a_{t}\sim\tau}\sum_{t=0}^{\infty}-% \gamma^{t}r\left(s_{t},a_{t}\right)roman_min start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_τ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT - italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (11)
s.t. 𝔼stτt=0[γt(logdπ(st)logζ(st))]εm1γ<0.s.t. subscript𝔼similar-tosubscript𝑠𝑡𝜏superscriptsubscript𝑡0delimited-[]superscript𝛾𝑡subscript𝑑𝜋subscript𝑠𝑡𝜁subscript𝑠𝑡subscript𝜀𝑚1𝛾0\displaystyle\text{ s.t. }~{}~{}\mathbb{E}_{s_{t}\sim\tau}\sum_{t=0}^{\infty}% \left[\gamma^{t}\left(\log d_{\pi}(s_{t})-\log\zeta(s_{t})\right)\right]-\frac% {\varepsilon_{m}}{1-\gamma}<0.s.t. blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_τ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT [ italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( roman_log italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - roman_log italic_ζ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] - divide start_ARG italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG < 0 .

So the Lagrangian L𝐿Litalic_L is

L=𝔼st,atτ[t=0γt(r(st,at)+λlogζ(st)λlogdπ(st))]λεm1γ.𝐿subscript𝔼similar-tosubscript𝑠𝑡subscript𝑎𝑡𝜏delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡𝜆𝜁subscript𝑠𝑡𝜆subscript𝑑𝜋subscript𝑠𝑡𝜆subscript𝜀𝑚1𝛾L=-\mathbb{E}_{s_{t},a_{t}\sim\tau}\left[\sum_{t=0}^{\infty}\gamma^{t}\left(r(% s_{t},a_{t})+\lambda\log\zeta(s_{t})-\lambda\log d_{\pi}(s_{t})\right)\right]-% \frac{\lambda\varepsilon_{m}}{1-\gamma}.italic_L = - blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_τ end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_λ roman_log italic_ζ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_λ roman_log italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] - divide start_ARG italic_λ italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG . (12)

A.2 Derivations of the Forward and Backward Probabilities

The backward probability can be written as:

βt(st)subscript𝛽𝑡subscript𝑠𝑡\displaystyle\beta_{t}(s_{t})italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =𝒮p(𝒪t:|st,st+1,π)p(st+1|st)𝑑st+1absentsubscript𝒮𝑝conditionalsubscript𝒪:𝑡subscript𝑠𝑡subscript𝑠𝑡1𝜋𝑝conditionalsubscript𝑠𝑡1subscript𝑠𝑡differential-dsubscript𝑠𝑡1\displaystyle=\int_{\mathcal{S}}p(\mathcal{O}_{t:\infty}|s_{t},s_{t+1},\pi)p(s% _{t+1}|s_{t})ds_{t+1}= ∫ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT italic_p ( caligraphic_O start_POSTSUBSCRIPT italic_t : ∞ end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_π ) italic_p ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_d italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT (13)
=𝒮p(𝒪t|st,π)p(𝒪t+1:|st+1,π)p(st+1|st)𝑑st+1absentsubscript𝒮𝑝conditionalsubscript𝒪𝑡subscript𝑠𝑡𝜋𝑝conditionalsubscript𝒪:𝑡1subscript𝑠𝑡1𝜋𝑝conditionalsubscript𝑠𝑡1subscript𝑠𝑡differential-dsubscript𝑠𝑡1\displaystyle=\int_{\mathcal{S}}p(\mathcal{O}_{t}|s_{t},\pi)p(\mathcal{O}_{t+1% :\infty}|s_{t+1},\pi)p(s_{t+1}|s_{t})ds_{t+1}= ∫ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT italic_p ( caligraphic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_π ) italic_p ( caligraphic_O start_POSTSUBSCRIPT italic_t + 1 : ∞ end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_π ) italic_p ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_d italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT
=𝒮maxatexp(γtr(st,at))βt+1(st+1)p(st+1|st)𝑑st+1.absentsubscript𝒮subscriptsubscript𝑎𝑡superscript𝛾𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡subscript𝛽𝑡1subscript𝑠𝑡1𝑝conditionalsubscript𝑠𝑡1subscript𝑠𝑡differential-dsubscript𝑠𝑡1\displaystyle=\int_{\mathcal{S}}\max_{a_{t}}\exp(\gamma^{t}r(s_{t},a_{t}))% \beta_{t+1}(s_{t+1})p(s_{t+1}|s_{t})ds_{t+1}.= ∫ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_exp ( italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) italic_β start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) italic_p ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_d italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT .

Taking logarithm on both sizes, we have

logβt(st)subscript𝛽𝑡subscript𝑠𝑡\displaystyle\log\beta_{t}(s_{t})roman_log italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =log𝔼st+1maxatexp(γtr(st,at)+logβt+1(st+1)).absentsubscript𝔼subscript𝑠𝑡1subscriptsubscript𝑎𝑡superscript𝛾𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡subscript𝛽𝑡1subscript𝑠𝑡1\displaystyle=\log\mathbb{E}_{s_{t+1}}\max_{a_{t}}\exp(\gamma^{t}r(s_{t},a_{t}% )+\log\beta_{t+1}(s_{t+1})).= roman_log blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_exp ( italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + roman_log italic_β start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ) . (14)

Let W(st)=logβt(st)𝑊subscript𝑠𝑡subscript𝛽𝑡subscript𝑠𝑡W(s_{t})=\log\beta_{t}(s_{t})italic_W ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_log italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), we get

W(st)=log𝔼st+1exp[maxatγtr(st,at)+W(st+1)].𝑊subscript𝑠𝑡subscript𝔼subscript𝑠𝑡1subscriptsubscript𝑎𝑡superscript𝛾𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡𝑊subscript𝑠𝑡1W(s_{t})=\log\mathbb{E}_{s_{t+1}}\exp\left[\max\limits_{a_{t}}\gamma^{t}r(s_{t% },a_{t})+W(s_{t+1})\right].italic_W ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_log blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_exp [ roman_max start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_W ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] . (15)

According to [16], Wtsubscript𝑊𝑡W_{t}italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a soft version of the traditional value function Vtsubscript𝑉𝑡V_{t}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. As the Soft Actor-Critic [8] has become the base algorithm in many scenarios, βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is closely related to the value function learned during training, which is often in its soft version. The forward probability αt(st)=p(𝒪0:t1st,π)subscript𝛼𝑡subscript𝑠𝑡𝑝conditionalsubscript𝒪:0𝑡1subscript𝑠𝑡𝜋\alpha_{t}(s_{t})=p\left(\mathcal{O}_{0:t-1}\mid s_{t},\pi\right)italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_p ( caligraphic_O start_POSTSUBSCRIPT 0 : italic_t - 1 end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_π ) is the probability of trajectory from timestep 00 to t1𝑡1t-1italic_t - 1 being optimal given the state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Such probability is hard to model as the transition from st1subscript𝑠𝑡1s_{t-1}italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is related to the actual policy π𝜋\piitalic_π as well as the environment dynamics. Therefore, we do not take αt(st)subscript𝛼𝑡subscript𝑠𝑡\alpha_{t}(s_{t})italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) into account when dividing training data to train the classifier.

A.3 Discussions on the Surrogate for the Density Ratio

According to some Off-Policy RL algorithms [45, 46], the idea of training a classifier D(s)𝐷𝑠D(s)italic_D ( italic_s ) as a data-based surrogate of the density ratio ζ(s)dπ(s)𝜁𝑠subscript𝑑𝜋𝑠\frac{\zeta(s)}{d_{\pi}(s)}divide start_ARG italic_ζ ( italic_s ) end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ) end_ARG can also be derived from a theorem related to f-divergence (lemma 1 in [46]). Such derivation is essentially the same with our GAN-based proposition. Technically, these algorithms also divides the training data into two parts and train a classifier, which is later used to generate probabilities for prioritized sampling. Our SRPO algorithm proposes a different criterion to divide the training data, and train a classifier used in reward augmentation.

A.4 Proofs to Theorems in Sec. 4

We first introduce the following lemma which is essential in proving the two theorems in Sec. 4.

Lemma A.1.

Consider two homomorphous MDPs with dynamics T𝑇Titalic_T and Tsuperscript𝑇normal-′T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Assuming T(T,εm)superscript𝑇normal-′𝑇subscript𝜀𝑚T^{\prime}\in(T,\varepsilon_{m})italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ ( italic_T , italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ), the reward function w.r.t. the action is λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-Lipschitz and the dynamics function w.r.t. the action is λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-inverse Lipschitz, we have

|VT*(s)VT*(s)|λ1λ2εm1γsubscriptsuperscript𝑉𝑇𝑠subscriptsuperscript𝑉superscript𝑇𝑠subscript𝜆1subscript𝜆2subscript𝜀𝑚1𝛾\left|V^{*}_{T}(s)-V^{*}_{T^{\prime}}(s)\right|\leqslant\frac{\lambda_{1}% \lambda_{2}\varepsilon_{m}}{1-\gamma}| italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ) - italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) | ⩽ divide start_ARG italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG (16)

for all s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S.

Proof.

Recall the optimal value function under dynamics T𝑇Titalic_T follows

VT*(s)=maxar(s,a,T(s,a))+γVT*(T(s,a)).subscriptsuperscript𝑉𝑇𝑠subscript𝑎𝑟𝑠𝑎𝑇𝑠𝑎𝛾subscriptsuperscript𝑉𝑇𝑇𝑠𝑎V^{*}_{T}(s)=\max\limits_{a}~{}r(s,a,T(s,a))+\gamma V^{*}_{T}(T(s,a)).italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ) = roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_r ( italic_s , italic_a , italic_T ( italic_s , italic_a ) ) + italic_γ italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_T ( italic_s , italic_a ) ) . (17)

Without the loss of generality, we assume VT*(s)VT*(s)subscriptsuperscript𝑉𝑇𝑠subscriptsuperscript𝑉superscript𝑇𝑠V^{*}_{T}(s)\geqslant V^{*}_{T^{\prime}}(s)italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ) ⩾ italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) on a certain state s𝑠sitalic_s. Define aT*=πT*(s)subscriptsuperscript𝑎𝑇subscriptsuperscript𝜋𝑇𝑠a^{*}_{T}=\pi^{*}_{T}(s)italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ) and a^^𝑎\hat{a}over^ start_ARG italic_a end_ARG such that T(s,a^)=T(s,aT*)=ssuperscript𝑇𝑠^𝑎𝑇𝑠subscriptsuperscript𝑎𝑇superscript𝑠T^{\prime}(s,\hat{a})=T(s,a^{*}_{T})=s^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s , over^ start_ARG italic_a end_ARG ) = italic_T ( italic_s , italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Then we have

|T(s,aT*)T(s,a^)|𝑇𝑠subscriptsuperscript𝑎𝑇𝑇𝑠^𝑎\displaystyle\left|T(s,a^{*}_{T})-T(s,\hat{a})\right|| italic_T ( italic_s , italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) - italic_T ( italic_s , over^ start_ARG italic_a end_ARG ) | =|T(s,a^)T(s,a^)|εm,absentsuperscript𝑇𝑠^𝑎𝑇𝑠^𝑎subscript𝜀𝑚\displaystyle=\left|T^{\prime}(s,\hat{a})-T(s,\hat{a})\right|\leqslant% \varepsilon_{m},= | italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s , over^ start_ARG italic_a end_ARG ) - italic_T ( italic_s , over^ start_ARG italic_a end_ARG ) | ⩽ italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , (18)

and

|r(s,aT*,s)r(s,a^,s)|λ1|aT*a^|λ1λ2εm.𝑟𝑠subscriptsuperscript𝑎𝑇superscript𝑠𝑟𝑠^𝑎superscript𝑠subscript𝜆1subscriptsuperscript𝑎𝑇^𝑎subscript𝜆1subscript𝜆2subscript𝜀𝑚\left|r(s,a^{*}_{T},s^{\prime})-r(s,\hat{a},s^{\prime})\right|\leqslant\lambda% _{1}\left|a^{*}_{T}-\hat{a}\right|\leqslant\lambda_{1}\lambda_{2}\varepsilon_{% m}.| italic_r ( italic_s , italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_r ( italic_s , over^ start_ARG italic_a end_ARG , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | ⩽ italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - over^ start_ARG italic_a end_ARG | ⩽ italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT . (19)

Therefore for all s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S,

|VT*(s)VT*(s)|subscriptsuperscript𝑉𝑇𝑠subscriptsuperscript𝑉superscript𝑇𝑠\displaystyle\left|V^{*}_{T}(s)-V^{*}_{T^{\prime}}(s)\right|| italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ) - italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) | =VT*(s)VT*(s)absentsubscriptsuperscript𝑉𝑇𝑠subscriptsuperscript𝑉superscript𝑇𝑠\displaystyle=V^{*}_{T}(s)-V^{*}_{T^{\prime}}(s)= italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ) - italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) (20)
=r(s,aT*,s)+γVT*(s)maxa[r(s,a,T(s,a))+γVT*(T(s,a))]absent𝑟𝑠subscriptsuperscript𝑎𝑇superscript𝑠𝛾subscriptsuperscript𝑉𝑇superscript𝑠subscript𝑎𝑟𝑠𝑎superscript𝑇𝑠𝑎𝛾subscriptsuperscript𝑉superscript𝑇superscript𝑇𝑠𝑎\displaystyle=r(s,a^{*}_{T},s^{\prime})+\gamma V^{*}_{T}(s^{\prime})-\max% \limits_{a}~{}\left[r(s,a,T^{\prime}(s,a))+\gamma V^{*}_{T^{\prime}}(T^{\prime% }(s,a))\right]= italic_r ( italic_s , italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_γ italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT [ italic_r ( italic_s , italic_a , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s , italic_a ) ) + italic_γ italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s , italic_a ) ) ]
r(s,aT*,s)+γVT*(s)r(s,a^,s)γVT*(s)absent𝑟𝑠subscriptsuperscript𝑎𝑇superscript𝑠𝛾subscriptsuperscript𝑉𝑇superscript𝑠𝑟𝑠^𝑎superscript𝑠𝛾subscriptsuperscript𝑉superscript𝑇superscript𝑠\displaystyle\leqslant r(s,a^{*}_{T},s^{\prime})+\gamma V^{*}_{T}(s^{\prime})-% r(s,\hat{a},s^{\prime})-\gamma V^{*}_{T^{\prime}}(s^{\prime})⩽ italic_r ( italic_s , italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_γ italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_r ( italic_s , over^ start_ARG italic_a end_ARG , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_γ italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
λ1λ2εm+γ|VT*(s)VT*(s)|absentsubscript𝜆1subscript𝜆2subscript𝜀𝑚𝛾subscriptsuperscript𝑉𝑇superscript𝑠subscriptsuperscript𝑉superscript𝑇superscript𝑠\displaystyle\leqslant\lambda_{1}\lambda_{2}\varepsilon_{m}+\gamma\left|V^{*}_% {T}(s^{\prime})-V^{*}_{T^{\prime}}(s^{\prime})\right|⩽ italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_γ | italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) |
λ1λ2εm+γλ1λ2εm+γ2|VT*(s′′)VT*(s′′)|absentsubscript𝜆1subscript𝜆2subscript𝜀𝑚𝛾subscript𝜆1subscript𝜆2subscript𝜀𝑚superscript𝛾2subscriptsuperscript𝑉𝑇superscript𝑠′′subscriptsuperscript𝑉superscript𝑇superscript𝑠′′\displaystyle\leqslant\lambda_{1}\lambda_{2}\varepsilon_{m}+\gamma\lambda_{1}% \lambda_{2}\varepsilon_{m}+\gamma^{2}\left|V^{*}_{T}(s^{\prime\prime})-V^{*}_{% T^{\prime}}(s^{\prime\prime})\right|⩽ italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_γ italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) - italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) |
absent\displaystyle\leqslant\cdots⩽ ⋯
λ1λ2εm1γ,absentsubscript𝜆1subscript𝜆2subscript𝜀𝑚1𝛾\displaystyle\leqslant\frac{\lambda_{1}\lambda_{2}\varepsilon_{m}}{1-\gamma},⩽ divide start_ARG italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG ,

which concludes the proof. ∎

This lemma shows the discrepancy upper bound between the optimal state value functions in two homomorphous MDPs. We then apply it to prove the second theorem in Sec. 4.

Theorem A.2 (Restatement of Thm. 4.3).

Following the assumptions in Lem. A.1, if the action gap Δnormal-Δ\Deltaroman_Δ follows Δ>(2γ)λ1λ2εm1γnormal-Δ2𝛾subscript𝜆1subscript𝜆2subscript𝜀𝑚1𝛾\Delta>\frac{(2-\gamma)\lambda_{1}\lambda_{2}\varepsilon_{m}}{1-\gamma}roman_Δ > divide start_ARG ( 2 - italic_γ ) italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG, for all s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S we have dT*(s)=dT*(s)subscriptsuperscript𝑑𝑇𝑠subscriptsuperscript𝑑superscript𝑇normal-′𝑠d^{*}_{T}(s)=d^{*}_{T^{\prime}}(s)italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ) = italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ).

Proof.

Recall that the definition of action gap is Δ=minθΘmins𝒮minaπT*(s)VTθ*(s)QTθ*(s,a)Δsubscript𝜃Θsubscript𝑠𝒮subscript𝑎superscriptsubscript𝜋𝑇𝑠superscriptsubscript𝑉subscript𝑇𝜃𝑠superscriptsubscript𝑄subscript𝑇𝜃𝑠𝑎\Delta=\min\limits_{\theta\in\Theta}\min\limits_{s\in\mathcal{S}}\min\limits_{% a\neq\pi_{T}^{*}(s)}V_{T_{\theta}}^{*}(s)-Q_{T_{\theta}}^{*}(s,a)roman_Δ = roman_min start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_a ≠ italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s ) end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s ) - italic_Q start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_a ). Therefore, we have

VT*(s)subscriptsuperscript𝑉𝑇𝑠\displaystyle V^{*}_{T}(s)italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ) QT*(s,a)+Δabsentsubscriptsuperscript𝑄𝑇𝑠𝑎Δ\displaystyle\geqslant Q^{*}_{T}(s,a)+\Delta⩾ italic_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s , italic_a ) + roman_Δ (21)
>QT*(s,a)+(2γ)λ1λ2εm1γabsentsubscriptsuperscript𝑄𝑇𝑠𝑎2𝛾subscript𝜆1subscript𝜆2subscript𝜀𝑚1𝛾\displaystyle>Q^{*}_{T}(s,a)+\frac{(2-\gamma)\lambda_{1}\lambda_{2}\varepsilon% _{m}}{1-\gamma}> italic_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s , italic_a ) + divide start_ARG ( 2 - italic_γ ) italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG

for all (s,a)𝒮×𝒜𝑠𝑎𝒮𝒜(s,a)\in\mathcal{S}\times\mathcal{A}( italic_s , italic_a ) ∈ caligraphic_S × caligraphic_A if aπT*(s)𝑎subscriptsuperscript𝜋𝑇𝑠a\neq\pi^{*}_{T}(s)italic_a ≠ italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ). The same property holds for the transition function Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. We first show the state transition probability derived from πT*subscriptsuperscript𝜋𝑇\pi^{*}_{T}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and πT*subscriptsuperscript𝜋superscript𝑇\pi^{*}_{T^{\prime}}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is the same: pT(|s,πT*)=pT(|s,πT*),s𝒮p_{T}(\cdot|s,\pi^{*}_{T})=p_{T^{\prime}}(\cdot|s,\pi^{*}_{T^{\prime}}),~{}% \forall s\in\mathcal{S}italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( ⋅ | italic_s , italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = italic_p start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ | italic_s , italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , ∀ italic_s ∈ caligraphic_S. Without the loss of generality, let VT*(s)VT*(s)(*)subscriptsuperscript𝑉𝑇𝑠superscriptsubscript𝑉superscript𝑇𝑠V^{*}_{T}(s)\geqslant V_{T^{\prime}}^{*}(s)(*)italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ) ⩾ italic_V start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s ) ( * ). Let

a¯=argmaxar(s,a,T(s,a))+γVT*(T(s,a))¯𝑎subscriptargmax𝑎𝑟𝑠𝑎𝑇𝑠𝑎𝛾superscriptsubscript𝑉𝑇𝑇𝑠𝑎\displaystyle\bar{a}=\operatorname*{arg\,max}\limits_{a}r(s,a,T(s,a))+\gamma V% _{T}^{*}(T(s,a))over¯ start_ARG italic_a end_ARG = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_r ( italic_s , italic_a , italic_T ( italic_s , italic_a ) ) + italic_γ italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T ( italic_s , italic_a ) ) (22)
a=argmaxar(s,a,T(s,a))+γVT*(T(s,a))superscript𝑎subscriptargmax𝑎𝑟𝑠𝑎superscript𝑇𝑠𝑎𝛾superscriptsubscript𝑉superscript𝑇superscript𝑇𝑠𝑎\displaystyle a^{\prime}=\operatorname*{arg\,max}\limits_{a}r(s,a,T^{\prime}(s% ,a))+\gamma V_{T^{\prime}}^{*}(T^{\prime}(s,a))italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_r ( italic_s , italic_a , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s , italic_a ) ) + italic_γ italic_V start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s , italic_a ) )
T(s,a~)=T(s,a¯)=s¯,T(s,a)=s.formulae-sequencesuperscript𝑇𝑠~𝑎𝑇𝑠¯𝑎¯𝑠superscript𝑇𝑠superscript𝑎superscript𝑠\displaystyle T^{\prime}(s,\tilde{a})=T(s,\bar{a})=\bar{s},~{}T^{\prime}(s,a^{% \prime})=s^{\prime}.italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s , over~ start_ARG italic_a end_ARG ) = italic_T ( italic_s , over¯ start_ARG italic_a end_ARG ) = over¯ start_ARG italic_s end_ARG , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT .

According to Eq. (19), a~a¯λ1λ2εmnorm~𝑎¯𝑎subscript𝜆1subscript𝜆2subscript𝜀𝑚\|\tilde{a}-\bar{a}\|\leqslant\lambda_{1}\lambda_{2}\varepsilon_{m}∥ over~ start_ARG italic_a end_ARG - over¯ start_ARG italic_a end_ARG ∥ ⩽ italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. Supposing s¯s(**)\bar{s}\neq s^{\prime}(**)over¯ start_ARG italic_s end_ARG ≠ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( * * ), we have a~a=πT*(s)~𝑎superscript𝑎subscriptsuperscript𝜋superscript𝑇𝑠\tilde{a}\neq a^{\prime}=\pi^{*}_{T^{\prime}}(s)over~ start_ARG italic_a end_ARG ≠ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ). So

VT*(s)>QT*(s,a~)+(2γ)λ1λ2εm1γ.subscriptsuperscript𝑉superscript𝑇𝑠subscriptsuperscript𝑄superscript𝑇𝑠~𝑎2𝛾subscript𝜆1subscript𝜆2subscript𝜀𝑚1𝛾V^{*}_{T^{\prime}}(s)>Q^{*}_{T^{\prime}}(s,\tilde{a})+\frac{(2-\gamma)\lambda_% {1}\lambda_{2}\varepsilon_{m}}{1-\gamma}.italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) > italic_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s , over~ start_ARG italic_a end_ARG ) + divide start_ARG ( 2 - italic_γ ) italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG . (23)

Meanwhile,

QT*(s,a~)subscriptsuperscript𝑄superscript𝑇𝑠~𝑎\displaystyle Q^{*}_{T^{\prime}}(s,\tilde{a})italic_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s , over~ start_ARG italic_a end_ARG ) =r(s,a~,s¯)+γVT*(s¯)absent𝑟𝑠~𝑎¯𝑠𝛾subscriptsuperscript𝑉superscript𝑇¯𝑠\displaystyle=r(s,\tilde{a},\bar{s})+\gamma V^{*}_{T^{\prime}}(\bar{s})= italic_r ( italic_s , over~ start_ARG italic_a end_ARG , over¯ start_ARG italic_s end_ARG ) + italic_γ italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG ) (24)
r(s,a¯,s¯)+γVT*(s¯)λ1λ2εmabsent𝑟𝑠¯𝑎¯𝑠𝛾subscriptsuperscript𝑉superscript𝑇¯𝑠subscript𝜆1subscript𝜆2subscript𝜀𝑚\displaystyle\geqslant r(s,\bar{a},\bar{s})+\gamma V^{*}_{T^{\prime}}(\bar{s})% -\lambda_{1}\lambda_{2}\varepsilon_{m}⩾ italic_r ( italic_s , over¯ start_ARG italic_a end_ARG , over¯ start_ARG italic_s end_ARG ) + italic_γ italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG ) - italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT
r(s,a¯,s¯)+γVT*(s¯)λ1λ2εmγλ1λ2εm1γabsent𝑟𝑠¯𝑎¯𝑠𝛾subscriptsuperscript𝑉𝑇¯𝑠subscript𝜆1subscript𝜆2subscript𝜀𝑚𝛾subscript𝜆1subscript𝜆2subscript𝜀𝑚1𝛾\displaystyle\geqslant r(s,\bar{a},\bar{s})+\gamma V^{*}_{T}(\bar{s})-\lambda_% {1}\lambda_{2}\varepsilon_{m}-\frac{\gamma\lambda_{1}\lambda_{2}\varepsilon_{m% }}{1-\gamma}⩾ italic_r ( italic_s , over¯ start_ARG italic_a end_ARG , over¯ start_ARG italic_s end_ARG ) + italic_γ italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( over¯ start_ARG italic_s end_ARG ) - italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - divide start_ARG italic_γ italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG
=VT*(s)(2γ)λ1λ2εm1γabsentsubscriptsuperscript𝑉𝑇𝑠2𝛾subscript𝜆1subscript𝜆2subscript𝜀𝑚1𝛾\displaystyle=V^{*}_{T}(s)-\frac{(2-\gamma)\lambda_{1}\lambda_{2}\varepsilon_{% m}}{1-\gamma}= italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ) - divide start_ARG ( 2 - italic_γ ) italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG

Combining Eq. (23) and Eq. (24), we get VT*(s)>VT*(s)subscriptsuperscript𝑉superscript𝑇𝑠subscriptsuperscript𝑉𝑇𝑠V^{*}_{T^{\prime}}(s)>V^{*}_{T}(s)italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) > italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ), which contradicts with Eq. (*)(*)( * ). It means that the assumption (**)(**)( * * ) is not correct, so s¯=s¯𝑠superscript𝑠\bar{s}=s^{\prime}over¯ start_ARG italic_s end_ARG = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

We then show that dT*(s)=dT*(s)subscriptsuperscript𝑑𝑇𝑠subscriptsuperscript𝑑superscript𝑇𝑠d^{*}_{T}(s)=d^{*}_{T^{\prime}}(s)italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ) = italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) for all s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S:

pT(st=|πT*)pT(st=|πT*)\displaystyle\left\|p_{T}(s_{t}=\cdot|\pi_{T}^{*})-p_{T^{\prime}}(s_{t}=\cdot|% \pi_{T^{\prime}}^{*})\right\|_{\infty}∥ italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ⋅ | italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - italic_p start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ⋅ | italic_π start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT (25)
=spT(|s,πT*)pT(st1=s|πT*)pT(|s,πT*)pT(st1=s|πT*)\displaystyle=\left\|\sum_{s^{\prime}}p_{T}(\cdot|s^{\prime},\pi^{*}_{T})p_{T}% (s_{t-1}=s^{\prime}|\pi_{T}^{*})-p_{T^{\prime}}(\cdot|s^{\prime},\pi_{T^{% \prime}}^{*})p_{T^{\prime}}(s_{t-1}=s^{\prime}|\pi_{T^{\prime}}^{*})\right\|_{\infty}= ∥ ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - italic_p start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_π start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) italic_p start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_π start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
=spT(|s,πT*)[pT(st1=s|πT*)pT(st1=s|πT*)]\displaystyle=\left\|\sum_{s^{\prime}}p_{T}(\cdot|s^{\prime},\pi^{*}_{T})\left% [p_{T}(s_{t-1}=s^{\prime}|\pi_{T}^{*})-p_{T^{\prime}}(s_{t-1}=s^{\prime}|\pi_{% T^{\prime}}^{*})\right]\right\|_{\infty}= ∥ ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) [ italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - italic_p start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_π start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ] ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
spT(|s,πT*)pT(st1=|πT*)pT(st1=|πT*)\displaystyle\leqslant\left\|\sum_{s^{\prime}}p_{T}(\cdot|s^{\prime},\pi^{*}_{% T})\left\|p_{T}(s_{t-1}=\cdot|\pi_{T}^{*})-p_{T^{\prime}}(s_{t-1}=\cdot|\pi_{T% ^{\prime}}^{*})\right\|_{\infty}\right\|_{\infty}⩽ ∥ ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∥ italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = ⋅ | italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - italic_p start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = ⋅ | italic_π start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
=pT(st1=|πT*)pT(st1=|πT*)spT(|s,πT*)\displaystyle=\left\|\left\|p_{T}(s_{t-1}=\cdot|\pi_{T}^{*})-p_{T^{\prime}}(s_% {t-1}=\cdot|\pi_{T^{\prime}}^{*})\right\|_{\infty}\sum_{s^{\prime}}p_{T}(\cdot% |s^{\prime},\pi^{*}_{T})\right\|_{\infty}= ∥ ∥ italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = ⋅ | italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - italic_p start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = ⋅ | italic_π start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
=pT(st1=|πT*)pT(st1=|πT*)\displaystyle=\left\|p_{T}(s_{t-1}=\cdot|\pi_{T}^{*})-p_{T^{\prime}}(s_{t-1}=% \cdot|\pi_{T^{\prime}}^{*})\right\|_{\infty}= ∥ italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = ⋅ | italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - italic_p start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = ⋅ | italic_π start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
absent\displaystyle\leqslant\cdots⩽ ⋯
pT(s0=|πT*)pT(s0=|πT*)\displaystyle\leqslant\left\|p_{T}(s_{0}=\cdot|\pi^{*}_{T})-p_{T^{\prime}}(s_{% 0}=\cdot|\pi_{T^{\prime}}^{*})\right\|_{\infty}⩽ ∥ italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ⋅ | italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) - italic_p start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ⋅ | italic_π start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
=0.absent0\displaystyle=0.= 0 .

Therefore, for all s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S, we have pT(st=s|πT*)=pT(st=s|πT*)subscript𝑝𝑇subscript𝑠𝑡conditional𝑠superscriptsubscript𝜋𝑇subscript𝑝superscript𝑇subscript𝑠𝑡conditional𝑠superscriptsubscript𝜋superscript𝑇p_{T}(s_{t}=s|\pi_{T}^{*})=p_{T^{\prime}}(s_{t}=s|\pi_{T^{\prime}}^{*})italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s | italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) = italic_p start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s | italic_π start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ). So

|dT*(s)dT*(s)|=|t=0pT(st=s|πT*)pT(st=s|πT*)|=0\left|d^{*}_{T}(s)-d^{*}_{T^{\prime}}(s)\right|=\left|\sum_{t=0}^{\infty}p_{T}% (s_{t}=s|\pi^{*}_{T})-p_{T^{\prime}}(s_{t}=s|\pi_{T^{\prime}}^{*})\right|=0| italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ) - italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) | = | ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s | italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) - italic_p start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s | italic_π start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) | = 0 (26)

for all s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S, which concludes the proof. ∎

Before proving the first theorem in Sec. 4, we introduce a lemma that incorporates the 1-Wasserstein distance between the policies. It also considers a reference policy that has the same stationary state distribution with the optimal policy in the other dynamics. Such policy exists thanks to the homomorphous property of the MDPs.

Lemma A.3.

Following the assumptions in Lem. A.1, for all policy π^normal-^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG such that dTπ^(s)=dT*(s)subscriptsuperscript𝑑normal-^𝜋𝑇𝑠subscriptsuperscript𝑑superscript𝑇normal-′𝑠d^{\hat{\pi}}_{T}(s)=d^{*}_{T^{\prime}}(s)italic_d start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ) = italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) for all s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S and maxsW1(π^(|s),πT*(|s))ϵπ\max_{s}W_{1}\left(\hat{\pi}(\cdot|s),\pi^{*}_{T^{\prime}}(\cdot|s)\right)% \leqslant\epsilon_{\pi}roman_max start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ( ⋅ | italic_s ) , italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ | italic_s ) ) ⩽ italic_ϵ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT, we have

|ηT(πT*)ηT(π^)|λ1λ2εm+λ1επ1γ,subscript𝜂𝑇superscriptsubscript𝜋𝑇subscript𝜂𝑇^𝜋subscript𝜆1subscript𝜆2subscript𝜀𝑚subscript𝜆1subscript𝜀𝜋1𝛾\left|\eta_{T}(\pi_{T}^{*})-\eta_{T}(\hat{\pi})\right|\leqslant\dfrac{\lambda_% {1}\lambda_{2}\varepsilon_{m}+\lambda_{1}\varepsilon_{\pi}}{1-\gamma},| italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ) | ⩽ divide start_ARG italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG , (27)

where W1(π^(|s),πT*(|s))W_{1}(\hat{\pi}(\cdot|s),\pi^{*}_{T^{\prime}}(\cdot|s))italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ( ⋅ | italic_s ) , italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ | italic_s ) ) is the 1-Wasserstein distance between two policies.

Proof.

First, |ηT(πT*)ηT(πT*)|subscript𝜂𝑇subscriptsuperscript𝜋𝑇subscript𝜂superscript𝑇subscriptsuperscript𝜋superscript𝑇|\eta_{T}(\pi^{*}_{T})-\eta_{T^{\prime}}(\pi^{*}_{T^{\prime}})|| italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) - italic_η start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) | can be bounded with Thm. A.1:

|ηT(πT*)ηT(πT*)|subscript𝜂𝑇subscriptsuperscript𝜋𝑇subscript𝜂superscript𝑇subscriptsuperscript𝜋superscript𝑇\displaystyle\left|\eta_{T}(\pi^{*}_{T})-\eta_{T^{\prime}}(\pi^{*}_{T^{\prime}% })\right|| italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) - italic_η start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) | =|𝔼sρ0VT*(s)𝔼sρ0VT*(s)|absentsubscript𝔼𝑠subscript𝜌0subscriptsuperscript𝑉𝑇𝑠subscript𝔼𝑠subscript𝜌0superscriptsubscript𝑉superscript𝑇𝑠\displaystyle=\left|\mathbb{E}_{s\in\rho_{0}}V^{*}_{T}(s)-\mathbb{E}_{s\in\rho% _{0}}V_{T^{\prime}}^{*}(s)\right|= | blackboard_E start_POSTSUBSCRIPT italic_s ∈ italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ) - blackboard_E start_POSTSUBSCRIPT italic_s ∈ italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s ) | (28)
λ1λ2εm1γ.absentsubscript𝜆1subscript𝜆2subscript𝜀𝑚1𝛾\displaystyle\leqslant\frac{\lambda_{1}\lambda_{2}\varepsilon_{m}}{1-\gamma}.⩽ divide start_ARG italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG .

We then try to bound |ηT(π^)ηT(πT*)|subscript𝜂𝑇^𝜋subscript𝜂superscript𝑇subscriptsuperscript𝜋superscript𝑇|\eta_{T}(\hat{\pi})-\eta_{T^{\prime}}(\pi^{*}_{T^{\prime}})|| italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ) - italic_η start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) |. We first define the state-action stationary distributions D1(s,a)=dTπ^(s)π^(a|s)subscript𝐷1𝑠𝑎superscriptsubscript𝑑𝑇^𝜋𝑠^𝜋conditional𝑎𝑠D_{1}(s,a)=d_{T}^{\hat{\pi}}(s)\hat{\pi}(a|s)italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) = italic_d start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG end_POSTSUPERSCRIPT ( italic_s ) over^ start_ARG italic_π end_ARG ( italic_a | italic_s ) and D2(s,a)=dT*(s)πT*(a|s)subscript𝐷2𝑠𝑎subscriptsuperscript𝑑superscript𝑇𝑠subscriptsuperscript𝜋superscript𝑇conditional𝑎𝑠D_{2}(s,a)=d^{*}_{T^{\prime}}(s)\pi^{*}_{T^{\prime}}(a|s)italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_s , italic_a ) = italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_a | italic_s ). The accumulated return can be written as

ηT(π^)subscript𝜂𝑇^𝜋\displaystyle\eta_{T}(\hat{\pi})italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ) =11γ𝔼s,a,sD1[r(s,a,s)]absent11𝛾subscript𝔼similar-to𝑠𝑎superscript𝑠subscript𝐷1delimited-[]𝑟𝑠𝑎superscript𝑠\displaystyle=\frac{1}{1-\gamma}\mathbb{E}_{s,a,s^{\prime}\sim D_{1}}\left[r(s% ,a,s^{\prime})\right]= divide start_ARG 1 end_ARG start_ARG 1 - italic_γ end_ARG blackboard_E start_POSTSUBSCRIPT italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_r ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] (29)
ηT(πT*)subscript𝜂superscript𝑇subscriptsuperscript𝜋superscript𝑇\displaystyle\eta_{T^{\prime}}(\pi^{*}_{T^{\prime}})italic_η start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) =11γ𝔼s,a,sD2[r(s,a,s)]absent11𝛾subscript𝔼similar-to𝑠𝑎superscript𝑠subscript𝐷2delimited-[]𝑟𝑠𝑎superscript𝑠\displaystyle=\frac{1}{1-\gamma}\mathbb{E}_{s,a,s^{\prime}\sim D_{2}}\left[r(s% ,a,s^{\prime})\right]= divide start_ARG 1 end_ARG start_ARG 1 - italic_γ end_ARG blackboard_E start_POSTSUBSCRIPT italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_r ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ]

We start from the Lipschitz property of the reward function:

|r(s,a1,s)r(s,a2,s)|λ1a1a21.𝑟𝑠subscript𝑎1superscript𝑠𝑟𝑠subscript𝑎2superscript𝑠subscript𝜆1subscriptnormsubscript𝑎1subscript𝑎21|r(s,a_{1},s^{\prime})-r(s,a_{2},s^{\prime})|\leqslant\lambda_{1}\|a_{1}-a_{2}% \|_{1}.| italic_r ( italic_s , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_r ( italic_s , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | ⩽ italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . (30)

Taking expectation w.r.t. dT*()subscriptsuperscript𝑑superscript𝑇d^{*}_{T^{\prime}}(\cdot)italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ ) on both sides, we get

𝔼sdT*|r(s,a1,s)r(s,a2,s)|𝔼sdT*λ1a1a21.subscript𝔼similar-to𝑠subscriptsuperscript𝑑superscript𝑇𝑟𝑠subscript𝑎1superscript𝑠𝑟𝑠subscript𝑎2superscript𝑠subscript𝔼similar-to𝑠subscriptsuperscript𝑑superscript𝑇subscript𝜆1subscriptnormsubscript𝑎1subscript𝑎21\mathbb{E}_{s\sim d^{*}_{T^{\prime}}}|r(s,a_{1},s^{\prime})-r(s,a_{2},s^{% \prime})|\leqslant\mathbb{E}_{s\sim d^{*}_{T^{\prime}}}\lambda_{1}\|a_{1}-a_{2% }\|_{1}.blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_r ( italic_s , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_r ( italic_s , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | ⩽ blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . (31)

Letting μ(A1,A2|s)𝜇subscript𝐴1conditionalsubscript𝐴2𝑠\mu(A_{1},A_{2}|s)italic_μ ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_s ) be any joint distribution with marginals π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG and πT*subscriptsuperscript𝜋superscript𝑇\pi^{*}_{T^{\prime}}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT conditioned on s𝑠sitalic_s. Taking expectation w.r.t. μ𝜇\muitalic_μ on both sides, we get

|𝔼s,aD1r(s,a,s)𝔼s,aD2r(s,a,s)|subscript𝔼similar-to𝑠𝑎subscript𝐷1𝑟𝑠𝑎superscript𝑠subscript𝔼similar-to𝑠𝑎subscript𝐷2𝑟𝑠𝑎superscript𝑠\displaystyle\left|\mathbb{E}_{s,a\sim D_{1}}r(s,a,s^{\prime})-\mathbb{E}_{s,a% \sim D_{2}}r(s,a,s^{\prime})\right|| blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_r ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_r ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | 𝔼sdT*𝔼a1,a2μ|r(s,a1,s)r(s,a2,s)|absentsubscript𝔼similar-to𝑠subscriptsuperscript𝑑superscript𝑇subscript𝔼similar-tosubscript𝑎1subscript𝑎2𝜇𝑟𝑠subscript𝑎1superscript𝑠𝑟𝑠subscript𝑎2superscript𝑠\displaystyle\leqslant\mathbb{E}_{s\sim d^{*}_{T^{\prime}}}\mathbb{E}_{a_{1},a% _{2}\sim\mu}|r(s,a_{1},s^{\prime})-r(s,a_{2},s^{\prime})|⩽ blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ italic_μ end_POSTSUBSCRIPT | italic_r ( italic_s , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_r ( italic_s , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | (32)
λ1𝔼sdT*Eμa1a21absentsubscript𝜆1subscript𝔼similar-to𝑠subscriptsuperscript𝑑superscript𝑇subscript𝐸𝜇subscriptnormsubscript𝑎1subscript𝑎21\displaystyle\leqslant\lambda_{1}\mathbb{E}_{s\sim d^{*}_{T^{\prime}}}E_{\mu}% \|a_{1}-a_{2}\|_{1}⩽ italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ∥ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
maxsλ1Eμa1a21.absentsubscript𝑠subscript𝜆1subscript𝐸𝜇subscriptnormsubscript𝑎1subscript𝑎21\displaystyle\leqslant\max\limits_{s}\lambda_{1}E_{\mu}\|a_{1}-a_{2}\|_{1}.⩽ roman_max start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ∥ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .

Eq. (32) holds for all joint distribution μ𝜇\muitalic_μ, so it also holds for μ¯=argminμλ1Eμa1a21¯𝜇subscriptargmin𝜇subscript𝜆1subscript𝐸𝜇subscriptnormsubscript𝑎1subscript𝑎21\bar{\mu}=\operatorname*{arg\,min}\limits_{\mu}\lambda_{1}E_{\mu}\|a_{1}-a_{2}% \|_{1}over¯ start_ARG italic_μ end_ARG = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ∥ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, leading to the 1-Wasserstein distance:

|𝔼s,aD1r(s,a,s)𝔼s,aD2r(s,a,s)|maxsλ1W1(π^(|s),πT*(|s)).\left|\mathbb{E}_{s,a\sim D_{1}}r(s,a,s^{\prime})-\mathbb{E}_{s,a\sim D_{2}}r(% s,a,s^{\prime})\right|\leqslant\max_{s}\lambda_{1}W_{1}(\hat{\pi}(\cdot|s),\pi% ^{*}_{T^{\prime}}(\cdot|s)).| blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_r ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - blackboard_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_r ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | ⩽ roman_max start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ( ⋅ | italic_s ) , italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ | italic_s ) ) . (33)

According to Eq. (29), we have

|ηT(π^)ηT(πT*)|λ1εm1γ.subscript𝜂𝑇^𝜋subscript𝜂superscript𝑇subscriptsuperscript𝜋superscript𝑇subscript𝜆1subscript𝜀𝑚1𝛾|\eta_{T}(\hat{\pi})-\eta_{T^{\prime}}(\pi^{*}_{T^{\prime}})|\leqslant\frac{% \lambda_{1}\varepsilon_{m}}{1-\gamma}.| italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ) - italic_η start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) | ⩽ divide start_ARG italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG . (34)

Applying the triangle inequality, we get

|ηT(πT*)ηT(π^)|subscript𝜂𝑇superscriptsubscript𝜋𝑇subscript𝜂𝑇^𝜋\displaystyle\left|\eta_{T}(\pi_{T}^{*})-\eta_{T}(\hat{\pi})\right|| italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ) | |ηT(πT*)ηT(πT*)|+|ηT(π^)ηT(πT*)|absentsubscript𝜂𝑇subscriptsuperscript𝜋𝑇subscript𝜂superscript𝑇subscriptsuperscript𝜋superscript𝑇subscript𝜂𝑇^𝜋subscript𝜂superscript𝑇subscriptsuperscript𝜋superscript𝑇\displaystyle\leqslant|\eta_{T}(\pi^{*}_{T})-\eta_{T^{\prime}}(\pi^{*}_{T^{% \prime}})|+|\eta_{T}(\hat{\pi})-\eta_{T^{\prime}}(\pi^{*}_{T^{\prime}})|⩽ | italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) - italic_η start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) | + | italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ) - italic_η start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) | (35)
λ1λ2εm+λ1επ1γ,absentsubscript𝜆1subscript𝜆2subscript𝜀𝑚subscript𝜆1subscript𝜀𝜋1𝛾\displaystyle\leqslant\dfrac{\lambda_{1}\lambda_{2}\varepsilon_{m}+\lambda_{1}% \varepsilon_{\pi}}{1-\gamma},⩽ divide start_ARG italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG ,

which concludes the proof.

We then use this lemma to prove the first theory in Sec. 4. ∎

Theorem A.4 (Restatement of Thm. 4.2).

Consider two homomorphous MDPs with dynamics T𝑇Titalic_T and Tsuperscript𝑇normal-′T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. If T(T,εm)superscript𝑇normal-′𝑇subscript𝜀𝑚T^{\prime}\in(T,\varepsilon_{m})italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ ( italic_T , italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ), for all learning policy π^normal-^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG such that DKL(dTπ^()dT*())εssubscript𝐷normal-KLconditionalsubscriptsuperscript𝑑normal-^𝜋𝑇normal-⋅subscriptsuperscript𝑑superscript𝑇normal-′normal-⋅subscript𝜀𝑠D_{\mathrm{KL}}(d^{\hat{\pi}}_{T}(\cdot)\|d^{*}_{T^{\prime}}(\cdot))\leqslant% \varepsilon_{s}italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_d start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( ⋅ ) ∥ italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ ) ) ⩽ italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, we have

ηT(π^)ηT(πT*)λ1λ2εm+2λ1+2Rmaxεs1γ.subscript𝜂𝑇^𝜋subscript𝜂𝑇superscriptsubscript𝜋𝑇subscript𝜆1subscript𝜆2subscript𝜀𝑚2subscript𝜆12subscript𝑅subscript𝜀𝑠1𝛾\eta_{T}(\hat{\pi})\geqslant\eta_{T}(\pi_{T}^{*})-\dfrac{\lambda_{1}\lambda_{2% }\varepsilon_{m}+2\lambda_{1}+\sqrt{2}R_{\max}\sqrt{\varepsilon_{s}}}{1-\gamma}.italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ) ⩾ italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - divide start_ARG italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + 2 italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + square-root start_ARG 2 end_ARG italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT square-root start_ARG italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 1 - italic_γ end_ARG . (36)
Proof.

In two homomorphous MDPs with dynamics T𝑇Titalic_T and Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, there exists a policy π~~𝜋\tilde{\pi}over~ start_ARG italic_π end_ARG such that dTπ~()=dT*()subscriptsuperscript𝑑~𝜋𝑇subscriptsuperscript𝑑superscript𝑇d^{\tilde{\pi}}_{T}(\cdot)=d^{*}_{T^{\prime}}(\cdot)italic_d start_POSTSUPERSCRIPT over~ start_ARG italic_π end_ARG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( ⋅ ) = italic_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ ). According to Lem. A.3, we have

|ηT(πT*)ηT(π~)|subscript𝜂𝑇superscriptsubscript𝜋𝑇subscript𝜂𝑇~𝜋\displaystyle\left|\eta_{T}(\pi_{T}^{*})-\eta_{T}(\tilde{\pi})\right|| italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( over~ start_ARG italic_π end_ARG ) | λ1λ2εm+λ1επ1γλ1λ2εm+2λ11γ,absentsubscript𝜆1subscript𝜆2subscript𝜀𝑚subscript𝜆1subscript𝜀𝜋1𝛾subscript𝜆1subscript𝜆2subscript𝜀𝑚2subscript𝜆11𝛾\displaystyle\leqslant\dfrac{\lambda_{1}\lambda_{2}\varepsilon_{m}+\lambda_{1}% \varepsilon_{\pi}}{1-\gamma}\leqslant\dfrac{\lambda_{1}\lambda_{2}\varepsilon_% {m}+2\lambda_{1}}{1-\gamma},⩽ divide start_ARG italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG ⩽ divide start_ARG italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + 2 italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG , (37)

where the second inequality is obtained as the actions are bounded to [1,1]11[-1,1][ - 1 , 1 ]. The scaling is multiplied by the Lipschitz coefficient which tends to small, so it will make little influence to the bound. On the other hand, policies π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG and π~~𝜋\tilde{\pi}over~ start_ARG italic_π end_ARG have a similar state discrepancy: DKL(dTπ^()dTπ~())εssubscript𝐷KLconditionalsubscriptsuperscript𝑑^𝜋𝑇subscriptsuperscript𝑑~𝜋𝑇subscript𝜀𝑠D_{\mathrm{KL}}(d^{\hat{\pi}}_{T}(\cdot)\|d^{\tilde{\pi}}_{T}(\cdot))\leqslant% \varepsilon_{s}italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_d start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( ⋅ ) ∥ italic_d start_POSTSUPERSCRIPT over~ start_ARG italic_π end_ARG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( ⋅ ) ) ⩽ italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Therefore, their performance gap can be bounded according to results in imitation learning (Lem. 6 in [47]):

|ηT(π^)ηT(π~)|2Rmaxεs1γ.subscript𝜂𝑇^𝜋subscript𝜂𝑇~𝜋2subscript𝑅subscript𝜀𝑠1𝛾|\eta_{T}(\hat{\pi})-\eta_{T}(\tilde{\pi})|\leqslant\frac{\sqrt{2}R_{\max}% \sqrt{\varepsilon_{s}}}{1-\gamma}.| italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ) - italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( over~ start_ARG italic_π end_ARG ) | ⩽ divide start_ARG square-root start_ARG 2 end_ARG italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT square-root start_ARG italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 1 - italic_γ end_ARG . (38)

Merging Eq. (37) and Eq. (38), we obtain

ηT(πT*)ηT(π^)subscript𝜂𝑇superscriptsubscript𝜋𝑇subscript𝜂𝑇^𝜋\displaystyle\eta_{T}(\pi_{T}^{*})-\eta_{T}(\hat{\pi})italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ) |ηT(π^)ηT(πT*)|absentsubscript𝜂𝑇^𝜋subscript𝜂𝑇superscriptsubscript𝜋𝑇\displaystyle\leqslant|\eta_{T}(\hat{\pi})-\eta_{T}(\pi_{T}^{*})|⩽ | italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ) - italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) | (39)
|ηT(πT*)ηT(π~)|+|ηT(π^)ηT(π~)|absentsubscript𝜂𝑇superscriptsubscript𝜋𝑇subscript𝜂𝑇~𝜋subscript𝜂𝑇^𝜋subscript𝜂𝑇~𝜋\displaystyle\leqslant\left|\eta_{T}(\pi_{T}^{*})-\eta_{T}(\tilde{\pi})\right|% +|\eta_{T}(\hat{\pi})-\eta_{T}(\tilde{\pi})|⩽ | italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( over~ start_ARG italic_π end_ARG ) | + | italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ) - italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( over~ start_ARG italic_π end_ARG ) |
λ1λ2εm+2λ1+2Rmaxεs1γ.absentsubscript𝜆1subscript𝜆2subscript𝜀𝑚2subscript𝜆12subscript𝑅subscript𝜀𝑠1𝛾\displaystyle\leqslant\dfrac{\lambda_{1}\lambda_{2}\varepsilon_{m}+2\lambda_{1% }+\sqrt{2}R_{\max}\sqrt{\varepsilon_{s}}}{1-\gamma}.⩽ divide start_ARG italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + 2 italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + square-root start_ARG 2 end_ARG italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT square-root start_ARG italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 1 - italic_γ end_ARG .

A.5 Discussions on the Theoretical Analysis

The Lipschitz Assumptions in Sec. 4

Regarding the reward functions, the Lipschitz property implies that if s𝑠sitalic_s and ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT keeps unchanged, the deviation of the reward r𝑟ritalic_r will be no larger than λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT times the deviation of the action a𝑎aitalic_a. Therefore, the Lipstchiz coefficient λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is solely related to action-related terms in the reward function. It is important to note that different actions exist given the same s𝑠sitalic_s and ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT since we may compute the reward function in different dynamics. Considering the dynamics functions, the Lipschitz property indicates that if the current state s𝑠sitalic_s remains unchanged and the actions differ by : |a1a2|εsubscript𝑎1subscript𝑎2𝜀\left|a_{1}-a_{2}\right|\geqslant\varepsilon| italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | ⩾ italic_ε, the next states will exhibit a significant difference: |s1s2|ελ2superscriptsubscript𝑠1superscriptsubscript𝑠2𝜀𝜆2\left|s_{1}^{\prime}-s_{2}^{\prime}\right|\geqslant\frac{\varepsilon}{\lambda 2}| italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | ⩾ divide start_ARG italic_ε end_ARG start_ARG italic_λ 2 end_ARG.

Table 4: λ1,λ2,Rmaxsubscript𝜆1subscript𝜆2subscript𝑅\lambda_{1},\lambda_{2},R_{\max}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT in practical environments.
Environment Action-related Reward λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Rmaxsubscript𝑅R_{\max}italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT
CartPole-v0 0 0 1.42 1.00
InvertedPendulum-v2 0 0 8.58 1.00
Swimmer-v2 0.0001|a|220.0001superscriptsubscript𝑎22-0.0001|a|_{2}^{2}- 0.0001 | italic_a | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 0.0001 2.59 0.36
HalfCheetah-v2 0.1|a|220.1superscriptsubscript𝑎22-0.1|a|_{2}^{2}- 0.1 | italic_a | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 0.1 1.01 4.80
Hopper-v2 0.001|a|220.001superscriptsubscript𝑎22-0.001|a|_{2}^{2}- 0.001 | italic_a | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 0.001 3.45 3.80
Walker2d-v2 0.001|a|220.001superscriptsubscript𝑎22-0.001|a|_{2}^{2}- 0.001 | italic_a | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 0.001 4.70 4absent4\geqslant 4⩾ 4
Ant-v2 0.5|a|220.5superscriptsubscript𝑎22-0.5|a|_{2}^{2}- 0.5 | italic_a | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 0.5 0.69 6.0
Humanoid-v2 0.1|a|220.1superscriptsubscript𝑎22-0.1|a|_{2}^{2}- 0.1 | italic_a | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 0.1 0.03 8absent8\geqslant 8⩾ 8

In Tab. 4, we list the action-related terms of the reward functions for various RL evaluation environments, along with the corresponding values of λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT derived from these terms. Additionally, we sample 50,000 (s,a,s)𝑠𝑎superscript𝑠(s,a,s^{\prime})( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) tuples from the replay buffer, slightly modify the action, and observe how the resulting next state ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT changes. The replay buffer contains trajectories collected during different training phases and should be diverse enough to cover most possible trajectories. This empirical analysis allows us to calculate λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in practice. As indicated in the table, the action-related terms in reward functions exhibit reasonably small coefficients in all environments, leading to small λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT values. Combined with medium values of λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, it can be inferred that Lipschitz terms, including λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ1λ2subscript𝜆1subscript𝜆2\lambda_{1}\lambda_{2}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, will remain small in practical scenarios and will not dominate the error term in Eq. 7. Also, the action gap assumption in Thm. 4.3 (line 258-259) is not strong and holds in many situations.

Failure Cases

Although the assumptions are weak and hold in many situations, there are certain scenarios that these assumptions do not hold and the performance of SRPO can not be guaranteed. For example, in maze environments with different obstacle layout, the requirement of homomorphous MDPs is violated. There are also cases where the Lipstchitz coefficients λ1,λ2subscript𝜆1subscript𝜆2\lambda_{1},\lambda_{2}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can be large, such as stock markets with very high transaction feeds.

The Assumption on Dynamics Discrepancy

We mentioned in Sec. 4 that one of the assumptions to prove the theorems is that T(T,εm)𝑇superscript𝑇subscript𝜀𝑚T\in(T^{\prime},\varepsilon_{m})italic_T ∈ ( italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_ε start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ). In fact, this is a simplification of the actual requirement, which is weaker than the uniform bound of dynamics shift. According to Eq. (18), for any state s𝑠sitalic_s we only require T(s,a)𝑇𝑠𝑎T(s,a)italic_T ( italic_s , italic_a ) and T(s,a)superscript𝑇𝑠𝑎T^{\prime}(s,a)italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s , italic_a ) to be close on one specific action a^^𝑎\hat{a}over^ start_ARG italic_a end_ARG such that T(s,a^)=T(s,aT*)=ssuperscript𝑇𝑠^𝑎𝑇𝑠superscriptsubscript𝑎𝑇superscript𝑠T^{\prime}(s,\hat{a})=T\left(s,a_{T}^{*}\right)=s^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s , over^ start_ARG italic_a end_ARG ) = italic_T ( italic_s , italic_a start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. This is a point-wise bound on dynamics shift and is comparable to assumptions in previous analysis [48].

The Tightness of Eq. (7)

Eq. (7) has a similar form with the Eq. (1) in Thm. 4.1 of [49], where the return discrepancy |ηT(πT*)ηT(π^)|subscript𝜂𝑇superscriptsubscript𝜋𝑇subscript𝜂𝑇^𝜋\left|\eta_{T}(\pi_{T}^{*})-\eta_{T}(\hat{\pi})\right|| italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ) | is also bounded by differences in the policy distribution and transition functions, with an order of two in the effective horizon (i.e. with a coefficient 1(1γ)21superscript1𝛾2\frac{1}{(1-\gamma)^{2}}divide start_ARG 1 end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG). By introducing some assumptions and constraining the policy to have the same stationary state distribution, we obtain a tighter discrepancy bound with an order of one in the effective horizon (i.e. with a coefficient 11γ11𝛾\frac{1}{1-\gamma}divide start_ARG 1 end_ARG start_ARG 1 - italic_γ end_ARG).

Appendix B Experiment Details

B.1 Setup

To generate environments with different transition functions, We alter the xml file of the MuJoCo simulator and change its environment parameters. In online experiments, we build our code based on the Github repository111https://github.com/younggyoseo/CaDM/tree/master of CaDM [9]. Some customized MuJoCo environments are defined in this repository. They have different reward functions with the original environments. We keep these modifications to make our online results comparable with the original CaDM algorithm. In offline experiments, we build our code based on the Github repository222https://github.com/polixir/OfflineRL. The offline datasets are generated by concatenating the data sampled in the original MuJoCo simulator, as well as simulators whose gravity and medium density are altered. In both online and offline experiments, the evaluation is done in online static environments with all possible values of environment parameters. The average of these evaluation results is reported.

Refer to caption
Figure 6: Detailed results of online experiments on MuJoCo tasks. In environments with single dynamics, three algorithms have a similar performance.
Table 5: Detailed results of ablation studies in offline experiments.
MAPLE +SRPO Behavior Regularization Random Partition Fixed λ=0.1𝜆0.1\lambda=0.1italic_λ = 0.1 Fixed λ=0.3𝜆0.3\lambda=0.3italic_λ = 0.3
Walker2d-medium-expert 0.66±plus-or-minus\pm±0.08 0.70±plus-or-minus\pm±0.18 0.42±plus-or-minus\pm±0.16 0.66±plus-or-minus\pm±0.08 0.38±plus-or-minus\pm±0.16
Walker2d-medium 0.84±plus-or-minus\pm±0.03 0.71±plus-or-minus\pm±0.02 0.79±plus-or-minus\pm±0.00 0.72±plus-or-minus\pm±0.13 0.84±plus-or-minus\pm±0.03
Walker2d-medium-replay 0.17±plus-or-minus\pm±0.02 0.16±plus-or-minus\pm±0.01 0.14±plus-or-minus\pm±0.01 0.17±plus-or-minus\pm±0.02 0.16±plus-or-minus\pm±0.01
Walker2d-random 0.22±plus-or-minus\pm±0.00 0.22±plus-or-minus\pm±0.00 0.22±plus-or-minus\pm±0.00 0.22±plus-or-minus\pm±0.00 0.22±plus-or-minus\pm±000
Hopper-medium-expert 0.98±plus-or-minus\pm±0.02 0.85±plus-or-minus\pm±0.25 0.46±plus-or-minus\pm±0.14 0.98±plus-or-minus\pm±0.02 0.86±plus-or-minus\pm±0.18
Hopper-medium 1.03±plus-or-minus\pm±0.01 0.78±plus-or-minus\pm±0.26 0.76±plus-or-minus\pm±0.21 0.53±plus-or-minus\pm±0.13 1.03±plus-or-minus\pm±0.01
Hopper-medium-replay 1.02±plus-or-minus\pm±0.01 0.94±plus-or-minus\pm±0.04 0.91±plus-or-minus\pm±0.08 1.02±plus-or-minus\pm±0.01 0.93±plus-or-minus\pm±0.03
Hopper-random 0.32±plus-or-minus\pm±0.02 0.13±plus-or-minus\pm±0.01 0.12±plus-or-minus\pm±0.01 0.13±plus-or-minus\pm±0.01 0.32±plus-or-minus\pm±0.02
Halfcheetah-medium-expert 0.63±plus-or-minus\pm±0.01 0.65±plus-or-minus\pm±0.01 0.44±plus-or-minus\pm±0.18 0.63±plus-or-minus\pm±0.01 0.52±plus-or-minus\pm±0.00
Halfcheetah-medium 0.63±plus-or-minus\pm±0.01 0.60±plus-or-minus\pm±0.00 0.62±plus-or-minus\pm±0.02 0.61±plus-or-minus\pm±0.02 0.63±plus-or-minus\pm±0.01
Halfcheetah-medium-replay 0.55±plus-or-minus\pm±0.00 0.54±plus-or-minus\pm±0.00 0.54±plus-or-minus\pm±0.01 0.55±plus-or-minus\pm±0.00 0.24±plus-or-minus\pm±0.01
Halfcheetah-random 0.24±plus-or-minus\pm±0.01 0.21±plus-or-minus\pm±0.03 0.20±plus-or-minus\pm±0.01 0.24±plus-or-minus\pm±0.01 0.23±plus-or-minus\pm±0.01
Average 0.61 0.54 0.47 0.54 0.53
Table 6: Results of offline experiments with a small dataset.
MOPO MAPLE MAPLE+ DARA MAPLE+ SRPO (ours)
Walker2d-medium 0.21±0.13plus-or-minus0.210.130.21\pm 0.130.21 ± 0.13 0.45±0.18plus-or-minus0.450.180.45\pm 0.180.45 ± 0.18 0.74±0.12plus-or-minus0.740.120.74\pm 0.120.74 ± 0.12 0.79±0.04plus-or-minus0.790.04\mathbf{0.79}\pm 0.04bold_0.79 ± 0.04
Walker2d-medium-expert 0.14±0.06plus-or-minus0.140.060.14\pm 0.060.14 ± 0.06 0.26±0.01plus-or-minus0.260.010.26\pm 0.010.26 ± 0.01 0.38±0.03plus-or-minus0.380.030.38\pm 0.030.38 ± 0.03 0.61±0.11plus-or-minus0.610.11\mathbf{0.61}\pm 0.11bold_0.61 ± 0.11
Hopper-medium 0.01±0.00plus-or-minus0.010.000.01\pm 0.000.01 ± 0.00 0.42±0.36plus-or-minus0.420.360.42\pm 0.360.42 ± 0.36 0.36±0.06plus-or-minus0.360.060.36\pm 0.060.36 ± 0.06 0.51±0.14plus-or-minus0.510.14\mathbf{0.51}\pm 0.14bold_0.51 ± 0.14
Hopper-medium-expert 0.01±0.00plus-or-minus0.010.000.01\pm 0.000.01 ± 0.00 0.33±0.09plus-or-minus0.330.090.33\pm 0.090.33 ± 0.09 0.16±0.04plus-or-minus0.160.040.16\pm 0.040.16 ± 0.04 0.40±0.06plus-or-minus0.400.06\mathbf{0.40}\pm 0.06bold_0.40 ± 0.06
HalfCheetah-medium 0.10±0.01plus-or-minus0.100.010.10\pm 0.010.10 ± 0.01 0.50±0.06plus-or-minus0.500.060.50\pm 0.060.50 ± 0.06 0.37±0.01plus-or-minus0.370.010.37\pm 0.010.37 ± 0.01 0.55±0.03plus-or-minus0.550.03\mathbf{0.55}\pm 0.03bold_0.55 ± 0.03
HalfCheetah-medium-expert 0.03±0.00plus-or-minus0.030.00-0.03\pm 0.00- 0.03 ± 0.00 0.35±0.01plus-or-minus0.350.010.35\pm 0.010.35 ± 0.01 0.63±0.03plus-or-minus0.630.03\mathbf{0.63}\pm 0.03bold_0.63 ± 0.03 0.62±0.19plus-or-minus0.620.190.62\pm 0.190.62 ± 0.19
Average 0.07 0.39 0.44 0.580.58\mathbf{0.58}bold_0.58

B.2 Additional Results

We show full results of online experiments on MuJoCo tasks in Fig. 6. Experiments on environments with single dynamics are included. These experiments are equivalent to those on static static environments. PPO, CaDM and CaDM+SRPO have a similar performance in these tasks. Full results of ablation studies are shown in Tab. 5. We also reduce the amount of offline data to 1/3 and perform additional experiments. The results are shown in Tab. 6. MAPLE+SRPO can still achieve better performance than baseline algorithms. It improves the performance by 31% over MAPLE+DARA and 49% over MAPLE. These evidences indicate that SRPO indeed enables efficient data reuse, which is in accordance with statements in the introduction part.

B.3 Additional Analysis

To provide an additional demonstrating example to the intuition in Sec. 3.1, we train two policies in the Pendulum environment with 0.5 and 2 times of the original frictions and then visualize state and action densities. The results in Fig. 7 are similar to the experiments altering the environment gravity. We observe similar state distributions and different peaks in action distributions.

t Refer to caption

Figure 7: Visualization of state and action densities in data sampled from the Inverted Pendulum environment with 0.5 and 2 times of the original friction. Under both frictions, the state distribution has high density with low pendulum speed and small pendulum angle. Meanwhile, the action distribution has different peaks in density under different frictions.

With respect to different Offline RL tasks, MAPLE+SRPO gains the highest rise in the Hopper environment and outperform all baseline methods in all of the 4 tasks. In the Walker2d and HalfCheetah environments, however, MAPLE+SRPO only outperforms in half of the tasks. Such difference results from the existence of multiple optimal policies which lead to different stationary state distributions [50, 51]. For example, the agent in the Walker2d environment has many ways to swing its arms to keep balance. When the policy pattern in the offline dataset is different from the learning policy, its stationary state distribution may not be a good regularizer. The Hopper agent has a fewer degree of freedom compared with the other two, so the policy benefits more from regularizing with SRPO.