State Regularized Policy Optimization
on Data with Dynamics Shift

Zhenghai Xue

{}^{1}

Qingpeng Cai

{}^{2}

Shuchang Liu

{}^{2}

Dong Zheng

{}^{2}

Peng Jiang

{}^{2}

Kun Gai

{}^{3}

Bo An

{}^{1}

{}^{1}

Nanyang Technological University, Singapore

{}^{2}

Kuaishou Technology

{}^{3}

Unaffliated
[email protected] [email protected] [email protected]
{caiqingpeng,liushuchang,zhengdong,jiangpeng}@kuaishou.com

Abstract

In many real-world scenarios, Reinforcement Learning (RL) algorithms are trained on data with dynamics shift, i.e., with different underlying environment dynamics. A majority of current methods address such issue by training context encoders to identify environment parameters. Data with dynamics shift are separated according to their environment parameters to train the corresponding policy. However, these methods can be sample inefficient as data are used ad hoc, and policies trained for one dynamics cannot benefit from data collected in all other environments with different dynamics. In this paper, we find that in many environments with similar structures and different dynamics, optimal policies have similar stationary state distributions. We exploit such property and learn the stationary state distribution from data with dynamics shift for efficient data reuse. Such distribution is used to regularize the policy trained in a new environment, leading to the SRPO (State Regularized Policy Optimization) algorithm. To conduct theoretical analyses, the intuition of similar environment structures is characterized by the notion of homomorphous MDPs. We then demonstrate a lower-bound performance guarantee on policies regularized by the stationary state distribution. In practice, SRPO can be an add-on module to context-based algorithms in both online and offline RL settings. Experimental results show that SRPO can make several context-based algorithms far more data efficient and significantly improve their overall performance.

1 Introduction

Reinforcement Learning (RL) has achieved great success in solving challenging sequential decision-making problems [1, 2]. Unfortunately, existing RL methods usually assume that agents are trained and evaluated in exactly the same environment, which is often not the case in real-world applications where environment dynamics can vary a lot. For example, the recommendation engine of social apps may need to deal with time-varying and heterogeneous user preferences [3, 4]. A robot arm may operate in different scenarios with different joint frictions and medium densities [5]. In these cases, the agent has to work with the trajectory data from different environment dynamics, i.e., data with dynamics shift, which will bias the learning process and lead to poor performance. In fact, some empirical studies [6, 5] demonstrate that general RL algorithms [7, 8] can easily be misled by different environment dynamics and fail to train a good policy.

In recent years, considerable research efforts have been devoted to addressing the dynamics shift and learning generalizable policies for environments with changing dynamics. One common practice is to train a context encoder [9, 10, 11] to associate the environment dynamics with a latent variable. The policy is then trained with the latent variable as an additional input [12]. One issue with this

Refer to caption — Figure 1: Performance comparison of PPO [7], CaDM [9] and CaDM+SRPO in the Ant environment, where SRPO is our proposed state regularized policy optimization method. Details of the experiment setup are in Sec. 5.1.

practice is that policies conditioned on a specific latent variable can only learn from data collected in the environment corresponding to that latent variable. In other words, data with different dynamics are used in an ad hoc manner. The generalizability of context encoders relies on the expressive power of neural networks. However, neural networks are prone to overfit and behave poorly when extrapolating. As an example, we benchmarked CaDM [9], which is one of the context-based algorithms, under Ant environments with different gravities and display the results in Fig. 1. Although it can outperform PPO [7] due to its adaptability from context encoders, CaDM fails to constantly improve its performance with more data from different environment dynamics. To mitigate the problem of inefficient data use, there are some attempts that leverage Importance Sampling (IS) [13, 5, 14]. Given the dynamics of the target environment, samples from the source environments are assigned with larger importance weights if they are more likely to happen in the target environment and vice versa. Compared to training context encoders, IS-based methods manage to proactively exploit the data from other dynamics. However, such methods require prior knowledge about the dynamics of the target environment. Also, it is notoriously hard to balance the bias and variance when calculating the IS weights.

This paper proposes a new RL paradigm that can explicitly leverage data with dynamics shift. It is also free of the aforementioned drawbacks of IS-based methods. We find that the stationary state distribution induced by optimal policies (later termed optimal state distribution) is similar across a set of environments with similar structures and different environment dynamics. For example, given heterogeneous preferences of users, a video recommendation system may choose different videos to recommend, but the optimal states are the same: users keep pressing the “like” or “save” button and continue watching for a long time. More concretely, the optimal state distribution in one environment dynamics can be informative for training policies in all other different dynamics. We therefore propose a constrained policy optimization (CPO) [15] formulation that requires the policy not only to optimize the cumulative return, but also to generate a stationary state distribution close to the optimal state distribution. By relating optimality to high-reward states [16], we are able to approximate the optimal state distribution from trajectory data regardless of the underlying dynamics, providing a unified and efficient approach to exploiting these data.

Summarizing these ideas, we propose the SRPO (State Regularized Policy Optimization) algorithm. SRPO works as an add-on module in both online and offline context-based RL algorithms such as CaDM [9] and MAPLE [12] to increase their sample efficiency, leading to the CaDM+SRPO and MAPLE+SRPO algorithms. We provide a lower-bound performance guarantee on policies in one dynamics regularized by the optimal state distribution in other dynamics. This theoretically demonstrates the effectiveness of the SRPO algorithm in using data with dynamics shift. Empirical results in both online and offline settings show that SRPO can significantly improve both the data efficiency and the overall performance of several state-of-the-art context-based RL algorithms. We also perform ablation studies to demonstrate the effectiveness of each component in the SRPO algorithm.

2 Backgroud

2.1 Preliminaries

A Markov Decision Process (MDP) can be defined by a tuple $(\mathcal{S},\mathcal{A},T,r,\gamma,\rho_{0})$ , where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the bounded action space with actions $a\in(-1,1)$ , $T(s^{\prime}|s,a)\in[0,1]$ and $r(s,a,s^{\prime})\in[-R_{\max},R_{\max}]$ are the transition and reward functions. $\gamma\in(0,1)$ is the discount factor and $\rho_{0}(s)$ is the initial state distribution. In MDPs with deterministic transitions, we denote $T(s,a)$ as the transition function with a slight abuse of notation, and $(T,\varepsilon)$ as $\{T^{\prime}\mid|T(s,a)-T^{\prime}(s,a)|<\varepsilon,~{}\forall s\in\mathcal{S% },a\in\mathcal{A}\}$ which is the $\varepsilon-$ neighbourhood of $T$ . RL aims at maximizing the accumulated return of policy $\pi$ : $\eta_{T}(\pi)$ $=E_{\pi,T}\left[\sum\limits_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})\right]$ , where the expectation is computed with $s_{0}\sim\rho_{0}$ , $a_{t}\sim\pi(\cdot|s_{t})$ , and $s_{t+1}\sim T(\cdot|s_{t},a_{t})$ . The optimal policy $\pi^{*}$ is defined as $\pi_{T}^{*}=\operatorname*{arg\,max}\limits_{\pi}\eta_{T}(\pi)$ . In an MDP with a policy $\pi$ , the Q-value $Q_{T}^{\pi}(s,a)$ denotes the expected return after taking action $a$ at state $s$ : $Q_{T}^{\pi}(s,a)$ $=E_{\pi,T}\left[\sum\limits_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})|s_{0}=s,a_{% 0}=a\right]$ . The value function is defined as $V_{T}^{\pi}(s)=\mathbb{E}_{a\sim\pi(\cdot|s)}Q_{T}^{\pi}(s,a)$ with $V^{*}_{T}(s)$ being the shorthand for $V^{\pi_{T}^{*}}_{T}(s)$ . It satisfies the optimal Bellman Equation $V^{*}_{T}(s)=\max\limits_{a}~{}r(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim T(\cdot% |s,a)}V^{*}_{T}(s^{\prime}).$ We can also define the stationary state distribution (also known as state occupation function) as $d_{T}^{\pi}(s):=(1-\gamma)$ $\sum_{t=0}^{\infty}\gamma^{t}P_{T}\left(s_{t}=s\mid\pi\right)$ with $d^{*}_{T}(s)$ being the shorthand for $d^{\pi_{T}^{*}}_{T}(s)$ .

The Hidden Parameter Markov Decision Process (HiP-MDP) captures a class of MDPs with different transition functions and the same reward function by introducing a set of hidden parameters. Specifically, an HiP-MDP is defined by a tuple $(\mathcal{S},\mathcal{A},\Theta,T,r,\gamma,\rho_{0})$ , where $\Theta$ is the space of hidden parameters. The transition function $T_{\theta}(s^{\prime}|s,a,\theta)$ is parameterized not only by states and actions, but also by a hidden parameter $\theta$ sampled from $\Theta$ . The action gap of an HiP-MDP is defined as $\Delta=\min\limits_{\theta\in\Theta}\min\limits_{s\in\mathcal{S}}\min\limits_{% a\neq\pi^{*}(s)}V_{T_{\theta}}^{*}(s)-Q_{T_{\theta}}^{*}(s,a)$ , which reflects the minimum gap between an optimal action and all other sub-optimal actions.

2.2 Related Work

MDPs with Different Dynamics

The setting of HiP-MDP [17] was proposed to model a set of variations in the environment dynamics. The problem is intensively investigated in recent years [18] and these researches fall into three categories, i.e., encoder-based, Important Sampling (IS)-based and meta-RL based algorithms. Encoder-based methods extract the hidden parameters from trajectories with variational inference [19] or auxiliary loss [9]. These hidden parameters are used as inputs to the transition function [9] or policy network [20, 12]. Unfortunately, these methods train dynamics-specific policies from the trajectory data of each hidden parameter independently, which leads to poor sample efficiency. Instead, our method uses the data from all dynamics to learn an optimal state distribution that facilitates the policy learning. IS-based methods compute the importance ratio between transition probabilities under different dynamics and modify the replay buffer [13, 14, 5] according to the transition probabilities in the test environments, which is often not available in real-world scenarios. Finally, meta-RL algorithms [21, 22] can adapt to environments with new dynamics through fine-tuning on a small amount of data from the test environment. In contrast, our method can be directly applied to new environment dynamics by making a zero-shot transfer.

Behavior Regularized Methods

The idea of constrained policy optimization (CPO) [15] is widely used in RL. Most researches focus on behavior regularized methods, i.e., adding policy constraints based on another policy distribution, as shown in the following optimization problem:

\displaystyle\max_{\pi}~{}\mathbb{E}_{s,a\sim\mathcal{D}}\left[\mathbb{E}_{a^{% \prime}\sim\pi(\cdot|s)}Q(s,a^{\prime})\right]\qquad\text{ s.t. }~{}~{}\mathbb% {E}_{s\sim\mathcal{D}}\left[\hat{D}(\pi(\cdot|s)\|\hat{\pi}(\cdot|s))\right]<\varepsilon,

(1)

where $\mathcal{D}$ is the replay buffer, $\hat{\pi}$ is the regularizing policy, and $\hat{D}$ is a certain distance measure. Maximum-Entropy RL [23, 8] can be considered as CPO with a uniform policy distribution. The sparse action tasks [24] can be solved by CPO with a sparse policy distribution. Besides, many offline RL algorithms [25, 26, 27] are based on the idea of constraining the current policy distribution to be close to the dataset’s policy distribution. However, data sampled from environments with different dynamics can have distinct optimal policies, as illustrated in Sec. 3.1. In such cases, $\hat{\pi}$ in Eq. (1) may include policies that do not match the current environment, and can therefore be misleading. So behavior regularization in Eq. (1) would fail on data with dynamics shift. Differently, our proposed method is based on state regularization, which is more suitable when learning from data with dynamics shift.

Leveraging stationary state distributions

The stationary state distribution $d^{\pi}_{T}(s)$ of policy $\pi$ and dynamics $T$ is an important feature that can measure the differences in policies and transition functions. It has already been exploited in many researches. In Off-Policy RL, Islam et. al [28] estimates the stationary state distributions of both the current policy and the mixed buffer policy. It then computes the off-policy policy gradient with the constraint that the two distributions should be close. Some Off-Policy Policy Evaluation (OPE) algorithms [29, 30] use the steady-state property of stationary Markov processes to estimate the stationary state distributions. In Imitation Learning (IL), state-only IL algorithms [31, 32] requires the stationary state distribution of the current policy to be close to that of the expert policy. In Inverse RL (IRL), [33] learns a stationary reward function by computing the gradient of the distance between agent and expert state distribution w.r.t. reward parameters. In Offline RL, [34] requires the stationary state distribution of the learning policy and the behavior policy to be close and perform conservative updates. The use of such distributions in our paper is similar to some researches on sim-to-real [35, 36]. They propose to match the next state distribution in the imperfect simulator and the real environment with inverse dynamics model. They implicitly relies on the idea that the same state distribution should generate similar returns in environments with different dynamics. We formulate the idea in this paper with theorems and quantitatively analyse such similarity in various conditions.

3 State Regularized Policy Optimization

In this section, we first give motivating examples on why the optimal state distribution in one environment dynamics can be informative in all other different dynamics. A constrained policy optimization formulation is then proposed in Sec. 3.2 based on the optimal state distribution. Solving this optimization problem gives rise to our State Regularized Policy Optimization (SRPO) algorithm that can leverage data with dynamics shift to improve the policy performance.

3.1 Motivating Example

The key intuition behind SRPO is that the optimal state distribution is similar across environments. Consider an example of the Inverted Pendulum environment in Fig. 2. We train two policies in the environments with gravities of 5 and 10 until convergence. Then the kernel density estimation [37] technique is employed to estimate the state and action density of the data collected by the two policies in different areas. It can be observed from the figure that collected data have the same high state density region with low pendulum speed and small pendulum angle, while the action distribution has different density peaks. It demonstrates that the state distribution of data generated by the optimal policy can be similar regardless of the environment dynamics, and therefore can serve as a reference distribution to regularize the training policy in environments with new dynamics. More demonstrating examples can be found at Appendix B.3.

3.2 State Regularized Policy Optimization

Based on the intuition of informative optimal state distribution, we develop a novel technique that regulates RL algorithms to generate a stationary state distribution that is close to the optimal one. Specifically, we propose the following constrained policy optimization formulation:

\displaystyle\max_{\pi}~{}\mathbb{E}_{s_{t},a_{t}\sim\tau_{\pi}}\left[\sum_{t=% 0}^{\infty}\gamma^{t}r\left(s_{t},a_{t}\right)\right]\qquad\text{ s.t. }~{}~{}% D_{\mathrm{KL}}\left(d_{\pi}(\cdot)\|\zeta(\cdot)\right)<\varepsilon,

(2)

where $\zeta(s)$ is the optimal state distribution in other environment dynamics. By introducing the stationary state distribution, the optimization problem defined in Eq. (2) extends the regularization of in-distribution data to data with distribution shift. A similar form of Eq. (2) (See Eq. 1 in Sec. 2.2) is employed in Offline RL algorithms to ensure conservative policy updates. But it restricts the training data to be sampled from the same environment.

We solve Eq. (2) by casting it to the following unconstrained optimization problem via Lagrange multipliers:

L=-\mathbb{E}_{s_{t},a_{t}\sim\tau}\left[\sum_{t=0}\limits^{\infty}\gamma^{t}% \left(r(s_{t},a_{t})+\lambda\log\frac{\zeta(s_{t})}{d_{\pi}(s_{t})}\right)% \right]-\frac{\lambda\varepsilon}{1-\gamma},

(3)

where $\lambda>0$ is the Lagrangian Multiplier. The detailed derivations of the Lagrangian can be found in Appendix A.1. It is noteworthy that in addition to the multiplier term, the only difference of Eq. (3) and the reward-maximization objective of RL is that the logarithm of probability density ratio $\lambda\log\frac{\zeta(s_{t})}{d_{\pi}(s_{t})}$ is added to the reward term $r(s_{t},a_{t})$ . Therefore, one can easily apply our scheme to a wide range of RL algorithms by augmenting the reward function with the density ratio.

3.3 Data-based Surrogate of the Density Ratio

The main challenge in solving Eq. (3) is to compute the density ratio $\frac{\zeta(s)}{d_{\pi}(s)}$ because obtaining the optimal state distribution during online training or given suboptimal offline dataset is infeasible. Also, $d_{\pi}(s)$ is intractable if the state space is continuous. Motivated by recent advances in adversarial training [38, 39], we propose a sample-based surrogate for the density ratio $\frac{\zeta(s)}{d_{\pi}(s)}$ .

Proposition 3.1.

In a GAN, when the real data distribution is $\zeta(s)$ and the generated data distribution is $d_{\pi}(s)$ , the output of the discriminator $D(s)$ follows

\frac{D(s)}{1-D(s)}=\frac{\zeta(s)}{d_{\pi}(s)}.

(4)

Figure 3: HMM in MDP with optimality variables

\mathcal{O}_{t}

We discuss the relation of this sample-based surrogate with f-divergences and Off-Policy RL in Appendix A.3. To train the discriminator $D(s)$ , we need to generate samples that is close to the optimal state distribution $\zeta(s)$ and away from $d_{\pi}(s)$ , which is sort of the average state distribution. Motivated by [16], we model state optimality by a variable $\mathcal{O}_{t}$ . As shown in Fig. 3, we regard the state $s_{t}$ in MDP as a hidden state in a Hidden Markov Model (HMM), and introduce the binary observation state $\mathcal{O}_{t}$ . $\mathcal{O}_{t}=1$ denotes that $s_{t}$ is the optimal state at timestep $t$ . The observation model is given by

p(\mathcal{O}_{t}|s_{t})=\max_{a_{t}}\exp[\gamma^{t}(r(s_{t},a_{t})-R_{\max})].

(5)

We can therefore compute the state density ratio $\frac{\zeta(s)}{d_{\pi}(s)}$ as

\displaystyle\frac{\zeta(s)}{d_{\pi}(s)}

\displaystyle=\frac{d_{\pi}(s|\mathcal{O}_{0:\infty})}{d_{\pi}(s)}=\frac{p(% \mathcal{O}_{0:\infty}|s,\pi)d_{\pi}(s)}{p(\mathcal{O}_{0:\infty}|\pi)d_{\pi}(% s)}=\frac{\mathbb{E}_{t}[p(\mathcal{O}_{0:t-1}|s_{t},\pi)p(\mathcal{O}_{t:% \infty}|s_{t},\pi)]}{p(\mathcal{O}_{0:\infty}|\pi)},

(6)

where the second equation follows the Bayes’ law. The last term is related to the forward probability $\alpha_{t}(s_{t})=p(\mathcal{O}_{0:t-1}|s_{t},\pi)$ and backward probability $\beta_{t}(s_{t})=p(\mathcal{O}_{t:\infty}|s_{t},\pi)$ in the HMM. We discuss in Appendix A.2 that $\beta_{t}(s_{t})$ is positively related to a soft version of MDP’s state value $V_{\pi}(s_{t})$ . Also, $\alpha_{t}(s_{t})$ makes a little influence on the overall density ratio. Therefore, the input $s$ will be more likely to be sampled from distribution $\zeta(s)$ if it has a higher state value $V(s)$ than average. With this idea, we are able to build training samples for the discriminator $D(s)$ .

Algorithm 1 The workflow of SRPO on top of MAPLE [12].

1: Input:

\phi_{\varphi}

as a context encoder parameterized by

\varphi

; Adaptable policy network

\pi_{\theta}

parameterized by

\theta

; Adaptable value network

V_{\psi}

parameterized by

\psi

; Offline dataset

\mathcal{D}_{\text{off }}

; Rollout horizon

H

; State partition ratio

\rho

; State discriminator

D_{\delta}

parameterized by

\delta

; Regularization coefficient

\lambda

2: for 1, 2, 3,

\dots

3: for

t=1

2

\dots

H

4: Sample

z_{t}

from

\phi_{\varphi}\left(z\mid s_{t},a_{t-1},z_{t-1}\right)

and then sample

a_{t}

from

\pi_{\theta}\left(a\mid s_{t},z_{t}\right)

5: Rollout and get transition data

\left(s_{t+1},r_{t},d_{t+1},s_{t},a_{t},z_{t}\right)

. Then add it to

\mathcal{D}_{\text{rollout }}

6: end for

7: Update the context encoder

\phi_{\varphi}

according to MAPLE.

8: Sample a batch

\mathcal{D}_{\text{batch}}

from

\mathcal{D}_{\text{off}}

and

\mathcal{D}_{\text{rollout}}

and rank them by their state-values estimated by

V_{\psi}

; Add

\rho|\mathcal{D}_{\text{batch}}|

states with higher state-values to

\mathcal{D}_{\text{real}}

and the others to

\mathcal{D}_{\text{fake}}

9: Train the discriminator

D_{\delta}

with nll loss.

10: For one-step transition

\left(s_{t+1},r_{t},d_{t+1},s_{t},a_{t},z_{t}\right)

\mathcal{D}_{\text{batch}}

, update

r_{t}

with

r_{t}+\lambda\frac{D_{\delta}(s_{t})}{1-D_{\delta}(s_{t})}

11: Use the updated

\mathcal{D}_{\text{batch}}

and SAC to update the policy and value network parameters

\theta

and

\psi

12: end for

3.4 Practical Algorithm

Summarizing the previous derivations, we obtain a practical reward regularization algorithm, termed as SRPO (State Regularized Policy Optimization) to leverage data with dynamics shift. We select the MAPLE [12] algorithm, which is one of the SOTA algorithms in context-based Offline RL, as the base algorithm. The detailed procedure of MAPLE+SRPO is shown in Alg. 1. After preparing the dataset in a model-based Offline RL style [40, 12], we sample a batch of data from the dataset, obtain a portion of $\rho$ states with higher rewards and add them to the dataset $\mathcal{D}_{\text{real}}$ . $\mathcal{D}_{\text{fake}}$ is similarly generated by states with lower rewards (line 10). We set $\rho=0.5$ in offline experiments with medium-expert level of data. $\rho=0.2$ is set in all other experiments. Then a classifier discriminating data from the two datasets is trained (line 11). It estimates the logarithm of the state density ratio $\lambda\log\frac{\zeta(s)}{d_{\pi}(s)}$ , which is added to the reward $r_{t}$ (line 12). $\lambda$ is regarded as a hyperparameter with values $0.1$ or $0.3$ . The effect of $\lambda$ is investigated in Sec. 5.3. The procedure of the online algorithm CaDM [9]+SRPO is similar to MAPLE+SRPO, where the datasets $\mathcal{D}_{\text{real}}$ and $\mathcal{D}_{\text{fake}}$ are built with data from the replay buffer, rather than the offline dataset.

4 Theoretical Analysis

In this section, we analyze some properties of MDPs with different dynamics and provide theoretical justifications for the SRPO algorithm in Sec. 3. The notations are introduced in Sec. 2.1 and proofs can be found in Appendix A.4. We first show in Thm. 4.2 that the performance of a policy can be lower-bounded when its stationary state distribution is close to a certain optimal state distribution. In accordance with the intuition in Sec. 3.1, it is also demonstrated in Thm. 4.3 that optimal policies can induce the same stationary state distribution in different dynamics under mild assumptions. We start the analysis with the definition of homomorphous MDPs.

Definition 4.1 (homomorphous MDPs).

In an HiP-MDP $(\mathcal{S},\mathcal{A},\Theta,T,r,\gamma,\rho_{0})$ , consider hidden parameters $\theta_{1},\theta_{2}\in\Theta$ . Let $T_{i}(s^{\prime}|s,a)=T(s^{\prime}|s,a,\theta_{i}),\forall(s,a,s^{\prime})\in% \mathcal{S}\times\mathcal{A}\times\mathcal{S},i=1,2$ . If $\sum\limits_{a\in\mathcal{A}}T_{1}(s^{\prime}|s,a)>0\Leftrightarrow\sum\limits% _{a\in\mathcal{A}}T_{2}(s^{\prime}|s,a)>0$ for all $s,s^{\prime}\in\mathcal{S}$ , MDPs $(\mathcal{S},\mathcal{A},T_{1},r,\gamma,\rho_{0})$ and $(\mathcal{S},\mathcal{A},T_{2},r,\gamma,\rho_{0})$ are referred to as homomorphous MDPs.

In this definition, $\sum\limits_{a\in\mathcal{A}}T(s^{\prime}|s,a)>0$ means state $s^{\prime}$ can be reached from $s$ , so the equivalence of non-zero transition probabilities refers to the same reachability from $s$ to $s^{\prime}$ . Such condition holds in a wide range of MDPs differing only in environment parameters. For example, pendulums with different lengths can all reach the upright state from an off-center state, with longer pendulums exerting a larger force. Apart from the homomorphous property, we also require the reward and dynamics functions of MDPs to have Lipschitz properties. We assume reward function $r(s,a,s^{\prime})$ w.r.t. the action $a$ is $\lambda_{1}$ -Lipschitz and the dynamics function $T(s,a)$ w.r.t. the action $a$ is $\lambda_{2}$ -inverse Lipschitz. Discussions on these Lipschitz properties can be found in Appendix A.5.

With these preliminaries, we first analyze the discrepancy of accumulated returns of two policies with similar stationary state distributions. The analysis is related to our SRPO algorithm in that the state regularized policy optimization formulation in Eq. (2) also constrains the learning policy to have a similar stationary state distribution with the optimal policy. Specifically, we derive a theorem as follows.

Theorem 4.2.

Consider two homomorphous MDPs with dynamics $T$ and $T^{\prime}$ . If $T^{\prime}\in(T,\varepsilon_{m})$ , for all learning policy $\hat{\pi}$ such that $D_{\mathrm{KL}}(d^{\hat{\pi}}_{T}(\cdot)\|d^{*}_{T^{\prime}}(\cdot))\leqslant% \varepsilon_{s}$ , we have

\eta_{T}(\hat{\pi})\geqslant\eta_{T}(\pi_{T}^{*})-\dfrac{\lambda_{1}\lambda_{2% }\varepsilon_{m}+2\lambda_{1}+\sqrt{2}R_{\max}\sqrt{\varepsilon_{s}}}{1-\gamma}.

(7)

The theorem implies that if a policy $\hat{\pi}$ has a similar stationary state distribution with the optimal policy in one MDP $M$ , $\hat{\pi}$ will have a lower-bound performance guarantee in all MDPs that are homomorphous with the MDP $M$ . Therefore, the learning policy can benefit from the state regularized policy optimization in Sec. 3.2.

More specifically, Eq. (7) shows that the gap in accumulated return of $\hat{\pi}$ and $\pi^{*}_{T}$ is related to the dynamics shift $\varepsilon_{m}$ , the KL-Divergence of two stationary state distributions $\varepsilon_{s}$ , and the effective planning horizon $\frac{1}{1-\gamma}$ . With respect to the dynamics shift $\varepsilon_{m}$ , it is related to a “uniform” constraint on the dynamics $T^{\prime}$ . We further show in Appendix A.4 that constraining the dynamics shift on a certain state-action pair is enough to derive Eq. (7). Unlike the dynamics shift $\varepsilon_{m}$ that is determined by a pre-defined RL task, the discrepancy between stationary state distributions $\varepsilon_{s}$ is determined by the learning policy $\hat{\pi}$ and can be optimized during training to obtain a better performance lower-bound. We also discuss in Appendix A.5 how tight Eq. (7) is in terms of the effective planning horizon $\frac{1}{1-\gamma}$ , compared with some similar performance bounds.

With an additional assumption on the action gap $\Delta$ (defined in Sec. 2.1), we further demonstrate that the optimal policy of two homomorphous MDPs can have the same stationary state distribution, which verifies the intuition in Sec. 3.1.

Theorem 4.3.

Consider two homomorphous MDPs with dynamics $T$ and $T^{\prime}$ . If $T^{\prime}\in(T,\varepsilon_{m})$ and the action gap $\Delta$ follows $\Delta>\frac{(2-\gamma)\lambda_{1}\lambda_{2}\varepsilon_{m}}{1-\gamma}$ , for all $s\in\mathcal{S}$ we have $d^{*}_{T}(s)=d^{*}_{T^{\prime}}(s)$ .

The assumption is mild and holds in many scenarios. For example, in autonomous driving it can be very dangerous to deviate from the optimal policy. Such suboptimal actions have low rewards, leading to a large action gap $\Delta$ . In recommendation tasks we are hardly concerned with what items we recommend (the action), as long as the recommendation outcome (the state), i.e., the users’ experiences are good enough, leading to a small $\lambda_{1}$ . The condition of large enough action gap holds in these situations.

5 Experiments

In this section, we conduct experiments to investigate the following questions: (1) Can SRPO leverage data with distribution shift and outperform current SOTA algorithms in the setting of HiP-MDP, in both online and offline RL? (2) How does each component of SRPO (e.g., use state regularization rather than behavior regularization) contribute to SRPO’s performance? To answer question (1), we use the MuJoCo simulator [41] and generate environments with different transition functions. We train the CaDM+SRPO and the MAPLE+SRPO algorithm proposed in Sec. 3.4 and make comparative analysis with baseline algorithms. To answer question (2), we do ablation studies to examine the role of different modules in SRPO. We also examine how the discriminator $D_{\delta}$ works in complex environments and the effect of regularizing with state distributions in different performance levels.

5.1 Experiment Setup

We alter the simulator gravity to generate different dynamics in online experiments. Possible values of gravity are {1.0}, {0.7,1.0,1.3}, and {0.4,0.7,1.0,1.3,1.6} in experiments with 1, 3, and 5 kinds of different dynamics, respectively. When the simulator resets, the gravity is uniformly sampled from the set of all possible values. The number of training steps is in proportion to the number of environment parameters. Therefore, the agent has access to the same amount of training data on a certain value of simulator gravity. We also consider the shift of medium density and body mass in offline experiments to show SRPO’s robustness to different forms of dynamics shift.

To perform comparative analysis, we choose CaDM [9] and PPO [7] as baseline algorithms in online experiments. In offline experiments, DARA [5] also exploits large amount of data with dynamics shift. Its algorithm relies on Importance Sampling and will be used as a baseline method. Apart from that, we choose MOPO [40], MAPLE [12] and CQL [26] as baseline methods. More information on the setup of experiments is shown in Appendix B.1.

Table 1: Results of offline experiments on MuJoCo tasks. Numbers are the normalized scores according to the D4RL paper [42]. ME, M, MR and R correspond to the medium-expert, expert, medium-replay and random dataset, respectively. The evaluation is done on policies at the last iteration of training, averaged over four random seeds. The number after

\pm

is the standard deviation. Our proposed MAPLE+SRPO algorithm has the best performance in 8 of 12 tasks and the highest overall performance.

	CQL (Single Env)	GAIL	CQL	MOPO	MAPLE	MAPLE +DARA	MAPLE +SRPO(Ours)
Walker2d-ME	1.11	0.21 $\pm$ 0.03	1.03 $\pm$ 0.10	0.25 $\pm$ 0.18	0.55 $\pm$ 0.21	0.80 $\pm$ 0.02	0.66 $\pm$ 0.08
Walker2d-M	0.79	0.15 $\pm$ 0.06	0.78 $\pm$ 0.01	0.23 $\pm$ 0.34	0.82 $\pm$ 0.01	0.83 $\pm$ 0.03	0.84 $\pm$ 0.03
Walker2d-MR	0.27	0.00 $\pm$ 0.00	0.07 $\pm$ 0.00	0.00 $\pm$ 0.00	0.16 $\pm$ 0.02	0.17 $\pm$ 0.01	0.17 $\pm$ 0.02
Walker2d-R	0.07	0.00 $\pm$ 0.00	0.03 $\pm$ 0.01	0.00 $\pm$ 0.00	0.22 $\pm$ 0.00	0.22 $\pm$ 0.00	0.22 $\pm$ 0.00
Hopper-ME	0.98	0.04 $\pm$ 0.01	0.32 $\pm$ 0.14	0.01 $\pm$ 0.00	0.96 $\pm$ 0.14	0.96 $\pm$ 0.06	0.98 $\pm$ 0.02
Hopper-M	0.58	0.00 $\pm$ 0.00	0.57 $\pm$ 0.16	0.01 $\pm$ 0.00	0.78 $\pm$ 0.28	0.40 $\pm$ 0.05	1.03 $\pm$ 0.09
Hopper-MR	0.46	0.00 $\pm$ 0.00	0.14 $\pm$ 0.02	0.01 $\pm$ 0.01	0.91 $\pm$ 0.11	1.02 $\pm$ 0.01	1.02 $\pm$ 0.01
Hopper-R	0.11	0.00 $\pm$ 0.00	0.11 $\pm$ 0.00	0.01 $\pm$ 0.00	0.13 $\pm$ 0.00	0.13 $\pm$ 0.01	0.32 $\pm$ 0.02
HalfCheetah-ME	0.62	0.36 $\pm$ 0.06	0.03 $\pm$ 0.04	-0.03 $\pm$ 0.00	0.50 $\pm$ 0.06	0.50 $\pm$ 0.00	0.63 $\pm$ 0.01
HalfCheetah-M	0.44	0.25 $\pm$ 0.02	0.43 $\pm$ 0.03	0.38 $\pm$ 0.28	0.62 $\pm$ 0.01	0.67 $\pm$ 0.03	0.63 $\pm$ 0.01
HalfCheetah-MR	0.46	0.18 $\pm$ 0.11	0.46 $\pm$ 0.00	-0.03 $\pm$ 0.00	0.52 $\pm$ 0.00	0.53 $\pm$ 0.01	0.55 $\pm$ 0.00
HalfCheetah-R	0.35	0.14 $\pm$ 0.02	0.01 $\pm$ 0.02	-0.03 $\pm$ 0.00	0.22 $\pm$ 0.03	0.21 $\pm$ 0.00	0.24 $\pm$ 0.01
Average	0.52	0.11	0.33	0.068	0.53	0.54	0.61

Table 2: Results of ablation studies in offline experiments.

Table 3: Performance comparison of differently regularized policies.

	MAPLE +SRPO	Behavior Regularizing	Random Partition	Fixed $\lambda=0.3$
Walker2d	0.47	0.45	0.39	0.40
Hopper	0.83	0.68	0.56	0.79
HalfCheetah	0.51	0.50	0.45	0.40
Average	0.61	0.54	0.47	0.53

	Original Hopper Env	10x Density
Random	121.3	44.15
Medium	2178	913.3
Expert	3819	3748

Table 3: Performance comparison of differently regularized policies.

5.2 Results

Online Experiments

The results of online experiments are shown in Fig. 4. With the context encoder and conditional policy, CaDM is able to outperform PPO in all environments. However, it fails to take advantage of the increase in the amount of data with dynamics shift. Its performance with 5 different dynamics is lower than that with 3 dynamics. In contrast, our proposed SRPO algorithm leads to better performance on top of CaDM in accordance with more training data. It significantly outperforms the original CaDM algorithm in environments with 5 different dynamics. The performance comparison in the Pendulum environment is also in accordance with the motivating example in Sec. 3.1. More results of online experiments are shown in Appendix B.2.

Offline Experiments

The results of offline experiments are shown in Tab. 1. The column of “CQL Single” refers to the evaluation score in the CQL [26] paper, where the policy is with data from a single static environment. Without the mechanism of context-based encoders, GAIL [43], CQL and MOPO [40] cannot handle data with distribution shift and show a performance drop. MAPLE [12] and MAPLE+DARA [5] only achieve marginal performance improvement with respect to CQL single. On the other hand, MAPLE+SRPO shows significant performance improvement over CQL single, which means that SRPO can efficiently leverage the additional data with dynamics shift to facilitate policy training. The MAPLE+SRPO algorithm also has a 15% higher evaluation score than MAPLE, achieving the best performance in 8 out of 12 tasks. Apart from MAPLE, the meta-RL algorithm PEARL [44] also has an context encoder for fast adaptation. We compare PEARL with PEARL+SRPO and leave the results in Appendix. B.3.

5.3 Analysis

Ablations

We conduct ablation studies in offline environments to analyze the role of each algorithm component in SRPO. The results are shown in Tab. 3. We first investigate the outcome of regularizing with state-action distribution rather than state distribution in Eq. (2). The resulting policy has a lower evaluation score on average than policies trained with the original SRPO algorithm in all environments. This is because environments with different dynamics do not have a similar optimal policy. The action distribution in the mixed dataset can be misleading when training new policies. According to Sec. 3.3, SRPO trains a classifier to discriminate states with higher values from lower values. We also train another classifier discriminating a random binary partition of states. Ablation results show a huge performance drop, which verifies the effectiveness of the classification-based surrogate mechanism. We also evaluate MAPLE+SRPO with a fixed value of the hyperparameter $\lambda$ . $\lambda=0.1$ is more suitable for Walker2d and HalfCheetah environments, while in the Hopper environment $\lambda=0.3$ is better. This is in accordance with previous analysis that Hopper agents can benefit more from regularizing with the stationary state distribution.

Effectiveness of Discriminators

We first train a discriminator $D_{\delta}$ according to Alg. 1. Then a set of states is sampled from the D4RL [42] dataset and classified into two sets according to the output of $D_{\delta}$ . The average values of the two sets of states are compared in Fig. 5. As shown in the figure, states with higher $D_{\delta}$ outputs also have higher values in all three environments. It means that the trained discriminator $D_{\delta}$ can successfully identify states with high values from those with low values. Therefore, its output can be a good surrogate for the density ratio $\frac{\zeta(s)}{d_{\pi}(s)}$ in Sec. 3.

Effectiveness of the Regularization

We also study the effect of policy regularization with different performance levels of stationary state distributions. Random, medium and expert policies in the original Hopper environment are used to estimate the stationary state distributions, which regularize the learning policies in a new environment with different dynamics. The results are shown in Tab. 3, where the expert policy is the most effective in regularizing. This verifies the practice in Sec. 3.2 and the theoretical analysis.

6 Conclusion and Discussion

In this work, we focus on the problem of leveraging data with dynamics shift to efficiently train RL agents. Based on the intuition that optimal policies can lead to similar stationary state distributions, we give a constrained optimization formulation that regards the state distribution as a regularizer. After discussions on a sample-based surrogate, we propose the SRPO algorithm which can be an add-on module to context-based algorithms and improve their sample efficiency. The resulting CaDM+SRPO and MAPLE+SRPO algorithms show superior performance when learning on data sampled from environments with different dynamics. Theoretical analyses are also given to analyze some properties of MDPs with different dynamics. They provide justifications for the intuition of the dynamics-invariant state distribution, as well as the constrained policy optimization formulation.

Limitations and Future work

The theoretical analyses of this work requires the assumption of homomorphous MDPs, i.e., the same state reachability in different MDPs. It would be interesting to discuss whether similar conclusions on the stationary state distribution can be derived without such assumption.

Acknowledgements

We thank Wanqi Xue and Yanchen Deng for helpful discussions. This research is supported by the National Research Foundation, Singapore under its Industry Alignment Fund – Pre-positioning (IAF-PP) Funding Initiative and Ministry of Education, Singapore, under its Academic Research Fund Tier 1 (RG13/22). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.

References

[1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. CoRR, abs/1312.5602, 2013.
[2] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nat., 529(7587):484–489, 2016.
[3] Wanqi Xue, Qingpeng Cai, Zhenghai Xue, Shuo Sun, Shuchang Liu, Dong Zheng, Peng Jiang, and Bo An. PrefRec: Preference-based recommender systems for reinforcing long-term user engagement. CoRR, abs/2212.02779, 2022.
[4] Zhenghai Xue, Qingpeng Cai, Tianyou Zuo, Bin Yang, Lantao Hu, Peng Jiang, and Bo An. AdaRec: Adaptive sequential recommendation for reinforcing long-term user engagement. CoRR, abs/2310.03984, 2023.
[5] **xin Liu, Hongyin Zhang, and Donglin Wang. DARA: Dynamics-aware reward augmentation in offline reinforcement learning. In ICLR, 2022.
[6] Fan-Ming Luo, Shengyi Jiang, Yang Yu, Zongzhang Zhang, and Yi-Feng Zhang. Adapt to environment sudden changes by learning a context sensitive policy. In AAAI, 2022.
[7] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
[8] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, 2018.
[9] Kimin Lee, Younggyo Seo, Seunghyun Lee, Honglak Lee, and **woo Shin. Context-aware dynamics model for generalization in model-based reinforcement learning. In ICML, 2020.
[10] Xi Chen, Ali Ghadirzadeh, Tianhe Yu, Yuan Gao, Jianhao Wang, Wenzhe Li, Bin Liang, Chelsea Finn, and Chongjie Zhang. Latent-variable advantage-weighted policy optimization for offline RL. CoRR, abs/2203.08949, 2022.
[11] Wenxuan Zhou, Lerrel Pinto, and Abhinav Gupta. Environment probing interaction policies. In ICLR, 2019.
[12] Xiong-Hui Chen, Yang Yu, Qingyang Li, Fan-Ming Luo, Zhiwei (Tony) Qin, Wenjie Shang, and Jie** Ye. Offline model-based adaptable policy learning. In NeurIPS, 2021.
[13] Benjamin Eysenbach, Shreyas Chaudhari, Swapnil Asawa, Sergey Levine, and Ruslan Salakhutdinov. Off-dynamics reinforcement learning: Training for transfer with domain classifiers. In ICLR, 2021.
[14] Haoyi Niu, Shubham Sharma, Yiwen Qiu, Ming Li, Guyue Zhou, Jianming Hu, and Xianyuan Zhan. When to trust your simulator: Dynamics-aware hybrid offline-and-online reinforcement learning. CoRR, abs/2206.13464, 2022.
[15] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In ICML, 2017.
[16] Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. CoRR, abs/1805.00909, 2018.
[17] Finale Doshi-Velez and George Dimitri Konidaris. Hidden parameter markov decision processes: A semiparametric regression approach for discovering latent task parametrizations. In IJCAI, 2016.
[18] Robert Kirk, Amy Zhang, Edward Grefenstette, and Tim Rocktäschel. A survey of zero-shot generalisation in deep reinforcement learning. J. Artif. Intell. Res., 76:201–264, 2023.
[19] Luisa M. Zintgraf, Kyriacos Shiarlis, Maximilian Igl, Sebastian Schulze, Yarin Gal, Katja Hofmann, and Shimon Whiteson. VariBAD: A very good method for bayes-adaptive deep RL via meta-learning. In ICLR, 2020.
[20] Jiachen Yang, Brenden K. Petersen, Hongyuan Zha, and Daniel M. Faissol. Single episode policy transfer in reinforcement learning. In ICLR, 2020.
[21] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.
[22] Eric Mitchell, Rafael Rafailov, Xue Bin Peng, Sergey Levine, and Chelsea Finn. Offline meta-reinforcement learning with advantage weighting. In ICML, 2021.
[23] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In ICML, 2017.
[24] **g-Cheng Pang, Tian Xu, Shengyi Jiang, Yu-Ren Liu, and Yang Yu. Sparsity prior regularized q-learning for sparse action tasks. CoRR, abs/2105.08666, 2021.
[25] Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In ICML, 2019.
[26] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. In NeurIPS, 2020.
[27] Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. CoRR, abs/1911.11361, 2019.
[28] Riashat Islam, Komal K. Teru, and Deepak Sharma. Off-policy policy gradient algorithms by constraining the state distribution shift. CoRR, abs/1911.06970, 2019.
[29] Qiang Liu, Lihong Li, and Ziyang Tang and´ Dengyong Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation. In NeurIPS, 2018.
[30] Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. In NeurIPS, 2019.
[31] Tanmay Gangwani and Jian Peng. State-only imitation with transition dynamics mismatch. In ICLR, 2020.
[32] Fangchen Liu, Zhan Ling, Tongzhou Mu, and Hao Su. State alignment-based imitation learning. In ICLR, 2020.
[33] Tianwei Ni, Harshit S. Sikchi, Yufei Wang, Tejus Gupta, Lisa Lee, and Ben Eysenbach. f-IRL: Inverse reinforcement learning via state marginal matching. In CoRL, 2020.
[34] Shentao Yang, Yihao Feng, Shujian Zhang, and Mingyuan Zhou. Regularizing a model-based policy stationary distribution to stabilize offline reinforcement learning. In ICML, 2022.
[35] Paul F. Christiano, Zain Shah, Igor Mordatch, Jonas Schneider, Trevor Blackwell, Joshua Tobin, Pieter Abbeel, and Wojciech Zaremba. Transfer from simulation to real world through learning deep inverse dynamics model. CoRR, abs/1610.03518, 2016.
[36] Shengyi Jiang, **g-Cheng Pang, and Yang Yu. Offline imitation learning with a misspecified simulator. In NeurIPS, 2020.
[37] Emanuel Parzen. On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3):1065–1076, 1962.
[38] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial networks. CoRR, abs/1406.2661, 2014.
[39] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In NIPS, 2016.
[40] Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y. Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. MOPO: model-based offline policy optimization. In NeurIPS, 2020.
[41] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In IROS, 2012.
[42] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: datasets for deep data-driven reinforcement learning. CoRR, abs/2004.07219, 2020.
[43] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In NIPS, 2016.
[44] Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, and Deirdre Quillen. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In ICML, 2019.
[45] Xu-Hui Liu, Zhenghai Xue, **g-Cheng Pang, Shengyi Jiang, Feng Xu, and Yang Yu. Regret minimization experience replay in off-policy reinforcement learning. In NeurIPS, 2021.
[46] Samarth Sinha, Jiaming Song, Animesh Garg, and Stefano Ermon. Experience replay with likelihood-free importance weights. In L4DC, 2022.
[47] Tian Xu, Ziniu Li, and Yang Yu. Error bounds of imitating policies and environments. In NeurIPS, 2020.
[48] Yuda Song, Aditi Mavalankar, Wen Sun, and Sicun Gao. Provably efficient model-based policy adaptation. In ICML, 2020.
[49] Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. In NeurIPS, 2019.
[50] Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. In ICLR, 2019.
[51] Kailin Zeng, Qiyuan Zhang, Bin Chen, Bin Liang, and Jun Yang. APD: learning diverse behaviors for reinforcement learning through unsupervised active pre-training. IEEE Robotics Autom. Lett., 7(4):12251–12258, 2022.

Appendix A Additional Derivations and Proofs

A.1 Derivations of the Lagrangian

We start from the optimization problem:

		$\displaystyle\max_{\pi}~{}\mathbb{E}_{s_{t},a_{t}\sim\tau_{\pi}}\sum_{t=0}^{% \infty}\gamma^{t}r\left(s_{t},a_{t}\right)$		(8)
		$\displaystyle\text{ s.t. }~{}~{}D_{\mathrm{KL}}\left(d_{\pi}(\cdot)\\|\zeta(% \cdot)\right)<\varepsilon_{m}.$		(8)

The KL-Divergence term can be transformed as:

$\displaystyle D_{\mathrm{KL}}\left(d_{\pi}(\cdot)\\|\zeta(\cdot)\right)$	$\displaystyle=-\mathbb{E}_{s\sim d_{\pi}(s)}\left[\log\zeta(s)-\log d_{\pi}(s)\right]$	(9)
	$\displaystyle=-\int d_{\pi}(s)\left[\log\zeta(s)-\log d_{\pi}(s)\right]ds$
	$\displaystyle=-\int(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}p\left(s_{t}=s\right% )\left[\log\zeta(s)-\log d_{\pi}(s)\right]ds$
	$\displaystyle=-(1-\gamma)\sum_{t=0}^{\infty}\int\gamma^{t}p\left(s_{t}=s\right% )\left[\log\zeta(s)-\log d_{\pi}(s)\right]ds$
	$\displaystyle=-(1-\gamma)\sum_{t=0}^{\infty}\mathbb{E}_{s_{t}\sim\tau}\left[% \gamma^{t}\left(\log\zeta(s_{t})-\log d_{\pi}(s_{t})\right)\right]$
	$\displaystyle=-(1-\gamma)\mathbb{E}_{s_{t}\sim\tau}\sum_{t=0}^{\infty}\gamma^{% t}\left(\log\zeta(s_{t})-\log d_{\pi}(s_{t})\right).$

So the constraint can be written as

\mathbb{E}_{s_{t}\sim\tau}\sum_{t=0}^{\infty}\left[\gamma^{t}\left(\log d_{\pi% }(s_{t})-\log\zeta(s_{t})\right)\right]-\frac{\varepsilon_{m}}{1-\gamma}<0.

(10)

The optimization problem can be written as the following standard form

		$\displaystyle\min_{\pi}~{}\mathbb{E}_{s_{t},a_{t}\sim\tau}\sum_{t=0}^{\infty}-% \gamma^{t}r\left(s_{t},a_{t}\right)$		(11)
		$\displaystyle\text{ s.t. }~{}~{}\mathbb{E}_{s_{t}\sim\tau}\sum_{t=0}^{\infty}% \left[\gamma^{t}\left(\log d_{\pi}(s_{t})-\log\zeta(s_{t})\right)\right]-\frac% {\varepsilon_{m}}{1-\gamma}<0.$		(11)

So the Lagrangian $L$ is

L=-\mathbb{E}_{s_{t},a_{t}\sim\tau}\left[\sum_{t=0}^{\infty}\gamma^{t}\left(r(% s_{t},a_{t})+\lambda\log\zeta(s_{t})-\lambda\log d_{\pi}(s_{t})\right)\right]-% \frac{\lambda\varepsilon_{m}}{1-\gamma}.

(12)

A.2 Derivations of the Forward and Backward Probabilities

The backward probability can be written as:

$\displaystyle\beta_{t}(s_{t})$	$\displaystyle=\int_{\mathcal{S}}p(\mathcal{O}_{t:\infty}\|s_{t},s_{t+1},\pi)p(s% _{t+1}\|s_{t})ds_{t+1}$	(13)
	$\displaystyle=\int_{\mathcal{S}}p(\mathcal{O}_{t}\|s_{t},\pi)p(\mathcal{O}_{t+1% :\infty}\|s_{t+1},\pi)p(s_{t+1}\|s_{t})ds_{t+1}$
	$\displaystyle=\int_{\mathcal{S}}\max_{a_{t}}\exp(\gamma^{t}r(s_{t},a_{t}))% \beta_{t+1}(s_{t+1})p(s_{t+1}\|s_{t})ds_{t+1}.$

Taking logarithm on both sizes, we have

\displaystyle\log\beta_{t}(s_{t})

\displaystyle=\log\mathbb{E}_{s_{t+1}}\max_{a_{t}}\exp(\gamma^{t}r(s_{t},a_{t}% )+\log\beta_{t+1}(s_{t+1})).

(14)

Let $W(s_{t})=\log\beta_{t}(s_{t})$ , we get

W(s_{t})=\log\mathbb{E}_{s_{t+1}}\exp\left[\max\limits_{a_{t}}\gamma^{t}r(s_{t% },a_{t})+W(s_{t+1})\right].

(15)

According to [16], $W_{t}$ is a soft version of the traditional value function $V_{t}$ . As the Soft Actor-Critic [8] has become the base algorithm in many scenarios, $\beta_{t}$ is closely related to the value function learned during training, which is often in its soft version. The forward probability $\alpha_{t}(s_{t})=p\left(\mathcal{O}_{0:t-1}\mid s_{t},\pi\right)$ is the probability of trajectory from timestep $0$ to $t-1$ being optimal given the state $s_{t}$ . Such probability is hard to model as the transition from $s_{t-1}$ to $s_{t}$ is related to the actual policy $\pi$ as well as the environment dynamics. Therefore, we do not take $\alpha_{t}(s_{t})$ into account when dividing training data to train the classifier.

A.3 Discussions on the Surrogate for the Density Ratio

According to some Off-Policy RL algorithms [45, 46], the idea of training a classifier $D(s)$ as a data-based surrogate of the density ratio $\frac{\zeta(s)}{d_{\pi}(s)}$ can also be derived from a theorem related to f-divergence (lemma 1 in [46]). Such derivation is essentially the same with our GAN-based proposition. Technically, these algorithms also divides the training data into two parts and train a classifier, which is later used to generate probabilities for prioritized sampling. Our SRPO algorithm proposes a different criterion to divide the training data, and train a classifier used in reward augmentation.

A.4 Proofs to Theorems in Sec. 4

We first introduce the following lemma which is essential in proving the two theorems in Sec. 4.

Lemma A.1.

Consider two homomorphous MDPs with dynamics $T$ and $T^{\prime}$ . Assuming $T^{\prime}\in(T,\varepsilon_{m})$ , the reward function w.r.t. the action is $\lambda_{1}$ -Lipschitz and the dynamics function w.r.t. the action is $\lambda_{2}$ -inverse Lipschitz, we have

\left|V^{*}_{T}(s)-V^{*}_{T^{\prime}}(s)\right|\leqslant\frac{\lambda_{1}% \lambda_{2}\varepsilon_{m}}{1-\gamma}

(16)

for all $s\in\mathcal{S}$ .

Proof.

Recall the optimal value function under dynamics $T$ follows

V^{*}_{T}(s)=\max\limits_{a}~{}r(s,a,T(s,a))+\gamma V^{*}_{T}(T(s,a)).

(17)

Without the loss of generality, we assume $V^{*}_{T}(s)\geqslant V^{*}_{T^{\prime}}(s)$ on a certain state $s$ . Define $a^{*}_{T}=\pi^{*}_{T}(s)$ and $\hat{a}$ such that $T^{\prime}(s,\hat{a})=T(s,a^{*}_{T})=s^{\prime}$ . Then we have

\displaystyle\left|T(s,a^{*}_{T})-T(s,\hat{a})\right|

\displaystyle=\left|T^{\prime}(s,\hat{a})-T(s,\hat{a})\right|\leqslant% \varepsilon_{m},

(18)

and

\left|r(s,a^{*}_{T},s^{\prime})-r(s,\hat{a},s^{\prime})\right|\leqslant\lambda% _{1}\left|a^{*}_{T}-\hat{a}\right|\leqslant\lambda_{1}\lambda_{2}\varepsilon_{% m}.

(19)

Therefore for all $s\in\mathcal{S}$ ,

$\displaystyle\left\|V^{}_{T}(s)-V^{}_{T^{\prime}}(s)\right\|$	$\displaystyle=V^{}_{T}(s)-V^{}_{T^{\prime}}(s)$	(20)
	$\displaystyle=r(s,a^{}_{T},s^{\prime})+\gamma V^{}_{T}(s^{\prime})-\max% \limits_{a}~{}\left[r(s,a,T^{\prime}(s,a))+\gamma V^{*}_{T^{\prime}}(T^{\prime% }(s,a))\right]$
	$\displaystyle\leqslant r(s,a^{}_{T},s^{\prime})+\gamma V^{}_{T}(s^{\prime})-% r(s,\hat{a},s^{\prime})-\gamma V^{*}_{T^{\prime}}(s^{\prime})$
	$\displaystyle\leqslant\lambda_{1}\lambda_{2}\varepsilon_{m}+\gamma\left\|V^{}_% {T}(s^{\prime})-V^{}_{T^{\prime}}(s^{\prime})\right\|$
	$\displaystyle\leqslant\lambda_{1}\lambda_{2}\varepsilon_{m}+\gamma\lambda_{1}% \lambda_{2}\varepsilon_{m}+\gamma^{2}\left\|V^{}_{T}(s^{\prime\prime})-V^{}_{% T^{\prime}}(s^{\prime\prime})\right\|$
	$\displaystyle\leqslant\cdots$
	$\displaystyle\leqslant\frac{\lambda_{1}\lambda_{2}\varepsilon_{m}}{1-\gamma},$

which concludes the proof. ∎

This lemma shows the discrepancy upper bound between the optimal state value functions in two homomorphous MDPs. We then apply it to prove the second theorem in Sec. 4.

Theorem A.2 (Restatement of Thm. 4.3).

Following the assumptions in Lem. A.1, if the action gap $\Delta$ follows $\Delta>\frac{(2-\gamma)\lambda_{1}\lambda_{2}\varepsilon_{m}}{1-\gamma}$ , for all $s\in\mathcal{S}$ we have $d^{*}_{T}(s)=d^{*}_{T^{\prime}}(s)$ .

Proof.

Recall that the definition of action gap is $\Delta=\min\limits_{\theta\in\Theta}\min\limits_{s\in\mathcal{S}}\min\limits_{% a\neq\pi_{T}^{*}(s)}V_{T_{\theta}}^{*}(s)-Q_{T_{\theta}}^{*}(s,a)$ . Therefore, we have

	$\displaystyle V^{*}_{T}(s)$	$\displaystyle\geqslant Q^{*}_{T}(s,a)+\Delta$		(21)
		$\displaystyle>Q^{*}_{T}(s,a)+\frac{(2-\gamma)\lambda_{1}\lambda_{2}\varepsilon% _{m}}{1-\gamma}$		(21)

for all $(s,a)\in\mathcal{S}\times\mathcal{A}$ if $a\neq\pi^{*}_{T}(s)$ . The same property holds for the transition function $T^{\prime}$ . We first show the state transition probability derived from $\pi^{*}_{T}$ and $\pi^{*}_{T^{\prime}}$ is the same: $p_{T}(\cdot|s,\pi^{*}_{T})=p_{T^{\prime}}(\cdot|s,\pi^{*}_{T^{\prime}}),~{}% \forall s\in\mathcal{S}$ . Without the loss of generality, let $V^{*}_{T}(s)\geqslant V_{T^{\prime}}^{*}(s)(*)$ . Let

		$\displaystyle\bar{a}=\operatorname{arg\,max}\limits_{a}r(s,a,T(s,a))+\gamma V% _{T}^{}(T(s,a))$		(22)
		$\displaystyle a^{\prime}=\operatorname{arg\,max}\limits_{a}r(s,a,T^{\prime}(s% ,a))+\gamma V_{T^{\prime}}^{}(T^{\prime}(s,a))$
		$\displaystyle T^{\prime}(s,\tilde{a})=T(s,\bar{a})=\bar{s},~{}T^{\prime}(s,a^{% \prime})=s^{\prime}.$

According to Eq. (19), $\|\tilde{a}-\bar{a}\|\leqslant\lambda_{1}\lambda_{2}\varepsilon_{m}$ . Supposing $\bar{s}\neq s^{\prime}(**)$ , we have $\tilde{a}\neq a^{\prime}=\pi^{*}_{T^{\prime}}(s)$ . So

V^{*}_{T^{\prime}}(s)>Q^{*}_{T^{\prime}}(s,\tilde{a})+\frac{(2-\gamma)\lambda_% {1}\lambda_{2}\varepsilon_{m}}{1-\gamma}.

(23)

Meanwhile,

$\displaystyle Q^{*}_{T^{\prime}}(s,\tilde{a})$	$\displaystyle=r(s,\tilde{a},\bar{s})+\gamma V^{*}_{T^{\prime}}(\bar{s})$	(24)
	$\displaystyle\geqslant r(s,\bar{a},\bar{s})+\gamma V^{*}_{T^{\prime}}(\bar{s})% -\lambda_{1}\lambda_{2}\varepsilon_{m}$
	$\displaystyle\geqslant r(s,\bar{a},\bar{s})+\gamma V^{*}_{T}(\bar{s})-\lambda_% {1}\lambda_{2}\varepsilon_{m}-\frac{\gamma\lambda_{1}\lambda_{2}\varepsilon_{m% }}{1-\gamma}$
	$\displaystyle=V^{*}_{T}(s)-\frac{(2-\gamma)\lambda_{1}\lambda_{2}\varepsilon_{% m}}{1-\gamma}$

Combining Eq. (23) and Eq. (24), we get $V^{*}_{T^{\prime}}(s)>V^{*}_{T}(s)$ , which contradicts with Eq. $(*)$ . It means that the assumption $(**)$ is not correct, so $\bar{s}=s^{\prime}$ .

We then show that $d^{*}_{T}(s)=d^{*}_{T^{\prime}}(s)$ for all $s\in\mathcal{S}$ :

		$\displaystyle\left\\|p_{T}(s_{t}=\cdot\|\pi_{T}^{})-p_{T^{\prime}}(s_{t}=\cdot\|% \pi_{T^{\prime}}^{})\right\\|_{\infty}$		(25)
		$\displaystyle=\left\\|\sum_{s^{\prime}}p_{T}(\cdot\|s^{\prime},\pi^{}_{T})p_{T}% (s_{t-1}=s^{\prime}\|\pi_{T}^{})-p_{T^{\prime}}(\cdot\|s^{\prime},\pi_{T^{% \prime}}^{})p_{T^{\prime}}(s_{t-1}=s^{\prime}\|\pi_{T^{\prime}}^{})\right\\|_{\infty}$
		$\displaystyle=\left\\|\sum_{s^{\prime}}p_{T}(\cdot\|s^{\prime},\pi^{}_{T})\left% [p_{T}(s_{t-1}=s^{\prime}\|\pi_{T}^{})-p_{T^{\prime}}(s_{t-1}=s^{\prime}\|\pi_{% T^{\prime}}^{*})\right]\right\\|_{\infty}$
		$\displaystyle\leqslant\left\\|\sum_{s^{\prime}}p_{T}(\cdot\|s^{\prime},\pi^{}_{% T})\left\\|p_{T}(s_{t-1}=\cdot\|\pi_{T}^{})-p_{T^{\prime}}(s_{t-1}=\cdot\|\pi_{T% ^{\prime}}^{*})\right\\|_{\infty}\right\\|_{\infty}$
		$\displaystyle=\left\\|\left\\|p_{T}(s_{t-1}=\cdot\|\pi_{T}^{})-p_{T^{\prime}}(s_% {t-1}=\cdot\|\pi_{T^{\prime}}^{})\right\\|_{\infty}\sum_{s^{\prime}}p_{T}(\cdot% \|s^{\prime},\pi^{*}_{T})\right\\|_{\infty}$
		$\displaystyle=\left\\|p_{T}(s_{t-1}=\cdot\|\pi_{T}^{})-p_{T^{\prime}}(s_{t-1}=% \cdot\|\pi_{T^{\prime}}^{})\right\\|_{\infty}$
		$\displaystyle\leqslant\cdots$
		$\displaystyle\leqslant\left\\|p_{T}(s_{0}=\cdot\|\pi^{}_{T})-p_{T^{\prime}}(s_{% 0}=\cdot\|\pi_{T^{\prime}}^{})\right\\|_{\infty}$
		$\displaystyle=0.$

Therefore, for all $s\in\mathcal{S}$ , we have $p_{T}(s_{t}=s|\pi_{T}^{*})=p_{T^{\prime}}(s_{t}=s|\pi_{T^{\prime}}^{*})$ . So

\left|d^{*}_{T}(s)-d^{*}_{T^{\prime}}(s)\right|=\left|\sum_{t=0}^{\infty}p_{T}% (s_{t}=s|\pi^{*}_{T})-p_{T^{\prime}}(s_{t}=s|\pi_{T^{\prime}}^{*})\right|=0

(26)

for all $s\in\mathcal{S}$ , which concludes the proof. ∎

Before proving the first theorem in Sec. 4, we introduce a lemma that incorporates the 1-Wasserstein distance between the policies. It also considers a reference policy that has the same stationary state distribution with the optimal policy in the other dynamics. Such policy exists thanks to the homomorphous property of the MDPs.

Lemma A.3.

Following the assumptions in Lem. A.1, for all policy $\hat{\pi}$ such that $d^{\hat{\pi}}_{T}(s)=d^{*}_{T^{\prime}}(s)$ for all $s\in\mathcal{S}$ and $\max_{s}W_{1}\left(\hat{\pi}(\cdot|s),\pi^{*}_{T^{\prime}}(\cdot|s)\right)% \leqslant\epsilon_{\pi}$ , we have

\left|\eta_{T}(\pi_{T}^{*})-\eta_{T}(\hat{\pi})\right|\leqslant\dfrac{\lambda_% {1}\lambda_{2}\varepsilon_{m}+\lambda_{1}\varepsilon_{\pi}}{1-\gamma},

(27)

where $W_{1}(\hat{\pi}(\cdot|s),\pi^{*}_{T^{\prime}}(\cdot|s))$ is the 1-Wasserstein distance between two policies.

Proof.

First, $|\eta_{T}(\pi^{*}_{T})-\eta_{T^{\prime}}(\pi^{*}_{T^{\prime}})|$ can be bounded with Thm. A.1:

	$\displaystyle\left\|\eta_{T}(\pi^{}_{T})-\eta_{T^{\prime}}(\pi^{}_{T^{\prime}% })\right\|$	$\displaystyle=\left\|\mathbb{E}_{s\in\rho_{0}}V^{}_{T}(s)-\mathbb{E}_{s\in\rho% _{0}}V_{T^{\prime}}^{}(s)\right\|$		(28)
		$\displaystyle\leqslant\frac{\lambda_{1}\lambda_{2}\varepsilon_{m}}{1-\gamma}.$		(28)

We then try to bound $|\eta_{T}(\hat{\pi})-\eta_{T^{\prime}}(\pi^{*}_{T^{\prime}})|$ . We first define the state-action stationary distributions $D_{1}(s,a)=d_{T}^{\hat{\pi}}(s)\hat{\pi}(a|s)$ and $D_{2}(s,a)=d^{*}_{T^{\prime}}(s)\pi^{*}_{T^{\prime}}(a|s)$ . The accumulated return can be written as

	$\displaystyle\eta_{T}(\hat{\pi})$	$\displaystyle=\frac{1}{1-\gamma}\mathbb{E}_{s,a,s^{\prime}\sim D_{1}}\left[r(s% ,a,s^{\prime})\right]$		(29)
	$\displaystyle\eta_{T^{\prime}}(\pi^{*}_{T^{\prime}})$	$\displaystyle=\frac{1}{1-\gamma}\mathbb{E}_{s,a,s^{\prime}\sim D_{2}}\left[r(s% ,a,s^{\prime})\right]$		(29)

We start from the Lipschitz property of the reward function:

|r(s,a_{1},s^{\prime})-r(s,a_{2},s^{\prime})|\leqslant\lambda_{1}\|a_{1}-a_{2}% \|_{1}.

(30)

Taking expectation w.r.t. $d^{*}_{T^{\prime}}(\cdot)$ on both sides, we get

\mathbb{E}_{s\sim d^{*}_{T^{\prime}}}|r(s,a_{1},s^{\prime})-r(s,a_{2},s^{% \prime})|\leqslant\mathbb{E}_{s\sim d^{*}_{T^{\prime}}}\lambda_{1}\|a_{1}-a_{2% }\|_{1}.

(31)

Letting $\mu(A_{1},A_{2}|s)$ be any joint distribution with marginals $\hat{\pi}$ and $\pi^{*}_{T^{\prime}}$ conditioned on $s$ . Taking expectation w.r.t. $\mu$ on both sides, we get

$\displaystyle\left\|\mathbb{E}_{s,a\sim D_{1}}r(s,a,s^{\prime})-\mathbb{E}_{s,a% \sim D_{2}}r(s,a,s^{\prime})\right\|$	$\displaystyle\leqslant\mathbb{E}_{s\sim d^{*}_{T^{\prime}}}\mathbb{E}_{a_{1},a% _{2}\sim\mu}\|r(s,a_{1},s^{\prime})-r(s,a_{2},s^{\prime})\|$	(32)
	$\displaystyle\leqslant\lambda_{1}\mathbb{E}_{s\sim d^{*}_{T^{\prime}}}E_{\mu}% \\|a_{1}-a_{2}\\|_{1}$
	$\displaystyle\leqslant\max\limits_{s}\lambda_{1}E_{\mu}\\|a_{1}-a_{2}\\|_{1}.$

Eq. (32) holds for all joint distribution $\mu$ , so it also holds for $\bar{\mu}=\operatorname*{arg\,min}\limits_{\mu}\lambda_{1}E_{\mu}\|a_{1}-a_{2}% \|_{1}$ , leading to the 1-Wasserstein distance:

\left|\mathbb{E}_{s,a\sim D_{1}}r(s,a,s^{\prime})-\mathbb{E}_{s,a\sim D_{2}}r(% s,a,s^{\prime})\right|\leqslant\max_{s}\lambda_{1}W_{1}(\hat{\pi}(\cdot|s),\pi% ^{*}_{T^{\prime}}(\cdot|s)).

(33)

According to Eq. (29), we have

|\eta_{T}(\hat{\pi})-\eta_{T^{\prime}}(\pi^{*}_{T^{\prime}})|\leqslant\frac{% \lambda_{1}\varepsilon_{m}}{1-\gamma}.

(34)

Applying the triangle inequality, we get

	$\displaystyle\left\|\eta_{T}(\pi_{T}^{*})-\eta_{T}(\hat{\pi})\right\|$	$\displaystyle\leqslant\|\eta_{T}(\pi^{}_{T})-\eta_{T^{\prime}}(\pi^{}_{T^{% \prime}})\|+\|\eta_{T}(\hat{\pi})-\eta_{T^{\prime}}(\pi^{*}_{T^{\prime}})\|$		(35)
		$\displaystyle\leqslant\dfrac{\lambda_{1}\lambda_{2}\varepsilon_{m}+\lambda_{1}% \varepsilon_{\pi}}{1-\gamma},$		(35)

which concludes the proof.

We then use this lemma to prove the first theory in Sec. 4. ∎

Theorem A.4 (Restatement of Thm. 4.2).

\eta_{T}(\hat{\pi})\geqslant\eta_{T}(\pi_{T}^{*})-\dfrac{\lambda_{1}\lambda_{2% }\varepsilon_{m}+2\lambda_{1}+\sqrt{2}R_{\max}\sqrt{\varepsilon_{s}}}{1-\gamma}.

(36)

Proof.

In two homomorphous MDPs with dynamics $T$ and $T^{\prime}$ , there exists a policy $\tilde{\pi}$ such that $d^{\tilde{\pi}}_{T}(\cdot)=d^{*}_{T^{\prime}}(\cdot)$ . According to Lem. A.3, we have

\displaystyle\left|\eta_{T}(\pi_{T}^{*})-\eta_{T}(\tilde{\pi})\right|

\displaystyle\leqslant\dfrac{\lambda_{1}\lambda_{2}\varepsilon_{m}+\lambda_{1}% \varepsilon_{\pi}}{1-\gamma}\leqslant\dfrac{\lambda_{1}\lambda_{2}\varepsilon_% {m}+2\lambda_{1}}{1-\gamma},

(37)

where the second inequality is obtained as the actions are bounded to $[-1,1]$ . The scaling is multiplied by the Lipschitz coefficient which tends to small, so it will make little influence to the bound. On the other hand, policies $\hat{\pi}$ and $\tilde{\pi}$ have a similar state discrepancy: $D_{\mathrm{KL}}(d^{\hat{\pi}}_{T}(\cdot)\|d^{\tilde{\pi}}_{T}(\cdot))\leqslant% \varepsilon_{s}$ . Therefore, their performance gap can be bounded according to results in imitation learning (Lem. 6 in [47]):

|\eta_{T}(\hat{\pi})-\eta_{T}(\tilde{\pi})|\leqslant\frac{\sqrt{2}R_{\max}% \sqrt{\varepsilon_{s}}}{1-\gamma}.

(38)

Merging Eq. (37) and Eq. (38), we obtain

$\displaystyle\eta_{T}(\pi_{T}^{*})-\eta_{T}(\hat{\pi})$	$\displaystyle\leqslant\|\eta_{T}(\hat{\pi})-\eta_{T}(\pi_{T}^{*})\|$	(39)
	$\displaystyle\leqslant\left\|\eta_{T}(\pi_{T}^{*})-\eta_{T}(\tilde{\pi})\right\|% +\|\eta_{T}(\hat{\pi})-\eta_{T}(\tilde{\pi})\|$
	$\displaystyle\leqslant\dfrac{\lambda_{1}\lambda_{2}\varepsilon_{m}+2\lambda_{1% }+\sqrt{2}R_{\max}\sqrt{\varepsilon_{s}}}{1-\gamma}.$

∎

A.5 Discussions on the Theoretical Analysis

The Lipschitz Assumptions in Sec. 4

Regarding the reward functions, the Lipschitz property implies that if $s$ and $s^{\prime}$ keeps unchanged, the deviation of the reward $r$ will be no larger than $\lambda_{1}$ times the deviation of the action $a$ . Therefore, the Lipstchiz coefficient $\lambda_{1}$ is solely related to action-related terms in the reward function. It is important to note that different actions exist given the same $s$ and $s^{\prime}$ since we may compute the reward function in different dynamics. Considering the dynamics functions, the Lipschitz property indicates that if the current state $s$ remains unchanged and the actions differ by : $\left|a_{1}-a_{2}\right|\geqslant\varepsilon$ , the next states will exhibit a significant difference: $\left|s_{1}^{\prime}-s_{2}^{\prime}\right|\geqslant\frac{\varepsilon}{\lambda 2}$ .

Table 4:

\lambda_{1},\lambda_{2},R_{\max}

in practical environments.

Environment	Action-related Reward	$\lambda_{1}$	$\lambda_{2}$	$R_{\max}$
CartPole-v0	0	0	1.42	1.00
InvertedPendulum-v2	0	0	8.58	1.00
Swimmer-v2	$-0.0001\|a\|_{2}^{2}$	0.0001	2.59	0.36
HalfCheetah-v2	$-0.1\|a\|_{2}^{2}$	0.1	1.01	4.80
Hopper-v2	$-0.001\|a\|_{2}^{2}$	0.001	3.45	3.80
Walker2d-v2	$-0.001\|a\|_{2}^{2}$	0.001	4.70	$\geqslant 4$
Ant-v2	$-0.5\|a\|_{2}^{2}$	0.5	0.69	6.0
Humanoid-v2	$-0.1\|a\|_{2}^{2}$	0.1	0.03	$\geqslant 8$

In Tab. 4, we list the action-related terms of the reward functions for various RL evaluation environments, along with the corresponding values of $\lambda_{1}$ derived from these terms. Additionally, we sample 50,000 $(s,a,s^{\prime})$ tuples from the replay buffer, slightly modify the action, and observe how the resulting next state $s^{\prime}$ changes. The replay buffer contains trajectories collected during different training phases and should be diverse enough to cover most possible trajectories. This empirical analysis allows us to calculate $\lambda_{2}$ in practice. As indicated in the table, the action-related terms in reward functions exhibit reasonably small coefficients in all environments, leading to small $\lambda_{1}$ values. Combined with medium values of $\lambda_{2}$ , it can be inferred that Lipschitz terms, including $\lambda_{1}$ and $\lambda_{1}\lambda_{2}$ , will remain small in practical scenarios and will not dominate the error term in Eq. 7. Also, the action gap assumption in Thm. 4.3 (line 258-259) is not strong and holds in many situations.

Failure Cases

Although the assumptions are weak and hold in many situations, there are certain scenarios that these assumptions do not hold and the performance of SRPO can not be guaranteed. For example, in maze environments with different obstacle layout, the requirement of homomorphous MDPs is violated. There are also cases where the Lipstchitz coefficients $\lambda_{1},\lambda_{2}$ can be large, such as stock markets with very high transaction feeds.

The Assumption on Dynamics Discrepancy

We mentioned in Sec. 4 that one of the assumptions to prove the theorems is that $T\in(T^{\prime},\varepsilon_{m})$ . In fact, this is a simplification of the actual requirement, which is weaker than the uniform bound of dynamics shift. According to Eq. (18), for any state $s$ we only require $T(s,a)$ and $T^{\prime}(s,a)$ to be close on one specific action $\hat{a}$ such that $T^{\prime}(s,\hat{a})=T\left(s,a_{T}^{*}\right)=s^{\prime}$ . This is a point-wise bound on dynamics shift and is comparable to assumptions in previous analysis [48].

The Tightness of Eq. (7)

Eq. (7) has a similar form with the Eq. (1) in Thm. 4.1 of [49], where the return discrepancy $\left|\eta_{T}(\pi_{T}^{*})-\eta_{T}(\hat{\pi})\right|$ is also bounded by differences in the policy distribution and transition functions, with an order of two in the effective horizon (i.e. with a coefficient $\frac{1}{(1-\gamma)^{2}}$ ). By introducing some assumptions and constraining the policy to have the same stationary state distribution, we obtain a tighter discrepancy bound with an order of one in the effective horizon (i.e. with a coefficient $\frac{1}{1-\gamma}$ ).

Appendix B Experiment Details

B.1 Setup

To generate environments with different transition functions, We alter the xml file of the MuJoCo simulator and change its environment parameters. In online experiments, we build our code based on the Github repository¹¹1https://github.com/younggyoseo/CaDM/tree/master of CaDM [9]. Some customized MuJoCo environments are defined in this repository. They have different reward functions with the original environments. We keep these modifications to make our online results comparable with the original CaDM algorithm. In offline experiments, we build our code based on the Github repository²²2https://github.com/polixir/OfflineRL. The offline datasets are generated by concatenating the data sampled in the original MuJoCo simulator, as well as simulators whose gravity and medium density are altered. In both online and offline experiments, the evaluation is done in online static environments with all possible values of environment parameters. The average of these evaluation results is reported.

Table 5: Detailed results of ablation studies in offline experiments.

	MAPLE +SRPO	Behavior Regularization	Random Partition	Fixed $\lambda=0.1$	Fixed $\lambda=0.3$
Walker2d-medium-expert	0.66 $\pm$ 0.08	0.70 $\pm$ 0.18	0.42 $\pm$ 0.16	0.66 $\pm$ 0.08	0.38 $\pm$ 0.16
Walker2d-medium	0.84 $\pm$ 0.03	0.71 $\pm$ 0.02	0.79 $\pm$ 0.00	0.72 $\pm$ 0.13	0.84 $\pm$ 0.03
Walker2d-medium-replay	0.17 $\pm$ 0.02	0.16 $\pm$ 0.01	0.14 $\pm$ 0.01	0.17 $\pm$ 0.02	0.16 $\pm$ 0.01
Walker2d-random	0.22 $\pm$ 0.00	0.22 $\pm$ 0.00	0.22 $\pm$ 0.00	0.22 $\pm$ 0.00	0.22 $\pm$ 000
Hopper-medium-expert	0.98 $\pm$ 0.02	0.85 $\pm$ 0.25	0.46 $\pm$ 0.14	0.98 $\pm$ 0.02	0.86 $\pm$ 0.18
Hopper-medium	1.03 $\pm$ 0.01	0.78 $\pm$ 0.26	0.76 $\pm$ 0.21	0.53 $\pm$ 0.13	1.03 $\pm$ 0.01
Hopper-medium-replay	1.02 $\pm$ 0.01	0.94 $\pm$ 0.04	0.91 $\pm$ 0.08	1.02 $\pm$ 0.01	0.93 $\pm$ 0.03
Hopper-random	0.32 $\pm$ 0.02	0.13 $\pm$ 0.01	0.12 $\pm$ 0.01	0.13 $\pm$ 0.01	0.32 $\pm$ 0.02
Halfcheetah-medium-expert	0.63 $\pm$ 0.01	0.65 $\pm$ 0.01	0.44 $\pm$ 0.18	0.63 $\pm$ 0.01	0.52 $\pm$ 0.00
Halfcheetah-medium	0.63 $\pm$ 0.01	0.60 $\pm$ 0.00	0.62 $\pm$ 0.02	0.61 $\pm$ 0.02	0.63 $\pm$ 0.01
Halfcheetah-medium-replay	0.55 $\pm$ 0.00	0.54 $\pm$ 0.00	0.54 $\pm$ 0.01	0.55 $\pm$ 0.00	0.24 $\pm$ 0.01
Halfcheetah-random	0.24 $\pm$ 0.01	0.21 $\pm$ 0.03	0.20 $\pm$ 0.01	0.24 $\pm$ 0.01	0.23 $\pm$ 0.01
Average	0.61	0.54	0.47	0.54	0.53

Table 6: Results of offline experiments with a small dataset.

	MOPO	MAPLE	MAPLE+ DARA	MAPLE+ SRPO (ours)
Walker2d-medium	$0.21\pm 0.13$	$0.45\pm 0.18$	$0.74\pm 0.12$	$\mathbf{0.79}\pm 0.04$
Walker2d-medium-expert	$0.14\pm 0.06$	$0.26\pm 0.01$	$0.38\pm 0.03$	$\mathbf{0.61}\pm 0.11$
Hopper-medium	$0.01\pm 0.00$	$0.42\pm 0.36$	$0.36\pm 0.06$	$\mathbf{0.51}\pm 0.14$
Hopper-medium-expert	$0.01\pm 0.00$	$0.33\pm 0.09$	$0.16\pm 0.04$	$\mathbf{0.40}\pm 0.06$
HalfCheetah-medium	$0.10\pm 0.01$	$0.50\pm 0.06$	$0.37\pm 0.01$	$\mathbf{0.55}\pm 0.03$
HalfCheetah-medium-expert	$-0.03\pm 0.00$	$0.35\pm 0.01$	$\mathbf{0.63}\pm 0.03$	$0.62\pm 0.19$
Average	0.07	0.39	0.44	$\mathbf{0.58}$

B.2 Additional Results

We show full results of online experiments on MuJoCo tasks in Fig. 6. Experiments on environments with single dynamics are included. These experiments are equivalent to those on static static environments. PPO, CaDM and CaDM+SRPO have a similar performance in these tasks. Full results of ablation studies are shown in Tab. 5. We also reduce the amount of offline data to 1/3 and perform additional experiments. The results are shown in Tab. 6. MAPLE+SRPO can still achieve better performance than baseline algorithms. It improves the performance by 31% over MAPLE+DARA and 49% over MAPLE. These evidences indicate that SRPO indeed enables efficient data reuse, which is in accordance with statements in the introduction part.

B.3 Additional Analysis

To provide an additional demonstrating example to the intuition in Sec. 3.1, we train two policies in the Pendulum environment with 0.5 and 2 times of the original frictions and then visualize state and action densities. The results in Fig. 7 are similar to the experiments altering the environment gravity. We observe similar state distributions and different peaks in action distributions.

With respect to different Offline RL tasks, MAPLE+SRPO gains the highest rise in the Hopper environment and outperform all baseline methods in all of the 4 tasks. In the Walker2d and HalfCheetah environments, however, MAPLE+SRPO only outperforms in half of the tasks. Such difference results from the existence of multiple optimal policies which lead to different stationary state distributions [50, 51]. For example, the agent in the Walker2d environment has many ways to swing its arms to keep balance. When the policy pattern in the offline dataset is different from the learning policy, its stationary state distribution may not be a good regularizer. The Hopper agent has a fewer degree of freedom compared with the other two, so the policy benefits more from regularizing with SRPO.

$\displaystyle\beta_{t}(s_{t})$	$\displaystyle=\int_{\mathcal{S}}p(\mathcal{O}_{t:\infty}\|s_{t},s_{t+1},\pi)p(s% _{t+1}\|s_{t})ds_{t+1}$	(13)
	$\displaystyle=\int_{\mathcal{S}}p(\mathcal{O}_{t}\|s_{t},\pi)p(\mathcal{O}_{t+1% :\infty}\|s_{t+1},\pi)p(s_{t+1}\|s_{t})ds_{t+1}$
	$\displaystyle=\int_{\mathcal{S}}\max_{a_{t}}\exp(\gamma^{t}r(s_{t},a_{t}))% \beta_{t+1}(s_{t+1})p(s_{t+1}\|s_{t})ds_{t+1}.$

$\displaystyle\left\|V^{}_{T}(s)-V^{}_{T^{\prime}}(s)\right\|$	$\displaystyle=V^{}_{T}(s)-V^{}_{T^{\prime}}(s)$	(20)
	$\displaystyle=r(s,a^{}_{T},s^{\prime})+\gamma V^{}_{T}(s^{\prime})-\max% \limits_{a}~{}\left[r(s,a,T^{\prime}(s,a))+\gamma V^{*}_{T^{\prime}}(T^{\prime% }(s,a))\right]$
	$\displaystyle\leqslant r(s,a^{}_{T},s^{\prime})+\gamma V^{}_{T}(s^{\prime})-% r(s,\hat{a},s^{\prime})-\gamma V^{*}_{T^{\prime}}(s^{\prime})$
	$\displaystyle\leqslant\lambda_{1}\lambda_{2}\varepsilon_{m}+\gamma\left\|V^{}_% {T}(s^{\prime})-V^{}_{T^{\prime}}(s^{\prime})\right\|$
	$\displaystyle\leqslant\lambda_{1}\lambda_{2}\varepsilon_{m}+\gamma\lambda_{1}% \lambda_{2}\varepsilon_{m}+\gamma^{2}\left\|V^{}_{T}(s^{\prime\prime})-V^{}_{% T^{\prime}}(s^{\prime\prime})\right\|$
	$\displaystyle\leqslant\cdots$
	$\displaystyle\leqslant\frac{\lambda_{1}\lambda_{2}\varepsilon_{m}}{1-\gamma},$

		$\displaystyle\left\\|p_{T}(s_{t}=\cdot\|\pi_{T}^{})-p_{T^{\prime}}(s_{t}=\cdot\|% \pi_{T^{\prime}}^{})\right\\|_{\infty}$		(25)
		$\displaystyle=\left\\|\sum_{s^{\prime}}p_{T}(\cdot\|s^{\prime},\pi^{}_{T})p_{T}% (s_{t-1}=s^{\prime}\|\pi_{T}^{})-p_{T^{\prime}}(\cdot\|s^{\prime},\pi_{T^{% \prime}}^{})p_{T^{\prime}}(s_{t-1}=s^{\prime}\|\pi_{T^{\prime}}^{})\right\\|_{\infty}$
		$\displaystyle=\left\\|\sum_{s^{\prime}}p_{T}(\cdot\|s^{\prime},\pi^{}_{T})\left% [p_{T}(s_{t-1}=s^{\prime}\|\pi_{T}^{})-p_{T^{\prime}}(s_{t-1}=s^{\prime}\|\pi_{% T^{\prime}}^{*})\right]\right\\|_{\infty}$
		$\displaystyle\leqslant\left\\|\sum_{s^{\prime}}p_{T}(\cdot\|s^{\prime},\pi^{}_{% T})\left\\|p_{T}(s_{t-1}=\cdot\|\pi_{T}^{})-p_{T^{\prime}}(s_{t-1}=\cdot\|\pi_{T% ^{\prime}}^{*})\right\\|_{\infty}\right\\|_{\infty}$
		$\displaystyle=\left\\|\left\\|p_{T}(s_{t-1}=\cdot\|\pi_{T}^{})-p_{T^{\prime}}(s_% {t-1}=\cdot\|\pi_{T^{\prime}}^{})\right\\|_{\infty}\sum_{s^{\prime}}p_{T}(\cdot% \|s^{\prime},\pi^{*}_{T})\right\\|_{\infty}$
		$\displaystyle=\left\\|p_{T}(s_{t-1}=\cdot\|\pi_{T}^{})-p_{T^{\prime}}(s_{t-1}=% \cdot\|\pi_{T^{\prime}}^{})\right\\|_{\infty}$
		$\displaystyle\leqslant\cdots$
		$\displaystyle\leqslant\left\\|p_{T}(s_{0}=\cdot\|\pi^{}_{T})-p_{T^{\prime}}(s_{% 0}=\cdot\|\pi_{T^{\prime}}^{})\right\\|_{\infty}$
		$\displaystyle=0.$

$\displaystyle\left\|\mathbb{E}_{s,a\sim D_{1}}r(s,a,s^{\prime})-\mathbb{E}_{s,a% \sim D_{2}}r(s,a,s^{\prime})\right\|$	$\displaystyle\leqslant\mathbb{E}_{s\sim d^{*}_{T^{\prime}}}\mathbb{E}_{a_{1},a% _{2}\sim\mu}\|r(s,a_{1},s^{\prime})-r(s,a_{2},s^{\prime})\|$	(32)
	$\displaystyle\leqslant\lambda_{1}\mathbb{E}_{s\sim d^{*}_{T^{\prime}}}E_{\mu}% \\|a_{1}-a_{2}\\|_{1}$
	$\displaystyle\leqslant\max\limits_{s}\lambda_{1}E_{\mu}\\|a_{1}-a_{2}\\|_{1}.$

State Regularized Policy Optimization on Data with Dynamics Shift

Abstract

1 Introduction

2 Backgroud

2.1 Preliminaries

2.2 Related Work

MDPs with Different Dynamics

Behavior Regularized Methods

Leveraging stationary state distributions

3 State Regularized Policy Optimization

3.1 Motivating Example

3.2 State Regularized Policy Optimization

3.3 Data-based Surrogate of the Density Ratio

Proposition 3.1.

3.4 Practical Algorithm

4 Theoretical Analysis

Definition 4.1 (homomorphous MDPs).

Theorem 4.2.

Theorem 4.3.

5 Experiments

5.1 Experiment Setup

5.2 Results

Online Experiments

Offline Experiments

5.3 Analysis

Ablations

Effectiveness of Discriminators

Effectiveness of the Regularization

6 Conclusion and Discussion

Limitations and Future work

Acknowledgements

References

Appendix A Additional Derivations and Proofs

A.1 Derivations of the Lagrangian

A.2 Derivations of the Forward and Backward Probabilities

A.3 Discussions on the Surrogate for the Density Ratio

A.4 Proofs to Theorems in Sec. 4

Lemma A.1.

Proof.

Theorem A.2 (Restatement of Thm. 4.3).

Proof.

Lemma A.3.

Proof.

Theorem A.4 (Restatement of Thm. 4.2).

Proof.

A.5 Discussions on the Theoretical Analysis

The Lipschitz Assumptions in Sec. 4

Failure Cases

The Assumption on Dynamics Discrepancy

The Tightness of Eq. (7)

Appendix B Experiment Details

B.1 Setup

B.2 Additional Results

B.3 Additional Analysis

State Regularized Policy Optimization
on Data with Dynamics Shift