Forward and Backward State Abstractions for Off-policy Evaluation

Meiling Hao1∗, **fan Su2***Co-first authors, Liyuan Hu2, Zoltán Szabó2, Qingyuan Zhao3 and Chengchun Shi2Corresponding author. Email: [email protected]
1 School of Statistics, University of International Business and Economics,
Bei**g, 100029, China
2 Department of Statistics,
London School of Economics and Political Science,
London, WC2A2AE, United Kingdom
3 Statistics in the Statistical Laboratory, Department of Pure Mathematics and Mathematical Statistics (DPMMS), University of Cambridge,
Cambridge, CB30WB,United Kingdom

Abstract. Off-policy evaluation (OPE) is crucial for evaluating a target policy’s impact offline before its deployment. However, achieving accurate OPE in large state spaces remains challenging. This paper studies state abstractions – originally designed for policy learning – in the context of OPE. Our contributions are three-fold: (i) We define a set of irrelevance conditions central to learning state abstractions for OPE. (ii) We derive sufficient conditions for achieving irrelevance in Q-functions and marginalized importance sampling ratios, the latter obtained by constructing a time-reversed Markov decision process (MDP) based on the observed MDP. (iii) We propose a novel two-step procedure that sequentially projects the original state space into a smaller space, which substantially simplify the sample complexity of OPE arising from high cardinality.

1 Introduction

Motivation. Off-policy evaluation (OPE) serves as a crucial tool for assessing the impact of a newly developed policy using a pre-collected historical data before its deployment in high-stake applications, such as healthcare (Murphy et al., 2001), recommendation systems (Chapelle & Li, 2011), education (Mandel et al., 2014), dialog systems (Jiang et al., 2021) and robotics (Levine et al., 2020). A fundamental challenge in OPE is its “off-policy” nature, wherein the target policy to be evaluated differs from the behavior policy that generates the offline data. This distributional shift is particularly pronounced in environments with large state spaces of high cardinality. Theoretically, the minimax rate for estimating the target policy’s Q-function decreases rapidly as the state space dimension increases (Chen & Qi, 2022). Empirically, large state space significantly challenges the performance of state-of-the-art OPE algorithms (Fu et al., 2020; Voloshin et al., 2021).

Although different policies induce different trajectories in the large ground state space, they can produce similar paths when restricted to relevant, lower-dimensional state spaces (Pavse & Hanna, 2023). Consequently, applying OPE to these abstract spaces can significantly mitigate the distributional shift between target and behavior policies, enhancing the accuracy in predicting the target policy’s value. This makes state abstraction, designed to reduce state space cardinality, particularly appealing for OPE. However, despite the extensive literature on studying state abstractions for policy learning (see Section 1.1 for details), it has been hardly explored in the context of OPE.

Contributions. This paper aims to systematically investigate state abstractions for OPE to address the aforementioned gap. Our main contributions include:

  1. 1.

    Introduction of a set of irrelevance conditions for OPE, accompanied by validations of various OPE methods when applied to abstract state spaces under these conditions.

  2. 2.

    Derivation of sufficient conditions for state abstractions to achieve irrevelance in Q-functions and marginalized importance sampling (MIS) ratios. A key ingredient of our proposal lies in constructing a time-reversed Markov decision process (MDP, Puterman, 2014) by swap** the future and past. This effectively yields state abstractions that achieve the irrelevance property.

  3. 3.

    Development of a novel two-step procedure to sequentially obtain a smaller state space and reduce the sample complexity of OPE. It is also guaranteed to yield a smaller state space compared to existing single-step abstractions.

1.1 Related work

Our proposal is closely related to OPE and state abstraction. Additional related work on confounder selection in causal inference is relegated to Appendix A.

Off-policy evaluation. OPE aims to estimate the average return of a given target policy, utilizing historical data generated by a possibly different behavior policy (Dudík et al., 2014; Uehara et al., 2022). The majority of methods in the literature can be classified into the following three categories:

  1. 1.

    Value-based methods that estimate the target policy’s return by learning either a value function (Sutton et al., 2008; Luckett et al., 2019; Li et al., 2024) or a Q-function (Le et al., 2019; Feng et al., 2020; Hao et al., 2021; Liao et al., 2021; Chen & Qi, 2022; Shi et al., 2022) from the data.

  2. 2.

    Importance sampling (IS) methods that adjust the observed rewards using the IS ratio, i.e., the ratio of the target policy over the behavior policy, to address their distributional shift. There are two major types: sequential IS (SIS, Precup, 2000; Thomas et al., 2015; Hanna et al., 2019; Hu & Wager, 2023) which employs a cumulative IS ratio, and marginalized IS (Liu et al., 2018; Nachum et al., 2019; Xie et al., 2019; Dai et al., 2020; Yin & Wang, 2020; Wang et al., 2023) which uses the MIS ratio to mitigate the high variance of the SIS estimator.

  3. 3.

    Doubly robust methods or their variants that employ both the IS ratio and the value/reward function to enhance the robustness of OPE (Zhang et al., 2013; Jiang & Li, 2016; Thomas & Brunskill, 2016; Farajtabar et al., 2018; Kallus & Uehara, 2020; Tang et al., 2020; Uehara et al., 2020; Shi et al., 2021; Kallus & Uehara, 2022; Liao et al., 2022; Xie et al., 2023).

However, none of the aforementioned works studied state abstraction, which is our primary focus.

State abstraction. State abstraction aims to obtain a parsimonious state representation to simplify the sample complexity of reinforcement learning (RL), while ensuring that the optimal policy restricted to the abstract state space attains comparable values as in the original, ground state space. There is an extensive literature on the theoretical and methodological development of state abstraction, particularly bisimulation — a type of abstractions that preserve the Markov property in the abstracted state (Singh et al., 1994; Dean & Givan, 1997; Givan et al., 2003; Ravindran, 2004; Jong & Stone, 2005; Li et al., 2006; Ferns et al., 2004, 2011; Pathak et al., 2017; Wang et al., 2017; Ha & Schmidhuber, 2018; François-Lavet et al., 2019; Gelada et al., 2019; Castro, 2020; Zhang et al., 2020; Allen et al., 2021; Abel, 2022). In particular, Li et al. (2006) analyzed five irrelevance conditions for optimal policy learning. Unlike the aforementioned works that focus on policy learning, we introduce irrelevance conditions for OPE, and propose abstractions that satisfy these irrelevant properties. Meanwhile, the proposed abstraction for achieving irrelevance for the MIS ratio resembles the Markov state abstraction developed by Allen et al. (2021) in the context of policy learning.

More recently, Pavse & Hanna (2023) made a pioneering attempt to study state abstraction for OPE, proving its benefits in enhancing OPE accuracy. However, they primarily focused on MIS estimators. In contrast, our theoretical analysis applies to a broader range of estimators. Moreover, their abstraction did not achieve MIS-ratio irrelevance, nor did they implement the two-step procedure.

Lastly, state abstraction is also related to variable selection (Tangkaratt et al., 2016; Wang et al., 2017; Zhang & Zhang, 2018; Ma et al., 2023) and representation learning for RL (Abel et al., 2016; Shelhamer et al., 2016; Laskin et al., 2020; Uehara et al., 2021).

2 Preliminaries

In this section, we first introduce some key concepts relevant to OPE in RL, such as MDP, target and behavior policies, value functions, IS ratios (Section 2.1). We next review state abstractions for optimal policy learning (Section 2.2), alongside with four prominent OPE methodologies (Section 2.3).

2.1 Data generating process, policy, value and IS ratio

Data. Assume the offline dataset 𝒟𝒟\mathcal{D}caligraphic_D comprises multiple trajectories, each containing a sequence of state-action-reward triplets (St,At,Rt)t1subscriptsubscript𝑆𝑡subscript𝐴𝑡subscript𝑅𝑡𝑡1(S_{t},A_{t},R_{t})_{t\geq 1}( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t ≥ 1 end_POSTSUBSCRIPT following a finite MDP, denoted by =𝒮,𝒜,𝒯,,ρ0,γ𝒮𝒜𝒯subscript𝜌0𝛾\mathcal{M}=\langle\mathcal{S},\mathcal{A},\mathcal{T},\mathcal{R},\rho_{0},\gamma\ranglecaligraphic_M = ⟨ caligraphic_S , caligraphic_A , caligraphic_T , caligraphic_R , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_γ ⟩. Here, 𝒮𝒮\mathcal{S}caligraphic_S and 𝒜𝒜\mathcal{A}caligraphic_A are the discrete state and action spaces, both with finite cardinalities, 𝒯𝒯\mathcal{T}caligraphic_T and \mathcal{R}caligraphic_R are the state transition and reward functions, ρ0subscript𝜌0\rho_{0}italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT denotes the initial state distribution, and γ(0,1)𝛾01\gamma\in(0,1)italic_γ ∈ ( 0 , 1 ) is the discount factor.

The data is generated as follows: (i) At the initial time, the state S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is generated according to ρ0subscript𝜌0\rho_{0}italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT; (ii) Subsequently, at each time t𝑡titalic_t, the agent finds the environment in a specific state St𝒮subscript𝑆𝑡𝒮S_{t}\in\mathcal{S}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S and selects an action At𝒜subscript𝐴𝑡𝒜A_{t}\in\mathcal{A}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A according to a behavior policy b𝑏bitalic_b such that (At=a|St)=b(a|St)subscript𝐴𝑡conditional𝑎subscript𝑆𝑡𝑏conditional𝑎subscript𝑆𝑡\mathbb{P}(A_{t}=a|S_{t})=b(a|S_{t})blackboard_P ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_b ( italic_a | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ); (iii) The environment delivers an immediate reward Rtsubscript𝑅𝑡R_{t}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with an expected value of (At,St)subscript𝐴𝑡subscript𝑆𝑡\mathcal{R}(A_{t},S_{t})caligraphic_R ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and transits into the next state St+1d𝒯(At,St)S_{t+1}\stackrel{{\scriptstyle d}}{{\sim}}\mathcal{T}(\bullet\mid A_{t},S_{t})italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG italic_d end_ARG end_RELOP caligraphic_T ( ∙ ∣ italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) according to the transition function 𝒯𝒯\mathcal{T}caligraphic_T. Notice that both the reward and transition functions rely only on the current state-action pair (St,At)subscript𝑆𝑡subscript𝐴𝑡(S_{t},A_{t})( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), independent of the past data history. This ensures that the data satisfies the Markov assumption.

Policy and value. Let π𝜋\piitalic_π denote a given target policy we wish to evaluate. We use 𝔼πsuperscript𝔼𝜋\mathbb{E}^{\pi}blackboard_E start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT and πsuperscript𝜋\mathbb{P}^{\pi}blackboard_P start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT to denote the expectation and probability assuming the actions are chosen according to π𝜋\piitalic_π at each time. The regular 𝔼𝔼\mathbb{E}blackboard_E and \mathbb{P}blackboard_P without superscript are taking respect to the behavior policy b𝑏bitalic_b. Our objective lies in estimating the expected cumulative reward under π𝜋\piitalic_π, denoted by J(π)=𝔼π[t=1+γt1Rt]𝐽𝜋superscript𝔼𝜋delimited-[]superscriptsubscript𝑡1superscript𝛾𝑡1subscript𝑅𝑡J(\pi)=\mathbb{E}^{\pi}\Big{[}\sum_{t=1}^{+\infty}\gamma^{t-1}R_{t}\Big{]}italic_J ( italic_π ) = blackboard_E start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] using the offline dataset generated under a different policy b𝑏bitalic_b. Additionally, denote Vπsuperscript𝑉𝜋V^{\pi}italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT and Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT as the state value function and state-action value function (better known as the Q-function), namely,

Vπ(s)=𝔼π[t=1+γt1Rt|S1=s]andQπ(a,s)=𝔼π[t=1+γt1Rt|S1=s,A1=a].superscript𝑉𝜋𝑠superscript𝔼𝜋delimited-[]conditionalsuperscriptsubscript𝑡1superscript𝛾𝑡1subscript𝑅𝑡subscript𝑆1𝑠andsuperscript𝑄𝜋𝑎𝑠superscript𝔼𝜋delimited-[]formulae-sequenceconditionalsuperscriptsubscript𝑡1superscript𝛾𝑡1subscript𝑅𝑡subscript𝑆1𝑠subscript𝐴1𝑎\displaystyle V^{\pi}(s)=\mathbb{E}^{\pi}\Big{[}\sum_{t=1}^{+\infty}\gamma^{t-% 1}R_{t}|S_{1}=s\Big{]}\,\,\hbox{and}\,\,Q^{\pi}(a,s)=\mathbb{E}^{\pi}\Big{[}% \sum_{t=1}^{+\infty}\gamma^{t-1}R_{t}|S_{1}=s,A_{1}=a\Big{]}.italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) = blackboard_E start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_s ] and italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s ) = blackboard_E start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_s , italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_a ] . (1)

These functions are pivotal in develo** value-based estimators, as described in Method 1 of Section 2.3. Moreover, we use πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to denote the optimal policy that maximizes J(π)𝐽𝜋J(\pi)italic_J ( italic_π ), i.e., πargmaxπJ(π)superscript𝜋subscript𝜋𝐽𝜋\pi^{*}\in\arg\max_{\pi}J(\pi)italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_arg roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_J ( italic_π ), and write the optimal Q- and value functions Qπsuperscript𝑄superscript𝜋Q^{\pi^{*}}italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, Vπsuperscript𝑉superscript𝜋V^{\pi^{*}}italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT as Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Vsuperscript𝑉V^{*}italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for brevity.

IS ratio. We also introduce the IS ratio ρπ(a,s)=π(a|s)/b(a|s)superscript𝜌𝜋𝑎𝑠𝜋conditional𝑎𝑠𝑏conditional𝑎𝑠\rho^{\pi}(a,s)=\pi(a|s)/b(a|s)italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s ) = italic_π ( italic_a | italic_s ) / italic_b ( italic_a | italic_s ), which quantifies the discrepancy between the target policy π𝜋\piitalic_π and the behavior policy b𝑏bitalic_b. Furthermore, let wπ(a,s)superscript𝑤𝜋𝑎𝑠w^{\pi}(a,s)italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s ) denote the MIS ratio (1γ)t1γt1π(St=s,At=a)/limt(St=s,At=a)1𝛾subscript𝑡1superscript𝛾𝑡1superscript𝜋formulae-sequencesubscript𝑆𝑡𝑠subscript𝐴𝑡𝑎subscript𝑡formulae-sequencesubscript𝑆𝑡𝑠subscript𝐴𝑡𝑎(1-\gamma)\sum_{t\geq 1}\gamma^{t-1}\mathbb{P}^{\pi}(S_{t}=s,A_{t}=a)/\lim_{t% \rightarrow\infty}\mathbb{P}(S_{t}=s,A_{t}=a)( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t ≥ 1 end_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT blackboard_P start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a ) / roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT blackboard_P ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a ). Here, the numerator represents the discounted visitation probability under the target policy π𝜋\piitalic_π, a crucial component in policy-based learning for estimating πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (Sutton et al., 1999; Schulman et al., 2015). The denominator corresponds to the limiting state-action distribution under the behavior policy. These ratios are fundamental in constructing IS estimators, as detailed in Methods 2 and 3 of Section 2.3.

2.2 State abstractions for policy learning

Let =𝒮,𝒜,𝒯,,ρ0,γ𝒮𝒜𝒯subscript𝜌0𝛾\mathcal{M}=\langle\mathcal{S},\mathcal{A},\mathcal{T},\mathcal{R},\rho_{0},\gamma\ranglecaligraphic_M = ⟨ caligraphic_S , caligraphic_A , caligraphic_T , caligraphic_R , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_γ ⟩ be the ground MDP. A state abstraction ϕitalic-ϕ\phiitalic_ϕ is a map** from the state space 𝒮𝒮\mathcal{S}caligraphic_S to certain abstract state space 𝒳={ϕ(s):s𝒮}𝒳conditional-setitalic-ϕ𝑠𝑠𝒮\mathcal{X}=\{\phi(s):s\in\mathcal{S}\}caligraphic_X = { italic_ϕ ( italic_s ) : italic_s ∈ caligraphic_S }. Below, we review some commonly studied definitions of state abstraction designed for learning the optimal policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT; see Jiang (2018).

Definition 1 (πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT-irrelevance)

ϕitalic-ϕ\phiitalic_ϕ is πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT-irrelevant if there exists an optimal policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, such that for any s(1)superscript𝑠1s^{(1)}italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT, s(2)𝒮superscript𝑠2𝒮s^{(2)}\in\mathcal{S}italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ∈ caligraphic_S whenever ϕ(s(1))=ϕ(s(2))italic-ϕsuperscript𝑠1italic-ϕsuperscript𝑠2\phi(s^{(1)})=\phi(s^{(2)})italic_ϕ ( italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = italic_ϕ ( italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ), we have π(a|s(1))=π(a|s(2))superscript𝜋conditional𝑎superscript𝑠1superscript𝜋conditional𝑎superscript𝑠2\pi^{*}(a|s^{(1)})=\pi^{*}(a|s^{(2)})italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_a | italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_a | italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) for any a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A.

Definition 2 (Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT-irrelevance)

ϕitalic-ϕ\phiitalic_ϕ is Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT-irrelevant if for any s(1)superscript𝑠1s^{(1)}italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT, s(2)𝒮superscript𝑠2𝒮s^{(2)}\in\mathcal{S}italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ∈ caligraphic_S whenever ϕ(s(1))=ϕ(s(2))italic-ϕsuperscript𝑠1italic-ϕsuperscript𝑠2\phi(s^{(1)})=\phi(s^{(2)})italic_ϕ ( italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = italic_ϕ ( italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ), the optimal Q-function satisfies Q(a,s(1))=Q(a,s(2))superscript𝑄𝑎superscript𝑠1superscript𝑄𝑎superscript𝑠2Q^{*}(a,s^{(1)})=Q^{*}(a,s^{(2)})italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) for any a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A.

Definitions 1 and 2 are easy to understand, requiring the optimal policy/Q-function to depend on a state s𝑠sitalic_s only through its abstraction ϕ(s)italic-ϕ𝑠\phi(s)italic_ϕ ( italic_s ). In practical terms, these definitions encourage the transformation of raw MDP data into a new sequence of state-action-reward triplets (ϕ(S),A,R)italic-ϕ𝑆𝐴𝑅(\phi(S),A,R)( italic_ϕ ( italic_S ) , italic_A , italic_R ) for policy learning. However, the transformed data may not necessarily satisfy the Markov assumption. This leads us to define the following model-irrelevance, which aims to preserve the MDP structure while ensuring πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT- and Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT-irrelevance.

Definition 3 (Model-irrelevance)

ϕitalic-ϕ\phiitalic_ϕ is model-irrelevant if for any s(1)superscript𝑠1s^{(1)}italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT, s(2)𝒮superscript𝑠2𝒮s^{(2)}\in\mathcal{S}italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ∈ caligraphic_S whenever ϕ(s(1))=ϕ(s(2))italic-ϕsuperscript𝑠1italic-ϕsuperscript𝑠2\phi(s^{(1)})=\phi(s^{(2)})italic_ϕ ( italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = italic_ϕ ( italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ), the following holds for any a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A, s𝒮superscript𝑠𝒮s^{\prime}\in\mathcal{S}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S and x𝒳superscript𝑥𝒳x^{\prime}\in\mathcal{X}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X:

(a,s(1))=(a,s(2))andsϕ1(x)𝒯(s|a,s(1))=sϕ1(x)𝒯(s|a,s(2)).𝑎superscript𝑠1𝑎superscript𝑠2andsubscriptsuperscript𝑠superscriptitalic-ϕ1superscript𝑥𝒯conditionalsuperscript𝑠𝑎superscript𝑠1subscriptsuperscript𝑠superscriptitalic-ϕ1superscript𝑥𝒯conditionalsuperscript𝑠𝑎superscript𝑠2\displaystyle\mathcal{R}(a,s^{(1)})=\mathcal{R}(a,s^{(2)})\,\,\,\hbox{and}\,\,% \sum_{s^{\prime}\in\phi^{-1}(x^{\prime})}\mathcal{T}(s^{\prime}|a,s^{(1)})=% \sum_{s^{\prime}\in\phi^{-1}(x^{\prime})}\mathcal{T}(s^{\prime}|a,s^{(2)}).caligraphic_R ( italic_a , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = caligraphic_R ( italic_a , italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) and ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_ϕ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT caligraphic_T ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_a , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_ϕ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT caligraphic_T ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_a , italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) . (2)

The first condition in (2) corresponds to “reward-irrelevance” whereas the second condition represents “transition-irrelevance”. Consequently, Definition 3 defines a “model-based” abstraction, in contrast to “model-free” abstractions considered in Definitions 1 and 2. Notice that the term sϕ1(x)𝒯(s|a,s)subscriptsuperscript𝑠superscriptitalic-ϕ1superscript𝑥𝒯conditionalsuperscript𝑠𝑎𝑠\sum_{s^{\prime}\in\phi^{-1}(x^{\prime})}\mathcal{T}(s^{\prime}|a,s)∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_ϕ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT caligraphic_T ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_a , italic_s ) – appearing in the second equation of (2) – represents the probability of transitioning to ϕ(S)=xitalic-ϕsuperscript𝑆superscript𝑥\phi(S^{\prime})=x^{\prime}italic_ϕ ( italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in the abstract state space. Thus, the second condition essentially requires the abstract next state ϕ(S)italic-ϕsuperscript𝑆\phi(S^{\prime})italic_ϕ ( italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) to be conditionally independent of S𝑆Sitalic_S given A𝐴Aitalic_A and ϕ(S)italic-ϕ𝑆\phi(S)italic_ϕ ( italic_S ). Assuming S𝑆Sitalic_S can be decomposed into the union of ϕ(S)italic-ϕ𝑆\phi(S)italic_ϕ ( italic_S ) and ψ(S)𝜓𝑆\psi(S)italic_ψ ( italic_S ), which represent relevant features and irrelevant features, respectively. The condition implies that the evolution of those relevant features depends solely on themselves, independent of those irrelevant features. This ensures that the transformed data triplets (ϕ(S),A,R)italic-ϕ𝑆𝐴𝑅(\phi(S),A,R)( italic_ϕ ( italic_S ) , italic_A , italic_R ) remains an MDP. Meanwhile, the evolution of those irrelevant features may still depend on the relevant features; see Figure 1(a) for an illustration.

It is also known that model-irrelevance implies Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT-irrelevance, which in turn implies πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT-irrelevance; see e.g., Theorem 2 in Li et al. (2006). Given that the transformed data remains an MDP under model-irrelevance, one can apply existing state-of-the-art RL algorithms to the abstract state space instead of the original ground space, leading to more effective learning of the optimal policy.

Refer to caption
Refer to caption
Figure 1: Illustrations of (a) model-irrelevance and (b) backward-model-irrelevance. ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a shorthand for ρπ(At,St)superscript𝜌𝜋subscript𝐴𝑡subscript𝑆𝑡\rho^{\pi}(A_{t},S_{t})italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) for any t1𝑡1t\geq 1italic_t ≥ 1.

2.3 OPE methodologies

We focus on four OPE methods, covering the three families of estimators introduced in Section 1.1. Each method employs a specific formula to identify J(π)𝐽𝜋J(\pi)italic_J ( italic_π ), which we detail below. The first method is a popular value-based approach – the Q-function-based method. The second and third methods are the two major IS estimators: SIS and MIS. The fourth method is a semi-parametrically efficient doubly robust method, double RL (DRL), known for achieving the smallest possible MSE among a broad class of OPE estimators (Kallus & Uehara, 2020, 2022).

Method 1 (Q-function-based method). For a given Q-function Q𝑄Qitalic_Q, define f1(Q)subscript𝑓1𝑄f_{1}(Q)italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Q ) as the estimating function a𝒜π(a|S1)Q(a,S1)subscript𝑎𝒜𝜋conditional𝑎subscript𝑆1𝑄𝑎subscript𝑆1\sum_{a\in\mathcal{A}}\pi(a|S_{1})Q(a,S_{1})∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_π ( italic_a | italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_Q ( italic_a , italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) with S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT being the initial state. By (1) and the definition of J(π)𝐽𝜋J(\pi)italic_J ( italic_π ), it is immediate to see that J(π)=𝔼[f1(Qπ)]𝐽𝜋𝔼delimited-[]subscript𝑓1superscript𝑄𝜋J(\pi)=\mathbb{E}[f_{1}(Q^{\pi})]italic_J ( italic_π ) = blackboard_E [ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) ]. This motivates the Q-function-based method which uses a plug-in estimator to approximate 𝔼[f1(Qπ)]𝔼delimited-[]subscript𝑓1superscript𝑄𝜋\mathbb{E}[f_{1}(Q^{\pi})]blackboard_E [ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) ] and thereby estimates J(π)𝐽𝜋J(\pi)italic_J ( italic_π ). In particular, Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT can be estimated by Q-learning type algorithms (e.g., fitted Q-evaluation, FQE, Le et al., 2019), and the expectation can be approximated based on the empirical initial state distribution.

Method 2 (Sequential importance sampling). For a given IS ratio ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT, let ρ1:tπsuperscriptsubscript𝜌:1𝑡𝜋\rho_{1:t}^{\pi}italic_ρ start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT denote the cumulative IS ratio j=1tρπ(Aj,Sj)superscriptsubscriptproduct𝑗1𝑡superscript𝜌𝜋subscript𝐴𝑗subscript𝑆𝑗\prod_{j=1}^{t}\rho^{\pi}(A_{j},S_{j})∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). It follows from the change of measure theorem that the counterfactual reward 𝔼π(Rt)superscript𝔼𝜋subscript𝑅𝑡\mathbb{E}^{\pi}(R_{t})blackboard_E start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is equivalent to 𝔼(ρ1:tπRt)𝔼subscriptsuperscript𝜌𝜋:1𝑡subscript𝑅𝑡\mathbb{E}(\rho^{\pi}_{1:t}R_{t})blackboard_E ( italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) whose expectation is taken with respect to the offline data distribution. Assuming all trajectories in 𝒟𝒟\mathcal{D}caligraphic_D terminate after a finite time T𝑇Titalic_T, this allows us to approximate J(π)𝐽𝜋J(\pi)italic_J ( italic_π ) by 𝔼[f2(ρπ)]𝔼delimited-[]subscript𝑓2superscript𝜌𝜋\mathbb{E}[f_{2}(\rho^{\pi})]blackboard_E [ italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) ] where f2(ρπ)=t=1Tγt1ρ1:tπRtsubscript𝑓2superscript𝜌𝜋superscriptsubscript𝑡1𝑇superscript𝛾𝑡1superscriptsubscript𝜌:1𝑡𝜋subscript𝑅𝑡f_{2}(\rho^{\pi})=\sum_{t=1}^{T}\gamma^{t-1}\rho_{1:t}^{\pi}R_{t}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The approximation error is bounded by O(γT)𝑂superscript𝛾𝑇O(\gamma^{T})italic_O ( italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ), which decays exponentially fast with respect to T𝑇Titalic_T. SIS utilizes a plug-in estimator to initially estimate ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT (when the behavior policy is unknown), and subsequently employs this estimator, along with the empirical data distribution, to approximate 𝔼[f2(ρπ)]𝔼delimited-[]subscript𝑓2superscript𝜌𝜋\mathbb{E}[f_{2}(\rho^{\pi})]blackboard_E [ italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) ]. However, a notable limitation of this estimator is its rapidly increasing variance due to the use of the cumulative IS ratio ρ1:tπsubscriptsuperscript𝜌𝜋:1𝑡\rho^{\pi}_{1:t}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT. Specifically, this variance tends to grow exponentially with respect to t𝑡titalic_t, a phenomenon often referred to as the curse of horizon (Liu et al., 2018).

Method 3 (Marginalized importance sampling). The MIS estimator is designed to overcome the limitations of the SIS estimator. It breaks the curse of horizon by incorporating the structure of the MDP model. As noted previously, under the Markov assumption, the reward depends only on the current state-action pair, rather than the entire history. This insight allows us to replace the cumulative IS ratio with the MIS ratio, which depends solely on the current state-action pair. This modification considerably reduces variance because wπsuperscript𝑤𝜋w^{\pi}italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT is no longer history-dependent. Assuming the data trajectory is stationary over time – that is, all state-action-reward (S,A,R)𝑆𝐴𝑅(S,A,R)( italic_S , italic_A , italic_R ) triplets have the same distribution – it can be shown that J(π)=𝔼[f3(wπ)]𝐽𝜋𝔼delimited-[]subscript𝑓3superscript𝑤𝜋J(\pi)=\mathbb{E}[f_{3}(w^{\pi})]italic_J ( italic_π ) = blackboard_E [ italic_f start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) ] where f3(wπ)=(1γ)1wπ(A,S)Rsubscript𝑓3superscript𝑤𝜋superscript1𝛾1superscript𝑤𝜋𝐴𝑆𝑅f_{3}(w^{\pi})=(1-\gamma)^{-1}w^{\pi}(A,S)Ritalic_f start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) = ( 1 - italic_γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A , italic_S ) italic_R for any triplet (S,A,R)𝑆𝐴𝑅(S,A,R)( italic_S , italic_A , italic_R ). Both wπsuperscript𝑤𝜋w^{\pi}italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT and the expectation can be effectively estimated and approximated using offline data.

Method 4 (Double reinforcement learning). DRL combines Q-function-based method with MIS. Let f4(Q,w)=f1(Q)+(1γ)1w(A,S)[R+γaπ(a|S)Q(a,S)Q(A,S)]subscript𝑓4𝑄𝑤subscript𝑓1𝑄superscript1𝛾1𝑤𝐴𝑆delimited-[]𝑅𝛾subscript𝑎𝜋conditional𝑎superscript𝑆𝑄𝑎superscript𝑆𝑄𝐴𝑆f_{4}(Q,w)=f_{1}(Q)+(1-\gamma)^{-1}w(A,S)[R+\gamma\sum_{a}\pi(a|S^{\prime})Q(a% ,S^{\prime})-Q(A,S)]italic_f start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ( italic_Q , italic_w ) = italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Q ) + ( 1 - italic_γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_w ( italic_A , italic_S ) [ italic_R + italic_γ ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_π ( italic_a | italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_Q ( italic_a , italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_Q ( italic_A , italic_S ) ], where f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is defined in Method 1 and (S,A,R,S)𝑆𝐴𝑅superscript𝑆(S,A,R,S^{\prime})( italic_S , italic_A , italic_R , italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) denotes a state-action-reward-next-state tuple. Under the stationarity assumption, it can be shown that J(π)=𝔼[f4(Q,w)]𝐽𝜋𝔼delimited-[]subscript𝑓4𝑄𝑤J(\pi)=\mathbb{E}[f_{4}(Q,w)]italic_J ( italic_π ) = blackboard_E [ italic_f start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ( italic_Q , italic_w ) ] when either Q=Qπ𝑄superscript𝑄𝜋Q=Q^{\pi}italic_Q = italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT or w=wπ𝑤superscript𝑤𝜋w=w^{\pi}italic_w = italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT (Kallus & Uehara, 2022). DRL proposes to learn both Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT and wπsuperscript𝑤𝜋w^{\pi}italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT from the data, employing these estimators to calculate 𝔼[f4(Q,w)]𝔼delimited-[]subscript𝑓4𝑄𝑤\mathbb{E}[f_{4}(Q,w)]blackboard_E [ italic_f start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ( italic_Q , italic_w ) ] and approximate the expectation with empirical data distribution. The resulting estimator benefits from double robustness: it is consistent when either Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT or wπsuperscript𝑤𝜋w^{\pi}italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT is correctly specified.

3 Proposed state abstractions for policy evaluation

Here, we propose model-free (Section 3.1) and model-based irrelevance conditions (Section 3.2) for OPE, and analyze the OPE estimators under these conditions (Theorem 1, Theorem 2, Theorem 3). Motivated by this analysis, we propose our two-step procedure (Section 3.3).

3.1 Model-free irrelevance conditions

We first introduce several model-free irrelevance conditions tailored for OPE.

Definition 4 (π𝜋\piitalic_π-irrelevance)

ϕitalic-ϕ\phiitalic_ϕ is π𝜋\piitalic_π-irrelevant if for any s(1),s(2)𝒮superscript𝑠1superscript𝑠2𝒮s^{(1)},s^{(2)}\in\mathcal{S}italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ∈ caligraphic_S whenever ϕ(s(1))=ϕ(s(2))italic-ϕsuperscript𝑠1italic-ϕsuperscript𝑠2\phi(s^{(1)})=\phi(s^{(2)})italic_ϕ ( italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = italic_ϕ ( italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ), we have π(a|s(1))=π(a|s(2))𝜋conditional𝑎superscript𝑠1𝜋conditional𝑎superscript𝑠2\pi(a|s^{(1)})=\pi(a|s^{(2)})italic_π ( italic_a | italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = italic_π ( italic_a | italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) for any a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A.

Definition 5 (Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance)

ϕitalic-ϕ\phiitalic_ϕ is Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevant if for any s(1),s(2)𝒮superscript𝑠1superscript𝑠2𝒮s^{(1)},s^{(2)}\in\mathcal{S}italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ∈ caligraphic_S whenever ϕ(s(1))=ϕ(s(2))italic-ϕsuperscript𝑠1italic-ϕsuperscript𝑠2\phi(s^{(1)})=\phi(s^{(2)})italic_ϕ ( italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = italic_ϕ ( italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ), we have Qπ(a,s(1))=Qπ(a,s(2))superscript𝑄𝜋𝑎superscript𝑠1superscript𝑄𝜋𝑎superscript𝑠2Q^{\pi}(a,s^{(1)})=Q^{\pi}(a,s^{(2)})italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) for any a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A.

Definitions 4 and 5 are adaptations of Definitions 1 and 2 designed for policy evaluation, with the optimal policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT replaced by the target policy π𝜋\piitalic_π. The following definitions are tailored for IS estimators (see Methods 2 and 3 in Section 2.3).

Definition 6 (ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance)

ϕitalic-ϕ\phiitalic_ϕ is ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevant if for any s(1),s(2)𝒮superscript𝑠1superscript𝑠2𝒮s^{(1)},s^{(2)}\in\mathcal{S}italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ∈ caligraphic_S whenever ϕ(s(1))=ϕ(s(2))italic-ϕsuperscript𝑠1italic-ϕsuperscript𝑠2\phi(s^{(1)})=\phi(s^{(2)})italic_ϕ ( italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = italic_ϕ ( italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ), we have ρπ(a,s(1))=ρπ(a,s(2))superscript𝜌𝜋𝑎superscript𝑠1superscript𝜌𝜋𝑎superscript𝑠2\rho^{\pi}(a,s^{(1)})=\rho^{\pi}(a,s^{(2)})italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) for any a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A.

Definition 7 (wπsuperscript𝑤𝜋w^{\pi}italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance)

ϕitalic-ϕ\phiitalic_ϕ is wπsuperscript𝑤𝜋w^{\pi}italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevant if for any s(1),s(2)𝒮superscript𝑠1superscript𝑠2𝒮s^{(1)},s^{(2)}\in\mathcal{S}italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ∈ caligraphic_S whenever ϕ(s(1))=ϕ(s(2))italic-ϕsuperscript𝑠1italic-ϕsuperscript𝑠2\phi(s^{(1)})=\phi(s^{(2)})italic_ϕ ( italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = italic_ϕ ( italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ), we have wπ(a,s(1))=wπ(a,s(2))superscript𝑤𝜋𝑎superscript𝑠1superscript𝑤𝜋𝑎superscript𝑠2w^{\pi}(a,s^{(1)})=w^{\pi}(a,s^{(2)})italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) for any a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A.

Based on the aforementioned definitions, we can immediately state the following theorem:

Theorem 1 (OPE under model-free irrelevance conditions)

Under Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-, ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT- or wπsuperscript𝑤𝜋w^{\pi}italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance, the corresponding methods remain valid when applied to the abstract state space:

  • Under Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance, the Q-function-based method (Method 1) remains valid, i.e., the Q-function Qϕπsuperscriptsubscript𝑄italic-ϕ𝜋Q_{\phi}^{\pi}italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT defined on the abstract state space satisfies 𝔼[f1(Qπ)]=𝔼[f1(Qϕπ)]𝔼delimited-[]subscript𝑓1superscript𝑄𝜋𝔼delimited-[]subscript𝑓1subscriptsuperscript𝑄𝜋italic-ϕ\mathbb{E}[f_{1}(Q^{\pi})]=\mathbb{E}[f_{1}(Q^{\pi}_{\phi})]blackboard_E [ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) ] = blackboard_E [ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) ];

  • Under ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance, SIS (Method 2) remains valid, i.e., the IS ratio ρϕπsuperscriptsubscript𝜌italic-ϕ𝜋\rho_{\phi}^{\pi}italic_ρ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT defined on the abstract state space satisfies 𝔼[f2(ρπ)]=𝔼[f2(ρϕπ)]𝔼delimited-[]subscript𝑓2superscript𝜌𝜋𝔼delimited-[]subscript𝑓2subscriptsuperscript𝜌𝜋italic-ϕ\mathbb{E}[f_{2}(\rho^{\pi})]=\mathbb{E}[f_{2}(\rho^{\pi}_{\phi})]blackboard_E [ italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) ] = blackboard_E [ italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) ];

  • Under wπsuperscript𝑤𝜋w^{\pi}italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance, MIS (Method 3) remains valid, i.e., the MIS ratio wϕπsuperscriptsubscript𝑤italic-ϕ𝜋w_{\phi}^{\pi}italic_w start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT defined on the abstract state space satisfies 𝔼[f3(wπ)]=𝔼[f3(wϕπ)]𝔼delimited-[]subscript𝑓3superscript𝑤𝜋𝔼delimited-[]subscript𝑓3subscriptsuperscript𝑤𝜋italic-ϕ\mathbb{E}[f_{3}(w^{\pi})]=\mathbb{E}[f_{3}(w^{\pi}_{\phi})]blackboard_E [ italic_f start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) ] = blackboard_E [ italic_f start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) ].

Moreover, when ϕitalic-ϕ\phiitalic_ϕ satisfies either Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance or wπsuperscript𝑤𝜋w^{\pi}italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance, DRL (Method 4) remains valid, i.e., Qϕπsuperscriptsubscript𝑄italic-ϕ𝜋Q_{\phi}^{\pi}italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT and wϕπsuperscriptsubscript𝑤italic-ϕ𝜋w_{\phi}^{\pi}italic_w start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT defined on the abstract state space satisfy 𝔼[f4(Qπ,wπ)]=𝔼[f4(Qϕπ,wϕπ)]𝔼delimited-[]subscript𝑓4superscript𝑄𝜋superscript𝑤𝜋𝔼delimited-[]subscript𝑓4subscriptsuperscript𝑄𝜋italic-ϕsubscriptsuperscript𝑤𝜋italic-ϕ\mathbb{E}[f_{4}(Q^{\pi},w^{\pi})]=\mathbb{E}[f_{4}(Q^{\pi}_{\phi},w^{\pi}_{% \phi})]blackboard_E [ italic_f start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT , italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) ] = blackboard_E [ italic_f start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) ].

Theorem 1 validates the four OPE methods presented in Section 2.3 when applied to the abstract state space, under the corresponding irrelevance conditions. Notably, DRL requires weaker irrelevance conditions compared to the Q-function-based method and MIS, owing to its inherent double robustness property. Nevertheless, methods for deriving abstractions that satisfy these conditions (particularly Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT- and wπsuperscript𝑤𝜋w^{\pi}italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance) remain unclear. Furthermore, the state-action-reward triplets transformed via these abstractions (ϕ(S),A,R)italic-ϕ𝑆𝐴𝑅(\phi(S),A,R)( italic_ϕ ( italic_S ) , italic_A , italic_R ) might not maintain the MDP structure. This complicates the process of learning Qϕπsuperscriptsubscript𝑄italic-ϕ𝜋Q_{\phi}^{\pi}italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT and wϕπsuperscriptsubscript𝑤italic-ϕ𝜋w_{\phi}^{\pi}italic_w start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT. These challenges motivate us to develop model-based irrelevance conditions in the subsequent section.

3.2 Model-based irrelevance conditions

To begin with, we discuss two perspectives of the data generated within the MDP framework; see Figure 2 for a graphical illustration.

  1. 1.

    The first perspective is the traditional forward MDP model with all state-action-reward triplets sequenced by time index. This yields the model-based irrelevance condition defined in Definition 3. We will discuss the relationship between this condition and Definitions 5-7 below.

  2. 2.

    The second perspective offers a backward view by reversing the time order. Specifically, due to the symmetric nature of the Markov assumption — implying that if the future is independent of the past given the present, the past must also be independent of the future given the present — the reversed state-action pairs also maintain the Markov property. Leveraging this property, we define another backward MDP, which forms the basis for deriving model-based conditions for achieving wπsuperscript𝑤𝜋w^{\pi}italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance and motivates the subsequent two-step procedure. This development represents one of our main contributions.

Refer to caption
Refer to caption
Figure 2: Illustrations of (a) the forward MDP model and (b) the backward MDP model.

Forward MDP-based model-irrelevance. We first explore the relationship between the model-irrelevance given in Definition 3, and the notions of Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-, ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT- and wπsuperscript𝑤𝜋w^{\pi}italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance.

Theorem 2 (OPE under model-irrelevance)

Let ϕitalic-ϕ\phiitalic_ϕ denote a model-irrelevant abstraction.

  • If ϕitalic-ϕ\phiitalic_ϕ is additionally π𝜋\piitalic_π-irrelevant, then ϕitalic-ϕ\phiitalic_ϕ is also Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevant.

  • While ϕitalic-ϕ\phiitalic_ϕ is not necessarily wπsuperscript𝑤𝜋w^{\pi}italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevant, MIS (Method 3) remains valid when applied to the abstract state space. Indeed, the validity only requires reward-irrelevance (see the first part of (2)).

  • While ϕitalic-ϕ\phiitalic_ϕ is not necessarily ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevant, SIS (Method 2) remains valid when applied to the abstract state space if ϕitalic-ϕ\phiitalic_ϕ is additionally π𝜋\piitalic_π-irrelevant.

  • DRL (Method 4) remains valid when applied to the abstract state space.

The first bullet point establishes the link between model-irrelevance and Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance, thus proving the validity of the Q-function-based method when applied to the abstract state space. To satisfy Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance, we need both model-irrelevance and π𝜋\piitalic_π-irrelevance. In our implementation, we first adapt existing algorithms (Ha & Schmidhuber, 2018; François-Lavet et al., 2019; Gelada et al., 2019) to train a model-irrelevant abstraction ϕitalic-ϕ\phiitalic_ϕ, parameterized via deep neural networks. We next combine ϕ(s)italic-ϕ𝑠\phi(s)italic_ϕ ( italic_s ) with {π(a|s):a𝒜}:𝜋conditional𝑎𝑠𝑎𝒜\{\pi(a|s):a\in\mathcal{A}\}{ italic_π ( italic_a | italic_s ) : italic_a ∈ caligraphic_A } to obtain a new abstraction ϕfor(s)subscriptitalic-ϕ𝑓𝑜𝑟𝑠\phi_{for}(s)italic_ϕ start_POSTSUBSCRIPT italic_f italic_o italic_r end_POSTSUBSCRIPT ( italic_s ). This augmentation ensures ϕfor(s)subscriptitalic-ϕ𝑓𝑜𝑟𝑠\phi_{for}(s)italic_ϕ start_POSTSUBSCRIPT italic_f italic_o italic_r end_POSTSUBSCRIPT ( italic_s ) is π𝜋\piitalic_π-irrelevant, and hence Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevant. Refer to Appendix B.1 for the detailed procedures.

The last three bullet points prove the validity of the SIS, MIS and DRL, despite ϕitalic-ϕ\phiitalic_ϕ being neither wπsuperscript𝑤𝜋w^{\pi}italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevant nor ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevant. By definition, ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance can be achieved by selecting state features that adequately predict the IS ratio. However, methods for constructing wπsuperscript𝑤𝜋w^{\pi}italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevant abstractions remain less clear. In the following, we introduce a backward MDP model-based irrelevance condition that ensures wπsuperscript𝑤𝜋w^{\pi}italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance. We also note that findings similar to those in the first two bullet points have previously been documented in Li et al. (2006) and Pavse & Hanna (2023), respectively. However, the properties of SIS and DRL estimators under model-irrelevance conditions as summarized in our last two bullet points, remain unexplored in the existing literature.

Backward MDP-based model-irrelevance. To illustrate the rationale behind the proposed model-based abstraction, we introduce the backward MDP model by reversing the time index. Under the (forward) MDP model assumption described in Section 2.1 and that the behavior policy b𝑏bitalic_b is not history-dependent, actions and states following Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are independent of those occurred prior to the realization of Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Accordingly, (St1,At1)subscript𝑆𝑡1subscript𝐴𝑡1(S_{t-1},A_{t-1})( italic_S start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) is conditionally independent of {(Sk,Ak)}k>tsubscriptsubscript𝑆𝑘subscript𝐴𝑘𝑘𝑡\{(S_{k},A_{k})\}_{k>t}{ ( italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_k > italic_t end_POSTSUBSCRIPT given Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Recall that T𝑇Titalic_T corresponds to the termination time of trajectories in 𝒟𝒟\mathcal{D}caligraphic_D. We define a time-reversed process consisting of state-action-reward triplets {(St,At,ρπ(At,St)):t=T,,1}conditional-setsubscript𝑆𝑡subscript𝐴𝑡superscript𝜌𝜋subscript𝐴𝑡subscript𝑆𝑡𝑡𝑇1\{(S_{t},A_{t},\rho^{\pi}(A_{t},S_{t})):t=T,\dots,1\}{ ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) : italic_t = italic_T , … , 1 }. Its dynamics is described as follows (see also Figure 2(b) for the configuration):

  • State-action transition: Due to the aforementioned Markov property, the transition of the past state St+1subscript𝑆𝑡1S_{t+1}italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT in the reversed process (future state in the original process) into the current state Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is independent of the past action At+1subscript𝐴𝑡1A_{t+1}italic_A start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT in the reversed process (future action in the original process) while the behavior policy that generates Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT depends on both the current state Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the past state St+1subscript𝑆𝑡1S_{t+1}italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT in the reversed process. This yields the time-reversed state-action transition function (At=a,St=s|St+1)formulae-sequencesubscript𝐴𝑡𝑎subscript𝑆𝑡conditional𝑠subscript𝑆𝑡1\mathbb{P}(A_{t}=a,S_{t}=s|S_{t+1})blackboard_P ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s | italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ).

  • Reward generation: For each state-action pair (St,At)subscript𝑆𝑡subscript𝐴𝑡(S_{t},A_{t})( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), we manually set the reward to the IS ratio ρπ(At,St)superscript𝜌𝜋subscript𝐴𝑡subscript𝑆𝑡\rho^{\pi}(A_{t},S_{t})italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), which plays a crucial role in constructing IS estimators.

Given this MDP, analogous to Definition 3, our objective is to identify a state abstraction that is crucial for predicting the reward (e.g., the IS ratio) and the reversed transition function. We provide the formal definition of the proposed backward MDP-based model-irrelevance (short for backward-model-irrelevance) below.

Definition 8 (Backward-model-irrelevance)

ϕitalic-ϕ\phiitalic_ϕ is backward-model-irrelevant if for any s(1),s(2)𝒮superscript𝑠1superscript𝑠2𝒮s^{(1)},s^{(2)}\in\mathcal{S}italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ∈ caligraphic_S whenever ϕ(s(1))=ϕ(s(2))italic-ϕsuperscript𝑠1italic-ϕsuperscript𝑠2\phi(s^{(1)})=\phi(s^{(2)})italic_ϕ ( italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = italic_ϕ ( italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ), the followings hold for any a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A, x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X and t+𝑡superscriptt\in\mathbb{N}^{+}italic_t ∈ blackboard_N start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT:

(i)ρπ(a,s(1))=ρπ(a,s(2));𝑖superscript𝜌𝜋𝑎superscript𝑠1superscript𝜌𝜋𝑎superscript𝑠2\displaystyle(i)\rho^{\pi}(a,s^{(1)})=\rho^{\pi}(a,s^{(2)});( italic_i ) italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) ;
(ii)sϕ1(x)(At=a,St=s|St+1=s(1))=sϕ1(x)(At=a,St=s|St+1=s(2)).𝑖𝑖subscript𝑠superscriptitalic-ϕ1𝑥formulae-sequencesubscript𝐴𝑡𝑎subscript𝑆𝑡conditional𝑠subscript𝑆𝑡1superscript𝑠1subscript𝑠superscriptitalic-ϕ1𝑥formulae-sequencesubscript𝐴𝑡𝑎subscript𝑆𝑡conditional𝑠subscript𝑆𝑡1superscript𝑠2\displaystyle(ii)\sum_{s\in\phi^{-1}(x)}\mathbb{P}(A_{t}=a,S_{t}=s|S_{t+1}=s^{% (1)})=\sum_{s\in\phi^{-1}(x)}\mathbb{P}(A_{t}=a,S_{t}=s|S_{t+1}=s^{(2)}).( italic_i italic_i ) ∑ start_POSTSUBSCRIPT italic_s ∈ italic_ϕ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUBSCRIPT blackboard_P ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s | italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_s ∈ italic_ϕ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUBSCRIPT blackboard_P ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s | italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) . (3)

The conditions of backward-model-irrelevance are similar to those specified for model-irrelevance outlined in Definition 3. The first condition (i) essentially requires reward-irrelevance, i.e., ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance, in the backward MDP. The second condition in equation (8) is equivalent to the conditional independence assumption between the pair (At,ϕ(St))subscript𝐴𝑡italic-ϕsubscript𝑆𝑡(A_{t},\phi(S_{t}))( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) and St+1subscript𝑆𝑡1S_{t+1}italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT given ϕ(St+1)italic-ϕsubscript𝑆𝑡1\phi(S_{t+1})italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ). As previously assumed, Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be decomposed into the union of relevant features ϕ(St)italic-ϕsubscript𝑆𝑡\phi(S_{t})italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and irrelevant features ψ(St)𝜓subscript𝑆𝑡\psi(S_{t})italic_ψ ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), leading to the following factorization:

(St+1=s|At,ϕ(St))=(ψ(St+1)=ψ(s)|ϕ(St+1)=ϕ(s))(ϕ(St+1)=ϕ(s)|At,ϕ(St)).subscript𝑆𝑡1conditionalsuperscript𝑠subscript𝐴𝑡italic-ϕsubscript𝑆𝑡𝜓subscript𝑆𝑡1conditional𝜓superscript𝑠italic-ϕsubscript𝑆𝑡1italic-ϕsuperscript𝑠italic-ϕsubscript𝑆𝑡1conditionalitalic-ϕsuperscript𝑠subscript𝐴𝑡italic-ϕsubscript𝑆𝑡\displaystyle\mathbb{P}(S_{t+1}=s^{\prime}|A_{t},\phi(S_{t}))=\mathbb{P}(\psi(% S_{t+1})=\psi(s^{\prime})|\phi(S_{t+1})=\phi(s^{\prime}))\mathbb{P}(\phi(S_{t+% 1})=\phi(s^{\prime})|A_{t},\phi(S_{t})).blackboard_P ( italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) = blackboard_P ( italic_ψ ( italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) = italic_ψ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) = italic_ϕ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) blackboard_P ( italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) = italic_ϕ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) .

This indicates a two-step transition in the forward model: initially from (ϕ(St),At)italic-ϕsubscript𝑆𝑡subscript𝐴𝑡(\phi(S_{t}),A_{t})( italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to ϕ(St+1)italic-ϕsubscript𝑆𝑡1\phi(S_{t+1})italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ), and then from ϕ(St+1)italic-ϕsubscript𝑆𝑡1\phi(S_{t+1})italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) to ψ(St+1)𝜓subscript𝑆𝑡1\psi(S_{t+1})italic_ψ ( italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ). Importantly, the generation of ψ(St+1)𝜓subscript𝑆𝑡1\psi(S_{t+1})italic_ψ ( italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) in the second step is conditionally independent of Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ϕ(St)italic-ϕsubscript𝑆𝑡\phi(S_{t})italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Consequently, ϕitalic-ϕ\phiitalic_ϕ extracts state representations that are influenced either by past actions or past relevant features; see Figure 1(b) for an illustration. Combined with ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance, this ensures that all information contained within the historical IS ratios {ρπ(Ak,Sk)}k<tsubscriptsuperscript𝜌𝜋subscript𝐴𝑘subscript𝑆𝑘𝑘𝑡\{\rho^{\pi}(A_{k},S_{k})\}_{k<t}{ italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_k < italic_t end_POSTSUBSCRIPT can be effectively summarized using a single At1subscript𝐴𝑡1A_{t-1}italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and the abstract state ϕ(St1)italic-ϕsubscript𝑆𝑡1\phi(S_{t-1})italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ), thus achieving wπsuperscript𝑤𝜋w^{\pi}italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance (see Theorem 3 below).

Theorem 3 (OPE under backward-model-irrelevance)

Assume ϕitalic-ϕ\phiitalic_ϕ is backward-model-irrelevant.

  • ϕitalic-ϕ\phiitalic_ϕ is both ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevant and wπsuperscript𝑤𝜋w^{\pi}italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevant.

  • While ϕitalic-ϕ\phiitalic_ϕ is not necessarily Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevant, the Q-function-based method (Method 1) remains valid when applied to the abstract state space.

  • DRL (Method 4) remains valid when applied to the abstract state space.

The first bullet point in Theorem 3 validates the two IS methods when applied to the abstract state space under the proposed backward-model-irrelevance, whereas the last two bullet points validate the Q-function-based method and DRL.

To conclude this section, we draw a connection between the proposed backward-model-irrelevant abstraction for OPE and the Markov state abstraction (MSA) developed by Allen et al. (2021) for policy learning. MSA impose two conditions: (i) inverse-model-irrelevance, which requires Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to be conditionally independent of Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and St+1subscript𝑆𝑡1S_{t+1}italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT given ϕ(St)italic-ϕsubscript𝑆𝑡\phi(S_{t})italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and ϕ(St+1)italic-ϕsubscript𝑆𝑡1\phi(S_{t+1})italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ); (ii) density-ratio-irrelevance, which requires ϕ(St)italic-ϕsubscript𝑆𝑡\phi(S_{t})italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to be conditionally independent of St+1subscript𝑆𝑡1S_{t+1}italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT given ϕ(St+1)italic-ϕsubscript𝑆𝑡1\phi(S_{t+1})italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ). For effective policy learning, MSA requires both conditions to hold in data generating processes following a diverse range of behavior policies. When restricting them to one behavior policy, the two conditions are closely related to our backward-model-irrelevance. In particular, they imply our proposed condition in (8) whereas (8) in turn yields density-ratio-irrelevance. This allows us to adapt their algorithm to train state abstractions that satisfy backward-model-irrelevance; see Appendix B.2 for details.

Refer to caption
Refer to caption
Figure 3: Illustrations of (a) the two-step procedure and (b) an MDP with three groups of state variables, denoted by {St,1}tsubscriptsubscript𝑆𝑡1𝑡\{S_{t,1}\}_{t}{ italic_S start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, {St,2}tsubscriptsubscript𝑆𝑡2𝑡\{S_{t,2}\}_{t}{ italic_S start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and {St,3}tsubscriptsubscript𝑆𝑡3𝑡\{S_{t,3}\}_{t}{ italic_S start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

3.3 Two-step procedure for forward and backward state abstraction

The proposed two-step procedure proceeds as follows (see Figure 3(a) for a visualization):

  1. 1.

    Forward abstraction: learn an abstraction ϕ1subscriptitalic-ϕ1\phi_{1}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT from the ground state space 𝒮=𝒳0𝒮subscript𝒳0\mathcal{S}=\mathcal{X}_{0}caligraphic_S = caligraphic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to 𝒳1subscript𝒳1\mathcal{X}_{1}caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT using the data triplets (S,A,R)𝑆𝐴𝑅(S,A,R)( italic_S , italic_A , italic_R ) that is both (forward)-model-irrelevant and π𝜋\piitalic_π-irrelevant.

  2. 2.

    Backward abstraction: Learn an abstraction ϕ2subscriptitalic-ϕ2\phi_{2}italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT from the abstract state space 𝒳1subscript𝒳1\mathcal{X}_{1}caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to 𝒳2subscript𝒳2\mathcal{X}_{2}caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT using the data triplets (ϕ1(S),A,R)subscriptitalic-ϕ1𝑆𝐴𝑅(\phi_{1}(S),A,R)( italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_S ) , italic_A , italic_R ) that is backward-model-irrelevant.

  3. 3.

    Output 𝒳2subscript𝒳2\mathcal{X}_{2}caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for off-policy evaluation.

To summarize, our approach sequentially applies the forward and backward abstraction on the state obtained from the previous iteration, progressively reducing state cardinality. To elaborate the usefulness of the two-step procedure in reducing state cardinality, we first analyze a toy example.

A toy example: Consider an MDP where the state variables can be classified into three groups, depicted in Figure 3(b). For this example, we focus on a specific type of state abstraction known as variable selection, which selects a sub-vector from the original state. Key observations from this example are as follows: (i) The reward depends on the state only through the first group of variables; (ii) The evolution of the first group of variables depends only on the second group, and this dependency is indirect. Specifically, the second group evolves first at each time step and subsequently influences the first group; (iii) The second and third groups in the MDP evolve independently, each relying solely on their own previous states; (iv) The behavior policy depends only on the last two groups; (v) Only the second group of variables is directly influenced by the previous action.

According to (i), selecting the first group of variables achieves reward-irrelevance. Combined with (ii) and (iii), choosing the first two groups achieves model-irrelevance. Assuming the target policy is agnostic to the state, the proposed forward abstraction will select the first two groups of variables.

According to (iv) and that the target policy is state-agnostic, selecting the last two groups attains ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance. Meanwhile, according to (ii) and (v), selecting these variables also achieves backward-model-irrelevance. Thus, the proposed backward abstraction will select the last two groups.

In the two-step procedure, the forward abstraction first eliminates the third group of variables. Given conditions (ii)-(v), selecting just the second group suffices to achieve backward-model-irrelevance, leading to the elimination of the first group in the subsequent backward abstraction. After two iterations, the procedure produces only one group of variables, demonstrating its efficiency in reducing dimensions compared to using either forward or backward abstraction alone.

In more complex scenarios, each abstraction guarantees that the cardinality of the state space does not increase, effectively maintaining or reducing complexity. The reduction is more likely because forward and backward abstractions, as illustrated in Figures 1(a) and (b), differ by definition. Meanwhile, according to Theorems 2 and 3, the post-abstraction-OPE remains valid for any of the four methods.

Theorem 4 (The two-step procedure)

The four OPE methods remain valid when applied to the abstracted state produced by the proposed two-step procedure.

Finally, we note that one may further consider an iterative procedure that alternates between forward and backward abstractions. However, it remains unclear whether these methods have guarantees.

4 Numerical experiments

Method. We investigate the finite sample performance of our proposed methods (details in Appendix B), the forward, backward and two-step procedures.

Comparisons. We compare the proposed abstraction obtained via the two-step procedure (denoted by ‘two-step’), single-iteration forward (‘forward’) and backward (‘backward’) abstractions against Markov state abstraction (Allen et al., 2021) (‘Markov’) and a reconstruction-based abstraction (Lange & Riedmiller, 2010) (‘auto-encoder’). Each abstraction’s performance is tested using FQE (Le et al., 2019) applied to the abstract state space. We also report the performance of a baseline FQE applied to the unabstracted, ground state space (‘FQE’).

Environments. We consider two environments from OpenAI Gym (Brockman et al., 2016), “CartPole-v0” and “LunarLander-v2”, with original state dimensions of 4 and 8, respectively. For each environment, we manually include 296 and 292 irrelevant variables in the state, leading to a challenging 300-dimensional system. Refer to Appendix C for more details about these environments.

Results. We report the MSEs and biases of different post-abstraction-OPE estimators and those of the baseline FQE estimator without abstraction in Figure  4 and Figure C.1 in Appendix C. We summarize our findings as follows. First, the proposed two-step method outperforms other baseline methods, with the smallest MSE and absolute bias in all cases. Since ‘Markov’ and ‘auto-encoder’ are types of model-irrelevant abstractions, these comparisons demonstrate the advantages of the proposed two-step method over single-iteration forward and backward procedures. Second, both figures indicate that the baseline FQE applied to the ground state space performs the worst among all cases. This demonstrates the usefulness of state abstractions for OPE.

Refer to caption
Figure 4: MSEs and biases of FQE estimators when applied to ground and abstract state spaces with various abstractions. The behavior policy is ϵitalic-ϵ\epsilonitalic_ϵ-greedy with respect to the target policy, with ϵ=0.1,0.3,0.5,0.7italic-ϵ0.10.30.50.7\epsilon=0.1,0.3,0.5,0.7italic_ϵ = 0.1 , 0.3 , 0.5 , 0.7 from left to right.

References

  • Abel (2022) Abel, D. A theory of abstraction in reinforcement learning. arXiv preprint arXiv:2203.00397, 2022.
  • Abel et al. (2016) Abel, D., Hershkowitz, D., and Littman, M. Near optimal behavior via approximate state abstraction. In International Conference on Machine Learning, pp. 2915–2923, 2016.
  • Allen et al. (2021) Allen, C., Parikh, N., Gottesman, O., and Konidaris, G. Learning Markov state abstractions for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 8229–8241, 2021.
  • Austin (2011) Austin, P. C. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behavioral Research, 46(3):399–424, 2011.
  • Belloni et al. (2014) Belloni, A., Chernozhukov, V., and Hansen, C. Inference on treatment effects after selection among high-dimensional controls. Review of Economic Studies, 81(2):608–650, 2014.
  • Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  • Castro (2020) Castro, P. S. Scalable methods for computing state similarity in deterministic Markov decision processes. In AAAI Conference on Artificial Intelligence, pp. 10069–10076, 2020.
  • Chapelle & Li (2011) Chapelle, O. and Li, L. An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems, pp. 2249–2257, 2011.
  • Chen & Qi (2022) Chen, X. and Qi, Z. On well-posedness and minimax optimal rates of nonparametric Q-function estimation in off-policy evaluation. In International Conference on Machine Learning, pp. 3558–3582, 2022.
  • Dai et al. (2020) Dai, B., Nachum, O., Chow, Y., Li, L., Szepesvári, C., and Schuurmans, D. CoinDICE: Off-policy confidence interval estimation. In Advances in Neural Information Processing Systems, pp. 9398–9411, 2020.
  • De Luna et al. (2011) De Luna, X., Waernbaum, I., and Richardson, T. S. Covariate selection for the nonparametric estimation of an average treatment effect. Biometrika, 98(4):861–875, 2011.
  • Dean & Givan (1997) Dean, T. and Givan, R. Model minimization in Markov decision processes. In Conference on Artificial Intelligence / Conference on Innovative Applications of Artificial Intelligence, pp.  106–111, 1997.
  • Dudík et al. (2014) Dudík, M., Erhan, D., Langford, J., and Li, L. Doubly robust policy evaluation and optimization. Statistical Science, 29(4):485–511, 2014.
  • Farajtabar et al. (2018) Farajtabar, M., Chow, Y., and Ghavamzadeh, M. More robust doubly robust off-policy evaluation. In International Conference on Machine Learning, pp. 1447–1456, 2018.
  • Feng et al. (2020) Feng, Y., Ren, T., Tang, Z., and Liu, Q. Accountable off-policy evaluation with kernel Bellman statistics. In International Conference on Machine Learning, pp. 3102–3111, 2020.
  • Ferns et al. (2004) Ferns, N., Panangaden, P., and Precup, D. Metrics for finite Markov decision processes. In Conference on Uncertainty in Artificial Intelligence, pp. 162–169, 2004.
  • Ferns et al. (2011) Ferns, N., Panangaden, P., and Precup, D. Bisimulation metrics for continuous Markov decision processes. SIAM Journal on Computing, 40(6):1662–1714, 2011.
  • François-Lavet et al. (2019) François-Lavet, V., Bengio, Y., Precup, D., and Pineau, J. Combined reinforcement learning via abstract representations. In AAAI Conference on Artificial Intelligence, pp. 3582–3589, 2019.
  • Fu et al. (2020) Fu, J., Norouzi, M., Nachum, O., Tucker, G., Novikov, A., Yang, M., Zhang, M. R., Chen, Y., Kumar, A., Paduraru, C., et al. Benchmarks for deep off-policy evaluation. In International Conference on Learning Representations, 2020.
  • Gelada et al. (2019) Gelada, C., Kumar, S., Buckman, J., Nachum, O., and Bellemare, M. G. DeepMDP: Learning continuous latent space models for representation learning. In International Conference on Machine Learning, pp. 2170–2179, 2019.
  • Givan et al. (2003) Givan, R., Dean, T., and Greig, M. Equivalence notions and model minimization in Markov decision processes. Artificial Intelligence, 147(1-2):163–223, 2003.
  • Glymour et al. (2008) Glymour, M. M., Weuve, J., and Chen, J. T. Methodological challenges in causal research on racial and ethnic patterns of cognitive trajectories: measurement, selection, and bias. Neuropsychology Review, 18:194–213, 2008.
  • Greenland et al. (1999) Greenland, S., Pearl, J., and Robins, J. M. Confounding and collapsibility in causal inference. Statistical science, 14(1):29–46, 1999.
  • Guo & Zhao (2023) Guo, F. R. and Zhao, Q. Confounder selection via iterative graph expansion. arXiv preprint arXiv:2309.06053, 2023.
  • Guo et al. (2022) Guo, F. R., Lundborg, A. R., and Zhao, Q. Confounder selection: Objectives and approaches. arXiv preprint arXiv:2208.13871, 2022.
  • Ha & Schmidhuber (2018) Ha, D. and Schmidhuber, J. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, pp. 2455–2467, 2018.
  • Hanna et al. (2019) Hanna, J., Niekum, S., and Stone, P. Importance sampling policy evaluation with an estimated behavior policy. In International Conference on Machine Learning, pp. 2605–2613, 2019.
  • Hao et al. (2021) Hao, B., Ji, X., Duan, Y., Lu, H., Szepesvari, C., and Wang, M. Bootstrap** fitted Q-evaluation for off-policy inference. In International Conference on Machine Learning, pp. 4074–4084, 2021.
  • Hernán & Robins (2010) Hernán, M. A. and Robins, J. M. Causal inference, 2010.
  • Hernán & Robins (2016) Hernán, M. A. and Robins, J. M. Using big data to emulate a target trial when a randomized trial is not available. American Journal of Epidemiology, 183(8):758–764, 2016.
  • Hu & Wager (2023) Hu, Y. and Wager, S. Off-policy evaluation in partially observed Markov decision processes under sequential ignorability. The Annals of Statistics, 51(4):1561–1585, 2023.
  • Jiang et al. (2021) Jiang, H., Dai, B., Yang, M., Zhao, T., and Wei, W. Towards automatic evaluation of dialog systems: A model-free off-policy evaluation approach. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  7419–7451, 2021.
  • Jiang (2018) Jiang, N. Notes on state abstractions, 2018.
  • Jiang & Li (2016) Jiang, N. and Li, L. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 652–661, 2016.
  • Jong & Stone (2005) Jong, N. K. and Stone, P. State abstraction discovery from irrelevant state variables. In International Joint Conference on Artificial Intelligence, pp.  752–757, 2005.
  • Kallus & Uehara (2020) Kallus, N. and Uehara, M. Double reinforcement learning for efficient off-policy evaluation in Markov decision processes. Journal of Machine Learning Research, 21(167):1–63, 2020.
  • Kallus & Uehara (2022) Kallus, N. and Uehara, M. Efficiently breaking the curse of horizon in off-policy evaluation with double reinforcement learning. Operations Research, 70(6):3282–3302, 2022.
  • Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Koch et al. (2020) Koch, B., Vock, D. M., Wolfson, J., and Vock, L. B. Variable selection and estimation in causal inference using Bayesian spike and slab priors. Statistical Methods in Medical Research, 29(9):2445–2469, 2020.
  • Lange & Riedmiller (2010) Lange, S. and Riedmiller, M. Deep auto-encoder neural networks in reinforcement learning. In International Joint Conference on Neural Networks, pp. 1–8, 2010.
  • Laskin et al. (2020) Laskin, M., Srinivas, A., and Abbeel, P. Curl: Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning, pp. 5639–5650, 2020.
  • Le et al. (2019) Le, H., Voloshin, C., and Yue, Y. Batch policy learning under constraints. In International Conference on Machine Learning, pp. 3703–3712, 2019.
  • Levine et al. (2020) Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  • Li et al. (2024) Li, G., Wu, W., Chi, Y., Ma, C., Rinaldo, A., and Wei, Y. High-probability sample complexities for policy evaluation with linear function approximation. IEEE Transactions on Information Theory, 2024.
  • Li et al. (2006) Li, L., Walsh, T. J., and Littman, M. L. Towards a unified theory of state abstraction for MDPs. AI&M, 1(2):3, 2006.
  • Liao et al. (2021) Liao, P., Klasnja, P., and Murphy, S. Off-policy estimation of long-term average outcomes with applications to mobile health. Journal of the American Statistical Association, 116(533):382–391, 2021.
  • Liao et al. (2022) Liao, P., Qi, Z., Wan, R., Klasnja, P., and Murphy, S. A. Batch policy learning in average reward Markov decision processes. Annals of Statistics, 50(6):3364, 2022.
  • Liu et al. (2018) Liu, Q., Li, L., Tang, Z., and Zhou, D. Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, pp. 5361–5371, 2018.
  • Luckett et al. (2019) Luckett, D. J., Laber, E. B., Kahkoska, A. R., Maahs, D. M., Mayer-Davis, E., and Kosorok, M. R. Estimating dynamic treatment regimes in mobile health using V-learning. Journal of the American Statistical Association, 115:692–706, 2019.
  • Ma et al. (2023) Ma, T., Cai, H., Qi, Z., Shi, C., and Laber, E. B. Sequential knockoffs for variable selection in reinforcement learning. arXiv preprint arXiv:2303.14281, 2023.
  • Mandel et al. (2014) Mandel, T., Liu, Y.-E., Levine, S., Brunskill, E., and Popovic, Z. Offline policy evaluation across representations with applications to educational games. In International Conference on Autonomous Agents and Multi-Agent Systems, pp.  1077–1084, 2014.
  • Murphy et al. (2001) Murphy, S. A., van der Laan, M. J., Robins, J. M., and Group, C. P. P. R. Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 96(456):1410–1423, 2001.
  • Nachum et al. (2019) Nachum, O., Chow, Y., Dai, B., and Li, L. DualDICE: Behavior-agnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing systems, pp. 2318–2328, 2019.
  • Pathak et al. (2017) Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning, pp. 2778–2787, 2017.
  • Pavse & Hanna (2023) Pavse, B. S. and Hanna, J. P. Scaling marginalized importance sampling to high-dimensional state-spaces via state abstraction. In AAAI Conference on Artificial Intelligence, pp. 9417–9425, 2023.
  • Pearl (2009) Pearl, J. Causality. Cambridge University Press, Cambridge, UK, 2 edition, 2009. ISBN 978-0-521-89560-6. doi: 10.1017/CBO9780511803161.
  • Persson et al. (2017) Persson, E., Häggström, J., Waernbaum, I., and de Luna, X. Data-driven algorithms for dimension reduction in causal inference. Computational statistics & data analysis, 105:280–292, 2017.
  • Precup (2000) Precup, D. Eligibility traces for off-policy policy evaluation. In International Conference on Machine Learning, pp. 759–766, 2000.
  • Puterman (2014) Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  • Ravindran (2004) Ravindran, B. An algebraic approach to abstraction in reinforcement learning. University of Massachusetts Amherst, 2004.
  • Robins (1997) Robins, J. M. Causal inference from complex longitudinal data. In Latent variable modeling and applications to causality, pp.  69–117. Springer, 1997.
  • Rubin (2009) Rubin, D. B. Should observational studies be designed to allow lack of balance in covariate distributions across treatment groups? Statistics in Medicine, 28(9):1420–1423, 2009.
  • Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897, 2015.
  • Shelhamer et al. (2016) Shelhamer, E., Mahmoudieh, P., Argus, M., and Darrell, T. Loss is its own reward: Self-supervision for reinforcement learning. arXiv preprint arXiv:1612.07307, 2016.
  • Shi et al. (2021) Shi, C., Wan, R., Chernozhukov, V., and Song, R. Deeply-debiased off-policy interval estimation. In International Conference on Machine Learning, pp. 9580–9591, 2021.
  • Shi et al. (2022) Shi, C., Zhang, S., Lu, W., and Song, R. Statistical inference of the value function for reinforcement learning in infinite-horizon settings. Journal of the Royal Statistical Society Series B, 84(3):765–793, 2022.
  • Shortreed & Ertefaie (2017) Shortreed, S. M. and Ertefaie, A. Outcome-adaptive Lasso: variable selection for causal inference. Biometrics, 73(4):1111–1122, 2017.
  • Singh et al. (1994) Singh, S., Jaakkola, T., and Jordan, M. Reinforcement learning with soft state aggregation. In Advances in Neural Information Processing Systems, pp. 361–368, 1994.
  • Sutton et al. (1999) Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, pp. 1057–1063, 1999.
  • Sutton et al. (2008) Sutton, R. S., Szepesvári, C., and Maei, H. R. A convergent O(n)𝑂𝑛{O}(n)italic_O ( italic_n ) algorithm for off-policy temporal-difference learning with linear function approximation. In Advances in Neural Information Processing Systems, pp. 1609–1616, 2008.
  • Tang et al. (2020) Tang, Z., Feng, Y., Li, L., Zhou, D., and Liu, Q. Doubly robust bias reduction in infinite horizon off-policy estimation. In International Conference on Learning Representations, 2020.
  • Tangkaratt et al. (2016) Tangkaratt, V., Morimoto, J., and Sugiyama, M. Model-based reinforcement learning with dimension reduction. Neural Networks, 84:1–16, 2016.
  • Thomas & Brunskill (2016) Thomas, P. and Brunskill, E. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 2139–2148, 2016.
  • Thomas et al. (2015) Thomas, P., Theocharous, G., and Ghavamzadeh, M. High-confidence off-policy evaluation. In AAAI Conference on Artificial Intelligence, pp. 3000–3006, 2015.
  • Uehara et al. (2020) Uehara, M., Huang, J., and Jiang, N. Minimax weight and Q-function learning for off-policy evaluation. In International Conference on Machine Learning, pp. 9659–9668, 2020.
  • Uehara et al. (2021) Uehara, M., Zhang, X., and Sun, W. Representation learning for online and offline RL in low-rank MDPs. In International Conference on Learning Representations, 2021.
  • Uehara et al. (2022) Uehara, M., Shi, C., and Kallus, N. A review of off-policy evaluation in reinforcement learning. arXiv preprint arXiv:2212.06355, 2022.
  • Vander Weele & Shpitser (2011) Vander Weele, T. J. and Shpitser, I. A new criterion for confounder selection. Biometrics, 67(4):1406–1413, 2011.
  • VanderWeele (2019) VanderWeele, T. J. Principles of confounder selection. European Journal of Epidemiology, 34:211–219, 2019.
  • Voloshin et al. (2021) Voloshin, C., Le, H. M., Jiang, N., and Yue, Y. Empirical study of off-policy policy evaluation for reinforcement learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021.
  • Wang et al. (2023) Wang, J., Qi, Z., and Wong, R. K. Projected state-action balancing weights for offline reinforcement learning. The Annals of Statistics, 51(4):1639–1665, 2023.
  • Wang et al. (2017) Wang, L., Laber, E. B., and Witkiewitz, K. Sufficient Markov decision processes with alternating deep neural networks. arXiv preprint arXiv:1704.07531, 2017.
  • Xie et al. (2023) Xie, C., Yang, W., and Zhang, Z. Semiparametrically efficient off-policy evaluation in linear Markov decision processes. In International Conference on Machine Learning, pp. 38227–38257, 2023.
  • Xie et al. (2019) Xie, T., Ma, Y., and Wang, Y.-X. Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. In Advances in Neural Information Processing Systems, pp. 9668–9678, 2019.
  • Yin & Wang (2020) Yin, M. and Wang, Y.-X. Asymptotically efficient off-policy evaluation for tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pp.  3948–3958, 2020.
  • Zhang et al. (2020) Zhang, A., McAllister, R. T., Calandra, R., Gal, Y., and Levine, S. Learning invariant representations for reinforcement learning without reconstruction. In International Conference on Learning Representations, 2020.
  • Zhang & Zhang (2018) Zhang, B. and Zhang, M. Variable selection for estimating the optimal treatment regimes in the presence of a large number of covariates. The Annals of Applied Statistics, 12(4):2335–2358, 2018.
  • Zhang et al. (2013) Zhang, B., Tsiatis, A. A., Laber, E. B., and Davidian, M. Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika, 100(3):681–694, 2013.

Appendix

This appendix is structured as follows: Section A introduces additional related works on confounder selection in causal inference. The implementation details of the proposed state abstraction are discussed in Section B. Additional information concerning the environments and computing resources utilized is presented in Section C. All technical proofs can be found in Section D.

Appendix A Confounder selection in causal inference

Broadly speaking, confounding refers to the problem that even if two variables are not causes of each other, they may exhibit statistical association due to common causes. Controlling for confounding is a central problem in the design of observational studies, and many criteria for confounder selection have been proposed in the literature. A commonly adopted criterion is the “common cause heuristic”, where the user only controls for covariates that are related to both the treatment and the outcome (Glymour et al., 2008; Austin, 2011; Shortreed & Ertefaie, 2017; Koch et al., 2020). Another widely used criterion is to simply use all covariates that are observed before the treatment in time (Rubin, 2009; Hernán & Robins, 2010, 2016). However, both of these approaches are not guaranteed to find a set of covariates that are sufficient to control for confounding. From a graphical perspective, confounder selection is essentially about finding a set of covariates that block all “back-door” paths (Pearl, 2009), but this requires full structural knowledge about the causal relationship between the variables which is often not possible. This motivated some methods that only require partial structural knowledge (Vander Weele & Shpitser, 2011; VanderWeele, 2019; Guo & Zhao, 2023). All the aforementioned methods need substantive knowledge about the treatment, outcome, and covariates. Other methods use statistical tests (usually of conditional independence) to trim a set of covariates that are assumed to control for confounding (Robins, 1997; Greenland et al., 1999; Hernán & Robins, 2010; De Luna et al., 2011; Belloni et al., 2014; Persson et al., 2017). The reader is referred to Guo et al. (2022) for a recent survey of objectives and approaches for confounder selection.

Confounder selection can be considered as a special example of our problem under certain conditions: (i) The state transition is independent, effectively transforming the MDP into a contextual bandit; (ii) The action space is binary, with the target policy consistently assigning either action 0 or action 1, aimed at assessing the average treatment effect; (iii) State abstractions are confined to variable selections. While our proposed two-step procedure shares similar spirits with the aforementioned algorithms, it addresses a more complex problem involving state transitions. Additionally, our focus is on abstraction that facilitates the engineering of new feature vectors, rather than merely selecting a subset of existing ones.

Appendix B Implementation details

In this section, we present implementation details for forward abstraction (Section B.1) and backward abstraction (Section B.2).

B.1 Implementation details for forward abstraction

We provide details for implementing the proposed forward abstraction in this subsection. We use deep neural networks to parameterize the forward abstraction and estimate the parameters by minimzing the following loss function:

α1r+β1𝒯+δ1Q+λ1penalty,subscript𝛼1subscript𝑟subscript𝛽1subscript𝒯subscript𝛿1subscript𝑄subscript𝜆1subscript𝑝𝑒𝑛𝑎𝑙𝑡𝑦\displaystyle\alpha_{1}\mathcal{L}_{r}+\beta_{1}\mathcal{L}_{\mathcal{T}}+% \delta_{1}\mathcal{L}_{Q}+\lambda_{1}\mathcal{L}_{penalty},italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_p italic_e italic_n italic_a italic_l italic_t italic_y end_POSTSUBSCRIPT , (B.1)

where rsubscript𝑟\mathcal{L}_{r}caligraphic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, 𝒯subscript𝒯\mathcal{L}_{\mathcal{T}}caligraphic_L start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT and Qsubscript𝑄\mathcal{L}_{Q}caligraphic_L start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT are the loss functions detailed below, penaltysubscript𝑝𝑒𝑛𝑎𝑙𝑡𝑦\mathcal{L}_{penalty}caligraphic_L start_POSTSUBSCRIPT italic_p italic_e italic_n italic_a italic_l italic_t italic_y end_POSTSUBSCRIPT is a penalty term, and α1,β1,δ1,λ1subscript𝛼1subscript𝛽1subscript𝛿1subscript𝜆1\alpha_{1},\beta_{1},\delta_{1},\lambda_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are positive constant hyper-parameters whose values are reported in Table B.1.

By definition, the forward abstraction is required to achieve both model-irrelevance and π𝜋\piitalic_π-irrelevance. As discussed in Section 3.2, our approach is to learn a model-irrelevant abstraction, denoted as ϕitalic-ϕ\phiitalic_ϕ, and then concatenate it with {π(a|):a𝒜}:𝜋conditional𝑎𝑎𝒜\{\pi(a|\bullet):a\in\mathcal{A}\}{ italic_π ( italic_a | ∙ ) : italic_a ∈ caligraphic_A }. We denote the concatenated abstraction by ϕforsubscriptitalic-ϕ𝑓𝑜𝑟\phi_{for}italic_ϕ start_POSTSUBSCRIPT italic_f italic_o italic_r end_POSTSUBSCRIPT.

We next detail the loss functions and the penalty term. The first two losses rsubscript𝑟\mathcal{L}_{r}caligraphic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and 𝒯subscript𝒯\mathcal{L}_{\mathcal{T}}caligraphic_L start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT are to ensure reward-irrelevance and transition-irrelevance, respectively,

r=1|𝒟|(S,A,R)𝒟[Rϕ(A,ϕ(S))]2,𝒯=1|𝒟|(S,A,S)𝒟𝒯ϕ(A,ϕ(S))ϕ(S)22,formulae-sequencesubscript𝑟1𝒟subscript𝑆𝐴𝑅𝒟superscriptdelimited-[]𝑅subscriptitalic-ϕ𝐴italic-ϕ𝑆2subscript𝒯1𝒟subscript𝑆𝐴superscript𝑆𝒟superscriptsubscriptnormsubscript𝒯italic-ϕ𝐴italic-ϕ𝑆italic-ϕsuperscript𝑆22\displaystyle\mathcal{L}_{r}=\frac{1}{|\mathcal{D}|}\sum_{(S,A,R)\in\mathcal{D% }}\big{[}R-\mathcal{R}_{\phi}\big{(}A,\phi(S)\big{)}\big{]}^{2},\,\,\mathcal{L% }_{\mathcal{T}}=\frac{1}{|\mathcal{D}|}\sum_{(S,A,S^{\prime})\in\mathcal{D}}\|% \mathcal{T}_{\phi}\big{(}A,\phi(S)\big{)}-\phi(S^{\prime})\|_{2}^{2},caligraphic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_D | end_ARG ∑ start_POSTSUBSCRIPT ( italic_S , italic_A , italic_R ) ∈ caligraphic_D end_POSTSUBSCRIPT [ italic_R - caligraphic_R start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_A , italic_ϕ ( italic_S ) ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , caligraphic_L start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_D | end_ARG ∑ start_POSTSUBSCRIPT ( italic_S , italic_A , italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_D end_POSTSUBSCRIPT ∥ caligraphic_T start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_A , italic_ϕ ( italic_S ) ) - italic_ϕ ( italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where ϕ0subscriptsubscriptitalic-ϕ0\mathcal{R}_{\phi_{0}}caligraphic_R start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and 𝒯ϕ0subscript𝒯subscriptitalic-ϕ0\mathcal{T}_{\phi_{0}}caligraphic_T start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT are the estimated reward and transition functions applied to the abstract state space parameterized by deep neural networks as well, and |𝒟|𝒟|\mathcal{D}|| caligraphic_D | is the cardinality of the dataset 𝒟𝒟\mathcal{D}caligraphic_D.

The inclusion of the third loss function, Qsubscript𝑄\mathcal{L}_{Q}caligraphic_L start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT, is motivated by the demonstrated benefits of utilizing model-free objectives to guide the training of state abstractions in policy learning, as evidenced by Gelada et al. (2019); Ha & Schmidhuber (2018); François-Lavet et al. (2019). Given our interest in OPE, we integrate the following FQE loss into the objective function,

Q=1|𝒟|(S,A,R,S)𝒟[R+γa𝒜π(a|S)Q(ϕfor(S),a)Q(ϕfor(S),A)]2,subscript𝑄1𝒟subscript𝑆𝐴𝑅superscript𝑆𝒟superscriptdelimited-[]𝑅𝛾subscript𝑎𝒜𝜋conditional𝑎superscript𝑆superscript𝑄subscriptitalic-ϕ𝑓𝑜𝑟superscript𝑆𝑎𝑄subscriptitalic-ϕ𝑓𝑜𝑟𝑆𝐴2\displaystyle\mathcal{L}_{Q}=\frac{1}{|\mathcal{D}|}\sum_{(S,A,R,S^{\prime})% \in\mathcal{D}}\Big{[}R+\gamma\sum_{a\in\mathcal{A}}\pi(a|S^{\prime})Q^{-}\big% {(}\phi_{for}(S^{\prime}),a\big{)}-Q\big{(}\phi_{for}(S),A\big{)}\Big{]}^{2},caligraphic_L start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_D | end_ARG ∑ start_POSTSUBSCRIPT ( italic_S , italic_A , italic_R , italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_D end_POSTSUBSCRIPT [ italic_R + italic_γ ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_π ( italic_a | italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_Q start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_f italic_o italic_r end_POSTSUBSCRIPT ( italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_a ) - italic_Q ( italic_ϕ start_POSTSUBSCRIPT italic_f italic_o italic_r end_POSTSUBSCRIPT ( italic_S ) , italic_A ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where Qsuperscript𝑄Q^{-}italic_Q start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT and Q𝑄Qitalic_Q represent the estimated Qϕforπsubscriptsuperscript𝑄𝜋subscriptitalic-ϕ𝑓𝑜𝑟Q^{\pi}_{\phi_{for}}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_f italic_o italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT function applied to the abstract state space during the previous and current iterations, respectively.

The above objectives allow us to effectively train forward abstractions. However, a potential concern is that the resulting abstraction and transition can collapse to some constant x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT such that ϕfor(S)x0,S𝒮formulae-sequencesubscriptitalic-ϕ𝑓𝑜𝑟𝑆subscript𝑥0for-all𝑆𝒮\phi_{for}(S)\rightarrow x_{0},~{}~{}\forall S\in\mathcal{S}italic_ϕ start_POSTSUBSCRIPT italic_f italic_o italic_r end_POSTSUBSCRIPT ( italic_S ) → italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ∀ italic_S ∈ caligraphic_S. To address this limitation, we include the following penalty function of two randomly drawn states to promote diversity in the abstractions:

c=1|𝒟|(|𝒟|1)S,S~𝒟,SS~exp(C0ϕ^(S)ϕ^(S~)2)subscript𝑐1𝒟𝒟1subscriptformulae-sequence𝑆~𝑆𝒟𝑆~𝑆subscript𝐶0subscriptnorm^italic-ϕ𝑆^italic-ϕ~𝑆2\displaystyle\mathcal{L}_{c}=\frac{1}{|\mathcal{D}|(|\mathcal{D}|-1)}\sum_{S,% \tilde{S}\in\mathcal{D},S\neq\tilde{S}}\exp(-C_{0}\|\widehat{\phi}(S)-\widehat% {\phi}(\tilde{S})\|_{2})caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_D | ( | caligraphic_D | - 1 ) end_ARG ∑ start_POSTSUBSCRIPT italic_S , over~ start_ARG italic_S end_ARG ∈ caligraphic_D , italic_S ≠ over~ start_ARG italic_S end_ARG end_POSTSUBSCRIPT roman_exp ( - italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ over^ start_ARG italic_ϕ end_ARG ( italic_S ) - over^ start_ARG italic_ϕ end_ARG ( over~ start_ARG italic_S end_ARG ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )

for some positive scaling constant C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and ϕ^(s)^italic-ϕ𝑠\widehat{\phi}(s)over^ start_ARG italic_ϕ end_ARG ( italic_s ) is the estimated abstract state from transition function. ϕ^(s~)^italic-ϕ~𝑠\widehat{\phi}(\tilde{s})over^ start_ARG italic_ϕ end_ARG ( over~ start_ARG italic_s end_ARG ) can be achieved by shuffling ϕ^(s)^italic-ϕsuperscript𝑠\widehat{\phi}(s^{\prime})over^ start_ARG italic_ϕ end_ARG ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) from pairs (s,s)𝑠superscript𝑠(s,s^{\prime})( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) in the batch. Additionally, we add another penalty to penalize consecutive abstract states for being more than some predefined distance d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT away from each other,

s=1|𝒟|(S,S)𝒟C1[ϕfor(S)ϕfor(S)2d0]2,subscript𝑠1𝒟subscript𝑆superscript𝑆𝒟subscript𝐶1superscriptdelimited-[]subscriptnormsubscriptitalic-ϕ𝑓𝑜𝑟𝑆subscriptitalic-ϕ𝑓𝑜𝑟superscript𝑆2subscript𝑑02\displaystyle\mathcal{L}_{s}=\frac{1}{|\mathcal{D}|}\sum_{(S,S^{\prime})\in% \mathcal{D}}C_{1}[\|\phi_{for}(S)-\phi_{for}(S^{\prime})\|_{2}-d_{0}]^{2},caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_D | end_ARG ∑ start_POSTSUBSCRIPT ( italic_S , italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_D end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ ∥ italic_ϕ start_POSTSUBSCRIPT italic_f italic_o italic_r end_POSTSUBSCRIPT ( italic_S ) - italic_ϕ start_POSTSUBSCRIPT italic_f italic_o italic_r end_POSTSUBSCRIPT ( italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

for some positive constant C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. These components combine into the final penalty function:

penalty=s+c.subscript𝑝𝑒𝑛𝑎𝑙𝑡𝑦subscript𝑠subscript𝑐\mathcal{L}_{penalty}=\mathcal{L}_{s}+\mathcal{L}_{c}.caligraphic_L start_POSTSUBSCRIPT italic_p italic_e italic_n italic_a italic_l italic_t italic_y end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT .

The forward model architecture is as follow:

    Forward_model(
  (encoder): Encoder_linear(
    (activation): ReLU()
    (encoder_net): Sequential(
      (0): Linear(in_features=300, out_features=64, bias=True)
      (1): ReLU()
      (2): Linear(in_features=64, out_features=64, bias=True)
      (3): ReLU()
      (4): Dropout(p=0.2, inplace=False)
      (5): Linear(in_features=64, out_features=64, bias=True)
      (6): ReLU()
      (7): Dropout(p=0.2, inplace=False)
      (8): Linear(in_features=64, out_features=100, bias=True)
    )
  )
  (transition): Transition(
    (activation): ReLU()
    (T_net): Sequential(
      (0): Linear(in_features=100, out_features=64, bias=True)
      (1): ReLU()
      (2): Linear(in_features=64, out_features=64, bias=True)
      (3): ReLU()
      (4): Dropout(p=0.2, inplace=False)
      (5): Linear(in_features=64, out_features=64, bias=True)
    )
    (lstm): LSTMCell(64, 128)
    (tanh): Tanh()
  )
  (reward): Reward(
    (activation): ReLU()
    (reward_net): Sequential(
      (0): Linear(in_features=100, out_features=64, bias=True)
      (1): ReLU()
      (2): Linear(in_features=64, out_features=64, bias=True)
      (3): ReLU()
      (4): Dropout(p=0.2, inplace=False)
      (5): Linear(in_features=64, out_features=64, bias=True)
      (6): ReLU()
      (7): Dropout(p=0.2, inplace=False)
      (8): Linear(in_features=64, out_features=64, bias=True)
      (9): ReLU()
      (10): Dropout(p=0.2, inplace=False)
      (11): Linear(in_features=64, out_features=64, bias=True)
      (12): ReLU()
      (13): Dropout(p=0.2, inplace=False)
      (14): Linear(in_features=64, out_features=2, bias=True)
    )
  )
  (FQE): FQE(
    (activation): ReLU()
    (action_net): Sequential(
      (0): Linear(in_features=1, out_features=16, bias=True)
      (1): ReLU()
      (2): Linear(in_features=16, out_features=100, bias=True)
    )
    (xa_net): Linear(in_features=200, out_features=100, bias=True)
    (FQE_net): Sequential(
      (0): Linear(in_features=100, out_features=64, bias=True)
      (1): ReLU()
      (2): Linear(in_features=64, out_features=64, bias=True)
      (3): ReLU()
      (4): Dropout(p=0.2, inplace=False)
      (5): Linear(in_features=64, out_features=64, bias=True)
      (6): ReLU()
      (7): Dropout(p=0.2, inplace=False)
      (8): Linear(in_features=64, out_features=2, bias=True)
    )
  )
)
Table B.1: Hyper-parameters information. m𝑚mitalic_m is the input feature dimension, and **∗ ∗ means no value.
Environment Hyper-parameters Values Hyper-parameters Values
CartPole-v0 α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 1 α2subscript𝛼2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 1
β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 1 β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 1
γ1subscript𝛾1\gamma_{1}italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 1 γ2subscript𝛾2\gamma_{2}italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 1
λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT min(1,20m)120𝑚\min(1,\frac{20}{m})roman_min ( 1 , divide start_ARG 20 end_ARG start_ARG italic_m end_ARG ) λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT min(1,10m)110𝑚\min(1,\frac{10}{m})roman_min ( 1 , divide start_ARG 10 end_ARG start_ARG italic_m end_ARG )
C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 1 C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT **∗ ∗
C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 1 C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 1
d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 0.15m0.15𝑚0.15m0.15 italic_m d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 0.15m0.15𝑚0.15m0.15 italic_m
LunarLander-v2 α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 1 α2subscript𝛼2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 1
β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 1 β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 1
γ1subscript𝛾1\gamma_{1}italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 1 γ2subscript𝛾2\gamma_{2}italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 1
λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT min(1,20m)120𝑚\min(1,\frac{20}{m})roman_min ( 1 , divide start_ARG 20 end_ARG start_ARG italic_m end_ARG ) λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT min(1,20m)120𝑚\min(1,\frac{20}{m})roman_min ( 1 , divide start_ARG 20 end_ARG start_ARG italic_m end_ARG )
C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 1 C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT **∗ ∗
C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 1 C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 1
d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 0.15m0.15𝑚0.15m0.15 italic_m d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 0.15m0.15𝑚0.15m0.15 italic_m

B.2 Implementation details for backward abstraction

We provide details for implementing the proposed backward abstraction in this subsection. Similar to Section B.1, we use deep neural networks to parameterize the abstraction ϕbacksubscriptitalic-ϕ𝑏𝑎𝑐𝑘\phi_{back}italic_ϕ start_POSTSUBSCRIPT italic_b italic_a italic_c italic_k end_POSTSUBSCRIPT and estimate the parameters by solving the following loss function,

α2ρ+β2ratio+δ2inv+λ2s,subscript𝛼2subscript𝜌subscript𝛽2subscript𝑟𝑎𝑡𝑖𝑜subscript𝛿2subscript𝑖𝑛𝑣subscript𝜆2subscript𝑠\displaystyle\alpha_{2}\mathcal{L}_{\rho}+\beta_{2}\mathcal{L}_{ratio}+\delta_% {2}\mathcal{L}_{inv}+\lambda_{2}\mathcal{L}_{s},italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_r italic_a italic_t italic_i italic_o end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_v end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ,

where α2,β2,δ2,λ2subscript𝛼2subscript𝛽2subscript𝛿2subscript𝜆2\alpha_{2},\beta_{2},\delta_{2},\lambda_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are positive hyper-parameters specified in Table B.1.

Recall that backward-model-irrelevance requires both ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance (Definition 6) and (8). The first loss function ρsubscript𝜌\mathcal{L}_{\rho}caligraphic_L start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT is designed to enforce ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance, specified as

ρ=1|𝒟|(S,A)𝒟[ρ^π(A,S)ρϕbackπ(A,ϕback(S))]2,subscript𝜌1𝒟subscript𝑆𝐴𝒟superscriptdelimited-[]superscript^𝜌𝜋𝐴𝑆subscriptsuperscript𝜌𝜋subscriptitalic-ϕ𝑏𝑎𝑐𝑘𝐴subscriptitalic-ϕ𝑏𝑎𝑐𝑘𝑆2\displaystyle\mathcal{L}_{\rho}=\frac{1}{|\mathcal{D}|}\sum_{(S,A)\in\mathcal{% D}}\big{[}\widehat{\rho}^{\pi}(A,S)-\rho^{\pi}_{\phi_{back}}\big{(}A,\phi_{% back}(S)\big{)}\big{]}^{2},caligraphic_L start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_D | end_ARG ∑ start_POSTSUBSCRIPT ( italic_S , italic_A ) ∈ caligraphic_D end_POSTSUBSCRIPT [ over^ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A , italic_S ) - italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_b italic_a italic_c italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_A , italic_ϕ start_POSTSUBSCRIPT italic_b italic_a italic_c italic_k end_POSTSUBSCRIPT ( italic_S ) ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where ρ^πsuperscript^𝜌𝜋\widehat{\rho}^{\pi}over^ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT denotes some consistent estimator of the IS ratio. Note that in two-step procedure, we should replace ρ^π(A,S)superscript^𝜌𝜋𝐴𝑆\widehat{\rho}^{\pi}(A,S)over^ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A , italic_S ) by:

ρ^forπ(A,ϕfor(S))=πϕfor(A|ϕfor(S))b^(A|ϕfor(S))=π(A|S)b^(A|ϕfor(S)),subscriptsuperscript^𝜌𝜋𝑓𝑜𝑟𝐴subscriptitalic-ϕ𝑓𝑜𝑟𝑆subscript𝜋subscriptitalic-ϕ𝑓𝑜𝑟conditional𝐴subscriptitalic-ϕ𝑓𝑜𝑟𝑆^𝑏conditional𝐴subscriptitalic-ϕ𝑓𝑜𝑟𝑆𝜋conditional𝐴𝑆^𝑏conditional𝐴subscriptitalic-ϕ𝑓𝑜𝑟𝑆\widehat{\rho}^{\pi}_{for}(A,\phi_{for}(S))=\frac{\pi_{\phi_{for}}(A|\phi_{for% }(S))}{\widehat{b}(A|\phi_{for}(S))}=\frac{\pi(A|S)}{\widehat{b}(A|\phi_{for}(% S))},over^ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f italic_o italic_r end_POSTSUBSCRIPT ( italic_A , italic_ϕ start_POSTSUBSCRIPT italic_f italic_o italic_r end_POSTSUBSCRIPT ( italic_S ) ) = divide start_ARG italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_f italic_o italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_A | italic_ϕ start_POSTSUBSCRIPT italic_f italic_o italic_r end_POSTSUBSCRIPT ( italic_S ) ) end_ARG start_ARG over^ start_ARG italic_b end_ARG ( italic_A | italic_ϕ start_POSTSUBSCRIPT italic_f italic_o italic_r end_POSTSUBSCRIPT ( italic_S ) ) end_ARG = divide start_ARG italic_π ( italic_A | italic_S ) end_ARG start_ARG over^ start_ARG italic_b end_ARG ( italic_A | italic_ϕ start_POSTSUBSCRIPT italic_f italic_o italic_r end_POSTSUBSCRIPT ( italic_S ) ) end_ARG ,

where b^^𝑏\widehat{b}over^ start_ARG italic_b end_ARG is estimated from the abstracted experiences and π(A|S)𝜋conditional𝐴𝑆\pi(A|S)italic_π ( italic_A | italic_S ) keeps static due to the π𝜋\piitalic_π-irrelevance property of forward abstraction.

As commented in Section 3.2, the second condition of (8) holds by satisfying the conditional independence assumption between (At,ϕ(St))subscript𝐴𝑡italic-ϕsubscript𝑆𝑡(A_{t},\phi(S_{t}))( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) and St+1subscript𝑆𝑡1S_{t+1}italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT given ϕ(St+1)italic-ϕsubscript𝑆𝑡1\phi(S_{t+1})italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ). By Bayesian formula, we can show that it is satisfied by the inverse-model-irrelevance and density-ratio-irrelevance when setting the learning policy π𝜋\piitalic_π to b𝑏bitalic_b. This motivates us to leverage the two objectives invsubscript𝑖𝑛𝑣\mathcal{L}_{inv}caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_v end_POSTSUBSCRIPT and ratiosubscript𝑟𝑎𝑡𝑖𝑜\mathcal{L}_{ratio}caligraphic_L start_POSTSUBSCRIPT italic_r italic_a italic_t italic_i italic_o end_POSTSUBSCRIPT used by Allen et al. (2021) for training MSA. More details regarding these losses can be found in Section 5 of Allen et al. (2021). Note that to obtain non-sequential states (s,s~)𝑠~𝑠(s,\tilde{s})( italic_s , over~ start_ARG italic_s end_ARG ) used in Lratiosubscript𝐿𝑟𝑎𝑡𝑖𝑜L_{ratio}italic_L start_POSTSUBSCRIPT italic_r italic_a italic_t italic_i italic_o end_POSTSUBSCRIPT, we flip ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in the pairs (s,s)𝑠superscript𝑠(s,s^{\prime})( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) in each batch instead of shuffling.

Finally, ssubscript𝑠\mathcal{L}_{s}caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT corresponds to the smoothness penalty introduced in Section B.1. The backward model architecture is:

    Backward_model(
  (encoder): Encoder_linear(
    (activation): ReLU()
    (encoder_net): Sequential(
      (0): Linear(in_features=100, out_features=64, bias=True)
      (1): ReLU()
      (2): Linear(in_features=64, out_features=64, bias=True)
      (3): ReLU()
      (4): Dropout(p=0.2, inplace=False)
      (5): Linear(in_features=64, out_features=64, bias=True)
      (6): ReLU()
      (7): Dropout(p=0.2, inplace=False)
      (8): Linear(in_features=64, out_features=6, bias=True)
    )
  )
  (inverse): Inverse(
    (activation): ReLU()
    (inverse_net): Sequential(
      (0): Linear(in_features=12, out_features=64, bias=True)
      (1): ReLU()
      (2): Linear(in_features=64, out_features=64, bias=True)
      (3): ReLU()
      (4): Dropout(p=0.3, inplace=False)
      (5): Linear(in_features=64, out_features=64, bias=True)
      (6): ReLU()
      (7): Dropout(p=0.3, inplace=False)
      (8): Linear(in_features=64, out_features=64, bias=True)
      (9): ReLU()
      (10): Dropout(p=0.3, inplace=False)
      (11): Linear(in_features=64, out_features=64, bias=True)
      (12): ReLU()
      (13): Dropout(p=0.3, inplace=False)
      (14): Linear(in_features=64, out_features=1, bias=True)
    )
  )
  (density): Density(
    (activation): ReLU()
    (density_net): Sequential(
      (0): Linear(in_features=12, out_features=64, bias=True)
      (1): ReLU()
      (2): Linear(in_features=64, out_features=64, bias=True)
      (3): ReLU()
      (4): Dropout(p=0.3, inplace=False)
      (5): Linear(in_features=64, out_features=64, bias=True)
      (6): ReLU()
      (7): Dropout(p=0.3, inplace=False)
      (8): Linear(in_features=64, out_features=64, bias=True)
      (9): ReLU()
      (10): Dropout(p=0.3, inplace=False)
      (11): Linear(in_features=64, out_features=64, bias=True)
      (12): ReLU()
      (13): Dropout(p=0.3, inplace=False)
      (14): Linear(in_features=64, out_features=1, bias=True)
    )
  )
  (rho): Rho(
    (activation): ReLU()
    (rho_net): Sequential(
      (0): Linear(in_features=6, out_features=64, bias=True)
      (1): ReLU()
      (2): Linear(in_features=64, out_features=64, bias=True)
      (3): ReLU()
      (4): Dropout(p=0.3, inplace=False)
      (5): Linear(in_features=64, out_features=64, bias=True)
      (6): ReLU()
      (7): Dropout(p=0.3, inplace=False)
      (8): Linear(in_features=64, out_features=2, bias=True)
    )
  )
)

Appendix C Additional Experimental Details

C.1 Reproducibility

We release our code and data on the website at
https://github.com/pufffs/state-abstraction
The hyper-parameters to train the proposed forward and backward abstractions can be found in Table B.1.

C.2 Experimental settings and additional results

For both environments we use Adam Kingma & Ba (2014) optimizer, with learning rate 0.0010.0010.0010.001 in Cartpole and 0.0030.0030.0030.003 in LunarLander. Model architectures and hyper-parameters are outlined in B. When conducting OPE, the FQE network has 3333 hidden layers with 64646464 nodes per hidden layer for abstraction methods, and is equipped with 5555 hidden layers with 128128128128 nodes per hidden layer for non-abstracted observations (shown as ‘FQE’ in the plot).

C.2.1 CartPole-v0

Data generating processes

We manually insert 296 irrelevant features in the state, each following a first order auto-regressive model (AR(1))

(St+1,j|St,At)=(St+1,j|St,j),j=5,,300.formulae-sequenceconditionalsubscript𝑆𝑡1𝑗subscript𝑆𝑡subscript𝐴𝑡conditionalsubscript𝑆𝑡1𝑗subscript𝑆𝑡𝑗𝑗5300\displaystyle\mathbb{P}(S_{t+1,j}|S_{t},A_{t})=\mathbb{P}(S_{t+1,j}|S_{t,j}),~% {}~{}~{}~{}~{}j=5,\dots,300.blackboard_P ( italic_S start_POSTSUBSCRIPT italic_t + 1 , italic_j end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = blackboard_P ( italic_S start_POSTSUBSCRIPT italic_t + 1 , italic_j end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ) , italic_j = 5 , … , 300 .

We also define a new state-action-dependent reward as

(st,at)=12st,125st,32,subscript𝑠𝑡subscript𝑎𝑡12superscriptsubscript𝑠𝑡125superscriptsubscript𝑠𝑡32\mathcal{R}(s_{t},a_{t})=1-2s_{t,1}^{2}-5s_{t,3}^{2},caligraphic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 1 - 2 italic_s start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 5 italic_s start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where st,1subscript𝑠𝑡1s_{t,1}italic_s start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT and st,3subscript𝑠𝑡3s_{t,3}italic_s start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT are the first feature (cart position) and third feature (pole angle) of the state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, to replace the original constant rewards. The number of trajectories n𝑛nitalic_n in the offline dataset is chosen from {5,8,15,30}581530\{5,8,15,30\}{ 5 , 8 , 15 , 30 }, where each trajectory contains approximately 40 decision points. The target policy is determined by the pole angle: we push the cart to the left if the angle is negative and to the right if it is positive. Namely,

π(st)=𝟙(st,3>0).𝜋subscript𝑠𝑡1subscript𝑠𝑡30\displaystyle\pi(s_{t})=\mathbbm{1}(s_{t,3}>0).italic_π ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = blackboard_1 ( italic_s start_POSTSUBSCRIPT italic_t , 3 end_POSTSUBSCRIPT > 0 ) .

The behavior policy that generates the batch data is set to an ϵitalic-ϵ\epsilonitalic_ϵ-greedy policy with respect to the target policy, with ϵ{0.1, 0.3, 0.5, 0.7}italic-ϵ0.10.30.50.7\epsilon\in\{0.1,\,0.3,\,0.5,\,0.7\}italic_ϵ ∈ { 0.1 , 0.3 , 0.5 , 0.7 }. Results are averaged over 30 runs for each (n,ϵ)𝑛italic-ϵ(n,\epsilon)( italic_n , italic_ϵ ) pair.

Model parameters

For the proposed forward and backward models, we set the abstracted state dimension as 100100100100. For the two-step method, we apply backward abstraction followed by forward abstraction, reducing the dimension from 30010063001006300\rightarrow 100\rightarrow 6300 → 100 → 6 for ϵ{0.1,0.3}italic-ϵ0.10.3\epsilon\in\{0.1,0.3\}italic_ϵ ∈ { 0.1 , 0.3 }. We change the abstracted dimension to 30010023001002300\rightarrow 100\rightarrow 2300 → 100 → 2 for ϵ{0.5,0.7}italic-ϵ0.50.7\epsilon\in\{0.5,0.7\}italic_ϵ ∈ { 0.5 , 0.7 }.

C.2.2 LunarLander-v2

Data generating processes

We similarly insert 292 irrelevant auto-regressive features in the state:

(St+1,j|St,At)=(St+1,j|St,j),j=9,,300.formulae-sequenceconditionalsubscript𝑆𝑡1𝑗subscript𝑆𝑡subscript𝐴𝑡conditionalsubscript𝑆𝑡1𝑗subscript𝑆𝑡𝑗𝑗9300\displaystyle\mathbb{P}(S_{t+1,j}|S_{t},A_{t})=\mathbb{P}(S_{t+1,j}|S_{t,j}),~% {}~{}~{}~{}~{}j=9,\dots,300.blackboard_P ( italic_S start_POSTSUBSCRIPT italic_t + 1 , italic_j end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = blackboard_P ( italic_S start_POSTSUBSCRIPT italic_t + 1 , italic_j end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ) , italic_j = 9 , … , 300 .

The number of trajectories n𝑛nitalic_n in the offline dataset is chosen from {7,13,20}71320\{7,13,20\}{ 7 , 13 , 20 }, where trajectory length differs significantly in this environment. Some lengthy episodes can have length larger than 100000100000100000100000 while short episodes have fewer than 100100100100 decision points. When trained and evaluated on the short episodes, OPE methods will fail due to huge distributional drift. We therefore truncate the episode length at 1000 if it exceeds, define it as long episode and those fewer than 1000 as short episodes. When generating trajectories, we use a long-short combination for each size: {7=5long+2short,13=10long+3short,20=15long+5short}formulae-sequence7subscript5𝑙𝑜𝑛𝑔subscript2𝑠𝑜𝑟𝑡formulae-sequence13subscript10𝑙𝑜𝑛𝑔subscript3𝑠𝑜𝑟𝑡20subscript15𝑙𝑜𝑛𝑔subscript5𝑠𝑜𝑟𝑡\{7=5_{long}+2_{short},13=10_{long}+3_{short},20=15_{long}+5_{short}\}{ 7 = 5 start_POSTSUBSCRIPT italic_l italic_o italic_n italic_g end_POSTSUBSCRIPT + 2 start_POSTSUBSCRIPT italic_s italic_h italic_o italic_r italic_t end_POSTSUBSCRIPT , 13 = 10 start_POSTSUBSCRIPT italic_l italic_o italic_n italic_g end_POSTSUBSCRIPT + 3 start_POSTSUBSCRIPT italic_s italic_h italic_o italic_r italic_t end_POSTSUBSCRIPT , 20 = 15 start_POSTSUBSCRIPT italic_l italic_o italic_n italic_g end_POSTSUBSCRIPT + 5 start_POSTSUBSCRIPT italic_s italic_h italic_o italic_r italic_t end_POSTSUBSCRIPT }. The target policy is an estimated optimal policy pre-trained by an DQN agent whereas the behavior policy again ϵitalic-ϵ\epsilonitalic_ϵ-greedy to the target policy with ϵ{0.1, 0.3, 0.5}italic-ϵ0.10.30.5\epsilon\in\{0.1,\,0.3,\,0.5\}italic_ϵ ∈ { 0.1 , 0.3 , 0.5 }. Results are averaged over 30 runs for each (n,ϵ)𝑛italic-ϵ(n,\epsilon)( italic_n , italic_ϵ ) pair and are reported in Figure C.1

Model parameters

For forward and backward models, we abstract the original state dimension from 300100300100300\rightarrow 100300 → 100, and for two-step method we reduce dimensions from 300504300504300\rightarrow 50\rightarrow 4300 → 50 → 4, by first using forward model and then backward model.

Pre-trained agent

We pre-train an agent by using DQN as our target policy. The agent is trained until there exists an episode that has accumulative discounted rewards exceeding 200200200200 with discounted rate γ=0.99𝛾0.99\gamma=0.99italic_γ = 0.99. We evaluated oracle value (61.7) of the optimized agent by Monte Carlo method with the same discounted rate. The agent model architecture is as follow:

    DQN(
  (fc1): Linear(in_features=8, out_features=64, bias=True)
  (fc2): Linear(in_features=64, out_features=64, bias=True)
  (fc3): Linear(in_features=64, out_features=4, bias=True)
)
Refer to caption
Figure C.1: MSEs and biases of FQE estimators when applied to ground and abstract state spaces with various abstractions. The behavior policy is ϵitalic-ϵ\epsilonitalic_ϵ-greedy with respect to the target policy, with ϵ=0.1,0.3,0.5italic-ϵ0.10.30.5\epsilon=0.1,0.3,0.5italic_ϵ = 0.1 , 0.3 , 0.5 from left to right.

C.3 Licences for existing assets

We consider two environments from OpenAI Gym (Brockman et al., 2016), “CartPole-v0” and “LunarLander-v2” with the MIT License and Copyright (c) 2016 OpenAI (https://openai.com).

C.4 Computing resources

C.4.1 CartPole-v0

To build Figure 4, we trained 3 abstraction methods and one non-abstraction method on 4 different sizes of data, each with 30 runs, under 4 ϵitalic-ϵ\epsilonitalic_ϵ values. Each run takes approximately 1.5 minutes for four methods on an E2-series CPU with 64GB memory on Google Cloud Platform (GCP). It takes about 12 compute hours to complete all the experiments in the figure.

C.4.2 LunarLander-v2

To build Figure C.1, we trained 3 abstraction methods and one non-abstraction method on 3 different sizes of data, each with 30 runs, under 3 ϵitalic-ϵ\epsilonitalic_ϵ values. In average, each run takes approximately 4 minutes for four methods on an E2-series CPU with 64GB memory on GCP. It takes about 18 computation hours to complete all the experiments in the figure.

Appendix D Technical proofs

We provide the detailed proofs of our theorems (Theorems 1, 2, 3, 4) in this section.

Notations. For events or random variables A,B,C𝐴𝐵𝐶A,B,Citalic_A , italic_B , italic_C, ABA\perp\!\!\!\perp Bitalic_A ⟂ ⟂ italic_B means the independence between A𝐴Aitalic_A and B𝐵Bitalic_B whereas AB|CA\perp\!\!\!\perp B|Citalic_A ⟂ ⟂ italic_B | italic_C means the conditional independence between A𝐴Aitalic_A and B𝐵Bitalic_B given C𝐶Citalic_C.

D.1 Proof of Theorem 1

We prove Theorem 1 in this subsection. We first prove under Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-, ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT- or wπsuperscript𝑤𝜋w^{\pi}italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance, the corresponding methods remain valid when applied to the abstract state space:

  • Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance. By definition, Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT is the expected return given an initial state S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and A1subscript𝐴1A_{1}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Under Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance, the Q-function depends on S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT only through ϕ(S1)italic-ϕsubscript𝑆1\phi(S_{1})italic_ϕ ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). It follows that Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT equals the expected return given ϕ(S1)italic-ϕsubscript𝑆1\phi(S_{1})italic_ϕ ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and A1subscript𝐴1A_{1}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the latter being Qϕπsuperscriptsubscript𝑄italic-ϕ𝜋Q_{\phi}^{\pi}italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT – the Q𝑄Qitalic_Q-function when restricted to the abstract state space, i.e., Qϕπ(a,ϕ(s))=t1γt1𝔼π[Rt|A1=a,ϕ(S1)=ϕ(s)]superscriptsubscript𝑄italic-ϕ𝜋𝑎italic-ϕ𝑠subscript𝑡1superscript𝛾𝑡1superscript𝔼𝜋delimited-[]formulae-sequenceconditionalsubscript𝑅𝑡subscript𝐴1𝑎italic-ϕsubscript𝑆1italic-ϕ𝑠Q_{\phi}^{\pi}(a,\phi(s))=\sum_{t\geq 1}\gamma^{t-1}\mathbb{E}^{\pi}[R_{t}|A_{% 1}=a,\phi(S_{1})=\phi(s)]italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_ϕ ( italic_s ) ) = ∑ start_POSTSUBSCRIPT italic_t ≥ 1 end_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT [ italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_a , italic_ϕ ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_ϕ ( italic_s ) ]. It follows that

    𝔼[f1(Qπ)]=𝔼delimited-[]subscript𝑓1superscript𝑄𝜋absent\displaystyle\mathbb{E}[f_{1}(Q^{\pi})]=blackboard_E [ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) ] = a,sπ(a|s)Qπ(a,s)(S1=s)subscript𝑎𝑠𝜋conditional𝑎𝑠superscript𝑄𝜋𝑎𝑠subscript𝑆1𝑠\displaystyle\sum_{a,s}\pi(a|s)Q^{\pi}(a,s)\mathbb{P}(S_{1}=s)∑ start_POSTSUBSCRIPT italic_a , italic_s end_POSTSUBSCRIPT italic_π ( italic_a | italic_s ) italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s ) blackboard_P ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_s )
    =\displaystyle== a,sπ(a|s)Qϕπ(a,ϕ(s))(S1=s)subscript𝑎𝑠𝜋conditional𝑎𝑠superscriptsubscript𝑄italic-ϕ𝜋𝑎italic-ϕ𝑠subscript𝑆1𝑠\displaystyle\sum_{a,s}\pi(a|s)Q_{\phi}^{\pi}(a,\phi(s))\mathbb{P}(S_{1}=s)∑ start_POSTSUBSCRIPT italic_a , italic_s end_POSTSUBSCRIPT italic_π ( italic_a | italic_s ) italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_ϕ ( italic_s ) ) blackboard_P ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_s )
    =\displaystyle== 𝔼[f1(Qϕπ)].𝔼delimited-[]subscript𝑓1subscriptsuperscript𝑄𝜋italic-ϕ\displaystyle\mathbb{E}[f_{1}(Q^{\pi}_{\phi})].blackboard_E [ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) ] .
  • ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance. We first establish the equivalence between ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT and ρϕπsubscriptsuperscript𝜌𝜋italic-ϕ\rho^{\pi}_{\phi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT – the IS ratio defined on the abstract state space. Under ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance, ρπ(a,s)superscript𝜌𝜋𝑎𝑠\rho^{\pi}(a,s)italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s ) becomes a constant function of x=ϕ(s)𝑥italic-ϕ𝑠x=\phi(s)italic_x = italic_ϕ ( italic_s ). Consequently, for any conditional probability mass function (pmf) f(s|x)𝑓conditional𝑠𝑥f(s|x)italic_f ( italic_s | italic_x ) such that sϕ1(x)f(s|x)=1subscript𝑠superscriptitalic-ϕ1𝑥𝑓conditional𝑠𝑥1\sum_{s\in\phi^{-1}(x)}f(s|x)=1∑ start_POSTSUBSCRIPT italic_s ∈ italic_ϕ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUBSCRIPT italic_f ( italic_s | italic_x ) = 1, we have ρπ(a,s)=sϕ1(x)f(s|x)ρπ(a,s)superscript𝜌𝜋𝑎𝑠subscript𝑠superscriptitalic-ϕ1𝑥𝑓conditional𝑠𝑥superscript𝜌𝜋𝑎𝑠\rho^{\pi}(a,s)=\sum_{s\in\phi^{-1}(x)}f(s|x)\rho^{\pi}(a,s)italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s ) = ∑ start_POSTSUBSCRIPT italic_s ∈ italic_ϕ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUBSCRIPT italic_f ( italic_s | italic_x ) italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s ). By setting f(s|x)𝑓conditional𝑠𝑥f(s|x)italic_f ( italic_s | italic_x ) to the pmf of St=ssubscript𝑆𝑡𝑠S_{t}=sitalic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s given At=asubscript𝐴𝑡𝑎A_{t}=aitalic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a and ϕ(S)=xitalic-ϕ𝑆𝑥\phi(S)=xitalic_ϕ ( italic_S ) = italic_x, it follows that

    ρπ(a,s)=sϕ1(x)(St=s|At=a,ϕ(St)=x)ρπ(a,s).\displaystyle\rho^{\pi}(a,s)=\sum_{s\in\phi^{-1}(x)}\mathbb{P}(S_{t}=s|A_{t}=a% ,\phi(S_{t})=x)\rho^{\pi}(a,s).italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s ) = ∑ start_POSTSUBSCRIPT italic_s ∈ italic_ϕ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUBSCRIPT blackboard_P ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s | italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a , italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_x ) italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s ) . (D.1)

    Notice that

    (St=s|At=a,ϕ(St)=x)=(At=a,St=s|ϕ(St)=x)(At=a|ϕ(St)=x).\displaystyle\mathbb{P}(S_{t}=s|A_{t}=a,\phi(S_{t})=x)=\frac{\mathbb{P}(A_{t}=% a,S_{t}=s|\phi(S_{t})=x)}{\mathbb{P}(A_{t}=a|\phi(S_{t})=x)}.blackboard_P ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s | italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a , italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_x ) = divide start_ARG blackboard_P ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s | italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_x ) end_ARG start_ARG blackboard_P ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a | italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_x ) end_ARG .

    The denominator equals bϕ,t(a|x)subscript𝑏italic-ϕ𝑡conditional𝑎𝑥b_{\phi,t}(a|x)italic_b start_POSTSUBSCRIPT italic_ϕ , italic_t end_POSTSUBSCRIPT ( italic_a | italic_x ), the behavior policy when restricted to the abstract state space at time t𝑡titalic_t. Notice that this behavior policy can be non-stationary over time, despite that b𝑏bitalic_b being time-invariant. As for the numerator, it is straightforward to show that it equals b(a|s)(St=s|ϕ(St)=x)𝑏conditional𝑎𝑠subscript𝑆𝑡conditional𝑠italic-ϕsubscript𝑆𝑡𝑥b(a|s)\mathbb{P}(S_{t}=s|\phi(S_{t})=x)italic_b ( italic_a | italic_s ) blackboard_P ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s | italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_x ). This together with (D.1) yields

    ρπ(a,s)=sϕ1(x)π(a|s)bϕ,t(a|x)(St=s|ϕ(St)=x)=πϕ,t(a|x)bϕ,t(a|x),superscript𝜌𝜋𝑎𝑠subscript𝑠superscriptitalic-ϕ1𝑥𝜋conditional𝑎𝑠subscript𝑏italic-ϕ𝑡conditional𝑎𝑥subscript𝑆𝑡conditional𝑠italic-ϕsubscript𝑆𝑡𝑥subscript𝜋italic-ϕ𝑡conditional𝑎𝑥subscript𝑏italic-ϕ𝑡conditional𝑎𝑥\displaystyle\rho^{\pi}(a,s)=\sum_{s\in\phi^{-1}(x)}\frac{\pi(a|s)}{b_{\phi,t}% (a|x)}\mathbb{P}(S_{t}=s|\phi(S_{t})=x)=\frac{\pi_{\phi,t}(a|x)}{b_{\phi,t}(a|% x)},italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s ) = ∑ start_POSTSUBSCRIPT italic_s ∈ italic_ϕ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUBSCRIPT divide start_ARG italic_π ( italic_a | italic_s ) end_ARG start_ARG italic_b start_POSTSUBSCRIPT italic_ϕ , italic_t end_POSTSUBSCRIPT ( italic_a | italic_x ) end_ARG blackboard_P ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s | italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_x ) = divide start_ARG italic_π start_POSTSUBSCRIPT italic_ϕ , italic_t end_POSTSUBSCRIPT ( italic_a | italic_x ) end_ARG start_ARG italic_b start_POSTSUBSCRIPT italic_ϕ , italic_t end_POSTSUBSCRIPT ( italic_a | italic_x ) end_ARG , (D.2)

    where πϕ,tsubscript𝜋italic-ϕ𝑡\pi_{\phi,t}italic_π start_POSTSUBSCRIPT italic_ϕ , italic_t end_POSTSUBSCRIPT denotes the target policy confined on the abstract state space at time t𝑡titalic_t. The last term in (D.2) is given by ρϕ,tπsuperscriptsubscript𝜌italic-ϕ𝑡𝜋\rho_{\phi,t}^{\pi}italic_ρ start_POSTSUBSCRIPT italic_ϕ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT. Consequently, the cumulative IS ratio ρ1:tπsuperscriptsubscript𝜌:1𝑡𝜋\rho_{1:t}^{\pi}italic_ρ start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT is equal to k=1tρϕ,kπ(Ak,ϕ(Sk))superscriptsubscriptproduct𝑘1𝑡superscriptsubscript𝜌italic-ϕ𝑘𝜋subscript𝐴𝑘italic-ϕsubscript𝑆𝑘\prod_{k=1}^{t}\rho_{\phi,k}^{\pi}(A_{k},\phi(S_{k}))∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_ϕ , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ). This in turn yields 𝔼[f2(ρπ)]=𝔼[f2(ρϕπ)]𝔼delimited-[]subscript𝑓2superscript𝜌𝜋𝔼delimited-[]subscript𝑓2subscriptsuperscript𝜌𝜋italic-ϕ\mathbb{E}[f_{2}(\rho^{\pi})]=\mathbb{E}[f_{2}(\rho^{\pi}_{\phi})]blackboard_E [ italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) ] = blackboard_E [ italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) ].

  • wπsuperscript𝑤𝜋w^{\pi}italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance. Similar to the proof under ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance, the key lies in establishing the equivalence between wπ(a,s)superscript𝑤𝜋𝑎𝑠w^{\pi}(a,s)italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s ) and wϕπ(a,ϕ(s))subscriptsuperscript𝑤𝜋italic-ϕ𝑎italic-ϕ𝑠w^{\pi}_{\phi}(a,\phi(s))italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a , italic_ϕ ( italic_s ) ), the latter being the MIS ratio defined on the abstract state space. Once this has been proven, it is immediate to see that 𝔼[f3(wπ)]=𝔼[f3(wϕπ)]𝔼delimited-[]subscript𝑓3superscript𝑤𝜋𝔼delimited-[]subscript𝑓3subscriptsuperscript𝑤𝜋italic-ϕ\mathbb{E}[f_{3}(w^{\pi})]=\mathbb{E}[f_{3}(w^{\pi}_{\phi})]blackboard_E [ italic_f start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) ] = blackboard_E [ italic_f start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) ], so that MIS remains valid when applied to the abstract state space.

    As discussed in Section 2.3, to guarantee the unbiasedness of the MIS estimator, we additionally require a stationarity assumption. Under this requirement, for a given state-action pair (S,A)𝑆𝐴(S,A)( italic_S , italic_A ) in the offline data, its joint pmf function can be represented as p×bsubscript𝑝𝑏p_{\infty}\times bitalic_p start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT × italic_b where psubscript𝑝p_{\infty}italic_p start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT denotes the marginal state distribution under the behavior policy. Additionally, let ptπsubscriptsuperscript𝑝𝜋𝑡p^{\pi}_{t}italic_p start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denote the pmf of Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT generated under the target policy π𝜋\piitalic_π. The MIS ratio can be represented by

    wπ(a,s)=(1γ)t1γt1ptπ(s)π(a|s)p(s)b(a|s).superscript𝑤𝜋𝑎𝑠1𝛾subscript𝑡1superscript𝛾𝑡1superscriptsubscript𝑝𝑡𝜋𝑠𝜋conditional𝑎𝑠subscript𝑝𝑠𝑏conditional𝑎𝑠\displaystyle w^{\pi}(a,s)=\frac{(1-\gamma)\sum_{t\geq 1}\gamma^{t-1}p_{t}^{% \pi}(s)\pi(a|s)}{p_{\infty}(s)b(a|s)}.italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s ) = divide start_ARG ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t ≥ 1 end_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) italic_π ( italic_a | italic_s ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_s ) italic_b ( italic_a | italic_s ) end_ARG .

    Similar to (D.2), under wπsuperscript𝑤𝜋w^{\pi}italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irreleavance, it follows that

    wπ(a,s)superscript𝑤𝜋𝑎𝑠\displaystyle w^{\pi}(a,s)italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s ) =\displaystyle== (1γ)sϕ1(x)t1γt1ptπ(s)π(a|s)p(s)bϕ(a|x)(S=s|ϕ(S)=x)1𝛾subscript𝑠superscriptitalic-ϕ1𝑥subscript𝑡1superscript𝛾𝑡1superscriptsubscript𝑝𝑡𝜋𝑠𝜋conditional𝑎𝑠subscript𝑝𝑠subscript𝑏italic-ϕconditional𝑎𝑥𝑆conditional𝑠italic-ϕ𝑆𝑥\displaystyle(1-\gamma)\sum_{s\in\phi^{-1}(x)}\frac{\sum_{t\geq 1}\gamma^{t-1}% p_{t}^{\pi}(s)\pi(a|s)}{p_{\infty}(s)b_{\phi}(a|x)}\mathbb{P}(S=s|\phi(S)=x)( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_s ∈ italic_ϕ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUBSCRIPT divide start_ARG ∑ start_POSTSUBSCRIPT italic_t ≥ 1 end_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) italic_π ( italic_a | italic_s ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_s ) italic_b start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a | italic_x ) end_ARG blackboard_P ( italic_S = italic_s | italic_ϕ ( italic_S ) = italic_x )
    =\displaystyle== (1γ)sϕ1(x)t1γt1ptπ(s)π(a|s)p(x)bϕ(a|x).1𝛾subscript𝑠superscriptitalic-ϕ1𝑥subscript𝑡1superscript𝛾𝑡1superscriptsubscript𝑝𝑡𝜋𝑠𝜋conditional𝑎𝑠subscript𝑝𝑥subscript𝑏italic-ϕconditional𝑎𝑥\displaystyle\frac{(1-\gamma)\sum_{s\in\phi^{-1}(x)}\sum_{t\geq 1}\gamma^{t-1}% p_{t}^{\pi}(s)\pi(a|s)}{p_{\infty}(x)b_{\phi}(a|x)}.divide start_ARG ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_s ∈ italic_ϕ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t ≥ 1 end_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) italic_π ( italic_a | italic_s ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_x ) italic_b start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a | italic_x ) end_ARG .

    Here, the subscript t𝑡titalic_t in bϕsubscript𝑏italic-ϕb_{\phi}italic_b start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT and S𝑆Sitalic_S is dropped due to stationarity. Additionally, p(x)subscript𝑝𝑥p_{\infty}(x)italic_p start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_x ) is used to denote the probability mass function (pmf) of ϕ(S)italic-ϕ𝑆\phi(S)italic_ϕ ( italic_S ), albeit with a slight abuse of notation. Moreover, the numerator represents the discounted visitation probability of (A,ϕ(S))𝐴italic-ϕ𝑆(A,\phi(S))( italic_A , italic_ϕ ( italic_S ) ) under π𝜋\piitalic_π. This proves that wπ(a,s)=wϕπ(a,ϕ(s))superscript𝑤𝜋𝑎𝑠subscriptsuperscript𝑤𝜋italic-ϕ𝑎italic-ϕ𝑠w^{\pi}(a,s)=w^{\pi}_{\phi}(a,\phi(s))italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s ) = italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a , italic_ϕ ( italic_s ) ).

Finally, we establish the validity of DRL. According to the doubly robustness property, DRL is valid when either Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT or wπsuperscript𝑤𝜋w^{\pi}italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT is correctly specified. Under Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance, we have Qπ(a,s)=Qϕπ(a,ϕ(s))superscript𝑄𝜋𝑎𝑠subscriptsuperscript𝑄𝜋italic-ϕ𝑎italic-ϕ𝑠Q^{\pi}(a,s)=Q^{\pi}_{\phi}(a,\phi(s))italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s ) = italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a , italic_ϕ ( italic_s ) ) and thus DRL remains valid when applied to the abstract state space. Similarly, we have wπ(a,s)=wϕπ(a,ϕ(s))superscript𝑤𝜋𝑎𝑠subscriptsuperscript𝑤𝜋italic-ϕ𝑎italic-ϕ𝑠w^{\pi}(a,s)=w^{\pi}_{\phi}(a,\phi(s))italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s ) = italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a , italic_ϕ ( italic_s ) ) under wπsuperscript𝑤𝜋w^{\pi}italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance, which in turn implies DRL’s validity. This completes the proof.

D.2 Proof of Theorem 2

We prove Theorem 2 in this subsection.

  • For any s(1)superscript𝑠1s^{(1)}italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT and s(2)superscript𝑠2s^{(2)}italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT satisfies (2), we aim to prove

    Qπ(a,s(1))=Qπ(a,s(2)).superscript𝑄𝜋𝑎superscript𝑠1superscript𝑄𝜋𝑎superscript𝑠2\displaystyle Q^{\pi}(a,s^{(1)})=Q^{\pi}(a,s^{(2)}).italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) .

    Toward that end, we use the induction method. Denote

    Qjπ(a,s)=𝔼π[t=1jγt1Rt|S1=s,A1=a],andsuperscriptsubscript𝑄𝑗𝜋𝑎𝑠superscript𝔼𝜋delimited-[]formulae-sequenceconditionalsuperscriptsubscript𝑡1𝑗superscript𝛾𝑡1subscript𝑅𝑡subscript𝑆1𝑠subscript𝐴1𝑎and\displaystyle Q_{j}^{\pi}(a,s)=\mathbb{E}^{\pi}\left[\sum_{t=1}^{j}\gamma^{t-1% }R_{t}|S_{1}=s,A_{1}=a\right],\,\,\hbox{and}\,\,italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s ) = blackboard_E start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_s , italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_a ] , and
    Vjπ(s)=𝔼π[t=1jγt1Rt|S1=s].superscriptsubscript𝑉𝑗𝜋𝑠superscript𝔼𝜋delimited-[]conditionalsuperscriptsubscript𝑡1𝑗superscript𝛾𝑡1subscript𝑅𝑡subscript𝑆1𝑠\displaystyle V_{j}^{\pi}(s)=\mathbb{E}^{\pi}\left[\sum_{t=1}^{j}\gamma^{t-1}R% _{t}|S_{1}=s\right].italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) = blackboard_E start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_s ] .

    Under reward-irrelevance, we have

    Q1π(a,s(1))=superscriptsubscript𝑄1𝜋𝑎superscript𝑠1absent\displaystyle Q_{1}^{\pi}(a,s^{(1)})=italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = 𝔼π[R1|S1=s(1),A1=a]superscript𝔼𝜋delimited-[]formulae-sequenceconditionalsubscript𝑅1subscript𝑆1superscript𝑠1subscript𝐴1𝑎\displaystyle\mathbb{E}^{\pi}\left[R_{1}|S_{1}=s^{(1)},A_{1}=a\right]blackboard_E start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT [ italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_a ]
    =\displaystyle== (a,s(1))𝑎superscript𝑠1\displaystyle\mathcal{R}(a,s^{(1)})caligraphic_R ( italic_a , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT )
    =\displaystyle== (a,s(2))𝑎superscript𝑠2\displaystyle\mathcal{R}(a,s^{(2)})caligraphic_R ( italic_a , italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT )
    =\displaystyle== Q1π(a,s(2)).superscriptsubscript𝑄1𝜋𝑎superscript𝑠2\displaystyle Q_{1}^{\pi}(a,s^{(2)}).italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) .

    Together with π𝜋\piitalic_π-irrelevance, we obtain that

    V1π(s(1))=superscriptsubscript𝑉1𝜋superscript𝑠1absent\displaystyle V_{1}^{\pi}(s^{(1)})=italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = 𝔼π[R1|S1=s(1),A1=a]π(a|s(1))superscript𝔼𝜋delimited-[]formulae-sequenceconditionalsubscript𝑅1subscript𝑆1superscript𝑠1subscript𝐴1𝑎𝜋conditional𝑎superscript𝑠1\displaystyle\mathbb{E}^{\pi}\left[R_{1}|S_{1}=s^{(1)},A_{1}=a\right]\pi(a|s^{% (1)})blackboard_E start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT [ italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_a ] italic_π ( italic_a | italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT )
    =\displaystyle== (a,s(1))π(a|s(1))𝑎superscript𝑠1𝜋conditional𝑎superscript𝑠1\displaystyle\mathcal{R}(a,s^{(1)})\pi(a|s^{(1)})caligraphic_R ( italic_a , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) italic_π ( italic_a | italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT )
    =\displaystyle== (a,s(2))reward-irrelevantπ(a|s(2))πirrelevantsubscript𝑎superscript𝑠2reward-irrelevantsubscript𝜋conditional𝑎superscript𝑠2𝜋irrelevant\displaystyle\underbrace{\mathcal{R}(a,s^{(2)})}_{\mbox{reward-irrelevant}}% \underbrace{\pi(a|s^{(2)})}_{\pi-\mbox{irrelevant}}under⏟ start_ARG caligraphic_R ( italic_a , italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) end_ARG start_POSTSUBSCRIPT reward-irrelevant end_POSTSUBSCRIPT under⏟ start_ARG italic_π ( italic_a | italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) end_ARG start_POSTSUBSCRIPT italic_π - irrelevant end_POSTSUBSCRIPT
    =\displaystyle== V1π(s(2)).superscriptsubscript𝑉1𝜋superscript𝑠2\displaystyle V_{1}^{\pi}(s^{(2)}).italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) .

    Suppose we have shown that the following holds for any j<T𝑗𝑇j<Titalic_j < italic_T,

    Qjπ(a,s(1))=Qjπ(a,s(2))andVjπ(s(1))=Vjπ(s(2)).superscriptsubscript𝑄𝑗𝜋𝑎superscript𝑠1superscriptsubscript𝑄𝑗𝜋𝑎superscript𝑠2andsuperscriptsubscript𝑉𝑗𝜋superscript𝑠1superscriptsubscript𝑉𝑗𝜋superscript𝑠2\displaystyle Q_{j}^{\pi}(a,s^{(1)})=Q_{j}^{\pi}(a,s^{(2)})\,\,\mbox{and}\,\,V% _{j}^{\pi}(s^{(1)})=V_{j}^{\pi}(s^{(2)}).italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) and italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) . (D.3)

    Our goal is to show (D.3) holds with j=T𝑗𝑇j=Titalic_j = italic_T.

    We similarly define Qj,ϕπsuperscriptsubscript𝑄𝑗italic-ϕ𝜋Q_{j,\phi}^{\pi}italic_Q start_POSTSUBSCRIPT italic_j , italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT and Vj,ϕπsuperscriptsubscript𝑉𝑗italic-ϕ𝜋V_{j,\phi}^{\pi}italic_V start_POSTSUBSCRIPT italic_j , italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT as the Q- and value functions defined on the abstract state space. Similar to the proof of Theorem 1, we can show that Qjπ=Qj,ϕπsuperscriptsubscript𝑄𝑗𝜋superscriptsubscript𝑄𝑗italic-ϕ𝜋Q_{j}^{\pi}=Q_{j,\phi}^{\pi}italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT = italic_Q start_POSTSUBSCRIPT italic_j , italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT and Vjπ=Vj,ϕπsuperscriptsubscript𝑉𝑗𝜋superscriptsubscript𝑉𝑗italic-ϕ𝜋V_{j}^{\pi}=V_{j,\phi}^{\pi}italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT = italic_V start_POSTSUBSCRIPT italic_j , italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT for any j<T𝑗𝑇j<Titalic_j < italic_T. It follows that

    QTπ(a,s(1))=subscriptsuperscript𝑄𝜋𝑇𝑎superscript𝑠1absent\displaystyle Q^{\pi}_{T}(a,s^{(1)})=italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = 𝔼π[t=1Tγt1Rt|S1=s(1),A1=a]superscript𝔼𝜋delimited-[]formulae-sequenceconditionalsuperscriptsubscript𝑡1𝑇superscript𝛾𝑡1subscript𝑅𝑡subscript𝑆1superscript𝑠1subscript𝐴1𝑎\displaystyle\mathbb{E}^{\pi}\left[\sum_{t=1}^{T}\gamma^{t-1}R_{t}|S_{1}=s^{(1% )},A_{1}=a\right]blackboard_E start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_a ]
    =\displaystyle== 𝔼π[t=2Tγt1Rt|S1=s(1),A1=a]+(a,s(1))superscript𝔼𝜋delimited-[]formulae-sequenceconditionalsuperscriptsubscript𝑡2𝑇superscript𝛾𝑡1subscript𝑅𝑡subscript𝑆1superscript𝑠1subscript𝐴1𝑎𝑎superscript𝑠1\displaystyle\mathbb{E}^{\pi}\left[\sum_{t=2}^{T}\gamma^{t-1}R_{t}|S_{1}=s^{(1% )},A_{1}=a\right]+\mathcal{R}(a,s^{(1)})blackboard_E start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_a ] + caligraphic_R ( italic_a , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT )
    =\displaystyle== γ𝔼πs𝒮[t=2Tγt1Rt|S2=s]𝒯(s|s(1),a)+(a,s(1))𝛾superscript𝔼𝜋subscriptsuperscript𝑠𝒮delimited-[]conditionalsuperscriptsubscript𝑡2𝑇superscript𝛾𝑡1subscript𝑅𝑡subscript𝑆2superscript𝑠𝒯conditionalsuperscript𝑠superscript𝑠1𝑎𝑎superscript𝑠1\displaystyle\gamma\mathbb{E}^{\pi}\sum_{s^{\prime}\in\mathcal{S}}\left[\sum_{% t=2}^{T}\gamma^{t-1}R_{t}|S_{2}=s^{\prime}\right]\mathcal{T}(s^{\prime}|s^{(1)% },a)+\mathcal{R}(a,s^{(1)})italic_γ blackboard_E start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] caligraphic_T ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_a ) + caligraphic_R ( italic_a , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT )
    =\displaystyle== γ𝔼πx𝒳sϕ1(x)[t=2Tγt2Rt|S2=s]𝒯(s|s(1),a)+(a,s(1))𝛾superscript𝔼𝜋subscriptsuperscript𝑥𝒳subscriptsuperscript𝑠superscriptitalic-ϕ1superscript𝑥delimited-[]conditionalsuperscriptsubscript𝑡2𝑇superscript𝛾𝑡2subscript𝑅𝑡subscript𝑆2superscript𝑠𝒯conditionalsuperscript𝑠superscript𝑠1𝑎𝑎superscript𝑠1\displaystyle\gamma\mathbb{E}^{\pi}\sum_{x^{\prime}\in\mathcal{X}}\sum_{s^{% \prime}\in\phi^{-1}(x^{\prime})}\left[\sum_{t=2}^{T}\gamma^{t-2}R_{t}|S_{2}=s^% {\prime}\right]\mathcal{T}(s^{\prime}|s^{(1)},a)+\mathcal{R}(a,s^{(1)})italic_γ blackboard_E start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_ϕ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 2 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] caligraphic_T ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_a ) + caligraphic_R ( italic_a , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT )
    =\displaystyle== γx𝒳sϕ1(x)VT1π(s)𝒯(s|s(1),a)+(a,s(1))𝛾subscriptsuperscript𝑥𝒳subscriptsuperscript𝑠superscriptitalic-ϕ1superscript𝑥superscriptsubscript𝑉𝑇1𝜋superscript𝑠𝒯conditionalsuperscript𝑠superscript𝑠1𝑎𝑎superscript𝑠1\displaystyle\gamma\sum_{x^{\prime}\in\mathcal{X}}\sum_{s^{\prime}\in\phi^{-1}% (x^{\prime})}V_{T-1}^{\pi}(s^{\prime})\mathcal{T}(s^{\prime}|s^{(1)},a)+% \mathcal{R}(a,s^{(1)})italic_γ ∑ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_ϕ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) caligraphic_T ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_a ) + caligraphic_R ( italic_a , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT )
    =\displaystyle== γx𝒳VT1,ϕπ(x)by(D.3)sϕ1(x)𝒯(s|s(1),a)+(a,s(1))𝛾subscriptsuperscript𝑥𝒳subscriptsuperscriptsubscript𝑉𝑇1italic-ϕ𝜋superscript𝑥byitalic-(D.3italic-)subscriptsuperscript𝑠superscriptitalic-ϕ1superscript𝑥𝒯conditionalsuperscript𝑠superscript𝑠1𝑎𝑎superscript𝑠1\displaystyle\gamma\sum_{x^{\prime}\in\mathcal{X}}\underbrace{V_{T-1,\phi}^{% \pi}(x^{\prime})}_{\mbox{by}\,\,\eqref{eq2}}\sum_{s^{\prime}\in\phi^{-1}(x^{% \prime})}\mathcal{T}(s^{\prime}|s^{(1)},a)+\mathcal{R}(a,s^{(1)})italic_γ ∑ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X end_POSTSUBSCRIPT under⏟ start_ARG italic_V start_POSTSUBSCRIPT italic_T - 1 , italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_POSTSUBSCRIPT by italic_( italic_) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_ϕ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT caligraphic_T ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_a ) + caligraphic_R ( italic_a , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT )
    =\displaystyle== γx𝒳VT1,ϕπ(x)by(D.3)sϕ1(x)𝒯(s|s(2),a)(2)+(a,s(2))𝛾subscriptsuperscript𝑥𝒳subscriptsuperscriptsubscript𝑉𝑇1italic-ϕ𝜋superscript𝑥byitalic-(D.3italic-)subscriptsubscriptsuperscript𝑠superscriptitalic-ϕ1superscript𝑥𝒯conditionalsuperscript𝑠superscript𝑠2𝑎italic-(2italic-)𝑎superscript𝑠2\displaystyle\gamma\sum_{x^{\prime}\in\mathcal{X}}\underbrace{V_{T-1,\phi}^{% \pi}(x^{\prime})}_{\mbox{by}\,\,\eqref{eq2}}\underbrace{\sum_{s^{\prime}\in% \phi^{-1}(x^{\prime})}\mathcal{T}(s^{\prime}|s^{(2)},a)}_{\eqref{eqn:model-% irrelevant}}+\mathcal{R}(a,s^{(2)})italic_γ ∑ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X end_POSTSUBSCRIPT under⏟ start_ARG italic_V start_POSTSUBSCRIPT italic_T - 1 , italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_POSTSUBSCRIPT by italic_( italic_) end_POSTSUBSCRIPT under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_ϕ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT caligraphic_T ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , italic_a ) end_ARG start_POSTSUBSCRIPT italic_( italic_) end_POSTSUBSCRIPT + caligraphic_R ( italic_a , italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT )
    =\displaystyle== QTπ(a,s(2)).subscriptsuperscript𝑄𝜋𝑇𝑎superscript𝑠2\displaystyle Q^{\pi}_{T}(a,s^{(2)}).italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) .

    This together with π𝜋\piitalic_π-irrelevance proves VTπsuperscriptsubscript𝑉𝑇𝜋V_{T}^{\pi}italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance. Consequently, (D.3) holds for any j1𝑗1j\geq 1italic_j ≥ 1. Since QjπQπsuperscriptsubscript𝑄𝑗𝜋superscript𝑄𝜋Q_{j}^{\pi}\to Q^{\pi}italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT → italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT as j𝑗j\to\inftyitalic_j → ∞, we obtain Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance.

  • We will prove that the MIS estimator constructed on the abstract state space remains valid. With a slight abuse of notation, we use ptπ(a,x)superscriptsubscript𝑝𝑡𝜋𝑎𝑥p_{t}^{\pi}(a,x)italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_x ) to denote the probability π(At=a,ϕ(St)=x)superscript𝜋formulae-sequencesubscript𝐴𝑡𝑎italic-ϕsubscript𝑆𝑡𝑥\mathbb{P}^{\pi}(A_{t}=a,\phi(S_{t})=x)blackboard_P start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a , italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_x ). Under the stationarity assumption, direct calculations yield

    𝔼[f3(wϕπ)]=𝔼delimited-[]subscript𝑓3subscriptsuperscript𝑤𝜋italic-ϕabsent\displaystyle\mathbb{E}[f_{3}(w^{\pi}_{\phi})]=blackboard_E [ italic_f start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) ] = 𝔼[(1γ)1wϕπ(A,ϕ(S))R]𝔼delimited-[]superscript1𝛾1superscriptsubscript𝑤italic-ϕ𝜋𝐴italic-ϕ𝑆𝑅\displaystyle\mathbb{E}\left[(1-\gamma)^{-1}w_{\phi}^{\pi}(A,\phi(S))R\right]blackboard_E [ ( 1 - italic_γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A , italic_ϕ ( italic_S ) ) italic_R ]
    =\displaystyle== 𝔼[(1γ)1wϕπ(A,ϕ(S))(A,S)]𝔼delimited-[]superscript1𝛾1superscriptsubscript𝑤italic-ϕ𝜋𝐴italic-ϕ𝑆𝐴𝑆\displaystyle\mathbb{E}\left[(1-\gamma)^{-1}w_{\phi}^{\pi}(A,\phi(S))\mathcal{% R}\big{(}A,S\big{)}\right]blackboard_E [ ( 1 - italic_γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A , italic_ϕ ( italic_S ) ) caligraphic_R ( italic_A , italic_S ) ]
    =\displaystyle== 𝔼[(1γ)1wϕπ(A,ϕ(S))(A,ϕ(S))reward-irrelevant]𝔼delimited-[]superscript1𝛾1superscriptsubscript𝑤italic-ϕ𝜋𝐴italic-ϕ𝑆subscript𝐴italic-ϕ𝑆reward-irrelevant\displaystyle\mathbb{E}\left[(1-\gamma)^{-1}w_{\phi}^{\pi}(A,\phi(S))% \underbrace{\mathcal{R}\big{(}A,\phi(S)\big{)}}_{\mbox{reward-irrelevant}}\right]blackboard_E [ ( 1 - italic_γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A , italic_ϕ ( italic_S ) ) under⏟ start_ARG caligraphic_R ( italic_A , italic_ϕ ( italic_S ) ) end_ARG start_POSTSUBSCRIPT reward-irrelevant end_POSTSUBSCRIPT ]
    =\displaystyle== a𝒜,x𝒳t=1+γt1ptπ(a,x)ϕ(a,x)subscriptformulae-sequence𝑎𝒜𝑥𝒳superscriptsubscript𝑡1superscript𝛾𝑡1superscriptsubscript𝑝𝑡𝜋𝑎𝑥subscriptitalic-ϕ𝑎𝑥\displaystyle\sum_{a\in\mathcal{A},x\in\mathcal{X}}\sum_{t=1}^{+\infty}\gamma^% {t-1}p_{t}^{\pi}(a,x)\mathcal{R}_{\phi}(a,x)∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A , italic_x ∈ caligraphic_X end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_x ) caligraphic_R start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a , italic_x )
    =\displaystyle== a𝒜,x𝒳sϕ1(x)t=1+γt1π(a|s)ptπ(s)(a,s)subscriptformulae-sequence𝑎𝒜𝑥𝒳subscript𝑠superscriptitalic-ϕ1𝑥superscriptsubscript𝑡1superscript𝛾𝑡1𝜋conditional𝑎𝑠superscriptsubscript𝑝𝑡𝜋𝑠𝑎𝑠\displaystyle\sum_{a\in\mathcal{A},x\in\mathcal{X}}\sum_{s\in\phi^{-1}(x)}\sum% _{t=1}^{+\infty}\gamma^{t-1}\pi(a|s)p_{t}^{\pi}(s)\mathcal{R}(a,s)∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A , italic_x ∈ caligraphic_X end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s ∈ italic_ϕ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_π ( italic_a | italic_s ) italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) caligraphic_R ( italic_a , italic_s )
    =\displaystyle== t=1+γt1𝔼π(Rt)superscriptsubscript𝑡1superscript𝛾𝑡1superscript𝔼𝜋subscript𝑅𝑡\displaystyle\sum_{t=1}^{+\infty}\gamma^{t-1}\mathbb{E}^{\pi}({R}_{t})∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
    =\displaystyle== 𝔼[f3(wπ)]𝔼delimited-[]subscript𝑓3superscript𝑤𝜋\displaystyle\mathbb{E}[f_{3}(w^{\pi})]blackboard_E [ italic_f start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) ]

    Notice that we only require reward-irrelevance in the above proof.

  • It suffices to show that

    𝔼[ρ1:tπRt]=𝔼[k=1tρϕ,tπ(Ak,ϕ(Sk))Rt],𝔼delimited-[]superscriptsubscript𝜌:1𝑡𝜋subscript𝑅𝑡𝔼delimited-[]superscriptsubscriptproduct𝑘1𝑡superscriptsubscript𝜌italic-ϕ𝑡𝜋subscript𝐴𝑘italic-ϕsubscript𝑆𝑘subscript𝑅𝑡\displaystyle\mathbb{E}[\rho_{1:t}^{\pi}R_{t}]=\mathbb{E}[\prod_{k=1}^{t}\rho_% {\phi,t}^{\pi}(A_{k},\phi(S_{k}))R_{t}],blackboard_E [ italic_ρ start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = blackboard_E [ ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_ϕ , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] , (D.4)

    for any t𝑡titalic_t. Under the Markov assumption, Rtsubscript𝑅𝑡R_{t}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is independent of past state-action pairs given Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Consequently, the left-hand-side can be represented as

    𝔼[𝔼(ρ1:t1π|At,St)ρπ(At,St)Rt].𝔼delimited-[]𝔼conditionalsuperscriptsubscript𝜌:1𝑡1𝜋subscript𝐴𝑡subscript𝑆𝑡superscript𝜌𝜋subscript𝐴𝑡subscript𝑆𝑡subscript𝑅𝑡\displaystyle\mathbb{E}[\mathbb{E}(\rho_{1:t-1}^{\pi}|A_{t},S_{t})\rho^{\pi}(A% _{t},S_{t})R_{t}].blackboard_E [ blackboard_E ( italic_ρ start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT | italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] .

    Additionally, since the generation Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT depends only on Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the inner expectation equals 𝔼(ρ1:t1π|St)𝔼conditionalsuperscriptsubscript𝜌:1𝑡1𝜋subscript𝑆𝑡\mathbb{E}(\rho_{1:t-1}^{\pi}|S_{t})blackboard_E ( italic_ρ start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) which can be further shown to equal to ptπ(St)/p(St)superscriptsubscript𝑝𝑡𝜋subscript𝑆𝑡subscript𝑝subscript𝑆𝑡p_{t}^{\pi}(S_{t})/p_{\infty}(S_{t})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) / italic_p start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). This allows us to represent the left-hand-side of (D.4) by

    𝔼[ptπ(St)p(St)ρπ(At,St)Rt].𝔼delimited-[]superscriptsubscript𝑝𝑡𝜋subscript𝑆𝑡subscript𝑝subscript𝑆𝑡superscript𝜌𝜋subscript𝐴𝑡subscript𝑆𝑡subscript𝑅𝑡\displaystyle\mathbb{E}\Big{[}\frac{p_{t}^{\pi}(S_{t})}{p_{\infty}(S_{t})}\rho% ^{\pi}(A_{t},S_{t})R_{t}\Big{]}.blackboard_E [ divide start_ARG italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] . (D.5)

    Using similar arguments in proving the validity of MIS estimator, under reward-irrelevance, (D.5) can be shown to equal to

    a𝒜,x𝒳ptπ(a,x)ϕ(a,x).subscriptformulae-sequence𝑎𝒜𝑥𝒳superscriptsubscript𝑝𝑡𝜋𝑎𝑥subscriptitalic-ϕ𝑎𝑥\displaystyle\sum_{a\in\mathcal{A},x\in\mathcal{X}}p_{t}^{\pi}(a,x)\mathcal{R}% _{\phi}(a,x).∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A , italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_x ) caligraphic_R start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a , italic_x ) . (D.6)

    Under transition-irrelevance, the data triplets (ϕ(S),A,R)italic-ϕ𝑆𝐴𝑅(\phi(S),A,R)( italic_ϕ ( italic_S ) , italic_A , italic_R ) forms an MDP, satisfying the Markov assumption. Let 𝒯ϕsubscript𝒯italic-ϕ\mathcal{T}_{\phi}caligraphic_T start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT denote the resulting transition function. Together with π𝜋\piitalic_π-irrelevance, we can rewrite (D.6) as

    a1,,at𝒜x1,,xt𝒳ρ0(x1)k=1t1[πϕ(ak|xk)𝒯ϕ(xk+1|ak,xk)]πϕ(a|xt)ϕ(a,x).subscriptsubscript𝑎1subscript𝑎𝑡𝒜subscript𝑥1subscript𝑥𝑡𝒳subscript𝜌0subscript𝑥1superscriptsubscriptproduct𝑘1𝑡1delimited-[]subscript𝜋italic-ϕconditionalsubscript𝑎𝑘subscript𝑥𝑘subscript𝒯italic-ϕconditionalsubscript𝑥𝑘1subscript𝑎𝑘subscript𝑥𝑘subscript𝜋italic-ϕconditional𝑎subscript𝑥𝑡subscriptitalic-ϕ𝑎𝑥\displaystyle\sum_{\begin{subarray}{c}a_{1},\cdots,a_{t}\in\mathcal{A}\\ x_{1},\cdots,x_{t}\in\mathcal{X}\end{subarray}}\rho_{0}(x_{1})\prod_{k=1}^{t-1% }\Big{[}\pi_{\phi}(a_{k}|x_{k})\mathcal{T}_{\phi}(x_{k+1}|a_{k},x_{k})\Big{]}% \pi_{\phi}(a|x_{t})\mathcal{R}_{\phi}(a,x).∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_X end_CELL end_ROW end_ARG end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT [ italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) caligraphic_T start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) caligraphic_R start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a , italic_x ) .

    Notice that 𝒯ϕsubscript𝒯italic-ϕ\mathcal{T}_{\phi}caligraphic_T start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT is independent of the target policy π𝜋\piitalic_π. Using the change of measure theorem, we can represent above expression by 𝔼(ρ1:t,ϕπRt)𝔼superscriptsubscript𝜌:1𝑡italic-ϕ𝜋subscript𝑅𝑡\mathbb{E}(\rho_{1:t,\phi}^{\pi}R_{t})blackboard_E ( italic_ρ start_POSTSUBSCRIPT 1 : italic_t , italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) where ρ1:t,ϕπsuperscriptsubscript𝜌:1𝑡italic-ϕ𝜋\rho_{1:t,\phi}^{\pi}italic_ρ start_POSTSUBSCRIPT 1 : italic_t , italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT denotes the cumulative IS ratio defined on the abstract state space. This completes the proof.

  • Since model-irrelvance implies Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance, the conclusion directly follows from the last conclusion of Theorem 1.

D.3 Proof of Theorem 3

At the begging of the proof, we name the phenomena as the Inverse Markovianity, namely the reversed state-action pairs maintain the Markov property.

  • ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance directly follows from the definition of backward-model-irrelevance. To show wπsuperscript𝑤𝜋w^{\pi}italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance, we divide the proof into two steps.
    (1) In the first step, we will prove that if ϕitalic-ϕ\phiitalic_ϕ satisfies the backward-model-irrelevance, then

    ρπ(Atk,Stk)St|ϕ(St),1kt1.\displaystyle\rho^{\pi}(A_{t-k},S_{t-k})\perp\!\!\!\perp S_{t}|\phi(S_{t}),1% \leq k\leq t-1.italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_t - italic_k end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t - italic_k end_POSTSUBSCRIPT ) ⟂ ⟂ italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , 1 ≤ italic_k ≤ italic_t - 1 . (D.7)

    It follows from equation (8) that

    (ϕ(Stk)=x|Stk+1)=(ϕ(Stk)=x|ϕ(Stk+1)),1kt1.formulae-sequenceitalic-ϕsubscript𝑆𝑡𝑘conditional𝑥subscript𝑆𝑡𝑘1italic-ϕsubscript𝑆𝑡𝑘conditional𝑥italic-ϕsubscript𝑆𝑡𝑘11𝑘𝑡1\displaystyle\mathbb{P}\big{(}\phi(S_{t-k})=x|S_{t-k+1}\big{)}=\mathbb{P}\big{% (}\phi(S_{t-k})=x|\phi(S_{t-k+1})\big{)},1\leq k\leq t-1.blackboard_P ( italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t - italic_k end_POSTSUBSCRIPT ) = italic_x | italic_S start_POSTSUBSCRIPT italic_t - italic_k + 1 end_POSTSUBSCRIPT ) = blackboard_P ( italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t - italic_k end_POSTSUBSCRIPT ) = italic_x | italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t - italic_k + 1 end_POSTSUBSCRIPT ) ) , 1 ≤ italic_k ≤ italic_t - 1 .

    We can use the induction method to prove that for 1kt11𝑘𝑡11\leq k\leq t-11 ≤ italic_k ≤ italic_t - 1,

    ρπ(Atk,Stk)St|ϕ(St).\displaystyle\rho^{\pi}(A_{t-k},S_{t-k})\perp\!\!\!\perp S_{t}|\phi(S_{t}).italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_t - italic_k end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t - italic_k end_POSTSUBSCRIPT ) ⟂ ⟂ italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (D.8)

    For k=1𝑘1k=1italic_k = 1, we have for any positive constant c𝑐citalic_c,

    (ρπ(At1,St1)=c|St)=superscript𝜌𝜋subscript𝐴𝑡1subscript𝑆𝑡1conditional𝑐subscript𝑆𝑡absent\displaystyle\mathbb{P}\big{(}\rho^{\pi}(A_{t-1},S_{t-1})=c|S_{t}\big{)}=blackboard_P ( italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = italic_c | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = [ρϕ,t1π(At1,ϕ(St1))=c|St]delimited-[]superscriptsubscript𝜌italic-ϕ𝑡1𝜋subscript𝐴𝑡1italic-ϕsubscript𝑆𝑡1conditional𝑐subscript𝑆𝑡\displaystyle\mathbb{P}[\rho_{\phi,t-1}^{\pi}\big{(}A_{t-1},\phi(S_{t-1})\big{% )}=c|S_{t}]blackboard_P [ italic_ρ start_POSTSUBSCRIPT italic_ϕ , italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ) = italic_c | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]
    =\displaystyle== [ρϕ,t1π(At1,ϕ(St1))=c|ϕ(St)],delimited-[]superscriptsubscript𝜌italic-ϕ𝑡1𝜋subscript𝐴𝑡1italic-ϕsubscript𝑆𝑡1conditional𝑐italic-ϕsubscript𝑆𝑡\displaystyle\mathbb{P}[\rho_{\phi,t-1}^{\pi}\big{(}A_{t-1},\phi(S_{t-1})\big{% )}=c|\phi(S_{t})],blackboard_P [ italic_ρ start_POSTSUBSCRIPT italic_ϕ , italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ) = italic_c | italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] , (D.9)

    where the first equation is due to ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance and the second equation follows from (8). This yields

    ρπ(At1,St1)St|ϕ(St).\displaystyle\rho^{\pi}(A_{t-1},S_{t-1})\perp\!\!\!\perp S_{t}|\phi(S_{t}).italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ⟂ ⟂ italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

    We assume that for kt2𝑘𝑡2k\leq t-2italic_k ≤ italic_t - 2 the formulation (D.8) holds. Now, we prove that for k=t1𝑘𝑡1k=t-1italic_k = italic_t - 1, (D.8) successes. By similar arguments to that of (D.3), we get

    (ρπ(A1,S1)=c|St)=superscript𝜌𝜋subscript𝐴1subscript𝑆1conditional𝑐subscript𝑆𝑡absent\displaystyle\mathbb{P}\big{(}\rho^{\pi}(A_{1},S_{1})=c|S_{t}\big{)}=blackboard_P ( italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_c | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = [(ρπ(A1,S1)=c|S2,A2,St,At)|St]delimited-[]conditionalsuperscript𝜌𝜋subscript𝐴1subscript𝑆1conditional𝑐subscript𝑆2subscript𝐴2subscript𝑆𝑡subscript𝐴𝑡subscript𝑆𝑡\displaystyle\mathbb{P}[\mathbb{P}\big{(}\rho^{\pi}(A_{1},S_{1})=c|S_{2},A_{2}% ,S_{t},A_{t}\big{)}|S_{t}]blackboard_P [ blackboard_P ( italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_c | italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]
    =\displaystyle== [(ρπ(A1,S1)=c|S2)|St]delimited-[]conditionalsuperscript𝜌𝜋subscript𝐴1subscript𝑆1conditional𝑐subscript𝑆2subscript𝑆𝑡\displaystyle\mathbb{P}[\mathbb{P}\big{(}\rho^{\pi}(A_{1},S_{1})=c|S_{2}\big{)% }|S_{t}]blackboard_P [ blackboard_P ( italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_c | italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]
    =\displaystyle== [g(ϕ(S2))|St].delimited-[]conditional𝑔italic-ϕsubscript𝑆2subscript𝑆𝑡\displaystyle\mathbb{P}[g\big{(}\phi(S_{2})\big{)}|S_{t}].blackboard_P [ italic_g ( italic_ϕ ( italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] . (D.10)

    To prove this, we need to show that for any 1kt11𝑘𝑡11\leq k\leq t-11 ≤ italic_k ≤ italic_t - 1, we have

    (ϕ(Stk)=x|St)=(ϕ(Stk)=x|ϕ(St)).italic-ϕsubscript𝑆𝑡𝑘conditional𝑥subscript𝑆𝑡italic-ϕsubscript𝑆𝑡𝑘conditional𝑥italic-ϕsubscript𝑆𝑡\displaystyle\mathbb{P}\big{(}\phi(S_{t-k})=x|S_{t}\big{)}=\mathbb{P}\big{(}% \phi(S_{t-k})=x|\phi(S_{t})\big{)}.blackboard_P ( italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t - italic_k end_POSTSUBSCRIPT ) = italic_x | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = blackboard_P ( italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t - italic_k end_POSTSUBSCRIPT ) = italic_x | italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) . (D.11)

    The definition of inverse model implies when k=1𝑘1k=1italic_k = 1, (D.11) successes. We assume that for kt2𝑘𝑡2k\leq t-2italic_k ≤ italic_t - 2 the formulation (D.11) successes. Now, we prove that for k=t1𝑘𝑡1k=t-1italic_k = italic_t - 1, (D.11) also hold.

    (ϕ(S1)=x|St)=italic-ϕsubscript𝑆1conditional𝑥subscript𝑆𝑡absent\displaystyle\mathbb{P}\big{(}\phi(S_{1})=x|S_{t}\big{)}=blackboard_P ( italic_ϕ ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_x | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = [(ϕ(S1)=x|S2,St)|St]delimited-[]conditionalitalic-ϕsubscript𝑆1conditional𝑥subscript𝑆2subscript𝑆𝑡subscript𝑆𝑡\displaystyle\mathbb{P}[\mathbb{P}\big{(}\phi(S_{1})=x|S_{2},S_{t}\big{)}|S_{t}]blackboard_P [ blackboard_P ( italic_ϕ ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_x | italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]
    =\displaystyle== [(ϕ(S1)=x|S2)|St]Inverse Markovianitysubscriptdelimited-[]conditionalitalic-ϕsubscript𝑆1conditional𝑥subscript𝑆2subscript𝑆𝑡Inverse Markovianity\displaystyle\underbrace{\mathbb{P}[\mathbb{P}\big{(}\phi(S_{1})=x|S_{2}\big{)% }|S_{t}]}_{\mbox{Inverse Markovianity}}under⏟ start_ARG blackboard_P [ blackboard_P ( italic_ϕ ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_x | italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] end_ARG start_POSTSUBSCRIPT Inverse Markovianity end_POSTSUBSCRIPT
    =\displaystyle== [(ϕ(S1)=x|ϕ(S2))|St](D.11)subscriptdelimited-[]conditionalitalic-ϕsubscript𝑆1conditional𝑥italic-ϕsubscript𝑆2subscript𝑆𝑡(D.11)\displaystyle\underbrace{\mathbb{P}[\mathbb{P}\big{(}\phi(S_{1})=x|\phi(S_{2})% \big{)}|S_{t}]}_{\mbox{\eqref{eq9}}}under⏟ start_ARG blackboard_P [ blackboard_P ( italic_ϕ ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_x | italic_ϕ ( italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] end_ARG start_POSTSUBSCRIPT ( ) end_POSTSUBSCRIPT
    =\displaystyle== [g(ϕ(S2))|St](D.11)subscriptdelimited-[]conditional𝑔italic-ϕsubscript𝑆2subscript𝑆𝑡(D.11)\displaystyle\underbrace{\mathbb{P}[g\big{(}\phi(S_{2})\big{)}|S_{t}]}_{\mbox{% \eqref{eq9}}}under⏟ start_ARG blackboard_P [ italic_g ( italic_ϕ ( italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] end_ARG start_POSTSUBSCRIPT ( ) end_POSTSUBSCRIPT
    =\displaystyle== [g(ϕ(S2))|ϕ(St)].delimited-[]conditional𝑔italic-ϕsubscript𝑆2italic-ϕsubscript𝑆𝑡\displaystyle\mathbb{P}[g\big{(}\phi(S_{2})\big{)}|\phi(S_{t})].blackboard_P [ italic_g ( italic_ϕ ( italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) | italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] .

    This proves (D.11). Combing (D.3) and (D.11), we can get

    (ρπ(A1,S1)=c|St)=superscript𝜌𝜋subscript𝐴1subscript𝑆1conditional𝑐subscript𝑆𝑡absent\displaystyle\mathbb{P}\big{(}\rho^{\pi}(A_{1},S_{1})=c|S_{t}\big{)}=blackboard_P ( italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_c | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = [g(ϕ(S2))|ϕ(St)].delimited-[]conditional𝑔italic-ϕsubscript𝑆2italic-ϕsubscript𝑆𝑡\displaystyle\mathbb{P}[g\big{(}\phi(S_{2})\big{)}|\phi(S_{t})].blackboard_P [ italic_g ( italic_ϕ ( italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) | italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] .

    Then we prove (D.7).

    (2)In the second step, we will prove that if ϕitalic-ϕ\phiitalic_ϕ satisfies equation (D.7) and ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance, it is wπsuperscript𝑤𝜋w^{\pi}italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevant, namely for any s(1)superscript𝑠1s^{(1)}italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT and s(2)superscript𝑠2s^{(2)}italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT satisfying ρtπ(a,s(1))=ρtπ(a,s(2))superscriptsubscript𝜌𝑡𝜋𝑎superscript𝑠1superscriptsubscript𝜌𝑡𝜋𝑎superscript𝑠2\rho_{t}^{\pi}(a,s^{(1)})=\rho_{t}^{\pi}(a,s^{(2)})italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ), they will satisfy

    wπ(a,s(1))=wπ(a,s(2)).superscript𝑤𝜋𝑎superscript𝑠1superscript𝑤𝜋𝑎superscript𝑠2\displaystyle w^{\pi}(a,s^{(1)})=w^{\pi}(a,s^{(2)}).italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) .

    It follows from the definition of state abstraction, s(1)superscript𝑠1s^{(1)}italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT and s(2)superscript𝑠2s^{(2)}italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT, we have

    (Xt|St=s(1))=𝟏(s(1)ϕ1(Xt))=𝟏(s(2)ϕ1(Xt))=(Xt|St=s(2)).conditionalsubscript𝑋𝑡subscript𝑆𝑡superscript𝑠11superscript𝑠1superscriptitalic-ϕ1subscript𝑋𝑡1superscript𝑠2superscriptitalic-ϕ1subscript𝑋𝑡conditionalsubscript𝑋𝑡subscript𝑆𝑡superscript𝑠2\displaystyle\mathbb{P}(X_{t}|S_{t}=s^{(1)})={\bf{1}}\big{(}s^{(1)}\in\phi^{-1% }(X_{t})\big{)}={\bf{1}}\big{(}s^{(2)}\in\phi^{-1}(X_{t})\big{)}=\mathbb{P}(X_% {t}|S_{t}=s^{(2)}).blackboard_P ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = bold_1 ( italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ∈ italic_ϕ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) = bold_1 ( italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ∈ italic_ϕ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) = blackboard_P ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) . (D.12)

    By (D.12) and (D.7), we have

    wπ(a,s(1))=superscript𝑤𝜋𝑎superscript𝑠1absent\displaystyle w^{\pi}(a,s^{(1)})=italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) = (1γ)t=1Tγt1π(At=a,St=s(1))(A=a,S=s(1))1𝛾superscriptsubscript𝑡1𝑇superscript𝛾𝑡1superscript𝜋formulae-sequencesubscript𝐴𝑡𝑎subscript𝑆𝑡superscript𝑠1formulae-sequence𝐴𝑎𝑆superscript𝑠1\displaystyle\frac{(1-\gamma)\sum_{t=1}^{T}\gamma^{t-1}\mathbb{P}^{\pi}(A_{t}=% a,S_{t}=s^{(1)})}{\mathbb{P}(A=a,S=s^{(1)})}divide start_ARG ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT blackboard_P start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) end_ARG start_ARG blackboard_P ( italic_A = italic_a , italic_S = italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) end_ARG
    =\displaystyle== (1γ)t=1Tγt1π(At=a|St=s(1))π(St=s(1))(A=a|S=s(1))b(S=s(1))1𝛾superscriptsubscript𝑡1𝑇superscript𝛾𝑡1superscript𝜋subscript𝐴𝑡conditional𝑎subscript𝑆𝑡superscript𝑠1superscript𝜋subscript𝑆𝑡superscript𝑠1𝐴conditional𝑎𝑆superscript𝑠1superscript𝑏𝑆superscript𝑠1\displaystyle\frac{(1-\gamma)\sum_{t=1}^{T}\gamma^{t-1}\mathbb{P}^{\pi}(A_{t}=% a|S_{t}=s^{(1)})\mathbb{P}^{\pi}(S_{t}=s^{(1)})}{\mathbb{P}(A=a|S=s^{(1)})% \mathbb{P}^{b}(S=s^{(1)})}divide start_ARG ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT blackboard_P start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) blackboard_P start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) end_ARG start_ARG blackboard_P ( italic_A = italic_a | italic_S = italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) blackboard_P start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( italic_S = italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) end_ARG
    =\displaystyle== (1γ)t=1Tγt1ρtπ(a,s(1))π(St=s(1))b(S=s(1))1𝛾superscriptsubscript𝑡1𝑇superscript𝛾𝑡1superscriptsubscript𝜌𝑡𝜋𝑎superscript𝑠1superscript𝜋subscript𝑆𝑡superscript𝑠1superscript𝑏𝑆superscript𝑠1\displaystyle\frac{(1-\gamma)\sum_{t=1}^{T}\gamma^{t-1}\rho_{t}^{\pi}(a,s^{(1)% })\mathbb{P}^{\pi}(S_{t}=s^{(1)})}{\mathbb{P}^{b}(S=s^{(1)})}divide start_ARG ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) blackboard_P start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) end_ARG start_ARG blackboard_P start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( italic_S = italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) end_ARG
    =\displaystyle== (1γ)t=1Tγt1ρtπ(a,s(1))𝔼π[𝟏(St=s(1))]𝔼b[𝟏(St=s1)]1𝛾superscriptsubscript𝑡1𝑇superscript𝛾𝑡1superscriptsubscript𝜌𝑡𝜋𝑎superscript𝑠1superscript𝔼𝜋delimited-[]1subscript𝑆𝑡superscript𝑠1superscript𝔼𝑏delimited-[]1subscript𝑆𝑡superscript𝑠1\displaystyle\frac{(1-\gamma)\sum_{t=1}^{T}\gamma^{t-1}\rho_{t}^{\pi}(a,s^{(1)% })\mathbb{E}^{\pi}[{\bf{1}}(S_{t}=s^{(1)})]}{\mathbb{E}^{b}[{\bf{1}}(S_{t}=s^{% 1})]}divide start_ARG ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) blackboard_E start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT [ bold_1 ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) ] end_ARG start_ARG blackboard_E start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT [ bold_1 ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) ] end_ARG
    =\displaystyle== (1γ)t=1Tγt1ρtπ(a,s(1))𝔼b[𝟏(St=s(1))j=1t1ρjπ(Aj,Sj)]𝔼b[𝟏(St=s(1))]1𝛾superscriptsubscript𝑡1𝑇superscript𝛾𝑡1superscriptsubscript𝜌𝑡𝜋𝑎superscript𝑠1superscript𝔼𝑏delimited-[]1subscript𝑆𝑡superscript𝑠1superscriptsubscriptproduct𝑗1𝑡1superscriptsubscript𝜌𝑗𝜋subscript𝐴𝑗subscript𝑆𝑗superscript𝔼𝑏delimited-[]1subscript𝑆𝑡superscript𝑠1\displaystyle\frac{(1-\gamma)\sum_{t=1}^{T}\gamma^{t-1}\rho_{t}^{\pi}(a,s^{(1)% })\mathbb{E}^{b}[{\bf{1}}(S_{t}=s^{(1)})\prod_{j=1}^{t-1}\rho_{j}^{\pi}(A_{j},% S_{j})]}{\mathbb{E}^{b}[{\bf{1}}(S_{t}=s^{(1)})]}divide start_ARG ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) blackboard_E start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT [ bold_1 ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ] end_ARG start_ARG blackboard_E start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT [ bold_1 ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) ] end_ARG
    =\displaystyle== (1γ)t=1Tγt1ρtπ(a,s(1))𝔼b[𝔼b(𝟏(St=s(1))j=1t1ρjπ(Aj,Sj)|Xt)]𝔼b[𝟏(St=s(1))]1𝛾superscriptsubscript𝑡1𝑇superscript𝛾𝑡1superscriptsubscript𝜌𝑡𝜋𝑎superscript𝑠1superscript𝔼𝑏delimited-[]superscript𝔼𝑏conditional1subscript𝑆𝑡superscript𝑠1superscriptsubscriptproduct𝑗1𝑡1superscriptsubscript𝜌𝑗𝜋subscript𝐴𝑗subscript𝑆𝑗subscript𝑋𝑡superscript𝔼𝑏delimited-[]1subscript𝑆𝑡superscript𝑠1\displaystyle\frac{(1-\gamma)\sum_{t=1}^{T}\gamma^{t-1}\rho_{t}^{\pi}(a,s^{(1)% })\mathbb{E}^{b}\left[\mathbb{E}^{b}\left({\bf{1}}(S_{t}=s^{(1)})\prod_{j=1}^{% t-1}\rho_{j}^{\pi}(A_{j},S_{j})|X_{t}\right)\right]}{\mathbb{E}^{b}[{\bf{1}}(S% _{t}=s^{(1)})]}divide start_ARG ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) blackboard_E start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT [ blackboard_E start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( bold_1 ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] end_ARG start_ARG blackboard_E start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT [ bold_1 ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) ] end_ARG
    =\displaystyle== (1γ)t=1Tγt1ρtπ(a,s(1))𝔼b[𝔼b(𝟏(St=s(1))|Xt)𝔼b(j=1t1ρjπ(Aj,Sj)|Xt)]𝔼b[𝟏(St=s(1))]by(D.7)subscript1𝛾superscriptsubscript𝑡1𝑇superscript𝛾𝑡1superscriptsubscript𝜌𝑡𝜋𝑎superscript𝑠1superscript𝔼𝑏delimited-[]superscript𝔼𝑏conditional1subscript𝑆𝑡superscript𝑠1subscript𝑋𝑡superscript𝔼𝑏conditionalsuperscriptsubscriptproduct𝑗1𝑡1superscriptsubscript𝜌𝑗𝜋subscript𝐴𝑗subscript𝑆𝑗subscript𝑋𝑡superscript𝔼𝑏delimited-[]1subscript𝑆𝑡superscript𝑠1byitalic-(D.7italic-)\displaystyle\underbrace{\frac{(1-\gamma)\sum_{t=1}^{T}\gamma^{t-1}\rho_{t}^{% \pi}(a,s^{(1)})\mathbb{E}^{b}\left[\mathbb{E}^{b}\left({\bf{1}}(S_{t}=s^{(1)})% |X_{t}\right)\mathbb{E}^{b}\left(\prod_{j=1}^{t-1}\rho_{j}^{\pi}(A_{j},S_{j})|% X_{t}\right)\right]}{\mathbb{E}^{b}[{\bf{1}}(S_{t}=s^{(1)})]}}_{\mbox{by}\,\,% \eqref{eqn:ind3}}under⏟ start_ARG divide start_ARG ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) blackboard_E start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT [ blackboard_E start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( bold_1 ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) blackboard_E start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] end_ARG start_ARG blackboard_E start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT [ bold_1 ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) ] end_ARG end_ARG start_POSTSUBSCRIPT by italic_( italic_) end_POSTSUBSCRIPT
    =\displaystyle== (1γ)t=1Tγt1ρtπ(a,s(1))𝔼b((Xt|St=s(1))j=1t1ρjπ(Aj,Sj)(Xt))1𝛾superscriptsubscript𝑡1𝑇superscript𝛾𝑡1superscriptsubscript𝜌𝑡𝜋𝑎superscript𝑠1superscript𝔼𝑏conditionalsubscript𝑋𝑡subscript𝑆𝑡superscript𝑠1superscriptsubscriptproduct𝑗1𝑡1superscriptsubscript𝜌𝑗𝜋subscript𝐴𝑗subscript𝑆𝑗subscript𝑋𝑡\displaystyle(1-\gamma)\sum_{t=1}^{T}\gamma^{t-1}\rho_{t}^{\pi}(a,s^{(1)})% \mathbb{E}^{b}\left(\frac{\mathbb{P}(X_{t}|S_{t}=s^{(1)})\prod_{j=1}^{t-1}\rho% _{j}^{\pi}(A_{j},S_{j})}{\mathbb{P}(X_{t})}\right)( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) blackboard_E start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( divide start_ARG blackboard_P ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG blackboard_P ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG )
    =\displaystyle== (1γ)t=1Tγt1ρtπ(a,s(2))𝔼b((Xt|St=s(2))j=1t1ρjπ(Aj,Sj)(Xt))by(D.12)subscript1𝛾superscriptsubscript𝑡1𝑇superscript𝛾𝑡1superscriptsubscript𝜌𝑡𝜋𝑎superscript𝑠2superscript𝔼𝑏conditionalsubscript𝑋𝑡subscript𝑆𝑡superscript𝑠2superscriptsubscriptproduct𝑗1𝑡1superscriptsubscript𝜌𝑗𝜋subscript𝐴𝑗subscript𝑆𝑗subscript𝑋𝑡byitalic-(D.12italic-)\displaystyle\underbrace{(1-\gamma)\sum_{t=1}^{T}\gamma^{t-1}\rho_{t}^{\pi}(a,% s^{(2)})\mathbb{E}^{b}\left(\frac{\mathbb{P}(X_{t}|S_{t}=s^{(2)})\prod_{j=1}^{% t-1}\rho_{j}^{\pi}(A_{j},S_{j})}{\mathbb{P}(X_{t})}\right)}_{\mbox{by}\,\,% \eqref{eq5}}under⏟ start_ARG ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) blackboard_E start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( divide start_ARG blackboard_P ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG blackboard_P ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ) end_ARG start_POSTSUBSCRIPT by italic_( italic_) end_POSTSUBSCRIPT
    =\displaystyle== wπ(a,s(2)).superscript𝑤𝜋𝑎superscript𝑠2\displaystyle w^{\pi}(a,s^{(2)}).italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) .

    Then, we can conclude that backward-model-irrelevance implies the ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance and wπsuperscript𝑤𝜋w^{\pi}italic_w start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT-irrelevance.

  • It follows from the definition of Q𝑄Qitalic_Q-function-based method that

    𝔼[f1(Qϕπ)]=𝔼delimited-[]subscript𝑓1subscriptsuperscript𝑄𝜋italic-ϕabsent\displaystyle\mathbb{E}[f_{1}(Q^{\pi}_{\phi})]=blackboard_E [ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) ] = a,xQϕπ(a,x)π(a|x)(ϕ(S1)=x)subscript𝑎𝑥superscriptsubscript𝑄italic-ϕ𝜋𝑎𝑥𝜋conditional𝑎𝑥italic-ϕsubscript𝑆1𝑥\displaystyle\sum_{a,x}Q_{\phi}^{\pi}(a,x)\pi(a|x)\mathbb{P}(\phi(S_{1})=x)∑ start_POSTSUBSCRIPT italic_a , italic_x end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_x ) italic_π ( italic_a | italic_x ) blackboard_P ( italic_ϕ ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_x )
    =\displaystyle== a,x𝔼π[t=1+γt1Rt|X1=x,A1=a]π(a|x)(X1=x)subscript𝑎𝑥superscript𝔼𝜋delimited-[]formulae-sequenceconditionalsuperscriptsubscript𝑡1superscript𝛾𝑡1subscript𝑅𝑡subscript𝑋1𝑥subscript𝐴1𝑎𝜋conditional𝑎𝑥subscript𝑋1𝑥\displaystyle\sum_{a,x}\mathbb{E}^{\pi}\Big{[}\sum_{t=1}^{+\infty}\gamma^{t-1}% R_{t}|X_{1}=x,A_{1}=a\Big{]}\pi(a|x)\mathbb{P}(X_{1}=x)∑ start_POSTSUBSCRIPT italic_a , italic_x end_POSTSUBSCRIPT blackboard_E start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x , italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_a ] italic_π ( italic_a | italic_x ) blackboard_P ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x )
    =\displaystyle== a,x,rt=1+γt1rπ[r|X1=x,A1=a]π(a|x)(X1=x)subscript𝑎𝑥𝑟superscriptsubscript𝑡1superscript𝛾𝑡1𝑟superscript𝜋delimited-[]formulae-sequenceconditional𝑟subscript𝑋1𝑥subscript𝐴1𝑎𝜋conditional𝑎𝑥subscript𝑋1𝑥\displaystyle\sum_{a,x,r}\sum_{t=1}^{+\infty}\gamma^{t-1}r\mathbb{P}^{\pi}\Big% {[}r|X_{1}=x,A_{1}=a\Big{]}\pi(a|x)\mathbb{P}(X_{1}=x)∑ start_POSTSUBSCRIPT italic_a , italic_x , italic_r end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_r blackboard_P start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT [ italic_r | italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x , italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_a ] italic_π ( italic_a | italic_x ) blackboard_P ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x )
    =\displaystyle== 𝔼π[t=1+γt1Rt]superscript𝔼𝜋delimited-[]superscriptsubscript𝑡1superscript𝛾𝑡1subscript𝑅𝑡\displaystyle\mathbb{E}^{\pi}\Big{[}\sum_{t=1}^{+\infty}\gamma^{t-1}R_{t}\Big{]}blackboard_E start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]
    =\displaystyle== 𝔼[f1(Qπ)].𝔼delimited-[]subscript𝑓1superscript𝑄𝜋\displaystyle\mathbb{E}[f_{1}(Q^{\pi})].blackboard_E [ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) ] .
  • The conclusion directly follows from the last conclusion of Theorem 1, and the first conclusion of Theorem 3.

D.4 Proof of Theorem 4

Theorem 4 directly follows from Theorem 2 and Theorem 3. We just list the Q𝑄Qitalic_Q-function based method and initialization from forward state abstraction. Firstly, based on the first conclusions in Theorems 1 and 2, we can get that Q𝑄Qitalic_Q-function based method still remains valid. Namely, for the forward state abstraction function ϕ1subscriptitalic-ϕ1\phi_{1}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we have

𝔼[f1(Qϕ1π)]=𝔼[f1(Qπ)].𝔼delimited-[]subscript𝑓1superscriptsubscript𝑄subscriptitalic-ϕ1𝜋𝔼delimited-[]subscript𝑓1superscript𝑄𝜋\displaystyle\mathbb{E}[f_{1}(Q_{\phi_{1}}^{\pi})]=\mathbb{E}[f_{1}(Q^{\pi})].blackboard_E [ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) ] = blackboard_E [ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) ] .

Based on ϕ1(𝒮)=𝒳1subscriptitalic-ϕ1𝒮subscript𝒳1\phi_{1}(\mathcal{S})=\mathcal{X}_{1}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( caligraphic_S ) = caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we derive the backward state abstraction ϕ2subscriptitalic-ϕ2\phi_{2}italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The second conclusion in Theorem 3 indicates

𝔼[f1(Qϕ2ϕ1π)]=𝔼[f1(Qϕ1π)]=𝔼[f1(Qπ)].𝔼delimited-[]subscript𝑓1superscriptsubscript𝑄subscriptitalic-ϕ2subscriptitalic-ϕ1𝜋𝔼delimited-[]subscript𝑓1superscriptsubscript𝑄subscriptitalic-ϕ1𝜋𝔼delimited-[]subscript𝑓1superscript𝑄𝜋\displaystyle\mathbb{E}[f_{1}(Q_{\phi_{2}\circ\phi_{1}}^{\pi})]=\mathbb{E}[f_{% 1}(Q_{\phi_{1}}^{\pi})]=\mathbb{E}[f_{1}(Q^{\pi})].blackboard_E [ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) ] = blackboard_E [ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) ] = blackboard_E [ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) ] .

This indicates that after the two-step procedure, the Q𝑄Qitalic_Q-value-based function still works.