Operator World Models for Reinforcement Learning

Pietro Novelli1
[email protected]   
   Marco Pratticò1
[email protected]  
   Massimiliano Pontil1,2
[email protected]   
   Carlo Ciliberto2
[email protected]
Abstract

Policy Mirror Descent (PMD) is a powerful and theoretically sound methodology for sequential decision-making. However, it is not directly applicable to Reinforcement Learning (RL) due to the inaccessibility of explicit action-value functions. We address this challenge by introducing a novel approach based on learning a world model of the environment using conditional mean embeddings. We then leverage the operatorial formulation of RL to express the action-value function in terms of this quantity in closed form via matrix operations. Combining these estimators with PMD leads to POWR, a new RL algorithm for which we prove convergence rates to the global optimum. Preliminary experiments in finite and infinite state settings support the effectiveness of our method.

1 Introduction

11footnotetext: Computational Statistics and Machine Learning - Istituto Italiano di Tecnologia, 16100 Genova, Italy22footnotetext: AI Centre, Computer Science Department, University College London, London, UK.

In recent years, Reinforcement Learning (RL) [1] has seen significant progress, with methods capable of tackling challenging applications such as robotic manipulation [2], playing Go [3] or Atari games [4] and resource management [5] to name but a few. The central challenge in RL settings is to balance the trade-off between exploration and exploitation, namely to improve upon previous policies while gathering sufficient information about the environment dynamics. Several strategies have been proposed to tackle this issue, such as Q-learning-based methods [4], policy optimization [6, 7] or actor-critics [8] to name a few. In contrast, when full information about the environment is available, sequential decision-making methods need only to focus on exploitation. Here, strategies such as policy improvement or policy iteration [9] have been thoroughly studied from both the algorithmic and theoretical standpoints. Within this context, the understanding of Policy Mirror Descent (PMD) methods has recently enjoyed a significant step forward, with results guaranteeing convergence to a global optimum with associated rates [10, 11, 12].

In their original formulation, PMD methods require explicit knowledge of the action-value functions for all policies generated during the optimization process. This is clearly inaccessible in RL applications. Recently, [12] showed how PMD convergence rates can be extended to settings in which inexact estimators of the action-value function are used (see [13] for a similar result from a regret-based perspective). The resulting convergence rates, however, depend on uniform norm bounds on the approximation error, usually guaranteed only under unrealistic and inefficient assumptions such as the availability of a (perfect) simulator to be queried on arbitrary state-action pairs. Moreover, these strategies require repeating this sampling/learning process for any policy generated by the PMD algorithm, which is computationally expensive and demands numerous interactions with the environment. A natural question, therefore, is whether PMD approaches can be efficiently deployed in RL settings while enjoying the same strong theoretical guarantees.

In this work we address these issues by proposing a novel approach to estimating the action-value function. Unlike previous methods that directly approximate the action-value function from samples, we first learn the transition operator and reward function associated with the Markov decision process (MDP). To model the transition operator, we adopt the Conditional Mean Embedding (CME) framework [14, 15]. We then leverage an operatorial characterization of the action-value function to express it in terms of these estimated quantities. This strategy draws a peculiar connection with world model methods [16], and can be interpreted as world model-learning via CMEs. However, traditional world model methods emphasize learning an implicit model of the environment in the form of a simulator. This requires extensive sampling for application to PMD and incurs into to two sources of error in estimating the action-value function: model and sampling error. In contrast, CMEs can be used to estimate expectations without sampling and incur only in model error, for which learning bounds are available [17, 18]. One of our key results shows that by modeling the transition operator as a CME between suitable Sobolev spaces, we can compute estimates of the action-value function of any sufficiently smooth policy in closed form via efficient matrix operations.

Combining our estimates of the action-value function with the PMD framework we obtain a novel RL algorithm that we dub Policy mirror descent with Operator World-models for Reinforcement learning (POWR). A byproduct of adopting CMEs to model the transition operator is that we can naturally extend PMD to infinite state space settings. We leverage recent advancements in characterizing the sample complexity of CME estimators to prove convergence rates for the proposed algorithm to the global maximum of the RL Problem. Our approach is similar in spirit to [19], which proposed a policy optimization strategy based on CMEs. We extend these ideas to PMD strategies and refine previous results on convergence rates as a byproduct of our analysis. Learning the transition operator with a least-squares based estimator was also recently considered in [20], which proposed an optimistic strategy to prove near-optimal regret bounds in linear mixture MDP settings [21]. In contrast, in this work we cast our problem within an linear MDP setting with possibly infinite latent dimension. We validate empirically our approach in both finite and infinite state settings, reporting promising evidence in support of our theoretical analysis.

Contributions

The main contributions of this paper are: i)i)italic_i ) a CME-based world model estimator, which enables us to generate estimators for the action-value function of a policy in closed form via matrix operations. ii)ii)italic_i italic_i ) An (inexact) PMD algorithm combining the learned CMEs world models with mirror descent update steps to generate improved policies. iii)iii)italic_i italic_i italic_i ) Showing that the algorithm is well-defined when learning the world model as an operator between a suitable family of Sobolev spaces. iv)iv)italic_i italic_v ) Showing convergence rates of the proposed approach to the global maximum of the RL problem, under regularity assumptions on the MDP. v)v)italic_v ) Empirically testing the proposed approach in practice, comparing it with well-established baselines.

2 Problem Formulation and Policy Mirror Descent

We consider a Markov Decision Process (MDP) over a state space 𝒳𝒳\mathcal{X}caligraphic_X and action space 𝒜𝒜\mathcal{A}caligraphic_A, with transition kernel τ𝜏\tauitalic_τ. We assume 𝒳𝒳\mathcal{X}caligraphic_X and 𝒜𝒜\mathcal{A}caligraphic_A to be Polish, τ:Ω1+(𝒳):𝜏Ωsuperscriptsubscript1𝒳\tau:\Omega\to{\mathcal{M}}_{1}^{+}(\mathcal{X})italic_τ : roman_Ω → caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( caligraphic_X ) to be a Borel measurable function from the joint space Ω=𝒳×𝒜Ω𝒳𝒜\Omega=\mathcal{X}\times\mathcal{A}roman_Ω = caligraphic_X × caligraphic_A to the space 1+(𝒳)superscriptsubscript1𝒳{\mathcal{M}}_{1}^{+}(\mathcal{X})caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( caligraphic_X ) of Borel probability measures on 𝒳𝒳{\mathcal{X}}caligraphic_X. We define a policy to be a Borel measurable function π:𝒳1+(𝒜):𝜋𝒳superscriptsubscript1𝒜\pi:{\mathcal{X}}\to{\mathcal{M}}_{1}^{+}(\mathcal{A})italic_π : caligraphic_X → caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( caligraphic_A ). When 𝒜𝒜\mathcal{A}caligraphic_A (respectively 𝒳𝒳\mathcal{X}caligraphic_X) is a finite set, the space 1+(𝒜)=Δ(𝒜)|A|superscriptsubscript1𝒜Δ𝒜superscript𝐴{\mathcal{M}}_{1}^{+}(\mathcal{A})=\Delta(\mathcal{A})\subseteq\mathbb{R}^{|A|}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( caligraphic_A ) = roman_Δ ( caligraphic_A ) ⊆ blackboard_R start_POSTSUPERSCRIPT | italic_A | end_POSTSUPERSCRIPT (respectively 1+(𝒳)=Δ(𝒳)|X|superscriptsubscript1𝒳Δ𝒳superscript𝑋{\mathcal{M}}_{1}^{+}(\mathcal{X})=\Delta(\mathcal{X})\subseteq\mathbb{R}^{|X|}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( caligraphic_X ) = roman_Δ ( caligraphic_X ) ⊆ blackboard_R start_POSTSUPERSCRIPT | italic_X | end_POSTSUPERSCRIPT) corresponds to the probability simplex. Given a discount factor γ>0𝛾0\gamma>0italic_γ > 0, a (starting) state distribution ν1+(𝒳)𝜈superscriptsubscript1𝒳\nu\in{\mathcal{M}}_{1}^{+}(\mathcal{X})italic_ν ∈ caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( caligraphic_X ) and a Borel measurable bounded and non-negative reward111All the discussion in this work can be extended to the case where also the rewards are random and τ:Ω1+(𝒳×):𝜏Ωsuperscriptsubscript1𝒳\tau:\Omega\to{\mathcal{M}}_{1}^{+}({\mathcal{X}}\times\mathbb{R})italic_τ : roman_Ω → caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( caligraphic_X × blackboard_R ) takes values in the space of joint distributions over states and rewards (Xt+1,Rt)subscript𝑋𝑡1subscript𝑅𝑡(X_{t+1},R_{t})( italic_X start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) function r:Ω:𝑟Ωr:\Omega\to\mathbb{R}italic_r : roman_Ω → blackboard_R we denote by

J(π)=E[t=0γtr(Xt,At)]𝐽𝜋𝐸delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡𝑟subscript𝑋𝑡subscript𝐴𝑡\displaystyle J(\pi)=E\left[\sum_{t=0}^{\infty}\gamma^{t}r(X_{t},A_{t})\right]italic_J ( italic_π ) = italic_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] (1)

the (discounted) expected return of the policy π𝜋\piitalic_π applied to the MDP, yielding the Markov process (Xt,At)tsubscriptsubscript𝑋𝑡subscript𝐴𝑡𝑡(X_{t},A_{t})_{t\in\mathbb{N}}( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t ∈ blackboard_N end_POSTSUBSCRIPT, where X0subscript𝑋0X_{0}italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is distributed according to ν𝜈\nuitalic_ν and for each t𝑡t\in\mathbb{N}italic_t ∈ blackboard_N the action Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is distributed according to π(|Xt)\pi(\cdot|X_{t})italic_π ( ⋅ | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and Xt+1subscript𝑋𝑡1X_{t+1}italic_X start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT according to τ(|Xt,At)\tau(\cdot|X_{t},A_{t})italic_τ ( ⋅ | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

In sequential decision settings, the goal is to find the optimal policy πsubscript𝜋\pi_{*}italic_π start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT maximizing 1 over the space of all measurable policies. In reinforcement learning, one typically assumes that knowledge of the transition τ𝜏\tauitalic_τ, the reward r𝑟ritalic_r, and (possibly) the starting distribution ν𝜈\nuitalic_ν is not available. It is only possible to gather information about these quantities by interacting with the MDP to sample trajectories of state-action pairs (xt,at)subscript𝑥𝑡subscript𝑎𝑡(x_{t},a_{t})( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and corresponding rewards r(xt,at)𝑟subscript𝑥𝑡subscript𝑎𝑡r(x_{t},a_{t})italic_r ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

Policy Mirror Descent (PMD)

In so-called tabular settings – in which both 𝒳𝒳\mathcal{X}caligraphic_X and 𝒜𝒜\mathcal{A}caligraphic_A are finite sets – the policy optimization problem amounts to maximizing 1 over the space Π=Δ(𝒜)|𝒳|Πtensor-productΔ𝒜superscript𝒳\Pi=\Delta(\mathcal{A})\otimes\mathbb{R}^{|\mathcal{X}|}roman_Π = roman_Δ ( caligraphic_A ) ⊗ blackboard_R start_POSTSUPERSCRIPT | caligraphic_X | end_POSTSUPERSCRIPT of column substochastic matrices, namely matrices M|𝒜|×|𝒳|𝑀superscript𝒜𝒳M\in\mathbb{R}^{|\mathcal{A}|\times|\mathcal{X}|}italic_M ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_A | × | caligraphic_X | end_POSTSUPERSCRIPT with non-negative entries and whose columns sum up to one, namely M𝟏𝒜=𝟏𝒳superscript𝑀subscript1𝒜subscript1𝒳M^{*}\mathbf{1}_{\mathcal{A}}=\mathbf{1}_{\mathcal{X}}italic_M start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT = bold_1 start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT, with 𝟏1\mathbf{1}bold_1 denoting the vector with all entries equal to one on the appropriate space. Borrowing from the convex optimization literature – where mirror descent algorithms offer a powerful approach to minimize a convex functional over a convex constraint set [22, 23] – recent work proposed to adopt mirror descent also for policy optimization, a strategy known as policy mirror descent (PMD) [10]. Even though the objective in 1 is not convex (or concave, since we are maximizing it [11]), it turns out that mirror ascent can nevertheless enjoy global convergence to the maximum, with sublinear [11] or even linear rates [12] (at the cost of dimension-dependent constants).

Starting from an initial policy π0subscript𝜋0\pi_{0}italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, PMD generates a sequence (πt)tsubscriptsubscript𝜋𝑡𝑡(\pi_{t})_{t\in\mathbb{N}}( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t ∈ blackboard_N end_POSTSUBSCRIPT according to the the update step

πt+1(|x)=argminpΔ(𝒜)ηqπt(,x),p+D(p,πt(|x)),\displaystyle\pi_{t+1}(\cdot\,|\,x)=\operatorname*{argmin}_{p\in\Delta(% \mathcal{A})}\quad-\eta\left\langle{q_{\pi_{t}}(\cdot,x)},{p}\right\rangle+D(p% ,\pi_{t}(\cdot|x)),italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( ⋅ | italic_x ) = roman_argmin start_POSTSUBSCRIPT italic_p ∈ roman_Δ ( caligraphic_A ) end_POSTSUBSCRIPT - italic_η ⟨ italic_q start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ , italic_x ) , italic_p ⟩ + italic_D ( italic_p , italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ | italic_x ) ) , (2)

for any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, with η>0𝜂0\eta>0italic_η > 0 a step size, D𝐷Ditalic_D a suitable Bregman divergence [23] and qπ:Ω:subscript𝑞𝜋Ωq_{\pi}:\Omega\to\mathbb{R}italic_q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT : roman_Ω → blackboard_R the so-called action-value function of a policy π𝜋\piitalic_π, namely

qπ(x,a)=E[t=0γtr(Xt,At)|X0=x,A0=a],\displaystyle q_{\pi}(x,a)=E\left[\sum_{t=0}^{\infty}\gamma^{t}r(X_{t},A_{t})% \middle|X_{0}=x,A_{0}=a\right],italic_q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x , italic_a ) = italic_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x , italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_a ] , (3)

the discounted return obtained by taking action a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A in state x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X and then following the policy π𝜋\piitalic_π. The solution to 2 crucially depends on the choice of D𝐷Ditalic_D. For example, in [10] the authors observed that if D𝐷Ditalic_D is the Kullback-Leibler divergence, PMD corresponds to the Natural Policy Gradient originally proposed in [24] while [12] showed that if D𝐷Ditalic_D is the squared euclidean distance, PMD recovers the Projected Policy Gradient method from [11].

PMD in Reinforcement Learning

A clear limitation to adopting PMD in RL settings is that 3 needs exact knowledge of the action-value functions qπtsubscript𝑞subscript𝜋𝑡q_{\pi_{t}}italic_q start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT associated to each iterate πtsubscript𝜋𝑡\pi_{t}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of the algorithm. This requires evaluating the expectation in 3, which is not possible in RL where we do not know the reward r𝑟ritalic_r and MDP transition distribution τ𝜏\tauitalic_τ in advance. While sampling strategies can be adopted to estimate qπtsubscript𝑞subscript𝜋𝑡q_{\pi_{t}}italic_q start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, a key question is how the approximation error affects PMD.

The work in [12] provides an answer to this question, extending the analysis of PMD to the case where estimates q^tsubscript^𝑞𝑡\hat{q}_{t}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are used in in place of the true action-value function in 2. We recall here an informal version of the result for the case of sublinear convergence rates for PMD. We postpone a more rigorous statement of the theorem and its assumptions to Sec. 5, where we extend it to infinite state spaces 𝒳𝒳\mathcal{X}caligraphic_X.

Theorem 1 (Inexact PMD (Sec. 5 in [12]) – Informal).

In the tabular setting, let (πt)tsubscriptsubscript𝜋𝑡𝑡(\pi_{t})_{t\in\mathbb{N}}( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t ∈ blackboard_N end_POSTSUBSCRIPT be a sequence of policies obtained by applying the PMD update in 2 with functions q^t:Ω:subscript^𝑞𝑡Ω\hat{q}_{t}:\Omega\to\mathbb{R}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : roman_Ω → blackboard_R in place of qπtsubscript𝑞subscript𝜋𝑡q_{\pi_{t}}italic_q start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT and D𝐷Ditalic_D a suitable Bregman divergence. For any T𝑇T\in\mathbb{N}italic_T ∈ blackboard_N and ε>0𝜀0\varepsilon>0italic_ε > 0, if q^tqπtεsubscriptnormsubscript^𝑞𝑡subscript𝑞subscript𝜋𝑡𝜀\left\|{\hat{q}_{t}-q_{\pi_{t}}}\right\|_{\infty}\leq\varepsilon∥ over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_q start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_ε for all t=1,,T𝑡1𝑇t=1,\dots,Titalic_t = 1 , … , italic_T, then

maxπJ(π)J(πT)O(ε+1/T).subscript𝜋𝐽𝜋𝐽subscript𝜋𝑇𝑂𝜀1𝑇\displaystyle\max_{\pi}\leavevmode\nobreak\ J(\pi)-J(\pi_{T})\leq O(% \varepsilon+1/T).roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_J ( italic_π ) - italic_J ( italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ≤ italic_O ( italic_ε + 1 / italic_T ) . (4)

Thm. 1 implies that inexact PMD retains the convergence rates of its exact counterpart, provided that the approximation error for each action-value function is of order 1/T1𝑇1/T1 / italic_T in uniform norm \left\|{\cdot}\right\|_{\infty}∥ ⋅ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT. While this result supports estimating the action-value function in RL, implementing this strategy in practice poses two main challenges, even in tabular settings. First, approximating the expectation in 3 in \left\|{\cdot}\right\|_{\infty}∥ ⋅ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT norm via sampling requires “starting” the MDP from each state x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, multiple times. This is often not possible in RL, where we do not have control over the starting distribution ν𝜈\nuitalic_ν. Second, repeating this sampling process to learn a q^tsubscript^𝑞𝑡\hat{q}_{t}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for each policy πtsubscript𝜋𝑡\pi_{t}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can become extremely expensive in terms of both the number of computations and interactions with the environment.

In this work we propose a new strategy to tackle the problems above. Rather than sampling the MDP to directly estimate each q^tsubscript^𝑞𝑡\hat{q}_{t}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT we learn estimators r^^𝑟\hat{r}over^ start_ARG italic_r end_ARG and τ^^𝜏\hat{\tau}over^ start_ARG italic_τ end_ARG for the reward and transition distribution respectively. Then, for any policy π𝜋\piitalic_π, we leverage the relation between these quantities in 3 to generate an estimator q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG for qπsubscript𝑞𝜋q_{\pi}italic_q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT. This tackles the above challenges since: 1) it enables us to control the approximation error on any action-value function in terms the approximation error of r^^𝑟\hat{r}over^ start_ARG italic_r end_ARG and τ^^𝜏\hat{\tau}over^ start_ARG italic_τ end_ARG; 2) it does not require sampling from the MDP to learn a new q^tsubscript^𝑞𝑡\hat{q}_{t}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for each πtsubscript𝜋𝑡\pi_{t}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT generated by PMD.

3 Proposed Estimator and Algorithm

In this section, we consider the operator-based formulation of the problem introduced in Sec. 2 (see also [11]). This will be instrumental to extend the PMD theory to arbitrary state spaces 𝒳𝒳\mathcal{X}caligraphic_X, to quantify the relation between the approximation error of the action-value in terms of the approximation error of the reward and transition distribution, and to motivate conditional mean embeddings as the tool to learn these latter quantities.

Conditional Expectation Operators

We start by defining the transition operator T𝑇{T}italic_T associated to the MDP transition distribution τ𝜏\tauitalic_τ. Let Bb(𝒳)subscript𝐵𝑏𝒳B_{b}(\mathcal{X})italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( caligraphic_X ) denote the space of bounded Borel measurable functions on a space 𝒳𝒳\mathcal{X}caligraphic_X. Formally, T:Bb(𝒳)Bb(Ω):𝑇subscript𝐵𝑏𝒳subscript𝐵𝑏Ω{T}:B_{b}(\mathcal{X})\to B_{b}(\Omega)italic_T : italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( caligraphic_X ) → italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( roman_Ω ) is the linear operator such that, for any fBb(𝒳)𝑓subscript𝐵𝑏𝒳f\in B_{b}(\mathcal{X})italic_f ∈ italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( caligraphic_X )

(Tf)(x,a)=𝒳f(x)τ(dx|x,a)=E[f(X)x,a]for all (x,a)Ω,formulae-sequence𝑇𝑓𝑥𝑎subscript𝒳𝑓superscript𝑥𝜏conditional𝑑superscript𝑥𝑥𝑎𝐸delimited-[]conditional𝑓superscript𝑋𝑥𝑎for all 𝑥𝑎Ω\displaystyle({T}f)(x,a)=\int_{\mathcal{X}}f(x^{\prime})\leavevmode\nobreak\ % \tau(dx^{\prime}|x,a)=E\left[f(X^{\prime})\mid x,a\right]\qquad\text{for all }% (x,a)\in\Omega,( italic_T italic_f ) ( italic_x , italic_a ) = ∫ start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_τ ( italic_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_a ) = italic_E [ italic_f ( italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∣ italic_x , italic_a ] for all ( italic_x , italic_a ) ∈ roman_Ω , (5)

where Xsuperscript𝑋X^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is sampled according to τ(|x,a)\tau(\cdot|x,a)italic_τ ( ⋅ | italic_x , italic_a ). Note that T𝑇{T}italic_T is the Markov operator [25, Ch. 19] encoding the dynamics of the MDP and its conjugate T:(Ω)(𝒳):superscript𝑇Ω𝒳{T}^{*}:\mathcal{M}(\Omega)\to\mathcal{M}({\mathcal{X}})italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : caligraphic_M ( roman_Ω ) → caligraphic_M ( caligraphic_X ) is the operator map** measures μ(Ω)𝜇Ω\mu\in\mathcal{M}(\Omega)italic_μ ∈ caligraphic_M ( roman_Ω ) to their transition via τ𝜏\tauitalic_τ to (Tμ)()=×Ωτ(dx|x,a)μ(dx,da)superscript𝑇𝜇subscriptΩ𝜏conditional𝑑superscript𝑥𝑥𝑎𝜇𝑑𝑥𝑑𝑎(T^{*}\mu)(\mathcal{B})=\int_{\mathcal{B}\times\Omega}\tau(dx^{\prime}|x,a)\mu% (dx,da)( italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_μ ) ( caligraphic_B ) = ∫ start_POSTSUBSCRIPT caligraphic_B × roman_Ω end_POSTSUBSCRIPT italic_τ ( italic_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_a ) italic_μ ( italic_d italic_x , italic_d italic_a ) for any measurable 𝒳𝒳\mathcal{B}\subseteq\mathcal{X}caligraphic_B ⊆ caligraphic_X. For any policy π𝜋\piitalic_π we define the operator Pπ:Bb(Ω)Bb(𝒳):subscript𝑃𝜋subscript𝐵𝑏Ωsubscript𝐵𝑏𝒳{P}_{\pi}:B_{b}(\Omega)\to B_{b}(\mathcal{X})italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT : italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( roman_Ω ) → italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( caligraphic_X ) such that for all gBb(Ω)𝑔subscript𝐵𝑏Ωg\in B_{b}(\Omega)italic_g ∈ italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( roman_Ω )

(Pπg)(x)=𝒜g(x,a)π(da|x)=E[g(X,A)X=x]for all x𝒳,formulae-sequencesubscript𝑃𝜋𝑔𝑥subscript𝒜𝑔𝑥𝑎𝜋conditional𝑑𝑎𝑥𝐸delimited-[]conditional𝑔𝑋𝐴𝑋𝑥for all 𝑥𝒳({P}_{\pi}g)(x)=\int_{\mathcal{A}}g(x,a)\leavevmode\nobreak\ \pi(da|x)=E\left[% g(X,A)\mid X=x\right]\quad\text{for all }x\in\mathcal{X},( italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_g ) ( italic_x ) = ∫ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT italic_g ( italic_x , italic_a ) italic_π ( italic_d italic_a | italic_x ) = italic_E [ italic_g ( italic_X , italic_A ) ∣ italic_X = italic_x ] for all italic_x ∈ caligraphic_X , (6)

where the expectation is taken over the action A𝐴Aitalic_A sampled according to π(|x)\pi(\cdot|x)italic_π ( ⋅ | italic_x ). Also Pπsubscript𝑃𝜋{P}_{\pi}italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT is a Markov operator and its conjugate Pπ:(𝒳)(Ω):superscriptsubscript𝑃𝜋𝒳Ω{P}_{\pi}^{*}:\mathcal{M}(\mathcal{X})\to\mathcal{M}(\Omega)italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : caligraphic_M ( caligraphic_X ) → caligraphic_M ( roman_Ω ) is the operator map** any ν(𝒳)𝜈𝒳\nu\in\mathcal{M}({\mathcal{X}})italic_ν ∈ caligraphic_M ( caligraphic_X ) to its “multiplication” by π𝜋\piitalic_π, namely (Pπν)(𝒞)=𝒞π(da|x)ν(dx)superscriptsubscript𝑃𝜋𝜈𝒞subscript𝒞𝜋conditional𝑑𝑎𝑥𝜈𝑑𝑥({P}_{\pi}^{*}\nu)(\mathcal{C})=\int_{\mathcal{C}}\pi(da|x)\nu(dx)( italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_ν ) ( caligraphic_C ) = ∫ start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT italic_π ( italic_d italic_a | italic_x ) italic_ν ( italic_d italic_x ) for any measurable 𝒞Ω𝒞Ω\mathcal{C}\subseteq\Omegacaligraphic_C ⊆ roman_Ω.

Operator Formulation of RL

With these two operators in place, we can characterize the expected reward after a single interaction between a policy π𝜋\piitalic_π and the MDP as (TPπr)(x,a)=E[r(X,A)|X0=x,A0=a]𝑇subscript𝑃𝜋𝑟𝑥𝑎𝐸delimited-[]formulae-sequenceconditional𝑟superscript𝑋superscript𝐴subscript𝑋0𝑥subscript𝐴0𝑎({T}{P}_{\pi}r)(x,a)=E[r(X^{\prime},A^{\prime})|X_{0}=x,A_{0}=a]( italic_T italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_r ) ( italic_x , italic_a ) = italic_E [ italic_r ( italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x , italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_a ]. This observation can be applied recursively, yielding the operatorial characterization of the action-value function from 3

qπ(x,a)=t=0γtE[r(Xt,At)|X0=x,A0=a]=t=0(γTPπ)tr=(IdγTPπ)1r,subscript𝑞𝜋𝑥𝑎superscriptsubscript𝑡0superscript𝛾𝑡𝐸delimited-[]formulae-sequenceconditional𝑟subscript𝑋𝑡subscript𝐴𝑡subscript𝑋0𝑥subscript𝐴0𝑎superscriptsubscript𝑡0superscript𝛾𝑇subscript𝑃𝜋𝑡𝑟superscript𝐼𝑑𝛾𝑇subscript𝑃𝜋1𝑟\displaystyle q_{\pi}(x,a)=\sum_{t=0}^{\infty}\gamma^{t}E[r(X_{t},A_{t})|X_{0}% =x,A_{0}=a]=\sum_{t=0}^{\infty}(\gamma{T}{P}_{\pi})^{t}r=({Id}-\gamma{T}{P}_{% \pi})^{-1}r,italic_q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_x , italic_a ) = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_E [ italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x , italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_a ] = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( italic_γ italic_T italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r = ( italic_I italic_d - italic_γ italic_T italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_r , (7)

where the last equality follows from T𝑇{T}italic_T and Pπsubscript𝑃𝜋{P}_{\pi}italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT being Markov operators [25, Ch. 19] (T=Pπ=1norm𝑇normsubscript𝑃𝜋1\left\|{{T}}\right\|=\left\|{{P}_{\pi}}\right\|=1∥ italic_T ∥ = ∥ italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ∥ = 1), making the Neumann series convergent. Analogously, we can reformulate the RL objective introduced in 1 as the pairing

J(π)=Pπ(IdγTPπ)1r,ν=Pπqπ,ν,𝐽𝜋subscript𝑃𝜋superscript𝐼𝑑𝛾𝑇subscript𝑃𝜋1𝑟𝜈subscript𝑃𝜋subscript𝑞𝜋𝜈\displaystyle J(\pi)=\left\langle{{P}_{\pi}({Id}-\gamma{T}{P}_{\pi})^{-1}r},{% \nu}\right\rangle=\left\langle{{P}_{\pi}q_{\pi}},{\nu}\right\rangle,italic_J ( italic_π ) = ⟨ italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_I italic_d - italic_γ italic_T italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_r , italic_ν ⟩ = ⟨ italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT , italic_ν ⟩ , (8)

for ν1+(𝒳)𝜈superscriptsubscript1𝒳\nu\in{\mathcal{M}}_{1}^{+}(\mathcal{X})italic_ν ∈ caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( caligraphic_X ) a starting distribution. In both 7 and 8 the operatorial formulation encodes the cumulative reward collected through the (possibly infinitely many) interactions of the policy with the MDP in closed form, as the inversion (IdγTPπ)1rsuperscript𝐼𝑑𝛾𝑇subscript𝑃𝜋1𝑟({Id}-\gamma{T}{P}_{\pi})^{-1}r( italic_I italic_d - italic_γ italic_T italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_r. This characterization motivates us to learn T𝑇{T}italic_T and r𝑟ritalic_r from data and then express any action-value function as the interaction of these two terms with the policy π𝜋\piitalic_π as in 7, rather than learning each qπsubscript𝑞𝜋q_{\pi}italic_q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT independently for any π𝜋\piitalic_π.

Learning the World Model via Conditional Mean Embeddings

Conditional Mean Embeddings (CME) offer an effective tool to model and learn conditional expectation operators from data [15]. They cast the problem of learning T𝑇{T}italic_T by studying the restriction of its action on a suitable family of functions. Let φ:𝒳:𝜑𝒳\varphi:\mathcal{X}\to{\mathcal{F}}italic_φ : caligraphic_X → caligraphic_F and ψ:Ω𝒢:𝜓Ω𝒢\psi:\Omega\to{\mathcal{G}}italic_ψ : roman_Ω → caligraphic_G two feature maps with values into the Hilbert spaces {\mathcal{F}}caligraphic_F and 𝒢𝒢{\mathcal{G}}caligraphic_G. With some abuse of notation (which is justified by them being Hilbert spaces), we interpret {\mathcal{F}}caligraphic_F and 𝒢𝒢{\mathcal{G}}caligraphic_G as subspaces of functions in Bb(𝒳)subscript𝐵𝑏𝒳B_{b}(\mathcal{X})italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( caligraphic_X ) and Bb(Ω)subscript𝐵𝑏ΩB_{b}(\Omega)italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( roman_Ω ) of the form f(x)=f,φ(x)𝑓𝑥𝑓𝜑𝑥f(x)=\left\langle{f},{\varphi(x)}\right\rangleitalic_f ( italic_x ) = ⟨ italic_f , italic_φ ( italic_x ) ⟩ and g(x,a)=g,ψ(x,a)𝑔𝑥𝑎𝑔𝜓𝑥𝑎g(x,a)=\left\langle{g},{\psi(x,a)}\right\rangleitalic_g ( italic_x , italic_a ) = ⟨ italic_g , italic_ψ ( italic_x , italic_a ) ⟩ for any f𝑓f\in{\mathcal{F}}italic_f ∈ caligraphic_F and g𝒢𝑔𝒢g\in{\mathcal{G}}italic_g ∈ caligraphic_G and any (x,a)Ω𝑥𝑎Ω(x,a)\in\Omega( italic_x , italic_a ) ∈ roman_Ω. We say that the linear MDP assumption holds with respect to (φ,ψ)𝜑𝜓(\varphi,\psi)( italic_φ , italic_ψ ) if

Assumption 1 (Linear MDP – Well-specified CME).

The restriction of  T𝑇{T}italic_T to {\mathcal{F}}caligraphic_F is a Hilbert-Schmidt operator T|HS(,𝒢)evaluated-at𝑇𝐻𝑆𝒢{T}|_{\mathcal{F}}\in{HS}({\mathcal{F}},{\mathcal{G}})italic_T | start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ∈ italic_H italic_S ( caligraphic_F , caligraphic_G ).

In CME settings, the assumption above is known as requiring the CME of τ𝜏\tauitalic_τ to be well-specified. The following result clarifies this aspect, as well as establishing the relation of Asm. 1 with the standard definition of linear MDP.

Proposition 2 (Well-specified CME).

Under Asm. 1, (T|)=(T)|𝒢superscriptevaluated-at𝑇evaluated-atsuperscript𝑇𝒢({T}|_{\mathcal{F}})^{*}=({T}^{*})|_{\mathcal{G}}( italic_T | start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ( italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT and, for any (x,a)Ω𝑥𝑎Ω(x,a)\in\Omega( italic_x , italic_a ) ∈ roman_Ω

(T|)ψ(x,a)=𝒳φ(x)τ(x|x,a)=E[φ(X)|X=x,A=a].superscriptevaluated-at𝑇𝜓𝑥𝑎subscript𝒳𝜑superscript𝑥𝜏conditionalsuperscript𝑥𝑥𝑎𝐸delimited-[]formulae-sequenceconditional𝜑superscript𝑋𝑋𝑥𝐴𝑎\displaystyle({T}|_{\mathcal{F}})^{*}\psi(x,a)=\int_{\mathcal{X}}\varphi(x^{% \prime})\leavevmode\nobreak\ \tau(x^{\prime}|x,a)=E[\varphi(X^{\prime})|X=x,A=% a].( italic_T | start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_ψ ( italic_x , italic_a ) = ∫ start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT italic_φ ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_τ ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_a ) = italic_E [ italic_φ ( italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | italic_X = italic_x , italic_A = italic_a ] . (9)

Prop. 2 (see proof in Sec. A.2) shows that 9 is equivalent to the standard linear MDP assumption (see e.g. [26, Ch. 8]) when 𝒳𝒳\mathcal{X}caligraphic_X is a finite set (taking φ𝜑\varphiitalic_φ the one-hot encoding) while being weaker in infinite settings. From the CME perspective, the proposition characterizes the action of (T|)superscriptevaluated-at𝑇({T}|_{\mathcal{F}})^{*}( italic_T | start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as sending evaluation vectors in 𝒢𝒢{\mathcal{G}}caligraphic_G to the conditional expectation of evaluation vectors in {\mathcal{F}}caligraphic_F with respect to τ𝜏\tauitalic_τ, the definition of conditional mean embedding of τ𝜏\tauitalic_τ [27, 15]. This characterization also suggests a learning strategy: 9 characterizes the action of T𝑇{T}italic_T as evaluating to the conditional expectation of a vector φ(X)𝜑superscript𝑋\varphi(X^{\prime})italic_φ ( italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) given (x,a)𝑥𝑎(x,a)( italic_x , italic_a ). Given a set of points (xi,ai)i=1nsuperscriptsubscriptsubscript𝑥𝑖subscript𝑎𝑖𝑖1𝑛(x_{i},a_{i})_{i=1}^{n}( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and corresponding xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT sampled from τ(|xi,ai)\tau(\cdot|x_{i},a_{i})italic_τ ( ⋅ | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), this can be learned by minimizing the squared loss, yielding the estimator

Tn=Sn(Kn+nλId)1Zn=argminTHS(,𝒢)1ni=1nφ(xi)Tψ(xi,ai)2+λTHS2.subscript𝑇𝑛superscriptsubscript𝑆𝑛superscriptsubscript𝐾𝑛𝑛𝜆𝐼𝑑1subscript𝑍𝑛subscriptargmin𝑇𝐻𝑆𝒢1𝑛superscriptsubscript𝑖1𝑛superscriptsubscriptnorm𝜑superscriptsubscript𝑥𝑖superscript𝑇𝜓subscript𝑥𝑖subscript𝑎𝑖2𝜆superscriptsubscriptnorm𝑇𝐻𝑆2\displaystyle{T}_{n}=S_{n}^{*}(K_{n}+n\lambda{Id})^{-1}Z_{n}=\operatorname*{% argmin}_{{T}\in{HS}({\mathcal{F}},{\mathcal{G}})}\leavevmode\nobreak\ \frac{1}% {n}\sum_{i=1}^{n}\left\|{\varphi(x_{i}^{\prime})-{T}^{*}\psi(x_{i},a_{i})}% \right\|_{\mathcal{F}}^{2}+\lambda\left\|{{T}}\right\|_{{HS}}^{2}.italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_K start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_n italic_λ italic_I italic_d ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = roman_argmin start_POSTSUBSCRIPT italic_T ∈ italic_H italic_S ( caligraphic_F , caligraphic_G ) end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ italic_φ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ italic_T ∥ start_POSTSUBSCRIPT italic_H italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (10)

When {\mathcal{F}}caligraphic_F and 𝒢𝒢{\mathcal{G}}caligraphic_G are finite dimensional, Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and Znsubscript𝑍𝑛Z_{n}italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are matrices with n𝑛nitalic_n rows, each corresponding respectively to the vectors ψ(xi,ai)𝜓subscript𝑥𝑖subscript𝑎𝑖\psi(x_{i},a_{i})italic_ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and φ(xi)𝜑superscriptsubscript𝑥𝑖\varphi(x_{i}^{\prime})italic_φ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) for i=1,,n𝑖1𝑛i=1,\dots,nitalic_i = 1 , … , italic_n. In the infinite setting they generalize to operators Sn:𝒢n:subscript𝑆𝑛𝒢superscript𝑛S_{n}:{\mathcal{G}}\to\mathbb{R}^{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT : caligraphic_G → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and Zn:n:subscript𝑍𝑛superscript𝑛Z_{n}:{\mathcal{F}}\to\mathbb{R}^{n}italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT : caligraphic_F → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. The Kn=SnSnn×nsubscript𝐾𝑛subscript𝑆𝑛superscriptsubscript𝑆𝑛superscript𝑛𝑛K_{n}=S_{n}S_{n}^{*}\in\mathbb{R}^{n\times n}italic_K start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT is the Gram (or kernel) matrix with (i,j)𝑖𝑗(i,j)( italic_i , italic_j )-th entry corresponding to (Kn)ij=ψ(xi,ai),ψ(xj,aj)subscriptsubscript𝐾𝑛𝑖𝑗𝜓subscript𝑥𝑖subscript𝑎𝑖𝜓subscript𝑥𝑗subscript𝑎𝑗(K_{n})_{ij}=\left\langle{\psi(x_{i},a_{i})},{\psi(x_{j},a_{j})}\right\rangle( italic_K start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ⟨ italic_ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_ψ ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⟩. We conclude our discussion on learning world models via CMEs by noting that in most RL settings, the reward function is unknown, too. Analogously to what we have described for Tnsubscript𝑇𝑛{T}_{n}italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and following the standard practice in supervised settings, we can learn an estimator for r𝑟ritalic_r solving a problem akin to 10. This yields a function of the form rn=Snb=i=1nbiψ(xi,ai)subscript𝑟𝑛superscriptsubscript𝑆𝑛𝑏superscriptsubscript𝑖1𝑛subscript𝑏𝑖𝜓subscript𝑥𝑖subscript𝑎𝑖r_{n}=S_{n}^{*}b=\sum_{i=1}^{n}b_{i}\,\psi(x_{i},a_{i})italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_b = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) as the linear combination of the embedded training points with the entries of the vector b=(Kn+nλId)1y𝑏superscriptsubscript𝐾𝑛𝑛𝜆𝐼𝑑1𝑦b=(K_{n}+n\lambda{Id})^{-1}yitalic_b = ( italic_K start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_n italic_λ italic_I italic_d ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_y where yn𝑦superscript𝑛y\in\mathbb{R}^{n}italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the vector with entries yi=r(xi,ai)subscript𝑦𝑖𝑟subscript𝑥𝑖subscript𝑎𝑖y_{i}=r(x_{i},a_{i})italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_r ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

Estimating the Action-value Function qπsubscript𝑞𝜋q_{\pi}italic_q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT

We now propose our strategy to generate an estimator for the action-value function qπsubscript𝑞𝜋q_{\pi}italic_q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT of a given policy π𝜋\piitalic_π in terms of an estimator for the reward r𝑟ritalic_r and a world model for T𝑇{T}italic_T learned in terms of the restriction to 𝒢𝒢{\mathcal{G}}caligraphic_G and {\mathcal{F}}caligraphic_F. To this end, we need to introduce the notion of compatibility between a policy π𝜋\piitalic_π and the pair (𝒢,)𝒢({\mathcal{G}},{\mathcal{F}})( caligraphic_G , caligraphic_F ).

Definition 1 ((𝒢,)𝒢({\mathcal{G}},{\mathcal{F}})( caligraphic_G , caligraphic_F )-compatibility).

A policy π:𝒳1+(𝒜):𝜋𝒳superscriptsubscript1𝒜\pi:\mathcal{X}\to{\mathcal{M}}_{1}^{+}(\mathcal{A})italic_π : caligraphic_X → caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( caligraphic_A ) is compatible with two subspaces Bb(𝒳)subscript𝐵𝑏𝒳{\mathcal{F}}\subseteq B_{b}(\mathcal{X})caligraphic_F ⊆ italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( caligraphic_X ) and 𝒢Bb(Ω)𝒢subscript𝐵𝑏Ω{\mathcal{G}}\subseteq B_{b}(\Omega)caligraphic_G ⊆ italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( roman_Ω ) if the restriction Pπsubscript𝑃𝜋{P}_{\pi}italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT to 𝒢𝒢{\mathcal{G}}caligraphic_G is a bounded linear operator with range absent\subseteq{\mathcal{F}}⊆ caligraphic_F, that is (Pπ)|𝒢:𝒢:evaluated-atsubscript𝑃𝜋𝒢𝒢({P}_{\pi})|_{\mathcal{G}}:{\mathcal{G}}\to{\mathcal{F}}( italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) | start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT : caligraphic_G → caligraphic_F.

Definition 1 is analogous to the linear MDP Asm. 1 in that it requires the restriction of Pπsubscript𝑃𝜋{P}_{\pi}italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT to 𝒢𝒢{\mathcal{G}}caligraphic_G to take values in the associated space {\mathcal{F}}caligraphic_F. However, it is slightly weaker since it requires this restriction to be bounded (and linear) rather than being a HS operator. We will discuss in Sec. 4 how this difference will allow us to show that a wide range of policies (in particular those generated by our POWR  method) is (𝒢,)𝒢({\mathcal{G}},{\mathcal{F}})( caligraphic_G , caligraphic_F )-compatible for our choice of function spaces. Definition 1 is the key condition that enables us to generate an estimator for qπsubscript𝑞𝜋q_{\pi}italic_q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT, as characterized by the following result.

Proposition 3.

Let Tn=SnBZnHS(,𝒢)subscript𝑇𝑛superscriptsubscript𝑆𝑛𝐵subscript𝑍𝑛𝐻𝑆𝒢{T}_{n}=S_{n}^{*}BZ_{n}\in{HS}({\mathcal{F}},{\mathcal{G}})italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_B italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_H italic_S ( caligraphic_F , caligraphic_G ) and rn=Snb𝒢subscript𝑟𝑛superscriptsubscript𝑆𝑛𝑏𝒢r_{n}=S_{n}^{*}b\in{\mathcal{G}}italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_b ∈ caligraphic_G for respectively a Bn×n𝐵superscript𝑛𝑛B\in\mathbb{R}^{n\times n}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT and bn𝑏superscript𝑛b\in\mathbb{R}^{n}italic_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Let π𝜋\piitalic_π be (𝒢,)𝒢({\mathcal{G}},{\mathcal{F}})( caligraphic_G , caligraphic_F )-compatible. Then,

q^π=(IdγTnPπ)1rn=Sn(IdγBMπn)1bsubscript^𝑞𝜋superscript𝐼𝑑𝛾subscript𝑇𝑛subscript𝑃𝜋1subscript𝑟𝑛superscriptsubscript𝑆𝑛superscript𝐼𝑑𝛾𝐵subscript𝑀𝜋𝑛1𝑏\displaystyle\hat{q}_{\pi}=({Id}-\gamma{T}_{n}{P}_{\pi})^{-1}r_{n}=S_{n}^{*}({% Id}-\gamma BM_{\pi n})^{-1}bover^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT = ( italic_I italic_d - italic_γ italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_I italic_d - italic_γ italic_B italic_M start_POSTSUBSCRIPT italic_π italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_b (11)

where Mπn=ZnPπSnn×nsubscript𝑀𝜋𝑛subscript𝑍𝑛subscript𝑃𝜋superscriptsubscript𝑆𝑛superscript𝑛𝑛M_{\pi n}=Z_{n}{P}_{\pi}S_{n}^{*}\in\mathbb{R}^{n\times n}italic_M start_POSTSUBSCRIPT italic_π italic_n end_POSTSUBSCRIPT = italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT is the matrix with entries

(Mπn)ij=𝒜ψ(xi,a),ψ(xj,aj)π(da|xi).subscriptsubscript𝑀𝜋𝑛𝑖𝑗subscript𝒜𝜓superscriptsubscript𝑥𝑖𝑎𝜓subscript𝑥𝑗subscript𝑎𝑗𝜋conditional𝑑𝑎superscriptsubscript𝑥𝑖\displaystyle(M_{\pi n})_{ij}=\int_{\mathcal{A}}\left\langle{\psi(x_{i}^{% \prime},a)},{\psi(x_{j},a_{j})}\right\rangle\leavevmode\nobreak\ \pi(da|x_{i}^% {\prime}).( italic_M start_POSTSUBSCRIPT italic_π italic_n end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ∫ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT ⟨ italic_ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a ) , italic_ψ ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⟩ italic_π ( italic_d italic_a | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) . (12)

Prop. 3 leverages a kernel trick argument to express the estimator for the action-value function of π𝜋\piitalic_π as the linear combination q^π=i=1nciψ(xi,ai)subscript^𝑞𝜋superscriptsubscript𝑖1𝑛subscript𝑐𝑖𝜓subscript𝑥𝑖subscript𝑎𝑖\hat{q}_{\pi}=\sum_{i=1}^{n}c_{i}\,\psi(x_{i},a_{i})over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) of the (embedded) training points ψ(xi,ai)𝜓subscript𝑥𝑖subscript𝑎𝑖\psi(x_{i},a_{i})italic_ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and the entries cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the vector c=(IγBZnPπSn)1bn𝑐superscript𝐼𝛾𝐵subscript𝑍𝑛subscript𝑃𝜋superscriptsubscript𝑆𝑛1𝑏superscript𝑛c=(I-\gamma BZ_{n}{P}_{\pi}S_{n}^{*})^{-1}b\in\mathbb{R}^{n}italic_c = ( italic_I - italic_γ italic_B italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. We prove the result in Sec. A.4. We note that in 11 both B𝐵Bitalic_B and ZnPπSnsubscript𝑍𝑛subscript𝑃𝜋superscriptsubscript𝑆𝑛Z_{n}{P}_{\pi}S_{n}^{*}italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT are n×n𝑛𝑛n\times nitalic_n × italic_n matrices and therefore the characterization of q^πsubscript^𝑞𝜋\hat{q}_{\pi}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT amounts to solving a n×n𝑛𝑛n\times nitalic_n × italic_n linear system. For settings where n𝑛nitalic_n is large, one can adopt random projection methods such as Nyström approximation to learn Tnsubscript𝑇𝑛{T}_{n}italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and rnsubscript𝑟𝑛r_{n}italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [28]. These strategies have been recently shown to significantly reduce the computational load of learning while retaining the same empirical and theoretical performance as their non-approximated counterparts [29, 30].

We conclude this section noting how 12 implies that we only need to be able to evaluate π𝜋\piitalic_π, but we do not need explicit knowledge of Pπsubscript𝑃𝜋{P}_{\pi}italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT as operator. As we shall see, this property will be instrumental to prove generalization bounds for our proposed PMD algorithm in Sec. 4.

Algorithm 1 POWR: Policy mirror descent with Operator World-models for Rl
  Input: Dataset 𝒟=(xi,ai,xi,ri)i=1n𝒟superscriptsubscriptsubscript𝑥𝑖subscript𝑎𝑖superscriptsubscript𝑥𝑖subscript𝑟𝑖𝑖1𝑛\mathcal{D}=(x_{i},a_{i},x_{i}^{\prime},r_{i})_{i=1}^{n}caligraphic_D = ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, discount γ(0,1)𝛾01\gamma\in(0,1)italic_γ ∈ ( 0 , 1 ), step size η>0𝜂0\eta>0italic_η > 0, kernel k:𝒳×𝒳:𝑘𝒳𝒳k:\mathcal{X}\times\mathcal{X}\to\mathbb{R}italic_k : caligraphic_X × caligraphic_X → blackboard_R initial weights C0=0n×|𝒜|subscript𝐶00superscript𝑛𝒜C_{0}=0\in\mathbb{R}^{n\times|\mathcal{A}|}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × | caligraphic_A | end_POSTSUPERSCRIPT
  Let En×|𝒜|𝐸superscript𝑛𝒜E\in\mathbb{R}^{n\times|\mathcal{A}|}italic_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × | caligraphic_A | end_POSTSUPERSCRIPT with rows Ei=subscript𝐸𝑖absentE_{i}=italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = OneHot|𝒜|(ai)subscriptOneHot𝒜subscript𝑎𝑖\text{\sc OneHot}_{|\mathcal{A}|}(a_{i})OneHot start_POSTSUBSCRIPT | caligraphic_A | end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for i=1,,n𝑖1𝑛i=1,\dots,nitalic_i = 1 , … , italic_n.
  Let K,Hn×n𝐾𝐻superscript𝑛𝑛K,H\in\mathbb{R}^{n\times n}italic_K , italic_H ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT such that Kij=k(xi,xj)δai=ajsubscript𝐾𝑖𝑗𝑘subscript𝑥𝑖subscript𝑥𝑗subscript𝛿subscript𝑎𝑖subscript𝑎𝑗K_{ij}=k(x_{i},x_{j})\delta_{a_{i}=a_{j}}italic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_k ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_δ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT and Hij=k(xi,xj)subscript𝐻𝑖𝑗𝑘superscriptsubscript𝑥𝑖subscript𝑥𝑗H_{ij}=k(x_{i}^{\prime},x_{j})italic_H start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_k ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) i,j=1,,nformulae-sequencefor-all𝑖𝑗1𝑛\forall i,j=1,\dots,n∀ italic_i , italic_j = 1 , … , italic_n.
  Let B=(K+λId)1𝐵superscript𝐾𝜆𝐼𝑑1B=(K+\lambda{Id})^{-1}italic_B = ( italic_K + italic_λ italic_I italic_d ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT and b=By𝑏𝐵𝑦b=Byitalic_b = italic_B italic_y with y=(r1,,rn)n𝑦subscript𝑟1subscript𝑟𝑛superscript𝑛y=(r_{1},\dots,r_{n})\in\mathbb{R}^{n}italic_y = ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT
  For t=0,1,,T𝑡01𝑇t=0,1,\dots,Titalic_t = 0 , 1 , … , italic_T:
    π=𝜋absent\pi=italic_π = softmax(ηQ)𝜂𝑄(\eta Q)( italic_η italic_Q ) with Q=HCt𝑄𝐻subscript𝐶𝑡Q=HC_{t}italic_Q = italic_H italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
    Let Mn×n𝑀superscript𝑛𝑛M\in\mathbb{R}^{n\times n}italic_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT such that Mij=Hijπjaisubscript𝑀𝑖𝑗subscript𝐻𝑖𝑗subscript𝜋𝑗subscript𝑎𝑖M_{ij}=H_{ij}\pi_{ja_{i}}italic_M start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_H start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_j italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT for i,j=1,,nformulae-sequence𝑖𝑗1𝑛i,j=1,\dots,nitalic_i , italic_j = 1 , … , italic_n
    Ct+1=Ct+subscript𝐶𝑡1limit-fromsubscript𝐶𝑡C_{t+1}=C_{t}+italic_C start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT +diag(c)E𝑐𝐸(c)E( italic_c ) italic_E with c=(IdγBM)1b𝑐superscript𝐼𝑑𝛾𝐵𝑀1𝑏c=({Id}-\gamma BM)^{-1}bitalic_c = ( italic_I italic_d - italic_γ italic_B italic_M ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_b
  Return πT:𝒳Δ(𝒜):subscript𝜋𝑇𝒳Δ𝒜\pi_{T}:\mathcal{X}\to\Delta(\mathcal{A})italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT : caligraphic_X → roman_Δ ( caligraphic_A ) such that πT(x)=subscript𝜋𝑇𝑥absent\pi_{T}(x)=italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ) =softmax(ηf(x))𝜂𝑓𝑥(\eta\,f(x))( italic_η italic_f ( italic_x ) ) with f(x)=k(x)CT𝑓𝑥𝑘𝑥subscript𝐶𝑇f(x)=k(x)C_{T}italic_f ( italic_x ) = italic_k ( italic_x ) italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT
  Return and k(x)=(k(x,x1),,k(x,xn))𝑘𝑥𝑘𝑥subscript𝑥1𝑘𝑥subscript𝑥𝑛k(x)=(k(x,x_{1}),\dots,k(x,x_{n}))italic_k ( italic_x ) = ( italic_k ( italic_x , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_k ( italic_x , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) for all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X

4 Proposed Algorithm: POWR

We are ready to describe our algorithm for world model-based PMD. In the following, we restrict to the case where |𝒜|<𝒜|\mathcal{A}|<\infty| caligraphic_A | < ∞ is a finite set. As introduced in Sec. 2, policy mirror descent methods are mainly characterized by the choice of Bregman divergence D𝐷Ditalic_D used for the mirror descent update and, in the case of inexact methods, the estimator q^πtsubscript^𝑞subscript𝜋𝑡\hat{q}_{\pi_{t}}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT of the action-value function qπtsubscript𝑞subscript𝜋𝑡q_{\pi_{t}}italic_q start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT for the intermediate policies generated by the algorithm.

In POWR, we combine the CME world model presented in Sec. 3 with mirror descent steps using the Kullback-Leibler divergence DKL(p;p)=a𝒜palog(pa/pa)subscript𝐷KL𝑝superscript𝑝subscript𝑎𝒜subscript𝑝𝑎subscript𝑝𝑎superscriptsubscript𝑝𝑎D_{{\rm KL}}(p;p^{\prime})=\sum_{a\in\mathcal{A}}p_{a}\log(p_{a}/p_{a}^{\prime})italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_p ; italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT roman_log ( italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT / italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) in the update of 2. It was shown in [10] that in this case PMD corresponds to Natural Policy Gradient [24], while [31, Example 9.10] shows how the solution to 2 can be written in closed form for any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X as

πt+1(|x)=πt(|x)eηq^πt(x,)a𝒜πt(a|x)eηq^πt(x,a).\displaystyle\pi_{t+1}(\cdot|x)=\frac{\pi_{t}(\cdot|x)e^{\eta\hat{q}_{\pi_{t}}% (x,\cdot)}}{\sum_{a\in\mathcal{A}}\pi_{t}(a|x)e^{\eta\hat{q}_{\pi_{t}}(x,a)}}.italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( ⋅ | italic_x ) = divide start_ARG italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ | italic_x ) italic_e start_POSTSUPERSCRIPT italic_η over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x , ⋅ ) end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a | italic_x ) italic_e start_POSTSUPERSCRIPT italic_η over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x , italic_a ) end_POSTSUPERSCRIPT end_ARG . (13)

Additionally, the formula above can be applied recursively, expressing πt+1subscript𝜋𝑡1\pi_{t+1}italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT as the softmax operator applied to the discounted sum of the action-value functions up to the current iteration

πt+1(|x)=softmax(log(π0(|x))+ηs=0tq^πs(x,)).\displaystyle\pi_{t+1}(\cdot|x)={\textsc{softmax}}\left(\log(\pi_{0}(\cdot|x))% +\eta\sum_{s=0}^{t}\hat{q}_{\pi_{s}}(x,\cdot)\right).italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( ⋅ | italic_x ) = softmax ( roman_log ( italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ | italic_x ) ) + italic_η ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x , ⋅ ) ) . (14)

Choice of the Feature Maps

A key question to address to adopt the action-value estimators introduced in Sec. 3 is choosing the two spaces {\mathcal{F}}caligraphic_F and 𝒢𝒢{\mathcal{G}}caligraphic_G to perform world model learning. Specifically, to apply Prop. 3 and obtain proper estimators q^πtsubscript^𝑞subscript𝜋𝑡\hat{q}_{\pi_{t}}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, we need to guarantee that all policies generated by the PMD update are (𝒢,)𝒢({\mathcal{G}},{\mathcal{F}})( caligraphic_G , caligraphic_F )-compatible (Definition 1). The following result describes a suitable family of such spaces.

Proposition 4 (Separable Spaces).

Let ϕ:𝒳:italic-ϕ𝒳\phi:\mathcal{X}\to{\mathcal{H}}italic_ϕ : caligraphic_X → caligraphic_H be a feature map into a Hilbert space {\mathcal{H}}caligraphic_H. Let =tensor-product{\mathcal{F}}={\mathcal{H}}\otimes{\mathcal{H}}caligraphic_F = caligraphic_H ⊗ caligraphic_H and 𝒢=|𝒜|𝒢tensor-productsuperscript𝒜{\mathcal{G}}=\mathbb{R}^{|\mathcal{A}|}\otimes{\mathcal{H}}caligraphic_G = blackboard_R start_POSTSUPERSCRIPT | caligraphic_A | end_POSTSUPERSCRIPT ⊗ caligraphic_H with feature maps respectively φ(x)=ϕ(x)ϕ(x)𝜑𝑥tensor-productitalic-ϕ𝑥italic-ϕ𝑥\varphi(x)=\phi(x)\otimes\phi(x)italic_φ ( italic_x ) = italic_ϕ ( italic_x ) ⊗ italic_ϕ ( italic_x ) and ψ(x,a)=ϕ(x)ea𝜓𝑥𝑎tensor-productitalic-ϕ𝑥subscript𝑒𝑎\psi(x,a)=\phi(x)\otimes e_{a}italic_ψ ( italic_x , italic_a ) = italic_ϕ ( italic_x ) ⊗ italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, with ea|𝒜|subscript𝑒𝑎superscript𝒜e_{a}\in\mathbb{R}^{|\mathcal{A}|}italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_A | end_POSTSUPERSCRIPT the one-hot encoding of action a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A. Let π:𝒳Δ(𝒜):𝜋𝒳Δ𝒜\pi:\mathcal{X}\to\Delta(\mathcal{A})italic_π : caligraphic_X → roman_Δ ( caligraphic_A ) be a policy such that π(a|)=pa,ϕ()𝜋conditional𝑎subscript𝑝𝑎italic-ϕ\pi(a|\cdot)=\left\langle{p_{a}},{\phi(\cdot)}\right\rangleitalic_π ( italic_a | ⋅ ) = ⟨ italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_ϕ ( ⋅ ) ⟩ with pasubscript𝑝𝑎p_{a}\in{\mathcal{H}}italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ caligraphic_H for any a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A. Then, π𝜋\piitalic_π is (𝒢,)𝒢({\mathcal{G}},{\mathcal{F}})( caligraphic_G , caligraphic_F )-compatible.

Prop. 4 (proof in Sec. A.5) states that for the specific choice of function spaces =tensor-product{\mathcal{F}}={\mathcal{H}}\otimes{\mathcal{H}}caligraphic_F = caligraphic_H ⊗ caligraphic_H and 𝒢=|𝒜|𝒢tensor-productsuperscript𝒜{\mathcal{G}}=\mathbb{R}^{|\mathcal{A}|}\otimes{\mathcal{H}}caligraphic_G = blackboard_R start_POSTSUPERSCRIPT | caligraphic_A | end_POSTSUPERSCRIPT ⊗ caligraphic_H, we can guarantee (𝒢,)𝒢({\mathcal{G}},{\mathcal{F}})( caligraphic_G , caligraphic_F )-compatibility, provided that {\mathcal{H}}caligraphic_H is rich enough to “contain” all π(a|)𝜋conditional𝑎\pi(a|\cdot)italic_π ( italic_a | ⋅ ) for a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A. We postpone the discussion on identifying a suitable spaces {\mathcal{H}}caligraphic_H for PMD to Sec. 5, since (𝒢,)𝒢({\mathcal{G}},{\mathcal{F}})( caligraphic_G , caligraphic_F )-compatibility is not needed to mechanically apply Prop. 3 and obtain an estimator q^πsubscript^𝑞𝜋\hat{q}_{\pi}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT. This is because 11 exploits a kernel-trick to bypass the need to know Pπsubscript𝑃𝜋{P}_{\pi}italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT explicitly and rather requires only to be able to evaluate π𝜋\piitalic_π on the training data. The latter is possible for πt+1subscript𝜋𝑡1\pi_{t+1}italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, thanks to the point-wise characterization of the PMD update in 13. We can therefore present our algorithm.

POWR

Alg. 1 describes Policy mirror descent with Operator World-models for Reinforcement learning (POWR). Following the intuition of Prop. 4, the algorithm assumes to work with separable spaces. During an initial phase, we learn the world model Tn=SnBZnsubscript𝑇𝑛superscriptsubscript𝑆𝑛𝐵subscript𝑍𝑛{T}_{n}=S_{n}^{*}BZ_{n}italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_B italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and the reward rn=Snbsubscript𝑟𝑛superscriptsubscript𝑆𝑛𝑏r_{n}=S_{n}^{*}bitalic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_b fitting the conditional mean embedding described in 10 on a dataset (xi,ai,xi,ri)i=1nsuperscriptsubscriptsubscript𝑥𝑖subscript𝑎𝑖superscriptsubscript𝑥𝑖subscript𝑟𝑖𝑖1𝑛(x_{i},a_{i},x_{i}^{\prime},r_{i})_{i=1}^{n}( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT (e.g. obtained via experience replay [32]). Once the world model has been learned, we optimize the policy and perform PMD iterations via 14. In this second phase, we first evaluate the past (cumulative) action-value estimators s=0tq^ssuperscriptsubscript𝑠0𝑡subscript^𝑞𝑠\sum_{s=0}^{t}\hat{q}_{s}∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to obtain the current policy πt+1(|xi)\pi_{t+1}(\cdot|x_{i})italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( ⋅ | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) by (inexact) PMD via the softmax operator in 14. We use the newly obtained policy to compute the matrix Mπt+1nsubscript𝑀subscript𝜋𝑡1𝑛M_{\pi_{t+1}n}italic_M start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT defined in 12, which is a key component to obtain the estimator q^t+1subscript^𝑞𝑡1\hat{q}_{t+1}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT for qπt+1subscript𝑞subscript𝜋𝑡1q_{\pi_{t+1}}italic_q start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. We note that in the case of the separable spaces of Prop. 4, this matrix reduces to the n×n𝑛𝑛n\times nitalic_n × italic_n matrix Mn×n𝑀superscript𝑛𝑛M\in\mathbb{R}^{n\times n}italic_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT with entries Mij=k(xi,xj)πt+1(aj|xi)subscript𝑀𝑖𝑗𝑘superscriptsubscript𝑥𝑖subscript𝑥𝑗subscript𝜋𝑡1conditionalsubscript𝑎𝑗superscriptsubscript𝑥𝑖M_{ij}=k(x_{i}^{\prime},x_{j})\pi_{t+1}(a_{j}|x_{i}^{\prime})italic_M start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_k ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), where k(xi,xj)=ϕ(xi),ϕ(xj)𝑘superscriptsubscript𝑥𝑖subscript𝑥𝑗italic-ϕsuperscriptsubscript𝑥𝑖italic-ϕsubscript𝑥𝑗k(x_{i}^{\prime},x_{j})=\left\langle{\phi(x_{i}^{\prime})},{\phi(x_{j})}\right\rangleitalic_k ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = ⟨ italic_ϕ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_ϕ ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⟩. Finally, we obtain c=(IdγBM)1b𝑐superscript𝐼𝑑𝛾𝐵𝑀1𝑏c=({Id}-\gamma BM)^{-1}bitalic_c = ( italic_I italic_d - italic_γ italic_B italic_M ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_b and model q^t+1=Sncsubscript^𝑞𝑡1superscriptsubscript𝑆𝑛𝑐\hat{q}_{t+1}=S_{n}^{*}cover^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_c according to Prop. 3.

Clearly, world model learning phase and PMD can be alternated in POWR , essentially finding a trade-off between exploration and exploitation. This could possibly lead to a refinement of the world model as more observations are integrated into the estimator. While in this work we do not investigate the effects of such alternating strategy, Thm. 7 offers relevant insights in this sense. The result characterizes the behavior of the PMD algorithm when combined with a varying (possibly increasing) accuracy in the estimation of the action-valued function (see Sec. 5 for more details).

5 Theoretical Analysis

We now show that POWR converges to the global maximizer of the RL problem in 1. To this end, we first identify a family of function spaces guaranteed to be compatible with the policies generated by Alg. 1. Then, we provide an extension of the result in [12] for inexact PMD to infinite state spaces 𝒳𝒳\mathcal{X}caligraphic_X, showing the impact of the action-value approximation error on the convergence rates. By leveraging the simulation lemma, we will relate this error for the case of the estimator introduced Prop. 3 to the approximation errors of Tnsubscript𝑇𝑛{T}_{n}italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and rnsubscript𝑟𝑛r_{n}italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Finally, we will use recent advancements in the characterization of CMEs’ fast learning rates to bound the sample complexity of these latter quantities, yielding error bounds for POWR.

POWR is Well-defined

To properly apply Prop. 3 to estimate the action-value function of any PMD iterate πtsubscript𝜋𝑡\pi_{t}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we need to guarantee that every iterate belongs to the space {\mathcal{H}}caligraphic_H according to Prop. 4. The following result provides such a family of spaces.

Theorem 5.

Let 𝒳d𝒳superscript𝑑\mathcal{X}\subset\mathbb{R}^{d}caligraphic_X ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be a compact set and let =W2,s(𝒳)superscript𝑊2𝑠𝒳{\mathcal{H}}=W^{2,s}(\mathcal{X})caligraphic_H = italic_W start_POSTSUPERSCRIPT 2 , italic_s end_POSTSUPERSCRIPT ( caligraphic_X ) be the Sobolev space of smoothness s>0𝑠0s>0italic_s > 0 (see e.g. [33]). Let πt(a|)subscript𝜋𝑡conditional𝑎\pi_{t}(a|\cdot)italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a | ⋅ ) and q^πt(,a)subscript^𝑞subscript𝜋𝑡𝑎\hat{q}_{\pi_{t}}(\cdot,a)over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ , italic_a ) belong to {\mathcal{H}}caligraphic_H for any a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A and πt(a|x)>0subscript𝜋𝑡conditional𝑎𝑥0\pi_{t}(a|x)>0italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a | italic_x ) > 0 for any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X. Then the policy πt+1subscript𝜋𝑡1\pi_{t+1}italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT solution to the PMD update in 13 belongs to {\mathcal{H}}caligraphic_H.

According to Thm. 5, Sobolev spaces offer a viable choice for compatibility with PMD-generated policies. This observation is further supported by the fact that Sobolev spaces of smoothness s>d/2𝑠𝑑2s>d/2italic_s > italic_d / 2 are so-called reproducing kernel Hilbert spaces (rkhs) (see e.g. [34, Ch. 10]). We recall that rkhs are always naturally equipped with a ϕ:𝒳:italic-ϕ𝒳\phi:\mathcal{X}\to{\mathcal{H}}italic_ϕ : caligraphic_X → caligraphic_H such that the inner product ϕ(x),ϕ(x)=k(x,x)italic-ϕ𝑥italic-ϕsuperscript𝑥𝑘𝑥superscript𝑥\left\langle{\phi(x)},{\phi(x^{\prime})}\right\rangle=k(x,x^{\prime})⟨ italic_ϕ ( italic_x ) , italic_ϕ ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⟩ = italic_k ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) defines a so-called reproducing kernel, namely a positive definite function that is (usually) efficient to evaluate, even if ϕ(x)italic-ϕ𝑥\phi(x)italic_ϕ ( italic_x ) is high or infinite dimensional. For example, =W2,s(𝒳)superscript𝑊2𝑠𝒳{\mathcal{H}}=W^{2,s}(\mathcal{X})caligraphic_H = italic_W start_POSTSUPERSCRIPT 2 , italic_s end_POSTSUPERSCRIPT ( caligraphic_X ) with s=d2𝑠𝑑2s=\lceil\frac{d}{2}\rceilitalic_s = ⌈ divide start_ARG italic_d end_ARG start_ARG 2 end_ARG ⌉ has associated kernel k(x,x)=exx/σ𝑘𝑥superscript𝑥superscript𝑒norm𝑥superscript𝑥𝜎k(x,x^{\prime})=e^{-\|x-x^{\prime}\|/\sigma}italic_k ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_e start_POSTSUPERSCRIPT - ∥ italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ / italic_σ end_POSTSUPERSCRIPT with bandwidth σ>0𝜎0\sigma>0italic_σ > 0 [34]. By applying Thm. 5 to the iterates generated by Alg. 1 we have the following result.

Corollary 6.

With the hypothes of Prop. 4 let =W2,s(𝒳)superscript𝑊2𝑠𝒳{\mathcal{H}}=W^{2,s}(\mathcal{X})caligraphic_H = italic_W start_POSTSUPERSCRIPT 2 , italic_s end_POSTSUPERSCRIPT ( caligraphic_X ) with s>d/2𝑠𝑑2s>d/2italic_s > italic_d / 2. Let TnHS(,𝒢)subscript𝑇𝑛𝐻𝑆𝒢{T}_{n}\in{HS}({\mathcal{F}},{\mathcal{G}})italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_H italic_S ( caligraphic_F , caligraphic_G ) and rn𝒢subscript𝑟𝑛𝒢r_{n}\in{\mathcal{G}}italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_G characterized as in Prop. 3. Let π0(a|)eηq0(,a)proportional-tosubscript𝜋0conditional𝑎superscript𝑒𝜂subscript𝑞0𝑎\pi_{0}(a|\cdot)\propto e^{\eta q_{0}(\cdot,a)}italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_a | ⋅ ) ∝ italic_e start_POSTSUPERSCRIPT italic_η italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ , italic_a ) end_POSTSUPERSCRIPT for q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT such that q0(,a)subscript𝑞0𝑎q_{0}(\cdot,a)\in{\mathcal{H}}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ , italic_a ) ∈ caligraphic_H any a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A. Then, for any t𝑡t\in\mathbb{N}italic_t ∈ blackboard_N the PMD iterates πtsubscript𝜋𝑡\pi_{t}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT generated by Alg. 1 are such that πt(a|)subscript𝜋𝑡conditional𝑎\pi_{t}(a|\cdot)\in{\mathcal{H}}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a | ⋅ ) ∈ caligraphic_H and hence are (𝒢,)𝒢({\mathcal{G}},{\mathcal{F}})( caligraphic_G , caligraphic_F )-compatible.

The above corollary guarantees us that if we are able to learn our estimates for the action-value function in {\mathcal{H}}caligraphic_H a suitably regular Sobolev space, then POWR is well-defined. This is a necessary condition to then being able to study it’s theoretical behavior in our main result. We report the proofs of Thm. 5 and Cor. 6 in Sec. C.1.

Inexact PMD Converges

We now present a more rigorous version of the characterization of the convergence rates of the inexact PMD algorithm discussed informally in Thm. 1.

Theorem 7 (Convergenge of Inexact PMD).

Let (πt)tsubscriptsubscript𝜋𝑡𝑡(\pi_{t})_{t\in\mathbb{N}}( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t ∈ blackboard_N end_POSTSUBSCRIPT be a sequence of policies generated by Alg. 1 that are all (𝒢,)𝒢({\mathcal{G}},{\mathcal{F}})( caligraphic_G , caligraphic_F )-compatible. If the action-value functions q^πtsubscript^𝑞subscript𝜋𝑡\hat{q}_{\pi_{t}}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT are estimated with an error qπtq^πtεtsubscriptnormsubscript𝑞subscript𝜋𝑡subscript^𝑞subscript𝜋𝑡subscript𝜀𝑡\left\|{q_{\pi_{t}}-\hat{q}_{\pi_{t}}}\right\|_{\infty}\leq\varepsilon_{t}∥ italic_q start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT - over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the iterates of Alg. 1 converge to the optimal policy as

J(π)J(πT)εT+O(1T+1Tt=0T1εt),𝐽subscript𝜋𝐽subscript𝜋𝑇subscript𝜀𝑇𝑂1𝑇1𝑇superscriptsubscript𝑡0𝑇1subscript𝜀𝑡J(\pi_{*})-J(\pi_{T})\leq\varepsilon_{T}+O\left(\frac{1}{T}+\frac{1}{T}\sum_{t% =0}^{T-1}\varepsilon_{t}\right),italic_J ( italic_π start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - italic_J ( italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ≤ italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT + italic_O ( divide start_ARG 1 end_ARG start_ARG italic_T end_ARG + divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (15)

where π:𝒳Δ(𝒜):subscript𝜋𝒳Δ𝒜\pi_{*}:\mathcal{X}\to\Delta(\mathcal{A})italic_π start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT : caligraphic_X → roman_Δ ( caligraphic_A ) is a measurable maximizer of 8.

Thm. 7 shows that inexact PMD can behave comparably to its exact version provided that q^πtsubscript^𝑞subscript𝜋𝑡\hat{q}_{\pi_{t}}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT are estimated with increasing accuracy, provided that the sequence of policies is (𝒢,)𝒢({\mathcal{G}},{\mathcal{F}})( caligraphic_G , caligraphic_F )-compatible, for example in the Sobolev-based setting of Cor. 6. Specifically, if qπtq^πtO(1/t)subscriptnormsubscript𝑞subscript𝜋𝑡subscript^𝑞subscript𝜋𝑡𝑂1𝑡\left\|{q_{\pi_{t}}-\hat{q}_{\pi_{t}}}\right\|_{\infty}\leq O(1/t)∥ italic_q start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT - over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_O ( 1 / italic_t ) for any t𝑡t\in\mathbb{N}italic_t ∈ blackboard_N, the convergence rate of inexact PMD is of order O(logT/T)𝑂𝑇𝑇O(\log{T}/T)italic_O ( roman_log italic_T / italic_T ), only a logarithmic factor slower than exact PMD. This means that we do not necessarily need a good approximation of the world model from the beginning but rather a strategy to improve upon such approximation as we perform more PMD iterations. This suggests adopting an alternating strategy between exploration (world model learning) and exploitation (PMD steps), as suggested in Sec. 4. We do not investigate this question in this work.

The demonstration technique used to prove Thm. 7 follows closely [12, Thm. 8 and 13]. We provide a proof in Sec. B.1 since the original result did not allow for a decreasing approximation error but rather assumed a constant one. Moreover, extending it to the case of infinite 𝒳𝒳\mathcal{X}caligraphic_X requires taking care of additional details related to potential measurability issues.

Action-value approximation error in terms of World Model estimates

Thm. 7 highlights the importance of studying the error of the estimator for the action-value functions produced by Alg. 1. These objects are obtained via the formula described in Prop. 3 in terms of Tnsubscript𝑇𝑛{T}_{n}italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, rnsubscript𝑟𝑛r_{n}italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and Pπsubscript𝑃𝜋{P}_{\pi}italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT. The exact qπsubscript𝑞𝜋q_{\pi}italic_q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT has an analogous closed-form characterization in terms of T𝑇{T}italic_T, r𝑟ritalic_r and Pπsubscript𝑃𝜋{P}_{\pi}italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT as expressed in 7 and motivating our operator-based approach. The following result compares these quantities in terms of the approximation errors of the world model and the reward function.

Lemma 8 (Implications of the Simulation Lemma).

Let Tnsubscript𝑇𝑛{T}_{n}italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and rnsubscript𝑟𝑛r_{n}italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT the empirical estimators of the transfer operator T𝑇{T}italic_T and reward function r𝑟ritalic_r as defined in Prop. 3, respectively. Let also γTn<γ<1𝛾delimited-∥∥subscript𝑇𝑛superscript𝛾1\gamma\bigl{\|}{{T}_{n}}\bigr{\|}<\gamma^{\prime}<1italic_γ ∥ italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ < italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < 1. Then,

q^πqπ11γ[Cψrnr𝒢+γr1γT|TnHS]subscriptdelimited-∥∥subscript^𝑞𝜋subscript𝑞𝜋11superscript𝛾delimited-[]subscript𝐶𝜓subscriptdelimited-∥∥subscript𝑟𝑛𝑟𝒢𝛾subscriptdelimited-∥∥𝑟1𝛾subscriptdelimited-∥∥evaluated-at𝑇subscript𝑇𝑛𝐻𝑆\bigl{\|}{\hat{q}_{\pi}-q_{\pi}}\bigr{\|}_{\infty}\leq\frac{1}{1-\gamma^{% \prime}}\left[C_{\psi}\bigl{\|}{r_{n}-r}\bigr{\|}_{{\mathcal{G}}}+\frac{\gamma% \bigl{\|}{r}\bigr{\|}_{\infty}}{1-\gamma}\bigl{\|}{{T}|_{{\mathcal{F}}}-{T}_{n% }}\bigr{\|}_{{HS}}\right]∥ over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT - italic_q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG [ italic_C start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ∥ italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_r ∥ start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT + divide start_ARG italic_γ ∥ italic_r ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG ∥ italic_T | start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_H italic_S end_POSTSUBSCRIPT ]

In the result above, when applied to a function in 𝒢𝒢{\mathcal{G}}caligraphic_G, such as rnsubscript𝑟𝑛r_{n}italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, the uniform norm is to be interpreted as the uniform norm of the evaluation of such function, namely rn=sup(x,a)Ω|rn,ψ(x,a)|subscriptnormsubscript𝑟𝑛subscriptsupremum𝑥𝑎Ωsubscript𝑟𝑛𝜓𝑥𝑎\left\|{r_{n}}\right\|_{\infty}=\sup_{(x,a)\in\Omega}|\left\langle{r_{n}},{% \psi(x,a)}\right\rangle|∥ italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = roman_sup start_POSTSUBSCRIPT ( italic_x , italic_a ) ∈ roman_Ω end_POSTSUBSCRIPT | ⟨ italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_ψ ( italic_x , italic_a ) ⟩ |, and analogously for Tnsubscript𝑇𝑛{T}_{n}italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. The proof, reported in Sec. C.2, follows by first decomposing the difference qπq^πsubscript𝑞𝜋subscript^𝑞𝜋q_{\pi}-\hat{q}_{\pi}italic_q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT - over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT with the simulation lemma [26, Lemma 2.2] and then applying the triangular inequality for the uniform norm.

POWR converges

We are ready to state the convergence result for Alg. 1. We consider the setting where the dataset used to learn Tnsubscript𝑇𝑛{T}_{n}italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (and rnsubscript𝑟𝑛r_{n}italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) is made of i.i.d. triplets (xi,ai,xi)subscript𝑥𝑖subscript𝑎𝑖superscriptsubscript𝑥𝑖(x_{i},a_{i},x_{i}^{\prime})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) with (xi,ai)subscript𝑥𝑖subscript𝑎𝑖(x_{i},a_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) sampled from a distribution ρ1+(Ω)𝜌superscriptsubscript1Ω\rho\in{\mathcal{M}}_{1}^{+}(\Omega)italic_ρ ∈ caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( roman_Ω ) supported on all ΩΩ\Omegaroman_Ω (such as the state occupancy measure (see e.g. [11] or Sec. A.3) of the uniform policy π(|x)=1/|𝒜|\pi(\cdot|x)=1/|\mathcal{A}|italic_π ( ⋅ | italic_x ) = 1 / | caligraphic_A |) and xisuperscriptsubscript𝑥𝑖x_{i}^{\prime}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT sampled from τ(|xi,ai)\tau(\cdot|x_{i},a_{i})italic_τ ( ⋅ | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). To guarantee bounds in uniform norm, the result makes a further regularity assumption, of the transfer operator (and the reward function)

Assumption 2 (Strong Source Condition).

Let ρ1+(Ω)𝜌superscriptsubscript1Ω\rho\in{\mathcal{M}}_{1}^{+}(\Omega)italic_ρ ∈ caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( roman_Ω ) and Cρsubscript𝐶𝜌C_{\rho}italic_C start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT the covariance operator a𝒜𝒳ρ(dx,a)ψ(x,a)ψ(x,a)subscript𝑎𝒜subscript𝒳tensor-product𝜌𝑑𝑥𝑎𝜓𝑥𝑎𝜓𝑥𝑎\sum_{a\in\mathcal{A}}\int_{\mathcal{X}}\rho(dx,a)\psi(x,a)\otimes\psi(x,a)∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT italic_ρ ( italic_d italic_x , italic_a ) italic_ψ ( italic_x , italic_a ) ⊗ italic_ψ ( italic_x , italic_a ). The transition operator T𝑇{T}italic_T and the reward function r𝑟ritalic_r are such that T|HS(,𝒢)evaluated-at𝑇𝐻𝑆𝒢{T}|_{\mathcal{F}}\in{HS}({\mathcal{F}},{\mathcal{G}})italic_T | start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ∈ italic_H italic_S ( caligraphic_F , caligraphic_G ) and r𝒢𝑟𝒢r\in{\mathcal{G}}italic_r ∈ caligraphic_G. Further, (T|)CρβHS<\left\|{(T|_{\mathcal{F}})^{*}C_{\rho}^{-\beta}}\right\|_{{HS}}<\infty∥ ( italic_T | start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_β end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_H italic_S end_POSTSUBSCRIPT < ∞ and Cρβr𝒢<subscriptnormsuperscriptsubscript𝐶𝜌𝛽𝑟𝒢\left\|{C_{\rho}^{-\beta}r}\right\|_{{\mathcal{G}}}<\infty∥ italic_C start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_β end_POSTSUPERSCRIPT italic_r ∥ start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT < ∞ for some β>0𝛽0\beta>0italic_β > 0.

Assumption 2 imposes a strong requirement to the so-called source condition, a quantity that describes how well the target objective of the learning process (here T𝑇{T}italic_T and r𝑟ritalic_r) “interact” with he sampling distribution. The assumption is always satisfied when the hypothesis space is finite dimensional (e.g. in the tabular RL setting) and imposes additional smoothness on T𝑇{T}italic_T and r𝑟ritalic_r when belong to a Sobolev space. Equipped with this assumption, we can now state the convergence theorem for Alg. 1.

Theorem 9.

Let (πt)tsubscriptsubscript𝜋𝑡𝑡(\pi_{t})_{t\in\mathbb{N}}( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t ∈ blackboard_N end_POSTSUBSCRIPT be a sequence of policies generated by Alg. 1 in the same setting of Cor. 6. If the action-value functions q^πtsubscript^𝑞subscript𝜋𝑡\hat{q}_{\pi_{t}}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT are estimated from a dataset (xi,ai;xi)i=1nsuperscriptsubscriptsubscript𝑥𝑖subscript𝑎𝑖superscriptsubscript𝑥𝑖𝑖1𝑛(x_{i},a_{i};x_{i}^{\prime})_{i=1}^{n}( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT with (xi,ai)ρ1+(Ω)similar-tosubscript𝑥𝑖subscript𝑎𝑖𝜌superscriptsubscript1Ω(x_{i},a_{i})\sim\rho\in{\mathcal{M}}_{1}^{+}(\Omega)( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∼ italic_ρ ∈ caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( roman_Ω ) such that Asm. 2 holds with parameter β𝛽\betaitalic_β, the iterates of Alg. 1 converge to the optimal policy as

J(π)J(πT)O(1T+δ2nα)𝐽subscript𝜋𝐽subscript𝜋𝑇𝑂1𝑇superscript𝛿2superscript𝑛𝛼J(\pi_{*})-J(\pi_{T})\leq O\left(\frac{1}{T}+\delta^{2}n^{-\alpha}\right)italic_J ( italic_π start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - italic_J ( italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ≤ italic_O ( divide start_ARG 1 end_ARG start_ARG italic_T end_ARG + italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT )

with probability not less than 14eδ14superscript𝑒𝛿1-4e^{-\delta}1 - 4 italic_e start_POSTSUPERSCRIPT - italic_δ end_POSTSUPERSCRIPT. Here, α(β2+2β,β1+2β)𝛼𝛽22𝛽𝛽12𝛽\alpha\in\left(\frac{\beta}{2+2\beta},\frac{\beta}{1+2\beta}\right)italic_α ∈ ( divide start_ARG italic_β end_ARG start_ARG 2 + 2 italic_β end_ARG , divide start_ARG italic_β end_ARG start_ARG 1 + 2 italic_β end_ARG ) and π:𝒳Δ(𝒜):subscript𝜋𝒳Δ𝒜\pi_{*}:\mathcal{X}\to\Delta(\mathcal{A})italic_π start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT : caligraphic_X → roman_Δ ( caligraphic_A ) is a measurable maximizer of 8.

The proof of Thm. 9 is reported in Appendix C and combines the results discussed in this section with fast convergence rates for the least-squares [35] and CME [18] estimators. In particular we fisrt use Thm. 5 to guarantee that the policies produced by Alg. 1 are all (𝒢,)𝒢({\mathcal{G}},{\mathcal{F}})( caligraphic_G , caligraphic_F )-compatible and therefore that applying Prop. 3 to obtain an estimator for the action-value function is well-defined. Then, we use Lemma 8 to study the approximation error of these estimators in terms of our estimates for the world model and the reward function. Bounds on these quantities are then used in the result for inexact PMD convergence in Thm. 7. We note here that since the latter results require convergence in uniform norm, we cannot leverage standard results for least-squares and CME convergence, which characterize convergence in L2(Ω,μ)subscript𝐿2Ω𝜇L_{2}(\Omega,\mu)italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_Ω , italic_μ ) and would only require Asm. 1 (Linear MDP) to hold. Rather, we need to impose Asm. 2 to guarantee faster rates in uniform norm.

6 Experimental results

We empirically evaluated POWR on classical Gym environments [36], ranging from discrete (FrozenLake-v1, Taxi-v3) to dense state spaces (MountainCar-v0). To ensure balancing between exploration and exploitation of our method, we alternated between running the environment with the current policy to collect samples for world model learning and running Alg. 1 for a number of steps to generate a new policy. Appendix D provides implementation details regarding this process as well as details on the kernel and hyperparameters used in both tabular and infinite-states settings.

Fig. 1 compares our approach with the performance of well-established baselines including A2C [37, 38], DQN [4, 38], TRPO [7, 38], and PPO [6, 38]. The figure reports the average cumulative reward obtained by the models on test environments with respect to the number of interactions with the MDP (timesteps in log scale in the figure) across 7 different training runs. In all plots, the horizontal dashed line represents the “success” threshold for the corresponding environment, according to official guidelines. We observe that our method outperforms all competitors by a significant margin in terms of sample complexity (i.e. reward achieved wrt number of timesteps executed) and, in the case of the Taxi-v3 environment, it avoids converging to a local optimum, in contrast to other methods such as A2C and TRPO. On the downside, we note that our method exhibits less stability than other approaches, particularly during the initial stages of the training process. This is arguably due to a sub-optimal interplay between exploration and exploitation, which will be the subject of future work.

Refer to caption
(a) FrozenLake-v1
Refer to caption
(b) Taxi-v3
Refer to caption
(c) MountainCar-v0
Figure 1: The plots show the average cumulative reward in different environments with respect to the timesteps (i.e. number of interactions with MDP). The dark lines represent the mean of the cumulative reward and the shaded area is the minimum and maximum values reached across 7777 independent runs. The horizontal dashed lines represent the reward threshold proposed by the Gym library [36].

7 Conclusions and Future Work

Motivated by recent advancements in policy mirror descent (PMD), this work introduced a novel reinforcement learning (RL) algorithm leveraging these results. Our approach operates in two, possibly alternating, phases: learning a world model and planning via PMD. During exploration, we utilize conditional mean embeddings (CMEs) to learn a world model operator, showing that this procedure is well-posed when performed over suitable Sobolev spaces. The planning phase involves PMD steps for which we guarantee convergence to a global optimum at a polynomial rate under specific MDP regularities.

Our analysis opens avenues for further exploration. Firstly, extending PMD to infinite action spaces remains a challenge. While we introduced the operatorial perspective on RL for infinite state space settings, the PMD update with KL divergence requires approximation methods (e.g., Monte Carlo) whose impact on convergence requires investigation. Secondly, scalability to large environments requires adopting approximated yet efficient CME estimators like Nystrom [30] or reduced-rank regressors [39]. Thirdly, a question we touched upon only empirically, is whether alternating world model learning with inexact PMD updates benefits the exploration-exploitation trade-off. Studying this strategy’s impact on convergence is a promising future direction. Finally, a crucial question is generalizing our policy compatibility results beyond Sobolev spaces. Ideally, a representation learning process would identify suitable feature maps that guarantee compatibility with the PMD-generated policies while allowing for added flexibility in learning the world model.

References

  • [1] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • [2] OpenAI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1):3–20, 2020.
  • [3] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
  • [4] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv1312.5602, 2013.
  • [5] Hongzi Mao, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula. Resource management with deep reinforcement learning. In Proceedings of the 15th ACM workshop on hot topics in networks, pages 50–56, 2016.
  • [6] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv1707.06347, 2017.
  • [7] John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization, 2017.
  • [8] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, 2018.
  • [9] Dimitri Bertsekas. Dynamic Programming and Optimal Control: Volume I, volume 4. Athena scientific, 2012.
  • [10] Lior Shani, Yonathan Efroni, and Shie Mannor. Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5668–5675, 2020.
  • [11] Alekh Agarwal, Sham M. Kakade, Jason D. Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, 22(98):1–76, 2021.
  • [12] Lin Xiao. On the convergence rates of policy gradient methods. Journal of Machine Learning Research, 23(282):1–36, 2022.
  • [13] Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized markov decision processes. In International Conference on Machine Learning, pages 2160–2169. PMLR, 2019.
  • [14] Kenji Fukumizu, Francis R. Bach, and Michael I. Jordan. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5:73–99, 2004.
  • [15] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Bernhard Schölkopf, et al. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends® in Machine Learning, 10(1-2):1–141, 2017.
  • [16] Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
  • [17] Carlo Ciliberto, Lorenzo Rosasco, and Alessandro Rudi. A general framework for consistent structured prediction with implicit loss embeddings. Journal of Machine Learning Research, 21(98):1–67, 2020.
  • [18] Zhu Li, Dimitri Meunier, Mattes Mollenhauer, and Arthur Gretton. Optimal rates for regularized conditional mean embedding learning. Advances in Neural Information Processing Systems, 35:4433–4445, 2022.
  • [19] Steffen Grünewälder, Guy Lever, Luca Baldassarre, Massimiliano Pontil, and Arthur Gretton. Modelling transition dynamics in mdps with rkhs embeddings. In Proceedings of the 29th International Conference on International Conference on Machine Learning, pages 1603––1610, 2012.
  • [20] Antoine Moulin and Gergely Neu. Optimistic planning by regularized dynamic programming. In International Conference on Machine Learning, pages 25337–25357. PMLR, 2023.
  • [21] Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin Yang. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pages 463–474. PMLR, 2020.
  • [22] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
  • [23] Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
  • [24] Sham M. Kakade. A natural policy gradient. Advances in Neural Information Processing Systems, 14, 2001.
  • [25] C.D. Aliprantis and K.C. Border. Infinite Dimensional Analysis: A Hitchhiker’s Guide. Studies in Economic Theory. Springer, 1999.
  • [26] Alekh Agarwal, Nan Jiang, Sham M. Kakade, and Wen Sun. Reinforcement learning: Theory and algorithms. 2021. URL https://rltheorybook.github.io, 2022.
  • [27] Steffen Grünewälder, Guy Lever, Luca Baldassarre, Sam Patterson, Arthur Gretton, and Massimilano Pontil. Conditional mean embeddings as regressors. In Proceedings of the 29th International Conference on International Conference on Machine Learning, pages 1803–1810, 2012.
  • [28] Christopher Williams and Matthias Seeger. Using the Nyström method to speed up kernel machines. Advances in Neural Information Processing Systems, 13, 2000.
  • [29] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström computational regularization. Advances in Neural Information Processing Systems, 28, 2015.
  • [30] Giacomo Meanti, Antoine Chatalic, Vladimir R. Kostic, Pietro Novelli, Massimiliano Pontil, and Lorenzo Rosasco. Estimating Koopman operators with sketching to provably learn large scale dynamical systems. Advances in Neural Information Processing Systems, 36, 2023.
  • [31] Amir Beck. First-Order Methods in Optimization. Society for Industrial and Applied Mathematics, 2017.
  • [32] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • [33] Robert A. Adams and John J. F. Fournier. Sobolev Spaces. Elsevier, 2003.
  • [34] Holger Wendland. Scattered data approximation, volume 17. Cambridge University Press, 2004.
  • [35] Simon Fischer and Ingo Steinwart. Sobolev norm learning rates for regularized least-squares algorithms. Journal of Machine Learning Research, 21(205):1–38, 2020.
  • [36] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arxiv. arXiv preprint arXiv:1606.01540, 10, 2016.
  • [37] Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. arXiv1602.01783, 2016.
  • [38] Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268):1–8, 2021.
  • [39] Vladimir R. Kostic, Pietro Novelli, Andreas Maurer, Carlo Ciliberto, Lorenzo Rosasco, and Massimiliano Pontil. Learning dynamical systems via Koopman operator regression in reproducing kernel Hilbert spaces. Advances in Neural Information Processing Systems, 35:4017–4031, 2022.
  • [40] Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins University Press, 2013.
  • [41] O. Kallenberg. Foundations of Modern Probability. Probability and Its Applications. Springer New York, 2002.
  • [42] Giulia Luise, Saverio Salzo, Massimiliano Pontil, and Carlo Ciliberto. Sinkhorn barycenters with free support via frank-wolfe algorithm. Advances in Neural Information Processing Systems, 32, 2019.
  • [43] Vladimir R. Kostic, Karim Lounici, Pietro Novelli, and Massimiliano Pontil. Sharp spectral rates for koopman operator learning. Advances in Neural Information Processing Systems, 36, 2023.

Appendix

The appendices are organized as follows:

  • Appendix A discuss the operatorial formulation of RL and show how to derive the operator-based results in this work.

  • Appendix B focuses on policy mirror descent (PMD) and its convergence rate in the inexact setting.

  • Appendix C proves the main result of this work, namely the theoretical analysis of POWR .

  • Appendix D provide details on the experiments reported in this work.

Appendix A Operatorial Results

A.1 Auxiliary Lemma

We recall here a corollary of the Sherman-Woodbury identity [40].

Lemma A.1.

Let A𝐴Aitalic_A and B𝐵Bitalic_B two conformable linear operators such that (I+AB)1superscript𝐼𝐴𝐵1(I+AB)^{-1}( italic_I + italic_A italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is invertible. Then (I+AB)1A=A(I+BA)1superscript𝐼𝐴𝐵1𝐴𝐴superscript𝐼𝐵𝐴1(I+AB)^{-1}A=A(I+BA)^{-1}( italic_I + italic_A italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_A = italic_A ( italic_I + italic_B italic_A ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT

Proof.

The result is obvious if A𝐴Aitalic_A is invertible. More generally, we consider the following two application of the Sherman-Woodbury [40] formula

(I+AB)1=IA(I+BA)1Bsuperscript𝐼𝐴𝐵1𝐼𝐴superscript𝐼𝐵𝐴1𝐵\displaystyle(I+AB)^{-1}=I-A(I+BA)^{-1}B( italic_I + italic_A italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = italic_I - italic_A ( italic_I + italic_B italic_A ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B (A.1)

and

(I+BA)1=I(I+BA)1BA.superscript𝐼𝐵𝐴1𝐼superscript𝐼𝐵𝐴1𝐵𝐴\displaystyle(I+BA)^{-1}=I-(I+BA)^{-1}BA.( italic_I + italic_B italic_A ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = italic_I - ( italic_I + italic_B italic_A ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B italic_A . (A.2)

Multiplying the two equation by A𝐴Aitalic_A respectively to the right and to the left, we obtain the desired result. ∎

A.2 Markov operators and their properties

We recall here the notion of Markov operators, which is central for a number of results in the following. We refer to [25, Chapter 19] for more details on the topic.

Definition A.1 (Markov operators).

Let 𝒳𝒳\mathcal{X}caligraphic_X and 𝒴𝒴\mathcal{Y}caligraphic_Y be Polish spaces. A bounded linear operator (Bb(𝒳),Bb(𝒴))subscript𝐵𝑏𝒳subscript𝐵𝑏𝒴\mathcal{L}(B_{b}(\mathcal{X}),B_{b}(\mathcal{Y}))caligraphic_L ( italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( caligraphic_X ) , italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( caligraphic_Y ) ) is a Markov operator if is positive and maps the unit function to itself, that is:

a. f0Bb(𝒳)Pf0Bb(𝒴),b. P𝟏𝒳=𝟏𝒴,formulae-sequencea. 𝑓0subscript𝐵𝑏𝒳𝑃𝑓0subscript𝐵𝑏𝒴b. 𝑃subscript1𝒳subscript1𝒴\begin{split}&\textbf{a. }f\geq 0\in B_{b}(\mathcal{X})\implies{P}f\geq 0\in B% _{b}(\mathcal{Y}),\\ &\textbf{b. }{P}\mathbf{1}_{\mathcal{X}}=\mathbf{1}_{\mathcal{Y}},\\ \end{split}start_ROW start_CELL end_CELL start_CELL a. italic_f ≥ 0 ∈ italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( caligraphic_X ) ⟹ italic_P italic_f ≥ 0 ∈ italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( caligraphic_Y ) , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL b. italic_P bold_1 start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT = bold_1 start_POSTSUBSCRIPT caligraphic_Y end_POSTSUBSCRIPT , end_CELL end_ROW

where 𝟏𝒳:𝒳:subscript1𝒳𝒳\mathbf{1}_{\mathcal{X}}:\mathcal{X}\to\mathbb{R}bold_1 start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT : caligraphic_X → blackboard_R (respectively 𝟏𝒴subscript1𝒴\mathbf{1}_{\mathcal{Y}}bold_1 start_POSTSUBSCRIPT caligraphic_Y end_POSTSUBSCRIPT) denotes the function taking constant value equal to 1111 on 𝒳𝒳\mathcal{X}caligraphic_X (respectively 𝒴𝒴\mathcal{Y}caligraphic_Y).

We recall that Markov operators are a convex subset of (Bb(𝒳),Bb(𝒴))subscript𝐵𝑏𝒳subscript𝐵𝑏𝒴\mathcal{L}(B_{b}(\mathcal{X}),B_{b}(\mathcal{Y}))caligraphic_L ( italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( caligraphic_X ) , italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( caligraphic_Y ) ). Here we denote this space as M(Bb(𝒳),Bb(𝒴))subscriptMsubscript𝐵𝑏𝒳subscript𝐵𝑏𝒴\mathcal{L}_{{\rm M}}(B_{b}(\mathcal{X}),B_{b}(\mathcal{Y}))caligraphic_L start_POSTSUBSCRIPT roman_M end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( caligraphic_X ) , italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( caligraphic_Y ) ). Direct inspection of 5 and 6 shows that the transfer operator T𝑇{T}italic_T associated to an MDP and the policy operator Pπsubscript𝑃𝜋{P}_{\pi}italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT associated to a policy π𝜋\piitalic_π are both Markov operators.

Markov Operators and Policy Operators

In 6 we defined the policy operator Pπsubscript𝑃𝜋{P}_{\pi}italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT associated to a policy π𝜋\piitalic_π. It turns out that th e converse is also true, namely that any such Markov operator is a policy operator.

Proposition A.2.

Let PM(Bb(𝒳),Bb(Ω))𝑃subscriptMsubscript𝐵𝑏𝒳subscript𝐵𝑏Ω{P}\in\mathcal{L}_{{\rm M}}(B_{b}(\mathcal{X}),B_{b}(\Omega))italic_P ∈ caligraphic_L start_POSTSUBSCRIPT roman_M end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( caligraphic_X ) , italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( roman_Ω ) ) be a Markov operator. Then there exists πPsubscript𝜋𝑃\pi_{P}italic_π start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT, such that the associated policy operator corresponds to P𝑃{P}italic_P, namely PπP=Psubscript𝑃subscript𝜋𝑃𝑃{P}_{\pi_{P}}={P}italic_P start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_P.

Proof.

Define the map πP:𝒳(𝒜):subscript𝜋𝑃𝒳𝒜\pi_{P}:\mathcal{X}\to\mathcal{M}(\mathcal{A})italic_π start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT : caligraphic_X → caligraphic_M ( caligraphic_A ) taking value in the space of bounded Borel measures over 𝒜𝒜\mathcal{A}caligraphic_A such that, for any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X and any 𝒜𝒜\mathcal{B}\subseteq\mathcal{A}caligraphic_B ⊆ caligraphic_A Borel measurable subset

πP(|x)=(P1𝒳×)(x).subscript𝜋𝑃conditional𝑥𝑃subscript1𝒳𝑥\displaystyle\pi_{P}(\mathcal{B}|x)=({P}1_{\mathcal{X}\times\mathcal{B}})(x).italic_π start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( caligraphic_B | italic_x ) = ( italic_P 1 start_POSTSUBSCRIPT caligraphic_X × caligraphic_B end_POSTSUBSCRIPT ) ( italic_x ) . (A.3)

We need to guarantee that for every x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X the function πP(|x)\pi_{P}(\cdot|x)italic_π start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( ⋅ | italic_x ) is a signed measure. To show this, first note that the operation defined by πPsubscript𝜋𝑃\pi_{P}italic_π start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT is well-defined, since for any measurable set \mathcal{B}caligraphic_B the function 𝟏𝒳×subscript1𝒳\mathbf{1}_{\mathcal{X}\times\mathcal{B}}bold_1 start_POSTSUBSCRIPT caligraphic_X × caligraphic_B end_POSTSUBSCRIPT is also measurable, making πP(|x)subscript𝜋𝑃conditional𝑥\pi_{P}(\mathcal{B}|x)italic_π start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( caligraphic_B | italic_x ) well defined as well. Moreover, since 𝟏(a)=0subscript1𝑎0\mathbf{1}_{\emptyset}(a)=0bold_1 start_POSTSUBSCRIPT ∅ end_POSTSUBSCRIPT ( italic_a ) = 0 for any a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A, it implies that 𝟏𝒳×=0subscript1𝒳0\mathbf{1}_{\mathcal{X}\times\emptyset}=0bold_1 start_POSTSUBSCRIPT caligraphic_X × ∅ end_POSTSUBSCRIPT = 0 and therefore πP(|x)=0subscript𝜋𝑃conditional𝑥0\pi_{P}(\emptyset|x)=0italic_π start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( ∅ | italic_x ) = 0 for any x𝒳𝑥𝒳x\in{\mathcal{X}}italic_x ∈ caligraphic_X. Finally, σ𝜎\sigmaitalic_σ-additivity follows from the definition of indicator functions, namely 𝟏i=1i=i=1𝟏isubscript1superscriptsubscript𝑖1subscript𝑖superscriptsubscript𝑖1subscript1subscript𝑖\mathbf{1}_{\bigcup_{i=1}^{\infty}\mathcal{B}_{i}}=\sum_{i=1}^{\infty}\mathbf{% 1}_{\mathcal{B}_{i}}bold_1 start_POSTSUBSCRIPT ⋃ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT for any family of pair-wise disjoint sets (i)i=1nsuperscriptsubscriptsubscript𝑖𝑖1𝑛(\mathcal{B}_{i})_{i=1}^{n}( caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, which implies πP(i=1i|x)=i=1πP(i|x)subscript𝜋𝑃conditionalsuperscriptsubscript𝑖1subscript𝑖𝑥superscriptsubscript𝑖1subscript𝜋𝑃conditionalsubscript𝑖𝑥\pi_{P}\left(\bigcup_{i=1}^{\infty}\mathcal{B}_{i}|x\right)=\sum_{i=1}^{\infty% }\pi_{P}(\mathcal{B}_{i}|x)italic_π start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( ⋃ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ) for any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X.

We now apply the two properties of Markov operators to show that πPsubscript𝜋𝑃\pi_{P}italic_π start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT takes values in 1+(𝒜)superscriptsubscript1𝒜{\mathcal{M}}_{1}^{+}(\mathcal{A})caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( caligraphic_A ), namely it is a non-negative measure that sums to 1111. Since Markov operators map non-negative functions in non-negative functions and since 𝟏𝒳×0subscript1𝒳0\mathbf{1}_{\mathcal{X}\times\mathcal{B}}\geq 0bold_1 start_POSTSUBSCRIPT caligraphic_X × caligraphic_B end_POSTSUBSCRIPT ≥ 0 for any 𝒳𝒳\mathcal{B}\subseteq\mathcal{X}caligraphic_B ⊆ caligraphic_X, we have π(|x)0\pi(\cdot|x)\geq 0italic_π ( ⋅ | italic_x ) ≥ 0 as well for any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X. Moreover, since Ω=𝒳×𝒜Ω𝒳𝒜\Omega=\mathcal{X}\times\mathcal{A}roman_Ω = caligraphic_X × caligraphic_A and P𝟏Ω=𝟏𝒳𝑃subscript1Ωsubscript1𝒳{P}\mathbf{1}_{\Omega}=\mathbf{1}_{\mathcal{X}}italic_P bold_1 start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT = bold_1 start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT, we have

π(𝒜|x)=(P 1Ω)(x)=𝟏𝒳(x)=1,𝜋conditional𝒜𝑥𝑃subscript1Ω𝑥subscript1𝒳𝑥1\displaystyle\pi(\mathcal{A}|x)=({P}\leavevmode\nobreak\ \mathbf{1}_{\Omega})(% x)=\mathbf{1}_{\mathcal{X}}(x)=1,italic_π ( caligraphic_A | italic_x ) = ( italic_P bold_1 start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ) ( italic_x ) = bold_1 start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT ( italic_x ) = 1 , (A.4)

for any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X. Therefore πP(|x)\pi_{P}(\cdot|x)italic_π start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( ⋅ | italic_x ) is a probability measure for any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X. Direct application of 6 shows that the associated policy operator corresponds to P𝑃{P}italic_P, namely PπP=Psubscript𝑃subscript𝜋𝑃𝑃{P}_{\pi_{P}}={P}italic_P start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_P as desired. ∎

Given the correspondence between policies and their Markov operator according to 6 and Proposition A.2, in the following we will denote the policy operator associated to a policy π𝜋\piitalic_π only P𝑃{P}italic_P where clear from context.

With the definition of Markov operator in place, we can now prove the following result introduced in the main paper.

See 2

Proof.

Recall that since they are Hilbert spaces superscript{\mathcal{F}}\cong{\mathcal{F}}^{*}caligraphic_F ≅ caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and 𝒢𝒢𝒢superscript𝒢{\mathcal{G}}\cong{\mathcal{G}}^{*}caligraphic_G ≅ caligraphic_G start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT are isometric to their dual and therefore we can interpret any f𝑓f\in{\mathcal{F}}italic_f ∈ caligraphic_F as the function f()=f,φ()𝑓𝑓𝜑f(\cdot)=\left\langle{f},{\varphi(\cdot)}\right\rangleitalic_f ( ⋅ ) = ⟨ italic_f , italic_φ ( ⋅ ) ⟩ with some abuse of notation, where clear from context. By Asm. 1 we have that T|evaluated-at𝑇{T}|_{\mathcal{F}}italic_T | start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT takes values in 𝒢𝒢{\mathcal{G}}caligraphic_G. This means that (T|f)𝒢evaluated-at𝑇𝑓𝒢({T}|_{\mathcal{F}}f)\in{\mathcal{G}}( italic_T | start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT italic_f ) ∈ caligraphic_G or, in other words

T|f,ψ(x,a)evaluated-at𝑇𝑓𝜓𝑥𝑎\displaystyle\left\langle{{T}|_{\mathcal{F}}f},{\psi(x,a)}\right\rangle⟨ italic_T | start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT italic_f , italic_ψ ( italic_x , italic_a ) ⟩ =(T|f)(x,a)absentevaluated-at𝑇𝑓𝑥𝑎\displaystyle=({T}|_{\mathcal{F}}f)(x,a)= ( italic_T | start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT italic_f ) ( italic_x , italic_a )
=f(x)τ(dx|x,a)absentsubscript𝑓superscript𝑥𝜏conditional𝑑superscript𝑥𝑥𝑎\displaystyle=\int_{\mathcal{F}}f(x^{\prime})\leavevmode\nobreak\ \tau(dx^{% \prime}|x,a)= ∫ start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_τ ( italic_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_a )
=f,φ(x)τ(dx|x,a)absentsubscript𝑓𝜑superscript𝑥𝜏conditional𝑑superscript𝑥𝑥𝑎\displaystyle=\int_{\mathcal{F}}\left\langle{f},{\varphi(x^{\prime})}\right% \rangle\leavevmode\nobreak\ \tau(dx^{\prime}|x,a)= ∫ start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ⟨ italic_f , italic_φ ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⟩ italic_τ ( italic_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_a )
=f,φ(x)τ(dx|x,a),absent𝑓𝜑superscript𝑥𝜏conditional𝑑superscript𝑥𝑥𝑎\displaystyle=\left\langle{f},{\int\varphi(x^{\prime})\leavevmode\nobreak\ % \tau(dx^{\prime}|x,a)}\right\rangle,= ⟨ italic_f , ∫ italic_φ ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_τ ( italic_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_a ) ⟩ ,

from which we obtain

f,(T|)ψ(x,a)=f,φ(x)τ(dx|x,a).𝑓superscriptevaluated-at𝑇𝜓𝑥𝑎𝑓𝜑superscript𝑥𝜏conditional𝑑superscript𝑥𝑥𝑎\displaystyle\left\langle{f},{({T}|_{\mathcal{F}})^{*}\psi(x,a)}\right\rangle=% \left\langle{f},{\int\varphi(x^{\prime})\leavevmode\nobreak\ \tau(dx^{\prime}|% x,a)}\right\rangle.⟨ italic_f , ( italic_T | start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_ψ ( italic_x , italic_a ) ⟩ = ⟨ italic_f , ∫ italic_φ ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_τ ( italic_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_a ) ⟩ .

Since the above equality holds for any f𝑓f\in{\mathcal{F}}italic_f ∈ caligraphic_F 9 holds, as desired. ∎

We note that the result can be extended to the setting where T|()𝒢evaluated-at𝑇𝒢{T}|_{{\mathcal{F}}}({\mathcal{F}})\subseteq{\mathcal{G}}italic_T | start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ( caligraphic_F ) ⊆ caligraphic_G, namely the image of T|evaluated-at𝑇{T}|_{{\mathcal{F}}}italic_T | start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT is contained in 𝒢𝒢{\mathcal{G}}caligraphic_G, namely a sort of (,𝒢)𝒢({\mathcal{F}},{\mathcal{G}})( caligraphic_F , caligraphic_G )-compatibility for the transition operator (see Definition 1).

A.3 The operatorial formulation of RL

According to the operatorial characterization in 7, the action value function of a policy π𝜋\piitalic_π is directly related to the action of the associated policy operator P𝑃{P}italic_P. To highlight this relation, we will adopt the following notation:

  • Action-value (or Q-)function.

    q(P)=(IdγTP)1r.𝑞𝑃superscript𝐼𝑑𝛾𝑇𝑃1𝑟\displaystyle q({P})=\left({Id}-\gamma{T}{P}\right)^{-1}r.italic_q ( italic_P ) = ( italic_I italic_d - italic_γ italic_T italic_P ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_r . (A.5)
  • Value function.

    v(P)=Pq(P).𝑣𝑃𝑃𝑞𝑃\displaystyle v({P})={P}q({P}).italic_v ( italic_P ) = italic_P italic_q ( italic_P ) . (A.6)
  • Cumulative reward. The RL objective functional

    J(P)=P(IdγTP)1r,ν=Pq(P),ν=v(P),ν.𝐽𝑃𝑃superscript𝐼𝑑𝛾𝑇𝑃1𝑟𝜈𝑃𝑞𝑃𝜈𝑣𝑃𝜈\displaystyle J({P})=\left\langle{{P}\left({Id}-\gamma{T}{P}\right)^{-1}r},{% \nu}\right\rangle=\left\langle{{P}q({P})},{\nu}\right\rangle=\left\langle{v({P% })},{\nu}\right\rangle.italic_J ( italic_P ) = ⟨ italic_P ( italic_I italic_d - italic_γ italic_T italic_P ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_r , italic_ν ⟩ = ⟨ italic_P italic_q ( italic_P ) , italic_ν ⟩ = ⟨ italic_v ( italic_P ) , italic_ν ⟩ . (A.7)
  • State visitation (or State occupancy) measure. By the characterization of the adjoints of P𝑃{P}italic_P and T𝑇{T}italic_T (see discussion in Sec. 3 we can represent the evolution of a state distribution νtsubscript𝜈𝑡\nu_{t}italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time t𝑡titalic_t to the next state distribution as νt+1=TPνtsubscript𝜈𝑡1superscript𝑇superscript𝑃subscript𝜈𝑡\nu_{t+1}={T}^{*}{P}^{*}\nu_{t}italic_ν start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Applying this relation recursively, we recover the state visitation probability associated to the starting state distribution ν0=ν1+(𝒳)subscript𝜈0𝜈superscriptsubscript1𝒳\nu_{0}=\nu\in{\mathcal{M}}_{1}^{+}(\mathcal{X})italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ν ∈ caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( caligraphic_X ), the MDP with transition T𝑇{T}italic_T and the policy P𝑃{P}italic_P as

    dν(P)=(1γ)t=0γt(TP)tν=(1γ)(IdγPT)ν,subscript𝑑𝜈𝑃1𝛾superscriptsubscript𝑡0superscript𝛾𝑡superscriptsuperscript𝑇superscript𝑃𝑡𝜈1𝛾superscript𝐼𝑑𝛾𝑃𝑇absent𝜈\displaystyle d_{\nu}({P})=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}({T}^{*}{P}^% {*})^{t}\nu=(1-\gamma)\left({Id}-\gamma{P}{T}\right)^{-*}\nu,italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P ) = ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_ν = ( 1 - italic_γ ) ( italic_I italic_d - italic_γ italic_P italic_T ) start_POSTSUPERSCRIPT - ∗ end_POSTSUPERSCRIPT italic_ν , (A.8)

    where the (1γ)γt1𝛾superscript𝛾𝑡(1-\gamma)\gamma^{t}( 1 - italic_γ ) italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is a normalizing factor to guarantee that the series corresponds to a convex combination of the probability distributions νtsubscript𝜈𝑡\nu_{t}italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, hence guaranteeing dν(P)subscript𝑑𝜈𝑃d_{\nu}({P})italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P ) to be well-defined (namely it belongs to 1+(𝒳)superscriptsubscript1𝒳{\mathcal{M}}_{1}^{+}(\mathcal{X})caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( caligraphic_X )).

Previous well-known RL results in operator form

Under the operatorial formulation of RL, we can recover several well-known results from the reinforcement literature with concise proofs. We recall here a few of these results that will be useful in the following.

Remark A.3.

Algebraic manipulation of the cumulative expected reward J(P)𝐽𝑃J({P})italic_J ( italic_P ) implies

J(P)=P(IdγTP)1r,ν=Pr,(IdγPT)ν=11γPr,dν(P),𝐽𝑃𝑃superscript𝐼𝑑𝛾𝑇𝑃1𝑟𝜈𝑃𝑟superscript𝐼𝑑𝛾𝑃𝑇absent𝜈11𝛾𝑃𝑟subscript𝑑𝜈𝑃\begin{split}J({P})&=\left\langle{{P}\left({Id}-\gamma{T}{P}\right)^{-1}r},{% \nu}\right\rangle=\left\langle{{P}r},{\left({Id}-\gamma{P}{T}\right)^{-*}\nu}% \right\rangle=\frac{1}{1-\gamma}\left\langle{{P}r},{d_{\nu}({P})}\right\rangle% ,\\ \end{split}start_ROW start_CELL italic_J ( italic_P ) end_CELL start_CELL = ⟨ italic_P ( italic_I italic_d - italic_γ italic_T italic_P ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_r , italic_ν ⟩ = ⟨ italic_P italic_r , ( italic_I italic_d - italic_γ italic_P italic_T ) start_POSTSUPERSCRIPT - ∗ end_POSTSUPERSCRIPT italic_ν ⟩ = divide start_ARG 1 end_ARG start_ARG 1 - italic_γ end_ARG ⟨ italic_P italic_r , italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P ) ⟩ , end_CELL end_ROW

where we used Lemma A.1 and dν(P)subscript𝑑𝜈𝑃d_{\nu}({P})italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P ) is the state visitation distribution starting from ν𝜈\nuitalic_ν and following the policy P𝑃{P}italic_P.

The following result, known as Performance Difference Lemma (see e.g. [11, Lemma 1.16]), will be instrumental to prove the convergence rates for PMD in Thm. 7.

Lemma A.4 (Performance difference).

Let P1subscript𝑃1{P}_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, P2subscript𝑃2{P}_{2}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT two policy operators. The following equality holds

J(P1)J(P2)=11γ(P1P2)q(P2),dν(P1).𝐽subscript𝑃1𝐽subscript𝑃211𝛾subscript𝑃1subscript𝑃2𝑞subscript𝑃2subscript𝑑𝜈subscript𝑃1J({P}_{1})-J({P}_{2})=\frac{1}{1-\gamma}\left\langle{({P}_{1}-{P}_{2})q({P}_{2% })},{d_{\nu}({P}_{1})}\right\rangle.italic_J ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_J ( italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 1 - italic_γ end_ARG ⟨ ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_q ( italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⟩ . (A.9)
Proof.

Using the definition of J(P1)𝐽subscript𝑃1J({P}_{1})italic_J ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and Lemma A.1 one gets

J(P1)J(P2)=P1(IdγTP1)1r,νP2(IdγTP2)1r,ν=(IdγP1T)1P1r,νP2(IdγTP2)1r,ν=(IdγP1T)1P1(IdγTP2)(IdγTP2)1r,ν(IdγP1T)1(IdγP1T)P2(IdγTP2)1r,ν=(IdγP1T)1[P1(IdγTP2)(IdγP1T)P2](IdγTP2)1r,ν=(IdγP1T)1[P1P2](IdγTP2)1r,ν=11γ(P1P2)q(P2),dν(P1).𝐽subscript𝑃1𝐽subscript𝑃2subscript𝑃1superscript𝐼𝑑𝛾𝑇subscript𝑃11𝑟𝜈subscript𝑃2superscript𝐼𝑑𝛾𝑇subscript𝑃21𝑟𝜈superscript𝐼𝑑𝛾subscript𝑃1𝑇1subscript𝑃1𝑟𝜈subscript𝑃2superscript𝐼𝑑𝛾𝑇subscript𝑃21𝑟𝜈superscript𝐼𝑑𝛾subscript𝑃1𝑇1subscript𝑃1𝐼𝑑𝛾𝑇subscript𝑃2superscript𝐼𝑑𝛾𝑇subscript𝑃21𝑟𝜈superscript𝐼𝑑𝛾subscript𝑃1𝑇1𝐼𝑑𝛾subscript𝑃1𝑇subscript𝑃2superscript𝐼𝑑𝛾𝑇subscript𝑃21𝑟𝜈superscript𝐼𝑑𝛾subscript𝑃1𝑇1delimited-[]subscript𝑃1𝐼𝑑𝛾𝑇subscript𝑃2𝐼𝑑𝛾subscript𝑃1𝑇subscript𝑃2superscript𝐼𝑑𝛾𝑇subscript𝑃21𝑟𝜈superscript𝐼𝑑𝛾subscript𝑃1𝑇1delimited-[]subscript𝑃1subscript𝑃2superscript𝐼𝑑𝛾𝑇subscript𝑃21𝑟𝜈11𝛾subscript𝑃1subscript𝑃2𝑞subscript𝑃2subscript𝑑𝜈subscript𝑃1\begin{split}J({P}_{1})-J({P}_{2})&=\left\langle{{P}_{1}\left({Id}-\gamma{T}{P% }_{1}\right)^{-1}r},{\nu}\right\rangle-\left\langle{{P}_{2}\left({Id}-\gamma{T% }{P}_{2}\right)^{-1}r},{\nu}\right\rangle\\ =&\left\langle{\left({Id}-\gamma{P}_{1}{T}\right)^{-1}{P}_{1}r},{\nu}\right% \rangle-\left\langle{{P}_{2}\left({Id}-\gamma{T}{P}_{2}\right)^{-1}r},{\nu}% \right\rangle\\ =&\left\langle{\left({Id}-\gamma{P}_{1}{T}\right)^{-1}{P}_{1}\left({Id}-\gamma% {T}{P}_{2}\right)\left({Id}-\gamma{T}{P}_{2}\right)^{-1}r},{\nu}\right\rangle% \\ &\leavevmode\nobreak\ \leavevmode\nobreak\ -\left\langle{\left({Id}-\gamma{P}_% {1}{T}\right)^{-1}\left({Id}-\gamma{P}_{1}{T}\right){P}_{2}\left({Id}-\gamma{T% }{P}_{2}\right)^{-1}r},{\nu}\right\rangle\\ =&\left\langle{\left({Id}-\gamma{P}_{1}{T}\right)^{-1}\left[{P}_{1}\left({Id}-% \gamma{T}{P}_{2}\right)-\left({Id}-\gamma{P}_{1}{T}\right){P}_{2}\right]\left(% {Id}-\gamma{T}{P}_{2}\right)^{-1}r},{\nu}\right\rangle\\ =&\left\langle{\left({Id}-\gamma{P}_{1}{T}\right)^{-1}\left[{P}_{1}-{P}_{2}% \right]\left({Id}-\gamma{T}{P}_{2}\right)^{-1}r},{\nu}\right\rangle\\ =&\frac{1}{1-\gamma}\left\langle{({P}_{1}-{P}_{2})q({P}_{2})},{d_{\nu}({P}_{1}% )}\right\rangle.\end{split}start_ROW start_CELL italic_J ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_J ( italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL start_CELL = ⟨ italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_I italic_d - italic_γ italic_T italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_r , italic_ν ⟩ - ⟨ italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_I italic_d - italic_γ italic_T italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_r , italic_ν ⟩ end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL ⟨ ( italic_I italic_d - italic_γ italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_T ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_r , italic_ν ⟩ - ⟨ italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_I italic_d - italic_γ italic_T italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_r , italic_ν ⟩ end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL ⟨ ( italic_I italic_d - italic_γ italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_T ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_I italic_d - italic_γ italic_T italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ( italic_I italic_d - italic_γ italic_T italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_r , italic_ν ⟩ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - ⟨ ( italic_I italic_d - italic_γ italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_T ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_I italic_d - italic_γ italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_T ) italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_I italic_d - italic_γ italic_T italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_r , italic_ν ⟩ end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL ⟨ ( italic_I italic_d - italic_γ italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_T ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_I italic_d - italic_γ italic_T italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - ( italic_I italic_d - italic_γ italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_T ) italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] ( italic_I italic_d - italic_γ italic_T italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_r , italic_ν ⟩ end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL ⟨ ( italic_I italic_d - italic_γ italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_T ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] ( italic_I italic_d - italic_γ italic_T italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_r , italic_ν ⟩ end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG 1 - italic_γ end_ARG ⟨ ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_q ( italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⟩ . end_CELL end_ROW

A direct consequence of the operator formulation of the performance difference lemma is the following operator-based characterization of the differential behavior of the RL objective. The result can be found in [11] for the case of finite state and action spaces, however here the operatorial formulation allows for a much more concise proof.

Corollary A.5 (Directional derivatives).

For any two Markov P1,P2:Bb(𝒳)Bb(Ω):subscript𝑃1subscript𝑃2subscript𝐵𝑏𝒳subscript𝐵𝑏Ω{P}_{1},{P}_{2}:B_{b}(\mathcal{X})\to B_{b}(\Omega)italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( caligraphic_X ) → italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( roman_Ω ), we have that the directional derivative in P1subscript𝑃1{P}_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT towards P2subscript𝑃2{P}_{2}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is

limh0J(P1+h(P2P1))J(P1)h=11γ(P2P1)q(P1),dν(P1).subscript0𝐽subscript𝑃1subscript𝑃2subscript𝑃1𝐽subscript𝑃111𝛾subscript𝑃2subscript𝑃1𝑞subscript𝑃1subscript𝑑𝜈subscript𝑃1\displaystyle\lim_{h\to 0}\frac{J({P}_{1}+h({P}_{2}-{P}_{1}))-J({P}_{1})}{h}=% \frac{1}{1-\gamma}\left\langle{({P}_{2}-{P}_{1})q({P}_{1})},{d_{\nu}({P}_{1})}% \right\rangle.roman_lim start_POSTSUBSCRIPT italic_h → 0 end_POSTSUBSCRIPT divide start_ARG italic_J ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_h ( italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) - italic_J ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_h end_ARG = divide start_ARG 1 end_ARG start_ARG 1 - italic_γ end_ARG ⟨ ( italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_q ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⟩ . (A.10)
Proof.

The result follows by recalling that the space of Markov operators is convex, namely for any h[0,1]01h\in[0,1]italic_h ∈ [ 0 , 1 ] the term Ph=P1+h(P2P1)subscript𝑃subscript𝑃1subscript𝑃2subscript𝑃1{P}_{h}={P}_{1}+h({P}_{2}-{P}_{1})italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_h ( italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) is still a Markov operator. Therefore, we can apply Lemma A.4 to obtain

J(Ph)J(P1)𝐽subscript𝑃𝐽subscript𝑃1\displaystyle J({P}_{h})-J({P}_{1})italic_J ( italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) - italic_J ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) =(PhP1)q(P1),dν(Ph)absentsubscript𝑃subscript𝑃1𝑞subscript𝑃1subscript𝑑𝜈subscript𝑃\displaystyle=\left\langle{({P}_{h}-{P}_{1})q({P}_{1})},{d_{\nu}({P}_{h})}\right\rangle= ⟨ ( italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_q ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ⟩ (A.11)
=h(P2P1)q(P1),dν(Ph).absentsubscript𝑃2subscript𝑃1𝑞subscript𝑃1subscript𝑑𝜈subscript𝑃\displaystyle=h\left\langle{({P}_{2}-{P}_{1})q({P}_{1})},{d_{\nu}({P}_{h})}% \right\rangle.= italic_h ⟨ ( italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_q ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ⟩ . (A.12)

We can therefore divide the above quantity by hhitalic_h and send h00h\to 0italic_h → 0. The result follows by observing that dν(Ph)=(IdγTPh)ν(IdγTP1)ν=dν(P1)subscript𝑑𝜈subscript𝑃superscript𝐼𝑑𝛾𝑇subscript𝑃absent𝜈superscript𝐼𝑑𝛾𝑇subscript𝑃1absent𝜈subscript𝑑𝜈subscript𝑃1d_{\nu}({P}_{h})=({Id}-\gamma{T}{P}_{h})^{-*}\nu\to({Id}-\gamma{T}{P}_{1})^{-*% }\nu=d_{\nu}({P}_{1})italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) = ( italic_I italic_d - italic_γ italic_T italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - ∗ end_POSTSUPERSCRIPT italic_ν → ( italic_I italic_d - italic_γ italic_T italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - ∗ end_POSTSUPERSCRIPT italic_ν = italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) for h00h\to 0italic_h → 0, since Ph=1normsubscript𝑃1\left\|{{P}_{h}}\right\|=1∥ italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ = 1 for any h[0,1]01h\in[0,1]italic_h ∈ [ 0 , 1 ] and the function M(IγTM)1maps-to𝑀superscript𝐼𝛾𝑇𝑀1{M}\mapsto(I-\gamma{T}{M})^{-1}italic_M ↦ ( italic_I - italic_γ italic_T italic_M ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is continuous on the open ball of radius 1/γ>11𝛾11/\gamma>11 / italic_γ > 1 in (Bb(Ω),Bb(𝒳))subscript𝐵𝑏Ωsubscript𝐵𝑏𝒳\mathcal{L}(B_{b}(\Omega),B_{b}(\mathcal{X}))caligraphic_L ( italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( roman_Ω ) , italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( caligraphic_X ) ) with respect to the operator norm. ∎

Properties of (IdγPT)1superscript𝐼𝑑𝛾𝑃𝑇1({Id}-\gamma{P}{T})^{-1}( italic_I italic_d - italic_γ italic_P italic_T ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT

The quantity (IdγPT)1superscript𝐼𝑑𝛾𝑃𝑇1({Id}-\gamma{P}{T})^{-1}( italic_I italic_d - italic_γ italic_P italic_T ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT (note, not (IdγTP)1superscript𝐼𝑑𝛾𝑇𝑃1({Id}-\gamma{T}{P})^{-1}( italic_I italic_d - italic_γ italic_T italic_P ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT) plays a central role in the study of POWR . We prove here a few properties that will be useful in the following.

Lemma A.6 (Properties of IdγPT𝐼𝑑𝛾𝑃𝑇{Id}-\gamma{P}{T}italic_I italic_d - italic_γ italic_P italic_T).

The following facts are true:

  1. 1.

    For any f0Bb(𝒳)𝑓0subscript𝐵𝑏𝒳f\geq 0\in B_{b}(\mathcal{X})italic_f ≥ 0 ∈ italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( caligraphic_X ) it holds (IdγPT)1ffsuperscript𝐼𝑑𝛾𝑃𝑇1𝑓𝑓({Id}-\gamma{P}{T})^{-1}f\geq f( italic_I italic_d - italic_γ italic_P italic_T ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_f ≥ italic_f.

  2. 2.

    The operator (1γ)(IdγPT)11𝛾superscript𝐼𝑑𝛾𝑃𝑇1(1-\gamma)({Id}-\gamma{P}{T})^{-1}( 1 - italic_γ ) ( italic_I italic_d - italic_γ italic_P italic_T ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is a Markov operator.

  3. 3.

    For any positive measure ν(Bb(𝒳))𝜈subscript𝐵𝑏𝒳\nu\in\mathcal{M}(B_{b}(\mathcal{X}))italic_ν ∈ caligraphic_M ( italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( caligraphic_X ) ) it holds (1γ)(IdγPT)νTV=νTV1𝛾subscriptnormsuperscript𝐼𝑑𝛾𝑃𝑇absent𝜈TVsubscriptnorm𝜈TV(1-\gamma)\left\|{({Id}-\gamma{P}{T})^{-*}\nu}\right\|_{{\rm TV}}=\left\|{\nu}% \right\|_{{\rm TV}}( 1 - italic_γ ) ∥ ( italic_I italic_d - italic_γ italic_P italic_T ) start_POSTSUPERSCRIPT - ∗ end_POSTSUPERSCRIPT italic_ν ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT = ∥ italic_ν ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT.

  4. 4.

    For any positive measure ν(Bb(𝒳))𝜈subscript𝐵𝑏𝒳\nu\in\mathcal{M}(B_{b}(\mathcal{X}))italic_ν ∈ caligraphic_M ( italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( caligraphic_X ) ) it holds PνTV=νTVsubscriptnormsuperscript𝑃𝜈TVsubscriptnorm𝜈TV\left\|{{P}^{*}\nu}\right\|_{{\rm TV}}=\left\|{\nu}\right\|_{{\rm TV}}∥ italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_ν ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT = ∥ italic_ν ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT.

  5. 5.

    For any bounded linear operator X𝑋{X}italic_X, policy operator P𝑃{P}italic_P and discount factor γ<X𝛾norm𝑋\gamma<\left\|{{X}}\right\|italic_γ < ∥ italic_X ∥, it holds (IdγXP)11/(1γX)subscriptdelimited-∥∥superscript𝐼𝑑𝛾𝑋𝑃111𝛾norm𝑋\bigl{\|}{({Id}-\gamma{X}{P})^{-1}}\bigr{\|}_{\infty}\leq 1/(1-\gamma\left\|{{% X}}\right\|)∥ ( italic_I italic_d - italic_γ italic_X italic_P ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ 1 / ( 1 - italic_γ ∥ italic_X ∥ ).

Proof.

Since both T𝑇{T}italic_T and P𝑃{P}italic_P are Markov operators by construction, it immediately follows that their composition is a Markov operator as well. Using the Neumann series representation of (IdγPT)1superscript𝐼𝑑𝛾𝑃𝑇1({Id}-\gamma{P}{T})^{-1}( italic_I italic_d - italic_γ italic_P italic_T ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT it follows that for all f0Bb(𝒳)𝑓0subscript𝐵𝑏𝒳f\geq 0\in B_{b}(\mathcal{X})italic_f ≥ 0 ∈ italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( caligraphic_X )

(IdγPT)1f=t=0γt(PT)tf=f+t=1γt(PT)tff0,superscript𝐼𝑑𝛾𝑃𝑇1𝑓superscriptsubscript𝑡0superscript𝛾𝑡superscript𝑃𝑇𝑡𝑓𝑓superscriptsubscript𝑡1superscript𝛾𝑡superscript𝑃𝑇𝑡𝑓𝑓0({Id}-\gamma{P}{T})^{-1}f=\sum_{t=0}^{\infty}\gamma^{t}({P}{T})^{t}f=f+\sum_{t% =1}^{\infty}\gamma^{t}({P}{T})^{t}f\geq f\geq 0,( italic_I italic_d - italic_γ italic_P italic_T ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_f = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_P italic_T ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_f = italic_f + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_P italic_T ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_f ≥ italic_f ≥ 0 ,

proving (1). Further,

(IdγPT)1𝟏𝒳=t=0γt(PT)t𝟏𝒳=t=0γt𝟏𝒳=𝟏𝒳1γsuperscript𝐼𝑑𝛾𝑃𝑇1subscript1𝒳superscriptsubscript𝑡0superscript𝛾𝑡superscript𝑃𝑇𝑡subscript1𝒳superscriptsubscript𝑡0superscript𝛾𝑡subscript1𝒳subscript1𝒳1𝛾({Id}-\gamma{P}{T})^{-1}\mathbf{1}_{\mathcal{X}}=\sum_{t=0}^{\infty}\gamma^{t}% ({P}{T})^{t}\mathbf{1}_{\mathcal{X}}=\sum_{t=0}^{\infty}\gamma^{t}\mathbf{1}_{% \mathcal{X}}=\frac{\mathbf{1}_{\mathcal{X}}}{1-\gamma}( italic_I italic_d - italic_γ italic_P italic_T ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_P italic_T ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT = divide start_ARG bold_1 start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG

showing that (1γ)(IdγPT)1𝟏𝒳=𝟏𝒳1𝛾superscript𝐼𝑑𝛾𝑃𝑇1subscript1𝒳subscript1𝒳(1-\gamma)({Id}-\gamma{P}{T})^{-1}\mathbf{1}_{\mathcal{X}}=\mathbf{1}_{% \mathcal{X}}( 1 - italic_γ ) ( italic_I italic_d - italic_γ italic_P italic_T ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT = bold_1 start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT and proving (2). Finally, since (1γ)(IdγPT)11𝛾superscript𝐼𝑑𝛾𝑃𝑇1(1-\gamma)({Id}-\gamma{P}{T})^{-1}( 1 - italic_γ ) ( italic_I italic_d - italic_γ italic_P italic_T ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT and P𝑃{P}italic_P are Markov operators, (3) and (4) follow from the direct application of [25, Theorem 19.2]. For the last point (5), let fBb(Ω)𝑓subscript𝐵𝑏Ωf\in B_{b}(\Omega)italic_f ∈ italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( roman_Ω ). As P𝑃{P}italic_P is a conditional expectation operator, it holds that

XPfXP(𝟏Ωf)=Xfsubscriptnorm𝑋𝑃𝑓norm𝑋𝑃subscript1Ωsubscriptnorm𝑓norm𝑋subscriptnorm𝑓\left\|{{X}{P}f}\right\|_{\infty}\leq\left\|{{X}}\right\|{P}(\mathbf{1}_{% \Omega}\left\|{f}\right\|_{\infty})=\left\|{{X}}\right\|\left\|{f}\right\|_{\infty}∥ italic_X italic_P italic_f ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ ∥ italic_X ∥ italic_P ( bold_1 start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ∥ italic_f ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) = ∥ italic_X ∥ ∥ italic_f ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT

Where the inequality is just the conditional version of Jensen’s inequality [41, Chapter 5] applied on the (convex) \left\|{\cdot}\right\|_{\infty}∥ ⋅ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT function, while the equality comes from the fact that TP𝑇𝑃{T}{P}italic_T italic_P is a Markov operator. Then, we have

supf=1(IdγXP)1f=supf=1t=0(γXP)tfsupf=1t=0γt(XP)tfsupf=1t=0γtXtf(f=1)=11γX.subscriptsupremumsubscriptnorm𝑓1subscriptdelimited-∥∥superscript𝐼𝑑𝛾𝑋𝑃1𝑓subscriptsupremumsubscriptnorm𝑓1subscriptdelimited-∥∥superscriptsubscript𝑡0superscript𝛾𝑋𝑃𝑡𝑓subscriptsupremumsubscriptnorm𝑓1superscriptsubscript𝑡0superscript𝛾𝑡subscriptdelimited-∥∥superscript𝑋𝑃𝑡𝑓subscriptsupremumsubscriptnorm𝑓1superscriptsubscript𝑡0superscript𝛾𝑡superscriptdelimited-∥∥𝑋𝑡subscriptdelimited-∥∥𝑓subscriptdelimited-∥∥𝑓111𝛾delimited-∥∥𝑋\begin{split}\sup_{\left\|{f}\right\|_{\infty}=1}\bigl{\|}{({Id}-\gamma{X}{P})% ^{-1}f}\bigr{\|}_{\infty}&=\sup_{\left\|{f}\right\|_{\infty}=1}\bigl{\|}{\sum_% {t=0}^{\infty}(\gamma{X}{P})^{t}f}\bigr{\|}_{\infty}\\ &\leq\sup_{\left\|{f}\right\|_{\infty}=1}\sum_{t=0}^{\infty}\gamma^{t}\bigl{\|% }{({X}{P})^{t}f}\bigr{\|}_{\infty}\\ &\leq\sup_{\left\|{f}\right\|_{\infty}=1}\sum_{t=0}^{\infty}\gamma^{t}\bigl{\|% }{{X}}\bigr{\|}^{t}\bigl{\|}{f}\bigr{\|}_{\infty}\\ (\left\|{f}\right\|_{\infty}=1)&=\frac{1}{1-\gamma\bigl{\|}{{X}}\bigr{\|}}.\\ \end{split}start_ROW start_CELL roman_sup start_POSTSUBSCRIPT ∥ italic_f ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT ∥ ( italic_I italic_d - italic_γ italic_X italic_P ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_f ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_CELL start_CELL = roman_sup start_POSTSUBSCRIPT ∥ italic_f ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( italic_γ italic_X italic_P ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_f ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ roman_sup start_POSTSUBSCRIPT ∥ italic_f ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ ( italic_X italic_P ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_f ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ roman_sup start_POSTSUBSCRIPT ∥ italic_f ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ italic_X ∥ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ italic_f ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ( ∥ italic_f ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = 1 ) end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG 1 - italic_γ ∥ italic_X ∥ end_ARG . end_CELL end_ROW

Simulation Lemma

We report here the Simulation lemma, since it will be key to bridging the gap between Policy Mirror Descent and Conditional Mean Embeddings in Thm. 9 through Lemma 8.

Lemma A.7 (Simulation Lemma [26]-Lemma 2.2).

Let γ>0𝛾0\gamma>0italic_γ > 0 and let T1subscript𝑇1{T}_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, T2subscript𝑇2{T}_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT two linear operators with operator norm strictly less than γ𝛾\gammaitalic_γ. Let P𝑃{P}italic_P be a policy operator. Denote by q(P,T)=(IdγTP)1r𝑞𝑃𝑇superscript𝐼𝑑𝛾𝑇𝑃1𝑟q({P},{T})=({Id}-\gamma{T}{P})^{-1}ritalic_q ( italic_P , italic_T ) = ( italic_I italic_d - italic_γ italic_T italic_P ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_r the (generalized) action-value function associated to these terms and v(P,T)=Pq(P,T)𝑣𝑃𝑇𝑃𝑞𝑃𝑇v({P},{T})={P}q({P},{T})italic_v ( italic_P , italic_T ) = italic_P italic_q ( italic_P , italic_T ) the corresponding value function. Then the following equality holds

q(P,T1)q(P,T2)=γ(IdγT1P)1(T2T1)v(P,T2)𝑞𝑃subscript𝑇1𝑞𝑃subscript𝑇2𝛾superscript𝐼𝑑𝛾subscript𝑇1𝑃1subscript𝑇2subscript𝑇1𝑣𝑃subscript𝑇2q({P},{T}_{1})-q({P},{T}_{2})=\gamma\left({Id}-\gamma{T}_{1}{P}\right)^{-1}% \left({T}_{2}-{T}_{1}\right)v({P},{T}_{2})italic_q ( italic_P , italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_q ( italic_P , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = italic_γ ( italic_I italic_d - italic_γ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_P ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_v ( italic_P , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) (A.13)
Proof.

Using the same technique of the proof of Lemma A.4 one has

q(P,T1)q(P,T2)=(IdγT1P)1r(IdγT2P)1r=γ(IdγT1P)1(T2T1)P(IdγT2A)1r=γ(IdγT1P)1(T2T1)v(P,T2)𝑞𝑃subscript𝑇1𝑞𝑃subscript𝑇2superscript𝐼𝑑𝛾subscript𝑇1𝑃1𝑟superscript𝐼𝑑𝛾subscript𝑇2𝑃1𝑟𝛾superscript𝐼𝑑𝛾subscript𝑇1𝑃1subscript𝑇2subscript𝑇1𝑃superscript𝐼𝑑𝛾subscript𝑇2𝐴1𝑟𝛾superscript𝐼𝑑𝛾subscript𝑇1𝑃1subscript𝑇2subscript𝑇1𝑣𝑃subscript𝑇2\begin{split}q({P},{T}_{1})-q({P},{T}_{2})&=\left({Id}-\gamma{T}_{1}{P}\right)% ^{-1}r-\left({Id}-\gamma{T}_{2}{P}\right)^{-1}r\\ &=\gamma\left({Id}-\gamma{T}_{1}{P}\right)^{-1}\left({T}_{2}-{T}_{1}\right){P}% \left({Id}-\gamma{T}_{2}A\right)^{-1}r\\ &=\gamma\left({Id}-\gamma{T}_{1}{P}\right)^{-1}\left({T}_{2}-{T}_{1}\right)v({% P},{T}_{2})\end{split}start_ROW start_CELL italic_q ( italic_P , italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_q ( italic_P , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL start_CELL = ( italic_I italic_d - italic_γ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_P ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_r - ( italic_I italic_d - italic_γ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_P ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_r end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_γ ( italic_I italic_d - italic_γ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_P ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_P ( italic_I italic_d - italic_γ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_A ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_r end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_γ ( italic_I italic_d - italic_γ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_P ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_v ( italic_P , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL end_ROW

where we have used fact that for any two invertible operators M𝑀{M}italic_M and P𝑃{P}italic_P it holds M1P1=M1(PM)P1superscript𝑀1superscript𝑃1superscript𝑀1𝑃𝑀superscript𝑃1{M}^{-1}-{P}^{-1}={M}^{-1}({P}-{M}){P}^{-1}italic_M start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - italic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = italic_M start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_P - italic_M ) italic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT for the second equation and applied the operatorial characterization of the value function to conclude the proof. ∎

We then have the following result, which hinges a generalization of the standard Simulation lemma in [26, Lemma 2.2] where we account also for the reward function to vary.

Corollary A.8.

Let γ>0𝛾0\gamma>0italic_γ > 0 and let T1subscript𝑇1{T}_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, T2subscript𝑇2{T}_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT two linear operators with operator norm strictly less than γ𝛾\gammaitalic_γ. Let r1subscript𝑟1r_{1}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and r2subscript𝑟2r_{2}italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be two reward functions and P𝑃{P}italic_P a policy operator. Denote by q(P,T,r)=(IdγTP)1r𝑞𝑃𝑇𝑟superscript𝐼𝑑𝛾𝑇𝑃1𝑟q({P},{T},r)=({Id}-\gamma{T}{P})^{-1}ritalic_q ( italic_P , italic_T , italic_r ) = ( italic_I italic_d - italic_γ italic_T italic_P ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_r the (generalized) action-value function associated to these terms and v(P,T,r)=Pq(P,T,r)𝑣𝑃𝑇𝑟𝑃𝑞𝑃𝑇𝑟v({P},{T},r)={P}q({P},{T},r)italic_v ( italic_P , italic_T , italic_r ) = italic_P italic_q ( italic_P , italic_T , italic_r ) the corresponding value function. Then the following equality holds

q(P,T1,r1)q(P,T2,r2)=(IdγT1P)1(r1r2)+γ(IdγT1P)1(T2T1)v(P,T2,r2).𝑞𝑃subscript𝑇1subscript𝑟1𝑞𝑃subscript𝑇2subscript𝑟2superscript𝐼𝑑𝛾subscript𝑇1𝑃1subscript𝑟1subscript𝑟2𝛾superscript𝐼𝑑𝛾subscript𝑇1𝑃1subscript𝑇2subscript𝑇1𝑣𝑃subscript𝑇2subscript𝑟2\displaystyle q({P},{T}_{1},r_{1})-q({P},{T}_{2},r_{2})=({Id}-\gamma{T}_{1}{P}% )^{-1}(r_{1}-r_{2})+\gamma({Id}-\gamma{T}_{1}{P})^{-1}({T}_{2}-{T}_{1})v({P},{% T}_{2},r_{2}).italic_q ( italic_P , italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_q ( italic_P , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ( italic_I italic_d - italic_γ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_P ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_γ ( italic_I italic_d - italic_γ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_P ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_v ( italic_P , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) .
Proof.

The difference between action-value functions can be written as

q(P,T1,r1)q(P,T2,r2)𝑞𝑃subscript𝑇1subscript𝑟1𝑞𝑃subscript𝑇2subscript𝑟2\displaystyle q({P},{T}_{1},r_{1})-q({P},{T}_{2},r_{2})italic_q ( italic_P , italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_q ( italic_P , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) =(IdγT1P)1r1(IdγT2P)1r2absentsuperscript𝐼𝑑𝛾subscript𝑇1𝑃1subscript𝑟1superscript𝐼𝑑𝛾subscript𝑇2𝑃1subscript𝑟2\displaystyle=({Id}-\gamma{T}_{1}{P})^{-1}r_{1}-({Id}-\gamma{T}_{2}{P})^{-1}r_% {2}= ( italic_I italic_d - italic_γ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_P ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - ( italic_I italic_d - italic_γ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_P ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
=(IdγT1P)1(r1r2)+[(IdγT1P)1(IdγT2P)1]r2absentsuperscript𝐼𝑑𝛾subscript𝑇1𝑃1subscript𝑟1subscript𝑟2delimited-[]superscript𝐼𝑑𝛾subscript𝑇1𝑃1superscript𝐼𝑑𝛾subscript𝑇2𝑃1subscript𝑟2\displaystyle=({Id}-\gamma{T}_{1}{P})^{-1}(r_{1}-r_{2})+\left[({Id}-\gamma{T}_% {1}{P})^{-1}-({Id}-\gamma{T}_{2}{P})^{-1}\right]r_{2}= ( italic_I italic_d - italic_γ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_P ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + [ ( italic_I italic_d - italic_γ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_P ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - ( italic_I italic_d - italic_γ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_P ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

where we added and removed a term (IdγT1P)1r2superscript𝐼𝑑𝛾subscript𝑇1𝑃1subscript𝑟2({Id}-\gamma{T}_{1}{P})^{-1}r_{2}( italic_I italic_d - italic_γ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_P ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The result follows by plugging in the Simulation Lemma A.7 for the second term of the right hand side. ∎

The corollary above will be useful in Appendix C to control the approximation error of the estimates q^tsubscript^𝑞𝑡\hat{q}_{t}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT appearing in the convergence rates for inexact PMD in Thm. 7.

A.4 Action-value Estimator for (𝒢,)𝒢({\mathcal{G}},{\mathcal{F}})( caligraphic_G , caligraphic_F )-compatible Policies

We can leverage the notation introduced in this section to prove the following form for the world model-based estimator of the action-value function.

See 3

Proof.

By hypothesis

q^π=(IdγTnPπ)1rn=(IdγSnBZnPπ)1Snb.subscript^𝑞𝜋superscript𝐼𝑑𝛾subscript𝑇𝑛subscript𝑃𝜋1subscript𝑟𝑛superscript𝐼𝑑𝛾superscriptsubscript𝑆𝑛𝐵subscript𝑍𝑛subscript𝑃𝜋1superscriptsubscript𝑆𝑛𝑏\displaystyle\hat{q}_{\pi}=({Id}-\gamma{T}_{n}{P}_{\pi})^{-1}r_{n}=({Id}-% \gamma S_{n}^{*}BZ_{n}{P}_{\pi})^{-1}S_{n}^{*}b.over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT = ( italic_I italic_d - italic_γ italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ( italic_I italic_d - italic_γ italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_B italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_b . (A.14)

Eq. 11 follows by applying Lemma A.1. Eq. 12 can be verified by direct calculation. Denote by (ei)i=1msuperscriptsubscriptsubscript𝑒𝑖𝑖1𝑚(e_{i})_{i=1}^{m}( italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT the vectors of the canonical basis in nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Then, for any i,j=1,,nformulae-sequence𝑖𝑗1𝑛i,j=1,\dots,nitalic_i , italic_j = 1 , … , italic_n

(Mπn)ij=ei,Mπnej=ei,ZnPπSn=Znei,Pπej.subscriptsubscript𝑀𝜋𝑛𝑖𝑗subscript𝑒𝑖subscript𝑀𝜋𝑛subscript𝑒𝑗subscript𝑒𝑖subscript𝑍𝑛subscript𝑃𝜋superscriptsubscript𝑆𝑛superscriptsubscript𝑍𝑛subscript𝑒𝑖subscript𝑃𝜋subscript𝑒𝑗\displaystyle(M_{\pi n})_{ij}=\left\langle{e_{i}},{M_{\pi n}e_{j}}\right% \rangle=\left\langle{e_{i}},{Z_{n}{P}_{\pi}S_{n}^{*}}\right\rangle=\left% \langle{Z_{n}^{*}e_{i}},{{P}_{\pi}e_{j}}\right\rangle.( italic_M start_POSTSUBSCRIPT italic_π italic_n end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ⟨ italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT italic_π italic_n end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ = ⟨ italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ = ⟨ italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ . (A.15)

Now, we recall that the two operators Sn:𝒢n:subscript𝑆𝑛𝒢superscript𝑛S_{n}:{\mathcal{G}}\to\mathbb{R}^{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT : caligraphic_G → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and Zn:n:superscriptsubscript𝑍𝑛superscript𝑛Z_{n}^{*}:{\mathcal{F}}\to\mathbb{R}^{n}italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : caligraphic_F → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT are the evaluation operators for respectively the points (xi,ai)i=1nsuperscriptsubscriptsubscript𝑥𝑖subscript𝑎𝑖𝑖1𝑛(x_{i},a_{i})_{i=1}^{n}( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and (xi)i=1nsuperscriptsubscriptsuperscriptsubscript𝑥𝑖𝑖1𝑛(x_{i}^{\prime})_{i=1}^{n}( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Namely, for any vector vn𝑣superscript𝑛v\in\mathbb{R}^{n}italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT

Snv=i=1nviψ(xi,ai)andZnv=i=1nviφ(xi).formulae-sequencesuperscriptsubscript𝑆𝑛𝑣superscriptsubscript𝑖1𝑛subscript𝑣𝑖𝜓subscript𝑥𝑖subscript𝑎𝑖andsuperscriptsubscript𝑍𝑛𝑣superscriptsubscript𝑖1𝑛subscript𝑣𝑖𝜑superscriptsubscript𝑥𝑖\displaystyle S_{n}^{*}v=\sum_{i=1}^{n}v_{i}\psi(x_{i},a_{i})\qquad\text{and}% \qquad Z_{n}^{*}v=\sum_{i=1}^{n}v_{i}\varphi(x_{i}^{\prime}).italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_v = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_v = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_φ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) . (A.16)

This implies that

(Mπn)ij)=Znei,PπSnej=φ(xi),Pπψ(xj,aj)\displaystyle(M_{\pi n})_{ij)}=\left\langle{Z_{n}^{*}e_{i}},{{P}_{\pi}S_{n}^{*% }e_{j}}\right\rangle=\left\langle{\varphi(x_{i}^{\prime})},{{P}_{\pi}\psi(x_{j% },a_{j})}\right\rangle( italic_M start_POSTSUBSCRIPT italic_π italic_n end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i italic_j ) end_POSTSUBSCRIPT = ⟨ italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ = ⟨ italic_φ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_ψ ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⟩ (A.17)

Since Pπsubscript𝑃𝜋{P}_{\pi}italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT is (𝒢,)𝒢({\mathcal{G}},{\mathcal{F}})( caligraphic_G , caligraphic_F )-compatible by hypothesis, we can leverage the same reasoning used in Prop. 2 to show that

(Pπ|𝒢)φ(x)=𝒜ψ(x,a)π(da|x)superscriptevaluated-atsubscript𝑃𝜋𝒢𝜑superscript𝑥subscript𝒜𝜓superscript𝑥𝑎𝜋conditional𝑑𝑎superscript𝑥\displaystyle({P}_{\pi}|_{\mathcal{G}})^{*}\varphi(x^{\prime})=\int_{\mathcal{% A}}\psi(x^{\prime},a)\leavevmode\nobreak\ \pi(da|x^{\prime})( italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT | start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_φ ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ∫ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT italic_ψ ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a ) italic_π ( italic_d italic_a | italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) (A.18)

for any x𝒳superscript𝑥𝒳x^{\prime}\in\mathcal{X}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X. By plugging this equation in the previous characterization for (Mπn)ijsubscriptsubscript𝑀𝜋𝑛𝑖𝑗(M_{\pi n})_{ij}( italic_M start_POSTSUBSCRIPT italic_π italic_n end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT we have

φ(xi),Pπψ(xj,aj)𝜑superscriptsubscript𝑥𝑖subscript𝑃𝜋𝜓subscript𝑥𝑗subscript𝑎𝑗\displaystyle\left\langle{\varphi(x_{i}^{\prime})},{{P}_{\pi}\psi(x_{j},a_{j})% }\right\rangle⟨ italic_φ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_ψ ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⟩ =Pπφ(xi),ψ(xj,aj)absentsuperscriptsubscript𝑃𝜋𝜑superscriptsubscript𝑥𝑖𝜓subscript𝑥𝑗subscript𝑎𝑗\displaystyle=\left\langle{{P}_{\pi}^{*}\varphi(x_{i}^{\prime})},{\psi(x_{j},a% _{j})}\right\rangle= ⟨ italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_φ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_ψ ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⟩ (A.19)
=𝒜ψ(xi,a)π(da|xi),ψ(xj,aj)absentsubscript𝒜𝜓superscriptsubscript𝑥𝑖𝑎𝜋conditional𝑑𝑎superscriptsubscript𝑥𝑖𝜓subscript𝑥𝑗subscript𝑎𝑗\displaystyle=\left\langle{\int_{\mathcal{A}}\psi(x_{i}^{\prime},a)\leavevmode% \nobreak\ \pi(da|x_{i}^{\prime})},{\psi(x_{j},a_{j})}\right\rangle= ⟨ ∫ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT italic_ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a ) italic_π ( italic_d italic_a | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_ψ ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⟩ (A.20)
=𝒜ψ(xi,a),ψ(xj,aj)π(da|xi),absentsubscript𝒜𝜓superscriptsubscript𝑥𝑖𝑎𝜓subscript𝑥𝑗subscript𝑎𝑗𝜋conditional𝑑𝑎superscriptsubscript𝑥𝑖\displaystyle=\int_{\mathcal{A}}\left\langle{\psi(x_{i}^{\prime},a)},{\psi(x_{% j},a_{j})}\right\rangle\leavevmode\nobreak\ \pi(da|x_{i}^{\prime}),= ∫ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT ⟨ italic_ψ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a ) , italic_ψ ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⟩ italic_π ( italic_d italic_a | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , (A.21)

as required. ∎

A.5 Separable Spaces

We show here the sufficient condition for (𝒢,)𝒢({\mathcal{G}},{\mathcal{F}})( caligraphic_G , caligraphic_F )-compatibility of a policy in the case of the separable spaces introduced in Sec. 4.

See 4

Proof.

The proposition follows from observing that for any v|𝒜|𝑣superscript𝒜v\in\mathbb{R}^{|\mathcal{A}|}italic_v ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_A | end_POSTSUPERSCRIPT and hh\in{\mathcal{H}}italic_h ∈ caligraphic_H, applying Pπsubscript𝑃𝜋{P}_{\pi}italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT according to 6 to the function g(x,a)=h,ϕ(x)v,Dea𝑔𝑥𝑎italic-ϕ𝑥𝑣𝐷subscript𝑒𝑎g(x,a)=\left\langle{h},{\phi(x)}\right\rangle\left\langle{v},{De_{a}}\right\rangleitalic_g ( italic_x , italic_a ) = ⟨ italic_h , italic_ϕ ( italic_x ) ⟩ ⟨ italic_v , italic_D italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ yields

(Pπg)(x)=a𝒜g(x,a)π(a|x)subscript𝑃𝜋𝑔𝑥subscript𝑎𝒜𝑔𝑥𝑎𝜋conditional𝑎𝑥\displaystyle({P}_{\pi}g)(x)=\sum_{a\in\mathcal{A}}g(x,a)\pi(a|x)( italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_g ) ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_g ( italic_x , italic_a ) italic_π ( italic_a | italic_x ) =h,ϕ(x)a𝒜v,Deaπ(a|x)absentitalic-ϕ𝑥subscript𝑎𝒜𝑣𝐷subscript𝑒𝑎𝜋conditional𝑎𝑥\displaystyle=\left\langle{h},{\phi(x)}\right\rangle\sum_{a\in\mathcal{A}}% \left\langle{v},{De_{a}}\right\rangle\pi(a|x)= ⟨ italic_h , italic_ϕ ( italic_x ) ⟩ ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT ⟨ italic_v , italic_D italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ italic_π ( italic_a | italic_x ) (A.22)
=h,ϕ(x)a𝒜v,Deapa,ϕ(x)absentitalic-ϕ𝑥subscript𝑎𝒜𝑣𝐷subscript𝑒𝑎subscript𝑝𝑎italic-ϕ𝑥\displaystyle=\left\langle{h},{\phi(x)}\right\rangle\sum_{a\in\mathcal{A}}% \left\langle{v},{De_{a}}\right\rangle\left\langle{p_{a}},{\phi(x)}\right\rangle= ⟨ italic_h , italic_ϕ ( italic_x ) ⟩ ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT ⟨ italic_v , italic_D italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ ⟨ italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_ϕ ( italic_x ) ⟩ (A.23)
=ha𝒜v,Deapa,ϕ(x)ϕ(x)absenttensor-productsubscript𝑎𝒜𝑣𝐷subscript𝑒𝑎subscript𝑝𝑎tensor-productitalic-ϕ𝑥italic-ϕ𝑥\displaystyle=\left\langle{h\otimes\sum_{a\in\mathcal{A}}\left\langle{v},{De_{% a}}\right\rangle p_{a}},{\phi(x)\otimes\phi(x)}\right\rangle= ⟨ italic_h ⊗ ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT ⟨ italic_v , italic_D italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_ϕ ( italic_x ) ⊗ italic_ϕ ( italic_x ) ⟩ (A.24)

Hence (Pπg)(x)=f,φ(x)subscript𝑃𝜋𝑔𝑥𝑓𝜑𝑥({P}_{\pi}g)(x)=\left\langle{f},{\varphi(x)}\right\rangle( italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_g ) ( italic_x ) = ⟨ italic_f , italic_φ ( italic_x ) ⟩ with f=hh=𝑓tensor-productsuperscripttensor-productf=h\otimes h^{\prime}\in{\mathcal{H}}\otimes{\mathcal{H}}={\mathcal{F}}italic_f = italic_h ⊗ italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_H ⊗ caligraphic_H = caligraphic_F and h=a𝒜v,Deapasuperscriptsubscript𝑎𝒜𝑣𝐷subscript𝑒𝑎subscript𝑝𝑎h^{\prime}=\sum_{a\in\mathcal{A}}\left\langle{v},{De_{a}}\right\rangle p_{a}% \in{\mathcal{H}}italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT ⟨ italic_v , italic_D italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⟩ italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ caligraphic_H. Therefore, the restriction of Pπsubscript𝑃𝜋{P}_{\pi}italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT to 𝒢𝒢{\mathcal{G}}caligraphic_G takes value in {\mathcal{F}}caligraphic_F as desired. ∎

Appendix B Policy Mirror Descent

In this section we briefly review the tools needed to formulate the PMD method and discuss the convergence rates for inexact PMD. Most of the discussion follows the presentation in [12] formulated within the notation used in this work.

Let D:Δ(𝒜)×rintΔ(𝒜):𝐷Δ𝒜rintΔ𝒜D:\Delta(\mathcal{A})\times\text{rint}\Delta(\mathcal{A})\to\mathbb{R}italic_D : roman_Δ ( caligraphic_A ) × rint roman_Δ ( caligraphic_A ) → blackboard_R a Bregman divergence [31, Definition 9.2] over the probability simplex, where rintΔ(𝒜)rintΔ𝒜\text{rint}\Delta(\mathcal{A})rint roman_Δ ( caligraphic_A ) denotes the relative interior of Δ(𝒜)Δ𝒜\Delta(\mathcal{A})roman_Δ ( caligraphic_A ). In the following, for any t𝑡t\in\mathbb{N}italic_t ∈ blackboard_N we will denote by πtsubscript𝜋𝑡\pi_{t}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT the policy produced at iteration t𝑡titalic_t by a PMD algorithm according to the update 2 (with either the exact action-value function or an estimator, as discussed in Sec. 3) with divergence D𝐷Ditalic_D and step-size η>0𝜂0\eta>0italic_η > 0. We denote Pt=Pπtsubscript𝑃𝑡subscript𝑃subscript𝜋𝑡{P}_{t}={P}_{\pi_{t}}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT the associated operator. We recall here the PMD update from 2, highlighting the dependency on the policy operator Ptsubscript𝑃𝑡{P}_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT via the action-value function q(Pt)𝑞subscript𝑃𝑡q({P}_{t})italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

πt+1(|x)argminpΔ(𝒜){ηa𝒜q(Pt)(x,a)pa+D(p;πt(|x))}for all x𝒳.\pi_{t+1}(\cdot|x)\in\operatorname*{argmin}_{p\in\Delta(\mathcal{A})}\left\{-% \eta\sum_{a\in\mathcal{A}}q({P}_{t})(x,a)p_{a}+D(p;\pi_{t}(\cdot|x))\right\}% \quad\text{for all }x\in\mathcal{X}.italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( ⋅ | italic_x ) ∈ roman_argmin start_POSTSUBSCRIPT italic_p ∈ roman_Δ ( caligraphic_A ) end_POSTSUBSCRIPT { - italic_η ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( italic_x , italic_a ) italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + italic_D ( italic_p ; italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ | italic_x ) ) } for all italic_x ∈ caligraphic_X . (B.1)

While this point-wise characterization is sufficient to define the updated policy πt+1:𝒳Δ(𝒜):subscript𝜋𝑡1𝒳Δ𝒜\pi_{t+1}:\mathcal{X}\to\Delta(\mathcal{A})italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT : caligraphic_X → roman_Δ ( caligraphic_A ) from the previous πtsubscript𝜋𝑡\pi_{t}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and its action-value function q(Pt)𝑞subscript𝑃𝑡q({P}_{t})italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), we need to guarantee that πt+1subscript𝜋𝑡1\pi_{t+1}italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is measurable. If that were not the case, we would not be able to guarantee the existence of a Pt+1subscript𝑃𝑡1{P}_{t+1}italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT associated with it, possibly affecting the well-definiteness of iteratively applying the mirror descent update B.1. The following result addresses this issue.

Lemma B.1 (Measurability of the Mirror Descent updates).

Let D:Δ(𝒜)×rintΔ(𝒜):𝐷Δ𝒜rintΔ𝒜D:\Delta(\mathcal{A})\times{{\rm rint}}\,\Delta(\mathcal{A})\to\mathbb{R}italic_D : roman_Δ ( caligraphic_A ) × roman_rint roman_Δ ( caligraphic_A ) → blackboard_R be a Bregman divergence continuous in its first argument. There exists a measurable policy πt+1:𝒳Δ(𝒜):subscript𝜋𝑡1𝒳Δ𝒜\pi_{t+1}:\mathcal{X}\to\Delta(\mathcal{A})italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT : caligraphic_X → roman_Δ ( caligraphic_A ) that satisfies B.1 for all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X.

Proof.

The proof follows from the Measurable Maximum Theorem [25, Theorem 18.19]. Let us denote ft:𝒳×Δ(𝒜):subscript𝑓𝑡𝒳Δ𝒜f_{t}:\mathcal{X}\times\Delta(\mathcal{A})\to\mathbb{R}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : caligraphic_X × roman_Δ ( caligraphic_A ) → blackboard_R the function

ft(x,p):=ηa𝒜q(Pt)(x,a)pa+D(p;πt(x)).assignsubscript𝑓𝑡𝑥𝑝𝜂subscript𝑎𝒜𝑞subscript𝑃𝑡𝑥𝑎subscript𝑝𝑎𝐷𝑝subscript𝜋𝑡𝑥f_{t}(x,p):=-\eta\sum_{a\in\mathcal{A}}q({P}_{t})(x,a)p_{a}+D(p;\pi_{t}(x)).italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x , italic_p ) := - italic_η ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( italic_x , italic_a ) italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + italic_D ( italic_p ; italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) ) . (B.2)

Let also κ:𝒳Δ(𝒜):𝜅𝒳Δ𝒜\kappa:\mathcal{X}\twoheadrightarrow\Delta(\mathcal{A})italic_κ : caligraphic_X ↠ roman_Δ ( caligraphic_A ) be the constant correspondance xΔ(𝒜)maps-to𝑥Δ𝒜x\mapsto\Delta(\mathcal{A})italic_x ↦ roman_Δ ( caligraphic_A ) for all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X. κ𝜅\kappaitalic_κ clearly has nonempty compact values, and it is also weakly measurable since for any open set GΔ(𝒜)𝐺Δ𝒜G\subset\Delta(\mathcal{A})italic_G ⊂ roman_Δ ( caligraphic_A ), its lower inverse κ(G):={x𝒳:κ(x)G}=𝒳assignsuperscript𝜅𝐺conditional-set𝑥𝒳𝜅𝑥𝐺𝒳\kappa^{\ell}(G):=\{x\in\mathcal{X}:\kappa(x)\cap G\neq\varnothing\}=\mathcal{X}italic_κ start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT ( italic_G ) := { italic_x ∈ caligraphic_X : italic_κ ( italic_x ) ∩ italic_G ≠ ∅ } = caligraphic_X belongs to the Borel sigma-algebra of 𝒳𝒳\mathcal{X}caligraphic_X. Finally, since q(Pt)Bb(Ω)𝑞subscript𝑃𝑡subscript𝐵𝑏Ωq({P}_{t})\in B_{b}(\Omega)italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( roman_Ω ), and by assumption D𝐷Ditalic_D is continuous in its first argument, then we have that ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a Carathéodory function. Then, by [25, Theorem 18.19] we have that the correspondance of minimizers μ:𝒳Δ(𝒜):𝜇𝒳Δ𝒜\mu:\mathcal{X}\twoheadrightarrow\Delta(\mathcal{A})italic_μ : caligraphic_X ↠ roman_Δ ( caligraphic_A ) defined as

μ(x):={pκ(x):ft(x,p)=minpκ(x)ft(x,p)}assign𝜇𝑥conditional-setsubscript𝑝𝜅𝑥subscript𝑓𝑡𝑥subscript𝑝subscript𝑝𝜅𝑥subscript𝑓𝑡𝑥𝑝\mu(x):=\left\{p_{*}\in\kappa(x):f_{t}(x,p_{*})=\min_{p\in\kappa(x)}f_{t}(x,p)\right\}italic_μ ( italic_x ) := { italic_p start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ italic_κ ( italic_x ) : italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x , italic_p start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = roman_min start_POSTSUBSCRIPT italic_p ∈ italic_κ ( italic_x ) end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x , italic_p ) }

admits a measurable selector, which we denote πt+1:𝒳Δ(𝒜):subscript𝜋𝑡1𝒳Δ𝒜\pi_{t+1}:\mathcal{X}\to\Delta(\mathcal{A})italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT : caligraphic_X → roman_Δ ( caligraphic_A ), proving the statement of the Lemma. ∎

The previous Lemma is the key technical step enabling us to extend the convergence rates of Mirror Descent proved in [12] to non-tabular settings. We now state and prove few Lemmas instrumental to prove Thm. 7.

Lemma B.2 (Three-points lemma).

Let πt+1:𝒳Δ(𝒜):subscript𝜋𝑡1𝒳Δ𝒜\pi_{t+1}:\mathcal{X}\to\Delta(\mathcal{A})italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT : caligraphic_X → roman_Δ ( caligraphic_A ) a measurable minimizer of B.2 and Pt+1subscript𝑃𝑡1{P}_{t+1}italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT its associated operator. For every measurable policy π:𝒳Δ(𝒜):𝜋𝒳Δ𝒜\pi:\mathcal{X}\to\Delta(\mathcal{A})italic_π : caligraphic_X → roman_Δ ( caligraphic_A ) (alongside its associated operator P𝑃{P}italic_P) it holds

η[(Pt+1P)q(Pt)](x)D(π(x);πt+1(x))D(π(x);πt(x))𝜂delimited-[]subscript𝑃𝑡1𝑃𝑞subscript𝑃𝑡𝑥𝐷𝜋𝑥subscript𝜋𝑡1𝑥𝐷𝜋𝑥subscript𝜋𝑡𝑥\eta\left[({P}_{t+1}-{P})q({P}_{t})\right](x)\geq D(\pi(x);\pi_{t+1}(x))-D(\pi% (x);\pi_{t}(x))italic_η [ ( italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_P ) italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ( italic_x ) ≥ italic_D ( italic_π ( italic_x ) ; italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_x ) ) - italic_D ( italic_π ( italic_x ) ; italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) ) (B.3)
Proof.

The function ft(x,p)subscript𝑓𝑡𝑥𝑝f_{t}(x,p)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x , italic_p ) in B.2 is convex and differentiable in p𝑝pitalic_p as it is a sum of a linear function and a (strictly convex) Bregman divergence. By the first-order optimality condition [31, Corollay 3.68], a minimizer psubscript𝑝p_{*}italic_p start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT of ft(x,)subscript𝑓𝑡𝑥f_{t}(x,\cdot)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x , ⋅ ) satisfies, for all pΔ(𝒜)𝑝Δ𝒜p\in\Delta(\mathcal{A})italic_p ∈ roman_Δ ( caligraphic_A )

ft(x,p),π(x)p0.subscript𝑓𝑡𝑥subscript𝑝𝜋𝑥subscript𝑝0\left\langle{\nabla f_{t}(x,p_{*})},{\pi(x)-p_{*}}\right\rangle\geq 0.⟨ ∇ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x , italic_p start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) , italic_π ( italic_x ) - italic_p start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ ≥ 0 . (B.4)

Since πt+1(x)subscript𝜋𝑡1𝑥\pi_{t+1}(x)italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_x ) is a minimizer of ft(x,)subscript𝑓𝑡𝑥f_{t}(x,\cdot)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x , ⋅ ) by assumption, letting p=πt+1(x)subscript𝑝subscript𝜋𝑡1𝑥p_{*}=\pi_{t+1}(x)italic_p start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_x ), the first order optimality condition B.4 becomes

η[(Pt+1P)q(Pt)](x)ψ(πt(x))ψ(πt+1(x)),π(x)πt+1(x)0η[(Pt+1P)q(Pt)](x)D(π(x);πt+1(x))D(π(x);πt(x))+D(πt+1(x);πt(x))η[(Pt+1P)q(Pt)](x)D(π(x);πt+1(x))D(π(x);πt(x))\begin{split}&\eta\left[({P}_{t+1}-{P})q({P}_{t})\right](x)-\left\langle{% \nabla\psi(\pi_{t}(x))-\nabla\psi(\pi_{t+1}(x))},{\pi(x)-\pi_{t+1}(x)}\right% \rangle\geq 0\quad\implies\\ &\eta\left[({P}_{t+1}-{P})q({P}_{t})\right](x)\geq D(\pi(x);\pi_{t+1}(x))-D(% \pi(x);\pi_{t}(x))+D(\pi_{t+1}(x);\pi_{t}(x))\quad\implies\\ &\eta\left[({P}_{t+1}-{P})q({P}_{t})\right](x)\geq D(\pi(x);\pi_{t+1}(x))-D(% \pi(x);\pi_{t}(x))\end{split}start_ROW start_CELL end_CELL start_CELL italic_η [ ( italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_P ) italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ( italic_x ) - ⟨ ∇ italic_ψ ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) ) - ∇ italic_ψ ( italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_x ) ) , italic_π ( italic_x ) - italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_x ) ⟩ ≥ 0 ⟹ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_η [ ( italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_P ) italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ( italic_x ) ≥ italic_D ( italic_π ( italic_x ) ; italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_x ) ) - italic_D ( italic_π ( italic_x ) ; italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) ) + italic_D ( italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_x ) ; italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) ) ⟹ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_η [ ( italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_P ) italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ( italic_x ) ≥ italic_D ( italic_π ( italic_x ) ; italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_x ) ) - italic_D ( italic_π ( italic_x ) ; italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) ) end_CELL end_ROW

Where in the first line we used the defintion of Bregman divergence [31, Definition 9.2] D(p;q):=ψ(p)ψ(q)ψ(q),pqassign𝐷𝑝𝑞𝜓𝑝𝜓𝑞𝜓𝑞𝑝𝑞D(p;q):=\psi(p)-\psi(q)-\left\langle{\nabla\psi(q)},{p-q}\right\rangleitalic_D ( italic_p ; italic_q ) := italic_ψ ( italic_p ) - italic_ψ ( italic_q ) - ⟨ ∇ italic_ψ ( italic_q ) , italic_p - italic_q ⟩ for a suitable Legendre function ψ:Δ(𝒜):𝜓Δ𝒜\psi:\Delta(\mathcal{A})\to\mathbb{R}italic_ψ : roman_Δ ( caligraphic_A ) → blackboard_R, the first implication follows from the three-points property of Bregman divergences [31, Lemma 9.11], and the last implication from the positivity of D(πt+1(x);πt(x))𝐷subscript𝜋𝑡1𝑥subscript𝜋𝑡𝑥D(\pi_{t+1}(x);\pi_{t}(x))italic_D ( italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_x ) ; italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) ). ∎

Corollary B.3 (MD Iterations are monotonically increasing).

This Corollary is essentially a restatement of [12, Lemma 7]. Let (Pt)tsubscriptsubscript𝑃𝑡𝑡({P}_{t})_{t\in\mathbb{N}}( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t ∈ blackboard_N end_POSTSUBSCRIPT be the sequence of policy operators associated to the measurable minimizers of B.1 for all t𝑡t\in\mathbb{N}italic_t ∈ blackboard_N. For all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X it holds

[(Pt+1Pt)q(Pt)](x)0delimited-[]subscript𝑃𝑡1subscript𝑃𝑡𝑞subscript𝑃𝑡𝑥0\left[({P}_{t+1}-{P}_{t})q({P}_{t})\right](x)\geq 0[ ( italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ( italic_x ) ≥ 0 (B.5)

and

J(Pt+1)J(Pt)0𝐽subscript𝑃𝑡1𝐽subscript𝑃𝑡0J({P}_{t+1})-J({P}_{t})\geq 0italic_J ( italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_J ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 0 (B.6)

i.e. the objective function is always increased by a mirror descent iteration. Further, if q~(Pt)Bb(Ω)~𝑞subscript𝑃𝑡subscript𝐵𝑏Ω\tilde{q}({P}_{t})\in B_{b}(\Omega)over~ start_ARG italic_q end_ARG ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( roman_Ω ) is such that q(Pt)q~(Pt)εtsubscriptnorm𝑞subscript𝑃𝑡~𝑞subscript𝑃𝑡subscript𝜀𝑡\left\|{q({P}_{t})-\tilde{q}({P}_{t})}\right\|_{\infty}\leq\varepsilon_{t}∥ italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - over~ start_ARG italic_q end_ARG ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, then B.5 holds inexactly on q~(Pt)~𝑞subscript𝑃𝑡\tilde{q}({P}_{t})over~ start_ARG italic_q end_ARG ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) as

[(Pt+1Pt)q~(Pt)](x)2εt.delimited-[]subscript𝑃𝑡1subscript𝑃𝑡~𝑞subscript𝑃𝑡𝑥2subscript𝜀𝑡\left[({P}_{t+1}-{P}_{t})\tilde{q}({P}_{t})\right](x)\geq-2\varepsilon_{t}.[ ( italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) over~ start_ARG italic_q end_ARG ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ( italic_x ) ≥ - 2 italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (B.7)
Proof.

By setting π(x)=πt(x)𝜋𝑥subscript𝜋𝑡𝑥\pi(x)=\pi_{t}(x)italic_π ( italic_x ) = italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) in  B.3, and recalling that D(p;q)0𝐷𝑝𝑞0D(p;q)\geq 0italic_D ( italic_p ; italic_q ) ≥ 0 with equality if and only if p=q𝑝𝑞p=qitalic_p = italic_q, it follows that

η[(Pt+1Pt)q(Pt)](x)D(πt(x);πt+1(x))0,𝜂delimited-[]subscript𝑃𝑡1subscript𝑃𝑡𝑞subscript𝑃𝑡𝑥𝐷subscript𝜋𝑡𝑥subscript𝜋𝑡1𝑥0\eta\left[({P}_{t+1}-{P}_{t})q({P}_{t})\right](x)\geq D(\pi_{t}(x);\pi_{t+1}(x% ))\geq 0,italic_η [ ( italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ( italic_x ) ≥ italic_D ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) ; italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_x ) ) ≥ 0 ,

giving B.5. Integrating B.5 over (IdγPt+1T)νsuperscript𝐼𝑑𝛾subscript𝑃𝑡1𝑇absent𝜈\left({Id}-\gamma{P}_{t+1}{T}\right)^{-*}\nu( italic_I italic_d - italic_γ italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT italic_T ) start_POSTSUPERSCRIPT - ∗ end_POSTSUPERSCRIPT italic_ν and using the Performance Difference Lemma A.4 one gets B.6. Finally, we get B.7 from

[(Pt+1Pt)q~(Pt)](x)=[(Pt+1Pt)q(Pt)](x)+[(Pt+1Pt)(q~(Pt)q(Pt))](x)[(Pt+1Pt)(q~(Pt)q(Pt))](x)(Pt+1Pt)(q~(Pt)q(Pt))Pt+1Ptq~(Pt)q(Pt)2εt.delimited-[]subscript𝑃𝑡1subscript𝑃𝑡~𝑞subscript𝑃𝑡𝑥delimited-[]subscript𝑃𝑡1subscript𝑃𝑡𝑞subscript𝑃𝑡𝑥delimited-[]subscript𝑃𝑡1subscript𝑃𝑡~𝑞subscript𝑃𝑡𝑞subscript𝑃𝑡𝑥delimited-[]subscript𝑃𝑡1subscript𝑃𝑡~𝑞subscript𝑃𝑡𝑞subscript𝑃𝑡𝑥subscriptdelimited-∥∥subscript𝑃𝑡1subscript𝑃𝑡~𝑞subscript𝑃𝑡𝑞subscript𝑃𝑡delimited-∥∥subscript𝑃𝑡1subscript𝑃𝑡subscriptdelimited-∥∥~𝑞subscript𝑃𝑡𝑞subscript𝑃𝑡2subscript𝜀𝑡\begin{split}\left[({P}_{t+1}-{P}_{t})\tilde{q}({P}_{t})\right](x)&=\left[({P}% _{t+1}-{P}_{t})q({P}_{t})\right](x)+\left[({P}_{t+1}-{P}_{t})(\tilde{q}({P}_{t% })-q({P}_{t}))\right](x)\\ &\geq\left[({P}_{t+1}-{P}_{t})(\tilde{q}({P}_{t})-q({P}_{t}))\right](x)\\ &\geq-\bigl{\|}{({P}_{t+1}-{P}_{t})(\tilde{q}({P}_{t})-q({P}_{t}))}\bigr{\|}_{% \infty}\\ &\geq-\bigl{\|}{{P}_{t+1}-{P}_{t}}\bigr{\|}\bigl{\|}{\tilde{q}({P}_{t})-q({P}_% {t})}\bigr{\|}_{\infty}\\ &\geq-2\varepsilon_{t}.\end{split}start_ROW start_CELL [ ( italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) over~ start_ARG italic_q end_ARG ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ( italic_x ) end_CELL start_CELL = [ ( italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ( italic_x ) + [ ( italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( over~ start_ARG italic_q end_ARG ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] ( italic_x ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≥ [ ( italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( over~ start_ARG italic_q end_ARG ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] ( italic_x ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≥ - ∥ ( italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( over~ start_ARG italic_q end_ARG ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≥ - ∥ italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ∥ over~ start_ARG italic_q end_ARG ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≥ - 2 italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . end_CELL end_ROW (B.8)

Where the first inequality follows from B.5, and the latter from the fact that policy operators are Markov operators and have norm 1, and Pt+1PtPt+1+Pt=2delimited-∥∥subscript𝑃𝑡1subscript𝑃𝑡delimited-∥∥subscript𝑃𝑡1delimited-∥∥subscript𝑃𝑡2\bigl{\|}{{P}_{t+1}-{P}_{t}}\bigr{\|}\leq\bigl{\|}{{P}_{t+1}}\bigr{\|}+\bigl{% \|}{{P}_{t}}\bigr{\|}=2∥ italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ≤ ∥ italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∥ + ∥ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ = 2. ∎

B.1 Convergence rates of PMD

We are finally ready to prove the convergence rates for the Policy Mirror Descent algorithm B.1. The proof technique is loosely based on [12, Theorem 8, Lemma 12], and extends them to the case of general state spaces through the key Lemma B.1 and using a fully operatorial formalism.

See 7

Proof.

As usual, in this proof we denote the estimated and exact action-value functions as qn(Pt):=q^πt=(IdγTnPt)1rnassignsubscript𝑞𝑛subscript𝑃𝑡subscript^𝑞subscript𝜋𝑡superscript𝐼𝑑𝛾subscript𝑇𝑛subscript𝑃𝑡1subscript𝑟𝑛q_{n}({P}_{t}):=\hat{q}_{\pi_{t}}=({Id}-\gamma{T}_{n}{P}_{t})^{-1}r_{n}italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) := over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ( italic_I italic_d - italic_γ italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and q(Pt):=qπt=(IdγTPt)1rassign𝑞subscript𝑃𝑡subscript𝑞subscript𝜋𝑡superscript𝐼𝑑𝛾𝑇subscript𝑃𝑡1𝑟q({P}_{t}):=q_{\pi_{t}}=({Id}-\gamma{T}{P}_{t})^{-1}ritalic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) := italic_q start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ( italic_I italic_d - italic_γ italic_T italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_r, respectively. From hypothesis, Alg. 1 is well-defined since all policies it generates are (𝒢,)𝒢({\mathcal{G}},{\mathcal{F}})( caligraphic_G , caligraphic_F )-compatible. The resulting sequence of policies (πt)tsubscriptsubscript𝜋𝑡𝑡(\pi_{t})_{t\in\mathbb{N}}( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t ∈ blackboard_N end_POSTSUBSCRIPT are generated via the update rule 13 on the inexact action-value functions qn(Pt)subscript𝑞𝑛subscript𝑃𝑡q_{n}({P}_{t})italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), as defined in 11. As the update rule 13 is a (measurable) minimizer of B.1 when D𝐷Ditalic_D equals the Kullback-Leibler divergence, the three-points Lemma B.2 with π(x)=π(x)𝜋𝑥subscript𝜋𝑥\pi(x)=\pi_{*}(x)italic_π ( italic_x ) = italic_π start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_x ) yields

[(PPt+1)qn(Pt)](x)1ηD(π(x);πt(x))1ηD(π(x);πt+1(x)).delimited-[]subscript𝑃subscript𝑃𝑡1subscript𝑞𝑛subscript𝑃𝑡𝑥1𝜂𝐷subscript𝜋𝑥subscript𝜋𝑡𝑥1𝜂𝐷subscript𝜋𝑥subscript𝜋𝑡1𝑥\left[({P}_{*}-{P}_{t+1})q_{n}({P}_{t})\right](x)\leq\frac{1}{\eta}D(\pi_{*}(x% );\pi_{t}(x))-\frac{1}{\eta}D(\pi_{*}(x);\pi_{t+1}(x)).[ ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ( italic_x ) ≤ divide start_ARG 1 end_ARG start_ARG italic_η end_ARG italic_D ( italic_π start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_x ) ; italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) ) - divide start_ARG 1 end_ARG start_ARG italic_η end_ARG italic_D ( italic_π start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_x ) ; italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_x ) ) .

Adding and subtracting the term [(PPt+1)q(Pt)](x)delimited-[]subscript𝑃subscript𝑃𝑡1𝑞subscript𝑃𝑡𝑥\left[({P}_{*}-{P}_{t+1})q({P}_{t})\right](x)[ ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ( italic_x ), and bounding the remaining difference as [(PPt+1)(q(Pt)qn(Pt))](x)2εtdelimited-[]subscript𝑃subscript𝑃𝑡1𝑞subscript𝑃𝑡subscript𝑞𝑛subscript𝑃𝑡𝑥2subscript𝜀𝑡\left[({P}_{*}-{P}_{t+1})(q({P}_{t})-q_{n}({P}_{t}))\right](x)\leq 2% \varepsilon_{t}[ ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ( italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] ( italic_x ) ≤ 2 italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT – see the derivation of B.8 – one gets

[(PPt+1)q(Pt)](x)2εt+1ηD(π(x);πt(x))1ηD(π(x);πt+1(x)).delimited-[]subscript𝑃subscript𝑃𝑡1𝑞subscript𝑃𝑡𝑥2subscript𝜀𝑡1𝜂𝐷subscript𝜋𝑥subscript𝜋𝑡𝑥1𝜂𝐷subscript𝜋𝑥subscript𝜋𝑡1𝑥\left[({P}_{*}-{P}_{t+1})q({P}_{t})\right](x)\leq 2\varepsilon_{t}+\frac{1}{% \eta}D(\pi_{*}(x);\pi_{t}(x))-\frac{1}{\eta}D(\pi_{*}(x);\pi_{t+1}(x)).[ ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ( italic_x ) ≤ 2 italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_η end_ARG italic_D ( italic_π start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_x ) ; italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) ) - divide start_ARG 1 end_ARG start_ARG italic_η end_ARG italic_D ( italic_π start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_x ) ; italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_x ) ) .

Adding and subtracting Ptq(Pt)subscript𝑃𝑡𝑞subscript𝑃𝑡{P}_{t}q({P}_{t})italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) on the left side gives

[(PPt)q(Pt)](x)[(Pt+1Pt)q(Pt)](x)+2εt+1ηD(π(x);πt(x))1ηD(π(x);πt+1(x)),delimited-[]subscript𝑃subscript𝑃𝑡𝑞subscript𝑃𝑡𝑥delimited-[]subscript𝑃𝑡1subscript𝑃𝑡𝑞subscript𝑃𝑡𝑥2subscript𝜀𝑡1𝜂𝐷subscript𝜋𝑥subscript𝜋𝑡𝑥1𝜂𝐷subscript𝜋𝑥subscript𝜋𝑡1𝑥\left[({P}_{*}-{P}_{t})q({P}_{t})\right](x)\leq\left[({P}_{t+1}-{P}_{t})q({P}_% {t})\right](x)+2\varepsilon_{t}+\frac{1}{\eta}D(\pi_{*}(x);\pi_{t}(x))-\frac{1% }{\eta}D(\pi_{*}(x);\pi_{t+1}(x)),[ ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ( italic_x ) ≤ [ ( italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ( italic_x ) + 2 italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_η end_ARG italic_D ( italic_π start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_x ) ; italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) ) - divide start_ARG 1 end_ARG start_ARG italic_η end_ARG italic_D ( italic_π start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_x ) ; italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_x ) ) ,

and integrating with respect to the positive measure (IdγPT)νsuperscript𝐼𝑑𝛾subscript𝑃𝑇absent𝜈({Id}-\gamma{P}_{*}{T})^{-*}\nu( italic_I italic_d - italic_γ italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT italic_T ) start_POSTSUPERSCRIPT - ∗ end_POSTSUPERSCRIPT italic_ν and using the performance difference Lemma A.4 on the left hand side one has

J(P)J(Pt)11γ(Pt+1Pt)q(Pt)+2εt,dν(P)1η(1γ)D(π;πt)D(π;πt+1),dν(P),𝐽subscript𝑃𝐽subscript𝑃𝑡11𝛾subscript𝑃𝑡1subscript𝑃𝑡𝑞subscript𝑃𝑡2subscript𝜀𝑡subscript𝑑𝜈subscript𝑃1𝜂1𝛾𝐷subscript𝜋subscript𝜋𝑡𝐷subscript𝜋subscript𝜋𝑡1subscript𝑑𝜈subscript𝑃\begin{split}J({P}_{*})-J({P}_{t})\leq&\frac{1}{1-\gamma}\left\langle{({P}_{t+% 1}-{P}_{t})q({P}_{t})+2\varepsilon_{t}},{d_{\nu}({P}_{*})}\right\rangle\\ &\frac{1}{\eta(1-\gamma)}\left\langle{D(\pi_{*};\pi_{t})-D(\pi_{*};\pi_{t+1})}% ,{d_{\nu}({P}_{*})}\right\rangle,\end{split}start_ROW start_CELL italic_J ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - italic_J ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG 1 - italic_γ end_ARG ⟨ ( italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + 2 italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⟩ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG italic_η ( 1 - italic_γ ) end_ARG ⟨ italic_D ( italic_π start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ; italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_D ( italic_π start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ; italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) , italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⟩ , end_CELL end_ROW (B.9)

where we used A.8 on the right-hand-side terms. Since (Pt+1Pt)q(Pt)+2εt0subscript𝑃𝑡1subscript𝑃𝑡𝑞subscript𝑃𝑡2subscript𝜀𝑡0({P}_{t+1}-{P}_{t})q({P}_{t})+2\varepsilon_{t}\geq 0( italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + 2 italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ 0 because of B.7, we can use fact (1) from Lemma A.6 with (IdγPt+1T)1superscript𝐼𝑑𝛾subscript𝑃𝑡1𝑇1({Id}-\gamma{P}_{t+1}{T})^{-1}( italic_I italic_d - italic_γ italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT italic_T ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT and the performance difference Lemma A.4 to get

(Pt+1Pt)q(Pt)+2εt,dν(P)(IdγPt+1T)1[(Pt+1Pt)q(Pt)+2εt],dν(P)=Pt+1q(Pt+1),dν(P)Ptq(Pt),dν(P)+2εt1γ.subscript𝑃𝑡1subscript𝑃𝑡𝑞subscript𝑃𝑡2subscript𝜀𝑡subscript𝑑𝜈subscript𝑃superscript𝐼𝑑𝛾subscript𝑃𝑡1𝑇1delimited-[]subscript𝑃𝑡1subscript𝑃𝑡𝑞subscript𝑃𝑡2subscript𝜀𝑡subscript𝑑𝜈subscript𝑃subscript𝑃𝑡1𝑞subscript𝑃𝑡1subscript𝑑𝜈subscript𝑃subscript𝑃𝑡𝑞subscript𝑃𝑡subscript𝑑𝜈subscript𝑃2subscript𝜀𝑡1𝛾\begin{split}\left\langle{({P}_{t+1}-{P}_{t})q({P}_{t})+2\varepsilon_{t}},{d_{% \nu}({P}_{*})}\right\rangle&\leq\left\langle{({Id}-\gamma{P}_{t+1}{T})^{-1}% \left[({P}_{t+1}-{P}_{t})q({P}_{t})+2\varepsilon_{t}\right]},{d_{\nu}({P}_{*})% }\right\rangle\\ &=\left\langle{{P}_{t+1}q({P}_{t+1})},{d_{\nu}({P}_{*})}\right\rangle-\left% \langle{{P}_{t}q({P}_{t})},{d_{\nu}({P}_{*})}\right\rangle+\frac{2\varepsilon_% {t}}{1-\gamma}.\\ \end{split}start_ROW start_CELL ⟨ ( italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + 2 italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⟩ end_CELL start_CELL ≤ ⟨ ( italic_I italic_d - italic_γ italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT italic_T ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ ( italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + 2 italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] , italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⟩ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ⟨ italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT italic_q ( italic_P start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) , italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⟩ - ⟨ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⟩ + divide start_ARG 2 italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG . end_CELL end_ROW

Substituting this bound in B.9 and summing from t=0T1𝑡0𝑇1t=0\ldots T-1italic_t = 0 … italic_T - 1 one gets to

t=0T1J(P)J(Pt)11γ(PTq(PT),dν(P)P0q(P0),dν(P))+2(1γ)2t=0T1εt1η(1γ)D(π;π0)D(π;πT),dν(P).superscriptsubscript𝑡0𝑇1𝐽subscript𝑃𝐽subscript𝑃𝑡11𝛾subscript𝑃𝑇𝑞subscript𝑃𝑇subscript𝑑𝜈subscript𝑃subscript𝑃0𝑞subscript𝑃0subscript𝑑𝜈subscript𝑃2superscript1𝛾2superscriptsubscript𝑡0𝑇1subscript𝜀𝑡1𝜂1𝛾𝐷subscript𝜋subscript𝜋0𝐷subscript𝜋subscript𝜋𝑇subscript𝑑𝜈subscript𝑃\begin{split}\sum_{t=0}^{T-1}J({P}_{*})-J({P}_{t})\leq&\frac{1}{1-\gamma}\left% (\left\langle{{P}_{T}q({P}_{T})},{d_{\nu}({P}_{*})}\right\rangle-\left\langle{% {P}_{0}q({P}_{0})},{d_{\nu}({P}_{*})}\right\rangle\right)+\frac{2}{(1-\gamma)^% {2}}\sum_{t=0}^{T-1}\varepsilon_{t}\\ &\frac{1}{\eta(1-\gamma)}\left\langle{D(\pi_{*};\pi_{0})-D(\pi_{*};\pi_{T})},{% d_{\nu}({P}_{*})}\right\rangle.\end{split}start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_J ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - italic_J ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG 1 - italic_γ end_ARG ( ⟨ italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_q ( italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) , italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⟩ - ⟨ italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_q ( italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⟩ ) + divide start_ARG 2 end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG italic_η ( 1 - italic_γ ) end_ARG ⟨ italic_D ( italic_π start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ; italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_D ( italic_π start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ; italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) , italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⟩ . end_CELL end_ROW

Using facts (3) and (4) from Lemma A.6 we have that the terms Pq(P),dν(P)𝑃𝑞𝑃subscript𝑑𝜈subscript𝑃\left\langle{{P}q({P})},{d_{\nu}({P}_{*})}\right\rangle⟨ italic_P italic_q ( italic_P ) , italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⟩ on the right hand side can be bounded as

Pq(P),dν(P)=P(IdγTP)1r,dν(P)=(IdγPT)1Pr,dν(P)=r,P(IdγPT)dν(P)(Duality)rP(IdγPT)dν(P)TV(Lemma A.6 and νTV=1)r1γ,𝑃𝑞𝑃subscript𝑑𝜈subscript𝑃𝑃superscript𝐼𝑑𝛾𝑇𝑃1𝑟subscript𝑑𝜈subscript𝑃superscript𝐼𝑑𝛾𝑃𝑇1𝑃𝑟subscript𝑑𝜈subscript𝑃𝑟superscript𝑃superscript𝐼𝑑𝛾𝑃𝑇absentsubscript𝑑𝜈subscript𝑃(Duality)subscriptdelimited-∥∥𝑟subscriptdelimited-∥∥superscript𝑃superscript𝐼𝑑𝛾𝑃𝑇absentsubscript𝑑𝜈subscript𝑃TVLemma A.6 and subscriptdelimited-∥∥𝜈TV1subscriptnorm𝑟1𝛾\begin{split}\left\langle{{P}q({P})},{d_{\nu}({P}_{*})}\right\rangle&=\left% \langle{{P}({Id}-\gamma{T}{P})^{-1}r},{d_{\nu}({P}_{*})}\right\rangle\\ &=\left\langle{({Id}-\gamma{P}{T})^{-1}{P}r},{d_{\nu}({P}_{*})}\right\rangle\\ &=\left\langle{r},{{P}^{*}({Id}-\gamma{P}{T})^{-*}d_{\nu}({P}_{*})}\right% \rangle\\ \text{(Duality)}&\leq\left\|{r}\right\|_{\infty}\bigl{\|}{{P}^{*}({Id}-\gamma{% P}{T})^{-*}d_{\nu}({P}_{*})}\bigr{\|}_{{\rm TV}}\\ (\text{\lx@cref{creftype~refnum}{lemma:properties_bellman_op} and }\left\|{\nu% }\right\|_{{\rm TV}}=1)&\leq\frac{\left\|{r}\right\|_{\infty}}{1-\gamma},\end{split}start_ROW start_CELL ⟨ italic_P italic_q ( italic_P ) , italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⟩ end_CELL start_CELL = ⟨ italic_P ( italic_I italic_d - italic_γ italic_T italic_P ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_r , italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⟩ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ⟨ ( italic_I italic_d - italic_γ italic_P italic_T ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_P italic_r , italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⟩ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ⟨ italic_r , italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_I italic_d - italic_γ italic_P italic_T ) start_POSTSUPERSCRIPT - ∗ end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⟩ end_CELL end_ROW start_ROW start_CELL (Duality) end_CELL start_CELL ≤ ∥ italic_r ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ∥ italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_I italic_d - italic_γ italic_P italic_T ) start_POSTSUPERSCRIPT - ∗ end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ( and ∥ italic_ν ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT = 1 ) end_CELL start_CELL ≤ divide start_ARG ∥ italic_r ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG , end_CELL end_ROW

while D(π;πT),dν(P)𝐷subscript𝜋subscript𝜋𝑇subscript𝑑𝜈subscript𝑃-\left\langle{D(\pi_{*};\pi_{T})},{d_{\nu}({P}_{*})}\right\rangle- ⟨ italic_D ( italic_π start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ; italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) , italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⟩ can be dropped due to the positivity of Bregman divergences yielding

t=0T1J(P)J(Pt)2(1γ)2(r+t=0T1εt)+1η(1γ)D(π;π0),dν(P).superscriptsubscript𝑡0𝑇1𝐽subscript𝑃𝐽subscript𝑃𝑡2superscript1𝛾2subscriptdelimited-∥∥𝑟superscriptsubscript𝑡0𝑇1subscript𝜀𝑡1𝜂1𝛾𝐷subscript𝜋subscript𝜋0subscript𝑑𝜈subscript𝑃\begin{split}\sum_{t=0}^{T-1}J({P}_{*})-J({P}_{t})\leq&\frac{2}{(1-\gamma)^{2}% }\left(\bigl{\|}{r}\bigr{\|}_{\infty}+\sum_{t=0}^{T-1}\varepsilon_{t}\right)+% \frac{1}{\eta(1-\gamma)}\left\langle{D(\pi_{*};\pi_{0})},{d_{\nu}({P}_{*})}% \right\rangle.\end{split}start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_J ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - italic_J ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ end_CELL start_CELL divide start_ARG 2 end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( ∥ italic_r ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_η ( 1 - italic_γ ) end_ARG ⟨ italic_D ( italic_π start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ; italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⟩ . end_CELL end_ROW (B.10)

Now notice that for all t<T𝑡𝑇t<Titalic_t < italic_T it holds

J(Pt)=Ptq(Pt),ν=Ptqn(Pt),ν+Pt(q(Pt)qn(Pt)),ν(Equation B.6)PTqn(PT),ν+Pt(q(Pt)qn(Pt)),ν=PTqn(PT),ν+Pt(q(Pt)qn(Pt)),ν+PT(qn(PT)q(PT)),νJ(PT)+εt+εT,𝐽subscript𝑃𝑡subscript𝑃𝑡𝑞subscript𝑃𝑡𝜈subscript𝑃𝑡subscript𝑞𝑛subscript𝑃𝑡𝜈subscript𝑃𝑡𝑞subscript𝑃𝑡subscript𝑞𝑛subscript𝑃𝑡𝜈Equation B.6subscript𝑃𝑇subscript𝑞𝑛subscript𝑃𝑇𝜈subscript𝑃𝑡𝑞subscript𝑃𝑡subscript𝑞𝑛subscript𝑃𝑡𝜈subscript𝑃𝑇subscript𝑞𝑛subscript𝑃𝑇𝜈subscript𝑃𝑡𝑞subscript𝑃𝑡subscript𝑞𝑛subscript𝑃𝑡𝜈subscript𝑃𝑇subscript𝑞𝑛subscript𝑃𝑇𝑞subscript𝑃𝑇𝜈𝐽subscript𝑃𝑇subscript𝜀𝑡subscript𝜀𝑇\begin{split}J({P}_{t})&=\left\langle{{P}_{t}q({P}_{t})},{\nu}\right\rangle\\ &=\left\langle{{P}_{t}q_{n}({P}_{t})},{\nu}\right\rangle+\left\langle{{P}_{t}(% q({P}_{t})-q_{n}({P}_{t}))},{\nu}\right\rangle\\ (\text{Equation \lx@cref{creftype~refnum}{eq:objfn_MD_monotonicity}})&\leq% \left\langle{{P}_{T}q_{n}({P}_{T})},{\nu}\right\rangle+\left\langle{{P}_{t}(q(% {P}_{t})-q_{n}({P}_{t}))},{\nu}\right\rangle\\ &=\left\langle{{P}_{T}q_{n}({P}_{T})},{\nu}\right\rangle+\left\langle{{P}_{t}(% q({P}_{t})-q_{n}({P}_{t}))},{\nu}\right\rangle+\left\langle{{P}_{T}(q_{n}({P}_% {T})-q({P}_{T}))},{\nu}\right\rangle\\ &\leq J({P}_{T})+\varepsilon_{t}+\varepsilon_{T},\end{split}start_ROW start_CELL italic_J ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL start_CELL = ⟨ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_ν ⟩ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ⟨ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_ν ⟩ + ⟨ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , italic_ν ⟩ end_CELL end_ROW start_ROW start_CELL ( Equation ) end_CELL start_CELL ≤ ⟨ italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) , italic_ν ⟩ + ⟨ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , italic_ν ⟩ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ⟨ italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) , italic_ν ⟩ + ⟨ italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , italic_ν ⟩ + ⟨ italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) - italic_q ( italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) , italic_ν ⟩ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ italic_J ( italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) + italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , end_CELL end_ROW

so that

J(P)J(PT)εT+1Tt=0T1J(P)J(Pt)+εt.𝐽subscript𝑃𝐽subscript𝑃𝑇subscript𝜀𝑇1𝑇superscriptsubscript𝑡0𝑇1𝐽subscript𝑃𝐽subscript𝑃𝑡subscript𝜀𝑡J({P}_{*})-J({P}_{T})\leq\varepsilon_{T}+\frac{1}{T}\sum_{t=0}^{T-1}J({P}_{*})% -J({P}_{t})+\varepsilon_{t}.italic_J ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - italic_J ( italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ≤ italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_J ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - italic_J ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT .

Combining this with B.10 we obtain

J(P)J(PT)εT+1T[(1+2(1γ)2)t=0T1εt+2r(1γ)2+1η(1γ)D(π;π0),dν(P)],𝐽subscript𝑃𝐽subscript𝑃𝑇subscript𝜀𝑇1𝑇delimited-[]12superscript1𝛾2superscriptsubscript𝑡0𝑇1subscript𝜀𝑡2subscriptdelimited-∥∥𝑟superscript1𝛾21𝜂1𝛾𝐷subscript𝜋subscript𝜋0subscript𝑑𝜈subscript𝑃\displaystyle J({P}_{*})-J({P}_{T})\leq\varepsilon_{T}+\frac{1}{T}\left[\left(% 1+\frac{2}{(1-\gamma)^{2}}\right)\sum_{t=0}^{T-1}\varepsilon_{t}+\frac{2\bigl{% \|}{r}\bigr{\|}_{\infty}}{(1-\gamma)^{2}}+\frac{1}{\eta(1-\gamma)}\left\langle% {D(\pi_{*};\pi_{0})},{d_{\nu}({P}_{*})}\right\rangle\right],italic_J ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - italic_J ( italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ≤ italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_T end_ARG [ ( 1 + divide start_ARG 2 end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG 2 ∥ italic_r ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG italic_η ( 1 - italic_γ ) end_ARG ⟨ italic_D ( italic_π start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ; italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_d start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⟩ ] ,

leading to the desired bound. ∎

Appendix C POWR Convergence Rates

In this section we prove the convergence of POWR . To do so, we need to first show that under the choice of spaces {\mathcal{F}}caligraphic_F and 𝒢𝒢{\mathcal{G}}caligraphic_G proposed in this work, the resulting PMD iterations are well defined. Then, we need to bound the approximation error of the estimates for the action-value functions of the iterates produced by the inexact PDM algorithm, which appear in the rates of Thm. 7.

C.1 POWR is Well-defined

In order to guarantee that the iterations of POWR generate policies πtsubscript𝜋𝑡\pi_{t}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for which we can compute an estimator according to the formula in Prop. 3, we need to guarantee that all such policies are (𝒢,)𝒢({\mathcal{G}},{\mathcal{F}})( caligraphic_G , caligraphic_F )-compatible. In particular we restrict to the case of the separable spaces introduced in Prop. 4, for which it turns out that it is sufficient to show that all policies belong to the space {\mathcal{H}}caligraphic_H characterizing =tensor-product{\mathcal{F}}={\mathcal{H}}\otimes{\mathcal{H}}caligraphic_F = caligraphic_H ⊗ caligraphic_H and 𝒢=|𝒜|𝒢tensor-productsuperscript𝒜{\mathcal{G}}=\mathbb{R}^{|\mathcal{A}|}\otimes{\mathcal{H}}caligraphic_G = blackboard_R start_POSTSUPERSCRIPT | caligraphic_A | end_POSTSUPERSCRIPT ⊗ caligraphic_H. The following results provide a candidate for choosing such space.

See 5

Proof.

We recall that Sobolev spaces [33] over a compact subset 𝒳𝒳\mathcal{X}caligraphic_X of Dsuperscript𝐷\mathbb{R}^{D}blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT are closed with respect to the operations of sum, multiplication, exponentiation or inversion (if the function is supported on the entire domain 𝒳𝒳\mathcal{X}caligraphic_X), namely for any two f,f𝑓superscript𝑓f,f^{\prime}\in{\mathcal{H}}italic_f , italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_H, f+f,ff,ef𝑓superscript𝑓𝑓superscript𝑓superscript𝑒𝑓f+f^{\prime},ff^{\prime},e^{f}\in{\mathcal{H}}italic_f + italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_f italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ∈ caligraphic_H and, if f(x)>0𝑓𝑥0f(x)>0italic_f ( italic_x ) > 0 for all x𝒳𝑥𝒳x\in{\mathcal{X}}italic_x ∈ caligraphic_X, 1/f1𝑓1/f\in{\mathcal{H}}1 / italic_f ∈ caligraphic_H. This follows from by applying the chain rule and the boundedness of derivatives over the compact 𝒳𝒳{\mathcal{X}}caligraphic_X (see for instance [42, Lemma E.2.2]). The proof follows by observing that the one step update πt+1subscript𝜋𝑡1\pi_{t+1}italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT in 13 is expressed precisely in terms of these operations and the hypothesis that πt(a|)subscript𝜋𝑡conditional𝑎\pi_{t}(a|\cdot)italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a | ⋅ ) and q^t(,a)subscript^𝑞𝑡𝑎\hat{q}_{t}(\cdot,a)over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ , italic_a ) belong to {\mathcal{H}}caligraphic_H for any a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A. ∎

Combining the choice of space {\mathcal{H}}caligraphic_H according to the above result and combining with the PMD iterations of Alg. 1 we have the following corollary.

See 6

Proof.

We proceed by induction. Since q¯(,a)¯𝑞𝑎\bar{q}(\cdot,a)\in{\mathcal{H}}over¯ start_ARG italic_q end_ARG ( ⋅ , italic_a ) ∈ caligraphic_H we can apply the same reasoning in Thm. 5 to guarantee that π0(a|)subscript𝜋0conditional𝑎\pi_{0}(a|\cdot)\in{\mathcal{H}}italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_a | ⋅ ) ∈ caligraphic_H for any a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A. Moreover, π0(a|cot)>0subscript𝜋0conditional𝑎0\pi_{0}(a|\cot)>0italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_a | roman_cot ) > 0 for any a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A since it is the (normalized) exponential of a function. Hence π0subscript𝜋0\pi_{0}italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is (𝒢,)𝒢({\mathcal{G}},{\mathcal{F}})( caligraphic_G , caligraphic_F )-compatible. Therefore, q^0subscript^𝑞0\hat{q}_{0}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT obtained according to Prop. 3 is well defined and belongs to 𝒢𝒢{\mathcal{G}}caligraphic_G, implying q^0(,a)subscript^𝑞0𝑎\hat{q}_{0}(\cdot,a)\in{\mathcal{H}}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ , italic_a ) ∈ caligraphic_H for any a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A. Now, assume by inductive hypothesis that the policy πt(a|)subscript𝜋𝑡conditional𝑎\pi_{t}(a|\cdot)italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a | ⋅ ) generated by POWR at time t𝑡titalic_t and the corresponding estimator q^t(,a)subscript^𝑞𝑡𝑎\hat{q}_{t}(\cdot,a)over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ , italic_a ) of the action value function belong to {\mathcal{H}}caligraphic_H and that πt(a|x)>0subscript𝜋𝑡conditional𝑎𝑥0\pi_{t}(a|x)>0italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a | italic_x ) > 0 for any (x,a)Ω𝑥𝑎Ω(x,a)\in\Omega( italic_x , italic_a ) ∈ roman_Ω. Then, by Thm. 5 we have that also πt+1subscript𝜋𝑡1\pi_{t+1}italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT the solution to the PMD update in 13 belongs to {\mathcal{H}}caligraphic_H (and is therefore (𝒢,)𝒢({\mathcal{G}},{\mathcal{F}})( caligraphic_G , caligraphic_F )-compatible). Additionally, since πt+1subscript𝜋𝑡1\pi_{t+1}italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT can be expressed as the softmax of a (finite) sum of functions in {\mathcal{H}}caligraphic_H, we have also πt+1(a|x)>0subscript𝜋𝑡1conditional𝑎𝑥0\pi_{t+1}(a|x)>0italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_a | italic_x ) > 0 for al (x,a)Ω𝑥𝑎Ω(x,a)\in\Omega( italic_x , italic_a ) ∈ roman_Ω, proving the inductive hypothesis and concluding the proof. ∎

The above corollary guarantees us that if we are able to learn our estimates for the action-value function in {\mathcal{H}}caligraphic_H a suitably regular Sobolev space, then POWR is well-defined. This is a necessary condition to then being able to study it’s theoretical behavior in our main result.

C.2 Controlling the Action-value Estimation Error

We now show how to control the eximation error for the action-value funciton. we start by considering the following application of the (generalized) Simulation lemma in Corollary A.8.

See 8

Proof.

Recall that in the notation of these appendices the action value of a policy and its estimator via the world model CME framework are denoted qπ=q(P)subscript𝑞𝜋𝑞𝑃q_{\pi}=q({P})italic_q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT = italic_q ( italic_P ) and q^π=qn(π)subscript^𝑞𝜋subscript𝑞𝑛𝜋\hat{q}_{\pi}=q_{n}(\pi)over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT = italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_π ) respectively. We can apply Corollary A.8 to obtain

qn(P)q(P)=(IdγTnP)1(rnr)+γ(IdγTnP)1(TTn)v(P).subscript𝑞𝑛𝑃𝑞𝑃superscript𝐼𝑑𝛾subscript𝑇𝑛𝑃1subscript𝑟𝑛𝑟𝛾superscript𝐼𝑑𝛾subscript𝑇𝑛𝑃1𝑇subscript𝑇𝑛𝑣𝑃\displaystyle q_{n}({P})-q({P})=({Id}-\gamma{T}_{n}{P})^{-1}(r_{n}-r)+\gamma({% Id}-\gamma{T}_{n}{P})^{-1}({T}-{T}_{n})v({P}).italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_P ) - italic_q ( italic_P ) = ( italic_I italic_d - italic_γ italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_P ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_r ) + italic_γ ( italic_I italic_d - italic_γ italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_P ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_T - italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_v ( italic_P ) .

Then, by Lemma A.6, point 5, we have

qn(P)q(P)11γ[rnr+γ(TTn)v(P)],subscriptdelimited-∥∥subscript𝑞𝑛𝑃𝑞𝑃11superscript𝛾delimited-[]subscriptdelimited-∥∥subscript𝑟𝑛𝑟𝛾subscriptdelimited-∥∥𝑇subscript𝑇𝑛𝑣𝑃\bigl{\|}{q_{n}({P})-q({P})}\bigr{\|}_{\infty}\leq\frac{1}{1-\gamma^{\prime}}% \Big{[}\bigl{\|}{r_{n}-r}\bigr{\|}_{\infty}+\gamma\bigl{\|}{({T}-{T}_{n})v({P}% )}\bigr{\|}_{\infty}\Big{]},∥ italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_P ) - italic_q ( italic_P ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG [ ∥ italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_r ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + italic_γ ∥ ( italic_T - italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_v ( italic_P ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ] ,

where v(P):=P(IdγTP)1rassign𝑣𝑃𝑃superscript𝐼𝑑𝛾𝑇𝑃1𝑟v({P}):={P}({Id}-\gamma{T}{P})^{-1}ritalic_v ( italic_P ) := italic_P ( italic_I italic_d - italic_γ italic_T italic_P ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_r is the value function of the MDP, and we used that γTn<γ𝛾normsubscript𝑇𝑛superscript𝛾\gamma\left\|{{T}_{n}}\right\|<\gamma^{\prime}italic_γ ∥ italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ < italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Because of Asm. 2, and P𝑃{P}italic_P being (𝒢,)𝒢({\mathcal{G}},{\mathcal{F}})( caligraphic_G , caligraphic_F )-compatible, it holds that r𝒢𝑟𝒢r\in{\mathcal{G}}italic_r ∈ caligraphic_G, v(P)𝑣𝑃v({P})\in{\mathcal{F}}italic_v ( italic_P ) ∈ caligraphic_F, while Prop. 3 implies rn𝒢subscript𝑟𝑛𝒢r_{n}\in{\mathcal{G}}italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_G, and (TTn)v(P)𝒢𝑇subscript𝑇𝑛𝑣𝑃𝒢({T}-{T}_{n})v({P})\in{\mathcal{G}}( italic_T - italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_v ( italic_P ) ∈ caligraphic_G as well. Therefore, using the reproducing property

rnr=sup(x,a)Ω|ψ(x,a),rnr𝒢|rnr𝒢sup(x,a)Ωψ(x,a)𝒢=Cψrnr𝒢subscriptdelimited-∥∥subscript𝑟𝑛𝑟subscriptsupremum𝑥𝑎Ωsubscript𝜓𝑥𝑎subscript𝑟𝑛𝑟𝒢subscriptnormsubscript𝑟𝑛𝑟𝒢subscriptsupremum𝑥𝑎Ωsubscriptnorm𝜓𝑥𝑎𝒢subscript𝐶𝜓subscriptnormsubscript𝑟𝑛𝑟𝒢\bigl{\|}{r_{n}-r}\bigr{\|}_{\infty}=\sup_{(x,a)\in\Omega}|\left\langle{\psi(x% ,a)},{r_{n}-r}\right\rangle_{{\mathcal{G}}}|\leq\left\|{r_{n}-r}\right\|_{{% \mathcal{G}}}\sup_{(x,a)\in\Omega}\left\|{\psi(x,a)}\right\|_{{\mathcal{G}}}=C% _{\psi}\left\|{r_{n}-r}\right\|_{{\mathcal{G}}}∥ italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_r ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = roman_sup start_POSTSUBSCRIPT ( italic_x , italic_a ) ∈ roman_Ω end_POSTSUBSCRIPT | ⟨ italic_ψ ( italic_x , italic_a ) , italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_r ⟩ start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT | ≤ ∥ italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_r ∥ start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT ( italic_x , italic_a ) ∈ roman_Ω end_POSTSUBSCRIPT ∥ italic_ψ ( italic_x , italic_a ) ∥ start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT = italic_C start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ∥ italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_r ∥ start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT

where we assumed a bounded kernel ψ(x,a),ψ(x,a)Cψ𝜓𝑥𝑎𝜓𝑥𝑎subscript𝐶𝜓\left\langle{\psi(x,a)},{\psi(x,a)}\right\rangle\leq C_{\psi}⟨ italic_ψ ( italic_x , italic_a ) , italic_ψ ( italic_x , italic_a ) ⟩ ≤ italic_C start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT for all (x,a)Ω𝑥𝑎Ω(x,a)\in\Omega( italic_x , italic_a ) ∈ roman_Ω. Similarly, for the term depending on TnTsubscript𝑇𝑛𝑇{T}_{n}-{T}italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_T we have

(TTn)v(P)=sup(x,a)Ω|[(T|Tn)v(P)](x,a)|=sup(x,a)Ω|ψ(x,a),(T|Tn)v(P)𝒢|=sup(x,a)Ω|Tr[(v(P)ψ(x,a))(T|Tn)]|T|TnHSsup(x,a)Ω|v(P)(x,a)|r1γT|TnHS.\begin{split}\bigl{\|}{({T}-{T}_{n})v({P})}\bigr{\|}_{\infty}&=\sup_{(x,a)\in% \Omega}|[({T}|_{{\mathcal{F}}}-{T}_{n})v({P})](x,a)|\\ &=\sup_{(x,a)\in\Omega}|\left\langle{\psi(x,a)},{({T}|_{{\mathcal{F}}}-{T}_{n}% )v({P})}\right\rangle_{{\mathcal{G}}}|\\ &=\sup_{(x,a)\in\Omega}\Big{|}\text{\rm Tr}\left[(v({P})\otimes\psi(x,a))({T}|% _{{\mathcal{F}}}-{T}_{n})\right]\Big{|}\\ &\leq\bigl{\|}{{T}|_{{\mathcal{F}}}-{T}_{n}}\bigr{\|}_{{HS}}\sup_{(x,a)\in% \Omega}|v({P})(x,a)|\\ &\leq\frac{\left\|{r}\right\|_{\infty}}{1-\gamma}\bigl{\|}{{T}|_{{\mathcal{F}}% }-{T}_{n}}\bigr{\|}_{{HS}}.\end{split}start_ROW start_CELL ∥ ( italic_T - italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_v ( italic_P ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_CELL start_CELL = roman_sup start_POSTSUBSCRIPT ( italic_x , italic_a ) ∈ roman_Ω end_POSTSUBSCRIPT | [ ( italic_T | start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_v ( italic_P ) ] ( italic_x , italic_a ) | end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = roman_sup start_POSTSUBSCRIPT ( italic_x , italic_a ) ∈ roman_Ω end_POSTSUBSCRIPT | ⟨ italic_ψ ( italic_x , italic_a ) , ( italic_T | start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_v ( italic_P ) ⟩ start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT | end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = roman_sup start_POSTSUBSCRIPT ( italic_x , italic_a ) ∈ roman_Ω end_POSTSUBSCRIPT | Tr [ ( italic_v ( italic_P ) ⊗ italic_ψ ( italic_x , italic_a ) ) ( italic_T | start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ] | end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ∥ italic_T | start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_H italic_S end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT ( italic_x , italic_a ) ∈ roman_Ω end_POSTSUBSCRIPT | italic_v ( italic_P ) ( italic_x , italic_a ) | end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ divide start_ARG ∥ italic_r ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG ∥ italic_T | start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_H italic_S end_POSTSUBSCRIPT . end_CELL end_ROW

Combining the previous two bounds, we get to

qn(P)q(P)11γ[Cψrnr𝒢+γr1γT|TnHS],subscriptdelimited-∥∥subscript𝑞𝑛𝑃𝑞𝑃11superscript𝛾delimited-[]subscript𝐶𝜓subscriptdelimited-∥∥subscript𝑟𝑛𝑟𝒢𝛾subscriptdelimited-∥∥𝑟1𝛾subscriptdelimited-∥∥evaluated-at𝑇subscript𝑇𝑛𝐻𝑆\bigl{\|}{q_{n}({P})-q({P})}\bigr{\|}_{\infty}\leq\frac{1}{1-\gamma^{\prime}}% \left[C_{\psi}\bigl{\|}{r_{n}-r}\bigr{\|}_{{\mathcal{G}}}+\frac{\gamma\bigl{\|% }{r}\bigr{\|}_{\infty}}{1-\gamma}\bigl{\|}{{T}|_{{\mathcal{F}}}-{T}_{n}}\bigr{% \|}_{{HS}}\right],∥ italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_P ) - italic_q ( italic_P ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG [ italic_C start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ∥ italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_r ∥ start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT + divide start_ARG italic_γ ∥ italic_r ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG ∥ italic_T | start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_H italic_S end_POSTSUBSCRIPT ] ,

as desired. ∎

According to the result above, we can control the approximation error for the action value function in terms of the approximation errors rnr𝒢subscriptnormsubscript𝑟𝑛𝑟𝒢\left\|{r_{n}-r}\right\|_{{\mathcal{G}}}∥ italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_r ∥ start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT and T|TnHSsubscriptdelimited-∥∥evaluated-at𝑇subscript𝑇𝑛𝐻𝑆\bigl{\|}{{T}|_{{\mathcal{F}}}-{T}_{n}}\bigr{\|}_{{HS}}∥ italic_T | start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_H italic_S end_POSTSUBSCRIPT. This can be done by leveraging state-of-the-art statistical learning rates for the ridge regression and CME estimators from [35, 18, 43]. The following lemma connects Asm. 2 with the notation used in [35] which enables us to use the required result.

Lemma C.1 (Relation between A.8 and [35]’s definition).

The following two facts are equivalent

  1. 1.

    g𝒢𝑔𝒢g\in{\mathcal{G}}italic_g ∈ caligraphic_G satisfies the strong source condition Asm. 2 with parameter β𝛽\betaitalic_β on the probability distribution ρ𝜌\rhoitalic_ρ.

  2. 2.

    g[𝒢]ρ1+2β𝑔superscriptsubscriptdelimited-[]𝒢𝜌12𝛽g\in[{\mathcal{G}}]_{\rho}^{1+2\beta}italic_g ∈ [ caligraphic_G ] start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 + 2 italic_β end_POSTSUPERSCRIPT as in the notation of [35].

Proof.

Using the same notations as in [35], we have

g[𝒢]ρ1+2βg=iaiμi12+βei and (ai)i2<.iff𝑔superscriptsubscriptdelimited-[]𝒢𝜌12𝛽𝑔subscript𝑖subscript𝑎𝑖superscriptsubscript𝜇𝑖12𝛽subscript𝑒𝑖 and subscriptdelimited-∥∥subscriptsubscript𝑎𝑖𝑖superscript2g\in[{\mathcal{G}}]_{\rho}^{1+2\beta}\iff g=\sum_{i\in\mathbb{N}}a_{i}\mu_{i}^% {\frac{1}{2}+\beta}e_{i}\,\text{ and }\,\bigl{\|}{(a_{i})_{i\in\mathbb{N}}}% \bigr{\|}_{\ell^{2}}<\infty.italic_g ∈ [ caligraphic_G ] start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 + 2 italic_β end_POSTSUPERSCRIPT ⇔ italic_g = ∑ start_POSTSUBSCRIPT italic_i ∈ blackboard_N end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG + italic_β end_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and ∥ ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ blackboard_N end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT < ∞ .

For 12121\implies 21 ⟹ 2 we have that that for g[𝒢]ρ1+2β𝑔superscriptsubscriptdelimited-[]𝒢𝜌12𝛽g\in[{\mathcal{G}}]_{\rho}^{1+2\beta}italic_g ∈ [ caligraphic_G ] start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 + 2 italic_β end_POSTSUPERSCRIPT

Cρβg=iaiμi12eisuperscriptsubscript𝐶𝜌𝛽𝑔subscript𝑖subscript𝑎𝑖superscriptsubscript𝜇𝑖12subscript𝑒𝑖\displaystyle C_{\rho}^{-\beta}g=\sum_{i\in\mathbb{N}}a_{i}\mu_{i}^{\frac{1}{2% }}e_{i}italic_C start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_β end_POSTSUPERSCRIPT italic_g = ∑ start_POSTSUBSCRIPT italic_i ∈ blackboard_N end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (C.1)

whose 𝒢𝒢{\mathcal{G}}caligraphic_G-norm is Cρβg𝒢=(ai)i2<subscriptdelimited-∥∥superscriptsubscript𝐶𝜌𝛽𝑔𝒢subscriptdelimited-∥∥subscriptsubscript𝑎𝑖𝑖superscript2\bigl{\|}{C_{\rho}^{-\beta}g}\bigr{\|}_{{\mathcal{G}}}=\bigl{\|}{(a_{i})_{i\in% \mathbb{N}}}\bigr{\|}_{\ell^{2}}<\infty∥ italic_C start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_β end_POSTSUPERSCRIPT italic_g ∥ start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT = ∥ ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ blackboard_N end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT < ∞.

For 21212\implies 12 ⟹ 1, let g=ibiμi12ei𝑔subscript𝑖subscript𝑏𝑖superscriptsubscript𝜇𝑖12subscript𝑒𝑖g=\sum_{i\in\mathbb{N}}b_{i}\mu_{i}^{\frac{1}{2}}e_{i}italic_g = ∑ start_POSTSUBSCRIPT italic_i ∈ blackboard_N end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with (bi)i2<subscriptdelimited-∥∥subscriptsubscript𝑏𝑖𝑖superscript2\bigl{\|}{(b_{i})_{i\in\mathbb{N}}}\bigr{\|}_{\ell^{2}}<\infty∥ ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ blackboard_N end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT < ∞ (since). Now, Cρβg𝒢<subscriptdelimited-∥∥superscriptsubscript𝐶𝜌𝛽𝑔𝒢\bigl{\|}{C_{\rho}^{-\beta}g}\bigr{\|}_{{\mathcal{G}}}<\infty∥ italic_C start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_β end_POSTSUPERSCRIPT italic_g ∥ start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT < ∞ is equivalent to (biμiβ)i2<subscriptdelimited-∥∥subscriptsubscript𝑏𝑖superscriptsubscript𝜇𝑖𝛽𝑖superscript2\bigl{\|}{(b_{i}\mu_{i}^{-\beta})_{i\in\mathbb{N}}}\bigr{\|}_{\ell^{2}}<\infty∥ ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_β end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ blackboard_N end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT < ∞. By letting bi=μiβaisubscript𝑏𝑖superscriptsubscript𝜇𝑖𝛽subscript𝑎𝑖b_{i}=\mu_{i}^{\beta}a_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT we have that (ai)i2<subscriptdelimited-∥∥subscriptsubscript𝑎𝑖𝑖superscript2\bigl{\|}{(a_{i})_{i\in\mathbb{N}}}\bigr{\|}_{\ell^{2}}<\infty∥ ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ blackboard_N end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT < ∞ and that

g=iaiμi12+βei,𝑔subscript𝑖subscript𝑎𝑖superscriptsubscript𝜇𝑖12𝛽subscript𝑒𝑖g=\sum_{i\in\mathbb{N}}a_{i}\mu_{i}^{\frac{1}{2}+\beta}e_{i},italic_g = ∑ start_POSTSUBSCRIPT italic_i ∈ blackboard_N end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG + italic_β end_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,

that is g[𝒢]ρ1+2β𝑔superscriptsubscriptdelimited-[]𝒢𝜌12𝛽g\in[{\mathcal{G}}]_{\rho}^{1+2\beta}italic_g ∈ [ caligraphic_G ] start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 + 2 italic_β end_POSTSUPERSCRIPT. ∎

With the connection between [35] and Asm. 2 in place we can characterize the bound on the approximation error for the world model-based estimation of the action-value function.

Proposition C.2.

Let Tnsubscript𝑇𝑛{T}_{n}italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and rnsubscript𝑟𝑛r_{n}italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT the empirical estimators of the transfer operator T𝑇{T}italic_T and reward function r𝑟ritalic_r as defined in Prop. 3, respectively. When P𝑃{P}italic_P is a (𝒢,)𝒢({\mathcal{G}},{\mathcal{F}})( caligraphic_G , caligraphic_F )-compatible policy as in Definition 1 and the strong source condition Asm. 2 is attained with parameter β𝛽\betaitalic_β, it holds

qn(P)q(P)O(δ2nα),subscriptdelimited-∥∥subscript𝑞𝑛𝑃𝑞𝑃𝑂superscript𝛿2superscript𝑛𝛼\bigl{\|}{q_{n}({P})-q({P})}\bigr{\|}_{\infty}\leq O(\delta^{2}n^{-\alpha}),∥ italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_P ) - italic_q ( italic_P ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_O ( italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT ) , (C.2)

with rates α(β2+2β,β1+2β)𝛼𝛽22𝛽𝛽12𝛽\alpha\in\left(\frac{\beta}{2+2\beta},\frac{\beta}{1+2\beta}\right)italic_α ∈ ( divide start_ARG italic_β end_ARG start_ARG 2 + 2 italic_β end_ARG , divide start_ARG italic_β end_ARG start_ARG 1 + 2 italic_β end_ARG ) and probability not less than 14eδ14superscript𝑒𝛿1-4e^{-\delta}1 - 4 italic_e start_POSTSUPERSCRIPT - italic_δ end_POSTSUPERSCRIPT.

Proof.

We use Lemma C.1 to apply Theorem 3.1 (ii) from [35] to show that under Asm. 2 with parameter β𝛽\betaitalic_β it holds, with probability not less than 14eδ14superscript𝑒𝛿1-4e^{-\delta}1 - 4 italic_e start_POSTSUPERSCRIPT - italic_δ end_POSTSUPERSCRIPT,

rnr𝒢δ2crnαr.subscriptdelimited-∥∥subscript𝑟𝑛𝑟𝒢superscript𝛿2subscript𝑐𝑟superscript𝑛subscript𝛼𝑟\bigl{\|}{r_{n}-r}\bigr{\|}_{{\mathcal{G}}}\leq\delta^{2}c_{r}n^{-\alpha_{r}}.∥ italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_r ∥ start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ≤ italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT - italic_α start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT . (C.3)

The rate αr(β2+2β,β1+2β)subscript𝛼𝑟𝛽22𝛽𝛽12𝛽\alpha_{r}\in\left(\frac{\beta}{2+2\beta},\frac{\beta}{1+2\beta}\right)italic_α start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∈ ( divide start_ARG italic_β end_ARG start_ARG 2 + 2 italic_β end_ARG , divide start_ARG italic_β end_ARG start_ARG 1 + 2 italic_β end_ARG ) is determined by the properties of the inclusion 𝒢Bb(Ω)𝒢subscript𝐵𝑏Ω{\mathcal{G}}\hookrightarrow B_{b}(\Omega)caligraphic_G ↪ italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( roman_Ω ), and the constant cr>0subscript𝑐𝑟0c_{r}>0italic_c start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT > 0 is independent of n𝑛nitalic_n and δ𝛿\deltaitalic_δ. Similarly, point (2.) of [18, Theorem 2] shows that under Asm. 2

T|TnHS(,𝒢)δ2cTnαTsubscriptdelimited-∥∥evaluated-at𝑇subscript𝑇𝑛𝐻𝑆𝒢superscript𝛿2subscript𝑐𝑇superscript𝑛subscript𝛼𝑇\bigl{\|}{{T}|_{{\mathcal{F}}}-{T}_{n}}\bigr{\|}_{{HS}({\mathcal{F}},{\mathcal% {G}})}\leq\delta^{2}c_{{T}}n^{-\alpha_{{T}}}∥ italic_T | start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_H italic_S ( caligraphic_F , caligraphic_G ) end_POSTSUBSCRIPT ≤ italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT - italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (C.4)

again with probability not less than 14eδ14superscript𝑒𝛿1-4e^{-\delta}1 - 4 italic_e start_POSTSUPERSCRIPT - italic_δ end_POSTSUPERSCRIPT, rates αT(β2+2β,β1+2β)subscript𝛼𝑇𝛽22𝛽𝛽12𝛽\alpha_{{T}}\in\left(\frac{\beta}{2+2\beta},\frac{\beta}{1+2\beta}\right)italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ ( divide start_ARG italic_β end_ARG start_ARG 2 + 2 italic_β end_ARG , divide start_ARG italic_β end_ARG start_ARG 1 + 2 italic_β end_ARG ) and with cT>0subscript𝑐𝑇0c_{{T}}>0italic_c start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT > 0 independent of n𝑛nitalic_n and δ𝛿\deltaitalic_δ. Combining every bound and denoting α:=min(αr,αT)assign𝛼subscript𝛼𝑟subscript𝛼𝑇\alpha:=\min(\alpha_{r},\alpha_{{T}})italic_α := roman_min ( italic_α start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ), we conclude

qn(P)q(P)δ21γ[Cψcr+γcTr1γ]nα=O(δ2nα).subscriptdelimited-∥∥subscript𝑞𝑛𝑃𝑞𝑃superscript𝛿21superscript𝛾delimited-[]subscript𝐶𝜓subscript𝑐𝑟𝛾subscript𝑐𝑇subscriptdelimited-∥∥𝑟1𝛾superscript𝑛𝛼𝑂superscript𝛿2superscript𝑛𝛼\bigl{\|}{q_{n}({P})-q({P})}\bigr{\|}_{\infty}\leq\frac{\delta^{2}}{1-\gamma^{% \prime}}\left[C_{\psi}c_{r}+\frac{\gamma c_{{T}}\bigl{\|}{r}\bigr{\|}_{\infty}% }{1-\gamma}\right]n^{-\alpha}=O(\delta^{2}n^{-\alpha}).∥ italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_P ) - italic_q ( italic_P ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ divide start_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG [ italic_C start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + divide start_ARG italic_γ italic_c start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∥ italic_r ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG ] italic_n start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT = italic_O ( italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT ) . (C.5)

as required. ∎

C.3 Convergence Rates for POWR

With a bound on the estimation error of the action-value function by Alg. 1, we are finally ready to state the complexity bounds for POWR .

See 9

Proof.

Since the setting of Cor. 6 implies that Ptsubscript𝑃𝑡{P}_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are (𝒢,)𝒢({\mathcal{G}},{\mathcal{F}})( caligraphic_G , caligraphic_F )-compatible for all t𝑡titalic_t, and Asm. 2 is holding, then q(Pt)𝑞subscript𝑃𝑡q({P}_{t})italic_q ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and qn(Pt)subscript𝑞𝑛subscript𝑃𝑡q_{n}({P}_{t})italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) belong to 𝒢𝒢{\mathcal{G}}caligraphic_G for all (Pt)tsubscriptsubscript𝑃𝑡𝑡({P}_{t})_{t\in\mathbb{N}}( italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t ∈ blackboard_N end_POSTSUBSCRIPT. This assures that we can use the statistical learning bounds Proposition C.2 into Thm. 7, yielding the final bound. ∎

Appendix D Experimental details

D.1 Additional Results

In this section, we delve deeper into the empirical outcomes of our methodology. We present a boxplot that shows the average timestep at which a reward threshold is met during the training phase. The testing environments are the same as introduced previously, with reward thresholds being the standard ones given in [36], except for the Taxi-v3 environment, where it is marginally lower. Interestingly, in this environment, only DQN and our algorithm are capable of achieving the original threshold within 1.5×1061.5superscript1061.5\times 10^{6}1.5 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT timesteps during the training. On the other hand, the new lower threshold is also reached by the PPO algorithm.

As depicted in Fig. 2, our approach can attain the desired reward quicker than the competing algorithms. Furthermore, when considering our method, the timestep at which the threshold is reached exhibits a lower variance than other techniques. This implies that our approach requires a stable amount of timesteps to learn how to solve a specific environment.

Refer to caption
(a) FrozenLake-v1
Refer to caption
(b) Taxi-v3
Refer to caption
(c) MountainCar-v0
Figure 2: This boxplot visually represents the mean timestep at which various algorithms attain a specified reward threshold during their training. The reward targets are set at 0.80.80.80.8 for FrozenLake-v1, 6666 for Taxi-v3, and 110110-110- 110 for MountainCar-v0. The absence of a box indicates that the corresponding algorithm was unable to meet the reward threshold within the training process.

D.2 Hyperparameters

To test our approach, we fine-tuned the algorithm through a grid search. Table 1 reports the optimal parameters for each testing environment. Here, t_epochs denotes the training epochs used for data collection t_epochs before updating the policy. Then, n_iter_pmd refers to the number of policy mirror descent iterations to update our policy.

Parameter FrozenLake-v1 Taxi-v3 MountainCar-v0
η𝜂\etaitalic_η 1 1 0.1
λ𝜆\lambdaitalic_λ 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
γ𝛾\gammaitalic_γ 0.99 0.99 0.99
t_epochs 5 5 1
n_iter_pmd 10 10 10
Table 1: Hyperparameters for using our method in the proposed environments.

D.3 Other methods

We compare the performance of our algorithm with several baselines. In particular, we considered A2C [37], DQN [4], TRPO [7] and PPO [6], which we implemented using the stable baselines library [38]. In the end, we used the standard hyperparameters in [38].