Oracle-Efficient Reinforcement Learning
for Max Value Ensembles

Marcel Hussing University of Pennsylvania Michael Kearns University of Pennsylvania Aaron Roth University of Pennsylvania Sikata Sengupta University of Pennsylvania Jessica Sorrell University of Pennsylvania
Abstract

Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, both theoretically (where worst-case sample and computational complexities must scale with state space cardinality) and experimentally (where function approximation and policy gradient techniques often scale poorly and suffer from instability and high variance). One line of research attempting to address these difficulties makes the natural assumption that we are given a collection of heuristic base or constituent policies upon which we would like to improve in a scalable manner. In this work we aim to compete with the max-following policy, which at each state follows the action of whichever constituent policy has the highest value. The max-following policy is always at least as good as the best constituent policy, and may be considerably better. Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies (but not their value functions). In contrast to prior work in similar settings, our theoretical results require only the minimal assumption of an ERM oracle for value function approximation for the constituent policies (and not the global optimal policy or the max-following policy itself) on samplable distributions. We illustrate our algorithmโ€™s experimental effectiveness and behavior on several robotic simulation testbeds.

1 Introduction

Computationally efficient RL algorithms are known for simple environments with small state spaces such as tabular Markov decision processes (MDPs) ย (Kearns and Singh, 2002; Brafman and Tennenholtz, 2002), but practical applications often require dealing with large or even infinite state spaces. Learning efficiently in these cases requires computational complexity independent of the state space, but this is statistically impossible without strong assumptions on the class of MDPsย (Jaksch etย al., 2010; Lattimore and Hutter, 2012; Du etย al., 2019; Domingues etย al., 2021). Even in structured MDPs that admit statistically efficient algorithms, learning an optimal policy can still be computationally intractableย (Kane etย al., 2022; Golowich etย al., 2024).

These obstacles to practical RL motivate the study of ensembling methodsย (Lee etย al., 2021; Peer etย al., 2021; Chen etย al., 2021; Hiraoka etย al., 2022), which assume access to multiple sub-optimal policies for the same MDP and aim to leverage these constituent policies to improve upon them. There are now several provably efficient ensembling algorithms, but their guarantees require strong assumptions on the representation of the target policy learned by the algorithm. Brukhim etย al. (2022) use the boosting framework for ensembling developed in the supervised learning settingย (Freund and Schapire, 1997) to learn an optimal policy, assuming access to a weak learner for a parameterized policy class. To efficiently converge to an optimal policy, the target policy must be expressible as a depth-two circuit over policies from a base class which is efficiently weak-learnable. The convergence guarantees additionally require strong bounds on the worst-case distance between state-visitation distributions of the target policy and policies from the base class.

Another line of ensembling work considers a weaker objective than learning an optimal policyย (Cheng etย al., 2020; Liu etย al., 2023, 2024). These works instead aim to learn a policy competitive with a max-aggregation policy, which take whichever action maximizes the advantage function with respect to a max-following policy at the current state. When these works have provable guarantees, they require the assumption that the target max-aggregation policy can be approximated in an online-learnable parametric class, as well as the assumption that policy gradients within the class can be efficiently estimated with low variance and bias.

Our goal is to learn a policy competitive with a similar but incomparable benchmark to that ofย Cheng etย al. (2020) under comparatively weak assumptions. We give an efficient algorithm for learning a policy competitive with a max-following policy (Definitionย 2.1), assuming the learner has access to a squared-error regression oracle for the value functions of the constituent policies. Our algorithm exclusively queries this oracle on distributions over states that are efficiently samplable, thereby reducing the problem of learning a max-following competitive policy to supervised learning of value functions. Notably, our learnability assumptions pertain only to the value functions of the constituent policies and not to the more complicated class of max-following benchmark policies or their value functions. Our algorithm is simple and effective, which we demonstrate empirically in Sectionย 5.

It is natural to wonder if access to an oracle such as ours could be leveraged to instead efficiently learn an optimal policy, obviating the need for weaker benchmarks (and our results). However, it was recently shown byย (Golowich etย al., 2024) that learning an optimal policy in a particular family of block MDPs is computationally intractable under reasonable cryptographic assumptions, even when the learner has access to a squared-error regression oracle. Their oracle captures a general class of regression tasks that includes value function estimation, and therefore also captures our oracle assumption. Our work shows that when we instead consider the simpler objective of efficiently learning a policy that competes with max-following, a regression oracle is in fact sufficient. We leave open the interesting question of whether such an oracle is necessary.

1.1 Results

Our main contribution is a novel algorithm for improving upon a set of K๐พKitalic_K given policies that is oracle efficient with respect to a squared-error regression oracle, and therefore scalable in large state spaces (Algorithmย 1, Theoremย 3.1). We consider the episodic RL setting in which the learner interacts with its environment for episodes of a fixed length H๐ปHitalic_H. The algorithm incrementally constructs an improved policy over H๐ปHitalic_H iterations, learning an improved policy for step hโˆˆ[H]โ„Ždelimited-[]๐ปh\in[H]italic_h โˆˆ [ italic_H ] of the episode at iteration hโ„Žhitalic_h. This incremental approach allows the algorithm to explicitly construct efficiently samplable distributions over states visited by the improved policy at step hโ„Žhitalic_h by simply executing the current policy for hโ„Žhitalic_h steps. It can then query its oracle to obtain approximate value functions for all constituent policies with respect to this distribution. This in turn allows the algorithm to learn an improved policy for step h+1โ„Ž1h+1italic_h + 1 by following the policy with highest estimated value. By incrementally constructing an improved policy over steps of the episode, we can avoid making assumptions like those ofย Brukhim etย al. (2022) about the overlap between state-visitation distributions of the target policy and the intermediate policies constructed by the algorithm.

Because our oracle only gives us approximate value functions, we take as our benchmark class the set of approximate max-following policies (Definitionย 2.3). This is a superset of the class of max-following policies and contains all policies that at each state follow the action of some constituent policy with near-maximum value at that state. In Sectionย 4, we prove that for any set of constituent policies, the worst approximate max-following policy is competitive with the best constituent policy (Lemmaย 4.1) and provide several example MDPs illustrating how our benchmark relates to other natural benchmarks.

Finally, we demonstrate the practical feasibility of our algorithm using a heuristic version on a set of robotic manipulation tasks from the CompoSuite benchmarkย Mendez etย al. (2022); Hussing etย al. (2023). We demonstrate that in all cases, the max-following policy we find is at least as good as the constituent policies and in several cases outperforms it significantly.

1.2 Related work

As discussed above, our work is related to a recent line of research learning a max-aggregation policyย (Cheng etย al., 2020; Liu etย al., 2023, 2024), which can be viewed as a one-step look-ahead max-following policy and is incomparable to the class of max-following policies (see the appendix ofย Cheng etย al. (2020) for example MDPs demonstrating this fact). These works all assume online learnability of the target policy class, which is strictly stronger than our batch learnability assumption for constituent policy value functions.

The work ofย Cheng etย al. (2020) proposes an algorithm (MAMBA) that uses policy gradient methods, and the convergence of the learned policy to their benchmark depends on the bias and variance of those policy gradients. Liu etย al. (2023, 2024) builds on the work of (Cheng etย al., 2020). Their algorithm MAPS-SE modifies MAMBA to promote exploration when there is uncertainty about which constituent policy has the greatest value at a state, via an upper confidence bound (UCB) approach to policy selection. Reducing uncertainty about the constituent policiesโ€™ value functions reduces the bias and variance of the gradient estimates, improving convergence guarantees. However, policy gradient techniques are known to generally have high varianceย (Wu etย al., 2018), and this appears to affect the practical performance of MAPS-SE in certain cases (see Sectionย 5 for additional discussion).

The boosting approach to policy ensembling ofย Brukhim etย al. (2022) also necessitates strong assumptions. This follows from the computational separation inย Golowich etย al. (2024), which shows that our oracle assumption is insufficient to learn an optimal policy, whereas the assumptions made inย Brukhim etย al. (2022) enable convergence to optimality. This work also gives convergence guarantees that are independent of any relationship between the starting state distribution, the state-visitation distributions of the base policy class, and the state-visitation distribution of the target policy, whereas bounds on the closeness of these distributions is required for convergence inย Brukhim etย al. (2022).

There are other lines of work on policy improvement, which consider improving upon a single base policy and therefore do not address the challenge of ensemblingย (Sun etย al., 2017; Schulman etย al., 2015; Chang etย al., 2015). Empirical work on ensemble imitation learning (IL) also studies the problem of leveraging multiple base policies for learningย  (Li etย al., 2018; Kurenkov etย al., 2019), but these works lack provable guarantees of efficient convergence to a meaningful benchmark.

(Song etย al., 2023) provide a survey of a variety of more complex techniques to ensemble policies, mainly from a practical perspective. Barreto etย al. (2017, 2020) decompose complex tasks into a set of multiple smaller tasks where they use transfer learning, but they make strong assumptions about the joint parametrization of rewards for various tasks and about the representations of the tasks.

2 Preliminaries

We consider an episodic fixed-horizon Markov decision process (MDP)ย (Puterman, 1994) which we formalize as a tuple โ„ณ=(๐’ฎ,๐’œ,R,P,ฮผ0,H)โ„ณ๐’ฎ๐’œ๐‘…๐‘ƒsubscript๐œ‡0๐ป\mathcal{M}=(\mathcal{S},\mathcal{A},R,P,\mu_{0},H)caligraphic_M = ( caligraphic_S , caligraphic_A , italic_R , italic_P , italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_H ) where ๐’ฎ๐’ฎ\mathcal{S}caligraphic_S is the set of states, ๐’œ๐’œ\mathcal{A}caligraphic_A the set of actions, R๐‘…Ritalic_R is a reward function, P๐‘ƒPitalic_P the transition dynamics, ฮผ0subscript๐œ‡0\mu_{0}italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT a distribution over starting states and H๐ปHitalic_H the horizon (Sutton and Barto, 2018). [N]delimited-[]๐‘[N][ italic_N ] will denote the set {0,โ€ฆ,Nโˆ’1}0โ€ฆ๐‘1\{0,...,N-1\}{ 0 , โ€ฆ , italic_N - 1 }. In the beginning, an initial state s0subscript๐‘ 0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is sampled from ฮผ0subscript๐œ‡0\mu_{0}italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. At any time hโˆˆ[H]โ„Ždelimited-[]๐ปh\in[H]italic_h โˆˆ [ italic_H ], the agent is in some state shโˆˆ๐’ฎsubscript๐‘ โ„Ž๐’ฎs_{h}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT โˆˆ caligraphic_S and chooses an action ahโˆˆ๐’œsubscript๐‘Žโ„Ž๐’œa_{h}\in\mathcal{A}italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT โˆˆ caligraphic_A based on a function ฯ€hsubscript๐œ‹โ„Ž\pi_{h}italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT map** from states to distributions over actions ฮ :๐’ฎโ†ฆฮ”โข(๐’œ):ฮ maps-to๐’ฎฮ”๐’œ\Pi:\mathcal{S}\mapsto\Delta(\mathcal{A})roman_ฮ  : caligraphic_S โ†ฆ roman_ฮ” ( caligraphic_A ). As a consequence, the agent traverses to a new next state sh+1subscript๐‘ โ„Ž1s_{h+1}italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT sampled from P(โ‹…|sh,ah)P(\cdot|s_{h},a_{h})italic_P ( โ‹… | italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) and obtains a reward Rโข(sh,ah)๐‘…subscript๐‘ โ„Žsubscript๐‘Žโ„ŽR(s_{h},a_{h})italic_R ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ). Without loss of generality, we assume that rewards bounded within [0,1]01[0,1][ 0 , 1 ]. The sequence of functions ฯ€hsubscript๐œ‹โ„Ž\pi_{h}italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT used by the agent is referred to as its policy, and is denoted ฯ€={ฯ€h}hโˆˆ[H]๐œ‹subscriptsubscript๐œ‹โ„Žโ„Ždelimited-[]๐ป\pi=\{\pi_{h}\}_{h\in[H]}italic_ฯ€ = { italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h โˆˆ [ italic_H ] end_POSTSUBSCRIPT. A trajectory is the sequence of (state, action) pairs taken by the agent over an episode of length H๐ปHitalic_H, and is denoted ฯ„={(sh,ah)}hโˆˆ[H]๐œsubscriptsubscript๐‘ โ„Žsubscript๐‘Žโ„Žโ„Ždelimited-[]๐ป\tau=\{(s_{h},a_{h})\}_{h\in[H]}italic_ฯ„ = { ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_h โˆˆ [ italic_H ] end_POSTSUBSCRIPT. We will use the notation ฯ„โˆผฯ€โข(ฮผ0)similar-to๐œ๐œ‹subscript๐œ‡0\tau\sim\pi(\mu_{0})italic_ฯ„ โˆผ italic_ฯ€ ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) to refer to sampling a trajectory by first sampling a starting state s0โˆผฮผ0similar-tosubscript๐‘ 0subscript๐œ‡0s_{0}\sim\mu_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and then executing policy ฯ€๐œ‹\piitalic_ฯ€ from s0subscript๐‘ 0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

The goal of the learner is to maximize the expected cumulative reward ๐”ผs0โˆผฮผ0,P[โˆ‘t=0Hโˆ’1Rโข(st,at)]subscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0๐‘ƒdelimited-[]superscriptsubscript๐‘ก0๐ป1๐‘…subscript๐‘ ๐‘กsubscript๐‘Ž๐‘ก\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0},P}[\sum_{t=0}^{H-1}R(s_{t},a_{t})]blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_P end_POSTSUBSCRIPT [ โˆ‘ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] over episodes of length H๐ปHitalic_H. We further define the value function as the expected cumulative return of following some policy ฯ€๐œ‹\piitalic_ฯ€ from some state s๐‘ sitalic_s as Vฯ€โข(s)=๐”ผs0โˆผฮผ0,P[โˆ‘t=0Hโˆ’1Rโข(st,at)|ฯ€,s0=s]superscript๐‘‰๐œ‹๐‘ subscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0๐‘ƒdelimited-[]conditionalsuperscriptsubscript๐‘ก0๐ป1๐‘…subscript๐‘ ๐‘กsubscript๐‘Ž๐‘ก๐œ‹subscript๐‘ 0๐‘ V^{\pi}(s)=\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0},P}[\sum_{t=0}^{H-1}R(s_{t},a_% {t})|\pi,s_{0}=s]italic_V start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s ) = blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_P end_POSTSUBSCRIPT [ โˆ‘ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | italic_ฯ€ , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_s ]. Due to the finite horizon of the episodic setting, we will also need to refer to the expected cumulative reward from state s๐‘ sitalic_s under policy ฯ€๐œ‹\piitalic_ฯ€ from time hโˆˆ[H]โ„Ždelimited-[]๐ปh\in[H]italic_h โˆˆ [ italic_H ]. We denote this time-specific value function by Vhฯ€โข(s)=๐”ผP[โˆ‘t=hHโˆ’1Rโข(st,at)|ฯ€,sh=s]superscriptsubscript๐‘‰โ„Ž๐œ‹๐‘ subscript๐”ผ๐‘ƒdelimited-[]conditionalsuperscriptsubscript๐‘กโ„Ž๐ป1๐‘…subscript๐‘ ๐‘กsubscript๐‘Ž๐‘ก๐œ‹subscript๐‘ โ„Ž๐‘ V_{h}^{\pi}(s)=\mathop{\mathbb{E}}_{P}[\sum_{t=h}^{H-1}R(s_{t},a_{t})|\pi,s_{h% }=s]italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s ) = blackboard_E start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT [ โˆ‘ start_POSTSUBSCRIPT italic_t = italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | italic_ฯ€ , italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_s ]. Finally, the key object of interest is a max-following policy. Given access to a set of k๐‘˜kitalic_k arbitrarily defined policies ฮ k={ฯ€k}k=1Ksuperscriptฮ ๐‘˜superscriptsubscriptsuperscript๐œ‹๐‘˜๐‘˜1๐พ\Pi^{k}=\{\pi^{k}\}_{k=1}^{K}roman_ฮ  start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = { italic_ฯ€ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT and their respective value functions which we denote by the shorthand Vฯ€k=Vksuperscript๐‘‰subscript๐œ‹๐‘˜superscript๐‘‰๐‘˜V^{\pi_{k}}=V^{k}italic_V start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, a max-following policy is defined as a policy that at every step follows the action of the policy with the highest value in that state.

Definition 2.1 (Max-following policy class).

Fix a set of policies ฮ ksuperscriptฮ ๐‘˜\Pi^{k}roman_ฮ  start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT for a common MDP โ„ณโ„ณ\mathcal{M}caligraphic_M and an episode length H๐ปHitalic_H. The class of max-following policies ฮ maxksubscriptsuperscriptฮ ๐‘˜\Pi^{k}_{\max}roman_ฮ  start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT is defined

ฮ maxk={ฯ€:โˆ€hโˆˆ[H],โˆ€sโˆˆ๐’ฎ,ฯ€hโข(s)=ฯ€kโˆ—โข(s)โขย for someย โขkโˆ—โˆˆargmaxkโˆˆ[K]Vkโข(s)}subscriptsuperscriptฮ ๐‘˜conditional-set๐œ‹formulae-sequencefor-allโ„Ždelimited-[]๐ปformulae-sequencefor-all๐‘ ๐’ฎsubscript๐œ‹โ„Ž๐‘ superscript๐œ‹superscript๐‘˜๐‘ ย for someย superscript๐‘˜subscriptargmax๐‘˜delimited-[]๐พsuperscript๐‘‰๐‘˜๐‘ \Pi^{k}_{\max}=\{\pi:\forall h\in[H],\forall s\in\mathcal{S},\pi_{h}(s)=\pi^{k% ^{*}}(s)\text{ for some }k^{*}\in\operatorname*{argmax}_{k\in[K]}V^{k}(s)\}roman_ฮ  start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = { italic_ฯ€ : โˆ€ italic_h โˆˆ [ italic_H ] , โˆ€ italic_s โˆˆ caligraphic_S , italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) = italic_ฯ€ start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s ) for some italic_k start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT โˆˆ roman_argmax start_POSTSUBSCRIPT italic_k โˆˆ [ italic_K ] end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) }

Note that for any collection of constituent policies ฮ ksuperscriptฮ ๐‘˜\Pi^{k}roman_ฮ  start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT there may be many max-following policies, due to ties between the value functions. Different max-following policies may have different expected return, and we refer the reader to Observationย 4.5 for an example demonstrating this fact.

We assume access to a value function oracle that allows us to approximate a value function of a policy under a samplable distribution at any specified time hโˆˆ[H]โ„Ždelimited-[]๐ปh\in[H]italic_h โˆˆ [ italic_H ]. This oracle is intended to capture the common assumption that the value function of a policy can be efficiently well-approximated by a function from a fixed parameterized class. In practice, one might imagine implementing this oracle as a neural network minimizing the squared error to a target value function.

Definition 2.2 (Oracle for ฯ€๐œ‹\piitalic_ฯ€ value function estimates).

We denote by ๐’ชฯ€superscript๐’ช๐œ‹\mathcal{O}^{\pi}caligraphic_O start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT an oracle satisfying the following guarantee for a policy ฯ€๐œ‹\piitalic_ฯ€. For any ฮฑโˆˆ(0,1]๐›ผ01\alpha\in(0,1]italic_ฮฑ โˆˆ ( 0 , 1 ], and any hโˆˆ[H]โ„Ždelimited-[]๐ปh\in[H]italic_h โˆˆ [ italic_H ], given as input a time hโˆˆ[H]โ„Ždelimited-[]๐ปh\in[H]italic_h โˆˆ [ italic_H ] and sampling access to any efficiently samplable distribution ฮผ๐œ‡\muitalic_ฮผ, the oracle outputs V^hฯ€โ†๐’ชฯ€โข(ฮฑ,ฮผ,h)โ†superscriptsubscript^๐‘‰โ„Ž๐œ‹superscript๐’ช๐œ‹๐›ผ๐œ‡โ„Ž\hat{V}_{h}^{\pi}\leftarrow\mathcal{O}^{\pi}(\alpha,\mu,h)over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT โ† caligraphic_O start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_ฮฑ , italic_ฮผ , italic_h ) such that ๐”ผsโˆผฮผ[(V^hฯ€โข(s)โˆ’Vhฯ€โข(s))2]โ‰คฮฑsubscript๐”ผsimilar-to๐‘ ๐œ‡delimited-[]superscriptsuperscriptsubscript^๐‘‰โ„Ž๐œ‹๐‘ superscriptsubscript๐‘‰โ„Ž๐œ‹๐‘ 2๐›ผ\mathop{\mathbb{E}}_{s\sim\mu}[(\hat{V}_{h}^{\pi}(s)-V_{h}^{\pi}(s))^{2}]\leq\alphablackboard_E start_POSTSUBSCRIPT italic_s โˆผ italic_ฮผ end_POSTSUBSCRIPT [ ( over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s ) - italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] โ‰ค italic_ฮฑ. We use the notation ๐’ชฮฑฯ€=๐’ชฯ€โข(ฮฑ,โ‹…,โ‹…)superscriptsubscript๐’ช๐›ผ๐œ‹superscript๐’ช๐œ‹๐›ผโ‹…โ‹…\mathcal{O}_{\alpha}^{\pi}=\mathcal{O}^{\pi}(\alpha,\cdot,\cdot)caligraphic_O start_POSTSUBSCRIPT italic_ฮฑ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT = caligraphic_O start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_ฮฑ , โ‹… , โ‹… ) to denote ๐’ชฯ€superscript๐’ช๐œ‹\mathcal{O}^{\pi}caligraphic_O start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT with fixed accuracy parameter ฮฑ๐›ผ\alphaitalic_ฮฑ. We will also use the shorthand ๐’ชk=๐’ชฯ€ksuperscript๐’ช๐‘˜superscript๐’ชsuperscript๐œ‹๐‘˜\mathcal{O}^{k}=\mathcal{O}^{\pi^{k}}caligraphic_O start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = caligraphic_O start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT.

Looking ahead to Sectionย 3, we note that for every distribution ฮผ๐œ‡\muitalic_ฮผ on which Algorithmย 1 queries an oracle, ฮผ๐œ‡\muitalic_ฮผ is not only efficiently samplable, but samplable by executing an explicitly constructed policy ฯ€๐—Œ๐–บ๐—†๐—‰subscript๐œ‹๐—Œ๐–บ๐—†๐—‰\pi_{\mathsf{samp}}italic_ฯ€ start_POSTSUBSCRIPT sansserif_samp end_POSTSUBSCRIPT for hโ„Žhitalic_h steps in MDP โ„ณโ„ณ\mathcal{M}caligraphic_M, starting from ฮผ0subscript๐œ‡0\mu_{0}italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Thus, for any distribution ฮผ๐œ‡\muitalic_ฮผ, policy ฯ€ksuperscript๐œ‹๐‘˜\pi^{k}italic_ฯ€ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, and time hโ„Žhitalic_h for which we query ๐’ชksuperscript๐’ช๐‘˜\mathcal{O}^{k}caligraphic_O start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, we could efficiently obtain an unbiased estimate of ๐”ผsโˆผฮผ[Vhkโข(s)]subscript๐”ผsimilar-to๐‘ ๐œ‡delimited-[]subscriptsuperscript๐‘‰๐‘˜โ„Ž๐‘ \mathop{\mathbb{E}}_{s\sim\mu}[V^{k}_{h}(s)]blackboard_E start_POSTSUBSCRIPT italic_s โˆผ italic_ฮผ end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ] by following a known ฯ€๐—Œ๐–บ๐—†๐—‰subscript๐œ‹๐—Œ๐–บ๐—†๐—‰\pi_{\mathsf{samp}}italic_ฯ€ start_POSTSUBSCRIPT sansserif_samp end_POSTSUBSCRIPT for hโ„Žhitalic_h steps from ฮผ0subscript๐œ‡0\mu_{0}italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and then switching to ฯ€ksuperscript๐œ‹๐‘˜\pi^{k}italic_ฯ€ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT for the remainder of the episode. We mention this to highlight that our oracle is not eliding any technical obstacles to sampling in the episodic setting. It is simply abstracting the supervised learning task of converting unbiased estimates of ๐”ผsโˆผฮผ[Vhkโข(s)]subscript๐”ผsimilar-to๐‘ ๐œ‡delimited-[]subscriptsuperscript๐‘‰๐‘˜โ„Ž๐‘ \mathop{\mathbb{E}}_{s\sim\mu}[V^{k}_{h}(s)]blackboard_E start_POSTSUBSCRIPT italic_s โˆผ italic_ฮผ end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ] into an approximation V^hksubscriptsuperscript^๐‘‰๐‘˜โ„Ž\hat{V}^{k}_{h}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT with small squared error with respect to ฮผ๐œ‡\muitalic_ฮผ.

Lastly, we define our benchmark class of policies. Given a set of constituent policies ฮ ksuperscriptฮ ๐‘˜\Pi^{k}roman_ฮ  start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, our benchmark defines for each state and time a set of permissible actions: any action taken by a policy ฯ€tโˆˆฮ ksuperscript๐œ‹๐‘กsuperscriptฮ ๐‘˜\pi^{t}\in\Pi^{k}italic_ฯ€ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โˆˆ roman_ฮ  start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT for which the value Vhtโข(s)superscriptsubscript๐‘‰โ„Ž๐‘ก๐‘ V_{h}^{t}(s)italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s ) is sufficiently close to the maximum value maxkโˆˆ[K]โกVhkโข(s)subscript๐‘˜delimited-[]๐พsuperscriptsubscript๐‘‰โ„Ž๐‘˜๐‘ \max_{k\in[K]}V_{h}^{k}(s)roman_max start_POSTSUBSCRIPT italic_k โˆˆ [ italic_K ] end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ). The class of approximate max-following policies is then any policy that exclusively takes permissible actions. We refer the reader to Sectionย 4 for further explanation of this benchmark.

Definition 2.3 (Approximate max-following policies).

We define a set of ฮฒ๐›ฝ\betaitalic_ฮฒ-good policies at state sโˆˆ๐’ฎ๐‘ ๐’ฎs\in\mathcal{S}italic_s โˆˆ caligraphic_S and time hโˆˆ[H]โ„Ždelimited-[]๐ปh\in[H]italic_h โˆˆ [ italic_H ], selected from a set ฮ ksuperscriptฮ ๐‘˜\Pi^{k}roman_ฮ  start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, as follows.

Tฮฒ,hโข(s)={ฯ€โˆˆฮ k:Vhฯ€โข(s)โ‰ฅmaxkโˆˆ[K]โกVhkโข(s)โˆ’ฮฒ}.subscript๐‘‡๐›ฝโ„Ž๐‘ conditional-set๐œ‹superscriptฮ ๐‘˜superscriptsubscript๐‘‰โ„Ž๐œ‹๐‘ subscript๐‘˜delimited-[]๐พsuperscriptsubscript๐‘‰โ„Ž๐‘˜๐‘ ๐›ฝT_{\beta,h}(s)=\{\pi\in\Pi^{k}:V_{h}^{\pi}(s)\geq\max_{k\in[K]}V_{h}^{k}(s)-% \beta\}.italic_T start_POSTSUBSCRIPT italic_ฮฒ , italic_h end_POSTSUBSCRIPT ( italic_s ) = { italic_ฯ€ โˆˆ roman_ฮ  start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT : italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s ) โ‰ฅ roman_max start_POSTSUBSCRIPT italic_k โˆˆ [ italic_K ] end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) - italic_ฮฒ } .

Then we define the set of approximate max-following policies for ฮ ksuperscriptฮ ๐‘˜\Pi^{k}roman_ฮ  start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT to be

ฮ ฮฒkโˆ—={ฯ€:โˆ€hโˆˆ[H],โˆ€sโˆˆ๐’ฎ,ฯ€hโข(s)=ฯ€htโข(s)โขย for someย โขฯ€tโˆˆTฮฒ,hโข(s)}.subscriptsuperscriptฮ superscript๐‘˜๐›ฝconditional-set๐œ‹formulae-sequencefor-allโ„Ždelimited-[]๐ปformulae-sequencefor-all๐‘ ๐’ฎsubscript๐œ‹โ„Ž๐‘ superscriptsubscript๐œ‹โ„Ž๐‘ก๐‘ ย for someย superscript๐œ‹๐‘กsubscript๐‘‡๐›ฝโ„Ž๐‘ \Pi^{k^{*}}_{\beta}=\{\pi:\forall h\in[H],\forall s\in\mathcal{S},\pi_{h}(s)=% \pi_{h}^{t}(s)\text{ for some }\pi^{t}\in T_{\beta,h}(s)\}.roman_ฮ  start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฮฒ end_POSTSUBSCRIPT = { italic_ฯ€ : โˆ€ italic_h โˆˆ [ italic_H ] , โˆ€ italic_s โˆˆ caligraphic_S , italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) = italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s ) for some italic_ฯ€ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โˆˆ italic_T start_POSTSUBSCRIPT italic_ฮฒ , italic_h end_POSTSUBSCRIPT ( italic_s ) } .

3 The ๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡\mathsf{MaxIteration}sansserif_MaxIteration learning algorithm

In this section, we introduce our algorithm for learning an approximate max-following policy, ๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡\mathsf{MaxIteration}sansserif_MaxIteration (Algorithmย 1. This algorithm learns a good approximation of a max-following policy at step hโ„Žhitalic_h, assuming access to a good approximation of a max-following policy for all previous steps.

For the first step (h=0โ„Ž0h=0italic_h = 0), the algorithm learns a good approximation V^0ksubscriptsuperscript^๐‘‰๐‘˜0\hat{V}^{k}_{0}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for all constituent policies ฯ€ksuperscript๐œ‹๐‘˜\pi^{k}italic_ฯ€ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT on the starting distribution ฮผ0subscript๐œ‡0\mu_{0}italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. These approximate value functions can in turn be used to define the first action taken by the approximate max-following policy, namely ฯ€^0โข(s)=ฯ€argmaxkV^0kโข(s)โข(s)subscript^๐œ‹0๐‘ subscript๐œ‹subscriptargmax๐‘˜subscriptsuperscript^๐‘‰๐‘˜0๐‘ ๐‘ \hat{\pi}_{0}(s)=\pi_{\operatorname*{argmax}_{k}\hat{V}^{k}_{0}(s)}(s)over^ start_ARG italic_ฯ€ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) = italic_ฯ€ start_POSTSUBSCRIPT roman_argmax start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT ( italic_s ). Following ฯ€^0โข(s)subscript^๐œ‹0๐‘ \hat{\pi}_{0}(s)over^ start_ARG italic_ฯ€ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) from ฮผ0subscript๐œ‡0\mu_{0}italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT generates a samplable distribution over states ฮผ1โข(s)=๐”ผs0โˆผฮผ0[Pโข(s|s0,ฯ€^0โข(s0))]subscript๐œ‡1๐‘ subscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0delimited-[]๐‘ƒconditional๐‘ subscript๐‘ 0subscript^๐œ‹0subscript๐‘ 0\mu_{1}(s)=\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0}}[P(s|s_{0},\hat{\pi}_{0}(s_{0% }))]italic_ฮผ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s ) = blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_P ( italic_s | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over^ start_ARG italic_ฯ€ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ], and so our oracle assumption allows us to obtain good estimates V^1ksubscriptsuperscript^๐‘‰๐‘˜1\hat{V}^{k}_{1}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with respect to ฮผ1subscript๐œ‡1\mu_{1}italic_ฮผ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for all ฯ€ksuperscript๐œ‹๐‘˜\pi^{k}italic_ฯ€ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. We can then define the second action of the approximate max-following policy, and so on, for all H๐ปHitalic_H steps.

Algorithm 1 ๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡ฮฑโ„ณโข(ฮ k)subscriptsuperscript๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡โ„ณ๐›ผsuperscriptฮ ๐‘˜\mathsf{MaxIteration}^{\mathcal{M}}_{\alpha}(\Pi^{k})sansserif_MaxIteration start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฮฑ end_POSTSUBSCRIPT ( roman_ฮ  start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT )
1:forย hโˆˆ[H]โ„Ždelimited-[]๐ปh\in[H]italic_h โˆˆ [ italic_H ]ย do
2:ย ย ย ย ย forย kโˆˆ[K]๐‘˜delimited-[]๐พk\in[K]italic_k โˆˆ [ italic_K ]ย do
3:ย ย ย ย ย ย ย ย ย let ฮผhsubscript๐œ‡โ„Ž\mu_{h}italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT be the distribution sampled by executing the following procedure: ย ย ย ย ย ย ย ย ย 
4:ย ย ย ย ย ย ย ย ย ย ย ย ย ย sample a starting state s0โˆผฮผ0similar-tosubscript๐‘ 0subscript๐œ‡0s_{0}\sim\mu_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
5:ย ย ย ย ย ย ย ย ย ย ย ย ย ย forย iโˆˆ[h]๐‘–delimited-[]โ„Ži\in[h]italic_i โˆˆ [ italic_h ]ย do
6:ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย si+1โˆผP(โ‹…โˆฃsi,ฯ€argmaxkV^ikโข(si)(si))s_{i+1}\sim P(\;\cdot\mid s_{i},\pi^{\operatorname*{argmax}_{k}\hat{V}^{k}_{i}% (s_{i})}(s_{i}))italic_s start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT โˆผ italic_P ( โ‹… โˆฃ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ฯ€ start_POSTSUPERSCRIPT roman_argmax start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )
7:ย ย ย ย ย ย ย ย ย ย ย ย ย ย endย for
8:ย ย ย ย ย ย ย ย ย ย ย ย ย ย output shsubscript๐‘ โ„Žs_{h}italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ย ย ย ย ย ย ย ย ย 
9:ย ย ย ย ย ย ย ย ย V^hkโ†๐’ชฮฑkโข(ฮผh,h)โ†subscriptsuperscript^๐‘‰๐‘˜โ„Žsuperscriptsubscript๐’ช๐›ผ๐‘˜subscript๐œ‡โ„Žโ„Ž\hat{V}^{k}_{h}\leftarrow\mathcal{O}_{\alpha}^{k}(\mu_{h},h)over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT โ† caligraphic_O start_POSTSUBSCRIPT italic_ฮฑ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_h )
10:ย ย ย ย ย endย for
11:endย for
12:return policy ฯ€^={ฯ€^h}hโˆˆ[H]^๐œ‹subscriptsubscript^๐œ‹โ„Žโ„Ždelimited-[]๐ป\hat{\pi}=\{\hat{\pi}_{h}\}_{h\in[H]}over^ start_ARG italic_ฯ€ end_ARG = { over^ start_ARG italic_ฯ€ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h โˆˆ [ italic_H ] end_POSTSUBSCRIPT where ฯ€^hโข(s)=ฯ€argmaxkโˆˆ[K]V^hkโข(s)โข(s)subscript^๐œ‹โ„Ž๐‘ superscript๐œ‹subscriptargmax๐‘˜delimited-[]๐พsubscriptsuperscript^๐‘‰๐‘˜โ„Ž๐‘ ๐‘ \hat{\pi}_{h}(s)=\pi^{\operatorname*{argmax}_{k\in[K]}\hat{V}^{k}_{h}(s)}(s)over^ start_ARG italic_ฯ€ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) = italic_ฯ€ start_POSTSUPERSCRIPT roman_argmax start_POSTSUBSCRIPT italic_k โˆˆ [ italic_K ] end_POSTSUBSCRIPT over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ( italic_s )
Theorem 3.1.

For any ฮตโˆˆ(0,1]๐œ€01\varepsilon\in(0,1]italic_ฮต โˆˆ ( 0 , 1 ], any MDP โ„ณโ„ณ\mathcal{M}caligraphic_M with starting state distribution ฮผ0subscript๐œ‡0\mu_{0}italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, any episode length H๐ปHitalic_H, and any K๐พKitalic_K policies ฮ ksuperscriptฮ ๐‘˜\Pi^{k}roman_ฮ  start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT defined on โ„ณโ„ณ\mathcal{M}caligraphic_M, let ฮฑโˆˆฮ˜โข(ฮต3KโขH4)๐›ผฮ˜superscript๐œ€3๐พsuperscript๐ป4\alpha\in\Theta(\tfrac{\varepsilon^{3}}{KH^{4}})italic_ฮฑ โˆˆ roman_ฮ˜ ( divide start_ARG italic_ฮต start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K italic_H start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG ) and ฮฒโˆˆฮ˜โข(ฮตH)๐›ฝฮ˜๐œ€๐ป\beta\in\Theta(\tfrac{\varepsilon}{H})italic_ฮฒ โˆˆ roman_ฮ˜ ( divide start_ARG italic_ฮต end_ARG start_ARG italic_H end_ARG ). Then ๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡ฮฑโ„ณโข(ฮ k)subscriptsuperscript๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡โ„ณ๐›ผsuperscriptฮ ๐‘˜\mathsf{MaxIteration}^{\mathcal{M}}_{\alpha}(\Pi^{k})sansserif_MaxIteration start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฮฑ end_POSTSUBSCRIPT ( roman_ฮ  start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) makes Oโข(HโขK)๐‘‚๐ป๐พO(HK)italic_O ( italic_H italic_K ) oracle queries and outputs ฯ€^^๐œ‹\hat{\pi}over^ start_ARG italic_ฯ€ end_ARG such that

๐”ผs0โˆผฮผ0[Vฯ€^โข(s0)]โ‰ฅminฯ€โˆˆฮ ฮฒkโˆ—โข๐”ผs0โˆผฮผ0[Vฯ€โข(s0)]โˆ’Oโข(ฮต).subscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0delimited-[]superscript๐‘‰^๐œ‹subscript๐‘ 0subscript๐œ‹subscriptsuperscriptฮ superscript๐‘˜๐›ฝsubscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0delimited-[]superscript๐‘‰๐œ‹subscript๐‘ 0๐‘‚๐œ€\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0}}\left[V^{\hat{\pi}}(s_{0})\right]\geq% \min_{\pi\in\Pi^{k^{*}}_{\beta}}\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0}}\left[V^% {\pi}(s_{0})\right]-O(\varepsilon).blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT over^ start_ARG italic_ฯ€ end_ARG end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] โ‰ฅ roman_min start_POSTSUBSCRIPT italic_ฯ€ โˆˆ roman_ฮ  start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฮฒ end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] - italic_O ( italic_ฮต ) .
Proof.

For all hโˆˆ[H]โ„Ždelimited-[]๐ปh\in[H]italic_h โˆˆ [ italic_H ], kโˆˆ[K]๐‘˜delimited-[]๐พk\in[K]italic_k โˆˆ [ italic_K ], let V^hksuperscriptsubscript^๐‘‰โ„Ž๐‘˜\hat{V}_{h}^{k}over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT denote the approximate value function obtained from ๐’ชฮฑkโข(ฮผh,h)superscriptsubscript๐’ช๐›ผ๐‘˜subscript๐œ‡โ„Žโ„Ž\mathcal{O}_{\alpha}^{k}(\mu_{h},h)caligraphic_O start_POSTSUBSCRIPT italic_ฮฑ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_h ) in Algorithmย 1. We then define, for every hโˆˆ[H]โ„Ždelimited-[]๐ปh\in[H]italic_h โˆˆ [ italic_H ], the set of states for which some approximate value function V^hkโข(s)subscriptsuperscript^๐‘‰๐‘˜โ„Ž๐‘ \hat{V}^{k}_{h}(s)over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) has large absolute error (Bhsubscript๐ตโ„ŽB_{h}italic_B start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT) and the set of bad trajectories (Bฯ„)subscript๐ต๐œ(B_{\tau})( italic_B start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT ) that pass through a state in Bhsubscript๐ตโ„ŽB_{h}italic_B start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT for any hโˆˆ[H]โ„Ždelimited-[]๐ปh\in[H]italic_h โˆˆ [ italic_H ] : Bh={sโˆˆS:โˆƒkโˆˆ[K]โขย s.t.ย โข|V^hkโข(s)โˆ’Vhkโข(s)|โ‰ฅฮต2โขH}subscript๐ตโ„Žconditional-set๐‘ ๐‘†๐‘˜delimited-[]๐พย s.t.ย superscriptsubscript^๐‘‰โ„Ž๐‘˜๐‘ superscriptsubscript๐‘‰โ„Ž๐‘˜๐‘ ๐œ€2๐ป{B_{h}=\{s\in S:\exists k\in[K]\text{ s.t. }|\hat{V}_{h}^{k}(s)-V_{h}^{k}(s)|% \geq\tfrac{\varepsilon}{2H}\}}italic_B start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = { italic_s โˆˆ italic_S : โˆƒ italic_k โˆˆ [ italic_K ] s.t. | over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) - italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) | โ‰ฅ divide start_ARG italic_ฮต end_ARG start_ARG 2 italic_H end_ARG } and Bฯ„={{(sh,ah)}hโˆˆ[H]:โˆƒhโˆˆ[H]โขย s.t.ย โขshโˆˆBh}subscript๐ต๐œconditional-setsubscriptsubscript๐‘ โ„Žsubscript๐‘Žโ„Žโ„Ždelimited-[]๐ปโ„Ždelimited-[]๐ปย s.t.ย subscript๐‘ โ„Žsubscript๐ตโ„ŽB_{\tau}=\{\{(s_{h},a_{h})\}_{h\in[H]}:\exists h\in[H]\text{ s.t. }s_{h}\in B_% {h}\}italic_B start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT = { { ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_h โˆˆ [ italic_H ] end_POSTSUBSCRIPT : โˆƒ italic_h โˆˆ [ italic_H ] s.t. italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT โˆˆ italic_B start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT }. We will show that there exists an approximate max-following policy ฯ€โˆˆฮ ฮฒkโˆ—๐œ‹subscriptsuperscriptฮ superscript๐‘˜๐›ฝ\pi\in\Pi^{k^{*}}_{\beta}italic_ฯ€ โˆˆ roman_ฮ  start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฮฒ end_POSTSUBSCRIPT such that for any trajectory ฯ„โ€ฒโˆ‰Bฯ„superscript๐œโ€ฒsubscript๐ต๐œ\tau^{\prime}\not\in B_{\tau}italic_ฯ„ start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT โˆ‰ italic_B start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT, Prฯ„โˆผฯ€^โข(ฮผ0)โก[ฯ„=ฯ„โ€ฒ]=Prฯ„โˆผฯ€โข(ฮผ0)โก[ฯ„=ฯ„โ€ฒ]subscriptPrsimilar-to๐œ^๐œ‹subscript๐œ‡0๐œsuperscript๐œโ€ฒsubscriptPrsimilar-to๐œ๐œ‹subscript๐œ‡0๐œsuperscript๐œโ€ฒ\Pr_{\tau\sim\hat{\pi}(\mu_{0})}[\tau=\tau^{\prime}]=\Pr_{\tau\sim\pi(\mu_{0})% }[\tau=\tau^{\prime}]roman_Pr start_POSTSUBSCRIPT italic_ฯ„ โˆผ over^ start_ARG italic_ฯ€ end_ARG ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_ฯ„ = italic_ฯ„ start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ] = roman_Pr start_POSTSUBSCRIPT italic_ฯ„ โˆผ italic_ฯ€ ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_ฯ„ = italic_ฯ„ start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ]. We then bound the probability Prฯ„โˆผฯ€^โข(ฮผ0)โก[ฯ„โˆˆBฯ„]subscriptPrsimilar-to๐œ^๐œ‹subscript๐œ‡0๐œsubscript๐ต๐œ\Pr_{\tau\sim\hat{\pi}(\mu_{0})}[\tau\in B_{\tau}]roman_Pr start_POSTSUBSCRIPT italic_ฯ„ โˆผ over^ start_ARG italic_ฯ€ end_ARG ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_ฯ„ โˆˆ italic_B start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT ], and the contribution to ๐”ผs0โˆผฮผ0[Vฯ€โข(s0)]subscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0delimited-[]superscript๐‘‰๐œ‹subscript๐‘ 0\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0}}\left[V^{\pi}(s_{0})\right]blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] from these trajectories, proving the claim.

Let Vhkโˆ—โข(s)superscriptsubscript๐‘‰โ„Žsuperscript๐‘˜๐‘ V_{h}^{k^{*}}(s)italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s ) denote the value of the policy that ฯ€^^๐œ‹\hat{\pi}over^ start_ARG italic_ฯ€ end_ARG follows at time hโ„Žhitalic_h and state s๐‘ sitalic_s. From the definition of the bad set Bhsubscript๐ตโ„ŽB_{h}italic_B start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and the setting of ฮฒโˆˆฮ˜โข(ฮตH)๐›ฝฮ˜๐œ€๐ป\beta\in\Theta(\tfrac{\varepsilon}{H})italic_ฮฒ โˆˆ roman_ฮ˜ ( divide start_ARG italic_ฮต end_ARG start_ARG italic_H end_ARG ), for any state sโˆ‰Bh๐‘ subscript๐ตโ„Žs\not\in B_{h}italic_s โˆ‰ italic_B start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT,

Vhkโˆ—โข(s)โ‰ฅV^hkโˆ—โข(s)โˆ’ฮต2โขHโ‰ฅmaxkโˆˆ[K]โกV^hkโข(s)โˆ’ฮต2โขHโ‰ฅmaxkโˆˆ[K]โกVhkโข(s)โˆ’ฮฒ.subscriptsuperscript๐‘‰superscript๐‘˜โ„Ž๐‘ subscriptsuperscript^๐‘‰superscript๐‘˜โ„Ž๐‘ ๐œ€2๐ปsubscript๐‘˜delimited-[]๐พsubscriptsuperscript^๐‘‰๐‘˜โ„Ž๐‘ ๐œ€2๐ปsubscript๐‘˜delimited-[]๐พsubscriptsuperscript๐‘‰๐‘˜โ„Ž๐‘ ๐›ฝV^{k^{*}}_{h}(s)\geq\hat{V}^{k^{*}}_{h}(s)-\tfrac{\varepsilon}{2H}\geq\max_{k% \in[K]}\hat{V}^{k}_{h}(s)-\tfrac{\varepsilon}{2H}\geq\max_{k\in[K]}V^{k}_{h}(s% )-\beta.italic_V start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) โ‰ฅ over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) - divide start_ARG italic_ฮต end_ARG start_ARG 2 italic_H end_ARG โ‰ฅ roman_max start_POSTSUBSCRIPT italic_k โˆˆ [ italic_K ] end_POSTSUBSCRIPT over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) - divide start_ARG italic_ฮต end_ARG start_ARG 2 italic_H end_ARG โ‰ฅ roman_max start_POSTSUBSCRIPT italic_k โˆˆ [ italic_K ] end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) - italic_ฮฒ .

In other words, if a state s๐‘ sitalic_s is not bad at time hโ„Žhitalic_h, then ฯ€^hโข(s)=ฯ€hkโข(s)subscript^๐œ‹โ„Ž๐‘ superscriptsubscript๐œ‹โ„Ž๐‘˜๐‘ \hat{\pi}_{h}(s)=\pi_{h}^{k}(s)over^ start_ARG italic_ฯ€ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) = italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) for a policy ฯ€ksuperscript๐œ‹๐‘˜\pi^{k}italic_ฯ€ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT that has value Vhkโข(s)superscriptsubscript๐‘‰โ„Ž๐‘˜๐‘ V_{h}^{k}(s)italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) within ฮฒ๐›ฝ\betaitalic_ฮฒ of the true max value maxkโˆˆ[K]โกVhkโข(s)subscript๐‘˜delimited-[]๐พsuperscriptsubscript๐‘‰โ„Ž๐‘˜๐‘ \max_{k\in[K]}V_{h}^{k}(s)roman_max start_POSTSUBSCRIPT italic_k โˆˆ [ italic_K ] end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ). It then follows from the definition of the class of approximate max-following policies ฮ ฮฒkโˆ—subscriptsuperscriptฮ superscript๐‘˜๐›ฝ\Pi^{k^{*}}_{\beta}roman_ฮ  start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฮฒ end_POSTSUBSCRIPT (Definitionย 2.3) that there exists some ฯ€โˆˆฮ ฮฒkโˆ—๐œ‹subscriptsuperscriptฮ superscript๐‘˜๐›ฝ\pi\in\Pi^{k^{*}}_{\beta}italic_ฯ€ โˆˆ roman_ฮ  start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฮฒ end_POSTSUBSCRIPT such that for all hโˆˆ[H]โ„Ždelimited-[]๐ปh\in[H]italic_h โˆˆ [ italic_H ], for all sโˆ‰Bh๐‘ subscript๐ตโ„Žs\not\in B_{h}italic_s โˆ‰ italic_B start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, ฯ€^hโข(s)=ฯ€hโข(s)subscript^๐œ‹โ„Ž๐‘ subscript๐œ‹โ„Ž๐‘ \hat{\pi}_{h}(s)=\pi_{h}(s)over^ start_ARG italic_ฯ€ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) = italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ).

For any trajectory ฯ„โ€ฒsuperscript๐œโ€ฒ\tau^{\prime}italic_ฯ„ start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT, Prฯ„โˆผฯ€^โข(ฮผ0)โก[ฯ„=ฯ„โ€ฒ]=Prฮผ0โก[s0]โ‹…โˆh=0Hโˆ’1Pโข(sh+1|sh,ฯ€^hโข(sh))subscriptPrsimilar-to๐œ^๐œ‹subscript๐œ‡0๐œsuperscript๐œโ€ฒโ‹…subscriptPrsubscript๐œ‡0subscript๐‘ 0superscriptsubscriptproductโ„Ž0๐ป1๐‘ƒconditionalsubscript๐‘ โ„Ž1subscript๐‘ โ„Žsubscript^๐œ‹โ„Žsubscript๐‘ โ„Ž\Pr_{\tau\sim\hat{\pi}(\mu_{0})}[\tau=\tau^{\prime}]=\Pr_{\mu_{0}}[s_{0}]\cdot% \prod_{h=0}^{H-1}P(s_{h+1}|s_{h},\hat{\pi}_{h}(s_{h}))roman_Pr start_POSTSUBSCRIPT italic_ฯ„ โˆผ over^ start_ARG italic_ฯ€ end_ARG ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_ฯ„ = italic_ฯ„ start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ] = roman_Pr start_POSTSUBSCRIPT italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] โ‹… โˆ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_P ( italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , over^ start_ARG italic_ฯ€ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ). Then for any trajectory ฯ„โ€ฒโˆ‰Bฯ„superscript๐œโ€ฒsubscript๐ต๐œ\tau^{\prime}\not\in B_{\tau}italic_ฯ„ start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT โˆ‰ italic_B start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT, Prฯ„โˆผฯ€^โข(ฮผ0)โก[ฯ„=ฯ„โ€ฒ]=Prฯ„โˆผฯ€โข(ฮผ0)โก[ฯ„=ฯ„โ€ฒ]subscriptPrsimilar-to๐œ^๐œ‹subscript๐œ‡0๐œsuperscript๐œโ€ฒsubscriptPrsimilar-to๐œ๐œ‹subscript๐œ‡0๐œsuperscript๐œโ€ฒ\Pr_{\tau\sim\hat{\pi}(\mu_{0})}[\tau=\tau^{\prime}]=\Pr_{\tau\sim\pi(\mu_{0})% }[\tau=\tau^{\prime}]roman_Pr start_POSTSUBSCRIPT italic_ฯ„ โˆผ over^ start_ARG italic_ฯ€ end_ARG ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_ฯ„ = italic_ฯ„ start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ] = roman_Pr start_POSTSUBSCRIPT italic_ฯ„ โˆผ italic_ฯ€ ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_ฯ„ = italic_ฯ„ start_POSTSUPERSCRIPT โ€ฒ end_POSTSUPERSCRIPT ], and therefore

๐”ผฯ„โˆผฯ€^โข(ฮผ0)[โˆ‘h=0Hโˆ’1Rโข(sh,ah)โˆฃฯ„โˆ‰Bฯ„]=๐”ผฯ„โˆผฯ€โข(ฮผ0)[โˆ‘h=0Hโˆ’1Rโข(sh,ah)โˆฃฯ„โˆ‰Bฯ„]subscript๐”ผsimilar-to๐œ^๐œ‹subscript๐œ‡0delimited-[]conditionalsuperscriptsubscriptโ„Ž0๐ป1๐‘…subscript๐‘ โ„Žsubscript๐‘Žโ„Ž๐œsubscript๐ต๐œsubscript๐”ผsimilar-to๐œ๐œ‹subscript๐œ‡0delimited-[]conditionalsuperscriptsubscriptโ„Ž0๐ป1๐‘…subscript๐‘ โ„Žsubscript๐‘Žโ„Ž๐œsubscript๐ต๐œ\mathop{\mathbb{E}}_{\tau\sim\hat{\pi}(\mu_{0})}\left[\sum_{h=0}^{H-1}R(s_{h},% a_{h})\mid\tau\not\in B_{\tau}\right]=\mathop{\mathbb{E}}_{\tau\sim\pi(\mu_{0}% )}\left[\sum_{h=0}^{H-1}R(s_{h},a_{h})\mid\tau\not\in B_{\tau}\right]blackboard_E start_POSTSUBSCRIPT italic_ฯ„ โˆผ over^ start_ARG italic_ฯ€ end_ARG ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ โˆ‘ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) โˆฃ italic_ฯ„ โˆ‰ italic_B start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT ] = blackboard_E start_POSTSUBSCRIPT italic_ฯ„ โˆผ italic_ฯ€ ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ โˆ‘ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) โˆฃ italic_ฯ„ โˆ‰ italic_B start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT ]

For ฯ„โˆˆBฯ„๐œsubscript๐ต๐œ\tau\in B_{\tau}italic_ฯ„ โˆˆ italic_B start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT, we have lower and upper-bounds ๐”ผฯ„โˆผฯ€^โข(ฮผ0)[โˆ‘h=0Hโˆ’1Rโข(sh,ah)โˆฃฯ„โˆˆBฯ„]โ‰ฅ0subscript๐”ผsimilar-to๐œ^๐œ‹subscript๐œ‡0delimited-[]conditionalsuperscriptsubscriptโ„Ž0๐ป1๐‘…subscript๐‘ โ„Žsubscript๐‘Žโ„Ž๐œsubscript๐ต๐œ0\mathop{\mathbb{E}}_{\tau\sim\hat{\pi}(\mu_{0})}[\sum_{h=0}^{H-1}R(s_{h},a_{h}% )\mid\tau\in B_{\tau}]\geq 0blackboard_E start_POSTSUBSCRIPT italic_ฯ„ โˆผ over^ start_ARG italic_ฯ€ end_ARG ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ โˆ‘ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) โˆฃ italic_ฯ„ โˆˆ italic_B start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT ] โ‰ฅ 0 and ๐”ผฯ„โˆผฯ€โข(ฮผ0)[โˆ‘h=0Hโˆ’1Rโข(sh,ah)โˆฃฯ„โˆˆBฯ„]โ‰คHsubscript๐”ผsimilar-to๐œ๐œ‹subscript๐œ‡0delimited-[]conditionalsuperscriptsubscriptโ„Ž0๐ป1๐‘…subscript๐‘ โ„Žsubscript๐‘Žโ„Ž๐œsubscript๐ต๐œ๐ป\mathop{\mathbb{E}}_{\tau\sim\pi(\mu_{0})}[\sum_{h=0}^{H-1}R(s_{h},a_{h})\mid% \tau\in B_{\tau}]\leq Hblackboard_E start_POSTSUBSCRIPT italic_ฯ„ โˆผ italic_ฯ€ ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ โˆ‘ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) โˆฃ italic_ฯ„ โˆˆ italic_B start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT ] โ‰ค italic_H. We can then write:

๐”ผs0โˆผฮผ0[Vฯ€^โข(s0)]subscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0delimited-[]superscript๐‘‰^๐œ‹subscript๐‘ 0\displaystyle\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0}}\left[V^{\hat{\pi}}(s_{0})\right]blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT over^ start_ARG italic_ฯ€ end_ARG end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] =๐”ผฯ„โˆผฯ€^โข(ฮผ0)[โˆ‘h=0Hโˆ’1Rโข(sh,ah)โˆฃฯ„โˆ‰Bฯ„]โ‹…Prฯ„โˆผฯ€^โข(ฮผ0)โก[ฯ„โˆ‰Bฯ„]absentsubscript๐”ผsimilar-to๐œ^๐œ‹subscript๐œ‡0โ‹…delimited-[]conditionalsuperscriptsubscriptโ„Ž0๐ป1๐‘…subscript๐‘ โ„Žsubscript๐‘Žโ„Ž๐œsubscript๐ต๐œsubscriptPrsimilar-to๐œ^๐œ‹subscript๐œ‡0๐œsubscript๐ต๐œ\displaystyle=\mathop{\mathbb{E}}_{\tau\sim\hat{\pi}(\mu_{0})}\left[\sum_{h=0}% ^{H-1}R(s_{h},a_{h})\mid\tau\not\in B_{\tau}\right]\cdot\Pr_{\tau\sim\hat{\pi}% (\mu_{0})}[\tau\not\in B_{\tau}]= blackboard_E start_POSTSUBSCRIPT italic_ฯ„ โˆผ over^ start_ARG italic_ฯ€ end_ARG ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ โˆ‘ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) โˆฃ italic_ฯ„ โˆ‰ italic_B start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT ] โ‹… roman_Pr start_POSTSUBSCRIPT italic_ฯ„ โˆผ over^ start_ARG italic_ฯ€ end_ARG ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_ฯ„ โˆ‰ italic_B start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT ]
+๐”ผฯ„โˆผฯ€^โข(ฮผ0)[โˆ‘h=0Hโˆ’1Rโข(sh,ah)โˆฃฯ„โˆˆBฯ„]โ‹…Prฯ„โˆผฯ€^โข(ฮผ0)โก[ฯ„โˆˆBฯ„]subscript๐”ผsimilar-to๐œ^๐œ‹subscript๐œ‡0โ‹…delimited-[]conditionalsuperscriptsubscriptโ„Ž0๐ป1๐‘…subscript๐‘ โ„Žsubscript๐‘Žโ„Ž๐œsubscript๐ต๐œsubscriptPrsimilar-to๐œ^๐œ‹subscript๐œ‡0๐œsubscript๐ต๐œ\displaystyle\quad\quad\quad\quad+\mathop{\mathbb{E}}_{\tau\sim\hat{\pi}(\mu_{% 0})}\left[\sum_{h=0}^{H-1}R(s_{h},a_{h})\mid\tau\in B_{\tau}\right]\cdot\Pr_{% \tau\sim\hat{\pi}(\mu_{0})}[\tau\in B_{\tau}]+ blackboard_E start_POSTSUBSCRIPT italic_ฯ„ โˆผ over^ start_ARG italic_ฯ€ end_ARG ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ โˆ‘ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) โˆฃ italic_ฯ„ โˆˆ italic_B start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT ] โ‹… roman_Pr start_POSTSUBSCRIPT italic_ฯ„ โˆผ over^ start_ARG italic_ฯ€ end_ARG ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_ฯ„ โˆˆ italic_B start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT ]
โ‰ฅ๐”ผฯ„โˆผฯ€^โข(ฮผ0)[โˆ‘h=0Hโˆ’1Rโข(sh,ah)โˆฃฯ„โˆ‰Bฯ„]โ‹…Prฯ„โˆผฯ€^โข(ฮผ0)โก[ฯ„โˆ‰Bฯ„]absentsubscript๐”ผsimilar-to๐œ^๐œ‹subscript๐œ‡0โ‹…delimited-[]conditionalsuperscriptsubscriptโ„Ž0๐ป1๐‘…subscript๐‘ โ„Žsubscript๐‘Žโ„Ž๐œsubscript๐ต๐œsubscriptPrsimilar-to๐œ^๐œ‹subscript๐œ‡0๐œsubscript๐ต๐œ\displaystyle\geq\mathop{\mathbb{E}}_{\tau\sim\hat{\pi}(\mu_{0})}\left[\sum_{h% =0}^{H-1}R(s_{h},a_{h})\mid\tau\not\in B_{\tau}\right]\cdot\Pr_{\tau\sim\hat{% \pi}(\mu_{0})}[\tau\not\in B_{\tau}]โ‰ฅ blackboard_E start_POSTSUBSCRIPT italic_ฯ„ โˆผ over^ start_ARG italic_ฯ€ end_ARG ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ โˆ‘ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) โˆฃ italic_ฯ„ โˆ‰ italic_B start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT ] โ‹… roman_Pr start_POSTSUBSCRIPT italic_ฯ„ โˆผ over^ start_ARG italic_ฯ€ end_ARG ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_ฯ„ โˆ‰ italic_B start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT ]
=๐”ผฯ„โˆผฯ€โข(ฮผ0)[โˆ‘h=0Hโˆ’1Rโข(sh,ah)โˆฃฯ„โˆ‰Bฯ„]โ‹…Prฯ„โˆผฯ€โข(ฮผ0)โก[ฯ„โˆ‰Bฯ„]absentsubscript๐”ผsimilar-to๐œ๐œ‹subscript๐œ‡0โ‹…delimited-[]conditionalsuperscriptsubscriptโ„Ž0๐ป1๐‘…subscript๐‘ โ„Žsubscript๐‘Žโ„Ž๐œsubscript๐ต๐œsubscriptPrsimilar-to๐œ๐œ‹subscript๐œ‡0๐œsubscript๐ต๐œ\displaystyle=\mathop{\mathbb{E}}_{\tau\sim\pi(\mu_{0})}\left[\sum_{h=0}^{H-1}% R(s_{h},a_{h})\mid\tau\not\in B_{\tau}\right]\cdot\Pr_{\tau\sim\pi(\mu_{0})}[% \tau\not\in B_{\tau}]= blackboard_E start_POSTSUBSCRIPT italic_ฯ„ โˆผ italic_ฯ€ ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ โˆ‘ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) โˆฃ italic_ฯ„ โˆ‰ italic_B start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT ] โ‹… roman_Pr start_POSTSUBSCRIPT italic_ฯ„ โˆผ italic_ฯ€ ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_ฯ„ โˆ‰ italic_B start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT ]
โ‰ฅ๐”ผฯ„โˆผฯ€โข(ฮผ0)[โˆ‘h=0Hโˆ’1Rโข(sh,ah)]โˆ’Hโ‹…Prฯ„โˆผฯ€โข(ฮผ0)โก[ฯ„โˆˆBฯ„]absentsubscript๐”ผsimilar-to๐œ๐œ‹subscript๐œ‡0delimited-[]superscriptsubscriptโ„Ž0๐ป1๐‘…subscript๐‘ โ„Žsubscript๐‘Žโ„Žโ‹…๐ปsubscriptPrsimilar-to๐œ๐œ‹subscript๐œ‡0๐œsubscript๐ต๐œ\displaystyle\geq\mathop{\mathbb{E}}_{\tau\sim\pi(\mu_{0})}\left[\sum_{h=0}^{H% -1}R(s_{h},a_{h})\right]-H\cdot\Pr_{\tau\sim\pi(\mu_{0})}[\tau\in B_{\tau}]โ‰ฅ blackboard_E start_POSTSUBSCRIPT italic_ฯ„ โˆผ italic_ฯ€ ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ โˆ‘ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ] - italic_H โ‹… roman_Pr start_POSTSUBSCRIPT italic_ฯ„ โˆผ italic_ฯ€ ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_ฯ„ โˆˆ italic_B start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT ]
โ‰ฅminฯ€โˆˆฮ ฮฒkโˆ—โข๐”ผs0โˆผฮผ0[Vฯ€โข(s0)]โˆ’Hโ‹…Prฯ„โˆผฯ€โข(ฮผ0)โก[ฯ„โˆˆBฯ„].absentsubscript๐œ‹subscriptsuperscriptฮ superscript๐‘˜๐›ฝsubscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0delimited-[]superscript๐‘‰๐œ‹subscript๐‘ 0โ‹…๐ปsubscriptPrsimilar-to๐œ๐œ‹subscript๐œ‡0๐œsubscript๐ต๐œ\displaystyle\geq\min_{\pi\in\Pi^{k^{*}}_{\beta}}\mathop{\mathbb{E}}_{s_{0}% \sim\mu_{0}}[V^{\pi}(s_{0})]-H\cdot\Pr_{\tau\sim\pi(\mu_{0})}[\tau\in B_{\tau}].โ‰ฅ roman_min start_POSTSUBSCRIPT italic_ฯ€ โˆˆ roman_ฮ  start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฮฒ end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] - italic_H โ‹… roman_Pr start_POSTSUBSCRIPT italic_ฯ„ โˆผ italic_ฯ€ ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_ฯ„ โˆˆ italic_B start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT ] .

It remains to upper-bound Prฯ„โˆผฯ€โข(ฮผ0)โก[ฯ„โˆˆBฯ„]subscriptPrsimilar-to๐œ๐œ‹subscript๐œ‡0๐œsubscript๐ต๐œ\Pr_{\tau\sim\pi(\mu_{0})}[\tau\in B_{\tau}]roman_Pr start_POSTSUBSCRIPT italic_ฯ„ โˆผ italic_ฯ€ ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_ฯ„ โˆˆ italic_B start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT ]. We have already argued Prฯ„โˆผฯ€โข(ฮผ0)โก[ฯ„โˆˆBฯ„]=Prฯ„โˆผฯ€^โข(ฮผ0)โก[ฯ„โˆˆBฯ„]subscriptPrsimilar-to๐œ๐œ‹subscript๐œ‡0๐œsubscript๐ต๐œsubscriptPrsimilar-to๐œ^๐œ‹subscript๐œ‡0๐œsubscript๐ต๐œ\Pr_{\tau\sim\pi(\mu_{0})}[\tau\in B_{\tau}]=\Pr_{\tau\sim\hat{\pi}(\mu_{0})}[% \tau\in B_{\tau}]roman_Pr start_POSTSUBSCRIPT italic_ฯ„ โˆผ italic_ฯ€ ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_ฯ„ โˆˆ italic_B start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT ] = roman_Pr start_POSTSUBSCRIPT italic_ฯ„ โˆผ over^ start_ARG italic_ฯ€ end_ARG ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_ฯ„ โˆˆ italic_B start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT ]. Observing that Prฯ„โˆผฯ€^โข(ฮผ0)โก[ฯ„โˆˆBฯ„]โ‰คโˆ‘h=0Hโˆ’1Prฯ„โˆผฯ€^โข(ฮผ0)โก[shโˆˆBh]subscriptPrsimilar-to๐œ^๐œ‹subscript๐œ‡0๐œsubscript๐ต๐œsuperscriptsubscriptโ„Ž0๐ป1subscriptPrsimilar-to๐œ^๐œ‹subscript๐œ‡0subscript๐‘ โ„Žsubscript๐ตโ„Ž\Pr_{\tau\sim\hat{\pi}(\mu_{0})}[\tau\in B_{\tau}]\leq\sum_{h=0}^{H-1}\Pr_{% \tau\sim\hat{\pi}(\mu_{0})}[s_{h}\in B_{h}]roman_Pr start_POSTSUBSCRIPT italic_ฯ„ โˆผ over^ start_ARG italic_ฯ€ end_ARG ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_ฯ„ โˆˆ italic_B start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT ] โ‰ค โˆ‘ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT roman_Pr start_POSTSUBSCRIPT italic_ฯ„ โˆผ over^ start_ARG italic_ฯ€ end_ARG ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT โˆˆ italic_B start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ], it is sufficient to show Prฯ„โˆผฯ€^โข(ฮผ0)โก[shโˆˆBh]โˆˆOโข(ฮตH2)subscriptPrsimilar-to๐œ^๐œ‹subscript๐œ‡0subscript๐‘ โ„Žsubscript๐ตโ„Ž๐‘‚๐œ€superscript๐ป2\Pr_{\tau\sim\hat{\pi}(\mu_{0})}[s_{h}\in B_{h}]\in O(\tfrac{\varepsilon}{H^{2% }})roman_Pr start_POSTSUBSCRIPT italic_ฯ„ โˆผ over^ start_ARG italic_ฯ€ end_ARG ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT โˆˆ italic_B start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ] โˆˆ italic_O ( divide start_ARG italic_ฮต end_ARG start_ARG italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) to prove the claim. For all hโˆˆ[H]โ„Ždelimited-[]๐ปh\in[H]italic_h โˆˆ [ italic_H ], let ฮผhโข(s)=Prฯ„โˆผฯ€^โข(ฮผ0)โก[sh=s]subscript๐œ‡โ„Ž๐‘ subscriptPrsimilar-to๐œ^๐œ‹subscript๐œ‡0subscript๐‘ โ„Ž๐‘ \mu_{h}(s)=\Pr_{\tau\sim\hat{\pi}(\mu_{0})}[s_{h}=s]italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) = roman_Pr start_POSTSUBSCRIPT italic_ฯ„ โˆผ over^ start_ARG italic_ฯ€ end_ARG ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_s ], and note that this is the distribution supplied to the oracle at iteration hโ„Žhitalic_h of Algorithmย 1. It follows from our oracle assumption (Definitionย 2.2) that for all kโˆˆ[K]๐‘˜delimited-[]๐พk\in[K]italic_k โˆˆ [ italic_K ], ๐”ผshโˆผฮผh[(V^kโข(sh)โˆ’Vkโข(sh))2]<ฮฑsubscript๐”ผsimilar-tosubscript๐‘ โ„Žsubscript๐œ‡โ„Ždelimited-[]superscriptsuperscript^๐‘‰๐‘˜subscript๐‘ โ„Žsuperscript๐‘‰๐‘˜subscript๐‘ โ„Ž2๐›ผ\mathop{\mathbb{E}}_{s_{h}\sim\mu_{h}}[(\hat{V}^{k}(s_{h})-V^{k}(s_{h}))^{2}]<\alphablackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) - italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] < italic_ฮฑ. We apply Markovโ€™s inequality to conclude that for all kโˆˆ[K]๐‘˜delimited-[]๐พk\in[K]italic_k โˆˆ [ italic_K ],

Prshโˆผฮผhโก[|V^hkโข(sh)โˆ’Vhkโข(sh)|โ‰ฅฮต2โขH]<4โขฮฑโขH2ฮต2โˆˆOโข(ฮตKโขH2).subscriptPrsimilar-tosubscript๐‘ โ„Žsubscript๐œ‡โ„Žsuperscriptsubscript^๐‘‰โ„Ž๐‘˜subscript๐‘ โ„Žsuperscriptsubscript๐‘‰โ„Ž๐‘˜subscript๐‘ โ„Ž๐œ€2๐ป4๐›ผsuperscript๐ป2superscript๐œ€2๐‘‚๐œ€๐พsuperscript๐ป2\Pr_{s_{h}\sim\mu_{h}}[|\hat{V}_{h}^{k}(s_{h})-V_{h}^{k}(s_{h})|\geq\tfrac{% \varepsilon}{2H}]<\tfrac{4\alpha H^{2}}{\varepsilon^{2}}\in O(\tfrac{% \varepsilon}{KH^{2}}).roman_Pr start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) - italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) | โ‰ฅ divide start_ARG italic_ฮต end_ARG start_ARG 2 italic_H end_ARG ] < divide start_ARG 4 italic_ฮฑ italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ฮต start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG โˆˆ italic_O ( divide start_ARG italic_ฮต end_ARG start_ARG italic_K italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) .

Union bounding over the K๐พKitalic_K constituent policies gives Prshโˆผฮผhโก[shโˆˆBh]โˆˆOโข(ฮตH2)subscriptPrsimilar-tosubscript๐‘ โ„Žsubscript๐œ‡โ„Žsubscript๐‘ โ„Žsubscript๐ตโ„Ž๐‘‚๐œ€superscript๐ป2{\Pr_{s_{h}\sim\mu_{h}}[s_{h}\in B_{h}]\in O(\tfrac{\varepsilon}{H^{2}})}roman_Pr start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT โˆˆ italic_B start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ] โˆˆ italic_O ( divide start_ARG italic_ฮต end_ARG start_ARG italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ), from the definition of Bhsubscript๐ตโ„ŽB_{h}italic_B start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. Union bounding over the trajectory length H๐ปHitalic_H, we then have Prฯ„โˆผฯ€^โข(ฮผ0)โก[ฯ„โˆˆBฯ„]โˆˆOโข(ฮตH).subscriptPrsimilar-to๐œ^๐œ‹subscript๐œ‡0๐œsubscript๐ต๐œ๐‘‚๐œ€๐ป\Pr_{\tau\sim\hat{\pi}(\mu_{0})}[\tau\in B_{\tau}]\in O(\tfrac{\varepsilon}{H}).roman_Pr start_POSTSUBSCRIPT italic_ฯ„ โˆผ over^ start_ARG italic_ฯ€ end_ARG ( italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_ฯ„ โˆˆ italic_B start_POSTSUBSCRIPT italic_ฯ„ end_POSTSUBSCRIPT ] โˆˆ italic_O ( divide start_ARG italic_ฮต end_ARG start_ARG italic_H end_ARG ) . It follows that

๐”ผs0โˆผฮผ0[Vฯ€^โข(s0)]โ‰ฅminฯ€โˆˆฮ ฮฒkโˆ—โข๐”ผs0โˆผฮผ0[Vฯ€โข(s0)]โˆ’Oโข(ฮต),subscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0delimited-[]superscript๐‘‰^๐œ‹subscript๐‘ 0subscript๐œ‹subscriptsuperscriptฮ superscript๐‘˜๐›ฝsubscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0delimited-[]superscript๐‘‰๐œ‹subscript๐‘ 0๐‘‚๐œ€\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0}}\left[V^{\hat{\pi}}(s_{0})\right]\geq% \min_{\pi\in\Pi^{k^{*}}_{\beta}}\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0}}[V^{\pi}% (s_{0})]-O(\varepsilon),blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT over^ start_ARG italic_ฯ€ end_ARG end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] โ‰ฅ roman_min start_POSTSUBSCRIPT italic_ฯ€ โˆˆ roman_ฮ  start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฮฒ end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] - italic_O ( italic_ฮต ) ,

completing the proof. โˆŽ

4 The approximate max-following benchmark

In this section, we provide additional context for our benchmark class of approximate max-following policies. We show that the worst policy in our benchmark class competes with the best fixed policy from the set of constituent policies. We also provide examples of MDPs that showcase properties of the set of (approximate) max-following policies.

Lemma 4.1 (Worst approximate max-following policy competes with best fixed policy).

For any ฮตโˆˆ(0,1]๐œ€01\varepsilon\in(0,1]italic_ฮต โˆˆ ( 0 , 1 ] and any episode length H๐ปHitalic_H, let ฮฒโˆˆฮ˜โข(ฮตH)๐›ฝฮ˜๐œ€๐ป\beta\in\Theta(\tfrac{\varepsilon}{H})italic_ฮฒ โˆˆ roman_ฮ˜ ( divide start_ARG italic_ฮต end_ARG start_ARG italic_H end_ARG ). Then for any MDP โ„ณโ„ณ\mathcal{M}caligraphic_M with starting state distribution ฮผ0subscript๐œ‡0\mu_{0}italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and any K๐พKitalic_K policies ฮ ksuperscriptฮ ๐‘˜\Pi^{k}roman_ฮ  start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT defined on โ„ณโ„ณ\mathcal{M}caligraphic_M,

minฯ€โˆˆฮ ฮฒkโˆ—โข๐”ผs0โˆผฮผ0[Vฯ€^โข(s0)]โ‰ฅmaxkโˆˆ[K]โข๐”ผs0โˆผฮผ0[Vkโข(s0)]โˆ’Oโข(ฮต).subscript๐œ‹subscriptsuperscriptฮ superscript๐‘˜๐›ฝsubscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0delimited-[]superscript๐‘‰^๐œ‹subscript๐‘ 0subscript๐‘˜delimited-[]๐พsubscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0delimited-[]superscript๐‘‰๐‘˜subscript๐‘ 0๐‘‚๐œ€\min_{\pi\in\Pi^{k^{*}}_{\beta}}\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0}}\left[V^% {\hat{\pi}}(s_{0})\right]\geq\max_{k\in[K]}\mathop{\mathbb{E}}_{s_{0}\sim\mu_{% 0}}\left[V^{k}(s_{0})\right]-O(\varepsilon).roman_min start_POSTSUBSCRIPT italic_ฯ€ โˆˆ roman_ฮ  start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฮฒ end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT over^ start_ARG italic_ฯ€ end_ARG end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] โ‰ฅ roman_max start_POSTSUBSCRIPT italic_k โˆˆ [ italic_K ] end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] - italic_O ( italic_ฮต ) .

We defer the proof of Lemmaย 4.1 to Appendixย B.

It is an immediate corollary of Theoremย 3.1 and Lemmaย 4.1 that the policy learned by Algorithmย 1 competes with the best constituent policy.

Corollary 4.2.

For any ฮตโˆˆ(0,1]๐œ€01\varepsilon\in(0,1]italic_ฮต โˆˆ ( 0 , 1 ], any MDP โ„ณโ„ณ\mathcal{M}caligraphic_M with starting state distribution ฮผ0subscript๐œ‡0\mu_{0}italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, any episode length H๐ปHitalic_H, and any K๐พKitalic_K policies ฮ ksuperscriptฮ ๐‘˜\Pi^{k}roman_ฮ  start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT defined on โ„ณโ„ณ\mathcal{M}caligraphic_M, let ฮฑโˆˆฮ˜โข(ฮต3KโขH4)๐›ผฮ˜superscript๐œ€3๐พsuperscript๐ป4\alpha\in\Theta(\tfrac{\varepsilon^{3}}{KH^{4}})italic_ฮฑ โˆˆ roman_ฮ˜ ( divide start_ARG italic_ฮต start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG start_ARG italic_K italic_H start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG ), and let ฯ€^^๐œ‹\hat{\pi}over^ start_ARG italic_ฯ€ end_ARG denote the policy output by ๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡ฮฑโ„ณโข(ฮ k)subscriptsuperscript๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡โ„ณ๐›ผsuperscriptฮ ๐‘˜\mathsf{MaxIteration}^{\mathcal{M}}_{\alpha}(\Pi^{k})sansserif_MaxIteration start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฮฑ end_POSTSUBSCRIPT ( roman_ฮ  start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ). Then

๐”ผs0โˆผฮผ0[Vฯ€^โข(s0)]โ‰ฅmaxkโˆˆ[K]โข๐”ผs0โˆผฮผ0[Vkโข(s0)]โˆ’Oโข(ฮต).subscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0delimited-[]superscript๐‘‰^๐œ‹subscript๐‘ 0subscript๐‘˜delimited-[]๐พsubscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0delimited-[]superscript๐‘‰๐‘˜subscript๐‘ 0๐‘‚๐œ€\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0}}\left[V^{\hat{\pi}}(s_{0})\right]\geq% \max_{k\in[K]}\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0}}\left[V^{k}(s_{0})\right]-% O(\varepsilon).blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT over^ start_ARG italic_ฯ€ end_ARG end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] โ‰ฅ roman_max start_POSTSUBSCRIPT italic_k โˆˆ [ italic_K ] end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] - italic_O ( italic_ฮต ) .

We provide diagrams of MDPs as examples for the observations that we make below. States in ๐’ฎ๐’ฎ\mathcal{S}caligraphic_S are denoted by the labels on the nodes. Actions in ๐’œ๐’œ\mathcal{A}caligraphic_A are indicated by arrows from given states with deterministic transition dynamics and the rewards Rโข(s,a)๐‘…๐‘ ๐‘ŽR(s,a)italic_R ( italic_s , italic_a ) are labeled over the corresponding arrows. Arrows may be omitted for transitions that are self-loops with reward 00.

Observation 4.3.

The worst approximate max-following policy can be arbitrarily better than the best constituent policy.

s0subscript๐‘ 0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPTs1subscript๐‘ 1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTs2subscript๐‘ 2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT1111
(a) MDP in which two policies going either only left or right obtain low return but max-following them would be optimal.
s0subscript๐‘ 0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPTs1subscript๐‘ 1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTs2subscript๐‘ 2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTs3subscript๐‘ 3s_{3}italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPTs4subscript๐‘ 4s_{4}italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT00000ฮต๐œ€\varepsilonitalic_ฮต1
(b) MDP with ๐’œ={๐—‹๐—‚๐—€๐—๐—,๐—…๐–พ๐–ฟ๐—,๐—Ž๐—‰\mathcal{A}=\{\mathsf{right,left,up}caligraphic_A = { sansserif_right , sansserif_left , sansserif_up} where starting from s2subscript๐‘ 2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, max-following is far worse than optimal and starting from s0subscript๐‘ 0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, different max-following policies have different values (depending on tie-breaking).
Figure 1: Examples of MDPs with max-following policy performance comparison

Consider in Figureย 1(a) two policies on this MDP: ฯ€0โข(s)=๐—‹๐—‚๐—€๐—๐—superscript๐œ‹0๐‘ ๐—‹๐—‚๐—€๐—๐—\pi^{0}(s)=\mathsf{right}italic_ฯ€ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_s ) = sansserif_right and ฯ€1โข(s)=๐—…๐–พ๐–ฟ๐—superscript๐œ‹1๐‘ ๐—…๐–พ๐–ฟ๐—\pi^{1}(s)=\mathsf{left}italic_ฯ€ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_s ) = sansserif_left, for all sโˆˆ๐’ฎ๐‘ ๐’ฎs\in\mathcal{S}italic_s โˆˆ caligraphic_S. Note that for any episode length Hโ‰ฅ2๐ป2H\geq 2italic_H โ‰ฅ 2, for all kโˆˆ{0,1}๐‘˜01k\in\{0,1\}italic_k โˆˆ { 0 , 1 }, maxsโˆˆ๐’ฎโกVkโข(s)=2subscript๐‘ ๐’ฎsuperscript๐‘‰๐‘˜๐‘ 2\max_{s\in\mathcal{S}}V^{k}(s)=2roman_max start_POSTSUBSCRIPT italic_s โˆˆ caligraphic_S end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) = 2. For any ฮฒ<1๐›ฝ1\beta<1italic_ฮฒ < 1, ฮ ฮฒkโˆ—subscriptsuperscriptฮ superscript๐‘˜๐›ฝ\Pi^{k^{*}}_{\beta}roman_ฮ  start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฮฒ end_POSTSUBSCRIPT comprises policies ฯ€๐œ‹\piitalic_ฯ€ such that ฯ€โข(s0)=๐—‹๐—‚๐—€๐—๐—๐œ‹subscript๐‘ 0๐—‹๐—‚๐—€๐—๐—\pi(s_{0})=\mathsf{right}italic_ฯ€ ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = sansserif_right, ฯ€โข(s2)=๐—…๐–พ๐–ฟ๐—๐œ‹subscript๐‘ 2๐—…๐–พ๐–ฟ๐—\pi(s_{2})=\mathsf{left}italic_ฯ€ ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = sansserif_left, and ฯ€โข(s1)โˆˆ{๐—‹๐—‚๐—€๐—๐—,๐—…๐–พ๐–ฟ๐—}๐œ‹subscript๐‘ 1๐—‹๐—‚๐—€๐—๐—๐—…๐–พ๐–ฟ๐—\pi(s_{1})\in\{\mathsf{right},\mathsf{left}\}italic_ฯ€ ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) โˆˆ { sansserif_right , sansserif_left }. Therefore for any episode length H๐ปHitalic_H, and state sโˆˆ๐’ฎ๐‘ ๐’ฎs\in\mathcal{S}italic_s โˆˆ caligraphic_S, minฯ€โˆˆฮ ฮฒkโˆ—โกVฯ€โข(s)=Hsubscript๐œ‹subscriptsuperscriptฮ superscript๐‘˜๐›ฝsuperscript๐‘‰๐œ‹๐‘ ๐ป\min_{\pi\in\Pi^{k^{*}}_{\beta}}V^{\pi}(s)=Hroman_min start_POSTSUBSCRIPT italic_ฯ€ โˆˆ roman_ฮ  start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฮฒ end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_ฯ€ end_POSTSUPERSCRIPT ( italic_s ) = italic_H. In this example, any approximate max-following policy is also an optimal policy, whose gap in expected return with the best constituent policy can be made arbitrarily large by increasing H๐ปHitalic_H.

Observation 4.4.

A max-following policy cannot always compete with an optimal policy.

In Figureย 1(b), consider policies ฯ€0โข(s)=๐—‹๐—‚๐—€๐—๐—superscript๐œ‹0๐‘ ๐—‹๐—‚๐—€๐—๐—\pi^{0}(s)=\mathsf{right}italic_ฯ€ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_s ) = sansserif_right, ฯ€1โข(s)=๐—…๐–พ๐–ฟ๐—superscript๐œ‹1๐‘ ๐—…๐–พ๐–ฟ๐—\pi^{1}(s)=\mathsf{left}italic_ฯ€ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_s ) = sansserif_left, and ฯ€2โข(s)=๐—Ž๐—‰superscript๐œ‹2๐‘ ๐—Ž๐—‰\pi^{2}(s)=\mathsf{up}italic_ฯ€ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_s ) = sansserif_up, for all sโˆˆ๐’ฎ๐‘ ๐’ฎs\in\mathcal{S}italic_s โˆˆ caligraphic_S. At state s2subscript๐‘ 2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, ฯ€0superscript๐œ‹0\pi^{0}italic_ฯ€ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT is the only policy with non-zero value. Thus, any max-following policy will take action ๐—‹๐—‚๐—€๐—๐—๐—‹๐—‚๐—€๐—๐—\mathsf{right}sansserif_right from s2subscript๐‘ 2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, receiving reward ฮต๐œ€\varepsilonitalic_ฮต and then reward 0 for the remainder of the episode. Given a starting state distribution supported entirely on s2subscript๐‘ 2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, for any episode length Hโ‰ฅ3๐ป3H\geq 3italic_H โ‰ฅ 3, the optimal policy will obtain cumulative reward Hโˆ’2๐ป2H-2italic_H - 2, whereas any max-following policy will only obtain reward ฮต๐œ€\varepsilonitalic_ฮต.

Observation 4.5.

Different max-following policies may have different expected cumulative reward.

We again consider Figureย 1(b), but suppose now the starting state distribution is supported entirely on s0subscript๐‘ 0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. For all kโˆˆ[3]๐‘˜delimited-[]3k\in[3]italic_k โˆˆ [ 3 ], Vkโข(s0)=0superscript๐‘‰๐‘˜subscript๐‘ 00V^{k}(s_{0})=0italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = 0 and so a max-following policy may take any action from s0subscript๐‘ 0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. A max-following policy that always takes actions ๐—…๐–พ๐–ฟ๐—๐—…๐–พ๐–ฟ๐—\mathsf{left}sansserif_left or ๐—Ž๐—‰๐—Ž๐—‰\mathsf{up}sansserif_up from s0subscript๐‘ 0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT will only ever obtain cumulative reward 0, but a max-following policy that takes action ๐—‹๐—‚๐—€๐—๐—๐—‹๐—‚๐—€๐—๐—\mathsf{right}sansserif_right will move to s1subscript๐‘ 1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and (so long as more than one step remains in the episode) will then take action ๐—Ž๐—‰๐—Ž๐—‰\mathsf{up}sansserif_up and move to state s4subscript๐‘ 4s_{4}italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, where it will stay to obtain cumulative reward Hโˆ’2๐ป2H-2italic_H - 2.

s2subscript๐‘ 2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTs3subscript๐‘ 3s_{3}italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPTs4subscript๐‘ 4s_{4}italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPTs0subscript๐‘ 0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPTstarts1subscript๐‘ 1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTs5subscript๐‘ 5s_{5}italic_s start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT11110000ฮต๐œ€\varepsilonitalic_ฮต00ฮต๐œ€\varepsilonitalic_ฮต
(a) MDP where small value approximation errors at s0subscript๐‘ 0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT hinder max-following. Arrows representing transition dynamics are color-coded red to indicate actions taken by ฯ€0superscript๐œ‹0\pi^{0}italic_ฯ€ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT and blue to indicate actions taken by ฯ€1superscript๐œ‹1\pi^{1}italic_ฯ€ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT.
Rโข(s,โˆ’1)๐‘…๐‘ 1R(s,-1)italic_R ( italic_s , - 1 )Rโข(s,1)๐‘…๐‘ 1R(s,1)italic_R ( italic_s , 1 )statereward
(b) MDP where the max-following value function is piecewise linear, but constituent policyโ€™s values are affine functions of the state for fixed actions.
Figure 2: Examples for Observationย 4.6 and Observationย 4.7

If the value functions of constituent policies are exactly known, it is easy to construct a max-following policy, but the learner may not have access to these functions. If the learner only has access to approximations and follows whichever policy has the larger approximate value at the current state, the resulting policy can have much lower expected cumulative reward than the max-following policy. This is true even for state-wise bounds on the value approximation error. This observation previously motivated our definition of the approximate max-following class (Definitionย 2.3).

Observation 4.6.

Small value function approximation errors can be an obstacle to learning a max-following policy.

In Figureย 2(a), we again consider policies ฯ€0โข(s)=๐—‹๐—‚๐—€๐—๐—superscript๐œ‹0๐‘ ๐—‹๐—‚๐—€๐—๐—\pi^{0}(s)=\mathsf{right}italic_ฯ€ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_s ) = sansserif_right and ฯ€1โข(s)=๐—…๐–พ๐–ฟ๐—superscript๐œ‹1๐‘ ๐—…๐–พ๐–ฟ๐—\pi^{1}(s)=\mathsf{left}italic_ฯ€ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_s ) = sansserif_left for all states sโˆˆ๐’ฎ๐‘ ๐’ฎs\in\mathcal{S}italic_s โˆˆ caligraphic_S, color coding the actions taken by ฯ€0superscript๐œ‹0\pi^{0}italic_ฯ€ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT with red and ฯ€1superscript๐œ‹1\pi^{1}italic_ฯ€ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT with blue in Figureย 2(a). For starting state distribution supported entirely on s0subscript๐‘ 0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, a max-following policy ฯ€๐œ‹\piitalic_ฯ€ will take action ฯ€โข(s0)=๐—…๐–พ๐–ฟ๐—๐œ‹subscript๐‘ 0๐—…๐–พ๐–ฟ๐—\pi(s_{0})=\mathsf{left}italic_ฯ€ ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = sansserif_left, ฯ€โข(s2)=๐—‹๐—‚๐—€๐—๐—๐œ‹subscript๐‘ 2๐—‹๐—‚๐—€๐—๐—\pi(s_{2})=\mathsf{right}italic_ฯ€ ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = sansserif_right, and ฯ€โข(s3)=๐—…๐–พ๐–ฟ๐—๐œ‹subscript๐‘ 3๐—…๐–พ๐–ฟ๐—\pi(s_{3})=\mathsf{left}italic_ฯ€ ( italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = sansserif_left for the remainder of the episode, obtaining reward Hโˆ’2+2โขฮต๐ป22๐œ€H-2+2\varepsilonitalic_H - 2 + 2 italic_ฮต. However, given only approximate value functions V^ksuperscript^๐‘‰๐‘˜\hat{V}^{k}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT with state-wise absolute error bound |V^hkโข(s)โˆ’Vhkโข(s)|โ‰คฮตsuperscriptsubscript^๐‘‰โ„Ž๐‘˜๐‘ superscriptsubscript๐‘‰โ„Ž๐‘˜๐‘ ๐œ€|\hat{V}_{h}^{k}(s)-V_{h}^{k}(s)|\leq\varepsilon| over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) - italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) | โ‰ค italic_ฮต for all states s๐‘ sitalic_s and times hโ„Žhitalic_h, the policy ฯ€^^๐œ‹\hat{\pi}over^ start_ARG italic_ฯ€ end_ARG that takes action ฯ€hkโˆ—โข(s)superscriptsubscript๐œ‹โ„Žsuperscript๐‘˜๐‘ \pi_{h}^{k^{*}}(s)italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s ) for kโˆ—=argmaxkโˆˆ[2]V^hkโข(s)superscript๐‘˜subscriptargmax๐‘˜delimited-[]2superscriptsubscript^๐‘‰โ„Ž๐‘˜๐‘ k^{*}=\operatorname*{argmax}_{k\in[2]}\hat{V}_{h}^{k}(s)italic_k start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT = roman_argmax start_POSTSUBSCRIPT italic_k โˆˆ [ 2 ] end_POSTSUBSCRIPT over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) can have much lower expected cumulative reward than a max-following policy. For example if V^00โข(s0)=ฮตsubscriptsuperscript^๐‘‰00subscript๐‘ 0๐œ€\hat{V}^{0}_{0}(s_{0})=\varepsilonover^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = italic_ฮต and V^01โข(s0)=0subscriptsuperscript^๐‘‰10subscript๐‘ 00\hat{V}^{1}_{0}(s_{0})=0over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = 0 in our Figureย 2(a) example, then ฯ€^^๐œ‹\hat{\pi}over^ start_ARG italic_ฯ€ end_ARG will have expected return 0.

Observation 4.7.

A max-following policyโ€™s value function is not always of the same parametric class as the constituent policiesโ€™ value functions.

As a simple first example, consider an MDP with states ๐’ฎ=[0,1]๐’ฎ01\mathcal{S}=[0,1]caligraphic_S = [ 0 , 1 ] and actions ๐’œ={โˆ’1,1}๐’œ11\mathcal{A}=\{-1,1\}caligraphic_A = { - 1 , 1 }. Every action leads to a self-loop (for all aโˆˆ๐’œ๐‘Ž๐’œa\in\mathcal{A}italic_a โˆˆ caligraphic_A, Pโข(s|s,a)=1๐‘ƒconditional๐‘ ๐‘ ๐‘Ž1P(s|s,a)=1italic_P ( italic_s | italic_s , italic_a ) = 1) and for a fixed action, rewards are affine functions of the state (e.g. Rโข(s,โˆ’1)=1โˆ’s๐‘…๐‘ 11๐‘ R(s,-1)=1-sitalic_R ( italic_s , - 1 ) = 1 - italic_s and Rโข(s,1)=s๐‘…๐‘ 1๐‘ R(s,1)=sitalic_R ( italic_s , 1 ) = italic_s). We consider two policies: ฯ€0โข(s)=โˆ’1superscript๐œ‹0๐‘ 1\pi^{0}(s)=-1italic_ฯ€ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_s ) = - 1 and ฯ€1โข(s)=1superscript๐œ‹1๐‘ 1\pi^{1}(s)=1italic_ฯ€ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_s ) = 1 for all sโˆˆ๐’ฎ๐‘ ๐’ฎs\in\mathcal{S}italic_s โˆˆ caligraphic_S. Notice that for episode length H๐ปHitalic_H, V0โข(s)=HโขRโข(s,โˆ’1)superscript๐‘‰0๐‘ ๐ป๐‘…๐‘ 1V^{0}(s)=HR(s,-1)italic_V start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_s ) = italic_H italic_R ( italic_s , - 1 ) and V1โข(s)=HโขRโข(s,1)superscript๐‘‰1๐‘ ๐ป๐‘…๐‘ 1V^{1}(s)=HR(s,1)italic_V start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_s ) = italic_H italic_R ( italic_s , 1 ). Since the dynamics keep the state at the same fixed place independent of the action, the max-following policy at state s๐‘ sitalic_s will simply be the max of the two individual value functions at s๐‘ sitalic_s and therefore its parametric class will be piecewise linear, unlike the constituent policiesโ€™ which are affine (see Figureย 2(b)). To provide a more complex MDP example, we consider a traditional control problem with continuous state and action spaces: the discrete linear quadratic regulator. In this example the constituent linear policies have quadratic value functions, but the max-following policy is not of the same parametric class. See Appendixย A for further discussion.

5 Experiments

We proceed to examine our MaxIteration algorithm in a set of experiments that uses neural network function approximation as oracles. These experiments aim to provide a scenario to demonstrate the usefulness of max-following. While previous works in this line of research have studied the ability to integrate knowledge from the constituent policies to increase performance of a learnable policyย (Cheng etย al., 2020; Liu etย al., 2023, 2024) our algorithm offers an alternative approach. We consider a common scenario from the field of robotics where one has access to older policies from a robotic simulator that were used in previous projects. As long as the dynamics of the MDP of interest do not differ, such old policies can be simply be re-used in new applications. In such cases, training completely from scratch can be incredibly expensive due to the vast search spaceย (Schulman etย al., 2017; Haarnoja etย al., 2018). We note that this setup is related to the one used byย Barreto etย al. (2017, 2020) but we do not put any constraints on the reward functions.

Experimental setupย ย ย A recent robotic simulation benchmark called CompoSuiteย (Mendez etย al., 2022) and its corresponding offline datasetsย (Hussing etย al., 2023) offer an instantiation of such a scenario. CompoSuite consists of four axes: robot arms, objects, objectives and obstacles. Tasks are simply constructed by combining one element from each axis.We consider tasks with a fixed IIWA robotic manipulator and no obstacle. This leaves us with a total of 16 tasks. These 16 tasks are randomly grouped into pairs of two. Each group is one experiment where the policies trained on tasks correspond to our constituents. To create a new target task, we change one element per task, creating novel combinations for each group. For example, we start with the constituent policies that can 1) put and place a box into a trashcan and 2) push a plate. The target task can be to push the box. We train our constituent policies on the expert datasets using the offline RL algorithm Implicit Q-learningย (Kostrikov etย al., 2022) (IQL). This ensures we obtain very strong constituent policies for their respective tasks. After training the constituents, we run ๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡\mathsf{MaxIteration}sansserif_MaxIteration and the baselines for a short amount of time in the simulator. We report mean performance and standard error over 5 seeds using an evaluation of 32323232 episodes.

Algorithmsย ย ย For practical purposes, we use a heuristic version of ๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡\mathsf{MaxIteration}sansserif_MaxIteration which does not re-compute the max-following policy at every step hโ„Žhitalic_h but rather after multiple steps. For our baselines, we ran the code provided byย (Liu etย al., 2023) to train the MAPS algorithm but were unable to obtain non-trivial return even after a reasonable amount of tuning. MAPS has been shown to have difficulties with leveraging very performant constituent policies such as the ones we are using (see the Walker experiment byย Liu etย al. (2023) in Figure 1 (d) in which the algorithm struggles to be competitive with the best, high-return constituent policy). They conjecture that in this case, their estimates of the constituent value functions will be less accurate in early training, resulting in gradient estimates with large bias and variance, weakening their convergence guarantees.

We provide an evaluation of ๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡\mathsf{MaxIteration}sansserif_MaxIteration on tasks originally used byย Liu etย al. (2023) in Appendixย C.3.

For now, we opt to use IQLโ€™s in fine-tuning capabilities that offer a policy improvement style method on top of the best-performing constituent policy for comparison. Fine-tuning provides a strong baseline in the sense that it has access to the already trained value functions of the constituent policies providing it with inherently more starting information. For comparability, we limit the number of episodes available for fine-tuning to the same number of episodes available for training MaxIteration. For more details we refer to Appendixย C.

Experimental Results

Figureย 3 contains a set of demonstrative results. The full results are deferred to Appendixย C. The selected results in Figureย 3 highlight three properties of ๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡\mathsf{MaxIteration}sansserif_MaxIteration:

  1. 1.

    There are cases where max-following not only increases the return but actually leads to solving a task successfully even when none of the constituent policies achieve success.

  2. 2.

    With successful constituent policies, max-following can significantly increase the success rate.

  3. 3.

    max-following can sometimes increase return but not necessarily lead to success demonstrating the need to better understand which attributes make up good constituent policies in the future.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Mean cumulative return and success over 5555 seeds of ๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡\mathsf{MaxIteration}sansserif_MaxIteration compared to fine-tuning IQL on selected tasks. Error-bars correspond to standard error. Full bars correspond to returns and red lines indicate the success rate of each algorithm. ๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡\mathsf{MaxIteration}sansserif_MaxIteration can yield improvements in return but increased return does not always yield success.

The results in Appendixย C demonstrate that in all cases, ๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡\mathsf{MaxIteration}sansserif_MaxIteration is at least as good as the best constituent policy which is not the case for algorithms from prior workย (Liu etย al., 2023) as discussed earlier. Moreover, ๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡\mathsf{MaxIteration}sansserif_MaxIteration consistently leads to greater return improvement than fine-tuning given the same amount of data. Fine-tuning with substantially more resources would eventually surpass the performance of ๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡\mathsf{MaxIteration}sansserif_MaxIteration as ๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡\mathsf{MaxIteration}sansserif_MaxIteration is limited to competing with the max-following benchmark which can be suboptimal.

6 Conclusion

We introduce ๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡\mathsf{MaxIteration}sansserif_MaxIteration, an algorithm to efficiently learn a policy that is competitive with the approximate max-following benchmark (and hence also with all constituent policies). We provide empirical evidence that max-following utilizing skill-learning enables us to learn how to complete tasks that it would be inefficient to learn from scratch, but that are superior to other individually trained experts for fixed given skills.

Limitations and Future Work

Our goal in this work has been to learn a policy that competes with an approximate max-following policy under minimal assumptions. However, we still assume efficient batch learnability of constituent value functions, which will not always be feasible in practice. While it seems likely that our oracle assumption is necessary for learning an approximate max-following policy, we leave proving this claim for future work. We also leave consideration of alternative ensembling approaches to future work. Max-value ensembling is sensitive to slight differences in the values between constituent policies whereas, e.g., softmax takes into account the relative โ€˜weightingโ€™ of values. In addition, it would be interesting to characterize the amount of improvement we can obtain over our constituent policies or prove conditions under which our approximate max-following policy is competitive with a true max-following policy or the optimal policy. One could also extend this analysis to ensembling methods like softmax and study the nature of guarantees in that setting. Extending beyond MDPs to the partially observable setting, and to the discounted infinite-horizon setting, would also add richness to the class of problems we could consider.

Acknowledgements and Disclosure of Funding

The authors are partially supported by ARO grant W911NF2010080, DARPA grant HR001123S0011, the Simons Foundation Collaboration on Algorithmic Fairness, and NSF grants FAI-2147212 and CCF-2217062.

References

  • Amit etย al. [2020] Ron Amit, Ron Meir, and Kamil Ciosek. Discount factor as a regularizer in reinforcement learning. In Halย Daumรฉ III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 269โ€“278. PMLR, 13โ€“18 Jul 2020.
  • Barreto etย al. [2017] Andre Barreto, Will Dabney, Remi Munos, Jonathanย J Hunt, Tom Schaul, Hadoย P van Hasselt, and David Silver. Successor features for transfer in reinforcement learning. In I.ย Guyon, U.ย Von Luxburg, S.ย Bengio, H.ย Wallach, R.ย Fergus, S.ย Vishwanathan, and R.ย Garnett, editors, Advances in Neural Information Processing Systems, volumeย 30, 2017.
  • Barreto etย al. [2020] Andrรฉ Barreto, Shaobo Hou, Diana Borsa, David Silver, and Doina Precup. Fast reinforcement learning with generalized policy updates. Proceedings of the National Academy of Sciences, 117(48):30079โ€“30087, 2020. doi: 10.1073/pnas.1907370117.
  • Bertsekas [2012] Dimitri Bertsekas. Dynamic programming and optimal control: Volume I, volumeย 4. Athena scientific, 2012.
  • Brafman and Tennenholtz [2002] Ronenย I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213โ€“231, 2002.
  • Brukhim etย al. [2022] Nataly Brukhim, Elad Hazan, and Karan Singh. A boosting approach to reinforcement learning. Advances in Neural Information Processing Systems, 35:33806โ€“33817, 2022.
  • Chang etย al. [2015] Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daumรฉย III, and John Langford. Learning to search better than your teacher. In International Conference on Machine Learning, pages 2058โ€“2066. PMLR, 2015.
  • Chen etย al. [2021] Xinyue Chen, Che Wang, Zijian Zhou, and Keithย W. Ross. Randomized ensembled double q-learning: Learning fast without a model. In International Conference on Learning Representations, 2021.
  • Cheng etย al. [2020] Ching-An Cheng, Andrey Kolobov, and Alekh Agarwal. Policy improvement via imitation of multiple oracles. Advances in Neural Information Processing Systems, 33:5587โ€“5598, 2020.
  • Domingues etย al. [2021] Omarย Darwiche Domingues, Pierre Mรฉnard, Emilie Kaufmann, and Michal Valko. Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited. In Algorithmic Learning Theory, pages 578โ€“598. PMLR, 2021.
  • Du etย al. [2019] Simonย S Du, Shamย M Kakade, Ruosong Wang, and Linย F Yang. Is a good representation sufficient for sample efficient reinforcement learning? arXiv preprint arXiv:1910.03016, 2019.
  • Freund and Schapire [1997] Yoav Freund and Robertย E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119โ€“139, 1997.
  • Glorot etย al. [2011] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Geoffrey Gordon, David Dunson, and Miroslav Dudรญk, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volumeย 15 of Proceedings of Machine Learning Research, pages 315โ€“323, Fort Lauderdale, FL, USA, 11โ€“13 Apr 2011. PMLR.
  • Golowich etย al. [2024] Noah Golowich, Ankur Moitra, and Dhruv Rohatgi. Exploration is harder than prediction: Cryptographically separating reinforcement learning from supervised learning. arXiv preprint arXiv:2404.03774, 2024.
  • Haarnoja etย al. [2018] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volumeย 80 of Proceedings of Machine Learning Research, pages 1861โ€“1870. PMLR, 10โ€“15 Jul 2018.
  • Hiraoka etย al. [2022] Takuya Hiraoka, Takahisa Imagawa, Taisei Hashimoto, Takashi Onishi, and Yoshimasa Tsuruoka. Dropout q-functions for doubly efficient reinforcement learning. In International Conference on Learning Representations, 2022.
  • Hussing etย al. [2023] Marcel Hussing, Jorgeย A. Mendez, Anisha Singrodia, Cassandra Kent, and Eric Eaton. Robotic manipulation datasets for offline compositional reinforcement learning, 2023.
  • Jaksch etย al. [2010] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(51):1563โ€“1600, 2010.
  • Kane etย al. [2022] Daniel Kane, Sihan Liu, Shachar Lovett, and Gaurav Mahajan. Computational-statistical gap in reinforcement learning. In Conference on Learning Theory, pages 1282โ€“1302. PMLR, 2022.
  • Kearns and Singh [2002] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine learning, 49:209โ€“232, 2002.
  • Kostrikov etย al. [2022] Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=68n2s9ZJWF8.
  • Kurenkov etย al. [2019] Andrey Kurenkov, Ajay Mandlekar, Roberto Martin-Martin, Silvio Savarese, and Animesh Garg. Ac-teach: A bayesian actor-critic method for policy learning with an ensemble of suboptimal teachers. arXiv preprint arXiv:1909.04121, 2019.
  • Lattimore and Hutter [2012] Tor Lattimore and Marcus Hutter. Pac bounds for discounted mdps. In Algorithmic Learning Theory, pages 320โ€“334, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. ISBN 978-3-642-34106-9.
  • Lee etย al. [2021] Kimin Lee, Michael Laskin, Aravind Srinivas, and Pieter Abbeel. Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning. In International Conference on Machine Learning. PMLR, 2021.
  • Li etย al. [2018] Guohao Li, Matthias Mueller, Vincent Casser, Neil Smith, Dominikย L Michels, and Bernard Ghanem. Oil: Observational imitation learning. arXiv preprint arXiv:1803.01129, 2018.
  • Liu etย al. [2023] Xuefeng Liu, Takuma Yoneda, Chaoqi Wang, Matthew Walter, and Yuxin Chen. Active policy improvement from multiple black-box oracles. In International Conference on Machine Learning, pages 22320โ€“22337. PMLR, 2023.
  • Liu etย al. [2024] Xuefeng Liu, Takuma Yoneda, Rick Stevens, Matthew Walter, and Yuxin Chen. Blending imitation and reinforcement learning for robust policy improvement. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=eJ0dzPJq1F.
  • Mendez etย al. [2022] Jorgeย A. Mendez, Marcel Hussing, Meghna Gummadi, and Eric Eaton. Composuite: A compositional reinforcement learning benchmark. In 1st Conference on Lifelong Learning Agents, 2022.
  • Peer etย al. [2021] Oren Peer, Chen Tessler, Nadav Merlis, and Ron Meir. Ensemble bootstrap** for q-learning. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8454โ€“8463. PMLR, 18โ€“24 Jul 2021.
  • Puterman [1994] Martinย L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., USA, 1st edition, 1994. ISBN 0471619779.
  • Schulman etย al. [2015] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
  • Schulman etย al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.longhoe.net/abs/1707.06347.
  • Seno and Imai [2022] Takuma Seno and Michita Imai. d3rlpy: An offline deep reinforcement learning library. Journal of Machine Learning Research, 23(315):1โ€“20, 2022. URL http://jmlr.org/papers/v23/22-0017.html.
  • Song etย al. [2023] Yanjie Song, Ponnuthuraiย Nagaratnam Suganthan, Witold Pedrycz, Junwei Ou, Yongming He, Yingwu Chen, and Yutong Wu. Ensemble reinforcement learning: A survey. Applied Soft Computing, page 110975, 2023.
  • Sun etย al. [2017] Wen Sun, Arun Venkatraman, Geoffreyย J Gordon, Byron Boots, and Jย Andrew Bagnell. Deeply aggrevated: Differentiable imitation learning for sequential prediction. In International conference on machine learning, pages 3309โ€“3318. PMLR, 2017.
  • Sutton and Barto [2018] Richardย S Sutton and Andrewย G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • Tunyasuvunakool etย al. [2020] Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, Nicolas Heess, and Yuval Tassa. dm_control: Software and tasks for continuous control. Software Impacts, 6:100022, 2020. ISSN 2665-9638. doi: https://doi.org/10.1016/j.simpa.2020.100022.
  • Wu etย al. [2018] Cathy Wu, Aravind Rajeswaran, Yan Duan, Vikash Kumar, Alexandreย M Bayen, Sham Kakade, Igor Mordatch, and Pieter Abbeel. Variance reduction for policy gradient with action-dependent factorized baselines. arXiv preprint arXiv:1803.07246, 2018.

Appendix A MDP Examples

A.1 LQR max-following parametric class vs. constituent policies

min{ut}t=0โˆžsubscriptsuperscriptsubscriptsubscript๐‘ข๐‘ก๐‘ก0\displaystyle\min_{\{u_{t}\}_{t=0}^{\infty}}\quadroman_min start_POSTSUBSCRIPT { italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โˆž end_POSTSUPERSCRIPT end_POSTSUBSCRIPT โˆ‘t=0โˆžฮณtโข(xtTโขQโขxt+utTโขRโขut)superscriptsubscript๐‘ก0superscript๐›พ๐‘กsuperscriptsubscript๐‘ฅ๐‘ก๐‘‡๐‘„subscript๐‘ฅ๐‘กsuperscriptsubscript๐‘ข๐‘ก๐‘‡๐‘…subscript๐‘ข๐‘ก\displaystyle\sum_{t=0}^{\infty}\gamma^{t}(x_{t}^{T}Qx_{t}+u_{t}^{T}Ru_{t})โˆ‘ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT โˆž end_POSTSUPERSCRIPT italic_ฮณ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_Q italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_R italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
subject to xt+1=Aโขxt+Bโขut+wt,subscript๐‘ฅ๐‘ก1๐ดsubscript๐‘ฅ๐‘ก๐ตsubscript๐‘ข๐‘กsubscript๐‘ค๐‘ก\displaystyle x_{t+1}=Ax_{t}+Bu_{t}+w_{t},italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_A italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_B italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ,

To motivate the use of max-following policies in a richer class of MDPs, we consider a traditional control problem with continuous state and action spaces: the discrete linear quadratic regulator. Note that here we analyze the infinite horizon discounted case so that we can analyze the time-invariant value function, but episodic analogues exist. Consider the following setting where ฮณโˆˆ[0,1]๐›พ01\gamma\in[0,1]italic_ฮณ โˆˆ [ 0 , 1 ] is a discount factor, and wtโˆผ๐’ฉโข(0,ฯƒ2โขI)similar-tosubscript๐‘ค๐‘ก๐’ฉ0superscript๐œŽ2๐ผw_{t}\sim\mathcal{N}(\textbf{0},\sigma^{2}I)italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT โˆผ caligraphic_N ( 0 , italic_ฯƒ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ). Here, we consider the simple case where Q,R,A=I๐‘„๐‘…๐ด๐ผQ,R,A=Iitalic_Q , italic_R , italic_A = italic_I and B=(1+ฯต)โขI๐ต1italic-ฯต๐ผB=(1+\epsilon)Iitalic_B = ( 1 + italic_ฯต ) italic_I. We know that the optimal policy is of the form u=โˆ’Kโˆ—โขx๐‘ขsuperscript๐พ๐‘ฅu=-K^{*}xitalic_u = - italic_K start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT italic_x [Bertsekas, 2012] and we set two policies that are only stable along one component and unstable along the other of the form u1=โˆ’K1โขxsubscript๐‘ข1subscript๐พ1๐‘ฅu_{1}=-K_{1}xitalic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x and u2=โˆ’K2โขxsubscript๐‘ข2subscript๐พ2๐‘ฅu_{2}=-K_{2}xitalic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_x. It is important to note that the value functions of the individual policies and the optimal policies have exact quadratic forms like Vโข(x)=xTโขPโขx+q๐‘‰๐‘ฅsuperscript๐‘ฅ๐‘‡๐‘ƒ๐‘ฅ๐‘žV(x)=x^{T}Px+qitalic_V ( italic_x ) = italic_x start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_P italic_x + italic_q, but the max-following policy is not necessarily within the same parametric class. For example, P1subscript๐‘ƒ1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the solution to the Lyapunov equation P1=(I+K1TโขK1+ฮณโข(Aโˆ’K1)TโขP1โข(Aโˆ’K1))subscript๐‘ƒ1๐ผsuperscriptsubscript๐พ1๐‘‡subscript๐พ1๐›พsuperscript๐ดsubscript๐พ1๐‘‡subscript๐‘ƒ1๐ดsubscript๐พ1P_{1}=(I+K_{1}^{T}K_{1}+\gamma(A-K_{1})^{T}P_{1}(A-K_{1}))italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( italic_I + italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_ฮณ ( italic_A - italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_A - italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) and q1=ฮณ1โˆ’ฮณโขฯƒ2โขtrโก(P1)subscript๐‘ž1๐›พ1๐›พsuperscript๐œŽ2trsubscript๐‘ƒ1q_{1}=\frac{\gamma}{1-\gamma}\sigma^{2}\operatorname{tr}(P_{1})italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG italic_ฮณ end_ARG start_ARG 1 - italic_ฮณ end_ARG italic_ฯƒ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_tr ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). A similar formula exists for policy 2222.

In LQR, for the K1,K2subscript๐พ1subscript๐พ2K_{1},K_{2}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT controllers described above, a max-following policy is able to attain higher value than the individual expert policies that have an unstable direction in one axis. Moreover, we see that the optimal policy is obviously superior to all the other policies, but that a max-following policy is more competitive with it than the other individual expert policies. A max-following policy is ultimately able to benefit from the stabilizing component of each axis of the individual policies, which ultimately lets it perform better than any given individual one.

Appendix B Additional Proofs

See 4.1

Proof.

We will prove the claim inductively, showing that for all Cโˆˆ[H]๐ถdelimited-[]๐ปC\in[H]italic_C โˆˆ [ italic_H ], if we run any approximate max-following policy for C๐ถCitalic_C steps, and then continue following the policy ฯ€ksuperscript๐œ‹๐‘˜\pi^{k}italic_ฯ€ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT chosen at step C๐ถCitalic_C for the rest of the episode, then our expected return is not much worse than if we had followed any fixed ฯ€ksuperscript๐œ‹๐‘˜\pi^{k}italic_ฯ€ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT for the whole episode.

Somewhat more formally, recalling the definition of the set of approximate max-following policies ฮ ฮฒkโˆ—subscriptsuperscriptฮ superscript๐‘˜๐›ฝ\Pi^{k^{*}}_{\beta}roman_ฮ  start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฮฒ end_POSTSUBSCRIPT (Definitionย 2.3), at every time hโˆˆ[H]โ„Ždelimited-[]๐ปh\in[H]italic_h โˆˆ [ italic_H ] and state sโˆˆ๐’ฎ๐‘ ๐’ฎs\in\mathcal{S}italic_s โˆˆ caligraphic_S, a policy ฯ€โˆˆฮ ฮฒkโˆ—๐œ‹subscriptsuperscriptฮ superscript๐‘˜๐›ฝ\pi\in\Pi^{k^{*}}_{\beta}italic_ฯ€ โˆˆ roman_ฮ  start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฮฒ end_POSTSUBSCRIPT takes action ฯ€htโข(s)superscriptsubscript๐œ‹โ„Ž๐‘ก๐‘ \pi_{h}^{t}(s)italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s ) for a ฯ€tโˆˆฮ ksuperscript๐œ‹๐‘กsuperscriptฮ ๐‘˜\pi^{t}\in\Pi^{k}italic_ฯ€ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โˆˆ roman_ฮ  start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT such that Vhtโข(s)โ‰ฅmaxkโˆˆ[K]โกVhkโข(s)โˆ’ฮฒsubscriptsuperscript๐‘‰๐‘กโ„Ž๐‘ subscript๐‘˜delimited-[]๐พsuperscriptsubscript๐‘‰โ„Ž๐‘˜๐‘ ๐›ฝV^{t}_{h}(s)\geq\max_{k\in[K]}V_{h}^{k}(s)-\betaitalic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) โ‰ฅ roman_max start_POSTSUBSCRIPT italic_k โˆˆ [ italic_K ] end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) - italic_ฮฒ. Letting ฯ€tโข(s,h)superscript๐œ‹๐‘ก๐‘ โ„Ž\pi^{t(s,h)}italic_ฯ€ start_POSTSUPERSCRIPT italic_t ( italic_s , italic_h ) end_POSTSUPERSCRIPT denote the ฯ€tโˆˆฮ ksuperscript๐œ‹๐‘กsuperscriptฮ ๐‘˜\pi^{t}\in\Pi^{k}italic_ฯ€ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โˆˆ roman_ฮ  start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT that ฯ€๐œ‹\piitalic_ฯ€ follows at state s๐‘ sitalic_s and time hโ„Žhitalic_h, we will show that if at some step Cโˆˆ[H]๐ถdelimited-[]๐ปC\in[H]italic_C โˆˆ [ italic_H ] we have

๐”ผs0โˆผฮผ0,P[โˆ‘h=0CRโข(sh,ฯ€hโข(sh))+โˆ‘h=C+1Hโˆ’1Rโข(sh,ฯ€htโข(sC,C)โข(sh))]โ‰ฅmaxkโˆˆ[K]โข๐”ผs0โˆผฮผ0[Vkโข(s0)]โˆ’Oโข(ฮตโข(C+1)H),subscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0๐‘ƒdelimited-[]superscriptsubscriptโ„Ž0๐ถ๐‘…subscript๐‘ โ„Žsubscript๐œ‹โ„Žsubscript๐‘ โ„Žsuperscriptsubscriptโ„Ž๐ถ1๐ป1๐‘…subscript๐‘ โ„Žsuperscriptsubscript๐œ‹โ„Ž๐‘กsubscript๐‘ ๐ถ๐ถsubscript๐‘ โ„Žsubscript๐‘˜delimited-[]๐พsubscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0delimited-[]superscript๐‘‰๐‘˜subscript๐‘ 0๐‘‚๐œ€๐ถ1๐ป\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0},P}\left[\sum_{h=0}^{C}R(s_{h},\pi_{h}(s_% {h}))+\sum_{h=C+1}^{H-1}R(s_{h},\pi_{h}^{t(s_{C},C)}(s_{h}))\right]\geq\max_{k% \in[K]}\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0}}\left[V^{k}(s_{0})\right]-O(% \tfrac{\varepsilon(C+1)}{H}),blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_P end_POSTSUBSCRIPT [ โˆ‘ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) + โˆ‘ start_POSTSUBSCRIPT italic_h = italic_C + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t ( italic_s start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_C ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) ] โ‰ฅ roman_max start_POSTSUBSCRIPT italic_k โˆˆ [ italic_K ] end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] - italic_O ( divide start_ARG italic_ฮต ( italic_C + 1 ) end_ARG start_ARG italic_H end_ARG ) ,

for all ฯ€โˆˆฮ ฮฒkโˆ—๐œ‹subscriptsuperscriptฮ superscript๐‘˜๐›ฝ\pi\in\Pi^{k^{*}}_{\beta}italic_ฯ€ โˆˆ roman_ฮ  start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฮฒ end_POSTSUBSCRIPT, then the same holds for C+1๐ถ1C+1italic_C + 1 for all ฯ€๐œ‹\piitalic_ฯ€.

In the base case, C=0๐ถ0C=0italic_C = 0, the claim

๐”ผs0โˆผฮผ0,P[โˆ‘h=0Hโˆ’1Rโข(sh,ฯ€htโข(s0,0)โข(sh))]โ‰ฅmaxkโˆˆ[K]โข๐”ผs0โˆผฮผ0[Vkโข(s0)]โˆ’Oโข(ฮตH)subscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0๐‘ƒdelimited-[]superscriptsubscriptโ„Ž0๐ป1๐‘…subscript๐‘ โ„Žsuperscriptsubscript๐œ‹โ„Ž๐‘กsubscript๐‘ 00subscript๐‘ โ„Žsubscript๐‘˜delimited-[]๐พsubscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0delimited-[]superscript๐‘‰๐‘˜subscript๐‘ 0๐‘‚๐œ€๐ป\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0},P}\left[\sum_{h=0}^{H-1}R(s_{h},\pi_{h}^% {t(s_{0},0)}(s_{h}))\right]\geq\max_{k\in[K]}\mathop{\mathbb{E}}_{s_{0}\sim\mu% _{0}}\left[V^{k}(s_{0})\right]-O(\tfrac{\varepsilon}{H})blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_P end_POSTSUBSCRIPT [ โˆ‘ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 0 ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) ] โ‰ฅ roman_max start_POSTSUBSCRIPT italic_k โˆˆ [ italic_K ] end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] - italic_O ( divide start_ARG italic_ฮต end_ARG start_ARG italic_H end_ARG )

for all ฯ€โˆˆฮ ฮฒkโˆ—๐œ‹subscriptsuperscriptฮ superscript๐‘˜๐›ฝ\pi\in\Pi^{k^{*}}_{\beta}italic_ฯ€ โˆˆ roman_ฮ  start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฮฒ end_POSTSUBSCRIPT and all ฯ€kโˆˆฮ ksuperscript๐œ‹๐‘˜superscriptฮ ๐‘˜\pi^{k}\in\Pi^{k}italic_ฯ€ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT โˆˆ roman_ฮ  start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, follows straightforwardly from the definition of ฮ ฮฒkโˆ—subscriptsuperscriptฮ superscript๐‘˜๐›ฝ\Pi^{k^{*}}_{\beta}roman_ฮ  start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฮฒ end_POSTSUBSCRIPT and setting of ฮฒโˆˆฮ˜โข(ฮตH)๐›ฝฮ˜๐œ€๐ป\beta\in\Theta(\tfrac{\varepsilon}{H})italic_ฮฒ โˆˆ roman_ฮ˜ ( divide start_ARG italic_ฮต end_ARG start_ARG italic_H end_ARG ), since

๐”ผs0โˆผฮผ0,P[โˆ‘h=0Hโˆ’1Rโข(sh,ฯ€htโข(s0,0)โข(sh))]subscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0๐‘ƒdelimited-[]superscriptsubscriptโ„Ž0๐ป1๐‘…subscript๐‘ โ„Žsuperscriptsubscript๐œ‹โ„Ž๐‘กsubscript๐‘ 00subscript๐‘ โ„Ž\displaystyle\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0},P}\left[\sum_{h=0}^{H-1}R(s% _{h},\pi_{h}^{t(s_{0},0)}(s_{h}))\right]blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_P end_POSTSUBSCRIPT [ โˆ‘ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 0 ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) ] =๐”ผs0โˆผฮผ0[Vฯ€tโข(s0,0)โข(s0)]absentsubscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0delimited-[]superscript๐‘‰superscript๐œ‹๐‘กsubscript๐‘ 00subscript๐‘ 0\displaystyle=\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0}}[V^{\pi^{t(s_{0},0)}}(s_{0% })]= blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_ฯ€ start_POSTSUPERSCRIPT italic_t ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 0 ) end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ]
โ‰ฅ๐”ผs0โˆผฮผ0[maxkโˆˆ[K]โกVkโข(s0)โˆ’Oโข(ฮตH)]absentsubscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0delimited-[]subscript๐‘˜delimited-[]๐พsuperscript๐‘‰๐‘˜subscript๐‘ 0๐‘‚๐œ€๐ป\displaystyle\geq\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0}}\left[\max_{k\in[K]}V^{% k}(s_{0})-O(\tfrac{\varepsilon}{H})\right]โ‰ฅ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_max start_POSTSUBSCRIPT italic_k โˆˆ [ italic_K ] end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_O ( divide start_ARG italic_ฮต end_ARG start_ARG italic_H end_ARG ) ]
โ‰ฅmaxkโˆˆ[K]โข๐”ผs0โˆผฮผ0[Vkโข(s0)]โˆ’Oโข(ฮตH).absentsubscript๐‘˜delimited-[]๐พsubscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0delimited-[]superscript๐‘‰๐‘˜subscript๐‘ 0๐‘‚๐œ€๐ป\displaystyle\geq\max_{k\in[K]}\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0}}\left[V^{% k}(s_{0})\right]-O(\tfrac{\varepsilon}{H}).โ‰ฅ roman_max start_POSTSUBSCRIPT italic_k โˆˆ [ italic_K ] end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] - italic_O ( divide start_ARG italic_ฮต end_ARG start_ARG italic_H end_ARG ) .

We now prove the inductive step. We wish to show that if at step C๐ถCitalic_C, we have for some ฯ€โˆˆฮ ฮฒkโˆ—๐œ‹subscriptsuperscriptฮ superscript๐‘˜๐›ฝ\pi\in\Pi^{k^{*}}_{\beta}italic_ฯ€ โˆˆ roman_ฮ  start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฮฒ end_POSTSUBSCRIPT

๐”ผs0โˆผฮผ0,P[โˆ‘h=0CRโข(sh,ฯ€hโข(sh))+โˆ‘h=C+1Hโˆ’1Rโข(sh,ฯ€htโข(sC,C)โข(sh))]โ‰ฅmaxkโˆˆ[K]โข๐”ผs0โˆผฮผ0[Vkโข(s)]โˆ’Oโข(ฮตโข(C+1)H),subscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0๐‘ƒdelimited-[]superscriptsubscriptโ„Ž0๐ถ๐‘…subscript๐‘ โ„Žsubscript๐œ‹โ„Žsubscript๐‘ โ„Žsuperscriptsubscriptโ„Ž๐ถ1๐ป1๐‘…subscript๐‘ โ„Žsuperscriptsubscript๐œ‹โ„Ž๐‘กsubscript๐‘ ๐ถ๐ถsubscript๐‘ โ„Žsubscript๐‘˜delimited-[]๐พsubscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0delimited-[]superscript๐‘‰๐‘˜๐‘ ๐‘‚๐œ€๐ถ1๐ป\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0},P}\left[\sum_{h=0}^{C}R(s_{h},\pi_{h}(s_% {h}))+\sum_{h=C+1}^{H-1}R(s_{h},\pi_{h}^{t(s_{C},C)}(s_{h}))\right]\geq\max_{k% \in[K]}\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0}}\left[V^{k}(s)\right]-O(\tfrac{% \varepsilon(C+1)}{H}),blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_P end_POSTSUBSCRIPT [ โˆ‘ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) + โˆ‘ start_POSTSUBSCRIPT italic_h = italic_C + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t ( italic_s start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_C ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) ] โ‰ฅ roman_max start_POSTSUBSCRIPT italic_k โˆˆ [ italic_K ] end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) ] - italic_O ( divide start_ARG italic_ฮต ( italic_C + 1 ) end_ARG start_ARG italic_H end_ARG ) ,

then continuing to follow ฯ€๐œ‹\piitalic_ฯ€ at step C+1๐ถ1C+1italic_C + 1 and following ฯ€tโข(sC+1,C+1)superscript๐œ‹๐‘กsubscript๐‘ ๐ถ1๐ถ1\pi^{t(s_{C+1},C+1)}italic_ฯ€ start_POSTSUPERSCRIPT italic_t ( italic_s start_POSTSUBSCRIPT italic_C + 1 end_POSTSUBSCRIPT , italic_C + 1 ) end_POSTSUPERSCRIPT thereafter reduces expected return by Oโข(ฮตH)๐‘‚๐œ€๐ปO(\tfrac{\varepsilon}{H})italic_O ( divide start_ARG italic_ฮต end_ARG start_ARG italic_H end_ARG ). Now if ฯ€C+1โข(sC+1)=ฯ€C+1tโข(sC+1)subscript๐œ‹๐ถ1subscript๐‘ ๐ถ1superscriptsubscript๐œ‹๐ถ1๐‘กsubscript๐‘ ๐ถ1\pi_{C+1}(s_{C+1})=\pi_{C+1}^{t}(s_{C+1})italic_ฯ€ start_POSTSUBSCRIPT italic_C + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_C + 1 end_POSTSUBSCRIPT ) = italic_ฯ€ start_POSTSUBSCRIPT italic_C + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_C + 1 end_POSTSUBSCRIPT ) for ฯ€tโˆˆฮ ksuperscript๐œ‹๐‘กsuperscriptฮ ๐‘˜\pi^{t}\in\Pi^{k}italic_ฯ€ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT โˆˆ roman_ฮ  start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, it must be the case that

VC+1tโข(sC+1)โ‰ฅmaxkโˆˆ[K]โกVC+1kโข(sC+1)โˆ’Oโข(ฮตH),superscriptsubscript๐‘‰๐ถ1๐‘กsubscript๐‘ ๐ถ1subscript๐‘˜delimited-[]๐พsubscriptsuperscript๐‘‰๐‘˜๐ถ1subscript๐‘ ๐ถ1๐‘‚๐œ€๐ปV_{C+1}^{t}(s_{C+1})\geq\max_{k\in[K]}V^{k}_{C+1}(s_{C+1})-O(\tfrac{% \varepsilon}{H}),italic_V start_POSTSUBSCRIPT italic_C + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_C + 1 end_POSTSUBSCRIPT ) โ‰ฅ roman_max start_POSTSUBSCRIPT italic_k โˆˆ [ italic_K ] end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_C + 1 end_POSTSUBSCRIPT ) - italic_O ( divide start_ARG italic_ฮต end_ARG start_ARG italic_H end_ARG ) ,

otherwise ฯ€โˆ‰ฮ ฮฒkโˆ—๐œ‹subscriptsuperscriptฮ superscript๐‘˜๐›ฝ\pi\not\in\Pi^{k^{*}}_{\beta}italic_ฯ€ โˆ‰ roman_ฮ  start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฮฒ end_POSTSUBSCRIPT. It follows that

๐”ผs0โˆผฮผ0,Psubscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0๐‘ƒ\displaystyle\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0},P}blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_P end_POSTSUBSCRIPT [โˆ‘h=0C+1Rโข(sh,ฯ€hโข(sh))+โˆ‘h=C+2Hโˆ’1Rโข(sh,ฯ€htโข(sC+1,C+1)โข(sh))]delimited-[]superscriptsubscriptโ„Ž0๐ถ1๐‘…subscript๐‘ โ„Žsubscript๐œ‹โ„Žsubscript๐‘ โ„Žsuperscriptsubscriptโ„Ž๐ถ2๐ป1๐‘…subscript๐‘ โ„Žsuperscriptsubscript๐œ‹โ„Ž๐‘กsubscript๐‘ ๐ถ1๐ถ1subscript๐‘ โ„Ž\displaystyle\left[\sum_{h=0}^{C+1}R(s_{h},\pi_{h}(s_{h}))+\sum_{h=C+2}^{H-1}R% (s_{h},\pi_{h}^{t(s_{C+1},C+1)}(s_{h}))\right][ โˆ‘ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C + 1 end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) + โˆ‘ start_POSTSUBSCRIPT italic_h = italic_C + 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t ( italic_s start_POSTSUBSCRIPT italic_C + 1 end_POSTSUBSCRIPT , italic_C + 1 ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) ]
=๐”ผs0โˆผฮผ0,P[โˆ‘h=0CRโข(sh,ฯ€hโข(sh))+VC+1tโข(sC+1,C+1)โข(sC+1)]absentsubscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0๐‘ƒdelimited-[]superscriptsubscriptโ„Ž0๐ถ๐‘…subscript๐‘ โ„Žsubscript๐œ‹โ„Žsubscript๐‘ โ„Žsuperscriptsubscript๐‘‰๐ถ1๐‘กsubscript๐‘ ๐ถ1๐ถ1subscript๐‘ ๐ถ1\displaystyle=\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0},P}\left[\sum_{h=0}^{C}R(s_% {h},\pi_{h}(s_{h}))+V_{C+1}^{t(s_{C+1},C+1)}(s_{C+1})\right]= blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_P end_POSTSUBSCRIPT [ โˆ‘ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) + italic_V start_POSTSUBSCRIPT italic_C + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t ( italic_s start_POSTSUBSCRIPT italic_C + 1 end_POSTSUBSCRIPT , italic_C + 1 ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_C + 1 end_POSTSUBSCRIPT ) ] (by definition of V๐‘‰Vitalic_V and ฯ€C+1โข(sC+1)subscript๐œ‹๐ถ1subscript๐‘ ๐ถ1\pi_{C+1}(s_{C+1})italic_ฯ€ start_POSTSUBSCRIPT italic_C + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_C + 1 end_POSTSUBSCRIPT ))
โ‰ฅ๐”ผs0โˆผฮผ0,P[โˆ‘h=0CRโข(sh,ฯ€hโข(sh))+maxkโˆˆ[K]โกVC+1kโข(sC+1)โˆ’Oโข(ฮตH)]absentsubscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0๐‘ƒdelimited-[]superscriptsubscriptโ„Ž0๐ถ๐‘…subscript๐‘ โ„Žsubscript๐œ‹โ„Žsubscript๐‘ โ„Žsubscript๐‘˜delimited-[]๐พsuperscriptsubscript๐‘‰๐ถ1๐‘˜subscript๐‘ ๐ถ1๐‘‚๐œ€๐ป\displaystyle\geq\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0},P}\left[\sum_{h=0}^{C}R% (s_{h},\pi_{h}(s_{h}))+\max_{k\in[K]}V_{C+1}^{k}(s_{C+1})-O(\tfrac{\varepsilon% }{H})\right]โ‰ฅ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_P end_POSTSUBSCRIPT [ โˆ‘ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) + roman_max start_POSTSUBSCRIPT italic_k โˆˆ [ italic_K ] end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_C + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_C + 1 end_POSTSUBSCRIPT ) - italic_O ( divide start_ARG italic_ฮต end_ARG start_ARG italic_H end_ARG ) ] (from ฯ€โˆˆฮ ฮฒkโˆ—๐œ‹subscriptsuperscriptฮ superscript๐‘˜๐›ฝ\pi\in\Pi^{k^{*}}_{\beta}italic_ฯ€ โˆˆ roman_ฮ  start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฮฒ end_POSTSUBSCRIPT)
โ‰ฅ๐”ผs0โˆผฮผ0,P[โˆ‘h=0CRโข(sh,ฯ€hโข(sh))+VC+1tโข(sC,C)โข(sC+1)โˆ’Oโข(ฮตH)]absentsubscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0๐‘ƒdelimited-[]superscriptsubscriptโ„Ž0๐ถ๐‘…subscript๐‘ โ„Žsubscript๐œ‹โ„Žsubscript๐‘ โ„Žsuperscriptsubscript๐‘‰๐ถ1๐‘กsubscript๐‘ ๐ถ๐ถsubscript๐‘ ๐ถ1๐‘‚๐œ€๐ป\displaystyle\geq\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0},P}\left[\sum_{h=0}^{C}R% (s_{h},\pi_{h}(s_{h}))+V_{C+1}^{t(s_{C},C)}(s_{C+1})-O(\tfrac{\varepsilon}{H})\right]โ‰ฅ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_P end_POSTSUBSCRIPT [ โˆ‘ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) + italic_V start_POSTSUBSCRIPT italic_C + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t ( italic_s start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_C ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_C + 1 end_POSTSUBSCRIPT ) - italic_O ( divide start_ARG italic_ฮต end_ARG start_ARG italic_H end_ARG ) ]
=๐”ผs0โˆผฮผ0,P[โˆ‘h=0CRโข(sh,ฯ€hโข(sh))+โˆ‘h=C+1Hโˆ’1Rโข(sh,ฯ€htโข(sC,C)โข(sh))]โˆ’Oโข(ฮตH)absentsubscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0๐‘ƒdelimited-[]superscriptsubscriptโ„Ž0๐ถ๐‘…subscript๐‘ โ„Žsubscript๐œ‹โ„Žsubscript๐‘ โ„Žsuperscriptsubscriptโ„Ž๐ถ1๐ป1๐‘…subscript๐‘ โ„Žsuperscriptsubscript๐œ‹โ„Ž๐‘กsubscript๐‘ ๐ถ๐ถsubscript๐‘ โ„Ž๐‘‚๐œ€๐ป\displaystyle=\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0},P}\left[\sum_{h=0}^{C}R(s_% {h},\pi_{h}(s_{h}))+\sum_{h=C+1}^{H-1}R(s_{h},\pi_{h}^{t(s_{C},C)}(s_{h}))% \right]-O(\tfrac{\varepsilon}{H})= blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_P end_POSTSUBSCRIPT [ โˆ‘ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) + โˆ‘ start_POSTSUBSCRIPT italic_h = italic_C + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t ( italic_s start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_C ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) ] - italic_O ( divide start_ARG italic_ฮต end_ARG start_ARG italic_H end_ARG ) (by definition of V๐‘‰Vitalic_V)
โ‰ฅmaxkโˆˆ[K]โข๐”ผs0โˆผฮผ0[Vkโข(s)]โˆ’Oโข(ฮตโข(C+2)H)absentsubscript๐‘˜delimited-[]๐พsubscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0delimited-[]superscript๐‘‰๐‘˜๐‘ ๐‘‚๐œ€๐ถ2๐ป\displaystyle\geq\max_{k\in[K]}\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0}}\left[V^{% k}(s)\right]-O(\tfrac{\varepsilon(C+2)}{H})\quad\quad\quad\quad\quad\quad\quad% \quad\quad\quadโ‰ฅ roman_max start_POSTSUBSCRIPT italic_k โˆˆ [ italic_K ] end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) ] - italic_O ( divide start_ARG italic_ฮต ( italic_C + 2 ) end_ARG start_ARG italic_H end_ARG ) (by inductive hypothesis)

and so the claim holds for time C+1๐ถ1C+1italic_C + 1, for any ฯ€โˆˆฮ ฮฒkโˆ—๐œ‹subscriptsuperscriptฮ superscript๐‘˜๐›ฝ\pi\in\Pi^{k^{*}}_{\beta}italic_ฯ€ โˆˆ roman_ฮ  start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฮฒ end_POSTSUBSCRIPT for which it holds for time C๐ถCitalic_C. We showed the base case C=0๐ถ0C=0italic_C = 0 hold for all ฯ€โˆˆฮ ฮฒkโˆ—๐œ‹subscriptsuperscriptฮ superscript๐‘˜๐›ฝ\pi\in\Pi^{k^{*}}_{\beta}italic_ฯ€ โˆˆ roman_ฮ  start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฮฒ end_POSTSUBSCRIPT, and therefore we have

๐”ผs0โˆผฮผ0,P[โˆ‘h=0CRโข(sh,ฯ€hโข(sh))+โˆ‘h=C+1Hโˆ’1Rโข(sh,ฯ€htโข(sC,C)โข(sh))]โ‰ฅmaxkโˆˆ[K]โข๐”ผs0โˆผฮผ0[Vkโข(s)]โˆ’Oโข(ฮตโข(C+1)H)subscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0๐‘ƒdelimited-[]superscriptsubscriptโ„Ž0๐ถ๐‘…subscript๐‘ โ„Žsubscript๐œ‹โ„Žsubscript๐‘ โ„Žsuperscriptsubscriptโ„Ž๐ถ1๐ป1๐‘…subscript๐‘ โ„Žsuperscriptsubscript๐œ‹โ„Ž๐‘กsubscript๐‘ ๐ถ๐ถsubscript๐‘ โ„Žsubscript๐‘˜delimited-[]๐พsubscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0delimited-[]superscript๐‘‰๐‘˜๐‘ ๐‘‚๐œ€๐ถ1๐ป\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0},P}\left[\sum_{h=0}^{C}R(s_{h},\pi_{h}(s_% {h}))+\sum_{h=C+1}^{H-1}R(s_{h},\pi_{h}^{t(s_{C},C)}(s_{h}))\right]\geq\max_{k% \in[K]}\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0}}\left[V^{k}(s)\right]-O(\tfrac{% \varepsilon(C+1)}{H})blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_P end_POSTSUBSCRIPT [ โˆ‘ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) + โˆ‘ start_POSTSUBSCRIPT italic_h = italic_C + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t ( italic_s start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_C ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) ] โ‰ฅ roman_max start_POSTSUBSCRIPT italic_k โˆˆ [ italic_K ] end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) ] - italic_O ( divide start_ARG italic_ฮต ( italic_C + 1 ) end_ARG start_ARG italic_H end_ARG )

for all Cโˆˆ[H]๐ถdelimited-[]๐ปC\in[H]italic_C โˆˆ [ italic_H ]. In particular, for C=Hโˆ’1๐ถ๐ป1C=H-1italic_C = italic_H - 1 we conclude that

๐”ผs0โˆผฮผ0,P[โˆ‘h=0CRโข(sh,ฯ€hโข(sh))]โ‰ฅmaxkโˆˆ[K]โข๐”ผs0โˆผฮผ0[Vkโข(s)]โˆ’Oโข(ฮต)subscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0๐‘ƒdelimited-[]superscriptsubscriptโ„Ž0๐ถ๐‘…subscript๐‘ โ„Žsubscript๐œ‹โ„Žsubscript๐‘ โ„Žsubscript๐‘˜delimited-[]๐พsubscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0delimited-[]superscript๐‘‰๐‘˜๐‘ ๐‘‚๐œ€\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0},P}\left[\sum_{h=0}^{C}R(s_{h},\pi_{h}(s_% {h}))\right]\geq\max_{k\in[K]}\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0}}\left[V^{k% }(s)\right]-O(\varepsilon)blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_P end_POSTSUBSCRIPT [ โˆ‘ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_ฯ€ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ) ] โ‰ฅ roman_max start_POSTSUBSCRIPT italic_k โˆˆ [ italic_K ] end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s ) ] - italic_O ( italic_ฮต )

and it follows that

minฯ€โˆˆฮ ฮฒkโˆ—โข๐”ผs0โˆผฮผ0[Vฯ€^โข(s0)]โ‰ฅmaxkโˆˆ[K]โข๐”ผs0โˆผฮผ0[Vkโข(s0)]โˆ’Oโข(ฮต).subscript๐œ‹subscriptsuperscriptฮ superscript๐‘˜๐›ฝsubscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0delimited-[]superscript๐‘‰^๐œ‹subscript๐‘ 0subscript๐‘˜delimited-[]๐พsubscript๐”ผsimilar-tosubscript๐‘ 0subscript๐œ‡0delimited-[]superscript๐‘‰๐‘˜subscript๐‘ 0๐‘‚๐œ€\min_{\pi\in\Pi^{k^{*}}_{\beta}}\mathop{\mathbb{E}}_{s_{0}\sim\mu_{0}}\left[V^% {\hat{\pi}}(s_{0})\right]\geq\max_{k\in[K]}\mathop{\mathbb{E}}_{s_{0}\sim\mu_{% 0}}\left[V^{k}(s_{0})\right]-O(\varepsilon).roman_min start_POSTSUBSCRIPT italic_ฯ€ โˆˆ roman_ฮ  start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT โˆ— end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ฮฒ end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT over^ start_ARG italic_ฯ€ end_ARG end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] โ‰ฅ roman_max start_POSTSUBSCRIPT italic_k โˆˆ [ italic_K ] end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT โˆผ italic_ฮผ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] - italic_O ( italic_ฮต ) .

โˆŽ

Appendix C Additional information about experiments

For our experiments, we use a heuristic version of ๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡\mathsf{MaxIteration}sansserif_MaxIteration that operates in rounds. First, the algorithm collects a set of trajectories using every policy to initialize the respective value functions. Then, in every round the algorithm for every policy exectues the max-following policy for ฮฒ๐›ฝ\betaitalic_ฮฒ steps and the switches to the respective constituent policy. At the end of each round, value functions of constituent policies are updated. ฮฒ๐›ฝ\betaitalic_ฮฒ is uniformly spaced along the full horizon and thus, depends on the number of rounds and the horizon. The total number of episodes is an upper bound on the number of samples collected which is what we determine to compare run-times between ๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡\mathsf{MaxIteration}sansserif_MaxIteration and IQL. Finally, we use a ฮณ๐›พ\gammaitalic_ฮณ discounting which has been shown to have regularizing effects on the value function updatesย [Amit etย al., 2020].

For IQL, we use the d3rlpy implementationsย [Seno and Imai, 2022] and code provided byย Hussing etย al. [2023].

C.1 Hyperparameters

Both algorithms are run for 10,0001000010,00010 , 000 steps initially (to initialize value functions for ๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡\mathsf{MaxIteration}sansserif_MaxIteration and to pre-fill the buffer for IQL) before doing updates and then for 50,0005000050,00050 , 000 steps for online training.

All neural networks use ReLUย [Glorot etย al., 2011] Multi-layer perceptrons with 2222 layers and a hidden dimension of 256256256256 per layer.

Table 1: Hyperparameters for ๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡\mathsf{MaxIteration}sansserif_MaxIteration
Optimizer Adam
Adam ฮฒ1subscript๐›ฝ1\beta_{1}italic_ฮฒ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.90.90.90.9
Adam ฮฒ2subscript๐›ฝ2\beta_{2}italic_ฮฒ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0.9990.9990.9990.999
Adam ฮต๐œ€\varepsilonitalic_ฮต 1โขeโˆ’81๐‘’81e-81 italic_e - 8
Value Function Learning Rate 1โขeโˆ’41๐‘’41e-41 italic_e - 4
Number of rounds 50
Number of gradient steps per round 40,000
Batch Size 64646464
ฮณ๐›พ\gammaitalic_ฮณ 0.990.990.990.99
Table 2: Hyperparameters for Implicit Q-Learning
Optimizer Adam
Adam ฮฒ1subscript๐›ฝ1\beta_{1}italic_ฮฒ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.90.90.90.9
Adam ฮฒ2subscript๐›ฝ2\beta_{2}italic_ฮฒ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0.9990.9990.9990.999
Adam ฮต๐œ€\varepsilonitalic_ฮต 1โขeโˆ’81๐‘’81e-81 italic_e - 8
Actor Learning Rate 4โขeโˆ’34๐‘’34e-34 italic_e - 3
Critic Learning Rate 4โขeโˆ’34๐‘’34e-34 italic_e - 3
Batch Size #Tasks ร—256absent256\times 256ร— 256
n_steps 1111
ฮณ๐›พ\gammaitalic_ฮณ 0.990.990.990.99
ฯ„๐œ\tauitalic_ฯ„ 0.0050.0050.0050.005
n_critics 2222
expectile 0.70.70.70.7
weight_temp 3.03.03.03.0
max_weight 100100100100

C.2 Full results on CompoSuite

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Mean cumulative return and success over 5555 seeds of ๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡\mathsf{MaxIteration}sansserif_MaxIteration compared to fine-tuning IQL on all considered tasks. Error-bars correspond to standard error. Full bars correspond to returns and red lines indicate the success rate of each algorithm.

C.3 Results on DM Control

We run our MaxIteration algorithm on the DM Control benchmarksย [Tunyasuvunakool etย al., 2020] similar to the MAPSย [Liu etย al., 2023] setup. In their setup, the constituent policies correspond to different 3333 checkpointed models in one run of the online Soft-Actor criticย [Haarnoja etย al., 2018] algorithm. As a result, it is generally true that the latest checkpointed model will outperform the previous two checkpoints meaning one constituent policy is strictly better everywhere than the others. We report the final performance over 5 seeds using 16 evaluation trajectories in Figureย 5. The results show that our algorithm behaves as expected and always uses the best oracle. Without policy improvement operator, this setup does not allow us to exceed the performance of the constituent policies.

Refer to caption
Refer to caption
Refer to caption
Figure 5: Mean of cumulative return over 5555 seeds of ๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡๐–ฌ๐–บ๐—‘๐–จ๐—๐–พ๐—‹๐–บ๐—๐—‚๐—ˆ๐—‡\mathsf{MaxIteration}sansserif_MaxIteration on DM Control tasksย [Tunyasuvunakool etย al., 2020]. Error-bars correspond to standard error. MaxIteration always selects the best performing constituent policy.

C.4 Computational Resources

Our experiments were conducted using a total of 17171717 GPUs inclusing both server-grade (e.g., NVIDIA RTX A6000s) and consumer-grade (e.g., NVIDIA RTX 3090) GPUs. Training the constituent policies from offline data takes less than 2222 hours. Our MaxIteration algorithm takes about 3333 hours to train while the baseline fine-tuning takes around 1111 hour. A large chunk of the runtime cost stems from executing the simulator.