When Do Skills Help Reinforcement Learning?
A Theoretical Analysis of Temporal Abstractions

Zhening Li    Gabriel Poesia    Armando Solar-Lezama
Abstract

Skills are temporal abstractions that are intended to improve reinforcement learning (RL) performance through hierarchical RL. Despite our intuition about the properties of an environment that make skills useful, a precise characterization has been absent. We provide the first such characterization, focusing on the utility of deterministic skills in deterministic sparse-reward environments with finite action spaces. We show theoretically and empirically that RL performance gain from skills is worse in environments where solutions to states are less compressible. Additional theoretical results suggest that skills benefit exploration more than they benefit learning from existing experience, and that using unexpressive skills such as macroactions may worsen RL performance. We hope our findings can guide research on automatic skill discovery and help RL practitioners better decide when and how to use skills.

Machine Learning, ICML

1 Introduction

In most real-world sequential decision making problems, agents are only given sparse rewards for their actions. This makes reinforcement learning (RL) challenging, as agents can only recognize good behavior after long sequences of good decisions. This issue can be mitigated by leveraging temporal abstractions (Sutton et al., 1999), also known as skills. A skill is a high-level action — such as a fixed sequence of actions (macroaction) or a sub-policy with a termination condition (option) — that is expected to be useful in a large number of states. Skills can be hand-engineered to perform subtasks (Pedersen et al., 2016; He et al., 2011) or learned from experience (Machado et al., 2017; Bacon et al., 2017; Barreto et al., 2019; Kipf et al., 2019; Jiang et al., 2022; Li et al., 2022). Incorporating skills into the agent’s action space (hierarchical RL) allows it to act at a higher level and reach goals in fewer steps, which may improve exploration and thus RL performance.

Despite their appeal, skills have not seen widespread use. In fact, they were not involved in most major breakthroughs and applications of RL, such as surpassing human-level performance in all Atari games (Badia et al., 2020), RLHF for aligning LLMs with human preferences (Ouyang et al., 2022), AlphaTensor for faster matrix multiplication (Fawzi et al., 2022), and AlphaDev for faster sorting (Mankowitz et al., 2023). A reason skills have not been widely adopted is that they sometimes do not improve RL performance and it is unclear how to determine beforehand whether they would. While several methods have been developed to automatically discover skills, most of them require the practitioner to decide whether to use skills at all. To our knowledge, LEMMA (Li et al., 2022) is the only algorithm that automatically decides whether skills are useful by learning the optimal number of skills — zero would mean that skills do not help. However, this is accomplished by optimizing a heuristic objective that does not necessarily reflect the benefits to RL. Other skill discovery algorithms such as Option-Critic (Bacon et al., 2017), eigenoptions (Machado et al., 2017), deep skill chaining (Bagaria & Konidaris, 2019), LOVE (Jiang et al., 2022) and COPlanLearn (Nayyar et al., 2023) determine the number of skills using a hyperparameter. A better understanding of how exactly skills benefit RL may guide research in automatically determining whether skills would be useful in an environment and the optimal number to learn if they are. Such an understanding can also provide insight into why skills do not work in certain environments as well as help practitioners better decide whether to use skills for a given RL task.

Our work provides a theoretical analysis of when and how skills and hierarchical RL benefit RL performance in deterministic sparse-reward environments. We hope our insights will serve to guide research in automatic skill discovery including the automatic determination of whether to use skills, and allow practitioners to better understand the kinds of environments where skills are helpful. In summary, we make the following contributions:

  • We define two metrics — p𝑝pitalic_p-exploration difficulty and p𝑝pitalic_p-learning difficulty — that quantify the hardness of exploration and learning from experience in a deterministic sparse-reward environment with a finite action space. We show empirically that these metrics correlate strongly with the sample complexity of several RL algorithms (Section 3).

  • We define two closely related metrics that measure the incompressibility of solutions to states generated by the environment. Under mild assumptions, we prove lower bounds on the change in p𝑝pitalic_p-learning difficulty and p𝑝pitalic_p-exploration difficulty due to deterministic skills in terms of the incompressibility measures. We show that skills are better suited to decreasing p𝑝pitalic_p-exploration difficulty rather than p𝑝pitalic_p-learning difficulty, and less expressive skills are less apt at decreasing the difficulty metrics. In particular, for each difficulty metric, we demonstrate the existence of environments where incorporating macroactions provably increases it (Sections 4 and 5).

  • We show empirically that macroactions and deep neural options are less beneficial in environments with higher incompressibility (Section 6).

  • We describe how to derive skill learning objectives from our incompressibility metrics (Section 7).

All proofs are found in Appendix E. Code for experiments are publicly available at https://github.com/uranium11010/rl-skill-theory.

2 Preliminary Definitions

We first introduce basic definitions related to deterministic sparse-reward Markov decision processes (MDPs), which are the focus of this paper. We choose to focus on sparse-reward environments since skills are purported to alleviate the sparse-reward problem. Despite our focus on deterministic environments, a large number of environments both in the standard RL literature (e.g., the original Atari game environments (Bellemare et al., 2013) and MuJoCo (Todorov et al., 2012)) and in applications of RL (e.g., program synthesis (Ellis et al., 2019; Mankowitz et al., 2023) and mathematical reasoning (Kaliszyk et al., 2018; Poesia et al., 2021; Wu et al., 2021)) are deterministic. Furthermore, by focusing on a special case of MDPs, our hardness results — lower bounds on the change in difficulty due to skills — suggest that improving RL using skills in the general case of stochastic environments can be at least as hard. Finally, Section F.1 provides preliminary results on generalizing to stochastic environments, suggesting that many insights obtained from studying deterministic environments apply to stochastic ones as well.

Definition 2.1.

A deterministic sparse-reward MDP (DSMDP) is defined by a 4-tuple =(S,A,T,g)𝑆𝐴𝑇𝑔\mathcal{M}=(S,A,T,g)caligraphic_M = ( italic_S , italic_A , italic_T , italic_g ) where S𝑆Sitalic_S is the state space, A𝐴Aitalic_A is the action space, T:(S{g})×AS:𝑇𝑆𝑔𝐴𝑆T:(S\setminus\{g\})\times A\to Sitalic_T : ( italic_S ∖ { italic_g } ) × italic_A → italic_S is the deterministic transition function and gS𝑔𝑆g\in Sitalic_g ∈ italic_S is the goal state.

Note that environments that have multiple goal states can also be formulated as DSMDPs by merging these goal states into a single goal state. The CompILE2 environment introduced in Section 3.3 is one such example — see Appendix B for more details.

Borrowing terminology commonly used in symbolic reasoning domains, we say “solve a state” as a shorthand for “finding a sequence of actions that lead to the goal state,” and we call such a sequence of actions a solution. This is formalized below.

Definition 2.2.

A solution to a state sS{g}𝑠𝑆𝑔s\in S\setminus\{g\}italic_s ∈ italic_S ∖ { italic_g } of a DSMDP =(S,A,T,g)𝑆𝐴𝑇𝑔\mathcal{M}=(S,A,T,g)caligraphic_M = ( italic_S , italic_A , italic_T , italic_g ) is a sequence of actions (a1,,al)Alsubscript𝑎1subscript𝑎𝑙superscript𝐴𝑙(a_{1},\ldots,a_{l})\in A^{l}( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∈ italic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT (l1𝑙1l\geq 1italic_l ≥ 1) such that applying the sequence of actions starting in s𝑠sitalic_s results in the goal state g𝑔gitalic_g:

T(s,(a1,,al))=g,𝑇𝑠subscript𝑎1subscript𝑎𝑙𝑔T(s,(a_{1},\ldots,a_{l}))=g,italic_T ( italic_s , ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ) = italic_g , (1)

where T(s,(a1,,al)):=T((T(s,a1),a2),al)assign𝑇𝑠subscript𝑎1subscript𝑎𝑙𝑇𝑇𝑠subscript𝑎1subscript𝑎2subscript𝑎𝑙T(s,(a_{1},\ldots,a_{l})):=T(\cdots(T(s,a_{1}),a_{2})\cdots,a_{l})italic_T ( italic_s , ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ) := italic_T ( ⋯ ( italic_T ( italic_s , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⋯ , italic_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) denotes the result of applying action sequence (a1,,al)subscript𝑎1subscript𝑎𝑙(a_{1},\ldots,a_{l})( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) to state s𝑠sitalic_s. Here, l>0𝑙0l>0italic_l > 0 is called the length of the solution. We will denote by Sol(s)subscriptSol𝑠\operatorname{Sol}_{\mathcal{M}}(s)roman_Sol start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ) the set of solutions to s𝑠sitalic_s and d(s)=minσSol(s)|σ|subscript𝑑𝑠subscript𝜎subscriptSol𝑠𝜎d_{\mathcal{M}}(s)=\min_{\sigma\in\operatorname{Sol}_{\mathcal{M}}(s)}|\sigma|italic_d start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ) = roman_min start_POSTSUBSCRIPT italic_σ ∈ roman_Sol start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT | italic_σ | the length of a shortest solution to s𝑠sitalic_s.

Note that a state can have no solutions. For example, in domains where we’d like to formalize the notion of “death,” one could transition to a “dead state” that goes to itself for all actions taken, and that dead state has no solutions. In contrast, states that have at least one solution are called solvable states.

Some results in this paper assume that no two states share a solution, a property we call solution separability.

Definition 2.3.

A DSMDP is solution-separable if no sequence of actions is a solution to more than one state.

Any DSMDP with invertible transitions is solution-separable. Here, we say a DSMDP (S,A,T,g)𝑆𝐴𝑇𝑔(S,A,T,g)( italic_S , italic_A , italic_T , italic_g ) has invertible transitions if s=s𝑠superscript𝑠s=s^{\prime}italic_s = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT whenever T(s,a)=T(s,a)𝑇𝑠𝑎𝑇superscript𝑠𝑎T(s,a)=T(s^{\prime},a)italic_T ( italic_s , italic_a ) = italic_T ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a ) and T(s,a)𝑇𝑠𝑎T(s,a)italic_T ( italic_s , italic_a ) is either solvable or the goal. Examples include (a) all twisty puzzles such as the Rubik’s cube; (b) grid world domains where taking a vacuous action (e.g., walking into a wall or picking up a non-existent object) leads to instant death; (c) sliding puzzles where taking a vacuous action leads to instant death.

The following definition formalizes RL in the episodic setting as applied to a DSMDP.

Definition 2.4.

In reinforcement learning (RL) in the episodic setting, an agent interacts with an environment (MDP) in episodes to learn a policy π(as)𝜋conditional𝑎𝑠\pi(a\mid s)italic_π ( italic_a ∣ italic_s ) that optimizes the expected cumulative reward from one episode. For a DSMDP, the optimal policy is

argmaxπ𝔼s0p0(s0,a1,,al,sl)Rolloutπ(s0)[γl1𝟏[sl=g]].subscriptargmax𝜋subscript𝔼similar-tosubscript𝑠0subscript𝑝0similar-tosubscript𝑠0subscript𝑎1subscript𝑎𝑙subscript𝑠𝑙subscriptRollout𝜋subscript𝑠0superscript𝛾𝑙11delimited-[]subscript𝑠𝑙𝑔\operatorname*{arg\,max}_{\pi}\operatorname*{\mathbb{E}}_{\begin{subarray}{c}s% _{0}\sim p_{0}\\ (s_{0},a_{1},\ldots,a_{l},s_{l})\sim\operatorname{Rollout}_{\pi}(s_{0})\end{% subarray}}\left[\gamma^{l-1}\boldsymbol{1}[s_{l}=g]\right].start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∼ roman_Rollout start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ italic_γ start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT bold_1 [ italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_g ] ] . (2)

Here, p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the initial state distribution and 0<γ10𝛾10<\gamma\leq 10 < italic_γ ≤ 1 is the discount factor. Rolloutπ(s0)subscriptRollout𝜋subscript𝑠0\operatorname{Rollout}_{\pi}(s_{0})roman_Rollout start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is the result of rolling out policy π𝜋\piitalic_π starting in state s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, stop** when either the goal state is reached or H𝐻Hitalic_H actions have been taken, where H𝐻Hitalic_H is called the horizon and sometimes considered part of the definition of an MDP. Note that when γ=1𝛾1\gamma=1italic_γ = 1, then Equation 2 becomes maximizing the probability that the policy solves s0p0similar-tosubscript𝑠0subscript𝑝0s_{0}\sim p_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Now, we introduce skills. Whereas skills need not be deterministic in general, we are studying deterministic environments and will thus focus on deterministic skills.

Definition 2.5.

A deterministic skill in a DSMDP is a function from states to finite action sequences. In other words, for each state, we specify the sequence of actions to be taken if the agent initiates the skill in that state. Note that this sequence is allowed to be empty.

We will refer to deterministic skills as simply “skills.”

The prototypical example of an unexpressive class of skills is macroactions.

Definition 2.6.

A macroaction is a skill that produces the same sequence of actions of length greater than 1 regardless of the state in which the skill is initiated.

Incorporating skills into a DSMDP is called a skill augmentation, which is more precisely defined below.

Definition 2.7.

A DSMDP 0=(S,A0,T0,g)subscript0𝑆subscript𝐴0subscript𝑇0𝑔\mathcal{M}_{0}=(S,A_{0},T_{0},g)caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( italic_S , italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g ) augmented with a finite set of skills Z𝑍Zitalic_Z is the DSMDP +=(S,A+,T+,g)subscript𝑆subscript𝐴subscript𝑇𝑔\mathcal{M}_{+}=(S,A_{+},T_{+},g)caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = ( italic_S , italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT + end_POSTSUBSCRIPT , italic_g ) where A+=A0Zsubscript𝐴subscript𝐴0𝑍A_{+}=A_{0}\cup Zitalic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∪ italic_Z, T+(s,a)=T0(s,a)subscript𝑇𝑠𝑎subscript𝑇0𝑠𝑎T_{+}(s,a)=T_{0}(s,a)italic_T start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_s , italic_a ) = italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s , italic_a ) for aA0𝑎subscript𝐴0a\in A_{0}italic_a ∈ italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and T+(s,a)=T0(s,a(s))subscript𝑇𝑠𝑎subscript𝑇0𝑠𝑎𝑠T_{+}(s,a)=T_{0}(s,a(s))italic_T start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_s , italic_a ) = italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s , italic_a ( italic_s ) ) for aZ𝑎𝑍a\in Zitalic_a ∈ italic_Z.111 Technically, T+subscript𝑇T_{+}italic_T start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is a partial function as T+(s,z)subscript𝑇𝑠𝑧T_{+}(s,z)italic_T start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_s , italic_z ) is undefined if unrolling the skill z𝑧zitalic_z reaches the goal state before the unrolling finishes. Thus, in this case, the agent is considered not to have reached the goal state. (However, our HRL implementation in our experiments follows the more common convention that the agent is considered successful in this situation.) We say +subscript\mathcal{M}_{+}caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is the A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT-skill augmentation of 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We call A0subscript𝐴0A_{0}italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT the base action space and A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT the skill-augmented action space. Furthermore, if Z𝑍Z\neq\emptysetitalic_Z ≠ ∅ so that A0subscript𝐴0A_{0}italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a proper subset of A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, then we say the skill augmentation is strict.

For simplicity, when discussing a base environment 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and its skill augmentation +subscript\mathcal{M}_{+}caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, we will abuse notation by writing subscripts “+++” or “00” in places where they should really be “+subscript\mathcal{M}_{+}caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT” or “0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT”, such as d0(s)subscript𝑑0𝑠d_{0}(s)italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) and Sol0(s)subscriptSol0𝑠\operatorname{Sol}_{0}(s)roman_Sol start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) for d0(s)subscript𝑑subscript0𝑠d_{\mathcal{M}_{0}}(s)italic_d start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ) and Sol0(s)subscriptSolsubscript0𝑠\operatorname{Sol}_{\mathcal{M}_{0}}(s)roman_Sol start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ). We allow repetition of skills and skills are also allowed to overlap with base actions. In such cases, Z𝑍Zitalic_Z and A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT should be interpreted as multisets.

3 Quantifying RL Difficulty in a Deterministic Sparse-Reward Environment

To study how much skills can reduce the difficulty of applying RL to a DSMDP, we need to first quantify this difficulty. Unfortunately, existing MDP difficulty metrics fail to capture RL difficulty in DSMDPs since they were not designed to directly estimate sample efficiency or regret, but instead appear in loose asymptotic performance bounds of RL algorithms (see Appendix A for a brief survey). As a result, they correlate poorly with actual performance measures like total regret (Conserva & Rauber, 2022). We therefore aim to develop difficulty metrics for DSMDPs by directly estimating an RL performance measure — in our case, sample efficiency — and to verify them empirically.

Below, we introduce two metrics quantifying the difficulty of applying RL to a deterministic sparse-reward environment, assuming that the environments compared have the same state space (e.g., they are different skill augmentations of the same base environment). We motivate these metrics using heuristic arguments that estimate the sample efficiency of an RL agent in the episodic setting without assuming any particular RL algorithm. We then experimentally test how well the metrics correlate with the sample efficiency of 4 popular RL algorithms in 32 macroaction augmentations of each of 4 base environments.

3.1 Quantifying Difficulty in Learning from Experience

To quantify the complexity of learning a DSMDP from existing experience, suppose that the agent has gathered enough experience to effectively reduce the remaining learning problem to a planning problem. Then Lemma 3.1 shows that the number of iterations through the entire state space needed to learn the value of a state is linear in the minimum length of a solution to that state.

Lemma 3.1.

Suppose we apply value iteration with discount rate γ=1𝛾1\gamma=1italic_γ = 1 and learning rate α𝛼\alphaitalic_α to a DSMDP =(S,A,T,g)𝑆𝐴𝑇𝑔\mathcal{M}=(S,A,T,g)caligraphic_M = ( italic_S , italic_A , italic_T , italic_g ) with a finite action space. In particular, we initialize V(s)0𝑉𝑠0V(s)\leftarrow 0italic_V ( italic_s ) ← 0 for sg𝑠𝑔s\neq gitalic_s ≠ italic_g and V(g)1𝑉𝑔1V(g)\leftarrow 1italic_V ( italic_g ) ← 1, and at time t𝑡titalic_t, we update the entire table using

V(s)(1α)V(s)+αmaxaV(T(s,a))for all sg.𝑉𝑠1𝛼𝑉𝑠𝛼subscript𝑎𝑉𝑇𝑠𝑎for all sgV(s)\leftarrow(1-\alpha)V(s)+\alpha\max_{a}V(T(s,a))\quad\text{for all $s\neq g% $}.italic_V ( italic_s ) ← ( 1 - italic_α ) italic_V ( italic_s ) + italic_α roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_V ( italic_T ( italic_s , italic_a ) ) for all italic_s ≠ italic_g . (3)

If α=1𝛼1\alpha=1italic_α = 1, then the number of time steps until the value of a solvable state s𝑠sitalic_s becomes its true value (i.e., 1111) is d(s)subscript𝑑𝑠d_{\mathcal{M}}(s)italic_d start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ). If α<1𝛼1\alpha<1italic_α < 1, then the number of time steps until the value of a solvable state s𝑠sitalic_s is within ε𝜀\varepsilonitalic_ε of its true value (i.e., 1V(s)<ε1𝑉𝑠𝜀1-V(s)<\varepsilon1 - italic_V ( italic_s ) < italic_ε) is

Θ(d(s)+log(1/ε)α).Θsubscript𝑑𝑠1𝜀𝛼\Theta\left(\frac{d_{\mathcal{M}}(s)+\log(1/\varepsilon)}{\alpha}\right).roman_Θ ( divide start_ARG italic_d start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ) + roman_log ( 1 / italic_ε ) end_ARG start_ARG italic_α end_ARG ) .

Since each iteration has a complexity of Θ(|S||A|)Θ𝑆𝐴\Theta(|S||A|)roman_Θ ( | italic_S | | italic_A | ), the total complexity for learning the value of a state s𝑠sitalic_s is Θ(|S||A|d(s))Θ𝑆𝐴subscript𝑑𝑠\Theta(|S||A|d_{\mathcal{M}}(s))roman_Θ ( | italic_S | | italic_A | italic_d start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ) ) for constant α,ε𝛼𝜀\alpha,\varepsilonitalic_α , italic_ε. If we apply the same intuition to the RL setting, then we would expect that learning the optimal policy at a state s𝑠sitalic_s requires Θ(d(s))Θsubscript𝑑𝑠\Theta(d_{\mathcal{M}}(s))roman_Θ ( italic_d start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ) ) “iterations,” where one “iteration” involves the agent sampling experiences that effectively cover the entire space of state-action pairs. Thus, as a rough estimation, approximately Θ(|Seff||A|d(s))Θsubscript𝑆eff𝐴subscript𝑑𝑠\Theta(|S_{\text{eff}}||A|d_{\mathcal{M}}(s))roman_Θ ( | italic_S start_POSTSUBSCRIPT eff end_POSTSUBSCRIPT | | italic_A | italic_d start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ) ) samples are needed to learn the policy at state s𝑠sitalic_s. Here, |Seff|subscript𝑆eff|S_{\text{eff}}|| italic_S start_POSTSUBSCRIPT eff end_POSTSUBSCRIPT | is some effective size of the state space, counting only those states that we “care about,” i.e., those with positive p0(s)subscript𝑝0𝑠p_{0}(s)italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) or that are part of (short) solutions to states with positive p0(s)subscript𝑝0𝑠p_{0}(s)italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ). For constant |Seff|subscript𝑆eff|S_{\text{eff}}|| italic_S start_POSTSUBSCRIPT eff end_POSTSUBSCRIPT |, this estimation of the sample complexity motivates using a weighted average of |A|d(s)𝐴subscript𝑑𝑠|A|d_{\mathcal{M}}(s)| italic_A | italic_d start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ) over states s𝑠sitalic_s to measure the complexity of learning from experience.

Definition 3.2.

Let =(S,A,T,g)𝑆𝐴𝑇𝑔\mathcal{M}=(S,A,T,g)caligraphic_M = ( italic_S , italic_A , italic_T , italic_g ) be a DSMDP with finite action space A𝐴Aitalic_A. For a probability distribution p𝑝pitalic_p on solvable states, the p𝑝pitalic_p-learning difficulty of \mathcal{M}caligraphic_M is defined as

Jlearn(;p)=|A|𝔼sp[d(s)]subscript𝐽learn𝑝𝐴subscript𝔼similar-to𝑠𝑝delimited-[]subscript𝑑𝑠J_{\text{{learn}}}(\mathcal{M};p)=|A|\mathbb{E}_{s\sim p}[d_{\mathcal{M}}(s)]italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT ( caligraphic_M ; italic_p ) = | italic_A | blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ) ] (4)

where d(s)subscript𝑑𝑠d_{\mathcal{M}}(s)italic_d start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ) is the length of a shortest solution to s𝑠sitalic_s.

The distribution p𝑝pitalic_p assigns higher importance to states that we care more about learning to solve. If p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT denotes the initial state distribution of the MDP, then p𝑝pitalic_p should be higher for states with higher p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. For simplicity, we can just take p𝑝pitalic_p to be p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

The p𝑝pitalic_p-learning difficulty can be viewed as a generalization of diameter (Auer et al., 2008). While the diameter of an MDP is originally defined for the continuous learning setting, a natural extension to the episodic setting for a DSMDP is the maximum length of a solution to a state, maxsgd(s)subscript𝑠𝑔subscript𝑑𝑠\max_{s\neq g}d_{\mathcal{M}}(s)roman_max start_POSTSUBSCRIPT italic_s ≠ italic_g end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ). Ignoring the |A|𝐴|A|| italic_A | factor, this is the p𝑝pitalic_p-learning difficulty when p𝑝pitalic_p is zero for all but the state(s) with the largest d(s)subscript𝑑𝑠d_{\mathcal{M}}(s)italic_d start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ).

3.2 Quantifying Difficulty in Exploration

p𝑝pitalic_p-learning difficulty does not take into account the complexity of gathering the needed experience: learning a state s𝑠sitalic_s starts to take place only after the agent has seen state-action pairs that form a chain leading from s𝑠sitalic_s to the goal state. Thus, as a simplification, an agent’s learning process in the episodic setting can be roughly divided into two stages: the first stage is dominated by exploration, where the agent tries to find reward signal and gather experience; the second stage is dominated by learning, where the agent learns from the experience. The sample efficiency of the learning stage is captured by the p𝑝pitalic_p-learning difficulty. Let us now motivate the definition of p𝑝pitalic_p-exploration difficulty by estimating the sample efficiency of the exploration stage.

Suppose that the initial exploration policy is a uniformly random policy, and let q(s)𝑞𝑠q(s)italic_q ( italic_s ) denote the probability that such a policy solves s𝑠sitalic_s in one episode. Assuming that the policy remains roughly uniform until the agent finally solves s𝑠sitalic_s for the first time, the expected number of episodes until this happens is 1/q(s)1𝑞𝑠1/q(s)1 / italic_q ( italic_s ), and the number of environment steps taken is H/q(s)𝐻𝑞𝑠H/q(s)italic_H / italic_q ( italic_s ) where H𝐻Hitalic_H is the horizon. To obtain an upper bound on the expected total number of steps taken to find a solution to every state, we simply sum this expression over all states to arrive at Nsum=Hs1q(s)subscript𝑁sum𝐻subscript𝑠1𝑞𝑠N_{\text{sum}}=H\sum_{s}\frac{1}{q(s)}italic_N start_POSTSUBSCRIPT sum end_POSTSUBSCRIPT = italic_H ∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_q ( italic_s ) end_ARG. Note that this can be a significant overestimate of the true sample complexity: solving a state s𝑠sitalic_s often updates the agent in a way that helps it solve states whose solutions contain s𝑠sitalic_s. We will address this issue later.

For a constant horizon H𝐻Hitalic_H and state space size, Nsum𝔼sp[1/q(s)]proportional-tosubscript𝑁sumsubscript𝔼similar-to𝑠𝑝delimited-[]1𝑞𝑠N_{\text{sum}}\propto\mathbb{E}_{s\sim p}\left[1/q(s)\right]italic_N start_POSTSUBSCRIPT sum end_POSTSUBSCRIPT ∝ blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ 1 / italic_q ( italic_s ) ] where p𝑝pitalic_p is a uniform distribution over all states. As with the p𝑝pitalic_p-learning difficulty, we generalize this to allow different weights p(s)𝑝𝑠p(s)italic_p ( italic_s ) to be assigned to different states. For example, if a state has small q(s)𝑞𝑠q(s)italic_q ( italic_s ) but the MDP’s initial state distribution p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT assigns almost zero probability to s𝑠sitalic_s, then we can afford not to learn to solve s𝑠sitalic_s and this can be reflected by having p(s)0𝑝𝑠0p(s)\approx 0italic_p ( italic_s ) ≈ 0. For simplicity, we can simply set p𝑝pitalic_p to p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, as with the p𝑝pitalic_p-learning difficulty.

We now address the issue of overestimating the sample complexity. In practice, this overestimation is more significant when q(s)𝑞𝑠q(s)italic_q ( italic_s ) for different s𝑠sitalic_s are more disparate. In DSMDPs where states vary in difficulty (vary in q(s)𝑞𝑠q(s)italic_q ( italic_s )), solving easy states (states with large q(s)𝑞𝑠q(s)italic_q ( italic_s )) generally updates the agent in a way that helps it find solutions to harder states (states with small q(s)𝑞𝑠q(s)italic_q ( italic_s )). For this reason, we find empirically (Section D.2) that the arithmetic mean NAM=𝔼sp[1/q(s)]subscript𝑁AMsubscript𝔼similar-to𝑠𝑝delimited-[]1𝑞𝑠N_{\text{AM}}=\mathbb{E}_{s\sim p}[1/q(s)]italic_N start_POSTSUBSCRIPT AM end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ 1 / italic_q ( italic_s ) ] is outperformed by the geometric mean NGM=exp(𝔼sp[log(1/q(s))])subscript𝑁GMsubscript𝔼similar-to𝑠𝑝delimited-[]1𝑞𝑠N_{\text{GM}}=\exp(\mathbb{E}_{s\sim p}[\log(1/q(s))])italic_N start_POSTSUBSCRIPT GM end_POSTSUBSCRIPT = roman_exp ( blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ roman_log ( 1 / italic_q ( italic_s ) ) ] ), which is lower than NAMsubscript𝑁AMN_{\text{AM}}italic_N start_POSTSUBSCRIPT AM end_POSTSUBSCRIPT when there’s variety in 1/q(s)1𝑞𝑠1/q(s)1 / italic_q ( italic_s ). Although this estimation of exploration sample complexity is quite rough, it is difficult to make better estimates without knowing details of the MDP structure and RL algorithm. Also, the resultant definition of p𝑝pitalic_p-exploration difficulty already performs well empirically on several environments for several RL algorithms (Section 3.3).

Finally, we take the logarithm of NGMsubscript𝑁GMN_{\text{GM}}italic_N start_POSTSUBSCRIPT GM end_POSTSUBSCRIPT as that simplifies notation in our theoretical results. We also replace the fixed horizon with a random horizon sampled from a geometric distribution to simplify theoretical analysis.

Definition 3.3.

Let =(S,A,T,g)𝑆𝐴𝑇𝑔\mathcal{M}=(S,A,T,g)caligraphic_M = ( italic_S , italic_A , italic_T , italic_g ) be a DSMDP with finite action space A𝐴Aitalic_A. For a probability distribution p𝑝pitalic_p on solvable states and 0δ<10𝛿10\leq\delta<10 ≤ italic_δ < 1, the δ𝛿\deltaitalic_δ-discounted p𝑝pitalic_p-exploration difficulty of \mathcal{M}caligraphic_M is defined as

Jexplore(;p,δ)=𝔼sp[logq,δ(s)]subscript𝐽explore𝑝𝛿subscript𝔼similar-to𝑠𝑝delimited-[]subscript𝑞𝛿𝑠J_{\text{{explore}}}(\mathcal{M};p,\delta)=\mathbb{E}_{s\sim p}[-\log q_{% \mathcal{M},\delta}(s)]italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M ; italic_p , italic_δ ) = blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ - roman_log italic_q start_POSTSUBSCRIPT caligraphic_M , italic_δ end_POSTSUBSCRIPT ( italic_s ) ] (5)

where

q,δ(s):=σSol(s)(1δ|A|)|σ|assignsubscript𝑞𝛿𝑠subscript𝜎Sol𝑠superscript1𝛿𝐴𝜎q_{\mathcal{M},\delta}(s):=\sum_{\sigma\in\operatorname{Sol}(s)}\left(\frac{1-% \delta}{|A|}\right)^{|\sigma|}italic_q start_POSTSUBSCRIPT caligraphic_M , italic_δ end_POSTSUBSCRIPT ( italic_s ) := ∑ start_POSTSUBSCRIPT italic_σ ∈ roman_Sol ( italic_s ) end_POSTSUBSCRIPT ( divide start_ARG 1 - italic_δ end_ARG start_ARG | italic_A | end_ARG ) start_POSTSUPERSCRIPT | italic_σ | end_POSTSUPERSCRIPT (6)

is the probability that the following policy solves s𝑠sitalic_s: at every time step, terminate with probability δ𝛿\deltaitalic_δ and choose an action uniformly at random with probability 1δ1𝛿1-\delta1 - italic_δ. q,δ(s)subscript𝑞𝛿𝑠q_{\mathcal{M},\delta}(s)italic_q start_POSTSUBSCRIPT caligraphic_M , italic_δ end_POSTSUBSCRIPT ( italic_s ) is also the probability that the uniformly random policy solves s𝑠sitalic_s within a horizon of length H𝐻Hitalic_H, where H+1𝐻1H+1italic_H + 1 is sampled from the geometric distribution with parameter δ𝛿\deltaitalic_δ.

3.3 Experiments

Table 1: Correlations between logN𝑁\log Nroman_log italic_N and logJ𝐽\log Jroman_log italic_J where N𝑁Nitalic_N is the number of environment steps the agent takes to learn the environment and J=λJlearn+(1λ)exp(Jexplore)𝐽𝜆subscript𝐽learn1𝜆subscript𝐽exploreJ=\lambda J_{\text{{learn}}}+(1-\lambda)\exp(J_{\text{{explore}}})italic_J = italic_λ italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT + ( 1 - italic_λ ) roman_exp ( italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ) is a weighted average of the p𝑝pitalic_p-learning difficulty and the exponential of the p𝑝pitalic_p-exploration difficulty. Convergence criteria include reaching a certain reward threshold rsuperscript𝑟r^{*}italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (0.5 for RubiksCube222 and 0.9 for the other environments) or reaching a certain threshold ΔQΔsuperscript𝑄\Delta Q^{*}roman_Δ italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT or ΔVΔsuperscript𝑉\Delta V^{*}roman_Δ italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in the p𝑝pitalic_p-weighted average error in action or state values (0.2 for RubiksCube222 and 0.05 for the other environments). The value of λ[0,1]𝜆01\lambda\in[0,1]italic_λ ∈ [ 0 , 1 ] was chosen so that the correlation was maximized. Data points where the algorithm never converges before the experiment run ends (100M environment steps) were excluded from the calculation of the correlation. The reported errors are standard errors of the mean over 5 random seeds.

logJ𝙲𝚕𝚒𝚏𝚏𝚆𝚊𝚕𝚔𝚒𝚗𝚐subscript𝐽𝙲𝚕𝚒𝚏𝚏𝚆𝚊𝚕𝚔𝚒𝚗𝚐\log J_{\mathtt{CliffWalking}}roman_log italic_J start_POSTSUBSCRIPT typewriter_CliffWalking end_POSTSUBSCRIPT logJ𝙲𝚘𝚖𝚙𝙸𝙻𝙴𝟸subscript𝐽𝙲𝚘𝚖𝚙𝙸𝙻𝙴𝟸\log J_{\mathtt{CompILE2}}roman_log italic_J start_POSTSUBSCRIPT typewriter_CompILE2 end_POSTSUBSCRIPT logJ𝟾𝙿𝚞𝚣𝚣𝚕𝚎subscript𝐽8𝙿𝚞𝚣𝚣𝚕𝚎\log J_{\mathtt{8Puzzle}}roman_log italic_J start_POSTSUBSCRIPT typewriter_8 typewriter_P typewriter_u typewriter_z typewriter_z typewriter_l typewriter_e end_POSTSUBSCRIPT logJ𝚁𝚞𝚋𝚒𝚔𝚜𝙲𝚞𝚋𝚎𝟸𝟸𝟸subscript𝐽𝚁𝚞𝚋𝚒𝚔𝚜𝙲𝚞𝚋𝚎𝟸𝟸𝟸\log J_{\mathtt{RubiksCube222}}roman_log italic_J start_POSTSUBSCRIPT typewriter_RubiksCube222 end_POSTSUBSCRIPT
Q-Learning logNrrsubscript𝑁𝑟superscript𝑟\log N_{r\geq r^{*}}roman_log italic_N start_POSTSUBSCRIPT italic_r ≥ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT 0.947 ±plus-or-minus\pm± 0.006 0.792 ±plus-or-minus\pm± 0.025 0.403 ±plus-or-minus\pm± 0.036 0.857 ±plus-or-minus\pm± 0.023
logNΔQ¯ΔQsubscript𝑁¯Δ𝑄Δsuperscript𝑄\log N_{\overline{\Delta Q}\leq\Delta Q^{*}}roman_log italic_N start_POSTSUBSCRIPT over¯ start_ARG roman_Δ italic_Q end_ARG ≤ roman_Δ italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT 0.953 ±plus-or-minus\pm± 0.008 0.786 ±plus-or-minus\pm± 0.023 0.671 ±plus-or-minus\pm± 0.056 0.937 ±plus-or-minus\pm± 0.003
Value iteration logNrrsubscript𝑁𝑟superscript𝑟\log N_{r\geq r^{*}}roman_log italic_N start_POSTSUBSCRIPT italic_r ≥ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT 0.933 ±plus-or-minus\pm± 0.009 0.825 ±plus-or-minus\pm± 0.018 0.693 ±plus-or-minus\pm± 0.051 0.785 ±plus-or-minus\pm± 0.031
logNΔV¯ΔVsubscript𝑁¯Δ𝑉Δsuperscript𝑉\log N_{\overline{\Delta V}\leq\Delta V^{*}}roman_log italic_N start_POSTSUBSCRIPT over¯ start_ARG roman_Δ italic_V end_ARG ≤ roman_Δ italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT 0.951 ±plus-or-minus\pm± 0.015 0.849 ±plus-or-minus\pm± 0.013 0.885 ±plus-or-minus\pm± 0.011 0.748 ±plus-or-minus\pm± 0.029
REINFORCE logNrrsubscript𝑁𝑟superscript𝑟\log N_{r\geq r^{*}}roman_log italic_N start_POSTSUBSCRIPT italic_r ≥ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT 0.949 ±plus-or-minus\pm± 0.006 0.869 ±plus-or-minus\pm± 0.013 0.678 ±plus-or-minus\pm± 0.020 0.892 ±plus-or-minus\pm± 0.029
DQN logNrrsubscript𝑁𝑟superscript𝑟\log N_{r\geq r^{*}}roman_log italic_N start_POSTSUBSCRIPT italic_r ≥ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT 0.789 ±plus-or-minus\pm± 0.028 0.758 ±plus-or-minus\pm± 0.076 0.583 ±plus-or-minus\pm± 0.039 0.753 ±plus-or-minus\pm± 0.019

In motivating p𝑝pitalic_p-learning difficulty and p𝑝pitalic_p-exploration difficulty, we made significant approximations to estimate the sample complexity without assuming a particular environment or RL algorithm. Despite this, we show empirically that a combination of the two difficult metrics predicts sample complexity well across a variety of environments and RL algorithms.

We study four deterministic sparse-reward environments: (a) CliffWalking, a simple grid world (Sutton & Barto, 2018); (b) CompILE2, the CompILE grid world with visit length 2 (Kipf et al., 2019); (c) 8Puzzle, the 8-puzzle; (d) RubiksCube222, the 2x2 Rubik’s cube. For the computation of p𝑝pitalic_p-learning difficulty and p𝑝pitalic_p-exploration difficulty to be feasible, p𝑝pitalic_p needs to have finite support over a sufficiently small number of states (107similar-toabsentsuperscript107\sim 10^{7}∼ 10 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT or less). To mitigate this limitation, we chose environments for which there exist larger versions with a similar MDP structure. For example, the 2x2 Rubik’s cube should behave similarly to the 3x3 cube, 4x4 cube, etc., and the 8-puzzle should behave similarly to the 15-puzzle, 24-puzzle, etc.

Each environment has 32 action space variants, with one being the base environment (the trivial skill augmentation) and 31 with different sets of macroactions. One macroaction augmentation is calculated using LEMMA (Li et al., 2022) on offline data derived from breadth-first search; 5 are variations of that macroaction augmentation; and 25 are generated randomly. More details are given in Appendix B.

We evaluate how well a combination of p𝑝pitalic_p-learning difficulty and p𝑝pitalic_p-exploration difficulty captures the sample complexity of 4 RL algorithms on the different variants of each environment. The algorithms are: (a) Q-learning (Watkins, 1989); (b) Value iteration (Bellman, 1957), modified to the RL setting, similar to (Agostinelli et al., 2019); (c) REINFORCE (Williams, 1992), made tabular by parameterizing the policy directly with the logits of the actions; (d) Deep Q-networks (DQN) (Mnih et al., 2015).

According to Sections 3.1 and 3.2, we expect Jlearnsubscript𝐽learnJ_{\text{{learn}}}italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT to scale roughly linearly with the sample complexity of learning from experience and exp(Jexplore)subscript𝐽explore\exp(J_{\text{{explore}}})roman_exp ( italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ) to scale roughly linearly with the sample complexity of exploration. We thus choose a weighted average J=λJlearn+(1λ)exp(Jexplore)𝐽𝜆subscript𝐽learn1𝜆subscript𝐽exploreJ=\lambda J_{\text{{learn}}}+(1-\lambda)\exp(J_{\text{{explore}}})italic_J = italic_λ italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT + ( 1 - italic_λ ) roman_exp ( italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ) (0λ10𝜆10\leq\lambda\leq 10 ≤ italic_λ ≤ 1) to represent the combined difficulty. The discount δ𝛿\deltaitalic_δ used in the p𝑝pitalic_p-exploration difficulty is set to 1/H1𝐻1/H1 / italic_H, where H𝐻Hitalic_H is the environment’s horizon. The sample complexity N𝑁Nitalic_N and the combined difficulty J𝐽Jitalic_J spanned several orders of magnitude in CliffWalking and CompILE2, so we took the logarithm of both before computing their Pearson correlation coefficient. The value of λ𝜆\lambdaitalic_λ was chosen to maximize this correlation. The results are summarized in Table 1. Most correlation values are at least around 0.7, demonstrating that combining p𝑝pitalic_p-learning difficulty and p𝑝pitalic_p-exploration difficulty allows us to capture a significant portion of the variation in RL sample efficiency on different action space variants of the same environment.

We also conducted experiments to directly test Lemma 3.1 by computing the correlation between the number of iterations it takes value iteration to converge and the p𝑝pitalic_p-weighted average solution length (Section D.1). In addition to state value iteration, we also considered Q-value iteration to simulate Q-learning. With two exceptions, all correlations are above 0.9, thus empirically corroborating Lemma 3.1.

4 Effect of Skills on Learning from Experience

Part of our goal is to understand what makes a particular set of skills helpful for an RL agent. One intuition articulated in prior work (Jiang et al., 2022; Kipf et al., 2019) is that skills help compress optimal trajectories, making them shorter and thus more likely to be found during exploration. But, conversely, data distributions can be provably incompressible when their entropy is too high (Cover, 1994). As a result, we expect that skills are less likely to be helpful when the distribution of optimal trajectories in the environment is incompressible. This intuition is made precise by Theorem 4.2, which states that the ratio between the new and old p𝑝pitalic_p-learning difficulties after an A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT-skill augmentation is lower-bounded by the product of an incompressibility measure and a factor penalizing large |A+|subscript𝐴|A_{+}|| italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT |. Before stating the theorem, let’s first define this incompressibility measure.

Definition 4.1.

Let 0=(S,A0,T0,g)subscript0𝑆subscript𝐴0subscript𝑇0𝑔\mathcal{M}_{0}=(S,A_{0},T_{0},g)caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( italic_S , italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g ) be a DSMDP with finite |A0|>1subscript𝐴01|A_{0}|>1| italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | > 1 and +=(S,A+,T+,g)subscript𝑆subscript𝐴subscript𝑇𝑔\mathcal{M}_{+}=(S,A_{+},T_{+},g)caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = ( italic_S , italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT + end_POSTSUBSCRIPT , italic_g ) its A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT-skill augmentation. Let p𝑝pitalic_p be a distribution over solvable states. The A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT-merged p𝑝pitalic_p-incompressibility is defined as

ICA+(0;p)=sup0<ε<1H[P+]log(1εε)𝔼sp[d0(s)]log(|A0|1ε).subscriptICsubscript𝐴subscript0𝑝subscriptsupremum0𝜀1Hdelimited-[]subscript𝑃1𝜀𝜀subscript𝔼similar-to𝑠𝑝delimited-[]subscript𝑑0𝑠subscript𝐴01𝜀\mathrm{IC}_{A_{+}}(\mathcal{M}_{0};p)=\sup_{0<\varepsilon<1}\frac{\mathrm{H}[% P_{+}]-\log\left(\frac{1-\varepsilon}{\varepsilon}\right)}{\mathbb{E}_{s\sim p% }[d_{0}(s)]\log\left(\frac{|A_{0}|}{1-\varepsilon}\right)}.roman_IC start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) = roman_sup start_POSTSUBSCRIPT 0 < italic_ε < 1 end_POSTSUBSCRIPT divide start_ARG roman_H [ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] - roman_log ( divide start_ARG 1 - italic_ε end_ARG start_ARG italic_ε end_ARG ) end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) ] roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG 1 - italic_ε end_ARG ) end_ARG . (7)

Here, P+subscript𝑃P_{+}italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is the distribution of canonical shortest solutions in +subscript\mathcal{M}_{+}caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT to states sampled from p𝑝pitalic_p, where the canonical shortest solutions are chosen such that H[P+]Hdelimited-[]subscript𝑃\mathrm{H}[P_{+}]roman_H [ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] is maximized. Note that H[P+]Hdelimited-[]subscript𝑃\mathrm{H}[P_{+}]roman_H [ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] is the entropy of the state distribution after states with the same canonical solution in +subscript\mathcal{M}_{+}caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT have been merged into one state. Thus, it has the property H[P+]H[p]Hdelimited-[]subscript𝑃Hdelimited-[]𝑝\mathrm{H}[P_{+}]\leq\mathrm{H}[p]roman_H [ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] ≤ roman_H [ italic_p ], where equality holds iff all states in the support of p𝑝pitalic_p have different canonical solutions.

A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT-merged p𝑝pitalic_p-incompressibility can be understood as the coding efficiency of using base actions to write solutions to states sampled from p𝑝pitalic_p as opposed to using a code optimized for the distribution of shortest solutions with skills. More precisely, we can write

ICA+(0;p)=sup0<ε<1H[P+]log(1εε)H[P0,P0,unif,ε]log(1εε),subscriptICsubscript𝐴subscript0𝑝subscriptsupremum0𝜀1Hdelimited-[]subscript𝑃1𝜀𝜀Hsubscript𝑃0subscript𝑃0unif𝜀1𝜀𝜀\mathrm{IC}_{A_{+}}(\mathcal{M}_{0};p)=\sup_{0<\varepsilon<1}\frac{\mathrm{H}[% P_{+}]-\log\left(\frac{1-\varepsilon}{\varepsilon}\right)}{\mathrm{H}[P_{0},P_% {0,\text{unif},\varepsilon}]-\log\left(\frac{1-\varepsilon}{\varepsilon}\right% )},roman_IC start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) = roman_sup start_POSTSUBSCRIPT 0 < italic_ε < 1 end_POSTSUBSCRIPT divide start_ARG roman_H [ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] - roman_log ( divide start_ARG 1 - italic_ε end_ARG start_ARG italic_ε end_ARG ) end_ARG start_ARG roman_H [ italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 0 , unif , italic_ε end_POSTSUBSCRIPT ] - roman_log ( divide start_ARG 1 - italic_ε end_ARG start_ARG italic_ε end_ARG ) end_ARG , (8)

where H[P+]Hdelimited-[]subscript𝑃\mathrm{H}[P_{+}]roman_H [ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] is the optimal expected number of bits needed to encode a (canonical) shortest solution in +subscript\mathcal{M}_{+}caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT to a state spsimilar-to𝑠𝑝s\sim pitalic_s ∼ italic_p, and H[P0,P0,unif,ε]Hsubscript𝑃0subscript𝑃0unif𝜀\mathrm{H}[P_{0},P_{0,\text{unif},\varepsilon}]roman_H [ italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 0 , unif , italic_ε end_POSTSUBSCRIPT ] denotes the cross entropy between P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and P0,unif,εsubscript𝑃0unif𝜀P_{0,\text{unif},\varepsilon}italic_P start_POSTSUBSCRIPT 0 , unif , italic_ε end_POSTSUBSCRIPT. P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the distribution of shortest solutions to states sampled from p𝑝pitalic_p containing only base actions. P0,unif,ε(σ)=ε(1ε)|σ|1|A0||σ|subscript𝑃0unif𝜀𝜎𝜀superscript1𝜀𝜎1superscriptsubscript𝐴0𝜎P_{0,\text{unif},\varepsilon}(\sigma)=\varepsilon(1-\varepsilon)^{|\sigma|-1}|% A_{0}|^{-|\sigma|}italic_P start_POSTSUBSCRIPT 0 , unif , italic_ε end_POSTSUBSCRIPT ( italic_σ ) = italic_ε ( 1 - italic_ε ) start_POSTSUPERSCRIPT | italic_σ | - 1 end_POSTSUPERSCRIPT | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT - | italic_σ | end_POSTSUPERSCRIPT is a uniform prior over base action sequences. H[P0,P0,unif,ε]Hsubscript𝑃0subscript𝑃0unif𝜀\mathrm{H}[P_{0},P_{0,\text{unif},\varepsilon}]roman_H [ italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 0 , unif , italic_ε end_POSTSUBSCRIPT ] is thus the expected number of bits required to encode a shortest solution using a fixed-length code over base actions A0subscript𝐴0A_{0}italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, optimized for a termination symbol that appears at the end of each time step with probability ε𝜀\varepsilonitalic_ε.

We now introduce the theorem, which shows how A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT-merged p𝑝pitalic_p-incompressibility can be used to bound how much skills in A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT can improve p𝑝pitalic_p-learning difficulty.

Theorem 4.2.

Let +=(S,A+,T+,g)subscript𝑆subscript𝐴subscript𝑇𝑔\mathcal{M}_{+}=(S,A_{+},T_{+},g)caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = ( italic_S , italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT + end_POSTSUBSCRIPT , italic_g ) be the A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT-skill augmentation of the DSMDP 0=(S,A0,T0,g)subscript0𝑆subscript𝐴0subscript𝑇0𝑔\mathcal{M}_{0}=(S,A_{0},T_{0},g)caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( italic_S , italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g ) with finite |A0|>1subscript𝐴01|A_{0}|>1| italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | > 1, and p𝑝pitalic_p a probability distribution over solvable states. Then

Jlearn(+;p)Jlearn(0;p)|A+|log|A0||A0|log|A+|ICA+(0;p).subscript𝐽learnsubscript𝑝subscript𝐽learnsubscript0𝑝subscript𝐴subscript𝐴0subscript𝐴0subscript𝐴subscriptICsubscript𝐴subscript0𝑝\frac{J_{\text{{learn}}}(\mathcal{M}_{+};p)}{J_{\text{{learn}}}(\mathcal{M}_{0% };p)}\geq\frac{|A_{+}|\log|A_{0}|}{|A_{0}|\log|A_{+}|}\mathrm{IC}_{A_{+}}(% \mathcal{M}_{0};p).divide start_ARG italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p ) end_ARG start_ARG italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) end_ARG ≥ divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | roman_log | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | roman_log | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG roman_IC start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) . (9)

We can use Theorem 4.2 to understand the effect that the expressivity of skills has on their ability to improve p𝑝pitalic_p-learning difficulty.222 See Section F.2 for a more formal treatment where the incompressibility measure in Theorem 4.2 is replaced with one defined explicitly in terms of a quantitative measure of expressivity. More expressive skills can encode more diverse behavior and thus allow a larger number of action sequences to be encoded as the same skill. This allows states to share solutions more often, which decreases H[P+]Hdelimited-[]subscript𝑃\mathrm{H}[P_{+}]roman_H [ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] and hence ICA+(0;p)subscriptICsubscript𝐴subscript0𝑝\mathrm{IC}_{A_{+}}(\mathcal{M}_{0};p)roman_IC start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ). As a result, the lower bound on the p𝑝pitalic_p-learning difficulty ratio decreases. As concrete examples, if we place no restriction on what kinds of skills are allowed, then we can simply include a single skill that solves all solvable states, resulting in ICA+(0;p)=0subscriptICsubscript𝐴subscript0𝑝0\mathrm{IC}_{A_{+}}(\mathcal{M}_{0};p)=0roman_IC start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) = 0 and Jlearn(+;p)=|A0|+1subscript𝐽learnsubscript𝑝subscript𝐴01J_{\text{{learn}}}(\mathcal{M}_{+};p)=|A_{0}|+1italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p ) = | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | + 1. This is less than Jlearn(0;p)subscript𝐽learnsubscript0𝑝J_{\text{{learn}}}(\mathcal{M}_{0};p)italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) whenever 𝔼sp[d0(s)]>1+1/|A0|subscript𝔼similar-to𝑠𝑝delimited-[]subscript𝑑0𝑠11subscript𝐴0\mathbb{E}_{s\sim p}[d_{0}(s)]>1+1/|A_{0}|blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) ] > 1 + 1 / | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT |, which is true for all RL environments of practical interest. If a skill is allowed to be a concrete sequence of actions and loops of actions, then states whose solutions involve different numbers of repetitions of the same component will have the same solution containing a skill with a loop whose body is that component. Thus, H[P+]<H[p]Hdelimited-[]subscript𝑃Hdelimited-[]𝑝\mathrm{H}[P_{+}]<\mathrm{H}[p]roman_H [ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] < roman_H [ italic_p ] but is larger than the value of zero obtained when no restriction is placed on skills. Finally, if skills are restricted to macroactions, then distinct solutions remain distinct after rewriting with macroactions, and so the A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT-merged p𝑝pitalic_p-incompressibility achieves its maximum value. In solution-separable environments, this maximum value is equal to the unmerged p𝑝pitalic_p-incompressibility (Definition 4.3), in which case Theorem 4.2 can be restated in terms of it (Corollary 4.4).

Definition 4.3.

Let =(S,A,T,g)𝑆𝐴𝑇𝑔\mathcal{M}=(S,A,T,g)caligraphic_M = ( italic_S , italic_A , italic_T , italic_g ) be a DSMDP with finite |A|>1𝐴1|A|>1| italic_A | > 1 and p𝑝pitalic_p a distribution over solvable states. The unmerged p𝑝pitalic_p-incompressibility is defined as

IC(;p)=sup0<ε<1IC(;p,ε),IC𝑝subscriptsupremum0𝜀1IC𝑝𝜀\mathrm{IC}(\mathcal{M};p)=\sup_{0<\varepsilon<1}\mathrm{IC}(\mathcal{M};p,% \varepsilon),roman_IC ( caligraphic_M ; italic_p ) = roman_sup start_POSTSUBSCRIPT 0 < italic_ε < 1 end_POSTSUBSCRIPT roman_IC ( caligraphic_M ; italic_p , italic_ε ) , (10)

where the ε𝜀\varepsilonitalic_ε-discounted unmerged p𝑝pitalic_p-incompressibility

IC(;p,ε)=H[p]log(1εε)𝔼sp[d(s)]log(|A|1ε).IC𝑝𝜀Hdelimited-[]𝑝1𝜀𝜀subscript𝔼similar-to𝑠𝑝delimited-[]subscript𝑑𝑠𝐴1𝜀\allowdisplaybreaks\mathrm{IC}(\mathcal{M};p,\varepsilon)=\frac{\mathrm{H}[p]-% \log\left(\frac{1-\varepsilon}{\varepsilon}\right)}{\mathbb{E}_{s\sim p}[d_{% \mathcal{M}}(s)]\log\left(\frac{|A|}{1-\varepsilon}\right)}.roman_IC ( caligraphic_M ; italic_p , italic_ε ) = divide start_ARG roman_H [ italic_p ] - roman_log ( divide start_ARG 1 - italic_ε end_ARG start_ARG italic_ε end_ARG ) end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ) ] roman_log ( divide start_ARG | italic_A | end_ARG start_ARG 1 - italic_ε end_ARG ) end_ARG . (11)

It measures incompressibility on a scale from 0 to 1 if \mathcal{M}caligraphic_M is solution-separable. Furthermore, unlike the A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT-merged p𝑝pitalic_p-incompressibility, it is a function of only \mathcal{M}caligraphic_M and p𝑝pitalic_p and is thus a general measure of the incompressibility of \mathcal{M}caligraphic_M.

Corollary 4.4 (Corollary to Theorem 4.2).

In the setup to Theorem 4.2, suppose 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is solution-separable333 See Section F.3 for the version of this corollary that does not assume solution-separability. and A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is a macroaction augmentation. Then

Jlearn(+;p)Jlearn(0;p)|A+|log|A0||A0|log|A+|IC(0;p).subscript𝐽learnsubscript𝑝subscript𝐽learnsubscript0𝑝subscript𝐴subscript𝐴0subscript𝐴0subscript𝐴ICsubscript0𝑝\frac{J_{\text{{learn}}}(\mathcal{M}_{+};p)}{J_{\text{{learn}}}(\mathcal{M}_{0% };p)}\geq\frac{|A_{+}|\log|A_{0}|}{|A_{0}|\log|A_{+}|}\mathrm{IC}(\mathcal{M}_% {0};p).divide start_ARG italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p ) end_ARG start_ARG italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) end_ARG ≥ divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | roman_log | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | roman_log | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG roman_IC ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) . (12)

A direct consequence of the above corollary is that there exist environments where incorporating macroactions will always worsen p𝑝pitalic_p-learning difficulty, no matter how many there are or what they are.

Corollary 4.5 (Corollary to Corollary 4.4).

In the setup to Theorem 4.2, suppose 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is solution-separable and A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is a strict macroaction augmentation. If

1IC(0;p)1|A0|+1(11ln|A0|),1ICsubscript0𝑝1subscript𝐴0111subscript𝐴01-\mathrm{IC}(\mathcal{M}_{0};p)\leq\frac{1}{|A_{0}|+1}\left(1-\frac{1}{\ln|A_% {0}|}\right),1 - roman_IC ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) ≤ divide start_ARG 1 end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | + 1 end_ARG ( 1 - divide start_ARG 1 end_ARG start_ARG roman_ln | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG ) ,

then Jlearn(+;p)>Jlearn(0;p)subscript𝐽learnsubscript𝑝subscript𝐽learnsubscript0𝑝J_{\text{{learn}}}(\mathcal{M}_{+};p)>J_{\text{{learn}}}(\mathcal{M}_{0};p)italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p ) > italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ).

5 Effect of Skills on Exploration

To study the properties of a DSMDP that make exploration difficult, we have derived a tight lower bound on the p𝑝pitalic_p-exploration difficulty of a DSMDP in terms of the entropy of p𝑝pitalic_p and a term representing how dense solutions to states are in the space of all solutions (Theorem 5.2).

Definition 5.1.

Let =(S,A,T,g)𝑆𝐴𝑇𝑔\mathcal{M}=(S,A,T,g)caligraphic_M = ( italic_S , italic_A , italic_T , italic_g ) be a DSMDP with finite action space A𝐴Aitalic_A. For 0δ<10𝛿10\leq\delta<10 ≤ italic_δ < 1, the δ𝛿\deltaitalic_δ-discounted solution density of \mathcal{M}caligraphic_M is defined as

D(;δ)=sρ,δ(s),𝐷𝛿subscript𝑠subscript𝜌𝛿𝑠D(\mathcal{M};\delta)=\sum_{s}\rho_{\mathcal{M},\delta}(s),italic_D ( caligraphic_M ; italic_δ ) = ∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT caligraphic_M , italic_δ end_POSTSUBSCRIPT ( italic_s ) , (13)

where

ρ,δ(s)subscript𝜌𝛿𝑠\displaystyle\rho_{\mathcal{M},\delta}(s)italic_ρ start_POSTSUBSCRIPT caligraphic_M , italic_δ end_POSTSUBSCRIPT ( italic_s ) =δ1δq,δ(s)absent𝛿1𝛿subscript𝑞𝛿𝑠\displaystyle=\frac{\delta}{1-\delta}q_{\mathcal{M},\delta}(s)= divide start_ARG italic_δ end_ARG start_ARG 1 - italic_δ end_ARG italic_q start_POSTSUBSCRIPT caligraphic_M , italic_δ end_POSTSUBSCRIPT ( italic_s )
=σSol(s)δ(1δ)|σ|1|A||σ|absentsubscript𝜎subscriptSol𝑠𝛿superscript1𝛿𝜎1superscript𝐴𝜎\displaystyle=\sum_{\sigma\in\operatorname{Sol}_{\mathcal{M}}(s)}\delta(1-% \delta)^{|\sigma|-1}|A|^{-|\sigma|}= ∑ start_POSTSUBSCRIPT italic_σ ∈ roman_Sol start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT italic_δ ( 1 - italic_δ ) start_POSTSUPERSCRIPT | italic_σ | - 1 end_POSTSUPERSCRIPT | italic_A | start_POSTSUPERSCRIPT - | italic_σ | end_POSTSUPERSCRIPT (14)

is the probability that a uniformly random action sequence with length sampled from Geometric(δ)Geometric𝛿\operatorname{Geometric}(\delta)roman_Geometric ( italic_δ ) solves s𝑠sitalic_s.

Theorem 5.2.

Let +=(S,A+,T+,g)subscript𝑆subscript𝐴subscript𝑇𝑔\mathcal{M}_{+}=(S,A_{+},T_{+},g)caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = ( italic_S , italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT + end_POSTSUBSCRIPT , italic_g ) be the A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT-skill augmentation of the DSMDP 0=(S,A0,T0,g)subscript0𝑆subscript𝐴0subscript𝑇0𝑔\mathcal{M}_{0}=(S,A_{0},T_{0},g)caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( italic_S , italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g ) with a finite action space, and p𝑝pitalic_p a probability distribution over solvable states. Then for 0<δ<10𝛿10<\delta<10 < italic_δ < 1,

Jexplore(+;p,δ)H[p]log(1δδD(+;δ)).subscript𝐽exploresubscript𝑝𝛿Hdelimited-[]𝑝1𝛿𝛿𝐷subscript𝛿J_{\text{{explore}}}(\mathcal{M}_{+};p,\delta)\geq\mathrm{H}[p]-\log\left(% \frac{1-\delta}{\delta}D(\mathcal{M}_{+};\delta)\right).italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p , italic_δ ) ≥ roman_H [ italic_p ] - roman_log ( divide start_ARG 1 - italic_δ end_ARG start_ARG italic_δ end_ARG italic_D ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_δ ) ) . (15)

Furthermore, if the state space is finite and δ>maxsp(s)𝛿subscript𝑠𝑝𝑠\delta>\max_{s}p(s)italic_δ > roman_max start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_p ( italic_s ), then for any ε>0𝜀0\varepsilon>0italic_ε > 0, there exists an A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT-skill augmentation +subscript\mathcal{M}_{+}caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT of 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT such that

Jexplore(+;p,δ)<H[p]log(1δδD(+;δ))+ε,subscript𝐽exploresubscript𝑝𝛿Hdelimited-[]𝑝1𝛿𝛿𝐷subscript𝛿𝜀J_{\text{{explore}}}(\mathcal{M}_{+};p,\delta)<\mathrm{H}[p]-\log\left(\frac{1% -\delta}{\delta}D(\mathcal{M}_{+};\delta)\right)+\varepsilon,italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p , italic_δ ) < roman_H [ italic_p ] - roman_log ( divide start_ARG 1 - italic_δ end_ARG start_ARG italic_δ end_ARG italic_D ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_δ ) ) + italic_ε , (16)

thus showing that the lower bound given above is tight for all finite DSMDPs and a large range of δ𝛿\deltaitalic_δ.

The fact that the lower bound grows with H[p]Hdelimited-[]𝑝\mathrm{H}[p]roman_H [ italic_p ] is intuitive: when there are many states that we care about learning to solve (H[p]Hdelimited-[]𝑝\mathrm{H}[p]roman_H [ italic_p ] is large), it is hard for the agent to gather the experience needed to learn to solve all these states (Jexploresubscript𝐽exploreJ_{\text{{explore}}}italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT is large). However, incorporating skills only changes the action space and cannot affect H[p]Hdelimited-[]𝑝\mathrm{H}[p]roman_H [ italic_p ]. Skills thus improve exploration by increasing the δ𝛿\deltaitalic_δ-discounted solution density, which is interpreted as the density of solutions to states within the space of all action sequences. Action sequences of length l𝑙litalic_l equally divide a total density of δ(1δ)l1𝛿superscript1𝛿𝑙1\delta(1-\delta)^{l-1}italic_δ ( 1 - italic_δ ) start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT, so that the combined density of all possible action sequences is 1. If +subscript\mathcal{M}_{+}caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is solution-separable, then sρ+,δ(s)1subscript𝑠subscript𝜌𝛿𝑠1\sum_{s}\rho_{+,\delta}(s)\leq 1∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT + , italic_δ end_POSTSUBSCRIPT ( italic_s ) ≤ 1, whereas if every action sequence solves some state, then sρ+,δ(s)1subscript𝑠subscript𝜌𝛿𝑠1\sum_{s}\rho_{+,\delta}(s)\geq 1∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT + , italic_δ end_POSTSUBSCRIPT ( italic_s ) ≥ 1. Skills improve exploration by increasing this density, similar to how skills reduce A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT-merged p𝑝pitalic_p-incompressibility by allowing more states to share solutions. More expressive skills are more apt at increasing solution density. For example, introducing macroactions in a solution-separable environment results in a solution-separable environment, so the density remains at most 1. If we introduce the logic of loops, then states whose solutions involve different repetitions of the same component can be solved by the same action sequence containing a loop skill, hence increasing the density. In the extreme case where no restriction is placed on the kind of skills allowed, we can introduce many skills, each of which automatically solves all solvable states. The resultant density is approximately δ|Ssolvable|𝛿subscript𝑆solvable\delta|S_{\text{solvable}}|italic_δ | italic_S start_POSTSUBSCRIPT solvable end_POSTSUBSCRIPT |, which is usually much larger than 1.

As a corollary to Theorem 5.2, increase in p𝑝pitalic_p-exploration difficulty due to macroactions is lower-bounded by the δ𝛿\deltaitalic_δ-discounted unmerged p𝑝pitalic_p-incompressibility (Equation 11) in solution-separable environments, thus providing the p𝑝pitalic_p-exploration difficulty counterpart to Corollary 4.4.

Corollary 5.3 (Corollary to Theorem 5.2).

In the setup to Theorem 5.2, suppose 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is solution-separable, |A0|>1subscript𝐴01|A_{0}|>1| italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | > 1, and A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is a macroaction augmentation. Then

Jexplore(+;p,δ)Jexplore(0;p,δ)IC(0;p,δ).subscript𝐽exploresubscript𝑝𝛿subscript𝐽exploresubscript0𝑝𝛿ICsubscript0𝑝𝛿\frac{J_{\text{{explore}}}(\mathcal{M}_{+};p,\delta)}{J_{\text{{explore}}}(% \mathcal{M}_{0};p,\delta)}\geq\mathrm{IC}(\mathcal{M}_{0};p,\delta).divide start_ARG italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p , italic_δ ) end_ARG start_ARG italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p , italic_δ ) end_ARG ≥ roman_IC ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p , italic_δ ) . (17)

Compared to Corollary 4.4, the factor |A+|log|A0||A0|log|A+|subscript𝐴subscript𝐴0subscript𝐴0subscript𝐴\frac{|A_{+}|\log|A_{0}|}{|A_{0}|\log|A_{+}|}divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | roman_log | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | roman_log | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG penalizing large A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is absent, and the supsupremum\suproman_sup in IC(0;p)=sup0<δ<1IC(0;p,δ)ICsubscript0𝑝subscriptsupremum0𝛿1ICsubscript0𝑝𝛿\mathrm{IC}(\mathcal{M}_{0};p)=\sup_{0<\delta<1}\mathrm{IC}(\mathcal{M}_{0};p,\delta)roman_IC ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) = roman_sup start_POSTSUBSCRIPT 0 < italic_δ < 1 end_POSTSUBSCRIPT roman_IC ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p , italic_δ ) has been removed. The resultant weaker bound suggests that skills are better suited to improving exploration than learning from experience. This is made more precise in Theorems 5.4 and 5.5 below, but before stating these results, we shall first give an intuitive explanation for why this is the case.

In discussing the effects of skills on learning from existing experience, there was a tradeoff between action space size and reducing solution lengths. Intuitively, while skills allow reward information to propagate to states faster, a large action space means a larger number of experiences to iterate through to efficiently cover the space of all state-action pairs (s,a)𝑠𝑎(s,a)( italic_s , italic_a ). Such a tradeoff is not so clear in the effects of skills on exploration. To improve exploration, skills are chosen so that a uniformly random policy in the augmented action space is more likely to reach the goal. If skills are expressive enough, this should always be possible, unless the base action space is already close to optimal. Of course, the most general skills trivially improve p𝑝pitalic_p-exploration difficulty by simply map** every solvable state to the goal, which gives Jexplore0subscript𝐽explore0J_{\text{{explore}}}\approx 0italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ≈ 0. But there can be skills that achieve the maximum possible A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT-merged p𝑝pitalic_p-incompressibility (which appears in the lower bound for p𝑝pitalic_p-learning difficulty increase in Theorem 4.2) but still decrease p𝑝pitalic_p-exploration difficulty. This is made precise by the following theorem.

Theorem 5.4.

Let 0=(S,A0,T0,g)subscript0𝑆subscript𝐴0subscript𝑇0𝑔\mathcal{M}_{0}=(S,A_{0},T_{0},g)caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( italic_S , italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g ) be a solution-separable DSMDP with finite |A0|>1subscript𝐴01|A_{0}|>1| italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | > 1 as well as finite |S|𝑆|S|| italic_S |. Let p𝑝pitalic_p be a probability distribution over solvable states. For all δ>maxsp(s)𝛿subscript𝑠𝑝𝑠\delta>\max_{s}p(s)italic_δ > roman_max start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_p ( italic_s ) for which pρ0,δnot-equivalent-to𝑝subscript𝜌0𝛿p\not\equiv\rho_{0,\delta}italic_p ≢ italic_ρ start_POSTSUBSCRIPT 0 , italic_δ end_POSTSUBSCRIPT, there exists an A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT-skill augmentation +subscript\mathcal{M}_{+}caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT of 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT such that:

  • There exist distinct shortest solutions in A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT to all states in the support of p𝑝pitalic_p (namely, H[P+]Hdelimited-[]subscript𝑃\mathrm{H}[P_{+}]roman_H [ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] achieves its maximum possible value H[p]Hdelimited-[]𝑝\mathrm{H}[p]roman_H [ italic_p ] and thus ICA+(0;p)subscriptICsubscript𝐴subscript0𝑝\mathrm{IC}_{A_{+}}(\mathcal{M}_{0};p)roman_IC start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) achieves its maximum possible value IC(0;p)ICsubscript0𝑝\mathrm{IC}(\mathcal{M}_{0};p)roman_IC ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ));

  • Jexplore(+;p,δ)<Jexplore(0;p,δ)subscript𝐽exploresubscript𝑝𝛿subscript𝐽exploresubscript0𝑝𝛿J_{\text{{explore}}}(\mathcal{M}_{+};p,\delta)<J_{\text{{explore}}}(\mathcal{M% }_{0};p,\delta)italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p , italic_δ ) < italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p , italic_δ ).

Corollary 5.5 (Corollary to Theorem 5.4).

Assume the setup to Theorem 5.4. If

1IC(0;p)1|A0|+1(11ln|A0|),1ICsubscript0𝑝1subscript𝐴0111subscript𝐴01-\mathrm{IC}(\mathcal{M}_{0};p)\leq\frac{1}{|A_{0}|+1}\left(1-\frac{1}{\ln|A_% {0}|}\right),1 - roman_IC ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) ≤ divide start_ARG 1 end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | + 1 end_ARG ( 1 - divide start_ARG 1 end_ARG start_ARG roman_ln | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG ) ,

then there exists a skill augmentation +subscript\mathcal{M}_{+}caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT of 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT such that Jlearn(+;p)>Jlearn(0;p)subscript𝐽learnsubscript𝑝subscript𝐽learnsubscript0𝑝J_{\text{{learn}}}(\mathcal{M}_{+};p)>J_{\text{{learn}}}(\mathcal{M}_{0};p)italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p ) > italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) but Jexplore(+;p,δ)<Jexplore(0;p,δ).subscript𝐽exploresubscript𝑝𝛿subscript𝐽exploresubscript0𝑝𝛿J_{\text{{explore}}}(\mathcal{M}_{+};p,\delta)<J_{\text{{explore}}}(\mathcal{M% }_{0};p,\delta).italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p , italic_δ ) < italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p , italic_δ ) .

Corollary 5.5 shows that there are environments where skills can benefit exploration but harm learning from experience. This again suggests that skills are more apt at improving exploration than learning.

Refer to caption

Figure 1: For each of the 4 environments studied, we plot the point (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) where x𝑥xitalic_x is the unmerged p𝑝pitalic_p-incompressibility of the base environment and y𝑦yitalic_y is the best complexity improvement ratio minC+/C0subscript𝐶subscript𝐶0\min C_{+}/C_{0}roman_min italic_C start_POSTSUBSCRIPT + end_POSTSUBSCRIPT / italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT over the 31 macroaction augmentations of the base environment. Different colors represent different measures C𝐶Citalic_C of complexity, and different panels correspond to sample complexities N𝑁Nitalic_N of different RL algorithms. The plots corresponding to p𝑝pitalic_p-learning difficulty (Jlearnsubscript𝐽learnJ_{\text{{learn}}}italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT) and p𝑝pitalic_p-exploration difficulty (Jexploresubscript𝐽exploreJ_{\text{{explore}}}italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT) have been repeated across panels for clearer comparison with the plots corresponding to the sample complexities (N𝑁Nitalic_N) of the RL algorithms.

As a final discussion on the effect that skills have on exploration, we answer the question: are there environments where unexpressive skills like macroactions always harm exploration? Unlike Corollary 4.4, there is no penalty factor in the lower bound given in Corollary 5.3. As a result, there is no environment where the lower bound is above 1, which would have implied that all macroaction augmentations increase p𝑝pitalic_p-exploration difficulty. Nevertheless, the answer to the question is still affirmative. The following two theorems construct environments where incorporating macroactions always increases p𝑝pitalic_p-exploration difficulty, no matter how many there are or what they are.

Theorem 5.6.

Let 0=(S,A0,T0,g)subscript0𝑆subscript𝐴0subscript𝑇0𝑔\mathcal{M}_{0}=(S,A_{0},T_{0},g)caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( italic_S , italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g ) be a solution-separable DSMDP with a finite action space such that any state that has a length-1 solution only has length-1 solutions. Let p𝑝pitalic_p be a probability distribution over solvable states. Suppose that δ>0𝛿0\delta>0italic_δ > 0 and

DKL(pρ0,δ)δ2loge8(|A0|+1)2.subscript𝐷KLconditional𝑝subscript𝜌0𝛿superscript𝛿2𝑒8superscriptsubscript𝐴012D_{\mathrm{KL}}\left(p\parallel\rho_{0,\delta}\right)\leq\frac{\delta^{2}\log e% }{8(|A_{0}|+1)^{2}}.italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_p ∥ italic_ρ start_POSTSUBSCRIPT 0 , italic_δ end_POSTSUBSCRIPT ) ≤ divide start_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_e end_ARG start_ARG 8 ( | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

Then Jexplore(+;p,δ)>Jexplore(0;p,δ)subscript𝐽exploresubscript𝑝𝛿subscript𝐽exploresubscript0𝑝𝛿J_{\text{{explore}}}(\mathcal{M}_{+};p,\delta)>J_{\text{{explore}}}(\mathcal{M% }_{0};p,\delta)italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p , italic_δ ) > italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p , italic_δ ) for any strict macroaction augmentation +subscript\mathcal{M}_{+}caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT of 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Theorem 5.7.

Let 0=(S,A0,T0,g)subscript0𝑆subscript𝐴0subscript𝑇0𝑔\mathcal{M}_{0}=(S,A_{0},T_{0},g)caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( italic_S , italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g ) be a solution-separable DSMDP with a finite action space such that: 1) every action sequence is the solution to some state; 2) for every solvable state, all solutions to that state have the same length. Let p𝑝pitalic_p be a probability distribution over solvable states such that p(s)/p(s)=|Sol0(s)|/|Sol0(s)|𝑝𝑠𝑝superscript𝑠subscriptSol0𝑠subscriptSol0superscript𝑠p(s)/p(s^{\prime})=|\operatorname{Sol}_{0}(s)|/|\operatorname{Sol}_{0}(s^{% \prime})|italic_p ( italic_s ) / italic_p ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = | roman_Sol start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) | / | roman_Sol start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | for any s,s𝑠superscript𝑠s,s^{\prime}italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT whose solutions have the same length. Then

Jexplore(+;p,0)Jexplore(0;p,0)|A0||A+|(1|A0||A+|)subscript𝐽exploresubscript𝑝0subscript𝐽exploresubscript0𝑝0subscript𝐴0subscript𝐴1subscript𝐴0subscript𝐴J_{\text{{explore}}}(\mathcal{M}_{+};p,0)-J_{\text{{explore}}}(\mathcal{M}_{0}% ;p,0)\\ \geq\frac{|A_{0}|}{|A_{+}|}\left(1-\frac{|A_{0}|}{|A_{+}|}\right)start_ROW start_CELL italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p , 0 ) - italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p , 0 ) end_CELL end_ROW start_ROW start_CELL ≥ divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG ( 1 - divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG ) end_CELL end_ROW (18)

for any strict A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT-macroaction augmentation +subscript\mathcal{M}_{+}caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT of 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

A stronger version of this theorem (Section F.4) relaxes the conditions on 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and p𝑝pitalic_p and the modified bound involves subtracting a corresponding KL-divergence term.

Stated in words, Theorem 5.6 says that macroactions harm exploration when most action sequences are solutions to some state and that a state’s assigned importance p(s)𝑝𝑠p(s)italic_p ( italic_s ) is close to the probability that a uniformly random action sequence solves it. Theorem 5.7 suggests that it suffices for p(s)𝑝𝑠p(s)italic_p ( italic_s ) to be roughly proportional to this probability across states whose solutions have the same length. These results make more precise our intuition that it is more difficult to use skills to improve exploration in environments where solutions to states look uniformly randomly distributed.

6 Experiments

Corollaries 4.4 and 5.3 suggest that solution-separable DSMDPs with lower unmerged p𝑝pitalic_p-incompressibility can benefit more from macroactions. We test this prediction on the four environments studied in Section 3.3, which include both solution-separable (RubiksCube222) and non-solution-separable (CliffWalking, CompILE2, 8Puzzle) DSMDPs. For different complexity measures C𝐶Citalic_C (p𝑝pitalic_p-learning difficulty, p𝑝pitalic_p-exploration difficulty, and sample complexity N𝑁Nitalic_N of four RL algorithms), Figure 1 shows the best complexity improvement ratio minC+/C0subscript𝐶subscript𝐶0\min C_{+}/C_{0}roman_min italic_C start_POSTSUBSCRIPT + end_POSTSUBSCRIPT / italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT across the 31 (strict) macroaction augmentations of each base environment against the unmerged p𝑝pitalic_p-incompressibility of the base environment. We observe a positive correlation regardless of the choice of C𝐶Citalic_C and RL algorithm, thus corroborating our theoretical predictions: macroactions are more helpful in environments with lower unmerged p𝑝pitalic_p-incompressibility.

While the definition of unmerged p𝑝pitalic_p-incompressibility is motivated in the context of macroactions (Corollaries 4.4 and 5.3), experiments with general stochastic options discovered by LOVE (Jiang et al., 2022) show that it successfully captures the difficulty of applying HRL with general options in an environment. Table 2 shows the unmerged p𝑝pitalic_p-incompressibility values of our four environments, along with the sample complexity improvement ratio N+/N0subscript𝑁subscript𝑁0N_{+}/N_{0}italic_N start_POSTSUBSCRIPT + end_POSTSUBSCRIPT / italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from optionally applying HRL with options discovered by LOVE. The improvement from HRL decreases as the unmerged p𝑝pitalic_p-incompressibility increases.

7 p𝑝pitalic_p-Incompressibility for Skill Learning

Appendix G demonstrates two ways to use our incompressibility measures to derive objectives for skill learning. We show that, under mild approximations, these two objectives are equivalent to two minimum description length (MDL) objectives previously used in the skill learning literature. In particular, finding the A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT that minimizes A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT-merged p𝑝pitalic_p-incompressibility corresponds to the objective used by LOVE (Jiang et al., 2022), and finding the skills such that the resultant skill-augmented environment has the highest unmerged p𝑝pitalic_p-incompressibility corresponds to the objective used by LEMMA (Li et al., 2022).

Table 2: Unmerged p𝑝pitalic_p-incompressibility IC(0;p)ICsubscript0𝑝\mathrm{IC}(\mathcal{M}_{0};p)roman_IC ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) vs. the improvement ratio N+/N0subscript𝑁subscript𝑁0N_{+}/N_{0}italic_N start_POSTSUBSCRIPT + end_POSTSUBSCRIPT / italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT of sample complexity Nrrsubscript𝑁𝑟superscript𝑟N_{r\geq r^{*}}italic_N start_POSTSUBSCRIPT italic_r ≥ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT from applying HRL with LOVE options. Results are averaged over 5 seeds. Because HRL can fail to learn an environment on some seeds, we set the improvement ratio to 1 if HRL does not improve the sample complexity.

Environment N+/N0subscript𝑁subscript𝑁0N_{+}/N_{0}italic_N start_POSTSUBSCRIPT + end_POSTSUBSCRIPT / italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT IC(0;p)ICsubscript0𝑝\mathrm{IC}(\mathcal{M}_{0};p)roman_IC ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p )
CliffWalking 0.000007 ±plus-or-minus\pm± 0.000007 0.0000
CompILE2 0.00023 ±plus-or-minus\pm± 0.00011 0.1475
8Puzzle 0.64 ±plus-or-minus\pm± 0.19 0.5157
RubiksCube222 0.73 ±plus-or-minus\pm± 0.17 0.8072

8 Conclusion

We introduce the first theoretical analysis of the utility of RL skills, focusing on deterministic sparse-reward MDPs. With both theoretical motivation and empirical verification, we introduce metrics that quantify two aspects of RL complexity: exploration and learning from experience. We show both theoretically and experimentally that these metrics can be improved more in environments where solutions to states are more compressible. Further theoretical results suggest that skills benefit exploration more than learning from experience, and that less expressive skills are less beneficial to improving RL sample efficiency. Our work is a first step towards characterizing the properties of an environment that make skills helpful for RL, and we expect future theoretical work to generalize beyond deterministic sparse-reward MDPs with finite action spaces.

Acknowledgements

We thank anonymous referees for useful suggestions and discussions, as well as instructors of the MIT Advanced Undergraduate Research Opportunities Program (SuperUROP) for suggestions on presentation.

This work was funded by U.S. National Science Foundation (NSF) awards #1918771 and #1918839. In addition, ZL was supported by the MIT Advanced Undergraduate Research Opportunities Program (SuperUROP) and GP was supported by the Stanford Interdisciplinary Graduate Fellowship (SIGF).

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

References

  • Agostinelli et al. (2019) Agostinelli, F., McAleer, S., Shmakov, A., and Baldi, P. Solving the Rubik’s cube with deep reinforcement learning and search. Nature Machine Intelligence, 1(8):356–363, 2019.
  • Auer et al. (2008) Auer, P., Jaksch, T., and Ortner, R. Near-optimal regret bounds for reinforcement learning. Advances in Neural Information Processing Systems, 21, 2008.
  • Bacon et al. (2017) Bacon, P.-L., Harb, J., and Precup, D. The option-critic architecture. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.
  • Badia et al. (2020) Badia, A. P., Piot, B., Kapturowski, S., Sprechmann, P., Vitvitskyi, A., Guo, Z. D., and Blundell, C. Agent57: Outperforming the atari human benchmark. In International Conference on Machine Learning, pp.  507–517. PMLR, 2020.
  • Bagaria & Konidaris (2019) Bagaria, A. and Konidaris, G. Option discovery using deep skill chaining. In International Conference on Learning Representations, 2019.
  • Barreto et al. (2019) Barreto, A., Borsa, D., Hou, S., Comanici, G., Aygün, E., Hamel, P., Toyama, D., Mourad, S., Silver, D., Precup, D., et al. The option keyboard: Combining skills in reinforcement learning. Advances in Neural Information Processing Systems, 32, 2019.
  • Bellemare et al. (2013) Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The Aarcade Learning Environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
  • Bellman (1957) Bellman, R. A Markovian decision process. Journal of Mathematics and Mechanics, pp.  679–684, 1957.
  • Choi (1994) Choi, K. P. On the medians of gamma distributions and an equation of Ramanujan. Proceedings of the American Mathematical Society, 121(1):245–251, 1994.
  • Conserva & Rauber (2022) Conserva, M. and Rauber, P. Hardness in markov decision processes: Theory and practice. Advances in Neural Information Processing Systems, 35:14824–14838, 2022.
  • Cover (1994) Cover, T. Information theory and statistics. In Proceedings of 1994 Workshop on Information Theory and Statistics, pp.  2. IEEE, 1994.
  • Ellis et al. (2019) Ellis, K., Nye, M., Pu, Y., Sosa, F., Tenenbaum, J., and Solar-Lezama, A. Write, execute, assess: Program synthesis with a repl. Advances in Neural Information Processing Systems, 32, 2019.
  • Fawzi et al. (2022) Fawzi, A., Balog, M., Huang, A., Hubert, T., Romera-Paredes, B., Barekatain, M., Novikov, A., R Ruiz, F. J., Schrittwieser, J., Swirszcz, G., et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610(7930):47–53, 2022.
  • He et al. (2011) He, R., Brunskill, E., and Roy, N. Efficient planning under uncertainty with macro-actions. Journal of Artificial Intelligence Research, 40:523–570, 2011.
  • Hukmani et al. (2021) Hukmani, K., Kolekar, S., and Vobugari, S. Solving twisty puzzles using parallel Q-learning. Engineering Letters, 29(4), 2021.
  • Jiang et al. (2022) Jiang, Y., Liu, E., Eysenbach, B., Kolter, J. Z., and Finn, C. Learning options via compression. Advances in Neural Information Processing Systems, 35:21184–21199, 2022.
  • Kakade & Langford (2002) Kakade, S. and Langford, J. Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, pp.  267–274, 2002.
  • Kaliszyk et al. (2018) Kaliszyk, C., Urban, J., Michalewski, H., and Olšák, M. Reinforcement learning of theorem proving. Advances in Neural Information Processing Systems, 31, 2018.
  • Kipf et al. (2019) Kipf, T., Li, Y., Dai, H., Zambaldi, V., Sanchez-Gonzalez, A., Grefenstette, E., Kohli, P., and Battaglia, P. Compile: Compositional imitation learning and execution. In International Conference on Machine Learning, pp.  3418–3428. PMLR, 2019.
  • Li et al. (2022) Li, Z., Poesia, G., Costilla-Reyes, O., Goodman, N., and Solar-Lezama, A. Lemma: Bootstrap** high-level mathematical reasoning with learned symbolic abstractions. NeurIPS’22 MATH-AI Workshop, 2022.
  • Machado et al. (2017) Machado, M. C., Bellemare, M. G., and Bowling, M. A laplacian framework for option discovery in reinforcement learning. In International Conference on Machine Learning, pp.  2295–2304. PMLR, 2017.
  • Maillard et al. (2014) Maillard, O.-A., Mann, T. A., and Mannor, S. “How hard is my MDP?” The distribution-norm to the rescue. Advances in Neural Information Processing Systems, 27, 2014.
  • Mankowitz et al. (2023) Mankowitz, D. J., Michi, A., Zhernov, A., Gelmi, M., Selvi, M., Paduraru, C., Leurent, E., Iqbal, S., Lespiau, J.-B., Ahern, A., et al. Faster sorting algorithms discovered using deep reinforcement learning. Nature, 618(7964):257–263, 2023.
  • Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • Nayyar et al. (2023) Nayyar, R. K., Verma, S., and Srivastava, S. Learning generalizable symbolic options for transfer in reinforcement learning. In NeurIPS 2023 Workshop on Generalization in Planning, 2023.
  • Ouyang et al. (2022) Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  • Pedersen et al. (2016) Pedersen, M. R., Nalpantidis, L., Andersen, R. S., Schou, C., Bøgh, S., Krüger, V., and Madsen, O. Robot skills for manufacturing: From concept to industrial deployment. Robotics and Computer-Integrated Manufacturing, 37:282–291, 2016.
  • Poesia et al. (2021) Poesia, G., Dong, W., and Goodman, N. Contrastive reinforcement learning of symbolic reasoning domains. Advances in Neural Information Processing Systems, 34:15946–15956, 2021.
  • Simchowitz & Jamieson (2019) Simchowitz, M. and Jamieson, K. G. Non-asymptotic gap-dependent regret bounds for tabular MDPs. Advances in Neural Information Processing Systems, 32, 2019.
  • Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Temporal difference learning. In Reinforcement Learning: An Introduction, chapter 6. MIT Press, 2018.
  • Sutton et al. (1999) Sutton, R. S., Precup, D., and Singh, S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1-2):181–211, 1999.
  • Todorov et al. (2012) Todorov, E., Erez, T., and Tassa, Y. MuJoCo: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.  5026–5033. IEEE, 2012.
  • Watkins (1989) Watkins, C. J. C. H. Learning from delayed rewards. 1989.
  • Williams (1992) Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992.
  • Wu et al. (2021) Wu, M., Norrish, M., Walder, C., and Dezfouli, A. Tacticzero: Learning to prove theorems from scratch with deep reinforcement learning. Advances in Neural Information Processing Systems, 34:9330–9342, 2021.

Appendix A Survey on Existing RL Difficulty Metrics

Here, we provide a brief survey on existing RL difficulty metrics and explain why they are inadequate for our purposes. See Conserva & Rauber (2022) for a more detailed survey and benchmark. We will be using the notation =(S,A,P,R)𝑆𝐴𝑃𝑅\mathcal{M}=(S,A,P,R)caligraphic_M = ( italic_S , italic_A , italic_P , italic_R ) for an MDP with state space S𝑆Sitalic_S, action space A𝐴Aitalic_A, transition kernel P𝑃Pitalic_P, and reward kernel R𝑅Ritalic_R.

  • The environmental value norm of the optimal policy (Maillard et al., 2014) is given by

    sup(s,a)S×AVarsP(s,a)Vγ(s),subscriptsupremum𝑠𝑎𝑆𝐴subscriptVarsimilar-tosuperscript𝑠𝑃𝑠𝑎superscriptsubscript𝑉𝛾superscript𝑠\sup_{(s,a)\in S\times A}\sqrt{\operatorname{Var}_{s^{\prime}\sim P(s,a)}V_{% \gamma}^{*}(s^{\prime})},roman_sup start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ italic_S × italic_A end_POSTSUBSCRIPT square-root start_ARG roman_Var start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P ( italic_s , italic_a ) end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG , (19)

    where P(s,a)𝑃𝑠𝑎P(s,a)italic_P ( italic_s , italic_a ) is the transition kernel of the MDP and Vγsuperscriptsubscript𝑉𝛾V_{\gamma}^{*}italic_V start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the value function of the optimal policy with discount factor γ𝛾\gammaitalic_γ. The variation in the values of next states quantifies the difficulty in obtaining accurate sample estimates of action values. However, in deterministic MDPs, which are our focus, the environmental value norm of the optimal policy is always zero and is therefore not applicable.

  • The distribution mismatch coefficient (Kakade & Langford, 2002) is given by

    supπsSμsμsπ,subscriptsupremum𝜋subscript𝑠𝑆superscriptsubscript𝜇𝑠superscriptsubscript𝜇𝑠𝜋\sup_{\pi}\sum_{s\in S}\frac{\mu_{s}^{*}}{\mu_{s}^{\pi}},roman_sup start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s ∈ italic_S end_POSTSUBSCRIPT divide start_ARG italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT end_ARG , (20)

    where μsπsuperscriptsubscript𝜇𝑠𝜋\mu_{s}^{\pi}italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT is the stationary distribution of the Markov chain induced by policy π𝜋\piitalic_π and μssuperscriptsubscript𝜇𝑠\mu_{s}^{*}italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the stationary distribution of the Markov chain induced by the optimal policy. It measures how much the stationary distribution of states visited by the agent can differ from the optimal distribution. It is defined only for ergodic MDPs (otherwise the stationary distribution may not be uniquely defined) in the continuous setting, whereas we focus on deterministic MDPs (which are not ergodic when |S|>1𝑆1|S|>1| italic_S | > 1) in the episodic setting.

  • The sum of reciprocals of suboptimality gaps (Simchowitz & Jamieson, 2019) is given by

    (s,a)S×A:Δ(s,a)01Δ(s,a),Δ(s,a)=V(s)Q(s,a),subscript:𝑠𝑎𝑆𝐴Δ𝑠𝑎01Δ𝑠𝑎Δ𝑠𝑎superscript𝑉𝑠superscript𝑄𝑠𝑎\sum_{(s,a)\in S\times A:\Delta(s,a)\neq 0}\frac{1}{\Delta(s,a)},\quad\Delta(s% ,a)=V^{*}(s)-Q^{*}(s,a),∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ italic_S × italic_A : roman_Δ ( italic_s , italic_a ) ≠ 0 end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG roman_Δ ( italic_s , italic_a ) end_ARG , roman_Δ ( italic_s , italic_a ) = italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) - italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) , (21)

    where V(s)superscript𝑉𝑠V^{*}(s)italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) and Q(s,a)superscript𝑄𝑠𝑎Q^{*}(s,a)italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) are the state and action value functions of the optimal policy. Larger Δ(s,a)Δ𝑠𝑎\Delta(s,a)roman_Δ ( italic_s , italic_a ) allows the agent to more easily distinguish suboptimal actions from the optimal action and can thus reduce average total regret in the long run. However, as Conserva & Rauber (2022) points out, smaller Δ(s,a)Δ𝑠𝑎\Delta(s,a)roman_Δ ( italic_s , italic_a ) makes it easier to find a near-optimal policy, which contributes to decreasing the sample complexity.

  • The diameter (Auer et al., 2008) is defined to be

    sups1s2infπTs1s2π,subscriptsupremumsubscript𝑠1subscript𝑠2subscriptinfimum𝜋subscriptsuperscript𝑇𝜋subscript𝑠1subscript𝑠2\sup_{s_{1}\neq s_{2}}\inf_{\pi}T^{\pi}_{s_{1}\to s_{2}},roman_sup start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_inf start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (22)

    where Ts1s2πsubscriptsuperscript𝑇𝜋subscript𝑠1subscript𝑠2T^{\pi}_{s_{1}\to s_{2}}italic_T start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the expected time to reach s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT starting in s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT following policy π𝜋\piitalic_π. While this is defined for the continuous setting, a natural definition for the diameter of a DSMDP \mathcal{M}caligraphic_M in the episodic setting would be

    supsg:Sol(s)d(s),subscriptsupremum:𝑠𝑔subscriptSol𝑠subscript𝑑𝑠\sup_{s\neq g:\ \operatorname{Sol}_{\mathcal{M}}(s)\neq\emptyset}d_{\mathcal{M% }}(s),roman_sup start_POSTSUBSCRIPT italic_s ≠ italic_g : roman_Sol start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ) ≠ ∅ end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ) , (23)

    where d(s)subscript𝑑𝑠d_{\mathcal{M}}(s)italic_d start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ) denotes the length of a shortest solution to s𝑠sitalic_s. However, taking the supremum is overly pessimistic, and in many cases, there may be states that are far from the goal but that we do not care about solving. Our p𝑝pitalic_p-learning difficulty takes this into account by using a weighted average of d(s)subscript𝑑𝑠d_{\mathcal{M}}(s)italic_d start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ), multiplied by |A|𝐴|A|| italic_A | to take into account the additional sample complexity due to a large action space.

Appendix B Environments

Experiments were conducted on 4 base environments of varying complexity:

  • CliffWalking (Sutton & Barto, 2018), a toy grid world environment of size 4×124124\times 124 × 12 where the agent always begins in the bottom left corner and has to travel to the bottom right corner. The available actions are moving one step in each of the 4 cardinal directions. The agent returns to its original position whenever it touches a square in the bottom row other than the leftmost and rightmost squares.

  • CompILE2 is one of the CompILE grid world environments (Kipf et al., 2019). The agent navigates in an 10×10101010\times 1010 × 10 grid world with walls both lining the edges and within the grid. The world also has several objects of different kinds, possibly with several of each kind. The agent’s goal is to pick up several specified (kinds of) objects in order. In CompILE2, the agent has to pick up 2 objects. The available actions are moving one step in each of the 4 cardinal directions in addition to attempting to pick up the object in the current cell. The positions and types of the objects are fixed but the agent’s position is randomized at every reset, following Jiang et al. (2022). We did not choose 3 or more objects for the agent to pick up because we found that the agent could not find the positive reward signal without suitable skills in these cases, consistent with previous findings on the same environment (Kipf et al., 2019; Jiang et al., 2022). Since whether the goal is reached depends on the sequence of objects the agent has picked up, the state includes both the grid and the sequence of objects that the agent has picked up thus far. Since Kipf et al. (2019) did not publish the source code for the environment, we use the implementation by Jiang et al. (2022).

    Because there can be several of the same kind of object on the grid, there are different sequences of objects the agent can pick up that amount to the same sequence of kinds of objects. There are thus multiple goal states, which are merged into one to comply with the definition of a DSMDP.

  • 8Puzzle is the 8-puzzle, the 3×3333\times 33 × 3 version of the more well-known 15-puzzle. There are 8 tiles numbered 1 to 8 on a 3×3333\times 33 × 3 board so that there is one tile missing. The available actions are moving the position of the missing tile in each of the four cardinal directions. The solved state has the numbers 1 to 8 in order from left-to-right, top-to-bottom. The puzzle is scrambled from the solved state by applying a random legal action K𝐾Kitalic_K times where K𝐾Kitalic_K is uniform between 1 and 31. Here, 31 is the maximum distance from any state to the goal state. The puzzle is re-scrambled if the scramble solves the cube.

  • RubiksCube222 is the 2x2 Rubik’s cube, also called the pocket cube. The available actions are turning the front, right, or top faces clockwise by 90superscript9090^{\circ}90 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT. The cube is scrambled by applying a random sequence of moves of length K𝐾Kitalic_K where K𝐾Kitalic_K is uniform between 1 and 11 and where each move is turning the front, right, or top face 90superscript9090^{\circ}90 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT clockwise, 180superscript180180^{\circ}180 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT, or 90superscript9090^{\circ}90 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT counterclockwise and no two consecutive moves turn the same face. (Note that the action space used for scrambling is larger than the action space of the agent.) Here, 11 is the maximum number distance from a state to the solved state. We use the implementation provided by Hukmani et al. (2021).

For 8Puzzle and RubiksCube222, our choice of sampling the scramble length uniformly from 1 to some maximum K𝐾Kitalic_K follows Agostinelli et al. (2019).

Basic information about the 4 base environments is summarized in Table 3.

Table 3: Basic information about the base environments studied by our experiments. |A0|subscript𝐴0|A_{0}|| italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT |: size of base action space; |S|𝑆|S|| italic_S |: size of state space; |Sp>0|subscript𝑆𝑝0|S_{p>0}|| italic_S start_POSTSUBSCRIPT italic_p > 0 end_POSTSUBSCRIPT |: size of support of p𝑝pitalic_p.

Environment |A0|subscript𝐴0|A_{0}|| italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | |S|𝑆|S|| italic_S | |Sp>0|subscript𝑆𝑝0|S_{p>0}|| italic_S start_POSTSUBSCRIPT italic_p > 0 end_POSTSUBSCRIPT |
CliffWalking 4 32 1
CompILE2 5 115,462 59
8Puzzle 4 362,880 181,439
RubiksCube222 3 3,674,160 3,674,159

For each base environment, one of the 32 action space variants is just the base environment itself. The remaining 31 are (strict) macroaction augmentations generated as follows:

  • For CliffWalking, the LEMMA abstraction algorithm (Li et al., 2022) found one single macroaction from the offline trajectory data generated using breadth-first search (BFS). That single macroaction is just the shortest sequence of actions that solves the only possible starting state of the environment: (U = up, R = right, D = down, L = left)

    • URRRRRRRRRRRD

    5 other sets of macroactions were derived from subsequences of near-optimal solutions to the starting state:

    • RR

    • RR, RRRR, RRRRRRRR

    • RRRRRRRRRRR

    • UUURRRR, RRR, DRDRD

    • URRRRRRRRRRR, RRRRRRRRRRRD

    Furthermore, for each k=1,2,3,4,5𝑘12345k=1,2,3,4,5italic_k = 1 , 2 , 3 , 4 , 5, we randomly generated 5 sets of k𝑘kitalic_k distinct macroactions. A random macroaction with length L+1𝐿1L+1italic_L + 1 (LGeometric(1/3)similar-to𝐿Geometric13L\sim\operatorname{Geometric}(1/3)italic_L ∼ roman_Geometric ( 1 / 3 )) was generated as follows:

    • With probability 0.4, randomly choose between U and R with probabilities 0.3 and 0.7;

    • With probability 0.3, randomly choose between R and D with probabilities 0.7 and 0.3;

    • With probability 0.1, randomly choose between D and L with probabilities 0.7 and 0.3;

    • With probability 0.2, randomly choose between L and U with probabilities 0.3 and 0.7.

    We didn’t choose probabilities uniform across all directions because this results in several sets of macroactions that cause the agent to drift leftward or downward during random exploration, and the agent almost never receives any positive reward signal. However, it was also the presence of drift that helped us generate variety in the learnability of the macroaction-augmented environments. Variation in the direction of the drift across different sets of macroactions resulted in sample efficiencies that varied across 7 orders of magnitude.

  • For CompILE2, LEMMA discovered the following set of macroactions: (L = left, U = up, R = right, D = down, P = pick up)

    • PUURRRP, LL, UU, DD

    5 other sets of macroactions were derived from subsequences of subsets of these macroactions:

    • LL, UU, DD

    • LL, UU, RRR, DD

    • PUU, RRRP

    • PUURRRP

    • PUURRRP, LL, UU, RRR, DD

    Furthermore, for each k=1,2,3,4,5𝑘12345k=1,2,3,4,5italic_k = 1 , 2 , 3 , 4 , 5, we randomly generated 5 sets of k𝑘kitalic_k distinct macroactions. A random macroaction with length L+1𝐿1L+1italic_L + 1 (LGeometric(1/3)similar-to𝐿Geometric13L\sim\operatorname{Geometric}(1/3)italic_L ∼ roman_Geometric ( 1 / 3 )) was generated as follows:

    • With probability 1/4, randomly choose among L, U and P with probabilities 0.4, 0.4 and 0.2;

    • With probability 1/4, randomly choose among U, R and P with probabilities 0.4, 0.4 and 0.2;

    • With probability 1/4, randomly choose among R, D and P with probabilities 0.4, 0.4 and 0.2;

    • With probability 1/4, randomly choose among D, L and P with probabilities 0.4, 0.4 and 0.2.

  • For 8Puzzle, LEMMA discovered the following set of macroactions: (U = up, R = right, D = down, L = left)

    • RD, LDR

    5 other sets of macroactions were derived from subsets of these macroactions, possibly with reflection across the diagonal (a symmetry of the puzzle):

    • RD

    • LDR

    • RD, DR

    • LDR, URD

    • RD, DR, LDR, URD

    Furthermore, for each k=1,2,3,4,5𝑘12345k=1,2,3,4,5italic_k = 1 , 2 , 3 , 4 , 5, we randomly generated 5 sets of k𝑘kitalic_k distinct macroactions. A random macroaction with length L+1𝐿1L+1italic_L + 1 (LGeometric(1/2)similar-to𝐿Geometric12L\sim\operatorname{Geometric}(1/2)italic_L ∼ roman_Geometric ( 1 / 2 )) was generated by sampling from U, R, D, L with probabilities 0.2, 0.3, 0.3, 0.2. The higher probabilities for R and D are intended to encourage moving the position of the missing tile towards the bottom-right corner.

  • For RubiksCube222, LEMMA generated the empty set. However, the 3 top-scoring macroactions were: (F = front face 90superscript9090^{\circ}90 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT, R = right face 90superscript9090^{\circ}90 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT, U = top face 90superscript9090^{\circ}90 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT)

    • FF, RR, UU

    5 other sets of macroactions were derived from subsets of these macroactions, possibly with more repetition of some base action:

    • FF

    • FF, FFF

    • FF, RR

    • FF, FFF, RR, RRR

    • FF, FFF, RR, RRR, UUU

    Note that FF, RR, UU are half-turns of faces (denoted F2, R2, U2 in standard cube notation) and FFF, RRR, UUU are counter-clockwise 90superscript9090^{\circ}90 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT turns (usually denoted F, R, U).

    Furthermore, for each k=1,2,3,4,5𝑘12345k=1,2,3,4,5italic_k = 1 , 2 , 3 , 4 , 5, we randomly generated 5 sets of k𝑘kitalic_k distinct macroactions. A random macroaction with length L+1𝐿1L+1italic_L + 1 (LGeometric(1/2)similar-to𝐿Geometric12L\sim\operatorname{Geometric}(1/2)italic_L ∼ roman_Geometric ( 1 / 2 )) was generated by sampling from F, R, U each with probability 1/3.

Appendix C Experimental Details

C.1 Hyperparameters

  • The learning rate is α=0.1𝛼0.1\alpha=0.1italic_α = 0.1 for Q-learning, value iteration and REINFORCE, and α=0.0005𝛼0.0005\alpha=0.0005italic_α = 0.0005 for DQN.

  • For the off-policy RL algorithms (Q-learning, value iteration, and DQN), the optimal epsilon schedule for epsilon greedy can vary by orders of magnitude across different action space variants of the same base environment. We therefore adopt an adaptive epsilon-greedy exploration policy where the probability ε𝜀\varepsilonitalic_ε of choosing a random action starts at 1111 and is decreased by 0.0020.0020.0020.002 every time the agent beats its highest test reward so far by 0.0020.0020.0020.002, until ε=0.1𝜀0.1\varepsilon=0.1italic_ε = 0.1.

  • Testing was performed with 200200200200 episodes (1111 episode for CliffWalking, which only has one starting state) using the greedy policy (Q-learning, value iteration, DQN) or the current policy (REINFORCE). For the purposes of computing sample complexity, the N𝑁Nitalic_N at which a reward or value error threshold is reached is computed by averaging over all values of N𝑁Nitalic_N where the reward/value error crosses above/below the threshold.

  • Experiments were run with a maximum of 100M environment steps. We applied early stop** with a test reward threshold of 0.95 (0.75 for RubiksCube222) and average value error threshold of 0.025 (0.1 for RubiksCube222).

  • The horizon is 50505050 for all environments, including skill-augmented environments. In addition, to simulate a cost of applying too many base actions, we terminate an episode whenever the number of base actions reaches 100100100100.

  • For Q-learning, value iteration, and DQN, the replay buffer size is 1000100010001000 and updates are performed once every 4444 episodes with a batch size of 32323232.

  • Details on the model architecture of DQN are given in Section C.3.

No extensive hyperparameter tuning was done as the purpose of our experiments was not to compare RL algorithms, but to compare the performance of one algorithm on different action space variants of the same base environment.

C.2 Computational Resources

Experiments were run on 28 NVIDIA GPUs (8×8\times8 ×Quadro RTX 5000, 8×8\times8 ×GeForce GTX 1080 Ti, 8×8\times8 ×Tesla V100 SXM2 32GB, 4×4\times4 ×RTX 6000 Ada Generation). One experiment, which usually consisted of 32 runs of some RL algorithm on different macroaction augmentations of the same base environment, took between under a minute to about a week to finish. In total, all experiments were completed within one month.

C.3 Algorithm-Specific Details

  • Value iteration is modified to the RL setting in a way similar to Deep Approximate Value Iteration (DAVI) (Agostinelli et al., 2019). In DAVI, a state is chosen from some initial distribution and the value network is updated by minimizing the quadratic loss between the current state value and the Bellman update, thus requiring a forward pass that computes the values of all next states. In our version of value iteration, a state is chosen from the initial distribution and we apply a rollout of the epsilon-greedy policy. For each state s𝑠sitalic_s in the rollout, we also compute all possible next states. Similar to Q-learning, these next states are stored along with s𝑠sitalic_s in a replay buffer. When we sample a state (along with its next states) from the replay buffer, its value is updated in the direction of the Bellman update. Note that the fact that all possible next states are computed from each state in a rollout multiplies the number of environment steps taken by |A|𝐴|A|| italic_A |.

  • The policy πθ(as)subscript𝜋𝜃conditional𝑎𝑠\pi_{\theta}(a\mid s)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a ∣ italic_s ) in REINFORCE is parameterized directly by the logits. In other words, the weights are an |S|×|A|𝑆𝐴|S|\times|A|| italic_S | × | italic_A | matrix and πθ(s)=Softmax(θs,)\pi_{\theta}(\cdot\mid s)=\mathrm{Softmax}(\theta_{s,\cdot})italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ∣ italic_s ) = roman_Softmax ( italic_θ start_POSTSUBSCRIPT italic_s , ⋅ end_POSTSUBSCRIPT ).

  • The implementation of the deep neural net in DQN depends on the environment. A state embedding is first constructed from the input before passed into a linear projection head that outputs the action values Q(s,)𝑄𝑠Q(s,\cdot)italic_Q ( italic_s , ⋅ ) of a state s𝑠sitalic_s.

    • In CliffWalking, the input is a length-3 multihot vector at every location of the 4-by-12 grid (hence a 4×12×341234\times 12\times 34 × 12 × 3 binary tensor). In each multihot vector, the 3 indices represent the player, goal, and cliff. The state embedding is constructed by passing the input through a 2-layer CNN with ReLU activation followed by a 2-layer MLP with ReLU activation. The CNN has a kernel size of 3 and padding of 1. The hidden dimension is 32 and the output embedding has dimension 16.

    • In CompILE2, the input has two components. The grid is represented as a length-12 multihot vector at every location of the 10-by-10 grid (hence a 10×10×1210101210\times 10\times 1210 × 10 × 12 binary tensor). The 12 indices of each multihot vector represent the 10 types of objects, wall, and agent. The next object the agent has to pick up is represented as a length-10 one-hot vector. The grid is passed through a 2-layer CNN with ReLU activation followed by a 2-layer MLP with ReLU activation. The result is concatenated with an embedding of the next object the agent has to pick up and passed through a linear projection to form the final embedding of the observation. The CNN has a a kernel size of 3 and padding of 0. The hidden dimension is 32 in the CNN layers and 128 in the MLP layers; the object embedding has dimension 16; the output embedding has dimension 128.

    • In 8Puzzle, the input is a length-9 onehot vector at every location of the 3-by-3 grid (hence a 3×3×93393\times 3\times 93 × 3 × 9 binary tensor) denoting the tile present at each location (or the absence thereof). The state embedding is constructed by passing the input through a 2-layer CNN with ReLU activation followed by a 2-layer MLP with ReLU activation. The CNN has a kernel size of 3 and padding of 1. The hidden dimension is 32 and the output dimension is 32.

    • In RubiksCube222, the input is a length-6 multihot vector for each of 6×4=2464246\times 4=246 × 4 = 24 tiles of the cube (hence a 24×624624\times 624 × 6 binary tensor) denoting the color of each tile. The state embedding is constructed by flattening the input and passing it through a 4-layer MLP with ReLU activation. The hidden dimension is 64 and the output dimension is 32.

Appendix D Additional Empirical Tests of p𝑝pitalic_p-Learning and p𝑝pitalic_p-Exploration Difficulty

D.1 Empirically Verifying Lemma 3.1 for Motivating p𝑝pitalic_p-Learning Difficulty

To test how well p𝑝pitalic_p-learning difficulty captures learning from experience, we study the value iteration algorithm for planning with known transitions and rewards in a DSMDP. We consider two variants of value iteration: state value iteration for learning the values of states (Bellman, 1957), and action value iteration for learning the values of state-action pairs. The latter is like Q-learning (Watkins, 1989) but modified to update the values of all state-action pairs at once. Instead of the original Bellman update, each update uses a linear interpolation between the old value and the new value given by the Bellman update with a learning rate of α=0.1𝛼0.1\alpha=0.1italic_α = 0.1 (see Equation 3).

For each base environment, we test the correlation between average solution length and sample complexity N𝑁Nitalic_N on 32 macroaction augmentations of that environment. The results are summarized in Table 4. We find that the correlation between convergence time and average solution length is almost always greater than 0.9, with it occasionally being near-perfect (above 0.99).

Table 4: Across 32 macroaction augmentations of each of 4 base environments, we report the correlations between: the number of iterations until convergence (N𝑁Nitalic_N) for two variants of value iteration (state values and Q-values) and two convergence criteria (r0.95𝑟0.95r\geq 0.95italic_r ≥ 0.95; ΔV¯¯Δ𝑉\overline{\Delta V}over¯ start_ARG roman_Δ italic_V end_ARG or ΔQ¯0.01¯Δ𝑄0.01\overline{\Delta Q}\leq 0.01over¯ start_ARG roman_Δ italic_Q end_ARG ≤ 0.01); and the p𝑝pitalic_p-weighted mean solution length of a state (d¯:=𝔼sp[d+(s)]assign¯𝑑subscript𝔼similar-to𝑠𝑝delimited-[]subscript𝑑𝑠\overline{d}:=\ \mathbb{E}_{s\sim p}[d_{+}(s)]over¯ start_ARG italic_d end_ARG := blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_s ) ]). The reported errors are standard errors of the mean over 5 seeds.

d¯𝙲𝚕𝚒𝚏𝚏𝚆𝚊𝚕𝚔𝚒𝚗𝚐subscript¯𝑑𝙲𝚕𝚒𝚏𝚏𝚆𝚊𝚕𝚔𝚒𝚗𝚐\overline{d}_{\mathtt{CliffWalking}}over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT typewriter_CliffWalking end_POSTSUBSCRIPT d¯𝙲𝚘𝚖𝚙𝙸𝙻𝙴𝟸subscript¯𝑑𝙲𝚘𝚖𝚙𝙸𝙻𝙴𝟸\overline{d}_{\mathtt{CompILE2}}over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT typewriter_CompILE2 end_POSTSUBSCRIPT d¯𝟾𝙿𝚞𝚣𝚣𝚕𝚎subscript¯𝑑8𝙿𝚞𝚣𝚣𝚕𝚎\overline{d}_{\mathtt{8Puzzle}}over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT typewriter_8 typewriter_P typewriter_u typewriter_z typewriter_z typewriter_l typewriter_e end_POSTSUBSCRIPT d¯𝚁𝚞𝚋𝚒𝚔𝚜𝙲𝚞𝚋𝚎𝟸𝟸𝟸subscript¯𝑑𝚁𝚞𝚋𝚒𝚔𝚜𝙲𝚞𝚋𝚎𝟸𝟸𝟸\overline{d}_{\mathtt{RubiksCube222}}over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT typewriter_RubiksCube222 end_POSTSUBSCRIPT
Q-value iteration Nr0.95subscript𝑁𝑟0.95N_{r\geq 0.95}italic_N start_POSTSUBSCRIPT italic_r ≥ 0.95 end_POSTSUBSCRIPT 0.980 ±plus-or-minus\pm± 0.001 0.934 ±plus-or-minus\pm± 0.012 0.901 ±plus-or-minus\pm± 0.013 0.942 ±plus-or-minus\pm± 0.007
NΔQ¯0.01subscript𝑁¯Δ𝑄0.01N_{\overline{\Delta Q}\leq 0.01}italic_N start_POSTSUBSCRIPT over¯ start_ARG roman_Δ italic_Q end_ARG ≤ 0.01 end_POSTSUBSCRIPT 0.998 ±plus-or-minus\pm± 0.000 0.977 ±plus-or-minus\pm± 0.003 0.968 ±plus-or-minus\pm± 0.006 0.989 ±plus-or-minus\pm± 0.001
Value iteration Nr0.95subscript𝑁𝑟0.95N_{r\geq 0.95}italic_N start_POSTSUBSCRIPT italic_r ≥ 0.95 end_POSTSUBSCRIPT 0.977 ±plus-or-minus\pm± 0.001 0.942 ±plus-or-minus\pm± 0.005 0.902 ±plus-or-minus\pm± 0.015 0.942 ±plus-or-minus\pm± 0.007
NΔV¯0.01subscript𝑁¯Δ𝑉0.01N_{\overline{\Delta V}\leq 0.01}italic_N start_POSTSUBSCRIPT over¯ start_ARG roman_Δ italic_V end_ARG ≤ 0.01 end_POSTSUBSCRIPT 0.998 ±plus-or-minus\pm± 0.000 0.984 ±plus-or-minus\pm± 0.002 0.969 ±plus-or-minus\pm± 0.005 0.985 ±plus-or-minus\pm± 0.001

D.2 Arithmetic Mean Variant of p𝑝pitalic_p-Exploration Difficulty Performs Worse Than the Geometric Mean

Table 5 shows the version of Table 1 where Jexplore=logNGMsubscript𝐽exploresubscript𝑁GMJ_{\text{{explore}}}=\log N_{\text{GM}}italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT = roman_log italic_N start_POSTSUBSCRIPT GM end_POSTSUBSCRIPT is redefined to be logNAMsubscript𝑁AM\log N_{\text{AM}}roman_log italic_N start_POSTSUBSCRIPT AM end_POSTSUBSCRIPT. (Up to a constant factor, NAM=𝔼sp[1/q(s)]subscript𝑁AMsubscript𝔼similar-to𝑠𝑝delimited-[]1𝑞𝑠N_{\text{AM}}=\mathbb{E}_{s\sim p}[1/q(s)]italic_N start_POSTSUBSCRIPT AM end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ 1 / italic_q ( italic_s ) ] estimates an upper bound on the sample complexity of the exploration stage of RL.) Comparing the results with Table 1, we find that with 3 exceptions (in 8Puzzle), all correlation values are no higher than those when the geometric mean is used.555 The correlation values of CliffWalking are exactly equal across the two tables because this environment has only one possible starting state, as a result of which the arithmetic and geometric means are exactly equal. This provides empirical validation for using the geometric mean as opposed to the arithmetic mean in our definition of p𝑝pitalic_p-exploration difficulty.

Table 5: Version of Table 1 where the geometric mean is replaced with the arithmetic mean in the definition of Jexploresubscript𝐽exploreJ_{\text{{explore}}}italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT. With 3 exceptions, all correlation values are no higher than those when the geometric mean are used (Table 1).

logJ𝙲𝚕𝚒𝚏𝚏𝚆𝚊𝚕𝚔𝚒𝚗𝚐subscript𝐽𝙲𝚕𝚒𝚏𝚏𝚆𝚊𝚕𝚔𝚒𝚗𝚐\log J_{\mathtt{CliffWalking}}roman_log italic_J start_POSTSUBSCRIPT typewriter_CliffWalking end_POSTSUBSCRIPT logJ𝙲𝚘𝚖𝚙𝙸𝙻𝙴𝟸subscript𝐽𝙲𝚘𝚖𝚙𝙸𝙻𝙴𝟸\log J_{\mathtt{CompILE2}}roman_log italic_J start_POSTSUBSCRIPT typewriter_CompILE2 end_POSTSUBSCRIPT logJ𝟾𝙿𝚞𝚣𝚣𝚕𝚎subscript𝐽8𝙿𝚞𝚣𝚣𝚕𝚎\log J_{\mathtt{8Puzzle}}roman_log italic_J start_POSTSUBSCRIPT typewriter_8 typewriter_P typewriter_u typewriter_z typewriter_z typewriter_l typewriter_e end_POSTSUBSCRIPT logJ𝚁𝚞𝚋𝚒𝚔𝚜𝙲𝚞𝚋𝚎𝟸𝟸𝟸subscript𝐽𝚁𝚞𝚋𝚒𝚔𝚜𝙲𝚞𝚋𝚎𝟸𝟸𝟸\log J_{\mathtt{RubiksCube222}}roman_log italic_J start_POSTSUBSCRIPT typewriter_RubiksCube222 end_POSTSUBSCRIPT
Q-Learning logNrrsubscript𝑁𝑟superscript𝑟\log N_{r\geq r^{*}}roman_log italic_N start_POSTSUBSCRIPT italic_r ≥ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT 0.947 ±plus-or-minus\pm± 0.006 0.661 ±plus-or-minus\pm± 0.049 0.301 ±plus-or-minus\pm± 0.047 0.366 ±plus-or-minus\pm± 0.081
logNΔQ¯ΔQsubscript𝑁¯Δ𝑄Δsuperscript𝑄\log N_{\overline{\Delta Q}\leq\Delta Q^{*}}roman_log italic_N start_POSTSUBSCRIPT over¯ start_ARG roman_Δ italic_Q end_ARG ≤ roman_Δ italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT 0.953 ±plus-or-minus\pm± 0.008 0.631 ±plus-or-minus\pm± 0.061 0.442 ±plus-or-minus\pm± 0.043 0.763 ±plus-or-minus\pm± 0.019
Value iteration logNrrsubscript𝑁𝑟superscript𝑟\log N_{r\geq r^{*}}roman_log italic_N start_POSTSUBSCRIPT italic_r ≥ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT 0.933 ±plus-or-minus\pm± 0.009 0.724 ±plus-or-minus\pm± 0.043 0.788 ±plus-or-minus\pm± 0.042 0.247 ±plus-or-minus\pm± 0.058
logNΔV¯ΔVsubscript𝑁¯Δ𝑉Δsuperscript𝑉\log N_{\overline{\Delta V}\leq\Delta V^{*}}roman_log italic_N start_POSTSUBSCRIPT over¯ start_ARG roman_Δ italic_V end_ARG ≤ roman_Δ italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT 0.951 ±plus-or-minus\pm± 0.015 0.732 ±plus-or-minus\pm± 0.035 0.877 ±plus-or-minus\pm± 0.011 0.694 ±plus-or-minus\pm± 0.021
REINFORCE logNrrsubscript𝑁𝑟superscript𝑟\log N_{r\geq r^{*}}roman_log italic_N start_POSTSUBSCRIPT italic_r ≥ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT 0.949 ±plus-or-minus\pm± 0.006 0.732 ±plus-or-minus\pm± 0.039 0.715 ±plus-or-minus\pm± 0.020 0.537 ±plus-or-minus\pm± 0.139
DQN logNrrsubscript𝑁𝑟superscript𝑟\log N_{r\geq r^{*}}roman_log italic_N start_POSTSUBSCRIPT italic_r ≥ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT 0.789 ±plus-or-minus\pm± 0.028 0.752 ±plus-or-minus\pm± 0.075 0.621 ±plus-or-minus\pm± 0.025 0.576 ±plus-or-minus\pm± 0.023

Appendix E Proofs

Proof of Lemma 3.1.

(Note: This proof assumes log\logroman_log refers to the natural logarithm.)

For α=1𝛼1\alpha=1italic_α = 1, simple induction on t𝑡titalic_t shows that, at time t𝑡titalic_t, the states with value 1111 are exactly those states that can be solved with t𝑡titalic_t actions or less, and all other states have value 00.

For the α<1𝛼1\alpha<1italic_α < 1 case, let’s first consider the case where the DSMDP is a chain of states 0,1,,n01𝑛0,1,\ldots,n0 , 1 , … , italic_n where state 0 is the only goal state and T(s,a)=s1𝑇𝑠𝑎𝑠1T(s,a)=s-1italic_T ( italic_s , italic_a ) = italic_s - 1 for any action a𝑎aitalic_a and non-goal state s0𝑠0s\neq 0italic_s ≠ 0. Then the value iteration formula becomes V(s)(1α)V(s)+αV(s1)𝑉𝑠1𝛼𝑉𝑠𝛼𝑉𝑠1V(s)\leftarrow(1-\alpha)V(s)+\alpha V(s-1)italic_V ( italic_s ) ← ( 1 - italic_α ) italic_V ( italic_s ) + italic_α italic_V ( italic_s - 1 ) for s>0𝑠0s>0italic_s > 0 and V(0)=1𝑉01V(0)=1italic_V ( 0 ) = 1. For α1much-less-than𝛼1\alpha\ll 1italic_α ≪ 1, we can write this as a differential equation

dVsdt=α(VsVs1)𝑑subscript𝑉𝑠𝑑𝑡𝛼subscript𝑉𝑠subscript𝑉𝑠1\frac{dV_{s}}{dt}=-\alpha(V_{s}-V_{s-1})divide start_ARG italic_d italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG = - italic_α ( italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_V start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT )

for s>0𝑠0s>0italic_s > 0, and dV0dt=0𝑑subscript𝑉0𝑑𝑡0\frac{dV_{0}}{dt}=0divide start_ARG italic_d italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG = 0. (We have switched to subscript notation to make it clearer that this is a linear system of ODEs in time.) Solving the system with the initial conditions V0(0)=1subscript𝑉001V_{0}(0)=1italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 0 ) = 1 and Vs(0)=0subscript𝑉𝑠00V_{s}(0)=0italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( 0 ) = 0 for s>0𝑠0s>0italic_s > 0 yields

Vs(t)=1eαtk=0s1(αt)kk!.subscript𝑉𝑠𝑡1superscript𝑒𝛼𝑡superscriptsubscript𝑘0𝑠1superscript𝛼𝑡𝑘𝑘V_{s}(t)=1-e^{-\alpha t}\sum_{k=0}^{s-1}\frac{(\alpha t)^{k}}{k!}.italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_t ) = 1 - italic_e start_POSTSUPERSCRIPT - italic_α italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s - 1 end_POSTSUPERSCRIPT divide start_ARG ( italic_α italic_t ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG italic_k ! end_ARG .

Note that Vs(t)subscript𝑉𝑠𝑡V_{s}(t)italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_t ) decreases in s𝑠sitalic_s, i.e., at any time t𝑡titalic_t, states closer to the goal have higher value.

If αt=as+blog(1/ε)𝛼𝑡𝑎𝑠𝑏1𝜀\alpha t=as+b\log(1/\varepsilon)italic_α italic_t = italic_a italic_s + italic_b roman_log ( 1 / italic_ε ) where a>1𝑎1a>1italic_a > 1 and b=aa1𝑏𝑎𝑎1b=\frac{a}{a-1}italic_b = divide start_ARG italic_a end_ARG start_ARG italic_a - 1 end_ARG, then

log(1Vs(t))1subscript𝑉𝑠𝑡\displaystyle\log(1-V_{s}(t))roman_log ( 1 - italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_t ) ) αt+log(s(αt)ss!)absent𝛼𝑡𝑠superscript𝛼𝑡𝑠𝑠\displaystyle\leq-\alpha t+\log\left(s\frac{(\alpha t)^{s}}{s!}\right)≤ - italic_α italic_t + roman_log ( italic_s divide start_ARG ( italic_α italic_t ) start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG start_ARG italic_s ! end_ARG )
asblog(1/ε)+slog(as+blog(1/ε))(s1)(log(s1)1)absent𝑎𝑠𝑏1𝜀𝑠𝑎𝑠𝑏1𝜀𝑠1𝑠11\displaystyle\leq-as-b\log(1/\varepsilon)+s\log(as+b\log(1/\varepsilon))-(s-1)% (\log(s-1)-1)≤ - italic_a italic_s - italic_b roman_log ( 1 / italic_ε ) + italic_s roman_log ( italic_a italic_s + italic_b roman_log ( 1 / italic_ε ) ) - ( italic_s - 1 ) ( roman_log ( italic_s - 1 ) - 1 )
asblog(1/ε)+slogs+sloga+balog(1/ε)(s1)(log(s1)1)absent𝑎𝑠𝑏1𝜀𝑠𝑠𝑠𝑎𝑏𝑎1𝜀𝑠1𝑠11\displaystyle\leq-as-b\log(1/\varepsilon)+s\log s+s\log a+\frac{b}{a}\log(1/% \varepsilon)-(s-1)(\log(s-1)-1)≤ - italic_a italic_s - italic_b roman_log ( 1 / italic_ε ) + italic_s roman_log italic_s + italic_s roman_log italic_a + divide start_ARG italic_b end_ARG start_ARG italic_a end_ARG roman_log ( 1 / italic_ε ) - ( italic_s - 1 ) ( roman_log ( italic_s - 1 ) - 1 )
=s(aloga1)(bba)log(1/ε)+s(logslog(s1))+log(s1)1absent𝑠𝑎𝑎1𝑏𝑏𝑎1𝜀𝑠𝑠𝑠1𝑠11\displaystyle=-s(a-\log a-1)-\left(b-\frac{b}{a}\right)\log(1/\varepsilon)+s(% \log s-\log(s-1))+\log(s-1)-1= - italic_s ( italic_a - roman_log italic_a - 1 ) - ( italic_b - divide start_ARG italic_b end_ARG start_ARG italic_a end_ARG ) roman_log ( 1 / italic_ε ) + italic_s ( roman_log italic_s - roman_log ( italic_s - 1 ) ) + roman_log ( italic_s - 1 ) - 1
=logεs(aloga1+log(s1s)log(s1)1s),absent𝜀𝑠𝑎𝑎1𝑠1𝑠𝑠11𝑠\displaystyle=\log\varepsilon-s\left(a-\log a-1+\log\left(\frac{s-1}{s}\right)% -\frac{\log(s-1)-1}{s}\right),= roman_log italic_ε - italic_s ( italic_a - roman_log italic_a - 1 + roman_log ( divide start_ARG italic_s - 1 end_ARG start_ARG italic_s end_ARG ) - divide start_ARG roman_log ( italic_s - 1 ) - 1 end_ARG start_ARG italic_s end_ARG ) ,

which is less than logε𝜀\log\varepsilonroman_log italic_ε for sufficiently large s𝑠sitalic_s since aloga1>0𝑎𝑎10a-\log a-1>0italic_a - roman_log italic_a - 1 > 0.

Let αt=s+log(1/ε)1log2𝛼𝑡𝑠1𝜀12\alpha t=s+\log(1/\varepsilon)-1-\log 2italic_α italic_t = italic_s + roman_log ( 1 / italic_ε ) - 1 - roman_log 2. Then for s2𝑠2s\geq 2italic_s ≥ 2 and ε1/2𝜀12\varepsilon\leq 1/2italic_ε ≤ 1 / 2, we have

log(1Vs(t))1subscript𝑉𝑠𝑡\displaystyle\log(1-V_{s}(t))roman_log ( 1 - italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_t ) ) slog(1/ε)+1+log2+log(k=0s1(s1)kk!)absent𝑠1𝜀12superscriptsubscript𝑘0𝑠1superscript𝑠1𝑘𝑘\displaystyle\geq-s-\log(1/\varepsilon)+1+\log 2+\log\left(\sum_{k=0}^{s-1}% \frac{(s-1)^{k}}{k!}\right)≥ - italic_s - roman_log ( 1 / italic_ε ) + 1 + roman_log 2 + roman_log ( ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s - 1 end_POSTSUPERSCRIPT divide start_ARG ( italic_s - 1 ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG italic_k ! end_ARG )
()slog(1/ε)+1+log2+log(12es1)𝑠1𝜀1212superscript𝑒𝑠1\displaystyle\overset{(*)}{\geq}-s-\log(1/\varepsilon)+1+\log 2+\log\left(% \frac{1}{2}e^{s-1}\right)start_OVERACCENT ( ∗ ) end_OVERACCENT start_ARG ≥ end_ARG - italic_s - roman_log ( 1 / italic_ε ) + 1 + roman_log 2 + roman_log ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_e start_POSTSUPERSCRIPT italic_s - 1 end_POSTSUPERSCRIPT )
=logε,absent𝜀\displaystyle=\log\varepsilon,= roman_log italic_ε ,

where the inequality marked (*) made use of the fact that the the median of a Poisson distribution with positive integer rate s1𝑠1s-1italic_s - 1 is exactly s1𝑠1s-1italic_s - 1 (Choi, 1994).

We have thus shown that we need αt=Θ(s+log(1/ε))𝛼𝑡Θ𝑠1𝜀\alpha t=\Theta(s+\log(1/\varepsilon))italic_α italic_t = roman_Θ ( italic_s + roman_log ( 1 / italic_ε ) ) to obtain 1Vs(t)=ε1subscript𝑉𝑠𝑡𝜀1-V_{s}(t)=\varepsilon1 - italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_t ) = italic_ε. In other words, the time until the value estimate Vs(t)subscript𝑉𝑠𝑡V_{s}(t)italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_t ) is within ε𝜀\varepsilonitalic_ε of its true value of 1 is

t=Θ(s+log(1/ε)α).𝑡Θ𝑠1𝜀𝛼t=\Theta\left(\frac{s+\log(1/\varepsilon)}{\alpha}\right).italic_t = roman_Θ ( divide start_ARG italic_s + roman_log ( 1 / italic_ε ) end_ARG start_ARG italic_α end_ARG ) . (24)

Now let’s return to the general graph setting. In this situation, the invariants are as follows:

  • maxaV(T(s,a),t)=V(n(s),t)subscript𝑎𝑉𝑇𝑠𝑎𝑡𝑉𝑛𝑠𝑡\max_{a}V(T(s,a),t)=V(n(s),t)roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_V ( italic_T ( italic_s , italic_a ) , italic_t ) = italic_V ( italic_n ( italic_s ) , italic_t ) where n(s)𝑛𝑠n(s)italic_n ( italic_s ) is the next state on the shortest path from s𝑠sitalic_s to any goal.

  • V(s,t)=Vd(s)(t)𝑉𝑠𝑡subscript𝑉𝑑𝑠𝑡V(s,t)=V_{d(s)}(t)italic_V ( italic_s , italic_t ) = italic_V start_POSTSUBSCRIPT italic_d ( italic_s ) end_POSTSUBSCRIPT ( italic_t ) where Vd(t)subscript𝑉𝑑𝑡V_{d}(t)italic_V start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_t ) is the solution to the value function in the case of a simple chain, as we just derived.

This invariants are preserved by the fact that Vd(t)subscript𝑉𝑑𝑡V_{d}(t)italic_V start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_t ) is non-increasing in d𝑑ditalic_d. Thus, replacing s𝑠sitalic_s with d(s)𝑑𝑠d(s)italic_d ( italic_s ) in the formula for the chain DSMDP (Equation 24) yields the result for the general DSMDP case. ∎

Proof of Theorem 4.2.

(Note: We use the version of the geometric distribution with support excluding 0.)

For σ(A+)+𝜎superscriptsubscript𝐴\sigma\in(A_{+})^{+}italic_σ ∈ ( italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, let P+,unif,ε(σ)=ε(1ε)|σ|1|A+||σ|subscript𝑃unif𝜀𝜎𝜀superscript1𝜀𝜎1superscriptsubscript𝐴𝜎P_{+,\text{unif},\varepsilon}(\sigma)=\varepsilon(1-\varepsilon)^{|\sigma|-1}|% A_{+}|^{-|\sigma|}italic_P start_POSTSUBSCRIPT + , unif , italic_ε end_POSTSUBSCRIPT ( italic_σ ) = italic_ε ( 1 - italic_ε ) start_POSTSUPERSCRIPT | italic_σ | - 1 end_POSTSUPERSCRIPT | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT - | italic_σ | end_POSTSUPERSCRIPT, the probability that a random sequence of length Geometric(ε)similar-toabsentGeometric𝜀\sim\operatorname{Geometric}(\varepsilon)∼ roman_Geometric ( italic_ε ) with actions chosen uniformly from A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is exactly σ𝜎\sigmaitalic_σ. Then P+,unif,εsubscript𝑃unif𝜀P_{+,\text{unif},\varepsilon}italic_P start_POSTSUBSCRIPT + , unif , italic_ε end_POSTSUBSCRIPT is a probability distribution over (A+)+superscriptsubscript𝐴(A_{+})^{+}( italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, so

𝔼σP+[logP+,unif,ε(σ)]=H[P+,P+,unif,ε]H[P+],subscript𝔼similar-to𝜎subscript𝑃delimited-[]subscript𝑃unif𝜀𝜎Hsubscript𝑃subscript𝑃unif𝜀Hdelimited-[]subscript𝑃\mathbb{E}_{\sigma\sim P_{+}}[-\log P_{+,\text{unif},\varepsilon}(\sigma)]=% \mathrm{H}[P_{+},P_{+,\text{unif},\varepsilon}]\geq\mathrm{H}[P_{+}],blackboard_E start_POSTSUBSCRIPT italic_σ ∼ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ - roman_log italic_P start_POSTSUBSCRIPT + , unif , italic_ε end_POSTSUBSCRIPT ( italic_σ ) ] = roman_H [ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT + , unif , italic_ε end_POSTSUBSCRIPT ] ≥ roman_H [ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] ,

where H[p,q]H𝑝𝑞\mathrm{H}[p,q]roman_H [ italic_p , italic_q ] denotes the cross entropy between p𝑝pitalic_p and q𝑞qitalic_q.

Now, fix any 0<ε<10𝜀10<\varepsilon<10 < italic_ε < 1. Then

𝔼sp[d+(s)]log(|A+|1ε)+log(1εε)subscript𝔼similar-to𝑠𝑝delimited-[]subscript𝑑𝑠subscript𝐴1𝜀1𝜀𝜀\displaystyle\mathbb{E}_{s\sim p}[d_{+}(s)]\log\left(\frac{|A_{+}|}{1-% \varepsilon}\right)+\log\left(\frac{1-\varepsilon}{\varepsilon}\right)blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_s ) ] roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG start_ARG 1 - italic_ε end_ARG ) + roman_log ( divide start_ARG 1 - italic_ε end_ARG start_ARG italic_ε end_ARG ) =𝔼σP+[|σ|]log(|A+|1ε)+log(1εε)absentsubscript𝔼similar-to𝜎subscript𝑃delimited-[]𝜎subscript𝐴1𝜀1𝜀𝜀\displaystyle=\mathbb{E}_{\sigma\sim P_{+}}[|\sigma|]\log\left(\frac{|A_{+}|}{% 1-\varepsilon}\right)+\log\left(\frac{1-\varepsilon}{\varepsilon}\right)= blackboard_E start_POSTSUBSCRIPT italic_σ ∼ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | italic_σ | ] roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG start_ARG 1 - italic_ε end_ARG ) + roman_log ( divide start_ARG 1 - italic_ε end_ARG start_ARG italic_ε end_ARG )
=𝔼σP+[logP+,unif,ε(σ)]absentsubscript𝔼similar-to𝜎subscript𝑃delimited-[]subscript𝑃unif𝜀𝜎\displaystyle=\mathbb{E}_{\sigma\sim P_{+}}[-\log P_{+,\text{unif},\varepsilon% }(\sigma)]= blackboard_E start_POSTSUBSCRIPT italic_σ ∼ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ - roman_log italic_P start_POSTSUBSCRIPT + , unif , italic_ε end_POSTSUBSCRIPT ( italic_σ ) ]
H[P+].absentHdelimited-[]subscript𝑃\displaystyle\geq\mathrm{H}[P_{+}].≥ roman_H [ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] .

Thus,

𝔼sp[d+(s)]log(|A+|1ε)𝔼sp[d0(s)]log(|A0|1ε)subscript𝔼similar-to𝑠𝑝delimited-[]subscript𝑑𝑠subscript𝐴1𝜀subscript𝔼similar-to𝑠𝑝delimited-[]subscript𝑑0𝑠subscript𝐴01𝜀\displaystyle\frac{\mathbb{E}_{s\sim p}[d_{+}(s)]\log\left(\frac{|A_{+}|}{1-% \varepsilon}\right)}{{\mathbb{E}_{s\sim p}[d_{0}(s)]\log\left(\frac{|A_{0}|}{1% -\varepsilon}\right)}}divide start_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_s ) ] roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG start_ARG 1 - italic_ε end_ARG ) end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) ] roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG 1 - italic_ε end_ARG ) end_ARG H[P+]log(1εε)𝔼sp[d0(s)]log(|A0|1ε)absentHdelimited-[]subscript𝑃1𝜀𝜀subscript𝔼similar-to𝑠𝑝delimited-[]subscript𝑑0𝑠subscript𝐴01𝜀\displaystyle\geq\frac{\mathrm{H}[P_{+}]-\log\left(\frac{1-\varepsilon}{% \varepsilon}\right)}{\mathbb{E}_{s\sim p}[d_{0}(s)]\log\left(\frac{|A_{0}|}{1-% \varepsilon}\right)}≥ divide start_ARG roman_H [ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] - roman_log ( divide start_ARG 1 - italic_ε end_ARG start_ARG italic_ε end_ARG ) end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) ] roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG 1 - italic_ε end_ARG ) end_ARG
Jlearn(+;p)Jlearn(0;p)=𝔼sp[d+(s)]|A+|𝔼sp[d0(s)]|A0|subscript𝐽learnsubscript𝑝subscript𝐽learnsubscript0𝑝subscript𝔼similar-to𝑠𝑝delimited-[]subscript𝑑𝑠subscript𝐴subscript𝔼similar-to𝑠𝑝delimited-[]subscript𝑑0𝑠subscript𝐴0\displaystyle\frac{J_{\text{{learn}}}(\mathcal{M}_{+};p)}{J_{\text{{learn}}}(% \mathcal{M}_{0};p)}=\frac{\mathbb{E}_{s\sim p}[d_{+}(s)]|A_{+}|}{\mathbb{E}_{s% \sim p}[d_{0}(s)]|A_{0}|}divide start_ARG italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p ) end_ARG start_ARG italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) end_ARG = divide start_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_s ) ] | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) ] | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG |A+|log(|A0|1ε)|A0|log(|A+|1ε)H[P+]log(1εε)𝔼sp[d0(s)]log(|A0|1ε)absentsubscript𝐴subscript𝐴01𝜀subscript𝐴0subscript𝐴1𝜀Hdelimited-[]subscript𝑃1𝜀𝜀subscript𝔼similar-to𝑠𝑝delimited-[]subscript𝑑0𝑠subscript𝐴01𝜀\displaystyle\geq\frac{|A_{+}|\log\left(\frac{|A_{0}|}{1-\varepsilon}\right)}{% |A_{0}|\log\left(\frac{|A_{+}|}{1-\varepsilon}\right)}\frac{\mathrm{H}[P_{+}]-% \log\left(\frac{1-\varepsilon}{\varepsilon}\right)}{\mathbb{E}_{s\sim p}[d_{0}% (s)]\log\left(\frac{|A_{0}|}{1-\varepsilon}\right)}≥ divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG 1 - italic_ε end_ARG ) end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG start_ARG 1 - italic_ε end_ARG ) end_ARG divide start_ARG roman_H [ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] - roman_log ( divide start_ARG 1 - italic_ε end_ARG start_ARG italic_ε end_ARG ) end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) ] roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG 1 - italic_ε end_ARG ) end_ARG
|A+|log|A0||A0|log|A+|H[P+]log(1εε)𝔼sp[d0(s)]log(|A0|1ε).absentsubscript𝐴subscript𝐴0subscript𝐴0subscript𝐴Hdelimited-[]subscript𝑃1𝜀𝜀subscript𝔼similar-to𝑠𝑝delimited-[]subscript𝑑0𝑠subscript𝐴01𝜀\displaystyle\geq\frac{|A_{+}|\log|A_{0}|}{|A_{0}|\log|A_{+}|}\frac{\mathrm{H}% [P_{+}]-\log\left(\frac{1-\varepsilon}{\varepsilon}\right)}{\mathbb{E}_{s\sim p% }[d_{0}(s)]\log\left(\frac{|A_{0}|}{1-\varepsilon}\right)}.≥ divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | roman_log | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | roman_log | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG divide start_ARG roman_H [ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] - roman_log ( divide start_ARG 1 - italic_ε end_ARG start_ARG italic_ε end_ARG ) end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) ] roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG 1 - italic_ε end_ARG ) end_ARG .

The last inequality used the fact that |A+||A0|subscript𝐴subscript𝐴0|A_{+}|\geq|A_{0}|| italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | ≥ | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | gives

log(|A0|1ε)log(|A+|1ε)=1log(|A+||A0|)log(|A+|1ε)1log(|A+||A0|)log|A+|=log|A0|log|A+|.subscript𝐴01𝜀subscript𝐴1𝜀1subscript𝐴subscript𝐴0subscript𝐴1𝜀1subscript𝐴subscript𝐴0subscript𝐴subscript𝐴0subscript𝐴\frac{\log\left(\frac{|A_{0}|}{1-\varepsilon}\right)}{\log\left(\frac{|A_{+}|}% {1-\varepsilon}\right)}=1-\frac{\log\left(\frac{|A_{+}|}{|A_{0}|}\right)}{\log% \left(\frac{|A_{+}|}{1-\varepsilon}\right)}\geq 1-\frac{\log\left(\frac{|A_{+}% |}{|A_{0}|}\right)}{\log|A_{+}|}=\frac{\log|A_{0}|}{\log|A_{+}|}.divide start_ARG roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG 1 - italic_ε end_ARG ) end_ARG start_ARG roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG start_ARG 1 - italic_ε end_ARG ) end_ARG = 1 - divide start_ARG roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG ) end_ARG start_ARG roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG start_ARG 1 - italic_ε end_ARG ) end_ARG ≥ 1 - divide start_ARG roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG ) end_ARG start_ARG roman_log | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG = divide start_ARG roman_log | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG roman_log | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG .

Now, we have

Jlearn(+;p)Jlearn(0;p)sup0<ε<1|A+|log|A0||A0|log|A+|H[P+]log(1εε)𝔼sp[d0(s)]log(|A0|1ε)=|A+|log|A0||A0|log|A+|ICA+(0;p),subscript𝐽learnsubscript𝑝subscript𝐽learnsubscript0𝑝subscriptsupremum0𝜀1subscript𝐴subscript𝐴0subscript𝐴0subscript𝐴Hdelimited-[]subscript𝑃1𝜀𝜀subscript𝔼similar-to𝑠𝑝delimited-[]subscript𝑑0𝑠subscript𝐴01𝜀subscript𝐴subscript𝐴0subscript𝐴0subscript𝐴subscriptICsubscript𝐴subscript0𝑝\frac{J_{\text{{learn}}}(\mathcal{M}_{+};p)}{J_{\text{{learn}}}(\mathcal{M}_{0% };p)}\geq\sup_{0<\varepsilon<1}\frac{|A_{+}|\log|A_{0}|}{|A_{0}|\log|A_{+}|}% \frac{\mathrm{H}[P_{+}]-\log\left(\frac{1-\varepsilon}{\varepsilon}\right)}{% \mathbb{E}_{s\sim p}[d_{0}(s)]\log\left(\frac{|A_{0}|}{1-\varepsilon}\right)}=% \frac{|A_{+}|\log|A_{0}|}{|A_{0}|\log|A_{+}|}\mathrm{IC}_{A_{+}}(\mathcal{M}_{% 0};p),divide start_ARG italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p ) end_ARG start_ARG italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) end_ARG ≥ roman_sup start_POSTSUBSCRIPT 0 < italic_ε < 1 end_POSTSUBSCRIPT divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | roman_log | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | roman_log | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG divide start_ARG roman_H [ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] - roman_log ( divide start_ARG 1 - italic_ε end_ARG start_ARG italic_ε end_ARG ) end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) ] roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG 1 - italic_ε end_ARG ) end_ARG = divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | roman_log | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | roman_log | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG roman_IC start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) ,

which completes the proof. ∎

Proof of Corollary 4.5.

Since |A0|2subscript𝐴02|A_{0}|\geq 2| italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | ≥ 2, we have |A+||A0|+13subscript𝐴subscript𝐴013|A_{+}|\geq|A_{0}|+1\geq 3| italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | ≥ | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | + 1 ≥ 3. The function f(x)=lnx/x𝑓𝑥𝑥𝑥f(x)=\ln x/xitalic_f ( italic_x ) = roman_ln italic_x / italic_x is decreasing for xe𝑥𝑒x\geq eitalic_x ≥ italic_e, so

|A0|ln|A+||A+|ln|A0|=f(|A+|)f(|A0|)f(|A0|+1)f(|A0|)=|A0|ln(|A0|+1)(|A0|+1)ln|A0|=|A0||A0|+1(1+ln(1+1|A0|)ln|A0|)<|A0||A0|+1(1+1|A0|ln|A0|)=1|A0|+1(|A0|+1ln|A0|).subscript𝐴0subscript𝐴subscript𝐴subscript𝐴0𝑓subscript𝐴𝑓subscript𝐴0𝑓subscript𝐴01𝑓subscript𝐴0subscript𝐴0subscript𝐴01subscript𝐴01subscript𝐴0subscript𝐴0subscript𝐴01111subscript𝐴0subscript𝐴0subscript𝐴0subscript𝐴0111subscript𝐴0subscript𝐴01subscript𝐴01subscript𝐴01subscript𝐴0\frac{|A_{0}|\ln|A_{+}|}{|A_{+}|\ln|A_{0}|}=\frac{f(|A_{+}|)}{f(|A_{0}|)}\leq% \frac{f(|A_{0}|+1)}{f(|A_{0}|)}=\frac{|A_{0}|\ln(|A_{0}|+1)}{(|A_{0}|+1)\ln|A_% {0}|}\\ =\frac{|A_{0}|}{|A_{0}|+1}\left(1+\frac{\ln\left(1+\frac{1}{|A_{0}|}\right)}{% \ln|A_{0}|}\right)<\frac{|A_{0}|}{|A_{0}|+1}\left(1+\frac{1}{|A_{0}|\ln|A_{0}|% }\right)=\frac{1}{|A_{0}|+1}\left(|A_{0}|+\frac{1}{\ln|A_{0}|}\right).start_ROW start_CELL divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | roman_ln | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | roman_ln | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG = divide start_ARG italic_f ( | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | ) end_ARG start_ARG italic_f ( | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | ) end_ARG ≤ divide start_ARG italic_f ( | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | + 1 ) end_ARG start_ARG italic_f ( | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | ) end_ARG = divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | roman_ln ( | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | + 1 ) end_ARG start_ARG ( | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | + 1 ) roman_ln | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG end_CELL end_ROW start_ROW start_CELL = divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | + 1 end_ARG ( 1 + divide start_ARG roman_ln ( 1 + divide start_ARG 1 end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG ) end_ARG start_ARG roman_ln | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG ) < divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | + 1 end_ARG ( 1 + divide start_ARG 1 end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | roman_ln | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG ) = divide start_ARG 1 end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | + 1 end_ARG ( | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | + divide start_ARG 1 end_ARG start_ARG roman_ln | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG ) . end_CELL end_ROW

Then

Jlearn(+;p)Jlearn(0;p)=|A+|ln|A0||A0|ln|A+|IC(0;p)>11|A0|+1(11ln|A0|)1|A0|+1(|A0|+1ln|A0|)=1,subscript𝐽learnsubscript𝑝subscript𝐽learnsubscript0𝑝subscript𝐴subscript𝐴0subscript𝐴0subscript𝐴ICsubscript0𝑝11subscript𝐴0111subscript𝐴01subscript𝐴01subscript𝐴01subscript𝐴01\frac{J_{\text{{learn}}}(\mathcal{M}_{+};p)}{J_{\text{{learn}}}(\mathcal{M}_{0% };p)}=\frac{|A_{+}|\ln|A_{0}|}{|A_{0}|\ln|A_{+}|}\mathrm{IC}(\mathcal{M}_{0};p% )>\frac{1-\frac{1}{|A_{0}|+1}\left(1-\frac{1}{\ln|A_{0}|}\right)}{\frac{1}{|A_% {0}|+1}\left(|A_{0}|+\frac{1}{\ln|A_{0}|}\right)}=1,divide start_ARG italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p ) end_ARG start_ARG italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) end_ARG = divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | roman_ln | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | roman_ln | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG roman_IC ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) > divide start_ARG 1 - divide start_ARG 1 end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | + 1 end_ARG ( 1 - divide start_ARG 1 end_ARG start_ARG roman_ln | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG ) end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | + 1 end_ARG ( | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | + divide start_ARG 1 end_ARG start_ARG roman_ln | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG ) end_ARG = 1 ,

as desired. ∎

Proof of Theorem 5.2.
Jexplore(+;p,δ)subscript𝐽exploresubscript𝑝𝛿\displaystyle J_{\text{{explore}}}(\mathcal{M}_{+};p,\delta)italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p , italic_δ ) =𝔼sp[logq+,δ(s)]absentsubscript𝔼similar-to𝑠𝑝subscript𝑞𝛿𝑠\displaystyle=\operatorname{\mathbb{E}}_{s\sim p}[-\log q_{+,\delta}(s)]= blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ - roman_log italic_q start_POSTSUBSCRIPT + , italic_δ end_POSTSUBSCRIPT ( italic_s ) ]
=𝔼sp[logρ+,δ(s)]log(1δδ)absentsubscript𝔼similar-to𝑠𝑝subscript𝜌𝛿𝑠1𝛿𝛿\displaystyle=\operatorname{\mathbb{E}}_{s\sim p}[-\log\rho_{+,\delta}(s)]-% \log\left(\frac{1-\delta}{\delta}\right)= blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ - roman_log italic_ρ start_POSTSUBSCRIPT + , italic_δ end_POSTSUBSCRIPT ( italic_s ) ] - roman_log ( divide start_ARG 1 - italic_δ end_ARG start_ARG italic_δ end_ARG )
=𝔼sp[log(ρ+,δ(s)D(+;δ))]log(1δδD(+;δ))absentsubscript𝔼similar-to𝑠𝑝subscript𝜌𝛿𝑠𝐷subscript𝛿1𝛿𝛿𝐷subscript𝛿\displaystyle=\operatorname{\mathbb{E}}_{s\sim p}\left[-\log\left(\frac{\rho_{% +,\delta}(s)}{D(\mathcal{M}_{+};\delta)}\right)\right]-\log\left(\frac{1-% \delta}{\delta}D(\mathcal{M}_{+};\delta)\right)= blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ - roman_log ( divide start_ARG italic_ρ start_POSTSUBSCRIPT + , italic_δ end_POSTSUBSCRIPT ( italic_s ) end_ARG start_ARG italic_D ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_δ ) end_ARG ) ] - roman_log ( divide start_ARG 1 - italic_δ end_ARG start_ARG italic_δ end_ARG italic_D ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_δ ) )
=H[p]+DKL(pρ+,δ()D(+;δ))log(1δδD(+;δ))absentHdelimited-[]𝑝subscript𝐷KLconditional𝑝subscript𝜌𝛿𝐷subscript𝛿1𝛿𝛿𝐷subscript𝛿\displaystyle=\mathrm{H}[p]+D_{\mathrm{KL}}\left(p\parallel\frac{\rho_{+,% \delta}(\cdot)}{D(\mathcal{M}_{+};\delta)}\right)-\log\left(\frac{1-\delta}{% \delta}D(\mathcal{M}_{+};\delta)\right)= roman_H [ italic_p ] + italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_p ∥ divide start_ARG italic_ρ start_POSTSUBSCRIPT + , italic_δ end_POSTSUBSCRIPT ( ⋅ ) end_ARG start_ARG italic_D ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_δ ) end_ARG ) - roman_log ( divide start_ARG 1 - italic_δ end_ARG start_ARG italic_δ end_ARG italic_D ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_δ ) ) (25)
H[p]log(1δδD(+;δ)),absentHdelimited-[]𝑝1𝛿𝛿𝐷subscript𝛿\displaystyle\geq\mathrm{H}[p]-\log\left(\frac{1-\delta}{\delta}D(\mathcal{M}_% {+};\delta)\right),≥ roman_H [ italic_p ] - roman_log ( divide start_ARG 1 - italic_δ end_ARG start_ARG italic_δ end_ARG italic_D ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_δ ) ) ,

where we have used the fact that ρ+,δ()D(+;δ)subscript𝜌𝛿𝐷subscript𝛿\frac{\rho_{+,\delta}(\cdot)}{D(\mathcal{M}_{+};\delta)}divide start_ARG italic_ρ start_POSTSUBSCRIPT + , italic_δ end_POSTSUBSCRIPT ( ⋅ ) end_ARG start_ARG italic_D ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_δ ) end_ARG is a normalized probability distribution.

Now, suppose the state space is finite and δ>maxsp(s)𝛿subscript𝑠𝑝𝑠\delta>\max_{s}p(s)italic_δ > roman_max start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_p ( italic_s ). According to Equation 25, we want to show that we can make DKL(pρ+,δ()D(+;δ))subscript𝐷KLconditional𝑝subscript𝜌𝛿𝐷subscript𝛿D_{\mathrm{KL}}\left(p\parallel\frac{\rho_{+,\delta}(\cdot)}{D(\mathcal{M}_{+}% ;\delta)}\right)italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_p ∥ divide start_ARG italic_ρ start_POSTSUBSCRIPT + , italic_δ end_POSTSUBSCRIPT ( ⋅ ) end_ARG start_ARG italic_D ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_δ ) end_ARG ) arbitrarily small with a suitable choice of A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT. Construct A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT as follows. Let the number of skills |A+||A0|subscript𝐴subscript𝐴0|A_{+}|-|A_{0}|| italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | - | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | be some large number Kmax{|A0|,1/mins:p(s)>0p(s)}much-greater-than𝐾subscript𝐴01subscript:𝑠𝑝𝑠0𝑝𝑠K\gg\max\{|A_{0}|,1/\min_{s:p(s)>0}p(s)\}italic_K ≫ roman_max { | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | , 1 / roman_min start_POSTSUBSCRIPT italic_s : italic_p ( italic_s ) > 0 end_POSTSUBSCRIPT italic_p ( italic_s ) }. For each solvable state s𝑠sitalic_s with p(s)>0𝑝𝑠0p(s)>0italic_p ( italic_s ) > 0, let Kf(s)𝐾𝑓𝑠\lfloor Kf(s)\rfloor⌊ italic_K italic_f ( italic_s ) ⌋ skills send s𝑠sitalic_s directly to the goal state and the remaining KKf(s)𝐾𝐾𝑓𝑠K-\lfloor Kf(s)\rflooritalic_K - ⌊ italic_K italic_f ( italic_s ) ⌋ send s𝑠sitalic_s back to s𝑠sitalic_s itself, where f(s)=δδ(1δ)p(s)p(s)(0,1)𝑓𝑠𝛿𝛿1𝛿𝑝𝑠𝑝𝑠01f(s)=\frac{\delta}{\delta-(1-\delta)p(s)}p(s)\in(0,1)italic_f ( italic_s ) = divide start_ARG italic_δ end_ARG start_ARG italic_δ - ( 1 - italic_δ ) italic_p ( italic_s ) end_ARG italic_p ( italic_s ) ∈ ( 0 , 1 ). (For solvable states s𝑠sitalic_s with p(s)=0𝑝𝑠0p(s)=0italic_p ( italic_s ) = 0, simply let all K𝐾Kitalic_K skills send s𝑠sitalic_s back to s𝑠sitalic_s itself.) Let’s now show that ρ+,δ(s)p(s)subscript𝜌𝛿𝑠𝑝𝑠\rho_{+,\delta}(s)\to p(s)italic_ρ start_POSTSUBSCRIPT + , italic_δ end_POSTSUBSCRIPT ( italic_s ) → italic_p ( italic_s ) as K𝐾K\to\inftyitalic_K → ∞ for every solvable state s𝑠sitalic_s.

ρ+,δ(s)subscript𝜌𝛿𝑠\rho_{+,\delta}(s)italic_ρ start_POSTSUBSCRIPT + , italic_δ end_POSTSUBSCRIPT ( italic_s ) is the probability that an action sequence σ𝜎\sigmaitalic_σ with actions uniformly chosen from A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT and length |σ|Geometric(δ)similar-to𝜎Geometric𝛿|\sigma|\sim\operatorname{Geometric}(\delta)| italic_σ | ∼ roman_Geometric ( italic_δ ) solves s𝑠sitalic_s. Among all such action sequences, the total probability of those that have a base action is no more than the total probability of all actions sequences that have a base action. The latter is given by

1σ(A+A0)+δ(1δ)|σ|1|A+||σ|1subscript𝜎superscriptsubscript𝐴subscript𝐴0𝛿superscript1𝛿𝜎1superscriptsubscript𝐴𝜎\displaystyle 1-\sum_{\sigma\in(A_{+}\setminus A_{0})^{+}}\delta(1-\delta)^{|% \sigma|-1}|A_{+}|^{-|\sigma|}1 - ∑ start_POSTSUBSCRIPT italic_σ ∈ ( italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ∖ italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_δ ( 1 - italic_δ ) start_POSTSUPERSCRIPT | italic_σ | - 1 end_POSTSUPERSCRIPT | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT - | italic_σ | end_POSTSUPERSCRIPT =1l=1(|A+||A0|)lδ(1δ)l1|A+|labsent1superscriptsubscript𝑙1superscriptsubscript𝐴subscript𝐴0𝑙𝛿superscript1𝛿𝑙1superscriptsubscript𝐴𝑙\displaystyle=1-\sum_{l=1}^{\infty}(|A_{+}|-|A_{0}|)^{l}\delta(1-\delta)^{l-1}% |A_{+}|^{-l}= 1 - ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | - | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_δ ( 1 - italic_δ ) start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT - italic_l end_POSTSUPERSCRIPT
=1δ1δl=1((1δ)(1|A0||A+|))labsent1𝛿1𝛿superscriptsubscript𝑙1superscript1𝛿1subscript𝐴0subscript𝐴𝑙\displaystyle=1-\frac{\delta}{1-\delta}\sum_{l=1}^{\infty}\left((1-\delta)% \left(1-\frac{|A_{0}|}{|A_{+}|}\right)\right)^{l}= 1 - divide start_ARG italic_δ end_ARG start_ARG 1 - italic_δ end_ARG ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( ( 1 - italic_δ ) ( 1 - divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG ) ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT
=1δ(1|A0||A+|)1(1δ)(1|A0||A+|)absent1𝛿1subscript𝐴0subscript𝐴11𝛿1subscript𝐴0subscript𝐴\displaystyle=1-\frac{\delta\left(1-\frac{|A_{0}|}{|A_{+}|}\right)}{1-(1-% \delta)\left(1-\frac{|A_{0}|}{|A_{+}|}\right)}= 1 - divide start_ARG italic_δ ( 1 - divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG ) end_ARG start_ARG 1 - ( 1 - italic_δ ) ( 1 - divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG ) end_ARG
0,as |A0|/|A+|0.absent0as |A0|/|A+|0\displaystyle\to 0,\quad\text{as $|A_{0}|/|A_{+}|\to 0$}.→ 0 , as | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | / | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | → 0 .

It now remains to show that the total probability of solutions to s𝑠sitalic_s that consist only of skills approximates p(s)𝑝𝑠p(s)italic_p ( italic_s ) arbitrarily well as K𝐾K\to\inftyitalic_K → ∞. For s𝑠sitalic_s with p(s)=0𝑝𝑠0p(s)=0italic_p ( italic_s ) = 0, no such solutions exist and so their total probability is 0. For s𝑠sitalic_s with p(s)>0𝑝𝑠0p(s)>0italic_p ( italic_s ) > 0,

σSol+(s)(A+A0)+δ(1δ)|σ|1|A+||σ|subscript𝜎subscriptSol𝑠superscriptsubscript𝐴subscript𝐴0𝛿superscript1𝛿𝜎1superscriptsubscript𝐴𝜎\displaystyle\sum_{\sigma\in\operatorname{Sol}_{+}(s)\cap(A_{+}\setminus A_{0}% )^{+}}\delta(1-\delta)^{|\sigma|-1}|A_{+}|^{-|\sigma|}∑ start_POSTSUBSCRIPT italic_σ ∈ roman_Sol start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_s ) ∩ ( italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ∖ italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_δ ( 1 - italic_δ ) start_POSTSUPERSCRIPT | italic_σ | - 1 end_POSTSUPERSCRIPT | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT - | italic_σ | end_POSTSUPERSCRIPT =l=1Kf(s)(KKf(s))l1δ(1δ)l1|A+|labsentsuperscriptsubscript𝑙1𝐾𝑓𝑠superscript𝐾𝐾𝑓𝑠𝑙1𝛿superscript1𝛿𝑙1superscriptsubscript𝐴𝑙\displaystyle=\sum_{l=1}^{\infty}\lfloor Kf(s)\rfloor(K-\lfloor Kf(s)\rfloor)^% {l-1}\delta(1-\delta)^{l-1}|A_{+}|^{-l}= ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ⌊ italic_K italic_f ( italic_s ) ⌋ ( italic_K - ⌊ italic_K italic_f ( italic_s ) ⌋ ) start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT italic_δ ( 1 - italic_δ ) start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT - italic_l end_POSTSUPERSCRIPT
=δKf(s)|A+|l=1((1δ)KKf(s)|A+|)l1absent𝛿𝐾𝑓𝑠subscript𝐴superscriptsubscript𝑙1superscript1𝛿𝐾𝐾𝑓𝑠subscript𝐴𝑙1\displaystyle=\frac{\delta\lfloor Kf(s)\rfloor}{|A_{+}|}\sum_{l=1}^{\infty}% \left((1-\delta)\frac{K-\lfloor Kf(s)\rfloor}{|A_{+}|}\right)^{l-1}= divide start_ARG italic_δ ⌊ italic_K italic_f ( italic_s ) ⌋ end_ARG start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( ( 1 - italic_δ ) divide start_ARG italic_K - ⌊ italic_K italic_f ( italic_s ) ⌋ end_ARG start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG ) start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT
=δKf(s)|A+|11(1δ)KKf(s)|A+|absent𝛿𝐾𝑓𝑠subscript𝐴111𝛿𝐾𝐾𝑓𝑠subscript𝐴\displaystyle=\frac{\delta\lfloor Kf(s)\rfloor}{|A_{+}|}\frac{1}{1-(1-\delta)% \frac{K-\lfloor Kf(s)\rfloor}{|A_{+}|}}= divide start_ARG italic_δ ⌊ italic_K italic_f ( italic_s ) ⌋ end_ARG start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG divide start_ARG 1 end_ARG start_ARG 1 - ( 1 - italic_δ ) divide start_ARG italic_K - ⌊ italic_K italic_f ( italic_s ) ⌋ end_ARG start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG end_ARG
δf(s)11(1δ)(1f(s))absent𝛿𝑓𝑠111𝛿1𝑓𝑠\displaystyle\to\delta f(s)\frac{1}{1-(1-\delta)(1-f(s))}→ italic_δ italic_f ( italic_s ) divide start_ARG 1 end_ARG start_ARG 1 - ( 1 - italic_δ ) ( 1 - italic_f ( italic_s ) ) end_ARG (as K𝐾K\to\inftyitalic_K → ∞)
=p(s).absent𝑝𝑠\displaystyle=p(s).= italic_p ( italic_s ) .

By now, we have shown that ρ+,δ(s)p(s)subscript𝜌𝛿𝑠𝑝𝑠\rho_{+,\delta}(s)\to p(s)italic_ρ start_POSTSUBSCRIPT + , italic_δ end_POSTSUBSCRIPT ( italic_s ) → italic_p ( italic_s ) as K𝐾K\to\inftyitalic_K → ∞ for every solvable state s𝑠sitalic_s. Since S𝑆Sitalic_S is finite, this convergence is uniform, so the KL-divergence between p𝑝pitalic_p and the normalized version of ρ+,δsubscript𝜌𝛿\rho_{+,\delta}italic_ρ start_POSTSUBSCRIPT + , italic_δ end_POSTSUBSCRIPT tends to zero as K𝐾K\to\inftyitalic_K → ∞, as desired. ∎

Proof of Corollary 5.3.

Since 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is solution-separable and +subscript\mathcal{M}_{+}caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is a macroaction augmentation of 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, +subscript\mathcal{M}_{+}caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is also solution-separable. Thus, D(+;δ)1𝐷subscript𝛿1D(\mathcal{M}_{+};\delta)\leq 1italic_D ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_δ ) ≤ 1. By Theorem 5.2,

Jexplore(+;p,δ)H[p]log(1δδD(+;δ))H[p]log(1δδ),subscript𝐽exploresubscript𝑝𝛿Hdelimited-[]𝑝1𝛿𝛿𝐷subscript𝛿Hdelimited-[]𝑝1𝛿𝛿J_{\text{{explore}}}(\mathcal{M}_{+};p,\delta)\geq\mathrm{H}[p]-\log\left(% \frac{1-\delta}{\delta}D(\mathcal{M}_{+};\delta)\right)\geq\mathrm{H}[p]-\log% \left(\frac{1-\delta}{\delta}\right),italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p , italic_δ ) ≥ roman_H [ italic_p ] - roman_log ( divide start_ARG 1 - italic_δ end_ARG start_ARG italic_δ end_ARG italic_D ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_δ ) ) ≥ roman_H [ italic_p ] - roman_log ( divide start_ARG 1 - italic_δ end_ARG start_ARG italic_δ end_ARG ) ,

whereas

Jexplore(0;p,δ)=𝔼sp[logq0,δ(s)]𝔼sp[log((1δ|A0|)d0(s))]=𝔼sp[d0(s)]log(|A0|1δ).subscript𝐽exploresubscript0𝑝𝛿subscript𝔼similar-to𝑠𝑝subscript𝑞0𝛿𝑠subscript𝔼similar-to𝑠𝑝superscript1𝛿subscript𝐴0subscript𝑑0𝑠subscript𝔼similar-to𝑠𝑝subscript𝑑0𝑠subscript𝐴01𝛿J_{\text{{explore}}}(\mathcal{M}_{0};p,\delta)=\operatorname{\mathbb{E}}_{s% \sim p}[-\log q_{0,\delta}(s)]\leq\operatorname{\mathbb{E}}_{s\sim p}\left[-% \log\left(\left(\frac{1-\delta}{|A_{0}|}\right)^{d_{0}(s)}\right)\right]=% \operatorname{\mathbb{E}}_{s\sim p}[d_{0}(s)]\log\left(\frac{|A_{0}|}{1-\delta% }\right).italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p , italic_δ ) = blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ - roman_log italic_q start_POSTSUBSCRIPT 0 , italic_δ end_POSTSUBSCRIPT ( italic_s ) ] ≤ blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ - roman_log ( ( divide start_ARG 1 - italic_δ end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG ) start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ) ] = blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) ] roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG 1 - italic_δ end_ARG ) .

Thus,

Jexplore(+;p,δ)Jexplore(0;p,δ)H[p]log(1δδ)𝔼sp[d0(s)]log(|A0|1δ),subscript𝐽exploresubscript𝑝𝛿subscript𝐽exploresubscript0𝑝𝛿Hdelimited-[]𝑝1𝛿𝛿subscript𝔼similar-to𝑠𝑝subscript𝑑0𝑠subscript𝐴01𝛿\frac{J_{\text{{explore}}}(\mathcal{M}_{+};p,\delta)}{J_{\text{{explore}}}(% \mathcal{M}_{0};p,\delta)}\geq\frac{\mathrm{H}[p]-\log\left(\frac{1-\delta}{% \delta}\right)}{\operatorname{\mathbb{E}}_{s\sim p}[d_{0}(s)]\log\left(\frac{|% A_{0}|}{1-\delta}\right)},divide start_ARG italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p , italic_δ ) end_ARG start_ARG italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p , italic_δ ) end_ARG ≥ divide start_ARG roman_H [ italic_p ] - roman_log ( divide start_ARG 1 - italic_δ end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) ] roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG 1 - italic_δ end_ARG ) end_ARG ,

as desired. ∎

Proof of Theorem 5.4.

The construction given in the proof of Theorem 5.2 allows us to make DKL(pρ+,δ(s)D(+;δ))subscript𝐷KLconditional𝑝subscript𝜌𝛿𝑠𝐷subscript𝛿D_{\mathrm{KL}}\left(p\parallel\frac{\rho_{+,\delta}(s)}{D(\mathcal{M}_{+};% \delta)}\right)italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_p ∥ divide start_ARG italic_ρ start_POSTSUBSCRIPT + , italic_δ end_POSTSUBSCRIPT ( italic_s ) end_ARG start_ARG italic_D ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_δ ) end_ARG ) arbitrarily close to 0 and D(+;δ)=sρ+,δ(s)𝐷subscript𝛿subscript𝑠subscript𝜌𝛿𝑠D(\mathcal{M}_{+};\delta)=\sum_{s}\rho_{+,\delta}(s)italic_D ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_δ ) = ∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT + , italic_δ end_POSTSUBSCRIPT ( italic_s ) arbitrarily close to 1 with sufficient large K=|A+||A0|𝐾subscript𝐴subscript𝐴0K=|A_{+}|-|A_{0}|italic_K = | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | - | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT |. Recalling Equation 25, this means that for any ε>0𝜀0\varepsilon>0italic_ε > 0, the construction gives

Jexplore(+;p,δ)<H[p]log(1δδ)+εsubscript𝐽exploresubscript𝑝𝛿Hdelimited-[]𝑝1𝛿𝛿𝜀J_{\text{{explore}}}(\mathcal{M}_{+};p,\delta)<\mathrm{H}[p]-\log\left(\frac{1% -\delta}{\delta}\right)+\varepsilonitalic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p , italic_δ ) < roman_H [ italic_p ] - roman_log ( divide start_ARG 1 - italic_δ end_ARG start_ARG italic_δ end_ARG ) + italic_ε

for sufficiently large K𝐾Kitalic_K.

On the other hand, let p,ρ0,δsuperscript𝑝superscriptsubscript𝜌0𝛿p^{\prime},\rho_{0,\delta}^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_ρ start_POSTSUBSCRIPT 0 , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT be distributions defined on solvable states in addition to a dummy state sdsubscript𝑠𝑑s_{d}italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT such that p(s)=p(s)superscript𝑝𝑠𝑝𝑠p^{\prime}(s)=p(s)italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) = italic_p ( italic_s ) and ρ0,δ(s)=ρ0,δ(s)superscriptsubscript𝜌0𝛿𝑠subscript𝜌0𝛿𝑠\rho_{0,\delta}^{\prime}(s)=\rho_{0,\delta}(s)italic_ρ start_POSTSUBSCRIPT 0 , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) = italic_ρ start_POSTSUBSCRIPT 0 , italic_δ end_POSTSUBSCRIPT ( italic_s ) whenever ssd𝑠subscript𝑠𝑑s\neq s_{d}italic_s ≠ italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, whereas p(sd)=0superscript𝑝subscript𝑠𝑑0p^{\prime}(s_{d})=0italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) = 0 and ρ0,δ(sd)=1ssdρ0,δ(s)superscriptsubscript𝜌0𝛿subscript𝑠𝑑1subscript𝑠subscript𝑠𝑑subscript𝜌0𝛿𝑠\rho_{0,\delta}^{\prime}(s_{d})=1-\sum_{s\neq s_{d}}\rho_{0,\delta}(s)italic_ρ start_POSTSUBSCRIPT 0 , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) = 1 - ∑ start_POSTSUBSCRIPT italic_s ≠ italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT 0 , italic_δ end_POSTSUBSCRIPT ( italic_s ). (Note that ρ0,δ(sd)0superscriptsubscript𝜌0𝛿subscript𝑠𝑑0\rho_{0,\delta}^{\prime}(s_{d})\geq 0italic_ρ start_POSTSUBSCRIPT 0 , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ≥ 0 because 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is solution-separable.) Then DKL(pρ0,δ)>0subscript𝐷KLconditionalsuperscript𝑝superscriptsubscript𝜌0𝛿0D_{\mathrm{KL}}\left(p^{\prime}\parallel\rho_{0,\delta}^{\prime}\right)>0italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ italic_ρ start_POSTSUBSCRIPT 0 , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) > 0 since pρ0,δnot-equivalent-tosuperscript𝑝superscriptsubscript𝜌0𝛿p^{\prime}\not\equiv\rho_{0,\delta}^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≢ italic_ρ start_POSTSUBSCRIPT 0 , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. This gives

Jexplore(0;p,δ)subscript𝐽exploresubscript0𝑝𝛿\displaystyle J_{\text{{explore}}}(\mathcal{M}_{0};p,\delta)italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p , italic_δ ) =𝔼sp[logρ0,δ(s)]log(1δδ)absentsubscript𝔼similar-to𝑠𝑝subscript𝜌0𝛿𝑠1𝛿𝛿\displaystyle=\operatorname{\mathbb{E}}_{s\sim p}\left[-\log\rho_{0,\delta}(s)% \right]-\log\left(\frac{1-\delta}{\delta}\right)= blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ - roman_log italic_ρ start_POSTSUBSCRIPT 0 , italic_δ end_POSTSUBSCRIPT ( italic_s ) ] - roman_log ( divide start_ARG 1 - italic_δ end_ARG start_ARG italic_δ end_ARG )
=H[p]+𝔼sp[logρ0,δ(s)p(s)]log(1δδ)absentHdelimited-[]𝑝subscript𝔼similar-to𝑠𝑝subscript𝜌0𝛿𝑠𝑝𝑠1𝛿𝛿\displaystyle=\mathrm{H}[p]+\operatorname{\mathbb{E}}_{s\sim p}\left[-\log% \frac{\rho_{0,\delta}(s)}{p(s)}\right]-\log\left(\frac{1-\delta}{\delta}\right)= roman_H [ italic_p ] + blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ - roman_log divide start_ARG italic_ρ start_POSTSUBSCRIPT 0 , italic_δ end_POSTSUBSCRIPT ( italic_s ) end_ARG start_ARG italic_p ( italic_s ) end_ARG ] - roman_log ( divide start_ARG 1 - italic_δ end_ARG start_ARG italic_δ end_ARG )
=H[p]+DKL(pρ0,δ(s))log(1δδ)absentHdelimited-[]𝑝subscript𝐷KLconditionalsuperscript𝑝superscriptsubscript𝜌0𝛿𝑠1𝛿𝛿\displaystyle=\mathrm{H}[p]+D_{\mathrm{KL}}\left(p^{\prime}\parallel\rho_{0,% \delta}^{\prime}(s)\right)-\log\left(\frac{1-\delta}{\delta}\right)= roman_H [ italic_p ] + italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ italic_ρ start_POSTSUBSCRIPT 0 , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) ) - roman_log ( divide start_ARG 1 - italic_δ end_ARG start_ARG italic_δ end_ARG ) (26)
>H[p]log(1δδ).absentHdelimited-[]𝑝1𝛿𝛿\displaystyle>\mathrm{H}[p]-\log\left(\frac{1-\delta}{\delta}\right).> roman_H [ italic_p ] - roman_log ( divide start_ARG 1 - italic_δ end_ARG start_ARG italic_δ end_ARG ) .

As a result, for sufficiently large K𝐾Kitalic_K, Jexplore(+;p,δ)<Jexplore(0;p,δ)subscript𝐽exploresubscript𝑝𝛿subscript𝐽exploresubscript0𝑝𝛿J_{\text{{explore}}}(\mathcal{M}_{+};p,\delta)<J_{\text{{explore}}}(\mathcal{M% }_{0};p,\delta)italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p , italic_δ ) < italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p , italic_δ ).

Now, let’s show that the construction in the proof of Theorem 5.2 can be made more precise to allow all states with p(s)>0𝑝𝑠0p(s)>0italic_p ( italic_s ) > 0 to have distinct canonical shortest solutions in A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT. Simply choose K𝐾Kitalic_K large enough so that, for all s𝑠sitalic_s with p(s)>0𝑝𝑠0p(s)>0italic_p ( italic_s ) > 0, the number of skills Kf(s)𝐾𝑓𝑠\lfloor Kf(s)\rfloor⌊ italic_K italic_f ( italic_s ) ⌋ that send s𝑠sitalic_s directly to g𝑔gitalic_g is at least the number of states with p(s)>0𝑝𝑠0p(s)>0italic_p ( italic_s ) > 0. Then the number of shortest solutions to every s𝑠sitalic_s with p(s)>0𝑝𝑠0p(s)>0italic_p ( italic_s ) > 0 is at least the number of such s𝑠sitalic_s, so it is possible to choose one shortest solution for every such s𝑠sitalic_s so that all the chosen solutions are distinct. ∎

Proof of Corollary 5.5.

Define A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT as in Theorem 5.4, so that Jexplore(+;p,δ)<Jexplore(0;p,δ)subscript𝐽exploresubscript𝑝𝛿subscript𝐽exploresubscript0𝑝𝛿J_{\text{{explore}}}(\mathcal{M}_{+};p,\delta)<J_{\text{{explore}}}(\mathcal{M% }_{0};p,\delta)italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p , italic_δ ) < italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p , italic_δ ) and Theorem 4.2 gives

Jlearn(+;p)Jlearn(0;p)|A+|log|A0||A0|log|A+|IC(0;p),subscript𝐽learnsubscript𝑝subscript𝐽learnsubscript0𝑝subscript𝐴subscript𝐴0subscript𝐴0subscript𝐴ICsubscript0𝑝\frac{J_{\text{{learn}}}(\mathcal{M}_{+};p)}{J_{\text{{learn}}}(\mathcal{M}_{0% };p)}\geq\frac{|A_{+}|\log|A_{0}|}{|A_{0}|\log|A_{+}|}\mathrm{IC}(\mathcal{M}_% {0};p),divide start_ARG italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p ) end_ARG start_ARG italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) end_ARG ≥ divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | roman_log | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | roman_log | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG roman_IC ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) , (27)

which is identical to Equation 12. Then the proof that the additional condition in the corollary implies Jlearn(+;p)>Jlearn(0;p)subscript𝐽learnsubscript𝑝subscript𝐽learnsubscript0𝑝J_{\text{{learn}}}(\mathcal{M}_{+};p)>J_{\text{{learn}}}(\mathcal{M}_{0};p)italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p ) > italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) is identical to the proof of Corollary 4.5. ∎

Proof of Theorem 5.6.

Augment the state space of 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with a state s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT that is solved by every length-1 sequence that is not already the solution to any other state. (Furthermore, all actions that do not result in the goal state instead transition to a dead state.) Denote by ¯0subscript¯0\bar{\mathcal{M}}_{0}over¯ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT the resultant DSMDP and for simplicity of notation we write ρ¯0,δsubscript¯𝜌0𝛿\bar{\rho}_{0,\delta}over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT 0 , italic_δ end_POSTSUBSCRIPT for ρ¯0,δsubscript𝜌subscript¯0𝛿\rho_{\bar{\mathcal{M}}_{0},\delta}italic_ρ start_POSTSUBSCRIPT over¯ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_δ end_POSTSUBSCRIPT and d¯0subscript¯𝑑0\bar{d}_{0}over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for d¯0subscript𝑑subscript¯0d_{\bar{\mathcal{M}}_{0}}italic_d start_POSTSUBSCRIPT over¯ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Let ¯+subscript¯\bar{\mathcal{M}}_{+}over¯ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT denote the A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT-macroaction augmentation of ¯0subscript¯0\bar{\mathcal{M}}_{0}over¯ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Then the solutions to s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT are exactly the same as those in A0subscript𝐴0A_{0}italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT since macroactions always have length greater than 1. We will write ρ¯+,δsubscript¯𝜌𝛿\bar{\rho}_{+,\delta}over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT + , italic_δ end_POSTSUBSCRIPT to mean ρ¯+,δsubscript𝜌subscript¯𝛿\rho_{\bar{\mathcal{M}}_{+},\delta}italic_ρ start_POSTSUBSCRIPT over¯ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT , italic_δ end_POSTSUBSCRIPT. Let p¯¯𝑝\bar{p}over¯ start_ARG italic_p end_ARG be a distribution over the solvable states of ¯0subscript¯0\bar{\mathcal{M}}_{0}over¯ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT so that p¯=p¯𝑝𝑝\bar{p}=pover¯ start_ARG italic_p end_ARG = italic_p on the solvable states of 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and p¯(s1)=0¯𝑝subscript𝑠10\bar{p}(s_{1})=0over¯ start_ARG italic_p end_ARG ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = 0.

As in the proof of Theorem 5.4, we define distributions p,ρ¯0,δ,ρ¯+,δsuperscript𝑝superscriptsubscript¯𝜌0𝛿superscriptsubscript¯𝜌𝛿p^{\prime},\bar{\rho}_{0,\delta}^{\prime},\bar{\rho}_{+,\delta}^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT 0 , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT + , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT over the solvable states in addition to a dummy state sdsubscript𝑠𝑑s_{d}italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT to be equal to p¯,ρ¯0,δ,ρ¯+,δ¯𝑝subscript¯𝜌0𝛿subscript¯𝜌𝛿\bar{p},\bar{\rho}_{0,\delta},\bar{\rho}_{+,\delta}over¯ start_ARG italic_p end_ARG , over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT 0 , italic_δ end_POSTSUBSCRIPT , over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT + , italic_δ end_POSTSUBSCRIPT whenever ssd𝑠subscript𝑠𝑑s\neq s_{d}italic_s ≠ italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, whereas p(sd)=0superscript𝑝subscript𝑠𝑑0p^{\prime}(s_{d})=0italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) = 0 and ρ¯0,δ(sd),ρ¯+,δ(sd)superscriptsubscript¯𝜌0𝛿subscript𝑠𝑑superscriptsubscript¯𝜌𝛿subscript𝑠𝑑\bar{\rho}_{0,\delta}^{\prime}(s_{d}),\bar{\rho}_{+,\delta}^{\prime}(s_{d})over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT 0 , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) , over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT + , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) are such that ρ¯0,δ,ρ¯+,δsuperscriptsubscript¯𝜌0𝛿superscriptsubscript¯𝜌𝛿\bar{\rho}_{0,\delta}^{\prime},\bar{\rho}_{+,\delta}^{\prime}over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT 0 , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT + , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are normalized probability distributions.

First, let’s show that

s|ρ¯+,δ(s)ρ¯0,δ(s)|δ|A0|+1.subscript𝑠superscriptsubscript¯𝜌𝛿𝑠superscriptsubscript¯𝜌0𝛿𝑠𝛿subscript𝐴01\sum_{s}|\bar{\rho}_{+,\delta}^{\prime}(s)-\bar{\rho}_{0,\delta}^{\prime}(s)|% \geq\frac{\delta}{|A_{0}|+1}.∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT + , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) - over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT 0 , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) | ≥ divide start_ARG italic_δ end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | + 1 end_ARG . (28)

If s𝑠sitalic_s is distance 1 away from the goal in ¯0subscript¯0\bar{\mathcal{M}}_{0}over¯ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, then

ρ¯+,δ(s)=δn(s)|A+|,ρ¯0,δ(s)=δn(s)|A0|,formulae-sequencesuperscriptsubscript¯𝜌𝛿𝑠𝛿𝑛𝑠subscript𝐴superscriptsubscript¯𝜌0𝛿𝑠𝛿𝑛𝑠subscript𝐴0\bar{\rho}_{+,\delta}^{\prime}(s)=\delta\frac{n(s)}{|A_{+}|},\quad\bar{\rho}_{% 0,\delta}^{\prime}(s)=\delta\frac{n(s)}{|A_{0}|},over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT + , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) = italic_δ divide start_ARG italic_n ( italic_s ) end_ARG start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG , over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT 0 , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) = italic_δ divide start_ARG italic_n ( italic_s ) end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG ,

where n(s)𝑛𝑠n(s)italic_n ( italic_s ) denotes the number of solutions to s𝑠sitalic_s in ¯0subscript¯0\bar{\mathcal{M}}_{0}over¯ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (or equivalently, ¯+subscript¯\bar{\mathcal{M}}_{+}over¯ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT), all of which have length 1. Thus,

s|ρ¯+,δ(s)ρ¯0,δ(s)|s:d¯0(s)=1δn(s)(1|A0|1|A+|)=δ|A0|(1|A0|1|A+|)=δ(1|A0||A+|)δ|A0|+1,subscript𝑠superscriptsubscript¯𝜌𝛿𝑠superscriptsubscript¯𝜌0𝛿𝑠subscript:𝑠subscript¯𝑑0𝑠1𝛿𝑛𝑠1subscript𝐴01subscript𝐴𝛿subscript𝐴01subscript𝐴01subscript𝐴𝛿1subscript𝐴0subscript𝐴𝛿subscript𝐴01\sum_{s}|\bar{\rho}_{+,\delta}^{\prime}(s)-\bar{\rho}_{0,\delta}^{\prime}(s)|% \geq\sum_{s:\bar{d}_{0}(s)=1}\delta n(s)\left(\frac{1}{|A_{0}|}-\frac{1}{|A_{+% }|}\right)=\delta|A_{0}|\left(\frac{1}{|A_{0}|}-\frac{1}{|A_{+}|}\right)=% \delta\left(1-\frac{|A_{0}|}{|A_{+}|}\right)\geq\frac{\delta}{|A_{0}|+1},∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT + , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) - over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT 0 , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) | ≥ ∑ start_POSTSUBSCRIPT italic_s : over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) = 1 end_POSTSUBSCRIPT italic_δ italic_n ( italic_s ) ( divide start_ARG 1 end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG - divide start_ARG 1 end_ARG start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG ) = italic_δ | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | ( divide start_ARG 1 end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG - divide start_ARG 1 end_ARG start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG ) = italic_δ ( 1 - divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG ) ≥ divide start_ARG italic_δ end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | + 1 end_ARG ,

where the last inequality used the fact that |A+||A0|+1subscript𝐴subscript𝐴01|A_{+}|\geq|A_{0}|+1| italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | ≥ | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | + 1.

We will now use Equation 28 to prove the theorem. By the triangle inequality,

s|ρ¯+,δ(s)p¯(s)|s|ρ¯+,δ(s)ρ¯0,δ(s)|s|ρ¯0,δ(s)p¯(s)|δ|A0|+1s|ρ¯0,δ(s)p¯(s)|.subscript𝑠superscriptsubscript¯𝜌𝛿𝑠superscript¯𝑝𝑠subscript𝑠superscriptsubscript¯𝜌𝛿𝑠superscriptsubscript¯𝜌0𝛿𝑠subscript𝑠superscriptsubscript¯𝜌0𝛿𝑠superscript¯𝑝𝑠𝛿subscript𝐴01subscript𝑠superscriptsubscript¯𝜌0𝛿𝑠superscript¯𝑝𝑠\sum_{s}|\bar{\rho}_{+,\delta}^{\prime}(s)-\bar{p}^{\prime}(s)|\geq\sum_{s}|% \bar{\rho}_{+,\delta}^{\prime}(s)-\bar{\rho}_{0,\delta}^{\prime}(s)|-\sum_{s}|% \bar{\rho}_{0,\delta}^{\prime}(s)-\bar{p}^{\prime}(s)|\geq\frac{\delta}{|A_{0}% |+1}-\sum_{s}|\bar{\rho}_{0,\delta}^{\prime}(s)-\bar{p}^{\prime}(s)|.∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT + , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) - over¯ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) | ≥ ∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT + , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) - over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT 0 , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) | - ∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT 0 , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) - over¯ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) | ≥ divide start_ARG italic_δ end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | + 1 end_ARG - ∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT 0 , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) - over¯ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) | .

Pinsker’s inequality says that DKL(pq)12(x|p(x)q(x)|)2logesubscript𝐷KLconditional𝑝𝑞12superscriptsubscript𝑥𝑝𝑥𝑞𝑥2𝑒D_{\mathrm{KL}}\left(p\parallel q\right)\geq\frac{1}{2}\left(\sum_{x}|p(x)-q(x% )|\right)^{2}\log eitalic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_p ∥ italic_q ) ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( ∑ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_p ( italic_x ) - italic_q ( italic_x ) | ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_e for any two probability mass functions p,q𝑝𝑞p,qitalic_p , italic_q. Thus, if

DKL(p¯ρ¯0,δ)=DKL(pρ0,δ)<δ2loge8(|A0|+1)2,subscript𝐷KLconditionalsuperscript¯𝑝superscriptsubscript¯𝜌0𝛿subscript𝐷KLconditionalsuperscript𝑝superscriptsubscript𝜌0𝛿superscript𝛿2𝑒8superscriptsubscript𝐴012D_{\mathrm{KL}}\left(\bar{p}^{\prime}\parallel\bar{\rho}_{0,\delta}^{\prime}% \right)=D_{\mathrm{KL}}\left(p^{\prime}\parallel\rho_{0,\delta}^{\prime}\right% )<\frac{\delta^{2}\log e}{8(|A_{0}|+1)^{2}},italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( over¯ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT 0 , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ italic_ρ start_POSTSUBSCRIPT 0 , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) < divide start_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_e end_ARG start_ARG 8 ( | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ,

then

s|ρ¯+,δ(s)p¯(s)|>δ|A0|+12logeδ2loge8(|A0|+1)2=δ2(|A0|+1)subscript𝑠superscriptsubscript¯𝜌𝛿𝑠superscript¯𝑝𝑠𝛿subscript𝐴012𝑒superscript𝛿2𝑒8superscriptsubscript𝐴012𝛿2subscript𝐴01\sum_{s}|\bar{\rho}_{+,\delta}^{\prime}(s)-\bar{p}^{\prime}(s)|>\frac{\delta}{% |A_{0}|+1}-\sqrt{\frac{2}{\log e}\cdot\frac{\delta^{2}\log e}{8(|A_{0}|+1)^{2}% }}=\frac{\delta}{2(|A_{0}|+1)}∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT + , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) - over¯ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) | > divide start_ARG italic_δ end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | + 1 end_ARG - square-root start_ARG divide start_ARG 2 end_ARG start_ARG roman_log italic_e end_ARG ⋅ divide start_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_e end_ARG start_ARG 8 ( | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG = divide start_ARG italic_δ end_ARG start_ARG 2 ( | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | + 1 ) end_ARG

and so

DKL(pρ+,δ)=DKL(p¯ρ¯+,δ)>12(δ2(|A0|+1))2loge>DKL(pρ0,δ).subscript𝐷KLconditionalsuperscript𝑝superscriptsubscript𝜌𝛿subscript𝐷KLconditionalsuperscript¯𝑝superscriptsubscript¯𝜌𝛿12superscript𝛿2subscript𝐴012𝑒subscript𝐷KLconditionalsuperscript𝑝superscriptsubscript𝜌0𝛿D_{\mathrm{KL}}\left(p^{\prime}\parallel\rho_{+,\delta}^{\prime}\right)=D_{% \mathrm{KL}}\left(\bar{p}^{\prime}\parallel\bar{\rho}_{+,\delta}^{\prime}% \right)>\frac{1}{2}\left(\frac{\delta}{2(|A_{0}|+1)}\right)^{2}\log e>D_{% \mathrm{KL}}\left(p^{\prime}\parallel\rho_{0,\delta}^{\prime}\right).italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ italic_ρ start_POSTSUBSCRIPT + , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( over¯ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT + , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) > divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( divide start_ARG italic_δ end_ARG start_ARG 2 ( | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | + 1 ) end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_e > italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ italic_ρ start_POSTSUBSCRIPT 0 , italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .

Now, by Equation 26, this is equivalent to Jexplore(+;p,δ)>Jexplore(0;p,δ)subscript𝐽exploresubscript𝑝𝛿subscript𝐽exploresubscript0𝑝𝛿J_{\text{{explore}}}(\mathcal{M}_{+};p,\delta)>J_{\text{{explore}}}(\mathcal{M% }_{0};p,\delta)italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p , italic_δ ) > italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p , italic_δ ), as desired. ∎

The proof of Theorem 5.7 is omitted as the stronger version and its proof are given in Section F.4.

Appendix F Additional Theoretical Results

F.1 Preliminary Results on Stochastic Environments

Here, we provide preliminary generalizations of our results for stochastic sparse-reward MDPs, which are SDMDPs (Definition 2.1) where the transition kernel T𝑇Titalic_T may be stochastic (i.e., T(s,a)𝑇𝑠𝑎T(s,a)italic_T ( italic_s , italic_a ) is now a distribution over S𝑆Sitalic_S).

In a (possibly stochastic) sparse-reward MDP, let Wσssubscript𝑊𝜎𝑠W_{\sigma s}italic_W start_POSTSUBSCRIPT italic_σ italic_s end_POSTSUBSCRIPT be the probability that taking actions σ𝜎\sigmaitalic_σ starting in s𝑠sitalic_s results in the goal state. For an ordering σ1,σ2,subscript𝜎1subscript𝜎2\sigma_{1},\sigma_{2},\ldotsitalic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … of all positive-length action sequences in non-decreasing length, define

wσks={Wσks1k<kmax1i=1kmax1Wσisk=kmax0k>kmaxsubscript𝑤subscript𝜎𝑘𝑠casessubscript𝑊subscript𝜎𝑘𝑠1𝑘subscript𝑘𝑚𝑎𝑥1superscriptsubscript𝑖1subscript𝑘𝑚𝑎𝑥1subscript𝑊subscript𝜎𝑖𝑠𝑘subscript𝑘𝑚𝑎𝑥0𝑘subscript𝑘𝑚𝑎𝑥w_{\sigma_{k}s}=\begin{cases}W_{\sigma_{k}s}&1\leq k<k_{max}\\ 1-\sum_{i=1}^{k_{max}-1}W_{\sigma_{i}s}&k=k_{max}\\ 0&k>k_{max}\end{cases}italic_w start_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = { start_ROW start_CELL italic_W start_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_CELL start_CELL 1 ≤ italic_k < italic_k start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 1 - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_CELL start_CELL italic_k = italic_k start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_k > italic_k start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_CELL end_ROW

where kmaxsubscript𝑘𝑚𝑎𝑥k_{max}italic_k start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT is the largest k𝑘kitalic_k such that i=1k1Wσis<1superscriptsubscript𝑖1𝑘1subscript𝑊subscript𝜎𝑖𝑠1\sum_{i=1}^{k-1}W_{\sigma_{i}s}<1∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT < 1. As a result, σwσs=1subscript𝜎subscript𝑤𝜎𝑠1\sum_{\sigma}w_{\sigma s}=1∑ start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_σ italic_s end_POSTSUBSCRIPT = 1.

Let’s redefine d(s)subscript𝑑𝑠d_{\mathcal{M}}(s)italic_d start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ) to be the weighted mean σwσs|σ|subscript𝜎subscript𝑤𝜎𝑠𝜎\sum_{\sigma}w_{\sigma s}|\sigma|∑ start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_σ italic_s end_POSTSUBSCRIPT | italic_σ |, so that p𝑝pitalic_p-learning difficulty (Equation 4) and A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT-merged p𝑝pitalic_p-incompressibility (Equation 7) are now defined using this new notion of shortest solution length. Furthermore, in the definition of A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT-merged p𝑝pitalic_p-incompressibility (Equation 7), redefine P+subscript𝑃P_{+}italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT to be P+(σ)=sp(s)wσssubscript𝑃𝜎subscript𝑠𝑝𝑠subscript𝑤𝜎𝑠P_{+}(\sigma)=\sum_{s}p(s)w_{\sigma s}italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_σ ) = ∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_p ( italic_s ) italic_w start_POSTSUBSCRIPT italic_σ italic_s end_POSTSUBSCRIPT so that σP+(σ)=1subscript𝜎subscript𝑃𝜎1\sum_{\sigma}P_{+}(\sigma)=1∑ start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_σ ) = 1. Note that the new definitions match the old definitions when the environment is deterministic. The stochasticity effectively spreads the responsibility of being a “shortest solution” over several short solutions whose success probabilities Wσssubscript𝑊𝜎𝑠W_{\sigma s}italic_W start_POSTSUBSCRIPT italic_σ italic_s end_POSTSUBSCRIPT add up to 1.

Theorem F.1 (Generalization of Theorem 4.2).

Under the above redefinitions for stochastic sparse-reward MDPs, Equation 9 of Theorem 4.2 continues to hold.

Proof.

The proof is identical to that of the original Theorem 4.2. ∎

In stochastic environments, we can keep the original definition of p𝑝pitalic_p-exploration difficulty (Equation 5) since the probabilistic definition of q,δ(s)subscript𝑞𝛿𝑠q_{\mathcal{M},\delta}(s)italic_q start_POSTSUBSCRIPT caligraphic_M , italic_δ end_POSTSUBSCRIPT ( italic_s ) continues to make sense when there’s stochasticity. (As a reminder, it is the probability that a uniformly random policy that terminates with probability δ𝛿\deltaitalic_δ before each step solves s𝑠sitalic_s.) Similarly, we keep the definition of δ𝛿\deltaitalic_δ-discounted solution density (Definition 5.1), which is also defined in terms of q𝑞qitalic_q.

Theorem F.2 (Generalization of the first half of Theorem 5.2).

Under the above redefinitions for stochastic sparse-reward MDPs, Equation 15 of Theorem 5.2 continues to hold.

Proof.

The proof is identical to that of the original Theorem 5.2. ∎

F.2 Incorporating Skill Expressivity in Theorem 4.2

In Theorem F.5 below, we provide a version of Theorem 4.2 that eliminates the dependence of ICA+(0;p)subscriptICsubscript𝐴subscript0𝑝\mathrm{IC}_{A_{+}}(\mathcal{M}_{0};p)roman_IC start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) on A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT and makes it depend explicitly on a quantitative measure of skill expressivity instead. This new measure of incompressibility (Equation 29), which we call E𝐸Eitalic_E-expressive p𝑝pitalic_p-incompressibility, decreases in E𝐸Eitalic_E. This is expected as an environment is more compressible when the available skills are more expressive.

Definition F.3 (Quantifying skill expressivity).

With respect to a DSMDP =(S,A,T,g)𝑆𝐴𝑇𝑔\mathcal{M}=(S,A,T,g)caligraphic_M = ( italic_S , italic_A , italic_T , italic_g ), define the behavior variety expressivity Ezsubscript𝐸𝑧E_{z}italic_E start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT of a skill z:SA:𝑧𝑆superscript𝐴z:S\to A^{*}italic_z : italic_S → italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to be |z(S)|𝑧𝑆|z(S)|| italic_z ( italic_S ) |, i.e., the number of distinct action sequences that z𝑧zitalic_z can produce.

Definition F.4 (E𝐸Eitalic_E-expressive p𝑝pitalic_p-incompressibility).

For a DSMDP =(S,A,T,g)𝑆𝐴𝑇𝑔\mathcal{M}=(S,A,T,g)caligraphic_M = ( italic_S , italic_A , italic_T , italic_g ) with finite |A|>1𝐴1|A|>1| italic_A | > 1, define its E𝐸Eitalic_E-expressive p𝑝pitalic_p-incompressibility to be

IC(;p,E)=sup0<ε<1minPH[P]log(1εε)𝔼sp[d(s)]log(|A|E1ε)IC𝑝𝐸subscriptsupremum0𝜀1subscript𝑃Hdelimited-[]𝑃1𝜀𝜀subscript𝔼similar-to𝑠𝑝delimited-[]subscript𝑑𝑠𝐴𝐸1𝜀\mathrm{IC}(\mathcal{M};p,E)=\sup_{0<\varepsilon<1}\frac{\min_{P}\mathrm{H}[P]% -\log\left(\frac{1-\varepsilon}{\varepsilon}\right)}{\mathbb{E}_{s\sim p}[d_{% \mathcal{M}}(s)]\log\left(\frac{|A|E}{1-\varepsilon}\right)}roman_IC ( caligraphic_M ; italic_p , italic_E ) = roman_sup start_POSTSUBSCRIPT 0 < italic_ε < 1 end_POSTSUBSCRIPT divide start_ARG roman_min start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT roman_H [ italic_P ] - roman_log ( divide start_ARG 1 - italic_ε end_ARG start_ARG italic_ε end_ARG ) end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ) ] roman_log ( divide start_ARG | italic_A | italic_E end_ARG start_ARG 1 - italic_ε end_ARG ) end_ARG (29)

where the minPsubscript𝑃\min_{P}roman_min start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT is taken over all choices of canonical (not necessarily shortest) solutions to all states.666 Recall that, given a choice of canonical solutions to all states, P(σ)𝑃𝜎P(\sigma)italic_P ( italic_σ ) is the sum over p(s)𝑝𝑠p(s)italic_p ( italic_s ) of all states s𝑠sitalic_s that have σ𝜎\sigmaitalic_σ as their canonical solution. As a result, H[P]H[p]Hdelimited-[]𝑃Hdelimited-[]𝑝\mathrm{H}[P]\leq\mathrm{H}[p]roman_H [ italic_P ] ≤ roman_H [ italic_p ] and equality holds in solution-separable DSMDPs. Note that expressivity E𝐸Eitalic_E occurs once in the denominator, so that larger E𝐸Eitalic_E results in smaller IC(;p,E)IC𝑝𝐸\mathrm{IC}(\mathcal{M};p,E)roman_IC ( caligraphic_M ; italic_p , italic_E ).

Theorem F.5 (Expressivity and p𝑝pitalic_p-learning difficulty improvability).

Assuming the setup to Theorem 4.2, the following modified version of Equation 9 holds:

Jlearn(+;p)Jlearn(0;p)|A+|log|A0||A0|log|A+|IC(0;p,E)subscript𝐽𝑙𝑒𝑎𝑟𝑛subscript𝑝subscript𝐽𝑙𝑒𝑎𝑟𝑛subscript0𝑝subscript𝐴subscript𝐴0subscript𝐴0subscript𝐴ICsubscript0𝑝𝐸\frac{J_{learn}(\mathcal{M}_{+};p)}{J_{learn}(\mathcal{M}_{0};p)}\geq\frac{|A_% {+}|\log|A_{0}|}{|A_{0}|\log|A_{+}|}\mathrm{IC}(\mathcal{M}_{0};p,E)divide start_ARG italic_J start_POSTSUBSCRIPT italic_l italic_e italic_a italic_r italic_n end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p ) end_ARG start_ARG italic_J start_POSTSUBSCRIPT italic_l italic_e italic_a italic_r italic_n end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) end_ARG ≥ divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | roman_log | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | roman_log | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG roman_IC ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p , italic_E ) (30)

where E:=maxzA+A0Ezassign𝐸subscript𝑧subscript𝐴subscript𝐴0subscript𝐸𝑧E:=\max_{z\in A_{+}\setminus A_{0}}E_{z}italic_E := roman_max start_POSTSUBSCRIPT italic_z ∈ italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ∖ italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT is the maximum behavior variety expressivity of a skill in the skill augmentation. Higher expressivity E𝐸Eitalic_E thus reduces incompressibility and allows skills to improve p𝑝pitalic_p-learning difficulty more, as expected.

Proof.

Given any choice of canonical shortest solutions in A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, define the random variables σ+(A+)+subscript𝜎superscriptsubscript𝐴\sigma_{+}\in(A_{+})^{+}italic_σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ∈ ( italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and σ0(A0)+subscript𝜎0superscriptsubscript𝐴0\sigma_{0}\in(A_{0})^{+}italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ ( italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT as follows. For spsimilar-to𝑠𝑝s\sim pitalic_s ∼ italic_p, σ+subscript𝜎\sigma_{+}italic_σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is the canonical solution to s𝑠sitalic_s in A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, and σ0subscript𝜎0\sigma_{0}italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the same solution but with skills expanded into base actions. Then the distribution of σ+subscript𝜎\sigma_{+}italic_σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT is just P+subscript𝑃P_{+}italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, and let P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT be the distribution of σ0subscript𝜎0\sigma_{0}italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Note that

H[P+]+H[σ0|σ+]=H[(σ+,σ0)]H[P0].Hdelimited-[]subscript𝑃Hdelimited-[]conditionalsubscript𝜎0subscript𝜎Hdelimited-[]subscript𝜎subscript𝜎0Hdelimited-[]subscript𝑃0\mathrm{H}[P_{+}]+\mathrm{H}[\sigma_{0}|\sigma_{+}]=\mathrm{H}[(\sigma_{+},% \sigma_{0})]\geq\mathrm{H}[P_{0}].roman_H [ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] + roman_H [ italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] = roman_H [ ( italic_σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] ≥ roman_H [ italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] . (31)

Furthermore, since any σ+subscript𝜎\sigma_{+}italic_σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT can expand to at most E|σ+|superscript𝐸subscript𝜎E^{|\sigma_{+}|}italic_E start_POSTSUPERSCRIPT | italic_σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT different base action sequences,

H[σ0|σ+]𝔼σ+P+[|σ+|logE].Hdelimited-[]conditionalsubscript𝜎0subscript𝜎subscript𝔼similar-tosubscript𝜎subscript𝑃delimited-[]subscript𝜎𝐸\mathrm{H}[\sigma_{0}|\sigma_{+}]\leq\mathbb{E}_{\sigma_{+}\sim P_{+}}[|\sigma% _{+}|\log E].roman_H [ italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] ≤ blackboard_E start_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | italic_σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | roman_log italic_E ] . (32)

In addition, recall from the proof of Theorem 4.2 that, for any 0<ε<10𝜀10<\varepsilon<10 < italic_ε < 1,

𝔼sp[d+(s)]log(|A+|1ε)+log(1εε)H[P+].subscript𝔼similar-to𝑠𝑝delimited-[]subscript𝑑𝑠subscript𝐴1𝜀1𝜀𝜀Hdelimited-[]subscript𝑃\mathbb{E}_{s\sim p}[d_{+}(s)]\log\left(\frac{|A_{+}|}{1-\varepsilon}\right)+% \log\left(\frac{1-\varepsilon}{\varepsilon}\right)\geq\mathrm{H}[P_{+}].blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_s ) ] roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG start_ARG 1 - italic_ε end_ARG ) + roman_log ( divide start_ARG 1 - italic_ε end_ARG start_ARG italic_ε end_ARG ) ≥ roman_H [ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] . (33)

Thus, substituting Equations 32 and 33 into Equation 31 yields

𝔼sp[d+(s)]log(|A+|1ε)+log(1εε)+𝔼σ+P+[|σ+|]logEsubscript𝔼similar-to𝑠𝑝delimited-[]subscript𝑑𝑠subscript𝐴1𝜀1𝜀𝜀subscript𝔼similar-tosubscript𝜎subscript𝑃delimited-[]subscript𝜎𝐸\displaystyle\mathbb{E}_{s\sim p}[d_{+}(s)]\log\left(\frac{|A_{+}|}{1-% \varepsilon}\right)+\log\left(\frac{1-\varepsilon}{\varepsilon}\right)+\mathbb% {E}_{\sigma_{+}\sim P_{+}}[|\sigma_{+}|]\log Eblackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_s ) ] roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG start_ARG 1 - italic_ε end_ARG ) + roman_log ( divide start_ARG 1 - italic_ε end_ARG start_ARG italic_ε end_ARG ) + blackboard_E start_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | italic_σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | ] roman_log italic_E H[P0]absentHdelimited-[]subscript𝑃0\displaystyle\geq\mathrm{H}[P_{0}]≥ roman_H [ italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ]
𝔼sp[d+(s)]log(|A+|1ε)+𝔼sp[d+(s)]logEsubscript𝔼similar-to𝑠𝑝delimited-[]subscript𝑑𝑠subscript𝐴1𝜀subscript𝔼similar-to𝑠𝑝delimited-[]subscript𝑑𝑠𝐸\displaystyle\mathbb{E}_{s\sim p}[d_{+}(s)]\log\left(\frac{|A_{+}|}{1-% \varepsilon}\right)+\mathbb{E}_{s\sim p}[d_{+}(s)]\log Eblackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_s ) ] roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG start_ARG 1 - italic_ε end_ARG ) + blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_s ) ] roman_log italic_E H[P0]log(1εε)absentHdelimited-[]subscript𝑃01𝜀𝜀\displaystyle\geq\mathrm{H}[P_{0}]-\log\left(\frac{1-\varepsilon}{\varepsilon}\right)≥ roman_H [ italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] - roman_log ( divide start_ARG 1 - italic_ε end_ARG start_ARG italic_ε end_ARG )
𝔼sp[d+(s)]log(|A+|E1ε)subscript𝔼similar-to𝑠𝑝delimited-[]subscript𝑑𝑠subscript𝐴𝐸1𝜀\displaystyle\mathbb{E}_{s\sim p}[d_{+}(s)]\log\left(\frac{|A_{+}|E}{1-% \varepsilon}\right)blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_s ) ] roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | italic_E end_ARG start_ARG 1 - italic_ε end_ARG ) H[P0]log(1εε)absentHdelimited-[]subscript𝑃01𝜀𝜀\displaystyle\geq\mathrm{H}[P_{0}]-\log\left(\frac{1-\varepsilon}{\varepsilon}\right)≥ roman_H [ italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] - roman_log ( divide start_ARG 1 - italic_ε end_ARG start_ARG italic_ε end_ARG )
𝔼sp[d+(s)]log(|A+|E1ε)𝔼sp[d0(s)]log(|A0|E1ε)subscript𝔼similar-to𝑠𝑝delimited-[]subscript𝑑𝑠subscript𝐴𝐸1𝜀subscript𝔼similar-to𝑠𝑝delimited-[]subscript𝑑0𝑠subscript𝐴0𝐸1𝜀\displaystyle\frac{\mathbb{E}_{s\sim p}[d_{+}(s)]\log\left(\frac{|A_{+}|E}{1-% \varepsilon}\right)}{\mathbb{E}_{s\sim p}[d_{0}(s)]\log\left(\frac{|A_{0}|E}{1% -\varepsilon}\right)}divide start_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_s ) ] roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | italic_E end_ARG start_ARG 1 - italic_ε end_ARG ) end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) ] roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_E end_ARG start_ARG 1 - italic_ε end_ARG ) end_ARG minP0H[P0]log(1εε)𝔼sp[d0(s)]log(|A0|E1ε)absentsubscriptsubscript𝑃0Hdelimited-[]subscript𝑃01𝜀𝜀subscript𝔼similar-to𝑠𝑝delimited-[]subscript𝑑0𝑠subscript𝐴0𝐸1𝜀\displaystyle\geq\frac{\min_{P_{0}}\mathrm{H}[P_{0}]-\log\left(\frac{1-% \varepsilon}{\varepsilon}\right)}{\mathbb{E}_{s\sim p}[d_{0}(s)]\log\left(% \frac{|A_{0}|E}{1-\varepsilon}\right)}≥ divide start_ARG roman_min start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_H [ italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] - roman_log ( divide start_ARG 1 - italic_ε end_ARG start_ARG italic_ε end_ARG ) end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) ] roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_E end_ARG start_ARG 1 - italic_ε end_ARG ) end_ARG
Jlearn(+;p)Jlearn(0;p)subscript𝐽learnsubscript𝑝subscript𝐽learnsubscript0𝑝\displaystyle\frac{J_{\text{{learn}}}(\mathcal{M}_{+};p)}{J_{\text{{learn}}}(% \mathcal{M}_{0};p)}divide start_ARG italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p ) end_ARG start_ARG italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) end_ARG |A+|log(|A0|E1ε)|A0|log(|A+|E1ε)minP0H[P0]log(1εε)𝔼sp[d0(s)]log(|A0|E1ε)absentsubscript𝐴subscript𝐴0𝐸1𝜀subscript𝐴0subscript𝐴𝐸1𝜀subscriptsubscript𝑃0Hdelimited-[]subscript𝑃01𝜀𝜀subscript𝔼similar-to𝑠𝑝delimited-[]subscript𝑑0𝑠subscript𝐴0𝐸1𝜀\displaystyle\geq\frac{|A_{+}|\log\left(\frac{|A_{0}|E}{1-\varepsilon}\right)}% {|A_{0}|\log\left(\frac{|A_{+}|E}{1-\varepsilon}\right)}\frac{\min_{P_{0}}% \mathrm{H}[P_{0}]-\log\left(\frac{1-\varepsilon}{\varepsilon}\right)}{\mathbb{% E}_{s\sim p}[d_{0}(s)]\log\left(\frac{|A_{0}|E}{1-\varepsilon}\right)}≥ divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_E end_ARG start_ARG 1 - italic_ε end_ARG ) end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | italic_E end_ARG start_ARG 1 - italic_ε end_ARG ) end_ARG divide start_ARG roman_min start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_H [ italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] - roman_log ( divide start_ARG 1 - italic_ε end_ARG start_ARG italic_ε end_ARG ) end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) ] roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_E end_ARG start_ARG 1 - italic_ε end_ARG ) end_ARG

where

log(|A0|E1ε)log(|A+|E1ε)log|A0|log|A+|subscript𝐴0𝐸1𝜀subscript𝐴𝐸1𝜀subscript𝐴0subscript𝐴\frac{\log\left(\frac{|A_{0}|E}{1-\varepsilon}\right)}{\log\left(\frac{|A_{+}|% E}{1-\varepsilon}\right)}\geq\frac{\log|A_{0}|}{\log|A_{+}|}divide start_ARG roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_E end_ARG start_ARG 1 - italic_ε end_ARG ) end_ARG start_ARG roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | italic_E end_ARG start_ARG 1 - italic_ε end_ARG ) end_ARG ≥ divide start_ARG roman_log | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG roman_log | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG

since E1ε>1𝐸1𝜀1\frac{E}{1-\varepsilon}>1divide start_ARG italic_E end_ARG start_ARG 1 - italic_ε end_ARG > 1 and |A+||A0|subscript𝐴subscript𝐴0|A_{+}|\geq|A_{0}|| italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | ≥ | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT |. Thus,

Jlearn(A;p)Jlearn(B;p)|A+|log|A0||A0|log|A+|minP0H[P0]log(1εε)𝔼sp[d0(s)]log(|A0|E1ε),subscript𝐽learnsubscript𝐴𝑝subscript𝐽learnsubscript𝐵𝑝subscript𝐴subscript𝐴0subscript𝐴0subscript𝐴subscriptsubscript𝑃0Hdelimited-[]subscript𝑃01𝜀𝜀subscript𝔼similar-to𝑠𝑝delimited-[]subscript𝑑0𝑠subscript𝐴0𝐸1𝜀\frac{J_{\text{{learn}}}(\mathcal{M}_{A};p)}{J_{\text{{learn}}}(\mathcal{M}_{B% };p)}\geq\frac{|A_{+}|\log|A_{0}|}{|A_{0}|\log|A_{+}|}\frac{\min_{P_{0}}% \mathrm{H}[P_{0}]-\log\left(\frac{1-\varepsilon}{\varepsilon}\right)}{\mathbb{% E}_{s\sim p}[d_{0}(s)]\log\left(\frac{|A_{0}|E}{1-\varepsilon}\right)},divide start_ARG italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ; italic_p ) end_ARG start_ARG italic_J start_POSTSUBSCRIPT learn end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ; italic_p ) end_ARG ≥ divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | roman_log | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | roman_log | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG divide start_ARG roman_min start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_H [ italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] - roman_log ( divide start_ARG 1 - italic_ε end_ARG start_ARG italic_ε end_ARG ) end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) ] roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_E end_ARG start_ARG 1 - italic_ε end_ARG ) end_ARG ,

which is true for all 0<ε<10𝜀10<\varepsilon<10 < italic_ε < 1, as desired. ∎

F.3 Relaxing Solution-Separability Assumption in Corollary 4.4

Corollary F.6 (Generalization of Corollary 4.4).

Relaxing the solution-separability assumption, Corollary 4.4 holds if we replace H[p]Hdelimited-[]𝑝\mathrm{H}[p]roman_H [ italic_p ] in the definition of IC(0;p)ICsubscript0𝑝\mathrm{IC}(\mathcal{M}_{0};p)roman_IC ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) with minP0H[P0]subscriptsubscript𝑃0Hdelimited-[]subscript𝑃0\min_{P_{0}}\mathrm{H}[P_{0}]roman_min start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_H [ italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ]. Here, P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the distribution of canonical solutions to states sampled from p𝑝pitalic_p, and the minimum is taken over all possible choices of canonical solutions. Thus, H[P0]Hdelimited-[]subscript𝑃0\mathrm{H}[P_{0}]roman_H [ italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] can be understood as the entropy of the state distribution if states with the same canonical solution are merged into one “super-state.”

Proof.

The result follows directly from Theorem F.5 by setting E=1𝐸1E=1italic_E = 1. ∎

F.4 Stronger Version of Theorem 5.7

Note: For notational simplicity, we will write qsubscript𝑞q_{\mathcal{M}}italic_q start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT to mean q,δ=0subscript𝑞𝛿0q_{\mathcal{M},\delta=0}italic_q start_POSTSUBSCRIPT caligraphic_M , italic_δ = 0 end_POSTSUBSCRIPT.

Before stating the stronger version of Theorem 5.7, we need to first define solution-length separations of state spaces.

Definition F.7.

For a DSMDP =(S,A,T,g)𝑆𝐴𝑇𝑔\mathcal{M}=(S,A,T,g)caligraphic_M = ( italic_S , italic_A , italic_T , italic_g ), let Ssolvablesubscript𝑆solvableS_{\text{{solvable}}}italic_S start_POSTSUBSCRIPT solvable end_POSTSUBSCRIPT denote the set of solvable states. The solution-length separation S~solvablesubscript~𝑆solvable\tilde{S}_{\text{{solvable}}}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT solvable end_POSTSUBSCRIPT of Ssolvablesubscript𝑆solvableS_{\text{{solvable}}}italic_S start_POSTSUBSCRIPT solvable end_POSTSUBSCRIPT is the result of separating every solvable state sSsolvable𝑠subscript𝑆solvables\in S_{\text{{solvable}}}italic_s ∈ italic_S start_POSTSUBSCRIPT solvable end_POSTSUBSCRIPT into a set S~(s)~𝑆𝑠\tilde{S}(s)over~ start_ARG italic_S end_ARG ( italic_s ) of sub-states corresponding to the lengths of solutions to s𝑠sitalic_s. Formally, we write

S~solvable:=sSsolvableS~(s),S~(s):={(s,l)l>0 s.t. σSol(s) with |σ|=l}.formulae-sequenceassignsubscript~𝑆solvablesubscript𝑠subscript𝑆solvable~𝑆𝑠assign~𝑆𝑠conditional-set𝑠𝑙l>0 s.t. σSol(s) with |σ|=l\tilde{S}_{\text{{solvable}}}:=\bigcup_{s\in S_{\text{{solvable}}}}\tilde{S}(s% ),\quad\tilde{S}(s):=\{(s,l)\mid\text{$l>0$ s.t.\ $\exists\sigma\in% \operatorname{Sol}_{\mathcal{M}}(s)$ with $|\sigma|=l$}\}.over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT solvable end_POSTSUBSCRIPT := ⋃ start_POSTSUBSCRIPT italic_s ∈ italic_S start_POSTSUBSCRIPT solvable end_POSTSUBSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG ( italic_s ) , over~ start_ARG italic_S end_ARG ( italic_s ) := { ( italic_s , italic_l ) ∣ italic_l > 0 s.t. ∃ italic_σ ∈ roman_Sol start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ) with | italic_σ | = italic_l } .

Furthermore, for a sub-state s~=(s,l)~𝑠𝑠𝑙\tilde{s}=(s,l)over~ start_ARG italic_s end_ARG = ( italic_s , italic_l ) of s𝑠sitalic_s corresponding to solution length l𝑙litalic_l, we naturally define its solutions to be the length-l𝑙litalic_l solutions to s𝑠sitalic_s. Formally,

Sol~((s,l)):={σSol(s)|σ|=l}assignsubscript~Sol𝑠𝑙conditional-set𝜎subscriptSol𝑠𝜎𝑙\tilde{\operatorname{Sol}}_{\mathcal{M}}((s,l)):=\{\sigma\in\operatorname{Sol}% _{\mathcal{M}}(s)\mid|\sigma|=l\}over~ start_ARG roman_Sol end_ARG start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( ( italic_s , italic_l ) ) := { italic_σ ∈ roman_Sol start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ) ∣ | italic_σ | = italic_l }

where the ~~absent\tilde{\phantom{\ }}over~ start_ARG end_ARG is used to make it explicit that we’re applying the operation to sub-states.

Functions on Ssolvablesubscript𝑆solvableS_{\text{{solvable}}}italic_S start_POSTSUBSCRIPT solvable end_POSTSUBSCRIPT defined using solutions to states can therefore be naturally extended to S~~𝑆\tilde{S}over~ start_ARG italic_S end_ARG. For example, d~(s~)~𝑑~𝑠\tilde{d}(\tilde{s})over~ start_ARG italic_d end_ARG ( over~ start_ARG italic_s end_ARG ) for s~=(s,l)S~~𝑠𝑠𝑙~𝑆\tilde{s}=(s,l)\in\tilde{S}over~ start_ARG italic_s end_ARG = ( italic_s , italic_l ) ∈ over~ start_ARG italic_S end_ARG is just l𝑙litalic_l, and

q~(s~):=σSol~(s~)|A||σ|=|Sol~(s~)||A|lif s~=(s,l).formulae-sequenceassignsubscript~𝑞~𝑠subscript𝜎subscript~Sol~𝑠superscript𝐴𝜎subscript~Sol~𝑠superscript𝐴𝑙if s~=(s,l).\tilde{q}_{\mathcal{M}}(\tilde{s}):=\sum_{\sigma\in\tilde{\operatorname{Sol}}_% {\mathcal{M}}(\tilde{s})}|A|^{-|\sigma|}=\left|\tilde{\operatorname{Sol}}_{% \mathcal{M}}(\tilde{s})\right||A|^{-l}\quad\text{if $\tilde{s}=(s,l)$.}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ) := ∑ start_POSTSUBSCRIPT italic_σ ∈ over~ start_ARG roman_Sol end_ARG start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ) end_POSTSUBSCRIPT | italic_A | start_POSTSUPERSCRIPT - | italic_σ | end_POSTSUPERSCRIPT = | over~ start_ARG roman_Sol end_ARG start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ) | | italic_A | start_POSTSUPERSCRIPT - italic_l end_POSTSUPERSCRIPT if over~ start_ARG italic_s end_ARG = ( italic_s , italic_l ) .

For an arbitrary function f:Ssolvable:𝑓subscript𝑆solvablef:S_{\text{{solvable}}}\to\mathbb{R}italic_f : italic_S start_POSTSUBSCRIPT solvable end_POSTSUBSCRIPT → blackboard_R, there is a family of natural extensions to S~solvablesubscript~𝑆solvable\tilde{S}_{\text{{solvable}}}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT solvable end_POSTSUBSCRIPT. Specifically, we say that f~:S~solvable:~𝑓subscript~𝑆solvable\tilde{f}:\tilde{S}_{\text{{solvable}}}\to\mathbb{R}over~ start_ARG italic_f end_ARG : over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT solvable end_POSTSUBSCRIPT → blackboard_R is a solution-length-separated additive extension if, for all sSsolvable𝑠subscript𝑆solvables\in S_{\text{{solvable}}}italic_s ∈ italic_S start_POSTSUBSCRIPT solvable end_POSTSUBSCRIPT,

f(s)=s~S~(s)f~(s~),𝑓𝑠subscript~𝑠~𝑆𝑠~𝑓~𝑠f(s)=\sum_{\tilde{s}\in\tilde{S}(s)}\tilde{f}(\tilde{s}),italic_f ( italic_s ) = ∑ start_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG ∈ over~ start_ARG italic_S end_ARG ( italic_s ) end_POSTSUBSCRIPT over~ start_ARG italic_f end_ARG ( over~ start_ARG italic_s end_ARG ) ,

and f(s)=f~(s)𝑓𝑠~𝑓𝑠f(s)=\tilde{f}(s)italic_f ( italic_s ) = over~ start_ARG italic_f end_ARG ( italic_s ) for sSsolvable𝑠subscript𝑆solvables\not\in S_{\text{{solvable}}}italic_s ∉ italic_S start_POSTSUBSCRIPT solvable end_POSTSUBSCRIPT. For example, q~subscript~𝑞\tilde{q}_{\mathcal{M}}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT as defined above is a solution-length-separated additive extension to qsubscript𝑞q_{\mathcal{M}}italic_q start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT.

Theorem F.8 (Generalization of Theorem 5.7).

Let +=(S,A+,T+,g)subscript𝑆subscript𝐴subscript𝑇𝑔\mathcal{M}_{+}=(S,A_{+},T_{+},g)caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = ( italic_S , italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT + end_POSTSUBSCRIPT , italic_g ) be the A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT-macroaction augmentation of the solution-separable DSMDP 0=(S,A0,T0,g)subscript0𝑆subscript𝐴0subscript𝑇0𝑔\mathcal{M}_{0}=(S,A_{0},T_{0},g)caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( italic_S , italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g ) with a finite action space, and p𝑝pitalic_p a probability distribution over solvable states. Then there exists a solution-length-separated additive extension p~~𝑝\tilde{p}over~ start_ARG italic_p end_ARG to p𝑝pitalic_p in 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT such that

Jexplore(+;p)Jexplore(0;p)|A0||A+|(1|A0||A+|)DKL(p~λ~q~0).subscript𝐽exploresubscript𝑝subscript𝐽exploresubscript0𝑝subscript𝐴0subscript𝐴1subscript𝐴0subscript𝐴subscript𝐷KLconditional~𝑝~𝜆subscript~𝑞0J_{\text{{explore}}}(\mathcal{M}_{+};p)-J_{\text{{explore}}}(\mathcal{M}_{0};p% )\geq\frac{|A_{0}|}{|A_{+}|}\left(1-\frac{|A_{0}|}{|A_{+}|}\right)-D_{\mathrm{% KL}}\left(\tilde{p}\parallel\tilde{\lambda}\tilde{q}_{0}\right).italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p ) - italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) ≥ divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG ( 1 - divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG ) - italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( over~ start_ARG italic_p end_ARG ∥ over~ start_ARG italic_λ end_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) . (34)

Here, (λ~q~0)((s,l)):=λ~(l)q~0((s,l))assign~𝜆subscript~𝑞0𝑠𝑙~𝜆𝑙subscript~𝑞0𝑠𝑙(\tilde{\lambda}\tilde{q}_{0})((s,l)):=\tilde{\lambda}(l)\tilde{q}_{0}((s,l))( over~ start_ARG italic_λ end_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ( ( italic_s , italic_l ) ) := over~ start_ARG italic_λ end_ARG ( italic_l ) over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ( italic_s , italic_l ) ), where

λ~(l):=s~:d~0(s~)=lp~(s~)assign~𝜆𝑙subscript:superscript~𝑠subscript~𝑑0superscript~𝑠𝑙~𝑝superscript~𝑠\tilde{\lambda}(l):=\sum_{\tilde{s}^{\prime}:\tilde{d}_{0}(\tilde{s}^{\prime})% =l}\tilde{p}(\tilde{s}^{\prime})over~ start_ARG italic_λ end_ARG ( italic_l ) := ∑ start_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT : over~ start_ARG italic_d end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_l end_POSTSUBSCRIPT over~ start_ARG italic_p end_ARG ( over~ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )

is the total probability (under p~~𝑝\tilde{p}over~ start_ARG italic_p end_ARG) of sub-states with solution length l𝑙litalic_l and

q~0((s,l))=|Sol~0((s,l))||A0|lsubscript~𝑞0𝑠𝑙subscript~Sol0𝑠𝑙superscriptsubscript𝐴0𝑙\tilde{q}_{0}((s,l))=\left|\tilde{\operatorname{Sol}}_{0}((s,l))\right||A_{0}|% ^{-l}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ( italic_s , italic_l ) ) = | over~ start_ARG roman_Sol end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ( italic_s , italic_l ) ) | | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT - italic_l end_POSTSUPERSCRIPT

is the probability that a uniformly random action sequence of length l𝑙litalic_l is a solution to s𝑠sitalic_s. To make λ~q~0~𝜆subscript~𝑞0\tilde{\lambda}\tilde{q}_{0}over~ start_ARG italic_λ end_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT a normalized probability distribution, we introduce a dummy sub-state s~d(l)subscript~𝑠𝑑𝑙\tilde{s}_{d}(l)over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_l ) for each solution length l𝑙litalic_l with p~(s~d(l)):=0assign~𝑝subscript~𝑠𝑑𝑙0\tilde{p}(\tilde{s}_{d}(l)):=0over~ start_ARG italic_p end_ARG ( over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_l ) ) := 0 and q~0(s~d(l)):=1s:(s,l)S~solvableq~0((s,l))assignsubscript~𝑞0subscript~𝑠𝑑𝑙1subscript:𝑠𝑠𝑙subscript~𝑆solvablesubscript~𝑞0𝑠𝑙\tilde{q}_{0}(\tilde{s}_{d}(l)):=1-\sum_{s:(s,l)\in\tilde{S}_{\text{{solvable}% }}}\tilde{q}_{0}((s,l))over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_l ) ) := 1 - ∑ start_POSTSUBSCRIPT italic_s : ( italic_s , italic_l ) ∈ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT solvable end_POSTSUBSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ( italic_s , italic_l ) ). Note that q~0(s~d(l))0subscript~𝑞0subscript~𝑠𝑑𝑙0\tilde{q}_{0}(\tilde{s}_{d}(l))\geq 0over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_l ) ) ≥ 0 because of solution-separability, and it is equal to zero when every action sequence of length l𝑙litalic_l is the solution to some state.

Proof of Theorem F.8.

For each solvable state s𝑠sitalic_s, denote by S~(s)~𝑆𝑠\tilde{S}(s)over~ start_ARG italic_S end_ARG ( italic_s ) the set of sub-states resultant from separating s𝑠sitalic_s by solution length in the base environment. Define the solution-length-separated additive extension p~~𝑝\tilde{p}over~ start_ARG italic_p end_ARG to p𝑝pitalic_p such that p~(s~)q~+(s~)proportional-to~𝑝~𝑠subscript~𝑞~𝑠\tilde{p}(\tilde{s})\propto\tilde{q}_{+}(\tilde{s})over~ start_ARG italic_p end_ARG ( over~ start_ARG italic_s end_ARG ) ∝ over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ) for s~S~(s)~𝑠~𝑆𝑠\tilde{s}\in\tilde{S}(s)over~ start_ARG italic_s end_ARG ∈ over~ start_ARG italic_S end_ARG ( italic_s ), or more precisely,

p~(s~)=p(s)q~+(s~)q+(s),q+(s)=s~S~(s)q~+(s~).formulae-sequence~𝑝~𝑠𝑝𝑠subscript~𝑞~𝑠subscript𝑞𝑠subscript𝑞𝑠subscript~𝑠~𝑆𝑠subscript~𝑞~𝑠\tilde{p}(\tilde{s})=p(s)\frac{\tilde{q}_{+}(\tilde{s})}{q_{+}(s)},\quad q_{+}% (s)=\sum_{\tilde{s}\in\tilde{S}(s)}\tilde{q}_{+}(\tilde{s}).over~ start_ARG italic_p end_ARG ( over~ start_ARG italic_s end_ARG ) = italic_p ( italic_s ) divide start_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_s ) end_ARG , italic_q start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_s ) = ∑ start_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG ∈ over~ start_ARG italic_S end_ARG ( italic_s ) end_POSTSUBSCRIPT over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ) .

Here,

q~+((s,l)):=σSol+(s)σ expands to l base actions|A+||σ|assignsubscript~𝑞𝑠𝑙subscript𝜎subscriptSol𝑠σ expands to l base actionssuperscriptsubscript𝐴𝜎\tilde{q}_{+}((s,l)):=\sum_{\begin{subarray}{c}\sigma\in\operatorname{Sol}_{+}% (s)\\ \text{$\sigma$ expands to $l$ base actions}\end{subarray}}|A_{+}|^{-|\sigma|}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( ( italic_s , italic_l ) ) := ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_σ ∈ roman_Sol start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_s ) end_CELL end_ROW start_ROW start_CELL italic_σ expands to italic_l base actions end_CELL end_ROW end_ARG end_POSTSUBSCRIPT | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT - | italic_σ | end_POSTSUPERSCRIPT

denotes the q𝑞qitalic_q of the A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT-macroaction augmentation of ~0subscript~0\tilde{\mathcal{M}}_{0}over~ start_ARG caligraphic_M end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (the solution-length separation of 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with respect to A0subscript𝐴0A_{0}italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT), and not the solution-length separation of +subscript\mathcal{M}_{+}caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT (the A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT-macroaction augmentation of 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT).

Then

logq0(s)q+(s)=logs~S~(s)q~+(s~)q+(s)q~0(s~)q~+(s~)s~S~(s)q~+(s~)q+(s)logq~0(s~)q~+(s~)subscript𝑞0𝑠subscript𝑞𝑠subscript~𝑠~𝑆𝑠subscript~𝑞~𝑠subscript𝑞𝑠subscript~𝑞0~𝑠subscript~𝑞~𝑠subscript~𝑠~𝑆𝑠subscript~𝑞~𝑠subscript𝑞𝑠subscript~𝑞0~𝑠subscript~𝑞~𝑠\log\frac{q_{0}(s)}{q_{+}(s)}=\log\sum_{\tilde{s}\in\tilde{S}(s)}\frac{\tilde{% q}_{+}(\tilde{s})}{q_{+}(s)}\frac{\tilde{q}_{0}(\tilde{s})}{\tilde{q}_{+}(% \tilde{s})}\geq\sum_{\tilde{s}\in\tilde{S}(s)}\frac{\tilde{q}_{+}(\tilde{s})}{% q_{+}(s)}\log\frac{\tilde{q}_{0}(\tilde{s})}{\tilde{q}_{+}(\tilde{s})}roman_log divide start_ARG italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_s ) end_ARG = roman_log ∑ start_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG ∈ over~ start_ARG italic_S end_ARG ( italic_s ) end_POSTSUBSCRIPT divide start_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_s ) end_ARG divide start_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ) end_ARG start_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ) end_ARG ≥ ∑ start_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG ∈ over~ start_ARG italic_S end_ARG ( italic_s ) end_POSTSUBSCRIPT divide start_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_s ) end_ARG roman_log divide start_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ) end_ARG start_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ) end_ARG

by Jensen’s inequality, so

Jexplore(+;p)Jexplore(0;p)=𝔼sp[logq0(s)q+(s)]𝔼s~p~[logq~0(s~)q~+(s~)].subscript𝐽exploresubscript𝑝subscript𝐽exploresubscript0𝑝subscript𝔼similar-to𝑠𝑝delimited-[]subscript𝑞0𝑠subscript𝑞𝑠subscript𝔼similar-to~𝑠~𝑝delimited-[]subscript~𝑞0~𝑠subscript~𝑞~𝑠J_{\text{{explore}}}(\mathcal{M}_{+};p)-J_{\text{{explore}}}(\mathcal{M}_{0};p% )=\mathbb{E}_{s\sim p}\left[\log\frac{q_{0}(s)}{q_{+}(s)}\right]\geq\mathbb{E}% _{\tilde{s}\sim\tilde{p}}\left[\log\frac{\tilde{q}_{0}(\tilde{s})}{\tilde{q}_{% +}(\tilde{s})}\right].italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p ) - italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) = blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_s ) end_ARG ] ≥ blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG ∼ over~ start_ARG italic_p end_ARG end_POSTSUBSCRIPT [ roman_log divide start_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ) end_ARG start_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ) end_ARG ] .

It thus suffices to lower-bound the latter. Let’s consider base solution lengths l𝑙litalic_l separately.

Fix some l1𝑙1l\geq 1italic_l ≥ 1. Define p~lsubscript~𝑝𝑙\tilde{p}_{l}over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT to be the conditional distribution of p~~𝑝\tilde{p}over~ start_ARG italic_p end_ARG on sub-states with solution length l𝑙litalic_l. In other words, if S~lsubscript~𝑆𝑙\tilde{S}_{l}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT denotes the set of sub-states with solution length l𝑙litalic_l, then p~lsubscript~𝑝𝑙\tilde{p}_{l}over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is a distribution over S~lsubscript~𝑆𝑙\tilde{S}_{l}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT defined as

p~l(s~)=p~(s~)λ~(l),λ~(l)=s~S~lp~(s~).formulae-sequencesubscript~𝑝𝑙~𝑠~𝑝~𝑠~𝜆𝑙~𝜆𝑙subscriptsuperscript~𝑠subscript~𝑆𝑙~𝑝superscript~𝑠\tilde{p}_{l}(\tilde{s})=\frac{\tilde{p}(\tilde{s})}{\tilde{\lambda}(l)},\quad% \tilde{\lambda}(l)=\sum_{\tilde{s}^{\prime}\in\tilde{S}_{l}}\tilde{p}(\tilde{s% }^{\prime}).over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ) = divide start_ARG over~ start_ARG italic_p end_ARG ( over~ start_ARG italic_s end_ARG ) end_ARG start_ARG over~ start_ARG italic_λ end_ARG ( italic_l ) end_ARG , over~ start_ARG italic_λ end_ARG ( italic_l ) = ∑ start_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_p end_ARG ( over~ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .

We write

𝔼s~p~l[logq~0(s~)q~+(s~)]=𝔼s~p~l[logq~0(s~)q~+(s~)/s~S~lq~+(s~)]logs~S~lq~+(s~),subscript𝔼similar-to~𝑠subscript~𝑝𝑙delimited-[]subscript~𝑞0~𝑠subscript~𝑞~𝑠subscript𝔼similar-to~𝑠subscript~𝑝𝑙delimited-[]subscript~𝑞0~𝑠subscript~𝑞~𝑠subscriptsuperscript~𝑠superscriptsubscript~𝑆𝑙subscript~𝑞superscript~𝑠subscriptsuperscript~𝑠superscriptsubscript~𝑆𝑙subscript~𝑞superscript~𝑠\mathbb{E}_{\tilde{s}\sim\tilde{p}_{l}}\left[\log\frac{\tilde{q}_{0}(\tilde{s}% )}{\tilde{q}_{+}(\tilde{s})}\right]=\mathbb{E}_{\tilde{s}\sim\tilde{p}_{l}}% \left[\log\frac{\tilde{q}_{0}(\tilde{s})}{\tilde{q}_{+}(\tilde{s})/\sum_{% \tilde{s}^{\prime}\in\tilde{S}_{l}^{*}}\tilde{q}_{+}(\tilde{s}^{\prime})}% \right]-\log\sum_{\tilde{s}^{\prime}\in\tilde{S}_{l}^{*}}\tilde{q}_{+}(\tilde{% s}^{\prime}),blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG ∼ over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log divide start_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ) end_ARG start_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ) end_ARG ] = blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG ∼ over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log divide start_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ) end_ARG start_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ) / ∑ start_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ] - roman_log ∑ start_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , (35)

where S~lsuperscriptsubscript~𝑆𝑙\tilde{S}_{l}^{*}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT denotes the set S~lsubscript~𝑆𝑙\tilde{S}_{l}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT of sub-states with solution length l𝑙litalic_l, along with a dummy state s~dsubscript~𝑠𝑑\tilde{s}_{d}over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT for every length-l𝑙litalic_l action sequence that isn’t a solution to any state. Note that q~0(s~d)=|A0|lsubscript~𝑞0subscript~𝑠𝑑superscriptsubscript𝐴0𝑙\tilde{q}_{0}(\tilde{s}_{d})=|A_{0}|^{-l}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) = | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT - italic_l end_POSTSUPERSCRIPT for each dummy state so that s~S~lq~0(s~)=|A0|l|A0|l=1subscriptsuperscript~𝑠superscriptsubscript~𝑆𝑙subscript~𝑞0superscript~𝑠superscriptsubscript𝐴0𝑙superscriptsubscript𝐴0𝑙1\sum_{\tilde{s}^{\prime}\in\tilde{S}_{l}^{*}}\tilde{q}_{0}(\tilde{s}^{\prime})% =|A_{0}|^{l}|A_{0}|^{-l}=1∑ start_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT - italic_l end_POSTSUPERSCRIPT = 1, whereas q~+(s~d)=σ(A+)+ expands to α|A+||σ|subscript~𝑞subscript~𝑠𝑑subscriptσ(A+)+ expands to αsuperscriptsubscript𝐴𝜎\tilde{q}_{+}(\tilde{s}_{d})=\sum_{\text{$\sigma\in(A_{+})^{+}$ expands to $% \alpha$}}|A_{+}|^{-|\sigma|}over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_σ ∈ ( italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT expands to italic_α end_POSTSUBSCRIPT | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT - | italic_σ | end_POSTSUPERSCRIPT where α𝛼\alphaitalic_α is the action sequence assigned as the solution to s~dsubscript~𝑠𝑑\tilde{s}_{d}over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. As usual for dummy states, we define p~l(s~d)=0subscript~𝑝𝑙subscript~𝑠𝑑0\tilde{p}_{l}(\tilde{s}_{d})=0over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) = 0.

Let’s first lower-bound the first term on the RHS of Equation 35:

𝔼s~p~l[logq~0(s~)q~+(s~)/s~S~lq~+(s~)]=DKL(p~lq~+()s~S~lq~+(s~))DKL(p~lq~0)DKL(p~lq~0).subscript𝔼similar-to~𝑠subscript~𝑝𝑙delimited-[]subscript~𝑞0~𝑠subscript~𝑞~𝑠subscriptsuperscript~𝑠superscriptsubscript~𝑆𝑙subscript~𝑞superscript~𝑠subscript𝐷KLconditionalsubscript~𝑝𝑙subscript~𝑞subscriptsuperscript~𝑠superscriptsubscript~𝑆𝑙subscript~𝑞superscript~𝑠subscript𝐷KLconditionalsubscript~𝑝𝑙subscript~𝑞0subscript𝐷KLconditionalsubscript~𝑝𝑙subscript~𝑞0\mathbb{E}_{\tilde{s}\sim\tilde{p}_{l}}\left[\log\frac{\tilde{q}_{0}(\tilde{s}% )}{\tilde{q}_{+}(\tilde{s})/\sum_{\tilde{s}^{\prime}\in\tilde{S}_{l}^{*}}% \tilde{q}_{+}(\tilde{s}^{\prime})}\right]=D_{\mathrm{KL}}\left(\tilde{p}_{l}% \parallel\frac{\tilde{q}_{+}(\cdot)}{\sum_{\tilde{s}^{\prime}\in\tilde{S}_{l}^% {*}}\tilde{q}_{+}(\tilde{s}^{\prime})}\right)-D_{\mathrm{KL}}\left(\tilde{p}_{% l}\parallel\tilde{q}_{0}\right)\geq-D_{\mathrm{KL}}\left(\tilde{p}_{l}% \parallel\tilde{q}_{0}\right).blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG ∼ over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log divide start_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ) end_ARG start_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ) / ∑ start_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ] = italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ divide start_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( ⋅ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ) - italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≥ - italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) . (36)

Let’s now upper-bound the sum in the second term on the RHS of Equation 35. We write

s~S~lq~+(s~)subscriptsuperscript~𝑠superscriptsubscript~𝑆𝑙subscript~𝑞superscript~𝑠\displaystyle\sum_{\tilde{s}^{\prime}\in\tilde{S}_{l}^{*}}\tilde{q}_{+}(\tilde% {s}^{\prime})∑ start_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) =σ(A+)σ expands to l base actions|A+||σ|absentsubscript𝜎superscriptsubscript𝐴σ expands to l base actionssuperscriptsubscript𝐴𝜎\displaystyle=\sum_{\begin{subarray}{c}\sigma\in(A_{+})^{*}\\ \text{$\sigma$ expands to $l$ base actions}\end{subarray}}|A_{+}|^{-|\sigma|}= ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_σ ∈ ( italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_σ expands to italic_l base actions end_CELL end_ROW end_ARG end_POSTSUBSCRIPT | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT - | italic_σ | end_POSTSUPERSCRIPT

which is a function fl(x2,,xK)subscript𝑓𝑙subscript𝑥2subscript𝑥𝐾f_{l}(x_{2},\ldots,x_{K})italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) where xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the number of macroactions of length k𝑘kitalic_k divided by |A+|subscript𝐴|A_{+}|| italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | and K𝐾Kitalic_K is the maximum length of any macroaction. To see that this is a function of only l𝑙litalic_l and xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, notice that changing the number of macroactions of every length as well as the number of base actions by the same factor ξ𝜉\xiitalic_ξ (which keeps all xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT unchanged) will result in ξlsuperscript𝜉superscript𝑙\xi^{l^{\prime}}italic_ξ start_POSTSUPERSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT times more sequences σ(A+)𝜎superscriptsubscript𝐴\sigma\in(A_{+})^{*}italic_σ ∈ ( italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT such that |σ|=l𝜎superscript𝑙|\sigma|=l^{\prime}| italic_σ | = italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and σ𝜎\sigmaitalic_σ expands to l𝑙litalic_l base actions, whereas the |A+|lsuperscriptsubscript𝐴superscript𝑙|A_{+}|^{-l^{\prime}}| italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT - italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT summand is multiplied by a factor of ξlsuperscript𝜉superscript𝑙\xi^{-l^{\prime}}italic_ξ start_POSTSUPERSCRIPT - italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT for these sequences. The two factors cancel out, thus leaving the entire sum unchanged.

Now, let’s derive a recursive formula for fl(x2,,xK)subscript𝑓𝑙subscript𝑥2subscript𝑥𝐾f_{l}(x_{2},\ldots,x_{K})italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) where the xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are treated as parameters. To do this, we separate the sum over σ𝜎\sigmaitalic_σ into cases depending on whether the first action in σ𝜎\sigmaitalic_σ is a macroaction, and its length if yes. If the first action in σ𝜎\sigmaitalic_σ is a base action, then the rest of σ𝜎\sigmaitalic_σ expands to length l1𝑙1l-1italic_l - 1, so the contribution to the sum is x1fl1(x2,,xK)subscript𝑥1subscript𝑓𝑙1subscript𝑥2subscript𝑥𝐾x_{1}f_{l-1}(x_{2},\ldots,x_{K})italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ), where x1:=1k=2Kxkassignsubscript𝑥11superscriptsubscript𝑘2𝐾subscript𝑥𝑘x_{1}:=1-\sum_{k=2}^{K}x_{k}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT := 1 - ∑ start_POSTSUBSCRIPT italic_k = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the number of base actions divided by |A+|subscript𝐴|A_{+}|| italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT |. If the first action in σ𝜎\sigmaitalic_σ is a macroaction of length k𝑘kitalic_k, then the rest of σ𝜎\sigmaitalic_σ expands to length lk𝑙𝑘l-kitalic_l - italic_k, so the contribution to the sum is xkflk(x2,,xK)subscript𝑥𝑘subscript𝑓𝑙𝑘subscript𝑥2subscript𝑥𝐾x_{k}f_{l-k}(x_{2},\ldots,x_{K})italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_l - italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ). To summarize,

fl=k=1Kxkflk,subscript𝑓𝑙superscriptsubscript𝑘1𝐾subscript𝑥𝑘subscript𝑓𝑙𝑘f_{l}=\sum_{k=1}^{K}x_{k}f_{l-k},italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_l - italic_k end_POSTSUBSCRIPT ,

where it is understood that fi=0subscript𝑓𝑖0f_{i}=0italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 for i<0𝑖0i<0italic_i < 0. The base case is f0=1subscript𝑓01f_{0}=1italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1. Since the sum of the coefficients xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in the recursive formula equals 1, flsubscript𝑓𝑙f_{l}italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is just a weighted average of fl1,fl2,,flKsubscript𝑓𝑙1subscript𝑓𝑙2subscript𝑓𝑙𝐾f_{l-1},f_{l-2},\ldots,f_{l-K}italic_f start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_l - 2 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_l - italic_K end_POSTSUBSCRIPT. Thus, if flasubscript𝑓𝑙𝑎f_{l}\leq aitalic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ≤ italic_a for 1lK1𝑙𝐾1\leq l\leq K1 ≤ italic_l ≤ italic_K then flasubscript𝑓𝑙𝑎f_{l}\leq aitalic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ≤ italic_a for all l1𝑙1l\geq 1italic_l ≥ 1.

Let’s show by induction on K𝐾Kitalic_K that

fl1x1+x12subscript𝑓𝑙1subscript𝑥1superscriptsubscript𝑥12f_{l}\leq 1-x_{1}+x_{1}^{2}italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ≤ 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (37)

for all l1𝑙1l\geq 1italic_l ≥ 1. It suffices to show that fl1x1+x12subscript𝑓𝑙1subscript𝑥1superscriptsubscript𝑥12f_{l}\leq 1-x_{1}+x_{1}^{2}italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ≤ 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for 1lK1𝑙𝐾1\leq l\leq K1 ≤ italic_l ≤ italic_K.

For K=1𝐾1K=1italic_K = 1, f1=x1=1=1x1+x12subscript𝑓1subscript𝑥111subscript𝑥1superscriptsubscript𝑥12f_{1}=x_{1}=1=1-x_{1}+x_{1}^{2}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 = 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. For K=2𝐾2K=2italic_K = 2, f1=x1x1+(1x1)2=1x1+x12subscript𝑓1subscript𝑥1subscript𝑥1superscript1subscript𝑥121subscript𝑥1superscriptsubscript𝑥12f_{1}=x_{1}\leq x_{1}+(1-x_{1})^{2}=1-x_{1}+x_{1}^{2}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and f2=x1f1+(1x1)=1x1+x12subscript𝑓2subscript𝑥1subscript𝑓11subscript𝑥11subscript𝑥1superscriptsubscript𝑥12f_{2}=x_{1}f_{1}+(1-x_{1})=1-x_{1}+x_{1}^{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Now, for K3𝐾3K\geq 3italic_K ≥ 3, assume the K1𝐾1K-1italic_K - 1 and K2𝐾2K-2italic_K - 2 cases hold.

Let’s upper-bound flsubscript𝑓𝑙f_{l}italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT for the following two cases separately: (i) 1lK11𝑙𝐾11\leq l\leq K-11 ≤ italic_l ≤ italic_K - 1; (ii) l=K𝑙𝐾l=Kitalic_l = italic_K.

(i) Define xk=xk/x¯Ksuperscriptsubscript𝑥𝑘subscript𝑥𝑘subscript¯𝑥𝐾x_{k}^{\prime}=x_{k}/\bar{x}_{K}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT / over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT for 1kK11𝑘𝐾11\leq k\leq K-11 ≤ italic_k ≤ italic_K - 1 where x¯K:=1xK=i=1K1xiassignsubscript¯𝑥𝐾1subscript𝑥𝐾superscriptsubscript𝑖1𝐾1subscript𝑥𝑖\bar{x}_{K}:=1-x_{K}=\sum_{i=1}^{K-1}x_{i}over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT := 1 - italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Define the sequence fl=k=1K1xkflksubscriptsuperscript𝑓𝑙superscriptsubscript𝑘1𝐾1superscriptsubscript𝑥𝑘subscriptsuperscript𝑓𝑙𝑘f^{\prime}_{l}=\sum_{k=1}^{K-1}x_{k}^{\prime}f^{\prime}_{l-k}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l - italic_k end_POSTSUBSCRIPT with fi=0subscriptsuperscript𝑓𝑖0f^{\prime}_{i}=0italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 for i<0𝑖0i<0italic_i < 0 and f0=1subscriptsuperscript𝑓01f^{\prime}_{0}=1italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1. Then the inductive hypothesis gives fl1x1+x12subscriptsuperscript𝑓𝑙1superscriptsubscript𝑥1superscriptsubscript𝑥12f^{\prime}_{l}\leq 1-x_{1}^{\prime}+x_{1}^{\prime 2}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ≤ 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT for 1lK11𝑙𝐾11\leq l\leq K-11 ≤ italic_l ≤ italic_K - 1. It is easy to show by induction on l𝑙litalic_l that fl=(x¯K)lflsubscript𝑓𝑙superscriptsubscript¯𝑥𝐾𝑙subscriptsuperscript𝑓𝑙f_{l}=(\bar{x}_{K})^{l}f^{\prime}_{l}italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = ( over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT for 0lK10𝑙𝐾10\leq l\leq K-10 ≤ italic_l ≤ italic_K - 1, so for 1lK11𝑙𝐾11\leq l\leq K-11 ≤ italic_l ≤ italic_K - 1,

flx¯K(1x1+x12)=x¯Kx1+x12x¯K,subscript𝑓𝑙subscript¯𝑥𝐾1superscriptsubscript𝑥1superscriptsubscript𝑥12subscript¯𝑥𝐾subscript𝑥1superscriptsubscript𝑥12subscript¯𝑥𝐾f_{l}\leq\bar{x}_{K}(1-x_{1}^{\prime}+x_{1}^{\prime 2})=\bar{x}_{K}-x_{1}+% \frac{x_{1}^{2}}{\bar{x}_{K}},italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ≤ over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT ) = over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_ARG ,

where x¯Ksubscript¯𝑥𝐾\bar{x}_{K}over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT is restricted to the range x1x¯K1subscript𝑥1subscript¯𝑥𝐾1x_{1}\leq\bar{x}_{K}\leq 1italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ≤ 1. Since x¯K+x12x¯Ksubscript¯𝑥𝐾superscriptsubscript𝑥12subscript¯𝑥𝐾\bar{x}_{K}+\frac{x_{1}^{2}}{\bar{x}_{K}}over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT + divide start_ARG italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_ARG is increasing for x¯Kx1subscript¯𝑥𝐾subscript𝑥1\bar{x}_{K}\geq x_{1}over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ≥ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, its maximum is reached when x¯K=1subscript¯𝑥𝐾1\bar{x}_{K}=1over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = 1, i.e.,

flx¯Kx1+x12x¯K1x1+x12.subscript𝑓𝑙subscript¯𝑥𝐾subscript𝑥1superscriptsubscript𝑥12subscript¯𝑥𝐾1subscript𝑥1superscriptsubscript𝑥12f_{l}\leq\bar{x}_{K}-x_{1}+\frac{x_{1}^{2}}{\bar{x}_{K}}\leq 1-x_{1}+x_{1}^{2}.italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ≤ over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_ARG ≤ 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

(ii) Note that recursively expanding the recursion formula for flsubscript𝑓𝑙f_{l}italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT until we reach the base cases results in a polynomial in x1,,xKsubscript𝑥1subscript𝑥𝐾x_{1},\ldots,x_{K}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT. It is easy to see by induction on l𝑙litalic_l that, for 1lK1𝑙𝐾1\leq l\leq K1 ≤ italic_l ≤ italic_K, no term contains xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT where k>l𝑘𝑙k>litalic_k > italic_l and there is a single term containing xlsubscript𝑥𝑙x_{l}italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT which is just xlsubscript𝑥𝑙x_{l}italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. So fK1=P1(x1,,xK2)+xK1subscript𝑓𝐾1subscript𝑃1subscript𝑥1subscript𝑥𝐾2subscript𝑥𝐾1f_{K-1}=P_{1}(x_{1},\ldots,x_{K-2})+x_{K-1}italic_f start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_K - 2 end_POSTSUBSCRIPT ) + italic_x start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT for some polynomial P1subscript𝑃1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and

fKsubscript𝑓𝐾\displaystyle f_{K}italic_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT =k=1KxkfKkabsentsuperscriptsubscript𝑘1𝐾subscript𝑥𝑘subscript𝑓𝐾𝑘\displaystyle=\sum_{k=1}^{K}x_{k}f_{K-k}= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_K - italic_k end_POSTSUBSCRIPT
=x1(P1(x1,,xK2)+xK1)+k=2K2xkfKk+xK1f1+xKabsentsubscript𝑥1subscript𝑃1subscript𝑥1subscript𝑥𝐾2subscript𝑥𝐾1superscriptsubscript𝑘2𝐾2subscript𝑥𝑘subscript𝑓𝐾𝑘subscript𝑥𝐾1subscript𝑓1subscript𝑥𝐾\displaystyle=x_{1}(P_{1}(x_{1},\ldots,x_{K-2})+x_{K-1})+\sum_{k=2}^{K-2}x_{k}% f_{K-k}+x_{K-1}f_{1}+x_{K}= italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_K - 2 end_POSTSUBSCRIPT ) + italic_x start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_k = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 2 end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_K - italic_k end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT
=P2(x1,,xK2)+2x1xK1+xKabsentsubscript𝑃2subscript𝑥1subscript𝑥𝐾22subscript𝑥1subscript𝑥𝐾1subscript𝑥𝐾\displaystyle=P_{2}(x_{1},\ldots,x_{K-2})+2x_{1}x_{K-1}+x_{K}= italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_K - 2 end_POSTSUBSCRIPT ) + 2 italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT

for some polynomial P2subscript𝑃2P_{2}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Substituting xK=1k=1K1xksubscript𝑥𝐾1superscriptsubscript𝑘1𝐾1subscript𝑥𝑘x_{K}=1-\sum_{k=1}^{K-1}x_{k}italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = 1 - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT results in

fK=P3(x1,,xK2)+xK1(2x11)subscript𝑓𝐾subscript𝑃3subscript𝑥1subscript𝑥𝐾2subscript𝑥𝐾12subscript𝑥11f_{K}=P_{3}(x_{1},\ldots,x_{K-2})+x_{K-1}(2x_{1}-1)italic_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_K - 2 end_POSTSUBSCRIPT ) + italic_x start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ( 2 italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 )

for some polynomial P3subscript𝑃3P_{3}italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. This is linear in xK1subscript𝑥𝐾1x_{K-1}italic_x start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT where 0xK11i=1K2xi0subscript𝑥𝐾11superscriptsubscript𝑖1𝐾2subscript𝑥𝑖0\leq x_{K-1}\leq 1-\sum_{i=1}^{K-2}x_{i}0 ≤ italic_x start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT ≤ 1 - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 2 end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, so

fKmax{fK|xK1=0,fK|xK=0}subscript𝑓𝐾evaluated-atsubscript𝑓𝐾subscript𝑥𝐾10evaluated-atsubscript𝑓𝐾subscript𝑥𝐾0f_{K}\leq\max\left\{f_{K}|_{x_{K-1}=0},f_{K}|_{x_{K}=0}\right\}italic_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ≤ roman_max { italic_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT | start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT | start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT }

where

fK|xK1=0evaluated-atsubscript𝑓𝐾subscript𝑥𝐾10\displaystyle f_{K}|_{x_{K-1}=0}italic_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT | start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT =P3(x1,,xK2)absentsubscript𝑃3subscript𝑥1subscript𝑥𝐾2\displaystyle=P_{3}(x_{1},\ldots,x_{K-2})= italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_K - 2 end_POSTSUBSCRIPT )
fK|xK=0evaluated-atsubscript𝑓𝐾subscript𝑥𝐾0\displaystyle f_{K}|_{x_{K}=0}italic_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT | start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT =P3(x1,,xK2)+(1i=1K2xi)(2x11).absentsubscript𝑃3subscript𝑥1subscript𝑥𝐾21superscriptsubscript𝑖1𝐾2subscript𝑥𝑖2subscript𝑥11\displaystyle=P_{3}(x_{1},\ldots,x_{K-2})+\left(1-\sum_{i=1}^{K-2}x_{i}\right)% (2x_{1}-1).= italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_K - 2 end_POSTSUBSCRIPT ) + ( 1 - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 2 end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( 2 italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 ) .

fK|xK=01x1+x12evaluated-atsubscript𝑓𝐾subscript𝑥𝐾01subscript𝑥1superscriptsubscript𝑥12f_{K}|_{x_{K}=0}\leq 1-x_{1}+x_{1}^{2}italic_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT | start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT ≤ 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT by the inductive hypothesis. Now let’s upper-bound fK|xK1=0evaluated-atsubscript𝑓𝐾subscript𝑥𝐾10f_{K}|_{x_{K-1}=0}italic_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT | start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT.

Note that, regardless of the value of xK1subscript𝑥𝐾1x_{K-1}italic_x start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT, we have f0=1subscript𝑓01f_{0}=1italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1 and fk1x1+x12subscript𝑓𝑘1subscript𝑥1superscriptsubscript𝑥12f_{k}\leq 1-x_{1}+x_{1}^{2}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for 1kK21𝑘𝐾21\leq k\leq K-21 ≤ italic_k ≤ italic_K - 2 by the inductive hypothesis, since these values of fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are independent of xK1subscript𝑥𝐾1x_{K-1}italic_x start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT. Thus,

fK1|xK1=0evaluated-atsubscript𝑓𝐾1subscript𝑥𝐾10\displaystyle f_{K-1}|_{x_{K-1}=0}italic_f start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT | start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT =k=1KxkfK1kk=1K2xk(1x1+x12)=(1xK)(1x1+x12)absentsuperscriptsubscript𝑘1𝐾subscript𝑥𝑘subscript𝑓𝐾1𝑘superscriptsubscript𝑘1𝐾2subscript𝑥𝑘1subscript𝑥1superscriptsubscript𝑥121subscript𝑥𝐾1subscript𝑥1superscriptsubscript𝑥12\displaystyle=\sum_{k=1}^{K}x_{k}f_{K-1-k}\leq\sum_{k=1}^{K-2}x_{k}(1-x_{1}+x_% {1}^{2})=(1-x_{K})(1-x_{1}+x_{1}^{2})= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_K - 1 - italic_k end_POSTSUBSCRIPT ≤ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 2 end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = ( 1 - italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) ( 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
fK|xK1=0evaluated-atsubscript𝑓𝐾subscript𝑥𝐾10\displaystyle f_{K}|_{x_{K-1}=0}italic_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT | start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT =k=1KxkfKkabsentsuperscriptsubscript𝑘1𝐾subscript𝑥𝑘subscript𝑓𝐾𝑘\displaystyle=\sum_{k=1}^{K}x_{k}f_{K-k}= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_K - italic_k end_POSTSUBSCRIPT
x1(1xK)(1x1+x12)+k=2K2xk(1x1+x12)+xKabsentsubscript𝑥11subscript𝑥𝐾1subscript𝑥1superscriptsubscript𝑥12superscriptsubscript𝑘2𝐾2subscript𝑥𝑘1subscript𝑥1superscriptsubscript𝑥12subscript𝑥𝐾\displaystyle\leq x_{1}(1-x_{K})\left(1-x_{1}+x_{1}^{2}\right)+\sum_{k=2}^{K-2% }x_{k}\left(1-x_{1}+x_{1}^{2}\right)+x_{K}≤ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 1 - italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) ( 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_k = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 2 end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT
=(x1(1xK)+1x1xK)(1x1+x12)+xKabsentsubscript𝑥11subscript𝑥𝐾1subscript𝑥1subscript𝑥𝐾1subscript𝑥1superscriptsubscript𝑥12subscript𝑥𝐾\displaystyle=(x_{1}(1-x_{K})+1-x_{1}-x_{K})\left(1-x_{1}+x_{1}^{2}\right)+x_{K}= ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 1 - italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) + 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) ( 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT
=(1xK(1+x1))(1x1+x12)+xKabsent1subscript𝑥𝐾1subscript𝑥11subscript𝑥1superscriptsubscript𝑥12subscript𝑥𝐾\displaystyle=(1-x_{K}(1+x_{1}))\left(1-x_{1}+x_{1}^{2}\right)+x_{K}= ( 1 - italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( 1 + italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) ( 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT
=1x1+x12xK(1+x13)+xKabsent1subscript𝑥1superscriptsubscript𝑥12subscript𝑥𝐾1superscriptsubscript𝑥13subscript𝑥𝐾\displaystyle=1-x_{1}+x_{1}^{2}-x_{K}\left(1+x_{1}^{3}\right)+x_{K}= 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( 1 + italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) + italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT
1x1+x12.absent1subscript𝑥1superscriptsubscript𝑥12\displaystyle\leq 1-x_{1}+x_{1}^{2}.≤ 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Thus, we have shown that fK1x1+x12subscript𝑓𝐾1subscript𝑥1superscriptsubscript𝑥12f_{K}\leq 1-x_{1}+x_{1}^{2}italic_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ≤ 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, which completes the inductive step. This concludes the proof of Equation 37.

Now,

logs~S~lq~+(s~)=logfl1fl|A0||A+|(1|A0||A+|),subscriptsuperscript~𝑠superscriptsubscript~𝑆𝑙subscript~𝑞superscript~𝑠subscript𝑓𝑙1subscript𝑓𝑙subscript𝐴0subscript𝐴1subscript𝐴0subscript𝐴-\log\sum_{\tilde{s}^{\prime}\in\tilde{S}_{l}^{*}}\tilde{q}_{+}(\tilde{s}^{% \prime})=-\log f_{l}\geq 1-f_{l}\geq\frac{|A_{0}|}{|A_{+}|}\left(1-\frac{|A_{0% }|}{|A_{+}|}\right),- roman_log ∑ start_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = - roman_log italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ≥ 1 - italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ≥ divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG ( 1 - divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG ) , (38)

where the last inequality follows from Equation 37 with x1=|A0|/|A+|subscript𝑥1subscript𝐴0subscript𝐴x_{1}=|A_{0}|/|A_{+}|italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | / | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT |.

We now substitute Equations 36 and 38 into Equation 35 to obtain

𝔼s~p~l[logq~0(s~)q~+(s~)]|A0||A+|(1|A0||A+|)DKL(p~lq~0).subscript𝔼similar-to~𝑠subscript~𝑝𝑙delimited-[]subscript~𝑞0~𝑠subscript~𝑞~𝑠subscript𝐴0subscript𝐴1subscript𝐴0subscript𝐴subscript𝐷KLconditionalsubscript~𝑝𝑙subscript~𝑞0\mathbb{E}_{\tilde{s}\sim\tilde{p}_{l}}\left[\log\frac{\tilde{q}_{0}(\tilde{s}% )}{\tilde{q}_{+}(\tilde{s})}\right]\geq\frac{|A_{0}|}{|A_{+}|}\left(1-\frac{|A% _{0}|}{|A_{+}|}\right)-D_{\mathrm{KL}}\left(\tilde{p}_{l}\parallel\tilde{q}_{0% }\right).blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG ∼ over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log divide start_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ) end_ARG start_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ) end_ARG ] ≥ divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG ( 1 - divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG ) - italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) .

Thus, we finally have

Jexplore(+;p)Jexplore(0;p)subscript𝐽exploresubscript𝑝subscript𝐽exploresubscript0𝑝\displaystyle J_{\text{{explore}}}(\mathcal{M}_{+};p)-J_{\text{{explore}}}(% \mathcal{M}_{0};p)italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p ) - italic_J start_POSTSUBSCRIPT explore end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_p ) 𝔼s~p~[logq~0(s~)q~+(s~)]absentsubscript𝔼similar-to~𝑠~𝑝delimited-[]subscript~𝑞0~𝑠subscript~𝑞~𝑠\displaystyle\geq\mathbb{E}_{\tilde{s}\sim\tilde{p}}\left[\log\frac{\tilde{q}_% {0}(\tilde{s})}{\tilde{q}_{+}(\tilde{s})}\right]≥ blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG ∼ over~ start_ARG italic_p end_ARG end_POSTSUBSCRIPT [ roman_log divide start_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ) end_ARG start_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ) end_ARG ]
=l=1λ~(l)𝔼s~p~l[logq~0(s~)q~+(s~)]absentsuperscriptsubscript𝑙1~𝜆𝑙subscript𝔼similar-to~𝑠subscript~𝑝𝑙delimited-[]subscript~𝑞0~𝑠subscript~𝑞~𝑠\displaystyle=\sum_{l=1}^{\infty}\tilde{\lambda}(l)\mathbb{E}_{\tilde{s}\sim% \tilde{p}_{l}}\left[\log\frac{\tilde{q}_{0}(\tilde{s})}{\tilde{q}_{+}(\tilde{s% })}\right]= ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT over~ start_ARG italic_λ end_ARG ( italic_l ) blackboard_E start_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG ∼ over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log divide start_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ) end_ARG start_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG ) end_ARG ]
l=1λ~(l)(|A0||A+|(1|A0||A+|)DKL(p~lq~0))absentsuperscriptsubscript𝑙1~𝜆𝑙subscript𝐴0subscript𝐴1subscript𝐴0subscript𝐴subscript𝐷KLconditionalsubscript~𝑝𝑙subscript~𝑞0\displaystyle\geq\sum_{l=1}^{\infty}\tilde{\lambda}(l)\left(\frac{|A_{0}|}{|A_% {+}|}\left(1-\frac{|A_{0}|}{|A_{+}|}\right)-D_{\mathrm{KL}}\left(\tilde{p}_{l}% \parallel\tilde{q}_{0}\right)\right)≥ ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT over~ start_ARG italic_λ end_ARG ( italic_l ) ( divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG ( 1 - divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG ) - italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) )
|A0||A+|(1|A0||A+|)DKL(p~λ~q~0).absentsubscript𝐴0subscript𝐴1subscript𝐴0subscript𝐴subscript𝐷KLconditional~𝑝~𝜆subscript~𝑞0\displaystyle\geq\frac{|A_{0}|}{|A_{+}|}\left(1-\frac{|A_{0}|}{|A_{+}|}\right)% -D_{\mathrm{KL}}\left(\tilde{p}\parallel\tilde{\lambda}\tilde{q}_{0}\right).≥ divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG ( 1 - divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG ) - italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( over~ start_ARG italic_p end_ARG ∥ over~ start_ARG italic_λ end_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) .

Appendix G Relating p𝑝pitalic_p-Incompressibility to Skill Learning

The intuition that skills should optimally compress successful trajectories has been previously used by skill-discovery algorithms like LOVE and LEMMA. Using the incompressibility measures introduced in this paper, we can approach skill learning more rigorously. There are two approaches to converting p𝑝pitalic_p-incompressibility into a skill-learning objective.

The first approach is to find A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT that minimizes the lower bound on the p𝑝pitalic_p-learning difficulty increase ratio as given in Theorem 4.2. This is equivalent to minimizing

1(A+)=|A+|log|A+|sup0<ε<1H[P+]log(1εε)log(|A0|1ε),subscript1subscript𝐴subscript𝐴subscript𝐴subscriptsupremum0𝜀1Hdelimited-[]subscript𝑃1𝜀𝜀subscript𝐴01𝜀\mathcal{L}_{1}(A_{+})=\frac{|A_{+}|}{\log|A_{+}|}\sup_{0<\varepsilon<1}\frac{% \mathrm{H}[P_{+}]-\log\left(\frac{1-\varepsilon}{\varepsilon}\right)}{\log% \left(\frac{|A_{0}|}{1-\varepsilon}\right)},caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) = divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG start_ARG roman_log | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG roman_sup start_POSTSUBSCRIPT 0 < italic_ε < 1 end_POSTSUBSCRIPT divide start_ARG roman_H [ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] - roman_log ( divide start_ARG 1 - italic_ε end_ARG start_ARG italic_ε end_ARG ) end_ARG start_ARG roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | end_ARG start_ARG 1 - italic_ε end_ARG ) end_ARG , (39)

where the supsupremum\suproman_sup factor is proportional to the A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT-merged p𝑝pitalic_p-incompressibility. Usually, H[P+]Hdelimited-[]subscript𝑃\mathrm{H}[P_{+}]roman_H [ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] is large, as a result of which the maximizing ε𝜀\varepsilonitalic_ε satisfies ε1much-less-than𝜀1\varepsilon\ll 1italic_ε ≪ 1 and H[P+]log(1εε)much-greater-thanHdelimited-[]subscript𝑃1𝜀𝜀\mathrm{H}[P_{+}]\gg\log\left(\frac{1-\varepsilon}{\varepsilon}\right)roman_H [ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] ≫ roman_log ( divide start_ARG 1 - italic_ε end_ARG start_ARG italic_ε end_ARG ). Thus, minimizing 1(A+)subscript1subscript𝐴\mathcal{L}_{1}(A_{+})caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) becomes equivalent to minimizing

2(A+)=|A+|log|A+|H[P+].subscript2subscript𝐴subscript𝐴subscript𝐴Hdelimited-[]subscript𝑃\mathcal{L}_{2}(A_{+})=\frac{|A_{+}|}{\log|A_{+}|}\mathrm{H}[P_{+}].caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) = divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG start_ARG roman_log | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG roman_H [ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] . (40)

When |A+|subscript𝐴|A_{+}|| italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | is known or given as a hyperparameter, then the objective is to minimize

3(A+)=H[P+].subscript3subscript𝐴Hdelimited-[]subscript𝑃\mathcal{L}_{3}(A_{+})=\mathrm{H}[P_{+}].caligraphic_L start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) = roman_H [ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] . (41)

Note that in practice, it is not possible to compute P+subscript𝑃P_{+}italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, the distribution of shortest solutions using actions from A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT to states generated by the environment. However, we do have a training set of offline experience, so we can use our skills to rewrite these solutions and define P^+subscript^𝑃\hat{P}_{+}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT to be the resultant empirical distribution of abstracted solutions. The P+subscript𝑃P_{+}italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT appearing in the objectives 1,2,3subscript1subscript2subscript3\mathcal{L}_{1},\mathcal{L}_{2},\mathcal{L}_{3}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , caligraphic_L start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT should thus be interpreted as P^+subscript^𝑃\hat{P}_{+}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT as calculated from our training set.

However, the resultant approximation of H[P+]Hdelimited-[]subscript𝑃\mathrm{H}[P_{+}]roman_H [ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] will be a significant under-approximation if the training set is much smaller than the number of states that cover most of the state space under p𝑝pitalic_p. In this case, we recommend modeling P+subscript𝑃P_{+}italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT with the assumption that it is generated by sampling i.i.d. actions from a distribution pa,+subscript𝑝𝑎p_{a,+}italic_p start_POSTSUBSCRIPT italic_a , + end_POSTSUBSCRIPT over A+subscript𝐴A_{+}italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, with solution length sampled from a distribution pl,+subscript𝑝𝑙p_{l,+}italic_p start_POSTSUBSCRIPT italic_l , + end_POSTSUBSCRIPT. Then the maximum-likelihood (ML) estimates of pa,+subscript𝑝𝑎p_{a,+}italic_p start_POSTSUBSCRIPT italic_a , + end_POSTSUBSCRIPT and pl,+subscript𝑝𝑙p_{l,+}italic_p start_POSTSUBSCRIPT italic_l , + end_POSTSUBSCRIPT are just the empirical distribution of actions and the empirical distribution of solution lengths in the abstracted training set. If we define P~+subscript~𝑃\tilde{P}_{+}over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT to be the distribution of action sequences defined by this choice of pa,+subscript𝑝𝑎p_{a,+}italic_p start_POSTSUBSCRIPT italic_a , + end_POSTSUBSCRIPT and pl,+subscript𝑝𝑙p_{l,+}italic_p start_POSTSUBSCRIPT italic_l , + end_POSTSUBSCRIPT, then we can approximate H[P+]H[P^+,P~+]=H[pl,+]+l¯+H[pa,+]Hdelimited-[]subscript𝑃Hsubscript^𝑃subscript~𝑃Hdelimited-[]subscript𝑝𝑙subscript¯𝑙Hdelimited-[]subscript𝑝𝑎\mathrm{H}[P_{+}]\approx\mathrm{H}[\hat{P}_{+},\tilde{P}_{+}]=\mathrm{H}[p_{l,% +}]+\overline{l}_{+}\mathrm{H}[p_{a,+}]roman_H [ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] ≈ roman_H [ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] = roman_H [ italic_p start_POSTSUBSCRIPT italic_l , + end_POSTSUBSCRIPT ] + over¯ start_ARG italic_l end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT roman_H [ italic_p start_POSTSUBSCRIPT italic_a , + end_POSTSUBSCRIPT ], where l¯+:=𝔼lpl,+[l]assignsubscript¯𝑙subscript𝔼similar-to𝑙subscript𝑝𝑙𝑙\overline{l}_{+}:=\operatorname{\mathbb{E}}_{l\sim p_{l,+}}[l]over¯ start_ARG italic_l end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT := blackboard_E start_POSTSUBSCRIPT italic_l ∼ italic_p start_POSTSUBSCRIPT italic_l , + end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_l ] is the average solution length. Under this approximation, 3subscript3\mathcal{L}_{3}caligraphic_L start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT becomes

4(A+)=H[pl,+]+l¯+H[pa,+].subscript4subscript𝐴Hdelimited-[]subscript𝑝𝑙subscript¯𝑙Hdelimited-[]subscript𝑝𝑎\mathcal{L}_{4}(A_{+})=\mathrm{H}[p_{l,+}]+\overline{l}_{+}\mathrm{H}[p_{a,+}].caligraphic_L start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) = roman_H [ italic_p start_POSTSUBSCRIPT italic_l , + end_POSTSUBSCRIPT ] + over¯ start_ARG italic_l end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT roman_H [ italic_p start_POSTSUBSCRIPT italic_a , + end_POSTSUBSCRIPT ] . (42)

(We can similarly apply this approximation to 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.) It is often the case that H[pl,+]Hdelimited-[]subscript𝑝𝑙\mathrm{H}[p_{l,+}]roman_H [ italic_p start_POSTSUBSCRIPT italic_l , + end_POSTSUBSCRIPT ] is much smaller than l¯+H[pa,+]subscript¯𝑙Hdelimited-[]subscript𝑝𝑎\overline{l}_{+}\mathrm{H}[p_{a,+}]over¯ start_ARG italic_l end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT roman_H [ italic_p start_POSTSUBSCRIPT italic_a , + end_POSTSUBSCRIPT ], so neglecting that term results in the objective

5(A+)=l¯+H[pa,+].subscript5subscript𝐴subscript¯𝑙Hdelimited-[]subscript𝑝𝑎\mathcal{L}_{5}(A_{+})=\overline{l}_{+}\mathrm{H}[p_{a,+}].caligraphic_L start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) = over¯ start_ARG italic_l end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT roman_H [ italic_p start_POSTSUBSCRIPT italic_a , + end_POSTSUBSCRIPT ] . (43)

Note that 5subscript5\mathcal{L}_{5}caligraphic_L start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT is exactly the minimum description length (MDL) objective used by LOVE (Jiang et al., 2022). It represents the average number of bits required to encode an abstracted solution, where the encoding of actions is optimized for the empirical distribution of actions in the abstracted training set.

The second approach to deriving a skill learning objective from p𝑝pitalic_p-incompressibility is based on the idea that the maximally abstracted environment is the least compressible. Using unmerged p𝑝pitalic_p-incompressibility to measure incompressibility, this corresponds to the maximization objective

𝒥6(A+)=IC(+;p)=sup0<ε<1H[p]log(1εε)𝔼sp[d+(s)]log(|A+|1ε).subscript𝒥6subscript𝐴ICsubscript𝑝subscriptsupremum0𝜀1Hdelimited-[]𝑝1𝜀𝜀subscript𝔼similar-to𝑠𝑝delimited-[]subscript𝑑𝑠subscript𝐴1𝜀\mathcal{J}_{6}(A_{+})=\mathrm{IC}(\mathcal{M}_{+};p)=\sup_{0<\varepsilon<1}% \frac{\mathrm{H}[p]-\log\left(\frac{1-\varepsilon}{\varepsilon}\right)}{% \mathbb{E}_{s\sim p}[d_{+}(s)]\log\left(\frac{|A_{+}|}{1-\varepsilon}\right)}.caligraphic_J start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) = roman_IC ( caligraphic_M start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ; italic_p ) = roman_sup start_POSTSUBSCRIPT 0 < italic_ε < 1 end_POSTSUBSCRIPT divide start_ARG roman_H [ italic_p ] - roman_log ( divide start_ARG 1 - italic_ε end_ARG start_ARG italic_ε end_ARG ) end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_s ) ] roman_log ( divide start_ARG | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG start_ARG 1 - italic_ε end_ARG ) end_ARG . (44)

Similar to H[P+]Hdelimited-[]subscript𝑃\mathrm{H}[P_{+}]roman_H [ italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] in 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, H[p]Hdelimited-[]𝑝\mathrm{H}[p]roman_H [ italic_p ] in 𝒥6subscript𝒥6\mathcal{J}_{6}caligraphic_J start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT is often large, in which case the maximizing ε𝜀\varepsilonitalic_ε satisfies ε1much-less-than𝜀1\varepsilon\ll 1italic_ε ≪ 1 and H[p]log(1εε)much-greater-thanHdelimited-[]𝑝1𝜀𝜀\mathrm{H}[p]\gg\log\left(\frac{1-\varepsilon}{\varepsilon}\right)roman_H [ italic_p ] ≫ roman_log ( divide start_ARG 1 - italic_ε end_ARG start_ARG italic_ε end_ARG ). Under this approximation, the maximization objective becomes the minimization objective

7(A+)=𝔼sp[d+(s)]log|A+|.subscript7subscript𝐴subscript𝔼similar-to𝑠𝑝subscript𝑑𝑠subscript𝐴\mathcal{L}_{7}(A_{+})=\operatorname{\mathbb{E}}_{s\sim p}[d_{+}(s)]\log|A_{+}|.caligraphic_L start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_s ) ] roman_log | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | . (45)

As with P+subscript𝑃P_{+}italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, 𝔼sp[d+(s)]subscript𝔼similar-to𝑠𝑝subscript𝑑𝑠\operatorname{\mathbb{E}}_{s\sim p}[d_{+}(s)]blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_p end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_s ) ] cannot be computed exactly, so we approximate it with the average solution length in the abstracted training set, i.e., l¯+subscript¯𝑙\overline{l}_{+}over¯ start_ARG italic_l end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT. As a result,

7(A+)=l¯+log|A+|,subscript7subscript𝐴subscript¯𝑙subscript𝐴\mathcal{L}_{7}(A_{+})=\overline{l}_{+}\log|A_{+}|,caligraphic_L start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) = over¯ start_ARG italic_l end_ARG start_POSTSUBSCRIPT + end_POSTSUBSCRIPT roman_log | italic_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | , (46)

which is just 5subscript5\mathcal{L}_{5}caligraphic_L start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT but with a uniform distribution for pa,+subscript𝑝𝑎p_{a,+}italic_p start_POSTSUBSCRIPT italic_a , + end_POSTSUBSCRIPT. It can thus also be interpreted as an MDL objective where the encoding of actions is a uniform code. Note that this is exactly the objective used by LEMMA (Li et al., 2022).