\doparttoc\faketableofcontents

Achieving Tractable Minimax Optimal Regret in Average Reward MDPs

Victor Boone
[email protected]
Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG, 38000 Grenoble, France &Zihan Zhang
[email protected]
Princeton University
Abstract

In recent years, significant attention has been directed towards learning average-reward Markov Decision Processes (MDPs). However, existing algorithms either suffer from sub-optimal regret guarantees or computational inefficiencies. In this paper, we present the first tractable algorithm with minimax optimal regret of O~(sp(h)SAT)~Ospsuperscript𝑆𝐴𝑇\widetilde{\operatorname*{{\rm O}}}\left(\!\!\sqrt{{\mathrm{sp}\left(h^{*}% \right)}SAT}\right)over~ start_ARG roman_O end_ARG ( square-root start_ARG roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_S italic_A italic_T end_ARG ),111O~()~O\widetilde{\operatorname*{{\rm O}}}(\cdot)over~ start_ARG roman_O end_ARG ( ⋅ ) hides logarithmic factors of (S,A,T)𝑆𝐴𝑇(S,A,T)( italic_S , italic_A , italic_T ). where sp(h)spsuperscript{\mathrm{sp}\left(h^{*}\right)}roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is the span of the optimal bias function hsuperscripth^{*}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, S×A𝑆𝐴S\times Aitalic_S × italic_A is the size of the state-action space and T𝑇Titalic_T the number of learning steps. Remarkably, our algorithm does not require prior information on sp(h)spsuperscript{\mathrm{sp}\left(h^{*}\right)}roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ).

Our algorithm relies on a novel subroutine, Projected Mitigated Extended Value Iteration (PMEVI), to compute bias-constrained optimal policies efficiently. This subroutine can be applied to various previous algorithms to improve regret bounds.

1 Introduction

Reinforcement learning (RL) Burnetas and Katehakis (1997); Sutton and Barto (2018) has become a popular approach for solving complex sequential decision-making tasks and has recently achieved notable advancements in diverse fields of application. The RL problem is generally formulated as a Markov Decision Process (MDP) Puterman (1994), where the agent interacts with an unknown environment to maximize its accumulative rewards.

In this paper, we consider the problem of learning average-reward MDPs, where the central task is to balance between exploration (i.e., learning the unknown environment) and exploitation (i.e., planning optimally according to current knowledge) along the infinite-horizon learning process. One way to measure the performance of the learner is the regret, that compares the gathered rewards of the learner, unaware of the exact structure of its environment, to the expected performance of an omniscient agent that knows the environment in advance. The seminal work of Auer et al. (2009) provides a minimax regret lower bound Ω(DSAT)Ω𝐷𝑆𝐴𝑇\Omega\left(\!\!\sqrt{DSAT}\right)roman_Ω ( square-root start_ARG italic_D italic_S italic_A italic_T end_ARG ), where D𝐷Ditalic_D is the diameter (the maximal distance between two different states), S𝑆Sitalic_S the number of states, A𝐴Aitalic_A the number of actions and T𝑇Titalic_T the learning horizon. They also provide an algorithm achieving regret O~(D2S2AT)~Osuperscript𝐷2superscript𝑆2𝐴𝑇\widetilde{\operatorname*{{\rm O}}}\left(\!\!\sqrt{D^{2}S^{2}AT}\right)over~ start_ARG roman_O end_ARG ( square-root start_ARG italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A italic_T end_ARG ). Ever since Auer et al. (2009), many works have been devoted to close the gap between the regret lower and upper bounds in the average reward setting Auer et al. (2009); Bartlett and Tewari (2009); Filippi et al. (2010); Talebi and Maillard (2018); Fruit et al. (2018, 2020); Bourel et al. (2020); Zhang and Ji (2019); Ouyang et al. (2017); Agrawal and Jia (2023); Abbasi-Yadkori et al. (2019); Wei et al. (2020) and more. Subsequent works Fruit et al. (2018); Zhang and Ji (2019) refined the minimax regret lower bound to Ω(sp(h)SAT)Ωspsuperscript𝑆𝐴𝑇\Omega\left(\!\!\sqrt{{\mathrm{sp}\left(h^{*}\right)}SAT}\right)roman_Ω ( square-root start_ARG roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_S italic_A italic_T end_ARG ) where sp(h)spsuperscript{\mathrm{sp}\left(h^{*}\right)}roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is the span of the bias function, which is the maximal gap of the long-term accumulative rewards starting from two different states. The difference is significant, since sp(h)Dspsuperscript𝐷{\mathrm{sp}\left(h^{*}\right)}\leq Droman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ italic_D and the gap between the two can be arbitrarly large. However, no existing work achieves the following three requirements simultaneously:

  • (1)

    The method achieves minimax optimal regret guarantees O~(sp(h)SAT)~Ospsuperscript𝑆𝐴𝑇\widetilde{\operatorname*{{\rm O}}}\left(\!\!\sqrt{{\mathrm{sp}\left(h^{*}% \right)}SAT}\right)over~ start_ARG roman_O end_ARG ( square-root start_ARG roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_S italic_A italic_T end_ARG );

  • (2)

    The proposed method is tractable;

  • (3)

    No prior knowledge on the model is required.

Most algorithms simply fail to achieve minimax optimal regret, and the only method achieving it Zhang and Ji (2019) is intractable because it relies an oracle to solve difficult optimization problems along the learning process. Naturally, we raise the question of whether these three requirements can be met all at once:

Is there a tractable algorithm with O~(sp(h)SAT)~OspsuperscripthSAT\widetilde{\operatorname*{{\rm O}}}\left(\!\!\sqrt{{\mathrm{sp}\left(h^{*}% \right)}SAT}\right)over~ start_ARG roman_O end_ARG ( square-root start_ARG roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_S italic_A italic_T end_ARG ) minimax regret without prior knowledge?

Contributions.

In this paper, we answer the above question affirmatively, by proposing a polynomial time algorithm with regret guarantees O~(sp(h)SAT)~Ospsuperscript𝑆𝐴𝑇\widetilde{\operatorname*{{\rm O}}}\left(\!\!\sqrt{{\mathrm{sp}\left(h^{*}% \right)}SAT}\right)over~ start_ARG roman_O end_ARG ( square-root start_ARG roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_S italic_A italic_T end_ARG ) for average-reward MDPs. Our method can further incorporate almost arbitrary prior bias information 𝐑𝒮subscriptsuperscript𝐑𝒮\mathcal{H}_{*}\subseteq\mathbf{R}^{\mathcal{S}}caligraphic_H start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⊆ bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT to improve its regret.

Theorem 1 (Informal).

For any c>0𝑐0c>0italic_c > 0, provided that the confidence region used by PMEVI-DT satisfy mild regularity conditions (see Assumption 1-3), if Tc5𝑇superscript𝑐5T\geq c^{5}italic_T ≥ italic_c start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT, then for every weakly communicating model with bias span less than c𝑐citalic_c and with bias vector within subscript\mathcal{H}_{*}caligraphic_H start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT, PMEVI-DT(,T)subscript𝑇(\mathcal{H}_{*},T)( caligraphic_H start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_T ) achieves regret:

O(cSATlog(SATδ))+O(cS52A32T920log2(SATδ))O𝑐𝑆𝐴𝑇𝑆𝐴𝑇𝛿O𝑐superscript𝑆52superscript𝐴32superscript𝑇920superscript2𝑆𝐴𝑇𝛿\operatorname*{{\rm O}}\left(\!\!\sqrt{cSAT\log\left(\tfrac{SAT}{\delta}\right% )}\right)+\operatorname*{{\rm O}}\left(cS^{\frac{5}{2}}A^{\frac{3}{2}}T^{\frac% {9}{20}}\log^{2}\left(\tfrac{SAT}{\delta}\right)\right)roman_O ( square-root start_ARG italic_c italic_S italic_A italic_T roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG ) + roman_O ( italic_c italic_S start_POSTSUPERSCRIPT divide start_ARG 5 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_A start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT divide start_ARG 9 end_ARG start_ARG 20 end_ARG end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) )

in expectation and with high probability. Moreover, if PMEVI-DT runs with the same confidence regions that UCRL2 Auer et al. (2009), then it enjoys a time complexity O(DS3AT)O𝐷superscript𝑆3𝐴𝑇\operatorname*{{\rm O}}(DS^{3}AT)roman_O ( italic_D italic_S start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_A italic_T ).

The geometry of the prior bias region subscript\mathcal{H}_{*}caligraphic_H start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT is discussed later (see 4). It can be taken trivial with =𝐑𝒮subscriptsuperscript𝐑𝒮\mathcal{H}_{*}=\mathbf{R}^{\mathcal{S}}caligraphic_H start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT to obtain a completely prior-less algorithm. To the best of our knowledge, this is the first tractable algorithm with minimax optimal regret bounds (up to logarithmic factors). The algorithm does not necessitate any prior knowledge of sp(h)spsuperscript{\mathrm{sp}\left(h^{*}\right)}roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), thus circumventing the potentially high cost associated with learning sp(h)spsuperscript{\mathrm{sp}\left(h^{*}\right)}roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). On the technical side, a key novelty of our method is the subroutine named PMEVI (see Algorithm 2) that improves and can replace EVI Auer et al. (2009) in any algorithm that relies on it Auer et al. (2009); Fruit et al. (2018); Filippi et al. (2010); Fruit et al. (2020); Bourel et al. (2020) to boost its performance and achieve minimax optimal regret.

Related works.

Here is a short overview of the learning theory of average reward MDPs. For communicating MDPs, the notable work of Auer et al. (2009) proposes the famous UCRL2 algorithm, a mature version of their prior UCRL Auer and Ortner (2006), achieving a regret bound of O~(DSAT)~O𝐷𝑆𝐴𝑇\widetilde{\operatorname*{{\rm O}}}(DS\!\!\sqrt{AT})over~ start_ARG roman_O end_ARG ( italic_D italic_S square-root start_ARG italic_A italic_T end_ARG ). This paper pioneered the use optimistic methods to learn MDPs efficiently. A line of papers Filippi et al. (2010); Fruit et al. (2020); Bourel et al. (2020) developed this direction by tightening the confidence region that UCRL2 rely on, and sharpened its analysis through the use of local properties of MDPs, such as local diameters and local bias variances, but none of these works went beyond regret guarantees of order SDAT𝑆𝐷𝐴𝑇S\!\!\sqrt{DAT}italic_S square-root start_ARG italic_D italic_A italic_T end_ARG and suffer from an extra S𝑆\!\!\sqrt{S}square-root start_ARG italic_S end_ARG. A parallel direction was initiated by Bartlett and Tewari (2009), that design REGAL to attain sp(h)spsuperscript{\mathrm{sp}\left(h^{*}\right)}roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )-dependent regret bounds (instead of D𝐷Ditalic_D) while extending the regret bounds to weakly-communicating MDPs. The computational intractability of REGAL is addressed by Fruit et al. (2018) with SCAL, while Zhang and Ji (2019) further enhance the regret analysis by evaluating the bias differences with EBF, eventually reaching optimal minimax regret but loosing tractability.

Another successful design approach is Bayesian-flavored sampling, derived from Thompson Sampling Thompson (1933), that usually replaces optimism. The regret guarantees of these algorithms usually stick to the Bayesian setting however Ouyang et al. (2017); Theocharous et al. (2017), although Agrawal and Jia (2023) also enjoys O~(SDAT)~O𝑆𝐷𝐴𝑇\widetilde{\operatorname*{{\rm O}}}(S\!\!\sqrt{DAT})over~ start_ARG roman_O end_ARG ( italic_S square-root start_ARG italic_D italic_A italic_T end_ARG ) high probability regret by coupling posterior sampling and optimism. Another line of research focuses on the study of ergodic MDPs, where all policies mix uniformly according to a mixing time. To name a few, the model-free algorithm Politex Abbasi-Yadkori et al. (2019) attains a regret of O~((tmix)3thitSAT34)~Osuperscriptsubscript𝑡mix3subscript𝑡hit𝑆𝐴superscript𝑇34\widetilde{\operatorname*{{\rm O}}}((t_{\mathrm{mix}})^{3}t_{\mathrm{hit}}\!\!% \sqrt{SA}T^{\frac{3}{4}})over~ start_ARG roman_O end_ARG ( ( italic_t start_POSTSUBSCRIPT roman_mix end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT roman_hit end_POSTSUBSCRIPT square-root start_ARG italic_S italic_A end_ARG italic_T start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ). By leveraging an optimistic mirror descent algorithm, Wei et al. (2020) achieve an enhanced regret of O~((tmix)2thitAT)~Osuperscriptsubscript𝑡mix2subscript𝑡hit𝐴𝑇\widetilde{\operatorname*{{\rm O}}}(\!\!\sqrt{(t_{\mathrm{mix}})^{2}t_{\mathrm% {hit}}AT})over~ start_ARG roman_O end_ARG ( square-root start_ARG ( italic_t start_POSTSUBSCRIPT roman_mix end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT roman_hit end_POSTSUBSCRIPT italic_A italic_T end_ARG ) .

We refer the readers to Table 1 for a (non-exhaustive) list of existing algorithms.

Table 1: Comparison of related works on RL algorithms for average-reward MDP, where S×A𝑆𝐴S\times Aitalic_S × italic_A is the size of state-action space, T𝑇Titalic_T is the total number of steps, D𝐷Ditalic_D (Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT) is the (local) diameter, sp(h)Dspsuperscript𝐷{\mathrm{sp}\left(h^{*}\right)}\leq Droman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ italic_D is the span of the bias vector, tmixsubscript𝑡mixt_{\mathrm{mix}}italic_t start_POSTSUBSCRIPT roman_mix end_POSTSUBSCRIPT is the worst-case mixing time, thitsubscript𝑡hitt_{\mathrm{hit}}italic_t start_POSTSUBSCRIPT roman_hit end_POSTSUBSCRIPT is the hitting time (i.e., the expected time cost to visit some certain state under any policy).
Algorithm Regret in O~()~O\widetilde{\operatorname*{{\rm O}}}(-)over~ start_ARG roman_O end_ARG ( - ) Tractable Comment/Requirements
REGAL Bartlett and Tewari (2009) sp(h)SATspsuperscript𝑆𝐴𝑇{\mathrm{sp}\left(h^{*}\right)}S\!\!\sqrt{AT}roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_S square-root start_ARG italic_A italic_T end_ARG ×\times× knowledge of sp(h)spsuperscript{\mathrm{sp}\left(h^{*}\right)}roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
UCRL2 Auer et al. (2009) DSAT𝐷𝑆𝐴𝑇DS\!\!\sqrt{AT}italic_D italic_S square-root start_ARG italic_A italic_T end_ARG \checkmark -
PSRL Agrawal and Jia (2023) DSAT𝐷𝑆𝐴𝑇DS\!\!\sqrt{AT}italic_D italic_S square-root start_ARG italic_A italic_T end_ARG \checkmark Bayesian regret
SCAL Fruit et al. (2018) sp(h)SATspsuperscript𝑆𝐴𝑇{\mathrm{sp}\left(h^{*}\right)}S\!\!\sqrt{AT}roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_S square-root start_ARG italic_A italic_T end_ARG \checkmark knowledge of sp(h)spsuperscript{\mathrm{sp}\left(h^{*}\right)}roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
UCRL2B Fruit et al. (2020) SDAT𝑆𝐷𝐴𝑇S\!\!\sqrt{DAT}italic_S square-root start_ARG italic_D italic_A italic_T end_ARG \checkmark extra log(T)𝑇\sqrt{\log(T)}square-root start_ARG roman_log ( italic_T ) end_ARG in upper-bound
UCRL3 Bourel et al. (2020) D+Ts,aDs2Ls,a𝐷𝑇subscript𝑠𝑎superscriptsubscript𝐷𝑠2subscript𝐿𝑠𝑎D+\!\!\sqrt{T\sum_{s,a}D_{s}^{2}L_{s,a}}italic_D + square-root start_ARG italic_T ∑ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT end_ARG \checkmark Ls,a:=sp(s|s,a)(1p(s|s,a))assignsubscript𝐿𝑠𝑎subscriptsuperscript𝑠𝑝conditionalsuperscript𝑠𝑠𝑎1𝑝conditionalsuperscript𝑠𝑠𝑎L_{s,a}:=\sum_{s^{\prime}}\!\!\sqrt{p(s^{\prime}|s,a)(1-p(s^{\prime}|s,a))}italic_L start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT := ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT square-root start_ARG italic_p ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) ( 1 - italic_p ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) ) end_ARG
KL-UCRL Filippi et al. (2010); Talebi and Maillard (2018) SDAT𝑆𝐷𝐴𝑇S\!\!\sqrt{DAT}italic_S square-root start_ARG italic_D italic_A italic_T end_ARG \checkmark -
EBF Zhang and Ji (2019) sp(h)SATspsuperscript𝑆𝐴𝑇\sqrt{{\mathrm{sp}\left(h\right)}^{*}SAT}square-root start_ARG roman_sp ( italic_h ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_S italic_A italic_T end_ARG ×\times× optimal, knowledge of sp(h)spsuperscript{\mathrm{sp}\left(h^{*}\right)}roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
Optimistic-Q Wei et al. (2020) sp(h)(SA)13T23spsuperscriptsuperscript𝑆𝐴13superscript𝑇23{\mathrm{sp}\left(h^{*}\right)}(SA)^{\frac{1}{3}}T^{\frac{2}{3}}roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ( italic_S italic_A ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT divide start_ARG 2 end_ARG start_ARG 3 end_ARG end_POSTSUPERSCRIPT \checkmark model-free
UCB-AVG Zhang and Xie (2023) S5A2sp(h)Tsuperscript𝑆5superscript𝐴2spsuperscript𝑇S^{5}A^{2}{\mathrm{sp}\left(h^{*}\right)}\!\!\sqrt{T}italic_S start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) square-root start_ARG italic_T end_ARG \checkmark model-free, knowledge of sp(h)spsuperscript{\mathrm{sp}\left(h^{*}\right)}roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
MDP-OOMD Wei et al. (2020) (tmix)2thitATsuperscriptsubscript𝑡mix2subscript𝑡hit𝐴𝑇\sqrt{(t_{\mathrm{mix}})^{2}t_{\mathrm{hit}}AT}square-root start_ARG ( italic_t start_POSTSUBSCRIPT roman_mix end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT roman_hit end_POSTSUBSCRIPT italic_A italic_T end_ARG \checkmark ergodic
Politex Abbasi-Yadkori et al. (2019) (tmix)3thitSAT34superscriptsubscript𝑡mix3subscript𝑡hit𝑆𝐴superscript𝑇34(t_{\mathrm{mix}})^{3}t_{\mathrm{hit}}\!\!\sqrt{SA}T^{\frac{3}{4}}( italic_t start_POSTSUBSCRIPT roman_mix end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT roman_hit end_POSTSUBSCRIPT square-root start_ARG italic_S italic_A end_ARG italic_T start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT \checkmark model-free, ergodic
PMEVI-DT (this work) sp(h)SATspsuperscript𝑆𝐴𝑇\sqrt{{\mathrm{sp}\left(h^{*}\right)}SAT}square-root start_ARG roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_S italic_A italic_T end_ARG \checkmark -
Lower bound Ω(sp(h)SAT)Ωspsuperscript𝑆𝐴𝑇\Omega\left(\!\!\sqrt{{\mathrm{sp}\left(h^{*}\right)}SAT}\right)roman_Ω ( square-root start_ARG roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_S italic_A italic_T end_ARG ) - -

2 Preliminaries

We fix a finite state-action space structure 𝒳:=s𝒮{s}×𝒜(s)assign𝒳subscript𝑠𝒮𝑠𝒜𝑠\mathcal{X}:=\bigcup_{s\in\mathcal{S}}\left\{s\right\}\times\mathcal{A}(s)caligraphic_X := ⋃ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT { italic_s } × caligraphic_A ( italic_s ), and denote \mathcal{M}caligraphic_M the collection of all MDPs with state-action space 𝒳𝒳\mathcal{X}caligraphic_X and rewards supported in [0,1]01[0,1][ 0 , 1 ].

Infinite-horizon MDP.

An element M𝑀M\in\mathcal{M}italic_M ∈ caligraphic_M is a tuple (𝒮,𝒜,p,r)𝒮𝒜𝑝𝑟(\mathcal{S},\mathcal{A},p,r)( caligraphic_S , caligraphic_A , italic_p , italic_r ) where p𝑝pitalic_p is the transition kernel and r𝑟ritalic_r the reward function. The random state-action pair played by the agent at time t𝑡titalic_t is denoted Xt(St,At)subscript𝑋𝑡subscript𝑆𝑡subscript𝐴𝑡X_{t}\equiv(S_{t},A_{t})italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≡ ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and the achieved reward is Rtsubscript𝑅𝑡R_{t}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. A policy is a deterministic rule π:𝒮𝒜:𝜋𝒮𝒜\pi:\mathcal{S}\to\mathcal{A}italic_π : caligraphic_S → caligraphic_A and we write ΠΠ\Piroman_Π the space of policies. Coupled with a M𝑀M\in\mathcal{M}italic_M ∈ caligraphic_M, a policy properly defines the distribution of (Xt)subscript𝑋𝑡(X_{t})( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) whose associated probability probability and expectation operators are denoted 𝐏sπ,𝐄sπsubscriptsuperscript𝐏𝜋𝑠subscriptsuperscript𝐄𝜋𝑠\mathbf{P}^{\pi}_{s},\mathbf{E}^{\pi}_{s}bold_P start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_E start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, where s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S is the initial state. Under M𝑀Mitalic_M, a fixed policy has a reward function rπ(s):=r(s,π(s))assignsuperscript𝑟𝜋𝑠𝑟𝑠𝜋𝑠r^{\pi}(s):=r(s,\pi(s))italic_r start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) := italic_r ( italic_s , italic_π ( italic_s ) ), a transition matrix Pπsuperscript𝑃𝜋P^{\pi}italic_P start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT, a gain gπ(s):=lim1T𝐄sπ[R0++RT1]assignsuperscript𝑔𝜋𝑠1𝑇subscriptsuperscript𝐄𝜋𝑠delimited-[]subscript𝑅0subscript𝑅𝑇1g^{\pi}(s):=\lim\frac{1}{T}\mathbf{E}^{\pi}_{s}[R_{0}+\ldots+R_{T-1}]italic_g start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) := roman_lim divide start_ARG 1 end_ARG start_ARG italic_T end_ARG bold_E start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT [ italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + … + italic_R start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT ] and a bias hπ:=limt=0T1(Rtg(St))assignsuperscript𝜋superscriptsubscript𝑡0𝑇1subscript𝑅𝑡𝑔subscript𝑆𝑡h^{\pi}:=\lim\sum_{t=0}^{T-1}(R_{t}-g(S_{t}))italic_h start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT := roman_lim ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_g ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ), that all together satisfy the Poisson equation hπ+gπ=rπ+Pπhπsuperscript𝜋superscript𝑔𝜋superscript𝑟𝜋superscript𝑃𝜋superscript𝜋h^{\pi}+g^{\pi}=r^{\pi}+P^{\pi}h^{\pi}italic_h start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT + italic_g start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT = italic_r start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT + italic_P start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT, see Puterman (1994). The Bellman operator of the MDP is:

Lu(s):=maxa𝒜(s){r(s,a)+p(s,a)u}assign𝐿𝑢𝑠subscript𝑎𝒜𝑠𝑟𝑠𝑎𝑝𝑠𝑎𝑢Lu(s):=\max_{a\in\mathcal{A}(s)}\left\{r(s,a)+p(s,a)u\right\}italic_L italic_u ( italic_s ) := roman_max start_POSTSUBSCRIPT italic_a ∈ caligraphic_A ( italic_s ) end_POSTSUBSCRIPT { italic_r ( italic_s , italic_a ) + italic_p ( italic_s , italic_a ) italic_u } (1)
Weakly-communicating MDPs.

M𝑀Mitalic_M is weakly-communicating Puterman (1994); Bartlett and Tewari (2009) if the state space can be divided into two sets: (1) the transient set, consisting in states that are transient under all policies; (2) the non-transient set, where every state is reachable starting from any other non-transient. In this case, L𝐿Litalic_L has a span-fixpoint hsuperscripth^{*}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (see Puterman (1994)), i.e., there exists h𝐑𝒮superscriptsuperscript𝐑𝒮h^{*}\in\mathbf{R}^{\mathcal{S}}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT such that Lhh𝐑e𝐿superscriptsuperscript𝐑𝑒Lh^{*}-h^{*}\in\mathbf{R}eitalic_L italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ bold_R italic_e where e𝑒eitalic_e is the vector full of ones. We write hFix(L)superscriptFix𝐿h^{*}\in\operatorname{{Fix}}(L)italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Fix ( italic_L ). Then g:=Lhhassignsuperscript𝑔𝐿superscriptsuperscriptg^{*}:=Lh^{*}-h^{*}italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT := italic_L italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the optimal gain function and every policy π𝜋\piitalic_π satisfies rπ+Pπhg+hsuperscript𝑟𝜋superscript𝑃𝜋superscriptsuperscript𝑔superscriptr^{\pi}+P^{\pi}h^{*}\leq g^{*}+h^{*}italic_r start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT + italic_P start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. We accordingly define the Bellman gaps:

Δ(s,a):=h(s)+g(s)r(s,a)p(s,a)h0.assignsuperscriptΔ𝑠𝑎superscript𝑠superscript𝑔𝑠𝑟𝑠𝑎𝑝𝑠𝑎superscript0\Delta^{*}(s,a):=h^{*}(s)+g^{*}(s)-r(s,a)-p(s,a)h^{*}\geq 0.roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) := italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) + italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) - italic_r ( italic_s , italic_a ) - italic_p ( italic_s , italic_a ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≥ 0 . (2)

Another important concept is the diameter, that describes the maximal distance from one state to another state. It is given by D:=supssinfπ𝐄sπ[inf{t1:St=s}].assign𝐷subscriptsupremum𝑠superscript𝑠subscriptinfimum𝜋superscriptsubscript𝐄𝑠𝜋delimited-[]infimumconditional-set𝑡1subscript𝑆𝑡superscript𝑠D:=\sup_{s\neq s^{\prime}}\inf_{\pi}\mathbf{E}_{s}^{\pi}[\inf\left\{t\geq 1:S_% {t}=s^{\prime}\right\}].italic_D := roman_sup start_POSTSUBSCRIPT italic_s ≠ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_inf start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT bold_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT [ roman_inf { italic_t ≥ 1 : italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } ] . An MDP is said communicating if its diameter D𝐷Ditalic_D is finite.

Reinforcement learning.

The learner is only aware that M𝑀M\in\mathcal{M}italic_M ∈ caligraphic_M but doesn’t have a clue about what M𝑀Mitalic_M further looks like. From the past observations and the current state Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the agent picks an available action 𝒜(St)𝒜subscript𝑆𝑡\mathcal{A}(S_{t})caligraphic_A ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), receives a reward Rtsubscript𝑅𝑡R_{t}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and observe the new state St+1subscript𝑆𝑡1S_{t+1}italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. The regret of the agent is:

Reg(T):=Tgt=0T1Rt.assignReg𝑇𝑇superscript𝑔superscriptsubscript𝑡0𝑇1subscript𝑅𝑡\operatorname{{Reg}}(T):=Tg^{*}-\sum_{t=0}^{T-1}R_{t}.roman_Reg ( italic_T ) := italic_T italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (3)

Its expected value satisfies 𝐄[Reg(T)]=𝐄[t=0T1Δ(Xt)]+𝐄[h(S0)h(ST)]𝐄delimited-[]Reg𝑇𝐄delimited-[]superscriptsubscript𝑡0𝑇1superscriptΔsubscript𝑋𝑡𝐄delimited-[]superscriptsubscript𝑆0superscriptsubscript𝑆𝑇\mathbf{E}[\operatorname{{Reg}}(T)]=\mathbf{E}[\sum_{t=0}^{T-1}\Delta^{*}(X_{t% })]+\mathbf{E}[h^{*}(S_{0})-h^{*}(S_{T})]bold_E [ roman_Reg ( italic_T ) ] = bold_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] + bold_E [ italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ] and the quantity t=0T1Δ(Xt)superscriptsubscript𝑡0𝑇1superscriptΔsubscript𝑋𝑡\sum_{t=0}^{T-1}\Delta^{*}(X_{t})∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) will be referred to as the pseudo-regret. This paper focuses on minimax regret guarantees. Specifically, for c1𝑐1c\geq 1italic_c ≥ 1, denote c:={M:hFix(L(M)),sp(h)c}assignsubscript𝑐conditional-set𝑀formulae-sequencesuperscriptFix𝐿𝑀spsuperscript𝑐\mathcal{M}_{c}:=\left\{M\in\mathcal{M}:\exists h^{*}\in\operatorname{{Fix}}(L% (M)),{\mathrm{sp}\left(h^{*}\right)}\leq c\right\}caligraphic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT := { italic_M ∈ caligraphic_M : ∃ italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Fix ( italic_L ( italic_M ) ) , roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ italic_c } the set of weakly-communicating MDPs that admit a bias function with span at most c𝑐citalic_c, where the span of a vector u𝒮𝑢superscript𝒮u\in\mathbb{R}^{\mathcal{S}}italic_u ∈ blackboard_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT is sp(u):=max(u)min(u)assignsp𝑢𝑢𝑢{\mathrm{sp}\left(u\right)}:=\max(u)-\min(u)roman_sp ( italic_u ) := roman_max ( italic_u ) - roman_min ( italic_u ). Following Auer et al. (2009), every algorithm 𝐀𝐀\mathbf{A}bold_A, for all c>0𝑐0c>0italic_c > 0, we have

maxMc𝐄M,𝐀[Reg(T)]=Ω(cSAT).subscript𝑀subscript𝑐superscript𝐄𝑀𝐀delimited-[]Reg𝑇Ω𝑐𝑆𝐴𝑇\max_{M\in\mathcal{M}_{c}}\mathbf{E}^{M,\mathbf{A}}[\operatorname{{Reg}}(T)]=% \Omega\left(\!\sqrt{cSAT}\right).roman_max start_POSTSUBSCRIPT italic_M ∈ caligraphic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_E start_POSTSUPERSCRIPT italic_M , bold_A end_POSTSUPERSCRIPT [ roman_Reg ( italic_T ) ] = roman_Ω ( square-root start_ARG italic_c italic_S italic_A italic_T end_ARG ) . (4)

The goal of this work is to reach this lower bound with a tractable algorithm.

3 Algorithm PMEVI-DT

The method designed in this work can be applied to any algorithm relying on extended Bellman operators to compute the deployed policies Auer et al. (2009); Filippi et al. (2010); Fruit et al. (2018); Bourel et al. (2020) and beyond Tewari and Bartlett (2007). We start by reviewing the principles behind these algorithms. These algorithms follow the optimism-in-face-of-certainty (OFU) principle, meaning that they deploy policies achieving the highest possible gain that is plausible under their current information. This is done by building a confidence region tsubscript𝑡\mathcal{M}_{t}\subseteq\mathcal{M}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊆ caligraphic_M for the hidden model M𝑀Mitalic_M, then searching for a policy π𝜋\piitalic_π solving the optimization problem:

g(t):=sup{gπ(t):πΠ,sp(gπ(t))=0} with gπ(t):=sup{g(π,M~):M~t}.assignsuperscript𝑔subscript𝑡supremumconditional-setsuperscript𝑔𝜋subscript𝑡formulae-sequence𝜋Πspsuperscript𝑔𝜋subscript𝑡0 with superscript𝑔𝜋subscript𝑡assignsupremumconditional-set𝑔𝜋~𝑀~𝑀subscript𝑡g^{*}(\mathcal{M}_{t}):=\sup\left\{g^{\pi}(\mathcal{M}_{t}):\pi\in\Pi,{\mathrm% {sp}\left(g^{\pi}(\mathcal{M}_{t})\right)}=0\right\}\text{~{}with~{}}g^{\pi}(% \mathcal{M}_{t}):=\sup\left\{g\left(\pi,\widetilde{M}\right):\widetilde{M}\in% \mathcal{M}_{t}\right\}.italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) := roman_sup { italic_g start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) : italic_π ∈ roman_Π , roman_sp ( italic_g start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) = 0 } with italic_g start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) := roman_sup { italic_g ( italic_π , over~ start_ARG italic_M end_ARG ) : over~ start_ARG italic_M end_ARG ∈ caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } . (5)

The design of the confidence region tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT varies from a work to another. Provided that tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT has been designed, these OFU-algorithms work as follows: At the start of episode k𝑘kitalic_k, the optimization problem (5) is solved, and its solution πksubscript𝜋𝑘\pi_{k}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is played until the end of episode. The duration of episodes can be managed in various ways, although the most popular is arguably the doubling trick (DT), that essentially waits until a state-action pair is about to double the visit count it had at the beginning of the current episode (see Algorithm 1). In the rest of this section, we use p^t(s,a)subscript^𝑝𝑡𝑠𝑎\hat{p}_{t}(s,a)over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) (and r^t(s,a)subscript^𝑟𝑡𝑠𝑎\hat{r}_{t}(s,a)over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a )) to denote the empirical transition (and reward) of the latest doubling update before the t𝑡titalic_t-th step, and further denote M^t:=(r^t,p^t)assignsubscript^𝑀𝑡subscript^𝑟𝑡subscript^𝑝𝑡\hat{M}_{t}:=(\hat{r}_{t},\hat{p}_{t})over^ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := ( over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

Extended Bellman operators and EVI.

To solve (5) efficiently, the celebrated Auer et al. (2009) introduced the extended value iteration algorithm (EVI). Assume that tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a (s,a)𝑠𝑎(s,a)( italic_s , italic_a )-rectangular confidence region, meaning that ts,a(t(s,a)×𝒫t(s,a))subscript𝑡subscriptproduct𝑠𝑎subscript𝑡𝑠𝑎subscript𝒫𝑡𝑠𝑎\mathcal{M}_{t}\equiv\prod_{s,a}(\mathcal{R}_{t}(s,a)\times\mathcal{P}_{t}(s,a))caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≡ ∏ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ( caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) × caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) ) where t(s,a)subscript𝑡𝑠𝑎\mathcal{R}_{t}(s,a)caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) and 𝒫t(s,a)subscript𝒫𝑡𝑠𝑎\mathcal{P}_{t}(s,a)caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) are respectively the confidence region for r(s,a)𝑟𝑠𝑎r(s,a)italic_r ( italic_s , italic_a ) and p(s,a)𝑝𝑠𝑎p(s,a)italic_p ( italic_s , italic_a ) after t𝑡titalic_t learning steps. EVI is the algorithm computing the sequence defined by:

vi+1(s)tvi(s):=maxa𝒜(s)maxr~(s,a)t(s,a)maxp~(s,a)𝒫t(s,a)(r~(s,a)+p~(s,a)vi)subscript𝑣𝑖1𝑠subscript𝑡subscript𝑣𝑖𝑠assignsubscript𝑎𝒜𝑠subscript~𝑟𝑠𝑎subscript𝑡𝑠𝑎subscript~𝑝𝑠𝑎subscript𝒫𝑡𝑠𝑎~𝑟𝑠𝑎~𝑝𝑠𝑎subscript𝑣𝑖v_{i+1}(s)\equiv\mathcal{L}_{t}v_{i}(s):=\max_{a\in\mathcal{A}(s)}\max_{\tilde% {r}(s,a)\in\mathcal{R}_{t}(s,a)}\max_{\tilde{p}(s,a)\in\mathcal{P}_{t}(s,a)}% \left(\tilde{r}(s,a)+\tilde{p}(s,a)\cdot v_{i}\right)italic_v start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ( italic_s ) ≡ caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) := roman_max start_POSTSUBSCRIPT italic_a ∈ caligraphic_A ( italic_s ) end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT over~ start_ARG italic_r end_ARG ( italic_s , italic_a ) ∈ caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT over~ start_ARG italic_p end_ARG ( italic_s , italic_a ) ∈ caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) end_POSTSUBSCRIPT ( over~ start_ARG italic_r end_ARG ( italic_s , italic_a ) + over~ start_ARG italic_p end_ARG ( italic_s , italic_a ) ⋅ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (6)

until sp(vi+1vi)<ϵspsubscript𝑣𝑖1subscript𝑣𝑖italic-ϵ{\mathrm{sp}\left(v_{i+1}-v_{i}\right)}<\epsilonroman_sp ( italic_v start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < italic_ϵ where ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 is the numerical precision. When the process stops, it is known that any policy π𝜋\piitalic_π such that π(s)𝜋𝑠\pi(s)italic_π ( italic_s ) achieves tvisubscript𝑡subscript𝑣𝑖\mathcal{L}_{t}v_{i}caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in (6) satisfies gπ(t)g()ϵsuperscript𝑔𝜋subscript𝑡superscript𝑔italic-ϵg^{\pi}(\mathcal{M}_{t})\geq g^{*}(\mathcal{M})-\epsilonitalic_g start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( caligraphic_M ) - italic_ϵ, hence is nearly optimistically optimal. This process gets its name from the observation that tsubscript𝑡\mathcal{L}_{t}caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the Bellman operator of tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT seen as a MDP, hence EVI is just the Value Iteration algorithm Puterman (1994) ran in tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. A choice of action from s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S in tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT consists in (1) a choice of action a𝒜(s)𝑎𝒜𝑠a\in\mathcal{A}(s)italic_a ∈ caligraphic_A ( italic_s ), (2) a choice of reward r~(s,a)t(s,a)~𝑟𝑠𝑎subscript𝑡𝑠𝑎\tilde{r}(s,a)\in\mathcal{R}_{t}(s,a)over~ start_ARG italic_r end_ARG ( italic_s , italic_a ) ∈ caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) and (3) a choice of transition p~(s,a)𝒫t(s,a)~𝑝𝑠𝑎subscript𝒫𝑡𝑠𝑎\tilde{p}(s,a)\in\mathcal{P}_{t}(s,a)over~ start_ARG italic_p end_ARG ( italic_s , italic_a ) ∈ caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ); It is an extended version of 𝒜(s)𝒜𝑠\mathcal{A}(s)caligraphic_A ( italic_s ).

Towards Projected Mitigated EVI.

Obviously, the regret of an OFU-algorithm is directly related to the quality of the confidence region tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. That is why most previous works tried to approach the regret lower bound DSAT𝐷𝑆𝐴𝑇\!\!\sqrt{DSAT}square-root start_ARG italic_D italic_S italic_A italic_T end_ARG of Auer et al. (2009) by refining tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The older works of Auer et al. (2009); Bartlett and Tewari (2009); Filippi et al. (2010) have been improved with a variance aware analysis Talebi and Maillard (2018); Fruit et al. (2018, 2020); Bourel et al. (2020) that essentially make use of tightened kernel confidence regions 𝒫tsubscript𝒫𝑡\mathcal{P}_{t}caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. While all these algorithms successively reduce the gap between the regret upper and lower bounds, they fail to achieve optimal regret DSAT𝐷𝑆𝐴𝑇\!\!\sqrt{DSAT}square-root start_ARG italic_D italic_S italic_A italic_T end_ARG. Meanwhile, the EVI algorithm of Zhang and Ji (2019) achieves the lower bound but (1) the algorithm is intractable because it relies on an oracle to retrieve optimistically optimal policies and (2) needs prior information on the bias function. Nonetheless, the method of Zhang and Ji (2019) strongly suggests that inferring bias information from the available data is key to achieve minimax optimal regret.

Rather surprisingly and in opposition to this previous line of work, our work suggests that the choice of the confidence region tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT has little importance. Instead, our algorithm takes an arbitrary (well-behaved) confidence region in, infer bias information similarly to EBF Zhang and Ji (2019) and makes use of it to heavily refine the extended Bellman operator (6) associated to the input confidence region. Our algorithm can further take arbitrary prior information (possibly none) on the bias vector to tighten its bias confidence region. The pseudo-code given in Algorithm 1 is the high level structure our algorithm PMEVI-DT. In Section 3.1, we explain how (6) is refined using bias information and in Section 3.2, we explain how bias information is obtained.

  Algorithm 1: PMEVI-DT(,T,tt)maps-tosubscript𝑇𝑡subscript𝑡(\mathcal{H}_{*},T,t\mapsto\mathcal{M}_{t})( caligraphic_H start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_T , italic_t ↦ caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )

 

Parameters: Bias prior subscript\mathcal{H}_{*}caligraphic_H start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT, horizon T𝑇Titalic_T, a system of confidence region ttmaps-to𝑡subscript𝑡t\mapsto\mathcal{M}_{t}italic_t ↦ caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

1:  for k=1,2,𝑘12k=1,2,\ldotsitalic_k = 1 , 2 , … do
2:     Set tktsubscript𝑡𝑘𝑡t_{k}\leftarrow titalic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← italic_t, update confidence region tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT;
3:     tBiasEstimation(t,t,δ)subscriptsuperscript𝑡BiasEstimationsubscript𝑡subscript𝑡𝛿\mathcal{H}^{\prime}_{t}\leftarrow\texttt{BiasEstimation}(\mathcal{F}_{t},% \mathcal{M}_{t},\delta)caligraphic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← BiasEstimation ( caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_δ ):
4:     t{u:sp(u)T1/5}tsubscript𝑡subscriptconditional-set𝑢sp𝑢superscript𝑇15superscriptsubscript𝑡\mathcal{H}_{t}\leftarrow\mathcal{H}_{*}\cap\{u:{\mathrm{sp}\left(u\right)}% \leq T^{1/5}\}\cap\mathcal{H}_{t}^{\prime}caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← caligraphic_H start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∩ { italic_u : roman_sp ( italic_u ) ≤ italic_T start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT } ∩ caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT;
5:     ΓtBiasProjection(t,)subscriptΓ𝑡BiasProjectionsubscript𝑡\Gamma_{t}\leftarrow\texttt{BiasProjection}(\mathcal{H}_{t},-)roman_Γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← BiasProjection ( caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , - );
6:     βtVarianceApprox(t,t)subscript𝛽𝑡VarianceApproxsubscriptsuperscript𝑡subscript𝑡\beta_{t}\leftarrow\texttt{VarianceApprox}(\mathcal{H}^{\prime}_{t},\mathcal{F% }_{t})italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← VarianceApprox ( caligraphic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT );
7:     𝔥kPMEVI(t,βt,Γt,log(t)/t)subscript𝔥𝑘PMEVIsubscript𝑡subscript𝛽𝑡subscriptΓ𝑡𝑡𝑡\mathfrak{h}_{k}\leftarrow\texttt{PMEVI}(\mathcal{M}_{t},\beta_{t},\Gamma_{t},% \!\!\sqrt{\log(t)/t})fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← PMEVI ( caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_Γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , square-root start_ARG roman_log ( italic_t ) / italic_t end_ARG ) ;
8:     𝔤k𝔏t𝔥k𝔥ksubscript𝔤𝑘subscript𝔏𝑡subscript𝔥𝑘subscript𝔥𝑘\mathfrak{g}_{k}\leftarrow\mathfrak{L}_{t}\mathfrak{h}_{k}-\mathfrak{h}_{k}fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← fraktur_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ;
9:     Update policy πkGreedy(t,𝔥k,βt)subscript𝜋𝑘Greedysubscript𝑡subscript𝔥𝑘subscript𝛽𝑡\pi_{k}\leftarrow\texttt{Greedy}(\mathcal{M}_{t},\mathfrak{h}_{k},\beta_{t})italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← Greedy ( caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT );
10:     repeat
11:        Play Atπk(St)subscript𝐴𝑡subscript𝜋𝑘subscript𝑆𝑡A_{t}\leftarrow\pi_{k}(S_{t})italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), observe Rt,St+1subscript𝑅𝑡subscript𝑆𝑡1R_{t},S_{t+1}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT;
12:        Increment tt+1𝑡𝑡1t\leftarrow t+1italic_t ← italic_t + 1;
13:     until (DT) Nt(St,πk(St))12Ntk(Xt)subscript𝑁𝑡subscript𝑆𝑡subscript𝜋𝑘subscript𝑆𝑡12subscript𝑁subscript𝑡𝑘subscript𝑋𝑡N_{t}(S_{t},\pi_{k}(S_{t}))\geq 1\vee 2N_{t_{k}}(X_{t})italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ≥ 1 ∨ 2 italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).
14:  end for

 

  Algorithm 2: PMEVI(,β,Γ,ϵ)𝛽Γitalic-ϵ(\mathcal{M},\beta,\Gamma,\epsilon)( caligraphic_M , italic_β , roman_Γ , italic_ϵ )

 

Parameters: region \mathcal{M}caligraphic_M, mitigation β𝛽\betaitalic_β, projection ΓΓ\Gammaroman_Γ, precision ϵitalic-ϵ\epsilonitalic_ϵ, initial vector v0subscript𝑣0v_{0}italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (optional)

1:  if v0subscript𝑣0v_{0}italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT not initialized then set v00subscript𝑣00v_{0}\leftarrow 0italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← 0;
2:  n0𝑛0n\leftarrow 0italic_n ← 0
3:  absent\mathcal{L}\leftarrowcaligraphic_L ← extended operator associated to \mathcal{M}caligraphic_M;
4:  repeat
5:     vn+12βvnsubscript𝑣𝑛12superscript𝛽subscript𝑣𝑛v_{n+\frac{1}{2}}\leftarrow\mathcal{L}^{\beta}v_{n}italic_v start_POSTSUBSCRIPT italic_n + divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUBSCRIPT ← caligraphic_L start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT;
6:     vn+1Γvn+12subscript𝑣𝑛1Γsubscript𝑣𝑛12v_{n+1}\leftarrow\Gamma v_{n+\frac{1}{2}}italic_v start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ← roman_Γ italic_v start_POSTSUBSCRIPT italic_n + divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUBSCRIPT;
7:     nn+1𝑛𝑛1n\leftarrow n+1italic_n ← italic_n + 1;
8:  until sp(vnvn1)<ϵspsubscript𝑣𝑛subscript𝑣𝑛1italic-ϵ{\mathrm{sp}\left(v_{n}-v_{n-1}\right)}<\epsilonroman_sp ( italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) < italic_ϵ
9:  return  vnsubscript𝑣𝑛v_{n}italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

 

3.1 Projected mitigated extended value iteration (PMEVI)

Assume that an external mechanism provides a confidence region tsubscript𝑡\mathcal{H}_{t}caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for the bias function hsuperscripth^{*}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Provided that tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is correct (Mt𝑀subscript𝑡M\in\mathcal{M}_{t}italic_M ∈ caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) and that tsubscript𝑡\mathcal{H}_{t}caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is correct (htsuperscriptsubscript𝑡h^{*}\in\mathcal{H}_{t}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT), we want to find a pair of policy-model (π,M~)𝜋~𝑀(\pi,\tilde{M})( italic_π , over~ start_ARG italic_M end_ARG ) that maximize the gain and such that h(π,M~)t𝜋~𝑀subscript𝑡h(\pi,\tilde{M})\in\mathcal{H}_{t}italic_h ( italic_π , over~ start_ARG italic_M end_ARG ) ∈ caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. This is done with an improved version of (6) combining two ideas.

  1. 1.

    Projection (Section 3.2). Whenever it is correct, the bias confidence region tsubscript𝑡\mathcal{H}_{t}caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT informs the learner that the search of an optimistic model can be constrained to those with bias within tsubscript𝑡\mathcal{H}_{t}caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. This is done by projecting tβsuperscriptsubscript𝑡𝛽\mathcal{L}_{t}^{\beta}caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT (see mitigation) using an operator Γt:𝐑𝒮t:subscriptΓ𝑡superscript𝐑𝒮subscript𝑡\Gamma_{t}:\mathbf{R}^{\mathcal{S}}\to\mathcal{H}_{t}roman_Γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT → caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, that has to satisfy a few non-trivial regularity conditions that are specified in Proposition 2.

  2. 2.

    Mitigation (Section 3.3). When one is aware that htsuperscriptsubscript𝑡h^{*}\in\mathcal{H}_{t}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the dynamical bias update p~(s,a)ui~𝑝𝑠𝑎subscript𝑢𝑖\tilde{p}(s,a)u_{i}over~ start_ARG italic_p end_ARG ( italic_s , italic_a ) italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in (6) can be controlled better, by trying to restrict (6) to some p~(s,a)~𝑝𝑠𝑎\tilde{p}(s,a)over~ start_ARG italic_p end_ARG ( italic_s , italic_a ) such that p~(s,a)uip^t(s,a)ui+(p(s,a)p^t(s,a))ui~𝑝𝑠𝑎subscript𝑢𝑖subscript^𝑝𝑡𝑠𝑎subscript𝑢𝑖𝑝𝑠𝑎subscript^𝑝𝑡𝑠𝑎subscript𝑢𝑖\tilde{p}(s,a)u_{i}\leq\hat{p}_{t}(s,a)u_{i}+(p(s,a)-\hat{p}_{t}(s,a))u_{i}over~ start_ARG italic_p end_ARG ( italic_s , italic_a ) italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ( italic_p ( italic_s , italic_a ) - over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) ) italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the knowledge that uitsubscript𝑢𝑖subscript𝑡u_{i}\in\mathcal{H}_{t}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

    For a fixed u𝐑𝒮𝑢superscript𝐑𝒮u\in\mathbf{R}^{\mathcal{S}}italic_u ∈ bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT, the empirical Bernstein inequality (Lemma 38) provides a variance bound of the form (p^t(s,a)p(s,a))uβt(s,a,u)subscript^𝑝𝑡𝑠𝑎𝑝𝑠𝑎𝑢subscript𝛽𝑡𝑠𝑎𝑢(\hat{p}_{t}(s,a)-p(s,a))u\leq\beta_{t}(s,a,u)( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_p ( italic_s , italic_a ) ) italic_u ≤ italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a , italic_u ). By computing βt(s,a):=maxutβt(s,a,u)assignsubscript𝛽𝑡𝑠𝑎subscript𝑢subscript𝑡subscript𝛽𝑡𝑠𝑎𝑢\beta_{t}(s,a):=\max_{u\in\mathcal{H}_{t}}\beta_{t}(s,a,u)italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) := roman_max start_POSTSUBSCRIPT italic_u ∈ caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a , italic_u ), the search makes sure that (p^t(s,a)p(s,a))hβt(s,a)subscript^𝑝𝑡𝑠𝑎𝑝𝑠𝑎superscriptsubscript𝛽𝑡𝑠𝑎(\hat{p}_{t}(s,a)-p(s,a))h^{*}\leq\beta_{t}(s,a)( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_p ( italic_s , italic_a ) ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) even though hsuperscripth^{*}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is unknown. For β𝐑+𝒳𝛽superscriptsubscript𝐑𝒳\beta\in\mathbf{R}_{+}^{\mathcal{X}}italic_β ∈ bold_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_X end_POSTSUPERSCRIPT, we introduce the β𝛽\betaitalic_β-mitigated extended Bellman operator:

    tβu(s):=maxa𝒜(s)supr~(s,a)t(s,a)supp~(s,a)𝒫t(s,a){r~(s,a)+min{p~(s,a)ui,p^t(s,a)ui+βt(s,a)}}assignsuperscriptsubscript𝑡𝛽𝑢𝑠subscript𝑎𝒜𝑠subscriptsupremum~𝑟𝑠𝑎subscript𝑡𝑠𝑎subscriptsupremum~𝑝𝑠𝑎subscript𝒫𝑡𝑠𝑎~𝑟𝑠𝑎~𝑝𝑠𝑎subscript𝑢𝑖subscript^𝑝𝑡𝑠𝑎subscript𝑢𝑖subscript𝛽𝑡𝑠𝑎\mathcal{L}_{t}^{\beta}u(s):=\max_{a\in\mathcal{A}(s)}\sup_{\tilde{r}(s,a)\in% \mathcal{R}_{t}(s,a)}\sup_{\tilde{p}(s,a)\in\mathcal{P}_{t}(s,a)}\Big{\{}% \tilde{r}(s,a)+\min\left\{\tilde{p}(s,a)u_{i},\hat{p}_{t}(s,a)u_{i}+\beta_{t}(% s,a)\right\}\Big{\}}caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT italic_u ( italic_s ) := roman_max start_POSTSUBSCRIPT italic_a ∈ caligraphic_A ( italic_s ) end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT over~ start_ARG italic_r end_ARG ( italic_s , italic_a ) ∈ caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT over~ start_ARG italic_p end_ARG ( italic_s , italic_a ) ∈ caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) end_POSTSUBSCRIPT { over~ start_ARG italic_r end_ARG ( italic_s , italic_a ) + roman_min { over~ start_ARG italic_p end_ARG ( italic_s , italic_a ) italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) } } (7)

The proposition below shows how well-behaved the composition 𝔏t:=Γttβassignsubscript𝔏𝑡subscriptΓ𝑡superscriptsubscript𝑡𝛽\mathfrak{L}_{t}:=\Gamma_{t}\circ\mathcal{L}_{t}^{\beta}fraktur_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := roman_Γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∘ caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT is. Its proof requires to build a complete analysis of projected mitigated Bellman operators. This is deferred to the appendix.

Proposition 2.

Fix β𝐑+𝒳𝛽superscriptsubscript𝐑𝒳\beta\in\mathbf{R}_{+}^{\mathcal{X}}italic_β ∈ bold_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_X end_POSTSUPERSCRIPT and assume that there exists a projection operator Γt:𝐑𝒳t:subscriptΓ𝑡superscript𝐑𝒳subscript𝑡\Gamma_{t}:\mathbf{R}^{\mathcal{X}}\to\mathcal{H}_{t}roman_Γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : bold_R start_POSTSUPERSCRIPT caligraphic_X end_POSTSUPERSCRIPT → caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT which is (O1) monotone: uvΓuΓv𝑢𝑣Γ𝑢Γ𝑣u\leq v\Rightarrow\Gamma u\leq\Gamma vitalic_u ≤ italic_v ⇒ roman_Γ italic_u ≤ roman_Γ italic_v; (O2) non span-expansive: sp(ΓuΓv)sp(uv)spΓ𝑢Γ𝑣sp𝑢𝑣{\mathrm{sp}\left(\Gamma u-\Gamma v\right)}\leq{\mathrm{sp}\left(u-v\right)}roman_sp ( roman_Γ italic_u - roman_Γ italic_v ) ≤ roman_sp ( italic_u - italic_v ); (O3) linear: Γ(u+λe)=Γu+λeΓ𝑢𝜆𝑒Γ𝑢𝜆𝑒\Gamma(u+\lambda e)=\Gamma u+\lambda eroman_Γ ( italic_u + italic_λ italic_e ) = roman_Γ italic_u + italic_λ italic_e and (O4) ΓuuΓ𝑢𝑢\Gamma u\leq uroman_Γ italic_u ≤ italic_u. Then, the projected mitigated extended Bellman operator 𝔏t:=Γttβassignsubscript𝔏𝑡subscriptΓ𝑡superscriptsubscript𝑡𝛽\mathfrak{L}_{t}:=\Gamma_{t}\circ\mathcal{L}_{t}^{\beta}fraktur_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := roman_Γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∘ caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT has the following properties:

  • (1)

    There exists a unique 𝔤t𝐑esubscript𝔤𝑡𝐑𝑒\mathfrak{g}_{t}\in\mathbf{R}efraktur_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ bold_R italic_e such that 𝔥tt,𝔏t𝔥t=𝔥t+𝔤tformulae-sequencesubscript𝔥𝑡subscript𝑡subscript𝔏𝑡subscript𝔥𝑡subscript𝔥𝑡subscript𝔤𝑡\exists\mathfrak{h}_{t}\in\mathcal{H}_{t},\mathfrak{L}_{t}\mathfrak{h}_{t}=% \mathfrak{h}_{t}+\mathfrak{g}_{t}∃ fraktur_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , fraktur_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT fraktur_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = fraktur_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + fraktur_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT;

  • (2)

    If Mt𝑀subscript𝑡M\in\mathcal{M}_{t}italic_M ∈ caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, htsuperscriptsubscript𝑡h^{*}\in\mathcal{H}_{t}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and (p^t(s,a)p(s,a))hβt(s,a)subscript^𝑝𝑡𝑠𝑎𝑝𝑠𝑎superscriptsubscript𝛽𝑡𝑠𝑎(\hat{p}_{t}(s,a)-p(s,a))h^{*}\leq\beta_{t}(s,a)( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_p ( italic_s , italic_a ) ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ), then 𝔤tg(M)subscript𝔤𝑡superscript𝑔𝑀\mathfrak{g}_{t}\geq g^{*}(M)fraktur_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_M );

  • (3)

    If tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is convex, then for all u𝐑𝒮𝑢superscript𝐑𝒮u\in\mathbf{R}^{\mathcal{S}}italic_u ∈ bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT, the policy π=:Greedy(t,u,βt)\pi=:\texttt{Greedy}(\mathcal{M}_{t},u,\beta_{t})italic_π = : Greedy ( caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) picking the actions achieving tβusuperscriptsubscript𝑡𝛽𝑢\mathcal{L}_{t}^{\beta}ucaligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT italic_u satisfies 𝔏tu=r~π+P~πusubscript𝔏𝑡𝑢superscript~𝑟𝜋superscript~𝑃𝜋𝑢\mathfrak{L}_{t}u=\tilde{r}^{\pi}+\tilde{P}^{\pi}ufraktur_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_u = over~ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT + over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT italic_u for r~π(s)supt(s,π(s))superscript~𝑟𝜋𝑠supremumsubscript𝑡𝑠𝜋𝑠\tilde{r}^{\pi}(s)\leq\sup\mathcal{R}_{t}(s,\pi(s))over~ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) ≤ roman_sup caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_π ( italic_s ) ) and P~π(s)𝒫t(s,π(s))superscript~𝑃𝜋𝑠subscript𝒫𝑡𝑠𝜋𝑠\tilde{P}^{\pi}(s)\in\mathcal{P}_{t}(s,\pi(s))over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) ∈ caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_π ( italic_s ) );

  • (4)

    For all u𝐑𝒮𝑢superscript𝐑𝒮u\in\mathbf{R}^{\mathcal{S}}italic_u ∈ bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT and n0𝑛0n\geq 0italic_n ≥ 0, sp(𝔏tn+1u𝔏tnu)sp(tn+1utnu)spsuperscriptsubscript𝔏𝑡𝑛1𝑢superscriptsubscript𝔏𝑡𝑛𝑢spsuperscriptsubscript𝑡𝑛1𝑢superscriptsubscript𝑡𝑛𝑢{\mathrm{sp}\left(\mathfrak{L}_{t}^{n+1}u-\mathfrak{L}_{t}^{n}u\right)}\leq{% \mathrm{sp}\left(\mathcal{L}_{t}^{n+1}u-\mathcal{L}_{t}^{n}u\right)}roman_sp ( fraktur_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT italic_u - fraktur_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u ) ≤ roman_sp ( caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT italic_u - caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u ).

The property (1) guarantees that 𝔏tsubscript𝔏𝑡\mathfrak{L}_{t}fraktur_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT has a fix-point while (2) states that this fix-point corresponds to an optimistic gain 𝔤tsubscript𝔤𝑡\mathfrak{g}_{t}fraktur_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT if the model and the bias confidence region are correct and the mitigation isn’t too aggressive. Combined with (3), the Poisson equation of a policy corresponds to this fix-point, i.e., r~π+P~π𝔥t=𝔥t+𝔤tsuperscript~𝑟𝜋superscript~𝑃𝜋subscript𝔥𝑡subscript𝔥𝑡subscript𝔤𝑡\tilde{r}^{\pi}+\tilde{P}^{\pi}\mathfrak{h}_{t}=\mathfrak{h}_{t}+\mathfrak{g}_% {t}over~ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT + over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT fraktur_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = fraktur_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + fraktur_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, so that 𝔤tsubscript𝔤𝑡\mathfrak{g}_{t}fraktur_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the gain and 𝔥ttsubscript𝔥𝑡subscript𝑡\mathfrak{h}_{t}\in\mathcal{H}_{t}fraktur_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a legal bias for π𝜋\piitalic_π under the model (r~π,P~π)superscript~𝑟𝜋superscript~𝑃𝜋(\tilde{r}^{\pi},\tilde{P}^{\pi})( over~ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ). Lastly, the property (4) guarantees that the iterates 𝔏tnusuperscriptsubscript𝔏𝑡𝑛𝑢\mathfrak{L}_{t}^{n}ufraktur_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u converge to a fix-point of 𝔏𝔏\mathfrak{L}fraktur_L at least as quickly as tnusuperscriptsubscript𝑡𝑛𝑢\mathcal{L}_{t}^{n}ucaligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u goes to a fix-point of tsubscript𝑡\mathcal{L}_{t}caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT; the convergence of tnusuperscriptsubscript𝑡𝑛𝑢\mathcal{L}_{t}^{n}ucaligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u is already guaranteed by existing studies and is discussed in the appendix.

Provided that the bias confidence region is constructed, Proposition 2 foreshadows how powerful is the construction: The algorithm PMEVI, obtained by iterating 𝔏tsubscript𝔏𝑡\mathfrak{L}_{t}fraktur_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT instead of tsubscript𝑡\mathcal{L}_{t}caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in EVI, can replace the well-known EVI within any algorithm of the literature that relies on it (UCRL2 Auer et al. (2009), UCRL2B Fruit et al. (2020) or KL-UCRL Filippi et al. (2010)) for an immediate improvement of its theoretical guarantees.

3.2 Building the bias confidence region and its projection operator

The bias confidence region used by PMEVI-DT is obtained as a collection of constraints of the form:

ss,𝔥(s)𝔥(s)c(s,s)d(s,s).formulae-sequencefor-all𝑠superscript𝑠𝔥𝑠𝔥superscript𝑠𝑐𝑠superscript𝑠𝑑𝑠superscript𝑠\forall s\neq s^{\prime},\quad\mathfrak{h}(s)-\mathfrak{h}(s^{\prime})-c(s,s^{% \prime})\leq d(s,s^{\prime}).∀ italic_s ≠ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , fraktur_h ( italic_s ) - fraktur_h ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_c ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_d ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) . (8)

Such constraints include (1) prior bias constraints (if any) of the form of 𝔥(s)𝔥(s)c(s,s)𝔥𝑠𝔥superscript𝑠subscript𝑐𝑠superscript𝑠\mathfrak{h}(s)-\mathfrak{h}(s^{\prime})\leq c_{*}(s,s^{\prime})fraktur_h ( italic_s ) - fraktur_h ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_c start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ); (2) span constraints of the form 𝔥(s)𝔥(s)c0:=T1/5𝔥𝑠𝔥superscript𝑠subscript𝑐0assignsuperscript𝑇15\mathfrak{h}(s)-\mathfrak{h}(s^{\prime})\leq c_{0}:=T^{1/5}fraktur_h ( italic_s ) - fraktur_h ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT := italic_T start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT spawning the span semi-ball {u:sp(u)T1/5}conditional-set𝑢sp𝑢superscript𝑇15\{u:{\mathrm{sp}\left(u\right)}\leq T^{1/5}\}{ italic_u : roman_sp ( italic_u ) ≤ italic_T start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT }; and (3) pair-wise constraints obtained by estimating bias differences in the style of Zhang and Ji (2019); Zhang and Xie (2023) that we further improve. We start by defining a bias difference estimator.

Definition 1 (Bias difference estimator).

Given a pair of states ss𝑠superscript𝑠s\neq s^{\prime}italic_s ≠ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, their sequence of commute times (τiss)i0subscriptsuperscriptsubscript𝜏𝑖𝑠superscript𝑠𝑖0(\tau_{i}^{s\leftrightarrow s^{\prime}})_{i\geq 0}( italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i ≥ 0 end_POSTSUBSCRIPT is defined by τ2iss:=inf{t>τ2i1ss:St=s} and τ2i+1ss:=inf{t>τ2iss:St=s}assignsuperscriptsubscript𝜏2𝑖𝑠superscript𝑠infimumconditional-set𝑡superscriptsubscript𝜏2𝑖1𝑠superscript𝑠subscript𝑆𝑡𝑠 and superscriptsubscript𝜏2𝑖1𝑠superscript𝑠assigninfimumconditional-set𝑡superscriptsubscript𝜏2𝑖𝑠superscript𝑠subscript𝑆𝑡superscript𝑠\tau_{2i}^{s\leftrightarrow s^{\prime}}:=\inf\{t>\tau_{2i-1}^{s\leftrightarrow s% ^{\prime}}:S_{t}=s\}\text{~{}and~{}}\tau_{2i+1}^{s\leftrightarrow s^{\prime}}:% =\inf\{t>\tau_{2i}^{s\leftrightarrow s^{\prime}}:S_{t}=s^{\prime}\}italic_τ start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT := roman_inf { italic_t > italic_τ start_POSTSUBSCRIPT 2 italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT : italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s } and italic_τ start_POSTSUBSCRIPT 2 italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT := roman_inf { italic_t > italic_τ start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT : italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } with the convention that τ1ss=superscriptsubscript𝜏1𝑠superscript𝑠\tau_{-1}^{s\leftrightarrow s^{\prime}}=-\inftyitalic_τ start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = - ∞. The number of commutations up to time t𝑡titalic_t is Nt(ss):=inf{i:τisst}N_{t}(s\leftrightarrow s^{\prime}):=\inf\{i:\tau_{i}^{s\leftrightarrow s^{% \prime}}\leq t\}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) := roman_inf { italic_i : italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ≤ italic_t }, and g^(t):=1ti=0t1Riassign^𝑔𝑡1𝑡superscriptsubscript𝑖0𝑡1subscript𝑅𝑖\hat{g}(t):=\frac{1}{t}\sum_{i=0}^{t-1}R_{i}over^ start_ARG italic_g end_ARG ( italic_t ) := divide start_ARG 1 end_ARG start_ARG italic_t end_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the empirical gain. The bias difference estimator at time T𝑇Titalic_T is any quantity cT(s,s)𝐑subscript𝑐𝑇𝑠superscript𝑠𝐑c_{T}(s,s^{\prime})\in\mathbf{R}italic_c start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ bold_R such that:

Nt(ss)cT(s,s)=t=0NT(ss)1(1)it=τissτi+1ss1(g^(T)Rt).N_{t}(s\leftrightarrow s^{\prime})c_{T}(s,s^{\prime})=\sum\nolimits_{t=0}^{N_{% T}(s\leftrightarrow s^{\prime})-1}(-1)^{i}\sum\nolimits_{t=\tau_{i}^{s% \leftrightarrow s^{\prime}}}^{\tau_{i+1}^{s\leftrightarrow s^{\prime}}-1}(\hat% {g}(T)-R_{t}).italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_c start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - 1 end_POSTSUPERSCRIPT ( - 1 ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG italic_g end_ARG ( italic_T ) - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (9)
Lemma 3.

With probability 12δ12𝛿1-2\delta1 - 2 italic_δ, for all TTsuperscript𝑇𝑇T^{\prime}\leq Titalic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_T and all g~g~𝑔superscript𝑔\tilde{g}\geq g^{*}over~ start_ARG italic_g end_ARG ≥ italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, we have:

NT(ss)|h(s)h(s)cT(s,s)|3sp(h)+(1+sp(h))8Tlog(2δ)+2t=0T1(g~Rt).N_{T^{\prime}}(s\leftrightarrow s^{\prime})\left|h^{*}(s)-h^{*}(s^{\prime})-c_% {T^{\prime}}(s,s^{\prime})\right|\leq 3{\mathrm{sp}\left(h^{*}\right)}+(1+{% \mathrm{sp}\left(h^{*}\right)})\sqrt{8T\log(\tfrac{2}{\delta})}+2\sum\nolimits% _{t=0}^{T^{\prime}-1}(\tilde{g}-R_{t}).italic_N start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_c start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | ≤ 3 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + ( 1 + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) square-root start_ARG 8 italic_T roman_log ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) end_ARG + 2 ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG italic_g end_ARG - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (10)

Lemma 3 says that the quality of the estimator cT(s,s)subscript𝑐𝑇𝑠superscript𝑠c_{T}(s,s^{\prime})italic_c start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is directly linked to the number of observed commutes between s𝑠sitalic_s and ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as well as the regret. The idea is that if the algorithm makes many commutes between s𝑠sitalic_s and ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and if its regret is small, then the algorithm mostly takes optimal paths from s𝑠sitalic_s to ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. The bound provided by Lemma 3 is not accessible to the learner however, because sp(h)spsuperscript{\mathrm{sp}\left(h^{*}\right)}roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is unknown in general. To overcome this issue, sp(h)spsuperscript{\mathrm{sp}\left(h^{*}\right)}roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is upper-bounded by c0:=T1/5assignsubscript𝑐0superscript𝑇15c_{0}:=T^{1/5}italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT := italic_T start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT. Overall, this leads to the design of the algorithm estimating the bias confidence region as specified in Algorithm 3.

  Algorithm 3: BiasEstimation(t,t,δ)subscript𝑡subscript𝑡𝛿(\mathcal{F}_{t},\mathcal{M}_{t},\delta)( caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_δ )

 

Parameters: History tsubscript𝑡\mathcal{F}_{t}caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, model region tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, confidence δ>0𝛿0\delta>0italic_δ > 0

1:  Estimate bias differences ctsubscript𝑐𝑡c_{t}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT via (9);
2:  Estimate optimistic gain g~mink<K(t)𝔤k~𝑔subscript𝑘𝐾𝑡subscript𝔤𝑘\tilde{g}\leftarrow\min_{k<K(t)}\mathfrak{g}_{k}over~ start_ARG italic_g end_ARG ← roman_min start_POSTSUBSCRIPT italic_k < italic_K ( italic_t ) end_POSTSUBSCRIPT fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT;
3:  Inner regret estimation B0tg~i=0t1Risubscript𝐵0𝑡~𝑔superscriptsubscript𝑖0𝑡1subscript𝑅𝑖B_{0}\leftarrow t\tilde{g}-\sum_{i=0}^{t-1}R_{i}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← italic_t over~ start_ARG italic_g end_ARG - ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT;
4:  8Tlog(2δ)8𝑇2𝛿\ell\leftarrow\sqrt{8T\log\left(\tfrac{2}{\delta}\right)}roman_ℓ ← square-root start_ARG 8 italic_T roman_log ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) end_ARG, c0T15subscript𝑐0superscript𝑇15c_{0}\leftarrow T^{\frac{1}{5}}italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← italic_T start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 5 end_ARG end_POSTSUPERSCRIPT;
5:  Estimate the bias difference errors as:

dt(s,s)error(ct,s,s):=3c0+(1+c0)(1+)+2B0Nt(ss)\displaystyle d_{t}(s,s^{\prime})\equiv\text{error}(c_{t},s,s^{\prime}):=\frac% {3c_{0}+(1+c_{0})(1+\ell)+2B_{0}}{N_{t}(s\leftrightarrow s^{\prime})}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≡ error ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) := divide start_ARG 3 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ( 1 + italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ( 1 + roman_ℓ ) + 2 italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG

6:  return  (ct,error(ct,,))subscript𝑐𝑡errorsubscript𝑐𝑡(c_{t},\text{error}(c_{t},-,-))( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , error ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , - , - ) ), (8) defines tsubscriptsuperscript𝑡\mathcal{H}^{\prime}_{t}caligraphic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

 

  Algorithm 4: BiasProjection(t,u)subscript𝑡𝑢(\mathcal{H}_{t},u)( caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u )

 

Parameters: tsubscript𝑡\mathcal{H}_{t}caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT a collection of linear constraints (8), u𝐑𝒮𝑢superscript𝐑𝒮u\in\mathbf{R}^{\mathcal{S}}italic_u ∈ bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT to project

1:  v0𝒮𝑣superscript0𝒮v\leftarrow 0^{\mathcal{S}}italic_v ← 0 start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT;
2:  for s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S do
3:     Using linear programming, compute:
4:     v(s)sup{w(s):wu and wt}𝑣𝑠supremumconditional-set𝑤𝑠𝑤𝑢 and 𝑤subscript𝑡v(s)\leftarrow\sup\left\{w(s):w\leq u\text{~{}and~{}}w\in\mathcal{H}_{t}\right\}italic_v ( italic_s ) ← roman_sup { italic_w ( italic_s ) : italic_w ≤ italic_u and italic_w ∈ caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT };
5:  end for
6:  return  v𝑣vitalic_v.

 

Coupled with prior information and span constraints, the obtained bias confidence region tsubscript𝑡\mathcal{H}_{t}caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a polyhedron of the same kind as the one encountered in Zhang and Xie (2023) generated by constraints of the form (8), and similarly to their Proposition 3, one can project onto tsubscript𝑡\mathcal{H}_{t}caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in polynomial time with Algorithm 4. Moreover, the resulting projection operator satisfies the prerequisites (O1-4) of Proposition 2, making sure that PMEVI (Algorithm 2) is well-behaved. This is proved in the appendix Section B.2.

Lemma 4.

Assume that \mathcal{H}caligraphic_H is a set of 𝔥𝐑𝒮𝔥superscript𝐑𝒮\mathfrak{h}\in\mathbf{R}^{\mathcal{S}}fraktur_h ∈ bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT satisfying a system of equations of the form of (8). If \mathcal{H}caligraphic_H is non empty, then the operator Γu:=BiasProjection(,u)assignΓ𝑢BiasProjection𝑢\Gamma u:=\text{{BiasProjection}}(\mathcal{H},u)roman_Γ italic_u := BiasProjection ( caligraphic_H , italic_u ) (see Algorithm 4) is a projection on \mathcal{H}caligraphic_H and satisfies the properties (O1-4) defined in Proposition 2.

3.3 Mitigation using finer bias dynamical error

The fact that htsuperscriptsubscript𝑡h^{*}\in\mathcal{H}_{t}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with high probability is used in PMEVI-DT to restrict the search of EVI by reducing the dynamical bias error. This reduction is based on a empirical Bernstein inequality (see Lemma 38) applied to (p^(s,a)p(s,a))u^𝑝𝑠𝑎𝑝𝑠𝑎𝑢(\hat{p}(s,a)-p(s,a))u( over^ start_ARG italic_p end_ARG ( italic_s , italic_a ) - italic_p ( italic_s , italic_a ) ) italic_u. Here, it gives that with probability 1δ1𝛿1-\delta1 - italic_δ, we have:

(p^t(s,a)p(s,a))u2𝐕(p^t(s,a),u)log(3Tδ)max{1,Nt(s,a)}+3sp(u)log(3Tδ)max{1,Nt(s,a)}=:βt(s,a,u)\left(\hat{p}_{t}(s,a)-p(s,a)\right)u\leq\sqrt{\frac{2\mathbf{V}(\hat{p}_{t}(s% ,a),u)\log\left(\tfrac{3T}{\delta}\right)}{\max\left\{1,N_{t}(s,a)\right\}}}+% \frac{3{\mathrm{sp}\left(u\right)}\log\left(\tfrac{3T}{\delta}\right)}{\max% \left\{1,N_{t}(s,a)\right\}}=:\beta_{t}(s,a,u)( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_p ( italic_s , italic_a ) ) italic_u ≤ square-root start_ARG divide start_ARG 2 bold_V ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) , italic_u ) roman_log ( divide start_ARG 3 italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG roman_max { 1 , italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) } end_ARG end_ARG + divide start_ARG 3 roman_s roman_p ( italic_u ) roman_log ( divide start_ARG 3 italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG roman_max { 1 , italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) } end_ARG = : italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a , italic_u ) (11)

where 𝐕(p^t(s,a),u)𝐕subscript^𝑝𝑡𝑠𝑎𝑢\mathbf{V}(\hat{p}_{t}(s,a),u)bold_V ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) , italic_u ) is the variance of u𝑢uitalic_u under the probability vector p^t(s,a)subscript^𝑝𝑡𝑠𝑎\hat{p}_{t}(s,a)over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ). More specifically, if q𝑞qitalic_q is a probability on 𝒮𝒮\mathcal{S}caligraphic_S and q𝐑𝒮𝑞superscript𝐑𝒮q\in\mathbf{R}^{\mathcal{S}}italic_q ∈ bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT, we set 𝐕(q,u):=sq(s)(u(s)qu)2assign𝐕𝑞𝑢subscript𝑠𝑞𝑠superscript𝑢𝑠𝑞𝑢2\mathbf{V}(q,u):=\sum_{s}q(s)(u(s)-q\cdot u)^{2}bold_V ( italic_q , italic_u ) := ∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_q ( italic_s ) ( italic_u ( italic_s ) - italic_q ⋅ italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. In (11), u𝐑𝒮𝑢superscript𝐑𝒮u\in\mathbf{R}^{\mathcal{S}}italic_u ∈ bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT, (s,a)𝒳𝑠𝑎𝒳(s,a)\in\mathcal{X}( italic_s , italic_a ) ∈ caligraphic_X and T1𝑇1T\geq 1italic_T ≥ 1 are fixed. Once is tempted to use (11) directly to mitigate the extended Bellman operator, but the resulting operator is ill-behaved because it loses monotony. This issue is avoided by changing βt(s,a,u)subscript𝛽𝑡𝑠𝑎𝑢\beta_{t}(s,a,u)italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a , italic_u ) to maxutβt(s,a,u)subscript𝑢subscript𝑡subscript𝛽𝑡𝑠𝑎𝑢\max_{u\in\mathcal{H}_{t}}\beta_{t}(s,a,u)roman_max start_POSTSUBSCRIPT italic_u ∈ caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a , italic_u ) in (9). We obtain a variance maximization problem, which is a convex maximization problem with linear constraints. Even in very simple settings, such optimization problems are NP-hard Pardalos and Schnitger (1988) hence computing maxutβt(s,a,u)subscript𝑢subscript𝑡subscript𝛽𝑡𝑠𝑎𝑢\max_{u\in\mathcal{H}_{t}}\beta_{t}(s,a,u)roman_max start_POSTSUBSCRIPT italic_u ∈ caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a , italic_u ) is not reasonable in general. Thankfully, this value can be upper-bounded by a tractable quantity that is enough to guarantee regret efficiency. The mitigation βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT used by PMEVI-DT is provided with Algorithm 5.

  Algorithm 5: VarianceApproximation(t,t)subscriptsuperscript𝑡subscript𝑡(\mathcal{H}^{\prime}_{t},\mathcal{F}_{t})( caligraphic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )

 

Parameters: Bias region tsubscriptsuperscript𝑡\mathcal{H}^{\prime}_{t}caligraphic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, history tsubscript𝑡\mathcal{F}_{t}caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

1:  Extract constraints (c,error(c,,))t𝑐error𝑐subscriptsuperscript𝑡(c,\text{error}(c,-,-))\leftarrow\mathcal{H}^{\prime}_{t}( italic_c , error ( italic_c , - , - ) ) ← caligraphic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT;
2:  Set c0T15subscript𝑐0superscript𝑇15c_{0}\leftarrow T^{\frac{1}{5}}italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← italic_T start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 5 end_ARG end_POSTSUPERSCRIPT;
3:  Pick a reference point h0BiasProjection(t,c(,s0))subscript0BiasProjectionsubscript𝑡𝑐subscript𝑠0h_{0}\leftarrow\text{{BiasProjection}}(\mathcal{H}_{t},c(-,s_{0}))italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← BiasProjection ( caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c ( - , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) );
4:  for (s,a)𝒳𝑠𝑎𝒳(s,a)\in\mathcal{X}( italic_s , italic_a ) ∈ caligraphic_X do
5:     ρlog(SATδ)/max{1,Nt(s,a)}𝜌𝑆𝐴𝑇𝛿1subscript𝑁𝑡𝑠𝑎\rho\leftarrow\log\left(\tfrac{SAT}{\delta}\right)/\max\left\{1,N_{t}(s,a)\right\}italic_ρ ← roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) / roman_max { 1 , italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) };
6:     var(s,a)𝐕(p^t(s,a),h0)+8c0s𝒮p^t(s|s,a)c(s,s)var𝑠𝑎𝐕subscript^𝑝𝑡𝑠𝑎subscript08subscript𝑐0subscriptsuperscript𝑠𝒮subscript^𝑝𝑡conditionalsuperscript𝑠𝑠𝑎𝑐superscript𝑠𝑠\text{var}(s,a)\leftarrow\mathbf{V}(\hat{p}_{t}(s,a),h_{0})+8c_{0}\sum_{s^{% \prime}\in\mathcal{S}}\hat{p}_{t}(s^{\prime}|s,a)c(s^{\prime},s)var ( italic_s , italic_a ) ← bold_V ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) , italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + 8 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) italic_c ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s );
7:     βt(s,a)2var(s,a)ρ+3c0ρsubscript𝛽𝑡𝑠𝑎2var𝑠𝑎𝜌3subscript𝑐0𝜌\beta_{t}(s,a)\leftarrow\sqrt{2\text{var}(s,a)\rho}+3c_{0}\rhoitalic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) ← square-root start_ARG 2 var ( italic_s , italic_a ) italic_ρ end_ARG + 3 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_ρ or ++\infty+ ∞ if Nt(s,a)=0subscript𝑁𝑡𝑠𝑎0N_{t}(s,a)=0italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) = 0;
8:  end for
9:  return  βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

 

4 Regret guarantees

Theorem 5 below shows that PMEVI-DT has minimax optimal regret under regularity assumptions on the used confidence region tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. 1 asserts that the confidence region holds uniformly with high probability. 2 asserts that the reward confidence region is sub-Weissman (see Lemma 35) and 3 assumes that the model confidence region makes sure that EVI (6) converges in the first place. 4 asserts that the prior bias region is correct.

Assumption 1.

With probability 1δ1𝛿1-\delta1 - italic_δ, we have Mk=1K(T)tk𝑀superscriptsubscript𝑘1𝐾𝑇subscriptsubscript𝑡𝑘M\in\bigcap_{k=1}^{K(T)}\mathcal{M}_{t_{k}}italic_M ∈ ⋂ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K ( italic_T ) end_POSTSUPERSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

Assumption 2.

There exists a constant C>0𝐶0C>0italic_C > 0 such that for all (s,a)𝒮𝑠𝑎𝒮(s,a)\in\mathcal{S}( italic_s , italic_a ) ∈ caligraphic_S, for all tT𝑡𝑇t\leq Titalic_t ≤ italic_T, we have:

t(s,a){r~(s,a)(s,a):Nt(s,a)r^t(s,a)r~(s,a)12Clog(2SA(1+Nt(s,a))δ)}.subscript𝑡𝑠𝑎conditional-set~𝑟𝑠𝑎𝑠𝑎subscript𝑁𝑡𝑠𝑎superscriptsubscriptnormsubscript^𝑟𝑡𝑠𝑎~𝑟𝑠𝑎12𝐶2𝑆𝐴1subscript𝑁𝑡𝑠𝑎𝛿\mathcal{R}_{t}(s,a)\subseteq\left\{\tilde{r}(s,a)\in\mathcal{R}(s,a):N_{t}(s,% a)\left\|\hat{r}_{t}(s,a)-\tilde{r}(s,a)\right\|_{1}^{2}\leq C\log\left(\tfrac% {2SA(1+N_{t}(s,a))}{\delta}\right)\right\}.caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) ⊆ { over~ start_ARG italic_r end_ARG ( italic_s , italic_a ) ∈ caligraphic_R ( italic_s , italic_a ) : italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) ∥ over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) - over~ start_ARG italic_r end_ARG ( italic_s , italic_a ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_C roman_log ( divide start_ARG 2 italic_S italic_A ( 1 + italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) ) end_ARG start_ARG italic_δ end_ARG ) } .
Assumption 3.

For t0𝑡0t\geq 0italic_t ≥ 0, tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a (s,a)𝑠𝑎(s,a)( italic_s , italic_a )-rectangular convex region and tnusuperscriptsubscript𝑡𝑛𝑢\mathcal{L}_{t}^{n}ucaligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u converges a fix-point.

Assumption 4.

The prior bias region subscript\mathcal{H}_{*}caligraphic_H start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT contains h(M)superscript𝑀h^{*}(M)italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_M ) and is generated by constraints of the form:

ss,𝔥(s)𝔥(s)c(s,s)formulae-sequencefor-all𝑠superscript𝑠𝔥𝑠𝔥superscript𝑠subscript𝑐𝑠superscript𝑠\forall s\neq s^{\prime},\quad\mathfrak{h}(s)-\mathfrak{h}(s^{\prime})\leq c_{% *}(s,s^{\prime})∀ italic_s ≠ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , fraktur_h ( italic_s ) - fraktur_h ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_c start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )

with c(s,s)[,]subscript𝑐𝑠superscript𝑠c_{*}(s,s^{\prime})\in[-\infty,\infty]italic_c start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ [ - ∞ , ∞ ] (possibly infinite).

Refer to Section A.2 for the feasibility of 1, Section A.2.3 for 2, and Section A.3 for 3.

Theorem 5 (Main result).

Let c>0𝑐0c>0italic_c > 0. Assume that PMEVI-DT runs with a confidence region system ttmaps-to𝑡subscript𝑡t\mapsto\mathcal{M}_{t}italic_t ↦ caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that guarantees Assumptions 1-3. If Tc5𝑇superscript𝑐5T\geq c^{5}italic_T ≥ italic_c start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT, then for every weakly communicating model with sp(h)cspsuperscript𝑐{\mathrm{sp}\left(h^{*}\right)}\leq croman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ italic_c and such that 4 is satisfied (hsuperscriptsubscripth^{*}\in\mathcal{H}_{*}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT), PMEVI-DT achieves regret:

O(cSATlog(SATδ))+O(cS52A32T920log2(SATδ))O𝑐𝑆𝐴𝑇𝑆𝐴𝑇𝛿O𝑐superscript𝑆52superscript𝐴32superscript𝑇920superscript2𝑆𝐴𝑇𝛿\operatorname*{{\rm O}}\left(\!\!\sqrt{cSAT\log\left(\tfrac{SAT}{\delta}\right% )}\right)+\operatorname*{{\rm O}}\left(cS^{\frac{5}{2}}A^{\frac{3}{2}}T^{\frac% {9}{20}}\log^{2}\left(\tfrac{SAT}{\delta}\right)\right)roman_O ( square-root start_ARG italic_c italic_S italic_A italic_T roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG ) + roman_O ( italic_c italic_S start_POSTSUPERSCRIPT divide start_ARG 5 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_A start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT divide start_ARG 9 end_ARG start_ARG 20 end_ARG end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) )

with probability 126δ126𝛿1-26\delta1 - 26 italic_δ, and in expectation if δ<1/T𝛿1𝑇\delta<\!\!\sqrt{1/T}italic_δ < square-root start_ARG 1 / italic_T end_ARG. Moreover, if PMEVI-DT runs with the same confidence regions that UCRL2 Auer et al. (2009), then it enjoys a time complexity O(DS3AT)O𝐷superscript𝑆3𝐴𝑇\operatorname*{{\rm O}}(DS^{3}AT)roman_O ( italic_D italic_S start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_A italic_T ).

To have a completely prior-less algorithm, pick =𝐑𝒮subscriptsuperscript𝐑𝒮\mathcal{H}_{*}=\mathbf{R}^{\mathcal{S}}caligraphic_H start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT. The proof of Theorem 5 is too long to fit within these pages, so the complete proof is deferred to appendix. We will focus here on the main ideas.

Model region tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT Bias region tsubscript𝑡\mathcal{H}_{t}caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT Mitigation βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT Projected Mitigated EVI (Alg.1) Regret Reg(T)Reg𝑇\operatorname{{Reg}}(T)roman_Reg ( italic_T ) kt=tktk+11(gRt)subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11superscript𝑔subscript𝑅𝑡\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}(g^{*}-R_{t})∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) Optimistic regret (Lemma 6) kt=tktk+11(𝔤kRt)subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript𝔤𝑘subscript𝑅𝑡\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}(\mathfrak{g}_{k}-R_{t})∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) O(SATlog(T))O𝑆𝐴𝑇𝑇\operatorname*{{\rm O}}\left(\!\!\sqrt{SAT\log(T)}\right)roman_O ( square-root start_ARG italic_S italic_A italic_T roman_log ( italic_T ) end_ARG ) Navigation error (Lemma 7) k,t(p(Xt)eSt+1)𝔥ksubscript𝑘𝑡𝑝subscript𝑋𝑡subscript𝑒subscript𝑆𝑡1subscript𝔥𝑘\sum_{k,t}(p(X_{t})-e_{S_{t+1}})\mathfrak{h}_{k}∑ start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT Emp. bias error (Lemma 8) k,t(p^tk(Xt)p(Xt))hsubscript𝑘𝑡subscript^𝑝subscript𝑡𝑘subscript𝑋𝑡𝑝subscript𝑋𝑡superscript\sum_{k,t}(\hat{p}_{t_{k}}(X_{t})-p(X_{t}))h^{*}∑ start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT Optimism overshoot (Lemma 9) k,t(p~tk(St)p^tk(St))𝔥ksubscript𝑘𝑡subscript~𝑝subscript𝑡𝑘subscript𝑆𝑡subscript^𝑝subscript𝑡𝑘subscript𝑆𝑡subscript𝔥𝑘\sum_{k,t}(\tilde{p}_{t_{k}}(S_{t})-\hat{p}_{t_{k}}(S_{t}))\mathfrak{h}_{k}∑ start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ( over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT 2nd order error (Lemma 10) k,t(p^tk(St)p(Xt))(𝔥kh)subscript𝑘𝑡subscript^𝑝subscript𝑡𝑘subscript𝑆𝑡𝑝subscript𝑋𝑡subscript𝔥𝑘superscript\sum_{k,t}(\hat{p}_{t_{k}}(S_{t})-p(X_{t}))(\mathfrak{h}_{k}-h^{*})∑ start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ( fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) (in expectation) O(c0log(SAT))Osubscript𝑐0𝑆𝐴𝑇\operatorname*{{\rm O}}(c_{0}\log(SAT))roman_O ( italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_log ( italic_S italic_A italic_T ) ) sp(h)SATlog(SAT)spsuperscript𝑆𝐴𝑇𝑆𝐴𝑇\!\!\sqrt{{\mathrm{sp}\left(h^{*}\right)}SAT\log(SAT)}square-root start_ARG roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_S italic_A italic_T roman_log ( italic_S italic_A italic_T ) end_ARG S2Ac0log2(T)Reg(T)superscript𝑆2𝐴subscript𝑐0superscript2𝑇Reg𝑇S^{2}Ac_{0}\log^{2}(T)\!\!\sqrt{\operatorname{{Reg}}(T)}italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_T ) square-root start_ARG roman_Reg ( italic_T ) end_ARG O(c0S2A2T1/4log2(T))Osubscript𝑐0superscript𝑆2superscript𝐴2superscript𝑇14superscript2𝑇\operatorname*{{\rm O}}\left(c_{0}S^{2}A^{2}T^{1/4}\log^{2}(T)\right)roman_O ( italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_T ) )
Figure 1: An overview of PMEVI-DT and its regret analysis. In the above, 𝔤ksubscript𝔤𝑘\mathfrak{g}_{k}fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and 𝔥ksubscript𝔥𝑘\mathfrak{h}_{k}fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are the optimistic gain and bias functions produced by PMEVI (see Algorithm 2) at episode k𝑘kitalic_k, and p^tksubscript^𝑝subscript𝑡𝑘\hat{p}_{t_{k}}over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT and p~tksubscript~𝑝subscript𝑡𝑘\tilde{p}_{t_{k}}over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT are respectively the empirical and optimistic kernel models at episode k𝑘kitalic_k.

We start by introducing notations. At episode k𝑘kitalic_k, the played policy is denoted πksubscript𝜋𝑘\pi_{k}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. As a greedy response to 𝔥ksubscript𝔥𝑘\mathfrak{h}_{k}fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, by Proposition 2 (3), there exists r~k(s)suptk(s,πk(s))subscript~𝑟𝑘𝑠supremumsubscriptsubscript𝑡𝑘𝑠subscript𝜋𝑘𝑠\tilde{r}_{k}(s)\leq\sup\mathcal{R}_{t_{k}}(s,\pi_{k}(s))over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s ) ≤ roman_sup caligraphic_R start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s ) ) and P~k(s)𝒫tk(s,π(x))subscript~𝑃𝑘𝑠subscript𝒫subscript𝑡𝑘𝑠𝜋𝑥\tilde{P}_{k}(s)\in\mathcal{P}_{t_{k}}(s,\pi(x))over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s ) ∈ caligraphic_P start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π ( italic_x ) ) such that 𝔥k+𝔤k=r~k+P~k𝔥ksubscript𝔥𝑘subscript𝔤𝑘subscript~𝑟𝑘subscript~𝑃𝑘subscript𝔥𝑘\mathfrak{h}_{k}+\mathfrak{g}_{k}=\tilde{r}_{k}+\tilde{P}_{k}\mathfrak{h}_{k}fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. The reward-kernel pair M~k=(r~k,P~k)subscript~𝑀𝑘subscript~𝑟𝑘subscript~𝑃𝑘\tilde{M}_{k}=(\tilde{r}_{k},\tilde{P}_{k})over~ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ( over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is referred to as the optimistic model of πksubscript𝜋𝑘\pi_{k}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. We write Pk:=Pπk(M)assignsubscript𝑃𝑘subscript𝑃subscript𝜋𝑘𝑀P_{k}:=P_{\pi_{k}}(M)italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT := italic_P start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_M ) the true kernel and P^k:=Pπk(M^tk)assignsubscript^𝑃𝑘subscript𝑃subscript𝜋𝑘subscript^𝑀subscript𝑡𝑘\hat{P}_{k}:=P_{\pi_{k}}(\hat{M}_{t_{k}})over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT := italic_P start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) the empirical kernel. Likewise, we define the reward functions rksubscript𝑟𝑘r_{k}italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and r^ksubscript^𝑟𝑘\hat{r}_{k}over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. The optimistic gain and bias satisfy 𝔤k=g(πk,M~k)subscript𝔤𝑘𝑔subscript𝜋𝑘subscript~𝑀𝑘\mathfrak{g}_{k}=g(\pi_{k},\widetilde{M}_{k})fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_g ( italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over~ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) and 𝔥k=h(πk,M~k)subscript𝔥𝑘subscript𝜋𝑘subscript~𝑀𝑘\mathfrak{h}_{k}=h(\pi_{k},\widetilde{M}_{k})fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_h ( italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over~ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). We further denote c0=T15subscript𝑐0superscript𝑇15c_{0}=T^{\frac{1}{5}}italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_T start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 5 end_ARG end_POSTSUPERSCRIPT.

The regret is first decomposed episodically with Reg(T)=kt=tktk+11(gRt)Reg𝑇subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11superscript𝑔subscript𝑅𝑡\operatorname{{Reg}}(T)=\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}(g^{*}-R_{t})roman_Reg ( italic_T ) = ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The first step goes back to the analysis of UCRL2 Auer et al. (2009), and consists in upper-bounding the regret over episode k𝑘kitalic_k with optimistic quantities that are exclusive to that episode.

Lemma 6 (Reward optimism).

With probabililty 16δ16𝛿1-6\delta1 - 6 italic_δ, we have:

Reg(T)kt=tktk+11(𝔤kRt)kt=tktk+11(𝔤kr~k(Xt))+O(SATlog(Tδ)).Reg𝑇subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript𝔤𝑘subscript𝑅𝑡subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript𝔤𝑘subscript~𝑟𝑘subscript𝑋𝑡O𝑆𝐴𝑇𝑇𝛿\operatorname{{Reg}}(T)\leq\sum\nolimits_{k}\sum\nolimits_{t=t_{k}}^{t_{k+1}-1% }(\mathfrak{g}_{k}-R_{t})\leq\sum\nolimits_{k}\sum\nolimits_{t=t_{k}}^{t_{k+1}% -1}(\mathfrak{g}_{k}-\tilde{r}_{k}(X_{t}))+\operatorname*{{\rm O}}\left(\sqrt{% SAT\log\left(\tfrac{T}{\delta}\right)}\right).roman_Reg ( italic_T ) ≤ ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + roman_O ( square-root start_ARG italic_S italic_A italic_T roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG ) . (12)

We introduce the two optimistic regrets B(T):=kt=tktk+11(𝔤kRt)assign𝐵𝑇subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript𝔤𝑘subscript𝑅𝑡B(T):=\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}(\mathfrak{g}_{k}-R_{t})italic_B ( italic_T ) := ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and B~(T):=kt=tktk+11(𝔤kr~k(Xt))assign~𝐵𝑇subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript𝔤𝑘subscript~𝑟𝑘subscript𝑋𝑡\tilde{B}(T):=\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}(\mathfrak{g}_{k}-\tilde{r}_{k% }(X_{t}))over~ start_ARG italic_B end_ARG ( italic_T ) := ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ). Rewriting the summand 𝔤kr~k(Xt)subscript𝔤𝑘subscript~𝑟𝑘subscript𝑋𝑡\mathfrak{g}_{k}-\tilde{r}_{k}(X_{t})fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) using the Poisson equation 𝔥k+𝔤k=r~k+P~k𝔥ksubscript𝔥𝑘subscript𝔤𝑘subscript~𝑟𝑘subscript~𝑃𝑘subscript𝔥𝑘\mathfrak{h}_{k}+\mathfrak{g}_{k}=\tilde{r}_{k}+\tilde{P}_{k}\mathfrak{h}_{k}fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we get:

B~(T)=kt=tktk+11(p~k(St)eSt)𝔥k.~𝐵𝑇subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript~𝑝𝑘subscript𝑆𝑡subscript𝑒subscript𝑆𝑡subscript𝔥𝑘\tilde{B}(T)=\sum\nolimits_{k}\sum\nolimits_{t=t_{k}}^{t_{k+1}-1}\left(\tilde{% p}_{k}(S_{t})-e_{S_{t}}\right)\mathfrak{h}_{k}.over~ start_ARG italic_B end_ARG ( italic_T ) = ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT .

The analysis proceed by decomposing the above expression of B~(T)~𝐵𝑇\tilde{B}(T)over~ start_ARG italic_B end_ARG ( italic_T ) in the style of Zhang and Ji (2019). We write t=tktk+11(p~k(St)eSt)𝔥ksuperscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript~𝑝𝑘subscript𝑆𝑡subscript𝑒subscript𝑆𝑡subscript𝔥𝑘\sum_{t=t_{k}}^{t_{k+1}-1}(\tilde{p}_{k}(S_{t})-e_{S_{t}})\mathfrak{h}_{k}∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as:

t=tktk+11((pk(St)eSt)𝔥knavigation error (1k)+(p^k(St)pk(St))hempirical bias error (2k)+(p~k(St)p^k(St))𝔥koptimistic overshoot (3k)+(p^k(St)pk(St))(𝔥kh)second order error (4k))superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscriptsubscript𝑝𝑘subscript𝑆𝑡subscript𝑒subscript𝑆𝑡subscript𝔥𝑘navigation error 1𝑘subscriptsubscript^𝑝𝑘subscript𝑆𝑡subscript𝑝𝑘subscript𝑆𝑡superscriptempirical bias error 2𝑘subscriptsubscript~𝑝𝑘subscript𝑆𝑡subscript^𝑝𝑘subscript𝑆𝑡subscript𝔥𝑘optimistic overshoot 3𝑘subscriptsubscript^𝑝𝑘subscript𝑆𝑡subscript𝑝𝑘subscript𝑆𝑡subscript𝔥𝑘superscriptsecond order error 4𝑘\sum\nolimits_{t=t_{k}}^{t_{k+1}-1}\left(\underbrace{\left(p_{k}(S_{t})-e_{S_{% t}}\right)\mathfrak{h}_{k}}_{\text{navigation error }(1k)}+\underbrace{\left(% \hat{p}_{k}(S_{t})-p_{k}(S_{t})\right)h^{*}}_{\text{empirical bias error }(2k)% }+\underbrace{\left(\tilde{p}_{k}(S_{t})-\hat{p}_{k}(S_{t})\right)\mathfrak{h}% _{k}}_{\text{optimistic overshoot }(3k)}+\underbrace{\left(\hat{p}_{k}(S_{t})-% p_{k}(S_{t})\right)(\mathfrak{h}_{k}-h^{*})}_{\text{second order error }(4k)}\right)∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( under⏟ start_ARG ( italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT navigation error ( 1 italic_k ) end_POSTSUBSCRIPT + under⏟ start_ARG ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT empirical bias error ( 2 italic_k ) end_POSTSUBSCRIPT + under⏟ start_ARG ( over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT optimistic overshoot ( 3 italic_k ) end_POSTSUBSCRIPT + under⏟ start_ARG ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ( fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG start_POSTSUBSCRIPT second order error ( 4 italic_k ) end_POSTSUBSCRIPT )

Each error term is bounded separately. Below, we denote 𝐕(q,u):=sq(s)(u(s)qu)2assign𝐕𝑞𝑢subscript𝑠𝑞𝑠superscript𝑢𝑠𝑞𝑢2\mathbf{V}(q,u):=\sum_{s}q(s)(u(s)-q\cdot u)^{2}bold_V ( italic_q , italic_u ) := ∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_q ( italic_s ) ( italic_u ( italic_s ) - italic_q ⋅ italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Lemma 7 (Navigation error).

With probability 17δ17𝛿1-7\delta1 - 7 italic_δ, the navigation error is bounded by:

kt=tktk+11(pk(St)eSt)𝔥k2t=0T1𝐕(p(Xt),h)log(Tδ)+2SA123B(T)log(Tδ)+O~(T720).subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript𝑝𝑘subscript𝑆𝑡subscript𝑒subscript𝑆𝑡subscript𝔥𝑘2superscriptsubscript𝑡0𝑇1𝐕𝑝subscript𝑋𝑡superscript𝑇𝛿2𝑆superscript𝐴123𝐵𝑇𝑇𝛿~Osuperscript𝑇720\displaystyle\sum\nolimits_{k}\sum\nolimits_{t=t_{k}}^{t_{k+1}-1}(p_{k}(S_{t})% -e_{S_{t}})\mathfrak{h}_{k}\leq\!\!\sqrt{2\sum\nolimits_{t=0}^{T-1}\mathbf{V}(% p(X_{t}),h^{*})\log\left(\tfrac{T}{\delta}\right)}+2SA^{\frac{1}{2}}\!\!\sqrt{% 3B(T)}\log\left(\tfrac{T}{\delta}\right)+\widetilde{\operatorname*{{\rm O}}}% \left(T^{\frac{7}{20}}\right).∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ square-root start_ARG 2 ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG + 2 italic_S italic_A start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT square-root start_ARG 3 italic_B ( italic_T ) end_ARG roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) + over~ start_ARG roman_O end_ARG ( italic_T start_POSTSUPERSCRIPT divide start_ARG 7 end_ARG start_ARG 20 end_ARG end_POSTSUPERSCRIPT ) .
Lemma 8 (Empirical bias error).

With probability 1δ1𝛿1-\delta1 - italic_δ, the empirical bias error is bounded by:

kt=tktk+11(p^k(St)pk(St))h4SAt=0T1𝐕(p(Xt),h)log(SATδ)+O(log2(T)).subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript^𝑝𝑘subscript𝑆𝑡subscript𝑝𝑘subscript𝑆𝑡superscript4𝑆𝐴superscriptsubscript𝑡0𝑇1𝐕𝑝subscript𝑋𝑡superscript𝑆𝐴𝑇𝛿Osuperscript2𝑇\sum\nolimits_{k}\sum\nolimits_{t=t_{k}}^{t_{k+1}-1}\left(\hat{p}_{k}(S_{t})-p% _{k}(S_{t})\right)h^{*}\leq 4\sqrt{SA\sum\nolimits_{t=0}^{T-1}\mathbf{V}(p(X_{% t}),h^{*})\log\left(\tfrac{SAT}{\delta}\right)}+\operatorname*{{\rm O}}\left(% \log^{2}(T)\right).∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ 4 square-root start_ARG italic_S italic_A ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG + roman_O ( roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_T ) ) .
Lemma 9 (Optimism overshoot).

With probability 16δ16𝛿1-6\delta1 - 6 italic_δ, the optimism overshoot is bounded by:

kt=tktk+11(p~k(St)p^k(St))𝔥k{42SAt=0T1𝐕(p(Xt),h)log(SATδ)+8(1+c0)S32Alog32(SATδ)B(T)+O~(T14)}.subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript~𝑝𝑘subscript𝑆𝑡subscript^𝑝𝑘subscript𝑆𝑡subscript𝔥𝑘matrix42𝑆𝐴superscriptsubscript𝑡0𝑇1𝐕𝑝subscript𝑋𝑡superscript𝑆𝐴𝑇𝛿81subscript𝑐0superscript𝑆32𝐴superscript32𝑆𝐴𝑇𝛿𝐵𝑇~Osuperscript𝑇14\sum\nolimits_{k}\sum\nolimits_{t=t_{k}}^{t_{k+1}-1}\left(\tilde{p}_{k}(S_{t})% -\hat{p}_{k}(S_{t})\right)\mathfrak{h}_{k}\leq\begin{Bmatrix}4\sqrt{2SA\sum% \nolimits_{t=0}^{T-1}\mathbf{V}(p(X_{t}),h^{*})\log\left(\tfrac{SAT}{\delta}% \right)}\\ +8(1+c_{0})S^{\frac{3}{2}}A\log^{\frac{3}{2}}\left(\tfrac{SAT}{\delta}\right)% \sqrt{B(T)}+\widetilde{\operatorname*{{\rm O}}}\left(T^{\frac{1}{4}}\right)% \end{Bmatrix}.∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ { start_ARG start_ROW start_CELL 4 square-root start_ARG 2 italic_S italic_A ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG end_CELL end_ROW start_ROW start_CELL + 8 ( 1 + italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_S start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_A roman_log start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) square-root start_ARG italic_B ( italic_T ) end_ARG + over~ start_ARG roman_O end_ARG ( italic_T start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARG } .
Lemma 10 (Second order error).

With probability 16δ16𝛿1-6\delta1 - 6 italic_δ, the second order error is bounded by:

kt=tktk+11(p^k(St)pk(St))(𝔥kh)16S2A(1+c0)log12(S2ATδ)2B(T)+O~(T14).subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript^𝑝𝑘subscript𝑆𝑡subscript𝑝𝑘subscript𝑆𝑡subscript𝔥𝑘superscript16superscript𝑆2𝐴1subscript𝑐0superscript12superscript𝑆2𝐴𝑇𝛿2𝐵𝑇~Osuperscript𝑇14\sum\nolimits_{k}\sum\nolimits_{t=t_{k}}^{t_{k+1}-1}\left(\hat{p}_{k}(S_{t})-p% _{k}(S_{t})\right)(\mathfrak{h}_{k}-h^{*})\leq 16S^{2}A(1+c_{0})\log^{\frac{1}% {2}}\left(\tfrac{S^{2}AT}{\delta}\right)\sqrt{2B(T)}+\widetilde{\operatorname*% {{\rm O}}}\left(T^{\frac{1}{4}}\right).∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ( fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ 16 italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A ( 1 + italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_log start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ( divide start_ARG italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) square-root start_ARG 2 italic_B ( italic_T ) end_ARG + over~ start_ARG roman_O end_ARG ( italic_T start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ) .

We see that the empirical bias error (Lemma 8) and the optimism overshoot (Lemma 9) both involve the sum of variances t=0T1𝐕(p(Xt),h)superscriptsubscript𝑡0𝑇1𝐕𝑝subscript𝑋𝑡superscript\sum_{t=0}^{T-1}\mathbf{V}(p(X_{t}),h^{*})∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), which is shown in Lemma 29 to be of order sp(h)sp(r)T+t=0T1Δ(Xt)spsuperscriptsp𝑟𝑇superscriptsubscript𝑡0𝑇1superscriptΔsubscript𝑋𝑡{\mathrm{sp}\left(h^{*}\right)}{\mathrm{sp}\left(r\right)}T+\sum_{t=0}^{T-1}% \Delta^{*}(X_{t})roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_sp ( italic_r ) italic_T + ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The pseudo-regret term t=0T1Δ(Xt)superscriptsubscript𝑡0𝑇1superscriptΔsubscript𝑋𝑡\sum_{t=0}^{T-1}\Delta^{*}(X_{t})∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is bounded with the regret using Corollary 31, then by B(T)𝐵𝑇B(T)italic_B ( italic_T ). With high probability, we obtain an equation of the form:

B(T)C(1+sp(h))SATlog(Tδ)+CS2A(1+c0)log2(T)B(T)+O~(T14)𝐵𝑇𝐶1spsuperscript𝑆𝐴𝑇𝑇𝛿𝐶superscript𝑆2𝐴1subscript𝑐0superscript2𝑇𝐵𝑇~Osuperscript𝑇14B(T)\leq C\sqrt{(1+{\mathrm{sp}\left(h^{*}\right)})SAT\log\left(\tfrac{T}{% \delta}\right)}+CS^{2}A(1+c_{0})\log^{2}(T)\sqrt{B(T)}+\tilde{\operatorname*{{% \rm O}}}\left(T^{\frac{1}{4}}\right)italic_B ( italic_T ) ≤ italic_C square-root start_ARG ( 1 + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) italic_S italic_A italic_T roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG + italic_C italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A ( 1 + italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_T ) square-root start_ARG italic_B ( italic_T ) end_ARG + over~ start_ARG roman_O end_ARG ( italic_T start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT )

where C𝐶Citalic_C is a constant. Setting α:=CS2A(1+c0)log2(T)assign𝛼𝐶superscript𝑆2𝐴1subscript𝑐0superscript2𝑇\alpha:=CS^{2}A(1+c_{0})\log^{2}(T)italic_α := italic_C italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A ( 1 + italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_T ) and β:=C(1+sp(h))SATlog(T/δ)+O~(T1/4)assign𝛽𝐶1spsuperscript𝑆𝐴𝑇𝑇𝛿~Osuperscript𝑇14\beta:=C\sqrt{(1+{\mathrm{sp}\left(h^{*}\right)})SAT\log(T/\delta)}+\tilde{% \operatorname*{{\rm O}}}(T^{1/4})italic_β := italic_C square-root start_ARG ( 1 + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) italic_S italic_A italic_T roman_log ( italic_T / italic_δ ) end_ARG + over~ start_ARG roman_O end_ARG ( italic_T start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT ), the above equation is of the form B(T)β+αB(T)𝐵𝑇𝛽𝛼𝐵𝑇B(T)\leq\beta+\alpha\sqrt{B(T)}italic_B ( italic_T ) ≤ italic_β + italic_α square-root start_ARG italic_B ( italic_T ) end_ARG. Solving in B(T)𝐵𝑇B(T)italic_B ( italic_T ), we find B(T)β+2αβ+α2𝐵𝑇𝛽2𝛼𝛽superscript𝛼2B(T)\leq\beta+2\sqrt{\alpha\beta}+\alpha^{2}italic_B ( italic_T ) ≤ italic_β + 2 square-root start_ARG italic_α italic_β end_ARG + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The dominant term is β𝛽\betaitalic_β, hence we readily obtain:

B(T)C(1+sp(h))sp(r)SATlog(Tδ)+O~(sp(h)sp(r)S52A32(1+c0)T14).𝐵𝑇𝐶1spsuperscriptsp𝑟𝑆𝐴𝑇𝑇𝛿~Ospsuperscriptsp𝑟superscript𝑆52superscript𝐴321subscript𝑐0superscript𝑇14B(T)\leq C\sqrt{(1+{\mathrm{sp}\left(h^{*}\right)}){\mathrm{sp}\left(r\right)}% SAT\log\left(\tfrac{T}{\delta}\right)}+\widetilde{\operatorname*{{\rm O}}}% \left({\mathrm{sp}\left(h^{*}\right)}{\mathrm{sp}\left(r\right)}S^{\frac{5}{2}% }A^{\frac{3}{2}}(1+c_{0})T^{\frac{1}{4}}\right).italic_B ( italic_T ) ≤ italic_C square-root start_ARG ( 1 + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) roman_sp ( italic_r ) italic_S italic_A italic_T roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG + over~ start_ARG roman_O end_ARG ( roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_sp ( italic_r ) italic_S start_POSTSUPERSCRIPT divide start_ARG 5 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_A start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ( 1 + italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_T start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ) . (13)

Since c0=o(T14)subscript𝑐0osuperscript𝑇14c_{0}=\operatorname*{{\rm o}}(T^{\frac{1}{4}})italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_o ( italic_T start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ), we conclude that B(T)=O(sp(h)SATlog(T/δ))𝐵𝑇Ospsuperscript𝑆𝐴𝑇𝑇𝛿B(T)=\operatorname*{{\rm O}}\left(\!\sqrt{{\mathrm{sp}\left(h^{*}\right)}SAT% \log(T/\delta)}\right)italic_B ( italic_T ) = roman_O ( square-root start_ARG roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_S italic_A italic_T roman_log ( italic_T / italic_δ ) end_ARG ), ending the proof.

5 Experimental illustrations

To get a grasp of how PMEVI-DT behaves in practice, we provide in Fig. 2 of few illustrative experiments. In both experiments, the environment is a river-swim which is a model known to be hard to learn despite its size, with high diameter and bias span. Its description is found in Bourel et al. (2020) and is reported in the appendix for self-containedness.

Refer to captionUCRL2B & PMEVI-UCRL2BUCRL2 & PMEVI-UCRL2KLUCRL & PMEVI-KLUCRL
Refer to captionUCRL2PMEVI(c=2)PMEVI(c=0.5)PMEVI(c=1)
Figure 2: (To the left) Running a few algorithms of the literature on 5555-state river-swim and comparing their average regret against their PMEVI variants, obtained by changing calls to the EVI sub-routine to calls to PMEVI. (To the right) Running UCRL2 and PMEVI-DT with the same confidence region that UCRL2 on a 3333-state river-swim. PMEVI-DT is run with prior knowledge h(s1)h(s2)ch(s3)2csuperscriptsubscript𝑠1superscriptsubscript𝑠2𝑐superscriptsubscript𝑠32𝑐h^{*}(s_{1})\leq h^{*}(s_{2})-c\leq h^{*}(s_{3})-2citalic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≤ italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_c ≤ italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) - 2 italic_c for c{0,0.5,1,1.5,2}𝑐00.511.52c\in\left\{0,0.5,1,1.5,2\right\}italic_c ∈ { 0 , 0.5 , 1 , 1.5 , 2 }.

We observe on the first experiment that PMEVI behaves almost identically to its EVI counterparts when no prior on the bias region is given. This is because most of the regret is due to the earlier learning phase, when bias information is impossible to get (the regret is still growing linearly and the bias estimator is off). Accordingly, the bias confidence region is too large and all projections onto it are trivial during the iterations of PMEVI. Thankfully, this also makes the calls to PMEVI not substantially heavier than calls to EVI from a computational perspective. On the second experiment, we measure the influence of prior bias information on the behavior of PMEVI-DT. We observe that PMEVI-DT is very efficient at using adequate bias prior information to strikingly reduce the burn-in cost of the learning process on this 3333-state riverswim.

References

  • Abbasi-Yadkori et al. [2019] Yasin Abbasi-Yadkori, Nevena Lazic, Csaba Szepesvari, and Gellert Weisz. Exploration-enhanced POLITEX. arXiv preprint arXiv:1908.10479, 2019.
  • Agrawal and Jia [2023] Shipra Agrawal and Randy Jia. Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds. Mathematics of Operations Research, 48(1):363–392, 2023. Publisher: INFORMS.
  • Audibert et al. [2009] Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876–1902, 2009. Publisher: Elsevier.
  • Auer and Ortner [2006] Peter Auer and Ronald Ortner. Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning. Proceedings of the 19th International Conference on Neural Information Processing Systems, December 2006.
  • Auer et al. [2009] Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal Regret Bounds for Reinforcement Learning. In Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2009.
  • Azuma [1967] Kazuoki Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, 19(3):357 – 367, 1967. Publisher: Tohoku University, Mathematical Institute.
  • Bartlett and Tewari [2009] Peter L. Bartlett and Ambuj Tewari. REGAL: a regularization based algorithm for reinforcement learning in weakly communicating MDPs. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, pages 35–42, Arlington, Virginia, USA, June 2009. AUAI Press. ISBN 978-0-9749039-5-8.
  • Bourel et al. [2020] Hippolyte Bourel, Odalric Maillard, and Mohammad Sadegh Talebi. Tightening Exploration in Upper Confidence Reinforcement Learning. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1056–1066. PMLR, July 2020.
  • Burnetas and Katehakis [1997] Apostolos Burnetas and Michael Katehakis. Optimal Adaptive Policies for Markov Decision Processes. Mathematics of Operations Research - MOR, 22:222–255, February 1997.
  • Cohen et al. [2020] Michael B. Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time, 2020.
  • Filippi et al. [2010] Sarah Filippi, Olivier Cappé, and Aurélien Garivier. Optimism in Reinforcement Learning and Kullback-Leibler Divergence. 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 115–122, September 2010. arXiv: 1004.5229.
  • Fruit [2019] Ronan Fruit. Exploration-exploitation dilemma in Reinforcement Learning under various form of prior knowledge. PhD Thesis, Université de Lille 1, Sciences et Technologies; CRIStAL UMR 9189, 2019.
  • Fruit et al. [2018] Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning. Proceedings of the 35 th International Conference on Machine Learning, 2018.
  • Fruit et al. [2020] Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Improved Analysis of UCRL2 with Empirical Bernstein Inequality. ArXiv, abs/2007.05456, 2020.
  • Jonsson et al. [2020] Anders Jonsson, Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Edouard Leurent, and Michal Valko. Planning in markov decision processes with gap-dependent sample complexity. Advances in Neural Information Processing Systems, 33:1253–1263, 2020.
  • Ouyang et al. [2017] Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain. Learning Unknown Markov Decision Processes: A Thompson Sampling Approach. arXiv:1709.04570 [cs], September 2017. arXiv: 1709.04570.
  • Pardalos and Schnitger [1988] Panos M. Pardalos and Georg Schnitger. Checking local optimality in constrained quadratic programming is NP-hard. Operations Research Letters, 7:33–35, 1988.
  • Puterman [1994] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Probability and Statistics. Wiley, 1 edition, April 1994. ISBN 978-0-471-61977-2 978-0-470-31688-7.
  • Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • Talebi and Maillard [2018] Mohammad Sadegh Talebi and Odalric-Ambrym Maillard. Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs. Journal of Machine Learning Research, pages 1–36, April 2018. Publisher: Microtome Publishing.
  • Tewari and Bartlett [2007] Ambuj Tewari and P. Bartlett. Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs. In NIPS, 2007.
  • Theocharous et al. [2017] Georgios Theocharous, Zheng Wen, Yasin Abbasi-Yadkori, and Nikos Vlassis. Posterior sampling for large scale reinforcement learning. arXiv preprint arXiv:1711.07979, 2017.
  • Thompson [1933] William R Thompson. On the Likelihood that One Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika, 25(3-4):285–294, December 1933. ISSN 0006-3444.
  • Wei et al. [2020] Chen-Yu Wei, Mehdi Jafarnia Jahromi, Haipeng Luo, Hiteshi Sharma, and Rahul Jain. Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 10170–10180. PMLR, July 2020.
  • Zhang and Ji [2019] Zihan Zhang and Xiangyang Ji. Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  • Zhang and Xie [2023] Zihan Zhang and Qiaomin Xie. Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes. In The Thirty Sixth Annual Conference on Learning Theory, pages 5476–5477. PMLR, 2023.
  • Zhang et al. [2020] Zihan Zhang, Yuan Zhou, and Xiangyang Ji. Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition. arXiv:2004.10019 [cs, stat], June 2020. arXiv: 2004.10019.

Appendix

\parttoc

Appendix A Construction of PMEVI-DT

This section provides the technical details required to understand the design of PMEVI-DT in Section 3. We further discuss the assumptions 1-4 appearing in Theorem 5 and provide sufficient conditions so that they are met.

A.1 Proof of Lemma 3, estimation of the bias error

Fix s,s𝒮𝑠superscript𝑠𝒮s,s^{\prime}\in\mathcal{S}italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S. We denote αT:=NT(ss)(h(s)h(s)cT(s,s))\alpha_{T}:=N_{T}(s\leftrightarrow s^{\prime})(h^{*}(s)-h^{*}(s^{\prime})-c_{T% }(s,s^{\prime}))italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT := italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_c start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ). We will start by considering the better estimator cT(s,s)subscriptsuperscript𝑐𝑇𝑠superscript𝑠c^{\prime}_{T}(s,s^{\prime})italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) that satisfies the same equation (9) than cT(s,s)subscript𝑐𝑇𝑠superscript𝑠c_{T}(s,s^{\prime})italic_c start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) but with g^(T)^𝑔𝑇\hat{g}(T)over^ start_ARG italic_g end_ARG ( italic_T ) changed to hsuperscripth^{*}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, readily:

Nt(ss)cT(s,s)=t=0NT(ss)1(1)it=τissτi+1ss1(gRt).N_{t}(s\leftrightarrow s^{\prime})c^{\prime}_{T}(s,s^{\prime})=\sum\nolimits_{% t=0}^{N_{T}(s\leftrightarrow s^{\prime})-1}(-1)^{i}\sum\nolimits_{t=\tau_{i}^{% s\leftrightarrow s^{\prime}}}^{\tau_{i+1}^{s\leftrightarrow s^{\prime}}-1}(g^{% *}-R_{t}).italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - 1 end_POSTSUPERSCRIPT ( - 1 ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

To avoid a typographical clutter, we write τisubscript𝜏𝑖\tau_{i}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT instead of τisssuperscriptsubscript𝜏𝑖𝑠superscript𝑠\tau_{i}^{s\leftrightarrow s^{\prime}}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT in the remaining of the proof and we write αT:=NT(ss)(h(s)h(s)cT(s,s)\alpha_{T}^{\prime}:=N_{T}(s\leftrightarrow s^{\prime})(h^{*}(s)-h^{*}(s^{% \prime})-c^{\prime}_{T}(s,s^{\prime})italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT := italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).

(STEP 1) We start by relating the two estimators. Intuitively, g^(T)^𝑔𝑇\hat{g}(T)over^ start_ARG italic_g end_ARG ( italic_T ) is a good estimator for gsuperscript𝑔g^{*}italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT when the regret is small. Recall that g^(T):=1Tt=0T1Rtassign^𝑔𝑇1𝑇superscriptsubscript𝑡0𝑇1subscript𝑅𝑡\hat{g}(T):=\frac{1}{T}\sum_{t=0}^{T-1}R_{t}over^ start_ARG italic_g end_ARG ( italic_T ) := divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, hence:

t=0T1|g^(T)g|=|t=0T1(Rtg)|=|Reg(T)|.superscriptsubscript𝑡0𝑇1^𝑔𝑇superscript𝑔superscriptsubscript𝑡0𝑇1subscript𝑅𝑡superscript𝑔Reg𝑇\sum\nolimits_{t=0}^{T-1}\left|\hat{g}(T)-g^{*}\right|=\left|\sum\nolimits_{t=% 0}^{T-1}(R_{t}-g^{*})\right|=\left|\operatorname{{Reg}}(T)\right|.∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT | over^ start_ARG italic_g end_ARG ( italic_T ) - italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | = | ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | = | roman_Reg ( italic_T ) | .

Therefore,

|αT||αT|+|αTαT||αT|+t=0T1|g^(T)g||αT|+|Reg(T)|.subscript𝛼𝑇superscriptsubscript𝛼𝑇subscript𝛼𝑇superscriptsubscript𝛼𝑇superscriptsubscript𝛼𝑇superscriptsubscript𝑡0𝑇1^𝑔𝑇superscript𝑔superscriptsubscript𝛼𝑇Reg𝑇\left|\alpha_{T}\right|\leq\left|\alpha_{T}^{\prime}\right|+\left|\alpha_{T}-% \alpha_{T}^{\prime}\right|\leq\left|\alpha_{T}^{\prime}\right|+\sum\nolimits_{% t=0}^{T-1}\left|\hat{g}(T)-g^{*}\right|\leq\left|\alpha_{T}^{\prime}\right|+% \left|\operatorname{{Reg}}(T)\right|.| italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | ≤ | italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | + | italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | ≤ | italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | + ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT | over^ start_ARG italic_g end_ARG ( italic_T ) - italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | ≤ | italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | + | roman_Reg ( italic_T ) | .

We are left with upper-bounding |αT|superscriptsubscript𝛼𝑇\left|\alpha_{T}^{\prime}\right|| italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT |.

(STEP 2) If i𝑖iitalic_i is even, then Sτisubscript𝑆subscript𝜏𝑖S_{\tau_{i}}italic_S start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and §τi+1=ssubscript§subscript𝜏𝑖1superscript𝑠\S_{\tau_{i+1}}=s^{\prime}§ start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT; otherwise Sτi=ssubscript𝑆subscript𝜏𝑖superscript𝑠S_{\tau_{i}}=s^{\prime}italic_S start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and Sτi+1=ssubscript𝑆subscript𝜏𝑖1𝑠S_{\tau_{i+1}}=sitalic_S start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_s. In both cases, we have h(Sτi+1)h(Sτi)=(1)i(h(s)h(s))superscriptsubscript𝑆subscript𝜏𝑖1superscriptsubscript𝑆subscript𝜏𝑖superscript1𝑖superscriptsuperscript𝑠superscript𝑠h^{*}(S_{\tau_{i+1}})-h^{*}(S_{\tau_{i}})=(-1)^{i}(h^{*}(s^{\prime})-h^{*}(s))italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = ( - 1 ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) ). Therefore, using Bellman’s equation, the quantity A:=t=τiτi+11(gRt)assignAsuperscriptsubscript𝑡subscript𝜏𝑖subscript𝜏𝑖11superscript𝑔subscript𝑅𝑡\text{A}:=\sum_{t=\tau_{i}}^{\tau_{i+1}-1}(g^{*}-R_{t})A := ∑ start_POSTSUBSCRIPT italic_t = italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) satisfies:

A =t=τiτi+11(p(Xt)eSt)h+t=τiτi+11(r(Xt)Rt)+t=τiτi+11Δ(Xt)absentsuperscriptsubscript𝑡subscript𝜏𝑖subscript𝜏𝑖11𝑝subscript𝑋𝑡subscript𝑒subscript𝑆𝑡superscriptsuperscriptsubscript𝑡subscript𝜏𝑖subscript𝜏𝑖11𝑟subscript𝑋𝑡subscript𝑅𝑡superscriptsubscript𝑡subscript𝜏𝑖subscript𝜏𝑖11superscriptΔsubscript𝑋𝑡\displaystyle=\sum\nolimits_{t=\tau_{i}}^{\tau_{i+1}-1}\left(p(X_{t})-e_{S_{t}% }\right)h^{*}+\sum\nolimits_{t=\tau_{i}}^{\tau_{i+1}-1}(r(X_{t})-R_{t})+\sum% \nolimits_{t=\tau_{i}}^{\tau_{i+1}-1}\Delta^{*}(X_{t})= ∑ start_POSTSUBSCRIPT italic_t = italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_t = italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
=t=τiτi+11(eSt+1eSt)h+t=τiτi+11(p(Xt)eSt+1)h+t=τiτi+11(r(Xt)Rt)+t=τiτi+11Δ(Xt)absentsuperscriptsubscript𝑡subscript𝜏𝑖subscript𝜏𝑖11subscript𝑒subscript𝑆𝑡1subscript𝑒subscript𝑆𝑡superscriptsuperscriptsubscript𝑡subscript𝜏𝑖subscript𝜏𝑖11𝑝subscript𝑋𝑡subscript𝑒subscript𝑆𝑡1superscriptsuperscriptsubscript𝑡subscript𝜏𝑖subscript𝜏𝑖11𝑟subscript𝑋𝑡subscript𝑅𝑡superscriptsubscript𝑡subscript𝜏𝑖subscript𝜏𝑖11superscriptΔsubscript𝑋𝑡\displaystyle=\sum\nolimits_{t=\tau_{i}}^{\tau_{i+1}-1}\left(e_{S_{t+1}}-e_{S_% {t}}\right)h^{*}+\sum\nolimits_{t=\tau_{i}}^{\tau_{i+1}-1}\left(p(X_{t})-e_{S_% {t+1}}\right)h^{*}+\sum\nolimits_{t=\tau_{i}}^{\tau_{i+1}-1}(r(X_{t})-R_{t})+% \sum\nolimits_{t=\tau_{i}}^{\tau_{i+1}-1}\Delta^{*}(X_{t})= ∑ start_POSTSUBSCRIPT italic_t = italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_t = italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
=(1)i(h(s)h(s))+t=τiτi+11(p(Xt)eSt+1)h+t=τiτi+11(r(Xt)Rt)+t=τiτi+11Δ(Xt).absentsuperscript1𝑖superscriptsuperscript𝑠superscript𝑠superscriptsubscript𝑡subscript𝜏𝑖subscript𝜏𝑖11𝑝subscript𝑋𝑡subscript𝑒subscript𝑆𝑡1superscriptsuperscriptsubscript𝑡subscript𝜏𝑖subscript𝜏𝑖11𝑟subscript𝑋𝑡subscript𝑅𝑡superscriptsubscript𝑡subscript𝜏𝑖subscript𝜏𝑖11superscriptΔsubscript𝑋𝑡\displaystyle=(-1)^{i}(h^{*}(s^{\prime})-h^{*}(s))+\sum\nolimits_{t=\tau_{i}}^% {\tau_{i+1}-1}\left(p(X_{t})-e_{S_{t+1}}\right)h^{*}+\sum\nolimits_{t=\tau_{i}% }^{\tau_{i+1}-1}(r(X_{t})-R_{t})+\sum\nolimits_{t=\tau_{i}}^{\tau_{i+1}-1}% \Delta^{*}(X_{t}).= ( - 1 ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) ) + ∑ start_POSTSUBSCRIPT italic_t = italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_t = italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

Multiplying by (1)isuperscript1𝑖(-1)^{i}( - 1 ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and rearranging, h(s)h(s)+(1)i+1t=τiτi+11(gRt)superscriptsuperscript𝑠superscript𝑠superscript1𝑖1superscriptsubscript𝑡subscript𝜏𝑖subscript𝜏𝑖11superscript𝑔subscript𝑅𝑡h^{*}(s^{\prime})-h^{*}(s)+(-1)^{i+1}\sum_{t=\tau_{i}}^{\tau_{i+1}-1}(g^{*}-R_% {t})italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) + ( - 1 ) start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) appears to be equal to:

(1)i+1(t=τiτi+11((p(Xt)eSt+1)h+r(Xt)Rt)+t=τiτi+11Δ(Xt)).superscript1𝑖1superscriptsubscript𝑡subscript𝜏𝑖subscript𝜏𝑖11𝑝subscript𝑋𝑡subscript𝑒subscript𝑆𝑡1superscript𝑟subscript𝑋𝑡subscript𝑅𝑡superscriptsubscript𝑡subscript𝜏𝑖subscript𝜏𝑖11superscriptΔsubscript𝑋𝑡(-1)^{i+1}\left(\sum\nolimits_{t=\tau_{i}}^{\tau_{i+1}-1}\left(\left(p(X_{t})-% e_{S_{t+1}}\right)h^{*}+r(X_{t})-R_{t}\right)+\sum\nolimits_{t=\tau_{i}}^{\tau% _{i+1}-1}\Delta^{*}(X_{t})\right).( - 1 ) start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_t = italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_t = italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) .

Proceed by summing over i𝑖iitalic_i. By triangular inequality, we obtain:

|αT||i=0NT(ss)1t=τiτi+11(1)i+1((p(Xt)eSt+1)h+r(Xt)Rt)|+i=0NT(ss)1t=τiτi+11Δ(Xt).superscriptsubscript𝛼𝑇superscriptsubscript𝑖0subscript𝑁𝑇𝑠superscript𝑠1superscriptsubscript𝑡subscript𝜏𝑖subscript𝜏𝑖11superscript1𝑖1𝑝subscript𝑋𝑡subscript𝑒subscript𝑆𝑡1superscript𝑟subscript𝑋𝑡subscript𝑅𝑡superscriptsubscript𝑖0subscript𝑁𝑇𝑠superscript𝑠1superscriptsubscript𝑡subscript𝜏𝑖subscript𝜏𝑖11superscriptΔsubscript𝑋𝑡\left|\alpha_{T}^{\prime}\right|\leq\left|\sum\nolimits_{i=0}^{N_{T}(s% \leftrightarrow s^{\prime})-1}\sum\nolimits_{t=\tau_{i}}^{\tau_{i+1}-1}(-1)^{i% +1}\left(\left(p(X_{t})-e_{S_{t+1}}\right)h^{*}+r(X_{t})-R_{t}\right)\right|+% \sum\nolimits_{i=0}^{N_{T}(s\leftrightarrow s^{\prime})-1}\sum\nolimits_{t=% \tau_{i}}^{\tau_{i+1}-1}\Delta^{*}(X_{t}).| italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | ≤ | ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( - 1 ) start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT ( ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | + ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

Because all Bellman gaps ΔsuperscriptΔ\Delta^{*}roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT are non-negative, the second term is upper-bounded by the pseudo-regret t=0T1Δ(Xt)superscriptsubscript𝑡0𝑇1superscriptΔsubscript𝑋𝑡\sum_{t=0}^{T-1}\Delta^{*}(X_{t})∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The first term is a martingale, and the martingale difference sequence (1)i+1((p(Xt)eSt+1)h+r(Xt)Rt(-1)^{i+1}((p(X_{t})-e_{S_{t+1}})h^{*}+r(X_{t})-R_{t}( - 1 ) start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT ( ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT has span at most sp(h)+1spsuperscript1{\mathrm{sp}\left(h^{*}\right)}+1roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + 1 since rewards are supported in [0,1]01[0,1][ 0 , 1 ]. Although the number of involved is random, it is upper-bounded by T𝑇Titalic_T, hence by the maximal version of Azuma-Hoeffding’s inequality (Lemma 32), we have that with probability at least 1δ1𝛿1-\delta1 - italic_δ, uniformly for TTsuperscript𝑇𝑇T^{\prime}\leq Titalic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_T,

|i=0NT(ss)1t=τiτi+11(1)i+1((p(Xt)eSt+1)h+r(Xt)Rt)|(1+sp(h))12Tlog(2δ).superscriptsubscript𝑖0subscript𝑁superscript𝑇𝑠superscript𝑠1superscriptsubscript𝑡subscript𝜏𝑖subscript𝜏𝑖11superscript1𝑖1𝑝subscript𝑋𝑡subscript𝑒subscript𝑆𝑡1superscript𝑟subscript𝑋𝑡subscript𝑅𝑡1spsuperscript12𝑇2𝛿\left|\sum\nolimits_{i=0}^{N_{T^{\prime}}(s\leftrightarrow s^{\prime})-1}\sum% \nolimits_{t=\tau_{i}}^{\tau_{i+1}-1}(-1)^{i+1}\left(\left(p(X_{t})-e_{S_{t+1}% }\right)h^{*}+r(X_{t})-R_{t}\right)\right|\leq(1+{\mathrm{sp}\left(h^{*}\right% )})\sqrt{\tfrac{1}{2}T\log\left(\tfrac{2}{\delta}\right)}.| ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( - 1 ) start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT ( ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | ≤ ( 1 + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_T roman_log ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) end_ARG .

(STEP 3) We conclude that with probability 1δ1𝛿1-\delta1 - italic_δ, for all TTsuperscript𝑇𝑇T^{\prime}\leq Titalic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_T,

αT(1+sp(h))12Tlog(2δ)+t=0T1Δ(Xt)+|Reg(T)|.subscript𝛼superscript𝑇1spsuperscript12𝑇2𝛿superscriptsubscript𝑡0superscript𝑇1superscriptΔsubscript𝑋𝑡Regsuperscript𝑇\alpha_{T^{\prime}}\leq(1+{\mathrm{sp}\left(h^{*}\right)})\sqrt{\tfrac{1}{2}T% \log\left(\tfrac{2}{\delta}\right)}+\sum\nolimits_{t=0}^{T^{\prime}-1}\Delta^{% *}(X_{t})+\left|\operatorname{{Reg}}(T^{\prime})\right|.italic_α start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ≤ ( 1 + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_T roman_log ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) end_ARG + ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + | roman_Reg ( italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | .

We are left with relating both t=0T1Δ(Xt)superscriptsubscript𝑡0superscript𝑇1superscriptΔsubscript𝑋𝑡\sum_{t=0}^{T^{\prime}-1}\Delta^{*}(X_{t})∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and |Reg(T)|Regsuperscript𝑇\left|\operatorname{{Reg}}(T^{\prime})\right|| roman_Reg ( italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | to t=0T1(g~Rt)superscriptsubscript𝑡0superscript𝑇1~𝑔subscript𝑅𝑡\sum_{t=0}^{T^{\prime}-1}(\tilde{g}-R_{t})∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG italic_g end_ARG - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Using the Bellman equation again, we find that:

|t=0T1(gRtΔ(Xt))|superscriptsubscript𝑡0superscript𝑇1superscript𝑔subscript𝑅𝑡superscriptΔsubscript𝑋𝑡\displaystyle\left|\sum\nolimits_{t=0}^{T^{\prime}-1}(g^{*}-R_{t}-\Delta^{*}(X% _{t}))\right|| ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) | |h(S0)h(ST)|+|t=0T1((p(Xt)eSt+1)h+(r(Xt)Rt))|absentsuperscriptsubscript𝑆0superscriptsubscript𝑆superscript𝑇superscriptsubscript𝑡0superscript𝑇1𝑝subscript𝑋𝑡subscript𝑒subscript𝑆𝑡1superscript𝑟subscript𝑋𝑡subscript𝑅𝑡\displaystyle\leq\left|h^{*}(S_{0})-h^{*}(S_{T^{\prime}})\right|+\left|\sum% \nolimits_{t=0}^{T^{\prime}-1}\left(\left(p(X_{t})-e_{S_{t+1}}\right)h^{*}+% \left(r(X_{t})-R_{t}\right)\right)\right|≤ | italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) | + | ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + ( italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) |
sp(h)+(1+sp(h))12Tlog(2δ)absentspsuperscript1spsuperscript12𝑇2𝛿\displaystyle\leq{\mathrm{sp}\left(h^{*}\right)}+(1+{\mathrm{sp}\left(h^{*}% \right)})\sqrt{\tfrac{1}{2}T\log\left(\tfrac{2}{\delta}\right)}≤ roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + ( 1 + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_T roman_log ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) end_ARG

where the last inequality holds with probability 1δ1𝛿1-\delta1 - italic_δ uniformly over TTsuperscript𝑇𝑇T^{\prime}\leq Titalic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_T by Azuma-Hoeffding’s inequality again (Lemma 32). Remark that if yzxy+z𝑦𝑧𝑥𝑦𝑧y-z\leq x\leq y+zitalic_y - italic_z ≤ italic_x ≤ italic_y + italic_z, then |x||y|+|z|𝑥𝑦𝑧\left|x\right|\leq\left|y\right|+\left|z\right|| italic_x | ≤ | italic_y | + | italic_z |, hence we conclude that with probability 1δ1𝛿1-\delta1 - italic_δ, for all TTsuperscript𝑇𝑇T^{\prime}\leq Titalic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_T:

t=0T1Δ(Xt)+|Reg(T)|superscriptsubscript𝑡0superscript𝑇1superscriptΔsubscript𝑋𝑡Regsuperscript𝑇\displaystyle\sum\nolimits_{t=0}^{T^{\prime}-1}\Delta^{*}(X_{t})+\left|% \operatorname{{Reg}}(T^{\prime})\right|∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + | roman_Reg ( italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | 2t=0T1Δ(Xt)+(1+sp(h))12Tlog(2δ)+sp(h)absent2superscriptsubscript𝑡0superscript𝑇1superscriptΔsubscript𝑋𝑡1spsuperscript12𝑇2𝛿spsuperscript\displaystyle\leq 2\sum\nolimits_{t=0}^{T^{\prime}-1}\Delta^{*}(X_{t})+(1+{% \mathrm{sp}\left(h^{*}\right)})\sqrt{\tfrac{1}{2}T\log\left(\tfrac{2}{\delta}% \right)}+{\mathrm{sp}\left(h^{*}\right)}≤ 2 ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ( 1 + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_T roman_log ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) end_ARG + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
2t=0T1(gRt)+3(1+sp(h))12Tlog(2δ)+3sp(h)absent2superscriptsubscript𝑡0superscript𝑇1superscript𝑔subscript𝑅𝑡31spsuperscript12𝑇2𝛿3spsuperscript\displaystyle\leq 2\sum\nolimits_{t=0}^{T^{\prime}-1}\left(g^{*}-R_{t}\right)+% 3(1+{\mathrm{sp}\left(h^{*}\right)})\sqrt{\tfrac{1}{2}T\log\left(\tfrac{2}{% \delta}\right)}+3{\mathrm{sp}\left(h^{*}\right)}≤ 2 ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + 3 ( 1 + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_T roman_log ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) end_ARG + 3 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
2t=0T1(g~Rt)+3(1+sp(h))12Tlog(2δ)+3sp(h)absent2superscriptsubscript𝑡0superscript𝑇1~𝑔subscript𝑅𝑡31spsuperscript12𝑇2𝛿3spsuperscript\displaystyle\leq 2\sum\nolimits_{t=0}^{T^{\prime}-1}\left(\tilde{g}-R_{t}% \right)+3(1+{\mathrm{sp}\left(h^{*}\right)})\sqrt{\tfrac{1}{2}T\log\left(% \tfrac{2}{\delta}\right)}+3{\mathrm{sp}\left(h^{*}\right)}≤ 2 ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG italic_g end_ARG - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + 3 ( 1 + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_T roman_log ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) end_ARG + 3 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )

where the last inequality invokes g~g~𝑔superscript𝑔\tilde{g}\geq g^{*}over~ start_ARG italic_g end_ARG ≥ italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. We conclude that, with probability 12δ12𝛿1-2\delta1 - 2 italic_δ, for all TTsuperscript𝑇𝑇T^{\prime}\leq Titalic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_T, we have:

NT(ss)(h(s)h(s)cT(s,s))3sp(h)+(1+sp(h))8Tlog(2δ)+t=0T1(g~Rt).N_{T^{\prime}}(s\leftrightarrow s^{\prime})(h^{*}(s)-h^{*}(s^{\prime})-c_{T^{% \prime}}(s,s^{\prime}))\leq 3{\mathrm{sp}\left(h^{*}\right)}+\left(1+{\mathrm{% sp}\left(h^{*}\right)}\right)\sqrt{8T\log\left(\tfrac{2}{\delta}\right)}+\sum% \nolimits_{t=0}^{T^{\prime}-1}(\tilde{g}-R_{t}).italic_N start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_c start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ≤ 3 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + ( 1 + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) square-root start_ARG 8 italic_T roman_log ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) end_ARG + ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG italic_g end_ARG - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

This concludes the proof. ∎

A.2 The confidence region of PMEVI-DT

The algorithm PMEVI-DT can be instantiated with a large panel of possibilities, depending on the type of confidence region one is willing to use for rewards and kernels. In this work, we allow for four types of confidence regions, described below. For conciseness, q{r,p}𝑞𝑟𝑝q\in\left\{r,p\right\}italic_q ∈ { italic_r , italic_p } is a symbolic letter that can be a reward or a kernel and denote 𝒬t(s,a)subscript𝒬𝑡𝑠𝑎\mathcal{Q}_{t}(s,a)caligraphic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) the confidence region for q(s,a)𝑞𝑠𝑎q(s,a)italic_q ( italic_s , italic_a ) at time t𝑡titalic_t. If q=r𝑞𝑟q=ritalic_q = italic_r, then dim(q)=2dimension𝑞2\dim(q)=2roman_dim ( italic_q ) = 2 (Bernoulli rewards) with 𝒬(s,a)=[0,1]𝒬𝑠𝑎01\mathcal{Q}(s,a)=[0,1]caligraphic_Q ( italic_s , italic_a ) = [ 0 , 1 ]; and if q=p𝑞𝑝q=pitalic_q = italic_p, then dim(q)=Sdimension𝑞𝑆\dim(q)=Sroman_dim ( italic_q ) = italic_S with 𝒬(s,a)=𝒫(𝒮)𝒬𝑠𝑎𝒫𝒮\mathcal{Q}(s,a)=\mathcal{P}(\mathcal{S})caligraphic_Q ( italic_s , italic_a ) = caligraphic_P ( caligraphic_S ).

  • (C1)

    Azuma-Hoeffding or Weissman type confidence regions, with 𝒬t(s,a)subscript𝒬𝑡𝑠𝑎\mathcal{Q}_{t}(s,a)caligraphic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) taken as:

    {q~(s,a)𝒬(s,a):Nt(s,a)q^t(s,a)q~(s,a)12dim(q)log(2SA(1+Nt(s,a))δ)}.conditional-set~𝑞𝑠𝑎𝒬𝑠𝑎subscript𝑁𝑡𝑠𝑎superscriptsubscriptnormsubscript^𝑞𝑡𝑠𝑎~𝑞𝑠𝑎12dimension𝑞2𝑆𝐴1subscript𝑁𝑡𝑠𝑎𝛿\left\{\tilde{q}(s,a)\in\mathcal{Q}(s,a):N_{t}(s,a)\left\|\hat{q}_{t}(s,a)-% \tilde{q}(s,a)\right\|_{1}^{2}\leq\dim(q)\log\left(\tfrac{2SA(1+N_{t}(s,a))}{% \delta}\right)\right\}.{ over~ start_ARG italic_q end_ARG ( italic_s , italic_a ) ∈ caligraphic_Q ( italic_s , italic_a ) : italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) ∥ over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) - over~ start_ARG italic_q end_ARG ( italic_s , italic_a ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ roman_dim ( italic_q ) roman_log ( divide start_ARG 2 italic_S italic_A ( 1 + italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) ) end_ARG start_ARG italic_δ end_ARG ) } .
  • (C2)

    Empirical Bernstein type confidence regions, with 𝒬t(s,a)subscript𝒬𝑡𝑠𝑎\mathcal{Q}_{t}(s,a)caligraphic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) taken as:

{q~(s,a)𝒬(s,a):i,|q^t(i|s,a)q~(i|s,a)|2𝐕(q^t(i|s,a))log(2dim(q)SATδ)Nt(s,a)+3log(2dim(q)SATδ)Nt(s,a)}.\left\{\tilde{q}(s,a)\in\mathcal{Q}(s,a):\forall i,\left|\hat{q}_{t}(i|s,a)-% \tilde{q}(i|s,a)\right|\leq\sqrt{\tfrac{2\mathbf{V}(\hat{q}_{t}(i|s,a))\log% \left(\tfrac{2\dim(q)SAT}{\delta}\right)}{N_{t}(s,a)}}+\tfrac{3\log\left(% \tfrac{2\dim(q)SAT}{\delta}\right)}{N_{t}(s,a)}\right\}.{ over~ start_ARG italic_q end_ARG ( italic_s , italic_a ) ∈ caligraphic_Q ( italic_s , italic_a ) : ∀ italic_i , | over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_i | italic_s , italic_a ) - over~ start_ARG italic_q end_ARG ( italic_i | italic_s , italic_a ) | ≤ square-root start_ARG divide start_ARG 2 bold_V ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_i | italic_s , italic_a ) ) roman_log ( divide start_ARG 2 roman_dim ( italic_q ) italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG end_ARG + divide start_ARG 3 roman_log ( divide start_ARG 2 roman_dim ( italic_q ) italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG } .
  • with the convention that x/0=+𝑥0x/0=+\inftyitalic_x / 0 = + ∞ for x>0𝑥0x>0italic_x > 0.

  • (C3)

    Empirical likelihood type confidence regions, with 𝒬t(s,a)subscript𝒬𝑡𝑠𝑎\mathcal{Q}_{t}(s,a)caligraphic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) taken as:

{q~(s,a)𝒬(s,a):Nt(s,a)KL(q^t(s,a)q~(s,a))log(2SAδ)+(dim(q)1)log(e(1+Nt(s,a)dimq1))}.conditional-set~𝑞𝑠𝑎𝒬𝑠𝑎subscript𝑁𝑡𝑠𝑎KLconditionalsubscript^𝑞𝑡𝑠𝑎~𝑞𝑠𝑎2𝑆𝐴𝛿dimension𝑞1𝑒1subscript𝑁𝑡𝑠𝑎dimension𝑞1\left\{\tilde{q}(s,a)\in\mathcal{Q}(s,a):N_{t}(s,a)\operatorname{{\rm KL}}(% \hat{q}_{t}(s,a)\|\tilde{q}(s,a))\leq\log\left(\tfrac{2SA}{\delta}\right)+(% \dim(q)-1)\log\left(e\left(1+\tfrac{N_{t}(s,a)}{\dim{q}-1}\right)\right)\right\}.{ over~ start_ARG italic_q end_ARG ( italic_s , italic_a ) ∈ caligraphic_Q ( italic_s , italic_a ) : italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) roman_KL ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) ∥ over~ start_ARG italic_q end_ARG ( italic_s , italic_a ) ) ≤ roman_log ( divide start_ARG 2 italic_S italic_A end_ARG start_ARG italic_δ end_ARG ) + ( roman_dim ( italic_q ) - 1 ) roman_log ( italic_e ( 1 + divide start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG start_ARG roman_dim italic_q - 1 end_ARG ) ) } .
  • (C4)

    Trivial confidence region with 𝒬t(s,a)=𝒬(s,a)subscript𝒬𝑡𝑠𝑎𝒬𝑠𝑎\mathcal{Q}_{t}(s,a)=\mathcal{Q}(s,a)caligraphic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) = caligraphic_Q ( italic_s , italic_a ).

A few remarks are in order. When rewards are not Bernoulli, only the confidence regions (C1) and (C4) are elligible among the above. Then, Weissman’s inequality must be changed to Azuma’s inequality for σ𝜎\sigmaitalic_σ-sub-Gaussian random variables, see Lemma 34. Since rewards are supported in [0,1]01[0,1][ 0 , 1 ], Hoeffding’s Lemma guarantees that reward distributions are σ𝜎\sigmaitalic_σ-sub-Gaussian with σ=12𝜎12\sigma=\tfrac{1}{2}italic_σ = divide start_ARG 1 end_ARG start_ARG 2 end_ARG.

A.2.1 Correctness of the model confidence region tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 1

The confidence regions 𝒬t(s,a)subscript𝒬𝑡𝑠𝑎\mathcal{Q}_{t}(s,a)caligraphic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) described with (C1-4) are tuned so that the following result holds:

Lemma 11.

Assume that, for all q{r,p}𝑞𝑟𝑝q\in\left\{r,p\right\}italic_q ∈ { italic_r , italic_p } and (s,a)𝒳𝑠𝑎𝒳(s,a)\in\mathcal{X}( italic_s , italic_a ) ∈ caligraphic_X, we choose 𝒬t(s,a)subscript𝒬𝑡𝑠𝑎\mathcal{Q}_{t}(s,a)caligraphic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) among (C1-4). Then 1 holds. More specifically, the region of models t:=s,a(t(s,a)×𝒫t(s,a))assignsubscript𝑡subscriptproduct𝑠𝑎subscript𝑡𝑠𝑎subscript𝒫𝑡𝑠𝑎\mathcal{M}_{t}:=\prod_{s,a}(\mathcal{R}_{t}(s,a)\times\mathcal{P}_{t}(s,a))caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := ∏ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ( caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) × caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) ) satisfies 𝐏(tT:Mt)δ\mathbf{P}(\exists t\leq T:M\notin\mathcal{M}_{t})\leq\deltabold_P ( ∃ italic_t ≤ italic_T : italic_M ∉ caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ italic_δ.

Proof.

We show that, for all q{r,q}𝑞𝑟𝑞q\in\left\{r,q\right\}italic_q ∈ { italic_r , italic_q } and (s,a)𝒳𝑠𝑎𝒳(s,a)\in\mathcal{X}( italic_s , italic_a ) ∈ caligraphic_X, if 𝒬t(s,a)subscript𝒬𝑡𝑠𝑎\mathcal{Q}_{t}(s,a)caligraphic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) is chosen amoung (C1-4), then

𝐏(tT:q(s,a)𝒬t(s,a))δ.\mathbf{P}\left(\exists t\leq T:q(s,a)\notin\mathcal{Q}_{t}(s,a)\right)\leq\delta.bold_P ( ∃ italic_t ≤ italic_T : italic_q ( italic_s , italic_a ) ∉ caligraphic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) ) ≤ italic_δ .

If 𝒬t(s,a)subscript𝒬𝑡𝑠𝑎\mathcal{Q}_{t}(s,a)caligraphic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) is chosen with (C1), this is a direct application of Lemma 35; with (C2), this is Lemma 36; with (C3), this is Lemma 37; and with (C4) this is by definition. ∎

A.2.2 Simultaneous correctness of bias confidence region tsubscript𝑡\mathcal{H}_{t}caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, mitigation βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and optimism

In this section, we show that if 1 holds, then the bias confidence region constructed by PMEVI-DT is correct with high probability, and that the mitigation is not too strong. Recall that (𝔤k,𝔥k)subscript𝔤𝑘subscript𝔥𝑘(\mathfrak{g}_{k},\mathfrak{h}_{k})( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) are the optimistic gain and bias of the policy deployed in episode k𝑘kitalic_k (see Algorithm 1). In particular, we have 𝔤k=𝔏tk𝔥k𝔥ksubscript𝔤𝑘subscript𝔏subscript𝑡𝑘subscript𝔥𝑘subscript𝔥𝑘\mathfrak{g}_{k}=\mathfrak{L}_{t_{k}}\mathfrak{h}_{k}-\mathfrak{h}_{k}fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = fraktur_L start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT with 𝔥ktksubscript𝔥𝑘subscriptsubscript𝑡𝑘\mathfrak{h}_{k}\in\mathcal{H}_{t_{k}}fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT. We start by a result on the deviation of the variance, which is what the variance approximation Algorithm 5 is based on. Recall that the bias confidence region tsubscript𝑡\mathcal{H}_{t}caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is obtained as the collection of constraints:

  • (1)

    prior constraints (if any) 𝔥(s)𝔥(s)c(s,s)𝔥𝑠𝔥superscript𝑠subscript𝑐𝑠superscript𝑠\mathfrak{h}(s)-\mathfrak{h}(s^{\prime})\leq c_{*}(s,s^{\prime})fraktur_h ( italic_s ) - fraktur_h ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_c start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT );

  • (2)

    span constraints 𝔥(s)𝔥(s)c0:=T1/5𝔥𝑠𝔥superscript𝑠subscript𝑐0assignsuperscript𝑇15\mathfrak{h}(s)-\mathfrak{h}(s^{\prime})\leq c_{0}:=T^{1/5}fraktur_h ( italic_s ) - fraktur_h ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT := italic_T start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT;

  • (3)

    dynamically infered constraints |𝔥(s)𝔥(s)ct(s,s)|error(ct,s,s)𝔥𝑠𝔥superscript𝑠subscript𝑐𝑡superscript𝑠𝑠errorsubscript𝑐𝑡superscript𝑠𝑠\left|\mathfrak{h}(s)-\mathfrak{h}(s^{\prime})-c_{t}(s^{\prime},s)\right|\leq% \text{error}(c_{t},s^{\prime},s)| fraktur_h ( italic_s ) - fraktur_h ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ) | ≤ error ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ) (see Algorithm 3).

We have the following result.

Lemma 12.

Let u,vt𝑢𝑣subscript𝑡u,v\in\mathcal{H}_{t}italic_u , italic_v ∈ caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and fix p𝑝pitalic_p a probability distribution on 𝒮𝒮\mathcal{S}caligraphic_S. Then for all s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S,

𝐕(p,u)𝐕(p,v)+8c0s𝒮p(s)error(ct,s,s).𝐕𝑝𝑢𝐕𝑝𝑣8subscript𝑐0subscriptsuperscript𝑠𝒮𝑝superscript𝑠errorsubscript𝑐𝑡superscript𝑠𝑠\mathbf{V}(p,u)\leq\mathbf{V}(p,v)+8c_{0}\sum\nolimits_{s^{\prime}\in\mathcal{% S}}p(s^{\prime})~{}\text{error}(c_{t},s^{\prime},s).bold_V ( italic_p , italic_u ) ≤ bold_V ( italic_p , italic_v ) + 8 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT italic_p ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) error ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ) .
Proof.

We start by establishing the following result: If p𝑝pitalic_p is a probability distribution on 𝒮𝒮\mathcal{S}caligraphic_S and u,v𝐑𝒮𝑢𝑣superscript𝐑𝒮u,v\in\mathbf{R}^{\mathcal{S}}italic_u , italic_v ∈ bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT, we have:

𝐕(p,u)𝐕(p,v)+2(p|uv|)max(u+v)𝐕𝑝𝑢𝐕𝑝𝑣2𝑝𝑢𝑣𝑢𝑣\mathbf{V}(p,u)\leq\mathbf{V}(p,v)+2\left(p\cdot\left|u-v\right|\right)\max(u+v)bold_V ( italic_p , italic_u ) ≤ bold_V ( italic_p , italic_v ) + 2 ( italic_p ⋅ | italic_u - italic_v | ) roman_max ( italic_u + italic_v ) (14)

where \cdot is the dot product, u2superscript𝑢2u^{2}italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT the Hadamard product uu𝑢𝑢uuitalic_u italic_u and |u|𝑢\left|u\right|| italic_u | the vector whose entry s𝑠sitalic_s is |u(s)|𝑢𝑠\left|u(s)\right|| italic_u ( italic_s ) |. (14) is obtained with a straight forward computation:

𝐕(p,u)𝐕(p,v)𝐕𝑝𝑢𝐕𝑝𝑣\displaystyle\mathbf{V}(p,u)-\mathbf{V}(p,v)bold_V ( italic_p , italic_u ) - bold_V ( italic_p , italic_v ) =p(u2v2)+(pv)2(pu)2absent𝑝superscript𝑢2superscript𝑣2superscript𝑝𝑣2superscript𝑝𝑢2\displaystyle=p\cdot(u^{2}-v^{2})+(p\cdot v)^{2}-(p\cdot u)^{2}= italic_p ⋅ ( italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_v start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + ( italic_p ⋅ italic_v ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ( italic_p ⋅ italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=p((uv)(u+v))+(p(uv))(p(u+v))absent𝑝𝑢𝑣𝑢𝑣𝑝𝑢𝑣𝑝𝑢𝑣\displaystyle=p\cdot((u-v)(u+v))+(p\cdot(u-v))(p\cdot(u+v))= italic_p ⋅ ( ( italic_u - italic_v ) ( italic_u + italic_v ) ) + ( italic_p ⋅ ( italic_u - italic_v ) ) ( italic_p ⋅ ( italic_u + italic_v ) )
p(|uv|(u+v))+(p|uv|)(p|u+v|)absent𝑝𝑢𝑣𝑢𝑣𝑝𝑢𝑣𝑝𝑢𝑣\displaystyle\leq p\cdot(\left|u-v\right|(u+v))+(p\cdot\left|u-v\right|)(p% \cdot\left|u+v\right|)≤ italic_p ⋅ ( | italic_u - italic_v | ( italic_u + italic_v ) ) + ( italic_p ⋅ | italic_u - italic_v | ) ( italic_p ⋅ | italic_u + italic_v | )
2(p|uv|)max(u+v).absent2𝑝𝑢𝑣𝑢𝑣\displaystyle\leq 2(p\cdot\left|u-v\right|)\max(u+v).≤ 2 ( italic_p ⋅ | italic_u - italic_v | ) roman_max ( italic_u + italic_v ) .

Observe that v𝑣vitalic_v can be changed to v+λe𝑣𝜆𝑒v+\lambda eitalic_v + italic_λ italic_e, where e𝑒eitalic_e is the vector full of ones, without changing the result. The same goes for u𝑢uitalic_u. We now move to the proof of the main statement. First, translate u𝑢uitalic_u and v𝑣vitalic_v such that u(s)=v(s)=0𝑢𝑠𝑣𝑠0u(s)=v(s)=0italic_u ( italic_s ) = italic_v ( italic_s ) = 0. Then, we have:

p(uv)𝑝𝑢𝑣\displaystyle p\cdot(u-v)italic_p ⋅ ( italic_u - italic_v ) =s𝒮p(s)|u(s)u(s)ct(s,s)+v(s)v(s)+ct(s,s)|absentsubscriptsuperscript𝑠𝒮𝑝superscript𝑠𝑢superscript𝑠𝑢𝑠subscript𝑐𝑡superscript𝑠𝑠𝑣𝑠𝑣superscript𝑠subscript𝑐𝑡superscript𝑠𝑠\displaystyle=\sum\nolimits_{s^{\prime}\in\mathcal{S}}p(s^{\prime})\left|u(s^{% \prime})-u(s)-c_{t}(s^{\prime},s)+v(s)-v(s^{\prime})+c_{t}(s^{\prime},s)\right|= ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT italic_p ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | italic_u ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_u ( italic_s ) - italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ) + italic_v ( italic_s ) - italic_v ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ) |
s𝒮p(s)(|u(s)u(s)ct(s,s)|+|v(s)v(s)ct(s,s)|)absentsubscriptsuperscript𝑠𝒮𝑝superscript𝑠𝑢superscript𝑠𝑢𝑠subscript𝑐𝑡superscript𝑠𝑠𝑣superscript𝑠𝑣𝑠subscript𝑐𝑡superscript𝑠𝑠\displaystyle\leq\sum\nolimits_{s^{\prime}\in\mathcal{S}}p(s^{\prime})\left(% \left|u(s^{\prime})-u(s)-c_{t}(s^{\prime},s)\right|+\left|v(s^{\prime})-v(s)-c% _{t}(s^{\prime},s)\right|\right)≤ ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT italic_p ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ( | italic_u ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_u ( italic_s ) - italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ) | + | italic_v ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_v ( italic_s ) - italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ) | )
2s𝒮p(s)error(ct,s,s).absent2subscriptsuperscript𝑠𝒮𝑝superscript𝑠errorsubscript𝑐𝑡superscript𝑠𝑠\displaystyle\leq 2\sum\nolimits_{s^{\prime}\in\mathcal{S}}p(s^{\prime})~{}% \text{error}(c_{t},s^{\prime},s).≤ 2 ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT italic_p ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) error ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ) .

Conclude using that max(u+v)max(u)+max(v)+2c0𝑢𝑣𝑢𝑣2subscript𝑐0\max(u+v)\leq\max(u)+\max(v)+2c_{0}roman_max ( italic_u + italic_v ) ≤ roman_max ( italic_u ) + roman_max ( italic_v ) + 2 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for u,v𝑢𝑣u,v\in\mathcal{H}italic_u , italic_v ∈ caligraphic_H such that u(s)=v(s)=0𝑢𝑠𝑣𝑠0u(s)=v(s)=0italic_u ( italic_s ) = italic_v ( italic_s ) = 0. ∎

Lemma 13.

Assume that 1 holds and that c0sp(h)subscript𝑐0spsuperscriptc_{0}\geq{\mathrm{sp}\left(h^{*}\right)}italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≥ roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). Then, with probability 14δ14𝛿1-4\delta1 - 4 italic_δ, for all kK(T)𝑘𝐾𝑇k\leq K(T)italic_k ≤ italic_K ( italic_T ), (1) 𝔤kgsubscript𝔤𝑘superscript𝑔\mathfrak{g}_{k}\geq g^{*}fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and (2) htksuperscriptsubscriptsubscript𝑡𝑘h^{*}\in\mathcal{H}_{t_{k}}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT and (3) for all (s,a)𝑠𝑎(s,a)( italic_s , italic_a ), (p^tk(s,a)p(s,a))hβtk(s,a)subscript^𝑝subscript𝑡𝑘𝑠𝑎𝑝𝑠𝑎superscriptsubscript𝛽subscript𝑡𝑘𝑠𝑎(\hat{p}_{t_{k}}(s,a)-p(s,a))h^{*}\leq\beta_{t_{k}}(s,a)( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_p ( italic_s , italic_a ) ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_β start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ).

Proof.

Let E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT the event (kK(T),Mtk)formulae-sequencefor-all𝑘𝐾𝑇𝑀subscriptsubscript𝑡𝑘(\forall k\leq K(T),M\in\mathcal{M}_{t_{k}})( ∀ italic_k ≤ italic_K ( italic_T ) , italic_M ∈ caligraphic_M start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). Let E2subscript𝐸2E_{2}italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT the event stating that, for all TTsuperscript𝑇𝑇T^{\prime}\leq Titalic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_T,

NT(ss)|h(s)h(s)cT(s,s)|3sp(h)+(1+sp(h))8Tlog(2δ)+2t=0T1(g~Rt),N_{T^{\prime}}(s\leftrightarrow s^{\prime})\left|h^{*}(s)-h^{*}(s^{\prime})-c_% {T^{\prime}}(s,s^{\prime})\right|\leq 3{\mathrm{sp}\left(h^{*}\right)}+(1+{% \mathrm{sp}\left(h^{*}\right)})\sqrt{8T\log(\tfrac{2}{\delta})}+2\sum\nolimits% _{t=0}^{T^{\prime}-1}(\tilde{g}-R_{t}),italic_N start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_c start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | ≤ 3 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + ( 1 + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) square-root start_ARG 8 italic_T roman_log ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) end_ARG + 2 ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG italic_g end_ARG - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,

and let E3subscript𝐸3E_{3}italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT the event stating that, for all TTsuperscript𝑇𝑇T^{\prime}\leq Titalic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_T and for all (s,a)𝒳𝑠𝑎𝒳(s,a)\in\mathcal{X}( italic_s , italic_a ) ∈ caligraphic_X, we have:

(p^T(s,a)p(s,a))h2𝐕(p^T(s,a),h)log(SATδ)NT(s,a)+3sp(h)log(SATδ)NT(s,a).subscript^𝑝superscript𝑇𝑠𝑎𝑝𝑠𝑎superscript2𝐕subscript^𝑝superscript𝑇𝑠𝑎superscript𝑆𝐴𝑇𝛿subscript𝑁superscript𝑇𝑠𝑎3spsuperscript𝑆𝐴𝑇𝛿subscript𝑁superscript𝑇𝑠𝑎\left(\hat{p}_{T^{\prime}}(s,a)-p(s,a)\right)h^{*}\leq\sqrt{\tfrac{2\mathbf{V}% (\hat{p}_{T^{\prime}}(s,a),h^{*})\log\left(\frac{SAT}{\delta}\right)}{N_{T^{% \prime}}(s,a)}}+\tfrac{3{\mathrm{sp}\left(h^{*}\right)}\log\left(\frac{SAT}{% \delta}\right)}{N_{T^{\prime}}(s,a)}.( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_p ( italic_s , italic_a ) ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ square-root start_ARG divide start_ARG 2 bold_V ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG end_ARG + divide start_ARG 3 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG .

By Lemma 3, we have 𝐏(E2)12δ𝐏subscript𝐸212𝛿\mathbf{P}(E_{2})\geq 1-2\deltabold_P ( italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≥ 1 - 2 italic_δ and by Lemma 36, we have 𝐏(E3)1δ𝐏subscript𝐸31𝛿\mathbf{P}(E_{3})\geq 1-\deltabold_P ( italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ≥ 1 - italic_δ, so 𝐏(E1E2E3)14δ𝐏subscript𝐸1subscript𝐸2subscript𝐸314𝛿\mathbf{P}(E_{1}\cap E_{2}\cap E_{3})\geq 1-4\deltabold_P ( italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∩ italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∩ italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ≥ 1 - 4 italic_δ. We prove by induction on kK(T)𝑘𝐾𝑇k\leq K(T)italic_k ≤ italic_K ( italic_T ) that, on E1E2subscript𝐸1subscript𝐸2E_{1}\cap E_{2}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∩ italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, (1) 𝔤kgsubscript𝔤𝑘superscript𝑔\mathfrak{g}_{k}\geq g^{*}fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, (2) htksuperscriptsubscriptsubscript𝑡𝑘h^{*}\in\mathcal{H}_{t_{k}}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT (3) and for all (s,a)𝑠𝑎(s,a)( italic_s , italic_a ), (p^tk(s,a)p(s,a))hβtk(s,a)subscript^𝑝subscript𝑡𝑘𝑠𝑎𝑝𝑠𝑎superscriptsubscript𝛽subscript𝑡𝑘𝑠𝑎(\hat{p}_{t_{k}}(s,a)-p(s,a))h^{*}\leq\beta_{t_{k}}(s,a)( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_p ( italic_s , italic_a ) ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_β start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ), where 𝔤ksubscript𝔤𝑘\mathfrak{g}_{k}fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the optimistic gain of the policy deployed at episode k𝑘kitalic_k. For k=0𝑘0k=0italic_k = 0, this is obvious. Indeed, N0(ss)=0N_{0}(s\leftrightarrow s^{\prime})=0italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 0 for all s,s𝑠superscript𝑠s,s^{\prime}italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT hence c0(s,s)=c0sp(h)subscript𝑐0𝑠superscript𝑠subscript𝑐0spsuperscriptc_{0}(s,s^{\prime})=c_{0}\geq{\mathrm{sp}\left(h^{*}\right)}italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≥ roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). Therefore,

0{𝔥𝐑𝒮:sp(𝔥)c0}{𝔥𝐑𝒮:sp(𝔥)sp(h)}superset-of-or-equalssubscript0conditional-set𝔥superscript𝐑𝒮sp𝔥subscript𝑐0superset-of-or-equalsconditional-set𝔥superscript𝐑𝒮sp𝔥spsuperscript\mathcal{H}_{0}\supseteq\left\{\mathfrak{h}\in\mathbf{R}^{\mathcal{S}}:{% \mathrm{sp}\left(\mathfrak{h}\right)}\leq c_{0}\right\}\supseteq\left\{% \mathfrak{h}\in\mathbf{R}^{\mathcal{S}}:{\mathrm{sp}\left(\mathfrak{h}\right)}% \leq{\mathrm{sp}\left(h^{*}\right)}\right\}caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⊇ { fraktur_h ∈ bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT : roman_sp ( fraktur_h ) ≤ italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT } ⊇ { fraktur_h ∈ bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT : roman_sp ( fraktur_h ) ≤ roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) }

so contains hsuperscripth^{*}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, proving (2). Moreover, since N0(s,a)=0subscript𝑁0𝑠𝑎0N_{0}(s,a)=0italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s , italic_a ) = 0, we have β0(s,a)=+subscript𝛽0𝑠𝑎\beta_{0}(s,a)=+\inftyitalic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s , italic_a ) = + ∞, proving (3). Finally, since M0𝑀subscript0M\in\mathcal{M}_{0}italic_M ∈ caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT on E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, by the statement (2) of Proposition 2, we have 𝔤kgsubscript𝔤𝑘superscript𝑔\mathfrak{g}_{k}\geq g^{*}fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, hence proving (1).

Now assume that k1𝑘1k\geq 1italic_k ≥ 1. By induction 𝔤gsubscript𝔤superscript𝑔\mathfrak{g}_{\ell}\geq g^{*}fraktur_g start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ≥ italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for all <k𝑘\ell<kroman_ℓ < italic_k, so on E2subscript𝐸2E_{2}italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT we have:

Ntk(ss)|h(s)h(s)ctk(s,s)|3sp(h)+(1+sp(h))8Tlog(2δ)+2=1k1t=tt+11(𝔤Rt).N_{t_{k}}(s\leftrightarrow s^{\prime})\left|h^{*}(s)-h^{*}(s^{\prime})-c_{t_{k% }}(s,s^{\prime})\right|\leq 3{\mathrm{sp}\left(h^{*}\right)}+(1+{\mathrm{sp}% \left(h^{*}\right)})\sqrt{8T\log(\tfrac{2}{\delta})}+2\sum\nolimits_{\ell=1}^{% k-1}\sum\nolimits_{t=t_{\ell}}^{t_{\ell+1}-1}(\mathfrak{g}_{\ell}-R_{t}).italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_c start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | ≤ 3 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + ( 1 + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) square-root start_ARG 8 italic_T roman_log ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) end_ARG + 2 ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

By design of tksubscriptsubscript𝑡𝑘\mathcal{H}_{t_{k}}caligraphic_H start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT (see Algorithm 3), we deduce that (2) htksuperscriptsubscriptsubscript𝑡𝑘h^{*}\in\mathcal{H}_{t_{k}}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Denote h0tksubscript0subscriptsubscript𝑡𝑘h_{0}\in\mathcal{H}_{t_{k}}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT the reference point used by Algorithm 5. We have, for all (s,a)𝒳𝑠𝑎𝒳(s,a)\in\mathcal{X}( italic_s , italic_a ) ∈ caligraphic_X, on E1E2E3subscript𝐸1subscript𝐸2subscript𝐸3E_{1}\cap E_{2}\cap E_{3}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∩ italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∩ italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, we have:

(p^tk(s,a)p(s,a))hsubscript^𝑝subscript𝑡𝑘𝑠𝑎𝑝𝑠𝑎superscript\displaystyle\left(\hat{p}_{t_{k}}(s,a)-p(s,a)\right)h^{*}( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_p ( italic_s , italic_a ) ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 2𝐕(p^tk(s,a),h)log(SATδ)Ntk(s,a)+3sp(h)log(SATδ)Ntk(s,a)absent2𝐕subscript^𝑝subscript𝑡𝑘𝑠𝑎superscript𝑆𝐴𝑇𝛿subscript𝑁subscript𝑡𝑘𝑠𝑎3spsuperscript𝑆𝐴𝑇𝛿subscript𝑁subscript𝑡𝑘𝑠𝑎\displaystyle\leq\sqrt{\tfrac{2\mathbf{V}(\hat{p}_{t_{k}}(s,a),h^{*})\log\left% (\frac{SAT}{\delta}\right)}{N_{t_{k}}(s,a)}}+\tfrac{3{\mathrm{sp}\left(h^{*}% \right)}\log\left(\frac{SAT}{\delta}\right)}{N_{t_{k}}(s,a)}≤ square-root start_ARG divide start_ARG 2 bold_V ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG end_ARG + divide start_ARG 3 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG
(htksuperscriptsubscriptsubscript𝑡𝑘h^{*}\in\mathcal{H}_{t_{k}}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT + Lemma 12) 2(𝐕(p^tk(s,a),h0)log(SATδ)+8c0s𝒮p^tk(s|s,a)error(ctk,s,s))log(SATδ)Ntk(s,a)+3c0log(SATδ)Ntk(s,a)absent2𝐕subscript^𝑝subscript𝑡𝑘𝑠𝑎subscript0𝑆𝐴𝑇𝛿8subscript𝑐0subscriptsuperscript𝑠𝒮subscript^𝑝subscript𝑡𝑘conditionalsuperscript𝑠𝑠𝑎errorsubscript𝑐subscript𝑡𝑘superscript𝑠𝑠𝑆𝐴𝑇𝛿subscript𝑁subscript𝑡𝑘𝑠𝑎3subscript𝑐0𝑆𝐴𝑇𝛿subscript𝑁subscript𝑡𝑘𝑠𝑎\displaystyle\leq\sqrt{\tfrac{2\left(\mathbf{V}(\hat{p}_{t_{k}}(s,a),h_{0})% \log\left(\frac{SAT}{\delta}\right)+8c_{0}\sum_{s^{\prime}\in\mathcal{S}}\hat{% p}_{t_{k}}(s^{\prime}|s,a)~{}\text{error}(c_{t_{k}},s^{\prime},s)\right)\log% \left(\frac{SAT}{\delta}\right)}{N_{t_{k}}(s,a)}}+\tfrac{3c_{0}\log\left(\frac% {SAT}{\delta}\right)}{N_{t_{k}}(s,a)}≤ square-root start_ARG divide start_ARG 2 ( bold_V ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) , italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) + 8 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) error ( italic_c start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ) ) roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG end_ARG + divide start_ARG 3 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG
=:βtk(s,a)\displaystyle=:\beta_{t_{k}}(s,a)= : italic_β start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a )

by construction of Algorithm 5. Accordingly, (3) is satisfied. Finally, Mtk𝑀subscriptsubscript𝑡𝑘M\in\mathcal{M}_{t_{k}}italic_M ∈ caligraphic_M start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT on E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT so by Proposition 2, we have (1) 𝔤kgsubscript𝔤𝑘superscript𝑔\mathfrak{g}_{k}\geq g^{*}fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. ∎

Corollary 14.

Assume that, for all q{r,p}𝑞𝑟𝑝q\in\left\{r,p\right\}italic_q ∈ { italic_r , italic_p } and (s,a)𝒳𝑠𝑎𝒳(s,a)\in\mathcal{X}( italic_s , italic_a ) ∈ caligraphic_X, we choose 𝒬t(s,a)subscript𝒬𝑡𝑠𝑎\mathcal{Q}_{t}(s,a)caligraphic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) among (C1-4). Then, with probability 13δ13𝛿1-3\delta1 - 3 italic_δ, for all kK(T)𝑘𝐾𝑇k\in K(T)italic_k ∈ italic_K ( italic_T ), we have 𝔤kgsubscript𝔤𝑘superscript𝑔\mathfrak{g}_{k}\geq g^{*}fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and (2) htksuperscriptsubscriptsubscript𝑡𝑘h^{*}\in\mathcal{H}_{t_{k}}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT and (3) for all (s,a)𝑠𝑎(s,a)( italic_s , italic_a ), (p^tk(s,a)p(s,a))hβtk(s,a)subscript^𝑝subscript𝑡𝑘𝑠𝑎𝑝𝑠𝑎superscriptsubscript𝛽subscript𝑡𝑘𝑠𝑎(\hat{p}_{t_{k}}(s,a)-p(s,a))h^{*}\leq\beta_{t_{k}}(s,a)( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_p ( italic_s , italic_a ) ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ italic_β start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ).

Proof.

By Lemma 11, 1 is satisfied. Apply Lemma 13. ∎

A.2.3 Sub-Weissman reward confidence region and 2

Although the kernel confidence region can even chosen to be trivial with (C4), in order to work, PMEVI-DT needs the reward confidence region to be sub-Weissman in the following sense:

Assumption 2.

There exists a constant C>0𝐶0C>0italic_C > 0 such that for all (s,a)𝒮𝑠𝑎𝒮(s,a)\in\mathcal{S}( italic_s , italic_a ) ∈ caligraphic_S, for all tT𝑡𝑇t\leq Titalic_t ≤ italic_T, we have:

t(s,a){r~(s,a)(s,a):Nt(s,a)r^t(s,a)r~(s,a)12Clog(2SA(1+Nt(s,a))δ)}.subscript𝑡𝑠𝑎conditional-set~𝑟𝑠𝑎𝑠𝑎subscript𝑁𝑡𝑠𝑎superscriptsubscriptnormsubscript^𝑟𝑡𝑠𝑎~𝑟𝑠𝑎12𝐶2𝑆𝐴1subscript𝑁𝑡𝑠𝑎𝛿\mathcal{R}_{t}(s,a)\subseteq\left\{\tilde{r}(s,a)\in\mathcal{R}(s,a):N_{t}(s,% a)\left\|\hat{r}_{t}(s,a)-\tilde{r}(s,a)\right\|_{1}^{2}\leq C\log\left(\tfrac% {2SA(1+N_{t}(s,a))}{\delta}\right)\right\}.caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) ⊆ { over~ start_ARG italic_r end_ARG ( italic_s , italic_a ) ∈ caligraphic_R ( italic_s , italic_a ) : italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) ∥ over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) - over~ start_ARG italic_r end_ARG ( italic_s , italic_a ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_C roman_log ( divide start_ARG 2 italic_S italic_A ( 1 + italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) ) end_ARG start_ARG italic_δ end_ARG ) } .

This is indeed the case if t(s,a)subscript𝑡𝑠𝑎\mathcal{R}_{t}(s,a)caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) is chosen among (C1-3).

A.3 Convergence of EVI and 3

We start with a preliminary lemma on the speed of convergence of EVI. The Lemma 15 is thought to be applied to extended MDPs. Below, when we claim that the action space is compact, we further claim that a𝒜(s)p(s,a)𝑎𝒜𝑠maps-to𝑝𝑠𝑎a\in\mathcal{A}(s)\mapsto p(s,a)italic_a ∈ caligraphic_A ( italic_s ) ↦ italic_p ( italic_s , italic_a ) is a continuous map, so that the Bellman operator is continuous and that gsuperscript𝑔g^{*}italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and hsuperscripth^{*}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT are well-defined, see Puterman [1994].

Lemma 15.

Let M𝑀Mitalic_M a weakly-communicating MDP with finite state space 𝐑𝒮superscript𝐑𝒮\mathbf{R}^{\mathcal{S}}bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT and compact action space, and let L𝐿Litalic_L its Bellman operator. Assume that there exists γ>0𝛾0\gamma>0italic_γ > 0 such that, u𝐑𝒮for-all𝑢superscript𝐑𝒮\forall u\in\mathbf{R}^{\mathcal{S}}∀ italic_u ∈ bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT,

s𝒮,a𝒜(s),Lu(s)=r(s,a)+p(s,a)u=r(s,a)+γmax(u)+(1γ)qsuuformulae-sequencefor-all𝑠𝒮formulae-sequence𝑎𝒜𝑠𝐿𝑢𝑠𝑟𝑠𝑎𝑝𝑠𝑎𝑢𝑟𝑠𝑎𝛾𝑢1𝛾superscriptsubscript𝑞𝑠𝑢𝑢\forall s\in\mathcal{S},\exists a\in\mathcal{A}(s),\quad Lu(s)=r(s,a)+p(s,a)u=% r(s,a)+\gamma\max(u)+(1-\gamma)q_{s}^{u}u∀ italic_s ∈ caligraphic_S , ∃ italic_a ∈ caligraphic_A ( italic_s ) , italic_L italic_u ( italic_s ) = italic_r ( italic_s , italic_a ) + italic_p ( italic_s , italic_a ) italic_u = italic_r ( italic_s , italic_a ) + italic_γ roman_max ( italic_u ) + ( 1 - italic_γ ) italic_q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT italic_u (*)

with qsu𝒫(𝒮)superscriptsubscript𝑞𝑠𝑢𝒫𝒮q_{s}^{u}\in\mathcal{P}(\mathcal{S})italic_q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ∈ caligraphic_P ( caligraphic_S ). Then, for all u𝐑𝒮𝑢superscript𝐑𝒮u\in\mathbf{R}^{\mathcal{S}}italic_u ∈ bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT and all ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, if sp(Ln+1uLnu)ϵspsuperscript𝐿𝑛1𝑢superscript𝐿𝑛𝑢italic-ϵ{\mathrm{sp}\left(L^{n+1}u-L^{n}u\right)}\geq\epsilonroman_sp ( italic_L start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT italic_u - italic_L start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u ) ≥ italic_ϵ, then:

n2+4sp(w0)γϵ+2γlog(2sp(w0)ϵ).𝑛24spsubscript𝑤0𝛾italic-ϵ2𝛾2spsubscript𝑤0italic-ϵn\leq 2+\frac{4{\mathrm{sp}\left(w_{0}\right)}}{\gamma\epsilon}+\frac{2}{% \gamma}\log\left(\frac{2{\mathrm{sp}\left(w_{0}\right)}}{\epsilon}\right).italic_n ≤ 2 + divide start_ARG 4 roman_s roman_p ( italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_γ italic_ϵ end_ARG + divide start_ARG 2 end_ARG start_ARG italic_γ end_ARG roman_log ( divide start_ARG 2 roman_s roman_p ( italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_ϵ end_ARG ) .
Proof.

Since M𝑀Mitalic_M is weakly communicating, has finitely many states and compact action space, it has well-defined gain gsuperscript𝑔g^{*}italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and bias hsuperscripth^{*}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT functions. Denote un+1:=Lnuassignsubscript𝑢𝑛1superscript𝐿𝑛𝑢u_{n+1}:=L^{n}uitalic_u start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT := italic_L start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u.

wnsubscript𝑤𝑛\displaystyle w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT :=maxπΠ{rπ+Pπun1}nghassignabsentsubscript𝜋Πsubscript𝑟𝜋subscript𝑃𝜋subscript𝑢𝑛1𝑛superscript𝑔superscript\displaystyle:=\max_{\pi\in\Pi}\left\{r_{\pi}+P_{\pi}u_{n-1}\right\}-ng^{*}-h^% {*}:= roman_max start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT { italic_r start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT + italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT } - italic_n italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT
=maxπΠ{rπg+(PπI)h+Pπ(un1h(n1)g)}=:maxπΠ{rπ+Pπwn1}.\displaystyle=\max_{\pi\in\Pi}\left\{r_{\pi}-g^{*}+(P_{\pi}-I)h^{*}+P_{\pi}% \left(u_{n-1}-h^{*}-(n-1)g^{*}\right)\right\}=:\max_{\pi\in\Pi}\left\{r^{% \prime}_{\pi}+P_{\pi}w_{n-1}\right\}.= roman_max start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT { italic_r start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT - italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + ( italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT - italic_I ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - ( italic_n - 1 ) italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } = : roman_max start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT { italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT + italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT } .

Observe that the policy achieving the maximum is the one achieving un=rπ+Pπun1subscript𝑢𝑛subscript𝑟𝜋subscript𝑃𝜋subscript𝑢𝑛1u_{n}=r_{\pi}+P_{\pi}u_{n-1}italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT + italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT. Remark that rπ(s)=Δ(s,π(s))0subscriptsuperscript𝑟𝜋𝑠superscriptΔ𝑠𝜋𝑠0r^{\prime}_{\pi}(s)=-\Delta^{*}(s,\pi(s))\leq 0italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ) = - roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_π ( italic_s ) ) ≤ 0 is the Bellman gap of the pair (s,π(s))𝑠𝜋𝑠(s,\pi(s))( italic_s , italic_π ( italic_s ) ), that we more simply write ΔπsubscriptΔ𝜋\Delta_{\pi}roman_Δ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT. For all n𝑛nitalic_n, there exists πnΠsubscript𝜋𝑛Π\pi_{n}\in\Piitalic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_Π such that wn+1=Δπn+Pπnwnsubscript𝑤𝑛1subscriptΔsubscript𝜋𝑛subscript𝑃subscript𝜋𝑛subscript𝑤𝑛w_{n+1}=-\Delta_{\pi_{n}}+P_{\pi_{n}}w_{n}italic_w start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = - roman_Δ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_P start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Moreover, by assumption, we have Pπn=γesne+(1γ)Qnsubscript𝑃subscript𝜋𝑛𝛾superscriptsubscript𝑒subscript𝑠𝑛top𝑒1𝛾subscript𝑄𝑛P_{\pi_{n}}=\gamma\cdot e_{s_{n}}^{\top}e+(1-\gamma)Q_{n}italic_P start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_γ ⋅ italic_e start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_e + ( 1 - italic_γ ) italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT where Qnsubscript𝑄𝑛Q_{n}italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is a stochastic matrix. Moreover,

(min(Δπn)+γwn(sn))e+(1γ)Qnwnwn+1(max(Δπn)+γwn(sn))e+(1γ)Qnwn.subscriptΔsubscript𝜋𝑛𝛾subscript𝑤𝑛subscript𝑠𝑛𝑒1𝛾subscript𝑄𝑛subscript𝑤𝑛subscript𝑤𝑛1subscriptΔsubscript𝜋𝑛𝛾subscript𝑤𝑛subscript𝑠𝑛𝑒1𝛾subscript𝑄𝑛subscript𝑤𝑛\left(\min(-\Delta_{\pi_{n}})+\gamma w_{n}(s_{n})\right)e+(1-\gamma)Q_{n}w_{n}% \leq w_{n+1}\leq\left(\max(-\Delta_{\pi_{n}})+\gamma w_{n}(s_{n})\right)e+(1-% \gamma)Q_{n}w_{n}.( roman_min ( - roman_Δ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_γ italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) italic_e + ( 1 - italic_γ ) italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≤ italic_w start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ≤ ( roman_max ( - roman_Δ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_γ italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) italic_e + ( 1 - italic_γ ) italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT .

Hence, sp(wn+1)(1γ)sp(wn)+sp(Δπn)spsubscript𝑤𝑛11𝛾spsubscript𝑤𝑛spsubscriptΔsubscript𝜋𝑛{\mathrm{sp}\left(w_{n+1}\right)}\leq(1-\gamma){\mathrm{sp}\left(w_{n}\right)}% +{\mathrm{sp}\left(\Delta_{\pi_{n}}\right)}roman_sp ( italic_w start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) ≤ ( 1 - italic_γ ) roman_sp ( italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + roman_sp ( roman_Δ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). In addition, wn=LnuLnhsubscript𝑤𝑛superscript𝐿𝑛𝑢superscript𝐿𝑛superscriptw_{n}=L^{n}u-L^{n}h^{*}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_L start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u - italic_L start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, so by non-expansiveness of L𝐿Litalic_L in span semi-norm, sp(wn+1)sp(wn)spsubscript𝑤𝑛1spsubscript𝑤𝑛{\mathrm{sp}\left(w_{n+1}\right)}\leq{\mathrm{sp}\left(w_{n}\right)}roman_sp ( italic_w start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) ≤ roman_sp ( italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). Overall,

sp(wn+1)min((1γ)sp(wn)+sp(Δπn),sp(wn)).spsubscript𝑤𝑛11𝛾spsubscript𝑤𝑛spsubscriptΔsubscript𝜋𝑛spsubscript𝑤𝑛{\mathrm{sp}\left(w_{n+1}\right)}\leq\min\left((1-\gamma){\mathrm{sp}\left(w_{% n}\right)}+{\mathrm{sp}\left(\Delta_{\pi_{n}}\right)},{\mathrm{sp}\left(w_{n}% \right)}\right).roman_sp ( italic_w start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) ≤ roman_min ( ( 1 - italic_γ ) roman_sp ( italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + roman_sp ( roman_Δ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , roman_sp ( italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) . (15)

Fix ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, and let nϵ:=inf{n:sp(wn)<ϵ}assignsubscript𝑛italic-ϵinfimumconditional-set𝑛spsubscript𝑤𝑛italic-ϵn_{\epsilon}:=\inf\left\{n:{\mathrm{sp}\left(w_{n}\right)}<\epsilon\right\}italic_n start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT := roman_inf { italic_n : roman_sp ( italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_ϵ }.

Let πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT an optimal policy. We have wn+1Pπwnsubscript𝑤𝑛1subscript𝑃superscript𝜋subscript𝑤𝑛w_{n+1}\geq P_{\pi^{*}}w_{n}italic_w start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ≥ italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT so by induction, wn+1Pπn+1w0min(w0)esubscript𝑤𝑛1superscriptsubscript𝑃superscript𝜋𝑛1subscript𝑤0subscript𝑤0𝑒w_{n+1}\geq P_{\pi^{*}}^{n+1}w_{0}\geq\min(w_{0})eitalic_w start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ≥ italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≥ roman_min ( italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_e. Meanwhile, we see that wn1k=0n1Δπk1+Smin(w0)subscriptnormsubscript𝑤𝑛1superscriptsubscript𝑘0𝑛1subscriptnormsubscriptΔsubscript𝜋𝑘1𝑆subscript𝑤0\left\|w_{n}\right\|_{1}\geq\sum_{k=0}^{n-1}\left\|\Delta_{\pi_{k}}\right\|_{1% }+S\min(w_{0})∥ italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT ∥ roman_Δ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_S roman_min ( italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), so k=0n1Δπk1sp(w0)superscriptsubscript𝑘0𝑛1subscriptnormsubscriptΔsubscript𝜋𝑘1spsubscript𝑤0\sum_{k=0}^{n-1}\left\|\Delta_{\pi_{k}}\right\|_{1}\leq{\mathrm{sp}\left(w_{0}% \right)}∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT ∥ roman_Δ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ roman_sp ( italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). Since Δπk0subscriptΔsubscript𝜋𝑘0\Delta_{\pi_{k}}\leq 0roman_Δ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ 0 for all k𝑘kitalic_k, we have sp(Δπk)Δπk1spsubscriptΔsubscript𝜋𝑘subscriptnormsubscriptΔsubscript𝜋𝑘1{\mathrm{sp}\left(\Delta_{\pi_{k}}\right)}\leq\left\|\Delta_{\pi_{k}}\right\|_% {1}roman_sp ( roman_Δ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ≤ ∥ roman_Δ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT so k=0n1sp(Δπk)sp(w0)superscriptsubscript𝑘0𝑛1spsubscriptΔsubscript𝜋𝑘spsubscript𝑤0\sum_{k=0}^{n-1}{\mathrm{sp}\left(\Delta_{\pi_{k}}\right)}\leq{\mathrm{sp}% \left(w_{0}\right)}∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT roman_sp ( roman_Δ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ≤ roman_sp ( italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ).

By (15), either sp(wn+1)(112γ)max(ϵ,sp(wn))spsubscript𝑤𝑛1112𝛾italic-ϵspsubscript𝑤𝑛{\mathrm{sp}\left(w_{n+1}\right)}\leq(1-\tfrac{1}{2}\gamma)\max(\epsilon,{% \mathrm{sp}\left(w_{n}\right)})roman_sp ( italic_w start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) ≤ ( 1 - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_γ ) roman_max ( italic_ϵ , roman_sp ( italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) or sp(Δπn)12γϵspsubscriptΔsubscript𝜋𝑛12𝛾italic-ϵ{\mathrm{sp}\left(\Delta_{\pi_{n}}\right)}\geq\tfrac{1}{2}\gamma\epsilonroman_sp ( roman_Δ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_γ italic_ϵ, but because k=0+sp(Δπk)sp(w0)superscriptsubscript𝑘0spsubscriptΔsubscript𝜋𝑘spsubscript𝑤0\sum_{k=0}^{+\infty}{\mathrm{sp}\left(\Delta_{\pi_{k}}\right)}\leq{\mathrm{sp}% \left(w_{0}\right)}∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT roman_sp ( roman_Δ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ≤ roman_sp ( italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), the second case can happen at most 2sp(w0)γϵ2spsubscript𝑤0𝛾italic-ϵ\tfrac{2{\mathrm{sp}\left(w_{0}\right)}}{\gamma\epsilon}divide start_ARG 2 roman_s roman_p ( italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_γ italic_ϵ end_ARG times. We deduce that, for all nnϵ𝑛subscript𝑛italic-ϵn\leq n_{\epsilon}italic_n ≤ italic_n start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT,

sp(wn+1)(112γ)n2sp(w0)γϵsp(w0).spsubscript𝑤𝑛1superscript112𝛾𝑛2spsubscript𝑤0𝛾italic-ϵspsubscript𝑤0{\mathrm{sp}\left(w_{n+1}\right)}\leq\left(1-\tfrac{1}{2}\gamma\right)^{n-% \frac{2{\mathrm{sp}\left(w_{0}\right)}}{\gamma\epsilon}}{\mathrm{sp}\left(w_{0% }\right)}.roman_sp ( italic_w start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) ≤ ( 1 - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_γ ) start_POSTSUPERSCRIPT italic_n - divide start_ARG 2 roman_s roman_p ( italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_γ italic_ϵ end_ARG end_POSTSUPERSCRIPT roman_sp ( italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) .

In particular, for n=nϵ1𝑛subscript𝑛italic-ϵ1n=n_{\epsilon}-1italic_n = italic_n start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT - 1, we get:

ϵ(112γ)nϵ22sp(w0)γϵsp(w0).italic-ϵsuperscript112𝛾subscript𝑛italic-ϵ22spsubscript𝑤0𝛾italic-ϵspsubscript𝑤0\epsilon\leq\left(1-\tfrac{1}{2}\gamma\right)^{n_{\epsilon}-2-\frac{2{\mathrm{% sp}\left(w_{0}\right)}}{\gamma\epsilon}}{\mathrm{sp}\left(w_{0}\right)}.italic_ϵ ≤ ( 1 - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_γ ) start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT - 2 - divide start_ARG 2 roman_s roman_p ( italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_γ italic_ϵ end_ARG end_POSTSUPERSCRIPT roman_sp ( italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) .

We obtain:

nϵ2+2sp(w0)γϵ+2γlog(sp(w0)ϵ).subscript𝑛italic-ϵ22spsubscript𝑤0𝛾italic-ϵ2𝛾spsubscript𝑤0italic-ϵn_{\epsilon}\leq 2+\frac{2{\mathrm{sp}\left(w_{0}\right)}}{\gamma\epsilon}+% \frac{2}{\gamma}\log\left(\frac{{\mathrm{sp}\left(w_{0}\right)}}{\epsilon}% \right).italic_n start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ≤ 2 + divide start_ARG 2 roman_s roman_p ( italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_γ italic_ϵ end_ARG + divide start_ARG 2 end_ARG start_ARG italic_γ end_ARG roman_log ( divide start_ARG roman_sp ( italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_ϵ end_ARG ) .

To conclude, check that sp(Ln+1uLnu)=sp(wn+1wn)2sp(wn)spsuperscript𝐿𝑛1𝑢superscript𝐿𝑛𝑢spsubscript𝑤𝑛1subscript𝑤𝑛2spsubscript𝑤𝑛{\mathrm{sp}\left(L^{n+1}u-L^{n}u\right)}={\mathrm{sp}\left(w_{n+1}-w_{n}% \right)}\leq 2{\mathrm{sp}\left(w_{n}\right)}roman_sp ( italic_L start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT italic_u - italic_L start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u ) = roman_sp ( italic_w start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT - italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≤ 2 roman_s roman_p ( italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). ∎

Before moving to the application of interest, remark that this result can be greatly improved if the supremum sup{Δ(s,a):Δ(s,a)<0}supremumconditional-setsuperscriptΔ𝑠𝑎superscriptΔ𝑠𝑎0\sup\left\{\Delta^{*}(s,a):\Delta^{*}(s,a)<0\right\}roman_sup { roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) : roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) < 0 } is not zero, to change the dominant term 4sp(w0)γϵ4spsubscript𝑤0𝛾italic-ϵ\frac{4{\mathrm{sp}\left(w_{0}\right)}}{\gamma\epsilon}divide start_ARG 4 roman_s roman_p ( italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_γ italic_ϵ end_ARG for a constant independent of ϵitalic-ϵ\epsilonitalic_ϵ.

Corollary 16.

Assume that the tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT has non-empty interior, and that its Bellman operator satisfies the requirement of Lemma 15, i.e., there exists γ>0𝛾0\gamma>0italic_γ > 0 such that, u𝐑𝒮,s𝒮,a𝒜(s),r~t(s,a)t(s,a),p~t(s,a)𝒫t(s,a)formulae-sequencefor-all𝑢superscript𝐑𝒮formulae-sequencefor-all𝑠𝒮formulae-sequence𝑎𝒜𝑠formulae-sequencesubscript~𝑟𝑡𝑠𝑎subscript𝑡𝑠𝑎subscript~𝑝𝑡𝑠𝑎subscript𝒫𝑡𝑠𝑎\forall u\in\mathbf{R}^{\mathcal{S}},\forall s\in\mathcal{S},\exists a\in% \mathcal{A}(s),\exists\tilde{r}_{t}(s,a)\in\mathcal{R}_{t}(s,a),\exists\tilde{% p}_{t}(s,a)\in\mathcal{P}_{t}(s,a)∀ italic_u ∈ bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT , ∀ italic_s ∈ caligraphic_S , ∃ italic_a ∈ caligraphic_A ( italic_s ) , ∃ over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) , ∃ over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ):

tu(s)=r~t(s,a)+p~t(s,a)u=r~t(s,a)+γmax(u)+(1γ)qsuusubscript𝑡𝑢𝑠subscript~𝑟𝑡𝑠𝑎subscript~𝑝𝑡𝑠𝑎𝑢subscript~𝑟𝑡𝑠𝑎𝛾𝑢1𝛾superscriptsubscript𝑞𝑠𝑢𝑢\mathcal{L}_{t}u(s)=\tilde{r}_{t}(s,a)+\tilde{p}_{t}(s,a)u=\tilde{r}_{t}(s,a)+% \gamma\max(u)+(1-\gamma)q_{s}^{u}ucaligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_u ( italic_s ) = over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) + over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) italic_u = over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) + italic_γ roman_max ( italic_u ) + ( 1 - italic_γ ) italic_q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT italic_u

for some qsu𝒫(𝒮)superscriptsubscript𝑞𝑠𝑢𝒫𝒮q_{s}^{u}\in\mathcal{P}(\mathcal{S})italic_q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ∈ caligraphic_P ( caligraphic_S ). Then 3 is satisfied, and span fix-points h~tsubscript~𝑡\tilde{h}_{t}over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of tsubscript𝑡\mathcal{L}_{t}caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are such that g(t)=th~th~tsuperscript𝑔subscript𝑡subscript𝑡subscript~𝑡subscript~𝑡g^{*}(\mathcal{M}_{t})=\mathcal{L}_{t}\tilde{h}_{t}-\tilde{h}_{t}italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Proof.

If tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is has non-empty interior, it means that for all (s,a)𝑠𝑎(s,a)( italic_s , italic_a ), 𝒫t(s,a)subscript𝒫𝑡𝑠𝑎\mathcal{P}_{t}(s,a)caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) has non-empty interior. Therefore, for all state-action pair, there exists p~t(s,a)𝒫t(s,a)subscript~𝑝𝑡𝑠𝑎subscript𝒫𝑡𝑠𝑎\tilde{p}_{t}(s,a)\in\mathcal{P}_{t}(s,a)over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) that is fully supported. It follows that tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is communicating, and it follows from standard results Puterman [1994] that its span fix-points h~~\tilde{h}over~ start_ARG italic_h end_ARG do exist and that g~t:=h~th~t𝐑eassignsubscript~𝑔𝑡subscript~𝑡subscript~𝑡𝐑𝑒\tilde{g}_{t}:=\mathcal{L}\tilde{h}_{t}-\tilde{h}_{t}\in\mathbf{R}eover~ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := caligraphic_L over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ bold_R italic_e does not depend on the initial state.

Moreover, if M~t~𝑀subscript𝑡\widetilde{M}\in\mathcal{M}_{t}over~ start_ARG italic_M end_ARG ∈ caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and πΠ𝜋Π\pi\in\Piitalic_π ∈ roman_Π with g~πg(π,t)𝐑esubscript~𝑔𝜋𝑔𝜋subscript𝑡𝐑𝑒\tilde{g}_{\pi}\equiv g(\pi,\mathcal{M}_{t})\in\mathbf{R}eover~ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ≡ italic_g ( italic_π , caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ bold_R italic_e, letting r~π:=rπ(M~)assignsubscript~𝑟𝜋subscript𝑟𝜋~𝑀\tilde{r}_{\pi}:=r_{\pi}(\tilde{M})over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT := italic_r start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( over~ start_ARG italic_M end_ARG ) and P~π:=Pπ(M~)assignsubscript~𝑃𝜋subscript𝑃𝜋~𝑀\tilde{P}_{\pi}:=P_{\pi}(\tilde{M})over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT := italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( over~ start_ARG italic_M end_ARG ), we have:

r~π+p~πh~tth~tg~te+h~t.subscript~𝑟𝜋subscript~𝑝𝜋subscript~𝑡subscript𝑡subscript~𝑡subscript~𝑔𝑡𝑒subscript~𝑡\tilde{r}_{\pi}+\tilde{p}_{\pi}\tilde{h}_{t}\leq\mathcal{L}_{t}\tilde{h}_{t}% \leq\tilde{g}_{t}e+\tilde{h}_{t}.over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT + over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ over~ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_e + over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT .

So by induction and since tsubscript𝑡\mathcal{L}_{t}caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is obviously monotone and linear, we show that:

k=0nP~πkr~πng~te+(IP~πn)h~π.superscriptsubscript𝑘0𝑛superscriptsubscript~𝑃𝜋𝑘subscript~𝑟𝜋𝑛subscript~𝑔𝑡𝑒𝐼superscriptsubscript~𝑃𝜋𝑛subscript~𝜋\sum_{k=0}^{n}\tilde{P}_{\pi}^{k}\tilde{r}_{\pi}\leq n\tilde{g}_{t}e+(I-\tilde% {P}_{\pi}^{n})\tilde{h}_{\pi}.∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ≤ italic_n over~ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_e + ( italic_I - over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT .

Dividing by n𝑛nitalic_n and letting it go to infinity, we obtain g(π,t)g~t𝑔𝜋subscript𝑡subscript~𝑔𝑡g(\pi,\mathcal{M}_{t})\leq\tilde{g}_{t}italic_g ( italic_π , caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ over~ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Observe that we have equility by taking the policy achieving (g~t,h~t)subscript~𝑔𝑡subscript~𝑡(\tilde{g}_{t},\tilde{h}_{t})( over~ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

To see that EVI converges indeed, simply observe that Lemma 15 provides a finite bound on how much time is required until the sp(tn+1utnu)ϵspsuperscriptsubscript𝑡𝑛1𝑢superscriptsubscript𝑡𝑛𝑢italic-ϵ{\mathrm{sp}\left(\mathcal{L}_{t}^{n+1}u-\mathcal{L}_{t}^{n}u\right)}\leq\epsilonroman_sp ( caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT italic_u - caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u ) ≤ italic_ϵ. Hence sp(tn+1utnu)spsuperscriptsubscript𝑡𝑛1𝑢superscriptsubscript𝑡𝑛𝑢{\mathrm{sp}\left(\mathcal{L}_{t}^{n+1}u-\mathcal{L}_{t}^{n}u\right)}roman_sp ( caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT italic_u - caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u ) vanishes to 00. ∎

About 3.

The assumptions made by Corollary 16 are met if the kernel confidence regions are:

  • Built out of Weissman’s inequality (C1) (see the next section, also Auer et al. [2009]);

  • Built out of Bernstein’s inequality (C2) (because the maximization algorithm to compute p~t(s,a)uisubscript~𝑝𝑡𝑠𝑎subscript𝑢𝑖\tilde{p}_{t}(s,a)u_{i}over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in EVI has the same greedy properties than with Weissman’s inequality);

  • Trivial (C4) obviously.

For confidence regions build with empirical likelihood estimates (C3), there is no guarantee of convergence (although we conjecture that one could be established), although the gain is still well-defined because tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT remains communicating. However, just like the original work of Filippi et al. [2010], the convergence is always met numerically.

A.4 Proof of Theorem 5: Complexity of PMEVI with Weissman confidence regions

In this section, we show that when one is using Weissman confidence regions for kernels (C1), then the iterates of tsubscript𝑡\mathcal{L}_{t}caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT converge to an ϵitalic-ϵ\epsilonitalic_ϵ span-fix-point quickly.

Proposition 17.

Assume that PMEVI-DT uses kernel confidence regions of Weissman type (C1) satisfying 1. Then with probability 1δ1𝛿1-\delta1 - italic_δ, the number of iterations of PMEVI (see Algorithm 2) is O(DSAT)O𝐷𝑆𝐴𝑇\operatorname*{{\rm O}}\left(D\!\!\sqrt{S}AT\right)roman_O ( italic_D square-root start_ARG italic_S end_ARG italic_A italic_T ), hence the algorithm has polynomial per-step amortized complexity.

Proof.

With Weissman type confidence regions for kernels, for all tT𝑡𝑇t\leq Titalic_t ≤ italic_T and (s,a)𝒳𝑠𝑎𝒳(s,a)\in\mathcal{X}( italic_s , italic_a ) ∈ caligraphic_X, we have

𝒫t(s,a){p~(s,a)𝒫(s,a):p~(s,a)p^t(s,a)1Slog(2SAT)T}conditional-set~𝑝𝑠𝑎𝒫𝑠𝑎subscriptnorm~𝑝𝑠𝑎subscript^𝑝𝑡𝑠𝑎1𝑆2𝑆𝐴𝑇𝑇subscript𝒫𝑡𝑠𝑎\mathcal{P}_{t}(s,a)\supseteq\left\{\tilde{p}(s,a)\in\mathcal{P}(s,a):\left\|% \tilde{p}(s,a)-\hat{p}_{t}(s,a)\right\|_{1}\leq\sqrt{\frac{S\log(2SAT)}{T}}\right\}caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) ⊇ { over~ start_ARG italic_p end_ARG ( italic_s , italic_a ) ∈ caligraphic_P ( italic_s , italic_a ) : ∥ over~ start_ARG italic_p end_ARG ( italic_s , italic_a ) - over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ square-root start_ARG divide start_ARG italic_S roman_log ( 2 italic_S italic_A italic_T ) end_ARG start_ARG italic_T end_ARG end_ARG }

It follows that, for all tT𝑡𝑇t\leq Titalic_t ≤ italic_T, the extended Bellman operator tsubscript𝑡\mathcal{L}_{t}caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT satisfies the prerequisite ()(*)( ∗ ) of Lemma 15 with

γ=12Slog(2SAT/δ)T=Ω(Slog(T/δ)T).𝛾12𝑆2𝑆𝐴𝑇𝛿𝑇Ω𝑆𝑇𝛿𝑇\gamma=\frac{1}{2}\sqrt{\frac{S\log(2SAT/\delta)}{T}}=\Omega\left(\!\!\sqrt{% \frac{S\log(T/\delta)}{T}}\right).italic_γ = divide start_ARG 1 end_ARG start_ARG 2 end_ARG square-root start_ARG divide start_ARG italic_S roman_log ( 2 italic_S italic_A italic_T / italic_δ ) end_ARG start_ARG italic_T end_ARG end_ARG = roman_Ω ( square-root start_ARG divide start_ARG italic_S roman_log ( italic_T / italic_δ ) end_ARG start_ARG italic_T end_ARG end_ARG ) .

Under 1, we have Mt𝑀subscript𝑡M\in\mathcal{M}_{t}italic_M ∈ caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with probability 1δ1𝛿1-\delta1 - italic_δ. Under this event, tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is weakly communicating and sp(h(t))D(M)spsuperscriptsubscript𝑡𝐷𝑀{\mathrm{sp}\left(h^{*}(\mathcal{M}_{t})\right)}\leq D(M)roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ≤ italic_D ( italic_M ), we can apply Lemma 15 and conclude that every calls to PMEVI (Algorithm 2) takes

O(sp(w0)TϵSlog(T/δ)T)=O(DTSlog(T))Ospsubscript𝑤0𝑇italic-ϵ𝑆𝑇𝛿𝑇O𝐷𝑇𝑆𝑇\operatorname*{{\rm O}}\left(\frac{{\mathrm{sp}\left(w_{0}\right)}\sqrt{T}}{% \epsilon\sqrt{\frac{S\log(T/\delta)}{T}}}\right)=\operatorname*{{\rm O}}\left(% \frac{DT}{\sqrt{S}\log(T)}\right)roman_O ( divide start_ARG roman_sp ( italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) square-root start_ARG italic_T end_ARG end_ARG start_ARG italic_ϵ square-root start_ARG divide start_ARG italic_S roman_log ( italic_T / italic_δ ) end_ARG start_ARG italic_T end_ARG end_ARG end_ARG ) = roman_O ( divide start_ARG italic_D italic_T end_ARG start_ARG square-root start_ARG italic_S end_ARG roman_log ( italic_T ) end_ARG )

where we use that ϵ=log(SAT/δ)Titalic-ϵ𝑆𝐴𝑇𝛿𝑇\epsilon=\sqrt{\tfrac{\log(SAT/\delta)}{T}}italic_ϵ = square-root start_ARG divide start_ARG roman_log ( italic_S italic_A italic_T / italic_δ ) end_ARG start_ARG italic_T end_ARG end_ARG, that sp(w0)=O(sp(h(t)))=O(D(M))spsubscript𝑤0Ospsuperscriptsubscript𝑡O𝐷𝑀{\mathrm{sp}\left(w_{0}\right)}=\operatorname*{{\rm O}}\left({\mathrm{sp}\left% (h^{*}(\mathcal{M}_{t})\right)}\right)=\operatorname*{{\rm O}}(D(M))roman_sp ( italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = roman_O ( roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) = roman_O ( italic_D ( italic_M ) ) and that δ1T𝛿1𝑇\delta\geq\frac{1}{T}italic_δ ≥ divide start_ARG 1 end_ARG start_ARG italic_T end_ARG. Since the number of episodes under the doubling trick (DT) is O(SAlog(T))O𝑆𝐴𝑇\operatorname*{{\rm O}}(SA\log(T))roman_O ( italic_S italic_A roman_log ( italic_T ) ), we conclude accordingly. ∎

Every call to the projection operator solves a linear program. Although in theory, this time is polynomial (relying on recent work on the complexity of LP such as Cohen et al. [2020], it is the current matrix multiplication time O(S2.38)Osuperscript𝑆2.38\operatorname*{{\rm O}}(S^{2.38})roman_O ( italic_S start_POSTSUPERSCRIPT 2.38 end_POSTSUPERSCRIPT )), in practice, reducing the number of calls to the projection operator is key to run PMEVI-DT in reasonable time.

Appendix B Analysis of the projected mitigated Bellman operator

In this section, we fix the model region \mathcal{M}caligraphic_M, the bias region \mathcal{H}caligraphic_H and the mitigation vector β𝛽\betaitalic_β, drop** the sub-script t𝑡titalic_t for conciseness. We denote r^,p^^𝑟^𝑝\hat{r},\hat{p}over^ start_ARG italic_r end_ARG , over^ start_ARG italic_p end_ARG the respective empirical reward and kernel. Further assume that =0+𝐑esubscript0𝐑𝑒\mathcal{H}=\mathcal{H}_{0}+\mathbf{R}ecaligraphic_H = caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_R italic_e with 0subscript0\mathcal{H}_{0}caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT a compact convex set. The associated projection operation (see Section B.2) is denoted ΓΓ\Gammaroman_Γ. The (vanilla) extended Bellman operator \mathcal{L}caligraphic_L associated to \mathcal{M}caligraphic_M is given by u(s):=maxa𝒜(s){sup(s,a)+sup𝒫(s,a)u}assign𝑢𝑠subscript𝑎𝒜𝑠supremum𝑠𝑎supremum𝒫𝑠𝑎𝑢\mathcal{L}u(s):=\max_{a\in\mathcal{A}(s)}\left\{\sup\mathcal{R}(s,a)+\sup% \mathcal{P}(s,a)u\right\}caligraphic_L italic_u ( italic_s ) := roman_max start_POSTSUBSCRIPT italic_a ∈ caligraphic_A ( italic_s ) end_POSTSUBSCRIPT { roman_sup caligraphic_R ( italic_s , italic_a ) + roman_sup caligraphic_P ( italic_s , italic_a ) italic_u }. The β𝛽\betaitalic_β-mitigated extended Bellman operator associated to \mathcal{M}caligraphic_M is:

βu(s):=maxa𝒜(s)supr~(s,a)(s,a)supp~(s,a)𝒫(s,a){r~(s,a)+min{p~(s,a)ui,p^(s,a)ui+β(s,a)}}.assignsuperscript𝛽𝑢𝑠subscript𝑎𝒜𝑠subscriptsupremum~𝑟𝑠𝑎𝑠𝑎subscriptsupremum~𝑝𝑠𝑎𝒫𝑠𝑎~𝑟𝑠𝑎~𝑝𝑠𝑎subscript𝑢𝑖^𝑝𝑠𝑎subscript𝑢𝑖𝛽𝑠𝑎\mathcal{L}^{\beta}u(s):=\max_{a\in\mathcal{A}(s)}\sup_{\tilde{r}(s,a)\in% \mathcal{R}(s,a)}\sup_{\tilde{p}(s,a)\in\mathcal{P}(s,a)}\Big{\{}\tilde{r}(s,a% )+\min\left\{\tilde{p}(s,a)u_{i},\hat{p}(s,a)u_{i}+\beta(s,a)\right\}\Big{\}}.caligraphic_L start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT italic_u ( italic_s ) := roman_max start_POSTSUBSCRIPT italic_a ∈ caligraphic_A ( italic_s ) end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT over~ start_ARG italic_r end_ARG ( italic_s , italic_a ) ∈ caligraphic_R ( italic_s , italic_a ) end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT over~ start_ARG italic_p end_ARG ( italic_s , italic_a ) ∈ caligraphic_P ( italic_s , italic_a ) end_POSTSUBSCRIPT { over~ start_ARG italic_r end_ARG ( italic_s , italic_a ) + roman_min { over~ start_ARG italic_p end_ARG ( italic_s , italic_a ) italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_p end_ARG ( italic_s , italic_a ) italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_β ( italic_s , italic_a ) } } . (16)

The function Greedy(,u,β)𝑢𝛽(\mathcal{M},u,\beta)( caligraphic_M , italic_u , italic_β ) returns a stationary deterministic policy that picks its actions among the one reaching the maximum above. The projection of βsuperscript𝛽\mathcal{L}^{\beta}caligraphic_L start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT to \mathcal{H}caligraphic_H is

𝔏𝔏β,:=Γβ.𝔏superscript𝔏𝛽assignΓsuperscript𝛽\mathfrak{L}\equiv\mathfrak{L}^{\beta,\mathcal{H}}:=\Gamma\circ\mathcal{L}^{% \beta}.fraktur_L ≡ fraktur_L start_POSTSUPERSCRIPT italic_β , caligraphic_H end_POSTSUPERSCRIPT := roman_Γ ∘ caligraphic_L start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT . (17)

The goal of this section is to establish Proposition 2 and

B.1 Finding an optimistic policy under bias constraints

The main goal is to find and optimistic policy under bias constraints (projection) and bias error constraints (mitigation). The bias constraints imply that we search for a policy π𝜋\piitalic_π together with a model M~~𝑀\widetilde{M}over~ start_ARG italic_M end_ARG such that h(π,M~)𝜋~𝑀h(\pi,\widetilde{M})\in\mathcal{H}italic_h ( italic_π , over~ start_ARG italic_M end_ARG ) ∈ caligraphic_H. The bias error means that, for h~h(π,M~)~𝜋~𝑀\tilde{h}\equiv h(\pi,\widetilde{M})over~ start_ARG italic_h end_ARG ≡ italic_h ( italic_π , over~ start_ARG italic_M end_ARG ), we want in addition p~(s,π(s))h~p^(s,π(s))h~+β(s,π(s))~𝑝𝑠𝜋𝑠~^𝑝𝑠𝜋𝑠~𝛽𝑠𝜋𝑠\tilde{p}(s,\pi(s))\tilde{h}\leq\hat{p}(s,\pi(s))\tilde{h}+\beta(s,\pi(s))over~ start_ARG italic_p end_ARG ( italic_s , italic_π ( italic_s ) ) over~ start_ARG italic_h end_ARG ≤ over^ start_ARG italic_p end_ARG ( italic_s , italic_π ( italic_s ) ) over~ start_ARG italic_h end_ARG + italic_β ( italic_s , italic_π ( italic_s ) ) where p~~𝑝\tilde{p}over~ start_ARG italic_p end_ARG is the transition kernel of M~~𝑀\widetilde{M}over~ start_ARG italic_M end_ARG. In the end, our goal is to track the solution of the following optimization problem:

g(,β,):=sup{g(π,M~):πΠ,M~,s𝒮,p~(s,π(s))h~p^(s,π(s))h~+β(s,π(s)),h~h(π,M~),sp(g(π,M~))=0}g^{*}(\mathcal{H},\beta,\mathcal{M}):=\sup\left\{g\left(\pi,\widetilde{M}% \right):\begin{array}[]{c}\pi\in\Pi,\widetilde{M}\in\mathcal{M},\\ \forall s\in\mathcal{S},~{}\tilde{p}(s,\pi(s))\tilde{h}\leq\hat{p}(s,\pi(s))% \tilde{h}+\beta(s,\pi(s)),\\ \tilde{h}\equiv h(\pi,\widetilde{M})\in\mathcal{H},~{}{\mathrm{sp}\left(g\left% (\pi,\widetilde{M}\right)\right)}=0\end{array}\right\}italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( caligraphic_H , italic_β , caligraphic_M ) := roman_sup { italic_g ( italic_π , over~ start_ARG italic_M end_ARG ) : start_ARRAY start_ROW start_CELL italic_π ∈ roman_Π , over~ start_ARG italic_M end_ARG ∈ caligraphic_M , end_CELL end_ROW start_ROW start_CELL ∀ italic_s ∈ caligraphic_S , over~ start_ARG italic_p end_ARG ( italic_s , italic_π ( italic_s ) ) over~ start_ARG italic_h end_ARG ≤ over^ start_ARG italic_p end_ARG ( italic_s , italic_π ( italic_s ) ) over~ start_ARG italic_h end_ARG + italic_β ( italic_s , italic_π ( italic_s ) ) , end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_h end_ARG ≡ italic_h ( italic_π , over~ start_ARG italic_M end_ARG ) ∈ caligraphic_H , roman_sp ( italic_g ( italic_π , over~ start_ARG italic_M end_ARG ) ) = 0 end_CELL end_ROW end_ARRAY } (18)

where the supremum is taken with respect to the product order 𝐑𝒮superscript𝐑𝒮\mathbf{R}^{\mathcal{S}}bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT. In particular, if 𝒰𝒮𝒰superscript𝒮\mathcal{U}\subseteq\mathcal{R}^{\mathcal{S}}caligraphic_U ⊆ caligraphic_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT, check that u=sup𝒰superscript𝑢supremum𝒰u^{*}=\sup\mathcal{U}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_sup caligraphic_U is obtained as u(s):=sup{v(s):v𝒰}assignsuperscript𝑢𝑠supremumconditional-set𝑣𝑠𝑣𝒰u^{*}(s):=\sup\left\{v(s):v\in\mathcal{U}\right\}italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) := roman_sup { italic_v ( italic_s ) : italic_v ∈ caligraphic_U }. The constraint sp(g(π,M~))=0sp𝑔𝜋~𝑀0{\mathrm{sp}\left(g\left(\pi,\widetilde{M}\right)\right)}=0roman_sp ( italic_g ( italic_π , over~ start_ARG italic_M end_ARG ) ) = 0 is suggested by the work of Fruit et al. [2018], Fruit [2019] and is key for the problem to be solvable.

The bias and the β𝛽\betaitalic_β-constraints make the problem to handle with a “pure” extended MDP solution, which is why the extended Bellman operators are mitigated (with β𝛽\betaitalic_β) then projected (with ΓΓ\Gammaroman_Γ). The mitigation operation guarantees that the β𝛽\betaitalic_β-constraint is satisfied, while the projection on \mathcal{H}caligraphic_H makes sure that the bias constraint is satisfied. It is important for both operations to be compatible, i.e., that the β𝛽\betaitalic_β-constraint that βsuperscript𝛽\mathcal{L}^{\beta}caligraphic_L start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT forces is not lost when applying ΓΓ\Gammaroman_Γ. As a matter of fact, projecting then mitigating would not work.

We now explain why 𝔏𝔏\mathfrak{L}fraktur_L can be used to solve (18).

B.2 Projection operation and definition of 𝔏𝔏\mathfrak{L}fraktur_L

We start by discussing why 𝔏𝔏\mathfrak{L}fraktur_L is well-defined at all. The well-definition of βsuperscript𝛽\mathcal{L}^{\beta}caligraphic_L start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT is obvious. The point is to explain why the projection onto \mathcal{H}caligraphic_H is possible while preserving mandatory structural properties such as monotony, non-expansivity, linearity and more. For general \mathcal{H}caligraphic_H, such properties are impossible to meet. But the bias confidence region constructed with Algorithm 3 has a specific shape that makes the projection possible. The central property is the one below:

(A1) The downward closure {vu:v}conditional-set𝑣𝑢𝑣\left\{v\leq u:v\in\mathcal{H}\right\}{ italic_v ≤ italic_u : italic_v ∈ caligraphic_H } of every u𝐑𝒮𝑢superscript𝐑𝒮u\in\mathbf{R}^{\mathcal{S}}italic_u ∈ bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT has a maximum in \mathcal{H}caligraphic_H.

The only order that we will be considering is the product order on 𝐑𝒮superscript𝐑𝒮\mathbf{R}^{\mathcal{S}}bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT. Recall that a set 𝒰𝐑𝒮𝒰superscript𝐑𝒮\mathcal{U}\subseteq\mathbf{R}^{\mathcal{S}}caligraphic_U ⊆ bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT has a maximum if there exists u𝒰𝑢𝒰u\in\mathcal{U}italic_u ∈ caligraphic_U such that vu𝑣𝑢v\leq uitalic_v ≤ italic_u for all u𝒰𝑢𝒰u\in\mathcal{U}italic_u ∈ caligraphic_U. A supremum of 𝒰𝒰\mathcal{U}caligraphic_U is a minimal upper-bound of 𝒰𝒰\mathcal{U}caligraphic_U, i.e., u𝑢uitalic_u such that (1) vu𝑣𝑢v\leq uitalic_v ≤ italic_u for all v𝒰𝑣𝒰v\in\mathcal{U}italic_v ∈ caligraphic_U and (2) no w𝑤witalic_w satisfying (1) can be smaller than u𝑢uitalic_u. For the product order, the supremum of a subset 𝒰𝒰\mathcal{U}caligraphic_U is unique and of the form u(s)=sup{v(s):v𝒰}𝑢𝑠supremumconditional-set𝑣𝑠𝑣𝒰u(s)=\sup\left\{v(s):v\in\mathcal{U}\right\}italic_u ( italic_s ) = roman_sup { italic_v ( italic_s ) : italic_v ∈ caligraphic_U }.

Define the projection Γ:𝐑𝒮:Γsuperscript𝐑𝒮\Gamma:\mathbf{R}^{\mathcal{S}}\to\mathcal{H}roman_Γ : bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT → caligraphic_H as such:

Γu:=max{vu:v}.assignΓ𝑢:𝑣𝑢𝑣\Gamma u:=\max\left\{v\leq u:v\in\mathcal{H}\right\}.roman_Γ italic_u := roman_max { italic_v ≤ italic_u : italic_v ∈ caligraphic_H } . (19)

In general, Assumption (A1) is satisfied when \mathcal{H}caligraphic_H admits a join, i.e., is stable by finite supremum: u,vsup(u,v)𝑢𝑣supremum𝑢𝑣u,v\in\mathcal{H}\Rightarrow\sup(u,v)\in\mathcal{H}italic_u , italic_v ∈ caligraphic_H ⇒ roman_sup ( italic_u , italic_v ) ∈ caligraphic_H.

Lemma 18.

If \mathcal{H}caligraphic_H is generated by constraints of the form 𝔥(s)𝔥(s)c(s,s)d(s,s)𝔥𝑠𝔥superscript𝑠𝑐𝑠superscript𝑠𝑑𝑠superscript𝑠\mathfrak{h}(s)-\mathfrak{h}(s^{\prime})-c(s,s^{\prime})\leq d(s,s^{\prime})fraktur_h ( italic_s ) - fraktur_h ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_c ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_d ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), then it has a join and (A1) is satisfied. Moreover, ΓΓ\Gammaroman_Γ is then correctly computed with Algorithm 4.

Proof.

The first half of the result is well-known, see Zhang and Xie [2023], but we recall a proof for self-containedness. Let v1,v2subscript𝑣1subscript𝑣2v_{1},v_{2}\in\mathcal{H}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ caligraphic_H and define v3:=sup(v1,v2)assignsubscript𝑣3supremumsubscript𝑣1subscript𝑣2v_{3}:=\sup(v_{1},v_{2})italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT := roman_sup ( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). Observe that v3(s)v3(s)max(v1(s)v1(s),v2(s)v2(s))c(s,s)+d(s,s)subscript𝑣3𝑠subscript𝑣3superscript𝑠subscript𝑣1𝑠subscript𝑣1superscript𝑠subscript𝑣2𝑠subscript𝑣2superscript𝑠𝑐𝑠superscript𝑠𝑑𝑠superscript𝑠v_{3}(s)-v_{3}(s^{\prime})\leq\max(v_{1}(s)-v_{1}(s^{\prime}),v_{2}(s)-v_{2}(s% ^{\prime}))\leq c(s,s^{\prime})+d(s,s^{\prime})italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_s ) - italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ roman_max ( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s ) - italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_s ) - italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ≤ italic_c ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_d ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). So v3subscript𝑣3v_{3}\in\mathcal{H}italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∈ caligraphic_H.

We continue by showing that if \mathcal{H}caligraphic_H has a join, then (19) is well-defined. For s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S, take a sequence vnssuperscriptsubscript𝑣𝑛𝑠v_{n}^{s}italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT such that vns(s)α(s):=sup{v(s):vu,v}superscriptsubscript𝑣𝑛𝑠𝑠𝛼𝑠assignsupremumconditional-set𝑣𝑠formulae-sequence𝑣𝑢𝑣v_{n}^{s}(s)\to\alpha(s):=\sup\left\{v(s):v\leq u,v\in\mathcal{H}\right\}italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( italic_s ) → italic_α ( italic_s ) := roman_sup { italic_v ( italic_s ) : italic_v ≤ italic_u , italic_v ∈ caligraphic_H }. Because the span of every element of \mathcal{H}caligraphic_H is upper-bounded by c:=sup{sp(v):v}assign𝑐supremumconditional-setsp𝑣𝑣c:=\sup\left\{{\mathrm{sp}\left(v\right)}:v\in\mathcal{H}\right\}italic_c := roman_sup { roman_sp ( italic_v ) : italic_v ∈ caligraphic_H }, it follows that vnssuperscriptsubscript𝑣𝑛𝑠v_{n}^{s}italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT evolves in the compact region {vu:v}{v:vαse=1+c}conditional-set𝑣𝑢𝑣conditional-set𝑣subscriptnorm𝑣𝛼𝑠𝑒1𝑐\left\{v\leq u:v\in\mathcal{H}\right\}\cap\left\{v:\left\|v-\alpha{s}e\right\|% _{\infty}=1+c\right\}{ italic_v ≤ italic_u : italic_v ∈ caligraphic_H } ∩ { italic_v : ∥ italic_v - italic_α italic_s italic_e ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = 1 + italic_c }. We can therefore extract a convergent sequence of vnssuperscriptsubscript𝑣𝑛𝑠v_{n}^{s}italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, converging vssuperscriptsubscript𝑣𝑠v_{*}^{s}italic_v start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT that belongs to \mathcal{H}caligraphic_H since the latter is closed. By construction, vs(s)=α(s)superscriptsubscript𝑣𝑠𝑠𝛼𝑠v_{*}^{s}(s)=\alpha(s)italic_v start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( italic_s ) = italic_α ( italic_s ). Because \mathcal{H}caligraphic_H has a join, v:=sup{vs:s𝒮}assignsubscript𝑣supremumconditional-setsuperscriptsubscript𝑣𝑠𝑠𝒮v_{*}:=\sup\left\{v_{*}^{s}:s\in\mathcal{S}\right\}\in\mathcal{H}italic_v start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT := roman_sup { italic_v start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT : italic_s ∈ caligraphic_S } ∈ caligraphic_H. ∎

Lemma 19.

Under assumption (A1), the operator Γu:=max{vu:v}assignΓ𝑢:𝑣𝑢𝑣\Gamma u:=\max\left\{v\leq u:v\in\mathcal{H}\right\}roman_Γ italic_u := roman_max { italic_v ≤ italic_u : italic_v ∈ caligraphic_H } is well-defined, and is:

  1. (1)

    monotone: uvΓuΓv𝑢𝑣Γ𝑢Γ𝑣u\leq v\Rightarrow\Gamma u\leq\Gamma vitalic_u ≤ italic_v ⇒ roman_Γ italic_u ≤ roman_Γ italic_v;

  2. (2)

    non span-expansive: sp(ΓuΓv)sp(uv)spΓ𝑢Γ𝑣sp𝑢𝑣{\mathrm{sp}\left(\Gamma u-\Gamma v\right)}\leq{\mathrm{sp}\left(u-v\right)}roman_sp ( roman_Γ italic_u - roman_Γ italic_v ) ≤ roman_sp ( italic_u - italic_v );

  3. (3)

    linear: Γ(u+λe)=Γu+λeΓ𝑢𝜆𝑒Γ𝑢𝜆𝑒\Gamma(u+\lambda e)=\Gamma u+\lambda eroman_Γ ( italic_u + italic_λ italic_e ) = roman_Γ italic_u + italic_λ italic_e;

  4. (4)

    ΓuuΓ𝑢𝑢\Gamma u\leq uroman_Γ italic_u ≤ italic_u.

Proof.

The well-definition of ΓΓ\Gammaroman_Γ is obvious from (A1). For (2), if uv𝑢𝑣u\leq vitalic_u ≤ italic_v then wuwv𝑤𝑢𝑤𝑣w\leq u\Rightarrow w\leq vitalic_w ≤ italic_u ⇒ italic_w ≤ italic_v. Hence Γu:=max{wu:w}max{wv:w}=:Γv\Gamma u:=\max\left\{w\leq u:w\in\mathcal{H}\right\}\leq\max\left\{w\leq v:w% \in\mathcal{H}\right\}=:\Gamma vroman_Γ italic_u := roman_max { italic_w ≤ italic_u : italic_w ∈ caligraphic_H } ≤ roman_max { italic_w ≤ italic_v : italic_w ∈ caligraphic_H } = : roman_Γ italic_v. For (3), check that it follows from =+𝐑e𝐑𝑒\mathcal{H}=\mathcal{H}+\mathbf{R}ecaligraphic_H = caligraphic_H + bold_R italic_e. For (4), we obviously have Γu:=max{vu:v}uassignΓ𝑢:𝑣𝑢𝑣𝑢\Gamma u:=\max\left\{v\leq u:v\in\mathcal{H}\right\}\leq uroman_Γ italic_u := roman_max { italic_v ≤ italic_u : italic_v ∈ caligraphic_H } ≤ italic_u.

The more difficult point is (2) span non-expansivity. Pick u,v𝐑𝒮𝑢𝑣superscript𝐑𝒮u,v\in\mathbf{R}^{\mathcal{S}}italic_u , italic_v ∈ bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT. By linearity, it suffices to show the result for su(s)=sv(s)subscript𝑠𝑢𝑠subscript𝑠𝑣𝑠\sum_{s}u(s)=\sum_{s}v(s)∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_u ( italic_s ) = ∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_v ( italic_s ). In that case, we have sp(vu)=max(vu)+max(uv)sp𝑣𝑢𝑣𝑢𝑢𝑣{\mathrm{sp}\left(v-u\right)}=\max(v-u)+\max(u-v)roman_sp ( italic_v - italic_u ) = roman_max ( italic_v - italic_u ) + roman_max ( italic_u - italic_v ). Observe that for all wu𝑤𝑢w\leq uitalic_w ≤ italic_u, we have w+min(vu)ev𝑤𝑣𝑢𝑒𝑣w+\min(v-u)e\leq vitalic_w + roman_min ( italic_v - italic_u ) italic_e ≤ italic_v. Since =+𝐑e𝐑𝑒\mathcal{H}=\mathcal{H}+\mathbf{R}ecaligraphic_H = caligraphic_H + bold_R italic_e, it follows that:

max{wu:u}max{wv:w}+max(uv)e.:𝑤𝑢𝑢:𝑤𝑣𝑤𝑢𝑣𝑒\max\left\{w\leq u:u\in\mathcal{H}\right\}\leq\max\left\{w\leq v:w\in\mathcal{% H}\right\}+\max(u-v)e.roman_max { italic_w ≤ italic_u : italic_u ∈ caligraphic_H } ≤ roman_max { italic_w ≤ italic_v : italic_w ∈ caligraphic_H } + roman_max ( italic_u - italic_v ) italic_e .

Similarly, we have max{wu:w}max{wv:w}+min(vu)e:𝑤𝑢𝑤:𝑤𝑣𝑤𝑣𝑢𝑒\max\left\{w\leq u:w\in\mathcal{H}\right\}\geq\max\left\{w\leq v:w\in\mathcal{% H}\right\}+\min(v-u)eroman_max { italic_w ≤ italic_u : italic_w ∈ caligraphic_H } ≥ roman_max { italic_w ≤ italic_v : italic_w ∈ caligraphic_H } + roman_min ( italic_v - italic_u ) italic_e. Using them both at once, we find sp(ΓuΓv)sp(vu)spΓ𝑢Γ𝑣sp𝑣𝑢{\mathrm{sp}\left(\Gamma u-\Gamma v\right)}\leq{\mathrm{sp}\left(v-u\right)}roman_sp ( roman_Γ italic_u - roman_Γ italic_v ) ≤ roman_sp ( italic_v - italic_u ). ∎

The properties (1), (3) and (4) are essential for 𝔏𝔏\mathfrak{L}fraktur_L to properly address the optimization problem (18). The property (2) is just as important, because it plays a central part in the convergence of value iteration. The next result shows similar properties for the β𝛽\betaitalic_β-mitigated extended Bellman operator βsuperscript𝛽\mathcal{L}^{\beta}caligraphic_L start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT. From now on, we will assume (A1), because it is almost-surely satisfied by the bias confidence region generated by Algorithm 3.

Lemma 20.

The β𝛽\betaitalic_β-mitigated extended Bellman operator βsuperscript𝛽\mathcal{L}^{\beta}caligraphic_L start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT is (1) monotone, (2) non-span-expansive and (3) linear.

Proof.

The properties (1) and (3) directly follow from the definition. We focus on (2). Fix u,u𝐑𝒮𝑢superscript𝑢superscript𝐑𝒮u,u^{\prime}\in\mathbf{R}^{\mathcal{S}}italic_u , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT. By Lemma 26, we can write βu=r~π+P~πusuperscript𝛽𝑢subscript~𝑟𝜋subscript~𝑃𝜋𝑢\mathcal{L}^{\beta}u=\tilde{r}_{\pi}+\tilde{P}_{\pi}ucaligraphic_L start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT italic_u = over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT + over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_u and βu=r~π+P~πusuperscript𝛽superscript𝑢subscript~𝑟superscript𝜋subscript~𝑃superscript𝜋superscript𝑢\mathcal{L}^{\beta}u^{\prime}=\tilde{r}_{\pi^{\prime}}+\tilde{P}_{\pi^{\prime}% }u^{\prime}caligraphic_L start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. In the following, we write βπ(s):=β(s,π(s))assignsubscript𝛽𝜋𝑠𝛽𝑠𝜋𝑠\beta_{\pi}(s):=\beta(s,\pi(s))italic_β start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ) := italic_β ( italic_s , italic_π ( italic_s ) ). Check that:

βuβu=r~π+P~πu(r~π+P~πu)r~π+P~πu(r~π+min{P~πu,P^πu+βπ}).superscript𝛽𝑢superscript𝛽superscript𝑢subscript~𝑟𝜋subscript~𝑃𝜋𝑢subscript~𝑟superscript𝜋subscript~𝑃superscript𝜋superscript𝑢subscript~𝑟𝜋subscript~𝑃𝜋𝑢subscript~𝑟𝜋subscript~𝑃𝜋superscript𝑢subscript^𝑃𝜋superscript𝑢subscript𝛽𝜋\mathcal{L}^{\beta}u-\mathcal{L}^{\beta}u^{\prime}=\tilde{r}_{\pi}+\tilde{P}_{% \pi}u-\left(\tilde{r}_{\pi^{\prime}}+\tilde{P}_{\pi^{\prime}}u^{\prime}\right)% \leq\tilde{r}_{\pi}+\tilde{P}_{\pi}u-\left(\tilde{r}_{\pi}+\min\left\{\tilde{P% }_{\pi}u^{\prime},\hat{P}_{\pi}u^{\prime}+\beta_{\pi}\right\}\right).caligraphic_L start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT italic_u - caligraphic_L start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT + over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_u - ( over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT + over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_u - ( over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT + roman_min { over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_β start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT } ) .

If the minimum is reached with P~πusubscript~𝑃𝜋superscript𝑢\tilde{P}_{\pi}u^{\prime}over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, then:

βuβuP~π(uu).superscript𝛽𝑢superscript𝛽superscript𝑢subscript~𝑃𝜋𝑢superscript𝑢\mathcal{L}^{\beta}u-\mathcal{L}^{\beta}u^{\prime}\leq\tilde{P}_{\pi}(u-u^{% \prime}).caligraphic_L start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT italic_u - caligraphic_L start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_u - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .

If the minimum is reached with P^πu+βπsubscript^𝑃𝜋superscript𝑢subscript𝛽𝜋\hat{P}_{\pi}u^{\prime}+\beta_{\pi}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_β start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT, then upper-bound P~πusubscript~𝑃𝜋𝑢\tilde{P}_{\pi}uover~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_u by P^πu+βπsubscript^𝑃𝜋𝑢subscript𝛽𝜋\hat{P}_{\pi}u+\beta_{\pi}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_u + italic_β start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT to obtain:

βuβuP^π(uu).superscript𝛽𝑢superscript𝛽superscript𝑢subscript^𝑃𝜋𝑢superscript𝑢\mathcal{L}^{\beta}u-\mathcal{L}^{\beta}u^{\prime}\leq\hat{P}_{\pi}(u-u^{% \prime}).caligraphic_L start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT italic_u - caligraphic_L start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_u - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .

Overall, we find that there exists Qπ𝒫πsubscript𝑄𝜋subscript𝒫𝜋Q_{\pi}\in\mathcal{P}_{\pi}italic_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT such that βuβuQπ(uu)superscript𝛽𝑢superscript𝛽superscript𝑢subscript𝑄𝜋𝑢superscript𝑢\mathcal{L}^{\beta}u-\mathcal{L}^{\beta}u^{\prime}\leq Q_{\pi}(u-u^{\prime})caligraphic_L start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT italic_u - caligraphic_L start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_u - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). Similarly, we find Qπ𝒫πsubscript𝑄superscript𝜋subscript𝒫superscript𝜋Q_{\pi^{\prime}}\in\mathcal{P}_{\pi^{\prime}}italic_Q start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT such that βuβuQπ(uu)superscript𝛽𝑢superscript𝛽superscript𝑢subscript𝑄superscript𝜋𝑢superscript𝑢\mathcal{L}^{\beta}u-\mathcal{L}^{\beta}u^{\prime}\geq Q_{\pi^{\prime}}(u-u^{% \prime})caligraphic_L start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT italic_u - caligraphic_L start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≥ italic_Q start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_u - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). We conclude that:

sp(βuβu)sp((QπQπ)(uu))sp(uu).spsuperscript𝛽𝑢superscript𝛽superscript𝑢spsubscript𝑄𝜋subscript𝑄superscript𝜋𝑢superscript𝑢sp𝑢superscript𝑢{\mathrm{sp}\left(\mathcal{L}^{\beta}u-\mathcal{L}^{\beta}u^{\prime}\right)}% \leq{\mathrm{sp}\left((Q_{\pi}-Q_{\pi^{\prime}})(u-u^{\prime})\right)}\leq{% \mathrm{sp}\left(u-u^{\prime}\right)}.roman_sp ( caligraphic_L start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT italic_u - caligraphic_L start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ roman_sp ( ( italic_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ( italic_u - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ≤ roman_sp ( italic_u - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .

This concludes the proof. ∎

By composition, we obtain the following result.

Corollary 21.

𝔏𝔏\mathfrak{L}fraktur_L is (1) monotone, (2) non-span-expansive and (3) linear. Moreover, sp(𝔏u𝔏v)sp(uv)sp𝔏𝑢𝔏𝑣sp𝑢𝑣{\mathrm{sp}\left(\mathfrak{L}u-\mathfrak{L}v\right)}\leq{\mathrm{sp}\left(% \mathcal{L}u-\mathcal{L}v\right)}roman_sp ( fraktur_L italic_u - fraktur_L italic_v ) ≤ roman_sp ( caligraphic_L italic_u - caligraphic_L italic_v ) for all u,v𝐑𝒮𝑢𝑣superscript𝐑𝒮u,v\in\mathbf{R}^{\mathcal{S}}italic_u , italic_v ∈ bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT.

B.3 Fix-points of 𝔏𝔏\mathfrak{L}fraktur_L and (weak) optimism

Lemma 22.

𝔏𝔏\mathfrak{L}fraktur_L has a fix-point in span semi-norm, i.e., u,sp(𝔏uu)=0formulae-sequence𝑢sp𝔏𝑢𝑢0\exists u\in\mathcal{H},{\mathrm{sp}\left(\mathfrak{L}u-u\right)}=0∃ italic_u ∈ caligraphic_H , roman_sp ( fraktur_L italic_u - italic_u ) = 0.

Proof.

The idea is to apply Brouwer’s fix-point theorem in 𝐑𝒮superscript𝐑𝒮\mathbf{R}^{\mathcal{S}}bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT quotiented by the equivalence relation uvsp(uv)=0similar-to𝑢𝑣sp𝑢𝑣0u\sim v\Leftrightarrow{\mathrm{sp}\left(u-v\right)}=0italic_u ∼ italic_v ⇔ roman_sp ( italic_u - italic_v ) = 0, where sp()sp{\mathrm{sp}\left(-\right)}roman_sp ( - ) becomes a norm. By linearity (Corollary 21), 𝔏𝔏\mathfrak{L}fraktur_L is well-defined in this quotient space, and if 𝔏𝔏\mathfrak{L}fraktur_L is shown continuous on 𝐑𝒮superscript𝐑𝒮\mathbf{R}^{\mathcal{S}}bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT, so will it be on the quotient.

We show that 𝔏𝔏\mathfrak{L}fraktur_L is sequentially continuous on \mathcal{H}caligraphic_H. Consider a sequence un𝐍subscript𝑢𝑛superscript𝐍u_{n}\in\mathcal{H}^{\mathbf{N}}italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_H start_POSTSUPERSCRIPT bold_N end_POSTSUPERSCRIPT converging to u𝑢u\in\mathcal{H}italic_u ∈ caligraphic_H and fix ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0. Provided that n>Nϵ𝑛subscript𝑁italic-ϵn>N_{\epsilon}italic_n > italic_N start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT for Nϵsubscript𝑁italic-ϵN_{\epsilon}italic_N start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT large enough, we have unu<ϵsubscriptnormsubscript𝑢𝑛𝑢italic-ϵ\left\|u_{n}-u\right\|_{\infty}<\epsilon∥ italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_u ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT < italic_ϵ, i.e., unϵeunu+ϵesubscript𝑢𝑛italic-ϵ𝑒subscript𝑢𝑛𝑢italic-ϵ𝑒u_{n}-\epsilon e\leq u_{n}\leq u+\epsilon eitalic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_ϵ italic_e ≤ italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≤ italic_u + italic_ϵ italic_e. Therefore, in the one hand, for all vun𝑣subscript𝑢𝑛v\leq u_{n}italic_v ≤ italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we have vϵeu𝑣italic-ϵ𝑒𝑢v-\epsilon e\leq uitalic_v - italic_ϵ italic_e ≤ italic_u so max{vun:v}max{vu:v}+ϵe:𝑣subscript𝑢𝑛𝑣:𝑣𝑢𝑣italic-ϵ𝑒\max\left\{v\leq u_{n}:v\in\mathcal{H}\right\}\leq\max\left\{v\leq u:v\in% \mathcal{H}\right\}+\epsilon eroman_max { italic_v ≤ italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT : italic_v ∈ caligraphic_H } ≤ roman_max { italic_v ≤ italic_u : italic_v ∈ caligraphic_H } + italic_ϵ italic_e; And on the other hand, for all vu𝑣𝑢v\leq uitalic_v ≤ italic_u, v+ϵeun𝑣italic-ϵ𝑒subscript𝑢𝑛v+\epsilon e\leq u_{n}italic_v + italic_ϵ italic_e ≤ italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT so max{vu:v}max{vun:v}+ϵe:𝑣𝑢𝑣:𝑣subscript𝑢𝑛𝑣italic-ϵ𝑒\max\left\{v\leq u:v\in\mathcal{H}\right\}\leq\max\left\{v\leq u_{n}:v\in% \mathcal{H}\right\}+\epsilon eroman_max { italic_v ≤ italic_u : italic_v ∈ caligraphic_H } ≤ roman_max { italic_v ≤ italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT : italic_v ∈ caligraphic_H } + italic_ϵ italic_e. Hence:

max{vu:v}max{vun:v}ϵ.norm:𝑣𝑢𝑣:𝑣subscript𝑢𝑛𝑣italic-ϵ\left\|\max\left\{v\leq u:v\in\mathcal{H}\right\}-\max\left\{v\leq u_{n}:v\in% \mathcal{H}\right\}\right\|\leq\epsilon.∥ roman_max { italic_v ≤ italic_u : italic_v ∈ caligraphic_H } - roman_max { italic_v ≤ italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT : italic_v ∈ caligraphic_H } ∥ ≤ italic_ϵ .

It shows that ΓΓ\Gammaroman_Γ is continuous. The operator βsuperscript𝛽\mathcal{L}^{\beta}caligraphic_L start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT is obviously continuous as well, so 𝔏=Γβ𝔏Γsuperscript𝛽\mathfrak{L}=\Gamma\circ\mathcal{L}^{\beta}fraktur_L = roman_Γ ∘ caligraphic_L start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT is continuous by composition. Since =0+𝐑esubscript0𝐑𝑒\mathcal{H}=\mathcal{H}_{0}+\mathbf{R}ecaligraphic_H = caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_R italic_e with 0subscript0\mathcal{H}_{0}caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT compact and ocnvex, the quotient /\mathcal{H}/{\sim}caligraphic_H / ∼ is compact and convex, and is preserved by 𝔏/\mathfrak{L}/{\sim}fraktur_L / ∼. By Brouwer’s fix-point theorem, 𝔏/\mathfrak{L}/{\sim}fraktur_L / ∼ has a fix-point in /\mathcal{H}/{\sim}caligraphic_H / ∼. So 𝔏𝔏\mathfrak{L}fraktur_L has a span fix-point in \mathcal{H}caligraphic_H. ∎

We write Fix(𝔏)Fix𝔏\operatorname{{Fix}}(\mathfrak{L})roman_Fix ( fraktur_L ) the span fix-points of 𝔏𝔏\mathfrak{L}fraktur_L.

Lemma 23.

𝔏𝔏\mathfrak{L}fraktur_L has well-defined growth. Specifically, if 𝔏u=u+𝔤e𝔏𝑢𝑢𝔤𝑒\mathfrak{L}u=u+\mathfrak{g}efraktur_L italic_u = italic_u + fraktur_g italic_e, then:

  1. (1)

    There exists c>0𝑐0c>0italic_c > 0, s.t., for all v0𝑣subscript0v\in\mathcal{H}_{0}italic_v ∈ caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, (n𝔤c)e+u𝔏nv(n𝔤+c)e+u𝑛𝔤𝑐𝑒𝑢superscript𝔏𝑛𝑣𝑛𝔤𝑐𝑒𝑢(n\mathfrak{g}-c)e+u\leq\mathfrak{L}^{n}v\leq(n\mathfrak{g}+c)e+u( italic_n fraktur_g - italic_c ) italic_e + italic_u ≤ fraktur_L start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_v ≤ ( italic_n fraktur_g + italic_c ) italic_e + italic_u;

  2. (2)

    If uFix(𝔏)superscript𝑢Fix𝔏u^{\prime}\in\operatorname{{Fix}}(\mathfrak{L})italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Fix ( fraktur_L ), then 𝔏uu=𝔤e𝔏superscript𝑢superscript𝑢𝔤𝑒\mathfrak{L}u^{\prime}-u^{\prime}=\mathfrak{g}efraktur_L italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = fraktur_g italic_e.

Proof.

Setting c:=maxv0vu<assign𝑐subscript𝑣subscript0subscriptnorm𝑣𝑢c:=\max_{v\in\mathcal{H}_{0}}\left\|v-u\right\|_{\infty}<\inftyitalic_c := roman_max start_POSTSUBSCRIPT italic_v ∈ caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_v - italic_u ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT < ∞, one can check that ucevu+ce𝑢𝑐𝑒𝑣𝑢𝑐𝑒u-ce\leq v\leq u+ceitalic_u - italic_c italic_e ≤ italic_v ≤ italic_u + italic_c italic_e for all v0𝑣subscript0v\in\mathcal{H}_{0}italic_v ∈ caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. this proves (1) for n=0𝑛0n=0italic_n = 0 and we then proceed by induction on n0𝑛0n\geq 0italic_n ≥ 0. By induction, 𝔏nvu+(n𝔤+c)esuperscript𝔏𝑛𝑣𝑢𝑛𝔤𝑐𝑒\mathfrak{L}^{n}v\leq u+(n\mathfrak{g}+c)efraktur_L start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_v ≤ italic_u + ( italic_n fraktur_g + italic_c ) italic_e and by Corollary 21, 𝔏𝔏\mathfrak{L}fraktur_L is monotone, so we have:

𝔏n+1v𝔏𝔏nv𝔏(u+(n𝔤+c)e)=u+((n+1)𝔤+c)esuperscript𝔏𝑛1𝑣𝔏superscript𝔏𝑛𝑣𝔏𝑢𝑛𝔤𝑐𝑒𝑢𝑛1𝔤𝑐𝑒\mathfrak{L}^{n+1}v\leq\mathfrak{L}\mathfrak{L}^{n}v\leq\mathfrak{L}(u+(n% \mathfrak{g}+c)e)=u+((n+1)\mathfrak{g}+c)efraktur_L start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT italic_v ≤ fraktur_L fraktur_L start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_v ≤ fraktur_L ( italic_u + ( italic_n fraktur_g + italic_c ) italic_e ) = italic_u + ( ( italic_n + 1 ) fraktur_g + italic_c ) italic_e

where the last inequality use the linearity of 𝔏𝔏\mathfrak{L}fraktur_L together with 𝔏u=u+𝔤e𝔏𝑢𝑢𝔤𝑒\mathfrak{L}u=u+\mathfrak{g}efraktur_L italic_u = italic_u + fraktur_g italic_e. The lower bound of 𝔏nvsuperscript𝔏𝑛𝑣\mathfrak{L}^{n}vfraktur_L start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_v is shown similarly, establishing (1).

For (2), pick uFix(𝔏)superscript𝑢Fix𝔏u^{\prime}\in\operatorname{{Fix}}(\mathfrak{L})italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Fix ( fraktur_L ) with 𝔏u=u+𝔤e𝔏superscript𝑢superscript𝑢superscript𝔤𝑒\mathfrak{L}u^{\prime}=u^{\prime}+\mathfrak{g}^{\prime}efraktur_L italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + fraktur_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_e. Up to translating usuperscript𝑢u^{\prime}italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we can assume that u0superscript𝑢subscript0u^{\prime}\in\mathcal{H}_{0}italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and apply (1). We get:

(n𝔤c)e+un𝔤e+u(n𝔤+c)e+u.𝑛𝔤𝑐𝑒𝑢𝑛superscript𝔤𝑒superscript𝑢𝑛𝔤𝑐𝑒𝑢(n\mathfrak{g}-c)e+u\leq n\mathfrak{g}^{\prime}e+u^{\prime}\leq(n\mathfrak{g}+% c)e+u.( italic_n fraktur_g - italic_c ) italic_e + italic_u ≤ italic_n fraktur_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_e + italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ ( italic_n fraktur_g + italic_c ) italic_e + italic_u .

Divided by n𝑛nitalic_n and let it go to infinity. We conclude that 𝔤=𝔤𝔤superscript𝔤\mathfrak{g}=\mathfrak{g}^{\prime}fraktur_g = fraktur_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. ∎

We finally have everything in hand to claim that 𝔏𝔏\mathfrak{L}fraktur_L solves (18).

Corollary 24.

The growth of 𝔏𝔏\mathfrak{L}fraktur_L given by 𝔤=𝔏uu𝔤𝔏𝑢𝑢\mathfrak{g}=\mathfrak{L}u-ufraktur_g = fraktur_L italic_u - italic_u for uFix(𝔏)𝑢Fix𝔏u\in\operatorname{{Fix}}(\mathfrak{L})italic_u ∈ roman_Fix ( fraktur_L ) is well-defined, and:

u,𝔤e=lim infn𝔏nun=lim supn𝔏nun.formulae-sequencefor-all𝑢𝔤𝑒subscriptlimit-infimum𝑛superscript𝔏𝑛𝑢𝑛subscriptlimit-supremum𝑛superscript𝔏𝑛𝑢𝑛\forall u\in\mathcal{H},\quad\mathfrak{g}e=\liminf_{n\to\infty}\frac{\mathfrak% {L}^{n}u}{n}=\limsup_{n\to\infty}\frac{\mathfrak{L}^{n}u}{n}.∀ italic_u ∈ caligraphic_H , fraktur_g italic_e = lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT divide start_ARG fraktur_L start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u end_ARG start_ARG italic_n end_ARG = lim sup start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT divide start_ARG fraktur_L start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u end_ARG start_ARG italic_n end_ARG .

Moreover, 𝔤g(,β,)𝔤superscript𝑔𝛽\mathfrak{g}\geq g^{*}(\mathcal{H},\beta,\mathcal{M})fraktur_g ≥ italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( caligraphic_H , italic_β , caligraphic_M ).

Proof.

The growth property is a direct consequence of Lemma 23. We show 𝔤g(,β,)𝔤superscript𝑔𝛽\mathfrak{g}\geq g^{*}(\mathcal{H},\beta,\mathcal{M})fraktur_g ≥ italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( caligraphic_H , italic_β , caligraphic_M ) which is defined in (18). Pick πΠ,M~formulae-sequence𝜋Π~𝑀\pi\in\Pi,\widetilde{M}\in\mathcal{M}italic_π ∈ roman_Π , over~ start_ARG italic_M end_ARG ∈ caligraphic_M its model with h~h(π,M~)~𝜋~𝑀\tilde{h}\equiv h(\pi,\widetilde{M})over~ start_ARG italic_h end_ARG ≡ italic_h ( italic_π , over~ start_ARG italic_M end_ARG ) and P~πh~P^πh~+βπsubscript~𝑃𝜋~subscript^𝑃𝜋~subscript𝛽𝜋\tilde{P}_{\pi}\tilde{h}\leq\hat{P}_{\pi}\tilde{h}+\beta_{\pi}over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT over~ start_ARG italic_h end_ARG ≤ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT over~ start_ARG italic_h end_ARG + italic_β start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT where βπ(s):=β(s,π(s))assignsubscript𝛽𝜋𝑠𝛽𝑠𝜋𝑠\beta_{\pi}(s):=\beta(s,\pi(s))italic_β start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ) := italic_β ( italic_s , italic_π ( italic_s ) ). Up to translation, we can assume that h~0~subscript0\tilde{h}\in\mathcal{H}_{0}over~ start_ARG italic_h end_ARG ∈ caligraphic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

We have g(π,M~)=g~e𝑔𝜋~𝑀~𝑔𝑒g(\pi,\widetilde{M})=\tilde{g}eitalic_g ( italic_π , over~ start_ARG italic_M end_ARG ) = over~ start_ARG italic_g end_ARG italic_e for g~𝐑~𝑔𝐑\tilde{g}\in\mathbf{R}over~ start_ARG italic_g end_ARG ∈ bold_R, so

h~+g~e=r~π+P~πh~𝔏h~~~𝑔𝑒subscript~𝑟𝜋subscript~𝑃𝜋~𝔏~\tilde{h}+\tilde{g}e=\tilde{r}_{\pi}+\tilde{P}_{\pi}\tilde{h}\leq\mathfrak{L}% \tilde{h}over~ start_ARG italic_h end_ARG + over~ start_ARG italic_g end_ARG italic_e = over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT + over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT over~ start_ARG italic_h end_ARG ≤ fraktur_L over~ start_ARG italic_h end_ARG

by definition. By monotony of 𝔏𝔏\mathfrak{L}fraktur_L, see Corollary 21, ng~e+h~𝔏nh~𝑛~𝑔𝑒~superscript𝔏𝑛~n\tilde{g}e+\tilde{h}\leq\mathfrak{L}^{n}\tilde{h}italic_n over~ start_ARG italic_g end_ARG italic_e + over~ start_ARG italic_h end_ARG ≤ fraktur_L start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over~ start_ARG italic_h end_ARG follows by induction on n0𝑛0n\geq 0italic_n ≥ 0. By Lemma 23, we further have 𝔏nh~n(𝔤+c)e+usuperscript𝔏𝑛~𝑛𝔤𝑐𝑒𝑢\mathfrak{L}^{n}\tilde{h}\leq n(\mathfrak{g}+c)e+ufraktur_L start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over~ start_ARG italic_h end_ARG ≤ italic_n ( fraktur_g + italic_c ) italic_e + italic_u where uFix(𝔏)𝑢Fix𝔏u\in\operatorname{{Fix}}(\mathfrak{L})italic_u ∈ roman_Fix ( fraktur_L ). In tandem,

g~e𝔤e+ce+uh~n.~𝑔𝑒𝔤𝑒𝑐𝑒𝑢~𝑛\tilde{g}e\leq\mathfrak{g}e+\frac{ce+u-\tilde{h}}{n}.over~ start_ARG italic_g end_ARG italic_e ≤ fraktur_g italic_e + divide start_ARG italic_c italic_e + italic_u - over~ start_ARG italic_h end_ARG end_ARG start_ARG italic_n end_ARG .

Letting n𝑛n\to\inftyitalic_n → ∞, we deduce that g~𝔤~𝑔𝔤\tilde{g}\leq\mathfrak{g}over~ start_ARG italic_g end_ARG ≤ fraktur_g. Conclude by taking the best π𝜋\piitalic_π and M~~𝑀\widetilde{M}over~ start_ARG italic_M end_ARG. ∎

The next theorem follows directly with the same proof technique, and guarantees optimism.

Theorem 25.

Assume that g+h𝔏hsuperscript𝑔superscript𝔏superscriptg^{*}+h^{*}\leq\mathfrak{L}h^{*}italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ fraktur_L italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Then 𝔤g𝔤superscript𝑔\mathfrak{g}\geq g^{*}fraktur_g ≥ italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

The condition “g+h𝔏hsuperscript𝑔superscript𝔏superscriptg^{*}+h^{*}\leq\mathfrak{L}h^{*}italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ fraktur_L italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT” can be referred to as a weak form of optimism. We qualify this version of optimism as weak because it is much weaker than optimism property suggested by Fruit [2019] L𝐿\mathcal{L}\geq Lcaligraphic_L ≥ italic_L where L𝐿Litalic_L is the Bellman operator of the true MDP. Here, we only ask for 𝔏hLh𝔏superscript𝐿superscript\mathfrak{L}h^{*}\geq Lh^{*}fraktur_L italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≥ italic_L italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, i.e., optimism at the fix-point of L𝐿Litalic_L. This condition is met as soon as M𝑀M\in\mathcal{M}italic_M ∈ caligraphic_M, hsuperscripth^{*}\in\mathcal{H}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_H and β𝛽\betaitalic_β is large enough, but is in fact much more general.

B.4 Modelization of the projected mitigated Bellman operator 𝔏𝔏\mathfrak{L}fraktur_L

The aim of this paragraph is to establish Corollary 27, stating that 𝔏u𝔏𝑢\mathfrak{L}ufraktur_L italic_u can be viewed as a policy produced by Greedy(,u,β)𝑢𝛽(\mathcal{M},u,\beta)( caligraphic_M , italic_u , italic_β ).

Lemma 26 (Modelization).

For πΠ𝜋Π\pi\in\Piitalic_π ∈ roman_Π, denote βπ(s):=β(s,π(s))assignsubscript𝛽𝜋𝑠𝛽𝑠𝜋𝑠\beta_{\pi}(s):=\beta(s,\pi(s))italic_β start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ) := italic_β ( italic_s , italic_π ( italic_s ) ), π:=s(s,π(s))assignsubscript𝜋subscriptproduct𝑠𝑠𝜋𝑠\mathcal{R}_{\pi}:=\prod_{s}\mathcal{R}(s,\pi(s))caligraphic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT := ∏ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT caligraphic_R ( italic_s , italic_π ( italic_s ) ) and 𝒫π:=s𝒫(s,π(s))assignsubscript𝒫𝜋subscriptproduct𝑠𝒫𝑠𝜋𝑠\mathcal{P}_{\pi}:=\prod_{s}\mathcal{P}(s,\pi(s))caligraphic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT := ∏ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT caligraphic_P ( italic_s , italic_π ( italic_s ) ). Fix u𝐑𝒮𝑢superscript𝐑𝒮u\in\mathbf{R}^{\mathcal{S}}italic_u ∈ bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT and let π:=Greedy(,u,β)assign𝜋Greedy𝑢𝛽\pi:=\texttt{Greedy}(\mathcal{M},u,\beta)italic_π := Greedy ( caligraphic_M , italic_u , italic_β ).

  1. (1)

    If 𝒫𝒫\mathcal{P}caligraphic_P is convex, then there exists (r~π,P~π)π×𝒫πsubscript~𝑟𝜋subscript~𝑃𝜋subscript𝜋subscript𝒫𝜋(\tilde{r}_{\pi},\tilde{P}_{\pi})\in\mathcal{R}_{\pi}\times\mathcal{P}_{\pi}( over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) ∈ caligraphic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT × caligraphic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT such that βu=r~π+P~πusubscript𝛽𝑢subscript~𝑟𝜋subscript~𝑃𝜋𝑢\mathcal{L}_{\beta}u=\tilde{r}_{\pi}+\tilde{P}_{\pi}ucaligraphic_L start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT italic_u = over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT + over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_u.

  2. (2)

    Assume that βu=r~π+P~πusubscript𝛽𝑢subscript~𝑟𝜋subscript~𝑃𝜋𝑢\mathcal{L}_{\beta}u=\tilde{r}_{\pi}+\tilde{P}_{\pi}ucaligraphic_L start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT italic_u = over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT + over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_u. There exists rπr~πsubscriptsuperscript𝑟𝜋subscript~𝑟𝜋r^{\prime}_{\pi}\leq\tilde{r}_{\pi}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ≤ over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT such that 𝔏u=rπ+P~πu𝔏𝑢subscriptsuperscript𝑟𝜋subscript~𝑃𝜋𝑢\mathfrak{L}u=r^{\prime}_{\pi}+\tilde{P}_{\pi}ufraktur_L italic_u = italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT + over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_u.

The convexity requirement of (1) is always true if the kernel confidence region is chosen via (C1-4).

Proof.

For (1), fix a state s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S, let a:=π(s)assign𝑎𝜋𝑠a:=\pi(s)italic_a := italic_π ( italic_s ) and ρ:=min(sup𝒫(s,a)u,p^(s,a)u+β(s,a))assign𝜌supremum𝒫𝑠𝑎𝑢^𝑝𝑠𝑎𝑢𝛽𝑠𝑎\rho:=\min(\sup\mathcal{P}(s,a)u,\hat{p}(s,a)u+\beta(s,a))italic_ρ := roman_min ( roman_sup caligraphic_P ( italic_s , italic_a ) italic_u , over^ start_ARG italic_p end_ARG ( italic_s , italic_a ) italic_u + italic_β ( italic_s , italic_a ) ). If ρ=sup𝒫(s,a)u𝜌supremum𝒫𝑠𝑎𝑢\rho=\sup\mathcal{P}(s,a)uitalic_ρ = roman_sup caligraphic_P ( italic_s , italic_a ) italic_u, then there is nothing to say because 𝒫𝒫\mathcal{P}caligraphic_P is compact, hence the sup is a max and ρ𝜌\rhoitalic_ρ is of the form p~(s,a)u~𝑝𝑠𝑎𝑢\tilde{p}(s,a)uover~ start_ARG italic_p end_ARG ( italic_s , italic_a ) italic_u. Otherwise, let p~(s,a)u>p^(s,a)u+β(s,a)~𝑝𝑠𝑎𝑢^𝑝𝑠𝑎𝑢𝛽𝑠𝑎\tilde{p}(s,a)u>\hat{p}(s,a)u+\beta(s,a)over~ start_ARG italic_p end_ARG ( italic_s , italic_a ) italic_u > over^ start_ARG italic_p end_ARG ( italic_s , italic_a ) italic_u + italic_β ( italic_s , italic_a ) with p~(s,a)𝒫(s,a)~𝑝𝑠𝑎𝒫𝑠𝑎\tilde{p}(s,a)\in\mathcal{P}(s,a)over~ start_ARG italic_p end_ARG ( italic_s , italic_a ) ∈ caligraphic_P ( italic_s , italic_a ). Introduce, for λ[0,1]𝜆01\lambda\in[0,1]italic_λ ∈ [ 0 , 1 ],

p~λ(s,a):=λp~(s,a)+(1λ)p^(s,a).assignsubscript~𝑝𝜆𝑠𝑎𝜆~𝑝𝑠𝑎1𝜆^𝑝𝑠𝑎\tilde{p}_{\lambda}(s,a):=\lambda\tilde{p}(s,a)+(1-\lambda)\hat{p}(s,a).over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( italic_s , italic_a ) := italic_λ over~ start_ARG italic_p end_ARG ( italic_s , italic_a ) + ( 1 - italic_λ ) over^ start_ARG italic_p end_ARG ( italic_s , italic_a ) .

By continuity, there exists λ(0,1)𝜆01\lambda\in(0,1)italic_λ ∈ ( 0 , 1 ) such that p~λ(s,a)u=p^(s,a)u+β(s,a)subscript~𝑝𝜆𝑠𝑎𝑢^𝑝𝑠𝑎𝑢𝛽𝑠𝑎\tilde{p}_{\lambda}(s,a)u=\hat{p}(s,a)u+\beta(s,a)over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( italic_s , italic_a ) italic_u = over^ start_ARG italic_p end_ARG ( italic_s , italic_a ) italic_u + italic_β ( italic_s , italic_a ) and by convexity of 𝒫(s,a)𝒫𝑠𝑎\mathcal{P}(s,a)caligraphic_P ( italic_s , italic_a ), p~λ(s,a)𝒫(s,a)subscript~𝑝𝜆𝑠𝑎𝒫𝑠𝑎\tilde{p}_{\lambda}(s,a)\in\mathcal{P}(s,a)over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ caligraphic_P ( italic_s , italic_a ). This proves (1).

For (2), recall that 𝔏u=Γβu=Γ(r~π+P~πu)𝔏𝑢Γsuperscript𝛽𝑢Γsubscript~𝑟𝜋subscript~𝑃𝜋𝑢\mathfrak{L}u=\Gamma\mathcal{L}^{\beta}u=\Gamma(\tilde{r}_{\pi}+\tilde{P}_{\pi% }u)fraktur_L italic_u = roman_Γ caligraphic_L start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT italic_u = roman_Γ ( over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT + over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_u ). Since ΓvvΓ𝑣𝑣\Gamma v\leq vroman_Γ italic_v ≤ italic_v, for v𝐑𝒮𝑣superscript𝐑𝒮v\in\mathbf{R}^{\mathcal{S}}italic_v ∈ bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT, we have:

Γ(r~π+P~πu)r~π+P~πu.Γsubscript~𝑟𝜋subscript~𝑃𝜋𝑢subscript~𝑟𝜋subscript~𝑃𝜋𝑢\Gamma(\tilde{r}_{\pi}+\tilde{P}_{\pi}u)\leq\tilde{r}_{\pi}+\tilde{P}_{\pi}u.roman_Γ ( over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT + over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_u ) ≤ over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT + over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_u .

Set rπ:=Γ(r~π+P~πu)P~πuassignsubscriptsuperscript𝑟𝜋Γsubscript~𝑟𝜋subscript~𝑃𝜋𝑢subscript~𝑃𝜋𝑢r^{\prime}_{\pi}:=\Gamma(\tilde{r}_{\pi}+\tilde{P}_{\pi}u)-\tilde{P}_{\pi}uitalic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT := roman_Γ ( over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT + over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_u ) - over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_u. Check that rπsubscriptsuperscript𝑟𝜋r^{\prime}_{\pi}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT satisfies rπr~πsubscriptsuperscript𝑟𝜋subscript~𝑟𝜋r^{\prime}_{\pi}\leq\tilde{r}_{\pi}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ≤ over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT and 𝔏u=rπ+P~πu𝔏𝑢subscriptsuperscript𝑟𝜋subscript~𝑃𝜋𝑢\mathfrak{L}u=r^{\prime}_{\pi}+\tilde{P}_{\pi}ufraktur_L italic_u = italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT + over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_u. ∎

The last corollary bellow is crucial to claim that greedy policies are good choices in PMEVI-DT.

Corollary 27 (Greedy modelization).

Let u𝐑𝒮𝑢superscript𝐑𝒮u\in\mathbf{R}^{\mathcal{S}}italic_u ∈ bold_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT and fix π:=Greedy(,u,β)assign𝜋Greedy𝑢𝛽\pi:=\texttt{Greedy}(\mathcal{M},u,\beta)italic_π := Greedy ( caligraphic_M , italic_u , italic_β ). If 𝒫𝒫\mathcal{P}caligraphic_P is convex, then with the notations of Lemma 26, there exists r~πsupπsubscript~𝑟𝜋supremumsubscript𝜋\tilde{r}_{\pi}\leq\sup\mathcal{R}_{\pi}over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ≤ roman_sup caligraphic_R start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT and P~π𝒫πsubscript~𝑃𝜋subscript𝒫𝜋\tilde{P}_{\pi}\in\mathcal{P}_{\pi}over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT such that 𝔏u=r~π+P~πu𝔏𝑢subscript~𝑟𝜋subscript~𝑃𝜋𝑢\mathfrak{L}u=\tilde{r}_{\pi}+\tilde{P}_{\pi}ufraktur_L italic_u = over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT + over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_u.

Appendix C Proof of Theorem 5: Regret analysis of PMEVI-DT

We recall a few notations. At episode k𝑘kitalic_k, the played policy is denoted πksubscript𝜋𝑘\pi_{k}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. As a greedy response to 𝔥ksubscript𝔥𝑘\mathfrak{h}_{k}fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, by Proposition 2 (3), there exists r~k(s)suptk(s,πk(s))subscript~𝑟𝑘𝑠supremumsubscriptsubscript𝑡𝑘𝑠subscript𝜋𝑘𝑠\tilde{r}_{k}(s)\leq\sup\mathcal{R}_{t_{k}}(s,\pi_{k}(s))over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s ) ≤ roman_sup caligraphic_R start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s ) ) and P~k(s)𝒫tk(s,π(x))subscript~𝑃𝑘𝑠subscript𝒫subscript𝑡𝑘𝑠𝜋𝑥\tilde{P}_{k}(s)\in\mathcal{P}_{t_{k}}(s,\pi(x))over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s ) ∈ caligraphic_P start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π ( italic_x ) ) such that 𝔥k+𝔤k=r~k+P~k𝔥ksubscript𝔥𝑘subscript𝔤𝑘subscript~𝑟𝑘subscript~𝑃𝑘subscript𝔥𝑘\mathfrak{h}_{k}+\mathfrak{g}_{k}=\tilde{r}_{k}+\tilde{P}_{k}\mathfrak{h}_{k}fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. The reward-kernel pair M~k=(r~k,P~k)subscript~𝑀𝑘subscript~𝑟𝑘subscript~𝑃𝑘\tilde{M}_{k}=(\tilde{r}_{k},\tilde{P}_{k})over~ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ( over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is referred to as the optimistic model of πksubscript𝜋𝑘\pi_{k}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. We write Pk:=Pπk(M)assignsubscript𝑃𝑘subscript𝑃subscript𝜋𝑘𝑀P_{k}:=P_{\pi_{k}}(M)italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT := italic_P start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_M ) the true kernel and P^k:=Pπk(M^tk)assignsubscript^𝑃𝑘subscript𝑃subscript𝜋𝑘subscript^𝑀subscript𝑡𝑘\hat{P}_{k}:=P_{\pi_{k}}(\hat{M}_{t_{k}})over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT := italic_P start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) the empirical kernel. Likewise, we define the reward functions rksubscript𝑟𝑘r_{k}italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and r^ksubscript^𝑟𝑘\hat{r}_{k}over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. The optimistic gain and bias satisfy 𝔤k=g(πk,M~k)subscript𝔤𝑘𝑔subscript𝜋𝑘subscript~𝑀𝑘\mathfrak{g}_{k}=g(\pi_{k},\widetilde{M}_{k})fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_g ( italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over~ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) and 𝔥k=h(πk,M~k)subscript𝔥𝑘subscript𝜋𝑘subscript~𝑀𝑘\mathfrak{h}_{k}=h(\pi_{k},\widetilde{M}_{k})fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_h ( italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over~ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). We further denote c0=T15subscript𝑐0superscript𝑇15c_{0}=T^{\frac{1}{5}}italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_T start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 5 end_ARG end_POSTSUPERSCRIPT.

Important remark.

To slightely simplify the analysis, we assume that PMEVI is run with perfect precision ϵ=0italic-ϵ0\epsilon=0italic_ϵ = 0, i.e., that 𝔥k=PMEVI(tk,βtk,Γtk,0)subscript𝔥𝑘PMEVIsubscriptsubscript𝑡𝑘subscript𝛽subscript𝑡𝑘subscriptΓsubscript𝑡𝑘0\mathfrak{h}_{k}=\texttt{PMEVI}(\mathcal{M}_{t_{k}},\beta_{t_{k}},\Gamma_{t_{k% }},0)fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = PMEVI ( caligraphic_M start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , roman_Γ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 0 ) hence is a span fix-point of 𝔏tksubscript𝔏subscript𝑡𝑘\mathfrak{L}_{t_{k}}fraktur_L start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT. This assumption is mild and can be dropped by adding an extra error term that has to be carried out in the calculations.

C.1 Number of episodes under doubling trick (DT)

Lemma 28 (Number of episodes, Auer et al. [2009]).

The number of episodes up to time TSA𝑇𝑆𝐴T\geq SAitalic_T ≥ italic_S italic_A is upper-bounded by:

K(T)SAlog2(8TSA).𝐾𝑇𝑆𝐴subscript28𝑇𝑆𝐴K(T)\leq SA\log_{2}\left(\tfrac{8T}{SA}\right).italic_K ( italic_T ) ≤ italic_S italic_A roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( divide start_ARG 8 italic_T end_ARG start_ARG italic_S italic_A end_ARG ) .

C.2 Sum of bias variances

The Lemma 29 below shows that t=0T1𝐕(p(Xt),h)superscriptsubscript𝑡0𝑇1𝐕𝑝subscript𝑋𝑡superscript\sum_{t=0}^{T-1}\mathbf{V}(p(X_{t}),h^{*})∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) scales as Tsp(h)sp(r)+sp(h)Reg(T)𝑇spsuperscriptsp𝑟spsuperscriptReg𝑇T{\mathrm{sp}\left(h^{*}\right)}{\mathrm{sp}\left(r\right)}+{\mathrm{sp}\left(% h^{*}\right)}\operatorname{{Reg}}(T)italic_T roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_sp ( italic_r ) + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_Reg ( italic_T ) in probability.

Lemma 29.

With probability at least 1δ1𝛿1-\delta1 - italic_δ, we have:

t=0T1𝐕(p(Xt),h)2sp(h)sp(r)T+sp(h)212Tlog(1δ)+2sp(h)t=0T1Δ(Xt)+sp(h)2.superscriptsubscript𝑡0𝑇1𝐕𝑝subscript𝑋𝑡superscript2spsuperscriptsp𝑟𝑇spsuperscriptsuperscript212𝑇1𝛿2spsuperscriptsuperscriptsubscript𝑡0𝑇1superscriptΔsubscript𝑋𝑡spsuperscriptsuperscript2\sum\nolimits_{t=0}^{T-1}\mathbf{V}(p(X_{t}),h^{*})\leq 2{\mathrm{sp}\left(h^{% *}\right)}{\mathrm{sp}\left(r\right)}T+{\mathrm{sp}\left(h^{*}\right)}^{2}% \sqrt{\tfrac{1}{2}T\log\left(\tfrac{1}{\delta}\right)}+2{\mathrm{sp}\left(h^{*% }\right)}\sum\nolimits_{t=0}^{T-1}\Delta^{*}(X_{t})+{\mathrm{sp}\left(h^{*}% \right)}^{2}.∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ 2 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_sp ( italic_r ) italic_T + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_T roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) end_ARG + 2 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .
Proof.

Using the Bellman equation h(s)+g(s)=r(s,a)+p(s,a)h+Δ(s,a)superscript𝑠superscript𝑔𝑠𝑟𝑠𝑎𝑝𝑠𝑎superscriptsuperscriptΔ𝑠𝑎h^{*}(s)+g^{*}(s)=r(s,a)+p(s,a)h^{*}+\Delta^{*}(s,a)italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) + italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) = italic_r ( italic_s , italic_a ) + italic_p ( italic_s , italic_a ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ), we have:

𝐕(p(Xt),h)=(p(Xt)eSt)h2+2h(St)(Δ(Xt)+r(Xt)g(St)).𝐕𝑝subscript𝑋𝑡superscript𝑝subscript𝑋𝑡subscript𝑒subscript𝑆𝑡superscriptabsent22superscriptsubscript𝑆𝑡superscriptΔsubscript𝑋𝑡𝑟subscript𝑋𝑡superscript𝑔subscript𝑆𝑡\mathbf{V}(p(X_{t}),h^{*})=\left(p(X_{t})-e_{S_{t}}\right)h^{*2}+2h^{*}(S_{t})% (\Delta^{*}(X_{t})+r(X_{t})-g^{*}(S_{t})).bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_h start_POSTSUPERSCRIPT ∗ 2 end_POSTSUPERSCRIPT + 2 italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) .

Since sp(h2)sp(h)2spsuperscriptabsent2spsuperscriptsuperscript2{\mathrm{sp}\left(h^{*2}\right)}\leq{\mathrm{sp}\left(h^{*}\right)}^{2}roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ 2 end_POSTSUPERSCRIPT ) ≤ roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we get:

t=0T1𝐕(p(Xt),h)superscriptsubscript𝑡0𝑇1𝐕𝑝subscript𝑋𝑡superscript\displaystyle\sum\nolimits_{t=0}^{T-1}\mathbf{V}(p(X_{t}),h^{*})∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) t=0T1(p(Xt)eSt)h2+2sp(h)(sp(r)T+t=0T1Δ(Xt))absentsuperscriptsubscript𝑡0𝑇1𝑝subscript𝑋𝑡subscript𝑒subscript𝑆𝑡superscriptabsent22spsuperscriptsp𝑟𝑇superscriptsubscript𝑡0𝑇1superscriptΔsubscript𝑋𝑡\displaystyle\leq\sum\nolimits_{t=0}^{T-1}\left(p(X_{t})-e_{S_{t}}\right)h^{*2% }+2{\mathrm{sp}\left(h^{*}\right)}\left({\mathrm{sp}\left(r\right)}T+\sum% \nolimits_{t=0}^{T-1}\Delta^{*}(X_{t})\right)≤ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_h start_POSTSUPERSCRIPT ∗ 2 end_POSTSUPERSCRIPT + 2 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ( roman_sp ( italic_r ) italic_T + ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) )
=t=0T1(p(Xt)eSt+1)h2+2sp(h)(12sp(h)sp(r)T+t=0T1Δ(Xt))absentsuperscriptsubscript𝑡0𝑇1𝑝subscript𝑋𝑡subscript𝑒subscript𝑆𝑡1superscriptabsent22spsuperscript12spsuperscriptsp𝑟𝑇superscriptsubscript𝑡0𝑇1superscriptΔsubscript𝑋𝑡\displaystyle=\sum\nolimits_{t=0}^{T-1}\left(p(X_{t})-e_{S_{t+1}}\right)h^{*2}% +2{\mathrm{sp}\left(h^{*}\right)}\left(\tfrac{1}{2}{\mathrm{sp}\left(h^{*}% \right)}{\mathrm{sp}\left(r\right)}T+\sum\nolimits_{t=0}^{T-1}\Delta^{*}(X_{t}% )\right)= ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_h start_POSTSUPERSCRIPT ∗ 2 end_POSTSUPERSCRIPT + 2 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_sp ( italic_r ) italic_T + ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) )
(Lemma 32) 2sp(h)sp(r)T+sp(h)212Tlog(1δ)+2sp(h)t=0T1Δ(Xt)+sp(h)2absent2spsuperscriptsp𝑟𝑇spsuperscriptsuperscript212𝑇1𝛿2spsuperscriptsuperscriptsubscript𝑡0𝑇1superscriptΔsubscript𝑋𝑡spsuperscriptsuperscript2\displaystyle\leq 2{\mathrm{sp}\left(h^{*}\right)}{\mathrm{sp}\left(r\right)}T% +{\mathrm{sp}\left(h^{*}\right)}^{2}\sqrt{\tfrac{1}{2}T\log\left(\tfrac{1}{% \delta}\right)}+2{\mathrm{sp}\left(h^{*}\right)}\sum\nolimits_{t=0}^{T-1}% \Delta^{*}(X_{t})+{\mathrm{sp}\left(h^{*}\right)}^{2}≤ 2 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_sp ( italic_r ) italic_T + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_T roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) end_ARG + 2 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where the last inequality holds with probability 1δ1𝛿1-\delta1 - italic_δ. This concludes the proof. ∎

C.3 Regret and pseudo-regret: A tight relation

In this paragraph, we bound the regret with respect to the pseudo-regret (and conversely) up to a factor of order (sp(h)sp(r)log(Tδ))1/2superscriptspsuperscriptsp𝑟𝑇𝛿12({\mathrm{sp}\left(h^{*}\right)}{\mathrm{sp}\left(r\right)}\log(\tfrac{T}{% \delta}))^{1/2}( roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_sp ( italic_r ) roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT. Hence, in proofs, the pseudo-regret can be changed to the regret with ease.

Lemma 30.

With probability 14δ14𝛿1-4\delta1 - 4 italic_δ, the regret and the pseudo-regret and linked as follows:

|t=0T1(gRt)t=0T1Δ(Xt)|{2(2sp(h)sp(r)+18)Tlog(Tδ)+2sp(h)log(Tδ)t=0T1Δ(Xt)+sp(h)(12T)14log34(Tδ)+4sp(h)log(Tδ)+2sp(h)}.superscriptsubscript𝑡0𝑇1superscript𝑔subscript𝑅𝑡superscriptsubscript𝑡0𝑇1superscriptΔsubscript𝑋𝑡matrix22spsuperscriptsp𝑟18𝑇𝑇𝛿2spsuperscript𝑇𝛿superscriptsubscript𝑡0𝑇1superscriptΔsubscript𝑋𝑡spsuperscriptsuperscript12𝑇14superscript34𝑇𝛿4spsuperscript𝑇𝛿2spsuperscript\left|\sum_{t=0}^{T-1}(g^{*}-R_{t})-\sum_{t=0}^{T-1}\Delta^{*}(X_{t})\right|% \leq\begin{Bmatrix}2\!\!\sqrt{\left(2{\mathrm{sp}\left(h^{*}\right)}{\mathrm{% sp}\left(r\right)}+\tfrac{1}{8}\right)T\log\left(\tfrac{T}{\delta}\right)}+\!% \!\sqrt{2{\mathrm{sp}\left(h^{*}\right)}\log\left(\tfrac{T}{\delta}\right)\sum% \nolimits_{t=0}^{T-1}\Delta^{*}(X_{t})}\\ +{\mathrm{sp}\left(h^{*}\right)}\left(\tfrac{1}{2}T\right)^{\frac{1}{4}}\log^{% \frac{3}{4}}\left(\frac{T}{\delta}\right)+4{\mathrm{sp}\left(h^{*}\right)}\log% \left(\tfrac{T}{\delta}\right)+2{\mathrm{sp}\left(h^{*}\right)}\end{Bmatrix}.| ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | ≤ { start_ARG start_ROW start_CELL 2 square-root start_ARG ( 2 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_sp ( italic_r ) + divide start_ARG 1 end_ARG start_ARG 8 end_ARG ) italic_T roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG + square-root start_ARG 2 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_CELL end_ROW start_ROW start_CELL + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_T ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) + 4 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) + 2 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARG } .
Proof.

We rely again on the Poisson equation g(St)r(Xt)Δ(Xt)=(p(Xt)eSt)hsuperscript𝑔subscript𝑆𝑡𝑟subscript𝑋𝑡superscriptΔsubscript𝑋𝑡𝑝subscript𝑋𝑡subscript𝑒subscript𝑆𝑡superscriptg^{*}(S_{t})-r(X_{t})-\Delta^{*}(X_{t})=(p(X_{t})-e_{S_{t}})h^{*}italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, so:

A:=|t=0T1(gRtΔ(Xt))|assignAsuperscriptsubscript𝑡0𝑇1superscript𝑔subscript𝑅𝑡superscriptΔsubscript𝑋𝑡\displaystyle\textrm{A}:=\left|\sum\nolimits_{t=0}^{T-1}(g^{*}-R_{t}-\Delta^{*% }(X_{t}))\right|A := | ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) | |t=0T1(p(Xt)eSt)h|+|t=0T1(Rtr(Xt))|absentsuperscriptsubscript𝑡0𝑇1𝑝subscript𝑋𝑡subscript𝑒subscript𝑆𝑡superscriptsuperscriptsubscript𝑡0𝑇1subscript𝑅𝑡𝑟subscript𝑋𝑡\displaystyle\leq\left|\sum\nolimits_{t=0}^{T-1}\left(p(X_{t})-e_{S_{t}}\right% )h^{*}\right|+\left|\sum\nolimits_{t=0}^{T-1}\left(R_{t}-r(X_{t})\right)\right|≤ | ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | + | ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) |
sp(h)+|t=0T1(p(Xt)eSt+1)h|+|t=0T1(Rtr(Xt))|.absentspsuperscriptsuperscriptsubscript𝑡0𝑇1𝑝subscript𝑋𝑡subscript𝑒subscript𝑆𝑡1superscriptsuperscriptsubscript𝑡0𝑇1subscript𝑅𝑡𝑟subscript𝑋𝑡\displaystyle\leq{\mathrm{sp}\left(h^{*}\right)}+\left|\sum\nolimits_{t=0}^{T-% 1}\left(p(X_{t})-e_{S_{t+1}}\right)h^{*}\right|+\left|\sum\nolimits_{t=0}^{T-1% }\left(R_{t}-r(X_{t})\right)\right|.≤ roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + | ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | + | ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) | .

Up to the constant sp(h)spsuperscript{\mathrm{sp}\left(h^{*}\right)}roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), the two error terms are respectively a navigation and a reward error. The second is bounded using Azuma’s inequality (Lemma 32), showing that with probability 12δ12𝛿1-2\delta1 - 2 italic_δ, we have:

|t=0T1(Rtr(Xt))|12Tlog(1δ).superscriptsubscript𝑡0𝑇1subscript𝑅𝑡𝑟subscript𝑋𝑡12𝑇1𝛿\left|\sum\nolimits_{t=0}^{T-1}(R_{t}-r(X_{t}))\right|\leq\sqrt{\tfrac{1}{2}T% \log\left(\tfrac{1}{\delta}\right)}.| ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) | ≤ square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_T roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) end_ARG .

We continue by using Freedman’s inequality, instantiated in the form of Lemma 33. With probability 1δ1𝛿1-\delta1 - italic_δ, we have:

|t=0T1(p(Xt)eSt+1)h|2t=0T1𝐕(p(Xt),h)log(Tδ)+4sp(h)log(Tδ).superscriptsubscript𝑡0𝑇1𝑝subscript𝑋𝑡subscript𝑒subscript𝑆𝑡1superscript2superscriptsubscript𝑡0𝑇1𝐕𝑝subscript𝑋𝑡superscript𝑇𝛿4spsuperscript𝑇𝛿\left|\sum\nolimits_{t=0}^{T-1}\left(p(X_{t})-e_{S_{t+1}}\right)h^{*}\right|% \leq\sqrt{2\sum\nolimits_{t=0}^{T-1}\mathbf{V}(p(X_{t}),h^{*})\log\left(\tfrac% {T}{\delta}\right)}+4{\mathrm{sp}\left(h^{*}\right)}\log\left(\tfrac{T}{\delta% }\right).| ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | ≤ square-root start_ARG 2 ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG + 4 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) .

The quantity t=0T1𝐕(p(Xt),h)superscriptsubscript𝑡0𝑇1𝐕𝑝subscript𝑋𝑡superscript\sum_{t=0}^{T-1}\mathbf{V}(p(X_{t}),h^{*})∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is a classical one that appears at several places throughout the analysis. Using Lemma 29, we bount it explicitely. Further simplifying the bound with a+ba+b𝑎𝑏𝑎𝑏\sqrt{a+b}\leq\sqrt{a}+\sqrt{b}square-root start_ARG italic_a + italic_b end_ARG ≤ square-root start_ARG italic_a end_ARG + square-root start_ARG italic_b end_ARG, we get that with probability 14δ14𝛿1-4\delta1 - 4 italic_δ, we have:

A{2sp(h)sp(r)Tlog(Tδ)+12Tlog(1δ)+2sp(h)log(Tδ)t=0T1Δ(Xt)+sp(h)(12T)14log34(Tδ)+4sp(h)log(Tδ)+2sp(h)}.Amatrix2spsuperscriptsp𝑟𝑇𝑇𝛿12𝑇1𝛿2spsuperscript𝑇𝛿superscriptsubscript𝑡0𝑇1superscriptΔsubscript𝑋𝑡spsuperscriptsuperscript12𝑇14superscript34𝑇𝛿4spsuperscript𝑇𝛿2spsuperscript\textrm{A}\leq\begin{Bmatrix}\sqrt{2{\mathrm{sp}\left(h^{*}\right)}{\mathrm{sp% }\left(r\right)}T\log\left(\tfrac{T}{\delta}\right)}+\sqrt{\tfrac{1}{2}T\log% \left(\tfrac{1}{\delta}\right)}+\sqrt{2{\mathrm{sp}\left(h^{*}\right)}\log% \left(\tfrac{T}{\delta}\right)\sum\nolimits_{t=0}^{T-1}\Delta^{*}(X_{t})}\\ +{\mathrm{sp}\left(h^{*}\right)}\left(\tfrac{1}{2}T\right)^{\frac{1}{4}}\log^{% \frac{3}{4}}\left(\frac{T}{\delta}\right)+4{\mathrm{sp}\left(h^{*}\right)}\log% \left(\tfrac{T}{\delta}\right)+2{\mathrm{sp}\left(h^{*}\right)}\end{Bmatrix}.A ≤ { start_ARG start_ROW start_CELL square-root start_ARG 2 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_sp ( italic_r ) italic_T roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG + square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_T roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) end_ARG + square-root start_ARG 2 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_CELL end_ROW start_ROW start_CELL + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_T ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) + 4 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) + 2 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARG } .

Bound log(1δ)1𝛿\log(\tfrac{1}{\delta})roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) by log(Tδ)𝑇𝛿\log(\tfrac{T}{\delta})roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) and use a+b2a+b𝑎𝑏2𝑎𝑏\sqrt{a}+\sqrt{b}\leq 2\sqrt{a+b}square-root start_ARG italic_a end_ARG + square-root start_ARG italic_b end_ARG ≤ 2 square-root start_ARG italic_a + italic_b end_ARG to merge the terms in Tlog(Tδ)𝑇𝑇𝛿\sqrt{T\log(\tfrac{T}{\delta})}square-root start_ARG italic_T roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG under a single square-root. ∎

Overall, Lemma 30 states that the regret t=0T1(gRt)superscriptsubscript𝑡0𝑇1superscript𝑔subscript𝑅𝑡\sum_{t=0}^{T-1}(g^{*}-R_{t})∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and the pseudo-regret t=0T1Δ(Xt)superscriptsubscript𝑡0𝑇1superscriptΔsubscript𝑋𝑡\sum_{t=0}^{T-1}\Delta^{*}(X_{t})∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) differ by about (sp(h)Tlog(Tδ))1/2superscriptspsuperscript𝑇𝑇𝛿12({\mathrm{sp}\left(h^{*}\right)}T\log(\tfrac{T}{\delta}))^{1/2}( roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_T roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT in probability (up to asymptotically negligible additional terms). In general, the precise form of Lemma 30 is not convenient to use because it is of form form xy+αy+β𝑥𝑦𝛼𝑦𝛽x\leq y+\alpha\sqrt{y}+\betaitalic_x ≤ italic_y + italic_α square-root start_ARG italic_y end_ARG + italic_β that is not linear in y𝑦yitalic_y. Corollary 31 factorizes the result into one which will be more convenient in proofs.

Corollary 31.

Denote x:=t=0T1(gRt)assign𝑥superscriptsubscript𝑡0𝑇1superscript𝑔subscript𝑅𝑡x:=\sum_{t=0}^{T-1}(g^{*}-R_{t})italic_x := ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and y:=t=0T1Δ(Xt)assign𝑦superscriptsubscript𝑡0𝑇1superscriptΔsubscript𝑋𝑡y:=\sum_{t=0}^{T-1}\Delta^{*}(X_{t})italic_y := ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Further introduce:

α𝛼\displaystyle\alphaitalic_α :=2sp(h)log(Tδ)assignabsent2spsuperscript𝑇𝛿\displaystyle:=\sqrt{2{\mathrm{sp}\left(h^{*}\right)}\log\left(\tfrac{T}{% \delta}\right)}:= square-root start_ARG 2 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG
β𝛽\displaystyle\betaitalic_β :=2(2sp(h)sp(r)+12)Tlog(Tδ)+sp(h)(12T)14log34(Tδ)+2sp(h)(2log(Tδ)+1).assignabsent22spsuperscriptsp𝑟12𝑇𝑇𝛿spsuperscriptsuperscript12𝑇14superscript34𝑇𝛿2spsuperscript2𝑇𝛿1\displaystyle:=2\sqrt{\left(2{\mathrm{sp}\left(h^{*}\right)}{\mathrm{sp}\left(% r\right)}+\tfrac{1}{2}\right)T\log\left(\tfrac{T}{\delta}\right)}+{\mathrm{sp}% \left(h^{*}\right)}\left(\tfrac{1}{2}T\right)^{\frac{1}{4}}\log^{\frac{3}{4}}% \left(\tfrac{T}{\delta}\right)+2{\mathrm{sp}\left(h^{*}\right)}\left(2\log% \left(\tfrac{T}{\delta}\right)+1\right).:= 2 square-root start_ARG ( 2 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_sp ( italic_r ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) italic_T roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_T ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) + 2 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ( 2 roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) + 1 ) .

Then, with probability 14δ14𝛿1-4\delta1 - 4 italic_δ, we have xy+12α+β𝑥𝑦12𝛼𝛽\sqrt{x}\leq\sqrt{y}+\tfrac{1}{2}\alpha+\sqrt{\beta}square-root start_ARG italic_x end_ARG ≤ square-root start_ARG italic_y end_ARG + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_α + square-root start_ARG italic_β end_ARG and yx+α+β𝑦𝑥𝛼𝛽\sqrt{y}\leq\sqrt{x}+\alpha+\sqrt{\beta}square-root start_ARG italic_y end_ARG ≤ square-root start_ARG italic_x end_ARG + italic_α + square-root start_ARG italic_β end_ARG.

Proof.

This is straight forward algebra from the result of Lemma 30. ∎

C.4 Proof of Lemma 6, reward optimism

We start by getting rid of the reward noise. We have:

Reg(T)Reg𝑇\displaystyle\operatorname{{Reg}}(T)roman_Reg ( italic_T ) :=t=0T1(gRt)=t=0T1(gr(Xt))+t=0T1(r(Xt)Rt)assignabsentsuperscriptsubscript𝑡0𝑇1superscript𝑔subscript𝑅𝑡superscriptsubscript𝑡0𝑇1superscript𝑔𝑟subscript𝑋𝑡superscriptsubscript𝑡0𝑇1𝑟subscript𝑋𝑡subscript𝑅𝑡\displaystyle:=\sum\nolimits_{t=0}^{T-1}(g^{*}-R_{t})=\sum\nolimits_{t=0}^{T-1% }(g^{*}-r(X_{t}))+\sum\nolimits_{t=0}^{T-1}(r(X_{t})-R_{t}):= ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
t=0T1(gr(Xt))+12Tlog(1δ)absentsuperscriptsubscript𝑡0𝑇1superscript𝑔𝑟subscript𝑋𝑡12𝑇1𝛿\displaystyle\leq\sum\nolimits_{t=0}^{T-1}(g^{*}-r(X_{t}))+\sqrt{\tfrac{1}{2}T% \log\left(\tfrac{1}{\delta}\right)}≤ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_T roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) end_ARG

with probability 1δ1𝛿1-\delta1 - italic_δ by Azuma’s inequality (Lemma 32). We are left with t=0T1(gr(Xt))superscriptsubscript𝑡0𝑇1superscript𝑔𝑟subscript𝑋𝑡\sum_{t=0}^{T-1}(g^{*}-r(X_{t}))∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ). We continue by splitting the regret episodically and invoking optimism. By Lemma 13, with probability 14δ14𝛿1-4\delta1 - 4 italic_δ, we have t=0T1(gr(Xt))kt=tktk+11(𝔤kr(Xt))superscriptsubscript𝑡0𝑇1superscript𝑔𝑟subscript𝑋𝑡subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript𝔤𝑘𝑟subscript𝑋𝑡\sum\nolimits_{t=0}^{T-1}(g^{*}-r(X_{t}))\leq\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1% }(\mathfrak{g}_{k}-r(X_{t}))∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ≤ ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ). Introduce

B0(T):=kt=tktk+11(𝔤kr(Xt)).assignsubscript𝐵0𝑇subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript𝔤𝑘𝑟subscript𝑋𝑡B_{0}(T):=\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}(\mathfrak{g}_{k}-r(X_{t})).italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_T ) := ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) . (20)

We focus on bounding B0(T)subscript𝐵0𝑇B_{0}(T)italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_T ). By 2, r~k(s,a)subscript~𝑟𝑘𝑠𝑎\tilde{r}_{k}(s,a)over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s , italic_a ) is of the form r^k(s,a)+Clog(2SAT/δ)/Ntk(s,a)ηk(s,a)subscript^𝑟𝑘𝑠𝑎𝐶2𝑆𝐴𝑇𝛿subscript𝑁subscript𝑡𝑘𝑠𝑎subscript𝜂𝑘𝑠𝑎\hat{r}_{k}(s,a)+\sqrt{C\log(2SAT/\delta)/N_{t_{k}}(s,a)}-\eta_{k}(s,a)over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s , italic_a ) + square-root start_ARG italic_C roman_log ( 2 italic_S italic_A italic_T / italic_δ ) / italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG - italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s , italic_a ) with ηk(s,a)𝐑subscript𝜂𝑘𝑠𝑎𝐑\eta_{k}(s,a)\in\mathbf{R}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ bold_R. By the statement (3) of Proposition 2, ηk(s,a)0subscript𝜂𝑘𝑠𝑎0\eta_{k}(s,a)\geq 0italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s , italic_a ) ≥ 0. Therefore,

B0(T)subscript𝐵0𝑇\displaystyle B_{0}(T)italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_T ) =kt=tktk+11(𝔤kr~k(Xt))+kt=tktk+11(rk~(Xt)r(Xt))absentsubscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript𝔤𝑘subscript~𝑟𝑘subscript𝑋𝑡subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11~subscript𝑟𝑘subscript𝑋𝑡𝑟subscript𝑋𝑡\displaystyle=\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\left(\mathfrak{g}_{k}-\tilde{% r}_{k}(X_{t})\right)+\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\left(\tilde{r_{k}}(X_{% t})-r(X_{t})\right)= ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( over~ start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) )
kt=tktk+11(𝔤kr~k(Xt))+SA+kt=tktk+11𝟏(Ntk(Xt)1)(rk^(Xt)r(Xt)+Clog(2SATδ)Ntk(Xt))absentsubscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript𝔤𝑘subscript~𝑟𝑘subscript𝑋𝑡𝑆𝐴subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘111subscript𝑁subscript𝑡𝑘subscript𝑋𝑡1^subscript𝑟𝑘subscript𝑋𝑡𝑟subscript𝑋𝑡𝐶2𝑆𝐴𝑇𝛿subscript𝑁subscript𝑡𝑘subscript𝑋𝑡\displaystyle\leq\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\left(\mathfrak{g}_{k}-% \tilde{r}_{k}(X_{t})\right)+SA+\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{1}% \left(N_{t_{k}}(X_{t})\geq 1\right)\left(\hat{r_{k}}(X_{t})-r(X_{t})+\sqrt{% \frac{C\log\left(\tfrac{2SAT}{\delta}\right)}{N_{t_{k}}(X_{t})}}\right)≤ ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + italic_S italic_A + ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT bold_1 ( italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 1 ) ( over^ start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + square-root start_ARG divide start_ARG italic_C roman_log ( divide start_ARG 2 italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG )
()kt=tktk+11(𝔤kr~k(Xt))+SA+kt=tktk+11𝟏(Ntk(Xt)1)(2log(2SATδ)Ntk(s,a)+Clog(2SATδ)Ntk(s,a))subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript𝔤𝑘subscript~𝑟𝑘subscript𝑋𝑡𝑆𝐴subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘111subscript𝑁subscript𝑡𝑘subscript𝑋𝑡122𝑆𝐴𝑇𝛿subscript𝑁subscript𝑡𝑘𝑠𝑎𝐶2𝑆𝐴𝑇𝛿subscript𝑁subscript𝑡𝑘𝑠𝑎\displaystyle\overset{(*)}{\leq}\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\left(% \mathfrak{g}_{k}-\tilde{r}_{k}(X_{t})\right)+SA+\sum_{k}\sum_{t=t_{k}}^{t_{k+1% }-1}\mathbf{1}\left(N_{t_{k}}(X_{t})\geq 1\right)\left(\sqrt{\frac{2\log\left(% \tfrac{2SAT}{\delta}\right)}{N_{t_{k}}(s,a)}}+\sqrt{\frac{C\log\left(\tfrac{2% SAT}{\delta}\right)}{N_{t_{k}}(s,a)}}\right)start_OVERACCENT ( ∗ ) end_OVERACCENT start_ARG ≤ end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + italic_S italic_A + ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT bold_1 ( italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 1 ) ( square-root start_ARG divide start_ARG 2 roman_log ( divide start_ARG 2 italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG end_ARG + square-root start_ARG divide start_ARG italic_C roman_log ( divide start_ARG 2 italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG end_ARG )

where ()(*)( ∗ ) holds with probability 1δ1𝛿1-\delta1 - italic_δ following Lemma 35. By the doubling trick rule (DT), we have Nt(Xt)2Ntk(Xt)subscript𝑁𝑡subscript𝑋𝑡2subscript𝑁subscript𝑡𝑘subscript𝑋𝑡N_{t}(X_{t})\leq 2N_{t_{k}}(X_{t})italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ 2 italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) for t<tk+1𝑡subscript𝑡𝑘1t<t_{k+1}italic_t < italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT, so, with probability 1δ1𝛿1-\delta1 - italic_δ,

B0(T)subscript𝐵0𝑇\displaystyle B_{0}(T)italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_T ) kt=tktk+11(𝔤kr~k(Xt))+SA+2kt=tktk+11𝟏(Ntk(Xt)1)(2+C)log(2SATδ)Ntk(s,a)absentsubscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript𝔤𝑘subscript~𝑟𝑘subscript𝑋𝑡𝑆𝐴2subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘111subscript𝑁subscript𝑡𝑘subscript𝑋𝑡12𝐶2𝑆𝐴𝑇𝛿subscript𝑁subscript𝑡𝑘𝑠𝑎\displaystyle\leq\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\left(\mathfrak{g}_{k}-% \tilde{r}_{k}(X_{t})\right)+SA+2\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{1}% \left(N_{t_{k}}(X_{t})\geq 1\right)\sqrt{\frac{(2+C)\log\left(\tfrac{2SAT}{% \delta}\right)}{N_{t_{k}}(s,a)}}≤ ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + italic_S italic_A + 2 ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT bold_1 ( italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 1 ) square-root start_ARG divide start_ARG ( 2 + italic_C ) roman_log ( divide start_ARG 2 italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG end_ARG
kt=tktk+11(𝔤kr~k(Xt))+SA+2(2+C)log(2SATδ)s,an=1NT(s,a)11nabsentsubscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript𝔤𝑘subscript~𝑟𝑘subscript𝑋𝑡𝑆𝐴22𝐶2𝑆𝐴𝑇𝛿subscript𝑠𝑎superscriptsubscript𝑛1subscript𝑁𝑇𝑠𝑎11𝑛\displaystyle\leq\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\left(\mathfrak{g}_{k}-% \tilde{r}_{k}(X_{t})\right)+SA+2\sqrt{(2+C)\log\left(\tfrac{2SAT}{\delta}% \right)}\sum_{s,a}\sum_{n=1}^{N_{T}(s,a)-1}\sqrt{\tfrac{1}{n}}≤ ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + italic_S italic_A + 2 square-root start_ARG ( 2 + italic_C ) roman_log ( divide start_ARG 2 italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG ∑ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s , italic_a ) - 1 end_POSTSUPERSCRIPT square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_n end_ARG end_ARG
kt=tktk+11(𝔤kr~k(Xt))+SA+4(2+C)log(2SATδ)s,aNT(s,a)absentsubscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript𝔤𝑘subscript~𝑟𝑘subscript𝑋𝑡𝑆𝐴42𝐶2𝑆𝐴𝑇𝛿subscript𝑠𝑎subscript𝑁𝑇𝑠𝑎\displaystyle\leq\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\left(\mathfrak{g}_{k}-% \tilde{r}_{k}(X_{t})\right)+SA+4\sqrt{(2+C)\log\left(\tfrac{2SAT}{\delta}% \right)}\sum_{s,a}\sqrt{N_{T}(s,a)}≤ ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + italic_S italic_A + 4 square-root start_ARG ( 2 + italic_C ) roman_log ( divide start_ARG 2 italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG ∑ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT square-root start_ARG italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG
(Jensen) kt=tktk+11(𝔤kr~k(Xt))+SA+4(2+C)SATlog(2SATδ).absentsubscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript𝔤𝑘subscript~𝑟𝑘subscript𝑋𝑡𝑆𝐴42𝐶𝑆𝐴𝑇2𝑆𝐴𝑇𝛿\displaystyle\leq\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\left(\mathfrak{g}_{k}-% \tilde{r}_{k}(X_{t})\right)+SA+4\sqrt{(2+C)SAT\log\left(\tfrac{2SAT}{\delta}% \right)}.≤ ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + italic_S italic_A + 4 square-root start_ARG ( 2 + italic_C ) italic_S italic_A italic_T roman_log ( divide start_ARG 2 italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG .

We conclude that with probability 16δ16𝛿1-6\delta1 - 6 italic_δ, we have:

Reg(T)kt=tktk+11(𝔤kr~k(Xt))+4(2+C)SATlog(2SATδ)+12Tlog(2SATδ)+SA.Reg𝑇subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript𝔤𝑘subscript~𝑟𝑘subscript𝑋𝑡42𝐶𝑆𝐴𝑇2𝑆𝐴𝑇𝛿12𝑇2𝑆𝐴𝑇𝛿𝑆𝐴\operatorname{{Reg}}(T)\leq\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\left(\mathfrak{g% }_{k}-\tilde{r}_{k}(X_{t})\right)+4\sqrt{(2+C)SAT\log\left(\tfrac{2SAT}{\delta% }\right)}+\sqrt{\tfrac{1}{2}T\log\left(\tfrac{2SAT}{\delta}\right)}+SA.roman_Reg ( italic_T ) ≤ ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + 4 square-root start_ARG ( 2 + italic_C ) italic_S italic_A italic_T roman_log ( divide start_ARG 2 italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG + square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_T roman_log ( divide start_ARG 2 italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG + italic_S italic_A . (21)

This concludes the proof. ∎

C.5 Proof of Lemma 7, navigation error

We have:

kt=tktk+11(pk(St)eSt)𝔥ksubscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript𝑝𝑘subscript𝑆𝑡subscript𝑒subscript𝑆𝑡subscript𝔥𝑘\displaystyle\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}(p_{k}(S_{t})-e_{S_{t}})% \mathfrak{h}_{k}∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT kt=tktk+11(pk(St)eSt+1)𝔥k+ksp(𝔥k)absentsubscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript𝑝𝑘subscript𝑆𝑡subscript𝑒subscript𝑆𝑡1subscript𝔥𝑘subscript𝑘spsubscript𝔥𝑘\displaystyle\leq\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}(p_{k}(S_{t})-e_{S_{t+1}})% \mathfrak{h}_{k}+\sum_{k}{\mathrm{sp}\left(\mathfrak{h}_{k}\right)}≤ ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_sp ( fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
kt=tktk+11(pk(St)eSt+1)(𝔥kh)A1+kt=tktk+11(pk(St)eSt+1)hA2+ksp(𝔥k).absentsubscriptsubscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript𝑝𝑘subscript𝑆𝑡subscript𝑒subscript𝑆𝑡1subscript𝔥𝑘superscriptsubscriptA1subscriptsubscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript𝑝𝑘subscript𝑆𝑡subscript𝑒subscript𝑆𝑡1superscriptsubscriptA2subscript𝑘spsubscript𝔥𝑘\displaystyle\leq\underbrace{\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}(p_{k}(S_{t})-e% _{S_{t+1}})(\mathfrak{h}_{k}-h^{*})}_{\mathrm{A}_{1}}+\underbrace{\sum_{k}\sum% _{t=t_{k}}^{t_{k+1}-1}(p_{k}(S_{t})-e_{S_{t+1}})h^{*}}_{\mathrm{A}_{2}}+\sum_{% k}{\mathrm{sp}\left(\mathfrak{h}_{k}\right)}.≤ under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ( fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG start_POSTSUBSCRIPT roman_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT roman_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_sp ( fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) .

The last term is O(c0SAlog(T))Osubscript𝑐0𝑆𝐴𝑇\operatorname*{{\rm O}}(c_{0}SA\log(T))roman_O ( italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_S italic_A roman_log ( italic_T ) ) by Lemma 28, hence is O(T1/5log(T))Osuperscript𝑇15𝑇\operatorname*{{\rm O}}(T^{1/5}\log(T))roman_O ( italic_T start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT roman_log ( italic_T ) ).

(STEP 1) We start by bounding A1subscriptA1\mathrm{A}_{1}roman_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. By Lemma 13, with probability 14δ14𝛿1-4\delta1 - 4 italic_δ, we have htksuperscriptsubscriptsubscript𝑡𝑘h^{*}\in\mathcal{H}_{t_{k}}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT for all kK(T)𝑘𝐾𝑇k\leq K(T)italic_k ≤ italic_K ( italic_T ). So sp(𝔥kh)sp(𝔥k)+sp(h)2c0spsubscript𝔥𝑘superscriptspsubscript𝔥𝑘spsuperscript2subscript𝑐0{\mathrm{sp}\left(\mathfrak{h}_{k}-h^{*}\right)}\leq{\mathrm{sp}\left(% \mathfrak{h}_{k}\right)}+{\mathrm{sp}\left(h^{*}\right)}\leq 2c_{0}roman_sp ( fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ roman_sp ( fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ 2 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. By Freedman’s inequality invoked in the form of Lemma 33, we have with probability 15δ15𝛿1-5\delta1 - 5 italic_δ,

A12kt=tktk+11𝐕(p(Xt),𝔥kh)log(Tδ)+8c0log(Tδ)subscriptA12subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11𝐕𝑝subscript𝑋𝑡subscript𝔥𝑘superscript𝑇𝛿8subscript𝑐0𝑇𝛿\mathrm{A}_{1}\leq\sqrt{2\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{V}\left(p(X% _{t}),\mathfrak{h}_{k}-h^{*}\right)\log\left(\tfrac{T}{\delta}\right)}+8c_{0}% \log\left(\tfrac{T}{\delta}\right)roman_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ square-root start_ARG 2 ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG + 8 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG )

It suffices to bound the first term. Recall that e𝑒eitalic_e is the vector full of ones. We have:

kt=tktk+11𝐕(p(Xt),𝔥kh)=kt=tktk+11𝐕(p(Xt),𝔥kh(𝔥k(St)h(St))e)subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11𝐕𝑝subscript𝑋𝑡subscript𝔥𝑘superscriptsubscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11𝐕𝑝subscript𝑋𝑡subscript𝔥𝑘superscriptsubscript𝔥𝑘subscript𝑆𝑡superscriptsubscript𝑆𝑡𝑒\displaystyle\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{V}(p(X_{t}),\mathfrak{h% }_{k}-h^{*})=\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{V}\left(p(X_{t}),% \mathfrak{h}_{k}-h^{*}-(\mathfrak{h}_{k}(S_{t})-h^{*}(S_{t}))\cdot e\right)∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - ( fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ⋅ italic_e )
kt=tktk+11s𝒮p(s|Xt)(𝔥k(s)h(s)(𝔥k(St)h(St)))2absentsubscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscriptsuperscript𝑠𝒮𝑝conditionalsuperscript𝑠subscript𝑋𝑡superscriptsubscript𝔥𝑘superscript𝑠superscriptsuperscript𝑠subscript𝔥𝑘subscript𝑆𝑡superscriptsubscript𝑆𝑡2\displaystyle\leq\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\sum_{s^{\prime}\in\mathcal% {S}}p(s^{\prime}|X_{t})\left(\mathfrak{h}_{k}(s^{\prime})-h^{*}(s^{\prime})-(% \mathfrak{h}_{k}(S_{t})-h^{*}(S_{t}))\right)^{2}≤ ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT italic_p ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - ( fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
()3kt=tktk+11𝐄[s𝒮p(s|Xt)(𝔥k(s)h(s)(𝔥k(St)h(St)))2|t]+16c02log(1δ)3subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11𝐄delimited-[]conditionalsubscriptsuperscript𝑠𝒮𝑝conditionalsuperscript𝑠subscript𝑋𝑡superscriptsubscript𝔥𝑘superscript𝑠superscriptsuperscript𝑠subscript𝔥𝑘subscript𝑆𝑡superscriptsubscript𝑆𝑡2subscript𝑡16superscriptsubscript𝑐021𝛿\displaystyle\overset{(*)}{\leq}3\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{E}% \left[\sum_{s^{\prime}\in\mathcal{S}}p(s^{\prime}|X_{t})\left(\mathfrak{h}_{k}% (s^{\prime})-h^{*}(s^{\prime})-(\mathfrak{h}_{k}(S_{t})-h^{*}(S_{t}))\right)^{% 2}\Bigg{|}\mathcal{F}_{t}\right]+16c_{0}^{2}\log\left(\tfrac{1}{\delta}\right)start_OVERACCENT ( ∗ ) end_OVERACCENT start_ARG ≤ end_ARG 3 ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT bold_E [ ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT italic_p ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - ( fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] + 16 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG )
=3kt=tktk+11(𝔥k(St+1)h(St+1)(𝔥k(St)h(St)))2+16c02log(1δ).absent3subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11superscriptsubscript𝔥𝑘subscript𝑆𝑡1superscriptsubscript𝑆𝑡1subscript𝔥𝑘subscript𝑆𝑡superscriptsubscript𝑆𝑡216superscriptsubscript𝑐021𝛿\displaystyle=3\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\left(\mathfrak{h}_{k}(S_{t+1% })-h^{*}(S_{t+1})-(\mathfrak{h}_{k}(S_{t})-h^{*}(S_{t}))\right)^{2}+16c_{0}^{2% }\log\left(\tfrac{1}{\delta}\right).= 3 ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - ( fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 16 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) .

Here the inequality ()(*)( ∗ ) holds with probability 1δ1𝛿1-\delta1 - italic_δ following Lemma 40. We will bound the summand with the bias estimation error error(ck,s,s)errorsubscript𝑐𝑘𝑠superscript𝑠\text{error}(c_{k},s,s^{\prime})error ( italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) that spawns the inner regret estimation B0(tk)==1k1t=tt+11(𝔤Rt)subscript𝐵0subscript𝑡𝑘superscriptsubscript1𝑘1superscriptsubscript𝑡subscript𝑡subscript𝑡11subscript𝔤subscript𝑅𝑡B_{0}(t_{k})=\sum_{\ell=1}^{k-1}\sum_{t=t_{\ell}}^{t_{\ell+1}-1}(\mathfrak{g}_% {\ell}-R_{t})italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). This inner estimation is linked to B(T):=k,t(𝔤kRt)assign𝐵𝑇subscript𝑘𝑡subscript𝔤𝑘subscript𝑅𝑡B(T):=\sum_{k,t}(\mathfrak{g}_{k}-R_{t})italic_B ( italic_T ) := ∑ start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) the overall optimistic regret by:

B0(tk)subscript𝐵0subscript𝑡𝑘\displaystyle B_{0}(t_{k})italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) =1K(T)t=tt+11(𝔤kRt)=kK(T)t=tt+11(𝔤kRt)absentsuperscriptsubscript1𝐾𝑇superscriptsubscript𝑡subscript𝑡subscript𝑡11subscript𝔤𝑘subscript𝑅𝑡superscriptsubscript𝑘𝐾𝑇superscriptsubscript𝑡subscript𝑡subscript𝑡11subscript𝔤𝑘subscript𝑅𝑡\displaystyle\leq\sum\nolimits_{\ell=1}^{K(T)}\sum\nolimits_{t=t_{\ell}}^{t_{% \ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)-\sum\nolimits_{\ell=k}^{K(T)}% \sum\nolimits_{t=t_{\ell}}^{t_{\ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)≤ ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K ( italic_T ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT roman_ℓ = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K ( italic_T ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
()=1K(T)t=tt+11(𝔤kRt)=kK(T)t=tt+11(gRt)superscriptsubscript1𝐾𝑇superscriptsubscript𝑡subscript𝑡subscript𝑡11subscript𝔤𝑘subscript𝑅𝑡superscriptsubscript𝑘𝐾𝑇superscriptsubscript𝑡subscript𝑡subscript𝑡11superscript𝑔subscript𝑅𝑡\displaystyle\overset{(*)}{\leq}\sum\nolimits_{\ell=1}^{K(T)}\sum\nolimits_{t=% t_{\ell}}^{t_{\ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)-\sum\nolimits_{% \ell=k}^{K(T)}\sum\nolimits_{t=t_{\ell}}^{t_{\ell+1}-1}\left(g^{*}-R_{t}\right)start_OVERACCENT ( ∗ ) end_OVERACCENT start_ARG ≤ end_ARG ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K ( italic_T ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT roman_ℓ = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K ( italic_T ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
=1K(T)t=tt+11(𝔤kRt)=kK(T)t=tkT1(Δ(Xt)+(p(Xt)eSt)h+r(Xt)Rt)absentsuperscriptsubscript1𝐾𝑇superscriptsubscript𝑡subscript𝑡subscript𝑡11subscript𝔤𝑘subscript𝑅𝑡superscriptsubscript𝑘𝐾𝑇superscriptsubscript𝑡subscript𝑡𝑘𝑇1superscriptΔsubscript𝑋𝑡𝑝subscript𝑋𝑡subscript𝑒subscript𝑆𝑡superscript𝑟subscript𝑋𝑡subscript𝑅𝑡\displaystyle\leq\sum\nolimits_{\ell=1}^{K(T)}\sum\nolimits_{t=t_{\ell}}^{t_{% \ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)-\sum\nolimits_{\ell=k}^{K(T)}% \sum\nolimits_{t=t_{k}}^{T-1}\left(\Delta^{*}(X_{t})+\left(p(X_{t})-e_{S_{t}}% \right)h^{*}+r(X_{t})-R_{t}\right)≤ ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K ( italic_T ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT roman_ℓ = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K ( italic_T ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
=1K(T)t=tt+11(𝔤kRt)+sp(h)=kK(T)t=tkT1((p(Xt)eSt+1)h+r(Xt)Rt)absentsuperscriptsubscript1𝐾𝑇superscriptsubscript𝑡subscript𝑡subscript𝑡11subscript𝔤𝑘subscript𝑅𝑡spsuperscriptsuperscriptsubscript𝑘𝐾𝑇superscriptsubscript𝑡subscript𝑡𝑘𝑇1𝑝subscript𝑋𝑡subscript𝑒subscript𝑆𝑡1superscript𝑟subscript𝑋𝑡subscript𝑅𝑡\displaystyle\leq\sum\nolimits_{\ell=1}^{K(T)}\sum\nolimits_{t=t_{\ell}}^{t_{% \ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)+{\mathrm{sp}\left(h^{*}\right)}-% \sum\nolimits_{\ell=k}^{K(T)}\sum\nolimits_{t=t_{k}}^{T-1}\left(\left(p(X_{t})% -e_{S_{t+1}}\right)h^{*}+r(X_{t})-R_{t}\right)≤ ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K ( italic_T ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - ∑ start_POSTSUBSCRIPT roman_ℓ = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K ( italic_T ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
()=1K(T)t=tt+11(𝔤kRt)+sp(h)+(1+sp(h))12Tlog(1δ)superscriptsubscript1𝐾𝑇superscriptsubscript𝑡subscript𝑡subscript𝑡11subscript𝔤𝑘subscript𝑅𝑡spsuperscript1spsuperscript12𝑇1𝛿\displaystyle\overset{(\dagger)}{\leq}\sum\nolimits_{\ell=1}^{K(T)}\sum% \nolimits_{t=t_{\ell}}^{t_{\ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)+{% \mathrm{sp}\left(h^{*}\right)}+(1+{\mathrm{sp}\left(h^{*}\right)})\sqrt{\tfrac% {1}{2}T\log\left(\tfrac{1}{\delta}\right)}start_OVERACCENT ( † ) end_OVERACCENT start_ARG ≤ end_ARG ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K ( italic_T ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + ( 1 + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_T roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) end_ARG
=:B(T)+sp(h)+(1+sp(h))12Tlog(1δ).\displaystyle=:B(T)+{\mathrm{sp}\left(h^{*}\right)}+(1+{\mathrm{sp}\left(h^{*}% \right)})\sqrt{\tfrac{1}{2}T\log\left(\tfrac{1}{\delta}\right)}.= : italic_B ( italic_T ) + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + ( 1 + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_T roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) end_ARG .

In the above, ()(*)( ∗ ) holds with probability 14δ14𝛿1-4\delta1 - 4 italic_δ uniformly on k𝑘kitalic_k following Lemma 13 and ()(\dagger)( † ) holds, also uniformly on k𝑘kitalic_k, with probability 1δ1𝛿1-\delta1 - italic_δ by applying Azuma-Hoeffding’s inequality (Lemma 32). Continuing, still on the event specified by Lemma 13, we have with probability 16δ16𝛿1-6\delta1 - 6 italic_δ:

kt=tktk+11𝐕(p(Xt),𝔥kh)subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11𝐕𝑝subscript𝑋𝑡subscript𝔥𝑘superscript\displaystyle\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{V}(p(X_{t}),\mathfrak{h% }_{k}-h^{*})∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) 3kt=tktk+113c0+(1+c0)8tklog(2δ)+2B0(tk)Ntk(St+1St)+16c02log(1δ)\displaystyle\leq 3\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\frac{3c_{0}+(1+c_{0})% \sqrt{8t_{k}\log\left(\tfrac{2}{\delta}\right)}+2B_{0}(t_{k})}{N_{t_{k}}(S_{t+% 1}\leftrightarrow S_{t})}+16c_{0}^{2}\log\left(\tfrac{1}{\delta}\right)≤ 3 ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT divide start_ARG 3 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ( 1 + italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) square-root start_ARG 8 italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_log ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) end_ARG + 2 italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ↔ italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG + 16 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG )
3kt=tktk+114c0+(1+c0)32Tlog(2δ)+2B(T)Ntk(St,At,St+1)+16c02log(1δ)absent3subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘114subscript𝑐01subscript𝑐032𝑇2𝛿2𝐵𝑇subscript𝑁subscript𝑡𝑘subscript𝑆𝑡subscript𝐴𝑡subscript𝑆𝑡116superscriptsubscript𝑐021𝛿\displaystyle\leq 3\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\frac{4c_{0}+(1+c_{0})% \sqrt{32T\log\left(\tfrac{2}{\delta}\right)}+2B(T)}{N_{t_{k}}(S_{t},A_{t},S_{t% +1})}+16c_{0}^{2}\log\left(\tfrac{1}{\delta}\right)≤ 3 ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT divide start_ARG 4 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ( 1 + italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) square-root start_ARG 32 italic_T roman_log ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) end_ARG + 2 italic_B ( italic_T ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) end_ARG + 16 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG )
(DT) 12c02S2A+3(4c0+(1+c0)32Tlog(2δ)+2B(T))S2Alog(T)absent12superscriptsubscript𝑐02superscript𝑆2𝐴34subscript𝑐01subscript𝑐032𝑇2𝛿2𝐵𝑇superscript𝑆2𝐴𝑇\displaystyle\leq 12c_{0}^{2}S^{2}A+3\left(4c_{0}+(1+c_{0})\sqrt{32T\log\left(% \tfrac{2}{\delta}\right)}+2B(T)\right)S^{2}A\log(T)≤ 12 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A + 3 ( 4 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ( 1 + italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) square-root start_ARG 32 italic_T roman_log ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) end_ARG + 2 italic_B ( italic_T ) ) italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A roman_log ( italic_T )
+16c02log(1δ).16superscriptsubscript𝑐021𝛿\displaystyle\phantom{\leq{}}+16c_{0}^{2}\log\left(\tfrac{1}{\delta}\right).+ 16 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) .

(STEP 2) For A2subscriptA2\mathrm{A}_{2}roman_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, by Freedman’s inequality invoked in the form of Lemma 33 again, we have with probability 1δ1𝛿1-\delta1 - italic_δ,

A2subscriptA2\displaystyle\mathrm{A}_{2}roman_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 2kt=tktk+11𝐕(pk(St),h)log(Tδ)+8c0log(Tδ)absent2subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11𝐕subscript𝑝𝑘subscript𝑆𝑡superscript𝑇𝛿8subscript𝑐0𝑇𝛿\displaystyle\leq\sqrt{2\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{V}(p_{k}(S_{% t}),h^{*})\log\left(\tfrac{T}{\delta}\right)}+8c_{0}\log\left(\tfrac{T}{\delta% }\right)≤ square-root start_ARG 2 ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT bold_V ( italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG + 8 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG )
2t=0T1𝐕(p(Xt),h)log(Tδ)+8c0log(Tδ).absent2superscriptsubscript𝑡0𝑇1𝐕𝑝subscript𝑋𝑡superscript𝑇𝛿8subscript𝑐0𝑇𝛿\displaystyle\leq\sqrt{2\sum_{t=0}^{T-1}\mathbf{V}(p(X_{t}),h^{*})\log\left(% \tfrac{T}{\delta}\right)}+8c_{0}\log\left(\tfrac{T}{\delta}\right).≤ square-root start_ARG 2 ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG + 8 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) .

We recognize the sum of variance t=0T1𝐕(p(Xt),h)superscriptsubscript𝑡0𝑇1𝐕𝑝subscript𝑋𝑡superscript\sum_{t=0}^{T-1}\mathbf{V}(p(X_{t}),h^{*})∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) that we leave as is.

(STEP 3) As a result, with probability 17δ17𝛿1-7\delta1 - 7 italic_δ, we have:

kt=tktk+11(pk(St)eSt)𝔥k2t=0T1𝐕(p(Xt),h)log(Tδ)+2SA123B(T)log(Tδ)+O(SA12T720log34(Tδ))subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript𝑝𝑘subscript𝑆𝑡subscript𝑒subscript𝑆𝑡subscript𝔥𝑘2superscriptsubscript𝑡0𝑇1𝐕𝑝subscript𝑋𝑡superscript𝑇𝛿2𝑆superscript𝐴123𝐵𝑇𝑇𝛿O𝑆superscript𝐴12superscript𝑇720superscript34𝑇𝛿\displaystyle\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}(p_{k}(S_{t})-e_{S_{t}})% \mathfrak{h}_{k}\leq\sqrt{2\sum_{t=0}^{T-1}\mathbf{V}(p(X_{t}),h^{*})\log\left% (\tfrac{T}{\delta}\right)}+2SA^{\frac{1}{2}}\sqrt{3B(T)}\log\left(\tfrac{T}{% \delta}\right)+\operatorname*{{\rm O}}\left(SA^{\frac{1}{2}}T^{\frac{7}{20}}% \log^{\frac{3}{4}}\left(\tfrac{T}{\delta}\right)\right)∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ square-root start_ARG 2 ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG + 2 italic_S italic_A start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT square-root start_ARG 3 italic_B ( italic_T ) end_ARG roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) + roman_O ( italic_S italic_A start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT divide start_ARG 7 end_ARG start_ARG 20 end_ARG end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) )

when c0=T15subscript𝑐0superscript𝑇15c_{0}=T^{\frac{1}{5}}italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_T start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 5 end_ARG end_POSTSUPERSCRIPT. ∎

C.6 Proof of Lemma 8, empirical bias error

Because hsuperscripth^{*}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a fixed vector, Bennett’s inequality (see Lemma 39) guarantees that (p^k(St)pk(St)h(\hat{p}_{k}(S_{t})-p_{k}(S_{t})h^{*}( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is small as follows. By doing a union bound over Lemma 39 with confidence δSAT𝛿𝑆𝐴𝑇\frac{\delta}{SAT}divide start_ARG italic_δ end_ARG start_ARG italic_S italic_A italic_T end_ARG over all pairs (s,a)𝑠𝑎(s,a)( italic_s , italic_a ) and visits counts N(s,a)T𝑁𝑠𝑎𝑇N(s,a)\leq Titalic_N ( italic_s , italic_a ) ≤ italic_T, we see that with probability 1δ1𝛿1-\delta1 - italic_δ, for all k𝑘kitalic_k, we have:

t=tktk+11(p^k(St)pk(St))hsuperscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript^𝑝𝑘subscript𝑆𝑡subscript𝑝𝑘subscript𝑆𝑡superscript\displaystyle\sum_{t=t_{k}}^{t_{k+1}-1}\left(\hat{p}_{k}(S_{t})-p_{k}(S_{t})% \right)h^{*}∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT sp(h)SA+t=tktk+11𝟏(Ntk(Xt)1)(2𝐕(p(Xt),h)log(SATδ)Ntk(Xt)+sp(h)log(SATδ)3Ntk(Xt))absentspsuperscript𝑆𝐴superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘111subscript𝑁subscript𝑡𝑘subscript𝑋𝑡12𝐕𝑝subscript𝑋𝑡superscript𝑆𝐴𝑇𝛿subscript𝑁subscript𝑡𝑘subscript𝑋𝑡spsuperscript𝑆𝐴𝑇𝛿3subscript𝑁subscript𝑡𝑘subscript𝑋𝑡\displaystyle\leq{\mathrm{sp}\left(h^{*}\right)}SA+\sum_{t=t_{k}}^{t_{k+1}-1}% \mathbf{1}\left(N_{t_{k}}(X_{t})\geq 1\right)\left(\sqrt{\tfrac{2\mathbf{V}(p(% X_{t}),h^{*})\log\left(\frac{SAT}{\delta}\right)}{N_{t_{k}}(X_{t})}}+\tfrac{{% \mathrm{sp}\left(h^{*}\right)}\log\left(\tfrac{SAT}{\delta}\right)}{3N_{t_{k}}% (X_{t})}\right)≤ roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_S italic_A + ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT bold_1 ( italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 1 ) ( square-root start_ARG divide start_ARG 2 bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG + divide start_ARG roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG 3 italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG )
(by doubling trick) sp(h)SA+2t=tktk+11𝟏(Nt(Xt)1)(2𝐕(p(Xt),h)log(SATδ)Nt(Xt)+sp(h)log(SATδ)3Nt(Xt)).absentspsuperscript𝑆𝐴2superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘111subscript𝑁𝑡subscript𝑋𝑡12𝐕𝑝subscript𝑋𝑡superscript𝑆𝐴𝑇𝛿subscript𝑁𝑡subscript𝑋𝑡spsuperscript𝑆𝐴𝑇𝛿3subscript𝑁𝑡subscript𝑋𝑡\displaystyle\leq{\mathrm{sp}\left(h^{*}\right)}SA+2\sum_{t=t_{k}}^{t_{k+1}-1}% \mathbf{1}\left(N_{t}(X_{t})\geq 1\right)\left(\sqrt{\tfrac{2\mathbf{V}(p(X_{t% }),h^{*})\log\left(\frac{SAT}{\delta}\right)}{N_{t}(X_{t})}}+\tfrac{{\mathrm{% sp}\left(h^{*}\right)}\log\left(\tfrac{SAT}{\delta}\right)}{3N_{t}(X_{t})}% \right).≤ roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_S italic_A + 2 ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT bold_1 ( italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 1 ) ( square-root start_ARG divide start_ARG 2 bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG + divide start_ARG roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG 3 italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ) .

Summing this over k𝑘kitalic_k and factorizing over state-action pairs, we get that with probability 1δ1𝛿1-\delta1 - italic_δ,

k(2k)subscript𝑘2𝑘\displaystyle\sum_{k}(2k)∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( 2 italic_k ) sp(h)SA+2s,a(n=1NT(s,a)2𝐕(p(s,a),h)log(SATδ)n+n=1NT(s,a)sp(h)log(SATδ)n)absentspsuperscript𝑆𝐴2subscript𝑠𝑎superscriptsubscript𝑛1subscript𝑁𝑇𝑠𝑎2𝐕𝑝𝑠𝑎superscript𝑆𝐴𝑇𝛿𝑛superscriptsubscript𝑛1subscript𝑁𝑇𝑠𝑎spsuperscript𝑆𝐴𝑇𝛿𝑛\displaystyle\leq{\mathrm{sp}\left(h^{*}\right)}SA+2\sum_{s,a}\left(\sum_{n=1}% ^{N_{T}(s,a)}\sqrt{\tfrac{2\mathbf{V}(p(s,a),h^{*})\log\left(\frac{SAT}{\delta% }\right)}{n}}+\sum_{n=1}^{N_{T}(s,a)}\tfrac{{\mathrm{sp}\left(h^{*}\right)}% \log\left(\tfrac{SAT}{\delta}\right)}{n}\right)≤ roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_S italic_A + 2 ∑ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s , italic_a ) end_POSTSUPERSCRIPT square-root start_ARG divide start_ARG 2 bold_V ( italic_p ( italic_s , italic_a ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_n end_ARG end_ARG + ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s , italic_a ) end_POSTSUPERSCRIPT divide start_ARG roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_n end_ARG )
sp(h)SA+4s,aNT(s,a)𝐕(p(s,a),h)log(SATδ)+2sp(h)SAlog(SATδ)log(T)absentspsuperscript𝑆𝐴4subscript𝑠𝑎subscript𝑁𝑇𝑠𝑎𝐕𝑝𝑠𝑎superscript𝑆𝐴𝑇𝛿2spsuperscript𝑆𝐴𝑆𝐴𝑇𝛿𝑇\displaystyle\leq{\mathrm{sp}\left(h^{*}\right)}SA+4\sum_{s,a}\sqrt{N_{T}(s,a)% \mathbf{V}(p(s,a),h^{*})\log\left(\tfrac{SAT}{\delta}\right)}+2{\mathrm{sp}% \left(h^{*}\right)}SA\log\left(\tfrac{SAT}{\delta}\right)\log(T)≤ roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_S italic_A + 4 ∑ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT square-root start_ARG italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_s , italic_a ) bold_V ( italic_p ( italic_s , italic_a ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG + 2 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_S italic_A roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) roman_log ( italic_T )
(Jensen) sp(h)SA+4SAs,a𝐕(p(s,a),h)log(SATδ)+2sp(h)SAlog(SATδ)log(T)absentspsuperscript𝑆𝐴4𝑆𝐴subscript𝑠𝑎𝐕𝑝𝑠𝑎superscript𝑆𝐴𝑇𝛿2spsuperscript𝑆𝐴𝑆𝐴𝑇𝛿𝑇\displaystyle\leq{\mathrm{sp}\left(h^{*}\right)}SA+4\sqrt{SA\sum\nolimits_{s,a% }\mathbf{V}(p(s,a),h^{*})\log\left(\tfrac{SAT}{\delta}\right)}+2{\mathrm{sp}% \left(h^{*}\right)}SA\log\left(\tfrac{SAT}{\delta}\right)\log(T)≤ roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_S italic_A + 4 square-root start_ARG italic_S italic_A ∑ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT bold_V ( italic_p ( italic_s , italic_a ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG + 2 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_S italic_A roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) roman_log ( italic_T )
=sp(h)SA+4t=0T1𝐕(p(Xt),h)log(SATδ)+2sp(h)SAlog(SATδ)log(T)absentspsuperscript𝑆𝐴4superscriptsubscript𝑡0𝑇1𝐕𝑝subscript𝑋𝑡superscript𝑆𝐴𝑇𝛿2spsuperscript𝑆𝐴𝑆𝐴𝑇𝛿𝑇\displaystyle={\mathrm{sp}\left(h^{*}\right)}SA+4\sqrt{\sum\nolimits_{t=0}^{T-% 1}\mathbf{V}(p(X_{t}),h^{*})\log\left(\tfrac{SAT}{\delta}\right)}+2{\mathrm{sp% }\left(h^{*}\right)}SA\log\left(\tfrac{SAT}{\delta}\right)\log(T)= roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_S italic_A + 4 square-root start_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG + 2 roman_s roman_p ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_S italic_A roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) roman_log ( italic_T )

We recognize the sum of variances t=0T1𝐕(p(Xt),h)superscriptsubscript𝑡0𝑇1𝐕𝑝subscript𝑋𝑡superscript\sum_{t=0}^{T-1}\mathbf{V}(p(X_{t}),h^{*})∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), that is left to be upper-bounded later on. ∎

C.7 Proof of Lemma 9, optimism overshoot

Because of the β𝛽\betaitalic_β-mitigation generated by Algorithm 5, the quantity (p~k(St)p^k(St))𝔥ksubscript~𝑝𝑘subscript𝑆𝑡subscript^𝑝𝑘subscript𝑆𝑡subscript𝔥𝑘(\tilde{p}_{k}(S_{t})-\hat{p}_{k}(S_{t}))\mathfrak{h}_{k}( over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is shown to be directly related to 𝐕(p(Xt),h)𝐕𝑝subscript𝑋𝑡superscript\mathbf{V}(p(X_{t}),h^{*})bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) up to a provably negligible error. Denote hksubscriptsuperscript𝑘h^{\prime}_{k}italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT the reference point BiasProjection(tk,ctk(,s0))subscriptsubscript𝑡𝑘subscript𝑐subscript𝑡𝑘subscript𝑠0(\mathcal{H}_{t_{k}},c_{t_{k}}(-,s_{0}))( caligraphic_H start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( - , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) used in Algorithm 5 (denoted h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in the algorithm). By Lemma 13, with probability 14δ14𝛿1-4\delta1 - 4 italic_δ, we have htksuperscriptsubscriptsubscript𝑡𝑘h^{*}\in\mathcal{H}_{t_{k}}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT for all k𝑘kitalic_k. To lighten up notations, we write dtk(s,s)subscript𝑑subscript𝑡𝑘superscript𝑠𝑠d_{t_{k}}(s^{\prime},s)italic_d start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ) instead of error(ctk,s,s)errorsubscript𝑐subscript𝑡𝑘superscript𝑠𝑠\text{error}(c_{t_{k}},s^{\prime},s)error ( italic_c start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ).

(STEP 1) Denote A:=(p~k(St)p^k(St))𝔥kassignAsubscript~𝑝𝑘subscript𝑆𝑡subscript^𝑝𝑘subscript𝑆𝑡subscript𝔥𝑘\text{A}:=(\tilde{p}_{k}(S_{t})-\hat{p}_{k}(S_{t}))\mathfrak{h}_{k}A := ( over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. By construction of p~ksubscript~𝑝𝑘\tilde{p}_{k}over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we have Aβtk(Xt)Asubscript𝛽subscript𝑡𝑘subscript𝑋𝑡\text{A}\leq\beta_{t_{k}}(X_{t})A ≤ italic_β start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), so:

A βtk(Xt)absentsubscript𝛽subscript𝑡𝑘subscript𝑋𝑡\displaystyle\leq\beta_{t_{k}}(X_{t})≤ italic_β start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
=:2(𝐕(p^k(St),hk)+8c0s𝒮p^k(s|St)dtk(s,St)log(SATδ))Ntk(Xt)+3c0log(SATδ)Ntk(Xt)\displaystyle=:\sqrt{\frac{2\left(\mathbf{V}(\hat{p}_{k}(S_{t}),h^{\prime}_{k}% )+8c_{0}\sum_{s^{\prime}\in\mathcal{S}}\hat{p}_{k}(s^{\prime}|S_{t})d_{t_{k}}(% s^{\prime},S_{t})\log\left(\tfrac{SAT}{\delta}\right)\right)}{N_{t_{k}}(X_{t})% }}+\frac{3c_{0}\log\left(\tfrac{SAT}{\delta}\right)}{N_{t_{k}}(X_{t})}= : square-root start_ARG divide start_ARG 2 ( bold_V ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + 8 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_d start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG + divide start_ARG 3 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG
2𝐕(p^k(St),hk)Ntk(Xt)A1+16c0s𝒮p^k(s|St)dtk(s,St)log(SATδ)Ntk(Xt)A2+3c0log(SATδ)Ntk(Xt).absentsubscript2𝐕subscript^𝑝𝑘subscript𝑆𝑡subscriptsuperscript𝑘subscript𝑁subscript𝑡𝑘subscript𝑋𝑡subscriptA1subscript16subscript𝑐0subscriptsuperscript𝑠𝒮subscript^𝑝𝑘conditionalsuperscript𝑠subscript𝑆𝑡subscript𝑑subscript𝑡𝑘superscript𝑠subscript𝑆𝑡𝑆𝐴𝑇𝛿subscript𝑁subscript𝑡𝑘subscript𝑋𝑡subscriptA23subscript𝑐0𝑆𝐴𝑇𝛿subscript𝑁subscript𝑡𝑘subscript𝑋𝑡\displaystyle\leq\underbrace{\sqrt{\frac{2\mathbf{V}(\hat{p}_{k}(S_{t}),h^{% \prime}_{k})}{N_{t_{k}}(X_{t})}}}_{\text{A}_{1}}+\underbrace{\sqrt{\frac{16c_{% 0}\sum_{s^{\prime}\in\mathcal{S}}\hat{p}_{k}(s^{\prime}|S_{t})d_{t_{k}}(s^{% \prime},S_{t})\log\left(\tfrac{SAT}{\delta}\right)}{N_{t_{k}}(X_{t})}}}_{\text% {A}_{2}}+\frac{3c_{0}\log\left(\tfrac{SAT}{\delta}\right)}{N_{t_{k}}(X_{t})}.≤ under⏟ start_ARG square-root start_ARG divide start_ARG 2 bold_V ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG end_ARG start_POSTSUBSCRIPT A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG square-root start_ARG divide start_ARG 16 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_d start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG end_ARG start_POSTSUBSCRIPT A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + divide start_ARG 3 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG .

The rightmost term of A is of order O(log2(T))Osuperscript2𝑇\operatorname*{{\rm O}}(\log^{2}(T))roman_O ( roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_T ) ) hence is negligible. We focus on the other two. The analysis of A1subscriptA1\text{A}_{1}A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT will spawn a term similar to A2subscriptA2\text{A}_{2}A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, hence we start by the second. Recall that dtksubscript𝑑subscript𝑡𝑘d_{t_{k}}italic_d start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the bias error provided by Algorithm 3 and that the inner regret estimation is B0(tk)==1k1t=tt+11(𝔤Rt)subscript𝐵0subscript𝑡𝑘superscriptsubscript1𝑘1superscriptsubscript𝑡subscript𝑡subscript𝑡11subscript𝔤subscript𝑅𝑡B_{0}(t_{k})=\sum_{\ell=1}^{k-1}\sum_{t=t_{\ell}}^{t_{\ell+1}-1}(\mathfrak{g}_% {\ell}-R_{t})italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Now, remark that:

B0(tk)subscript𝐵0subscript𝑡𝑘\displaystyle B_{0}(t_{k})italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) =1K(T)t=tt+11(𝔤kRt)=kK(T)t=tt+11(𝔤kRt)absentsuperscriptsubscript1𝐾𝑇superscriptsubscript𝑡subscript𝑡subscript𝑡11subscript𝔤𝑘subscript𝑅𝑡superscriptsubscript𝑘𝐾𝑇superscriptsubscript𝑡subscript𝑡subscript𝑡11subscript𝔤𝑘subscript𝑅𝑡\displaystyle\leq\sum\nolimits_{\ell=1}^{K(T)}\sum\nolimits_{t=t_{\ell}}^{t_{% \ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)-\sum\nolimits_{\ell=k}^{K(T)}% \sum\nolimits_{t=t_{\ell}}^{t_{\ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)≤ ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K ( italic_T ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT roman_ℓ = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K ( italic_T ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
()=1K(T)t=tt+11(𝔤kRt)=kK(T)t=tt+11(gRt)superscriptsubscript1𝐾𝑇superscriptsubscript𝑡subscript𝑡subscript𝑡11subscript𝔤𝑘subscript𝑅𝑡superscriptsubscript𝑘𝐾𝑇superscriptsubscript𝑡subscript𝑡subscript𝑡11superscript𝑔subscript𝑅𝑡\displaystyle\overset{(*)}{\leq}\sum\nolimits_{\ell=1}^{K(T)}\sum\nolimits_{t=% t_{\ell}}^{t_{\ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)-\sum\nolimits_{% \ell=k}^{K(T)}\sum\nolimits_{t=t_{\ell}}^{t_{\ell+1}-1}\left(g^{*}-R_{t}\right)start_OVERACCENT ( ∗ ) end_OVERACCENT start_ARG ≤ end_ARG ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K ( italic_T ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT roman_ℓ = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K ( italic_T ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
=1K(T)t=tt+11(𝔤kRt)=kK(T)t=tkT1(Δ(Xt)+(p(Xt)eSt)h+r(Xt)Rt)absentsuperscriptsubscript1𝐾𝑇superscriptsubscript𝑡subscript𝑡subscript𝑡11subscript𝔤𝑘subscript𝑅𝑡superscriptsubscript𝑘𝐾𝑇superscriptsubscript𝑡subscript𝑡𝑘𝑇1superscriptΔsubscript𝑋𝑡𝑝subscript𝑋𝑡subscript𝑒subscript𝑆𝑡superscript𝑟subscript𝑋𝑡subscript𝑅𝑡\displaystyle\leq\sum\nolimits_{\ell=1}^{K(T)}\sum\nolimits_{t=t_{\ell}}^{t_{% \ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)-\sum\nolimits_{\ell=k}^{K(T)}% \sum\nolimits_{t=t_{k}}^{T-1}\left(\Delta^{*}(X_{t})+\left(p(X_{t})-e_{S_{t}}% \right)h^{*}+r(X_{t})-R_{t}\right)≤ ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K ( italic_T ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT roman_ℓ = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K ( italic_T ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
=1K(T)t=tt+11(𝔤kRt)+sp(h)=kK(T)t=tkT1((p(Xt)eSt+1)h+r(Xt)Rt)absentsuperscriptsubscript1𝐾𝑇superscriptsubscript𝑡subscript𝑡subscript𝑡11subscript𝔤𝑘subscript𝑅𝑡spsuperscriptsuperscriptsubscript𝑘𝐾𝑇superscriptsubscript𝑡subscript𝑡𝑘𝑇1𝑝subscript𝑋𝑡subscript𝑒subscript𝑆𝑡1superscript𝑟subscript𝑋𝑡subscript𝑅𝑡\displaystyle\leq\sum\nolimits_{\ell=1}^{K(T)}\sum\nolimits_{t=t_{\ell}}^{t_{% \ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)+{\mathrm{sp}\left(h^{*}\right)}-% \sum\nolimits_{\ell=k}^{K(T)}\sum\nolimits_{t=t_{k}}^{T-1}\left(\left(p(X_{t})% -e_{S_{t+1}}\right)h^{*}+r(X_{t})-R_{t}\right)≤ ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K ( italic_T ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - ∑ start_POSTSUBSCRIPT roman_ℓ = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K ( italic_T ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ( ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_e start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_r ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
()=1K(T)t=tt+11(𝔤kRt)+sp(h)+(1+sp(h))12Tlog(1δ)superscriptsubscript1𝐾𝑇superscriptsubscript𝑡subscript𝑡subscript𝑡11subscript𝔤𝑘subscript𝑅𝑡spsuperscript1spsuperscript12𝑇1𝛿\displaystyle\overset{(\dagger)}{\leq}\sum\nolimits_{\ell=1}^{K(T)}\sum% \nolimits_{t=t_{\ell}}^{t_{\ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)+{% \mathrm{sp}\left(h^{*}\right)}+(1+{\mathrm{sp}\left(h^{*}\right)})\sqrt{\tfrac% {1}{2}T\log\left(\tfrac{1}{\delta}\right)}start_OVERACCENT ( † ) end_OVERACCENT start_ARG ≤ end_ARG ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K ( italic_T ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + ( 1 + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_T roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) end_ARG
=:B(T)+sp(h)+(1+sp(h))12Tlog(1δ).\displaystyle=:B(T)+{\mathrm{sp}\left(h^{*}\right)}+(1+{\mathrm{sp}\left(h^{*}% \right)})\sqrt{\tfrac{1}{2}T\log\left(\tfrac{1}{\delta}\right)}.= : italic_B ( italic_T ) + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + ( 1 + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_T roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) end_ARG .

In the above, ()(*)( ∗ ) holds with probability 14δ14𝛿1-4\delta1 - 4 italic_δ uniformly on k𝑘kitalic_k following Lemma 13 and ()(\dagger)( † ) holds, also uniformly on k𝑘kitalic_k, with probability 1δ1𝛿1-\delta1 - italic_δ by applying Azuma-Hoeffding’s inequality (Lemma 32). Therefore, with probability 15δ15𝛿1-5\delta1 - 5 italic_δ, for all k𝑘kitalic_k and t{tk,,tk+11}𝑡subscript𝑡𝑘subscript𝑡𝑘11t\in\left\{t_{k},\ldots,t_{k+1}-1\right\}italic_t ∈ { italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 }, we have:

16c0s𝒮p^k(s|St)dtk(s,St)log(SATδ)Ntk(Xt)16subscript𝑐0subscriptsuperscript𝑠𝒮subscript^𝑝𝑘conditionalsuperscript𝑠subscript𝑆𝑡subscript𝑑subscript𝑡𝑘superscript𝑠subscript𝑆𝑡𝑆𝐴𝑇𝛿subscript𝑁subscript𝑡𝑘subscript𝑋𝑡\displaystyle\sqrt{\frac{16c_{0}\sum_{s^{\prime}\in\mathcal{S}}\hat{p}_{k}(s^{% \prime}|S_{t})d_{t_{k}}(s^{\prime},S_{t})\log\left(\tfrac{SAT}{\delta}\right)}% {N_{t_{k}}(X_{t})}}square-root start_ARG divide start_ARG 16 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_d start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG 16c0log(SATδ)s𝒮Ntk(St,At,s)dtk(s,St)Ntk(Xt)absent16subscript𝑐0𝑆𝐴𝑇𝛿subscriptsuperscript𝑠𝒮subscript𝑁subscript𝑡𝑘subscript𝑆𝑡subscript𝐴𝑡superscript𝑠subscript𝑑subscript𝑡𝑘superscript𝑠subscript𝑆𝑡subscript𝑁subscript𝑡𝑘subscript𝑋𝑡\displaystyle\leq\frac{\sqrt{16c_{0}\log\left(\tfrac{SAT}{\delta}\right)\sum_{% s^{\prime}\in\mathcal{S}}N_{t_{k}}(S_{t},A_{t},s^{\prime})d_{t_{k}}(s^{\prime}% ,S_{t})}}{N_{t_{k}}(X_{t})}≤ divide start_ARG square-root start_ARG 16 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_d start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG
16c0log(SATδ)s𝒮Ntk(Sts)dtk(s,St)Ntk(Xt)\displaystyle\leq\frac{\sqrt{16c_{0}\log\left(\tfrac{SAT}{\delta}\right)\sum_{% s^{\prime}\in\mathcal{S}}N_{t_{k}}(S_{t}\leftrightarrow s^{\prime})d_{t_{k}}(s% ^{\prime},S_{t})}}{N_{t_{k}}(X_{t})}≤ divide start_ARG square-root start_ARG 16 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_d start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG
16c0log(SATδ)S(3c0+(1+c0)(1+8Tlog(2δ))+2B0(tk))Ntk(Xt)absent16subscript𝑐0𝑆𝐴𝑇𝛿𝑆3subscript𝑐01subscript𝑐018𝑇2𝛿2subscript𝐵0subscript𝑡𝑘subscript𝑁subscript𝑡𝑘subscript𝑋𝑡\displaystyle\leq\tfrac{\sqrt{16c_{0}\log\left(\tfrac{SAT}{\delta}\right)S% \left(3c_{0}+(1+c_{0})\left(1+\sqrt{8T\log\left(\tfrac{2}{\delta}\right)}% \right)+2B_{0}(t_{k})\right)}}{N_{t_{k}}(X_{t})}≤ divide start_ARG square-root start_ARG 16 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) italic_S ( 3 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ( 1 + italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ( 1 + square-root start_ARG 8 italic_T roman_log ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) end_ARG ) + 2 italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) end_ARG end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG
16c0log(SATδ)S((1+c0)(3+28Tlog(2δ))+2B(T))Ntk(Xt)absent16subscript𝑐0𝑆𝐴𝑇𝛿𝑆1subscript𝑐0328𝑇2𝛿2𝐵𝑇subscript𝑁subscript𝑡𝑘subscript𝑋𝑡\displaystyle\leq\tfrac{\sqrt{16c_{0}\log\left(\tfrac{SAT}{\delta}\right)S% \left((1+c_{0})\left(3+2\sqrt{8T\log\left(\tfrac{2}{\delta}\right)}\right)+2B(% T)\right)}}{N_{t_{k}}(X_{t})}≤ divide start_ARG square-root start_ARG 16 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) italic_S ( ( 1 + italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ( 3 + 2 square-root start_ARG 8 italic_T roman_log ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) end_ARG ) + 2 italic_B ( italic_T ) ) end_ARG end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG
16c0log(SATδ)S((1+c0)(3+28Tlog(2δ)+2B(T)))Ntk(Xt).absent16subscript𝑐0𝑆𝐴𝑇𝛿𝑆1subscript𝑐0328𝑇2𝛿2𝐵𝑇subscript𝑁subscript𝑡𝑘subscript𝑋𝑡\displaystyle\leq\tfrac{\sqrt{16c_{0}\log\left(\tfrac{SAT}{\delta}\right)S% \left((1+c_{0})\left(3+2\sqrt{8T\log\left(\tfrac{2}{\delta}\right)}+2B(T)% \right)\right)}}{N_{t_{k}}(X_{t})}.≤ divide start_ARG square-root start_ARG 16 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) italic_S ( ( 1 + italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ( 3 + 2 square-root start_ARG 8 italic_T roman_log ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) end_ARG + 2 italic_B ( italic_T ) ) ) end_ARG end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG .

This bound will be enough. We move on to A1subscriptA1\text{A}_{1}A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. We have:

𝐕(p^k(St),hk)𝐕subscript^𝑝𝑘subscript𝑆𝑡subscriptsuperscript𝑘\displaystyle\sqrt{\mathbf{V}(\hat{p}_{k}(S_{t}),h^{\prime}_{k})}square-root start_ARG bold_V ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG |𝐕(p^k(St),hk)𝐕(p(Xt),h)|+𝐕(p(Xt),h)absent𝐕subscript^𝑝𝑘subscript𝑆𝑡subscriptsuperscript𝑘𝐕𝑝subscript𝑋𝑡superscript𝐕𝑝subscript𝑋𝑡superscript\displaystyle\leq\sqrt{\left|\mathbf{V}(\hat{p}_{k}(S_{t}),h^{\prime}_{k})-% \mathbf{V}(p(X_{t}),h^{*})\right|}+\sqrt{\mathbf{V}(p(X_{t}),h^{*})}≤ square-root start_ARG | bold_V ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | end_ARG + square-root start_ARG bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG
|𝐕(p^k(St),hk)𝐕(p^k(Xt),h)||𝐕(p^k(St),h)𝐕(p(Xt),h)|+𝐕(p(Xt),h)absent𝐕subscript^𝑝𝑘subscript𝑆𝑡subscriptsuperscript𝑘𝐕subscript^𝑝𝑘subscript𝑋𝑡superscript𝐕subscript^𝑝𝑘subscript𝑆𝑡superscript𝐕𝑝subscript𝑋𝑡superscript𝐕𝑝subscript𝑋𝑡superscript\displaystyle\leq\sqrt{\left|\mathbf{V}(\hat{p}_{k}(S_{t}),h^{\prime}_{k})-% \mathbf{V}(\hat{p}_{k}(X_{t}),h^{*})\right|}\sqrt{\left|\mathbf{V}(\hat{p}_{k}% (S_{t}),h^{*})-\mathbf{V}(p(X_{t}),h^{*})\right|}+\sqrt{\mathbf{V}(p(X_{t}),h^% {*})}≤ square-root start_ARG | bold_V ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - bold_V ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | end_ARG square-root start_ARG | bold_V ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | end_ARG + square-root start_ARG bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG
()8c0s𝒮p^k(s|St)dk(s,St)+sp(h)pk^(St)pk(St)1+𝐕(p(Xt),h)8subscript𝑐0subscriptsuperscript𝑠𝒮subscript^𝑝𝑘conditionalsuperscript𝑠subscript𝑆𝑡subscript𝑑𝑘superscript𝑠subscript𝑆𝑡spsuperscriptsubscriptnorm^subscript𝑝𝑘subscript𝑆𝑡subscript𝑝𝑘subscript𝑆𝑡1𝐕𝑝subscript𝑋𝑡superscript\displaystyle\overset{(*)}{\leq}\sqrt{8c_{0}\sum\nolimits_{s^{\prime}\in% \mathcal{S}}\hat{p}_{k}(s^{\prime}|S_{t})d_{k}(s^{\prime},S_{t})}+{\mathrm{sp}% \left(h^{*}\right)}\sqrt{\left\|\hat{p_{k}}(S_{t})-p_{k}(S_{t})\right\|_{1}}+% \sqrt{\mathbf{V}(p(X_{t}),h^{*})}start_OVERACCENT ( ∗ ) end_OVERACCENT start_ARG ≤ end_ARG square-root start_ARG 8 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) square-root start_ARG ∥ over^ start_ARG italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG + square-root start_ARG bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG
()8c0s𝒮p^k(s|St)dk(s,St)+sp(h)(Slog(SATδ)Ntk(Xt))14+𝐕(p(Xt),h)8subscript𝑐0subscriptsuperscript𝑠𝒮subscript^𝑝𝑘conditionalsuperscript𝑠subscript𝑆𝑡subscript𝑑𝑘superscript𝑠subscript𝑆𝑡spsuperscriptsuperscript𝑆𝑆𝐴𝑇𝛿subscript𝑁subscript𝑡𝑘subscript𝑋𝑡14𝐕𝑝subscript𝑋𝑡superscript\displaystyle\overset{(\dagger)}{\leq}\sqrt{8c_{0}\sum\nolimits_{s^{\prime}\in% \mathcal{S}}\hat{p}_{k}(s^{\prime}|S_{t})d_{k}(s^{\prime},S_{t})}+{\mathrm{sp}% \left(h^{*}\right)}\left(\frac{S\log\left(\tfrac{SAT}{\delta}\right)}{N_{t_{k}% }(X_{t})}\right)^{\frac{1}{4}}+\sqrt{\mathbf{V}(p(X_{t}),h^{*})}start_OVERACCENT ( † ) end_OVERACCENT start_ARG ≤ end_ARG square-root start_ARG 8 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ( divide start_ARG italic_S roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT + square-root start_ARG bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG
A22Ntk(Xt)+sp(h)(Slog(SATδ)Ntk(Xt))14+𝐕(p(Xt),h)absentsubscriptA22subscript𝑁subscript𝑡𝑘subscript𝑋𝑡spsuperscriptsuperscript𝑆𝑆𝐴𝑇𝛿subscript𝑁subscript𝑡𝑘subscript𝑋𝑡14𝐕𝑝subscript𝑋𝑡superscript\displaystyle\leq\frac{\mathrm{A}_{2}}{\sqrt{2N_{t_{k}}(X_{t})}}+{\mathrm{sp}% \left(h^{*}\right)}\left(\frac{S\log\left(\tfrac{SAT}{\delta}\right)}{N_{t_{k}% }(X_{t})}\right)^{\frac{1}{4}}+\sqrt{\mathbf{V}(p(X_{t}),h^{*})}≤ divide start_ARG roman_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 2 italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG + roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ( divide start_ARG italic_S roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT + square-root start_ARG bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG

where ()(*)( ∗ ) is obtained by applying Lemma 12 and ()(\dagger)( † ) holds with probability 1δ1𝛿1-\delta1 - italic_δ by applying Weissman’s inequality, see Lemma 35. All together, with probability 16δ16𝛿1-6\delta1 - 6 italic_δ, A is upper-bounded by:

A2𝐕(p(Xt),h)log(SATδ)Ntk(Xt)+2A2+sp(h)2log(SATδ)SlogSATδNtk(Xt)Ntk(Xt)+3c0log(SATδ)Ntk(Xt)A3(k,t).A2𝐕𝑝subscript𝑋𝑡superscript𝑆𝐴𝑇𝛿subscript𝑁subscript𝑡𝑘subscript𝑋𝑡2subscriptA2subscriptspsuperscript2𝑆𝐴𝑇𝛿𝑆𝑆𝐴𝑇𝛿subscript𝑁subscript𝑡𝑘subscript𝑋𝑡subscript𝑁subscript𝑡𝑘subscript𝑋𝑡3subscript𝑐0𝑆𝐴𝑇𝛿subscript𝑁subscript𝑡𝑘subscript𝑋𝑡subscriptA3𝑘𝑡\text{A}\leq\sqrt{\frac{2\mathbf{V}(p(X_{t}),h^{*})\log\left(\tfrac{SAT}{% \delta}\right)}{N_{t_{k}}(X_{t})}}+2\text{A}_{2}+\underbrace{{\mathrm{sp}\left% (h^{*}\right)}\sqrt{\frac{2\log\left(\tfrac{SAT}{\delta}\right)\sqrt{S\log% \tfrac{SAT}{\delta}}}{N_{t_{k}}(X_{t})\sqrt{N_{t_{k}}(X_{t})}}}+\frac{3c_{0}% \log\left(\tfrac{SAT}{\delta}\right)}{N_{t_{k}}(X_{t})}}_{\text{A}_{3}(k,t)}.A ≤ square-root start_ARG divide start_ARG 2 bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG + 2 A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + under⏟ start_ARG roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) square-root start_ARG divide start_ARG 2 roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) square-root start_ARG italic_S roman_log divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG end_ARG end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) square-root start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG end_ARG + divide start_ARG 3 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG start_POSTSUBSCRIPT A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_k , italic_t ) end_POSTSUBSCRIPT .

(STEP 2) The number of visits Nk(Xt)subscript𝑁𝑘subscript𝑋𝑡N_{k}(X_{t})italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is lower-bounded by 12Nt(Xt)12subscript𝑁𝑡subscript𝑋𝑡\frac{1}{2}N_{t}(X_{t})divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) when Nk(Xt)1subscript𝑁𝑘subscript𝑋𝑡1N_{k}(X_{t})\geq 1italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 1 by doubling trick (DT). By summing over t𝑡titalic_t and k𝑘kitalic_k, we find that with probability 16δ16𝛿1-6\delta1 - 6 italic_δ,

k(3k)subscript𝑘3𝑘\displaystyle\sum_{k}(3k)∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( 3 italic_k ) SAc0+kt=tktk+11𝟏Ntk(Xt)12𝐕(p(Xt),h)log(SATδ)Ntk(Xt)+kt=tktk+11𝟏Ntk(Xt)1(2A2(k,t)+A3(k,t))absent𝑆𝐴subscript𝑐0subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript1subscript𝑁subscript𝑡𝑘subscript𝑋𝑡12𝐕𝑝subscript𝑋𝑡superscript𝑆𝐴𝑇𝛿subscript𝑁subscript𝑡𝑘subscript𝑋𝑡subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript1subscript𝑁subscript𝑡𝑘subscript𝑋𝑡12subscriptA2𝑘𝑡subscriptA3𝑘𝑡\displaystyle\leq SAc_{0}+\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{1}_{N_{t_{% k}}(X_{t})\geq 1}\sqrt{\frac{2\mathbf{V}(p(X_{t}),h^{*})\log\left(\tfrac{SAT}{% \delta}\right)}{N_{t_{k}}(X_{t})}}+\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{1% }_{N_{t_{k}}(X_{t})\geq 1}(2\text{A}_{2}(k,t)+\text{A}_{3}(k,t))≤ italic_S italic_A italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 1 end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 2 bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG + ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 1 end_POSTSUBSCRIPT ( 2 A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_k , italic_t ) + A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_k , italic_t ) )
(DT) SAc0+2kt=tktk+11𝟏Ntk(Xt)12𝐕(p(Xt),h)log(SATδ)Nt(Xt)+kt=tktk+11𝟏Ntk(Xt)1(2A2(k,t)+A3(k,t))absent𝑆𝐴subscript𝑐02subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript1subscript𝑁subscript𝑡𝑘subscript𝑋𝑡12𝐕𝑝subscript𝑋𝑡superscript𝑆𝐴𝑇𝛿subscript𝑁𝑡subscript𝑋𝑡subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript1subscript𝑁subscript𝑡𝑘subscript𝑋𝑡12subscriptA2𝑘𝑡subscriptA3𝑘𝑡\displaystyle\leq SAc_{0}+2\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{1}_{N_{t_% {k}}(X_{t})\geq 1}\sqrt{\frac{2\mathbf{V}(p(X_{t}),h^{*})\log\left(\tfrac{SAT}% {\delta}\right)}{N_{t}(X_{t})}}+\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{1}_{% N_{t_{k}}(X_{t})\geq 1}(2\text{A}_{2}(k,t)+\text{A}_{3}(k,t))≤ italic_S italic_A italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 2 ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 1 end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 2 bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG + ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 1 end_POSTSUBSCRIPT ( 2 A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_k , italic_t ) + A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_k , italic_t ) )
SAc0+42SAt=0T1𝐕(p(Xt),h)log(SATδ)+kt=tktk+11𝟏Ntk(Xt)1(2A2(k,t)+A3(k,t))absent𝑆𝐴subscript𝑐042𝑆𝐴superscriptsubscript𝑡0𝑇1𝐕𝑝subscript𝑋𝑡superscript𝑆𝐴𝑇𝛿subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript1subscript𝑁subscript𝑡𝑘subscript𝑋𝑡12subscriptA2𝑘𝑡subscriptA3𝑘𝑡\displaystyle\leq SAc_{0}+4\sqrt{2SA\sum\nolimits_{t=0}^{T-1}\mathbf{V}(p(X_{t% }),h^{*})\log\left(\tfrac{SAT}{\delta}\right)}+\sum_{k}\sum_{t=t_{k}}^{t_{k+1}% -1}\mathbf{1}_{N_{t_{k}}(X_{t})\geq 1}(2\text{A}_{2}(k,t)+\text{A}_{3}(k,t))≤ italic_S italic_A italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 4 square-root start_ARG 2 italic_S italic_A ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT bold_V ( italic_p ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG + ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 1 end_POSTSUBSCRIPT ( 2 A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_k , italic_t ) + A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_k , italic_t ) )

where the last inequality is obtained with computations that are similar to those detailed in the proof of Lemma 8. We recognize the variance that we will leave as is. We finish the proof by bounding the lower order terms A2subscriptA2\text{A}_{2}A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and A3subscriptA3\text{A}_{3}A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT.

(STEP 3) We start with A2subscriptA2\text{A}_{2}A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We have:

kt=tktk+11𝟏Ntk(Xt)1A2(k,t)subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript1subscript𝑁subscript𝑡𝑘subscript𝑋𝑡1subscriptA2𝑘𝑡\displaystyle\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{1}_{N_{t_{k}}(X_{t})% \geq 1}\text{A}_{2}(k,t)∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 1 end_POSTSUBSCRIPT A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_k , italic_t ) :=kt=tktk+11𝟏Ntk(Xt)116c0log(SATδ)S((1+c0)(3+28Tlog(2δ)+2B(T)))Ntk(Xt)assignabsentsubscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript1subscript𝑁subscript𝑡𝑘subscript𝑋𝑡116subscript𝑐0𝑆𝐴𝑇𝛿𝑆1subscript𝑐0328𝑇2𝛿2𝐵𝑇subscript𝑁subscript𝑡𝑘subscript𝑋𝑡\displaystyle:=\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{1}_{N_{t_{k}}(X_{t})% \geq 1}\tfrac{\sqrt{16c_{0}\log\left(\tfrac{SAT}{\delta}\right)S\left((1+c_{0}% )\left(3+2\sqrt{8T\log\left(\tfrac{2}{\delta}\right)}+2B(T)\right)\right)}}{N_% {t_{k}}(X_{t})}:= ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 1 end_POSTSUBSCRIPT divide start_ARG square-root start_ARG 16 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) italic_S ( ( 1 + italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ( 3 + 2 square-root start_ARG 8 italic_T roman_log ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) end_ARG + 2 italic_B ( italic_T ) ) ) end_ARG end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG
(DT) 216c0Slog(SATδ)((1+c0)(3+28Tlog(2δ)+2B(T)))SAlog(T)absent216subscript𝑐0𝑆𝑆𝐴𝑇𝛿1subscript𝑐0328𝑇2𝛿2𝐵𝑇𝑆𝐴𝑇\displaystyle\leq 2\sqrt{16c_{0}S\log\left(\tfrac{SAT}{\delta}\right)\left((1+% c_{0})\left(3+2\sqrt{8T\log\left(\tfrac{2}{\delta}\right)}+2B(T)\right)\right)% }~{}SA\log(T)≤ 2 square-root start_ARG 16 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_S roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) ( ( 1 + italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ( 3 + 2 square-root start_ARG 8 italic_T roman_log ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) end_ARG + 2 italic_B ( italic_T ) ) ) end_ARG italic_S italic_A roman_log ( italic_T )
8(1+c0)S32Alog32(SATδ)(2+4T14log14(SATδ)+2B(T)).absent81subscript𝑐0superscript𝑆32𝐴superscript32𝑆𝐴𝑇𝛿24superscript𝑇14superscript14𝑆𝐴𝑇𝛿2𝐵𝑇\displaystyle\leq 8(1+c_{0})S^{\frac{3}{2}}A\log^{\frac{3}{2}}\left(\tfrac{SAT% }{\delta}\right)\left(2+4T^{\frac{1}{4}}\log^{\frac{1}{4}}\left(\tfrac{SAT}{% \delta}\right)+\sqrt{2B(T)}\right).≤ 8 ( 1 + italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_S start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_A roman_log start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) ( 2 + 4 italic_T start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) + square-root start_ARG 2 italic_B ( italic_T ) end_ARG ) .

(STEP 4) We are left with A3subscriptA3\text{A}_{3}A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. We have:

kt=tktk+11𝟏Ntk(Xt)1A3(k,t)subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript1subscript𝑁subscript𝑡𝑘subscript𝑋𝑡1subscriptA3𝑘𝑡\displaystyle\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{1}_{N_{t_{k}}(X_{t})% \geq 1}\text{A}_{3}(k,t)∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 1 end_POSTSUBSCRIPT A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_k , italic_t ) :=kt=tktk+11𝟏Ntk(Xt)1(sp(h)2log(SATδ)SlogSATδNtk(Xt)Ntk(Xt)+3c0log(SATδ)Ntk(Xt))assignabsentsubscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript1subscript𝑁subscript𝑡𝑘subscript𝑋𝑡1spsuperscript2𝑆𝐴𝑇𝛿𝑆𝑆𝐴𝑇𝛿subscript𝑁subscript𝑡𝑘subscript𝑋𝑡subscript𝑁subscript𝑡𝑘subscript𝑋𝑡3subscript𝑐0𝑆𝐴𝑇𝛿subscript𝑁subscript𝑡𝑘subscript𝑋𝑡\displaystyle:=\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{1}_{N_{t_{k}}(X_{t})% \geq 1}\left({\mathrm{sp}\left(h^{*}\right)}\sqrt{\frac{2\log\left(\tfrac{SAT}% {\delta}\right)\sqrt{S\log\tfrac{SAT}{\delta}}}{N_{t_{k}}(X_{t})\sqrt{N_{t_{k}% }(X_{t})}}}+\frac{3c_{0}\log\left(\tfrac{SAT}{\delta}\right)}{N_{t_{k}}(X_{t})% }\right):= ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 1 end_POSTSUBSCRIPT ( roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) square-root start_ARG divide start_ARG 2 roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) square-root start_ARG italic_S roman_log divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG end_ARG end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) square-root start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG end_ARG + divide start_ARG 3 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG )
(DT) kt=tktk+11𝟏Ntk(Xt)1(sp(h)2log(SATδ)SlogSATδNtk(Xt)Ntk(Xt)+3c0log(SATδ)Ntk(Xt))absentsubscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript1subscript𝑁subscript𝑡𝑘subscript𝑋𝑡1spsuperscript2𝑆𝐴𝑇𝛿𝑆𝑆𝐴𝑇𝛿subscript𝑁subscript𝑡𝑘subscript𝑋𝑡subscript𝑁subscript𝑡𝑘subscript𝑋𝑡3subscript𝑐0𝑆𝐴𝑇𝛿subscript𝑁subscript𝑡𝑘subscript𝑋𝑡\displaystyle\leq\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{1}_{N_{t_{k}}(X_{t}% )\geq 1}\left({\mathrm{sp}\left(h^{*}\right)}\sqrt{\frac{2\log\left(\tfrac{SAT% }{\delta}\right)\sqrt{S\log\tfrac{SAT}{\delta}}}{N_{t_{k}}(X_{t})\sqrt{N_{t_{k% }}(X_{t})}}}+\frac{3c_{0}\log\left(\tfrac{SAT}{\delta}\right)}{N_{t_{k}}(X_{t}% )}\right)≤ ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 1 end_POSTSUBSCRIPT ( roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) square-root start_ARG divide start_ARG 2 roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) square-root start_ARG italic_S roman_log divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG end_ARG end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) square-root start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG end_ARG + divide start_ARG 3 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG )
Csp(h)S54AT14log34(SATδ)+6c0SAlog(SATδ)absent𝐶spsuperscriptsuperscript𝑆54𝐴superscript𝑇14superscript34𝑆𝐴𝑇𝛿6subscript𝑐0𝑆𝐴𝑆𝐴𝑇𝛿\displaystyle\leq C{\mathrm{sp}\left(h^{*}\right)}S^{\frac{5}{4}}AT^{\frac{1}{% 4}}\log^{\frac{3}{4}}\left(\tfrac{SAT}{\delta}\right)+6c_{0}SA\log\left(\tfrac% {SAT}{\delta}\right)≤ italic_C roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_S start_POSTSUPERSCRIPT divide start_ARG 5 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT italic_A italic_T start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) + 6 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_S italic_A roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG )
=O(sp(h)S54AT14log(SATδ)).absentOspsuperscriptsuperscript𝑆54𝐴superscript𝑇14𝑆𝐴𝑇𝛿\displaystyle=\operatorname*{{\rm O}}\left({\mathrm{sp}\left(h^{*}\right)}S^{% \frac{5}{4}}AT^{\frac{1}{4}}\log\left(\tfrac{SAT}{\delta}\right)\right).= roman_O ( roman_sp ( italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_S start_POSTSUPERSCRIPT divide start_ARG 5 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT italic_A italic_T start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_S italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) ) .

This concludes the proof. ∎

C.8 Proof of Lemma 10, second order error

Recall that by Lemma 13, with probability 14δ14𝛿1-4\delta1 - 4 italic_δ, htksuperscriptsubscriptsubscript𝑡𝑘h^{*}\in\mathcal{H}_{t_{k}}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT for all k𝑘kitalic_k, hence sp(𝔥kh)2c0spsubscript𝔥𝑘superscript2subscript𝑐0{\mathrm{sp}\left(\mathfrak{h}_{k}-h^{*}\right)}\leq 2c_{0}roman_sp ( fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ 2 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for all k𝑘kitalic_k on the same event. Therefore, with probability 14δ14𝛿1-4\delta1 - 4 italic_δ,

k(4k)subscript𝑘4𝑘\displaystyle\sum_{k}(4k)∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( 4 italic_k ) :=2c0SA+kt=tktk+11𝟏Ntk(Xt)1(p^k(St)pk(St))(𝔥kh)assignabsent2subscript𝑐0𝑆𝐴subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscript1subscript𝑁subscript𝑡𝑘subscript𝑋𝑡1subscript^𝑝𝑘subscript𝑆𝑡subscript𝑝𝑘subscript𝑆𝑡subscript𝔥𝑘superscript\displaystyle:=2c_{0}SA+\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{1}_{N_{t_{k}% }(X_{t})\geq 1}\left(\hat{p}_{k}(S_{t})-p_{k}(S_{t})\right)\left(\mathfrak{h}_% {k}-h^{*}\right):= 2 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_S italic_A + ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 1 end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ( fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
=2c0SA+kt=tktk+11s𝒮𝟏Ntk(Xt)1(p^k(s|St)pk(s|St))(𝔥kh(s))absent2subscript𝑐0𝑆𝐴subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscriptsuperscript𝑠𝒮subscript1subscript𝑁subscript𝑡𝑘subscript𝑋𝑡1subscript^𝑝𝑘conditionalsuperscript𝑠subscript𝑆𝑡subscript𝑝𝑘conditionalsuperscript𝑠subscript𝑆𝑡subscript𝔥𝑘superscriptsuperscript𝑠\displaystyle=2c_{0}SA+\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\sum_{s^{\prime}\in% \mathcal{S}}\mathbf{1}_{N_{t_{k}}(X_{t})\geq 1}(\hat{p}_{k}(s^{\prime}|S_{t})-% p_{k}(s^{\prime}|S_{t}))(\mathfrak{h}_{k}-h^{*}(s^{\prime}))= 2 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_S italic_A + ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT bold_1 start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 1 end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ( fraktur_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) )
()2c0SA+2kt=tktk+11s𝒮𝟏Ntk(Xt)1(p^k(s|St)pk(s|St))dtk(s,St)2subscript𝑐0𝑆𝐴2subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscriptsuperscript𝑠𝒮subscript1subscript𝑁subscript𝑡𝑘subscript𝑋𝑡1subscript^𝑝𝑘conditionalsuperscript𝑠subscript𝑆𝑡subscript𝑝𝑘conditionalsuperscript𝑠subscript𝑆𝑡subscript𝑑subscript𝑡𝑘superscript𝑠subscript𝑆𝑡\displaystyle\overset{(*)}{\leq}2c_{0}SA+2\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}% \sum_{s^{\prime}\in\mathcal{S}}\mathbf{1}_{N_{t_{k}}(X_{t})\geq 1}(\hat{p}_{k}% (s^{\prime}|S_{t})-p_{k}(s^{\prime}|S_{t}))d_{t_{k}}(s^{\prime},S_{t})start_OVERACCENT ( ∗ ) end_OVERACCENT start_ARG ≤ end_ARG 2 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_S italic_A + 2 ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT bold_1 start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 1 end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) italic_d start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
()2c0SA+2kt=tktk+11s𝒮𝟏Ntk(Xt)1(dk(s,St)2p^k(s|St)log(S2ATδ)Ntk(Xt)+3dk(s|St)log(S2ATδ)Ntk(Xt))2subscript𝑐0𝑆𝐴2subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscriptsuperscript𝑠𝒮subscript1subscript𝑁subscript𝑡𝑘subscript𝑋𝑡1subscript𝑑𝑘superscript𝑠subscript𝑆𝑡2subscript^𝑝𝑘conditionalsuperscript𝑠subscript𝑆𝑡superscript𝑆2𝐴𝑇𝛿subscript𝑁subscript𝑡𝑘subscript𝑋𝑡3subscript𝑑𝑘conditionalsuperscript𝑠subscript𝑆𝑡superscript𝑆2𝐴𝑇𝛿subscript𝑁subscript𝑡𝑘subscript𝑋𝑡\displaystyle\overset{(\dagger)}{\leq}2c_{0}SA+2\sum_{k}\sum_{t=t_{k}}^{t_{k+1% }-1}\sum_{s^{\prime}\in\mathcal{S}}\mathbf{1}_{N_{t_{k}}(X_{t})\geq 1}\left(d_% {k}(s^{\prime},S_{t})\sqrt{\frac{2\hat{p}_{k}(s^{\prime}|S_{t})\log\left(% \tfrac{S^{2}AT}{\delta}\right)}{N_{t_{k}}(X_{t})}}+3d_{k}(s^{\prime}|S_{t})% \frac{\log\left(\tfrac{S^{2}AT}{\delta}\right)}{N_{t_{k}}(X_{t})}\right)start_OVERACCENT ( † ) end_OVERACCENT start_ARG ≤ end_ARG 2 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_S italic_A + 2 ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT bold_1 start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 1 end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) square-root start_ARG divide start_ARG 2 over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_log ( divide start_ARG italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG + 3 italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) divide start_ARG roman_log ( divide start_ARG italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG )
2c0SA+2kt=tktk+11s𝒮𝟏Ntk(Xt)1(c02p^k(s|St)dk(s,St)log(S2ATδ)Ntk(Xt)+3c0log(S2ATδ)Ntk(Xt))absent2subscript𝑐0𝑆𝐴2subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscriptsuperscript𝑠𝒮subscript1subscript𝑁subscript𝑡𝑘subscript𝑋𝑡1subscript𝑐02subscript^𝑝𝑘conditionalsuperscript𝑠subscript𝑆𝑡subscript𝑑𝑘superscript𝑠subscript𝑆𝑡superscript𝑆2𝐴𝑇𝛿subscript𝑁subscript𝑡𝑘subscript𝑋𝑡3subscript𝑐0superscript𝑆2𝐴𝑇𝛿subscript𝑁subscript𝑡𝑘subscript𝑋𝑡\displaystyle\leq 2c_{0}SA+2\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\sum_{s^{\prime}% \in\mathcal{S}}\mathbf{1}_{N_{t_{k}}(X_{t})\geq 1}\left(\sqrt{c_{0}}\sqrt{% \frac{2\hat{p}_{k}(s^{\prime}|S_{t})d_{k}(s^{\prime},S_{t})\log\left(\tfrac{S^% {2}AT}{\delta}\right)}{N_{t_{k}}(X_{t})}}+\frac{3c_{0}\log\left(\tfrac{S^{2}AT% }{\delta}\right)}{N_{t_{k}}(X_{t})}\right)≤ 2 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_S italic_A + 2 ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT bold_1 start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 1 end_POSTSUBSCRIPT ( square-root start_ARG italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG square-root start_ARG divide start_ARG 2 over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_log ( divide start_ARG italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG + divide start_ARG 3 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG )
2c0SA+4kt=tktk+11s𝒮𝟏Ntk(Xt)1(c02p^k(s|St)dk(s,St)log(S2ATδ)Nt(Xt)+3c0log(S2ATδ)Nt(Xt))absent2subscript𝑐0𝑆𝐴4subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscriptsuperscript𝑠𝒮subscript1subscript𝑁subscript𝑡𝑘subscript𝑋𝑡1subscript𝑐02subscript^𝑝𝑘conditionalsuperscript𝑠subscript𝑆𝑡subscript𝑑𝑘superscript𝑠subscript𝑆𝑡superscript𝑆2𝐴𝑇𝛿subscript𝑁𝑡subscript𝑋𝑡3subscript𝑐0superscript𝑆2𝐴𝑇𝛿subscript𝑁𝑡subscript𝑋𝑡\displaystyle\leq 2c_{0}SA+4\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\sum_{s^{\prime}% \in\mathcal{S}}\mathbf{1}_{N_{t_{k}}(X_{t})\geq 1}\left(\sqrt{c_{0}}\sqrt{% \frac{2\hat{p}_{k}(s^{\prime}|S_{t})d_{k}(s^{\prime},S_{t})\log\left(\tfrac{S^% {2}AT}{\delta}\right)}{N_{t}(X_{t})}}+\frac{3c_{0}\log\left(\tfrac{S^{2}AT}{% \delta}\right)}{N_{t}(X_{t})}\right)≤ 2 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_S italic_A + 4 ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT bold_1 start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≥ 1 end_POSTSUBSCRIPT ( square-root start_ARG italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG square-root start_ARG divide start_ARG 2 over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_log ( divide start_ARG italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG + divide start_ARG 3 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG )

where ()(*)( ∗ ) uses that htksuperscriptsubscriptsubscript𝑡𝑘h^{*}\in\mathcal{H}_{t_{k}}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_H start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and ()(\dagger)( † ) is obtained by applying the empirical Bernstein’s inequality, see Lemma 36, to p^k(s|St)pk(s|St)subscript^𝑝𝑘conditionalsuperscript𝑠subscript𝑆𝑡subscript𝑝𝑘conditionalsuperscript𝑠subscript𝑆𝑡\hat{p}_{k}(s^{\prime}|S_{t})-p_{k}(s^{\prime}|S_{t})over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and holds with probability 1δ1𝛿1-\delta1 - italic_δ. The rightmost term’s sum is upper-bounded by:

4kt=tktk+11s𝒮3c0log(S2ATδ)Nt(Xt)12S2Alog(T)log(S2ATδ).4subscript𝑘superscriptsubscript𝑡subscript𝑡𝑘subscript𝑡𝑘11subscriptsuperscript𝑠𝒮3subscript𝑐0superscript𝑆2𝐴𝑇𝛿subscript𝑁𝑡subscript𝑋𝑡12superscript𝑆2𝐴𝑇superscript𝑆2𝐴𝑇𝛿4\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\sum_{s^{\prime}\in\mathcal{S}}\frac{3c_{0}% \log\left(\tfrac{S^{2}AT}{\delta}\right)}{N_{t}(X_{t})}\leq 12S^{2}A\log(T)% \log\left(\tfrac{S^{2}AT}{\delta}\right).4 ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT divide start_ARG 3 italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ≤ 12 italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A roman_log ( italic_T ) roman_log ( divide start_ARG italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) .

For the other term, follow the line of the proof of Lemma 9 (term A2subscriptA2\text{A}_{2}A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT). We have with probability 15δ15𝛿1-5\delta1 - 5 italic_δ (4δ4𝛿4\delta4 italic_δ of which is by invoking Lemma 13):

p^k(s|St)dk(s,St)subscript^𝑝𝑘conditionalsuperscript𝑠subscript𝑆𝑡subscript𝑑𝑘superscript𝑠subscript𝑆𝑡\displaystyle\hat{p}_{k}(s^{\prime}|S_{t})d_{k}(s^{\prime},S_{t})over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =Ntk(St,At,s)((1+c0)(1+8tklog(2δ))+2B0(tk))Ntk(Sts)Ntk(Xt)\displaystyle=\frac{N_{t_{k}}(S_{t},A_{t},s^{\prime})\left((1+c_{0})\left(1+% \sqrt{8t_{k}\log\left(\tfrac{2}{\delta}\right)}\right)+2B_{0}(t_{k})\right)}{N% _{t_{k}}(S_{t}\leftrightarrow s^{\prime})N_{t_{k}}(X_{t})}= divide start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ( ( 1 + italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ( 1 + square-root start_ARG 8 italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_log ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) end_ARG ) + 2 italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ↔ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG
((1+c0)(3+28Tlog(2δ)+2B(T)))Ntk(Xt).absent1subscript𝑐0328𝑇2𝛿2𝐵𝑇subscript𝑁subscript𝑡𝑘subscript𝑋𝑡\displaystyle\leq\frac{\left((1+c_{0})\left(3+2\sqrt{8T\log\left(\tfrac{2}{% \delta}\right)}+2B(T)\right)\right)}{N_{t_{k}}(X_{t})}.≤ divide start_ARG ( ( 1 + italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ( 3 + 2 square-root start_ARG 8 italic_T roman_log ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) end_ARG + 2 italic_B ( italic_T ) ) ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG .

Therefore,

c02pk^(s|St)dtk(s,St)log(S2ATδ)Nt(Xt)4(1+c0)(3+28Tlog(2δ)+2B(T))log(S2ATδ)Nt(Xt).subscript𝑐02^subscript𝑝𝑘conditionalsuperscript𝑠subscript𝑆𝑡subscript𝑑subscript𝑡𝑘superscript𝑠subscript𝑆𝑡superscript𝑆2𝐴𝑇𝛿subscript𝑁𝑡subscript𝑋𝑡41subscript𝑐0328𝑇2𝛿2𝐵𝑇superscript𝑆2𝐴𝑇𝛿subscript𝑁𝑡subscript𝑋𝑡\sqrt{c_{0}}\sqrt{\frac{2\hat{p_{k}}(s^{\prime}|S_{t})d_{t_{k}}(s^{\prime},S_{% t})\log\left(\tfrac{S^{2}AT}{\delta}\right)}{N_{t}(X_{t})}}\leq\frac{4(1+c_{0}% )\sqrt{\left(3+2\sqrt{8T\log\left(\tfrac{2}{\delta}\right)}+2B(T)\right)\log% \left(\tfrac{S^{2}AT}{\delta}\right)}}{N_{t}(X_{t})}.square-root start_ARG italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG square-root start_ARG divide start_ARG 2 over^ start_ARG italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_d start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_log ( divide start_ARG italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG ≤ divide start_ARG 4 ( 1 + italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) square-root start_ARG ( 3 + 2 square-root start_ARG 8 italic_T roman_log ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) end_ARG + 2 italic_B ( italic_T ) ) roman_log ( divide start_ARG italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG .

Summing over k𝑘kitalic_k, t𝑡titalic_t, ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, with probability 16δ16𝛿1-6\delta1 - 6 italic_δ, we have:

k(4k){16S2A(1+c0)log12(S2ATδ)(2B(T)+2(8Tlog(2δ))14)+32S2A(log(T)log(S2ATδ)+(1+c0)log12(S2ATδ))}subscript𝑘4𝑘matrix16superscript𝑆2𝐴1subscript𝑐0superscript12superscript𝑆2𝐴𝑇𝛿2𝐵𝑇2superscript8𝑇2𝛿1432superscript𝑆2𝐴𝑇superscript𝑆2𝐴𝑇𝛿1subscript𝑐0superscript12superscript𝑆2𝐴𝑇𝛿\sum_{k}(4k)\leq\begin{Bmatrix}16S^{2}A(1+c_{0})\log^{\frac{1}{2}}\left(\tfrac% {S^{2}AT}{\delta}\right)\left(\sqrt{2B(T)}+2\left(8T\log\left(\tfrac{2}{\delta% }\right)\right)^{\frac{1}{4}}\right)\\ +32S^{2}A\left(\log(T)\log\left(\tfrac{S^{2}AT}{\delta}\right)+(1+c_{0})\log^{% \frac{1}{2}}\left(\tfrac{S^{2}AT}{\delta}\right)\right)\end{Bmatrix}∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( 4 italic_k ) ≤ { start_ARG start_ROW start_CELL 16 italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A ( 1 + italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_log start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ( divide start_ARG italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) ( square-root start_ARG 2 italic_B ( italic_T ) end_ARG + 2 ( 8 italic_T roman_log ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL + 32 italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A ( roman_log ( italic_T ) roman_log ( divide start_ARG italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) + ( 1 + italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_log start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ( divide start_ARG italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A italic_T end_ARG start_ARG italic_δ end_ARG ) ) end_CELL end_ROW end_ARG }

This concludes the proof. ∎

Appendix D Details on experiments

D.1 River swim

Experiments are run on n𝑛nitalic_n-states river-swim. Such MDPs are, despite their size, known to be hard to learn. They consists in n𝑛nitalic_n states aligned in a straight line with two playable actions right and left whose dynamics are given in the figure below. Rewards are Bernoulli and null everywhere excepted for r(sn,right)=0.95𝑟subscript𝑠𝑛right0.95r(s_{n},\textsc{right})=0.95italic_r ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , right ) = 0.95 and r(s0,left)=0.05𝑟subscript𝑠0left0.05r(s_{0},\text{\sc\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{% 1,0,0}left})=0.05italic_r ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , left ) = 0.05.

s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTs2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT\cdots\cdotssnsubscript𝑠𝑛s_{n}italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT0.60.60.60.60.40.40.40.40.60.60.60.60.350.350.350.350.050.050.050.050.050.050.050.050.350.350.350.350.950.950.950.950.050.050.050.051111111111111111
Figure 3: The kernel of a n𝑛nitalic_n-state river-swim.
3333-state river-swim.

The gain is g0.82superscript𝑔0.82g^{*}\approx 0.82italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≈ 0.82 and h(4.28,2.24,0.4)superscript4.282.240.4h^{*}\approx(-4.28,-2.24,0.4)italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≈ ( - 4.28 , - 2.24 , 0.4 ).

5555-state river-swim.

The gain is g0.82superscript𝑔0.82g^{*}\approx 0.82italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≈ 0.82 and h(9.62,7.58,4.96,2.27,0.45)superscript9.627.584.962.270.45h^{*}\approx(-9.62,-7.58,-4.96,-2.27,0.45)italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≈ ( - 9.62 , - 7.58 , - 4.96 , - 2.27 , 0.45 ).

Appendix E Standard concentration inequalities

Lemma 32 (Azuma’s inequality, Azuma [1967]).

Let (Ut)t0subscriptsubscript𝑈𝑡𝑡0(U_{t})_{t\geq 0}( italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT a martingale difference sequence such that sp(Ut)cspsubscript𝑈𝑡𝑐{\mathrm{sp}\left(U_{t}\right)}\leq croman_sp ( italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ italic_c a.s., i.e., there exists at𝐑subscript𝑎𝑡𝐑a_{t}\in\mathbf{R}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ bold_R such that atUtat+csubscript𝑎𝑡subscript𝑈𝑡subscript𝑎𝑡𝑐a_{t}\leq U_{t}\leq a_{t}+citalic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_c a.s. Then, for all δ>0𝛿0\delta>0italic_δ > 0,

𝐏(t=0T1Utc12Tlog(1δ))δ.𝐏superscriptsubscript𝑡0𝑇1subscript𝑈𝑡𝑐12𝑇1𝛿𝛿\mathbf{P}\left(\sum\nolimits_{t=0}^{T-1}U_{t}\geq c\sqrt{\tfrac{1}{2}T\log% \left(\tfrac{1}{\delta}\right)}\right)\leq\delta.bold_P ( ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ italic_c square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_T roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) end_ARG ) ≤ italic_δ .
Lemma 33 (Freedman’s inequality, Zhang et al. [2020]).

Let (Ut)t0subscriptsubscript𝑈𝑡𝑡0(U_{t})_{t\geq 0}( italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT a martingale difference sequence such that |Ut|csubscript𝑈𝑡𝑐\left|U_{t}\right|\leq c| italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ≤ italic_c a.s., and denote its conditional variance Vt:=𝐄[Ut2|t1]assignsubscript𝑉𝑡𝐄delimited-[]conditionalsuperscriptsubscript𝑈𝑡2subscript𝑡1V_{t}:=\mathbf{E}[U_{t}^{2}|\mathcal{F}_{t-1}]italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := bold_E [ italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ]. Then, for all δ>0𝛿0\delta>0italic_δ > 0,

𝐏(TT:t=0T1Ut2t=0T1Vtlog(Tδ)+4clog(Tδ))δ.\mathbf{P}\left(\exists T^{\prime}\leq T:\sum\nolimits_{t=0}^{T^{\prime}-1}U_{% t}\geq\sqrt{2\sum\nolimits_{t=0}^{T^{\prime}-1}V_{t}\log\left(\tfrac{T}{\delta% }\right)}+4c\log\left(\tfrac{T}{\delta}\right)\right)\leq\delta.bold_P ( ∃ italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_T : ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ square-root start_ARG 2 ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG + 4 italic_c roman_log ( divide start_ARG italic_T end_ARG start_ARG italic_δ end_ARG ) ) ≤ italic_δ .
Lemma 34 (Time-uniform Azuma, Bourel et al. [2020]).

Let (Ut)subscript𝑈𝑡(U_{t})( italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) a martingale difference sequence such that, for all λ𝐑𝜆𝐑\lambda\in\mathbf{R}italic_λ ∈ bold_R, 𝐄[exp(λUt)|U1,,Ut1]exp(λ2σ22)𝐄delimited-[]conditional𝜆subscript𝑈𝑡subscript𝑈1subscript𝑈𝑡1superscript𝜆2superscript𝜎22\mathbf{E}[\exp(\lambda U_{t})|U_{1},\ldots,U_{t-1}]\leq\exp(\frac{\lambda^{2}% \sigma^{2}}{2})bold_E [ roman_exp ( italic_λ italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_U start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] ≤ roman_exp ( divide start_ARG italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ). Then:

δ>0,𝐏(n1,(k=1nUk)2nσ2(1+1n)log(1+nδ))δ.formulae-sequencefor-all𝛿0𝐏formulae-sequence𝑛1superscriptsuperscriptsubscript𝑘1𝑛subscript𝑈𝑘2𝑛superscript𝜎211𝑛1𝑛𝛿𝛿\forall\delta>0,\quad\mathbf{P}\left(\exists n\geq 1,\quad\left(\sum\nolimits_% {k=1}^{n}U_{k}\right)^{2}\geq n\sigma^{2}\left(1+\tfrac{1}{n}\right)\log\left(% \tfrac{\!\!\sqrt{1+n}}{\delta}\right)\right)\leq\delta.∀ italic_δ > 0 , bold_P ( ∃ italic_n ≥ 1 , ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ italic_n italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ) roman_log ( divide start_ARG square-root start_ARG 1 + italic_n end_ARG end_ARG start_ARG italic_δ end_ARG ) ) ≤ italic_δ .
Lemma 35 (Time-uniform Weissman).

Let q𝑞qitalic_q a distribution over {1,,d}1𝑑\left\{1,\ldots,d\right\}{ 1 , … , italic_d }. Let (Ut)subscript𝑈𝑡(U_{t})( italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) a sequence of i.i.d. random variables of distribution q𝑞qitalic_q. Then:

δ>0,𝐏(n1,i=1n(eUiq)12ndlog(21+nδ))δ.formulae-sequencefor-all𝛿0𝐏formulae-sequence𝑛1superscriptsubscriptnormsuperscriptsubscript𝑖1𝑛subscript𝑒subscript𝑈𝑖𝑞12𝑛𝑑21𝑛𝛿𝛿\forall\delta>0,\quad\mathbf{P}\left(\exists n\geq 1,\left\|\sum\nolimits_{i=1% }^{n}\left(e_{U_{i}}-q\right)\right\|_{1}^{2}\geq nd\log\left(\tfrac{2\sqrt{1+% n}}{\delta}\right)\right)\leq\delta.∀ italic_δ > 0 , bold_P ( ∃ italic_n ≥ 1 , ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_e start_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_q ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ italic_n italic_d roman_log ( divide start_ARG 2 square-root start_ARG 1 + italic_n end_ARG end_ARG start_ARG italic_δ end_ARG ) ) ≤ italic_δ .
Proof.

Remark that k=1n(eUkq)1=maxv{1,1}dk=1neUkq,vsubscriptnormsuperscriptsubscript𝑘1𝑛subscript𝑒subscript𝑈𝑘𝑞1subscript𝑣superscript11𝑑superscriptsubscript𝑘1𝑛subscript𝑒subscript𝑈𝑘𝑞𝑣\left\|\sum_{k=1}^{n}(e_{U_{k}}-q)\right\|_{1}=\max_{v\in\left\{-1,1\right\}^{% d}}\sum_{k=1}^{n}\left\langle e_{U_{k}}-q,v\right\rangle∥ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_e start_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_q ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_v ∈ { - 1 , 1 } start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ⟨ italic_e start_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_q , italic_v ⟩. Let Wkv:=eUkq,vassignsuperscriptsubscript𝑊𝑘𝑣subscript𝑒subscript𝑈𝑘𝑞𝑣W_{k}^{v}:=\left\langle e_{U_{k}}-q,v\right\rangleitalic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT := ⟨ italic_e start_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_q , italic_v ⟩. Remark that for each v{1,1}d𝑣superscript11𝑑v\in\left\{-1,1\right\}^{d}italic_v ∈ { - 1 , 1 } start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, (Wkv)superscriptsubscript𝑊𝑘𝑣(W_{k}^{v})( italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ) is a family of i.i.d. random variables with q,vWkv1q,v𝑞𝑣superscriptsubscript𝑊𝑘𝑣1𝑞𝑣-\left\langle q,v\right\rangle\leq W_{k}^{v}\leq 1-\left\langle q,v\right\rangle- ⟨ italic_q , italic_v ⟩ ≤ italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ≤ 1 - ⟨ italic_q , italic_v ⟩, so 𝐄[exp(λWkv)]exp(λ28)𝐄delimited-[]𝜆superscriptsubscript𝑊𝑘𝑣superscript𝜆28\mathbf{E}[\exp(\lambda W_{k}^{v})]\leq\exp(\tfrac{\lambda^{2}}{8})bold_E [ roman_exp ( italic_λ italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ) ] ≤ roman_exp ( divide start_ARG italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 8 end_ARG ) by Hoeffding’s Lemma. By Lemma 34, we have:

𝐏(n1,k=1n(eUkq)1ndlog(21+nδ))𝐏formulae-sequence𝑛1subscriptnormsuperscriptsubscript𝑘1𝑛subscript𝑒subscript𝑈𝑘𝑞1𝑛𝑑21𝑛𝛿\displaystyle\mathbf{P}\left(\exists n\geq 1,\left\|\sum_{k=1}^{n}(e_{U_{k}}-q% )\right\|_{1}\geq\!\!\sqrt{nd\log\left(\tfrac{2\!\!\sqrt{1+n}}{\delta}\right)}\right)bold_P ( ∃ italic_n ≥ 1 , ∥ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_e start_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_q ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ square-root start_ARG italic_n italic_d roman_log ( divide start_ARG 2 square-root start_ARG 1 + italic_n end_ARG end_ARG start_ARG italic_δ end_ARG ) end_ARG ) =𝐏(v{1,1}d,n,k=1nWkvndlog(21+nδ))absent𝐏formulae-sequence𝑣superscript11𝑑𝑛superscriptsubscript𝑘1𝑛superscriptsubscript𝑊𝑘𝑣𝑛𝑑21𝑛𝛿\displaystyle=\mathbf{P}\left(\exists v\in\left\{-1,1\right\}^{d},\exists n,% \sum_{k=1}^{n}W_{k}^{v}\geq\!\!\sqrt{nd\log\left(\tfrac{2\!\!\sqrt{1+n}}{% \delta}\right)}\right)= bold_P ( ∃ italic_v ∈ { - 1 , 1 } start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , ∃ italic_n , ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ≥ square-root start_ARG italic_n italic_d roman_log ( divide start_ARG 2 square-root start_ARG 1 + italic_n end_ARG end_ARG start_ARG italic_δ end_ARG ) end_ARG )
v{1,1}d𝐏(n,k=1nWkvndlog(21+nδ))absentsubscript𝑣superscript11𝑑𝐏𝑛superscriptsubscript𝑘1𝑛superscriptsubscript𝑊𝑘𝑣𝑛𝑑21𝑛𝛿\displaystyle\leq\sum_{v\in\left\{-1,1\right\}^{d}}\mathbf{P}\left(\exists n,% \sum_{k=1}^{n}W_{k}^{v}\geq\!\!\sqrt{nd\log\left(\tfrac{2\!\!\sqrt{1+n}}{% \delta}\right)}\right)≤ ∑ start_POSTSUBSCRIPT italic_v ∈ { - 1 , 1 } start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_P ( ∃ italic_n , ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ≥ square-root start_ARG italic_n italic_d roman_log ( divide start_ARG 2 square-root start_ARG 1 + italic_n end_ARG end_ARG start_ARG italic_δ end_ARG ) end_ARG )
v{1,1}d𝐏(n,k=1nWkv12n(1+1n)log(1+n2dδ))absentsubscript𝑣superscript11𝑑𝐏𝑛superscriptsubscript𝑘1𝑛superscriptsubscript𝑊𝑘𝑣12𝑛11𝑛1𝑛superscript2𝑑𝛿\displaystyle\leq\sum_{v\in\left\{-1,1\right\}^{d}}\mathbf{P}\left(\exists n,% \sum_{k=1}^{n}W_{k}^{v}\geq\!\!\sqrt{\tfrac{1}{2}n\left(1+\tfrac{1}{n}\right)% \log\left(\tfrac{\!\!\sqrt{1+n}}{2^{-d}\delta}\right)}\right)≤ ∑ start_POSTSUBSCRIPT italic_v ∈ { - 1 , 1 } start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_P ( ∃ italic_n , ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ≥ square-root start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_n ( 1 + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ) roman_log ( divide start_ARG square-root start_ARG 1 + italic_n end_ARG end_ARG start_ARG 2 start_POSTSUPERSCRIPT - italic_d end_POSTSUPERSCRIPT italic_δ end_ARG ) end_ARG )
2d2dδ=δ.absentsuperscript2𝑑superscript2𝑑𝛿𝛿\displaystyle\leq 2^{d}\cdot 2^{d}\delta=\delta.≤ 2 start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ⋅ 2 start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_δ = italic_δ .

This concludes the proof. ∎

Lemma 36 (Time-uniform Empirical Bernstein).

Let (Uk)k1subscriptsubscript𝑈𝑘𝑘1(U_{k})_{k\geq 1}( italic_U start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k ≥ 1 end_POSTSUBSCRIPT a martingale difference sequence such that sp(Un)cspsubscript𝑈𝑛𝑐{\mathrm{sp}\left(U_{n}\right)}\leq croman_sp ( italic_U start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≤ italic_c a.s., let U^n:=1nk=1nUkassignsubscript^𝑈𝑛1𝑛superscriptsubscript𝑘1𝑛subscript𝑈𝑘\hat{U}_{n}:=\frac{1}{n}\sum_{k=1}^{n}U_{k}over^ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT := divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT the empirical mean and V^n:=1nk=1n(UkU^n)2assignsubscript^𝑉𝑛1𝑛superscriptsubscript𝑘1𝑛superscriptsubscript𝑈𝑘subscript^𝑈𝑛2\hat{V}_{n}:=\frac{1}{n}\sum_{k=1}^{n}(U_{k}-\hat{U}_{n})^{2}over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT := divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_U start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over^ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT the population variance. Then,

δ>0,T>0,𝐏(tT,i=1tUi2tV^tlog(3Tδ)+3clog(3Tδ))δ.formulae-sequencefor-all𝛿0formulae-sequencefor-all𝑇0𝐏formulae-sequence𝑡𝑇superscriptsubscript𝑖1𝑡subscript𝑈𝑖2𝑡subscript^𝑉𝑡3𝑇𝛿3𝑐3𝑇𝛿𝛿\forall\delta>0,\forall T>0,\quad\mathbf{P}\left(\exists t\leq T,\sum\nolimits% _{i=1}^{t}U_{i}\geq\sqrt{2t\hat{V}_{t}\log\left(\tfrac{3T}{\delta}\right)}+3c% \log\left(\tfrac{3T}{\delta}\right)\right)\leq\delta.∀ italic_δ > 0 , ∀ italic_T > 0 , bold_P ( ∃ italic_t ≤ italic_T , ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ square-root start_ARG 2 italic_t over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_log ( divide start_ARG 3 italic_T end_ARG start_ARG italic_δ end_ARG ) end_ARG + 3 italic_c roman_log ( divide start_ARG 3 italic_T end_ARG start_ARG italic_δ end_ARG ) ) ≤ italic_δ .
Proof.

This is obtained with a union bound on the values of nT𝑛𝑇n\leq Titalic_n ≤ italic_T, then applying Lemma 38. ∎

Lemma 37 (Time-uniform Empirical Likelihoods, Jonsson et al. [2020]).

Let q𝑞qitalic_q a distribution on {1,,d}1𝑑\left\{1,\ldots,d\right\}{ 1 , … , italic_d }. Let (Ut)subscript𝑈𝑡(U_{t})( italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) a sequence of i.i.d. random variables of distribution q𝑞qitalic_q. Then:

δ>0,𝐏(n1,nKL(q^n||q)>log(1δ)+(d1)log(e(1+nd1)))δ.\forall\delta>0,\quad\mathbf{P}\left(\exists n\geq 1,n\operatorname{{\rm KL}}(% \hat{q}_{n}||q)>\log\left(\tfrac{1}{\delta}\right)+(d-1)\log\left(e\left(1+% \tfrac{n}{d-1}\right)\right)\right)\leq\delta.∀ italic_δ > 0 , bold_P ( ∃ italic_n ≥ 1 , italic_n roman_KL ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | | italic_q ) > roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) + ( italic_d - 1 ) roman_log ( italic_e ( 1 + divide start_ARG italic_n end_ARG start_ARG italic_d - 1 end_ARG ) ) ) ≤ italic_δ .
Lemma 38 (Empirical Bernstein inequality, Audibert et al. [2009]).

Let (Uk)k1subscriptsubscript𝑈𝑘𝑘1(U_{k})_{k\geq 1}( italic_U start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k ≥ 1 end_POSTSUBSCRIPT a martingale difference sequence such that sp(Un)cspsubscript𝑈𝑛𝑐{\mathrm{sp}\left(U_{n}\right)}\leq croman_sp ( italic_U start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≤ italic_c a.s., let U^n:=1nk=1nUkassignsubscript^𝑈𝑛1𝑛superscriptsubscript𝑘1𝑛subscript𝑈𝑘\hat{U}_{n}:=\frac{1}{n}\sum_{k=1}^{n}U_{k}over^ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT := divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT the empirical mean and V^n:=1nk=1n(UkU^n)2assignsubscript^𝑉𝑛1𝑛superscriptsubscript𝑘1𝑛superscriptsubscript𝑈𝑘subscript^𝑈𝑛2\hat{V}_{n}:=\frac{1}{n}\sum_{k=1}^{n}(U_{k}-\hat{U}_{n})^{2}over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT := divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_U start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over^ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT the population variance. Then,

δ>0,n1,𝐏(k=1nUk2nV^nlog(3δ)+3clog(3δ))δ.formulae-sequencefor-all𝛿0formulae-sequencefor-all𝑛1𝐏superscriptsubscript𝑘1𝑛subscript𝑈𝑘2𝑛subscript^𝑉𝑛3𝛿3𝑐3𝛿𝛿\forall\delta>0,\forall n\geq 1,\quad\mathbf{P}\left(\sum\nolimits_{k=1}^{n}U_% {k}\geq\sqrt{2n\hat{V}_{n}\log\left(\tfrac{3}{\delta}\right)}+3c\log\left(% \tfrac{3}{\delta}\right)\right)\leq\delta.∀ italic_δ > 0 , ∀ italic_n ≥ 1 , bold_P ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ square-root start_ARG 2 italic_n over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_log ( divide start_ARG 3 end_ARG start_ARG italic_δ end_ARG ) end_ARG + 3 italic_c roman_log ( divide start_ARG 3 end_ARG start_ARG italic_δ end_ARG ) ) ≤ italic_δ .
Lemma 39 (Bennett’s inequality, Audibert et al. [2009]).

Let (Ut)t0subscriptsubscript𝑈𝑡𝑡0(U_{t})_{t\geq 0}( italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT a martingale difference sequence such that |Ut|csubscript𝑈𝑡𝑐\left|U_{t}\right|\leq c| italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ≤ italic_c a.s., and denote its conditional variance Vt:=𝐄[Ut2|t1]assignsubscript𝑉𝑡𝐄delimited-[]conditionalsuperscriptsubscript𝑈𝑡2subscript𝑡1V_{t}:=\mathbf{E}[U_{t}^{2}|\mathcal{F}_{t-1}]italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := bold_E [ italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ]. Then,

δ>0,n1,𝐏(kn,i=1kUi2i=1nVilog(1δ)+13clog(1δ))δ.formulae-sequencefor-all𝛿0formulae-sequencefor-all𝑛1𝐏formulae-sequence𝑘𝑛superscriptsubscript𝑖1𝑘subscript𝑈𝑖2superscriptsubscript𝑖1𝑛subscript𝑉𝑖1𝛿13𝑐1𝛿𝛿\forall\delta>0,\forall n\geq 1,\quad\mathbf{P}\left(\exists k\leq n,\sum% \nolimits_{i=1}^{k}U_{i}\geq\sqrt{2\sum\nolimits_{i=1}^{n}V_{i}\log\left(% \tfrac{1}{\delta}\right)}+\tfrac{1}{3}c\log\left(\tfrac{1}{\delta}\right)% \right)\leq\delta.∀ italic_δ > 0 , ∀ italic_n ≥ 1 , bold_P ( ∃ italic_k ≤ italic_n , ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ square-root start_ARG 2 ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) end_ARG + divide start_ARG 1 end_ARG start_ARG 3 end_ARG italic_c roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) ) ≤ italic_δ .
Lemma 40 (Lemma 3 of Zhang and Xie [2023]).

Let (Ut)subscript𝑈𝑡(U_{t})( italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) be a sequence of random variables such that 0Utc0subscript𝑈𝑡𝑐0\leq U_{t}\leq c0 ≤ italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ italic_c a.s., and let t:=σ(U0,U1,,Ut1)assignsubscript𝑡𝜎subscript𝑈0subscript𝑈1subscript𝑈𝑡1\mathcal{F}_{t}:=\sigma(U_{0},U_{1},\ldots,U_{t-1})caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := italic_σ ( italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_U start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ). Then:

δ>0,𝐏(T0,t=0T1Ut3t=0T1𝐄[Ut|t1]+clog(1δ))δ;formulae-sequencefor-all𝛿0𝐏formulae-sequence𝑇0superscriptsubscript𝑡0𝑇1subscript𝑈𝑡3superscriptsubscript𝑡0𝑇1𝐄delimited-[]conditionalsubscript𝑈𝑡subscript𝑡1𝑐1𝛿𝛿\displaystyle\forall\delta>0,\quad\mathbf{P}\left(\exists T\geq 0,\sum% \nolimits_{t=0}^{T-1}U_{t}\geq 3\sum\nolimits_{t=0}^{T-1}\mathbf{E}[U_{t}|% \mathcal{F}_{t-1}]+c\log\left(\tfrac{1}{\delta}\right)\right)\leq\delta;∀ italic_δ > 0 , bold_P ( ∃ italic_T ≥ 0 , ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ 3 ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT bold_E [ italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] + italic_c roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) ) ≤ italic_δ ;
δ>0,𝐏(T0,t=0T1𝐄[Ut|t1]3t=0T1Ut+clog(1δ))δ.formulae-sequencefor-all𝛿0𝐏formulae-sequence𝑇0superscriptsubscript𝑡0𝑇1𝐄delimited-[]conditionalsubscript𝑈𝑡subscript𝑡13superscriptsubscript𝑡0𝑇1subscript𝑈𝑡𝑐1𝛿𝛿\displaystyle\forall\delta>0,\quad\mathbf{P}\left(\exists T\geq 0,\sum% \nolimits_{t=0}^{T-1}\mathbf{E}[U_{t}|\mathcal{F}_{t-1}]\geq 3\sum\nolimits_{t% =0}^{T-1}U_{t}+c\log\left(\tfrac{1}{\delta}\right)\right)\leq\delta.∀ italic_δ > 0 , bold_P ( ∃ italic_T ≥ 0 , ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT bold_E [ italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] ≥ 3 ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_c roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) ) ≤ italic_δ .