\doparttoc\faketableofcontents

Achieving Tractable Minimax Optimal Regret in Average Reward MDPs

Victor Boone
[email protected]
Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG, 38000 Grenoble, France &Zihan Zhang
[email protected]
Princeton University

Abstract

In recent years, significant attention has been directed towards learning average-reward Markov Decision Processes (MDPs). However, existing algorithms either suffer from sub-optimal regret guarantees or computational inefficiencies. In this paper, we present the first tractable algorithm with minimax optimal regret of $\widetilde{\operatorname*{{\rm O}}}\left(\!\!\sqrt{{\mathrm{sp}\left(h^{*}% \right)}SAT}\right)$ ,¹¹1 $\widetilde{\operatorname*{{\rm O}}}(\cdot)$ hides logarithmic factors of $(S,A,T)$ . where ${\mathrm{sp}\left(h^{*}\right)}$ is the span of the optimal bias function $h^{*}$ , $S\times A$ is the size of the state-action space and $T$ the number of learning steps. Remarkably, our algorithm does not require prior information on ${\mathrm{sp}\left(h^{*}\right)}$ .

Our algorithm relies on a novel subroutine, Projected Mitigated Extended Value Iteration (PMEVI), to compute bias-constrained optimal policies efficiently. This subroutine can be applied to various previous algorithms to improve regret bounds.

1 Introduction

Reinforcement learning (RL) Burnetas and Katehakis (1997); Sutton and Barto (2018) has become a popular approach for solving complex sequential decision-making tasks and has recently achieved notable advancements in diverse fields of application. The RL problem is generally formulated as a Markov Decision Process (MDP) Puterman (1994), where the agent interacts with an unknown environment to maximize its accumulative rewards.

In this paper, we consider the problem of learning average-reward MDPs, where the central task is to balance between exploration (i.e., learning the unknown environment) and exploitation (i.e., planning optimally according to current knowledge) along the infinite-horizon learning process. One way to measure the performance of the learner is the regret, that compares the gathered rewards of the learner, unaware of the exact structure of its environment, to the expected performance of an omniscient agent that knows the environment in advance. The seminal work of Auer et al. (2009) provides a minimax regret lower bound $\Omega\left(\!\!\sqrt{DSAT}\right)$ , where $D$ is the diameter (the maximal distance between two different states), $S$ the number of states, $A$ the number of actions and $T$ the learning horizon. They also provide an algorithm achieving regret $\widetilde{\operatorname*{{\rm O}}}\left(\!\!\sqrt{D^{2}S^{2}AT}\right)$ . Ever since Auer et al. (2009), many works have been devoted to close the gap between the regret lower and upper bounds in the average reward setting Auer et al. (2009); Bartlett and Tewari (2009); Filippi et al. (2010); Talebi and Maillard (2018); Fruit et al. (2018, 2020); Bourel et al. (2020); Zhang and Ji (2019); Ouyang et al. (2017); Agrawal and Jia (2023); Abbasi-Yadkori et al. (2019); Wei et al. (2020) and more. Subsequent works Fruit et al. (2018); Zhang and Ji (2019) refined the minimax regret lower bound to $\Omega\left(\!\!\sqrt{{\mathrm{sp}\left(h^{*}\right)}SAT}\right)$ where ${\mathrm{sp}\left(h^{*}\right)}$ is the span of the bias function, which is the maximal gap of the long-term accumulative rewards starting from two different states. The difference is significant, since ${\mathrm{sp}\left(h^{*}\right)}\leq D$ and the gap between the two can be arbitrarly large. However, no existing work achieves the following three requirements simultaneously:

(1)

The method achieves minimax optimal regret guarantees $\widetilde{\operatorname*{{\rm O}}}\left(\!\!\sqrt{{\mathrm{sp}\left(h^{*}% \right)}SAT}\right)$ ;
(2)

The proposed method is tractable;
(3)

No prior knowledge on the model is required.

Most algorithms simply fail to achieve minimax optimal regret, and the only method achieving it Zhang and Ji (2019) is intractable because it relies an oracle to solve difficult optimization problems along the learning process. Naturally, we raise the question of whether these three requirements can be met all at once:

Is there a tractable algorithm with $\widetilde{\operatorname*{{\rm O}}}\left(\!\!\sqrt{{\mathrm{sp}\left(h^{*}% \right)}SAT}\right)$ minimax regret without prior knowledge?

Contributions.

In this paper, we answer the above question affirmatively, by proposing a polynomial time algorithm with regret guarantees $\widetilde{\operatorname*{{\rm O}}}\left(\!\!\sqrt{{\mathrm{sp}\left(h^{*}% \right)}SAT}\right)$ for average-reward MDPs. Our method can further incorporate almost arbitrary prior bias information $\mathcal{H}_{*}\subseteq\mathbf{R}^{\mathcal{S}}$ to improve its regret.

Theorem 1 (Informal).

For any $c>0$ , provided that the confidence region used by PMEVI-DT satisfy mild regularity conditions (see Assumption 1-3), if $T\geq c^{5}$ , then for every weakly communicating model with bias span less than $c$ and with bias vector within $\mathcal{H}_{*}$ , PMEVI-DT $(\mathcal{H}_{*},T)$ achieves regret:

\operatorname*{{\rm O}}\left(\!\!\sqrt{cSAT\log\left(\tfrac{SAT}{\delta}\right% )}\right)+\operatorname*{{\rm O}}\left(cS^{\frac{5}{2}}A^{\frac{3}{2}}T^{\frac% {9}{20}}\log^{2}\left(\tfrac{SAT}{\delta}\right)\right)

in expectation and with high probability. Moreover, if PMEVI-DT runs with the same confidence regions that UCRL2 Auer et al. (2009), then it enjoys a time complexity $\operatorname*{{\rm O}}(DS^{3}AT)$ .

The geometry of the prior bias region $\mathcal{H}_{*}$ is discussed later (see 4). It can be taken trivial with $\mathcal{H}_{*}=\mathbf{R}^{\mathcal{S}}$ to obtain a completely prior-less algorithm. To the best of our knowledge, this is the first tractable algorithm with minimax optimal regret bounds (up to logarithmic factors). The algorithm does not necessitate any prior knowledge of ${\mathrm{sp}\left(h^{*}\right)}$ , thus circumventing the potentially high cost associated with learning ${\mathrm{sp}\left(h^{*}\right)}$ . On the technical side, a key novelty of our method is the subroutine named PMEVI (see Algorithm 2) that improves and can replace EVI Auer et al. (2009) in any algorithm that relies on it Auer et al. (2009); Fruit et al. (2018); Filippi et al. (2010); Fruit et al. (2020); Bourel et al. (2020) to boost its performance and achieve minimax optimal regret.

Related works.

Here is a short overview of the learning theory of average reward MDPs. For communicating MDPs, the notable work of Auer et al. (2009) proposes the famous UCRL2 algorithm, a mature version of their prior UCRL Auer and Ortner (2006), achieving a regret bound of $\widetilde{\operatorname*{{\rm O}}}(DS\!\!\sqrt{AT})$ . This paper pioneered the use optimistic methods to learn MDPs efficiently. A line of papers Filippi et al. (2010); Fruit et al. (2020); Bourel et al. (2020) developed this direction by tightening the confidence region that UCRL2 rely on, and sharpened its analysis through the use of local properties of MDPs, such as local diameters and local bias variances, but none of these works went beyond regret guarantees of order $S\!\!\sqrt{DAT}$ and suffer from an extra $\!\!\sqrt{S}$ . A parallel direction was initiated by Bartlett and Tewari (2009), that design REGAL to attain ${\mathrm{sp}\left(h^{*}\right)}$ -dependent regret bounds (instead of $D$ ) while extending the regret bounds to weakly-communicating MDPs. The computational intractability of REGAL is addressed by Fruit et al. (2018) with SCAL, while Zhang and Ji (2019) further enhance the regret analysis by evaluating the bias differences with EBF, eventually reaching optimal minimax regret but loosing tractability.

Another successful design approach is Bayesian-flavored sampling, derived from Thompson Sampling Thompson (1933), that usually replaces optimism. The regret guarantees of these algorithms usually stick to the Bayesian setting however Ouyang et al. (2017); Theocharous et al. (2017), although Agrawal and Jia (2023) also enjoys $\widetilde{\operatorname*{{\rm O}}}(S\!\!\sqrt{DAT})$ high probability regret by coupling posterior sampling and optimism. Another line of research focuses on the study of ergodic MDPs, where all policies mix uniformly according to a mixing time. To name a few, the model-free algorithm Politex Abbasi-Yadkori et al. (2019) attains a regret of $\widetilde{\operatorname*{{\rm O}}}((t_{\mathrm{mix}})^{3}t_{\mathrm{hit}}\!\!% \sqrt{SA}T^{\frac{3}{4}})$ . By leveraging an optimistic mirror descent algorithm, Wei et al. (2020) achieve an enhanced regret of $\widetilde{\operatorname*{{\rm O}}}(\!\!\sqrt{(t_{\mathrm{mix}})^{2}t_{\mathrm% {hit}}AT})$ .

We refer the readers to Table 1 for a (non-exhaustive) list of existing algorithms.

Table 1: Comparison of related works on RL algorithms for average-reward MDP, where

S\times A

is the size of state-action space,

T

is the total number of steps,

D

(

D_{s}

) is the (local) diameter,

{\mathrm{sp}\left(h^{*}\right)}\leq D

is the span of the bias vector,

t_{\mathrm{mix}}

is the worst-case mixing time,

t_{\mathrm{hit}}

is the hitting time (i.e., the expected time cost to visit some certain state under any policy).

Algorithm	Regret in $\widetilde{\operatorname*{{\rm O}}}(-)$	Tractable	Comment/Requirements
REGAL Bartlett and Tewari (2009)	${\mathrm{sp}\left(h^{*}\right)}S\!\!\sqrt{AT}$	$\times$	knowledge of ${\mathrm{sp}\left(h^{*}\right)}$
UCRL2 Auer et al. (2009)	$DS\!\!\sqrt{AT}$	$\checkmark$	-
PSRL Agrawal and Jia (2023)	$DS\!\!\sqrt{AT}$	$\checkmark$	Bayesian regret
SCAL Fruit et al. (2018)	${\mathrm{sp}\left(h^{*}\right)}S\!\!\sqrt{AT}$	$\checkmark$	knowledge of ${\mathrm{sp}\left(h^{*}\right)}$
UCRL2B Fruit et al. (2020)	$S\!\!\sqrt{DAT}$	$\checkmark$	extra $\sqrt{\log(T)}$ in upper-bound
UCRL3 Bourel et al. (2020)	$D+\!\!\sqrt{T\sum_{s,a}D_{s}^{2}L_{s,a}}$	$\checkmark$	$L_{s,a}:=\sum_{s^{\prime}}\!\!\sqrt{p(s^{\prime}\|s,a)(1-p(s^{\prime}\|s,a))}$
KL-UCRL Filippi et al. (2010); Talebi and Maillard (2018)	$S\!\!\sqrt{DAT}$	$\checkmark$	-
EBF Zhang and Ji (2019)	$\sqrt{{\mathrm{sp}\left(h\right)}^{*}SAT}$	$\times$	optimal, knowledge of ${\mathrm{sp}\left(h^{*}\right)}$
Optimistic-Q Wei et al. (2020)	${\mathrm{sp}\left(h^{*}\right)}(SA)^{\frac{1}{3}}T^{\frac{2}{3}}$	$\checkmark$	model-free
UCB-AVG Zhang and Xie (2023)	$S^{5}A^{2}{\mathrm{sp}\left(h^{*}\right)}\!\!\sqrt{T}$	$\checkmark$	model-free, knowledge of ${\mathrm{sp}\left(h^{*}\right)}$
MDP-OOMD Wei et al. (2020)	$\sqrt{(t_{\mathrm{mix}})^{2}t_{\mathrm{hit}}AT}$	$\checkmark$	ergodic
Politex Abbasi-Yadkori et al. (2019)	$(t_{\mathrm{mix}})^{3}t_{\mathrm{hit}}\!\!\sqrt{SA}T^{\frac{3}{4}}$	$\checkmark$	model-free, ergodic
PMEVI-DT (this work)	$\sqrt{{\mathrm{sp}\left(h^{*}\right)}SAT}$	$\checkmark$	-
Lower bound	$\Omega\left(\!\!\sqrt{{\mathrm{sp}\left(h^{*}\right)}SAT}\right)$	-	-

2 Preliminaries

We fix a finite state-action space structure $\mathcal{X}:=\bigcup_{s\in\mathcal{S}}\left\{s\right\}\times\mathcal{A}(s)$ , and denote $\mathcal{M}$ the collection of all MDPs with state-action space $\mathcal{X}$ and rewards supported in $[0,1]$ .

Infinite-horizon MDP.

An element $M\in\mathcal{M}$ is a tuple $(\mathcal{S},\mathcal{A},p,r)$ where $p$ is the transition kernel and $r$ the reward function. The random state-action pair played by the agent at time $t$ is denoted $X_{t}\equiv(S_{t},A_{t})$ , and the achieved reward is $R_{t}$ . A policy is a deterministic rule $\pi:\mathcal{S}\to\mathcal{A}$ and we write $\Pi$ the space of policies. Coupled with a $M\in\mathcal{M}$ , a policy properly defines the distribution of $(X_{t})$ whose associated probability probability and expectation operators are denoted $\mathbf{P}^{\pi}_{s},\mathbf{E}^{\pi}_{s}$ , where $s\in\mathcal{S}$ is the initial state. Under $M$ , a fixed policy has a reward function $r^{\pi}(s):=r(s,\pi(s))$ , a transition matrix $P^{\pi}$ , a gain $g^{\pi}(s):=\lim\frac{1}{T}\mathbf{E}^{\pi}_{s}[R_{0}+\ldots+R_{T-1}]$ and a bias $h^{\pi}:=\lim\sum_{t=0}^{T-1}(R_{t}-g(S_{t}))$ , that all together satisfy the Poisson equation $h^{\pi}+g^{\pi}=r^{\pi}+P^{\pi}h^{\pi}$ , see Puterman (1994). The Bellman operator of the MDP is:

Lu(s):=\max_{a\in\mathcal{A}(s)}\left\{r(s,a)+p(s,a)u\right\}

(1)

Weakly-communicating MDPs.

$M$ is weakly-communicating Puterman (1994); Bartlett and Tewari (2009) if the state space can be divided into two sets: (1) the transient set, consisting in states that are transient under all policies; (2) the non-transient set, where every state is reachable starting from any other non-transient. In this case, $L$ has a span-fixpoint $h^{*}$ (see Puterman (1994)), i.e., there exists $h^{*}\in\mathbf{R}^{\mathcal{S}}$ such that $Lh^{*}-h^{*}\in\mathbf{R}e$ where $e$ is the vector full of ones. We write $h^{*}\in\operatorname{{Fix}}(L)$ . Then $g^{*}:=Lh^{*}-h^{*}$ is the optimal gain function and every policy $\pi$ satisfies $r^{\pi}+P^{\pi}h^{*}\leq g^{*}+h^{*}$ . We accordingly define the Bellman gaps:

\Delta^{*}(s,a):=h^{*}(s)+g^{*}(s)-r(s,a)-p(s,a)h^{*}\geq 0.

(2)

Another important concept is the diameter, that describes the maximal distance from one state to another state. It is given by $D:=\sup_{s\neq s^{\prime}}\inf_{\pi}\mathbf{E}_{s}^{\pi}[\inf\left\{t\geq 1:S_% {t}=s^{\prime}\right\}].$ An MDP is said communicating if its diameter $D$ is finite.

Reinforcement learning.

The learner is only aware that $M\in\mathcal{M}$ but doesn’t have a clue about what $M$ further looks like. From the past observations and the current state $S_{t}$ , the agent picks an available action $\mathcal{A}(S_{t})$ , receives a reward $R_{t}$ and observe the new state $S_{t+1}$ . The regret of the agent is:

\operatorname{{Reg}}(T):=Tg^{*}-\sum_{t=0}^{T-1}R_{t}.

(3)

Its expected value satisfies $\mathbf{E}[\operatorname{{Reg}}(T)]=\mathbf{E}[\sum_{t=0}^{T-1}\Delta^{*}(X_{t% })]+\mathbf{E}[h^{*}(S_{0})-h^{*}(S_{T})]$ and the quantity $\sum_{t=0}^{T-1}\Delta^{*}(X_{t})$ will be referred to as the pseudo-regret. This paper focuses on minimax regret guarantees. Specifically, for $c\geq 1$ , denote $\mathcal{M}_{c}:=\left\{M\in\mathcal{M}:\exists h^{*}\in\operatorname{{Fix}}(L% (M)),{\mathrm{sp}\left(h^{*}\right)}\leq c\right\}$ the set of weakly-communicating MDPs that admit a bias function with span at most $c$ , where the span of a vector $u\in\mathbb{R}^{\mathcal{S}}$ is ${\mathrm{sp}\left(u\right)}:=\max(u)-\min(u)$ . Following Auer et al. (2009), every algorithm $\mathbf{A}$ , for all $c>0$ , we have

\max_{M\in\mathcal{M}_{c}}\mathbf{E}^{M,\mathbf{A}}[\operatorname{{Reg}}(T)]=% \Omega\left(\!\sqrt{cSAT}\right).

(4)

The goal of this work is to reach this lower bound with a tractable algorithm.

3 Algorithm PMEVI-DT

The method designed in this work can be applied to any algorithm relying on extended Bellman operators to compute the deployed policies Auer et al. (2009); Filippi et al. (2010); Fruit et al. (2018); Bourel et al. (2020) and beyond Tewari and Bartlett (2007). We start by reviewing the principles behind these algorithms. These algorithms follow the optimism-in-face-of-certainty (OFU) principle, meaning that they deploy policies achieving the highest possible gain that is plausible under their current information. This is done by building a confidence region $\mathcal{M}_{t}\subseteq\mathcal{M}$ for the hidden model $M$ , then searching for a policy $\pi$ solving the optimization problem:

g^{*}(\mathcal{M}_{t}):=\sup\left\{g^{\pi}(\mathcal{M}_{t}):\pi\in\Pi,{\mathrm% {sp}\left(g^{\pi}(\mathcal{M}_{t})\right)}=0\right\}\text{~{}with~{}}g^{\pi}(% \mathcal{M}_{t}):=\sup\left\{g\left(\pi,\widetilde{M}\right):\widetilde{M}\in% \mathcal{M}_{t}\right\}.

(5)

The design of the confidence region $\mathcal{M}_{t}$ varies from a work to another. Provided that $\mathcal{M}_{t}$ has been designed, these OFU-algorithms work as follows: At the start of episode $k$ , the optimization problem (5) is solved, and its solution $\pi_{k}$ is played until the end of episode. The duration of episodes can be managed in various ways, although the most popular is arguably the doubling trick (DT), that essentially waits until a state-action pair is about to double the visit count it had at the beginning of the current episode (see Algorithm 1). In the rest of this section, we use $\hat{p}_{t}(s,a)$ (and $\hat{r}_{t}(s,a)$ ) to denote the empirical transition (and reward) of the latest doubling update before the $t$ -th step, and further denote $\hat{M}_{t}:=(\hat{r}_{t},\hat{p}_{t})$ .

Extended Bellman operators and EVI.

To solve (5) efficiently, the celebrated Auer et al. (2009) introduced the extended value iteration algorithm (EVI). Assume that $\mathcal{M}_{t}$ is a $(s,a)$ -rectangular confidence region, meaning that $\mathcal{M}_{t}\equiv\prod_{s,a}(\mathcal{R}_{t}(s,a)\times\mathcal{P}_{t}(s,a))$ where $\mathcal{R}_{t}(s,a)$ and $\mathcal{P}_{t}(s,a)$ are respectively the confidence region for $r(s,a)$ and $p(s,a)$ after $t$ learning steps. EVI is the algorithm computing the sequence defined by:

v_{i+1}(s)\equiv\mathcal{L}_{t}v_{i}(s):=\max_{a\in\mathcal{A}(s)}\max_{\tilde% {r}(s,a)\in\mathcal{R}_{t}(s,a)}\max_{\tilde{p}(s,a)\in\mathcal{P}_{t}(s,a)}% \left(\tilde{r}(s,a)+\tilde{p}(s,a)\cdot v_{i}\right)

(6)

until ${\mathrm{sp}\left(v_{i+1}-v_{i}\right)}<\epsilon$ where $\epsilon>0$ is the numerical precision. When the process stops, it is known that any policy $\pi$ such that $\pi(s)$ achieves $\mathcal{L}_{t}v_{i}$ in (6) satisfies $g^{\pi}(\mathcal{M}_{t})\geq g^{*}(\mathcal{M})-\epsilon$ , hence is nearly optimistically optimal. This process gets its name from the observation that $\mathcal{L}_{t}$ is the Bellman operator of $\mathcal{M}_{t}$ seen as a MDP, hence EVI is just the Value Iteration algorithm Puterman (1994) ran in $\mathcal{M}_{t}$ . A choice of action from $s\in\mathcal{S}$ in $\mathcal{M}_{t}$ consists in (1) a choice of action $a\in\mathcal{A}(s)$ , (2) a choice of reward $\tilde{r}(s,a)\in\mathcal{R}_{t}(s,a)$ and (3) a choice of transition $\tilde{p}(s,a)\in\mathcal{P}_{t}(s,a)$ ; It is an extended version of $\mathcal{A}(s)$ .

Towards Projected Mitigated EVI.

Obviously, the regret of an OFU-algorithm is directly related to the quality of the confidence region $\mathcal{M}_{t}$ . That is why most previous works tried to approach the regret lower bound $\!\!\sqrt{DSAT}$ of Auer et al. (2009) by refining $\mathcal{M}_{t}$ . The older works of Auer et al. (2009); Bartlett and Tewari (2009); Filippi et al. (2010) have been improved with a variance aware analysis Talebi and Maillard (2018); Fruit et al. (2018, 2020); Bourel et al. (2020) that essentially make use of tightened kernel confidence regions $\mathcal{P}_{t}$ . While all these algorithms successively reduce the gap between the regret upper and lower bounds, they fail to achieve optimal regret $\!\!\sqrt{DSAT}$ . Meanwhile, the EVI algorithm of Zhang and Ji (2019) achieves the lower bound but (1) the algorithm is intractable because it relies on an oracle to retrieve optimistically optimal policies and (2) needs prior information on the bias function. Nonetheless, the method of Zhang and Ji (2019) strongly suggests that inferring bias information from the available data is key to achieve minimax optimal regret.

Rather surprisingly and in opposition to this previous line of work, our work suggests that the choice of the confidence region $\mathcal{M}_{t}$ has little importance. Instead, our algorithm takes an arbitrary (well-behaved) confidence region in, infer bias information similarly to EBF Zhang and Ji (2019) and makes use of it to heavily refine the extended Bellman operator (6) associated to the input confidence region. Our algorithm can further take arbitrary prior information (possibly none) on the bias vector to tighten its bias confidence region. The pseudo-code given in Algorithm 1 is the high level structure our algorithm PMEVI-DT. In Section 3.1, we explain how (6) is refined using bias information and in Section 3.2, we explain how bias information is obtained.

Algorithm 1: PMEVI-DT $(\mathcal{H}_{*},T,t\mapsto\mathcal{M}_{t})$

Parameters: Bias prior $\mathcal{H}_{*}$ , horizon $T$ , a system of confidence region $t\mapsto\mathcal{M}_{t}$

1: for

k=1,2,\ldots

2: Set

t_{k}\leftarrow t

, update confidence region

\mathcal{M}_{t}

;

\mathcal{H}^{\prime}_{t}\leftarrow\texttt{BiasEstimation}(\mathcal{F}_{t},% \mathcal{M}_{t},\delta)

\mathcal{H}_{t}\leftarrow\mathcal{H}_{*}\cap\{u:{\mathrm{sp}\left(u\right)}% \leq T^{1/5}\}\cap\mathcal{H}_{t}^{\prime}

;

\Gamma_{t}\leftarrow\texttt{BiasProjection}(\mathcal{H}_{t},-)

;

\beta_{t}\leftarrow\texttt{VarianceApprox}(\mathcal{H}^{\prime}_{t},\mathcal{F% }_{t})

;

\mathfrak{h}_{k}\leftarrow\texttt{PMEVI}(\mathcal{M}_{t},\beta_{t},\Gamma_{t},% \!\!\sqrt{\log(t)/t})

;

\mathfrak{g}_{k}\leftarrow\mathfrak{L}_{t}\mathfrak{h}_{k}-\mathfrak{h}_{k}

;

9: Update policy

\pi_{k}\leftarrow\texttt{Greedy}(\mathcal{M}_{t},\mathfrak{h}_{k},\beta_{t})

;

10: repeat

11: Play

A_{t}\leftarrow\pi_{k}(S_{t})

, observe

R_{t},S_{t+1}

;

12: Increment

t\leftarrow t+1

;

13: until (DT)

N_{t}(S_{t},\pi_{k}(S_{t}))\geq 1\vee 2N_{t_{k}}(X_{t})

14: end for

Algorithm 2: PMEVI $(\mathcal{M},\beta,\Gamma,\epsilon)$

Parameters: region $\mathcal{M}$ , mitigation $\beta$ , projection $\Gamma$ , precision $\epsilon$ , initial vector $v_{0}$ (optional)

1: if

v_{0}

not initialized then set

v_{0}\leftarrow 0

;

n\leftarrow 0

\mathcal{L}\leftarrow

extended operator associated to

\mathcal{M}

;

4: repeat

v_{n+\frac{1}{2}}\leftarrow\mathcal{L}^{\beta}v_{n}

;

v_{n+1}\leftarrow\Gamma v_{n+\frac{1}{2}}

;

n\leftarrow n+1

;

8: until

{\mathrm{sp}\left(v_{n}-v_{n-1}\right)}<\epsilon

9: return

v_{n}

3.1 Projected mitigated extended value iteration (PMEVI)

Assume that an external mechanism provides a confidence region $\mathcal{H}_{t}$ for the bias function $h^{*}$ . Provided that $\mathcal{M}_{t}$ is correct ( $M\in\mathcal{M}_{t}$ ) and that $\mathcal{H}_{t}$ is correct ( $h^{*}\in\mathcal{H}_{t}$ ), we want to find a pair of policy-model $(\pi,\tilde{M})$ that maximize the gain and such that $h(\pi,\tilde{M})\in\mathcal{H}_{t}$ . This is done with an improved version of (6) combining two ideas.

1.

Projection (Section 3.2). Whenever it is correct, the bias confidence region $\mathcal{H}_{t}$ informs the learner that the search of an optimistic model can be constrained to those with bias within $\mathcal{H}_{t}$ . This is done by projecting $\mathcal{L}_{t}^{\beta}$ (see mitigation) using an operator $\Gamma_{t}:\mathbf{R}^{\mathcal{S}}\to\mathcal{H}_{t}$ , that has to satisfy a few non-trivial regularity conditions that are specified in Proposition 2.

Mitigation (Section 3.3). When one is aware that $h^{*}\in\mathcal{H}_{t}$ , the dynamical bias update $\tilde{p}(s,a)u_{i}$ in (6) can be controlled better, by trying to restrict (6) to some $\tilde{p}(s,a)$ such that $\tilde{p}(s,a)u_{i}\leq\hat{p}_{t}(s,a)u_{i}+(p(s,a)-\hat{p}_{t}(s,a))u_{i}$ with the knowledge that $u_{i}\in\mathcal{H}_{t}$ .

For a fixed $u\in\mathbf{R}^{\mathcal{S}}$ , the empirical Bernstein inequality (Lemma 38) provides a variance bound of the form $(\hat{p}_{t}(s,a)-p(s,a))u\leq\beta_{t}(s,a,u)$ . By computing $\beta_{t}(s,a):=\max_{u\in\mathcal{H}_{t}}\beta_{t}(s,a,u)$ , the search makes sure that $(\hat{p}_{t}(s,a)-p(s,a))h^{*}\leq\beta_{t}(s,a)$ even though $h^{*}$ is unknown. For $\beta\in\mathbf{R}_{+}^{\mathcal{X}}$ , we introduce the $\beta$ -mitigated extended Bellman operator:

\mathcal{L}_{t}^{\beta}u(s):=\max_{a\in\mathcal{A}(s)}\sup_{\tilde{r}(s,a)\in% \mathcal{R}_{t}(s,a)}\sup_{\tilde{p}(s,a)\in\mathcal{P}_{t}(s,a)}\Big{\{}% \tilde{r}(s,a)+\min\left\{\tilde{p}(s,a)u_{i},\hat{p}_{t}(s,a)u_{i}+\beta_{t}(% s,a)\right\}\Big{\}}

(7)

The proposition below shows how well-behaved the composition $\mathfrak{L}_{t}:=\Gamma_{t}\circ\mathcal{L}_{t}^{\beta}$ is. Its proof requires to build a complete analysis of projected mitigated Bellman operators. This is deferred to the appendix.

Proposition 2.

Fix $\beta\in\mathbf{R}_{+}^{\mathcal{X}}$ and assume that there exists a projection operator $\Gamma_{t}:\mathbf{R}^{\mathcal{X}}\to\mathcal{H}_{t}$ which is (O1) monotone: $u\leq v\Rightarrow\Gamma u\leq\Gamma v$ ; (O2) non span-expansive: ${\mathrm{sp}\left(\Gamma u-\Gamma v\right)}\leq{\mathrm{sp}\left(u-v\right)}$ ; (O3) linear: $\Gamma(u+\lambda e)=\Gamma u+\lambda e$ and (O4) $\Gamma u\leq u$ . Then, the projected mitigated extended Bellman operator $\mathfrak{L}_{t}:=\Gamma_{t}\circ\mathcal{L}_{t}^{\beta}$ has the following properties:

(1)

There exists a unique $\mathfrak{g}_{t}\in\mathbf{R}e$ such that $\exists\mathfrak{h}_{t}\in\mathcal{H}_{t},\mathfrak{L}_{t}\mathfrak{h}_{t}=% \mathfrak{h}_{t}+\mathfrak{g}_{t}$ ;
(2)

If $M\in\mathcal{M}_{t}$ , $h^{*}\in\mathcal{H}_{t}$ and $(\hat{p}_{t}(s,a)-p(s,a))h^{*}\leq\beta_{t}(s,a)$ , then $\mathfrak{g}_{t}\geq g^{*}(M)$ ;
(3)

If $\mathcal{M}_{t}$ is convex, then for all $u\in\mathbf{R}^{\mathcal{S}}$ , the policy $\pi=:\texttt{Greedy}(\mathcal{M}_{t},u,\beta_{t})$ picking the actions achieving $\mathcal{L}_{t}^{\beta}u$ satisfies $\mathfrak{L}_{t}u=\tilde{r}^{\pi}+\tilde{P}^{\pi}u$ for $\tilde{r}^{\pi}(s)\leq\sup\mathcal{R}_{t}(s,\pi(s))$ and $\tilde{P}^{\pi}(s)\in\mathcal{P}_{t}(s,\pi(s))$ ;
(4)

For all $u\in\mathbf{R}^{\mathcal{S}}$ and $n\geq 0$ , ${\mathrm{sp}\left(\mathfrak{L}_{t}^{n+1}u-\mathfrak{L}_{t}^{n}u\right)}\leq{% \mathrm{sp}\left(\mathcal{L}_{t}^{n+1}u-\mathcal{L}_{t}^{n}u\right)}$ .

The property (1) guarantees that $\mathfrak{L}_{t}$ has a fix-point while (2) states that this fix-point corresponds to an optimistic gain $\mathfrak{g}_{t}$ if the model and the bias confidence region are correct and the mitigation isn’t too aggressive. Combined with (3), the Poisson equation of a policy corresponds to this fix-point, i.e., $\tilde{r}^{\pi}+\tilde{P}^{\pi}\mathfrak{h}_{t}=\mathfrak{h}_{t}+\mathfrak{g}_% {t}$ , so that $\mathfrak{g}_{t}$ is the gain and $\mathfrak{h}_{t}\in\mathcal{H}_{t}$ is a legal bias for $\pi$ under the model $(\tilde{r}^{\pi},\tilde{P}^{\pi})$ . Lastly, the property (4) guarantees that the iterates $\mathfrak{L}_{t}^{n}u$ converge to a fix-point of $\mathfrak{L}$ at least as quickly as $\mathcal{L}_{t}^{n}u$ goes to a fix-point of $\mathcal{L}_{t}$ ; the convergence of $\mathcal{L}_{t}^{n}u$ is already guaranteed by existing studies and is discussed in the appendix.

Provided that the bias confidence region is constructed, Proposition 2 foreshadows how powerful is the construction: The algorithm PMEVI, obtained by iterating $\mathfrak{L}_{t}$ instead of $\mathcal{L}_{t}$ in EVI, can replace the well-known EVI within any algorithm of the literature that relies on it (UCRL2 Auer et al. (2009), UCRL2B Fruit et al. (2020) or KL-UCRL Filippi et al. (2010)) for an immediate improvement of its theoretical guarantees.

3.2 Building the bias confidence region and its projection operator

The bias confidence region used by PMEVI-DT is obtained as a collection of constraints of the form:

\forall s\neq s^{\prime},\quad\mathfrak{h}(s)-\mathfrak{h}(s^{\prime})-c(s,s^{% \prime})\leq d(s,s^{\prime}).

(8)

Such constraints include (1) prior bias constraints (if any) of the form of $\mathfrak{h}(s)-\mathfrak{h}(s^{\prime})\leq c_{*}(s,s^{\prime})$ ; (2) span constraints of the form $\mathfrak{h}(s)-\mathfrak{h}(s^{\prime})\leq c_{0}:=T^{1/5}$ spawning the span semi-ball $\{u:{\mathrm{sp}\left(u\right)}\leq T^{1/5}\}$ ; and (3) pair-wise constraints obtained by estimating bias differences in the style of Zhang and Ji (2019); Zhang and Xie (2023) that we further improve. We start by defining a bias difference estimator.

Definition 1 (Bias difference estimator).

Given a pair of states $s\neq s^{\prime}$ , their sequence of commute times $(\tau_{i}^{s\leftrightarrow s^{\prime}})_{i\geq 0}$ is defined by $\tau_{2i}^{s\leftrightarrow s^{\prime}}:=\inf\{t>\tau_{2i-1}^{s\leftrightarrow s% ^{\prime}}:S_{t}=s\}\text{~{}and~{}}\tau_{2i+1}^{s\leftrightarrow s^{\prime}}:% =\inf\{t>\tau_{2i}^{s\leftrightarrow s^{\prime}}:S_{t}=s^{\prime}\}$ with the convention that $\tau_{-1}^{s\leftrightarrow s^{\prime}}=-\infty$ . The number of commutations up to time $t$ is $N_{t}(s\leftrightarrow s^{\prime}):=\inf\{i:\tau_{i}^{s\leftrightarrow s^{% \prime}}\leq t\}$ , and $\hat{g}(t):=\frac{1}{t}\sum_{i=0}^{t-1}R_{i}$ is the empirical gain. The bias difference estimator at time $T$ is any quantity $c_{T}(s,s^{\prime})\in\mathbf{R}$ such that:

N_{t}(s\leftrightarrow s^{\prime})c_{T}(s,s^{\prime})=\sum\nolimits_{t=0}^{N_{% T}(s\leftrightarrow s^{\prime})-1}(-1)^{i}\sum\nolimits_{t=\tau_{i}^{s% \leftrightarrow s^{\prime}}}^{\tau_{i+1}^{s\leftrightarrow s^{\prime}}-1}(\hat% {g}(T)-R_{t}).

(9)

Lemma 3.

With probability $1-2\delta$ , for all $T^{\prime}\leq T$ and all $\tilde{g}\geq g^{*}$ , we have:

N_{T^{\prime}}(s\leftrightarrow s^{\prime})\left|h^{*}(s)-h^{*}(s^{\prime})-c_% {T^{\prime}}(s,s^{\prime})\right|\leq 3{\mathrm{sp}\left(h^{*}\right)}+(1+{% \mathrm{sp}\left(h^{*}\right)})\sqrt{8T\log(\tfrac{2}{\delta})}+2\sum\nolimits% _{t=0}^{T^{\prime}-1}(\tilde{g}-R_{t}).

(10)

Lemma 3 says that the quality of the estimator $c_{T}(s,s^{\prime})$ is directly linked to the number of observed commutes between $s$ and $s^{\prime}$ as well as the regret. The idea is that if the algorithm makes many commutes between $s$ and $s^{\prime}$ and if its regret is small, then the algorithm mostly takes optimal paths from $s$ to $s^{\prime}$ . The bound provided by Lemma 3 is not accessible to the learner however, because ${\mathrm{sp}\left(h^{*}\right)}$ is unknown in general. To overcome this issue, ${\mathrm{sp}\left(h^{*}\right)}$ is upper-bounded by $c_{0}:=T^{1/5}$ . Overall, this leads to the design of the algorithm estimating the bias confidence region as specified in Algorithm 3.

Algorithm 3: BiasEstimation $(\mathcal{F}_{t},\mathcal{M}_{t},\delta)$

Parameters: History $\mathcal{F}_{t}$ , model region $\mathcal{M}_{t}$ , confidence $\delta>0$

1: Estimate bias differences

c_{t}

via (9);

2: Estimate optimistic gain

\tilde{g}\leftarrow\min_{k<K(t)}\mathfrak{g}_{k}

;

3: Inner regret estimation

B_{0}\leftarrow t\tilde{g}-\sum_{i=0}^{t-1}R_{i}

;

\ell\leftarrow\sqrt{8T\log\left(\tfrac{2}{\delta}\right)}

c_{0}\leftarrow T^{\frac{1}{5}}

;

5: Estimate the bias difference errors as:

$\displaystyle d_{t}(s,s^{\prime})\equiv\text{error}(c_{t},s,s^{\prime}):=\frac% {3c_{0}+(1+c_{0})(1+\ell)+2B_{0}}{N_{t}(s\leftrightarrow s^{\prime})}$

6: return

(c_{t},\text{error}(c_{t},-,-))

, (8) defines

\mathcal{H}^{\prime}_{t}

Algorithm 4: BiasProjection $(\mathcal{H}_{t},u)$

Parameters: $\mathcal{H}_{t}$ a collection of linear constraints (8), $u\in\mathbf{R}^{\mathcal{S}}$ to project

v\leftarrow 0^{\mathcal{S}}

;

2: for

s\in\mathcal{S}

3: Using linear programming, compute:

v(s)\leftarrow\sup\left\{w(s):w\leq u\text{~{}and~{}}w\in\mathcal{H}_{t}\right\}

;

5: end for

6: return

v

Coupled with prior information and span constraints, the obtained bias confidence region $\mathcal{H}_{t}$ is a polyhedron of the same kind as the one encountered in Zhang and Xie (2023) generated by constraints of the form (8), and similarly to their Proposition 3, one can project onto $\mathcal{H}_{t}$ in polynomial time with Algorithm 4. Moreover, the resulting projection operator satisfies the prerequisites (O1-4) of Proposition 2, making sure that PMEVI (Algorithm 2) is well-behaved. This is proved in the appendix Section B.2.

Lemma 4.

Assume that $\mathcal{H}$ is a set of $\mathfrak{h}\in\mathbf{R}^{\mathcal{S}}$ satisfying a system of equations of the form of (8). If $\mathcal{H}$ is non empty, then the operator $\Gamma u:=\text{{BiasProjection}}(\mathcal{H},u)$ (see Algorithm 4) is a projection on $\mathcal{H}$ and satisfies the properties (O1-4) defined in Proposition 2.

3.3 Mitigation using finer bias dynamical error

The fact that $h^{*}\in\mathcal{H}_{t}$ with high probability is used in PMEVI-DT to restrict the search of EVI by reducing the dynamical bias error. This reduction is based on a empirical Bernstein inequality (see Lemma 38) applied to $(\hat{p}(s,a)-p(s,a))u$ . Here, it gives that with probability $1-\delta$ , we have:

\left(\hat{p}_{t}(s,a)-p(s,a)\right)u\leq\sqrt{\frac{2\mathbf{V}(\hat{p}_{t}(s% ,a),u)\log\left(\tfrac{3T}{\delta}\right)}{\max\left\{1,N_{t}(s,a)\right\}}}+% \frac{3{\mathrm{sp}\left(u\right)}\log\left(\tfrac{3T}{\delta}\right)}{\max% \left\{1,N_{t}(s,a)\right\}}=:\beta_{t}(s,a,u)

(11)

where $\mathbf{V}(\hat{p}_{t}(s,a),u)$ is the variance of $u$ under the probability vector $\hat{p}_{t}(s,a)$ . More specifically, if $q$ is a probability on $\mathcal{S}$ and $q\in\mathbf{R}^{\mathcal{S}}$ , we set $\mathbf{V}(q,u):=\sum_{s}q(s)(u(s)-q\cdot u)^{2}$ . In (11), $u\in\mathbf{R}^{\mathcal{S}}$ , $(s,a)\in\mathcal{X}$ and $T\geq 1$ are fixed. Once is tempted to use (11) directly to mitigate the extended Bellman operator, but the resulting operator is ill-behaved because it loses monotony. This issue is avoided by changing $\beta_{t}(s,a,u)$ to $\max_{u\in\mathcal{H}_{t}}\beta_{t}(s,a,u)$ in (9). We obtain a variance maximization problem, which is a convex maximization problem with linear constraints. Even in very simple settings, such optimization problems are NP-hard Pardalos and Schnitger (1988) hence computing $\max_{u\in\mathcal{H}_{t}}\beta_{t}(s,a,u)$ is not reasonable in general. Thankfully, this value can be upper-bounded by a tractable quantity that is enough to guarantee regret efficiency. The mitigation $\beta_{t}$ used by PMEVI-DT is provided with Algorithm 5.

Algorithm 5: VarianceApproximation $(\mathcal{H}^{\prime}_{t},\mathcal{F}_{t})$

Parameters: Bias region $\mathcal{H}^{\prime}_{t}$ , history $\mathcal{F}_{t}$

1: Extract constraints

(c,\text{error}(c,-,-))\leftarrow\mathcal{H}^{\prime}_{t}

;

2: Set

c_{0}\leftarrow T^{\frac{1}{5}}

;

3: Pick a reference point

h_{0}\leftarrow\text{{BiasProjection}}(\mathcal{H}_{t},c(-,s_{0}))

;

4: for

(s,a)\in\mathcal{X}

\rho\leftarrow\log\left(\tfrac{SAT}{\delta}\right)/\max\left\{1,N_{t}(s,a)\right\}

;

\text{var}(s,a)\leftarrow\mathbf{V}(\hat{p}_{t}(s,a),h_{0})+8c_{0}\sum_{s^{% \prime}\in\mathcal{S}}\hat{p}_{t}(s^{\prime}|s,a)c(s^{\prime},s)

;

\beta_{t}(s,a)\leftarrow\sqrt{2\text{var}(s,a)\rho}+3c_{0}\rho

+\infty

N_{t}(s,a)=0

;

8: end for

9: return

\beta_{t}

4 Regret guarantees

Theorem 5 below shows that PMEVI-DT has minimax optimal regret under regularity assumptions on the used confidence region $\mathcal{M}_{t}$ . 1 asserts that the confidence region holds uniformly with high probability. 2 asserts that the reward confidence region is sub-Weissman (see Lemma 35) and 3 assumes that the model confidence region makes sure that EVI (6) converges in the first place. 4 asserts that the prior bias region is correct.

Assumption 1.

With probability $1-\delta$ , we have $M\in\bigcap_{k=1}^{K(T)}\mathcal{M}_{t_{k}}$ .

Assumption 2.

There exists a constant $C>0$ such that for all $(s,a)\in\mathcal{S}$ , for all $t\leq T$ , we have:

\mathcal{R}_{t}(s,a)\subseteq\left\{\tilde{r}(s,a)\in\mathcal{R}(s,a):N_{t}(s,% a)\left\|\hat{r}_{t}(s,a)-\tilde{r}(s,a)\right\|_{1}^{2}\leq C\log\left(\tfrac% {2SA(1+N_{t}(s,a))}{\delta}\right)\right\}.

Assumption 3.

For $t\geq 0$ , $\mathcal{M}_{t}$ is a $(s,a)$ -rectangular convex region and $\mathcal{L}_{t}^{n}u$ converges a fix-point.

Assumption 4.

The prior bias region $\mathcal{H}_{*}$ contains $h^{*}(M)$ and is generated by constraints of the form:

\forall s\neq s^{\prime},\quad\mathfrak{h}(s)-\mathfrak{h}(s^{\prime})\leq c_{% *}(s,s^{\prime})

with $c_{*}(s,s^{\prime})\in[-\infty,\infty]$ (possibly infinite).

Refer to Section A.2 for the feasibility of 1, Section A.2.3 for 2, and Section A.3 for 3.

Theorem 5 (Main result).

Let $c>0$ . Assume that PMEVI-DT runs with a confidence region system $t\mapsto\mathcal{M}_{t}$ that guarantees Assumptions 1-3. If $T\geq c^{5}$ , then for every weakly communicating model with ${\mathrm{sp}\left(h^{*}\right)}\leq c$ and such that 4 is satisfied ( $h^{*}\in\mathcal{H}_{*}$ ), PMEVI-DT achieves regret:

\operatorname*{{\rm O}}\left(\!\!\sqrt{cSAT\log\left(\tfrac{SAT}{\delta}\right% )}\right)+\operatorname*{{\rm O}}\left(cS^{\frac{5}{2}}A^{\frac{3}{2}}T^{\frac% {9}{20}}\log^{2}\left(\tfrac{SAT}{\delta}\right)\right)

with probability $1-26\delta$ , and in expectation if $\delta<\!\!\sqrt{1/T}$ . Moreover, if PMEVI-DT runs with the same confidence regions that UCRL2 Auer et al. (2009), then it enjoys a time complexity $\operatorname*{{\rm O}}(DS^{3}AT)$ .

To have a completely prior-less algorithm, pick $\mathcal{H}_{*}=\mathbf{R}^{\mathcal{S}}$ . The proof of Theorem 5 is too long to fit within these pages, so the complete proof is deferred to appendix. We will focus here on the main ideas.

Figure 1: An overview of PMEVI-DT and its regret analysis. In the above,

\mathfrak{g}_{k}

and

\mathfrak{h}_{k}

are the optimistic gain and bias functions produced by PMEVI (see Algorithm 2) at episode

k

, and

\hat{p}_{t_{k}}

and

\tilde{p}_{t_{k}}

are respectively the empirical and optimistic kernel models at episode

k

We start by introducing notations. At episode $k$ , the played policy is denoted $\pi_{k}$ . As a greedy response to $\mathfrak{h}_{k}$ , by Proposition 2 (3), there exists $\tilde{r}_{k}(s)\leq\sup\mathcal{R}_{t_{k}}(s,\pi_{k}(s))$ and $\tilde{P}_{k}(s)\in\mathcal{P}_{t_{k}}(s,\pi(x))$ such that $\mathfrak{h}_{k}+\mathfrak{g}_{k}=\tilde{r}_{k}+\tilde{P}_{k}\mathfrak{h}_{k}$ . The reward-kernel pair $\tilde{M}_{k}=(\tilde{r}_{k},\tilde{P}_{k})$ is referred to as the optimistic model of $\pi_{k}$ . We write $P_{k}:=P_{\pi_{k}}(M)$ the true kernel and $\hat{P}_{k}:=P_{\pi_{k}}(\hat{M}_{t_{k}})$ the empirical kernel. Likewise, we define the reward functions $r_{k}$ and $\hat{r}_{k}$ . The optimistic gain and bias satisfy $\mathfrak{g}_{k}=g(\pi_{k},\widetilde{M}_{k})$ and $\mathfrak{h}_{k}=h(\pi_{k},\widetilde{M}_{k})$ . We further denote $c_{0}=T^{\frac{1}{5}}$ .

The regret is first decomposed episodically with $\operatorname{{Reg}}(T)=\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}(g^{*}-R_{t})$ . The first step goes back to the analysis of UCRL2 Auer et al. (2009), and consists in upper-bounding the regret over episode $k$ with optimistic quantities that are exclusive to that episode.

Lemma 6 (Reward optimism).

With probabililty $1-6\delta$ , we have:

\operatorname{{Reg}}(T)\leq\sum\nolimits_{k}\sum\nolimits_{t=t_{k}}^{t_{k+1}-1% }(\mathfrak{g}_{k}-R_{t})\leq\sum\nolimits_{k}\sum\nolimits_{t=t_{k}}^{t_{k+1}% -1}(\mathfrak{g}_{k}-\tilde{r}_{k}(X_{t}))+\operatorname*{{\rm O}}\left(\sqrt{% SAT\log\left(\tfrac{T}{\delta}\right)}\right).

(12)

We introduce the two optimistic regrets $B(T):=\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}(\mathfrak{g}_{k}-R_{t})$ and $\tilde{B}(T):=\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}(\mathfrak{g}_{k}-\tilde{r}_{k% }(X_{t}))$ . Rewriting the summand $\mathfrak{g}_{k}-\tilde{r}_{k}(X_{t})$ using the Poisson equation $\mathfrak{h}_{k}+\mathfrak{g}_{k}=\tilde{r}_{k}+\tilde{P}_{k}\mathfrak{h}_{k}$ , we get:

\tilde{B}(T)=\sum\nolimits_{k}\sum\nolimits_{t=t_{k}}^{t_{k+1}-1}\left(\tilde{% p}_{k}(S_{t})-e_{S_{t}}\right)\mathfrak{h}_{k}.

The analysis proceed by decomposing the above expression of $\tilde{B}(T)$ in the style of Zhang and Ji (2019). We write $\sum_{t=t_{k}}^{t_{k+1}-1}(\tilde{p}_{k}(S_{t})-e_{S_{t}})\mathfrak{h}_{k}$ as:

\sum\nolimits_{t=t_{k}}^{t_{k+1}-1}\left(\underbrace{\left(p_{k}(S_{t})-e_{S_{% t}}\right)\mathfrak{h}_{k}}_{\text{navigation error }(1k)}+\underbrace{\left(% \hat{p}_{k}(S_{t})-p_{k}(S_{t})\right)h^{*}}_{\text{empirical bias error }(2k)% }+\underbrace{\left(\tilde{p}_{k}(S_{t})-\hat{p}_{k}(S_{t})\right)\mathfrak{h}% _{k}}_{\text{optimistic overshoot }(3k)}+\underbrace{\left(\hat{p}_{k}(S_{t})-% p_{k}(S_{t})\right)(\mathfrak{h}_{k}-h^{*})}_{\text{second order error }(4k)}\right)

Each error term is bounded separately. Below, we denote $\mathbf{V}(q,u):=\sum_{s}q(s)(u(s)-q\cdot u)^{2}$ .

Lemma 7 (Navigation error).

With probability $1-7\delta$ , the navigation error is bounded by:

\displaystyle\sum\nolimits_{k}\sum\nolimits_{t=t_{k}}^{t_{k+1}-1}(p_{k}(S_{t})% -e_{S_{t}})\mathfrak{h}_{k}\leq\!\!\sqrt{2\sum\nolimits_{t=0}^{T-1}\mathbf{V}(% p(X_{t}),h^{*})\log\left(\tfrac{T}{\delta}\right)}+2SA^{\frac{1}{2}}\!\!\sqrt{% 3B(T)}\log\left(\tfrac{T}{\delta}\right)+\widetilde{\operatorname*{{\rm O}}}% \left(T^{\frac{7}{20}}\right).

Lemma 8 (Empirical bias error).

With probability $1-\delta$ , the empirical bias error is bounded by:

\sum\nolimits_{k}\sum\nolimits_{t=t_{k}}^{t_{k+1}-1}\left(\hat{p}_{k}(S_{t})-p% _{k}(S_{t})\right)h^{*}\leq 4\sqrt{SA\sum\nolimits_{t=0}^{T-1}\mathbf{V}(p(X_{% t}),h^{*})\log\left(\tfrac{SAT}{\delta}\right)}+\operatorname*{{\rm O}}\left(% \log^{2}(T)\right).

Lemma 9 (Optimism overshoot).

With probability $1-6\delta$ , the optimism overshoot is bounded by:

\sum\nolimits_{k}\sum\nolimits_{t=t_{k}}^{t_{k+1}-1}\left(\tilde{p}_{k}(S_{t})% -\hat{p}_{k}(S_{t})\right)\mathfrak{h}_{k}\leq\begin{Bmatrix}4\sqrt{2SA\sum% \nolimits_{t=0}^{T-1}\mathbf{V}(p(X_{t}),h^{*})\log\left(\tfrac{SAT}{\delta}% \right)}\\ +8(1+c_{0})S^{\frac{3}{2}}A\log^{\frac{3}{2}}\left(\tfrac{SAT}{\delta}\right)% \sqrt{B(T)}+\widetilde{\operatorname*{{\rm O}}}\left(T^{\frac{1}{4}}\right)% \end{Bmatrix}.

Lemma 10 (Second order error).

With probability $1-6\delta$ , the second order error is bounded by:

\sum\nolimits_{k}\sum\nolimits_{t=t_{k}}^{t_{k+1}-1}\left(\hat{p}_{k}(S_{t})-p% _{k}(S_{t})\right)(\mathfrak{h}_{k}-h^{*})\leq 16S^{2}A(1+c_{0})\log^{\frac{1}% {2}}\left(\tfrac{S^{2}AT}{\delta}\right)\sqrt{2B(T)}+\widetilde{\operatorname*% {{\rm O}}}\left(T^{\frac{1}{4}}\right).

We see that the empirical bias error (Lemma 8) and the optimism overshoot (Lemma 9) both involve the sum of variances $\sum_{t=0}^{T-1}\mathbf{V}(p(X_{t}),h^{*})$ , which is shown in Lemma 29 to be of order ${\mathrm{sp}\left(h^{*}\right)}{\mathrm{sp}\left(r\right)}T+\sum_{t=0}^{T-1}% \Delta^{*}(X_{t})$ . The pseudo-regret term $\sum_{t=0}^{T-1}\Delta^{*}(X_{t})$ is bounded with the regret using Corollary 31, then by $B(T)$ . With high probability, we obtain an equation of the form:

B(T)\leq C\sqrt{(1+{\mathrm{sp}\left(h^{*}\right)})SAT\log\left(\tfrac{T}{% \delta}\right)}+CS^{2}A(1+c_{0})\log^{2}(T)\sqrt{B(T)}+\tilde{\operatorname*{{% \rm O}}}\left(T^{\frac{1}{4}}\right)

where $C$ is a constant. Setting $\alpha:=CS^{2}A(1+c_{0})\log^{2}(T)$ and $\beta:=C\sqrt{(1+{\mathrm{sp}\left(h^{*}\right)})SAT\log(T/\delta)}+\tilde{% \operatorname*{{\rm O}}}(T^{1/4})$ , the above equation is of the form $B(T)\leq\beta+\alpha\sqrt{B(T)}$ . Solving in $B(T)$ , we find $B(T)\leq\beta+2\sqrt{\alpha\beta}+\alpha^{2}$ . The dominant term is $\beta$ , hence we readily obtain:

B(T)\leq C\sqrt{(1+{\mathrm{sp}\left(h^{*}\right)}){\mathrm{sp}\left(r\right)}% SAT\log\left(\tfrac{T}{\delta}\right)}+\widetilde{\operatorname*{{\rm O}}}% \left({\mathrm{sp}\left(h^{*}\right)}{\mathrm{sp}\left(r\right)}S^{\frac{5}{2}% }A^{\frac{3}{2}}(1+c_{0})T^{\frac{1}{4}}\right).

(13)

Since $c_{0}=\operatorname*{{\rm o}}(T^{\frac{1}{4}})$ , we conclude that $B(T)=\operatorname*{{\rm O}}\left(\!\sqrt{{\mathrm{sp}\left(h^{*}\right)}SAT% \log(T/\delta)}\right)$ , ending the proof.

5 Experimental illustrations

To get a grasp of how PMEVI-DT behaves in practice, we provide in Fig. 2 of few illustrative experiments. In both experiments, the environment is a river-swim which is a model known to be hard to learn despite its size, with high diameter and bias span. Its description is found in Bourel et al. (2020) and is reported in the appendix for self-containedness.

Refer to caption — Figure 2: (To the left) Running a few algorithms of the literature on $5$ -state river-swim and comparing their average regret against their PMEVI variants, obtained by changing calls to the EVI sub-routine to calls to PMEVI. (To the right) Running UCRL2 and PMEVI-DT with the same confidence region that UCRL2 on a $3$ -state river-swim. PMEVI-DT is run with prior knowledge $h^{*}(s_{1})\leq h^{*}(s_{2})-c\leq h^{*}(s_{3})-2c$ for $c\in\left\{0,0.5,1,1.5,2\right\}$ .

We observe on the first experiment that PMEVI behaves almost identically to its EVI counterparts when no prior on the bias region is given. This is because most of the regret is due to the earlier learning phase, when bias information is impossible to get (the regret is still growing linearly and the bias estimator is off). Accordingly, the bias confidence region is too large and all projections onto it are trivial during the iterations of PMEVI. Thankfully, this also makes the calls to PMEVI not substantially heavier than calls to EVI from a computational perspective. On the second experiment, we measure the influence of prior bias information on the behavior of PMEVI-DT. We observe that PMEVI-DT is very efficient at using adequate bias prior information to strikingly reduce the burn-in cost of the learning process on this $3$ -state riverswim.

References

Abbasi-Yadkori et al. [2019] Yasin Abbasi-Yadkori, Nevena Lazic, Csaba Szepesvari, and Gellert Weisz. Exploration-enhanced POLITEX. arXiv preprint arXiv:1908.10479, 2019.
Agrawal and Jia [2023] Shipra Agrawal and Randy Jia. Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds. Mathematics of Operations Research, 48(1):363–392, 2023. Publisher: INFORMS.
Audibert et al. [2009] Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876–1902, 2009. Publisher: Elsevier.
Auer and Ortner [2006] Peter Auer and Ronald Ortner. Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning. Proceedings of the 19th International Conference on Neural Information Processing Systems, December 2006.
Auer et al. [2009] Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal Regret Bounds for Reinforcement Learning. In Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2009.
Azuma [1967] Kazuoki Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, 19(3):357 – 367, 1967. Publisher: Tohoku University, Mathematical Institute.
Bartlett and Tewari [2009] Peter L. Bartlett and Ambuj Tewari. REGAL: a regularization based algorithm for reinforcement learning in weakly communicating MDPs. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, pages 35–42, Arlington, Virginia, USA, June 2009. AUAI Press. ISBN 978-0-9749039-5-8.
Bourel et al. [2020] Hippolyte Bourel, Odalric Maillard, and Mohammad Sadegh Talebi. Tightening Exploration in Upper Confidence Reinforcement Learning. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1056–1066. PMLR, July 2020.
Burnetas and Katehakis [1997] Apostolos Burnetas and Michael Katehakis. Optimal Adaptive Policies for Markov Decision Processes. Mathematics of Operations Research - MOR, 22:222–255, February 1997.
Cohen et al. [2020] Michael B. Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time, 2020.
Filippi et al. [2010] Sarah Filippi, Olivier Cappé, and Aurélien Garivier. Optimism in Reinforcement Learning and Kullback-Leibler Divergence. 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 115–122, September 2010. arXiv: 1004.5229.
Fruit [2019] Ronan Fruit. Exploration-exploitation dilemma in Reinforcement Learning under various form of prior knowledge. PhD Thesis, Université de Lille 1, Sciences et Technologies; CRIStAL UMR 9189, 2019.
Fruit et al. [2018] Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning. Proceedings of the 35 th International Conference on Machine Learning, 2018.
Fruit et al. [2020] Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Improved Analysis of UCRL2 with Empirical Bernstein Inequality. ArXiv, abs/2007.05456, 2020.
Jonsson et al. [2020] Anders Jonsson, Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Edouard Leurent, and Michal Valko. Planning in markov decision processes with gap-dependent sample complexity. Advances in Neural Information Processing Systems, 33:1253–1263, 2020.
Ouyang et al. [2017] Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain. Learning Unknown Markov Decision Processes: A Thompson Sampling Approach. arXiv:1709.04570 [cs], September 2017. arXiv: 1709.04570.
Pardalos and Schnitger [1988] Panos M. Pardalos and Georg Schnitger. Checking local optimality in constrained quadratic programming is NP-hard. Operations Research Letters, 7:33–35, 1988.
Puterman [1994] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Probability and Statistics. Wiley, 1 edition, April 1994. ISBN 978-0-471-61977-2 978-0-470-31688-7.
Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
Talebi and Maillard [2018] Mohammad Sadegh Talebi and Odalric-Ambrym Maillard. Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs. Journal of Machine Learning Research, pages 1–36, April 2018. Publisher: Microtome Publishing.
Tewari and Bartlett [2007] Ambuj Tewari and P. Bartlett. Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs. In NIPS, 2007.
Theocharous et al. [2017] Georgios Theocharous, Zheng Wen, Yasin Abbasi-Yadkori, and Nikos Vlassis. Posterior sampling for large scale reinforcement learning. arXiv preprint arXiv:1711.07979, 2017.
Thompson [1933] William R Thompson. On the Likelihood that One Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika, 25(3-4):285–294, December 1933. ISSN 0006-3444.
Wei et al. [2020] Chen-Yu Wei, Mehdi Jafarnia Jahromi, Haipeng Luo, Hiteshi Sharma, and Rahul Jain. Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 10170–10180. PMLR, July 2020.
Zhang and Ji [2019] Zihan Zhang and Xiangyang Ji. Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
Zhang and Xie [2023] Zihan Zhang and Qiaomin Xie. Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes. In The Thirty Sixth Annual Conference on Learning Theory, pages 5476–5477. PMLR, 2023.
Zhang et al. [2020] Zihan Zhang, Yuan Zhou, and Xiangyang Ji. Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition. arXiv:2004.10019 [cs, stat], June 2020. arXiv: 2004.10019.

Appendix

\parttoc

Appendix A Construction of PMEVI-DT

This section provides the technical details required to understand the design of PMEVI-DT in Section 3. We further discuss the assumptions 1-4 appearing in Theorem 5 and provide sufficient conditions so that they are met.

A.1 Proof of Lemma 3, estimation of the bias error

Fix $s,s^{\prime}\in\mathcal{S}$ . We denote $\alpha_{T}:=N_{T}(s\leftrightarrow s^{\prime})(h^{*}(s)-h^{*}(s^{\prime})-c_{T% }(s,s^{\prime}))$ . We will start by considering the better estimator $c^{\prime}_{T}(s,s^{\prime})$ that satisfies the same equation (9) than $c_{T}(s,s^{\prime})$ but with $\hat{g}(T)$ changed to $h^{*}$ , readily:

N_{t}(s\leftrightarrow s^{\prime})c^{\prime}_{T}(s,s^{\prime})=\sum\nolimits_{% t=0}^{N_{T}(s\leftrightarrow s^{\prime})-1}(-1)^{i}\sum\nolimits_{t=\tau_{i}^{% s\leftrightarrow s^{\prime}}}^{\tau_{i+1}^{s\leftrightarrow s^{\prime}}-1}(g^{% *}-R_{t}).

To avoid a typographical clutter, we write $\tau_{i}$ instead of $\tau_{i}^{s\leftrightarrow s^{\prime}}$ in the remaining of the proof and we write $\alpha_{T}^{\prime}:=N_{T}(s\leftrightarrow s^{\prime})(h^{*}(s)-h^{*}(s^{% \prime})-c^{\prime}_{T}(s,s^{\prime})$ .

(STEP 1) We start by relating the two estimators. Intuitively, $\hat{g}(T)$ is a good estimator for $g^{*}$ when the regret is small. Recall that $\hat{g}(T):=\frac{1}{T}\sum_{t=0}^{T-1}R_{t}$ , hence:

\sum\nolimits_{t=0}^{T-1}\left|\hat{g}(T)-g^{*}\right|=\left|\sum\nolimits_{t=% 0}^{T-1}(R_{t}-g^{*})\right|=\left|\operatorname{{Reg}}(T)\right|.

Therefore,

\left|\alpha_{T}\right|\leq\left|\alpha_{T}^{\prime}\right|+\left|\alpha_{T}-% \alpha_{T}^{\prime}\right|\leq\left|\alpha_{T}^{\prime}\right|+\sum\nolimits_{% t=0}^{T-1}\left|\hat{g}(T)-g^{*}\right|\leq\left|\alpha_{T}^{\prime}\right|+% \left|\operatorname{{Reg}}(T)\right|.

We are left with upper-bounding $\left|\alpha_{T}^{\prime}\right|$ .

(STEP 2) If $i$ is even, then $S_{\tau_{i}}$ and $\S_{\tau_{i+1}}=s^{\prime}$ ; otherwise $S_{\tau_{i}}=s^{\prime}$ and $S_{\tau_{i+1}}=s$ . In both cases, we have $h^{*}(S_{\tau_{i+1}})-h^{*}(S_{\tau_{i}})=(-1)^{i}(h^{*}(s^{\prime})-h^{*}(s))$ . Therefore, using Bellman’s equation, the quantity $\text{A}:=\sum_{t=\tau_{i}}^{\tau_{i+1}-1}(g^{*}-R_{t})$ satisfies:

	A	$\displaystyle=\sum\nolimits_{t=\tau_{i}}^{\tau_{i+1}-1}\left(p(X_{t})-e_{S_{t}% }\right)h^{}+\sum\nolimits_{t=\tau_{i}}^{\tau_{i+1}-1}(r(X_{t})-R_{t})+\sum% \nolimits_{t=\tau_{i}}^{\tau_{i+1}-1}\Delta^{}(X_{t})$
		$\displaystyle=\sum\nolimits_{t=\tau_{i}}^{\tau_{i+1}-1}\left(e_{S_{t+1}}-e_{S_% {t}}\right)h^{}+\sum\nolimits_{t=\tau_{i}}^{\tau_{i+1}-1}\left(p(X_{t})-e_{S_% {t+1}}\right)h^{}+\sum\nolimits_{t=\tau_{i}}^{\tau_{i+1}-1}(r(X_{t})-R_{t})+% \sum\nolimits_{t=\tau_{i}}^{\tau_{i+1}-1}\Delta^{*}(X_{t})$
		$\displaystyle=(-1)^{i}(h^{}(s^{\prime})-h^{}(s))+\sum\nolimits_{t=\tau_{i}}^% {\tau_{i+1}-1}\left(p(X_{t})-e_{S_{t+1}}\right)h^{}+\sum\nolimits_{t=\tau_{i}% }^{\tau_{i+1}-1}(r(X_{t})-R_{t})+\sum\nolimits_{t=\tau_{i}}^{\tau_{i+1}-1}% \Delta^{}(X_{t}).$

Multiplying by $(-1)^{i}$ and rearranging, $h^{*}(s^{\prime})-h^{*}(s)+(-1)^{i+1}\sum_{t=\tau_{i}}^{\tau_{i+1}-1}(g^{*}-R_% {t})$ appears to be equal to:

(-1)^{i+1}\left(\sum\nolimits_{t=\tau_{i}}^{\tau_{i+1}-1}\left(\left(p(X_{t})-% e_{S_{t+1}}\right)h^{*}+r(X_{t})-R_{t}\right)+\sum\nolimits_{t=\tau_{i}}^{\tau% _{i+1}-1}\Delta^{*}(X_{t})\right).

Proceed by summing over $i$ . By triangular inequality, we obtain:

\left|\alpha_{T}^{\prime}\right|\leq\left|\sum\nolimits_{i=0}^{N_{T}(s% \leftrightarrow s^{\prime})-1}\sum\nolimits_{t=\tau_{i}}^{\tau_{i+1}-1}(-1)^{i% +1}\left(\left(p(X_{t})-e_{S_{t+1}}\right)h^{*}+r(X_{t})-R_{t}\right)\right|+% \sum\nolimits_{i=0}^{N_{T}(s\leftrightarrow s^{\prime})-1}\sum\nolimits_{t=% \tau_{i}}^{\tau_{i+1}-1}\Delta^{*}(X_{t}).

Because all Bellman gaps $\Delta^{*}$ are non-negative, the second term is upper-bounded by the pseudo-regret $\sum_{t=0}^{T-1}\Delta^{*}(X_{t})$ . The first term is a martingale, and the martingale difference sequence $(-1)^{i+1}((p(X_{t})-e_{S_{t+1}})h^{*}+r(X_{t})-R_{t}$ has span at most ${\mathrm{sp}\left(h^{*}\right)}+1$ since rewards are supported in $[0,1]$ . Although the number of involved is random, it is upper-bounded by $T$ , hence by the maximal version of Azuma-Hoeffding’s inequality (Lemma 32), we have that with probability at least $1-\delta$ , uniformly for $T^{\prime}\leq T$ ,

\left|\sum\nolimits_{i=0}^{N_{T^{\prime}}(s\leftrightarrow s^{\prime})-1}\sum% \nolimits_{t=\tau_{i}}^{\tau_{i+1}-1}(-1)^{i+1}\left(\left(p(X_{t})-e_{S_{t+1}% }\right)h^{*}+r(X_{t})-R_{t}\right)\right|\leq(1+{\mathrm{sp}\left(h^{*}\right% )})\sqrt{\tfrac{1}{2}T\log\left(\tfrac{2}{\delta}\right)}.

(STEP 3) We conclude that with probability $1-\delta$ , for all $T^{\prime}\leq T$ ,

\alpha_{T^{\prime}}\leq(1+{\mathrm{sp}\left(h^{*}\right)})\sqrt{\tfrac{1}{2}T% \log\left(\tfrac{2}{\delta}\right)}+\sum\nolimits_{t=0}^{T^{\prime}-1}\Delta^{% *}(X_{t})+\left|\operatorname{{Reg}}(T^{\prime})\right|.

We are left with relating both $\sum_{t=0}^{T^{\prime}-1}\Delta^{*}(X_{t})$ and $\left|\operatorname{{Reg}}(T^{\prime})\right|$ to $\sum_{t=0}^{T^{\prime}-1}(\tilde{g}-R_{t})$ . Using the Bellman equation again, we find that:

	$\displaystyle\left\|\sum\nolimits_{t=0}^{T^{\prime}-1}(g^{}-R_{t}-\Delta^{}(X% _{t}))\right\|$	$\displaystyle\leq\left\|h^{}(S_{0})-h^{}(S_{T^{\prime}})\right\|+\left\|\sum% \nolimits_{t=0}^{T^{\prime}-1}\left(\left(p(X_{t})-e_{S_{t+1}}\right)h^{*}+% \left(r(X_{t})-R_{t}\right)\right)\right\|$
		$\displaystyle\leq{\mathrm{sp}\left(h^{}\right)}+(1+{\mathrm{sp}\left(h^{}% \right)})\sqrt{\tfrac{1}{2}T\log\left(\tfrac{2}{\delta}\right)}$

where the last inequality holds with probability $1-\delta$ uniformly over $T^{\prime}\leq T$ by Azuma-Hoeffding’s inequality again (Lemma 32). Remark that if $y-z\leq x\leq y+z$ , then $\left|x\right|\leq\left|y\right|+\left|z\right|$ , hence we conclude that with probability $1-\delta$ , for all $T^{\prime}\leq T$ :

	$\displaystyle\sum\nolimits_{t=0}^{T^{\prime}-1}\Delta^{*}(X_{t})+\left\|% \operatorname{{Reg}}(T^{\prime})\right\|$	$\displaystyle\leq 2\sum\nolimits_{t=0}^{T^{\prime}-1}\Delta^{}(X_{t})+(1+{% \mathrm{sp}\left(h^{}\right)})\sqrt{\tfrac{1}{2}T\log\left(\tfrac{2}{\delta}% \right)}+{\mathrm{sp}\left(h^{*}\right)}$
		$\displaystyle\leq 2\sum\nolimits_{t=0}^{T^{\prime}-1}\left(g^{}-R_{t}\right)+% 3(1+{\mathrm{sp}\left(h^{}\right)})\sqrt{\tfrac{1}{2}T\log\left(\tfrac{2}{% \delta}\right)}+3{\mathrm{sp}\left(h^{*}\right)}$
		$\displaystyle\leq 2\sum\nolimits_{t=0}^{T^{\prime}-1}\left(\tilde{g}-R_{t}% \right)+3(1+{\mathrm{sp}\left(h^{}\right)})\sqrt{\tfrac{1}{2}T\log\left(% \tfrac{2}{\delta}\right)}+3{\mathrm{sp}\left(h^{}\right)}$

where the last inequality invokes $\tilde{g}\geq g^{*}$ . We conclude that, with probability $1-2\delta$ , for all $T^{\prime}\leq T$ , we have:

N_{T^{\prime}}(s\leftrightarrow s^{\prime})(h^{*}(s)-h^{*}(s^{\prime})-c_{T^{% \prime}}(s,s^{\prime}))\leq 3{\mathrm{sp}\left(h^{*}\right)}+\left(1+{\mathrm{% sp}\left(h^{*}\right)}\right)\sqrt{8T\log\left(\tfrac{2}{\delta}\right)}+\sum% \nolimits_{t=0}^{T^{\prime}-1}(\tilde{g}-R_{t}).

This concludes the proof. ∎

A.2 The confidence region of PMEVI-DT

The algorithm PMEVI-DT can be instantiated with a large panel of possibilities, depending on the type of confidence region one is willing to use for rewards and kernels. In this work, we allow for four types of confidence regions, described below. For conciseness, $q\in\left\{r,p\right\}$ is a symbolic letter that can be a reward or a kernel and denote $\mathcal{Q}_{t}(s,a)$ the confidence region for $q(s,a)$ at time $t$ . If $q=r$ , then $\dim(q)=2$ (Bernoulli rewards) with $\mathcal{Q}(s,a)=[0,1]$ ; and if $q=p$ , then $\dim(q)=S$ with $\mathcal{Q}(s,a)=\mathcal{P}(\mathcal{S})$ .

(C1)

Azuma-Hoeffding or Weissman type confidence regions, with $\mathcal{Q}_{t}(s,a)$ taken as:

\left\{\tilde{q}(s,a)\in\mathcal{Q}(s,a):N_{t}(s,a)\left\|\hat{q}_{t}(s,a)-% \tilde{q}(s,a)\right\|_{1}^{2}\leq\dim(q)\log\left(\tfrac{2SA(1+N_{t}(s,a))}{% \delta}\right)\right\}.

(C2)

Empirical Bernstein type confidence regions, with $\mathcal{Q}_{t}(s,a)$ taken as:

\left\{\tilde{q}(s,a)\in\mathcal{Q}(s,a):\forall i,\left|\hat{q}_{t}(i|s,a)-% \tilde{q}(i|s,a)\right|\leq\sqrt{\tfrac{2\mathbf{V}(\hat{q}_{t}(i|s,a))\log% \left(\tfrac{2\dim(q)SAT}{\delta}\right)}{N_{t}(s,a)}}+\tfrac{3\log\left(% \tfrac{2\dim(q)SAT}{\delta}\right)}{N_{t}(s,a)}\right\}.

with the convention that $x/0=+\infty$ for $x>0$ .
(C3)

Empirical likelihood type confidence regions, with $\mathcal{Q}_{t}(s,a)$ taken as:

\left\{\tilde{q}(s,a)\in\mathcal{Q}(s,a):N_{t}(s,a)\operatorname{{\rm KL}}(% \hat{q}_{t}(s,a)\|\tilde{q}(s,a))\leq\log\left(\tfrac{2SA}{\delta}\right)+(% \dim(q)-1)\log\left(e\left(1+\tfrac{N_{t}(s,a)}{\dim{q}-1}\right)\right)\right\}.

(C4)

Trivial confidence region with $\mathcal{Q}_{t}(s,a)=\mathcal{Q}(s,a)$ .

A few remarks are in order. When rewards are not Bernoulli, only the confidence regions (C1) and (C4) are elligible among the above. Then, Weissman’s inequality must be changed to Azuma’s inequality for $\sigma$ -sub-Gaussian random variables, see Lemma 34. Since rewards are supported in $[0,1]$ , Hoeffding’s Lemma guarantees that reward distributions are $\sigma$ -sub-Gaussian with $\sigma=\tfrac{1}{2}$ .

A.2.1 Correctness of the model confidence region $\mathcal{M}_{t}$ and 1

The confidence regions $\mathcal{Q}_{t}(s,a)$ described with (C1-4) are tuned so that the following result holds:

Lemma 11.

Assume that, for all $q\in\left\{r,p\right\}$ and $(s,a)\in\mathcal{X}$ , we choose $\mathcal{Q}_{t}(s,a)$ among (C1-4). Then 1 holds. More specifically, the region of models $\mathcal{M}_{t}:=\prod_{s,a}(\mathcal{R}_{t}(s,a)\times\mathcal{P}_{t}(s,a))$ satisfies $\mathbf{P}(\exists t\leq T:M\notin\mathcal{M}_{t})\leq\delta$ .

Proof.

We show that, for all $q\in\left\{r,q\right\}$ and $(s,a)\in\mathcal{X}$ , if $\mathcal{Q}_{t}(s,a)$ is chosen amoung (C1-4), then

\mathbf{P}\left(\exists t\leq T:q(s,a)\notin\mathcal{Q}_{t}(s,a)\right)\leq\delta.

If $\mathcal{Q}_{t}(s,a)$ is chosen with (C1), this is a direct application of Lemma 35; with (C2), this is Lemma 36; with (C3), this is Lemma 37; and with (C4) this is by definition. ∎

A.2.2 Simultaneous correctness of bias confidence region $\mathcal{H}_{t}$ , mitigation $\beta_{t}$ and optimism

In this section, we show that if 1 holds, then the bias confidence region constructed by PMEVI-DT is correct with high probability, and that the mitigation is not too strong. Recall that $(\mathfrak{g}_{k},\mathfrak{h}_{k})$ are the optimistic gain and bias of the policy deployed in episode $k$ (see Algorithm 1). In particular, we have $\mathfrak{g}_{k}=\mathfrak{L}_{t_{k}}\mathfrak{h}_{k}-\mathfrak{h}_{k}$ with $\mathfrak{h}_{k}\in\mathcal{H}_{t_{k}}$ . We start by a result on the deviation of the variance, which is what the variance approximation Algorithm 5 is based on. Recall that the bias confidence region $\mathcal{H}_{t}$ is obtained as the collection of constraints:

(1)

prior constraints (if any) $\mathfrak{h}(s)-\mathfrak{h}(s^{\prime})\leq c_{*}(s,s^{\prime})$ ;
(2)

span constraints $\mathfrak{h}(s)-\mathfrak{h}(s^{\prime})\leq c_{0}:=T^{1/5}$ ;
(3)

dynamically infered constraints $\left|\mathfrak{h}(s)-\mathfrak{h}(s^{\prime})-c_{t}(s^{\prime},s)\right|\leq% \text{error}(c_{t},s^{\prime},s)$ (see Algorithm 3).

We have the following result.

Lemma 12.

Let $u,v\in\mathcal{H}_{t}$ and fix $p$ a probability distribution on $\mathcal{S}$ . Then for all $s\in\mathcal{S}$ ,

\mathbf{V}(p,u)\leq\mathbf{V}(p,v)+8c_{0}\sum\nolimits_{s^{\prime}\in\mathcal{% S}}p(s^{\prime})~{}\text{error}(c_{t},s^{\prime},s).

Proof.

We start by establishing the following result: If $p$ is a probability distribution on $\mathcal{S}$ and $u,v\in\mathbf{R}^{\mathcal{S}}$ , we have:

\mathbf{V}(p,u)\leq\mathbf{V}(p,v)+2\left(p\cdot\left|u-v\right|\right)\max(u+v)

(14)

where $\cdot$ is the dot product, $u^{2}$ the Hadamard product $uu$ and $\left|u\right|$ the vector whose entry $s$ is $\left|u(s)\right|$ . (14) is obtained with a straight forward computation:

	$\displaystyle\mathbf{V}(p,u)-\mathbf{V}(p,v)$	$\displaystyle=p\cdot(u^{2}-v^{2})+(p\cdot v)^{2}-(p\cdot u)^{2}$
		$\displaystyle=p\cdot((u-v)(u+v))+(p\cdot(u-v))(p\cdot(u+v))$
		$\displaystyle\leq p\cdot(\left\|u-v\right\|(u+v))+(p\cdot\left\|u-v\right\|)(p% \cdot\left\|u+v\right\|)$
		$\displaystyle\leq 2(p\cdot\left\|u-v\right\|)\max(u+v).$

Observe that $v$ can be changed to $v+\lambda e$ , where $e$ is the vector full of ones, without changing the result. The same goes for $u$ . We now move to the proof of the main statement. First, translate $u$ and $v$ such that $u(s)=v(s)=0$ . Then, we have:

	$\displaystyle p\cdot(u-v)$	$\displaystyle=\sum\nolimits_{s^{\prime}\in\mathcal{S}}p(s^{\prime})\left\|u(s^{% \prime})-u(s)-c_{t}(s^{\prime},s)+v(s)-v(s^{\prime})+c_{t}(s^{\prime},s)\right\|$
		$\displaystyle\leq\sum\nolimits_{s^{\prime}\in\mathcal{S}}p(s^{\prime})\left(% \left\|u(s^{\prime})-u(s)-c_{t}(s^{\prime},s)\right\|+\left\|v(s^{\prime})-v(s)-c% _{t}(s^{\prime},s)\right\|\right)$
		$\displaystyle\leq 2\sum\nolimits_{s^{\prime}\in\mathcal{S}}p(s^{\prime})~{}% \text{error}(c_{t},s^{\prime},s).$

Conclude using that $\max(u+v)\leq\max(u)+\max(v)+2c_{0}$ for $u,v\in\mathcal{H}$ such that $u(s)=v(s)=0$ . ∎

Lemma 13.

Assume that 1 holds and that $c_{0}\geq{\mathrm{sp}\left(h^{*}\right)}$ . Then, with probability $1-4\delta$ , for all $k\leq K(T)$ , (1) $\mathfrak{g}_{k}\geq g^{*}$ and (2) $h^{*}\in\mathcal{H}_{t_{k}}$ and (3) for all $(s,a)$ , $(\hat{p}_{t_{k}}(s,a)-p(s,a))h^{*}\leq\beta_{t_{k}}(s,a)$ .

Proof.

Let $E_{1}$ the event $(\forall k\leq K(T),M\in\mathcal{M}_{t_{k}})$ . Let $E_{2}$ the event stating that, for all $T^{\prime}\leq T$ ,

N_{T^{\prime}}(s\leftrightarrow s^{\prime})\left|h^{*}(s)-h^{*}(s^{\prime})-c_% {T^{\prime}}(s,s^{\prime})\right|\leq 3{\mathrm{sp}\left(h^{*}\right)}+(1+{% \mathrm{sp}\left(h^{*}\right)})\sqrt{8T\log(\tfrac{2}{\delta})}+2\sum\nolimits% _{t=0}^{T^{\prime}-1}(\tilde{g}-R_{t}),

and let $E_{3}$ the event stating that, for all $T^{\prime}\leq T$ and for all $(s,a)\in\mathcal{X}$ , we have:

\left(\hat{p}_{T^{\prime}}(s,a)-p(s,a)\right)h^{*}\leq\sqrt{\tfrac{2\mathbf{V}% (\hat{p}_{T^{\prime}}(s,a),h^{*})\log\left(\frac{SAT}{\delta}\right)}{N_{T^{% \prime}}(s,a)}}+\tfrac{3{\mathrm{sp}\left(h^{*}\right)}\log\left(\frac{SAT}{% \delta}\right)}{N_{T^{\prime}}(s,a)}.

By Lemma 3, we have $\mathbf{P}(E_{2})\geq 1-2\delta$ and by Lemma 36, we have $\mathbf{P}(E_{3})\geq 1-\delta$ , so $\mathbf{P}(E_{1}\cap E_{2}\cap E_{3})\geq 1-4\delta$ . We prove by induction on $k\leq K(T)$ that, on $E_{1}\cap E_{2}$ , (1) $\mathfrak{g}_{k}\geq g^{*}$ , (2) $h^{*}\in\mathcal{H}_{t_{k}}$ (3) and for all $(s,a)$ , $(\hat{p}_{t_{k}}(s,a)-p(s,a))h^{*}\leq\beta_{t_{k}}(s,a)$ , where $\mathfrak{g}_{k}$ is the optimistic gain of the policy deployed at episode $k$ . For $k=0$ , this is obvious. Indeed, $N_{0}(s\leftrightarrow s^{\prime})=0$ for all $s,s^{\prime}$ hence $c_{0}(s,s^{\prime})=c_{0}\geq{\mathrm{sp}\left(h^{*}\right)}$ . Therefore,

\mathcal{H}_{0}\supseteq\left\{\mathfrak{h}\in\mathbf{R}^{\mathcal{S}}:{% \mathrm{sp}\left(\mathfrak{h}\right)}\leq c_{0}\right\}\supseteq\left\{% \mathfrak{h}\in\mathbf{R}^{\mathcal{S}}:{\mathrm{sp}\left(\mathfrak{h}\right)}% \leq{\mathrm{sp}\left(h^{*}\right)}\right\}

so contains $h^{*}$ , proving (2). Moreover, since $N_{0}(s,a)=0$ , we have $\beta_{0}(s,a)=+\infty$ , proving (3). Finally, since $M\in\mathcal{M}_{0}$ on $E_{1}$ , by the statement (2) of Proposition 2, we have $\mathfrak{g}_{k}\geq g^{*}$ , hence proving (1).

Now assume that $k\geq 1$ . By induction $\mathfrak{g}_{\ell}\geq g^{*}$ for all $\ell<k$ , so on $E_{2}$ we have:

N_{t_{k}}(s\leftrightarrow s^{\prime})\left|h^{*}(s)-h^{*}(s^{\prime})-c_{t_{k% }}(s,s^{\prime})\right|\leq 3{\mathrm{sp}\left(h^{*}\right)}+(1+{\mathrm{sp}% \left(h^{*}\right)})\sqrt{8T\log(\tfrac{2}{\delta})}+2\sum\nolimits_{\ell=1}^{% k-1}\sum\nolimits_{t=t_{\ell}}^{t_{\ell+1}-1}(\mathfrak{g}_{\ell}-R_{t}).

By design of $\mathcal{H}_{t_{k}}$ (see Algorithm 3), we deduce that (2) $h^{*}\in\mathcal{H}_{t_{k}}$ . Denote $h_{0}\in\mathcal{H}_{t_{k}}$ the reference point used by Algorithm 5. We have, for all $(s,a)\in\mathcal{X}$ , on $E_{1}\cap E_{2}\cap E_{3}$ , we have:

	$\displaystyle\left(\hat{p}_{t_{k}}(s,a)-p(s,a)\right)h^{*}$	$\displaystyle\leq\sqrt{\tfrac{2\mathbf{V}(\hat{p}_{t_{k}}(s,a),h^{})\log\left% (\frac{SAT}{\delta}\right)}{N_{t_{k}}(s,a)}}+\tfrac{3{\mathrm{sp}\left(h^{}% \right)}\log\left(\frac{SAT}{\delta}\right)}{N_{t_{k}}(s,a)}$
	( $h^{*}\in\mathcal{H}_{t_{k}}$ + Lemma 12)	$\displaystyle\leq\sqrt{\tfrac{2\left(\mathbf{V}(\hat{p}_{t_{k}}(s,a),h_{0})% \log\left(\frac{SAT}{\delta}\right)+8c_{0}\sum_{s^{\prime}\in\mathcal{S}}\hat{% p}_{t_{k}}(s^{\prime}\|s,a)~{}\text{error}(c_{t_{k}},s^{\prime},s)\right)\log% \left(\frac{SAT}{\delta}\right)}{N_{t_{k}}(s,a)}}+\tfrac{3c_{0}\log\left(\frac% {SAT}{\delta}\right)}{N_{t_{k}}(s,a)}$
		$\displaystyle=:\beta_{t_{k}}(s,a)$

by construction of Algorithm 5. Accordingly, (3) is satisfied. Finally, $M\in\mathcal{M}_{t_{k}}$ on $E_{1}$ so by Proposition 2, we have (1) $\mathfrak{g}_{k}\geq g^{*}$ . ∎

Corollary 14.

Assume that, for all $q\in\left\{r,p\right\}$ and $(s,a)\in\mathcal{X}$ , we choose $\mathcal{Q}_{t}(s,a)$ among (C1-4). Then, with probability $1-3\delta$ , for all $k\in K(T)$ , we have $\mathfrak{g}_{k}\geq g^{*}$ and (2) $h^{*}\in\mathcal{H}_{t_{k}}$ and (3) for all $(s,a)$ , $(\hat{p}_{t_{k}}(s,a)-p(s,a))h^{*}\leq\beta_{t_{k}}(s,a)$ .

Proof.

By Lemma 11, 1 is satisfied. Apply Lemma 13. ∎

A.2.3 Sub-Weissman reward confidence region and 2

Although the kernel confidence region can even chosen to be trivial with (C4), in order to work, PMEVI-DT needs the reward confidence region to be sub-Weissman in the following sense:

Assumption 2.

There exists a constant $C>0$ such that for all $(s,a)\in\mathcal{S}$ , for all $t\leq T$ , we have:

\mathcal{R}_{t}(s,a)\subseteq\left\{\tilde{r}(s,a)\in\mathcal{R}(s,a):N_{t}(s,% a)\left\|\hat{r}_{t}(s,a)-\tilde{r}(s,a)\right\|_{1}^{2}\leq C\log\left(\tfrac% {2SA(1+N_{t}(s,a))}{\delta}\right)\right\}.

This is indeed the case if $\mathcal{R}_{t}(s,a)$ is chosen among (C1-3).

A.3 Convergence of EVI and 3

We start with a preliminary lemma on the speed of convergence of EVI. The Lemma 15 is thought to be applied to extended MDPs. Below, when we claim that the action space is compact, we further claim that $a\in\mathcal{A}(s)\mapsto p(s,a)$ is a continuous map, so that the Bellman operator is continuous and that $g^{*}$ and $h^{*}$ are well-defined, see Puterman [1994].

Lemma 15.

Let $M$ a weakly-communicating MDP with finite state space $\mathbf{R}^{\mathcal{S}}$ and compact action space, and let $L$ its Bellman operator. Assume that there exists $\gamma>0$ such that, $\forall u\in\mathbf{R}^{\mathcal{S}}$ ,

\forall s\in\mathcal{S},\exists a\in\mathcal{A}(s),\quad Lu(s)=r(s,a)+p(s,a)u=% r(s,a)+\gamma\max(u)+(1-\gamma)q_{s}^{u}u

(

*

)

with $q_{s}^{u}\in\mathcal{P}(\mathcal{S})$ . Then, for all $u\in\mathbf{R}^{\mathcal{S}}$ and all $\epsilon>0$ , if ${\mathrm{sp}\left(L^{n+1}u-L^{n}u\right)}\geq\epsilon$ , then:

n\leq 2+\frac{4{\mathrm{sp}\left(w_{0}\right)}}{\gamma\epsilon}+\frac{2}{% \gamma}\log\left(\frac{2{\mathrm{sp}\left(w_{0}\right)}}{\epsilon}\right).

Proof.

Since $M$ is weakly communicating, has finitely many states and compact action space, it has well-defined gain $g^{*}$ and bias $h^{*}$ functions. Denote $u_{n+1}:=L^{n}u$ .

	$\displaystyle w_{n}$	$\displaystyle:=\max_{\pi\in\Pi}\left\{r_{\pi}+P_{\pi}u_{n-1}\right\}-ng^{}-h^% {}$
		$\displaystyle=\max_{\pi\in\Pi}\left\{r_{\pi}-g^{}+(P_{\pi}-I)h^{}+P_{\pi}% \left(u_{n-1}-h^{}-(n-1)g^{}\right)\right\}=:\max_{\pi\in\Pi}\left\{r^{% \prime}_{\pi}+P_{\pi}w_{n-1}\right\}.$

Observe that the policy achieving the maximum is the one achieving $u_{n}=r_{\pi}+P_{\pi}u_{n-1}$ . Remark that $r^{\prime}_{\pi}(s)=-\Delta^{*}(s,\pi(s))\leq 0$ is the Bellman gap of the pair $(s,\pi(s))$ , that we more simply write $\Delta_{\pi}$ . For all $n$ , there exists $\pi_{n}\in\Pi$ such that $w_{n+1}=-\Delta_{\pi_{n}}+P_{\pi_{n}}w_{n}$ . Moreover, by assumption, we have $P_{\pi_{n}}=\gamma\cdot e_{s_{n}}^{\top}e+(1-\gamma)Q_{n}$ where $Q_{n}$ is a stochastic matrix. Moreover,

\left(\min(-\Delta_{\pi_{n}})+\gamma w_{n}(s_{n})\right)e+(1-\gamma)Q_{n}w_{n}% \leq w_{n+1}\leq\left(\max(-\Delta_{\pi_{n}})+\gamma w_{n}(s_{n})\right)e+(1-% \gamma)Q_{n}w_{n}.

Hence, ${\mathrm{sp}\left(w_{n+1}\right)}\leq(1-\gamma){\mathrm{sp}\left(w_{n}\right)}% +{\mathrm{sp}\left(\Delta_{\pi_{n}}\right)}$ . In addition, $w_{n}=L^{n}u-L^{n}h^{*}$ , so by non-expansiveness of $L$ in span semi-norm, ${\mathrm{sp}\left(w_{n+1}\right)}\leq{\mathrm{sp}\left(w_{n}\right)}$ . Overall,

{\mathrm{sp}\left(w_{n+1}\right)}\leq\min\left((1-\gamma){\mathrm{sp}\left(w_{% n}\right)}+{\mathrm{sp}\left(\Delta_{\pi_{n}}\right)},{\mathrm{sp}\left(w_{n}% \right)}\right).

(15)

Fix $\epsilon>0$ , and let $n_{\epsilon}:=\inf\left\{n:{\mathrm{sp}\left(w_{n}\right)}<\epsilon\right\}$ .

Let $\pi^{*}$ an optimal policy. We have $w_{n+1}\geq P_{\pi^{*}}w_{n}$ so by induction, $w_{n+1}\geq P_{\pi^{*}}^{n+1}w_{0}\geq\min(w_{0})e$ . Meanwhile, we see that $\left\|w_{n}\right\|_{1}\geq\sum_{k=0}^{n-1}\left\|\Delta_{\pi_{k}}\right\|_{1% }+S\min(w_{0})$ , so $\sum_{k=0}^{n-1}\left\|\Delta_{\pi_{k}}\right\|_{1}\leq{\mathrm{sp}\left(w_{0}% \right)}$ . Since $\Delta_{\pi_{k}}\leq 0$ for all $k$ , we have ${\mathrm{sp}\left(\Delta_{\pi_{k}}\right)}\leq\left\|\Delta_{\pi_{k}}\right\|_% {1}$ so $\sum_{k=0}^{n-1}{\mathrm{sp}\left(\Delta_{\pi_{k}}\right)}\leq{\mathrm{sp}% \left(w_{0}\right)}$ .

By (15), either ${\mathrm{sp}\left(w_{n+1}\right)}\leq(1-\tfrac{1}{2}\gamma)\max(\epsilon,{% \mathrm{sp}\left(w_{n}\right)})$ or ${\mathrm{sp}\left(\Delta_{\pi_{n}}\right)}\geq\tfrac{1}{2}\gamma\epsilon$ , but because $\sum_{k=0}^{+\infty}{\mathrm{sp}\left(\Delta_{\pi_{k}}\right)}\leq{\mathrm{sp}% \left(w_{0}\right)}$ , the second case can happen at most $\tfrac{2{\mathrm{sp}\left(w_{0}\right)}}{\gamma\epsilon}$ times. We deduce that, for all $n\leq n_{\epsilon}$ ,

{\mathrm{sp}\left(w_{n+1}\right)}\leq\left(1-\tfrac{1}{2}\gamma\right)^{n-% \frac{2{\mathrm{sp}\left(w_{0}\right)}}{\gamma\epsilon}}{\mathrm{sp}\left(w_{0% }\right)}.

In particular, for $n=n_{\epsilon}-1$ , we get:

\epsilon\leq\left(1-\tfrac{1}{2}\gamma\right)^{n_{\epsilon}-2-\frac{2{\mathrm{% sp}\left(w_{0}\right)}}{\gamma\epsilon}}{\mathrm{sp}\left(w_{0}\right)}.

We obtain:

n_{\epsilon}\leq 2+\frac{2{\mathrm{sp}\left(w_{0}\right)}}{\gamma\epsilon}+% \frac{2}{\gamma}\log\left(\frac{{\mathrm{sp}\left(w_{0}\right)}}{\epsilon}% \right).

To conclude, check that ${\mathrm{sp}\left(L^{n+1}u-L^{n}u\right)}={\mathrm{sp}\left(w_{n+1}-w_{n}% \right)}\leq 2{\mathrm{sp}\left(w_{n}\right)}$ . ∎

Before moving to the application of interest, remark that this result can be greatly improved if the supremum $\sup\left\{\Delta^{*}(s,a):\Delta^{*}(s,a)<0\right\}$ is not zero, to change the dominant term $\frac{4{\mathrm{sp}\left(w_{0}\right)}}{\gamma\epsilon}$ for a constant independent of $\epsilon$ .

Corollary 16.

Assume that the $\mathcal{M}_{t}$ has non-empty interior, and that its Bellman operator satisfies the requirement of Lemma 15, i.e., there exists $\gamma>0$ such that, $\forall u\in\mathbf{R}^{\mathcal{S}},\forall s\in\mathcal{S},\exists a\in% \mathcal{A}(s),\exists\tilde{r}_{t}(s,a)\in\mathcal{R}_{t}(s,a),\exists\tilde{% p}_{t}(s,a)\in\mathcal{P}_{t}(s,a)$ :

\mathcal{L}_{t}u(s)=\tilde{r}_{t}(s,a)+\tilde{p}_{t}(s,a)u=\tilde{r}_{t}(s,a)+% \gamma\max(u)+(1-\gamma)q_{s}^{u}u

for some $q_{s}^{u}\in\mathcal{P}(\mathcal{S})$ . Then 3 is satisfied, and span fix-points $\tilde{h}_{t}$ of $\mathcal{L}_{t}$ are such that $g^{*}(\mathcal{M}_{t})=\mathcal{L}_{t}\tilde{h}_{t}-\tilde{h}_{t}$ .

Proof.

If $\mathcal{M}_{t}$ is has non-empty interior, it means that for all $(s,a)$ , $\mathcal{P}_{t}(s,a)$ has non-empty interior. Therefore, for all state-action pair, there exists $\tilde{p}_{t}(s,a)\in\mathcal{P}_{t}(s,a)$ that is fully supported. It follows that $\mathcal{M}_{t}$ is communicating, and it follows from standard results Puterman [1994] that its span fix-points $\tilde{h}$ do exist and that $\tilde{g}_{t}:=\mathcal{L}\tilde{h}_{t}-\tilde{h}_{t}\in\mathbf{R}e$ does not depend on the initial state.

Moreover, if $\widetilde{M}\in\mathcal{M}_{t}$ and $\pi\in\Pi$ with $\tilde{g}_{\pi}\equiv g(\pi,\mathcal{M}_{t})\in\mathbf{R}e$ , letting $\tilde{r}_{\pi}:=r_{\pi}(\tilde{M})$ and $\tilde{P}_{\pi}:=P_{\pi}(\tilde{M})$ , we have:

\tilde{r}_{\pi}+\tilde{p}_{\pi}\tilde{h}_{t}\leq\mathcal{L}_{t}\tilde{h}_{t}% \leq\tilde{g}_{t}e+\tilde{h}_{t}.

So by induction and since $\mathcal{L}_{t}$ is obviously monotone and linear, we show that:

\sum_{k=0}^{n}\tilde{P}_{\pi}^{k}\tilde{r}_{\pi}\leq n\tilde{g}_{t}e+(I-\tilde% {P}_{\pi}^{n})\tilde{h}_{\pi}.

Dividing by $n$ and letting it go to infinity, we obtain $g(\pi,\mathcal{M}_{t})\leq\tilde{g}_{t}$ . Observe that we have equility by taking the policy achieving $(\tilde{g}_{t},\tilde{h}_{t})$ .

To see that EVI converges indeed, simply observe that Lemma 15 provides a finite bound on how much time is required until the ${\mathrm{sp}\left(\mathcal{L}_{t}^{n+1}u-\mathcal{L}_{t}^{n}u\right)}\leq\epsilon$ . Hence ${\mathrm{sp}\left(\mathcal{L}_{t}^{n+1}u-\mathcal{L}_{t}^{n}u\right)}$ vanishes to $0$ . ∎

About 3.

The assumptions made by Corollary 16 are met if the kernel confidence regions are:

•

Built out of Weissman’s inequality (C1) (see the next section, also Auer et al. [2009]);
•

Built out of Bernstein’s inequality (C2) (because the maximization algorithm to compute $\tilde{p}_{t}(s,a)u_{i}$ in EVI has the same greedy properties than with Weissman’s inequality);
•

Trivial (C4) obviously.

For confidence regions build with empirical likelihood estimates (C3), there is no guarantee of convergence (although we conjecture that one could be established), although the gain is still well-defined because $\mathcal{M}_{t}$ remains communicating. However, just like the original work of Filippi et al. [2010], the convergence is always met numerically.

A.4 Proof of Theorem 5: Complexity of PMEVI with Weissman confidence regions

In this section, we show that when one is using Weissman confidence regions for kernels (C1), then the iterates of $\mathcal{L}_{t}$ converge to an $\epsilon$ span-fix-point quickly.

Proposition 17.

Assume that PMEVI-DT uses kernel confidence regions of Weissman type (C1) satisfying 1. Then with probability $1-\delta$ , the number of iterations of PMEVI (see Algorithm 2) is $\operatorname*{{\rm O}}\left(D\!\!\sqrt{S}AT\right)$ , hence the algorithm has polynomial per-step amortized complexity.

Proof.

With Weissman type confidence regions for kernels, for all $t\leq T$ and $(s,a)\in\mathcal{X}$ , we have

\mathcal{P}_{t}(s,a)\supseteq\left\{\tilde{p}(s,a)\in\mathcal{P}(s,a):\left\|% \tilde{p}(s,a)-\hat{p}_{t}(s,a)\right\|_{1}\leq\sqrt{\frac{S\log(2SAT)}{T}}\right\}

It follows that, for all $t\leq T$ , the extended Bellman operator $\mathcal{L}_{t}$ satisfies the prerequisite $(*)$ of Lemma 15 with

\gamma=\frac{1}{2}\sqrt{\frac{S\log(2SAT/\delta)}{T}}=\Omega\left(\!\!\sqrt{% \frac{S\log(T/\delta)}{T}}\right).

Under 1, we have $M\in\mathcal{M}_{t}$ with probability $1-\delta$ . Under this event, $\mathcal{M}_{t}$ is weakly communicating and ${\mathrm{sp}\left(h^{*}(\mathcal{M}_{t})\right)}\leq D(M)$ , we can apply Lemma 15 and conclude that every calls to PMEVI (Algorithm 2) takes

\operatorname*{{\rm O}}\left(\frac{{\mathrm{sp}\left(w_{0}\right)}\sqrt{T}}{% \epsilon\sqrt{\frac{S\log(T/\delta)}{T}}}\right)=\operatorname*{{\rm O}}\left(% \frac{DT}{\sqrt{S}\log(T)}\right)

where we use that $\epsilon=\sqrt{\tfrac{\log(SAT/\delta)}{T}}$ , that ${\mathrm{sp}\left(w_{0}\right)}=\operatorname*{{\rm O}}\left({\mathrm{sp}\left% (h^{*}(\mathcal{M}_{t})\right)}\right)=\operatorname*{{\rm O}}(D(M))$ and that $\delta\geq\frac{1}{T}$ . Since the number of episodes under the doubling trick (DT) is $\operatorname*{{\rm O}}(SA\log(T))$ , we conclude accordingly. ∎

Every call to the projection operator solves a linear program. Although in theory, this time is polynomial (relying on recent work on the complexity of LP such as Cohen et al. [2020], it is the current matrix multiplication time $\operatorname*{{\rm O}}(S^{2.38})$ ), in practice, reducing the number of calls to the projection operator is key to run PMEVI-DT in reasonable time.

Appendix B Analysis of the projected mitigated Bellman operator

In this section, we fix the model region $\mathcal{M}$ , the bias region $\mathcal{H}$ and the mitigation vector $\beta$ , drop** the sub-script $t$ for conciseness. We denote $\hat{r},\hat{p}$ the respective empirical reward and kernel. Further assume that $\mathcal{H}=\mathcal{H}_{0}+\mathbf{R}e$ with $\mathcal{H}_{0}$ a compact convex set. The associated projection operation (see Section B.2) is denoted $\Gamma$ . The (vanilla) extended Bellman operator $\mathcal{L}$ associated to $\mathcal{M}$ is given by $\mathcal{L}u(s):=\max_{a\in\mathcal{A}(s)}\left\{\sup\mathcal{R}(s,a)+\sup% \mathcal{P}(s,a)u\right\}$ . The $\beta$ -mitigated extended Bellman operator associated to $\mathcal{M}$ is:

\mathcal{L}^{\beta}u(s):=\max_{a\in\mathcal{A}(s)}\sup_{\tilde{r}(s,a)\in% \mathcal{R}(s,a)}\sup_{\tilde{p}(s,a)\in\mathcal{P}(s,a)}\Big{\{}\tilde{r}(s,a% )+\min\left\{\tilde{p}(s,a)u_{i},\hat{p}(s,a)u_{i}+\beta(s,a)\right\}\Big{\}}.

(16)

The function Greedy $(\mathcal{M},u,\beta)$ returns a stationary deterministic policy that picks its actions among the one reaching the maximum above. The projection of $\mathcal{L}^{\beta}$ to $\mathcal{H}$ is

\mathfrak{L}\equiv\mathfrak{L}^{\beta,\mathcal{H}}:=\Gamma\circ\mathcal{L}^{% \beta}.

(17)

The goal of this section is to establish Proposition 2 and

•

Proposition 2 statement (1) is a consequence of Lemma 22;
•

Proposition 2 statement (2) follows from Theorem 25;
•

Proposition 2 statement (3) follows from Corollary 27;
•

Proposition 2 statement (4) follows from Corollary 21;
•

Proposition 2 prerequisites on the projection operator and Lemma 4 follows from Lemma 19

B.1 Finding an optimistic policy under bias constraints

The main goal is to find and optimistic policy under bias constraints (projection) and bias error constraints (mitigation). The bias constraints imply that we search for a policy $\pi$ together with a model $\widetilde{M}$ such that $h(\pi,\widetilde{M})\in\mathcal{H}$ . The bias error means that, for $\tilde{h}\equiv h(\pi,\widetilde{M})$ , we want in addition $\tilde{p}(s,\pi(s))\tilde{h}\leq\hat{p}(s,\pi(s))\tilde{h}+\beta(s,\pi(s))$ where $\tilde{p}$ is the transition kernel of $\widetilde{M}$ . In the end, our goal is to track the solution of the following optimization problem:

g^{*}(\mathcal{H},\beta,\mathcal{M}):=\sup\left\{g\left(\pi,\widetilde{M}% \right):\begin{array}[]{c}\pi\in\Pi,\widetilde{M}\in\mathcal{M},\\ \forall s\in\mathcal{S},~{}\tilde{p}(s,\pi(s))\tilde{h}\leq\hat{p}(s,\pi(s))% \tilde{h}+\beta(s,\pi(s)),\\ \tilde{h}\equiv h(\pi,\widetilde{M})\in\mathcal{H},~{}{\mathrm{sp}\left(g\left% (\pi,\widetilde{M}\right)\right)}=0\end{array}\right\}

(18)

where the supremum is taken with respect to the product order $\mathbf{R}^{\mathcal{S}}$ . In particular, if $\mathcal{U}\subseteq\mathcal{R}^{\mathcal{S}}$ , check that $u^{*}=\sup\mathcal{U}$ is obtained as $u^{*}(s):=\sup\left\{v(s):v\in\mathcal{U}\right\}$ . The constraint ${\mathrm{sp}\left(g\left(\pi,\widetilde{M}\right)\right)}=0$ is suggested by the work of Fruit et al. [2018], Fruit [2019] and is key for the problem to be solvable.

The bias and the $\beta$ -constraints make the problem to handle with a “pure” extended MDP solution, which is why the extended Bellman operators are mitigated (with $\beta$ ) then projected (with $\Gamma$ ). The mitigation operation guarantees that the $\beta$ -constraint is satisfied, while the projection on $\mathcal{H}$ makes sure that the bias constraint is satisfied. It is important for both operations to be compatible, i.e., that the $\beta$ -constraint that $\mathcal{L}^{\beta}$ forces is not lost when applying $\Gamma$ . As a matter of fact, projecting then mitigating would not work.

We now explain why $\mathfrak{L}$ can be used to solve (18).

B.2 Projection operation and definition of $\mathfrak{L}$

We start by discussing why $\mathfrak{L}$ is well-defined at all. The well-definition of $\mathcal{L}^{\beta}$ is obvious. The point is to explain why the projection onto $\mathcal{H}$ is possible while preserving mandatory structural properties such as monotony, non-expansivity, linearity and more. For general $\mathcal{H}$ , such properties are impossible to meet. But the bias confidence region constructed with Algorithm 3 has a specific shape that makes the projection possible. The central property is the one below:

(A1) The downward closure $\left\{v\leq u:v\in\mathcal{H}\right\}$ of every $u\in\mathbf{R}^{\mathcal{S}}$ has a maximum in $\mathcal{H}$ .

The only order that we will be considering is the product order on $\mathbf{R}^{\mathcal{S}}$ . Recall that a set $\mathcal{U}\subseteq\mathbf{R}^{\mathcal{S}}$ has a maximum if there exists $u\in\mathcal{U}$ such that $v\leq u$ for all $u\in\mathcal{U}$ . A supremum of $\mathcal{U}$ is a minimal upper-bound of $\mathcal{U}$ , i.e., $u$ such that (1) $v\leq u$ for all $v\in\mathcal{U}$ and (2) no $w$ satisfying (1) can be smaller than $u$ . For the product order, the supremum of a subset $\mathcal{U}$ is unique and of the form $u(s)=\sup\left\{v(s):v\in\mathcal{U}\right\}$ .

Define the projection $\Gamma:\mathbf{R}^{\mathcal{S}}\to\mathcal{H}$ as such:

\Gamma u:=\max\left\{v\leq u:v\in\mathcal{H}\right\}.

(19)

In general, Assumption (A1) is satisfied when $\mathcal{H}$ admits a join, i.e., is stable by finite supremum: $u,v\in\mathcal{H}\Rightarrow\sup(u,v)\in\mathcal{H}$ .

Lemma 18.

If $\mathcal{H}$ is generated by constraints of the form $\mathfrak{h}(s)-\mathfrak{h}(s^{\prime})-c(s,s^{\prime})\leq d(s,s^{\prime})$ , then it has a join and (A1) is satisfied. Moreover, $\Gamma$ is then correctly computed with Algorithm 4.

Proof.

The first half of the result is well-known, see Zhang and Xie [2023], but we recall a proof for self-containedness. Let $v_{1},v_{2}\in\mathcal{H}$ and define $v_{3}:=\sup(v_{1},v_{2})$ . Observe that $v_{3}(s)-v_{3}(s^{\prime})\leq\max(v_{1}(s)-v_{1}(s^{\prime}),v_{2}(s)-v_{2}(s% ^{\prime}))\leq c(s,s^{\prime})+d(s,s^{\prime})$ . So $v_{3}\in\mathcal{H}$ .

We continue by showing that if $\mathcal{H}$ has a join, then (19) is well-defined. For $s\in\mathcal{S}$ , take a sequence $v_{n}^{s}$ such that $v_{n}^{s}(s)\to\alpha(s):=\sup\left\{v(s):v\leq u,v\in\mathcal{H}\right\}$ . Because the span of every element of $\mathcal{H}$ is upper-bounded by $c:=\sup\left\{{\mathrm{sp}\left(v\right)}:v\in\mathcal{H}\right\}$ , it follows that $v_{n}^{s}$ evolves in the compact region $\left\{v\leq u:v\in\mathcal{H}\right\}\cap\left\{v:\left\|v-\alpha{s}e\right\|% _{\infty}=1+c\right\}$ . We can therefore extract a convergent sequence of $v_{n}^{s}$ , converging $v_{*}^{s}$ that belongs to $\mathcal{H}$ since the latter is closed. By construction, $v_{*}^{s}(s)=\alpha(s)$ . Because $\mathcal{H}$ has a join, $v_{*}:=\sup\left\{v_{*}^{s}:s\in\mathcal{S}\right\}\in\mathcal{H}$ . ∎

Lemma 19.

Under assumption (A1), the operator $\Gamma u:=\max\left\{v\leq u:v\in\mathcal{H}\right\}$ is well-defined, and is:

(1)

monotone: $u\leq v\Rightarrow\Gamma u\leq\Gamma v$ ;
(2)

non span-expansive: ${\mathrm{sp}\left(\Gamma u-\Gamma v\right)}\leq{\mathrm{sp}\left(u-v\right)}$ ;
(3)

linear: $\Gamma(u+\lambda e)=\Gamma u+\lambda e$ ;
(4)

$\Gamma u\leq u$ .

Proof.

The well-definition of $\Gamma$ is obvious from (A1). For (2), if $u\leq v$ then $w\leq u\Rightarrow w\leq v$ . Hence $\Gamma u:=\max\left\{w\leq u:w\in\mathcal{H}\right\}\leq\max\left\{w\leq v:w% \in\mathcal{H}\right\}=:\Gamma v$ . For (3), check that it follows from $\mathcal{H}=\mathcal{H}+\mathbf{R}e$ . For (4), we obviously have $\Gamma u:=\max\left\{v\leq u:v\in\mathcal{H}\right\}\leq u$ .

The more difficult point is (2) span non-expansivity. Pick $u,v\in\mathbf{R}^{\mathcal{S}}$ . By linearity, it suffices to show the result for $\sum_{s}u(s)=\sum_{s}v(s)$ . In that case, we have ${\mathrm{sp}\left(v-u\right)}=\max(v-u)+\max(u-v)$ . Observe that for all $w\leq u$ , we have $w+\min(v-u)e\leq v$ . Since $\mathcal{H}=\mathcal{H}+\mathbf{R}e$ , it follows that:

\max\left\{w\leq u:u\in\mathcal{H}\right\}\leq\max\left\{w\leq v:w\in\mathcal{% H}\right\}+\max(u-v)e.

Similarly, we have $\max\left\{w\leq u:w\in\mathcal{H}\right\}\geq\max\left\{w\leq v:w\in\mathcal{% H}\right\}+\min(v-u)e$ . Using them both at once, we find ${\mathrm{sp}\left(\Gamma u-\Gamma v\right)}\leq{\mathrm{sp}\left(v-u\right)}$ . ∎

The properties (1), (3) and (4) are essential for $\mathfrak{L}$ to properly address the optimization problem (18). The property (2) is just as important, because it plays a central part in the convergence of value iteration. The next result shows similar properties for the $\beta$ -mitigated extended Bellman operator $\mathcal{L}^{\beta}$ . From now on, we will assume (A1), because it is almost-surely satisfied by the bias confidence region generated by Algorithm 3.

Lemma 20.

The $\beta$ -mitigated extended Bellman operator $\mathcal{L}^{\beta}$ is (1) monotone, (2) non-span-expansive and (3) linear.

Proof.

The properties (1) and (3) directly follow from the definition. We focus on (2). Fix $u,u^{\prime}\in\mathbf{R}^{\mathcal{S}}$ . By Lemma 26, we can write $\mathcal{L}^{\beta}u=\tilde{r}_{\pi}+\tilde{P}_{\pi}u$ and $\mathcal{L}^{\beta}u^{\prime}=\tilde{r}_{\pi^{\prime}}+\tilde{P}_{\pi^{\prime}% }u^{\prime}$ . In the following, we write $\beta_{\pi}(s):=\beta(s,\pi(s))$ . Check that:

\mathcal{L}^{\beta}u-\mathcal{L}^{\beta}u^{\prime}=\tilde{r}_{\pi}+\tilde{P}_{% \pi}u-\left(\tilde{r}_{\pi^{\prime}}+\tilde{P}_{\pi^{\prime}}u^{\prime}\right)% \leq\tilde{r}_{\pi}+\tilde{P}_{\pi}u-\left(\tilde{r}_{\pi}+\min\left\{\tilde{P% }_{\pi}u^{\prime},\hat{P}_{\pi}u^{\prime}+\beta_{\pi}\right\}\right).

If the minimum is reached with $\tilde{P}_{\pi}u^{\prime}$ , then:

\mathcal{L}^{\beta}u-\mathcal{L}^{\beta}u^{\prime}\leq\tilde{P}_{\pi}(u-u^{% \prime}).

If the minimum is reached with $\hat{P}_{\pi}u^{\prime}+\beta_{\pi}$ , then upper-bound $\tilde{P}_{\pi}u$ by $\hat{P}_{\pi}u+\beta_{\pi}$ to obtain:

\mathcal{L}^{\beta}u-\mathcal{L}^{\beta}u^{\prime}\leq\hat{P}_{\pi}(u-u^{% \prime}).

Overall, we find that there exists $Q_{\pi}\in\mathcal{P}_{\pi}$ such that $\mathcal{L}^{\beta}u-\mathcal{L}^{\beta}u^{\prime}\leq Q_{\pi}(u-u^{\prime})$ . Similarly, we find $Q_{\pi^{\prime}}\in\mathcal{P}_{\pi^{\prime}}$ such that $\mathcal{L}^{\beta}u-\mathcal{L}^{\beta}u^{\prime}\geq Q_{\pi^{\prime}}(u-u^{% \prime})$ . We conclude that:

{\mathrm{sp}\left(\mathcal{L}^{\beta}u-\mathcal{L}^{\beta}u^{\prime}\right)}% \leq{\mathrm{sp}\left((Q_{\pi}-Q_{\pi^{\prime}})(u-u^{\prime})\right)}\leq{% \mathrm{sp}\left(u-u^{\prime}\right)}.

This concludes the proof. ∎

By composition, we obtain the following result.

Corollary 21.

$\mathfrak{L}$ is (1) monotone, (2) non-span-expansive and (3) linear. Moreover, ${\mathrm{sp}\left(\mathfrak{L}u-\mathfrak{L}v\right)}\leq{\mathrm{sp}\left(% \mathcal{L}u-\mathcal{L}v\right)}$ for all $u,v\in\mathbf{R}^{\mathcal{S}}$ .

B.3 Fix-points of $\mathfrak{L}$ and (weak) optimism

Lemma 22.

$\mathfrak{L}$ has a fix-point in span semi-norm, i.e., $\exists u\in\mathcal{H},{\mathrm{sp}\left(\mathfrak{L}u-u\right)}=0$ .

Proof.

The idea is to apply Brouwer’s fix-point theorem in $\mathbf{R}^{\mathcal{S}}$ quotiented by the equivalence relation $u\sim v\Leftrightarrow{\mathrm{sp}\left(u-v\right)}=0$ , where ${\mathrm{sp}\left(-\right)}$ becomes a norm. By linearity (Corollary 21), $\mathfrak{L}$ is well-defined in this quotient space, and if $\mathfrak{L}$ is shown continuous on $\mathbf{R}^{\mathcal{S}}$ , so will it be on the quotient.

We show that $\mathfrak{L}$ is sequentially continuous on $\mathcal{H}$ . Consider a sequence $u_{n}\in\mathcal{H}^{\mathbf{N}}$ converging to $u\in\mathcal{H}$ and fix $\epsilon>0$ . Provided that $n>N_{\epsilon}$ for $N_{\epsilon}$ large enough, we have $\left\|u_{n}-u\right\|_{\infty}<\epsilon$ , i.e., $u_{n}-\epsilon e\leq u_{n}\leq u+\epsilon e$ . Therefore, in the one hand, for all $v\leq u_{n}$ , we have $v-\epsilon e\leq u$ so $\max\left\{v\leq u_{n}:v\in\mathcal{H}\right\}\leq\max\left\{v\leq u:v\in% \mathcal{H}\right\}+\epsilon e$ ; And on the other hand, for all $v\leq u$ , $v+\epsilon e\leq u_{n}$ so $\max\left\{v\leq u:v\in\mathcal{H}\right\}\leq\max\left\{v\leq u_{n}:v\in% \mathcal{H}\right\}+\epsilon e$ . Hence:

\left\|\max\left\{v\leq u:v\in\mathcal{H}\right\}-\max\left\{v\leq u_{n}:v\in% \mathcal{H}\right\}\right\|\leq\epsilon.

It shows that $\Gamma$ is continuous. The operator $\mathcal{L}^{\beta}$ is obviously continuous as well, so $\mathfrak{L}=\Gamma\circ\mathcal{L}^{\beta}$ is continuous by composition. Since $\mathcal{H}=\mathcal{H}_{0}+\mathbf{R}e$ with $\mathcal{H}_{0}$ compact and ocnvex, the quotient $\mathcal{H}/{\sim}$ is compact and convex, and is preserved by $\mathfrak{L}/{\sim}$ . By Brouwer’s fix-point theorem, $\mathfrak{L}/{\sim}$ has a fix-point in $\mathcal{H}/{\sim}$ . So $\mathfrak{L}$ has a span fix-point in $\mathcal{H}$ . ∎

We write $\operatorname{{Fix}}(\mathfrak{L})$ the span fix-points of $\mathfrak{L}$ .

Lemma 23.

$\mathfrak{L}$ has well-defined growth. Specifically, if $\mathfrak{L}u=u+\mathfrak{g}e$ , then:

(1)

There exists $c>0$ , s.t., for all $v\in\mathcal{H}_{0}$ , $(n\mathfrak{g}-c)e+u\leq\mathfrak{L}^{n}v\leq(n\mathfrak{g}+c)e+u$ ;
(2)

If $u^{\prime}\in\operatorname{{Fix}}(\mathfrak{L})$ , then $\mathfrak{L}u^{\prime}-u^{\prime}=\mathfrak{g}e$ .

Proof.

Setting $c:=\max_{v\in\mathcal{H}_{0}}\left\|v-u\right\|_{\infty}<\infty$ , one can check that $u-ce\leq v\leq u+ce$ for all $v\in\mathcal{H}_{0}$ . this proves (1) for $n=0$ and we then proceed by induction on $n\geq 0$ . By induction, $\mathfrak{L}^{n}v\leq u+(n\mathfrak{g}+c)e$ and by Corollary 21, $\mathfrak{L}$ is monotone, so we have:

\mathfrak{L}^{n+1}v\leq\mathfrak{L}\mathfrak{L}^{n}v\leq\mathfrak{L}(u+(n% \mathfrak{g}+c)e)=u+((n+1)\mathfrak{g}+c)e

where the last inequality use the linearity of $\mathfrak{L}$ together with $\mathfrak{L}u=u+\mathfrak{g}e$ . The lower bound of $\mathfrak{L}^{n}v$ is shown similarly, establishing (1).

For (2), pick $u^{\prime}\in\operatorname{{Fix}}(\mathfrak{L})$ with $\mathfrak{L}u^{\prime}=u^{\prime}+\mathfrak{g}^{\prime}e$ . Up to translating $u^{\prime}$ , we can assume that $u^{\prime}\in\mathcal{H}_{0}$ and apply (1). We get:

(n\mathfrak{g}-c)e+u\leq n\mathfrak{g}^{\prime}e+u^{\prime}\leq(n\mathfrak{g}+% c)e+u.

Divided by $n$ and let it go to infinity. We conclude that $\mathfrak{g}=\mathfrak{g}^{\prime}$ . ∎

We finally have everything in hand to claim that $\mathfrak{L}$ solves (18).

Corollary 24.

The growth of $\mathfrak{L}$ given by $\mathfrak{g}=\mathfrak{L}u-u$ for $u\in\operatorname{{Fix}}(\mathfrak{L})$ is well-defined, and:

\forall u\in\mathcal{H},\quad\mathfrak{g}e=\liminf_{n\to\infty}\frac{\mathfrak% {L}^{n}u}{n}=\limsup_{n\to\infty}\frac{\mathfrak{L}^{n}u}{n}.

Moreover, $\mathfrak{g}\geq g^{*}(\mathcal{H},\beta,\mathcal{M})$ .

Proof.

The growth property is a direct consequence of Lemma 23. We show $\mathfrak{g}\geq g^{*}(\mathcal{H},\beta,\mathcal{M})$ which is defined in (18). Pick $\pi\in\Pi,\widetilde{M}\in\mathcal{M}$ its model with $\tilde{h}\equiv h(\pi,\widetilde{M})$ and $\tilde{P}_{\pi}\tilde{h}\leq\hat{P}_{\pi}\tilde{h}+\beta_{\pi}$ where $\beta_{\pi}(s):=\beta(s,\pi(s))$ . Up to translation, we can assume that $\tilde{h}\in\mathcal{H}_{0}$ .

We have $g(\pi,\widetilde{M})=\tilde{g}e$ for $\tilde{g}\in\mathbf{R}$ , so

\tilde{h}+\tilde{g}e=\tilde{r}_{\pi}+\tilde{P}_{\pi}\tilde{h}\leq\mathfrak{L}% \tilde{h}

by definition. By monotony of $\mathfrak{L}$ , see Corollary 21, $n\tilde{g}e+\tilde{h}\leq\mathfrak{L}^{n}\tilde{h}$ follows by induction on $n\geq 0$ . By Lemma 23, we further have $\mathfrak{L}^{n}\tilde{h}\leq n(\mathfrak{g}+c)e+u$ where $u\in\operatorname{{Fix}}(\mathfrak{L})$ . In tandem,

\tilde{g}e\leq\mathfrak{g}e+\frac{ce+u-\tilde{h}}{n}.

Letting $n\to\infty$ , we deduce that $\tilde{g}\leq\mathfrak{g}$ . Conclude by taking the best $\pi$ and $\widetilde{M}$ . ∎

The next theorem follows directly with the same proof technique, and guarantees optimism.

Theorem 25.

Assume that $g^{*}+h^{*}\leq\mathfrak{L}h^{*}$ . Then $\mathfrak{g}\geq g^{*}$ .

The condition “ $g^{*}+h^{*}\leq\mathfrak{L}h^{*}$ ” can be referred to as a weak form of optimism. We qualify this version of optimism as weak because it is much weaker than optimism property suggested by Fruit [2019] $\mathcal{L}\geq L$ where $L$ is the Bellman operator of the true MDP. Here, we only ask for $\mathfrak{L}h^{*}\geq Lh^{*}$ , i.e., optimism at the fix-point of $L$ . This condition is met as soon as $M\in\mathcal{M}$ , $h^{*}\in\mathcal{H}$ and $\beta$ is large enough, but is in fact much more general.

B.4 Modelization of the projected mitigated Bellman operator $\mathfrak{L}$

The aim of this paragraph is to establish Corollary 27, stating that $\mathfrak{L}u$ can be viewed as a policy produced by Greedy $(\mathcal{M},u,\beta)$ .

Lemma 26 (Modelization).

For $\pi\in\Pi$ , denote $\beta_{\pi}(s):=\beta(s,\pi(s))$ , $\mathcal{R}_{\pi}:=\prod_{s}\mathcal{R}(s,\pi(s))$ and $\mathcal{P}_{\pi}:=\prod_{s}\mathcal{P}(s,\pi(s))$ . Fix $u\in\mathbf{R}^{\mathcal{S}}$ and let $\pi:=\texttt{Greedy}(\mathcal{M},u,\beta)$ .

(1)

If $\mathcal{P}$ is convex, then there exists $(\tilde{r}_{\pi},\tilde{P}_{\pi})\in\mathcal{R}_{\pi}\times\mathcal{P}_{\pi}$ such that $\mathcal{L}_{\beta}u=\tilde{r}_{\pi}+\tilde{P}_{\pi}u$ .
(2)

Assume that $\mathcal{L}_{\beta}u=\tilde{r}_{\pi}+\tilde{P}_{\pi}u$ . There exists $r^{\prime}_{\pi}\leq\tilde{r}_{\pi}$ such that $\mathfrak{L}u=r^{\prime}_{\pi}+\tilde{P}_{\pi}u$ .

The convexity requirement of (1) is always true if the kernel confidence region is chosen via (C1-4).

Proof.

For (1), fix a state $s\in\mathcal{S}$ , let $a:=\pi(s)$ and $\rho:=\min(\sup\mathcal{P}(s,a)u,\hat{p}(s,a)u+\beta(s,a))$ . If $\rho=\sup\mathcal{P}(s,a)u$ , then there is nothing to say because $\mathcal{P}$ is compact, hence the sup is a max and $\rho$ is of the form $\tilde{p}(s,a)u$ . Otherwise, let $\tilde{p}(s,a)u>\hat{p}(s,a)u+\beta(s,a)$ with $\tilde{p}(s,a)\in\mathcal{P}(s,a)$ . Introduce, for $\lambda\in[0,1]$ ,

\tilde{p}_{\lambda}(s,a):=\lambda\tilde{p}(s,a)+(1-\lambda)\hat{p}(s,a).

By continuity, there exists $\lambda\in(0,1)$ such that $\tilde{p}_{\lambda}(s,a)u=\hat{p}(s,a)u+\beta(s,a)$ and by convexity of $\mathcal{P}(s,a)$ , $\tilde{p}_{\lambda}(s,a)\in\mathcal{P}(s,a)$ . This proves (1).

For (2), recall that $\mathfrak{L}u=\Gamma\mathcal{L}^{\beta}u=\Gamma(\tilde{r}_{\pi}+\tilde{P}_{\pi% }u)$ . Since $\Gamma v\leq v$ , for $v\in\mathbf{R}^{\mathcal{S}}$ , we have:

\Gamma(\tilde{r}_{\pi}+\tilde{P}_{\pi}u)\leq\tilde{r}_{\pi}+\tilde{P}_{\pi}u.

Set $r^{\prime}_{\pi}:=\Gamma(\tilde{r}_{\pi}+\tilde{P}_{\pi}u)-\tilde{P}_{\pi}u$ . Check that $r^{\prime}_{\pi}$ satisfies $r^{\prime}_{\pi}\leq\tilde{r}_{\pi}$ and $\mathfrak{L}u=r^{\prime}_{\pi}+\tilde{P}_{\pi}u$ . ∎

The last corollary bellow is crucial to claim that greedy policies are good choices in PMEVI-DT.

Corollary 27 (Greedy modelization).

Let $u\in\mathbf{R}^{\mathcal{S}}$ and fix $\pi:=\texttt{Greedy}(\mathcal{M},u,\beta)$ . If $\mathcal{P}$ is convex, then with the notations of Lemma 26, there exists $\tilde{r}_{\pi}\leq\sup\mathcal{R}_{\pi}$ and $\tilde{P}_{\pi}\in\mathcal{P}_{\pi}$ such that $\mathfrak{L}u=\tilde{r}_{\pi}+\tilde{P}_{\pi}u$ .

Appendix C Proof of Theorem 5: Regret analysis of PMEVI-DT

We recall a few notations. At episode $k$ , the played policy is denoted $\pi_{k}$ . As a greedy response to $\mathfrak{h}_{k}$ , by Proposition 2 (3), there exists $\tilde{r}_{k}(s)\leq\sup\mathcal{R}_{t_{k}}(s,\pi_{k}(s))$ and $\tilde{P}_{k}(s)\in\mathcal{P}_{t_{k}}(s,\pi(x))$ such that $\mathfrak{h}_{k}+\mathfrak{g}_{k}=\tilde{r}_{k}+\tilde{P}_{k}\mathfrak{h}_{k}$ . The reward-kernel pair $\tilde{M}_{k}=(\tilde{r}_{k},\tilde{P}_{k})$ is referred to as the optimistic model of $\pi_{k}$ . We write $P_{k}:=P_{\pi_{k}}(M)$ the true kernel and $\hat{P}_{k}:=P_{\pi_{k}}(\hat{M}_{t_{k}})$ the empirical kernel. Likewise, we define the reward functions $r_{k}$ and $\hat{r}_{k}$ . The optimistic gain and bias satisfy $\mathfrak{g}_{k}=g(\pi_{k},\widetilde{M}_{k})$ and $\mathfrak{h}_{k}=h(\pi_{k},\widetilde{M}_{k})$ . We further denote $c_{0}=T^{\frac{1}{5}}$ .

Important remark.

To slightely simplify the analysis, we assume that PMEVI is run with perfect precision $\epsilon=0$ , i.e., that $\mathfrak{h}_{k}=\texttt{PMEVI}(\mathcal{M}_{t_{k}},\beta_{t_{k}},\Gamma_{t_{k% }},0)$ hence is a span fix-point of $\mathfrak{L}_{t_{k}}$ . This assumption is mild and can be dropped by adding an extra error term that has to be carried out in the calculations.

C.1 Number of episodes under doubling trick (DT)

Lemma 28 (Number of episodes, Auer et al. [2009]).

The number of episodes up to time $T\geq SA$ is upper-bounded by:

K(T)\leq SA\log_{2}\left(\tfrac{8T}{SA}\right).

C.2 Sum of bias variances

The Lemma 29 below shows that $\sum_{t=0}^{T-1}\mathbf{V}(p(X_{t}),h^{*})$ scales as $T{\mathrm{sp}\left(h^{*}\right)}{\mathrm{sp}\left(r\right)}+{\mathrm{sp}\left(% h^{*}\right)}\operatorname{{Reg}}(T)$ in probability.

Lemma 29.

With probability at least $1-\delta$ , we have:

\sum\nolimits_{t=0}^{T-1}\mathbf{V}(p(X_{t}),h^{*})\leq 2{\mathrm{sp}\left(h^{% *}\right)}{\mathrm{sp}\left(r\right)}T+{\mathrm{sp}\left(h^{*}\right)}^{2}% \sqrt{\tfrac{1}{2}T\log\left(\tfrac{1}{\delta}\right)}+2{\mathrm{sp}\left(h^{*% }\right)}\sum\nolimits_{t=0}^{T-1}\Delta^{*}(X_{t})+{\mathrm{sp}\left(h^{*}% \right)}^{2}.

Proof.

Using the Bellman equation $h^{*}(s)+g^{*}(s)=r(s,a)+p(s,a)h^{*}+\Delta^{*}(s,a)$ , we have:

\mathbf{V}(p(X_{t}),h^{*})=\left(p(X_{t})-e_{S_{t}}\right)h^{*2}+2h^{*}(S_{t})% (\Delta^{*}(X_{t})+r(X_{t})-g^{*}(S_{t})).

Since ${\mathrm{sp}\left(h^{*2}\right)}\leq{\mathrm{sp}\left(h^{*}\right)}^{2}$ , we get:

	$\displaystyle\sum\nolimits_{t=0}^{T-1}\mathbf{V}(p(X_{t}),h^{*})$	$\displaystyle\leq\sum\nolimits_{t=0}^{T-1}\left(p(X_{t})-e_{S_{t}}\right)h^{2% }+2{\mathrm{sp}\left(h^{}\right)}\left({\mathrm{sp}\left(r\right)}T+\sum% \nolimits_{t=0}^{T-1}\Delta^{*}(X_{t})\right)$
		$\displaystyle=\sum\nolimits_{t=0}^{T-1}\left(p(X_{t})-e_{S_{t+1}}\right)h^{2}% +2{\mathrm{sp}\left(h^{}\right)}\left(\tfrac{1}{2}{\mathrm{sp}\left(h^{}% \right)}{\mathrm{sp}\left(r\right)}T+\sum\nolimits_{t=0}^{T-1}\Delta^{}(X_{t}% )\right)$
	(Lemma 32)	$\displaystyle\leq 2{\mathrm{sp}\left(h^{}\right)}{\mathrm{sp}\left(r\right)}T% +{\mathrm{sp}\left(h^{}\right)}^{2}\sqrt{\tfrac{1}{2}T\log\left(\tfrac{1}{% \delta}\right)}+2{\mathrm{sp}\left(h^{}\right)}\sum\nolimits_{t=0}^{T-1}% \Delta^{}(X_{t})+{\mathrm{sp}\left(h^{*}\right)}^{2}$

where the last inequality holds with probability $1-\delta$ . This concludes the proof. ∎

C.3 Regret and pseudo-regret: A tight relation

In this paragraph, we bound the regret with respect to the pseudo-regret (and conversely) up to a factor of order $({\mathrm{sp}\left(h^{*}\right)}{\mathrm{sp}\left(r\right)}\log(\tfrac{T}{% \delta}))^{1/2}$ . Hence, in proofs, the pseudo-regret can be changed to the regret with ease.

Lemma 30.

With probability $1-4\delta$ , the regret and the pseudo-regret and linked as follows:

\left|\sum_{t=0}^{T-1}(g^{*}-R_{t})-\sum_{t=0}^{T-1}\Delta^{*}(X_{t})\right|% \leq\begin{Bmatrix}2\!\!\sqrt{\left(2{\mathrm{sp}\left(h^{*}\right)}{\mathrm{% sp}\left(r\right)}+\tfrac{1}{8}\right)T\log\left(\tfrac{T}{\delta}\right)}+\!% \!\sqrt{2{\mathrm{sp}\left(h^{*}\right)}\log\left(\tfrac{T}{\delta}\right)\sum% \nolimits_{t=0}^{T-1}\Delta^{*}(X_{t})}\\ +{\mathrm{sp}\left(h^{*}\right)}\left(\tfrac{1}{2}T\right)^{\frac{1}{4}}\log^{% \frac{3}{4}}\left(\frac{T}{\delta}\right)+4{\mathrm{sp}\left(h^{*}\right)}\log% \left(\tfrac{T}{\delta}\right)+2{\mathrm{sp}\left(h^{*}\right)}\end{Bmatrix}.

Proof.

We rely again on the Poisson equation $g^{*}(S_{t})-r(X_{t})-\Delta^{*}(X_{t})=(p(X_{t})-e_{S_{t}})h^{*}$ , so:

	$\displaystyle\textrm{A}:=\left\|\sum\nolimits_{t=0}^{T-1}(g^{}-R_{t}-\Delta^{% }(X_{t}))\right\|$	$\displaystyle\leq\left\|\sum\nolimits_{t=0}^{T-1}\left(p(X_{t})-e_{S_{t}}\right% )h^{*}\right\|+\left\|\sum\nolimits_{t=0}^{T-1}\left(R_{t}-r(X_{t})\right)\right\|$
		$\displaystyle\leq{\mathrm{sp}\left(h^{}\right)}+\left\|\sum\nolimits_{t=0}^{T-% 1}\left(p(X_{t})-e_{S_{t+1}}\right)h^{}\right\|+\left\|\sum\nolimits_{t=0}^{T-1% }\left(R_{t}-r(X_{t})\right)\right\|.$

Up to the constant ${\mathrm{sp}\left(h^{*}\right)}$ , the two error terms are respectively a navigation and a reward error. The second is bounded using Azuma’s inequality (Lemma 32), showing that with probability $1-2\delta$ , we have:

\left|\sum\nolimits_{t=0}^{T-1}(R_{t}-r(X_{t}))\right|\leq\sqrt{\tfrac{1}{2}T% \log\left(\tfrac{1}{\delta}\right)}.

We continue by using Freedman’s inequality, instantiated in the form of Lemma 33. With probability $1-\delta$ , we have:

\left|\sum\nolimits_{t=0}^{T-1}\left(p(X_{t})-e_{S_{t+1}}\right)h^{*}\right|% \leq\sqrt{2\sum\nolimits_{t=0}^{T-1}\mathbf{V}(p(X_{t}),h^{*})\log\left(\tfrac% {T}{\delta}\right)}+4{\mathrm{sp}\left(h^{*}\right)}\log\left(\tfrac{T}{\delta% }\right).

The quantity $\sum_{t=0}^{T-1}\mathbf{V}(p(X_{t}),h^{*})$ is a classical one that appears at several places throughout the analysis. Using Lemma 29, we bount it explicitely. Further simplifying the bound with $\sqrt{a+b}\leq\sqrt{a}+\sqrt{b}$ , we get that with probability $1-4\delta$ , we have:

\textrm{A}\leq\begin{Bmatrix}\sqrt{2{\mathrm{sp}\left(h^{*}\right)}{\mathrm{sp% }\left(r\right)}T\log\left(\tfrac{T}{\delta}\right)}+\sqrt{\tfrac{1}{2}T\log% \left(\tfrac{1}{\delta}\right)}+\sqrt{2{\mathrm{sp}\left(h^{*}\right)}\log% \left(\tfrac{T}{\delta}\right)\sum\nolimits_{t=0}^{T-1}\Delta^{*}(X_{t})}\\ +{\mathrm{sp}\left(h^{*}\right)}\left(\tfrac{1}{2}T\right)^{\frac{1}{4}}\log^{% \frac{3}{4}}\left(\frac{T}{\delta}\right)+4{\mathrm{sp}\left(h^{*}\right)}\log% \left(\tfrac{T}{\delta}\right)+2{\mathrm{sp}\left(h^{*}\right)}\end{Bmatrix}.

Bound $\log(\tfrac{1}{\delta})$ by $\log(\tfrac{T}{\delta})$ and use $\sqrt{a}+\sqrt{b}\leq 2\sqrt{a+b}$ to merge the terms in $\sqrt{T\log(\tfrac{T}{\delta})}$ under a single square-root. ∎

Overall, Lemma 30 states that the regret $\sum_{t=0}^{T-1}(g^{*}-R_{t})$ and the pseudo-regret $\sum_{t=0}^{T-1}\Delta^{*}(X_{t})$ differ by about $({\mathrm{sp}\left(h^{*}\right)}T\log(\tfrac{T}{\delta}))^{1/2}$ in probability (up to asymptotically negligible additional terms). In general, the precise form of Lemma 30 is not convenient to use because it is of form form $x\leq y+\alpha\sqrt{y}+\beta$ that is not linear in $y$ . Corollary 31 factorizes the result into one which will be more convenient in proofs.

Corollary 31.

Denote $x:=\sum_{t=0}^{T-1}(g^{*}-R_{t})$ and $y:=\sum_{t=0}^{T-1}\Delta^{*}(X_{t})$ . Further introduce:

	$\displaystyle\alpha$	$\displaystyle:=\sqrt{2{\mathrm{sp}\left(h^{*}\right)}\log\left(\tfrac{T}{% \delta}\right)}$
	$\displaystyle\beta$	$\displaystyle:=2\sqrt{\left(2{\mathrm{sp}\left(h^{}\right)}{\mathrm{sp}\left(% r\right)}+\tfrac{1}{2}\right)T\log\left(\tfrac{T}{\delta}\right)}+{\mathrm{sp}% \left(h^{}\right)}\left(\tfrac{1}{2}T\right)^{\frac{1}{4}}\log^{\frac{3}{4}}% \left(\tfrac{T}{\delta}\right)+2{\mathrm{sp}\left(h^{*}\right)}\left(2\log% \left(\tfrac{T}{\delta}\right)+1\right).$

Then, with probability $1-4\delta$ , we have $\sqrt{x}\leq\sqrt{y}+\tfrac{1}{2}\alpha+\sqrt{\beta}$ and $\sqrt{y}\leq\sqrt{x}+\alpha+\sqrt{\beta}$ .

Proof.

This is straight forward algebra from the result of Lemma 30. ∎

C.4 Proof of Lemma 6, reward optimism

We start by getting rid of the reward noise. We have:

	$\displaystyle\operatorname{{Reg}}(T)$	$\displaystyle:=\sum\nolimits_{t=0}^{T-1}(g^{}-R_{t})=\sum\nolimits_{t=0}^{T-1% }(g^{}-r(X_{t}))+\sum\nolimits_{t=0}^{T-1}(r(X_{t})-R_{t})$
		$\displaystyle\leq\sum\nolimits_{t=0}^{T-1}(g^{*}-r(X_{t}))+\sqrt{\tfrac{1}{2}T% \log\left(\tfrac{1}{\delta}\right)}$

with probability $1-\delta$ by Azuma’s inequality (Lemma 32). We are left with $\sum_{t=0}^{T-1}(g^{*}-r(X_{t}))$ . We continue by splitting the regret episodically and invoking optimism. By Lemma 13, with probability $1-4\delta$ , we have $\sum\nolimits_{t=0}^{T-1}(g^{*}-r(X_{t}))\leq\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1% }(\mathfrak{g}_{k}-r(X_{t}))$ . Introduce

B_{0}(T):=\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}(\mathfrak{g}_{k}-r(X_{t})).

(20)

We focus on bounding $B_{0}(T)$ . By 2, $\tilde{r}_{k}(s,a)$ is of the form $\hat{r}_{k}(s,a)+\sqrt{C\log(2SAT/\delta)/N_{t_{k}}(s,a)}-\eta_{k}(s,a)$ with $\eta_{k}(s,a)\in\mathbf{R}$ . By the statement (3) of Proposition 2, $\eta_{k}(s,a)\geq 0$ . Therefore,

	$\displaystyle B_{0}(T)$	$\displaystyle=\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\left(\mathfrak{g}_{k}-\tilde{% r}_{k}(X_{t})\right)+\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\left(\tilde{r_{k}}(X_{% t})-r(X_{t})\right)$
		$\displaystyle\leq\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\left(\mathfrak{g}_{k}-% \tilde{r}_{k}(X_{t})\right)+SA+\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{1}% \left(N_{t_{k}}(X_{t})\geq 1\right)\left(\hat{r_{k}}(X_{t})-r(X_{t})+\sqrt{% \frac{C\log\left(\tfrac{2SAT}{\delta}\right)}{N_{t_{k}}(X_{t})}}\right)$
		$\displaystyle\overset{(*)}{\leq}\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\left(% \mathfrak{g}_{k}-\tilde{r}_{k}(X_{t})\right)+SA+\sum_{k}\sum_{t=t_{k}}^{t_{k+1% }-1}\mathbf{1}\left(N_{t_{k}}(X_{t})\geq 1\right)\left(\sqrt{\frac{2\log\left(% \tfrac{2SAT}{\delta}\right)}{N_{t_{k}}(s,a)}}+\sqrt{\frac{C\log\left(\tfrac{2% SAT}{\delta}\right)}{N_{t_{k}}(s,a)}}\right)$

where $(*)$ holds with probability $1-\delta$ following Lemma 35. By the doubling trick rule (DT), we have $N_{t}(X_{t})\leq 2N_{t_{k}}(X_{t})$ for $t<t_{k+1}$ , so, with probability $1-\delta$ ,

	$\displaystyle B_{0}(T)$	$\displaystyle\leq\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\left(\mathfrak{g}_{k}-% \tilde{r}_{k}(X_{t})\right)+SA+2\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{1}% \left(N_{t_{k}}(X_{t})\geq 1\right)\sqrt{\frac{(2+C)\log\left(\tfrac{2SAT}{% \delta}\right)}{N_{t_{k}}(s,a)}}$
		$\displaystyle\leq\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\left(\mathfrak{g}_{k}-% \tilde{r}_{k}(X_{t})\right)+SA+2\sqrt{(2+C)\log\left(\tfrac{2SAT}{\delta}% \right)}\sum_{s,a}\sum_{n=1}^{N_{T}(s,a)-1}\sqrt{\tfrac{1}{n}}$
		$\displaystyle\leq\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\left(\mathfrak{g}_{k}-% \tilde{r}_{k}(X_{t})\right)+SA+4\sqrt{(2+C)\log\left(\tfrac{2SAT}{\delta}% \right)}\sum_{s,a}\sqrt{N_{T}(s,a)}$
	(Jensen)	$\displaystyle\leq\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\left(\mathfrak{g}_{k}-% \tilde{r}_{k}(X_{t})\right)+SA+4\sqrt{(2+C)SAT\log\left(\tfrac{2SAT}{\delta}% \right)}.$

We conclude that with probability $1-6\delta$ , we have:

\operatorname{{Reg}}(T)\leq\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\left(\mathfrak{g% }_{k}-\tilde{r}_{k}(X_{t})\right)+4\sqrt{(2+C)SAT\log\left(\tfrac{2SAT}{\delta% }\right)}+\sqrt{\tfrac{1}{2}T\log\left(\tfrac{2SAT}{\delta}\right)}+SA.

(21)

This concludes the proof. ∎

C.5 Proof of Lemma 7, navigation error

We have:

	$\displaystyle\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}(p_{k}(S_{t})-e_{S_{t}})% \mathfrak{h}_{k}$	$\displaystyle\leq\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}(p_{k}(S_{t})-e_{S_{t+1}})% \mathfrak{h}_{k}+\sum_{k}{\mathrm{sp}\left(\mathfrak{h}_{k}\right)}$
		$\displaystyle\leq\underbrace{\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}(p_{k}(S_{t})-e% _{S_{t+1}})(\mathfrak{h}_{k}-h^{})}_{\mathrm{A}_{1}}+\underbrace{\sum_{k}\sum% _{t=t_{k}}^{t_{k+1}-1}(p_{k}(S_{t})-e_{S_{t+1}})h^{}}_{\mathrm{A}_{2}}+\sum_{% k}{\mathrm{sp}\left(\mathfrak{h}_{k}\right)}.$

The last term is $\operatorname*{{\rm O}}(c_{0}SA\log(T))$ by Lemma 28, hence is $\operatorname*{{\rm O}}(T^{1/5}\log(T))$ .

(STEP 1) We start by bounding $\mathrm{A}_{1}$ . By Lemma 13, with probability $1-4\delta$ , we have $h^{*}\in\mathcal{H}_{t_{k}}$ for all $k\leq K(T)$ . So ${\mathrm{sp}\left(\mathfrak{h}_{k}-h^{*}\right)}\leq{\mathrm{sp}\left(% \mathfrak{h}_{k}\right)}+{\mathrm{sp}\left(h^{*}\right)}\leq 2c_{0}$ . By Freedman’s inequality invoked in the form of Lemma 33, we have with probability $1-5\delta$ ,

\mathrm{A}_{1}\leq\sqrt{2\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{V}\left(p(X% _{t}),\mathfrak{h}_{k}-h^{*}\right)\log\left(\tfrac{T}{\delta}\right)}+8c_{0}% \log\left(\tfrac{T}{\delta}\right)

It suffices to bound the first term. Recall that $e$ is the vector full of ones. We have:

	$\displaystyle\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{V}(p(X_{t}),\mathfrak{h% }_{k}-h^{})=\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{V}\left(p(X_{t}),% \mathfrak{h}_{k}-h^{}-(\mathfrak{h}_{k}(S_{t})-h^{*}(S_{t}))\cdot e\right)$
	$\displaystyle\leq\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\sum_{s^{\prime}\in\mathcal% {S}}p(s^{\prime}\|X_{t})\left(\mathfrak{h}_{k}(s^{\prime})-h^{}(s^{\prime})-(% \mathfrak{h}_{k}(S_{t})-h^{}(S_{t}))\right)^{2}$
	$\displaystyle\overset{()}{\leq}3\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{E}% \left[\sum_{s^{\prime}\in\mathcal{S}}p(s^{\prime}\|X_{t})\left(\mathfrak{h}_{k}% (s^{\prime})-h^{}(s^{\prime})-(\mathfrak{h}_{k}(S_{t})-h^{*}(S_{t}))\right)^{% 2}\Bigg{\|}\mathcal{F}_{t}\right]+16c_{0}^{2}\log\left(\tfrac{1}{\delta}\right)$
	$\displaystyle=3\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\left(\mathfrak{h}_{k}(S_{t+1% })-h^{}(S_{t+1})-(\mathfrak{h}_{k}(S_{t})-h^{}(S_{t}))\right)^{2}+16c_{0}^{2% }\log\left(\tfrac{1}{\delta}\right).$

Here the inequality $(*)$ holds with probability $1-\delta$ following Lemma 40. We will bound the summand with the bias estimation error $\text{error}(c_{k},s,s^{\prime})$ that spawns the inner regret estimation $B_{0}(t_{k})=\sum_{\ell=1}^{k-1}\sum_{t=t_{\ell}}^{t_{\ell+1}-1}(\mathfrak{g}_% {\ell}-R_{t})$ . This inner estimation is linked to $B(T):=\sum_{k,t}(\mathfrak{g}_{k}-R_{t})$ the overall optimistic regret by:

	$\displaystyle B_{0}(t_{k})$	$\displaystyle\leq\sum\nolimits_{\ell=1}^{K(T)}\sum\nolimits_{t=t_{\ell}}^{t_{% \ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)-\sum\nolimits_{\ell=k}^{K(T)}% \sum\nolimits_{t=t_{\ell}}^{t_{\ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)$
		$\displaystyle\overset{()}{\leq}\sum\nolimits_{\ell=1}^{K(T)}\sum\nolimits_{t=% t_{\ell}}^{t_{\ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)-\sum\nolimits_{% \ell=k}^{K(T)}\sum\nolimits_{t=t_{\ell}}^{t_{\ell+1}-1}\left(g^{}-R_{t}\right)$
		$\displaystyle\leq\sum\nolimits_{\ell=1}^{K(T)}\sum\nolimits_{t=t_{\ell}}^{t_{% \ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)-\sum\nolimits_{\ell=k}^{K(T)}% \sum\nolimits_{t=t_{k}}^{T-1}\left(\Delta^{}(X_{t})+\left(p(X_{t})-e_{S_{t}}% \right)h^{}+r(X_{t})-R_{t}\right)$
		$\displaystyle\leq\sum\nolimits_{\ell=1}^{K(T)}\sum\nolimits_{t=t_{\ell}}^{t_{% \ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)+{\mathrm{sp}\left(h^{}\right)}-% \sum\nolimits_{\ell=k}^{K(T)}\sum\nolimits_{t=t_{k}}^{T-1}\left(\left(p(X_{t})% -e_{S_{t+1}}\right)h^{}+r(X_{t})-R_{t}\right)$
		$\displaystyle\overset{(\dagger)}{\leq}\sum\nolimits_{\ell=1}^{K(T)}\sum% \nolimits_{t=t_{\ell}}^{t_{\ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)+{% \mathrm{sp}\left(h^{}\right)}+(1+{\mathrm{sp}\left(h^{}\right)})\sqrt{\tfrac% {1}{2}T\log\left(\tfrac{1}{\delta}\right)}$
		$\displaystyle=:B(T)+{\mathrm{sp}\left(h^{}\right)}+(1+{\mathrm{sp}\left(h^{}% \right)})\sqrt{\tfrac{1}{2}T\log\left(\tfrac{1}{\delta}\right)}.$

In the above, $(*)$ holds with probability $1-4\delta$ uniformly on $k$ following Lemma 13 and $(\dagger)$ holds, also uniformly on $k$ , with probability $1-\delta$ by applying Azuma-Hoeffding’s inequality (Lemma 32). Continuing, still on the event specified by Lemma 13, we have with probability $1-6\delta$ :

	$\displaystyle\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{V}(p(X_{t}),\mathfrak{h% }_{k}-h^{*})$	$\displaystyle\leq 3\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\frac{3c_{0}+(1+c_{0})% \sqrt{8t_{k}\log\left(\tfrac{2}{\delta}\right)}+2B_{0}(t_{k})}{N_{t_{k}}(S_{t+% 1}\leftrightarrow S_{t})}+16c_{0}^{2}\log\left(\tfrac{1}{\delta}\right)$
		$\displaystyle\leq 3\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\frac{4c_{0}+(1+c_{0})% \sqrt{32T\log\left(\tfrac{2}{\delta}\right)}+2B(T)}{N_{t_{k}}(S_{t},A_{t},S_{t% +1})}+16c_{0}^{2}\log\left(\tfrac{1}{\delta}\right)$
	(DT)	$\displaystyle\leq 12c_{0}^{2}S^{2}A+3\left(4c_{0}+(1+c_{0})\sqrt{32T\log\left(% \tfrac{2}{\delta}\right)}+2B(T)\right)S^{2}A\log(T)$
		$\displaystyle\phantom{\leq{}}+16c_{0}^{2}\log\left(\tfrac{1}{\delta}\right).$

(STEP 2) For $\mathrm{A}_{2}$ , by Freedman’s inequality invoked in the form of Lemma 33 again, we have with probability $1-\delta$ ,

	$\displaystyle\mathrm{A}_{2}$	$\displaystyle\leq\sqrt{2\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{V}(p_{k}(S_{% t}),h^{*})\log\left(\tfrac{T}{\delta}\right)}+8c_{0}\log\left(\tfrac{T}{\delta% }\right)$
		$\displaystyle\leq\sqrt{2\sum_{t=0}^{T-1}\mathbf{V}(p(X_{t}),h^{*})\log\left(% \tfrac{T}{\delta}\right)}+8c_{0}\log\left(\tfrac{T}{\delta}\right).$

We recognize the sum of variance $\sum_{t=0}^{T-1}\mathbf{V}(p(X_{t}),h^{*})$ that we leave as is.

(STEP 3) As a result, with probability $1-7\delta$ , we have:

$\displaystyle\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}(p_{k}(S_{t})-e_{S_{t}})% \mathfrak{h}_{k}\leq\sqrt{2\sum_{t=0}^{T-1}\mathbf{V}(p(X_{t}),h^{*})\log\left% (\tfrac{T}{\delta}\right)}+2SA^{\frac{1}{2}}\sqrt{3B(T)}\log\left(\tfrac{T}{% \delta}\right)+\operatorname*{{\rm O}}\left(SA^{\frac{1}{2}}T^{\frac{7}{20}}% \log^{\frac{3}{4}}\left(\tfrac{T}{\delta}\right)\right)$

when $c_{0}=T^{\frac{1}{5}}$ . ∎

C.6 Proof of Lemma 8, empirical bias error

Because $h^{*}$ is a fixed vector, Bennett’s inequality (see Lemma 39) guarantees that $(\hat{p}_{k}(S_{t})-p_{k}(S_{t})h^{*}$ is small as follows. By doing a union bound over Lemma 39 with confidence $\frac{\delta}{SAT}$ over all pairs $(s,a)$ and visits counts $N(s,a)\leq T$ , we see that with probability $1-\delta$ , for all $k$ , we have:

	$\displaystyle\sum_{t=t_{k}}^{t_{k+1}-1}\left(\hat{p}_{k}(S_{t})-p_{k}(S_{t})% \right)h^{*}$	$\displaystyle\leq{\mathrm{sp}\left(h^{}\right)}SA+\sum_{t=t_{k}}^{t_{k+1}-1}% \mathbf{1}\left(N_{t_{k}}(X_{t})\geq 1\right)\left(\sqrt{\tfrac{2\mathbf{V}(p(% X_{t}),h^{})\log\left(\frac{SAT}{\delta}\right)}{N_{t_{k}}(X_{t})}}+\tfrac{{% \mathrm{sp}\left(h^{*}\right)}\log\left(\tfrac{SAT}{\delta}\right)}{3N_{t_{k}}% (X_{t})}\right)$
	(by doubling trick)	$\displaystyle\leq{\mathrm{sp}\left(h^{}\right)}SA+2\sum_{t=t_{k}}^{t_{k+1}-1}% \mathbf{1}\left(N_{t}(X_{t})\geq 1\right)\left(\sqrt{\tfrac{2\mathbf{V}(p(X_{t% }),h^{})\log\left(\frac{SAT}{\delta}\right)}{N_{t}(X_{t})}}+\tfrac{{\mathrm{% sp}\left(h^{*}\right)}\log\left(\tfrac{SAT}{\delta}\right)}{3N_{t}(X_{t})}% \right).$

Summing this over $k$ and factorizing over state-action pairs, we get that with probability $1-\delta$ ,

	$\displaystyle\sum_{k}(2k)$	$\displaystyle\leq{\mathrm{sp}\left(h^{}\right)}SA+2\sum_{s,a}\left(\sum_{n=1}% ^{N_{T}(s,a)}\sqrt{\tfrac{2\mathbf{V}(p(s,a),h^{})\log\left(\frac{SAT}{\delta% }\right)}{n}}+\sum_{n=1}^{N_{T}(s,a)}\tfrac{{\mathrm{sp}\left(h^{*}\right)}% \log\left(\tfrac{SAT}{\delta}\right)}{n}\right)$
		$\displaystyle\leq{\mathrm{sp}\left(h^{}\right)}SA+4\sum_{s,a}\sqrt{N_{T}(s,a)% \mathbf{V}(p(s,a),h^{})\log\left(\tfrac{SAT}{\delta}\right)}+2{\mathrm{sp}% \left(h^{*}\right)}SA\log\left(\tfrac{SAT}{\delta}\right)\log(T)$
	(Jensen)	$\displaystyle\leq{\mathrm{sp}\left(h^{}\right)}SA+4\sqrt{SA\sum\nolimits_{s,a% }\mathbf{V}(p(s,a),h^{})\log\left(\tfrac{SAT}{\delta}\right)}+2{\mathrm{sp}% \left(h^{*}\right)}SA\log\left(\tfrac{SAT}{\delta}\right)\log(T)$
		$\displaystyle={\mathrm{sp}\left(h^{}\right)}SA+4\sqrt{\sum\nolimits_{t=0}^{T-% 1}\mathbf{V}(p(X_{t}),h^{})\log\left(\tfrac{SAT}{\delta}\right)}+2{\mathrm{sp% }\left(h^{*}\right)}SA\log\left(\tfrac{SAT}{\delta}\right)\log(T)$

We recognize the sum of variances $\sum_{t=0}^{T-1}\mathbf{V}(p(X_{t}),h^{*})$ , that is left to be upper-bounded later on. ∎

C.7 Proof of Lemma 9, optimism overshoot

Because of the $\beta$ -mitigation generated by Algorithm 5, the quantity $(\tilde{p}_{k}(S_{t})-\hat{p}_{k}(S_{t}))\mathfrak{h}_{k}$ is shown to be directly related to $\mathbf{V}(p(X_{t}),h^{*})$ up to a provably negligible error. Denote $h^{\prime}_{k}$ the reference point BiasProjection $(\mathcal{H}_{t_{k}},c_{t_{k}}(-,s_{0}))$ used in Algorithm 5 (denoted $h_{0}$ in the algorithm). By Lemma 13, with probability $1-4\delta$ , we have $h^{*}\in\mathcal{H}_{t_{k}}$ for all $k$ . To lighten up notations, we write $d_{t_{k}}(s^{\prime},s)$ instead of $\text{error}(c_{t_{k}},s^{\prime},s)$ .

(STEP 1) Denote $\text{A}:=(\tilde{p}_{k}(S_{t})-\hat{p}_{k}(S_{t}))\mathfrak{h}_{k}$ . By construction of $\tilde{p}_{k}$ , we have $\text{A}\leq\beta_{t_{k}}(X_{t})$ , so:

	A	$\displaystyle\leq\beta_{t_{k}}(X_{t})$
		$\displaystyle=:\sqrt{\frac{2\left(\mathbf{V}(\hat{p}_{k}(S_{t}),h^{\prime}_{k}% )+8c_{0}\sum_{s^{\prime}\in\mathcal{S}}\hat{p}_{k}(s^{\prime}\|S_{t})d_{t_{k}}(% s^{\prime},S_{t})\log\left(\tfrac{SAT}{\delta}\right)\right)}{N_{t_{k}}(X_{t})% }}+\frac{3c_{0}\log\left(\tfrac{SAT}{\delta}\right)}{N_{t_{k}}(X_{t})}$
		$\displaystyle\leq\underbrace{\sqrt{\frac{2\mathbf{V}(\hat{p}_{k}(S_{t}),h^{% \prime}_{k})}{N_{t_{k}}(X_{t})}}}_{\text{A}_{1}}+\underbrace{\sqrt{\frac{16c_{% 0}\sum_{s^{\prime}\in\mathcal{S}}\hat{p}_{k}(s^{\prime}\|S_{t})d_{t_{k}}(s^{% \prime},S_{t})\log\left(\tfrac{SAT}{\delta}\right)}{N_{t_{k}}(X_{t})}}}_{\text% {A}_{2}}+\frac{3c_{0}\log\left(\tfrac{SAT}{\delta}\right)}{N_{t_{k}}(X_{t})}.$

The rightmost term of A is of order $\operatorname*{{\rm O}}(\log^{2}(T))$ hence is negligible. We focus on the other two. The analysis of $\text{A}_{1}$ will spawn a term similar to $\text{A}_{2}$ , hence we start by the second. Recall that $d_{t_{k}}$ is the bias error provided by Algorithm 3 and that the inner regret estimation is $B_{0}(t_{k})=\sum_{\ell=1}^{k-1}\sum_{t=t_{\ell}}^{t_{\ell+1}-1}(\mathfrak{g}_% {\ell}-R_{t})$ . Now, remark that:

	$\displaystyle B_{0}(t_{k})$	$\displaystyle\leq\sum\nolimits_{\ell=1}^{K(T)}\sum\nolimits_{t=t_{\ell}}^{t_{% \ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)-\sum\nolimits_{\ell=k}^{K(T)}% \sum\nolimits_{t=t_{\ell}}^{t_{\ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)$
		$\displaystyle\overset{()}{\leq}\sum\nolimits_{\ell=1}^{K(T)}\sum\nolimits_{t=% t_{\ell}}^{t_{\ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)-\sum\nolimits_{% \ell=k}^{K(T)}\sum\nolimits_{t=t_{\ell}}^{t_{\ell+1}-1}\left(g^{}-R_{t}\right)$
		$\displaystyle\leq\sum\nolimits_{\ell=1}^{K(T)}\sum\nolimits_{t=t_{\ell}}^{t_{% \ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)-\sum\nolimits_{\ell=k}^{K(T)}% \sum\nolimits_{t=t_{k}}^{T-1}\left(\Delta^{}(X_{t})+\left(p(X_{t})-e_{S_{t}}% \right)h^{}+r(X_{t})-R_{t}\right)$
		$\displaystyle\leq\sum\nolimits_{\ell=1}^{K(T)}\sum\nolimits_{t=t_{\ell}}^{t_{% \ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)+{\mathrm{sp}\left(h^{}\right)}-% \sum\nolimits_{\ell=k}^{K(T)}\sum\nolimits_{t=t_{k}}^{T-1}\left(\left(p(X_{t})% -e_{S_{t+1}}\right)h^{}+r(X_{t})-R_{t}\right)$
		$\displaystyle\overset{(\dagger)}{\leq}\sum\nolimits_{\ell=1}^{K(T)}\sum% \nolimits_{t=t_{\ell}}^{t_{\ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)+{% \mathrm{sp}\left(h^{}\right)}+(1+{\mathrm{sp}\left(h^{}\right)})\sqrt{\tfrac% {1}{2}T\log\left(\tfrac{1}{\delta}\right)}$
		$\displaystyle=:B(T)+{\mathrm{sp}\left(h^{}\right)}+(1+{\mathrm{sp}\left(h^{}% \right)})\sqrt{\tfrac{1}{2}T\log\left(\tfrac{1}{\delta}\right)}.$

In the above, $(*)$ holds with probability $1-4\delta$ uniformly on $k$ following Lemma 13 and $(\dagger)$ holds, also uniformly on $k$ , with probability $1-\delta$ by applying Azuma-Hoeffding’s inequality (Lemma 32). Therefore, with probability $1-5\delta$ , for all $k$ and $t\in\left\{t_{k},\ldots,t_{k+1}-1\right\}$ , we have:

	$\displaystyle\sqrt{\frac{16c_{0}\sum_{s^{\prime}\in\mathcal{S}}\hat{p}_{k}(s^{% \prime}\|S_{t})d_{t_{k}}(s^{\prime},S_{t})\log\left(\tfrac{SAT}{\delta}\right)}% {N_{t_{k}}(X_{t})}}$	$\displaystyle\leq\frac{\sqrt{16c_{0}\log\left(\tfrac{SAT}{\delta}\right)\sum_{% s^{\prime}\in\mathcal{S}}N_{t_{k}}(S_{t},A_{t},s^{\prime})d_{t_{k}}(s^{\prime}% ,S_{t})}}{N_{t_{k}}(X_{t})}$
		$\displaystyle\leq\frac{\sqrt{16c_{0}\log\left(\tfrac{SAT}{\delta}\right)\sum_{% s^{\prime}\in\mathcal{S}}N_{t_{k}}(S_{t}\leftrightarrow s^{\prime})d_{t_{k}}(s% ^{\prime},S_{t})}}{N_{t_{k}}(X_{t})}$
		$\displaystyle\leq\tfrac{\sqrt{16c_{0}\log\left(\tfrac{SAT}{\delta}\right)S% \left(3c_{0}+(1+c_{0})\left(1+\sqrt{8T\log\left(\tfrac{2}{\delta}\right)}% \right)+2B_{0}(t_{k})\right)}}{N_{t_{k}}(X_{t})}$
		$\displaystyle\leq\tfrac{\sqrt{16c_{0}\log\left(\tfrac{SAT}{\delta}\right)S% \left((1+c_{0})\left(3+2\sqrt{8T\log\left(\tfrac{2}{\delta}\right)}\right)+2B(% T)\right)}}{N_{t_{k}}(X_{t})}$
		$\displaystyle\leq\tfrac{\sqrt{16c_{0}\log\left(\tfrac{SAT}{\delta}\right)S% \left((1+c_{0})\left(3+2\sqrt{8T\log\left(\tfrac{2}{\delta}\right)}+2B(T)% \right)\right)}}{N_{t_{k}}(X_{t})}.$

This bound will be enough. We move on to $\text{A}_{1}$ . We have:

	$\displaystyle\sqrt{\mathbf{V}(\hat{p}_{k}(S_{t}),h^{\prime}_{k})}$	$\displaystyle\leq\sqrt{\left\|\mathbf{V}(\hat{p}_{k}(S_{t}),h^{\prime}_{k})-% \mathbf{V}(p(X_{t}),h^{})\right\|}+\sqrt{\mathbf{V}(p(X_{t}),h^{})}$
		$\displaystyle\leq\sqrt{\left\|\mathbf{V}(\hat{p}_{k}(S_{t}),h^{\prime}_{k})-% \mathbf{V}(\hat{p}_{k}(X_{t}),h^{})\right\|}\sqrt{\left\|\mathbf{V}(\hat{p}_{k}% (S_{t}),h^{})-\mathbf{V}(p(X_{t}),h^{})\right\|}+\sqrt{\mathbf{V}(p(X_{t}),h^% {})}$
		$\displaystyle\overset{()}{\leq}\sqrt{8c_{0}\sum\nolimits_{s^{\prime}\in% \mathcal{S}}\hat{p}_{k}(s^{\prime}\|S_{t})d_{k}(s^{\prime},S_{t})}+{\mathrm{sp}% \left(h^{}\right)}\sqrt{\left\\|\hat{p_{k}}(S_{t})-p_{k}(S_{t})\right\\|_{1}}+% \sqrt{\mathbf{V}(p(X_{t}),h^{*})}$
		$\displaystyle\overset{(\dagger)}{\leq}\sqrt{8c_{0}\sum\nolimits_{s^{\prime}\in% \mathcal{S}}\hat{p}_{k}(s^{\prime}\|S_{t})d_{k}(s^{\prime},S_{t})}+{\mathrm{sp}% \left(h^{}\right)}\left(\frac{S\log\left(\tfrac{SAT}{\delta}\right)}{N_{t_{k}% }(X_{t})}\right)^{\frac{1}{4}}+\sqrt{\mathbf{V}(p(X_{t}),h^{})}$
		$\displaystyle\leq\frac{\mathrm{A}_{2}}{\sqrt{2N_{t_{k}}(X_{t})}}+{\mathrm{sp}% \left(h^{}\right)}\left(\frac{S\log\left(\tfrac{SAT}{\delta}\right)}{N_{t_{k}% }(X_{t})}\right)^{\frac{1}{4}}+\sqrt{\mathbf{V}(p(X_{t}),h^{})}$

where $(*)$ is obtained by applying Lemma 12 and $(\dagger)$ holds with probability $1-\delta$ by applying Weissman’s inequality, see Lemma 35. All together, with probability $1-6\delta$ , A is upper-bounded by:

\text{A}\leq\sqrt{\frac{2\mathbf{V}(p(X_{t}),h^{*})\log\left(\tfrac{SAT}{% \delta}\right)}{N_{t_{k}}(X_{t})}}+2\text{A}_{2}+\underbrace{{\mathrm{sp}\left% (h^{*}\right)}\sqrt{\frac{2\log\left(\tfrac{SAT}{\delta}\right)\sqrt{S\log% \tfrac{SAT}{\delta}}}{N_{t_{k}}(X_{t})\sqrt{N_{t_{k}}(X_{t})}}}+\frac{3c_{0}% \log\left(\tfrac{SAT}{\delta}\right)}{N_{t_{k}}(X_{t})}}_{\text{A}_{3}(k,t)}.

(STEP 2) The number of visits $N_{k}(X_{t})$ is lower-bounded by $\frac{1}{2}N_{t}(X_{t})$ when $N_{k}(X_{t})\geq 1$ by doubling trick (DT). By summing over $t$ and $k$ , we find that with probability $1-6\delta$ ,

	$\displaystyle\sum_{k}(3k)$	$\displaystyle\leq SAc_{0}+\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{1}_{N_{t_{% k}}(X_{t})\geq 1}\sqrt{\frac{2\mathbf{V}(p(X_{t}),h^{*})\log\left(\tfrac{SAT}{% \delta}\right)}{N_{t_{k}}(X_{t})}}+\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{1% }_{N_{t_{k}}(X_{t})\geq 1}(2\text{A}_{2}(k,t)+\text{A}_{3}(k,t))$
	(DT)	$\displaystyle\leq SAc_{0}+2\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{1}_{N_{t_% {k}}(X_{t})\geq 1}\sqrt{\frac{2\mathbf{V}(p(X_{t}),h^{*})\log\left(\tfrac{SAT}% {\delta}\right)}{N_{t}(X_{t})}}+\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{1}_{% N_{t_{k}}(X_{t})\geq 1}(2\text{A}_{2}(k,t)+\text{A}_{3}(k,t))$
		$\displaystyle\leq SAc_{0}+4\sqrt{2SA\sum\nolimits_{t=0}^{T-1}\mathbf{V}(p(X_{t% }),h^{*})\log\left(\tfrac{SAT}{\delta}\right)}+\sum_{k}\sum_{t=t_{k}}^{t_{k+1}% -1}\mathbf{1}_{N_{t_{k}}(X_{t})\geq 1}(2\text{A}_{2}(k,t)+\text{A}_{3}(k,t))$

where the last inequality is obtained with computations that are similar to those detailed in the proof of Lemma 8. We recognize the variance that we will leave as is. We finish the proof by bounding the lower order terms $\text{A}_{2}$ and $\text{A}_{3}$ .

(STEP 3) We start with $\text{A}_{2}$ . We have:

	$\displaystyle\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{1}_{N_{t_{k}}(X_{t})% \geq 1}\text{A}_{2}(k,t)$	$\displaystyle:=\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{1}_{N_{t_{k}}(X_{t})% \geq 1}\tfrac{\sqrt{16c_{0}\log\left(\tfrac{SAT}{\delta}\right)S\left((1+c_{0}% )\left(3+2\sqrt{8T\log\left(\tfrac{2}{\delta}\right)}+2B(T)\right)\right)}}{N_% {t_{k}}(X_{t})}$
	(DT)	$\displaystyle\leq 2\sqrt{16c_{0}S\log\left(\tfrac{SAT}{\delta}\right)\left((1+% c_{0})\left(3+2\sqrt{8T\log\left(\tfrac{2}{\delta}\right)}+2B(T)\right)\right)% }~{}SA\log(T)$
		$\displaystyle\leq 8(1+c_{0})S^{\frac{3}{2}}A\log^{\frac{3}{2}}\left(\tfrac{SAT% }{\delta}\right)\left(2+4T^{\frac{1}{4}}\log^{\frac{1}{4}}\left(\tfrac{SAT}{% \delta}\right)+\sqrt{2B(T)}\right).$

(STEP 4) We are left with $\text{A}_{3}$ . We have:

	$\displaystyle\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{1}_{N_{t_{k}}(X_{t})% \geq 1}\text{A}_{3}(k,t)$	$\displaystyle:=\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{1}_{N_{t_{k}}(X_{t})% \geq 1}\left({\mathrm{sp}\left(h^{*}\right)}\sqrt{\frac{2\log\left(\tfrac{SAT}% {\delta}\right)\sqrt{S\log\tfrac{SAT}{\delta}}}{N_{t_{k}}(X_{t})\sqrt{N_{t_{k}% }(X_{t})}}}+\frac{3c_{0}\log\left(\tfrac{SAT}{\delta}\right)}{N_{t_{k}}(X_{t})% }\right)$
	(DT)	$\displaystyle\leq\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{1}_{N_{t_{k}}(X_{t}% )\geq 1}\left({\mathrm{sp}\left(h^{*}\right)}\sqrt{\frac{2\log\left(\tfrac{SAT% }{\delta}\right)\sqrt{S\log\tfrac{SAT}{\delta}}}{N_{t_{k}}(X_{t})\sqrt{N_{t_{k% }}(X_{t})}}}+\frac{3c_{0}\log\left(\tfrac{SAT}{\delta}\right)}{N_{t_{k}}(X_{t}% )}\right)$
		$\displaystyle\leq C{\mathrm{sp}\left(h^{*}\right)}S^{\frac{5}{4}}AT^{\frac{1}{% 4}}\log^{\frac{3}{4}}\left(\tfrac{SAT}{\delta}\right)+6c_{0}SA\log\left(\tfrac% {SAT}{\delta}\right)$
		$\displaystyle=\operatorname{{\rm O}}\left({\mathrm{sp}\left(h^{}\right)}S^{% \frac{5}{4}}AT^{\frac{1}{4}}\log\left(\tfrac{SAT}{\delta}\right)\right).$

This concludes the proof. ∎

C.8 Proof of Lemma 10, second order error

Recall that by Lemma 13, with probability $1-4\delta$ , $h^{*}\in\mathcal{H}_{t_{k}}$ for all $k$ , hence ${\mathrm{sp}\left(\mathfrak{h}_{k}-h^{*}\right)}\leq 2c_{0}$ for all $k$ on the same event. Therefore, with probability $1-4\delta$ ,

	$\displaystyle\sum_{k}(4k)$	$\displaystyle:=2c_{0}SA+\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{1}_{N_{t_{k}% }(X_{t})\geq 1}\left(\hat{p}_{k}(S_{t})-p_{k}(S_{t})\right)\left(\mathfrak{h}_% {k}-h^{*}\right)$
		$\displaystyle=2c_{0}SA+\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\sum_{s^{\prime}\in% \mathcal{S}}\mathbf{1}_{N_{t_{k}}(X_{t})\geq 1}(\hat{p}_{k}(s^{\prime}\|S_{t})-% p_{k}(s^{\prime}\|S_{t}))(\mathfrak{h}_{k}-h^{*}(s^{\prime}))$
		$\displaystyle\overset{(*)}{\leq}2c_{0}SA+2\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}% \sum_{s^{\prime}\in\mathcal{S}}\mathbf{1}_{N_{t_{k}}(X_{t})\geq 1}(\hat{p}_{k}% (s^{\prime}\|S_{t})-p_{k}(s^{\prime}\|S_{t}))d_{t_{k}}(s^{\prime},S_{t})$
		$\displaystyle\overset{(\dagger)}{\leq}2c_{0}SA+2\sum_{k}\sum_{t=t_{k}}^{t_{k+1% }-1}\sum_{s^{\prime}\in\mathcal{S}}\mathbf{1}_{N_{t_{k}}(X_{t})\geq 1}\left(d_% {k}(s^{\prime},S_{t})\sqrt{\frac{2\hat{p}_{k}(s^{\prime}\|S_{t})\log\left(% \tfrac{S^{2}AT}{\delta}\right)}{N_{t_{k}}(X_{t})}}+3d_{k}(s^{\prime}\|S_{t})% \frac{\log\left(\tfrac{S^{2}AT}{\delta}\right)}{N_{t_{k}}(X_{t})}\right)$
		$\displaystyle\leq 2c_{0}SA+2\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\sum_{s^{\prime}% \in\mathcal{S}}\mathbf{1}_{N_{t_{k}}(X_{t})\geq 1}\left(\sqrt{c_{0}}\sqrt{% \frac{2\hat{p}_{k}(s^{\prime}\|S_{t})d_{k}(s^{\prime},S_{t})\log\left(\tfrac{S^% {2}AT}{\delta}\right)}{N_{t_{k}}(X_{t})}}+\frac{3c_{0}\log\left(\tfrac{S^{2}AT% }{\delta}\right)}{N_{t_{k}}(X_{t})}\right)$
		$\displaystyle\leq 2c_{0}SA+4\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\sum_{s^{\prime}% \in\mathcal{S}}\mathbf{1}_{N_{t_{k}}(X_{t})\geq 1}\left(\sqrt{c_{0}}\sqrt{% \frac{2\hat{p}_{k}(s^{\prime}\|S_{t})d_{k}(s^{\prime},S_{t})\log\left(\tfrac{S^% {2}AT}{\delta}\right)}{N_{t}(X_{t})}}+\frac{3c_{0}\log\left(\tfrac{S^{2}AT}{% \delta}\right)}{N_{t}(X_{t})}\right)$

where $(*)$ uses that $h^{*}\in\mathcal{H}_{t_{k}}$ , and $(\dagger)$ is obtained by applying the empirical Bernstein’s inequality, see Lemma 36, to $\hat{p}_{k}(s^{\prime}|S_{t})-p_{k}(s^{\prime}|S_{t})$ , and holds with probability $1-\delta$ . The rightmost term’s sum is upper-bounded by:

4\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\sum_{s^{\prime}\in\mathcal{S}}\frac{3c_{0}% \log\left(\tfrac{S^{2}AT}{\delta}\right)}{N_{t}(X_{t})}\leq 12S^{2}A\log(T)% \log\left(\tfrac{S^{2}AT}{\delta}\right).

For the other term, follow the line of the proof of Lemma 9 (term $\text{A}_{2}$ ). We have with probability $1-5\delta$ ( $4\delta$ of which is by invoking Lemma 13):

	$\displaystyle\hat{p}_{k}(s^{\prime}\|S_{t})d_{k}(s^{\prime},S_{t})$	$\displaystyle=\frac{N_{t_{k}}(S_{t},A_{t},s^{\prime})\left((1+c_{0})\left(1+% \sqrt{8t_{k}\log\left(\tfrac{2}{\delta}\right)}\right)+2B_{0}(t_{k})\right)}{N% _{t_{k}}(S_{t}\leftrightarrow s^{\prime})N_{t_{k}}(X_{t})}$
		$\displaystyle\leq\frac{\left((1+c_{0})\left(3+2\sqrt{8T\log\left(\tfrac{2}{% \delta}\right)}+2B(T)\right)\right)}{N_{t_{k}}(X_{t})}.$

Therefore,

\sqrt{c_{0}}\sqrt{\frac{2\hat{p_{k}}(s^{\prime}|S_{t})d_{t_{k}}(s^{\prime},S_{% t})\log\left(\tfrac{S^{2}AT}{\delta}\right)}{N_{t}(X_{t})}}\leq\frac{4(1+c_{0}% )\sqrt{\left(3+2\sqrt{8T\log\left(\tfrac{2}{\delta}\right)}+2B(T)\right)\log% \left(\tfrac{S^{2}AT}{\delta}\right)}}{N_{t}(X_{t})}.

Summing over $k$ , $t$ , $s^{\prime}$ , with probability $1-6\delta$ , we have:

\sum_{k}(4k)\leq\begin{Bmatrix}16S^{2}A(1+c_{0})\log^{\frac{1}{2}}\left(\tfrac% {S^{2}AT}{\delta}\right)\left(\sqrt{2B(T)}+2\left(8T\log\left(\tfrac{2}{\delta% }\right)\right)^{\frac{1}{4}}\right)\\ +32S^{2}A\left(\log(T)\log\left(\tfrac{S^{2}AT}{\delta}\right)+(1+c_{0})\log^{% \frac{1}{2}}\left(\tfrac{S^{2}AT}{\delta}\right)\right)\end{Bmatrix}

This concludes the proof. ∎

Appendix D Details on experiments

D.1 River swim

Experiments are run on $n$ -states river-swim. Such MDPs are, despite their size, known to be hard to learn. They consists in $n$ states aligned in a straight line with two playable actions right and left whose dynamics are given in the figure below. Rewards are Bernoulli and null everywhere excepted for $r(s_{n},\textsc{right})=0.95$ and $r(s_{0},\text{\sc\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{% 1,0,0}left})=0.05$ .

Figure 3: The kernel of a

n

-state river-swim.

$3$ -state river-swim.

The gain is $g^{*}\approx 0.82$ and $h^{*}\approx(-4.28,-2.24,0.4)$ .

$5$ -state river-swim.

The gain is $g^{*}\approx 0.82$ and $h^{*}\approx(-9.62,-7.58,-4.96,-2.27,0.45)$ .

Appendix E Standard concentration inequalities

Lemma 32 (Azuma’s inequality, Azuma [1967]).

Let $(U_{t})_{t\geq 0}$ a martingale difference sequence such that ${\mathrm{sp}\left(U_{t}\right)}\leq c$ a.s., i.e., there exists $a_{t}\in\mathbf{R}$ such that $a_{t}\leq U_{t}\leq a_{t}+c$ a.s. Then, for all $\delta>0$ ,

\mathbf{P}\left(\sum\nolimits_{t=0}^{T-1}U_{t}\geq c\sqrt{\tfrac{1}{2}T\log% \left(\tfrac{1}{\delta}\right)}\right)\leq\delta.

Lemma 33 (Freedman’s inequality, Zhang et al. [2020]).

Let $(U_{t})_{t\geq 0}$ a martingale difference sequence such that $\left|U_{t}\right|\leq c$ a.s., and denote its conditional variance $V_{t}:=\mathbf{E}[U_{t}^{2}|\mathcal{F}_{t-1}]$ . Then, for all $\delta>0$ ,

\mathbf{P}\left(\exists T^{\prime}\leq T:\sum\nolimits_{t=0}^{T^{\prime}-1}U_{% t}\geq\sqrt{2\sum\nolimits_{t=0}^{T^{\prime}-1}V_{t}\log\left(\tfrac{T}{\delta% }\right)}+4c\log\left(\tfrac{T}{\delta}\right)\right)\leq\delta.

Lemma 34 (Time-uniform Azuma, Bourel et al. [2020]).

Let $(U_{t})$ a martingale difference sequence such that, for all $\lambda\in\mathbf{R}$ , $\mathbf{E}[\exp(\lambda U_{t})|U_{1},\ldots,U_{t-1}]\leq\exp(\frac{\lambda^{2}% \sigma^{2}}{2})$ . Then:

\forall\delta>0,\quad\mathbf{P}\left(\exists n\geq 1,\quad\left(\sum\nolimits_% {k=1}^{n}U_{k}\right)^{2}\geq n\sigma^{2}\left(1+\tfrac{1}{n}\right)\log\left(% \tfrac{\!\!\sqrt{1+n}}{\delta}\right)\right)\leq\delta.

Lemma 35 (Time-uniform Weissman).

Let $q$ a distribution over $\left\{1,\ldots,d\right\}$ . Let $(U_{t})$ a sequence of i.i.d. random variables of distribution $q$ . Then:

\forall\delta>0,\quad\mathbf{P}\left(\exists n\geq 1,\left\|\sum\nolimits_{i=1% }^{n}\left(e_{U_{i}}-q\right)\right\|_{1}^{2}\geq nd\log\left(\tfrac{2\sqrt{1+% n}}{\delta}\right)\right)\leq\delta.

Proof.

Remark that $\left\|\sum_{k=1}^{n}(e_{U_{k}}-q)\right\|_{1}=\max_{v\in\left\{-1,1\right\}^{% d}}\sum_{k=1}^{n}\left\langle e_{U_{k}}-q,v\right\rangle$ . Let $W_{k}^{v}:=\left\langle e_{U_{k}}-q,v\right\rangle$ . Remark that for each $v\in\left\{-1,1\right\}^{d}$ , $(W_{k}^{v})$ is a family of i.i.d. random variables with $-\left\langle q,v\right\rangle\leq W_{k}^{v}\leq 1-\left\langle q,v\right\rangle$ , so $\mathbf{E}[\exp(\lambda W_{k}^{v})]\leq\exp(\tfrac{\lambda^{2}}{8})$ by Hoeffding’s Lemma. By Lemma 34, we have:

	$\displaystyle\mathbf{P}\left(\exists n\geq 1,\left\\|\sum_{k=1}^{n}(e_{U_{k}}-q% )\right\\|_{1}\geq\!\!\sqrt{nd\log\left(\tfrac{2\!\!\sqrt{1+n}}{\delta}\right)}\right)$	$\displaystyle=\mathbf{P}\left(\exists v\in\left\{-1,1\right\}^{d},\exists n,% \sum_{k=1}^{n}W_{k}^{v}\geq\!\!\sqrt{nd\log\left(\tfrac{2\!\!\sqrt{1+n}}{% \delta}\right)}\right)$
		$\displaystyle\leq\sum_{v\in\left\{-1,1\right\}^{d}}\mathbf{P}\left(\exists n,% \sum_{k=1}^{n}W_{k}^{v}\geq\!\!\sqrt{nd\log\left(\tfrac{2\!\!\sqrt{1+n}}{% \delta}\right)}\right)$
		$\displaystyle\leq\sum_{v\in\left\{-1,1\right\}^{d}}\mathbf{P}\left(\exists n,% \sum_{k=1}^{n}W_{k}^{v}\geq\!\!\sqrt{\tfrac{1}{2}n\left(1+\tfrac{1}{n}\right)% \log\left(\tfrac{\!\!\sqrt{1+n}}{2^{-d}\delta}\right)}\right)$
		$\displaystyle\leq 2^{d}\cdot 2^{d}\delta=\delta.$

This concludes the proof. ∎

Lemma 36 (Time-uniform Empirical Bernstein).

Let $(U_{k})_{k\geq 1}$ a martingale difference sequence such that ${\mathrm{sp}\left(U_{n}\right)}\leq c$ a.s., let $\hat{U}_{n}:=\frac{1}{n}\sum_{k=1}^{n}U_{k}$ the empirical mean and $\hat{V}_{n}:=\frac{1}{n}\sum_{k=1}^{n}(U_{k}-\hat{U}_{n})^{2}$ the population variance. Then,

\forall\delta>0,\forall T>0,\quad\mathbf{P}\left(\exists t\leq T,\sum\nolimits% _{i=1}^{t}U_{i}\geq\sqrt{2t\hat{V}_{t}\log\left(\tfrac{3T}{\delta}\right)}+3c% \log\left(\tfrac{3T}{\delta}\right)\right)\leq\delta.

Proof.

This is obtained with a union bound on the values of $n\leq T$ , then applying Lemma 38. ∎

Lemma 37 (Time-uniform Empirical Likelihoods, Jonsson et al. [2020]).

Let $q$ a distribution on $\left\{1,\ldots,d\right\}$ . Let $(U_{t})$ a sequence of i.i.d. random variables of distribution $q$ . Then:

\forall\delta>0,\quad\mathbf{P}\left(\exists n\geq 1,n\operatorname{{\rm KL}}(% \hat{q}_{n}||q)>\log\left(\tfrac{1}{\delta}\right)+(d-1)\log\left(e\left(1+% \tfrac{n}{d-1}\right)\right)\right)\leq\delta.

Lemma 38 (Empirical Bernstein inequality, Audibert et al. [2009]).

\forall\delta>0,\forall n\geq 1,\quad\mathbf{P}\left(\sum\nolimits_{k=1}^{n}U_% {k}\geq\sqrt{2n\hat{V}_{n}\log\left(\tfrac{3}{\delta}\right)}+3c\log\left(% \tfrac{3}{\delta}\right)\right)\leq\delta.

Lemma 39 (Bennett’s inequality, Audibert et al. [2009]).

Let $(U_{t})_{t\geq 0}$ a martingale difference sequence such that $\left|U_{t}\right|\leq c$ a.s., and denote its conditional variance $V_{t}:=\mathbf{E}[U_{t}^{2}|\mathcal{F}_{t-1}]$ . Then,

\forall\delta>0,\forall n\geq 1,\quad\mathbf{P}\left(\exists k\leq n,\sum% \nolimits_{i=1}^{k}U_{i}\geq\sqrt{2\sum\nolimits_{i=1}^{n}V_{i}\log\left(% \tfrac{1}{\delta}\right)}+\tfrac{1}{3}c\log\left(\tfrac{1}{\delta}\right)% \right)\leq\delta.

Lemma 40 (Lemma 3 of Zhang and Xie [2023]).

Let $(U_{t})$ be a sequence of random variables such that $0\leq U_{t}\leq c$ a.s., and let $\mathcal{F}_{t}:=\sigma(U_{0},U_{1},\ldots,U_{t-1})$ . Then:

	$\displaystyle\forall\delta>0,\quad\mathbf{P}\left(\exists T\geq 0,\sum% \nolimits_{t=0}^{T-1}U_{t}\geq 3\sum\nolimits_{t=0}^{T-1}\mathbf{E}[U_{t}\|% \mathcal{F}_{t-1}]+c\log\left(\tfrac{1}{\delta}\right)\right)\leq\delta;$
	$\displaystyle\forall\delta>0,\quad\mathbf{P}\left(\exists T\geq 0,\sum% \nolimits_{t=0}^{T-1}\mathbf{E}[U_{t}\|\mathcal{F}_{t-1}]\geq 3\sum\nolimits_{t% =0}^{T-1}U_{t}+c\log\left(\tfrac{1}{\delta}\right)\right)\leq\delta.$

	$\displaystyle\sum\nolimits_{t=0}^{T^{\prime}-1}\Delta^{*}(X_{t})+\left\|% \operatorname{{Reg}}(T^{\prime})\right\|$	$\displaystyle\leq 2\sum\nolimits_{t=0}^{T^{\prime}-1}\Delta^{}(X_{t})+(1+{% \mathrm{sp}\left(h^{}\right)})\sqrt{\tfrac{1}{2}T\log\left(\tfrac{2}{\delta}% \right)}+{\mathrm{sp}\left(h^{*}\right)}$
		$\displaystyle\leq 2\sum\nolimits_{t=0}^{T^{\prime}-1}\left(g^{}-R_{t}\right)+% 3(1+{\mathrm{sp}\left(h^{}\right)})\sqrt{\tfrac{1}{2}T\log\left(\tfrac{2}{% \delta}\right)}+3{\mathrm{sp}\left(h^{*}\right)}$
		$\displaystyle\leq 2\sum\nolimits_{t=0}^{T^{\prime}-1}\left(\tilde{g}-R_{t}% \right)+3(1+{\mathrm{sp}\left(h^{}\right)})\sqrt{\tfrac{1}{2}T\log\left(% \tfrac{2}{\delta}\right)}+3{\mathrm{sp}\left(h^{}\right)}$

	$\displaystyle\sum\nolimits_{t=0}^{T-1}\mathbf{V}(p(X_{t}),h^{*})$	$\displaystyle\leq\sum\nolimits_{t=0}^{T-1}\left(p(X_{t})-e_{S_{t}}\right)h^{2% }+2{\mathrm{sp}\left(h^{}\right)}\left({\mathrm{sp}\left(r\right)}T+\sum% \nolimits_{t=0}^{T-1}\Delta^{*}(X_{t})\right)$
		$\displaystyle=\sum\nolimits_{t=0}^{T-1}\left(p(X_{t})-e_{S_{t+1}}\right)h^{2}% +2{\mathrm{sp}\left(h^{}\right)}\left(\tfrac{1}{2}{\mathrm{sp}\left(h^{}% \right)}{\mathrm{sp}\left(r\right)}T+\sum\nolimits_{t=0}^{T-1}\Delta^{}(X_{t}% )\right)$
	(Lemma 32)	$\displaystyle\leq 2{\mathrm{sp}\left(h^{}\right)}{\mathrm{sp}\left(r\right)}T% +{\mathrm{sp}\left(h^{}\right)}^{2}\sqrt{\tfrac{1}{2}T\log\left(\tfrac{1}{% \delta}\right)}+2{\mathrm{sp}\left(h^{}\right)}\sum\nolimits_{t=0}^{T-1}% \Delta^{}(X_{t})+{\mathrm{sp}\left(h^{*}\right)}^{2}$

	$\displaystyle\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{V}(p(X_{t}),\mathfrak{h% }_{k}-h^{})=\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{V}\left(p(X_{t}),% \mathfrak{h}_{k}-h^{}-(\mathfrak{h}_{k}(S_{t})-h^{*}(S_{t}))\cdot e\right)$
	$\displaystyle\leq\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\sum_{s^{\prime}\in\mathcal% {S}}p(s^{\prime}\|X_{t})\left(\mathfrak{h}_{k}(s^{\prime})-h^{}(s^{\prime})-(% \mathfrak{h}_{k}(S_{t})-h^{}(S_{t}))\right)^{2}$
	$\displaystyle\overset{()}{\leq}3\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbf{E}% \left[\sum_{s^{\prime}\in\mathcal{S}}p(s^{\prime}\|X_{t})\left(\mathfrak{h}_{k}% (s^{\prime})-h^{}(s^{\prime})-(\mathfrak{h}_{k}(S_{t})-h^{*}(S_{t}))\right)^{% 2}\Bigg{\|}\mathcal{F}_{t}\right]+16c_{0}^{2}\log\left(\tfrac{1}{\delta}\right)$
	$\displaystyle=3\sum_{k}\sum_{t=t_{k}}^{t_{k+1}-1}\left(\mathfrak{h}_{k}(S_{t+1% })-h^{}(S_{t+1})-(\mathfrak{h}_{k}(S_{t})-h^{}(S_{t}))\right)^{2}+16c_{0}^{2% }\log\left(\tfrac{1}{\delta}\right).$

	$\displaystyle B_{0}(t_{k})$	$\displaystyle\leq\sum\nolimits_{\ell=1}^{K(T)}\sum\nolimits_{t=t_{\ell}}^{t_{% \ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)-\sum\nolimits_{\ell=k}^{K(T)}% \sum\nolimits_{t=t_{\ell}}^{t_{\ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)$
		$\displaystyle\overset{()}{\leq}\sum\nolimits_{\ell=1}^{K(T)}\sum\nolimits_{t=% t_{\ell}}^{t_{\ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)-\sum\nolimits_{% \ell=k}^{K(T)}\sum\nolimits_{t=t_{\ell}}^{t_{\ell+1}-1}\left(g^{}-R_{t}\right)$
		$\displaystyle\leq\sum\nolimits_{\ell=1}^{K(T)}\sum\nolimits_{t=t_{\ell}}^{t_{% \ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)-\sum\nolimits_{\ell=k}^{K(T)}% \sum\nolimits_{t=t_{k}}^{T-1}\left(\Delta^{}(X_{t})+\left(p(X_{t})-e_{S_{t}}% \right)h^{}+r(X_{t})-R_{t}\right)$
		$\displaystyle\leq\sum\nolimits_{\ell=1}^{K(T)}\sum\nolimits_{t=t_{\ell}}^{t_{% \ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)+{\mathrm{sp}\left(h^{}\right)}-% \sum\nolimits_{\ell=k}^{K(T)}\sum\nolimits_{t=t_{k}}^{T-1}\left(\left(p(X_{t})% -e_{S_{t+1}}\right)h^{}+r(X_{t})-R_{t}\right)$
		$\displaystyle\overset{(\dagger)}{\leq}\sum\nolimits_{\ell=1}^{K(T)}\sum% \nolimits_{t=t_{\ell}}^{t_{\ell+1}-1}\left(\mathfrak{g}_{k}-R_{t}\right)+{% \mathrm{sp}\left(h^{}\right)}+(1+{\mathrm{sp}\left(h^{}\right)})\sqrt{\tfrac% {1}{2}T\log\left(\tfrac{1}{\delta}\right)}$
		$\displaystyle=:B(T)+{\mathrm{sp}\left(h^{}\right)}+(1+{\mathrm{sp}\left(h^{}% \right)})\sqrt{\tfrac{1}{2}T\log\left(\tfrac{1}{\delta}\right)}.$

	$\displaystyle\sum_{k}(2k)$	$\displaystyle\leq{\mathrm{sp}\left(h^{}\right)}SA+2\sum_{s,a}\left(\sum_{n=1}% ^{N_{T}(s,a)}\sqrt{\tfrac{2\mathbf{V}(p(s,a),h^{})\log\left(\frac{SAT}{\delta% }\right)}{n}}+\sum_{n=1}^{N_{T}(s,a)}\tfrac{{\mathrm{sp}\left(h^{*}\right)}% \log\left(\tfrac{SAT}{\delta}\right)}{n}\right)$
		$\displaystyle\leq{\mathrm{sp}\left(h^{}\right)}SA+4\sum_{s,a}\sqrt{N_{T}(s,a)% \mathbf{V}(p(s,a),h^{})\log\left(\tfrac{SAT}{\delta}\right)}+2{\mathrm{sp}% \left(h^{*}\right)}SA\log\left(\tfrac{SAT}{\delta}\right)\log(T)$
	(Jensen)	$\displaystyle\leq{\mathrm{sp}\left(h^{}\right)}SA+4\sqrt{SA\sum\nolimits_{s,a% }\mathbf{V}(p(s,a),h^{})\log\left(\tfrac{SAT}{\delta}\right)}+2{\mathrm{sp}% \left(h^{*}\right)}SA\log\left(\tfrac{SAT}{\delta}\right)\log(T)$
		$\displaystyle={\mathrm{sp}\left(h^{}\right)}SA+4\sqrt{\sum\nolimits_{t=0}^{T-% 1}\mathbf{V}(p(X_{t}),h^{})\log\left(\tfrac{SAT}{\delta}\right)}+2{\mathrm{sp% }\left(h^{*}\right)}SA\log\left(\tfrac{SAT}{\delta}\right)\log(T)$

Achieving Tractable Minimax Optimal Regret in Average Reward MDPs

Abstract

1 Introduction

Contributions.

Theorem 1 (Informal).

Related works.

2 Preliminaries

Infinite-horizon MDP.

Weakly-communicating MDPs.

Reinforcement learning.

3 Algorithm PMEVI-DT

Extended Bellman operators and EVI.

Towards Projected Mitigated EVI.

3.1 Projected mitigated extended value iteration (PMEVI)

Proposition 2.

3.2 Building the bias confidence region and its projection operator

Definition 1 (Bias difference estimator).

Lemma 3.

Lemma 4.

3.3 Mitigation using finer bias dynamical error

4 Regret guarantees

Assumption 1.

Assumption 2.

Assumption 3.

Assumption 4.

Theorem 5 (Main result).

Lemma 6 (Reward optimism).

Lemma 7 (Navigation error).

Lemma 8 (Empirical bias error).

Lemma 9 (Optimism overshoot).

Lemma 10 (Second order error).

5 Experimental illustrations

References

Appendix

Appendix A Construction of PMEVI-DT

A.1 Proof of Lemma 3, estimation of the bias error

A.2 The confidence region of PMEVI-DT

A.2.1 Correctness of the model confidence region ℳtsubscriptℳ𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 1

Lemma 11.

Proof.

A.2.2 Simultaneous correctness of bias confidence region ℋtsubscriptℋ𝑡\mathcal{H}_{t}caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, mitigation βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and optimism

Lemma 12.

Proof.

Lemma 13.

Proof.

Corollary 14.

Proof.

A.2.3 Sub-Weissman reward confidence region and 2

Assumption 2.

A.3 Convergence of EVI and 3

Lemma 15.

Proof.

Corollary 16.

Proof.

About 3.

A.4 Proof of Theorem 5: Complexity of PMEVI with Weissman confidence regions

Proposition 17.

Proof.

Appendix B Analysis of the projected mitigated Bellman operator

B.1 Finding an optimistic policy under bias constraints

B.2 Projection operation and definition of 𝔏𝔏\mathfrak{L}fraktur_L

Lemma 18.

Proof.

Lemma 19.

Proof.

Lemma 20.

Proof.

Corollary 21.

B.3 Fix-points of 𝔏𝔏\mathfrak{L}fraktur_L and (weak) optimism

Lemma 22.

Proof.

Lemma 23.

Proof.

Corollary 24.

Proof.

Theorem 25.

B.4 Modelization of the projected mitigated Bellman operator 𝔏𝔏\mathfrak{L}fraktur_L

Lemma 26 (Modelization).

Proof.

Corollary 27 (Greedy modelization).

A.2.1 Correctness of the model confidence region $\mathcal{M}_{t}$ and 1

A.2.2 Simultaneous correctness of bias confidence region $\mathcal{H}_{t}$ , mitigation $\beta_{t}$ and optimism

B.2 Projection operation and definition of $\mathfrak{L}$

B.3 Fix-points of $\mathfrak{L}$ and (weak) optimism

B.4 Modelization of the projected mitigated Bellman operator $\mathfrak{L}$

$3$ -state river-swim.

$5$ -state river-swim.