Achieving Tractable Minimax Optimal Regret in Average Reward MDPs
Abstract
In recent years, significant attention has been directed towards learning average-reward Markov Decision Processes (MDPs). However, existing algorithms either suffer from sub-optimal regret guarantees or computational inefficiencies. In this paper, we present the first tractable algorithm with minimax optimal regret of ,111 hides logarithmic factors of . where is the span of the optimal bias function , is the size of the state-action space and the number of learning steps. Remarkably, our algorithm does not require prior information on .
Our algorithm relies on a novel subroutine, Projected Mitigated Extended Value Iteration (PMEVI), to compute bias-constrained optimal policies efficiently. This subroutine can be applied to various previous algorithms to improve regret bounds.
1 Introduction
Reinforcement learning (RL) Burnetas and Katehakis (1997); Sutton and Barto (2018) has become a popular approach for solving complex sequential decision-making tasks and has recently achieved notable advancements in diverse fields of application. The RL problem is generally formulated as a Markov Decision Process (MDP) Puterman (1994), where the agent interacts with an unknown environment to maximize its accumulative rewards.
In this paper, we consider the problem of learning average-reward MDPs, where the central task is to balance between exploration (i.e., learning the unknown environment) and exploitation (i.e., planning optimally according to current knowledge) along the infinite-horizon learning process. One way to measure the performance of the learner is the regret, that compares the gathered rewards of the learner, unaware of the exact structure of its environment, to the expected performance of an omniscient agent that knows the environment in advance. The seminal work of Auer et al. (2009) provides a minimax regret lower bound , where is the diameter (the maximal distance between two different states), the number of states, the number of actions and the learning horizon. They also provide an algorithm achieving regret . Ever since Auer et al. (2009), many works have been devoted to close the gap between the regret lower and upper bounds in the average reward setting Auer et al. (2009); Bartlett and Tewari (2009); Filippi et al. (2010); Talebi and Maillard (2018); Fruit et al. (2018, 2020); Bourel et al. (2020); Zhang and Ji (2019); Ouyang et al. (2017); Agrawal and Jia (2023); Abbasi-Yadkori et al. (2019); Wei et al. (2020) and more. Subsequent works Fruit et al. (2018); Zhang and Ji (2019) refined the minimax regret lower bound to where is the span of the bias function, which is the maximal gap of the long-term accumulative rewards starting from two different states. The difference is significant, since and the gap between the two can be arbitrarly large. However, no existing work achieves the following three requirements simultaneously:
-
(1)
The method achieves minimax optimal regret guarantees ;
-
(2)
The proposed method is tractable;
-
(3)
No prior knowledge on the model is required.
Most algorithms simply fail to achieve minimax optimal regret, and the only method achieving it Zhang and Ji (2019) is intractable because it relies an oracle to solve difficult optimization problems along the learning process. Naturally, we raise the question of whether these three requirements can be met all at once:
Is there a tractable algorithm with minimax regret without prior knowledge?
Contributions.
In this paper, we answer the above question affirmatively, by proposing a polynomial time algorithm with regret guarantees for average-reward MDPs. Our method can further incorporate almost arbitrary prior bias information to improve its regret.
Theorem 1 (Informal).
For any , provided that the confidence region used by PMEVI-DT satisfy mild regularity conditions (see Assumption 1-3), if , then for every weakly communicating model with bias span less than and with bias vector within , PMEVI-DT achieves regret:
in expectation and with high probability. Moreover, if PMEVI-DT runs with the same confidence regions that UCRL2 Auer et al. (2009), then it enjoys a time complexity .
The geometry of the prior bias region is discussed later (see 4). It can be taken trivial with to obtain a completely prior-less algorithm. To the best of our knowledge, this is the first tractable algorithm with minimax optimal regret bounds (up to logarithmic factors). The algorithm does not necessitate any prior knowledge of , thus circumventing the potentially high cost associated with learning . On the technical side, a key novelty of our method is the subroutine named PMEVI (see Algorithm 2) that improves and can replace EVI Auer et al. (2009) in any algorithm that relies on it Auer et al. (2009); Fruit et al. (2018); Filippi et al. (2010); Fruit et al. (2020); Bourel et al. (2020) to boost its performance and achieve minimax optimal regret.
Related works.
Here is a short overview of the learning theory of average reward MDPs. For communicating MDPs, the notable work of Auer et al. (2009) proposes the famous UCRL2 algorithm, a mature version of their prior UCRL Auer and Ortner (2006), achieving a regret bound of . This paper pioneered the use optimistic methods to learn MDPs efficiently. A line of papers Filippi et al. (2010); Fruit et al. (2020); Bourel et al. (2020) developed this direction by tightening the confidence region that UCRL2 rely on, and sharpened its analysis through the use of local properties of MDPs, such as local diameters and local bias variances, but none of these works went beyond regret guarantees of order and suffer from an extra . A parallel direction was initiated by Bartlett and Tewari (2009), that design REGAL to attain -dependent regret bounds (instead of ) while extending the regret bounds to weakly-communicating MDPs. The computational intractability of REGAL is addressed by Fruit et al. (2018) with SCAL, while Zhang and Ji (2019) further enhance the regret analysis by evaluating the bias differences with EBF, eventually reaching optimal minimax regret but loosing tractability.
Another successful design approach is Bayesian-flavored sampling, derived from Thompson Sampling Thompson (1933), that usually replaces optimism. The regret guarantees of these algorithms usually stick to the Bayesian setting however Ouyang et al. (2017); Theocharous et al. (2017), although Agrawal and Jia (2023) also enjoys high probability regret by coupling posterior sampling and optimism. Another line of research focuses on the study of ergodic MDPs, where all policies mix uniformly according to a mixing time. To name a few, the model-free algorithm Politex Abbasi-Yadkori et al. (2019) attains a regret of . By leveraging an optimistic mirror descent algorithm, Wei et al. (2020) achieve an enhanced regret of .
We refer the readers to Table 1 for a (non-exhaustive) list of existing algorithms.
Algorithm | Regret in | Tractable | Comment/Requirements |
REGAL Bartlett and Tewari (2009) | knowledge of | ||
UCRL2 Auer et al. (2009) | - | ||
PSRL Agrawal and Jia (2023) | Bayesian regret | ||
SCAL Fruit et al. (2018) | knowledge of | ||
UCRL2B Fruit et al. (2020) | extra in upper-bound | ||
UCRL3 Bourel et al. (2020) | |||
KL-UCRL Filippi et al. (2010); Talebi and Maillard (2018) | - | ||
EBF Zhang and Ji (2019) | optimal, knowledge of | ||
Optimistic-Q Wei et al. (2020) | model-free | ||
UCB-AVG Zhang and Xie (2023) | model-free, knowledge of | ||
MDP-OOMD Wei et al. (2020) | ergodic | ||
Politex Abbasi-Yadkori et al. (2019) | model-free, ergodic | ||
PMEVI-DT (this work) | - | ||
Lower bound | - | - |
2 Preliminaries
We fix a finite state-action space structure , and denote the collection of all MDPs with state-action space and rewards supported in .
Infinite-horizon MDP.
An element is a tuple where is the transition kernel and the reward function. The random state-action pair played by the agent at time is denoted , and the achieved reward is . A policy is a deterministic rule and we write the space of policies. Coupled with a , a policy properly defines the distribution of whose associated probability probability and expectation operators are denoted , where is the initial state. Under , a fixed policy has a reward function , a transition matrix , a gain and a bias , that all together satisfy the Poisson equation , see Puterman (1994). The Bellman operator of the MDP is:
(1) |
Weakly-communicating MDPs.
is weakly-communicating Puterman (1994); Bartlett and Tewari (2009) if the state space can be divided into two sets: (1) the transient set, consisting in states that are transient under all policies; (2) the non-transient set, where every state is reachable starting from any other non-transient. In this case, has a span-fixpoint (see Puterman (1994)), i.e., there exists such that where is the vector full of ones. We write . Then is the optimal gain function and every policy satisfies . We accordingly define the Bellman gaps:
(2) |
Another important concept is the diameter, that describes the maximal distance from one state to another state. It is given by An MDP is said communicating if its diameter is finite.
Reinforcement learning.
The learner is only aware that but doesn’t have a clue about what further looks like. From the past observations and the current state , the agent picks an available action , receives a reward and observe the new state . The regret of the agent is:
(3) |
Its expected value satisfies and the quantity will be referred to as the pseudo-regret. This paper focuses on minimax regret guarantees. Specifically, for , denote the set of weakly-communicating MDPs that admit a bias function with span at most , where the span of a vector is . Following Auer et al. (2009), every algorithm , for all , we have
(4) |
The goal of this work is to reach this lower bound with a tractable algorithm.
3 Algorithm PMEVI-DT
The method designed in this work can be applied to any algorithm relying on extended Bellman operators to compute the deployed policies Auer et al. (2009); Filippi et al. (2010); Fruit et al. (2018); Bourel et al. (2020) and beyond Tewari and Bartlett (2007). We start by reviewing the principles behind these algorithms. These algorithms follow the optimism-in-face-of-certainty (OFU) principle, meaning that they deploy policies achieving the highest possible gain that is plausible under their current information. This is done by building a confidence region for the hidden model , then searching for a policy solving the optimization problem:
(5) |
The design of the confidence region varies from a work to another. Provided that has been designed, these OFU-algorithms work as follows: At the start of episode , the optimization problem (5) is solved, and its solution is played until the end of episode. The duration of episodes can be managed in various ways, although the most popular is arguably the doubling trick (DT), that essentially waits until a state-action pair is about to double the visit count it had at the beginning of the current episode (see Algorithm 1). In the rest of this section, we use (and ) to denote the empirical transition (and reward) of the latest doubling update before the -th step, and further denote .
Extended Bellman operators and EVI.
To solve (5) efficiently, the celebrated Auer et al. (2009) introduced the extended value iteration algorithm (EVI). Assume that is a -rectangular confidence region, meaning that where and are respectively the confidence region for and after learning steps. EVI is the algorithm computing the sequence defined by:
(6) |
until where is the numerical precision. When the process stops, it is known that any policy such that achieves in (6) satisfies , hence is nearly optimistically optimal. This process gets its name from the observation that is the Bellman operator of seen as a MDP, hence EVI is just the Value Iteration algorithm Puterman (1994) ran in . A choice of action from in consists in (1) a choice of action , (2) a choice of reward and (3) a choice of transition ; It is an extended version of .
Towards Projected Mitigated EVI.
Obviously, the regret of an OFU-algorithm is directly related to the quality of the confidence region . That is why most previous works tried to approach the regret lower bound of Auer et al. (2009) by refining . The older works of Auer et al. (2009); Bartlett and Tewari (2009); Filippi et al. (2010) have been improved with a variance aware analysis Talebi and Maillard (2018); Fruit et al. (2018, 2020); Bourel et al. (2020) that essentially make use of tightened kernel confidence regions . While all these algorithms successively reduce the gap between the regret upper and lower bounds, they fail to achieve optimal regret . Meanwhile, the EVI algorithm of Zhang and Ji (2019) achieves the lower bound but (1) the algorithm is intractable because it relies on an oracle to retrieve optimistically optimal policies and (2) needs prior information on the bias function. Nonetheless, the method of Zhang and Ji (2019) strongly suggests that inferring bias information from the available data is key to achieve minimax optimal regret.
Rather surprisingly and in opposition to this previous line of work, our work suggests that the choice of the confidence region has little importance. Instead, our algorithm takes an arbitrary (well-behaved) confidence region in, infer bias information similarly to EBF Zhang and Ji (2019) and makes use of it to heavily refine the extended Bellman operator (6) associated to the input confidence region. Our algorithm can further take arbitrary prior information (possibly none) on the bias vector to tighten its bias confidence region. The pseudo-code given in Algorithm 1 is the high level structure our algorithm PMEVI-DT. In Section 3.1, we explain how (6) is refined using bias information and in Section 3.2, we explain how bias information is obtained.
Algorithm 1: PMEVI-DT
Parameters: Bias prior , horizon , a system of confidence region
Algorithm 2: PMEVI
Parameters: region , mitigation , projection , precision , initial vector (optional)
3.1 Projected mitigated extended value iteration (PMEVI)
Assume that an external mechanism provides a confidence region for the bias function . Provided that is correct () and that is correct (), we want to find a pair of policy-model that maximize the gain and such that . This is done with an improved version of (6) combining two ideas.
-
1.
Projection (Section 3.2). Whenever it is correct, the bias confidence region informs the learner that the search of an optimistic model can be constrained to those with bias within . This is done by projecting (see mitigation) using an operator , that has to satisfy a few non-trivial regularity conditions that are specified in Proposition 2.
-
2.
Mitigation (Section 3.3). When one is aware that , the dynamical bias update in (6) can be controlled better, by trying to restrict (6) to some such that with the knowledge that .
For a fixed , the empirical Bernstein inequality (Lemma 38) provides a variance bound of the form . By computing , the search makes sure that even though is unknown. For , we introduce the -mitigated extended Bellman operator:
(7)
The proposition below shows how well-behaved the composition is. Its proof requires to build a complete analysis of projected mitigated Bellman operators. This is deferred to the appendix.
Proposition 2.
Fix and assume that there exists a projection operator which is (O1) monotone: ; (O2) non span-expansive: ; (O3) linear: and (O4) . Then, the projected mitigated extended Bellman operator has the following properties:
-
(1)
There exists a unique such that ;
-
(2)
If , and , then ;
-
(3)
If is convex, then for all , the policy picking the actions achieving satisfies for and ;
-
(4)
For all and , .
The property (1) guarantees that has a fix-point while (2) states that this fix-point corresponds to an optimistic gain if the model and the bias confidence region are correct and the mitigation isn’t too aggressive. Combined with (3), the Poisson equation of a policy corresponds to this fix-point, i.e., , so that is the gain and is a legal bias for under the model . Lastly, the property (4) guarantees that the iterates converge to a fix-point of at least as quickly as goes to a fix-point of ; the convergence of is already guaranteed by existing studies and is discussed in the appendix.
Provided that the bias confidence region is constructed, Proposition 2 foreshadows how powerful is the construction: The algorithm PMEVI, obtained by iterating instead of in EVI, can replace the well-known EVI within any algorithm of the literature that relies on it (UCRL2 Auer et al. (2009), UCRL2B Fruit et al. (2020) or KL-UCRL Filippi et al. (2010)) for an immediate improvement of its theoretical guarantees.
3.2 Building the bias confidence region and its projection operator
The bias confidence region used by PMEVI-DT is obtained as a collection of constraints of the form:
(8) |
Such constraints include (1) prior bias constraints (if any) of the form of ; (2) span constraints of the form spawning the span semi-ball ; and (3) pair-wise constraints obtained by estimating bias differences in the style of Zhang and Ji (2019); Zhang and Xie (2023) that we further improve. We start by defining a bias difference estimator.
Definition 1 (Bias difference estimator).
Given a pair of states , their sequence of commute times is defined by with the convention that . The number of commutations up to time is , and is the empirical gain. The bias difference estimator at time is any quantity such that:
(9) |
Lemma 3.
With probability , for all and all , we have:
(10) |
Lemma 3 says that the quality of the estimator is directly linked to the number of observed commutes between and as well as the regret. The idea is that if the algorithm makes many commutes between and and if its regret is small, then the algorithm mostly takes optimal paths from to . The bound provided by Lemma 3 is not accessible to the learner however, because is unknown in general. To overcome this issue, is upper-bounded by . Overall, this leads to the design of the algorithm estimating the bias confidence region as specified in Algorithm 3.
Algorithm 3: BiasEstimation
Parameters: History , model region , confidence
Algorithm 4: BiasProjection
Parameters: a collection of linear constraints (8), to project
Coupled with prior information and span constraints, the obtained bias confidence region is a polyhedron of the same kind as the one encountered in Zhang and Xie (2023) generated by constraints of the form (8), and similarly to their Proposition 3, one can project onto in polynomial time with Algorithm 4. Moreover, the resulting projection operator satisfies the prerequisites (O1-4) of Proposition 2, making sure that PMEVI (Algorithm 2) is well-behaved. This is proved in the appendix Section B.2.
Lemma 4.
Assume that is a set of satisfying a system of equations of the form of (8). If is non empty, then the operator (see Algorithm 4) is a projection on and satisfies the properties (O1-4) defined in Proposition 2.
3.3 Mitigation using finer bias dynamical error
The fact that with high probability is used in PMEVI-DT to restrict the search of EVI by reducing the dynamical bias error. This reduction is based on a empirical Bernstein inequality (see Lemma 38) applied to . Here, it gives that with probability , we have:
(11) |
where is the variance of under the probability vector . More specifically, if is a probability on and , we set . In (11), , and are fixed. Once is tempted to use (11) directly to mitigate the extended Bellman operator, but the resulting operator is ill-behaved because it loses monotony. This issue is avoided by changing to in (9). We obtain a variance maximization problem, which is a convex maximization problem with linear constraints. Even in very simple settings, such optimization problems are NP-hard Pardalos and Schnitger (1988) hence computing is not reasonable in general. Thankfully, this value can be upper-bounded by a tractable quantity that is enough to guarantee regret efficiency. The mitigation used by PMEVI-DT is provided with Algorithm 5.
Algorithm 5: VarianceApproximation
Parameters: Bias region , history
4 Regret guarantees
Theorem 5 below shows that PMEVI-DT has minimax optimal regret under regularity assumptions on the used confidence region . 1 asserts that the confidence region holds uniformly with high probability. 2 asserts that the reward confidence region is sub-Weissman (see Lemma 35) and 3 assumes that the model confidence region makes sure that EVI (6) converges in the first place. 4 asserts that the prior bias region is correct.
Assumption 1.
With probability , we have .
Assumption 2.
There exists a constant such that for all , for all , we have:
Assumption 3.
For , is a -rectangular convex region and converges a fix-point.
Assumption 4.
The prior bias region contains and is generated by constraints of the form:
with (possibly infinite).
Refer to Section A.2 for the feasibility of 1, Section A.2.3 for 2, and Section A.3 for 3.
Theorem 5 (Main result).
Let . Assume that PMEVI-DT runs with a confidence region system that guarantees Assumptions 1-3. If , then for every weakly communicating model with and such that 4 is satisfied (), PMEVI-DT achieves regret:
with probability , and in expectation if . Moreover, if PMEVI-DT runs with the same confidence regions that UCRL2 Auer et al. (2009), then it enjoys a time complexity .
To have a completely prior-less algorithm, pick . The proof of Theorem 5 is too long to fit within these pages, so the complete proof is deferred to appendix. We will focus here on the main ideas.
We start by introducing notations. At episode , the played policy is denoted . As a greedy response to , by Proposition 2 (3), there exists and such that . The reward-kernel pair is referred to as the optimistic model of . We write the true kernel and the empirical kernel. Likewise, we define the reward functions and . The optimistic gain and bias satisfy and . We further denote .
The regret is first decomposed episodically with . The first step goes back to the analysis of UCRL2 Auer et al. (2009), and consists in upper-bounding the regret over episode with optimistic quantities that are exclusive to that episode.
Lemma 6 (Reward optimism).
With probabililty , we have:
(12) |
We introduce the two optimistic regrets and . Rewriting the summand using the Poisson equation , we get:
The analysis proceed by decomposing the above expression of in the style of Zhang and Ji (2019). We write as:
Each error term is bounded separately. Below, we denote .
Lemma 7 (Navigation error).
With probability , the navigation error is bounded by:
Lemma 8 (Empirical bias error).
With probability , the empirical bias error is bounded by:
Lemma 9 (Optimism overshoot).
With probability , the optimism overshoot is bounded by:
Lemma 10 (Second order error).
With probability , the second order error is bounded by:
We see that the empirical bias error (Lemma 8) and the optimism overshoot (Lemma 9) both involve the sum of variances , which is shown in Lemma 29 to be of order . The pseudo-regret term is bounded with the regret using Corollary 31, then by . With high probability, we obtain an equation of the form:
where is a constant. Setting and , the above equation is of the form . Solving in , we find . The dominant term is , hence we readily obtain:
(13) |
Since , we conclude that , ending the proof.
5 Experimental illustrations
To get a grasp of how PMEVI-DT behaves in practice, we provide in Fig. 2 of few illustrative experiments. In both experiments, the environment is a river-swim which is a model known to be hard to learn despite its size, with high diameter and bias span. Its description is found in Bourel et al. (2020) and is reported in the appendix for self-containedness.
We observe on the first experiment that PMEVI behaves almost identically to its EVI counterparts when no prior on the bias region is given. This is because most of the regret is due to the earlier learning phase, when bias information is impossible to get (the regret is still growing linearly and the bias estimator is off). Accordingly, the bias confidence region is too large and all projections onto it are trivial during the iterations of PMEVI. Thankfully, this also makes the calls to PMEVI not substantially heavier than calls to EVI from a computational perspective. On the second experiment, we measure the influence of prior bias information on the behavior of PMEVI-DT. We observe that PMEVI-DT is very efficient at using adequate bias prior information to strikingly reduce the burn-in cost of the learning process on this -state riverswim.
References
- Abbasi-Yadkori et al. [2019] Yasin Abbasi-Yadkori, Nevena Lazic, Csaba Szepesvari, and Gellert Weisz. Exploration-enhanced POLITEX. arXiv preprint arXiv:1908.10479, 2019.
- Agrawal and Jia [2023] Shipra Agrawal and Randy Jia. Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds. Mathematics of Operations Research, 48(1):363–392, 2023. Publisher: INFORMS.
- Audibert et al. [2009] Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876–1902, 2009. Publisher: Elsevier.
- Auer and Ortner [2006] Peter Auer and Ronald Ortner. Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning. Proceedings of the 19th International Conference on Neural Information Processing Systems, December 2006.
- Auer et al. [2009] Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal Regret Bounds for Reinforcement Learning. In Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2009.
- Azuma [1967] Kazuoki Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, 19(3):357 – 367, 1967. Publisher: Tohoku University, Mathematical Institute.
- Bartlett and Tewari [2009] Peter L. Bartlett and Ambuj Tewari. REGAL: a regularization based algorithm for reinforcement learning in weakly communicating MDPs. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, pages 35–42, Arlington, Virginia, USA, June 2009. AUAI Press. ISBN 978-0-9749039-5-8.
- Bourel et al. [2020] Hippolyte Bourel, Odalric Maillard, and Mohammad Sadegh Talebi. Tightening Exploration in Upper Confidence Reinforcement Learning. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1056–1066. PMLR, July 2020.
- Burnetas and Katehakis [1997] Apostolos Burnetas and Michael Katehakis. Optimal Adaptive Policies for Markov Decision Processes. Mathematics of Operations Research - MOR, 22:222–255, February 1997.
- Cohen et al. [2020] Michael B. Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time, 2020.
- Filippi et al. [2010] Sarah Filippi, Olivier Cappé, and Aurélien Garivier. Optimism in Reinforcement Learning and Kullback-Leibler Divergence. 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 115–122, September 2010. arXiv: 1004.5229.
- Fruit [2019] Ronan Fruit. Exploration-exploitation dilemma in Reinforcement Learning under various form of prior knowledge. PhD Thesis, Université de Lille 1, Sciences et Technologies; CRIStAL UMR 9189, 2019.
- Fruit et al. [2018] Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning. Proceedings of the 35 th International Conference on Machine Learning, 2018.
- Fruit et al. [2020] Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Improved Analysis of UCRL2 with Empirical Bernstein Inequality. ArXiv, abs/2007.05456, 2020.
- Jonsson et al. [2020] Anders Jonsson, Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Edouard Leurent, and Michal Valko. Planning in markov decision processes with gap-dependent sample complexity. Advances in Neural Information Processing Systems, 33:1253–1263, 2020.
- Ouyang et al. [2017] Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain. Learning Unknown Markov Decision Processes: A Thompson Sampling Approach. arXiv:1709.04570 [cs], September 2017. arXiv: 1709.04570.
- Pardalos and Schnitger [1988] Panos M. Pardalos and Georg Schnitger. Checking local optimality in constrained quadratic programming is NP-hard. Operations Research Letters, 7:33–35, 1988.
- Puterman [1994] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Probability and Statistics. Wiley, 1 edition, April 1994. ISBN 978-0-471-61977-2 978-0-470-31688-7.
- Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
- Talebi and Maillard [2018] Mohammad Sadegh Talebi and Odalric-Ambrym Maillard. Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs. Journal of Machine Learning Research, pages 1–36, April 2018. Publisher: Microtome Publishing.
- Tewari and Bartlett [2007] Ambuj Tewari and P. Bartlett. Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs. In NIPS, 2007.
- Theocharous et al. [2017] Georgios Theocharous, Zheng Wen, Yasin Abbasi-Yadkori, and Nikos Vlassis. Posterior sampling for large scale reinforcement learning. arXiv preprint arXiv:1711.07979, 2017.
- Thompson [1933] William R Thompson. On the Likelihood that One Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika, 25(3-4):285–294, December 1933. ISSN 0006-3444.
- Wei et al. [2020] Chen-Yu Wei, Mehdi Jafarnia Jahromi, Haipeng Luo, Hiteshi Sharma, and Rahul Jain. Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 10170–10180. PMLR, July 2020.
- Zhang and Ji [2019] Zihan Zhang and Xiangyang Ji. Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Zhang and Xie [2023] Zihan Zhang and Qiaomin Xie. Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes. In The Thirty Sixth Annual Conference on Learning Theory, pages 5476–5477. PMLR, 2023.
- Zhang et al. [2020] Zihan Zhang, Yuan Zhou, and Xiangyang Ji. Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition. arXiv:2004.10019 [cs, stat], June 2020. arXiv: 2004.10019.
Appendix
Appendix A Construction of PMEVI-DT
This section provides the technical details required to understand the design of PMEVI-DT in Section 3. We further discuss the assumptions 1-4 appearing in Theorem 5 and provide sufficient conditions so that they are met.
A.1 Proof of Lemma 3, estimation of the bias error
Fix . We denote . We will start by considering the better estimator that satisfies the same equation (9) than but with changed to , readily:
To avoid a typographical clutter, we write instead of in the remaining of the proof and we write .
(STEP 1) We start by relating the two estimators. Intuitively, is a good estimator for when the regret is small. Recall that , hence:
Therefore,
We are left with upper-bounding .
(STEP 2) If is even, then and ; otherwise and . In both cases, we have . Therefore, using Bellman’s equation, the quantity satisfies:
A | |||
Multiplying by and rearranging, appears to be equal to:
Proceed by summing over . By triangular inequality, we obtain:
Because all Bellman gaps are non-negative, the second term is upper-bounded by the pseudo-regret . The first term is a martingale, and the martingale difference sequence has span at most since rewards are supported in . Although the number of involved is random, it is upper-bounded by , hence by the maximal version of Azuma-Hoeffding’s inequality (Lemma 32), we have that with probability at least , uniformly for ,
(STEP 3) We conclude that with probability , for all ,
We are left with relating both and to . Using the Bellman equation again, we find that:
where the last inequality holds with probability uniformly over by Azuma-Hoeffding’s inequality again (Lemma 32). Remark that if , then , hence we conclude that with probability , for all :
where the last inequality invokes . We conclude that, with probability , for all , we have:
This concludes the proof. ∎
A.2 The confidence region of PMEVI-DT
The algorithm PMEVI-DT can be instantiated with a large panel of possibilities, depending on the type of confidence region one is willing to use for rewards and kernels. In this work, we allow for four types of confidence regions, described below. For conciseness, is a symbolic letter that can be a reward or a kernel and denote the confidence region for at time . If , then (Bernoulli rewards) with ; and if , then with .
-
(C1)
Azuma-Hoeffding or Weissman type confidence regions, with taken as:
-
(C2)
Empirical Bernstein type confidence regions, with taken as:
-
with the convention that for .
-
(C3)
Empirical likelihood type confidence regions, with taken as:
-
(C4)
Trivial confidence region with .
A few remarks are in order. When rewards are not Bernoulli, only the confidence regions (C1) and (C4) are elligible among the above. Then, Weissman’s inequality must be changed to Azuma’s inequality for -sub-Gaussian random variables, see Lemma 34. Since rewards are supported in , Hoeffding’s Lemma guarantees that reward distributions are -sub-Gaussian with .
A.2.1 Correctness of the model confidence region and 1
The confidence regions described with (C1-4) are tuned so that the following result holds:
Lemma 11.
Assume that, for all and , we choose among (C1-4). Then 1 holds. More specifically, the region of models satisfies .
A.2.2 Simultaneous correctness of bias confidence region , mitigation and optimism
In this section, we show that if 1 holds, then the bias confidence region constructed by PMEVI-DT is correct with high probability, and that the mitigation is not too strong. Recall that are the optimistic gain and bias of the policy deployed in episode (see Algorithm 1). In particular, we have with . We start by a result on the deviation of the variance, which is what the variance approximation Algorithm 5 is based on. Recall that the bias confidence region is obtained as the collection of constraints:
-
(1)
prior constraints (if any) ;
-
(2)
span constraints ;
-
(3)
dynamically infered constraints (see Algorithm 3).
We have the following result.
Lemma 12.
Let and fix a probability distribution on . Then for all ,
Proof.
We start by establishing the following result: If is a probability distribution on and , we have:
(14) |
where is the dot product, the Hadamard product and the vector whose entry is . (14) is obtained with a straight forward computation:
Observe that can be changed to , where is the vector full of ones, without changing the result. The same goes for . We now move to the proof of the main statement. First, translate and such that . Then, we have:
Conclude using that for such that . ∎
Lemma 13.
Assume that 1 holds and that . Then, with probability , for all , (1) and (2) and (3) for all , .
Proof.
Let the event . Let the event stating that, for all ,
and let the event stating that, for all and for all , we have:
By Lemma 3, we have and by Lemma 36, we have , so . We prove by induction on that, on , (1) , (2) (3) and for all , , where is the optimistic gain of the policy deployed at episode . For , this is obvious. Indeed, for all hence . Therefore,
so contains , proving (2). Moreover, since , we have , proving (3). Finally, since on , by the statement (2) of Proposition 2, we have , hence proving (1).
Now assume that . By induction for all , so on we have:
By design of (see Algorithm 3), we deduce that (2) . Denote the reference point used by Algorithm 5. We have, for all , on , we have:
( + Lemma 12) | |||
by construction of Algorithm 5. Accordingly, (3) is satisfied. Finally, on so by Proposition 2, we have (1) . ∎
Corollary 14.
Assume that, for all and , we choose among (C1-4). Then, with probability , for all , we have and (2) and (3) for all , .
A.2.3 Sub-Weissman reward confidence region and 2
Although the kernel confidence region can even chosen to be trivial with (C4), in order to work, PMEVI-DT needs the reward confidence region to be sub-Weissman in the following sense:
Assumption 2.
There exists a constant such that for all , for all , we have:
This is indeed the case if is chosen among (C1-3).
A.3 Convergence of EVI and 3
We start with a preliminary lemma on the speed of convergence of EVI. The Lemma 15 is thought to be applied to extended MDPs. Below, when we claim that the action space is compact, we further claim that is a continuous map, so that the Bellman operator is continuous and that and are well-defined, see Puterman [1994].
Lemma 15.
Let a weakly-communicating MDP with finite state space and compact action space, and let its Bellman operator. Assume that there exists such that, ,
() |
with . Then, for all and all , if , then:
Proof.
Since is weakly communicating, has finitely many states and compact action space, it has well-defined gain and bias functions. Denote .
Observe that the policy achieving the maximum is the one achieving . Remark that is the Bellman gap of the pair , that we more simply write . For all , there exists such that . Moreover, by assumption, we have where is a stochastic matrix. Moreover,
Hence, . In addition, , so by non-expansiveness of in span semi-norm, . Overall,
(15) |
Fix , and let .
Let an optimal policy. We have so by induction, . Meanwhile, we see that , so . Since for all , we have so .
By (15), either or , but because , the second case can happen at most times. We deduce that, for all ,
In particular, for , we get:
We obtain:
To conclude, check that . ∎
Before moving to the application of interest, remark that this result can be greatly improved if the supremum is not zero, to change the dominant term for a constant independent of .
Corollary 16.
Proof.
If is has non-empty interior, it means that for all , has non-empty interior. Therefore, for all state-action pair, there exists that is fully supported. It follows that is communicating, and it follows from standard results Puterman [1994] that its span fix-points do exist and that does not depend on the initial state.
Moreover, if and with , letting and , we have:
So by induction and since is obviously monotone and linear, we show that:
Dividing by and letting it go to infinity, we obtain . Observe that we have equility by taking the policy achieving .
To see that EVI converges indeed, simply observe that Lemma 15 provides a finite bound on how much time is required until the . Hence vanishes to . ∎
About 3.
The assumptions made by Corollary 16 are met if the kernel confidence regions are:
-
•
Built out of Weissman’s inequality (C1) (see the next section, also Auer et al. [2009]);
-
•
Built out of Bernstein’s inequality (C2) (because the maximization algorithm to compute in EVI has the same greedy properties than with Weissman’s inequality);
-
•
Trivial (C4) obviously.
For confidence regions build with empirical likelihood estimates (C3), there is no guarantee of convergence (although we conjecture that one could be established), although the gain is still well-defined because remains communicating. However, just like the original work of Filippi et al. [2010], the convergence is always met numerically.
A.4 Proof of Theorem 5: Complexity of PMEVI with Weissman confidence regions
In this section, we show that when one is using Weissman confidence regions for kernels (C1), then the iterates of converge to an span-fix-point quickly.
Proposition 17.
Assume that PMEVI-DT uses kernel confidence regions of Weissman type (C1) satisfying 1. Then with probability , the number of iterations of PMEVI (see Algorithm 2) is , hence the algorithm has polynomial per-step amortized complexity.
Proof.
With Weissman type confidence regions for kernels, for all and , we have
It follows that, for all , the extended Bellman operator satisfies the prerequisite of Lemma 15 with
Under 1, we have with probability . Under this event, is weakly communicating and , we can apply Lemma 15 and conclude that every calls to PMEVI (Algorithm 2) takes
where we use that , that and that . Since the number of episodes under the doubling trick (DT) is , we conclude accordingly. ∎
Every call to the projection operator solves a linear program. Although in theory, this time is polynomial (relying on recent work on the complexity of LP such as Cohen et al. [2020], it is the current matrix multiplication time ), in practice, reducing the number of calls to the projection operator is key to run PMEVI-DT in reasonable time.
Appendix B Analysis of the projected mitigated Bellman operator
In this section, we fix the model region , the bias region and the mitigation vector , drop** the sub-script for conciseness. We denote the respective empirical reward and kernel. Further assume that with a compact convex set. The associated projection operation (see Section B.2) is denoted . The (vanilla) extended Bellman operator associated to is given by . The -mitigated extended Bellman operator associated to is:
(16) |
The function Greedy returns a stationary deterministic policy that picks its actions among the one reaching the maximum above. The projection of to is
(17) |
The goal of this section is to establish Proposition 2 and
-
•
Proposition 2 statement (1) is a consequence of Lemma 22;
-
•
Proposition 2 statement (2) follows from Theorem 25;
-
•
Proposition 2 statement (3) follows from Corollary 27;
-
•
Proposition 2 statement (4) follows from Corollary 21;
-
•
Proposition 2 prerequisites on the projection operator and Lemma 4 follows from Lemma 19
B.1 Finding an optimistic policy under bias constraints
The main goal is to find and optimistic policy under bias constraints (projection) and bias error constraints (mitigation). The bias constraints imply that we search for a policy together with a model such that . The bias error means that, for , we want in addition where is the transition kernel of . In the end, our goal is to track the solution of the following optimization problem:
(18) |
where the supremum is taken with respect to the product order . In particular, if , check that is obtained as . The constraint is suggested by the work of Fruit et al. [2018], Fruit [2019] and is key for the problem to be solvable.
The bias and the -constraints make the problem to handle with a “pure” extended MDP solution, which is why the extended Bellman operators are mitigated (with ) then projected (with ). The mitigation operation guarantees that the -constraint is satisfied, while the projection on makes sure that the bias constraint is satisfied. It is important for both operations to be compatible, i.e., that the -constraint that forces is not lost when applying . As a matter of fact, projecting then mitigating would not work.
We now explain why can be used to solve (18).
B.2 Projection operation and definition of
We start by discussing why is well-defined at all. The well-definition of is obvious. The point is to explain why the projection onto is possible while preserving mandatory structural properties such as monotony, non-expansivity, linearity and more. For general , such properties are impossible to meet. But the bias confidence region constructed with Algorithm 3 has a specific shape that makes the projection possible. The central property is the one below:
(A1) The downward closure of every has a maximum in .
The only order that we will be considering is the product order on . Recall that a set has a maximum if there exists such that for all . A supremum of is a minimal upper-bound of , i.e., such that (1) for all and (2) no satisfying (1) can be smaller than . For the product order, the supremum of a subset is unique and of the form .
Define the projection as such:
(19) |
In general, Assumption (A1) is satisfied when admits a join, i.e., is stable by finite supremum: .
Lemma 18.
If is generated by constraints of the form , then it has a join and (A1) is satisfied. Moreover, is then correctly computed with Algorithm 4.
Proof.
The first half of the result is well-known, see Zhang and Xie [2023], but we recall a proof for self-containedness. Let and define . Observe that . So .
We continue by showing that if has a join, then (19) is well-defined. For , take a sequence such that . Because the span of every element of is upper-bounded by , it follows that evolves in the compact region . We can therefore extract a convergent sequence of , converging that belongs to since the latter is closed. By construction, . Because has a join, . ∎
Lemma 19.
Under assumption (A1), the operator is well-defined, and is:
-
(1)
monotone: ;
-
(2)
non span-expansive: ;
-
(3)
linear: ;
-
(4)
.
Proof.
The well-definition of is obvious from (A1). For (2), if then . Hence . For (3), check that it follows from . For (4), we obviously have .
The more difficult point is (2) span non-expansivity. Pick . By linearity, it suffices to show the result for . In that case, we have . Observe that for all , we have . Since , it follows that:
Similarly, we have . Using them both at once, we find . ∎
The properties (1), (3) and (4) are essential for to properly address the optimization problem (18). The property (2) is just as important, because it plays a central part in the convergence of value iteration. The next result shows similar properties for the -mitigated extended Bellman operator . From now on, we will assume (A1), because it is almost-surely satisfied by the bias confidence region generated by Algorithm 3.
Lemma 20.
The -mitigated extended Bellman operator is (1) monotone, (2) non-span-expansive and (3) linear.
Proof.
The properties (1) and (3) directly follow from the definition. We focus on (2). Fix . By Lemma 26, we can write and . In the following, we write . Check that:
If the minimum is reached with , then:
If the minimum is reached with , then upper-bound by to obtain:
Overall, we find that there exists such that . Similarly, we find such that . We conclude that:
This concludes the proof. ∎
By composition, we obtain the following result.
Corollary 21.
is (1) monotone, (2) non-span-expansive and (3) linear. Moreover, for all .
B.3 Fix-points of and (weak) optimism
Lemma 22.
has a fix-point in span semi-norm, i.e., .
Proof.
The idea is to apply Brouwer’s fix-point theorem in quotiented by the equivalence relation , where becomes a norm. By linearity (Corollary 21), is well-defined in this quotient space, and if is shown continuous on , so will it be on the quotient.
We show that is sequentially continuous on . Consider a sequence converging to and fix . Provided that for large enough, we have , i.e., . Therefore, in the one hand, for all , we have so ; And on the other hand, for all , so . Hence:
It shows that is continuous. The operator is obviously continuous as well, so is continuous by composition. Since with compact and ocnvex, the quotient is compact and convex, and is preserved by . By Brouwer’s fix-point theorem, has a fix-point in . So has a span fix-point in . ∎
We write the span fix-points of .
Lemma 23.
has well-defined growth. Specifically, if , then:
-
(1)
There exists , s.t., for all , ;
-
(2)
If , then .
Proof.
Setting , one can check that for all . this proves (1) for and we then proceed by induction on . By induction, and by Corollary 21, is monotone, so we have:
where the last inequality use the linearity of together with . The lower bound of is shown similarly, establishing (1).
For (2), pick with . Up to translating , we can assume that and apply (1). We get:
Divided by and let it go to infinity. We conclude that . ∎
We finally have everything in hand to claim that solves (18).
Corollary 24.
The growth of given by for is well-defined, and:
Moreover, .
Proof.
The growth property is a direct consequence of Lemma 23. We show which is defined in (18). Pick its model with and where . Up to translation, we can assume that .
We have for , so
by definition. By monotony of , see Corollary 21, follows by induction on . By Lemma 23, we further have where . In tandem,
Letting , we deduce that . Conclude by taking the best and . ∎
The next theorem follows directly with the same proof technique, and guarantees optimism.
Theorem 25.
Assume that . Then .
The condition “” can be referred to as a weak form of optimism. We qualify this version of optimism as weak because it is much weaker than optimism property suggested by Fruit [2019] where is the Bellman operator of the true MDP. Here, we only ask for , i.e., optimism at the fix-point of . This condition is met as soon as , and is large enough, but is in fact much more general.
B.4 Modelization of the projected mitigated Bellman operator
The aim of this paragraph is to establish Corollary 27, stating that can be viewed as a policy produced by Greedy.
Lemma 26 (Modelization).
For , denote , and . Fix and let .
-
(1)
If is convex, then there exists such that .
-
(2)
Assume that . There exists such that .
The convexity requirement of (1) is always true if the kernel confidence region is chosen via (C1-4).
Proof.
For (1), fix a state , let and . If , then there is nothing to say because is compact, hence the sup is a max and is of the form . Otherwise, let with . Introduce, for ,
By continuity, there exists such that and by convexity of , . This proves (1).
For (2), recall that . Since , for , we have:
Set . Check that satisfies and . ∎
The last corollary bellow is crucial to claim that greedy policies are good choices in PMEVI-DT.
Corollary 27 (Greedy modelization).
Let and fix . If is convex, then with the notations of Lemma 26, there exists and such that .
Appendix C Proof of Theorem 5: Regret analysis of PMEVI-DT
We recall a few notations. At episode , the played policy is denoted . As a greedy response to , by Proposition 2 (3), there exists and such that . The reward-kernel pair is referred to as the optimistic model of . We write the true kernel and the empirical kernel. Likewise, we define the reward functions and . The optimistic gain and bias satisfy and . We further denote .
Important remark.
To slightely simplify the analysis, we assume that PMEVI is run with perfect precision , i.e., that hence is a span fix-point of . This assumption is mild and can be dropped by adding an extra error term that has to be carried out in the calculations.
C.1 Number of episodes under doubling trick (DT)
Lemma 28 (Number of episodes, Auer et al. [2009]).
The number of episodes up to time is upper-bounded by:
C.2 Sum of bias variances
The Lemma 29 below shows that scales as in probability.
Lemma 29.
With probability at least , we have:
Proof.
Using the Bellman equation , we have:
Since , we get:
(Lemma 32) |
where the last inequality holds with probability . This concludes the proof. ∎
C.3 Regret and pseudo-regret: A tight relation
In this paragraph, we bound the regret with respect to the pseudo-regret (and conversely) up to a factor of order . Hence, in proofs, the pseudo-regret can be changed to the regret with ease.
Lemma 30.
With probability , the regret and the pseudo-regret and linked as follows:
Proof.
We rely again on the Poisson equation , so:
Up to the constant , the two error terms are respectively a navigation and a reward error. The second is bounded using Azuma’s inequality (Lemma 32), showing that with probability , we have:
We continue by using Freedman’s inequality, instantiated in the form of Lemma 33. With probability , we have:
The quantity is a classical one that appears at several places throughout the analysis. Using Lemma 29, we bount it explicitely. Further simplifying the bound with , we get that with probability , we have:
Bound by and use to merge the terms in under a single square-root. ∎
Overall, Lemma 30 states that the regret and the pseudo-regret differ by about in probability (up to asymptotically negligible additional terms). In general, the precise form of Lemma 30 is not convenient to use because it is of form form that is not linear in . Corollary 31 factorizes the result into one which will be more convenient in proofs.
Corollary 31.
Denote and . Further introduce:
Then, with probability , we have and .
Proof.
This is straight forward algebra from the result of Lemma 30. ∎
C.4 Proof of Lemma 6, reward optimism
We start by getting rid of the reward noise. We have:
with probability by Azuma’s inequality (Lemma 32). We are left with . We continue by splitting the regret episodically and invoking optimism. By Lemma 13, with probability , we have . Introduce
(20) |
We focus on bounding . By 2, is of the form with . By the statement (3) of Proposition 2, . Therefore,
where holds with probability following Lemma 35. By the doubling trick rule (DT), we have for , so, with probability ,
(Jensen) |
We conclude that with probability , we have:
(21) |
This concludes the proof. ∎
C.5 Proof of Lemma 7, navigation error
We have:
The last term is by Lemma 28, hence is .
(STEP 1) We start by bounding . By Lemma 13, with probability , we have for all . So . By Freedman’s inequality invoked in the form of Lemma 33, we have with probability ,
It suffices to bound the first term. Recall that is the vector full of ones. We have:
Here the inequality holds with probability following Lemma 40. We will bound the summand with the bias estimation error that spawns the inner regret estimation . This inner estimation is linked to the overall optimistic regret by:
In the above, holds with probability uniformly on following Lemma 13 and holds, also uniformly on , with probability by applying Azuma-Hoeffding’s inequality (Lemma 32). Continuing, still on the event specified by Lemma 13, we have with probability :
(DT) | |||
(STEP 2) For , by Freedman’s inequality invoked in the form of Lemma 33 again, we have with probability ,
We recognize the sum of variance that we leave as is.
(STEP 3) As a result, with probability , we have:
when . ∎
C.6 Proof of Lemma 8, empirical bias error
Because is a fixed vector, Bennett’s inequality (see Lemma 39) guarantees that is small as follows. By doing a union bound over Lemma 39 with confidence over all pairs and visits counts , we see that with probability , for all , we have:
(by doubling trick) |
Summing this over and factorizing over state-action pairs, we get that with probability ,
(Jensen) | |||
We recognize the sum of variances , that is left to be upper-bounded later on. ∎
C.7 Proof of Lemma 9, optimism overshoot
Because of the -mitigation generated by Algorithm 5, the quantity is shown to be directly related to up to a provably negligible error. Denote the reference point BiasProjection used in Algorithm 5 (denoted in the algorithm). By Lemma 13, with probability , we have for all . To lighten up notations, we write instead of .
(STEP 1) Denote . By construction of , we have , so:
A | |||
The rightmost term of A is of order hence is negligible. We focus on the other two. The analysis of will spawn a term similar to , hence we start by the second. Recall that is the bias error provided by Algorithm 3 and that the inner regret estimation is . Now, remark that:
In the above, holds with probability uniformly on following Lemma 13 and holds, also uniformly on , with probability by applying Azuma-Hoeffding’s inequality (Lemma 32). Therefore, with probability , for all and , we have:
This bound will be enough. We move on to . We have:
where is obtained by applying Lemma 12 and holds with probability by applying Weissman’s inequality, see Lemma 35. All together, with probability , A is upper-bounded by:
(STEP 2) The number of visits is lower-bounded by when by doubling trick (DT). By summing over and , we find that with probability ,
(DT) | |||
where the last inequality is obtained with computations that are similar to those detailed in the proof of Lemma 8. We recognize the variance that we will leave as is. We finish the proof by bounding the lower order terms and .
(STEP 3) We start with . We have:
(DT) | |||
(STEP 4) We are left with . We have:
(DT) | |||
This concludes the proof. ∎
C.8 Proof of Lemma 10, second order error
Recall that by Lemma 13, with probability , for all , hence for all on the same event. Therefore, with probability ,
where uses that , and is obtained by applying the empirical Bernstein’s inequality, see Lemma 36, to , and holds with probability . The rightmost term’s sum is upper-bounded by:
For the other term, follow the line of the proof of Lemma 9 (term ). We have with probability ( of which is by invoking Lemma 13):
Therefore,
Summing over , , , with probability , we have:
This concludes the proof. ∎
Appendix D Details on experiments
D.1 River swim
Experiments are run on -states river-swim. Such MDPs are, despite their size, known to be hard to learn. They consists in states aligned in a straight line with two playable actions right and left whose dynamics are given in the figure below. Rewards are Bernoulli and null everywhere excepted for and .
-state river-swim.
The gain is and .
-state river-swim.
The gain is and .
Appendix E Standard concentration inequalities
Lemma 32 (Azuma’s inequality, Azuma [1967]).
Let a martingale difference sequence such that a.s., i.e., there exists such that a.s. Then, for all ,
Lemma 33 (Freedman’s inequality, Zhang et al. [2020]).
Let a martingale difference sequence such that a.s., and denote its conditional variance . Then, for all ,
Lemma 34 (Time-uniform Azuma, Bourel et al. [2020]).
Let a martingale difference sequence such that, for all , . Then:
Lemma 35 (Time-uniform Weissman).
Let a distribution over . Let a sequence of i.i.d. random variables of distribution . Then:
Proof.
Remark that . Let . Remark that for each , is a family of i.i.d. random variables with , so by Hoeffding’s Lemma. By Lemma 34, we have:
This concludes the proof. ∎
Lemma 36 (Time-uniform Empirical Bernstein).
Let a martingale difference sequence such that a.s., let the empirical mean and the population variance. Then,
Proof.
This is obtained with a union bound on the values of , then applying Lemma 38. ∎
Lemma 37 (Time-uniform Empirical Likelihoods, Jonsson et al. [2020]).
Let a distribution on . Let a sequence of i.i.d. random variables of distribution . Then:
Lemma 38 (Empirical Bernstein inequality, Audibert et al. [2009]).
Let a martingale difference sequence such that a.s., let the empirical mean and the population variance. Then,
Lemma 39 (Bennett’s inequality, Audibert et al. [2009]).
Let a martingale difference sequence such that a.s., and denote its conditional variance . Then,
Lemma 40 (Lemma 3 of Zhang and Xie [2023]).
Let be a sequence of random variables such that a.s., and let . Then: