Contractual Reinforcement Learning:
Pulling Arms with Invisible Handsthanks: This work is supported in part by Army Research Office Award W911NF-23-1-0030, ONR Award N00014-23-1-2802 and NSF Award CCF-2303372.

Jibang Wu Department of Computer Science, University of Chicago; Corresponding author emails: {wujibang, haifengxu}@uchicago.edu.    Siyu Chen Department of Statistics and Data Science, Yale University.    Mengdi Wang Department of Electrical and Computer Engineering, Princeton University.    Huazheng Wang School of Electrical Engineering and Computer Science, Oregon State University.    Haifeng Xu22footnotemark: 2
Abstract

The agency problem emerges in today’s large scale machine learning tasks, where the learners are unable to direct content creation or enforce data collection. In this work, we propose a theoretical framework for aligning economic interests of different stakeholders in the online learning problems through contract design. The problem, termed contractual reinforcement learning, naturally arises from the classic model of Markov decision processes, where a learning principal seeks to optimally influence the agent’s action policy for their common interests through a set of payment rules contingent on the realization of next state. For the planning problem, we design an efficient dynamic programming algorithm to determine the optimal contracts against the far-sighted agent. For the learning problem, we introduce a generic design of no-regret learning algorithms to untangle the challenges from robust design of contracts to the balance of exploration and exploitation, reducing the complexity analysis to the construction of efficient search algorithms. For several natural classes of problems, we design tailored search algorithms that provably achieve O~(T)~𝑂𝑇\widetilde{O}(\sqrt{T})over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_T end_ARG ) regret. We also present an algorithm with O~(T2/3)~𝑂superscript𝑇23\widetilde{O}(T^{2/3})over~ start_ARG italic_O end_ARG ( italic_T start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT ) for the general problem that improves the existing analysis in online contract design with mild technical assumptions.

1 Introduction

“Every individual… intends only his own gain, and is led by an invisible hand to promote an end which was no part of his intention.”

— Adam Smith, The Theory of Moral Sentiments, 1759.

The “invisible hand” metaphor by Adam Smith illustrates how properly designed incentive structures can guide self-interested individuals to inadvertently promote the greater social good. This concept is increasingly relevant in the realm of machine learning, as the scale of applications expands and the conflict of economic interests intensifies. For example, an Internet platform wants to estimate the ad revenues from serving different types of content, but it is up to the creators to decide what content to produce. While the platform seeks high-quality content to boost its long-term growth, creators may opt to minimize their production costs. This misalignment has prompted platforms to implement revenue-sharing models, fueling the growth of the creator economy, projected to exceed half a trillion by 2027 [16, 1, 25, 2]. However, current incentive models are inadequate, especially in light of their roles in exacerbating the proliferation of clickbait and misinformation online [54, 55, 31]. Moreover, this issue of misalignment extends well beyond content platforms. E-commerce sites rely on sufficient consumers experimenting with new products for accurate preference assessments. Gig platforms depend on freelance workers accepting tasks to gather essential operational data. Even recommender systems are paying users for their engagement in order to effectively optimize their algorithms [3]. In these cases, the learner’s hands are tied, and decision-makers interacting with the environment have their own objectives, dooming the system to under-exploration regardless of the learner’s objective. Hence, there is a pressing need to pursue formal treatments of incentive alignment problems between the learners and decision-makers and to design principled learning algorithms with statistical and computational efficiency guarantees.

Contributions. On the conceptual side, the presence of self-interested decision-makers challenges our common assumption in online learning, where a single learner controls all the interactions with the environment. This paper introduces the contractual reinforcement learning (RL) problem in the principal-agent Markov decision process (PAMDP), where we adopt the principal-agent model from contract theory [27, 24] to capture strategic interactions between the learner and decision-maker. As illustrated in Figure 1, the learner (henceforth, principal/she) collects the rewards from the actions of decision-maker (henceforth, agent/he). Without any incentive design, the agent simply optimizes his policy in a standard Markov decision process (MDP) based on his cost function. However, since the agent’s optimal policy is not necessarily in the principal’s best interest, the principal is motivated to properly incentivize the agent to act in her favor by designing contracts that specify the payment rules contingent on the realization of the next state. The core challenge in this design problem is the information asymmetry at two levels: (1) the principal cannot observe the agent’s action a priori and has to condition her payment on the probabilistic outcome of the action — a phenomenon known as the moral hazard in economics; (2) the agent is far-sighted that he is willing to take suboptimal actions at one step in order to reach a more favorable state in future steps — a major barrier for theoretical analysis in multi-agent learning problems.

Refer to caption
Figure 1: An illustration of contractual RL in the PAMDP.

On the technical side, this paper provides a comprehensive solution framework to address the unique learning and computational challenges when moral hazard meets far-sighted agency in contractual RL problems. In Section 2, we define state value functions for both the agent and principal, from which we derive a new class of Bellman equations to characterize the intricate correspondence between the principal and agent’s optimal policy. This leads to our Theorem 1, which shows that the principal’s optimal planning problem can be solved by a clean formulation of dynamic programming in polynomial time. The learning problem is more involved, so we begin with the contractual bandit learning problem (episode length H=1𝐻1H=1italic_H = 1) in Section 3 to focus on the challenges from moral hazard. In particular, to achieve low regret, the principal’s learning algorithm must balance exploration and exploitation while continuously improving its estimation of the agent’s preferences to determine cost-efficient contracts. In Theorem 2, we construct a generic algorithm that reduces the learning problem into a standard online learning problem and an efficient search problem for the agent’s decision boundary. As a consequence, we are able to obtain sublinear regret guarantee under different setups, summarized in Table 1. The efficient search algorithm we designed for learning the outcome distribution difference in the simplex may be of interest for general use. With these insights, we delve into the full contractual RL problem in Section 4 and show a provably efficient learning algorithm under several technical assumptions in Theorem 3. Meanwhile, the general result highlights a trade-off between statistical and computational tractability, leaving an intriguing open question on the existence of the best-of-both-worlds solution. The complexity of search is in logarithmic order yet with a large constant in the Markovian setup, and we expect an improved analysis by organically combining the search and exploration in the algorithm design.

Table 1: Regrets in Contractual RL, O~~𝑂\widetilde{O}over~ start_ARG italic_O end_ARG omits logarithmic terms and other problem-specific constants
Moral Hazard Far-sighted Agency Known Cost Regret
O~(T1/2)~𝑂superscript𝑇12\widetilde{O}(T^{1/2})over~ start_ARG italic_O end_ARG ( italic_T start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ), Corollary 2.1
O~(T2/3)~𝑂superscript𝑇23\widetilde{O}({T}^{2/3})over~ start_ARG italic_O end_ARG ( italic_T start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT ), Corollary 2.2
O~(T1/2)~𝑂superscript𝑇12\widetilde{O}({T}^{1/2})over~ start_ARG italic_O end_ARG ( italic_T start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ), Corollary 2.3
O~(T1/2)~𝑂superscript𝑇12\widetilde{O}({T}^{1/2})over~ start_ARG italic_O end_ARG ( italic_T start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ), Theorem 3

Related Work. Our problem is built upon the principal-agent model in contract theory, a crucial branch of economics [27, 48, 35]. Driven by an accelerating trend of contract-based markets deployed to Internet-based applications, the contract design problem recently started to receive a surging interest especially from the computer science community [24, 28, 9, 20]. The principal-agent model has been also applied for the delegation of online search problems [13, 33] and machine learning tasks [45]. While these works focus on the computational aspects of contract design, we consider the adaptive design problem of the contract between learners and decision-makers in an initially unknown environment. For our learning problem, the dynamic (contextual) pricing problem [34, 41, 47, 39, 36] can be viewed as one of its special cases, where the contract is contingent on the agent’s binary action and the principal already knows her reward function. As we will see in Section 3, our algorithm is able to borrow some design insights from these pricing problems. Meanwhile, the online contract design problem begins as a variant of dynamic pricing [34] where the agent’s cost is stochastic (or adversarially) chosen, and regret bound is Θ(T)Θ𝑇\Theta(\sqrt{T})roman_Θ ( square-root start_ARG italic_T end_ARG ) (or Θ(T2/3)Θsuperscript𝑇23\Theta(T^{2/3})roman_Θ ( italic_T start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT ) in adversarial setup). Ho et al. [29], Zhu et al. [57] consider a generalized model where the agent has multiple actions, both the cost and reward of his actions are determined by the agent’s Bayesian type that are unknown to the learner. This problem relates to the continuum-armed bandit problem [6], except the principal’s utility is not continuous, and Zhu et al. [57] shows an almost tight linear regret bound Θ~(T1K/|𝒮|)~Θsuperscript𝑇1𝐾𝒮\widetilde{\Theta}(T^{1-K/|{\mathcal{S}}|})over~ start_ARG roman_Θ end_ARG ( italic_T start_POSTSUPERSCRIPT 1 - italic_K / | caligraphic_S | end_POSTSUPERSCRIPT ) for some constant K𝐾Kitalic_K and the number of outcomes |𝒮|𝒮|{\mathcal{S}}|| caligraphic_S |. In comparison, our learning problem is closer to the standard contract design model, in which the agent type is observable by the principal (captured by the initial state or context), as many platforms hold a good amount of data on their users and content creators. More importantly, this modeling choice allows us to focus on solving the key challenges of learning and planning the optimal contract under moral hazard, where we are able to achieve O~(T)~𝑂𝑇\widetilde{O}(\sqrt{T})over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_T end_ARG ) regret for a large class of problems and O~(T2/3)~𝑂superscript𝑇23\widetilde{O}(T^{2/3})over~ start_ARG italic_O end_ARG ( italic_T start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT ) in general under mild assumptions. Lastly, several recent works [22, 21, 46] consider the simple special case of our problem, where there is no Markov state transition and principal can directly incentivize the agent to take certain action without the barrier of moral hazard. We defer further discussion of the related work to Appendix A.

2 Problem Formulation

2.1 The Principal-Agent Markov Decision Process

Let us first recall the standard reinforcement learning problem in a (finite-horizon) Markov decision process (𝒜,𝒮,{Ph,rh}h=1H,P0)𝒜𝒮superscriptsubscriptsubscript𝑃subscript𝑟1𝐻subscript𝑃0(\mathcal{A},{\mathcal{S}},\{P_{h},r_{h}\}_{h=1}^{H},P_{0})( caligraphic_A , caligraphic_S , { italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT , italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), where we have the agent’s action space 𝒜𝒜\mathcal{A}caligraphic_A, the environment’s state space 𝒮𝒮{\mathcal{S}}caligraphic_S, the transition kernel Ph:𝒮×𝒜Δ(𝒮):subscript𝑃𝒮𝒜Δ𝒮P_{h}:{\mathcal{S}}\times\mathcal{A}\to\Delta({\mathcal{S}})italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S × caligraphic_A → roman_Δ ( caligraphic_S ), the expected reward function rh:𝒮×𝒜[0,1]:subscript𝑟𝒮𝒜01r_{h}:{\mathcal{S}}\times\mathcal{A}\to[0,1]italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S × caligraphic_A → [ 0 , 1 ], the initial state distribution P0Δ(𝒮)subscript𝑃0Δ𝒮P_{0}\in\Delta({\mathcal{S}})italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ roman_Δ ( caligraphic_S ) and the horizon length H𝐻Hitalic_H. The contractual reinforcement learning problem simply extends the MDP to a principal-agent Markov decision process (𝒜,𝒮,{Ph,rh,ch}h=1H,P0)𝒜𝒮superscriptsubscriptsubscript𝑃subscript𝑟subscript𝑐1𝐻subscript𝑃0(\mathcal{A},{\mathcal{S}},\{P_{h},r_{h},c_{h}\}_{h=1}^{H},P_{0})( caligraphic_A , caligraphic_S , { italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT , italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) with the additional cost function ch:𝒮×𝒜[0,1]:subscript𝑐𝒮𝒜01c_{h}:{\mathcal{S}}\times\mathcal{A}\to[0,1]italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S × caligraphic_A → [ 0 , 1 ]111The [0,1]01[0,1][ 0 , 1 ] scale of the cost and reward function range is without loss of generality, due to constant shifting and rescaling, thereby covers existing models [22, 21, 46] that assume a positive reward function for the agent. In this process, the agent interacts with the environment by taking actions and bearing the costs, whereas the principal receives the reward from the environment. Unable to directly interact with the environment, the principal has to instead design and implement contracts to incentivize the agent to take actions in her interest. Below, we formalize the design of their policies.

Following from a standard MDP, the agent’s action policy 𝝅={πh:𝒮Δ(𝒜)}h=1H𝝅superscriptsubscriptconditional-setsubscript𝜋𝒮Δ𝒜1𝐻\bm{\pi}=\{\pi_{h}:{\mathcal{S}}\to\Delta(\mathcal{A})\}_{h=1}^{H}bold_italic_π = { italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S → roman_Δ ( caligraphic_A ) } start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT specifies that at each step hhitalic_h, given the state s𝑠sitalic_s, the agent would take the action aπh𝒙(s)similar-to𝑎superscriptsubscript𝜋𝒙𝑠a\sim\pi_{h}^{\bm{x}}(s)italic_a ∼ italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT ( italic_s ). In the following subsection, we will discuss how the agent chooses his action policy and that it suffices to only consider deterministic action policies. Meanwhile, the principal’s contract policy 𝒙={xh:𝒮×𝒮+}h=1H𝒙superscriptsubscriptconditional-setsubscript𝑥𝒮𝒮subscript1𝐻\bm{x}=\{x_{h}:{\mathcal{S}}\times{\mathcal{S}}\to\mathbb{R}_{+}\}_{h=1}^{H}bold_italic_x = { italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S × caligraphic_S → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT is a sequence of non-liable payment rules xhsubscript𝑥x_{h}italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, where xh(sh,sh+1)subscript𝑥subscript𝑠subscript𝑠1x_{h}(s_{h},s_{h+1})italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) specifies the payment to the agent if the next state sh+1subscript𝑠1s_{h+1}italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT is realized, given the current state shsubscript𝑠s_{h}italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT at the hhitalic_h-th step. The non-liability constraint ensures that the principal’s payment in the contract for any realization of the next state must be non-negative; the problem would otherwise degenerate with an trivially optimal solution for the principal (see e.g., [24]). Denote Π,𝒳Π𝒳\Pi,\mathcal{X}roman_Π , caligraphic_X as the agent and principal’s policy space, respectively. Let |𝒮|=S,|𝒜|=Aformulae-sequence𝒮𝑆𝒜𝐴|{\mathcal{S}}|=S,|\mathcal{A}|=A| caligraphic_S | = italic_S , | caligraphic_A | = italic_A and thus |Π|=(SA)HΠsuperscript𝑆𝐴𝐻|\Pi|=(SA)^{H}| roman_Π | = ( italic_S italic_A ) start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT.

The typical setting of the PAMDP problems can be summarized by the following steps. In the beginning of each episode, the initial state s1P0similar-tosubscript𝑠1subscript𝑃0s_{1}\sim P_{0}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is realized and observed by both the principal and the agent. Afterwards, the principal commits to a contract policy 𝒙𝒙\bm{x}bold_italic_x and the agent accordingly chooses an action policy 𝝅𝝅\bm{\pi}bold_italic_π. Their interactions then proceed as follows at each step h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ],

1. The agent takes an action ahπh(sh)similar-tosubscript𝑎subscript𝜋subscript𝑠a_{h}\sim\pi_{h}(s_{h})italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) and bears the cost ch(sh,ah)subscript𝑐subscript𝑠subscript𝑎c_{h}(s_{h},a_{h})italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ). 2. The next state sh+1Ph(sh,ah)similar-tosubscript𝑠1subscript𝑃subscript𝑠subscript𝑎s_{h+1}\sim P_{h}(s_{h},a_{h})italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) is realized and observed by both the principal and agent. 3. The principal receives a noisy reward ιh(sh,sh+1)subscript𝜄subscript𝑠subscript𝑠1\iota_{h}(s_{h},s_{h+1})italic_ι start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) and pays the agent xh(sh,sh+1)subscript𝑥subscript𝑠subscript𝑠1x_{h}(s_{h},s_{h+1})italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ). 4. The principal observes the agent’s action ahsubscript𝑎a_{h}italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT.

In this step, the principal’s utility is ιh(sh,sh+1)xh(sh,sh+1)subscript𝜄subscript𝑠subscript𝑠1subscript𝑥subscript𝑠subscript𝑠1\iota_{h}(s_{h},s_{h+1})-x_{h}(s_{h},s_{h+1})italic_ι start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) - italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ), her reward minus the payment to agent, whereas the agent’s utility is xh(sh,sh+1)ch(sh,ah)subscript𝑥subscript𝑠subscript𝑠1subscript𝑐subscript𝑠subscript𝑎x_{h}(s_{h},s_{h+1})-c_{h}(s_{h},a_{h})italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) - italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ), the payment from principal minus his cost. The reward noise has zero mean such that rh(s,a)=𝐄sPh(s,a)ιh(s,s),s𝒮,a𝒜formulae-sequencesubscript𝑟𝑠𝑎subscript𝐄similar-tosuperscript𝑠subscript𝑃𝑠𝑎subscript𝜄𝑠superscript𝑠formulae-sequencefor-all𝑠𝒮𝑎𝒜r_{h}(s,a)=\mathop{\mathbf{E}}_{s^{\prime}\sim P_{h}(s,a)}\iota_{h}(s,s^{% \prime}),\forall s\in{\mathcal{S}},a\in\mathcal{A}italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) = bold_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) end_POSTSUBSCRIPT italic_ι start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , ∀ italic_s ∈ caligraphic_S , italic_a ∈ caligraphic_A. We refer the readers to Appendix B.1 for a summary of notations and Appendix B.2 for a full discussion of our modeling choices.

2.2 The Optimal Contract Policy

Without any contract design, the model reduces to a standard MDP (𝒜,𝒮,{Ph,ch}h=1H,P0)𝒜𝒮superscriptsubscriptsubscript𝑃subscript𝑐1𝐻subscript𝑃0(\mathcal{A},{\mathcal{S}},\{P_{h},c_{h}\}_{h=1}^{H},P_{0})( caligraphic_A , caligraphic_S , { italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT , italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) for the agent and the principal passively collects the reward from the agent’s policy. This outcome could be suboptimal for both the principal and agent. Instead, by resha** the agent’s reward environment through the design of contract policy, the principal could induce the agent adopt some action policy with higher social surplus. This motivates the problem of designing the optimal contract policy. We focus on a realistic yet challenging setup in the face of a long-lived, far-sighted and Bayesian rational agent who is also planning optimally for his cumulative reward — we expect the case of myopic agents can be worked out with simpler approach. In particular, since the agent’s utility is not necessarily 00 under the principal’s optimal contract at any state due to moral hazard, a far-sighted agent could take certain actions that are sub-optimal in the current step, yet secure him toward certain future states where he can obtain higher cumulative utility.

We extend notions of value functions and optimal policies from MDP to PAMDP. Under any action policy 𝝅𝝅\bm{\pi}bold_italic_π and contract policy 𝒙𝒙\bm{x}bold_italic_x, we define the principal’s state value function at the hhitalic_h-th step as,

Vh𝒙,𝝅(s):=𝐄[τ=hHrτ(sτ,aτ)xτ(sτ,sτ+1)|{πτ}τ=hH,sτ=s],assignsuperscriptsubscript𝑉𝒙𝝅𝑠𝐄delimited-[]superscriptsubscript𝜏𝐻subscript𝑟𝜏subscript𝑠𝜏subscript𝑎𝜏conditionalsubscript𝑥𝜏subscript𝑠𝜏subscript𝑠𝜏1superscriptsubscriptsubscript𝜋𝜏𝜏𝐻subscript𝑠𝜏𝑠V_{h}^{\bm{x},\bm{\pi}}(s):=\mathop{\mathbf{E}}\big{[}\sum_{\tau=h}^{H}r_{\tau% }(s_{\tau},a_{\tau})-x_{\tau}(s_{\tau},s_{\tau+1})\big{|}\{\pi_{\tau}\}_{\tau=% h}^{H},s_{\tau}=s\big{]},italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) := bold_E [ ∑ start_POSTSUBSCRIPT italic_τ = italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) - italic_x start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_τ + 1 end_POSTSUBSCRIPT ) | { italic_π start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_τ = italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = italic_s ] ,

and the agent’s state value function at the hhitalic_h-th step as,

Uh𝒙,𝝅(s):=𝐄[τ=hHxτ(sτ,sτ+1)cτ(sτ,aτ)|{πτ}τ=hH,sτ=s],assignsuperscriptsubscript𝑈𝒙𝝅𝑠𝐄delimited-[]superscriptsubscript𝜏𝐻subscript𝑥𝜏subscript𝑠𝜏subscript𝑠𝜏1conditionalsubscript𝑐𝜏subscript𝑠𝜏subscript𝑎𝜏superscriptsubscriptsubscript𝜋𝜏𝜏𝐻subscript𝑠𝜏𝑠U_{h}^{\bm{x},\bm{\pi}}(s):=\mathop{\mathbf{E}}\big{[}\sum_{\tau=h}^{H}x_{\tau% }(s_{\tau},s_{\tau+1})-c_{\tau}(s_{\tau},a_{\tau})\big{|}\{\pi_{\tau}\}_{\tau=% h}^{H},s_{\tau}=s\big{]},italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) := bold_E [ ∑ start_POSTSUBSCRIPT italic_τ = italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_τ + 1 end_POSTSUBSCRIPT ) - italic_c start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) | { italic_π start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_τ = italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = italic_s ] ,

where the expectation in both V,U𝑉𝑈V,Uitalic_V , italic_U are with respect to the randomness of the trajectory (due to the stochasticity of state transitions and action policy). Let V𝒙,𝝅:=𝐄sP0V1𝒙,𝝅(s)assignsuperscript𝑉𝒙𝝅subscript𝐄similar-to𝑠subscript𝑃0superscriptsubscript𝑉1𝒙𝝅𝑠V^{\bm{x},\bm{\pi}}:=\mathop{\mathbf{E}}_{s\sim P_{0}}V_{1}^{\bm{x},\bm{\pi}}(s)italic_V start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT := bold_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) and U𝒙,𝝅:=𝐄sP0U1𝒙,𝝅(s)assignsuperscript𝑈𝒙𝝅subscript𝐄similar-to𝑠subscript𝑃0superscriptsubscript𝑈1𝒙𝝅𝑠U^{\bm{x},\bm{\pi}}:=\mathop{\mathbf{E}}_{s\sim P_{0}}U_{1}^{\bm{x},\bm{\pi}}(s)italic_U start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT := bold_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT ( italic_s ). The principal’s goal is to maximize her value V𝒙,𝝅superscript𝑉𝒙𝝅V^{\bm{x},\bm{\pi}}italic_V start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT, given the agent’s optimal response 𝝅𝝅\bm{\pi}bold_italic_π, which equivalently maximizes V1𝒙,𝝅(s)superscriptsubscript𝑉1𝒙𝝅𝑠V_{1}^{\bm{x},\bm{\pi}}(s)italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) at any initial state s𝑠sitalic_s with P0(s)>0subscript𝑃0𝑠0P_{0}(s)>0italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s ) > 0. Hence, we define the principal’s optimal contract policy 𝒙={xh}h=1Hsuperscript𝒙superscriptsubscriptsubscriptsuperscript𝑥1𝐻\bm{x}^{*}=\{x^{*}_{h}\}_{h=1}^{H}bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = { italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT and the corresponding optimal value function Vsuperscript𝑉V^{*}italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as the optimal solution and value of the following bi-level optimization problem, 222Throughout this paper, we assume the agent breaks tie in favor of the principal. This is without loss of generality in generic games, since the principal can force the tie-breaking by making an infinitesimally small additional payment to the action of her interest.

V,𝒙:=maxarg𝒙𝒳V𝒙,𝝅𝒙s.t.𝝅𝒙=argmax𝝅ΠU𝒙,𝝅,formulae-sequenceassignsuperscript𝑉superscript𝒙subscript𝑚𝑎𝑥𝑎𝑟𝑔𝒙𝒳superscript𝑉𝒙superscript𝝅𝒙s.t.superscript𝝅𝒙subscript𝑎𝑟𝑔𝑚𝑎𝑥𝝅Πsuperscript𝑈𝒙𝝅V^{*},\bm{x}^{*}:=\mathop{maxarg}_{\bm{x}\in\mathcal{X}}V^{\bm{x},\bm{\pi}^{% \bm{x}}}\quad\text{s.t.}\quad\bm{\pi}^{\bm{x}}=\mathop{argmax}_{\bm{\pi}\in\Pi% }U^{\bm{x},\bm{\pi}},italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT := start_BIGOP italic_m italic_a italic_x italic_a italic_r italic_g end_BIGOP start_POSTSUBSCRIPT bold_italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT s.t. bold_italic_π start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT = start_BIGOP italic_a italic_r italic_g italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT bold_italic_π ∈ roman_Π end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT , (2.1)

where “maxarg𝑚𝑎𝑥𝑎𝑟𝑔\mathop{maxarg}italic_m italic_a italic_x italic_a italic_r italic_g” is a convenient operator notation on an optimization problem that returns the optimal objective value followed by its optimal solution. For notational convenience, we will denote the agent’s optimal action policy in response to contract policy 𝝅𝝅\bm{\pi}bold_italic_π as 𝝅𝒙=argmax𝝅ΠU𝒙,𝝅superscript𝝅𝒙subscript𝑎𝑟𝑔𝑚𝑎𝑥𝝅Πsuperscript𝑈𝒙𝝅\bm{\pi}^{\bm{x}}=\mathop{argmax}_{\bm{\pi}\in\Pi}U^{\bm{x},\bm{\pi}}bold_italic_π start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT = start_BIGOP italic_a italic_r italic_g italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT bold_italic_π ∈ roman_Π end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT, and use shorthands Vh𝒙:=Vh𝒙,𝝅𝒙,Uh𝒙:=Uh𝒙,𝝅𝒙formulae-sequenceassignsuperscriptsubscript𝑉𝒙superscriptsubscript𝑉𝒙superscript𝝅𝒙assignsuperscriptsubscript𝑈𝒙superscriptsubscript𝑈𝒙superscript𝝅𝒙V_{h}^{\bm{x}}:=V_{h}^{\bm{x},\bm{\pi}^{\bm{x}}},U_{h}^{\bm{x}}:=U_{h}^{\bm{x}% ,\bm{\pi}^{\bm{x}}}italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT := italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT := italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT for the principal’s and agent’s value function under contract policy 𝒙𝒙\bm{x}bold_italic_x at the hhitalic_h-th step given that the agent responds optimally. Meanwhile, we denote 𝒙𝝅=argmax𝒙𝒳V𝒙,𝝅 s.t. 𝝅=argmax𝝅ΠU𝒙,𝝅superscript𝒙𝝅subscript𝑎𝑟𝑔𝑚𝑎𝑥𝒙𝒳superscript𝑉𝒙𝝅 s.t. 𝝅subscript𝑎𝑟𝑔𝑚𝑎𝑥𝝅Πsuperscript𝑈𝒙𝝅\bm{x}^{\bm{\pi}}=\mathop{argmax}_{\bm{x}\in\mathcal{X}}V^{\bm{x},\bm{\pi}}% \text{ s.t. }\bm{\pi}=\mathop{argmax}_{\bm{\pi}\in\Pi}U^{\bm{x},\bm{\pi}}bold_italic_x start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT = start_BIGOP italic_a italic_r italic_g italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT bold_italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT s.t. bold_italic_π = start_BIGOP italic_a italic_r italic_g italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT bold_italic_π ∈ roman_Π end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT as the principal’s optimal contract policy to induce the agent’s action policy 𝝅𝝅\bm{\pi}bold_italic_π. We use similar shorthands Vh𝝅:=Vh𝒙𝝅,𝝅,Uh𝒙:=Uh𝒙𝝅,𝝅formulae-sequenceassignsuperscriptsubscript𝑉𝝅superscriptsubscript𝑉superscript𝒙𝝅𝝅assignsuperscriptsubscript𝑈𝒙superscriptsubscript𝑈superscript𝒙𝝅𝝅V_{h}^{\bm{\pi}}:=V_{h}^{\bm{x}^{\bm{\pi}},\bm{\pi}},U_{h}^{\bm{x}}:=U_{h}^{% \bm{x}^{\bm{\pi}},\bm{\pi}}italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT := italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , bold_italic_π end_POSTSUPERSCRIPT , italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT := italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , bold_italic_π end_POSTSUPERSCRIPT for the principal’s and agent’s value function under contract policy 𝒙𝝅superscript𝒙𝝅\bm{x}^{\bm{\pi}}bold_italic_x start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT at the hhitalic_h-th step given that the agent responds optimally. Notably, since the optimization problem (2.1) hinges on the intricate correspondence between 𝒙𝒙\bm{x}bold_italic_x and 𝝅𝝅\bm{\pi}bold_italic_π, it is unclear for now if the principal can efficiently plan his optimal policy adopting the standard approach in MDP.

Solving for the Agent’s Optimal Policy. One key observation is that the correspondence between 𝝅𝝅\bm{\pi}bold_italic_π and 𝒙𝒙\bm{x}bold_italic_x has a clean characterization through the Bellman equation. Specifically, both functions {πh𝒙,Uh𝒙}h=1Hsuperscriptsubscriptsuperscriptsubscript𝜋𝒙superscriptsubscript𝑈𝒙1𝐻\{\pi_{h}^{\bm{x}},U_{h}^{\bm{x}}\}_{h=1}^{H}{ italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT , italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT can be solved through backward induction with UH+1𝒙(s)=0subscriptsuperscript𝑈𝒙𝐻1𝑠0U^{\bm{x}}_{H+1}(s)=0italic_U start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H + 1 end_POSTSUBSCRIPT ( italic_s ) = 0:

given Uh+1𝒙,Uh𝒙(s),πh𝒙(s)=maxarga𝒜Ph(s,a)[xh(s)+Uh+1𝒙]ch(s,a).given Uh+1𝒙,subscriptsuperscript𝑈𝒙𝑠subscriptsuperscript𝜋𝒙𝑠subscript𝑚𝑎𝑥𝑎𝑟𝑔𝑎𝒜subscript𝑃𝑠𝑎delimited-[]subscript𝑥𝑠subscriptsuperscript𝑈𝒙1subscript𝑐𝑠𝑎\textup{given $U^{\bm{x}}_{h+1}$,}\qquad U^{\bm{x}}_{h}(s),\pi^{\bm{x}}_{h}(s)% =\mathop{maxarg}_{a\in\mathcal{A}}P_{h}(s,a)\cdot[x_{h}(s)+U^{\bm{x}}_{h+1}]-c% _{h}(s,a).given italic_U start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT , italic_U start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) , italic_π start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) = start_BIGOP italic_m italic_a italic_x italic_a italic_r italic_g end_BIGOP start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ⋅ [ italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) + italic_U start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ] - italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) . (2.2)

Notice that since πh𝒙(s)subscriptsuperscript𝜋𝒙𝑠\pi^{\bm{x}}_{h}(s)italic_π start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) is a maximizer of a linear function, the agent’s best responding policy πhsubscript𝜋\pi_{h}italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is deterministic without loss of generality. With 𝝅𝒙superscript𝝅𝒙\bm{\pi}^{\bm{x}}bold_italic_π start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT, the principal’s value function under 𝒙𝒙\bm{x}bold_italic_x can also be computed iteratively from VH+1𝒙(s)=0subscriptsuperscript𝑉𝒙𝐻1𝑠0V^{\bm{x}}_{H+1}(s)=0italic_V start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H + 1 end_POSTSUBSCRIPT ( italic_s ) = 0:

given Vh+1𝒙,Vh𝒙(s)=rh(s,πh𝒙(s))+Ph(s,πh𝒙(s))[Vh+1𝒙xh(s)].given Vh+1𝒙,superscriptsubscript𝑉𝒙𝑠subscript𝑟𝑠superscriptsubscript𝜋𝒙𝑠subscript𝑃𝑠superscriptsubscript𝜋𝒙𝑠delimited-[]subscriptsuperscript𝑉𝒙1subscript𝑥𝑠\textup{given $V^{\bm{x}}_{h+1}$,}\qquad V_{h}^{\bm{x}}(s)=r_{h}(s,\pi_{h}^{% \bm{x}}(s))+P_{h}(s,\pi_{h}^{\bm{x}}(s))\cdot[V^{\bm{x}}_{h+1}-x_{h}(s)].given italic_V start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT ( italic_s ) = italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT ( italic_s ) ) + italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT ( italic_s ) ) ⋅ [ italic_V start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ] . (2.3)

Due to the space limit, we only solve the agent’s best response 𝝅𝒙superscript𝝅𝒙\bm{\pi}^{\bm{x}}bold_italic_π start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT at any given 𝒙𝒙\bm{x}bold_italic_x. We refer the reader to Appendix B.3 for the more involved formulation to solve the optimal policy 𝒙𝝅superscript𝒙𝝅\bm{x}^{\bm{\pi}}bold_italic_x start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT for any given 𝝅𝝅\bm{\pi}bold_italic_π.

Value Decomposition. Another key observation is that the value functions can decomposed into parts that are only depends on the agent’s action policy. This is analogous to the standard contract design where principal’s and agent’s utility sums up to the social surplus, i.e., the difference between the reward and cost of the agent’s action. Here, let the principal’s expected reward and agent’s expected cost function in the hhitalic_h-th step be

Rh𝝅(s):=𝐄[τ=hHrτ(sτ,aτ)|{πτ}τ=hH,sτ=s],Ch𝝅(s):=𝐄[τ=hHcτ(sτ,aτ)|{πτ}τ=hH,sh=s].formulae-sequenceassignsuperscriptsubscript𝑅𝝅𝑠𝐄delimited-[]conditionalsuperscriptsubscript𝜏𝐻subscript𝑟𝜏subscript𝑠𝜏subscript𝑎𝜏superscriptsubscriptsubscript𝜋𝜏𝜏𝐻subscript𝑠𝜏𝑠assignsuperscriptsubscript𝐶𝝅𝑠𝐄delimited-[]conditionalsuperscriptsubscript𝜏𝐻subscript𝑐𝜏subscript𝑠𝜏subscript𝑎𝜏superscriptsubscriptsubscript𝜋𝜏𝜏𝐻subscript𝑠𝑠R_{h}^{\bm{\pi}}(s):=\mathop{\mathbf{E}}\big{[}\sum_{\tau=h}^{H}r_{\tau}(s_{% \tau},a_{\tau})\big{|}\{\pi_{\tau}\}_{\tau=h}^{H},s_{\tau}=s\big{]},\ C_{h}^{% \bm{\pi}}(s):=\mathop{\mathbf{E}}\big{[}\sum_{\tau=h}^{H}c_{\tau}(s_{\tau},a_{% \tau})\big{|}\{\pi_{\tau}\}_{\tau=h}^{H},s_{h}=s\big{]}.italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) := bold_E [ ∑ start_POSTSUBSCRIPT italic_τ = italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) | { italic_π start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_τ = italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = italic_s ] , italic_C start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) := bold_E [ ∑ start_POSTSUBSCRIPT italic_τ = italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) | { italic_π start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_τ = italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_s ] .

By linearity of expectation, for any policy 𝒙𝒳,𝝅Πformulae-sequence𝒙𝒳𝝅Π\bm{x}\in\mathcal{X},\bm{\pi}\in\Pibold_italic_x ∈ caligraphic_X , bold_italic_π ∈ roman_Π, at any state s𝑠sitalic_s of any step hhitalic_h, we have

Vh𝒙,𝝅(s)=Rh𝝅(s)Ch𝝅(s)Uh𝒙,𝝅(s).superscriptsubscript𝑉𝒙𝝅𝑠superscriptsubscript𝑅𝝅𝑠superscriptsubscript𝐶𝝅𝑠superscriptsubscript𝑈𝒙𝝅𝑠V_{h}^{\bm{x},\bm{\pi}}(s)=R_{h}^{\bm{\pi}}(s)-C_{h}^{\bm{\pi}}(s)-U_{h}^{\bm{% x},\bm{\pi}}(s).italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) = italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) - italic_C start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) - italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) .

Both functions R,C𝑅𝐶R,Citalic_R , italic_C are fixed to the agent’s action policy 𝝅𝝅\bm{\pi}bold_italic_π, regardless of the contract policy 𝒙𝒙\bm{x}bold_italic_x. In addition, Ch𝝅(s)Uh𝒙,𝝅(s)superscriptsubscript𝐶𝝅𝑠superscriptsubscript𝑈𝒙𝝅𝑠C_{h}^{\bm{\pi}}(s)-U_{h}^{\bm{x},\bm{\pi}}(s)italic_C start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) - italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) captures the total amount of expected payment transferred from the principal to the agent since hhitalic_h-th step at state s𝑠sitalic_s. Since the total reward is fixed under any given 𝝅𝝅\bm{\pi}bold_italic_π, the principal’s value is maximized under the minimal total payment, ζh𝝅(s):=Ch𝝅(s)+Uh𝝅(s)assignsuperscriptsubscript𝜁𝝅𝑠superscriptsubscript𝐶𝝅𝑠superscriptsubscript𝑈𝝅𝑠\zeta_{h}^{\bm{\pi}}(s):=C_{h}^{\bm{\pi}}(s)+U_{h}^{\bm{\pi}}(s)italic_ζ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) := italic_C start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) + italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ). The function ζh𝝅superscriptsubscript𝜁𝝅\zeta_{h}^{\bm{\pi}}italic_ζ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT thus serves as the equivalent optimization objective in the least-payment Bellman equation in Appendix B.3.

Solving for the Optimal Contract Policy. With the two observations above, it is clear that the principal’s the optimal value and policy V=max𝝅ΠV𝝅superscript𝑉subscript𝑚𝑎𝑥𝝅Πsuperscript𝑉𝝅V^{*}=\mathop{max}_{\bm{\pi}\in\Pi}V^{\bm{\pi}}italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_BIGOP italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT bold_italic_π ∈ roman_Π end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT can be determined by computing V𝝅superscript𝑉𝝅V^{\bm{\pi}}italic_V start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT for every 𝝅𝝅\bm{\pi}bold_italic_π, according to the least-payment bellman equation in Appendix B.3. However, this maximization problem is still intractable, as there are exponentially many possible 𝝅𝝅\bm{\pi}bold_italic_π. Instead, we have to interleave the process of solving for the optimal policy with least payment and maximum reward. This enables the following construction of a bi-level backward induction that iteratively solves for the optimal contract policy 𝒙superscript𝒙\bm{x}^{*}bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Theorem 1 (Bellman Equations of PAMDP).

The optimal contract policy can solved by dynamic programming in polynomial time, from h=H𝐻h=Hitalic_h = italic_H to 1111 with UH+1𝐱(s),VH+1𝐱(s)=0,s𝒮,a𝒜formulae-sequencesubscriptsuperscript𝑈𝐱𝐻1𝑠subscriptsuperscript𝑉𝐱𝐻1𝑠0formulae-sequencefor-all𝑠𝒮𝑎𝒜U^{\bm{x}}_{H+1}(s),V^{\bm{x}}_{H+1}(s)=0,\forall s\in{\mathcal{S}},a\in% \mathcal{A}italic_U start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H + 1 end_POSTSUBSCRIPT ( italic_s ) , italic_V start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H + 1 end_POSTSUBSCRIPT ( italic_s ) = 0 , ∀ italic_s ∈ caligraphic_S , italic_a ∈ caligraphic_A,

Wh(s,a;x)subscriptsuperscript𝑊𝑠𝑎𝑥\displaystyle W^{*}_{h}(s,a;x)italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ; italic_x ) =Ph(s,a)[x+Uh+1]ch(s,a),absentsubscript𝑃𝑠𝑎delimited-[]𝑥subscriptsuperscript𝑈1subscript𝑐𝑠𝑎\displaystyle=P_{h}(s,a)\cdot[x+U^{*}_{h+1}]-c_{h}(s,a),= italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ⋅ [ italic_x + italic_U start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ] - italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) , (2.4)
xh(s;a)subscriptsuperscript𝑥𝑠𝑎\displaystyle x^{*}_{h}(s;a)italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ; italic_a ) =argminx:𝒮+{Ph(s,a)xWh(s,a;x)Wh(s,a;x),a𝒜},absentsubscript𝑎𝑟𝑔𝑚𝑖𝑛:𝑥𝒮subscriptconditional-setsubscript𝑃𝑠𝑎𝑥formulae-sequencesubscriptsuperscript𝑊𝑠𝑎𝑥subscriptsuperscript𝑊𝑠superscript𝑎𝑥for-allsuperscript𝑎𝒜\displaystyle=\mathop{argmin}_{x:{\mathcal{S}}\to\mathbb{R}_{+}}\left\{P_{h}(s% ,a)\cdot x\mid W^{*}_{h}(s,a;x)\geq W^{*}_{h}(s,a^{\prime};x),\forall a^{% \prime}\in\mathcal{A}\right\},= start_BIGOP italic_a italic_r italic_g italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT italic_x : caligraphic_S → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ⋅ italic_x ∣ italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ; italic_x ) ≥ italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_x ) , ∀ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A } ,
Qh(s,a)subscriptsuperscript𝑄𝑠𝑎\displaystyle Q^{*}_{h}(s,a)italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) =rh(s,a)+Ph(s,a)[Vh+1xh(s;a)],absentsubscript𝑟𝑠𝑎subscript𝑃𝑠𝑎delimited-[]subscriptsuperscript𝑉1subscriptsuperscript𝑥𝑠𝑎\displaystyle=r_{h}(s,a)+P_{h}(s,a)\cdot[V^{*}_{h+1}-x^{*}_{h}(s;a)],= italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) + italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ⋅ [ italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ; italic_a ) ] ,
Vh(s),πh(s)superscriptsubscript𝑉𝑠subscriptsuperscript𝜋𝑠\displaystyle V_{h}^{*}(s),\pi^{*}_{h}(s)italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) =maxarga𝒜Qh(s,a),xh(s)=xh(s;πh(s)),Uh(s)=Wh(s,πh(s);xh(s)),formulae-sequenceabsentsubscript𝑚𝑎𝑥𝑎𝑟𝑔𝑎𝒜subscriptsuperscript𝑄𝑠𝑎formulae-sequencesubscriptsuperscript𝑥𝑠subscriptsuperscript𝑥𝑠subscriptsuperscript𝜋𝑠superscriptsubscript𝑈𝑠subscriptsuperscript𝑊𝑠subscriptsuperscript𝜋𝑠subscriptsuperscript𝑥𝑠\displaystyle=\mathop{maxarg}_{a\in\mathcal{A}}Q^{*}_{h}(s,a),\ x^{*}_{h}(s)=x% ^{*}_{h}(s;\pi^{*}_{h}(s)),\ U_{h}^{*}(s)=W^{*}_{h}(s,\pi^{*}_{h}(s);x^{*}_{h}% (s)),= start_BIGOP italic_m italic_a italic_x italic_a italic_r italic_g end_BIGOP start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) = italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ; italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ) , italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) = italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ; italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ) ,

To interpret the Bellman equation above, xh(s;a)subscriptsuperscript𝑥𝑠𝑎x^{*}_{h}(s;a)italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ; italic_a ) denotes the contract with the least payment to induce the agent to take action a𝑎aitalic_a at state s𝑠sitalic_s in step hhitalic_h. Given that πh(s)subscriptsuperscript𝜋𝑠\pi^{*}_{h}(s)italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) is the best agent action for the principal to induce, the optimal contract at state s𝑠sitalic_s in step hhitalic_h can be determined as xh(s)=xh(s;πh(s))subscript𝑥𝑠subscript𝑥𝑠subscriptsuperscript𝜋𝑠x_{h}(s)=x_{h}(s;\pi^{*}_{h}(s))italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) = italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ; italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ). Qh(s,a),Wh(s,a;x)superscriptsubscript𝑄𝑠𝑎superscriptsubscript𝑊𝑠𝑎𝑥Q_{h}^{*}(s,a),W_{h}^{*}(s,a;x)italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) , italic_W start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ; italic_x ) are respectively the principal’s and agent’s total expected utility from hhitalic_h-th step under policy {xτ}τ=h+1Hsuperscriptsubscriptsubscriptsuperscript𝑥𝜏𝜏1𝐻\{x^{*}_{\tau}\}_{\tau=h+1}^{H}{ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_τ = italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT and {πτ}τ=h+1Hsuperscriptsubscriptsubscriptsuperscript𝜋𝜏𝜏1𝐻\{\pi^{*}_{\tau}\}_{\tau=h+1}^{H}{ italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_τ = italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT, which can be viewed as their optimal state-action value function at hhitalic_h-th step, serving as the intermediate variable for the computation. See Appendix B.4 for the proof of correctness.

2.3 The Contractual Reinforcement Learning Problem

We now introduce the reinforcement learning problem in PAMDP, where the principal acts as the learner and seeks to adaptively improve its contract policy by interacting with the agent. Following the online learning convention, we use the expected regret to evaluate the learning performance in T𝑇Titalic_T episodes, Reg(T):=t=1TVV𝒙t,assignReg𝑇superscriptsubscript𝑡1𝑇superscript𝑉superscript𝑉superscript𝒙𝑡\operatorname{Reg}(T):=\sum_{t=1}^{T}V^{*}-V^{\bm{x}^{t}},roman_Reg ( italic_T ) := ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_V start_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , where 𝒙tsuperscript𝒙𝑡\bm{x}^{t}bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the principal’s contract policy in the t𝑡titalic_t-th episode.

This paper makes a few assumptions for the analysis of reinforcement learning problems. First, the far-sighted agent has perfect knowledge of his cost function and the state transition kernel {Ph,ch}h[H]subscriptsubscript𝑃subscript𝑐delimited-[]𝐻\{P_{h},c_{h}\}_{h\in[H]}{ italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT such that he can always chooses the best response. This is realistic because in applications of our interest, agents are the experts (e.g., content creators, freelance workers, ride-sharing drivers) in the fields who has learnt about the environment sufficiently well whereas principal as the system designer does not know. Second, the agent at time t𝑡titalic_t is assumed to best respond to 𝒙tsuperscript𝒙𝑡\bm{x}^{t}bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. This can be equivalently interpreted as the agent at each time t𝑡titalic_t showing up only once. This is motivated by the reality of Internet applications where each individual agent’s participation only accounts for a negligible portion of the system’s traffic hence has little influence over the entire system’s learning policy, so the best response (regardless of the learning policy) is optimal for each individual. Thirdly, we assume that the design space of contract is restricted to {xh:𝒮×𝒮[0,η]}conditional-setsubscript𝑥𝒮𝒮0𝜂\{x_{h}:{\mathcal{S}}\times{\mathcal{S}}\to[0,\eta]\}{ italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S × caligraphic_S → [ 0 , italic_η ] } at any step h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ]. This reflects the practical concern of contract design under randomness: while contract with bounded payment may sacrifice the optimality, it regularizes the variance in the payment transfer and reduces the risk for both the principal and agents. Moreover, as long as the environment parameters have finite precision, the parameter η𝜂\etaitalic_η can be matched to the finite bit complexity of the optimal contract. Though this assumption is without loss of generality from a modeling perspective, we expect future work to develop tighter analysis techniques to relax the dependency on η𝜂\etaitalic_η. For other regularity assumptions necessary to obtain tractable complexity results, we defer to the technical sections.

3 Warm-up: Solving the Contractual Bandit Learning Problem

In this section, we consider an important special case of the contractual reinforcement problem with H=1𝐻1H=1italic_H = 1, which allows us to first focus on the learning challenge from moral hazard without the concern of far-sight agency. We refer to this problem as the contractual bandit learning problem. Below, we first describe the contractual bandit learning problem with much simplified notations, since it suffices to omit the current state and the time step in the subscripts given that H=1𝐻1H=1italic_H = 1. We then showcase a generic analysis of the statistical complexity of contractual bandit learning problem.

3.1 The Contractual Bandit Learning Problem

In this setup, the agent’s policy space is simply its action space 𝒜𝒜\mathcal{A}caligraphic_A, i.e., the set of bandit arms. P:𝒜Δ(𝒮):𝑃𝒜Δ𝒮P:\mathcal{A}\to\Delta({\mathcal{S}})italic_P : caligraphic_A → roman_Δ ( caligraphic_S ) specifies an outcome distribution for each action, where the outcome space 𝒮𝒮{\mathcal{S}}caligraphic_S could naturally capture the reward stochasticity of each arm in bandit learning problems. The principal designs the contract x:𝒮+:𝑥𝒮subscriptx:{\mathcal{S}}\to\mathbb{R}_{+}italic_x : caligraphic_S → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, contingent on the outcome space 𝒮𝒮{\mathcal{S}}caligraphic_S, to influence the agent’s choice of action. The principal’s reward and agent’s cost are both function of the agent’s action, r,c:𝒜[0,1]:𝑟𝑐𝒜01r,c:\mathcal{A}\to[0,1]italic_r , italic_c : caligraphic_A → [ 0 , 1 ]. We consider a contractual bandit learning problem with T𝑇Titalic_T rounds. In the beginning of each round t𝑡titalic_t, the principal commits to a contract xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and interacts with the agent as follows:

1. The agent takes the action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. 2. The outcome stP(at)similar-tosubscript𝑠𝑡𝑃subscript𝑎𝑡s_{t}\sim P(a_{t})italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is realized and observed by both the principal and agent. 3. The principal receives the noisy reward ι(st)𝜄subscript𝑠𝑡\iota(s_{t})italic_ι ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and pays the agent xt(st)subscript𝑥𝑡subscript𝑠𝑡x_{t}(s_{t})italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). 4. The principal observes the agent’s action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Here, the noisy reward function satisfies 𝐄sP(at)ι(s)=r(at)subscript𝐄similar-to𝑠𝑃subscript𝑎𝑡𝜄𝑠𝑟subscript𝑎𝑡\mathop{\mathbf{E}}_{s\sim P(a_{t})}\iota(s)=r(a_{t})bold_E start_POSTSUBSCRIPT italic_s ∼ italic_P ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_ι ( italic_s ) = italic_r ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and we assume the agent’s action always maximizes his expected utility, i.e., atargmaxa𝒜{P(a)xtc(a)}subscript𝑎𝑡subscript𝑎𝑟𝑔𝑚𝑎𝑥𝑎𝒜𝑃𝑎subscript𝑥𝑡𝑐𝑎a_{t}\in\mathop{argmax}_{a\in\mathcal{A}}\{P(a)\cdot x_{t}-c(a)\}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ start_BIGOP italic_a italic_r italic_g italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT { italic_P ( italic_a ) ⋅ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_c ( italic_a ) }. To determine the principal’s optimal contract, let us recall the notion of least payment function from the general setup. We similarly define ζ:𝒜+:𝜁𝒜subscript\zeta:\mathcal{A}\to\mathbb{R}_{+}italic_ζ : caligraphic_A → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT such that for any given action a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A, it outputs the least amount of the expected payment necessary to induce a𝑎aitalic_a, ζ(a):=minx𝒳aP(a)x,assign𝜁𝑎subscript𝑚𝑖𝑛𝑥superscript𝒳𝑎𝑃𝑎𝑥\zeta(a):=\mathop{min}_{x\in\mathcal{X}^{a}}P(a)\cdot x,italic_ζ ( italic_a ) := start_BIGOP italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P ( italic_a ) ⋅ italic_x , where 𝒳a={x𝒳:[P(a)P(a)]xc(a)c(a),aa}superscript𝒳𝑎conditional-set𝑥𝒳formulae-sequencedelimited-[]𝑃𝑎𝑃superscript𝑎𝑥𝑐𝑎𝑐superscript𝑎for-allsuperscript𝑎𝑎\mathcal{X}^{a}=\{x\in\mathcal{X}:[P(a)-P(a^{\prime})]\cdot x\geq c(a)-c(a^{% \prime}),\forall a^{\prime}\neq a\}caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT = { italic_x ∈ caligraphic_X : [ italic_P ( italic_a ) - italic_P ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ⋅ italic_x ≥ italic_c ( italic_a ) - italic_c ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , ∀ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_a } denotes the set of all contracts under which the agent would respond with action a𝑎aitalic_a. Hence, the principal can determine the optimal action to induce, a=maxa𝒜t=1T[rt(a)ζ(a)]superscript𝑎subscript𝑚𝑎𝑥𝑎𝒜superscriptsubscript𝑡1𝑇delimited-[]subscript𝑟𝑡𝑎𝜁𝑎a^{*}=\mathop{max}_{a\in\mathcal{A}}\sum_{t=1}^{T}[r_{t}(a)-\zeta(a)]italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_BIGOP italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) - italic_ζ ( italic_a ) ] with the optimal contract x=argminx𝒳aP(a)xsuperscript𝑥subscript𝑎𝑟𝑔𝑚𝑖𝑛𝑥superscript𝒳superscript𝑎𝑃superscript𝑎𝑥x^{*}=\mathop{argmin}_{x\in\mathcal{X}^{a^{*}}}P(a^{*})\cdot xitalic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_BIGOP italic_a italic_r italic_g italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ⋅ italic_x. With the benchmark of the optimal contract xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that induces asuperscript𝑎a^{*}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with the least payment ζ(a)𝜁superscript𝑎\zeta(a^{*})italic_ζ ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), we can measure the learning performance in T𝑇Titalic_T rounds with the expected regret as follows, Reg(T)=maxa𝒜t=1T[r(a)ζ(a)]t=1T[r(at)P(at)xt].Reg𝑇subscript𝑚𝑎𝑥superscript𝑎𝒜superscriptsubscript𝑡1𝑇delimited-[]𝑟superscript𝑎𝜁superscript𝑎superscriptsubscript𝑡1𝑇delimited-[]𝑟subscript𝑎𝑡𝑃subscript𝑎𝑡subscript𝑥𝑡\operatorname{Reg}(T)=\mathop{max}_{a^{*}\in\mathcal{A}}\sum_{t=1}^{T}\big{[}r% (a^{*})-\zeta(a^{*})\big{]}-\sum_{t=1}^{T}\big{[}r(a_{t})-P(a_{t})\cdot x_{t}% \big{]}.roman_Reg ( italic_T ) = start_BIGOP italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_A end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ italic_r ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_ζ ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ italic_r ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_P ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] . This problem is a strict generalization of standard online learning, as it degenerates to the standard notion of regret when ζ(a)=0,a𝒜formulae-sequence𝜁𝑎0for-all𝑎𝒜\zeta(a)=0,\forall a\in\mathcal{A}italic_ζ ( italic_a ) = 0 , ∀ italic_a ∈ caligraphic_A. However, with the additional ζ𝜁\zetaitalic_ζ function, the no-regret learner must not only obtain good estimation of both r𝑟ritalic_r and ζ𝜁\zetaitalic_ζ towards the optimal action, but also implement the contracts that induce the optimal action and have expected payment approaching towards ζ𝜁\zetaitalic_ζ.

A Simpler Case with Direct Incentives. We remark that a special case of the contractual bandit learning problem assumes the principal is able to design her contract contingent on the agent’s action. This enables the principal to implement any payment rule x:𝒜+:𝑥𝒜subscriptx:\mathcal{A}\to\mathbb{R}_{+}italic_x : caligraphic_A → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, and the agent responds with his optimal action a=argmaxa𝒜x(a)c(a)superscript𝑎subscript𝑎𝑟𝑔𝑚𝑎𝑥𝑎𝒜𝑥𝑎𝑐𝑎a^{*}=\mathop{argmax}_{a\in\mathcal{A}}x(a)-c(a)italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_BIGOP italic_a italic_r italic_g italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_x ( italic_a ) - italic_c ( italic_a ). With this relaxation, ζ=c𝜁𝑐\zeta=citalic_ζ = italic_c, since the optimal x𝑥xitalic_x to induce any action a𝑎aitalic_a is to set a direct incentive with x(a)=c(a),x(a)=0,aaformulae-sequence𝑥𝑎𝑐𝑎formulae-sequence𝑥superscript𝑎0for-allsuperscript𝑎𝑎x(a)=c(a),x(a^{\prime})=0,\forall a^{\prime}\neq aitalic_x ( italic_a ) = italic_c ( italic_a ) , italic_x ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 0 , ∀ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_a. The expected regret reduces to Reg(T)=maxa𝒜t=1T[r(a)c(a)]t=1T[r(at)xt(at)].Reg𝑇subscript𝑚𝑎𝑥superscript𝑎𝒜superscriptsubscript𝑡1𝑇delimited-[]𝑟superscript𝑎𝑐superscript𝑎superscriptsubscript𝑡1𝑇delimited-[]𝑟subscript𝑎𝑡subscript𝑥𝑡subscript𝑎𝑡\operatorname{Reg}(T)=\mathop{max}_{a^{*}\in\mathcal{A}}\sum_{t=1}^{T}\big{[}r% (a^{*})-c(a^{*})\big{]}-\sum_{t=1}^{T}\big{[}r(a_{t})-x_{t}(a_{t})\big{]}.roman_Reg ( italic_T ) = start_BIGOP italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_A end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ italic_r ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_c ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ italic_r ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] . As we will see in this paper, the learning problem becomes more tractable in this setup, since the principal can directly learn the cost function c𝑐citalic_c to determine the least payment to induce each action. In Appendix C.2 and C.3 we showcases the multi-armed bandits and linear bandits under direct incentives, both of which have been recently studied by Scheid et al. [46].

3.2 A Generic Approach to Contractual Bandit Learning

We begin with a natural assumption that enable us to simply employ existing techniques in online learning to obtain tractable complexity results for a large class of contractual bandit learning problems.

Assumption 1 (λ𝜆\lambdaitalic_λ-Inducibility).

For any action a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A, there exists an event e{0,1}S𝑒superscript01𝑆e\in\{0,1\}^{S}italic_e ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT as a distribution of outcomes such that [P(a)P(a)]eλ,aaformulae-sequencedelimited-[]𝑃𝑎𝑃superscript𝑎𝑒𝜆for-allsuperscript𝑎𝑎[P(a)-P(a^{\prime})]\cdot e\geq\lambda,\forall a^{\prime}\neq a[ italic_P ( italic_a ) - italic_P ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ⋅ italic_e ≥ italic_λ , ∀ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_a.

This assumption ensures the regularity of the problem instance in the sense that each action is dominantly capable of inducing a set of outcomes over others such that for any cost function c𝑐citalic_c and any action a𝑎aitalic_a, there exists a contract x𝑥xitalic_x to induce a𝑎aitalic_a. To see this, one can explicitly construct such contract as x=emaxac(a)c(a)λ𝑥𝑒subscript𝑚𝑎𝑥superscript𝑎𝑐𝑎𝑐superscript𝑎𝜆x=e\mathop{max}_{a^{\prime}}\frac{c(a)-c(a^{\prime})}{\lambda}italic_x = italic_e start_BIGOP italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG italic_c ( italic_a ) - italic_c ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_λ end_ARG, where e𝑒eitalic_e is the event such that [P(a)P(a)]eλ,aaformulae-sequencedelimited-[]𝑃𝑎𝑃superscript𝑎𝑒𝜆for-allsuperscript𝑎𝑎[P(a)-P(a^{\prime})]\cdot e\geq\lambda,\forall a^{\prime}\neq a[ italic_P ( italic_a ) - italic_P ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ⋅ italic_e ≥ italic_λ , ∀ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_a. Otherwise, if λ0𝜆0\lambda\leq 0italic_λ ≤ 0, then there could be some action that is never the agent’s best response under any contract.

We now propose a generic approach to design statistically efficient algorithm for contractual bandit learning problem. The key idea of our approach is to decouple the learning of the contract from the learning of the optimal action. In particular, let us first assume an oracle in Definition 1 that is able to construct a robust contract set for each action a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A, despite the uncertainty in parameter estimation. We use the robust contract set to determine the optimistic action and eventually learn the optimal action with no regret. This enables us to decouple the sample complexity result into the estimation errors from optimal contract and the optimal action, according to Theorem 2.

Definition 1 (ε𝜀\varepsilonitalic_ε-margin Contract Set).

We define the ε𝜀\varepsilonitalic_ε-margin contract set for each action a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A as

𝒳a(ε)={x𝒳:[P(a)P(a)]xc(a)c(a)+ε,aa}.superscript𝒳𝑎𝜀conditional-set𝑥𝒳formulae-sequencedelimited-[]𝑃𝑎𝑃superscript𝑎𝑥𝑐𝑎𝑐superscript𝑎𝜀for-all𝑎superscript𝑎{\mathcal{X}}^{a}(\varepsilon)=\{x\in\mathcal{X}:[P(a)-P(a^{\prime})]\cdot x% \geq c(a)-c(a^{\prime})+\varepsilon,\forall a\neq a^{\prime}\}.caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_ε ) = { italic_x ∈ caligraphic_X : [ italic_P ( italic_a ) - italic_P ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ⋅ italic_x ≥ italic_c ( italic_a ) - italic_c ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_ε , ∀ italic_a ≠ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } .
Theorem 2.

Under Assumption 1, with a ε𝜀\varepsilonitalic_ε-margin contract set for every action a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A, there is a generic algorithm with regret O~(ηT+Tε/λ)~𝑂𝜂𝑇𝑇𝜀𝜆\widetilde{O}(\eta\sqrt{T}+T\varepsilon/\lambda)over~ start_ARG italic_O end_ARG ( italic_η square-root start_ARG italic_T end_ARG + italic_T italic_ε / italic_λ ) for the contractual bandit learning problems.

The key step of the proof is Lemma 1, which shows the contracts solved from LP (C.1) have bounded suboptimality from the least payment contract (both in estimation and in execution) depending on parameter estimation error ϵitalic-ϵ\epsilonitalic_ϵ and the robustness margin ε𝜀\varepsilonitalic_ε. This allows us to simply adopt an upper confidence bound argument to bound the regret. See Appendix C.1 for the full proof and the construction of the generic algorithm. The rationale behind Theorem 2 is to separate the learning of the contract sets from the learning of the optimal action. In particular, the learning and construction procedure of such contract sets has been a well-established problem in variants of Stackelberg games [37, 43]. We abstract this problem into the design of a χ(ε)𝜒𝜀\chi(\varepsilon)italic_χ ( italic_ε )-learning procedure defined below.

Definition 2 (χ(ε)𝜒𝜀\chi(\varepsilon)italic_χ ( italic_ε )-Learning Procedure).

For a χ(ε)𝜒𝜀\chi(\varepsilon)italic_χ ( italic_ε )-learning procedure, after any χ(ε)𝜒𝜀\chi(\varepsilon)italic_χ ( italic_ε ) number of rounds, it can construct a robust contract set 𝒳^a,a𝒜superscript^𝒳𝑎for-all𝑎𝒜\widehat{\mathcal{X}}^{a},\forall a\in\mathcal{A}over^ start_ARG caligraphic_X end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , ∀ italic_a ∈ caligraphic_A such that 𝒳a(ε)𝒳^a𝒳asuperscript𝒳𝑎𝜀superscript^𝒳𝑎superscript𝒳𝑎{\mathcal{X}}^{a}(\varepsilon)\subseteq\widehat{{\mathcal{X}}}^{a}\subseteq{% \mathcal{X}}^{a}caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_ε ) ⊆ over^ start_ARG caligraphic_X end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ⊆ caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT.

Based on the concept in Definition 2, an immediate implication of Theorem 2 is that if there is an O(1/ε)𝑂1𝜀{O}(1/\varepsilon)italic_O ( 1 / italic_ε )-learning procedure, a simple “prepare-then-commit” style algorithm can achieve O~(T)~𝑂𝑇\widetilde{O}(\sqrt{T})over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_T end_ARG ) regret in the contractual bandit problem. That is, it first prepares for a warm start by running the learning procedure for T1/2superscript𝑇12T^{1/2}italic_T start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT rounds to obtain the O(T1/2)𝑂superscript𝑇12O(T^{-1/2})italic_O ( italic_T start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT )-margin contract sets, then commits to follow Algorithm 2 for the remaining TT1/2𝑇superscript𝑇12T-T^{1/2}italic_T - italic_T start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT rounds. Futhermore, using the standard doubling trick [15], we can convert “prepare-then-commit” style algorithm into an anytime algorithm with the same O~(T)~𝑂𝑇\widetilde{O}(\sqrt{T})over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_T end_ARG ) regret guaruntee that is agnostic to the time horizon T𝑇Titalic_T during its construction. Therefore, the difficulty of solving the contractual bandit learning problem hinges on the statistical efficiency of the learning procedure, which heavily depends on the problem structure.

Solving Bandit Problems under Direct Incentives. As a direct application of Theorem 2, we show that the O(1/ε)𝑂1𝜀O(1/\varepsilon)italic_O ( 1 / italic_ε )-learning procedure can be constructed for the two bandit problems under direct incentives and thus admits O(T)𝑂𝑇O(\sqrt{T})italic_O ( square-root start_ARG italic_T end_ARG ) regret online learning algorithm. The construction of the efficient search algorithm essentially relies on the binary search for the cost of each arm. In addition, the binary search algorithm can be generalized to cases with infinitely many arms. Such problem is known as the contextual search, and recent work [38] have established clean solutions with nearly optimal performance. We defer their detailed construction and proofs to Appendix C.2 and C.3.

Corollary 2.1.

Multi-armed bandits and linear bandits under direct incentives have Θ~(T)~Θ𝑇\widetilde{\Theta}(\sqrt{T})over~ start_ARG roman_Θ end_ARG ( square-root start_ARG italic_T end_ARG ) regret.

Solving Contractual Bandit Problems under Moral Hazard. The construction of efficient learning procedure is difficult in general contractual bandit learning. We instead start with sufficient knowledge of P𝑃Pitalic_P to construct an O(1/ε)𝑂1𝜀O(1/\varepsilon)italic_O ( 1 / italic_ε )-learning procedure under the following assumption. This assumption is motivated by the practice, where the principal would ask the agent to provide a listing of desired conditions for him to perform different level of services. The search problem is otherwise known to have exponential sample complexity lower bound in Stackelberg games [43].

Assumption 2 (Preliminary Contracts).

For any a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A, the principal has the preliminary knowledge to construct an non-liable contract x𝑥xitalic_x that induces the agent’s action a𝑎aitalic_a with constant payment.

We defer the construction of this learning procedure and its proof to Appendix C.4. As a result, we can construct an explore-then-commit style algorithm O(T2/3)𝑂superscript𝑇23O(T^{2/3})italic_O ( italic_T start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT ) regret for general contractual bandit learning, as. Specifically, this algorithm induces the agent to take each action uniformly random for T2/3superscript𝑇23T^{2/3}italic_T start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT rounds under the Assumption 2. Then, given that the outcome distribution is estimated with error up to T1/3superscript𝑇13T^{-1/3}italic_T start_POSTSUPERSCRIPT - 1 / 3 end_POSTSUPERSCRIPT, it can efficiently estimate the difference of cost up to error T1/3superscript𝑇13T^{-1/3}italic_T start_POSTSUPERSCRIPT - 1 / 3 end_POSTSUPERSCRIPT and thus construct an T1/3superscript𝑇13T^{-1/3}italic_T start_POSTSUPERSCRIPT - 1 / 3 end_POSTSUPERSCRIPT-optimal contract to induce the optimal action asuperscript𝑎a^{*}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in the remaining rounds.

Corollary 2.2.

Under Assumption 1 and 2, O~(T2/3)~𝑂superscript𝑇23\widetilde{O}(T^{2/3})over~ start_ARG italic_O end_ARG ( italic_T start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT ) regret can be achieved for contractual bandit learning problems.

This result reveals the core challenge of learning the optimal contract under moral hazard. That is, constructing the contract to induce the optimal action, [P(a)P(a)]xc(a)c(a),aaformulae-sequencedelimited-[]𝑃superscript𝑎𝑃superscript𝑎𝑥𝑐𝑎𝑐superscript𝑎for-allsuperscript𝑎superscript𝑎[P(a^{*})-P(a^{\prime})]\cdot x\geq c(a)-c(a^{\prime}),\forall a^{\prime}\neq a% ^{*}[ italic_P ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_P ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ⋅ italic_x ≥ italic_c ( italic_a ) - italic_c ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , ∀ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT already requires a sufficiently good estimate of P𝑃Pitalic_P for all actions (including the suboptimal ones). This observation raises the question on whether it is possible to learn P(a)𝑃superscript𝑎P(a^{\prime})italic_P ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) without playing the costly sub-optimal action asuperscript𝑎a^{\prime}italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT — the barrier to achieve o(T2/3)𝑜superscript𝑇23o(T^{2/3})italic_o ( italic_T start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT ) regret. The answer turns out to be “Yes” but with some catches. The solution is to implement a binary search procedure for contract x𝑥xitalic_x near the hyperplane formed by the linear system [P(a)P(a)]x=c(a)c(a),aaformulae-sequencedelimited-[]𝑃𝑎𝑃superscript𝑎𝑥𝑐𝑎𝑐superscript𝑎for-allsuperscript𝑎𝑎[P(a)-P(a^{\prime})]\cdot x=c(a)-c(a^{\prime}),\forall a^{\prime}\neq a[ italic_P ( italic_a ) - italic_P ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ⋅ italic_x = italic_c ( italic_a ) - italic_c ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , ∀ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_a. We want to solve the parameters c(a)c(a)𝑐𝑎𝑐superscript𝑎c(a)-c(a^{\prime})italic_c ( italic_a ) - italic_c ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) and P(a)P(a),aa𝑃𝑎𝑃superscript𝑎for-allsuperscript𝑎𝑎P(a)-P(a^{\prime}),\forall a^{\prime}\neq aitalic_P ( italic_a ) - italic_P ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , ∀ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_a in the linear system with bounded errors using a number of contracts x𝑥xitalic_x that almost satisfy the linear system. This is however impossible unless knowing at least one set of parameters in the linear system to ensure it has full rank.

Corollary 2.3.

Under Assumption 1 and with the knowledge of agent’s cost, O~(T1/2)~𝑂superscript𝑇12\widetilde{O}(T^{1/2})over~ start_ARG italic_O end_ARG ( italic_T start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ) regret can be achieved for contractual bandit learning problems.

In Appendix D, we formally show that, knowing the agent’s cost, there is an efficient learning procedure for the unknown parameters P(a)P(a),aa𝑃𝑎𝑃superscript𝑎for-allsuperscript𝑎𝑎P(a)-P(a^{\prime}),\forall a^{\prime}\neq aitalic_P ( italic_a ) - italic_P ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , ∀ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_a with small errors under mild assumptions. This allows us to attain O~(T)~𝑂𝑇\widetilde{O}(\sqrt{T})over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_T end_ARG ) for the general contractual bandit problem, and we showcase its application in designing contractual RL algorithms in the next section. Since the design and analysis of the learning procedure is highly technical, we also demonstrate the high-level idea on a simplified instance in Example 1 of Appendix D. More generally, we expect similar learning procedure exists if we alternatively assume some predictive state s𝑠sitalic_s in P𝑃Pitalic_P such that the principal knows P(s0|a),a𝒜𝑃conditionalsubscript𝑠0𝑎for-all𝑎𝒜P(s_{0}|a),\forall a\in\mathcal{A}italic_P ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_a ) , ∀ italic_a ∈ caligraphic_A, since it would also eliminate one extra degree of freedom in the linear system above.

4 The Complexity of Contractual Reinforcement Learning

If we treat each stationary policy in contractual RL as an arm and its induced visitation measure (see its formal definition in Appendix E.1) as an outcome in the contractual bandit problem, the generic algorithm from Section 3.2 already provides a O~(T2/3)~𝑂superscript𝑇23\widetilde{O}(T^{2/3})over~ start_ARG italic_O end_ARG ( italic_T start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT ) regret bound. However, the computational and statistical complexity of both Algorithm 2 and 3 has polynomial dependence on the size of action space, which has become exponential as |Π|=(SA)HΠsuperscript𝑆𝐴𝐻|\Pi|=(SA)^{H}| roman_Π | = ( italic_S italic_A ) start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT. Moreover, as pointed out above, it requires a uniformly good knowledge over the transition kernel P𝑃Pitalic_P to constructing the near-optimal contract policy under the moral hazard. In this section, we provide an improved analysis for the complexity of contractual reinforcement learning, given that the agent’s cost function {ch}h=1Hsuperscriptsubscriptsubscript𝑐1𝐻\{c_{h}\}_{h=1}^{H}{ italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT is known initially. This assumption allows us to leverage the learning procedure designed in the last section to efficiently learn the parameters μh(s,a,a):=Ph(s,a)Ph(s,a)assignsubscript𝜇𝑠𝑎superscript𝑎subscript𝑃𝑠𝑎subscript𝑃𝑠superscript𝑎\mu_{h}(s,a,a^{\prime}):=P_{h}(s,a)-P_{h}(s,a^{\prime})italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) := italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) for all h[H],s𝒮,a,a𝒜formulae-sequencedelimited-[]𝐻formulae-sequence𝑠𝒮𝑎superscript𝑎𝒜h\in[H],s\in{\mathcal{S}},a,a^{\prime}\in\mathcal{A}italic_h ∈ [ italic_H ] , italic_s ∈ caligraphic_S , italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A.

Input: State, action set 𝒮,𝒜𝒮𝒜{\mathcal{S}},\mathcal{A}caligraphic_S , caligraphic_A, number of steps H𝐻Hitalic_H, episodes T𝑇Titalic_T, solver 𝒜𝒜{\mathscr{A}}script_A.
Run χ(ε)𝜒𝜀\chi(\varepsilon)italic_χ ( italic_ε )-learning procedure in Algorithm 5 for T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT rounds and obtain estimates {μ^h}h[H]subscriptsubscript^𝜇delimited-[]𝐻\{\widehat{\mu}_{h}\}_{h\in[H]}{ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT.
Initialize empirical estimate of parameters {P^h1,r^h1,bh1}h[H]subscriptsubscriptsuperscript^𝑃1subscriptsuperscript^𝑟1subscriptsuperscript𝑏1delimited-[]𝐻\{\widehat{P}^{1}_{h},\widehat{r}^{1}_{h},b^{1}_{h}\}_{h\in[H]}{ over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , over^ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_b start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT
for t=1TT1𝑡1𝑇subscript𝑇1t=1\dots T-T_{1}italic_t = 1 … italic_T - italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT do
       Solve 𝒙t,𝝅tsuperscript𝒙𝑡superscript𝝅𝑡\bm{x}^{t},\bm{\pi}^{t}bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT from the subroutine 𝒜𝒜{\mathscr{A}}script_A using the parameters {P^ht,r^ht,bht,μ^h}h[H]subscriptsubscriptsuperscript^𝑃𝑡subscriptsuperscript^𝑟𝑡subscriptsuperscript𝑏𝑡subscript^𝜇delimited-[]𝐻\{\widehat{P}^{t}_{h},\widehat{r}^{t}_{h},b^{t}_{h},\widehat{\mu}_{h}\}_{h\in[% H]}{ over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , over^ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_b start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT.
       Execute the policy 𝒙tsuperscript𝒙𝑡\bm{x}^{t}bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and observe the trajectory {(sht,aht,rht)}h[H]subscriptsuperscriptsubscript𝑠𝑡superscriptsubscript𝑎𝑡superscriptsubscript𝑟𝑡delimited-[]𝐻\{(s_{h}^{t},a_{h}^{t},r_{h}^{t})\}_{h\in[H]}{ ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT.
       Update the empirical estimate of parameters {P^ht,r^ht,bht,μ^h}h[H]subscriptsubscriptsuperscript^𝑃𝑡subscriptsuperscript^𝑟𝑡subscriptsuperscript𝑏𝑡subscript^𝜇delimited-[]𝐻\{\widehat{P}^{t}_{h},\widehat{r}^{t}_{h},b^{t}_{h},\widehat{\mu}_{h}\}_{h\in[% H]}{ over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , over^ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_b start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT
Algorithm 1 Contractual RL with Warm Start

We sketch the no-regret learning algorithm in contractual RL in Algorithm 1, which cuts the number of episodes T𝑇Titalic_T into two phases and can be improved to be agnostic to T𝑇Titalic_T with the doubling trick. It begins by running the χ(ε)𝜒𝜀\chi(\varepsilon)italic_χ ( italic_ε )-learning procedure to efficiently obtain the estimated parameter μ^^𝜇\widehat{\mu}over^ start_ARG italic_μ end_ARG for the construction of robust contract policy. Then, the algorithm use a solver to determine the robust contract policy 𝒙tsuperscript𝒙𝑡\bm{x}^{t}bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT that induces an optimistic action policy 𝝅tsuperscript𝝅𝑡\bm{\pi}^{t}bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT with almost optimal payment. In Theorem 3, we state the complexity results under two different solvers that work under different technical assumption and provides different trade-offs in statistical and computational complexity. Here, κ,λs𝜅subscript𝜆𝑠\kappa,\lambda_{s}italic_κ , italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT in the regret bound are constants in the regularity assumptions, and omit logT𝑇\log Troman_log italic_T terms from learning μhsubscript𝜇\mu_{h}italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, though the effect of these constants can be canceled out only for sufficiently large T𝑇Titalic_T; we defer the details to Appendix E. Below we zoom into the construction of each component.

Theorem 3.

With high probability, Algorithm 1 has O~((SA1/2+κ1/2)H2T)~𝑂𝑆superscript𝐴12superscript𝜅12superscript𝐻2𝑇\widetilde{O}\left((SA^{-1/2}+\kappa^{-1/2})H^{2}\sqrt{T}\right)over~ start_ARG italic_O end_ARG ( ( italic_S italic_A start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT + italic_κ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG italic_T end_ARG ) regret using the solver in Algorithm 6 and O~((H2SA1/2+ηλs1Hκ1/2)T)~𝑂superscript𝐻2𝑆superscript𝐴12𝜂subscriptsuperscript𝜆1𝐻𝑠superscript𝜅12𝑇\widetilde{O}\left((H^{2}SA^{-1/2}+\eta\lambda^{1-H}_{s}\kappa^{-1/2})\sqrt{T}\right)over~ start_ARG italic_O end_ARG ( ( italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S italic_A start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT + italic_η italic_λ start_POSTSUPERSCRIPT 1 - italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_κ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) square-root start_ARG italic_T end_ARG ) regret using the solver in Algorithm 7 in contractual RL under mild assumptions.

χ(ε)𝜒𝜀\chi(\varepsilon)italic_χ ( italic_ε )-Learning Procedure in Contractual RL. One challenge in the construction is the need to separate the stepwise interference among {xh}h=1Hsuperscriptsubscriptsubscript𝑥1𝐻\{x_{h}\}_{h=1}^{H}{ italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT. Otherwise, the actual response space for the agent is (SA)Hsuperscript𝑆𝐴𝐻(SA)^{H}( italic_S italic_A ) start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT, which is unacceptable even for doing binary search. Our solution is due to the observation that if we fix xh+1,,xHsubscript𝑥1subscript𝑥𝐻x_{h+1},\dots,x_{H}italic_x start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT and tune xhsubscript𝑥x_{h}italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT only, the agent’s expected profits Uh+1𝒙,,UH𝒙superscriptsubscript𝑈1𝒙superscriptsubscript𝑈𝐻𝒙U_{h+1}^{\bm{x}},\dots,U_{H}^{\bm{x}}italic_U start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT , … , italic_U start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT remain unchanged. This allows us to set xhsubscript𝑥x_{h}italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT without influencing the agent’s action policy for step h+1,,H1𝐻h+1,\dots,Hitalic_h + 1 , … , italic_H. Another key challenge in constructing the oracle in the MDP setting is to guarantee visitation measure over each state at step hhitalic_h. To maximize the visitation measure of a particular state s𝑠sitalic_s at step hhitalic_h, we let xh(s,)subscript𝑥𝑠x_{h}(s,\cdot)italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , ⋅ ) have nonzero values such that the agent has a strong incentive to maximize her visitation measure over state s𝑠sitalic_s at step hhitalic_h. To simplify our analysis, we assume that the maximal visitation measure at each state s𝒮𝑠𝒮s\in{\mathcal{S}}italic_s ∈ caligraphic_S and at each step h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ] is bounded below, though we expect it to be relaxed via a more careful analysis since those states rarely visited contributes little to the estimation of the cumulative utility. Lastly, the task of setting xh(s,)subscript𝑥𝑠x_{h}(s,\cdot)italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , ⋅ ) is solved in the bandit learning setup under the techniques and assumptions specified in Appendix D. See Section E.2 for the formal proof and detailed construction of the learning procedure.

Solving for Optimistic and Robust Contract Policies. We show two different solvers for the optimistic contract with bounded suboptimality using the estimated parameters. Their basic idea is the same, which is to include additional bonus for optimism and margin for robustness. However, it turns out that they can either ensure statistical or computational efficiency, leaving an intriguing open question on the existence of the best-of-both-world solver. For the solver in Algorithm 6, we directly solve for the optimal contract policy according to LP (2.1) with additional bonus and margin step for the entire policy. For the solver in Algorithm 7, we employ the value iteration from the Bellman equation (2.4) with bonus and margin set at every step. Both solvers require the inducibility assumption similar to Assumption 1 in contractual bandit learning problem. However, the computationally efficient solver requires the inducibility assumption to hold at every step, whereas the statistically efficient solver only requires the inducibility assumption to hold at the trajectory level. We defer their detailed construction and their proofs to Appendix E.3 and E.4.

5 Conclusion

In this paper, we propose the study of contractual reinforcement learning problems in which the principal learns to influence the agent’s policy by adaptively designing contracts that are contingent on the state realization. The principal must not only balance the tradeoff between her payments and rewards from the agent’s policy, but also incentivize the agent’s exploration for her learning in an unknown environment. Our primary approach is to decouple this general problem into a standard online learning problem and a hyperplane search problem. This enables a clean analysis of the no-regret learning guarantee under several variants of technical assumptions. Meanwhile, several technical gaps remain for future work, including a tighter analysis under relaxed assumptions and the general setup where the agent adaptively improves his policy. We believe this model forms a natural theoretical basis for the agency problem in today’s large scale machine learning tasks where economic incentives of users, creators, service providers stand in conflict with the Internet platform’s long-term objective. More generally, it sheds light on the emergent problems of AI alignment from the perspective of steering AI behaviors through reward-sha** in its training environment. We hope this work would motivate new avenues for develo** robust, incentive-compatible frameworks that align diverse stakeholder interests in complex digital ecosystems.

References

  • [1] Creator earnings report breakdown, where are we in the creator economy? https://neoreach.com/creator-earnings/. Accessed: 2024-05-18.
  • [2] The creator economy. https://www.goldmansachs.com/intelligence/pages/the-creator-economy-could-approach-half-a-trillion-dollars-by-2027.html. Accessed: 2024-05-18.
  • [3] Tiktok lite, a new app quietly released in france that rewards screen time. https://www.lemonde.fr/en/pixels/article/2024/04/13/tiktok-lite-a-new-app-quietly-released-in-france-that-rewards-screen-time_6668286_13.html. Accessed: 2024-05-18.
  • Abbasi-Yadkori et al. [2011] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.
  • Agarwal et al. [2019] Alekh Agarwal, Nan Jiang, Sham M Kakade, and Wen Sun. Reinforcement learning: Theory and algorithms. CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep, pages 10–4, 2019.
  • Agrawal [1995] Rajeev Agrawal. The continuum-armed bandit problem. SIAM journal on control and optimization, 33(6):1926–1951, 1995.
  • Agrawal and Devanur [2016] Shipra Agrawal and Nikhil Devanur. Linear contextual bandits with knapsacks. Advances in Neural Information Processing Systems, 29, 2016.
  • Agrawal and Devanur [2014] Shipra Agrawal and Nikhil R Devanur. Bandits with concave rewards and convex knapsacks. In Proceedings of the fifteenth ACM conference on Economics and computation, pages 989–1006, 2014.
  • Alon et al. [2021] Tal Alon, Paul Dütting, and Inbal Talgam-Cohen. Contracts with private cost per unit-of-effort. In Proceedings of the 22nd ACM Conference on Economics and Computation, pages 52–69, 2021.
  • Badanidiyuru et al. [2018] Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knapsacks. Journal of the ACM (JACM), 65(3):1–55, 2018.
  • Bahar et al. [2020] Gal Bahar, Omer Ben-Porat, Kevin Leyton-Brown, and Moshe Tennenholtz. Fiduciary bandits. In International Conference on Machine Learning, pages 518–527. PMLR, 2020.
  • Balcan et al. [2015] Maria-Florina Balcan, Avrim Blum, Nika Haghtalab, and Ariel D Procaccia. Commitment without regrets: Online learning in stackelberg security games. In Proceedings of the sixteenth ACM conference on economics and computation, pages 61–78, 2015.
  • Bechtel et al. [2022] Curtis Bechtel, Shaddin Dughmi, and Neel Patel. Delegated pandora’s box. arXiv preprint arXiv:2202.10382, 2022.
  • Bernasconi et al. [2023] Martino Bernasconi, Matteo Castiglioni, Andrea Celli, Alberto Marchesi, Francesco Trovò, and Nicola Gatti. Optimal rates and efficient algorithms for online bayesian persuasion. In International Conference on Machine Learning, pages 2164–2183. PMLR, 2023.
  • Besson and Kaufmann [2018] Lilian Besson and Emilie Kaufmann. What doubling tricks can and can’t do for multi-armed bandits. arXiv preprint arXiv:1803.06971, 2018.
  • Bhargava [2022] Hemant K Bhargava. The creator economy: Managing ecosystem supply, revenue sharing, and platform design. Management Science, 68(7):5233–5251, 2022.
  • Blum et al. [2004] Avrim Blum, Vijay Kumar, Atri Rudra, and Felix Wu. Online learning in online auctions. Theoretical Computer Science, 324(2-3):137–146, 2004.
  • Braverman et al. [2019] Mark Braverman, Jieming Mao, Jon Schneider, and S Matthew Weinberg. Multi-armed bandit problems with strategic arms. In Conference on Learning Theory, pages 383–416. PMLR, 2019.
  • Cacciamani et al. [2023] Federico Cacciamani, Matteo Castiglioni, and Nicola Gatti. Online information acquisition: Hiring multiple agents. arXiv preprint arXiv:2307.06210, 2023.
  • Castiglioni et al. [2022] Matteo Castiglioni, Alberto Marchesi, and Nicola Gatti. Designing menus of contracts efficiently: The power of randomization. arXiv preprint arXiv:2202.10966, 2022.
  • Dogan et al. [2023a] Ilgin Dogan, Zuo-Jun Max Shen, and Anil Aswani. Estimating and incentivizing imperfect-knowledge agents with hidden rewards. arXiv preprint arXiv:2308.06717, 2023a.
  • Dogan et al. [2023b] Ilgin Dogan, Zuo-Jun Max Shen, and Anil Aswani. Repeated principal-agent games with unobserved agent rewards and perfect-knowledge agents. arXiv preprint arXiv:2304.07407, 2023b.
  • Dudík et al. [2020] Miroslav Dudík, Nika Haghtalab, Haipeng Luo, Robert E Schapire, Vasilis Syrgkanis, and Jennifer Wortman Vaughan. Oracle-efficient online learning and auction design. Journal of the ACM (JACM), 67(5):1–57, 2020.
  • Dütting et al. [2019] Paul Dütting, Tim Roughgarden, and Inbal Talgam-Cohen. Simple versus optimal contracts. In Proceedings of the 2019 ACM Conference on Economics and Computation, pages 369–387, 2019.
  • Florida [2022] Richard Florida. The rise of the creator economy. 2022.
  • Frazier et al. [2014] Peter Frazier, David Kempe, Jon Kleinberg, and Robert Kleinberg. Incentivizing exploration. In Proceedings of the fifteenth ACM conference on Economics and computation, pages 5–22, 2014.
  • Grossman and Hart [1992] Sanford J Grossman and Oliver D Hart. An analysis of the principal-agent problem. In Foundations of insurance economics, pages 302–340. Springer, 1992.
  • Guruganesh et al. [2021] Guru Guruganesh, Jon Schneider, and Joshua R Wang. Contracts under moral hazard and adverse selection. In Proceedings of the 22nd ACM Conference on Economics and Computation, pages 563–582, 2021.
  • Ho et al. [2016] Chien-Ju Ho, Aleksandrs Slivkins, and Jennifer Wortman Vaughan. Adaptive contract design for crowdsourcing markets: Bandit algorithms for repeated principal-agent problems. Journal of Artificial Intelligence Research, 55:317–359, 2016.
  • Immorlica et al. [2022] Nicole Immorlica, Karthik Sankararaman, Robert Schapire, and Aleksandrs Slivkins. Adversarial bandits with knapsacks. Journal of the ACM, 69(6):1–47, 2022.
  • Immorlica et al. [2024] Nicole Immorlica, Meena Jagadeesan, and Brendan Lucier. Clickbait vs. quality: How engagement-based optimization shapes the content landscape in online platforms. arXiv preprint arXiv:2401.09804, 2024.
  • Kaelbling et al. [1998] Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
  • Kleinberg and Kleinberg [2018] Jon Kleinberg and Robert Kleinberg. Delegated search approximates efficient search. In Proceedings of the 2018 ACM Conference on Economics and Computation, pages 287–302, 2018.
  • Kleinberg and Leighton [2003] Robert Kleinberg and Tom Leighton. The value of knowing a demand curve: Bounds on regret for online posted-price auctions. In 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings., pages 594–605. IEEE, 2003.
  • Laffont and Martimort [2009] Jean-Jacques Laffont and David Martimort. The theory of incentives. In The Theory of Incentives. Princeton university press, 2009.
  • Leme and Schneider [2018] Renato Paes Leme and Jon Schneider. Contextual search via intrinsic volumes. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 268–282. IEEE, 2018.
  • Letchford et al. [2009] Joshua Letchford, Vincent Conitzer, and Kamesh Munagala. Learning and approximating the optimal strategy to commit to. In International symposium on algorithmic game theory, pages 250–262. Springer, 2009.
  • Liu et al. [2021] Allen Liu, Renato Paes Leme, and Jon Schneider. Optimal contextual pricing and extensions. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1059–1078. SIAM, 2021.
  • Lobel et al. [2018] Ilan Lobel, Renato Paes Leme, and Adrian Vladu. Multidimensional binary search for contextual decision-making. Operations Research, 66(5):1346–1361, 2018.
  • Mansour et al. [2015] Yishay Mansour, Aleksandrs Slivkins, and Vasilis Syrgkanis. Bayesian incentive-compatible bandit exploration. In Proceedings of the Sixteenth ACM Conference on Economics and Computation, pages 565–582, 2015.
  • Mao et al. [2018] Jieming Mao, Renato Leme, and Jon Schneider. Contextual pricing for lipschitz buyers. Advances in Neural Information Processing Systems, 31, 2018.
  • McDiarmid et al. [1989] Colin McDiarmid et al. On the method of bounded differences. Surveys in combinatorics, 141(1):148–188, 1989.
  • Peng et al. [2019] Binghui Peng, Weiran Shen, **zhong Tang, and Song Zuo. Learning optimal strategies to commit to. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 2149–2156, 2019.
  • Ratliff et al. [2018] Lillian J Ratliff, Shreyas Sekar, Liyuan Zheng, and Tanner Fiez. Incentives in the dark: multi-armed bandits for evolving users with unknown type. arXiv preprint arXiv:1803.04008, 55, 2018.
  • Saig et al. [2024] Eden Saig, Inbal Talgam-Cohen, and Nir Rosenfeld. Delegated classification. Advances in Neural Information Processing Systems, 36, 2024.
  • Scheid et al. [2024] Antoine Scheid, Daniil Tiapkin, Etienne Boursier, Aymeric Capitaine, El Mahdi El Mhamdi, Éric Moulines, Michael I Jordan, and Alain Durmus. Incentivized learning in principal-agent bandit games. arXiv preprint arXiv:2403.03811, 2024.
  • Shah et al. [2019] Virag Shah, Ramesh Johari, and Jose Blanchet. Semi-parametric dynamic contextual pricing. Advances in Neural Information Processing Systems, 32, 2019.
  • Smith [2004] Stephen A Smith. Contract theory. OUP Oxford, 2004.
  • Tran-Thanh et al. [2010] Long Tran-Thanh, Archie Chapman, Enrique Munoz De Cote, Alex Rogers, and Nicholas R Jennings. Epsilon–first policies for budget–limited multi-armed bandits. In Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010.
  • Tran-Thanh et al. [2012] Long Tran-Thanh, Archie Chapman, Alex Rogers, and Nicholas Jennings. Knapsack based optimal policies for budget–limited multi–armed bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 26, pages 1134–1140, 2012.
  • Wu et al. [2022] Jibang Wu, Zixuan Zhang, Zhe Feng, Zhaoran Wang, Zhuoran Yang, Michael I Jordan, and Haifeng Xu. Sequential information design: Markov persuasion process and its efficient reinforcement learning. arXiv preprint arXiv:2202.10678, 2022.
  • Xia et al. [2015] Yingce Xia, Haifang Li, Tao Qin, Nenghai Yu, and Tie-Yan Liu. Thompson sampling for budgeted multi-armed bandits. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
  • Xia et al. [2016] Yingce Xia, Tao Qin, Weidong Ma, Nenghai Yu, and Tie-Yan Liu. Budgeted multi-armed bandits with multiple plays. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pages 2210–2216, 2016.
  • Yao et al. [2023] Fan Yao, Chuanhao Li, Denis Nekipelov, Hongning Wang, and Haifeng Xu. How bad is top-k𝑘kitalic_k recommendation under competing content creators? In International Conference on Machine Learning, pages 39674–39701. PMLR, 2023.
  • Yao et al. [2024] Fan Yao, Chuanhao Li, Karthik Abinav Sankararaman, Yiming Liao, Yan Zhu, Qifan Wang, Hongning Wang, and Haifeng Xu. Rethinking incentives in recommender systems: Are monotone rewards always beneficial? Advances in Neural Information Processing Systems, 36, 2024.
  • Zhao et al. [2023] Geng Zhao, Banghua Zhu, Jiantao Jiao, and Michael Jordan. Online learning in stackelberg games with an omniscient follower. In International Conference on Machine Learning, pages 42304–42316. PMLR, 2023.
  • Zhu et al. [2022] Banghua Zhu, Stephen Bates, Zhuoran Yang, Yixin Wang, Jiantao Jiao, and Michael I Jordan. The sample complexity of online contract design. arXiv preprint arXiv:2211.05732, 2022.
  • Zhu et al. [2023] Banghua Zhu, Sai Praneeth Karimireddy, Jiantao Jiao, and Michael I Jordan. Online learning in a creator economy. arXiv preprint arXiv:2305.11381, 2023.
  • Zuo [2024] Shiliang Zuo. New perspectives in online contract design: Heterogeneous, homogeneous, non-myopic agents and team production. arXiv preprint arXiv:2403.07143, 2024.

Appendix A Further Discussion on Related Work

Contract Design.

The contract theory has been a crucial branch of economics [27, 48, 35]. Driven by an accelerating trend of contract-based markets deployed to Internet-based applications, the contract design problem recently started to receive a surging interest especially from the computer science community [24, 28, 9, 20]. The principal-agent model has been also applied for the delegation of online search problems [13, 33] and machine learning tasks [45]. While these works focus on the computational aspects of contract design, our work is to adaptively design the optimal contract between learners and decision makers in an initially unknown environment.

Dynamic Pricing.

Our model is related to the dynamic (contextual) pricing problems [34, 41, 47, 39, 36], where a seller learns to post a price on a single item for a sequence of buyers with a fixed cost (possibly under different context). In particular, they can be viewed as special cases of contractual reinforcement learning, where the contract is contingent on the agent’s binary action and the principal already knows her reward function. As we will see in Section 3, our algorithm is able to borrow some design insights from these pricing problems. Nonetheless, our learning algorithm deals with the more involved situations, where the agent has multiple actions (e.g., a list of items to buy) of which the principal’s rewards are unknown, and the contract is not necessarily contingent on the agent’s actions but their outcomes. As such, it is possible to achieve constant regret in these pricing problems, whereas the regret lower bound of contractual reinforcement learning is Ω(T)Ω𝑇\Omega(\sqrt{T})roman_Ω ( square-root start_ARG italic_T end_ARG ).

Online Contract Design.

The problem begins as a variant of dynamic pricing in Kleinberg and Leighton [34] where the agent’s cost is stochastic (or adversarially) chosen, and regret bound is Θ(T)Θ𝑇\Theta(\sqrt{T})roman_Θ ( square-root start_ARG italic_T end_ARG ) (or Θ(T2/3)Θsuperscript𝑇23\Theta(T^{2/3})roman_Θ ( italic_T start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT ) in adversarial setup). Ho et al. [29], Zhu et al. [57] consider a generalized model where the agent has multiple (instead of binary) actions, both the cost and reward of his actions are determined by the agent’s Bayesian type that are unknown to the learner. These problems can be viewed as a continuum-armed bandit problem [6], except the principal’s utility is not continuous. Zhu et al. [57] shows an almost tight linear regret bound of this problem Θ~(T1K/|𝒮|)~Θsuperscript𝑇1𝐾𝒮\widetilde{\Theta}(T^{1-K/|{\mathcal{S}}|})over~ start_ARG roman_Θ end_ARG ( italic_T start_POSTSUPERSCRIPT 1 - italic_K / | caligraphic_S | end_POSTSUPERSCRIPT ) for some constant K𝐾Kitalic_K and the number of outcomes |𝒮|𝒮|{\mathcal{S}}|| caligraphic_S |. On top of this model, Zhu et al. [58] considers the joint online optimization problem of contract and recommendation policy in the context of creator economy. Zuo [59] assumes a smoothness condition and presents a direct reduction to the standard Lipschitz bandits problem. In comparison, our learning problem is closer to the standard contract design model, in which the agent type is observable by the principal (captured by the initial state or context), as many platforms hold a good amount of data on their users and content creators. More importantly, this modeling choice allows us to focus on solving the key challenges of learning and planning the optimal contract under moral hazard, where we are able to achieve O~(T)~𝑂𝑇\widetilde{O}(\sqrt{T})over~ start_ARG italic_O end_ARG ( square-root start_ARG italic_T end_ARG ) regret for a large class of problems and O~(T2/3)~𝑂superscript𝑇23\widetilde{O}(T^{2/3})over~ start_ARG italic_O end_ARG ( italic_T start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT ) in general under mild assumptions. Meanwhile, several recent works [22, 21, 46] consider the simple special case of our problem, where there is no Markov state transition and principal can directly incentivize the agent to take certain action without the barrier of moral hazard.

Online Learning with Incentive Constraints.

The incentive design problems have been studied in online learning in several different ways. One line of works, known as the incentivized exploration [26, 40], consider the situations where the principal recommends the agents to pull different arms and the recommendation policy must be incentive compatible to the agents in a Bayesian sense w.r.t. each agent’s prior of arm rewards. Bahar et al. [11] consider the fiduciary bandits problem, where a slightly stronger constraint of individual rationality is introduced. Our model is different from these works in that the principal use monetary incentives (contracts) instead of information advantage to influence the agents’ decisions. Another line of work, known as the budgeted bandits [49, 50, 52, 53], and more generally, bandits with knapsacks [10, 8, 7, 30], models the intrinsic cost of arm selection. The cost only affects the learner’s choices due to the limited budget, whereas the learner (principal) in our multi-agent decision making process needs to properly reimburse the agent’s (opportunity) cost in order to influence the agent’s arm choices. Ratliff et al. [44] consider the multi-armed bandit problem where the reward distribution (impacted by user types) shifts according to the history of arm selection. Braverman et al. [18] models each bandit arm as a self-interested agent that keeps part of the reward from the principal to strategically maximizes his long-term utility. Besides the online contract design problem, there are also rich line of literature in the online learning problems under Stackelberg games, information design and auction design setups [17, 12, 23, 51, 56, 14, 19].

Appendix B Omitted Content in Section 2

B.1 Notations and Illustrations

We use the notation of [n]delimited-[]𝑛[n][ italic_n ] for the set {1,2,,n}12𝑛\{1,2,\dots,n\}{ 1 , 2 , … , italic_n }. We use Δ(𝒮)Δ𝒮\Delta({\mathcal{S}})roman_Δ ( caligraphic_S ) to denote the simplex space on discrete set 𝒮𝒮{\mathcal{S}}caligraphic_S. For probability distribution PΔ(𝒮)𝑃Δ𝒮P\in\Delta({\mathcal{S}})italic_P ∈ roman_Δ ( caligraphic_S ), we will use P(s)𝑃𝑠P(s)italic_P ( italic_s ) to denote the measure of s𝒮𝑠𝒮s\in{\mathcal{S}}italic_s ∈ caligraphic_S in P𝑃Pitalic_P. We use the notation of maxarg,minarg𝑚𝑎𝑥𝑎𝑟𝑔𝑚𝑖𝑛𝑎𝑟𝑔\mathop{maxarg},\mathop{minarg}start_BIGOP italic_m italic_a italic_x italic_a italic_r italic_g end_BIGOP , start_BIGOP italic_m italic_i italic_n italic_a italic_r italic_g end_BIGOP as an operator on an optimization problem that returns the optimal objective value followed by its optimal solution, e.g., 0,a=minargxa20𝑎𝑚𝑖𝑛𝑎𝑟𝑔superscriptdelimited-∥∥𝑥𝑎20,a=\mathop{minarg}\left\lVert x-a\right\rVert^{2}0 , italic_a = start_BIGOP italic_m italic_i italic_n italic_a italic_r italic_g end_BIGOP ∥ italic_x - italic_a ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

We will interchangeably treat a function f:𝒳𝒴:𝑓𝒳𝒴f:\mathcal{X}\to\mathcal{Y}italic_f : caligraphic_X → caligraphic_Y as a vector from 𝒴|𝒳|superscript𝒴𝒳\mathcal{Y}^{|\mathcal{X}|}caligraphic_Y start_POSTSUPERSCRIPT | caligraphic_X | end_POSTSUPERSCRIPT. As such, we denote the inner product fg:=x𝒳f(x)g(x)assign𝑓𝑔subscript𝑥𝒳𝑓𝑥𝑔𝑥f\cdot g:=\sum_{x\in\mathcal{X}}f(x)g(x)italic_f ⋅ italic_g := ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_f ( italic_x ) italic_g ( italic_x ) for f,g:𝒳𝒴:𝑓𝑔𝒳𝒴f,g:\mathcal{X}\to\mathcal{Y}italic_f , italic_g : caligraphic_X → caligraphic_Y or f,g𝒳×𝒴:=x𝒳,y𝒴f(x,y)g(x,y)assignsubscript𝑓𝑔𝒳𝒴subscriptformulae-sequence𝑥𝒳𝑦𝒴𝑓𝑥𝑦𝑔𝑥𝑦\langle f,g\rangle_{\mathcal{X}\times\mathcal{Y}}:=\sum_{x\in\mathcal{X},y\in% \mathcal{Y}}f(x,y)g(x,y)⟨ italic_f , italic_g ⟩ start_POSTSUBSCRIPT caligraphic_X × caligraphic_Y end_POSTSUBSCRIPT := ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_X , italic_y ∈ caligraphic_Y end_POSTSUBSCRIPT italic_f ( italic_x , italic_y ) italic_g ( italic_x , italic_y ) for f,g:𝒳×𝒴:𝑓𝑔𝒳𝒴f,g:\mathcal{X}\times\mathcal{Y}\to\mathbb{R}italic_f , italic_g : caligraphic_X × caligraphic_Y → blackboard_R. Denote their outer product as fgtensor-product𝑓𝑔f\otimes gitalic_f ⊗ italic_g. Denote f,:=supx𝒳f(x)assignsubscriptdelimited-∥∥𝑓subscriptsupremum𝑥𝒳subscriptdelimited-∥∥𝑓𝑥\left\lVert f\right\rVert_{\ell,\infty}:=\sup_{x\in\mathcal{X}}\left\lVert f(x% )\right\rVert_{\ell}∥ italic_f ∥ start_POSTSUBSCRIPT roman_ℓ , ∞ end_POSTSUBSCRIPT := roman_sup start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT ∥ italic_f ( italic_x ) ∥ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT for f:𝒳𝒴:𝑓𝒳𝒴f:\mathcal{X}\to\mathcal{Y}italic_f : caligraphic_X → caligraphic_Y. In addition, for function f:𝒳×𝒴𝒵:𝑓𝒳𝒴𝒵f:\mathcal{X}\times\mathcal{Y}\to\mathcal{Z}italic_f : caligraphic_X × caligraphic_Y → caligraphic_Z, we use f(x)𝒵𝒴𝑓𝑥superscript𝒵𝒴f(x)\in\mathcal{Z}^{\mathcal{Y}}italic_f ( italic_x ) ∈ caligraphic_Z start_POSTSUPERSCRIPT caligraphic_Y end_POSTSUPERSCRIPT. For conditional probability P:𝒳Δ(𝒴):𝑃𝒳Δ𝒴P:\mathcal{X}\to\Delta(\mathcal{Y})italic_P : caligraphic_X → roman_Δ ( caligraphic_Y ), we denote P(y|x)𝑃conditional𝑦𝑥P(y|x)italic_P ( italic_y | italic_x ) as the measure of y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y given x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X.

Table 2: A table of notations in the contractual reinforcement learning problem
Symbols Interpretations
𝒮,𝒜𝒮𝒜{\mathcal{S}},\mathcal{A}caligraphic_S , caligraphic_A state, action space
Ph:𝒮×𝒜Δ(𝒮),P0Δ(𝒮):subscript𝑃formulae-sequence𝒮𝒜Δ𝒮subscript𝑃0Δ𝒮P_{h}:{\mathcal{S}}\times\mathcal{A}\to\Delta({\mathcal{S}}),P_{0}\in\Delta({% \mathcal{S}})italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S × caligraphic_A → roman_Δ ( caligraphic_S ) , italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ roman_Δ ( caligraphic_S ) transition kernel, initial state distribution
ιh:𝒮×𝒮+:subscript𝜄𝒮𝒮subscript\iota_{h}:{\mathcal{S}}\times{\mathcal{S}}\to\mathbb{R}_{+}italic_ι start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S × caligraphic_S → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT noisy reward function at hhitalic_h-th step
rh:𝒮×𝒜[0,1]:subscript𝑟𝒮𝒜01r_{h}:{\mathcal{S}}\times\mathcal{A}\to[0,1]italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S × caligraphic_A → [ 0 , 1 ] expected reward function at hhitalic_h-th step
ch:𝒮×𝒜[0,1]:subscript𝑐𝒮𝒜01c_{h}:{\mathcal{S}}\times\mathcal{A}\to[0,1]italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S × caligraphic_A → [ 0 , 1 ] cost function at hhitalic_h-th step
𝒙={xh:𝒮×𝒮+}h=1H𝒙superscriptsubscriptconditional-setsubscript𝑥𝒮𝒮subscript1𝐻\bm{x}=\{x_{h}:{\mathcal{S}}\times{\mathcal{S}}\to\mathbb{R}_{+}\}_{h=1}^{H}bold_italic_x = { italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S × caligraphic_S → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT contract policy
𝝅={πh:𝒮Δ(𝒜)}h=1H𝝅superscriptsubscriptconditional-setsubscript𝜋𝒮Δ𝒜1𝐻\bm{\pi}=\{\pi_{h}:{\mathcal{S}}\to\Delta(\mathcal{A})\}_{h=1}^{H}bold_italic_π = { italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S → roman_Δ ( caligraphic_A ) } start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT action policy
ΠΠ\Piroman_Π action policy space
𝒳𝒳\mathcal{X}caligraphic_X contract policy space
Vh𝒙,𝝅,Vh𝒙,Vh𝝅,Vh:𝒮+:superscriptsubscript𝑉𝒙𝝅superscriptsubscript𝑉𝒙superscriptsubscript𝑉𝝅superscriptsubscript𝑉𝒮subscriptV_{h}^{\bm{x},\bm{\pi}},V_{h}^{\bm{x}},V_{h}^{\bm{\pi}},V_{h}^{*}:{\mathcal{S}% }\to\mathbb{R}_{+}italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT , italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT , italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : caligraphic_S → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT principal’s state value function at hhitalic_h-th step
Uh𝒙,𝝅,Uh𝒙,Uh𝝅,Vh:𝒮+:superscriptsubscript𝑈𝒙𝝅superscriptsubscript𝑈𝒙superscriptsubscript𝑈𝝅superscriptsubscript𝑉𝒮subscriptU_{h}^{\bm{x},\bm{\pi}},U_{h}^{\bm{x}},U_{h}^{\bm{\pi}},V_{h}^{*}:{\mathcal{S}% }\to\mathbb{R}_{+}italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT , italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT , italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : caligraphic_S → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT agent’s state value function at hhitalic_h-th step
ρh𝝅:Δ(𝒮):superscriptsubscript𝜌𝝅Δ𝒮\rho_{h}^{\bm{\pi}}:\Delta({\mathcal{S}})italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT : roman_Δ ( caligraphic_S ) state visitation measure at hhitalic_h-th step
ζh𝝅:𝒜+:superscriptsubscript𝜁𝝅𝒜subscript\zeta_{h}^{\bm{\pi}}:\mathcal{A}\to\mathbb{R}_{+}italic_ζ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT : caligraphic_A → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT least payment function at hhitalic_h-th step
Qh𝝅:𝒮×𝒜+:superscriptsubscript𝑄𝝅𝒮𝒜subscriptQ_{h}^{\bm{\pi}}:{\mathcal{S}}\times\mathcal{A}\to\mathbb{R}_{+}italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT : caligraphic_S × caligraphic_A → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT principal’s state-action value function at hhitalic_h-th step
Wh𝝅:𝒮×𝒜+:superscriptsubscript𝑊𝝅𝒮𝒜subscriptW_{h}^{\bm{\pi}}:{\mathcal{S}}\times\mathcal{A}\to\mathbb{R}_{+}italic_W start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT : caligraphic_S × caligraphic_A → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT agent’s state-action value function at hhitalic_h-th step
PrincipalAgentEnvironmentContract Policy xh:𝒮×𝒮+:subscript𝑥𝒮𝒮subscript\displaystyle x_{h}:\mathcal{S}\times\mathcal{S}\rightarrow\mathbb{R}_{+}italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S × caligraphic_S → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPTAction Policy πh:𝒮𝒜:subscript𝜋𝒮𝒜\displaystyle\pi_{h}:\mathcal{S}\rightarrow\mathcal{A}italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S → caligraphic_AState Transition sh+1P(sh,πh(sh))similar-tosubscript𝑠1𝑃subscript𝑠subscript𝜋subscript𝑠\displaystyle s_{h+1}\sim P(s_{h},\pi_{h}(s_{h}))italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ∼ italic_P ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) )Payment xh(sh,sh+1)subscript𝑥subscript𝑠subscript𝑠1\displaystyle x_{h}(s_{h},s_{h+1})italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT )
Figure 2: An illustration of the interaction procedure in the principal-agent Markov decision process.

B.2 Discussion on the Modeling Choices.

We make a few remarks on the procedure of the PAMDP.

  1. 1.

    It is without loss of generality for the principal to commit his contract policy 𝒙𝒙\bm{x}bold_italic_x at the very beginning of each episode, since this MDP setup can be viewed as an extensive-form game as long as the principal as the first mover can predict the agent’s response and plan his follow-up move accordingly. Once the principal commits its contract policy, the agent can also determine his optimal action policy in response.

  2. 2.

    We assume the Markovian state after its realization is publicly observable by both the agent and principal, serving as the natural conditions and contingencies for the contract design. Hence, our model directly use the state transition kernel, Ph:𝒮×𝒜Δ(𝒮):subscript𝑃𝒮𝒜Δ𝒮P_{h}:{\mathcal{S}}\times\mathcal{A}\to\Delta({\mathcal{S}})italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S × caligraphic_A → roman_Δ ( caligraphic_S ), as the outcome distribution in contract design problems. Otherwise, if either agent or principal only partially observes the state, the planning problem is known to be intractable [32], and we leave this open question for future work. A more subtle caveat here is that, different from standard episodic MDP, the transition kernel in the last step PHsubscript𝑃𝐻P_{H}italic_P start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT matters, as it influences the principal’s design of contract.

  3. 3.

    The principal’s noise reward ιh(sh,sh+1)subscript𝜄subscript𝑠subscript𝑠1\iota_{h}(s_{h},s_{h+1})italic_ι start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) is set to be conditionally independent of the agent’s action ahsubscript𝑎a_{h}italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, given the next state sh+1subscript𝑠1s_{h+1}italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT. This is necessary for a subtle modeling reason: as we will see in the next section, the contract design problem would become much easier, if the principal can condition its payment directly based on the agent’ action (i.e., without the concern of moral hazard). Note that since the reward itself can be modeled as a part of the state, the existence of such ιhsubscript𝜄\iota_{h}italic_ι start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is without loss of generality, and there is no need to assume additional zero-mean noise on top of ιhsubscript𝜄\iota_{h}italic_ι start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT.

  4. 4.

    We assume the principal is able to observe the agent’s action once the payment is transferred. This is well-motivated in practice. For example, a content platform may ask the creators to fill a survey on the amount of time they spent to create their content; the creators have no incentive to misreport this information, as long as their payment is independent of the answers. The more general setup is that the principal is only able to observe a probabilistic signal of agent taking some action a𝑎aitalic_a (e.g., from the realization of the next state and knowledge of the transition kernel). For the convenience of analysis, we save the additional steps for the principal to infer the agent’s decision up to a sufficient level of confidence by repeating the same contract policy, though this could introduce additional factor of H𝐻Hitalic_H into the sample complexity, depending on the mixing ratio. We leave the tight analysis to future work.

  5. 5.

    It is without loss of generality to assume that the rational agent always has the incentive to participate in the PAMDP. This is because enforcing the additional constraint that the agent’s utility must be non-negative under the principal’s optimal contract is equivalent to adding an “idle” action a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to the existing action set 𝒜𝒜\mathcal{A}caligraphic_A with rh(s,a0)=ch(s,a0)=0,s𝒮,h[H]formulae-sequencesubscript𝑟𝑠subscript𝑎0subscript𝑐𝑠subscript𝑎00formulae-sequencefor-all𝑠𝒮delimited-[]𝐻r_{h}(s,a_{0})=c_{h}(s,a_{0})=0,\forall s\in{\mathcal{S}},h\in[H]italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = 0 , ∀ italic_s ∈ caligraphic_S , italic_h ∈ [ italic_H ], which allows our analysis to ignore the agent’s non-negative utility (individual rationality) constraint.

B.3 Least-Payment Bellman Equations in PAMDP

With the correspondence between 𝒙𝒙\bm{x}bold_italic_x and 𝝅𝒙superscript𝝅𝒙\bm{\pi}^{\bm{x}}bold_italic_π start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT, a natural next step is to fix 𝝅𝝅\bm{\pi}bold_italic_π and find the contract policy

𝒙𝝅=argmax𝒙𝒳V𝒙,𝝅s.t.𝝅=argmax𝝅ΠU𝒙,𝝅,formulae-sequencesuperscript𝒙𝝅subscript𝑎𝑟𝑔𝑚𝑎𝑥𝒙𝒳superscript𝑉𝒙𝝅s.t.𝝅subscript𝑎𝑟𝑔𝑚𝑎𝑥𝝅Πsuperscript𝑈𝒙𝝅\bm{x}^{\bm{\pi}}=\mathop{argmax}_{\bm{x}\in\mathcal{X}}V^{\bm{x},\bm{\pi}}% \quad\text{s.t.}\quad\bm{\pi}=\mathop{argmax}_{\bm{\pi}\in\Pi}U^{\bm{x},\bm{% \pi}},bold_italic_x start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT = start_BIGOP italic_a italic_r italic_g italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT bold_italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT s.t. bold_italic_π = start_BIGOP italic_a italic_r italic_g italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT bold_italic_π ∈ roman_Π end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT ,

with the maximal value among all policy that the agent would optimally respond with action policy 𝝅𝝅\bm{\pi}bold_italic_π. Since the total expected reward of the principal is fixed under 𝝅𝝅\bm{\pi}bold_italic_π, the objective of the optimization problem can be equivalently rewritten as minimizing the principal’s total payment,

𝒙𝝅,ζ𝝅=minarg𝒙𝒳𝐄[h=1Hxh(sh,ah)|{πh}h=1H]s.t.𝝅=argmax𝝅ΠU𝒙,𝝅.formulae-sequencesuperscript𝒙𝝅superscript𝜁𝝅subscript𝑚𝑖𝑛𝑎𝑟𝑔𝒙𝒳𝐄delimited-[]conditionalsuperscriptsubscript1𝐻subscript𝑥subscript𝑠subscript𝑎superscriptsubscriptsubscript𝜋1𝐻s.t.𝝅subscript𝑎𝑟𝑔𝑚𝑎𝑥𝝅Πsuperscript𝑈𝒙𝝅\bm{x}^{\bm{\pi}},\zeta^{\bm{\pi}}=\mathop{minarg}_{\bm{x}\in\mathcal{X}}% \mathop{\mathbf{E}}\big{[}\sum_{h=1}^{H}x_{h}(s_{h},a_{h})\big{|}\{\pi_{h}\}_{% h=1}^{H}\big{]}\quad\text{s.t.}\quad\bm{\pi}=\mathop{argmax}_{\bm{\pi}\in\Pi}U% ^{\bm{x},\bm{\pi}}.bold_italic_x start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , italic_ζ start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT = start_BIGOP italic_m italic_i italic_n italic_a italic_r italic_g end_BIGOP start_POSTSUBSCRIPT bold_italic_x ∈ caligraphic_X end_POSTSUBSCRIPT bold_E [ ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) | { italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ] s.t. bold_italic_π = start_BIGOP italic_a italic_r italic_g italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT bold_italic_π ∈ roman_Π end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT .

Recall that ζh𝝅(s)superscriptsubscript𝜁𝝅𝑠\zeta_{h}^{\bm{\pi}}(s)italic_ζ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) denotes the least amount of expected payment to induce an action policy 𝝅𝝅\bm{\pi}bold_italic_π at the state s𝑠sitalic_s from the hhitalic_h-th step. Meanwhile, since Uh𝝅(s)=Ph(s,πh(s))[x+Uh+1𝝅]ch(s,πh(s))subscriptsuperscript𝑈𝝅𝑠subscript𝑃𝑠subscript𝜋𝑠delimited-[]𝑥subscriptsuperscript𝑈𝝅1subscript𝑐𝑠subscript𝜋𝑠U^{\bm{\pi}}_{h}(s)=P_{h}(s,\pi_{h}(s))\cdot[x+U^{\bm{\pi}}_{h+1}]-c_{h}(s,\pi% _{h}(s))italic_U start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) = italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ) ⋅ [ italic_x + italic_U start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ] - italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ), the above constraint can be equivalently rewritten as a set of constraints in an iterative form,

πh=argmaxa𝒜Ph(s,a)[x+Uh+1𝝅]ch(s,a),h[H].formulae-sequencesubscript𝜋subscript𝑎𝑟𝑔𝑚𝑎𝑥𝑎𝒜subscript𝑃𝑠𝑎delimited-[]𝑥subscriptsuperscript𝑈𝝅1subscript𝑐𝑠𝑎for-alldelimited-[]𝐻\pi_{h}=\mathop{argmax}_{a\in\mathcal{A}}P_{h}(s,a)\cdot[x+U^{\bm{\pi}}_{h+1}]% -c_{h}(s,a),\quad\forall h\in[H].italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = start_BIGOP italic_a italic_r italic_g italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ⋅ [ italic_x + italic_U start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ] - italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) , ∀ italic_h ∈ [ italic_H ] .

Therefore, such a contract policy 𝒙𝝅superscript𝒙𝝅\bm{x}^{\bm{\pi}}bold_italic_x start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT can be computed iteratively with backward induction from h=H𝐻h=Hitalic_h = italic_H to 1111 with UH+1𝒙(s)subscriptsuperscript𝑈𝒙𝐻1𝑠U^{\bm{x}}_{H+1}(s)italic_U start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H + 1 end_POSTSUBSCRIPT ( italic_s ), s𝒮for-all𝑠𝒮\forall s\in{\mathcal{S}}∀ italic_s ∈ caligraphic_S,

Wh𝝅(s,a;x)subscriptsuperscript𝑊𝝅𝑠𝑎𝑥\displaystyle W^{\bm{\pi}}_{h}(s,a;x)italic_W start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ; italic_x ) =Ph(s,a)[x+Uh+1𝝅]ch(s,a),absentsubscript𝑃𝑠𝑎delimited-[]𝑥subscriptsuperscript𝑈𝝅1subscript𝑐𝑠𝑎\displaystyle=P_{h}(s,a)\cdot[x+U^{\bm{\pi}}_{h+1}]-c_{h}(s,a),= italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ⋅ [ italic_x + italic_U start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ] - italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) , (B.1)
xh𝝅(s)subscriptsuperscript𝑥𝝅𝑠\displaystyle x^{\bm{\pi}}_{h}(s)italic_x start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) =argminx:𝒮+{Ph(s,πh(s))x|Wh𝝅(s,πh(s);x)Wh𝝅(s,a;x),a𝒜},absentsubscript𝑎𝑟𝑔𝑚𝑖𝑛:𝑥𝒮subscriptconditional-setsubscript𝑃𝑠subscript𝜋𝑠𝑥formulae-sequencesubscriptsuperscript𝑊𝝅𝑠subscript𝜋𝑠𝑥subscriptsuperscript𝑊𝝅𝑠superscript𝑎𝑥for-allsuperscript𝑎𝒜\displaystyle=\mathop{argmin}_{x:{\mathcal{S}}\to\mathbb{R}_{+}}\{P_{h}(s,\pi_% {h}(s))\cdot x\ |\ W^{\bm{\pi}}_{h}(s,\pi_{h}(s);x)\geq W^{\bm{\pi}}_{h}(s,a^{% \prime};x),\forall a^{\prime}\in\mathcal{A}\},= start_BIGOP italic_a italic_r italic_g italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT italic_x : caligraphic_S → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ) ⋅ italic_x | italic_W start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ; italic_x ) ≥ italic_W start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_x ) , ∀ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A } ,
Uh𝝅(s)subscriptsuperscript𝑈𝝅𝑠\displaystyle U^{\bm{\pi}}_{h}(s)italic_U start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) =Wh𝝅(s,πh(s);xh𝝅(s)),absentsubscriptsuperscript𝑊𝝅𝑠subscript𝜋𝑠superscriptsubscript𝑥𝝅𝑠\displaystyle=W^{\bm{\pi}}_{h}(s,\pi_{h}(s);x_{h}^{\bm{\pi}}(s)),= italic_W start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ; italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) ) ,
ζh𝝅(s)subscriptsuperscript𝜁𝝅𝑠\displaystyle\zeta^{\bm{\pi}}_{h}(s)italic_ζ start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) =Ph(s,πh(s))[xh𝝅(s)+ζh+1𝝅],absentsubscript𝑃𝑠subscript𝜋𝑠delimited-[]superscriptsubscript𝑥𝝅𝑠superscriptsubscript𝜁1𝝅\displaystyle=P_{h}(s,\pi_{h}(s))\cdot[x_{h}^{\bm{\pi}}(s)+\zeta_{h+1}^{\bm{% \pi}}],= italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ) ⋅ [ italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) + italic_ζ start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ] ,
Vh𝝅(s)superscriptsubscript𝑉𝝅𝑠\displaystyle V_{h}^{\bm{\pi}}(s)italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) =rh(s,πh(s))+Ph(s,πh(s))[Vh+1𝝅xh𝝅(s)],absentsubscript𝑟𝑠subscript𝜋𝑠subscript𝑃𝑠subscript𝜋𝑠delimited-[]subscriptsuperscript𝑉𝝅1subscriptsuperscript𝑥𝝅𝑠\displaystyle=r_{h}(s,\pi_{h}(s))+P_{h}(s,\pi_{h}(s))\cdot[V^{\bm{\pi}}_{h+1}-% x^{\bm{\pi}}_{h}(s)],= italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ) + italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ) ⋅ [ italic_V start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ] ,

where the function ζh𝝅(s),Vh𝝅(s)subscriptsuperscript𝜁𝝅𝑠superscriptsubscript𝑉𝝅𝑠\zeta^{\bm{\pi}}_{h}(s),V_{h}^{\bm{\pi}}(s)italic_ζ start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) , italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) are computed as by-products of the value-iteration. We refer to Equation (B.1) as the least-payment Bellman equation.

B.4 Bellman Optimality Equations in PAMDP

Proofs of Theorem 1.

We begin by giving an interpretation for each variable in the Bellman equation. With slight abuse of notation, xh(s;a)subscriptsuperscript𝑥𝑠𝑎x^{*}_{h}(s;a)italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ; italic_a ) denotes the contract with the least payment to induce the agent to take action a𝑎aitalic_a in each step hhitalic_h. Given that πh(s)subscriptsuperscript𝜋𝑠\pi^{*}_{h}(s)italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) is the best agent action for the principal to induce, the optimal contract at state s𝑠sitalic_s in step hhitalic_h can be determined as xh(s)=xh(s;πh(s))subscript𝑥𝑠subscript𝑥𝑠subscriptsuperscript𝜋𝑠x_{h}(s)=x_{h}(s;\pi^{*}_{h}(s))italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) = italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ; italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ). Qh(s,a),Wh(s,a;x)superscriptsubscript𝑄𝑠𝑎superscriptsubscript𝑊𝑠𝑎𝑥Q_{h}^{*}(s,a),W_{h}^{*}(s,a;x)italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) , italic_W start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ; italic_x ) are respectively the principal’s and agent’s total expected utility from hhitalic_h-th step under policy {xτ}τ=h+1Hsuperscriptsubscriptsubscriptsuperscript𝑥𝜏𝜏1𝐻\{x^{*}_{\tau}\}_{\tau=h+1}^{H}{ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_τ = italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT and {πτ}τ=h+1Hsuperscriptsubscriptsubscriptsuperscript𝜋𝜏𝜏1𝐻\{\pi^{*}_{\tau}\}_{\tau=h+1}^{H}{ italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_τ = italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT, which can be interpreted as their optimal state-action value function at hhitalic_h-th step, serving as the intermediate variable for the computation. We now prove the optimality of its solution 𝒙,Vsuperscript𝒙superscript𝑉\bm{x}^{*},V^{*}bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT via induction:

For the base case, observe that planning for any state s𝑠sitalic_s at the last step H𝐻Hitalic_H is reduced to a standard contract design problem and the optimal contract can be determined by solving the following linear program, a𝒜for-all𝑎𝒜\forall a\in\mathcal{A}∀ italic_a ∈ caligraphic_A,

QH(s,a),xH(s;a)superscriptsubscript𝑄𝐻𝑠𝑎subscriptsuperscript𝑥𝐻𝑠𝑎\displaystyle Q_{H}^{*}(s,a),x^{*}_{H}(s;a)italic_Q start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_s ; italic_a ) =argminx:𝒮+rH(s,a)PH(s,a)xabsentsubscript𝑎𝑟𝑔𝑚𝑖𝑛:𝑥𝒮subscriptsubscript𝑟𝐻𝑠𝑎subscript𝑃𝐻𝑠𝑎𝑥\displaystyle=\mathop{argmin}_{x:{\mathcal{S}}\to\mathbb{R}_{+}}r_{H}(s,a)-P_{% H}(s,a)\cdot x= start_BIGOP italic_a italic_r italic_g italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT italic_x : caligraphic_S → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_P start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_s , italic_a ) ⋅ italic_x
s.t. PH(s,a)xcH(s,a)PH(s,a)xcH(s,a),aa,formulae-sequencesubscript𝑃𝐻𝑠𝑎𝑥subscript𝑐𝐻𝑠𝑎subscript𝑃𝐻𝑠superscript𝑎𝑥subscript𝑐𝐻𝑠superscript𝑎for-allsuperscript𝑎𝑎\displaystyle\quad P_{H}(s,a)\cdot x-c_{H}(s,a)\geq P_{H}(s,a^{\prime})\cdot x% -c_{H}(s,a^{\prime}),\forall a^{\prime}\neq a,italic_P start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_s , italic_a ) ⋅ italic_x - italic_c start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_s , italic_a ) ≥ italic_P start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⋅ italic_x - italic_c start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , ∀ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_a ,

where xH(s;a)subscriptsuperscript𝑥𝐻𝑠𝑎x^{*}_{H}(s;a)italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_s ; italic_a ) is the least payment contract to induce the agent to take action a𝑎aitalic_a in the last step and QH(s,a)superscriptsubscript𝑄𝐻𝑠𝑎Q_{H}^{*}(s,a)italic_Q start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) is the principal’s expected utility under xH(s;a)subscriptsuperscript𝑥𝐻𝑠𝑎x^{*}_{H}(s;a)italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_s ; italic_a ). In Equation 2.4, we save the term rH(s,a)subscript𝑟𝐻𝑠𝑎r_{H}(s,a)italic_r start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_s , italic_a ), since it is a constant once the action is fixed. Hence, VH(s),a=maxarga𝒜QH(s,a)V_{H}^{*}(s),a*=\mathop{maxarg}_{a\in\mathcal{A}}Q^{*}_{H}(s,a)italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) , italic_a ∗ = start_BIGOP italic_m italic_a italic_x italic_a italic_r italic_g end_BIGOP start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_s , italic_a ) determines the best action πH(s)=asubscriptsuperscript𝜋𝐻𝑠superscript𝑎\pi^{*}_{H}(s)=a^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_s ) = italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for the principal to induce and VH(s)superscriptsubscript𝑉𝐻𝑠V_{H}^{*}(s)italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) is the optimal state value function at H𝐻Hitalic_H-th step. The principal’s optimal contract can be determined as xH(s)=xH(s;a)subscriptsuperscript𝑥𝐻𝑠subscriptsuperscript𝑥𝐻𝑠superscript𝑎x^{*}_{H}(s)=x^{*}_{H}(s;a^{*})italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_s ) = italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_s ; italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). The agent’s value is UH(s)=PH(s,a)xH(s)cH(s,a)superscriptsubscript𝑈𝐻𝑠subscript𝑃𝐻𝑠superscript𝑎subscriptsuperscript𝑥𝐻𝑠subscript𝑐𝐻𝑠superscript𝑎U_{H}^{*}(s)=P_{H}(s,a^{*})\cdot x^{*}_{H}(s)-c_{H}(s,a^{*})italic_U start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) = italic_P start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ⋅ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_s ) - italic_c start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) based on his best response asuperscript𝑎a^{*}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT under xH(s)subscriptsuperscript𝑥𝐻𝑠x^{*}_{H}(s)italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_s ).

For the inductive case, given that {xτ}τ=h+1Hsuperscriptsubscriptsubscriptsuperscript𝑥𝜏𝜏1𝐻\{x^{*}_{\tau}\}_{\tau=h+1}^{H}{ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_τ = italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT is optimal, with the agent’s best responding action policy {πτ}τ=h+1Hsuperscriptsubscriptsubscriptsuperscript𝜋𝜏𝜏1𝐻\{\pi^{*}_{\tau}\}_{\tau=h+1}^{H}{ italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_τ = italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT, we show that xhsubscriptsuperscript𝑥x^{*}_{h}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT solved from Equation 2.4 is optimal. Let us observe that Wh(s,a;x)=Ph(s,a)[x+Uh+1]ch(s,a)subscriptsuperscript𝑊𝑠𝑎𝑥subscript𝑃𝑠𝑎delimited-[]𝑥subscriptsuperscript𝑈1subscript𝑐𝑠𝑎W^{*}_{h}(s,a;x)=P_{h}(s,a)\cdot[x+U^{*}_{h+1}]-c_{h}(s,a)italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ; italic_x ) = italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ⋅ [ italic_x + italic_U start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ] - italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) captures the agent’s total utility of taking action a𝑎aitalic_a under the contract x𝑥xitalic_x at step hhitalic_h state s𝑠sitalic_s and then optimally following the action policy {πτ}τ=h+1Hsuperscriptsubscriptsubscriptsuperscript𝜋𝜏𝜏1𝐻\{\pi^{*}_{\tau}\}_{\tau=h+1}^{H}{ italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_τ = italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT under {xτ}τ=h+1Hsuperscriptsubscriptsubscriptsuperscript𝑥𝜏𝜏1𝐻\{x^{*}_{\tau}\}_{\tau=h+1}^{H}{ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_τ = italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT from step h+11h+1italic_h + 1. Here, we can use Uh+1subscriptsuperscript𝑈1U^{*}_{h+1}italic_U start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT computed from previous iteration, because the agent’s value Uh+1(s)subscriptsuperscript𝑈1𝑠U^{*}_{h+1}(s)italic_U start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s ) is conditionally independent to the action in the current step given the realization of next state s𝑠sitalic_s — this enables efficient computation through dynamic programming. Similar to the base case, for every action a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A, the principal is to compute the least payment contract for the agent to take action a𝑎aitalic_a,

Qh(s,a),xh(s;a)superscriptsubscript𝑄𝑠𝑎subscriptsuperscript𝑥𝑠𝑎\displaystyle Q_{h}^{*}(s,a),x^{*}_{h}(s;a)italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s , italic_a ) , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ; italic_a ) =minargx:𝒮+rh(s,a)+Ph(s,a)[Vh+1xh(s;a)]absentsubscript𝑚𝑖𝑛𝑎𝑟𝑔:𝑥𝒮subscriptsubscript𝑟𝑠𝑎subscript𝑃𝑠𝑎delimited-[]subscriptsuperscript𝑉1subscriptsuperscript𝑥𝑠𝑎\displaystyle=\mathop{minarg}_{x:{\mathcal{S}}\to\mathbb{R}_{+}}r_{h}(s,a)+P_{% h}(s,a)\cdot[V^{*}_{h+1}-x^{*}_{h}(s;a)]= start_BIGOP italic_m italic_i italic_n italic_a italic_r italic_g end_BIGOP start_POSTSUBSCRIPT italic_x : caligraphic_S → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) + italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ⋅ [ italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ; italic_a ) ]
s.t. Ph(s,a)[x+Uh+1]ch(s,a)Ph(s,a)[x+Uh+1]ch(s,a),aa,formulae-sequencesubscript𝑃𝑠𝑎delimited-[]𝑥subscriptsuperscript𝑈1subscript𝑐𝑠𝑎subscript𝑃𝑠superscript𝑎delimited-[]𝑥subscriptsuperscript𝑈1subscript𝑐𝑠superscript𝑎for-allsuperscript𝑎𝑎\displaystyle\quad P_{h}(s,a)\cdot[x+U^{*}_{h+1}]-c_{h}(s,a)\geq P_{h}(s,a^{% \prime})\cdot[x+U^{*}_{h+1}]-c_{h}(s,a^{\prime}),\forall a^{\prime}\neq a,italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ⋅ [ italic_x + italic_U start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ] - italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ≥ italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⋅ [ italic_x + italic_U start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ] - italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , ∀ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_a ,

where the objective is set as the principal’s total utility if the agent takes action a𝑎aitalic_a at current step hhitalic_h and follows the policy {πτ}τ=h+1Hsuperscriptsubscriptsubscriptsuperscript𝜋𝜏𝜏1𝐻\{\pi^{*}_{\tau}\}_{\tau=h+1}^{H}{ italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_τ = italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT onward; the constraint is to reflect that it is (weakly) optimal for the agent to take action a𝑎aitalic_a at current step. In Equation 2.4, we save the term rh(s,a)+Ph(s,a)Vh+1subscript𝑟𝑠𝑎subscript𝑃𝑠𝑎subscriptsuperscript𝑉1r_{h}(s,a)+P_{h}(s,a)\cdot V^{*}_{h+1}italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) + italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ⋅ italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT, since they are constant once the action is fixed. With the least payment contract x(s;a)superscript𝑥𝑠𝑎x^{*}(s;a)italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ; italic_a ) and state value Qh(s,a)subscriptsuperscript𝑄𝑠𝑎Q^{*}_{h}(s,a)italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) for each action, Vh(s),a=maxarga𝒜Qh(s,a)superscriptsubscript𝑉𝑠superscript𝑎subscript𝑚𝑎𝑥𝑎𝑟𝑔𝑎𝒜subscriptsuperscript𝑄𝑠𝑎V_{h}^{*}(s),a^{*}=\mathop{maxarg}_{a\in\mathcal{A}}Q^{*}_{h}(s,a)italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) , italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_BIGOP italic_m italic_a italic_x italic_a italic_r italic_g end_BIGOP start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) determines the best action πh(s)=asubscriptsuperscript𝜋𝑠superscript𝑎\pi^{*}_{h}(s)=a^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) = italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for the principal to induce and Vh(s)superscriptsubscript𝑉𝑠V_{h}^{*}(s)italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) is the optimal state value function at hhitalic_h-th step. The principal’s optimal contract can be determined as xh(s)=xh(s;a)subscriptsuperscript𝑥𝑠subscriptsuperscript𝑥𝑠superscript𝑎x^{*}_{h}(s)=x^{*}_{h}(s;a^{*})italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) = italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ; italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). The agent’s value is Uh(s)=Ph(s,a)xh(s)ch(s,a)superscriptsubscript𝑈𝑠subscript𝑃𝑠superscript𝑎subscriptsuperscript𝑥𝑠subscript𝑐𝑠superscript𝑎U_{h}^{*}(s)=P_{h}(s,a^{*})\cdot x^{*}_{h}(s)-c_{h}(s,a^{*})italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) = italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ⋅ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) - italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) based on his best response asuperscript𝑎a^{*}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT under xh(s)subscriptsuperscript𝑥𝑠x^{*}_{h}(s)italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ). Therefore, xh(s)subscriptsuperscript𝑥𝑠x^{*}_{h}(s)italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) is the optimal contract following from the optimal contract policy in previous steps {xτ}τ=h+1Hsuperscriptsubscriptsubscriptsuperscript𝑥𝜏𝜏1𝐻\{x^{*}_{\tau}\}_{\tau=h+1}^{H}{ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_τ = italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT, which concludes the induction.

Lastly, we note that this Bellman equation can solved efficiently using backward induction from the state-function of the (H+1)𝐻1(H+1)( italic_H + 1 )-step, VH+1,UH+1=0subscriptsuperscript𝑉𝐻1subscriptsuperscript𝑈𝐻10V^{*}_{H+1},U^{*}_{H+1}=0italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H + 1 end_POSTSUBSCRIPT , italic_U start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H + 1 end_POSTSUBSCRIPT = 0. In each step, it solves S×A𝑆𝐴S\times Aitalic_S × italic_A many linear programs for the optimal contract xh(s;a)subscriptsuperscript𝑥𝑠𝑎x^{*}_{h}(s;a)italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ; italic_a ), while each linear programs have O(A2)𝑂superscript𝐴2O(A^{2})italic_O ( italic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) many constraints. Hence, the total time complexity to solve for the optimal policy 𝒙superscript𝒙\bm{x}^{*}bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is polynomial w.r.t. A,S,H𝐴𝑆𝐻A,S,Hitalic_A , italic_S , italic_H.

Appendix C Proofs in Section 3

C.1 The Regret Analysis of the Generic Algorithm

We first describe the design of the generic algorithm in Algorithm 2 and the technical lemmas.

Input: {𝒳a(ε)}a𝒜subscriptsuperscript𝒳𝑎𝜀𝑎𝒜\{\mathcal{X}^{a}(\varepsilon)\}_{a\in\mathcal{A}}{ caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_ε ) } start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT, the ε𝜀\varepsilonitalic_ε-margin contract sets.
for t=1T𝑡1𝑇t=1\dots Titalic_t = 1 … italic_T do
       Estimate the least payment for each action under the empirical outcome distribution,
ζ^t(a)minx𝒳a(ε)P^t(a)x.subscript^𝜁𝑡𝑎subscript𝑚𝑖𝑛𝑥superscript𝒳𝑎𝜀subscript^𝑃𝑡𝑎𝑥\widehat{\zeta}_{t}(a)\leftarrow\mathop{min}_{x\in\mathcal{X}^{a}(\varepsilon)% }\widehat{P}_{t}(a)\cdot x.over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) ← start_BIGOP italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_ε ) end_POSTSUBSCRIPT over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) ⋅ italic_x . (C.1)
      
      Determine the best action based on the optimistic estimation of profit,
atargmaxa𝒜r^t(a)ζ^t(a)+(1+η)ϵt(a).subscript𝑎𝑡subscript𝑎𝑟𝑔𝑚𝑎𝑥𝑎𝒜subscript^𝑟𝑡𝑎subscript^𝜁𝑡𝑎1𝜂subscriptitalic-ϵ𝑡𝑎a_{t}\leftarrow\mathop{argmax}_{a\in\mathcal{A}}\widehat{r}_{t}(a)-\widehat{% \zeta}_{t}(a)+(1+\eta)\epsilon_{t}(a).italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← start_BIGOP italic_a italic_r italic_g italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) - over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) + ( 1 + italic_η ) italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) .
       Solve for a robust contract to induce atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT,
xtargminx𝒳at(ε)P^t(at)x.subscript𝑥𝑡subscript𝑎𝑟𝑔𝑚𝑖𝑛𝑥superscript𝒳subscript𝑎𝑡𝜀subscript^𝑃𝑡subscript𝑎𝑡𝑥x_{t}\leftarrow\mathop{argmin}_{x\in\mathcal{X}^{a_{t}}(\varepsilon)}\widehat{% P}_{t}(a_{t})\cdot x.italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← start_BIGOP italic_a italic_r italic_g italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_ε ) end_POSTSUBSCRIPT over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ italic_x .
       Commit to contract xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and observe the agent’s action atsubscriptsuperscript𝑎𝑡a^{\prime}_{t}italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, outcome stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and its reward ιt(st)subscript𝜄𝑡subscript𝑠𝑡\iota_{t}(s_{t})italic_ι start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).
       Update the empirical estimation of the outcome distribution and reward function, P^t,r^tsubscript^𝑃𝑡subscript^𝑟𝑡\widehat{P}_{t},\widehat{r}_{t}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.
       Set the confidence interval ϵtsubscriptitalic-ϵ𝑡\epsilon_{t}italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT such that P(a)P^t(a)1ϵt(a),a𝒜formulae-sequencesubscriptdelimited-∥∥𝑃𝑎subscript^𝑃𝑡𝑎1subscriptitalic-ϵ𝑡𝑎for-all𝑎𝒜\left\lVert P(a)-\widehat{P}_{t}(a)\right\rVert_{1}\leq\epsilon_{t}(a),\forall a% \in\mathcal{A}∥ italic_P ( italic_a ) - over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) , ∀ italic_a ∈ caligraphic_A, with prob. 1δ1𝛿1-\delta1 - italic_δ.
      
Algorithm 2 Contractual bandit learning with ε𝜀\varepsilonitalic_ε-margin contract sets
Lemma 1.

Under Assumption 1, for each action a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A, given a robust contract set 𝒳a(ε)superscript𝒳𝑎𝜀\mathcal{X}^{a}(\varepsilon)caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_ε ) with margin ε𝜀\varepsilonitalic_ε, and an empirical estimation of P^(a)^𝑃𝑎\widehat{P}(a)over^ start_ARG italic_P end_ARG ( italic_a ) with P^(a)P(a)1ϵsubscriptdelimited-∥∥^𝑃𝑎𝑃𝑎1italic-ϵ\left\lVert\widehat{P}(a)-P(a)\right\rVert_{1}\leq\epsilon∥ over^ start_ARG italic_P end_ARG ( italic_a ) - italic_P ( italic_a ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_ϵ, let x^,ζ^(a)^𝑥^𝜁𝑎\widehat{x},\widehat{\zeta}(a)over^ start_ARG italic_x end_ARG , over^ start_ARG italic_ζ end_ARG ( italic_a ) be the minimizer and minimum objective value of LP (C.1). The following conditions are satisfied,

  1. 1.

    The expected payment of x^^𝑥\widehat{x}over^ start_ARG italic_x end_ARG is bounded as, 0P(a)x^ζ(a)λ1ε.0𝑃𝑎^𝑥𝜁𝑎superscript𝜆1𝜀0\leq P(a)\cdot\widehat{x}-{\zeta}(a)\leq\lambda^{-1}\varepsilon.0 ≤ italic_P ( italic_a ) ⋅ over^ start_ARG italic_x end_ARG - italic_ζ ( italic_a ) ≤ italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε .

  2. 2.

    The estimated payment of x^^𝑥\widehat{x}over^ start_ARG italic_x end_ARG is bounded as, ηϵζ^(a)ζ(a)λ1ε+ηϵ.𝜂italic-ϵ^𝜁𝑎𝜁𝑎superscript𝜆1𝜀𝜂italic-ϵ-\eta\epsilon\leq\widehat{\zeta}(a)-\zeta(a)\leq\lambda^{-1}\varepsilon+\eta\epsilon.- italic_η italic_ϵ ≤ over^ start_ARG italic_ζ end_ARG ( italic_a ) - italic_ζ ( italic_a ) ≤ italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε + italic_η italic_ϵ .

Lemma 2 (McDiarmid et al. [42]).

With t𝑡titalic_t i.i.d. samples of an m𝑚mitalic_m-dimensional distribution Q𝑄Qitalic_Q, we can construct a confidence ball ={QΔm:Q^tQ1mlog(1/δ)t}conditional-set𝑄superscriptΔ𝑚subscriptdelimited-∥∥subscript^𝑄𝑡𝑄1𝑚1𝛿𝑡\mathcal{B}=\{Q\in\Delta^{m}:\left\lVert\widehat{Q}_{t}-Q\right\rVert_{1}\leq% \sqrt{\frac{m\log(1/\delta)}{t}}\}caligraphic_B = { italic_Q ∈ roman_Δ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT : ∥ over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_Q ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ square-root start_ARG divide start_ARG italic_m roman_log ( 1 / italic_δ ) end_ARG start_ARG italic_t end_ARG end_ARG } such that Q𝑄Q\in\mathcal{B}italic_Q ∈ caligraphic_B with prob. at least 1δ1𝛿1-\delta1 - italic_δ.

Proof of Theorem 2.

At a high level, Algorithm 2 proceeds by following the upper confidence bound of the expected “profit” of each action z(a):=r(a)ζ(a)assign𝑧𝑎𝑟𝑎𝜁𝑎z(a):=r(a)-\zeta(a)italic_z ( italic_a ) := italic_r ( italic_a ) - italic_ζ ( italic_a ), which shrinks at the rate of t1/2superscript𝑡12t^{-1/2}italic_t start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT, based on Lemma 1. This enables us to apply the upper confidence bound analysis from online learning problem to the contractual bandit learning problem.

That is, we construct a variable z~t(a):=r~t(a)ζ^t(a)+(1+η)ϵt(a)+λ1εassignsubscript~𝑧𝑡𝑎subscript~𝑟𝑡𝑎subscript^𝜁𝑡𝑎1𝜂subscriptitalic-ϵ𝑡𝑎superscript𝜆1𝜀\widetilde{z}_{t}(a):=\widetilde{r}_{t}(a)-\widehat{\zeta}_{t}(a)+(1+\eta)% \epsilon_{t}(a)+\lambda^{-1}\varepsilonover~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) := over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) - over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) + ( 1 + italic_η ) italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) + italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε as an optimistic estimation of z(a)𝑧𝑎z(a)italic_z ( italic_a ). First, notice that Algorithm 2 is equivalently to follow the action at=argmaxa𝒜z~t(a)subscript𝑎𝑡subscript𝑎𝑟𝑔𝑚𝑎𝑥𝑎𝒜subscript~𝑧𝑡𝑎a_{t}=\mathop{argmax}_{a\in\mathcal{A}}\widetilde{z}_{t}(a)italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = start_BIGOP italic_a italic_r italic_g italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) at each round t𝑡titalic_t, as λ1εsuperscript𝜆1𝜀\lambda^{-1}\varepsilonitalic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε is constant and does not affect the optimization. Second, we show that under the difference between z~t(a)subscript~𝑧𝑡𝑎\widetilde{z}_{t}(a)over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) and z(a)𝑧𝑎z(a)italic_z ( italic_a ) satisfies the following inequality with probability at least 1δ1𝛿1-\delta1 - italic_δ,

0z~t(a)z(a)(2+2η)ϵt(a)+λ1ε,a𝒜,t[T].formulae-sequence0subscript~𝑧𝑡𝑎𝑧𝑎22𝜂subscriptitalic-ϵ𝑡𝑎superscript𝜆1𝜀formulae-sequencefor-all𝑎𝒜𝑡delimited-[]𝑇0\leq\widetilde{z}_{t}(a)-z(a)\leq(2+2\eta)\epsilon_{t}(a)+\lambda^{-1}% \varepsilon,\quad\forall a\in\mathcal{A},t\in[T].0 ≤ over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) - italic_z ( italic_a ) ≤ ( 2 + 2 italic_η ) italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) + italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε , ∀ italic_a ∈ caligraphic_A , italic_t ∈ [ italic_T ] . (C.2)

On the event that P^t(a)P(a)1ϵt(a)subscriptdelimited-∥∥subscript^𝑃𝑡𝑎𝑃𝑎1subscriptitalic-ϵ𝑡𝑎\left\lVert\widehat{P}_{t}(a)-P(a)\right\rVert_{1}\leq\epsilon_{t}(a)∥ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) - italic_P ( italic_a ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ), we can derive that |r^t(a)r(a)|=[P^t(a)P(a)]ιϵt(a)subscript^𝑟𝑡𝑎𝑟𝑎delimited-[]subscript^𝑃𝑡𝑎𝑃𝑎𝜄subscriptitalic-ϵ𝑡𝑎\left|\widehat{r}_{t}(a)-r(a)\right|=[\widehat{P}_{t}(a)-P(a)]\cdot\iota\leq% \epsilon_{t}(a)| over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) - italic_r ( italic_a ) | = [ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) - italic_P ( italic_a ) ] ⋅ italic_ι ≤ italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) and ηϵt(a)ζ^(a)ζ(a)λ1ε+ηϵt(a)𝜂subscriptitalic-ϵ𝑡𝑎^𝜁𝑎𝜁𝑎superscript𝜆1𝜀𝜂subscriptitalic-ϵ𝑡𝑎-\eta\epsilon_{t}(a)\leq\widehat{\zeta}(a)-\zeta(a)\leq\lambda^{-1}\varepsilon% +\eta\epsilon_{t}(a)- italic_η italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) ≤ over^ start_ARG italic_ζ end_ARG ( italic_a ) - italic_ζ ( italic_a ) ≤ italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε + italic_η italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) by Lemma 1. This implies that (1+η)ϵt(a)λ1εr^t(a)ζ^t(a)r(a)+ζ(a)(1+η)ϵt(a)1𝜂subscriptitalic-ϵ𝑡𝑎superscript𝜆1𝜀subscript^𝑟𝑡𝑎subscript^𝜁𝑡𝑎𝑟𝑎𝜁𝑎1𝜂subscriptitalic-ϵ𝑡𝑎-(1+\eta)\epsilon_{t}(a)-\lambda^{-1}\varepsilon\leq\widehat{r}_{t}(a)-% \widehat{\zeta}_{t}(a)-r(a)+\zeta(a)\leq(1+\eta)\epsilon_{t}(a)- ( 1 + italic_η ) italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) - italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε ≤ over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) - over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) - italic_r ( italic_a ) + italic_ζ ( italic_a ) ≤ ( 1 + italic_η ) italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ), which leads to the Equation (C.2).

Under the event that P^t(a)P(a)1ϵt(a),a𝒜formulae-sequencesubscriptdelimited-∥∥subscript^𝑃𝑡𝑎𝑃𝑎1subscriptitalic-ϵ𝑡𝑎for-all𝑎𝒜\left\lVert\widehat{P}_{t}(a)-P(a)\right\rVert_{1}\leq\epsilon_{t}(a),\forall a% \in\mathcal{A}∥ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) - italic_P ( italic_a ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) , ∀ italic_a ∈ caligraphic_A, we have at=at,t[T]formulae-sequencesubscript𝑎𝑡subscriptsuperscript𝑎𝑡for-all𝑡delimited-[]𝑇a_{t}=a^{\prime}_{t},\forall t\in[T]italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∀ italic_t ∈ [ italic_T ] and the expected regret of Algorithm 2 in the T𝑇Titalic_T rounds is as follows,

Reg(T)=maxa𝒜t=1T[r(a)ζ(a)]t=1T[r(at)P(at)xt].Reg𝑇subscript𝑚𝑎𝑥superscript𝑎𝒜superscriptsubscript𝑡1𝑇delimited-[]𝑟superscript𝑎𝜁superscript𝑎superscriptsubscript𝑡1𝑇delimited-[]𝑟subscript𝑎𝑡𝑃subscript𝑎𝑡subscript𝑥𝑡\operatorname{Reg}(T)=\mathop{max}_{a^{*}\in\mathcal{A}}\sum_{t=1}^{T}[r(a^{*}% )-\zeta(a^{*})]-\sum_{t=1}^{T}[r(a_{t})-P(a_{t})x_{t}].roman_Reg ( italic_T ) = start_BIGOP italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_A end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ italic_r ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_ζ ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ italic_r ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_P ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] .

We decompose the regret into two cases on whether the optimal arm asuperscript𝑎a^{*}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is played at round t𝑡titalic_t:

When at=asubscript𝑎𝑡superscript𝑎a_{t}=a^{*}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, we have [r(a)ζ(a)][r(at)P(at)xt]=P(at)xtζ(a)λ1εdelimited-[]𝑟superscript𝑎𝜁superscript𝑎delimited-[]𝑟subscript𝑎𝑡𝑃subscript𝑎𝑡subscript𝑥𝑡𝑃subscript𝑎𝑡subscript𝑥𝑡𝜁superscript𝑎superscript𝜆1𝜀[r(a^{*})-\zeta(a^{*})]-[r(a_{t})-P(a_{t})x_{t}]=P(a_{t})x_{t}-\zeta(a^{*})% \leq\lambda^{-1}\varepsilon[ italic_r ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_ζ ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] - [ italic_r ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_P ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = italic_P ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_ζ ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε by Lemma 1.

When atasubscript𝑎𝑡superscript𝑎a_{t}\neq a^{*}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≠ italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, we have

[r(a)ζ(a)][r(at)P(at)xt]delimited-[]𝑟superscript𝑎𝜁superscript𝑎delimited-[]𝑟subscript𝑎𝑡𝑃subscript𝑎𝑡subscript𝑥𝑡\displaystyle[r(a^{*})-\zeta(a^{*})]-[r(a_{t})-P(a_{t})\cdot x_{t}][ italic_r ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_ζ ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] - [ italic_r ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_P ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] z~t(a)r(at)+P(at)xtabsentsubscript~𝑧𝑡superscript𝑎𝑟subscript𝑎𝑡𝑃subscript𝑎𝑡subscript𝑥𝑡\displaystyle\leq\widetilde{z}_{t}(a^{*})-r(a_{t})+P(a_{t})\cdot x_{t}≤ over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_r ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_P ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
z~t(at)r(at)+ζ(at)+λ1εabsentsubscript~𝑧𝑡subscript𝑎𝑡𝑟subscript𝑎𝑡𝜁subscript𝑎𝑡superscript𝜆1𝜀\displaystyle\leq\widetilde{z}_{t}(a_{t})-r(a_{t})+\zeta(a_{t})+\lambda^{-1}\varepsilon≤ over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_r ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_ζ ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε
=z~t(at)z(at)+λ1εabsentsubscript~𝑧𝑡subscript𝑎𝑡𝑧subscript𝑎𝑡superscript𝜆1𝜀\displaystyle=\widetilde{z}_{t}(a_{t})-z(a_{t})+\lambda^{-1}\varepsilon= over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_z ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε
(2+2η)ϵt(at)+2λ1ε,absent22𝜂subscriptitalic-ϵ𝑡subscript𝑎𝑡2superscript𝜆1𝜀\displaystyle\leq(2+2\eta)\epsilon_{t}(a_{t})+2\lambda^{-1}\varepsilon,≤ ( 2 + 2 italic_η ) italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + 2 italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε ,

where the first inequality follows from Equation (C.2) that z(a)z~t(a)𝑧superscript𝑎subscript~𝑧𝑡superscript𝑎z(a^{*})\leq\widetilde{z}_{t}(a^{*})italic_z ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ); the second inequality uses the fact that z~t(a)z~t(at)subscript~𝑧𝑡superscript𝑎subscript~𝑧𝑡subscript𝑎𝑡\widetilde{z}_{t}(a^{*})\leq\widetilde{z}_{t}(a_{t})over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and P(at)xtζ(at)+λ1ε𝑃subscript𝑎𝑡subscript𝑥𝑡𝜁subscript𝑎𝑡superscript𝜆1𝜀P(a_{t})x_{t}\leq\zeta(a_{t})+\lambda^{-1}\varepsilonitalic_P ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ italic_ζ ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε from Lemma 1; the third inequality follows Equation (C.2) that z~t(a)z(a)(2+2η)ϵt(at)+λ1εsubscript~𝑧𝑡superscript𝑎𝑧superscript𝑎22𝜂subscriptitalic-ϵ𝑡subscript𝑎𝑡superscript𝜆1𝜀\widetilde{z}_{t}(a^{*})-z(a^{*})\leq(2+2\eta)\epsilon_{t}(a_{t})+\lambda^{-1}\varepsilonover~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_z ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ ( 2 + 2 italic_η ) italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε.

It remains to bound the total regret based on the exact choice of ϵtsubscriptitalic-ϵ𝑡\epsilon_{t}italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in different setup.

The Case of Finite Action Space.

In the case where the action space 𝒜𝒜\mathcal{A}caligraphic_A is finite. For any action a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A, we denote Nt(a)subscript𝑁𝑡𝑎N_{t}(a)italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) as the number of times action a𝑎aitalic_a has been taken. By Lemma 2, we can set ϵt(a)=|𝒮|log(T|𝒜|/δ)Nt(a)subscriptitalic-ϵ𝑡𝑎𝒮𝑇𝒜𝛿subscript𝑁𝑡𝑎\epsilon_{t}(a)=\sqrt{\frac{|{\mathcal{S}}|\log(T|\mathcal{A}|/\delta)}{N_{t}(% a)}}italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) = square-root start_ARG divide start_ARG | caligraphic_S | roman_log ( italic_T | caligraphic_A | / italic_δ ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) end_ARG end_ARG such that the empirical estimation of the outcome distribution P^tsubscript^𝑃𝑡\widehat{P}_{t}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT satisfies P^t(a)P(a)1ϵt(a)subscriptdelimited-∥∥subscript^𝑃𝑡𝑎𝑃𝑎1subscriptitalic-ϵ𝑡𝑎\left\lVert\widehat{P}_{t}(a)-P(a)\right\rVert_{1}\leq\epsilon_{t}(a)∥ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) - italic_P ( italic_a ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) with probability at least 1δT|𝒜|1𝛿𝑇𝒜1-\frac{\delta}{T|\mathcal{A}|}1 - divide start_ARG italic_δ end_ARG start_ARG italic_T | caligraphic_A | end_ARG. Thus, by union bound, with probability 1δ1𝛿1-\delta1 - italic_δ, the expected regret can be bounded as follows,

Reg(T)Reg𝑇\displaystyle\operatorname{Reg}(T)roman_Reg ( italic_T ) t=1T(2+2η)|𝒮|log(T|𝒜|/δ)Nt(at)+2λ1εabsentsuperscriptsubscript𝑡1𝑇22𝜂𝒮𝑇𝒜𝛿subscript𝑁𝑡subscript𝑎𝑡2superscript𝜆1𝜀\displaystyle\leq\sum_{t=1}^{T}(2+2\eta)\sqrt{\frac{|{\mathcal{S}}|\log(T|% \mathcal{A}|/\delta)}{N_{t}(a_{t})}}+2\lambda^{-1}\varepsilon≤ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( 2 + 2 italic_η ) square-root start_ARG divide start_ARG | caligraphic_S | roman_log ( italic_T | caligraphic_A | / italic_δ ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG + 2 italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε
(4+4η)|𝒮|log(T|𝒜|/δ)a𝒜NT(a)+2λ1εTabsent44𝜂𝒮𝑇𝒜𝛿subscript𝑎𝒜subscript𝑁𝑇𝑎2superscript𝜆1𝜀𝑇\displaystyle\leq(4+4\eta)\sqrt{|{\mathcal{S}}|\log(T|\mathcal{A}|/\delta)}% \sum_{a\in\mathcal{A}}\sqrt{N_{T}(a)}+2\lambda^{-1}\varepsilon T≤ ( 4 + 4 italic_η ) square-root start_ARG | caligraphic_S | roman_log ( italic_T | caligraphic_A | / italic_δ ) end_ARG ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT square-root start_ARG italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_a ) end_ARG + 2 italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε italic_T
(4+4η)|𝒮|log(T|𝒜|/δ)|𝒜|T+2λ1εTabsent44𝜂𝒮𝑇𝒜𝛿𝒜𝑇2superscript𝜆1𝜀𝑇\displaystyle\leq(4+4\eta)\sqrt{|{\mathcal{S}}|\log(T|\mathcal{A}|/\delta)}% \sqrt{|\mathcal{A}|T}+2\lambda^{-1}\varepsilon T≤ ( 4 + 4 italic_η ) square-root start_ARG | caligraphic_S | roman_log ( italic_T | caligraphic_A | / italic_δ ) end_ARG square-root start_ARG | caligraphic_A | italic_T end_ARG + 2 italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε italic_T
=O(ηT|𝒜||𝒮|log(T|𝒜|/δ)+εT/λ),absent𝑂𝜂𝑇𝒜𝒮𝑇𝒜𝛿𝜀𝑇𝜆\displaystyle=O(\eta\sqrt{T|\mathcal{A}||{\mathcal{S}}|\log(T|\mathcal{A}|/% \delta)}+\varepsilon T/\lambda),= italic_O ( italic_η square-root start_ARG italic_T | caligraphic_A | | caligraphic_S | roman_log ( italic_T | caligraphic_A | / italic_δ ) end_ARG + italic_ε italic_T / italic_λ ) ,

where the first inequality uses the fact that the loss incur when atasubscript𝑎𝑡superscript𝑎a_{t}\neq a^{*}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≠ italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is at least as much as the loss when at=asubscript𝑎𝑡superscript𝑎a_{t}=a^{*}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT; the second inequality follows from the Cauchy-Schwarz inequality; the third inequality again applies Cauchy-Schwarz inequality and use the fact that a𝒜NT(a)=Tsubscript𝑎𝒜subscript𝑁𝑇𝑎𝑇\sum_{a\in\mathcal{A}}N_{T}(a)=T∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_a ) = italic_T.

The Case of Infinite Action Space with Linear Context.

In the case when the action space 𝒜d𝒜superscript𝑑\mathcal{A}\subset\mathbb{R}^{d}caligraphic_A ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is infinite, the outcome distribution P(a)=aθ𝑃𝑎superscript𝑎top𝜃P(a)=a^{\top}\thetaitalic_P ( italic_a ) = italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_θ for some unknown parameter θd×m𝜃superscript𝑑𝑚\theta\in\mathbb{R}^{d\times m}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_m end_POSTSUPERSCRIPT. Let θ^t(a)=Σ1τ=1tsτaτsubscript^𝜃𝑡𝑎superscriptΣ1superscriptsubscript𝜏1𝑡subscript𝑠𝜏subscript𝑎𝜏\widehat{\theta}_{t}(a)=\Sigma^{-1}\sum_{\tau=1}^{t}s_{\tau}a_{\tau}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) = roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT with Σt=λI+τ=1taτaτsubscriptΣ𝑡𝜆𝐼superscriptsubscript𝜏1𝑡subscript𝑎𝜏superscriptsubscript𝑎𝜏top\Sigma_{t}=\lambda I+\sum_{\tau=1}^{t}a_{\tau}a_{\tau}^{\top}roman_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_λ italic_I + ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. By Lemma 11 of Abbasi-Yadkori et al. [4], with probability at least 1δ1𝛿1-\delta1 - italic_δ, we have θ^tθΣtβtsubscriptdelimited-∥∥subscript^𝜃𝑡superscript𝜃subscriptΣ𝑡subscript𝛽𝑡\left\lVert\widehat{\theta}_{t}-\theta^{*}\right\rVert_{\Sigma_{t}}\leq\sqrt{% \beta_{t}}∥ over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ square-root start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , where βt=σ2(2+4dlog(T+1)+8log(4/δ))dlog(1+Tdσ2)subscript𝛽𝑡superscript𝜎224𝑑𝑇184𝛿𝑑1𝑇𝑑superscript𝜎2\beta_{t}=\sigma^{2}(2+4d\log(T+1)+8\log(4/\delta))d\log(1+\frac{T}{d\sigma^{2% }})italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 2 + 4 italic_d roman_log ( italic_T + 1 ) + 8 roman_log ( 4 / italic_δ ) ) italic_d roman_log ( 1 + divide start_ARG italic_T end_ARG start_ARG italic_d italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ). We can set ϵt=βtaΣt1subscriptitalic-ϵ𝑡subscript𝛽𝑡subscriptdelimited-∥∥𝑎superscriptsubscriptΣ𝑡1\epsilon_{t}=\sqrt{\beta_{t}\left\lVert a\right\rVert_{\Sigma_{t}^{-1}}}italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ italic_a ∥ start_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG such that empirical estimation of the outcome distribution P^t=aθ^tsubscript^𝑃𝑡superscript𝑎topsubscript^𝜃𝑡\widehat{P}_{t}=a^{\top}\widehat{\theta}_{t}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT satisfies

P^t(a)P(a)1=(Σt1/2a)Σt1/2(θ^tθ)1Σt1/2aΣt1/2(θ^tθ)βtaΣt1=ϵt(a).subscriptdelimited-∥∥subscript^𝑃𝑡𝑎𝑃𝑎1subscriptdelimited-∥∥superscriptsuperscriptsubscriptΣ𝑡12𝑎topsuperscriptsubscriptΣ𝑡12subscript^𝜃𝑡superscript𝜃1delimited-∥∥superscriptsubscriptΣ𝑡12𝑎delimited-∥∥superscriptsubscriptΣ𝑡12subscript^𝜃𝑡superscript𝜃subscript𝛽𝑡subscriptdelimited-∥∥𝑎superscriptsubscriptΣ𝑡1subscriptitalic-ϵ𝑡𝑎\left\lVert\widehat{P}_{t}(a)-P(a)\right\rVert_{1}=\left\lVert(\Sigma_{t}^{-1/% 2}a)^{\top}\Sigma_{t}^{1/2}(\widehat{\theta}_{t}-\theta^{*})\right\rVert_{1}% \leq\left\lVert\Sigma_{t}^{-1/2}a\right\rVert\left\lVert\Sigma_{t}^{1/2}(% \widehat{\theta}_{t}-\theta^{*})\right\rVert\leq\sqrt{\beta_{t}\left\lVert a% \right\rVert_{\Sigma_{t}^{-1}}}=\epsilon_{t}(a).∥ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) - italic_P ( italic_a ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ∥ ( roman_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_a ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ ∥ roman_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_a ∥ ∥ roman_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∥ ≤ square-root start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ italic_a ∥ start_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG = italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) .

Thus, by union bound, with probability 1δ1𝛿1-\delta1 - italic_δ, the expected regret can be bounded as follows,

Reg(T)Reg𝑇\displaystyle\operatorname{Reg}(T)roman_Reg ( italic_T ) t=1T(2+2η)βtaΣt1+2λ1εabsentsuperscriptsubscript𝑡1𝑇22𝜂subscript𝛽𝑡subscriptdelimited-∥∥𝑎superscriptsubscriptΣ𝑡12superscript𝜆1𝜀\displaystyle\leq\sum_{t=1}^{T}(2+2\eta)\sqrt{\beta_{t}\left\lVert a\right% \rVert_{\Sigma_{t}^{-1}}}+2\lambda^{-1}\varepsilon≤ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( 2 + 2 italic_η ) square-root start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ italic_a ∥ start_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG + 2 italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε
(2+2η)Tt=1Tβt2aΣt12+2λ1εTabsent22𝜂𝑇superscriptsubscript𝑡1𝑇superscriptsubscript𝛽𝑡2superscriptsubscriptdelimited-∥∥𝑎superscriptsubscriptΣ𝑡122superscript𝜆1𝜀𝑇\displaystyle\leq(2+2\eta)\sqrt{T\sum_{t=1}^{T}\beta_{t}^{2}\left\lVert a% \right\rVert_{\Sigma_{t}^{-1}}^{2}}+2\lambda^{-1}\varepsilon T≤ ( 2 + 2 italic_η ) square-root start_ARG italic_T ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_a ∥ start_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + 2 italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε italic_T
=O((1+η)T(dlog(T)+log(1/δ))+εT/λ),absent𝑂1𝜂𝑇𝑑𝑇1𝛿𝜀𝑇𝜆\displaystyle=O\bigg{(}(1+\eta)\sqrt{T}\big{(}d\log(T)+\log(1/\delta)\big{)}+% \varepsilon T/\lambda\bigg{)},= italic_O ( ( 1 + italic_η ) square-root start_ARG italic_T end_ARG ( italic_d roman_log ( italic_T ) + roman_log ( 1 / italic_δ ) ) + italic_ε italic_T / italic_λ ) ,

where the first inequality uses the fact that the loss incur when atasubscript𝑎𝑡superscript𝑎a_{t}\neq a^{*}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≠ italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is at least as much as the loss when at=asubscript𝑎𝑡superscript𝑎a_{t}=a^{*}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT; the second inequality follows from the Cauchy-Schwarz inequality; the third inequality again applies Cauchy-Schwarz inequality and use the fact that a𝒜NT(a)=Tsubscript𝑎𝒜subscript𝑁𝑇𝑎𝑇\sum_{a\in\mathcal{A}}N_{T}(a)=T∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_a ) = italic_T.

Proof of Lemma 1.

Pick an arbitrary a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A. We have 𝒳a(ε)={x:[P(a)P(a)]xc(a)c(a)+ε,aa}superscript𝒳𝑎𝜀conditional-set𝑥formulae-sequencedelimited-[]𝑃𝑎𝑃superscript𝑎𝑥𝑐𝑎𝑐superscript𝑎𝜀for-all𝑎superscript𝑎{\mathcal{X}}^{a}(\varepsilon)=\{x:[P(a)-P(a^{\prime})]\cdot x\geq c(a)-c(a^{% \prime})+\varepsilon,\forall a\neq a^{\prime}\}caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_ε ) = { italic_x : [ italic_P ( italic_a ) - italic_P ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ⋅ italic_x ≥ italic_c ( italic_a ) - italic_c ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_ε , ∀ italic_a ≠ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } and an empirical estimation of P^(a)^𝑃𝑎\widehat{P}(a)over^ start_ARG italic_P end_ARG ( italic_a ) with P^(a)P(a)1ϵsubscriptdelimited-∥∥^𝑃𝑎𝑃𝑎1italic-ϵ\left\lVert\widehat{P}(a)-P(a)\right\rVert_{1}\leq\epsilon∥ over^ start_ARG italic_P end_ARG ( italic_a ) - italic_P ( italic_a ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_ϵ, LP (C.1). With 𝒳a(ε)superscript𝒳𝑎𝜀{\mathcal{X}}^{a}(\varepsilon)caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_ε ) and P^(a)^𝑃𝑎\widehat{P}(a)over^ start_ARG italic_P end_ARG ( italic_a ), LP (C.1) solves for a robust contract x^^𝑥\widehat{x}over^ start_ARG italic_x end_ARG. Since x^𝒳a(ε)𝒳a^𝑥superscript𝒳𝑎𝜀superscript𝒳𝑎\widehat{x}\in\mathcal{X}^{a}(\varepsilon)\subseteq\mathcal{X}^{a}over^ start_ARG italic_x end_ARG ∈ caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_ε ) ⊆ caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT, the agent’s best response to x^^𝑥\widehat{x}over^ start_ARG italic_x end_ARG is to take action a𝑎aitalic_a.

First, we derive a bound for an intermediate value minx𝒳a(ε)P(a)xsubscript𝑚𝑖𝑛𝑥superscript𝒳𝑎𝜀𝑃𝑎𝑥\mathop{min}_{x\in{\mathcal{X}}^{a}(\varepsilon)}P(a)\cdot xstart_BIGOP italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_ε ) end_POSTSUBSCRIPT italic_P ( italic_a ) ⋅ italic_x. Recall that Assumption 1 guarantees that, for any x𝒳a𝑥superscript𝒳𝑎x\in\mathcal{X}^{a}italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT, there exists x¯=x+λ1εe¯𝑥𝑥superscript𝜆1𝜀𝑒\overline{x}=x+\lambda^{-1}\varepsilon eover¯ start_ARG italic_x end_ARG = italic_x + italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε italic_e such that x¯𝒳a(ε)¯𝑥superscript𝒳𝑎𝜀\overline{x}\in{\mathcal{X}}^{a}(\varepsilon)over¯ start_ARG italic_x end_ARG ∈ caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_ε ). Let x=argminx𝒳aP(a)xsuperscript𝑥subscript𝑎𝑟𝑔𝑚𝑖𝑛𝑥superscript𝒳𝑎𝑃𝑎𝑥x^{*}=\mathop{argmin}_{x\in\mathcal{X}^{a}}P(a)\cdot xitalic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_BIGOP italic_a italic_r italic_g italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P ( italic_a ) ⋅ italic_x and x¯=argminx𝒳a(ε)P(a)xsuperscript¯𝑥subscript𝑎𝑟𝑔𝑚𝑖𝑛𝑥superscript𝒳𝑎𝜀𝑃𝑎𝑥\overline{x}^{*}=\mathop{argmin}_{x\in{\mathcal{X}}^{a}(\varepsilon)}P(a)\cdot xover¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_BIGOP italic_a italic_r italic_g italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_ε ) end_POSTSUBSCRIPT italic_P ( italic_a ) ⋅ italic_x. We have

minx𝒳a(ε)P(a)x=minx𝒳aP(a)x+P(a)(x¯x).subscript𝑚𝑖𝑛𝑥superscript𝒳𝑎𝜀𝑃𝑎𝑥subscript𝑚𝑖𝑛𝑥superscript𝒳𝑎𝑃𝑎𝑥𝑃𝑎superscript¯𝑥superscript𝑥\mathop{min}_{x\in{\mathcal{X}}^{a}(\varepsilon)}P(a)\cdot x=\mathop{min}_{x% \in\mathcal{X}^{a}}P(a)\cdot x+P(a)\cdot(\overline{x}^{*}-x^{*}).start_BIGOP italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_ε ) end_POSTSUBSCRIPT italic_P ( italic_a ) ⋅ italic_x = start_BIGOP italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P ( italic_a ) ⋅ italic_x + italic_P ( italic_a ) ⋅ ( over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) .

Since P(a)1=1,x¯xλ1εformulae-sequencesubscriptdelimited-∥∥𝑃𝑎11subscriptdelimited-∥∥superscript¯𝑥superscript𝑥superscript𝜆1𝜀\left\lVert P(a)\right\rVert_{1}=1,\left\lVert\overline{x}^{*}-x^{*}\right% \rVert_{\infty}\leq\lambda^{-1}\varepsilon∥ italic_P ( italic_a ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 , ∥ over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε, we have |P(a)(x¯x)|P(a)1x¯x𝑃𝑎superscript¯𝑥superscript𝑥subscriptdelimited-∥∥𝑃𝑎1subscriptdelimited-∥∥superscript¯𝑥superscript𝑥\left|P(a)\cdot(\overline{x}^{*}-x^{*})\right|\leq\left\lVert P(a)\right\rVert% _{1}\left\lVert\overline{x}^{*}-x^{*}\right\rVert_{\infty}| italic_P ( italic_a ) ⋅ ( over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | ≤ ∥ italic_P ( italic_a ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT. In addition, as 𝒳a(ε)𝒳asuperscript𝒳𝑎𝜀superscript𝒳𝑎{\mathcal{X}}^{a}(\varepsilon)\subseteq{\mathcal{X}}^{a}caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_ε ) ⊆ caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT, minx𝒳a(ε)P(a)xminx𝒳aP(a)x=ζ(a)subscript𝑚𝑖𝑛𝑥superscript𝒳𝑎𝜀𝑃𝑎𝑥subscript𝑚𝑖𝑛𝑥superscript𝒳𝑎𝑃𝑎𝑥𝜁𝑎\mathop{min}_{x\in{\mathcal{X}}^{a}(\varepsilon)}P(a)\cdot x\geq\mathop{min}_{% x\in{\mathcal{X}}^{a}}P(a)\cdot x=\zeta(a)start_BIGOP italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_ε ) end_POSTSUBSCRIPT italic_P ( italic_a ) ⋅ italic_x ≥ start_BIGOP italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P ( italic_a ) ⋅ italic_x = italic_ζ ( italic_a ). Hence, we have

0minx𝒳a(ε)P(a)xζ(a)λ1ε.0subscript𝑚𝑖𝑛𝑥superscript𝒳𝑎𝜀𝑃𝑎𝑥𝜁𝑎superscript𝜆1𝜀0\leq\mathop{min}_{x\in{\mathcal{X}}^{a}(\varepsilon)}P(a)\cdot x-\zeta(a)\leq% \lambda^{-1}\varepsilon.0 ≤ start_BIGOP italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_ε ) end_POSTSUBSCRIPT italic_P ( italic_a ) ⋅ italic_x - italic_ζ ( italic_a ) ≤ italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε . (C.3)

Notice that x^𝒳a(ε)^𝑥superscript𝒳𝑎𝜀\widehat{x}\in{\mathcal{X}}^{a}(\varepsilon)over^ start_ARG italic_x end_ARG ∈ caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_ε ) and is not necessarily the minimizer of P(a)x𝑃𝑎𝑥P(a)\cdot xitalic_P ( italic_a ) ⋅ italic_x over 𝒳asuperscript𝒳𝑎{\mathcal{X}}^{a}caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT. We get the first condition of this lemma, by Equation (C.3)

ζ(a)=minx𝒳aP(a)xP(a)x^minx𝒳a(ε)P(a)xζ(a)+λ1ε.𝜁𝑎subscript𝑚𝑖𝑛𝑥superscript𝒳𝑎𝑃𝑎𝑥𝑃𝑎^𝑥subscript𝑚𝑖𝑛𝑥superscript𝒳𝑎𝜀𝑃𝑎𝑥𝜁𝑎superscript𝜆1𝜀\zeta(a)=\mathop{min}_{x\in{\mathcal{X}}^{a}}P(a)\cdot x\leq P(a)\cdot\widehat% {x}\leq\mathop{min}_{x\in{\mathcal{X}}^{a}(\varepsilon)}P(a)\cdot x\leq\zeta(a% )+\lambda^{-1}\varepsilon.italic_ζ ( italic_a ) = start_BIGOP italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P ( italic_a ) ⋅ italic_x ≤ italic_P ( italic_a ) ⋅ over^ start_ARG italic_x end_ARG ≤ start_BIGOP italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_ε ) end_POSTSUBSCRIPT italic_P ( italic_a ) ⋅ italic_x ≤ italic_ζ ( italic_a ) + italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε .

We now bound the estimate payment ζ^(a)=P^(a)x^^𝜁𝑎^𝑃𝑎^𝑥\widehat{\zeta}(a)=\widehat{P}(a)\cdot\widehat{x}over^ start_ARG italic_ζ end_ARG ( italic_a ) = over^ start_ARG italic_P end_ARG ( italic_a ) ⋅ over^ start_ARG italic_x end_ARG. Since P(a)P~1ϵ,x~aηformulae-sequencesubscriptdelimited-∥∥𝑃𝑎~𝑃1italic-ϵsubscriptdelimited-∥∥superscript~𝑥𝑎𝜂\left\lVert P(a)-\widetilde{P}\right\rVert_{1}\leq\epsilon,\left\lVert% \widetilde{x}^{a}\right\rVert_{\infty}\leq\eta∥ italic_P ( italic_a ) - over~ start_ARG italic_P end_ARG ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_ϵ , ∥ over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_η by the bounded contract space assumption, we have

|ζ^(a)P(a)x^|=|[P(a)P^(a)]x^|P(a)P^(a)1x^ηϵ.^𝜁𝑎𝑃𝑎^𝑥delimited-[]𝑃𝑎^𝑃𝑎^𝑥subscriptdelimited-∥∥𝑃𝑎^𝑃𝑎1subscriptdelimited-∥∥^𝑥𝜂italic-ϵ\left|\widehat{\zeta}(a)-P(a)\cdot\widehat{x}\right|=\left|[P(a)-\widehat{P}(a% )]\cdot\widehat{x}\right|\leq\left\lVert P(a)-\widehat{P}(a)\right\rVert_{1}% \left\lVert\widehat{x}\right\rVert_{\infty}\leq\eta\epsilon.| over^ start_ARG italic_ζ end_ARG ( italic_a ) - italic_P ( italic_a ) ⋅ over^ start_ARG italic_x end_ARG | = | [ italic_P ( italic_a ) - over^ start_ARG italic_P end_ARG ( italic_a ) ] ⋅ over^ start_ARG italic_x end_ARG | ≤ ∥ italic_P ( italic_a ) - over^ start_ARG italic_P end_ARG ( italic_a ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ over^ start_ARG italic_x end_ARG ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_η italic_ϵ . (C.4)

Therefore, combining Equation (C.4) and the first condition, we get the second condition of this lemma,

ηϵζ^(a)ζ(a)λ1ε+ηϵ.𝜂italic-ϵ^𝜁𝑎𝜁𝑎superscript𝜆1𝜀𝜂italic-ϵ-\eta\epsilon\leq\widehat{\zeta}(a)-\zeta(a)\leq\lambda^{-1}\varepsilon+\eta\epsilon.- italic_η italic_ϵ ≤ over^ start_ARG italic_ζ end_ARG ( italic_a ) - italic_ζ ( italic_a ) ≤ italic_λ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε + italic_η italic_ϵ .

C.2 Solving Multi-armed Bandits under Direct Incentives

Multi-Armed Bandits under Direct Incentives

This is perhaps the most simple yet natural class of contractual online learning problems. The principal is unable to directly pull arms but is able to receive the reward from arm pulled by the agent. In this problem, we have the action space 𝒜=[N]𝒜delimited-[]𝑁\mathcal{A}=[N]caligraphic_A = [ italic_N ] and r,c:[N][0,1]:𝑟𝑐delimited-[]𝑁01r,c:[N]\to[0,1]italic_r , italic_c : [ italic_N ] → [ 0 , 1 ] specifying the principal’s reward and agent’s cost of pulling each arm i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ]. At the beginning of each round t𝑡titalic_t, the principal sets a contract xt:[N]+:subscript𝑥𝑡delimited-[]𝑁subscriptx_{t}:[N]\to\mathbb{R}_{+}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : [ italic_N ] → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT and the agent accordingly decides its best response it=maxi[n][xt(i)c(i)]subscript𝑖𝑡subscript𝑚𝑎𝑥𝑖delimited-[]𝑛delimited-[]subscript𝑥𝑡𝑖𝑐𝑖i_{t}=\mathop{max}_{i\in[n]}\big{[}x_{t}(i)-c(i)\big{]}italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = start_BIGOP italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT [ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_i ) - italic_c ( italic_i ) ]. At the end of each round t𝑡titalic_t, the principal is able to observe the exact arm itsubscript𝑖𝑡i_{t}italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT taken by the agent as well as the noisy bandit feedback on its corresponding reward r~t(i)=r(i)+ϵtsubscript~𝑟𝑡𝑖𝑟𝑖subscriptitalic-ϵ𝑡\widetilde{r}_{t}(i)=r(i)+\epsilon_{t}over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_i ) = italic_r ( italic_i ) + italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where ϵtsubscriptitalic-ϵ𝑡\epsilon_{t}italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is zero-mean, i.i.d. σ𝜎\sigmaitalic_σ-subGuassian noise. Finally, the learning goal of the principal is to minimize the regret, Reg(T)=Tmaxi[N][r(i)c(i)]t[T][r(it)xt(it)]Reg𝑇𝑇subscript𝑚𝑎𝑥𝑖delimited-[]𝑁delimited-[]𝑟𝑖𝑐𝑖subscript𝑡delimited-[]𝑇delimited-[]𝑟subscript𝑖𝑡subscript𝑥𝑡subscript𝑖𝑡\operatorname{Reg}(T)=T\cdot\mathop{max}_{i\in[N]}\big{[}r(i)-c(i)\big{]}-\sum% _{t\in[T]}[r(i_{t})-x_{t}(i_{t})]roman_Reg ( italic_T ) = italic_T ⋅ start_BIGOP italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT italic_i ∈ [ italic_N ] end_POSTSUBSCRIPT [ italic_r ( italic_i ) - italic_c ( italic_i ) ] - ∑ start_POSTSUBSCRIPT italic_t ∈ [ italic_T ] end_POSTSUBSCRIPT [ italic_r ( italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ].

Lemma 3 (Binary Search for Finite Arms).

There exists an O(|𝒜|log(1/ε))𝑂𝒜1𝜀O(|\mathcal{A}|\log(1/\varepsilon))italic_O ( | caligraphic_A | roman_log ( 1 / italic_ε ) )-learning procedure for multi-armed bandits under direct incentives.

Proof of Lemma 3.

We show an explicit construction of χ(ε)𝜒𝜀\chi(\varepsilon)italic_χ ( italic_ε )-learning procedure in the problem. Observe that, if we can learn an estimation of |c^(a)c(a)|ε/2,a𝒜formulae-sequence^𝑐𝑎𝑐𝑎𝜀2for-all𝑎𝒜|\widehat{c}(a)-c(a)|\leq\varepsilon/2,\forall a\in\mathcal{A}| over^ start_ARG italic_c end_ARG ( italic_a ) - italic_c ( italic_a ) | ≤ italic_ε / 2 , ∀ italic_a ∈ caligraphic_A, we can set the least payment contract x𝑥xitalic_x as follows, x(a)=c^(a)+ε/2,x(a)=0,aaformulae-sequence𝑥𝑎^𝑐𝑎𝜀2formulae-sequence𝑥superscript𝑎0for-allsuperscript𝑎𝑎x(a)=\widehat{c}(a)+\varepsilon/2,x(a^{\prime})=0,\forall a^{\prime}\neq aitalic_x ( italic_a ) = over^ start_ARG italic_c end_ARG ( italic_a ) + italic_ε / 2 , italic_x ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 0 , ∀ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_a. x(a)c(a)0x(a)c(a)𝑥𝑎𝑐𝑎0𝑥superscript𝑎𝑐superscript𝑎x(a)-c(a)\geq 0\geq x(a^{\prime})-c(a^{\prime})italic_x ( italic_a ) - italic_c ( italic_a ) ≥ 0 ≥ italic_x ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_c ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), it is optimal for the agent to respond with action a𝑎aitalic_a. Moreover, the payment is minimized as x(a)εc(a)+ε/2+ε/2ε=c(a)=ζ(a)𝑥𝑎𝜀𝑐𝑎𝜀2𝜀2𝜀𝑐𝑎𝜁𝑎x(a)-\varepsilon\leq c(a)+\varepsilon/2+\varepsilon/2-\varepsilon=c(a)=\zeta(a)italic_x ( italic_a ) - italic_ε ≤ italic_c ( italic_a ) + italic_ε / 2 + italic_ε / 2 - italic_ε = italic_c ( italic_a ) = italic_ζ ( italic_a ).

So it only remains to learn the estimation of |c^(a)c(a)|ε/2,a𝒜formulae-sequence^𝑐𝑎𝑐𝑎𝜀2for-all𝑎𝒜|\widehat{c}(a)-c(a)|\leq\varepsilon/2,\forall a\in\mathcal{A}| over^ start_ARG italic_c end_ARG ( italic_a ) - italic_c ( italic_a ) | ≤ italic_ε / 2 , ∀ italic_a ∈ caligraphic_A. This can be achieved through binary search. For any action a𝑎aitalic_a, we set a cost lower bound c(a)superscript𝑐𝑎c^{-}(a)italic_c start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( italic_a ) and upper bound c+(a)superscript𝑐𝑎c^{+}(a)italic_c start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_a ). At each round, the algorithm sets the contract x𝑥xitalic_x with x(a)=c(a)+c+(a)2,x(a)=0,aaformulae-sequence𝑥𝑎superscript𝑐𝑎superscript𝑐𝑎2formulae-sequence𝑥superscript𝑎0for-allsuperscript𝑎𝑎x(a)=\frac{c^{-}(a)+c^{+}(a)}{2},x(a^{\prime})=0,\forall a^{\prime}\neq aitalic_x ( italic_a ) = divide start_ARG italic_c start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( italic_a ) + italic_c start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_a ) end_ARG start_ARG 2 end_ARG , italic_x ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 0 , ∀ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_a. If the agent takes the action a𝑎aitalic_a, then the algorithm updates c(a)c(a)+c+(a)2superscript𝑐𝑎superscript𝑐𝑎superscript𝑐𝑎2c^{-}(a)\leftarrow\frac{c^{-}(a)+c^{+}(a)}{2}italic_c start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( italic_a ) ← divide start_ARG italic_c start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( italic_a ) + italic_c start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_a ) end_ARG start_ARG 2 end_ARG. Otherwise, it updates c+(a)c(a)+c+(a)2superscript𝑐𝑎superscript𝑐𝑎superscript𝑐𝑎2c^{+}(a)\leftarrow\frac{c^{-}(a)+c^{+}(a)}{2}italic_c start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_a ) ← divide start_ARG italic_c start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( italic_a ) + italic_c start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_a ) end_ARG start_ARG 2 end_ARG. In log(1/ε)+11𝜀1\log(1/\varepsilon)+1roman_log ( 1 / italic_ε ) + 1 rounds, the algorithm is guaranteed to have c+(a)c(a)ε/2superscript𝑐𝑎superscript𝑐𝑎𝜀2c^{+}(a)-c^{-}(a)\leq\varepsilon/2italic_c start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_a ) - italic_c start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( italic_a ) ≤ italic_ε / 2 and thus an estimation |c^(a)c(a)|ε/2^𝑐𝑎𝑐𝑎𝜀2|\widehat{c}(a)-c(a)|\leq\varepsilon/2| over^ start_ARG italic_c end_ARG ( italic_a ) - italic_c ( italic_a ) | ≤ italic_ε / 2. To conduct the binary search for every action, the total sample complexity is O(|𝒜|log(1/ε))𝑂𝒜1𝜀O(|\mathcal{A}|\log(1/\varepsilon))italic_O ( | caligraphic_A | roman_log ( 1 / italic_ε ) ).

C.3 Solving Linear Bandits under Direct Incentives

Linear Bandits under Direct Incentives

In this problem, we have the action space 𝒜d𝒜superscript𝑑\mathcal{A}\subset\mathbb{R}^{d}caligraphic_A ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT (composed of the context vectors) and r,c:𝒜[0,1]:𝑟𝑐𝒜01r,c:\mathcal{A}\to[0,1]italic_r , italic_c : caligraphic_A → [ 0 , 1 ] specifying the principal’s reward and agent’s cost of choosing each context a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A. At the beginning of each round t𝑡titalic_t, the principal observes a set of contexts 𝒜t𝒜subscript𝒜𝑡𝒜\mathcal{A}_{t}\subset\mathcal{A}caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊂ caligraphic_A and sets a contract xt:𝒜t+:subscript𝑥𝑡subscript𝒜𝑡subscriptx_{t}:\mathcal{A}_{t}\to\mathbb{R}_{+}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT. The agent accordingly decides its best response at=maxa𝒜t[xt(a)c(a)]subscript𝑎𝑡subscript𝑚𝑎𝑥𝑎subscript𝒜𝑡delimited-[]subscript𝑥𝑡𝑎𝑐𝑎a_{t}=\mathop{max}_{a\in\mathcal{A}_{t}}\big{[}x_{t}(a)-c(a)\big{]}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = start_BIGOP italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT italic_a ∈ caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a ) - italic_c ( italic_a ) ], where c(a)=aγ𝑐𝑎superscript𝑎top𝛾c(a)=a^{\top}\gammaitalic_c ( italic_a ) = italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_γ. At the end of each round t𝑡titalic_t, the principal is able to observe the exact arm atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT taken by the agent as well as the noisy bandit feedback on its corresponding reward r(at)=atθ+ϵt𝑟subscript𝑎𝑡superscriptsubscript𝑎𝑡top𝜃subscriptitalic-ϵ𝑡r(a_{t})=a_{t}^{\top}\theta+\epsilon_{t}italic_r ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_θ + italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where ϵtsubscriptitalic-ϵ𝑡\epsilon_{t}italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is zero-mean, i.i.d. σ𝜎\sigmaitalic_σ-subGuassian noise. θ,γ𝜃𝛾\theta,\gammaitalic_θ , italic_γ are fixed, unknown parameters to be learnt. Without loss of generality, we assume θ1,γ1,atdformulae-sequencedelimited-∥∥𝜃1formulae-sequencedelimited-∥∥𝛾1delimited-∥∥subscript𝑎𝑡𝑑\left\lVert\theta\right\rVert\leq 1,\left\lVert\gamma\right\rVert\leq 1,\left% \lVert a_{t}\right\rVert\leq\sqrt{d}∥ italic_θ ∥ ≤ 1 , ∥ italic_γ ∥ ≤ 1 , ∥ italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ≤ square-root start_ARG italic_d end_ARG by coordinate transformation. Finally, the learning goal of the principal is to minimize the regret, Reg(T)=t=1T[(at)θ(at)γatθ+xt(at)],Reg𝑇superscriptsubscript𝑡1𝑇delimited-[]superscriptsubscriptsuperscript𝑎𝑡topsuperscript𝜃superscriptsubscriptsuperscript𝑎𝑡topsuperscript𝛾superscriptsubscript𝑎𝑡topsuperscript𝜃subscript𝑥𝑡subscript𝑎𝑡\operatorname{Reg}(T)=\sum_{t=1}^{T}[(a^{*}_{t})^{\top}\theta^{*}-(a^{*}_{t})^% {\top}\gamma^{*}-a_{t}^{\top}\theta^{*}+x_{t}(a_{t})],roman_Reg ( italic_T ) = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - ( italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] , where at=argmaxat𝒜t{atθatγ}subscriptsuperscript𝑎𝑡subscript𝑎𝑟𝑔𝑚𝑎𝑥subscript𝑎𝑡subscript𝒜𝑡superscriptsubscript𝑎𝑡topsuperscript𝜃superscriptsubscript𝑎𝑡topsuperscript𝛾a^{*}_{t}=\mathop{argmax}_{a_{t}\in\mathcal{A}_{t}}\{a_{t}^{\top}\theta^{*}-a_% {t}^{\top}\gamma^{*}\}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = start_BIGOP italic_a italic_r italic_g italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } is the optimal arm at round t𝑡titalic_t.

Lemma 4 (Contextual Search for Infinite Arms).

There exists an O(dlog1/ε)𝑂𝑑1𝜀O(d\log 1/\varepsilon)italic_O ( italic_d roman_log 1 / italic_ε )-learning procedure for linear bandits under direct incentives.

Proof of Lemma 4.

We show an explicit construction of χ(ε)𝜒𝜀\chi(\varepsilon)italic_χ ( italic_ε )-learning procedure in the problem with agent’s best response function h(x)=argmaxa𝒜{x(a)aγ}superscript𝑥subscript𝑎𝑟𝑔𝑚𝑎𝑥𝑎𝒜𝑥𝑎superscript𝑎topsuperscript𝛾h^{*}(x)=\mathop{argmax}_{a\in\mathcal{A}}\{x(a)-a^{\top}\gamma^{*}\}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) = start_BIGOP italic_a italic_r italic_g italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT { italic_x ( italic_a ) - italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } for some parameter γd,γ1formulae-sequencesuperscript𝛾superscript𝑑delimited-∥∥superscript𝛾1\gamma^{*}\in\mathbb{R}^{d},\left\lVert\gamma^{*}\right\rVert\leq 1italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , ∥ italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ≤ 1 and action set 𝒜d𝒜superscript𝑑\mathcal{A}\subset\mathbb{R}^{d}caligraphic_A ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Observe that, if we can learn an estimation of γ𝛾\gammaitalic_γ such that γγ12tdelimited-∥∥𝛾superscript𝛾12𝑡\left\lVert\gamma-\gamma^{*}\right\rVert\leq\frac{1}{2t}∥ italic_γ - italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ≤ divide start_ARG 1 end_ARG start_ARG 2 italic_t end_ARG, we can set the least payment rule x𝑥xitalic_x such that x(a)=γa+12t,x(a)=0,aaformulae-sequence𝑥𝑎superscript𝛾top𝑎12𝑡formulae-sequence𝑥superscript𝑎0for-allsuperscript𝑎𝑎x(a)=\gamma^{\top}a+\frac{1}{2t},x(a^{\prime})=0,\forall a^{\prime}\neq aitalic_x ( italic_a ) = italic_γ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_a + divide start_ARG 1 end_ARG start_ARG 2 italic_t end_ARG , italic_x ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 0 , ∀ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_a. Since x(a)c(a)12tγγa0x(a)c(a)𝑥𝑎𝑐𝑎12𝑡delimited-∥∥𝛾superscript𝛾delimited-∥∥𝑎0𝑥superscript𝑎𝑐superscript𝑎x(a)-c(a)\geq\frac{1}{2t}-\left\lVert\gamma-\gamma^{*}\right\rVert\cdot\left% \lVert a\right\rVert\geq 0\geq x(a^{\prime})-c(a^{\prime})italic_x ( italic_a ) - italic_c ( italic_a ) ≥ divide start_ARG 1 end_ARG start_ARG 2 italic_t end_ARG - ∥ italic_γ - italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ⋅ ∥ italic_a ∥ ≥ 0 ≥ italic_x ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_c ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), we have h(x)=a𝑥𝑎h(x)=aitalic_h ( italic_x ) = italic_a. Moreover, x(a)c(a)12t+γγa1t𝑥𝑎𝑐𝑎12𝑡delimited-∥∥𝛾superscript𝛾delimited-∥∥𝑎1𝑡x(a)-c(a)\leq\frac{1}{2t}+\left\lVert\gamma-\gamma^{*}\right\rVert\cdot\left% \lVert a\right\rVert\leq\frac{1}{t}italic_x ( italic_a ) - italic_c ( italic_a ) ≤ divide start_ARG 1 end_ARG start_ARG 2 italic_t end_ARG + ∥ italic_γ - italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ⋅ ∥ italic_a ∥ ≤ divide start_ARG 1 end_ARG start_ARG italic_t end_ARG.

To learn an estimation of γ𝛾\gammaitalic_γ such that γγ12tdelimited-∥∥𝛾superscript𝛾12𝑡\left\lVert\gamma-\gamma^{*}\right\rVert\leq\frac{1}{2t}∥ italic_γ - italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ≤ divide start_ARG 1 end_ARG start_ARG 2 italic_t end_ARG, we adopt the contextual search algorithm under symmetric loss [38]. At a high level, we use the constant regret guarantee of contextual search algorithm again adversarially chosen context at every round, and we present a simple argument assuming 𝒜t=𝒜subscript𝒜𝑡𝒜\mathcal{A}_{t}=\mathcal{A}caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_A that allows us to pick arbitrary context for the contextual search algorithm. Specifically, consider a contextual search problem with the unknown vector γdsuperscript𝛾superscript𝑑\gamma^{*}\in\mathbb{R}^{d}italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and γ1delimited-∥∥superscript𝛾1\left\lVert\gamma^{*}\right\rVert\leq 1∥ italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ≤ 1. Fix any unit vector e𝒜𝑒𝒜e\in\mathcal{A}italic_e ∈ caligraphic_A, in O(t)𝑂𝑡O(t)italic_O ( italic_t ) rounds, the contextual search algorithm can determine a knowledge set ΓtsubscriptΓ𝑡\Gamma_{t}roman_Γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of all feasible γ𝛾\gammaitalic_γ such that maxγΓt|γe(γ)e|2tsubscript𝑚𝑎𝑥𝛾subscriptΓ𝑡superscript𝛾top𝑒superscriptsuperscript𝛾top𝑒superscript2𝑡\mathop{max}_{\gamma\in\Gamma_{t}}|\gamma^{\top}e-(\gamma^{*})^{\top}e|\leq 2^% {-t}start_BIGOP italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT italic_γ ∈ roman_Γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_γ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_e - ( italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_e | ≤ 2 start_POSTSUPERSCRIPT - italic_t end_POSTSUPERSCRIPT. Repeating this search procedure for all d𝑑ditalic_d linearly independent direction in 𝒜𝒜\mathcal{A}caligraphic_A, we obtain a knowledge set of all feasible γ𝛾\gammaitalic_γ such that maxγΓt|γa(γ)a|2t,a𝒜formulae-sequencesubscript𝑚𝑎𝑥𝛾subscriptΓ𝑡superscript𝛾top𝑎superscriptsuperscript𝛾top𝑎superscript2𝑡for-all𝑎𝒜\mathop{max}_{\gamma\in\Gamma_{t}}|\gamma^{\top}a-(\gamma^{*})^{\top}a|\leq 2^% {-t},\forall a\in\mathcal{A}start_BIGOP italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT italic_γ ∈ roman_Γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_γ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_a - ( italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_a | ≤ 2 start_POSTSUPERSCRIPT - italic_t end_POSTSUPERSCRIPT , ∀ italic_a ∈ caligraphic_A, since any action a𝑎aitalic_a can be decomposed as a convex combination of the d𝑑ditalic_d linearly independent unit vectors. The total sample complexity is O(dlog(1/ε))𝑂𝑑1𝜀O(d\log(1/\varepsilon))italic_O ( italic_d roman_log ( 1 / italic_ε ) ). ∎

C.4 Solving General Contractual Bandit Problems

Lemma 5.

Under Assumption 2 and given P^^𝑃\widehat{P}over^ start_ARG italic_P end_ARG that satisfies P^P1,ε/ηsubscriptdelimited-∥∥^𝑃𝑃1𝜀𝜂\left\lVert\widehat{P}-P\right\rVert_{1,\infty}\leq\varepsilon/\eta∥ over^ start_ARG italic_P end_ARG - italic_P ∥ start_POSTSUBSCRIPT 1 , ∞ end_POSTSUBSCRIPT ≤ italic_ε / italic_η, we can construct an O(|𝒜|2log(|𝒜|η/ε))𝑂superscript𝒜2𝒜𝜂𝜀O\big{(}|\mathcal{A}|^{2}\log(|\mathcal{A}|\eta/\varepsilon)\big{)}italic_O ( | caligraphic_A | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( | caligraphic_A | italic_η / italic_ε ) )-learning procedure for general contractual bandit learning problems.

Proof of Lemma 5.

We denote d(a,a):=c(a)c(a)assign𝑑𝑎superscript𝑎𝑐𝑎𝑐superscript𝑎d(a,a^{\prime}):=c(a)-c(a^{\prime})italic_d ( italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) := italic_c ( italic_a ) - italic_c ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) and the learning procedure is to query certain contract in a binary search fashion in order to obtain estimation d^^𝑑\widehat{d}over^ start_ARG italic_d end_ARG with bounded error |d^(a,a)d(a,a)|,a,a^𝑑𝑎superscript𝑎𝑑𝑎superscript𝑎for-all𝑎superscript𝑎|\widehat{d}(a,a^{\prime})-d(a,a^{\prime})|,\forall a,a^{\prime}| over^ start_ARG italic_d end_ARG ( italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_d ( italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | , ∀ italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, from which we can construct ε𝜀\varepsilonitalic_ε-margin contract set 𝒳^asuperscript^𝒳𝑎\widehat{\mathcal{X}}^{a}over^ start_ARG caligraphic_X end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT for any action a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A and thereby compute almost least payment contract according to Lemma 1. For precise analysis, let P^P1,ϵ=ε10ηsubscriptdelimited-∥∥^𝑃𝑃1italic-ϵ𝜀10𝜂\left\lVert\widehat{P}-P\right\rVert_{1,\infty}\leq\epsilon=\frac{\varepsilon}% {10\eta}∥ over^ start_ARG italic_P end_ARG - italic_P ∥ start_POSTSUBSCRIPT 1 , ∞ end_POSTSUBSCRIPT ≤ italic_ϵ = divide start_ARG italic_ε end_ARG start_ARG 10 italic_η end_ARG. We describe the full procedure in Algorithm 3.

Input: Action set 𝒜𝒜\mathcal{A}caligraphic_A, estimated parameters P^^𝑃\widehat{P}over^ start_ARG italic_P end_ARG.
Output : Robust contract sets 𝒳^a,a𝒜superscript^𝒳𝑎for-all𝑎𝒜\widehat{\mathcal{X}}^{a},\forall a\in\mathcal{A}over^ start_ARG caligraphic_X end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , ∀ italic_a ∈ caligraphic_A.
d^(a,a),a,a𝒜formulae-sequence^𝑑𝑎superscript𝑎for-all𝑎superscript𝑎𝒜\widehat{d}(a,a^{\prime})\leftarrow\infty,\forall a,a^{\prime}\in\mathcal{A}over^ start_ARG italic_d end_ARG ( italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ← ∞ , ∀ italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A.
Set binary search precision ϵ=ε10η|𝒜|italic-ϵ𝜀10𝜂𝒜\epsilon=\frac{\varepsilon}{10\eta|\mathcal{A}|}italic_ϵ = divide start_ARG italic_ε end_ARG start_ARG 10 italic_η | caligraphic_A | end_ARG.
for each a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A do
       Construct a contract xasuperscript𝑥𝑎x^{a}italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT that induces the action a𝑎aitalic_a.
       for each aa𝒜superscript𝑎𝑎superscript𝒜a^{\prime}\neq a\in\mathcal{A}^{\prime}italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_a ∈ caligraphic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT do
             Construct a contract xasuperscript𝑥superscript𝑎x^{a^{\prime}}italic_x start_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT that induces the action asuperscript𝑎a^{\prime}italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.
             Binary search for parameter α(0,1)𝛼01\alpha\in(0,1)italic_α ∈ ( 0 , 1 ) such that x=αxa+(1α)xa𝑥𝛼superscript𝑥𝑎1𝛼superscript𝑥superscript𝑎x=\alpha x^{a}+(1-\alpha)x^{a^{\prime}}italic_x = italic_α italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT + ( 1 - italic_α ) italic_x start_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT induces action a𝑎aitalic_a, while x=(α+ϵ)xa+(1αϵ)xasuperscript𝑥𝛼italic-ϵsuperscript𝑥𝑎1𝛼italic-ϵsuperscript𝑥superscript𝑎x^{\prime}=(\alpha+\epsilon)x^{a}+(1-\alpha-\epsilon)x^{a^{\prime}}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( italic_α + italic_ϵ ) italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT + ( 1 - italic_α - italic_ϵ ) italic_x start_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT induces action a′′superscript𝑎′′a^{\prime\prime}italic_a start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT.
             Use x,x𝑥superscript𝑥x,x^{\prime}italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to solve for d^(a,a′′)[P^(a)P^(a′′)]xa^𝑑𝑎superscript𝑎′′delimited-[]^𝑃𝑎^𝑃superscript𝑎′′superscript𝑥𝑎\widehat{d}(a,a^{\prime\prime})\leftarrow\big{[}\widehat{P}(a)-\widehat{P}(a^{% \prime\prime})\big{]}\cdot x^{a}over^ start_ARG italic_d end_ARG ( italic_a , italic_a start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) ← [ over^ start_ARG italic_P end_ARG ( italic_a ) - over^ start_ARG italic_P end_ARG ( italic_a start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) ] ⋅ italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT.
      
for each a,a𝒜𝑎superscript𝑎𝒜a,a^{\prime}\in\mathcal{A}italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A do
       if d^(a,a)=^𝑑𝑎superscript𝑎\widehat{d}(a,a^{\prime})=\inftyover^ start_ARG italic_d end_ARG ( italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ∞ then
             d^(a,a)min𝒫(ai,aj)𝒫d^(ai,aj)^𝑑𝑎superscript𝑎subscript𝑚𝑖𝑛𝒫subscriptsubscript𝑎𝑖subscript𝑎𝑗𝒫^𝑑subscript𝑎𝑖subscript𝑎𝑗\widehat{d}(a,a^{\prime})\leftarrow\mathop{min}_{\mathcal{P}}\sum_{(a_{i},a_{j% })\in\mathcal{P}}\widehat{d}(a_{i},a_{j})over^ start_ARG italic_d end_ARG ( italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ← start_BIGOP italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ caligraphic_P end_POSTSUBSCRIPT over^ start_ARG italic_d end_ARG ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), where 𝒫𝒫\mathcal{P}caligraphic_P is a choice of path from a𝑎aitalic_a to asuperscript𝑎a^{\prime}italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.
      
return 𝒳^a={x𝒳:[P^(a)P^(a)]xd^(a,a)+ε/2,aa}superscript^𝒳𝑎conditional-set𝑥𝒳formulae-sequencedelimited-[]^𝑃𝑎^𝑃superscript𝑎𝑥^𝑑𝑎superscript𝑎𝜀2for-all𝑎superscript𝑎\widehat{\mathcal{X}}^{a}=\{x\in\mathcal{X}:\big{[}\widehat{P}(a)-\widehat{P}(% a^{\prime})\big{]}\cdot x\geq\widehat{d}(a,a^{\prime})+\varepsilon/2,\forall a% \neq a^{\prime}\}over^ start_ARG caligraphic_X end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT = { italic_x ∈ caligraphic_X : [ over^ start_ARG italic_P end_ARG ( italic_a ) - over^ start_ARG italic_P end_ARG ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ⋅ italic_x ≥ over^ start_ARG italic_d end_ARG ( italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_ε / 2 , ∀ italic_a ≠ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } for each a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A
Algorithm 3 χ(ε)𝜒𝜀\chi(\varepsilon)italic_χ ( italic_ε )-Learning Procedure in Contractual Bandit Learning

To prove its correctness, we start from the observation that for any two action a,a𝑎superscript𝑎a,a^{\prime}italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with sufficiently small ϵitalic-ϵ\epsilonitalic_ϵ, given two contracts xa,xasuperscript𝑥𝑎superscript𝑥superscript𝑎x^{a},x^{a^{\prime}}italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT that respectively induces action a,a𝑎superscript𝑎a,a^{\prime}italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and xaxaϵsubscriptdelimited-∥∥superscript𝑥𝑎superscript𝑥superscript𝑎italic-ϵ\left\lVert x^{a}-x^{a^{\prime}}\right\rVert_{\infty}\leq\epsilon∥ italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT - italic_x start_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_ϵ, we can obtain the estimation d^(a,a)^𝑑𝑎superscript𝑎\widehat{d}(a,a^{\prime})over^ start_ARG italic_d end_ARG ( italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) such that |d^(a,a)d(a,a)|<3ϵη=3ε/10^𝑑𝑎superscript𝑎𝑑𝑎superscript𝑎3italic-ϵ𝜂3𝜀10|\widehat{d}(a,a^{\prime})-d(a,a^{\prime})|<3\epsilon\eta=3\varepsilon/10| over^ start_ARG italic_d end_ARG ( italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_d ( italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | < 3 italic_ϵ italic_η = 3 italic_ε / 10. To see this, we introduce a contract x0=αxa+(1α)xasuperscript𝑥0𝛼superscript𝑥𝑎1𝛼superscript𝑥superscript𝑎x^{0}=\alpha x^{a}+(1-\alpha)x^{a^{\prime}}italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = italic_α italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT + ( 1 - italic_α ) italic_x start_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT for some α(0,1)𝛼01\alpha\in(0,1)italic_α ∈ ( 0 , 1 ) such that [P(a)P(a)]x0=d(a,a)delimited-[]𝑃𝑎𝑃superscript𝑎superscript𝑥0𝑑𝑎superscript𝑎\big{[}P(a)-P(a^{\prime})\big{]}x^{0}=d(a,a^{\prime})[ italic_P ( italic_a ) - italic_P ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = italic_d ( italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). Such x0superscript𝑥0x^{0}italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT must exist, since [P(a)P(a)]xa>d(a,a)delimited-[]𝑃𝑎𝑃superscript𝑎superscript𝑥𝑎𝑑𝑎superscript𝑎\big{[}P(a)-P(a^{\prime})\big{]}x^{a}>d(a,a^{\prime})[ italic_P ( italic_a ) - italic_P ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT > italic_d ( italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) and [P(a)P(a)]xa<d(a,a)delimited-[]𝑃𝑎𝑃superscript𝑎superscript𝑥superscript𝑎𝑑𝑎superscript𝑎\big{[}P(a)-P(a^{\prime})\big{]}x^{a^{\prime}}<d(a,a^{\prime})[ italic_P ( italic_a ) - italic_P ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] italic_x start_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT < italic_d ( italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). Now let d^(a,a)=[P^(a)P^(a)]xa^𝑑𝑎superscript𝑎delimited-[]^𝑃𝑎^𝑃superscript𝑎superscript𝑥𝑎\widehat{d}(a,a^{\prime})=\big{[}\widehat{P}(a)-\widehat{P}(a^{\prime})\big{]}% \cdot x^{a}over^ start_ARG italic_d end_ARG ( italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = [ over^ start_ARG italic_P end_ARG ( italic_a ) - over^ start_ARG italic_P end_ARG ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ⋅ italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT, we have

|d^(a,a)d(a,a)|^𝑑𝑎superscript𝑎𝑑𝑎superscript𝑎\displaystyle\quad\left|\widehat{d}(a,a^{\prime})-d(a,a^{\prime})\right|| over^ start_ARG italic_d end_ARG ( italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_d ( italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | =|[P(a)P(a)](xax0)+[P^(a)P(a)P^(a)+P(a)]xa|absentdelimited-[]𝑃𝑎𝑃superscript𝑎superscript𝑥𝑎superscript𝑥0delimited-[]^𝑃𝑎𝑃𝑎^𝑃superscript𝑎𝑃superscript𝑎superscript𝑥𝑎\displaystyle=\left|\big{[}P(a)-P(a^{\prime})\big{]}\cdot(x^{a}-x^{0})+\big{[}% \widehat{P}(a)-P(a)-\widehat{P}(a^{\prime})+P(a^{\prime})\big{]}\cdot x^{a}\right|= | [ italic_P ( italic_a ) - italic_P ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ⋅ ( italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT - italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) + [ over^ start_ARG italic_P end_ARG ( italic_a ) - italic_P ( italic_a ) - over^ start_ARG italic_P end_ARG ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_P ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ⋅ italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT |
(1α)|[P(a)P(a)](xaxa)|+2ϵηabsent1𝛼delimited-[]𝑃𝑎𝑃superscript𝑎superscript𝑥𝑎superscript𝑥superscript𝑎2italic-ϵ𝜂\displaystyle\leq(1-\alpha)\left|\big{[}P(a)-P(a^{\prime})\big{]}\cdot(x^{a}-x% ^{a^{\prime}})\right|+2\epsilon\eta≤ ( 1 - italic_α ) | [ italic_P ( italic_a ) - italic_P ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ⋅ ( italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT - italic_x start_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) | + 2 italic_ϵ italic_η
(1α)ϵ2+2ϵηabsent1𝛼superscriptitalic-ϵ22italic-ϵ𝜂\displaystyle\leq(1-\alpha)\epsilon^{2}+2\epsilon\eta≤ ( 1 - italic_α ) italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_ϵ italic_η
<3ϵη=3ε/10.absent3italic-ϵ𝜂3𝜀10\displaystyle<3\epsilon\eta=3\varepsilon/10.< 3 italic_ϵ italic_η = 3 italic_ε / 10 .

To obtain that the contracts xa,xasuperscript𝑥𝑎superscript𝑥superscript𝑎x^{a},x^{a^{\prime}}italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, it only requires to do a binary search based on two initial contracts xa,xasuperscript𝑥𝑎superscript𝑥superscript𝑎x^{a},x^{a^{\prime}}italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT that induces action a,a𝑎superscript𝑎a,a^{\prime}italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Then, if the contract x=12xa+12xasuperscript𝑥12superscript𝑥𝑎12superscript𝑥superscript𝑎x^{\prime}=\frac{1}{2}x^{a}+\frac{1}{2}x^{a^{\prime}}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_x start_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT induces the action a𝑎aitalic_a, then we update xaxsuperscript𝑥𝑎superscript𝑥x^{a}\leftarrow x^{\prime}italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ← italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Otherwise, xaxsuperscript𝑥superscript𝑎superscript𝑥x^{a^{\prime}}\leftarrow x^{\prime}italic_x start_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ← italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. In log(1/ϵ)1italic-ϵ\log(1/\epsilon)roman_log ( 1 / italic_ϵ ) rounds, the distance of xasuperscript𝑥𝑎x^{a}italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT and xasuperscript𝑥superscript𝑎x^{a^{\prime}}italic_x start_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is bounded by ϵitalic-ϵ\epsilonitalic_ϵ. As is described in Algorithm 3, we can do such binary search for every pair of actions a,a𝑎superscript𝑎a,a^{\prime}italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. While two actions may not share a decision boundary, we identify all action pairs that do share a decision boundary with each other. This means for pairs that do not share a decision boundary, we can find a path through their neighbours to determine their cost difference given by d𝑑ditalic_d and the shortest path can find by the Dijkstra’s algorithm in O(|𝒜|2)𝑂superscript𝒜2O(|\mathcal{A}|^{2})italic_O ( | caligraphic_A | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). In the worst case, such path can be as long as |𝒜|2𝒜2|\mathcal{A}|-2| caligraphic_A | - 2, this means we need to conduct binary search to the precision level of ϵ/|𝒜|=ε10η|𝒜|italic-ϵ𝒜𝜀10𝜂𝒜\epsilon/|\mathcal{A}|=\frac{\varepsilon}{10\eta|\mathcal{A}|}italic_ϵ / | caligraphic_A | = divide start_ARG italic_ε end_ARG start_ARG 10 italic_η | caligraphic_A | end_ARG for O(log(|𝒜|η/ε))𝑂𝒜𝜂𝜀O(\log(|\mathcal{A}|\eta/\varepsilon))italic_O ( roman_log ( | caligraphic_A | italic_η / italic_ε ) ) rounds.

Finally, with the estimated parameters P^^𝑃\widehat{P}over^ start_ARG italic_P end_ARG and d^(a,a)^𝑑𝑎superscript𝑎\widehat{d}(a,a^{\prime})over^ start_ARG italic_d end_ARG ( italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), the algorithm construct the robust contract set 𝒳^a={x𝒳:[P^(a)P^(a)]xd^(a,a)+ε/2,aa}superscript^𝒳𝑎conditional-set𝑥𝒳formulae-sequencedelimited-[]^𝑃𝑎^𝑃superscript𝑎𝑥^𝑑𝑎superscript𝑎𝜀2for-all𝑎superscript𝑎\widehat{{\mathcal{X}}}^{a}=\{x\in\mathcal{X}:\big{[}\widehat{P}(a)-\widehat{P% }(a^{\prime})\big{]}\cdot x\geq\widehat{d}(a,a^{\prime})+\varepsilon/2,\forall a% \neq a^{\prime}\}over^ start_ARG caligraphic_X end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT = { italic_x ∈ caligraphic_X : [ over^ start_ARG italic_P end_ARG ( italic_a ) - over^ start_ARG italic_P end_ARG ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ⋅ italic_x ≥ over^ start_ARG italic_d end_ARG ( italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_ε / 2 , ∀ italic_a ≠ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT }, and we claim that 𝒳a(ε)𝒳^a𝒳asuperscript𝒳𝑎𝜀superscript^𝒳𝑎superscript𝒳𝑎{\mathcal{X}}^{a}(\varepsilon)\subseteq\widehat{{\mathcal{X}}}^{a}\subseteq{% \mathcal{X}}^{a}caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_ε ) ⊆ over^ start_ARG caligraphic_X end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ⊆ caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT. To verify that 𝒳^a𝒳asuperscript^𝒳𝑎superscript𝒳𝑎\widehat{{\mathcal{X}}}^{a}\subseteq{\mathcal{X}}^{a}over^ start_ARG caligraphic_X end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ⊆ caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT, we can check that the following inequality must hold, x^𝒳^a,aaformulae-sequencefor-all^𝑥superscript^𝒳𝑎for-allsuperscript𝑎𝑎\forall\widehat{x}\in\widehat{{\mathcal{X}}}^{a},\forall a^{\prime}\neq a∀ over^ start_ARG italic_x end_ARG ∈ over^ start_ARG caligraphic_X end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , ∀ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_a,

[P(a)P(a)]x^delimited-[]𝑃𝑎𝑃superscript𝑎^𝑥\displaystyle[P(a)-P(a^{\prime})]\cdot\widehat{x}[ italic_P ( italic_a ) - italic_P ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ⋅ over^ start_ARG italic_x end_ARG [P^(a)P^(a)]x^+[P(a)P^(a)P(a)+P^(a)]x^absentdelimited-[]^𝑃𝑎^𝑃superscript𝑎^𝑥delimited-[]𝑃𝑎^𝑃𝑎𝑃superscript𝑎^𝑃superscript𝑎^𝑥\displaystyle\geq[\widehat{P}(a)-\widehat{P}(a^{\prime})]\cdot\widehat{x}+[{P}% (a)-\widehat{P}(a)-{P}(a^{\prime})+\widehat{P}(a^{\prime})]\cdot\widehat{x}≥ [ over^ start_ARG italic_P end_ARG ( italic_a ) - over^ start_ARG italic_P end_ARG ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ⋅ over^ start_ARG italic_x end_ARG + [ italic_P ( italic_a ) - over^ start_ARG italic_P end_ARG ( italic_a ) - italic_P ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + over^ start_ARG italic_P end_ARG ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ⋅ over^ start_ARG italic_x end_ARG
d^(a,a)+ε/22maxa𝒜P^(a)P(a)1xabsent^𝑑𝑎superscript𝑎𝜀22subscript𝑚𝑎𝑥𝑎𝒜subscriptdelimited-∥∥^𝑃𝑎𝑃𝑎1subscriptdelimited-∥∥𝑥\displaystyle\geq\widehat{d}(a,a^{\prime})+\varepsilon/2-2\mathop{max}_{a\in% \mathcal{A}}\left\lVert\widehat{P}(a)-{P}(a)\right\rVert_{1}\left\lVert x% \right\rVert_{\infty}≥ over^ start_ARG italic_d end_ARG ( italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_ε / 2 - 2 start_BIGOP italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT ∥ over^ start_ARG italic_P end_ARG ( italic_a ) - italic_P ( italic_a ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ italic_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
d(a,a)+ε/2ε/53ε/10d(a,a).absent𝑑𝑎superscript𝑎𝜀2𝜀53𝜀10𝑑𝑎superscript𝑎\displaystyle\geq d(a,a^{\prime})+\varepsilon/2-\varepsilon/5-3\varepsilon/10% \geq d(a,a^{\prime}).≥ italic_d ( italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_ε / 2 - italic_ε / 5 - 3 italic_ε / 10 ≥ italic_d ( italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .

Similarly, to verify 𝒳a(ε)𝒳^asuperscript𝒳𝑎𝜀superscript^𝒳𝑎{\mathcal{X}}^{a}(\varepsilon)\subseteq\widehat{{\mathcal{X}}}^{a}caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_ε ) ⊆ over^ start_ARG caligraphic_X end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT, we can check that the following inequality must hold, x𝒳a(ε),aaformulae-sequencefor-all𝑥superscript𝒳𝑎𝜀for-allsuperscript𝑎𝑎\forall x\in{\mathcal{X}}^{a}(\varepsilon),\forall a^{\prime}\neq a∀ italic_x ∈ caligraphic_X start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_ε ) , ∀ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_a,

[P^(a)P^(a)]xdelimited-[]^𝑃𝑎^𝑃superscript𝑎𝑥\displaystyle[\widehat{P}(a)-\widehat{P}(a^{\prime})]\cdot x[ over^ start_ARG italic_P end_ARG ( italic_a ) - over^ start_ARG italic_P end_ARG ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ⋅ italic_x [P(a)P(a)]x+[P^(a)P(a)P^(a)+P(a)]xabsentdelimited-[]𝑃𝑎𝑃superscript𝑎𝑥delimited-[]^𝑃𝑎𝑃𝑎^𝑃superscript𝑎𝑃superscript𝑎𝑥\displaystyle\geq[{P}(a)-{P}(a^{\prime})]\cdot{x}+[\widehat{P}(a)-{P}(a)-% \widehat{P}(a^{\prime})+{P}(a^{\prime})]\cdot{x}≥ [ italic_P ( italic_a ) - italic_P ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ⋅ italic_x + [ over^ start_ARG italic_P end_ARG ( italic_a ) - italic_P ( italic_a ) - over^ start_ARG italic_P end_ARG ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_P ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ⋅ italic_x
d(a,a)+ε2maxa𝒜P^(a)P(a)1xabsent𝑑𝑎superscript𝑎𝜀2subscript𝑚𝑎𝑥𝑎𝒜subscriptdelimited-∥∥^𝑃𝑎𝑃𝑎1subscriptdelimited-∥∥𝑥\displaystyle\geq{d}(a,a^{\prime})+\varepsilon-2\mathop{max}_{a\in\mathcal{A}}% \left\lVert\widehat{P}(a)-{P}(a)\right\rVert_{1}\left\lVert x\right\rVert_{\infty}≥ italic_d ( italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_ε - 2 start_BIGOP italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT ∥ over^ start_ARG italic_P end_ARG ( italic_a ) - italic_P ( italic_a ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ italic_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
d^(a,a)+εε/53ε/10d^(a,a)+ε/2.absent^𝑑𝑎superscript𝑎𝜀𝜀53𝜀10^𝑑𝑎superscript𝑎𝜀2\displaystyle\geq\widehat{d}(a,a^{\prime})+\varepsilon-\varepsilon/5-3% \varepsilon/10\geq\widehat{d}(a,a^{\prime})+\varepsilon/2.≥ over^ start_ARG italic_d end_ARG ( italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_ε - italic_ε / 5 - 3 italic_ε / 10 ≥ over^ start_ARG italic_d end_ARG ( italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_ε / 2 .

Appendix D Searching on Probability Simplex

In this section, we discuss the details related to specifying the information structure through hyperplane searching. A motivation for doing hyperplane searching is given in Example 1, where learning the outcome distribution difference with 𝒪(logT)𝒪𝑇\mathcal{O}(\log T)caligraphic_O ( roman_log italic_T ) rounds potentially avoid pulling the non-optimal arm too many times and paves the way for constructing T1superscript𝑇1T^{-1}italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT-optimal contract. In addition, the need to plan with the transition kernel Ph(|s,a)P_{h}(\cdot{\,|\,}s,a)italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s , italic_a ) in the MDP environment with far-sighted agent prompts us to learn the difference in Ph(|s,a)Ph(|s,a)P_{h}(\cdot{\,|\,}s,a)-P_{h}(\cdot{\,|\,}s,a^{\prime})italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s , italic_a ) - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) in order to fully exploit the information structure and as well reduce the cost of redundant explorations.

Example 1 (o(T2/3)𝑜superscript𝑇23o(T^{2/3})italic_o ( italic_T start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT ) regret with known cost).

Consider a class of contractual bandit problem instances parameterized on μ(0,1]𝜇01\mu\in(0,1]italic_μ ∈ ( 0 , 1 ]. For each instance, there are two outcomes s1,s2subscript𝑠1subscript𝑠2s_{1},s_{2}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with mean reward ι(s1)=1,ι(s2)=0formulae-sequence𝜄subscript𝑠11𝜄subscript𝑠20\iota(s_{1})=1,\iota(s_{2})=0italic_ι ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = 1 , italic_ι ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 0, and two agent actions a1,a2subscript𝑎1subscript𝑎2a_{1},a_{2}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with cost c(a1)=1/2,c(a2)=0formulae-sequence𝑐subscript𝑎112𝑐subscript𝑎20c(a_{1})=1/2,c(a_{2})=0italic_c ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = 1 / 2 , italic_c ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 0 and outcome distribution P(a1)=[1,0],P(a2)=[1μ,μ]formulae-sequence𝑃subscript𝑎110𝑃subscript𝑎21𝜇𝜇P(a_{1})=[1,0],P(a_{2})=[1-\mu,\mu]italic_P ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = [ 1 , 0 ] , italic_P ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = [ 1 - italic_μ , italic_μ ]. One can verify that the optimal contract xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT here is to set x(s1)=12μ,x(s2)=0formulae-sequencesuperscript𝑥subscript𝑠112𝜇superscript𝑥subscript𝑠20x^{*}(s_{1})=\frac{1}{2\mu},x^{*}(s_{2})=0italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 italic_μ end_ARG , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 0 and the principal gets the expected utility 112μ112𝜇1-\frac{1}{2\mu}1 - divide start_ARG 1 end_ARG start_ARG 2 italic_μ end_ARG. The naive learning method is to play a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for T2/3superscript𝑇23T^{2/3}italic_T start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT rounds and learn its outcome distribution parameterized by μ𝜇\muitalic_μ up to the bounded error O(T1/3)𝑂superscript𝑇13O(T^{-1/3})italic_O ( italic_T start_POSTSUPERSCRIPT - 1 / 3 end_POSTSUPERSCRIPT ). This is costly as a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the sub-optimal arm, resulting in O~(T2/3)~𝑂superscript𝑇23\widetilde{O}(T^{2/3})over~ start_ARG italic_O end_ARG ( italic_T start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT ) regret in Theorem 2. However, an alternative method is to conduct a binary search for μ𝜇\muitalic_μ. This would achieve O(logT)𝑂𝑇O(\log T)italic_O ( roman_log italic_T ) regret, since the algorithm can get estimation error of μ𝜇\muitalic_μ bounded by T1superscript𝑇1T^{-1}italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT in O(logT)𝑂𝑇O(\log T)italic_O ( roman_log italic_T ) rounds, and construct an T1superscript𝑇1T^{-1}italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT-optimal contract.

Here, we consider searching for the agent’s best response section in a D𝐷Ditalic_D-dimensional probability simplex. Let 𝒮𝒮{\mathcal{S}}caligraphic_S with |𝒮|=d𝒮𝑑|{\mathcal{S}}|=d| caligraphic_S | = italic_d be the outcome space and x:𝒮+:𝑥𝒮subscriptx:{\mathcal{S}}\rightarrow\mathbb{R}_{+}italic_x : caligraphic_S → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT denote the contract the principal announces to the agent. Here, we restrict the contract x𝑥xitalic_x to a subspace x𝒫d1𝑥superscript𝒫𝑑1x\in\mathcal{P}^{d-1}italic_x ∈ caligraphic_P start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT where 𝒫d1superscript𝒫𝑑1\mathcal{P}^{d-1}caligraphic_P start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT is the (d1)𝑑1(d-1)( italic_d - 1 )-dimension probability simplex. We remark that searching over a low dimensional simplex is without loss of generality, and we just consider the simplex x1=η,x[0,η]Dformulae-sequencesubscriptnorm𝑥1𝜂𝑥superscript0𝜂𝐷\left\|x\right\|_{1}=\eta,x\in[0,\eta]^{D}∥ italic_x ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_η , italic_x ∈ [ 0 , italic_η ] start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT for simplicity, where η𝜂\etaitalic_η bounds the infinity norm of any contract we use. Let 𝒜𝒜\mathcal{A}caligraphic_A be the agent’s action set with |𝒜|=N𝒜𝑁\left|\mathcal{A}\right|=N| caligraphic_A | = italic_N. This is because we have the following proposition.

Proposition 1 (Action inducibility).

If an action i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ] can be induced by contract x𝑥xitalic_x with xηsubscriptnorm𝑥𝜂\left\|x\right\|_{\infty}\leq\eta∥ italic_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_η, then i𝑖iitalic_i can also be induced by contract y=x+((N1)ηx1)𝟙/N𝑦𝑥𝑁1𝜂subscriptnorm𝑥11𝑁y=x+((N-1)\eta-\left\|x\right\|_{1})\operatorname{\mathds{1}}/Nitalic_y = italic_x + ( ( italic_N - 1 ) italic_η - ∥ italic_x ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) blackboard_1 / italic_N.

Proof.

The inducibility condition implies

x,pipjcicj,ji.formulae-sequence𝑥subscript𝑝𝑖subscript𝑝𝑗subscript𝑐𝑖subscript𝑐𝑗for-all𝑗𝑖\displaystyle\langle x,p_{i}-p_{j}\rangle\geq c_{i}-c_{j},\forall j\neq i.⟨ italic_x , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ ≥ italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ∀ italic_j ≠ italic_i .

Obviously, adding (ηx1)𝟙/N𝜂subscriptnorm𝑥11𝑁(\eta-\left\|x\right\|_{1})\operatorname{\mathds{1}}/N( italic_η - ∥ italic_x ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) blackboard_1 / italic_N to x𝑥xitalic_x does not change the inequality. Moreover,

y(l)=x(l)+((N1)ηx1)/N=x(l)N1N+(N1)ηN1Nmlx(m)0,superscript𝑦𝑙superscript𝑥𝑙𝑁1𝜂normsubscript𝑥1𝑁superscript𝑥𝑙𝑁1𝑁𝑁1𝜂𝑁1𝑁subscript𝑚𝑙superscript𝑥𝑚0\displaystyle y^{(l)}=x^{(l)}+((N-1)\eta-\left\|x_{1}\right\|)/N=x^{(l)}\frac{% N-1}{N}+\frac{(N-1)\eta}{N}-\frac{1}{N}\sum_{m\neq l}x^{(m)}\geq 0,italic_y start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT + ( ( italic_N - 1 ) italic_η - ∥ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ ) / italic_N = italic_x start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT divide start_ARG italic_N - 1 end_ARG start_ARG italic_N end_ARG + divide start_ARG ( italic_N - 1 ) italic_η end_ARG start_ARG italic_N end_ARG - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_m ≠ italic_l end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ≥ 0 ,

which implies that y𝑦yitalic_y is a valid contract with y1=(N1)ηsubscriptnorm𝑦1𝑁1𝜂\left\|y\right\|_{1}=(N-1)\eta∥ italic_y ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( italic_N - 1 ) italic_η. ∎

For simplicity, we ignore the scale (N1)η𝑁1𝜂(N-1)\eta( italic_N - 1 ) italic_η and just conduct our search on the probability simplex. Under this setting, the best response region for ai𝒜subscript𝑎𝑖𝒜a_{i}\in\mathcal{A}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_A is

𝒱i={x𝒫d1|x,pipjcicj,ji},subscript𝒱𝑖conditional-set𝑥superscript𝒫𝑑1formulae-sequence𝑥subscript𝑝𝑖subscript𝑝𝑗subscript𝑐𝑖subscript𝑐𝑗for-all𝑗𝑖\displaystyle\mathcal{V}_{i}=\left\{x\in\mathcal{P}^{d-1}{\,\big{|}\,}\langle x% ,p_{i}-p_{j}\rangle\geq c_{i}-c_{j},\quad\forall j\neq i\right\},caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_x ∈ caligraphic_P start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT | ⟨ italic_x , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ ≥ italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ∀ italic_j ≠ italic_i } ,

where piΔ(𝒮)subscript𝑝𝑖Δ𝒮p_{i}\in\Delta({\mathcal{S}})italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_Δ ( caligraphic_S ) is the outcome distribution under action i𝑖iitalic_i and cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the action cost the agent has to pay for any i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ]. Our target is to identify each 𝒱isubscript𝒱𝑖\mathcal{V}_{i}caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by searching for the hyperplanes that separate these 𝒱isubscript𝒱𝑖\mathcal{V}_{i}caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT under weak assumptions. Specifically, we assume that the cost of each action is known. The algorithm is summarized in Algorithm 4.

Input: Number of actions N𝑁Nitalic_N, number of samples T𝑇Titalic_T, binary search threshold ε𝜀\varepsilonitalic_ε, parameters cdsubscript𝑐𝑑c_{d}italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT.
Initial memory =\mathcal{M}=\emptysetcaligraphic_M = ∅;
for t=1,,T𝑡1𝑇t=1,\dots,Titalic_t = 1 , … , italic_T do
       Randomly sample z1,z2𝒫dsubscript𝑧1subscript𝑧2superscript𝒫𝑑z_{1},z_{2}\in\mathcal{P}^{d}italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and draw the line 𝒫dsuperscript𝒫𝑑\ell\subset\mathcal{P}^{d}roman_ℓ ⊂ caligraphic_P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT connecting z1,z2subscript𝑧1subscript𝑧2z_{1},z_{2}italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT;
       Binary search on \ellroman_ℓ for all the switching points up to precision ε𝜀\varepsilonitalic_ε 333Precision ε𝜀\varepsilonitalic_ε in this algorithm always means the d𝑑ditalic_d-dimensional infinity-norm xkykεsubscriptnormsubscript𝑥𝑘subscript𝑦𝑘𝜀\left\|x_{k}-y_{k}\right\|_{\infty}\leq\varepsilon∥ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_ε., and obtain all the segments containing a switching point: x1y1¯,,xmym¯¯subscript𝑥1subscript𝑦1¯subscript𝑥𝑚subscript𝑦𝑚\mkern 1.5mu\overline{\mkern-1.5mux_{1}y_{1}\mkern-1.5mu}\mkern 1.5mu,\dots,% \mkern 1.5mu\overline{\mkern-1.5mux_{m}y_{m}\mkern-1.5mu}\mkern 1.5muover¯ start_ARG italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG , … , over¯ start_ARG italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG;
       for k=1,,m𝑘1𝑚k=1,\dots,mitalic_k = 1 , … , italic_m do
             {(xk,a(xk))}{(yk,a(yk))}subscript𝑥𝑘superscript𝑎subscript𝑥𝑘subscript𝑦𝑘superscript𝑎subscript𝑦𝑘\mathcal{M}\leftarrow\mathcal{M}\cup\left\{\left(x_{k},a^{*}(x_{k})\right)% \right\}\cup\left\{\left(y_{k},a^{*}(y_{k})\right)\right\}caligraphic_M ← caligraphic_M ∪ { ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) } ∪ { ( italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) };
             Randomly draw a d𝑑ditalic_d-dimensional simplex centered at (xk+yk)/2subscript𝑥𝑘subscript𝑦𝑘2(x_{k}+y_{k})/2( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) / 2 with length 2cd2subscript𝑐𝑑\sqrt{2}c_{d}square-root start_ARG 2 end_ARG italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and vertices v1,,vd+1subscript𝑣1subscript𝑣𝑑1v_{1},\dots,v_{d+1}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_d + 1 end_POSTSUBSCRIPT;
             Play v1,,vd+1subscript𝑣1subscript𝑣𝑑1v_{1},\dots,v_{d+1}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_d + 1 end_POSTSUBSCRIPT and obtain the best response a1,,ad+1subscriptsuperscript𝑎1subscriptsuperscript𝑎𝑑1a^{*}_{1},\dots,a^{*}_{d+1}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d + 1 end_POSTSUBSCRIPT;
             for each pair (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) s.t. 1i<jd+11𝑖𝑗𝑑11\leq i<j\leq d+11 ≤ italic_i < italic_j ≤ italic_d + 1 and aiajsubscriptsuperscript𝑎𝑖subscriptsuperscript𝑎𝑗a^{*}_{i}\neq a^{*}_{j}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT do
                   Binary search on vivj¯¯subscript𝑣𝑖subscript𝑣𝑗\mkern 1.5mu\overline{\mkern-1.5muv_{i}v_{j}\mkern-1.5mu}\mkern 1.5muover¯ start_ARG italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG for a switching point up to precision ε𝜀\varepsilonitalic_ε, and obtain the segment uw¯¯𝑢𝑤\mkern 1.5mu\overline{\mkern-1.5muuw\mkern-1.5mu}\mkern 1.5muover¯ start_ARG italic_u italic_w end_ARG containg the switching point;
                   {(u,a(u))}{(w,a(w))}𝑢superscript𝑎𝑢𝑤superscript𝑎𝑤\mathcal{M}\leftarrow\mathcal{M}\cup\left\{\left(u,a^{*}(u)\right)\right\}\cup% \left\{\left(w,a^{*}(w)\right)\right\}caligraphic_M ← caligraphic_M ∪ { ( italic_u , italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_u ) ) } ∪ { ( italic_w , italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_w ) ) };
                  
            
      
Solve for 𝒮𝒮{\mathcal{S}}caligraphic_S with \mathcal{M}caligraphic_M;
Algorithm 4 Searching on Probability Simplex

Here, we show how to recover p1,,pNsubscript𝑝1subscript𝑝𝑁p_{1},\dots,p_{N}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT from the memory \mathcal{M}caligraphic_M. Suppose that when the algorithm terminates, we have ={(wl,a(wl))}l[L]subscriptsubscript𝑤𝑙superscript𝑎subscript𝑤𝑙𝑙delimited-[]𝐿\mathcal{M}=\left\{(w_{l},a^{*}(w_{l}))\right\}_{l\in[L]}caligraphic_M = { ( italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ) } start_POSTSUBSCRIPT italic_l ∈ [ italic_L ] end_POSTSUBSCRIPT. We just solve for (p1,,pN)subscript𝑝1subscript𝑝𝑁(p_{1},\cdots,p_{N})( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_p start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) that satisfies the following constraints,

wl,pa(wl)pasubscript𝑤𝑙subscript𝑝superscript𝑎subscript𝑤𝑙subscript𝑝superscript𝑎\displaystyle\langle w_{l},p_{a^{*}(w_{l})}-p_{a^{\prime}}\rangle⟨ italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⟩ ca(wl)ca,aa(wl),l[L],formulae-sequenceabsentsubscript𝑐superscript𝑎subscript𝑤𝑙subscript𝑐superscript𝑎formulae-sequencefor-allsuperscript𝑎superscript𝑎subscript𝑤𝑙for-all𝑙delimited-[]𝐿\displaystyle\geq c_{a^{*}(w_{l})}-c_{a^{\prime}},\quad\forall a^{\prime}\neq a% ^{*}(w_{l}),\quad\forall l\in[L],≥ italic_c start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , ∀ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , ∀ italic_l ∈ [ italic_L ] , (D.1)
𝟙,pipj1subscript𝑝𝑖subscript𝑝𝑗\displaystyle\langle\operatorname{\mathds{1}},p_{i}-p_{j}\rangle⟨ blackboard_1 , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ =0,(i,j)[N]2.formulae-sequenceabsent0for-all𝑖𝑗superscriptdelimited-[]𝑁2\displaystyle=0,\qquad\qquad\quad\forall(i,j)\in[N]^{2}.= 0 , ∀ ( italic_i , italic_j ) ∈ [ italic_N ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (D.2)

In the sequel, we write 𝒮𝒮{\mathcal{S}}caligraphic_S as the set of (p1,,pN)subscript𝑝1subscript𝑝𝑁(p_{1},\dots,p_{N})( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) that satisfy Conditions (D.1) and (D.2). For Algorithm 4 to work, we introduce the following assumption on the volume of 𝒱isubscript𝒱𝑖\mathcal{V}_{i}caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Assumption 3 (Minimal Volume Ratio).

Let Vold(𝒱)superscriptVol𝑑𝒱\operatorname{Vol}^{d}(\mathcal{V})roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( caligraphic_V ) denote the d𝑑ditalic_d-dimensional volume of set 𝒱d𝒱superscript𝑑\mathcal{V}\in\mathbb{R}^{d}caligraphic_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. We assume that there exists ς(0,1]𝜍01\varsigma\in(0,1]italic_ς ∈ ( 0 , 1 ] such that Vold1(𝒱i)ςVold1(𝒫d1)superscriptVol𝑑1subscript𝒱𝑖𝜍superscriptVol𝑑1superscript𝒫𝑑1\operatorname{Vol}^{d-1}(\mathcal{V}_{i})\geq\varsigma\cdot\operatorname{Vol}^% {d-1}(\mathcal{P}^{d-1})roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ italic_ς ⋅ roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ) for any i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ].

The minimal volume ratio assumption guarantees that all the sections 𝒱isubscript𝒱𝑖\mathcal{V}_{i}caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are detectable via random sampling with high probability. We also make the following assumption on the cost difference.

Assumption 4 (Minimal Cost difference).

We assume that inf1i<jN|cicj|θsubscriptinfimum1𝑖𝑗𝑁subscript𝑐𝑖subscript𝑐𝑗𝜃\inf_{1\leq i<j\leq N}\left|c_{i}-c_{j}\right|\geq\thetaroman_inf start_POSTSUBSCRIPT 1 ≤ italic_i < italic_j ≤ italic_N end_POSTSUBSCRIPT | italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ≥ italic_θ.

Specifically, we use the following definition of surface detection probability function.

Definition 3 (Surface Detection Probability Function).

Let Convd1superscriptConv𝑑1{\mathrm{Conv}}^{d-1}roman_Conv start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT be the set of convex regions on some (d1)𝑑1(d-1)( italic_d - 1 )-dimensional hyperplane such that e𝒫d𝑒superscript𝒫𝑑e\subset\mathcal{P}^{d}italic_e ⊂ caligraphic_P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT for any eConvd1𝑒superscriptConv𝑑1e\in{\mathrm{Conv}}^{d-1}italic_e ∈ roman_Conv start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT. Define function σd:[0,1][0,1]:subscript𝜎𝑑0101\sigma_{d}:[0,1]\rightarrow[0,1]italic_σ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT : [ 0 , 1 ] → [ 0 , 1 ] as the pointwise maximum such that,

(e)σd(Vold1(e)Vold1(𝒫d1)),eConvd1.formulae-sequence𝑒subscript𝜎𝑑superscriptVol𝑑1𝑒superscriptVol𝑑1superscript𝒫𝑑1for-all𝑒superscriptConv𝑑1\displaystyle\mathbb{P}(\ell\cap e\neq\emptyset)\geq\sigma_{d}\left(\frac{% \operatorname{Vol}^{d-1}(e)}{\operatorname{Vol}^{d-1}(\mathcal{P}^{d-1})}% \right),\quad\forall e\in{\mathrm{Conv}}^{d-1}.blackboard_P ( roman_ℓ ∩ italic_e ≠ ∅ ) ≥ italic_σ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( divide start_ARG roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( italic_e ) end_ARG start_ARG roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ) end_ARG ) , ∀ italic_e ∈ roman_Conv start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT .

Note that σdsubscript𝜎𝑑\sigma_{d}italic_σ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is a property inherent to the d𝑑ditalic_d-dimensional probability simplex. We argue that σdsubscript𝜎𝑑\sigma_{d}italic_σ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT can be roughly viewed as a linear function for small e𝑒eitalic_e. To characterize the searching result 𝒮𝒮{\mathcal{S}}caligraphic_S of Algorithm 4, we present the following Lemma.

Lemma 6.

Under Assumptions 3, 4, suppose that ε,cd𝜀subscript𝑐𝑑\varepsilon,c_{d}italic_ε , italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is chosen to satisfy

ξd2superscriptsubscript𝜉𝑑2\displaystyle\xi_{d}^{2}italic_ξ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT :=cd2d2dε28>0,assignsuperscriptsubscript𝑐𝑑2superscript𝑑2𝑑superscript𝜀280\displaystyle\operatorname{\vcentcolon=}{\frac{c_{d}^{2}}{d^{2}}-\frac{d% \varepsilon^{2}}{8}}>0,:= divide start_ARG italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG italic_d italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 8 end_ARG > 0 ,
τdsubscript𝜏𝑑\displaystyle\tau_{d}italic_τ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT :=(ς23d)dd2(1+4dς)(cd+ε)>0.\displaystyle\operatorname{\vcentcolon=}\left(\frac{\varsigma^{2}}{3d}\right)^% {d}-d^{2}\left(1+\frac{4}{d\varsigma}\right)\cdot(c_{d}+\varepsilon)>0.:= ( divide start_ARG italic_ς start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 3 italic_d end_ARG ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT - italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + divide start_ARG 4 end_ARG start_ARG italic_d italic_ς end_ARG ) ⋅ ( italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + italic_ε ) > 0 .

After T𝑇Titalic_T samples and no more than 𝒪(TNd2log(1/ε))𝒪𝑇𝑁superscript𝑑21𝜀\mathcal{O}(TNd^{2}\log(1/\varepsilon))caligraphic_O ( italic_T italic_N italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( 1 / italic_ε ) ) rounds, with probability at least 1NeTσd(τd)1𝑁superscript𝑒𝑇subscript𝜎𝑑subscript𝜏𝑑1-Ne^{-T\sigma_{d}(\tau_{d})}1 - italic_N italic_e start_POSTSUPERSCRIPT - italic_T italic_σ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT, we have for any (p1,,pN)𝒮subscript𝑝1subscript𝑝𝑁𝒮(p_{1},\dots,p_{N})\in{\mathcal{S}}( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ∈ caligraphic_S that

(pipj)(pipj)22(N1)dεξdd1θ.subscriptnormsubscript𝑝𝑖subscript𝑝𝑗superscriptsubscript𝑝𝑖superscriptsubscript𝑝𝑗22𝑁1𝑑𝜀superscriptsubscript𝜉𝑑𝑑1𝜃\displaystyle\left\|(p_{i}-p_{j})-(p_{i}^{*}-p_{j}^{*})\right\|_{2}\leq\frac{2% (N-1)\sqrt{d}\cdot\varepsilon}{\xi_{d}^{d-1}\theta}.∥ ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ divide start_ARG 2 ( italic_N - 1 ) square-root start_ARG italic_d end_ARG ⋅ italic_ε end_ARG start_ARG italic_ξ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT italic_θ end_ARG .

To construct an efficient learning procedure, we need to determine the optimal value for ε,cd𝜀subscript𝑐𝑑\varepsilon,c_{d}italic_ε , italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT such that the total round number is minimized while the learning error is controlled by ε𝜀\varepsilonitalic_ε.

Corollary 3.1.

By properly setting cdsubscript𝑐𝑑c_{d}italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and ε𝜀\varepsilonitalic_ε and running the simplex searching algorithm for t𝑡titalic_t rounds, we guarantee the learning error less than ε𝜀\varepsilonitalic_ε with probability at least

1Nexp(tσd(τd)Nd4log(Nς2ε1θ1)),1𝑁𝑡subscript𝜎𝑑subscript𝜏𝑑𝑁superscript𝑑4𝑁superscript𝜍2superscript𝜀1superscript𝜃11-N\exp\left(-\frac{t\cdot\sigma_{d}(\tau_{d})}{Nd^{4}\log(N\varsigma^{-2}% \varepsilon^{-1}\theta^{-1})}\right),1 - italic_N roman_exp ( - divide start_ARG italic_t ⋅ italic_σ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_ARG start_ARG italic_N italic_d start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT roman_log ( italic_N italic_ς start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_θ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) end_ARG ) ,

where τd=(ς2/6d)2subscript𝜏𝑑superscriptsuperscript𝜍26𝑑2\tau_{d}=(\varsigma^{2}/6d)^{2}italic_τ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = ( italic_ς start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 6 italic_d ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is a constant.

Proof.

Define constant

Υ:=ς2d+123ddd+1(dς+4)<1.Υassignsuperscript𝜍2𝑑12superscript3𝑑superscript𝑑𝑑1𝑑𝜍41\displaystyle\Upsilon\operatorname{\vcentcolon=}\frac{\varsigma^{2d+1}}{2\cdot 3% ^{d}d^{d+1}(d\varsigma+4)}<1.roman_Υ := divide start_ARG italic_ς start_POSTSUPERSCRIPT 2 italic_d + 1 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ⋅ 3 start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT italic_d + 1 end_POSTSUPERSCRIPT ( italic_d italic_ς + 4 ) end_ARG < 1 .

We let cd=Υ/2subscript𝑐𝑑Υ2c_{d}=\Upsilon/2italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = roman_Υ / 2. Then it suffices for the first condition to hold if εΥ/d3/2𝜀Υsuperscript𝑑32\varepsilon\leq\Upsilon/d^{3/2}italic_ε ≤ roman_Υ / italic_d start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT. Moreover, we have ξdΥ/8dsubscript𝜉𝑑Υ8𝑑\xi_{d}\geq\Upsilon/\sqrt{8}ditalic_ξ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ≥ roman_Υ / square-root start_ARG 8 end_ARG italic_d and the second condition holds automatically with τd(ς2/6d)2subscript𝜏𝑑superscriptsuperscript𝜍26𝑑2\tau_{d}\geq(\varsigma^{2}/6d)^{2}italic_τ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ≥ ( italic_ς start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 6 italic_d ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Therefore, the constraint for ε𝜀\varepsilonitalic_ε becomes,

εmin{Υd3/2,εθ2Nd(Υ8d)d1}.𝜀𝑚𝑖𝑛Υsuperscript𝑑32𝜀𝜃2𝑁𝑑superscriptΥ8𝑑𝑑1\displaystyle\varepsilon\leq\mathop{min}\left\{\frac{\Upsilon}{d^{3/2}},\frac{% \varepsilon\theta}{2N\sqrt{d}}\left(\frac{\Upsilon}{\sqrt{8}d}\right)^{d-1}% \right\}.italic_ε ≤ start_BIGOP italic_m italic_i italic_n end_BIGOP { divide start_ARG roman_Υ end_ARG start_ARG italic_d start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG , divide start_ARG italic_ε italic_θ end_ARG start_ARG 2 italic_N square-root start_ARG italic_d end_ARG end_ARG ( divide start_ARG roman_Υ end_ARG start_ARG square-root start_ARG 8 end_ARG italic_d end_ARG ) start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT } .

We can take equality for the optimal ε𝜀\varepsilonitalic_ε. Obviously, the second term dominates, and we thus have the total rounds bounded by

total round 𝒪(TNd2(log(2Ndεθ)+dlog(8dΥ)))absent𝒪𝑇𝑁superscript𝑑22𝑁𝑑𝜀𝜃𝑑8𝑑Υ\displaystyle\leq\mathcal{O}\left(TNd^{2}\left(\log\left(\frac{2N\sqrt{d}}{% \varepsilon\theta}\right)+d\log\left(\frac{\sqrt{8}d}{\Upsilon}\right)\right)\right)≤ caligraphic_O ( italic_T italic_N italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_log ( divide start_ARG 2 italic_N square-root start_ARG italic_d end_ARG end_ARG start_ARG italic_ε italic_θ end_ARG ) + italic_d roman_log ( divide start_ARG square-root start_ARG 8 end_ARG italic_d end_ARG start_ARG roman_Υ end_ARG ) ) )
=𝒪(TNd2(log(2Nd(εθ)1)+d2log(dς2))),absent𝒪𝑇𝑁superscript𝑑22𝑁𝑑superscript𝜀𝜃1superscript𝑑2𝑑superscript𝜍2\displaystyle=\mathcal{O}\left(TNd^{2}\left(\log\left(2N\sqrt{d}(\varepsilon% \theta)^{-1}\right)+d^{2}\log\left(d\varsigma^{-2}\right)\right)\right),= caligraphic_O ( italic_T italic_N italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_log ( 2 italic_N square-root start_ARG italic_d end_ARG ( italic_ε italic_θ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) + italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( italic_d italic_ς start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) ) ) ,

where the failure probability is bounded by NeTσd(τd)𝑁superscript𝑒𝑇subscript𝜎𝑑subscript𝜏𝑑Ne^{-T\sigma_{d}(\tau_{d})}italic_N italic_e start_POSTSUPERSCRIPT - italic_T italic_σ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT with τd=(ς2/6d)2subscript𝜏𝑑superscriptsuperscript𝜍26𝑑2\tau_{d}=(\varsigma^{2}/6d)^{2}italic_τ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = ( italic_ς start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 6 italic_d ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT being a constant. ∎

Proof of Lemma 6.

We consider an undirected graph 𝒢=(𝒜,E)𝒢𝒜𝐸\mathcal{G}=(\mathcal{A},E)caligraphic_G = ( caligraphic_A , italic_E ) with the node set 𝒜𝒜\mathcal{A}caligraphic_A and the edge set E={eij|eij=𝒱i𝒱j,ij}𝐸conditional-setsubscript𝑒𝑖𝑗formulae-sequencesubscript𝑒𝑖𝑗subscript𝒱𝑖subscript𝒱𝑗for-all𝑖𝑗E=\left\{e_{ij}{\,|\,}e_{ij}=\mathcal{V}_{i}\cap\mathcal{V}_{j},\forall i\neq j\right\}italic_E = { italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ caligraphic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ∀ italic_i ≠ italic_j }. Define event ijdsuperscriptsubscript𝑖𝑗𝑑\mathcal{E}_{ij}^{d}caligraphic_E start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT as follows.

Definition 4 (Surface Detection Event).

We say that event ijdsuperscriptsubscript𝑖𝑗𝑑\mathcal{E}_{ij}^{d}caligraphic_E start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT happens if there exists t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ] and we have successfully searched for a ktsubscript𝑘𝑡k_{t}italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT such that for the d𝑑ditalic_d-dimensional simplex 𝕊dsuperscript𝕊𝑑\mathbb{S}^{d}blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT placed around (xkt+ykt)/2subscript𝑥subscript𝑘𝑡subscript𝑦subscript𝑘𝑡2(x_{k_{t}}+y_{k_{t}})/2( italic_x start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_y start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) / 2 in Algorithm 4 and any x𝕊d𝑥superscript𝕊𝑑x\in\mathbb{S}^{d}italic_x ∈ blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, the best response at x𝑥xitalic_x satisfies a(x){i,j}superscript𝑎𝑥𝑖𝑗a^{*}(x)\in\{i,j\}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) ∈ { italic_i , italic_j }.

Simply put, the event ijdsuperscriptsubscript𝑖𝑗𝑑\mathcal{E}_{ij}^{d}caligraphic_E start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT guarantees that 𝕊dsuperscript𝕊𝑑\mathbb{S}^{d}blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT only contains two possible actions {i,j}𝑖𝑗\{i,j\}{ italic_i , italic_j } and eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT can therefore be successfully learned via binary searching for the intersects of the edges of the simplex with eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. The following proposition backs up our statement.

Proposition 2 (Intersection Geometry).

Under event ijdsuperscriptsubscript𝑖𝑗𝑑\mathcal{E}_{ij}^{d}caligraphic_E start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and condition ε/2<cd/(dd+1)𝜀2subscript𝑐𝑑𝑑𝑑1\varepsilon/2<c_{d}/(d\sqrt{d+1})italic_ε / 2 < italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT / ( italic_d square-root start_ARG italic_d + 1 end_ARG ), let 𝕊𝕊\mathbb{S}blackboard_S denote the simplex corresponding to ijdsuperscriptsubscript𝑖𝑗𝑑\mathcal{E}_{ij}^{d}caligraphic_E start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Then 𝕊deijsuperscript𝕊𝑑subscript𝑒𝑖𝑗\mathbb{S}^{d}\cap e_{ij}blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∩ italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT contains a (d1)𝑑1(d-1)( italic_d - 1 )-dimensional ball with radius at least

ξd:=cd2d(d+1)dε24.subscript𝜉𝑑assignsuperscriptsubscript𝑐𝑑2𝑑𝑑1𝑑superscript𝜀24\displaystyle\xi_{d}\operatorname{\vcentcolon=}\sqrt{\frac{c_{d}^{2}}{d(d+1)}-% \frac{d\varepsilon^{2}}{4}}.italic_ξ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT := square-root start_ARG divide start_ARG italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d ( italic_d + 1 ) end_ARG - divide start_ARG italic_d italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 end_ARG end_ARG .
Proof.

Under event ijdsuperscriptsubscript𝑖𝑗𝑑\mathcal{E}_{ij}^{d}caligraphic_E start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, we claim that the hyperplane eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT must intersect with the dε/2𝑑𝜀2\sqrt{d}\varepsilon/2square-root start_ARG italic_d end_ARG italic_ε / 2-ball 𝔹1=𝔹(dε/2)subscript𝔹1𝔹𝑑𝜀2\mathbb{B}_{1}=\mathbb{B}(\sqrt{d}\varepsilon/2)blackboard_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = blackboard_B ( square-root start_ARG italic_d end_ARG italic_ε / 2 ) centered at (xkt+ykt)/2subscript𝑥subscript𝑘𝑡subscript𝑦subscript𝑘𝑡2(x_{k_{t}}+y_{k_{t}})/2( italic_x start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_y start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) / 2, since eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT passes through xktykt¯¯subscript𝑥subscript𝑘𝑡subscript𝑦subscript𝑘𝑡\mkern 1.5mu\overline{\mkern-1.5mux_{k_{t}}y_{k_{t}}\mkern-1.5mu}\mkern 1.5muover¯ start_ARG italic_x start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG, which lies inside 𝔹1subscript𝔹1\mathbb{B}_{1}blackboard_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Moreover, we consider a ball 𝔹2=𝔹(cd/d(d+1))subscript𝔹2𝔹subscript𝑐𝑑𝑑𝑑1\mathbb{B}_{2}=\mathbb{B}(c_{d}/\sqrt{d(d+1)})blackboard_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = blackboard_B ( italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT / square-root start_ARG italic_d ( italic_d + 1 ) end_ARG ) also centered at (xkt+ykt)/2subscript𝑥subscript𝑘𝑡subscript𝑦subscript𝑘𝑡2(x_{k_{t}}+y_{k_{t}})/2( italic_x start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_y start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) / 2. Since the smallest distance from the center of a d𝑑ditalic_d-dimensional simplex with length 2cd2subscript𝑐𝑑\sqrt{2}c_{d}square-root start_ARG 2 end_ARG italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT to any of its surface is cd/d(d+1)subscript𝑐𝑑𝑑𝑑1c_{d}/\sqrt{d(d+1)}italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT / square-root start_ARG italic_d ( italic_d + 1 ) end_ARG, we have 𝔹1𝔹2𝕊dsubscript𝔹1subscript𝔹2superscript𝕊𝑑\mathbb{B}_{1}\subset\mathbb{B}_{2}\subset\mathbb{S}^{d}blackboard_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊂ blackboard_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⊂ blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT under the condition ε/2<cd/(dd+1)𝜀2subscript𝑐𝑑𝑑𝑑1\varepsilon/2<c_{d}/(d\sqrt{d+1})italic_ε / 2 < italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT / ( italic_d square-root start_ARG italic_d + 1 end_ARG ). In addition, 𝔹2subscript𝔹2\mathbb{B}_{2}blackboard_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is also the largest ball contained in 𝕊dsuperscript𝕊𝑑\mathbb{S}^{d}blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. We therefore conclude that the intersection area satisfies,

𝔹2eij𝕊deij.subscript𝔹2subscript𝑒𝑖𝑗superscript𝕊𝑑subscript𝑒𝑖𝑗\mathbb{B}_{2}\cap e_{ij}\subseteq\mathbb{S}^{d}\cap e_{ij}.blackboard_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∩ italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ⊆ blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∩ italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT .

Since 𝕊dsuperscript𝕊𝑑\mathbb{S}^{d}blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT only intersects with hyperplane eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, we have 𝔹2eij𝕊deijeijsubscript𝔹2subscript𝑒𝑖𝑗superscript𝕊𝑑subscript𝑒𝑖𝑗subscript𝑒𝑖𝑗\mathbb{B}_{2}\cap e_{ij}\subseteq\mathbb{S}^{d}\cap e_{ij}\subseteq e_{ij}blackboard_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∩ italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ⊆ blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∩ italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ⊆ italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. Thus, the left-hand side corresponds to the intersection area of a d𝑑ditalic_d-dimensional ball and a (d1)𝑑1(d-1)( italic_d - 1 )-dimensional hyperplane. Since eijsubscript𝑒𝑖𝑗e_{i}jitalic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_j intersects with 𝔹1subscript𝔹1\mathbb{B}_{1}blackboard_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, thus a (d1)𝑑1(d-1)( italic_d - 1 )-dimensional ball with radius at least

ξd=cd2d(d+1)dε24,subscript𝜉𝑑superscriptsubscript𝑐𝑑2𝑑𝑑1𝑑superscript𝜀24\displaystyle\xi_{d}=\sqrt{\frac{c_{d}^{2}}{d(d+1)}-\frac{d\varepsilon^{2}}{4}},italic_ξ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d ( italic_d + 1 ) end_ARG - divide start_ARG italic_d italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 end_ARG end_ARG ,

should be contained in 𝕊deijeijsuperscript𝕊𝑑subscript𝑒𝑖𝑗subscript𝑒𝑖𝑗\mathbb{S}^{d}\cap e_{ij}\subseteq e_{ij}blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∩ italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ⊆ italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. ∎

Proposition 2 characterizes the geometry of the intersection area 𝕊deijsuperscript𝕊𝑑subscript𝑒𝑖𝑗\mathbb{S}^{d}\cap e_{ij}blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∩ italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT under event ijdsuperscriptsubscript𝑖𝑗𝑑\mathcal{E}_{ij}^{d}caligraphic_E start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Specifically, the intersection area should contain a (d1)𝑑1(d-1)( italic_d - 1 )-dimensional ball with radius lower bounded by ξdsubscript𝜉𝑑\xi_{d}italic_ξ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. Such a geometry is critical for solving the hyperplanes to small errors. We consider the following constraints,

uk,pipjcicj,wk,pipjcicj,k[K],formulae-sequencesubscript𝑢𝑘subscript𝑝𝑖subscript𝑝𝑗subscript𝑐𝑖subscript𝑐𝑗formulae-sequencesubscript𝑤𝑘subscript𝑝𝑖subscript𝑝𝑗subscript𝑐𝑖subscript𝑐𝑗for-all𝑘delimited-[]𝐾\displaystyle\langle u_{k},p_{i}-p_{j}\rangle\geq c_{i}-c_{j},\quad\langle w_{% k},p_{i}-p_{j}\rangle\leq c_{i}-c_{j},\quad\forall k\in[K],⟨ italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ ≥ italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ⟨ italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ ≤ italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ∀ italic_k ∈ [ italic_K ] ,
𝟙,pipj=0,1subscript𝑝𝑖subscript𝑝𝑗0\displaystyle\langle\operatorname{\mathds{1}},p_{i}-p_{j}\rangle=0,⟨ blackboard_1 , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ = 0 , (D.3)

where (uk,wk)subscript𝑢𝑘subscript𝑤𝑘(u_{k},w_{k})( italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is the binary searching result on the edges of simplex 𝕊dsuperscript𝕊𝑑\mathbb{S}^{d}blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Under event ijdsuperscriptsubscript𝑖𝑗𝑑\mathcal{E}_{ij}^{d}caligraphic_E start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, let 𝒮ijsubscript𝒮𝑖𝑗{\mathcal{S}}_{ij}caligraphic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT denote the set of (pi,pj)subscript𝑝𝑖subscript𝑝𝑗(p_{i},p_{j})( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) satisfying these constraints. Apparently, 𝒮ijsubscript𝒮𝑖𝑗{\mathcal{S}}_{ij}caligraphic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is a relaxation of 𝒮𝒮{\mathcal{S}}caligraphic_S. The following proposition characterizes the searching errors in 𝒮ijsubscript𝒮𝑖𝑗{\mathcal{S}}_{ij}caligraphic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT under the event ijdsuperscriptsubscript𝑖𝑗𝑑\mathcal{E}_{ij}^{d}caligraphic_E start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT in terms of pipjsubscript𝑝𝑖subscript𝑝𝑗p_{i}-p_{j}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

Proposition 3 (Local Hyperplane Learnability).

Under Assumption 4 and event ijdsuperscriptsubscript𝑖𝑗𝑑\mathcal{E}_{ij}^{d}caligraphic_E start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, for any (pi,pj)𝒮ijsubscript𝑝𝑖subscript𝑝𝑗subscript𝒮𝑖𝑗(p_{i},p_{j})\in{\mathcal{S}}_{ij}( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ caligraphic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, we have

(pipj)(pipj)2φd1dε,subscriptnormsubscript𝑝𝑖subscript𝑝𝑗superscriptsubscript𝑝𝑖superscriptsubscript𝑝𝑗2superscriptsubscript𝜑𝑑1𝑑𝜀\displaystyle\left\|(p_{i}-p_{j})-(p_{i}^{*}-p_{j}^{*})\right\|_{2}\leq\varphi% _{d}^{-1}\sqrt{d}\varepsilon,∥ ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_φ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT square-root start_ARG italic_d end_ARG italic_ε ,

where

φd:=(ξd1d1)d1θ2dd+1.\displaystyle\varphi_{d}\operatorname{\vcentcolon=}\left(\frac{\xi_{d}}{\sqrt{% 1-d^{-1}}}\right)^{d-1}\cdot\frac{\theta}{\sqrt{2}}\cdot\sqrt{\frac{d}{d+1}}.italic_φ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT := ( divide start_ARG italic_ξ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG end_ARG ) start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ⋅ divide start_ARG italic_θ end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG ⋅ square-root start_ARG divide start_ARG italic_d end_ARG start_ARG italic_d + 1 end_ARG end_ARG .
Proof.

We first relax 𝒮ijsubscript𝒮𝑖𝑗{\mathcal{S}}_{ij}caligraphic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT to

𝒮ˇij:(pi,pj)subject tosk,pipj=cicj+ϵk,𝟙,pipj=0,,|ϵk|ε,k[K].\displaystyle\check{\mathcal{S}}_{ij}:(p_{i},p_{j})\quad\mbox{subject to}\quad% \langle s_{k},p_{i}-p_{j}\rangle=c_{i}-c_{j}+\epsilon_{k},\quad\langle% \operatorname{\mathds{1}},p_{i}-p_{j}\rangle=0,\quad,|\epsilon_{k}|\leq% \varepsilon,\quad\forall k\in[K].overroman_ˇ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT : ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) subject to ⟨ italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ = italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ⟨ blackboard_1 , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ = 0 , , | italic_ϵ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | ≤ italic_ε , ∀ italic_k ∈ [ italic_K ] .

Note that s1,,sKsubscript𝑠1subscript𝑠𝐾s_{1},\dots,s_{K}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT corresponds to the vertices of 𝕊deijsuperscript𝕊𝑑subscript𝑒𝑖𝑗\mathbb{S}^{d}\cap e_{ij}blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∩ italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. Using Proposition 2, we pick s~1,,s~dsubscript~𝑠1subscript~𝑠𝑑\widetilde{s}_{1},\dots,\widetilde{s}_{d}over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT on the ball 𝔹2𝕊deijsubscript𝔹2superscript𝕊𝑑subscript𝑒𝑖𝑗\mathbb{B}_{2}\subset\mathbb{S}^{d}\cap e_{ij}blackboard_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⊂ blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∩ italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT that form a (d1)𝑑1(d-1)( italic_d - 1 )-dimensional simplex. Note that 𝕊deijsuperscript𝕊𝑑subscript𝑒𝑖𝑗\mathbb{S}^{d}\cap e_{ij}blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∩ italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT must be convex. Thus, we can express s~1,,s~dsubscript~𝑠1subscript~𝑠𝑑\widetilde{s}_{1},\dots,\widetilde{s}_{d}over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT as convex combinations of s1,,sKsubscript𝑠1subscript𝑠𝐾s_{1},\dots,s_{K}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT. In the matrix form, we suppose

[s~1s~d]=Q[s1sK],matrixsuperscriptsubscript~𝑠1topsuperscriptsubscript~𝑠𝑑top𝑄matrixsuperscriptsubscript𝑠1topsuperscriptsubscript𝑠𝐾top\displaystyle\begin{bmatrix}\widetilde{s}_{1}^{\top}\\ \cdots\\ \widetilde{s}_{d}^{\top}\end{bmatrix}=Q\begin{bmatrix}s_{1}^{\top}\\ \cdots\\ s_{K}^{\top}\end{bmatrix},[ start_ARG start_ROW start_CELL over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋯ end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] = italic_Q [ start_ARG start_ROW start_CELL italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋯ end_CELL end_ROW start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] ,

where Q𝑄Qitalic_Q has row sums equal to 1111. Using s~1,,s~dsubscript~𝑠1subscript~𝑠𝑑\widetilde{s}_{1},\dots,\widetilde{s}_{d}over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, we can further relax 𝒮ˇijsubscriptˇ𝒮𝑖𝑗\check{\mathcal{S}}_{ij}overroman_ˇ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT by multiplying diag(Q,1)diag𝑄1\mathrm{diag}(Q,1)roman_diag ( italic_Q , 1 ) to the constraints,

𝒮~ij:(pi,pj)subject to[s~1s~d𝟙d+1/(d+1)]S~(pipj)=[(cicj)𝟙d0]+[ϵ0]ϵ~,ϵε.\displaystyle\widetilde{\mathcal{S}}_{ij}:(p_{i},p_{j})\quad\mbox{subject to}% \quad\underbrace{\begin{bmatrix}\widetilde{s}_{1}^{\top}\\ \cdots\\ \widetilde{s}_{d}^{\top}\\ \operatorname{\mathds{1}}_{d+1}^{\top}/(d+1)\end{bmatrix}}_{\widetilde{S}}(p_{% i}-p_{j})=\begin{bmatrix}(c_{i}-c_{j})\operatorname{\mathds{1}}_{d}\\ 0\end{bmatrix}+\underbrace{\begin{bmatrix}\epsilon\\ 0\end{bmatrix}}_{\widetilde{\epsilon}},\quad\left\|\epsilon\right\|_{\infty}% \leq\varepsilon.over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT : ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) subject to under⏟ start_ARG [ start_ARG start_ROW start_CELL over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋯ end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL blackboard_1 start_POSTSUBSCRIPT italic_d + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT / ( italic_d + 1 ) end_CELL end_ROW end_ARG ] end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = [ start_ARG start_ROW start_CELL ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) blackboard_1 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW end_ARG ] + under⏟ start_ARG [ start_ARG start_ROW start_CELL italic_ϵ end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW end_ARG ] end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_ϵ end_ARG end_POSTSUBSCRIPT , ∥ italic_ϵ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_ε .

Note that the first d𝑑ditalic_d rows of S𝑆Sitalic_S form a (d1)𝑑1(d-1)( italic_d - 1 )-dimensional simplex on eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. Moreover, since |cicj|>θsubscript𝑐𝑖subscript𝑐𝑗𝜃|c_{i}-c_{j}|>\theta| italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | > italic_θ, we note that 𝟙d+1subscript1𝑑1\operatorname{\mathds{1}}_{d+1}blackboard_1 start_POSTSUBSCRIPT italic_d + 1 end_POSTSUBSCRIPT does not lie on eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. To guarantee linear independence of S~~𝑆\widetilde{S}over~ start_ARG italic_S end_ARG, we aim to lower bound the distance from 𝟙d+1/(d+1)subscript1𝑑1𝑑1\operatorname{\mathds{1}}_{d+1}/(d+1)blackboard_1 start_POSTSUBSCRIPT italic_d + 1 end_POSTSUBSCRIPT / ( italic_d + 1 ) to eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. For any xeij𝑥subscript𝑒𝑖𝑗x\in e_{ij}italic_x ∈ italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, we have x,pipj=cicj𝑥superscriptsubscript𝑝𝑖superscriptsubscript𝑝𝑗subscript𝑐𝑖subscript𝑐𝑗\langle x,p_{i}^{*}-p_{j}^{*}\rangle=c_{i}-c_{j}⟨ italic_x , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ = italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and for 𝟙d+1/(d+1)subscript1𝑑1𝑑1\operatorname{\mathds{1}}_{d+1}/(d+1)blackboard_1 start_POSTSUBSCRIPT italic_d + 1 end_POSTSUBSCRIPT / ( italic_d + 1 ) we have 𝟙d+1/(d+1),pipj=0subscript1𝑑1𝑑1superscriptsubscript𝑝𝑖superscriptsubscript𝑝𝑗0\langle\operatorname{\mathds{1}}_{d+1}/(d+1),p_{i}^{*}-p_{j}^{*}\rangle=0⟨ blackboard_1 start_POSTSUBSCRIPT italic_d + 1 end_POSTSUBSCRIPT / ( italic_d + 1 ) , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ = 0. Thus, we use the Cauchy-Schwartz inequality and obtain,

x𝟙d+1d+12pipj2|x𝟙d+1d+1,pipj|=|cicj|θ.subscriptnorm𝑥subscript1𝑑1𝑑12subscriptnormsuperscriptsubscript𝑝𝑖superscriptsubscript𝑝𝑗2𝑥subscript1𝑑1𝑑1superscriptsubscript𝑝𝑖superscriptsubscript𝑝𝑗subscript𝑐𝑖subscript𝑐𝑗𝜃\displaystyle\left\|x-\frac{\operatorname{\mathds{1}}_{d+1}}{d+1}\right\|_{2}% \cdot\left\|p_{i}^{*}-p_{j}^{*}\right\|_{2}\geq\left|\Big{\langle}x-\frac{% \operatorname{\mathds{1}}_{d+1}}{d+1},p_{i}^{*}-p_{j}^{*}\Big{\rangle}\right|=% \left|c_{i}-c_{j}\right|\geq\theta.∥ italic_x - divide start_ARG blackboard_1 start_POSTSUBSCRIPT italic_d + 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_d + 1 end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ ∥ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ | ⟨ italic_x - divide start_ARG blackboard_1 start_POSTSUBSCRIPT italic_d + 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_d + 1 end_ARG , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ | = | italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ≥ italic_θ .

Since pipj22subscriptdelimited-∥∥superscriptsubscript𝑝𝑖superscriptsubscript𝑝𝑗22\left\lVert p_{i}^{*}-p_{j}^{*}\right\rVert_{2}\leq\sqrt{2}∥ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ square-root start_ARG 2 end_ARG, we conclude that for any xeij𝑥subscript𝑒𝑖𝑗x\in e_{ij}italic_x ∈ italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT,

x𝟙d+1d+12θ2.subscriptnorm𝑥subscript1𝑑1𝑑12𝜃2\displaystyle\left\|x-\frac{\operatorname{\mathds{1}}_{d+1}}{d+1}\right\|_{2}% \geq\frac{\theta}{\sqrt{2}}.∥ italic_x - divide start_ARG blackboard_1 start_POSTSUBSCRIPT italic_d + 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_d + 1 end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ divide start_ARG italic_θ end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG .

Thus, the distance from 𝟙d+1/(d+1)subscript1𝑑1𝑑1\operatorname{\mathds{1}}_{d+1}/(d+1)blackboard_1 start_POSTSUBSCRIPT italic_d + 1 end_POSTSUBSCRIPT / ( italic_d + 1 ) to eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is at least θ/2𝜃2\theta/\sqrt{2}italic_θ / square-root start_ARG 2 end_ARG, which indicates that the volume of the cube spammed by rows of S~~𝑆\widetilde{S}over~ start_ARG italic_S end_ARG is at least

Vold(S~)superscriptVol𝑑~𝑆\displaystyle\operatorname{Vol}^{d}(\widetilde{S})roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( over~ start_ARG italic_S end_ARG ) Vold1(𝒫d1)(ξd1d1)d1θ2dabsentsuperscriptVol𝑑1superscript𝒫𝑑1superscriptsubscript𝜉𝑑1superscript𝑑1𝑑1𝜃2𝑑\displaystyle\geq\operatorname{Vol}^{d-1}(\mathcal{P}^{d-1})\cdot\left(\frac{% \xi_{d}}{\sqrt{1-d^{-1}}}\right)^{d-1}\cdot\frac{\theta}{\sqrt{2}d}≥ roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ) ⋅ ( divide start_ARG italic_ξ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG end_ARG ) start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ⋅ divide start_ARG italic_θ end_ARG start_ARG square-root start_ARG 2 end_ARG italic_d end_ARG
=Vold(𝒫d)(ξd1d1)d1θ2dd+1.absentsuperscriptVol𝑑superscript𝒫𝑑superscriptsubscript𝜉𝑑1superscript𝑑1𝑑1𝜃2𝑑𝑑1\displaystyle=\operatorname{Vol}^{d}(\mathcal{P}^{d})\cdot\left(\frac{\xi_{d}}% {\sqrt{1-d^{-1}}}\right)^{d-1}\cdot\frac{\theta}{\sqrt{2}}\cdot\sqrt{\frac{d}{% d+1}}.= roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) ⋅ ( divide start_ARG italic_ξ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG end_ARG ) start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ⋅ divide start_ARG italic_θ end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG ⋅ square-root start_ARG divide start_ARG italic_d end_ARG start_ARG italic_d + 1 end_ARG end_ARG .

Therefore, the determinant of S~~𝑆\widetilde{S}over~ start_ARG italic_S end_ARG satisfies

det(S~)(ξd1d1)d1θ2dd+1=:φd.~𝑆superscriptsubscript𝜉𝑑1superscript𝑑1𝑑1𝜃2𝑑𝑑1absent:subscript𝜑𝑑\displaystyle\det(\widetilde{S})\geq\left(\frac{\xi_{d}}{\sqrt{1-d^{-1}}}% \right)^{d-1}\cdot\frac{\theta}{\sqrt{2}}\cdot\sqrt{\frac{d}{d+1}}% \operatorname{=\vcentcolon}\varphi_{d}.roman_det ( over~ start_ARG italic_S end_ARG ) ≥ ( divide start_ARG italic_ξ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG end_ARG ) start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ⋅ divide start_ARG italic_θ end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG ⋅ square-root start_ARG divide start_ARG italic_d end_ARG start_ARG italic_d + 1 end_ARG end_ARG start_OPFUNCTION = : end_OPFUNCTION italic_φ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT .

Note that S~~𝑆\widetilde{S}over~ start_ARG italic_S end_ARG has row sums equal to 1111, which implies that all the eigenvalues of S~~𝑆\widetilde{S}over~ start_ARG italic_S end_ARG should be no larger than 1111. Hence, the smallest eigenvalue of S~~𝑆\widetilde{S}over~ start_ARG italic_S end_ARG should be no less than φdsubscript𝜑𝑑\varphi_{d}italic_φ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and consequently, the largest eigenvalue of S~1superscript~𝑆1\widetilde{S}^{-1}over~ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT should be no more than φd1superscriptsubscript𝜑𝑑1\varphi_{d}^{-1}italic_φ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. Following the definition of 𝒮~ijsubscript~𝒮𝑖𝑗\widetilde{\mathcal{S}}_{ij}over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, we have for any (pi,pj)𝒮~ijsubscript𝑝𝑖subscript𝑝𝑗subscript~𝒮𝑖𝑗(p_{i},p_{j})\in\widetilde{\mathcal{S}}_{ij}( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT that

(pipj)(pipj)2=S~1ϵ~2φd1ϵ~2φd1dε.subscriptnormsubscript𝑝𝑖subscript𝑝𝑗superscriptsubscript𝑝𝑖superscriptsubscript𝑝𝑗2subscriptnormsuperscript~𝑆1~italic-ϵ2superscriptsubscript𝜑𝑑1subscriptnorm~italic-ϵ2superscriptsubscript𝜑𝑑1𝑑𝜀\displaystyle\left\|(p_{i}-p_{j})-(p_{i}^{*}-p_{j}^{*})\right\|_{2}=\left\|% \widetilde{S}^{-1}\widetilde{\epsilon}\right\|_{2}\leq\varphi_{d}^{-1}\left\|% \widetilde{\epsilon}\right\|_{2}\leq\varphi_{d}^{-1}\sqrt{d}\varepsilon.∥ ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∥ over~ start_ARG italic_S end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over~ start_ARG italic_ϵ end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_φ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ over~ start_ARG italic_ϵ end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_φ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT square-root start_ARG italic_d end_ARG italic_ε .

Proposition 3 bridges the geometrical argument to the learning errors in terms of pipjsubscript𝑝𝑖subscript𝑝𝑗p_{i}-p_{j}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. To obtain such benefits, we need to characterize under what conditions and with what probability the event ijdsuperscriptsubscript𝑖𝑗𝑑\mathcal{E}_{ij}^{d}caligraphic_E start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT will occur. To start with, we study the volume of a surface that separates the probability simplex into two disjoint parts.

Proposition 4 (Surface Volume).

Suppose A𝒫d𝐴superscript𝒫𝑑A\subset\mathcal{P}^{d}italic_A ⊂ caligraphic_P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is a compact subset such that Vold(A)>0superscriptVol𝑑𝐴0\operatorname{Vol}^{d}(A)>0roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_A ) > 0 and Vold(Ac)>0superscriptVol𝑑superscript𝐴𝑐0\operatorname{Vol}^{d}(A^{c})>0roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) > 0. Let E=AAc𝐸𝐴superscript𝐴𝑐E=A\cap A^{c}italic_E = italic_A ∩ italic_A start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT be the surfaces that separates A𝐴Aitalic_A and Acsuperscript𝐴𝑐A^{c}italic_A start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and we assume that E𝐸Eitalic_E comprises J𝐽Jitalic_J hyperplanes. It then follows that

Vold1(E)Vold1(𝒫d1)J(d1)(Vold(Ac)Vold(A)(Vold(𝒫d))22d+13d2d)d.superscriptVol𝑑1𝐸superscriptVol𝑑1superscript𝒫𝑑1superscript𝐽𝑑1superscriptsuperscriptVol𝑑superscript𝐴𝑐superscriptVol𝑑𝐴superscriptsuperscriptVol𝑑superscript𝒫𝑑22𝑑13𝑑2𝑑𝑑\displaystyle\frac{\operatorname{Vol}^{d-1}(E)}{\operatorname{Vol}^{d-1}(% \mathcal{P}^{d-1})}\geq J^{-(d-1)}\cdot\left(\frac{\operatorname{Vol}^{d}(A^{c% })\operatorname{Vol}^{d}(A)}{\left(\operatorname{Vol}^{d}(\mathcal{P}^{d})% \right)^{2}}\cdot\frac{2\sqrt{d+1}}{3d\sqrt{2d}}\right)^{d}.divide start_ARG roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( italic_E ) end_ARG start_ARG roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ) end_ARG ≥ italic_J start_POSTSUPERSCRIPT - ( italic_d - 1 ) end_POSTSUPERSCRIPT ⋅ ( divide start_ARG roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_A ) end_ARG start_ARG ( roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ divide start_ARG 2 square-root start_ARG italic_d + 1 end_ARG end_ARG start_ARG 3 italic_d square-root start_ARG 2 italic_d end_ARG end_ARG ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT .
Proof.

For any x𝒫d𝑥superscript𝒫𝑑x\in\mathcal{P}^{d}italic_x ∈ caligraphic_P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, if xA𝑥𝐴x\in Aitalic_x ∈ italic_A, consider the following set

B(x)={y𝒫d|zAc s.t. x,y,z are on the same line }.𝐵𝑥conditional-set𝑦superscript𝒫𝑑𝑧superscript𝐴𝑐 s.t. x,y,z are on the same line \displaystyle B(x)=\left\{y\in\mathcal{P}^{d}{\,\big{|}\,}\exists z\in A^{c}% \text{ s.t. $x,y,z$ are on the same line $\ell$}\right\}.italic_B ( italic_x ) = { italic_y ∈ caligraphic_P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT | ∃ italic_z ∈ italic_A start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT s.t. italic_x , italic_y , italic_z are on the same line roman_ℓ } .

Apparently, AcB(x)superscript𝐴𝑐𝐵𝑥A^{c}\subseteq B(x)italic_A start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ⊆ italic_B ( italic_x ). Hence, we have Vold(B(x))Vold(Ac)superscriptVol𝑑𝐵𝑥superscriptVol𝑑superscript𝐴𝑐\operatorname{Vol}^{d}(B(x))\geq\operatorname{Vol}^{d}(A^{c})roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_B ( italic_x ) ) ≥ roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ). Similarly, for xAc𝑥superscript𝐴𝑐x\in A^{c}italic_x ∈ italic_A start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, we can define B(x)𝐵𝑥B(x)italic_B ( italic_x ) as

B(x)={y𝒫d|zA s.t. x,y,z are on the same line },𝐵𝑥conditional-set𝑦superscript𝒫𝑑𝑧𝐴 s.t. x,y,z are on the same line \displaystyle B(x)=\left\{y\in\mathcal{P}^{d}{\,\big{|}\,}\exists z\in A\text{% s.t. $x,y,z$ are on the same line $\ell$}\right\},italic_B ( italic_x ) = { italic_y ∈ caligraphic_P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT | ∃ italic_z ∈ italic_A s.t. italic_x , italic_y , italic_z are on the same line roman_ℓ } ,

and it follows that Vold(B(x))Vold(A)superscriptVol𝑑𝐵𝑥superscriptVol𝑑𝐴\operatorname{Vol}^{d}(B(x))\geq\operatorname{Vol}^{d}(A)roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_B ( italic_x ) ) ≥ roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_A ). Following these observations, we have

B:=x𝒫Vold(B(x))dx2Vold(Ac)Vold(A).subscript𝐵assignsubscript𝑥𝒫superscriptVol𝑑𝐵𝑥differential-d𝑥2superscriptVol𝑑superscript𝐴𝑐superscriptVol𝑑𝐴\displaystyle\mathcal{L}_{B}\operatorname{\vcentcolon=}\int_{x\in\mathcal{P}}% \operatorname{Vol}^{d}(B(x)){\mathrm{d}}x\geq 2\operatorname{Vol}^{d}(A^{c})% \operatorname{Vol}^{d}(A).caligraphic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT := ∫ start_POSTSUBSCRIPT italic_x ∈ caligraphic_P end_POSTSUBSCRIPT roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_B ( italic_x ) ) roman_d italic_x ≥ 2 roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_A ) . (D.4)

Another important fact is that \ellroman_ℓ that connects x,y,z𝑥𝑦𝑧x,y,zitalic_x , italic_y , italic_z must go through E𝐸Eitalic_E at least once since z𝑧zitalic_z and x𝑥xitalic_x belong to different areas. For any (d1)𝑑1(d-1)( italic_d - 1 )-dimensional surface S𝒫d𝑆superscript𝒫𝑑S\subset\mathcal{P}^{d}italic_S ⊂ caligraphic_P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and any point x𝒫d𝑥superscript𝒫𝑑x\in\mathcal{P}^{d}italic_x ∈ caligraphic_P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, We define

C(S,x)={y𝒫d|zS s.t. x,y,z are on the same line}.𝐶𝑆𝑥conditional-set𝑦superscript𝒫𝑑𝑧𝑆 s.t. x,y,z are on the same line\displaystyle C(S,x)=\left\{y\in\mathcal{P}^{d}{\,\big{|}\,}\exists z\in S% \text{ s.t. $x,y,z$ are on the same line}\right\}.italic_C ( italic_S , italic_x ) = { italic_y ∈ caligraphic_P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT | ∃ italic_z ∈ italic_S s.t. italic_x , italic_y , italic_z are on the same line } .

Apparently, we have B(x)C(E,x)𝐵𝑥𝐶𝐸𝑥B(x)\subseteq C(E,x)italic_B ( italic_x ) ⊆ italic_C ( italic_E , italic_x ) and thus BC(E,)subscript𝐵subscript𝐶𝐸\mathcal{L}_{B}\leq\mathcal{L}_{C(E,\cdot)}caligraphic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ≤ caligraphic_L start_POSTSUBSCRIPT italic_C ( italic_E , ⋅ ) end_POSTSUBSCRIPT. Consider S=s1s2𝑆subscript𝑠1subscript𝑠2S=s_{1}\cup s_{2}italic_S = italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∪ italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and it follows from the definition that C(S,x)=C(s1,x)C(s2,x)𝐶𝑆𝑥𝐶subscript𝑠1𝑥𝐶subscript𝑠2𝑥C(S,x)=C(s_{1},x)\cup C(s_{2},x)italic_C ( italic_S , italic_x ) = italic_C ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x ) ∪ italic_C ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ), which suggests that Vold(C(S,x))Vold(C(s1,x))+Vold(C(s2,x))superscriptVol𝑑𝐶𝑆𝑥superscriptVol𝑑𝐶subscript𝑠1𝑥superscriptVol𝑑𝐶subscript𝑠2𝑥\operatorname{Vol}^{d}(C(S,x))\leq\operatorname{Vol}^{d}(C(s_{1},x))+% \operatorname{Vol}^{d}(C(s_{2},x))roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_C ( italic_S , italic_x ) ) ≤ roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_C ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x ) ) + roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_C ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) ). Suppose that E=E1E2EJ𝐸subscript𝐸1subscript𝐸2subscript𝐸𝐽E=E_{1}\cup E_{2}\cup\dots\cup E_{J}italic_E = italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∪ italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∪ ⋯ ∪ italic_E start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT, it then holds that

C(E,)=x𝒫Vold(C(E,x))dxEix𝒫Vold(C(Ei,x))dx=:R(Ei).subscript𝐶𝐸subscript𝑥𝒫superscriptVol𝑑𝐶𝐸𝑥differential-d𝑥subscriptsubscript𝐸𝑖subscriptsubscript𝑥𝒫superscriptVol𝑑𝐶subscript𝐸𝑖𝑥differential-d𝑥absent:𝑅subscript𝐸𝑖\displaystyle\mathcal{L}_{C(E,\cdot)}=\int_{x\in\mathcal{P}}\operatorname{Vol}% ^{d}(C(E,x)){\mathrm{d}}x\leq\sum_{E_{i}}\underbrace{\int_{x\in\mathcal{P}}% \operatorname{Vol}^{d}(C(E_{i},x)){\mathrm{d}}x}_{\operatorname{=\vcentcolon}R% (E_{i})}.caligraphic_L start_POSTSUBSCRIPT italic_C ( italic_E , ⋅ ) end_POSTSUBSCRIPT = ∫ start_POSTSUBSCRIPT italic_x ∈ caligraphic_P end_POSTSUBSCRIPT roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_C ( italic_E , italic_x ) ) roman_d italic_x ≤ ∑ start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT under⏟ start_ARG ∫ start_POSTSUBSCRIPT italic_x ∈ caligraphic_P end_POSTSUBSCRIPT roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_C ( italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ) ) roman_d italic_x end_ARG start_POSTSUBSCRIPT start_OPFUNCTION = : end_OPFUNCTION italic_R ( italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT .

Consider Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is small enough such that Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT lies on a hyperplane 444Or approximating the surface E𝐸Eitalic_E with countably infinite hyperplanes.. We draw hyperplanes S1i//S2i//Eisuperscriptsubscript𝑆1𝑖superscriptsubscript𝑆2𝑖subscript𝐸𝑖S_{1}^{i}\mathbin{\!/\mkern-5.0mu/\!}S_{2}^{i}\mathbin{\!/\mkern-5.0mu/\!}E_{i}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_BINOP / / end_BINOP italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_BINOP / / end_BINOP italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT such that the farthest two vertices of the simplex 𝒫dsuperscript𝒫𝑑\mathcal{P}^{d}caligraphic_P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT lie on S1isuperscriptsubscript𝑆1𝑖S_{1}^{i}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and S2isuperscriptsubscript𝑆2𝑖S_{2}^{i}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. We denote the area between S1isuperscriptsubscript𝑆1𝑖S_{1}^{i}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and S2isuperscriptsubscript𝑆2𝑖S_{2}^{i}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT by 𝒬isuperscript𝒬𝑖\mathcal{Q}^{i}caligraphic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. We separate 𝒬isuperscript𝒬𝑖\mathcal{Q}^{i}caligraphic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT into 𝒬1isubscriptsuperscript𝒬𝑖1\mathcal{Q}^{i}_{1}caligraphic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒬2isubscriptsuperscript𝒬𝑖2\mathcal{Q}^{i}_{2}caligraphic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, where 𝒬1isubscriptsuperscript𝒬𝑖1\mathcal{Q}^{i}_{1}caligraphic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is defined as the points that have a distance larger than γi2/2subscript𝛾𝑖22\gamma_{i}\leq\sqrt{2}/2italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ square-root start_ARG 2 end_ARG / 2 to the hyperplane containing Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝒬2i=𝒬i\𝒬1isubscriptsuperscript𝒬𝑖2\superscript𝒬𝑖subscriptsuperscript𝒬𝑖1\mathcal{Q}^{i}_{2}=\mathcal{Q}^{i}\backslash\mathcal{Q}^{i}_{1}caligraphic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = caligraphic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT \ caligraphic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Consider the following definition for x𝒬1i𝑥subscriptsuperscript𝒬𝑖1x\in\mathcal{Q}^{i}_{1}italic_x ∈ caligraphic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT,

C~(Ei,x)={y𝒬i|zEi s.t. x,y,z are on the same line},~𝐶subscript𝐸𝑖𝑥conditional-set𝑦superscript𝒬𝑖𝑧subscript𝐸𝑖 s.t. x,y,z are on the same line\displaystyle\widetilde{C}(E_{i},x)=\left\{y\in\mathcal{Q}^{i}{\,\big{|}\,}% \exists z\in E_{i}\text{ s.t. $x,y,z$ are on the same line}\right\},over~ start_ARG italic_C end_ARG ( italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ) = { italic_y ∈ caligraphic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | ∃ italic_z ∈ italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT s.t. italic_x , italic_y , italic_z are on the same line } ,

and for x𝒬2i𝑥subscriptsuperscript𝒬𝑖2x\in\mathcal{Q}^{i}_{2}italic_x ∈ caligraphic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT,

C~(Ei,x)={y𝒫d|zEi s.t. x,y,z are on the same line}.~𝐶subscript𝐸𝑖𝑥conditional-set𝑦superscript𝒫𝑑𝑧subscript𝐸𝑖 s.t. x,y,z are on the same line\displaystyle\widetilde{C}(E_{i},x)=\left\{y\in\mathcal{P}^{d}{\,\big{|}\,}% \exists z\in E_{i}\text{ s.t. $x,y,z$ are on the same line}\right\}.over~ start_ARG italic_C end_ARG ( italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ) = { italic_y ∈ caligraphic_P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT | ∃ italic_z ∈ italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT s.t. italic_x , italic_y , italic_z are on the same line } .

Apparently, C(Ei,x)C~(Ei,x)𝐶subscript𝐸𝑖𝑥~𝐶subscript𝐸𝑖𝑥C(E_{i},x)\subseteq\widetilde{C}(E_{i},x)italic_C ( italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ) ⊆ over~ start_ARG italic_C end_ARG ( italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ) and we thus have

R(Ei)𝑅subscript𝐸𝑖\displaystyle R(E_{i})italic_R ( italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) x𝒬1iVold(C~(Ei,x))dx+x𝒬2iVold(C~(Ei,x))dxabsentsubscript𝑥subscriptsuperscript𝒬𝑖1superscriptVol𝑑~𝐶subscript𝐸𝑖𝑥differential-d𝑥subscript𝑥subscriptsuperscript𝒬𝑖2superscriptVol𝑑~𝐶subscript𝐸𝑖𝑥differential-d𝑥\displaystyle\leq\int_{x\in\mathcal{Q}^{i}_{1}}\operatorname{Vol}^{d}(% \widetilde{C}(E_{i},x)){\mathrm{d}}x+\int_{x\in\mathcal{Q}^{i}_{2}}% \operatorname{Vol}^{d}(\widetilde{C}(E_{i},x)){\mathrm{d}}x≤ ∫ start_POSTSUBSCRIPT italic_x ∈ caligraphic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( over~ start_ARG italic_C end_ARG ( italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ) ) roman_d italic_x + ∫ start_POSTSUBSCRIPT italic_x ∈ caligraphic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( over~ start_ARG italic_C end_ARG ( italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ) ) roman_d italic_x
Vold(𝒬1i)Vold1(Ei)(2γi)d12+Vold(𝒬2i)Vold(𝒫d)absentsuperscriptVol𝑑subscriptsuperscript𝒬𝑖1superscriptVol𝑑1subscript𝐸𝑖superscript2subscript𝛾𝑖𝑑12superscriptVol𝑑subscriptsuperscript𝒬𝑖2superscriptVol𝑑superscript𝒫𝑑\displaystyle\leq\operatorname{Vol}^{d}(\mathcal{Q}^{i}_{1})\cdot\operatorname% {Vol}^{d-1}(E_{i})\cdot\left(\frac{\sqrt{2}}{\gamma_{i}}\right)^{d-1}\cdot% \sqrt{2}+\operatorname{Vol}^{d}(\mathcal{Q}^{i}_{2})\cdot\operatorname{Vol}^{d% }(\mathcal{P}^{d})≤ roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( caligraphic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⋅ roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ ( divide start_ARG square-root start_ARG 2 end_ARG end_ARG start_ARG italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ⋅ square-root start_ARG 2 end_ARG + roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( caligraphic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⋅ roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT )
Vold(𝒫d)((2)dVold1(Ei)γi(d1)+2Vold1(𝒫d1)γi),absentsuperscriptVol𝑑superscript𝒫𝑑superscript2𝑑superscriptVol𝑑1subscript𝐸𝑖superscriptsubscript𝛾𝑖𝑑12superscriptVol𝑑1superscript𝒫𝑑1subscript𝛾𝑖\displaystyle\leq\operatorname{Vol}^{d}(\mathcal{P}^{d})\cdot\left((\sqrt{2})^% {d}\operatorname{Vol}^{d-1}(E_{i})\gamma_{i}^{-(d-1)}+2\operatorname{Vol}^{d-1% }(\mathcal{P}^{d-1})\cdot\gamma_{i}\right),≤ roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) ⋅ ( ( square-root start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - ( italic_d - 1 ) end_POSTSUPERSCRIPT + 2 roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ) ⋅ italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,

where the second inequality holds from the definition of 𝒬1isuperscriptsubscript𝒬1𝑖\mathcal{Q}_{1}^{i}caligraphic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and the fact that the largest hyperplane contained in C~(Ei,x)~𝐶subscript𝐸𝑖𝑥\widetilde{C}(E_{i},x)over~ start_ARG italic_C end_ARG ( italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ) is no more than (2/γi)d1Vold1(Ei)superscript2subscript𝛾𝑖𝑑1superscriptVol𝑑1subscript𝐸𝑖(\sqrt{2}/\gamma_{i})^{d-1}\operatorname{Vol}^{d-1}(E_{i})( square-root start_ARG 2 end_ARG / italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Here, since γisubscript𝛾𝑖\gamma_{i}italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is adjustable, we plug in

γi=2(Vold1(Ei)Vold1(𝒫d1))1/d,subscript𝛾𝑖2superscriptsuperscriptVol𝑑1subscript𝐸𝑖superscriptVol𝑑1superscript𝒫𝑑11𝑑\displaystyle\gamma_{i}=\sqrt{2}\cdot\left(\frac{\operatorname{Vol}^{d-1}(E_{i% })}{\operatorname{Vol}^{d-1}(\mathcal{P}^{d-1})}\right)^{1/d},italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = square-root start_ARG 2 end_ARG ⋅ ( divide start_ARG roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ) end_ARG ) start_POSTSUPERSCRIPT 1 / italic_d end_POSTSUPERSCRIPT ,

and obtain

R(Ei)32Vold(𝒫d)(Vold1(𝒫d1))(d1)/d(Vold1(Ei))1/d.𝑅subscript𝐸𝑖32superscriptVol𝑑superscript𝒫𝑑superscriptsuperscriptVol𝑑1superscript𝒫𝑑1𝑑1𝑑superscriptsuperscriptVol𝑑1subscript𝐸𝑖1𝑑\displaystyle R(E_{i})\leq 3\sqrt{2}\operatorname{Vol}^{d}(\mathcal{P}^{d})% \cdot\left(\operatorname{Vol}^{d-1}(\mathcal{P}^{d-1})\right)^{(d-1)/d}\cdot% \left(\operatorname{Vol}^{d-1}(E_{i})\right)^{1/d}.italic_R ( italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≤ 3 square-root start_ARG 2 end_ARG roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) ⋅ ( roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT ( italic_d - 1 ) / italic_d end_POSTSUPERSCRIPT ⋅ ( roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 1 / italic_d end_POSTSUPERSCRIPT .

By summing up i[J]𝑖delimited-[]𝐽i\in[J]italic_i ∈ [ italic_J ] and using the Jensen’s inequality, we obtain

C(E,)32Vold(𝒫d)(JVold1(𝒫d1))(d1)/d(Vold1(E))1/d.subscript𝐶𝐸32superscriptVol𝑑superscript𝒫𝑑superscript𝐽superscriptVol𝑑1superscript𝒫𝑑1𝑑1𝑑superscriptsuperscriptVol𝑑1𝐸1𝑑\displaystyle\mathcal{L}_{C(E,\cdot)}\leq 3\sqrt{2}\operatorname{Vol}^{d}(% \mathcal{P}^{d})\cdot\left(J\operatorname{Vol}^{d-1}(\mathcal{P}^{d-1})\right)% ^{(d-1)/d}\cdot\left(\operatorname{Vol}^{d-1}(E)\right)^{1/d}.caligraphic_L start_POSTSUBSCRIPT italic_C ( italic_E , ⋅ ) end_POSTSUBSCRIPT ≤ 3 square-root start_ARG 2 end_ARG roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) ⋅ ( italic_J roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT ( italic_d - 1 ) / italic_d end_POSTSUPERSCRIPT ⋅ ( roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( italic_E ) ) start_POSTSUPERSCRIPT 1 / italic_d end_POSTSUPERSCRIPT . (D.5)

Combining (D.5) with (D.4) and using the inequality C(E,)Bsubscript𝐶𝐸subscript𝐵\mathcal{L}_{C(E,\cdot)}\geq\mathcal{L}_{B}caligraphic_L start_POSTSUBSCRIPT italic_C ( italic_E , ⋅ ) end_POSTSUBSCRIPT ≥ caligraphic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT, we obtain

2Vold(Ac)Vold(A)32Vold(𝒫d)(JVold1(𝒫d1))(d1)/d(Vold1(E))1/d,2superscriptVol𝑑superscript𝐴𝑐superscriptVol𝑑𝐴32superscriptVol𝑑superscript𝒫𝑑superscript𝐽superscriptVol𝑑1superscript𝒫𝑑1𝑑1𝑑superscriptsuperscriptVol𝑑1𝐸1𝑑\displaystyle 2\operatorname{Vol}^{d}(A^{c})\operatorname{Vol}^{d}(A)\leq 3% \sqrt{2}\operatorname{Vol}^{d}(\mathcal{P}^{d})\cdot\left(J\operatorname{Vol}^% {d-1}(\mathcal{P}^{d-1})\right)^{(d-1)/d}\cdot\left(\operatorname{Vol}^{d-1}(E% )\right)^{1/d},2 roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_A ) ≤ 3 square-root start_ARG 2 end_ARG roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) ⋅ ( italic_J roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT ( italic_d - 1 ) / italic_d end_POSTSUPERSCRIPT ⋅ ( roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( italic_E ) ) start_POSTSUPERSCRIPT 1 / italic_d end_POSTSUPERSCRIPT ,

which further implies that

Vold1(E)superscriptVol𝑑1𝐸\displaystyle\operatorname{Vol}^{d-1}(E)roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( italic_E ) (2Vold(Ac)Vold(A)32Vold(𝒫d)(JVold1(𝒫d1))(d1)/d)dabsentsuperscript2superscriptVol𝑑superscript𝐴𝑐superscriptVol𝑑𝐴32superscriptVol𝑑superscript𝒫𝑑superscript𝐽superscriptVol𝑑1superscript𝒫𝑑1𝑑1𝑑𝑑\displaystyle\geq\left(\frac{2\operatorname{Vol}^{d}(A^{c})\operatorname{Vol}^% {d}(A)}{3\sqrt{2}\operatorname{Vol}^{d}(\mathcal{P}^{d})\cdot\left(J% \operatorname{Vol}^{d-1}(\mathcal{P}^{d-1})\right)^{(d-1)/d}}\right)^{d}≥ ( divide start_ARG 2 roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_A ) end_ARG start_ARG 3 square-root start_ARG 2 end_ARG roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) ⋅ ( italic_J roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT ( italic_d - 1 ) / italic_d end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT
=J(d1)Vold1(𝒫d1)(2Vold(Ac)Vold(A)32(Vold(𝒫d))2)d(d+1dd)d.absentsuperscript𝐽𝑑1superscriptVol𝑑1superscript𝒫𝑑1superscript2superscriptVol𝑑superscript𝐴𝑐superscriptVol𝑑𝐴32superscriptsuperscriptVol𝑑superscript𝒫𝑑2𝑑superscript𝑑1𝑑𝑑𝑑\displaystyle=J^{-(d-1)}\cdot\operatorname{Vol}^{d-1}(\mathcal{P}^{d-1})\cdot% \left(\frac{2\operatorname{Vol}^{d}(A^{c})\operatorname{Vol}^{d}(A)}{3\sqrt{2}% \left(\operatorname{Vol}^{d}(\mathcal{P}^{d})\right)^{2}}\right)^{d}\cdot\left% (\frac{\sqrt{d+1}}{d\sqrt{d}}\right)^{d}.= italic_J start_POSTSUPERSCRIPT - ( italic_d - 1 ) end_POSTSUPERSCRIPT ⋅ roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ) ⋅ ( divide start_ARG 2 roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_A ) end_ARG start_ARG 3 square-root start_ARG 2 end_ARG ( roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ⋅ ( divide start_ARG square-root start_ARG italic_d + 1 end_ARG end_ARG start_ARG italic_d square-root start_ARG italic_d end_ARG end_ARG ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT .

Following Proposition 4, a direct conclusion is that if A𝐴Aitalic_A contains N1subscript𝑁1N_{1}italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT action sections and Acsuperscript𝐴𝑐A^{c}italic_A start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT contains N2subscript𝑁2N_{2}italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT action sections (N=N1+N2𝑁subscript𝑁1subscript𝑁2N=N_{1}+N_{2}italic_N = italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT), there exists a surface Eksubscript𝐸𝑘E_{k}italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT such that

Vold1(Ek)Vold1(𝒫d1)Jd(Vold(Ac)Vold(A)(Vold(𝒫d))22d+13d2d)d(2d+1ς23d2d)d,superscriptVol𝑑1subscript𝐸𝑘superscriptVol𝑑1superscript𝒫𝑑1superscript𝐽𝑑superscriptsuperscriptVol𝑑superscript𝐴𝑐superscriptVol𝑑𝐴superscriptsuperscriptVol𝑑superscript𝒫𝑑22𝑑13𝑑2𝑑𝑑superscript2𝑑1superscript𝜍23𝑑2𝑑𝑑\displaystyle\frac{\operatorname{Vol}^{d-1}(E_{k})}{\operatorname{Vol}^{d-1}(% \mathcal{P}^{d-1})}\geq J^{-d}\cdot\left(\frac{\operatorname{Vol}^{d}(A^{c})% \operatorname{Vol}^{d}(A)}{\left(\operatorname{Vol}^{d}(\mathcal{P}^{d})\right% )^{2}}\cdot\frac{2\sqrt{d+1}}{3d\sqrt{2d}}\right)^{d}\geq\left(\frac{2\sqrt{d+% 1}\varsigma^{2}}{3d\sqrt{2d}}\right)^{d},divide start_ARG roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ) end_ARG ≥ italic_J start_POSTSUPERSCRIPT - italic_d end_POSTSUPERSCRIPT ⋅ ( divide start_ARG roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_A ) end_ARG start_ARG ( roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ divide start_ARG 2 square-root start_ARG italic_d + 1 end_ARG end_ARG start_ARG 3 italic_d square-root start_ARG 2 italic_d end_ARG end_ARG ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ≥ ( divide start_ARG 2 square-root start_ARG italic_d + 1 end_ARG italic_ς start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 3 italic_d square-root start_ARG 2 italic_d end_ARG end_ARG ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ,

where the last inequality holds from Assumption 3 and the fact that JN1N2𝐽subscript𝑁1subscript𝑁2J\leq N_{1}N_{2}italic_J ≤ italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. However, a sampled line \ellroman_ℓ passing through this eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT does not guarantee event ijdsuperscriptsubscript𝑖𝑗𝑑\mathcal{E}_{ij}^{d}caligraphic_E start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT even though Vold1(eij)superscriptVol𝑑1subscript𝑒𝑖𝑗\operatorname{Vol}^{d-1}(e_{ij})roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) is bounded below. The following proposition states a sufficient condition for ijdsuperscriptsubscript𝑖𝑗𝑑\mathcal{E}_{ij}^{d}caligraphic_E start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT to hold.

Proposition 5 (Effective Surface).

Define the effective surface e~ijsubscript~𝑒𝑖𝑗\widetilde{e}_{ij}over~ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT as

e~ij={xeij|xy2ι+h,yeij},subscript~𝑒𝑖𝑗conditional-set𝑥subscript𝑒𝑖𝑗formulae-sequencesubscriptnorm𝑥𝑦2𝜄for-all𝑦subscript𝑒𝑖𝑗\displaystyle\widetilde{e}_{ij}=\left\{x\in e_{ij}{\,|\,}\left\|x-y\right\|_{2% }\geq\iota+h,\quad\forall y\in\partial e_{ij}\right\},over~ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = { italic_x ∈ italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | ∥ italic_x - italic_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ italic_ι + italic_h , ∀ italic_y ∈ ∂ italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT } ,

where ι=22(d+1)h/(ddς)𝜄22𝑑1𝑑𝑑𝜍\iota=2{\sqrt{2(d+1)}h}/{(d\sqrt{d}\varsigma)}italic_ι = 2 square-root start_ARG 2 ( italic_d + 1 ) end_ARG italic_h / ( italic_d square-root start_ARG italic_d end_ARG italic_ς ), h=cd+ε>cd1(d+1)1+εsubscript𝑐𝑑𝜀subscript𝑐𝑑1superscript𝑑11𝜀h=c_{d}+\varepsilon>c_{d}\sqrt{1-(d+1)^{-1}}+\varepsilonitalic_h = italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + italic_ε > italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT square-root start_ARG 1 - ( italic_d + 1 ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG + italic_ε. Event ijdsuperscriptsubscript𝑖𝑗𝑑\mathcal{E}_{ij}^{d}caligraphic_E start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT holds for any searching line \ellroman_ℓ crossing e~ijsubscript~𝑒𝑖𝑗\widetilde{e}_{ij}over~ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT under Assumption 3.

Proof.

Under condition ι>0𝜄0\iota>0italic_ι > 0, we guarantee that the simplex does not intersect with eijsubscript𝑒𝑖𝑗\partial e_{ij}∂ italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. However, such a condition is not sufficient if we want to guarantee that the simplex 𝕊dsuperscript𝕊𝑑\mathbb{S}^{d}blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT only contains actions i,j𝑖𝑗i,jitalic_i , italic_j since a third action may occur outside eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. In the sequel, we will show that it is impossible for a third action to appear if \ellroman_ℓ goes through e~ijsubscript~𝑒𝑖𝑗\widetilde{e}_{ij}over~ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. Define eijsuperscriptsubscript𝑒𝑖𝑗e_{ij}^{\prime}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as,

eij={xeij|ye~ij, s.t. yx2h}.superscriptsubscript𝑒𝑖𝑗conditional-set𝑥subscript𝑒𝑖𝑗formulae-sequence𝑦subscript~𝑒𝑖𝑗 s.t. subscriptnorm𝑦𝑥2\displaystyle e_{ij}^{\prime}=\left\{x\in e_{ij}{\,|\,}\exists y\in\widetilde{% e}_{ij},\text{ s.t. }\left\|y-x\right\|_{2}\leq h\right\}.italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { italic_x ∈ italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | ∃ italic_y ∈ over~ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , s.t. ∥ italic_y - italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_h } .

Apparently, for any xeij𝑥subscriptsuperscript𝑒𝑖𝑗x\in e^{\prime}_{ij}italic_x ∈ italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and yeij𝑦subscript𝑒𝑖𝑗y\in\partial e_{ij}italic_y ∈ ∂ italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, we have xy2ιsubscriptnorm𝑥𝑦2𝜄\left\|x-y\right\|_{2}\geq\iota∥ italic_x - italic_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ italic_ι. Now, we aim to prove that the prism A𝐴Aitalic_A with base eijsuperscriptsubscript𝑒𝑖𝑗e_{ij}^{\prime}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and height hhitalic_h belongs to either 𝒱isubscript𝒱𝑖\mathcal{V}_{i}caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT or 𝒱jsubscript𝒱𝑗\mathcal{V}_{j}caligraphic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Suppose that A𝐴Aitalic_A lies on action i𝑖iitalic_i’s side and there exists a point in A𝐴Aitalic_A whose best response is not i𝑖iitalic_i. Then, there must be a hyperplane eiksubscript𝑒𝑖𝑘e_{ik}italic_e start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT passing through A𝐴Aitalic_A for some k{i,j}𝑘𝑖𝑗k\notin\{i,j\}italic_k ∉ { italic_i , italic_j }. Suppose zeikA𝑧subscript𝑒𝑖𝑘𝐴z\in e_{ik}\cap Aitalic_z ∈ italic_e start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ∩ italic_A. Consider the set EikEijsubscript𝐸𝑖𝑘subscript𝐸𝑖𝑗E_{ik}\cap E_{ij}italic_E start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ∩ italic_E start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, where Eijsubscript𝐸𝑖𝑗E_{ij}italic_E start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the whole hyperplane containing eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. Obviously, for any xeij𝑥superscriptsubscript𝑒𝑖𝑗x\in e_{ij}^{\prime}italic_x ∈ italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and yEikEij𝑦subscript𝐸𝑖𝑘subscript𝐸𝑖𝑗y\in E_{ik}\cap E_{ij}italic_y ∈ italic_E start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ∩ italic_E start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, xy2ιsubscriptnorm𝑥𝑦2𝜄\left\|x-y\right\|_{2}\geq\iota∥ italic_x - italic_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ italic_ι since EikEijsubscript𝐸𝑖𝑘subscript𝐸𝑖𝑗E_{ik}\cap E_{ij}italic_E start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ∩ italic_E start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT should lie outside eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. In addition, since 𝒱isubscript𝒱𝑖\mathcal{V}_{i}caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a convex set, we note that 𝒱isubscript𝒱𝑖\mathcal{V}_{i}caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT must lie between Eiksubscript𝐸𝑖𝑘E_{ik}italic_E start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT and Eijsubscript𝐸𝑖𝑗E_{ij}italic_E start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. Therefore, the volume of 𝒱ksubscript𝒱𝑘\mathcal{V}_{k}caligraphic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT should be no more than the volume of the area that lies between Eiksubscript𝐸𝑖𝑘E_{ik}italic_E start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT and Eijsubscript𝐸𝑖𝑗E_{ij}italic_E start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT,

Vold(𝒱i)Vold1(𝒫d1)2hι=Vold(𝒫d)2(d+1)hddι.superscriptVol𝑑subscript𝒱𝑖superscriptVol𝑑1superscript𝒫𝑑12𝜄superscriptVol𝑑superscript𝒫𝑑2𝑑1𝑑𝑑𝜄\displaystyle\operatorname{Vol}^{d}(\mathcal{V}_{i})\leq\operatorname{Vol}^{d-% 1}(\mathcal{P}^{d-1})\cdot\frac{\sqrt{2}h}{\iota}=\operatorname{Vol}^{d}(% \mathcal{P}^{d})\cdot\frac{\sqrt{2(d+1)}h}{d\sqrt{d}\iota}.roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≤ roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ) ⋅ divide start_ARG square-root start_ARG 2 end_ARG italic_h end_ARG start_ARG italic_ι end_ARG = roman_Vol start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) ⋅ divide start_ARG square-root start_ARG 2 ( italic_d + 1 ) end_ARG italic_h end_ARG start_ARG italic_d square-root start_ARG italic_d end_ARG italic_ι end_ARG .

By Assumption 3, we conclude that

ι2(d+1)hddς,𝜄2𝑑1𝑑𝑑𝜍\displaystyle\iota\leq\frac{\sqrt{2(d+1)}h}{d\sqrt{d}\varsigma},italic_ι ≤ divide start_ARG square-root start_ARG 2 ( italic_d + 1 ) end_ARG italic_h end_ARG start_ARG italic_d square-root start_ARG italic_d end_ARG italic_ς end_ARG ,

which conflicts with the condition ι>2(d+1)h/(ddς)𝜄2𝑑1𝑑𝑑𝜍\iota>{\sqrt{2(d+1)}h}/{(d\sqrt{d}\varsigma)}italic_ι > square-root start_ARG 2 ( italic_d + 1 ) end_ARG italic_h / ( italic_d square-root start_ARG italic_d end_ARG italic_ς ). Thus, we conclude that A𝒱i𝐴subscript𝒱𝑖A\subseteq\mathcal{V}_{i}italic_A ⊆ caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The same argument also applies to the conclusion B𝒱j𝐵subscript𝒱𝑗B\subseteq\mathcal{V}_{j}italic_B ⊆ caligraphic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT if B𝐵Bitalic_B lies on action j𝑗jitalic_j’s side. For any line \ellroman_ℓ passing through e~ijsubscript~𝑒𝑖𝑗\widetilde{e}_{ij}over~ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, let x=e~ij𝑥subscript~𝑒𝑖𝑗x=\ell\cap\widetilde{e}_{ij}italic_x = roman_ℓ ∩ over~ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. Then we have xAB𝑥𝐴𝐵x\in A\cup Bitalic_x ∈ italic_A ∪ italic_B and the minimal distance from x𝑥xitalic_x to (AB)𝐴𝐵\partial(A\cup B)∂ ( italic_A ∪ italic_B ) is at least h>ε𝜀h>\varepsilonitalic_h > italic_ε. Thus, x𝑥xitalic_x is guaranteed to be detected when doing the binary search on \ellroman_ℓ. On the other hand, hε>cd1(d+1)1𝜀subscript𝑐𝑑1superscript𝑑11h-\varepsilon>c_{d}\sqrt{1-(d+1)^{-1}}italic_h - italic_ε > italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT square-root start_ARG 1 - ( italic_d + 1 ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG, meaning that the simplex 𝕊dsuperscript𝕊𝑑\mathbb{S}^{d}blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT placed at (xk+yk)/2subscript𝑥𝑘subscript𝑦𝑘2(x_{k}+y_{k})/2( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) / 2 also lies within AB𝐴𝐵A\cup Bitalic_A ∪ italic_B. Thus, the simplex only contains two actions and event ijdsuperscriptsubscript𝑖𝑗𝑑\mathcal{E}_{ij}^{d}caligraphic_E start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT follows. ∎

Proposition 5 characterizes a sufficient condition for ijdsuperscriptsubscript𝑖𝑗𝑑\mathcal{E}_{ij}^{d}caligraphic_E start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT to hold, i.e., the searching line \ellroman_ℓ passing through the effective surface e~ijsubscript~𝑒𝑖𝑗\widetilde{e}_{ij}over~ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. The next question is whether we can find enough effective surfaces via random sampling. To answer this question, we need to argue that there are sufficiently many effective surfaces with surface volume bounded below. It is also of the same importance to bound below the probability of \ellroman_ℓ passing through an effective surface with positive surface volume. Recall the definition of σdsubscript𝜎𝑑\sigma_{d}italic_σ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT.

(e)σd(Vold1(e)Vold1(𝒫d1)),eConvd1.formulae-sequence𝑒subscript𝜎𝑑superscriptVol𝑑1𝑒superscriptVol𝑑1superscript𝒫𝑑1for-all𝑒superscriptConv𝑑1\displaystyle\mathbb{P}(\ell\cap e\neq\emptyset)\geq\sigma_{d}\left(\frac{% \operatorname{Vol}^{d-1}(e)}{\operatorname{Vol}^{d-1}(\mathcal{P}^{d-1})}% \right),\quad\forall e\in{\mathrm{Conv}}^{d-1}.blackboard_P ( roman_ℓ ∩ italic_e ≠ ∅ ) ≥ italic_σ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( divide start_ARG roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( italic_e ) end_ARG start_ARG roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ) end_ARG ) , ∀ italic_e ∈ roman_Conv start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT .

Now, we study the problem that how many effective surfaces needs to be detected. From Proposition 3, we see that if event ijdsuperscriptsubscript𝑖𝑗𝑑\mathcal{E}_{ij}^{d}caligraphic_E start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT holds, we can estimate pipjsubscript𝑝𝑖subscript𝑝𝑗p_{i}-p_{j}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT up to small errors. To learn all the pairwise outcome distribution difference, we just need to construct a tree in graph 𝒢𝒢\mathcal{G}caligraphic_G where an edge eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is selected if ijdsuperscriptsubscript𝑖𝑗𝑑\mathcal{E}_{ij}^{d}caligraphic_E start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT happens. To see this point, we invoke the following definition.

Definition 5 (Connected Components).

We say that action i𝑖iitalic_i and j𝑗jitalic_j belongs to the same connected component if there exists a path i,k1,,kn,j𝑖subscript𝑘1subscript𝑘𝑛𝑗i,k_{1},\dots,k_{n},jitalic_i , italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_j such that events ik1d,k1k2d,,knjdsuperscriptsubscript𝑖subscript𝑘1𝑑superscriptsubscriptsubscript𝑘1subscript𝑘2𝑑superscriptsubscriptsubscript𝑘𝑛𝑗𝑑\mathcal{E}_{ik_{1}}^{d},\mathcal{E}_{k_{1}k_{2}}^{d},\dots,\mathcal{E}_{k_{n}% j}^{d}caligraphic_E start_POSTSUBSCRIPT italic_i italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , caligraphic_E start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , … , caligraphic_E start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT happen.

In the sequel, we use C𝐶Citalic_C to denote a connected component. With a little abuse of notation, we also denote by C𝐶Citalic_C the union of sections 𝒱ksubscript𝒱𝑘\mathcal{V}_{k}caligraphic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT such that kC𝑘𝐶k\in Citalic_k ∈ italic_C. Using Proposition 4 and the following discussion, we take C𝐶Citalic_C as A𝐴Aitalic_A and Ccsuperscript𝐶𝑐C^{c}italic_C start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT as Acsuperscript𝐴𝑐A^{c}italic_A start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, and it follows that there exists a surface eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT on the boundary of C𝐶Citalic_C and Ccsuperscript𝐶𝑐C^{c}italic_C start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT such that

Vold1(eij)Vold1(𝒫d1)(2d+1ς23d2d)d.superscriptVol𝑑1subscript𝑒𝑖𝑗superscriptVol𝑑1superscript𝒫𝑑1superscript2𝑑1superscript𝜍23𝑑2𝑑𝑑\displaystyle\frac{\operatorname{Vol}^{d-1}(e_{ij})}{\operatorname{Vol}^{d-1}(% \mathcal{P}^{d-1})}\geq\left(\frac{2\sqrt{d+1}\varsigma^{2}}{3d\sqrt{2d}}% \right)^{d}.divide start_ARG roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ) end_ARG ≥ ( divide start_ARG 2 square-root start_ARG italic_d + 1 end_ARG italic_ς start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 3 italic_d square-root start_ARG 2 italic_d end_ARG end_ARG ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT .

Here, we assume that iC𝑖𝐶i\in Citalic_i ∈ italic_C and jCc𝑗superscript𝐶𝑐j\in C^{c}italic_j ∈ italic_C start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT. Note that we have Vold2(e)Vold2(𝒫d1)superscriptVol𝑑2𝑒superscriptVol𝑑2superscript𝒫𝑑1\operatorname{Vol}^{d-2}(\partial e)\leq\operatorname{Vol}^{d-2}(\partial% \mathcal{P}^{d-1})roman_Vol start_POSTSUPERSCRIPT italic_d - 2 end_POSTSUPERSCRIPT ( ∂ italic_e ) ≤ roman_Vol start_POSTSUPERSCRIPT italic_d - 2 end_POSTSUPERSCRIPT ( ∂ caligraphic_P start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ) for the sake that e𝑒eitalic_e is convex. Recall the definition of the effective surface e~ijsubscript~𝑒𝑖𝑗\widetilde{e}_{ij}over~ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, which corresponds to shrinking eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT up to distance ι+h𝜄\iota+hitalic_ι + italic_h. Therefore, the volume of e~ijsubscript~𝑒𝑖𝑗\widetilde{e}_{ij}over~ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is at least

Vold1(e~ij)Vold1(𝒫d1)superscriptVol𝑑1subscript~𝑒𝑖𝑗superscriptVol𝑑1superscript𝒫𝑑1\displaystyle\frac{\operatorname{Vol}^{d-1}(\widetilde{e}_{ij})}{\operatorname% {Vol}^{d-1}(\mathcal{P}^{d-1})}divide start_ARG roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( over~ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ) end_ARG Vold1(eij)Vold2(e)(h+ι)Vold1(𝒫d1)absentsuperscriptVol𝑑1subscript𝑒𝑖𝑗superscriptVol𝑑2𝑒𝜄superscriptVol𝑑1superscript𝒫𝑑1\displaystyle\geq\frac{\operatorname{Vol}^{d-1}(e_{ij})-\operatorname{Vol}^{d-% 2}(\partial e)(h+\iota)}{\operatorname{Vol}^{d-1}(\mathcal{P}^{d-1})}≥ divide start_ARG roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) - roman_Vol start_POSTSUPERSCRIPT italic_d - 2 end_POSTSUPERSCRIPT ( ∂ italic_e ) ( italic_h + italic_ι ) end_ARG start_ARG roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ) end_ARG
(2d+1ς23d2d)dVold1(𝒫d1)dVold2(𝒫d2)(h+ι)Vold1(𝒫d1)absentsuperscript2𝑑1superscript𝜍23𝑑2𝑑𝑑superscriptVol𝑑1superscript𝒫𝑑1𝑑superscriptVol𝑑2superscript𝒫𝑑2𝜄superscriptVol𝑑1superscript𝒫𝑑1\displaystyle\geq\frac{\left(\frac{2\sqrt{d+1}\varsigma^{2}}{3d\sqrt{2d}}% \right)^{d}\operatorname{Vol}^{d-1}(\mathcal{P}^{d-1})-d\operatorname{Vol}^{d-% 2}(\mathcal{P}^{d-2})(h+\iota)}{\operatorname{Vol}^{d-1}(\mathcal{P}^{d-1})}≥ divide start_ARG ( divide start_ARG 2 square-root start_ARG italic_d + 1 end_ARG italic_ς start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 3 italic_d square-root start_ARG 2 italic_d end_ARG end_ARG ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ) - italic_d roman_Vol start_POSTSUPERSCRIPT italic_d - 2 end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d - 2 end_POSTSUPERSCRIPT ) ( italic_h + italic_ι ) end_ARG start_ARG roman_Vol start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ) end_ARG
(2d+1ς23d2d)d(d1)(d1)d(1+22(d+1)(ddς))habsentsuperscript2𝑑1superscript𝜍23𝑑2𝑑𝑑𝑑1𝑑1𝑑122𝑑1𝑑𝑑𝜍\displaystyle\geq\left(\frac{2\sqrt{d+1}\varsigma^{2}}{3d\sqrt{2d}}\right)^{d}% -{(d-1)\sqrt{(d-1)d}}\left(1+2\frac{\sqrt{2(d+1)}}{(d\sqrt{d}\varsigma)}\right% )\cdot h≥ ( divide start_ARG 2 square-root start_ARG italic_d + 1 end_ARG italic_ς start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 3 italic_d square-root start_ARG 2 italic_d end_ARG end_ARG ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT - ( italic_d - 1 ) square-root start_ARG ( italic_d - 1 ) italic_d end_ARG ( 1 + 2 divide start_ARG square-root start_ARG 2 ( italic_d + 1 ) end_ARG end_ARG start_ARG ( italic_d square-root start_ARG italic_d end_ARG italic_ς ) end_ARG ) ⋅ italic_h
=:τd.absent:subscript𝜏𝑑\displaystyle\operatorname{=\vcentcolon}\tau_{d}.start_OPFUNCTION = : end_OPFUNCTION italic_τ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT .

Therefore, with probability σd(τd)subscript𝜎𝑑subscript𝜏𝑑\sigma_{d}(\tau_{d})italic_σ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) event ijdsuperscriptsubscript𝑖𝑗𝑑\mathcal{E}_{ij}^{d}caligraphic_E start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT will happen in the next sample following Proposition 5, which also means that C𝐶Citalic_C will expand. Since C𝐶Citalic_C will expand for at most N1𝑁1N-1italic_N - 1 times, we have after T=blog(N1)/σd(τd)𝑇𝑏𝑁1subscript𝜎𝑑subscript𝜏𝑑T=b\log(N-1)/\sigma_{d}(\tau_{d})italic_T = italic_b roman_log ( italic_N - 1 ) / italic_σ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) samples that C=𝒜𝐶𝒜C=\mathcal{A}italic_C = caligraphic_A with probability at least 1(1/N)b11superscript1𝑁𝑏11-(1/N)^{b-1}1 - ( 1 / italic_N ) start_POSTSUPERSCRIPT italic_b - 1 end_POSTSUPERSCRIPT. Moreover, when C=𝒜𝐶𝒜C=\mathcal{A}italic_C = caligraphic_A, the error for estimating pipjsubscript𝑝𝑖subscript𝑝𝑗p_{i}-p_{j}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is bounded by

(pipj)(pipj)2(N2)φd1dε.subscriptnormsubscript𝑝𝑖subscript𝑝𝑗superscriptsubscript𝑝𝑖superscriptsubscript𝑝𝑗2𝑁2superscriptsubscript𝜑𝑑1𝑑𝜀\displaystyle\left\|(p_{i}-p_{j})-(p_{i}^{*}-p_{j}^{*})\right\|_{2}\leq(N-2)% \varphi_{d}^{-1}\sqrt{d}\varepsilon.∥ ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ ( italic_N - 2 ) italic_φ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT square-root start_ARG italic_d end_ARG italic_ε .

Appendix E Proofs in Section 4

E.1 Preliminaries for Regret Analysis

Visitation Measure.

We define the state visitation measure ρh𝝅Δ(𝒮)subscriptsuperscript𝜌𝝅Δ𝒮\rho^{\bm{\pi}}_{h}\in\Delta({\mathcal{S}})italic_ρ start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ roman_Δ ( caligraphic_S ) at step hhitalic_h induced by action policy π𝜋\piitalic_π as

ρh𝝅(s):=𝐄𝝅,{Ph}h=0H[𝟙(sh=s)]assignsubscriptsuperscript𝜌𝝅𝑠subscript𝐄𝝅superscriptsubscriptsubscript𝑃0𝐻delimited-[]1subscript𝑠𝑠\rho^{\bm{\pi}}_{h}(s):=\mathop{\mathbf{E}}_{\bm{\pi},\{P_{h}\}_{h=0}^{H}}% \left[\operatorname{\mathds{1}}(s_{h}=s)\right]italic_ρ start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) := bold_E start_POSTSUBSCRIPT bold_italic_π , { italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ blackboard_1 ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_s ) ]

Given that ρ1π=P0superscriptsubscript𝜌1𝜋subscript𝑃0\rho_{1}^{\pi}=P_{0}italic_ρ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT = italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the visitation measure can be computed iteratively from h=11h=1italic_h = 1 to H+1𝐻1H+1italic_H + 1 as,

ρh+1𝝅(s)=ρh𝝅πh,Ph𝒮×𝒜=s𝒮a𝒜ρh𝝅(s)πh(a|s)Ph(s|s,a).subscriptsuperscript𝜌𝝅1𝑠subscripttensor-productsuperscriptsubscript𝜌𝝅subscript𝜋subscript𝑃𝒮𝒜subscriptsuperscript𝑠𝒮subscript𝑎𝒜superscriptsubscript𝜌𝝅superscript𝑠subscript𝜋conditional𝑎superscript𝑠subscript𝑃conditional𝑠superscript𝑠𝑎\rho^{\bm{\pi}}_{h+1}(s)=\langle\rho_{h}^{\bm{\pi}}\otimes\pi_{h},P_{h}\rangle% _{{\mathcal{S}}\times\mathcal{A}}=\sum_{s^{\prime}\in{\mathcal{S}}}\sum_{a\in% \mathcal{A}}\rho_{h}^{\bm{\pi}}(s^{\prime})\pi_{h}(a|s^{\prime})P_{h}(s|s^{% \prime},a).italic_ρ start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s ) = ⟨ italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT caligraphic_S × caligraphic_A end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_a | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a ) .

With this definition, we have for any 𝒙𝒳,𝝅Πformulae-sequence𝒙𝒳𝝅Π\bm{x}\in\mathcal{X},\bm{\pi}\in\Pibold_italic_x ∈ caligraphic_X , bold_italic_π ∈ roman_Π,

U𝒙,𝝅=h=1Hρh𝝅πh,Phxhch𝒮×𝒜=h=1Hs𝒮a𝒜ρh𝝅(s)πh(a|s)[Ph(s,a)xh(s)ch(s,a)],superscript𝑈𝒙𝝅superscriptsubscript1𝐻subscripttensor-productsuperscriptsubscript𝜌𝝅subscript𝜋subscript𝑃subscript𝑥subscript𝑐𝒮𝒜superscriptsubscript1𝐻subscript𝑠𝒮subscript𝑎𝒜superscriptsubscript𝜌𝝅𝑠subscript𝜋conditional𝑎𝑠delimited-[]subscript𝑃𝑠𝑎subscript𝑥𝑠subscript𝑐𝑠𝑎U^{\bm{x},\bm{\pi}}=\sum_{h=1}^{H}\langle\rho_{h}^{\bm{\pi}}\otimes\pi_{h},P_{% h}\cdot x_{h}-c_{h}\rangle_{{\mathcal{S}}\times\mathcal{A}}=\sum_{h=1}^{H}\sum% _{s\in{\mathcal{S}}}\sum_{a\in\mathcal{A}}\rho_{h}^{\bm{\pi}}(s)\pi_{h}(a|s)[P% _{h}(s,a)\cdot x_{h}(s)-c_{h}(s,a)],italic_U start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ⟨ italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ⋅ italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT caligraphic_S × caligraphic_A end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_a | italic_s ) [ italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ⋅ italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) - italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ] ,
V𝒙,𝝅=h=1Hρh𝝅πh,rhPhxh𝒮×𝒜=h=1Hs𝒮a𝒜ρh𝝅(s)πh(a|s)[rh(s,a)Ph(s,a)xh(s)].superscript𝑉𝒙𝝅superscriptsubscript1𝐻subscripttensor-productsuperscriptsubscript𝜌𝝅subscript𝜋subscript𝑟subscript𝑃subscript𝑥𝒮𝒜superscriptsubscript1𝐻subscript𝑠𝒮subscript𝑎𝒜superscriptsubscript𝜌𝝅𝑠subscript𝜋conditional𝑎𝑠delimited-[]subscript𝑟𝑠𝑎subscript𝑃𝑠𝑎subscript𝑥𝑠V^{\bm{x},\bm{\pi}}=\sum_{h=1}^{H}\langle\rho_{h}^{\bm{\pi}}\otimes\pi_{h},r_{% h}-P_{h}\cdot x_{h}\rangle_{{\mathcal{S}}\times\mathcal{A}}=\sum_{h=1}^{H}\sum% _{s\in{\mathcal{S}}}\sum_{a\in\mathcal{A}}\rho_{h}^{\bm{\pi}}(s)\pi_{h}(a|s)[r% _{h}(s,a)-P_{h}(s,a)\cdot x_{h}(s)].italic_V start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ⟨ italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ⋅ italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT caligraphic_S × caligraphic_A end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_a | italic_s ) [ italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ⋅ italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ] .

Parameter Estimation.

Algorithm 1 omits the details of how it estimates the parameters based on the observed trajectory {(sht,aht,rht)}h[H]subscriptsuperscriptsubscript𝑠𝑡superscriptsubscript𝑎𝑡superscriptsubscript𝑟𝑡delimited-[]𝐻\{(s_{h}^{t},a_{h}^{t},r_{h}^{t})\}_{h\in[H]}{ ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT in each episode. Specifically, at the end of t𝑡titalic_t-th episode, given the trajectory {(sht,aht,rht)}h[H]subscriptsuperscriptsubscript𝑠𝑡superscriptsubscript𝑎𝑡superscriptsubscript𝑟𝑡delimited-[]𝐻\{(s_{h}^{t},a_{h}^{t},r_{h}^{t})\}_{h\in[H]}{ ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT, it updates the counting variables for all h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ],

Nht+1(s,a)superscriptsubscript𝑁𝑡1𝑠𝑎\displaystyle N_{h}^{t+1}(s,a)italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ( italic_s , italic_a ) Nht(s,a)+𝟏[sht=s,aht=a],absentsuperscriptsubscript𝑁𝑡𝑠𝑎1delimited-[]formulae-sequencesubscriptsuperscript𝑠𝑡𝑠subscriptsuperscript𝑎𝑡𝑎\displaystyle\leftarrow N_{h}^{t}(s,a)+{\bf 1}[s^{t}_{h}=s,a^{t}_{h}=a],← italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ) + bold_1 [ italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_s , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_a ] ,
Nht+1(s,a;s)superscriptsubscript𝑁𝑡1𝑠𝑎superscript𝑠\displaystyle N_{h}^{t+1}(s,a;s^{\prime})italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ( italic_s , italic_a ; italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) Nht(s,a;s)+𝟏[sht=s,aht=a,sh+1t=s],absentsuperscriptsubscript𝑁𝑡𝑠𝑎superscript𝑠1delimited-[]formulae-sequencesubscriptsuperscript𝑠𝑡𝑠formulae-sequencesubscriptsuperscript𝑎𝑡𝑎subscriptsuperscript𝑠𝑡1superscript𝑠\displaystyle\leftarrow N_{h}^{t}(s,a;s^{\prime})+{\bf 1}[s^{t}_{h}=s,a^{t}_{h% }=a,s^{t}_{h+1}=s^{\prime}],← italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ; italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + bold_1 [ italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_s , italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_a , italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] ,
Nht(s)superscriptsubscript𝑁𝑡𝑠\displaystyle N_{h}^{t}(s)italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s ) =a𝒜Nht(s,a),absentsubscript𝑎𝒜superscriptsubscript𝑁𝑡𝑠𝑎\displaystyle=\sum_{a\in\mathcal{A}}N_{h}^{t}(s,a),= ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ) ,

and the empirical mean reward and bonus for all h[H],s,s𝒮,a𝒜formulae-sequencedelimited-[]𝐻𝑠formulae-sequencesuperscript𝑠𝒮𝑎𝒜h\in[H],s,s^{\prime}\in{\mathcal{S}},a\in\mathcal{A}italic_h ∈ [ italic_H ] , italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S , italic_a ∈ caligraphic_A,

r^ht(s,a)superscriptsubscript^𝑟𝑡𝑠𝑎\displaystyle\widehat{r}_{h}^{t}(s,a)over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ) τ[t1]𝟏[shτ=s,ahτ=a]rhτ(s,a)Nht(s,a),absentsubscript𝜏delimited-[]𝑡11delimited-[]formulae-sequencesubscriptsuperscript𝑠𝜏𝑠subscriptsuperscript𝑎𝜏𝑎subscriptsuperscript𝑟𝜏𝑠𝑎subscriptsuperscript𝑁𝑡𝑠𝑎\displaystyle\leftarrow\frac{\sum_{\tau\in[t-1]}{\bf 1}[s^{\tau}_{h}=s,a^{\tau% }_{h}=a]r^{\tau}_{h}(s,a)}{N^{t}_{h}(s,a)},← divide start_ARG ∑ start_POSTSUBSCRIPT italic_τ ∈ [ italic_t - 1 ] end_POSTSUBSCRIPT bold_1 [ italic_s start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_s , italic_a start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_a ] italic_r start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG ,
P^ht(s|s,a)subscriptsuperscript^𝑃𝑡conditionalsuperscript𝑠𝑠𝑎\displaystyle\widehat{P}^{t}_{h}(s^{\prime}|s,a)over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) Nht(s,a;s)Nht(s,a),absentsubscriptsuperscript𝑁𝑡𝑠𝑎superscript𝑠subscriptsuperscript𝑁𝑡𝑠𝑎\displaystyle\leftarrow\frac{N^{t}_{h}(s,a;s^{\prime})}{N^{t}_{h}(s,a)},← divide start_ARG italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ; italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG ,
bht(s,a)superscriptsubscript𝑏𝑡𝑠𝑎\displaystyle b_{h}^{t}(s,a)italic_b start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ) (2H+2)ln(SAHT/δ)Nht(s,a).absent2𝐻2𝑆𝐴𝐻𝑇𝛿superscriptsubscript𝑁𝑡𝑠𝑎\displaystyle\leftarrow(2H+2)\sqrt{\frac{\ln(SAHT/\delta)}{N_{h}^{t}(s,a)}}.← ( 2 italic_H + 2 ) square-root start_ARG divide start_ARG roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG end_ARG .

For notational convenience, we will refer to the t𝑡titalic_t-th episode as the T1+tsubscript𝑇1𝑡T_{1}+titalic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_t-th episode after running the χ(ε)𝜒𝜀\chi(\varepsilon)italic_χ ( italic_ε )-learning procedure for T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT rounds in the sequel.

Lemma 7 (Agarwal et al. [5]).

Fix δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ). In any episode t𝑡titalic_t, s𝒮,a𝒜,h[H]formulae-sequencefor-all𝑠𝒮formulae-sequence𝑎𝒜delimited-[]𝐻\forall s\in{\mathcal{S}},a\in\mathcal{A},h\in[H]∀ italic_s ∈ caligraphic_S , italic_a ∈ caligraphic_A , italic_h ∈ [ italic_H ], with probability at least 1δ1𝛿1-\delta1 - italic_δ, we have

|r^ht(s,a)r(s,a)|2ln(SAHT/δ)Nht(s,a),P^ht(s,a)Ph(s,a)12ln(SAHT/δ)Nht(s,a),formulae-sequencesubscriptsuperscript^𝑟𝑡𝑠𝑎𝑟𝑠𝑎2𝑆𝐴𝐻𝑇𝛿superscriptsubscript𝑁𝑡𝑠𝑎subscriptdelimited-∥∥superscriptsubscript^𝑃𝑡𝑠𝑎subscript𝑃𝑠𝑎12𝑆𝐴𝐻𝑇𝛿superscriptsubscript𝑁𝑡𝑠𝑎\left|\widehat{r}^{t}_{h}(s,a)-{r}(s,a)\right|\leq 2\sqrt{\frac{\ln(SAHT/% \delta)}{N_{h}^{t}(s,a)}},\quad\left\lVert\widehat{P}_{h}^{t}(s,a)-P_{h}(s,a)% \right\rVert_{1}\leq 2\sqrt{\frac{\ln(SAHT/\delta)}{N_{h}^{t}(s,a)}},| over^ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_r ( italic_s , italic_a ) | ≤ 2 square-root start_ARG divide start_ARG roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG end_ARG , ∥ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ) - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ 2 square-root start_ARG divide start_ARG roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG end_ARG ,
|(P^ht(s,a)Ph(s,a))f|8HSln(SAHT/δ)Nht(s,a),f:𝒮[0,H].:superscriptsuperscriptsubscript^𝑃𝑡𝑠𝑎subscript𝑃𝑠𝑎top𝑓8𝐻𝑆𝑆𝐴𝐻𝑇𝛿superscriptsubscript𝑁𝑡𝑠𝑎for-all𝑓𝒮0𝐻\left|\left(\widehat{P}_{h}^{t}(s,a)-P_{h}(s,a)\right)^{\top}f\right|\leq 8H% \sqrt{\frac{S\ln(SAHT/\delta)}{N_{h}^{t}(s,a)}},\quad\forall f:{\mathcal{S}}% \to[0,H].| ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ) - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f | ≤ 8 italic_H square-root start_ARG divide start_ARG italic_S roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG end_ARG , ∀ italic_f : caligraphic_S → [ 0 , italic_H ] .

We denote modelsubscriptmodel\mathcal{E}_{\text{model}}caligraphic_E start_POSTSUBSCRIPT model end_POSTSUBSCRIPT as the event the inequalities in Lemma 7 and 10 holds. To make our notation consistent, we let R𝝅:=𝐄sP0R1𝝅(s)assignsuperscript𝑅𝝅subscript𝐄similar-to𝑠subscript𝑃0superscriptsubscript𝑅1𝝅𝑠R^{\bm{\pi}}:=\mathop{\mathbf{E}}_{s\sim P_{0}}R_{1}^{\bm{\pi}}(s)italic_R start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT := bold_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) and C𝝅:=𝐄sP0C1𝝅(s)assignsuperscript𝐶𝝅subscript𝐄similar-to𝑠subscript𝑃0superscriptsubscript𝐶1𝝅𝑠C^{\bm{\pi}}:=\mathop{\mathbf{E}}_{s\sim P_{0}}C_{1}^{\bm{\pi}}(s)italic_C start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT := bold_E start_POSTSUBSCRIPT italic_s ∼ italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) such that V𝝅=R𝝅C𝝅U𝝅=R𝝅ζ𝝅superscript𝑉𝝅superscript𝑅𝝅superscript𝐶𝝅superscript𝑈𝝅superscript𝑅𝝅superscript𝜁𝝅V^{\bm{\pi}}=R^{\bm{\pi}}-C^{\bm{\pi}}-U^{\bm{\pi}}=R^{\bm{\pi}}-\zeta^{\bm{% \pi}}italic_V start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT = italic_R start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT - italic_C start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT - italic_U start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT = italic_R start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT - italic_ζ start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT.

Following the optimism principle, we determine the optimal action policy to explore from argmax𝝅R^h𝝅(s)ζ^h𝝅(s)subscript𝑎𝑟𝑔𝑚𝑎𝑥𝝅superscriptsubscript^𝑅𝝅𝑠subscriptsuperscript^𝜁𝝅𝑠\mathop{argmax}_{\bm{\pi}}\widehat{R}_{h}^{\bm{\pi}}(s)-\widehat{\zeta}^{\bm{% \pi}}_{h}(s)start_BIGOP italic_a italic_r italic_g italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT bold_italic_π end_POSTSUBSCRIPT over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) - over^ start_ARG italic_ζ end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) — the optimism is preserved as long as the difference of ζ^h𝝅(s)ζh𝝅(s)superscriptsubscript^𝜁𝝅𝑠subscriptsuperscript𝜁𝝅𝑠\widehat{\zeta}_{h}^{\bm{\pi}}(s)-{\zeta}^{\bm{\pi}}_{h}(s)over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) - italic_ζ start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) is relatively small. Here, with some carefully chosen amount of reward bonus bhsubscript𝑏b_{h}italic_b start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, R^h𝝅(s)superscriptsubscript^𝑅𝝅𝑠\widehat{R}_{h}^{\bm{\pi}}(s)over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) is the principal’s expected reward under optimism for a given action policy 𝝅𝝅\bm{\pi}bold_italic_π at hhitalic_h-th step with state s𝑠sitalic_s, i.e.,

R^h𝝅(s)=min{H,r^h(s,πh(s))+bh(s,πh(s))+P^h(s,πh(s))R^h+1𝝅}.superscriptsubscript^𝑅𝝅𝑠𝑚𝑖𝑛𝐻subscript^𝑟𝑠subscript𝜋𝑠subscript𝑏𝑠subscript𝜋𝑠subscript^𝑃𝑠subscript𝜋𝑠subscriptsuperscript^𝑅𝝅1\widehat{R}_{h}^{\bm{\pi}}(s)=\mathop{min}\{H,\ \widehat{r}_{h}(s,\pi_{h}(s))+% b_{h}(s,\pi_{h}(s))+\widehat{P}_{h}(s,\pi_{h}(s))\cdot\widehat{R}^{\bm{\pi}}_{% h+1}\}.over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) = start_BIGOP italic_m italic_i italic_n end_BIGOP { italic_H , over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ) + italic_b start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ) + over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ) ⋅ over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT } .
Lemma 8 (Optimism).

If modelsubscriptmodel\mathcal{E}_{\text{model}}caligraphic_E start_POSTSUBSCRIPT model end_POSTSUBSCRIPT is true, for any 𝛑𝛑\bm{\pi}bold_italic_π, in any episode t𝑡titalic_t, R^t,𝛑(s)R𝛑(s)0superscript^𝑅𝑡𝛑𝑠superscript𝑅𝛑𝑠0\widehat{R}^{t,\bm{\pi}}(s)-R^{\bm{\pi}}(s)\geq 0over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) - italic_R start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) ≥ 0.

Proof.

We prove by induction that R^t,𝝅(s)R𝝅(s)0superscript^𝑅𝑡𝝅𝑠superscript𝑅𝝅𝑠0\widehat{R}^{t,\bm{\pi}}(s)-R^{\bm{\pi}}(s)\geq 0over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) - italic_R start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) ≥ 0. Start from the (H+1)𝐻1(H+1)( italic_H + 1 )-step, since R^H+1t,𝝅(s)=RH+1𝝅(s)=0,s𝒮formulae-sequencesuperscriptsubscript^𝑅𝐻1𝑡𝝅𝑠superscriptsubscript𝑅𝐻1𝝅𝑠0for-all𝑠𝒮\widehat{R}_{H+1}^{t,\bm{\pi}}(s)=R_{H+1}^{\bm{\pi}}(s)=0,\forall s\in{% \mathcal{S}}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_H + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) = italic_R start_POSTSUBSCRIPT italic_H + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) = 0 , ∀ italic_s ∈ caligraphic_S, the base case holds. For the inductive case, given that R^h+1t,𝝅(s)Rh+1𝝅(s),s𝒮formulae-sequencesuperscriptsubscript^𝑅1𝑡𝝅𝑠superscriptsubscript𝑅1𝝅𝑠for-all𝑠𝒮\widehat{R}_{h+1}^{t,\bm{\pi}}(s)\geq R_{h+1}^{\bm{\pi}}(s),\forall s\in{% \mathcal{S}}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) ≥ italic_R start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) , ∀ italic_s ∈ caligraphic_S, we can derive the following inequality for any s𝒮𝑠𝒮s\in{\mathcal{S}}italic_s ∈ caligraphic_S, let a=πh(s))a=\pi_{h}(s))italic_a = italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) )

R^ht,𝝅(s)Rh𝝅(s)superscriptsubscript^𝑅𝑡𝝅𝑠superscriptsubscript𝑅𝝅𝑠\displaystyle\widehat{R}_{h}^{t,\bm{\pi}}(s)-R_{h}^{\bm{\pi}}(s)over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) - italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) =r^ht(s,πh(s))+bht(s,a)rh(s,a)+P^ht(s,a)R^h+1t,𝝅Ph(s,a)Rh+1𝝅absentsubscriptsuperscript^𝑟𝑡𝑠subscript𝜋𝑠subscriptsuperscript𝑏𝑡𝑠𝑎subscript𝑟𝑠𝑎subscriptsuperscript^𝑃𝑡𝑠𝑎subscriptsuperscript^𝑅𝑡𝝅1subscript𝑃𝑠𝑎subscriptsuperscript𝑅𝝅1\displaystyle=\widehat{r}^{t}_{h}(s,\pi_{h}(s))+b^{t}_{h}(s,a)-r_{h}(s,a)+% \widehat{P}^{t}_{h}(s,a)\cdot\widehat{R}^{t,\bm{\pi}}_{h+1}-P_{h}(s,a)\cdot R^% {\bm{\pi}}_{h+1}= over^ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ) + italic_b start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) + over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ⋅ over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ⋅ italic_R start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT
r^ht(s,a)+bht(s,a)rh(s,a)+P^ht(s,a)Rh+1𝝅Ph(s,a)Rh+1𝝅absentsubscriptsuperscript^𝑟𝑡𝑠𝑎subscriptsuperscript𝑏𝑡𝑠𝑎subscript𝑟𝑠𝑎subscriptsuperscript^𝑃𝑡𝑠𝑎subscriptsuperscript𝑅𝝅1subscript𝑃𝑠𝑎subscriptsuperscript𝑅𝝅1\displaystyle\geq\widehat{r}^{t}_{h}(s,a)+b^{t}_{h}(s,a)-r_{h}(s,a)+\widehat{P% }^{t}_{h}(s,a)\cdot R^{\bm{\pi}}_{h+1}-P_{h}(s,a)\cdot R^{\bm{\pi}}_{h+1}≥ over^ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) + italic_b start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) + over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ⋅ italic_R start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ⋅ italic_R start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT
=r^ht(s,a)+bht(s,a)rh(s,a)+(P^ht(s,a)Ph(s,a))Rh+1𝝅absentsubscriptsuperscript^𝑟𝑡𝑠𝑎subscriptsuperscript𝑏𝑡𝑠𝑎subscript𝑟𝑠𝑎superscriptsubscript^𝑃𝑡𝑠𝑎subscript𝑃𝑠𝑎subscriptsuperscript𝑅𝝅1\displaystyle=\widehat{r}^{t}_{h}(s,a)+b^{t}_{h}(s,a)-r_{h}(s,a)+\left(% \widehat{P}_{h}^{t}(s,a)-P_{h}(s,a)\right)\cdot R^{\bm{\pi}}_{h+1}= over^ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) + italic_b start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) + ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ) - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ) ⋅ italic_R start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT
bht(s,a)(2+2H)ln(SAHT/δ)Nht(s,a)0,absentsuperscriptsubscript𝑏𝑡𝑠𝑎22𝐻𝑆𝐴𝐻𝑇𝛿superscriptsubscript𝑁𝑡𝑠𝑎0\displaystyle\geq b_{h}^{t}(s,a)-(2+2H)\sqrt{\frac{\ln(SAHT/\delta)}{N_{h}^{t}% (s,a)}}\geq 0,≥ italic_b start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ) - ( 2 + 2 italic_H ) square-root start_ARG divide start_ARG roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG end_ARG ≥ 0 ,

where the first inequality is due to the given induction condition R^h+1,t𝝅(s)Rh+1𝝅(s),s𝒮formulae-sequencesuperscriptsubscript^𝑅1𝑡𝝅𝑠superscriptsubscript𝑅1𝝅𝑠for-all𝑠𝒮\widehat{R}_{h+1,t}^{\bm{\pi}}(s)\geq R_{h+1}^{\bm{\pi}}(s),\forall s\in{% \mathcal{S}}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_h + 1 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) ≥ italic_R start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) , ∀ italic_s ∈ caligraphic_S and the last inequality is due to Lemma 7. Therefore, the induction holds which concludes the proof. ∎

Lemma 9.

If modelsubscriptmodel\mathcal{E}_{\text{model}}caligraphic_E start_POSTSUBSCRIPT model end_POSTSUBSCRIPT is true, t=1TR^t,𝛑t(s)R𝛑t(s)=O(H2SATln(SAHT/δ))superscriptsubscript𝑡1𝑇superscript^𝑅𝑡superscript𝛑𝑡𝑠superscript𝑅superscript𝛑𝑡𝑠𝑂superscript𝐻2𝑆𝐴𝑇𝑆𝐴𝐻𝑇𝛿\sum_{t=1}^{T}\widehat{R}^{t,\bm{\pi}^{t}}(s)-R^{\bm{\pi}^{t}}(s)=O(H^{2}S% \sqrt{AT\ln(SAHT/\delta)})∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s ) - italic_R start_POSTSUPERSCRIPT bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s ) = italic_O ( italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S square-root start_ARG italic_A italic_T roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) end_ARG ) .

Proof.

We apply the simulation lemma [5] to bound the difference term for each episode t𝑡titalic_t in E2subscript𝐸2E_{2}italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT,

R^t,𝝅tR𝝅tsuperscript^𝑅𝑡superscript𝝅𝑡superscript𝑅superscript𝝅𝑡\displaystyle\widehat{R}^{t,\bm{\pi}^{t}}-R^{\bm{\pi}^{t}}over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_R start_POSTSUPERSCRIPT bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT h=1Hρh𝝅tπht,bht+(P^htPh)R^h+1t,𝝅t𝒮×𝒜absentsuperscriptsubscript1𝐻subscripttensor-productsubscriptsuperscript𝜌superscript𝝅𝑡superscriptsubscript𝜋𝑡subscriptsuperscript𝑏𝑡superscriptsubscript^𝑃𝑡subscript𝑃subscriptsuperscript^𝑅𝑡superscript𝝅𝑡1𝒮𝒜\displaystyle\leq\sum_{h=1}^{H}\langle{\rho}^{\bm{\pi}^{t}}_{h}\otimes\pi_{h}^% {t},b^{t}_{h}+\left(\widehat{P}_{h}^{t}-P_{h}\right)\cdot\widehat{R}^{t,\bm{% \pi}^{t}}_{h+1}\rangle_{{\mathcal{S}}\times\mathcal{A}}≤ ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ⟨ italic_ρ start_POSTSUPERSCRIPT bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_b start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT + ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ⋅ over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT caligraphic_S × caligraphic_A end_POSTSUBSCRIPT
h=1H𝐄s,aπhtπht[10HSln(SAHT/δ)Nht(s,a)],absentsuperscriptsubscript1𝐻subscript𝐄similar-to𝑠𝑎tensor-productsubscriptsuperscript𝜋𝑡superscriptsubscript𝜋𝑡delimited-[]10𝐻𝑆𝑆𝐴𝐻𝑇𝛿superscriptsubscript𝑁𝑡𝑠𝑎\displaystyle\leq\sum_{h=1}^{H}\mathop{\mathbf{E}}_{s,a\sim\pi^{t}_{h}\otimes% \pi_{h}^{t}}\big{[}10H\sqrt{\frac{S\ln(SAHT/\delta)}{N_{h}^{t}(s,a)}}\big{]},≤ ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT bold_E start_POSTSUBSCRIPT italic_s , italic_a ∼ italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ 10 italic_H square-root start_ARG divide start_ARG italic_S roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG end_ARG ] ,

where the last inequality follows from Lemma 7.

The cumulative loss for all T𝑇Titalic_T episodes can be bounded as,

t=1TR^t,𝝅tR𝝅tsuperscriptsubscript𝑡1𝑇superscript^𝑅𝑡superscript𝝅𝑡superscript𝑅superscript𝝅𝑡\displaystyle\sum_{t=1}^{T}\widehat{R}^{t,\bm{\pi}^{t}}-R^{\bm{\pi}^{t}}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_R start_POSTSUPERSCRIPT bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT =10HSln(SAHT/δ)𝐄[t=1Th=1H1/Nht(s,a)]absent10𝐻𝑆𝑆𝐴𝐻𝑇𝛿𝐄delimited-[]superscriptsubscript𝑡1𝑇superscriptsubscript1𝐻1superscriptsubscript𝑁𝑡𝑠𝑎\displaystyle=10H\sqrt{S\ln(SAHT/\delta)}\mathop{\mathbf{E}}\bigg{[}\sum_{t=1}% ^{T}\sum_{h=1}^{H}\sqrt{1/N_{h}^{t}(s,a)}\bigg{]}= 10 italic_H square-root start_ARG italic_S roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) end_ARG bold_E [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT square-root start_ARG 1 / italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG ]
20H2Sln(SAHT/δ)SATabsent20superscript𝐻2𝑆𝑆𝐴𝐻𝑇𝛿𝑆𝐴𝑇\displaystyle\leq 20H^{2}\sqrt{S\ln(SAHT/\delta)}\sqrt{SAT}≤ 20 italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG italic_S roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) end_ARG square-root start_ARG italic_S italic_A italic_T end_ARG
=O(H2SATln(SAHT/δ)),absent𝑂superscript𝐻2𝑆𝐴𝑇𝑆𝐴𝐻𝑇𝛿\displaystyle=O(H^{2}S\sqrt{AT\ln(SAHT/\delta)}),= italic_O ( italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S square-root start_ARG italic_A italic_T roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) end_ARG ) ,

where the first equality is by the linearity of expectation, and the inequality is by the fact h=1N1/i2Nsuperscriptsubscript1𝑁1𝑖2𝑁\sum_{h=1}^{N}1/\sqrt{i}\leq 2\sqrt{N}∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT 1 / square-root start_ARG italic_i end_ARG ≤ 2 square-root start_ARG italic_N end_ARG.

Regret Decomposition.

For the ease of analysis, we assume a common regularity condition in MDP, which ensures that the trajectory induced under any policy has enough randomness. This lower bound constant κ𝜅\kappaitalic_κ is no more than 1/|𝒮|1𝒮1/|{\mathcal{S}}|1 / | caligraphic_S |, and we expect this assumption to be relaxed in future work.

Assumption 5 (MDP Mixing Condition).

There exists κ>0𝜅0\kappa>0italic_κ > 0 such that Ph(s|s,a)κ,s,s,a.subscript𝑃conditionalsuperscript𝑠𝑠𝑎𝜅for-allsuperscript𝑠𝑠𝑎P_{h}(s^{\prime}|s,a)\geq\kappa,\forall s^{\prime},s,a.italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) ≥ italic_κ , ∀ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s , italic_a .

We now present the proof of Theorem 3 via a decomposition of its regret into different components.

Theorem (Full Statement of Theorem 3).

In a contractual RL problem, with probability at least 1δ1𝛿1-\delta1 - italic_δ, Algorithm 1 has O~((H2SA1/2+H2κ1/2)Tln(SAHT/δ))~𝑂superscript𝐻2𝑆superscript𝐴12superscript𝐻2superscript𝜅12𝑇𝑆𝐴𝐻𝑇𝛿\widetilde{O}\left((H^{2}SA^{-1/2}+H^{2}\kappa^{-1/2})\sqrt{T\ln(SAHT/\delta)}\right)over~ start_ARG italic_O end_ARG ( ( italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S italic_A start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT + italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_κ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) square-root start_ARG italic_T roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) end_ARG ) regret using the solver in Algorithm 7, and O~((H2SA1/2+ηλwκ1/2)Tln(SAHT/δ))~𝑂superscript𝐻2𝑆superscript𝐴12𝜂subscript𝜆𝑤superscript𝜅12𝑇𝑆𝐴𝐻𝑇𝛿\widetilde{O}\left((H^{2}SA^{-1/2}+\eta\lambda_{w}\kappa^{-1/2})\sqrt{T\ln(% SAHT/\delta)}\right)over~ start_ARG italic_O end_ARG ( ( italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S italic_A start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT + italic_η italic_λ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_κ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) square-root start_ARG italic_T roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) end_ARG ) regret using the solver in Algorithm 6.

Proof of Theorem 3.

Consider the following decomposition of regret,

Reg(T)Reg𝑇\displaystyle\operatorname{Reg}(T)roman_Reg ( italic_T ) =t=1TVV𝒙t,absentsuperscriptsubscript𝑡1𝑇superscript𝑉superscript𝑉superscript𝒙𝑡\displaystyle=\sum_{t=1}^{T}V^{*}-V^{\bm{x}^{t}},= ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_V start_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ,
O(T1)+t=1TT1VV𝒙tabsent𝑂subscript𝑇1superscriptsubscript𝑡1𝑇subscript𝑇1superscript𝑉superscript𝑉superscript𝒙𝑡\displaystyle\leq O(T_{1})+\sum_{t=1}^{T-T_{1}}V^{*}-V^{\bm{x}^{t}}≤ italic_O ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_V start_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT
=O(T1)+t=1TT1[VV^t,𝝅]+[V^t,𝝅V^t,𝝅t]+[V^t,𝝅tV𝒙t]absent𝑂subscript𝑇1superscriptsubscript𝑡1𝑇subscript𝑇1delimited-[]superscript𝑉superscript^𝑉𝑡superscript𝝅delimited-[]superscript^𝑉𝑡superscript𝝅superscript^𝑉𝑡superscript𝝅𝑡delimited-[]superscript^𝑉𝑡superscript𝝅𝑡superscript𝑉superscript𝒙𝑡\displaystyle=O(T_{1})+\sum_{t=1}^{T-T_{1}}[V^{*}-\widehat{V}^{t,\bm{\pi}^{*}}% ]+[\widehat{V}^{t,\bm{\pi}^{*}}-\widehat{V}^{t,\bm{\pi}^{t}}]+[\widehat{V}^{t,% \bm{\pi}^{t}}-V^{\bm{x}^{t}}]= italic_O ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ] + [ over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ] + [ over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_V start_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ]
O(T1)+t=1TT1ζ^t,𝝅ζ𝝅+R^t,𝝅tR𝝅t+|ζ𝝅tζ^t,𝝅t|,absent𝑂subscript𝑇1superscriptsubscript𝑡1𝑇subscript𝑇1superscript^𝜁𝑡superscript𝝅superscript𝜁superscript𝝅superscript^𝑅𝑡superscript𝝅𝑡superscript𝑅superscript𝝅𝑡superscript𝜁superscript𝝅𝑡superscript^𝜁𝑡superscript𝝅𝑡\displaystyle\leq O(T_{1})+\sum_{t=1}^{T-T_{1}}\widehat{\zeta}^{t,\bm{\pi}^{*}% }-\zeta^{\bm{\pi}^{*}}+\widehat{R}^{t,\bm{\pi}^{t}}-R^{\bm{\pi}^{t}}+\left|{% \zeta}^{\bm{\pi}^{t}}-\widehat{\zeta}^{t,\bm{\pi}^{t}}\right|,≤ italic_O ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT over^ start_ARG italic_ζ end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_ζ start_POSTSUPERSCRIPT bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT + over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_R start_POSTSUPERSCRIPT bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT + | italic_ζ start_POSTSUPERSCRIPT bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - over^ start_ARG italic_ζ end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | ,

where V^t,𝝅=R^t,𝝅ζ^t,𝝅superscript^𝑉𝑡superscript𝝅superscript^𝑅𝑡superscript𝝅superscript^𝜁𝑡superscript𝝅\widehat{V}^{t,\bm{\pi}^{*}}=\widehat{R}^{t,\bm{\pi}^{*}}-\widehat{\zeta}^{t,% \bm{\pi}^{*}}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - over^ start_ARG italic_ζ end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, V^t,𝝅t=R^t,𝝅tζ^t,𝝅tsuperscript^𝑉𝑡superscript𝝅𝑡superscript^𝑅𝑡superscript𝝅𝑡superscript^𝜁𝑡superscript𝝅𝑡\widehat{V}^{t,\bm{\pi}^{t}}=\widehat{R}^{t,\bm{\pi}^{t}}-\widehat{\zeta}^{t,% \bm{\pi}^{t}}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - over^ start_ARG italic_ζ end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT are the principal’s estimated value under the least payment contract policy determined from Equation (B.1) with {P^ht,r^ht,bht,ϵht}h[H]subscriptsubscriptsuperscript^𝑃𝑡subscriptsuperscript^𝑟𝑡subscriptsuperscript𝑏𝑡subscriptsuperscriptitalic-ϵ𝑡delimited-[]𝐻\{\widehat{P}^{t}_{h},\widehat{r}^{t}_{h},b^{t}_{h},\epsilon^{t}_{h}\}_{h\in[H]}{ over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , over^ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_b start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_ϵ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT and V𝝅t=R𝝅tζ𝝅tsuperscript𝑉superscript𝝅𝑡superscript𝑅superscript𝝅𝑡superscript𝜁superscript𝝅𝑡V^{\bm{\pi}^{t}}=R^{\bm{\pi}^{t}}-\zeta^{\bm{\pi}^{t}}italic_V start_POSTSUPERSCRIPT bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = italic_R start_POSTSUPERSCRIPT bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_ζ start_POSTSUPERSCRIPT bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is the principal’s exact value under the least payment contract policy determined from Equation (B.1). We now consider each of the three difference terms in the last inequality:

  • The bound of first term is derived from the value decomposition, VV^t,𝝅=R𝝅ζ𝝅R^t,𝝅+ζ^t,𝝅ζ^t,𝝅ζ𝝅superscript𝑉superscript^𝑉𝑡superscript𝝅superscript𝑅superscript𝝅superscript𝜁superscript𝝅superscript^𝑅𝑡superscript𝝅superscript^𝜁𝑡superscript𝝅superscript^𝜁𝑡superscript𝝅superscript𝜁superscript𝝅V^{*}-\widehat{V}^{t,\bm{\pi}^{*}}=R^{\bm{\pi}^{*}}-\zeta^{\bm{\pi}^{*}}-% \widehat{R}^{t,\bm{\pi}^{*}}+\widehat{\zeta}^{t,\bm{\pi}^{*}}\leq\widehat{% \zeta}^{t,\bm{\pi}^{*}}-\zeta^{\bm{\pi}^{*}}italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = italic_R start_POSTSUPERSCRIPT bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_ζ start_POSTSUPERSCRIPT bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT + over^ start_ARG italic_ζ end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ≤ over^ start_ARG italic_ζ end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_ζ start_POSTSUPERSCRIPT bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, since the difference R𝝅R^t,𝝅0superscript𝑅superscript𝝅superscript^𝑅𝑡superscript𝝅0R^{\bm{\pi}^{*}}-\widehat{R}^{t,\bm{\pi}^{*}}\leq 0italic_R start_POSTSUPERSCRIPT bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ≤ 0, by Lemma 8.

  • The second term is non-positive, V^t,𝝅V^t,𝝅t0superscript^𝑉𝑡superscript𝝅superscript^𝑉𝑡superscript𝝅𝑡0\widehat{V}^{t,\bm{\pi}^{*}}-\widehat{V}^{t,\bm{\pi}^{t}}\leq 0over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ≤ 0, since 𝝅t=argmax𝝅R^t,𝝅ζ^t,𝝅superscript𝝅𝑡subscript𝑎𝑟𝑔𝑚𝑎𝑥𝝅superscript^𝑅𝑡𝝅superscript^𝜁𝑡𝝅\bm{\pi}^{t}=\mathop{argmax}_{\bm{\pi}}\widehat{R}^{t,\bm{\pi}}-\widehat{\zeta% }^{t,\bm{\pi}}bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = start_BIGOP italic_a italic_r italic_g italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT bold_italic_π end_POSTSUBSCRIPT over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π end_POSTSUPERSCRIPT - over^ start_ARG italic_ζ end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π end_POSTSUPERSCRIPT is the optimal action policy to induce under the optimistic planning.

  • The third term can be decomposed as, V^t,𝝅tV𝒙t=R^t,𝝅tU^t,𝝅tC𝝅tR𝝅t+U𝒙t+C𝝅tR^t,𝝅tR𝝅t+|ζ𝝅tζ^t,𝝅t|superscript^𝑉𝑡superscript𝝅𝑡superscript𝑉superscript𝒙𝑡superscript^𝑅𝑡superscript𝝅𝑡superscript^𝑈𝑡superscript𝝅𝑡superscript𝐶superscript𝝅𝑡superscript𝑅superscript𝝅𝑡superscript𝑈superscript𝒙𝑡superscript𝐶superscript𝝅𝑡superscript^𝑅𝑡superscript𝝅𝑡superscript𝑅superscript𝝅𝑡superscript𝜁superscript𝝅𝑡superscript^𝜁𝑡superscript𝝅𝑡\widehat{V}^{t,\bm{\pi}^{t}}-V^{\bm{x}^{t}}=\widehat{R}^{t,\bm{\pi}^{t}}-% \widehat{U}^{t,\bm{\pi}^{t}}-C^{\bm{\pi}^{t}}-R^{\bm{\pi}^{t}}+U^{\bm{x}^{t}}+% C^{\bm{\pi}^{t}}\leq\widehat{R}^{t,\bm{\pi}^{t}}-R^{\bm{\pi}^{t}}+\left|{\zeta% }^{\bm{\pi}^{t}}-\widehat{\zeta}^{t,\bm{\pi}^{t}}\right|over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_V start_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_C start_POSTSUPERSCRIPT bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_R start_POSTSUPERSCRIPT bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT + italic_U start_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT + italic_C start_POSTSUPERSCRIPT bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ≤ over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_R start_POSTSUPERSCRIPT bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT + | italic_ζ start_POSTSUPERSCRIPT bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - over^ start_ARG italic_ζ end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT |, since U𝒙tU^t,𝝅t|ζ𝝅tζ^t,𝝅t|superscript𝑈superscript𝒙𝑡superscript^𝑈𝑡superscript𝝅𝑡superscript𝜁superscript𝝅𝑡superscript^𝜁𝑡superscript𝝅𝑡U^{\bm{x}^{t}}-\widehat{U}^{t,\bm{\pi}^{t}}\eqsim\left|{\zeta}^{\bm{\pi}^{t}}-% \widehat{\zeta}^{t,\bm{\pi}^{t}}\right|italic_U start_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ≂ | italic_ζ start_POSTSUPERSCRIPT bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - over^ start_ARG italic_ζ end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | by either Lemma 12 or 15.

Finally, if we adopt the solver in Algorithm 6, by Lemma 9 and 14, the total regret is,

Reg(T)Reg𝑇\displaystyle\operatorname{Reg}(T)roman_Reg ( italic_T ) =O(T1)+O(H2SA(TT1)ln(SAHT/δ))+τ=1TT1O(H2κ1/2ln(SAHT/δ)/t})\displaystyle=O(T_{1})+O(H^{2}S\sqrt{A(T-T_{1})\ln(SAHT/\delta)})+\sum_{\tau=1% }^{T-T_{1}}O\left(H^{2}\kappa^{-1/2}\sqrt{\ln(SAHT/\delta)/t}\big{\}}\right)= italic_O ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_O ( italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S square-root start_ARG italic_A ( italic_T - italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) end_ARG ) + ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_O ( italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_κ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT square-root start_ARG roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) / italic_t end_ARG } )
=O(HAS5κ01log(ηTλw1)+H2SATln(SAHT/δ)+H2κ1/2Tln(SAH/δ))absent𝑂𝐻𝐴superscript𝑆5superscriptsubscript𝜅01𝜂𝑇superscriptsubscript𝜆𝑤1superscript𝐻2𝑆𝐴𝑇𝑆𝐴𝐻𝑇𝛿superscript𝐻2superscript𝜅12𝑇𝑆𝐴𝐻𝛿\displaystyle={O}\left(HAS^{5}\kappa_{0}^{-1}\log(\eta T\lambda_{w}^{-1})+H^{2% }S\sqrt{AT\ln(SAHT/\delta)}+H^{2}\kappa^{-1/2}\sqrt{T\ln(SAH/\delta)}\right)= italic_O ( italic_H italic_A italic_S start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_log ( italic_η italic_T italic_λ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) + italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S square-root start_ARG italic_A italic_T roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) end_ARG + italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_κ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT square-root start_ARG italic_T roman_ln ( italic_S italic_A italic_H / italic_δ ) end_ARG )
=O~((H2SA1/2+H2κ1/2)Tln(SAHT/δ)),absent~𝑂superscript𝐻2𝑆superscript𝐴12superscript𝐻2superscript𝜅12𝑇𝑆𝐴𝐻𝑇𝛿\displaystyle=\widetilde{O}\left((H^{2}SA^{-1/2}+H^{2}\kappa^{-1/2})\sqrt{T\ln% (SAHT/\delta)}\right),= over~ start_ARG italic_O end_ARG ( ( italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S italic_A start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT + italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_κ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) square-root start_ARG italic_T roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) end_ARG ) ,

where the first term T1=O~(logT)subscript𝑇1~𝑂𝑇T_{1}=\widetilde{O}(\log T)italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = over~ start_ARG italic_O end_ARG ( roman_log italic_T ) can be dropped as being dominated by the second term on the order of TT1𝑇subscript𝑇1\sqrt{T-T_{1}}square-root start_ARG italic_T - italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG.

Finally, if we adopt the solver in Algorithm 7, by Lemma 9 and 16, the total regret is,

Reg(T)Reg𝑇\displaystyle\operatorname{Reg}(T)roman_Reg ( italic_T ) =O(T1)+O(H2SA(TT1)ln(SAHT/δ))+τ=1TT1O(min{H,ηλs1Hκ1/2ln(SAHT/δ)/T})absent𝑂subscript𝑇1𝑂superscript𝐻2𝑆𝐴𝑇subscript𝑇1𝑆𝐴𝐻𝑇𝛿superscriptsubscript𝜏1𝑇subscript𝑇1𝑂𝑚𝑖𝑛𝐻𝜂superscriptsubscript𝜆𝑠1𝐻superscript𝜅12𝑆𝐴𝐻𝑇𝛿𝑇\displaystyle=O(T_{1})+O(H^{2}S\sqrt{A(T-T_{1})\ln(SAHT/\delta)})+\sum_{\tau=1% }^{T-T_{1}}O\left(\mathop{min}\big{\{}H,\eta\lambda_{s}^{1-H}\kappa^{-1/2}% \sqrt{\ln(SAHT/\delta)/T}\big{\}}\right)= italic_O ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_O ( italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S square-root start_ARG italic_A ( italic_T - italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) end_ARG ) + ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_O ( start_BIGOP italic_m italic_i italic_n end_BIGOP { italic_H , italic_η italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 - italic_H end_POSTSUPERSCRIPT italic_κ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT square-root start_ARG roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) / italic_T end_ARG } )
=O(HAS5κ01log(ηTλs1)log(1/δ)+H2SATln(SAHT/δ)+ηλs1Hκ1/2Tln(SAH/δ))absent𝑂𝐻𝐴superscript𝑆5superscriptsubscript𝜅01𝜂𝑇superscriptsubscript𝜆𝑠11𝛿superscript𝐻2𝑆𝐴𝑇𝑆𝐴𝐻𝑇𝛿𝜂superscriptsubscript𝜆𝑠1𝐻superscript𝜅12𝑇𝑆𝐴𝐻𝛿\displaystyle={O}\left(HAS^{5}\kappa_{0}^{-1}\log(\eta T\lambda_{s}^{-1})\log(% 1/\delta)+H^{2}S\sqrt{AT\ln(SAHT/\delta)}+\eta\lambda_{s}^{1-H}\kappa^{-1/2}% \sqrt{T\ln(SAH/\delta)}\right)= italic_O ( italic_H italic_A italic_S start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_log ( italic_η italic_T italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) roman_log ( 1 / italic_δ ) + italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S square-root start_ARG italic_A italic_T roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) end_ARG + italic_η italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 - italic_H end_POSTSUPERSCRIPT italic_κ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT square-root start_ARG italic_T roman_ln ( italic_S italic_A italic_H / italic_δ ) end_ARG )
=O~((H2SA1/2+ηλs1Hκ1/2)Tln(SAHT/δ)),absent~𝑂superscript𝐻2𝑆superscript𝐴12𝜂superscriptsubscript𝜆𝑠1𝐻superscript𝜅12𝑇𝑆𝐴𝐻𝑇𝛿\displaystyle=\widetilde{O}\left((H^{2}SA^{-1/2}+\eta\lambda_{s}^{1-H}\kappa^{% -1/2})\sqrt{T\ln(SAHT/\delta)}\right),= over~ start_ARG italic_O end_ARG ( ( italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S italic_A start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT + italic_η italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 - italic_H end_POSTSUPERSCRIPT italic_κ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) square-root start_ARG italic_T roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) end_ARG ) ,

where the first term T1=O~(logT)subscript𝑇1~𝑂𝑇T_{1}=\widetilde{O}(\log T)italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = over~ start_ARG italic_O end_ARG ( roman_log italic_T ) can be dropped as being dominated by the second term on the order of TT1𝑇subscript𝑇1\sqrt{T-T_{1}}square-root start_ARG italic_T - italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG.

E.2 χ(ε)𝜒𝜀\chi(\varepsilon)italic_χ ( italic_ε )-Learning Procedure in Contractual Reinforcement Learning

Input: State, action set 𝒮,𝒜𝒮𝒜{\mathcal{S}},\mathcal{A}caligraphic_S , caligraphic_A, number of steps H𝐻Hitalic_H, cost function {ch}h=1Hsuperscriptsubscriptsubscript𝑐1𝐻\{c_{h}\}_{h=1}^{H}{ italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT, search precision ε𝜀\varepsilonitalic_ε.
For each s𝒮,h[H]formulae-sequence𝑠𝒮delimited-[]𝐻s\in{\mathcal{S}},h\in[H]italic_s ∈ caligraphic_S , italic_h ∈ [ italic_H ], initialize a subroutine 𝒜(s,h)𝒜𝑠{\mathscr{A}}(s,h)script_A ( italic_s , italic_h ) described in Algorithm 4.
Set Ns,h0subscript𝑁𝑠0N_{s,h}\leftarrow 0italic_N start_POSTSUBSCRIPT italic_s , italic_h end_POSTSUBSCRIPT ← 0 for all s,h𝑠s,hitalic_s , italic_h.
for t=1,,χ(ε)𝑡1𝜒𝜀t=1,\dots,\chi(\varepsilon)italic_t = 1 , … , italic_χ ( italic_ε ) do
       (st,ht)=argmins,hNs,hsuperscript𝑠𝑡superscript𝑡subscript𝑎𝑟𝑔𝑚𝑖𝑛𝑠subscript𝑁𝑠(s^{t},h^{t})=\mathop{argmin}_{s,h}N_{s,h}( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = start_BIGOP italic_a italic_r italic_g italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT italic_s , italic_h end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_s , italic_h end_POSTSUBSCRIPT.
       Set xh(s,)subscript𝑥𝑠x_{h}(s,\cdot)italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , ⋅ ) to 00 for all (h,s)(ht,st)𝑠superscript𝑡superscript𝑠𝑡(h,s)\neq(h^{t},s^{t})( italic_h , italic_s ) ≠ ( italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ), and set xht(st,)subscript𝑥superscript𝑡superscript𝑠𝑡x_{h^{t}}(s^{t},\cdot)italic_x start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , ⋅ ) to the contract that 𝒜(st,ht)𝒜superscript𝑠𝑡superscript𝑡{\mathscr{A}}(s^{t},h^{t})script_A ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) is going to use next.
       Set xht(st,)xht(st,)+Hκ01max(s,a,h)𝒮×𝒜×[H]ch(s,a)subscript𝑥superscript𝑡superscript𝑠𝑡subscript𝑥superscript𝑡superscript𝑠𝑡𝐻superscriptsubscript𝜅01subscript𝑚𝑎𝑥𝑠𝑎𝒮𝒜delimited-[]𝐻subscript𝑐𝑠𝑎x_{h^{t}}(s^{t},\cdot)\leftarrow x_{h^{t}}(s^{t},\cdot)+H\kappa_{0}^{-1}% \mathop{max}_{(s,a,h)\in{\mathcal{S}}\times\mathcal{A}\times[H]}c_{h}(s,a)italic_x start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , ⋅ ) ← italic_x start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , ⋅ ) + italic_H italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_BIGOP italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT ( italic_s , italic_a , italic_h ) ∈ caligraphic_S × caligraphic_A × [ italic_H ] end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ).
       Execute the contract policy 𝒙𝒙\bm{x}bold_italic_x and collect trajectory 𝒯={(sh,ah,rh)}h[H]𝒯subscriptsubscript𝑠subscript𝑎subscript𝑟delimited-[]𝐻{\mathcal{T}}=\{(s_{h},a_{h},r_{h})\}_{h\in[H]}caligraphic_T = { ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT.
       if sht=stsubscript𝑠superscript𝑡superscript𝑠𝑡s_{h^{t}}=s^{t}italic_s start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT in the trajectory 𝒯𝒯{\mathcal{T}}caligraphic_T then
             Let 𝒜(st,ht)𝒜superscript𝑠𝑡superscript𝑡{\mathscr{A}}(s^{t},h^{t})script_A ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) step with (sht,aht,sht+1)𝒯subscript𝑠superscript𝑡subscript𝑎superscript𝑡subscript𝑠superscript𝑡1𝒯(s_{h^{t}},a_{h^{t}},s_{h^{t}+1})\in{\mathcal{T}}( italic_s start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT ) ∈ caligraphic_T and Nst,htNst,ht+1subscript𝑁superscript𝑠𝑡superscript𝑡subscript𝑁superscript𝑠𝑡superscript𝑡1N_{s^{t},h^{t}}\leftarrow N_{s^{t},h^{t}}+1italic_N start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ← italic_N start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + 1.
            
      return μh(s,,)subscript𝜇𝑠\mu_{h}(s,\cdot,\cdot)italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , ⋅ , ⋅ ) from 𝒜(st,ht)𝒜superscript𝑠𝑡superscript𝑡{\mathscr{A}}(s^{t},h^{t})script_A ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) for each s𝒮,h[H]formulae-sequence𝑠𝒮delimited-[]𝐻s\in{\mathcal{S}},h\in[H]italic_s ∈ caligraphic_S , italic_h ∈ [ italic_H ].
Algorithm 5 χ(ε)𝜒𝜀\chi(\varepsilon)italic_χ ( italic_ε )-Learning Procedure in Contractual RL with Far-sighted Agent

The algorithm for constructing the χ(ε)𝜒𝜀\chi(\varepsilon)italic_χ ( italic_ε )-learning procedure is summarized in Algorithm 5.

Assumption 6 (Weakly ergodic MDP).

There exists κ0>0subscript𝜅00\kappa_{0}>0italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > 0 such that max𝛑ρh𝛑(s)2κ0,s,h.subscript𝑚𝑎𝑥𝛑superscriptsubscript𝜌𝛑𝑠2subscript𝜅0for-all𝑠\mathop{max}_{\bm{\pi}}\rho_{h}^{\bm{\pi}}(s)\geq 2\kappa_{0},\forall s,h.start_BIGOP italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT bold_italic_π end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) ≥ 2 italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ∀ italic_s , italic_h .

Note that this assumption is weaker than the mixing MDP assumption where each state’s visitation measure at each step is lower bounded under all the agent’s policy, since the agent’s policy π𝜋\piitalic_π chosen here is in favor of visiting state s𝑠sitalic_s at step hhitalic_h.

Also, recall that throughout this section, we assume that principal already knows the agent’s cost function {ch}h=1Hsuperscriptsubscriptsubscript𝑐1𝐻\{c_{h}\}_{h=1}^{H}{ italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT and the contract design space is restricted to {xh:𝒮×𝒮[0,η]}conditional-setsubscript𝑥𝒮𝒮0𝜂\{x_{h}:{\mathcal{S}}\times{\mathcal{S}}\to[0,\eta]\}{ italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S × caligraphic_S → [ 0 , italic_η ] }. Moreover, we inherit all the technical assumptions from the analysis in Appendix D. With these assumptions, we show the following Lemma 10 holds.

Lemma 10.

For any δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), Algorithm 5 guarantees an estimation of μ^hμh1,εsubscriptdelimited-∥∥subscript^𝜇subscript𝜇1𝜀\left\lVert\widehat{\mu}_{h}-{\mu}_{h}\right\rVert_{1,\infty}\leq\varepsilon∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , ∞ end_POSTSUBSCRIPT ≤ italic_ε with probability at least 1δ1𝛿1-\delta1 - italic_δ in O~(HAS5κ01log(1/ε)log(1/δ))~𝑂𝐻𝐴superscript𝑆5superscriptsubscript𝜅011𝜀1𝛿\widetilde{O}(HAS^{5}\kappa_{0}^{-1}\log(1/\varepsilon)\log(1/\delta))over~ start_ARG italic_O end_ARG ( italic_H italic_A italic_S start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_log ( 1 / italic_ε ) roman_log ( 1 / italic_δ ) ) episodes.

Proof for Lemma 10.

Since each time, only xh(s,h)subscript𝑥𝑠x_{h}(s,h)italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_h ) gives nonzero payment, and we add a constant Hκ01max(s,a,h)𝒮×𝒜×[H]ch(s,a)𝐻superscriptsubscript𝜅01subscript𝑚𝑎𝑥𝑠𝑎𝒮𝒜delimited-[]𝐻subscript𝑐𝑠𝑎H\kappa_{0}^{-1}\mathop{max}_{(s,a,h)\in{\mathcal{S}}\times\mathcal{A}\times[H% ]}c_{h}(s,a)italic_H italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_BIGOP italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT ( italic_s , italic_a , italic_h ) ∈ caligraphic_S × caligraphic_A × [ italic_H ] end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) that dominates the potential costs, the agent’s optimal policy generates a visitation measure of s𝑠sitalic_s at step hhitalic_h larger than κ0subscript𝜅0\kappa_{0}italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. To illustrate this point, we take 𝝅^^𝝅\widehat{\bm{\pi}}over^ start_ARG bold_italic_π end_ARG that maximizes ρh𝝅(s)superscriptsubscript𝜌𝝅𝑠\rho_{h}^{\bm{\pi}}(s)italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) while π^h=argmaxπhπh,Phxh𝒮×𝒜subscript^𝜋subscript𝑎𝑟𝑔𝑚𝑎𝑥subscript𝜋subscriptsubscript𝜋subscript𝑃subscript𝑥𝒮𝒜\widehat{\pi}_{h}=\mathop{argmax}_{\pi_{h}}\langle\pi_{h},P_{h}\cdot x_{h}% \rangle_{{\mathcal{S}}\times\mathcal{A}}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = start_BIGOP italic_a italic_r italic_g italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟨ italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ⋅ italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT caligraphic_S × caligraphic_A end_POSTSUBSCRIPT at step hhitalic_h (the choice of πhsubscript𝜋\pi_{h}italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT does not influence ρh𝝅(s)superscriptsubscript𝜌𝝅𝑠\rho_{h}^{\bm{\pi}}(s)italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s )). For any 𝝅𝝅\bm{\pi}bold_italic_π such that ρh𝝅(s)<κ0superscriptsubscript𝜌𝝅𝑠subscript𝜅0\rho_{h}^{\bm{\pi}}(s)<\kappa_{0}italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) < italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we have for the agent’s expected profit difference expressed as

U𝝅^U𝝅superscript𝑈^𝝅superscript𝑈𝝅\displaystyle\qquad U^{\widehat{\bm{\pi}}}-U^{{\bm{\pi}}}italic_U start_POSTSUPERSCRIPT over^ start_ARG bold_italic_π end_ARG end_POSTSUPERSCRIPT - italic_U start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT
=h=1Hρh𝝅^π^hρh𝝅πh,Phxhch𝒮×𝒜absentsuperscriptsubscriptsuperscript1𝐻subscripttensor-productsuperscriptsubscript𝜌superscript^𝝅subscript^𝜋superscripttensor-productsuperscriptsubscript𝜌superscript𝝅subscript𝜋superscriptsubscript𝑃superscriptsubscript𝑥superscriptsubscript𝑐superscript𝒮𝒜\displaystyle\quad=\sum_{h^{\prime}=1}^{H}\langle\rho_{h^{\prime}}^{{\widehat{% \bm{\pi}}}}\otimes\widehat{\pi}_{h^{\prime}}-\rho_{h^{\prime}}^{\bm{\pi}}% \otimes\pi_{h^{\prime}},P_{h^{\prime}}\cdot x_{h^{\prime}}-c_{h^{\prime}}% \rangle_{{\mathcal{S}}\times\mathcal{A}}= ∑ start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ⟨ italic_ρ start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG bold_italic_π end_ARG end_POSTSUPERSCRIPT ⊗ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - italic_ρ start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⋅ italic_x start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT caligraphic_S × caligraphic_A end_POSTSUBSCRIPT
ρh𝝅~πh(s,)ρh𝝅πh(s,),Phxh(s,)𝒜h=1Hρh𝝅~π~hρh𝝅πh,ch𝒮×𝒜absentsubscripttensor-productsuperscriptsubscript𝜌~𝝅subscript𝜋𝑠tensor-productsuperscriptsubscript𝜌𝝅subscript𝜋𝑠subscript𝑃subscript𝑥𝑠𝒜superscriptsubscriptsuperscript1𝐻subscripttensor-productsuperscriptsubscript𝜌superscript~𝝅subscript~𝜋superscripttensor-productsuperscriptsubscript𝜌superscript𝝅subscript𝜋superscriptsubscript𝑐superscript𝒮𝒜\displaystyle\quad\geq\langle\rho_{h}^{{\widetilde{\bm{\pi}}}}\otimes\pi_{h}(s% ,\cdot)-\rho_{h}^{\bm{\pi}}\otimes\pi_{h}(s,\cdot),P_{h}\cdot x_{h}(s,\cdot)% \rangle_{\mathcal{A}}-\sum_{h^{\prime}=1}^{H}\langle\rho_{h^{\prime}}^{% \widetilde{\bm{\pi}}}\otimes\widetilde{\pi}_{h^{\prime}}-\rho_{h^{\prime}}^{% \bm{\pi}}\otimes\pi_{h^{\prime}},c_{h^{\prime}}\rangle_{{\mathcal{S}}\times% \mathcal{A}}≥ ⟨ italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG bold_italic_π end_ARG end_POSTSUPERSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , ⋅ ) - italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , ⋅ ) , italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ⋅ italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , ⋅ ) ⟩ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ⟨ italic_ρ start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG bold_italic_π end_ARG end_POSTSUPERSCRIPT ⊗ over~ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - italic_ρ start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT caligraphic_S × caligraphic_A end_POSTSUBSCRIPT
>κ0Hκ01max(s,a,h)𝒮×𝒜×[H]ch(s,a)Hmax(s,a,h)𝒮×𝒜×[H]ch(s,a)=0,absentsubscript𝜅0𝐻superscriptsubscript𝜅01subscript𝑚𝑎𝑥𝑠𝑎𝒮𝒜delimited-[]𝐻subscript𝑐𝑠𝑎𝐻subscript𝑚𝑎𝑥𝑠𝑎𝒮𝒜delimited-[]𝐻subscript𝑐𝑠𝑎0\displaystyle\quad>\kappa_{0}H\kappa_{0}^{-1}\mathop{max}_{(s,a,h)\in{\mathcal% {S}}\times\mathcal{A}\times[H]}c_{h}(s,a)-H\mathop{max}_{(s,a,h)\in{\mathcal{S% }}\times\mathcal{A}\times[H]}c_{h}(s,a)=0,> italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_H italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_BIGOP italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT ( italic_s , italic_a , italic_h ) ∈ caligraphic_S × caligraphic_A × [ italic_H ] end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_H start_BIGOP italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT ( italic_s , italic_a , italic_h ) ∈ caligraphic_S × caligraphic_A × [ italic_H ] end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) = 0 ,

where 𝝅~~𝝅\widetilde{\bm{\pi}}over~ start_ARG bold_italic_π end_ARG is constructed by substituting π^hsubscript^𝜋\widehat{\pi}_{h}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT with πhsubscript𝜋\pi_{h}italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT in 𝝅^^𝝅\widehat{\bm{\pi}}over^ start_ARG bold_italic_π end_ARG. Here, the first inequality holds since π^hsubscript^𝜋\widehat{\pi}_{h}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is the optimal one-step policy concerning the payment. The last inequality holds by noting that π~~𝜋\widetilde{\pi}over~ start_ARG italic_π end_ARG still maximizes ρh𝝅(s)superscriptsubscript𝜌𝝅𝑠\rho_{h}^{\bm{\pi}}(s)italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) which gives ρh𝝅~(s)2κ0superscriptsubscript𝜌~𝝅𝑠2subscript𝜅0\rho_{h}^{\widetilde{\bm{\pi}}}(s)\geq 2\kappa_{0}italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG bold_italic_π end_ARG end_POSTSUPERSCRIPT ( italic_s ) ≥ 2 italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by Assumption 6, and that x𝑥xitalic_x has a constant shift Hκ01max(s,a,h)𝒮×𝒜×[H]ch(s,a)𝐻superscriptsubscript𝜅01subscript𝑚𝑎𝑥𝑠𝑎𝒮𝒜delimited-[]𝐻subscript𝑐𝑠𝑎H\kappa_{0}^{-1}\mathop{max}_{(s,a,h)\in{\mathcal{S}}\times\mathcal{A}\times[H% ]}c_{h}(s,a)italic_H italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_BIGOP italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT ( italic_s , italic_a , italic_h ) ∈ caligraphic_S × caligraphic_A × [ italic_H ] end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ). Note that the best response can only have higher profit for the agent. Thus, we conclude that any π𝜋\piitalic_π such that ρh𝝅(s)<κ0superscriptsubscript𝜌𝝅𝑠subscript𝜅0\rho_{h}^{\bm{\pi}}(s)<\kappa_{0}italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) < italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT cannot be the agent’s best response. In other words, for the best response 𝝅superscript𝝅\bm{\pi}^{*}bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, we have ρh𝝅(s)κ0superscriptsubscript𝜌superscript𝝅𝑠subscript𝜅0\rho_{h}^{\bm{\pi}^{*}}(s)\geq\kappa_{0}italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s ) ≥ italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Let Xs,hsubscript𝑋𝑠X_{s,h}italic_X start_POSTSUBSCRIPT italic_s , italic_h end_POSTSUBSCRIPT be the random variable representing the total steps that 𝒜(s,h)𝒜𝑠{\mathscr{A}}(s,h)script_A ( italic_s , italic_h ) successfully takes and Y𝑌Yitalic_Y be a random variable such that YBinomial(n,κ0)similar-to𝑌Binomial𝑛subscript𝜅0Y\sim\text{Binomial}(n,\kappa_{0})italic_Y ∼ Binomial ( italic_n , italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), where n=χ(ε)𝑛𝜒𝜀n=\chi(\varepsilon)italic_n = italic_χ ( italic_ε ). It follows from our previous discussion that for any λ(0,κ0)𝜆0subscript𝜅0\lambda\in(0,\kappa_{0})italic_λ ∈ ( 0 , italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ),

(s,hXs,hλn)subscript𝑠subscript𝑋𝑠𝜆𝑛\displaystyle\mathbb{P}\left(\sum_{s,h}X_{s,h}\leq\lambda n\right)blackboard_P ( ∑ start_POSTSUBSCRIPT italic_s , italic_h end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_s , italic_h end_POSTSUBSCRIPT ≤ italic_λ italic_n ) (Yλn)absent𝑌𝜆𝑛\displaystyle\leq\mathbb{P}(Y\leq\lambda n)≤ blackboard_P ( italic_Y ≤ italic_λ italic_n )
infθ0𝔼[exp(θ(Yλn))]absentsubscriptinfimum𝜃0𝔼delimited-[]𝜃𝑌𝜆𝑛\displaystyle\leq\inf_{\theta\geq 0}\mathbb{E}\left[\exp(-\theta(Y-\lambda n))\right]≤ roman_inf start_POSTSUBSCRIPT italic_θ ≥ 0 end_POSTSUBSCRIPT blackboard_E [ roman_exp ( - italic_θ ( italic_Y - italic_λ italic_n ) ) ]
exp(nκ0f(λ/κ0)),absent𝑛subscript𝜅0𝑓𝜆subscript𝜅0\displaystyle\leq\exp\left(-n\kappa_{0}f(\lambda/\kappa_{0})\right),≤ roman_exp ( - italic_n italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_f ( italic_λ / italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ,

where f(x)=1x+xlog(x)𝑓𝑥1𝑥𝑥𝑥f(x)=1-x+x\log(x)italic_f ( italic_x ) = 1 - italic_x + italic_x roman_log ( italic_x ). By the algorithm procedure, we conclude that

(mins,hXs,hλnH|𝒮|1)exp(nκ0f(λ/κ0)).subscript𝑚𝑖𝑛𝑠subscript𝑋𝑠𝜆𝑛𝐻𝒮1𝑛subscript𝜅0𝑓𝜆subscript𝜅0\displaystyle\mathbb{P}\left(\mathop{min}_{s,h}X_{s,h}\leq\frac{\lambda n}{H|{% \mathcal{S}}|}-1\right)\leq\exp(-n\kappa_{0}f(\lambda/\kappa_{0})).blackboard_P ( start_BIGOP italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT italic_s , italic_h end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_s , italic_h end_POSTSUBSCRIPT ≤ divide start_ARG italic_λ italic_n end_ARG start_ARG italic_H | caligraphic_S | end_ARG - 1 ) ≤ roman_exp ( - italic_n italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_f ( italic_λ / italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) .

Combined with Corollary 3.1, we conclude that Algorithm 5 has error

sups,a,h(P^h(s,a)P^h(s,a)(Ph(s,a)Ph(s,a))2ε\sup_{s,a,h}\left\lVert(\widehat{P}_{h}(s,a)-\widehat{P}_{h}(s,a^{\prime})-(P_% {h}(s,a)-P_{h}(s,a^{\prime}))\right\rVert_{2}\geq\varepsilonroman_sup start_POSTSUBSCRIPT italic_s , italic_a , italic_h end_POSTSUBSCRIPT ∥ ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - ( italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ italic_ε

with probability no more than

H|𝒮||𝒜|exp((λnH|𝒮|)σd(τd)H|𝒜||𝒮|5log(|𝒜|ς2ε1θ1))+exp(nκ0f(λ/κ0)),𝐻𝒮𝒜𝜆𝑛𝐻𝒮subscript𝜎𝑑subscript𝜏𝑑𝐻𝒜superscript𝒮5𝒜superscript𝜍2superscript𝜀1superscript𝜃1𝑛subscript𝜅0𝑓𝜆subscript𝜅0\displaystyle H|{\mathcal{S}}||\mathcal{A}|\exp\left(-\frac{(\lambda n-H|{% \mathcal{S}}|)\cdot\sigma_{d}(\tau_{d})}{H|\mathcal{A}||{\mathcal{S}}|^{5}\log% (|\mathcal{A}|\varsigma^{-2}\varepsilon^{-1}\theta^{-1})}\right)+\exp\left(-n% \kappa_{0}f(\lambda/\kappa_{0})\right),italic_H | caligraphic_S | | caligraphic_A | roman_exp ( - divide start_ARG ( italic_λ italic_n - italic_H | caligraphic_S | ) ⋅ italic_σ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_ARG start_ARG italic_H | caligraphic_A | | caligraphic_S | start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT roman_log ( | caligraphic_A | italic_ς start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_θ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) end_ARG ) + roman_exp ( - italic_n italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_f ( italic_λ / italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ,

for any λ(0,κ0)𝜆0subscript𝜅0\lambda\in(0,\kappa_{0})italic_λ ∈ ( 0 , italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). The first term dominates and we just plug in λ=κ0/2𝜆subscript𝜅02\lambda=\kappa_{0}/2italic_λ = italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / 2 and conclude that the error is no more than ε𝜀\varepsilonitalic_ε with probability at least

1C0H|𝒮||𝒜|exp(κ0χ(ε)σ𝒮(τ𝒮)H|𝒜||𝒮|5log(|𝒜|ς2ε1θ1)),1subscript𝐶0𝐻𝒮𝒜subscript𝜅0𝜒𝜀subscript𝜎𝒮subscript𝜏𝒮𝐻𝒜superscript𝒮5𝒜superscript𝜍2superscript𝜀1superscript𝜃1\displaystyle 1-C_{0}H|{\mathcal{S}}||\mathcal{A}|\exp\left(-\frac{\kappa_{0}% \chi(\varepsilon)\cdot\sigma_{\mathcal{S}}(\tau_{\mathcal{S}})}{H|\mathcal{A}|% |{\mathcal{S}}|^{5}\log(|\mathcal{A}|\varsigma^{-2}\varepsilon^{-1}\theta^{-1}% )}\right),1 - italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_H | caligraphic_S | | caligraphic_A | roman_exp ( - divide start_ARG italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_χ ( italic_ε ) ⋅ italic_σ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ) end_ARG start_ARG italic_H | caligraphic_A | | caligraphic_S | start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT roman_log ( | caligraphic_A | italic_ς start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_θ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) end_ARG ) ,

where C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a constant and τ𝒮=(ς2/6|𝒮|)2subscript𝜏𝒮superscriptsuperscript𝜍26𝒮2\tau_{\mathcal{S}}=(\varsigma^{2}/6|{\mathcal{S}}|)^{2}italic_τ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT = ( italic_ς start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 6 | caligraphic_S | ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Fix δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), and convert the error bound to 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm, this algorithm guarantees that, with probability at least 1δ1𝛿1-\delta1 - italic_δ,

sups,a,hμ^h(s,a,a)μh(s,a,a)2ε,subscriptsupremum𝑠𝑎subscriptdelimited-∥∥subscript^𝜇𝑠𝑎superscript𝑎subscript𝜇𝑠𝑎superscript𝑎2𝜀\sup_{s,a,h}\left\lVert\widehat{\mu}_{h}(s,a,a^{\prime})-{\mu}_{h}(s,a,a^{% \prime})\right\rVert_{2}\leq\varepsilon,roman_sup start_POSTSUBSCRIPT italic_s , italic_a , italic_h end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_ε ,

in O(HAS5log(ε1S1/2)κ0σ𝒮(τ𝒮))log(C0HSA/δ))=O~(HAS5κ01log(1/ε)log(1/δ)){O}(\frac{HAS^{5}\log(\varepsilon^{-1}S^{1/2})}{\kappa_{0}\sigma_{\mathcal{S}}% (\tau_{\mathcal{S}}))}\log(C_{0}HSA/\delta))=\widetilde{O}(HAS^{5}\kappa_{0}^{% -1}\log(1/\varepsilon)\log(1/\delta))italic_O ( divide start_ARG italic_H italic_A italic_S start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT roman_log ( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_S start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ) ) end_ARG roman_log ( italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_H italic_S italic_A / italic_δ ) ) = over~ start_ARG italic_O end_ARG ( italic_H italic_A italic_S start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_log ( 1 / italic_ε ) roman_log ( 1 / italic_δ ) ) rounds, where we ignore the constant σ𝒮(τ𝒮)subscript𝜎𝒮subscript𝜏𝒮\sigma_{\mathcal{S}}(\tau_{\mathcal{S}})italic_σ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ) from Assumption 3.

Lemma 11.

If modelsubscriptmodel\mathcal{E}_{\text{model}}caligraphic_E start_POSTSUBSCRIPT model end_POSTSUBSCRIPT is true, with μ^hμh1,εsubscriptdelimited-∥∥subscript^𝜇subscript𝜇1𝜀\left\lVert\widehat{\mu}_{h}-{\mu}_{h}\right\rVert_{1,\infty}\leq\varepsilon∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , ∞ end_POSTSUBSCRIPT ≤ italic_ε, at any episode t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ], we have

P^ht(s,a)Ph(s,a)12ln(SAHT/δ)Nht(s)+ε,subscriptdelimited-∥∥superscriptsubscript^𝑃𝑡𝑠𝑎subscript𝑃𝑠𝑎12𝑆𝐴𝐻𝑇𝛿superscriptsubscript𝑁𝑡𝑠𝜀\left\lVert\widehat{P}_{h}^{t}(s,a)-P_{h}(s,a)\right\rVert_{1}\leq 2\sqrt{% \frac{\ln(SAHT/\delta)}{N_{h}^{t}(s)}}+\varepsilon,∥ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ) - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ 2 square-root start_ARG divide start_ARG roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s ) end_ARG end_ARG + italic_ε , (E.1)
Proof.

To improve the estimate of P^^𝑃\widehat{P}over^ start_ARG italic_P end_ARG with μ^^𝜇\widehat{\mu}over^ start_ARG italic_μ end_ARG, we can construct the estimator of P^htsuperscriptsubscript^𝑃𝑡\widehat{P}_{h}^{t}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT as,

P^ht(s,a)subscriptsuperscript^𝑃𝑡𝑠𝑎\displaystyle\widehat{P}^{t}_{h}(s,a)over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) =a𝒜Nht(s,a)Nht(s)[P^ht(s,a)+μ^h(s,a,a)]absentsubscriptsuperscript𝑎𝒜subscriptsuperscript𝑁𝑡𝑠superscript𝑎superscriptsubscript𝑁𝑡𝑠delimited-[]subscriptsuperscript^𝑃𝑡𝑠superscript𝑎subscript^𝜇𝑠𝑎superscript𝑎\displaystyle=\sum_{a^{\prime}\in\mathcal{A}}\frac{N^{t}_{h}(s,a^{\prime})}{N_% {h}^{t}(s)}[\widehat{P}^{t}_{h}(s,a^{\prime})+\widehat{\mu}_{h}(s,a,a^{\prime})]= ∑ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A end_POSTSUBSCRIPT divide start_ARG italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s ) end_ARG [ over^ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ]
=a𝒜[Nht(s,a;)+μh(s,a,a)Nht(s,a)]Nht(s)+a𝒜Nht(s,a)Nht(s)[μ^h(s,a,a)μh(s,a,a)]absentsubscriptsuperscript𝑎𝒜delimited-[]subscriptsuperscript𝑁𝑡𝑠superscript𝑎subscript𝜇𝑠𝑎superscript𝑎subscriptsuperscript𝑁𝑡𝑠superscript𝑎superscriptsubscript𝑁𝑡𝑠subscriptsuperscript𝑎𝒜subscriptsuperscript𝑁𝑡𝑠superscript𝑎superscriptsubscript𝑁𝑡𝑠delimited-[]subscript^𝜇𝑠𝑎superscript𝑎subscript𝜇𝑠𝑎superscript𝑎\displaystyle=\frac{\sum_{a^{\prime}\in\mathcal{A}}[N^{t}_{h}(s,a^{\prime};% \cdot)+{\mu}_{h}(s,a,a^{\prime})N^{t}_{h}(s,a^{\prime})]}{N_{h}^{t}(s)}+\sum_{% a^{\prime}\in\mathcal{A}}\frac{N^{t}_{h}(s,a^{\prime})}{N_{h}^{t}(s)}[\widehat% {\mu}_{h}(s,a,a^{\prime})-{\mu}_{h}(s,a,a^{\prime})]= divide start_ARG ∑ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A end_POSTSUBSCRIPT [ italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; ⋅ ) + italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s ) end_ARG + ∑ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A end_POSTSUBSCRIPT divide start_ARG italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s ) end_ARG [ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ]

The first part is an unbiased estimator for Pht(s,a)subscriptsuperscript𝑃𝑡𝑠𝑎{P}^{t}_{h}(s,a)italic_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) and thus by Hoeffding’s inequality, with probability at least 1δ1𝛿1-\delta1 - italic_δ, for all s,a,h,t𝑠𝑎𝑡s,a,h,titalic_s , italic_a , italic_h , italic_t, with Nht(s)superscriptsubscript𝑁𝑡𝑠N_{h}^{t}(s)italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s ) samples,

a𝒜[Nht(s,a;)+μ^h(s,a,a)Nht(s,a)]Nht(s)Pht(s,a)12ln(SAHT/δ)Nht(s).subscriptnormsubscriptsuperscript𝑎𝒜delimited-[]subscriptsuperscript𝑁𝑡𝑠superscript𝑎subscript^𝜇𝑠𝑎superscript𝑎subscriptsuperscript𝑁𝑡𝑠superscript𝑎superscriptsubscript𝑁𝑡𝑠subscriptsuperscript𝑃𝑡𝑠𝑎12𝑆𝐴𝐻𝑇𝛿superscriptsubscript𝑁𝑡𝑠\bigg{|}\bigg{|}\frac{\sum_{a^{\prime}\in\mathcal{A}}[N^{t}_{h}(s,a^{\prime};% \cdot)+\widehat{\mu}_{h}(s,a,a^{\prime})N^{t}_{h}(s,a^{\prime})]}{N_{h}^{t}(s)% }-{P}^{t}_{h}(s,a)\bigg{|}\bigg{|}_{1}\leq 2\sqrt{\frac{\ln(SAHT/\delta)}{N_{h% }^{t}(s)}}.| | divide start_ARG ∑ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A end_POSTSUBSCRIPT [ italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; ⋅ ) + over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s ) end_ARG - italic_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ 2 square-root start_ARG divide start_ARG roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s ) end_ARG end_ARG .

Using the triangle inequality and the fact that Nht(s,a)Nht(s)1subscriptsuperscript𝑁𝑡𝑠superscript𝑎superscriptsubscript𝑁𝑡𝑠1\frac{N^{t}_{h}(s,a^{\prime})}{N_{h}^{t}(s)}\leq 1divide start_ARG italic_N start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s ) end_ARG ≤ 1, we get the inequality in Eq (E.1). ∎

E.3 Proofs for the Statistically Efficient Solver

Input: Parameters {P^h,r^h,bh,μ^h}h[H]subscriptsubscript^𝑃subscript^𝑟subscript𝑏subscript^𝜇delimited-[]𝐻\{\widehat{P}_{h},\widehat{r}_{h},b_{h},\widehat{\mu}_{h}\}_{h\in[H]}{ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT, robustness parameter ϵitalic-ϵ\epsilonitalic_ϵ
Output :  Contract policy 𝒙𝒙\bm{x}bold_italic_x, action policy 𝝅𝝅\bm{\pi}bold_italic_π
for 𝛑Π𝛑Π\bm{\pi}\in\Pibold_italic_π ∈ roman_Π do
      
U^𝝅,𝒙^𝝅=minarg𝒙U^𝒙,𝝅s.t.U^𝒙,𝝅U^𝒙,𝝅ϵ,𝝅𝝅,formulae-sequencesuperscript^𝑈𝝅superscript^𝒙𝝅subscript𝑚𝑖𝑛𝑎𝑟𝑔𝒙superscript^𝑈𝒙𝝅s.t.superscript^𝑈𝒙𝝅superscript^𝑈𝒙superscript𝝅italic-ϵfor-allsuperscript𝝅𝝅\widehat{U}^{\bm{\pi}},\widehat{\bm{x}}^{\bm{\pi}}=\mathop{minarg}_{\bm{x}}% \widehat{U}^{\bm{x},\bm{\pi}}\quad\text{s.t.}\quad\widehat{U}^{\bm{x},\bm{\pi}% }-\widehat{U}^{\bm{x},\bm{\pi}^{\prime}}\geq\epsilon,\forall\bm{\pi}^{\prime}% \neq\bm{\pi},over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT = start_BIGOP italic_m italic_i italic_n italic_a italic_r italic_g end_BIGOP start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT s.t. over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT - over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ≥ italic_ϵ , ∀ bold_italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ bold_italic_π , (E.2)
V^𝝅R^𝝅C𝝅U^𝝅superscript^𝑉𝝅superscript^𝑅𝝅superscript𝐶𝝅superscript^𝑈𝝅\widehat{V}^{\bm{\pi}}\leftarrow\widehat{R}^{\bm{\pi}}-{C}^{\bm{\pi}}-\widehat% {U}^{\bm{\pi}}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ← over^ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT - italic_C start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT - over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT
Set 𝝅^max𝝅ΠV^𝝅,𝒙^𝒙^𝝅^formulae-sequencesuperscript^𝝅subscript𝑚𝑎𝑥𝝅Πsuperscript^𝑉𝝅superscript^𝒙superscript^𝒙superscript^𝝅\widehat{\bm{\pi}}^{*}\leftarrow\mathop{max}_{\bm{\pi}\in\Pi}\widehat{V}^{\bm{% \pi}},\widehat{\bm{x}}^{*}\leftarrow\widehat{\bm{x}}^{\widehat{\bm{\pi}}^{*}}over^ start_ARG bold_italic_π end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← start_BIGOP italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT bold_italic_π ∈ roman_Π end_POSTSUBSCRIPT over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT over^ start_ARG bold_italic_π end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT
return 𝒙^,𝝅^superscript^𝒙superscript^𝝅\widehat{\bm{x}}^{*},\widehat{\bm{\pi}}^{*}over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_π end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT
Algorithm 6 Linear Programming under Uncertainty
Assumption 7 (Weak λ𝜆\lambdaitalic_λ-Inducibility).

For any stationary action policy 𝛑Π𝛑Π\bm{\pi}\in\Pibold_italic_π ∈ roman_Π, there exists an event eh{0,1}Ssubscript𝑒superscript01𝑆e_{h}\in\{0,1\}^{S}italic_e start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT for each hhitalic_h such that h=2H+1(ρh𝛑ρh𝛑)ehλw,𝛑𝛑.formulae-sequencesuperscriptsubscript2𝐻1superscriptsubscript𝜌𝛑superscriptsubscript𝜌superscript𝛑subscript𝑒subscript𝜆𝑤for-all𝛑superscript𝛑\sum_{h=2}^{H+1}(\rho_{h}^{\bm{\pi}}-\rho_{h}^{\bm{\pi}^{\prime}})\cdot e_{h}% \geq\lambda_{w},\forall\bm{\pi}\neq\bm{\pi}^{\prime}.∑ start_POSTSUBSCRIPT italic_h = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H + 1 end_POSTSUPERSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT - italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ⋅ italic_e start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ≥ italic_λ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , ∀ bold_italic_π ≠ bold_italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT .

This assumption ensures that for any stationary action policy 𝝅Π𝝅Π\bm{\pi}\in\Pibold_italic_π ∈ roman_Π, there exists a set of bounded reward functions {rh}h=1Hsuperscriptsubscriptsubscript𝑟1𝐻\{r_{h}\}_{h=1}^{H}{ italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT with maxs𝒮,a𝒜,h[H]rh(s,a)1subscript𝑚𝑎𝑥formulae-sequence𝑠𝒮formulae-sequence𝑎𝒜delimited-[]𝐻delimited-∥∥subscript𝑟𝑠𝑎1\mathop{max}_{s\in{\mathcal{S}},a\in\mathcal{A},h\in[H]}\left\lVert r_{h}(s,a)% \right\rVert\leq 1start_BIGOP italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT italic_s ∈ caligraphic_S , italic_a ∈ caligraphic_A , italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT ∥ italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ∥ ≤ 1 such that 𝝅𝝅\bm{\pi}bold_italic_π dominates any other stationary action policy 𝝅superscript𝝅\bm{\pi}^{\prime}bold_italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with additional expected utility at least λwsubscript𝜆𝑤\lambda_{w}italic_λ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT in the MDP environment (𝒮,𝒜,{Ph,rh}h=1H)𝒮𝒜superscriptsubscriptsubscript𝑃subscript𝑟1𝐻({\mathcal{S}},\mathcal{A},\{P_{h},r_{h}\}_{h=1}^{H})( caligraphic_S , caligraphic_A , { italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ). For example, one can set rh(s,a)=s𝒮eh+1(s)Ph(s|s,a)subscript𝑟𝑠𝑎subscript𝑠𝒮subscript𝑒1superscript𝑠subscript𝑃conditionalsuperscript𝑠𝑠𝑎r_{h}(s,a)=\sum_{s\in{\mathcal{S}}}e_{h+1}(s^{\prime})P_{h}(s^{\prime}|s,a)italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) = ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ). For similar reason, under this assumption, for any stationary action policy 𝝅Π𝝅Π\bm{\pi}\in\Pibold_italic_π ∈ roman_Π, there exists a contract policy 𝒙𝒙\bm{x}bold_italic_x with xh1subscriptdelimited-∥∥subscript𝑥1\left\lVert x_{h}\right\rVert_{\infty}\leq 1∥ italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ 1 under which U𝒙,𝝅U𝒙,𝝅λw,𝝅𝝅formulae-sequencesuperscript𝑈𝒙𝝅superscript𝑈𝒙superscript𝝅subscript𝜆𝑤for-all𝝅superscript𝝅U^{\bm{x},\bm{\pi}}-U^{\bm{x},\bm{\pi}^{\prime}}\geq\lambda_{w},\forall\bm{\pi% }\neq\bm{\pi}^{\prime}italic_U start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT - italic_U start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ≥ italic_λ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , ∀ bold_italic_π ≠ bold_italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, where πh(s,s)=eh+1(s)subscript𝜋𝑠superscript𝑠subscript𝑒1superscript𝑠\pi_{h}(s,s^{\prime})=e_{h+1}(s^{\prime})italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_e start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).

We describe the statistically efficient solver in Algorithm 6. Below we show that the contract policy it solves is robust and optimistic in a formal sense.

Lemma 12.

Under Assumption 7, with estimation of μ^,P^^𝜇^𝑃\widehat{\mu},\widehat{P}over^ start_ARG italic_μ end_ARG , over^ start_ARG italic_P end_ARG such that μ^hμh1,ϵSH2subscriptdelimited-∥∥subscript^𝜇subscript𝜇1italic-ϵ𝑆superscript𝐻2\left\lVert\widehat{\mu}_{h}-{\mu}_{h}\right\rVert_{1,\infty}\leq\frac{% \epsilon}{SH^{2}}∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , ∞ end_POSTSUBSCRIPT ≤ divide start_ARG italic_ϵ end_ARG start_ARG italic_S italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG and P^hPh1,ϵsubscriptdelimited-∥∥subscript^𝑃subscript𝑃1superscriptitalic-ϵ\left\lVert\widehat{P}_{h}-{P}_{h}\right\rVert_{1,\infty}\leq\epsilon^{\prime}∥ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , ∞ end_POSTSUBSCRIPT ≤ italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, the policy 𝛑,𝐱^𝛑𝛑superscript^𝐱𝛑\bm{\pi},\widehat{\bm{x}}^{\bm{\pi}}bold_italic_π , over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT solved from LP E.2 satisfies the following condition,

U𝒙^𝝅U𝝅|ζ^𝝅ζ𝝅|=O(λw1Hηϵ+H2ϵ).superscript𝑈superscript^𝒙𝝅superscript𝑈𝝅superscript^𝜁𝝅superscript𝜁𝝅𝑂superscriptsubscript𝜆𝑤1𝐻𝜂italic-ϵsuperscript𝐻2superscriptitalic-ϵU^{\widehat{\bm{x}}^{\bm{\pi}}}-U^{\bm{\pi}}\eqsim\left|\widehat{\zeta}^{\bm{% \pi}}-\zeta^{\bm{\pi}}\right|=O\left(\lambda_{w}^{-1}H\eta\epsilon+H^{2}% \epsilon^{\prime}\right).italic_U start_POSTSUPERSCRIPT over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_U start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ≂ | over^ start_ARG italic_ζ end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT - italic_ζ start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT | = italic_O ( italic_λ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_H italic_η italic_ϵ + italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .
Proof.

Pick any action policy 𝝅Π𝝅Π\bm{\pi}\in\Pibold_italic_π ∈ roman_Π, we first show that 𝝅𝝅\bm{\pi}bold_italic_π is the agent’s best response under 𝒙^𝝅superscript^𝒙𝝅\widehat{\bm{x}}^{\bm{\pi}}over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT solved from Equation E.2. That is, due to Lemma 13, the following inequality holds, 𝝅𝝅for-allsuperscript𝝅𝝅\forall\bm{\pi}^{\prime}\neq\bm{\pi}∀ bold_italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ bold_italic_π,

U𝒙^𝝅,𝝅U𝒙^𝝅,𝝅U^𝒙^𝝅,𝝅U^𝒙^𝝅,𝝅ϵ0.superscript𝑈superscript^𝒙𝝅𝝅superscript𝑈superscript^𝒙𝝅superscript𝝅superscript^𝑈superscript^𝒙𝝅𝝅superscript^𝑈superscript^𝒙𝝅superscript𝝅italic-ϵ0{U}^{\widehat{\bm{x}}^{\bm{\pi}},\bm{\pi}}-{U}^{\widehat{\bm{x}}^{\bm{\pi}},% \bm{\pi}^{\prime}}\geq\widehat{U}^{\widehat{\bm{x}}^{\bm{\pi}},\bm{\pi}}-% \widehat{U}^{\widehat{\bm{x}}^{\bm{\pi}},\bm{\pi}^{\prime}}-\epsilon\geq 0.italic_U start_POSTSUPERSCRIPT over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , bold_italic_π end_POSTSUPERSCRIPT - italic_U start_POSTSUPERSCRIPT over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , bold_italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ≥ over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , bold_italic_π end_POSTSUPERSCRIPT - over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , bold_italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_ϵ ≥ 0 .

Let μ^hμh1,ϵSH2=εsubscriptdelimited-∥∥subscript^𝜇subscript𝜇1italic-ϵ𝑆superscript𝐻2𝜀\left\lVert\widehat{\mu}_{h}-{\mu}_{h}\right\rVert_{1,\infty}\leq\frac{% \epsilon}{SH^{2}}=\varepsilon∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , ∞ end_POSTSUBSCRIPT ≤ divide start_ARG italic_ϵ end_ARG start_ARG italic_S italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = italic_ε. We show that the payment under 𝒙𝒙\bm{x}bold_italic_x has bounded suboptimality,

ζ^𝝅ζ𝝅=U^𝒙^𝝅,𝝅U𝝅U^𝒙𝝅,𝝅U𝒙𝝅,𝝅+2SH3ηελw2SH3ηελw+H2ϵ=Hηϵλw+H2ϵ,superscript^𝜁𝝅superscript𝜁𝝅superscript^𝑈superscript^𝒙𝝅𝝅superscript𝑈𝝅superscript^𝑈superscript𝒙𝝅𝝅superscript𝑈superscript𝒙𝝅𝝅2𝑆superscript𝐻3𝜂𝜀subscript𝜆𝑤2𝑆superscript𝐻3𝜂𝜀subscript𝜆𝑤superscript𝐻2superscriptitalic-ϵ𝐻𝜂italic-ϵsubscript𝜆𝑤superscript𝐻2superscriptitalic-ϵ\widehat{\zeta}^{\bm{\pi}}-{\zeta}^{\bm{\pi}}=\widehat{U}^{\widehat{\bm{x}}^{% \bm{\pi}},\bm{\pi}}-{U}^{\bm{\pi}}\leq\widehat{U}^{{\bm{x}}^{\bm{\pi}},\bm{\pi% }}-{U}^{{\bm{x}}^{\bm{\pi}},\bm{\pi}}+\frac{2SH^{3}\eta\varepsilon}{\lambda_{w% }}\leq\frac{2SH^{3}\eta\varepsilon}{\lambda_{w}}+H^{2}\epsilon^{\prime}=\frac{% H\eta\epsilon}{\lambda_{w}}+H^{2}\epsilon^{\prime},over^ start_ARG italic_ζ end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT - italic_ζ start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT = over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , bold_italic_π end_POSTSUPERSCRIPT - italic_U start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ≤ over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , bold_italic_π end_POSTSUPERSCRIPT - italic_U start_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , bold_italic_π end_POSTSUPERSCRIPT + divide start_ARG 2 italic_S italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η italic_ε end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG ≤ divide start_ARG 2 italic_S italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η italic_ε end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG + italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG italic_H italic_η italic_ϵ end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG + italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , (E.3)

where 𝒙𝝅=argmin𝒙U𝒙,𝝅s.t.U𝒙,𝝅U𝒙,𝝅0,𝝅𝝅formulae-sequencesuperscript𝒙𝝅subscript𝑎𝑟𝑔𝑚𝑖𝑛𝒙superscript𝑈𝒙𝝅s.t.formulae-sequencesuperscript𝑈𝒙𝝅superscript𝑈𝒙superscript𝝅0for-allsuperscript𝝅𝝅\bm{x}^{\bm{\pi}}=\mathop{argmin}_{\bm{x}}{U}^{\bm{x},\bm{\pi}}\quad\text{s.t.% }\quad{U}^{\bm{x},\bm{\pi}}-{U}^{\bm{x},\bm{\pi}^{\prime}}\geq 0,\forall\bm{% \pi}^{\prime}\neq\bm{\pi}bold_italic_x start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT = start_BIGOP italic_a italic_r italic_g italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT s.t. italic_U start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT - italic_U start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ≥ 0 , ∀ bold_italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ bold_italic_π and thus U𝝅=U𝒙𝝅,𝝅superscript𝑈𝝅superscript𝑈superscript𝒙𝝅𝝅{U}^{\bm{\pi}}={U}^{\bm{x}^{\bm{\pi}},\bm{\pi}}italic_U start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT = italic_U start_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , bold_italic_π end_POSTSUPERSCRIPT by definition.

We first prove the first inequality in Equation (E.3). Let 𝒙0superscript𝒙0\bm{x}^{0}bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT be the contract policy such that U𝒙0,𝝅U𝒙0,𝝅λw,𝝅𝝅formulae-sequencesuperscript𝑈superscript𝒙0𝝅superscript𝑈superscript𝒙0superscript𝝅subscript𝜆𝑤for-allsuperscript𝝅𝝅{U}^{\bm{x}^{0},\bm{\pi}}-{U}^{\bm{x}^{0},\bm{\pi}^{\prime}}\geq\lambda_{w},% \forall\bm{\pi}^{\prime}\neq\bm{\pi}italic_U start_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , bold_italic_π end_POSTSUPERSCRIPT - italic_U start_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , bold_italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ≥ italic_λ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , ∀ bold_italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ bold_italic_π with U𝒙0,𝝅Hsuperscript𝑈superscript𝒙0𝝅𝐻{U}^{{\bm{x}}^{0},\bm{\pi}}\leq Hitalic_U start_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , bold_italic_π end_POSTSUPERSCRIPT ≤ italic_H and U^𝒙0,𝝅Hηsuperscript^𝑈superscript𝒙0𝝅𝐻𝜂\widehat{U}^{{\bm{x}}^{0},\bm{\pi}}\leq H\etaover^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , bold_italic_π end_POSTSUPERSCRIPT ≤ italic_H italic_η. Such 𝒙0superscript𝒙0\bm{x}^{0}bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT exists by Assumption 1. It is easy to verify the linearity here that for any 𝒙,𝒙′′𝝅,cformulae-sequencesuperscript𝒙superscript𝒙′′𝝅𝑐\bm{x}^{\prime},\bm{x}^{\prime\prime}\in\bm{\pi},c\in\mathbb{R}bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ∈ bold_italic_π , italic_c ∈ blackboard_R, let 𝒙=𝒙+c𝒙′′𝒙superscript𝒙𝑐superscript𝒙′′\bm{x}=\bm{x}^{\prime}+c\bm{x}^{\prime\prime}bold_italic_x = bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_c bold_italic_x start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT such that 𝒙h(s,s)=𝒙h(s,s)+c𝒙h′′(s,s),h,s,ssubscript𝒙𝑠superscript𝑠subscriptsuperscript𝒙𝑠superscript𝑠𝑐subscriptsuperscript𝒙′′𝑠superscript𝑠for-all𝑠superscript𝑠\bm{x}_{h}(s,s^{\prime})=\bm{x}^{\prime}_{h}(s,s^{\prime})+c\bm{x}^{\prime% \prime}_{h}(s,s^{\prime}),\forall h,s,s^{\prime}bold_italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_c bold_italic_x start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , ∀ italic_h , italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, then U𝒙,𝝅=U𝒙,𝝅+cU𝒙′′,𝝅superscript𝑈𝒙𝝅superscript𝑈superscript𝒙𝝅𝑐superscript𝑈superscript𝒙′′𝝅U^{\bm{x},\bm{\pi}}=U^{\bm{x}^{\prime},\bm{\pi}}+cU^{\bm{x}^{\prime\prime},\bm% {\pi}}italic_U start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT = italic_U start_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_π end_POSTSUPERSCRIPT + italic_c italic_U start_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT , bold_italic_π end_POSTSUPERSCRIPT. We claim that 𝒙~=𝒙𝝅+2SH2ελw𝒙0~𝒙superscript𝒙𝝅2𝑆superscript𝐻2𝜀subscript𝜆𝑤superscript𝒙0\widetilde{\bm{x}}=\bm{x}^{\bm{\pi}}+\frac{2SH^{2}\varepsilon}{\lambda_{w}}\bm% {x}^{0}over~ start_ARG bold_italic_x end_ARG = bold_italic_x start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT + divide start_ARG 2 italic_S italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ε end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT satisfies the following inequalities:

U^𝒙~,𝝅U^𝒙~,𝝅U𝒙~,𝝅U𝒙~,𝝅SH2εϵ,superscript^𝑈~𝒙𝝅superscript^𝑈~𝒙superscript𝝅superscript𝑈~𝒙𝝅superscript𝑈~𝒙superscript𝝅𝑆superscript𝐻2𝜀italic-ϵ\widehat{U}^{\widetilde{\bm{x}},\bm{\pi}}-\widehat{U}^{\widetilde{\bm{x}},\bm{% \pi}^{\prime}}\geq{U}^{\widetilde{\bm{x}},\bm{\pi}}-{U}^{\widetilde{\bm{x}},% \bm{\pi}^{\prime}}-SH^{2}\varepsilon\geq\epsilon,over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT over~ start_ARG bold_italic_x end_ARG , bold_italic_π end_POSTSUPERSCRIPT - over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT over~ start_ARG bold_italic_x end_ARG , bold_italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ≥ italic_U start_POSTSUPERSCRIPT over~ start_ARG bold_italic_x end_ARG , bold_italic_π end_POSTSUPERSCRIPT - italic_U start_POSTSUPERSCRIPT over~ start_ARG bold_italic_x end_ARG , bold_italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_S italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ε ≥ italic_ϵ , (E.4)

where the first inequality is due to Lemma 13 and the second inequality is by the construction of 𝒙~~𝒙\widetilde{\bm{x}}over~ start_ARG bold_italic_x end_ARG.

U^𝒙^𝝅,𝝅U^𝒙~,𝝅=U^𝒙𝝅,𝝅+2SH2ελwU^𝒙0,𝝅U^𝒙𝝅,𝝅+2SH3ηελw.superscript^𝑈superscript^𝒙𝝅𝝅superscript^𝑈~𝒙𝝅superscript^𝑈superscript𝒙𝝅𝝅2𝑆superscript𝐻2𝜀subscript𝜆𝑤superscript^𝑈superscript𝒙0𝝅superscript^𝑈superscript𝒙𝝅𝝅2𝑆superscript𝐻3𝜂𝜀subscript𝜆𝑤\widehat{U}^{\widehat{\bm{x}}^{\bm{\pi}},\bm{\pi}}\leq\widehat{U}^{\widetilde{% \bm{x}},\bm{\pi}}=\widehat{U}^{{\bm{x}}^{\bm{\pi}},\bm{\pi}}+\frac{2SH^{2}% \varepsilon}{\lambda_{w}}\widehat{U}^{{\bm{x}}^{0},\bm{\pi}}\leq\widehat{U}^{{% \bm{x}}^{\bm{\pi}},\bm{\pi}}+\frac{2SH^{3}\eta\varepsilon}{\lambda_{w}}.over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , bold_italic_π end_POSTSUPERSCRIPT ≤ over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT over~ start_ARG bold_italic_x end_ARG , bold_italic_π end_POSTSUPERSCRIPT = over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , bold_italic_π end_POSTSUPERSCRIPT + divide start_ARG 2 italic_S italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ε end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , bold_italic_π end_POSTSUPERSCRIPT ≤ over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , bold_italic_π end_POSTSUPERSCRIPT + divide start_ARG 2 italic_S italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η italic_ε end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG .

For the first inequality, notice that 𝒙~~𝒙\widetilde{\bm{x}}over~ start_ARG bold_italic_x end_ARG is feasible in the LP (E.2) according to Equation (E.4) and its value should be no larger than the optimal value of LP (E.2). The equation is due to the linearity of U𝒙,𝝅superscript𝑈𝒙𝝅U^{\bm{x},\bm{\pi}}italic_U start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT w.r.t. 𝒙𝒙\bm{x}bold_italic_x. The last inequality uses the fact that U^𝒙0,𝝅Hsuperscript^𝑈superscript𝒙0𝝅𝐻\widehat{U}^{{\bm{x}}^{0},\bm{\pi}}\leq Hover^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , bold_italic_π end_POSTSUPERSCRIPT ≤ italic_H by construction.

For the last inequality in Equation (E.3), we apply the simulation lemma [5],

U^𝒙𝝅,𝝅U𝒙𝝅,𝝅superscript^𝑈superscript𝒙𝝅𝝅superscript𝑈superscript𝒙𝝅𝝅\displaystyle\widehat{U}^{{\bm{x}}^{\bm{\pi}},\bm{\pi}}-{U}^{{\bm{x}}^{\bm{\pi% }},\bm{\pi}}over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , bold_italic_π end_POSTSUPERSCRIPT - italic_U start_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , bold_italic_π end_POSTSUPERSCRIPT =h=1Hρh𝝅πh,(P^hPh)U^h𝒙𝝅,𝝅𝒮×𝒜absentsuperscriptsubscript1𝐻subscripttensor-productsuperscriptsubscript𝜌𝝅subscript𝜋subscript^𝑃subscript𝑃superscriptsubscript^𝑈superscript𝒙𝝅𝝅𝒮𝒜\displaystyle=\sum_{h=1}^{H}\langle\rho_{h}^{{\bm{\pi}}}\otimes\pi_{h},(% \widehat{P}_{h}-{P}_{h})\cdot\widehat{U}_{h}^{{\bm{x}}^{\bm{\pi}},\bm{\pi}}% \rangle_{{\mathcal{S}}\times\mathcal{A}}= ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ⟨ italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ⋅ over^ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , bold_italic_π end_POSTSUPERSCRIPT ⟩ start_POSTSUBSCRIPT caligraphic_S × caligraphic_A end_POSTSUBSCRIPT
Hh=1HP^h(s,a)Ph(s,a)1absent𝐻superscriptsubscript1𝐻subscriptdelimited-∥∥subscript^𝑃𝑠𝑎subscript𝑃𝑠𝑎1\displaystyle\leq H\sum_{h=1}^{H}\left\lVert\widehat{P}_{h}(s,a)-{P}_{h}(s,a)% \right\rVert_{1}≤ italic_H ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
H2ϵabsentsuperscript𝐻2superscriptitalic-ϵ\displaystyle\leq H^{2}\epsilon^{\prime}≤ italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT

Lemma 13.

Given that μ^hμh1,εsubscriptdelimited-∥∥subscript^𝜇subscript𝜇1𝜀\left\lVert\widehat{\mu}_{h}-{\mu}_{h}\right\rVert_{1,\infty}\leq\varepsilon∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , ∞ end_POSTSUBSCRIPT ≤ italic_ε, we have

|U𝒙^𝝅,𝝅U𝒙^𝝅,𝝅U^𝒙^𝝅,𝝅+U^𝒙^𝝅,𝝅|SH2εsuperscript𝑈superscript^𝒙𝝅𝝅superscript𝑈superscript^𝒙𝝅superscript𝝅superscript^𝑈superscript^𝒙𝝅𝝅superscript^𝑈superscript^𝒙𝝅superscript𝝅𝑆superscript𝐻2𝜀\left|U^{\widehat{\bm{x}}^{\bm{\pi}},\bm{\pi}}-U^{\widehat{\bm{x}}^{\bm{\pi}},% \bm{\pi}^{\prime}}-\widehat{U}^{\widehat{\bm{x}}^{\bm{\pi}},\bm{\pi}}+\widehat% {U}^{\widehat{\bm{x}}^{\bm{\pi}},\bm{\pi}^{\prime}}\right|\leq SH^{2}\varepsilon| italic_U start_POSTSUPERSCRIPT over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , bold_italic_π end_POSTSUPERSCRIPT - italic_U start_POSTSUPERSCRIPT over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , bold_italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , bold_italic_π end_POSTSUPERSCRIPT + over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , bold_italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | ≤ italic_S italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ε
Proof.

Notice that we can apply the performance difference lemma [5] on any U𝒙,𝝅,U𝒙,𝝅superscript𝑈𝒙𝝅superscript𝑈𝒙superscript𝝅{U}^{\bm{x},\bm{\pi}},{U}^{\bm{x},\bm{\pi}^{\prime}}italic_U start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT , italic_U start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT as they share the same 𝒙𝒙\bm{x}bold_italic_x and thus the same environment except for different policy,

U𝒙,𝝅U𝒙,𝝅=h=1Hρh𝝅(πhπh),PhUh𝒮×𝒜=h=1Hs𝒮ρh𝝅(s)μh(s,πh(s),πh(s))Uhsuperscript𝑈𝒙𝝅superscript𝑈𝒙superscript𝝅superscriptsubscript1𝐻subscripttensor-productsuperscriptsubscript𝜌𝝅subscript𝜋subscriptsuperscript𝜋subscript𝑃subscript𝑈𝒮𝒜superscriptsubscript1𝐻subscript𝑠𝒮superscriptsubscript𝜌𝝅𝑠subscript𝜇𝑠subscript𝜋𝑠superscriptsubscript𝜋𝑠subscript𝑈\displaystyle{U}^{\bm{x},\bm{\pi}}-{U}^{\bm{x},\bm{\pi}^{\prime}}=\sum_{h=1}^{% H}\langle\rho_{h}^{{\bm{\pi}}}\otimes(\pi_{h}-\pi^{\prime}_{h}),{P}_{h}\cdot U% _{h}\rangle_{{\mathcal{S}}\times\mathcal{A}}=\sum_{h=1}^{H}\sum_{s\in{\mathcal% {S}}}\rho_{h}^{{\bm{\pi}}}(s)\mu_{h}(s,\pi_{h}(s),\pi_{h}^{\prime}(s))\cdot U_% {h}italic_U start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π end_POSTSUPERSCRIPT - italic_U start_POSTSUPERSCRIPT bold_italic_x , bold_italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ⟨ italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ⊗ ( italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) , italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ⋅ italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT caligraphic_S × caligraphic_A end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) ) ⋅ italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT

Hence, we have

|U𝒙^𝝅,𝝅U𝒙^𝝅,𝝅U^𝒙^𝝅,𝝅+U^𝒙^𝝅,𝝅|superscript𝑈superscript^𝒙𝝅𝝅superscript𝑈superscript^𝒙𝝅superscript𝝅superscript^𝑈superscript^𝒙𝝅𝝅superscript^𝑈superscript^𝒙𝝅superscript𝝅\displaystyle\quad\left|U^{\widehat{\bm{x}}^{\bm{\pi}},\bm{\pi}}-U^{\widehat{% \bm{x}}^{\bm{\pi}},\bm{\pi}^{\prime}}-\widehat{U}^{\widehat{\bm{x}}^{\bm{\pi}}% ,\bm{\pi}}+\widehat{U}^{\widehat{\bm{x}}^{\bm{\pi}},\bm{\pi}^{\prime}}\right|| italic_U start_POSTSUPERSCRIPT over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , bold_italic_π end_POSTSUPERSCRIPT - italic_U start_POSTSUPERSCRIPT over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , bold_italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , bold_italic_π end_POSTSUPERSCRIPT + over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT , bold_italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT |
=|h=1Hs𝒮[ρh𝝅(s)μh(s,πh(s),πh(s))Uhρ^h𝝅(s)μ^h(s,πh(s),πh(s))U^h]|absentsuperscriptsubscript1𝐻subscript𝑠𝒮delimited-[]superscriptsubscript𝜌𝝅𝑠subscript𝜇𝑠subscript𝜋𝑠superscriptsubscript𝜋𝑠subscript𝑈superscriptsubscript^𝜌𝝅𝑠subscript^𝜇𝑠subscript𝜋𝑠superscriptsubscript𝜋𝑠subscript^𝑈\displaystyle=\left|\sum_{h=1}^{H}\sum_{s\in{\mathcal{S}}}\left[\rho_{h}^{{\bm% {\pi}}}(s)\mu_{h}(s,\pi_{h}(s),\pi_{h}^{\prime}(s))\cdot U_{h}-\widehat{\rho}_% {h}^{{\bm{\pi}}}(s)\widehat{\mu}_{h}(s,\pi_{h}(s),\pi_{h}^{\prime}(s))\cdot% \widehat{U}_{h}\right]\right|= | ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT [ italic_ρ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) ) ⋅ italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) ) ⋅ over^ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ] |
Hh=1Hs𝒮μh(s,πh(s),πh(s))μ^h(s,πh(s),πh(s))1absent𝐻superscriptsubscript1𝐻subscript𝑠𝒮subscriptdelimited-∥∥subscript𝜇𝑠subscript𝜋𝑠superscriptsubscript𝜋𝑠subscript^𝜇𝑠subscript𝜋𝑠superscriptsubscript𝜋𝑠1\displaystyle\leq H\sum_{h=1}^{H}\sum_{s\in{\mathcal{S}}}\left\lVert\mu_{h}(s,% \pi_{h}(s),\pi_{h}^{\prime}(s))-\widehat{\mu}_{h}(s,\pi_{h}(s),\pi_{h}^{\prime% }(s))\right\rVert_{1}≤ italic_H ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) ) - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
SH2maxs,a,hμ^h(s,a,a)μh(s,a,a)1absent𝑆superscript𝐻2subscript𝑚𝑎𝑥𝑠𝑎subscriptdelimited-∥∥subscript^𝜇𝑠𝑎superscript𝑎subscript𝜇𝑠𝑎superscript𝑎1\displaystyle\leq SH^{2}\mathop{max}_{s,a,h}\left\lVert\widehat{\mu}_{h}(s,a,a% ^{\prime})-{\mu}_{h}(s,a,a^{\prime})\right\rVert_{1}≤ italic_S italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_BIGOP italic_m italic_a italic_x end_BIGOP start_POSTSUBSCRIPT italic_s , italic_a , italic_h end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
SH2εabsent𝑆superscript𝐻2𝜀\displaystyle\leq SH^{2}\varepsilon≤ italic_S italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ε

Lemma 14.

If modelsubscriptmodel\mathcal{E}_{\text{model}}caligraphic_E start_POSTSUBSCRIPT model end_POSTSUBSCRIPT is true, with T1=O(HAS5κ01log(ηTλw1))subscript𝑇1𝑂𝐻𝐴superscript𝑆5superscriptsubscript𝜅01𝜂𝑇superscriptsubscript𝜆𝑤1T_{1}=O(HAS^{5}\kappa_{0}^{-1}\log(\eta T\lambda_{w}^{-1}))italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_O ( italic_H italic_A italic_S start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_log ( italic_η italic_T italic_λ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ), under Assumption 5 and 7, at any episode t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ], we have

|ζ^t,𝝅ζ𝝅|=O(H2κ1/2ln(SAHT/δ)/t}).\left|\widehat{\zeta}^{t,\bm{\pi}}-\zeta^{\bm{\pi}}\right|=O\left(H^{2}\kappa^% {-1/2}\sqrt{\ln(SAHT/\delta)/t}\big{\}}\right).| over^ start_ARG italic_ζ end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π end_POSTSUPERSCRIPT - italic_ζ start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT | = italic_O ( italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_κ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT square-root start_ARG roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) / italic_t end_ARG } ) .
Proof.

This proof is to find a good trade-off between the error bound of P𝑃Pitalic_P and μ𝜇\muitalic_μ according to the ratio given in Lemma 12. For this analysis, let sups,a,a,hμ^ht(s,a,a)μh(s,a,a)1εsubscriptsupremum𝑠𝑎superscript𝑎subscriptdelimited-∥∥subscriptsuperscript^𝜇𝑡𝑠𝑎superscript𝑎subscript𝜇𝑠𝑎superscript𝑎1𝜀\sup_{s,a,a^{\prime},h}\left\lVert\widehat{\mu}^{t}_{h}(s,a,a^{\prime})-{\mu}_% {h}(s,a,a^{\prime})\right\rVert_{1}\leq\varepsilonroman_sup start_POSTSUBSCRIPT italic_s , italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_h end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_ε such that we can apply Lemma 11 and get

P^ht(s,a)Ph(s,a)12ln(SAHT/δ)2κt+ε,subscriptdelimited-∥∥superscriptsubscript^𝑃𝑡𝑠𝑎subscript𝑃𝑠𝑎12𝑆𝐴𝐻𝑇𝛿2𝜅𝑡𝜀\left\lVert\widehat{P}_{h}^{t}(s,a)-P_{h}(s,a)\right\rVert_{1}\leq 2\sqrt{% \frac{\ln(SAHT/\delta)}{2\kappa t}}+\varepsilon,∥ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ) - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ 2 square-root start_ARG divide start_ARG roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) end_ARG start_ARG 2 italic_κ italic_t end_ARG end_ARG + italic_ε ,

where we use the fact Nht(s)2κtsuperscriptsubscript𝑁𝑡𝑠2𝜅𝑡N_{h}^{t}(s)\geq 2\kappa titalic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s ) ≥ 2 italic_κ italic_t under Assumption 5. By Lemma 12, we can construct contracts to induce any action policy 𝝅𝝅\bm{\pi}bold_italic_π such that

|ζ^t,𝝅ζ𝝅|=O(λw1SH3ηε+H2(2ln(SAHT/δ)2κt+ε))=O(λw1SH3ηε+H2ln(SAHT/δ)2κt).superscript^𝜁𝑡𝝅superscript𝜁𝝅𝑂superscriptsubscript𝜆𝑤1𝑆superscript𝐻3𝜂𝜀superscript𝐻22𝑆𝐴𝐻𝑇𝛿2𝜅𝑡𝜀𝑂superscriptsubscript𝜆𝑤1𝑆superscript𝐻3𝜂𝜀superscript𝐻2𝑆𝐴𝐻𝑇𝛿2𝜅𝑡\left|\widehat{\zeta}^{t,\bm{\pi}}-\zeta^{\bm{\pi}}\right|=O\left(\lambda_{w}^% {-1}SH^{3}\eta\varepsilon+H^{2}(2\sqrt{\frac{\ln(SAHT/\delta)}{2\kappa t}}+% \varepsilon)\right)=O\left(\lambda_{w}^{-1}SH^{3}\eta\varepsilon+H^{2}\sqrt{% \frac{\ln(SAHT/\delta)}{2\kappa t}}\right).| over^ start_ARG italic_ζ end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π end_POSTSUPERSCRIPT - italic_ζ start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT | = italic_O ( italic_λ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_S italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η italic_ε + italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 2 square-root start_ARG divide start_ARG roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) end_ARG start_ARG 2 italic_κ italic_t end_ARG end_ARG + italic_ε ) ) = italic_O ( italic_λ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_S italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η italic_ε + italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG divide start_ARG roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) end_ARG start_ARG 2 italic_κ italic_t end_ARG end_ARG ) .

Then, we can set ε=λwSHηt𝜀subscript𝜆𝑤𝑆𝐻𝜂𝑡\varepsilon=\frac{\lambda_{w}}{SH\eta\sqrt{t}}italic_ε = divide start_ARG italic_λ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG start_ARG italic_S italic_H italic_η square-root start_ARG italic_t end_ARG end_ARG such that the second term dominates the regret bound, and we have

|ζ^t,𝝅ζ𝝅|superscript^𝜁𝑡𝝅superscript𝜁𝝅\displaystyle\left|\widehat{\zeta}^{t,\bm{\pi}}-\zeta^{\bm{\pi}}\right|| over^ start_ARG italic_ζ end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π end_POSTSUPERSCRIPT - italic_ζ start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT | =O(H2κ1/2ln(SAHT/δ)/t})\displaystyle=O\left(H^{2}\kappa^{-1/2}\sqrt{\ln(SAHT/\delta)/t}\big{\}}\right)= italic_O ( italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_κ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT square-root start_ARG roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) / italic_t end_ARG } )

It remains to ensure that in T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT rounds, we can ensure ε=λwSHηTT1𝜀subscript𝜆𝑤𝑆𝐻𝜂𝑇subscript𝑇1\varepsilon=\frac{\lambda_{w}}{SH\eta\sqrt{T-T_{1}}}italic_ε = divide start_ARG italic_λ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG start_ARG italic_S italic_H italic_η square-root start_ARG italic_T - italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_ARG. By Lemma 10, we know the sample complexity of ε𝜀\varepsilonitalic_ε is on the order of O~(HAS5κ01log(1/ε))~𝑂𝐻𝐴superscript𝑆5superscriptsubscript𝜅011𝜀\widetilde{O}(HAS^{5}\kappa_{0}^{-1}\log(1/\varepsilon))over~ start_ARG italic_O end_ARG ( italic_H italic_A italic_S start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_log ( 1 / italic_ε ) ). Hence, T1=O(HAS5κ01log(SHηTλw1))=O(HAS5κ01log(ηTλw1))subscript𝑇1𝑂𝐻𝐴superscript𝑆5superscriptsubscript𝜅01𝑆𝐻𝜂𝑇superscriptsubscript𝜆𝑤1𝑂𝐻𝐴superscript𝑆5superscriptsubscript𝜅01𝜂𝑇superscriptsubscript𝜆𝑤1T_{1}=O(HAS^{5}\kappa_{0}^{-1}\log(SH\eta T\lambda_{w}^{-1}))=O(HAS^{5}\kappa_% {0}^{-1}\log(\eta T\lambda_{w}^{-1}))italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_O ( italic_H italic_A italic_S start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_log ( italic_S italic_H italic_η italic_T italic_λ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ) = italic_O ( italic_H italic_A italic_S start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_log ( italic_η italic_T italic_λ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) )

E.4 Proofs for the Computationally Efficient Solver

Input: Parameters {P^h,r^h,bh,μ^h,ϵh}h[H]subscriptsubscript^𝑃subscript^𝑟subscript𝑏subscript^𝜇subscriptitalic-ϵdelimited-[]𝐻\{\widehat{P}_{h},\widehat{r}_{h},b_{h},\widehat{\mu}_{h},\epsilon_{h}\}_{h\in% [H]}{ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT, robustness parameter {ϵh}h[H]subscriptsubscriptitalic-ϵdelimited-[]𝐻\{\epsilon_{h}\}_{h\in[H]}{ italic_ϵ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT
Output :  Contract policy 𝒙𝒙\bm{x}bold_italic_x, action policy 𝝅𝝅\bm{\pi}bold_italic_π
Set V^H+1(s),U^H+1(s)0,a𝒜,s𝒮formulae-sequencesubscript^𝑉𝐻1𝑠subscript^𝑈𝐻1𝑠0formulae-sequencefor-all𝑎𝒜𝑠𝒮\widehat{V}_{H+1}(s),\widehat{U}_{H+1}(s)\leftarrow 0,\forall a\in\mathcal{A},% s\in{\mathcal{S}}over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_H + 1 end_POSTSUBSCRIPT ( italic_s ) , over^ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_H + 1 end_POSTSUBSCRIPT ( italic_s ) ← 0 , ∀ italic_a ∈ caligraphic_A , italic_s ∈ caligraphic_S.
for h=H1𝐻1h=H\dots 1italic_h = italic_H … 1 do
      
x^h(s;a)subscript^𝑥𝑠𝑎\displaystyle\widehat{x}_{h}(s;a)over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ; italic_a ) =argminx:𝒮+{P^h(s,a)x|μ^h(s,a,a)[x+U^h+1]ch(s,a)ch(s,a)+ϵh,aa},absentsubscript𝑎𝑟𝑔𝑚𝑖𝑛:𝑥𝒮subscriptconditional-setsubscript^𝑃𝑠𝑎𝑥formulae-sequencesubscript^𝜇𝑠𝑎superscript𝑎delimited-[]𝑥subscript^𝑈1subscript𝑐𝑠𝑎subscript𝑐𝑠superscript𝑎subscriptitalic-ϵfor-allsuperscript𝑎𝑎\displaystyle=\mathop{argmin}_{x:{\mathcal{S}}\to\mathbb{R}_{+}}\{\widehat{P}_% {h}(s,a)\cdot x\ |\ \widehat{\mu}_{h}(s,a,a^{\prime})\cdot[x+\widehat{U}_{h+1}% ]\geq{c}_{h}(s,a)-{c}_{h}(s,a^{\prime})+\epsilon_{h},\forall a^{\prime}\neq a\},= start_BIGOP italic_a italic_r italic_g italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT italic_x : caligraphic_S → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT { over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ⋅ italic_x | over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⋅ [ italic_x + over^ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ] ≥ italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , ∀ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_a } , (E.5)
Q^h(s,a)subscript^𝑄𝑠𝑎\displaystyle\widehat{Q}_{h}(s,a)over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) =min{H,r^h(s,a)+bh(s,a)+P^h(s,a)V^h+1}P^h(s,a)x^h(s;a),absent𝑚𝑖𝑛𝐻subscript^𝑟𝑠𝑎subscript𝑏𝑠𝑎subscript^𝑃𝑠𝑎subscript^𝑉1subscript^𝑃𝑠𝑎subscript^𝑥𝑠𝑎\displaystyle=\mathop{min}\big{\{}H,\ \widehat{r}_{h}(s,a)+b_{h}(s,a)+\widehat% {P}_{h}(s,a)\cdot\widehat{V}_{h+1}\big{\}}-\widehat{P}_{h}(s,a)\cdot\widehat{x% }_{h}(s;a),= start_BIGOP italic_m italic_i italic_n end_BIGOP { italic_H , over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) + italic_b start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) + over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ⋅ over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT } - over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ⋅ over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ; italic_a ) ,
V^h(s),πh(s)subscript^𝑉𝑠subscript𝜋𝑠\displaystyle\widehat{V}_{h}(s),\pi_{h}(s)over^ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) , italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) =maxarga𝒜Q^h(s,a),x^h(s)=x^h(s;πh(s))formulae-sequenceabsentsubscript𝑚𝑎𝑥𝑎𝑟𝑔𝑎𝒜subscript^𝑄𝑠𝑎subscript^𝑥𝑠subscript^𝑥𝑠subscript𝜋𝑠\displaystyle=\mathop{maxarg}_{a\in\mathcal{A}}\widehat{Q}_{h}(s,a),\quad% \widehat{x}_{h}(s)=\widehat{x}_{h}(s;\pi_{h}(s))= start_BIGOP italic_m italic_a italic_x italic_a italic_r italic_g end_BIGOP start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) = over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ; italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) )
U^h(s)subscript^𝑈𝑠\displaystyle\widehat{U}_{h}(s)over^ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) =min{H,P^h(s,a)[x^h(s)+U^h+1]ch(s,a))}.\displaystyle=\mathop{min}\big{\{}H,\ \widehat{P}_{h}(s,a)\cdot[\widehat{x}_{h% }(s)+\widehat{U}_{h+1}]-c_{h}(s,a))\big{\}}.= start_BIGOP italic_m italic_i italic_n end_BIGOP { italic_H , over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ⋅ [ over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) + over^ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ] - italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ) } .
return 𝒙^={x^h}h[H],𝝅={πh}h[H]formulae-sequence^𝒙subscriptsubscript^𝑥delimited-[]𝐻𝝅subscriptsubscript𝜋delimited-[]𝐻\widehat{\bm{x}}=\{\widehat{x}_{h}\}_{h\in[H]},\bm{\pi}=\{\pi_{h}\}_{h\in[H]}over^ start_ARG bold_italic_x end_ARG = { over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT , bold_italic_π = { italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT
Algorithm 7 Value-Iteration under Uncertainty
Assumption 8 (Strong λ𝜆\lambdaitalic_λ-Inducibility).

For any a𝒜,s𝒮,h[H]formulae-sequence𝑎𝒜formulae-sequence𝑠𝒮delimited-[]𝐻a\in\mathcal{A},s\in{\mathcal{S}},h\in[H]italic_a ∈ caligraphic_A , italic_s ∈ caligraphic_S , italic_h ∈ [ italic_H ], there exists an event e{0,1}S𝑒superscript01𝑆e\in\{0,1\}^{S}italic_e ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT such that [Ph(s,a)Ph(s,a)]eλs,aa.formulae-sequencedelimited-[]subscript𝑃𝑠𝑎subscript𝑃𝑠superscript𝑎𝑒subscript𝜆𝑠for-all𝑎superscript𝑎[P_{h}(s,a)-P_{h}(s,a^{\prime})]\cdot e\geq\lambda_{s},\forall a\neq a^{\prime}.[ italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ⋅ italic_e ≥ italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , ∀ italic_a ≠ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT .

This assumption ensures that at any state s𝑠sitalic_s of any step hhitalic_h, for any cost function chsubscript𝑐c_{h}italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and any action a𝑎aitalic_a, there exists a contract xhsubscript𝑥x_{h}italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT to induce a𝑎aitalic_a. Intuitively, this assumption asks that each action is dominantly capable of inducing a set of outcomes over others. It is different from assuming that the outcome distribution of each action is distinguishable from other. In particular, notice that even if we have minaaPh(s,a)Ph(s,a)λssubscript𝑚𝑖𝑛superscript𝑎𝑎delimited-∥∥subscript𝑃𝑠𝑎subscript𝑃𝑠superscript𝑎subscript𝜆𝑠\mathop{min}_{a^{\prime}\neq a}\left\lVert P_{h}(s,a)-P_{h}(s,a^{\prime})% \right\rVert\geq\lambda_{s}start_BIGOP italic_m italic_i italic_n end_BIGOP start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_a end_POSTSUBSCRIPT ∥ italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ ≥ italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, there is no guarantee that x s.t. x[Ph(s,a)Ph(s,a)]0,aaformulae-sequence𝑥 s.t. 𝑥delimited-[]subscript𝑃𝑠𝑎subscript𝑃𝑠superscript𝑎0for-all𝑎superscript𝑎\exists x\text{ s.t. }x\cdot[P_{h}(s,a)-P_{h}(s,a^{\prime})]\geq 0,\forall a% \neq a^{\prime}∃ italic_x s.t. italic_x ⋅ [ italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ≥ 0 , ∀ italic_a ≠ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, i.e., there exists contract 𝒙𝒙\bm{x}bold_italic_x under which a𝑎aitalic_a is the optimal action for the agent at this step regardless of the cost. Notice that when H=1𝐻1H=1italic_H = 1, Assumption 7 and 8 are equivalent. In general, Assumption 8 is stronger than Assumption 7.

We describe the computationally efficient solver in Algorithm 7. Below we show that the contract policy it solves is robust and optimistic in a formal sense.

Lemma 15.

Under Assumption 8, with estimation of μ^,P^^𝜇^𝑃\widehat{\mu},\widehat{P}over^ start_ARG italic_μ end_ARG , over^ start_ARG italic_P end_ARG such that μ^hμh1,ϵHh+ηsubscriptdelimited-∥∥subscript^𝜇subscript𝜇1italic-ϵ𝐻𝜂\left\lVert\widehat{\mu}_{h}-{\mu}_{h}\right\rVert_{1,\infty}\leq\frac{% \epsilon}{H-h+\eta}∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , ∞ end_POSTSUBSCRIPT ≤ divide start_ARG italic_ϵ end_ARG start_ARG italic_H - italic_h + italic_η end_ARG, P^hPh1,ϵ/ηsubscriptdelimited-∥∥subscript^𝑃subscript𝑃1superscriptitalic-ϵ𝜂\left\lVert\widehat{P}_{h}-{P}_{h}\right\rVert_{1,\infty}\leq\epsilon^{\prime}/\eta∥ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , ∞ end_POSTSUBSCRIPT ≤ italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_η and choice of ϵH=ϵ,ϵh=O(min{H,ϵλshH+ϵλsh+1H}),h<Hformulae-sequencesubscriptitalic-ϵ𝐻italic-ϵformulae-sequencesubscriptitalic-ϵ𝑂𝑚𝑖𝑛𝐻italic-ϵsuperscriptsubscript𝜆𝑠𝐻superscriptitalic-ϵsuperscriptsubscript𝜆𝑠1𝐻for-all𝐻\epsilon_{H}=\epsilon,\epsilon_{h}=O(\mathop{min}\{H,\epsilon\lambda_{s}^{h-H}% +\epsilon^{\prime}\lambda_{s}^{h+1-H}\}),\forall h<Hitalic_ϵ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT = italic_ϵ , italic_ϵ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_O ( start_BIGOP italic_m italic_i italic_n end_BIGOP { italic_H , italic_ϵ italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h - italic_H end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h + 1 - italic_H end_POSTSUPERSCRIPT } ) , ∀ italic_h < italic_H, the policy 𝛑,𝐱^𝛑𝛑superscript^𝐱𝛑\bm{\pi},\widehat{\bm{x}}^{\bm{\pi}}bold_italic_π , over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT output from Algorithm 7 satisfies the following condition,

U𝒙^𝝅U𝝅|ζ^𝝅ζ𝝅|=O(min{H,ϵλsH1+ϵλsH}).superscript𝑈superscript^𝒙𝝅superscript𝑈𝝅superscript^𝜁𝝅superscript𝜁𝝅𝑂𝑚𝑖𝑛𝐻italic-ϵsuperscriptsubscript𝜆𝑠𝐻1superscriptitalic-ϵsuperscriptsubscript𝜆𝑠𝐻U^{\widehat{\bm{x}}^{\bm{\pi}}}-U^{\bm{\pi}}\eqsim\left|\widehat{\zeta}^{\bm{% \pi}}-\zeta^{\bm{\pi}}\right|=O\left(\mathop{min}\{H,\epsilon\lambda_{s}^{-H-1% }+\epsilon^{\prime}\lambda_{s}^{-H}\}\right).italic_U start_POSTSUPERSCRIPT over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_U start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ≂ | over^ start_ARG italic_ζ end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT - italic_ζ start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT | = italic_O ( start_BIGOP italic_m italic_i italic_n end_BIGOP { italic_H , italic_ϵ italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_H - 1 end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_H end_POSTSUPERSCRIPT } ) .
Proof.

Fix any action policy 𝝅𝝅\bm{\pi}bold_italic_π, we compare the contract policy 𝒙𝝅superscript𝒙𝝅{\bm{x}}^{\bm{\pi}}bold_italic_x start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT and 𝒙^𝝅superscript^𝒙𝝅\widehat{\bm{x}}^{\bm{\pi}}over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT solved respectively from Equation (B.1) and (E.5), i.e., using the ground-truth and estimated parameters. In the remainder of the proof, we will save the superscript 𝝅𝝅\bm{\pi}bold_italic_π in both 𝒙^𝝅superscript^𝒙𝝅\widehat{\bm{x}}^{\bm{\pi}}over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT and 𝒙𝝅superscript𝒙𝝅{\bm{x}}^{\bm{\pi}}bold_italic_x start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT for simplicity.

For the simplicity of notations, we prove the following claim. For any h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ], suppose we have U^h+1𝝅Uh+1αsubscriptdelimited-∥∥superscriptsubscript^𝑈1𝝅subscript𝑈1𝛼\left\lVert\widehat{U}_{h+1}^{\bm{\pi}}-U_{h+1}\right\rVert_{\infty}\leq\alpha∥ over^ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT - italic_U start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_α, P^hPh1,ϵ/ηsubscriptdelimited-∥∥subscript^𝑃subscript𝑃1superscriptitalic-ϵ𝜂\left\lVert\widehat{P}_{h}-{P}_{h}\right\rVert_{1,\infty}\leq\epsilon^{\prime}/\eta∥ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , ∞ end_POSTSUBSCRIPT ≤ italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_η, μ^hμh1,βsubscriptdelimited-∥∥subscript^𝜇subscript𝜇1𝛽\left\lVert\widehat{\mu}_{h}-{\mu}_{h}\right\rVert_{1,\infty}\leq\beta∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 , ∞ end_POSTSUBSCRIPT ≤ italic_β. Then, with the choice of ϵh=(Hh+η)β+2αsubscriptitalic-ϵ𝐻𝜂𝛽2𝛼\epsilon_{h}=(H-h+\eta)\beta+2\alphaitalic_ϵ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = ( italic_H - italic_h + italic_η ) italic_β + 2 italic_α, the solution of Equation (E.5) in the hhitalic_h-th step satisfies the following conditions: for every s𝒮𝑠𝒮s\in{\mathcal{S}}italic_s ∈ caligraphic_S,

  1. 1.

    x^h(s)subscript^𝑥𝑠\widehat{x}_{h}(s)over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) induces agent’s action πh(s)subscript𝜋𝑠\pi_{h}(s)italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ).

  2. 2.

    The expected payment of x^h(s)subscript^𝑥𝑠\widehat{x}_{h}(s)over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) is bounded as Uhx^h(s)(s)Uh𝝅(s)min{H,ϵhλs+ϵ}.subscriptsuperscript𝑈subscript^𝑥𝑠𝑠superscriptsubscript𝑈𝝅𝑠𝑚𝑖𝑛𝐻subscriptitalic-ϵsubscript𝜆𝑠superscriptitalic-ϵ{U}^{\widehat{x}_{h}(s)}_{h}(s)-U_{h}^{\bm{\pi}}(s)\leq\mathop{min}\{H,\frac{% \epsilon_{h}}{\lambda_{s}}+\epsilon^{\prime}\}.italic_U start_POSTSUPERSCRIPT over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) - italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) ≤ start_BIGOP italic_m italic_i italic_n end_BIGOP { italic_H , divide start_ARG italic_ϵ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG + italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } .

  3. 3.

    The estimated payment of x^h(s)subscript^𝑥𝑠\widehat{x}_{h}(s)over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) (as well as the estimated agent value) is bounded as |ζ^h𝝅(s)ζh𝝅(s)|=|U^h𝝅(s)Uh𝝅(s)|min{H,ϵhλs+2ϵ}.subscriptsuperscript^𝜁𝝅𝑠subscriptsuperscript𝜁𝝅𝑠subscriptsuperscript^𝑈𝝅𝑠subscriptsuperscript𝑈𝝅𝑠𝑚𝑖𝑛𝐻subscriptitalic-ϵsubscript𝜆𝑠2superscriptitalic-ϵ\left|\widehat{\zeta}^{\bm{\pi}}_{h}(s)-{\zeta}^{\bm{\pi}}_{h}(s)\right|=\left% |\widehat{U}^{\bm{\pi}}_{h}(s)-U^{\bm{\pi}}_{h}(s)\right|\leq\mathop{min}\{H,% \frac{\epsilon_{h}}{\lambda_{s}}+2\epsilon^{\prime}\}.| over^ start_ARG italic_ζ end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) - italic_ζ start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) | = | over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) - italic_U start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) | ≤ start_BIGOP italic_m italic_i italic_n end_BIGOP { italic_H , divide start_ARG italic_ϵ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG + 2 italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } .

We pick any state s𝑠sitalic_s and let a=πh(s)𝑎subscript𝜋𝑠a=\pi_{h}(s)italic_a = italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ). Equation (E.5) solves for x^h(s)subscript^𝑥𝑠\widehat{x}_{h}(s)over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) as the solution to the following optimization program,

minimizeP^h(s,a)xsubject toμ^h(s,a,a)[x+U^h+1𝝅]ch(s,a)ch(s,a)+ϵh,for aa.x(s,s)0,for s𝒮.minimizesubscript^𝑃𝑠𝑎𝑥missing-subexpressionsubject tosubscript^𝜇𝑠𝑎superscript𝑎delimited-[]𝑥superscriptsubscript^𝑈1𝝅subscript𝑐𝑠𝑎subscript𝑐𝑠superscript𝑎subscriptitalic-ϵfor superscript𝑎𝑎missing-subexpression𝑥𝑠superscript𝑠0for superscript𝑠𝒮\begin{array}[]{lll}\mbox{minimize}&{\widehat{P}_{h}(s,a)\cdot x}&\\ \mbox{subject to}&\widehat{\mu}_{h}(s,a,a^{\prime})\cdot[x+\widehat{U}_{h+1}^{% \bm{\pi}}]\geq{c}_{h}(s,a)-{c}_{h}(s,a^{\prime})+\epsilon_{h},&\mbox{for }a^{% \prime}\neq a.\\ &x(s,s^{\prime})\geq 0,&\mbox{for }s^{\prime}\in{\mathcal{S}}.\\ \end{array}start_ARRAY start_ROW start_CELL minimize end_CELL start_CELL over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ⋅ italic_x end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL subject to end_CELL start_CELL over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⋅ [ italic_x + over^ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ] ≥ italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , end_CELL start_CELL for italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_a . end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_x ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≥ 0 , end_CELL start_CELL for italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S . end_CELL end_ROW end_ARRAY (E.6)

We first argue that with ϵh=(Hh+η)β+αsubscriptitalic-ϵ𝐻𝜂𝛽𝛼\epsilon_{h}=(H-h+\eta)\beta+\alphaitalic_ϵ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = ( italic_H - italic_h + italic_η ) italic_β + italic_α, x^h(s)subscript^𝑥𝑠\widehat{x}_{h}(s)over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) induces the agent to play action a𝑎aitalic_a at state s𝑠sitalic_s. That is because the following inequality must hold, aafor-allsuperscript𝑎𝑎\forall a^{\prime}\neq a∀ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_a,

μh(s,a,a)[x^h(s)+Uh+1]subscript𝜇𝑠𝑎superscript𝑎delimited-[]subscript^𝑥𝑠subscript𝑈1\displaystyle\quad{\mu}_{h}(s,a,a^{\prime})\cdot[\widehat{x}_{h}(s)+U_{h+1}]italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⋅ [ over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) + italic_U start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ]
μ^h(s,a,a)[x^h(s)+U^h+1]βx^h(s)βUhαμh(s,a,a)1+βαabsentsubscript^𝜇𝑠𝑎superscript𝑎delimited-[]subscript^𝑥𝑠subscript^𝑈1𝛽subscriptdelimited-∥∥subscript^𝑥𝑠𝛽subscriptdelimited-∥∥subscript𝑈𝛼subscriptdelimited-∥∥subscript𝜇𝑠𝑎superscript𝑎1𝛽𝛼\displaystyle\geq\widehat{\mu}_{h}(s,a,a^{\prime})\cdot[\widehat{x}_{h}(s)+% \widehat{U}_{h+1}]-\beta\left\lVert\widehat{x}_{h}(s)\right\rVert_{\infty}-% \beta\left\lVert U_{h}\right\rVert_{\infty}-\alpha\left\lVert{\mu}_{h}(s,a,a^{% \prime})\right\rVert_{1}+\beta\alpha≥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⋅ [ over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) + over^ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ] - italic_β ∥ over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT - italic_β ∥ italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT - italic_α ∥ italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_β italic_α
c(s,a)c(s,a)+ϵhβηβ(Hh)2αabsent𝑐𝑠𝑎𝑐𝑠superscript𝑎subscriptitalic-ϵ𝛽𝜂𝛽𝐻2𝛼\displaystyle\geq c(s,a)-c(s,a^{\prime})+\epsilon_{h}-\beta\eta-\beta(H-h)-2\alpha≥ italic_c ( italic_s , italic_a ) - italic_c ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_β italic_η - italic_β ( italic_H - italic_h ) - 2 italic_α
c(s,a)c(s,a).absent𝑐𝑠𝑎𝑐𝑠superscript𝑎\displaystyle\geq c(s,a)-c(s,a^{\prime}).≥ italic_c ( italic_s , italic_a ) - italic_c ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .

To see that LP (E.6) must have a feasible solution, let us consider the contract xh(s)=xh(s)+ϵhλsxasubscriptsuperscript𝑥𝑠subscript𝑥𝑠subscriptitalic-ϵsubscript𝜆𝑠superscript𝑥𝑎{x}^{\prime}_{h}(s)={x}_{h}(s)+\frac{\epsilon_{h}}{\lambda_{s}}x^{a}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) = italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) + divide start_ARG italic_ϵ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT. We choose xasuperscript𝑥𝑎x^{a}italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT such that [Ph(s,a)Ph(s,a)]xaλs,aaformulae-sequencedelimited-[]subscript𝑃𝑠𝑎subscript𝑃𝑠superscript𝑎superscript𝑥𝑎subscript𝜆𝑠for-allsuperscript𝑎𝑎[P_{h}(s,a)-P_{h}(s,a^{\prime})]\cdot x^{a}\geq\lambda_{s},\forall a^{\prime}\neq a[ italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ⋅ italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ≥ italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , ∀ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_a. The existence of xasuperscript𝑥𝑎x^{a}italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT is implied by Assumption 8. xh(s)subscriptsuperscript𝑥𝑠{x}^{\prime}_{h}(s)italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) satisfies the constraints of LP (E.6), i.e., aa,for-allsuperscript𝑎𝑎\forall a^{\prime}\neq a,∀ italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_a ,

μ^h(s,a,a)[xh(s)+U^h+1𝝅]subscript^𝜇𝑠𝑎superscript𝑎delimited-[]subscriptsuperscript𝑥𝑠superscriptsubscript^𝑈1𝝅\displaystyle\quad\widehat{\mu}_{h}(s,a,a^{\prime})\cdot[{x}^{\prime}_{h}(s)+% \widehat{U}_{h+1}^{\bm{\pi}}]over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⋅ [ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) + over^ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ]
μh(s,a,a)xh(s)βxh(s)βUhαμ^h(s,a,a)1+βαabsentsubscript𝜇𝑠𝑎superscript𝑎subscriptsuperscript𝑥𝑠𝛽subscriptdelimited-∥∥subscriptsuperscript𝑥𝑠𝛽subscriptdelimited-∥∥subscript𝑈𝛼subscriptdelimited-∥∥subscript^𝜇𝑠𝑎superscript𝑎1𝛽𝛼\displaystyle\geq{\mu}_{h}(s,a,a^{\prime})\cdot{x}^{\prime}_{h}(s)-\beta\left% \lVert x^{\prime}_{h}(s)\right\rVert_{\infty}-\beta\left\lVert{U}_{h}\right% \rVert_{\infty}-\alpha\left\lVert\widehat{\mu}_{h}(s,a,a^{\prime})\right\rVert% _{1}+\beta\alpha≥ italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⋅ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) - italic_β ∥ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT - italic_β ∥ italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT - italic_α ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_β italic_α
μh(s,a,a)xh(s)+μh(s,a,a)βλsxaβηβ(Hh)2αabsentsubscript𝜇𝑠𝑎superscript𝑎subscript𝑥𝑠subscript𝜇𝑠𝑎superscript𝑎𝛽subscript𝜆𝑠superscript𝑥𝑎𝛽𝜂𝛽𝐻2𝛼\displaystyle\geq{\mu}_{h}(s,a,a^{\prime})\cdot{x}_{h}(s)+{\mu}_{h}(s,a,a^{% \prime})\cdot\frac{\beta}{\lambda_{s}}x^{a}-\beta\eta-\beta(H-h)-2\alpha≥ italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⋅ italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) + italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⋅ divide start_ARG italic_β end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT - italic_β italic_η - italic_β ( italic_H - italic_h ) - 2 italic_α
c(s,a)c(s,a)+ϵhβηβ(Hh)absent𝑐𝑠𝑎𝑐𝑠superscript𝑎subscriptitalic-ϵ𝛽𝜂𝛽𝐻\displaystyle\geq c(s,a)-c(s,a^{\prime})+\epsilon_{h}-\beta\eta-\beta(H-h)≥ italic_c ( italic_s , italic_a ) - italic_c ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - italic_β italic_η - italic_β ( italic_H - italic_h )
c(s,a)c(s,a).absent𝑐𝑠𝑎𝑐𝑠superscript𝑎\displaystyle\geq c(s,a)-c(s,a^{\prime}).≥ italic_c ( italic_s , italic_a ) - italic_c ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .

Moreover, since xh(s)subscript𝑥𝑠{x}_{h}(s)italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) minimizes LP (E.6), we have P^(a)x^h(s)P^(a)xh(s)=P^(a)[xh(s)+ϵhλsxa].^𝑃𝑎subscript^𝑥𝑠^𝑃𝑎subscriptsuperscript𝑥𝑠^𝑃𝑎delimited-[]subscript𝑥𝑠subscriptitalic-ϵsubscript𝜆𝑠superscript𝑥𝑎\widehat{P}(a)\cdot\widehat{x}_{h}(s)\leq\widehat{P}(a)\cdot{x}^{\prime}_{h}(s% )=\widehat{P}(a)\cdot[{x}_{h}(s)+\frac{\epsilon_{h}}{\lambda_{s}}x^{a}].over^ start_ARG italic_P end_ARG ( italic_a ) ⋅ over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ≤ over^ start_ARG italic_P end_ARG ( italic_a ) ⋅ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) = over^ start_ARG italic_P end_ARG ( italic_a ) ⋅ [ italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) + divide start_ARG italic_ϵ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ] . Given that xa=1subscriptdelimited-∥∥superscript𝑥𝑎1\left\lVert x^{a}\right\rVert_{\infty}=1∥ italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = 1 and P(a)1=1subscriptdelimited-∥∥𝑃𝑎11\left\lVert P(a)\right\rVert_{1}=1∥ italic_P ( italic_a ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1, we have P^h(s,a)xa1subscript^𝑃𝑠𝑎superscript𝑥𝑎1\widehat{P}_{h}(s,a)x^{a}\leq 1over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ≤ 1. Using x^h(s)subscript^𝑥𝑠\widehat{x}_{h}(s)over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ), the principal’s expected payment is bounded from the least payment as

Uhx^h(s)(s)Uh𝝅(s)subscriptsuperscript𝑈subscript^𝑥𝑠𝑠superscriptsubscript𝑈𝝅𝑠\displaystyle{U}^{\widehat{x}_{h}(s)}_{h}(s)-U_{h}^{\bm{\pi}}(s)italic_U start_POSTSUPERSCRIPT over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) - italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) =Ph(s,a)[x^h(s)xh(s)]absentsubscript𝑃𝑠𝑎delimited-[]subscript^𝑥𝑠subscript𝑥𝑠\displaystyle=P_{h}(s,a)\cdot[\widehat{x}_{h}(s)-{x}_{h}(s)]= italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ⋅ [ over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) - italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ]
=P^h(s,a)[x^h(s)xh(s)]+[Ph(s,a)P^h(s,a)][x^h(s)xh(s)]absentsubscript^𝑃𝑠𝑎delimited-[]subscript^𝑥𝑠subscript𝑥𝑠delimited-[]subscript𝑃𝑠𝑎subscript^𝑃𝑠𝑎delimited-[]subscript^𝑥𝑠subscript𝑥𝑠\displaystyle=\widehat{P}_{h}(s,a)\cdot[\widehat{x}_{h}(s)-{x}_{h}(s)]+[P_{h}(% s,a)-\widehat{P}_{h}(s,a)]\cdot[\widehat{x}_{h}(s)-{x}_{h}(s)]= over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ⋅ [ over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) - italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ] + [ italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ] ⋅ [ over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) - italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ]
P^h(s,a)ϵhλsxa+Ph(s,a)P^h(s,a)1x^h(s)xh(s)absentsubscript^𝑃𝑠𝑎subscriptitalic-ϵsubscript𝜆𝑠superscript𝑥𝑎subscriptdelimited-∥∥subscript𝑃𝑠𝑎subscript^𝑃𝑠𝑎1subscriptdelimited-∥∥subscript^𝑥𝑠subscript𝑥𝑠\displaystyle\leq\widehat{P}_{h}(s,a)\frac{\epsilon_{h}}{\lambda_{s}}x^{a}+% \left\lVert P_{h}(s,a)-\widehat{P}_{h}(s,a)\right\rVert_{1}\left\lVert\widehat% {x}_{h}(s)-{x}_{h}(s)\right\rVert_{\infty}≤ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) divide start_ARG italic_ϵ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT + ∥ italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) - italic_x start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
ϵhλs+ϵ.absentsubscriptitalic-ϵsubscript𝜆𝑠superscriptitalic-ϵ\displaystyle\leq\frac{\epsilon_{h}}{\lambda_{s}}+\epsilon^{\prime}.≤ divide start_ARG italic_ϵ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG + italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT .

In addition, Uh𝝅(s)H,Uhx^h(s)(s)0formulae-sequencesuperscriptsubscript𝑈𝝅𝑠𝐻subscriptsuperscript𝑈subscript^𝑥𝑠𝑠0U_{h}^{\bm{\pi}}(s)\leq H,{U}^{\widehat{x}_{h}(s)}_{h}(s)\geq 0italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) ≤ italic_H , italic_U start_POSTSUPERSCRIPT over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) ≥ 0, this gives the bound Uhx^h(s)(s)Uh𝝅(s)min{H,ϵhλs+ϵ}.subscriptsuperscript𝑈subscript^𝑥𝑠𝑠superscriptsubscript𝑈𝝅𝑠𝑚𝑖𝑛𝐻subscriptitalic-ϵsubscript𝜆𝑠superscriptitalic-ϵ{U}^{\widehat{x}_{h}(s)}_{h}(s)-U_{h}^{\bm{\pi}}(s)\leq\mathop{min}\{H,\frac{% \epsilon_{h}}{\lambda_{s}}+\epsilon^{\prime}\}.italic_U start_POSTSUPERSCRIPT over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) - italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) ≤ start_BIGOP italic_m italic_i italic_n end_BIGOP { italic_H , divide start_ARG italic_ϵ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG + italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } .

Since |U^h𝝅(s)Uhx^h(s)(s)|=|[P^h(s,a)Ph(s,a)]x^h(s)|ϵsubscriptsuperscript^𝑈𝝅𝑠superscriptsubscript𝑈subscript^𝑥𝑠𝑠delimited-[]subscript^𝑃𝑠𝑎subscript𝑃𝑠𝑎subscript^𝑥𝑠superscriptitalic-ϵ\left|\widehat{U}^{\bm{\pi}}_{h}(s)-U_{h}^{\widehat{x}_{h}(s)}(s)\right|=\left% |[\widehat{P}_{h}(s,a)-P_{h}(s,a)]\cdot\widehat{x}_{h}(s)\right|\leq\epsilon^{\prime}| over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) - italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ( italic_s ) | = | [ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ] ⋅ over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) | ≤ italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we have

|ζ^h𝝅(s)ζh𝝅(s)|=|U^h𝝅(s)Uh𝝅(s)||U^h𝝅(s)Uhx^h(s)(s)|+|Uhx^h(s)(s)Uh𝝅(s)|min{H,ϵhλs+2ϵ}.subscriptsuperscript^𝜁𝝅𝑠subscriptsuperscript𝜁𝝅𝑠subscriptsuperscript^𝑈𝝅𝑠subscriptsuperscript𝑈𝝅𝑠subscriptsuperscript^𝑈𝝅𝑠superscriptsubscript𝑈subscript^𝑥𝑠𝑠subscriptsuperscript𝑈subscript^𝑥𝑠𝑠superscriptsubscript𝑈𝝅𝑠𝑚𝑖𝑛𝐻subscriptitalic-ϵsubscript𝜆𝑠2superscriptitalic-ϵ\left|\widehat{\zeta}^{\bm{\pi}}_{h}(s)-{\zeta}^{\bm{\pi}}_{h}(s)\right|=\left% |\widehat{U}^{\bm{\pi}}_{h}(s)-U^{\bm{\pi}}_{h}(s)\right|\leq\left|\widehat{U}% ^{\bm{\pi}}_{h}(s)-U_{h}^{\widehat{x}_{h}(s)}(s)\right|+\left|{U}^{\widehat{x}% _{h}(s)}_{h}(s)-U_{h}^{\bm{\pi}}(s)\right|\leq\mathop{min}\{H,\frac{\epsilon_{% h}}{\lambda_{s}}+2\epsilon^{\prime}\}.| over^ start_ARG italic_ζ end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) - italic_ζ start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) | = | over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) - italic_U start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) | ≤ | over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) - italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ( italic_s ) | + | italic_U start_POSTSUPERSCRIPT over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) - italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) | ≤ start_BIGOP italic_m italic_i italic_n end_BIGOP { italic_H , divide start_ARG italic_ϵ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG + 2 italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } .

We now plug in the value of α,β𝛼𝛽\alpha,\betaitalic_α , italic_β. For the base case in the H𝐻Hitalic_H-th step, we have α=0,β=ϵH+2ηformulae-sequence𝛼0𝛽italic-ϵ𝐻2𝜂\alpha=0,\beta=\frac{\epsilon}{H+2\eta}italic_α = 0 , italic_β = divide start_ARG italic_ϵ end_ARG start_ARG italic_H + 2 italic_η end_ARG. Then, ϵH=ϵ,|ζ^H𝝅(s)ζH𝝅(s)|=|U^H𝝅(s)UH𝝅(s)|=O(min{H,ϵλs+ϵ})formulae-sequencesubscriptitalic-ϵ𝐻italic-ϵsubscriptsuperscript^𝜁𝝅𝐻𝑠subscriptsuperscript𝜁𝝅𝐻𝑠subscriptsuperscript^𝑈𝝅𝐻𝑠subscriptsuperscript𝑈𝝅𝐻𝑠𝑂𝑚𝑖𝑛𝐻italic-ϵsubscript𝜆𝑠superscriptitalic-ϵ\epsilon_{H}=\epsilon,\left|\widehat{\zeta}^{\bm{\pi}}_{H}(s)-\zeta^{\bm{\pi}}% _{H}(s)\right|=\left|\widehat{U}^{\bm{\pi}}_{H}(s)-U^{\bm{\pi}}_{H}(s)\right|=% O\left(\mathop{min}\{H,\frac{\epsilon}{\lambda_{s}}+\epsilon^{\prime}\}\right)italic_ϵ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT = italic_ϵ , | over^ start_ARG italic_ζ end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_s ) - italic_ζ start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_s ) | = | over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_s ) - italic_U start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_s ) | = italic_O ( start_BIGOP italic_m italic_i italic_n end_BIGOP { italic_H , divide start_ARG italic_ϵ end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG + italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } )

For the inductive case in the hhitalic_h-th step, given that U^h+1𝝅Uh+1𝝅=O(min{H,ϵλshH+ϵλsh+1H})subscriptdelimited-∥∥subscriptsuperscript^𝑈𝝅1subscriptsuperscript𝑈𝝅1𝑂𝑚𝑖𝑛𝐻italic-ϵsuperscriptsubscript𝜆𝑠𝐻superscriptitalic-ϵsuperscriptsubscript𝜆𝑠1𝐻\left\lVert\widehat{U}^{\bm{\pi}}_{h+1}-U^{\bm{\pi}}_{h+1}\right\rVert_{\infty% }=O(\mathop{min}\{H,\epsilon\lambda_{s}^{h-H}+\epsilon^{\prime}\lambda_{s}^{h+% 1-H}\})∥ over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - italic_U start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = italic_O ( start_BIGOP italic_m italic_i italic_n end_BIGOP { italic_H , italic_ϵ italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h - italic_H end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h + 1 - italic_H end_POSTSUPERSCRIPT } ), we have α=O(min{H,ϵλshH+ϵλsh+1H}),β=ϵHh+2ηformulae-sequence𝛼𝑂𝑚𝑖𝑛𝐻italic-ϵsuperscriptsubscript𝜆𝑠𝐻superscriptitalic-ϵsuperscriptsubscript𝜆𝑠1𝐻𝛽italic-ϵ𝐻2𝜂\alpha=O(\mathop{min}\{H,\epsilon\lambda_{s}^{h-H}+\epsilon^{\prime}\lambda_{s% }^{h+1-H}\}),\beta=\frac{\epsilon}{H-h+2\eta}italic_α = italic_O ( start_BIGOP italic_m italic_i italic_n end_BIGOP { italic_H , italic_ϵ italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h - italic_H end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h + 1 - italic_H end_POSTSUPERSCRIPT } ) , italic_β = divide start_ARG italic_ϵ end_ARG start_ARG italic_H - italic_h + 2 italic_η end_ARG and ϵh=U^h+1𝝅Uh+1𝝅+ϵ=O(min{H,ϵλshH+ϵλsh+1H})subscriptitalic-ϵsubscriptdelimited-∥∥subscriptsuperscript^𝑈𝝅1subscriptsuperscript𝑈𝝅1italic-ϵ𝑂𝑚𝑖𝑛𝐻italic-ϵsuperscriptsubscript𝜆𝑠𝐻superscriptitalic-ϵsuperscriptsubscript𝜆𝑠1𝐻\epsilon_{h}=\left\lVert\widehat{U}^{\bm{\pi}}_{h+1}-U^{\bm{\pi}}_{h+1}\right% \rVert_{\infty}+\epsilon=O(\mathop{min}\{H,\epsilon\lambda_{s}^{h-H}+\epsilon^% {\prime}\lambda_{s}^{h+1-H}\})italic_ϵ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = ∥ over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT - italic_U start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + italic_ϵ = italic_O ( start_BIGOP italic_m italic_i italic_n end_BIGOP { italic_H , italic_ϵ italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h - italic_H end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h + 1 - italic_H end_POSTSUPERSCRIPT } ). Then, we can plug in ϵhsubscriptitalic-ϵ\epsilon_{h}italic_ϵ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and bound the estimation error as, Uhx^h(s)(s)Uh𝝅(s)|ζ^h𝝅(s)ζh𝝅(s)|=|U^h𝝅(s)Uh𝝅(s)|O(min{H,ϵλsh1H+ϵλshH}).subscriptsuperscript𝑈subscript^𝑥𝑠𝑠superscriptsubscript𝑈𝝅𝑠subscriptsuperscript^𝜁𝝅𝑠subscriptsuperscript𝜁𝝅𝑠subscriptsuperscript^𝑈𝝅𝑠subscriptsuperscript𝑈𝝅𝑠𝑂𝑚𝑖𝑛𝐻italic-ϵsuperscriptsubscript𝜆𝑠1𝐻superscriptitalic-ϵsuperscriptsubscript𝜆𝑠𝐻{U}^{\widehat{x}_{h}(s)}_{h}(s)-U_{h}^{\bm{\pi}}(s)\eqsim\left|\widehat{\zeta}% ^{\bm{\pi}}_{h}(s)-{\zeta}^{\bm{\pi}}_{h}(s)\right|=\left|\widehat{U}^{\bm{\pi% }}_{h}(s)-U^{\bm{\pi}}_{h}(s)\right|\leq O\left(\mathop{min}\{H,\epsilon% \lambda_{s}^{h-1-H}+\epsilon^{\prime}\lambda_{s}^{h-H}\}\right).italic_U start_POSTSUPERSCRIPT over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) - italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT ( italic_s ) ≂ | over^ start_ARG italic_ζ end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) - italic_ζ start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) | = | over^ start_ARG italic_U end_ARG start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) - italic_U start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s ) | ≤ italic_O ( start_BIGOP italic_m italic_i italic_n end_BIGOP { italic_H , italic_ϵ italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h - 1 - italic_H end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h - italic_H end_POSTSUPERSCRIPT } ) . This gives the bound as claimed in the Lemma. ∎

Lemma 16.

If modelsubscriptmodel\mathcal{E}_{\text{model}}caligraphic_E start_POSTSUBSCRIPT model end_POSTSUBSCRIPT is true, under Assumption 5 and 8, with T1=O(HAS5κ01log(ηTλs1))subscript𝑇1𝑂𝐻𝐴superscript𝑆5superscriptsubscript𝜅01𝜂𝑇superscriptsubscript𝜆𝑠1T_{1}=O(HAS^{5}\kappa_{0}^{-1}\log(\eta T\lambda_{s}^{-1}))italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_O ( italic_H italic_A italic_S start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_log ( italic_η italic_T italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ), at any episode t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ], we have

|ζ^t,𝝅ζ𝝅|=O(min{H,ηλs1Hκ1/2ln(SAHT/δ)/t}).superscript^𝜁𝑡𝝅superscript𝜁𝝅𝑂𝑚𝑖𝑛𝐻𝜂superscriptsubscript𝜆𝑠1𝐻superscript𝜅12𝑆𝐴𝐻𝑇𝛿𝑡\left|\widehat{\zeta}^{t,\bm{\pi}}-\zeta^{\bm{\pi}}\right|=O\left(\mathop{min}% \big{\{}H,\eta\lambda_{s}^{1-H}\kappa^{-1/2}\sqrt{\ln(SAHT/\delta)/t}\big{\}}% \right).| over^ start_ARG italic_ζ end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π end_POSTSUPERSCRIPT - italic_ζ start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT | = italic_O ( start_BIGOP italic_m italic_i italic_n end_BIGOP { italic_H , italic_η italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 - italic_H end_POSTSUPERSCRIPT italic_κ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT square-root start_ARG roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) / italic_t end_ARG } ) .
Proof.

This proof is to find a good trade-off between the error bound of P𝑃Pitalic_P and μ𝜇\muitalic_μ according to the ratio given in Lemma 12. For this analysis, let sups,a,a,hμ^ht(s,a,a)μh(s,a,a)1εsubscriptsupremum𝑠𝑎superscript𝑎subscriptdelimited-∥∥subscriptsuperscript^𝜇𝑡𝑠𝑎superscript𝑎subscript𝜇𝑠𝑎superscript𝑎1𝜀\sup_{s,a,a^{\prime},h}\left\lVert\widehat{\mu}^{t}_{h}(s,a,a^{\prime})-{\mu}_% {h}(s,a,a^{\prime})\right\rVert_{1}\leq\varepsilonroman_sup start_POSTSUBSCRIPT italic_s , italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_h end_POSTSUBSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_ε such that we can apply Lemma 11 and get

P^ht(s,a)Ph(s,a)12ln(SAHT/δ)2κt+ε,subscriptdelimited-∥∥superscriptsubscript^𝑃𝑡𝑠𝑎subscript𝑃𝑠𝑎12𝑆𝐴𝐻𝑇𝛿2𝜅𝑡𝜀\left\lVert\widehat{P}_{h}^{t}(s,a)-P_{h}(s,a)\right\rVert_{1}\leq 2\sqrt{% \frac{\ln(SAHT/\delta)}{2\kappa t}}+\varepsilon,∥ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ) - italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ 2 square-root start_ARG divide start_ARG roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) end_ARG start_ARG 2 italic_κ italic_t end_ARG end_ARG + italic_ε ,

where we use the fact Nht(s)2κtsuperscriptsubscript𝑁𝑡𝑠2𝜅𝑡N_{h}^{t}(s)\geq 2\kappa titalic_N start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s ) ≥ 2 italic_κ italic_t under Assumption 5, similar to the analysis in Lemma 10. By Lemma 15, the Equation (E.5) can construct contracts to induce any action policy 𝝅𝝅\bm{\pi}bold_italic_π such that

|ζ^t,𝝅ζ𝝅|=O(min{H,(H+η)ϵλsH+ηλs1Hln(SAHT/δ)2κt})superscript^𝜁𝑡𝝅superscript𝜁𝝅𝑂𝑚𝑖𝑛𝐻𝐻𝜂italic-ϵsuperscriptsubscript𝜆𝑠𝐻𝜂superscriptsubscript𝜆𝑠1𝐻𝑆𝐴𝐻𝑇𝛿2𝜅𝑡\left|\widehat{\zeta}^{t,\bm{\pi}}-\zeta^{\bm{\pi}}\right|=O\left(\mathop{min}% \big{\{}H,(H+\eta)\epsilon\lambda_{s}^{-H}+\eta\lambda_{s}^{1-H}\sqrt{\frac{% \ln(SAHT/\delta)}{2\kappa t}}\big{\}}\right)| over^ start_ARG italic_ζ end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π end_POSTSUPERSCRIPT - italic_ζ start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT | = italic_O ( start_BIGOP italic_m italic_i italic_n end_BIGOP { italic_H , ( italic_H + italic_η ) italic_ϵ italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_H end_POSTSUPERSCRIPT + italic_η italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 - italic_H end_POSTSUPERSCRIPT square-root start_ARG divide start_ARG roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) end_ARG start_ARG 2 italic_κ italic_t end_ARG end_ARG } )

Then, we can set ε=λs(H+η)t𝜀subscript𝜆𝑠𝐻𝜂𝑡\varepsilon=\frac{\lambda_{s}}{(H+\eta)\sqrt{t}}italic_ε = divide start_ARG italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG ( italic_H + italic_η ) square-root start_ARG italic_t end_ARG end_ARG such that the second term dominates the regret bound, and we have

|ζ^t,𝝅ζ𝝅|superscript^𝜁𝑡𝝅superscript𝜁𝝅\displaystyle\left|\widehat{\zeta}^{t,\bm{\pi}}-\zeta^{\bm{\pi}}\right|| over^ start_ARG italic_ζ end_ARG start_POSTSUPERSCRIPT italic_t , bold_italic_π end_POSTSUPERSCRIPT - italic_ζ start_POSTSUPERSCRIPT bold_italic_π end_POSTSUPERSCRIPT | =O(min{H,ηλs1Hln(SAHT/δ)κt)\displaystyle=O\left(\mathop{min}\{H,\eta\lambda_{s}^{1-H}\sqrt{\frac{\ln(SAHT% /\delta)}{\kappa t}}\right)= italic_O ( start_BIGOP italic_m italic_i italic_n end_BIGOP { italic_H , italic_η italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 - italic_H end_POSTSUPERSCRIPT square-root start_ARG divide start_ARG roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) end_ARG start_ARG italic_κ italic_t end_ARG end_ARG )
=O(min{H,ηλs1Hκ1/2ln(SAHT/δ)/t})absent𝑂𝑚𝑖𝑛𝐻𝜂superscriptsubscript𝜆𝑠1𝐻superscript𝜅12𝑆𝐴𝐻𝑇𝛿𝑡\displaystyle=O\left(\mathop{min}\big{\{}H,\eta\lambda_{s}^{1-H}\kappa^{-1/2}% \sqrt{\ln(SAHT/\delta)/t}\big{\}}\right)= italic_O ( start_BIGOP italic_m italic_i italic_n end_BIGOP { italic_H , italic_η italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 - italic_H end_POSTSUPERSCRIPT italic_κ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT square-root start_ARG roman_ln ( italic_S italic_A italic_H italic_T / italic_δ ) / italic_t end_ARG } )

It remains to ensure that in T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT rounds, we can ensure ε=λs(H+η)t𝜀subscript𝜆𝑠𝐻𝜂𝑡\varepsilon=\frac{\lambda_{s}}{(H+\eta)\sqrt{t}}italic_ε = divide start_ARG italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG ( italic_H + italic_η ) square-root start_ARG italic_t end_ARG end_ARG. By Lemma 10, we know the sample complexity of ε𝜀\varepsilonitalic_ε is on the order of O~(HAS5κ01log(t))~𝑂𝐻𝐴superscript𝑆5superscriptsubscript𝜅01𝑡\widetilde{O}(HAS^{5}\kappa_{0}^{-1}\log(t))over~ start_ARG italic_O end_ARG ( italic_H italic_A italic_S start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_log ( italic_t ) ). Hence, we can set T1=O(HAS5κ01log((H+η)Tλs1))=O(HAS5κ01log(ηTλs1))subscript𝑇1𝑂𝐻𝐴superscript𝑆5superscriptsubscript𝜅01𝐻𝜂𝑇superscriptsubscript𝜆𝑠1𝑂𝐻𝐴superscript𝑆5superscriptsubscript𝜅01𝜂𝑇superscriptsubscript𝜆𝑠1T_{1}=O(HAS^{5}\kappa_{0}^{-1}\log((H+\eta)T\lambda_{s}^{-1}))=O(HAS^{5}\kappa% _{0}^{-1}\log(\eta T\lambda_{s}^{-1}))italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_O ( italic_H italic_A italic_S start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_log ( ( italic_H + italic_η ) italic_T italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ) = italic_O ( italic_H italic_A italic_S start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_log ( italic_η italic_T italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ). ∎