Contractual Reinforcement Learning:
Pulling Arms with Invisible Hands^†^†thanks: This work is supported in part by Army Research Office Award W911NF-23-1-0030, ONR Award N00014-23-1-2802 and NSF Award CCF-2303372.

Jibang Wu Department of Computer Science, University of Chicago; Corresponding author emails: {wujibang, haifengxu}@uchicago.edu. Siyu Chen Department of Statistics and Data Science, Yale University. Mengdi Wang Department of Electrical and Computer Engineering, Princeton University. Huazheng Wang School of Electrical Engineering and Computer Science, Oregon State University. Haifeng Xu²²footnotemark: 2

Abstract

The agency problem emerges in today’s large scale machine learning tasks, where the learners are unable to direct content creation or enforce data collection. In this work, we propose a theoretical framework for aligning economic interests of different stakeholders in the online learning problems through contract design. The problem, termed contractual reinforcement learning, naturally arises from the classic model of Markov decision processes, where a learning principal seeks to optimally influence the agent’s action policy for their common interests through a set of payment rules contingent on the realization of next state. For the planning problem, we design an efficient dynamic programming algorithm to determine the optimal contracts against the far-sighted agent. For the learning problem, we introduce a generic design of no-regret learning algorithms to untangle the challenges from robust design of contracts to the balance of exploration and exploitation, reducing the complexity analysis to the construction of efficient search algorithms. For several natural classes of problems, we design tailored search algorithms that provably achieve $\widetilde{O}(\sqrt{T})$ regret. We also present an algorithm with $\widetilde{O}(T^{2/3})$ for the general problem that improves the existing analysis in online contract design with mild technical assumptions.

1 Introduction

“Every individual… intends only his own gain, and is led by an invisible hand to promote an end which was no part of his intention.”

— Adam Smith, The Theory of Moral Sentiments, 1759.

The “invisible hand” metaphor by Adam Smith illustrates how properly designed incentive structures can guide self-interested individuals to inadvertently promote the greater social good. This concept is increasingly relevant in the realm of machine learning, as the scale of applications expands and the conflict of economic interests intensifies. For example, an Internet platform wants to estimate the ad revenues from serving different types of content, but it is up to the creators to decide what content to produce. While the platform seeks high-quality content to boost its long-term growth, creators may opt to minimize their production costs. This misalignment has prompted platforms to implement revenue-sharing models, fueling the growth of the creator economy, projected to exceed half a trillion by 2027 [16, 1, 25, 2]. However, current incentive models are inadequate, especially in light of their roles in exacerbating the proliferation of clickbait and misinformation online [54, 55, 31]. Moreover, this issue of misalignment extends well beyond content platforms. E-commerce sites rely on sufficient consumers experimenting with new products for accurate preference assessments. Gig platforms depend on freelance workers accepting tasks to gather essential operational data. Even recommender systems are paying users for their engagement in order to effectively optimize their algorithms [3]. In these cases, the learner’s hands are tied, and decision-makers interacting with the environment have their own objectives, dooming the system to under-exploration regardless of the learner’s objective. Hence, there is a pressing need to pursue formal treatments of incentive alignment problems between the learners and decision-makers and to design principled learning algorithms with statistical and computational efficiency guarantees.

Contributions. On the conceptual side, the presence of self-interested decision-makers challenges our common assumption in online learning, where a single learner controls all the interactions with the environment. This paper introduces the contractual reinforcement learning (RL) problem in the principal-agent Markov decision process (PAMDP), where we adopt the principal-agent model from contract theory [27, 24] to capture strategic interactions between the learner and decision-maker. As illustrated in Figure 1, the learner (henceforth, principal/she) collects the rewards from the actions of decision-maker (henceforth, agent/he). Without any incentive design, the agent simply optimizes his policy in a standard Markov decision process (MDP) based on his cost function. However, since the agent’s optimal policy is not necessarily in the principal’s best interest, the principal is motivated to properly incentivize the agent to act in her favor by designing contracts that specify the payment rules contingent on the realization of the next state. The core challenge in this design problem is the information asymmetry at two levels: (1) the principal cannot observe the agent’s action a priori and has to condition her payment on the probabilistic outcome of the action — a phenomenon known as the moral hazard in economics; (2) the agent is far-sighted that he is willing to take suboptimal actions at one step in order to reach a more favorable state in future steps — a major barrier for theoretical analysis in multi-agent learning problems.

Refer to caption — Figure 1: An illustration of contractual RL in the PAMDP.

On the technical side, this paper provides a comprehensive solution framework to address the unique learning and computational challenges when moral hazard meets far-sighted agency in contractual RL problems. In Section 2, we define state value functions for both the agent and principal, from which we derive a new class of Bellman equations to characterize the intricate correspondence between the principal and agent’s optimal policy. This leads to our Theorem 1, which shows that the principal’s optimal planning problem can be solved by a clean formulation of dynamic programming in polynomial time. The learning problem is more involved, so we begin with the contractual bandit learning problem (episode length $H=1$ ) in Section 3 to focus on the challenges from moral hazard. In particular, to achieve low regret, the principal’s learning algorithm must balance exploration and exploitation while continuously improving its estimation of the agent’s preferences to determine cost-efficient contracts. In Theorem 2, we construct a generic algorithm that reduces the learning problem into a standard online learning problem and an efficient search problem for the agent’s decision boundary. As a consequence, we are able to obtain sublinear regret guarantee under different setups, summarized in Table 1. The efficient search algorithm we designed for learning the outcome distribution difference in the simplex may be of interest for general use. With these insights, we delve into the full contractual RL problem in Section 4 and show a provably efficient learning algorithm under several technical assumptions in Theorem 3. Meanwhile, the general result highlights a trade-off between statistical and computational tractability, leaving an intriguing open question on the existence of the best-of-both-worlds solution. The complexity of search is in logarithmic order yet with a large constant in the Markovian setup, and we expect an improved analysis by organically combining the search and exploration in the algorithm design.

Table 1: Regrets in Contractual RL,

\widetilde{O}

omits logarithmic terms and other problem-specific constants

Moral Hazard	Far-sighted Agency	Known Cost	Regret
✗	✗	✗	$\widetilde{O}(T^{1/2})$ , Corollary 2.1
✓	✗	✗	$\widetilde{O}({T}^{2/3})$ , Corollary 2.2
✓	✗	✓	$\widetilde{O}({T}^{1/2})$ , Corollary 2.3
✓	✓	✓	$\widetilde{O}({T}^{1/2})$ , Theorem 3

Related Work. Our problem is built upon the principal-agent model in contract theory, a crucial branch of economics [27, 48, 35]. Driven by an accelerating trend of contract-based markets deployed to Internet-based applications, the contract design problem recently started to receive a surging interest especially from the computer science community [24, 28, 9, 20]. The principal-agent model has been also applied for the delegation of online search problems [13, 33] and machine learning tasks [45]. While these works focus on the computational aspects of contract design, we consider the adaptive design problem of the contract between learners and decision-makers in an initially unknown environment. For our learning problem, the dynamic (contextual) pricing problem [34, 41, 47, 39, 36] can be viewed as one of its special cases, where the contract is contingent on the agent’s binary action and the principal already knows her reward function. As we will see in Section 3, our algorithm is able to borrow some design insights from these pricing problems. Meanwhile, the online contract design problem begins as a variant of dynamic pricing [34] where the agent’s cost is stochastic (or adversarially) chosen, and regret bound is $\Theta(\sqrt{T})$ (or $\Theta(T^{2/3})$ in adversarial setup). Ho et al. [29], Zhu et al. [57] consider a generalized model where the agent has multiple actions, both the cost and reward of his actions are determined by the agent’s Bayesian type that are unknown to the learner. This problem relates to the continuum-armed bandit problem [6], except the principal’s utility is not continuous, and Zhu et al. [57] shows an almost tight linear regret bound $\widetilde{\Theta}(T^{1-K/|{\mathcal{S}}|})$ for some constant $K$ and the number of outcomes $|{\mathcal{S}}|$ . In comparison, our learning problem is closer to the standard contract design model, in which the agent type is observable by the principal (captured by the initial state or context), as many platforms hold a good amount of data on their users and content creators. More importantly, this modeling choice allows us to focus on solving the key challenges of learning and planning the optimal contract under moral hazard, where we are able to achieve $\widetilde{O}(\sqrt{T})$ regret for a large class of problems and $\widetilde{O}(T^{2/3})$ in general under mild assumptions. Lastly, several recent works [22, 21, 46] consider the simple special case of our problem, where there is no Markov state transition and principal can directly incentivize the agent to take certain action without the barrier of moral hazard. We defer further discussion of the related work to Appendix A.

2 Problem Formulation

2.1 The Principal-Agent Markov Decision Process

Let us first recall the standard reinforcement learning problem in a (finite-horizon) Markov decision process $(\mathcal{A},{\mathcal{S}},\{P_{h},r_{h}\}_{h=1}^{H},P_{0})$ , where we have the agent’s action space $\mathcal{A}$ , the environment’s state space ${\mathcal{S}}$ , the transition kernel $P_{h}:{\mathcal{S}}\times\mathcal{A}\to\Delta({\mathcal{S}})$ , the expected reward function $r_{h}:{\mathcal{S}}\times\mathcal{A}\to[0,1]$ , the initial state distribution $P_{0}\in\Delta({\mathcal{S}})$ and the horizon length $H$ . The contractual reinforcement learning problem simply extends the MDP to a principal-agent Markov decision process $(\mathcal{A},{\mathcal{S}},\{P_{h},r_{h},c_{h}\}_{h=1}^{H},P_{0})$ with the additional cost function $c_{h}:{\mathcal{S}}\times\mathcal{A}\to[0,1]$ . ¹¹1The $[0,1]$ scale of the cost and reward function range is without loss of generality, due to constant shifting and rescaling, thereby covers existing models [22, 21, 46] that assume a positive reward function for the agent. In this process, the agent interacts with the environment by taking actions and bearing the costs, whereas the principal receives the reward from the environment. Unable to directly interact with the environment, the principal has to instead design and implement contracts to incentivize the agent to take actions in her interest. Below, we formalize the design of their policies.

Following from a standard MDP, the agent’s action policy $\bm{\pi}=\{\pi_{h}:{\mathcal{S}}\to\Delta(\mathcal{A})\}_{h=1}^{H}$ specifies that at each step $h$ , given the state $s$ , the agent would take the action $a\sim\pi_{h}^{\bm{x}}(s)$ . In the following subsection, we will discuss how the agent chooses his action policy and that it suffices to only consider deterministic action policies. Meanwhile, the principal’s contract policy $\bm{x}=\{x_{h}:{\mathcal{S}}\times{\mathcal{S}}\to\mathbb{R}_{+}\}_{h=1}^{H}$ is a sequence of non-liable payment rules $x_{h}$ , where $x_{h}(s_{h},s_{h+1})$ specifies the payment to the agent if the next state $s_{h+1}$ is realized, given the current state $s_{h}$ at the $h$ -th step. The non-liability constraint ensures that the principal’s payment in the contract for any realization of the next state must be non-negative; the problem would otherwise degenerate with an trivially optimal solution for the principal (see e.g., [24]). Denote $\Pi,\mathcal{X}$ as the agent and principal’s policy space, respectively. Let $|{\mathcal{S}}|=S,|\mathcal{A}|=A$ and thus $|\Pi|=(SA)^{H}$ .

The typical setting of the PAMDP problems can be summarized by the following steps. In the beginning of each episode, the initial state $s_{1}\sim P_{0}$ is realized and observed by both the principal and the agent. Afterwards, the principal commits to a contract policy $\bm{x}$ and the agent accordingly chooses an action policy $\bm{\pi}$ . Their interactions then proceed as follows at each step $h\in[H]$ ,

1. The agent takes an action $a_{h}\sim\pi_{h}(s_{h})$ and bears the cost $c_{h}(s_{h},a_{h})$ . 2. The next state $s_{h+1}\sim P_{h}(s_{h},a_{h})$ is realized and observed by both the principal and agent. 3. The principal receives a noisy reward $\iota_{h}(s_{h},s_{h+1})$ and pays the agent $x_{h}(s_{h},s_{h+1})$ . 4. The principal observes the agent’s action $a_{h}$ .

In this step, the principal’s utility is $\iota_{h}(s_{h},s_{h+1})-x_{h}(s_{h},s_{h+1})$ , her reward minus the payment to agent, whereas the agent’s utility is $x_{h}(s_{h},s_{h+1})-c_{h}(s_{h},a_{h})$ , the payment from principal minus his cost. The reward noise has zero mean such that $r_{h}(s,a)=\mathop{\mathbf{E}}_{s^{\prime}\sim P_{h}(s,a)}\iota_{h}(s,s^{% \prime}),\forall s\in{\mathcal{S}},a\in\mathcal{A}$ . We refer the readers to Appendix B.1 for a summary of notations and Appendix B.2 for a full discussion of our modeling choices.

2.2 The Optimal Contract Policy

Without any contract design, the model reduces to a standard MDP $(\mathcal{A},{\mathcal{S}},\{P_{h},c_{h}\}_{h=1}^{H},P_{0})$ for the agent and the principal passively collects the reward from the agent’s policy. This outcome could be suboptimal for both the principal and agent. Instead, by resha** the agent’s reward environment through the design of contract policy, the principal could induce the agent adopt some action policy with higher social surplus. This motivates the problem of designing the optimal contract policy. We focus on a realistic yet challenging setup in the face of a long-lived, far-sighted and Bayesian rational agent who is also planning optimally for his cumulative reward — we expect the case of myopic agents can be worked out with simpler approach. In particular, since the agent’s utility is not necessarily $0$ under the principal’s optimal contract at any state due to moral hazard, a far-sighted agent could take certain actions that are sub-optimal in the current step, yet secure him toward certain future states where he can obtain higher cumulative utility.

We extend notions of value functions and optimal policies from MDP to PAMDP. Under any action policy $\bm{\pi}$ and contract policy $\bm{x}$ , we define the principal’s state value function at the $h$ -th step as,

V_{h}^{\bm{x},\bm{\pi}}(s):=\mathop{\mathbf{E}}\big{[}\sum_{\tau=h}^{H}r_{\tau% }(s_{\tau},a_{\tau})-x_{\tau}(s_{\tau},s_{\tau+1})\big{|}\{\pi_{\tau}\}_{\tau=% h}^{H},s_{\tau}=s\big{]},

and the agent’s state value function at the $h$ -th step as,

U_{h}^{\bm{x},\bm{\pi}}(s):=\mathop{\mathbf{E}}\big{[}\sum_{\tau=h}^{H}x_{\tau% }(s_{\tau},s_{\tau+1})-c_{\tau}(s_{\tau},a_{\tau})\big{|}\{\pi_{\tau}\}_{\tau=% h}^{H},s_{\tau}=s\big{]},

where the expectation in both $V,U$ are with respect to the randomness of the trajectory (due to the stochasticity of state transitions and action policy). Let $V^{\bm{x},\bm{\pi}}:=\mathop{\mathbf{E}}_{s\sim P_{0}}V_{1}^{\bm{x},\bm{\pi}}(s)$ and $U^{\bm{x},\bm{\pi}}:=\mathop{\mathbf{E}}_{s\sim P_{0}}U_{1}^{\bm{x},\bm{\pi}}(s)$ . The principal’s goal is to maximize her value $V^{\bm{x},\bm{\pi}}$ , given the agent’s optimal response $\bm{\pi}$ , which equivalently maximizes $V_{1}^{\bm{x},\bm{\pi}}(s)$ at any initial state $s$ with $P_{0}(s)>0$ . Hence, we define the principal’s optimal contract policy $\bm{x}^{*}=\{x^{*}_{h}\}_{h=1}^{H}$ and the corresponding optimal value function $V^{*}$ as the optimal solution and value of the following bi-level optimization problem, ²²2Throughout this paper, we assume the agent breaks tie in favor of the principal. This is without loss of generality in generic games, since the principal can force the tie-breaking by making an infinitesimally small additional payment to the action of her interest.

V^{*},\bm{x}^{*}:=\mathop{maxarg}_{\bm{x}\in\mathcal{X}}V^{\bm{x},\bm{\pi}^{% \bm{x}}}\quad\text{s.t.}\quad\bm{\pi}^{\bm{x}}=\mathop{argmax}_{\bm{\pi}\in\Pi% }U^{\bm{x},\bm{\pi}},

(2.1)

where “ $\mathop{maxarg}$ ” is a convenient operator notation on an optimization problem that returns the optimal objective value followed by its optimal solution. For notational convenience, we will denote the agent’s optimal action policy in response to contract policy $\bm{\pi}$ as $\bm{\pi}^{\bm{x}}=\mathop{argmax}_{\bm{\pi}\in\Pi}U^{\bm{x},\bm{\pi}}$ , and use shorthands $V_{h}^{\bm{x}}:=V_{h}^{\bm{x},\bm{\pi}^{\bm{x}}},U_{h}^{\bm{x}}:=U_{h}^{\bm{x}% ,\bm{\pi}^{\bm{x}}}$ for the principal’s and agent’s value function under contract policy $\bm{x}$ at the $h$ -th step given that the agent responds optimally. Meanwhile, we denote $\bm{x}^{\bm{\pi}}=\mathop{argmax}_{\bm{x}\in\mathcal{X}}V^{\bm{x},\bm{\pi}}% \text{ s.t. }\bm{\pi}=\mathop{argmax}_{\bm{\pi}\in\Pi}U^{\bm{x},\bm{\pi}}$ as the principal’s optimal contract policy to induce the agent’s action policy $\bm{\pi}$ . We use similar shorthands $V_{h}^{\bm{\pi}}:=V_{h}^{\bm{x}^{\bm{\pi}},\bm{\pi}},U_{h}^{\bm{x}}:=U_{h}^{% \bm{x}^{\bm{\pi}},\bm{\pi}}$ for the principal’s and agent’s value function under contract policy $\bm{x}^{\bm{\pi}}$ at the $h$ -th step given that the agent responds optimally. Notably, since the optimization problem (2.1) hinges on the intricate correspondence between $\bm{x}$ and $\bm{\pi}$ , it is unclear for now if the principal can efficiently plan his optimal policy adopting the standard approach in MDP.

Solving for the Agent’s Optimal Policy. One key observation is that the correspondence between $\bm{\pi}$ and $\bm{x}$ has a clean characterization through the Bellman equation. Specifically, both functions $\{\pi_{h}^{\bm{x}},U_{h}^{\bm{x}}\}_{h=1}^{H}$ can be solved through backward induction with $U^{\bm{x}}_{H+1}(s)=0$ :

\textup{given $U^{\bm{x}}_{h+1}$,}\qquad U^{\bm{x}}_{h}(s),\pi^{\bm{x}}_{h}(s)% =\mathop{maxarg}_{a\in\mathcal{A}}P_{h}(s,a)\cdot[x_{h}(s)+U^{\bm{x}}_{h+1}]-c% _{h}(s,a).

(2.2)

Notice that since $\pi^{\bm{x}}_{h}(s)$ is a maximizer of a linear function, the agent’s best responding policy $\pi_{h}$ is deterministic without loss of generality. With $\bm{\pi}^{\bm{x}}$ , the principal’s value function under $\bm{x}$ can also be computed iteratively from $V^{\bm{x}}_{H+1}(s)=0$ :

\textup{given $V^{\bm{x}}_{h+1}$,}\qquad V_{h}^{\bm{x}}(s)=r_{h}(s,\pi_{h}^{% \bm{x}}(s))+P_{h}(s,\pi_{h}^{\bm{x}}(s))\cdot[V^{\bm{x}}_{h+1}-x_{h}(s)].

(2.3)

Due to the space limit, we only solve the agent’s best response $\bm{\pi}^{\bm{x}}$ at any given $\bm{x}$ . We refer the reader to Appendix B.3 for the more involved formulation to solve the optimal policy $\bm{x}^{\bm{\pi}}$ for any given $\bm{\pi}$ .

Value Decomposition. Another key observation is that the value functions can decomposed into parts that are only depends on the agent’s action policy. This is analogous to the standard contract design where principal’s and agent’s utility sums up to the social surplus, i.e., the difference between the reward and cost of the agent’s action. Here, let the principal’s expected reward and agent’s expected cost function in the $h$ -th step be

R_{h}^{\bm{\pi}}(s):=\mathop{\mathbf{E}}\big{[}\sum_{\tau=h}^{H}r_{\tau}(s_{% \tau},a_{\tau})\big{|}\{\pi_{\tau}\}_{\tau=h}^{H},s_{\tau}=s\big{]},\ C_{h}^{% \bm{\pi}}(s):=\mathop{\mathbf{E}}\big{[}\sum_{\tau=h}^{H}c_{\tau}(s_{\tau},a_{% \tau})\big{|}\{\pi_{\tau}\}_{\tau=h}^{H},s_{h}=s\big{]}.

By linearity of expectation, for any policy $\bm{x}\in\mathcal{X},\bm{\pi}\in\Pi$ , at any state $s$ of any step $h$ , we have

V_{h}^{\bm{x},\bm{\pi}}(s)=R_{h}^{\bm{\pi}}(s)-C_{h}^{\bm{\pi}}(s)-U_{h}^{\bm{% x},\bm{\pi}}(s).

Both functions $R,C$ are fixed to the agent’s action policy $\bm{\pi}$ , regardless of the contract policy $\bm{x}$ . In addition, $C_{h}^{\bm{\pi}}(s)-U_{h}^{\bm{x},\bm{\pi}}(s)$ captures the total amount of expected payment transferred from the principal to the agent since $h$ -th step at state $s$ . Since the total reward is fixed under any given $\bm{\pi}$ , the principal’s value is maximized under the minimal total payment, $\zeta_{h}^{\bm{\pi}}(s):=C_{h}^{\bm{\pi}}(s)+U_{h}^{\bm{\pi}}(s)$ . The function $\zeta_{h}^{\bm{\pi}}$ thus serves as the equivalent optimization objective in the least-payment Bellman equation in Appendix B.3.

Solving for the Optimal Contract Policy. With the two observations above, it is clear that the principal’s the optimal value and policy $V^{*}=\mathop{max}_{\bm{\pi}\in\Pi}V^{\bm{\pi}}$ can be determined by computing $V^{\bm{\pi}}$ for every $\bm{\pi}$ , according to the least-payment bellman equation in Appendix B.3. However, this maximization problem is still intractable, as there are exponentially many possible $\bm{\pi}$ . Instead, we have to interleave the process of solving for the optimal policy with least payment and maximum reward. This enables the following construction of a bi-level backward induction that iteratively solves for the optimal contract policy $\bm{x}^{*}$ .

Theorem 1 (Bellman Equations of PAMDP).

The optimal contract policy can solved by dynamic programming in polynomial time, from $h=H$ to $1$ with $U^{\bm{x}}_{H+1}(s),V^{\bm{x}}_{H+1}(s)=0,\forall s\in{\mathcal{S}},a\in% \mathcal{A}$ ,

$\displaystyle W^{*}_{h}(s,a;x)$	$\displaystyle=P_{h}(s,a)\cdot[x+U^{*}_{h+1}]-c_{h}(s,a),$	(2.4)
$\displaystyle x^{*}_{h}(s;a)$	$\displaystyle=\mathop{argmin}_{x:{\mathcal{S}}\to\mathbb{R}_{+}}\left\{P_{h}(s% ,a)\cdot x\mid W^{}_{h}(s,a;x)\geq W^{}_{h}(s,a^{\prime};x),\forall a^{% \prime}\in\mathcal{A}\right\},$
$\displaystyle Q^{*}_{h}(s,a)$	$\displaystyle=r_{h}(s,a)+P_{h}(s,a)\cdot[V^{}_{h+1}-x^{}_{h}(s;a)],$
$\displaystyle V_{h}^{}(s),\pi^{}_{h}(s)$	$\displaystyle=\mathop{maxarg}_{a\in\mathcal{A}}Q^{}_{h}(s,a),\ x^{}_{h}(s)=x% ^{}_{h}(s;\pi^{}_{h}(s)),\ U_{h}^{}(s)=W^{}_{h}(s,\pi^{}_{h}(s);x^{}_{h}% (s)),$

To interpret the Bellman equation above, $x^{*}_{h}(s;a)$ denotes the contract with the least payment to induce the agent to take action $a$ at state $s$ in step $h$ . Given that $\pi^{*}_{h}(s)$ is the best agent action for the principal to induce, the optimal contract at state $s$ in step $h$ can be determined as $x_{h}(s)=x_{h}(s;\pi^{*}_{h}(s))$ . $Q_{h}^{*}(s,a),W_{h}^{*}(s,a;x)$ are respectively the principal’s and agent’s total expected utility from $h$ -th step under policy $\{x^{*}_{\tau}\}_{\tau=h+1}^{H}$ and $\{\pi^{*}_{\tau}\}_{\tau=h+1}^{H}$ , which can be viewed as their optimal state-action value function at $h$ -th step, serving as the intermediate variable for the computation. See Appendix B.4 for the proof of correctness.

2.3 The Contractual Reinforcement Learning Problem

We now introduce the reinforcement learning problem in PAMDP, where the principal acts as the learner and seeks to adaptively improve its contract policy by interacting with the agent. Following the online learning convention, we use the expected regret to evaluate the learning performance in $T$ episodes, $\operatorname{Reg}(T):=\sum_{t=1}^{T}V^{*}-V^{\bm{x}^{t}},$ where $\bm{x}^{t}$ is the principal’s contract policy in the $t$ -th episode.

This paper makes a few assumptions for the analysis of reinforcement learning problems. First, the far-sighted agent has perfect knowledge of his cost function and the state transition kernel $\{P_{h},c_{h}\}_{h\in[H]}$ such that he can always chooses the best response. This is realistic because in applications of our interest, agents are the experts (e.g., content creators, freelance workers, ride-sharing drivers) in the fields who has learnt about the environment sufficiently well whereas principal as the system designer does not know. Second, the agent at time $t$ is assumed to best respond to $\bm{x}^{t}$ . This can be equivalently interpreted as the agent at each time $t$ showing up only once. This is motivated by the reality of Internet applications where each individual agent’s participation only accounts for a negligible portion of the system’s traffic hence has little influence over the entire system’s learning policy, so the best response (regardless of the learning policy) is optimal for each individual. Thirdly, we assume that the design space of contract is restricted to $\{x_{h}:{\mathcal{S}}\times{\mathcal{S}}\to[0,\eta]\}$ at any step $h\in[H]$ . This reflects the practical concern of contract design under randomness: while contract with bounded payment may sacrifice the optimality, it regularizes the variance in the payment transfer and reduces the risk for both the principal and agents. Moreover, as long as the environment parameters have finite precision, the parameter $\eta$ can be matched to the finite bit complexity of the optimal contract. Though this assumption is without loss of generality from a modeling perspective, we expect future work to develop tighter analysis techniques to relax the dependency on $\eta$ . For other regularity assumptions necessary to obtain tractable complexity results, we defer to the technical sections.

3 Warm-up: Solving the Contractual Bandit Learning Problem

In this section, we consider an important special case of the contractual reinforcement problem with $H=1$ , which allows us to first focus on the learning challenge from moral hazard without the concern of far-sight agency. We refer to this problem as the contractual bandit learning problem. Below, we first describe the contractual bandit learning problem with much simplified notations, since it suffices to omit the current state and the time step in the subscripts given that $H=1$ . We then showcase a generic analysis of the statistical complexity of contractual bandit learning problem.

3.1 The Contractual Bandit Learning Problem

In this setup, the agent’s policy space is simply its action space $\mathcal{A}$ , i.e., the set of bandit arms. $P:\mathcal{A}\to\Delta({\mathcal{S}})$ specifies an outcome distribution for each action, where the outcome space ${\mathcal{S}}$ could naturally capture the reward stochasticity of each arm in bandit learning problems. The principal designs the contract $x:{\mathcal{S}}\to\mathbb{R}_{+}$ , contingent on the outcome space ${\mathcal{S}}$ , to influence the agent’s choice of action. The principal’s reward and agent’s cost are both function of the agent’s action, $r,c:\mathcal{A}\to[0,1]$ . We consider a contractual bandit learning problem with $T$ rounds. In the beginning of each round $t$ , the principal commits to a contract $x_{t}$ and interacts with the agent as follows:

1. The agent takes the action $a_{t}$ . 2. The outcome $s_{t}\sim P(a_{t})$ is realized and observed by both the principal and agent. 3. The principal receives the noisy reward $\iota(s_{t})$ and pays the agent $x_{t}(s_{t})$ . 4. The principal observes the agent’s action $a_{t}$ .

Here, the noisy reward function satisfies $\mathop{\mathbf{E}}_{s\sim P(a_{t})}\iota(s)=r(a_{t})$ , and we assume the agent’s action always maximizes his expected utility, i.e., $a_{t}\in\mathop{argmax}_{a\in\mathcal{A}}\{P(a)\cdot x_{t}-c(a)\}$ . To determine the principal’s optimal contract, let us recall the notion of least payment function from the general setup. We similarly define $\zeta:\mathcal{A}\to\mathbb{R}_{+}$ such that for any given action $a\in\mathcal{A}$ , it outputs the least amount of the expected payment necessary to induce $a$ , $\zeta(a):=\mathop{min}_{x\in\mathcal{X}^{a}}P(a)\cdot x,$ where $\mathcal{X}^{a}=\{x\in\mathcal{X}:[P(a)-P(a^{\prime})]\cdot x\geq c(a)-c(a^{% \prime}),\forall a^{\prime}\neq a\}$ denotes the set of all contracts under which the agent would respond with action $a$ . Hence, the principal can determine the optimal action to induce, $a^{*}=\mathop{max}_{a\in\mathcal{A}}\sum_{t=1}^{T}[r_{t}(a)-\zeta(a)]$ with the optimal contract $x^{*}=\mathop{argmin}_{x\in\mathcal{X}^{a^{*}}}P(a^{*})\cdot x$ . With the benchmark of the optimal contract $x^{*}$ that induces $a^{*}$ with the least payment $\zeta(a^{*})$ , we can measure the learning performance in $T$ rounds with the expected regret as follows, $\operatorname{Reg}(T)=\mathop{max}_{a^{*}\in\mathcal{A}}\sum_{t=1}^{T}\big{[}r% (a^{*})-\zeta(a^{*})\big{]}-\sum_{t=1}^{T}\big{[}r(a_{t})-P(a_{t})\cdot x_{t}% \big{]}.$ This problem is a strict generalization of standard online learning, as it degenerates to the standard notion of regret when $\zeta(a)=0,\forall a\in\mathcal{A}$ . However, with the additional $\zeta$ function, the no-regret learner must not only obtain good estimation of both $r$ and $\zeta$ towards the optimal action, but also implement the contracts that induce the optimal action and have expected payment approaching towards $\zeta$ .

A Simpler Case with Direct Incentives. We remark that a special case of the contractual bandit learning problem assumes the principal is able to design her contract contingent on the agent’s action. This enables the principal to implement any payment rule $x:\mathcal{A}\to\mathbb{R}_{+}$ , and the agent responds with his optimal action $a^{*}=\mathop{argmax}_{a\in\mathcal{A}}x(a)-c(a)$ . With this relaxation, $\zeta=c$ , since the optimal $x$ to induce any action $a$ is to set a direct incentive with $x(a)=c(a),x(a^{\prime})=0,\forall a^{\prime}\neq a$ . The expected regret reduces to $\operatorname{Reg}(T)=\mathop{max}_{a^{*}\in\mathcal{A}}\sum_{t=1}^{T}\big{[}r% (a^{*})-c(a^{*})\big{]}-\sum_{t=1}^{T}\big{[}r(a_{t})-x_{t}(a_{t})\big{]}.$ As we will see in this paper, the learning problem becomes more tractable in this setup, since the principal can directly learn the cost function $c$ to determine the least payment to induce each action. In Appendix C.2 and C.3 we showcases the multi-armed bandits and linear bandits under direct incentives, both of which have been recently studied by Scheid et al. [46].

3.2 A Generic Approach to Contractual Bandit Learning

We begin with a natural assumption that enable us to simply employ existing techniques in online learning to obtain tractable complexity results for a large class of contractual bandit learning problems.

Assumption 1 ( $\lambda$ -Inducibility).

For any action $a\in\mathcal{A}$ , there exists an event $e\in\{0,1\}^{S}$ as a distribution of outcomes such that $[P(a)-P(a^{\prime})]\cdot e\geq\lambda,\forall a^{\prime}\neq a$ .

This assumption ensures the regularity of the problem instance in the sense that each action is dominantly capable of inducing a set of outcomes over others such that for any cost function $c$ and any action $a$ , there exists a contract $x$ to induce $a$ . To see this, one can explicitly construct such contract as $x=e\mathop{max}_{a^{\prime}}\frac{c(a)-c(a^{\prime})}{\lambda}$ , where $e$ is the event such that $[P(a)-P(a^{\prime})]\cdot e\geq\lambda,\forall a^{\prime}\neq a$ . Otherwise, if $\lambda\leq 0$ , then there could be some action that is never the agent’s best response under any contract.

We now propose a generic approach to design statistically efficient algorithm for contractual bandit learning problem. The key idea of our approach is to decouple the learning of the contract from the learning of the optimal action. In particular, let us first assume an oracle in Definition 1 that is able to construct a robust contract set for each action $a\in\mathcal{A}$ , despite the uncertainty in parameter estimation. We use the robust contract set to determine the optimistic action and eventually learn the optimal action with no regret. This enables us to decouple the sample complexity result into the estimation errors from optimal contract and the optimal action, according to Theorem 2.

Definition 1 ( $\varepsilon$ -margin Contract Set).

We define the $\varepsilon$ -margin contract set for each action $a\in\mathcal{A}$ as

{\mathcal{X}}^{a}(\varepsilon)=\{x\in\mathcal{X}:[P(a)-P(a^{\prime})]\cdot x% \geq c(a)-c(a^{\prime})+\varepsilon,\forall a\neq a^{\prime}\}.

Theorem 2.

Under Assumption 1, with a $\varepsilon$ -margin contract set for every action $a\in\mathcal{A}$ , there is a generic algorithm with regret $\widetilde{O}(\eta\sqrt{T}+T\varepsilon/\lambda)$ for the contractual bandit learning problems.

The key step of the proof is Lemma 1, which shows the contracts solved from LP (C.1) have bounded suboptimality from the least payment contract (both in estimation and in execution) depending on parameter estimation error $\epsilon$ and the robustness margin $\varepsilon$ . This allows us to simply adopt an upper confidence bound argument to bound the regret. See Appendix C.1 for the full proof and the construction of the generic algorithm. The rationale behind Theorem 2 is to separate the learning of the contract sets from the learning of the optimal action. In particular, the learning and construction procedure of such contract sets has been a well-established problem in variants of Stackelberg games [37, 43]. We abstract this problem into the design of a $\chi(\varepsilon)$ -learning procedure defined below.

Definition 2 ( $\chi(\varepsilon)$ -Learning Procedure).

For a $\chi(\varepsilon)$ -learning procedure, after any $\chi(\varepsilon)$ number of rounds, it can construct a robust contract set $\widehat{\mathcal{X}}^{a},\forall a\in\mathcal{A}$ such that ${\mathcal{X}}^{a}(\varepsilon)\subseteq\widehat{{\mathcal{X}}}^{a}\subseteq{% \mathcal{X}}^{a}$ .

Based on the concept in Definition 2, an immediate implication of Theorem 2 is that if there is an ${O}(1/\varepsilon)$ -learning procedure, a simple “prepare-then-commit” style algorithm can achieve $\widetilde{O}(\sqrt{T})$ regret in the contractual bandit problem. That is, it first prepares for a warm start by running the learning procedure for $T^{1/2}$ rounds to obtain the $O(T^{-1/2})$ -margin contract sets, then commits to follow Algorithm 2 for the remaining $T-T^{1/2}$ rounds. Futhermore, using the standard doubling trick [15], we can convert “prepare-then-commit” style algorithm into an anytime algorithm with the same $\widetilde{O}(\sqrt{T})$ regret guaruntee that is agnostic to the time horizon $T$ during its construction. Therefore, the difficulty of solving the contractual bandit learning problem hinges on the statistical efficiency of the learning procedure, which heavily depends on the problem structure.

Solving Bandit Problems under Direct Incentives. As a direct application of Theorem 2, we show that the $O(1/\varepsilon)$ -learning procedure can be constructed for the two bandit problems under direct incentives and thus admits $O(\sqrt{T})$ regret online learning algorithm. The construction of the efficient search algorithm essentially relies on the binary search for the cost of each arm. In addition, the binary search algorithm can be generalized to cases with infinitely many arms. Such problem is known as the contextual search, and recent work [38] have established clean solutions with nearly optimal performance. We defer their detailed construction and proofs to Appendix C.2 and C.3.

Corollary 2.1.

Multi-armed bandits and linear bandits under direct incentives have $\widetilde{\Theta}(\sqrt{T})$ regret.

Solving Contractual Bandit Problems under Moral Hazard. The construction of efficient learning procedure is difficult in general contractual bandit learning. We instead start with sufficient knowledge of $P$ to construct an $O(1/\varepsilon)$ -learning procedure under the following assumption. This assumption is motivated by the practice, where the principal would ask the agent to provide a listing of desired conditions for him to perform different level of services. The search problem is otherwise known to have exponential sample complexity lower bound in Stackelberg games [43].

Assumption 2 (Preliminary Contracts).

For any $a\in\mathcal{A}$ , the principal has the preliminary knowledge to construct an non-liable contract $x$ that induces the agent’s action $a$ with constant payment.

We defer the construction of this learning procedure and its proof to Appendix C.4. As a result, we can construct an explore-then-commit style algorithm $O(T^{2/3})$ regret for general contractual bandit learning, as. Specifically, this algorithm induces the agent to take each action uniformly random for $T^{2/3}$ rounds under the Assumption 2. Then, given that the outcome distribution is estimated with error up to $T^{-1/3}$ , it can efficiently estimate the difference of cost up to error $T^{-1/3}$ and thus construct an $T^{-1/3}$ -optimal contract to induce the optimal action $a^{*}$ in the remaining rounds.

Corollary 2.2.

Under Assumption 1 and 2, $\widetilde{O}(T^{2/3})$ regret can be achieved for contractual bandit learning problems.

This result reveals the core challenge of learning the optimal contract under moral hazard. That is, constructing the contract to induce the optimal action, $[P(a^{*})-P(a^{\prime})]\cdot x\geq c(a)-c(a^{\prime}),\forall a^{\prime}\neq a% ^{*}$ already requires a sufficiently good estimate of $P$ for all actions (including the suboptimal ones). This observation raises the question on whether it is possible to learn $P(a^{\prime})$ without playing the costly sub-optimal action $a^{\prime}$ — the barrier to achieve $o(T^{2/3})$ regret. The answer turns out to be “Yes” but with some catches. The solution is to implement a binary search procedure for contract $x$ near the hyperplane formed by the linear system $[P(a)-P(a^{\prime})]\cdot x=c(a)-c(a^{\prime}),\forall a^{\prime}\neq a$ . We want to solve the parameters $c(a)-c(a^{\prime})$ and $P(a)-P(a^{\prime}),\forall a^{\prime}\neq a$ in the linear system with bounded errors using a number of contracts $x$ that almost satisfy the linear system. This is however impossible unless knowing at least one set of parameters in the linear system to ensure it has full rank.

Corollary 2.3.

Under Assumption 1 and with the knowledge of agent’s cost, $\widetilde{O}(T^{1/2})$ regret can be achieved for contractual bandit learning problems.

In Appendix D, we formally show that, knowing the agent’s cost, there is an efficient learning procedure for the unknown parameters $P(a)-P(a^{\prime}),\forall a^{\prime}\neq a$ with small errors under mild assumptions. This allows us to attain $\widetilde{O}(\sqrt{T})$ for the general contractual bandit problem, and we showcase its application in designing contractual RL algorithms in the next section. Since the design and analysis of the learning procedure is highly technical, we also demonstrate the high-level idea on a simplified instance in Example 1 of Appendix D. More generally, we expect similar learning procedure exists if we alternatively assume some predictive state $s$ in $P$ such that the principal knows $P(s_{0}|a),\forall a\in\mathcal{A}$ , since it would also eliminate one extra degree of freedom in the linear system above.

4 The Complexity of Contractual Reinforcement Learning

If we treat each stationary policy in contractual RL as an arm and its induced visitation measure (see its formal definition in Appendix E.1) as an outcome in the contractual bandit problem, the generic algorithm from Section 3.2 already provides a $\widetilde{O}(T^{2/3})$ regret bound. However, the computational and statistical complexity of both Algorithm 2 and 3 has polynomial dependence on the size of action space, which has become exponential as $|\Pi|=(SA)^{H}$ . Moreover, as pointed out above, it requires a uniformly good knowledge over the transition kernel $P$ to constructing the near-optimal contract policy under the moral hazard. In this section, we provide an improved analysis for the complexity of contractual reinforcement learning, given that the agent’s cost function $\{c_{h}\}_{h=1}^{H}$ is known initially. This assumption allows us to leverage the learning procedure designed in the last section to efficiently learn the parameters $\mu_{h}(s,a,a^{\prime}):=P_{h}(s,a)-P_{h}(s,a^{\prime})$ for all $h\in[H],s\in{\mathcal{S}},a,a^{\prime}\in\mathcal{A}$ .

Input: State, action set

{\mathcal{S}},\mathcal{A}

, number of steps

H

, episodes

T

, solver

{\mathscr{A}}

Run

\chi(\varepsilon)

-learning procedure in Algorithm 5 for

T_{1}

rounds and obtain estimates

\{\widehat{\mu}_{h}\}_{h\in[H]}

Initialize empirical estimate of parameters

\{\widehat{P}^{1}_{h},\widehat{r}^{1}_{h},b^{1}_{h}\}_{h\in[H]}

for $t=1\dots T-T_{1}$ do

Solve

\bm{x}^{t},\bm{\pi}^{t}

from the subroutine

{\mathscr{A}}

using the parameters

\{\widehat{P}^{t}_{h},\widehat{r}^{t}_{h},b^{t}_{h},\widehat{\mu}_{h}\}_{h\in[% H]}

Execute the policy

\bm{x}^{t}

and observe the trajectory

\{(s_{h}^{t},a_{h}^{t},r_{h}^{t})\}_{h\in[H]}

Update the empirical estimate of parameters

\{\widehat{P}^{t}_{h},\widehat{r}^{t}_{h},b^{t}_{h},\widehat{\mu}_{h}\}_{h\in[% H]}

Algorithm 1 Contractual RL with Warm Start

We sketch the no-regret learning algorithm in contractual RL in Algorithm 1, which cuts the number of episodes $T$ into two phases and can be improved to be agnostic to $T$ with the doubling trick. It begins by running the $\chi(\varepsilon)$ -learning procedure to efficiently obtain the estimated parameter $\widehat{\mu}$ for the construction of robust contract policy. Then, the algorithm use a solver to determine the robust contract policy $\bm{x}^{t}$ that induces an optimistic action policy $\bm{\pi}^{t}$ with almost optimal payment. In Theorem 3, we state the complexity results under two different solvers that work under different technical assumption and provides different trade-offs in statistical and computational complexity. Here, $\kappa,\lambda_{s}$ in the regret bound are constants in the regularity assumptions, and omit $\log T$ terms from learning $\mu_{h}$ , though the effect of these constants can be canceled out only for sufficiently large $T$ ; we defer the details to Appendix E. Below we zoom into the construction of each component.

Theorem 3.

With high probability, Algorithm 1 has $\widetilde{O}\left((SA^{-1/2}+\kappa^{-1/2})H^{2}\sqrt{T}\right)$ regret using the solver in Algorithm 6 and $\widetilde{O}\left((H^{2}SA^{-1/2}+\eta\lambda^{1-H}_{s}\kappa^{-1/2})\sqrt{T}\right)$ regret using the solver in Algorithm 7 in contractual RL under mild assumptions.

$\chi(\varepsilon)$ -Learning Procedure in Contractual RL. One challenge in the construction is the need to separate the stepwise interference among $\{x_{h}\}_{h=1}^{H}$ . Otherwise, the actual response space for the agent is $(SA)^{H}$ , which is unacceptable even for doing binary search. Our solution is due to the observation that if we fix $x_{h+1},\dots,x_{H}$ and tune $x_{h}$ only, the agent’s expected profits $U_{h+1}^{\bm{x}},\dots,U_{H}^{\bm{x}}$ remain unchanged. This allows us to set $x_{h}$ without influencing the agent’s action policy for step $h+1,\dots,H$ . Another key challenge in constructing the oracle in the MDP setting is to guarantee visitation measure over each state at step $h$ . To maximize the visitation measure of a particular state $s$ at step $h$ , we let $x_{h}(s,\cdot)$ have nonzero values such that the agent has a strong incentive to maximize her visitation measure over state $s$ at step $h$ . To simplify our analysis, we assume that the maximal visitation measure at each state $s\in{\mathcal{S}}$ and at each step $h\in[H]$ is bounded below, though we expect it to be relaxed via a more careful analysis since those states rarely visited contributes little to the estimation of the cumulative utility. Lastly, the task of setting $x_{h}(s,\cdot)$ is solved in the bandit learning setup under the techniques and assumptions specified in Appendix D. See Section E.2 for the formal proof and detailed construction of the learning procedure.

Solving for Optimistic and Robust Contract Policies. We show two different solvers for the optimistic contract with bounded suboptimality using the estimated parameters. Their basic idea is the same, which is to include additional bonus for optimism and margin for robustness. However, it turns out that they can either ensure statistical or computational efficiency, leaving an intriguing open question on the existence of the best-of-both-world solver. For the solver in Algorithm 6, we directly solve for the optimal contract policy according to LP (2.1) with additional bonus and margin step for the entire policy. For the solver in Algorithm 7, we employ the value iteration from the Bellman equation (2.4) with bonus and margin set at every step. Both solvers require the inducibility assumption similar to Assumption 1 in contractual bandit learning problem. However, the computationally efficient solver requires the inducibility assumption to hold at every step, whereas the statistically efficient solver only requires the inducibility assumption to hold at the trajectory level. We defer their detailed construction and their proofs to Appendix E.3 and E.4.

5 Conclusion

In this paper, we propose the study of contractual reinforcement learning problems in which the principal learns to influence the agent’s policy by adaptively designing contracts that are contingent on the state realization. The principal must not only balance the tradeoff between her payments and rewards from the agent’s policy, but also incentivize the agent’s exploration for her learning in an unknown environment. Our primary approach is to decouple this general problem into a standard online learning problem and a hyperplane search problem. This enables a clean analysis of the no-regret learning guarantee under several variants of technical assumptions. Meanwhile, several technical gaps remain for future work, including a tighter analysis under relaxed assumptions and the general setup where the agent adaptively improves his policy. We believe this model forms a natural theoretical basis for the agency problem in today’s large scale machine learning tasks where economic incentives of users, creators, service providers stand in conflict with the Internet platform’s long-term objective. More generally, it sheds light on the emergent problems of AI alignment from the perspective of steering AI behaviors through reward-sha** in its training environment. We hope this work would motivate new avenues for develo** robust, incentive-compatible frameworks that align diverse stakeholder interests in complex digital ecosystems.

References

[1] Creator earnings report breakdown, where are we in the creator economy? https://neoreach.com/creator-earnings/. Accessed: 2024-05-18.
[2] The creator economy. https://www.goldmansachs.com/intelligence/pages/the-creator-economy-could-approach-half-a-trillion-dollars-by-2027.html. Accessed: 2024-05-18.
[3] Tiktok lite, a new app quietly released in france that rewards screen time. https://www.lemonde.fr/en/pixels/article/2024/04/13/tiktok-lite-a-new-app-quietly-released-in-france-that-rewards-screen-time_6668286_13.html. Accessed: 2024-05-18.
Abbasi-Yadkori et al. [2011] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.
Agarwal et al. [2019] Alekh Agarwal, Nan Jiang, Sham M Kakade, and Wen Sun. Reinforcement learning: Theory and algorithms. CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep, pages 10–4, 2019.
Agrawal [1995] Rajeev Agrawal. The continuum-armed bandit problem. SIAM journal on control and optimization, 33(6):1926–1951, 1995.
Agrawal and Devanur [2016] Shipra Agrawal and Nikhil Devanur. Linear contextual bandits with knapsacks. Advances in Neural Information Processing Systems, 29, 2016.
Agrawal and Devanur [2014] Shipra Agrawal and Nikhil R Devanur. Bandits with concave rewards and convex knapsacks. In Proceedings of the fifteenth ACM conference on Economics and computation, pages 989–1006, 2014.
Alon et al. [2021] Tal Alon, Paul Dütting, and Inbal Talgam-Cohen. Contracts with private cost per unit-of-effort. In Proceedings of the 22nd ACM Conference on Economics and Computation, pages 52–69, 2021.
Badanidiyuru et al. [2018] Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knapsacks. Journal of the ACM (JACM), 65(3):1–55, 2018.
Bahar et al. [2020] Gal Bahar, Omer Ben-Porat, Kevin Leyton-Brown, and Moshe Tennenholtz. Fiduciary bandits. In International Conference on Machine Learning, pages 518–527. PMLR, 2020.
Balcan et al. [2015] Maria-Florina Balcan, Avrim Blum, Nika Haghtalab, and Ariel D Procaccia. Commitment without regrets: Online learning in stackelberg security games. In Proceedings of the sixteenth ACM conference on economics and computation, pages 61–78, 2015.
Bechtel et al. [2022] Curtis Bechtel, Shaddin Dughmi, and Neel Patel. Delegated pandora’s box. arXiv preprint arXiv:2202.10382, 2022.
Bernasconi et al. [2023] Martino Bernasconi, Matteo Castiglioni, Andrea Celli, Alberto Marchesi, Francesco Trovò, and Nicola Gatti. Optimal rates and efficient algorithms for online bayesian persuasion. In International Conference on Machine Learning, pages 2164–2183. PMLR, 2023.
Besson and Kaufmann [2018] Lilian Besson and Emilie Kaufmann. What doubling tricks can and can’t do for multi-armed bandits. arXiv preprint arXiv:1803.06971, 2018.
Bhargava [2022] Hemant K Bhargava. The creator economy: Managing ecosystem supply, revenue sharing, and platform design. Management Science, 68(7):5233–5251, 2022.
Blum et al. [2004] Avrim Blum, Vijay Kumar, Atri Rudra, and Felix Wu. Online learning in online auctions. Theoretical Computer Science, 324(2-3):137–146, 2004.
Braverman et al. [2019] Mark Braverman, Jieming Mao, Jon Schneider, and S Matthew Weinberg. Multi-armed bandit problems with strategic arms. In Conference on Learning Theory, pages 383–416. PMLR, 2019.
Cacciamani et al. [2023] Federico Cacciamani, Matteo Castiglioni, and Nicola Gatti. Online information acquisition: Hiring multiple agents. arXiv preprint arXiv:2307.06210, 2023.
Castiglioni et al. [2022] Matteo Castiglioni, Alberto Marchesi, and Nicola Gatti. Designing menus of contracts efficiently: The power of randomization. arXiv preprint arXiv:2202.10966, 2022.
Dogan et al. [2023a] Ilgin Dogan, Zuo-Jun Max Shen, and Anil Aswani. Estimating and incentivizing imperfect-knowledge agents with hidden rewards. arXiv preprint arXiv:2308.06717, 2023a.
Dogan et al. [2023b] Ilgin Dogan, Zuo-Jun Max Shen, and Anil Aswani. Repeated principal-agent games with unobserved agent rewards and perfect-knowledge agents. arXiv preprint arXiv:2304.07407, 2023b.
Dudík et al. [2020] Miroslav Dudík, Nika Haghtalab, Haipeng Luo, Robert E Schapire, Vasilis Syrgkanis, and Jennifer Wortman Vaughan. Oracle-efficient online learning and auction design. Journal of the ACM (JACM), 67(5):1–57, 2020.
Dütting et al. [2019] Paul Dütting, Tim Roughgarden, and Inbal Talgam-Cohen. Simple versus optimal contracts. In Proceedings of the 2019 ACM Conference on Economics and Computation, pages 369–387, 2019.
Florida [2022] Richard Florida. The rise of the creator economy. 2022.
Frazier et al. [2014] Peter Frazier, David Kempe, Jon Kleinberg, and Robert Kleinberg. Incentivizing exploration. In Proceedings of the fifteenth ACM conference on Economics and computation, pages 5–22, 2014.
Grossman and Hart [1992] Sanford J Grossman and Oliver D Hart. An analysis of the principal-agent problem. In Foundations of insurance economics, pages 302–340. Springer, 1992.
Guruganesh et al. [2021] Guru Guruganesh, Jon Schneider, and Joshua R Wang. Contracts under moral hazard and adverse selection. In Proceedings of the 22nd ACM Conference on Economics and Computation, pages 563–582, 2021.
Ho et al. [2016] Chien-Ju Ho, Aleksandrs Slivkins, and Jennifer Wortman Vaughan. Adaptive contract design for crowdsourcing markets: Bandit algorithms for repeated principal-agent problems. Journal of Artificial Intelligence Research, 55:317–359, 2016.
Immorlica et al. [2022] Nicole Immorlica, Karthik Sankararaman, Robert Schapire, and Aleksandrs Slivkins. Adversarial bandits with knapsacks. Journal of the ACM, 69(6):1–47, 2022.
Immorlica et al. [2024] Nicole Immorlica, Meena Jagadeesan, and Brendan Lucier. Clickbait vs. quality: How engagement-based optimization shapes the content landscape in online platforms. arXiv preprint arXiv:2401.09804, 2024.
Kaelbling et al. [1998] Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
Kleinberg and Kleinberg [2018] Jon Kleinberg and Robert Kleinberg. Delegated search approximates efficient search. In Proceedings of the 2018 ACM Conference on Economics and Computation, pages 287–302, 2018.
Kleinberg and Leighton [2003] Robert Kleinberg and Tom Leighton. The value of knowing a demand curve: Bounds on regret for online posted-price auctions. In 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings., pages 594–605. IEEE, 2003.
Laffont and Martimort [2009] Jean-Jacques Laffont and David Martimort. The theory of incentives. In The Theory of Incentives. Princeton university press, 2009.
Leme and Schneider [2018] Renato Paes Leme and Jon Schneider. Contextual search via intrinsic volumes. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 268–282. IEEE, 2018.
Letchford et al. [2009] Joshua Letchford, Vincent Conitzer, and Kamesh Munagala. Learning and approximating the optimal strategy to commit to. In International symposium on algorithmic game theory, pages 250–262. Springer, 2009.
Liu et al. [2021] Allen Liu, Renato Paes Leme, and Jon Schneider. Optimal contextual pricing and extensions. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1059–1078. SIAM, 2021.
Lobel et al. [2018] Ilan Lobel, Renato Paes Leme, and Adrian Vladu. Multidimensional binary search for contextual decision-making. Operations Research, 66(5):1346–1361, 2018.
Mansour et al. [2015] Yishay Mansour, Aleksandrs Slivkins, and Vasilis Syrgkanis. Bayesian incentive-compatible bandit exploration. In Proceedings of the Sixteenth ACM Conference on Economics and Computation, pages 565–582, 2015.
Mao et al. [2018] Jieming Mao, Renato Leme, and Jon Schneider. Contextual pricing for lipschitz buyers. Advances in Neural Information Processing Systems, 31, 2018.
McDiarmid et al. [1989] Colin McDiarmid et al. On the method of bounded differences. Surveys in combinatorics, 141(1):148–188, 1989.
Peng et al. [2019] Binghui Peng, Weiran Shen, **zhong Tang, and Song Zuo. Learning optimal strategies to commit to. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 2149–2156, 2019.
Ratliff et al. [2018] Lillian J Ratliff, Shreyas Sekar, Liyuan Zheng, and Tanner Fiez. Incentives in the dark: multi-armed bandits for evolving users with unknown type. arXiv preprint arXiv:1803.04008, 55, 2018.
Saig et al. [2024] Eden Saig, Inbal Talgam-Cohen, and Nir Rosenfeld. Delegated classification. Advances in Neural Information Processing Systems, 36, 2024.
Scheid et al. [2024] Antoine Scheid, Daniil Tiapkin, Etienne Boursier, Aymeric Capitaine, El Mahdi El Mhamdi, Éric Moulines, Michael I Jordan, and Alain Durmus. Incentivized learning in principal-agent bandit games. arXiv preprint arXiv:2403.03811, 2024.
Shah et al. [2019] Virag Shah, Ramesh Johari, and Jose Blanchet. Semi-parametric dynamic contextual pricing. Advances in Neural Information Processing Systems, 32, 2019.
Smith [2004] Stephen A Smith. Contract theory. OUP Oxford, 2004.
Tran-Thanh et al. [2010] Long Tran-Thanh, Archie Chapman, Enrique Munoz De Cote, Alex Rogers, and Nicholas R Jennings. Epsilon–first policies for budget–limited multi-armed bandits. In Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010.
Tran-Thanh et al. [2012] Long Tran-Thanh, Archie Chapman, Alex Rogers, and Nicholas Jennings. Knapsack based optimal policies for budget–limited multi–armed bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 26, pages 1134–1140, 2012.
Wu et al. [2022] Jibang Wu, Zixuan Zhang, Zhe Feng, Zhaoran Wang, Zhuoran Yang, Michael I Jordan, and Haifeng Xu. Sequential information design: Markov persuasion process and its efficient reinforcement learning. arXiv preprint arXiv:2202.10678, 2022.
Xia et al. [2015] Yingce Xia, Haifang Li, Tao Qin, Nenghai Yu, and Tie-Yan Liu. Thompson sampling for budgeted multi-armed bandits. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
Xia et al. [2016] Yingce Xia, Tao Qin, Weidong Ma, Nenghai Yu, and Tie-Yan Liu. Budgeted multi-armed bandits with multiple plays. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pages 2210–2216, 2016.
Yao et al. [2023] Fan Yao, Chuanhao Li, Denis Nekipelov, Hongning Wang, and Haifeng Xu. How bad is top- $k$ recommendation under competing content creators? In International Conference on Machine Learning, pages 39674–39701. PMLR, 2023.
Yao et al. [2024] Fan Yao, Chuanhao Li, Karthik Abinav Sankararaman, Yiming Liao, Yan Zhu, Qifan Wang, Hongning Wang, and Haifeng Xu. Rethinking incentives in recommender systems: Are monotone rewards always beneficial? Advances in Neural Information Processing Systems, 36, 2024.
Zhao et al. [2023] Geng Zhao, Banghua Zhu, Jiantao Jiao, and Michael Jordan. Online learning in stackelberg games with an omniscient follower. In International Conference on Machine Learning, pages 42304–42316. PMLR, 2023.
Zhu et al. [2022] Banghua Zhu, Stephen Bates, Zhuoran Yang, Yixin Wang, Jiantao Jiao, and Michael I Jordan. The sample complexity of online contract design. arXiv preprint arXiv:2211.05732, 2022.
Zhu et al. [2023] Banghua Zhu, Sai Praneeth Karimireddy, Jiantao Jiao, and Michael I Jordan. Online learning in a creator economy. arXiv preprint arXiv:2305.11381, 2023.
Zuo [2024] Shiliang Zuo. New perspectives in online contract design: Heterogeneous, homogeneous, non-myopic agents and team production. arXiv preprint arXiv:2403.07143, 2024.

Appendix A Further Discussion on Related Work

Contract Design.

The contract theory has been a crucial branch of economics [27, 48, 35]. Driven by an accelerating trend of contract-based markets deployed to Internet-based applications, the contract design problem recently started to receive a surging interest especially from the computer science community [24, 28, 9, 20]. The principal-agent model has been also applied for the delegation of online search problems [13, 33] and machine learning tasks [45]. While these works focus on the computational aspects of contract design, our work is to adaptively design the optimal contract between learners and decision makers in an initially unknown environment.

Dynamic Pricing.

Our model is related to the dynamic (contextual) pricing problems [34, 41, 47, 39, 36], where a seller learns to post a price on a single item for a sequence of buyers with a fixed cost (possibly under different context). In particular, they can be viewed as special cases of contractual reinforcement learning, where the contract is contingent on the agent’s binary action and the principal already knows her reward function. As we will see in Section 3, our algorithm is able to borrow some design insights from these pricing problems. Nonetheless, our learning algorithm deals with the more involved situations, where the agent has multiple actions (e.g., a list of items to buy) of which the principal’s rewards are unknown, and the contract is not necessarily contingent on the agent’s actions but their outcomes. As such, it is possible to achieve constant regret in these pricing problems, whereas the regret lower bound of contractual reinforcement learning is $\Omega(\sqrt{T})$ .

Online Contract Design.

The problem begins as a variant of dynamic pricing in Kleinberg and Leighton [34] where the agent’s cost is stochastic (or adversarially) chosen, and regret bound is $\Theta(\sqrt{T})$ (or $\Theta(T^{2/3})$ in adversarial setup). Ho et al. [29], Zhu et al. [57] consider a generalized model where the agent has multiple (instead of binary) actions, both the cost and reward of his actions are determined by the agent’s Bayesian type that are unknown to the learner. These problems can be viewed as a continuum-armed bandit problem [6], except the principal’s utility is not continuous. Zhu et al. [57] shows an almost tight linear regret bound of this problem $\widetilde{\Theta}(T^{1-K/|{\mathcal{S}}|})$ for some constant $K$ and the number of outcomes $|{\mathcal{S}}|$ . On top of this model, Zhu et al. [58] considers the joint online optimization problem of contract and recommendation policy in the context of creator economy. Zuo [59] assumes a smoothness condition and presents a direct reduction to the standard Lipschitz bandits problem. In comparison, our learning problem is closer to the standard contract design model, in which the agent type is observable by the principal (captured by the initial state or context), as many platforms hold a good amount of data on their users and content creators. More importantly, this modeling choice allows us to focus on solving the key challenges of learning and planning the optimal contract under moral hazard, where we are able to achieve $\widetilde{O}(\sqrt{T})$ regret for a large class of problems and $\widetilde{O}(T^{2/3})$ in general under mild assumptions. Meanwhile, several recent works [22, 21, 46] consider the simple special case of our problem, where there is no Markov state transition and principal can directly incentivize the agent to take certain action without the barrier of moral hazard.

Online Learning with Incentive Constraints.

The incentive design problems have been studied in online learning in several different ways. One line of works, known as the incentivized exploration [26, 40], consider the situations where the principal recommends the agents to pull different arms and the recommendation policy must be incentive compatible to the agents in a Bayesian sense w.r.t. each agent’s prior of arm rewards. Bahar et al. [11] consider the fiduciary bandits problem, where a slightly stronger constraint of individual rationality is introduced. Our model is different from these works in that the principal use monetary incentives (contracts) instead of information advantage to influence the agents’ decisions. Another line of work, known as the budgeted bandits [49, 50, 52, 53], and more generally, bandits with knapsacks [10, 8, 7, 30], models the intrinsic cost of arm selection. The cost only affects the learner’s choices due to the limited budget, whereas the learner (principal) in our multi-agent decision making process needs to properly reimburse the agent’s (opportunity) cost in order to influence the agent’s arm choices. Ratliff et al. [44] consider the multi-armed bandit problem where the reward distribution (impacted by user types) shifts according to the history of arm selection. Braverman et al. [18] models each bandit arm as a self-interested agent that keeps part of the reward from the principal to strategically maximizes his long-term utility. Besides the online contract design problem, there are also rich line of literature in the online learning problems under Stackelberg games, information design and auction design setups [17, 12, 23, 51, 56, 14, 19].

Appendix B Omitted Content in Section 2

B.1 Notations and Illustrations

We use the notation of $[n]$ for the set $\{1,2,\dots,n\}$ . We use $\Delta({\mathcal{S}})$ to denote the simplex space on discrete set ${\mathcal{S}}$ . For probability distribution $P\in\Delta({\mathcal{S}})$ , we will use $P(s)$ to denote the measure of $s\in{\mathcal{S}}$ in $P$ . We use the notation of $\mathop{maxarg},\mathop{minarg}$ as an operator on an optimization problem that returns the optimal objective value followed by its optimal solution, e.g., $0,a=\mathop{minarg}\left\lVert x-a\right\rVert^{2}$ .

We will interchangeably treat a function $f:\mathcal{X}\to\mathcal{Y}$ as a vector from $\mathcal{Y}^{|\mathcal{X}|}$ . As such, we denote the inner product $f\cdot g:=\sum_{x\in\mathcal{X}}f(x)g(x)$ for $f,g:\mathcal{X}\to\mathcal{Y}$ or $\langle f,g\rangle_{\mathcal{X}\times\mathcal{Y}}:=\sum_{x\in\mathcal{X},y\in% \mathcal{Y}}f(x,y)g(x,y)$ for $f,g:\mathcal{X}\times\mathcal{Y}\to\mathbb{R}$ . Denote their outer product as $f\otimes g$ . Denote $\left\lVert f\right\rVert_{\ell,\infty}:=\sup_{x\in\mathcal{X}}\left\lVert f(x% )\right\rVert_{\ell}$ for $f:\mathcal{X}\to\mathcal{Y}$ . In addition, for function $f:\mathcal{X}\times\mathcal{Y}\to\mathcal{Z}$ , we use $f(x)\in\mathcal{Z}^{\mathcal{Y}}$ . For conditional probability $P:\mathcal{X}\to\Delta(\mathcal{Y})$ , we denote $P(y|x)$ as the measure of $y\in\mathcal{Y}$ given $x\in\mathcal{X}$ .

Table 2: A table of notations in the contractual reinforcement learning problem

Symbols	Interpretations
${\mathcal{S}},\mathcal{A}$	state, action space
$P_{h}:{\mathcal{S}}\times\mathcal{A}\to\Delta({\mathcal{S}}),P_{0}\in\Delta({% \mathcal{S}})$	transition kernel, initial state distribution
$\iota_{h}:{\mathcal{S}}\times{\mathcal{S}}\to\mathbb{R}_{+}$	noisy reward function at $h$ -th step
$r_{h}:{\mathcal{S}}\times\mathcal{A}\to[0,1]$	expected reward function at $h$ -th step
$c_{h}:{\mathcal{S}}\times\mathcal{A}\to[0,1]$	cost function at $h$ -th step
$\bm{x}=\{x_{h}:{\mathcal{S}}\times{\mathcal{S}}\to\mathbb{R}_{+}\}_{h=1}^{H}$	contract policy
$\bm{\pi}=\{\pi_{h}:{\mathcal{S}}\to\Delta(\mathcal{A})\}_{h=1}^{H}$	action policy
$\Pi$	action policy space
$\mathcal{X}$	contract policy space
$V_{h}^{\bm{x},\bm{\pi}},V_{h}^{\bm{x}},V_{h}^{\bm{\pi}},V_{h}^{*}:{\mathcal{S}% }\to\mathbb{R}_{+}$	principal’s state value function at $h$ -th step
$U_{h}^{\bm{x},\bm{\pi}},U_{h}^{\bm{x}},U_{h}^{\bm{\pi}},V_{h}^{*}:{\mathcal{S}% }\to\mathbb{R}_{+}$	agent’s state value function at $h$ -th step
$\rho_{h}^{\bm{\pi}}:\Delta({\mathcal{S}})$	state visitation measure at $h$ -th step
$\zeta_{h}^{\bm{\pi}}:\mathcal{A}\to\mathbb{R}_{+}$	least payment function at $h$ -th step
$Q_{h}^{\bm{\pi}}:{\mathcal{S}}\times\mathcal{A}\to\mathbb{R}_{+}$	principal’s state-action value function at $h$ -th step
$W_{h}^{\bm{\pi}}:{\mathcal{S}}\times\mathcal{A}\to\mathbb{R}_{+}$	agent’s state-action value function at $h$ -th step

Figure 2: An illustration of the interaction procedure in the principal-agent Markov decision process.

B.2 Discussion on the Modeling Choices.

We make a few remarks on the procedure of the PAMDP.

1.

It is without loss of generality for the principal to commit his contract policy $\bm{x}$ at the very beginning of each episode, since this MDP setup can be viewed as an extensive-form game as long as the principal as the first mover can predict the agent’s response and plan his follow-up move accordingly. Once the principal commits its contract policy, the agent can also determine his optimal action policy in response.
2.

We assume the Markovian state after its realization is publicly observable by both the agent and principal, serving as the natural conditions and contingencies for the contract design. Hence, our model directly use the state transition kernel, $P_{h}:{\mathcal{S}}\times\mathcal{A}\to\Delta({\mathcal{S}})$ , as the outcome distribution in contract design problems. Otherwise, if either agent or principal only partially observes the state, the planning problem is known to be intractable [32], and we leave this open question for future work. A more subtle caveat here is that, different from standard episodic MDP, the transition kernel in the last step $P_{H}$ matters, as it influences the principal’s design of contract.
3.

The principal’s noise reward $\iota_{h}(s_{h},s_{h+1})$ is set to be conditionally independent of the agent’s action $a_{h}$ , given the next state $s_{h+1}$ . This is necessary for a subtle modeling reason: as we will see in the next section, the contract design problem would become much easier, if the principal can condition its payment directly based on the agent’ action (i.e., without the concern of moral hazard). Note that since the reward itself can be modeled as a part of the state, the existence of such $\iota_{h}$ is without loss of generality, and there is no need to assume additional zero-mean noise on top of $\iota_{h}$ .
4.

We assume the principal is able to observe the agent’s action once the payment is transferred. This is well-motivated in practice. For example, a content platform may ask the creators to fill a survey on the amount of time they spent to create their content; the creators have no incentive to misreport this information, as long as their payment is independent of the answers. The more general setup is that the principal is only able to observe a probabilistic signal of agent taking some action $a$ (e.g., from the realization of the next state and knowledge of the transition kernel). For the convenience of analysis, we save the additional steps for the principal to infer the agent’s decision up to a sufficient level of confidence by repeating the same contract policy, though this could introduce additional factor of $H$ into the sample complexity, depending on the mixing ratio. We leave the tight analysis to future work.
5.

It is without loss of generality to assume that the rational agent always has the incentive to participate in the PAMDP. This is because enforcing the additional constraint that the agent’s utility must be non-negative under the principal’s optimal contract is equivalent to adding an “idle” action $a_{0}$ to the existing action set $\mathcal{A}$ with $r_{h}(s,a_{0})=c_{h}(s,a_{0})=0,\forall s\in{\mathcal{S}},h\in[H]$ , which allows our analysis to ignore the agent’s non-negative utility (individual rationality) constraint.

B.3 Least-Payment Bellman Equations in PAMDP

With the correspondence between $\bm{x}$ and $\bm{\pi}^{\bm{x}}$ , a natural next step is to fix $\bm{\pi}$ and find the contract policy

\bm{x}^{\bm{\pi}}=\mathop{argmax}_{\bm{x}\in\mathcal{X}}V^{\bm{x},\bm{\pi}}% \quad\text{s.t.}\quad\bm{\pi}=\mathop{argmax}_{\bm{\pi}\in\Pi}U^{\bm{x},\bm{% \pi}},

with the maximal value among all policy that the agent would optimally respond with action policy $\bm{\pi}$ . Since the total expected reward of the principal is fixed under $\bm{\pi}$ , the objective of the optimization problem can be equivalently rewritten as minimizing the principal’s total payment,

\bm{x}^{\bm{\pi}},\zeta^{\bm{\pi}}=\mathop{minarg}_{\bm{x}\in\mathcal{X}}% \mathop{\mathbf{E}}\big{[}\sum_{h=1}^{H}x_{h}(s_{h},a_{h})\big{|}\{\pi_{h}\}_{% h=1}^{H}\big{]}\quad\text{s.t.}\quad\bm{\pi}=\mathop{argmax}_{\bm{\pi}\in\Pi}U% ^{\bm{x},\bm{\pi}}.

Recall that $\zeta_{h}^{\bm{\pi}}(s)$ denotes the least amount of expected payment to induce an action policy $\bm{\pi}$ at the state $s$ from the $h$ -th step. Meanwhile, since $U^{\bm{\pi}}_{h}(s)=P_{h}(s,\pi_{h}(s))\cdot[x+U^{\bm{\pi}}_{h+1}]-c_{h}(s,\pi% _{h}(s))$ , the above constraint can be equivalently rewritten as a set of constraints in an iterative form,

\pi_{h}=\mathop{argmax}_{a\in\mathcal{A}}P_{h}(s,a)\cdot[x+U^{\bm{\pi}}_{h+1}]% -c_{h}(s,a),\quad\forall h\in[H].

Therefore, such a contract policy $\bm{x}^{\bm{\pi}}$ can be computed iteratively with backward induction from $h=H$ to $1$ with $U^{\bm{x}}_{H+1}(s)$ , $\forall s\in{\mathcal{S}}$ ,

$\displaystyle W^{\bm{\pi}}_{h}(s,a;x)$	$\displaystyle=P_{h}(s,a)\cdot[x+U^{\bm{\pi}}_{h+1}]-c_{h}(s,a),$	(B.1)
$\displaystyle x^{\bm{\pi}}_{h}(s)$	$\displaystyle=\mathop{argmin}_{x:{\mathcal{S}}\to\mathbb{R}_{+}}\{P_{h}(s,\pi_% {h}(s))\cdot x\ \|\ W^{\bm{\pi}}_{h}(s,\pi_{h}(s);x)\geq W^{\bm{\pi}}_{h}(s,a^{% \prime};x),\forall a^{\prime}\in\mathcal{A}\},$
$\displaystyle U^{\bm{\pi}}_{h}(s)$	$\displaystyle=W^{\bm{\pi}}_{h}(s,\pi_{h}(s);x_{h}^{\bm{\pi}}(s)),$
$\displaystyle\zeta^{\bm{\pi}}_{h}(s)$	$\displaystyle=P_{h}(s,\pi_{h}(s))\cdot[x_{h}^{\bm{\pi}}(s)+\zeta_{h+1}^{\bm{% \pi}}],$
$\displaystyle V_{h}^{\bm{\pi}}(s)$	$\displaystyle=r_{h}(s,\pi_{h}(s))+P_{h}(s,\pi_{h}(s))\cdot[V^{\bm{\pi}}_{h+1}-% x^{\bm{\pi}}_{h}(s)],$

where the function $\zeta^{\bm{\pi}}_{h}(s),V_{h}^{\bm{\pi}}(s)$ are computed as by-products of the value-iteration. We refer to Equation (B.1) as the least-payment Bellman equation.

B.4 Bellman Optimality Equations in PAMDP

Proofs of Theorem 1.

We begin by giving an interpretation for each variable in the Bellman equation. With slight abuse of notation, $x^{*}_{h}(s;a)$ denotes the contract with the least payment to induce the agent to take action $a$ in each step $h$ . Given that $\pi^{*}_{h}(s)$ is the best agent action for the principal to induce, the optimal contract at state $s$ in step $h$ can be determined as $x_{h}(s)=x_{h}(s;\pi^{*}_{h}(s))$ . $Q_{h}^{*}(s,a),W_{h}^{*}(s,a;x)$ are respectively the principal’s and agent’s total expected utility from $h$ -th step under policy $\{x^{*}_{\tau}\}_{\tau=h+1}^{H}$ and $\{\pi^{*}_{\tau}\}_{\tau=h+1}^{H}$ , which can be interpreted as their optimal state-action value function at $h$ -th step, serving as the intermediate variable for the computation. We now prove the optimality of its solution $\bm{x}^{*},V^{*}$ via induction:

For the base case, observe that planning for any state $s$ at the last step $H$ is reduced to a standard contract design problem and the optimal contract can be determined by solving the following linear program, $\forall a\in\mathcal{A}$ ,

	$\displaystyle Q_{H}^{}(s,a),x^{}_{H}(s;a)$	$\displaystyle=\mathop{argmin}_{x:{\mathcal{S}}\to\mathbb{R}_{+}}r_{H}(s,a)-P_{% H}(s,a)\cdot x$
	s.t.	$\displaystyle\quad P_{H}(s,a)\cdot x-c_{H}(s,a)\geq P_{H}(s,a^{\prime})\cdot x% -c_{H}(s,a^{\prime}),\forall a^{\prime}\neq a,$

where $x^{*}_{H}(s;a)$ is the least payment contract to induce the agent to take action $a$ in the last step and $Q_{H}^{*}(s,a)$ is the principal’s expected utility under $x^{*}_{H}(s;a)$ . In Equation 2.4, we save the term $r_{H}(s,a)$ , since it is a constant once the action is fixed. Hence, $V_{H}^{*}(s),a*=\mathop{maxarg}_{a\in\mathcal{A}}Q^{*}_{H}(s,a)$ determines the best action $\pi^{*}_{H}(s)=a^{*}$ for the principal to induce and $V_{H}^{*}(s)$ is the optimal state value function at $H$ -th step. The principal’s optimal contract can be determined as $x^{*}_{H}(s)=x^{*}_{H}(s;a^{*})$ . The agent’s value is $U_{H}^{*}(s)=P_{H}(s,a^{*})\cdot x^{*}_{H}(s)-c_{H}(s,a^{*})$ based on his best response $a^{*}$ under $x^{*}_{H}(s)$ .

For the inductive case, given that $\{x^{*}_{\tau}\}_{\tau=h+1}^{H}$ is optimal, with the agent’s best responding action policy $\{\pi^{*}_{\tau}\}_{\tau=h+1}^{H}$ , we show that $x^{*}_{h}$ solved from Equation 2.4 is optimal. Let us observe that $W^{*}_{h}(s,a;x)=P_{h}(s,a)\cdot[x+U^{*}_{h+1}]-c_{h}(s,a)$ captures the agent’s total utility of taking action $a$ under the contract $x$ at step $h$ state $s$ and then optimally following the action policy $\{\pi^{*}_{\tau}\}_{\tau=h+1}^{H}$ under $\{x^{*}_{\tau}\}_{\tau=h+1}^{H}$ from step $h+1$ . Here, we can use $U^{*}_{h+1}$ computed from previous iteration, because the agent’s value $U^{*}_{h+1}(s)$ is conditionally independent to the action in the current step given the realization of next state $s$ — this enables efficient computation through dynamic programming. Similar to the base case, for every action $a\in\mathcal{A}$ , the principal is to compute the least payment contract for the agent to take action $a$ ,

	$\displaystyle Q_{h}^{}(s,a),x^{}_{h}(s;a)$	$\displaystyle=\mathop{minarg}_{x:{\mathcal{S}}\to\mathbb{R}_{+}}r_{h}(s,a)+P_{% h}(s,a)\cdot[V^{}_{h+1}-x^{}_{h}(s;a)]$
	s.t.	$\displaystyle\quad P_{h}(s,a)\cdot[x+U^{}_{h+1}]-c_{h}(s,a)\geq P_{h}(s,a^{% \prime})\cdot[x+U^{}_{h+1}]-c_{h}(s,a^{\prime}),\forall a^{\prime}\neq a,$

where the objective is set as the principal’s total utility if the agent takes action $a$ at current step $h$ and follows the policy $\{\pi^{*}_{\tau}\}_{\tau=h+1}^{H}$ onward; the constraint is to reflect that it is (weakly) optimal for the agent to take action $a$ at current step. In Equation 2.4, we save the term $r_{h}(s,a)+P_{h}(s,a)\cdot V^{*}_{h+1}$ , since they are constant once the action is fixed. With the least payment contract $x^{*}(s;a)$ and state value $Q^{*}_{h}(s,a)$ for each action, $V_{h}^{*}(s),a^{*}=\mathop{maxarg}_{a\in\mathcal{A}}Q^{*}_{h}(s,a)$ determines the best action $\pi^{*}_{h}(s)=a^{*}$ for the principal to induce and $V_{h}^{*}(s)$ is the optimal state value function at $h$ -th step. The principal’s optimal contract can be determined as $x^{*}_{h}(s)=x^{*}_{h}(s;a^{*})$ . The agent’s value is $U_{h}^{*}(s)=P_{h}(s,a^{*})\cdot x^{*}_{h}(s)-c_{h}(s,a^{*})$ based on his best response $a^{*}$ under $x^{*}_{h}(s)$ . Therefore, $x^{*}_{h}(s)$ is the optimal contract following from the optimal contract policy in previous steps $\{x^{*}_{\tau}\}_{\tau=h+1}^{H}$ , which concludes the induction.

Lastly, we note that this Bellman equation can solved efficiently using backward induction from the state-function of the $(H+1)$ -step, $V^{*}_{H+1},U^{*}_{H+1}=0$ . In each step, it solves $S\times A$ many linear programs for the optimal contract $x^{*}_{h}(s;a)$ , while each linear programs have $O(A^{2})$ many constraints. Hence, the total time complexity to solve for the optimal policy $\bm{x}^{*}$ is polynomial w.r.t. $A,S,H$ .

∎

Appendix C Proofs in Section 3

C.1 The Regret Analysis of the Generic Algorithm

We first describe the design of the generic algorithm in Algorithm 2 and the technical lemmas.

Input:

\{\mathcal{X}^{a}(\varepsilon)\}_{a\in\mathcal{A}}

, the

\varepsilon

-margin contract sets.

for $t=1\dots T$ do

Estimate the least payment for each action under the empirical outcome distribution,

\widehat{\zeta}_{t}(a)\leftarrow\mathop{min}_{x\in\mathcal{X}^{a}(\varepsilon)% }\widehat{P}_{t}(a)\cdot x.

(C.1)

Determine the best action based on the optimistic estimation of profit,

a_{t}\leftarrow\mathop{argmax}_{a\in\mathcal{A}}\widehat{r}_{t}(a)-\widehat{% \zeta}_{t}(a)+(1+\eta)\epsilon_{t}(a).

Solve for a robust contract to induce

a_{t}

x_{t}\leftarrow\mathop{argmin}_{x\in\mathcal{X}^{a_{t}}(\varepsilon)}\widehat{% P}_{t}(a_{t})\cdot x.

Commit to contract

x_{t}

and observe the agent’s action

a^{\prime}_{t}

, outcome

s_{t}

and its reward

\iota_{t}(s_{t})

Update the empirical estimation of the outcome distribution and reward function,

\widehat{P}_{t},\widehat{r}_{t}

Set the confidence interval

\epsilon_{t}

such that

\left\lVert P(a)-\widehat{P}_{t}(a)\right\rVert_{1}\leq\epsilon_{t}(a),\forall a% \in\mathcal{A}

, with prob.

1-\delta

Algorithm 2 Contractual bandit learning with

\varepsilon

-margin contract sets

Lemma 1.

Under Assumption 1, for each action $a\in\mathcal{A}$ , given a robust contract set $\mathcal{X}^{a}(\varepsilon)$ with margin $\varepsilon$ , and an empirical estimation of $\widehat{P}(a)$ with $\left\lVert\widehat{P}(a)-P(a)\right\rVert_{1}\leq\epsilon$ , let $\widehat{x},\widehat{\zeta}(a)$ be the minimizer and minimum objective value of LP (C.1). The following conditions are satisfied,

1.

The expected payment of $\widehat{x}$ is bounded as, $0\leq P(a)\cdot\widehat{x}-{\zeta}(a)\leq\lambda^{-1}\varepsilon.$
2.

The estimated payment of $\widehat{x}$ is bounded as, $-\eta\epsilon\leq\widehat{\zeta}(a)-\zeta(a)\leq\lambda^{-1}\varepsilon+\eta\epsilon.$

Lemma 2 (McDiarmid et al. [42]).

With $t$ i.i.d. samples of an $m$ -dimensional distribution $Q$ , we can construct a confidence ball $\mathcal{B}=\{Q\in\Delta^{m}:\left\lVert\widehat{Q}_{t}-Q\right\rVert_{1}\leq% \sqrt{\frac{m\log(1/\delta)}{t}}\}$ such that $Q\in\mathcal{B}$ with prob. at least $1-\delta$ .

Proof of Theorem 2.

At a high level, Algorithm 2 proceeds by following the upper confidence bound of the expected “profit” of each action $z(a):=r(a)-\zeta(a)$ , which shrinks at the rate of $t^{-1/2}$ , based on Lemma 1. This enables us to apply the upper confidence bound analysis from online learning problem to the contractual bandit learning problem.

That is, we construct a variable $\widetilde{z}_{t}(a):=\widetilde{r}_{t}(a)-\widehat{\zeta}_{t}(a)+(1+\eta)% \epsilon_{t}(a)+\lambda^{-1}\varepsilon$ as an optimistic estimation of $z(a)$ . First, notice that Algorithm 2 is equivalently to follow the action $a_{t}=\mathop{argmax}_{a\in\mathcal{A}}\widetilde{z}_{t}(a)$ at each round $t$ , as $\lambda^{-1}\varepsilon$ is constant and does not affect the optimization. Second, we show that under the difference between $\widetilde{z}_{t}(a)$ and $z(a)$ satisfies the following inequality with probability at least $1-\delta$ ,

0\leq\widetilde{z}_{t}(a)-z(a)\leq(2+2\eta)\epsilon_{t}(a)+\lambda^{-1}% \varepsilon,\quad\forall a\in\mathcal{A},t\in[T].

(C.2)

On the event that $\left\lVert\widehat{P}_{t}(a)-P(a)\right\rVert_{1}\leq\epsilon_{t}(a)$ , we can derive that $\left|\widehat{r}_{t}(a)-r(a)\right|=[\widehat{P}_{t}(a)-P(a)]\cdot\iota\leq% \epsilon_{t}(a)$ and $-\eta\epsilon_{t}(a)\leq\widehat{\zeta}(a)-\zeta(a)\leq\lambda^{-1}\varepsilon% +\eta\epsilon_{t}(a)$ by Lemma 1. This implies that $-(1+\eta)\epsilon_{t}(a)-\lambda^{-1}\varepsilon\leq\widehat{r}_{t}(a)-% \widehat{\zeta}_{t}(a)-r(a)+\zeta(a)\leq(1+\eta)\epsilon_{t}(a)$ , which leads to the Equation (C.2).

Under the event that $\left\lVert\widehat{P}_{t}(a)-P(a)\right\rVert_{1}\leq\epsilon_{t}(a),\forall a% \in\mathcal{A}$ , we have $a_{t}=a^{\prime}_{t},\forall t\in[T]$ and the expected regret of Algorithm 2 in the $T$ rounds is as follows,

\operatorname{Reg}(T)=\mathop{max}_{a^{*}\in\mathcal{A}}\sum_{t=1}^{T}[r(a^{*}% )-\zeta(a^{*})]-\sum_{t=1}^{T}[r(a_{t})-P(a_{t})x_{t}].

We decompose the regret into two cases on whether the optimal arm $a^{*}$ is played at round $t$ :

When $a_{t}=a^{*}$ , we have $[r(a^{*})-\zeta(a^{*})]-[r(a_{t})-P(a_{t})x_{t}]=P(a_{t})x_{t}-\zeta(a^{*})% \leq\lambda^{-1}\varepsilon$ by Lemma 1.

When $a_{t}\neq a^{*}$ , we have

	$\displaystyle[r(a^{})-\zeta(a^{})]-[r(a_{t})-P(a_{t})\cdot x_{t}]$	$\displaystyle\leq\widetilde{z}_{t}(a^{*})-r(a_{t})+P(a_{t})\cdot x_{t}$
		$\displaystyle\leq\widetilde{z}_{t}(a_{t})-r(a_{t})+\zeta(a_{t})+\lambda^{-1}\varepsilon$
		$\displaystyle=\widetilde{z}_{t}(a_{t})-z(a_{t})+\lambda^{-1}\varepsilon$
		$\displaystyle\leq(2+2\eta)\epsilon_{t}(a_{t})+2\lambda^{-1}\varepsilon,$

where the first inequality follows from Equation (C.2) that $z(a^{*})\leq\widetilde{z}_{t}(a^{*})$ ; the second inequality uses the fact that $\widetilde{z}_{t}(a^{*})\leq\widetilde{z}_{t}(a_{t})$ and $P(a_{t})x_{t}\leq\zeta(a_{t})+\lambda^{-1}\varepsilon$ from Lemma 1; the third inequality follows Equation (C.2) that $\widetilde{z}_{t}(a^{*})-z(a^{*})\leq(2+2\eta)\epsilon_{t}(a_{t})+\lambda^{-1}\varepsilon$ .

It remains to bound the total regret based on the exact choice of $\epsilon_{t}$ in different setup.

The Case of Finite Action Space.

In the case where the action space $\mathcal{A}$ is finite. For any action $a\in\mathcal{A}$ , we denote $N_{t}(a)$ as the number of times action $a$ has been taken. By Lemma 2, we can set $\epsilon_{t}(a)=\sqrt{\frac{|{\mathcal{S}}|\log(T|\mathcal{A}|/\delta)}{N_{t}(% a)}}$ such that the empirical estimation of the outcome distribution $\widehat{P}_{t}$ satisfies $\left\lVert\widehat{P}_{t}(a)-P(a)\right\rVert_{1}\leq\epsilon_{t}(a)$ with probability at least $1-\frac{\delta}{T|\mathcal{A}|}$ . Thus, by union bound, with probability $1-\delta$ , the expected regret can be bounded as follows,

	$\displaystyle\operatorname{Reg}(T)$	$\displaystyle\leq\sum_{t=1}^{T}(2+2\eta)\sqrt{\frac{\|{\mathcal{S}}\|\log(T\|% \mathcal{A}\|/\delta)}{N_{t}(a_{t})}}+2\lambda^{-1}\varepsilon$
		$\displaystyle\leq(4+4\eta)\sqrt{\|{\mathcal{S}}\|\log(T\|\mathcal{A}\|/\delta)}% \sum_{a\in\mathcal{A}}\sqrt{N_{T}(a)}+2\lambda^{-1}\varepsilon T$
		$\displaystyle\leq(4+4\eta)\sqrt{\|{\mathcal{S}}\|\log(T\|\mathcal{A}\|/\delta)}% \sqrt{\|\mathcal{A}\|T}+2\lambda^{-1}\varepsilon T$
		$\displaystyle=O(\eta\sqrt{T\|\mathcal{A}\|\|{\mathcal{S}}\|\log(T\|\mathcal{A}\|/% \delta)}+\varepsilon T/\lambda),$

where the first inequality uses the fact that the loss incur when $a_{t}\neq a^{*}$ is at least as much as the loss when $a_{t}=a^{*}$ ; the second inequality follows from the Cauchy-Schwarz inequality; the third inequality again applies Cauchy-Schwarz inequality and use the fact that $\sum_{a\in\mathcal{A}}N_{T}(a)=T$ .

The Case of Infinite Action Space with Linear Context.

In the case when the action space $\mathcal{A}\subset\mathbb{R}^{d}$ is infinite, the outcome distribution $P(a)=a^{\top}\theta$ for some unknown parameter $\theta\in\mathbb{R}^{d\times m}$ . Let $\widehat{\theta}_{t}(a)=\Sigma^{-1}\sum_{\tau=1}^{t}s_{\tau}a_{\tau}$ with $\Sigma_{t}=\lambda I+\sum_{\tau=1}^{t}a_{\tau}a_{\tau}^{\top}$ . By Lemma 11 of Abbasi-Yadkori et al. [4], with probability at least $1-\delta$ , we have $\left\lVert\widehat{\theta}_{t}-\theta^{*}\right\rVert_{\Sigma_{t}}\leq\sqrt{% \beta_{t}}$ , where $\beta_{t}=\sigma^{2}(2+4d\log(T+1)+8\log(4/\delta))d\log(1+\frac{T}{d\sigma^{2% }})$ . We can set $\epsilon_{t}=\sqrt{\beta_{t}\left\lVert a\right\rVert_{\Sigma_{t}^{-1}}}$ such that empirical estimation of the outcome distribution $\widehat{P}_{t}=a^{\top}\widehat{\theta}_{t}$ satisfies

\left\lVert\widehat{P}_{t}(a)-P(a)\right\rVert_{1}=\left\lVert(\Sigma_{t}^{-1/% 2}a)^{\top}\Sigma_{t}^{1/2}(\widehat{\theta}_{t}-\theta^{*})\right\rVert_{1}% \leq\left\lVert\Sigma_{t}^{-1/2}a\right\rVert\left\lVert\Sigma_{t}^{1/2}(% \widehat{\theta}_{t}-\theta^{*})\right\rVert\leq\sqrt{\beta_{t}\left\lVert a% \right\rVert_{\Sigma_{t}^{-1}}}=\epsilon_{t}(a).

Thus, by union bound, with probability $1-\delta$ , the expected regret can be bounded as follows,

	$\displaystyle\operatorname{Reg}(T)$	$\displaystyle\leq\sum_{t=1}^{T}(2+2\eta)\sqrt{\beta_{t}\left\lVert a\right% \rVert_{\Sigma_{t}^{-1}}}+2\lambda^{-1}\varepsilon$
		$\displaystyle\leq(2+2\eta)\sqrt{T\sum_{t=1}^{T}\beta_{t}^{2}\left\lVert a% \right\rVert_{\Sigma_{t}^{-1}}^{2}}+2\lambda^{-1}\varepsilon T$
		$\displaystyle=O\bigg{(}(1+\eta)\sqrt{T}\big{(}d\log(T)+\log(1/\delta)\big{)}+% \varepsilon T/\lambda\bigg{)},$

∎

Proof of Lemma 1.

Pick an arbitrary $a\in\mathcal{A}$ . We have ${\mathcal{X}}^{a}(\varepsilon)=\{x:[P(a)-P(a^{\prime})]\cdot x\geq c(a)-c(a^{% \prime})+\varepsilon,\forall a\neq a^{\prime}\}$ and an empirical estimation of $\widehat{P}(a)$ with $\left\lVert\widehat{P}(a)-P(a)\right\rVert_{1}\leq\epsilon$ , LP (C.1). With ${\mathcal{X}}^{a}(\varepsilon)$ and $\widehat{P}(a)$ , LP (C.1) solves for a robust contract $\widehat{x}$ . Since $\widehat{x}\in\mathcal{X}^{a}(\varepsilon)\subseteq\mathcal{X}^{a}$ , the agent’s best response to $\widehat{x}$ is to take action $a$ .

First, we derive a bound for an intermediate value $\mathop{min}_{x\in{\mathcal{X}}^{a}(\varepsilon)}P(a)\cdot x$ . Recall that Assumption 1 guarantees that, for any $x\in\mathcal{X}^{a}$ , there exists $\overline{x}=x+\lambda^{-1}\varepsilon e$ such that $\overline{x}\in{\mathcal{X}}^{a}(\varepsilon)$ . Let $x^{*}=\mathop{argmin}_{x\in\mathcal{X}^{a}}P(a)\cdot x$ and $\overline{x}^{*}=\mathop{argmin}_{x\in{\mathcal{X}}^{a}(\varepsilon)}P(a)\cdot x$ . We have

\mathop{min}_{x\in{\mathcal{X}}^{a}(\varepsilon)}P(a)\cdot x=\mathop{min}_{x% \in\mathcal{X}^{a}}P(a)\cdot x+P(a)\cdot(\overline{x}^{*}-x^{*}).

Since $\left\lVert P(a)\right\rVert_{1}=1,\left\lVert\overline{x}^{*}-x^{*}\right% \rVert_{\infty}\leq\lambda^{-1}\varepsilon$ , we have $\left|P(a)\cdot(\overline{x}^{*}-x^{*})\right|\leq\left\lVert P(a)\right\rVert% _{1}\left\lVert\overline{x}^{*}-x^{*}\right\rVert_{\infty}$ . In addition, as ${\mathcal{X}}^{a}(\varepsilon)\subseteq{\mathcal{X}}^{a}$ , $\mathop{min}_{x\in{\mathcal{X}}^{a}(\varepsilon)}P(a)\cdot x\geq\mathop{min}_{% x\in{\mathcal{X}}^{a}}P(a)\cdot x=\zeta(a)$ . Hence, we have

0\leq\mathop{min}_{x\in{\mathcal{X}}^{a}(\varepsilon)}P(a)\cdot x-\zeta(a)\leq% \lambda^{-1}\varepsilon.

(C.3)

Notice that $\widehat{x}\in{\mathcal{X}}^{a}(\varepsilon)$ and is not necessarily the minimizer of $P(a)\cdot x$ over ${\mathcal{X}}^{a}$ . We get the first condition of this lemma, by Equation (C.3)

\zeta(a)=\mathop{min}_{x\in{\mathcal{X}}^{a}}P(a)\cdot x\leq P(a)\cdot\widehat% {x}\leq\mathop{min}_{x\in{\mathcal{X}}^{a}(\varepsilon)}P(a)\cdot x\leq\zeta(a% )+\lambda^{-1}\varepsilon.

We now bound the estimate payment $\widehat{\zeta}(a)=\widehat{P}(a)\cdot\widehat{x}$ . Since $\left\lVert P(a)-\widetilde{P}\right\rVert_{1}\leq\epsilon,\left\lVert% \widetilde{x}^{a}\right\rVert_{\infty}\leq\eta$ by the bounded contract space assumption, we have

\left|\widehat{\zeta}(a)-P(a)\cdot\widehat{x}\right|=\left|[P(a)-\widehat{P}(a% )]\cdot\widehat{x}\right|\leq\left\lVert P(a)-\widehat{P}(a)\right\rVert_{1}% \left\lVert\widehat{x}\right\rVert_{\infty}\leq\eta\epsilon.

(C.4)

Therefore, combining Equation (C.4) and the first condition, we get the second condition of this lemma,

-\eta\epsilon\leq\widehat{\zeta}(a)-\zeta(a)\leq\lambda^{-1}\varepsilon+\eta\epsilon.

∎

C.2 Solving Multi-armed Bandits under Direct Incentives

Multi-Armed Bandits under Direct Incentives

This is perhaps the most simple yet natural class of contractual online learning problems. The principal is unable to directly pull arms but is able to receive the reward from arm pulled by the agent. In this problem, we have the action space $\mathcal{A}=[N]$ and $r,c:[N]\to[0,1]$ specifying the principal’s reward and agent’s cost of pulling each arm $i\in[N]$ . At the beginning of each round $t$ , the principal sets a contract $x_{t}:[N]\to\mathbb{R}_{+}$ and the agent accordingly decides its best response $i_{t}=\mathop{max}_{i\in[n]}\big{[}x_{t}(i)-c(i)\big{]}$ . At the end of each round $t$ , the principal is able to observe the exact arm $i_{t}$ taken by the agent as well as the noisy bandit feedback on its corresponding reward $\widetilde{r}_{t}(i)=r(i)+\epsilon_{t}$ , where $\epsilon_{t}$ is zero-mean, i.i.d. $\sigma$ -subGuassian noise. Finally, the learning goal of the principal is to minimize the regret, $\operatorname{Reg}(T)=T\cdot\mathop{max}_{i\in[N]}\big{[}r(i)-c(i)\big{]}-\sum% _{t\in[T]}[r(i_{t})-x_{t}(i_{t})]$ .

Lemma 3 (Binary Search for Finite Arms).

There exists an $O(|\mathcal{A}|\log(1/\varepsilon))$ -learning procedure for multi-armed bandits under direct incentives.

Proof of Lemma 3.

We show an explicit construction of $\chi(\varepsilon)$ -learning procedure in the problem. Observe that, if we can learn an estimation of $|\widehat{c}(a)-c(a)|\leq\varepsilon/2,\forall a\in\mathcal{A}$ , we can set the least payment contract $x$ as follows, $x(a)=\widehat{c}(a)+\varepsilon/2,x(a^{\prime})=0,\forall a^{\prime}\neq a$ . $x(a)-c(a)\geq 0\geq x(a^{\prime})-c(a^{\prime})$ , it is optimal for the agent to respond with action $a$ . Moreover, the payment is minimized as $x(a)-\varepsilon\leq c(a)+\varepsilon/2+\varepsilon/2-\varepsilon=c(a)=\zeta(a)$ .

So it only remains to learn the estimation of $|\widehat{c}(a)-c(a)|\leq\varepsilon/2,\forall a\in\mathcal{A}$ . This can be achieved through binary search. For any action $a$ , we set a cost lower bound $c^{-}(a)$ and upper bound $c^{+}(a)$ . At each round, the algorithm sets the contract $x$ with $x(a)=\frac{c^{-}(a)+c^{+}(a)}{2},x(a^{\prime})=0,\forall a^{\prime}\neq a$ . If the agent takes the action $a$ , then the algorithm updates $c^{-}(a)\leftarrow\frac{c^{-}(a)+c^{+}(a)}{2}$ . Otherwise, it updates $c^{+}(a)\leftarrow\frac{c^{-}(a)+c^{+}(a)}{2}$ . In $\log(1/\varepsilon)+1$ rounds, the algorithm is guaranteed to have $c^{+}(a)-c^{-}(a)\leq\varepsilon/2$ and thus an estimation $|\widehat{c}(a)-c(a)|\leq\varepsilon/2$ . To conduct the binary search for every action, the total sample complexity is $O(|\mathcal{A}|\log(1/\varepsilon))$ .

∎

C.3 Solving Linear Bandits under Direct Incentives

Linear Bandits under Direct Incentives

In this problem, we have the action space $\mathcal{A}\subset\mathbb{R}^{d}$ (composed of the context vectors) and $r,c:\mathcal{A}\to[0,1]$ specifying the principal’s reward and agent’s cost of choosing each context $a\in\mathcal{A}$ . At the beginning of each round $t$ , the principal observes a set of contexts $\mathcal{A}_{t}\subset\mathcal{A}$ and sets a contract $x_{t}:\mathcal{A}_{t}\to\mathbb{R}_{+}$ . The agent accordingly decides its best response $a_{t}=\mathop{max}_{a\in\mathcal{A}_{t}}\big{[}x_{t}(a)-c(a)\big{]}$ , where $c(a)=a^{\top}\gamma$ . At the end of each round $t$ , the principal is able to observe the exact arm $a_{t}$ taken by the agent as well as the noisy bandit feedback on its corresponding reward $r(a_{t})=a_{t}^{\top}\theta+\epsilon_{t}$ , where $\epsilon_{t}$ is zero-mean, i.i.d. $\sigma$ -subGuassian noise. $\theta,\gamma$ are fixed, unknown parameters to be learnt. Without loss of generality, we assume $\left\lVert\theta\right\rVert\leq 1,\left\lVert\gamma\right\rVert\leq 1,\left% \lVert a_{t}\right\rVert\leq\sqrt{d}$ by coordinate transformation. Finally, the learning goal of the principal is to minimize the regret, $\operatorname{Reg}(T)=\sum_{t=1}^{T}[(a^{*}_{t})^{\top}\theta^{*}-(a^{*}_{t})^% {\top}\gamma^{*}-a_{t}^{\top}\theta^{*}+x_{t}(a_{t})],$ where $a^{*}_{t}=\mathop{argmax}_{a_{t}\in\mathcal{A}_{t}}\{a_{t}^{\top}\theta^{*}-a_% {t}^{\top}\gamma^{*}\}$ is the optimal arm at round $t$ .

Lemma 4 (Contextual Search for Infinite Arms).

There exists an $O(d\log 1/\varepsilon)$ -learning procedure for linear bandits under direct incentives.

Proof of Lemma 4.

We show an explicit construction of $\chi(\varepsilon)$ -learning procedure in the problem with agent’s best response function $h^{*}(x)=\mathop{argmax}_{a\in\mathcal{A}}\{x(a)-a^{\top}\gamma^{*}\}$ for some parameter $\gamma^{*}\in\mathbb{R}^{d},\left\lVert\gamma^{*}\right\rVert\leq 1$ and action set $\mathcal{A}\subset\mathbb{R}^{d}$ . Observe that, if we can learn an estimation of $\gamma$ such that $\left\lVert\gamma-\gamma^{*}\right\rVert\leq\frac{1}{2t}$ , we can set the least payment rule $x$ such that $x(a)=\gamma^{\top}a+\frac{1}{2t},x(a^{\prime})=0,\forall a^{\prime}\neq a$ . Since $x(a)-c(a)\geq\frac{1}{2t}-\left\lVert\gamma-\gamma^{*}\right\rVert\cdot\left% \lVert a\right\rVert\geq 0\geq x(a^{\prime})-c(a^{\prime})$ , we have $h(x)=a$ . Moreover, $x(a)-c(a)\leq\frac{1}{2t}+\left\lVert\gamma-\gamma^{*}\right\rVert\cdot\left% \lVert a\right\rVert\leq\frac{1}{t}$ .

To learn an estimation of $\gamma$ such that $\left\lVert\gamma-\gamma^{*}\right\rVert\leq\frac{1}{2t}$ , we adopt the contextual search algorithm under symmetric loss [38]. At a high level, we use the constant regret guarantee of contextual search algorithm again adversarially chosen context at every round, and we present a simple argument assuming $\mathcal{A}_{t}=\mathcal{A}$ that allows us to pick arbitrary context for the contextual search algorithm. Specifically, consider a contextual search problem with the unknown vector $\gamma^{*}\in\mathbb{R}^{d}$ and $\left\lVert\gamma^{*}\right\rVert\leq 1$ . Fix any unit vector $e\in\mathcal{A}$ , in $O(t)$ rounds, the contextual search algorithm can determine a knowledge set $\Gamma_{t}$ of all feasible $\gamma$ such that $\mathop{max}_{\gamma\in\Gamma_{t}}|\gamma^{\top}e-(\gamma^{*})^{\top}e|\leq 2^% {-t}$ . Repeating this search procedure for all $d$ linearly independent direction in $\mathcal{A}$ , we obtain a knowledge set of all feasible $\gamma$ such that $\mathop{max}_{\gamma\in\Gamma_{t}}|\gamma^{\top}a-(\gamma^{*})^{\top}a|\leq 2^% {-t},\forall a\in\mathcal{A}$ , since any action $a$ can be decomposed as a convex combination of the $d$ linearly independent unit vectors. The total sample complexity is $O(d\log(1/\varepsilon))$ . ∎

C.4 Solving General Contractual Bandit Problems

Lemma 5.

Under Assumption 2 and given $\widehat{P}$ that satisfies $\left\lVert\widehat{P}-P\right\rVert_{1,\infty}\leq\varepsilon/\eta$ , we can construct an $O\big{(}|\mathcal{A}|^{2}\log(|\mathcal{A}|\eta/\varepsilon)\big{)}$ -learning procedure for general contractual bandit learning problems.

Proof of Lemma 5.

We denote $d(a,a^{\prime}):=c(a)-c(a^{\prime})$ and the learning procedure is to query certain contract in a binary search fashion in order to obtain estimation $\widehat{d}$ with bounded error $|\widehat{d}(a,a^{\prime})-d(a,a^{\prime})|,\forall a,a^{\prime}$ , from which we can construct $\varepsilon$ -margin contract set $\widehat{\mathcal{X}}^{a}$ for any action $a\in\mathcal{A}$ and thereby compute almost least payment contract according to Lemma 1. For precise analysis, let $\left\lVert\widehat{P}-P\right\rVert_{1,\infty}\leq\epsilon=\frac{\varepsilon}% {10\eta}$ . We describe the full procedure in Algorithm 3.

Input: Action set

\mathcal{A}

, estimated parameters

\widehat{P}

Output : Robust contract sets

\widehat{\mathcal{X}}^{a},\forall a\in\mathcal{A}

\widehat{d}(a,a^{\prime})\leftarrow\infty,\forall a,a^{\prime}\in\mathcal{A}

Set binary search precision

\epsilon=\frac{\varepsilon}{10\eta|\mathcal{A}|}

for each $a\in\mathcal{A}$ do

Construct a contract

x^{a}

that induces the action

a

for each $a^{\prime}\neq a\in\mathcal{A}^{\prime}$ do

Construct a contract

x^{a^{\prime}}

that induces the action

a^{\prime}

Binary search for parameter

\alpha\in(0,1)

such that

x=\alpha x^{a}+(1-\alpha)x^{a^{\prime}}

induces action

a

, while

x^{\prime}=(\alpha+\epsilon)x^{a}+(1-\alpha-\epsilon)x^{a^{\prime}}

induces action

a^{\prime\prime}

Use

x,x^{\prime}

to solve for

\widehat{d}(a,a^{\prime\prime})\leftarrow\big{[}\widehat{P}(a)-\widehat{P}(a^{% \prime\prime})\big{]}\cdot x^{a}

for each $a,a^{\prime}\in\mathcal{A}$ do

if $\widehat{d}(a,a^{\prime})=\infty$ then

\widehat{d}(a,a^{\prime})\leftarrow\mathop{min}_{\mathcal{P}}\sum_{(a_{i},a_{j% })\in\mathcal{P}}\widehat{d}(a_{i},a_{j})

, where

\mathcal{P}

is a choice of path from

a

a^{\prime}

return

\widehat{\mathcal{X}}^{a}=\{x\in\mathcal{X}:\big{[}\widehat{P}(a)-\widehat{P}(% a^{\prime})\big{]}\cdot x\geq\widehat{d}(a,a^{\prime})+\varepsilon/2,\forall a% \neq a^{\prime}\}

for each

a\in\mathcal{A}

Algorithm 3

\chi(\varepsilon)

-Learning Procedure in Contractual Bandit Learning

To prove its correctness, we start from the observation that for any two action $a,a^{\prime}$ with sufficiently small $\epsilon$ , given two contracts $x^{a},x^{a^{\prime}}$ that respectively induces action $a,a^{\prime}$ and $\left\lVert x^{a}-x^{a^{\prime}}\right\rVert_{\infty}\leq\epsilon$ , we can obtain the estimation $\widehat{d}(a,a^{\prime})$ such that $|\widehat{d}(a,a^{\prime})-d(a,a^{\prime})|<3\epsilon\eta=3\varepsilon/10$ . To see this, we introduce a contract $x^{0}=\alpha x^{a}+(1-\alpha)x^{a^{\prime}}$ for some $\alpha\in(0,1)$ such that $\big{[}P(a)-P(a^{\prime})\big{]}x^{0}=d(a,a^{\prime})$ . Such $x^{0}$ must exist, since $\big{[}P(a)-P(a^{\prime})\big{]}x^{a}>d(a,a^{\prime})$ and $\big{[}P(a)-P(a^{\prime})\big{]}x^{a^{\prime}}<d(a,a^{\prime})$ . Now let $\widehat{d}(a,a^{\prime})=\big{[}\widehat{P}(a)-\widehat{P}(a^{\prime})\big{]}% \cdot x^{a}$ , we have

	$\displaystyle\quad\left\|\widehat{d}(a,a^{\prime})-d(a,a^{\prime})\right\|$	$\displaystyle=\left\|\big{[}P(a)-P(a^{\prime})\big{]}\cdot(x^{a}-x^{0})+\big{[}% \widehat{P}(a)-P(a)-\widehat{P}(a^{\prime})+P(a^{\prime})\big{]}\cdot x^{a}\right\|$
		$\displaystyle\leq(1-\alpha)\left\|\big{[}P(a)-P(a^{\prime})\big{]}\cdot(x^{a}-x% ^{a^{\prime}})\right\|+2\epsilon\eta$
		$\displaystyle\leq(1-\alpha)\epsilon^{2}+2\epsilon\eta$
		$\displaystyle<3\epsilon\eta=3\varepsilon/10.$

To obtain that the contracts $x^{a},x^{a^{\prime}}$ , it only requires to do a binary search based on two initial contracts $x^{a},x^{a^{\prime}}$ that induces action $a,a^{\prime}$ . Then, if the contract $x^{\prime}=\frac{1}{2}x^{a}+\frac{1}{2}x^{a^{\prime}}$ induces the action $a$ , then we update $x^{a}\leftarrow x^{\prime}$ . Otherwise, $x^{a^{\prime}}\leftarrow x^{\prime}$ . In $\log(1/\epsilon)$ rounds, the distance of $x^{a}$ and $x^{a^{\prime}}$ is bounded by $\epsilon$ . As is described in Algorithm 3, we can do such binary search for every pair of actions $a,a^{\prime}$ . While two actions may not share a decision boundary, we identify all action pairs that do share a decision boundary with each other. This means for pairs that do not share a decision boundary, we can find a path through their neighbours to determine their cost difference given by $d$ and the shortest path can find by the Dijkstra’s algorithm in $O(|\mathcal{A}|^{2})$ . In the worst case, such path can be as long as $|\mathcal{A}|-2$ , this means we need to conduct binary search to the precision level of $\epsilon/|\mathcal{A}|=\frac{\varepsilon}{10\eta|\mathcal{A}|}$ for $O(\log(|\mathcal{A}|\eta/\varepsilon))$ rounds.

Finally, with the estimated parameters $\widehat{P}$ and $\widehat{d}(a,a^{\prime})$ , the algorithm construct the robust contract set $\widehat{{\mathcal{X}}}^{a}=\{x\in\mathcal{X}:\big{[}\widehat{P}(a)-\widehat{P% }(a^{\prime})\big{]}\cdot x\geq\widehat{d}(a,a^{\prime})+\varepsilon/2,\forall a% \neq a^{\prime}\}$ , and we claim that ${\mathcal{X}}^{a}(\varepsilon)\subseteq\widehat{{\mathcal{X}}}^{a}\subseteq{% \mathcal{X}}^{a}$ . To verify that $\widehat{{\mathcal{X}}}^{a}\subseteq{\mathcal{X}}^{a}$ , we can check that the following inequality must hold, $\forall\widehat{x}\in\widehat{{\mathcal{X}}}^{a},\forall a^{\prime}\neq a$ ,

	$\displaystyle[P(a)-P(a^{\prime})]\cdot\widehat{x}$	$\displaystyle\geq[\widehat{P}(a)-\widehat{P}(a^{\prime})]\cdot\widehat{x}+[{P}% (a)-\widehat{P}(a)-{P}(a^{\prime})+\widehat{P}(a^{\prime})]\cdot\widehat{x}$
		$\displaystyle\geq\widehat{d}(a,a^{\prime})+\varepsilon/2-2\mathop{max}_{a\in% \mathcal{A}}\left\lVert\widehat{P}(a)-{P}(a)\right\rVert_{1}\left\lVert x% \right\rVert_{\infty}$
		$\displaystyle\geq d(a,a^{\prime})+\varepsilon/2-\varepsilon/5-3\varepsilon/10% \geq d(a,a^{\prime}).$

Similarly, to verify ${\mathcal{X}}^{a}(\varepsilon)\subseteq\widehat{{\mathcal{X}}}^{a}$ , we can check that the following inequality must hold, $\forall x\in{\mathcal{X}}^{a}(\varepsilon),\forall a^{\prime}\neq a$ ,

	$\displaystyle[\widehat{P}(a)-\widehat{P}(a^{\prime})]\cdot x$	$\displaystyle\geq[{P}(a)-{P}(a^{\prime})]\cdot{x}+[\widehat{P}(a)-{P}(a)-% \widehat{P}(a^{\prime})+{P}(a^{\prime})]\cdot{x}$
		$\displaystyle\geq{d}(a,a^{\prime})+\varepsilon-2\mathop{max}_{a\in\mathcal{A}}% \left\lVert\widehat{P}(a)-{P}(a)\right\rVert_{1}\left\lVert x\right\rVert_{\infty}$
		$\displaystyle\geq\widehat{d}(a,a^{\prime})+\varepsilon-\varepsilon/5-3% \varepsilon/10\geq\widehat{d}(a,a^{\prime})+\varepsilon/2.$

∎

Appendix D Searching on Probability Simplex

In this section, we discuss the details related to specifying the information structure through hyperplane searching. A motivation for doing hyperplane searching is given in Example 1, where learning the outcome distribution difference with $\mathcal{O}(\log T)$ rounds potentially avoid pulling the non-optimal arm too many times and paves the way for constructing $T^{-1}$ -optimal contract. In addition, the need to plan with the transition kernel $P_{h}(\cdot{\,|\,}s,a)$ in the MDP environment with far-sighted agent prompts us to learn the difference in $P_{h}(\cdot{\,|\,}s,a)-P_{h}(\cdot{\,|\,}s,a^{\prime})$ in order to fully exploit the information structure and as well reduce the cost of redundant explorations.

Example 1 ( $o(T^{2/3})$ regret with known cost).

Consider a class of contractual bandit problem instances parameterized on $\mu\in(0,1]$ . For each instance, there are two outcomes $s_{1},s_{2}$ with mean reward $\iota(s_{1})=1,\iota(s_{2})=0$ , and two agent actions $a_{1},a_{2}$ with cost $c(a_{1})=1/2,c(a_{2})=0$ and outcome distribution $P(a_{1})=[1,0],P(a_{2})=[1-\mu,\mu]$ . One can verify that the optimal contract $x^{*}$ here is to set $x^{*}(s_{1})=\frac{1}{2\mu},x^{*}(s_{2})=0$ and the principal gets the expected utility $1-\frac{1}{2\mu}$ . The naive learning method is to play $a_{2}$ for $T^{2/3}$ rounds and learn its outcome distribution parameterized by $\mu$ up to the bounded error $O(T^{-1/3})$ . This is costly as $a_{2}$ is the sub-optimal arm, resulting in $\widetilde{O}(T^{2/3})$ regret in Theorem 2. However, an alternative method is to conduct a binary search for $\mu$ . This would achieve $O(\log T)$ regret, since the algorithm can get estimation error of $\mu$ bounded by $T^{-1}$ in $O(\log T)$ rounds, and construct an $T^{-1}$ -optimal contract.

Here, we consider searching for the agent’s best response section in a $D$ -dimensional probability simplex. Let ${\mathcal{S}}$ with $|{\mathcal{S}}|=d$ be the outcome space and $x:{\mathcal{S}}\rightarrow\mathbb{R}_{+}$ denote the contract the principal announces to the agent. Here, we restrict the contract $x$ to a subspace $x\in\mathcal{P}^{d-1}$ where $\mathcal{P}^{d-1}$ is the $(d-1)$ -dimension probability simplex. We remark that searching over a low dimensional simplex is without loss of generality, and we just consider the simplex $\left\|x\right\|_{1}=\eta,x\in[0,\eta]^{D}$ for simplicity, where $\eta$ bounds the infinity norm of any contract we use. Let $\mathcal{A}$ be the agent’s action set with $\left|\mathcal{A}\right|=N$ . This is because we have the following proposition.

Proposition 1 (Action inducibility).

If an action $i\in[N]$ can be induced by contract $x$ with $\left\|x\right\|_{\infty}\leq\eta$ , then $i$ can also be induced by contract $y=x+((N-1)\eta-\left\|x\right\|_{1})\operatorname{\mathds{1}}/N$ .

Proof.

The inducibility condition implies

\displaystyle\langle x,p_{i}-p_{j}\rangle\geq c_{i}-c_{j},\forall j\neq i.

Obviously, adding $(\eta-\left\|x\right\|_{1})\operatorname{\mathds{1}}/N$ to $x$ does not change the inequality. Moreover,

\displaystyle y^{(l)}=x^{(l)}+((N-1)\eta-\left\|x_{1}\right\|)/N=x^{(l)}\frac{% N-1}{N}+\frac{(N-1)\eta}{N}-\frac{1}{N}\sum_{m\neq l}x^{(m)}\geq 0,

which implies that $y$ is a valid contract with $\left\|y\right\|_{1}=(N-1)\eta$ . ∎

For simplicity, we ignore the scale $(N-1)\eta$ and just conduct our search on the probability simplex. Under this setting, the best response region for $a_{i}\in\mathcal{A}$ is

\displaystyle\mathcal{V}_{i}=\left\{x\in\mathcal{P}^{d-1}{\,\big{|}\,}\langle x% ,p_{i}-p_{j}\rangle\geq c_{i}-c_{j},\quad\forall j\neq i\right\},

where $p_{i}\in\Delta({\mathcal{S}})$ is the outcome distribution under action $i$ and $c_{i}$ is the action cost the agent has to pay for any $i\in[N]$ . Our target is to identify each $\mathcal{V}_{i}$ by searching for the hyperplanes that separate these $\mathcal{V}_{i}$ under weak assumptions. Specifically, we assume that the cost of each action is known. The algorithm is summarized in Algorithm 4.

Input: Number of actions

N

, number of samples

T

, binary search threshold

\varepsilon

, parameters

c_{d}

Initial memory

\mathcal{M}=\emptyset

;

for $t=1,\dots,T$ do

Randomly sample

z_{1},z_{2}\in\mathcal{P}^{d}

and draw the line

\ell\subset\mathcal{P}^{d}

connecting

z_{1},z_{2}

;

Binary search on

\ell

for all the switching points up to precision

\varepsilon

³³3Precision

\varepsilon

in this algorithm always means the

d

-dimensional infinity-norm

\left\|x_{k}-y_{k}\right\|_{\infty}\leq\varepsilon

., and obtain all the segments containing a switching point:

\mkern 1.5mu\overline{\mkern-1.5mux_{1}y_{1}\mkern-1.5mu}\mkern 1.5mu,\dots,% \mkern 1.5mu\overline{\mkern-1.5mux_{m}y_{m}\mkern-1.5mu}\mkern 1.5mu

;

for $k=1,\dots,m$ do

\mathcal{M}\leftarrow\mathcal{M}\cup\left\{\left(x_{k},a^{*}(x_{k})\right)% \right\}\cup\left\{\left(y_{k},a^{*}(y_{k})\right)\right\}

;

Randomly draw a

d

-dimensional simplex centered at

(x_{k}+y_{k})/2

with length

\sqrt{2}c_{d}

and vertices

v_{1},\dots,v_{d+1}

;

Play

v_{1},\dots,v_{d+1}

and obtain the best response

a^{*}_{1},\dots,a^{*}_{d+1}

;

for each pair $(i,j)$ s.t. $1\leq i<j\leq d+1$ and $a^{*}_{i}\neq a^{*}_{j}$ do

Binary search on

\mkern 1.5mu\overline{\mkern-1.5muv_{i}v_{j}\mkern-1.5mu}\mkern 1.5mu

for a switching point up to precision

\varepsilon

, and obtain the segment

\mkern 1.5mu\overline{\mkern-1.5muuw\mkern-1.5mu}\mkern 1.5mu

containg the switching point;

\mathcal{M}\leftarrow\mathcal{M}\cup\left\{\left(u,a^{*}(u)\right)\right\}\cup% \left\{\left(w,a^{*}(w)\right)\right\}

;

Solve for

{\mathcal{S}}

with

\mathcal{M}

;

Algorithm 4 Searching on Probability Simplex

Here, we show how to recover $p_{1},\dots,p_{N}$ from the memory $\mathcal{M}$ . Suppose that when the algorithm terminates, we have $\mathcal{M}=\left\{(w_{l},a^{*}(w_{l}))\right\}_{l\in[L]}$ . We just solve for $(p_{1},\cdots,p_{N})$ that satisfies the following constraints,

	$\displaystyle\langle w_{l},p_{a^{*}(w_{l})}-p_{a^{\prime}}\rangle$	$\displaystyle\geq c_{a^{}(w_{l})}-c_{a^{\prime}},\quad\forall a^{\prime}\neq a% ^{}(w_{l}),\quad\forall l\in[L],$		(D.1)
	$\displaystyle\langle\operatorname{\mathds{1}},p_{i}-p_{j}\rangle$	$\displaystyle=0,\qquad\qquad\quad\forall(i,j)\in[N]^{2}.$		(D.2)

In the sequel, we write ${\mathcal{S}}$ as the set of $(p_{1},\dots,p_{N})$ that satisfy Conditions (D.1) and (D.2). For Algorithm 4 to work, we introduce the following assumption on the volume of $\mathcal{V}_{i}$ .

Assumption 3 (Minimal Volume Ratio).

Let $\operatorname{Vol}^{d}(\mathcal{V})$ denote the $d$ -dimensional volume of set $\mathcal{V}\in\mathbb{R}^{d}$ . We assume that there exists $\varsigma\in(0,1]$ such that $\operatorname{Vol}^{d-1}(\mathcal{V}_{i})\geq\varsigma\cdot\operatorname{Vol}^% {d-1}(\mathcal{P}^{d-1})$ for any $i\in[N]$ .

The minimal volume ratio assumption guarantees that all the sections $\mathcal{V}_{i}$ are detectable via random sampling with high probability. We also make the following assumption on the cost difference.

Assumption 4 (Minimal Cost difference).

We assume that $\inf_{1\leq i<j\leq N}\left|c_{i}-c_{j}\right|\geq\theta$ .

Specifically, we use the following definition of surface detection probability function.

Definition 3 (Surface Detection Probability Function).

Let ${\mathrm{Conv}}^{d-1}$ be the set of convex regions on some $(d-1)$ -dimensional hyperplane such that $e\subset\mathcal{P}^{d}$ for any $e\in{\mathrm{Conv}}^{d-1}$ . Define function $\sigma_{d}:[0,1]\rightarrow[0,1]$ as the pointwise maximum such that,

\displaystyle\mathbb{P}(\ell\cap e\neq\emptyset)\geq\sigma_{d}\left(\frac{% \operatorname{Vol}^{d-1}(e)}{\operatorname{Vol}^{d-1}(\mathcal{P}^{d-1})}% \right),\quad\forall e\in{\mathrm{Conv}}^{d-1}.

Note that $\sigma_{d}$ is a property inherent to the $d$ -dimensional probability simplex. We argue that $\sigma_{d}$ can be roughly viewed as a linear function for small $e$ . To characterize the searching result ${\mathcal{S}}$ of Algorithm 4, we present the following Lemma.

Lemma 6.

Under Assumptions 3, 4, suppose that $\varepsilon,c_{d}$ is chosen to satisfy

	$\displaystyle\xi_{d}^{2}$	$\displaystyle\operatorname{\vcentcolon=}{\frac{c_{d}^{2}}{d^{2}}-\frac{d% \varepsilon^{2}}{8}}>0,$
	$\displaystyle\tau_{d}$	$\displaystyle\operatorname{\vcentcolon=}\left(\frac{\varsigma^{2}}{3d}\right)^% {d}-d^{2}\left(1+\frac{4}{d\varsigma}\right)\cdot(c_{d}+\varepsilon)>0.$

After $T$ samples and no more than $\mathcal{O}(TNd^{2}\log(1/\varepsilon))$ rounds, with probability at least $1-Ne^{-T\sigma_{d}(\tau_{d})}$ , we have for any $(p_{1},\dots,p_{N})\in{\mathcal{S}}$ that

\displaystyle\left\|(p_{i}-p_{j})-(p_{i}^{*}-p_{j}^{*})\right\|_{2}\leq\frac{2% (N-1)\sqrt{d}\cdot\varepsilon}{\xi_{d}^{d-1}\theta}.

To construct an efficient learning procedure, we need to determine the optimal value for $\varepsilon,c_{d}$ such that the total round number is minimized while the learning error is controlled by $\varepsilon$ .

Corollary 3.1.

By properly setting $c_{d}$ and $\varepsilon$ and running the simplex searching algorithm for $t$ rounds, we guarantee the learning error less than $\varepsilon$ with probability at least

1-N\exp\left(-\frac{t\cdot\sigma_{d}(\tau_{d})}{Nd^{4}\log(N\varsigma^{-2}% \varepsilon^{-1}\theta^{-1})}\right),

where $\tau_{d}=(\varsigma^{2}/6d)^{2}$ is a constant.

Proof.

Define constant

\displaystyle\Upsilon\operatorname{\vcentcolon=}\frac{\varsigma^{2d+1}}{2\cdot 3% ^{d}d^{d+1}(d\varsigma+4)}<1.

We let $c_{d}=\Upsilon/2$ . Then it suffices for the first condition to hold if $\varepsilon\leq\Upsilon/d^{3/2}$ . Moreover, we have $\xi_{d}\geq\Upsilon/\sqrt{8}d$ and the second condition holds automatically with $\tau_{d}\geq(\varsigma^{2}/6d)^{2}$ . Therefore, the constraint for $\varepsilon$ becomes,

\displaystyle\varepsilon\leq\mathop{min}\left\{\frac{\Upsilon}{d^{3/2}},\frac{% \varepsilon\theta}{2N\sqrt{d}}\left(\frac{\Upsilon}{\sqrt{8}d}\right)^{d-1}% \right\}.

We can take equality for the optimal $\varepsilon$ . Obviously, the second term dominates, and we thus have the total rounds bounded by

	total round	$\displaystyle\leq\mathcal{O}\left(TNd^{2}\left(\log\left(\frac{2N\sqrt{d}}{% \varepsilon\theta}\right)+d\log\left(\frac{\sqrt{8}d}{\Upsilon}\right)\right)\right)$
		$\displaystyle=\mathcal{O}\left(TNd^{2}\left(\log\left(2N\sqrt{d}(\varepsilon% \theta)^{-1}\right)+d^{2}\log\left(d\varsigma^{-2}\right)\right)\right),$

where the failure probability is bounded by $Ne^{-T\sigma_{d}(\tau_{d})}$ with $\tau_{d}=(\varsigma^{2}/6d)^{2}$ being a constant. ∎

Proof of Lemma 6.

We consider an undirected graph $\mathcal{G}=(\mathcal{A},E)$ with the node set $\mathcal{A}$ and the edge set $E=\left\{e_{ij}{\,|\,}e_{ij}=\mathcal{V}_{i}\cap\mathcal{V}_{j},\forall i\neq j\right\}$ . Define event $\mathcal{E}_{ij}^{d}$ as follows.

Definition 4 (Surface Detection Event).

We say that event $\mathcal{E}_{ij}^{d}$ happens if there exists $t\in[T]$ and we have successfully searched for a $k_{t}$ such that for the $d$ -dimensional simplex $\mathbb{S}^{d}$ placed around $(x_{k_{t}}+y_{k_{t}})/2$ in Algorithm 4 and any $x\in\mathbb{S}^{d}$ , the best response at $x$ satisfies $a^{*}(x)\in\{i,j\}$ .

Simply put, the event $\mathcal{E}_{ij}^{d}$ guarantees that $\mathbb{S}^{d}$ only contains two possible actions $\{i,j\}$ and $e_{ij}$ can therefore be successfully learned via binary searching for the intersects of the edges of the simplex with $e_{ij}$ . The following proposition backs up our statement.

Proposition 2 (Intersection Geometry).

Under event $\mathcal{E}_{ij}^{d}$ and condition $\varepsilon/2<c_{d}/(d\sqrt{d+1})$ , let $\mathbb{S}$ denote the simplex corresponding to $\mathcal{E}_{ij}^{d}$ . Then $\mathbb{S}^{d}\cap e_{ij}$ contains a $(d-1)$ -dimensional ball with radius at least

\displaystyle\xi_{d}\operatorname{\vcentcolon=}\sqrt{\frac{c_{d}^{2}}{d(d+1)}-% \frac{d\varepsilon^{2}}{4}}.

Proof.

Under event $\mathcal{E}_{ij}^{d}$ , we claim that the hyperplane $e_{ij}$ must intersect with the $\sqrt{d}\varepsilon/2$ -ball $\mathbb{B}_{1}=\mathbb{B}(\sqrt{d}\varepsilon/2)$ centered at $(x_{k_{t}}+y_{k_{t}})/2$ , since $e_{ij}$ passes through $\mkern 1.5mu\overline{\mkern-1.5mux_{k_{t}}y_{k_{t}}\mkern-1.5mu}\mkern 1.5mu$ , which lies inside $\mathbb{B}_{1}$ . Moreover, we consider a ball $\mathbb{B}_{2}=\mathbb{B}(c_{d}/\sqrt{d(d+1)})$ also centered at $(x_{k_{t}}+y_{k_{t}})/2$ . Since the smallest distance from the center of a $d$ -dimensional simplex with length $\sqrt{2}c_{d}$ to any of its surface is $c_{d}/\sqrt{d(d+1)}$ , we have $\mathbb{B}_{1}\subset\mathbb{B}_{2}\subset\mathbb{S}^{d}$ under the condition $\varepsilon/2<c_{d}/(d\sqrt{d+1})$ . In addition, $\mathbb{B}_{2}$ is also the largest ball contained in $\mathbb{S}^{d}$ . We therefore conclude that the intersection area satisfies,

\mathbb{B}_{2}\cap e_{ij}\subseteq\mathbb{S}^{d}\cap e_{ij}.

Since $\mathbb{S}^{d}$ only intersects with hyperplane $e_{ij}$ , we have $\mathbb{B}_{2}\cap e_{ij}\subseteq\mathbb{S}^{d}\cap e_{ij}\subseteq e_{ij}$ . Thus, the left-hand side corresponds to the intersection area of a $d$ -dimensional ball and a $(d-1)$ -dimensional hyperplane. Since $e_{i}j$ intersects with $\mathbb{B}_{1}$ , thus a $(d-1)$ -dimensional ball with radius at least

\displaystyle\xi_{d}=\sqrt{\frac{c_{d}^{2}}{d(d+1)}-\frac{d\varepsilon^{2}}{4}},

should be contained in $\mathbb{S}^{d}\cap e_{ij}\subseteq e_{ij}$ . ∎

Proposition 2 characterizes the geometry of the intersection area $\mathbb{S}^{d}\cap e_{ij}$ under event $\mathcal{E}_{ij}^{d}$ . Specifically, the intersection area should contain a $(d-1)$ -dimensional ball with radius lower bounded by $\xi_{d}$ . Such a geometry is critical for solving the hyperplanes to small errors. We consider the following constraints,

	$\displaystyle\langle u_{k},p_{i}-p_{j}\rangle\geq c_{i}-c_{j},\quad\langle w_{% k},p_{i}-p_{j}\rangle\leq c_{i}-c_{j},\quad\forall k\in[K],$
	$\displaystyle\langle\operatorname{\mathds{1}},p_{i}-p_{j}\rangle=0,$		(D.3)

where $(u_{k},w_{k})$ is the binary searching result on the edges of simplex $\mathbb{S}^{d}$ . Under event $\mathcal{E}_{ij}^{d}$ , let ${\mathcal{S}}_{ij}$ denote the set of $(p_{i},p_{j})$ satisfying these constraints. Apparently, ${\mathcal{S}}_{ij}$ is a relaxation of ${\mathcal{S}}$ . The following proposition characterizes the searching errors in ${\mathcal{S}}_{ij}$ under the event $\mathcal{E}_{ij}^{d}$ in terms of $p_{i}-p_{j}$ .

Proposition 3 (Local Hyperplane Learnability).

Under Assumption 4 and event $\mathcal{E}_{ij}^{d}$ , for any $(p_{i},p_{j})\in{\mathcal{S}}_{ij}$ , we have

\displaystyle\left\|(p_{i}-p_{j})-(p_{i}^{*}-p_{j}^{*})\right\|_{2}\leq\varphi% _{d}^{-1}\sqrt{d}\varepsilon,

where

\displaystyle\varphi_{d}\operatorname{\vcentcolon=}\left(\frac{\xi_{d}}{\sqrt{% 1-d^{-1}}}\right)^{d-1}\cdot\frac{\theta}{\sqrt{2}}\cdot\sqrt{\frac{d}{d+1}}.

Proof.

We first relax ${\mathcal{S}}_{ij}$ to

\displaystyle\check{\mathcal{S}}_{ij}:(p_{i},p_{j})\quad\mbox{subject to}\quad% \langle s_{k},p_{i}-p_{j}\rangle=c_{i}-c_{j}+\epsilon_{k},\quad\langle% \operatorname{\mathds{1}},p_{i}-p_{j}\rangle=0,\quad,|\epsilon_{k}|\leq% \varepsilon,\quad\forall k\in[K].

Note that $s_{1},\dots,s_{K}$ corresponds to the vertices of $\mathbb{S}^{d}\cap e_{ij}$ . Using Proposition 2, we pick $\widetilde{s}_{1},\dots,\widetilde{s}_{d}$ on the ball $\mathbb{B}_{2}\subset\mathbb{S}^{d}\cap e_{ij}$ that form a $(d-1)$ -dimensional simplex. Note that $\mathbb{S}^{d}\cap e_{ij}$ must be convex. Thus, we can express $\widetilde{s}_{1},\dots,\widetilde{s}_{d}$ as convex combinations of $s_{1},\dots,s_{K}$ . In the matrix form, we suppose

\displaystyle\begin{bmatrix}\widetilde{s}_{1}^{\top}\\ \cdots\\ \widetilde{s}_{d}^{\top}\end{bmatrix}=Q\begin{bmatrix}s_{1}^{\top}\\ \cdots\\ s_{K}^{\top}\end{bmatrix},

where $Q$ has row sums equal to $1$ . Using $\widetilde{s}_{1},\dots,\widetilde{s}_{d}$ , we can further relax $\check{\mathcal{S}}_{ij}$ by multiplying $\mathrm{diag}(Q,1)$ to the constraints,

\displaystyle\widetilde{\mathcal{S}}_{ij}:(p_{i},p_{j})\quad\mbox{subject to}% \quad\underbrace{\begin{bmatrix}\widetilde{s}_{1}^{\top}\\ \cdots\\ \widetilde{s}_{d}^{\top}\\ \operatorname{\mathds{1}}_{d+1}^{\top}/(d+1)\end{bmatrix}}_{\widetilde{S}}(p_{% i}-p_{j})=\begin{bmatrix}(c_{i}-c_{j})\operatorname{\mathds{1}}_{d}\\ 0\end{bmatrix}+\underbrace{\begin{bmatrix}\epsilon\\ 0\end{bmatrix}}_{\widetilde{\epsilon}},\quad\left\|\epsilon\right\|_{\infty}% \leq\varepsilon.

Note that the first $d$ rows of $S$ form a $(d-1)$ -dimensional simplex on $e_{ij}$ . Moreover, since $|c_{i}-c_{j}|>\theta$ , we note that $\operatorname{\mathds{1}}_{d+1}$ does not lie on $e_{ij}$ . To guarantee linear independence of $\widetilde{S}$ , we aim to lower bound the distance from $\operatorname{\mathds{1}}_{d+1}/(d+1)$ to $e_{ij}$ . For any $x\in e_{ij}$ , we have $\langle x,p_{i}^{*}-p_{j}^{*}\rangle=c_{i}-c_{j}$ , and for $\operatorname{\mathds{1}}_{d+1}/(d+1)$ we have $\langle\operatorname{\mathds{1}}_{d+1}/(d+1),p_{i}^{*}-p_{j}^{*}\rangle=0$ . Thus, we use the Cauchy-Schwartz inequality and obtain,

\displaystyle\left\|x-\frac{\operatorname{\mathds{1}}_{d+1}}{d+1}\right\|_{2}% \cdot\left\|p_{i}^{*}-p_{j}^{*}\right\|_{2}\geq\left|\Big{\langle}x-\frac{% \operatorname{\mathds{1}}_{d+1}}{d+1},p_{i}^{*}-p_{j}^{*}\Big{\rangle}\right|=% \left|c_{i}-c_{j}\right|\geq\theta.

Since $\left\lVert p_{i}^{*}-p_{j}^{*}\right\rVert_{2}\leq\sqrt{2}$ , we conclude that for any $x\in e_{ij}$ ,

\displaystyle\left\|x-\frac{\operatorname{\mathds{1}}_{d+1}}{d+1}\right\|_{2}% \geq\frac{\theta}{\sqrt{2}}.

Thus, the distance from $\operatorname{\mathds{1}}_{d+1}/(d+1)$ to $e_{ij}$ is at least $\theta/\sqrt{2}$ , which indicates that the volume of the cube spammed by rows of $\widetilde{S}$ is at least

	$\displaystyle\operatorname{Vol}^{d}(\widetilde{S})$	$\displaystyle\geq\operatorname{Vol}^{d-1}(\mathcal{P}^{d-1})\cdot\left(\frac{% \xi_{d}}{\sqrt{1-d^{-1}}}\right)^{d-1}\cdot\frac{\theta}{\sqrt{2}d}$
		$\displaystyle=\operatorname{Vol}^{d}(\mathcal{P}^{d})\cdot\left(\frac{\xi_{d}}% {\sqrt{1-d^{-1}}}\right)^{d-1}\cdot\frac{\theta}{\sqrt{2}}\cdot\sqrt{\frac{d}{% d+1}}.$

Therefore, the determinant of $\widetilde{S}$ satisfies

\displaystyle\det(\widetilde{S})\geq\left(\frac{\xi_{d}}{\sqrt{1-d^{-1}}}% \right)^{d-1}\cdot\frac{\theta}{\sqrt{2}}\cdot\sqrt{\frac{d}{d+1}}% \operatorname{=\vcentcolon}\varphi_{d}.

Note that $\widetilde{S}$ has row sums equal to $1$ , which implies that all the eigenvalues of $\widetilde{S}$ should be no larger than $1$ . Hence, the smallest eigenvalue of $\widetilde{S}$ should be no less than $\varphi_{d}$ and consequently, the largest eigenvalue of $\widetilde{S}^{-1}$ should be no more than $\varphi_{d}^{-1}$ . Following the definition of $\widetilde{\mathcal{S}}_{ij}$ , we have for any $(p_{i},p_{j})\in\widetilde{\mathcal{S}}_{ij}$ that

\displaystyle\left\|(p_{i}-p_{j})-(p_{i}^{*}-p_{j}^{*})\right\|_{2}=\left\|% \widetilde{S}^{-1}\widetilde{\epsilon}\right\|_{2}\leq\varphi_{d}^{-1}\left\|% \widetilde{\epsilon}\right\|_{2}\leq\varphi_{d}^{-1}\sqrt{d}\varepsilon.

∎

Proposition 3 bridges the geometrical argument to the learning errors in terms of $p_{i}-p_{j}$ . To obtain such benefits, we need to characterize under what conditions and with what probability the event $\mathcal{E}_{ij}^{d}$ will occur. To start with, we study the volume of a surface that separates the probability simplex into two disjoint parts.

Proposition 4 (Surface Volume).

Suppose $A\subset\mathcal{P}^{d}$ is a compact subset such that $\operatorname{Vol}^{d}(A)>0$ and $\operatorname{Vol}^{d}(A^{c})>0$ . Let $E=A\cap A^{c}$ be the surfaces that separates $A$ and $A^{c}$ and we assume that $E$ comprises $J$ hyperplanes. It then follows that

\displaystyle\frac{\operatorname{Vol}^{d-1}(E)}{\operatorname{Vol}^{d-1}(% \mathcal{P}^{d-1})}\geq J^{-(d-1)}\cdot\left(\frac{\operatorname{Vol}^{d}(A^{c% })\operatorname{Vol}^{d}(A)}{\left(\operatorname{Vol}^{d}(\mathcal{P}^{d})% \right)^{2}}\cdot\frac{2\sqrt{d+1}}{3d\sqrt{2d}}\right)^{d}.

Proof.

For any $x\in\mathcal{P}^{d}$ , if $x\in A$ , consider the following set

\displaystyle B(x)=\left\{y\in\mathcal{P}^{d}{\,\big{|}\,}\exists z\in A^{c}% \text{ s.t. $x,y,z$ are on the same line $\ell$}\right\}.

Apparently, $A^{c}\subseteq B(x)$ . Hence, we have $\operatorname{Vol}^{d}(B(x))\geq\operatorname{Vol}^{d}(A^{c})$ . Similarly, for $x\in A^{c}$ , we can define $B(x)$ as

\displaystyle B(x)=\left\{y\in\mathcal{P}^{d}{\,\big{|}\,}\exists z\in A\text{% s.t. $x,y,z$ are on the same line $\ell$}\right\},

and it follows that $\operatorname{Vol}^{d}(B(x))\geq\operatorname{Vol}^{d}(A)$ . Following these observations, we have

\displaystyle\mathcal{L}_{B}\operatorname{\vcentcolon=}\int_{x\in\mathcal{P}}% \operatorname{Vol}^{d}(B(x)){\mathrm{d}}x\geq 2\operatorname{Vol}^{d}(A^{c})% \operatorname{Vol}^{d}(A).

(D.4)

Another important fact is that $\ell$ that connects $x,y,z$ must go through $E$ at least once since $z$ and $x$ belong to different areas. For any $(d-1)$ -dimensional surface $S\subset\mathcal{P}^{d}$ and any point $x\in\mathcal{P}^{d}$ , We define

\displaystyle C(S,x)=\left\{y\in\mathcal{P}^{d}{\,\big{|}\,}\exists z\in S% \text{ s.t. $x,y,z$ are on the same line}\right\}.

Apparently, we have $B(x)\subseteq C(E,x)$ and thus $\mathcal{L}_{B}\leq\mathcal{L}_{C(E,\cdot)}$ . Consider $S=s_{1}\cup s_{2}$ , and it follows from the definition that $C(S,x)=C(s_{1},x)\cup C(s_{2},x)$ , which suggests that $\operatorname{Vol}^{d}(C(S,x))\leq\operatorname{Vol}^{d}(C(s_{1},x))+% \operatorname{Vol}^{d}(C(s_{2},x))$ . Suppose that $E=E_{1}\cup E_{2}\cup\dots\cup E_{J}$ , it then holds that

\displaystyle\mathcal{L}_{C(E,\cdot)}=\int_{x\in\mathcal{P}}\operatorname{Vol}% ^{d}(C(E,x)){\mathrm{d}}x\leq\sum_{E_{i}}\underbrace{\int_{x\in\mathcal{P}}% \operatorname{Vol}^{d}(C(E_{i},x)){\mathrm{d}}x}_{\operatorname{=\vcentcolon}R% (E_{i})}.

Consider $E_{i}$ is small enough such that $E_{i}$ lies on a hyperplane ⁴⁴4Or approximating the surface $E$ with countably infinite hyperplanes.. We draw hyperplanes $S_{1}^{i}\mathbin{\!/\mkern-5.0mu/\!}S_{2}^{i}\mathbin{\!/\mkern-5.0mu/\!}E_{i}$ such that the farthest two vertices of the simplex $\mathcal{P}^{d}$ lie on $S_{1}^{i}$ and $S_{2}^{i}$ . We denote the area between $S_{1}^{i}$ and $S_{2}^{i}$ by $\mathcal{Q}^{i}$ . We separate $\mathcal{Q}^{i}$ into $\mathcal{Q}^{i}_{1}$ and $\mathcal{Q}^{i}_{2}$ , where $\mathcal{Q}^{i}_{1}$ is defined as the points that have a distance larger than $\gamma_{i}\leq\sqrt{2}/2$ to the hyperplane containing $E_{i}$ and $\mathcal{Q}^{i}_{2}=\mathcal{Q}^{i}\backslash\mathcal{Q}^{i}_{1}$ . Consider the following definition for $x\in\mathcal{Q}^{i}_{1}$ ,

\displaystyle\widetilde{C}(E_{i},x)=\left\{y\in\mathcal{Q}^{i}{\,\big{|}\,}% \exists z\in E_{i}\text{ s.t. $x,y,z$ are on the same line}\right\},

and for $x\in\mathcal{Q}^{i}_{2}$ ,

\displaystyle\widetilde{C}(E_{i},x)=\left\{y\in\mathcal{P}^{d}{\,\big{|}\,}% \exists z\in E_{i}\text{ s.t. $x,y,z$ are on the same line}\right\}.

Apparently, $C(E_{i},x)\subseteq\widetilde{C}(E_{i},x)$ and we thus have

	$\displaystyle R(E_{i})$	$\displaystyle\leq\int_{x\in\mathcal{Q}^{i}_{1}}\operatorname{Vol}^{d}(% \widetilde{C}(E_{i},x)){\mathrm{d}}x+\int_{x\in\mathcal{Q}^{i}_{2}}% \operatorname{Vol}^{d}(\widetilde{C}(E_{i},x)){\mathrm{d}}x$
		$\displaystyle\leq\operatorname{Vol}^{d}(\mathcal{Q}^{i}_{1})\cdot\operatorname% {Vol}^{d-1}(E_{i})\cdot\left(\frac{\sqrt{2}}{\gamma_{i}}\right)^{d-1}\cdot% \sqrt{2}+\operatorname{Vol}^{d}(\mathcal{Q}^{i}_{2})\cdot\operatorname{Vol}^{d% }(\mathcal{P}^{d})$
		$\displaystyle\leq\operatorname{Vol}^{d}(\mathcal{P}^{d})\cdot\left((\sqrt{2})^% {d}\operatorname{Vol}^{d-1}(E_{i})\gamma_{i}^{-(d-1)}+2\operatorname{Vol}^{d-1% }(\mathcal{P}^{d-1})\cdot\gamma_{i}\right),$

where the second inequality holds from the definition of $\mathcal{Q}_{1}^{i}$ and the fact that the largest hyperplane contained in $\widetilde{C}(E_{i},x)$ is no more than $(\sqrt{2}/\gamma_{i})^{d-1}\operatorname{Vol}^{d-1}(E_{i})$ . Here, since $\gamma_{i}$ is adjustable, we plug in

\displaystyle\gamma_{i}=\sqrt{2}\cdot\left(\frac{\operatorname{Vol}^{d-1}(E_{i% })}{\operatorname{Vol}^{d-1}(\mathcal{P}^{d-1})}\right)^{1/d},

and obtain

\displaystyle R(E_{i})\leq 3\sqrt{2}\operatorname{Vol}^{d}(\mathcal{P}^{d})% \cdot\left(\operatorname{Vol}^{d-1}(\mathcal{P}^{d-1})\right)^{(d-1)/d}\cdot% \left(\operatorname{Vol}^{d-1}(E_{i})\right)^{1/d}.

By summing up $i\in[J]$ and using the Jensen’s inequality, we obtain

\displaystyle\mathcal{L}_{C(E,\cdot)}\leq 3\sqrt{2}\operatorname{Vol}^{d}(% \mathcal{P}^{d})\cdot\left(J\operatorname{Vol}^{d-1}(\mathcal{P}^{d-1})\right)% ^{(d-1)/d}\cdot\left(\operatorname{Vol}^{d-1}(E)\right)^{1/d}.

(D.5)

Combining (D.5) with (D.4) and using the inequality $\mathcal{L}_{C(E,\cdot)}\geq\mathcal{L}_{B}$ , we obtain

\displaystyle 2\operatorname{Vol}^{d}(A^{c})\operatorname{Vol}^{d}(A)\leq 3% \sqrt{2}\operatorname{Vol}^{d}(\mathcal{P}^{d})\cdot\left(J\operatorname{Vol}^% {d-1}(\mathcal{P}^{d-1})\right)^{(d-1)/d}\cdot\left(\operatorname{Vol}^{d-1}(E% )\right)^{1/d},

which further implies that

	$\displaystyle\operatorname{Vol}^{d-1}(E)$	$\displaystyle\geq\left(\frac{2\operatorname{Vol}^{d}(A^{c})\operatorname{Vol}^% {d}(A)}{3\sqrt{2}\operatorname{Vol}^{d}(\mathcal{P}^{d})\cdot\left(J% \operatorname{Vol}^{d-1}(\mathcal{P}^{d-1})\right)^{(d-1)/d}}\right)^{d}$
		$\displaystyle=J^{-(d-1)}\cdot\operatorname{Vol}^{d-1}(\mathcal{P}^{d-1})\cdot% \left(\frac{2\operatorname{Vol}^{d}(A^{c})\operatorname{Vol}^{d}(A)}{3\sqrt{2}% \left(\operatorname{Vol}^{d}(\mathcal{P}^{d})\right)^{2}}\right)^{d}\cdot\left% (\frac{\sqrt{d+1}}{d\sqrt{d}}\right)^{d}.$

∎

Following Proposition 4, a direct conclusion is that if $A$ contains $N_{1}$ action sections and $A^{c}$ contains $N_{2}$ action sections ( $N=N_{1}+N_{2}$ ), there exists a surface $E_{k}$ such that

\displaystyle\frac{\operatorname{Vol}^{d-1}(E_{k})}{\operatorname{Vol}^{d-1}(% \mathcal{P}^{d-1})}\geq J^{-d}\cdot\left(\frac{\operatorname{Vol}^{d}(A^{c})% \operatorname{Vol}^{d}(A)}{\left(\operatorname{Vol}^{d}(\mathcal{P}^{d})\right% )^{2}}\cdot\frac{2\sqrt{d+1}}{3d\sqrt{2d}}\right)^{d}\geq\left(\frac{2\sqrt{d+% 1}\varsigma^{2}}{3d\sqrt{2d}}\right)^{d},

where the last inequality holds from Assumption 3 and the fact that $J\leq N_{1}N_{2}$ . However, a sampled line $\ell$ passing through this $e_{ij}$ does not guarantee event $\mathcal{E}_{ij}^{d}$ even though $\operatorname{Vol}^{d-1}(e_{ij})$ is bounded below. The following proposition states a sufficient condition for $\mathcal{E}_{ij}^{d}$ to hold.

Proposition 5 (Effective Surface).

Define the effective surface $\widetilde{e}_{ij}$ as

\displaystyle\widetilde{e}_{ij}=\left\{x\in e_{ij}{\,|\,}\left\|x-y\right\|_{2% }\geq\iota+h,\quad\forall y\in\partial e_{ij}\right\},

where $\iota=2{\sqrt{2(d+1)}h}/{(d\sqrt{d}\varsigma)}$ , $h=c_{d}+\varepsilon>c_{d}\sqrt{1-(d+1)^{-1}}+\varepsilon$ . Event $\mathcal{E}_{ij}^{d}$ holds for any searching line $\ell$ crossing $\widetilde{e}_{ij}$ under Assumption 3.

Proof.

Under condition $\iota>0$ , we guarantee that the simplex does not intersect with $\partial e_{ij}$ . However, such a condition is not sufficient if we want to guarantee that the simplex $\mathbb{S}^{d}$ only contains actions $i,j$ since a third action may occur outside $e_{ij}$ . In the sequel, we will show that it is impossible for a third action to appear if $\ell$ goes through $\widetilde{e}_{ij}$ . Define $e_{ij}^{\prime}$ as,

\displaystyle e_{ij}^{\prime}=\left\{x\in e_{ij}{\,|\,}\exists y\in\widetilde{% e}_{ij},\text{ s.t. }\left\|y-x\right\|_{2}\leq h\right\}.

Apparently, for any $x\in e^{\prime}_{ij}$ and $y\in\partial e_{ij}$ , we have $\left\|x-y\right\|_{2}\geq\iota$ . Now, we aim to prove that the prism $A$ with base $e_{ij}^{\prime}$ and height $h$ belongs to either $\mathcal{V}_{i}$ or $\mathcal{V}_{j}$ . Suppose that $A$ lies on action $i$ ’s side and there exists a point in $A$ whose best response is not $i$ . Then, there must be a hyperplane $e_{ik}$ passing through $A$ for some $k\notin\{i,j\}$ . Suppose $z\in e_{ik}\cap A$ . Consider the set $E_{ik}\cap E_{ij}$ , where $E_{ij}$ is the whole hyperplane containing $e_{ij}$ . Obviously, for any $x\in e_{ij}^{\prime}$ and $y\in E_{ik}\cap E_{ij}$ , $\left\|x-y\right\|_{2}\geq\iota$ since $E_{ik}\cap E_{ij}$ should lie outside $e_{ij}$ . In addition, since $\mathcal{V}_{i}$ is a convex set, we note that $\mathcal{V}_{i}$ must lie between $E_{ik}$ and $E_{ij}$ . Therefore, the volume of $\mathcal{V}_{k}$ should be no more than the volume of the area that lies between $E_{ik}$ and $E_{ij}$ ,

\displaystyle\operatorname{Vol}^{d}(\mathcal{V}_{i})\leq\operatorname{Vol}^{d-% 1}(\mathcal{P}^{d-1})\cdot\frac{\sqrt{2}h}{\iota}=\operatorname{Vol}^{d}(% \mathcal{P}^{d})\cdot\frac{\sqrt{2(d+1)}h}{d\sqrt{d}\iota}.

By Assumption 3, we conclude that

\displaystyle\iota\leq\frac{\sqrt{2(d+1)}h}{d\sqrt{d}\varsigma},

which conflicts with the condition $\iota>{\sqrt{2(d+1)}h}/{(d\sqrt{d}\varsigma)}$ . Thus, we conclude that $A\subseteq\mathcal{V}_{i}$ . The same argument also applies to the conclusion $B\subseteq\mathcal{V}_{j}$ if $B$ lies on action $j$ ’s side. For any line $\ell$ passing through $\widetilde{e}_{ij}$ , let $x=\ell\cap\widetilde{e}_{ij}$ . Then we have $x\in A\cup B$ and the minimal distance from $x$ to $\partial(A\cup B)$ is at least $h>\varepsilon$ . Thus, $x$ is guaranteed to be detected when doing the binary search on $\ell$ . On the other hand, $h-\varepsilon>c_{d}\sqrt{1-(d+1)^{-1}}$ , meaning that the simplex $\mathbb{S}^{d}$ placed at $(x_{k}+y_{k})/2$ also lies within $A\cup B$ . Thus, the simplex only contains two actions and event $\mathcal{E}_{ij}^{d}$ follows. ∎

Proposition 5 characterizes a sufficient condition for $\mathcal{E}_{ij}^{d}$ to hold, i.e., the searching line $\ell$ passing through the effective surface $\widetilde{e}_{ij}$ . The next question is whether we can find enough effective surfaces via random sampling. To answer this question, we need to argue that there are sufficiently many effective surfaces with surface volume bounded below. It is also of the same importance to bound below the probability of $\ell$ passing through an effective surface with positive surface volume. Recall the definition of $\sigma_{d}$ .

\displaystyle\mathbb{P}(\ell\cap e\neq\emptyset)\geq\sigma_{d}\left(\frac{% \operatorname{Vol}^{d-1}(e)}{\operatorname{Vol}^{d-1}(\mathcal{P}^{d-1})}% \right),\quad\forall e\in{\mathrm{Conv}}^{d-1}.

Now, we study the problem that how many effective surfaces needs to be detected. From Proposition 3, we see that if event $\mathcal{E}_{ij}^{d}$ holds, we can estimate $p_{i}-p_{j}$ up to small errors. To learn all the pairwise outcome distribution difference, we just need to construct a tree in graph $\mathcal{G}$ where an edge $e_{ij}$ is selected if $\mathcal{E}_{ij}^{d}$ happens. To see this point, we invoke the following definition.

Definition 5 (Connected Components).

We say that action $i$ and $j$ belongs to the same connected component if there exists a path $i,k_{1},\dots,k_{n},j$ such that events $\mathcal{E}_{ik_{1}}^{d},\mathcal{E}_{k_{1}k_{2}}^{d},\dots,\mathcal{E}_{k_{n}% j}^{d}$ happen.

In the sequel, we use $C$ to denote a connected component. With a little abuse of notation, we also denote by $C$ the union of sections $\mathcal{V}_{k}$ such that $k\in C$ . Using Proposition 4 and the following discussion, we take $C$ as $A$ and $C^{c}$ as $A^{c}$ , and it follows that there exists a surface $e_{ij}$ on the boundary of $C$ and $C^{c}$ such that

\displaystyle\frac{\operatorname{Vol}^{d-1}(e_{ij})}{\operatorname{Vol}^{d-1}(% \mathcal{P}^{d-1})}\geq\left(\frac{2\sqrt{d+1}\varsigma^{2}}{3d\sqrt{2d}}% \right)^{d}.

Here, we assume that $i\in C$ and $j\in C^{c}$ . Note that we have $\operatorname{Vol}^{d-2}(\partial e)\leq\operatorname{Vol}^{d-2}(\partial% \mathcal{P}^{d-1})$ for the sake that $e$ is convex. Recall the definition of the effective surface $\widetilde{e}_{ij}$ , which corresponds to shrinking $e_{ij}$ up to distance $\iota+h$ . Therefore, the volume of $\widetilde{e}_{ij}$ is at least

	$\displaystyle\frac{\operatorname{Vol}^{d-1}(\widetilde{e}_{ij})}{\operatorname% {Vol}^{d-1}(\mathcal{P}^{d-1})}$	$\displaystyle\geq\frac{\operatorname{Vol}^{d-1}(e_{ij})-\operatorname{Vol}^{d-% 2}(\partial e)(h+\iota)}{\operatorname{Vol}^{d-1}(\mathcal{P}^{d-1})}$
		$\displaystyle\geq\frac{\left(\frac{2\sqrt{d+1}\varsigma^{2}}{3d\sqrt{2d}}% \right)^{d}\operatorname{Vol}^{d-1}(\mathcal{P}^{d-1})-d\operatorname{Vol}^{d-% 2}(\mathcal{P}^{d-2})(h+\iota)}{\operatorname{Vol}^{d-1}(\mathcal{P}^{d-1})}$
		$\displaystyle\geq\left(\frac{2\sqrt{d+1}\varsigma^{2}}{3d\sqrt{2d}}\right)^{d}% -{(d-1)\sqrt{(d-1)d}}\left(1+2\frac{\sqrt{2(d+1)}}{(d\sqrt{d}\varsigma)}\right% )\cdot h$
		$\displaystyle\operatorname{=\vcentcolon}\tau_{d}.$

Therefore, with probability $\sigma_{d}(\tau_{d})$ event $\mathcal{E}_{ij}^{d}$ will happen in the next sample following Proposition 5, which also means that $C$ will expand. Since $C$ will expand for at most $N-1$ times, we have after $T=b\log(N-1)/\sigma_{d}(\tau_{d})$ samples that $C=\mathcal{A}$ with probability at least $1-(1/N)^{b-1}$ . Moreover, when $C=\mathcal{A}$ , the error for estimating $p_{i}-p_{j}$ is bounded by

\displaystyle\left\|(p_{i}-p_{j})-(p_{i}^{*}-p_{j}^{*})\right\|_{2}\leq(N-2)% \varphi_{d}^{-1}\sqrt{d}\varepsilon.

∎

Appendix E Proofs in Section 4

E.1 Preliminaries for Regret Analysis

Visitation Measure.

We define the state visitation measure $\rho^{\bm{\pi}}_{h}\in\Delta({\mathcal{S}})$ at step $h$ induced by action policy $\pi$ as

\rho^{\bm{\pi}}_{h}(s):=\mathop{\mathbf{E}}_{\bm{\pi},\{P_{h}\}_{h=0}^{H}}% \left[\operatorname{\mathds{1}}(s_{h}=s)\right]

Given that $\rho_{1}^{\pi}=P_{0}$ , the visitation measure can be computed iteratively from $h=1$ to $H+1$ as,

\rho^{\bm{\pi}}_{h+1}(s)=\langle\rho_{h}^{\bm{\pi}}\otimes\pi_{h},P_{h}\rangle% _{{\mathcal{S}}\times\mathcal{A}}=\sum_{s^{\prime}\in{\mathcal{S}}}\sum_{a\in% \mathcal{A}}\rho_{h}^{\bm{\pi}}(s^{\prime})\pi_{h}(a|s^{\prime})P_{h}(s|s^{% \prime},a).

With this definition, we have for any $\bm{x}\in\mathcal{X},\bm{\pi}\in\Pi$ ,

U^{\bm{x},\bm{\pi}}=\sum_{h=1}^{H}\langle\rho_{h}^{\bm{\pi}}\otimes\pi_{h},P_{% h}\cdot x_{h}-c_{h}\rangle_{{\mathcal{S}}\times\mathcal{A}}=\sum_{h=1}^{H}\sum% _{s\in{\mathcal{S}}}\sum_{a\in\mathcal{A}}\rho_{h}^{\bm{\pi}}(s)\pi_{h}(a|s)[P% _{h}(s,a)\cdot x_{h}(s)-c_{h}(s,a)],

V^{\bm{x},\bm{\pi}}=\sum_{h=1}^{H}\langle\rho_{h}^{\bm{\pi}}\otimes\pi_{h},r_{% h}-P_{h}\cdot x_{h}\rangle_{{\mathcal{S}}\times\mathcal{A}}=\sum_{h=1}^{H}\sum% _{s\in{\mathcal{S}}}\sum_{a\in\mathcal{A}}\rho_{h}^{\bm{\pi}}(s)\pi_{h}(a|s)[r% _{h}(s,a)-P_{h}(s,a)\cdot x_{h}(s)].

Parameter Estimation.

Algorithm 1 omits the details of how it estimates the parameters based on the observed trajectory $\{(s_{h}^{t},a_{h}^{t},r_{h}^{t})\}_{h\in[H]}$ in each episode. Specifically, at the end of $t$ -th episode, given the trajectory $\{(s_{h}^{t},a_{h}^{t},r_{h}^{t})\}_{h\in[H]}$ , it updates the counting variables for all $h\in[H]$ ,

	$\displaystyle N_{h}^{t+1}(s,a)$	$\displaystyle\leftarrow N_{h}^{t}(s,a)+{\bf 1}[s^{t}_{h}=s,a^{t}_{h}=a],$
	$\displaystyle N_{h}^{t+1}(s,a;s^{\prime})$	$\displaystyle\leftarrow N_{h}^{t}(s,a;s^{\prime})+{\bf 1}[s^{t}_{h}=s,a^{t}_{h% }=a,s^{t}_{h+1}=s^{\prime}],$
	$\displaystyle N_{h}^{t}(s)$	$\displaystyle=\sum_{a\in\mathcal{A}}N_{h}^{t}(s,a),$

and the empirical mean reward and bonus for all $h\in[H],s,s^{\prime}\in{\mathcal{S}},a\in\mathcal{A}$ ,

	$\displaystyle\widehat{r}_{h}^{t}(s,a)$	$\displaystyle\leftarrow\frac{\sum_{\tau\in[t-1]}{\bf 1}[s^{\tau}_{h}=s,a^{\tau% }_{h}=a]r^{\tau}_{h}(s,a)}{N^{t}_{h}(s,a)},$
	$\displaystyle\widehat{P}^{t}_{h}(s^{\prime}\|s,a)$	$\displaystyle\leftarrow\frac{N^{t}_{h}(s,a;s^{\prime})}{N^{t}_{h}(s,a)},$
	$\displaystyle b_{h}^{t}(s,a)$	$\displaystyle\leftarrow(2H+2)\sqrt{\frac{\ln(SAHT/\delta)}{N_{h}^{t}(s,a)}}.$

For notational convenience, we will refer to the $t$ -th episode as the $T_{1}+t$ -th episode after running the $\chi(\varepsilon)$ -learning procedure for $T_{1}$ rounds in the sequel.

Lemma 7 (Agarwal et al. [5]).

Fix $\delta\in(0,1)$ . In any episode $t$ , $\forall s\in{\mathcal{S}},a\in\mathcal{A},h\in[H]$ , with probability at least $1-\delta$ , we have

\left|\widehat{r}^{t}_{h}(s,a)-{r}(s,a)\right|\leq 2\sqrt{\frac{\ln(SAHT/% \delta)}{N_{h}^{t}(s,a)}},\quad\left\lVert\widehat{P}_{h}^{t}(s,a)-P_{h}(s,a)% \right\rVert_{1}\leq 2\sqrt{\frac{\ln(SAHT/\delta)}{N_{h}^{t}(s,a)}},

\left|\left(\widehat{P}_{h}^{t}(s,a)-P_{h}(s,a)\right)^{\top}f\right|\leq 8H% \sqrt{\frac{S\ln(SAHT/\delta)}{N_{h}^{t}(s,a)}},\quad\forall f:{\mathcal{S}}% \to[0,H].

We denote $\mathcal{E}_{\text{model}}$ as the event the inequalities in Lemma 7 and 10 holds. To make our notation consistent, we let $R^{\bm{\pi}}:=\mathop{\mathbf{E}}_{s\sim P_{0}}R_{1}^{\bm{\pi}}(s)$ and $C^{\bm{\pi}}:=\mathop{\mathbf{E}}_{s\sim P_{0}}C_{1}^{\bm{\pi}}(s)$ such that $V^{\bm{\pi}}=R^{\bm{\pi}}-C^{\bm{\pi}}-U^{\bm{\pi}}=R^{\bm{\pi}}-\zeta^{\bm{% \pi}}$ .

Following the optimism principle, we determine the optimal action policy to explore from $\mathop{argmax}_{\bm{\pi}}\widehat{R}_{h}^{\bm{\pi}}(s)-\widehat{\zeta}^{\bm{% \pi}}_{h}(s)$ — the optimism is preserved as long as the difference of $\widehat{\zeta}_{h}^{\bm{\pi}}(s)-{\zeta}^{\bm{\pi}}_{h}(s)$ is relatively small. Here, with some carefully chosen amount of reward bonus $b_{h}$ , $\widehat{R}_{h}^{\bm{\pi}}(s)$ is the principal’s expected reward under optimism for a given action policy $\bm{\pi}$ at $h$ -th step with state $s$ , i.e.,

\widehat{R}_{h}^{\bm{\pi}}(s)=\mathop{min}\{H,\ \widehat{r}_{h}(s,\pi_{h}(s))+% b_{h}(s,\pi_{h}(s))+\widehat{P}_{h}(s,\pi_{h}(s))\cdot\widehat{R}^{\bm{\pi}}_{% h+1}\}.

Lemma 8 (Optimism).

If $\mathcal{E}_{\text{model}}$ is true, for any $\bm{\pi}$ , in any episode $t$ , $\widehat{R}^{t,\bm{\pi}}(s)-R^{\bm{\pi}}(s)\geq 0$ .

Proof.

We prove by induction that $\widehat{R}^{t,\bm{\pi}}(s)-R^{\bm{\pi}}(s)\geq 0$ . Start from the $(H+1)$ -step, since $\widehat{R}_{H+1}^{t,\bm{\pi}}(s)=R_{H+1}^{\bm{\pi}}(s)=0,\forall s\in{% \mathcal{S}}$ , the base case holds. For the inductive case, given that $\widehat{R}_{h+1}^{t,\bm{\pi}}(s)\geq R_{h+1}^{\bm{\pi}}(s),\forall s\in{% \mathcal{S}}$ , we can derive the following inequality for any $s\in{\mathcal{S}}$ , let $a=\pi_{h}(s))$

	$\displaystyle\widehat{R}_{h}^{t,\bm{\pi}}(s)-R_{h}^{\bm{\pi}}(s)$	$\displaystyle=\widehat{r}^{t}_{h}(s,\pi_{h}(s))+b^{t}_{h}(s,a)-r_{h}(s,a)+% \widehat{P}^{t}_{h}(s,a)\cdot\widehat{R}^{t,\bm{\pi}}_{h+1}-P_{h}(s,a)\cdot R^% {\bm{\pi}}_{h+1}$
		$\displaystyle\geq\widehat{r}^{t}_{h}(s,a)+b^{t}_{h}(s,a)-r_{h}(s,a)+\widehat{P% }^{t}_{h}(s,a)\cdot R^{\bm{\pi}}_{h+1}-P_{h}(s,a)\cdot R^{\bm{\pi}}_{h+1}$
		$\displaystyle=\widehat{r}^{t}_{h}(s,a)+b^{t}_{h}(s,a)-r_{h}(s,a)+\left(% \widehat{P}_{h}^{t}(s,a)-P_{h}(s,a)\right)\cdot R^{\bm{\pi}}_{h+1}$
		$\displaystyle\geq b_{h}^{t}(s,a)-(2+2H)\sqrt{\frac{\ln(SAHT/\delta)}{N_{h}^{t}% (s,a)}}\geq 0,$

where the first inequality is due to the given induction condition $\widehat{R}_{h+1,t}^{\bm{\pi}}(s)\geq R_{h+1}^{\bm{\pi}}(s),\forall s\in{% \mathcal{S}}$ and the last inequality is due to Lemma 7. Therefore, the induction holds which concludes the proof. ∎

Lemma 9.

If $\mathcal{E}_{\text{model}}$ is true, $\sum_{t=1}^{T}\widehat{R}^{t,\bm{\pi}^{t}}(s)-R^{\bm{\pi}^{t}}(s)=O(H^{2}S% \sqrt{AT\ln(SAHT/\delta)})$ .

Proof.

We apply the simulation lemma [5] to bound the difference term for each episode $t$ in $E_{2}$ ,

	$\displaystyle\widehat{R}^{t,\bm{\pi}^{t}}-R^{\bm{\pi}^{t}}$	$\displaystyle\leq\sum_{h=1}^{H}\langle{\rho}^{\bm{\pi}^{t}}_{h}\otimes\pi_{h}^% {t},b^{t}_{h}+\left(\widehat{P}_{h}^{t}-P_{h}\right)\cdot\widehat{R}^{t,\bm{% \pi}^{t}}_{h+1}\rangle_{{\mathcal{S}}\times\mathcal{A}}$
		$\displaystyle\leq\sum_{h=1}^{H}\mathop{\mathbf{E}}_{s,a\sim\pi^{t}_{h}\otimes% \pi_{h}^{t}}\big{[}10H\sqrt{\frac{S\ln(SAHT/\delta)}{N_{h}^{t}(s,a)}}\big{]},$

where the last inequality follows from Lemma 7.

The cumulative loss for all $T$ episodes can be bounded as,

	$\displaystyle\sum_{t=1}^{T}\widehat{R}^{t,\bm{\pi}^{t}}-R^{\bm{\pi}^{t}}$	$\displaystyle=10H\sqrt{S\ln(SAHT/\delta)}\mathop{\mathbf{E}}\bigg{[}\sum_{t=1}% ^{T}\sum_{h=1}^{H}\sqrt{1/N_{h}^{t}(s,a)}\bigg{]}$
		$\displaystyle\leq 20H^{2}\sqrt{S\ln(SAHT/\delta)}\sqrt{SAT}$
		$\displaystyle=O(H^{2}S\sqrt{AT\ln(SAHT/\delta)}),$

where the first equality is by the linearity of expectation, and the inequality is by the fact $\sum_{h=1}^{N}1/\sqrt{i}\leq 2\sqrt{N}$ .

∎

Regret Decomposition.

For the ease of analysis, we assume a common regularity condition in MDP, which ensures that the trajectory induced under any policy has enough randomness. This lower bound constant $\kappa$ is no more than $1/|{\mathcal{S}}|$ , and we expect this assumption to be relaxed in future work.

Assumption 5 (MDP Mixing Condition).

There exists $\kappa>0$ such that $P_{h}(s^{\prime}|s,a)\geq\kappa,\forall s^{\prime},s,a.$

We now present the proof of Theorem 3 via a decomposition of its regret into different components.

Theorem (Full Statement of Theorem 3).

In a contractual RL problem, with probability at least $1-\delta$ , Algorithm 1 has $\widetilde{O}\left((H^{2}SA^{-1/2}+H^{2}\kappa^{-1/2})\sqrt{T\ln(SAHT/\delta)}\right)$ regret using the solver in Algorithm 7, and $\widetilde{O}\left((H^{2}SA^{-1/2}+\eta\lambda_{w}\kappa^{-1/2})\sqrt{T\ln(% SAHT/\delta)}\right)$ regret using the solver in Algorithm 6.

Proof of Theorem 3.

Consider the following decomposition of regret,

	$\displaystyle\operatorname{Reg}(T)$	$\displaystyle=\sum_{t=1}^{T}V^{*}-V^{\bm{x}^{t}},$
		$\displaystyle\leq O(T_{1})+\sum_{t=1}^{T-T_{1}}V^{*}-V^{\bm{x}^{t}}$
		$\displaystyle=O(T_{1})+\sum_{t=1}^{T-T_{1}}[V^{}-\widehat{V}^{t,\bm{\pi}^{}}% ]+[\widehat{V}^{t,\bm{\pi}^{*}}-\widehat{V}^{t,\bm{\pi}^{t}}]+[\widehat{V}^{t,% \bm{\pi}^{t}}-V^{\bm{x}^{t}}]$
		$\displaystyle\leq O(T_{1})+\sum_{t=1}^{T-T_{1}}\widehat{\zeta}^{t,\bm{\pi}^{}% }-\zeta^{\bm{\pi}^{}}+\widehat{R}^{t,\bm{\pi}^{t}}-R^{\bm{\pi}^{t}}+\left\|{% \zeta}^{\bm{\pi}^{t}}-\widehat{\zeta}^{t,\bm{\pi}^{t}}\right\|,$

where $\widehat{V}^{t,\bm{\pi}^{*}}=\widehat{R}^{t,\bm{\pi}^{*}}-\widehat{\zeta}^{t,% \bm{\pi}^{*}}$ , $\widehat{V}^{t,\bm{\pi}^{t}}=\widehat{R}^{t,\bm{\pi}^{t}}-\widehat{\zeta}^{t,% \bm{\pi}^{t}}$ are the principal’s estimated value under the least payment contract policy determined from Equation (B.1) with $\{\widehat{P}^{t}_{h},\widehat{r}^{t}_{h},b^{t}_{h},\epsilon^{t}_{h}\}_{h\in[H]}$ and $V^{\bm{\pi}^{t}}=R^{\bm{\pi}^{t}}-\zeta^{\bm{\pi}^{t}}$ is the principal’s exact value under the least payment contract policy determined from Equation (B.1). We now consider each of the three difference terms in the last inequality:

•

The bound of first term is derived from the value decomposition, $V^{*}-\widehat{V}^{t,\bm{\pi}^{*}}=R^{\bm{\pi}^{*}}-\zeta^{\bm{\pi}^{*}}-% \widehat{R}^{t,\bm{\pi}^{*}}+\widehat{\zeta}^{t,\bm{\pi}^{*}}\leq\widehat{% \zeta}^{t,\bm{\pi}^{*}}-\zeta^{\bm{\pi}^{*}}$ , since the difference $R^{\bm{\pi}^{*}}-\widehat{R}^{t,\bm{\pi}^{*}}\leq 0$ , by Lemma 8.
•

The second term is non-positive, $\widehat{V}^{t,\bm{\pi}^{*}}-\widehat{V}^{t,\bm{\pi}^{t}}\leq 0$ , since $\bm{\pi}^{t}=\mathop{argmax}_{\bm{\pi}}\widehat{R}^{t,\bm{\pi}}-\widehat{\zeta% }^{t,\bm{\pi}}$ is the optimal action policy to induce under the optimistic planning.
•

The third term can be decomposed as, $\widehat{V}^{t,\bm{\pi}^{t}}-V^{\bm{x}^{t}}=\widehat{R}^{t,\bm{\pi}^{t}}-% \widehat{U}^{t,\bm{\pi}^{t}}-C^{\bm{\pi}^{t}}-R^{\bm{\pi}^{t}}+U^{\bm{x}^{t}}+% C^{\bm{\pi}^{t}}\leq\widehat{R}^{t,\bm{\pi}^{t}}-R^{\bm{\pi}^{t}}+\left|{\zeta% }^{\bm{\pi}^{t}}-\widehat{\zeta}^{t,\bm{\pi}^{t}}\right|$ , since $U^{\bm{x}^{t}}-\widehat{U}^{t,\bm{\pi}^{t}}\eqsim\left|{\zeta}^{\bm{\pi}^{t}}-% \widehat{\zeta}^{t,\bm{\pi}^{t}}\right|$ by either Lemma 12 or 15.

Finally, if we adopt the solver in Algorithm 6, by Lemma 9 and 14, the total regret is,

	$\displaystyle\operatorname{Reg}(T)$	$\displaystyle=O(T_{1})+O(H^{2}S\sqrt{A(T-T_{1})\ln(SAHT/\delta)})+\sum_{\tau=1% }^{T-T_{1}}O\left(H^{2}\kappa^{-1/2}\sqrt{\ln(SAHT/\delta)/t}\big{\}}\right)$
		$\displaystyle={O}\left(HAS^{5}\kappa_{0}^{-1}\log(\eta T\lambda_{w}^{-1})+H^{2% }S\sqrt{AT\ln(SAHT/\delta)}+H^{2}\kappa^{-1/2}\sqrt{T\ln(SAH/\delta)}\right)$
		$\displaystyle=\widetilde{O}\left((H^{2}SA^{-1/2}+H^{2}\kappa^{-1/2})\sqrt{T\ln% (SAHT/\delta)}\right),$

where the first term $T_{1}=\widetilde{O}(\log T)$ can be dropped as being dominated by the second term on the order of $\sqrt{T-T_{1}}$ .

Finally, if we adopt the solver in Algorithm 7, by Lemma 9 and 16, the total regret is,

	$\displaystyle\operatorname{Reg}(T)$	$\displaystyle=O(T_{1})+O(H^{2}S\sqrt{A(T-T_{1})\ln(SAHT/\delta)})+\sum_{\tau=1% }^{T-T_{1}}O\left(\mathop{min}\big{\{}H,\eta\lambda_{s}^{1-H}\kappa^{-1/2}% \sqrt{\ln(SAHT/\delta)/T}\big{\}}\right)$
		$\displaystyle={O}\left(HAS^{5}\kappa_{0}^{-1}\log(\eta T\lambda_{s}^{-1})\log(% 1/\delta)+H^{2}S\sqrt{AT\ln(SAHT/\delta)}+\eta\lambda_{s}^{1-H}\kappa^{-1/2}% \sqrt{T\ln(SAH/\delta)}\right)$
		$\displaystyle=\widetilde{O}\left((H^{2}SA^{-1/2}+\eta\lambda_{s}^{1-H}\kappa^{% -1/2})\sqrt{T\ln(SAHT/\delta)}\right),$

where the first term $T_{1}=\widetilde{O}(\log T)$ can be dropped as being dominated by the second term on the order of $\sqrt{T-T_{1}}$ .

∎

E.2 $\chi(\varepsilon)$ -Learning Procedure in Contractual Reinforcement Learning

Input: State, action set

{\mathcal{S}},\mathcal{A}

, number of steps

H

, cost function

\{c_{h}\}_{h=1}^{H}

, search precision

\varepsilon

For each

s\in{\mathcal{S}},h\in[H]

, initialize a subroutine

{\mathscr{A}}(s,h)

described in Algorithm 4.

Set

N_{s,h}\leftarrow 0

for all

s,h

for $t=1,\dots,\chi(\varepsilon)$ do

(s^{t},h^{t})=\mathop{argmin}_{s,h}N_{s,h}

Set

x_{h}(s,\cdot)

0

for all

(h,s)\neq(h^{t},s^{t})

, and set

x_{h^{t}}(s^{t},\cdot)

to the contract that

{\mathscr{A}}(s^{t},h^{t})

is going to use next.

Set

x_{h^{t}}(s^{t},\cdot)\leftarrow x_{h^{t}}(s^{t},\cdot)+H\kappa_{0}^{-1}% \mathop{max}_{(s,a,h)\in{\mathcal{S}}\times\mathcal{A}\times[H]}c_{h}(s,a)

Execute the contract policy

\bm{x}

and collect trajectory

{\mathcal{T}}=\{(s_{h},a_{h},r_{h})\}_{h\in[H]}

if $s_{h^{t}}=s^{t}$ in the trajectory ${\mathcal{T}}$ then

Let

{\mathscr{A}}(s^{t},h^{t})

step with

(s_{h^{t}},a_{h^{t}},s_{h^{t}+1})\in{\mathcal{T}}

and

N_{s^{t},h^{t}}\leftarrow N_{s^{t},h^{t}}+1

return

\mu_{h}(s,\cdot,\cdot)

from

{\mathscr{A}}(s^{t},h^{t})

for each

s\in{\mathcal{S}},h\in[H]

Algorithm 5

\chi(\varepsilon)

-Learning Procedure in Contractual RL with Far-sighted Agent

The algorithm for constructing the $\chi(\varepsilon)$ -learning procedure is summarized in Algorithm 5.

Assumption 6 (Weakly ergodic MDP).

There exists $\kappa_{0}>0$ such that $\mathop{max}_{\bm{\pi}}\rho_{h}^{\bm{\pi}}(s)\geq 2\kappa_{0},\forall s,h.$

Note that this assumption is weaker than the mixing MDP assumption where each state’s visitation measure at each step is lower bounded under all the agent’s policy, since the agent’s policy $\pi$ chosen here is in favor of visiting state $s$ at step $h$ .

Also, recall that throughout this section, we assume that principal already knows the agent’s cost function $\{c_{h}\}_{h=1}^{H}$ and the contract design space is restricted to $\{x_{h}:{\mathcal{S}}\times{\mathcal{S}}\to[0,\eta]\}$ . Moreover, we inherit all the technical assumptions from the analysis in Appendix D. With these assumptions, we show the following Lemma 10 holds.

Lemma 10.

For any $\delta\in(0,1)$ , Algorithm 5 guarantees an estimation of $\left\lVert\widehat{\mu}_{h}-{\mu}_{h}\right\rVert_{1,\infty}\leq\varepsilon$ with probability at least $1-\delta$ in $\widetilde{O}(HAS^{5}\kappa_{0}^{-1}\log(1/\varepsilon)\log(1/\delta))$ episodes.

Proof for Lemma 10.

Since each time, only $x_{h}(s,h)$ gives nonzero payment, and we add a constant $H\kappa_{0}^{-1}\mathop{max}_{(s,a,h)\in{\mathcal{S}}\times\mathcal{A}\times[H% ]}c_{h}(s,a)$ that dominates the potential costs, the agent’s optimal policy generates a visitation measure of $s$ at step $h$ larger than $\kappa_{0}$ . To illustrate this point, we take $\widehat{\bm{\pi}}$ that maximizes $\rho_{h}^{\bm{\pi}}(s)$ while $\widehat{\pi}_{h}=\mathop{argmax}_{\pi_{h}}\langle\pi_{h},P_{h}\cdot x_{h}% \rangle_{{\mathcal{S}}\times\mathcal{A}}$ at step $h$ (the choice of $\pi_{h}$ does not influence $\rho_{h}^{\bm{\pi}}(s)$ ). For any $\bm{\pi}$ such that $\rho_{h}^{\bm{\pi}}(s)<\kappa_{0}$ , we have for the agent’s expected profit difference expressed as

	$\displaystyle\qquad U^{\widehat{\bm{\pi}}}-U^{{\bm{\pi}}}$
	$\displaystyle\quad=\sum_{h^{\prime}=1}^{H}\langle\rho_{h^{\prime}}^{{\widehat{% \bm{\pi}}}}\otimes\widehat{\pi}_{h^{\prime}}-\rho_{h^{\prime}}^{\bm{\pi}}% \otimes\pi_{h^{\prime}},P_{h^{\prime}}\cdot x_{h^{\prime}}-c_{h^{\prime}}% \rangle_{{\mathcal{S}}\times\mathcal{A}}$
	$\displaystyle\quad\geq\langle\rho_{h}^{{\widetilde{\bm{\pi}}}}\otimes\pi_{h}(s% ,\cdot)-\rho_{h}^{\bm{\pi}}\otimes\pi_{h}(s,\cdot),P_{h}\cdot x_{h}(s,\cdot)% \rangle_{\mathcal{A}}-\sum_{h^{\prime}=1}^{H}\langle\rho_{h^{\prime}}^{% \widetilde{\bm{\pi}}}\otimes\widetilde{\pi}_{h^{\prime}}-\rho_{h^{\prime}}^{% \bm{\pi}}\otimes\pi_{h^{\prime}},c_{h^{\prime}}\rangle_{{\mathcal{S}}\times% \mathcal{A}}$
	$\displaystyle\quad>\kappa_{0}H\kappa_{0}^{-1}\mathop{max}_{(s,a,h)\in{\mathcal% {S}}\times\mathcal{A}\times[H]}c_{h}(s,a)-H\mathop{max}_{(s,a,h)\in{\mathcal{S% }}\times\mathcal{A}\times[H]}c_{h}(s,a)=0,$

where $\widetilde{\bm{\pi}}$ is constructed by substituting $\widehat{\pi}_{h}$ with $\pi_{h}$ in $\widehat{\bm{\pi}}$ . Here, the first inequality holds since $\widehat{\pi}_{h}$ is the optimal one-step policy concerning the payment. The last inequality holds by noting that $\widetilde{\pi}$ still maximizes $\rho_{h}^{\bm{\pi}}(s)$ which gives $\rho_{h}^{\widetilde{\bm{\pi}}}(s)\geq 2\kappa_{0}$ by Assumption 6, and that $x$ has a constant shift $H\kappa_{0}^{-1}\mathop{max}_{(s,a,h)\in{\mathcal{S}}\times\mathcal{A}\times[H% ]}c_{h}(s,a)$ . Note that the best response can only have higher profit for the agent. Thus, we conclude that any $\pi$ such that $\rho_{h}^{\bm{\pi}}(s)<\kappa_{0}$ cannot be the agent’s best response. In other words, for the best response $\bm{\pi}^{*}$ , we have $\rho_{h}^{\bm{\pi}^{*}}(s)\geq\kappa_{0}$ .

Let $X_{s,h}$ be the random variable representing the total steps that ${\mathscr{A}}(s,h)$ successfully takes and $Y$ be a random variable such that $Y\sim\text{Binomial}(n,\kappa_{0})$ , where $n=\chi(\varepsilon)$ . It follows from our previous discussion that for any $\lambda\in(0,\kappa_{0})$ ,

	$\displaystyle\mathbb{P}\left(\sum_{s,h}X_{s,h}\leq\lambda n\right)$	$\displaystyle\leq\mathbb{P}(Y\leq\lambda n)$
		$\displaystyle\leq\inf_{\theta\geq 0}\mathbb{E}\left[\exp(-\theta(Y-\lambda n))\right]$
		$\displaystyle\leq\exp\left(-n\kappa_{0}f(\lambda/\kappa_{0})\right),$

where $f(x)=1-x+x\log(x)$ . By the algorithm procedure, we conclude that

\displaystyle\mathbb{P}\left(\mathop{min}_{s,h}X_{s,h}\leq\frac{\lambda n}{H|{% \mathcal{S}}|}-1\right)\leq\exp(-n\kappa_{0}f(\lambda/\kappa_{0})).

Combined with Corollary 3.1, we conclude that Algorithm 5 has error

\sup_{s,a,h}\left\lVert(\widehat{P}_{h}(s,a)-\widehat{P}_{h}(s,a^{\prime})-(P_% {h}(s,a)-P_{h}(s,a^{\prime}))\right\rVert_{2}\geq\varepsilon

with probability no more than

\displaystyle H|{\mathcal{S}}||\mathcal{A}|\exp\left(-\frac{(\lambda n-H|{% \mathcal{S}}|)\cdot\sigma_{d}(\tau_{d})}{H|\mathcal{A}||{\mathcal{S}}|^{5}\log% (|\mathcal{A}|\varsigma^{-2}\varepsilon^{-1}\theta^{-1})}\right)+\exp\left(-n% \kappa_{0}f(\lambda/\kappa_{0})\right),

for any $\lambda\in(0,\kappa_{0})$ . The first term dominates and we just plug in $\lambda=\kappa_{0}/2$ and conclude that the error is no more than $\varepsilon$ with probability at least

\displaystyle 1-C_{0}H|{\mathcal{S}}||\mathcal{A}|\exp\left(-\frac{\kappa_{0}% \chi(\varepsilon)\cdot\sigma_{\mathcal{S}}(\tau_{\mathcal{S}})}{H|\mathcal{A}|% |{\mathcal{S}}|^{5}\log(|\mathcal{A}|\varsigma^{-2}\varepsilon^{-1}\theta^{-1}% )}\right),

where $C_{0}$ is a constant and $\tau_{\mathcal{S}}=(\varsigma^{2}/6|{\mathcal{S}}|)^{2}$ .

Fix $\delta\in(0,1)$ , and convert the error bound to $\ell_{1}$ norm, this algorithm guarantees that, with probability at least $1-\delta$ ,

\sup_{s,a,h}\left\lVert\widehat{\mu}_{h}(s,a,a^{\prime})-{\mu}_{h}(s,a,a^{% \prime})\right\rVert_{2}\leq\varepsilon,

in ${O}(\frac{HAS^{5}\log(\varepsilon^{-1}S^{1/2})}{\kappa_{0}\sigma_{\mathcal{S}}% (\tau_{\mathcal{S}}))}\log(C_{0}HSA/\delta))=\widetilde{O}(HAS^{5}\kappa_{0}^{% -1}\log(1/\varepsilon)\log(1/\delta))$ rounds, where we ignore the constant $\sigma_{\mathcal{S}}(\tau_{\mathcal{S}})$ from Assumption 3.

∎

Lemma 11.

If $\mathcal{E}_{\text{model}}$ is true, with $\left\lVert\widehat{\mu}_{h}-{\mu}_{h}\right\rVert_{1,\infty}\leq\varepsilon$ , at any episode $t\in[T]$ , we have

\left\lVert\widehat{P}_{h}^{t}(s,a)-P_{h}(s,a)\right\rVert_{1}\leq 2\sqrt{% \frac{\ln(SAHT/\delta)}{N_{h}^{t}(s)}}+\varepsilon,

(E.1)

Proof.

To improve the estimate of $\widehat{P}$ with $\widehat{\mu}$ , we can construct the estimator of $\widehat{P}_{h}^{t}$ as,

	$\displaystyle\widehat{P}^{t}_{h}(s,a)$	$\displaystyle=\sum_{a^{\prime}\in\mathcal{A}}\frac{N^{t}_{h}(s,a^{\prime})}{N_% {h}^{t}(s)}[\widehat{P}^{t}_{h}(s,a^{\prime})+\widehat{\mu}_{h}(s,a,a^{\prime})]$
		$\displaystyle=\frac{\sum_{a^{\prime}\in\mathcal{A}}[N^{t}_{h}(s,a^{\prime};% \cdot)+{\mu}_{h}(s,a,a^{\prime})N^{t}_{h}(s,a^{\prime})]}{N_{h}^{t}(s)}+\sum_{% a^{\prime}\in\mathcal{A}}\frac{N^{t}_{h}(s,a^{\prime})}{N_{h}^{t}(s)}[\widehat% {\mu}_{h}(s,a,a^{\prime})-{\mu}_{h}(s,a,a^{\prime})]$

The first part is an unbiased estimator for ${P}^{t}_{h}(s,a)$ and thus by Hoeffding’s inequality, with probability at least $1-\delta$ , for all $s,a,h,t$ , with $N_{h}^{t}(s)$ samples,

\bigg{|}\bigg{|}\frac{\sum_{a^{\prime}\in\mathcal{A}}[N^{t}_{h}(s,a^{\prime};% \cdot)+\widehat{\mu}_{h}(s,a,a^{\prime})N^{t}_{h}(s,a^{\prime})]}{N_{h}^{t}(s)% }-{P}^{t}_{h}(s,a)\bigg{|}\bigg{|}_{1}\leq 2\sqrt{\frac{\ln(SAHT/\delta)}{N_{h% }^{t}(s)}}.

Using the triangle inequality and the fact that $\frac{N^{t}_{h}(s,a^{\prime})}{N_{h}^{t}(s)}\leq 1$ , we get the inequality in Eq (E.1). ∎

E.3 Proofs for the Statistically Efficient Solver

Input: Parameters

\{\widehat{P}_{h},\widehat{r}_{h},b_{h},\widehat{\mu}_{h}\}_{h\in[H]}

, robustness parameter

\epsilon

Output : Contract policy

\bm{x}

, action policy

\bm{\pi}

for $\bm{\pi}\in\Pi$ do

\widehat{U}^{\bm{\pi}},\widehat{\bm{x}}^{\bm{\pi}}=\mathop{minarg}_{\bm{x}}% \widehat{U}^{\bm{x},\bm{\pi}}\quad\text{s.t.}\quad\widehat{U}^{\bm{x},\bm{\pi}% }-\widehat{U}^{\bm{x},\bm{\pi}^{\prime}}\geq\epsilon,\forall\bm{\pi}^{\prime}% \neq\bm{\pi},

(E.2)

\widehat{V}^{\bm{\pi}}\leftarrow\widehat{R}^{\bm{\pi}}-{C}^{\bm{\pi}}-\widehat% {U}^{\bm{\pi}}

Set

\widehat{\bm{\pi}}^{*}\leftarrow\mathop{max}_{\bm{\pi}\in\Pi}\widehat{V}^{\bm{% \pi}},\widehat{\bm{x}}^{*}\leftarrow\widehat{\bm{x}}^{\widehat{\bm{\pi}}^{*}}

return

\widehat{\bm{x}}^{*},\widehat{\bm{\pi}}^{*}

Algorithm 6 Linear Programming under Uncertainty

Assumption 7 (Weak $\lambda$ -Inducibility).

For any stationary action policy $\bm{\pi}\in\Pi$ , there exists an event $e_{h}\in\{0,1\}^{S}$ for each $h$ such that $\sum_{h=2}^{H+1}(\rho_{h}^{\bm{\pi}}-\rho_{h}^{\bm{\pi}^{\prime}})\cdot e_{h}% \geq\lambda_{w},\forall\bm{\pi}\neq\bm{\pi}^{\prime}.$

This assumption ensures that for any stationary action policy $\bm{\pi}\in\Pi$ , there exists a set of bounded reward functions $\{r_{h}\}_{h=1}^{H}$ with $\mathop{max}_{s\in{\mathcal{S}},a\in\mathcal{A},h\in[H]}\left\lVert r_{h}(s,a)% \right\rVert\leq 1$ such that $\bm{\pi}$ dominates any other stationary action policy $\bm{\pi}^{\prime}$ with additional expected utility at least $\lambda_{w}$ in the MDP environment $({\mathcal{S}},\mathcal{A},\{P_{h},r_{h}\}_{h=1}^{H})$ . For example, one can set $r_{h}(s,a)=\sum_{s\in{\mathcal{S}}}e_{h+1}(s^{\prime})P_{h}(s^{\prime}|s,a)$ . For similar reason, under this assumption, for any stationary action policy $\bm{\pi}\in\Pi$ , there exists a contract policy $\bm{x}$ with $\left\lVert x_{h}\right\rVert_{\infty}\leq 1$ under which $U^{\bm{x},\bm{\pi}}-U^{\bm{x},\bm{\pi}^{\prime}}\geq\lambda_{w},\forall\bm{\pi% }\neq\bm{\pi}^{\prime}$ , where $\pi_{h}(s,s^{\prime})=e_{h+1}(s^{\prime})$ .

We describe the statistically efficient solver in Algorithm 6. Below we show that the contract policy it solves is robust and optimistic in a formal sense.

Lemma 12.

Under Assumption 7, with estimation of $\widehat{\mu},\widehat{P}$ such that $\left\lVert\widehat{\mu}_{h}-{\mu}_{h}\right\rVert_{1,\infty}\leq\frac{% \epsilon}{SH^{2}}$ and $\left\lVert\widehat{P}_{h}-{P}_{h}\right\rVert_{1,\infty}\leq\epsilon^{\prime}$ , the policy $\bm{\pi},\widehat{\bm{x}}^{\bm{\pi}}$ solved from LP E.2 satisfies the following condition,

U^{\widehat{\bm{x}}^{\bm{\pi}}}-U^{\bm{\pi}}\eqsim\left|\widehat{\zeta}^{\bm{% \pi}}-\zeta^{\bm{\pi}}\right|=O\left(\lambda_{w}^{-1}H\eta\epsilon+H^{2}% \epsilon^{\prime}\right).

Proof.

Pick any action policy $\bm{\pi}\in\Pi$ , we first show that $\bm{\pi}$ is the agent’s best response under $\widehat{\bm{x}}^{\bm{\pi}}$ solved from Equation E.2. That is, due to Lemma 13, the following inequality holds, $\forall\bm{\pi}^{\prime}\neq\bm{\pi}$ ,

{U}^{\widehat{\bm{x}}^{\bm{\pi}},\bm{\pi}}-{U}^{\widehat{\bm{x}}^{\bm{\pi}},% \bm{\pi}^{\prime}}\geq\widehat{U}^{\widehat{\bm{x}}^{\bm{\pi}},\bm{\pi}}-% \widehat{U}^{\widehat{\bm{x}}^{\bm{\pi}},\bm{\pi}^{\prime}}-\epsilon\geq 0.

Let $\left\lVert\widehat{\mu}_{h}-{\mu}_{h}\right\rVert_{1,\infty}\leq\frac{% \epsilon}{SH^{2}}=\varepsilon$ . We show that the payment under $\bm{x}$ has bounded suboptimality,

\widehat{\zeta}^{\bm{\pi}}-{\zeta}^{\bm{\pi}}=\widehat{U}^{\widehat{\bm{x}}^{% \bm{\pi}},\bm{\pi}}-{U}^{\bm{\pi}}\leq\widehat{U}^{{\bm{x}}^{\bm{\pi}},\bm{\pi% }}-{U}^{{\bm{x}}^{\bm{\pi}},\bm{\pi}}+\frac{2SH^{3}\eta\varepsilon}{\lambda_{w% }}\leq\frac{2SH^{3}\eta\varepsilon}{\lambda_{w}}+H^{2}\epsilon^{\prime}=\frac{% H\eta\epsilon}{\lambda_{w}}+H^{2}\epsilon^{\prime},

(E.3)

where $\bm{x}^{\bm{\pi}}=\mathop{argmin}_{\bm{x}}{U}^{\bm{x},\bm{\pi}}\quad\text{s.t.% }\quad{U}^{\bm{x},\bm{\pi}}-{U}^{\bm{x},\bm{\pi}^{\prime}}\geq 0,\forall\bm{% \pi}^{\prime}\neq\bm{\pi}$ and thus ${U}^{\bm{\pi}}={U}^{\bm{x}^{\bm{\pi}},\bm{\pi}}$ by definition.

We first prove the first inequality in Equation (E.3). Let $\bm{x}^{0}$ be the contract policy such that ${U}^{\bm{x}^{0},\bm{\pi}}-{U}^{\bm{x}^{0},\bm{\pi}^{\prime}}\geq\lambda_{w},% \forall\bm{\pi}^{\prime}\neq\bm{\pi}$ with ${U}^{{\bm{x}}^{0},\bm{\pi}}\leq H$ and $\widehat{U}^{{\bm{x}}^{0},\bm{\pi}}\leq H\eta$ . Such $\bm{x}^{0}$ exists by Assumption 1. It is easy to verify the linearity here that for any $\bm{x}^{\prime},\bm{x}^{\prime\prime}\in\bm{\pi},c\in\mathbb{R}$ , let $\bm{x}=\bm{x}^{\prime}+c\bm{x}^{\prime\prime}$ such that $\bm{x}_{h}(s,s^{\prime})=\bm{x}^{\prime}_{h}(s,s^{\prime})+c\bm{x}^{\prime% \prime}_{h}(s,s^{\prime}),\forall h,s,s^{\prime}$ , then $U^{\bm{x},\bm{\pi}}=U^{\bm{x}^{\prime},\bm{\pi}}+cU^{\bm{x}^{\prime\prime},\bm% {\pi}}$ . We claim that $\widetilde{\bm{x}}=\bm{x}^{\bm{\pi}}+\frac{2SH^{2}\varepsilon}{\lambda_{w}}\bm% {x}^{0}$ satisfies the following inequalities:

\widehat{U}^{\widetilde{\bm{x}},\bm{\pi}}-\widehat{U}^{\widetilde{\bm{x}},\bm{% \pi}^{\prime}}\geq{U}^{\widetilde{\bm{x}},\bm{\pi}}-{U}^{\widetilde{\bm{x}},% \bm{\pi}^{\prime}}-SH^{2}\varepsilon\geq\epsilon,

(E.4)

where the first inequality is due to Lemma 13 and the second inequality is by the construction of $\widetilde{\bm{x}}$ .

\widehat{U}^{\widehat{\bm{x}}^{\bm{\pi}},\bm{\pi}}\leq\widehat{U}^{\widetilde{% \bm{x}},\bm{\pi}}=\widehat{U}^{{\bm{x}}^{\bm{\pi}},\bm{\pi}}+\frac{2SH^{2}% \varepsilon}{\lambda_{w}}\widehat{U}^{{\bm{x}}^{0},\bm{\pi}}\leq\widehat{U}^{{% \bm{x}}^{\bm{\pi}},\bm{\pi}}+\frac{2SH^{3}\eta\varepsilon}{\lambda_{w}}.

For the first inequality, notice that $\widetilde{\bm{x}}$ is feasible in the LP (E.2) according to Equation (E.4) and its value should be no larger than the optimal value of LP (E.2). The equation is due to the linearity of $U^{\bm{x},\bm{\pi}}$ w.r.t. $\bm{x}$ . The last inequality uses the fact that $\widehat{U}^{{\bm{x}}^{0},\bm{\pi}}\leq H$ by construction.

For the last inequality in Equation (E.3), we apply the simulation lemma [5],

	$\displaystyle\widehat{U}^{{\bm{x}}^{\bm{\pi}},\bm{\pi}}-{U}^{{\bm{x}}^{\bm{\pi% }},\bm{\pi}}$	$\displaystyle=\sum_{h=1}^{H}\langle\rho_{h}^{{\bm{\pi}}}\otimes\pi_{h},(% \widehat{P}_{h}-{P}_{h})\cdot\widehat{U}_{h}^{{\bm{x}}^{\bm{\pi}},\bm{\pi}}% \rangle_{{\mathcal{S}}\times\mathcal{A}}$
		$\displaystyle\leq H\sum_{h=1}^{H}\left\lVert\widehat{P}_{h}(s,a)-{P}_{h}(s,a)% \right\rVert_{1}$
		$\displaystyle\leq H^{2}\epsilon^{\prime}$

∎

Lemma 13.

Given that $\left\lVert\widehat{\mu}_{h}-{\mu}_{h}\right\rVert_{1,\infty}\leq\varepsilon$ , we have

\left|U^{\widehat{\bm{x}}^{\bm{\pi}},\bm{\pi}}-U^{\widehat{\bm{x}}^{\bm{\pi}},% \bm{\pi}^{\prime}}-\widehat{U}^{\widehat{\bm{x}}^{\bm{\pi}},\bm{\pi}}+\widehat% {U}^{\widehat{\bm{x}}^{\bm{\pi}},\bm{\pi}^{\prime}}\right|\leq SH^{2}\varepsilon

Proof.

Notice that we can apply the performance difference lemma [5] on any ${U}^{\bm{x},\bm{\pi}},{U}^{\bm{x},\bm{\pi}^{\prime}}$ as they share the same $\bm{x}$ and thus the same environment except for different policy,

\displaystyle{U}^{\bm{x},\bm{\pi}}-{U}^{\bm{x},\bm{\pi}^{\prime}}=\sum_{h=1}^{% H}\langle\rho_{h}^{{\bm{\pi}}}\otimes(\pi_{h}-\pi^{\prime}_{h}),{P}_{h}\cdot U% _{h}\rangle_{{\mathcal{S}}\times\mathcal{A}}=\sum_{h=1}^{H}\sum_{s\in{\mathcal% {S}}}\rho_{h}^{{\bm{\pi}}}(s)\mu_{h}(s,\pi_{h}(s),\pi_{h}^{\prime}(s))\cdot U_% {h}

Hence, we have

	$\displaystyle\quad\left\|U^{\widehat{\bm{x}}^{\bm{\pi}},\bm{\pi}}-U^{\widehat{% \bm{x}}^{\bm{\pi}},\bm{\pi}^{\prime}}-\widehat{U}^{\widehat{\bm{x}}^{\bm{\pi}}% ,\bm{\pi}}+\widehat{U}^{\widehat{\bm{x}}^{\bm{\pi}},\bm{\pi}^{\prime}}\right\|$
	$\displaystyle=\left\|\sum_{h=1}^{H}\sum_{s\in{\mathcal{S}}}\left[\rho_{h}^{{\bm% {\pi}}}(s)\mu_{h}(s,\pi_{h}(s),\pi_{h}^{\prime}(s))\cdot U_{h}-\widehat{\rho}_% {h}^{{\bm{\pi}}}(s)\widehat{\mu}_{h}(s,\pi_{h}(s),\pi_{h}^{\prime}(s))\cdot% \widehat{U}_{h}\right]\right\|$
	$\displaystyle\leq H\sum_{h=1}^{H}\sum_{s\in{\mathcal{S}}}\left\lVert\mu_{h}(s,% \pi_{h}(s),\pi_{h}^{\prime}(s))-\widehat{\mu}_{h}(s,\pi_{h}(s),\pi_{h}^{\prime% }(s))\right\rVert_{1}$
	$\displaystyle\leq SH^{2}\mathop{max}_{s,a,h}\left\lVert\widehat{\mu}_{h}(s,a,a% ^{\prime})-{\mu}_{h}(s,a,a^{\prime})\right\rVert_{1}$
	$\displaystyle\leq SH^{2}\varepsilon$

∎

Lemma 14.

If $\mathcal{E}_{\text{model}}$ is true, with $T_{1}=O(HAS^{5}\kappa_{0}^{-1}\log(\eta T\lambda_{w}^{-1}))$ , under Assumption 5 and 7, at any episode $t\in[T]$ , we have

\left|\widehat{\zeta}^{t,\bm{\pi}}-\zeta^{\bm{\pi}}\right|=O\left(H^{2}\kappa^% {-1/2}\sqrt{\ln(SAHT/\delta)/t}\big{\}}\right).

Proof.

This proof is to find a good trade-off between the error bound of $P$ and $\mu$ according to the ratio given in Lemma 12. For this analysis, let $\sup_{s,a,a^{\prime},h}\left\lVert\widehat{\mu}^{t}_{h}(s,a,a^{\prime})-{\mu}_% {h}(s,a,a^{\prime})\right\rVert_{1}\leq\varepsilon$ such that we can apply Lemma 11 and get

\left\lVert\widehat{P}_{h}^{t}(s,a)-P_{h}(s,a)\right\rVert_{1}\leq 2\sqrt{% \frac{\ln(SAHT/\delta)}{2\kappa t}}+\varepsilon,

where we use the fact $N_{h}^{t}(s)\geq 2\kappa t$ under Assumption 5. By Lemma 12, we can construct contracts to induce any action policy $\bm{\pi}$ such that

\left|\widehat{\zeta}^{t,\bm{\pi}}-\zeta^{\bm{\pi}}\right|=O\left(\lambda_{w}^% {-1}SH^{3}\eta\varepsilon+H^{2}(2\sqrt{\frac{\ln(SAHT/\delta)}{2\kappa t}}+% \varepsilon)\right)=O\left(\lambda_{w}^{-1}SH^{3}\eta\varepsilon+H^{2}\sqrt{% \frac{\ln(SAHT/\delta)}{2\kappa t}}\right).

Then, we can set $\varepsilon=\frac{\lambda_{w}}{SH\eta\sqrt{t}}$ such that the second term dominates the regret bound, and we have

\displaystyle\left|\widehat{\zeta}^{t,\bm{\pi}}-\zeta^{\bm{\pi}}\right|

\displaystyle=O\left(H^{2}\kappa^{-1/2}\sqrt{\ln(SAHT/\delta)/t}\big{\}}\right)

It remains to ensure that in $T_{1}$ rounds, we can ensure $\varepsilon=\frac{\lambda_{w}}{SH\eta\sqrt{T-T_{1}}}$ . By Lemma 10, we know the sample complexity of $\varepsilon$ is on the order of $\widetilde{O}(HAS^{5}\kappa_{0}^{-1}\log(1/\varepsilon))$ . Hence, $T_{1}=O(HAS^{5}\kappa_{0}^{-1}\log(SH\eta T\lambda_{w}^{-1}))=O(HAS^{5}\kappa_% {0}^{-1}\log(\eta T\lambda_{w}^{-1}))$ ∎

E.4 Proofs for the Computationally Efficient Solver

Input: Parameters

\{\widehat{P}_{h},\widehat{r}_{h},b_{h},\widehat{\mu}_{h},\epsilon_{h}\}_{h\in% [H]}

, robustness parameter

\{\epsilon_{h}\}_{h\in[H]}

Output : Contract policy

\bm{x}

, action policy

\bm{\pi}

Set

\widehat{V}_{H+1}(s),\widehat{U}_{H+1}(s)\leftarrow 0,\forall a\in\mathcal{A},% s\in{\mathcal{S}}

for $h=H\dots 1$ do

$\displaystyle\widehat{x}_{h}(s;a)$	$\displaystyle=\mathop{argmin}_{x:{\mathcal{S}}\to\mathbb{R}_{+}}\{\widehat{P}_% {h}(s,a)\cdot x\ \|\ \widehat{\mu}_{h}(s,a,a^{\prime})\cdot[x+\widehat{U}_{h+1}% ]\geq{c}_{h}(s,a)-{c}_{h}(s,a^{\prime})+\epsilon_{h},\forall a^{\prime}\neq a\},$	(E.5)
$\displaystyle\widehat{Q}_{h}(s,a)$	$\displaystyle=\mathop{min}\big{\{}H,\ \widehat{r}_{h}(s,a)+b_{h}(s,a)+\widehat% {P}_{h}(s,a)\cdot\widehat{V}_{h+1}\big{\}}-\widehat{P}_{h}(s,a)\cdot\widehat{x% }_{h}(s;a),$
$\displaystyle\widehat{V}_{h}(s),\pi_{h}(s)$	$\displaystyle=\mathop{maxarg}_{a\in\mathcal{A}}\widehat{Q}_{h}(s,a),\quad% \widehat{x}_{h}(s)=\widehat{x}_{h}(s;\pi_{h}(s))$
$\displaystyle\widehat{U}_{h}(s)$	$\displaystyle=\mathop{min}\big{\{}H,\ \widehat{P}_{h}(s,a)\cdot[\widehat{x}_{h% }(s)+\widehat{U}_{h+1}]-c_{h}(s,a))\big{\}}.$

return

\widehat{\bm{x}}=\{\widehat{x}_{h}\}_{h\in[H]},\bm{\pi}=\{\pi_{h}\}_{h\in[H]}

Algorithm 7 Value-Iteration under Uncertainty

Assumption 8 (Strong $\lambda$ -Inducibility).

For any $a\in\mathcal{A},s\in{\mathcal{S}},h\in[H]$ , there exists an event $e\in\{0,1\}^{S}$ such that $[P_{h}(s,a)-P_{h}(s,a^{\prime})]\cdot e\geq\lambda_{s},\forall a\neq a^{\prime}.$

This assumption ensures that at any state $s$ of any step $h$ , for any cost function $c_{h}$ and any action $a$ , there exists a contract $x_{h}$ to induce $a$ . Intuitively, this assumption asks that each action is dominantly capable of inducing a set of outcomes over others. It is different from assuming that the outcome distribution of each action is distinguishable from other. In particular, notice that even if we have $\mathop{min}_{a^{\prime}\neq a}\left\lVert P_{h}(s,a)-P_{h}(s,a^{\prime})% \right\rVert\geq\lambda_{s}$ , there is no guarantee that $\exists x\text{ s.t. }x\cdot[P_{h}(s,a)-P_{h}(s,a^{\prime})]\geq 0,\forall a% \neq a^{\prime}$ , i.e., there exists contract $\bm{x}$ under which $a$ is the optimal action for the agent at this step regardless of the cost. Notice that when $H=1$ , Assumption 7 and 8 are equivalent. In general, Assumption 8 is stronger than Assumption 7.

We describe the computationally efficient solver in Algorithm 7. Below we show that the contract policy it solves is robust and optimistic in a formal sense.

Lemma 15.

Under Assumption 8, with estimation of $\widehat{\mu},\widehat{P}$ such that $\left\lVert\widehat{\mu}_{h}-{\mu}_{h}\right\rVert_{1,\infty}\leq\frac{% \epsilon}{H-h+\eta}$ , $\left\lVert\widehat{P}_{h}-{P}_{h}\right\rVert_{1,\infty}\leq\epsilon^{\prime}/\eta$ and choice of $\epsilon_{H}=\epsilon,\epsilon_{h}=O(\mathop{min}\{H,\epsilon\lambda_{s}^{h-H}% +\epsilon^{\prime}\lambda_{s}^{h+1-H}\}),\forall h<H$ , the policy $\bm{\pi},\widehat{\bm{x}}^{\bm{\pi}}$ output from Algorithm 7 satisfies the following condition,

U^{\widehat{\bm{x}}^{\bm{\pi}}}-U^{\bm{\pi}}\eqsim\left|\widehat{\zeta}^{\bm{% \pi}}-\zeta^{\bm{\pi}}\right|=O\left(\mathop{min}\{H,\epsilon\lambda_{s}^{-H-1% }+\epsilon^{\prime}\lambda_{s}^{-H}\}\right).

Proof.

Fix any action policy $\bm{\pi}$ , we compare the contract policy ${\bm{x}}^{\bm{\pi}}$ and $\widehat{\bm{x}}^{\bm{\pi}}$ solved respectively from Equation (B.1) and (E.5), i.e., using the ground-truth and estimated parameters. In the remainder of the proof, we will save the superscript $\bm{\pi}$ in both $\widehat{\bm{x}}^{\bm{\pi}}$ and ${\bm{x}}^{\bm{\pi}}$ for simplicity.

For the simplicity of notations, we prove the following claim. For any $h\in[H]$ , suppose we have $\left\lVert\widehat{U}_{h+1}^{\bm{\pi}}-U_{h+1}\right\rVert_{\infty}\leq\alpha$ , $\left\lVert\widehat{P}_{h}-{P}_{h}\right\rVert_{1,\infty}\leq\epsilon^{\prime}/\eta$ , $\left\lVert\widehat{\mu}_{h}-{\mu}_{h}\right\rVert_{1,\infty}\leq\beta$ . Then, with the choice of $\epsilon_{h}=(H-h+\eta)\beta+2\alpha$ , the solution of Equation (E.5) in the $h$ -th step satisfies the following conditions: for every $s\in{\mathcal{S}}$ ,

1.

$\widehat{x}_{h}(s)$ induces agent’s action $\pi_{h}(s)$ .
2.

The expected payment of $\widehat{x}_{h}(s)$ is bounded as ${U}^{\widehat{x}_{h}(s)}_{h}(s)-U_{h}^{\bm{\pi}}(s)\leq\mathop{min}\{H,\frac{% \epsilon_{h}}{\lambda_{s}}+\epsilon^{\prime}\}.$
3.

The estimated payment of $\widehat{x}_{h}(s)$ (as well as the estimated agent value) is bounded as $\left|\widehat{\zeta}^{\bm{\pi}}_{h}(s)-{\zeta}^{\bm{\pi}}_{h}(s)\right|=\left% |\widehat{U}^{\bm{\pi}}_{h}(s)-U^{\bm{\pi}}_{h}(s)\right|\leq\mathop{min}\{H,% \frac{\epsilon_{h}}{\lambda_{s}}+2\epsilon^{\prime}\}.$

We pick any state $s$ and let $a=\pi_{h}(s)$ . Equation (E.5) solves for $\widehat{x}_{h}(s)$ as the solution to the following optimization program,

\begin{array}[]{lll}\mbox{minimize}&{\widehat{P}_{h}(s,a)\cdot x}&\\ \mbox{subject to}&\widehat{\mu}_{h}(s,a,a^{\prime})\cdot[x+\widehat{U}_{h+1}^{% \bm{\pi}}]\geq{c}_{h}(s,a)-{c}_{h}(s,a^{\prime})+\epsilon_{h},&\mbox{for }a^{% \prime}\neq a.\\ &x(s,s^{\prime})\geq 0,&\mbox{for }s^{\prime}\in{\mathcal{S}}.\\ \end{array}

(E.6)

We first argue that with $\epsilon_{h}=(H-h+\eta)\beta+\alpha$ , $\widehat{x}_{h}(s)$ induces the agent to play action $a$ at state $s$ . That is because the following inequality must hold, $\forall a^{\prime}\neq a$ ,

	$\displaystyle\quad{\mu}_{h}(s,a,a^{\prime})\cdot[\widehat{x}_{h}(s)+U_{h+1}]$
	$\displaystyle\geq\widehat{\mu}_{h}(s,a,a^{\prime})\cdot[\widehat{x}_{h}(s)+% \widehat{U}_{h+1}]-\beta\left\lVert\widehat{x}_{h}(s)\right\rVert_{\infty}-% \beta\left\lVert U_{h}\right\rVert_{\infty}-\alpha\left\lVert{\mu}_{h}(s,a,a^{% \prime})\right\rVert_{1}+\beta\alpha$
	$\displaystyle\geq c(s,a)-c(s,a^{\prime})+\epsilon_{h}-\beta\eta-\beta(H-h)-2\alpha$
	$\displaystyle\geq c(s,a)-c(s,a^{\prime}).$

To see that LP (E.6) must have a feasible solution, let us consider the contract ${x}^{\prime}_{h}(s)={x}_{h}(s)+\frac{\epsilon_{h}}{\lambda_{s}}x^{a}$ . We choose $x^{a}$ such that $[P_{h}(s,a)-P_{h}(s,a^{\prime})]\cdot x^{a}\geq\lambda_{s},\forall a^{\prime}\neq a$ . The existence of $x^{a}$ is implied by Assumption 8. ${x}^{\prime}_{h}(s)$ satisfies the constraints of LP (E.6), i.e., $\forall a^{\prime}\neq a,$

	$\displaystyle\quad\widehat{\mu}_{h}(s,a,a^{\prime})\cdot[{x}^{\prime}_{h}(s)+% \widehat{U}_{h+1}^{\bm{\pi}}]$
	$\displaystyle\geq{\mu}_{h}(s,a,a^{\prime})\cdot{x}^{\prime}_{h}(s)-\beta\left% \lVert x^{\prime}_{h}(s)\right\rVert_{\infty}-\beta\left\lVert{U}_{h}\right% \rVert_{\infty}-\alpha\left\lVert\widehat{\mu}_{h}(s,a,a^{\prime})\right\rVert% _{1}+\beta\alpha$
	$\displaystyle\geq{\mu}_{h}(s,a,a^{\prime})\cdot{x}_{h}(s)+{\mu}_{h}(s,a,a^{% \prime})\cdot\frac{\beta}{\lambda_{s}}x^{a}-\beta\eta-\beta(H-h)-2\alpha$
	$\displaystyle\geq c(s,a)-c(s,a^{\prime})+\epsilon_{h}-\beta\eta-\beta(H-h)$
	$\displaystyle\geq c(s,a)-c(s,a^{\prime}).$

Moreover, since ${x}_{h}(s)$ minimizes LP (E.6), we have $\widehat{P}(a)\cdot\widehat{x}_{h}(s)\leq\widehat{P}(a)\cdot{x}^{\prime}_{h}(s% )=\widehat{P}(a)\cdot[{x}_{h}(s)+\frac{\epsilon_{h}}{\lambda_{s}}x^{a}].$ Given that $\left\lVert x^{a}\right\rVert_{\infty}=1$ and $\left\lVert P(a)\right\rVert_{1}=1$ , we have $\widehat{P}_{h}(s,a)x^{a}\leq 1$ . Using $\widehat{x}_{h}(s)$ , the principal’s expected payment is bounded from the least payment as

	$\displaystyle{U}^{\widehat{x}_{h}(s)}_{h}(s)-U_{h}^{\bm{\pi}}(s)$	$\displaystyle=P_{h}(s,a)\cdot[\widehat{x}_{h}(s)-{x}_{h}(s)]$
		$\displaystyle=\widehat{P}_{h}(s,a)\cdot[\widehat{x}_{h}(s)-{x}_{h}(s)]+[P_{h}(% s,a)-\widehat{P}_{h}(s,a)]\cdot[\widehat{x}_{h}(s)-{x}_{h}(s)]$
		$\displaystyle\leq\widehat{P}_{h}(s,a)\frac{\epsilon_{h}}{\lambda_{s}}x^{a}+% \left\lVert P_{h}(s,a)-\widehat{P}_{h}(s,a)\right\rVert_{1}\left\lVert\widehat% {x}_{h}(s)-{x}_{h}(s)\right\rVert_{\infty}$
		$\displaystyle\leq\frac{\epsilon_{h}}{\lambda_{s}}+\epsilon^{\prime}.$

In addition, $U_{h}^{\bm{\pi}}(s)\leq H,{U}^{\widehat{x}_{h}(s)}_{h}(s)\geq 0$ , this gives the bound ${U}^{\widehat{x}_{h}(s)}_{h}(s)-U_{h}^{\bm{\pi}}(s)\leq\mathop{min}\{H,\frac{% \epsilon_{h}}{\lambda_{s}}+\epsilon^{\prime}\}.$

Since $\left|\widehat{U}^{\bm{\pi}}_{h}(s)-U_{h}^{\widehat{x}_{h}(s)}(s)\right|=\left% |[\widehat{P}_{h}(s,a)-P_{h}(s,a)]\cdot\widehat{x}_{h}(s)\right|\leq\epsilon^{\prime}$ , we have

\left|\widehat{\zeta}^{\bm{\pi}}_{h}(s)-{\zeta}^{\bm{\pi}}_{h}(s)\right|=\left% |\widehat{U}^{\bm{\pi}}_{h}(s)-U^{\bm{\pi}}_{h}(s)\right|\leq\left|\widehat{U}% ^{\bm{\pi}}_{h}(s)-U_{h}^{\widehat{x}_{h}(s)}(s)\right|+\left|{U}^{\widehat{x}% _{h}(s)}_{h}(s)-U_{h}^{\bm{\pi}}(s)\right|\leq\mathop{min}\{H,\frac{\epsilon_{% h}}{\lambda_{s}}+2\epsilon^{\prime}\}.

We now plug in the value of $\alpha,\beta$ . For the base case in the $H$ -th step, we have $\alpha=0,\beta=\frac{\epsilon}{H+2\eta}$ . Then, $\epsilon_{H}=\epsilon,\left|\widehat{\zeta}^{\bm{\pi}}_{H}(s)-\zeta^{\bm{\pi}}% _{H}(s)\right|=\left|\widehat{U}^{\bm{\pi}}_{H}(s)-U^{\bm{\pi}}_{H}(s)\right|=% O\left(\mathop{min}\{H,\frac{\epsilon}{\lambda_{s}}+\epsilon^{\prime}\}\right)$

For the inductive case in the $h$ -th step, given that $\left\lVert\widehat{U}^{\bm{\pi}}_{h+1}-U^{\bm{\pi}}_{h+1}\right\rVert_{\infty% }=O(\mathop{min}\{H,\epsilon\lambda_{s}^{h-H}+\epsilon^{\prime}\lambda_{s}^{h+% 1-H}\})$ , we have $\alpha=O(\mathop{min}\{H,\epsilon\lambda_{s}^{h-H}+\epsilon^{\prime}\lambda_{s% }^{h+1-H}\}),\beta=\frac{\epsilon}{H-h+2\eta}$ and $\epsilon_{h}=\left\lVert\widehat{U}^{\bm{\pi}}_{h+1}-U^{\bm{\pi}}_{h+1}\right% \rVert_{\infty}+\epsilon=O(\mathop{min}\{H,\epsilon\lambda_{s}^{h-H}+\epsilon^% {\prime}\lambda_{s}^{h+1-H}\})$ . Then, we can plug in $\epsilon_{h}$ and bound the estimation error as, ${U}^{\widehat{x}_{h}(s)}_{h}(s)-U_{h}^{\bm{\pi}}(s)\eqsim\left|\widehat{\zeta}% ^{\bm{\pi}}_{h}(s)-{\zeta}^{\bm{\pi}}_{h}(s)\right|=\left|\widehat{U}^{\bm{\pi% }}_{h}(s)-U^{\bm{\pi}}_{h}(s)\right|\leq O\left(\mathop{min}\{H,\epsilon% \lambda_{s}^{h-1-H}+\epsilon^{\prime}\lambda_{s}^{h-H}\}\right).$ This gives the bound as claimed in the Lemma. ∎

Lemma 16.

If $\mathcal{E}_{\text{model}}$ is true, under Assumption 5 and 8, with $T_{1}=O(HAS^{5}\kappa_{0}^{-1}\log(\eta T\lambda_{s}^{-1}))$ , at any episode $t\in[T]$ , we have

\left|\widehat{\zeta}^{t,\bm{\pi}}-\zeta^{\bm{\pi}}\right|=O\left(\mathop{min}% \big{\{}H,\eta\lambda_{s}^{1-H}\kappa^{-1/2}\sqrt{\ln(SAHT/\delta)/t}\big{\}}% \right).

Proof.

\left\lVert\widehat{P}_{h}^{t}(s,a)-P_{h}(s,a)\right\rVert_{1}\leq 2\sqrt{% \frac{\ln(SAHT/\delta)}{2\kappa t}}+\varepsilon,

where we use the fact $N_{h}^{t}(s)\geq 2\kappa t$ under Assumption 5, similar to the analysis in Lemma 10. By Lemma 15, the Equation (E.5) can construct contracts to induce any action policy $\bm{\pi}$ such that

\left|\widehat{\zeta}^{t,\bm{\pi}}-\zeta^{\bm{\pi}}\right|=O\left(\mathop{min}% \big{\{}H,(H+\eta)\epsilon\lambda_{s}^{-H}+\eta\lambda_{s}^{1-H}\sqrt{\frac{% \ln(SAHT/\delta)}{2\kappa t}}\big{\}}\right)

Then, we can set $\varepsilon=\frac{\lambda_{s}}{(H+\eta)\sqrt{t}}$ such that the second term dominates the regret bound, and we have

	$\displaystyle\left\|\widehat{\zeta}^{t,\bm{\pi}}-\zeta^{\bm{\pi}}\right\|$	$\displaystyle=O\left(\mathop{min}\{H,\eta\lambda_{s}^{1-H}\sqrt{\frac{\ln(SAHT% /\delta)}{\kappa t}}\right)$
		$\displaystyle=O\left(\mathop{min}\big{\{}H,\eta\lambda_{s}^{1-H}\kappa^{-1/2}% \sqrt{\ln(SAHT/\delta)/t}\big{\}}\right)$

It remains to ensure that in $T_{1}$ rounds, we can ensure $\varepsilon=\frac{\lambda_{s}}{(H+\eta)\sqrt{t}}$ . By Lemma 10, we know the sample complexity of $\varepsilon$ is on the order of $\widetilde{O}(HAS^{5}\kappa_{0}^{-1}\log(t))$ . Hence, we can set $T_{1}=O(HAS^{5}\kappa_{0}^{-1}\log((H+\eta)T\lambda_{s}^{-1}))=O(HAS^{5}\kappa% _{0}^{-1}\log(\eta T\lambda_{s}^{-1}))$ . ∎

	$\displaystyle Q_{h}^{}(s,a),x^{}_{h}(s;a)$	$\displaystyle=\mathop{minarg}_{x:{\mathcal{S}}\to\mathbb{R}_{+}}r_{h}(s,a)+P_{% h}(s,a)\cdot[V^{}_{h+1}-x^{}_{h}(s;a)]$
	s.t.	$\displaystyle\quad P_{h}(s,a)\cdot[x+U^{}_{h+1}]-c_{h}(s,a)\geq P_{h}(s,a^{% \prime})\cdot[x+U^{}_{h+1}]-c_{h}(s,a^{\prime}),\forall a^{\prime}\neq a,$

	$\displaystyle\operatorname{Reg}(T)$	$\displaystyle\leq\sum_{t=1}^{T}(2+2\eta)\sqrt{\frac{\|{\mathcal{S}}\|\log(T\|% \mathcal{A}\|/\delta)}{N_{t}(a_{t})}}+2\lambda^{-1}\varepsilon$
		$\displaystyle\leq(4+4\eta)\sqrt{\|{\mathcal{S}}\|\log(T\|\mathcal{A}\|/\delta)}% \sum_{a\in\mathcal{A}}\sqrt{N_{T}(a)}+2\lambda^{-1}\varepsilon T$
		$\displaystyle\leq(4+4\eta)\sqrt{\|{\mathcal{S}}\|\log(T\|\mathcal{A}\|/\delta)}% \sqrt{\|\mathcal{A}\|T}+2\lambda^{-1}\varepsilon T$
		$\displaystyle=O(\eta\sqrt{T\|\mathcal{A}\|\|{\mathcal{S}}\|\log(T\|\mathcal{A}\|/% \delta)}+\varepsilon T/\lambda),$

	$\displaystyle\quad\left\|\widehat{d}(a,a^{\prime})-d(a,a^{\prime})\right\|$	$\displaystyle=\left\|\big{[}P(a)-P(a^{\prime})\big{]}\cdot(x^{a}-x^{0})+\big{[}% \widehat{P}(a)-P(a)-\widehat{P}(a^{\prime})+P(a^{\prime})\big{]}\cdot x^{a}\right\|$
		$\displaystyle\leq(1-\alpha)\left\|\big{[}P(a)-P(a^{\prime})\big{]}\cdot(x^{a}-x% ^{a^{\prime}})\right\|+2\epsilon\eta$
		$\displaystyle\leq(1-\alpha)\epsilon^{2}+2\epsilon\eta$
		$\displaystyle<3\epsilon\eta=3\varepsilon/10.$

Contractual Reinforcement Learning: Pulling Arms with Invisible Hands††thanks: This work is supported in part by Army Research Office Award W911NF-23-1-0030, ONR Award N00014-23-1-2802 and NSF Award CCF-2303372.

Abstract

1 Introduction

2 Problem Formulation

2.1 The Principal-Agent Markov Decision Process

2.2 The Optimal Contract Policy

Theorem 1 (Bellman Equations of PAMDP).

2.3 The Contractual Reinforcement Learning Problem

3 Warm-up: Solving the Contractual Bandit Learning Problem

3.1 The Contractual Bandit Learning Problem

3.2 A Generic Approach to Contractual Bandit Learning

Assumption 1 (λ𝜆\lambdaitalic_λ-Inducibility).

Definition 1 (ε𝜀\varepsilonitalic_ε-margin Contract Set).

Theorem 2.

Definition 2 (χ⁢(ε)𝜒𝜀\chi(\varepsilon)italic_χ ( italic_ε )-Learning Procedure).

Corollary 2.1.

Assumption 2 (Preliminary Contracts).

Corollary 2.2.

Corollary 2.3.

4 The Complexity of Contractual Reinforcement Learning

Theorem 3.

5 Conclusion

References

Appendix A Further Discussion on Related Work

Contract Design.

Dynamic Pricing.

Online Contract Design.

Online Learning with Incentive Constraints.

Appendix B Omitted Content in Section 2

B.1 Notations and Illustrations

B.2 Discussion on the Modeling Choices.

B.3 Least-Payment Bellman Equations in PAMDP

B.4 Bellman Optimality Equations in PAMDP

Proofs of Theorem 1.

Appendix C Proofs in Section 3

C.1 The Regret Analysis of the Generic Algorithm

Lemma 1.

Lemma 2 (McDiarmid et al. [42]).

Proof of Theorem 2.

The Case of Finite Action Space.

The Case of Infinite Action Space with Linear Context.

Proof of Lemma 1.

C.2 Solving Multi-armed Bandits under Direct Incentives

Multi-Armed Bandits under Direct Incentives

Lemma 3 (Binary Search for Finite Arms).

Proof of Lemma 3.

C.3 Solving Linear Bandits under Direct Incentives

Linear Bandits under Direct Incentives

Lemma 4 (Contextual Search for Infinite Arms).

Proof of Lemma 4.

C.4 Solving General Contractual Bandit Problems

Lemma 5.

Proof of Lemma 5.

Appendix D Searching on Probability Simplex

Example 1 (o⁢(T2/3)𝑜superscript𝑇23o(T^{2/3})italic_o ( italic_T start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT ) regret with known cost).

Proposition 1 (Action inducibility).

Proof.

Assumption 3 (Minimal Volume Ratio).

Assumption 4 (Minimal Cost difference).

Definition 3 (Surface Detection Probability Function).

Lemma 6.

Corollary 3.1.

Proof.

Proof of Lemma 6.

Definition 4 (Surface Detection Event).

Proposition 2 (Intersection Geometry).

Proof.

Proposition 3 (Local Hyperplane Learnability).

Proof.

Proposition 4 (Surface Volume).

Proof.

Proposition 5 (Effective Surface).

Proof.

Definition 5 (Connected Components).

Appendix E Proofs in Section 4

E.1 Preliminaries for Regret Analysis

Visitation Measure.

Parameter Estimation.

Lemma 7 (Agarwal et al. [5]).

Lemma 8 (Optimism).

Contractual Reinforcement Learning:
Pulling Arms with Invisible Hands^†^†thanks: This work is supported in part by Army Research Office Award W911NF-23-1-0030, ONR Award N00014-23-1-2802 and NSF Award CCF-2303372.

Assumption 1 ( $\lambda$ -Inducibility).

Definition 1 ( $\varepsilon$ -margin Contract Set).

Definition 2 ( $\chi(\varepsilon)$ -Learning Procedure).

Example 1 ( $o(T^{2/3})$ regret with known cost).

E.2 $\chi(\varepsilon)$ -Learning Procedure in Contractual Reinforcement Learning

Assumption 7 (Weak $\lambda$ -Inducibility).

Assumption 8 (Strong $\lambda$ -Inducibility).