Contractual Reinforcement Learning:
Pulling Arms with Invisible Hands††thanks: This work is supported in part by Army Research
Office Award W911NF-23-1-0030, ONR Award N00014-23-1-2802 and NSF Award CCF-2303372.
Abstract
The agency problem emerges in today’s large scale machine learning tasks, where the learners are unable to direct content creation or enforce data collection. In this work, we propose a theoretical framework for aligning economic interests of different stakeholders in the online learning problems through contract design. The problem, termed contractual reinforcement learning, naturally arises from the classic model of Markov decision processes, where a learning principal seeks to optimally influence the agent’s action policy for their common interests through a set of payment rules contingent on the realization of next state. For the planning problem, we design an efficient dynamic programming algorithm to determine the optimal contracts against the far-sighted agent. For the learning problem, we introduce a generic design of no-regret learning algorithms to untangle the challenges from robust design of contracts to the balance of exploration and exploitation, reducing the complexity analysis to the construction of efficient search algorithms. For several natural classes of problems, we design tailored search algorithms that provably achieve regret. We also present an algorithm with for the general problem that improves the existing analysis in online contract design with mild technical assumptions.
1 Introduction
“Every individual… intends only his own gain, and is led by an invisible hand to promote an end which was no part of his intention.”
— Adam Smith, The Theory of Moral Sentiments, 1759.
The “invisible hand” metaphor by Adam Smith illustrates how properly designed incentive structures can guide self-interested individuals to inadvertently promote the greater social good. This concept is increasingly relevant in the realm of machine learning, as the scale of applications expands and the conflict of economic interests intensifies. For example, an Internet platform wants to estimate the ad revenues from serving different types of content, but it is up to the creators to decide what content to produce. While the platform seeks high-quality content to boost its long-term growth, creators may opt to minimize their production costs. This misalignment has prompted platforms to implement revenue-sharing models, fueling the growth of the creator economy, projected to exceed half a trillion by 2027 [16, 1, 25, 2]. However, current incentive models are inadequate, especially in light of their roles in exacerbating the proliferation of clickbait and misinformation online [54, 55, 31]. Moreover, this issue of misalignment extends well beyond content platforms. E-commerce sites rely on sufficient consumers experimenting with new products for accurate preference assessments. Gig platforms depend on freelance workers accepting tasks to gather essential operational data. Even recommender systems are paying users for their engagement in order to effectively optimize their algorithms [3]. In these cases, the learner’s hands are tied, and decision-makers interacting with the environment have their own objectives, dooming the system to under-exploration regardless of the learner’s objective. Hence, there is a pressing need to pursue formal treatments of incentive alignment problems between the learners and decision-makers and to design principled learning algorithms with statistical and computational efficiency guarantees.
Contributions. On the conceptual side, the presence of self-interested decision-makers challenges our common assumption in online learning, where a single learner controls all the interactions with the environment. This paper introduces the contractual reinforcement learning (RL) problem in the principal-agent Markov decision process (PAMDP), where we adopt the principal-agent model from contract theory [27, 24] to capture strategic interactions between the learner and decision-maker. As illustrated in Figure 1, the learner (henceforth, principal/she) collects the rewards from the actions of decision-maker (henceforth, agent/he). Without any incentive design, the agent simply optimizes his policy in a standard Markov decision process (MDP) based on his cost function. However, since the agent’s optimal policy is not necessarily in the principal’s best interest, the principal is motivated to properly incentivize the agent to act in her favor by designing contracts that specify the payment rules contingent on the realization of the next state. The core challenge in this design problem is the information asymmetry at two levels: (1) the principal cannot observe the agent’s action a priori and has to condition her payment on the probabilistic outcome of the action — a phenomenon known as the moral hazard in economics; (2) the agent is far-sighted that he is willing to take suboptimal actions at one step in order to reach a more favorable state in future steps — a major barrier for theoretical analysis in multi-agent learning problems.
![Refer to caption](x1.png)
On the technical side, this paper provides a comprehensive solution framework to address the unique learning and computational challenges when moral hazard meets far-sighted agency in contractual RL problems. In Section 2, we define state value functions for both the agent and principal, from which we derive a new class of Bellman equations to characterize the intricate correspondence between the principal and agent’s optimal policy. This leads to our Theorem 1, which shows that the principal’s optimal planning problem can be solved by a clean formulation of dynamic programming in polynomial time. The learning problem is more involved, so we begin with the contractual bandit learning problem (episode length ) in Section 3 to focus on the challenges from moral hazard. In particular, to achieve low regret, the principal’s learning algorithm must balance exploration and exploitation while continuously improving its estimation of the agent’s preferences to determine cost-efficient contracts. In Theorem 2, we construct a generic algorithm that reduces the learning problem into a standard online learning problem and an efficient search problem for the agent’s decision boundary. As a consequence, we are able to obtain sublinear regret guarantee under different setups, summarized in Table 1. The efficient search algorithm we designed for learning the outcome distribution difference in the simplex may be of interest for general use. With these insights, we delve into the full contractual RL problem in Section 4 and show a provably efficient learning algorithm under several technical assumptions in Theorem 3. Meanwhile, the general result highlights a trade-off between statistical and computational tractability, leaving an intriguing open question on the existence of the best-of-both-worlds solution. The complexity of search is in logarithmic order yet with a large constant in the Markovian setup, and we expect an improved analysis by organically combining the search and exploration in the algorithm design.
Moral Hazard | Far-sighted Agency | Known Cost | Regret |
✗ | ✗ | ✗ | , Corollary 2.1 |
✓ | ✗ | ✗ | , Corollary 2.2 |
✓ | ✗ | ✓ | , Corollary 2.3 |
✓ | ✓ | ✓ | , Theorem 3 |
Related Work. Our problem is built upon the principal-agent model in contract theory, a crucial branch of economics [27, 48, 35]. Driven by an accelerating trend of contract-based markets deployed to Internet-based applications, the contract design problem recently started to receive a surging interest especially from the computer science community [24, 28, 9, 20]. The principal-agent model has been also applied for the delegation of online search problems [13, 33] and machine learning tasks [45]. While these works focus on the computational aspects of contract design, we consider the adaptive design problem of the contract between learners and decision-makers in an initially unknown environment. For our learning problem, the dynamic (contextual) pricing problem [34, 41, 47, 39, 36] can be viewed as one of its special cases, where the contract is contingent on the agent’s binary action and the principal already knows her reward function. As we will see in Section 3, our algorithm is able to borrow some design insights from these pricing problems. Meanwhile, the online contract design problem begins as a variant of dynamic pricing [34] where the agent’s cost is stochastic (or adversarially) chosen, and regret bound is (or in adversarial setup). Ho et al. [29], Zhu et al. [57] consider a generalized model where the agent has multiple actions, both the cost and reward of his actions are determined by the agent’s Bayesian type that are unknown to the learner. This problem relates to the continuum-armed bandit problem [6], except the principal’s utility is not continuous, and Zhu et al. [57] shows an almost tight linear regret bound for some constant and the number of outcomes . In comparison, our learning problem is closer to the standard contract design model, in which the agent type is observable by the principal (captured by the initial state or context), as many platforms hold a good amount of data on their users and content creators. More importantly, this modeling choice allows us to focus on solving the key challenges of learning and planning the optimal contract under moral hazard, where we are able to achieve regret for a large class of problems and in general under mild assumptions. Lastly, several recent works [22, 21, 46] consider the simple special case of our problem, where there is no Markov state transition and principal can directly incentivize the agent to take certain action without the barrier of moral hazard. We defer further discussion of the related work to Appendix A.
2 Problem Formulation
2.1 The Principal-Agent Markov Decision Process
Let us first recall the standard reinforcement learning problem in a (finite-horizon) Markov decision process , where we have the agent’s action space , the environment’s state space , the transition kernel , the expected reward function , the initial state distribution and the horizon length . The contractual reinforcement learning problem simply extends the MDP to a principal-agent Markov decision process with the additional cost function . 111The scale of the cost and reward function range is without loss of generality, due to constant shifting and rescaling, thereby covers existing models [22, 21, 46] that assume a positive reward function for the agent. In this process, the agent interacts with the environment by taking actions and bearing the costs, whereas the principal receives the reward from the environment. Unable to directly interact with the environment, the principal has to instead design and implement contracts to incentivize the agent to take actions in her interest. Below, we formalize the design of their policies.
Following from a standard MDP, the agent’s action policy specifies that at each step , given the state , the agent would take the action . In the following subsection, we will discuss how the agent chooses his action policy and that it suffices to only consider deterministic action policies. Meanwhile, the principal’s contract policy is a sequence of non-liable payment rules , where specifies the payment to the agent if the next state is realized, given the current state at the -th step. The non-liability constraint ensures that the principal’s payment in the contract for any realization of the next state must be non-negative; the problem would otherwise degenerate with an trivially optimal solution for the principal (see e.g., [24]). Denote as the agent and principal’s policy space, respectively. Let and thus .
The typical setting of the PAMDP problems can be summarized by the following steps. In the beginning of each episode, the initial state is realized and observed by both the principal and the agent. Afterwards, the principal commits to a contract policy and the agent accordingly chooses an action policy . Their interactions then proceed as follows at each step ,
1. The agent takes an action and bears the cost . 2. The next state is realized and observed by both the principal and agent. 3. The principal receives a noisy reward and pays the agent . 4. The principal observes the agent’s action .
In this step, the principal’s utility is , her reward minus the payment to agent, whereas the agent’s utility is , the payment from principal minus his cost. The reward noise has zero mean such that . We refer the readers to Appendix B.1 for a summary of notations and Appendix B.2 for a full discussion of our modeling choices.
2.2 The Optimal Contract Policy
Without any contract design, the model reduces to a standard MDP for the agent and the principal passively collects the reward from the agent’s policy. This outcome could be suboptimal for both the principal and agent. Instead, by resha** the agent’s reward environment through the design of contract policy, the principal could induce the agent adopt some action policy with higher social surplus. This motivates the problem of designing the optimal contract policy. We focus on a realistic yet challenging setup in the face of a long-lived, far-sighted and Bayesian rational agent who is also planning optimally for his cumulative reward — we expect the case of myopic agents can be worked out with simpler approach. In particular, since the agent’s utility is not necessarily under the principal’s optimal contract at any state due to moral hazard, a far-sighted agent could take certain actions that are sub-optimal in the current step, yet secure him toward certain future states where he can obtain higher cumulative utility.
We extend notions of value functions and optimal policies from MDP to PAMDP. Under any action policy and contract policy , we define the principal’s state value function at the -th step as,
and the agent’s state value function at the -th step as,
where the expectation in both are with respect to the randomness of the trajectory (due to the stochasticity of state transitions and action policy). Let and . The principal’s goal is to maximize her value , given the agent’s optimal response , which equivalently maximizes at any initial state with . Hence, we define the principal’s optimal contract policy and the corresponding optimal value function as the optimal solution and value of the following bi-level optimization problem, 222Throughout this paper, we assume the agent breaks tie in favor of the principal. This is without loss of generality in generic games, since the principal can force the tie-breaking by making an infinitesimally small additional payment to the action of her interest.
(2.1) |
where “” is a convenient operator notation on an optimization problem that returns the optimal objective value followed by its optimal solution. For notational convenience, we will denote the agent’s optimal action policy in response to contract policy as , and use shorthands for the principal’s and agent’s value function under contract policy at the -th step given that the agent responds optimally. Meanwhile, we denote as the principal’s optimal contract policy to induce the agent’s action policy . We use similar shorthands for the principal’s and agent’s value function under contract policy at the -th step given that the agent responds optimally. Notably, since the optimization problem (2.1) hinges on the intricate correspondence between and , it is unclear for now if the principal can efficiently plan his optimal policy adopting the standard approach in MDP.
Solving for the Agent’s Optimal Policy. One key observation is that the correspondence between and has a clean characterization through the Bellman equation. Specifically, both functions can be solved through backward induction with :
(2.2) |
Notice that since is a maximizer of a linear function, the agent’s best responding policy is deterministic without loss of generality. With , the principal’s value function under can also be computed iteratively from :
(2.3) |
Due to the space limit, we only solve the agent’s best response at any given . We refer the reader to Appendix B.3 for the more involved formulation to solve the optimal policy for any given .
Value Decomposition. Another key observation is that the value functions can decomposed into parts that are only depends on the agent’s action policy. This is analogous to the standard contract design where principal’s and agent’s utility sums up to the social surplus, i.e., the difference between the reward and cost of the agent’s action. Here, let the principal’s expected reward and agent’s expected cost function in the -th step be
By linearity of expectation, for any policy , at any state of any step , we have
Both functions are fixed to the agent’s action policy , regardless of the contract policy . In addition, captures the total amount of expected payment transferred from the principal to the agent since -th step at state . Since the total reward is fixed under any given , the principal’s value is maximized under the minimal total payment, . The function thus serves as the equivalent optimization objective in the least-payment Bellman equation in Appendix B.3.
Solving for the Optimal Contract Policy. With the two observations above, it is clear that the principal’s the optimal value and policy can be determined by computing for every , according to the least-payment bellman equation in Appendix B.3. However, this maximization problem is still intractable, as there are exponentially many possible . Instead, we have to interleave the process of solving for the optimal policy with least payment and maximum reward. This enables the following construction of a bi-level backward induction that iteratively solves for the optimal contract policy .
Theorem 1 (Bellman Equations of PAMDP).
The optimal contract policy can solved by dynamic programming in polynomial time, from to with ,
(2.4) | ||||
To interpret the Bellman equation above, denotes the contract with the least payment to induce the agent to take action at state in step . Given that is the best agent action for the principal to induce, the optimal contract at state in step can be determined as . are respectively the principal’s and agent’s total expected utility from -th step under policy and , which can be viewed as their optimal state-action value function at -th step, serving as the intermediate variable for the computation. See Appendix B.4 for the proof of correctness.
2.3 The Contractual Reinforcement Learning Problem
We now introduce the reinforcement learning problem in PAMDP, where the principal acts as the learner and seeks to adaptively improve its contract policy by interacting with the agent. Following the online learning convention, we use the expected regret to evaluate the learning performance in episodes, where is the principal’s contract policy in the -th episode.
This paper makes a few assumptions for the analysis of reinforcement learning problems. First, the far-sighted agent has perfect knowledge of his cost function and the state transition kernel such that he can always chooses the best response. This is realistic because in applications of our interest, agents are the experts (e.g., content creators, freelance workers, ride-sharing drivers) in the fields who has learnt about the environment sufficiently well whereas principal as the system designer does not know. Second, the agent at time is assumed to best respond to . This can be equivalently interpreted as the agent at each time showing up only once. This is motivated by the reality of Internet applications where each individual agent’s participation only accounts for a negligible portion of the system’s traffic hence has little influence over the entire system’s learning policy, so the best response (regardless of the learning policy) is optimal for each individual. Thirdly, we assume that the design space of contract is restricted to at any step . This reflects the practical concern of contract design under randomness: while contract with bounded payment may sacrifice the optimality, it regularizes the variance in the payment transfer and reduces the risk for both the principal and agents. Moreover, as long as the environment parameters have finite precision, the parameter can be matched to the finite bit complexity of the optimal contract. Though this assumption is without loss of generality from a modeling perspective, we expect future work to develop tighter analysis techniques to relax the dependency on . For other regularity assumptions necessary to obtain tractable complexity results, we defer to the technical sections.
3 Warm-up: Solving the Contractual Bandit Learning Problem
In this section, we consider an important special case of the contractual reinforcement problem with , which allows us to first focus on the learning challenge from moral hazard without the concern of far-sight agency. We refer to this problem as the contractual bandit learning problem. Below, we first describe the contractual bandit learning problem with much simplified notations, since it suffices to omit the current state and the time step in the subscripts given that . We then showcase a generic analysis of the statistical complexity of contractual bandit learning problem.
3.1 The Contractual Bandit Learning Problem
In this setup, the agent’s policy space is simply its action space , i.e., the set of bandit arms. specifies an outcome distribution for each action, where the outcome space could naturally capture the reward stochasticity of each arm in bandit learning problems. The principal designs the contract , contingent on the outcome space , to influence the agent’s choice of action. The principal’s reward and agent’s cost are both function of the agent’s action, . We consider a contractual bandit learning problem with rounds. In the beginning of each round , the principal commits to a contract and interacts with the agent as follows:
1. The agent takes the action . 2. The outcome is realized and observed by both the principal and agent. 3. The principal receives the noisy reward and pays the agent . 4. The principal observes the agent’s action .
Here, the noisy reward function satisfies , and we assume the agent’s action always maximizes his expected utility, i.e., . To determine the principal’s optimal contract, let us recall the notion of least payment function from the general setup. We similarly define such that for any given action , it outputs the least amount of the expected payment necessary to induce , where denotes the set of all contracts under which the agent would respond with action . Hence, the principal can determine the optimal action to induce, with the optimal contract . With the benchmark of the optimal contract that induces with the least payment , we can measure the learning performance in rounds with the expected regret as follows, This problem is a strict generalization of standard online learning, as it degenerates to the standard notion of regret when . However, with the additional function, the no-regret learner must not only obtain good estimation of both and towards the optimal action, but also implement the contracts that induce the optimal action and have expected payment approaching towards .
A Simpler Case with Direct Incentives. We remark that a special case of the contractual bandit learning problem assumes the principal is able to design her contract contingent on the agent’s action. This enables the principal to implement any payment rule , and the agent responds with his optimal action . With this relaxation, , since the optimal to induce any action is to set a direct incentive with . The expected regret reduces to As we will see in this paper, the learning problem becomes more tractable in this setup, since the principal can directly learn the cost function to determine the least payment to induce each action. In Appendix C.2 and C.3 we showcases the multi-armed bandits and linear bandits under direct incentives, both of which have been recently studied by Scheid et al. [46].
3.2 A Generic Approach to Contractual Bandit Learning
We begin with a natural assumption that enable us to simply employ existing techniques in online learning to obtain tractable complexity results for a large class of contractual bandit learning problems.
Assumption 1 (-Inducibility).
For any action , there exists an event as a distribution of outcomes such that .
This assumption ensures the regularity of the problem instance in the sense that each action is dominantly capable of inducing a set of outcomes over others such that for any cost function and any action , there exists a contract to induce . To see this, one can explicitly construct such contract as , where is the event such that . Otherwise, if , then there could be some action that is never the agent’s best response under any contract.
We now propose a generic approach to design statistically efficient algorithm for contractual bandit learning problem. The key idea of our approach is to decouple the learning of the contract from the learning of the optimal action. In particular, let us first assume an oracle in Definition 1 that is able to construct a robust contract set for each action , despite the uncertainty in parameter estimation. We use the robust contract set to determine the optimistic action and eventually learn the optimal action with no regret. This enables us to decouple the sample complexity result into the estimation errors from optimal contract and the optimal action, according to Theorem 2.
Definition 1 (-margin Contract Set).
We define the -margin contract set for each action as
Theorem 2.
Under Assumption 1, with a -margin contract set for every action , there is a generic algorithm with regret for the contractual bandit learning problems.
The key step of the proof is Lemma 1, which shows the contracts solved from LP (C.1) have bounded suboptimality from the least payment contract (both in estimation and in execution) depending on parameter estimation error and the robustness margin . This allows us to simply adopt an upper confidence bound argument to bound the regret. See Appendix C.1 for the full proof and the construction of the generic algorithm. The rationale behind Theorem 2 is to separate the learning of the contract sets from the learning of the optimal action. In particular, the learning and construction procedure of such contract sets has been a well-established problem in variants of Stackelberg games [37, 43]. We abstract this problem into the design of a -learning procedure defined below.
Definition 2 (-Learning Procedure).
For a -learning procedure, after any number of rounds, it can construct a robust contract set such that .
Based on the concept in Definition 2, an immediate implication of Theorem 2 is that if there is an -learning procedure, a simple “prepare-then-commit” style algorithm can achieve regret in the contractual bandit problem. That is, it first prepares for a warm start by running the learning procedure for rounds to obtain the -margin contract sets, then commits to follow Algorithm 2 for the remaining rounds. Futhermore, using the standard doubling trick [15], we can convert “prepare-then-commit” style algorithm into an anytime algorithm with the same regret guaruntee that is agnostic to the time horizon during its construction. Therefore, the difficulty of solving the contractual bandit learning problem hinges on the statistical efficiency of the learning procedure, which heavily depends on the problem structure.
Solving Bandit Problems under Direct Incentives. As a direct application of Theorem 2, we show that the -learning procedure can be constructed for the two bandit problems under direct incentives and thus admits regret online learning algorithm. The construction of the efficient search algorithm essentially relies on the binary search for the cost of each arm. In addition, the binary search algorithm can be generalized to cases with infinitely many arms. Such problem is known as the contextual search, and recent work [38] have established clean solutions with nearly optimal performance. We defer their detailed construction and proofs to Appendix C.2 and C.3.
Corollary 2.1.
Multi-armed bandits and linear bandits under direct incentives have regret.
Solving Contractual Bandit Problems under Moral Hazard. The construction of efficient learning procedure is difficult in general contractual bandit learning. We instead start with sufficient knowledge of to construct an -learning procedure under the following assumption. This assumption is motivated by the practice, where the principal would ask the agent to provide a listing of desired conditions for him to perform different level of services. The search problem is otherwise known to have exponential sample complexity lower bound in Stackelberg games [43].
Assumption 2 (Preliminary Contracts).
For any , the principal has the preliminary knowledge to construct an non-liable contract that induces the agent’s action with constant payment.
We defer the construction of this learning procedure and its proof to Appendix C.4. As a result, we can construct an explore-then-commit style algorithm regret for general contractual bandit learning, as. Specifically, this algorithm induces the agent to take each action uniformly random for rounds under the Assumption 2. Then, given that the outcome distribution is estimated with error up to , it can efficiently estimate the difference of cost up to error and thus construct an -optimal contract to induce the optimal action in the remaining rounds.
Corollary 2.2.
This result reveals the core challenge of learning the optimal contract under moral hazard. That is, constructing the contract to induce the optimal action, already requires a sufficiently good estimate of for all actions (including the suboptimal ones). This observation raises the question on whether it is possible to learn without playing the costly sub-optimal action — the barrier to achieve regret. The answer turns out to be “Yes” but with some catches. The solution is to implement a binary search procedure for contract near the hyperplane formed by the linear system . We want to solve the parameters and in the linear system with bounded errors using a number of contracts that almost satisfy the linear system. This is however impossible unless knowing at least one set of parameters in the linear system to ensure it has full rank.
Corollary 2.3.
Under Assumption 1 and with the knowledge of agent’s cost, regret can be achieved for contractual bandit learning problems.
In Appendix D, we formally show that, knowing the agent’s cost, there is an efficient learning procedure for the unknown parameters with small errors under mild assumptions. This allows us to attain for the general contractual bandit problem, and we showcase its application in designing contractual RL algorithms in the next section. Since the design and analysis of the learning procedure is highly technical, we also demonstrate the high-level idea on a simplified instance in Example 1 of Appendix D. More generally, we expect similar learning procedure exists if we alternatively assume some predictive state in such that the principal knows , since it would also eliminate one extra degree of freedom in the linear system above.
4 The Complexity of Contractual Reinforcement Learning
If we treat each stationary policy in contractual RL as an arm and its induced visitation measure (see its formal definition in Appendix E.1) as an outcome in the contractual bandit problem, the generic algorithm from Section 3.2 already provides a regret bound. However, the computational and statistical complexity of both Algorithm 2 and 3 has polynomial dependence on the size of action space, which has become exponential as . Moreover, as pointed out above, it requires a uniformly good knowledge over the transition kernel to constructing the near-optimal contract policy under the moral hazard. In this section, we provide an improved analysis for the complexity of contractual reinforcement learning, given that the agent’s cost function is known initially. This assumption allows us to leverage the learning procedure designed in the last section to efficiently learn the parameters for all .
We sketch the no-regret learning algorithm in contractual RL in Algorithm 1, which cuts the number of episodes into two phases and can be improved to be agnostic to with the doubling trick. It begins by running the -learning procedure to efficiently obtain the estimated parameter for the construction of robust contract policy. Then, the algorithm use a solver to determine the robust contract policy that induces an optimistic action policy with almost optimal payment. In Theorem 3, we state the complexity results under two different solvers that work under different technical assumption and provides different trade-offs in statistical and computational complexity. Here, in the regret bound are constants in the regularity assumptions, and omit terms from learning , though the effect of these constants can be canceled out only for sufficiently large ; we defer the details to Appendix E. Below we zoom into the construction of each component.
Theorem 3.
-Learning Procedure in Contractual RL. One challenge in the construction is the need to separate the stepwise interference among . Otherwise, the actual response space for the agent is , which is unacceptable even for doing binary search. Our solution is due to the observation that if we fix and tune only, the agent’s expected profits remain unchanged. This allows us to set without influencing the agent’s action policy for step . Another key challenge in constructing the oracle in the MDP setting is to guarantee visitation measure over each state at step . To maximize the visitation measure of a particular state at step , we let have nonzero values such that the agent has a strong incentive to maximize her visitation measure over state at step . To simplify our analysis, we assume that the maximal visitation measure at each state and at each step is bounded below, though we expect it to be relaxed via a more careful analysis since those states rarely visited contributes little to the estimation of the cumulative utility. Lastly, the task of setting is solved in the bandit learning setup under the techniques and assumptions specified in Appendix D. See Section E.2 for the formal proof and detailed construction of the learning procedure.
Solving for Optimistic and Robust Contract Policies. We show two different solvers for the optimistic contract with bounded suboptimality using the estimated parameters. Their basic idea is the same, which is to include additional bonus for optimism and margin for robustness. However, it turns out that they can either ensure statistical or computational efficiency, leaving an intriguing open question on the existence of the best-of-both-world solver. For the solver in Algorithm 6, we directly solve for the optimal contract policy according to LP (2.1) with additional bonus and margin step for the entire policy. For the solver in Algorithm 7, we employ the value iteration from the Bellman equation (2.4) with bonus and margin set at every step. Both solvers require the inducibility assumption similar to Assumption 1 in contractual bandit learning problem. However, the computationally efficient solver requires the inducibility assumption to hold at every step, whereas the statistically efficient solver only requires the inducibility assumption to hold at the trajectory level. We defer their detailed construction and their proofs to Appendix E.3 and E.4.
5 Conclusion
In this paper, we propose the study of contractual reinforcement learning problems in which the principal learns to influence the agent’s policy by adaptively designing contracts that are contingent on the state realization. The principal must not only balance the tradeoff between her payments and rewards from the agent’s policy, but also incentivize the agent’s exploration for her learning in an unknown environment. Our primary approach is to decouple this general problem into a standard online learning problem and a hyperplane search problem. This enables a clean analysis of the no-regret learning guarantee under several variants of technical assumptions. Meanwhile, several technical gaps remain for future work, including a tighter analysis under relaxed assumptions and the general setup where the agent adaptively improves his policy. We believe this model forms a natural theoretical basis for the agency problem in today’s large scale machine learning tasks where economic incentives of users, creators, service providers stand in conflict with the Internet platform’s long-term objective. More generally, it sheds light on the emergent problems of AI alignment from the perspective of steering AI behaviors through reward-sha** in its training environment. We hope this work would motivate new avenues for develo** robust, incentive-compatible frameworks that align diverse stakeholder interests in complex digital ecosystems.
References
- [1] Creator earnings report breakdown, where are we in the creator economy? https://neoreach.com/creator-earnings/. Accessed: 2024-05-18.
- [2] The creator economy. https://www.goldmansachs.com/intelligence/pages/the-creator-economy-could-approach-half-a-trillion-dollars-by-2027.html. Accessed: 2024-05-18.
- [3] Tiktok lite, a new app quietly released in france that rewards screen time. https://www.lemonde.fr/en/pixels/article/2024/04/13/tiktok-lite-a-new-app-quietly-released-in-france-that-rewards-screen-time_6668286_13.html. Accessed: 2024-05-18.
- Abbasi-Yadkori et al. [2011] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.
- Agarwal et al. [2019] Alekh Agarwal, Nan Jiang, Sham M Kakade, and Wen Sun. Reinforcement learning: Theory and algorithms. CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep, pages 10–4, 2019.
- Agrawal [1995] Rajeev Agrawal. The continuum-armed bandit problem. SIAM journal on control and optimization, 33(6):1926–1951, 1995.
- Agrawal and Devanur [2016] Shipra Agrawal and Nikhil Devanur. Linear contextual bandits with knapsacks. Advances in Neural Information Processing Systems, 29, 2016.
- Agrawal and Devanur [2014] Shipra Agrawal and Nikhil R Devanur. Bandits with concave rewards and convex knapsacks. In Proceedings of the fifteenth ACM conference on Economics and computation, pages 989–1006, 2014.
- Alon et al. [2021] Tal Alon, Paul Dütting, and Inbal Talgam-Cohen. Contracts with private cost per unit-of-effort. In Proceedings of the 22nd ACM Conference on Economics and Computation, pages 52–69, 2021.
- Badanidiyuru et al. [2018] Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knapsacks. Journal of the ACM (JACM), 65(3):1–55, 2018.
- Bahar et al. [2020] Gal Bahar, Omer Ben-Porat, Kevin Leyton-Brown, and Moshe Tennenholtz. Fiduciary bandits. In International Conference on Machine Learning, pages 518–527. PMLR, 2020.
- Balcan et al. [2015] Maria-Florina Balcan, Avrim Blum, Nika Haghtalab, and Ariel D Procaccia. Commitment without regrets: Online learning in stackelberg security games. In Proceedings of the sixteenth ACM conference on economics and computation, pages 61–78, 2015.
- Bechtel et al. [2022] Curtis Bechtel, Shaddin Dughmi, and Neel Patel. Delegated pandora’s box. arXiv preprint arXiv:2202.10382, 2022.
- Bernasconi et al. [2023] Martino Bernasconi, Matteo Castiglioni, Andrea Celli, Alberto Marchesi, Francesco Trovò, and Nicola Gatti. Optimal rates and efficient algorithms for online bayesian persuasion. In International Conference on Machine Learning, pages 2164–2183. PMLR, 2023.
- Besson and Kaufmann [2018] Lilian Besson and Emilie Kaufmann. What doubling tricks can and can’t do for multi-armed bandits. arXiv preprint arXiv:1803.06971, 2018.
- Bhargava [2022] Hemant K Bhargava. The creator economy: Managing ecosystem supply, revenue sharing, and platform design. Management Science, 68(7):5233–5251, 2022.
- Blum et al. [2004] Avrim Blum, Vijay Kumar, Atri Rudra, and Felix Wu. Online learning in online auctions. Theoretical Computer Science, 324(2-3):137–146, 2004.
- Braverman et al. [2019] Mark Braverman, Jieming Mao, Jon Schneider, and S Matthew Weinberg. Multi-armed bandit problems with strategic arms. In Conference on Learning Theory, pages 383–416. PMLR, 2019.
- Cacciamani et al. [2023] Federico Cacciamani, Matteo Castiglioni, and Nicola Gatti. Online information acquisition: Hiring multiple agents. arXiv preprint arXiv:2307.06210, 2023.
- Castiglioni et al. [2022] Matteo Castiglioni, Alberto Marchesi, and Nicola Gatti. Designing menus of contracts efficiently: The power of randomization. arXiv preprint arXiv:2202.10966, 2022.
- Dogan et al. [2023a] Ilgin Dogan, Zuo-Jun Max Shen, and Anil Aswani. Estimating and incentivizing imperfect-knowledge agents with hidden rewards. arXiv preprint arXiv:2308.06717, 2023a.
- Dogan et al. [2023b] Ilgin Dogan, Zuo-Jun Max Shen, and Anil Aswani. Repeated principal-agent games with unobserved agent rewards and perfect-knowledge agents. arXiv preprint arXiv:2304.07407, 2023b.
- Dudík et al. [2020] Miroslav Dudík, Nika Haghtalab, Haipeng Luo, Robert E Schapire, Vasilis Syrgkanis, and Jennifer Wortman Vaughan. Oracle-efficient online learning and auction design. Journal of the ACM (JACM), 67(5):1–57, 2020.
- Dütting et al. [2019] Paul Dütting, Tim Roughgarden, and Inbal Talgam-Cohen. Simple versus optimal contracts. In Proceedings of the 2019 ACM Conference on Economics and Computation, pages 369–387, 2019.
- Florida [2022] Richard Florida. The rise of the creator economy. 2022.
- Frazier et al. [2014] Peter Frazier, David Kempe, Jon Kleinberg, and Robert Kleinberg. Incentivizing exploration. In Proceedings of the fifteenth ACM conference on Economics and computation, pages 5–22, 2014.
- Grossman and Hart [1992] Sanford J Grossman and Oliver D Hart. An analysis of the principal-agent problem. In Foundations of insurance economics, pages 302–340. Springer, 1992.
- Guruganesh et al. [2021] Guru Guruganesh, Jon Schneider, and Joshua R Wang. Contracts under moral hazard and adverse selection. In Proceedings of the 22nd ACM Conference on Economics and Computation, pages 563–582, 2021.
- Ho et al. [2016] Chien-Ju Ho, Aleksandrs Slivkins, and Jennifer Wortman Vaughan. Adaptive contract design for crowdsourcing markets: Bandit algorithms for repeated principal-agent problems. Journal of Artificial Intelligence Research, 55:317–359, 2016.
- Immorlica et al. [2022] Nicole Immorlica, Karthik Sankararaman, Robert Schapire, and Aleksandrs Slivkins. Adversarial bandits with knapsacks. Journal of the ACM, 69(6):1–47, 2022.
- Immorlica et al. [2024] Nicole Immorlica, Meena Jagadeesan, and Brendan Lucier. Clickbait vs. quality: How engagement-based optimization shapes the content landscape in online platforms. arXiv preprint arXiv:2401.09804, 2024.
- Kaelbling et al. [1998] Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
- Kleinberg and Kleinberg [2018] Jon Kleinberg and Robert Kleinberg. Delegated search approximates efficient search. In Proceedings of the 2018 ACM Conference on Economics and Computation, pages 287–302, 2018.
- Kleinberg and Leighton [2003] Robert Kleinberg and Tom Leighton. The value of knowing a demand curve: Bounds on regret for online posted-price auctions. In 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings., pages 594–605. IEEE, 2003.
- Laffont and Martimort [2009] Jean-Jacques Laffont and David Martimort. The theory of incentives. In The Theory of Incentives. Princeton university press, 2009.
- Leme and Schneider [2018] Renato Paes Leme and Jon Schneider. Contextual search via intrinsic volumes. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 268–282. IEEE, 2018.
- Letchford et al. [2009] Joshua Letchford, Vincent Conitzer, and Kamesh Munagala. Learning and approximating the optimal strategy to commit to. In International symposium on algorithmic game theory, pages 250–262. Springer, 2009.
- Liu et al. [2021] Allen Liu, Renato Paes Leme, and Jon Schneider. Optimal contextual pricing and extensions. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1059–1078. SIAM, 2021.
- Lobel et al. [2018] Ilan Lobel, Renato Paes Leme, and Adrian Vladu. Multidimensional binary search for contextual decision-making. Operations Research, 66(5):1346–1361, 2018.
- Mansour et al. [2015] Yishay Mansour, Aleksandrs Slivkins, and Vasilis Syrgkanis. Bayesian incentive-compatible bandit exploration. In Proceedings of the Sixteenth ACM Conference on Economics and Computation, pages 565–582, 2015.
- Mao et al. [2018] Jieming Mao, Renato Leme, and Jon Schneider. Contextual pricing for lipschitz buyers. Advances in Neural Information Processing Systems, 31, 2018.
- McDiarmid et al. [1989] Colin McDiarmid et al. On the method of bounded differences. Surveys in combinatorics, 141(1):148–188, 1989.
- Peng et al. [2019] Binghui Peng, Weiran Shen, **zhong Tang, and Song Zuo. Learning optimal strategies to commit to. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 2149–2156, 2019.
- Ratliff et al. [2018] Lillian J Ratliff, Shreyas Sekar, Liyuan Zheng, and Tanner Fiez. Incentives in the dark: multi-armed bandits for evolving users with unknown type. arXiv preprint arXiv:1803.04008, 55, 2018.
- Saig et al. [2024] Eden Saig, Inbal Talgam-Cohen, and Nir Rosenfeld. Delegated classification. Advances in Neural Information Processing Systems, 36, 2024.
- Scheid et al. [2024] Antoine Scheid, Daniil Tiapkin, Etienne Boursier, Aymeric Capitaine, El Mahdi El Mhamdi, Éric Moulines, Michael I Jordan, and Alain Durmus. Incentivized learning in principal-agent bandit games. arXiv preprint arXiv:2403.03811, 2024.
- Shah et al. [2019] Virag Shah, Ramesh Johari, and Jose Blanchet. Semi-parametric dynamic contextual pricing. Advances in Neural Information Processing Systems, 32, 2019.
- Smith [2004] Stephen A Smith. Contract theory. OUP Oxford, 2004.
- Tran-Thanh et al. [2010] Long Tran-Thanh, Archie Chapman, Enrique Munoz De Cote, Alex Rogers, and Nicholas R Jennings. Epsilon–first policies for budget–limited multi-armed bandits. In Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010.
- Tran-Thanh et al. [2012] Long Tran-Thanh, Archie Chapman, Alex Rogers, and Nicholas Jennings. Knapsack based optimal policies for budget–limited multi–armed bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 26, pages 1134–1140, 2012.
- Wu et al. [2022] Jibang Wu, Zixuan Zhang, Zhe Feng, Zhaoran Wang, Zhuoran Yang, Michael I Jordan, and Haifeng Xu. Sequential information design: Markov persuasion process and its efficient reinforcement learning. arXiv preprint arXiv:2202.10678, 2022.
- Xia et al. [2015] Yingce Xia, Haifang Li, Tao Qin, Nenghai Yu, and Tie-Yan Liu. Thompson sampling for budgeted multi-armed bandits. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
- Xia et al. [2016] Yingce Xia, Tao Qin, Weidong Ma, Nenghai Yu, and Tie-Yan Liu. Budgeted multi-armed bandits with multiple plays. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pages 2210–2216, 2016.
- Yao et al. [2023] Fan Yao, Chuanhao Li, Denis Nekipelov, Hongning Wang, and Haifeng Xu. How bad is top- recommendation under competing content creators? In International Conference on Machine Learning, pages 39674–39701. PMLR, 2023.
- Yao et al. [2024] Fan Yao, Chuanhao Li, Karthik Abinav Sankararaman, Yiming Liao, Yan Zhu, Qifan Wang, Hongning Wang, and Haifeng Xu. Rethinking incentives in recommender systems: Are monotone rewards always beneficial? Advances in Neural Information Processing Systems, 36, 2024.
- Zhao et al. [2023] Geng Zhao, Banghua Zhu, Jiantao Jiao, and Michael Jordan. Online learning in stackelberg games with an omniscient follower. In International Conference on Machine Learning, pages 42304–42316. PMLR, 2023.
- Zhu et al. [2022] Banghua Zhu, Stephen Bates, Zhuoran Yang, Yixin Wang, Jiantao Jiao, and Michael I Jordan. The sample complexity of online contract design. arXiv preprint arXiv:2211.05732, 2022.
- Zhu et al. [2023] Banghua Zhu, Sai Praneeth Karimireddy, Jiantao Jiao, and Michael I Jordan. Online learning in a creator economy. arXiv preprint arXiv:2305.11381, 2023.
- Zuo [2024] Shiliang Zuo. New perspectives in online contract design: Heterogeneous, homogeneous, non-myopic agents and team production. arXiv preprint arXiv:2403.07143, 2024.
Table of Contents
- 1 Introduction
- 2 Problem Formulation
- 3 Warm-up: Solving the Contractual Bandit Learning Problem
- 4 The Complexity of Contractual Reinforcement Learning
- 5 Conclusion
- A Further Discussion on Related Work
- B Omitted Content in Section 2
- C Proofs in Section 3
- D Searching on Probability Simplex
- E Proofs in Section 4
Appendix A Further Discussion on Related Work
Contract Design.
The contract theory has been a crucial branch of economics [27, 48, 35]. Driven by an accelerating trend of contract-based markets deployed to Internet-based applications, the contract design problem recently started to receive a surging interest especially from the computer science community [24, 28, 9, 20]. The principal-agent model has been also applied for the delegation of online search problems [13, 33] and machine learning tasks [45]. While these works focus on the computational aspects of contract design, our work is to adaptively design the optimal contract between learners and decision makers in an initially unknown environment.
Dynamic Pricing.
Our model is related to the dynamic (contextual) pricing problems [34, 41, 47, 39, 36], where a seller learns to post a price on a single item for a sequence of buyers with a fixed cost (possibly under different context). In particular, they can be viewed as special cases of contractual reinforcement learning, where the contract is contingent on the agent’s binary action and the principal already knows her reward function. As we will see in Section 3, our algorithm is able to borrow some design insights from these pricing problems. Nonetheless, our learning algorithm deals with the more involved situations, where the agent has multiple actions (e.g., a list of items to buy) of which the principal’s rewards are unknown, and the contract is not necessarily contingent on the agent’s actions but their outcomes. As such, it is possible to achieve constant regret in these pricing problems, whereas the regret lower bound of contractual reinforcement learning is .
Online Contract Design.
The problem begins as a variant of dynamic pricing in Kleinberg and Leighton [34] where the agent’s cost is stochastic (or adversarially) chosen, and regret bound is (or in adversarial setup). Ho et al. [29], Zhu et al. [57] consider a generalized model where the agent has multiple (instead of binary) actions, both the cost and reward of his actions are determined by the agent’s Bayesian type that are unknown to the learner. These problems can be viewed as a continuum-armed bandit problem [6], except the principal’s utility is not continuous. Zhu et al. [57] shows an almost tight linear regret bound of this problem for some constant and the number of outcomes . On top of this model, Zhu et al. [58] considers the joint online optimization problem of contract and recommendation policy in the context of creator economy. Zuo [59] assumes a smoothness condition and presents a direct reduction to the standard Lipschitz bandits problem. In comparison, our learning problem is closer to the standard contract design model, in which the agent type is observable by the principal (captured by the initial state or context), as many platforms hold a good amount of data on their users and content creators. More importantly, this modeling choice allows us to focus on solving the key challenges of learning and planning the optimal contract under moral hazard, where we are able to achieve regret for a large class of problems and in general under mild assumptions. Meanwhile, several recent works [22, 21, 46] consider the simple special case of our problem, where there is no Markov state transition and principal can directly incentivize the agent to take certain action without the barrier of moral hazard.
Online Learning with Incentive Constraints.
The incentive design problems have been studied in online learning in several different ways. One line of works, known as the incentivized exploration [26, 40], consider the situations where the principal recommends the agents to pull different arms and the recommendation policy must be incentive compatible to the agents in a Bayesian sense w.r.t. each agent’s prior of arm rewards. Bahar et al. [11] consider the fiduciary bandits problem, where a slightly stronger constraint of individual rationality is introduced. Our model is different from these works in that the principal use monetary incentives (contracts) instead of information advantage to influence the agents’ decisions. Another line of work, known as the budgeted bandits [49, 50, 52, 53], and more generally, bandits with knapsacks [10, 8, 7, 30], models the intrinsic cost of arm selection. The cost only affects the learner’s choices due to the limited budget, whereas the learner (principal) in our multi-agent decision making process needs to properly reimburse the agent’s (opportunity) cost in order to influence the agent’s arm choices. Ratliff et al. [44] consider the multi-armed bandit problem where the reward distribution (impacted by user types) shifts according to the history of arm selection. Braverman et al. [18] models each bandit arm as a self-interested agent that keeps part of the reward from the principal to strategically maximizes his long-term utility. Besides the online contract design problem, there are also rich line of literature in the online learning problems under Stackelberg games, information design and auction design setups [17, 12, 23, 51, 56, 14, 19].
Appendix B Omitted Content in Section 2
B.1 Notations and Illustrations
We use the notation of for the set . We use to denote the simplex space on discrete set . For probability distribution , we will use to denote the measure of in . We use the notation of as an operator on an optimization problem that returns the optimal objective value followed by its optimal solution, e.g., .
We will interchangeably treat a function as a vector from . As such, we denote the inner product for or for . Denote their outer product as . Denote for . In addition, for function , we use . For conditional probability , we denote as the measure of given .
Symbols | Interpretations |
state, action space | |
transition kernel, initial state distribution | |
noisy reward function at -th step | |
expected reward function at -th step | |
cost function at -th step | |
contract policy | |
action policy | |
action policy space | |
contract policy space | |
principal’s state value function at -th step | |
agent’s state value function at -th step | |
state visitation measure at -th step | |
least payment function at -th step | |
principal’s state-action value function at -th step | |
agent’s state-action value function at -th step |
B.2 Discussion on the Modeling Choices.
We make a few remarks on the procedure of the PAMDP.
-
1.
It is without loss of generality for the principal to commit his contract policy at the very beginning of each episode, since this MDP setup can be viewed as an extensive-form game as long as the principal as the first mover can predict the agent’s response and plan his follow-up move accordingly. Once the principal commits its contract policy, the agent can also determine his optimal action policy in response.
-
2.
We assume the Markovian state after its realization is publicly observable by both the agent and principal, serving as the natural conditions and contingencies for the contract design. Hence, our model directly use the state transition kernel, , as the outcome distribution in contract design problems. Otherwise, if either agent or principal only partially observes the state, the planning problem is known to be intractable [32], and we leave this open question for future work. A more subtle caveat here is that, different from standard episodic MDP, the transition kernel in the last step matters, as it influences the principal’s design of contract.
-
3.
The principal’s noise reward is set to be conditionally independent of the agent’s action , given the next state . This is necessary for a subtle modeling reason: as we will see in the next section, the contract design problem would become much easier, if the principal can condition its payment directly based on the agent’ action (i.e., without the concern of moral hazard). Note that since the reward itself can be modeled as a part of the state, the existence of such is without loss of generality, and there is no need to assume additional zero-mean noise on top of .
-
4.
We assume the principal is able to observe the agent’s action once the payment is transferred. This is well-motivated in practice. For example, a content platform may ask the creators to fill a survey on the amount of time they spent to create their content; the creators have no incentive to misreport this information, as long as their payment is independent of the answers. The more general setup is that the principal is only able to observe a probabilistic signal of agent taking some action (e.g., from the realization of the next state and knowledge of the transition kernel). For the convenience of analysis, we save the additional steps for the principal to infer the agent’s decision up to a sufficient level of confidence by repeating the same contract policy, though this could introduce additional factor of into the sample complexity, depending on the mixing ratio. We leave the tight analysis to future work.
-
5.
It is without loss of generality to assume that the rational agent always has the incentive to participate in the PAMDP. This is because enforcing the additional constraint that the agent’s utility must be non-negative under the principal’s optimal contract is equivalent to adding an “idle” action to the existing action set with , which allows our analysis to ignore the agent’s non-negative utility (individual rationality) constraint.
B.3 Least-Payment Bellman Equations in PAMDP
With the correspondence between and , a natural next step is to fix and find the contract policy
with the maximal value among all policy that the agent would optimally respond with action policy . Since the total expected reward of the principal is fixed under , the objective of the optimization problem can be equivalently rewritten as minimizing the principal’s total payment,
Recall that denotes the least amount of expected payment to induce an action policy at the state from the -th step. Meanwhile, since , the above constraint can be equivalently rewritten as a set of constraints in an iterative form,
Therefore, such a contract policy can be computed iteratively with backward induction from to with , ,
(B.1) | ||||
where the function are computed as by-products of the value-iteration. We refer to Equation (B.1) as the least-payment Bellman equation.
B.4 Bellman Optimality Equations in PAMDP
Proofs of Theorem 1.
We begin by giving an interpretation for each variable in the Bellman equation. With slight abuse of notation, denotes the contract with the least payment to induce the agent to take action in each step . Given that is the best agent action for the principal to induce, the optimal contract at state in step can be determined as . are respectively the principal’s and agent’s total expected utility from -th step under policy and , which can be interpreted as their optimal state-action value function at -th step, serving as the intermediate variable for the computation. We now prove the optimality of its solution via induction:
For the base case, observe that planning for any state at the last step is reduced to a standard contract design problem and the optimal contract can be determined by solving the following linear program, ,
s.t. |
where is the least payment contract to induce the agent to take action in the last step and is the principal’s expected utility under . In Equation 2.4, we save the term , since it is a constant once the action is fixed. Hence, determines the best action for the principal to induce and is the optimal state value function at -th step. The principal’s optimal contract can be determined as . The agent’s value is based on his best response under .
For the inductive case, given that is optimal, with the agent’s best responding action policy , we show that solved from Equation 2.4 is optimal. Let us observe that captures the agent’s total utility of taking action under the contract at step state and then optimally following the action policy under from step . Here, we can use computed from previous iteration, because the agent’s value is conditionally independent to the action in the current step given the realization of next state — this enables efficient computation through dynamic programming. Similar to the base case, for every action , the principal is to compute the least payment contract for the agent to take action ,
s.t. |
where the objective is set as the principal’s total utility if the agent takes action at current step and follows the policy onward; the constraint is to reflect that it is (weakly) optimal for the agent to take action at current step. In Equation 2.4, we save the term , since they are constant once the action is fixed. With the least payment contract and state value for each action, determines the best action for the principal to induce and is the optimal state value function at -th step. The principal’s optimal contract can be determined as . The agent’s value is based on his best response under . Therefore, is the optimal contract following from the optimal contract policy in previous steps , which concludes the induction.
Lastly, we note that this Bellman equation can solved efficiently using backward induction from the state-function of the -step, . In each step, it solves many linear programs for the optimal contract , while each linear programs have many constraints. Hence, the total time complexity to solve for the optimal policy is polynomial w.r.t. .
∎
Appendix C Proofs in Section 3
C.1 The Regret Analysis of the Generic Algorithm
We first describe the design of the generic algorithm in Algorithm 2 and the technical lemmas.
(C.1) |
Lemma 1.
Under Assumption 1, for each action , given a robust contract set with margin , and an empirical estimation of with , let be the minimizer and minimum objective value of LP (C.1). The following conditions are satisfied,
-
1.
The expected payment of is bounded as,
-
2.
The estimated payment of is bounded as,
Lemma 2 (McDiarmid et al. [42]).
With i.i.d. samples of an -dimensional distribution , we can construct a confidence ball such that with prob. at least .
Proof of Theorem 2.
At a high level, Algorithm 2 proceeds by following the upper confidence bound of the expected “profit” of each action , which shrinks at the rate of , based on Lemma 1. This enables us to apply the upper confidence bound analysis from online learning problem to the contractual bandit learning problem.
That is, we construct a variable as an optimistic estimation of . First, notice that Algorithm 2 is equivalently to follow the action at each round , as is constant and does not affect the optimization. Second, we show that under the difference between and satisfies the following inequality with probability at least ,
(C.2) |
On the event that , we can derive that and by Lemma 1. This implies that , which leads to the Equation (C.2).
Under the event that , we have and the expected regret of Algorithm 2 in the rounds is as follows,
We decompose the regret into two cases on whether the optimal arm is played at round :
When , we have by Lemma 1.
When , we have
where the first inequality follows from Equation (C.2) that ; the second inequality uses the fact that and from Lemma 1; the third inequality follows Equation (C.2) that .
It remains to bound the total regret based on the exact choice of in different setup.
The Case of Finite Action Space.
In the case where the action space is finite. For any action , we denote as the number of times action has been taken. By Lemma 2, we can set such that the empirical estimation of the outcome distribution satisfies with probability at least . Thus, by union bound, with probability , the expected regret can be bounded as follows,
where the first inequality uses the fact that the loss incur when is at least as much as the loss when ; the second inequality follows from the Cauchy-Schwarz inequality; the third inequality again applies Cauchy-Schwarz inequality and use the fact that .
The Case of Infinite Action Space with Linear Context.
In the case when the action space is infinite, the outcome distribution for some unknown parameter . Let with . By Lemma 11 of Abbasi-Yadkori et al. [4], with probability at least , we have , where . We can set such that empirical estimation of the outcome distribution satisfies
Thus, by union bound, with probability , the expected regret can be bounded as follows,
where the first inequality uses the fact that the loss incur when is at least as much as the loss when ; the second inequality follows from the Cauchy-Schwarz inequality; the third inequality again applies Cauchy-Schwarz inequality and use the fact that .
∎
Proof of Lemma 1.
Pick an arbitrary . We have and an empirical estimation of with , LP (C.1). With and , LP (C.1) solves for a robust contract . Since , the agent’s best response to is to take action .
First, we derive a bound for an intermediate value . Recall that Assumption 1 guarantees that, for any , there exists such that . Let and . We have
Since , we have . In addition, as , . Hence, we have
(C.3) |
Notice that and is not necessarily the minimizer of over . We get the first condition of this lemma, by Equation (C.3)
We now bound the estimate payment . Since by the bounded contract space assumption, we have
(C.4) |
Therefore, combining Equation (C.4) and the first condition, we get the second condition of this lemma,
∎
C.2 Solving Multi-armed Bandits under Direct Incentives
Multi-Armed Bandits under Direct Incentives
This is perhaps the most simple yet natural class of contractual online learning problems. The principal is unable to directly pull arms but is able to receive the reward from arm pulled by the agent. In this problem, we have the action space and specifying the principal’s reward and agent’s cost of pulling each arm . At the beginning of each round , the principal sets a contract and the agent accordingly decides its best response . At the end of each round , the principal is able to observe the exact arm taken by the agent as well as the noisy bandit feedback on its corresponding reward , where is zero-mean, i.i.d. -subGuassian noise. Finally, the learning goal of the principal is to minimize the regret, .
Lemma 3 (Binary Search for Finite Arms).
There exists an -learning procedure for multi-armed bandits under direct incentives.
Proof of Lemma 3.
We show an explicit construction of -learning procedure in the problem. Observe that, if we can learn an estimation of , we can set the least payment contract as follows, . , it is optimal for the agent to respond with action . Moreover, the payment is minimized as .
So it only remains to learn the estimation of . This can be achieved through binary search. For any action , we set a cost lower bound and upper bound . At each round, the algorithm sets the contract with . If the agent takes the action , then the algorithm updates . Otherwise, it updates . In rounds, the algorithm is guaranteed to have and thus an estimation . To conduct the binary search for every action, the total sample complexity is .
∎
C.3 Solving Linear Bandits under Direct Incentives
Linear Bandits under Direct Incentives
In this problem, we have the action space (composed of the context vectors) and specifying the principal’s reward and agent’s cost of choosing each context . At the beginning of each round , the principal observes a set of contexts and sets a contract . The agent accordingly decides its best response , where . At the end of each round , the principal is able to observe the exact arm taken by the agent as well as the noisy bandit feedback on its corresponding reward , where is zero-mean, i.i.d. -subGuassian noise. are fixed, unknown parameters to be learnt. Without loss of generality, we assume by coordinate transformation. Finally, the learning goal of the principal is to minimize the regret, where is the optimal arm at round .
Lemma 4 (Contextual Search for Infinite Arms).
There exists an -learning procedure for linear bandits under direct incentives.
Proof of Lemma 4.
We show an explicit construction of -learning procedure in the problem with agent’s best response function for some parameter and action set . Observe that, if we can learn an estimation of such that , we can set the least payment rule such that . Since , we have . Moreover, .
To learn an estimation of such that , we adopt the contextual search algorithm under symmetric loss [38]. At a high level, we use the constant regret guarantee of contextual search algorithm again adversarially chosen context at every round, and we present a simple argument assuming that allows us to pick arbitrary context for the contextual search algorithm. Specifically, consider a contextual search problem with the unknown vector and . Fix any unit vector , in rounds, the contextual search algorithm can determine a knowledge set of all feasible such that . Repeating this search procedure for all linearly independent direction in , we obtain a knowledge set of all feasible such that , since any action can be decomposed as a convex combination of the linearly independent unit vectors. The total sample complexity is . ∎
C.4 Solving General Contractual Bandit Problems
Lemma 5.
Under Assumption 2 and given that satisfies , we can construct an -learning procedure for general contractual bandit learning problems.
Proof of Lemma 5.
We denote and the learning procedure is to query certain contract in a binary search fashion in order to obtain estimation with bounded error , from which we can construct -margin contract set for any action and thereby compute almost least payment contract according to Lemma 1. For precise analysis, let . We describe the full procedure in Algorithm 3.
To prove its correctness, we start from the observation that for any two action with sufficiently small , given two contracts that respectively induces action and , we can obtain the estimation such that . To see this, we introduce a contract for some such that . Such must exist, since and . Now let , we have
To obtain that the contracts , it only requires to do a binary search based on two initial contracts that induces action . Then, if the contract induces the action , then we update . Otherwise, . In rounds, the distance of and is bounded by . As is described in Algorithm 3, we can do such binary search for every pair of actions . While two actions may not share a decision boundary, we identify all action pairs that do share a decision boundary with each other. This means for pairs that do not share a decision boundary, we can find a path through their neighbours to determine their cost difference given by and the shortest path can find by the Dijkstra’s algorithm in . In the worst case, such path can be as long as , this means we need to conduct binary search to the precision level of for rounds.
Finally, with the estimated parameters and , the algorithm construct the robust contract set , and we claim that . To verify that , we can check that the following inequality must hold, ,
Similarly, to verify , we can check that the following inequality must hold, ,
∎
Appendix D Searching on Probability Simplex
In this section, we discuss the details related to specifying the information structure through hyperplane searching. A motivation for doing hyperplane searching is given in Example 1, where learning the outcome distribution difference with rounds potentially avoid pulling the non-optimal arm too many times and paves the way for constructing -optimal contract. In addition, the need to plan with the transition kernel in the MDP environment with far-sighted agent prompts us to learn the difference in in order to fully exploit the information structure and as well reduce the cost of redundant explorations.
Example 1 ( regret with known cost).
Consider a class of contractual bandit problem instances parameterized on . For each instance, there are two outcomes with mean reward , and two agent actions with cost and outcome distribution . One can verify that the optimal contract here is to set and the principal gets the expected utility . The naive learning method is to play for rounds and learn its outcome distribution parameterized by up to the bounded error . This is costly as is the sub-optimal arm, resulting in regret in Theorem 2. However, an alternative method is to conduct a binary search for . This would achieve regret, since the algorithm can get estimation error of bounded by in rounds, and construct an -optimal contract.
Here, we consider searching for the agent’s best response section in a -dimensional probability simplex. Let with be the outcome space and denote the contract the principal announces to the agent. Here, we restrict the contract to a subspace where is the -dimension probability simplex. We remark that searching over a low dimensional simplex is without loss of generality, and we just consider the simplex for simplicity, where bounds the infinity norm of any contract we use. Let be the agent’s action set with . This is because we have the following proposition.
Proposition 1 (Action inducibility).
If an action can be induced by contract with , then can also be induced by contract .
Proof.
The inducibility condition implies
Obviously, adding to does not change the inequality. Moreover,
which implies that is a valid contract with . ∎
For simplicity, we ignore the scale and just conduct our search on the probability simplex. Under this setting, the best response region for is
where is the outcome distribution under action and is the action cost the agent has to pay for any . Our target is to identify each by searching for the hyperplanes that separate these under weak assumptions. Specifically, we assume that the cost of each action is known. The algorithm is summarized in Algorithm 4.
Here, we show how to recover from the memory . Suppose that when the algorithm terminates, we have . We just solve for that satisfies the following constraints,
(D.1) | ||||
(D.2) |
In the sequel, we write as the set of that satisfy Conditions (D.1) and (D.2). For Algorithm 4 to work, we introduce the following assumption on the volume of .
Assumption 3 (Minimal Volume Ratio).
Let denote the -dimensional volume of set . We assume that there exists such that for any .
The minimal volume ratio assumption guarantees that all the sections are detectable via random sampling with high probability. We also make the following assumption on the cost difference.
Assumption 4 (Minimal Cost difference).
We assume that .
Specifically, we use the following definition of surface detection probability function.
Definition 3 (Surface Detection Probability Function).
Let be the set of convex regions on some -dimensional hyperplane such that for any . Define function as the pointwise maximum such that,
Note that is a property inherent to the -dimensional probability simplex. We argue that can be roughly viewed as a linear function for small . To characterize the searching result of Algorithm 4, we present the following Lemma.
Lemma 6.
To construct an efficient learning procedure, we need to determine the optimal value for such that the total round number is minimized while the learning error is controlled by .
Corollary 3.1.
By properly setting and and running the simplex searching algorithm for rounds, we guarantee the learning error less than with probability at least
where is a constant.
Proof.
Define constant
We let . Then it suffices for the first condition to hold if . Moreover, we have and the second condition holds automatically with . Therefore, the constraint for becomes,
We can take equality for the optimal . Obviously, the second term dominates, and we thus have the total rounds bounded by
total round | |||
where the failure probability is bounded by with being a constant. ∎
Proof of Lemma 6.
We consider an undirected graph with the node set and the edge set . Define event as follows.
Definition 4 (Surface Detection Event).
We say that event happens if there exists and we have successfully searched for a such that for the -dimensional simplex placed around in Algorithm 4 and any , the best response at satisfies .
Simply put, the event guarantees that only contains two possible actions and can therefore be successfully learned via binary searching for the intersects of the edges of the simplex with . The following proposition backs up our statement.
Proposition 2 (Intersection Geometry).
Under event and condition , let denote the simplex corresponding to . Then contains a -dimensional ball with radius at least
Proof.
Under event , we claim that the hyperplane must intersect with the -ball centered at , since passes through , which lies inside . Moreover, we consider a ball also centered at . Since the smallest distance from the center of a -dimensional simplex with length to any of its surface is , we have under the condition . In addition, is also the largest ball contained in . We therefore conclude that the intersection area satisfies,
Since only intersects with hyperplane , we have . Thus, the left-hand side corresponds to the intersection area of a -dimensional ball and a -dimensional hyperplane. Since intersects with , thus a -dimensional ball with radius at least
should be contained in . ∎
Proposition 2 characterizes the geometry of the intersection area under event . Specifically, the intersection area should contain a -dimensional ball with radius lower bounded by . Such a geometry is critical for solving the hyperplanes to small errors. We consider the following constraints,
(D.3) |
where is the binary searching result on the edges of simplex . Under event , let denote the set of satisfying these constraints. Apparently, is a relaxation of . The following proposition characterizes the searching errors in under the event in terms of .
Proposition 3 (Local Hyperplane Learnability).
Proof.
We first relax to
Note that corresponds to the vertices of . Using Proposition 2, we pick on the ball that form a -dimensional simplex. Note that must be convex. Thus, we can express as convex combinations of . In the matrix form, we suppose
where has row sums equal to . Using , we can further relax by multiplying to the constraints,
Note that the first rows of form a -dimensional simplex on . Moreover, since , we note that does not lie on . To guarantee linear independence of , we aim to lower bound the distance from to . For any , we have , and for we have . Thus, we use the Cauchy-Schwartz inequality and obtain,
Since , we conclude that for any ,
Thus, the distance from to is at least , which indicates that the volume of the cube spammed by rows of is at least
Therefore, the determinant of satisfies
Note that has row sums equal to , which implies that all the eigenvalues of should be no larger than . Hence, the smallest eigenvalue of should be no less than and consequently, the largest eigenvalue of should be no more than . Following the definition of , we have for any that
∎
Proposition 3 bridges the geometrical argument to the learning errors in terms of . To obtain such benefits, we need to characterize under what conditions and with what probability the event will occur. To start with, we study the volume of a surface that separates the probability simplex into two disjoint parts.
Proposition 4 (Surface Volume).
Suppose is a compact subset such that and . Let be the surfaces that separates and and we assume that comprises hyperplanes. It then follows that
Proof.
For any , if , consider the following set
Apparently, . Hence, we have . Similarly, for , we can define as
and it follows that . Following these observations, we have
(D.4) |
Another important fact is that that connects must go through at least once since and belong to different areas. For any -dimensional surface and any point , We define
Apparently, we have and thus . Consider , and it follows from the definition that , which suggests that . Suppose that , it then holds that
Consider is small enough such that lies on a hyperplane 444Or approximating the surface with countably infinite hyperplanes.. We draw hyperplanes such that the farthest two vertices of the simplex lie on and . We denote the area between and by . We separate into and , where is defined as the points that have a distance larger than to the hyperplane containing and . Consider the following definition for ,
and for ,
Apparently, and we thus have
where the second inequality holds from the definition of and the fact that the largest hyperplane contained in is no more than . Here, since is adjustable, we plug in
and obtain
By summing up and using the Jensen’s inequality, we obtain
(D.5) |
Combining (D.5) with (D.4) and using the inequality , we obtain
which further implies that
∎
Following Proposition 4, a direct conclusion is that if contains action sections and contains action sections (), there exists a surface such that
where the last inequality holds from Assumption 3 and the fact that . However, a sampled line passing through this does not guarantee event even though is bounded below. The following proposition states a sufficient condition for to hold.
Proposition 5 (Effective Surface).
Define the effective surface as
where , . Event holds for any searching line crossing under Assumption 3.
Proof.
Under condition , we guarantee that the simplex does not intersect with . However, such a condition is not sufficient if we want to guarantee that the simplex only contains actions since a third action may occur outside . In the sequel, we will show that it is impossible for a third action to appear if goes through . Define as,
Apparently, for any and , we have . Now, we aim to prove that the prism with base and height belongs to either or . Suppose that lies on action ’s side and there exists a point in whose best response is not . Then, there must be a hyperplane passing through for some . Suppose . Consider the set , where is the whole hyperplane containing . Obviously, for any and , since should lie outside . In addition, since is a convex set, we note that must lie between and . Therefore, the volume of should be no more than the volume of the area that lies between and ,
By Assumption 3, we conclude that
which conflicts with the condition . Thus, we conclude that . The same argument also applies to the conclusion if lies on action ’s side. For any line passing through , let . Then we have and the minimal distance from to is at least . Thus, is guaranteed to be detected when doing the binary search on . On the other hand, , meaning that the simplex placed at also lies within . Thus, the simplex only contains two actions and event follows. ∎
Proposition 5 characterizes a sufficient condition for to hold, i.e., the searching line passing through the effective surface . The next question is whether we can find enough effective surfaces via random sampling. To answer this question, we need to argue that there are sufficiently many effective surfaces with surface volume bounded below. It is also of the same importance to bound below the probability of passing through an effective surface with positive surface volume. Recall the definition of .
Now, we study the problem that how many effective surfaces needs to be detected. From Proposition 3, we see that if event holds, we can estimate up to small errors. To learn all the pairwise outcome distribution difference, we just need to construct a tree in graph where an edge is selected if happens. To see this point, we invoke the following definition.
Definition 5 (Connected Components).
We say that action and belongs to the same connected component if there exists a path such that events happen.
In the sequel, we use to denote a connected component. With a little abuse of notation, we also denote by the union of sections such that . Using Proposition 4 and the following discussion, we take as and as , and it follows that there exists a surface on the boundary of and such that
Here, we assume that and . Note that we have for the sake that is convex. Recall the definition of the effective surface , which corresponds to shrinking up to distance . Therefore, the volume of is at least
Therefore, with probability event will happen in the next sample following Proposition 5, which also means that will expand. Since will expand for at most times, we have after samples that with probability at least . Moreover, when , the error for estimating is bounded by
∎
Appendix E Proofs in Section 4
E.1 Preliminaries for Regret Analysis
Visitation Measure.
We define the state visitation measure at step induced by action policy as
Given that , the visitation measure can be computed iteratively from to as,
With this definition, we have for any ,
Parameter Estimation.
Algorithm 1 omits the details of how it estimates the parameters based on the observed trajectory in each episode. Specifically, at the end of -th episode, given the trajectory , it updates the counting variables for all ,
and the empirical mean reward and bonus for all ,
For notational convenience, we will refer to the -th episode as the -th episode after running the -learning procedure for rounds in the sequel.
Lemma 7 (Agarwal et al. [5]).
Fix . In any episode , , with probability at least , we have
We denote as the event the inequalities in Lemma 7 and 10 holds. To make our notation consistent, we let and such that .
Following the optimism principle, we determine the optimal action policy to explore from — the optimism is preserved as long as the difference of is relatively small. Here, with some carefully chosen amount of reward bonus , is the principal’s expected reward under optimism for a given action policy at -th step with state , i.e.,
Lemma 8 (Optimism).
If is true, for any , in any episode , .
Proof.
We prove by induction that . Start from the -step, since , the base case holds. For the inductive case, given that , we can derive the following inequality for any , let
where the first inequality is due to the given induction condition and the last inequality is due to Lemma 7. Therefore, the induction holds which concludes the proof. ∎
Lemma 9.
If is true, .
Regret Decomposition.
For the ease of analysis, we assume a common regularity condition in MDP, which ensures that the trajectory induced under any policy has enough randomness. This lower bound constant is no more than , and we expect this assumption to be relaxed in future work.
Assumption 5 (MDP Mixing Condition).
There exists such that
We now present the proof of Theorem 3 via a decomposition of its regret into different components.
Theorem (Full Statement of Theorem 3).
Proof of Theorem 3.
Consider the following decomposition of regret,
where , are the principal’s estimated value under the least payment contract policy determined from Equation (B.1) with and is the principal’s exact value under the least payment contract policy determined from Equation (B.1). We now consider each of the three difference terms in the last inequality:
-
•
The bound of first term is derived from the value decomposition, , since the difference , by Lemma 8.
-
•
The second term is non-positive, , since is the optimal action policy to induce under the optimistic planning.
- •
Finally, if we adopt the solver in Algorithm 6, by Lemma 9 and 14, the total regret is,
where the first term can be dropped as being dominated by the second term on the order of .
Finally, if we adopt the solver in Algorithm 7, by Lemma 9 and 16, the total regret is,
where the first term can be dropped as being dominated by the second term on the order of .
∎
E.2 -Learning Procedure in Contractual Reinforcement Learning
The algorithm for constructing the -learning procedure is summarized in Algorithm 5.
Assumption 6 (Weakly ergodic MDP).
There exists such that
Note that this assumption is weaker than the mixing MDP assumption where each state’s visitation measure at each step is lower bounded under all the agent’s policy, since the agent’s policy chosen here is in favor of visiting state at step .
Also, recall that throughout this section, we assume that principal already knows the agent’s cost function and the contract design space is restricted to . Moreover, we inherit all the technical assumptions from the analysis in Appendix D. With these assumptions, we show the following Lemma 10 holds.
Lemma 10.
For any , Algorithm 5 guarantees an estimation of with probability at least in episodes.
Proof for Lemma 10.
Since each time, only gives nonzero payment, and we add a constant that dominates the potential costs, the agent’s optimal policy generates a visitation measure of at step larger than . To illustrate this point, we take that maximizes while at step (the choice of does not influence ). For any such that , we have for the agent’s expected profit difference expressed as
where is constructed by substituting with in . Here, the first inequality holds since is the optimal one-step policy concerning the payment. The last inequality holds by noting that still maximizes which gives by Assumption 6, and that has a constant shift . Note that the best response can only have higher profit for the agent. Thus, we conclude that any such that cannot be the agent’s best response. In other words, for the best response , we have .
Let be the random variable representing the total steps that successfully takes and be a random variable such that , where . It follows from our previous discussion that for any ,
where . By the algorithm procedure, we conclude that
Combined with Corollary 3.1, we conclude that Algorithm 5 has error
with probability no more than
for any . The first term dominates and we just plug in and conclude that the error is no more than with probability at least
where is a constant and .
Fix , and convert the error bound to norm, this algorithm guarantees that, with probability at least ,
in rounds, where we ignore the constant from Assumption 3.
∎
Lemma 11.
If is true, with , at any episode , we have
(E.1) |
Proof.
To improve the estimate of with , we can construct the estimator of as,
The first part is an unbiased estimator for and thus by Hoeffding’s inequality, with probability at least , for all , with samples,
Using the triangle inequality and the fact that , we get the inequality in Eq (E.1). ∎
E.3 Proofs for the Statistically Efficient Solver
(E.2) |
Assumption 7 (Weak -Inducibility).
For any stationary action policy , there exists an event for each such that
This assumption ensures that for any stationary action policy , there exists a set of bounded reward functions with such that dominates any other stationary action policy with additional expected utility at least in the MDP environment . For example, one can set . For similar reason, under this assumption, for any stationary action policy , there exists a contract policy with under which , where .
We describe the statistically efficient solver in Algorithm 6. Below we show that the contract policy it solves is robust and optimistic in a formal sense.
Lemma 12.
Proof.
Pick any action policy , we first show that is the agent’s best response under solved from Equation E.2. That is, due to Lemma 13, the following inequality holds, ,
Let . We show that the payment under has bounded suboptimality,
(E.3) |
where and thus by definition.
We first prove the first inequality in Equation (E.3). Let be the contract policy such that with and . Such exists by Assumption 1. It is easy to verify the linearity here that for any , let such that , then . We claim that satisfies the following inequalities:
(E.4) |
where the first inequality is due to Lemma 13 and the second inequality is by the construction of .
For the first inequality, notice that is feasible in the LP (E.2) according to Equation (E.4) and its value should be no larger than the optimal value of LP (E.2). The equation is due to the linearity of w.r.t. . The last inequality uses the fact that by construction.
∎
Lemma 13.
Given that , we have
Proof.
Notice that we can apply the performance difference lemma [5] on any as they share the same and thus the same environment except for different policy,
Hence, we have
∎
Proof.
This proof is to find a good trade-off between the error bound of and according to the ratio given in Lemma 12. For this analysis, let such that we can apply Lemma 11 and get
where we use the fact under Assumption 5. By Lemma 12, we can construct contracts to induce any action policy such that
Then, we can set such that the second term dominates the regret bound, and we have
It remains to ensure that in rounds, we can ensure . By Lemma 10, we know the sample complexity of is on the order of . Hence, ∎
E.4 Proofs for the Computationally Efficient Solver
(E.5) | ||||
Assumption 8 (Strong -Inducibility).
For any , there exists an event such that
This assumption ensures that at any state of any step , for any cost function and any action , there exists a contract to induce . Intuitively, this assumption asks that each action is dominantly capable of inducing a set of outcomes over others. It is different from assuming that the outcome distribution of each action is distinguishable from other. In particular, notice that even if we have , there is no guarantee that , i.e., there exists contract under which is the optimal action for the agent at this step regardless of the cost. Notice that when , Assumption 7 and 8 are equivalent. In general, Assumption 8 is stronger than Assumption 7.
We describe the computationally efficient solver in Algorithm 7. Below we show that the contract policy it solves is robust and optimistic in a formal sense.
Lemma 15.
Proof.
Fix any action policy , we compare the contract policy and solved respectively from Equation (B.1) and (E.5), i.e., using the ground-truth and estimated parameters. In the remainder of the proof, we will save the superscript in both and for simplicity.
For the simplicity of notations, we prove the following claim. For any , suppose we have , , . Then, with the choice of , the solution of Equation (E.5) in the -th step satisfies the following conditions: for every ,
-
1.
induces agent’s action .
-
2.
The expected payment of is bounded as
-
3.
The estimated payment of (as well as the estimated agent value) is bounded as
We pick any state and let . Equation (E.5) solves for as the solution to the following optimization program,
(E.6) |
We first argue that with , induces the agent to play action at state . That is because the following inequality must hold, ,
To see that LP (E.6) must have a feasible solution, let us consider the contract . We choose such that . The existence of is implied by Assumption 8. satisfies the constraints of LP (E.6), i.e.,
Moreover, since minimizes LP (E.6), we have Given that and , we have . Using , the principal’s expected payment is bounded from the least payment as
In addition, , this gives the bound
Since , we have
We now plug in the value of . For the base case in the -th step, we have . Then,
For the inductive case in the -th step, given that , we have and . Then, we can plug in and bound the estimation error as, This gives the bound as claimed in the Lemma. ∎
Proof.
This proof is to find a good trade-off between the error bound of and according to the ratio given in Lemma 12. For this analysis, let such that we can apply Lemma 11 and get
where we use the fact under Assumption 5, similar to the analysis in Lemma 10. By Lemma 15, the Equation (E.5) can construct contracts to induce any action policy such that
Then, we can set such that the second term dominates the regret bound, and we have
It remains to ensure that in rounds, we can ensure . By Lemma 10, we know the sample complexity of is on the order of . Hence, we can set . ∎