License: arXiv.org perpetual non-exclusive license
arXiv:2310.00817v3 [stat.ML] 21 Mar 2024

Learning to Make Adherence-Aware Advice

Guanting Chen11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT,    Xiaocheng Li22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT,    Chunlin Sun33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT,    Hanzhao Wang22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT  Department of Statistics and Operations Research, UNC-Chapel Hill
22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT  Imperial College Business School, Imperial College London
33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT  Institute for Computational and Mathematical Engineering, Stanford University
[email protected]
{xiaocheng.li, h.wang19}@imperial.ac.uk
[email protected]
Abstract

As artificial intelligence (AI) systems play an increasingly prominent role in human decision-making, challenges surface in the realm of human-AI interactions. One challenge arises from the suboptimal AI policies due to the inadequate consideration of humans disregarding AI recommendations, as well as the need for AI to provide advice selectively when it is most pertinent. This paper presents a sequential decision-making model that (i) takes into account the human’s adherence level (the probability that the human follows/rejects machine advice) and (ii) incorporates a defer option so that the machine can temporarily refrain from making advice. We provide learning algorithms that learn the optimal advice policy and make advice only at critical time stamps. Compared to problem-agnostic reinforcement learning algorithms, our specialized learning algorithms not only enjoy better theoretical convergence properties but also show strong empirical performance.

1 Introduction

Artificial intelligence (AI) has achieved remarkable success across various aspects of everyday life. However, it is crucial to acknowledge that many of AI’s accomplishments have been developed as fully automatic systems (Mnih et al., 2015; Silver et al., 2017). In several important domains like AI-assisted driving (Balachandran et al., 2021) and AI-assisted healthcare (Shaheen, 2021), AI is faced with the challenge of interacting with humans (Mozannar and Sontag, 2020; De et al., 2021), introducing a more intricate and demanding dynamic. This interaction between AI and humans gives rise to two significant issues. Firstly, it is common for humans to reject following AI’s advice, and if AI assumes humans’ perfect adherence to its advice, the advice generated under this assumption may not be optimal. Secondly, humans may prefer AI to refrain from constant advice-giving, opting for AI intervention only when necessary. They may value their autonomy when performing well but expect AI guidance during critical moments or when they encounter situations in which they are typically less proficient. These considerations underscore the importance of comprehending human behavior and preferences to develop effective and adaptable AI systems for human-AI interactions.

To address the mentioned challenges, in this paper, we provide a decision-making model for human-AI interactions. For the first challenge, the model takes into account the human’s adherence level, defined as the probability that the human takes the AI’s advice. This allows the machine to account for variations in human adherence level when making advice. For the second challenge, the AI model features an action named defer, which refrains from giving advice to humans. This feature recognizes that there are instances when humans prefer autonomy and only seek AI guidance during critical moments or situations where they typically struggle. By integrating the adherence level and action deferral into our model, we formulate these challenges as a decision-making problem.

To cater to this specialized decision-making model, we have developed tailored learning algorithms that are both provably convergent and empirically efficient. These algorithms are specifically designed to effectively handle the unique characteristics and challenges of the human-AI interaction setting.

1.1 Related Work

Human-AI interactions.     Human-AI interactions have long been studied in fields such as robotics. Methods for modeling human behaviors and collaborating with robots (Bobu et al., 2020; Laidlaw and Dragan, 2022; Carroll et al., 2019) have achieved strong empirical performance. Similar to our definition of adherence level, a stream of literature (Chen et al., 2018; Williams et al., 2023) integrates trust (Khavas et al., 2020) as latent factors into the human-AI model and solves Partially Observable Markov Decision Process (POMDP) to get policies with strong empirical outcomes. Our work primarily centers on modeling and establishing theoretical foundations for the human-AI interaction model and the associated learning problems, thereby complementing the existing body of human-AI interaction literature.

Modeling human-AI interactions.     On the modeling side, Grand-Clément and Pauphilet (2022) propose the decision-making model that incorporates the adherence level and illustrates that when the adherence level is low, the optimal advice can be different from the optimal decision. Also, see Sun et al. (2022) for an applied setting of interacting with different adherence levels, Shani et al. (2019) for the relationship between the model and the exploration-conscious RL setting, and Jacq et al. (2022) for the so-called lazy-MDP that features an action similar to defer in our setting.

Machine learning in human-AI interactions.     Although there has been no literature associated with learning the decision-making model similar to Grand-Clément and Pauphilet (2022) and Jacq et al. (2022), other machine learning approaches have been put forward (Bastani et al., 2021; Meresht et al., 2020; Straitouri et al., 2021; Okati et al., 2021; Chen et al., 2022; Hong et al., 2023; Mao et al., 2023; Mohri et al., 2023) with different human-AI interaction settings.

Theoretical reinforcement learning.     Our first proposed algorithm is an optimism-based reinforcement learning method that learns the optimal advice policy. This approach is inspired by the theoretical online reinforcement learning literature (Jaksch et al., 2010; Lattimore and Hutter, 2014; Dann and Brunskill, 2015; Azar et al., 2017; Dann et al., 2017; Zanette and Brunskill, 2019; Domingues et al., 2021). Instead of directly applying the upper confidence bound in the literature, we customize the learning algorithm so that it leverages special properties in our decision-making model, resulting in advantages in theoretical properties and empirical performance. Our second algorithm adopts a reward-free exploration (RFE) approach (** et al., 2020), which first explores the environment for a given number of episodes, and then becomes capable of outputting near-optimal policy for any bounded reward functions. We find this approach works well for learning algorithms that make pertinent advice. See Zhang et al. (2020); Kaufmann et al. (2021); Ménard et al. (2021); Miryoosefi and ** (2022) for the follow-up works in RFE.

Our contribution is twofold:

First, we propose a decision-making model for advice-giving that incorporates human’s adherence level and an option for the AI to defer the advice and trust the human. This is a comprehensive modeling framework for effective human-AI interactions, where the optimal decision-making not only considers human adherence level but also makes advice/recommendations only at critical states.

Second, based on this decision-making model, we develop tailored learning algorithms that output near-optimal advice policies and know when to make pertinent advice. Compared to the state-of-the-art problem-agnostic RL algorithms, our algorithm features tighter sample complexity bound and stronger empirical performance.

2 Model Setup

Consider a human decision-maker that takes sequential actions under an episodic Markov decision process (MDP) described by the tuple 𝙷=(𝒮,𝒜,H,p,r)superscript𝙷𝒮𝒜𝐻𝑝𝑟\mathcal{M}^{\mathtt{H}}=(\mathcal{S},\mathcal{A},H,p,r)caligraphic_M start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT = ( caligraphic_S , caligraphic_A , italic_H , italic_p , italic_r ). The superscript 𝙷𝙷{\mathtt{H}}typewriter_H emphasizes the human’s involvement in this MDP, 𝒮𝒮\mathcal{S}caligraphic_S denotes the set of states, 𝒜𝒜\mathcal{A}caligraphic_A denotes the set of actions, H𝐻Hitalic_H is the horizon of each episode (different from the superscript 𝙷𝙷{\mathtt{H}}typewriter_H), p𝑝pitalic_p denotes a deterministic time-dependent transition kernel so that ph(s|s,a)subscript𝑝conditionalsuperscript𝑠𝑠𝑎p_{h}(s^{\prime}|s,a)italic_p start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) is the transition probability from state s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S to state s𝒮superscript𝑠𝒮s^{\prime}\in\mathcal{S}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S under the action a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A at time hhitalic_h, and r𝑟ritalic_r denotes a time-dependent reward function where rh(s,a)[0,1]subscript𝑟𝑠𝑎01r_{h}(s,a)\in[0,1]italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ [ 0 , 1 ]. Let S=|𝒮|𝑆𝒮S=|\mathcal{S}|italic_S = | caligraphic_S | and A=|𝒜|𝐴𝒜A=|\mathcal{A}|italic_A = | caligraphic_A | denote the cardinality of 𝒮𝒮\mathcal{S}caligraphic_S and 𝒜𝒜\mathcal{A}caligraphic_A, respectively.

Suppose the human follows a fixed (suboptimal) policy π𝙷superscript𝜋𝙷\pi^{\mathtt{H}}italic_π start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT such that the probability of taking action a𝑎aitalic_a at state s𝑠sitalic_s and time hhitalic_h is πh𝙷(a|s)superscriptsubscript𝜋𝙷conditional𝑎𝑠\pi_{h}^{\mathtt{H}}(a|s)italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT ( italic_a | italic_s ). Alongside the human, an intelligent machine makes advice as decision support to improve the reward collected under πh𝙷superscriptsubscript𝜋𝙷\pi_{h}^{\mathtt{H}}italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT. In other words, the machine does not seek to change human policy but rather improve its final outcome given its suboptimality. Specifically, upon the arrival at each state, the machine can choose to make advice a𝙼𝒜superscript𝑎𝙼𝒜a^{\mathtt{M}}\in\mathcal{A}italic_a start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT ∈ caligraphic_A to the human (the superscript 𝙼𝙼\mathtt{M}typewriter_M stands for the machine), or to trust the human and defer the action to the human, denoted by a𝙼=defersuperscript𝑎𝙼defera^{\mathtt{M}}=\text{defer}italic_a start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT = defer. If the machine chooses to defer, the human follows its default policy π𝙷.superscript𝜋𝙷\pi^{\mathtt{H}}.italic_π start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT . If the machine chooses to advise, the human takes the machine’s advice with probability θ(s,a𝙼)[0,1]𝜃𝑠superscript𝑎𝙼01\theta(s,a^{\mathtt{M}})\in[0,1]italic_θ ( italic_s , italic_a start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT ) ∈ [ 0 , 1 ], where θ(,)𝜃\theta(\cdot,\cdot)italic_θ ( ⋅ , ⋅ ) is the adherence level of the human, and is defined as follows.

Definition 1 The human’s adherence level θ:𝒮×𝒜[0,1]normal-:𝜃normal-→𝒮𝒜01\theta:\mathcal{S}\times\mathcal{A}\rightarrow[0,1]italic_θ : caligraphic_S × caligraphic_A → [ 0 , 1 ] is the probability of human adopting/adhering to the machine’s certain advice at a certain state.

Given the setup, the human takes action a𝙷superscript𝑎𝙷a^{\mathtt{H}}italic_a start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT according to the following law:

h(a𝙷=a|s,a𝙼)={πh𝙷(a|s),if a𝙼=defer,θ(s,a𝙼), if a𝙼defer and a=a𝙼 (adhere),(1θ(s,a𝙼))πh𝙷(a|s)1πh𝙷(a𝙼|s), if a𝙼defer and aa𝙼 (not adhere).subscriptsuperscript𝑎𝙷conditional𝑎𝑠superscript𝑎𝙼casessuperscriptsubscript𝜋𝙷conditional𝑎𝑠if a𝙼=defer𝜃𝑠superscript𝑎𝙼 if a𝙼defer and a=a𝙼 (adhere),1𝜃𝑠superscript𝑎𝙼continued-fractionsuperscriptsubscript𝜋𝙷conditional𝑎𝑠1superscriptsubscript𝜋𝙷conditionalsuperscript𝑎𝙼𝑠 if a𝙼defer and aa𝙼 (not adhere)\displaystyle\mathbb{P}_{h}(a^{\mathtt{H}}=a|s,a^{\mathtt{M}})=\begin{cases}% \pi_{h}^{\mathtt{H}}(a|s),&\text{if $a^{\mathtt{M}}=\text{defer}$},\\ \theta(s,a^{\mathtt{M}}),&\text{ if $a^{\mathtt{M}}\neq\text{defer}$ and $a=a^% {\mathtt{M}}$}\text{ (adhere),}\\ (1-\theta(s,a^{\mathtt{M}}))\cdot\cfrac{\pi_{h}^{\mathtt{H}}(a|s)}{1-\pi_{h}^{% \mathtt{H}}(a^{\mathtt{M}}|s)},&\text{ if $a^{\mathtt{M}}\neq\text{defer}$ and% $a\neq a^{\mathtt{M}}$}\text{ (not adhere)}.\end{cases}blackboard_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT = italic_a | italic_s , italic_a start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT ) = { start_ROW start_CELL italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT ( italic_a | italic_s ) , end_CELL start_CELL if italic_a start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT = defer , end_CELL end_ROW start_ROW start_CELL italic_θ ( italic_s , italic_a start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT ) , end_CELL start_CELL if aM≠defer and a=aM (adhere), end_CELL end_ROW start_ROW start_CELL ( 1 - italic_θ ( italic_s , italic_a start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT ) ) ⋅ continued-fraction start_ARG italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT ( italic_a | italic_s ) end_ARG start_ARG 1 - italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT | italic_s ) end_ARG , end_CELL start_CELL if aM≠defer and a≠aM (not adhere) . end_CELL end_ROW (1)

To summarize, under the human-machine interaction, the underlying dynamic becomes

shmachine makes advicea𝙼a𝙷h(|sh,a𝙼)a𝙷sh+1ph(|sh,a𝙷)sh+1.s_{h}\xrightarrow{\text{machine makes advice}}a^{\mathtt{M}}\xrightarrow{\text% {$a^{\mathtt{H}}\sim\mathbb{P}_{h}(\cdot|s_{h},a^{\mathtt{M}})$}}a^{\mathtt{H}% }\xrightarrow{\text{$s_{h+1}\sim p_{h}(\cdot|s_{h},a^{\mathtt{H}})$}}s_{h+1}.italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_ARROW overmachine makes advice → end_ARROW italic_a start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT start_ARROW start_OVERACCENT italic_a start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT ∼ blackboard_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT ) end_OVERACCENT → end_ARROW italic_a start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT start_ARROW start_OVERACCENT italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT ) end_OVERACCENT → end_ARROW italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT .

At each time hhitalic_h, the machine first makes the advice a𝙼superscript𝑎𝙼a^{\mathtt{M}}italic_a start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT upon the state shsubscript𝑠s_{h}italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and the human incorporates the machine advice into a final action a𝙷superscript𝑎𝙷a^{\mathtt{H}}italic_a start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT, and then transit to the next state sh+1subscript𝑠1s_{h+1}italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT.

The machine’s MDP. From the machine’s perspective, the MDP is slightly different from the MDP faced by human. It can be described by 𝙼=(𝒮,𝒜¯,H,p𝙼,r𝙼)superscript𝙼𝒮¯𝒜𝐻superscript𝑝𝙼superscript𝑟𝙼\mathcal{M}^{\mathtt{M}}=\left(\mathcal{S},\bar{\mathcal{A}},H,p^{\mathtt{M}},% r^{\mathtt{M}}\right)caligraphic_M start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT = ( caligraphic_S , over¯ start_ARG caligraphic_A end_ARG , italic_H , italic_p start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT ). This MDP shares the same state space 𝒮𝒮\mathcal{S}caligraphic_S and horizon H𝐻Hitalic_H as the human MDP 𝙷superscript𝙷\mathcal{M}^{\mathtt{H}}caligraphic_M start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT. The action space is augmented to include the defer option 𝒜¯=𝒜{defer}¯𝒜𝒜defer\bar{\mathcal{A}}=\mathcal{A}\cup\{\text{defer}\}over¯ start_ARG caligraphic_A end_ARG = caligraphic_A ∪ { defer }. In the machine’s perspective, the transition can be viewed as a direct consequence of making advice a𝙼𝒜¯superscript𝑎𝙼¯𝒜a^{\mathtt{M}}\in\bar{\mathcal{A}}italic_a start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT ∈ over¯ start_ARG caligraphic_A end_ARG (i.e, sha𝙼sh+1subscript𝑠superscript𝑎𝙼subscript𝑠1s_{h}\to a^{\mathtt{M}}\to s_{h+1}italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT → italic_a start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT → italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT), and the transition kernel becomes

ph𝙼(s|s,a𝙼)superscriptsubscript𝑝𝙼conditionalsuperscript𝑠𝑠superscript𝑎𝙼\displaystyle p_{h}^{\mathtt{M}}(s^{\prime}|s,a^{\mathtt{M}})italic_p start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT ) =a𝙷𝒜ph(s|s,a𝙷)h(a𝙷|s,a𝙼),absentsubscriptsuperscript𝑎𝙷𝒜subscript𝑝conditionalsuperscript𝑠𝑠superscript𝑎𝙷subscriptconditionalsuperscript𝑎𝙷𝑠superscript𝑎𝙼\displaystyle=\sum_{a^{\mathtt{H}}\in\mathcal{A}}p_{h}(s^{\prime}|s,a^{\mathtt% {H}})\cdot\mathbb{P}_{h}(a^{\mathtt{H}}|s,a^{\mathtt{M}}),= ∑ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT ∈ caligraphic_A end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT ) ⋅ blackboard_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT | italic_s , italic_a start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT ) , (2)

where phsubscript𝑝p_{h}italic_p start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is the transition kernel of the MDP 𝙷superscript𝙷\mathcal{M}^{\mathtt{H}}caligraphic_M start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT, and the probability h(|s,a)\mathbb{P}_{h}(\cdot|s,a)blackboard_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s , italic_a ) is specified by the adherence dynamics (1). In parallel, we define the reward by marginalizing human’s action

rh𝙼(s,a𝙼)=a𝙷𝒜rh(s,a𝙷)h(a𝙷|s,a𝙼).superscriptsubscript𝑟𝙼𝑠superscript𝑎𝙼subscriptsuperscript𝑎𝙷𝒜subscript𝑟𝑠superscript𝑎𝙷subscriptconditionalsuperscript𝑎𝙷𝑠superscript𝑎𝙼r_{h}^{\mathtt{M}}(s,a^{\mathtt{M}})=\sum_{a^{\mathtt{H}}\in\mathcal{A}}r_{h}(% s,a^{\mathtt{H}})\cdot\mathbb{P}_{h}(a^{\mathtt{H}}|s,a^{\mathtt{M}}).italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT ∈ caligraphic_A end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT ) ⋅ blackboard_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT | italic_s , italic_a start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT ) .

Denote 𝝅={πh}h[H]𝝅subscriptsubscript𝜋delimited-[]𝐻\bm{\pi}=\{\pi_{h}\}_{h\in[H]}bold_italic_π = { italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_h ∈ [ italic_H ] end_POSTSUBSCRIPT the machine’s policy where πh:𝒮𝒜¯:subscript𝜋𝒮¯𝒜\pi_{h}:\mathcal{S}\to\bar{\mathcal{A}}italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT : caligraphic_S → over¯ start_ARG caligraphic_A end_ARG. The value function then becomes

Vh0π(s)=𝔼[h=h0Hrh𝙼(sh,ah)|sh0=s],where ah=πh(sh) and sh+1ph𝙼(|sh,ah),superscriptsubscript𝑉subscript0𝜋𝑠𝔼delimited-[]conditionalsuperscriptsubscriptsubscript0𝐻subscriptsuperscript𝑟𝙼subscript𝑠subscript𝑎subscript𝑠subscript0𝑠where ah=πh(sh) and sh+1ph𝙼(|sh,ah)V_{h_{0}}^{\pi}(s)=\mathbb{E}\left[\sum_{h=h_{0}}^{H}r^{\mathtt{M}}_{h}(s_{h},% a_{h})\Big{|}s_{h_{0}}=s\right],\,\,\,\text{where $a_{h}=\pi_{h}(s_{h})$ and $% s_{h+1}\sim p_{h}^{\mathtt{M}}(\cdot|s_{h},a_{h})$},italic_V start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) = blackboard_E [ ∑ start_POSTSUBSCRIPT italic_h = italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) | italic_s start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_s ] , where italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) and italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ,

and let VH+1π(s)=0subscriptsuperscript𝑉𝜋𝐻1𝑠0V^{\pi}_{H+1}(s)=0italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H + 1 end_POSTSUBSCRIPT ( italic_s ) = 0 for any s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S. The optimal value V*superscript𝑉V^{*}italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and the optimal policy π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT are defined by

Vh*(s)=maxπΠVhπ(s),π*=argmaxπΠV1π(s)formulae-sequencesuperscriptsubscript𝑉𝑠subscript𝜋Πsuperscriptsubscript𝑉𝜋𝑠superscript𝜋subscriptargmax𝜋Πsuperscriptsubscript𝑉1𝜋𝑠V_{h}^{*}(s)=\max_{\pi\in\Pi}V_{h}^{\pi}(s),\ \ \pi^{*}=\operatorname*{arg\,% max}_{\pi\in\Pi}V_{1}^{\pi}(s)italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s ) = roman_max start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) , italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s )

where ΠΠ\Piroman_Π consists of all deterministic non-anticipating Markov policies. Similarly, we define the corresponding Q𝑄Qitalic_Q functions to be

Qhπ(s,a)=rh𝙼(s,a)+s𝒮ph𝙼(s|s,a)Vh+1π(s), and Qh*(s,a)=rh𝙼(s,a)+s𝒮ph𝙼(s|s,a)Vh+1*(s).formulae-sequencesuperscriptsubscript𝑄𝜋𝑠𝑎subscriptsuperscript𝑟𝙼𝑠𝑎subscriptsuperscript𝑠𝒮subscriptsuperscript𝑝𝙼conditionalsuperscript𝑠𝑠𝑎superscriptsubscript𝑉1𝜋superscript𝑠 and superscriptsubscript𝑄𝑠𝑎subscriptsuperscript𝑟𝙼𝑠𝑎subscriptsuperscript𝑠𝒮subscriptsuperscript𝑝𝙼conditionalsuperscript𝑠𝑠𝑎superscriptsubscript𝑉1superscript𝑠Q_{h}^{\pi}(s,a)=r^{\mathtt{M}}_{h}(s,a)+\sum_{s^{\prime}\in\mathcal{S}}p^{% \mathtt{M}}_{h}(s^{\prime}|s,a)V_{h+1}^{\pi}(s^{\prime}),\text{ and }Q_{h}^{*}% (s,a)=r^{\mathtt{M}}_{h}(s,a)+\sum_{s^{\prime}\in\mathcal{S}}p^{\mathtt{M}}_{h% }(s^{\prime}|s,a)V_{h+1}^{*}(s^{\prime}).italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) = italic_r start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) + ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) italic_V start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , and italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_a ) = italic_r start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) + ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) italic_V start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .

Human-centric system. The theme of the formulation and all our following results is a human-centric decision system where the machine acknowledges the suboptimal behavior of the human and makes advice on critical states to improve the reward. So the learning and optimization of our paper take the perspective of the machine (solving 𝙼superscript𝙼\mathcal{M}^{\mathtt{M}}caligraphic_M start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT) and do not seek to change the underlying human policy π𝙷.superscript𝜋𝙷\pi^{\mathtt{H}}.italic_π start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT .

3 The learning problem

Now we discuss learning problems associated with the above human-machine adherence model. We consider two learning environments for the problem:

1subscript1\mathcal{E}_{1}caligraphic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (Environment 1 – partially known): the environment’s state transition kernel p𝑝pitalic_p, the reward r𝑟ritalic_r, and the human’s behavior policy π𝙷superscript𝜋𝙷\pi^{\mathtt{H}}italic_π start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT, are known; the human’s adherence level θ𝜃\thetaitalic_θ is unknown.

2subscript2\mathcal{E}_{2}caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (Environment 2 – fully unknown): the environment’s state transition kernel p𝑝pitalic_p, the reward r𝑟ritalic_r, the human’s behavior policy π𝙷superscript𝜋𝙷\pi^{\mathtt{H}}italic_π start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT, and the human’s adherence level θ𝜃\thetaitalic_θ are unknown.

For 1subscript1\mathcal{E}_{1}caligraphic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the goal is simply to learn the optimal policy under the unknown adherence level θ𝜃\thetaitalic_θ. We develop a learning algorithm that outputs ϵitalic-ϵ\epsilonitalic_ϵ-optimal advice policy and features better sample complexity compared to the vanilla application of problem-agnostic RL methods on 𝙼superscript𝙼\mathcal{M}^{\mathtt{M}}caligraphic_M start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT. For 2subscript2\mathcal{E}_{2}caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we know neither the environment nor the human’s policy. Thus the learning problem entails learning the dynamics of both the environment and the human policy. We develop a provably convergent learning algorithm that outputs the optimal policy, and in addition, the learned advice policy only gives advice when necessary (choosing to defer for non-critical steps).

Our investigations on these two learning formulations highlight three points. First, the inherent structure of the human-machine interaction allows more sample-efficient algorithms (than the vanilla application of the off-the-shelf RL algorithms) both theoretically and empirically. Second, the knowledge of the underlying environment (1subscript1\mathcal{E}_{1}caligraphic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT compared against 2subscript2\mathcal{E}_{2}caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) significantly, also unsurprisingly, reduces the sample complexity of the learning algorithm. Third, we establish a close connection between the formulation of the human-machine interaction with the problems of reward-free exploration (** et al., 2020) and constrained MDPs (Altman, 2021).

3.1 Main Results

We first state the technical results and then present the detailed algorithms and analyses in the subsequent section.

Theorem 1 (Environment 1subscript1\mathcal{E}_{1}caligraphic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, informal) For environment 1subscript1\mathcal{E}_{1}caligraphic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, Algorithm 1 finds an ϵitalic-ϵ\epsilonitalic_ϵ-optimal advice policy with a PAC sample complexity O(H2S2A/ϵ2)𝑂superscript𝐻2superscript𝑆2𝐴superscriptitalic-ϵ2O(H^{2}S^{2}A/\epsilon^{2})italic_O ( italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) with high probability.

Under environment 1subscript1\mathcal{E}_{1}caligraphic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, Theorem 1 gives a PAC sample complexity for the UCB-type (Upper-Confidence-Bound-type) Algorithm 1. We remark that applying the existing problem-agnostic algorithms can only achieve a suboptimal order of sample complexity on the problem: O(H3S2A/ϵ2)𝑂superscript𝐻3superscript𝑆2𝐴superscriptitalic-ϵ2O(H^{3}S^{2}A/\epsilon^{2})italic_O ( italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) via the model-based algorithm (Dann and Brunskill, 2015) and O(H4SA/ϵ2)𝑂superscript𝐻4𝑆𝐴superscriptitalic-ϵ2O(H^{4}SA/\epsilon^{2})italic_O ( italic_H start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_S italic_A / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) via the model-free algorithm (** et al., 2018)111The authors obtain a regret bound instead of PAC sample complexity bound. However, they convert the regret bound to a PAC sample complexity bound in (** et al., 2018, Section 3.1). Specifically, the bound in Dann and Brunskill (2015) gives an additional factor of H𝐻Hitalic_H compared to the bounds in the original setting, where stationary transition density is assumed; this is due to the fact that though the adherence level θ𝜃\thetaitalic_θ is stationary, the transition becomes non-stationary when compounding θ𝜃\thetaitalic_θ and underlying transition of the human’s underlying MDP. Also, we note that such an improvement on H𝐻Hitalic_H is not due to a reduction in the number of unknown parameters because the adherence level θ𝜃\thetaitalic_θ has a dimensionality of SA.𝑆𝐴SA.italic_S italic_A . Indeed, the key to the improvement is the intrinsic structure of the human-machine problem enables a more sample-efficient design of the UCB algorithm (See Section 4.1 for details). Moreover, we also provide another algorithm that finds an ϵitalic-ϵ\epsilonitalic_ϵ-optimal advice policy with a sample complexity of O(H3SA/ϵ2)𝑂superscript𝐻3𝑆𝐴superscriptitalic-ϵ2O(H^{3}SA/\epsilon^{2})italic_O ( italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_S italic_A / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) for 2subscript2\mathcal{E}_{2}caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (See Algorithm LABEL:alg:alg3 in appendix LABEL:ap_pf_alg2 for details).

For environment 2subscript2\mathcal{E}_{2}caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we assume no prior knowledge at all, and this makes the machine’s problem no different than a generic RL problem. Thus we consider a slight twist of the machine’s MDP with the notion of pertinent advice. This twisted formulation enables richer analytical structures and draws interesting connections with several existing frameworks. Specifically, consider a new machine’s MDP β𝙼(𝒮,𝒜¯,H,p𝙼,rβ𝙼)subscriptsuperscript𝙼𝛽𝒮¯𝒜𝐻superscript𝑝𝙼subscriptsuperscript𝑟𝙼𝛽\mathcal{M}^{\mathtt{M}}_{\beta}\in\left(\mathcal{S},\bar{\mathcal{A}},H,p^{% \mathtt{M}},r^{\mathtt{M}}_{\beta}\right)caligraphic_M start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ∈ ( caligraphic_S , over¯ start_ARG caligraphic_A end_ARG , italic_H , italic_p start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ) which inherits everything from 𝙼(𝒮,𝒜¯,H,p𝙼,r𝙼)superscript𝙼𝒮¯𝒜𝐻superscript𝑝𝙼superscript𝑟𝙼\mathcal{M}^{\mathtt{M}}\in\left(\mathcal{S},\bar{\mathcal{A}},H,p^{\mathtt{M}% },r^{\mathtt{M}}\right)caligraphic_M start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT ∈ ( caligraphic_S , over¯ start_ARG caligraphic_A end_ARG , italic_H , italic_p start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT ) except for the reward

rh,β𝙼(s,a)=rh𝙼(s,a)β𝕀{adefer},subscriptsuperscript𝑟𝙼𝛽𝑠𝑎superscriptsubscript𝑟𝙼𝑠𝑎𝛽𝕀𝑎defer\displaystyle r^{\mathtt{M}}_{h,\beta}(s,a)=r_{h}^{\mathtt{M}}(s,a)-\beta\cdot% \mathbb{I}\{a\neq\text{defer}\},italic_r start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_β end_POSTSUBSCRIPT ( italic_s , italic_a ) = italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT ( italic_s , italic_a ) - italic_β ⋅ blackboard_I { italic_a ≠ defer } , (3)

where the 𝕀{}𝕀\mathbb{I}\{\cdot\}blackboard_I { ⋅ } is the indicator function and β>0𝛽0\beta>0italic_β > 0 is a constant. Under β𝙼subscriptsuperscript𝙼𝛽\mathcal{M}^{\mathtt{M}}_{\beta}caligraphic_M start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT, we denote Vβπsubscriptsuperscript𝑉𝜋𝛽V^{\pi}_{\beta}italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT and Vβ*subscriptsuperscript𝑉𝛽V^{*}_{\beta}italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT the value functions of π𝜋\piitalic_π and the optimal value function, respectively, and the optimal policy πβ*argmaxπVβπsuperscriptsubscript𝜋𝛽subscriptargmax𝜋superscriptsubscript𝑉𝛽𝜋\pi_{\beta}^{*}\in\operatorname*{arg\,max}_{\pi}V_{\beta}^{\pi}italic_π start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∈ start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT. The new reward function enforces a penalization of β𝛽\betaitalic_β for making advice and thus regularizes the number of machine advices throughout the horizon. In practice, providing advice to human at every step can be annoying in applications such as gaming, driving, or sports. Hence, it is crucial to prioritize and selectively deliver advice based on its criticalness – which we term informally as pertinent advice. For example, when the human is an expert and already achieves near-optimal performance, there is no need to give advice; also, when the human is under-performing, and the adherence level is low, there is also no need to give advice because it is unlikely to be taken.

Proposition 1.

For all s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S and h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ] such that πh,β*(s)𝑑𝑒𝑓𝑒𝑟subscriptsuperscript𝜋𝛽𝑠𝑑𝑒𝑓𝑒𝑟\pi^{*}_{h,\beta}(s)\neq\text{defer}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_β end_POSTSUBSCRIPT ( italic_s ) ≠ defer, we have

Qh*(s,πh,β*(s))Vhπ𝙷(s)β.superscriptsubscript𝑄𝑠subscriptsuperscript𝜋𝛽𝑠superscriptsubscript𝑉superscript𝜋𝙷𝑠𝛽Q_{h}^{*}(s,\pi^{*}_{h,\beta}(s))-V_{h}^{\pi^{\mathtt{H}}}(s)\geq\beta.italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_β end_POSTSUBSCRIPT ( italic_s ) ) - italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s ) ≥ italic_β .

The proposition says that if the machine takes πh,β*(s)subscriptsuperscript𝜋𝛽𝑠\pi^{*}_{h,\beta}(s)italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_β end_POSTSUBSCRIPT ( italic_s ) and sticks with the optimal policy afterward, the reward will be at least β𝛽\betaitalic_β more than that if the machine chooses to defer all the way till the end. In this light, we can rank the criticalness of making advice at different states by solving β𝙼subscriptsuperscript𝙼𝛽\mathcal{M}^{\mathtt{M}}_{\beta}caligraphic_M start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT with different β𝛽\betaitalic_β which gives a better interpretation of this human-machine system.

Theorem 2 (Environment 2subscript2\mathcal{E}_{2}caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, informal) For 2subscript2\mathcal{E}_{2}caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, Algorithm 2 outputs a family of ϵitalic-ϵ\epsilonitalic_ϵ-optimal policies {π^β}β>0subscriptsubscript^𝜋𝛽𝛽0\{\hat{\pi}_{\beta}\}_{\beta>0}{ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_β > 0 end_POSTSUBSCRIPT for {β𝙷}β>0subscriptsuperscriptsubscript𝛽𝙷𝛽0\{\mathcal{M}_{\beta}^{\mathtt{H}}\}_{\beta>0}{ caligraphic_M start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_β > 0 end_POSTSUBSCRIPT with O(H5SA/ϵ2)𝑂superscript𝐻5𝑆𝐴superscriptitalic-ϵ2O(H^{5}SA/\epsilon^{2})italic_O ( italic_H start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_S italic_A / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) episodes such that the following inequality

V1,β*(s1)V1,βπ^β(s1)ϵsuperscriptsubscript𝑉1𝛽subscript𝑠1superscriptsubscript𝑉1𝛽subscript^𝜋𝛽subscript𝑠1italic-ϵ\displaystyle V_{1,\beta}^{*}(s_{1})-V_{1,\beta}^{\hat{\pi}_{\beta}}(s_{1})\leq\epsilonitalic_V start_POSTSUBSCRIPT 1 , italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_V start_POSTSUBSCRIPT 1 , italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≤ italic_ϵ (4)

holds uniformly for all β>0𝛽0\beta>0italic_β > 0 with high probability.

Theorem 2 gives the sample complexity of Algorithm 2 which learns a near-optimal policy for all the models {β𝙷}β0subscriptsuperscriptsubscript𝛽𝙷𝛽0\{\mathcal{M}_{\beta}^{\mathtt{H}}\}_{\beta\geq 0}{ caligraphic_M start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_β ≥ 0 end_POSTSUBSCRIPT simultaneously. Such joint learning not only provides a family of policies for the human to customize β𝛽\betaitalic_β according to her/his performance but also gives us a handle to understand which are the critical states where the human’s policy can be significantly improved.

4 Algorithms and Analyses

In this section, we present the algorithms and analyses that achieve the results mentioned previously.

4.1 UCB-based algorithm for 1subscript1\mathcal{E}_{1}caligraphic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

Under 1subscript1\mathcal{E}_{1}caligraphic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the machine works with a human with unknown adherence level θ.𝜃\theta.italic_θ . An important property of θ𝜃\thetaitalic_θ is as follows. Basically, it states that the team of human and machine achieves a higher optimal reward if the human has a higher adherence level. To emphasize the dependence on θ𝜃\thetaitalic_θ, we write

Vhπ(s|θ)=𝔼[h=hHrh𝙼(sh,ah)|sh=s,adherence parameter θ]andVh*(s|θ)=maxπΠVhπ(s|θ).superscriptsubscript𝑉𝜋conditional𝑠𝜃𝔼delimited-[]conditionalsuperscriptsubscriptsuperscript𝐻subscriptsuperscript𝑟𝙼superscriptsubscript𝑠superscriptsubscript𝑎superscriptsubscript𝑠𝑠adherence parameter 𝜃andsuperscriptsubscript𝑉conditional𝑠𝜃subscript𝜋Πsuperscriptsubscript𝑉𝜋conditional𝑠𝜃V_{h}^{\pi}(s|\theta)=\mathbb{E}\left[\sum_{h^{\prime}=h}^{H}r^{\mathtt{M}}_{h% ^{\prime}}(s_{h^{\prime}},a_{h^{\prime}})\Big{|}s_{h}=s,\text{adherence % parameter }\theta\right]\,\,\text{and}\,\,\,V_{h}^{*}(s|\theta)=\max_{\pi\in% \Pi}V_{h}^{\pi}(s|\theta).italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s | italic_θ ) = blackboard_E [ ∑ start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) | italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_s , adherence parameter italic_θ ] and italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s | italic_θ ) = roman_max start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s | italic_θ ) .
Proposition 2 (Monotonicity property).

Suppose θ1θ2subscript𝜃1subscript𝜃2\theta_{1}\geq\theta_{2}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT holds entry-wise, then the following inequality holds for all s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S and h[H]delimited-[]𝐻h\in[H]italic_h ∈ [ italic_H ]

Vh*(s|θ1)Vh*(s|θ2).superscriptsubscript𝑉conditional𝑠subscript𝜃1superscriptsubscript𝑉conditional𝑠subscript𝜃2V_{h}^{*}(s|\theta_{1})\geq V_{h}^{*}(s|\theta_{2}).italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s | italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≥ italic_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s | italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) .

Proposition 2 implies that finding an upper bound for the optimal value function reduces to finding an upper bound for θ𝜃\thetaitalic_θ. Algorithm 1 follows this implication and maintains an optimistic estimate θ¯tsuperscript¯𝜃𝑡\bar{\theta}^{t}over¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT for the true parameter θ𝜃\thetaitalic_θ. For each episode, it generates the policy π^tsubscript^𝜋𝑡\hat{\pi}_{t}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT pretending the θ¯tsuperscript¯𝜃𝑡\bar{\theta}^{t}over¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT as true, and rolls out the episode according to π^tsubscript^𝜋𝑡\hat{\pi}_{t}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Then it updates the estimate with the new observations. The optimistic estimate θ¯tsuperscript¯𝜃𝑡\bar{\theta}^{t}over¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT takes the form of a standard UCB form with a careful choice of the confidence width and we defer more details to Appendix LABEL:ap_pf_alg1. The algorithm shares the same intuition as other UCB-based algorithms that, with more and more observations, the confidence bound θ¯tsuperscript¯𝜃𝑡\bar{\theta}^{t}over¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT will shrink to the true θ𝜃\thetaitalic_θ, and so does the value functions.

Algorithm 1 UCB-ADherence (UCB-AD)
1:Input: Target probability level δ𝛿\deltaitalic_δ.
2:Initialize t=1𝑡1t=1italic_t = 1, 𝒟t1=subscript𝒟𝑡1\mathcal{D}_{t-1}=\emptysetcaligraphic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = ∅, and the optimistic estimate θ¯t=𝟏.superscript¯𝜃𝑡1\bar{\theta}^{t}=\bm{1}.over¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = bold_1 .
3:for t=1,2,𝑡12t=1,2,\cdotsitalic_t = 1 , 2 , ⋯ do
4:     Solve the advice policy π^t=argmaxπVπ(|θ¯t)\hat{\pi}^{t}=\operatorname*{arg\,max}_{\pi}V^{\pi}(\cdot|\bar{\theta}^{t})over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( ⋅ | over¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) given the current optimistic estimate θ¯tsuperscript¯𝜃𝑡\bar{\theta}^{t}over¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
5:     Sample a new episode zt={s1t,a1𝙼,t,a1𝙷,t,r1t,,sHt,aH𝙼,t,aH𝙷,t,rHt}subscript𝑧𝑡superscriptsubscript𝑠1𝑡superscriptsubscript𝑎1𝙼𝑡superscriptsubscript𝑎1𝙷𝑡subscriptsuperscript𝑟𝑡1superscriptsubscript𝑠𝐻𝑡superscriptsubscript𝑎𝐻𝙼𝑡superscriptsubscript𝑎𝐻𝙷𝑡superscriptsubscript𝑟𝐻𝑡z_{t}=\left\{s_{1}^{t},a_{1}^{\mathtt{M},t},a_{1}^{\mathtt{H},t},r^{t}_{1},% \cdots,s_{H}^{t},a_{H}^{\mathtt{M},t},a_{H}^{\mathtt{H},t},r_{H}^{t}\right\}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_M , italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_H , italic_t end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_s start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_M , italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_H , italic_t end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } following policy π^tsuperscript^𝜋𝑡\hat{\pi}^{t}over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
6:     Update 𝒟t𝒟t1{zt}subscript𝒟𝑡subscript𝒟𝑡1subscript𝑧𝑡\mathcal{D}_{t}\leftarrow\mathcal{D}_{t-1}\cup\{z_{t}\}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← caligraphic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∪ { italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }
7:     Update the optimistic estimate θ¯tθ¯t+1superscript¯𝜃𝑡superscript¯𝜃𝑡1\bar{\theta}^{t}\rightarrow\bar{\theta}^{t+1}over¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT → over¯ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT based on 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and δ𝛿\deltaitalic_δ
8:end for

Theorem 1 establishes an (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-PAC result for Algorithm 1.

Theorem 1.

For any δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), ϵ(0,1]italic-ϵ01\epsilon\in(0,1]italic_ϵ ∈ ( 0 , 1 ], and T+𝑇superscriptT\in\mathbb{N}^{+}italic_T ∈ blackboard_N start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, the number of policies among {π^t}t=1Tsuperscriptsubscriptsuperscriptnormal-^𝜋𝑡𝑡1𝑇\{\hat{\pi}^{t}\}_{t=1}^{T}{ over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT from Algorithm 1 that are not ϵitalic-ϵ\epsilonitalic_ϵ-optimal, i.e., V1*(s1)V1π^t(s1)>ϵsuperscriptsubscript𝑉1subscript𝑠1superscriptsubscript𝑉1superscriptnormal-^𝜋𝑡subscript𝑠1italic-ϵV_{1}^{*}(s_{1})-V_{1}^{\hat{\pi}^{t}}(s_{1})>\epsilonitalic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) > italic_ϵ, is bounded by O~(H2S2Aϵ2log1δ)normal-~𝑂normal-⋅superscript𝐻2superscript𝑆2𝐴superscriptitalic-ϵ21𝛿\tilde{O}\left(\frac{H^{2}S^{2}A}{\epsilon^{2}}\cdot\log\frac{1}{\delta}\right)over~ start_ARG italic_O end_ARG ( divide start_ARG italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) with probability 1δ1𝛿1-\delta1 - italic_δ.

The proof of the theorem mimics the analysis of Dann and Brunskill (2015). One caveat in the analysis is that the original analysis of Dann and Brunskill (2015) focuses on a stationary setting where transition probabilities depend solely on state and action, remaining independent of the time horizon. However, even when the adherence level θ𝜃\thetaitalic_θ remains the same over time, the machine’s MDP is non-stationary. An direct adoption is to enlarge the state space to incorporate the horizon step hhitalic_h, yet this will result in a sample complexity of O(H3S2A/ϵ2)𝑂superscript𝐻3superscript𝑆2𝐴superscriptitalic-ϵ2O(H^{3}S^{2}A/\epsilon^{2})italic_O ( italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), a worse dependency on H𝐻Hitalic_H. The key is to reduce the upper bound analysis to the adherence level space and utilize Proposition 2 to convert that into a suboptimality gap with respect to the value function. This treatment gives the desirable bound in Theorem 1 which also outperforms the bound from a direct application of results from Azar et al. (2017) to non-stationary MDPs.

4.2 Reward-free exploration algorithm for 2subscript2\mathcal{E}_{2}caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

2subscript2\mathcal{E}_{2}caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT has more unknown parameters than 1subscript1\mathcal{E}_{1}caligraphic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and thus it naturally entails more intense exploration. Moreover, the learning objective becomes more complex: we aim not only to learn the near-optimal policy but also to discern the pertinent advice.

Algorithm 2 is based on the concept of reward-free exploration (RFE) (** et al., 2020). Specifically, RFE algorithms usually consist of an exploration phase and a planning phase. During the exploration phase, the algorithm collects trajectories from an MDP \mathcal{M}caligraphic_M without a pre-specified reward function. In the planning phase, it can compute near-optimal policies of \mathcal{M}caligraphic_M, given any deterministic reward functions that are bounded.

In our human-machine model, the machine observes sha𝙼a𝙷sh+1subscript𝑠superscript𝑎𝙼superscript𝑎𝙷subscript𝑠1s_{h}\to a^{\mathtt{M}}\to a^{\mathtt{H}}\to s_{h+1}italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT → italic_a start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT → italic_a start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT → italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT, and the trajectory for episode t𝑡titalic_t is zt={s1t,a1𝙼,t,a1𝙷,t,r1t,s2t,a2𝙼,t,a2𝙷,t,r2t,,sHt,aH𝙼,t,aH𝙷,t,rHt}subscript𝑧𝑡superscriptsubscript𝑠1𝑡superscriptsubscript𝑎1𝙼𝑡superscriptsubscript𝑎1𝙷𝑡superscriptsubscript𝑟1𝑡superscriptsubscript𝑠2𝑡superscriptsubscript𝑎2𝙼𝑡superscriptsubscript𝑎2𝙷𝑡superscriptsubscript𝑟2𝑡superscriptsubscript𝑠𝐻𝑡superscriptsubscript𝑎𝐻𝙼𝑡superscriptsubscript𝑎𝐻𝙷𝑡superscriptsubscript𝑟𝐻𝑡z_{t}=\{s_{1}^{t},a_{1}^{\mathtt{M},t},a_{1}^{\mathtt{H},t},r_{1}^{t},s_{2}^{t% },a_{2}^{\mathtt{M},t},a_{2}^{\mathtt{H},t},r_{2}^{t},\cdots,s_{H}^{t},a_{H}^{% \mathtt{M},t},a_{H}^{\mathtt{H},t},r_{H}^{t}\}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_M , italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_H , italic_t end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_M , italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_H , italic_t end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , ⋯ , italic_s start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_M , italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_H , italic_t end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT }, where ah𝙼,t=πt(sht)superscriptsubscript𝑎𝙼𝑡superscript𝜋𝑡superscriptsubscript𝑠𝑡a_{h}^{\mathtt{M},t}=\pi^{t}(s_{h}^{t})italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_M , italic_t end_POSTSUPERSCRIPT = italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ), ah𝙷,th(|sht,ah𝙼,t)a_{h}^{\mathtt{H},t}\sim\mathbb{P}_{h}(\cdot|s_{h}^{t},a_{h}^{\mathtt{M},t})italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_H , italic_t end_POSTSUPERSCRIPT ∼ blackboard_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_M , italic_t end_POSTSUPERSCRIPT ), and sh+1tph(|sht,ah𝙷,t)s_{h+1}^{t}\sim{p}_{h}(\cdot|s_{h}^{t},a_{h}^{\mathtt{H},t})italic_s start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_H , italic_t end_POSTSUPERSCRIPT ). We denote p^h𝙼,tsubscriptsuperscript^𝑝𝙼𝑡\hat{p}^{\mathtt{M},t}_{h}over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT typewriter_M , italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and r^h𝙼,tsubscriptsuperscript^𝑟𝙼𝑡\hat{r}^{\mathtt{M},t}_{h}over^ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT typewriter_M , italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT the empirical estimation for p𝙼superscript𝑝𝙼p^{\mathtt{M}}italic_p start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT and rh𝙼subscriptsuperscript𝑟𝙼r^{\mathtt{M}}_{h}italic_r start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, and nht(s,a)=i=1t𝕀{(shi,ah𝙼,i)=(s,a)}superscriptsubscript𝑛𝑡𝑠𝑎superscriptsubscript𝑖1𝑡𝕀superscriptsubscript𝑠𝑖superscriptsubscript𝑎𝙼𝑖𝑠𝑎n_{h}^{t}(s,a)=\sum_{i=1}^{t}\mathbb{I}{\left\{\left(s_{h}^{i},a_{h}^{\mathtt{% M},i}\right)=(s,a)\right\}}italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_I { ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_M , italic_i end_POSTSUPERSCRIPT ) = ( italic_s , italic_a ) } the number of times the machine gives advice a𝑎aitalic_a at time hhitalic_h and state s𝑠sitalic_s in the first t𝑡titalic_t episodes. The key quantity in Algorithm 2 is

Wht(s,a)=min(H,16H2ϕ(nht(s,a),δ)nht(s,a)+(1+1H)sp^h𝙼,t(s|s,a)maxaWh+1t(s,a)),superscriptsubscript𝑊𝑡𝑠𝑎𝐻16superscript𝐻2continued-fractionitalic-ϕsuperscriptsubscript𝑛𝑡𝑠𝑎𝛿superscriptsubscript𝑛𝑡𝑠𝑎11𝐻subscriptsuperscript𝑠subscriptsuperscript^𝑝𝙼𝑡conditionalsuperscript𝑠𝑠𝑎subscriptsuperscript𝑎superscriptsubscript𝑊1𝑡superscript𝑠superscript𝑎\displaystyle W_{h}^{t}(s,a)=\min\left(H,16H^{2}\cfrac{\phi(n_{h}^{t}(s,a),% \delta)}{n_{h}^{t}(s,a)}+\left(1+\frac{1}{H}\right)\sum_{s^{\prime}}\hat{p}^{% \mathtt{M},t}_{h}(s^{\prime}|s,a)\max_{a^{\prime}}W_{h+1}^{t}(s^{\prime},a^{% \prime})\right),italic_W start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ) = roman_min ( italic_H , 16 italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT continued-fraction start_ARG italic_ϕ ( italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ) , italic_δ ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG + ( 1 + divide start_ARG 1 end_ARG start_ARG italic_H end_ARG ) ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT typewriter_M , italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) , (5)

where WH+1t(s,a)=0subscriptsuperscript𝑊𝑡𝐻1𝑠𝑎0W^{t}_{H+1}(s,a)=0italic_W start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H + 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) = 0 for (s,a)𝒮×𝒜𝑠𝑎𝒮𝒜(s,a)\in\mathcal{S}\times\mathcal{A}( italic_s , italic_a ) ∈ caligraphic_S × caligraphic_A, and ϕ(n,δ)italic-ϕ𝑛𝛿\phi(n,\delta)italic_ϕ ( italic_n , italic_δ ) grows at the order of O(log(n)+log(1/δ))𝑂𝑛1𝛿O(\log(n)+\log(1/\delta))italic_O ( roman_log ( italic_n ) + roman_log ( 1 / italic_δ ) ) and is specified in Theorem 2.

Now we formally introduce our Algorithm 2. The algorithm iteratively minimizes an upper bound defined by (5) which measures the uncertainty of a state-action pair, and the upper bound shrinks as the number of visits for the state-action pair increases. The algorithm stops when the upper bound is less than a pre-specified threshold. This algorithm is inspired by the RF-Express algorithm (Ménard et al., 2021), and there is a slight difference in the definition of Wht(s,a)superscriptsubscript𝑊𝑡𝑠𝑎W_{h}^{t}(s,a)italic_W start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ), ϕ(n,δ)italic-ϕ𝑛𝛿\phi(n,\delta)italic_ϕ ( italic_n , italic_δ ) and the stop** rule. In our application, the reward r𝙼superscript𝑟𝙼r^{\mathtt{M}}italic_r start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT is stochastic and we need to take care of the estimation error; while in Ménard et al. (2021), the algorithm does not need to deal with the reward at all.

Algorithm 2 : RFE-β𝛽\betaitalic_β
1:Input: ϵ,δitalic-ϵ𝛿\epsilon,\deltaitalic_ϵ , italic_δ, and user-specified {βi}isubscriptsubscript𝛽𝑖𝑖\{\beta_{i}\}_{i\in\mathcal{I}}{ italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT, where \mathcal{I}caligraphic_I could be any set where βi[0,H)subscript𝛽𝑖0𝐻\beta_{i}\in[0,H)italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ 0 , italic_H )
2:Stage 1: Reward-free exploration
3:Initialize t=1𝑡1t=1italic_t = 1 and Wht(s,a)=Hsubscriptsuperscript𝑊𝑡𝑠𝑎𝐻W^{t}_{h}(s,a)=Hitalic_W start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) = italic_H for all (s,a)𝒮×𝒜𝑠𝑎𝒮𝒜(s,a)\in\mathcal{S}\times\mathcal{A}( italic_s , italic_a ) ∈ caligraphic_S × caligraphic_A
4:Compute πtsuperscript𝜋𝑡{\pi}^{t}italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT so that πht(s)=argmaxa𝒜Wht(s,a)superscriptsubscript𝜋𝑡𝑠subscriptargmax𝑎𝒜superscriptsubscript𝑊𝑡𝑠𝑎{\pi}_{h}^{t}(s)=\operatorname*{arg\,max}_{a\in\mathcal{A}}W_{h}^{t}(s,a)italic_π start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s ) = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a ) (see (5))
5:while W1t(s1,πt(s1))+4eW1t(s1,πt(s1))>ϵ/Hsuperscriptsubscript𝑊1𝑡subscript𝑠1superscript𝜋𝑡subscript𝑠14𝑒superscriptsubscript𝑊1𝑡subscript𝑠1superscript𝜋𝑡subscript𝑠1italic-ϵ𝐻W_{1}^{t}(s_{1},\pi^{t}(s_{1}))+4e\sqrt{W_{1}^{t}(s_{1},\pi^{t}(s_{1}))}>% \epsilon/Hitalic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) + 4 italic_e square-root start_ARG italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) end_ARG > italic_ϵ / italic_H do
6:     Sample trajectory zt={s1t,a1𝙼,t,a1𝙷,t,r1t,,sHt,aH𝙼,t,aH𝙷,t,rHt}subscript𝑧𝑡superscriptsubscript𝑠1𝑡superscriptsubscript𝑎1𝙼𝑡superscriptsubscript𝑎1𝙷𝑡superscriptsubscript𝑟1𝑡superscriptsubscript𝑠𝐻𝑡superscriptsubscript𝑎𝐻𝙼𝑡superscriptsubscript𝑎𝐻𝙷𝑡superscriptsubscript𝑟𝐻𝑡z_{t}=\{s_{1}^{t},a_{1}^{\mathtt{M},t},a_{1}^{\mathtt{H},t},r_{1}^{t},\cdots,s% _{H}^{t},a_{H}^{\mathtt{M},t},a_{H}^{\mathtt{H},t},r_{H}^{t}\}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_M , italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_H , italic_t end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , ⋯ , italic_s start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_M , italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_H , italic_t end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } following πtsuperscript𝜋𝑡\pi^{t}italic_π start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
7:     update tt+1𝑡𝑡1t\leftarrow t+1italic_t ← italic_t + 1, 𝒟𝒟{zt}𝒟𝒟subscript𝑧𝑡\mathcal{D}\leftarrow\mathcal{D}\cup\{z_{t}\}caligraphic_D ← caligraphic_D ∪ { italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, p^h𝙼,t(s|s,a)subscriptsuperscript^𝑝𝙼𝑡conditionalsuperscript𝑠𝑠𝑎\hat{p}^{\mathtt{M},t}_{h}(s^{\prime}|s,a)over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT typewriter_M , italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ), r^h𝙼,t(s,a)subscriptsuperscript^𝑟𝙼𝑡𝑠𝑎\hat{r}^{\mathtt{M},t}_{h}(s,a)over^ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT typewriter_M , italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ), and Wht(s,a)superscriptsubscript𝑊𝑡𝑠𝑎W_{h}^{t}(s,a)italic_W start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_a )
8:end while
9:Stage 2: Policy identification
10:Use planning algorithms to output optimal advice policy {π^βiτ}isubscriptsuperscriptsubscript^𝜋subscript𝛽𝑖𝜏𝑖\{\hat{\pi}_{\beta_{i}}^{\tau}\}_{i\in\mathcal{I}}{ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT for {(𝒮,𝒜¯,H,p^𝙼,r^βi𝙼)}isubscript𝒮¯𝒜𝐻superscript^𝑝𝙼subscriptsuperscript^𝑟𝙼subscript𝛽𝑖𝑖\left\{\left(\mathcal{S},\bar{\mathcal{A}},H,\hat{p}^{\mathtt{M}},\hat{r}^{% \mathtt{M}}_{\beta_{i}}\right)\right\}_{i\in\mathcal{I}}{ ( caligraphic_S , over¯ start_ARG caligraphic_A end_ARG , italic_H , over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT , over^ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT
Theorem 2.

For δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), ϵ(0,1]italic-ϵ01\epsilon\in(0,1]italic_ϵ ∈ ( 0 , 1 ], and ϕ(n,δ)=6log(4HSA/(ϵδ))+Slog(8e(n+1))italic-ϕ𝑛𝛿64𝐻𝑆𝐴italic-ϵ𝛿𝑆8𝑒𝑛1\phi(n,\delta)=6\log(4HSA/(\epsilon\delta))+S\log(8e(n+1))italic_ϕ ( italic_n , italic_δ ) = 6 roman_log ( 4 italic_H italic_S italic_A / ( italic_ϵ italic_δ ) ) + italic_S roman_log ( 8 italic_e ( italic_n + 1 ) ), with probability 1δ1𝛿1-\delta1 - italic_δ, Stage 1 of Algorithm 2 stops in τ𝜏\tauitalic_τ episodes and

τC1H5SAϵ2(6log(4HSA/(ϵδ))+S),𝜏subscript𝐶1continued-fractionsuperscript𝐻5𝑆𝐴superscriptitalic-ϵ264𝐻𝑆𝐴italic-ϵ𝛿𝑆\tau\leq C_{1}\cfrac{H^{5}SA}{\epsilon^{2}}\left(6\log(4HSA/(\epsilon\delta))+% S\right),italic_τ ≤ italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT continued-fraction start_ARG italic_H start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_S italic_A end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( 6 roman_log ( 4 italic_H italic_S italic_A / ( italic_ϵ italic_δ ) ) + italic_S ) ,

where C1=O~(log(HSA))subscript𝐶1normal-~𝑂𝐻𝑆𝐴C_{1}=\tilde{O}(\log(HSA))italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = over~ start_ARG italic_O end_ARG ( roman_log ( italic_H italic_S italic_A ) ). Moreover, {π^βτ}β>0subscriptsubscriptsuperscriptnormal-^𝜋𝜏𝛽𝛽0\{\hat{\pi}^{\tau}_{\beta}\}_{\beta>0}{ over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_β > 0 end_POSTSUBSCRIPT have the following property

P(V1,β*(s1)V1,βπ^βτ(s1)ϵuniformly for all β[0,H))>1δ.𝑃superscriptsubscript𝑉1𝛽subscript𝑠1superscriptsubscript𝑉1𝛽subscriptsuperscript^𝜋𝜏𝛽subscript𝑠1italic-ϵuniformly for all β[0,H)1𝛿\displaystyle P\left(V_{1,\beta}^{*}(s_{1})-V_{1,\beta}^{\hat{\pi}^{\tau}_{% \beta}}(s_{1})\leq\epsilon\,\,\text{uniformly for all $\beta\in[0,H)$}\right)>% 1-\delta.italic_P ( italic_V start_POSTSUBSCRIPT 1 , italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_V start_POSTSUBSCRIPT 1 , italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≤ italic_ϵ uniformly for all italic_β ∈ [ 0 , italic_H ) ) > 1 - italic_δ .

Theorem 2 ensures that Algorithm 2 provides sample estimation for the underlying MDP such that all the policy {π^βτ}β[0,H)subscriptsuperscriptsubscript^𝜋𝛽𝜏𝛽0𝐻\{\hat{\pi}_{\beta}^{\tau}\}_{\beta\in[0,H)}{ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_β ∈ [ 0 , italic_H ) end_POSTSUBSCRIPT for pertinent advice are near optimal. The proof is a direct application of the RF-Express (Ménard et al., 2021), except that we have to take care of the estimation error in r^𝙼superscript^𝑟𝙼\hat{r}^{\mathtt{M}}over^ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT. Although Algorithm 2 has the uniform convergence property for any number of bounded reward functions, it can also be used the same way as Algorithm 1, to find the ϵitalic-ϵ\epsilonitalic_ϵ-optimal policy for 𝙼superscript𝙼\mathcal{M}^{\mathtt{M}}caligraphic_M start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT if provided with the non-penalized reward function r^𝙼superscript^𝑟𝙼\hat{r}^{\mathtt{M}}over^ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT. In this context, we can modify RFE-β𝛽\betaitalic_β so that with high probability, it solves 𝙼superscript𝙼\mathcal{M}^{\mathtt{M}}caligraphic_M start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT with a sample complexity of O(H3SA/ϵ2)𝑂superscript𝐻3𝑆𝐴superscriptitalic-ϵ2O(H^{3}SA/\epsilon^{2})italic_O ( italic_H start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_S italic_A / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (See Algorithm LABEL:alg:alg3 in Appendix LABEL:ap_pf_alg2 for details).

CMDP for pertinent advice. The algorithm RFE-β𝛽\betaitalic_β solves a class of problems {β𝙼}β>0subscriptsuperscriptsubscript𝛽𝙼𝛽0\{\mathcal{M}_{\beta}^{\mathtt{M}}\}_{\beta>0}{ caligraphic_M start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_β > 0 end_POSTSUBSCRIPT simultaneously for all the β𝛽\betaitalic_β’s and it measures the pertinence of advice by β𝛽\betaitalic_β. However, sometimes humans lack a quantitative view of how large a β𝛽\betaitalic_β value should be considered as pertinent. Here, we introduce a different perspective on how the human should rank the importance of advice, framing it as “in H𝐻Hitalic_H steps, I want advice no more than D𝐷Ditalic_D times”, and formulate this as a CMDP problem

maxπsubscript𝜋\displaystyle\max_{\pi}\,\,\,roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT 𝔼π[h=1Hr𝙼(sh,ah)]s.t.𝔼π[h=1H𝕀{ahdefer}]D,formulae-sequencesuperscript𝔼𝜋delimited-[]superscriptsubscript1𝐻superscript𝑟𝙼subscript𝑠subscript𝑎𝑠𝑡superscript𝔼𝜋delimited-[]superscriptsubscript1𝐻𝕀subscript𝑎defer𝐷\displaystyle\mathbb{E}^{\pi}\left[\sum_{h=1}^{H}r^{\mathtt{M}}(s_{h},a_{h})% \right]\,\,\,\,\,\,\,s.t.\,\,\,\mathbb{E}^{\pi}\left[\sum_{h=1}^{H}\mathbb{I}% \{a_{h}\neq\text{defer}\}\right]\leq D,blackboard_E start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT [ ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT typewriter_M end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ] italic_s . italic_t . blackboard_E start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT [ ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT blackboard_I { italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ≠ defer } ] ≤ italic_D , (6)

where D(0,H)𝐷0𝐻D\in(0,H)italic_D ∈ ( 0 , italic_H ). From the standard primal-dual theorem, this formulation is closely related to the penalty β𝛽\betaitalic_β in (3), for the reason that we can treat β𝛽\betaitalic_β as a dual variable for the constraint D𝐷Ditalic_D. We refer the reader to the proof of Corollary 1 in Appendix LABEL:ap_pf_alg2 for details.

Now we present the CMDP method for pertinent advice. After stage 1 of RFE-β𝛽\betaitalic_β, we solve

maxπ𝔼^π[h=1Hr^𝙼,τ(sh,ah)]s.t.𝔼^π[h=1H𝕀{ahdefer}]D,formulae-sequencesubscript𝜋superscript^𝔼𝜋delimited-[]superscriptsubscript1𝐻superscript^𝑟𝙼𝜏subscript𝑠subscript𝑎𝑠𝑡superscript^𝔼𝜋delimited-[]superscriptsubscript1𝐻𝕀subscript𝑎defer𝐷\displaystyle\max_{\pi}\,\,\,\hat{\mathbb{E}}^{\pi}\left[\sum_{h=1}^{H}\hat{r}% ^{\mathtt{M},\tau}(s_{h},a_{h})\right]\hskip 14.22636pts.t.\,\,\,\hat{\mathbb{% E}}^{\pi}\left[\sum_{h=1}^{H}\mathbb{I}\{a_{h}\neq\text{defer}\}\right]\leq D,roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT over^ start_ARG blackboard_E end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT [ ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT over^ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT typewriter_M , italic_τ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ] italic_s . italic_t . over^ start_ARG blackboard_E end_ARG start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT [ ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT blackboard_I { italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ≠ defer } ] ≤ italic_D , (7)

where 𝔼^^𝔼\hat{\mathbb{E}}over^ start_ARG blackboard_E end_ARG is the expectation with the underlying transition being p^𝙼,τsuperscript^𝑝𝙼𝜏\hat{p}^{\mathtt{M},\tau}over^ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT typewriter_M , italic_τ end_POSTSUPERSCRIPT. The next corollary states that π^Dτsubscriptsuperscript^𝜋𝜏𝐷\hat{\pi}^{\tau}_{D}over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, the solution for (7), is a near-optimal policy for the CMDP (6).

Corollary 1.

In the same setting of Theorem 2, for δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ) and ϵ(0,1]italic-ϵ01\epsilon\in(0,1]italic_ϵ ∈ ( 0 , 1 ], with probability 1δ1𝛿1-\delta1 - italic_δ, for all D(0,H)𝐷0𝐻D\in(0,H)italic_D ∈ ( 0 , italic_H ), π^Dτsubscriptsuperscriptnormal-^𝜋𝜏𝐷\hat{\pi}^{\tau}_{D}over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT is a near-optimal solution for the original CMDP (6) such that

V1π^Dτ(s1)V1πD*(s1)2ϵ,and 𝔼π^Dτ[h=1H𝕀{ah𝑑𝑒𝑓𝑒𝑟}]D+ϵformulae-sequencesuperscriptsubscript𝑉1subscriptsuperscript^𝜋𝜏𝐷subscript𝑠1superscriptsubscript𝑉1subscriptsuperscript𝜋𝐷subscript𝑠12italic-ϵand superscript𝔼subscriptsuperscript^𝜋𝜏𝐷delimited-[]superscriptsubscript1𝐻𝕀subscript𝑎𝑑𝑒𝑓𝑒𝑟𝐷italic-ϵ\displaystyle V_{1}^{\hat{\pi}^{\tau}_{D}}(s_{1})\geq V_{1}^{\pi^{*}_{D}}(s_{1% })-2\epsilon,\hskip 14.22636pt\text{and }\hskip 14.22636pt\mathbb{E}^{\hat{\pi% }^{\tau}_{D}}\left[\sum_{h=1}^{H}\mathbb{I}\{a_{h}\neq\text{defer}\}\right]% \leq D+\epsilonitalic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≥ italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - 2 italic_ϵ , and blackboard_E start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT blackboard_I { italic_a start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ≠ defer } ] ≤ italic_D + italic_ϵ (8)

where πD*subscriptsuperscript𝜋𝐷{\pi^{*}_{D}}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT is the optimal solution for (6).

Corollary 1 also implies that RFE-β𝛽\betaitalic_β can compute near-optimal policies of CMDP (6) for all the constraints D[0,H)𝐷0𝐻D\in[0,H)italic_D ∈ [ 0 , italic_H ), with a sample complexity of O(H5SA/ϵ2)𝑂superscript𝐻5𝑆𝐴superscriptitalic-ϵ2O(H^{5}SA/\epsilon^{2})italic_O ( italic_H start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_S italic_A / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Compared to other CMDP learning algorithms (for example, O(H2S3A/ϵ2)𝑂superscript𝐻2superscript𝑆3𝐴superscriptitalic-ϵ2O(H^{2}S^{3}A/\epsilon^{2})italic_O ( italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_S start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_A / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) in Kalagarla et al. (2021)), the sample complexity of Corollary 1 features a lower order in S𝑆Sitalic_S. Moreover, the near-optimal result holds for all constraints D[0,H)𝐷0𝐻D\in[0,H)italic_D ∈ [ 0 , italic_H ), and for other CMDP learning algorithms, the result only holds for a pre-specified D𝐷Ditalic_D.

5 Numerical Experiment

We perform numerical experiments under two environments: Flappy Bird (Williams et al., 2023) and Car Driving Meresht et al. (2020). Both Atari game-like environments are suitable and convenient for modeling human behavior while retaining the learning structure for the machine. We focus on the flappy bird environment here and defer the car driving environment to Appendix LABEL:apnd:B.

Flappy Bird Environment. We consider a game map of a 7-by-20 grid of cells. Each cell can be empty, contain a star, or act as a wall. The goal is to navigate the bird across the map from left to right and collect as many stars as possible. However, colliding with a wall or reaching the (upper and lower) boundaries leads to the end of the game. An example map is displayed in Figure 1, which splits into three phases: the first phase contains almost only stars and no walls, the second phase contains almost only walls and very few stars, and the third phase contains both stars and walls.

Refer to caption
Figure 1: Flappy Bird environment: player needs to navigate the bird to avoid walls and collect stars.

We define the state space as the current locations of the bird on the grid, represented by coordinates (x,y)2𝑥𝑦superscript2(x,y)\in\mathbb{Z}^{2}( italic_x , italic_y ) ∈ blackboard_Z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, with a total of 7×20=1407201407\times 20=1407 × 20 = 140 states. Regarding the action space, we define it as 𝒜={Up, Up-Up, Down}𝒜Up, Up-Up, Down\mathcal{A}=\{\text{Up, Up-Up, Down}\}caligraphic_A = { Up, Up-Up, Down }. Each action causes the bird to move forward by one cell. In addition, the “Up” action moves the bird one cell upwards, the “Up-Up” action moves it two cells upwards, and the “Down” action moves it one cell downwards. The MDP has a reward as a function of state only. We will get a reward of 1111 when the current state (location) has a star and otherwise 00. To model human behavior, we consider two sub-optimal human policies: Policy Greedy, which prioritizes collecting stars in the next column, and Policy Safe, which focuses on avoiding walls in the next column. If there is no preferred action available, both policies maintain a horizontal zig-zag line by alternating between “Up” and “Down”. For adherence level θ𝜃\thetaitalic_θ, we assume for all s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S and h=1,,H1𝐻h=1,...,Hitalic_h = 1 , … , italic_H, the human will adhere to the advice with probability 0.90.90.90.9 except the aggressive advice “Up-up” (which moves too fast vertically) with adherence level 0.70.70.70.7. We compare the following algorithms:

  • UCB-ADherence (UCB-AD): Algorithm 1 that finds the ϵitalic-ϵ\epsilonitalic_ϵ-optimal advice policy.

  • RFE-ADvice (RFE-AD): Algorithm LABEL:alg:alg3, a variant of RFE-β𝛽\betaitalic_β that finds the ϵitalic-ϵ\epsilonitalic_ϵ-optimal policy.

  • RFE-β𝛽\betaitalic_β: Algorithm 2 that outputs pertinent advice policy by exploring then planning.

  • RFE-CMDP: A variant of RFE-β𝛽\betaitalic_β that solves the CMDP (7) after exploring.

Figure 1(a) and 1(b) present the results for the two algorithms UCB-AD and RFE-AD for the environment 1subscript1\mathcal{E}_{1}caligraphic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. It also includes the state-of-the-art algorithm EULER (Zanette and Brunskill, 2019) that achieves a generic minimax optimal regret. From the regret plot, UCB-AD outperforms both RFE-AD and EULER. This advantage is attributed to UCB-AD’s effective utilization of the information and structure of the underlying MDP. These results also show that our tailored algorithms UCB-AD and RFE-AD are much more efficient than directly applying problem-agnostic RL algorithms in the adherence model. We further test UCB-AD with different θ𝜃\thetaitalic_θ’s: with θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, θ(a,s)0.8𝜃𝑎𝑠0.8\theta(a,s)\equiv 0.8italic_θ ( italic_a , italic_s ) ≡ 0.8 and with θ2subscript𝜃2\theta_{2}italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, θ(a,s)0.4𝜃𝑎𝑠0.4\theta(a,s)\equiv 0.4italic_θ ( italic_a , italic_s ) ≡ 0.4. Figure 1(c) shows the relationship between the regret of UCB-AD and θ𝜃\thetaitalic_θ: for both policies, UCB-AD can achieve smaller regret with higher θ𝜃\thetaitalic_θ. Intuitively, a high adherence level implies a high probability of following the advice instead of taking π𝙷superscript𝜋𝙷\pi^{\mathtt{H}}italic_π start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT, which will reduce the regret caused by the suboptimality of π𝙷superscript𝜋𝙷\pi^{\mathtt{H}}italic_π start_POSTSUPERSCRIPT typewriter_H end_POSTSUPERSCRIPT.

Refer to caption
(a) Regrets of Policy Greedy.
Refer to caption
(b) Regrets of Policy Safe.
Refer to caption
(c) Regrets of UCB-AD.
Figure 2: The regrets for learning the optimal advice for Policy Greedy and Policy Safe. Figure 1(a), 1(b) show the regrets of RFE-AD, UCB-AD, and EULER for two policies respectively. Figure 1(c) shows the regrets of UCB-AD for two policies under different θ𝜃\thetaitalic_θ’s.

Figure 3 summarizes results for three policies under the environment 2subscript2\mathcal{E}_{2}caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, namely RFE-β𝛽\betaitalic_β, RFE-CMDP, and UC-CFH, a provably convergent CMDP algorithm (Kalagarla et al., 2021), under Policy Safe. In Figure 2(a), we see that RFE-β𝛽\betaitalic_β exhibits convergence for different β𝛽\betaitalic_β’s, and this empirically corroborates the theoretical finding. In Figure 2(b), we compare RFE-CMDP and UC-CFH under a simpler environment with the advice budget being 1 (D=1𝐷1D=1italic_D = 1). We observe that RFE-CMDP shows a marginal performance advantage over UC-CFH in terms of the convergence rate. More importantly, Figure 2(c) shows by only using the estimated transition kernel after learning for D=1𝐷1D=1italic_D = 1 (Figure 2(b)), RFE-CMDP is able to obtain near-optimal policy for problem instances with different advice budgets (D=2,3,4𝐷234D=2,3,4italic_D = 2 , 3 , 4 and 5555). However, UC-CFH fails to explore the whole transition kernel sufficiently and can only output the near-optimal policy for the original problem instance. Moreover, RFE-CMDP is more sample efficient with respect to the advice budget, because for UC-CFH, we have to run multiple times with different advice budget parameters to get a near-optimal policy for all of them.

Refer to caption
(a) Value gaps of RFE-β𝛽\betaitalic_β.
Refer to caption
(b) Value gaps w.r.t. episode.
Refer to caption
(c) Value gaps when evaluating.
Figure 3: The performances of making pertinent advice. The value gap is defined as the difference between the value of current policy and the optimal values, with the red dashed line as the benchmark for 00 loss of the policy. Figure 2(a) shows the convergence of RFE-β𝛽\betaitalic_β under difference β𝛽\betaitalic_β’s. Figure 2(b) compares the convergences of RFE-CMDP and UC-CFH. Figure 2(c) evaluates performance of policy learned from learning episodes in Figure 2(b).
Refer to caption
(a) Trajectory of Policy Greedy.
Refer to caption
(b) Trajectory of Policy Safe.
Figure 4: Typical trajectories of two policies’ types. The red color means the machine defers and the green color means the machine advises.

Lastly, we show that RFE-β𝛽\betaitalic_β is capable of generating pertinent advice for different policies. Figure 4 displays representative trajectories of two policies playing the game while receiving guidance from the machine, which follows π^βsubscript^𝜋𝛽\hat{\pi}_{\beta}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT trained in the experiment of Figure 2. By setting β=0.3𝛽0.3\beta=0.3italic_β = 0.3, the machine outputs a policy that only gives advice when necessary: Since Policy Greedy behaves well in the first phase, the machine almost only gives advice in the second phase and the third phase; Similarly, the machine almost only gives advice in the first phase and the third phase, and choose to defer most of the time when Policy Safe is in the second phase.

References

  • Altman [2021] Eitan Altman. Constrained Markov decision processes. Routledge, 2021.
  • Azar et al. [2017] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272. PMLR, 2017.
  • Balachandran et al. [2021] Avinash Balachandran, Tiffany L Chen, Jonathan YM Goh, Stephen McGill, Guy Rosman, Simon Stent, and John J Leonard. Human-centric intelligent driving: Collaborating with the driver to improve safety. In Automated Road Transportation Symposium, pages 85–109. Springer, 2021.
  • Bastani et al. [2021] Hamsa Bastani, Osbert Bastani, and Wichinpong Park Sinchaisri. Improving human decision-making with machine learning. arXiv preprint arXiv:2108.08454, 2021.
  • Bobu et al. [2020] Andreea Bobu, Dexter RR Scobee, Jaime F Fisac, S Shankar Sastry, and Anca D Dragan. Less is more: Rethinking probabilistic models of human behavior. In Proceedings of the 2020 acm/ieee international conference on human-robot interaction, pages 429–437, 2020.
  • Carroll et al. [2019] Micah Carroll, Rohin Shah, Mark K Ho, Tom Griffiths, Sanjit Seshia, Pieter Abbeel, and Anca Dragan. On the utility of learning about humans for human-ai coordination. Advances in neural information processing systems, 32, 2019.
  • Chen et al. [2018] Min Chen, Stefanos Nikolaidis, Harold Soh, David Hsu, and Siddhartha Srinivasa. Planning with trust for human-robot collaboration. In Proceedings of the 2018 ACM/IEEE international conference on human-robot interaction, pages 307–315, 2018.
  • Chen et al. [2022] Ningyuan Chen, Ming Hu, and Wenhao Li. Algorithmic decision-making safeguarded by human knowledge. arXiv preprint arXiv:2211.11028, 2022.
  • Dann and Brunskill [2015] Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. Advances in Neural Information Processing Systems, 28, 2015.
  • Dann et al. [2017] Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems, 30, 2017.
  • De et al. [2021] Abir De, Nastaran Okati, Ali Zarezade, and Manuel Gomez Rodriguez. Classification under human assistance. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 5905–5913, 2021.
  • Domingues et al. [2021] Omar Darwiche Domingues, Pierre Ménard, Emilie Kaufmann, and Michal Valko. Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited. In Algorithmic Learning Theory, pages 578–598. PMLR, 2021.
  • Grand-Clément and Pauphilet [2022] Julien Grand-Clément and Jean Pauphilet. The best decisions are not the best advice: Making adherence-aware recommendations. arXiv preprint arXiv:2209.01874, 2022.
  • Hong et al. [2023] Joey Hong, Anca Dragan, and Sergey Levine. Learning to influence human behavior with offline reinforcement learning. arXiv preprint arXiv:2303.02265, 2023.
  • Jacq et al. [2022] Alexis Jacq, Johan Ferret, Olivier Pietquin, and Matthieu Geist. Lazy-mdps: Towards interpretable reinforcement learning by learning when to act. arXiv preprint arXiv:2203.08542, 2022.
  • Jaksch et al. [2010] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11:1563–1600, 2010.
  • ** et al. [2018] Chi **, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018.
  • ** et al. [2020] Chi **, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pages 4870–4879. PMLR, 2020.
  • Kalagarla et al. [2021] Krishna C Kalagarla, Rahul Jain, and Pierluigi Nuzzo. A sample-efficient algorithm for episodic finite-horizon mdp with constraints. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 8030–8037, 2021.
  • Kaufmann et al. [2021] Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Edouard Leurent, and Michal Valko. Adaptive reward-free exploration. In Algorithmic Learning Theory, pages 865–891. PMLR, 2021.
  • Khavas et al. [2020] Zahra Rezaei Khavas, S Reza Ahmadzadeh, and Paul Robinette. Modeling trust in human-robot interaction: A survey. In International conference on social robotics, pages 529–541. Springer, 2020.
  • Laidlaw and Dragan [2022] Cassidy Laidlaw and Anca Dragan. The boltzmann policy distribution: Accounting for systematic suboptimality in human models. arXiv preprint arXiv:2204.10759, 2022.
  • Lattimore and Hutter [2014] Tor Lattimore and Marcus Hutter. Near-optimal pac bounds for discounted mdps. Theoretical Computer Science, 558:125–143, 2014.
  • Mao et al. [2023] Anqi Mao, Christopher Mohri, Mehryar Mohri, and Yutao Zhong. Two-stage learning to defer with multiple experts. Advances in neural information processing systems, 36, 2023.
  • Ménard et al. [2021] Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Emilie Kaufmann, Edouard Leurent, and Michal Valko. Fast active learning for pure exploration in reinforcement learning. In International Conference on Machine Learning, pages 7599–7608. PMLR, 2021.
  • Meresht et al. [2020] Vahid Balazadeh Meresht, Abir De, Adish Singla, and Manuel Gomez-Rodriguez. Learning to switch between machines and humans. arXiv preprint arXiv:2002.04258, 2020.
  • Miryoosefi and ** [2022] Sobhan Miryoosefi and Chi **. A simple reward-free approach to constrained reinforcement learning. In International Conference on Machine Learning, pages 15666–15698. PMLR, 2022.
  • Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  • Mohri et al. [2023] Christopher Mohri, Daniel Andor, Eunsol Choi, and Michael Collins. Learning to reject with a fixed predictor: Application to decontextualization. arXiv preprint arXiv:2301.09044, 2023.
  • Mozannar and Sontag [2020] Hussein Mozannar and David Sontag. Consistent estimators for learning to defer to an expert. In International Conference on Machine Learning, pages 7076–7087. PMLR, 2020.
  • Okati et al. [2021] Nastaran Okati, Abir De, and Manuel Rodriguez. Differentiable learning under triage. Advances in Neural Information Processing Systems, 34:9140–9151, 2021.
  • Shaheen [2021] Mohammed Yousef Shaheen. Applications of artificial intelligence (ai) in healthcare: A review. ScienceOpen Preprints, 2021.
  • Shani et al. [2019] Lior Shani, Yonathan Efroni, and Shie Mannor. Exploration conscious reinforcement learning revisited. In International conference on machine learning, pages 5680–5689. PMLR, 2019.
  • Silver et al. [2017] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
  • Straitouri et al. [2021] Eleni Straitouri, Adish Singla, Vahid Balazadeh Meresht, and Manuel Gomez-Rodriguez. Reinforcement learning under algorithmic triage. arXiv preprint arXiv:2109.11328, 2021.
  • Sun et al. [2022] Jiankun Sun, Dennis J Zhang, Haoyuan Hu, and Jan A Van Mieghem. Predicting human discretion to adjust algorithmic prescription: A large-scale field experiment in warehouse operations. Management Science, 68(2):846–865, 2022.
  • Williams et al. [2023] Katherine J Williams, Madeleine S Yuh, and Neera Jain. A computational model of coupled human trust and self-confidence dynamics. ACM Transactions on Human-Robot Interaction, 2023.
  • Zanette and Brunskill [2019] Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–7312. PMLR, 2019.
  • Zhang et al. [2020] Xuezhou Zhang, Yuzhe Ma, and Adish Singla. Task-agnostic exploration in reinforcement learning. Advances in Neural Information Processing Systems, 33:11734–11743, 2020.