Learning to Make Adherence-Aware Advice

Guanting Chen

{}^{1}

, Xiaocheng Li

{}^{2}

, Chunlin Sun

{}^{3}

, Hanzhao Wang

{}^{2}

{}^{1}

Department of Statistics and Operations Research, UNC-Chapel Hill

{}^{2}

Imperial College Business School, Imperial College London

{}^{3}

Institute for Computational and Mathematical Engineering, Stanford University
[email protected]
{xiaocheng.li, h.wang19}@imperial.ac.uk
[email protected]

Abstract

As artificial intelligence (AI) systems play an increasingly prominent role in human decision-making, challenges surface in the realm of human-AI interactions. One challenge arises from the suboptimal AI policies due to the inadequate consideration of humans disregarding AI recommendations, as well as the need for AI to provide advice selectively when it is most pertinent. This paper presents a sequential decision-making model that (i) takes into account the human’s adherence level (the probability that the human follows/rejects machine advice) and (ii) incorporates a defer option so that the machine can temporarily refrain from making advice. We provide learning algorithms that learn the optimal advice policy and make advice only at critical time stamps. Compared to problem-agnostic reinforcement learning algorithms, our specialized learning algorithms not only enjoy better theoretical convergence properties but also show strong empirical performance.

1 Introduction

Artificial intelligence (AI) has achieved remarkable success across various aspects of everyday life. However, it is crucial to acknowledge that many of AI’s accomplishments have been developed as fully automatic systems (Mnih et al., 2015; Silver et al., 2017). In several important domains like AI-assisted driving (Balachandran et al., 2021) and AI-assisted healthcare (Shaheen, 2021), AI is faced with the challenge of interacting with humans (Mozannar and Sontag, 2020; De et al., 2021), introducing a more intricate and demanding dynamic. This interaction between AI and humans gives rise to two significant issues. Firstly, it is common for humans to reject following AI’s advice, and if AI assumes humans’ perfect adherence to its advice, the advice generated under this assumption may not be optimal. Secondly, humans may prefer AI to refrain from constant advice-giving, opting for AI intervention only when necessary. They may value their autonomy when performing well but expect AI guidance during critical moments or when they encounter situations in which they are typically less proficient. These considerations underscore the importance of comprehending human behavior and preferences to develop effective and adaptable AI systems for human-AI interactions.

To address the mentioned challenges, in this paper, we provide a decision-making model for human-AI interactions. For the first challenge, the model takes into account the human’s adherence level, defined as the probability that the human takes the AI’s advice. This allows the machine to account for variations in human adherence level when making advice. For the second challenge, the AI model features an action named defer, which refrains from giving advice to humans. This feature recognizes that there are instances when humans prefer autonomy and only seek AI guidance during critical moments or situations where they typically struggle. By integrating the adherence level and action deferral into our model, we formulate these challenges as a decision-making problem.

To cater to this specialized decision-making model, we have developed tailored learning algorithms that are both provably convergent and empirically efficient. These algorithms are specifically designed to effectively handle the unique characteristics and challenges of the human-AI interaction setting.

1.1 Related Work

Human-AI interactions. Human-AI interactions have long been studied in fields such as robotics. Methods for modeling human behaviors and collaborating with robots (Bobu et al., 2020; Laidlaw and Dragan, 2022; Carroll et al., 2019) have achieved strong empirical performance. Similar to our definition of adherence level, a stream of literature (Chen et al., 2018; Williams et al., 2023) integrates trust (Khavas et al., 2020) as latent factors into the human-AI model and solves Partially Observable Markov Decision Process (POMDP) to get policies with strong empirical outcomes. Our work primarily centers on modeling and establishing theoretical foundations for the human-AI interaction model and the associated learning problems, thereby complementing the existing body of human-AI interaction literature.

Modeling human-AI interactions. On the modeling side, Grand-Clément and Pauphilet (2022) propose the decision-making model that incorporates the adherence level and illustrates that when the adherence level is low, the optimal advice can be different from the optimal decision. Also, see Sun et al. (2022) for an applied setting of interacting with different adherence levels, Shani et al. (2019) for the relationship between the model and the exploration-conscious RL setting, and Jacq et al. (2022) for the so-called lazy-MDP that features an action similar to defer in our setting.

Machine learning in human-AI interactions. Although there has been no literature associated with learning the decision-making model similar to Grand-Clément and Pauphilet (2022) and Jacq et al. (2022), other machine learning approaches have been put forward (Bastani et al., 2021; Meresht et al., 2020; Straitouri et al., 2021; Okati et al., 2021; Chen et al., 2022; Hong et al., 2023; Mao et al., 2023; Mohri et al., 2023) with different human-AI interaction settings.

Theoretical reinforcement learning. Our first proposed algorithm is an optimism-based reinforcement learning method that learns the optimal advice policy. This approach is inspired by the theoretical online reinforcement learning literature (Jaksch et al., 2010; Lattimore and Hutter, 2014; Dann and Brunskill, 2015; Azar et al., 2017; Dann et al., 2017; Zanette and Brunskill, 2019; Domingues et al., 2021). Instead of directly applying the upper confidence bound in the literature, we customize the learning algorithm so that it leverages special properties in our decision-making model, resulting in advantages in theoretical properties and empirical performance. Our second algorithm adopts a reward-free exploration (RFE) approach (** et al., 2020), which first explores the environment for a given number of episodes, and then becomes capable of outputting near-optimal policy for any bounded reward functions. We find this approach works well for learning algorithms that make pertinent advice. See Zhang et al. (2020); Kaufmann et al. (2021); Ménard et al. (2021); Miryoosefi and ** (2022) for the follow-up works in RFE.

Our contribution is twofold:

First, we propose a decision-making model for advice-giving that incorporates human’s adherence level and an option for the AI to defer the advice and trust the human. This is a comprehensive modeling framework for effective human-AI interactions, where the optimal decision-making not only considers human adherence level but also makes advice/recommendations only at critical states.

Second, based on this decision-making model, we develop tailored learning algorithms that output near-optimal advice policies and know when to make pertinent advice. Compared to the state-of-the-art problem-agnostic RL algorithms, our algorithm features tighter sample complexity bound and stronger empirical performance.

2 Model Setup

Consider a human decision-maker that takes sequential actions under an episodic Markov decision process (MDP) described by the tuple $\mathcal{M}^{\mathtt{H}}=(\mathcal{S},\mathcal{A},H,p,r)$ . The superscript ${\mathtt{H}}$ emphasizes the human’s involvement in this MDP, $\mathcal{S}$ denotes the set of states, $\mathcal{A}$ denotes the set of actions, $H$ is the horizon of each episode (different from the superscript ${\mathtt{H}}$ ), $p$ denotes a deterministic time-dependent transition kernel so that $p_{h}(s^{\prime}|s,a)$ is the transition probability from state $s\in\mathcal{S}$ to state $s^{\prime}\in\mathcal{S}$ under the action $a\in\mathcal{A}$ at time $h$ , and $r$ denotes a time-dependent reward function where $r_{h}(s,a)\in[0,1]$ . Let $S=|\mathcal{S}|$ and $A=|\mathcal{A}|$ denote the cardinality of $\mathcal{S}$ and $\mathcal{A}$ , respectively.

Suppose the human follows a fixed (suboptimal) policy $\pi^{\mathtt{H}}$ such that the probability of taking action $a$ at state $s$ and time $h$ is $\pi_{h}^{\mathtt{H}}(a|s)$ . Alongside the human, an intelligent machine makes advice as decision support to improve the reward collected under $\pi_{h}^{\mathtt{H}}$ . In other words, the machine does not seek to change human policy but rather improve its final outcome given its suboptimality. Specifically, upon the arrival at each state, the machine can choose to make advice $a^{\mathtt{M}}\in\mathcal{A}$ to the human (the superscript $\mathtt{M}$ stands for the machine), or to trust the human and defer the action to the human, denoted by $a^{\mathtt{M}}=\text{defer}$ . If the machine chooses to defer, the human follows its default policy $\pi^{\mathtt{H}}.$ If the machine chooses to advise, the human takes the machine’s advice with probability $\theta(s,a^{\mathtt{M}})\in[0,1]$ , where $\theta(\cdot,\cdot)$ is the adherence level of the human, and is defined as follows.

Definition 1 The human’s adherence level $\theta:\mathcal{S}\times\mathcal{A}\rightarrow[0,1]$ is the probability of human adopting/adhering to the machine’s certain advice at a certain state.

Given the setup, the human takes action $a^{\mathtt{H}}$ according to the following law:

\displaystyle\mathbb{P}_{h}(a^{\mathtt{H}}=a|s,a^{\mathtt{M}})=\begin{cases}% \pi_{h}^{\mathtt{H}}(a|s),&\text{if $a^{\mathtt{M}}=\text{defer}$},\\ \theta(s,a^{\mathtt{M}}),&\text{ if $a^{\mathtt{M}}\neq\text{defer}$ and $a=a^% {\mathtt{M}}$}\text{ (adhere),}\\ (1-\theta(s,a^{\mathtt{M}}))\cdot\cfrac{\pi_{h}^{\mathtt{H}}(a|s)}{1-\pi_{h}^{% \mathtt{H}}(a^{\mathtt{M}}|s)},&\text{ if $a^{\mathtt{M}}\neq\text{defer}$ and% $a\neq a^{\mathtt{M}}$}\text{ (not adhere)}.\end{cases}

(1)

To summarize, under the human-machine interaction, the underlying dynamic becomes

s_{h}\xrightarrow{\text{machine makes advice}}a^{\mathtt{M}}\xrightarrow{\text% {$a^{\mathtt{H}}\sim\mathbb{P}_{h}(\cdot|s_{h},a^{\mathtt{M}})$}}a^{\mathtt{H}% }\xrightarrow{\text{$s_{h+1}\sim p_{h}(\cdot|s_{h},a^{\mathtt{H}})$}}s_{h+1}.

At each time $h$ , the machine first makes the advice $a^{\mathtt{M}}$ upon the state $s_{h}$ and the human incorporates the machine advice into a final action $a^{\mathtt{H}}$ , and then transit to the next state $s_{h+1}$ .

The machine’s MDP. From the machine’s perspective, the MDP is slightly different from the MDP faced by human. It can be described by $\mathcal{M}^{\mathtt{M}}=\left(\mathcal{S},\bar{\mathcal{A}},H,p^{\mathtt{M}},% r^{\mathtt{M}}\right)$ . This MDP shares the same state space $\mathcal{S}$ and horizon $H$ as the human MDP $\mathcal{M}^{\mathtt{H}}$ . The action space is augmented to include the defer option $\bar{\mathcal{A}}=\mathcal{A}\cup\{\text{defer}\}$ . In the machine’s perspective, the transition can be viewed as a direct consequence of making advice $a^{\mathtt{M}}\in\bar{\mathcal{A}}$ (i.e, $s_{h}\to a^{\mathtt{M}}\to s_{h+1}$ ), and the transition kernel becomes

\displaystyle p_{h}^{\mathtt{M}}(s^{\prime}|s,a^{\mathtt{M}})

\displaystyle=\sum_{a^{\mathtt{H}}\in\mathcal{A}}p_{h}(s^{\prime}|s,a^{\mathtt% {H}})\cdot\mathbb{P}_{h}(a^{\mathtt{H}}|s,a^{\mathtt{M}}),

(2)

where $p_{h}$ is the transition kernel of the MDP $\mathcal{M}^{\mathtt{H}}$ , and the probability $\mathbb{P}_{h}(\cdot|s,a)$ is specified by the adherence dynamics (1). In parallel, we define the reward by marginalizing human’s action

r_{h}^{\mathtt{M}}(s,a^{\mathtt{M}})=\sum_{a^{\mathtt{H}}\in\mathcal{A}}r_{h}(% s,a^{\mathtt{H}})\cdot\mathbb{P}_{h}(a^{\mathtt{H}}|s,a^{\mathtt{M}}).

Denote $\bm{\pi}=\{\pi_{h}\}_{h\in[H]}$ the machine’s policy where $\pi_{h}:\mathcal{S}\to\bar{\mathcal{A}}$ . The value function then becomes

V_{h_{0}}^{\pi}(s)=\mathbb{E}\left[\sum_{h=h_{0}}^{H}r^{\mathtt{M}}_{h}(s_{h},% a_{h})\Big{|}s_{h_{0}}=s\right],\,\,\,\text{where $a_{h}=\pi_{h}(s_{h})$ and $% s_{h+1}\sim p_{h}^{\mathtt{M}}(\cdot|s_{h},a_{h})$},

and let $V^{\pi}_{H+1}(s)=0$ for any $s\in\mathcal{S}$ . The optimal value $V^{*}$ and the optimal policy $\pi^{*}$ are defined by

V_{h}^{*}(s)=\max_{\pi\in\Pi}V_{h}^{\pi}(s),\ \ \pi^{*}=\operatorname*{arg\,% max}_{\pi\in\Pi}V_{1}^{\pi}(s)

where $\Pi$ consists of all deterministic non-anticipating Markov policies. Similarly, we define the corresponding $Q$ functions to be

Q_{h}^{\pi}(s,a)=r^{\mathtt{M}}_{h}(s,a)+\sum_{s^{\prime}\in\mathcal{S}}p^{% \mathtt{M}}_{h}(s^{\prime}|s,a)V_{h+1}^{\pi}(s^{\prime}),\text{ and }Q_{h}^{*}% (s,a)=r^{\mathtt{M}}_{h}(s,a)+\sum_{s^{\prime}\in\mathcal{S}}p^{\mathtt{M}}_{h% }(s^{\prime}|s,a)V_{h+1}^{*}(s^{\prime}).

Human-centric system. The theme of the formulation and all our following results is a human-centric decision system where the machine acknowledges the suboptimal behavior of the human and makes advice on critical states to improve the reward. So the learning and optimization of our paper take the perspective of the machine (solving $\mathcal{M}^{\mathtt{M}}$ ) and do not seek to change the underlying human policy $\pi^{\mathtt{H}}.$

3 The learning problem

Now we discuss learning problems associated with the above human-machine adherence model. We consider two learning environments for the problem:

$\mathcal{E}_{1}$ (Environment 1 – partially known): the environment’s state transition kernel $p$ , the reward $r$ , and the human’s behavior policy $\pi^{\mathtt{H}}$ , are known; the human’s adherence level $\theta$ is unknown.

$\mathcal{E}_{2}$ (Environment 2 – fully unknown): the environment’s state transition kernel $p$ , the reward $r$ , the human’s behavior policy $\pi^{\mathtt{H}}$ , and the human’s adherence level $\theta$ are unknown.

For $\mathcal{E}_{1}$ , the goal is simply to learn the optimal policy under the unknown adherence level $\theta$ . We develop a learning algorithm that outputs $\epsilon$ -optimal advice policy and features better sample complexity compared to the vanilla application of problem-agnostic RL methods on $\mathcal{M}^{\mathtt{M}}$ . For $\mathcal{E}_{2}$ , we know neither the environment nor the human’s policy. Thus the learning problem entails learning the dynamics of both the environment and the human policy. We develop a provably convergent learning algorithm that outputs the optimal policy, and in addition, the learned advice policy only gives advice when necessary (choosing to defer for non-critical steps).

Our investigations on these two learning formulations highlight three points. First, the inherent structure of the human-machine interaction allows more sample-efficient algorithms (than the vanilla application of the off-the-shelf RL algorithms) both theoretically and empirically. Second, the knowledge of the underlying environment ( $\mathcal{E}_{1}$ compared against $\mathcal{E}_{2}$ ) significantly, also unsurprisingly, reduces the sample complexity of the learning algorithm. Third, we establish a close connection between the formulation of the human-machine interaction with the problems of reward-free exploration (** et al., 2020) and constrained MDPs (Altman, 2021).

3.1 Main Results

We first state the technical results and then present the detailed algorithms and analyses in the subsequent section.

Theorem 1 (Environment $\mathcal{E}_{1}$ , informal) For environment $\mathcal{E}_{1}$ , Algorithm 1 finds an $\epsilon$ -optimal advice policy with a PAC sample complexity $O(H^{2}S^{2}A/\epsilon^{2})$ with high probability.

Under environment $\mathcal{E}_{1}$ , Theorem 1 gives a PAC sample complexity for the UCB-type (Upper-Confidence-Bound-type) Algorithm 1. We remark that applying the existing problem-agnostic algorithms can only achieve a suboptimal order of sample complexity on the problem: $O(H^{3}S^{2}A/\epsilon^{2})$ via the model-based algorithm (Dann and Brunskill, 2015) and $O(H^{4}SA/\epsilon^{2})$ via the model-free algorithm (** et al., 2018)¹¹1The authors obtain a regret bound instead of PAC sample complexity bound. However, they convert the regret bound to a PAC sample complexity bound in (** et al., 2018, Section 3.1). Specifically, the bound in Dann and Brunskill (2015) gives an additional factor of $H$ compared to the bounds in the original setting, where stationary transition density is assumed; this is due to the fact that though the adherence level $\theta$ is stationary, the transition becomes non-stationary when compounding $\theta$ and underlying transition of the human’s underlying MDP. Also, we note that such an improvement on $H$ is not due to a reduction in the number of unknown parameters because the adherence level $\theta$ has a dimensionality of $SA.$ Indeed, the key to the improvement is the intrinsic structure of the human-machine problem enables a more sample-efficient design of the UCB algorithm (See Section 4.1 for details). Moreover, we also provide another algorithm that finds an $\epsilon$ -optimal advice policy with a sample complexity of $O(H^{3}SA/\epsilon^{2})$ for $\mathcal{E}_{2}$ (See Algorithm LABEL:alg:alg3 in appendix LABEL:ap_pf_alg2 for details).

For environment $\mathcal{E}_{2}$ , we assume no prior knowledge at all, and this makes the machine’s problem no different than a generic RL problem. Thus we consider a slight twist of the machine’s MDP with the notion of pertinent advice. This twisted formulation enables richer analytical structures and draws interesting connections with several existing frameworks. Specifically, consider a new machine’s MDP $\mathcal{M}^{\mathtt{M}}_{\beta}\in\left(\mathcal{S},\bar{\mathcal{A}},H,p^{% \mathtt{M}},r^{\mathtt{M}}_{\beta}\right)$ which inherits everything from $\mathcal{M}^{\mathtt{M}}\in\left(\mathcal{S},\bar{\mathcal{A}},H,p^{\mathtt{M}% },r^{\mathtt{M}}\right)$ except for the reward

\displaystyle r^{\mathtt{M}}_{h,\beta}(s,a)=r_{h}^{\mathtt{M}}(s,a)-\beta\cdot% \mathbb{I}\{a\neq\text{defer}\},

(3)

where the $\mathbb{I}\{\cdot\}$ is the indicator function and $\beta>0$ is a constant. Under $\mathcal{M}^{\mathtt{M}}_{\beta}$ , we denote $V^{\pi}_{\beta}$ and $V^{*}_{\beta}$ the value functions of $\pi$ and the optimal value function, respectively, and the optimal policy $\pi_{\beta}^{*}\in\operatorname*{arg\,max}_{\pi}V_{\beta}^{\pi}$ . The new reward function enforces a penalization of $\beta$ for making advice and thus regularizes the number of machine advices throughout the horizon. In practice, providing advice to human at every step can be annoying in applications such as gaming, driving, or sports. Hence, it is crucial to prioritize and selectively deliver advice based on its criticalness – which we term informally as pertinent advice. For example, when the human is an expert and already achieves near-optimal performance, there is no need to give advice; also, when the human is under-performing, and the adherence level is low, there is also no need to give advice because it is unlikely to be taken.

Proposition 1.

For all $s\in\mathcal{S}$ and $h\in[H]$ such that $\pi^{*}_{h,\beta}(s)\neq\text{defer}$ , we have

Q_{h}^{*}(s,\pi^{*}_{h,\beta}(s))-V_{h}^{\pi^{\mathtt{H}}}(s)\geq\beta.

The proposition says that if the machine takes $\pi^{*}_{h,\beta}(s)$ and sticks with the optimal policy afterward, the reward will be at least $\beta$ more than that if the machine chooses to defer all the way till the end. In this light, we can rank the criticalness of making advice at different states by solving $\mathcal{M}^{\mathtt{M}}_{\beta}$ with different $\beta$ which gives a better interpretation of this human-machine system.

Theorem 2 (Environment $\mathcal{E}_{2}$ , informal) For $\mathcal{E}_{2}$ , Algorithm 2 outputs a family of $\epsilon$ -optimal policies $\{\hat{\pi}_{\beta}\}_{\beta>0}$ for $\{\mathcal{M}_{\beta}^{\mathtt{H}}\}_{\beta>0}$ with $O(H^{5}SA/\epsilon^{2})$ episodes such that the following inequality

\displaystyle V_{1,\beta}^{*}(s_{1})-V_{1,\beta}^{\hat{\pi}_{\beta}}(s_{1})\leq\epsilon

(4)

holds uniformly for all $\beta>0$ with high probability.

Theorem 2 gives the sample complexity of Algorithm 2 which learns a near-optimal policy for all the models $\{\mathcal{M}_{\beta}^{\mathtt{H}}\}_{\beta\geq 0}$ simultaneously. Such joint learning not only provides a family of policies for the human to customize $\beta$ according to her/his performance but also gives us a handle to understand which are the critical states where the human’s policy can be significantly improved.

4 Algorithms and Analyses

In this section, we present the algorithms and analyses that achieve the results mentioned previously.

4.1 UCB-based algorithm for $\mathcal{E}_{1}$

Under $\mathcal{E}_{1}$ , the machine works with a human with unknown adherence level $\theta.$ An important property of $\theta$ is as follows. Basically, it states that the team of human and machine achieves a higher optimal reward if the human has a higher adherence level. To emphasize the dependence on $\theta$ , we write

V_{h}^{\pi}(s|\theta)=\mathbb{E}\left[\sum_{h^{\prime}=h}^{H}r^{\mathtt{M}}_{h% ^{\prime}}(s_{h^{\prime}},a_{h^{\prime}})\Big{|}s_{h}=s,\text{adherence % parameter }\theta\right]\,\,\text{and}\,\,\,V_{h}^{*}(s|\theta)=\max_{\pi\in% \Pi}V_{h}^{\pi}(s|\theta).

Proposition 2 (Monotonicity property).

Suppose $\theta_{1}\geq\theta_{2}$ holds entry-wise, then the following inequality holds for all $s\in\mathcal{S}$ and $h\in[H]$

V_{h}^{*}(s|\theta_{1})\geq V_{h}^{*}(s|\theta_{2}).

Proposition 2 implies that finding an upper bound for the optimal value function reduces to finding an upper bound for $\theta$ . Algorithm 1 follows this implication and maintains an optimistic estimate $\bar{\theta}^{t}$ for the true parameter $\theta$ . For each episode, it generates the policy $\hat{\pi}_{t}$ pretending the $\bar{\theta}^{t}$ as true, and rolls out the episode according to $\hat{\pi}_{t}$ . Then it updates the estimate with the new observations. The optimistic estimate $\bar{\theta}^{t}$ takes the form of a standard UCB form with a careful choice of the confidence width and we defer more details to Appendix LABEL:ap_pf_alg1. The algorithm shares the same intuition as other UCB-based algorithms that, with more and more observations, the confidence bound $\bar{\theta}^{t}$ will shrink to the true $\theta$ , and so does the value functions.

Algorithm 1 UCB-ADherence (UCB-AD)

1:Input: Target probability level

\delta

2:Initialize

t=1

\mathcal{D}_{t-1}=\emptyset

, and the optimistic estimate

\bar{\theta}^{t}=\bm{1}.

3:for

t=1,2,\cdots

4: Solve the advice policy

\hat{\pi}^{t}=\operatorname*{arg\,max}_{\pi}V^{\pi}(\cdot|\bar{\theta}^{t})

given the current optimistic estimate

\bar{\theta}^{t}

5: Sample a new episode

z_{t}=\left\{s_{1}^{t},a_{1}^{\mathtt{M},t},a_{1}^{\mathtt{H},t},r^{t}_{1},% \cdots,s_{H}^{t},a_{H}^{\mathtt{M},t},a_{H}^{\mathtt{H},t},r_{H}^{t}\right\}

following policy

\hat{\pi}^{t}

6: Update

\mathcal{D}_{t}\leftarrow\mathcal{D}_{t-1}\cup\{z_{t}\}

7: Update the optimistic estimate

\bar{\theta}^{t}\rightarrow\bar{\theta}^{t+1}

based on

\mathcal{D}_{t}

and

\delta

8:end for

Theorem 1 establishes an $(\epsilon,\delta)$ -PAC result for Algorithm 1.

Theorem 1.

For any $\delta\in(0,1)$ , $\epsilon\in(0,1]$ , and $T\in\mathbb{N}^{+}$ , the number of policies among $\{\hat{\pi}^{t}\}_{t=1}^{T}$ from Algorithm 1 that are not $\epsilon$ -optimal, i.e., $V_{1}^{*}(s_{1})-V_{1}^{\hat{\pi}^{t}}(s_{1})>\epsilon$ , is bounded by $\tilde{O}\left(\frac{H^{2}S^{2}A}{\epsilon^{2}}\cdot\log\frac{1}{\delta}\right)$ with probability $1-\delta$ .

The proof of the theorem mimics the analysis of Dann and Brunskill (2015). One caveat in the analysis is that the original analysis of Dann and Brunskill (2015) focuses on a stationary setting where transition probabilities depend solely on state and action, remaining independent of the time horizon. However, even when the adherence level $\theta$ remains the same over time, the machine’s MDP is non-stationary. An direct adoption is to enlarge the state space to incorporate the horizon step $h$ , yet this will result in a sample complexity of $O(H^{3}S^{2}A/\epsilon^{2})$ , a worse dependency on $H$ . The key is to reduce the upper bound analysis to the adherence level space and utilize Proposition 2 to convert that into a suboptimality gap with respect to the value function. This treatment gives the desirable bound in Theorem 1 which also outperforms the bound from a direct application of results from Azar et al. (2017) to non-stationary MDPs.

4.2 Reward-free exploration algorithm for $\mathcal{E}_{2}$

$\mathcal{E}_{2}$ has more unknown parameters than $\mathcal{E}_{1}$ and thus it naturally entails more intense exploration. Moreover, the learning objective becomes more complex: we aim not only to learn the near-optimal policy but also to discern the pertinent advice.

Algorithm 2 is based on the concept of reward-free exploration (RFE) (** et al., 2020). Specifically, RFE algorithms usually consist of an exploration phase and a planning phase. During the exploration phase, the algorithm collects trajectories from an MDP $\mathcal{M}$ without a pre-specified reward function. In the planning phase, it can compute near-optimal policies of $\mathcal{M}$ , given any deterministic reward functions that are bounded.

In our human-machine model, the machine observes $s_{h}\to a^{\mathtt{M}}\to a^{\mathtt{H}}\to s_{h+1}$ , and the trajectory for episode $t$ is $z_{t}=\{s_{1}^{t},a_{1}^{\mathtt{M},t},a_{1}^{\mathtt{H},t},r_{1}^{t},s_{2}^{t% },a_{2}^{\mathtt{M},t},a_{2}^{\mathtt{H},t},r_{2}^{t},\cdots,s_{H}^{t},a_{H}^{% \mathtt{M},t},a_{H}^{\mathtt{H},t},r_{H}^{t}\}$ , where $a_{h}^{\mathtt{M},t}=\pi^{t}(s_{h}^{t})$ , $a_{h}^{\mathtt{H},t}\sim\mathbb{P}_{h}(\cdot|s_{h}^{t},a_{h}^{\mathtt{M},t})$ , and $s_{h+1}^{t}\sim{p}_{h}(\cdot|s_{h}^{t},a_{h}^{\mathtt{H},t})$ . We denote $\hat{p}^{\mathtt{M},t}_{h}$ and $\hat{r}^{\mathtt{M},t}_{h}$ the empirical estimation for $p^{\mathtt{M}}$ and $r^{\mathtt{M}}_{h}$ , and $n_{h}^{t}(s,a)=\sum_{i=1}^{t}\mathbb{I}{\left\{\left(s_{h}^{i},a_{h}^{\mathtt{% M},i}\right)=(s,a)\right\}}$ the number of times the machine gives advice $a$ at time $h$ and state $s$ in the first $t$ episodes. The key quantity in Algorithm 2 is

\displaystyle W_{h}^{t}(s,a)=\min\left(H,16H^{2}\cfrac{\phi(n_{h}^{t}(s,a),% \delta)}{n_{h}^{t}(s,a)}+\left(1+\frac{1}{H}\right)\sum_{s^{\prime}}\hat{p}^{% \mathtt{M},t}_{h}(s^{\prime}|s,a)\max_{a^{\prime}}W_{h+1}^{t}(s^{\prime},a^{% \prime})\right),

(5)

where $W^{t}_{H+1}(s,a)=0$ for $(s,a)\in\mathcal{S}\times\mathcal{A}$ , and $\phi(n,\delta)$ grows at the order of $O(\log(n)+\log(1/\delta))$ and is specified in Theorem 2.

Now we formally introduce our Algorithm 2. The algorithm iteratively minimizes an upper bound defined by (5) which measures the uncertainty of a state-action pair, and the upper bound shrinks as the number of visits for the state-action pair increases. The algorithm stops when the upper bound is less than a pre-specified threshold. This algorithm is inspired by the RF-Express algorithm (Ménard et al., 2021), and there is a slight difference in the definition of $W_{h}^{t}(s,a)$ , $\phi(n,\delta)$ and the stop** rule. In our application, the reward $r^{\mathtt{M}}$ is stochastic and we need to take care of the estimation error; while in Ménard et al. (2021), the algorithm does not need to deal with the reward at all.

Algorithm 2 : RFE-

\beta

1:Input:

\epsilon,\delta

, and user-specified

\{\beta_{i}\}_{i\in\mathcal{I}}

, where

\mathcal{I}

could be any set where

\beta_{i}\in[0,H)

2:Stage 1: Reward-free exploration

3:Initialize

t=1

and

W^{t}_{h}(s,a)=H

for all

(s,a)\in\mathcal{S}\times\mathcal{A}

4:Compute

{\pi}^{t}

so that

{\pi}_{h}^{t}(s)=\operatorname*{arg\,max}_{a\in\mathcal{A}}W_{h}^{t}(s,a)

(see (5))

5:while

W_{1}^{t}(s_{1},\pi^{t}(s_{1}))+4e\sqrt{W_{1}^{t}(s_{1},\pi^{t}(s_{1}))}>% \epsilon/H

6: Sample trajectory

z_{t}=\{s_{1}^{t},a_{1}^{\mathtt{M},t},a_{1}^{\mathtt{H},t},r_{1}^{t},\cdots,s% _{H}^{t},a_{H}^{\mathtt{M},t},a_{H}^{\mathtt{H},t},r_{H}^{t}\}

following

\pi^{t}

7: update

t\leftarrow t+1

\mathcal{D}\leftarrow\mathcal{D}\cup\{z_{t}\}

\hat{p}^{\mathtt{M},t}_{h}(s^{\prime}|s,a)

\hat{r}^{\mathtt{M},t}_{h}(s,a)

, and

W_{h}^{t}(s,a)

8:end while

9:Stage 2: Policy identification

10:Use planning algorithms to output optimal advice policy

\{\hat{\pi}_{\beta_{i}}^{\tau}\}_{i\in\mathcal{I}}

for

\left\{\left(\mathcal{S},\bar{\mathcal{A}},H,\hat{p}^{\mathtt{M}},\hat{r}^{% \mathtt{M}}_{\beta_{i}}\right)\right\}_{i\in\mathcal{I}}

Theorem 2.

For $\delta\in(0,1)$ , $\epsilon\in(0,1]$ , and $\phi(n,\delta)=6\log(4HSA/(\epsilon\delta))+S\log(8e(n+1))$ , with probability $1-\delta$ , Stage 1 of Algorithm 2 stops in $\tau$ episodes and

\tau\leq C_{1}\cfrac{H^{5}SA}{\epsilon^{2}}\left(6\log(4HSA/(\epsilon\delta))+% S\right),

where $C_{1}=\tilde{O}(\log(HSA))$ . Moreover, $\{\hat{\pi}^{\tau}_{\beta}\}_{\beta>0}$ have the following property

\displaystyle P\left(V_{1,\beta}^{*}(s_{1})-V_{1,\beta}^{\hat{\pi}^{\tau}_{% \beta}}(s_{1})\leq\epsilon\,\,\text{uniformly for all $\beta\in[0,H)$}\right)>% 1-\delta.

Theorem 2 ensures that Algorithm 2 provides sample estimation for the underlying MDP such that all the policy $\{\hat{\pi}_{\beta}^{\tau}\}_{\beta\in[0,H)}$ for pertinent advice are near optimal. The proof is a direct application of the RF-Express (Ménard et al., 2021), except that we have to take care of the estimation error in $\hat{r}^{\mathtt{M}}$ . Although Algorithm 2 has the uniform convergence property for any number of bounded reward functions, it can also be used the same way as Algorithm 1, to find the $\epsilon$ -optimal policy for $\mathcal{M}^{\mathtt{M}}$ if provided with the non-penalized reward function $\hat{r}^{\mathtt{M}}$ . In this context, we can modify RFE- $\beta$ so that with high probability, it solves $\mathcal{M}^{\mathtt{M}}$ with a sample complexity of $O(H^{3}SA/\epsilon^{2})$ (See Algorithm LABEL:alg:alg3 in Appendix LABEL:ap_pf_alg2 for details).

CMDP for pertinent advice. The algorithm RFE- $\beta$ solves a class of problems $\{\mathcal{M}_{\beta}^{\mathtt{M}}\}_{\beta>0}$ simultaneously for all the $\beta$ ’s and it measures the pertinence of advice by $\beta$ . However, sometimes humans lack a quantitative view of how large a $\beta$ value should be considered as pertinent. Here, we introduce a different perspective on how the human should rank the importance of advice, framing it as “in $H$ steps, I want advice no more than $D$ times”, and formulate this as a CMDP problem

\displaystyle\max_{\pi}\,\,\,

\displaystyle\mathbb{E}^{\pi}\left[\sum_{h=1}^{H}r^{\mathtt{M}}(s_{h},a_{h})% \right]\,\,\,\,\,\,\,s.t.\,\,\,\mathbb{E}^{\pi}\left[\sum_{h=1}^{H}\mathbb{I}% \{a_{h}\neq\text{defer}\}\right]\leq D,

(6)

where $D\in(0,H)$ . From the standard primal-dual theorem, this formulation is closely related to the penalty $\beta$ in (3), for the reason that we can treat $\beta$ as a dual variable for the constraint $D$ . We refer the reader to the proof of Corollary 1 in Appendix LABEL:ap_pf_alg2 for details.

Now we present the CMDP method for pertinent advice. After stage 1 of RFE- $\beta$ , we solve

\displaystyle\max_{\pi}\,\,\,\hat{\mathbb{E}}^{\pi}\left[\sum_{h=1}^{H}\hat{r}% ^{\mathtt{M},\tau}(s_{h},a_{h})\right]\hskip 14.22636pts.t.\,\,\,\hat{\mathbb{% E}}^{\pi}\left[\sum_{h=1}^{H}\mathbb{I}\{a_{h}\neq\text{defer}\}\right]\leq D,

(7)

where $\hat{\mathbb{E}}$ is the expectation with the underlying transition being $\hat{p}^{\mathtt{M},\tau}$ . The next corollary states that $\hat{\pi}^{\tau}_{D}$ , the solution for (7), is a near-optimal policy for the CMDP (6).

Corollary 1.

In the same setting of Theorem 2, for $\delta\in(0,1)$ and $\epsilon\in(0,1]$ , with probability $1-\delta$ , for all $D\in(0,H)$ , $\hat{\pi}^{\tau}_{D}$ is a near-optimal solution for the original CMDP (6) such that

\displaystyle V_{1}^{\hat{\pi}^{\tau}_{D}}(s_{1})\geq V_{1}^{\pi^{*}_{D}}(s_{1% })-2\epsilon,\hskip 14.22636pt\text{and }\hskip 14.22636pt\mathbb{E}^{\hat{\pi% }^{\tau}_{D}}\left[\sum_{h=1}^{H}\mathbb{I}\{a_{h}\neq\text{defer}\}\right]% \leq D+\epsilon

(8)

where ${\pi^{*}_{D}}$ is the optimal solution for (6).

Corollary 1 also implies that RFE- $\beta$ can compute near-optimal policies of CMDP (6) for all the constraints $D\in[0,H)$ , with a sample complexity of $O(H^{5}SA/\epsilon^{2})$ . Compared to other CMDP learning algorithms (for example, $O(H^{2}S^{3}A/\epsilon^{2})$ in Kalagarla et al. (2021)), the sample complexity of Corollary 1 features a lower order in $S$ . Moreover, the near-optimal result holds for all constraints $D\in[0,H)$ , and for other CMDP learning algorithms, the result only holds for a pre-specified $D$ .

5 Numerical Experiment

We perform numerical experiments under two environments: Flappy Bird (Williams et al., 2023) and Car Driving Meresht et al. (2020). Both Atari game-like environments are suitable and convenient for modeling human behavior while retaining the learning structure for the machine. We focus on the flappy bird environment here and defer the car driving environment to Appendix LABEL:apnd:B.

Flappy Bird Environment. We consider a game map of a 7-by-20 grid of cells. Each cell can be empty, contain a star, or act as a wall. The goal is to navigate the bird across the map from left to right and collect as many stars as possible. However, colliding with a wall or reaching the (upper and lower) boundaries leads to the end of the game. An example map is displayed in Figure 1, which splits into three phases: the first phase contains almost only stars and no walls, the second phase contains almost only walls and very few stars, and the third phase contains both stars and walls.

Refer to caption — Figure 1: Flappy Bird environment: player needs to navigate the bird to avoid walls and collect stars.

We define the state space as the current locations of the bird on the grid, represented by coordinates $(x,y)\in\mathbb{Z}^{2}$ , with a total of $7\times 20=140$ states. Regarding the action space, we define it as $\mathcal{A}=\{\text{Up, Up-Up, Down}\}$ . Each action causes the bird to move forward by one cell. In addition, the “Up” action moves the bird one cell upwards, the “Up-Up” action moves it two cells upwards, and the “Down” action moves it one cell downwards. The MDP has a reward as a function of state only. We will get a reward of $1$ when the current state (location) has a star and otherwise $0$ . To model human behavior, we consider two sub-optimal human policies: Policy Greedy, which prioritizes collecting stars in the next column, and Policy Safe, which focuses on avoiding walls in the next column. If there is no preferred action available, both policies maintain a horizontal zig-zag line by alternating between “Up” and “Down”. For adherence level $\theta$ , we assume for all $s\in\mathcal{S}$ and $h=1,...,H$ , the human will adhere to the advice with probability $0.9$ except the aggressive advice “Up-up” (which moves too fast vertically) with adherence level $0.7$ . We compare the following algorithms:

•

UCB-ADherence (UCB-AD): Algorithm 1 that finds the $\epsilon$ -optimal advice policy.
•

RFE-ADvice (RFE-AD): Algorithm LABEL:alg:alg3, a variant of RFE- $\beta$ that finds the $\epsilon$ -optimal policy.
•

RFE- $\beta$ : Algorithm 2 that outputs pertinent advice policy by exploring then planning.
•

RFE-CMDP: A variant of RFE- $\beta$ that solves the CMDP (7) after exploring.

Figure 1(a) and 1(b) present the results for the two algorithms UCB-AD and RFE-AD for the environment $\mathcal{E}_{1}$ . It also includes the state-of-the-art algorithm EULER (Zanette and Brunskill, 2019) that achieves a generic minimax optimal regret. From the regret plot, UCB-AD outperforms both RFE-AD and EULER. This advantage is attributed to UCB-AD’s effective utilization of the information and structure of the underlying MDP. These results also show that our tailored algorithms UCB-AD and RFE-AD are much more efficient than directly applying problem-agnostic RL algorithms in the adherence model. We further test UCB-AD with different $\theta$ ’s: with $\theta_{1}$ , $\theta(a,s)\equiv 0.8$ and with $\theta_{2}$ , $\theta(a,s)\equiv 0.4$ . Figure 1(c) shows the relationship between the regret of UCB-AD and $\theta$ : for both policies, UCB-AD can achieve smaller regret with higher $\theta$ . Intuitively, a high adherence level implies a high probability of following the advice instead of taking $\pi^{\mathtt{H}}$ , which will reduce the regret caused by the suboptimality of $\pi^{\mathtt{H}}$ .

Figure 3 summarizes results for three policies under the environment $\mathcal{E}_{2}$ , namely RFE- $\beta$ , RFE-CMDP, and UC-CFH, a provably convergent CMDP algorithm (Kalagarla et al., 2021), under Policy Safe. In Figure 2(a), we see that RFE- $\beta$ exhibits convergence for different $\beta$ ’s, and this empirically corroborates the theoretical finding. In Figure 2(b), we compare RFE-CMDP and UC-CFH under a simpler environment with the advice budget being 1 ( $D=1$ ). We observe that RFE-CMDP shows a marginal performance advantage over UC-CFH in terms of the convergence rate. More importantly, Figure 2(c) shows by only using the estimated transition kernel after learning for $D=1$ (Figure 2(b)), RFE-CMDP is able to obtain near-optimal policy for problem instances with different advice budgets ( $D=2,3,4$ and $5$ ). However, UC-CFH fails to explore the whole transition kernel sufficiently and can only output the near-optimal policy for the original problem instance. Moreover, RFE-CMDP is more sample efficient with respect to the advice budget, because for UC-CFH, we have to run multiple times with different advice budget parameters to get a near-optimal policy for all of them.

Lastly, we show that RFE- $\beta$ is capable of generating pertinent advice for different policies. Figure 4 displays representative trajectories of two policies playing the game while receiving guidance from the machine, which follows $\hat{\pi}_{\beta}$ trained in the experiment of Figure 2. By setting $\beta=0.3$ , the machine outputs a policy that only gives advice when necessary: Since Policy Greedy behaves well in the first phase, the machine almost only gives advice in the second phase and the third phase; Similarly, the machine almost only gives advice in the first phase and the third phase, and choose to defer most of the time when Policy Safe is in the second phase.

References

Altman [2021] Eitan Altman. Constrained Markov decision processes. Routledge, 2021.
Azar et al. [2017] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272. PMLR, 2017.
Balachandran et al. [2021] Avinash Balachandran, Tiffany L Chen, Jonathan YM Goh, Stephen McGill, Guy Rosman, Simon Stent, and John J Leonard. Human-centric intelligent driving: Collaborating with the driver to improve safety. In Automated Road Transportation Symposium, pages 85–109. Springer, 2021.
Bastani et al. [2021] Hamsa Bastani, Osbert Bastani, and Wichinpong Park Sinchaisri. Improving human decision-making with machine learning. arXiv preprint arXiv:2108.08454, 2021.
Bobu et al. [2020] Andreea Bobu, Dexter RR Scobee, Jaime F Fisac, S Shankar Sastry, and Anca D Dragan. Less is more: Rethinking probabilistic models of human behavior. In Proceedings of the 2020 acm/ieee international conference on human-robot interaction, pages 429–437, 2020.
Carroll et al. [2019] Micah Carroll, Rohin Shah, Mark K Ho, Tom Griffiths, Sanjit Seshia, Pieter Abbeel, and Anca Dragan. On the utility of learning about humans for human-ai coordination. Advances in neural information processing systems, 32, 2019.
Chen et al. [2018] Min Chen, Stefanos Nikolaidis, Harold Soh, David Hsu, and Siddhartha Srinivasa. Planning with trust for human-robot collaboration. In Proceedings of the 2018 ACM/IEEE international conference on human-robot interaction, pages 307–315, 2018.
Chen et al. [2022] Ningyuan Chen, Ming Hu, and Wenhao Li. Algorithmic decision-making safeguarded by human knowledge. arXiv preprint arXiv:2211.11028, 2022.
Dann and Brunskill [2015] Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. Advances in Neural Information Processing Systems, 28, 2015.
Dann et al. [2017] Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems, 30, 2017.
De et al. [2021] Abir De, Nastaran Okati, Ali Zarezade, and Manuel Gomez Rodriguez. Classification under human assistance. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 5905–5913, 2021.
Domingues et al. [2021] Omar Darwiche Domingues, Pierre Ménard, Emilie Kaufmann, and Michal Valko. Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited. In Algorithmic Learning Theory, pages 578–598. PMLR, 2021.
Grand-Clément and Pauphilet [2022] Julien Grand-Clément and Jean Pauphilet. The best decisions are not the best advice: Making adherence-aware recommendations. arXiv preprint arXiv:2209.01874, 2022.
Hong et al. [2023] Joey Hong, Anca Dragan, and Sergey Levine. Learning to influence human behavior with offline reinforcement learning. arXiv preprint arXiv:2303.02265, 2023.
Jacq et al. [2022] Alexis Jacq, Johan Ferret, Olivier Pietquin, and Matthieu Geist. Lazy-mdps: Towards interpretable reinforcement learning by learning when to act. arXiv preprint arXiv:2203.08542, 2022.
Jaksch et al. [2010] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11:1563–1600, 2010.
** et al. [2018] Chi **, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018.
** et al. [2020] Chi **, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pages 4870–4879. PMLR, 2020.
Kalagarla et al. [2021] Krishna C Kalagarla, Rahul Jain, and Pierluigi Nuzzo. A sample-efficient algorithm for episodic finite-horizon mdp with constraints. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 8030–8037, 2021.
Kaufmann et al. [2021] Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Edouard Leurent, and Michal Valko. Adaptive reward-free exploration. In Algorithmic Learning Theory, pages 865–891. PMLR, 2021.
Khavas et al. [2020] Zahra Rezaei Khavas, S Reza Ahmadzadeh, and Paul Robinette. Modeling trust in human-robot interaction: A survey. In International conference on social robotics, pages 529–541. Springer, 2020.
Laidlaw and Dragan [2022] Cassidy Laidlaw and Anca Dragan. The boltzmann policy distribution: Accounting for systematic suboptimality in human models. arXiv preprint arXiv:2204.10759, 2022.
Lattimore and Hutter [2014] Tor Lattimore and Marcus Hutter. Near-optimal pac bounds for discounted mdps. Theoretical Computer Science, 558:125–143, 2014.
Mao et al. [2023] Anqi Mao, Christopher Mohri, Mehryar Mohri, and Yutao Zhong. Two-stage learning to defer with multiple experts. Advances in neural information processing systems, 36, 2023.
Ménard et al. [2021] Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Emilie Kaufmann, Edouard Leurent, and Michal Valko. Fast active learning for pure exploration in reinforcement learning. In International Conference on Machine Learning, pages 7599–7608. PMLR, 2021.
Meresht et al. [2020] Vahid Balazadeh Meresht, Abir De, Adish Singla, and Manuel Gomez-Rodriguez. Learning to switch between machines and humans. arXiv preprint arXiv:2002.04258, 2020.
Miryoosefi and ** [2022] Sobhan Miryoosefi and Chi **. A simple reward-free approach to constrained reinforcement learning. In International Conference on Machine Learning, pages 15666–15698. PMLR, 2022.
Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
Mohri et al. [2023] Christopher Mohri, Daniel Andor, Eunsol Choi, and Michael Collins. Learning to reject with a fixed predictor: Application to decontextualization. arXiv preprint arXiv:2301.09044, 2023.
Mozannar and Sontag [2020] Hussein Mozannar and David Sontag. Consistent estimators for learning to defer to an expert. In International Conference on Machine Learning, pages 7076–7087. PMLR, 2020.
Okati et al. [2021] Nastaran Okati, Abir De, and Manuel Rodriguez. Differentiable learning under triage. Advances in Neural Information Processing Systems, 34:9140–9151, 2021.
Shaheen [2021] Mohammed Yousef Shaheen. Applications of artificial intelligence (ai) in healthcare: A review. ScienceOpen Preprints, 2021.
Shani et al. [2019] Lior Shani, Yonathan Efroni, and Shie Mannor. Exploration conscious reinforcement learning revisited. In International conference on machine learning, pages 5680–5689. PMLR, 2019.
Silver et al. [2017] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
Straitouri et al. [2021] Eleni Straitouri, Adish Singla, Vahid Balazadeh Meresht, and Manuel Gomez-Rodriguez. Reinforcement learning under algorithmic triage. arXiv preprint arXiv:2109.11328, 2021.
Sun et al. [2022] Jiankun Sun, Dennis J Zhang, Haoyuan Hu, and Jan A Van Mieghem. Predicting human discretion to adjust algorithmic prescription: A large-scale field experiment in warehouse operations. Management Science, 68(2):846–865, 2022.
Williams et al. [2023] Katherine J Williams, Madeleine S Yuh, and Neera Jain. A computational model of coupled human trust and self-confidence dynamics. ACM Transactions on Human-Robot Interaction, 2023.
Zanette and Brunskill [2019] Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–7312. PMLR, 2019.
Zhang et al. [2020] Xuezhou Zhang, Yuzhe Ma, and Adish Singla. Task-agnostic exploration in reinforcement learning. Advances in Neural Information Processing Systems, 33:11734–11743, 2020.