\setcopyright

ifaamas \acmConference[AAMAS ’24]Proc. of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024)May 6 – 10, 2024 Auckland, New ZealandN. Alechina, V. Dignum, M. Dastani, J.S. Sichman (eds.) \copyrightyear2024 \acmYear2024 \acmDOI \acmPrice \acmISBN \acmSubmissionID995 \authornoteBoth authors contributed equally to the paper. \affiliation \institutionGeorgia Institute of Technology \cityAtlanta, GA \countryUSA \authornotemark[1] \affiliation \institutionGeorgia Institute of Technology \cityAtlanta, GA \countryUSA \affiliation \institutionGeorgia Institute of Technology \cityAtlanta, GA \countryUSA \affiliation \institutionGeorgia Institute of Technology \cityAtlanta, GA \countryUSA \affiliation \institutionGeorgia Institute of Technology \cityAtlanta, GA \countryUSA

Mixed-Initiative Human-Robot Teaming under Suboptimality with Online Bayesian Adaptation

Manisha Natarajan [email protected] , Chunyue Xue [email protected] , Sanne van Waveren [email protected] , Karen Feigh [email protected] and Matthew Gombolay [email protected]

Abstract.

For effective human-agent teaming, robots and other artificial intelligence (AI) agents must infer their human partner’s abilities and behavioral response patterns and adapt accordingly. Most prior works make the unrealistic assumption that one or more teammates can act near-optimally. In real-world collaboration, humans and autonomous agents can be suboptimal, especially when each only has partial domain knowledge. In this work, we develop computational modeling and optimization techniques for enhancing the performance of suboptimal human-agent teams, where the human and the agent have asymmetric capabilities and act suboptimally due to incomplete environmental knowledge. We adopt an online Bayesian approach that enables a robot to infer people’s willingness to comply with its assistance in a sequential decision-making game. Our user studies show that user preferences and team performance indeed vary with robot intervention styles, and our approach for mixed-initiative collaborations enhances objective team performance ( $p<.001$ ) and subjective measures, such as user’s trust ( $p<.001$ ) and perceived likeability of the robot ( $p<.001$ ).

Key words and phrases:

Human-Agent Teams, Mixed-Initiative, Suboptimality, POMDP

1. Introduction

Human-agent teaming has the potential to leverage the unique capabilities of humans and artificial intelligence (AI) agents to enhance team performance. In real-world situations, both humans and agents can be suboptimal, especially when dealing with uncertainty human_suboptimality ; kahn2017uncertainty . Imagine a human collaborating with a robot in an urban search-and-rescue (USAR) mission with reduced visibility due to fog or smoke. The human can take over control when the robot is more prone to make errors (e.g., in unstructured environments). Likewise, when human vision is limited, the robot can intervene or take control. To optimize this collaboration, robots need to develop a Theory of Mind rabinowitz2018machine , i.e., the ability to infer the human teammates’ mental states and anticipate their actions to determine when such intervention is beneficial. In this work, we look at mixed-initiative interactions, where the robot actively models human behavior to decide when to intervene to maximize team performance.

In human-robot teams, mixed-initiative interaction refers to a collaborative strategy in which teammates opportunistically seize and relinquish initiative from and to each other during a mission, where initiative can range from low-level motion control to high-level goal specification jiang2015mixed . We study such interactions in a teaming task in which the human and the robot act suboptimally because they have partial knowledge of the environment. Specifically, the human teleoperates the robot, similar to USAR missions isaacs2022teleoperation , and must collaborate with the robot (seize or relinquish control) to reach a goal location. The human and the robot have asymmetric capabilities and non-identical, partial knowledge of the environment. During the task, when the human selects an action, the robot can either comply with and execute the chosen action, interrupt by not executing the chosen action, or take control and execute an alternative action. If the robot interrupts or takes control, the human can decide whether to accept or oppose the robot’s decision.

Our goal is to learn a domain-agnostic robot policy that can effectively adapt to diverse users to maximize team performance without prior human interaction data. Achieving such ad-hoc or zero-shot coordination with novel human partners has been a longstanding challenge in AI klien2004ten ; paleja2021utility . Recent works explore zero-shot human-AI collaboration by learning AI agent policies either from human-human demonstrations carroll2019utility ; hong2023learning or via self-play without any human data strouse2021collaborating ; zhao2023maximum . However, these approaches look at domains where humans and agents have symmetric capabilities. In contrast, our work delves into human-agent teaming with asymmetric capabilities, where mixed-initiative teaming is essential. Prior works in mixed-initiative teaming have adopted strategies for switching control between humans and robots by estimating user performance chiou2021mixed or operator engagement few2006improved ; our work differs by explicitly modeling user compliance to determine when robots should intervene.

Our contributions are two-fold. First, we propose a novel, online, Bayesian approach called Bayes-POMCP, for zero-shot human-robot collaboration in mixed-initiative settings. We model the human-robot team as a Partially Observable Markov Decision Process (POMDP), where the robot maintains a belief over users’ compliance tendencies. Initially, the robot has high uncertainty about user preferences and willingness to comply. Through Bayesian Learning, the robot’s estimation is iteratively refined, reducing its uncertainty upon subsequent interactions with the user. By conditioning the robot’s policy on the uncertainty of the human model, our approach is more robust to adapt to a diverse pool of participants than having a single, unified model for all subjects. To address the computational challenges in solving POMDPs and ensure that our approach is feasible to run online with novel users, Bayes-POMCP employs a Monte-Carlo search (scalable to large state spaces) while anticipating appropriate user behavior with approximate belief updates.

Second, we design a new user study interface for examining mixed-initiative human-robot teaming. We open-source our implementation¹¹1https://github.com/CORE-Robotics-Lab/Bayes-POMCP. We conduct two human-subjects experiments ( $n=30$ and $n=28$ ) with the interface to show that (1) user preferences and team performance can vary when the robot employs different intervention styles, and (2) our proposed approach performs favorably on both objective (team performance) and subjective (users’ trust, robot likeability) metrics with novel users.

2. Related Work

2.1. Modeling Human Behavior

For seamless human-robot collaboration, robots must anticipate human behavior and act accordingly. Prior works have shown that robots modeling human behavior can improve team performance across many applications, such as autonomous driving sadigh2016planning , assistive robotics jeon2020shared , and collaborative games pellegrinelli2016human . Both model-free and model-based approaches have been employed for modeling human behavior. Model-free approaches (e.g., imitation learning carroll2019utility ) require substantial data and generally employ neural networks to learn human behavior without making strong assumptions.

In contrast, model-based approaches require far fewer samples but make certain assumptions about human behavior (e.g., humans exhibit bounded rationality simon1972theories ). Prior works in HRI have used POMDPs and their variants (e.g., BAMDP, MOMDP, I-POMDP) to account for latent factors such as trust, intent, or capability influencing human decision-making chen2018planning ; lee2020getting ; ramachandran2019personalized ; wang2016impact . Most prior POMDP-based works either assume known model parameters or employ maximum likelihood estimation (MLE) to estimate them lee2019bayesian ; ramachandran2019personalized ; chen2018planning ; wang2016impact . However, these approaches can fail to generalize to a diverse population and are prone to overfit bishop2006pattern . Hence, we instead adopt a Bayesian approach to jointly learn the POMDP parameters and the robot policy during human interactions, similar to prior work lee2020getting ; nanavati2021modeling ; ng2012bayes . However, a major drawback of such Bayesian approaches is the need to update beliefs over an augmented state space comprising both the human latent states and the POMDP parameters, which can quickly become computationally intractable. We overcome this challenge by making key approximations about the belief space and using conjugate priors for belief representation, which allows for computing quick belief updates. Our work differs from prior Bayesian approaches in HRI lee2020getting ; nanavati2021modeling by maintaining belief about dynamic latent parameters, such as trust or compliance natarajan2020effects ; chen2018planning which varies during interactions and across individuals.

2.2. Human-Agent Teaming

Recently, there has been a surge in interest in designing AI agents that are capable of collaborating with humans, especially in ad-hoc settings barrett2017making ; hu2020other ; hong2023learning . Ad-hoc or zero-shot human-agent teaming requires agents to be adept at collaborating with diverse users in novel contexts without prior interactions. Achieving ad-hoc, zero-shot coordination with novel human partners has been a longstanding challenge in AI and will be crucial for the ubiquitous deployment of robots and AI agents klien2004ten ; paleja2021utility . Recent works aim to achieve ad-hoc human-AI teaming either from human-human demonstrations using Behavior Cloning carroll2019utility and offline RL hong2023learning or via self-play without any human data strouse2021collaborating . Others have also explored population-based training to learn robot policies that are generalizable across diverse users zhao2023maximum ; lou2023pecan . However, these approaches focus on domains where both humans and agents have symmetric capabilities and work concurrently. In contrast, our work examines mixed-initiative teaming, where humans and agents possess asymmetric capabilities and must share control to achieve the task objective. Thus, we cannot learn robot policies from human-human demonstrations. Further, we seek to optimize team performance when all teammates are suboptimal, which is seldom explored in human-robot teams lee2020getting ; natarajan2023human .

3. Preliminaries

We model the human-robot team as a Bayes-Adaptive POMDP (BA-POMDP) ross2007bayes , allowing the robot to dynamically learn and adjust its policy based on estimations of human model parameters, while accounting for estimation uncertainty.

A POMDP is defined as a tuple $\mathcal{M}=(S,A,O,\mathcal{T,E,}\>d_{0},R,\gamma)$ where $S$ is a set of states $s\in S$ , $A$ is a set of actions $a\in A$ , $O$ is a set of observations $o\in O$ , $\mathcal{T}(s_{t+1}|s_{t},a_{t})$ is the state transition probabilities, $\mathcal{E}(o_{t}|s_{t})$ is the emission function, $d_{0}$ is the initial state distribution, $R(s_{t},a_{t})$ is the reward for taking action $a$ in state $s$ at time step $t$ , and $\gamma\in(0,1]$ is the discount factor. The agent’s goal is to learn a policy, $\pi:\mathcal{B}\rightarrow A$ , that maximizes the expected cumulative discounted reward (return), where $b\in\mathcal{B}$ is a belief state inferred by a history of previous observations and actions, $h$ . Belief updates can be achieved via the Bayes rule (infeasible for large state spaces) or with an unweighted particle filter (approximate update).

Most prior works in POMDPs assume a fully-specified environment (i.e., the model parameters $\mathcal{T}$ , $\mathcal{E}$ are known) lauri2022partially , which is unrealistic in HRI as we neither have access to the person’s true latent states (e.g., trust, preferences) nor how they change during the interaction. We adopt the BA-POMDP framework — a Bayesian Reinforcement Learning approach for solving POMDPs ross2007bayes . The BA-POMDP employs Dirichlet vectors, $\chi$ , to represent uncertainty over the model parameters $(\mathcal{T},\mathcal{E})$ . As the POMDP states are hidden, $\chi$ cannot be computed and is included as part of the state.

3.1. Solving POMDPs

Partially Observable Monte-Carlo Planning (POMCP) is an online solver that extends the Monte-Carlo Tree Search (MCTS) to POMDPs silver2010monte . POMCP uses the UCT (Upper Confidence Bound (UCB) for Trees) to select actions and an unweighted particle filter for belief updates. In POMCP, the UCT algorithm is extended to partially observable domains using a search tree of histories $h$ instead of states, where each node in the tree stores statistics – visitation count $N(h)$ , value or mean return $V(h)$ , and belief $b(h)$ , approximated by particles. The algorithm performs online planning through multiple simulations, incrementally building the search tree. The return of each simulation is used to update the statistics for all visited nodes. POMCP terminates based on preset criteria (e.g., maximum number of simulations).

We model the human-robot team as a BA-POMDP. Solving BA-POMDPs is difficult as they are infinite-state POMDPs. The current state-of-the-art, online algorithm for solving BA-POMDPs is BA-POMCP(extending POMCP for BA-POMDPs) katt2017learning . In this work, we propose Bayes-POMCP, which extends the BA-POMCP algorithm for suboptimal human-robot teams.

4. Method

In this section, we first define the human-robot team model (BA-POMDP) for mixed-initiative interactions and then describe how we utilize a variant of the BA-POMCP algorithm to learn an adaptive robot policy for our current setting.

4.1. Human-Robot Team Model

4.1.1. State Space

In our human-robot team model, the state space combines the world state and user latent states $s=(x,z)$ . The world state, $x\in\mathcal{X}$ , refers to the task that the human-robot team is working on, and the latent states, $z\in\mathcal{Z}$ , can refer to the user’s trust or tendency to comply with the robot and their task execution preferences. The robot does not have access to the user’s latent states and must infer these states by observing the user’s actions. We focus on suboptimal human-robot teaming, assuming that the suboptimality arises from task-related errors or incomplete knowledge, i.e., both agents may make errors or cannot observe the full world state. Thus, the world state as observed by the robot may not always align with what the human observes $(x^{R}_{t}\neq x^{H}_{t},\forall t)$ .

4.1.2. Action Space

As we are planning from the robot’s perspective, the action space comprises the actions $a^{R}\in A^{R}$ that the robot can take in the environment. In our mixed-initiative collaborative scenario, we assume that the robot first observes the human action and then selects its action²²2Our approach is not restricted to this mixed-initiative setting and can be extended to cases where either the robot takes the first action or works concurrently with users.. The robot can choose to either execute, intervene, or override the user’s actions. Additionally, the robot may choose to explain whenever it intervenes or overrides the user.

4.1.3. Observation Space

The robot observes the human actions $a^{H}\in A^{H}$ . We assume that the human’s action depends on their knowledge of the current world state $x_{t}$ and the history of interactions, $h_{t-1}$ with the robot, i.e., the human follows the policy, $\pi^{H}(a^{H}_{t}|x_{t},h_{t-1},a^{R}_{t-1})$ , where $h_{t-1}=\{a^{H}_{0},a^{R}_{0},a^{H}_{1},a^{R}_{1},\cdots,a^{H}_{t-1}\}$ . Similar to prior work chen2018planning , we assume that the user’s latent state, $z_{t}$ , is a compact representation of the interaction history ( $z_{t}\approx\{h_{t-1}\cup a^{R}_{t-1}\}$ ). Thus, $\pi^{H}(a^{H}_{t}|x_{t},h_{t-1},a^{R}_{t-1})\approx\pi^{H}(a^{H}_{t}|x_{t},z_{% t})$ .

4.1.4. Transition and Emission Models

We define the state transition model, $\mathcal{T}$ , from the robot’s perspective, i.e., $\mathcal{T}=p(s_{t+1}|s_{t},a^{R}_{t})$ . However, for mixed-initiative settings, the transitions in the state, $s_{t}=(x_{t},z_{t})$ , occur as a result of both human and robot actions at each time step. Thus we rewrite the transition model as:

	$\displaystyle p(s_{t+1}\|s_{t},a^{R}_{t})=\sum\limits_{a^{H}}p(s_{t+1}\|s_{t},a^% {R}_{t},a^{H}_{t})\times\pi^{H}(a^{H}_{t}\|x_{t},z_{t})$		(1)
	$\displaystyle=\sum\limits_{a^{H}}p(x_{t+1}\|x_{t},a^{R}_{t},a^{H}_{t})\times p(% z_{t+1}\|z_{t},a^{R}_{t},a^{H}_{t})\times\pi^{H}(a^{H}_{t}\|x_{t},z_{t})$		(2)

Equation 2 comes from our assumption that given the human and robot actions, the world state dynamics are independent of the human latent state dynamics. In our collaborative scenario, we only estimate the latent state dynamics as part of the BA-POMDP, as we assume that the world state dynamics are deterministic and known.

The emission model $\mathcal{E}$ for the human-robot team refers to the human policy $\pi^{H}(a^{H}_{t}|x_{t},z_{t})$ which is also unknown to the robot and must be estimated to solve the BA-POMDP.

4.1.5. Reward Function

The reward function $\mathcal{R}(x,a^{H},a^{R})$ is positive for team actions that contribute to achieving the task goal and negative for team actions that hinder task success. We assume that both the user and the robot are aware of the reward function.

4.2. Adaptive Robot Intervention Policy in Mixed-Initiative Teams (Bayes-POMCP)

To maximize human-robot team performance in real-time for mixed-initiative settings, we implement a modified version of the BA-POMCP katt2017learning . Here, we highlight the key changes we make to the BA-POMCP algorithm. Figure 1 provides an overview of our approach, and the complete procedure is described in Algorithm 1.

Refer to caption — Figure 1. Graphical overview of the Bayes-POMCP approach for mixed-initiative Human-Robot Teaming: At each timestep $t$ , the human first takes an action based on interaction history, $h$ , and their current observation of the world state, $x$ . The robot then determines when and how to intervene by anticipating human behavior using a Monte-Carlo tree search. The reward is calculated based on both human and robot actions.

Input: Initial world state

x_{0}

; Interaction history

h_{-1}=[]

; initial belief

b_{0}

; Search Tree

T=\{\}

a^{R}_{-1}\leftarrow\text{No-Assist}

// By default before episode starts

z_{0}\leftarrow\{h_{-1}\cup a^{R}_{-1}\}

// Initial human latent state

a^{H}_{0}\leftarrow\textsc{RealHuman}(\cdot|x_{0},z_{0})

// First human action

h_{0}\leftarrow[a^{H}_{0}]

T(h_{0})\leftarrow\textsc{ConstructNode}(T,h_{0})

// Construct root node for $t=0,1,2,\dots\text{max\_steps}\>$ do

\color[rgb]{0.01171875,0.5859375,0.87109375}a^{R}_{t}\leftarrow\textsc{Search}% (h_{t})

// Root node

\rhd

Search (Supp. Alg. 2) if $(h_{t}a^{R}_{t})\notin T$ then

3 ConstructNode(

T,h_{t}a^{R}_{t}

)

x_{t+1}\leftarrow p(\cdot|x_{t},a^{R}_{t},a^{H}_{t})\>\;

// Update World State

z_{t+1}\leftarrow\{h_{t}\cup a^{R}_{t}\}\>\;

// Update true latent state

\not{\Rightarrow}

Robot

a^{H}_{t+1}\leftarrow\textsc{RealHuman}(\cdot|x_{t+1},z_{t+1})

// Next user action

h_{t+1}\leftarrow h_{t}\cup\{a^{R}_{t},a^{H}_{t+1}\}

if $h_{t+1}\notin T$ then

T(h_{t+1})\leftarrow

ConstructNode(

T,h_{t+1}

)

6 // Belief update: next root node

\color[rgb]{0.01171875,0.5859375,0.87109375}b(h_{t+1})\leftarrow\textsc{Belief% -Update}(b(h_{t}),a^{R}_{t},a^{H}_{t+1})

\textsc{Prune-Tree}(T,h_{t+1})

h_{t+1}

is the root node

Algorithm 1 Bayes-POMCP: Maximizing Performance in Mixed-Initiative Human-Robot Teams

4.2.1. Belief Approximation

Similar to POMCP, BA-POMCP is an online algorithm that constructs a lookahead search tree through environment simulations and maintains a belief over latent parameters using an unweighted particle filter to determine the best action at each time step. However, in BA-POMCP, we need to maintain a belief over both the latent states $|S|$ and the model parameters $\mathcal{T},\mathcal{E}$ ( $|S|^{2}\times|A|+|S|\times|A|\times|O|$ parameters). Computing the posterior update over such a large space can be expensive. Further, it is difficult for the posterior distribution to converge to the true parameters, especially when we only have access to limited interactions.

Hence, we leverage the independence assumption between the world state and the latent state transition (Equation 2) to approximate the belief in each node in the search tree. Approximating the belief makes it feasible to compute the belief updates in real-time for fluent HRI. Since we only need the human action to determine the next world state, we choose to maintain the belief only over the user policy space instead of all latent states and model parameters. We compute the posterior update for the belief $b(h_{t+1})$ from the prior belief, $b(h_{t})$ , based on the interaction history, $h_{t}$ , at each node.

4.2.2. Simulating Human Policy

In BA-POMCP, we need to simulate human actions during the rollout for constructing the search tree. As the robot lacks direct knowledge of the true human policy, we first estimate the human policy parameters and use the same for simulation. Given that the human actions can be categorized as being compliant/non-compliant with the robot, we model the true human policy as a Bernoulli distribution with an unknown parameter, $\mu$ , that signifies the likelihood of user compliance for a given interaction history, $h$ . To estimate $\mu$ , we adopt a Bayesian approach. We assume a prior distribution or belief over the space of human policies $b=p(\mu)$ . We approximate $b$ using a set of particles, which is updated upon subsequent interactions with the user. In general, performing the belief update can be computationally expensive, but such updates can be computed efficiently for the conjugate family of distributions bishop2006pattern . Thus, we model each particle as a beta distribution – the conjugate prior for Bernoulli distributions.

To simulate the human action during rollout, we sample a particle from $b$ at the current node. We use this sampled particle to anticipate the next human actions and update it based on the interaction outcomes during the simulation. Additionally, we assume that humans are rational and employ an $\epsilon$ -greedy heuristic to select the user’s actions in case of non-compliance.

Alternatively, we can use a random policy to mimic human behavior, but this would require more simulations to cover a range of possible human responses and determine the optimal robot action—resulting in increased computation time. Therefore, we opt for estimating user compliance and then simulating the human policy, which we find empirically more efficient.

To evaluate the contributions of our proposed modifications to the BA-POMCP algorithm katt2017learning , we perform an ablation analysis without modeling humans, i.e., we only use random rollout policies for anticipating human behavior and perform no belief updates. We refer to this approach as POMCP in our analysis (Section 6.2).

5. Evaluation

5.1. Domain

We modified the Frozen Lake environment from OpenAI Gym brockman2016openai for evaluating mixed-initiative human-robot teaming. In this domain, the users must collaborate with the robot to navigate an $8\times 8$ frozen lake grid from start to goal in the fewest steps possible while avoiding holes and slippery regions. We modified the original domain to only have certain grids as slippery instead of a constant slip probability throughout the map. Step** on a slippery region will cause the agent to fall into a hole. Both the human and the robot can only observe whether the adjacent four grids are slippery. Each time the agent falls into a hole, the team incurs a penalty $\alpha$ and must begin again from the start location.

To enforce suboptimality, we introduce errors in the human and robot observations of slippery grids. These errors include – False Positives (observing a safe grid as slippery), and False Negatives (observing a slippery region as safe). Moreover, certain parts of the map are covered by fog which reduces human visibility. The human and robot accuracies for identifying slippery regions are shown in Figure 2. During the game, the human teleoperates the robot across the lake, but the robot may intervene or take control if it finds that the user chose a longer or unsafe path (e.g., slippery regions or holes) to the goal. Additionally, the user is equipped with a high-quality ( $100\%$ accurate) sensor for detecting slippery regions in adjacent grids, but each use of the sensor incurs a point cost $\rho$ . The overall team performance or game reward for each round is calculated as a combination of step penalty (shorter path $\rightarrow$ higher reward), penalty for falling into holes $\alpha$ , detection penalty $\rho$ , and a bonus $\kappa$ for reaching the goal as shown in Equation 3.

\text{Reward}=\textrm{Max steps}-\textrm{\# steps taken}-\alpha\times\textrm{% \# falls into hole}\\ -\rho\times\textrm{\# detections}+\kappa\times\mathbbm{1}[\textrm{goal reached% }==\textrm{True}]

(3)

We empirically set max steps $=80$ , $\alpha=10$ , $\rho=2$ and $\kappa=30$ for our human-subject experiments. Our environment is inspired by USAR missions, where humans teleoperate robots, but both humans and robots can have complementary skills and varying domain knowledge. Further details of the user study domain can be found in the Supplementary.

5.2. Human-Subjects Experiments

We conducted two user studies to 1) examine how users respond to different robot intervention styles with and without explanations but with a static policy (Data Collection Study) and 2) evaluate human-robot team performance with the proposed adaptive Bayes-POMCP approach (Evaluation Study).

5.2.1. Data Collection Study

We employ a $1\times 5$ within-subjects experiment design to examine user responses to various robot interventions in mixed-initiative teaming (Figure 2b). These interventions include – no assist: the robot does not intervene (baseline), interrupt: the robot stops the user from executing an action, take-control: the robot overrides the user’s action with its own action, interrupt+explain: the robot interrupts and explains, take-control+explain: the robot takes over control and explains. To ensure consistency across intervention strategies, the robot employs the same handcrafted heuristic that determines when to intervene. The heuristic intervention policy is a short-horizon planner that only intervenes if the user’s current action is anticipated to lead to a slippery region (based on the robot’s knowledge), a hole, or a longer path ( $\geq$ k steps) and will cede control to the user if the user persistently chooses the action the robot is intervening. The heuristic employs a static intervention style. The algorithm for the heuristic policy can be found in the Supplementary.

5.2.2. Evaluation Study

We employ a $1\times 3$ within-subjects experiment to compare human-robot team performance under different robot policies. The examined policies are our proposed approach – Bayes-POMCP, the same heuristic policy as was used in the data collection study, and an adversarial policy (Adv-Bayes-POMCP) optimized for negative game reward (Equation 3). We include the adversarial policy as an adaptive baseline to show that (1) our proposed approach can successfully aid or inhibit the user from reaching the goal, and (2) it is essential for the adaptive policy to reason when to intervene effectively in addition to switching the intervention styles. To perform a balanced comparison, we ensure that the run times of all robot policies are identical. Further, we limit the use of the detection sensor ( $\leq 5$ ) in the evaluation study to force participants to rely on the robot’s assistance.

5.2.3. Metrics

For both studies, we assess user preferences and performance using subjective and objective measures, respectively. Our subjective measures include trust muir1989operators , likeability bartneck2009measurement , and willingness to comply raemdonck2013feedback (adapted from human-human interactions for HRI) measured via 5-point Likert scales. All questionnaires were administered to the users after each round in both studies. Further, participants reported their demographics, highest completed education, prior experience with robots, and completed a 50-item personality scale Goldberg1992TheStructure as part of the pre-study questionnaire. At the end of the study, users ranked their preferences for the different robot agents. All questionnaires used for the study can be found in the Supplementary. Objective performance was assessed based on the total game reward (Equation 3) in each round.

5.2.4. Participants and Procedure

We recruited $30$ participants (Age: $25.56\pm 3.38$ , Female: $33\%$ ) for the data collection study and $28$ new participants (Age: $25.27\pm 3.28$ , Female: $50\%$ ) for the evaluation study, all from a local university campus after IRB approval. The procedure was the same for both studies. Written consent from the participants was obtained before the experiment. At the start of the study, participants received written game instructions along with a demonstration from the experimenter. Participants first completed three practice rounds to familiarize themselves with the game and then engaged in ten and six rounds (two rounds per condition) for the data collection study and evaluation study, respectively. The subjects were instructed to complete each round by taking the shortest path to the goal. The experiment order was randomized, and participants completed pre- and post-study questionnaires.

5.3. Hypotheses

To investigate how different users interact with various robot intervention styles, we first conducted a data collection study. We hypothesize the following:

H1A: The human-robot team performance can vary with different robot intervention styles. Although the robot follows the same heuristic across different conditions in Study 1, we hypothesize that the team performance will vary across intervention styles as users may respond differently. For instance, users may be better informed to choose the next action appropriately when the robot intervenes and provides an explanation.

H1B: Users will have different preferences for working with various robot intervention styles. Humans have varying personality traits and task preferences, which may impact how they perceive and collaborate with teammates. For instance, extroverted individuals are more likely to assume leadership and less likely to renounce control in human-human teams kickul2000emergent . Likewise, we hypothesize that users will have different preferences when working with robots that interrupt or take control with or without offering explanations.

For the evaluation study, we compare the human-robot team performance with the adaptive Bayes-POMCP policy against heuristics used in the first study and an adversarial baseline – Adv-Bayes-POMCP. We hypothesize:

H2A: The human-robot team performance will be the highest when the robot employs the adaptive Bayes-POMCP policy. We hypothesize that the Bayes-POMCP policy which actively anticipates human actions by considering their latent states, is better suited for determining when and how to intervene various users and will thereby maximize team performance. In contrast, the baselines that do not model the human latent states (the heuristic policy) or optimize for negative reward (the adversarial Bayes-POMCP), will not be able to assist the users appropriately.

H2B: Users will most prefer to work with our proposed approach, the adaptive Bayes-POMCP policy. We hypothesize that the Bayes-POMCP policy can effectively intervene users by modeling their latent states and will, therefore, not only improve team performance but also have a positive impact on the users’ subjective preference for collaborating with the robot.

6. Results and Discussion

In this section, we first discuss the results of the data collection study. Next, we show results from our simulation experiments used to validate Bayes-POMCP before testing on human participants. We then discuss the results from the evaluation study, comparing our Bayes-POMCP approach and two baselines.

All our statistical analyses were performed using libraries in R, and the significance level $\alpha$ was set at $0.05$ . For our analysis, we use parametric tests unless the model fails to meet the required assumptions (normality, homoscedasticity, et cetera). Details of all models and tests used for each hypothesis, along with the effect sizes and statistical power, are listed in the Supplementary.

6.1. Data Collection Study

For the data collection study, we recruited $30$ participants and excluded one participant as an outlier since they failed to complete all ten rounds in the study (failure rate across all subjects: $1.733\pm 2.365$ ). Thus we have data from $29$ subjects for our analysis.

H1A: Team Performance and Robot Intervention Styles. We compare the team performance using the game reward (Equation 3) across the five robot intervention styles employed in the first study. The robot either used the same heuristic policy to determine when to intervene or did not intervene at all (no-assist: baseline condition). Each user participated in two rounds for each intervention style, totaling ten rounds, all played on different maps with varying levels of difficulty. To mitigate ordering effects and map-related biases, the experiment conditions and map assignments were randomized.

We use Kruskal-Wallis (a non-parametric test), with the dependent variable as the reward and the independent variable as the robot intervention style. We obtain statistical significance for the intervention style ( $H(4)=58.16,p<.001$ ). Subsequently, we use Dunn’s test for performing post-hoc pairwise comparisons, and the significance values are shown in Figure 2(a).

Takeaway: We find that the human-robot team performance is impacted by the intervention styles used by the robot, rejecting the null hypothesis (Figure 2(a)). Firstly, it is worth noting that the team performance significantly improves when the robot intervenes compared to the baseline (no assistance), validating the need for robot interventions in our study domain. Secondly, the team performance is the highest when the robot takes over control. Lastly, adding explanations did not significantly improve performance for the same intervention style (e.g., between interrupt and interrupt+explain).

H1B: Users’ Working Preference and Robot Intervention Styles. At the end of the first user study, participants were asked to rank their preferences for working with various robot intervention styles on a scale from 1 (lowest) to 5 (highest). As user rankings are considered as ordinal data, we use Kruskal-Wallis, a non-parametric test to analyze H1B. We find that robot intervention style indeed influences user preferences $(H(4)=61.67,p<.001)$ . The majority of the users preferred the interrupt+explain agent the most and the take-control agent the least, as shown in Figure 2(b).

Takeaway: Our results suggest that, despite explanations not improving performance, most users favor working with robots that offer explanations for their interventions. Interestingly, even though the take-control agent achieved the highest team performance, it was the least preferred choice for the majority of users. These findings highlight the need for an adaptive robot policy that adjusts the intervention style to maximize performance and user satisfaction. If the robot only takes over control, it can improve team performance in the short term but can cause user dissatisfaction and can potentially lead to users abandoning the system in the long run.

6.2. Simulation Experiments

We first validate whether our proposed method can adapt to diverse users by testing with various simulated human models before testing the Bayes-POMCP policy on users. For the simulation experiments, we compare Bayes-POMCP against two baselines – (1) the standard POMCP algorithm silver2010monte with no human model (POMCP) and (2) the heuristic agents (both take-control and interrupt) on five of the $8\times 8$ maps used in the data collection study. To simulate a diverse set of users, we modulate two latent parameters that determine their behavior – the users’ capability or expertise ( $\psi$ ) to solve the task and the users’ tendency to comply with the agent ( $\theta$ ). We test with both static users (whose latent parameters – $\psi,\theta$ are fixed) and dynamic users, whose $\theta$ varies continuously based on the interaction history, but $\psi$ remains fixed (i.e., we assume no learning effect as the domain is simple). We provide further details of the simulated human population in the Supplementary. Our results (Figure 4) indicate that Bayes-POMCP outperforms both the heuristics employed in the first study and the ablation baseline without human modeling (POMCP) for static and dynamic user models.

6.3. Evaluation Study

Upon verifying our policy with different simulated users, we collected data from $28$ new participants (who did not take part in the first user study) for the evaluation study. We excluded data from three subjects. Two of the three subjects encountered graphic rendering issues in the study interface. The other subject was excluded as an outlier as they failed to complete all six rounds (failure rate across all subjects: $3.48\pm 0.77$ ). Hence we only include data from the remaining $25$ subjects for our analysis.

H2A: Team Performance and Robot Policy. In this user study, we evaluated the team performance for different robot policies – the heuristic agents (interrupt+explain and take-control+explain) from the first study, our proposed approach Bayes-POMCP optimized for the true reward and the negative reward. Each user participated in two rounds per policy, totaling six rounds, all played on different maps (a subset from the first study). We used the Kruskal-Wallis test with the reward as the dependent variable and the robot policy as the independent variable. We obtained statistical significance for the robot policy $(H(2)=109.89,p<.001)$ and performed post-hoc analysis with Dunn’s test, whose results are shown in Figure 4(a).

Takeaway: We find that Bayes-POMCP policy significantly outperforms our baselines for team performance, rejecting the null hypothesis (Figure 4(a)). We also find that the adversarial Bayes-POMCP is effective in preventing the user from reaching the goal, as reflected by the negative reward.

H2B: Users’ Working Preference and Robot Policy. Users ranked their preferences for working with the different robot agents at the end of the second study. We perform the Kruskal-Wallis test, which shows that the robot policy significantly influences user preferences $(H(2)=45.41,p<.001)$ . We find that $68\%$ of the users preferred the Bayes-POMCP agent the most, $20\%$ preferred heuristic agents the most, and $88\%$ preferred the Adv-Bayes-POMCP agent the least. $12\%\>(=3/25)$ did not answer the preference survey.

We also analyzed subjective metrics with Likert scales for trust, willingness to comply, and robot likeability. We conducted three rANOVA with the subjective metrics as the dependent variables and independent variables as robot policy, number of rounds completed, demographics (age, gender, prior robotics experience), and pre-study questionnaire responses of the user. We find that robot policy was statistically significant across all subjective metrics from the three ANOVAs, with our proposed approach having the highest mean values. We then performed post-hoc analysis using Tukey HSD. For further details of the analysis, see Supplementary.

Takeaway: We find that Bayes-POMCP policy significantly outperforms our baselines across all subjective metrics, and the majority ( $68\%$ ) of the users rated that they would most prefer to work with the Bayes-POMCP agent in the evaluation study.

6.4. Summary of Results

We summarize our key findings from two human-subject experiments and analysis with simulated human models:

(1)

Robot interventions are necessary for improving team performance when both humans and robots are suboptimal due to having non-identical, partial domain knowledge.
(2)

The robot intervention style (interrupt or take-control) can impact both team performance $(p<0.001)$ and user preferences $(p<0.001)$ . Users prefer robots that offer explanations for interventions, albeit without performance improvement.
(3)

Our proposed approach, Bayes-POMCP, can effectively intervene users (both simulated and real human subjects) to maximize human-robot team performance.
(4)

Bayes-POMCP not only enhances team performance but also positively influences users’ preference to collaborate with the robot and their self-reported measures, such as trust and likeability towards the robot.

7. Limitations and Future Work

While our approach successfully improves human-robot team performance in a computationally efficient manner, Bayes-POMCP relies on an environment simulator to estimate the value of human-robot actions in the Monte Carlo search tree, which may not be available for real-world human-robot collaboration tasks. Therefore, in future work, we aim to explore alternative methods, such as deep learning silver2018general for value estimation. Moreover, our findings indicate that while robot explanations positively influenced users’ subjective perceptions, they did not improve team performance. We hypothesize that this may be because the task was relatively simple, and users did not need explanations from the robot to enhance their decision-making. In future work, we seek to assess the utility of explanations in improving team performance for more complex teaming tasks. Finally, our findings are limited to short-horizon interactions, as users only played two rounds of the game with each agent. To address this limitation, our proposed approach can be extended to longitudinal HRI tasks, where robots must anticipate and adapt to changes in user behavior or preferences over time.

8. Conclusion

In this work, we propose an online Bayesian approach, Bayes-POMCP, to optimize performance in mixed-initiative human-robot teams when both agents are suboptimal. Our focus is on learning a robot policy for effective user intervention. We find that robot interventions can improve performance while recognizing diverse user preferences. Next, we evaluate Bayes-POMCP, and show its effectiveness in improving team performance across different simulated human models and real users. We address the computational challenges in solving POMDPs by using a Monte-Carlo search with belief approximation and using conjugate priors to perform belief updates efficiently. In future work, we plan to continue evaluating our algorithm for long-horizon interactions and extend it beyond grid-world domains to real-world human-robot collaboration tasks.

{acks}

This work was sponsored by a gift from Konica Minolta and the National Institutes of Health (NIH) under Grant 1RO1HL157457.

References

[1] Samuel Barrett, Avi Rosenfeld, Sarit Kraus, and Peter Stone. Making friends on the fly: Cooperating with new teammates. Artificial Intelligence, 242:132–171, 2017.
[2] Christoph Bartneck, Dana Kulić, Elizabeth Croft, and Susana Zoghbi. Measurement instruments for the anthropomorphism, animacy, likeability, perceived intelligence, and perceived safety of robots. International journal of social robotics, 1:71–81, 2009.
[3] Christopher M Bishop and Nasser M Nasrabadi. Pattern recognition and machine learning, volume 4. Springer, 2006.
[4] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
[5] Micah Carroll, Rohin Shah, Mark K Ho, Tom Griffiths, Sanjit Seshia, Pieter Abbeel, and Anca Dragan. On the utility of learning about humans for human-AI coordination. Advances in neural information processing systems, 32, 2019.
[6] Min Chen, Stefanos Nikolaidis, Harold Soh, David Hsu, and Siddhartha Srinivasa. Planning with trust for human-robot collaboration. In Proceedings of the 2018 ACM/IEEE international conference on human-robot interaction, pages 307–315, 2018.
[7] Manolis Chiou, Nick Hawes, and Rustam Stolkin. Mixed-initiative variable autonomy for remotely operated mobile robots. ACM Transactions on Human-Robot Interaction (THRI), 10(4):1–34, 2021.
[8] Douglas A Few, David J Bruemmer, and Miles C Walton. Improved human-robot teaming through facilitated initiative. In ROMAN 2006-The 15th IEEE International Symposium on Robot and Human Interactive Communication, pages 171–176. IEEE, 2006.
[9] Lewis R. Goldberg. The Development of Markers for the Big-Five Factor Structure. Psychological Assessment, 4(1):26–42, 1992.
[10] Joey Hong, Anca Dragan, and Sergey Levine. Learning to influence human behavior with offline reinforcement learning. arXiv preprint arXiv:2303.02265, 2023.
[11] Hengyuan Hu, Adam Lerer, Alex Peysakhovich, and Jakob Foerster. “other-play” for zero-shot coordination. In International Conference on Machine Learning, pages 4399–4410. PMLR, 2020.
[12] J Isaacs, Kevin Knoedler, Andrew Herdering, Mishell Beylik, and Hugo Quintero. Teleoperation for urban search and rescue applications. Field Robotics, 2:1177–1190, 2022.
[13] Hong Jun Jeon, Dylan P Losey, and Dorsa Sadigh. Shared autonomy with learned latent actions. arXiv preprint arXiv:2005.03210, 2020.
[14] Shu Jiang and Ronald C Arkin. Mixed-initiative human-robot interaction: definition, taxonomy, and survey. In 2015 IEEE International conference on systems, man, and cybernetics, pages 954–961. IEEE, 2015.
[15] Gregory Kahn, Adam Villaflor, Vitchyr Pong, Pieter Abbeel, and Sergey Levine. Uncertainty-aware reinforcement learning for collision avoidance. arXiv preprint arXiv:1702.01182, 2017.
[16] Sammie Katt, Frans A Oliehoek, and Christopher Amato. Learning in pomdps with monte carlo tree search. In International Conference on Machine Learning, pages 1819–1827. PMLR, 2017.
[17] Jill Kickul and George Neuman. Emergent leadership behaviors: The function of personality and cognitive ability in determining teamwork performance and ksas. Journal of Business and Psychology, 15:27–51, 2000.
[18] Glen Klien, David D Woods, Jeffrey M Bradshaw, Robert R Hoffman, and Paul J Feltovich. Ten challenges for making automation a” team player” in joint human-agent activity. IEEE Intelligent Systems, 19(6):91–95, 2004.
[19] Minae Kwon, Erdem Biyik, Aditi Talati, Karan Bhasin, Dylan P. Losey, and Dorsa Sadigh. When humans aren’t optimal: Robots that collaborate with risk-aware humans. In Proceedings of the 2020 ACM/IEEE International Conference on Human-Robot Interaction, HRI ’20, page 43–52, New York, NY, USA, 2020. Association for Computing Machinery.
[20] Mikko Lauri, David Hsu, and Joni Pajarinen. Partially observable markov decision processes in robotics: A survey. IEEE Transactions on Robotics, 39(1):21–40, 2022.
[21] ** Joo Lee, Fei Sha, and Cynthia Breazeal. A bayesian theory of mind approach to nonverbal communication. In 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 487–496. IEEE, 2019.
[22] Joshua Lee, Jeffrey Fong, Bing Cai Kok, and Harold Soh. Getting to know one another: Calibrating intent, capabilities and trust for human-robot collaboration. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6296–6303. IEEE, 2020.
[23] Xingzhou Lou, Jiaxian Guo, Junge Zhang, Jun Wang, Kaiqi Huang, and Yali Du. Pecan: Leveraging policy ensemble for context-aware zero-shot human-ai coordination. arXiv preprint arXiv:2301.06387, 2023.
[24] B.M. Muir and B.M. Muir. Operators’ Trust in and Use of Automatic Controllers in a Supervisory Process Control Task. Canadian theses on microfiche. University of Toronto, 1989.
[25] Amal Nanavati, Christoforos I Mavrogiannis, Kevin Weatherwax, Leila Takayama, Maya Cakmak, and Siddhartha S Srinivasa. Modeling human helpfulness with individual and contextual factors for robot planning. In Robotics: Science and Systems, 2021.
[26] Manisha Natarajan and Matthew Gombolay. Effects of anthropomorphism and accountability on trust in human robot interaction. In Proceedings of the 2020 ACM/IEEE International Conference on Human-Robot Interaction, pages 33–42, 2020.
[27] Manisha Natarajan, Esmaeil Seraj, Batuhan Altundas, Rohan Paleja, Sean Ye, Letian Chen, Reed Jensen, Kimberlee Chestnut Chang, and Matthew Gombolay. Human-robot teaming: Grand challenges. Current Robotics Reports, pages 1–20, 2023.
[28] Brenda Ng, Kofi Boakye, Carol Meyers, and Andrew Wang. Bayes-adaptive interactive pomdps. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 26, pages 1408–1414, 2012.
[29] Rohan Paleja, Muyleng Ghuy, Nadun Ranawaka Arachchige, Reed Jensen, and Matthew Gombolay. The utility of explainable ai in ad hoc human-machine teaming. Advances in neural information processing systems, 34:610–623, 2021.
[30] Stefania Pellegrinelli, Henny Admoni, Shervin Javdani, and Siddhartha Srinivasa. Human-robot shared workspace collaboration via hindsight optimization. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 831–838. IEEE, 2016.
[31] Neil Rabinowitz, Frank Perbet, Francis Song, Chiyuan Zhang, SM Ali Eslami, and Matthew Botvinick. Machine theory of mind. In International conference on machine learning, pages 4218–4227. PMLR, 2018.
[32] Isabel Raemdonck and Jan-Willem Strijbos. Feedback perceptions and attribution by secretarial employees: Effects of feedback-content and sender characteristics. European Journal of Training and Development, 37(1):24–48, 2013.
[33] Aditi Ramachandran, Sarah Strohkorb Sebo, and Brian Scassellati. Personalized robot tutoring using the assistive tutor pomdp (at-pomdp). In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8050–8057, 2019.
[34] Stephane Ross, Brahim Chaib-draa, and Joelle Pineau. Bayes-adaptive pomdps. Advances in neural information processing systems, 20, 2007.
[35] Dorsa Sadigh, Shankar Sastry, Sanjit A Seshia, and Anca D Dragan. Planning for autonomous cars that leverage effects on human actions. In Robotics: Science and systems, volume 2, pages 1–9. Ann Arbor, MI, USA, 2016.
[36] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
[37] David Silver and Joel Veness. Monte-carlo planning in large pomdps. Advances in neural information processing systems, 23, 2010.
[38] Herbert A Simon. Theories of bounded rationality, decision and organization. CBR a. R. Radner. Amsterdam, NorthHolland, 1972.
[39] DJ Strouse, Kevin McKee, Matt Botvinick, Edward Hughes, and Richard Everett. Collaborating with humans without human data. Advances in Neural Information Processing Systems, 34:14502–14515, 2021.
[40] Ning Wang, David V Pynadath, and Susan G Hill. The impact of pomdp-generated explanations on trust and performance in human-robot teams. In Proceedings of the 2016 international conference on autonomous agents & multiagent systems, pages 997–1005, 2016.
[41] Rui Zhao, **ming Song, Yufeng Yuan, Haifeng Hu, Yang Gao, Yi Wu, Zhongqian Sun, and Wei Yang. Maximum entropy population-based training for zero-shot human-ai coordination. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 6145–6153, 2023.