License: CC BY 4.0
arXiv:2403.16178v1 [cs.RO] 24 Mar 2024
\setcopyright

ifaamas \acmConference[AAMAS ’24]Proc. of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024)May 6 – 10, 2024 Auckland, New ZealandN. Alechina, V. Dignum, M. Dastani, J.S. Sichman (eds.) \copyrightyear2024 \acmYear2024 \acmDOI \acmPrice \acmISBN \acmSubmissionID995 \authornoteBoth authors contributed equally to the paper. \affiliation \institutionGeorgia Institute of Technology \cityAtlanta, GA \countryUSA \authornotemark[1] \affiliation \institutionGeorgia Institute of Technology \cityAtlanta, GA \countryUSA \affiliation \institutionGeorgia Institute of Technology \cityAtlanta, GA \countryUSA \affiliation \institutionGeorgia Institute of Technology \cityAtlanta, GA \countryUSA \affiliation \institutionGeorgia Institute of Technology \cityAtlanta, GA \countryUSA

Mixed-Initiative Human-Robot Teaming under Suboptimality with Online Bayesian Adaptation

Manisha Natarajan [email protected] Chunyue Xue [email protected] Sanne van Waveren [email protected] Karen Feigh [email protected]  and  Matthew Gombolay [email protected]
Abstract.

For effective human-agent teaming, robots and other artificial intelligence (AI) agents must infer their human partner’s abilities and behavioral response patterns and adapt accordingly. Most prior works make the unrealistic assumption that one or more teammates can act near-optimally. In real-world collaboration, humans and autonomous agents can be suboptimal, especially when each only has partial domain knowledge. In this work, we develop computational modeling and optimization techniques for enhancing the performance of suboptimal human-agent teams, where the human and the agent have asymmetric capabilities and act suboptimally due to incomplete environmental knowledge. We adopt an online Bayesian approach that enables a robot to infer people’s willingness to comply with its assistance in a sequential decision-making game. Our user studies show that user preferences and team performance indeed vary with robot intervention styles, and our approach for mixed-initiative collaborations enhances objective team performance (p<.001𝑝.001p<.001italic_p < .001) and subjective measures, such as user’s trust (p<.001𝑝.001p<.001italic_p < .001) and perceived likeability of the robot (p<.001𝑝.001p<.001italic_p < .001).

Key words and phrases:
Human-Agent Teams, Mixed-Initiative, Suboptimality, POMDP

1. Introduction

Human-agent teaming has the potential to leverage the unique capabilities of humans and artificial intelligence (AI) agents to enhance team performance. In real-world situations, both humans and agents can be suboptimal, especially when dealing with uncertainty human_suboptimality ; kahn2017uncertainty . Imagine a human collaborating with a robot in an urban search-and-rescue (USAR) mission with reduced visibility due to fog or smoke. The human can take over control when the robot is more prone to make errors (e.g., in unstructured environments). Likewise, when human vision is limited, the robot can intervene or take control. To optimize this collaboration, robots need to develop a Theory of Mind rabinowitz2018machine , i.e., the ability to infer the human teammates’ mental states and anticipate their actions to determine when such intervention is beneficial. In this work, we look at mixed-initiative interactions, where the robot actively models human behavior to decide when to intervene to maximize team performance.

In human-robot teams, mixed-initiative interaction refers to a collaborative strategy in which teammates opportunistically seize and relinquish initiative from and to each other during a mission, where initiative can range from low-level motion control to high-level goal specification jiang2015mixed . We study such interactions in a teaming task in which the human and the robot act suboptimally because they have partial knowledge of the environment. Specifically, the human teleoperates the robot, similar to USAR missions isaacs2022teleoperation , and must collaborate with the robot (seize or relinquish control) to reach a goal location. The human and the robot have asymmetric capabilities and non-identical, partial knowledge of the environment. During the task, when the human selects an action, the robot can either comply with and execute the chosen action, interrupt by not executing the chosen action, or take control and execute an alternative action. If the robot interrupts or takes control, the human can decide whether to accept or oppose the robot’s decision.

Our goal is to learn a domain-agnostic robot policy that can effectively adapt to diverse users to maximize team performance without prior human interaction data. Achieving such ad-hoc or zero-shot coordination with novel human partners has been a longstanding challenge in AI klien2004ten ; paleja2021utility . Recent works explore zero-shot human-AI collaboration by learning AI agent policies either from human-human demonstrations carroll2019utility ; hong2023learning or via self-play without any human data strouse2021collaborating ; zhao2023maximum . However, these approaches look at domains where humans and agents have symmetric capabilities. In contrast, our work delves into human-agent teaming with asymmetric capabilities, where mixed-initiative teaming is essential. Prior works in mixed-initiative teaming have adopted strategies for switching control between humans and robots by estimating user performance chiou2021mixed or operator engagement few2006improved ; our work differs by explicitly modeling user compliance to determine when robots should intervene.

Our contributions are two-fold. First, we propose a novel, online, Bayesian approach called Bayes-POMCP, for zero-shot human-robot collaboration in mixed-initiative settings. We model the human-robot team as a Partially Observable Markov Decision Process (POMDP), where the robot maintains a belief over users’ compliance tendencies. Initially, the robot has high uncertainty about user preferences and willingness to comply. Through Bayesian Learning, the robot’s estimation is iteratively refined, reducing its uncertainty upon subsequent interactions with the user. By conditioning the robot’s policy on the uncertainty of the human model, our approach is more robust to adapt to a diverse pool of participants than having a single, unified model for all subjects. To address the computational challenges in solving POMDPs and ensure that our approach is feasible to run online with novel users, Bayes-POMCP employs a Monte-Carlo search (scalable to large state spaces) while anticipating appropriate user behavior with approximate belief updates.

Second, we design a new user study interface for examining mixed-initiative human-robot teaming. We open-source our implementation111https://github.com/CORE-Robotics-Lab/Bayes-POMCP. We conduct two human-subjects experiments (n=30𝑛30n=30italic_n = 30 and n=28𝑛28n=28italic_n = 28) with the interface to show that (1) user preferences and team performance can vary when the robot employs different intervention styles, and (2) our proposed approach performs favorably on both objective (team performance) and subjective (users’ trust, robot likeability) metrics with novel users.

2. Related Work

2.1. Modeling Human Behavior

For seamless human-robot collaboration, robots must anticipate human behavior and act accordingly. Prior works have shown that robots modeling human behavior can improve team performance across many applications, such as autonomous driving sadigh2016planning , assistive robotics jeon2020shared , and collaborative games pellegrinelli2016human . Both model-free and model-based approaches have been employed for modeling human behavior. Model-free approaches (e.g., imitation learning carroll2019utility ) require substantial data and generally employ neural networks to learn human behavior without making strong assumptions.

In contrast, model-based approaches require far fewer samples but make certain assumptions about human behavior (e.g., humans exhibit bounded rationality simon1972theories ). Prior works in HRI have used POMDPs and their variants (e.g., BAMDP, MOMDP, I-POMDP) to account for latent factors such as trust, intent, or capability influencing human decision-making chen2018planning ; lee2020getting ; ramachandran2019personalized ; wang2016impact . Most prior POMDP-based works either assume known model parameters or employ maximum likelihood estimation (MLE) to estimate them lee2019bayesian ; ramachandran2019personalized ; chen2018planning ; wang2016impact . However, these approaches can fail to generalize to a diverse population and are prone to overfit bishop2006pattern . Hence, we instead adopt a Bayesian approach to jointly learn the POMDP parameters and the robot policy during human interactions, similar to prior work lee2020getting ; nanavati2021modeling ; ng2012bayes . However, a major drawback of such Bayesian approaches is the need to update beliefs over an augmented state space comprising both the human latent states and the POMDP parameters, which can quickly become computationally intractable. We overcome this challenge by making key approximations about the belief space and using conjugate priors for belief representation, which allows for computing quick belief updates. Our work differs from prior Bayesian approaches in HRI lee2020getting ; nanavati2021modeling by maintaining belief about dynamic latent parameters, such as trust or compliance natarajan2020effects ; chen2018planning which varies during interactions and across individuals.

2.2. Human-Agent Teaming

Recently, there has been a surge in interest in designing AI agents that are capable of collaborating with humans, especially in ad-hoc settings barrett2017making ; hu2020other ; hong2023learning . Ad-hoc or zero-shot human-agent teaming requires agents to be adept at collaborating with diverse users in novel contexts without prior interactions. Achieving ad-hoc, zero-shot coordination with novel human partners has been a longstanding challenge in AI and will be crucial for the ubiquitous deployment of robots and AI agents klien2004ten ; paleja2021utility . Recent works aim to achieve ad-hoc human-AI teaming either from human-human demonstrations using Behavior Cloning carroll2019utility and offline RL hong2023learning or via self-play without any human data strouse2021collaborating . Others have also explored population-based training to learn robot policies that are generalizable across diverse users zhao2023maximum ; lou2023pecan . However, these approaches focus on domains where both humans and agents have symmetric capabilities and work concurrently. In contrast, our work examines mixed-initiative teaming, where humans and agents possess asymmetric capabilities and must share control to achieve the task objective. Thus, we cannot learn robot policies from human-human demonstrations. Further, we seek to optimize team performance when all teammates are suboptimal, which is seldom explored in human-robot teams lee2020getting ; natarajan2023human .

3. Preliminaries

We model the human-robot team as a Bayes-Adaptive POMDP (BA-POMDP) ross2007bayes , allowing the robot to dynamically learn and adjust its policy based on estimations of human model parameters, while accounting for estimation uncertainty.

A POMDP is defined as a tuple =(S,A,O,𝒯,,d0,R,γ)𝑆𝐴𝑂𝒯subscript𝑑0𝑅𝛾\mathcal{M}=(S,A,O,\mathcal{T,E,}\>d_{0},R,\gamma)caligraphic_M = ( italic_S , italic_A , italic_O , caligraphic_T , caligraphic_E , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_R , italic_γ ) where S𝑆Sitalic_S is a set of states sS𝑠𝑆s\in Sitalic_s ∈ italic_S, A𝐴Aitalic_A is a set of actions aA𝑎𝐴a\in Aitalic_a ∈ italic_A, O𝑂Oitalic_O is a set of observations oO𝑜𝑂o\in Oitalic_o ∈ italic_O, 𝒯(st+1|st,at)𝒯conditionalsubscript𝑠𝑡1subscript𝑠𝑡subscript𝑎𝑡\mathcal{T}(s_{t+1}|s_{t},a_{t})caligraphic_T ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the state transition probabilities, (ot|st)conditionalsubscript𝑜𝑡subscript𝑠𝑡\mathcal{E}(o_{t}|s_{t})caligraphic_E ( italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the emission function, d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the initial state distribution, R(st,at)𝑅subscript𝑠𝑡subscript𝑎𝑡R(s_{t},a_{t})italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the reward for taking action a𝑎aitalic_a in state s𝑠sitalic_s at time step t𝑡titalic_t, and γ(0,1]𝛾01\gamma\in(0,1]italic_γ ∈ ( 0 , 1 ] is the discount factor. The agent’s goal is to learn a policy, π:A:𝜋𝐴\pi:\mathcal{B}\rightarrow Aitalic_π : caligraphic_B → italic_A, that maximizes the expected cumulative discounted reward (return), where b𝑏b\in\mathcal{B}italic_b ∈ caligraphic_B is a belief state inferred by a history of previous observations and actions, hhitalic_h. Belief updates can be achieved via the Bayes rule (infeasible for large state spaces) or with an unweighted particle filter (approximate update).

Most prior works in POMDPs assume a fully-specified environment (i.e., the model parameters 𝒯𝒯\mathcal{T}caligraphic_T, \mathcal{E}caligraphic_E are known) lauri2022partially , which is unrealistic in HRI as we neither have access to the person’s true latent states (e.g., trust, preferences) nor how they change during the interaction. We adopt the BA-POMDP framework — a Bayesian Reinforcement Learning approach for solving POMDPs ross2007bayes . The BA-POMDP employs Dirichlet vectors, χ𝜒\chiitalic_χ, to represent uncertainty over the model parameters (𝒯,)𝒯(\mathcal{T},\mathcal{E})( caligraphic_T , caligraphic_E ). As the POMDP states are hidden, χ𝜒\chiitalic_χ cannot be computed and is included as part of the state.

3.1. Solving POMDPs

Partially Observable Monte-Carlo Planning (POMCP) is an online solver that extends the Monte-Carlo Tree Search (MCTS) to POMDPs silver2010monte . POMCP uses the UCT (Upper Confidence Bound (UCB) for Trees) to select actions and an unweighted particle filter for belief updates. In POMCP, the UCT algorithm is extended to partially observable domains using a search tree of histories hhitalic_h instead of states, where each node in the tree stores statistics – visitation count N(h)𝑁N(h)italic_N ( italic_h ), value or mean return V(h)𝑉V(h)italic_V ( italic_h ), and belief b(h)𝑏b(h)italic_b ( italic_h ), approximated by particles. The algorithm performs online planning through multiple simulations, incrementally building the search tree. The return of each simulation is used to update the statistics for all visited nodes. POMCP terminates based on preset criteria (e.g., maximum number of simulations).

We model the human-robot team as a BA-POMDP. Solving BA-POMDPs is difficult as they are infinite-state POMDPs. The current state-of-the-art, online algorithm for solving BA-POMDPs is BA-POMCP(extending POMCP for BA-POMDPs) katt2017learning . In this work, we propose Bayes-POMCP, which extends the BA-POMCP algorithm for suboptimal human-robot teams.

4. Method

In this section, we first define the human-robot team model (BA-POMDP) for mixed-initiative interactions and then describe how we utilize a variant of the BA-POMCP algorithm to learn an adaptive robot policy for our current setting.

4.1. Human-Robot Team Model

4.1.1. State Space

In our human-robot team model, the state space combines the world state and user latent states s=(x,z)𝑠𝑥𝑧s=(x,z)italic_s = ( italic_x , italic_z ). The world state, x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, refers to the task that the human-robot team is working on, and the latent states, z𝒵𝑧𝒵z\in\mathcal{Z}italic_z ∈ caligraphic_Z, can refer to the user’s trust or tendency to comply with the robot and their task execution preferences. The robot does not have access to the user’s latent states and must infer these states by observing the user’s actions. We focus on suboptimal human-robot teaming, assuming that the suboptimality arises from task-related errors or incomplete knowledge, i.e., both agents may make errors or cannot observe the full world state. Thus, the world state as observed by the robot may not always align with what the human observes (xtRxtH,t)subscriptsuperscript𝑥𝑅𝑡subscriptsuperscript𝑥𝐻𝑡for-all𝑡(x^{R}_{t}\neq x^{H}_{t},\forall t)( italic_x start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≠ italic_x start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∀ italic_t ).

4.1.2. Action Space

As we are planning from the robot’s perspective, the action space comprises the actions aRARsuperscript𝑎𝑅superscript𝐴𝑅a^{R}\in A^{R}italic_a start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ∈ italic_A start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT that the robot can take in the environment. In our mixed-initiative collaborative scenario, we assume that the robot first observes the human action and then selects its action222Our approach is not restricted to this mixed-initiative setting and can be extended to cases where either the robot takes the first action or works concurrently with users.. The robot can choose to either execute, intervene, or override the user’s actions. Additionally, the robot may choose to explain whenever it intervenes or overrides the user.

4.1.3. Observation Space

The robot observes the human actions aHAHsuperscript𝑎𝐻superscript𝐴𝐻a^{H}\in A^{H}italic_a start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∈ italic_A start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT. We assume that the human’s action depends on their knowledge of the current world state xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the history of interactions, ht1subscript𝑡1h_{t-1}italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT with the robot, i.e., the human follows the policy, πH(atH|xt,ht1,at1R)superscript𝜋𝐻conditionalsubscriptsuperscript𝑎𝐻𝑡subscript𝑥𝑡subscript𝑡1subscriptsuperscript𝑎𝑅𝑡1\pi^{H}(a^{H}_{t}|x_{t},h_{t-1},a^{R}_{t-1})italic_π start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ), where ht1={a0H,a0R,a1H,a1R,,at1H}subscript𝑡1subscriptsuperscript𝑎𝐻0subscriptsuperscript𝑎𝑅0subscriptsuperscript𝑎𝐻1subscriptsuperscript𝑎𝑅1subscriptsuperscript𝑎𝐻𝑡1h_{t-1}=\{a^{H}_{0},a^{R}_{0},a^{H}_{1},a^{R}_{1},\cdots,a^{H}_{t-1}\}italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = { italic_a start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_a start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT }. Similar to prior work chen2018planning , we assume that the user’s latent state, ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, is a compact representation of the interaction history (zt{ht1at1R}subscript𝑧𝑡subscript𝑡1subscriptsuperscript𝑎𝑅𝑡1z_{t}\approx\{h_{t-1}\cup a^{R}_{t-1}\}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≈ { italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∪ italic_a start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT }). Thus, πH(atH|xt,ht1,at1R)πH(atH|xt,zt)superscript𝜋𝐻conditionalsubscriptsuperscript𝑎𝐻𝑡subscript𝑥𝑡subscript𝑡1subscriptsuperscript𝑎𝑅𝑡1superscript𝜋𝐻conditionalsubscriptsuperscript𝑎𝐻𝑡subscript𝑥𝑡subscript𝑧𝑡\pi^{H}(a^{H}_{t}|x_{t},h_{t-1},a^{R}_{t-1})\approx\pi^{H}(a^{H}_{t}|x_{t},z_{% t})italic_π start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ≈ italic_π start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

4.1.4. Transition and Emission Models

We define the state transition model, 𝒯𝒯\mathcal{T}caligraphic_T, from the robot’s perspective, i.e., 𝒯=p(st+1|st,atR)𝒯𝑝conditionalsubscript𝑠𝑡1subscript𝑠𝑡subscriptsuperscript𝑎𝑅𝑡\mathcal{T}=p(s_{t+1}|s_{t},a^{R}_{t})caligraphic_T = italic_p ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). However, for mixed-initiative settings, the transitions in the state, st=(xt,zt)subscript𝑠𝑡subscript𝑥𝑡subscript𝑧𝑡s_{t}=(x_{t},z_{t})italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), occur as a result of both human and robot actions at each time step. Thus we rewrite the transition model as:

p(st+1|st,atR)=aHp(st+1|st,atR,atH)×πH(atH|xt,zt)𝑝conditionalsubscript𝑠𝑡1subscript𝑠𝑡subscriptsuperscript𝑎𝑅𝑡subscriptsuperscript𝑎𝐻𝑝conditionalsubscript𝑠𝑡1subscript𝑠𝑡subscriptsuperscript𝑎𝑅𝑡subscriptsuperscript𝑎𝐻𝑡superscript𝜋𝐻conditionalsubscriptsuperscript𝑎𝐻𝑡subscript𝑥𝑡subscript𝑧𝑡\displaystyle p(s_{t+1}|s_{t},a^{R}_{t})=\sum\limits_{a^{H}}p(s_{t+1}|s_{t},a^% {R}_{t},a^{H}_{t})\times\pi^{H}(a^{H}_{t}|x_{t},z_{t})italic_p ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_p ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) × italic_π start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (1)
=aHp(xt+1|xt,atR,atH)×p(zt+1|zt,atR,atH)×πH(atH|xt,zt)absentsubscriptsuperscript𝑎𝐻𝑝conditionalsubscript𝑥𝑡1subscript𝑥𝑡subscriptsuperscript𝑎𝑅𝑡subscriptsuperscript𝑎𝐻𝑡𝑝conditionalsubscript𝑧𝑡1subscript𝑧𝑡subscriptsuperscript𝑎𝑅𝑡subscriptsuperscript𝑎𝐻𝑡superscript𝜋𝐻conditionalsubscriptsuperscript𝑎𝐻𝑡subscript𝑥𝑡subscript𝑧𝑡\displaystyle=\sum\limits_{a^{H}}p(x_{t+1}|x_{t},a^{R}_{t},a^{H}_{t})\times p(% z_{t+1}|z_{t},a^{R}_{t},a^{H}_{t})\times\pi^{H}(a^{H}_{t}|x_{t},z_{t})= ∑ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) × italic_p ( italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) × italic_π start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (2)

Equation 2 comes from our assumption that given the human and robot actions, the world state dynamics are independent of the human latent state dynamics. In our collaborative scenario, we only estimate the latent state dynamics as part of the BA-POMDP, as we assume that the world state dynamics are deterministic and known.

The emission model \mathcal{E}caligraphic_E for the human-robot team refers to the human policy πH(atH|xt,zt)superscript𝜋𝐻conditionalsubscriptsuperscript𝑎𝐻𝑡subscript𝑥𝑡subscript𝑧𝑡\pi^{H}(a^{H}_{t}|x_{t},z_{t})italic_π start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) which is also unknown to the robot and must be estimated to solve the BA-POMDP.

4.1.5. Reward Function

The reward function (x,aH,aR)𝑥superscript𝑎𝐻superscript𝑎𝑅\mathcal{R}(x,a^{H},a^{R})caligraphic_R ( italic_x , italic_a start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ) is positive for team actions that contribute to achieving the task goal and negative for team actions that hinder task success. We assume that both the user and the robot are aware of the reward function.

4.2. Adaptive Robot Intervention Policy in Mixed-Initiative Teams (Bayes-POMCP)

To maximize human-robot team performance in real-time for mixed-initiative settings, we implement a modified version of the BA-POMCP katt2017learning . Here, we highlight the key changes we make to the BA-POMCP algorithm. Figure 1 provides an overview of our approach, and the complete procedure is described in Algorithm 1.

Refer to caption
Figure 1. Graphical overview of the Bayes-POMCP approach for mixed-initiative Human-Robot Teaming: At each timestep t𝑡titalic_t, the human first takes an action based on interaction history, hhitalic_h, and their current observation of the world state, x𝑥xitalic_x. The robot then determines when and how to intervene by anticipating human behavior using a Monte-Carlo tree search. The reward is calculated based on both human and robot actions.
Input: Initial world state x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT; Interaction history h1=[]subscript1h_{-1}=[]italic_h start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT = [ ]; initial belief b0subscript𝑏0b_{0}italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT; Search Tree T={}𝑇T=\{\}italic_T = { }
1 a1RNo-Assistsubscriptsuperscript𝑎𝑅1No-Assista^{R}_{-1}\leftarrow\text{No-Assist}italic_a start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ← No-Assist // By default before episode starts z0{h1a1R}subscript𝑧0subscript1subscriptsuperscript𝑎𝑅1z_{0}\leftarrow\{h_{-1}\cup a^{R}_{-1}\}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← { italic_h start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ∪ italic_a start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT } // Initial human latent state a0HRealHuman(|x0,z0)a^{H}_{0}\leftarrow\textsc{RealHuman}(\cdot|x_{0},z_{0})italic_a start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← RealHuman ( ⋅ | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) // First human action h0[a0H]subscript0delimited-[]subscriptsuperscript𝑎𝐻0h_{0}\leftarrow[a^{H}_{0}]italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← [ italic_a start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] T(h0)ConstructNode(T,h0)𝑇subscript0ConstructNode𝑇subscript0T(h_{0})\leftarrow\textsc{ConstructNode}(T,h_{0})italic_T ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ← ConstructNode ( italic_T , italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) // Construct root node for t=0,1,2,max_steps𝑡012normal-…max_stepst=0,1,2,\dots\text{max\_steps}\>italic_t = 0 , 1 , 2 , … max_steps do
2       atRSearch(ht)subscriptsuperscript𝑎𝑅𝑡Searchsubscript𝑡\color[rgb]{0.01171875,0.5859375,0.87109375}a^{R}_{t}\leftarrow\textsc{Search}% (h_{t})italic_a start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← Search ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) // Root node contains-as-subgroup\rhd Search (Supp. Alg. 2) if (htatR)Tsubscript𝑡subscriptsuperscript𝑎𝑅𝑡𝑇(h_{t}a^{R}_{t})\notin T( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∉ italic_T then
3             ConstructNode(T,htatR𝑇subscript𝑡subscriptsuperscript𝑎𝑅𝑡T,h_{t}a^{R}_{t}italic_T , italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT)
4      xt+1p(|xt,atR,atH)x_{t+1}\leftarrow p(\cdot|x_{t},a^{R}_{t},a^{H}_{t})\>\;italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ← italic_p ( ⋅ | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) // Update World State zt+1{htatR}subscript𝑧𝑡1subscript𝑡subscriptsuperscript𝑎𝑅𝑡z_{t+1}\leftarrow\{h_{t}\cup a^{R}_{t}\}\>\;italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ← { italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∪ italic_a start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } // Update true latent state ⇏⇏\not{\Rightarrow}⇏ Robot at+1HRealHuman(|xt+1,zt+1)a^{H}_{t+1}\leftarrow\textsc{RealHuman}(\cdot|x_{t+1},z_{t+1})italic_a start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ← RealHuman ( ⋅ | italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) // Next user action ht+1ht{atR,at+1H}subscript𝑡1subscript𝑡subscriptsuperscript𝑎𝑅𝑡subscriptsuperscript𝑎𝐻𝑡1h_{t+1}\leftarrow h_{t}\cup\{a^{R}_{t},a^{H}_{t+1}\}italic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ← italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∪ { italic_a start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT } if ht+1Tsubscript𝑡1𝑇h_{t+1}\notin Titalic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∉ italic_T then
5             T(ht+1)𝑇subscript𝑡1absentT(h_{t+1})\leftarrowitalic_T ( italic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ← ConstructNode(T,ht+1𝑇subscript𝑡1T,h_{t+1}italic_T , italic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT)
6      // Belief update: next root node b(ht+1)Belief-Update(b(ht),atR,at+1H)𝑏subscript𝑡1Belief-Update𝑏subscript𝑡subscriptsuperscript𝑎𝑅𝑡subscriptsuperscript𝑎𝐻𝑡1\color[rgb]{0.01171875,0.5859375,0.87109375}b(h_{t+1})\leftarrow\textsc{Belief% -Update}(b(h_{t}),a^{R}_{t},a^{H}_{t+1})italic_b ( italic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ← Belief-Update ( italic_b ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_a start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) Prune-Tree(T,ht+1)Prune-Tree𝑇subscript𝑡1\textsc{Prune-Tree}(T,h_{t+1})Prune-Tree ( italic_T , italic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) // ht+1subscript𝑡1h_{t+1}italic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is the root node
Algorithm 1 Bayes-POMCP: Maximizing Performance in Mixed-Initiative Human-Robot Teams

4.2.1. Belief Approximation

Similar to POMCP, BA-POMCP is an online algorithm that constructs a lookahead search tree through environment simulations and maintains a belief over latent parameters using an unweighted particle filter to determine the best action at each time step. However, in BA-POMCP, we need to maintain a belief over both the latent states |S|𝑆|S|| italic_S | and the model parameters 𝒯,𝒯\mathcal{T},\mathcal{E}caligraphic_T , caligraphic_E (|S|2×|A|+|S|×|A|×|O|superscript𝑆2𝐴𝑆𝐴𝑂|S|^{2}\times|A|+|S|\times|A|\times|O|| italic_S | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × | italic_A | + | italic_S | × | italic_A | × | italic_O | parameters). Computing the posterior update over such a large space can be expensive. Further, it is difficult for the posterior distribution to converge to the true parameters, especially when we only have access to limited interactions.

Hence, we leverage the independence assumption between the world state and the latent state transition (Equation 2) to approximate the belief in each node in the search tree. Approximating the belief makes it feasible to compute the belief updates in real-time for fluent HRI. Since we only need the human action to determine the next world state, we choose to maintain the belief only over the user policy space instead of all latent states and model parameters. We compute the posterior update for the belief b(ht+1)𝑏subscript𝑡1b(h_{t+1})italic_b ( italic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) from the prior belief, b(ht)𝑏subscript𝑡b(h_{t})italic_b ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), based on the interaction history, htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, at each node.

4.2.2. Simulating Human Policy

In BA-POMCP, we need to simulate human actions during the rollout for constructing the search tree. As the robot lacks direct knowledge of the true human policy, we first estimate the human policy parameters and use the same for simulation. Given that the human actions can be categorized as being compliant/non-compliant with the robot, we model the true human policy as a Bernoulli distribution with an unknown parameter, μ𝜇\muitalic_μ, that signifies the likelihood of user compliance for a given interaction history, hhitalic_h. To estimate μ𝜇\muitalic_μ, we adopt a Bayesian approach. We assume a prior distribution or belief over the space of human policies b=p(μ)𝑏𝑝𝜇b=p(\mu)italic_b = italic_p ( italic_μ ). We approximate b𝑏bitalic_b using a set of particles, which is updated upon subsequent interactions with the user. In general, performing the belief update can be computationally expensive, but such updates can be computed efficiently for the conjugate family of distributions bishop2006pattern . Thus, we model each particle as a beta distribution – the conjugate prior for Bernoulli distributions.

To simulate the human action during rollout, we sample a particle from b𝑏bitalic_b at the current node. We use this sampled particle to anticipate the next human actions and update it based on the interaction outcomes during the simulation. Additionally, we assume that humans are rational and employ an ϵitalic-ϵ\epsilonitalic_ϵ-greedy heuristic to select the user’s actions in case of non-compliance.

Alternatively, we can use a random policy to mimic human behavior, but this would require more simulations to cover a range of possible human responses and determine the optimal robot action—resulting in increased computation time. Therefore, we opt for estimating user compliance and then simulating the human policy, which we find empirically more efficient.

To evaluate the contributions of our proposed modifications to the BA-POMCP algorithm katt2017learning , we perform an ablation analysis without modeling humans, i.e., we only use random rollout policies for anticipating human behavior and perform no belief updates. We refer to this approach as POMCP in our analysis (Section 6.2).

5. Evaluation

5.1. Domain

We modified the Frozen Lake environment from OpenAI Gym brockman2016openai for evaluating mixed-initiative human-robot teaming. In this domain, the users must collaborate with the robot to navigate an 8×8888\times 88 × 8 frozen lake grid from start to goal in the fewest steps possible while avoiding holes and slippery regions. We modified the original domain to only have certain grids as slippery instead of a constant slip probability throughout the map. Step** on a slippery region will cause the agent to fall into a hole. Both the human and the robot can only observe whether the adjacent four grids are slippery. Each time the agent falls into a hole, the team incurs a penalty α𝛼\alphaitalic_α and must begin again from the start location.

To enforce suboptimality, we introduce errors in the human and robot observations of slippery grids. These errors include – False Positives (observing a safe grid as slippery), and False Negatives (observing a slippery region as safe). Moreover, certain parts of the map are covered by fog which reduces human visibility. The human and robot accuracies for identifying slippery regions are shown in Figure 2. During the game, the human teleoperates the robot across the lake, but the robot may intervene or take control if it finds that the user chose a longer or unsafe path (e.g., slippery regions or holes) to the goal. Additionally, the user is equipped with a high-quality (100%percent100100\%100 % accurate) sensor for detecting slippery regions in adjacent grids, but each use of the sensor incurs a point cost ρ𝜌\rhoitalic_ρ. The overall team performance or game reward for each round is calculated as a combination of step penalty (shorter path \rightarrow higher reward), penalty for falling into holes α𝛼\alphaitalic_α, detection penalty ρ𝜌\rhoitalic_ρ, and a bonus κ𝜅\kappaitalic_κ for reaching the goal as shown in Equation 3.

Reward=Max steps# steps takenα×# falls into holeρ×# detections+κ×𝟙[goal reached==True]\text{Reward}=\textrm{Max steps}-\textrm{\# steps taken}-\alpha\times\textrm{% \# falls into hole}\\ -\rho\times\textrm{\# detections}+\kappa\times\mathbbm{1}[\textrm{goal reached% }==\textrm{True}]start_ROW start_CELL Reward = Max steps - # steps taken - italic_α × # falls into hole end_CELL end_ROW start_ROW start_CELL - italic_ρ × # detections + italic_κ × blackboard_1 [ goal reached = = True ] end_CELL end_ROW (3)
Refer to caption
Figure 2. Frozen Lake Domain used in this study. Figure 2(a) shows the overall game layout. Figure 2(b) depicts robot intervention styles: interrupt, take-control, and Figure 2(c) shows the human and robot accuracies in identifying slippery grids.

We empirically set max steps =80absent80=80= 80, α=10𝛼10\alpha=10italic_α = 10, ρ=2𝜌2\rho=2italic_ρ = 2 and κ=30𝜅30\kappa=30italic_κ = 30 for our human-subject experiments. Our environment is inspired by USAR missions, where humans teleoperate robots, but both humans and robots can have complementary skills and varying domain knowledge. Further details of the user study domain can be found in the Supplementary.

5.2. Human-Subjects Experiments

We conducted two user studies to 1) examine how users respond to different robot intervention styles with and without explanations but with a static policy (Data Collection Study) and 2) evaluate human-robot team performance with the proposed adaptive Bayes-POMCP approach (Evaluation Study).

5.2.1. Data Collection Study

We employ a 1×5151\times 51 × 5 within-subjects experiment design to examine user responses to various robot interventions in mixed-initiative teaming (Figure 2b). These interventions include – no assist: the robot does not intervene (baseline), interrupt: the robot stops the user from executing an action, take-control: the robot overrides the user’s action with its own action, interrupt+explain: the robot interrupts and explains, take-control+explain: the robot takes over control and explains. To ensure consistency across intervention strategies, the robot employs the same handcrafted heuristic that determines when to intervene. The heuristic intervention policy is a short-horizon planner that only intervenes if the user’s current action is anticipated to lead to a slippery region (based on the robot’s knowledge), a hole, or a longer path (\geqk steps) and will cede control to the user if the user persistently chooses the action the robot is intervening. The heuristic employs a static intervention style. The algorithm for the heuristic policy can be found in the Supplementary.

Refer to caption
(a) Team Performance vs. Robot Intervention Style
Refer to caption
(b) User preferences for working with different robots
Figure 3. Results from Data Collection Study with Heuristic Mixed-Initiative Policies. Figure  2(a) shows that the team performance is the highest for the take-control agents and the lowest with no-assist (baseline). Figure 2(b) shows the users preference ranking across intervention styles. The majority of the users prefer to work with the interrupt+explain agent the most (rank =5absent5=5= 5).

5.2.2. Evaluation Study

We employ a 1×3131\times 31 × 3 within-subjects experiment to compare human-robot team performance under different robot policies. The examined policies are our proposed approach – Bayes-POMCP, the same heuristic policy as was used in the data collection study, and an adversarial policy (Adv-Bayes-POMCP) optimized for negative game reward (Equation 3). We include the adversarial policy as an adaptive baseline to show that (1) our proposed approach can successfully aid or inhibit the user from reaching the goal, and (2) it is essential for the adaptive policy to reason when to intervene effectively in addition to switching the intervention styles. To perform a balanced comparison, we ensure that the run times of all robot policies are identical. Further, we limit the use of the detection sensor (5absent5\leq 5≤ 5) in the evaluation study to force participants to rely on the robot’s assistance.

5.2.3. Metrics

For both studies, we assess user preferences and performance using subjective and objective measures, respectively. Our subjective measures include trust muir1989operators , likeability bartneck2009measurement , and willingness to comply raemdonck2013feedback (adapted from human-human interactions for HRI) measured via 5-point Likert scales. All questionnaires were administered to the users after each round in both studies. Further, participants reported their demographics, highest completed education, prior experience with robots, and completed a 50-item personality scale Goldberg1992TheStructure as part of the pre-study questionnaire. At the end of the study, users ranked their preferences for the different robot agents. All questionnaires used for the study can be found in the Supplementary. Objective performance was assessed based on the total game reward (Equation 3) in each round.

5.2.4. Participants and Procedure

We recruited 30303030 participants (Age: 25.56±3.38plus-or-minus25.563.3825.56\pm 3.3825.56 ± 3.38, Female: 33%percent3333\%33 %) for the data collection study and 28282828 new participants (Age: 25.27±3.28plus-or-minus25.273.2825.27\pm 3.2825.27 ± 3.28, Female: 50%percent5050\%50 %) for the evaluation study, all from a local university campus after IRB approval. The procedure was the same for both studies. Written consent from the participants was obtained before the experiment. At the start of the study, participants received written game instructions along with a demonstration from the experimenter. Participants first completed three practice rounds to familiarize themselves with the game and then engaged in ten and six rounds (two rounds per condition) for the data collection study and evaluation study, respectively. The subjects were instructed to complete each round by taking the shortest path to the goal. The experiment order was randomized, and participants completed pre- and post-study questionnaires.

5.3. Hypotheses

To investigate how different users interact with various robot intervention styles, we first conducted a data collection study. We hypothesize the following:

H1A: The human-robot team performance can vary with different robot intervention styles. Although the robot follows the same heuristic across different conditions in Study 1, we hypothesize that the team performance will vary across intervention styles as users may respond differently. For instance, users may be better informed to choose the next action appropriately when the robot intervenes and provides an explanation.

H1B: Users will have different preferences for working with various robot intervention styles. Humans have varying personality traits and task preferences, which may impact how they perceive and collaborate with teammates. For instance, extroverted individuals are more likely to assume leadership and less likely to renounce control in human-human teams kickul2000emergent . Likewise, we hypothesize that users will have different preferences when working with robots that interrupt or take control with or without offering explanations.

For the evaluation study, we compare the human-robot team performance with the adaptive Bayes-POMCP policy against heuristics used in the first study and an adversarial baseline – Adv-Bayes-POMCP. We hypothesize:

H2A: The human-robot team performance will be the highest when the robot employs the adaptive Bayes-POMCP policy. We hypothesize that the Bayes-POMCP policy which actively anticipates human actions by considering their latent states, is better suited for determining when and how to intervene various users and will thereby maximize team performance. In contrast, the baselines that do not model the human latent states (the heuristic policy) or optimize for negative reward (the adversarial Bayes-POMCP), will not be able to assist the users appropriately.

H2B: Users will most prefer to work with our proposed approach, the adaptive Bayes-POMCP policy. We hypothesize that the Bayes-POMCP policy can effectively intervene users by modeling their latent states and will, therefore, not only improve team performance but also have a positive impact on the users’ subjective preference for collaborating with the robot.

6. Results and Discussion

In this section, we first discuss the results of the data collection study. Next, we show results from our simulation experiments used to validate Bayes-POMCP before testing on human participants. We then discuss the results from the evaluation study, comparing our Bayes-POMCP approach and two baselines.

All our statistical analyses were performed using libraries in R, and the significance level α𝛼\alphaitalic_α was set at 0.050.050.050.05. For our analysis, we use parametric tests unless the model fails to meet the required assumptions (normality, homoscedasticity, et cetera). Details of all models and tests used for each hypothesis, along with the effect sizes and statistical power, are listed in the Supplementary.

6.1. Data Collection Study

For the data collection study, we recruited 30303030 participants and excluded one participant as an outlier since they failed to complete all ten rounds in the study (failure rate across all subjects: 1.733±2.365plus-or-minus1.7332.3651.733\pm 2.3651.733 ± 2.365). Thus we have data from 29292929 subjects for our analysis.

H1A: Team Performance and Robot Intervention Styles. We compare the team performance using the game reward (Equation 3) across the five robot intervention styles employed in the first study. The robot either used the same heuristic policy to determine when to intervene or did not intervene at all (no-assist: baseline condition). Each user participated in two rounds for each intervention style, totaling ten rounds, all played on different maps with varying levels of difficulty. To mitigate ordering effects and map-related biases, the experiment conditions and map assignments were randomized.

Refer to caption
(a) Static Users - vary expertise (ψ𝜓\psiitalic_ψ).

Refer to caption
(b) Static Users - vary compliance (θ𝜃\thetaitalic_θ).
Refer to caption
(c) Dynamic Users
Figure 4. Team performance in simulation experiments with static and dynamic latent user models. Figures 3(a) and  3(b) show that Bayes-POMCP can enhance team performance across users of varied expertise and compliance tendencies, respectively. Bayes-POMCP outperforms heuristics and the ablation POMCP model, especially for users with low expertise.

We use Kruskal-Wallis (a non-parametric test), with the dependent variable as the reward and the independent variable as the robot intervention style. We obtain statistical significance for the intervention style (H(4)=58.16,p<.001formulae-sequence𝐻458.16𝑝.001H(4)=58.16,p<.001italic_H ( 4 ) = 58.16 , italic_p < .001). Subsequently, we use Dunn’s test for performing post-hoc pairwise comparisons, and the significance values are shown in Figure 2(a).

Takeaway: We find that the human-robot team performance is impacted by the intervention styles used by the robot, rejecting the null hypothesis (Figure 2(a)). Firstly, it is worth noting that the team performance significantly improves when the robot intervenes compared to the baseline (no assistance), validating the need for robot interventions in our study domain. Secondly, the team performance is the highest when the robot takes over control. Lastly, adding explanations did not significantly improve performance for the same intervention style (e.g., between interrupt and interrupt+explain).

H1B: Users’ Working Preference and Robot Intervention Styles. At the end of the first user study, participants were asked to rank their preferences for working with various robot intervention styles on a scale from 1 (lowest) to 5 (highest). As user rankings are considered as ordinal data, we use Kruskal-Wallis, a non-parametric test to analyze H1B. We find that robot intervention style indeed influences user preferences (H(4)=61.67,p<.001)formulae-sequence𝐻461.67𝑝.001(H(4)=61.67,p<.001)( italic_H ( 4 ) = 61.67 , italic_p < .001 ). The majority of the users preferred the interrupt+explain agent the most and the take-control agent the least, as shown in Figure 2(b).

Takeaway: Our results suggest that, despite explanations not improving performance, most users favor working with robots that offer explanations for their interventions. Interestingly, even though the take-control agent achieved the highest team performance, it was the least preferred choice for the majority of users. These findings highlight the need for an adaptive robot policy that adjusts the intervention style to maximize performance and user satisfaction. If the robot only takes over control, it can improve team performance in the short term but can cause user dissatisfaction and can potentially lead to users abandoning the system in the long run.

Refer to caption
(a) Team Performance vs. Robot Policies
Refer to caption
(b) User Working Preferences for Robot Policies
Figure 5. Results from the Evaluation Study. Figure  4(a) shows that the team performance is the highest for the Bayes-POMCP agent and the lowest for the Adv-Bayes-POMCP (the adversarial baseline). Figure 4(b) shows that the majority of the users prefer our approach compared to the baselines.

6.2. Simulation Experiments

We first validate whether our proposed method can adapt to diverse users by testing with various simulated human models before testing the Bayes-POMCP policy on users. For the simulation experiments, we compare Bayes-POMCP against two baselines – (1) the standard POMCP algorithm silver2010monte with no human model (POMCP) and (2) the heuristic agents (both take-control and interrupt) on five of the 8×8888\times 88 × 8 maps used in the data collection study. To simulate a diverse set of users, we modulate two latent parameters that determine their behavior – the users’ capability or expertise (ψ𝜓\psiitalic_ψ) to solve the task and the users’ tendency to comply with the agent (θ𝜃\thetaitalic_θ). We test with both static users (whose latent parameters – ψ,θ𝜓𝜃\psi,\thetaitalic_ψ , italic_θ are fixed) and dynamic users, whose θ𝜃\thetaitalic_θ varies continuously based on the interaction history, but ψ𝜓\psiitalic_ψ remains fixed (i.e., we assume no learning effect as the domain is simple). We provide further details of the simulated human population in the Supplementary. Our results (Figure 4) indicate that Bayes-POMCP outperforms both the heuristics employed in the first study and the ablation baseline without human modeling (POMCP) for static and dynamic user models.

6.3. Evaluation Study

Upon verifying our policy with different simulated users, we collected data from 28282828 new participants (who did not take part in the first user study) for the evaluation study. We excluded data from three subjects. Two of the three subjects encountered graphic rendering issues in the study interface. The other subject was excluded as an outlier as they failed to complete all six rounds (failure rate across all subjects: 3.48±0.77plus-or-minus3.480.773.48\pm 0.773.48 ± 0.77). Hence we only include data from the remaining 25252525 subjects for our analysis.

H2A: Team Performance and Robot Policy. In this user study, we evaluated the team performance for different robot policies – the heuristic agents (interrupt+explain and take-control+explain) from the first study, our proposed approach Bayes-POMCP optimized for the true reward and the negative reward. Each user participated in two rounds per policy, totaling six rounds, all played on different maps (a subset from the first study). We used the Kruskal-Wallis test with the reward as the dependent variable and the robot policy as the independent variable. We obtained statistical significance for the robot policy (H(2)=109.89,p<.001)formulae-sequence𝐻2109.89𝑝.001(H(2)=109.89,p<.001)( italic_H ( 2 ) = 109.89 , italic_p < .001 ) and performed post-hoc analysis with Dunn’s test, whose results are shown in Figure 4(a).

Takeaway: We find that Bayes-POMCP policy significantly outperforms our baselines for team performance, rejecting the null hypothesis (Figure 4(a)). We also find that the adversarial Bayes-POMCP is effective in preventing the user from reaching the goal, as reflected by the negative reward.

H2B: Users’ Working Preference and Robot Policy. Users ranked their preferences for working with the different robot agents at the end of the second study. We perform the Kruskal-Wallis test, which shows that the robot policy significantly influences user preferences (H(2)=45.41,p<.001)formulae-sequence𝐻245.41𝑝.001(H(2)=45.41,p<.001)( italic_H ( 2 ) = 45.41 , italic_p < .001 ). We find that 68%percent6868\%68 % of the users preferred the Bayes-POMCP agent the most, 20%percent2020\%20 % preferred heuristic agents the most, and 88%percent8888\%88 % preferred the Adv-Bayes-POMCP agent the least. 12%(=3/25)annotatedpercent12absent32512\%\>(=3/25)12 % ( = 3 / 25 ) did not answer the preference survey.

We also analyzed subjective metrics with Likert scales for trust, willingness to comply, and robot likeability. We conducted three rANOVA with the subjective metrics as the dependent variables and independent variables as robot policy, number of rounds completed, demographics (age, gender, prior robotics experience), and pre-study questionnaire responses of the user. We find that robot policy was statistically significant across all subjective metrics from the three ANOVAs, with our proposed approach having the highest mean values. We then performed post-hoc analysis using Tukey HSD. For further details of the analysis, see Supplementary.

Takeaway: We find that Bayes-POMCP policy significantly outperforms our baselines across all subjective metrics, and the majority (68%percent6868\%68 %) of the users rated that they would most prefer to work with the Bayes-POMCP agent in the evaluation study.

6.4. Summary of Results

We summarize our key findings from two human-subject experiments and analysis with simulated human models:

  1. (1)

    Robot interventions are necessary for improving team performance when both humans and robots are suboptimal due to having non-identical, partial domain knowledge.

  2. (2)

    The robot intervention style (interrupt or take-control) can impact both team performance (p<0.001)𝑝0.001(p<0.001)( italic_p < 0.001 ) and user preferences (p<0.001)𝑝0.001(p<0.001)( italic_p < 0.001 ). Users prefer robots that offer explanations for interventions, albeit without performance improvement.

  3. (3)

    Our proposed approach, Bayes-POMCP, can effectively intervene users (both simulated and real human subjects) to maximize human-robot team performance.

  4. (4)

    Bayes-POMCP not only enhances team performance but also positively influences users’ preference to collaborate with the robot and their self-reported measures, such as trust and likeability towards the robot.

7. Limitations and Future Work

While our approach successfully improves human-robot team performance in a computationally efficient manner, Bayes-POMCP relies on an environment simulator to estimate the value of human-robot actions in the Monte Carlo search tree, which may not be available for real-world human-robot collaboration tasks. Therefore, in future work, we aim to explore alternative methods, such as deep learning silver2018general for value estimation. Moreover, our findings indicate that while robot explanations positively influenced users’ subjective perceptions, they did not improve team performance. We hypothesize that this may be because the task was relatively simple, and users did not need explanations from the robot to enhance their decision-making. In future work, we seek to assess the utility of explanations in improving team performance for more complex teaming tasks. Finally, our findings are limited to short-horizon interactions, as users only played two rounds of the game with each agent. To address this limitation, our proposed approach can be extended to longitudinal HRI tasks, where robots must anticipate and adapt to changes in user behavior or preferences over time.

8. Conclusion

In this work, we propose an online Bayesian approach, Bayes-POMCP, to optimize performance in mixed-initiative human-robot teams when both agents are suboptimal. Our focus is on learning a robot policy for effective user intervention. We find that robot interventions can improve performance while recognizing diverse user preferences. Next, we evaluate Bayes-POMCP, and show its effectiveness in improving team performance across different simulated human models and real users. We address the computational challenges in solving POMDPs by using a Monte-Carlo search with belief approximation and using conjugate priors to perform belief updates efficiently. In future work, we plan to continue evaluating our algorithm for long-horizon interactions and extend it beyond grid-world domains to real-world human-robot collaboration tasks.

{acks}

This work was sponsored by a gift from Konica Minolta and the National Institutes of Health (NIH) under Grant 1RO1HL157457.

References

  • [1] Samuel Barrett, Avi Rosenfeld, Sarit Kraus, and Peter Stone. Making friends on the fly: Cooperating with new teammates. Artificial Intelligence, 242:132–171, 2017.
  • [2] Christoph Bartneck, Dana Kulić, Elizabeth Croft, and Susana Zoghbi. Measurement instruments for the anthropomorphism, animacy, likeability, perceived intelligence, and perceived safety of robots. International journal of social robotics, 1:71–81, 2009.
  • [3] Christopher M Bishop and Nasser M Nasrabadi. Pattern recognition and machine learning, volume 4. Springer, 2006.
  • [4] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  • [5] Micah Carroll, Rohin Shah, Mark K Ho, Tom Griffiths, Sanjit Seshia, Pieter Abbeel, and Anca Dragan. On the utility of learning about humans for human-AI coordination. Advances in neural information processing systems, 32, 2019.
  • [6] Min Chen, Stefanos Nikolaidis, Harold Soh, David Hsu, and Siddhartha Srinivasa. Planning with trust for human-robot collaboration. In Proceedings of the 2018 ACM/IEEE international conference on human-robot interaction, pages 307–315, 2018.
  • [7] Manolis Chiou, Nick Hawes, and Rustam Stolkin. Mixed-initiative variable autonomy for remotely operated mobile robots. ACM Transactions on Human-Robot Interaction (THRI), 10(4):1–34, 2021.
  • [8] Douglas A Few, David J Bruemmer, and Miles C Walton. Improved human-robot teaming through facilitated initiative. In ROMAN 2006-The 15th IEEE International Symposium on Robot and Human Interactive Communication, pages 171–176. IEEE, 2006.
  • [9] Lewis R. Goldberg. The Development of Markers for the Big-Five Factor Structure. Psychological Assessment, 4(1):26–42, 1992.
  • [10] Joey Hong, Anca Dragan, and Sergey Levine. Learning to influence human behavior with offline reinforcement learning. arXiv preprint arXiv:2303.02265, 2023.
  • [11] Hengyuan Hu, Adam Lerer, Alex Peysakhovich, and Jakob Foerster. “other-play” for zero-shot coordination. In International Conference on Machine Learning, pages 4399–4410. PMLR, 2020.
  • [12] J Isaacs, Kevin Knoedler, Andrew Herdering, Mishell Beylik, and Hugo Quintero. Teleoperation for urban search and rescue applications. Field Robotics, 2:1177–1190, 2022.
  • [13] Hong Jun Jeon, Dylan P Losey, and Dorsa Sadigh. Shared autonomy with learned latent actions. arXiv preprint arXiv:2005.03210, 2020.
  • [14] Shu Jiang and Ronald C Arkin. Mixed-initiative human-robot interaction: definition, taxonomy, and survey. In 2015 IEEE International conference on systems, man, and cybernetics, pages 954–961. IEEE, 2015.
  • [15] Gregory Kahn, Adam Villaflor, Vitchyr Pong, Pieter Abbeel, and Sergey Levine. Uncertainty-aware reinforcement learning for collision avoidance. arXiv preprint arXiv:1702.01182, 2017.
  • [16] Sammie Katt, Frans A Oliehoek, and Christopher Amato. Learning in pomdps with monte carlo tree search. In International Conference on Machine Learning, pages 1819–1827. PMLR, 2017.
  • [17] Jill Kickul and George Neuman. Emergent leadership behaviors: The function of personality and cognitive ability in determining teamwork performance and ksas. Journal of Business and Psychology, 15:27–51, 2000.
  • [18] Glen Klien, David D Woods, Jeffrey M Bradshaw, Robert R Hoffman, and Paul J Feltovich. Ten challenges for making automation a” team player” in joint human-agent activity. IEEE Intelligent Systems, 19(6):91–95, 2004.
  • [19] Minae Kwon, Erdem Biyik, Aditi Talati, Karan Bhasin, Dylan P. Losey, and Dorsa Sadigh. When humans aren’t optimal: Robots that collaborate with risk-aware humans. In Proceedings of the 2020 ACM/IEEE International Conference on Human-Robot Interaction, HRI ’20, page 43–52, New York, NY, USA, 2020. Association for Computing Machinery.
  • [20] Mikko Lauri, David Hsu, and Joni Pajarinen. Partially observable markov decision processes in robotics: A survey. IEEE Transactions on Robotics, 39(1):21–40, 2022.
  • [21] ** Joo Lee, Fei Sha, and Cynthia Breazeal. A bayesian theory of mind approach to nonverbal communication. In 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 487–496. IEEE, 2019.
  • [22] Joshua Lee, Jeffrey Fong, Bing Cai Kok, and Harold Soh. Getting to know one another: Calibrating intent, capabilities and trust for human-robot collaboration. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6296–6303. IEEE, 2020.
  • [23] Xingzhou Lou, Jiaxian Guo, Junge Zhang, Jun Wang, Kaiqi Huang, and Yali Du. Pecan: Leveraging policy ensemble for context-aware zero-shot human-ai coordination. arXiv preprint arXiv:2301.06387, 2023.
  • [24] B.M. Muir and B.M. Muir. Operators’ Trust in and Use of Automatic Controllers in a Supervisory Process Control Task. Canadian theses on microfiche. University of Toronto, 1989.
  • [25] Amal Nanavati, Christoforos I Mavrogiannis, Kevin Weatherwax, Leila Takayama, Maya Cakmak, and Siddhartha S Srinivasa. Modeling human helpfulness with individual and contextual factors for robot planning. In Robotics: Science and Systems, 2021.
  • [26] Manisha Natarajan and Matthew Gombolay. Effects of anthropomorphism and accountability on trust in human robot interaction. In Proceedings of the 2020 ACM/IEEE International Conference on Human-Robot Interaction, pages 33–42, 2020.
  • [27] Manisha Natarajan, Esmaeil Seraj, Batuhan Altundas, Rohan Paleja, Sean Ye, Letian Chen, Reed Jensen, Kimberlee Chestnut Chang, and Matthew Gombolay. Human-robot teaming: Grand challenges. Current Robotics Reports, pages 1–20, 2023.
  • [28] Brenda Ng, Kofi Boakye, Carol Meyers, and Andrew Wang. Bayes-adaptive interactive pomdps. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 26, pages 1408–1414, 2012.
  • [29] Rohan Paleja, Muyleng Ghuy, Nadun Ranawaka Arachchige, Reed Jensen, and Matthew Gombolay. The utility of explainable ai in ad hoc human-machine teaming. Advances in neural information processing systems, 34:610–623, 2021.
  • [30] Stefania Pellegrinelli, Henny Admoni, Shervin Javdani, and Siddhartha Srinivasa. Human-robot shared workspace collaboration via hindsight optimization. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 831–838. IEEE, 2016.
  • [31] Neil Rabinowitz, Frank Perbet, Francis Song, Chiyuan Zhang, SM Ali Eslami, and Matthew Botvinick. Machine theory of mind. In International conference on machine learning, pages 4218–4227. PMLR, 2018.
  • [32] Isabel Raemdonck and Jan-Willem Strijbos. Feedback perceptions and attribution by secretarial employees: Effects of feedback-content and sender characteristics. European Journal of Training and Development, 37(1):24–48, 2013.
  • [33] Aditi Ramachandran, Sarah Strohkorb Sebo, and Brian Scassellati. Personalized robot tutoring using the assistive tutor pomdp (at-pomdp). In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8050–8057, 2019.
  • [34] Stephane Ross, Brahim Chaib-draa, and Joelle Pineau. Bayes-adaptive pomdps. Advances in neural information processing systems, 20, 2007.
  • [35] Dorsa Sadigh, Shankar Sastry, Sanjit A Seshia, and Anca D Dragan. Planning for autonomous cars that leverage effects on human actions. In Robotics: Science and systems, volume 2, pages 1–9. Ann Arbor, MI, USA, 2016.
  • [36] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
  • [37] David Silver and Joel Veness. Monte-carlo planning in large pomdps. Advances in neural information processing systems, 23, 2010.
  • [38] Herbert A Simon. Theories of bounded rationality, decision and organization. CBR a. R. Radner. Amsterdam, NorthHolland, 1972.
  • [39] DJ Strouse, Kevin McKee, Matt Botvinick, Edward Hughes, and Richard Everett. Collaborating with humans without human data. Advances in Neural Information Processing Systems, 34:14502–14515, 2021.
  • [40] Ning Wang, David V Pynadath, and Susan G Hill. The impact of pomdp-generated explanations on trust and performance in human-robot teams. In Proceedings of the 2016 international conference on autonomous agents & multiagent systems, pages 997–1005, 2016.
  • [41] Rui Zhao, **ming Song, Yufeng Yuan, Haifeng Hu, Yang Gao, Yi Wu, Zhongqian Sun, and Wei Yang. Maximum entropy population-based training for zero-shot human-ai coordination. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 6145–6153, 2023.