Online Context Learning for Socially-compliant Navigation

Iaroslav Okunevich1, Alexandre Lombard1, Tomas Krajnik2, Yassine Ruichek1, Zhi Yan1∗ This work was supported by the Bourgogne-Franche-Comté regional research project LOST-CoRoNa, the CZ MSMT (No. 8J23FR023) and PHC Barrande (No. 49275QM) project 3L4AR.1UTBM, CIAD UMR 7533, F-90010 Belfort, France. [email protected]2Faculty of Electrical Engineering, Czech Technical University in Prague, Prague, Czechia. [email protected]Corresponding Author.
Abstract

Robot social navigation needs to adapt to different human factors and environmental contexts. However, since these factors and contexts are difficult to predict and cannot be exhaustively enumerated, traditional learning-based methods have difficulty in ensuring the social attributes of robots in long-term and cross-environment deployments. This letter introduces an online context learning method that aims to empower robots to adapt to new social environments online. The proposed method adopts a two-layer structure. The bottom layer is built using a deep reinforcement learning-based method to ensure the output of basic robot navigation commands. The upper layer is implemented using an online robot learning-based method to socialize the control commands suggested by the bottom layer. Experiments using a community-wide simulator show that our method outperforms the state-of-the-art ones. Experimental results in the most challenging scenarios show that our method improves the performance of the state-of-the-art by 8%. The source code of the proposed method, the data used, and the tools for the pre-training step will be publicly available at https://github.com/Nedzhaken/SOCSARL-OL.

I Introduction

Mobile robots are becoming more and more popular nowadays in various applications. These applications can be classified into two categories: human-free and human-presence in the robot’s work area. For the former, robots do not need to take additional account of human participation, and hence their social nature. Typical examples include robots in fully automated factories [1, 2], or robots performing exploration tasks in hazardous environments [3, 4]. For applications in which humans are present, such as robots used for supermarket cleaning [5], hospital disinfection [6], or elderly care [7], they need to take into account human behavior and their social attributes in their task and motion planning [8]. In these applications, robots must achieve high-quality human-robot interaction (HRI) to ensure that human activity performance or comfort is not compromised [9, 10].

Effective HRI between mobile robots and humans can be achieved at both the hardware and software levels. Socially-compliant robot navigation is a typical example of HRI software solutions. The core idea is to apply human social rules to HRI while avoiding collisions with people. This kind of HRI is usually indirect, or implicit. The rules applied can be based on environmental [11] or social [12] context. The social compliance that a robot should demonstrate when navigating can be achieved through special reward functions in deep reinforcement learning (DRL) algorithms [13], attention scores of people around the robot [14], spatiotemporal representations of people around [15, 16], and so forth [9]. However, the social rules used in these methods are predefined and constant because they are considered to be general and universal. This is distinct from real-life human behavior, which changes and adapts to new social conditions.

Key challenges in robot social navigation include planning the robot’s task and motion, enabling the robot to behave in a socially-compliant manner, and effectively evaluating different approaches. In our previous work, we focused on standardizable evaluation procedures for social navigation [9]. In this letter, we focus on the behavior of robots. Our research goal is to ensure the social efficiency and navigation robustness of mobile robots when deployed in new social scenarios. To this end, an online robot learning (ORL) approach is proposed to achieve the goal, as it is able to train knowledge (discriminative) models spontaneously and autonomously in new contexts without manual intervention, thereby providing meaningful social cues for downstream navigation tasks.

Refer to caption
Figure 1: Conceptual diagram of our proposed approach. The navigation module selects the robot’s next action. The social module adds social value to the proposed action. The online social context learning method updates the social module to adapt it to the new social environment, which is represented by human trajectories.

A conceptual diagram of our method is shown in Fig. 1. Specifically, we first utilize DRL as a training algorithm for our foundation model to control the robot motion, as it is proven effective for robot navigation in crowds [14]. We then build a “social module” to evaluate the social efficiency of the robot trajectory, including traveled and will follow, and adjust the robot motion to achieve socially-compliant behavior by intervening in the foundation model. The module aims to extract the hidden social rules of human movement and is initially trained with human social trajectory data. However, it is likely that the social context of these training data does not correspond to the new one in which the robot is deployed. Therefore, the module is adapted to new contexts by being updated on-the-fly with on-site data while the robot is operating.

The contribution of this letter is threefold.

  • We propose to combine DRL with ORL to enable mobile robots to adapt to new social contexts efficiently and autonomously. Its specific implementation includes building a social module that can be updated online and promptly intervening in the robot’s DRL-based navigation system to ensure the robot’s social efficiency after switching contexts, without human intervention.

  • The social module is initially trained on the state-of-the-art (SOTA) contextually-rich THOR-Magni dataset [17], which contains social trajectories of humans in shared spaces with a robot. We supplement this dataset with non-social trajectory data.

  • We evaluate various social navigation methods, both ours and others, in new social contexts that are different from the one in which these methods were originally trained. We also complicate the experimental conditions to understand the robustness of the methods.

II Related Work

II-A Socially-compliant Navigation

A mainstream approach to enabling robots to exhibit social behaviors during navigation is based on learning, represented by imitation learning and DRL. For the former, the simulated data can be the robot behavior controlled by humans in a real environment [18] or generated in a simulated environment [19]. It is worth pointing out that the human agents in [19] are controlled by the classical social force (SF) model. The limitations of existing imitation learning methods still lie in their reliance on expert demonstrations and their difficulty in adapting to new social environments.

DRL learns robot control policies through reward functions and the agent’s state information. The research axis most relevant to our work starts with the Collision Avoidance with DRL (CADRL) method proposed by Chen et al. [20] for multi-agent communication-free decentralized navigation. This method uses a neural network to parameterize the value function. Subsequently, a socially-aware CADRL, named SA-CADRL [13], was proposed with a modified reward function that enables the robot to obey basic social rules, such as the right- or left-hand rule, when navigating. One limitation of CADRL-based approaches is that the input dimension of the neural network used is fixed, which prevents the robot from taking advantage of information about more people around it.

Everett  et al. [21] broke this limitation and utilized the data of all people around the robot by sending their states sequentially to a long short-term memory (LSTM) network. One limitation of this approach is that the priority of the LSTM input is only inversely proportional to the distance between the human and the robot, i.e., the input with a shorter distance is prioritized, while their relative speed is ignored. As a result, the impact of a human approaching the robot at high speed from a distance may be considered less important than a stationary person near the robot.

Chen et al. [14] proposed a solution to this problem. The idea comprises several steps. Since each person influences other people, the human-human pairwise interaction is modeled by a cost map. After that, an embedding vector is calculated for each human, based on their state and the cost map. The next step is calculating the attention score for each individual based on the individual embedding vector and the mean embedding vector of all detected persons. The final representation of the people around the robot is a linear combination of the individual attention score and the pairwise interaction vector of each person. This crowd representation facilitates highly efficient crowd navigation, an essential element of social navigation. Nevertheless, this method does not incorporate explicit or implicit social rules, except for the requirement of social distance in the reward function.

The preceding methods, except the LSTM method [21], are predicated solely on the spatial dimension of HRI. Liu et al. [22] proposed a Decentralized Structural Recurrent Neural Network (DSRNN) to leverage the spatial and temporal relationships for crowd navigation. Yang et al. [15] suggested applying a spatial-temporal state transformer for more effective crowd navigation. While the spatial-temporal transformer structure appears to be a more effective baseline for the social navigation method, the evaluation below will demonstrate that this method is less effective in crowd navigation compared with the attention mechanism [14].

II-B Online Context Learning

The aforementioned methods are all based on offline trained models and are therefore challenged by new social contexts. These contexts are diverse and even humans understand them differently [23]. There are only a few works related to ours. Shahrezaie et al. [11] proposed an online context robot navigation. Contexts and corresponding navigation rules are predefined based on the interviews. The content detection system allows the robot to change navigation online, depending on the environment. Lui et al. [24] proposed Lifelong Learning for Navigation (LLfN), which includes an additional module that corrects the robot’s behavior in complex contexts and updates online to learn new contexts. Although this work focuses on the problem of forgetting navigation experience and does not include social contexts, the structure of a main navigation method and an extra correction module is similar to our proposal.

II-C Discussion

According to our investigation, existing methods still struggle to meet the performance requirements of robots for social navigation in changing environments or across environments. It is an intuitive idea to directly perform online context updates on the social navigation model. However, taking the SOTA method SARL as an example, there are two obvious challenges. First, an update is difficult to complete in a short time, which goes against the spirit of “fast learning and immediate deployment” of online robot learning [25]. Second, the update of the model also has a clear demand for computing resources, which poses a challenge to deploying the model to the edge [26].

Therefore, considering the current algorithm and hardware development, it is a good choice to modularize the social context and only update the social module online. Additional benefits of doing so include, first, the developed social module can also be easily integrated into other navigation systems. Second, this solution is more in line with the requirements of the robustness of the robot system, that is, the social module will not harm the output of the robot navigation module, but only serves as an auxiliary. Finally, it has better interpretability compared to the end-to-end methods [13, 21, 14, 15].

III Method

We propose a novel architecture for robot social navigation that consists of two layers. The underlying layer is a DRL-based robot navigation method due to its ability to handle complex navigation problems. In addition, at this layer we extend considerations of safe distance to social distance. SARL is therefore used in our concrete implementation due to its demonstrated SOTA performance. The upper layer is a novel ORL-based social module that learns social context from human trajectory data and socializes the robot navigation control commands output by the bottom layer. In order to adapt to different social environments, the module is updated online based on new social contexts. In the remainder of this section, we first formulate the target problems, then propose an improvement to the original SARL, and finally detail our online learnable social module.

III-A Problem Formulation

Corresponding to reinforcement learning (RL), robot navigation can be formalized as a Markov decision process (MDP). The objective is to determine the optimal strategy for robot control through interaction with the environment. In social navigation, in addition to the environment, the robot should also consider the social attributes of humans:

J(π,θ)=s𝒮ρ(s)a𝒜π(a|s)s𝒮P(s|s,a)[R(s,a,s)+γH(s,θ)]𝐽𝜋𝜃subscript𝑠𝒮𝜌𝑠subscript𝑎𝒜𝜋|𝑎𝑠subscriptsuperscript𝑠𝒮𝑃conditionalsuperscript𝑠𝑠𝑎delimited-[]𝑅𝑠𝑎superscript𝑠𝛾𝐻superscript𝑠𝜃\begin{split}J(\pi,\theta)=&\sum_{s\in\mathcal{S}}\rho(s)\sum_{a\in\mathcal{A}% }\pi(a|s)\cdot\\ &\sum_{s^{\prime}\in\mathcal{S}}P(s^{\prime}|s,a)[R(s,a,s^{\prime})+\gamma H(s% ^{\prime},\theta)]\end{split}start_ROW start_CELL italic_J ( italic_π , italic_θ ) = end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_ρ ( italic_s ) ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_π ( italic_a | italic_s ) ⋅ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT italic_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) [ italic_R ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_γ italic_H ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_θ ) ] end_CELL end_ROW (1)

where s𝒮ρ(s)subscript𝑠𝒮𝜌𝑠\sum_{s\in\mathcal{S}}\rho(s)∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_ρ ( italic_s ) represents all possible states s𝑠sitalic_s in the environment, weighted by the initial state distribution ρ(s)𝜌𝑠\rho(s)italic_ρ ( italic_s ); a𝒜π(a|s)subscript𝑎𝒜𝜋conditional𝑎𝑠\sum_{a\in\mathcal{A}}\pi(a|s)∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_π ( italic_a | italic_s ) indicates all possible actions a𝑎aitalic_a available in a state s𝑠sitalic_s, weighted by the policy’s probability of taking that action π(a|s)𝜋conditional𝑎𝑠\pi(a|s)italic_π ( italic_a | italic_s ); s𝒮P(s|s,a)subscriptsuperscript𝑠𝒮𝑃conditionalsuperscript𝑠𝑠𝑎\sum_{s^{\prime}\in\mathcal{S}}P(s^{\prime}|s,a)∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT italic_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) represents all possible next states ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that the robot could transition to after taking action a𝑎aitalic_a in state s𝑠sitalic_s, weighted by the transition probability P(s|s,a)𝑃conditionalsuperscript𝑠𝑠𝑎P(s^{\prime}|s,a)italic_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ); R(s,a,s)𝑅𝑠𝑎superscript𝑠R(s,a,s^{\prime})italic_R ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is the immediate reward received for taking action a𝑎aitalic_a in state s𝑠sitalic_s and transitioning to state ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT; γ𝛾\gammaitalic_γ is the discount factor; and H(s,θ)𝐻superscript𝑠𝜃H(s^{\prime},\theta)italic_H ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_θ ) is the human-related cost function evaluated at the next state ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with parameter vector θ𝜃\thetaitalic_θ. H(s,θ)𝐻superscript𝑠𝜃H(s^{\prime},\theta)italic_H ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_θ ) captures the impact of the environment’s state on the robot’s objective, considering human presence or behavior, which can be implemented based on hand-crafted features or learned models. A specific implementation of this function is detailed in Section III-C.

Then the reward function taking into account the robot’s sociality can be composed of three parts:

R(s,a)=Rn(s)+Rc(s,a)+Rh(s,a)𝑅𝑠𝑎subscript𝑅𝑛𝑠subscript𝑅𝑐𝑠𝑎subscript𝑅𝑠𝑎R(s,a)=R_{n}(s)+R_{c}(s,a)+R_{h}(s,a)italic_R ( italic_s , italic_a ) = italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_s ) + italic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_s , italic_a ) + italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) (2)

where Rn(s)subscript𝑅𝑛𝑠R_{n}(s)italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_s ) represents the reward for progress towards the goal, Rc(s,a)subscript𝑅𝑐𝑠𝑎R_{c}(s,a)italic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_s , italic_a ) indicates the reward for maintaining a comfortable social distance between robot and humans, and Rh(s,a)subscript𝑅𝑠𝑎R_{h}(s,a)italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) is compliance reward for adhering to social norms and cultural context. In our proposed method, Rn(s)subscript𝑅𝑛𝑠R_{n}(s)italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_s ) and Rc(s,a)subscript𝑅𝑐𝑠𝑎R_{c}(s,a)italic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_s , italic_a ) are calculated by the DRL layer (cf. Section III-B), while Rh(s,a)subscript𝑅𝑠𝑎R_{h}(s,a)italic_R start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_s , italic_a ) is provided by the ORL layer (cf. Section III-C).

On the other hand, since social environments are inherently changing, robots are expected to be able to update their policies online in order to adapt to the changes:

πt(a|s,t+1)=fd(πs(a|s,t),Rt(s,a,s),s,t+1)subscript𝜋𝑡conditional𝑎𝑠𝑡1subscript𝑓𝑑subscript𝜋𝑠conditional𝑎𝑠𝑡subscript𝑅𝑡𝑠𝑎superscript𝑠superscript𝑠𝑡1\pi_{t}(a|s,t+1)=f_{d}(\pi_{s}(a|s,t),R_{t}(s,a,s^{\prime}),s^{\prime},t+1)italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a | italic_s , italic_t + 1 ) = italic_f start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_a | italic_s , italic_t ) , italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_t + 1 ) (3)

where πt(a|s,t+1)subscript𝜋𝑡conditional𝑎𝑠𝑡1\pi_{t}(a|s,t+1)italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a | italic_s , italic_t + 1 ) represents the policy at time t+1𝑡1t+1italic_t + 1; and πs(a|s,t)subscript𝜋𝑠conditional𝑎𝑠𝑡\pi_{s}(a|s,t)italic_π start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_a | italic_s , italic_t ) represents the policy at time t𝑡titalic_t. Building an effective fdsubscript𝑓𝑑f_{d}italic_f start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is the key.

More specifically, MDP includes the next important components: the joint state stjnsubscriptsuperscripts𝑗𝑛𝑡\textbf{s}^{jn}_{t}s start_POSTSUPERSCRIPT italic_j italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time t𝑡titalic_t, action of agent atsubscripta𝑡\textbf{a}_{t}a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time t, reward Rt(stjn,at)subscriptR𝑡subscriptsuperscripts𝑗𝑛𝑡subscripta𝑡\textbf{R}_{t}(\textbf{s}^{jn}_{t},\textbf{a}_{t})R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( s start_POSTSUPERSCRIPT italic_j italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) at time t, which the agent receive for its action and transition probability Pt(stjn,at,st+Δtjn)subscriptP𝑡subscriptsuperscripts𝑗𝑛𝑡subscripta𝑡subscriptsuperscripts𝑗𝑛𝑡Δ𝑡\textbf{P}_{t}(\textbf{s}^{jn}_{t},\textbf{a}_{t},\textbf{s}^{jn}_{t+\Delta t})P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( s start_POSTSUPERSCRIPT italic_j italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , s start_POSTSUPERSCRIPT italic_j italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + roman_Δ italic_t end_POSTSUBSCRIPT ). The joint state at time t equals a combination of robot-human states at time t stjn=[rht1,,rhtn]subscriptsuperscripts𝑗𝑛𝑡subscriptsuperscriptrh1𝑡subscriptsuperscriptrh𝑛𝑡\textbf{s}^{jn}_{t}=[\textbf{rh}^{1}_{t},...,\textbf{rh}^{n}_{t}]s start_POSTSUPERSCRIPT italic_j italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ rh start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , rh start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ], where rhtnsubscriptsuperscriptrh𝑛𝑡\textbf{rh}^{n}_{t}rh start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the joint robot-human state of nth human. The robot-human state is a combination of robot and ith human state around the robot rhti=[str,hti]subscriptsuperscriptrh𝑖𝑡subscriptsuperscripts𝑟𝑡subscriptsuperscripth𝑖𝑡\textbf{rh}^{i}_{t}=[\textbf{s}^{r}_{t},\textbf{h}^{i}_{t}]rh start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ s start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , h start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]. These states are defined as:

str=[px,py,vx,vy,rr,gx,gy,θ,vpref]t,subscriptsuperscript𝑠𝑟𝑡subscriptsubscript𝑝𝑥subscript𝑝𝑦subscript𝑣𝑥subscript𝑣𝑦subscript𝑟𝑟subscript𝑔𝑥subscript𝑔𝑦𝜃subscript𝑣𝑝𝑟𝑒𝑓𝑡s^{r}_{t}=[p_{x},p_{y},v_{x},v_{y},r_{r},g_{x},g_{y},\theta,v_{pref}]_{t},italic_s start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_θ , italic_v start_POSTSUBSCRIPT italic_p italic_r italic_e italic_f end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (4)
hti=[pxi,pyi,vxi,vyi,rhi]t,subscriptsuperscript𝑖𝑡subscriptsubscriptsuperscript𝑝𝑖𝑥subscriptsuperscript𝑝𝑖𝑦subscriptsuperscript𝑣𝑖𝑥subscriptsuperscript𝑣𝑖𝑦subscriptsuperscript𝑟𝑖𝑡h^{i}_{t}=[p^{i}_{x},p^{i}_{y},v^{i}_{x},v^{i}_{y},r^{i}_{h}]_{t},italic_h start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (5)

where pxsubscript𝑝𝑥p_{x}italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, pysubscript𝑝𝑦p_{y}italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT and pxisubscriptsuperscript𝑝𝑖𝑥p^{i}_{x}italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, pyisubscriptsuperscript𝑝𝑖𝑦p^{i}_{y}italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT represent the robot and ith human positions, vxsubscript𝑣𝑥v_{x}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, vysubscript𝑣𝑦v_{y}italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT and vxisubscriptsuperscript𝑣𝑖𝑥v^{i}_{x}italic_v start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, vyisubscriptsuperscript𝑣𝑖𝑦v^{i}_{y}italic_v start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT are the robot and ith human velocities, rrsubscript𝑟𝑟r_{r}italic_r start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and rhisubscriptsuperscript𝑟𝑖r^{i}_{h}italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT are the robot and ith human radius. gxsubscript𝑔𝑥g_{x}italic_g start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, gysubscript𝑔𝑦g_{y}italic_g start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT are the goal positions, θ𝜃\thetaitalic_θ is the robot orientation and vprefsubscript𝑣𝑝𝑟𝑒𝑓v_{pref}italic_v start_POSTSUBSCRIPT italic_p italic_r italic_e italic_f end_POSTSUBSCRIPT is the preferred robot speed. Position, speed, and radius are observable parameters and are available to other agents. The robot-human states are updated to follow the robot-centric parameterisation, where the x-axis points to the robot’s goal and the robot’s position is the origin. The final shape of the ith robot-human state, which forms the input for the value neural network, is:

rhti=[dg,vpref,θ,rr,vx,vy,pxi,pyi,vxi,vyi,rhi,di,rsum]t,𝑟subscriptsuperscript𝑖𝑡subscriptsubscript𝑑𝑔subscript𝑣𝑝𝑟𝑒𝑓𝜃subscript𝑟𝑟subscript𝑣𝑥subscript𝑣𝑦subscriptsuperscript𝑝𝑖𝑥subscriptsuperscript𝑝𝑖𝑦subscriptsuperscript𝑣𝑖𝑥subscriptsuperscript𝑣𝑖𝑦subscriptsuperscript𝑟𝑖subscript𝑑𝑖subscript𝑟𝑠𝑢𝑚𝑡{rh}^{i}_{t}=[d_{g},v_{pref},\theta,r_{r},v_{x},v_{y},p^{i}_{x},p^{i}_{y},v^{i% }_{x},v^{i}_{y},r^{i}_{h},d_{i},r_{sum}]_{t},italic_r italic_h start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_d start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_p italic_r italic_e italic_f end_POSTSUBSCRIPT , italic_θ , italic_r start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_s italic_u italic_m end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (6)

where dgsubscript𝑑𝑔d_{g}italic_d start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is the distance from the robot to the goal, disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the distance to the ith human, and rsumsubscript𝑟𝑠𝑢𝑚r_{sum}italic_r start_POSTSUBSCRIPT italic_s italic_u italic_m end_POSTSUBSCRIPT is the sum of robot and ith human radius.

III-B SARL

In SARL’s value-based reinforcement learning framework, the goal is to discover the optimal navigation policy, denoted as π(stjn)superscript𝜋subscriptsuperscript𝑠𝑗𝑛𝑡\pi^{*}(s^{jn}_{t})italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_j italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), to achieve the highest value V(stjn)superscript𝑉subscriptsuperscript𝑠𝑗𝑛𝑡V^{*}(s^{jn}_{t})italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_j italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) for the joined state stjnsubscriptsuperscript𝑠𝑗𝑛𝑡s^{jn}_{t}italic_s start_POSTSUPERSCRIPT italic_j italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time t𝑡titalic_t:

V(stjn)=i=tTγivprefRt(stjn,π(stjn)),superscript𝑉subscriptsuperscript𝑠𝑗𝑛𝑡superscriptsubscript𝑖𝑡𝑇superscript𝛾𝑖subscript𝑣𝑝𝑟𝑒𝑓subscript𝑅𝑡subscriptsuperscript𝑠𝑗𝑛𝑡superscript𝜋subscriptsuperscript𝑠𝑗𝑛𝑡V^{*}(s^{jn}_{t})=\sum_{i=t}^{T}\gamma^{i\cdot v_{pref}}\cdot R_{t}(s^{jn}_{t}% ,\pi^{*}(s^{jn}_{t})),italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_j italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_i ⋅ italic_v start_POSTSUBSCRIPT italic_p italic_r italic_e italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⋅ italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_j italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_j italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , (7)

where γ[0,1)𝛾01\gamma\in[0,1)italic_γ ∈ [ 0 , 1 ) is the discount factor, vprefsubscript𝑣𝑝𝑟𝑒𝑓v_{pref}italic_v start_POSTSUBSCRIPT italic_p italic_r italic_e italic_f end_POSTSUBSCRIPT is used as a normalization parameter of γ𝛾\gammaitalic_γ for numerical reasons [20], and T𝑇Titalic_T is the time of the final state. A constant velocity model is applied to estimate the subsequent states of humans over a short temporal interval ΔtΔ𝑡\Delta troman_Δ italic_t and to approximate the next joint state st+Δtjnsubscriptsuperscript𝑠𝑗𝑛𝑡Δ𝑡s^{jn}_{t+\Delta t}italic_s start_POSTSUPERSCRIPT italic_j italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + roman_Δ italic_t end_POSTSUBSCRIPT based on stjnsubscriptsuperscript𝑠𝑗𝑛𝑡s^{jn}_{t}italic_s start_POSTSUPERSCRIPT italic_j italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, then the optimal strategy can be formulated as:

π(stjn)=argmaxat[Rt(stjn,at)+γΔtvprefV(st+Δtjn)],superscript𝜋subscriptsuperscript𝑠𝑗𝑛𝑡subscriptsubscript𝑎𝑡subscript𝑅𝑡subscriptsuperscript𝑠𝑗𝑛𝑡subscript𝑎𝑡superscript𝛾Δ𝑡subscript𝑣𝑝𝑟𝑒𝑓𝑉subscriptsuperscript𝑠𝑗𝑛𝑡Δ𝑡\pi^{*}(s^{jn}_{t})=\arg\max_{a_{t}}[R_{t}(s^{jn}_{t},a_{t})+\gamma^{\Delta t% \cdot v_{pref}}\cdot V(s^{jn}_{t+\Delta t})],italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_j italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_arg roman_max start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_j italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_γ start_POSTSUPERSCRIPT roman_Δ italic_t ⋅ italic_v start_POSTSUBSCRIPT italic_p italic_r italic_e italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⋅ italic_V ( italic_s start_POSTSUPERSCRIPT italic_j italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + roman_Δ italic_t end_POSTSUBSCRIPT ) ] , (8)

where the value network is trained by the temporal-difference method with experience replay [13, 14, 15].

The speed and angular direction define the robot’s action space. The holonomic kinematics have modeled the movements, enabling the agent to move in any direction with any acceleration. The action space comprises 80 actions. The 5 linear speeds are distributed between 00 to vprefsubscript𝑣𝑝𝑟𝑒𝑓v_{pref}italic_v start_POSTSUBSCRIPT italic_p italic_r italic_e italic_f end_POSTSUBSCRIPT and the angular direction is spaced from 00 to 2π2𝜋2\pi2 italic_π.

In original SARL, the reward is equal to 1 if the robot reaches the target position. This reward is changed to dplan/drealsubscript𝑑𝑝𝑙𝑎𝑛subscript𝑑𝑟𝑒𝑎𝑙d_{plan}/d_{real}italic_d start_POSTSUBSCRIPT italic_p italic_l italic_a italic_n end_POSTSUBSCRIPT / italic_d start_POSTSUBSCRIPT italic_r italic_e italic_a italic_l end_POSTSUBSCRIPT in ours, where dplansubscript𝑑𝑝𝑙𝑎𝑛d_{plan}italic_d start_POSTSUBSCRIPT italic_p italic_l italic_a italic_n end_POSTSUBSCRIPT is the planned driving distance from its current position to the target one before the robot starts moving, while drealsubscript𝑑𝑟𝑒𝑎𝑙d_{real}italic_d start_POSTSUBSCRIPT italic_r italic_e italic_a italic_l end_POSTSUBSCRIPT is the actual distance traveled by the robot after it reaches the specified location. The interpretation of this change is that the robot should minimize unnecessary local path adjustments while maintaining socially-compliant navigation. Thus the full reward function becomes:

Rt(stjn,at)={0.25if dmin<0dplan/drealelse if dg=00.1+dmin/2else if dmin<dc0otherwise,subscript𝑅𝑡subscriptsuperscript𝑠𝑗𝑛𝑡subscript𝑎𝑡cases0.25if dmin<0subscript𝑑𝑝𝑙𝑎𝑛subscript𝑑𝑟𝑒𝑎𝑙else if dg=00.1subscript𝑑𝑚𝑖𝑛2else if dmin<dc0otherwiseR_{t}(s^{jn}_{t},a_{t})=\begin{cases}-0.25&\text{if $d_{min}<0$}\\ d_{plan}/d_{real}&\text{else if $d_{g}=0$}\\ -0.1+d_{min}/2&\text{else if $d_{min}<d_{c}$}\\ 0&\text{otherwise}\end{cases},italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_j italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = { start_ROW start_CELL - 0.25 end_CELL start_CELL if italic_d start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT < 0 end_CELL end_ROW start_ROW start_CELL italic_d start_POSTSUBSCRIPT italic_p italic_l italic_a italic_n end_POSTSUBSCRIPT / italic_d start_POSTSUBSCRIPT italic_r italic_e italic_a italic_l end_POSTSUBSCRIPT end_CELL start_CELL else if italic_d start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = 0 end_CELL end_ROW start_ROW start_CELL - 0.1 + italic_d start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT / 2 end_CELL start_CELL else if italic_d start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT < italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW , (9)

where dminsubscript𝑑𝑚𝑖𝑛d_{min}italic_d start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT is the distance between the robot and the nearest person, and dcsubscript𝑑𝑐d_{c}italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is the comfortable social distance. dcsubscript𝑑𝑐d_{c}italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT should be adjusted according to the spatio-temporal characteristics. For example, it can be higher in large malls or during epidemics, and smaller in small stores or during normal periods.

III-C Social Module

The task of the social module is to implement social modifications to the SARL output. Intuitively, if the action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT suggested by SARL is considered to be social, its corresponding value should be increased so that it is selected by the optimal policy π(stjn)superscript𝜋subscriptsuperscript𝑠𝑗𝑛𝑡\pi^{*}(s^{jn}_{t})italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_j italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Otherwise, its corresponding value should be lowered so that it is discarded by the policy π(stjn)superscript𝜋subscriptsuperscript𝑠𝑗𝑛𝑡\pi^{*}(s^{jn}_{t})italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_j italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Based on the dual consideration of the effectiveness of small-batch online learning [27] and tracklet-based behavioral analysis [28], we employ a tracklet-based sociality assessment method, which is visualized in Fig. 2(a). Specifically, a tracklet is defined as a set of robot instantaneous states:

trt=[st(uΔt)s,st((u1)Δt)s,,sts,st+Δts,,st+((1+f)Δt)s],𝑡subscript𝑟𝑡subscriptsuperscript𝑠𝑠𝑡𝑢Δ𝑡subscriptsuperscript𝑠𝑠𝑡𝑢1Δ𝑡subscriptsuperscript𝑠𝑠𝑡subscriptsuperscript𝑠𝑠𝑡Δ𝑡subscriptsuperscript𝑠𝑠𝑡1𝑓Δ𝑡tr_{t}=[s^{s}_{t-(u\cdot\Delta t)},s^{s}_{t-((u-1)\cdot\Delta t)},\ldots,\\ s^{s}_{t},s^{s}_{t+\Delta t},\ldots,s^{s}_{t+((1+f)\cdot\Delta t)}],start_ROW start_CELL italic_t italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_s start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - ( italic_u ⋅ roman_Δ italic_t ) end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - ( ( italic_u - 1 ) ⋅ roman_Δ italic_t ) end_POSTSUBSCRIPT , … , end_CELL end_ROW start_ROW start_CELL italic_s start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + roman_Δ italic_t end_POSTSUBSCRIPT , … , italic_s start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + ( ( 1 + italic_f ) ⋅ roman_Δ italic_t ) end_POSTSUBSCRIPT ] , end_CELL end_ROW (10)

where each state contains the robot’s position and velocity, denoted as sts=[px,py,vx,vy]tsubscriptsuperscript𝑠𝑠𝑡subscriptsubscript𝑝𝑥subscript𝑝𝑦subscript𝑣𝑥subscript𝑣𝑦𝑡s^{s}_{t}=[p_{x},p_{y},v_{x},v_{y}]_{t}italic_s start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT; st(uΔt)s,st((u1)Δt)s,subscriptsuperscript𝑠𝑠𝑡𝑢Δ𝑡subscriptsuperscript𝑠𝑠𝑡𝑢1Δ𝑡s^{s}_{t-(u\cdot\Delta t)},s^{s}_{t-((u-1)\cdot\Delta t)},\ldotsitalic_s start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - ( italic_u ⋅ roman_Δ italic_t ) end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - ( ( italic_u - 1 ) ⋅ roman_Δ italic_t ) end_POSTSUBSCRIPT , … represents u𝑢uitalic_u previous states; stssubscriptsuperscript𝑠𝑠𝑡s^{s}_{t}italic_s start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT refers to the current state; st+Δtssubscriptsuperscript𝑠𝑠𝑡Δ𝑡s^{s}_{t+\Delta t}italic_s start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + roman_Δ italic_t end_POSTSUBSCRIPT represents the state of applying atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to stssubscriptsuperscript𝑠𝑠𝑡s^{s}_{t}italic_s start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT; ,st+((1+f)Δt)ssubscriptsuperscript𝑠𝑠𝑡1𝑓Δ𝑡\ldots,s^{s}_{t+((1+f)\cdot\Delta t)}… , italic_s start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + ( ( 1 + italic_f ) ⋅ roman_Δ italic_t ) end_POSTSUBSCRIPT represents f𝑓fitalic_f future states, when atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is applied f𝑓fitalic_f times to st+Δtssubscriptsuperscript𝑠𝑠𝑡Δ𝑡s^{s}_{t+\Delta t}italic_s start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + roman_Δ italic_t end_POSTSUBSCRIPT. The values of u𝑢uitalic_u and f𝑓fitalic_f are currently determined by experience, in our experiments u=11,f=3formulae-sequence𝑢11𝑓3u=11,f=3italic_u = 11 , italic_f = 3.

Refer to caption
(a) The robot tracklet trt𝑡subscript𝑟𝑡tr_{t}italic_t italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The blue dots are u𝑢uitalic_u previous states of the robot, the yellow dot is its current state, the red dot is the robot’s state after action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and the green dots are f𝑓fitalic_f future prediction states.
Refer to caption
(b) The structure of social value network with GRU and four fully-connected (FC) layers. The input is the robot tracklets. The numbers at the bottom are the input dimensions. Social value (SV) is the output of this network and can be 1 for social tracklet and 0 for non-social tracklet.
Figure 2: A example of robot tracklet and the proposed social value network.

To evaluate the sociality of tracklets we employ a binary classification, i.e. social or non-social. To this end, a social value function is defined as:

SVt(trt)=ψsv(trt,Wsv),𝑆subscript𝑉𝑡𝑡subscript𝑟𝑡subscript𝜓𝑠𝑣𝑡subscript𝑟𝑡subscript𝑊𝑠𝑣SV_{t}(tr_{t})=\psi_{sv}(tr_{t},W_{sv}),italic_S italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_t italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_ψ start_POSTSUBSCRIPT italic_s italic_v end_POSTSUBSCRIPT ( italic_t italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_s italic_v end_POSTSUBSCRIPT ) , (11)

where ψsv()subscript𝜓𝑠𝑣\psi_{sv}(\cdot)italic_ψ start_POSTSUBSCRIPT italic_s italic_v end_POSTSUBSCRIPT ( ⋅ ) is a gated recurrent unit (GRU) [29] with multi-layer perceptron (MLP) and Wsvsubscript𝑊𝑠𝑣W_{sv}italic_W start_POSTSUBSCRIPT italic_s italic_v end_POSTSUBSCRIPT are the weights of it. The GRU-MLP model is visualized in Fig. 2(b), which contains four fully-connected layers, with ReLU nonlinear activation function and batch normalization, and the Sigmoid activation function.

Therefore, the final optimal strategy for robot navigation with the social module is:

π(stjn)=argmaxat[Rt(stjn,at)++γΔtvprefV(st+Δtjn)+ksSVt(trt)],superscript𝜋subscriptsuperscript𝑠𝑗𝑛𝑡subscriptsubscript𝑎𝑡subscript𝑅𝑡subscriptsuperscript𝑠𝑗𝑛𝑡subscript𝑎𝑡superscript𝛾Δ𝑡subscript𝑣𝑝𝑟𝑒𝑓𝑉subscriptsuperscript𝑠𝑗𝑛𝑡Δ𝑡subscript𝑘𝑠𝑆subscript𝑉𝑡𝑡subscript𝑟𝑡\pi^{*}(s^{jn}_{t})=\arg\max_{a_{t}}[R_{t}(s^{jn}_{t},a_{t})+\\ +\gamma^{\Delta t\cdot v_{pref}}\cdot V(s^{jn}_{t+\Delta t})+k_{s}\cdot SV_{t}% (tr_{t})],start_ROW start_CELL italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_j italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_arg roman_max start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_j italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + end_CELL end_ROW start_ROW start_CELL + italic_γ start_POSTSUPERSCRIPT roman_Δ italic_t ⋅ italic_v start_POSTSUBSCRIPT italic_p italic_r italic_e italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⋅ italic_V ( italic_s start_POSTSUPERSCRIPT italic_j italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + roman_Δ italic_t end_POSTSUBSCRIPT ) + italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⋅ italic_S italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_t italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] , end_CELL end_ROW (12)

where kssubscript𝑘𝑠k_{s}italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is the coefficient of social importance during robot navigation, which is set to 0.6 in our experiments, and γΔtvprefV(st+Δtjn)+ksSVt(trt)superscript𝛾Δ𝑡subscript𝑣𝑝𝑟𝑒𝑓𝑉subscriptsuperscript𝑠𝑗𝑛𝑡Δ𝑡subscript𝑘𝑠𝑆subscript𝑉𝑡𝑡subscript𝑟𝑡\gamma^{\Delta t\cdot v_{pref}}\cdot V(s^{jn}_{t+\Delta t})+k_{s}\cdot SV_{t}(% tr_{t})italic_γ start_POSTSUPERSCRIPT roman_Δ italic_t ⋅ italic_v start_POSTSUBSCRIPT italic_p italic_r italic_e italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⋅ italic_V ( italic_s start_POSTSUPERSCRIPT italic_j italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + roman_Δ italic_t end_POSTSUBSCRIPT ) + italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⋅ italic_S italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_t italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) responds explicitly to our introduction of human factor γH(s,θ)𝛾𝐻superscript𝑠𝜃\gamma H(s^{\prime},\theta)italic_γ italic_H ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_θ ) in Section III-A.

III-D Online Social Context Learning

The social performance of robots is challenged by changing social contexts. Although this problem can theoretically be solved by building massive complete datasets to train general and universal models, in practice, on the one hand, the cost of implementing such idea can be very high, and on the other hand, it is inherently unable to support long-term autonomy of mobile robots because the assumption that the dataset contains all social contexts is invalid [25]. Therefore, a mechanism is introduced in this letter to update the proposed social model online, triggered by the latter’s detection of differences between its internal and external social contexts. If the difference exceeds a tolerance, the module needs to be retrained online on the external context, and the training is iterated until the difference falls within the tolerance.

The proposed online social context learning is detailed in Algorithm 1. First, a buffer is set up to continuously collect the states of the robot and the humans around it (line 2). When the buffer is full (line 9), the states that constitute the robot’s last tracklet and the states that constitute the n𝑛nitalic_n people’s last tracklets are transferred separately (lines 10-11). When the number of stored robot tracklets reaches a threshold (line 12), the efficiency of the social module is analyzed.

Next, the labels of the states in a part of the human tracklets (line 13), i.e., social or non-social, need to be determined. Existing methods are often based on a strong assumption that all human behavior is social [30, 31, 8]. Based on our previous findings [9], we propose here a more elegant labeling method. A metric named extra distance ratio is used, which is defined as:

Rdist=ds/da,subscript𝑅𝑑𝑖𝑠𝑡subscript𝑑𝑠subscript𝑑𝑎R_{dist}=d_{s}/d_{a},italic_R start_POSTSUBSCRIPT italic_d italic_i italic_s italic_t end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT / italic_d start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , (13)

where dssubscript𝑑𝑠d_{s}italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT represents the Euclidean distance from the start point to the end point in a tracklet, while dasubscript𝑑𝑎d_{a}italic_d start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT represents the actual length of the tracklet. The higher the value of Rdistsubscript𝑅𝑑𝑖𝑠𝑡R_{dist}italic_R start_POSTSUBSCRIPT italic_d italic_i italic_s italic_t end_POSTSUBSCRIPT, the better the sociality. Therefore, if a tracklet’s Rdistsubscript𝑅𝑑𝑖𝑠𝑡R_{dist}italic_R start_POSTSUBSCRIPT italic_d italic_i italic_s italic_t end_POSTSUBSCRIPT is higher than a preset threshold, all states on the tracklet are considered social, otherwise non-social (line 14). Through this method, the robot obtains information about the external social context. Then, through the output of the social module, it becomes aware of its internal social context (line 15).

Finally, if the binary accuracy of the two contexts is below a threshold (the lower the accuracy, the greater the difference, line 16), the social model is updated (line 20). To do so, the aforementioned labeling method is still used to label the human set Trh𝑇subscript𝑟Tr_{h}italic_T italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT (line 17) and the robot set Trr𝑇subscript𝑟𝑟Tr_{r}italic_T italic_r start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT (line 18) respectively. Then the training set Dnewsubscript𝐷𝑛𝑒𝑤D_{new}italic_D start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT consists of Trh𝑇subscript𝑟Tr_{h}italic_T italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and non-social samples in Trr𝑇subscript𝑟𝑟Tr_{r}italic_T italic_r start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT (line 19). Additionally, to prevent model overfitting, the robot and human sets are purged (line 21) and old data will not be used in the next iteration. Ltraksubscript𝐿𝑡𝑟𝑎𝑘L_{trak}italic_L start_POSTSUBSCRIPT italic_t italic_r italic_a italic_k end_POSTSUBSCRIPT, Kupsubscript𝐾𝑢𝑝K_{up}italic_K start_POSTSUBSCRIPT italic_u italic_p end_POSTSUBSCRIPT and Kaccsubscript𝐾𝑎𝑐𝑐K_{acc}italic_K start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT are hyperparameters and need to be fine tuned. In our experiments, Ltrak=16subscript𝐿𝑡𝑟𝑎𝑘16L_{trak}=16italic_L start_POSTSUBSCRIPT italic_t italic_r italic_a italic_k end_POSTSUBSCRIPT = 16, Kup=3subscript𝐾𝑢𝑝3K_{up}=3italic_K start_POSTSUBSCRIPT italic_u italic_p end_POSTSUBSCRIPT = 3 and Kacc=0.5subscript𝐾𝑎𝑐𝑐0.5K_{acc}=0.5italic_K start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT = 0.5.

1 Input: ψsv(,Wsv)subscript𝜓𝑠𝑣subscript𝑊𝑠𝑣\psi_{sv}(\cdot,W_{sv})italic_ψ start_POSTSUBSCRIPT italic_s italic_v end_POSTSUBSCRIPT ( ⋅ , italic_W start_POSTSUBSCRIPT italic_s italic_v end_POSTSUBSCRIPT ): the social neural network
2 Ttsubscript𝑇𝑡T_{t}italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT: the buffered human and robot states at time t𝑡titalic_t
3 Ltraksubscript𝐿𝑡𝑟𝑎𝑘L_{trak}italic_L start_POSTSUBSCRIPT italic_t italic_r italic_a italic_k end_POSTSUBSCRIPT: the dimension of the input to ψsvsubscript𝜓𝑠𝑣\psi_{sv}italic_ψ start_POSTSUBSCRIPT italic_s italic_v end_POSTSUBSCRIPT
4 Trr𝑇subscript𝑟𝑟Tr_{r}italic_T italic_r start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT: the set of robot tracklets
5 Trh𝑇subscript𝑟Tr_{h}italic_T italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT: the set of human tracklets
6 Kupsubscript𝐾𝑢𝑝K_{up}italic_K start_POSTSUBSCRIPT italic_u italic_p end_POSTSUBSCRIPT: the update threshold
7 Kaccsubscript𝐾𝑎𝑐𝑐K_{acc}italic_K start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT: the accuracy threshold
8 Output: ψsv(,Wsv)subscript𝜓𝑠𝑣subscript𝑊𝑠𝑣\psi_{sv}(\cdot,W_{sv})italic_ψ start_POSTSUBSCRIPT italic_s italic_v end_POSTSUBSCRIPT ( ⋅ , italic_W start_POSTSUBSCRIPT italic_s italic_v end_POSTSUBSCRIPT ): the social neural network
9 if |Tt|subscript𝑇𝑡|T_{t}|| italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | % Ltraksubscript𝐿𝑡𝑟𝑎𝑘L_{trak}italic_L start_POSTSUBSCRIPT italic_t italic_r italic_a italic_k end_POSTSUBSCRIPT = 00 then
10       Trr𝑇subscript𝑟𝑟Tr_{r}italic_T italic_r start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT \leftarrow The last robot tracklet (Ltrak×srssubscript𝐿𝑡𝑟𝑎𝑘subscriptsuperscript𝑠𝑠𝑟L_{trak}\times s^{s}_{r}italic_L start_POSTSUBSCRIPT italic_t italic_r italic_a italic_k end_POSTSUBSCRIPT × italic_s start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT)
11       Trh𝑇subscript𝑟Tr_{h}italic_T italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT \leftarrow The last human tracklets (Ltrak×shs)×nsubscript𝐿𝑡𝑟𝑎𝑘subscriptsuperscript𝑠𝑠𝑛(L_{trak}\times s^{s}_{h})\times n( italic_L start_POSTSUBSCRIPT italic_t italic_r italic_a italic_k end_POSTSUBSCRIPT × italic_s start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) × italic_n
12       if |Trr|𝑇subscript𝑟𝑟|Tr_{r}|| italic_T italic_r start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | % Kupsubscript𝐾𝑢𝑝K_{up}italic_K start_POSTSUBSCRIPT italic_u italic_p end_POSTSUBSCRIPT = 00 then
13             Trh𝑇subscriptsuperscript𝑟Tr^{{}^{\prime}}_{h}italic_T italic_r start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT \leftarrow The last Kup×nsubscript𝐾𝑢𝑝𝑛K_{up}\times nitalic_K start_POSTSUBSCRIPT italic_u italic_p end_POSTSUBSCRIPT × italic_n human tracklets \in Trh𝑇subscript𝑟Tr_{h}italic_T italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT;
14             YTrhsubscript𝑌𝑇subscriptsuperscript𝑟Y_{Tr^{{}^{\prime}}_{h}}italic_Y start_POSTSUBSCRIPT italic_T italic_r start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT \leftarrow label(Trh)𝑇subscriptsuperscript𝑟(Tr^{{}^{\prime}}_{h})( italic_T italic_r start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT );
15             Y^Trhsubscript^𝑌𝑇subscriptsuperscript𝑟\hat{Y}_{Tr^{{}^{\prime}}_{h}}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_T italic_r start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT \leftarrow ψsv(Trh,Wsv)subscript𝜓𝑠𝑣𝑇subscriptsuperscript𝑟subscript𝑊𝑠𝑣\psi_{sv}(Tr^{{}^{\prime}}_{h},W_{sv})italic_ψ start_POSTSUBSCRIPT italic_s italic_v end_POSTSUBSCRIPT ( italic_T italic_r start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_s italic_v end_POSTSUBSCRIPT );
16             if binary_acc(YTrh,Y^Trh)<Kaccsubscript𝑌𝑇subscriptsuperscript𝑟subscript^𝑌𝑇subscriptsuperscript𝑟subscript𝐾𝑎𝑐𝑐(Y_{Tr^{{}^{\prime}}_{h}},\hat{Y}_{Tr^{{}^{\prime}}_{h}})<K_{acc}( italic_Y start_POSTSUBSCRIPT italic_T italic_r start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_T italic_r start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) < italic_K start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT then
17                   YTrhsubscript𝑌𝑇subscript𝑟Y_{Tr_{h}}italic_Y start_POSTSUBSCRIPT italic_T italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT \leftarrow label(Trh)𝑇subscript𝑟(Tr_{h})( italic_T italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT );
18                   YTrrsubscript𝑌𝑇subscript𝑟𝑟Y_{Tr_{r}}italic_Y start_POSTSUBSCRIPT italic_T italic_r start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT \leftarrow label(Trr)𝑇subscript𝑟𝑟(Tr_{r})( italic_T italic_r start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT );
19                   Dnew[Trh,YTrh]{x,yTrr,YTrry=0}D_{new}\leftarrow[Tr_{h},Y_{Tr_{h}}]\cup\{x,y\in Tr_{r},Y_{Tr_{r}}\mid y=0\}italic_D start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT ← [ italic_T italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_T italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] ∪ { italic_x , italic_y ∈ italic_T italic_r start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_T italic_r start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_y = 0 };
20                   Wsvsubscript𝑊𝑠𝑣W_{sv}italic_W start_POSTSUBSCRIPT italic_s italic_v end_POSTSUBSCRIPT \leftarrow train(Dnew)subscript𝐷𝑛𝑒𝑤(D_{new})( italic_D start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT );
21                   Trr𝑇subscript𝑟𝑟Tr_{r}italic_T italic_r start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, Trh𝑇subscript𝑟Tr_{h}italic_T italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT \leftarrow \emptyset;
22                  
23            
24      
Algorithm 1 Online Social Context Learning

IV Evaluation

In this section we introduce the initialization of the social module, simulation, and real robot experiments.

IV-A Social Module Initialization

The initialization of the social module can be carried out in two ways, either online learning during deployment [32, 33], or offline training using simulators or datasets. We adopt the latter because of its higher experimental efficiency. Specifically, the Thor-Magni dataset [17] is used, which includes five different HRI scenarios with the number of participants ranging from four to nine. The human trajectories were recorded by a high-precision motion capture system and are used by us as social examples. We then apply the well-known Optimal Reciprocal Collision Avoidance (ORCA) [34] method to generate corresponding robot trajectories according to the human ones, which are labeled as non-social. All trajectories are segmented into 41675 tracklets, each contains 16 points (i.e. Ltraksubscript𝐿𝑡𝑟𝑎𝑘L_{trak}italic_L start_POSTSUBSCRIPT italic_t italic_r italic_a italic_k end_POSTSUBSCRIPT). After shuffling, 70% of them is used for training and 30% for validation. The social neural network is trained in 50 epochs with a batch size of 32. The optimizer is RMSProp with a learning rate of 0.001. The criterion is a binary cross-entropy loss, which results in an accuracy of 89.69%. For more details on the non-social robot trajectory generation and the social model training, please refer to our open source repository.

IV-B Simulation Results

We first evaluate our proposed method using the SARL simulator [14] widely used by the community [15, 35]. The original experiment involved a robot agent moving between two points at a distance of 8m8𝑚8~{}m8 italic_m, while 5 human agents also moved between two other points at the same distance. Since the lines connecting a pair of points all intersect at one place, HRI occurs when the robot and human agents move. The only difference between the test cases is the starting position of the human agents. SARL demonstrated excellent results under these settings, with a success rate of 100%.

The original experiments of SARL are enriched in this letter. First, the number of human agents is increased, ranging from 5 to 10 in steps of 1. When the number of agents reaches 10, the simulation will start again from 5, and the process repeats. Second, the test cases are complicated to be more realistic. The robot agent is asked to reach multiple goals in sequence, rather than one. The initial timeout is 25s25𝑠25~{}s25 italic_s as per [14] and increases by 25s25𝑠25~{}s25 italic_s for each new goal obtained. Third, the robot agent needs to traverse distances of 40m40𝑚40~{}m40 italic_m and 68m68𝑚68~{}m68 italic_m, and the human agent is controlled by ORCA (as per [20, 14, 15]) and SF (first use) respectively, thus constituting 4 experimental categories: Short-ORCA, Short-SF, Long-ORCA and Long-SF.

Two preset experimental scenarios are used (as shown in Fig. 3). In one, all human agents are randomly initialized within a circle of radius 4m4𝑚4~{}m4 italic_m, and in the other, the circle is replaced by a square with a side length of 10m10𝑚10~{}m10 italic_m. Each scenario contains 250 test cases, with the number of agents evenly distributed. The radius and preferred speed of the robot and human agents are 0.3m0.3𝑚0.3~{}m0.3 italic_m and 1m/s1𝑚𝑠1~{}m/s1 italic_m / italic_s respectively (cf. Eq. 4). We take the heuristic-based ORCA method without sociality as the baseline and conduct comparative experiments with four other learning-based methods, including CADRL [20], LSTM-RL [21], SARL [14] and ST2 [15]. The results of our method include those from the full SOCSARL-OL architecture as well as SOCSARL which does not include online learning for ablation purposes. Three common performance metrics [13, 14, 15] including navigation success rate, collision rate, and average navigation time of successful cases are used to evaluate the aforementioned methods. The experimental results are shown in Table I.

Refer to caption
Figure 3: Test cases corresponding to the two simulation experiment scenarios. Left: “circle” scenario. Right: “square” scenario.
TABLE I: Evaluation Results of Different Methods
Method Success Collision Time¯¯Time\overline{\text{Time}}over¯ start_ARG Time end_ARG (s) RdistsubscriptRdist\textbf{R}_{\textbf{dist}}R start_POSTSUBSCRIPT dist end_POSTSUBSCRIPT
Short-ORCA
ORCA [34] 0.91 0.00 53.48 0.964
CADRL [20] 0.06 0.93 59.49 0.828
LSTM-RL [21] 0.23 0.62 48.55 0.886
SARL [14] 0.71 0.25 50.94 0.928
ST2 [15] 0.54 0.46 48.50 0.899
SOCSARL (ours) 0.70 0.25 50.84 0.929
SOCSARL-OL (ours) 0.77 0.23 49.17 0.933
Short-SF
ORCA [34] 0.01 0.99 44.42 0.99
CADRL [20] 0.19 0.79 54.83 0.898
LSTM-RL [21] 0.13 0.74 47.41 0.896
SARL [14] 0.77 0.19 48.96 0.957
ST2 [15] 0.60 0.40 45.96 0.935
SOCSARL (ours) 0.77 0.19 48.92 0.957
SOCSARL-OL (ours) 0.81 0.19 47.61 0.958
Long-ORCA
ORCA [34] 0.84 0.00 91.29 0.960
CADRL [20] 0.01 0.97 105.83 0.833
LSTM-RL [21] 0.06 0.78 88.23 0.891
SARL [14] 0.54 0.38 90.3 0.921
ST2 [15] 0.33 0.67 81.17 0.908
SOCSARL (ours) 0.54 0.38 90.05 0.921
SOCSARL-OL (ours) 0.62 0.38 85.87 0.927
Long-SF
ORCA [34] 0.00 1.00 - -
CADRL [20] 0.09 0.88 96.86 0.898
LSTM-RL [21] 0.01 0.84 83.29 0.894
SARL [14] 0.56 0.36 85.89 0.956
ST2 [15] 0.42 0.58 77.03 0.943
SOCSARL (ours) 0.57 0.35 86.00 0.956
SOCSARL-OL (ours) 0.64 0.35 82.66 0.959

First, the results of the baseline method ORCA confirm the necessity of introducing SF into our experiments. Because when both the robot and human agents use ORCA, the former shows performance that exceeds all other tested methods. However, when the human agents are controlled by SF, the ORCA-based robot agent shows the worst performance among all methods. These results also illustrate from one aspect the importance of consistency of the robot’s internal and external social contexts. From a sociological perspective, sociality is a consensus that a social agent may not function well in a non-social environment, and vice versa. The performance differences of other methods in the ORCA and SF categories also support this claim. Specifically, the social methods SARL, ST2, and ours perform better in the SF-based social environments than in the ORCA-based non-social environments. While the non-social method LSTM-RL performs better in the ORCA-based environments. Although CADRL is also a non-social method, it is sensitive to the distance between agents and thus coincidentally closer to the social context, thus performing better in the SF-based environments.

Second, it can be seen that the enriched experimental settings pose challenges to the SOTA methods, but the overall performance ranking of SARL, LSTM-RL, and CADRL is consistent with that reported in [14]. Specifically, SARL shows the best results among the three, recalling that SARL is the development of LSTM-RL, which is the development of CADRL. The overall performance of ST2 is inferior to that of SARL, which is different from [15]. This can be interpreted as SARL having an advantage in dealing with complex environments.

Third, it can be seen that our proposed method outperforms other learning-based ones overall. The Rdistsubscript𝑅𝑑𝑖𝑠𝑡R_{dist}italic_R start_POSTSUBSCRIPT italic_d italic_i italic_s italic_t end_POSTSUBSCRIPT metric demonstrates the effectiveness of our improvement to the original SARL reward function (cf. Eq. 9). ST2 shows fast times in four categories but at the expense of performance in the other two metrics. In addition, it can be seen that the performance of our method without the OL module is comparable to that of SARL. This shows that, on the one hand, the proposed social module basically recognizes the sociality of SARL output without online updates. On the other hand, the improvement of the robot’s social performance is indeed achieved by updating the social module online.

Finally, we provide insights into the robustness of different methods in different environments. According to the two metrics of “success rate” and “collision rate”, our SOCSARL-OL has a maximum difference of 4% in different category pairs (i.e. Short-ORCA and Short-SF, Long-ORCA and Long-SF), ST2 has 9%, SARL has 6%, LSTM-RL has 10%, and CADRL has 13%. Our method shows the best robustness. This again demonstrates the superiority of online learning, while SOCSARL without online learning has a difference of up to 7%.

IV-C Real Robot Experiments

Our method is evaluated using a real robot we developed [9, 26]. We wanted to demonstrate the effectiveness of the robot social navigation system and measure how humans felt about interacting with the robot. We replicated the experimental scenario from the SARL paper [14], as shown in Fig. 4(a). During the experiment, people randomly walked around the room, while the robot had to move from one side of the room to the other. 12 human participants were divided into 3 groups of 5/5/2 for the experiment. SOCSARL as a social method was compared with ORCA as a non-social method. After the experiment, the participants answered the following two questions for each method as per [18]:

  1. 1.

    On a scale of 1 to 5, how “socially-compliant” do you think the robot was (think of social compliance as how considerate the robot was of your presence)?

  2. 2.

    On a scale of 1 to 5, how “safe” did you feel around the robot?

The results of the survey are shown in Fig. 4(b). It can be seen that our method makes people feel safer than ORCA, but the social feeling is comparable to ORCA. After the evaluation, we conducted a short interview with the participants. Their main opinion was that they were confused by the avoidance action of our SOCSARL method. This was attributed to the social distance condition in the reward function (cf. Eq. 9), which made the robot start the avoidance action very early. This was inconsistent with the participants’ stereotypes of mobile robots. People generally believed that the robot could get closer to them before reacting. However, this was contrary to the safety requirement of the robot’s movement.

Refer to caption
(a) Snapshot of a real robot experiment. The robot needs to avoid pedestrians from five different directions in a closed space (only four of them are shown).
Refer to caption
(b) Mean and standard deviation of the scores given by twelve human participants in the real robot experiment.
Figure 4: Real robot experiment and its results.

V Conclusions

This letter presented an online context learning method for socially-compliant robot navigation, named SOCSARL-OL. The method proposed to combine deep reinforcement learning (DRL) and online robot learning (ORL) to empower the robot to adapt to changing social contexts online in order to improve its social performance. The experimental results demonstrated that our method surpasses the SOTA ones. Future work includes further evaluation of the real-world performance of online context learning by running the robot over longer periods of time in larger spaces, as well as assessing the robustness of the system in different real-world public spaces.

Acknowledgments

We thank RoboSense for sponsoring the 3D lidar, UTBM CRUNCH Lab for support in robotic instrumentation, and all those involved in the experiments.

References

  • [1] I. Okunevich, D. Trinitatova, P. Kopanev, and D. Tsetserukou, “Mobilecharger: An autonomous mobile robot with inverted delta actuator for robust and safe robot charging,” in 2021 26th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA).   IEEE, 2021, pp. 1–8.
  • [2] H. Pikner, R. Sell, K. Karjust, E. Malayjerdi, and T. Velsker, “Cyber-physical control system for autonomous logistic robot,” in 2021 IEEE 19th International Power Electronics and Motion Control Conference (PEMC).   IEEE, 2021, pp. 699–704.
  • [3] D. Topolsky, I. Topolskaya, I. Plaksina, P. Shaburov, N. Yumagulov, D. Fedorov, and E. Zvereva, “Development of a mobile robot for mine exploration,” Processes, vol. 10, no. 5, p. 865, 2022.
  • [4] T. Roucek, M. Pecka, P. Cízek, T. Petrícek, J. Bayer, V. Salanský, D. Hert, M. Petrlík, T. Báca, V. Spurný, F. Pomerleau, V. Kubelka, J. Faigl, K. Zimmermann, M. Saska, T. Svoboda, and T. Krajník, “DARPA subterranean challenge: Multi-robotic exploration of underground environments,” in Proceedings of MESAS, J. Mazal, A. Fagiolini, and P. Vasík, Eds., 2019, pp. 274–290.
  • [5] Z. Yan, S. Schreiberhuber, G. Halmetschlager, T. Duckett, M. Vincze, and N. Bellotto, “Robot perception of static and dynamic objects with an autonomous floor scrubber,” Intelligent Service Robotics, vol. 13, no. 3, pp. 403–417, 2020.
  • [6] S. Perminov, N. Mikhailovskiy, A. Sedunin, I. Okunevich, I. Kalinov, M. Kurenkov, and D. Tsetserukou, “Ultrabot: Autonomous mobile robot for indoor uv-c disinfection,” in 2021 IEEE 17th International Conference on Automation Science and Engineering (CASE).   IEEE, 2021, pp. 2147–2152.
  • [7] S. Coşar, M. Fernandez-Carmona, R. Agrigoroaie, J. Pages, F. Ferland, F. Zhao, S. Yue, N. Bellotto, and A. Tapus, “Enrichme: Perception and interaction of an assistive robot for the elderly at home,” International Journal of Social Robotics, vol. 12, pp. 779–805, 2020.
  • [8] T. Vintr, Z. Yan, K. Eyisoy, F. Kubis, J. Blaha, J. Ulrich, C. S. Swaminathan, S. M. Mellado, T. Kucner, M. Magnusson, G. Cielniak, J. Faigl, T. Duckett, A. J. Lilienthal, and T. Krajník, “Natural criteria for comparison of pedestrian flow forecasting models,” in Proceedings of IROS, 2020, pp. 11 197–11 204.
  • [9] I. Okunevich, V. Hilaire, S. Galland, O. Lamotte, L. Shilova, Y. Ruichek, and Z. Yan, “Human-centered benchmarking for socially-compliant robot navigation,” in 2023 European Conference on Mobile Robots (ECMR).   IEEE, 2023, pp. 1–7.
  • [10] C. S. Song and Y.-K. Kim, “The role of the human-robot interaction in consumers’ acceptance of humanoid retail service robots,” Journal of Business Research, vol. 146, pp. 489–503, 2022.
  • [11] R. S. Shahrezaie, B. N. Manalo, A. G. Brantley, C. R. Lynch, and D. Feil-Seifer, “Advancing socially-aware navigation for public spaces,” in 2022 31st IEEE International Conference on Robot and Human Interactive Communication (RO-MAN).   IEEE, 2022, pp. 1015–1022.
  • [12] R. D. Benedictis, A. Umbrico, F. Fracasso, G. Cortellessa, A. Orlandini, and A. Cesta, “A dichotomic approach to adaptive interaction for socially assistive robots,” User Modeling and User-Adapted Interaction, vol. 33, no. 2, pp. 293–331, 2023.
  • [13] Y. F. Chen, M. Everett, M. Liu, and J. P. How, “Socially aware motion planning with deep reinforcement learning,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2017, pp. 1343–1350.
  • [14] C. Chen, Y. Liu, S. Kreiss, and A. Alahi, “Crowd-robot interaction: Crowd-aware robot navigation with attention-based deep reinforcement learning,” in 2019 international conference on robotics and automation (ICRA).   IEEE, 2019, pp. 6015–6022.
  • [15] Y. Yang, J. Jiang, J. Zhang, J. Huang, and M. Gao, “St2: Spatial-temporal state transformer for crowd-aware autonomous navigation,” IEEE Robotics and Automation Letters, vol. 8, no. 2, pp. 912–919, 2023.
  • [16] C. Chen, S. Hu, P. Nikdel, G. Mori, and M. Savva, “Relational graph learning for crowd navigation. in 2020 ieee,” in RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, pp. 10 007–10 013.
  • [17] T. Schreiter, T. R. de Almeida, Y. Zhu, E. G. Maestro, L. Morillo-Mendez, A. Rudenko, L. Palmieri, T. P. Kucner, M. Magnusson, and A. J. Lilienthal, “Thor-magni: A large-scale indoor motion capture recording of human movement and robot interaction,” arXiv preprint arXiv:2403.09285, 2024.
  • [18] H. Karnan, A. Nair, X. Xiao, G. Warnell, S. Pirk, A. Toshev, J. Hart, J. Biswas, and P. Stone, “Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation,” IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 11 807–11 814, 2022.
  • [19] L. Tai, J. Zhang, M. Liu, and W. Burgard, “Socially compliant navigation through raw depth inputs with generative adversarial imitation learning,” in 2018 IEEE international conference on robotics and automation (ICRA).   IEEE, 2018, pp. 1111–1117.
  • [20] Y. F. Chen, M. Liu, M. Everett, and J. P. How, “Decentralized non-communicating multiagent collision avoidance with deep reinforcement learning,” in 2017 IEEE international conference on robotics and automation (ICRA).   IEEE, 2017, pp. 285–292.
  • [21] M. Everett, Y. F. Chen, and J. P. How, “Motion planning among dynamic, decision-making agents with deep reinforcement learning,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2018, pp. 3052–3059.
  • [22] S. Liu, P. Chang, W. Liang, N. Chakraborty, and K. Driggs-Campbell, “Decentralized structural-rnn for robot crowd navigation with deep reinforcement learning,” in 2021 IEEE international conference on robotics and automation (ICRA).   IEEE, 2021, pp. 3517–3524.
  • [23] P. T. Singamaneni, A. Favier, and R. Alami, “Human-aware navigation planner for diverse human-robot interaction contexts,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2021, pp. 5817–5824.
  • [24] B. Liu, X. Xiao, and P. Stone, “A lifelong learning approach to mobile robot navigation,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 1090–1096, 2021.
  • [25] Z. Yan, L. Sun, T. Krajnik, T. Duckett, and N. Bellotto, “Towards long-term autonomy: A perspective from robot learning,” in Proceedings of the AAAI-23 Bridge Program on AI & Robotics, 2023.
  • [26] I. Okunevich, V. Hilaire, S. Galland, O. Lamotte, Y. Ruichek, and Z. Yan, “An open-source software-hardware integration scheme for embodied human perception in service robotics,” in The 20th IEEE International Conference on Advanced Robotics and Its Social Impacts (ARSO 2024), 2024.
  • [27] Z. Yan, T. Duckett, and N. Bellotto, “Online learning for 3d lidar-based human detection: experimental analysis of point cloud clustering and classification methods,” Autonomous Robots, vol. 44, no. 2, pp. 147–164, 2020.
  • [28] S. Cosar, G. Donatiello, V. Bogorny, C. Gárate, L. O. Alvares, and F. Brémond, “Toward abnormal trajectory and event detection in video surveillance,” IEEE Trans. Circuits Syst. Video Technol., vol. 27, no. 3, pp. 683–695, 2017.
  • [29] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
  • [30] H. Kretzschmar, M. Spies, C. Sprunk, and W. Burgard, “Socially compliant mobile robot navigation via inverse reinforcement learning,” The International Journal of Robotics Research, vol. 35, no. 11, pp. 1289–1307, 2016.
  • [31] V. Adeli, E. Adeli, I. Reid, J. C. Niebles, and H. Rezatofighi, “Socially and contextually aware human motion and pose forecasting,” IEEE Robotics and Automation Letters, vol. 5, no. 4, pp. 6033–6040, 2020.
  • [32] Z. Yan, T. Duckett, and N. Bellotto, “Online learning for human classification in 3D LiDAR-based tracking,” in Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, Canada, September 2017, pp. 864–871.
  • [33] Z. Yan, L. Sun, T. Duckett, and N. Bellotto, “Multisensor online transfer learning for 3d lidar-based human detection with a mobile robot,” in Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, October 2018, pp. 7635–7640.
  • [34] J. Van Den Berg, S. J. Guy, M. Lin, and D. Manocha, “Reciprocal n-body collision avoidance,” in Robotics Research: The 14th International Symposium ISRR.   Springer, 2011, pp. 3–19.
  • [35] C.-L. Cheng, C.-C. Hsu, S. Saeedvand, and J.-H. Jo, “Multi-objective crowd-aware robot navigation system using deep reinforcement learning,” Applied Soft Computing, vol. 151, p. 111154, 2024.