Online Context Learning for Socially-compliant Navigation

Iaroslav Okunevich¹, Alexandre Lombard¹, Tomas Krajnik², Yassine Ruichek¹, Zhi Yan^1∗ This work was supported by the Bourgogne-Franche-Comté regional research project LOST-CoRoNa, the CZ MSMT (No. 8J23FR023) and PHC Barrande (No. 49275QM) project 3L4AR.¹UTBM, CIAD UMR 7533, F-90010 Belfort, France. [email protected]²Faculty of Electrical Engineering, Czech Technical University in Prague, Prague, Czechia. [email protected]^∗Corresponding Author.

Abstract

Robot social navigation needs to adapt to different human factors and environmental contexts. However, since these factors and contexts are difficult to predict and cannot be exhaustively enumerated, traditional learning-based methods have difficulty in ensuring the social attributes of robots in long-term and cross-environment deployments. This letter introduces an online context learning method that aims to empower robots to adapt to new social environments online. The proposed method adopts a two-layer structure. The bottom layer is built using a deep reinforcement learning-based method to ensure the output of basic robot navigation commands. The upper layer is implemented using an online robot learning-based method to socialize the control commands suggested by the bottom layer. Experiments using a community-wide simulator show that our method outperforms the state-of-the-art ones. Experimental results in the most challenging scenarios show that our method improves the performance of the state-of-the-art by 8%. The source code of the proposed method, the data used, and the tools for the pre-training step will be publicly available at https://github.com/Nedzhaken/SOCSARL-OL.

I Introduction

Mobile robots are becoming more and more popular nowadays in various applications. These applications can be classified into two categories: human-free and human-presence in the robot’s work area. For the former, robots do not need to take additional account of human participation, and hence their social nature. Typical examples include robots in fully automated factories [1, 2], or robots performing exploration tasks in hazardous environments [3, 4]. For applications in which humans are present, such as robots used for supermarket cleaning [5], hospital disinfection [6], or elderly care [7], they need to take into account human behavior and their social attributes in their task and motion planning [8]. In these applications, robots must achieve high-quality human-robot interaction (HRI) to ensure that human activity performance or comfort is not compromised [9, 10].

Effective HRI between mobile robots and humans can be achieved at both the hardware and software levels. Socially-compliant robot navigation is a typical example of HRI software solutions. The core idea is to apply human social rules to HRI while avoiding collisions with people. This kind of HRI is usually indirect, or implicit. The rules applied can be based on environmental [11] or social [12] context. The social compliance that a robot should demonstrate when navigating can be achieved through special reward functions in deep reinforcement learning (DRL) algorithms [13], attention scores of people around the robot [14], spatiotemporal representations of people around [15, 16], and so forth [9]. However, the social rules used in these methods are predefined and constant because they are considered to be general and universal. This is distinct from real-life human behavior, which changes and adapts to new social conditions.

Key challenges in robot social navigation include planning the robot’s task and motion, enabling the robot to behave in a socially-compliant manner, and effectively evaluating different approaches. In our previous work, we focused on standardizable evaluation procedures for social navigation [9]. In this letter, we focus on the behavior of robots. Our research goal is to ensure the social efficiency and navigation robustness of mobile robots when deployed in new social scenarios. To this end, an online robot learning (ORL) approach is proposed to achieve the goal, as it is able to train knowledge (discriminative) models spontaneously and autonomously in new contexts without manual intervention, thereby providing meaningful social cues for downstream navigation tasks.

Refer to caption — Figure 1: Conceptual diagram of our proposed approach. The navigation module selects the robot’s next action. The social module adds social value to the proposed action. The online social context learning method updates the social module to adapt it to the new social environment, which is represented by human trajectories.

A conceptual diagram of our method is shown in Fig. 1. Specifically, we first utilize DRL as a training algorithm for our foundation model to control the robot motion, as it is proven effective for robot navigation in crowds [14]. We then build a “social module” to evaluate the social efficiency of the robot trajectory, including traveled and will follow, and adjust the robot motion to achieve socially-compliant behavior by intervening in the foundation model. The module aims to extract the hidden social rules of human movement and is initially trained with human social trajectory data. However, it is likely that the social context of these training data does not correspond to the new one in which the robot is deployed. Therefore, the module is adapted to new contexts by being updated on-the-fly with on-site data while the robot is operating.

The contribution of this letter is threefold.

•

We propose to combine DRL with ORL to enable mobile robots to adapt to new social contexts efficiently and autonomously. Its specific implementation includes building a social module that can be updated online and promptly intervening in the robot’s DRL-based navigation system to ensure the robot’s social efficiency after switching contexts, without human intervention.
•

The social module is initially trained on the state-of-the-art (SOTA) contextually-rich THOR-Magni dataset [17], which contains social trajectories of humans in shared spaces with a robot. We supplement this dataset with non-social trajectory data.
•

We evaluate various social navigation methods, both ours and others, in new social contexts that are different from the one in which these methods were originally trained. We also complicate the experimental conditions to understand the robustness of the methods.

II Related Work

II-A Socially-compliant Navigation

A mainstream approach to enabling robots to exhibit social behaviors during navigation is based on learning, represented by imitation learning and DRL. For the former, the simulated data can be the robot behavior controlled by humans in a real environment [18] or generated in a simulated environment [19]. It is worth pointing out that the human agents in [19] are controlled by the classical social force (SF) model. The limitations of existing imitation learning methods still lie in their reliance on expert demonstrations and their difficulty in adapting to new social environments.

DRL learns robot control policies through reward functions and the agent’s state information. The research axis most relevant to our work starts with the Collision Avoidance with DRL (CADRL) method proposed by Chen et al. [20] for multi-agent communication-free decentralized navigation. This method uses a neural network to parameterize the value function. Subsequently, a socially-aware CADRL, named SA-CADRL [13], was proposed with a modified reward function that enables the robot to obey basic social rules, such as the right- or left-hand rule, when navigating. One limitation of CADRL-based approaches is that the input dimension of the neural network used is fixed, which prevents the robot from taking advantage of information about more people around it.

Everett et al. [21] broke this limitation and utilized the data of all people around the robot by sending their states sequentially to a long short-term memory (LSTM) network. One limitation of this approach is that the priority of the LSTM input is only inversely proportional to the distance between the human and the robot, i.e., the input with a shorter distance is prioritized, while their relative speed is ignored. As a result, the impact of a human approaching the robot at high speed from a distance may be considered less important than a stationary person near the robot.

Chen et al. [14] proposed a solution to this problem. The idea comprises several steps. Since each person influences other people, the human-human pairwise interaction is modeled by a cost map. After that, an embedding vector is calculated for each human, based on their state and the cost map. The next step is calculating the attention score for each individual based on the individual embedding vector and the mean embedding vector of all detected persons. The final representation of the people around the robot is a linear combination of the individual attention score and the pairwise interaction vector of each person. This crowd representation facilitates highly efficient crowd navigation, an essential element of social navigation. Nevertheless, this method does not incorporate explicit or implicit social rules, except for the requirement of social distance in the reward function.

The preceding methods, except the LSTM method [21], are predicated solely on the spatial dimension of HRI. Liu et al. [22] proposed a Decentralized Structural Recurrent Neural Network (DSRNN) to leverage the spatial and temporal relationships for crowd navigation. Yang et al. [15] suggested applying a spatial-temporal state transformer for more effective crowd navigation. While the spatial-temporal transformer structure appears to be a more effective baseline for the social navigation method, the evaluation below will demonstrate that this method is less effective in crowd navigation compared with the attention mechanism [14].

II-B Online Context Learning

The aforementioned methods are all based on offline trained models and are therefore challenged by new social contexts. These contexts are diverse and even humans understand them differently [23]. There are only a few works related to ours. Shahrezaie et al. [11] proposed an online context robot navigation. Contexts and corresponding navigation rules are predefined based on the interviews. The content detection system allows the robot to change navigation online, depending on the environment. Lui et al. [24] proposed Lifelong Learning for Navigation (LLfN), which includes an additional module that corrects the robot’s behavior in complex contexts and updates online to learn new contexts. Although this work focuses on the problem of forgetting navigation experience and does not include social contexts, the structure of a main navigation method and an extra correction module is similar to our proposal.

II-C Discussion

According to our investigation, existing methods still struggle to meet the performance requirements of robots for social navigation in changing environments or across environments. It is an intuitive idea to directly perform online context updates on the social navigation model. However, taking the SOTA method SARL as an example, there are two obvious challenges. First, an update is difficult to complete in a short time, which goes against the spirit of “fast learning and immediate deployment” of online robot learning [25]. Second, the update of the model also has a clear demand for computing resources, which poses a challenge to deploying the model to the edge [26].

Therefore, considering the current algorithm and hardware development, it is a good choice to modularize the social context and only update the social module online. Additional benefits of doing so include, first, the developed social module can also be easily integrated into other navigation systems. Second, this solution is more in line with the requirements of the robustness of the robot system, that is, the social module will not harm the output of the robot navigation module, but only serves as an auxiliary. Finally, it has better interpretability compared to the end-to-end methods [13, 21, 14, 15].

III Method

We propose a novel architecture for robot social navigation that consists of two layers. The underlying layer is a DRL-based robot navigation method due to its ability to handle complex navigation problems. In addition, at this layer we extend considerations of safe distance to social distance. SARL is therefore used in our concrete implementation due to its demonstrated SOTA performance. The upper layer is a novel ORL-based social module that learns social context from human trajectory data and socializes the robot navigation control commands output by the bottom layer. In order to adapt to different social environments, the module is updated online based on new social contexts. In the remainder of this section, we first formulate the target problems, then propose an improvement to the original SARL, and finally detail our online learnable social module.

III-A Problem Formulation

Corresponding to reinforcement learning (RL), robot navigation can be formalized as a Markov decision process (MDP). The objective is to determine the optimal strategy for robot control through interaction with the environment. In social navigation, in addition to the environment, the robot should also consider the social attributes of humans:

\begin{split}J(\pi,\theta)=&\sum_{s\in\mathcal{S}}\rho(s)\sum_{a\in\mathcal{A}% }\pi(a|s)\cdot\\ &\sum_{s^{\prime}\in\mathcal{S}}P(s^{\prime}|s,a)[R(s,a,s^{\prime})+\gamma H(s% ^{\prime},\theta)]\end{split}

(1)

where $\sum_{s\in\mathcal{S}}\rho(s)$ represents all possible states $s$ in the environment, weighted by the initial state distribution $\rho(s)$ ; $\sum_{a\in\mathcal{A}}\pi(a|s)$ indicates all possible actions $a$ available in a state $s$ , weighted by the policy’s probability of taking that action $\pi(a|s)$ ; $\sum_{s^{\prime}\in\mathcal{S}}P(s^{\prime}|s,a)$ represents all possible next states $s^{\prime}$ that the robot could transition to after taking action $a$ in state $s$ , weighted by the transition probability $P(s^{\prime}|s,a)$ ; $R(s,a,s^{\prime})$ is the immediate reward received for taking action $a$ in state $s$ and transitioning to state $s^{\prime}$ ; $\gamma$ is the discount factor; and $H(s^{\prime},\theta)$ is the human-related cost function evaluated at the next state $s^{\prime}$ with parameter vector $\theta$ . $H(s^{\prime},\theta)$ captures the impact of the environment’s state on the robot’s objective, considering human presence or behavior, which can be implemented based on hand-crafted features or learned models. A specific implementation of this function is detailed in Section III-C.

Then the reward function taking into account the robot’s sociality can be composed of three parts:

R(s,a)=R_{n}(s)+R_{c}(s,a)+R_{h}(s,a)

(2)

where $R_{n}(s)$ represents the reward for progress towards the goal, $R_{c}(s,a)$ indicates the reward for maintaining a comfortable social distance between robot and humans, and $R_{h}(s,a)$ is compliance reward for adhering to social norms and cultural context. In our proposed method, $R_{n}(s)$ and $R_{c}(s,a)$ are calculated by the DRL layer (cf. Section III-B), while $R_{h}(s,a)$ is provided by the ORL layer (cf. Section III-C).

On the other hand, since social environments are inherently changing, robots are expected to be able to update their policies online in order to adapt to the changes:

\pi_{t}(a|s,t+1)=f_{d}(\pi_{s}(a|s,t),R_{t}(s,a,s^{\prime}),s^{\prime},t+1)

(3)

where $\pi_{t}(a|s,t+1)$ represents the policy at time $t+1$ ; and $\pi_{s}(a|s,t)$ represents the policy at time $t$ . Building an effective $f_{d}$ is the key.

More specifically, MDP includes the next important components: the joint state $\textbf{s}^{jn}_{t}$ at time $t$ , action of agent $\textbf{a}_{t}$ at time t, reward $\textbf{R}_{t}(\textbf{s}^{jn}_{t},\textbf{a}_{t})$ at time t, which the agent receive for its action and transition probability $\textbf{P}_{t}(\textbf{s}^{jn}_{t},\textbf{a}_{t},\textbf{s}^{jn}_{t+\Delta t})$ . The joint state at time t equals a combination of robot-human states at time t $\textbf{s}^{jn}_{t}=[\textbf{rh}^{1}_{t},...,\textbf{rh}^{n}_{t}]$ , where $\textbf{rh}^{n}_{t}$ is the joint robot-human state of nth human. The robot-human state is a combination of robot and ith human state around the robot $\textbf{rh}^{i}_{t}=[\textbf{s}^{r}_{t},\textbf{h}^{i}_{t}]$ . These states are defined as:

s^{r}_{t}=[p_{x},p_{y},v_{x},v_{y},r_{r},g_{x},g_{y},\theta,v_{pref}]_{t},

(4)

h^{i}_{t}=[p^{i}_{x},p^{i}_{y},v^{i}_{x},v^{i}_{y},r^{i}_{h}]_{t},

(5)

where $p_{x}$ , $p_{y}$ and $p^{i}_{x}$ , $p^{i}_{y}$ represent the robot and ith human positions, $v_{x}$ , $v_{y}$ and $v^{i}_{x}$ , $v^{i}_{y}$ are the robot and ith human velocities, $r_{r}$ and $r^{i}_{h}$ are the robot and ith human radius. $g_{x}$ , $g_{y}$ are the goal positions, $\theta$ is the robot orientation and $v_{pref}$ is the preferred robot speed. Position, speed, and radius are observable parameters and are available to other agents. The robot-human states are updated to follow the robot-centric parameterisation, where the x-axis points to the robot’s goal and the robot’s position is the origin. The final shape of the ith robot-human state, which forms the input for the value neural network, is:

{rh}^{i}_{t}=[d_{g},v_{pref},\theta,r_{r},v_{x},v_{y},p^{i}_{x},p^{i}_{y},v^{i% }_{x},v^{i}_{y},r^{i}_{h},d_{i},r_{sum}]_{t},

(6)

where $d_{g}$ is the distance from the robot to the goal, $d_{i}$ is the distance to the ith human, and $r_{sum}$ is the sum of robot and ith human radius.

III-B SARL

In SARL’s value-based reinforcement learning framework, the goal is to discover the optimal navigation policy, denoted as $\pi^{*}(s^{jn}_{t})$ , to achieve the highest value $V^{*}(s^{jn}_{t})$ for the joined state $s^{jn}_{t}$ at time $t$ :

V^{*}(s^{jn}_{t})=\sum_{i=t}^{T}\gamma^{i\cdot v_{pref}}\cdot R_{t}(s^{jn}_{t}% ,\pi^{*}(s^{jn}_{t})),

(7)

where $\gamma\in[0,1)$ is the discount factor, $v_{pref}$ is used as a normalization parameter of $\gamma$ for numerical reasons [20], and $T$ is the time of the final state. A constant velocity model is applied to estimate the subsequent states of humans over a short temporal interval $\Delta t$ and to approximate the next joint state $s^{jn}_{t+\Delta t}$ based on $s^{jn}_{t}$ and $a_{t}$ , then the optimal strategy can be formulated as:

\pi^{*}(s^{jn}_{t})=\arg\max_{a_{t}}[R_{t}(s^{jn}_{t},a_{t})+\gamma^{\Delta t% \cdot v_{pref}}\cdot V(s^{jn}_{t+\Delta t})],

(8)

where the value network is trained by the temporal-difference method with experience replay [13, 14, 15].

The speed and angular direction define the robot’s action space. The holonomic kinematics have modeled the movements, enabling the agent to move in any direction with any acceleration. The action space comprises 80 actions. The 5 linear speeds are distributed between $0$ to $v_{pref}$ and the angular direction is spaced from $0$ to $2\pi$ .

In original SARL, the reward is equal to 1 if the robot reaches the target position. This reward is changed to $d_{plan}/d_{real}$ in ours, where $d_{plan}$ is the planned driving distance from its current position to the target one before the robot starts moving, while $d_{real}$ is the actual distance traveled by the robot after it reaches the specified location. The interpretation of this change is that the robot should minimize unnecessary local path adjustments while maintaining socially-compliant navigation. Thus the full reward function becomes:

R_{t}(s^{jn}_{t},a_{t})=\begin{cases}-0.25&\text{if $d_{min}<0$}\\ d_{plan}/d_{real}&\text{else if $d_{g}=0$}\\ -0.1+d_{min}/2&\text{else if $d_{min}<d_{c}$}\\ 0&\text{otherwise}\end{cases},

(9)

where $d_{min}$ is the distance between the robot and the nearest person, and $d_{c}$ is the comfortable social distance. $d_{c}$ should be adjusted according to the spatio-temporal characteristics. For example, it can be higher in large malls or during epidemics, and smaller in small stores or during normal periods.

III-C Social Module

The task of the social module is to implement social modifications to the SARL output. Intuitively, if the action $a_{t}$ suggested by SARL is considered to be social, its corresponding value should be increased so that it is selected by the optimal policy $\pi^{*}(s^{jn}_{t})$ . Otherwise, its corresponding value should be lowered so that it is discarded by the policy $\pi^{*}(s^{jn}_{t})$ . Based on the dual consideration of the effectiveness of small-batch online learning [27] and tracklet-based behavioral analysis [28], we employ a tracklet-based sociality assessment method, which is visualized in Fig. 2(a). Specifically, a tracklet is defined as a set of robot instantaneous states:

tr_{t}=[s^{s}_{t-(u\cdot\Delta t)},s^{s}_{t-((u-1)\cdot\Delta t)},\ldots,\\ s^{s}_{t},s^{s}_{t+\Delta t},\ldots,s^{s}_{t+((1+f)\cdot\Delta t)}],

(10)

where each state contains the robot’s position and velocity, denoted as $s^{s}_{t}=[p_{x},p_{y},v_{x},v_{y}]_{t}$ ; $s^{s}_{t-(u\cdot\Delta t)},s^{s}_{t-((u-1)\cdot\Delta t)},\ldots$ represents $u$ previous states; $s^{s}_{t}$ refers to the current state; $s^{s}_{t+\Delta t}$ represents the state of applying $a_{t}$ to $s^{s}_{t}$ ; $\ldots,s^{s}_{t+((1+f)\cdot\Delta t)}$ represents $f$ future states, when $a_{t}$ is applied $f$ times to $s^{s}_{t+\Delta t}$ . The values of $u$ and $f$ are currently determined by experience, in our experiments $u=11,f=3$ .

To evaluate the sociality of tracklets we employ a binary classification, i.e. social or non-social. To this end, a social value function is defined as:

SV_{t}(tr_{t})=\psi_{sv}(tr_{t},W_{sv}),

(11)

where $\psi_{sv}(\cdot)$ is a gated recurrent unit (GRU) [29] with multi-layer perceptron (MLP) and $W_{sv}$ are the weights of it. The GRU-MLP model is visualized in Fig. 2(b), which contains four fully-connected layers, with ReLU nonlinear activation function and batch normalization, and the Sigmoid activation function.

Therefore, the final optimal strategy for robot navigation with the social module is:

\pi^{*}(s^{jn}_{t})=\arg\max_{a_{t}}[R_{t}(s^{jn}_{t},a_{t})+\\ +\gamma^{\Delta t\cdot v_{pref}}\cdot V(s^{jn}_{t+\Delta t})+k_{s}\cdot SV_{t}% (tr_{t})],

(12)

where $k_{s}$ is the coefficient of social importance during robot navigation, which is set to 0.6 in our experiments, and $\gamma^{\Delta t\cdot v_{pref}}\cdot V(s^{jn}_{t+\Delta t})+k_{s}\cdot SV_{t}(% tr_{t})$ responds explicitly to our introduction of human factor $\gamma H(s^{\prime},\theta)$ in Section III-A.

III-D Online Social Context Learning

The social performance of robots is challenged by changing social contexts. Although this problem can theoretically be solved by building massive complete datasets to train general and universal models, in practice, on the one hand, the cost of implementing such idea can be very high, and on the other hand, it is inherently unable to support long-term autonomy of mobile robots because the assumption that the dataset contains all social contexts is invalid [25]. Therefore, a mechanism is introduced in this letter to update the proposed social model online, triggered by the latter’s detection of differences between its internal and external social contexts. If the difference exceeds a tolerance, the module needs to be retrained online on the external context, and the training is iterated until the difference falls within the tolerance.

The proposed online social context learning is detailed in Algorithm 1. First, a buffer is set up to continuously collect the states of the robot and the humans around it (line 2). When the buffer is full (line 9), the states that constitute the robot’s last tracklet and the states that constitute the $n$ people’s last tracklets are transferred separately (lines 10-11). When the number of stored robot tracklets reaches a threshold (line 12), the efficiency of the social module is analyzed.

Next, the labels of the states in a part of the human tracklets (line 13), i.e., social or non-social, need to be determined. Existing methods are often based on a strong assumption that all human behavior is social [30, 31, 8]. Based on our previous findings [9], we propose here a more elegant labeling method. A metric named extra distance ratio is used, which is defined as:

R_{dist}=d_{s}/d_{a},

(13)

where $d_{s}$ represents the Euclidean distance from the start point to the end point in a tracklet, while $d_{a}$ represents the actual length of the tracklet. The higher the value of $R_{dist}$ , the better the sociality. Therefore, if a tracklet’s $R_{dist}$ is higher than a preset threshold, all states on the tracklet are considered social, otherwise non-social (line 14). Through this method, the robot obtains information about the external social context. Then, through the output of the social module, it becomes aware of its internal social context (line 15).

Finally, if the binary accuracy of the two contexts is below a threshold (the lower the accuracy, the greater the difference, line 16), the social model is updated (line 20). To do so, the aforementioned labeling method is still used to label the human set $Tr_{h}$ (line 17) and the robot set $Tr_{r}$ (line 18) respectively. Then the training set $D_{new}$ consists of $Tr_{h}$ and non-social samples in $Tr_{r}$ (line 19). Additionally, to prevent model overfitting, the robot and human sets are purged (line 21) and old data will not be used in the next iteration. $L_{trak}$ , $K_{up}$ and $K_{acc}$ are hyperparameters and need to be fine tuned. In our experiments, $L_{trak}=16$ , $K_{up}=3$ and $K_{acc}=0.5$ .

1 Input:

\psi_{sv}(\cdot,W_{sv})

: the social neural network

T_{t}

: the buffered human and robot states at time

t

L_{trak}

: the dimension of the input to

\psi_{sv}

Tr_{r}

: the set of robot tracklets

Tr_{h}

: the set of human tracklets

K_{up}

: the update threshold

K_{acc}

: the accuracy threshold

8 Output:

\psi_{sv}(\cdot,W_{sv})

: the social neural network

9 if $|T_{t}|$ % $L_{trak}$ = $0$ then

Tr_{r}

\leftarrow

The last robot tracklet (

L_{trak}\times s^{s}_{r}

)

Tr_{h}

\leftarrow

The last human tracklets

(L_{trak}\times s^{s}_{h})\times n

12 if $|Tr_{r}|$ % $K_{up}$ = $0$ then

Tr^{{}^{\prime}}_{h}

\leftarrow

The last

K_{up}\times n

human tracklets

\in

Tr_{h}

;

Y_{Tr^{{}^{\prime}}_{h}}

\leftarrow

label

(Tr^{{}^{\prime}}_{h})

;

\hat{Y}_{Tr^{{}^{\prime}}_{h}}

\leftarrow

\psi_{sv}(Tr^{{}^{\prime}}_{h},W_{sv})

;

16 if binary_acc $(Y_{Tr^{{}^{\prime}}_{h}},\hat{Y}_{Tr^{{}^{\prime}}_{h}})<K_{acc}$ then

Y_{Tr_{h}}

\leftarrow

label

(Tr_{h})

;

Y_{Tr_{r}}

\leftarrow

label

(Tr_{r})

;

D_{new}\leftarrow[Tr_{h},Y_{Tr_{h}}]\cup\{x,y\in Tr_{r},Y_{Tr_{r}}\mid y=0\}

;

W_{sv}

\leftarrow

train

(D_{new})

;

Tr_{r}

Tr_{h}

\leftarrow

\emptyset

;

Algorithm 1 Online Social Context Learning

IV Evaluation

In this section we introduce the initialization of the social module, simulation, and real robot experiments.

IV-A Social Module Initialization

The initialization of the social module can be carried out in two ways, either online learning during deployment [32, 33], or offline training using simulators or datasets. We adopt the latter because of its higher experimental efficiency. Specifically, the Thor-Magni dataset [17] is used, which includes five different HRI scenarios with the number of participants ranging from four to nine. The human trajectories were recorded by a high-precision motion capture system and are used by us as social examples. We then apply the well-known Optimal Reciprocal Collision Avoidance (ORCA) [34] method to generate corresponding robot trajectories according to the human ones, which are labeled as non-social. All trajectories are segmented into 41675 tracklets, each contains 16 points (i.e. $L_{trak}$ ). After shuffling, 70% of them is used for training and 30% for validation. The social neural network is trained in 50 epochs with a batch size of 32. The optimizer is RMSProp with a learning rate of 0.001. The criterion is a binary cross-entropy loss, which results in an accuracy of 89.69%. For more details on the non-social robot trajectory generation and the social model training, please refer to our open source repository.

IV-B Simulation Results

We first evaluate our proposed method using the SARL simulator [14] widely used by the community [15, 35]. The original experiment involved a robot agent moving between two points at a distance of $8~{}m$ , while 5 human agents also moved between two other points at the same distance. Since the lines connecting a pair of points all intersect at one place, HRI occurs when the robot and human agents move. The only difference between the test cases is the starting position of the human agents. SARL demonstrated excellent results under these settings, with a success rate of 100%.

The original experiments of SARL are enriched in this letter. First, the number of human agents is increased, ranging from 5 to 10 in steps of 1. When the number of agents reaches 10, the simulation will start again from 5, and the process repeats. Second, the test cases are complicated to be more realistic. The robot agent is asked to reach multiple goals in sequence, rather than one. The initial timeout is $25~{}s$ as per [14] and increases by $25~{}s$ for each new goal obtained. Third, the robot agent needs to traverse distances of $40~{}m$ and $68~{}m$ , and the human agent is controlled by ORCA (as per [20, 14, 15]) and SF (first use) respectively, thus constituting 4 experimental categories: Short-ORCA, Short-SF, Long-ORCA and Long-SF.

Two preset experimental scenarios are used (as shown in Fig. 3). In one, all human agents are randomly initialized within a circle of radius $4~{}m$ , and in the other, the circle is replaced by a square with a side length of $10~{}m$ . Each scenario contains 250 test cases, with the number of agents evenly distributed. The radius and preferred speed of the robot and human agents are $0.3~{}m$ and $1~{}m/s$ respectively (cf. Eq. 4). We take the heuristic-based ORCA method without sociality as the baseline and conduct comparative experiments with four other learning-based methods, including CADRL [20], LSTM-RL [21], SARL [14] and ST² [15]. The results of our method include those from the full SOCSARL-OL architecture as well as SOCSARL which does not include online learning for ablation purposes. Three common performance metrics [13, 14, 15] including navigation success rate, collision rate, and average navigation time of successful cases are used to evaluate the aforementioned methods. The experimental results are shown in Table I.

TABLE I: Evaluation Results of Different Methods

Short-ORCA
Method	Success	Collision	$\overline{\text{Time}}$ (s)	$\textbf{R}_{\textbf{dist}}$
ORCA [34]	0.91	0.00	53.48	0.964
CADRL [20]	0.06	0.93	59.49	0.828
LSTM-RL [21]	0.23	0.62	48.55	0.886
SARL [14]	0.71	0.25	50.94	0.928
ST² [15]	0.54	0.46	48.50	0.899
SOCSARL (ours)	0.70	0.25	50.84	0.929
SOCSARL-OL (ours)	0.77	0.23	49.17	0.933
Short-SF
ORCA [34]	0.01	0.99	44.42	0.99
CADRL [20]	0.19	0.79	54.83	0.898
LSTM-RL [21]	0.13	0.74	47.41	0.896
SARL [14]	0.77	0.19	48.96	0.957
ST² [15]	0.60	0.40	45.96	0.935
SOCSARL (ours)	0.77	0.19	48.92	0.957
SOCSARL-OL (ours)	0.81	0.19	47.61	0.958
Long-ORCA
ORCA [34]	0.84	0.00	91.29	0.960
CADRL [20]	0.01	0.97	105.83	0.833
LSTM-RL [21]	0.06	0.78	88.23	0.891
SARL [14]	0.54	0.38	90.3	0.921
ST² [15]	0.33	0.67	81.17	0.908
SOCSARL (ours)	0.54	0.38	90.05	0.921
SOCSARL-OL (ours)	0.62	0.38	85.87	0.927
Long-SF
ORCA [34]	0.00	1.00	-	-
CADRL [20]	0.09	0.88	96.86	0.898
LSTM-RL [21]	0.01	0.84	83.29	0.894
SARL [14]	0.56	0.36	85.89	0.956
ST² [15]	0.42	0.58	77.03	0.943
SOCSARL (ours)	0.57	0.35	86.00	0.956
SOCSARL-OL (ours)	0.64	0.35	82.66	0.959

First, the results of the baseline method ORCA confirm the necessity of introducing SF into our experiments. Because when both the robot and human agents use ORCA, the former shows performance that exceeds all other tested methods. However, when the human agents are controlled by SF, the ORCA-based robot agent shows the worst performance among all methods. These results also illustrate from one aspect the importance of consistency of the robot’s internal and external social contexts. From a sociological perspective, sociality is a consensus that a social agent may not function well in a non-social environment, and vice versa. The performance differences of other methods in the ORCA and SF categories also support this claim. Specifically, the social methods SARL, ST², and ours perform better in the SF-based social environments than in the ORCA-based non-social environments. While the non-social method LSTM-RL performs better in the ORCA-based environments. Although CADRL is also a non-social method, it is sensitive to the distance between agents and thus coincidentally closer to the social context, thus performing better in the SF-based environments.

Second, it can be seen that the enriched experimental settings pose challenges to the SOTA methods, but the overall performance ranking of SARL, LSTM-RL, and CADRL is consistent with that reported in [14]. Specifically, SARL shows the best results among the three, recalling that SARL is the development of LSTM-RL, which is the development of CADRL. The overall performance of ST² is inferior to that of SARL, which is different from [15]. This can be interpreted as SARL having an advantage in dealing with complex environments.

Third, it can be seen that our proposed method outperforms other learning-based ones overall. The $R_{dist}$ metric demonstrates the effectiveness of our improvement to the original SARL reward function (cf. Eq. 9). ST² shows fast times in four categories but at the expense of performance in the other two metrics. In addition, it can be seen that the performance of our method without the OL module is comparable to that of SARL. This shows that, on the one hand, the proposed social module basically recognizes the sociality of SARL output without online updates. On the other hand, the improvement of the robot’s social performance is indeed achieved by updating the social module online.

Finally, we provide insights into the robustness of different methods in different environments. According to the two metrics of “success rate” and “collision rate”, our SOCSARL-OL has a maximum difference of 4% in different category pairs (i.e. Short-ORCA and Short-SF, Long-ORCA and Long-SF), ST² has 9%, SARL has 6%, LSTM-RL has 10%, and CADRL has 13%. Our method shows the best robustness. This again demonstrates the superiority of online learning, while SOCSARL without online learning has a difference of up to 7%.

IV-C Real Robot Experiments

Our method is evaluated using a real robot we developed [9, 26]. We wanted to demonstrate the effectiveness of the robot social navigation system and measure how humans felt about interacting with the robot. We replicated the experimental scenario from the SARL paper [14], as shown in Fig. 4(a). During the experiment, people randomly walked around the room, while the robot had to move from one side of the room to the other. 12 human participants were divided into 3 groups of 5/5/2 for the experiment. SOCSARL as a social method was compared with ORCA as a non-social method. After the experiment, the participants answered the following two questions for each method as per [18]:

1.

On a scale of 1 to 5, how “socially-compliant” do you think the robot was (think of social compliance as how considerate the robot was of your presence)?
2.

On a scale of 1 to 5, how “safe” did you feel around the robot?

The results of the survey are shown in Fig. 4(b). It can be seen that our method makes people feel safer than ORCA, but the social feeling is comparable to ORCA. After the evaluation, we conducted a short interview with the participants. Their main opinion was that they were confused by the avoidance action of our SOCSARL method. This was attributed to the social distance condition in the reward function (cf. Eq. 9), which made the robot start the avoidance action very early. This was inconsistent with the participants’ stereotypes of mobile robots. People generally believed that the robot could get closer to them before reacting. However, this was contrary to the safety requirement of the robot’s movement.

V Conclusions

This letter presented an online context learning method for socially-compliant robot navigation, named SOCSARL-OL. The method proposed to combine deep reinforcement learning (DRL) and online robot learning (ORL) to empower the robot to adapt to changing social contexts online in order to improve its social performance. The experimental results demonstrated that our method surpasses the SOTA ones. Future work includes further evaluation of the real-world performance of online context learning by running the robot over longer periods of time in larger spaces, as well as assessing the robustness of the system in different real-world public spaces.

Acknowledgments

We thank RoboSense for sponsoring the 3D lidar, UTBM CRUNCH Lab for support in robotic instrumentation, and all those involved in the experiments.

References

[1] I. Okunevich, D. Trinitatova, P. Kopanev, and D. Tsetserukou, “Mobilecharger: An autonomous mobile robot with inverted delta actuator for robust and safe robot charging,” in 2021 26th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA). IEEE, 2021, pp. 1–8.
[2] H. Pikner, R. Sell, K. Karjust, E. Malayjerdi, and T. Velsker, “Cyber-physical control system for autonomous logistic robot,” in 2021 IEEE 19th International Power Electronics and Motion Control Conference (PEMC). IEEE, 2021, pp. 699–704.
[3] D. Topolsky, I. Topolskaya, I. Plaksina, P. Shaburov, N. Yumagulov, D. Fedorov, and E. Zvereva, “Development of a mobile robot for mine exploration,” Processes, vol. 10, no. 5, p. 865, 2022.
[4] T. Roucek, M. Pecka, P. Cízek, T. Petrícek, J. Bayer, V. Salanský, D. Hert, M. Petrlík, T. Báca, V. Spurný, F. Pomerleau, V. Kubelka, J. Faigl, K. Zimmermann, M. Saska, T. Svoboda, and T. Krajník, “DARPA subterranean challenge: Multi-robotic exploration of underground environments,” in Proceedings of MESAS, J. Mazal, A. Fagiolini, and P. Vasík, Eds., 2019, pp. 274–290.
[5] Z. Yan, S. Schreiberhuber, G. Halmetschlager, T. Duckett, M. Vincze, and N. Bellotto, “Robot perception of static and dynamic objects with an autonomous floor scrubber,” Intelligent Service Robotics, vol. 13, no. 3, pp. 403–417, 2020.
[6] S. Perminov, N. Mikhailovskiy, A. Sedunin, I. Okunevich, I. Kalinov, M. Kurenkov, and D. Tsetserukou, “Ultrabot: Autonomous mobile robot for indoor uv-c disinfection,” in 2021 IEEE 17th International Conference on Automation Science and Engineering (CASE). IEEE, 2021, pp. 2147–2152.
[7] S. Coşar, M. Fernandez-Carmona, R. Agrigoroaie, J. Pages, F. Ferland, F. Zhao, S. Yue, N. Bellotto, and A. Tapus, “Enrichme: Perception and interaction of an assistive robot for the elderly at home,” International Journal of Social Robotics, vol. 12, pp. 779–805, 2020.
[8] T. Vintr, Z. Yan, K. Eyisoy, F. Kubis, J. Blaha, J. Ulrich, C. S. Swaminathan, S. M. Mellado, T. Kucner, M. Magnusson, G. Cielniak, J. Faigl, T. Duckett, A. J. Lilienthal, and T. Krajník, “Natural criteria for comparison of pedestrian flow forecasting models,” in Proceedings of IROS, 2020, pp. 11 197–11 204.
[9] I. Okunevich, V. Hilaire, S. Galland, O. Lamotte, L. Shilova, Y. Ruichek, and Z. Yan, “Human-centered benchmarking for socially-compliant robot navigation,” in 2023 European Conference on Mobile Robots (ECMR). IEEE, 2023, pp. 1–7.
[10] C. S. Song and Y.-K. Kim, “The role of the human-robot interaction in consumers’ acceptance of humanoid retail service robots,” Journal of Business Research, vol. 146, pp. 489–503, 2022.
[11] R. S. Shahrezaie, B. N. Manalo, A. G. Brantley, C. R. Lynch, and D. Feil-Seifer, “Advancing socially-aware navigation for public spaces,” in 2022 31st IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). IEEE, 2022, pp. 1015–1022.
[12] R. D. Benedictis, A. Umbrico, F. Fracasso, G. Cortellessa, A. Orlandini, and A. Cesta, “A dichotomic approach to adaptive interaction for socially assistive robots,” User Modeling and User-Adapted Interaction, vol. 33, no. 2, pp. 293–331, 2023.
[13] Y. F. Chen, M. Everett, M. Liu, and J. P. How, “Socially aware motion planning with deep reinforcement learning,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 1343–1350.
[14] C. Chen, Y. Liu, S. Kreiss, and A. Alahi, “Crowd-robot interaction: Crowd-aware robot navigation with attention-based deep reinforcement learning,” in 2019 international conference on robotics and automation (ICRA). IEEE, 2019, pp. 6015–6022.
[15] Y. Yang, J. Jiang, J. Zhang, J. Huang, and M. Gao, “St²: Spatial-temporal state transformer for crowd-aware autonomous navigation,” IEEE Robotics and Automation Letters, vol. 8, no. 2, pp. 912–919, 2023.
[16] C. Chen, S. Hu, P. Nikdel, G. Mori, and M. Savva, “Relational graph learning for crowd navigation. in 2020 ieee,” in RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, pp. 10 007–10 013.
[17] T. Schreiter, T. R. de Almeida, Y. Zhu, E. G. Maestro, L. Morillo-Mendez, A. Rudenko, L. Palmieri, T. P. Kucner, M. Magnusson, and A. J. Lilienthal, “Thor-magni: A large-scale indoor motion capture recording of human movement and robot interaction,” arXiv preprint arXiv:2403.09285, 2024.
[18] H. Karnan, A. Nair, X. Xiao, G. Warnell, S. Pirk, A. Toshev, J. Hart, J. Biswas, and P. Stone, “Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation,” IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 11 807–11 814, 2022.
[19] L. Tai, J. Zhang, M. Liu, and W. Burgard, “Socially compliant navigation through raw depth inputs with generative adversarial imitation learning,” in 2018 IEEE international conference on robotics and automation (ICRA). IEEE, 2018, pp. 1111–1117.
[20] Y. F. Chen, M. Liu, M. Everett, and J. P. How, “Decentralized non-communicating multiagent collision avoidance with deep reinforcement learning,” in 2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017, pp. 285–292.
[21] M. Everett, Y. F. Chen, and J. P. How, “Motion planning among dynamic, decision-making agents with deep reinforcement learning,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 3052–3059.
[22] S. Liu, P. Chang, W. Liang, N. Chakraborty, and K. Driggs-Campbell, “Decentralized structural-rnn for robot crowd navigation with deep reinforcement learning,” in 2021 IEEE international conference on robotics and automation (ICRA). IEEE, 2021, pp. 3517–3524.
[23] P. T. Singamaneni, A. Favier, and R. Alami, “Human-aware navigation planner for diverse human-robot interaction contexts,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 5817–5824.
[24] B. Liu, X. Xiao, and P. Stone, “A lifelong learning approach to mobile robot navigation,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 1090–1096, 2021.
[25] Z. Yan, L. Sun, T. Krajnik, T. Duckett, and N. Bellotto, “Towards long-term autonomy: A perspective from robot learning,” in Proceedings of the AAAI-23 Bridge Program on AI & Robotics, 2023.
[26] I. Okunevich, V. Hilaire, S. Galland, O. Lamotte, Y. Ruichek, and Z. Yan, “An open-source software-hardware integration scheme for embodied human perception in service robotics,” in The 20th IEEE International Conference on Advanced Robotics and Its Social Impacts (ARSO 2024), 2024.
[27] Z. Yan, T. Duckett, and N. Bellotto, “Online learning for 3d lidar-based human detection: experimental analysis of point cloud clustering and classification methods,” Autonomous Robots, vol. 44, no. 2, pp. 147–164, 2020.
[28] S. Cosar, G. Donatiello, V. Bogorny, C. Gárate, L. O. Alvares, and F. Brémond, “Toward abnormal trajectory and event detection in video surveillance,” IEEE Trans. Circuits Syst. Video Technol., vol. 27, no. 3, pp. 683–695, 2017.
[29] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
[30] H. Kretzschmar, M. Spies, C. Sprunk, and W. Burgard, “Socially compliant mobile robot navigation via inverse reinforcement learning,” The International Journal of Robotics Research, vol. 35, no. 11, pp. 1289–1307, 2016.
[31] V. Adeli, E. Adeli, I. Reid, J. C. Niebles, and H. Rezatofighi, “Socially and contextually aware human motion and pose forecasting,” IEEE Robotics and Automation Letters, vol. 5, no. 4, pp. 6033–6040, 2020.
[32] Z. Yan, T. Duckett, and N. Bellotto, “Online learning for human classification in 3D LiDAR-based tracking,” in Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, Canada, September 2017, pp. 864–871.
[33] Z. Yan, L. Sun, T. Duckett, and N. Bellotto, “Multisensor online transfer learning for 3d lidar-based human detection with a mobile robot,” in Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, October 2018, pp. 7635–7640.
[34] J. Van Den Berg, S. J. Guy, M. Lin, and D. Manocha, “Reciprocal n-body collision avoidance,” in Robotics Research: The 14th International Symposium ISRR. Springer, 2011, pp. 3–19.
[35] C.-L. Cheng, C.-C. Hsu, S. Saeedvand, and J.-H. Jo, “Multi-objective crowd-aware robot navigation system using deep reinforcement learning,” Applied Soft Computing, vol. 151, p. 111154, 2024.