License: arXiv.org perpetual non-exclusive license
arXiv:2403.10833v1 [cs.RO] 16 Mar 2024

Deep Reinforcement Learning-based Large-scale Robot Exploration

Yuhong Cao11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Rui Zhao11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Yizhuo Wang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Bairan Xiang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Guillaume Sartoretti11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT 11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Author is with Department of Mechanical Engineering, College of Design and Engineering, National University of Singapore. [email protected]
Abstract

In this work, we propose a deep reinforcement learning (DRL) based reactive planner to solve large-scale Lidar-based autonomous robot exploration problems in 2D action space. Our DRL-based planner allows the agent to reactively plan its exploration path by making implicit predictions about unknown areas, based on a learned estimation of the underlying transition model of the environment. To this end, our approach relies on learned attention mechanisms for their powerful ability to capture long-term dependencies at different spatial scales to reason about the robot’s entire belief over known areas. Our approach relies on ground truth information (i.e., privileged learning) to guide the environment estimation during training, as well as on a graph rarefaction algorithm, which allows models trained in small-scale environments to scale to large-scale ones. Simulation results show that our model exhibits better exploration efficiency (12%percent1212\%12 % in path length, 6%percent66\%6 % in makespan) and lower planning time (60%percent6060\%60 %) than the state-of-the-art planners in a 130m×100m130𝑚100𝑚130m\times 100m130 italic_m × 100 italic_m benchmark scenario. We also validate our learned model on hardware.

I Introduction

Autonomous exploration focuses on finding the shortest total exploration path to map an unknown environment (i.e., classify it into free and occupied areas). In this work, we consider autonomous exploration based on omnidirectional 3D Lidar with 2D action space (i.e., for a ground robot). There, recent developments of Lidar-based simultaneous localization and map** (SLAM) can now near-perfectly mitigate state estimation error and yield high-quality maps on hundreds-meter scenarios [1]. Leveraging these advanced SLAM approaches, current autonomous exploration planners [2, 3, 4, 5, 6] are able to assume the localization and map** are perfect and start tackling complex larger-scale environments towards real-life deployments. However, optimizing trajectories in large-scale environments is non-trivial, since the exploration task requires real-time replanning as the robot discovers unknown areas and updates its map, with planning horizons in the hundreds of meters.

Refer to caption
Figure 1: A mobile ground robot exploring an indoor lab environment using our DRL-based planner. Top-left: view from the robot’s onboard camera (not used for map**). Top-right: agent’s current belief (map) of the environment, and view of our neural network’s inputs. Bottom: 3D point cloud (map) of the environment constructed by the robot. The axes system represents the current position and orientation of the robot, the purple ball the next waypoint output by our planner,

Advanced large-scale planners often leverage hierarchical planning to decrease computational complexity while maintaining the fineness of local paths [2, 3, 4, 6], as a result, the state-of-the-art planner, TARE [4], is able to explore environments at the hundred-meters scale while kee** computing times under a second at each planning step. Nevertheless, these conventional planners share the same limitation: they try to optimize paths solely based on the partial belief (map) over the environment. However, even optimal long-term paths on the partial map can lead to suboptimal longer-term behaviors as the map is updated with new measurements that may contradict the current plan. At its core, the exploration problem is a partially observable Markov decision process (POMDP), where the transition model between true states (ground truth of the environment) and the robot belief (map) is hard to predict. For example, a robot cannot confirm that two close-by, incomplete corridors in its partial map are connected before fully exploring this portion of the environment. However, we note that humans in the same situations will often try to infer the complete structure of an environment from a partial map (i.e., guess and reconstruct that full corridor) to improve efficiency. Learning-based approaches such as supervised learning or model-based reinforcement learning may be an option to explicitly predict unknown areas from the agent’s partial map, allowing conventional planners to then plan paths over the predicted belief. However, since the underlying transition model in autonomous exploration is highly stochastic and thus very hard to estimate/learn, directly relying on such a learned prediction may not be a wise choice in practice.

We believe that a simple model-free deep reinforcement learning (DRL) approach can elegantly handle all these challenges, by implicitly estimating the transition model and learning an adaptive policy. We build upon recent attention-based policy neural network [7, 8], which have shown outstanding abilities at allowing the agent to reason about its entire belief at multiple spatial scales. In this work, we further propose to train an attention-based policy with the assistance of ground truth knowledge (i.e., privileged learning). It enables the model to draw out more potential of attention mechanism at implicitly predicting unknown areas from the partial robot belief, leading to a more stable training process and better performance after training. In particular, we train our model under the soft actor-critic (SAC) algorithm [9, 10], and let the critic network have access to the ground truth of the environment to get precise long-term evaluations. We further propose a graph rarefaction algorithm, which enhances the model’s capacity to capture longer-term temporal dependencies between areas, thus extending the use of trained policy from small- to large-scale environments. We compare our DRL-based planner with conventional planners in simplified small-scale environments, where it achieves 11% better exploration efficiency in terms of path length. We then test our planner in a high-fidelity 3D Gazebo simulation benchmark of a 130m×100m130𝑚100𝑚130m\times 100m130 italic_m × 100 italic_m indoor office, where it achieves better exploration efficiency (12% in path length, 6% in makespan) and significantly lower computing times (60%) compared to the state-of-the-art exploration planner: TARE [4, 6]. Finally, we experimentally validate our planner on a ground robot in an 80m×10m80𝑚10𝑚80m\times 10m80 italic_m × 10 italic_m indoor scenario involving cluttered furniture, highlighting its real-life applicability without any additional training (demonstrated in Figure 1). To the best of our knowledge, our work is the first DRL planner that can effectively explore such large-scale environments.

II Related Works

In this work, we focus on methods applicable to the problem setup where the information in the robot belief is evaluated through frontiers (i.e., boundary between known, free area and unknown area). There are another group of other methods based on the problem setup where the information is evaluated through information theory (e.g., mutual information) [11, 12, 13]. However, information theory-based methods require the robot belief to be a probabilistic occupancy map, making them not applicable in the setup of this work and related works discussed below.

Small-scale exploration planner Frontier-based exploration planners have been shown to be very efficient in small-scale environments. Yamauchi et al. [14] first proposed to utilize frontiers to drive exploration, where the cost (i.e., the path length from the current position to a frontier) is usually used to rank frontiers, with the lowest-costed frontier being chosen as the next one to be visited by the robot. More recent works [15, 16] further considered utility (i.e., the size of a frontier, quantifying the amount of information expected to be collected at that frontier) to improve exploration efficiency. Although the resulting paths are often short-sighted (i.e., optimize for shorter-term objectives, often at the cost of longer-term exploration ones), the above planners remain near-optimal in simple scenarios, where the number of frontiers is relatively low. In general, as the number of frontiers increases in larger and more cluttered environments, frontier-based planners suffer from such myopic decisions. Instead of evaluating a single frontier, [17] relied on a sampling-based strategy in a receding-horizon manner to optimize a long-term trajectory, thus achieving non-myopic decisions and significantly outperforming frontier-based planners in relatively complex (but still small-scale) environments while remaining executable in real-time.

Large-scale exploration planner Optimizing long-term trajectories online in large-scale exploration problems is challenging since trajectories might be over 100m100𝑚100m100 italic_m in length, while the gap between individual waypoints should remain below 1m1𝑚1m1 italic_m to allow for fine detail map**. Some advanced sampling strategies have been proposed to tackle larger-scale exploration [18, 19]. Selin et al. [2] found that, in large-scale environments, frontiers are sparse at the global scale but dense at the local level, and proposed a hierarchical framework that combined a global frontier-based planner and a local planner based on [17]. Following works [3, 20, 6, 4] drew on the experience of such hierarchical frameworks to perform path planning at multiple resolutions and further improve the exploration efficiency of planned paths, which significantly outperformed naive frontier- and sampling-based planners [3, 6, 4]. Notably, Cao et al. [6, 4] proposed to use a TSP-based global planner and a sampling-based local planner with a coverage constraint to trade off the planning fineness of nearby and distant areas, which currently exhibits state-of-the-art performance with 80% better exploration efficiency over other benchmark planners [17, 21, 3].

DRL-based exploration planner While some works [22, 23] have relied on deep learning to explicitly predict the layout of the unknown (but only structured indoor) environments to assist conventional approaches, most learning-based approaches leverage deep reinforcement learning (DRL) to directly and reactively make decisions on movements. DRL-based planners are expected to achieve long-term exploration objectives by training a policy to maximize long-term returns. Most current DRL-based exploration planners [24, 25, 26, 27] rely on convolutional neural networks (CNNs) to select a waypoint from a set of locations, based on a visual representation of the agent’s belief (e.g., map). Although DRL intuitively looks like a well-suited method for autonomous exploration, the performance of these DRL-based planners is only on par with naive frontier-based planners in small-scale environments. On the other hand, Chen et al. [28] proposed to utilize graph neural network to handle the localization error during exploration, however their approach can only work in environments without obstacles. Notably, relying on learned attention over a graph representation of the agent’s belief, our previous work ARiADNE [8] achieved remarkable improvements over frontier-based planners and one of these CNN-based DRL planners [25]. However, ARiADNE’s advantage over TARE is marginal, and like other previous DRL methods, this DRL planner cannot scale to large environments.

Refer to caption
Figure 2: Our proposed DRL-based planner. The robot first transforms its current map (point cloud) data to an occupancy grid, and then extracts and rarefies an informative graph from it. After that, this graph (node features and an adjacency edge mask) is fed to our attention-based network (consisting of an encoder and a decoder), which finally outputs a policy over which neighboring node should be the next waypoint.

III Problem Statement

Let \mathcal{E}caligraphic_E be a bounded environment, classified into free areas fsubscript𝑓\mathcal{E}_{f}caligraphic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and occupied areas osubscript𝑜\mathcal{E}_{o}caligraphic_E start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, where fo=subscript𝑓subscript𝑜\mathcal{E}_{f}\cup\mathcal{E}_{o}=\mathcal{E}caligraphic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∪ caligraphic_E start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = caligraphic_E and fo=subscript𝑓subscript𝑜\mathcal{E}_{f}\cap\mathcal{E}_{o}=\emptysetcaligraphic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∩ caligraphic_E start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = ∅. The robot maintains a map \mathcal{M}caligraphic_M (i.e., robot belief) about \mathcal{E}caligraphic_E, which is classified into free areas fsubscript𝑓\mathcal{M}_{f}caligraphic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, occupied areas osubscript𝑜\mathcal{M}_{o}caligraphic_M start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, and unknown areas usubscript𝑢\mathcal{M}_{u}caligraphic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, where fou=subscript𝑓subscript𝑜subscript𝑢\mathcal{M}_{f}\cup\mathcal{M}_{o}\cup\mathcal{M}_{u}=\mathcal{M}caligraphic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∪ caligraphic_M start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∪ caligraphic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = caligraphic_M. Starting from an empty belief over the environment (i.e., =usubscript𝑢\mathcal{M}=\mathcal{M}_{u}caligraphic_M = caligraphic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT), the robot executes an exploration path ψ𝜓\psiitalic_ψ and updates its map/belief about the environment using measurements taken along the path from an onboard sensor (here, a LiDAR). The exploration task completes when osubscript𝑜\mathcal{M}_{o}caligraphic_M start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT is closed with respect to fsubscript𝑓\mathcal{M}_{f}caligraphic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT111In practice, the map is often considered complete when all frontiers have been removed, which also ensures the set of occupied areas is closed., and the objective is to minimize the cost C(ψ)𝐶𝜓C(\psi)italic_C ( italic_ψ ) (e.g., the path length L(ψ)𝐿𝜓L(\psi)italic_L ( italic_ψ ) or makespan T(ψ)𝑇𝜓T(\psi)italic_T ( italic_ψ )) of the exploration path.

Considering the completed exploration path as a finite sequence of robot-visited positions ψ=[p1,p2,,pn]𝜓subscript𝑝1subscript𝑝2subscript𝑝𝑛\psi=[p_{1},p_{2},...,p_{n}]italic_ψ = [ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ], where pifsubscript𝑝𝑖subscript𝑓p_{i}\in\mathcal{E}_{f}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_E start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is the robot position at decision step i𝑖iitalic_i, the exploration problem is a partially observable Markov decision process (POMDP). This problem can be formulated as a tuple (S,A,T,R,Ω,O)𝑆𝐴𝑇𝑅Ω𝑂(S,A,T,R,\Omega,O)( italic_S , italic_A , italic_T , italic_R , roman_Ω , italic_O ), where S𝑆Sitalic_S is a set of states, A𝐴Aitalic_A is a set of actions, T𝑇Titalic_T is a set of conditional probabilities between states, R𝑅Ritalic_R is a set of reward, ΩΩ\Omegaroman_Ω is a set of observations, O𝑂Oitalic_O is a set of conditional probabilities between states and observations The true state si=(i,i,ψ1:i),siSformulae-sequencesubscript𝑠𝑖subscript𝑖subscript𝑖subscript𝜓:1𝑖subscript𝑠𝑖𝑆s_{i}=(\mathcal{E}_{i},\mathcal{M}_{i},\psi_{1:i}),s_{i}\in Sitalic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( caligraphic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ψ start_POSTSUBSCRIPT 1 : italic_i end_POSTSUBSCRIPT ) , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S is not available to the agent during exploration; instead, the robot only has access to observation (i.e., its map/belief) oi=(i,ψ1:i),oiΩformulae-sequencesubscript𝑜𝑖subscript𝑖subscript𝜓:1𝑖subscript𝑜𝑖Ωo_{i}=(\mathcal{M}_{i},\psi_{1:i}),o_{i}\in\Omegaitalic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ψ start_POSTSUBSCRIPT 1 : italic_i end_POSTSUBSCRIPT ) , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_Ω, depending on O(oi|si)𝑂conditionalsubscript𝑜𝑖subscript𝑠𝑖O(o_{i}|s_{i})italic_O ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). At each decision step, the robot selects and takes action aiπ(|oi)a_{i}\sim\pi(\cdot|o_{i})italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_π ( ⋅ | italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), which indicates the next waypoint to visit. Upon reaching that next state si+1subscript𝑠𝑖1s_{i+1}italic_s start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT with T(si+1|si,ai)𝑇conditionalsubscript𝑠𝑖1subscript𝑠𝑖subscript𝑎𝑖T(s_{i+1}|s_{i},a_{i})italic_T ( italic_s start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), the agent receives a reward ri=C(ψt1:t)subscript𝑟𝑖𝐶subscript𝜓:𝑡1𝑡r_{i}=-C(\psi_{t-1:t})italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - italic_C ( italic_ψ start_POSTSUBSCRIPT italic_t - 1 : italic_t end_POSTSUBSCRIPT ). The objective is to find an optimal policy π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, which selects an action at each decision step that maximizes the long-term discounted return 𝔼(Σt=inrt)𝔼subscriptsuperscriptΣ𝑛𝑡𝑖subscript𝑟𝑡\mathbb{E}(\Sigma^{n}_{t=i}r_{t})blackboard_E ( roman_Σ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = italic_i end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

IV Methodology

Solving a highly stochastic POMDP such as autonomous exploration is non-trivial, and thus we propose to rely on model-free DRL for its powerful ability to make implicit predictions based on past experiences. In this work, we rely on an attention-based neural network similar to [8, 7] for its proven ability to model multi-scale dependencies between areas. We further improve the model’s performance by allowing the critic to train based on ground truth information, to help it precisely estimate returns and the underlying transition model. Guided by these improved predictions, our policy network can finally improve its ability to achieve long-term efficiency under partial observations.

We first decrease the complexity of the problem by extracting a dense collision-free graph from the map. With this collision-free graph as input, we train an attention-based policy network offline, using the soft actor-critic (SAC) with discrete actions [9, 10]. Note that under the actor-critic framework, the critic network is only used during training to assist the policy network in estimating the state-action value. Therefore, we allow the critic network to access true states s𝑠sitalic_s, instead of partial observations o𝑜oitalic_o, to estimate the long-term impact of current actions more precisely, which in turn leads to improved overall performances. Moreover, benefiting from the nature of attention layers [29, 30], our policy network is able to generalize to arbitrary graphs and arbitrary scale environments. We further propose an algorithm to rarefy the dense collision-free graph, allowing our DRL-based planner to naturally scale to larger-scale environments without any additional training.

IV-A Sequential Decision-making on a Graph

Consider the collision-free graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ) extracted from the map \mathcal{M}caligraphic_M, where V=(v1,v2,..,vm),vi=(xi,yi)fV=(v_{1},v_{2},..,v_{m}),\forall v_{i}=(x_{i},y_{i})\in\mathcal{M}_{f}italic_V = ( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , . . , italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) , ∀ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ caligraphic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is a set of nodes (one of the nodes is located at the robot current position ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) and E=((v1,v2),,(vm1,vm))𝐸subscript𝑣1subscript𝑣2subscript𝑣𝑚1subscript𝑣𝑚E=((v_{1},v_{2}),...,(v_{m-1},v_{m}))italic_E = ( ( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , … , ( italic_v start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ) is a set of collision-free edges. In this work, nodes in V𝑉Vitalic_V uniformly cover free areas fsubscript𝑓\mathcal{M}_{f}caligraphic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT. We find the k𝑘kitalic_k nearest neighbors of each node and check for collisions to construct the set of admissible edges E𝐸Eitalic_E (i.e., edges between nodes that do not cross osubscript𝑜\mathcal{M}_{o}caligraphic_M start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT or usubscript𝑢\mathcal{M}_{u}caligraphic_M start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT). During exploration, this collision graph extends incrementally along with the update of free areas. Our DRL planner only makes decisions on which neighboring node will be the next waypoint (i.e, π(ai|ot)=p(vi|ot),(pt,vi)Eformulae-sequence𝜋conditionalsubscript𝑎𝑖subscript𝑜𝑡𝑝conditionalsubscript𝑣𝑖subscript𝑜𝑡subscript𝑝𝑡subscript𝑣𝑖𝐸\pi(a_{i}|o_{t})=p(v_{i}|o_{t}),(p_{t},v_{i})\in Eitalic_π ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_p ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ italic_E), toward which the robot then navigates. Note that, during training, the planner only makes decisions upon arriving at the previously selected waypoint, while at execution time, the planner can make decisions at a fixed frequency to be more reactive to map updates.

IV-B Policy Network

We first extract an informative graph G*superscript𝐺G^{*}italic_G start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, extending the collision graph G𝐺Gitalic_G, as the input of our policy network to allow for more efficient learning. The informative graph G*=(V*,E)superscript𝐺superscript𝑉𝐸G^{*}=(V^{*},E)italic_G start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = ( italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_E ) shares the same collision-free edge set as G𝐺Gitalic_G, and each node vi*=(vi,ui,gi)V*subscriptsuperscript𝑣𝑖subscript𝑣𝑖subscript𝑢𝑖subscript𝑔𝑖superscript𝑉v^{*}_{i}=(v_{i},u_{i},g_{i})\in V^{*}italic_v start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT has two more properties than nodes in V𝑉Vitalic_V: 1) the node’s utility (denoted as uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, quantifying observable frontiers within line of sight from location visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) and 2) the guidepost (denoted as gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, a binary value defining whether visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has already been visited previously). The utility extracts information from the map to avoid the need for unnecessary learning of low-level pattern recognition, allowing the network to focus on modeling long-term dependencies between areas. The guidepost representing the path ψ1:tsubscript𝜓:1𝑡\psi_{1:t}italic_ψ start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT executed so far, which we empirically found to help speed up training by encouraging the robot to select unvisited nodes. The informative graph is normalized and used as input of our policy network, composed of an encoder and a decoder, as shown in Figure 2, where the encoder learns to model the dependencies among nodes V*superscript𝑉V^{*}italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and the decoder uses these dependencies to finally output the policy π𝜋\piitalic_π over neighboring nodes.

Encoder A graph structure is naturally aligned with the input format of an attention layer:

qi=WQhiq,ki=WKhik,v,vi=WVhik,v,uij=qiTkjd,formulae-sequencesubscript𝑞𝑖superscript𝑊𝑄subscriptsuperscript𝑞𝑖formulae-sequencesubscript𝑘𝑖superscript𝑊𝐾subscriptsuperscript𝑘𝑣𝑖formulae-sequencesubscript𝑣𝑖superscript𝑊𝑉subscriptsuperscript𝑘𝑣𝑖subscript𝑢𝑖𝑗superscriptsubscript𝑞𝑖𝑇subscript𝑘𝑗𝑑\displaystyle q_{i}=W^{Q}h^{q}_{i},\ k_{i}=W^{K}h^{k,v}_{i},\ v_{i}=W^{V}h^{k,% v}_{i},\ u_{ij}=\frac{q_{i}^{T}\cdot k_{j}}{\sqrt{d}},italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT italic_k , italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT italic_k , italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG , (1)
wij={euijj=1neuij,Mij=00,Mij=1,hi=j=1nwijvj,1imformulae-sequencesubscript𝑤𝑖𝑗casessuperscript𝑒subscript𝑢𝑖𝑗superscriptsubscript𝑗1𝑛superscript𝑒subscript𝑢𝑖𝑗subscript𝑀𝑖𝑗00subscript𝑀𝑖𝑗1formulae-sequencesubscriptsuperscript𝑖superscriptsubscript𝑗1𝑛subscript𝑤𝑖𝑗subscript𝑣𝑗1𝑖𝑚\displaystyle w_{ij}=\left\{\begin{array}[]{cc}\frac{e^{u_{ij}}}{\sum_{j=1}^{n% }e^{u_{ij}}},&M_{ij}=0\\ 0,&M_{ij}=1\end{array}\right.,\ h^{\prime}_{i}=\sum_{j=1}^{n}w_{ij}v_{j},1\leq i\leq mitalic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = { start_ARRAY start_ROW start_CELL divide start_ARG italic_e start_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG , end_CELL start_CELL italic_M start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0 end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL italic_M start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 end_CELL end_ROW end_ARRAY , italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , 1 ≤ italic_i ≤ italic_m

where each hid×1subscript𝑖superscript𝑑1h_{i}\in\mathbb{R}^{d\times 1}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × 1 end_POSTSUPERSCRIPT is a d𝑑ditalic_d-dimension feature vector projected from node vi*subscriptsuperscript𝑣𝑖v^{*}_{i}italic_v start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, superscripts q𝑞qitalic_q and k,v𝑘𝑣k,vitalic_k , italic_v denote the source of the query, key, and value respectively, WQ,WK,WVd×dsuperscript𝑊𝑄superscript𝑊𝐾superscript𝑊𝑉superscript𝑑𝑑W^{Q},W^{K},W^{V}\in\mathbb{R}^{d\times d}italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT are learnable matrices, and M𝑀Mitalic_M, which serves as an edge mask, is a m×m𝑚𝑚m\times mitalic_m × italic_m adjacency matrix built from E𝐸Eitalic_E (Mij=0subscript𝑀𝑖𝑗0M_{ij}=0italic_M start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0, if (i,j)E𝑖𝑗𝐸(i,j)\in E( italic_i , italic_j ) ∈ italic_E, else Mij=1subscript𝑀𝑖𝑗1M_{ij}=1italic_M start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1). Note that in each attention layer, each node in the informative graph is only allowed to access the features of its neighboring nodes. To allow the model to learn dependencies between non-neighboring nodes, we stack six attention layers in the encoder, each taking the output of its previous attention layer as input. The output of the encoder is a group of node features, each hisubscriptsuperscript𝑖h^{\prime}_{i}italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT modeling dependencies between node vi*subscriptsuperscript𝑣𝑖v^{*}_{i}italic_v start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and all other nodes.

Decoder Denoting the node feature corresponding to the robot’s current position ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as hcsuperscript𝑐h^{c}italic_h start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, we first add a feature vector representing the global graph to it, by passing hcsuperscript𝑐h^{c}italic_h start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT as the query source and all node features hisubscriptsuperscript𝑖h^{\prime}_{i}italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the key and value source to an attention layer, concatenating its output with hcsuperscript𝑐h^{c}italic_h start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and projecting its back to a d𝑑ditalic_d-dimension feature h*csuperscriptabsent𝑐h^{*c}italic_h start_POSTSUPERSCRIPT * italic_c end_POSTSUPERSCRIPT. We then select its neighboring node features hin,(pt,vi)Esubscriptsuperscript𝑛𝑖for-allsubscript𝑝𝑡subscript𝑣𝑖𝐸h^{n}_{i},\forall(p_{t},v_{i})\in Eitalic_h start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∀ ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ italic_E and input them as the key source and h*csuperscriptabsent𝑐h^{*c}italic_h start_POSTSUPERSCRIPT * italic_c end_POSTSUPERSCRIPT as the query source to a pointer layer [31, 7, 8], which directly outputs attention weights w𝑤witalic_w as the policy π(ai|ot)=p(vi|ot),(pt,vi)Eformulae-sequence𝜋conditionalsubscript𝑎𝑖subscript𝑜𝑡𝑝conditionalsubscript𝑣𝑖subscript𝑜𝑡subscript𝑝𝑡subscript𝑣𝑖𝐸\pi(a_{i}|o_{t})=p(v_{i}|o_{t}),(p_{t},v_{i})\in Eitalic_π ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_p ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ italic_E. By using this pointer network at the end of the decoder, the final policy’s dimensions naturally match the number of neighboring nodes.

Refer to caption
Figure 3: Informative graph (left) and Ground-truth graph (right). The informative graph is the input of our policy network. The ground truth graph is the input of our critic network, which is only used during training to assist the learning of the policy. Nodes are color-coded based on their utility (dark purple to yellow, low to high). The blue trajectory is the path executed so far by the robot (light purple node).

IV-C Training with Ground Truth

SAC We train our attention-based policy network using the soft actor-critic (SAC) algorithm with discrete actions [9, 10], in the set of small-scale exploration environments provided by [25]. The goal of SAC is to balance the trade-off between maximizing returns and policy entropy:

π*=argmax𝔼[t=0Tγt(rt+α(π(.|ot)))],\pi^{*}=\mathop{\rm{argmax}}\mathbb{E}[\sum^{T}_{t=0}\gamma^{t}(r_{t}+\alpha% \mathcal{H}(\pi(.|o_{t})))],\vspace{-0.15cm}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = roman_argmax blackboard_E [ ∑ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_α caligraphic_H ( italic_π ( . | italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) ] , (2)

where \mathcal{H}caligraphic_H denotes the entropy and α𝛼\alphaitalic_α is a temperature parameter that tunes the importance of the entropy term versus the return. We use reward sha** to help tune the training: for each action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, in addition to the cost reward rc=C(ψt1:t)subscript𝑟𝑐𝐶subscript𝜓:𝑡1𝑡r_{c}=-C(\psi_{t-1:t})italic_r start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = - italic_C ( italic_ψ start_POSTSUBSCRIPT italic_t - 1 : italic_t end_POSTSUBSCRIPT ), we give an exploration reward resubscript𝑟𝑒r_{e}italic_r start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT that quantifies the number of observed frontiers associated with atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Upon finishing exploration (i.e., once the utility of all nodes is zero), we also give the agent a fixed finishing reward rfsubscript𝑟𝑓r_{f}italic_r start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT. Thus the reward used for training is ri=arc+bre+rfsubscript𝑟𝑖𝑎subscript𝑟𝑐𝑏subscript𝑟𝑒subscript𝑟𝑓r_{i}=a\cdot r_{c}+b\cdot r_{e}+r_{f}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_a ⋅ italic_r start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_b ⋅ italic_r start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, where a𝑎aitalic_a and b𝑏bitalic_b are scaling parameters (in practice a=1/64𝑎164a=1/64italic_a = 1 / 64, b=1/50𝑏150b=1/50italic_b = 1 / 50, rf=20subscript𝑟𝑓20r_{f}=20italic_r start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = 20). SAC trains a critic network ϕitalic-ϕ\phiitalic_ϕ to estimate the soft action-state values Qϕ(ot,ai)subscript𝑄italic-ϕsubscript𝑜𝑡subscript𝑎𝑖Q_{\phi}(o_{t},a_{i})italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), using the critic loss: JQ(ϕ)=𝔼ot[12(Qϕ(ot,at)(rt+γ𝔼ot+1[V(ot+1)]))2]subscript𝐽𝑄italic-ϕsubscript𝔼subscript𝑜𝑡delimited-[]12superscriptsubscript𝑄italic-ϕsubscript𝑜𝑡subscript𝑎𝑡subscript𝑟𝑡𝛾subscript𝔼subscript𝑜𝑡1delimited-[]𝑉subscript𝑜𝑡12J_{Q}(\phi)=\mathbb{E}_{o_{t}}[\frac{1}{2}(Q_{\phi}(o_{t},a_{t})-(r_{t}+\gamma% \mathbb{E}_{o_{t+1}}[V(o_{t+1})]))^{2}]italic_J start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_ϕ ) = blackboard_E start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ blackboard_E start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V ( italic_o start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ], where V(ot)=𝔼ai[Q(ot,ai)]αlog(π(|ot))V(o_{t})=\mathbb{E}_{a_{i}}[Q(o_{t},a_{i})]-\alpha{\rm log}(\pi(\cdot|o_{t}))italic_V ( italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Q ( italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] - italic_α roman_log ( italic_π ( ⋅ | italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ). The policy network θ𝜃\thetaitalic_θ is trained to output a policy that maximizes the expected state-action value, where the policy loss reads: Jπ(θ)=𝔼(ot,ai)[αlog(πθ(ai|ot))Qϕ(ot,ai)]subscript𝐽𝜋𝜃subscript𝔼subscript𝑜𝑡subscript𝑎𝑖delimited-[]𝛼logsubscript𝜋𝜃conditionalsubscript𝑎𝑖subscript𝑜𝑡subscript𝑄italic-ϕsubscript𝑜𝑡subscript𝑎𝑖J_{\pi}(\theta)=\mathbb{E}_{(o_{t},a_{i})}[\alpha{\rm log}(\pi_{\theta}(a_{i}|% o_{t}))-Q_{\phi}(o_{t},a_{i})]italic_J start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_α roman_log ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ]. The temperature parameter is auto-tuned during training and the temperature loss is calculated as: J(α)=𝔼at[α(logπt(at|ot)+¯)]𝐽𝛼subscript𝔼subscript𝑎𝑡delimited-[]𝛼logsubscript𝜋𝑡conditionalsubscript𝑎𝑡subscript𝑜𝑡¯J(\alpha)=\mathbb{E}_{a_{t}}[-\alpha({\rm log}\pi_{t}(a_{t}|o_{t})+\overline{% \mathcal{H}})]italic_J ( italic_α ) = blackboard_E start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ - italic_α ( roman_log italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + over¯ start_ARG caligraphic_H end_ARG ) ], where ¯¯\overline{\mathcal{H}}over¯ start_ARG caligraphic_H end_ARG denotes the target entropy [9, 10].

Critic Network The training of a model-free DRL agent on a POMDP involves the need to estimate state/state-action values from partial observations of the environment, where the model needs to implicitly learn to predict the transition model T𝑇Titalic_T and O𝑂Oitalic_O, which is non-trivial in a stochastic problem such as autonomous exploration. In SAC, the learning of the policy network depends on accurate state-action value estimations from the critic network, where such coupling might further hamper the learning of optimal policies (e.g., from oscillations in thye critic’s training from high-variance gradients). In this context, recent works [8] tried to use standard SAC (i.e., identical observation inputs and nearly the same network structure for both actor and critic with just a different final layer to differentiate their output) but found that the critic network remained rather noisy due to the high randomness of the underlying transition model (shown as the gradient variance). During training, the critic loss remains high (10%similar-toabsentpercent10\sim 10\%∼ 10 % of Q(ot,at)𝑄subscript𝑜𝑡subscript𝑎𝑡Q(o_{t},a_{t})italic_Q ( italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )), showing that the critic is struggling at learning the underlying long-term effects of actions/states, and thus providing less accurate estimations to train the actor.

To address this issue, in this work, we leverage the structure of the actor-critic framework, where the critic network is only used to assist policy network training and is not required after being trained, allowing us to take true states as the observations for the critic network (i.e., privileged learning). Figure 3 shows a visual comparison between the inputs of the policy network and the critic network. Specifically, we keep the structure of the vanilla critic network but construct a ground truth graph G=(V,E)superscript𝐺superscript𝑉superscript𝐸G^{\prime}=(V^{\prime},E^{\prime})italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), from stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and otsubscript𝑜𝑡o_{t}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is similar to the informative graph used as input of the policy network, but the node set Vsuperscript𝑉V^{\prime}italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT covers the whole (ground truth) free area in \mathcal{E}caligraphic_E, the property of each node vi=(xi,yi,ui,ei)subscriptsuperscript𝑣𝑖subscript𝑥𝑖subscript𝑦𝑖subscript𝑢𝑖subscript𝑒𝑖v^{\prime}_{i}=(x_{i},y_{i},u_{i},e_{i})italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is slightly different: the unexplored feature eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a binary value indicating whether visubscriptsuperscript𝑣𝑖v^{\prime}_{i}italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is in the already explored area by the robot. If a node visubscriptsuperscript𝑣𝑖v^{\prime}_{i}italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT lies in unexplored areas, we set its utility ui=1subscript𝑢𝑖1u_{i}=-1italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - 1. By doing so, the critic network now only needs to tackle a more traditional MDP problem, i.e., estimating paths to cover an environment with fully known prior, allowing for more accurate estimations of state-action values (around 10 times lower state-action value loss and gradient variance). In practice, this translates into 10%similar-toabsentpercent10\sim 10\%∼ 10 % increased average per-step rewards compared to standard SAC (more details can be found in the supplemental material).

TABLE I: Comparisons with baseline planners in small-scale environments (identical 100 scenarios for each method).
Nearest Utility NBVP TARE Local CNN ARiANDE Ours
Distance (pixel) 1354(±plus-or-minus\pm±410) 1268(±plus-or-minus\pm±396) 1323(±plus-or-minus\pm±371) 1266(±plus-or-minus\pm±388) 1323(±plus-or-minus\pm±428) 1204(±plus-or-minus\pm±378) 1118(±321plus-or-minus321\pm 321± 321)

Training details Our model is trained on the dungeon environments provided by [25], where each environment \mathcal{E}caligraphic_E is a 640×480640480640\times 480640 × 480 grid world. The sensor range of our robot is set to 80808080 cells. We uniformly place 30×30303030\times 3030 × 30 points in the environment and select all points in the explored free area fsubscript𝑓\mathcal{M}_{f}caligraphic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT to form the node set V𝑉Vitalic_V and construct the information graph G*=(V*,E*)superscript𝐺superscript𝑉superscript𝐸G^{*}=(V^{*},E^{*})italic_G start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = ( italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_E start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ), where the number of neighboring nodes k=20𝑘20k=20italic_k = 20. The exploration task is completed once there is no non-zero utility node in V*superscript𝑉V^{*}italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT (using a threshold to ignore nodes with ui5subscript𝑢𝑖5u_{i}\leq 5italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ 5 in practice). During training, we set the max episode length to 128128128128 decision steps, the discount factor to γ=1𝛾1\gamma=1italic_γ = 1, the batch size to 64646464, and the episode buffer size to 2500250025002500. Training starts after the episode buffer collects more than 1000100010001000 step data. The target entropy is set to 0.01log(k)0.01log𝑘0.01\cdot{\rm log}(k)0.01 ⋅ roman_log ( italic_k ). Training happens at the end of each episode, and is composed of 8888 iterations. We use the Adam optimizer with a learning rate of 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT for both policy and critic networks and 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT for the temperature auto-tuning. The target critic network updates every 64646464 training step. Our model is trained on a desktop equipped with an AMD Ryzen7 5700X CPU and an NVIDIA GeForce RTX 4080 GPU. We train our model utilizing Ray, a distributed framework for machine learning [32], and run 16 training environments simultaneously to accelerate data collection. The training needs approximately 3 days to fully converge. Our full code can be found at https://github.com/marmotlab/large-scale-DRL-exploration.

IV-D Graph Rarefaction for Large-scale Exploration

Although our policy network is trained in small-scale environments only, our networks can naturally handle arbitrary graphs. However, the number of nodes needed to let G*G*italic_G * uniformly cover larger-scale environments will be significantly higher, even though while the graph structure would be dense, information would remain sparse (i.e., many frontiers are very far from each other, and separated by a large number of 0-utility nodes). To extend the use of our trained model to larger-scale environments, we propose a method to rarefy our informative graph, which lets the model capture longer-term dependencies more efficiently (i.e., frontiers far away can connect with each other through fewer edges). In short, we first classify all nodes with non-zero utility into multi groups according to neighboring relationship, and then find the shortest paths on G𝐺Gitalic_G from the robot’s position to each group through A* and check for line of sight at each waypoint in these paths, to find a minimal set of nodes VsV*superscript𝑉𝑠superscript𝑉V^{s}\subset V^{*}italic_V start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ⊂ italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT that represents all information in the map (the pseudo-code can be found in the supplemental material). The result is a sparser information graph Gs=(Vs,Es)superscript𝐺𝑠superscript𝑉𝑠superscript𝐸𝑠G^{s}=(V^{s},E^{s})italic_G start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = ( italic_V start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_E start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ), which is used as the policy network’s input.

V Experiment

Refer to caption
(a) DSVP, distance 1351m1351𝑚1351m1351 italic_m, time 796s796𝑠796s796 italic_s
Refer to caption
(b) TARE, distance 1177m1177𝑚1177m1177 italic_m, time 644s644𝑠644s644 italic_s
Refer to caption
(c) Ours, distance 985m985𝑚985m985 italic_m, time 582s582𝑠582s582 italic_s
Figure 4: Exploration paths comparisons in a large-scale 130m×100m130𝑚100𝑚130m\times 100m130 italic_m × 100 italic_m indoor office simulation.
Refer to caption
(a) TARE, 1505m1505𝑚1505m1505 italic_m, 803s803𝑠803s803 italic_s
Refer to caption
(b) Ours, 1104m1104𝑚1104m1104 italic_m, 659s659𝑠659s659 italic_s
Figure 5: Exploration paths comparisons in a large-scale 150m×150m150𝑚150𝑚150m\times 150m150 italic_m × 150 italic_m outdoor forest simulation.

V-A Validation in Small-scale Environments

We first test our trained model on a benchmark set of 100100100100 small-scale simplified simulation environments (never seen during training). We compare our planner with baseline planners including (1) nearest: the robot always moves to the nearest frontier, (2) utility: the robot moves to the frontier which balances cost and utility, (3) NBVP: a sampling-based planner based on rapid random trees [17], and (4) TARE Local: the local level planner of TARE [4, 6]. (5) CNN: a DRL planner based on CNNs [25]. (6) ARiADNE: an ablation variant of our model trained without ground truth as in [8]).

In our tests, we consider exploring more than 99% (meaning minor error is tolerable) of the free area in the environment as completing the exploration. As shown in Table I, our DRL planner outperforms all baseline planners (11% better than TARE Local and Utility, 15% better than CNN and NBVP) in terms of average travel distance to complete the exploration. We first note that the vanilla model trained without ground truth already achieves shorter exploration paths than all other baseline planners, showing the efficiency of our attention-based policy network to reason about dependencies of areas at multiple scales and thus making decisions that balance exploration and refining the map. We then note that our model trained with ground truth further improves over this vanilla ARiADNE model (7% better), verifying that training the critic network with ground truth can assist the policy network to estimate the transition model and long-term returns more precisely. Our results suggest that our planner can effectively predict the structure of unexplored areas, to make further non-myopic decisions and avoid redundant movements, by relying on past experiences in similarly structured indoor environments seen at training. We visualize attention weights of our attention layer to observe the insight learned mechanism (see our supplemental material), and we find that in some heads the planner focuses its attention over areas where connections might be missing, implying that the planner is predicting the structure of unknown areas from the belief of known areas.

V-B Validation in Large-scale Environments

Refer to caption
Figure 6: Comparisons with baseline planners in the large-scale indoor benchmark [6] (10 runs each), including ARiADNE despite its inability to complete exploration.

We then test our large-scale exploration planner in a Gazebo simulation of a 130m×100m130𝑚100𝑚130m\times 100m130 italic_m × 100 italic_m office with cluttered obstacles provided by [6]. Compared to our small-scale test environments where the sensor model and robot kinetic are highly simplified, in this large-scale test environment, the simulation considers real-world sensor model (a Velodyne 16-line 3D LiDAR) and real robot constraint (a four-wheel differential drive robot with max speed 2m/s2𝑚𝑠2m/s2 italic_m / italic_s). Note that the model used for our large-scale tests is the same as the one for our small-scale tests, but using our graph rarefaction algorithm. We use Octomap [33] to classify free and occupied areas (i.e., define ”explored” area). We set the resolution of Octomap to 0.4m0.4𝑚0.4m0.4 italic_m, the max map update range to 20m20𝑚20m20 italic_m, the gap of nodes to 2m2𝑚2m2 italic_m, the number of neighboring nodes in the dense collision-free graph to 5555, the number of neighboring nodes in the sparse information graph to 10101010, the sparse radius to 12m12𝑚12m12 italic_m, and replan the robot’s path every 0.8s0.8𝑠0.8s0.8 italic_s. Although we construct a 3D Octomap, the planner only considers frontiers at the ground level (i.e., planning is based on a 2D belief). All computations are conducted on a Intel i7-1270p CPU.

TABLE II: Comparisons with baseline planners in the large-scale indoor benchmark [6] (10 runs each).
DSVP TARE Ours Human
Distance (m𝑚mitalic_m) 1462 1158 1020 879
Time (s𝑠sitalic_s) 870 634 590 520
Computing (s𝑠sitalic_s) 0.90 0.24 0.15 /
Efficiency (m3/msuperscript𝑚3𝑚m^{3}/mitalic_m start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT / italic_m) 3.91 4.82 5.42 6.27
Efficiency (m3/ssuperscript𝑚3𝑠m^{3}/sitalic_m start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT / italic_s) 6.57 8.81 9.36 10.60
TABLE III: Comparisons with baseline planners in the large-scale outdoor benchmark [6] (5 runs each).
DSVP TARE Ours Human
Distance (m𝑚mitalic_m) 2002 1347 1174 1071
Time (s𝑠sitalic_s) 1053 707 686 549
Computing (s𝑠sitalic_s) 1.31 0.70 0.44 /
Efficiency (m3/msuperscript𝑚3𝑚m^{3}/mitalic_m start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT / italic_m) 21.13 29.37 32.63 36.05
Efficiency (m3/ssuperscript𝑚3𝑠m^{3}/sitalic_m start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT / italic_s) 40.17 55.96 55.85 70.33
Refer to caption
(a) Turtlebot3
Refer to caption
(b) Teaching room
Refer to caption
(c) Final Octomap
Refer to caption
(d) Office room
Refer to caption
(e) Another office room
Refer to caption
(f) Final point cloud map
Figure 7: Real world experiment in a 80m×10m80𝑚10𝑚80m\times 10m80 italic_m × 10 italic_m laboratory with cluttered furniture and moving pedestrians.

We compare our large-exploration planner with: (1) TARE [6, 4] (also the provider of this test environment), the state-of-the-art large-scale exploration planner which in general produces 80%percent8080\%80 % increased exploration efficiency over benchmark planners such as [17, 3]; in TARE, a global planner is designed to find a global TSP path to visit all unexplored areas, and the start of the global path is taken as the destination for the local path planner discussed in the last subsection, (2) DSVP [34], which uses a rapid random tree-based local planner and a graph-based global planner, (3) Best human practice with full prior knowledge about the environment provided by [6] as the reference for optimal exploration path. The hyperparameters of TARE and DSVP were tuned by the original authors directly on benchmark environments. Note that TARE further considers 3D frontiers but still does path planning in 2D action space. We evaluate the performance of all planners from multiple metrics: (1) Distance: total exploration path length, (2) Time: total duration of the exploration task, (3) Computing: per-step planning time, and (4) Efficiency: explored volume divided by the travel distance or makespan. Results are shown in Table II, Figure 4, and Figure 6.

We note that our DRL-based planner outperforms TARE in terms of exploration efficiency (12% in terms of distance efficiency and 6% in terms of time efficiency) and algorithm computing time (including frontier detection, informative graph update, and network inference time, 60% faster even though our planner is implemented in Python and TARE in C++). As shown in Figure 4, the exploration path planned by our DRL planner exhibits fewer redundant movements than benchmark planners. Although TARE theoretically guarantees near-optimal paths to cover frontiers in its current belief, we believe the advanced performance of our DRL planners shows the key importance of predicting the potential structure of unknown areas, which can yield decisions that benefit longer-term exploration. Note that paths planned by TARE allow the robot to move at the fastest speed, as TARE explicitly considers the motion constraint of the robot. Therefore, our advantage in terms of time efficiency is smaller than on distance efficiency.

Although our model was trained in indoor environments, we also test it in a 150m×150m150𝑚150𝑚150m\times 150m150 italic_m × 150 italic_m cluttered outdoor forest Gazebo environment without further training (results are shown in Table III and Figure 5). In this outdoor environment, the structure learned by our model (e.g., junctions and corridors) in indoor environments does not exist anymore, and we expect our planner not to be able to perform implicit predictions as well as in indoor environments. As expected, we observe our planner sometimes randomly chooses waypoints, implying that it cannot decide which action is optimal. However, in this cluttered outdoor environment, our DRL planner still achieves time efficiency on par with TARE and better distance efficiency.

V-C Validation in Real World

We finally validate our large-scale planner in the real world, using a wheeled robot equipped with a 3D LiDAR to explore an 80m×10m80𝑚10𝑚80m\times 10m80 italic_m × 10 italic_m indoor laboratory with cluttered furniture and moving pedestrians (see Figure 7). The robot platform is a customized four-wheel Turtlebot3 with a max speed of 0.2m/s0.2𝑚𝑠0.2m/s0.2 italic_m / italic_s. We use a Leishen C16 LiDAR with LOAM [1] to both get LiDAR odometry and map**. For our DRL planner, compared to the parameters used in simulation, we set the resolution of Octomap to 0.2m0.2𝑚0.2m0.2 italic_m, and the gap of nodes to 0.8m0.8𝑚0.8m0.8 italic_m to better handle such a cluttered real-world indoor environment. We believe the robot successfully and efficiently explores the lab in 12121212 minutes, highlighting the robust applicability of our planner in the real world. It should be noted that most previous DRL-based planners [24, 26, 25, 28, 27] are not validated on hardware in such a cluttered and large-scale environment. Compared to these planners that directly take the map as network input, where images from the real world may be very noisy and different from those used during training, we believe that extracting an informative graph helps the learned model avoid failure on never-seen real-world scenarios.

VI Conclusion

In this work, we propose a DRL-based exploration framework, which relies on ground truth information to assist the training of an attention-based policy network. We show that our trained model can more precisely estimate the unexplored areas, which helps the planner better reason about its map/belief and make decisions that benefit long-term objectives. With a graph-rarefaction algorithm, our planner becomes the first DRL-based method applicable to large environments and significantly outperforms state-of-the-art conventional planners in a benchmark indoor simulation environment. We also validate our planner in the real world, highlighting its potential for real-life deployments on robot.

Future works will focus on extending our planner from single to multi-robot exploration, where the planner needs to further help robots achieve efficient cooperation. We are also interested in letting the robot take account of the robot’s motion constraint in training.

VII Acknowledgement

This work was supported by Temasek Laboratories (TL@NUS) under grant TL/FS/2022/01 and the Singapore Ministry of Education Academic Research Fund Tier 1.

Refer to caption
Figure 8: Ablation results, with and without ground truth when training the critic network. By simplifying the POMDP to a MDP, our privileged learning approach reduces the state-action value loss and gradient variance by 10 times over the original critic network.
Refer to caption
Figure 9: Visualization of the learned attention weights from the three main heads of our decoder in a small-scale scenario. The query source is the node feature corresponding to the current position (purple). Key and value sources are all nodes in the graph (blue), and their radius is proportional to their learned attention weight. Note how the first and second heads adapt their attention when they notice that two corridors (top right) might connect.
Input: node set V*superscript𝑉V^{*}italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, map \mathcal{M}caligraphic_M, robot pos. ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, radius Dthsubscript𝐷𝑡D_{th}italic_D start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT
Output: sparse information graph Gs=(Vs,Es)superscript𝐺𝑠superscript𝑉𝑠superscript𝐸𝑠G^{s}=(V^{s},E^{s})italic_G start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = ( italic_V start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_E start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT )
Get non-zero utility node set U𝑈Uitalic_U from V*superscript𝑉V^{*}italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT
Initialize node set VsUsuperscript𝑉𝑠𝑈V^{s}\leftarrow Uitalic_V start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ← italic_U, covered node set U¯¯𝑈\overline{U}\leftarrow\emptysetover¯ start_ARG italic_U end_ARG ← ∅
for vU𝑣𝑈v\in Uitalic_v ∈ italic_U do
       if vU¯𝑣normal-¯𝑈v\in\overline{U}italic_v ∈ over¯ start_ARG italic_U end_ARG then continue;
       Find nearby node set N𝑁Nitalic_N in Dthsubscript𝐷𝑡D_{th}italic_D start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT
       for vNsuperscript𝑣normal-′𝑁v^{\prime}\in Nitalic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_N do
             if line (v,v)𝑣superscript𝑣normal-′(v,v^{\prime})( italic_v , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is collision free then  U¯v¯𝑈superscript𝑣\overline{U}\leftarrow v^{\prime}over¯ start_ARG italic_U end_ARG ← italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT;
            
       end for
      Find path ζ𝜁\zetaitalic_ζ from ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to v*superscript𝑣v^{*}italic_v start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, set ref node vref=vsubscript𝑣𝑟𝑒𝑓𝑣v_{ref}=vitalic_v start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT = italic_v
       for i|ζ|𝑖𝜁i\in|\zeta|italic_i ∈ | italic_ζ | do
             if line (vref,ζi)subscript𝑣𝑟𝑒𝑓subscript𝜁𝑖(v_{ref},\zeta_{i})( italic_v start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT , italic_ζ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is not collision free or L(vref,ζi)Dth𝐿subscript𝑣𝑟𝑒𝑓subscript𝜁𝑖subscript𝐷𝑡L(v_{ref},\zeta_{i})\geq D_{th}italic_L ( italic_v start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT , italic_ζ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ italic_D start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT then
                   vref=ζi1subscript𝑣𝑟𝑒𝑓subscript𝜁𝑖1v_{ref}=\zeta_{i-1}italic_v start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT = italic_ζ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT, Vsvrefsuperscript𝑉𝑠subscript𝑣𝑟𝑒𝑓V^{s}\leftarrow v_{ref}italic_V start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ← italic_v start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT
            
       end for
      
end for
Get collision free edge set Essuperscript𝐸𝑠E^{s}italic_E start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT based on Vssuperscript𝑉𝑠V^{s}italic_V start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and \mathcal{M}caligraphic_M
Algorithm 1 Dense Graph Rarefaction Algorithm

References

  • [1] J. Zhang and S. Singh, “Loam: Lidar odometry and map** in real-time,” in Robotics: Science and Systems, vol. 2, no. 9.   Berkeley, CA, 2014, pp. 1–9.
  • [2] M. Selin, M. Tiger, D. Duberg, F. Heintz, and P. Jensfelt, “Efficient autonomous exploration planning of large-scale 3-d environments,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 1699–1706, 2019.
  • [3] T. Dang, M. Tranzatto, S. Khattak, F. Mascarich, K. Alexis, and M. Hutter, “Graph-based subterranean exploration path planning using aerial and legged robots,” Journal of Field Robotics, vol. 37, no. 8, pp. 1363–1388, 2020.
  • [4] C. Cao, H. Zhu, H. Choset, and J. Zhang, “Tare: A hierarchical framework for efficiently exploring complex 3d environments,” in Robotics: Science and Systems, 2021.
  • [5] F. Yang, D.-H. Lee, J. Keller, and S. Scherer, “Graph-based topological exploration planning in large-scale 3d environments,” in 2021 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2021, pp. 12 730–12 736.
  • [6] C. Cao, H. Zhu, Z. Ren, H. Choset, and J. Zhang, “Representation granularity enables time-efficient autonomous exploration in large, complex worlds,” Science Robotics, vol. 8, no. 80, p. eadf0970, 2023.
  • [7] Y. Cao, Y. Wang, A. Vashisth, H. Fan, and G. A. Sartoretti, “Catnipp: Context-aware attention-based network for informative path planning,” in Conference on Robot Learning.   PMLR, 2023, pp. 1928–1937.
  • [8] Y. Cao, T. Hou, Y. Wang, X. Yi, and G. Sartoretti, “Ariadne: A reinforcement learning approach using attention-based deep networks for exploration,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 10 219–10 225.
  • [9] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International conference on machine learning.   PMLR, 2018, pp. 1861–1870.
  • [10] P. Christodoulou, “Soft actor-critic for discrete action settings,” arXiv preprint arXiv:1910.07207, 2019.
  • [11] Z. Zhang, T. Henderson, S. Karaman, and V. Sze, “Fsmi: Fast computation of shannon mutual information for information-theoretic map**,” The International Journal of Robotics Research, vol. 39, no. 9, pp. 1155–1177, 2020.
  • [12] A. Asgharivaskasi, S. Koga, and N. Atanasov, “Active map** via gradient ascent optimization of shannon mutual information over continuous se (3) trajectories,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2022, pp. 12 994–13 001.
  • [13] A. Asgharivaskasi and N. Atanasov, “Semantic octree map** and shannon mutual information computation for robot exploration,” IEEE Transactions on Robotics, vol. 39, no. 3, pp. 1910–1928, 2023.
  • [14] B. Yamauchi, “A frontier-based approach for autonomous exploration,” in Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA’97.’Towards New Computational Principles for Robotics and Automation’.   IEEE, 1997, pp. 146–151.
  • [15] D. Holz, N. Basilico, F. Amigoni, and S. Behnke, “Evaluating the efficiency of frontier-based exploration strategies,” in ISR 2010 (41st International Symposium on Robotics) and ROBOTIK 2010 (6th German Conference on Robotics).   VDE, 2010, pp. 1–8.
  • [16] M. Kulich, J. Faigl, and L. Přeučil, “On distance utility in the exploration task,” in 2011 IEEE International Conference on Robotics and Automation.   IEEE, 2011, pp. 4455–4460.
  • [17] A. Bircher, M. Kamel, K. Alexis, H. Oleynikova, and R. Siegwart, “Receding horizon” next-best-view” planner for 3d exploration,” in 2016 IEEE international conference on robotics and automation (ICRA).   IEEE, 2016, pp. 1462–1468.
  • [18] D. Duberg and P. Jensfelt, “Ufoexplorer: Fast and scalable sampling-based exploration with a graph-based planning structure,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 2487–2494, 2022.
  • [19] L. Schmid, M. Pantic, R. Khanna, L. Ott, R. Siegwart, and J. Nieto, “An efficient sampling-based method for online informative path planning in unknown environments,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 1500–1507, 2020.
  • [20] B. Zhou, Y. Zhang, X. Chen, and S. Shen, “Fuel: Fast uav exploration using incremental frontier structure and hierarchical planning,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 779–786, 2021.
  • [21] M. Dharmadhikari, T. Dang, L. Solanka, J. Loje, H. Nguyen, N. Khedekar, and K. Alexis, “Motion primitives-based path planning for fast and agile exploration using aerial robots,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2020, pp. 179–185.
  • [22] M. Luperto, L. Fochetta, and F. Amigoni, “Exploration of indoor environments through predicting the layout of partially observed rooms,” in Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, 2021, pp. 836–843.
  • [23] L. Schmid, C. Ni, Y. Zhong, R. Siegwart, and O. Andersson, “Fast and compute-efficient sampling-based local exploration planning via distribution learning,” IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 7810–7817, 2022.
  • [24] D. Zhu, T. Li, D. Ho, C. Wang, and M. Q.-H. Meng, “Deep reinforcement learning supervised autonomous exploration in office environments,” in 2018 IEEE international conference on robotics and automation (ICRA).   IEEE, 2018, pp. 7548–7555.
  • [25] F. Chen, S. Bai, T. Shan, and B. Englot, “Self-learning exploration and map** for mobile robots via deep reinforcement learning,” in Aiaa scitech 2019 forum, 2019, p. 0396.
  • [26] H. Li, Q. Zhang, and D. Zhao, “Deep reinforcement learning-based automatic exploration for navigation in unknown environment,” IEEE transactions on neural networks and learning systems, vol. 31, no. 6, pp. 2064–2076, 2019.
  • [27] Y. Xu, J. Yu, J. Tang, J. Qiu, J. Wang, Y. Shen, Y. Wang, and H. Yang, “Explore-bench: Data sets, metrics and evaluations for frontier-based and deep-reinforcement-learning-based autonomous exploration,” in 2022 International Conference on Robotics and Automation (ICRA), 2022, pp. 6225–6231.
  • [28] F. Chen, J. D. Martin, Y. Huang, J. Wang, and B. Englot, “Autonomous exploration under uncertainty via deep reinforcement learning on graphs,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2020, pp. 6140–6147.
  • [29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [30] O. Vinyals, M. Fortunato, and N. Jaitly, “Pointer networks,” Advances in neural information processing systems, vol. 28, 2015.
  • [31] Y. Cao, Z. Sun, and G. Sartoretti, “Dan: Decentralized attention-based neural network for the minmax multiple traveling salesman problem,” in International Symposium on Distributed Autonomous Robotic Systems.   Springer, 2022, pp. 202–215.
  • [32] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan, et al., “Ray: A distributed framework for emerging ai applications,” in Proceedings of OSDI, 2018, pp. 561–577.
  • [33] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard, “Octomap: An efficient probabilistic 3d map** framework based on octrees,” Autonomous robots, vol. 34, pp. 189–206, 2013.
  • [34] H. Zhu, C. Cao, Y. Xia, S. Scherer, J. Zhang, and W. Wang, “Dsvp: Dual-stage viewpoint planner for rapid exploration by dynamic expansion,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2021, pp. 7623–7630.