\affiliation

[1]organization=Electrical & Computer Engineering Department, addressline=6 MetroTech Center, city=Brooklyn, postcode=11201, state=New York, country=USA \affiliation[2]organization=Electrical Engineering, New York University Abu Dhabi (NYUAD), addressline=Saadiyat Island, postcode=129188, state=Abu Dhabi, country=UAE \affiliation[3]organization=Center for Artificial Intelligence and Robotics, NYUAD, addressline=Saadiyat Island, postcode=129188, state=Abu Dhabi, country=UAE

Efficient and Distributed Large-Scale
3D Map Registration using Tomographic Features

Halil Utku Unlu Anthony Tzes Prashanth Krishnamurthy Farshad Khorrami

Abstract

A robust, resource-efficient, distributed, and minimally parameterized 3D map matching and merging algorithm is proposed. The suggested algorithm utilizes tomographic features from 2D projections of horizontal cross-sections of gravity-aligned local maps, and matches these projection slices at all possible height differences, enabling the estimation of four degrees of freedom in an efficient and parallelizable manner. The advocated algorithm improves state-of-the-art feature extraction and registration pipelines by an order of magnitude in memory use and execution time. Experimental studies are offered to investigate the efficiency of this 3D map merging scheme.

keywords:

Map merging, 3D perception, Multi-robot exploration

1 Introduction

In many applications, robot collaboration not only is required, but also greatly enhances the effectiveness of a robotic system. Multiple robots can provide a larger coverage [1], different views of the working environment, faster exploration, and resiliency against individual agent failures. However, communication fidelity, limited computational capabilities, and the operated environments’ dynamic and uncertain nature hamper the utility of multi-agent systems in real-world applications [2].

A baseline requirement for a collaborative mission for robotic systems is a shared understanding of the environment and their location within it. External infrastructures with a centralized computation structure can provide the global state to each agent, but they restrict the applicability of the systems in unknown, unstructured environments without a central unit.

The common solution for autonomous localization, navigation, and map** in unknown, uncertain environments is the set of methods called simultaneous localization and map** (SLAM), in which the robotic platforms rely on the data they collect through onboard sensors to estimate their position and generate a map. Single robot SLAM for both 2D and 3D motion is a mature field with many advanced platforms and frameworks [3, 4].

An extension of the SLAM methods on multiple agents is commonly referred to as Collaborative SLAM (C-SLAM), and it aims to provide a consistent and more accurate system state estimate (i.e., location of all agents, global map) by utilizing data from multiple participating agents. The field has seen significant development in recent years, though large-scale deployment needs further refinement [2]. Part of the reason is that the SLAM task requires the use of dense, rich data from vision sensors to generate a useful and accurate environment representation. 3D data streams (RGB-D sensors and LiDAR) generate large amount of data requiring a high communication bandwidth that can overload the network and demand more computational budget from the resource-limited mobile agents.

An alternative way for utilizing individual map information gathered by every agent in the field is to perform map matching and merging, where the agents share their map with other agents upon connectivity and attempt to align and combine other maps, hence providing a more complete view of their environment. Multiple approaches have been proposed for 2D map matching and merging scenarios [5, 6, 7, 8], but the solutions do not easily scale to 3D map representations. Although the advances in low-power devices with GPUs address partially the computation issue, the complexity needed from the robotic agents adds additional burdens on already resource-strapped platforms. Furthermore, many of the existing algorithms require adjusting the parameters of the network, either through training on a different data or adjusting a large set of key parameters.

This paper attempts to address the gap in map matching for large-scale 3D maps by proposing a feature extraction framework that effectively enables the use of 2D image features in a pair of gravity-aligned 3D maps using tomography [9] and providing a thorough evaluation and comparisons against alternatives. In the proposed method, the algorithm a) extracts a binary occupancy representation for horizontal cross-sections of the maps, b) computes 2D features over these slices, and c) restricts matching space for the features across maps to its corresponding section only. Cross-correlating the maps based on their agreement in pose estimations yields a 4 degree of freedom (DoF) transformation estimate. The proposed approach will be demonstrated to be significantly efficient (in both time and memory consumption) and accurate compared to state-of-the-art point cloud registration methods. An early version of this paper appeared in [9], which does not contain the simulation studies, any experimental results, and the recent algorithms over the past two years.

The contributions of this article include:

1.

a simple and efficient approach to extracting 3D features via tomographic extraction of horizontal sections with minimal parameterization,
2.

study of viable uses of the aforementioned features in addressing large-scale 3D-map matching and merging scenarios across environments of varying scales,
3.

extensive comparative studies on simulated and real data to assess effectiveness against alternative methods,
4.

experimental studies using a quadruped to emulate merging scenario in a large indoor area, and
5.

the source code to replicate the study for both proposed and alternative methods¹¹1https://github.com/RISC-NYUAD/tomographic-map-matching.

The remainder of the paper is structured as follows: Relevant work is introduced in Section 2, and the formulation of the studied problem is provided in Section 3. The proposed method is described in Section 4, detailing the process of tomographic feature extraction and registration with the computed features. The structure of the comparative analysis and the baseline methods used for evaluation are detailed in Section 5. Results on a simulated dataset are illustrated in Section 6. Experimental studies on real data, using KITTI odometry benchmark and a newly generated data from a large-scale indoor environment, are provided in Section 7. Limitations of the algorithm and the overall study are provided in Section 8, followed by the concluding remarks.

2 Related Work

2.1 Map Matching and Merging

Map matching is related to estimating the relative transformation between a pair of maps, and map merging is the process of combining the representations of a set of maps to a common structure, given the relative transformation. The literature for matching and merging probabilistic 2D occupancy grid maps is vast and several reviews provide a thorough coverage of the field [8, 10]. Overall, two common threads can be observed: optimization-based and feature-based map matching.

Optimization-based algorithms operate on the entire probability grid without an intermediate representation. Some of the earlier work focused on maximizing a similarity metric [5] and a connectivity metric with better results [11]. More recent work uses genetic algorithms to address the matching and merging on low-resource platforms [12]. Feature-based map matching algorithms extract an intermediate representation from the raw occupancy grid to perform associations and estimations. Salient and descriptive feature extraction forms the basis of matching [13, 14, 15, 16, 17].

Tomographic features and a 2D map matching based on the Radon transform of the occupancy grid is employed in [15]. Tomographically salient features are used in a Gaussian mixture model to estimate the 3 DoF transformation. Another feature-based algorithm extracts 2D image features from image representations of the occupancy grid maps and performs an initial guess via RANSAC, which is then refined with ICP for tighter alignment [14]. Extensive comparison of various image features and ideal parameter selection process is provided in the paper.

The algorithms outlined earlier cannot be applied in 3D domain directly. The main issues in 3D map matching stem primarily from the scale and density of data with the added dimension [18]. 3D LiDAR sensors provide accurate position information in a long range, while stereo cameras with onboard computational units can measure dense depth in a camera field of view. Maps generated from such sensors are inevitably large. The ideas for matching and merging are similar to those in the 2D case: optimize a distance cost over 3D data [19, 20], or extract local or global features from the 3D data and perform a robust registration using the corresponding features [21, 22, 23, 24].

However, the aforementioned methods cannot run in real time using onboard computer of the individual agents. Near real-time performance was achieved in [21] while providing demonstrations on large scale but simplistic indoor environments. Similarly [24] relies on learning-based place recognition descriptors from pose-centered spherical projections and GICP optimization, but require preprocessing on each scan to be performed. While the SLAM and map matching and merging operations both require the preprocessing step, the systems end up being coupled.

2.2 Collaborative SLAM

The primary focus for collaborative SLAM (C-SLAM) algorithms [25], is the pose optimization using both intra-robot and inter-robot loop closures within the maps. As such, individual agents do not share their maps fully with others in the network. With the exception of some centralized architectures [26, 27], it is not possible to obtain the global map with many of the prominent C-SLAM architectures [28, 29, 30]. It is indeed possible to perform the map merging using the C-SLAM algorithms, but a framework for sharing map data between agents is not discussed, and the bandwidth requirements for dense 3D map data remains a significant challenge.

2.3 3D Point Features

A common issue that needs to be addressed for both 2D and 3D feature description problem is the necessity for the descriptors to be pose invariant. The same point of interest, viewed from different locations, should result in ideally the same descriptor. An additional challenge for arbitrary 3D point clouds is that the data structure is not necessarily ordered, and any permutation of the order of points should still provide accurate descriptors. Much of the work is focusing on obtaining such pose-invariant descriptors, and can be grouped into two major categories: hand-crafted and learning-based.

Hand-crafted feature descriptors establish a local frame of reference, informed by the point neighborhood, to generate discerning statistics about the point. A prominent algorithm is Fast Point Feature Histograms (FPFH) [31], comprised of the histograms of angular variations for a point of interest in its k-nearest neighborhood. A comprehensive review of hand-crafted features in [32] found FPFH to be effective in low-noise scan matching scenarios. An adaptation of FAST [33] corner detection on orthogonal 2D projections of the neighborhood of a point is proposed as an efficient alternative for low-resource platforms [22].

Recent algorithms attempt to learn the discerning and pose-invariant descriptors directly from available object, indoor, and outdoor datasets. Earlier realizations relied on adapting Convolutional Neural Networks (CNNs) to 3D voxel-based representations [34, 35]. Structured use of multi-layer perceptrons were shown to be effective in dealing with the unordered nature of point cloud data [36, 37], and various CNN-architectures that focus directly on point data emerged [38, 39, 40, 41, 42].

A review of learning-based point cloud feature extraction for various applications can be found in [43]. Two of the convolutional architectures are important to our study: Kernel Point Convolution (KPConv) [44], which uses radius-neighborhoods to define deformable kernels to be used in 3D point convolution, and Fully Convolutional Geometric Features (FCGF) [45], which implements a fully-convolutional architecture with a metric learning loss to improve both accuracy and efficiency.

2.4 Point Cloud Registration

Given noiseless data and perfect correspondences, the registration problem has a closed form solution by solving the Procrustes problem. However, the conditions cannot be met in real world. Sensors are noisy, and correspondences can be wrong: 95% outlier rates have been reported [46], necessitating robust solutions. Reviews for the algorithms addressing registration problem for same-source [47] and cross-source [48] are available.

Sample consensus-based algorithms (SAC or RANSAC-family) [49] find the most consistent registration hypothesis by evaluating the solution generated from a minimal set of correspondences, sampled from the initial correspondence set, and finding the solution that is accurate for the most number of correspondences, termed the inlier set. More modern iterations of RANSAC family of algorithms provide optimality guarantees in quadratic-time [50] and differentiable implementation for trainability and use in learning-based frameworks [51]. Another significant robust registration method relies on semi-definite relaxation in its optimization, allowing it to handle extreme outlier ratios with a certificate of optimality [52].

Learning-based registration frameworks cast the problem into differentiable sets of modules, treating initial set of correspondences as putative and assigning weights on each correspondence. Deep Global Registration (DGR) [53] minimizes a robust energy metric with a convolutional network for confidence assignment, while PointDSC [54] incorporates spatial consistency explicitly to the network architecture.

Attention-based transformer networks aim to utilize the benefits of transformer architectures for point cloud registration problems. Some examples include PREDATOR [55], a model that learns to focus primarily on the overlap** regions, GeoTransformer [56], an archirecture that encodes the geometric structure of the point neighborhood to obtain a pose-invariant superpoint representation and performs keypoint-free coarse-to-fine registration, and Rotation-Invariant Transformer for Point Cloud Matching (RoITr) [57], a system trained to be able to deal with arbitrary rotational transformations.

A significant bottleneck for the attention mechanisms is the computational complexity. Most of the algorithms appear to be implemented and tested only on server-grade GPUs, without much concern for mobile deployment. BUFFER [58] stands out as a model that attempts a trade-off between accuracy, efficiency, and generalizability by improving the feature quality and estimation pipelines with small-scale networks.

3 Problem Formulation

Vectors and matrices are denoted with lowercase and uppercase boldfaced characters (e.g. $\mathbf{p}$ , $\mathbf{T}$ ), respectively. Sets are named with calligraphic fonts (e.g. $\mathcal{S}$ , $\mathcal{C}$ ) or otherwise implicitly defined inside curly brackets. The cardinality of a set is denoted as $|\cdot|$ . $\mathbb{R}$ and $\mathbb{Z}^{+}$ denote the set of real numbers and positive integers, respectively. Object indexing and memberships are denoted with a combination of pre- or post-fixed subscripts and superscripts.

3.1 Map Matching Problem

This paper addresses the problem of pairwise merging of gravity-aligned 3D maps that can be canonically represented as an unordered point cloud, where each point stores a coordinate as ${{}^{i}\mathbf{p}_{j}}\in\mathbb{R}^{3},~{}i\in\{c,d\}$ , by two agents $c$ and $d$ . We assume no prior knowledge of the absolute pose information for any agent, and the agents are not required to observe each other via a rendezvous. The communication network is assumed to have sufficient bandwidth in the order of 10 Mbps in order to transfer incremental maps having a size close to 70 MB. The map matching and merging tasks are invoked upon sufficient connectivity.

The map of different agents will be denoted as

{{}^{i}\mathcal{M}}=\{\dots,~{}{{}^{i}\mathbf{p}_{j}},~{}\dots\},\quad|{{}^{i}% \mathcal{M}}|=N_{i},\quad i\in\{c,~{}d\}.

(1)

To ensure the applicability and scalability of the tested algorithms, it is assumed that the point cloud representation is processed with a voxel grid filter of leaf size $g$ . The filtering process ensures more uniform point density and eliminates the dependency on particular sensor sets during map** (i.e., LiDAR providing high point density perpendicular to its rotation axis, RGBD/stereo having high density in limited field of view).

The number of points in agents’ maps, $N_{i}$ , depends primarily on the map resolution and the size of the environment that is being mapped. Sensible combinations that preserve a moderate amount of detail can increase the map size quickly, reaching the order of millions.

Let $w$ denote a frame of reference fixed in the world. The points of the map ${{}^{i}\mathcal{M}}$ can be transformed into $w$ via a rigid 3D homogeneous transformation $\mathbf{T}_{i}^{w}\in SE(3)$ :

SE(3)=\left\{\begin{bmatrix}\mathbf{R}_{3\times 3}&\mathbf{t}_{3\times 1}\\ \mathbf{0}_{1\times 3}&1\end{bmatrix},\quad\mathbf{R}^{\top}\mathbf{R}=\mathbf% {RR}^{\top}=\mathbf{I},\quad\det({\mathbf{R}})=1\right\}

(2)

The global world frame $w$ can be arbitrarily defined by anchoring one agent’s frame as the map origin. Assigning the local frame of agent $c$ as the fixed frame $w$ , the goal is to estimate the transformation $\mathbf{T}_{d}^{c}\in SE(3)$ .

We assume that the agents’ maps are gravity-aligned. More explicitly, we assume that the agents are capable of observing the gravity vector. The gravity vector is common among all of the agents operating in the environment and is parallel to the $z$ -axis of the map they are building. We do not restrict the $z$ -axes to be identical (i.e., origins in the $z$ -axes are coincident), as the agents may have different initializations for the map height.

With this assumption, the problem is restricted to 4 degrees of freedom (DoF) and the homogeneous transformation can be parameterized as

\mathbf{T}_{d}^{c}=\mathbf{T}(x_{d}^{c},y_{d}^{c},z_{d}^{c},\theta_{d}^{c})=% \begin{bmatrix}\cos\theta_{d}^{c}&-\sin\theta_{d}^{c}&0&x_{d}^{c}\\ \sin\theta_{d}^{c}&\cos\theta_{d}^{c}&0&y_{d}^{c}\\ 0&0&1&z_{d}^{c}\\ 0&0&0&1\end{bmatrix}.

(3)

The given setup does not restrict the motion of the agents. The agent is free to execute any motion, as long as the gravity vector can be measured.

A map matching (or more generally, a point cloud registration) algorithm relies on the set of correspondences $\mathcal{C}$ established between pairs of points:

\mathcal{C}\triangleq\left\{({{}^{c}}\mathbf{p}_{m},{{}^{d}}\mathbf{p}_{n}):~{% }m,n\in\mathbb{Z}^{+},~{}m\leq N_{c},~{}n\leq N_{d}\right\}.

(4)

Assuming that all correspondences are “true”, which are the pairs that overlap when the true transformation is applied to the targeted point cloud, the registration task can be cast as an optimization problem as

\mathbf{T}_{d}^{c}=\underset{\mathbf{T}\in SE(3)}{\arg\min}~{}\sum_{({{}^{c}}% \mathbf{p}_{m},{{}^{d}}\mathbf{p}_{n})\in\mathcal{C}}\rho({{}^{c}}\mathbf{p}_{% m},\mathbf{T}~{}{{}^{d}}\mathbf{p}_{n})

(5)

where $\rho(\mathbf{a},\mathbf{b})$ is a non-negative distance metric to determine the error for a given correspondence under the transformation hypothesis $\mathbf{T}$ . As an example, in the case of point-to-point iterative closest point (ICP), the cost metric assumes the form of Euclidean distance.

Estimating “true” correspondences is a difficult task, as it depends on the discrimination capacity of the descriptor and some mismatches cannot be avoided due to perceptual aliasing. It is reasonable to expect outlier correspondence rates of 90% [59] for 3D datasets.

An inlier set $\mathcal{C}_{\text{in}}\subseteq\mathcal{C}$ is the set of correspondences that agree with a given transformation, and it needs to be selected from correspondences that agree with the unknown, correct transformation, leading to a modification of (5):

\mathbf{T}_{d}^{c}=\underset{\mathbf{T}\in SE(3)}{\arg\min}~{}\sum_{({{}^{c}}% \mathbf{p}_{m},{{}^{d}}\mathbf{p}_{n})\in\mathcal{C}_{\text{in}}}\rho({{}^{c}}% \mathbf{p}_{m},\mathbf{T}~{}{{}^{d}}\mathbf{p}_{n})~{}.

(6)

4 Proposed Method

4.1 Tomographic Features

A tomographic section of the map, termed slice throughout the paper, is defined as the 2D binary occupancy representation of a horizontal cross-section of a map $\mathcal{M}_{i}$ at a particular height $h$ . The points in the map to generate a slice at height $h$ can be defined as:

{{}^{i}\widehat{\mathcal{S}}_{h}}\triangleq\left\{{{}^{i}}\mathbf{p}_{j}:\ h-t% <\left.{{}^{i}\mathbf{p}_{j}}\right|_{z}\leq h+t,\right\}

(7)

where $\left.{\mathbf{p}}\right|_{z}$ is the $z$ -coordinate of point $\mathbf{p}$ and $t$ is a parameter that determines the thickness of the cross-section.

A natural choice for the thickness parameter to utilize all available information is $t=g/2$ , half the leaf size. In this manner, every point in the map will belong to a single slice for a finite set of $h$ that are $g$ apart. Let ${}^{i}z_{\max}$ ( ${}^{i}z_{\min}$ ) be the maximum (minimum) $z$ -coordinate of the map ${}^{i}\mathcal{M}$ . Then, the index set of heights $\mathcal{H}_{i}$ for ${}^{i}\mathcal{M}$ is defined as:

\mathcal{H}_{i}\triangleq\left\{{{}^{i}}z_{\min}+k\cdot g:\ k\in\mathbb{Z}^{+}% ,\ k\leq\frac{{{}^{i}}z_{\max}-{{}^{i}}z_{\min}}{g}\right\}.

(8)

Points in every slice ${{}^{i}\widehat{\mathcal{S}}_{h}},~{}h\in\mathcal{H}_{i}$ are projected onto the $xy$ -plane and discretized on a 2D grid to obtain a 2D binary occupancy image, ${{}^{i}\mathbf{S}_{h}}$ . The extrema $xy$ coordinates of points in ${{}^{i}\widehat{\mathcal{S}}_{h}}$ and the grid size $g$ dictate the width and height of the resultant binary image. Intuitively, each pixel represents the occupancy of a real area of size $g\times g~{}m^{2}$ .

Examples of the 2D binary images obtained through the slicing are provided in Figure 1 for an indoor environment.

Refer to caption — Figure 1: Samples of 2D binary slice images extracted at differnet heights for a simulated indoor environment. Binary images with colorized borders denote the corresponding height at the 3D rendering of the environment.

ORB features and descriptors [60] are extracted from every 2D binary slice, and the highest scoring $K$ features are kept. The set of 128-bit descriptors, paired with the 2D coordinates, is denoted as ${{}^{i}}\mathcal{F}_{h}$ and forms the basis of the map matching task:

{{}^{i}}\mathcal{F}_{h}\triangleq\{{{}^{i}}\mathbf{f}_{h,k}\},~{}|{{}^{i}}% \mathcal{F}_{h}|\leq K.

(9)

The choice of ORB is primarily informed by the lack of real photo intensity changes within the binary images ${{}^{i}\mathbf{S}_{h}}$ and the efficiency of the underlying FAST [33] corner detection with oriented BRIEF [61] descriptors. Alternative feature extraction and description pipelines can be used [62], though the reliability and efficiency of ORB features are validated in the literature for various cases [63].

The slicing and feature extraction results in a different representation of the map as an indexed set of 2D binary slices and associated features:

\mathcal{S}_{i}\triangleq\left\{({{}^{i}\mathbf{S}_{h}},~{}{{}^{i}}\mathcal{F}% _{h}),~{}h\in\mathcal{H}_{i}\right\}.

(10)

4.2 Map Matching with Tomographic Features

Given the set of binary slices and their associated features $\mathcal{S}_{c}$ and $\mathcal{S}_{d}$ , the task is to estimate the 4 unknowns in the transformation $\mathbf{T}_{d}^{c}=\mathbf{T}(x_{d}^{c},y_{d}^{c},z_{d}^{c},\theta_{d}^{c})$ (from (3)). The proposed method solves a hierarchical optimization problem in two stages: a) the rigid 2D transformation estimation, and b) the height difference estimation, as shown in Figure 2.

4.2.1 Rigid 2D Transformation Estimation

The first stage estimates the 3 DoF for different values of $z_{d}^{c}$ . For a known height difference, each slice in a map will have either one corresponding slice that will be at the same height, or none. For each slice pair $({{}^{c}\mathbf{S}_{m}},~{}{{}^{d}\mathbf{S}_{n}})_{k}$ indexed by $k$ , matching computed features yields a set of putative correspondences $\{\cdots,({{}^{c}\mathbf{p}_{j}},~{}{{}^{d}\mathbf{p}_{j}}),\cdots\}_{k}$ , and the pose estimation problem reduces to finding a 2D rigid transformation, which has the form

\begin{bmatrix}\cos~{}\theta_{d}^{c}&-\sin~{}\theta_{d}^{c}&x_{d}^{c}\\ \sin~{}\theta_{d}^{c}&~{}\cos~{}\theta_{d}^{c}&y_{d}^{c}\end{bmatrix}\begin{% bmatrix}{{}^{d}x_{j}}\\ {{}^{d}y_{j}}\\ 1\end{bmatrix}=\begin{bmatrix}{{}^{c}x_{j}}\\ {{}^{c}y_{j}}\end{bmatrix}

(11)

for corresponding points ${{}^{i}\mathbf{p}_{j}}=\begin{bmatrix}{{}^{i}x_{j}}&{{}^{i}y_{j}}\end{bmatrix}% ^{\top},~{}i\in\{c,d\}$ . The problem is then cast to a linear system as

\begin{bmatrix}\vdots&\vdots&\vdots&\vdots\\ {{}^{d}x_{j}}&-{{}^{d}y_{j}}&1&0\\ {{}^{d}y_{j}}&~{}{{}^{d}x_{j}}&0&1\\ \vdots&\vdots&\vdots&\vdots\end{bmatrix}\begin{bmatrix}\alpha\\ \beta\\ x_{d}^{c}\\ y_{d}^{c}\end{bmatrix}=\begin{bmatrix}\vdots\\ {{}^{c}x_{j}}\\ {{}^{c}y_{j}}\\ \vdots\end{bmatrix}

(12)

where $\alpha=s\cos(\theta_{d}^{c})$ and $\beta=s\sin(\theta_{d}^{c})$ with $s$ as the scale parameter. Since we assume known scale, $s=1$ . The angle $\theta_{d}^{c}$ is recovered using $\text{atan2}(\beta,\alpha)$ .

For simplicity, RANSAC is used to obtain a robust solution, followed by Levenberg-Marquardt refinement steps over the inlier set. The linear system given in Equation (12) is solved for every slice pair $k$ at a given height to obtain the set of transformations at a known height:

\widetilde{\mathcal{T}}(z_{d}^{c})\triangleq\left\{{{}_{k}\widetilde{\mathbf{T% }}(\widetilde{x}_{d}^{c},\widetilde{y}_{d}^{c},z_{d}^{c},\widetilde{\theta}_{d% }^{c})}\triangleq{{}_{k}\widetilde{\mathbf{T}}_{d}^{c}}\right\}.

(13)

It is not expected that every slice pair $k$ provides an accurate estimate, as the agents may not have complete occupancy information in the neighborhood of a particular $xy$ -coordinate at every height. In such cases, the pose estimate ${{}_{k}\widetilde{\mathbf{T}}_{d}^{c}}$ does not represent the true transform, and simple averaging across all slices will introduce large errors.

To that end, a consensus between different 2D rigid estimations is established to obtain the most likely pose estimate. We perform this by finding the largest set of inlier hypotheses $\widehat{\mathcal{T}}(z_{d}^{c})\subseteq\widetilde{\mathcal{T}}(z_{d}^{c})$ within a distance to an anchor hypothesis, ${{}_{k}}\widetilde{\mathbf{T}}_{d}^{c}$ :

\widehat{\mathcal{T}}(z_{d}^{c})=\arg\max~{}\left|\left\{{{}_{k}}\widetilde{% \mathbf{T}}_{d}^{c}:\mathbf{d}({{}_{k}}\widetilde{\mathbf{T}}_{d}^{c},~{}{{}_{% l}}\widetilde{\mathbf{T}}_{d}^{c})\prec\mathbf{t},~{}{{}_{\{k,l\}}}\widetilde{% \mathbf{T}}_{d}^{c}\in\widetilde{\mathcal{T}}(z_{d}^{c})\right\}\right|,

(14)

where the vector-valued function $\mathbf{d}$ encodes the Euclidean distance between the estimates of $x_{d}^{c},y_{d}^{c}$ and the angular distance between the angles $\theta_{d}^{c}$ , and $\mathbf{t}$ is a threshold to specify maximum allowed deviations. The parameter-wise average of the hypotheses is used in the inlier set $\widehat{\mathcal{T}}(z_{d}^{c})$ to compute the resultant pose hypothesis, $\bar{\mathbf{T}}_{d}^{c}$ .

4.2.2 Height Difference Estimation

The previous step assumed that the height difference $z_{d}^{c}$ between maps is known. Unless there are means to establish the initial height difference, it is a strong assumption that would restrict the applicability of the algorithm to different scenarios.

The $z_{d}^{c}$ is similarly estimated based on the consensus. Due to the discretization of the 3D maps via slicing, there are finite number of relative height steps that can be established between a pair maps, all of which are $g$ units apart. Testing all possible height steps can be thought of as “cross-correlation” of the maps: each step results in a pose hypothesis $\bar{\mathbf{T}}_{d}^{c}$ computed from an inlier set $\widehat{\mathcal{T}}(z_{d}^{c})$ , and the size of the inlier set is the value for the correlation operator. As such, the height difference $z_{d}^{c}$ with the largest cardinality for $\widehat{\mathcal{T}}(z_{d}^{c})$ is selected as the height estimate:

z_{d}^{c}=\underset{z}{\arg\max}~{}\left|\widehat{\mathcal{T}}(z)\right|.

(15)

The correct height difference would allow the largest number of slice pairs that themselves agree on the remaining degrees of freedom. Empirical tests indicate that the number of inlier correspondences are maximized for all slices when they are matched with the correct slice from the other map for indoor/outdoor environments with distinctive geometric features. We refer to the proposed method as Consensus throughout the paper.

4.2.3 Computational Complexity

The proposed method utilizes various simple algorithms that have a low computational burden individually, but repeated for a large number of hypotheses. Especially, the cross-correlation for maps that span larger heights are computationally expensive. However, each of these individual steps (feature calculation for every slice and 3 DoF estimation for a pair of slices) are independent of each other, enabling parallelization opportunities with hardware acceleration. Moreover, it is not necessary for intermittent map matching and merging on different indoor and outdoor scenarios.

5 Comparative Studies

We provide a comprehensive evaluation of the accuracy and efficiency of the proposed algorithm in both simulated and real-life datasets against various registration pipelines from the literature.

5.1 Algorithm Baselines and Metrics

Comparisons to eight different baselines are provided. The selection includes standard algorithms commonly used as a baseline in literature, as well as the state-of-the-art registration algorithms. The baselines that are not learning-based study different conditions in feature extraction and registration, and a literature implementation may not necessarily exist for them.

For the learning-based algorithms, the selection was informed primarily on their adaptability to the map matching problem. In all of the cases, the weights provided by the original authors are used. Since some methods do not fit the memory of mobile-grade GPUs, the evaluations are done on a high-performance computer with server-grade GPU (NVIDIA A100 80 GB).

5.1.1 Tomographic TEASER++

Tomographic TEASER++ [52] replaces the consensus-based selection of the height difference with a 3D registration problem at each height step, and selecting the solution that provides the largest number of inlier features.

Conversion from 2D ORB features to 3D features are performed by assigning the height difference as the $z$ -coordinate to the points in the source map slice. The number of registration problems is the same as the consensus step in the proposed algorithm.

5.1.2 ORB-TEASER++

TEASER++ may incur a high memory and time cost, which was found to be the main bottleneck for the previous baseline. ORB-TEASER++ was tested as an attempt to reduce the number of registration steps to a minimum. In ORB-TEASER++, the feature matching step is performed between 2D keypoints across all of the slices in a map and the correspondences are used with TEASER++ for robust registration.

In essence, all 2D ORB features are treated as 3D features by assigning their relative height in the map’s own frame as their $z$ -coordinates. Then, the matching is performed against the 3D ORB features in the other map.

5.1.3 FPFH-RANSAC

ORB will be shown to not be an appropriate feature for 3D data in the tomographic manner, as it is prone to perceptual aliasing across different slices. For example, a single edge of a wall with no features would yield the same ORB signature for similar heights. Therefore, when the matching is performed with the points from the other map, many identical matches will exist.

Instead, in FPFH-RANSAC the standard baseline of many point feature comparisons is used. FPFH [31], features are computed for 3D Harris keypoints of the map point clouds, and a rigid transformation hypothesis is computed using RANSAC over point matches. An open-source implementation of this method [64] is available online²²2https://github.com/hrnr/map-merge.

5.1.4 FPFH-TEASER++

As a modern alternative to RANSAC, which is known to be inaccurate under high outlier rates, this baseline uses TEASER++ for its registration task. The same features as FPFH-RANSAC baseline (FPFH descriptors of Harris 3D corners) are matched and registered using TEASER++ instead.

5.1.5 DeepGlobalRegistration

DeepGlobalRegistration [53] is the first learning-based baseline used. DeepGlobalRegistration uses FCGF [45], assigns confidence for each correspondence, and solves a differentiable weighted Procrustes problem.

The complete pipeline is modified in the comparisons here to provide fair comparisons. Specifically, the ICP refinement and the safeguard registration implementations are deactivated to evaluate the network as-is, since these refinements are method-agnostic.

5.1.6 GeoTransformer

GeoTransformer [56] adopts the transformer architecture [65] to encode the geometric structure based on neighboring point geometries to obtain a pose-invariant superpoint representation. With high-inlier ratios of the GeoTransformer correspondences, accurate local-to-global pose estimation is made possible, even under low overlap scenarios.

5.1.7 Rotation-Invariant Transformer (RoITr)

RoITr [57] proposes a cascade of attention modules that aims to eliminate any pose dependency on the features to isolate only the point geometry. In this manner, the architecture becomes inherently rotation-invariant.

RoITr is the only model among the others that do not provide model weights trained for KITTI dataset. For the real-data evaluation, RoITr is omitted.

5.1.8 BUFFER

BUFFER [58] distills the point-based and patch-based feature extraction methods, improving point-wise orientation and saliency estimation and performs pose estimation via RANSAC over inlier set of correspondences as returned from the network.

5.2 Comparison Metrics

Performance of the map matching systems are evaluated based on four factors: a) translation error, b) rotation error, c) execution time, d) and memory usage.

5.2.1 Translation Error

Translation error is measured as the distance between the ground truth transformation estimation and the calculated estimation, and is reported in meters. Different applications require different voxel grid sizes. For example, an indoor robot with a small footprint will benefit from a higher resolution map (e.g. 0.05 m - 0.2 m grid size), whereas for an outdoor mission the same parameters will introduce large processing bottlenecks and yield a very large map, in terms of the occupied cells. To that end, the translation error threshold to determine if a map matching task is successful is assigned to be 5 times the grid size of the underlying map. This is in line with the literature for point cloud registration tasks for both indoor and outdoor scenarios.

5.2.2 Rotation Error

Rotation error is calculated from the difference in yaw angles between the ground truth transformation and the estimated transform, and is reported in radians. Unlike the translation error, the rotation error does not depend on the map resolution. The error threshold is set as 0.1745 radians ( $\approx 10^{\circ}$ ), which is more conservative than the point cloud registration literature. The extents of the maps in map matching tasks are much larger in comparison and small variations can result in very large deviations.

5.2.3 Execution Time

Execution time is measured as the total time it takes the algorithm to compute the transformation parameters, reported in seconds. Any redundant pre-processing and post-processing steps are excluded from this measurement.

For the proposed method and baselines without a learning component, the computations were carried out on an Intel Phantom Canyon NUC11PHKi7C (Intel i7-1165G74 CPU, 64 GB RAM). Some deep learning-based pipelines required more memory than it is available on the aforementioned hardware, due to the scale of the problem. As such, the computations for learning-based baselines were carried out on a high-performance computer node with an NVIDIA A100 SXM4 80 GB card using 8 CPU cores and 530 GB RAM. While such hardware cannot be put on a mobile platform currently, this study takes into consideration the availability of better hardware in the future.

5.2.4 Memory Usage

Memory usage refers to the total amount of physical memory the algorithm utilizes to generate its result, in Bytes. Resident set size is selected as an estimate of the memory consumption for the proposed method and the first 4 baselines that do not use learning-based algorithms. For deep learning-based pipelines, the peak GPU memory usage is used as a measurement for the memory.

The following sections will introduce the simulation and real datasets and provide the results for the aforementioned baselines and the proposed method.

6 Simulation Studies

Simulation data is taken from individual trajectories of InteriorNet dataset [66], in which a camera executes up to 3 different continuous 6 DoF motions in various furnished indoor environments. Color and depth streams, matched with the camera pose, are used to generate a 3D occupancy grid map to generate sample datasets for this study. A representative pair of color and depth images are provided in Figure 3

The pose information for different trajectories in the same environment are provided with respect to a common, fixed frame of reference. In order to simulate a multi-agent map** scenario, the initial pose of the trajectory is set as the origin of that trajectory, after removing the initial roll and pitch. In that way, each trajectory is emulating a different agent with its own local origin.

Five sequences from InteriorNet dataset were selected, each with three trajectories. In total, there are 30 different instances of map matching tasks (5 sequences $\times$ 3 trajectories $\times$ 2 directions). The average overlap between pairs of maps is calculated as 71.76% (minimum 48.74%, maximum 91.35%).

The InteriorNet dataset provides noiseless measurements with perfect pose information. To test the robustness of the algorithms against common sources of noise, two different noise modalities were tested: pose estimation noise, and measurement noise.

In order to simulate inaccuracies arising from noisy IMU measurements and imperfect map** algorithm conditions, the ground truth pose for each camera reading is perturbed independently in each degree of freedom. Yaw, pitch, roll, and 3 translation perturbations are randomly sampled from independent, zero-mean Gaussian distributions with two different standard deviations. For $\sigma=0.02$ , the translation and rotation perturbations are sampled from distributions with standard deviations $0.02~{}$ m and $0.01~{}$ rad, respectively. For $\sigma=0.05$ , the standard deviations are $0.05~{}$ m and $0.025~{}$ rad. For the instances with added noise, the depth measurements from the camera were also corrupted with a noise profile mimicking the commercially available Intel RealSense D435i depth sensor. Some sample maps obtained under various noise levels are provided in Figure 4 for a qualitative comparison.

6.1 Map Matching Results

Figures 5-8 provide the translation and rotation errors, memory footprint, and execution times of each method on different levels of noise. The results are reported for maps that have been pre-filtered with a voxel grid filter of leaf size 0.05 m.

Each plot has a red dashed horizontal line, whose value represents a particular threshold. The selection is mostly subjective, success/failure classification is not performed based on these thresholds.

Translation error threshold is set to 0.25 m, five times the grid size of the maps. Rotation error threshold is set to 0.1745 rad ( $\approx 10^{\circ}$ ). Execution time threshold marks the 60 s line. Memory threshold is set to be 16 GB, which is the maximum capacity of the Jetson Orin NX computer, which can be attached to any mobile platform.

The noiseless case (leftmost panels in each figure) serves as an indicator that the baselines can actually handle the map matching task, especially for the learning-based methods. Consensus, Tomographic-TEASER++, FPFH-TEASER++, DeepGlobalRegistration (DGR), and BUFFER provide accurate results for the majority of the map pairs. Learning-based methods provide better results, with DGR being the most accurate. FPFH-TEASER++ shows a large variance with some failures. GeoTransformer and RoITr are still accurate for some pairs, but they fail to provide correct transform for the majority of the pairs. The failure of ORB-TEASER++ indicates that ORB features are not appropriate for use as a 3D feature descriptor, but their tomographic use (Tomographic-TEASER++) render them useful.

All of the algorithms return with an estimate within 20 seconds. RoITr and Consensus methods take less than a second for the majority, with Consensus showing a larger variance and some outliers that reach up to 3 seconds.

For the memory use, however, the non-learning and non-TEASER++ methods shine. The memory use is limited to approximately 100 MB for Consensus and FPFH-based methods, at least an order of magnitude lower than any other method. The maximum clique search over the correspondence graph for TEASER++ has a large memory footprint for the ORB-TEASER++ implementation.

For the noisy data (middle and right panels in Figures 5-8), all of the learning-based methods fail. DGR provides some correct results, but on average is not reliable. FPFH features lose their discrimination capabilities under the noisy instances, as seen by the decreased accuracy between ideal and noisy instances. On the other hand, the tomographic methods demonstrate a more graceful accuracy degradation. The noise appears to be less impactful on the projected cross-sectional slices, when compared to other 3D descriptors.

The execution time increases with the noise in the input data, with the exception of RoITr. The execution time gap across methods is more pronounced at higher noise levels. Tomographic methods take longer to execute, as the noisy map means there is a larger number of slices across all maps. Consensus method remains approximately an order of magnitude faster in comparison to DGR and GeoTransformer. The other learning-based algorithms pass the 60 second mark. The bottleneck for FPFH-based methods becomes the computation of keypoints and the features and they cross the minute mark as well.

Finally, memory usage for all of the methods increase with the increased noise. Tomographic-TEASER++ reaches the correspondence cap utilizes the same amount of memory as the ORB-TEASER++. FPFH-based methods show limited increase, suggesting that the memory footprint is mostly dominated by the feature computation, rather than the robust pose estimation. The most memory-efficient method remains to be the Consensus, utilizing less than 200 MB on all instances.

Overall, tomographic methods appear to be the the better alternatives. In the ideal scenario, Tomographic-TEASER++ provides the smallest translation error. However, the Consensus method is more accurate in the noisy instances, and has the smallest execution time and memory footprint across all of the data types. The success of FPFH-TEASER++, DGR and BUFFER on noiseless data indicate that the algorithms can provide accurate results. However, the feature quality appears to degrade significantly under noisy data, resulting in quick accuracy degredation.

7 Experimental Studies

7.1 KITTI Odometry Data

To evaluate the performance on a real-life dataset, the LiDAR data from the sequences of KITTI odometry benchmark [67] was used. In total, 11 sequences have GPS ground truth available. Out of the 11, only 5 sequences (00, 02, 05, 06, 07) revisit the previously explored locations and can be used for a map matching scenario. We divide the sequences in half to emulate two agents and perform matching for each agent, providing a total of 10 different instances of map merging. Average overlap was estimated at 43.90% (minimum 11.17%, maximum 83.53%). Figure 9 provides an example pair for the map matching task.

For this study, two of the learning-based baselines could not be used. RoITr model does not provide any weights for KITTI or hyperparameter set that can be used to train the architecture on KITTI sequences. For GeoTransformer, the data preparation step filled up the RAM of the high-performance computing node (approx. 530 GB) and could not be run.

Two grid sizes were used for the remaining baselines. Figure 10 reports the results for the data where the maps have been pre-filtered with a voxel grid filter of size 0.50 m. Results provided in Figure 11 use data that was pre-filtered to 0.20 m. Horizontal red lines mark the same thresholds as the InteriorNet case: five times the grid size for translation error, 0.1745 rad for rotation error, 60 s for the execution time, and 16 GB for the memory usage.

Due to the large number of correspondences, FPFH-TEASER++ ran out of memory in the mobile computer (64 GB) at higher resolution. As such, the method is omitted for the 0.20 m grid size (Figure 11).

The only algorithm that is successful for the majority of the pairs is the tomographic methods. Consensus provides high accuracy for both rotation and translation. Tomographic-TEASER++ fails to estimate the translation accurately, although the rotation estimation is mostly succesful. Even though DGR was accurate in the simulated data, it fails to perform in the KITTI study. BUFFER is still able to provide the correct transformation for some of the instances, but overall is inaccurate. FPFH features appear to be not descriptive on the KITTI dataset, which could be due to improper parameter selection.

The execution time is significantly higher for Consensus in comparison to the indoor data. For the higher resolution map, the execution time reaches up to a minute. The primary reason for the losses in efficiency is due to the much larger number of slices across both maps. Specifically for the sequence 02, which is shown in Figure 9, the maps span approximately 50 meters. Slice correlation step is much more time consuming, and it hampers the time efficiency.

In a similar fashion, the memory footprint of the Consensus algorithm is significantly higher, reaching up to 10 GB for some instances of high-resolution map. The footprint remains lower than any other algorithm that performs accurate estimations.

Overall, even though the Consensus algorithm efficiency and memory usage has increased significantly, it is still the best performing option out of the baseline methods.

Note that across the studies, we are only changing the grid size of the maps for the tomographic algorithms. Despite the minimal parameterization, the algorithm performance is not impacted significantly. The appropriate grid size selection is still a factor to be taken into consideration, but the choices here (0.2 m and 0.5 m) is common for LiDAR-based outdoor map** tasks.

7.2 Large-scale Indoor Environment Data

Finally, the algorithms’ performance is evaluated using a generated dataset, map** a large-scale indoor environment. The interior space is an atrium area connecting office spaces, classrooms, and laboratories across two floors. The environment is especially challenging for LiDAR-based distance sensors, as there is a glass construction at the center with opening to the outside. A photograph of the environment is provided in Figure 12(a). A Unitree quadruped robot equipped with RoboSense 3D LiDAR sensor (Figure 12(b)), was used to map the environment in 3 separate trajectories. The horizontal cross-sections of the individual maps overlaid on the architectural plan of the structure is provided in Figure 13. Each trajectory is map** along two corners of the triangle.

Horizontal red lines mark the same thresholds as the InteriorNet case: five times the grid size for translation error, 0.1745 rad for rotation error, 60 s for the execution time, and 16 GB for the memory usage.

To demonstrate the applicability to different forms of map** tools, the robot estimates its odometry using KISS-ICP [68] and generates the map using VDBFusion [69] framework. KISS-ICP estimates relative transformations between two key frames using point-to-point ICP with simple motion compensation, an adaptive correspondence threshold, and a robust kernel to provide a minimally-parameterized solution. VDBFusion represents the environment as a truncated signed distance function (TSDF) to obtain more accurate dense reconstructions, powered by OpenVDB³³3https://www.openvdb.org/, an open-source, production-level hierarchical data structure library. In this way, we demonstrate backend-agnostic capabilities of our method in the map matching task.

To generate the map, deskewed LiDAR point cloud data is integrated into the VDBFusion map, and the 3D triangle mesh representation of the resultant map was extracted using marching cubes algorithm at a grid size of 0.05 m. Vertices of the mesh are provided as input to all of the maps. A rendering of the the three maps generated as described above are provided in Figure 14.

Overall, three maps and two directions provide a total of 6 map matching tasks. Similar to the KITTI study, Geo-Transformer baseline could not be run due to memory consumption. The results for all of the other methods are provided in Figure 15.

Only the estimates provided by the proposed Consensus method are close to the true transformations. Tomographic TEASER++ and FPFH-TEASER++ provide accurate orientation for some of the pairings, but overall no other algorithm is able to provide accurate results.

Even for the Consensus algorithm, the translation error is much larger. The algorithm is able to estimate the correct rotation for all but one pairing. A parameter tuning may improve the translation error as well, but the finer registration with ICP is always possible. The algorithm is providing reasonable global transformation estimates.

Execution time is reaching to the minute mark for the proposed method in this framework. The grid size is small enough to generate both large-size slices for feature extraction and a large number of slices to be cross-correlated. However, the memory footprint is not affected, utilizing approximately 500 MB in RAM to carry out the computations.

8 Limitations

Our proposed algorithm has inherent limitations that render it unusable for generic point cloud registration tasks. The three most important ones are the missing degrees of freedom, the inability to deal with non-rigid estimations, and the dependence on the map heights.

Our assumption is that the agents have accurate gravity estimates that they can use to align their maps. IMU sensors are common, but are not entirely accurate. Large accelerations can affect the orientation quality and impact the overall map alignment. Even though the Consensus method was demonstrated to be resilient to noise, there is no way to recover the roll and pitch if the origin (the initial pose) of the agent is misaligned.

Another limitation is that the output is a single rigid transformation. In practice, large-scale maps tend to drift even if robust map** methods are used. As such, a single rigid transformation is generally not the most accurate answer. A deformable hypothesis operating on the pose graph [19] would me more appropriate, and there is no way to adapt this method directly.

Finally, the complexity of the algorithm depends mostly on the number of slices for each map. Per-slice transform estimations are fast, but the correlation step requires many hypotheses to be tested at once. The higher the number of slices, the heavier the computations. Hardware-accelerated parallelization could sidestep the computational bottlenecks, but the dependence on the map heights will not change.

Furthermore, the comparative studies, performed against the baselines has certain limits. These limitations can be grouped into 3 categories: parameter tuning for the tasks, disparity in the execution paradigms, and potentially suboptimal or inaccurate adaptations.

The weights used for learning-based models are taken directly from the best parameters as provided by the original authors. For the simulated study, this implies training on 3DMatch [35] dataset, which contains 3D scans of small-scale scenes. Even though the grid size parameter is appropriate, we do not account for the simulation-to-real domain gap, and the scale of the problem is not the same. The results on the ideal data (Figures 5-8, left-most panels) indicates that some of the algorithms can still perform, but a much thorough comparison would require to train the models on data that is more representative for the type and the scale of the map matching task.

For the FPFH-based methods, an extensive search on the key parameters, mainly the radii to estimate normals, keypoints, and features, was not performed. In a similar fashion to the tomographic methods, the radii was scaled based on the grid size, but a parameter study could yield a more optimal answer. The only parameter that was changed across different datasets is the grid size of the maps, which is known ahead of the time. The proposed algorithm is able to perform accurately with minimal parameterization. As such, the advantage of tomographic methods remain.

The efficiency is compared using the execution time and memory footprints. An ideal scenario would require the programming paradigms and the hardware platforms to be the same. However, learning-based methods are implemented primarily on Python, whereas we implemented our method and the remaining baselines (Consensus, ORB-family and FPFH-family of algorithms) in C++. Even though the Python bindings for the learning-based algorithm backends are compiled as well, the comparison is not entirely applicable. Furthermore, Python memory management is different than C++, introducing additional sources of issue.

On the other hand, the metrics were calculated for learning-based algorithms on a high-performance computer node that is equipped with a powerful graphics card. While we attempted to provide the best possible comparison, this limitation should be explicitly stated.

Finally, not all of the learning-based baselines had an existing implementation to perform registration on arbitrary datasets. The implementations from BUFFER and RoITr were mimicked from the source code to obtain the transformations.

9 Conclusions

Tomographic features were proposed to address large-scale map matching task in gravity-aligned 3D maps. The proposed minimally-parameterized approach was both accurate and efficient in memory and computation power. Furthermore, the tomographic approach was found to be more resilient to measurement noise in sensors in comparison to other methods utilizing 3D feature descriptors, learning-based or otherwise. The findings are corroborated on real datasets that map volumes of different scales, underscoring the algorithmic efficiency and accuracy.

The proposed algorithm cannot provide a non-rigid estimate to deal with map** inaccuracies, and is not able to estimate the roll and pitch. Nevertheless, for large and dense maps with bounded extents, such as the case in many graph-based SLAM algorithms, the tomographic approach can provide fast and robust transformations and requires investigations.

References

Tzes et al. [2018] M. Tzes, S. Papatheodorou, A. Tzes, Visual Area Coverage by Heterogeneous Aerial Agents Under Imprecise Localization, IEEE Control Systems Letters 2 (2018) 623–628.
Lajoie et al. [2022] P.-Y. Lajoie, B. Ramtoula, F. Wu, G. Beltrame, Towards Collaborative Simultaneous Localization and Map**: a Survey of the Current Research Landscape, Field Robotics 2 (2022) 971–1000.
Kazerouni et al. [2022] I. A. Kazerouni, L. Fitzgerald, G. Dooly, D. Toal, A survey of state-of-the-art on visual SLAM, Expert Systems with Applications 205 (2022) 117734.
Huang et al. [2023] J. Huang, S. Junginger, H. Liu, K. Thurow, Indoor Positioning Systems of Mobile Robots: A Review, Robotics 12 (2023) 47.
Carpin et al. [2005] S. Carpin, A. Birk, V. Jucikas, On map merging, Robotics and Autonomous Systems 53 (2005) 1–14.
Saeedi et al. [2014] S. Saeedi, L. Paull, M. Trentini, M. Seto, H. Li, Map merging for multiple robots using Hough peak matching, Robotics and Autonomous Systems 62 (2014) 1408–1424.
Ding and Feng [2019] L. Ding, C. Feng, DeepMap**: Unsupervised map estimation from multiple point clouds, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8650–8659.
Andersone [2019] I. Andersone, Heterogeneous Map Merging: State of the Art, Robotics 8 (2019) 74.
Unlu et al. [2023] H. U. Unlu, A. Tzes, P. Krishnamurthy, F. Khorrami, Distributed 3D-Map Matching and Merging on Resource-Limited Platforms Using Tomographic Features, in: 2023 European Conference on Mobile Robots (ECMR), IEEE, 2023, pp. 1–6.
Yu et al. [2020] S. Yu, C. Fu, A. K. Gostar, M. Hu, A Review on Map-Merging Methods for Typical Map Types in Multiple-Ground-Robot SLAM Solutions, Sensors 20 (2020) 6988.
Erinc and Carpin [2014] G. Erinc, S. Carpin, Anytime merging of appearance-based maps, Autonomous Robots 36 (2014) 241–256.
Sun et al. [2023] Q. Sun, T. Liao, H. Du, Y. Zhao, C.-C. Chen, A Method of Merging Maps for MUAVs Based on an Improved Genetic Algorithm, Sensors 23 (2023) 447.
Aragues et al. [2011] R. Aragues, J. Cortes, C. Sagues, Distributed consensus algorithms for merging feature-based maps with limited communication, Robotics and Autonomous Systems 59 (2011) 163–180.
Blanco et al. [2013] J.-L. Blanco, J. González-Jiménez, J.-A. Fernández-Madrigal, A robust, multi-hypothesis approach to matching occupancy grid maps, Robotica 31 (2013) 687–701.
Lee [2020] H. Lee, Tomographic Feature-Based Map Merging for Multi-Robot Systems, Electronics 9 (2020) 107.
Chen et al. [2021] B. Chen, S. Li, H. Zhao, L. Liu, Map Merging with Suppositional Box for Multi-Robot Indoor Map**, Electronics 10 (2021) 815.
Wang et al. [2023] M. Wang, M. Cong, Y. Du, D. Liu, X. Tian, Multi-robot raster map fusion without initial relative position, Robotic Intelligence and Automation 43 (2023) 498–508.
Xie et al. [2020] Y. Xie, Y. Tang, R. Zhou, Y. Guo, H. Shi, Map merging with terrain-adaptive density using mobile 3D laser scanner, Robotics and Autonomous Systems 134 (2020) 103649.
Bonanni et al. [2017] T. M. Bonanni, B. Della Corte, G. Grisetti, 3-D Map Merging on Pose Graphs, IEEE Robotics and Automation Letters 2 (2017) 1031–1038.
Mohanarajah et al. [2015] G. Mohanarajah, V. Usenko, M. Singh, R. D’Andrea, M. Waibel, Cloud-Based Collaborative 3D Map** in Real-Time With Low-Cost Robots, IEEE Transactions on Automation Science and Engineering 12 (2015) 423–431.
Yue et al. [2018] Y. Yue, P. N. Senarathne, C. Yang, J. Zhang, M. Wen, D. Wang, Hierarchical Probabilistic Fusion Framework for Matching and Merging of 3-D Occupancy Maps, IEEE Sensors Journal 18 (2018) 8933–8949.
Basso et al. [2023] M. Basso, D. Stocchero, P. Schnarndorf, E. P. de Freitas, Merging three-dimensional occupancy grid maps to support multi-uavs cooperative navigation systems, Journal of Field Robotics 40 (2023) 483–504.
Yin et al. [2023] P. Yin, S. Zhao, H. Lai, R. Ge, J. Zhang, H. Choset, S. Scherer, AutoMerge: A Framework for Map Assembling and Smoothing in City-Scale Environments, IEEE Transactions on Robotics (2023).
Stathoulopoulos et al. [2023] N. Stathoulopoulos, A. Koval, A.-a. Agha-mohammadi, G. Nikolakopoulos, FRAME: Fast and Robust Autonomous 3D Point Cloud Map-merging for Egocentric Multi-Robot Exploration, in: 2023 IEEE international conference on robotics and automation (ICRA), IEEE, 2023, pp. 3483–3489.
Lajoie et al. [2022] P.-Y. Lajoie, B. Ramtoula, F. Wu, G. Beltrame, Towards Collaborative Simultaneous Localization and Map**: a Survey of the Current Research Landscape, Field Robotics 2 (2022) 971–1000.
Schmuck and Chli [2019] P. Schmuck, M. Chli, CCM-SLAM: Robust and efficient centralized collaborative monocular simultaneous localization and map** for robotic teams, Journal of Field Robotics 36 (2019) 763–781.
Chang et al. [2022] Y. Chang, K. Ebadi, C. Denniston, M. F. Ginting, A. Rosinol, A. Reinke, M. Palieri, J. Shi, A. Chatterjee, B. Morrell, A. akbar Agha-mohammadi, L. Carlone, Lamp 2.0: A robust multi-robot slam system for operation in challenging large-scale underground environments, IEEE Robotics and Automation Letters 7 (2022) 9175–9182.
Lajoie et al. [2020] P.-Y. Lajoie, B. Ramtoula, Y. Chang, L. Carlone, G. Beltrame, DOOR-SLAM: Distributed, online, and outlier resilient SLAM for robotic teams, IEEE Robotics and Automation Letters 5 (2020) 1656–1663.
Tian et al. [2022] Y. Tian, Y. Chang, F. H. Arias, C. Nieto-Granda, J. P. How, L. Carlone, Kimera-multi: Robust, distributed, dense metric-semantic slam for multi-robot systems, IEEE Transactions on Robotics 38 (2022).
Lajoie and Beltrame [2023] P.-Y. Lajoie, G. Beltrame, Swarm-SLAM: Sparse Decentralized Collaborative Simultaneous Localization and Map** Framework for Multi-Robot Systems, IEEE Robotics and Automation Letters 9 (2023) 475–482.
Rusu et al. [2009] R. B. Rusu, N. Blodow, M. Beetz, Fast Point Feature Histograms (FPFH) for 3D Registration, in: 2009 IEEE international conference on robotics and automation, IEEE, 2009, pp. 3212–3217.
Guo et al. [2016] Y. Guo, M. Bennamoun, F. Sohel, M. Lu, J. Wan, N. M. Kwok, A Comprehensive Performance Evaluation of 3D Local Feature Descriptors, International Journal of Computer Vision 116 (2016) 66–89.
Trajković and Hedley [1998] M. Trajković, M. Hedley, Fast corner detection, Image and vision computing 16 (1998) 75–87.
Wu et al. [2015] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, J. Xiao, 3D ShapeNets: A Deep Representation for Volumetric Shapes, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1912–1920.
Zeng et al. [2016] A. Zeng, S. Song, M. Nießner, M. Fisher, J. Xiao, T. A. Funkhouser, 3DMatch: Learning Local Geometric Descriptors from RGB-D Reconstructions, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 199–208.
Qi et al. [2017] C. R. Qi, H. Su, K. Mo, L. J. Guibas, PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 652–660.
Deng et al. [2018] H. Deng, T. Birdal, S. Ilic, PPFNet: Global Context Aware Local Features for Robust 3D Point Matching, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 2018, pp. 195–205.
Hua et al. [2018] B.-S. Hua, M.-K. Tran, S.-K. Yeung, Pointwise Convolutional Neural Networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 984–993.
Atzmon et al. [2018] M. Atzmon, H. Maron, Y. Lipman, Point convolutional neural networks by extension operators, ACM Transactions on Graphics 37 (2018) 1–12.
Gojcic et al. [2019] Z. Gojcic, C. Zhou, J. D. Wegner, A. Wieser, The Perfect Match: 3D Point Cloud Matching With Smoothed Densities, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5545–5554.
Bai et al. [2020] X. Bai, Z. Luo, L. Zhou, H. Fu, L. Quan, C.-L. Tai, D3Feat: Joint Learning of Dense Detection and Description of 3D Local Features, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 6359–6367.
Floris et al. [2024] A. Floris, L. Frittoli, D. Carrera, G. Boracchi, Composite convolution: A flexible operator for deep learning on 3d point clouds, Pattern Recognition (2024) 110557.
Zhang et al. [2020] Z. Zhang, Y. Dai, J. Sun, Deep learning based point cloud registration: an overview, Virtual Reality & Intelligent Hardware 2 (2020) 222–246.
Thomas et al. [2019] H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, L. Guibas, KPConv: Flexible and Deformable Convolution for Point Clouds, in: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, 2019, pp. 6411–6420.
Choy et al. [2019] C. Choy, J. Park, V. Koltun, Fully Convolutional Geometric Features, in: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, 2019, pp. 8958–8966.
Bustos and Chin [2017] A. P. Bustos, T.-J. Chin, Guaranteed Outlier Removal for Point Cloud Registration with Correspondences, IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (2017) 2868–2882.
Huang et al. [2021] X. Huang, G. Mei, J. Zhang, R. Abbas, A comprehensive survey on point cloud registration, arXiv preprint arXiv:2103.02690 (2021).
Huang et al. [2023] X. Huang, G. Mei, J. Zhang, Cross-source point cloud registration: Challenges, progress and prospects, Neurocomputing (2023) 126383.
Fischler and Bolles [1981] M. A. Fischler, R. C. Bolles, Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography, Communications of the ACM 24 (1981) 381–395.
Li et al. [2023] J. Li, P. Shi, Q. Hu, Y. Zhang, Qgore: Quadratic-time guaranteed outlier removal for point cloud registration, IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
Wei et al. [2023] T. Wei, Y. Patel, A. Shekhovtsov, J. Matas, D. Barath, Generalized differentiable ransac, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 17649–17660.
Yang et al. [2020] H. Yang, J. Shi, L. Carlone, TEASER: Fast and Certifiable Point Cloud Registration, IEEE Transactions on Robotics 37 (2020) 314–333.
Choy et al. [2020] C. Choy, W. Dong, V. Koltun, Deep Global Registration, in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2020, pp. 2514–2523.
Bai et al. [2021] X. Bai, Z. Luo, L. Zhou, H. Chen, L. Li, Z. Hu, H. Fu, C.-L. Tai, Pointdsc: Robust point cloud registration using deep spatial consistency, in: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2021, pp. 15859–15869.
Huang et al. [2021] S. Huang, Z. Gojcic, M. Usvyatsov, A. Wieser, K. Schindler, PREDATOR: Registration of 3D Point Clouds with Low Overlap, in: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2021, pp. 4267–4276.
Qin et al. [2023] Z. Qin, H. Yu, C. Wang, Y. Guo, Y. Peng, S. Ilic, D. Hu, K. Xu, GeoTransformer: Fast and Robust Point Cloud Registration With Geometric Transformer, IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
Yu et al. [2023] H. Yu, Z. Qin, J. Hou, M. Saleh, D. Li, B. Busam, S. Ilic, Rotation-Invariant Transformer for Point Cloud Matching, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 5384–5393.
Ao et al. [2023] S. Ao, Q. Hu, H. Wang, K. Xu, Y. Guo, BUFFER: Balancing Accuracy, Efficiency, and Generalizability in Point Cloud Registration, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1255–1264.
Bustos and Chin [2017] A. P. Bustos, T.-J. Chin, Guaranteed Outlier Removal for Point Cloud Registration with Correspondences, IEEE transactions on pattern analysis and machine intelligence 40 (2017) 2868–2882.
Rublee et al. [2011] E. Rublee, V. Rabaud, K. Konolige, G. Bradski, ORB: An efficient alternative to SIFT or SURF, in: 2011 International Conference on Computer Vision, IEEE, 2011, pp. 2564–2571.
Calonder et al. [2010] M. Calonder, V. Lepetit, C. Strecha, P. Fua, BRIEF: Binary robust independent elementary features, in: European conference on computer vision, Springer, 2010, pp. 778–792.
Ma et al. [2021] J. Ma, X. Jiang, A. Fan, J. Jiang, J. Yan, Image matching from handcrafted to deep features: A survey, International Journal of Computer Vision 129 (2021) 23–79.
Tareen and Saleem [2018] S. A. K. Tareen, Z. Saleem, A comparative analysis of SIFT, SURF, KAZE, AKAZE, ORB, and BRISK, in: 2018 International Conference on Computing, Mathematics and Engineering Technologies (iCoMET), IEEE, 2018, pp. 1–10.
Hörner [2018] J. Hörner, Automatic Point Clouds Merging, Master’s thesis, Charles University, Department of Theoretical Computer Science and Mathematical Logic, 2018. URL: https://dspace.cuni.cz/bitstream/handle/20.500.11956/101930/120308521.pdf, accessed: 2024-06-04.
Vaswani et al. [2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is All you Need, Advances in neural information processing systems 30 (2017).
Li et al. [2018] W. Li, S. Saeedi, J. McCormac, R. Clark, D. Tzoumanikas, Q. Ye, Y. Huang, R. Tang, S. Leutenegger, InteriorNet: Mega-scale Multi-sensor Photo-realistic Indoor Scenes Dataset, arXiv preprint arXiv:1809.00716 (2018).
Geiger et al. [2012] A. Geiger, P. Lenz, R. Urtasun, Are we ready for autonomous driving? The KITTI vision benchmark suite, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2012, pp. 3354–3361.
Vizzo et al. [2023] I. Vizzo, T. Guadagnino, B. Mersch, L. Wiesmann, J. Behley, C. Stachniss, KISS-ICP: In Defense of Point-to-Point ICP – Simple, Accurate, and Robust Registration If Done the Right Way, IEEE Robotics and Automation Letters 8 (2023) 1029–1036.
Vizzo et al. [2022] I. Vizzo, T. Guadagnino, J. Behley, C. Stachniss, VDBFusion: Flexible and Efficient TSDF Integration of Range Sensor Data, Sensors 22 (2022) 1296.

Author Contributions

Co-author contributions follow the CRediT taxonomy: Halil Utku Unlu: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing - original draft, Writing - review and editing, Visualization. Anthony Tzes: Conceptualization, Methodology, Resources, Writing - review and editing, Supervision. Prashanth Krishnamurthy: Conceptualization, Methodology, Writing - review and editing. Farshad Khorrami: Conceptualization, Methodology, Writing - review and editing, Supervision.

Funding

This work was partially supported by the NYUAD Center for Artificial Intelligence and Robotics (CAIR), funded by Tamkeen under the NYUAD Research Institute Award CG010. The funding sources were not involved in study design, in the collection, analysis and interpretation of data, in the writing of the report, and in the decision to submit the article for publication.

Efficient and Distributed Large-Scale 3D Map Registration using Tomographic Features