\affiliation

[1]organization=Electrical & Computer Engineering Department, addressline=6 MetroTech Center, city=Brooklyn, postcode=11201, state=New York, country=USA \affiliation[2]organization=Electrical Engineering, New York University Abu Dhabi (NYUAD), addressline=Saadiyat Island, postcode=129188, state=Abu Dhabi, country=UAE \affiliation[3]organization=Center for Artificial Intelligence and Robotics, NYUAD, addressline=Saadiyat Island, postcode=129188, state=Abu Dhabi, country=UAE

Efficient and Distributed Large-Scale
3D Map Registration using Tomographic Features

Halil Utku Unlu Anthony Tzes Prashanth Krishnamurthy Farshad Khorrami
Abstract

A robust, resource-efficient, distributed, and minimally parameterized 3D map matching and merging algorithm is proposed. The suggested algorithm utilizes tomographic features from 2D projections of horizontal cross-sections of gravity-aligned local maps, and matches these projection slices at all possible height differences, enabling the estimation of four degrees of freedom in an efficient and parallelizable manner. The advocated algorithm improves state-of-the-art feature extraction and registration pipelines by an order of magnitude in memory use and execution time. Experimental studies are offered to investigate the efficiency of this 3D map merging scheme.

keywords:
Map merging, 3D perception, Multi-robot exploration

1 Introduction

In many applications, robot collaboration not only is required, but also greatly enhances the effectiveness of a robotic system. Multiple robots can provide a larger coverage [1], different views of the working environment, faster exploration, and resiliency against individual agent failures. However, communication fidelity, limited computational capabilities, and the operated environments’ dynamic and uncertain nature hamper the utility of multi-agent systems in real-world applications [2].

A baseline requirement for a collaborative mission for robotic systems is a shared understanding of the environment and their location within it. External infrastructures with a centralized computation structure can provide the global state to each agent, but they restrict the applicability of the systems in unknown, unstructured environments without a central unit.

The common solution for autonomous localization, navigation, and map** in unknown, uncertain environments is the set of methods called simultaneous localization and map** (SLAM), in which the robotic platforms rely on the data they collect through onboard sensors to estimate their position and generate a map. Single robot SLAM for both 2D and 3D motion is a mature field with many advanced platforms and frameworks [3, 4].

An extension of the SLAM methods on multiple agents is commonly referred to as Collaborative SLAM (C-SLAM), and it aims to provide a consistent and more accurate system state estimate (i.e., location of all agents, global map) by utilizing data from multiple participating agents. The field has seen significant development in recent years, though large-scale deployment needs further refinement [2]. Part of the reason is that the SLAM task requires the use of dense, rich data from vision sensors to generate a useful and accurate environment representation. 3D data streams (RGB-D sensors and LiDAR) generate large amount of data requiring a high communication bandwidth that can overload the network and demand more computational budget from the resource-limited mobile agents.

An alternative way for utilizing individual map information gathered by every agent in the field is to perform map matching and merging, where the agents share their map with other agents upon connectivity and attempt to align and combine other maps, hence providing a more complete view of their environment. Multiple approaches have been proposed for 2D map matching and merging scenarios [5, 6, 7, 8], but the solutions do not easily scale to 3D map representations. Although the advances in low-power devices with GPUs address partially the computation issue, the complexity needed from the robotic agents adds additional burdens on already resource-strapped platforms. Furthermore, many of the existing algorithms require adjusting the parameters of the network, either through training on a different data or adjusting a large set of key parameters.

This paper attempts to address the gap in map matching for large-scale 3D maps by proposing a feature extraction framework that effectively enables the use of 2D image features in a pair of gravity-aligned 3D maps using tomography [9] and providing a thorough evaluation and comparisons against alternatives. In the proposed method, the algorithm a) extracts a binary occupancy representation for horizontal cross-sections of the maps, b) computes 2D features over these slices, and c) restricts matching space for the features across maps to its corresponding section only. Cross-correlating the maps based on their agreement in pose estimations yields a 4 degree of freedom (DoF) transformation estimate. The proposed approach will be demonstrated to be significantly efficient (in both time and memory consumption) and accurate compared to state-of-the-art point cloud registration methods. An early version of this paper appeared in [9], which does not contain the simulation studies, any experimental results, and the recent algorithms over the past two years.

The contributions of this article include:

  • 1.

    a simple and efficient approach to extracting 3D features via tomographic extraction of horizontal sections with minimal parameterization,

  • 2.

    study of viable uses of the aforementioned features in addressing large-scale 3D-map matching and merging scenarios across environments of varying scales,

  • 3.

    extensive comparative studies on simulated and real data to assess effectiveness against alternative methods,

  • 4.

    experimental studies using a quadruped to emulate merging scenario in a large indoor area, and

  • 5.

    the source code to replicate the study for both proposed and alternative methods111https://github.com/RISC-NYUAD/tomographic-map-matching.

The remainder of the paper is structured as follows: Relevant work is introduced in Section 2, and the formulation of the studied problem is provided in Section 3. The proposed method is described in Section 4, detailing the process of tomographic feature extraction and registration with the computed features. The structure of the comparative analysis and the baseline methods used for evaluation are detailed in Section 5. Results on a simulated dataset are illustrated in Section 6. Experimental studies on real data, using KITTI odometry benchmark and a newly generated data from a large-scale indoor environment, are provided in Section 7. Limitations of the algorithm and the overall study are provided in Section 8, followed by the concluding remarks.

2 Related Work

2.1 Map Matching and Merging

Map matching is related to estimating the relative transformation between a pair of maps, and map merging is the process of combining the representations of a set of maps to a common structure, given the relative transformation. The literature for matching and merging probabilistic 2D occupancy grid maps is vast and several reviews provide a thorough coverage of the field [8, 10]. Overall, two common threads can be observed: optimization-based and feature-based map matching.

Optimization-based algorithms operate on the entire probability grid without an intermediate representation. Some of the earlier work focused on maximizing a similarity metric [5] and a connectivity metric with better results [11]. More recent work uses genetic algorithms to address the matching and merging on low-resource platforms [12]. Feature-based map matching algorithms extract an intermediate representation from the raw occupancy grid to perform associations and estimations. Salient and descriptive feature extraction forms the basis of matching [13, 14, 15, 16, 17].

Tomographic features and a 2D map matching based on the Radon transform of the occupancy grid is employed in [15]. Tomographically salient features are used in a Gaussian mixture model to estimate the 3 DoF transformation. Another feature-based algorithm extracts 2D image features from image representations of the occupancy grid maps and performs an initial guess via RANSAC, which is then refined with ICP for tighter alignment [14]. Extensive comparison of various image features and ideal parameter selection process is provided in the paper.

The algorithms outlined earlier cannot be applied in 3D domain directly. The main issues in 3D map matching stem primarily from the scale and density of data with the added dimension [18]. 3D LiDAR sensors provide accurate position information in a long range, while stereo cameras with onboard computational units can measure dense depth in a camera field of view. Maps generated from such sensors are inevitably large. The ideas for matching and merging are similar to those in the 2D case: optimize a distance cost over 3D data [19, 20], or extract local or global features from the 3D data and perform a robust registration using the corresponding features [21, 22, 23, 24].

However, the aforementioned methods cannot run in real time using onboard computer of the individual agents. Near real-time performance was achieved in [21] while providing demonstrations on large scale but simplistic indoor environments. Similarly [24] relies on learning-based place recognition descriptors from pose-centered spherical projections and GICP optimization, but require preprocessing on each scan to be performed. While the SLAM and map matching and merging operations both require the preprocessing step, the systems end up being coupled.

2.2 Collaborative SLAM

The primary focus for collaborative SLAM (C-SLAM) algorithms [25], is the pose optimization using both intra-robot and inter-robot loop closures within the maps. As such, individual agents do not share their maps fully with others in the network. With the exception of some centralized architectures [26, 27], it is not possible to obtain the global map with many of the prominent C-SLAM architectures [28, 29, 30]. It is indeed possible to perform the map merging using the C-SLAM algorithms, but a framework for sharing map data between agents is not discussed, and the bandwidth requirements for dense 3D map data remains a significant challenge.

2.3 3D Point Features

A common issue that needs to be addressed for both 2D and 3D feature description problem is the necessity for the descriptors to be pose invariant. The same point of interest, viewed from different locations, should result in ideally the same descriptor. An additional challenge for arbitrary 3D point clouds is that the data structure is not necessarily ordered, and any permutation of the order of points should still provide accurate descriptors. Much of the work is focusing on obtaining such pose-invariant descriptors, and can be grouped into two major categories: hand-crafted and learning-based.

Hand-crafted feature descriptors establish a local frame of reference, informed by the point neighborhood, to generate discerning statistics about the point. A prominent algorithm is Fast Point Feature Histograms (FPFH) [31], comprised of the histograms of angular variations for a point of interest in its k-nearest neighborhood. A comprehensive review of hand-crafted features in [32] found FPFH to be effective in low-noise scan matching scenarios. An adaptation of FAST [33] corner detection on orthogonal 2D projections of the neighborhood of a point is proposed as an efficient alternative for low-resource platforms [22].

Recent algorithms attempt to learn the discerning and pose-invariant descriptors directly from available object, indoor, and outdoor datasets. Earlier realizations relied on adapting Convolutional Neural Networks (CNNs) to 3D voxel-based representations [34, 35]. Structured use of multi-layer perceptrons were shown to be effective in dealing with the unordered nature of point cloud data [36, 37], and various CNN-architectures that focus directly on point data emerged [38, 39, 40, 41, 42].

A review of learning-based point cloud feature extraction for various applications can be found in [43]. Two of the convolutional architectures are important to our study: Kernel Point Convolution (KPConv) [44], which uses radius-neighborhoods to define deformable kernels to be used in 3D point convolution, and Fully Convolutional Geometric Features (FCGF) [45], which implements a fully-convolutional architecture with a metric learning loss to improve both accuracy and efficiency.

2.4 Point Cloud Registration

Given noiseless data and perfect correspondences, the registration problem has a closed form solution by solving the Procrustes problem. However, the conditions cannot be met in real world. Sensors are noisy, and correspondences can be wrong: 95% outlier rates have been reported [46], necessitating robust solutions. Reviews for the algorithms addressing registration problem for same-source [47] and cross-source [48] are available.

Sample consensus-based algorithms (SAC or RANSAC-family) [49] find the most consistent registration hypothesis by evaluating the solution generated from a minimal set of correspondences, sampled from the initial correspondence set, and finding the solution that is accurate for the most number of correspondences, termed the inlier set. More modern iterations of RANSAC family of algorithms provide optimality guarantees in quadratic-time [50] and differentiable implementation for trainability and use in learning-based frameworks [51]. Another significant robust registration method relies on semi-definite relaxation in its optimization, allowing it to handle extreme outlier ratios with a certificate of optimality [52].

Learning-based registration frameworks cast the problem into differentiable sets of modules, treating initial set of correspondences as putative and assigning weights on each correspondence. Deep Global Registration (DGR) [53] minimizes a robust energy metric with a convolutional network for confidence assignment, while PointDSC [54] incorporates spatial consistency explicitly to the network architecture.

Attention-based transformer networks aim to utilize the benefits of transformer architectures for point cloud registration problems. Some examples include PREDATOR [55], a model that learns to focus primarily on the overlap** regions, GeoTransformer [56], an archirecture that encodes the geometric structure of the point neighborhood to obtain a pose-invariant superpoint representation and performs keypoint-free coarse-to-fine registration, and Rotation-Invariant Transformer for Point Cloud Matching (RoITr) [57], a system trained to be able to deal with arbitrary rotational transformations.

A significant bottleneck for the attention mechanisms is the computational complexity. Most of the algorithms appear to be implemented and tested only on server-grade GPUs, without much concern for mobile deployment. BUFFER [58] stands out as a model that attempts a trade-off between accuracy, efficiency, and generalizability by improving the feature quality and estimation pipelines with small-scale networks.

3 Problem Formulation

Vectors and matrices are denoted with lowercase and uppercase boldfaced characters (e.g. 𝐩𝐩\mathbf{p}bold_p, 𝐓𝐓\mathbf{T}bold_T), respectively. Sets are named with calligraphic fonts (e.g. 𝒮𝒮\mathcal{S}caligraphic_S, 𝒞𝒞\mathcal{C}caligraphic_C) or otherwise implicitly defined inside curly brackets. The cardinality of a set is denoted as |||\cdot|| ⋅ |. \mathbb{R}blackboard_R and +superscript\mathbb{Z}^{+}blackboard_Z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT denote the set of real numbers and positive integers, respectively. Object indexing and memberships are denoted with a combination of pre- or post-fixed subscripts and superscripts.

3.1 Map Matching Problem

This paper addresses the problem of pairwise merging of gravity-aligned 3D maps that can be canonically represented as an unordered point cloud, where each point stores a coordinate as 𝐩ji3,i{c,d}formulae-sequencesuperscriptsubscript𝐩𝑗𝑖superscript3𝑖𝑐𝑑{{}^{i}\mathbf{p}_{j}}\in\mathbb{R}^{3},~{}i\in\{c,d\}start_FLOATSUPERSCRIPT italic_i end_FLOATSUPERSCRIPT bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , italic_i ∈ { italic_c , italic_d }, by two agents c𝑐citalic_c and d𝑑ditalic_d. We assume no prior knowledge of the absolute pose information for any agent, and the agents are not required to observe each other via a rendezvous. The communication network is assumed to have sufficient bandwidth in the order of 10 Mbps in order to transfer incremental maps having a size close to 70 MB. The map matching and merging tasks are invoked upon sufficient connectivity.

The map of different agents will be denoted as

i={,𝐩ji,},|i|=Ni,i{c,d}.formulae-sequencesuperscript𝑖superscriptsubscript𝐩𝑗𝑖formulae-sequencesuperscript𝑖subscript𝑁𝑖𝑖𝑐𝑑{{}^{i}\mathcal{M}}=\{\dots,~{}{{}^{i}\mathbf{p}_{j}},~{}\dots\},\quad|{{}^{i}% \mathcal{M}}|=N_{i},\quad i\in\{c,~{}d\}.start_FLOATSUPERSCRIPT italic_i end_FLOATSUPERSCRIPT caligraphic_M = { … , start_FLOATSUPERSCRIPT italic_i end_FLOATSUPERSCRIPT bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , … } , | start_FLOATSUPERSCRIPT italic_i end_FLOATSUPERSCRIPT caligraphic_M | = italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ { italic_c , italic_d } . (1)

To ensure the applicability and scalability of the tested algorithms, it is assumed that the point cloud representation is processed with a voxel grid filter of leaf size g𝑔gitalic_g. The filtering process ensures more uniform point density and eliminates the dependency on particular sensor sets during map** (i.e., LiDAR providing high point density perpendicular to its rotation axis, RGBD/stereo having high density in limited field of view).

The number of points in agents’ maps, Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, depends primarily on the map resolution and the size of the environment that is being mapped. Sensible combinations that preserve a moderate amount of detail can increase the map size quickly, reaching the order of millions.

Let w𝑤witalic_w denote a frame of reference fixed in the world. The points of the map isuperscript𝑖{{}^{i}\mathcal{M}}start_FLOATSUPERSCRIPT italic_i end_FLOATSUPERSCRIPT caligraphic_M can be transformed into w𝑤witalic_w via a rigid 3D homogeneous transformation 𝐓iwSE(3)superscriptsubscript𝐓𝑖𝑤𝑆𝐸3\mathbf{T}_{i}^{w}\in SE(3)bold_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ∈ italic_S italic_E ( 3 ):

SE(3)={[𝐑3×3𝐭3×1𝟎1×31],𝐑𝐑=𝐑𝐑=𝐈,det(𝐑)=1}SE(3)=\left\{\begin{bmatrix}\mathbf{R}_{3\times 3}&\mathbf{t}_{3\times 1}\\ \mathbf{0}_{1\times 3}&1\end{bmatrix},\quad\mathbf{R}^{\top}\mathbf{R}=\mathbf% {RR}^{\top}=\mathbf{I},\quad\det({\mathbf{R}})=1\right\}italic_S italic_E ( 3 ) = { [ start_ARG start_ROW start_CELL bold_R start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL start_CELL bold_t start_POSTSUBSCRIPT 3 × 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_0 start_POSTSUBSCRIPT 1 × 3 end_POSTSUBSCRIPT end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] , bold_R start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_R = bold_RR start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_I , roman_det ( bold_R ) = 1 } (2)

The global world frame w𝑤witalic_w can be arbitrarily defined by anchoring one agent’s frame as the map origin. Assigning the local frame of agent c𝑐citalic_c as the fixed frame w𝑤witalic_w, the goal is to estimate the transformation 𝐓dcSE(3)superscriptsubscript𝐓𝑑𝑐𝑆𝐸3\mathbf{T}_{d}^{c}\in SE(3)bold_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ∈ italic_S italic_E ( 3 ).

We assume that the agents’ maps are gravity-aligned. More explicitly, we assume that the agents are capable of observing the gravity vector. The gravity vector is common among all of the agents operating in the environment and is parallel to the z𝑧zitalic_z-axis of the map they are building. We do not restrict the z𝑧zitalic_z-axes to be identical (i.e., origins in the z𝑧zitalic_z-axes are coincident), as the agents may have different initializations for the map height.

With this assumption, the problem is restricted to 4 degrees of freedom (DoF) and the homogeneous transformation can be parameterized as

𝐓dc=𝐓(xdc,ydc,zdc,θdc)=[cosθdcsinθdc0xdcsinθdccosθdc0ydc001zdc0001].superscriptsubscript𝐓𝑑𝑐𝐓superscriptsubscript𝑥𝑑𝑐superscriptsubscript𝑦𝑑𝑐superscriptsubscript𝑧𝑑𝑐superscriptsubscript𝜃𝑑𝑐matrixsuperscriptsubscript𝜃𝑑𝑐superscriptsubscript𝜃𝑑𝑐0superscriptsubscript𝑥𝑑𝑐superscriptsubscript𝜃𝑑𝑐superscriptsubscript𝜃𝑑𝑐0superscriptsubscript𝑦𝑑𝑐001superscriptsubscript𝑧𝑑𝑐0001\mathbf{T}_{d}^{c}=\mathbf{T}(x_{d}^{c},y_{d}^{c},z_{d}^{c},\theta_{d}^{c})=% \begin{bmatrix}\cos\theta_{d}^{c}&-\sin\theta_{d}^{c}&0&x_{d}^{c}\\ \sin\theta_{d}^{c}&\cos\theta_{d}^{c}&0&y_{d}^{c}\\ 0&0&1&z_{d}^{c}\\ 0&0&0&1\end{bmatrix}.bold_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = bold_T ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) = [ start_ARG start_ROW start_CELL roman_cos italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_CELL start_CELL - roman_sin italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_CELL start_CELL 0 end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL roman_sin italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_CELL start_CELL roman_cos italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_CELL start_CELL 0 end_CELL start_CELL italic_y start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] . (3)

The given setup does not restrict the motion of the agents. The agent is free to execute any motion, as long as the gravity vector can be measured.

A map matching (or more generally, a point cloud registration) algorithm relies on the set of correspondences 𝒞𝒞\mathcal{C}caligraphic_C established between pairs of points:

𝒞{(𝐩mc,𝐩nd):m,n+,mNc,nNd}.𝒞conditional-setsuperscriptsubscript𝐩𝑚𝑐superscriptsubscript𝐩𝑛𝑑formulae-sequence𝑚𝑛superscriptformulae-sequence𝑚subscript𝑁𝑐𝑛subscript𝑁𝑑\mathcal{C}\triangleq\left\{({{}^{c}}\mathbf{p}_{m},{{}^{d}}\mathbf{p}_{n}):~{% }m,n\in\mathbb{Z}^{+},~{}m\leq N_{c},~{}n\leq N_{d}\right\}.caligraphic_C ≜ { ( start_FLOATSUPERSCRIPT italic_c end_FLOATSUPERSCRIPT bold_p start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , start_FLOATSUPERSCRIPT italic_d end_FLOATSUPERSCRIPT bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) : italic_m , italic_n ∈ blackboard_Z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_m ≤ italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_n ≤ italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT } . (4)

Assuming that all correspondences are “true”, which are the pairs that overlap when the true transformation is applied to the targeted point cloud, the registration task can be cast as an optimization problem as

𝐓dc=argmin𝐓SE(3)(𝐩mc,𝐩nd)𝒞ρ(𝐩mc,𝐓𝐩nd)superscriptsubscript𝐓𝑑𝑐𝐓𝑆𝐸3subscriptsuperscriptsubscript𝐩𝑚𝑐superscriptsubscript𝐩𝑛𝑑𝒞𝜌superscriptsubscript𝐩𝑚𝑐𝐓superscriptsubscript𝐩𝑛𝑑\mathbf{T}_{d}^{c}=\underset{\mathbf{T}\in SE(3)}{\arg\min}~{}\sum_{({{}^{c}}% \mathbf{p}_{m},{{}^{d}}\mathbf{p}_{n})\in\mathcal{C}}\rho({{}^{c}}\mathbf{p}_{% m},\mathbf{T}~{}{{}^{d}}\mathbf{p}_{n})bold_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = start_UNDERACCENT bold_T ∈ italic_S italic_E ( 3 ) end_UNDERACCENT start_ARG roman_arg roman_min end_ARG ∑ start_POSTSUBSCRIPT ( start_FLOATSUPERSCRIPT italic_c end_FLOATSUPERSCRIPT bold_p start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , start_FLOATSUPERSCRIPT italic_d end_FLOATSUPERSCRIPT bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∈ caligraphic_C end_POSTSUBSCRIPT italic_ρ ( start_FLOATSUPERSCRIPT italic_c end_FLOATSUPERSCRIPT bold_p start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_T start_FLOATSUPERSCRIPT italic_d end_FLOATSUPERSCRIPT bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) (5)

where ρ(𝐚,𝐛)𝜌𝐚𝐛\rho(\mathbf{a},\mathbf{b})italic_ρ ( bold_a , bold_b ) is a non-negative distance metric to determine the error for a given correspondence under the transformation hypothesis 𝐓𝐓\mathbf{T}bold_T. As an example, in the case of point-to-point iterative closest point (ICP), the cost metric assumes the form of Euclidean distance.

Estimating “true” correspondences is a difficult task, as it depends on the discrimination capacity of the descriptor and some mismatches cannot be avoided due to perceptual aliasing. It is reasonable to expect outlier correspondence rates of 90% [59] for 3D datasets.

An inlier set 𝒞in𝒞subscript𝒞in𝒞\mathcal{C}_{\text{in}}\subseteq\mathcal{C}caligraphic_C start_POSTSUBSCRIPT in end_POSTSUBSCRIPT ⊆ caligraphic_C is the set of correspondences that agree with a given transformation, and it needs to be selected from correspondences that agree with the unknown, correct transformation, leading to a modification of (5):

𝐓dc=argmin𝐓SE(3)(𝐩mc,𝐩nd)𝒞inρ(𝐩mc,𝐓𝐩nd).superscriptsubscript𝐓𝑑𝑐𝐓𝑆𝐸3subscriptsuperscriptsubscript𝐩𝑚𝑐superscriptsubscript𝐩𝑛𝑑subscript𝒞in𝜌superscriptsubscript𝐩𝑚𝑐𝐓superscriptsubscript𝐩𝑛𝑑\mathbf{T}_{d}^{c}=\underset{\mathbf{T}\in SE(3)}{\arg\min}~{}\sum_{({{}^{c}}% \mathbf{p}_{m},{{}^{d}}\mathbf{p}_{n})\in\mathcal{C}_{\text{in}}}\rho({{}^{c}}% \mathbf{p}_{m},\mathbf{T}~{}{{}^{d}}\mathbf{p}_{n})~{}.bold_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = start_UNDERACCENT bold_T ∈ italic_S italic_E ( 3 ) end_UNDERACCENT start_ARG roman_arg roman_min end_ARG ∑ start_POSTSUBSCRIPT ( start_FLOATSUPERSCRIPT italic_c end_FLOATSUPERSCRIPT bold_p start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , start_FLOATSUPERSCRIPT italic_d end_FLOATSUPERSCRIPT bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∈ caligraphic_C start_POSTSUBSCRIPT in end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ρ ( start_FLOATSUPERSCRIPT italic_c end_FLOATSUPERSCRIPT bold_p start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_T start_FLOATSUPERSCRIPT italic_d end_FLOATSUPERSCRIPT bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) . (6)

4 Proposed Method

4.1 Tomographic Features

A tomographic section of the map, termed slice throughout the paper, is defined as the 2D binary occupancy representation of a horizontal cross-section of a map isubscript𝑖\mathcal{M}_{i}caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT at a particular height hhitalic_h. The points in the map to generate a slice at height hhitalic_h can be defined as:

𝒮^hi{𝐩ji:ht<𝐩ji|zh+t,}{{}^{i}\widehat{\mathcal{S}}_{h}}\triangleq\left\{{{}^{i}}\mathbf{p}_{j}:\ h-t% <\left.{{}^{i}\mathbf{p}_{j}}\right|_{z}\leq h+t,\right\}start_FLOATSUPERSCRIPT italic_i end_FLOATSUPERSCRIPT over^ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ≜ { start_FLOATSUPERSCRIPT italic_i end_FLOATSUPERSCRIPT bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT : italic_h - italic_t < start_FLOATSUPERSCRIPT italic_i end_FLOATSUPERSCRIPT bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ≤ italic_h + italic_t , } (7)

where 𝐩|zevaluated-at𝐩𝑧\left.{\mathbf{p}}\right|_{z}bold_p | start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT is the z𝑧zitalic_z-coordinate of point 𝐩𝐩\mathbf{p}bold_p and t𝑡titalic_t is a parameter that determines the thickness of the cross-section.

A natural choice for the thickness parameter to utilize all available information is t=g/2𝑡𝑔2t=g/2italic_t = italic_g / 2, half the leaf size. In this manner, every point in the map will belong to a single slice for a finite set of hhitalic_h that are g𝑔gitalic_g apart. Let zmaxisuperscriptsubscript𝑧𝑖{}^{i}z_{\max}start_FLOATSUPERSCRIPT italic_i end_FLOATSUPERSCRIPT italic_z start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT (zminisuperscriptsubscript𝑧𝑖{}^{i}z_{\min}start_FLOATSUPERSCRIPT italic_i end_FLOATSUPERSCRIPT italic_z start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT) be the maximum (minimum) z𝑧zitalic_z-coordinate of the map isuperscript𝑖{}^{i}\mathcal{M}start_FLOATSUPERSCRIPT italic_i end_FLOATSUPERSCRIPT caligraphic_M. Then, the index set of heights isubscript𝑖\mathcal{H}_{i}caligraphic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for isuperscript𝑖{}^{i}\mathcal{M}start_FLOATSUPERSCRIPT italic_i end_FLOATSUPERSCRIPT caligraphic_M is defined as:

i{zmini+kg:k+,kzmaxizminig}.subscript𝑖conditional-setsuperscriptsubscript𝑧𝑖𝑘𝑔formulae-sequence𝑘superscript𝑘superscriptsubscript𝑧𝑖superscriptsubscript𝑧𝑖𝑔\mathcal{H}_{i}\triangleq\left\{{{}^{i}}z_{\min}+k\cdot g:\ k\in\mathbb{Z}^{+}% ,\ k\leq\frac{{{}^{i}}z_{\max}-{{}^{i}}z_{\min}}{g}\right\}.caligraphic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≜ { start_FLOATSUPERSCRIPT italic_i end_FLOATSUPERSCRIPT italic_z start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT + italic_k ⋅ italic_g : italic_k ∈ blackboard_Z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_k ≤ divide start_ARG start_FLOATSUPERSCRIPT italic_i end_FLOATSUPERSCRIPT italic_z start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT - start_FLOATSUPERSCRIPT italic_i end_FLOATSUPERSCRIPT italic_z start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG start_ARG italic_g end_ARG } . (8)

Points in every slice 𝒮^hi,hisuperscriptsubscript^𝒮𝑖subscript𝑖{{}^{i}\widehat{\mathcal{S}}_{h}},~{}h\in\mathcal{H}_{i}start_FLOATSUPERSCRIPT italic_i end_FLOATSUPERSCRIPT over^ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_h ∈ caligraphic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are projected onto the xy𝑥𝑦xyitalic_x italic_y-plane and discretized on a 2D grid to obtain a 2D binary occupancy image, 𝐒hisuperscriptsubscript𝐒𝑖{{}^{i}\mathbf{S}_{h}}start_FLOATSUPERSCRIPT italic_i end_FLOATSUPERSCRIPT bold_S start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. The extrema xy𝑥𝑦xyitalic_x italic_y coordinates of points in 𝒮^hisuperscriptsubscript^𝒮𝑖{{}^{i}\widehat{\mathcal{S}}_{h}}start_FLOATSUPERSCRIPT italic_i end_FLOATSUPERSCRIPT over^ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and the grid size g𝑔gitalic_g dictate the width and height of the resultant binary image. Intuitively, each pixel represents the occupancy of a real area of size g×gm2𝑔𝑔superscript𝑚2g\times g~{}m^{2}italic_g × italic_g italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Examples of the 2D binary images obtained through the slicing are provided in Figure 1 for an indoor environment.

Refer to caption
Figure 1: Samples of 2D binary slice images extracted at differnet heights for a simulated indoor environment. Binary images with colorized borders denote the corresponding height at the 3D rendering of the environment.

ORB features and descriptors [60] are extracted from every 2D binary slice, and the highest scoring K𝐾Kitalic_K features are kept. The set of 128-bit descriptors, paired with the 2D coordinates, is denoted as hisuperscriptsubscript𝑖{{}^{i}}\mathcal{F}_{h}start_FLOATSUPERSCRIPT italic_i end_FLOATSUPERSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and forms the basis of the map matching task:

hi{𝐟h,ki},|hi|K.formulae-sequencesuperscriptsubscript𝑖superscriptsubscript𝐟𝑘𝑖superscriptsubscript𝑖𝐾{{}^{i}}\mathcal{F}_{h}\triangleq\{{{}^{i}}\mathbf{f}_{h,k}\},~{}|{{}^{i}}% \mathcal{F}_{h}|\leq K.start_FLOATSUPERSCRIPT italic_i end_FLOATSUPERSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ≜ { start_FLOATSUPERSCRIPT italic_i end_FLOATSUPERSCRIPT bold_f start_POSTSUBSCRIPT italic_h , italic_k end_POSTSUBSCRIPT } , | start_FLOATSUPERSCRIPT italic_i end_FLOATSUPERSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT | ≤ italic_K . (9)

The choice of ORB is primarily informed by the lack of real photo intensity changes within the binary images 𝐒hisuperscriptsubscript𝐒𝑖{{}^{i}\mathbf{S}_{h}}start_FLOATSUPERSCRIPT italic_i end_FLOATSUPERSCRIPT bold_S start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and the efficiency of the underlying FAST [33] corner detection with oriented BRIEF [61] descriptors. Alternative feature extraction and description pipelines can be used [62], though the reliability and efficiency of ORB features are validated in the literature for various cases [63].

The slicing and feature extraction results in a different representation of the map as an indexed set of 2D binary slices and associated features:

𝒮i{(𝐒hi,hi),hi}.subscript𝒮𝑖superscriptsubscript𝐒𝑖superscriptsubscript𝑖subscript𝑖\mathcal{S}_{i}\triangleq\left\{({{}^{i}\mathbf{S}_{h}},~{}{{}^{i}}\mathcal{F}% _{h}),~{}h\in\mathcal{H}_{i}\right\}.caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≜ { ( start_FLOATSUPERSCRIPT italic_i end_FLOATSUPERSCRIPT bold_S start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , start_FLOATSUPERSCRIPT italic_i end_FLOATSUPERSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) , italic_h ∈ caligraphic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } . (10)

4.2 Map Matching with Tomographic Features

Given the set of binary slices and their associated features 𝒮csubscript𝒮𝑐\mathcal{S}_{c}caligraphic_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and 𝒮dsubscript𝒮𝑑\mathcal{S}_{d}caligraphic_S start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, the task is to estimate the 4 unknowns in the transformation 𝐓dc=𝐓(xdc,ydc,zdc,θdc)superscriptsubscript𝐓𝑑𝑐𝐓superscriptsubscript𝑥𝑑𝑐superscriptsubscript𝑦𝑑𝑐superscriptsubscript𝑧𝑑𝑐superscriptsubscript𝜃𝑑𝑐\mathbf{T}_{d}^{c}=\mathbf{T}(x_{d}^{c},y_{d}^{c},z_{d}^{c},\theta_{d}^{c})bold_T start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = bold_T ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) (from (3)). The proposed method solves a hierarchical optimization problem in two stages: a) the rigid 2D transformation estimation, and b) the height difference estimation, as shown in Figure 2.

Refer to caption
Figure 2: A visual summary of the proposed distributed 3D map matching framework. Each agent {c,d}𝑐𝑑\{c,d\}{ italic_c , italic_d } is responsible for extracting slices 𝐬hi{c,d}superscriptsubscript𝐬subscript𝑖𝑐𝑑{{}^{\{c,d\}}}\mathbf{s}_{h_{i}}start_FLOATSUPERSCRIPT { italic_c , italic_d } end_FLOATSUPERSCRIPT bold_s start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT at a predefined grid size. The pose is estimated via two-stage optimization algorithm, where one of the agents cross-correlates the slices by estimating a 3DoF rigid transformation, 𝐓~dcsuperscriptsubscript~𝐓𝑑𝑐\widetilde{\mathbf{T}}_{d}^{c}over~ start_ARG bold_T end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, between slices, and the consensus of different height hypotheses, zdcsuperscriptsubscript𝑧𝑑𝑐z_{d}^{c}italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, is used to determine the 4th DoF.

4.2.1 Rigid 2D Transformation Estimation

The first stage estimates the 3 DoF for different values of zdcsuperscriptsubscript𝑧𝑑𝑐z_{d}^{c}italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT. For a known height difference, each slice in a map will have either one corresponding slice that will be at the same height, or none. For each slice pair (𝐒mc,𝐒nd)ksubscriptsuperscriptsubscript𝐒𝑚𝑐superscriptsubscript𝐒𝑛𝑑𝑘({{}^{c}\mathbf{S}_{m}},~{}{{}^{d}\mathbf{S}_{n}})_{k}( start_FLOATSUPERSCRIPT italic_c end_FLOATSUPERSCRIPT bold_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , start_FLOATSUPERSCRIPT italic_d end_FLOATSUPERSCRIPT bold_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT indexed by k𝑘kitalic_k, matching computed features yields a set of putative correspondences {,(𝐩jc,𝐩jd),}ksubscriptsuperscriptsubscript𝐩𝑗𝑐superscriptsubscript𝐩𝑗𝑑𝑘\{\cdots,({{}^{c}\mathbf{p}_{j}},~{}{{}^{d}\mathbf{p}_{j}}),\cdots\}_{k}{ ⋯ , ( start_FLOATSUPERSCRIPT italic_c end_FLOATSUPERSCRIPT bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , start_FLOATSUPERSCRIPT italic_d end_FLOATSUPERSCRIPT bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , ⋯ } start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, and the pose estimation problem reduces to finding a 2D rigid transformation, which has the form

[cosθdcsinθdcxdcsinθdccosθdcydc][xjdyjd1]=[xjcyjc]matrixsuperscriptsubscript𝜃𝑑𝑐superscriptsubscript𝜃𝑑𝑐superscriptsubscript𝑥𝑑𝑐superscriptsubscript𝜃𝑑𝑐superscriptsubscript𝜃𝑑𝑐superscriptsubscript𝑦𝑑𝑐matrixsuperscriptsubscript𝑥𝑗𝑑superscriptsubscript𝑦𝑗𝑑1matrixsuperscriptsubscript𝑥𝑗𝑐superscriptsubscript𝑦𝑗𝑐\begin{bmatrix}\cos~{}\theta_{d}^{c}&-\sin~{}\theta_{d}^{c}&x_{d}^{c}\\ \sin~{}\theta_{d}^{c}&~{}\cos~{}\theta_{d}^{c}&y_{d}^{c}\end{bmatrix}\begin{% bmatrix}{{}^{d}x_{j}}\\ {{}^{d}y_{j}}\\ 1\end{bmatrix}=\begin{bmatrix}{{}^{c}x_{j}}\\ {{}^{c}y_{j}}\end{bmatrix}[ start_ARG start_ROW start_CELL roman_cos italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_CELL start_CELL - roman_sin italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL roman_sin italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_CELL start_CELL roman_cos italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_CELL start_CELL italic_y start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL start_FLOATSUPERSCRIPT italic_d end_FLOATSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL start_FLOATSUPERSCRIPT italic_d end_FLOATSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW end_ARG ] = [ start_ARG start_ROW start_CELL start_FLOATSUPERSCRIPT italic_c end_FLOATSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL start_FLOATSUPERSCRIPT italic_c end_FLOATSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] (11)

for corresponding points 𝐩ji=[xjiyji],i{c,d}formulae-sequencesuperscriptsubscript𝐩𝑗𝑖superscriptmatrixsuperscriptsubscript𝑥𝑗𝑖superscriptsubscript𝑦𝑗𝑖top𝑖𝑐𝑑{{}^{i}\mathbf{p}_{j}}=\begin{bmatrix}{{}^{i}x_{j}}&{{}^{i}y_{j}}\end{bmatrix}% ^{\top},~{}i\in\{c,d\}start_FLOATSUPERSCRIPT italic_i end_FLOATSUPERSCRIPT bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL start_FLOATSUPERSCRIPT italic_i end_FLOATSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL start_CELL start_FLOATSUPERSCRIPT italic_i end_FLOATSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , italic_i ∈ { italic_c , italic_d }. The problem is then cast to a linear system as

[xjdyjd10yjdxjd01][αβxdcydc]=[xjcyjc]matrixsuperscriptsubscript𝑥𝑗𝑑superscriptsubscript𝑦𝑗𝑑10superscriptsubscript𝑦𝑗𝑑superscriptsubscript𝑥𝑗𝑑01matrix𝛼𝛽superscriptsubscript𝑥𝑑𝑐superscriptsubscript𝑦𝑑𝑐matrixsuperscriptsubscript𝑥𝑗𝑐superscriptsubscript𝑦𝑗𝑐\begin{bmatrix}\vdots&\vdots&\vdots&\vdots\\ {{}^{d}x_{j}}&-{{}^{d}y_{j}}&1&0\\ {{}^{d}y_{j}}&~{}{{}^{d}x_{j}}&0&1\\ \vdots&\vdots&\vdots&\vdots\end{bmatrix}\begin{bmatrix}\alpha\\ \beta\\ x_{d}^{c}\\ y_{d}^{c}\end{bmatrix}=\begin{bmatrix}\vdots\\ {{}^{c}x_{j}}\\ {{}^{c}y_{j}}\\ \vdots\end{bmatrix}[ start_ARG start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL start_FLOATSUPERSCRIPT italic_d end_FLOATSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL start_CELL - start_FLOATSUPERSCRIPT italic_d end_FLOATSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL start_FLOATSUPERSCRIPT italic_d end_FLOATSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL start_CELL start_FLOATSUPERSCRIPT italic_d end_FLOATSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL italic_α end_CELL end_ROW start_ROW start_CELL italic_β end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_y start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] = [ start_ARG start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL start_FLOATSUPERSCRIPT italic_c end_FLOATSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL start_FLOATSUPERSCRIPT italic_c end_FLOATSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW end_ARG ] (12)

where α=scos(θdc)𝛼𝑠superscriptsubscript𝜃𝑑𝑐\alpha=s\cos(\theta_{d}^{c})italic_α = italic_s roman_cos ( italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) and β=ssin(θdc)𝛽𝑠superscriptsubscript𝜃𝑑𝑐\beta=s\sin(\theta_{d}^{c})italic_β = italic_s roman_sin ( italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) with s𝑠sitalic_s as the scale parameter. Since we assume known scale, s=1𝑠1s=1italic_s = 1. The angle θdcsuperscriptsubscript𝜃𝑑𝑐\theta_{d}^{c}italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT is recovered using atan2(β,α)atan2𝛽𝛼\text{atan2}(\beta,\alpha)atan2 ( italic_β , italic_α ).

For simplicity, RANSAC is used to obtain a robust solution, followed by Levenberg-Marquardt refinement steps over the inlier set. The linear system given in Equation (12) is solved for every slice pair k𝑘kitalic_k at a given height to obtain the set of transformations at a known height:

𝒯~(zdc){𝐓~k(x~dc,y~dc,zdc,θ~dc)𝐓~dck}.~𝒯superscriptsubscript𝑧𝑑𝑐subscript~𝐓𝑘superscriptsubscript~𝑥𝑑𝑐superscriptsubscript~𝑦𝑑𝑐superscriptsubscript𝑧𝑑𝑐superscriptsubscript~𝜃𝑑𝑐subscriptsuperscriptsubscript~𝐓𝑑𝑐𝑘\widetilde{\mathcal{T}}(z_{d}^{c})\triangleq\left\{{{}_{k}\widetilde{\mathbf{T% }}(\widetilde{x}_{d}^{c},\widetilde{y}_{d}^{c},z_{d}^{c},\widetilde{\theta}_{d% }^{c})}\triangleq{{}_{k}\widetilde{\mathbf{T}}_{d}^{c}}\right\}.over~ start_ARG caligraphic_T end_ARG ( italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ≜ { start_FLOATSUBSCRIPT italic_k end_FLOATSUBSCRIPT over~ start_ARG bold_T end_ARG ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ≜ start_FLOATSUBSCRIPT italic_k end_FLOATSUBSCRIPT over~ start_ARG bold_T end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT } . (13)

It is not expected that every slice pair k𝑘kitalic_k provides an accurate estimate, as the agents may not have complete occupancy information in the neighborhood of a particular xy𝑥𝑦xyitalic_x italic_y-coordinate at every height. In such cases, the pose estimate 𝐓~dcksubscriptsuperscriptsubscript~𝐓𝑑𝑐𝑘{{}_{k}\widetilde{\mathbf{T}}_{d}^{c}}start_FLOATSUBSCRIPT italic_k end_FLOATSUBSCRIPT over~ start_ARG bold_T end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT does not represent the true transform, and simple averaging across all slices will introduce large errors.

To that end, a consensus between different 2D rigid estimations is established to obtain the most likely pose estimate. We perform this by finding the largest set of inlier hypotheses 𝒯^(zdc)𝒯~(zdc)^𝒯superscriptsubscript𝑧𝑑𝑐~𝒯superscriptsubscript𝑧𝑑𝑐\widehat{\mathcal{T}}(z_{d}^{c})\subseteq\widetilde{\mathcal{T}}(z_{d}^{c})over^ start_ARG caligraphic_T end_ARG ( italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ⊆ over~ start_ARG caligraphic_T end_ARG ( italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) within a distance to an anchor hypothesis, 𝐓~dcksubscriptsuperscriptsubscript~𝐓𝑑𝑐𝑘{{}_{k}}\widetilde{\mathbf{T}}_{d}^{c}start_FLOATSUBSCRIPT italic_k end_FLOATSUBSCRIPT over~ start_ARG bold_T end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT:

𝒯^(zdc)=argmax|{𝐓~dck:𝐝(𝐓~dck,𝐓~dcl)𝐭,𝐓~dc{k,l}𝒯~(zdc)}|,^𝒯superscriptsubscript𝑧𝑑𝑐conditional-setsubscriptsuperscriptsubscript~𝐓𝑑𝑐𝑘formulae-sequenceprecedes𝐝subscriptsuperscriptsubscript~𝐓𝑑𝑐𝑘subscriptsuperscriptsubscript~𝐓𝑑𝑐𝑙𝐭subscriptsuperscriptsubscript~𝐓𝑑𝑐𝑘𝑙~𝒯superscriptsubscript𝑧𝑑𝑐\widehat{\mathcal{T}}(z_{d}^{c})=\arg\max~{}\left|\left\{{{}_{k}}\widetilde{% \mathbf{T}}_{d}^{c}:\mathbf{d}({{}_{k}}\widetilde{\mathbf{T}}_{d}^{c},~{}{{}_{% l}}\widetilde{\mathbf{T}}_{d}^{c})\prec\mathbf{t},~{}{{}_{\{k,l\}}}\widetilde{% \mathbf{T}}_{d}^{c}\in\widetilde{\mathcal{T}}(z_{d}^{c})\right\}\right|,over^ start_ARG caligraphic_T end_ARG ( italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) = roman_arg roman_max | { start_FLOATSUBSCRIPT italic_k end_FLOATSUBSCRIPT over~ start_ARG bold_T end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT : bold_d ( start_FLOATSUBSCRIPT italic_k end_FLOATSUBSCRIPT over~ start_ARG bold_T end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , start_FLOATSUBSCRIPT italic_l end_FLOATSUBSCRIPT over~ start_ARG bold_T end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ≺ bold_t , start_FLOATSUBSCRIPT { italic_k , italic_l } end_FLOATSUBSCRIPT over~ start_ARG bold_T end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ∈ over~ start_ARG caligraphic_T end_ARG ( italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) } | , (14)

where the vector-valued function 𝐝𝐝\mathbf{d}bold_d encodes the Euclidean distance between the estimates of xdc,ydcsuperscriptsubscript𝑥𝑑𝑐superscriptsubscript𝑦𝑑𝑐x_{d}^{c},y_{d}^{c}italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and the angular distance between the angles θdcsuperscriptsubscript𝜃𝑑𝑐\theta_{d}^{c}italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, and 𝐭𝐭\mathbf{t}bold_t is a threshold to specify maximum allowed deviations. The parameter-wise average of the hypotheses is used in the inlier set 𝒯^(zdc)^𝒯superscriptsubscript𝑧𝑑𝑐\widehat{\mathcal{T}}(z_{d}^{c})over^ start_ARG caligraphic_T end_ARG ( italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) to compute the resultant pose hypothesis, 𝐓¯dcsuperscriptsubscript¯𝐓𝑑𝑐\bar{\mathbf{T}}_{d}^{c}over¯ start_ARG bold_T end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT.

4.2.2 Height Difference Estimation

The previous step assumed that the height difference zdcsuperscriptsubscript𝑧𝑑𝑐z_{d}^{c}italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT between maps is known. Unless there are means to establish the initial height difference, it is a strong assumption that would restrict the applicability of the algorithm to different scenarios.

The zdcsuperscriptsubscript𝑧𝑑𝑐z_{d}^{c}italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT is similarly estimated based on the consensus. Due to the discretization of the 3D maps via slicing, there are finite number of relative height steps that can be established between a pair maps, all of which are g𝑔gitalic_g units apart. Testing all possible height steps can be thought of as “cross-correlation” of the maps: each step results in a pose hypothesis 𝐓¯dcsuperscriptsubscript¯𝐓𝑑𝑐\bar{\mathbf{T}}_{d}^{c}over¯ start_ARG bold_T end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT computed from an inlier set 𝒯^(zdc)^𝒯superscriptsubscript𝑧𝑑𝑐\widehat{\mathcal{T}}(z_{d}^{c})over^ start_ARG caligraphic_T end_ARG ( italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ), and the size of the inlier set is the value for the correlation operator. As such, the height difference zdcsuperscriptsubscript𝑧𝑑𝑐z_{d}^{c}italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT with the largest cardinality for 𝒯^(zdc)^𝒯superscriptsubscript𝑧𝑑𝑐\widehat{\mathcal{T}}(z_{d}^{c})over^ start_ARG caligraphic_T end_ARG ( italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) is selected as the height estimate:

zdc=argmax𝑧|𝒯^(z)|.superscriptsubscript𝑧𝑑𝑐𝑧^𝒯𝑧z_{d}^{c}=\underset{z}{\arg\max}~{}\left|\widehat{\mathcal{T}}(z)\right|.italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = underitalic_z start_ARG roman_arg roman_max end_ARG | over^ start_ARG caligraphic_T end_ARG ( italic_z ) | . (15)

The correct height difference would allow the largest number of slice pairs that themselves agree on the remaining degrees of freedom. Empirical tests indicate that the number of inlier correspondences are maximized for all slices when they are matched with the correct slice from the other map for indoor/outdoor environments with distinctive geometric features. We refer to the proposed method as Consensus throughout the paper.

4.2.3 Computational Complexity

The proposed method utilizes various simple algorithms that have a low computational burden individually, but repeated for a large number of hypotheses. Especially, the cross-correlation for maps that span larger heights are computationally expensive. However, each of these individual steps (feature calculation for every slice and 3 DoF estimation for a pair of slices) are independent of each other, enabling parallelization opportunities with hardware acceleration. Moreover, it is not necessary for intermittent map matching and merging on different indoor and outdoor scenarios.

5 Comparative Studies

We provide a comprehensive evaluation of the accuracy and efficiency of the proposed algorithm in both simulated and real-life datasets against various registration pipelines from the literature.

5.1 Algorithm Baselines and Metrics

Comparisons to eight different baselines are provided. The selection includes standard algorithms commonly used as a baseline in literature, as well as the state-of-the-art registration algorithms. The baselines that are not learning-based study different conditions in feature extraction and registration, and a literature implementation may not necessarily exist for them.

For the learning-based algorithms, the selection was informed primarily on their adaptability to the map matching problem. In all of the cases, the weights provided by the original authors are used. Since some methods do not fit the memory of mobile-grade GPUs, the evaluations are done on a high-performance computer with server-grade GPU (NVIDIA A100 80 GB).

5.1.1 Tomographic TEASER++

Tomographic TEASER++ [52] replaces the consensus-based selection of the height difference with a 3D registration problem at each height step, and selecting the solution that provides the largest number of inlier features.

Conversion from 2D ORB features to 3D features are performed by assigning the height difference as the z𝑧zitalic_z-coordinate to the points in the source map slice. The number of registration problems is the same as the consensus step in the proposed algorithm.

5.1.2 ORB-TEASER++

TEASER++ may incur a high memory and time cost, which was found to be the main bottleneck for the previous baseline. ORB-TEASER++ was tested as an attempt to reduce the number of registration steps to a minimum. In ORB-TEASER++, the feature matching step is performed between 2D keypoints across all of the slices in a map and the correspondences are used with TEASER++ for robust registration.

In essence, all 2D ORB features are treated as 3D features by assigning their relative height in the map’s own frame as their z𝑧zitalic_z-coordinates. Then, the matching is performed against the 3D ORB features in the other map.

5.1.3 FPFH-RANSAC

ORB will be shown to not be an appropriate feature for 3D data in the tomographic manner, as it is prone to perceptual aliasing across different slices. For example, a single edge of a wall with no features would yield the same ORB signature for similar heights. Therefore, when the matching is performed with the points from the other map, many identical matches will exist.

Instead, in FPFH-RANSAC the standard baseline of many point feature comparisons is used. FPFH [31], features are computed for 3D Harris keypoints of the map point clouds, and a rigid transformation hypothesis is computed using RANSAC over point matches. An open-source implementation of this method [64] is available online222https://github.com/hrnr/map-merge.

5.1.4 FPFH-TEASER++

As a modern alternative to RANSAC, which is known to be inaccurate under high outlier rates, this baseline uses TEASER++ for its registration task. The same features as FPFH-RANSAC baseline (FPFH descriptors of Harris 3D corners) are matched and registered using TEASER++ instead.

5.1.5 DeepGlobalRegistration

DeepGlobalRegistration [53] is the first learning-based baseline used. DeepGlobalRegistration uses FCGF [45], assigns confidence for each correspondence, and solves a differentiable weighted Procrustes problem.

The complete pipeline is modified in the comparisons here to provide fair comparisons. Specifically, the ICP refinement and the safeguard registration implementations are deactivated to evaluate the network as-is, since these refinements are method-agnostic.

5.1.6 GeoTransformer

GeoTransformer [56] adopts the transformer architecture [65] to encode the geometric structure based on neighboring point geometries to obtain a pose-invariant superpoint representation. With high-inlier ratios of the GeoTransformer correspondences, accurate local-to-global pose estimation is made possible, even under low overlap scenarios.

5.1.7 Rotation-Invariant Transformer (RoITr)

RoITr [57] proposes a cascade of attention modules that aims to eliminate any pose dependency on the features to isolate only the point geometry. In this manner, the architecture becomes inherently rotation-invariant.

RoITr is the only model among the others that do not provide model weights trained for KITTI dataset. For the real-data evaluation, RoITr is omitted.

5.1.8 BUFFER

BUFFER [58] distills the point-based and patch-based feature extraction methods, improving point-wise orientation and saliency estimation and performs pose estimation via RANSAC over inlier set of correspondences as returned from the network.

5.2 Comparison Metrics

Performance of the map matching systems are evaluated based on four factors: a) translation error, b) rotation error, c) execution time, d) and memory usage.

5.2.1 Translation Error

Translation error is measured as the distance between the ground truth transformation estimation and the calculated estimation, and is reported in meters. Different applications require different voxel grid sizes. For example, an indoor robot with a small footprint will benefit from a higher resolution map (e.g. 0.05 m - 0.2 m grid size), whereas for an outdoor mission the same parameters will introduce large processing bottlenecks and yield a very large map, in terms of the occupied cells. To that end, the translation error threshold to determine if a map matching task is successful is assigned to be 5 times the grid size of the underlying map. This is in line with the literature for point cloud registration tasks for both indoor and outdoor scenarios.

5.2.2 Rotation Error

Rotation error is calculated from the difference in yaw angles between the ground truth transformation and the estimated transform, and is reported in radians. Unlike the translation error, the rotation error does not depend on the map resolution. The error threshold is set as 0.1745 radians (10absentsuperscript10\approx 10^{\circ}≈ 10 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT), which is more conservative than the point cloud registration literature. The extents of the maps in map matching tasks are much larger in comparison and small variations can result in very large deviations.

5.2.3 Execution Time

Execution time is measured as the total time it takes the algorithm to compute the transformation parameters, reported in seconds. Any redundant pre-processing and post-processing steps are excluded from this measurement.

For the proposed method and baselines without a learning component, the computations were carried out on an Intel Phantom Canyon NUC11PHKi7C (Intel i7-1165G74 CPU, 64 GB RAM). Some deep learning-based pipelines required more memory than it is available on the aforementioned hardware, due to the scale of the problem. As such, the computations for learning-based baselines were carried out on a high-performance computer node with an NVIDIA A100 SXM4 80 GB card using 8 CPU cores and 530 GB RAM. While such hardware cannot be put on a mobile platform currently, this study takes into consideration the availability of better hardware in the future.

5.2.4 Memory Usage

Memory usage refers to the total amount of physical memory the algorithm utilizes to generate its result, in Bytes. Resident set size is selected as an estimate of the memory consumption for the proposed method and the first 4 baselines that do not use learning-based algorithms. For deep learning-based pipelines, the peak GPU memory usage is used as a measurement for the memory.

The following sections will introduce the simulation and real datasets and provide the results for the aforementioned baselines and the proposed method.

6 Simulation Studies

Simulation data is taken from individual trajectories of InteriorNet dataset [66], in which a camera executes up to 3 different continuous 6 DoF motions in various furnished indoor environments. Color and depth streams, matched with the camera pose, are used to generate a 3D occupancy grid map to generate sample datasets for this study. A representative pair of color and depth images are provided in Figure 3

Refer to caption
(a) Color image
Refer to caption
(b) (Normalized) Depth image
Figure 3: A sample pair of color and depth images from the InteriorNet dataset.

The pose information for different trajectories in the same environment are provided with respect to a common, fixed frame of reference. In order to simulate a multi-agent map** scenario, the initial pose of the trajectory is set as the origin of that trajectory, after removing the initial roll and pitch. In that way, each trajectory is emulating a different agent with its own local origin.

Five sequences from InteriorNet dataset were selected, each with three trajectories. In total, there are 30 different instances of map matching tasks (5 sequences ×\times× 3 trajectories ×\times× 2 directions). The average overlap between pairs of maps is calculated as 71.76% (minimum 48.74%, maximum 91.35%).

The InteriorNet dataset provides noiseless measurements with perfect pose information. To test the robustness of the algorithms against common sources of noise, two different noise modalities were tested: pose estimation noise, and measurement noise.

In order to simulate inaccuracies arising from noisy IMU measurements and imperfect map** algorithm conditions, the ground truth pose for each camera reading is perturbed independently in each degree of freedom. Yaw, pitch, roll, and 3 translation perturbations are randomly sampled from independent, zero-mean Gaussian distributions with two different standard deviations. For σ=0.02𝜎0.02\sigma=0.02italic_σ = 0.02, the translation and rotation perturbations are sampled from distributions with standard deviations 0.020.020.02~{}0.02m and 0.010.010.01~{}0.01rad, respectively. For σ=0.05𝜎0.05\sigma=0.05italic_σ = 0.05, the standard deviations are 0.050.050.05~{}0.05m and 0.0250.0250.025~{}0.025rad. For the instances with added noise, the depth measurements from the camera were also corrupted with a noise profile mimicking the commercially available Intel RealSense D435i depth sensor. Some sample maps obtained under various noise levels are provided in Figure 4 for a qualitative comparison.

Refer to caption
(a) σ=0.00𝜎0.00\sigma=0.00italic_σ = 0.00
Refer to caption
(b) σ=0.02𝜎0.02\sigma=0.02italic_σ = 0.02
Refer to caption
(c) σ=0.05𝜎0.05\sigma=0.05italic_σ = 0.05
Figure 4: Sample point clouds from InteriorNet dataset under various noise conditions.

6.1 Map Matching Results

Figures 5-8 provide the translation and rotation errors, memory footprint, and execution times of each method on different levels of noise. The results are reported for maps that have been pre-filtered with a voxel grid filter of leaf size 0.05 m.

Each plot has a red dashed horizontal line, whose value represents a particular threshold. The selection is mostly subjective, success/failure classification is not performed based on these thresholds.

Translation error threshold is set to 0.25 m, five times the grid size of the maps. Rotation error threshold is set to 0.1745 rad (10absentsuperscript10\approx 10^{\circ}≈ 10 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT). Execution time threshold marks the 60 s line. Memory threshold is set to be 16 GB, which is the maximum capacity of the Jetson Orin NX computer, which can be attached to any mobile platform.

Refer to caption
Figure 5: Translation errors of the tested algorithms for the InteriorNet data on various noise levels.
Refer to caption
Figure 6: Rotation errors of the tested algorithms for the InteriorNet data on various noise levels.
Refer to caption
Figure 7: Execution time of the tested algorithms for the InteriorNet data on various noise levels.
Refer to caption
Figure 8: Memory usage of the tested algorithms for the InteriorNet data on various noise levels. For the first 5 algorithms, peak resident set size for CPU is reported. For the rest, peak GPU memory usage is reported.

The noiseless case (leftmost panels in each figure) serves as an indicator that the baselines can actually handle the map matching task, especially for the learning-based methods. Consensus, Tomographic-TEASER++, FPFH-TEASER++, DeepGlobalRegistration (DGR), and BUFFER provide accurate results for the majority of the map pairs. Learning-based methods provide better results, with DGR being the most accurate. FPFH-TEASER++ shows a large variance with some failures. GeoTransformer and RoITr are still accurate for some pairs, but they fail to provide correct transform for the majority of the pairs. The failure of ORB-TEASER++ indicates that ORB features are not appropriate for use as a 3D feature descriptor, but their tomographic use (Tomographic-TEASER++) render them useful.

All of the algorithms return with an estimate within 20 seconds. RoITr and Consensus methods take less than a second for the majority, with Consensus showing a larger variance and some outliers that reach up to 3 seconds.

For the memory use, however, the non-learning and non-TEASER++ methods shine. The memory use is limited to approximately 100 MB for Consensus and FPFH-based methods, at least an order of magnitude lower than any other method. The maximum clique search over the correspondence graph for TEASER++ has a large memory footprint for the ORB-TEASER++ implementation.

For the noisy data (middle and right panels in Figures 5-8), all of the learning-based methods fail. DGR provides some correct results, but on average is not reliable. FPFH features lose their discrimination capabilities under the noisy instances, as seen by the decreased accuracy between ideal and noisy instances. On the other hand, the tomographic methods demonstrate a more graceful accuracy degradation. The noise appears to be less impactful on the projected cross-sectional slices, when compared to other 3D descriptors.

The execution time increases with the noise in the input data, with the exception of RoITr. The execution time gap across methods is more pronounced at higher noise levels. Tomographic methods take longer to execute, as the noisy map means there is a larger number of slices across all maps. Consensus method remains approximately an order of magnitude faster in comparison to DGR and GeoTransformer. The other learning-based algorithms pass the 60 second mark. The bottleneck for FPFH-based methods becomes the computation of keypoints and the features and they cross the minute mark as well.

Finally, memory usage for all of the methods increase with the increased noise. Tomographic-TEASER++ reaches the correspondence cap utilizes the same amount of memory as the ORB-TEASER++. FPFH-based methods show limited increase, suggesting that the memory footprint is mostly dominated by the feature computation, rather than the robust pose estimation. The most memory-efficient method remains to be the Consensus, utilizing less than 200 MB on all instances.

Overall, tomographic methods appear to be the the better alternatives. In the ideal scenario, Tomographic-TEASER++ provides the smallest translation error. However, the Consensus method is more accurate in the noisy instances, and has the smallest execution time and memory footprint across all of the data types. The success of FPFH-TEASER++, DGR and BUFFER on noiseless data indicate that the algorithms can provide accurate results. However, the feature quality appears to degrade significantly under noisy data, resulting in quick accuracy degredation.

7 Experimental Studies

7.1 KITTI Odometry Data

To evaluate the performance on a real-life dataset, the LiDAR data from the sequences of KITTI odometry benchmark [67] was used. In total, 11 sequences have GPS ground truth available. Out of the 11, only 5 sequences (00, 02, 05, 06, 07) revisit the previously explored locations and can be used for a map matching scenario. We divide the sequences in half to emulate two agents and perform matching for each agent, providing a total of 10 different instances of map merging. Average overlap was estimated at 43.90% (minimum 11.17%, maximum 83.53%). Figure 9 provides an example pair for the map matching task.

Refer to caption
(a) Map 1 - First half
Refer to caption
(b) Map 2 - Second half
Figure 9: A sample map pair from KITTI dataset (sequence 02). The elevation is color-coded according to the scales on the right.

For this study, two of the learning-based baselines could not be used. RoITr model does not provide any weights for KITTI or hyperparameter set that can be used to train the architecture on KITTI sequences. For GeoTransformer, the data preparation step filled up the RAM of the high-performance computing node (approx. 530 GB) and could not be run.

Two grid sizes were used for the remaining baselines. Figure 10 reports the results for the data where the maps have been pre-filtered with a voxel grid filter of size 0.50 m. Results provided in Figure 11 use data that was pre-filtered to 0.20 m. Horizontal red lines mark the same thresholds as the InteriorNet case: five times the grid size for translation error, 0.1745 rad for rotation error, 60 s for the execution time, and 16 GB for the memory usage.

Due to the large number of correspondences, FPFH-TEASER++ ran out of memory in the mobile computer (64 GB) at higher resolution. As such, the method is omitted for the 0.20 m grid size (Figure 11).

Refer to caption
Figure 10: Aggregated errors, execution time, and the memory usage of the tested algorithms on KITTI odometry dataset, pre-filtered to a grid size of 0.50 m.
Refer to caption
Figure 11: Aggregated errors, execution time, and the memory usage of the tested algorithms on KITTI odometry dataset, pre-filtered to a grid size of 0.20 m.

The only algorithm that is successful for the majority of the pairs is the tomographic methods. Consensus provides high accuracy for both rotation and translation. Tomographic-TEASER++ fails to estimate the translation accurately, although the rotation estimation is mostly succesful. Even though DGR was accurate in the simulated data, it fails to perform in the KITTI study. BUFFER is still able to provide the correct transformation for some of the instances, but overall is inaccurate. FPFH features appear to be not descriptive on the KITTI dataset, which could be due to improper parameter selection.

The execution time is significantly higher for Consensus in comparison to the indoor data. For the higher resolution map, the execution time reaches up to a minute. The primary reason for the losses in efficiency is due to the much larger number of slices across both maps. Specifically for the sequence 02, which is shown in Figure 9, the maps span approximately 50 meters. Slice correlation step is much more time consuming, and it hampers the time efficiency.

In a similar fashion, the memory footprint of the Consensus algorithm is significantly higher, reaching up to 10 GB for some instances of high-resolution map. The footprint remains lower than any other algorithm that performs accurate estimations.

Overall, even though the Consensus algorithm efficiency and memory usage has increased significantly, it is still the best performing option out of the baseline methods.

Note that across the studies, we are only changing the grid size of the maps for the tomographic algorithms. Despite the minimal parameterization, the algorithm performance is not impacted significantly. The appropriate grid size selection is still a factor to be taken into consideration, but the choices here (0.2 m and 0.5 m) is common for LiDAR-based outdoor map** tasks.

7.2 Large-scale Indoor Environment Data

Finally, the algorithms’ performance is evaluated using a generated dataset, map** a large-scale indoor environment. The interior space is an atrium area connecting office spaces, classrooms, and laboratories across two floors. The environment is especially challenging for LiDAR-based distance sensors, as there is a glass construction at the center with opening to the outside. A photograph of the environment is provided in Figure 12(a). A Unitree quadruped robot equipped with RoboSense 3D LiDAR sensor (Figure 12(b)), was used to map the environment in 3 separate trajectories. The horizontal cross-sections of the individual maps overlaid on the architectural plan of the structure is provided in Figure 13. Each trajectory is map** along two corners of the triangle.

Horizontal red lines mark the same thresholds as the InteriorNet case: five times the grid size for translation error, 0.1745 rad for rotation error, 60 s for the execution time, and 16 GB for the memory usage.

Refer to caption
(a) Large indoor environment for the experimental studies
Refer to caption
(b) Unitree B1
Figure 12: Photos of the environment (left) and the robotic platform (right)
Refer to caption
Figure 13: Individual maps overlaid on the architectural plan of the environment used in the experimental study. Each color indicates the area covered by an agent.

To demonstrate the applicability to different forms of map** tools, the robot estimates its odometry using KISS-ICP [68] and generates the map using VDBFusion [69] framework. KISS-ICP estimates relative transformations between two key frames using point-to-point ICP with simple motion compensation, an adaptive correspondence threshold, and a robust kernel to provide a minimally-parameterized solution. VDBFusion represents the environment as a truncated signed distance function (TSDF) to obtain more accurate dense reconstructions, powered by OpenVDB333https://www.openvdb.org/, an open-source, production-level hierarchical data structure library. In this way, we demonstrate backend-agnostic capabilities of our method in the map matching task.

To generate the map, deskewed LiDAR point cloud data is integrated into the VDBFusion map, and the 3D triangle mesh representation of the resultant map was extracted using marching cubes algorithm at a grid size of 0.05 m. Vertices of the mesh are provided as input to all of the maps. A rendering of the the three maps generated as described above are provided in Figure 14.

Refer to caption
(a) Map 1: 2,290,006 points
Refer to caption
(b) Map 2: 2,205,667 points
Refer to caption
(c) Map 3: 1,651,985 points
Figure 14: Pose-accurate renderings and the number of points for each map used in the experimental study. The colors match the overlays from Figure 13.

Overall, three maps and two directions provide a total of 6 map matching tasks. Similar to the KITTI study, Geo-Transformer baseline could not be run due to memory consumption. The results for all of the other methods are provided in Figure 15.

Refer to caption
Figure 15: Aggregated errors, execution time, and the memory usage of the tested algorithms for the experimental study. Horizontal line for the translation error is at 0.25 m.

Only the estimates provided by the proposed Consensus method are close to the true transformations. Tomographic TEASER++ and FPFH-TEASER++ provide accurate orientation for some of the pairings, but overall no other algorithm is able to provide accurate results.

Even for the Consensus algorithm, the translation error is much larger. The algorithm is able to estimate the correct rotation for all but one pairing. A parameter tuning may improve the translation error as well, but the finer registration with ICP is always possible. The algorithm is providing reasonable global transformation estimates.

Execution time is reaching to the minute mark for the proposed method in this framework. The grid size is small enough to generate both large-size slices for feature extraction and a large number of slices to be cross-correlated. However, the memory footprint is not affected, utilizing approximately 500 MB in RAM to carry out the computations.

8 Limitations

Our proposed algorithm has inherent limitations that render it unusable for generic point cloud registration tasks. The three most important ones are the missing degrees of freedom, the inability to deal with non-rigid estimations, and the dependence on the map heights.

Our assumption is that the agents have accurate gravity estimates that they can use to align their maps. IMU sensors are common, but are not entirely accurate. Large accelerations can affect the orientation quality and impact the overall map alignment. Even though the Consensus method was demonstrated to be resilient to noise, there is no way to recover the roll and pitch if the origin (the initial pose) of the agent is misaligned.

Another limitation is that the output is a single rigid transformation. In practice, large-scale maps tend to drift even if robust map** methods are used. As such, a single rigid transformation is generally not the most accurate answer. A deformable hypothesis operating on the pose graph [19] would me more appropriate, and there is no way to adapt this method directly.

Finally, the complexity of the algorithm depends mostly on the number of slices for each map. Per-slice transform estimations are fast, but the correlation step requires many hypotheses to be tested at once. The higher the number of slices, the heavier the computations. Hardware-accelerated parallelization could sidestep the computational bottlenecks, but the dependence on the map heights will not change.

Furthermore, the comparative studies, performed against the baselines has certain limits. These limitations can be grouped into 3 categories: parameter tuning for the tasks, disparity in the execution paradigms, and potentially suboptimal or inaccurate adaptations.

The weights used for learning-based models are taken directly from the best parameters as provided by the original authors. For the simulated study, this implies training on 3DMatch [35] dataset, which contains 3D scans of small-scale scenes. Even though the grid size parameter is appropriate, we do not account for the simulation-to-real domain gap, and the scale of the problem is not the same. The results on the ideal data (Figures 5-8, left-most panels) indicates that some of the algorithms can still perform, but a much thorough comparison would require to train the models on data that is more representative for the type and the scale of the map matching task.

For the FPFH-based methods, an extensive search on the key parameters, mainly the radii to estimate normals, keypoints, and features, was not performed. In a similar fashion to the tomographic methods, the radii was scaled based on the grid size, but a parameter study could yield a more optimal answer. The only parameter that was changed across different datasets is the grid size of the maps, which is known ahead of the time. The proposed algorithm is able to perform accurately with minimal parameterization. As such, the advantage of tomographic methods remain.

The efficiency is compared using the execution time and memory footprints. An ideal scenario would require the programming paradigms and the hardware platforms to be the same. However, learning-based methods are implemented primarily on Python, whereas we implemented our method and the remaining baselines (Consensus, ORB-family and FPFH-family of algorithms) in C++. Even though the Python bindings for the learning-based algorithm backends are compiled as well, the comparison is not entirely applicable. Furthermore, Python memory management is different than C++, introducing additional sources of issue.

On the other hand, the metrics were calculated for learning-based algorithms on a high-performance computer node that is equipped with a powerful graphics card. While we attempted to provide the best possible comparison, this limitation should be explicitly stated.

Finally, not all of the learning-based baselines had an existing implementation to perform registration on arbitrary datasets. The implementations from BUFFER and RoITr were mimicked from the source code to obtain the transformations.

9 Conclusions

Tomographic features were proposed to address large-scale map matching task in gravity-aligned 3D maps. The proposed minimally-parameterized approach was both accurate and efficient in memory and computation power. Furthermore, the tomographic approach was found to be more resilient to measurement noise in sensors in comparison to other methods utilizing 3D feature descriptors, learning-based or otherwise. The findings are corroborated on real datasets that map volumes of different scales, underscoring the algorithmic efficiency and accuracy.

The proposed algorithm cannot provide a non-rigid estimate to deal with map** inaccuracies, and is not able to estimate the roll and pitch. Nevertheless, for large and dense maps with bounded extents, such as the case in many graph-based SLAM algorithms, the tomographic approach can provide fast and robust transformations and requires investigations.

References

  • Tzes et al. [2018] M. Tzes, S. Papatheodorou, A. Tzes, Visual Area Coverage by Heterogeneous Aerial Agents Under Imprecise Localization, IEEE Control Systems Letters 2 (2018) 623–628.
  • Lajoie et al. [2022] P.-Y. Lajoie, B. Ramtoula, F. Wu, G. Beltrame, Towards Collaborative Simultaneous Localization and Map**: a Survey of the Current Research Landscape, Field Robotics 2 (2022) 971–1000.
  • Kazerouni et al. [2022] I. A. Kazerouni, L. Fitzgerald, G. Dooly, D. Toal, A survey of state-of-the-art on visual SLAM, Expert Systems with Applications 205 (2022) 117734.
  • Huang et al. [2023] J. Huang, S. Junginger, H. Liu, K. Thurow, Indoor Positioning Systems of Mobile Robots: A Review, Robotics 12 (2023) 47.
  • Carpin et al. [2005] S. Carpin, A. Birk, V. Jucikas, On map merging, Robotics and Autonomous Systems 53 (2005) 1–14.
  • Saeedi et al. [2014] S. Saeedi, L. Paull, M. Trentini, M. Seto, H. Li, Map merging for multiple robots using Hough peak matching, Robotics and Autonomous Systems 62 (2014) 1408–1424.
  • Ding and Feng [2019] L. Ding, C. Feng, DeepMap**: Unsupervised map estimation from multiple point clouds, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8650–8659.
  • Andersone [2019] I. Andersone, Heterogeneous Map Merging: State of the Art, Robotics 8 (2019) 74.
  • Unlu et al. [2023] H. U. Unlu, A. Tzes, P. Krishnamurthy, F. Khorrami, Distributed 3D-Map Matching and Merging on Resource-Limited Platforms Using Tomographic Features, in: 2023 European Conference on Mobile Robots (ECMR), IEEE, 2023, pp. 1–6.
  • Yu et al. [2020] S. Yu, C. Fu, A. K. Gostar, M. Hu, A Review on Map-Merging Methods for Typical Map Types in Multiple-Ground-Robot SLAM Solutions, Sensors 20 (2020) 6988.
  • Erinc and Carpin [2014] G. Erinc, S. Carpin, Anytime merging of appearance-based maps, Autonomous Robots 36 (2014) 241–256.
  • Sun et al. [2023] Q. Sun, T. Liao, H. Du, Y. Zhao, C.-C. Chen, A Method of Merging Maps for MUAVs Based on an Improved Genetic Algorithm, Sensors 23 (2023) 447.
  • Aragues et al. [2011] R. Aragues, J. Cortes, C. Sagues, Distributed consensus algorithms for merging feature-based maps with limited communication, Robotics and Autonomous Systems 59 (2011) 163–180.
  • Blanco et al. [2013] J.-L. Blanco, J. González-Jiménez, J.-A. Fernández-Madrigal, A robust, multi-hypothesis approach to matching occupancy grid maps, Robotica 31 (2013) 687–701.
  • Lee [2020] H. Lee, Tomographic Feature-Based Map Merging for Multi-Robot Systems, Electronics 9 (2020) 107.
  • Chen et al. [2021] B. Chen, S. Li, H. Zhao, L. Liu, Map Merging with Suppositional Box for Multi-Robot Indoor Map**, Electronics 10 (2021) 815.
  • Wang et al. [2023] M. Wang, M. Cong, Y. Du, D. Liu, X. Tian, Multi-robot raster map fusion without initial relative position, Robotic Intelligence and Automation 43 (2023) 498–508.
  • Xie et al. [2020] Y. Xie, Y. Tang, R. Zhou, Y. Guo, H. Shi, Map merging with terrain-adaptive density using mobile 3D laser scanner, Robotics and Autonomous Systems 134 (2020) 103649.
  • Bonanni et al. [2017] T. M. Bonanni, B. Della Corte, G. Grisetti, 3-D Map Merging on Pose Graphs, IEEE Robotics and Automation Letters 2 (2017) 1031–1038.
  • Mohanarajah et al. [2015] G. Mohanarajah, V. Usenko, M. Singh, R. D’Andrea, M. Waibel, Cloud-Based Collaborative 3D Map** in Real-Time With Low-Cost Robots, IEEE Transactions on Automation Science and Engineering 12 (2015) 423–431.
  • Yue et al. [2018] Y. Yue, P. N. Senarathne, C. Yang, J. Zhang, M. Wen, D. Wang, Hierarchical Probabilistic Fusion Framework for Matching and Merging of 3-D Occupancy Maps, IEEE Sensors Journal 18 (2018) 8933–8949.
  • Basso et al. [2023] M. Basso, D. Stocchero, P. Schnarndorf, E. P. de Freitas, Merging three-dimensional occupancy grid maps to support multi-uavs cooperative navigation systems, Journal of Field Robotics 40 (2023) 483–504.
  • Yin et al. [2023] P. Yin, S. Zhao, H. Lai, R. Ge, J. Zhang, H. Choset, S. Scherer, AutoMerge: A Framework for Map Assembling and Smoothing in City-Scale Environments, IEEE Transactions on Robotics (2023).
  • Stathoulopoulos et al. [2023] N. Stathoulopoulos, A. Koval, A.-a. Agha-mohammadi, G. Nikolakopoulos, FRAME: Fast and Robust Autonomous 3D Point Cloud Map-merging for Egocentric Multi-Robot Exploration, in: 2023 IEEE international conference on robotics and automation (ICRA), IEEE, 2023, pp. 3483–3489.
  • Lajoie et al. [2022] P.-Y. Lajoie, B. Ramtoula, F. Wu, G. Beltrame, Towards Collaborative Simultaneous Localization and Map**: a Survey of the Current Research Landscape, Field Robotics 2 (2022) 971–1000.
  • Schmuck and Chli [2019] P. Schmuck, M. Chli, CCM-SLAM: Robust and efficient centralized collaborative monocular simultaneous localization and map** for robotic teams, Journal of Field Robotics 36 (2019) 763–781.
  • Chang et al. [2022] Y. Chang, K. Ebadi, C. Denniston, M. F. Ginting, A. Rosinol, A. Reinke, M. Palieri, J. Shi, A. Chatterjee, B. Morrell, A. akbar Agha-mohammadi, L. Carlone, Lamp 2.0: A robust multi-robot slam system for operation in challenging large-scale underground environments, IEEE Robotics and Automation Letters 7 (2022) 9175–9182.
  • Lajoie et al. [2020] P.-Y. Lajoie, B. Ramtoula, Y. Chang, L. Carlone, G. Beltrame, DOOR-SLAM: Distributed, online, and outlier resilient SLAM for robotic teams, IEEE Robotics and Automation Letters 5 (2020) 1656–1663.
  • Tian et al. [2022] Y. Tian, Y. Chang, F. H. Arias, C. Nieto-Granda, J. P. How, L. Carlone, Kimera-multi: Robust, distributed, dense metric-semantic slam for multi-robot systems, IEEE Transactions on Robotics 38 (2022).
  • Lajoie and Beltrame [2023] P.-Y. Lajoie, G. Beltrame, Swarm-SLAM: Sparse Decentralized Collaborative Simultaneous Localization and Map** Framework for Multi-Robot Systems, IEEE Robotics and Automation Letters 9 (2023) 475–482.
  • Rusu et al. [2009] R. B. Rusu, N. Blodow, M. Beetz, Fast Point Feature Histograms (FPFH) for 3D Registration, in: 2009 IEEE international conference on robotics and automation, IEEE, 2009, pp. 3212–3217.
  • Guo et al. [2016] Y. Guo, M. Bennamoun, F. Sohel, M. Lu, J. Wan, N. M. Kwok, A Comprehensive Performance Evaluation of 3D Local Feature Descriptors, International Journal of Computer Vision 116 (2016) 66–89.
  • Trajković and Hedley [1998] M. Trajković, M. Hedley, Fast corner detection, Image and vision computing 16 (1998) 75–87.
  • Wu et al. [2015] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, J. Xiao, 3D ShapeNets: A Deep Representation for Volumetric Shapes, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1912–1920.
  • Zeng et al. [2016] A. Zeng, S. Song, M. Nießner, M. Fisher, J. Xiao, T. A. Funkhouser, 3DMatch: Learning Local Geometric Descriptors from RGB-D Reconstructions, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 199–208.
  • Qi et al. [2017] C. R. Qi, H. Su, K. Mo, L. J. Guibas, PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 652–660.
  • Deng et al. [2018] H. Deng, T. Birdal, S. Ilic, PPFNet: Global Context Aware Local Features for Robust 3D Point Matching, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 2018, pp. 195–205.
  • Hua et al. [2018] B.-S. Hua, M.-K. Tran, S.-K. Yeung, Pointwise Convolutional Neural Networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 984–993.
  • Atzmon et al. [2018] M. Atzmon, H. Maron, Y. Lipman, Point convolutional neural networks by extension operators, ACM Transactions on Graphics 37 (2018) 1–12.
  • Gojcic et al. [2019] Z. Gojcic, C. Zhou, J. D. Wegner, A. Wieser, The Perfect Match: 3D Point Cloud Matching With Smoothed Densities, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5545–5554.
  • Bai et al. [2020] X. Bai, Z. Luo, L. Zhou, H. Fu, L. Quan, C.-L. Tai, D3Feat: Joint Learning of Dense Detection and Description of 3D Local Features, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 6359–6367.
  • Floris et al. [2024] A. Floris, L. Frittoli, D. Carrera, G. Boracchi, Composite convolution: A flexible operator for deep learning on 3d point clouds, Pattern Recognition (2024) 110557.
  • Zhang et al. [2020] Z. Zhang, Y. Dai, J. Sun, Deep learning based point cloud registration: an overview, Virtual Reality & Intelligent Hardware 2 (2020) 222–246.
  • Thomas et al. [2019] H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, L. Guibas, KPConv: Flexible and Deformable Convolution for Point Clouds, in: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, 2019, pp. 6411–6420.
  • Choy et al. [2019] C. Choy, J. Park, V. Koltun, Fully Convolutional Geometric Features, in: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, 2019, pp. 8958–8966.
  • Bustos and Chin [2017] A. P. Bustos, T.-J. Chin, Guaranteed Outlier Removal for Point Cloud Registration with Correspondences, IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (2017) 2868–2882.
  • Huang et al. [2021] X. Huang, G. Mei, J. Zhang, R. Abbas, A comprehensive survey on point cloud registration, arXiv preprint arXiv:2103.02690 (2021).
  • Huang et al. [2023] X. Huang, G. Mei, J. Zhang, Cross-source point cloud registration: Challenges, progress and prospects, Neurocomputing (2023) 126383.
  • Fischler and Bolles [1981] M. A. Fischler, R. C. Bolles, Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography, Communications of the ACM 24 (1981) 381–395.
  • Li et al. [2023] J. Li, P. Shi, Q. Hu, Y. Zhang, Qgore: Quadratic-time guaranteed outlier removal for point cloud registration, IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
  • Wei et al. [2023] T. Wei, Y. Patel, A. Shekhovtsov, J. Matas, D. Barath, Generalized differentiable ransac, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 17649–17660.
  • Yang et al. [2020] H. Yang, J. Shi, L. Carlone, TEASER: Fast and Certifiable Point Cloud Registration, IEEE Transactions on Robotics 37 (2020) 314–333.
  • Choy et al. [2020] C. Choy, W. Dong, V. Koltun, Deep Global Registration, in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2020, pp. 2514–2523.
  • Bai et al. [2021] X. Bai, Z. Luo, L. Zhou, H. Chen, L. Li, Z. Hu, H. Fu, C.-L. Tai, Pointdsc: Robust point cloud registration using deep spatial consistency, in: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2021, pp. 15859–15869.
  • Huang et al. [2021] S. Huang, Z. Gojcic, M. Usvyatsov, A. Wieser, K. Schindler, PREDATOR: Registration of 3D Point Clouds with Low Overlap, in: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2021, pp. 4267–4276.
  • Qin et al. [2023] Z. Qin, H. Yu, C. Wang, Y. Guo, Y. Peng, S. Ilic, D. Hu, K. Xu, GeoTransformer: Fast and Robust Point Cloud Registration With Geometric Transformer, IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
  • Yu et al. [2023] H. Yu, Z. Qin, J. Hou, M. Saleh, D. Li, B. Busam, S. Ilic, Rotation-Invariant Transformer for Point Cloud Matching, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 5384–5393.
  • Ao et al. [2023] S. Ao, Q. Hu, H. Wang, K. Xu, Y. Guo, BUFFER: Balancing Accuracy, Efficiency, and Generalizability in Point Cloud Registration, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1255–1264.
  • Bustos and Chin [2017] A. P. Bustos, T.-J. Chin, Guaranteed Outlier Removal for Point Cloud Registration with Correspondences, IEEE transactions on pattern analysis and machine intelligence 40 (2017) 2868–2882.
  • Rublee et al. [2011] E. Rublee, V. Rabaud, K. Konolige, G. Bradski, ORB: An efficient alternative to SIFT or SURF, in: 2011 International Conference on Computer Vision, IEEE, 2011, pp. 2564–2571.
  • Calonder et al. [2010] M. Calonder, V. Lepetit, C. Strecha, P. Fua, BRIEF: Binary robust independent elementary features, in: European conference on computer vision, Springer, 2010, pp. 778–792.
  • Ma et al. [2021] J. Ma, X. Jiang, A. Fan, J. Jiang, J. Yan, Image matching from handcrafted to deep features: A survey, International Journal of Computer Vision 129 (2021) 23–79.
  • Tareen and Saleem [2018] S. A. K. Tareen, Z. Saleem, A comparative analysis of SIFT, SURF, KAZE, AKAZE, ORB, and BRISK, in: 2018 International Conference on Computing, Mathematics and Engineering Technologies (iCoMET), IEEE, 2018, pp. 1–10.
  • Hörner [2018] J. Hörner, Automatic Point Clouds Merging, Master’s thesis, Charles University, Department of Theoretical Computer Science and Mathematical Logic, 2018. URL: https://dspace.cuni.cz/bitstream/handle/20.500.11956/101930/120308521.pdf, accessed: 2024-06-04.
  • Vaswani et al. [2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is All you Need, Advances in neural information processing systems 30 (2017).
  • Li et al. [2018] W. Li, S. Saeedi, J. McCormac, R. Clark, D. Tzoumanikas, Q. Ye, Y. Huang, R. Tang, S. Leutenegger, InteriorNet: Mega-scale Multi-sensor Photo-realistic Indoor Scenes Dataset, arXiv preprint arXiv:1809.00716 (2018).
  • Geiger et al. [2012] A. Geiger, P. Lenz, R. Urtasun, Are we ready for autonomous driving? The KITTI vision benchmark suite, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2012, pp. 3354–3361.
  • Vizzo et al. [2023] I. Vizzo, T. Guadagnino, B. Mersch, L. Wiesmann, J. Behley, C. Stachniss, KISS-ICP: In Defense of Point-to-Point ICP – Simple, Accurate, and Robust Registration If Done the Right Way, IEEE Robotics and Automation Letters 8 (2023) 1029–1036.
  • Vizzo et al. [2022] I. Vizzo, T. Guadagnino, J. Behley, C. Stachniss, VDBFusion: Flexible and Efficient TSDF Integration of Range Sensor Data, Sensors 22 (2022) 1296.

Author Contributions

Co-author contributions follow the CRediT taxonomy: Halil Utku Unlu: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing - original draft, Writing - review and editing, Visualization. Anthony Tzes: Conceptualization, Methodology, Resources, Writing - review and editing, Supervision. Prashanth Krishnamurthy: Conceptualization, Methodology, Writing - review and editing. Farshad Khorrami: Conceptualization, Methodology, Writing - review and editing, Supervision.

Funding

This work was partially supported by the NYUAD Center for Artificial Intelligence and Robotics (CAIR), funded by Tamkeen under the NYUAD Research Institute Award CG010. The funding sources were not involved in study design, in the collection, analysis and interpretation of data, in the writing of the report, and in the decision to submit the article for publication.