License: arXiv.org perpetual non-exclusive license
arXiv:2403.11367v1 [cs.CV] 17 Mar 2024

3DGS-ReLoc: 3D Gaussian Splatting for Map Representation and Visual ReLocalization

Peng Jiang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Gaurav Pandey 22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT and Srikanth Saripalli11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT 11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Peng Jiang and Srikanth Saripalli are with J. Mike Walker ´66 Department of Mechanical Engineering, Texas A&\&&M University, College Station, TX-77843,USA maskjp,[email protected]22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTGaurav Pandey is with The Department of Engineering Technology and Industrial Distribution Texas A&\&&M University, College Station, TX-77843, USA [email protected]
Abstract

This paper presents a novel system designed for 3D map** and visual relocalization using 3D Gaussian Splatting. Our proposed method uses LiDAR and camera data to create accurate and visually plausible representations of the environment. By leveraging LiDAR data to initiate the training of the 3D Gaussian Splatting map, our system constructs maps that are both detailed and geometrically accurate. To mitigate excessive GPU memory usage and facilitate rapid spatial queries, we employ a combination of a 2D voxel map and a KD-tree. This preparation makes our method well-suited for visual localization tasks, enabling efficient identification of correspondences between the query image and the rendered image from the Gaussian Splatting map via normalized cross-correlation (NCC). Additionally, we refine the camera pose of the query image using feature-based matching and the Perspective-n-Point (PnP) technique. The effectiveness, adaptability, and precision of our system are demonstrated through extensive evaluation on the KITTI360 dataset.

I introduction

The rapid evolution of autonomous driving and robotic navigation technologies has underscored the critical importance of advanced scene reconstruction methodologies. These technologies rely heavily on the integration of data from diverse sensor modalities to create describable and accurate representations of the environment. Among various sensor fusion techniques, the combination of LiDAR and camera data is particularly noteworthy. This fusion harnesses LiDAR’s precise depth sensing capabilities alongside the rich visual details captured by cameras, a synergy crucial for achieving the level of environmental understanding necessary for autonomous systems to navigate safely and efficiently. However, the challenge lies in harmonizing these different types of data into a unified, detailed, and geometrically accurate representation of the scene, a task that is both complex and essential for traversing intricate urban landscapes.

This paper introduces 3DGS-ReLoc, a novel system tailored for visual relocalization in autonomous navigation, employing 3D Gaussian Splatting (3DGS) as its primary map representation technique[1]. Utilizing LiDAR data, our method initiates the training of the 3D Gaussian Splatting representation, enabling the generation of large-scale, geometry-accurate maps. This initial training with LiDAR significantly improves our system’s ability to create detailed and precise environmental models, which is essential for advanced perception systems in autonomous vehicles. Moreover, to address the high GPU memory consumption challenge, we adopt a strategy of dividing 3D Gaussian Splatting maps into 2D voxels and utilizing a KD-tree for efficient spatial querying.

3D Gaussian Splatting representation can generate high-fidelity images and depth data in association with known camera poses within the map’s coordinates. This capability simplifies our method by facilitating the straightforward identification of correspondences between the query image and the Gaussian Splatting map through image feature detection and matching techniques. Additionally, by leveraging the depth information and its corresponding camera pose, we can accurately determine the camera pose of the query image. Implementing 3D Gaussian Splatting for visual relocalization not only showcases the adaptability of our method but also effectively tackles the complexities involved in fusing sensor data, contributing to the development of more precise and efficient scene representation techniques.

We conducted an extensive evaluation of our methodology with the KITTI360 dataset [2]. This dataset was chosen for its comprehensive annotations, which aid in creating accurate maps in diverse urban landscapes. Our results highlight our system’s effectiveness, versatility, and precision. Specifically, we showcase the utility of 3D Gaussian Splatting for scene representation in visual relocalization tasks.

II related work

II-A Map Representation

Maps are crucial for robot navigation and autonomous driving, with traditional representations including voxel grids, point clouds, and meshes, as highlighted in recent literature [3]. The advent of neural rendering techniques has introduced a new avenue for constructing maps with high fidelity. These models capture and depict 3D scenes by utilizing images and corresponding poses for guidance. This approach enables synthesizing high-fidelity images from novel views of the scene. Among these, Neural Radiance Fields (NeRF)[4] has gained prominence. It encodes the radiance fields of complex 3-D scenes into the weights of multilayer perceptrons (MLPs), demonstrating exceptional realism in rendering 3-D environments through volume rendering under 2-D supervision. This innovation has significantly contributed to the development of map** systems and the enhancement of SLAM (Simultaneous Localization and Map**) systems, including iMAP[5] and NICE-SLAM[6]. iMAP, for instance, employs an MLP for real-time scene representation within a SLAM framework, while NICE-SLAM introduces a dense, efficient, and robust SLAM approach by integrating multilevel local scene information and optimizing with geometric priors for better detail in large indoor scenes.

However, the scene-specific nature of networks trained with NeRF, where each 3-D scene’s representation is encoded in an MLP’s weights, restricts their generalizability across different environments. Furthermore, the computational intensity of NeRF-based methods results in slow rendering times. 3D Gaussian Splatting [1] has emerged as a viable alternative, providing an explicit representation more in line with traditional map** approaches and enabling easier integration of conventional methods with minimal adjustments. This approach not only accelerates training times but also maintains high-quality visuals akin to NeRF. Recent efforts to apply 3D Gaussian Splatting to SLAM [7, 8, 9, 10] have shown promise. SplaTAM [10] represents an innovative application of 3D Gaussian splatting in SLAM, offering dense SLAM capabilities with monocular RGB-D cameras and enabling online camera pose tracking through singular 3D Gaussian Splatting map. Building on these efforts, [9] introduced SGS-SLAM, which incorporates 3D semantic segmentation into the GS-SLAM system. This method uses multi-channel optimization during map** to combine appearance, geometric, and semantic constraints with key-frame optimization, enhancing the quality of reconstruction. Despite these advancements, the focus of research remains predominantly on indoor scenes of limited size, utilizing RGB-D cameras to generate dense point clouds. Several studies have been conducted on outdoor large-scale 3D Gaussian Splatting reconstruction [11, 12, 13, 14]. However, these studies primarily focus on generating high-quality images [11] or handling dynamic scenarios in street data [14, 13] for simulation purposes, and do not explore their potential for map representation and relocalization.

II-B Visual Relocalization

Visual relocalization aims to estimate a camera’s position and orientation from a single query image. Approaches to visual relocalization vary, including feature-based methods, scene coordinate regression, pose regression, and direct image alignment. DSAC [15] exemplifies the scene coordinate regression method, circumventing the need for explicit 3D map** by mastering a pixel-to-point transformation through differentiable RANSAC for seamless end-to-end learning. In the realm of pose regression methods, the notable work by Laskar et al. [16] stands out. They leverage Convolutional Neural Networks (CNNs) to identify similar images within a database and calculate relative poses, employing RANSAC to enhance accuracy. Meanwhile, PixLoc [17] serves as a prime example of direct image alignment, utilizing deep multiscale features. PixLoc redefines localization as a metric learning challenge, facilitating comprehensive end-to-end training.

Despite the variety of methods, 2D-3D feature-based approaches remain predominant. 2D-3D feature-based approaches aims to estimate a camera’s position and orientation (pose) from a 2D image within a previously mapped 3D scene. The construction of these 3D models typically involves Structure-from-Motion (SfM) with color images [18], Truncated Signed-Distance Function (TSDF) from range images [19], or LiDAR-based map** techniques [20]. They compute the camera pose by matching 2D-3D correspondences through local feature descriptors. Since these descriptors often depend on the original imaging angle, research has focused on creating viewpoint-independent features [21] or learning across different modalities, such as with P2-Net’s unified descriptor for pixel-point matching [22]. Contrastive learning has been explored to bridge the gap between camera images and LiDAR point clouds [23]. Additionally, approaches like that of Wolcott et al.[20] propose localizing a camera within a 3D LiDAR-generated prior ground map by maximizing normalized mutual information between real camera measurements and generated synthetic LiDAR intensity image. Compared to traditional map representations, the 3D Gaussian Splatting representation has a more direct linkage between images, as it enables the rendering of new images and depth maps from novel viewpoints. This capability facilitates mitigating the challenges associated with view dependence, enhancing our ability to manage perspective-related difficulties more effectively.

Refer to caption
Figure 1: Pipeline of 3D Gaussian Splatting for Map Representation and Visual ReLocalization: The process starts by creating a colorized point cloud map from LiDAR scans, images, and poses. This map serves as the initialization for the 3D Gaussian Splatting (3DGS) map, which is incrementally trained on submaps. The 3DGS map is stored as a 2D voxel map, with a KD-tree enabling rapid spatial queries. For relocalization, a submap proximate to the query image’s coarse pose is selected to render a series of images and depths. The query image is then subjected to a brute-force search against this image sequence to find the closest rendered image and depth. Subsequently, feature-based matching and the Perspective-n-Point (PnP) method are employed to iteratively refine the query image’s pose, achieving precise localization within the global map.

III method

This section will first revisit the concept of 3D Gaussian Splatting. Then, we will detail our system, which consists of two main components: the 3D Gaussian Splatting (3DGS) Map Representation and visual relocalization using the 3DGS Map. The complete system is illustrated in Fig.1.

III-A Revisit 3D Gaussian Splatting

The 3D Gaussian Splatting (3DGS) [1] is a rasterization technique designed for real-time rendering of photorealistic scenes using a group of 3D Gaussians for modeling. The original approach unfolds in three steps: a) employing the Structure from Motion (SfM) technique to estimate the poses of a collection of images from the same scene and a sparse point cloud of the scene; b) transforming each point in the cloud into a 3D Gaussian; c) applying Stochastic Gradient Descent (SGD) to refine the Gaussians, allowing for adaptive densification and pruning of the Gaussians based on the gradients and predefined criteria. The following parameters characterize each Gaussian in the model:

  • Center of the Gaussian μi=[x1,x2,x3]3subscript𝜇𝑖subscript𝑥1subscript𝑥2subscript𝑥3superscript3\mu_{i}=\left[x_{1},x_{2},x_{3}\right]\in\mathbb{R}^{3}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, (usually initialize using sparse point cloud from SfM)

  • Covariance matrix of the Gaussian Σi=RiSiSiRisubscriptΣ𝑖subscript𝑅𝑖subscript𝑆𝑖superscriptsubscript𝑆𝑖topsuperscriptsubscript𝑅𝑖top\Sigma_{i}=R_{i}S_{i}S_{i}^{\top}R_{i}^{\top}roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT111Covariance matrices are physically meaningful only when they are positive semi-definite. Constraining gradient descent to consistently produce such matrices is challenging, as update steps and gradients may unintentionally generate invalid matrices. As a resolution, the foundational paper[1] advocated for an alternate yet equally expressive optimization representation. Here, the covariance matrix ΣisubscriptΣ𝑖\Sigma_{i}roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT analogously describes an ellipsoid’s configuration, decomposed into scale and rotation matrices., comprised of a scaling matrix Si=diag([sx,sy,sz])subscript𝑆𝑖diagsubscript𝑠𝑥subscript𝑠𝑦subscript𝑠𝑧S_{i}=\operatorname{diag}\left(\left[s_{x},s_{y},s_{z}\right]\right)italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_diag ( [ italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ] ) and a rotation matrix Ri=q2R([rw,rx,ry,rz])subscript𝑅𝑖q2Rsubscript𝑟𝑤subscript𝑟𝑥subscript𝑟𝑦subscript𝑟𝑧R_{i}=\mathrm{q}2\mathrm{R}\left(\left[r_{w},r_{x},r_{y},r_{z}\right]\right)italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = q2R ( [ italic_r start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ] ), with q2Rq2R\mathrm{q}2\mathrm{R}q2R converting a quaternion to a rotation matrix.

  • RGB color ci3subscript𝑐𝑖superscript3c_{i}\in\mathbb{R}^{3}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT or spherical harmonics (SH) coefficients ciksubscript𝑐𝑖superscript𝑘c_{i}\in\mathbb{R}^{k}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, facilitating view-dependent colors with k𝑘kitalic_k representing the degrees of freedom;

  • Opacity oisubscript𝑜𝑖o_{i}\in\mathbb{R}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R.

Accordingly, a 3D Gaussian is defined as gi=[μi,Si,Ri,ci,oi]subscript𝑔𝑖subscript𝜇𝑖subscript𝑆𝑖subscript𝑅𝑖subscript𝑐𝑖subscript𝑜𝑖g_{i}=\left[\mu_{i},S_{i},R_{i},c_{i},o_{i}\right]italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] and a full 3DGS Map is a set of the Gaussian representation G={g0,,gN}𝐺subscript𝑔0subscript𝑔𝑁G=\{g_{0},...,g_{N}\}italic_G = { italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_g start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }

To render an image for a camera characterized by the intrinsic matrix K𝐾Kitalic_K and pose Wtsubscript𝑊𝑡W_{t}italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (world-to-camera transformation), the Gaussians are first transformed into camera coordinates. They are then sorted by depth and rendered in a front-to-back sequence using Max’s volume rendering formula [24]:

C(x^)=i𝒮ciqi(x^)j=1i1(1qj(x^))𝐶^𝑥subscript𝑖𝒮subscript𝑐𝑖subscript𝑞𝑖^𝑥superscriptsubscriptproduct𝑗1𝑖11subscript𝑞𝑗^𝑥C\left(\hat{x}\right)=\sum_{i\in\mathcal{S}}c_{i}q_{i}\left(\hat{x}\right)% \prod_{j=1}^{i-1}\left(1-q_{j}\left(\hat{x}\right)\right)italic_C ( over^ start_ARG italic_x end_ARG ) = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG ) ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG ) ) (1)

Here, the final rendered color C(x^)𝐶^𝑥C\left(\hat{x}\right)italic_C ( over^ start_ARG italic_x end_ARG ) at the camera projection plane for pixel x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG is the weighted sum of each Gaussian’s color cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The weight is calculated using the footprint function qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT derived from the Gaussian kernel [25] (see Eq.2), and is modulated by an occlusion (transmittance) term that accounts for all Gaussians preceding the current one.

qi(x^)=oi1|J1||W1|GΣ^ic(x^μ^i)subscript𝑞𝑖^𝑥subscript𝑜𝑖1superscript𝐽1superscript𝑊1subscript𝐺superscriptsubscript^Σ𝑖𝑐^𝑥subscript^𝜇𝑖q_{i}(\hat{x})=o_{i}\frac{1}{\left|J^{-1}\right|\left|W^{-1}\right|}G_{\hat{% \Sigma}_{i}^{c}}\left(\hat{x}-\hat{\mu}_{i}\right)italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG ) = italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | italic_J start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT | | italic_W start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT | end_ARG italic_G start_POSTSUBSCRIPT over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (2)

where GΣ^icsubscript𝐺superscriptsubscript^Σ𝑖𝑐G_{\hat{\Sigma}_{i}^{c}}italic_G start_POSTSUBSCRIPT over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is a Gaussian function with covariance matrix Σ^icsuperscriptsubscript^Σ𝑖𝑐\hat{\Sigma}_{i}^{c}over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, a 2×2222\times 22 × 2 matrix obtained by excluding the last row and column from the matrix computed using Eq.3, and μ^=[x1,x2]^𝜇subscript𝑥1subscript𝑥2\hat{\mu}=[x_{1},x_{2}]over^ start_ARG italic_μ end_ARG = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] is the first two value of mean μ𝜇\muitalic_μ of this Gaussian.

Σic=JWΣiWJsuperscriptsubscriptΣ𝑖𝑐𝐽𝑊subscriptΣ𝑖superscript𝑊topsuperscript𝐽top\Sigma_{i}^{c}=JW\Sigma_{i}W^{\top}J^{\top}roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = italic_J italic_W roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT (3)

Where J=m(μ)/μ𝐽𝑚𝜇𝜇J=\partial m\left(\mu\right)/\partial\muitalic_J = ∂ italic_m ( italic_μ ) / ∂ italic_μ is the Jacobian of the projection formula Eq. 4:

m(μ)=K(Wμ(Wμ)z)𝑚𝜇𝐾𝑊𝜇subscript𝑊𝜇𝑧m\left({\mu}\right)=K\left(\frac{W\mu}{(W\mu)_{z}}\right)italic_m ( italic_μ ) = italic_K ( divide start_ARG italic_W italic_μ end_ARG start_ARG ( italic_W italic_μ ) start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG ) (4)

For a comprehensive derivation of the footprint function, readers are directed to [25].

For rendering depth, we can simply replace the color cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the zi=x3subscript𝑧𝑖subscript𝑥3z_{i}=x_{3}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT of the Gauassian transformed in the camera coordinate:

D(x^)=i𝒮ziqi(x^)j=1i1(1qj(x^))𝐷^𝑥subscript𝑖𝒮subscript𝑧𝑖subscript𝑞𝑖^𝑥superscriptsubscriptproduct𝑗1𝑖11subscript𝑞𝑗^𝑥D\left(\hat{x}\right)=\sum_{i\in\mathcal{S}}z_{i}q_{i}\left(\hat{x}\right)% \prod_{j=1}^{i-1}\left(1-q_{j}\left(\hat{x}\right)\right)italic_D ( over^ start_ARG italic_x end_ARG ) = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG ) ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG ) ) (5)

III-B 3D Gaussian Splatting Map Representation

III-B1 3D Map Construction and Initialization

Contrary to the original methods that begin with Structure from Motion (SfM) as outlined in [1], our approach initiates the 3D Gaussian process by utilizing the 3D map generated from LiDAR point cloud data and corresponding images. This ensures that our foundational representation possesses accurate geometric information. However, the depth information derived from the LiDAR point cloud is sparse, presenting challenges for subsequent visual localization tasks due to the possibility that keypoints detected at this stage may lack corresponding LiDAR depth. Nevertheless, by leveraging the densification scheme of the 3D Gaussian Splatting method [1], our method can increase the number of underlying Gaussians during the training process. This enhancement allows rendering dense depth from various viewpoints using 3DGS representation. This dense depth can provide precise depth information for our visual localization tasks.

III-B2 Map Storage and Management

The 3D Gaussian Splatting method is known for its high GPU memory consumption, making the representation of large outdoor scenes challenging. To mitigate this, we have opted to use only RGB color information, foregoing the use of Spectral Harmonics (SH) decomposition. While Spherical Harmonics (SH) decomposition aids in capturing lighting and view-dependent effects, it significantly raises memory requirements. Our choice reduces the map’s memory footprint, but it constrains direct comparisons of rendering quality with methods encoding lighting information—key in outdoor settings due to complex light and shadow dynamics (see Section V-A). Our primary aim is establishing a dependable map** system for visual relocalization, making detailed rendering quality comparisons, particularly regarding lighting effects, beyond this work’s scope.

To efficiently manage and train large-scale 3DGS maps, we’ve organized the 3D environment into a 2D voxel grid. This method divides the map into smaller voxels, each storing 3DGS parameters based on the μ𝜇\muitalic_μ, and assigns a unique hash ID to every voxel for quick querying. For better efficiency in spatial querying and updating 3DGS parameters in voxels according to the camera pose of each image, we utilize a KD tree. Constructed from the voxels’ center points, the KD tree swiftly identifies voxels within a certain range of the cameras. This approach not only improves our system’s scalability but also minimizes the use of computational resources, allowing for the detailed reconstruction of large environments without excessive GPU memory demands.

III-B3 Loss Function

The original 3D Gaussian Splatting technique was developed primarily for novel view synthesis, focusing on producing high-quality images. Consequently, it utilizes a balanced combination (Lrgbsubscript𝐿𝑟𝑔𝑏L_{rgb}italic_L start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT) of the Mean Absolute Error loss (L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) and the Structural Similarity Index Measure (SSIM) loss (LDSSIMsubscript𝐿𝐷𝑆𝑆𝐼𝑀L_{D-SSIM}italic_L start_POSTSUBSCRIPT italic_D - italic_S italic_S italic_I italic_M end_POSTSUBSCRIPT) to evaluate the difference between rendered (((Gt))subscript𝐺𝑡\left(\mathcal{R}\left(G_{t}\right)\right)( caligraphic_R ( italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) )) image and actual ground truth image (Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) (see to Eq. 7). This singular focus allows Gaussian Splatting to compromise the precision of rendered depth in favor of enhancing the visual quality of RGB images. Additionally, this leads the densification process to introduce new Gaussians that do not adhere to the underlying geometry.

Lrgb(I1,I2)=(1λ)L1(I1,I2)+λLD-SSIM(I1,I2)subscript𝐿𝑟𝑔𝑏subscript𝐼1subscript𝐼21𝜆subscript𝐿1subscript𝐼1subscript𝐼2𝜆subscript𝐿D-SSIMsubscript𝐼1subscript𝐼2L_{rgb}(I_{1},I_{2})=(1-\lambda)L_{1}(I_{1},I_{2})+\lambda L_{\text{D-SSIM}}(I% _{1},I_{2})italic_L start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ( 1 - italic_λ ) italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_λ italic_L start_POSTSUBSCRIPT D-SSIM end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) (6)
Lphoto=Lrgb((Gt),It)subscript𝐿𝑝𝑜𝑡𝑜subscript𝐿𝑟𝑔𝑏subscript𝐺𝑡subscript𝐼𝑡L_{photo}=L_{rgb}\left(\mathcal{R}\left(G_{t}\right),I_{t}\right)italic_L start_POSTSUBSCRIPT italic_p italic_h italic_o italic_t italic_o end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT ( caligraphic_R ( italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (7)

To address this limitation, we incorporate a re-projection error loss aimed at maintaining both the geometric accuracy of the scene representation and the fidelity of rendered depth. We acquire depth information (𝒟(Gt)𝒟subscript𝐺𝑡\mathcal{D}\left(G_{t}\right)caligraphic_D ( italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )) at pose Wtsubscript𝑊𝑡W_{t}italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT through Gaussian Splatting (Gtsubscript𝐺𝑡G_{t}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT). This depth information is then used to re-project the ground truth image (Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) from pose Wtsubscript𝑊𝑡W_{t}italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to pose Wt+1subscript𝑊𝑡1W_{t+1}italic_W start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT through transformation matrix Ttt+1superscriptsubscript𝑇𝑡𝑡1T_{t}^{t+1}italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT, and the re-projected image (𝒫(𝒟(Gt),It,Ttt+1)𝒫𝒟subscript𝐺𝑡subscript𝐼𝑡superscriptsubscript𝑇𝑡𝑡1\mathcal{P}\left(\mathcal{D}\left(G_{t}\right),I_{t},T_{t}^{t+1}\right)caligraphic_P ( caligraphic_D ( italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT )) is subsequently compared with the actual ground truth image (It+1subscript𝐼𝑡1I_{t+1}italic_I start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT) at pose Wt+1subscript𝑊𝑡1W_{t+1}italic_W start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT (see Eq. 8).

Lreproj=Lrgb(𝒫(𝒟(Gt),It,Ttt+1),It+1)subscript𝐿𝑟𝑒𝑝𝑟𝑜𝑗subscript𝐿𝑟𝑔𝑏𝒫𝒟subscript𝐺𝑡subscript𝐼𝑡superscriptsubscript𝑇𝑡𝑡1subscript𝐼𝑡1L_{reproj}=L_{rgb}\left(\mathcal{P}\left(\mathcal{D}\left(G_{t}\right),I_{t},T% _{t}^{t+1}\right),I_{t+1}\right)italic_L start_POSTSUBSCRIPT italic_r italic_e italic_p italic_r italic_o italic_j end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT ( caligraphic_P ( caligraphic_D ( italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) , italic_I start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) (8)

As a result, our comprehensive loss function is outlined in Eq. 9, with ω1subscript𝜔1\omega_{1}italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and ω2subscript𝜔2\omega_{2}italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT serving as weights to balance the contributions of the two loss functions effectively.

L=ω1Lphoto+ω2Lreproj𝐿subscript𝜔1subscript𝐿𝑝𝑜𝑡𝑜subscript𝜔2subscript𝐿𝑟𝑒𝑝𝑟𝑜𝑗L=\omega_{1}L_{photo}+\omega_{2}L_{reproj}italic_L = italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_p italic_h italic_o italic_t italic_o end_POSTSUBSCRIPT + italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_r italic_e italic_p italic_r italic_o italic_j end_POSTSUBSCRIPT (9)

III-C Visual ReLocalization Method Using 3DGS Map

III-C1 Initial ReLocalization

Our approach starts with leveraging raw pose data to pinpoint the query’s location on the global map. This data may come from various sources, including GPS systems. Utilizing the raw pose as a reference, we retrieve a segment of the global 3DGS map most likely to contain the query image’s precise location.

After selecting the nearby submap, we refine location accuracy through a brute-force search. This involves generating and comparing multiple images from the 3DGS submap with the query image to find the most visually similar one, assuming that similarity indicates closeness in location. This method improves upon GPS accuracy, providing a precise starting point for feature-based matching. We employ normalized cross-correlation (NCC) [26] for this image comparison, a metric frequently applied in medical image registration to evaluate similarity, as defined below:

NCC(Iq,IGt)=(i,j)(IqI¯q)(IGtI¯Gt)(i,j)(IqI¯q)2(i,j)(IGtI¯Gt)2NCCsubscript𝐼𝑞subscript𝐼subscript𝐺𝑡subscript𝑖𝑗subscript𝐼𝑞subscript¯𝐼𝑞subscript𝐼subscript𝐺𝑡subscript¯𝐼𝐺𝑡subscript𝑖𝑗superscriptsubscript𝐼𝑞subscript¯𝐼𝑞2𝑖𝑗superscriptsubscript𝐼subscript𝐺𝑡subscript¯𝐼𝐺𝑡2\mathrm{NCC}(I_{q},I_{G_{t}})=\frac{\sum_{(i,j)}(I_{q}-\bar{I}_{q})(I_{G_{t}}-% \bar{I}_{G{t}})}{\sqrt{\sum_{(i,j)}(I_{q}-\bar{I}_{q})^{2}}\sqrt{\sum{(i,j)}(I% _{G_{t}}-\bar{I}_{G{t}})^{2}}}roman_NCC ( italic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = divide start_ARG ∑ start_POSTSUBSCRIPT ( italic_i , italic_j ) end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT - over¯ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) ( italic_I start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT - over¯ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_G italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT ( italic_i , italic_j ) end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT - over¯ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG square-root start_ARG ∑ ( italic_i , italic_j ) ( italic_I start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT - over¯ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_G italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG (10)

Where Iqsubscript𝐼𝑞I_{q}italic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT represents the query image, IGtsubscript𝐼subscript𝐺𝑡I_{G_{t}}italic_I start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the image rendered from the 3DGS submap at pose Wtsubscript𝑊𝑡W_{t}italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and I¯¯𝐼\bar{I}over¯ start_ARG italic_I end_ARG indicates the mean intensity of the images.

Another rationale behind selecting normalized cross-correlation is its differentiable nature, which aligns well with the differentiable characteristics of the 3D Gaussian representation. This compatibility has the potential to facilitate a fully differentiable relocalization pipeline, creating avenues for seamless integration and optimization within the overall system. (For a more discussion, please see Section V-B).

III-C2 ReLocalization Refinement

After pinpointing the closest rendered image, we adopt a feature-based matching technique. This phase entails identifying matching points between the query image and the closest rendered counterpart. By harnessing the known camera pose associated with the rendered image, along with the depth rendered from the 3DGS map, we employ the Perspective-n-Point (PnP) algorithm to refine the pose of the query image within the selected submap.

To further enhance the precision of localization, we employ an iterative refinement process on the initially estimated pose. This involves repeatedly performing the feature-based matching step with images newly rendered using the pose estimated from the preceding step. Each cycle is designed to progressively refine the pose estimation, capitalizing on the increased accuracy with each iteration to achieve the most precise localization achievable.

Considering the broad spectrum of available feature detection and matching algorithms[18], we opted for Superpoint [21] for keypoint detection and feature extraction, coupled with LightGlue [27] for the matching process. These choices were driven by their proven effectiveness and compatibility with our localization framework, enabling us to achieve high-quality feature matching and efficient pose recovery, as shown in Section IV.

Refer to caption Refer to caption
(a) X Error-Normalized Cross Correlation (e) Yaw Error-Normalized Cross Correlation
Refer to caption Refer to caption
(b) Query Image (f) Query Image
Refer to caption Refer to caption
(c) Best Matched Rendered Image Along X (g) Best Matched Rendered Image Along Yaw
Refer to caption Refer to caption
(d) Worst Matched Rendered Image Along X (h) Worst Matched Rendered Image Along Yaw
Figure 2: (a)/(e) Illustrating the Relationship between X/Yaw Error and Normalized Cross-Correlation in Localization Initialization; (b)/(f) Query Image for Localization; (c)/(g) Best Matches in Rendered Image Sequences; (d)/(h) Worst Matches in Rendered Image Sequences.

III-C3 Live Relocalization

In live relocalization task, the system must continuously track a camera’s pose using streaming images. For the initial query image, we conduct initialization and refinement as outlined in Sections III-C1 and III-C2. For images that follow, we adopt a constant velocity model for predicting the camera’s next pose, further refining the pose with the feature-matching technique described in Section III-C2. This streamlined approach eliminates the necessity of brute-force searches for every query image, significantly boosting the efficiency of ongoing relocalization efforts. Such efficiency is crucial for real-time applications in fields like autonomous navigation and robotics, providing smoother and more reliable tracking. To enhance pose estimation accuracy, we can incorporate more sophisticated methods, such as filter-based techniques [28].

IV experiment

Our experimental evaluation was conducted using the KITTI360 dataset, which includes LiDAR data, camera images, ground truth poses, and semantic/instance labels. We focus on the dataset’s first route (2013_05_28_drive_0000_sync), dividing it into two segments: the initial drive and the revisit one.

The 3D Gaussian Map was constructed using data solely from the first drive, which provided the necessary inputs for initializing and training our 3DGS map. This map aimed to establish a reliable reference for our relocalization tasks. Data from the second drive were then used to test our relocalization algorithm, allowing us to measure our approach’s effectiveness in a real-world setting. Specifically, we selected two subsequences for our experiments:

  • Seq0: frames 4200 to 4550 were used for map creation, and frames 7779 to 8002 for visual relocalization;

  • Seq1: frames 7120 to 7450 were used for constructing the map, and frames 9754 to 10062 for visual relocalization;

During map construction, we utilized the dataset’s semantic annotations to exclude the sky and instance labels corresponding to moving objects like vehicles and pedestrians. This approach concentrated our efforts on static environmental features, eliminating the necessity for dynamic reconstruction within the 3DGS model [29]. While our map** system is capable of processing the entire route, our relocalization experiments focused on selected subsequences. All training and experimental activities were conducted on an NVIDIA RTX A4000 equipped with 16 GB of memory. The submap size was set to 120 meters for training and 150 meters for relocalization, with a voxel resolution of 1 meter.

IV-A Initial ReLocalization

In the initial phase of relocalization, we employ the normalized cross-correlation (NCC) metric to evaluate the similarity between pairs of images as mentioned in Section III-C1. To illustrate the utility of NCC, we present two examples demonstrating its effectiveness in overcoming common localization challenges.

In Fig. 2 (a-d), we examine the scenario where there is an error in the yaw angle. Fig. 2(b) displays the query image used for relocalization. Despite the presence of two new bicyclists in the query image, the NCC metric successfully identifies the correct match (indicated by the yellow point), demonstrating robustness to changes in scene composition and minor orientation errors.

Further, Fig. 2 (e-h) explores the impact of errors in the x position on the localization process. Remarkably, the NCC metric maintains its effectiveness even with an error margin of up to 10 meters, accurately localizing the correct position. This scenario reveals a notably negative relationship between the x position error and the NCC metric’s performance, underscoring the metric’s capacity to guide correct localization under significant positional discrepancies.

These examples highlight the NCC metric’s effectiveness in handling orientation (yaw) and positional (x) errors during initial relocalization. Utilizing the NCC metric enhances our method’s ability to withstand scene variations and inaccuracies in starting positions, laying a robust groundwork for accurate localization in complex settings. It’s important to note that the closely matched examples presented here benefit from the use of a very fine grid size during the search process. However, in practical applications and in our implementation, we opt for a coarser grid size to expedite the initialization phase. For achieving precise relocalization, we subsequently apply a feature-based matching method, as detailed in Section III-C2 and illustrated in the subsequent section.

Seq Success Stage X Error Y Error Yaw Error
0 219/223 Init 3.513 (3.080) 2.381 (1.807) 14.007 (10.685)
Refine 0.185 (0.189) 0.117 (0.168) 0.535 (0.498)
1 304/308 Init 3.212 (2.535) 3.148 (2.450) 12.001 (13.388)
Refine 0.098 (0.076) 0.114 (0.103) 0.247 (0.239)
TABLE I: Evaluation for Single Image Query Re-localization Error in Initialization and Refinement Stage

To evaluate our method’s effectiveness more thoroughly, we used all query images for the initial relocalization analysis. We introduced noise into (x,y,yaw)𝑥𝑦𝑦𝑎𝑤(x,y,yaw)( italic_x , italic_y , italic_y italic_a italic_w ) of the ground truth pose of each query image. The noise was uniformly sampled within a range of (10,10)1010(-10,10)( - 10 , 10 ) meters for the x𝑥xitalic_x and y𝑦yitalic_y translations and (90,90)superscript90superscript90(-90^{\circ},90^{\circ})( - 90 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , 90 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ) for the yaw𝑦𝑎𝑤yawitalic_y italic_a italic_w rotation. A brute-force search was conducted with a grid size of 2222 meters and 10superscript1010^{\circ}10 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT within a search space of (15,15)1515(-15,15)( - 15 , 15 ) meters and (360,360)superscript360superscript360(-360^{\circ},360^{\circ})( - 360 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , 360 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ).We explored the initial 20%percent2020\%20 % of the search space using a random search and applied an early stop criterion. This criterion was based on whether the Normalized Cross-Correlation (NCC) dropped below a set threshold and whether we could successfully obtain sufficient matching points with the second-stage method. The outcomes of this evaluation are detailed in Table I and illustrated in Figure 3. As indicated in Table I, both sequences exhibit high success rates, with Seq 0 achieving a 98.2% success rate (219 out of 223 attempts) and Seq 1 achieving a 98.7% success rate (304 out of 308 attempts). Following the exclusion of unsuccessful matches, we calculated the mean and standard deviation of the errors in (x,y,yaw)𝑥𝑦𝑦𝑎𝑤(x,y,yaw)( italic_x , italic_y , italic_y italic_a italic_w ). The distribution of translation errors, predominantly within the (5,5)55(-5,5)( - 5 , 5 ) meter range yet occasionally exceeding this, is depicted in Figure 3. Despite the presence of significant translation errors initially, the refinement stage markedly enhanced localization accuracy.

Refer to caption Refer to caption
(a) Seq 0 (b) Seq 1
Figure 3: Evaluation of Initial Localization X, Y, Yaw Error Histogram

IV-B ReLocalization Refinement

To thoroughly assess the accuracy of our final fine-pose estimations, we started with the initial poses that were successfully obtained, as outlined in Section IV-A. These poses underwent processing via keypoint detection and feature extraction utilizing Superpoint [21], followed by matching through LightGlue [27]. This cycle of detection, description, and matching was iteratively performed to enhance the accuracy of our estimations. The outcomes of these iterative enhancements are succinctly summarized in Table I and illustrated in Figure 4.

Following the refinement process, we observed significant improvements in the results. For instance, in Seq 0, initial positioning errors decreased from 3.5133.5133.5133.513 meters to 0.1850.1850.1850.185 meters in the X-axis, from 2.3812.3812.3812.381 meters to 0.1170.1170.1170.117 meters in the Y-axis, and from 14.007superscript14.00714.007^{\circ}14.007 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT to 0.535superscript0.5350.535^{\circ}0.535 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT in Yaw𝑌𝑎𝑤Yawitalic_Y italic_a italic_w. Similarly, in Seq 1, errors were reduced from 3.2123.2123.2123.212 meters to 0.0980.0980.0980.098 meters in X𝑋Xitalic_X, from 3.1483.1483.1483.148 meters to 0.1140.1140.1140.114 meters in Y𝑌Yitalic_Y, and from 12.001superscript12.00112.001^{\circ}12.001 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT to 0.247superscript0.2470.247^{\circ}0.247 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT in Yaw𝑌𝑎𝑤Yawitalic_Y italic_a italic_w. Beyond the reduction in errors, the consistency of our methodology is also evident from the diminished standard deviation values, showcasing the reliability and precision of our approach.

Refer to caption Refer to caption
(a) Seq 0 (b) Seq 1
Figure 4: Evaluation of Re-Localization X, Y, Yaw Error Histogram

IV-C Live ReLocalization

For live relocalization evaluation, we randomly initialized the pose of the first query image, which corresponds to the first frame in each sequence. We then streamed subsequent images for live relocalization. As detailed in Table II, we utilize Absolute Pose Error (APE) and Relative Pose Error (RPE) to evaluate our system’s performance on live relocalization task. The table presents comprehensive statistics for both metrics across two sequences, including the Root Mean Square Error (RMSE), Mean, Median, Standard Deviation (Std), Minimum (Min), Maximum (Max), and Sum of Squared Errors (SSE).

Metric Seq RMSE Mean Median Std Min Max SSE
APE 0 0.103 0.092 0.088 0.047 0.013 0.349 2.387
1 0.099 0.087 0.079 0.047 0.013 0.311 3.032
RPE 0 0.083 0.070 0.060 0.046 0.008 0.252 1.543
1 0.083 0.070 0.060 0.045 0.008 0.292 2.140
TABLE II: Evaluation for Live Re-localization using Absolute Pose Error (APE) and Relative Pose Error (RPE)

For APE, RMSE is around 0.10.10.10.1, with an average error near 0.090.090.090.09 and a median of 0.080.080.080.08, indicating high accuracy with low variability (standard deviation of 0.0470.0470.0470.047). SSE values highlight precise pose estimations over time.

RPE shows consistent metrics for sequences 0 and 1, with an RMSE of 0.0830.0830.0830.083 and mean and median errors of 0.0700.0700.0700.070 and 0.0600.0600.0600.060, respectively, showing stable relative pose accuracy. Standard deviations are minimal (0.0460.0460.0460.046 for Sequence 0 and 0.0450.0450.0450.045 for Sequence 1), with error ranges of 0.0080.0080.0080.008 to 0.2520.2520.2520.252 for Sequence 0 and 0.0080.0080.0080.008 to 0.2920.2920.2920.292 for Sequence 1, and SSE values of 1.5431.5431.5431.543 for Sequence 0 and 2.1402.1402.1402.140 for Sequence 1, indicating robust relative pose estimation.

Visual analysis of roll-pitch-yaw and XYZ trajectories shows close alignment with ground truth (see Fig. 5-6), but pitch and Z-axis estimates have more noise. The noise may stem from inaccuracies in key points extracted from ground features, which are less precisely depicted in 3D Gaussian plots. To enhance accuracy and reduce noise, employing more sophisticated trajectory estimation techniques, like filter-based methods, could offer smoother and more accurate results.

Refer to caption Refer to caption
Figure 5: Comparison of Ground Truth and Predicted Trajectories from Six Views: Roll, Pitch, Yaw, X, Y, Z for Sequence 0
Refer to caption Refer to caption
Figure 6: Comparison of Ground Truth and Predicted Trajectories from Six Views: Roll, Pitch, Yaw, X, Y, Z for Sequence 1

V Limitation and Discussion

V-A Balancing Visual Quality with Memory and Geometric Fidelity

To minimize the map’s footprint, we opted against using Spectral Harmonics (SH) to encode lighting and view-dependent information. While effective in reducing memory usage, this decision has its trade-offs, particularly in outdoor environments where dynamic lighting plays a significant role. For instance, as illustrated in Fig.7, changes in lighting direction can result in varying ground colors, leading to artifacts in our rendered images. Despite this challenge, which was particularly noticeable in different Seq 0 due to varying lighting conditions, the localization accuracy between Seq 0 and Seq 1 remained consistent in our experiments. This resilience is primarily attributed to the robustness of the Normalized Cross Correlation (NCC) metrics, as well as the feature detection and matching capabilities of Superpoint[21] and LightGlue[27].

Refer to caption
(a) Ground with Reflection
Refer to caption
(b) Ground without Reflection after Moving forward
Refer to caption
(c) Rendered Image
Figure 7: Without encoding lighting information in Gaussian Map, lighting changes cause rendering artifacts

This observation prompts a reevaluation of the necessity to encode lighting information within the map for relocalization tasks. Our findings suggest incorporating dynamic lighting and shadows may not be essential for achieving accurate localization. Moreover, an ideal map might benefit from eliminating lighting and shadow effects to focus more on the environment’s geometric and structural aspects, further streamlining the localization process without compromising accuracy. This finding suggests a potential direction for future research, exploring the balance between visual fidelity, memory efficiency, and geometric accuracy in the context of map representation and localization.

V-B Towards a Fully Differentiable Localization Pipeline

The 3D Gaussian Splatting representation’s differentiability is an interesting feature, which might offer the possibility of creating a fully differentiable pipeline to perform localization on a 3D Gaussian Splatting submap. This capability might enable us to bypass the traditional detection-description-matching approach, removing the need to train separate models for feature detection and extraction. Additionally, a fully differentiable pipeline can facilitate integration with other differentiable methods for navigation and planning systems. We have initially evaluated several metrics to perform direct localization on a 3D Gaussian Splatting map. These metrics include Gradient Correlation (GC), Normalized Cross Correlation (NCC) [26], and Mutual Information (MI) [30]. However, our preliminary experiment indicates that these metrics are particularly sensitive to initial pose estimates and prone to settling into local minima during gradient descent optimization. These challenges suggest the need to explore alternative optimization techniques or strategies to address these issues.

VI conclusion

This paper has explored the integration of LiDAR and camera data through the novel application of 3D Gaussian Splatting, addressing the crucial need for advanced map representation methodologies in the rapidly evolving domains of autonomous driving and robotic navigation. By leveraging the strengths of both LiDAR’s depth sensing and the detailed imagery provided by cameras, we have demonstrated a robust approach to creating detailed and geometrically accurate environmental representations, crucial for the safe and efficient navigation of autonomous systems. Our methodology, which begins with LiDAR data to initiate the training of the 3D Gaussian Splatting representation, facilitates the construction of comprehensive maps while addressing common challenges such as high memory usage and inaccuracies in underlying geometry.

The implementation of our technique in visual relocalization task showcases its capacity to enhance the precision of feature identification and positioning, contributing significantly to the field by enabling more sophisticated perception systems for autonomous vehicles. Through our evaluation with the KITTI360 dataset, we have validated the effectiveness, adaptability, and precision of our approach, underscoring its potential to advance environmental perception and system reliability. Ultimately, our work contributes to the broader conversation on sensor data integration and map representation, offering a pathway toward more efficient, accurate, and reliable localization and navigation in complex urban landscapes.

References

  • [1] B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3D Gaussian Splatting for Real-Time Radiance Field Rendering,” vol. 42, no. 4, p. 1.
  • [2] Y. Liao, J. Xie, and A. Geiger, “KITTI-360: A Novel Dataset and Benchmarks for Urban Scene Understanding in 2D and 3D,” vol. 45, no. 3, pp. 3292–3310.
  • [3] C. Chen, B. Wang, C. X. Lu, N. Trigoni, and A. Markham, “Deep Learning for Visual Localization and Map**: A Survey,” pp. 1–21.
  • [4] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis.
  • [5] E. Sucar, S. Liu, J. Ortiz, and A. J. Davison, “iMAP: Implicit Map** and Positioning in Real-Time,” pp. 6229–6238.
  • [6] Z. Zhu, S. Peng, V. Larsson, W. Xu, H. Bao, Z. Cui, M. R. Oswald, and M. Pollefeys, “NICE-SLAM: Neural Implicit Scalable Encoding for SLAM.”
  • [7] V. Yugay, Y. Li, T. Gevers, and M. R. Oswald. Gaussian-SLAM: Photo-realistic Dense SLAM with Gaussian Splatting.
  • [8] H. Matsuki, R. Murai, P. H. J. Kelly, and A. J. Davison. Gaussian Splatting SLAM.
  • [9] M. Li, S. Liu, H. Zhou, G. Zhu, N. Cheng, and H. Wang. SGS-SLAM: Semantic Gaussian Splatting For Neural Dense SLAM.
  • [10] N. Keetha, J. Karhade, K. M. Jatavallabhula, G. Yang, S. Scherer, D. Ramanan, and J. Luiten. SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM.
  • [11] J. Lin, Z. Li, X. Tang, J. Liu, S. Liu, J. Liu, Y. Lu, X. Wu, S. Xu, Y. Yan, and W. Yang. VastGaussian: Vast 3D Gaussians for Large Scene Reconstruction.
  • [12] Y. Chen, C. Gu, J. Jiang, X. Zhu, and L. Zhang. Periodic Vibration Gaussian: Dynamic Urban Scene Reconstruction and Real-time Rendering.
  • [13] Y. Yan, H. Lin, C. Zhou, W. Wang, H. Sun, K. Zhan, X. Lang, X. Zhou, and S. Peng. Street Gaussians for Modeling Dynamic Urban Scenes.
  • [14] X. Zhou, Z. Lin, X. Shan, Y. Wang, D. Sun, and M.-H. Yang. DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes.
  • [15] E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother, “DSAC - Differentiable RANSAC for Camera Localization,” pp. 6684–6692.
  • [16] Z. Laskar, I. Melekhov, S. Kalia, and J. Kannala, “Camera Relocalization by Computing Pairwise Relative Poses Using Convolutional Neural Network,” pp. 929–938.
  • [17] P.-E. Sarlin, A. Unagar, M. Larsson, H. Germain, C. Toft, V. Larsson, M. Pollefeys, V. Lepetit, L. Hammarstrand, F. Kahl, and T. Sattler, “Back to the Feature: Learning Robust Camera Localization From Pixels To Pose,” pp. 3247–3257.
  • [18] M. R. U. Saputra, A. Markham, and N. Trigoni, “Visual SLAM and Structure from Motion in Dynamic Environments: A Survey,” vol. 51, no. 2, pp. 37:1–37:36.
  • [19] D. Werner, A. Al-Hamadi, and P. Werner, “Truncated Signed Distance Function: Experiments on Voxel Size,” in Image Analysis and Recognition, ser. Lecture Notes in Computer Science, A. Campilho and M. Kamel, Eds.   Springer International Publishing, pp. 357–364.
  • [20] R. W. Wolcott and R. M. Eustice, “Visual localization within LIDAR maps for automated urban driving,” in 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 176–183.
  • [21] D. DeTone, T. Malisiewicz, and A. Rabinovich, “SuperPoint: Self-Supervised Interest Point Detection and Description,” pp. 224–236.
  • [22] B. Wang, C. Chen, Z. Cui, J. Qin, C. X. Lu, Z. Yu, P. Zhao, Z. Dong, F. Zhu, N. Trigoni, and A. Markham, “P2-Net: Joint Description and Detection of Local Features for Pixel and Point Matching.”
  • [23] P. Jiang and S. Saripalli, “Contrastive Learning of Features between Images and LiDAR,” in 2022 IEEE 18th International Conference on Automation Science and Engineering (CASE), pp. 411–417.
  • [24] N. Max, “Optical models for direct volume rendering,” vol. 1, no. 2, pp. 99–108.
  • [25] M. Zwicker, J. Pfister, H.Baar, and M. Gross, “EWA volume splatting,” in Proceedings Visualization, 2001. VIS ’01., pp. 29–538.
  • [26] Y. Hiasa, Y. Otake, M. Takao, T. Matsuoka, K. Takashima, A. Carass, J. L. Prince, N. Sugano, and Y. Sato, “Cross-modality image synthesis from unpaired data using cyclegan,” in International Workshop on Simulation and Synthesis in Medical Imaging.   Springer, 2018, pp. 31–41.
  • [27] P. Lindenberger, P.-E. Sarlin, and M. Pollefeys, “LightGlue: Local Feature Matching at Light Speed,” pp. 17 627–17 638.
  • [28] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics (Intelligent Robotics and Autonomous Agents).   The MIT Press.
  • [29] J. Luiten, G. Kopanas, B. Leibe, and D. Ramanan. Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis.
  • [30] P. Jiang, P. Osteen, and S. Saripalli, “SemCal: Semantic LiDAR-Camera Calibration using Neural Mutual Information Estimator,” in 2021 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI).   IEEE, pp. 1–7.