HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2404.05164v1 [cs.RO] 08 Apr 2024

Rendering-Enhanced Automatic Image-to-Point Cloud Registration for Roadside Scenes

Yu Sheng11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Lu Zhang22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, Xingchen Li11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Yifan Duan11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Yanyong Zhang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Yu Zhang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, and Jianmin Ji1,2,12{}^{1,2,{\dagger}}start_FLOATSUPERSCRIPT 1 , 2 , † end_FLOATSUPERSCRIPT 11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT School of Computer Science and Technology, University of Science and Technology of China (USTC), Hefei 230026, China22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, Anhui, China{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT Corresponding author. [email protected]
Abstract

Prior point cloud provides 3D environmental context, which enhances the capabilities of monocular camera in downstream vision tasks, such as 3D object detection, via data fusion. However, the absence of accurate and automated registration methods for estimating camera extrinsic parameters in roadside scene point clouds notably constrains the potential applications of roadside cameras. This paper proposes a novel approach for the automatic registration between prior point clouds and images from roadside scenes. The main idea involves rendering photorealistic grayscale views taken at specific perspectives from the prior point cloud with the help of their features like RGB or intensity values. These generated views can reduce the modality differences between images and prior point clouds, thereby improve the robustness and accuracy of the registration results. Particularly, we specify an efficient algorithm, named neighbor rendering, for the rendering process. Then we introduce a method for automatically estimating the initial guess using only rough guesses of camera’s position. At last, we propose a procedure for iteratively refining the extrinsic parameters by minimizing the reprojection error for line features extracted from both generated and camera images using Segment Anything Model (SAM). We assess our method using a self-collected dataset, comprising eight cameras strategically positioned throughout the university campus. Experiments demonstrate our method’s capability to automatically align prior point cloud with roadside camera image, achieving a rotation accuracy of 0.202{}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT and a translation precision of 0.079m. Furthermore, we validate our approach’s effectiveness in visual applications by substantially improving monocular 3D object detection performance.

I Introduction

Image-to-point cloud registration seeks to identify a rigid transformation for aligning a 3D point cloud with a 2D image. This process, a cornerstone of multi-modal fusion, has attracted considerable attention in domains like autonomous driving and computer vision. Integrating prior point clouds with images has been proven to enhance multiple downstream tasks in roadside environments, such as vehicle speed measurements [43] and 3D object detection [27, 26]. Unlike the application fields of LiDAR-camera calibration [3, 4] and visual localization [29, 28], the registration of roadside camera images with prior point clouds aims at finding the fixed cameras’ poses to these point clouds, a task that is currently heavily reliant on manual techniques for its accomplishment [26].

Refer to caption
Figure 1: Our method registers drone-collected point clouds with roadside camera images. This registration enhances monocular algorithm performance for downstream roadside tasks.

Given the distinct classifications of roadside environments, the direct usage of existing automatic registration methods may encounter a range of challenges. Firstly, the prior point cloud and image are often captured at different time and within different coordinate systems. This results in a variation in the relative poses between different roadside cameras and point clouds, complicating the acquisition of precise initial guess needed by classical registration methods [3, 5]. While learning-based [54, 29] approaches have the potential to mitigate the dependency on initial guess through scene comprehension, acquiring a diverse and sufficient dataset for training is difficult in roadside settings. The recent work [4] manages to lessen the dependency of the registration algorithm on initial guess by transforming point clouds into images, thereby reducing the modal discrepancy between the two data. However, [4] struggles to address the unpredictable extrinsic parameters in roadside scenarios. Moreover, due to sparse and uneven distribution of roadside point clouds, directly projecting point clouds onto images as done in [4, 9] can lead to issues such as the background being visible through the front surface (so-called bleeding problem), making it challenging to generate effective views for feature matching. This constitutes the second challenge. Although numerous methods, including NPBG [30] and 3DGS [32], have been developed to generate new views from point clouds with remarkable results, directly applying these methods to roadside point clouds with a vast number of points would incur an unbearable computational cost. Furthermore, these methods often overlook the 2D-3D correspondence during the rendering process, a key component necessary for accurate registration.

To address the aforementioned challenges, we propose an automatic registration method for roadside camera images and prior point clouds. We introduce neighbor rendering, an efficient approach for generating realistic views from point cloud while preserved 2D-3D correspondence, ideal for registration. Leveraging neighbor rendering, we propose a initial guess estimation framework that involves pose sampling near rough camera location selected from point cloud and view generating, followed by utilizing SuperGlue [14] to establish 2D-3D. Finally, we employ state-of-the-art (SOTA) segmentation model SAM [33] to extract line features from generated images and obtain corresponding line features in the point cloud through the 2D-3D correspondence provided by neighbor rendering. The extrinsic parameters are optimized by minimizing the reprojection error. We apply the registered image-point cloud pairs to roadside 3D object detection tasks, achieving remarkable performance and thus proving the effectiveness of our entire process. It is noteworthy that, although our point clouds originate from DJI drones, our method seamlessly integrates with pre-existing maps produced by SLAM methods like R33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTLIVE [56]. Our main contributions are as follows:

  • We propose a framework for automated image-to-point cloud registration in roadside scenes, utilizing view generated from point clouds to diminish data modality gaps, achieving SOTA performance in real-world roadside environments.

  • We introduce an efficient rendering method and utilize SAM to accurately extract edges from heterogeneous images, thereby enhancing registration accuracy.

  • We apply our method to roadside 3D detection task to demonstrate the framework’s practical effectiveness.

II Related Work

II-A Image-to-Point Cloud registration

II-A1 Generic Image-to-Point Cloud registration

Camera-LiDAR extrinsic calibration and visual localization are two key applications of image-to-point cloud registration. Existing calibration methods between LiDAR and camera can be categorized into two types: target-based methods and target-less methods [7]. Target-based methods typically use checkboards [20] to establish the correspondence between LiDAR and images, while target-less methods [3, 5] often relies on line or point features to achieve this objective. [4] optimize extrinsic parameters by minimizing the information discrepancy between LiDAR intensity images and grayscale images. The same idea is applied in visual localization as well [42]. As neural network technology advances, [35, 36, 49, 54] leverages networks for identifying the correlations between 2D and 3D features, with subsequent refinement achieved through optimization with either the Perspective-n-Point (PnP) or Particle Swarm Optimization (PSO) algorithms. Another research trajectory [28] delves into the differentiability aspects of PnP, facilitating the end-to-end training of networks.

II-A2 Roadside Image-to-Point Cloud registration

In the camera-world calibration procedure for the Rope3D dataset [26], high-definition maps (HD Maps) containing lane and crosswalk endpoints are first projected onto 2D images to obtain the initial transformation, followed by bundle adjustment optimization. [47] utilizes manual matching of 2D-3D features followed by optimization using distance transform. [43] utilizes panoramic images for point cloud reconstruction and aligns roadside camera images with the reconstructed point cloud. While these methods achieve excellent results, the need for manual intervention or obtaining panoramic images adds complexity to deployment.

II-B View Rendering

Rendering views from point clouds is primarily achieved through two techniques: points-based rendering and meshing. Meshing facilitates the creation of high-quality images; however, it necessitates complex preprocessing step [37]. The concept of utilizing points as rendering primitives was initially introduced by [38, 39], but it leads to bleeding problem. Ray casting series [45, 46] leverage surrounding points (volumes) to deduce the specific point on the ray, resulting in enhanced visual outcomes. However, discerning spatial relationships in vast, unordered point clouds requires substantial computational cost. [40] tackles these issues by substituting each point with oriented flat disks (a surfel), where the orientation and radius of these disks can be estimated from the point clouds. This strategy underpins recent neural network-based rendering techniques such as NPBG [30] and 3DGS [32]. These methods focus more on generating high-quality views, thereby neglecting the 2D-3D correspondence needed by registration.

II-C Roadside Camera Applications

A thorough review of monocular roadside camera applications is presented in [55]. Beyond tasks utilizing exclusively 2D data, like vehicle counting, applications requiring 3D information—for instance, vehicle speed measurement and vehicle distance estimation [53, 51, 50]—typically depend on the presumption of a flat ground or predefined size information. Such presumptions and prior knowledge considerably confine the algorithm’s versatility across varied settings. Conversely, leveraging the 3D information from registered point cloud offers a robust solution to these limitations.

III Methodology

III-A Overview

Refer to caption
Figure 2: Our automatic registration method follows a two-step framework. (1) In initial guess estimation pipeline, we first sample camera poses around the rough location and synthesize views, then match this views with camera image using SuperGlue and estimate the initial guess by 2D-3D correspondence. (2) In optimization process, we first extract lines in point cloud and image, then we match these lines and optimize the extrinsics using reprojection errors.
Refer to caption
Figure 3: The purple points represent foreground points, while the green points represent background points. Neighbor rendering filters out the background points and estimate the 3D point Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT corresponding to pixel pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT based on the foreground points.

Given a pair of roadside camera image 𝓘H*W*3𝓘superscript𝐻𝑊3\bm{\mathcal{I}}\in\mathbb{R}^{H*W*3}bold_caligraphic_I ∈ blackboard_R start_POSTSUPERSCRIPT italic_H * italic_W * 3 end_POSTSUPERSCRIPT and a prior point cloud 𝒫N×3𝒫superscript𝑁3\mathcal{P}\in\mathbb{R}^{N\times 3}caligraphic_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 3 end_POSTSUPERSCRIPT with RGB values 𝒞N×3𝒞superscript𝑁3\mathcal{C}\in\mathbb{R}^{N\times 3}caligraphic_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 3 end_POSTSUPERSCRIPT or intensity N×1superscript𝑁1\mathcal{L}\in\mathbb{R}^{N\times 1}caligraphic_L ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT and rough camera position selected from point cloud, our method aims to estimate the grid transformation T=[R,t]SE(3)𝑇𝑅𝑡𝑆𝐸3T=[R,t]\in SE(3)italic_T = [ italic_R , italic_t ] ∈ italic_S italic_E ( 3 ) between point cloud coordinates and camera coordinates, where RSO(3)𝑅𝑆𝑂3R\in SO(3)italic_R ∈ italic_S italic_O ( 3 ) is the rotation matrix and t3𝑡superscript3t\in\mathbb{R}^{3}italic_t ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT is the translation vector.

In this section, we first detail the principles of neighbor rendering as described in Section III-B. Following that, in Section III-C, we propose a pipeline designed to obtain precise initial guess solely based on the selected rough camera position. Lastly, in Section III-D, we discuss optimizing extrinsic parameters by minimizing reprojection errors and effectively extracting line features using SAM. The overview of our framework is shown in Fig. 2.

III-B View Rendering

Each pixel pi(ui,vi)subscript𝑝𝑖subscript𝑢𝑖subscript𝑣𝑖p_{i}(u_{i},v_{i})italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) corresponds to a ray ri(ui,vi)=t*K1[ui,vi,1]Tsubscript𝑟𝑖subscript𝑢𝑖subscript𝑣𝑖𝑡superscript𝐾1superscriptsubscript𝑢𝑖subscript𝑣𝑖1𝑇r_{i}(u_{i},v_{i})=t*K^{-1}[u_{i},v_{i},1]^{T}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_t * italic_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT in the camera coordinates. Without accounting for opacity, only the 3D point that falls on the ray and is closest to the camera can be projected onto the image. However, due to the sparsity of roadside point clouds, lots of rays may not intersect with any 3D points, resulting in holes and bleeding problems (shown in Fig. 3). While previous studies like NPBG [30] and 3DGS [32] have made significant strides in addressing these issues, directly applying these methods to our task primarily faces two challenges: computational cost and the loss of 2D-3D correspondence, as detailed in Section II-B.

Drawing insights from prior research, we propose an efficient rendering method, termed neighbor rendering, to tackle these challenges. Our method is based on two reasonable hypotheses: 1) When deduce the specific point on a certain ray like [45, 46], only the 3D points surrounding this ray will be taken into consideration. 2) Local 3d points can be approximated as belonging to a plane, a hypothesis also utilized in [40]. Specifically, we first project the point cloud onto the image and use the Z-buffer [30] to filter out occluded points. This process not only filters occluded and out-of-view points but also reorganizes the point cloud, which enables retrieving points in the point cloud based on pixel coordinates, thus facilitating parallel processing. For a given pixel pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we need to infer the points lying on its corresponding ray risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Since adjacent 3D points are necessarily adjacent in pixel space, according to hypothesis 1, we only need to consider the 3D points retrieved from the N𝑁Nitalic_N neighbor pixels of pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, denote as 𝒬i={Pij3j{0,,N1}}subscript𝒬𝑖conditional-setsuperscriptsubscript𝑃𝑖𝑗superscript3𝑗0𝑁1\mathcal{Q}_{i}=\{P_{i}^{j}\in\mathbb{R}^{3}\mid j\in\{0,\ldots,N-1\}\,\}caligraphic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ∣ italic_j ∈ { 0 , … , italic_N - 1 } }. Considering that these 3D points may not be adjacent in spatial, as shown in Fig. 3, filtering out background points is necessary. We adopt a simple distance filter for background point filtering, as described in Eq. (1). Here, d()𝑑d(\cdot)italic_d ( ⋅ ) represents the distance from a 3D point to the camera and ξ𝜉\xiitalic_ξ is a manually set threshold, f()=1𝑓1f(\cdot)=1italic_f ( ⋅ ) = 1 indicates that the point is considered as a foreground point.

f(Pij)={1ifd(Pij)min{d(Pim)Pim𝒬i}ξ,0others.f(P_{i}^{j})=\left\{\begin{aligned} &1\quad if\ d(P_{i}^{j})-\min\{d(P_{i}^{m}% )\mid P_{i}^{m}\in\mathcal{Q}_{i}\}\leq\xi,\\ &0\quad others.\\ \end{aligned}\right.italic_f ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) = { start_ROW start_CELL end_CELL start_CELL 1 italic_i italic_f italic_d ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) - roman_min { italic_d ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) ∣ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∈ caligraphic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ≤ italic_ξ , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL 0 italic_o italic_t italic_h italic_e italic_r italic_s . end_CELL end_ROW (1)

According to hypothesis 2, we fit plane function aix+biy+ciz+di=0subscript𝑎𝑖𝑥subscript𝑏𝑖𝑦subscript𝑐𝑖𝑧subscript𝑑𝑖0a_{i}x+b_{i}y+c_{i}z+d_{i}=0italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_y + italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_z + italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 with foreground points and regard the intersection point of risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with this plane (denote as Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) as the corresponding 3D point for pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The appearance value of Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (denote as v(Pi)𝑣subscript𝑃𝑖v(P_{i})italic_v ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )), such as the intensity or RGB values, are obtained by weighting the values of foreground points. Given that points closer to Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT should bear more resemblance to its appearance, inspired by the Inverse Distance Weighting (IDW) [15], we propose the Eq. (2) to compute the weights W()𝑊W(\cdot)italic_W ( ⋅ ).

v(Pi)𝑣subscript𝑃𝑖\displaystyle v(P_{i})italic_v ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) =j=0,f(Pij)=1N1W(Pij)v(Pij)j=0,f(Pij)=1N1W(Pij),absentsubscriptsuperscript𝑁1formulae-sequence𝑗0𝑓superscriptsubscript𝑃𝑖𝑗1𝑊superscriptsubscript𝑃𝑖𝑗𝑣superscriptsubscript𝑃𝑖𝑗subscriptsuperscript𝑁1formulae-sequence𝑗0𝑓superscriptsubscript𝑃𝑖𝑗1𝑊superscriptsubscript𝑃𝑖𝑗\displaystyle=\frac{\sum\limits^{N-1}\limits_{j=0,\,f(P_{i}^{j})=1}W(P_{i}^{j}% )\,v(P_{i}^{j})}{\sum\limits^{N-1}\limits_{j=0,\,f(P_{i}^{j})=1}W(P_{i}^{j})},= divide start_ARG ∑ start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 0 , italic_f ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) = 1 end_POSTSUBSCRIPT italic_W ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) italic_v ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_ARG start_ARG ∑ start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 0 , italic_f ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) = 1 end_POSTSUBSCRIPT italic_W ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_ARG , (2)
W(Pij)𝑊superscriptsubscript𝑃𝑖𝑗\displaystyle W(P_{i}^{j})italic_W ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) =ξ+min{d(Pim)Pim𝒬i}d(Pij)exp(D(Pij)),absent𝜉𝑚𝑖𝑛conditional-set𝑑superscriptsubscript𝑃𝑖𝑚superscriptsubscript𝑃𝑖𝑚subscript𝒬𝑖𝑑superscriptsubscript𝑃𝑖𝑗𝐷superscriptsubscript𝑃𝑖𝑗\displaystyle=\frac{\xi+min\{d(P_{i}^{m})\mid P_{i}^{m}\in\mathcal{Q}_{i}\}-d(% P_{i}^{j})}{\exp(D(P_{i}^{j}))},= divide start_ARG italic_ξ + italic_m italic_i italic_n { italic_d ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) ∣ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∈ caligraphic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } - italic_d ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_ARG start_ARG roman_exp ( italic_D ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ) end_ARG ,

where D(Pij)𝐷superscriptsubscript𝑃𝑖𝑗D(P_{i}^{j})italic_D ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) represents the distance of Pijsuperscriptsubscript𝑃𝑖𝑗P_{i}^{j}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT to the Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. This equation indicates that points closer to Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT have larger weights, and these weights decay exponentially as the distance increases.

We deploy our rendering algorithm using CUDA, allowing us to generate a 4K image in 20ms on RTX-3050 GPU. The visualization of generated images are available in the appendix.

III-C Initial Guess Estimation

Through the analysis of typical installation configurations for roadside cameras, we observe that these cameras are characterized by an essentially zero roll angle and a slight downward pitch angle, which facilitates the observation of the ground area. The yaw angle, conversely, is determined by the specific direction the camera is aimed at, as demonstrated in Fig. 4. Building on this observation, we assign the camera a zero roll angle and a minor pitch angle. Within the expanse of the panoramic view, eight yaw angles are uniformly sampled and their respective views are generated through neighbor rendering. We then match these generated views with the camera image using SuperGlue [14] until a sufficient number of matched feature points obtained. Leveraging the 2D-3D correspondence preserved by neighbor rendering, we utilize PnP [16] to estimate the rough pose of the camera.

Sampling merely eight yaw directions might result in partial matches between generated and camera images, leading to the clustering of matched feature points in particular regions. As indicated in [3], this uneven distribution of matched features heightens the uncertainty in extrinsic parameters estimation, making the estimation sensitive to environmental changes. Thus, by regenerating view with the roughly estimated pose and matching again, we ensure that the initial guess are estimated under the condition of a uniform distribution of matched feature points.

Refer to captionRefer to caption
Figure 4: The typical installation for roadside cameras. We assume that the camera has a small pitch angle and place pseudo-cameras at different yaw angles to generate views.

III-D Extrinsic Parameters Optimization

III-D1 Line Extraction

Edge extraction from images is well-established, yet deriving edges from point clouds poses greater difficulties. While depth discontinuity [5] is a prevalent technique for edge detection in point cloud, it tends to yield jagged, imprecise edges. [3] pioneered the use of plane intersections for precise edge extraction from point clouds, showing notable efficacy. However, roadside scenes often lack sufficient planar features to meet the algorithm’s requirements. Furthermore, line features in roadside images primarily focus on surface textures like lane markings, which the aforementioned methods cannot directly extract. Given our capability to generate realistic views and establish 2D-3D correspondences, it is possible to extract 2D line features from the generated images and then derive their corresponding 3D line features. However, traditional edge detection algorithm [17] necessitate parameter tuning for optimal extraction, yet finding a unified setting for these parameters across heterogeneous image is difficult. Therefore, we leverage the robust generalization capabilities of the powerful vision model SAM [33] to facilitate precise extraction of line features in both generated and camera images.

III-D2 Optimization

To align point cloud edges with those in images, we uniformly sample multiple points PLisuperscriptsubscript𝑃𝐿𝑖P_{L}^{i}italic_P start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT from each edge in the point cloud. These points are then projected onto the image using current extrinsic estimate T¯¯𝑇\bar{T}over¯ start_ARG italic_T end_ARG and camera intrinsic parameters K𝐾Kitalic_K.

pi=KT¯(PLi).subscript𝑝𝑖𝐾¯𝑇superscriptsubscript𝑃𝐿𝑖p_{i}=K\cdot\bar{T}(P_{L}^{i}).italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_K ⋅ over¯ start_ARG italic_T end_ARG ( italic_P start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) . (3)

For each pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we search for its M𝑀Mitalic_M nearest pixels 𝒮i={qij2j{0,,M1}}subscript𝒮𝑖conditional-setsuperscriptsubscript𝑞𝑖𝑗superscript2𝑗0𝑀1\mathcal{S}_{i}=\{q_{i}^{j}\in\mathbb{R}^{2}\mid j\in\{0,\ldots,M-1\}\,\}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∣ italic_j ∈ { 0 , … , italic_M - 1 } } in the k𝑘kitalic_k-D tree formed by image edge pixels. The line formed by 𝒮isubscript𝒮𝑖\mathcal{S}_{i}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the image edge corresponding to pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. To simplify computation, we parameterize the line as qi0superscriptsubscript𝑞𝑖0q_{i}^{0}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT with the norm vector nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which is derived from the minimum eigenvector obtained through through principal component analysis on 𝒮isubscript𝒮𝑖\mathcal{S}_{i}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .

The extrinsic parameters are optimized by minimizing the projection error between pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and qi0superscriptsubscript𝑞𝑖0q_{i}^{0}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT along nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which can be formulated as follows:

T*=argmin𝑇i=1n(niT(KT(PiL)qi0)2)=argminR,ti=1n(niT((KR(PiL)+t)qi0)2).superscript𝑇𝑇subscriptsuperscript𝑛𝑖1superscriptnormsuperscriptsubscript𝑛𝑖𝑇𝐾𝑇subscriptsuperscript𝑃𝐿𝑖superscriptsubscript𝑞𝑖02𝑅𝑡subscriptsuperscript𝑛𝑖1superscriptnormsuperscriptsubscript𝑛𝑖𝑇𝐾𝑅subscriptsuperscript𝑃𝐿𝑖𝑡superscriptsubscript𝑞𝑖02\begin{split}T^{*}&=\arg\underset{T}{\min}\sum\limits^{n}\limits_{i=1}\left(||% n_{i}^{T}(K\cdot T(P^{L}_{i})-q_{i}^{0})||^{2}\right)\\ &=\arg\underset{R,t}{\min}\sum\limits^{n}\limits_{i=1}\left(||n_{i}^{T}((K% \cdot R(P^{L}_{i})+t)-q_{i}^{0})||^{2}\right).\end{split}start_ROW start_CELL italic_T start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_CELL start_CELL = roman_arg underitalic_T start_ARG roman_min end_ARG ∑ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ( | | italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_K ⋅ italic_T ( italic_P start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = roman_arg start_UNDERACCENT italic_R , italic_t end_UNDERACCENT start_ARG roman_min end_ARG ∑ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ( | | italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( ( italic_K ⋅ italic_R ( italic_P start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_t ) - italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . end_CELL end_ROW (4)

To address the constraints introduced by the rotation matrix, we can utilize the transformation between Lie groups and Lie algebras:

R=E(θ)=exp(θ)=I+sin(θ)θ+(1cos(θ))θ2.𝑅𝐸𝜃superscript𝜃𝐼𝜃superscript𝜃1𝜃superscript𝜃2\begin{split}R&=E(\theta)=\exp(\theta^{\wedge})\\ &=I+\sin(\theta)\theta^{\wedge}+(1-\cos(\theta))\theta^{\wedge 2}.\end{split}start_ROW start_CELL italic_R end_CELL start_CELL = italic_E ( italic_θ ) = roman_exp ( italic_θ start_POSTSUPERSCRIPT ∧ end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_I + roman_sin ( italic_θ ) italic_θ start_POSTSUPERSCRIPT ∧ end_POSTSUPERSCRIPT + ( 1 - roman_cos ( italic_θ ) ) italic_θ start_POSTSUPERSCRIPT ∧ 2 end_POSTSUPERSCRIPT . end_CELL end_ROW (5)

where θ𝜃\thetaitalic_θ represents the Lie algebraic representation of the rotation matrix, and θsuperscript𝜃\theta^{\wedge}italic_θ start_POSTSUPERSCRIPT ∧ end_POSTSUPERSCRIPT denotes the corresponding skew-symmetric matrix. By transforming the optimization problem to the Lie algebra, the number of parameters to be optimized reduces to six, and the constraints are eliminated. The optimization process can be solved recursively. Let T¯¯𝑇\bar{T}over¯ start_ARG italic_T end_ARG be the current extrinsic estimate, using the operator \boxplus encapsulated in the Special Euclidean group SE(3)𝑆𝐸3SE(3)italic_S italic_E ( 3 ) [18], we can parameterize the target extrinsic T𝑇Titalic_T as:

T𝑇\displaystyle T\;italic_T T¯SE(3)δTexp(δT)T¯,absentsubscript𝑆𝐸3¯𝑇𝛿𝑇𝛿𝑇¯𝑇\displaystyle\leftarrow\;\bar{T}\boxplus_{SE(3)}\delta T\triangleq\exp(\delta T% )\cdot\bar{T},← over¯ start_ARG italic_T end_ARG ⊞ start_POSTSUBSCRIPT italic_S italic_E ( 3 ) end_POSTSUBSCRIPT italic_δ italic_T ≜ roman_exp ( italic_δ italic_T ) ⋅ over¯ start_ARG italic_T end_ARG , (6)
δT𝛿𝑇\displaystyle\delta Titalic_δ italic_T =(δθ,δt)6.absent𝛿𝜃𝛿𝑡superscript6\displaystyle=(\delta\theta,\delta t)\in\mathbb{R}^{6}.= ( italic_δ italic_θ , italic_δ italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT .

Given the objective function defined in Eq. (4) and the operator \boxplus, we can employ ceres [25] to efficiently solve the problem.

IV Experiments

IV-A Dataset Preparation

We validate our method on a self-collected dataset. As shown in Fig. 1, the prior point cloud is acquired using a DJI-M300 drone, equipped with the Zenmuse L1 module, alongside Hikvision cameras mounted on roadside poles for image capture. The cameras’ intrinsic parameters are calibrated using a checkerboard method [1]. Data are gathered from eight unique areas, each providing two pairs of point cloud-image datasets, and all point clouds are aligned within a consistent global coordinate system. For each area, one pair of point cloud-image is designated for evaluation, while the other include cones evenly distributed within the camera’s field of view (FoV) to facilitate the acquisition of pseudo ground truth, as shown in Fig. 5.

To obtain the ground truth of the extrinsic parameters, We manually annotated 2D-3D corresponding points of clearly visible cones and minimized the reprojection error. The accuracy of this transformations are validated by projecting the point clouds onto the images, which are then used as pseudo ground truth.

Refer to captionRefer to caption
Figure 5: Left is cones in camera images, right is the corresponding point clouds, the points size has been enlarged for better visualization.
Refer to captionRefer to captionRefer to caption
Figure 6: The qualitative registration results. The green lines represent the lines in the point cloud projected onto the camera plane using the estimated extrinsics. Visualization results for all scenarios can be found in the appendix.
TABLE I: Quantitative Registration Results on Dataset
Ours Continue Edge[3]
  Scenes Trans.(m)\downarrow Rot.({}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT) \downarrow Init. Guess Trans.(m)\downarrow Rot.({}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT)\downarrow
1 0.095 0.219 0.157 0.375
2 0.094 0.128 0.684 0.487
3 0.045 0.102 0.462 0.193
4 0.108 0.273 0.709 1.237
5 0.162 0.302 1.098 1.455
6 0.099 0.189 0.416 0.601
7 0.027 0.196 0.088 0.270
8 0.006 0.207 0.096 0.390
Avg. 0.079 0.202 8/8 0.464 0.602
TABLE II: Ground Distance Evaluation on Dataset
Methods Max Error(%percent\%%)\downarrow Median Error(%percent\%%)\downarrow RMSE(%percent\%%)\downarrow
OptInOpt [50] 15.78 10.80 12.87
PlaneCalib [51] 14.32 9.23 11.69
DeepVPCalib [52] 12.17 8.11 10.62
Revaud et al. [53] 14.87 10.91 12.54
Vuong et al. [43] 6.75 3.22 4.68
Ours 4.75 1.16 1.67
Refer to captionRefer to caption
Figure 7: Examples of bad scenes. The scene on the left lacks sufficient features, while on the right, dense foliage obstructs the drone’s ability to capture the point cloud.
TABLE III: Ablation Experiments Results
Mean Errors
Using RGB Neighbor Rending Trans.(m)\downarrow Rot.({}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT)\downarrow
0.796*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 0.477*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT
0.133 0.247
0.079 0.202
*{}^{\mathrm{*}}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPTOnly five successful scenarios were included in the statistics.

IV-B Evaluation Metrics

The assessment of extrinsic parameters involves quantifying the mean translation error across the three axes and mean rotation error for roll, pitch, and yaw. Furthermore, we incorporate ground distance measurements as utilized in roadside camera calibration methods [43], to evaluate the accuracy of our extrinsic parameters estimates. Specifically, we utilize the distance disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT between cones in points cloud as the ground truth and estimate these distances using the corresponding pixel positions, denote as d¯isubscript¯𝑑𝑖\bar{d}_{i}over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The normalized distance measurement error is calculated as ri=|d¯idi|disubscript𝑟𝑖subscript¯𝑑𝑖subscript𝑑𝑖subscript𝑑𝑖r_{i}=\frac{|\bar{d}_{i}-d_{i}|}{d_{i}}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG | over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG. It’s worth noting that in [43], d¯isubscript¯𝑑𝑖\bar{d}_{i}over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is estimated using the distance of intersection points of rays and ground plane. Since we are registering image with point cloud rather than the ground plane, we use the distance of 3D points corresponding to the pixels to estimated d¯isubscript¯𝑑𝑖\bar{d}_{i}over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

IV-C Results and Analysis

Table I presents the registration results of our method in eight different scenes, which achieves an average translation error of 0.079m0.079𝑚0.079m0.079 italic_m and a rotation error of 0.202superscript0.2020.202^{\circ}0.202 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT. The maximum translation error does not exceed 0.17m0.17𝑚0.17m0.17 italic_m, and the maximum rotation error is around 0.3superscript0.30.3^{\circ}0.3 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT, demonstrating the accuracy and robustness of our approach. Fig. 6 visualizes the registration results of scenes 4 (left) and 5 (middle). To replicate errors in manually selecting camera positions in point cloud, we added a ±2mplus-or-minus2𝑚\pm 2m± 2 italic_m perturbation to the displacement vector of pseudo ground truth, treating this as the camera’s position for further testing. Our experiments show that our method robustly handles this variation, automatically estimating initial guess across all tested scenes.

Given that the preprocessing and initial guess estimation processes outlined in [4] are ineffective on our dataset, we adopt the SOTA automatic line-based calibration method [3] as our baseline. We add a disturbance(±0.5mplus-or-minus0.5𝑚\pm 0.5m± 0.5 italic_m for translation and ±1plus-or-minussuperscript1\pm 1^{\circ}± 1 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT for rotation) to ground truth as initial guess for [3]. As shown in Table I, [3] achieves satisfactory registration in some scenes, its performance varies significantly across different scenes due to a scarcity of rich planar features. It’s noteworthy that both our method and [3] demonstrate relatively minimal errors in rotation. This is because roadside cameras have a wide observation range, and minor angular deviations may be magnified over distances, making the errors more detectable and more amenable to optimization by the algorithm.

Table II displays the performance of our method in ground distance measurement on the self-collected dataset. We evaluate from the perspectives of maximum error, median error, and root-mean-squared error (RMSE, in %percent\%%). Compared to SOTA method [43], our distance measurement results show significant improvement across all evaluation metrics, indicating the precision of the registration method.

In extreme conditions where ground texture features are minimal and excessive tree density obstructs drones from capturing point clouds effectively, our method may fail. In Fig. 7, we illustrate scenarios where our method fails to function properly. In the depicted left scene, the road surface lacks sufficient texture features and the absence of features in the vicinity making it difficult to obtain enough feature points for matching. The image on the right shows the camera’s left FoV being obstructed, with features predominantly centralized. Moreover, the dense foliage of trees complicates the acquisition of point clouds, resulting in substantial registration inaccuracies.

IV-D Ablation studies

We investigated the impact of neighbor rendering module and appearance of point clouds to the overall method. Table III presents our results. Using intensity information instead of RGB values only results in a slight decrease in registration accuracy, thanks to SuperGlue’s robust capability in matching heterogeneous images. Omitting neighbor rendering from image generating resulted in successful registration in just five of eight scenarios, accompanied by a significant drop in accuracy. This issue arises from distant points in the upper part of the image forming dense pixels, whereas closer points at the bottom don’t without neighboring rendering, causing match concentration in the upper area and biasing extrinsic parameter estimation towards a local optimum.

To evaluate the impact of point cloud density on different rendering methods, we conducted tests assessing their sensitivity to density variations. We observe that direct projection struggles to produce matchable images across all scenes when the point cloud is downsampled by a 3 cm radius. Conversely, incorporating neighbor rendering allows the method to function effectively, even with a 8 cm downsampling radius, maintaining registration precision. Post-downsampling registration outcomes are depicted to the right of Fig. 6, with rendering details in the appendix.

V Applications

We apply the registration framework to roadside 3D object detection. We initially employ YOLOv7 [48] for 2D object detection, estimating the bottom centers of detection boxes as the interface points between vehicles’ rear ends and the ground. Using the pixel coordinates of these points, we can retrieve the corresponding 3D points from the registered point cloud, which serves as the 3D position of the target for 3D object detection.

Refer to caption
Figure 8: Comparing distance measurements from different methods, the ground truth is provided by the GPS-RTK equipped on the vehicle.

We utilize a GPS-RTK-equipped vehicle, time-synchronized with the roadside camera, to provide positional ground truth. The distance from the GPS device to the vehicle’s rear is manually gauged. Considering the limitation of GPS to only two-dimensional (x and y coordinates) data, we simplify the evaluation by transforming 3D detection positions into distances from the target to the camera, adopting the distance measurement error from Section IV-B as our evaluative metric. Owing to spatial constraints, details on evaluation equipment and object detection result visualizations are provided in the appendix.

Fig. 8 depicts the distance measurements results of our method. Within a 250-meter detection range, most of the errors between our measurements and the ground truth is controlled within approximately 1.5%percent1.51.5\%1.5 %. The notable error at 50 meters is due to approximation error arising from using the 2D bbox’s bottom center to estimate the vehicle’s rear end ground contact point at close distances, but this approximation error decreasing as the distance increases. Road surface irregularities cause fitting plane methods to incur substantial errors, which increase with distance. Conversely, the registered point cloud effectively mitigates this issue. This results demonstrates the significance of fusing prior point clouds with images for visual-based detection tasks and also proves the precision of our registration method.

VI Conclusion

This paper presents an automated approach for registering prior point clouds with roadside camera images. We introduce an efficient rendering method aimed at minimizing discrepancies between generated images and actual camera photos to enhance registration. Moreover, we outline a workflow capable of precisely estimating initial guess using merely approximate camera locations, markedly decreasing the registration process’s dependence on initial configurations. We achieving a rotation accuracy of 0.202{}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT and a translation precision of 0.079m in self-collected dataset.

In the future, we plan to investigate our methodology’s applicability in varied contexts, including indoor scenes and tunnels, and to enhance performance by combining our approach with advanced rendering methods.

APPENDIX

VI-A Visualization of Neighbor Rendering

Without NR

Refer to caption
Refer to caption

With NR

Refer to caption Radius=3cm3𝑐𝑚3cm3 italic_c italic_m
Refer to caption Radius=8cm8𝑐𝑚8cm8 italic_c italic_m
Figure A1: The top row displays images generated by directly projecting the point cloud and filtering occluded points, while the bottom row shows images generated using neighbor rendering (NR). Utilize RGB information to generate grayscale values.

Fig. A1 demonstrates the effects of different rendering methods on point clouds of varying densities. It is evident that when the point cloud downsampling radius is 3cm3𝑐𝑚3cm3 italic_c italic_m, the method of direct projection struggles to form clear images, with white holes occupying a significant portion. Additionally, within the red box, road surfaces that should be obscured by trees are visible due to these holes, a phenomenon known as the bleeding problem. Employing neighbor rendering effectively addresses these issues, creating views as realistic as the original camera images. At an 8cm8𝑐𝑚8cm8 italic_c italic_m downsampling radius, directly projecting the point cloud fails to convey any meaningful information, whereas neighbor rendering still maintains clear and visible textures, effectively mitigating the bleeding problem.

VI-B Experiments Results and Scenes Visualization

Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
Figure A2: From left to right, and top to bottom, are the visualizations of scenes 1 to 8, in sequence. The green lines represent the results of projecting the edges from the point cloud onto the image using the registered extrinsic parameters.

In Fig. A2, we present the scenes utilized in our experiments alongside the visualizations of their registration outcomes. Our algorithm consistently achieve accurate registration across all eight scenes. Specifically, in Scene 3, the inability of several green circles on the left to align with image objects stems from dense foliage obstructing the drone’s capacity to capture point clouds under the trees. Nevertheless, our technique adeptly registers point clouds with images, highlighting the resilience and effectiveness of our methodology.

VI-C 3D Object Detection Application

Refer to captionRefer to caption
Figure A3: Left: Vehicle equipped with GPS-RTK to provide ground truth. Right: visualization of 3D object detection results

We utilize a vehicle outfitted with GPS-RTK to obtain 3D positional ground truth. The visualization of 3D object detection outcomes is shown in the right panel of Fig A3. Given that GPS only yields xy𝑥𝑦xyitalic_x italic_y coordinates as ground truth, our analysis is conducted from a Bird’s Eye View (BEV), concentrating exclusively on xy𝑥𝑦xyitalic_x italic_y results and omitting z𝑧zitalic_z values.

We identify the contact point between vehicles and the ground using the center of the bottom edge of the green 2D detection boxes depicted in the right side of Fig. A3. The 3D point in the registered point cloud corresponding to this pixel is designated as the target’s 3D location. The two-dimensional coordinates shown in Fig. A3 indicate the xy𝑥𝑦xyitalic_x italic_y position of this 3D point.

References

  • [1] J. Cui, J. Niu, Z. Ouyang, Y. He, and D. Liu, “Acsc: Automatic calibration for non-repetitive scanning solid-state lidar and camera systems,” arXiv preprint arXiv:2011.08516, 2020.
  • [2] S. Chen, J. Liu, X. Liang, S. Zhang, J. Hyyppä, and R. Chen, “A novel calibration method between a camera and a 3d lidar with infrared images,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 4963–4969.
  • [3] C. Yuan, X. Liu, X. Hong, and F. Zhang, “Pixel-level extrinsic self calibration of high resolution lidar and camera in targetless environments,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 7517–7524, 2021.
  • [4] K. Koide, S. Oishi, M. Yokozuka, and A. Banno, “General, single-shot, target-less, and automatic lidar-camera extrinsic calibration toolbox,” arXiv preprint arXiv:2302.05094, 2023.
  • [5] X. Zhang, S. Zhu, S. Guo, J. Li, and H. Liu, “Line-based automatic extrinsic calibration of lidar and camera,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2021, pp. 9347–9353.
  • [6] M. Á. Muñoz-Bañón, F. A. Candelas, and F. Torres, “Targetless camera-lidar calibration in unstructured environments,” IEEE Access, vol. 8, pp. 143 692–143 705, 2020.
  • [7] X. Li, Y. Xiao, B. Wang, H. Ren, Y. Zhang, and J. Ji, “Automatic targetless lidar–camera calibration: a survey,” Artificial Intelligence Review, pp. 1–39, 2022.
  • [8] R. Ishikawa, T. Oishi, and K. Ikeuchi, “Lidar and camera calibration using motions estimated by sensor fusion odometry,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2018, pp. 7342–7349.
  • [9] Y. Zhu, C. Zheng, C. Yuan, X. Huang, and X. Hong, “Camvox: A low-cost and accurate lidar-assisted visual slam system,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2021, pp. 5049–5055.
  • [10] L. Di Giammarino, I. Aloise, C. Stachniss, and G. Grisetti, “Visual place recognition using lidar intensity information,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2021, pp. 4382–4389.
  • [11] Y. Lin, C. Wang, J. Cheng, B. Chen, F. Jia, Z. Chen, and J. Li, “Line segment extraction for large scale unorganized point clouds,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 102, pp. 172–183, 2015.
  • [12] D. Zhang, B. Fang, W. Yang, X. Luo, and Y. Tang, “Robust inverse perspective map** based on vanishing point,” in Proceedings of the International Conference on Security, Pattern Analysis, and Cybernetics (SPAC).   IEEE, 2014, pp. 458–463.
  • [13] S. Khan, D. Wollherr, and M. Buss, “Modeling laser intensities for simultaneous localization and map**,” IEEE Robotics and Automation Letters, vol. 1, no. 2, pp. 692–699, 2016.
  • [14] P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superglue: Learning feature matching with graph neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4938–4947.
  • [15] I. Ashraf, S. Hur, and Y. Park, “An investigation of interpolation techniques to generate 2d intensity image from lidar data,” IEEE Access, vol. 5, pp. 8250–8260, 2017.
  • [16] V. Lepetit, F. Moreno-Noguer, and P. Fua, “Ep n p: An accurate o (n) solution to the p n p problem,” International journal of computer vision, vol. 81, pp. 155–166, 2009.
  • [17] J. Canny, “A computational approach to edge detection,” IEEE Transactions on pattern analysis and machine intelligence, no. 6, pp. 679–698, 1986.
  • [18] C. Hertzberg, R. Wagner, U. Frese, and L. Schröder, “Integrating generic sensor fusion algorithms with sound state representations through encapsulation of manifolds,” Information Fusion, vol. 14, no. 1, pp. 57–77, 2013.
  • [19] T. Ma, Z. Liu, G. Yan, and Y. Li, “Crlf: Automatic calibration and refinement based on line feature for lidar and camera in road scenes,” arXiv preprint arXiv:2103.04558, 2021.
  • [20] Q. Zhang and R. Pless, “Extrinsic calibration of a camera and laser range finder (improves camera calibration),” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2004, pp. 2301–2306.
  • [21] G. Pandey, J. McBride, S. Savarese, and R. Eustice, “Automatic targetless extrinsic calibration of a 3d lidar and camera by maximizing mutual information,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 26, no. 1, 2012, pp. 2053–2059.
  • [22] S. Katz, A. Tal, and R. Basri, “Direct visibility of point sets,” in ACM SIGGRAPH 2007 papers, 2007, pp. 24–es.
  • [23] C. Zheng, Q. Zhu, W. Xu, X. Liu, Q. Guo, and F. Zhang, “Fast-livo: Fast and tightly-coupled sparse-direct lidar-inertial-visual odometry,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2022, pp. 4003–4009.
  • [24] G. Yan, Z. Liu, C. Wang, C. Shi, P. Wei, X. Cai, T. Ma, Z. Liu, Z. Zhong, Y. Liu, et al., “Opencalib: A multi-sensor calibration toolbox for autonomous driving,” Software Impacts, vol. 14, p. 100393, 2022.
  • [25] S. Agarwal, K. Mierle, and T. C. S. Team, “Ceres Solver,” 3 2022. [Online]. Available: https://github.com/ceres-solver/ceres-solver
  • [26] X. Ye, M. Shu, H. Li, Y. Shi, Y. Li, G. Wang, X. Tan, and E. Ding, “Rope3d: The roadside perception dataset for autonomous driving and monocular 3d object detection task,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 21 341–21 350.
  • [27] B. Ravi Kiran, L. Roldao, B. Irastorza, R. Verastegui, S. Suss, S. Yogamani, V. Talpaert, A. Lepoutre, and G. Trehard, “Real-time dynamic object detection for autonomous driving using prior 3d-maps,” in Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018, pp. 0–0.
  • [28] J. Zhou, B. Ma, W. Zhang, Y. Fang, Y.-S. Liu, and Z. Han, “Differentiable registration of images and lidar point clouds with voxelpoint-to-pixel matching,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  • [29] D. Cattaneo, M. Vaghi, A. L. Ballardini, S. Fontana, D. G. Sorrenti, and W. Burgard, “Cmrnet: Camera to lidar-map registration,” in 2019 IEEE intelligent transportation systems conference (ITSC).   IEEE, 2019, pp. 1283–1289.
  • [30] K.-A. Aliev, A. Sevastopolsky, M. Kolos, D. Ulyanov, and V. Lempitsky, “Neural point-based graphics,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16.   Springer, 2020, pp. 696–712.
  • [31] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021.
  • [32] B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,” ACM Transactions on Graphics, vol. 42, no. 4, 2023.
  • [33] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
  • [34] H. Yu, W. Zhen, W. Yang, J. Zhang, and S. Scherer, “Monocular camera localization in prior lidar maps with 2d-3d line correspondences,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2020, pp. 4588–4594.
  • [35] M. Feng, S. Hu, M. H. Ang, and G. H. Lee, “2d3d-matchnet: Learning to match keypoints across 2d image and 3d point cloud,” in 2019 International Conference on Robotics and Automation (ICRA).   IEEE, 2019, pp. 4790–4796.
  • [36] S. Ren, Y. Zeng, J. Hou, and X. Chen, “Corri2p: Deep image-to-point cloud registration via dense correspondence,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 3, pp. 1198–1208, 2022.
  • [37] B. Kovač and B. Žalik, “Visualization of lidar datasets using point-based rendering technique,” Computers & Geosciences, vol. 36, no. 11, pp. 1443–1450, 2010.
  • [38] M. Levoy and T. Whitted, “The use of points as a display primitive,” 1985.
  • [39] J. P. Grossman and W. J. Dally, “Point sample rendering,” in Rendering Techniques’ 98: Proceedings of the Eurographics Workshop in Vienna, Austria, June 29—July 1, 1998 9.   Springer, 1998, pp. 181–192.
  • [40] H. Pfister, M. Zwicker, J. Van Baar, and M. Gross, “Surfels: Surface elements as rendering primitives,” in Proceedings of the 27th annual conference on Computer graphics and interactive techniques, 2000, pp. 335–342.
  • [41] P. Giannakeris, V. Kaltsa, K. Avgerinakis, A. Briassouli, S. Vrochidis, and I. Kompatsiaris, “Speed estimation and abnormality detection from surveillance cameras,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 93–99.
  • [42] G. Pascoe, W. Maddern, and P. Newman, “Direct visual localisation and calibration for road vehicles in changing city environments,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2015, pp. 9–16.
  • [43] K. Vuong, R. Tamburo, and S. G. Narasimhan, “Toward planet-wide traffic camera calibration,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 8553–8562.
  • [44] V.-H. Do, L.-H. Nghiem, N. Pham Thi, and N. Pham Ngoc, “A simple camera calibration method for vehicle velocity estimation,” in 2015 12th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), 2015, pp. 1–5.
  • [45] S. D. Roth, “Ray casting for modeling solids,” Computer graphics and image processing, vol. 18, no. 2, pp. 109–144, 1982.
  • [46] M. Hadwiger, C. Sigg, H. Scharsach, K. Bühler, and M. H. Gross, “Real-time ray-casting and advanced shading of discrete isosurfaces,” in Computer graphics forum, vol. 24, no. 3.   Citeseer, 2005, pp. 303–312.
  • [47] X. **g, F. Han, X. Ding, Y. Wang, and R. Xiong, “Intrinsic and extrinsic calibration of roadside lidar and camera,” in 2022 China Automation Congress (CAC).   IEEE, 2022, pp. 2367–2372.
  • [48] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 7464–7475.
  • [49] Y. Sun, J. Li, Y. Wang, X. Xu, X. Yang, and Z. Sun, “Atop: An attention-to-optimization approach for automatic lidar-camera calibration via cross-modal object matching,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 1, pp. 696–708, 2022.
  • [50] V. Bartl and A. Herout, “Optinopt: Dual optimization for automatic camera calibration by multi-target observations,” in 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2019, pp. 1–8.
  • [51] V. Bartl, R. Juránek, J. Špaňhel, and A. Herout, “Planecalib: Automatic camera calibration by multiple observations of rigid objects on plane,” in 2020 Digital Image Computing: Techniques and Applications (DICTA), 2020, pp. 1–8.
  • [52] V. Kocur and M. Ftáčnik, “Traffic camera calibration via vehicle vanishing point detection,” in Artificial Neural Networks and Machine Learning – ICANN 2021, I. Farkaš, P. Masulli, S. Otte, and S. Wermter, Eds.   Cham: Springer International Publishing, 2021, pp. 628–639.
  • [53] J. Revaud and M. Humenberger, “Robust automatic monocular vehicle speed estimation for traffic surveillance,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 4551–4561.
  • [54] X. Lv, B. Wang, Z. Dou, D. Ye, and S. Wang, “Lccnet: Lidar and camera self-calibration using cost volume network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2021, pp. 2894–2901.
  • [55] X. Zhang, Y. Feng, P. Angeloudis, and Y. Demiris, “Monocular visual traffic surveillance: A review,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 9, pp. 14 148–14 165, 2022.
  • [56] J. Lin and F. Zhang, “R3live: A robust, real-time, rgb-colored, lidar-inertial-visual tightly-coupled state estimation and map** package,” in 2022 International Conference on Robotics and Automation (ICRA), 2022, pp. 10 672–10 678.