Rendering-Enhanced Automatic Image-to-Point Cloud Registration for Roadside Scenes

Yu Sheng

{}^{1}

, Lu Zhang

{}^{2}

, Xingchen Li

{}^{1}

, Yifan Duan

{}^{1}

, Yanyong Zhang

{}^{1}

, Yu Zhang

{}^{1}

, and Jianmin Ji

{}^{1,2,{\dagger}}

{}^{1}

School of Computer Science and Technology, University of Science and Technology of China (USTC), Hefei 230026, China

{}^{2}

Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, Anhui, China

{}^{\dagger}

Corresponding author. [email protected]

Abstract

Prior point cloud provides 3D environmental context, which enhances the capabilities of monocular camera in downstream vision tasks, such as 3D object detection, via data fusion. However, the absence of accurate and automated registration methods for estimating camera extrinsic parameters in roadside scene point clouds notably constrains the potential applications of roadside cameras. This paper proposes a novel approach for the automatic registration between prior point clouds and images from roadside scenes. The main idea involves rendering photorealistic grayscale views taken at specific perspectives from the prior point cloud with the help of their features like RGB or intensity values. These generated views can reduce the modality differences between images and prior point clouds, thereby improve the robustness and accuracy of the registration results. Particularly, we specify an efficient algorithm, named neighbor rendering, for the rendering process. Then we introduce a method for automatically estimating the initial guess using only rough guesses of camera’s position. At last, we propose a procedure for iteratively refining the extrinsic parameters by minimizing the reprojection error for line features extracted from both generated and camera images using Segment Anything Model (SAM). We assess our method using a self-collected dataset, comprising eight cameras strategically positioned throughout the university campus. Experiments demonstrate our method’s capability to automatically align prior point cloud with roadside camera image, achieving a rotation accuracy of 0.202 ${}^{\circ}$ and a translation precision of 0.079m. Furthermore, we validate our approach’s effectiveness in visual applications by substantially improving monocular 3D object detection performance.

I Introduction

Image-to-point cloud registration seeks to identify a rigid transformation for aligning a 3D point cloud with a 2D image. This process, a cornerstone of multi-modal fusion, has attracted considerable attention in domains like autonomous driving and computer vision. Integrating prior point clouds with images has been proven to enhance multiple downstream tasks in roadside environments, such as vehicle speed measurements [43] and 3D object detection [27, 26]. Unlike the application fields of LiDAR-camera calibration [3, 4] and visual localization [29, 28], the registration of roadside camera images with prior point clouds aims at finding the fixed cameras’ poses to these point clouds, a task that is currently heavily reliant on manual techniques for its accomplishment [26].

Refer to caption — Figure 1: Our method registers drone-collected point clouds with roadside camera images. This registration enhances monocular algorithm performance for downstream roadside tasks.

Given the distinct classifications of roadside environments, the direct usage of existing automatic registration methods may encounter a range of challenges. Firstly, the prior point cloud and image are often captured at different time and within different coordinate systems. This results in a variation in the relative poses between different roadside cameras and point clouds, complicating the acquisition of precise initial guess needed by classical registration methods [3, 5]. While learning-based [54, 29] approaches have the potential to mitigate the dependency on initial guess through scene comprehension, acquiring a diverse and sufficient dataset for training is difficult in roadside settings. The recent work [4] manages to lessen the dependency of the registration algorithm on initial guess by transforming point clouds into images, thereby reducing the modal discrepancy between the two data. However, [4] struggles to address the unpredictable extrinsic parameters in roadside scenarios. Moreover, due to sparse and uneven distribution of roadside point clouds, directly projecting point clouds onto images as done in [4, 9] can lead to issues such as the background being visible through the front surface (so-called bleeding problem), making it challenging to generate effective views for feature matching. This constitutes the second challenge. Although numerous methods, including NPBG [30] and 3DGS [32], have been developed to generate new views from point clouds with remarkable results, directly applying these methods to roadside point clouds with a vast number of points would incur an unbearable computational cost. Furthermore, these methods often overlook the 2D-3D correspondence during the rendering process, a key component necessary for accurate registration.

To address the aforementioned challenges, we propose an automatic registration method for roadside camera images and prior point clouds. We introduce neighbor rendering, an efficient approach for generating realistic views from point cloud while preserved 2D-3D correspondence, ideal for registration. Leveraging neighbor rendering, we propose a initial guess estimation framework that involves pose sampling near rough camera location selected from point cloud and view generating, followed by utilizing SuperGlue [14] to establish 2D-3D. Finally, we employ state-of-the-art (SOTA) segmentation model SAM [33] to extract line features from generated images and obtain corresponding line features in the point cloud through the 2D-3D correspondence provided by neighbor rendering. The extrinsic parameters are optimized by minimizing the reprojection error. We apply the registered image-point cloud pairs to roadside 3D object detection tasks, achieving remarkable performance and thus proving the effectiveness of our entire process. It is noteworthy that, although our point clouds originate from DJI drones, our method seamlessly integrates with pre-existing maps produced by SLAM methods like R ${}^{3}$ LIVE [56]. Our main contributions are as follows:

•

We propose a framework for automated image-to-point cloud registration in roadside scenes, utilizing view generated from point clouds to diminish data modality gaps, achieving SOTA performance in real-world roadside environments.
•

We introduce an efficient rendering method and utilize SAM to accurately extract edges from heterogeneous images, thereby enhancing registration accuracy.
•

We apply our method to roadside 3D detection task to demonstrate the framework’s practical effectiveness.

II Related Work

II-A Image-to-Point Cloud registration

II-A1 Generic Image-to-Point Cloud registration

Camera-LiDAR extrinsic calibration and visual localization are two key applications of image-to-point cloud registration. Existing calibration methods between LiDAR and camera can be categorized into two types: target-based methods and target-less methods [7]. Target-based methods typically use checkboards [20] to establish the correspondence between LiDAR and images, while target-less methods [3, 5] often relies on line or point features to achieve this objective. [4] optimize extrinsic parameters by minimizing the information discrepancy between LiDAR intensity images and grayscale images. The same idea is applied in visual localization as well [42]. As neural network technology advances, [35, 36, 49, 54] leverages networks for identifying the correlations between 2D and 3D features, with subsequent refinement achieved through optimization with either the Perspective-n-Point (PnP) or Particle Swarm Optimization (PSO) algorithms. Another research trajectory [28] delves into the differentiability aspects of PnP, facilitating the end-to-end training of networks.

II-A2 Roadside Image-to-Point Cloud registration

In the camera-world calibration procedure for the Rope3D dataset [26], high-definition maps (HD Maps) containing lane and crosswalk endpoints are first projected onto 2D images to obtain the initial transformation, followed by bundle adjustment optimization. [47] utilizes manual matching of 2D-3D features followed by optimization using distance transform. [43] utilizes panoramic images for point cloud reconstruction and aligns roadside camera images with the reconstructed point cloud. While these methods achieve excellent results, the need for manual intervention or obtaining panoramic images adds complexity to deployment.

II-B View Rendering

Rendering views from point clouds is primarily achieved through two techniques: points-based rendering and meshing. Meshing facilitates the creation of high-quality images; however, it necessitates complex preprocessing step [37]. The concept of utilizing points as rendering primitives was initially introduced by [38, 39], but it leads to bleeding problem. Ray casting series [45, 46] leverage surrounding points (volumes) to deduce the specific point on the ray, resulting in enhanced visual outcomes. However, discerning spatial relationships in vast, unordered point clouds requires substantial computational cost. [40] tackles these issues by substituting each point with oriented flat disks (a surfel), where the orientation and radius of these disks can be estimated from the point clouds. This strategy underpins recent neural network-based rendering techniques such as NPBG [30] and 3DGS [32]. These methods focus more on generating high-quality views, thereby neglecting the 2D-3D correspondence needed by registration.

II-C Roadside Camera Applications

A thorough review of monocular roadside camera applications is presented in [55]. Beyond tasks utilizing exclusively 2D data, like vehicle counting, applications requiring 3D information—for instance, vehicle speed measurement and vehicle distance estimation [53, 51, 50]—typically depend on the presumption of a flat ground or predefined size information. Such presumptions and prior knowledge considerably confine the algorithm’s versatility across varied settings. Conversely, leveraging the 3D information from registered point cloud offers a robust solution to these limitations.

III Methodology

III-A Overview

Given a pair of roadside camera image $\bm{\mathcal{I}}\in\mathbb{R}^{H*W*3}$ and a prior point cloud $\mathcal{P}\in\mathbb{R}^{N\times 3}$ with RGB values $\mathcal{C}\in\mathbb{R}^{N\times 3}$ or intensity $\mathcal{L}\in\mathbb{R}^{N\times 1}$ and rough camera position selected from point cloud, our method aims to estimate the grid transformation $T=[R,t]\in SE(3)$ between point cloud coordinates and camera coordinates, where $R\in SO(3)$ is the rotation matrix and $t\in\mathbb{R}^{3}$ is the translation vector.

In this section, we first detail the principles of neighbor rendering as described in Section III-B. Following that, in Section III-C, we propose a pipeline designed to obtain precise initial guess solely based on the selected rough camera position. Lastly, in Section III-D, we discuss optimizing extrinsic parameters by minimizing reprojection errors and effectively extracting line features using SAM. The overview of our framework is shown in Fig. 2.

III-B View Rendering

Each pixel $p_{i}(u_{i},v_{i})$ corresponds to a ray $r_{i}(u_{i},v_{i})=t*K^{-1}[u_{i},v_{i},1]^{T}$ in the camera coordinates. Without accounting for opacity, only the 3D point that falls on the ray and is closest to the camera can be projected onto the image. However, due to the sparsity of roadside point clouds, lots of rays may not intersect with any 3D points, resulting in holes and bleeding problems (shown in Fig. 3). While previous studies like NPBG [30] and 3DGS [32] have made significant strides in addressing these issues, directly applying these methods to our task primarily faces two challenges: computational cost and the loss of 2D-3D correspondence, as detailed in Section II-B.

Drawing insights from prior research, we propose an efficient rendering method, termed neighbor rendering, to tackle these challenges. Our method is based on two reasonable hypotheses: 1) When deduce the specific point on a certain ray like [45, 46], only the 3D points surrounding this ray will be taken into consideration. 2) Local 3d points can be approximated as belonging to a plane, a hypothesis also utilized in [40]. Specifically, we first project the point cloud onto the image and use the Z-buffer [30] to filter out occluded points. This process not only filters occluded and out-of-view points but also reorganizes the point cloud, which enables retrieving points in the point cloud based on pixel coordinates, thus facilitating parallel processing. For a given pixel $p_{i}$ , we need to infer the points lying on its corresponding ray $r_{i}$ . Since adjacent 3D points are necessarily adjacent in pixel space, according to hypothesis 1, we only need to consider the 3D points retrieved from the $N$ neighbor pixels of $p_{i}$ , denote as $\mathcal{Q}_{i}=\{P_{i}^{j}\in\mathbb{R}^{3}\mid j\in\{0,\ldots,N-1\}\,\}$ . Considering that these 3D points may not be adjacent in spatial, as shown in Fig. 3, filtering out background points is necessary. We adopt a simple distance filter for background point filtering, as described in Eq. (1). Here, $d(\cdot)$ represents the distance from a 3D point to the camera and $\xi$ is a manually set threshold, $f(\cdot)=1$ indicates that the point is considered as a foreground point.

f(P_{i}^{j})=\left\{\begin{aligned} &1\quad if\ d(P_{i}^{j})-\min\{d(P_{i}^{m}% )\mid P_{i}^{m}\in\mathcal{Q}_{i}\}\leq\xi,\\ &0\quad others.\\ \end{aligned}\right.

(1)

According to hypothesis 2, we fit plane function $a_{i}x+b_{i}y+c_{i}z+d_{i}=0$ with foreground points and regard the intersection point of $r_{i}$ with this plane (denote as $P_{i}$ ) as the corresponding 3D point for $p_{i}$ . The appearance value of $P_{i}$ (denote as $v(P_{i})$ ), such as the intensity or RGB values, are obtained by weighting the values of foreground points. Given that points closer to $P_{i}$ should bear more resemblance to its appearance, inspired by the Inverse Distance Weighting (IDW) [15], we propose the Eq. (2) to compute the weights $W(\cdot)$ .

	$\displaystyle v(P_{i})$	$\displaystyle=\frac{\sum\limits^{N-1}\limits_{j=0,\,f(P_{i}^{j})=1}W(P_{i}^{j}% )\,v(P_{i}^{j})}{\sum\limits^{N-1}\limits_{j=0,\,f(P_{i}^{j})=1}W(P_{i}^{j})},$		(2)
	$\displaystyle W(P_{i}^{j})$	$\displaystyle=\frac{\xi+min\{d(P_{i}^{m})\mid P_{i}^{m}\in\mathcal{Q}_{i}\}-d(% P_{i}^{j})}{\exp(D(P_{i}^{j}))},$		(2)

where $D(P_{i}^{j})$ represents the distance of $P_{i}^{j}$ to the $P_{i}$ . This equation indicates that points closer to $P_{i}$ have larger weights, and these weights decay exponentially as the distance increases.

We deploy our rendering algorithm using CUDA, allowing us to generate a 4K image in 20ms on RTX-3050 GPU. The visualization of generated images are available in the appendix.

III-C Initial Guess Estimation

Through the analysis of typical installation configurations for roadside cameras, we observe that these cameras are characterized by an essentially zero roll angle and a slight downward pitch angle, which facilitates the observation of the ground area. The yaw angle, conversely, is determined by the specific direction the camera is aimed at, as demonstrated in Fig. 4. Building on this observation, we assign the camera a zero roll angle and a minor pitch angle. Within the expanse of the panoramic view, eight yaw angles are uniformly sampled and their respective views are generated through neighbor rendering. We then match these generated views with the camera image using SuperGlue [14] until a sufficient number of matched feature points obtained. Leveraging the 2D-3D correspondence preserved by neighbor rendering, we utilize PnP [16] to estimate the rough pose of the camera.

Sampling merely eight yaw directions might result in partial matches between generated and camera images, leading to the clustering of matched feature points in particular regions. As indicated in [3], this uneven distribution of matched features heightens the uncertainty in extrinsic parameters estimation, making the estimation sensitive to environmental changes. Thus, by regenerating view with the roughly estimated pose and matching again, we ensure that the initial guess are estimated under the condition of a uniform distribution of matched feature points.

III-D Extrinsic Parameters Optimization

III-D1 Line Extraction

Edge extraction from images is well-established, yet deriving edges from point clouds poses greater difficulties. While depth discontinuity [5] is a prevalent technique for edge detection in point cloud, it tends to yield jagged, imprecise edges. [3] pioneered the use of plane intersections for precise edge extraction from point clouds, showing notable efficacy. However, roadside scenes often lack sufficient planar features to meet the algorithm’s requirements. Furthermore, line features in roadside images primarily focus on surface textures like lane markings, which the aforementioned methods cannot directly extract. Given our capability to generate realistic views and establish 2D-3D correspondences, it is possible to extract 2D line features from the generated images and then derive their corresponding 3D line features. However, traditional edge detection algorithm [17] necessitate parameter tuning for optimal extraction, yet finding a unified setting for these parameters across heterogeneous image is difficult. Therefore, we leverage the robust generalization capabilities of the powerful vision model SAM [33] to facilitate precise extraction of line features in both generated and camera images.

III-D2 Optimization

To align point cloud edges with those in images, we uniformly sample multiple points $P_{L}^{i}$ from each edge in the point cloud. These points are then projected onto the image using current extrinsic estimate $\bar{T}$ and camera intrinsic parameters $K$ .

p_{i}=K\cdot\bar{T}(P_{L}^{i}).

(3)

For each $p_{i}$ , we search for its $M$ nearest pixels $\mathcal{S}_{i}=\{q_{i}^{j}\in\mathbb{R}^{2}\mid j\in\{0,\ldots,M-1\}\,\}$ in the $k$ -D tree formed by image edge pixels. The line formed by $\mathcal{S}_{i}$ represents the image edge corresponding to $p_{i}$ . To simplify computation, we parameterize the line as $q_{i}^{0}$ with the norm vector $n_{i}$ , which is derived from the minimum eigenvector obtained through through principal component analysis on $\mathcal{S}_{i}$ .

The extrinsic parameters are optimized by minimizing the projection error between $p_{i}$ and $q_{i}^{0}$ along $n_{i}$ , which can be formulated as follows:

\begin{split}T^{*}&=\arg\underset{T}{\min}\sum\limits^{n}\limits_{i=1}\left(||% n_{i}^{T}(K\cdot T(P^{L}_{i})-q_{i}^{0})||^{2}\right)\\ &=\arg\underset{R,t}{\min}\sum\limits^{n}\limits_{i=1}\left(||n_{i}^{T}((K% \cdot R(P^{L}_{i})+t)-q_{i}^{0})||^{2}\right).\end{split}

(4)

To address the constraints introduced by the rotation matrix, we can utilize the transformation between Lie groups and Lie algebras:

\begin{split}R&=E(\theta)=\exp(\theta^{\wedge})\\ &=I+\sin(\theta)\theta^{\wedge}+(1-\cos(\theta))\theta^{\wedge 2}.\end{split}

(5)

where $\theta$ represents the Lie algebraic representation of the rotation matrix, and $\theta^{\wedge}$ denotes the corresponding skew-symmetric matrix. By transforming the optimization problem to the Lie algebra, the number of parameters to be optimized reduces to six, and the constraints are eliminated. The optimization process can be solved recursively. Let $\bar{T}$ be the current extrinsic estimate, using the operator $\boxplus$ encapsulated in the Special Euclidean group $SE(3)$ [18], we can parameterize the target extrinsic $T$ as:

	$\displaystyle T\;$	$\displaystyle\leftarrow\;\bar{T}\boxplus_{SE(3)}\delta T\triangleq\exp(\delta T% )\cdot\bar{T},$		(6)
	$\displaystyle\delta T$	$\displaystyle=(\delta\theta,\delta t)\in\mathbb{R}^{6}.$		(6)

Given the objective function defined in Eq. (4) and the operator $\boxplus$ , we can employ ceres [25] to efficiently solve the problem.

IV Experiments

IV-A Dataset Preparation

We validate our method on a self-collected dataset. As shown in Fig. 1, the prior point cloud is acquired using a DJI-M300 drone, equipped with the Zenmuse L1 module, alongside Hikvision cameras mounted on roadside poles for image capture. The cameras’ intrinsic parameters are calibrated using a checkerboard method [1]. Data are gathered from eight unique areas, each providing two pairs of point cloud-image datasets, and all point clouds are aligned within a consistent global coordinate system. For each area, one pair of point cloud-image is designated for evaluation, while the other include cones evenly distributed within the camera’s field of view (FoV) to facilitate the acquisition of pseudo ground truth, as shown in Fig. 5.

To obtain the ground truth of the extrinsic parameters, We manually annotated 2D-3D corresponding points of clearly visible cones and minimized the reprojection error. The accuracy of this transformations are validated by projecting the point clouds onto the images, which are then used as pseudo ground truth.

TABLE I: Quantitative Registration Results on Dataset

	Ours			Continue Edge[3]
Scenes	Trans.(m) $\downarrow$	Rot.( ${}^{\circ}$ ) $\downarrow$	Init. Guess	Trans.(m) $\downarrow$	Rot.( ${}^{\circ}$ ) $\downarrow$
1	0.095	0.219	✓	0.157	0.375
2	0.094	0.128	✓	0.684	0.487
3	0.045	0.102	✓	0.462	0.193
4	0.108	0.273	✓	0.709	1.237
5	0.162	0.302	✓	1.098	1.455
6	0.099	0.189	✓	0.416	0.601
7	0.027	0.196	✓	0.088	0.270
8	0.006	0.207	✓	0.096	0.390
Avg.	0.079	0.202	8/8	0.464	0.602

TABLE II: Ground Distance Evaluation on Dataset

Methods	Max Error( $\%$ ) $\downarrow$	Median Error( $\%$ ) $\downarrow$	RMSE( $\%$ ) $\downarrow$
OptInOpt [50]	15.78	10.80	12.87
PlaneCalib [51]	14.32	9.23	11.69
DeepVPCalib [52]	12.17	8.11	10.62
Revaud et al. [53]	14.87	10.91	12.54
Vuong et al. [43]	6.75	3.22	4.68
Ours	4.75	1.16	1.67

TABLE III: Ablation Experiments Results

${}^{\mathrm{*}}$ Only five successful scenarios were included in the statistics.
		Mean Errors
Using RGB	Neighbor Rending	Trans.(m) $\downarrow$	Rot.( ${}^{\circ}$ ) $\downarrow$
✓		0.796 ${}^{*}$	0.477 ${}^{*}$
	✓	0.133	0.247
✓	✓	0.079	0.202

IV-B Evaluation Metrics

The assessment of extrinsic parameters involves quantifying the mean translation error across the three axes and mean rotation error for roll, pitch, and yaw. Furthermore, we incorporate ground distance measurements as utilized in roadside camera calibration methods [43], to evaluate the accuracy of our extrinsic parameters estimates. Specifically, we utilize the distance $d_{i}$ between cones in points cloud as the ground truth and estimate these distances using the corresponding pixel positions, denote as $\bar{d}_{i}$ . The normalized distance measurement error is calculated as $r_{i}=\frac{|\bar{d}_{i}-d_{i}|}{d_{i}}$ . It’s worth noting that in [43], $\bar{d}_{i}$ is estimated using the distance of intersection points of rays and ground plane. Since we are registering image with point cloud rather than the ground plane, we use the distance of 3D points corresponding to the pixels to estimated $\bar{d}_{i}$ .

IV-C Results and Analysis

Table I presents the registration results of our method in eight different scenes, which achieves an average translation error of $0.079m$ and a rotation error of $0.202^{\circ}$ . The maximum translation error does not exceed $0.17m$ , and the maximum rotation error is around $0.3^{\circ}$ , demonstrating the accuracy and robustness of our approach. Fig. 6 visualizes the registration results of scenes 4 (left) and 5 (middle). To replicate errors in manually selecting camera positions in point cloud, we added a $\pm 2m$ perturbation to the displacement vector of pseudo ground truth, treating this as the camera’s position for further testing. Our experiments show that our method robustly handles this variation, automatically estimating initial guess across all tested scenes.

Given that the preprocessing and initial guess estimation processes outlined in [4] are ineffective on our dataset, we adopt the SOTA automatic line-based calibration method [3] as our baseline. We add a disturbance( $\pm 0.5m$ for translation and $\pm 1^{\circ}$ for rotation) to ground truth as initial guess for [3]. As shown in Table I, [3] achieves satisfactory registration in some scenes, its performance varies significantly across different scenes due to a scarcity of rich planar features. It’s noteworthy that both our method and [3] demonstrate relatively minimal errors in rotation. This is because roadside cameras have a wide observation range, and minor angular deviations may be magnified over distances, making the errors more detectable and more amenable to optimization by the algorithm.

Table II displays the performance of our method in ground distance measurement on the self-collected dataset. We evaluate from the perspectives of maximum error, median error, and root-mean-squared error (RMSE, in $\%$ ). Compared to SOTA method [43], our distance measurement results show significant improvement across all evaluation metrics, indicating the precision of the registration method.

In extreme conditions where ground texture features are minimal and excessive tree density obstructs drones from capturing point clouds effectively, our method may fail. In Fig. 7, we illustrate scenarios where our method fails to function properly. In the depicted left scene, the road surface lacks sufficient texture features and the absence of features in the vicinity making it difficult to obtain enough feature points for matching. The image on the right shows the camera’s left FoV being obstructed, with features predominantly centralized. Moreover, the dense foliage of trees complicates the acquisition of point clouds, resulting in substantial registration inaccuracies.

IV-D Ablation studies

We investigated the impact of neighbor rendering module and appearance of point clouds to the overall method. Table III presents our results. Using intensity information instead of RGB values only results in a slight decrease in registration accuracy, thanks to SuperGlue’s robust capability in matching heterogeneous images. Omitting neighbor rendering from image generating resulted in successful registration in just five of eight scenarios, accompanied by a significant drop in accuracy. This issue arises from distant points in the upper part of the image forming dense pixels, whereas closer points at the bottom don’t without neighboring rendering, causing match concentration in the upper area and biasing extrinsic parameter estimation towards a local optimum.

To evaluate the impact of point cloud density on different rendering methods, we conducted tests assessing their sensitivity to density variations. We observe that direct projection struggles to produce matchable images across all scenes when the point cloud is downsampled by a 3 cm radius. Conversely, incorporating neighbor rendering allows the method to function effectively, even with a 8 cm downsampling radius, maintaining registration precision. Post-downsampling registration outcomes are depicted to the right of Fig. 6, with rendering details in the appendix.

V Applications

We apply the registration framework to roadside 3D object detection. We initially employ YOLOv7 [48] for 2D object detection, estimating the bottom centers of detection boxes as the interface points between vehicles’ rear ends and the ground. Using the pixel coordinates of these points, we can retrieve the corresponding 3D points from the registered point cloud, which serves as the 3D position of the target for 3D object detection.

We utilize a GPS-RTK-equipped vehicle, time-synchronized with the roadside camera, to provide positional ground truth. The distance from the GPS device to the vehicle’s rear is manually gauged. Considering the limitation of GPS to only two-dimensional (x and y coordinates) data, we simplify the evaluation by transforming 3D detection positions into distances from the target to the camera, adopting the distance measurement error from Section IV-B as our evaluative metric. Owing to spatial constraints, details on evaluation equipment and object detection result visualizations are provided in the appendix.

Fig. 8 depicts the distance measurements results of our method. Within a 250-meter detection range, most of the errors between our measurements and the ground truth is controlled within approximately $1.5\%$ . The notable error at 50 meters is due to approximation error arising from using the 2D bbox’s bottom center to estimate the vehicle’s rear end ground contact point at close distances, but this approximation error decreasing as the distance increases. Road surface irregularities cause fitting plane methods to incur substantial errors, which increase with distance. Conversely, the registered point cloud effectively mitigates this issue. This results demonstrates the significance of fusing prior point clouds with images for visual-based detection tasks and also proves the precision of our registration method.

VI Conclusion

This paper presents an automated approach for registering prior point clouds with roadside camera images. We introduce an efficient rendering method aimed at minimizing discrepancies between generated images and actual camera photos to enhance registration. Moreover, we outline a workflow capable of precisely estimating initial guess using merely approximate camera locations, markedly decreasing the registration process’s dependence on initial configurations. We achieving a rotation accuracy of 0.202 ${}^{\circ}$ and a translation precision of 0.079m in self-collected dataset.

In the future, we plan to investigate our methodology’s applicability in varied contexts, including indoor scenes and tunnels, and to enhance performance by combining our approach with advanced rendering methods.

APPENDIX

VI-A Visualization of Neighbor Rendering

Fig. A1 demonstrates the effects of different rendering methods on point clouds of varying densities. It is evident that when the point cloud downsampling radius is $3cm$ , the method of direct projection struggles to form clear images, with white holes occupying a significant portion. Additionally, within the red box, road surfaces that should be obscured by trees are visible due to these holes, a phenomenon known as the bleeding problem. Employing neighbor rendering effectively addresses these issues, creating views as realistic as the original camera images. At an $8cm$ downsampling radius, directly projecting the point cloud fails to convey any meaningful information, whereas neighbor rendering still maintains clear and visible textures, effectively mitigating the bleeding problem.

VI-B Experiments Results and Scenes Visualization

In Fig. A2, we present the scenes utilized in our experiments alongside the visualizations of their registration outcomes. Our algorithm consistently achieve accurate registration across all eight scenes. Specifically, in Scene 3, the inability of several green circles on the left to align with image objects stems from dense foliage obstructing the drone’s capacity to capture point clouds under the trees. Nevertheless, our technique adeptly registers point clouds with images, highlighting the resilience and effectiveness of our methodology.

VI-C 3D Object Detection Application

We utilize a vehicle outfitted with GPS-RTK to obtain 3D positional ground truth. The visualization of 3D object detection outcomes is shown in the right panel of Fig A3. Given that GPS only yields $xy$ coordinates as ground truth, our analysis is conducted from a Bird’s Eye View (BEV), concentrating exclusively on $xy$ results and omitting $z$ values.

We identify the contact point between vehicles and the ground using the center of the bottom edge of the green 2D detection boxes depicted in the right side of Fig. A3. The 3D point in the registered point cloud corresponding to this pixel is designated as the target’s 3D location. The two-dimensional coordinates shown in Fig. A3 indicate the $xy$ position of this 3D point.

References

[1] J. Cui, J. Niu, Z. Ouyang, Y. He, and D. Liu, “Acsc: Automatic calibration for non-repetitive scanning solid-state lidar and camera systems,” arXiv preprint arXiv:2011.08516, 2020.
[2] S. Chen, J. Liu, X. Liang, S. Zhang, J. Hyyppä, and R. Chen, “A novel calibration method between a camera and a 3d lidar with infrared images,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 4963–4969.
[3] C. Yuan, X. Liu, X. Hong, and F. Zhang, “Pixel-level extrinsic self calibration of high resolution lidar and camera in targetless environments,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 7517–7524, 2021.
[4] K. Koide, S. Oishi, M. Yokozuka, and A. Banno, “General, single-shot, target-less, and automatic lidar-camera extrinsic calibration toolbox,” arXiv preprint arXiv:2302.05094, 2023.
[5] X. Zhang, S. Zhu, S. Guo, J. Li, and H. Liu, “Line-based automatic extrinsic calibration of lidar and camera,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 9347–9353.
[6] M. Á. Muñoz-Bañón, F. A. Candelas, and F. Torres, “Targetless camera-lidar calibration in unstructured environments,” IEEE Access, vol. 8, pp. 143 692–143 705, 2020.
[7] X. Li, Y. Xiao, B. Wang, H. Ren, Y. Zhang, and J. Ji, “Automatic targetless lidar–camera calibration: a survey,” Artificial Intelligence Review, pp. 1–39, 2022.
[8] R. Ishikawa, T. Oishi, and K. Ikeuchi, “Lidar and camera calibration using motions estimated by sensor fusion odometry,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 7342–7349.
[9] Y. Zhu, C. Zheng, C. Yuan, X. Huang, and X. Hong, “Camvox: A low-cost and accurate lidar-assisted visual slam system,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 5049–5055.
[10] L. Di Giammarino, I. Aloise, C. Stachniss, and G. Grisetti, “Visual place recognition using lidar intensity information,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 4382–4389.
[11] Y. Lin, C. Wang, J. Cheng, B. Chen, F. Jia, Z. Chen, and J. Li, “Line segment extraction for large scale unorganized point clouds,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 102, pp. 172–183, 2015.
[12] D. Zhang, B. Fang, W. Yang, X. Luo, and Y. Tang, “Robust inverse perspective map** based on vanishing point,” in Proceedings of the International Conference on Security, Pattern Analysis, and Cybernetics (SPAC). IEEE, 2014, pp. 458–463.
[13] S. Khan, D. Wollherr, and M. Buss, “Modeling laser intensities for simultaneous localization and map**,” IEEE Robotics and Automation Letters, vol. 1, no. 2, pp. 692–699, 2016.
[14] P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superglue: Learning feature matching with graph neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4938–4947.
[15] I. Ashraf, S. Hur, and Y. Park, “An investigation of interpolation techniques to generate 2d intensity image from lidar data,” IEEE Access, vol. 5, pp. 8250–8260, 2017.
[16] V. Lepetit, F. Moreno-Noguer, and P. Fua, “Ep n p: An accurate o (n) solution to the p n p problem,” International journal of computer vision, vol. 81, pp. 155–166, 2009.
[17] J. Canny, “A computational approach to edge detection,” IEEE Transactions on pattern analysis and machine intelligence, no. 6, pp. 679–698, 1986.
[18] C. Hertzberg, R. Wagner, U. Frese, and L. Schröder, “Integrating generic sensor fusion algorithms with sound state representations through encapsulation of manifolds,” Information Fusion, vol. 14, no. 1, pp. 57–77, 2013.
[19] T. Ma, Z. Liu, G. Yan, and Y. Li, “Crlf: Automatic calibration and refinement based on line feature for lidar and camera in road scenes,” arXiv preprint arXiv:2103.04558, 2021.
[20] Q. Zhang and R. Pless, “Extrinsic calibration of a camera and laser range finder (improves camera calibration),” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2004, pp. 2301–2306.
[21] G. Pandey, J. McBride, S. Savarese, and R. Eustice, “Automatic targetless extrinsic calibration of a 3d lidar and camera by maximizing mutual information,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 26, no. 1, 2012, pp. 2053–2059.
[22] S. Katz, A. Tal, and R. Basri, “Direct visibility of point sets,” in ACM SIGGRAPH 2007 papers, 2007, pp. 24–es.
[23] C. Zheng, Q. Zhu, W. Xu, X. Liu, Q. Guo, and F. Zhang, “Fast-livo: Fast and tightly-coupled sparse-direct lidar-inertial-visual odometry,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022, pp. 4003–4009.
[24] G. Yan, Z. Liu, C. Wang, C. Shi, P. Wei, X. Cai, T. Ma, Z. Liu, Z. Zhong, Y. Liu, et al., “Opencalib: A multi-sensor calibration toolbox for autonomous driving,” Software Impacts, vol. 14, p. 100393, 2022.
[25] S. Agarwal, K. Mierle, and T. C. S. Team, “Ceres Solver,” 3 2022. [Online]. Available: https://github.com/ceres-solver/ceres-solver
[26] X. Ye, M. Shu, H. Li, Y. Shi, Y. Li, G. Wang, X. Tan, and E. Ding, “Rope3d: The roadside perception dataset for autonomous driving and monocular 3d object detection task,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 21 341–21 350.
[27] B. Ravi Kiran, L. Roldao, B. Irastorza, R. Verastegui, S. Suss, S. Yogamani, V. Talpaert, A. Lepoutre, and G. Trehard, “Real-time dynamic object detection for autonomous driving using prior 3d-maps,” in Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018, pp. 0–0.
[28] J. Zhou, B. Ma, W. Zhang, Y. Fang, Y.-S. Liu, and Z. Han, “Differentiable registration of images and lidar point clouds with voxelpoint-to-pixel matching,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[29] D. Cattaneo, M. Vaghi, A. L. Ballardini, S. Fontana, D. G. Sorrenti, and W. Burgard, “Cmrnet: Camera to lidar-map registration,” in 2019 IEEE intelligent transportation systems conference (ITSC). IEEE, 2019, pp. 1283–1289.
[30] K.-A. Aliev, A. Sevastopolsky, M. Kolos, D. Ulyanov, and V. Lempitsky, “Neural point-based graphics,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16. Springer, 2020, pp. 696–712.
[31] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021.
[32] B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,” ACM Transactions on Graphics, vol. 42, no. 4, 2023.
[33] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
[34] H. Yu, W. Zhen, W. Yang, J. Zhang, and S. Scherer, “Monocular camera localization in prior lidar maps with 2d-3d line correspondences,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 4588–4594.
[35] M. Feng, S. Hu, M. H. Ang, and G. H. Lee, “2d3d-matchnet: Learning to match keypoints across 2d image and 3d point cloud,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 4790–4796.
[36] S. Ren, Y. Zeng, J. Hou, and X. Chen, “Corri2p: Deep image-to-point cloud registration via dense correspondence,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 3, pp. 1198–1208, 2022.
[37] B. Kovač and B. Žalik, “Visualization of lidar datasets using point-based rendering technique,” Computers & Geosciences, vol. 36, no. 11, pp. 1443–1450, 2010.
[38] M. Levoy and T. Whitted, “The use of points as a display primitive,” 1985.
[39] J. P. Grossman and W. J. Dally, “Point sample rendering,” in Rendering Techniques’ 98: Proceedings of the Eurographics Workshop in Vienna, Austria, June 29—July 1, 1998 9. Springer, 1998, pp. 181–192.
[40] H. Pfister, M. Zwicker, J. Van Baar, and M. Gross, “Surfels: Surface elements as rendering primitives,” in Proceedings of the 27th annual conference on Computer graphics and interactive techniques, 2000, pp. 335–342.
[41] P. Giannakeris, V. Kaltsa, K. Avgerinakis, A. Briassouli, S. Vrochidis, and I. Kompatsiaris, “Speed estimation and abnormality detection from surveillance cameras,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 93–99.
[42] G. Pascoe, W. Maddern, and P. Newman, “Direct visual localisation and calibration for road vehicles in changing city environments,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2015, pp. 9–16.
[43] K. Vuong, R. Tamburo, and S. G. Narasimhan, “Toward planet-wide traffic camera calibration,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 8553–8562.
[44] V.-H. Do, L.-H. Nghiem, N. Pham Thi, and N. Pham Ngoc, “A simple camera calibration method for vehicle velocity estimation,” in 2015 12th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), 2015, pp. 1–5.
[45] S. D. Roth, “Ray casting for modeling solids,” Computer graphics and image processing, vol. 18, no. 2, pp. 109–144, 1982.
[46] M. Hadwiger, C. Sigg, H. Scharsach, K. Bühler, and M. H. Gross, “Real-time ray-casting and advanced shading of discrete isosurfaces,” in Computer graphics forum, vol. 24, no. 3. Citeseer, 2005, pp. 303–312.
[47] X. **g, F. Han, X. Ding, Y. Wang, and R. Xiong, “Intrinsic and extrinsic calibration of roadside lidar and camera,” in 2022 China Automation Congress (CAC). IEEE, 2022, pp. 2367–2372.
[48] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 7464–7475.
[49] Y. Sun, J. Li, Y. Wang, X. Xu, X. Yang, and Z. Sun, “Atop: An attention-to-optimization approach for automatic lidar-camera calibration via cross-modal object matching,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 1, pp. 696–708, 2022.
[50] V. Bartl and A. Herout, “Optinopt: Dual optimization for automatic camera calibration by multi-target observations,” in 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2019, pp. 1–8.
[51] V. Bartl, R. Juránek, J. Špaňhel, and A. Herout, “Planecalib: Automatic camera calibration by multiple observations of rigid objects on plane,” in 2020 Digital Image Computing: Techniques and Applications (DICTA), 2020, pp. 1–8.
[52] V. Kocur and M. Ftáčnik, “Traffic camera calibration via vehicle vanishing point detection,” in Artificial Neural Networks and Machine Learning – ICANN 2021, I. Farkaš, P. Masulli, S. Otte, and S. Wermter, Eds. Cham: Springer International Publishing, 2021, pp. 628–639.
[53] J. Revaud and M. Humenberger, “Robust automatic monocular vehicle speed estimation for traffic surveillance,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 4551–4561.
[54] X. Lv, B. Wang, Z. Dou, D. Ye, and S. Wang, “Lccnet: Lidar and camera self-calibration using cost volume network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2021, pp. 2894–2901.
[55] X. Zhang, Y. Feng, P. Angeloudis, and Y. Demiris, “Monocular visual traffic surveillance: A review,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 9, pp. 14 148–14 165, 2022.
[56] J. Lin and F. Zhang, “R3live: A robust, real-time, rgb-colored, lidar-inertial-visual tightly-coupled state estimation and map** package,” in 2022 International Conference on Robotics and Automation (ICRA), 2022, pp. 10 672–10 678.