Semantic Object-level Modeling for Robust Visual Camera Relocalization
Abstract
Visual relocalization is crucial for autonomous visual localization and navigation of mobile robotics. Due to the improvement of CNN-based object detection algorithm, the robustness of visual relocalization is greatly enhanced especially in viewpoints where classical methods fail. However, ellipsoids (quadrics) generated by axis-aligned object detection may limit the accuracy of the object-level representation and degenerate the performance of visual relocalization system. In this paper, we propose a novel method of automatic object-level voxel modeling for accurate ellipsoidal representations of objects. As for visual relocalization, we design a better pose optimization strategy for camera pose recovery, to fully utilize the projection characteristics of 2D fitted ellipses and the 3D accurate ellipsoids. All of these modules are entirely intergrated into visual SLAM system. Experimental results show that our semantic object-level map** and object-based visual relocalization methods significantly enhance the performance of visual relocalization in terms of robustness to new viewpoints.
Index Terms:
visual relocalization, object-level map**, ellipsoidal model, SLAM, instance segmentationI Introduction
Visual relocalization refers to the use of image sets, 3D point clouds, semantic objects, or other useful data to obtain the camera pose of query image frames, in order to solve the problem of camera uninitialization or localization failure in the Simultaneous Localization and Map** (SLAM). In practice, 6-DOF camera pose estimation based on known global scene maps has been widely applied in fields such as mobile robotics, autonomous driving, AR, and so on.
A robust and effective visual relocalization algorithm is extremely important for visual SLAM systems. It is quite common for motion blur and rapid movement to cause visual SLAM tracking failure, which seriously affects the widespread application of visual SLAM systems. As a consequence, the robust operation of the visual relocalization module is essential for recovering the SLAM process. It is crucial to enhance the robustness (reliability) of the visual relocalization module in order to solve the problems of robot loss and kidnapped robots that often occur in the indoor scenes.
Current keypoint-based visual SLAM frameworks, such as ORB-SLAM2[1], depending on local ORB[2] features and Bag-of-Words[3] descriptors, search for the most similar reference image in the image sets to obtain many matches between 2D key points in query images and 3D map points in the map. And through these matches, the PnP algorithm is then used to recover camera pose in a RANSAC loop. However, when there is a significant change of viewpoints between the query frame and the reference keyframe database, the local manual features of the image change significantly with the viewpoints, which makes it difficult to effectively perform feature matching, leading to camera relocalization failure.
However, due to the rapid improvement of CNN-based object detection, methods that use 3D geometric models (cuboids, ellipsoids) and 2D object detection to complete object-based camera pose estimation are emerging[4, 5, 6, 7, 8, 9]. In OA-SLAM[9], Zins et al. used object detection as constraints in the map** process to automatically build object ellipsoidal representations on the fly, and utilized objects and points in the map for robust visual relocalization.
These methods use geometric models to roughly represent objects through 2D axis-aligned bounding box constraints, which have achieved desirable results in 2D-3D object-level association and camera pose estimation. But owing to the direct use of 2D centers of axis-aligned bounding boxes and corresponding central points in 3D ellipsoids, the PnP algorithm can only calculate camera poses that are not accurate enough, especially when the 3D ellipsoid representation is not accurate or the center point of the 2D detection box is not coherent with the projection of center point in the 3D ellipsoid. In [8], Zins et al. proposed a learning-based method which detects improved elliptic approximations of 2D detected objects which are coherent with the 3D ellipsoids in terms of perspective projection. It shows remarkable results but needs manual annotations again when encountering new scenes, which means hard to integrate into SLAM system.
In order to enhance the performance of relocalization in visual SLAM in terms of robustness while ensuring accuarcy, we explore a novel way to obtain accurate ellipsoidal representations (with accurate poses) for static object landmarks in unknown indoor scenes. As for relocalization, when a query image arrives, instance segmentation masks on query images are used to calculate fitted ellipses for regular objects, and a object-based optimization strategy is imposed to refine the initial pose computed by PnP algorithm. Our main contributions are as follows:
-
•
We propose a novel map** method to obtain accurate ellipsoidal representations of semantic object-level landmarks, leveraging object-level voxel modelling and automatic object-level associations.
-
•
We design a object-based relocalization strategy to fully use projection characteristics of 2D fitted ellipses and the built 3D accurate ellipsoids.
-
•
Our object-level map** method and object-based relocalization strategy are entirely intergrated into RGBD SLAM system, robust to a variety of viewpoints, adaptive to unknown indoor scenes, and running in real-time.
II Related works
II-A Semantic Object-level Map**
To autonomously navigate in real-world environments, robots must be able to perceive complex and unstructured scenes, and build object-oriented scene maps.
Recent methods [10, 11, 12, 13] used geometric representations like point, mesh, voxel or TSDF model to build semantically meaningful, object-level entities with fine grain.
Some other recent methods [4, 5, 6, 7, 8, 9, 13] chose to build coarse-grained geometric representations (cuboids or ellipsoids) of objects to enhance visual SLAM or modules such as loop closing and relocalization in visual SLAM. In [13], Lin et al. relied on 2D instance segmentation mask to extract object-level voxel models, and used these voxel models to calculate 3D cuboids for loop closing in visual SLAM. Similarly, in CubeSLAM[7], Yang et al. used 3D cuboids to represent objects in the map. These cuboids are jointly optimized with camera poses and landmarks. In addition to representing objects with 3D cuboids, there are also some methods that use ellipsoids (quadrics) to represent objects in the map. In QuadricSLAM[6], Nicholson et al. derived a SLAM formulation that uses dual quadrics (ellipsoid) as 3D landmarks to represent objects. OA-SLAM[9], proposed by Zins et al, also uses ellipsoidal representations for objects during map** process. It is noteworthy that these coarse ellipsoidal representations use 2D axis-aligned bounding boxes for constraints and optimization, and can only roughly represent the pose of objects.
II-B Object-based Camera Pose Estimation
Recent works of using objects to calculate camera pose in visual SLAM can be divided into two types. The first is to couple the object as a landmark with the SLAM system, and optimize the camera pose and object landmark jointly during the SLAM process[7, 6]. The second type of object landmark has a shallow coupling with SLAM, only using the camera poses provided by SLAM to extract objects in the image. They build objects in the map** process, and then use these objects to enhance SLAM modules such as loop closing and visual relocalization[13, 5, 9]. They take full use of current remarkable object detection or instance segmentation algorithm like Faster R-CNN[14], Mask R-CNN[15] or YOLOs[16]. In OA-SLAM[9], Zins et al. used objects with high-level semantics and points with better spatial localization accuarcy to improve visual relocalization module, which often fails in ORB-SLAM2[1] when there are a variety of viewpoints. Though promising, we go one step futher and propose a new method to generate more accurate ellipsoidal models for objects in the map. By using these accurate ellipsoidal models, we design a strategy when relocalizing the query images in order to futher enhance the robustness of relocalization when encountering big view changes.
III Method
III-A System Overview
Our proposed framework is detailed in Fig. 2. It is based on ORB-SLAM2 backbone and consists of two main parts (semantic object-level map** and visual camera relocalization). Keyframes provided during the camera tracking process are used for semantic object-level map**. Among them, the keyframes are processed by the instance segmentation thread to obtain bounding boxes and instance segmentation masks. We use the camera pose, RGBD data, and instance segmentation data of keyframes for map** process. As for relocalization module, the instance segmentation is similarly used for RGB query frame in order to get fitted 2D ellipses for detected regular objects. Then the 2D ellipses are associated with 3D ellipsoids built in map** process to estimate and recover camera pose in case of SLAM tracking lost.
III-B Object-level Voxel Modelling
In order to accurately describe the geometric appearance structure of objects in indoor scenes, facilitate object tracking (described in section III-C) and accurate ellipsoid generation (described in section III-D), we use voxel models to process the original object segmentation data, in order to achieve accurate 3D object extraction. We perform 2D instance segmentation[16] to obtain 2D observation of each object on keyframes determined by camera tracking in ORB-SLAM2.
According to the poses, instance segmentation masks and depth maps of keyframes, we can obtain dense point clouds of objects at each keyframe perspective. In practical situations, due to the limitation of segmentation mask errors, dense point clouds of actual objects contain a large amount of noise and useless background information, which will seriously deteriorate the accuracy of 3D object entities.
Hence, similar to the method in [13], we use semantic label probability values to replace the occupancy probability of the grid map. Semantic label probability stored in each voxel of object entities is dynamiclly updated according to continuous observations from RGBD sequence as introduced in [17]. This method is experimentally demonstrated that it’s effective and efficient to filter noise produced by segmentation masks and sensors.
Assuming that there are categories of objects in the scene, for the -th object, its semantic category ID is . The number of nodes used to represent the probability voxel model of the -th object is , where the node number in -th object is denoted . And the semantic label probability of for the -th node of -th object is denoted . After we get the observations in frame , if a prior probability is known from the first frame to the -th frame, then semantic label probability will be updated according to the following formulation:
|
(1) |
When the object is tracked by objects tracking (described in III-C) by times, voxel model of the objects are inserted into the global map, and the label with the maximum semantic probability is selected as the object’s final semantic label. Then, according to the preset filtering threshold, the voxel whose semantic label probability less than the threshold is cleared stage by stage. and initial semantic label probability is set to 3 and 0.5.
III-C Objects Tracking
Objects tracking is divided into two categories: 2D objects tracking between adjacent frames and 3D objects matching. 2D object tracking is to distinguish whether the observation of adjacent frames for the same class of objects is the same object or not; 3D object matching is to determine whether the new object is an existing object in the scene or an unknown new object. When the same object is observed in continuous frames, it is mainly the 2D object tracking that takes effect; when the object disappears from the field of view for a period of time, it then suddenly appears (such as a loop), then 3D object matching is in effect.
2D objects tracking. Similarly to [18, 9], 2D multiple objects tracking is performed between adjacent keyframes. Because 2D-3D object correspondences needed to be transferred from previous -th frame to the current -th frame through 2D objects tracking. The voxels of an object (where subscript denote -th object in the map detected in frame ) are projected onto camera frame to form 2D point set, which generates known 2D bounding boxes with axis alignment. The optimal associations between the known 2D bounding boxes and the current 2D detected bounding boxes are completed using the Hungarian algorithm[19]. Thus, each element in cost matrix is defined as:
|
(2) |
(3) |
where and () are the weights of semantic cost term and reprojection cost term, respectively. Besides, operations such as proj, bbox and IoU means projecting 3D object to 2D point set, calculating bounding box for 2D point set and computing intersection-over-union (IoU), repectively. For the i-th object tracked by last frame and the j-th 2D detection in the current frame, the semantic label ID consistency must be met first, and secondly, the minimum reprojection error based on IoU is required to be considered as a 2D-3D matching pair.
3D objects matching. For newly detected 2D objects that are not matched in the current frame in 2D objects tracking, we need to determine whether their corresponding 3D objects already exist in the map. We generate a voxel model for a newly detected 2D object and then use voxel association to determine whether the detected object already exists in the map. Specifically, we traverse the map objects and retain the object that matches the semantic label ID. Firstly, set the search radius r, search and record the number of voxels in that are closest to the voxels in , denoted as . Similarly, can be calculated in the same way. The successful judgment of 3D object matching is as follows:
(4) |
where means minimum nodes for each of them. is a threshold set to 0.5 in our experiment.
III-D Accurate Ellipsoids Generation
We use ellipsoids(dual quadratics) to represent individual objects in the map. Similary to [6, 9], An ellipsoid is defined by nine parameters including rotation, position and three semi-axes. However, compared with these methods, ellipsoids generated by our method has better accuracy in terms of the above nine parameters. In order to obtain the accurate ellipsoid for an inserted object in the map, center coordinates and rotation relative to the world coordinate system, size of the voxel model are calculated. Ellipsoid is determined by transformation and initial ellipsoid centred at the origin.
(5) |
(6) |
Pose estimation for objects. We assume that all objects are placed on the ground and we have calculated the center coordinates of the objects, so the six-DOF pose estimation is reduced to one (yaw angle). 3D voxels of a object are projected onto the ground, and the PCA algorithm is implemented to compute the red main axis (shown in 3) of the projected 2D points set. The yaw angle of an object is determined by the angle of the computed main axis. Hence, we can obtain the accurate 9-DOF ellipsoidal representations which actually are inscribed inside the cuboids generated by the computed pose for all the objects in the map. It is worth mentioning that compared to our method, works like[6, 9] merely use axis-aligned bounding box constraints to generate coarse ellipsoids, resulting in lower accuracy in object representation.
III-E Object-based Relocalization
The main precedure of object-based relocalization is shown in 2. When camera tracking fails and the query frame needs to search for its own pose based on the already built global map, the object-based relocalization module is enabled. RGB, instance segmentation masks and axis-aligned bounding boxes of query image are sent to the relocalization module. It is noted that objects with projection masks presented as simple convex polygons are defined as regular objects. For regular objects, we fit the observation ellipses based on a 2D masks; For other objects, inscribed observation ellipses are obtained based on the axis-aligned bounding boxes. After obtaining the fitted observation ellipse, initial pose recovery and pose refinement for query image will be discussed seperately.
Initial pose recovery. An ellipsoid projects as an ellipse under any viewpoints, whose equation can be expressed in a closed-form manner using the dual space[9]. In that space, for a 2D observation ellipse and corresponding 3D ellipsoid , if the projection matrix is , there is the following projection equation[20]:
(7) |
Therefore, initial pose recovery problem involves finding the matching relationship between the observation ellipses of query frame and the ellipsoids in the map, and solving for projection matrix of the query frame. We use the method (P3P loop) introduced in [9, 8] to jointly determine object correspondences between the query image and the map, and estimates the initial pose of the camera. Compared to them, we choose different fitting methods based on whether the 2D mask is regular or not. If boundary of the mask is roughly a quadrilateral, the object is considered as regular object and 2D point set in the mask is used to fit the observation ellipse; Otherwise, the ellipse that is inscribed on the bounding box is directly used as the observation ellipse. Obviously, observation ellipse fitted by mask for regular object is more in line with the projection characteristics of an ellipse-ellipsoid pair.
Pose refinement. Refinement of initial camera pose uses accurate ellipsoids in the map and observation ellipses set denoted . Coarse pose of the query image and 2D-3D object correspondences are jointly obtained by the P3P loop. The robust kernel function and the covariance matrix set based on the elliptical area are applied to the following formula:
(8) |
where is the Wasserstein distance between two Gaussian distributions which are determined by observation ellipse and projected ellipse based on camera projection matrix . In OA-SLAM[9], Zin et al. detailed the process but for ellipsoidal optimization in map. The object-based camera pose refinement strategy prevent the camera pose from deteriorating due to object occlusion, inaccuray of the observation ellipses, and other reasons. It is because the accuracy of ellipsoidal representations for objects in the map, and observed 2D fitted ellipses in line with the projection characteristics, we make the pose refinement strategy better use of ellipse-ellipsoid projection equation 7 in the pinhole camera model. By the way, the points produced by ORB-SLAM2 also can be used to further refine camera pose like OA-SLAM.
IV Experiments
IV-A Experimental Settings
To evaluate our object-based relocalization method and map** method, we used three common indoor scenes including TUM RGBD datasets[21] (fr3/long_office_household and fr2/desk) and one customized RGBD dataset collected in our office. In addition to public benchmark, we used Kinect v2 RGBD camera to capture the RGBD sequences at 640360 resolution with 30 frames per second.
It is noteworthy that each scene contains two video sequences. The first sequence is used for map**, and the other sequence with a significant difference in perspective from the map** is used for validating the relocalization algorithm. The ground truth poses of relocalized frames in the second video sequences are obtained using state-of-the-art visual SLAM method[1] with loop correction if needed.
TUM fr3/long_office_household | TUM fr2/desk | customized | |||||||
Methods | pos.err.(cm) | rot.err.() | valid(%) | pos.err. (cm) | rot.err.() | valid(%) | pos.err.(cm) | rot.err.() | valid(%) |
ORB-SLAM2[1](points) | 4.34 | 1.48 | 60.44 | 3.44 | 0.49 | 17.85 | 2.64 | 0.98 | 47.61 |
OA-SLAM[9](objects) | 17.60 | 7.06 | 42.73 | 20.29 | 9.03 | 45.49 | 11.35 | 8.12 | 87.61 |
OA-SLAM(objects+points) | 7.06 | 3.68 | 45.94 | 9.50 | 5.67 | 56.89 | 4.69 | 2.36 | 95.75 |
Ours(objects) | 11.15 | 4.92 | 56.31 | 16.53 | 6.17 | 60.49 | 9.87 | 6.44 | 93.98 |
Ours(objects+points) | 5.16 | 2.40 | 60.68 | 7.58 | 3.37 | 69.33 | 2.96 | 1.19 | 96.28 |
OA-SLAM and our method use different maps built seperately. | |||||||||
OA-SLAM and our method use the same map built by our method. |
Additionally, the object detector or segmentation method used in the front-end directly affects performance of the semantic object-level map**. We used YOLO v8[16] pre-trained in COCO datasets[22] as our detector and instance segmentation algorithm without any fine-tuning. In order to ensure that the object landmark categories used in the map** are consistent, OA-SLAM[9] also used the same pre-trained YOLO v8 detector and disabled some objects.
IV-B Semantic Object-level Map**
Our semantic object-level map** method was evaluated on three common indoor scenes demonstrated in Fig. 4. In this demonstration, we can see that the voxel models of the common objects (TV monitors, chairs, books, keyboards, teddy bear, etc) are clearly generated by using our object-level voxel modelling method (III-B). Simultaneously, pose of each voxel model is computed (III-D) and represented by rotated 3D cuboid. Ultimately, accurate ellipsoidal representations for all objects in the map are generated for visual camera relocalization if SLAM tracking fails. This semantic object-level map** algorithm which is entirely integrated into ORB-SLAM2 can run in real-time and automatically in a seperate thread.
IV-C Visual Relocalization
The evaluations of our visual relocalization algorithm were conducted on the second different sequences with different viewpoints in pre-built three scenes. We used ORB-SLAM2[1] and OA-SLAM[9] as our comparision models. The map used by ORB-SLAM2 is the ORB point cloud map built using the first sequence of each scene. Besides, there are two types of maps used in OA-SLAM and our method. The first is mere object-level maps with ellipsoid representations, and the second is objects plus points maps, all built on the first sequence of each scene.
TABLE I shows the quantitative evaluation results of relocalization performance. We evaluated the accuracy and robustness of each method by using the median error of position (less than 30 cm) and rotation (less than 30 degrees), as well as the proportion of query frames that successfully relocalized while satisfying both position and rotation thresholds in the second sequences. It shows that our method can achieve higher valid ratio and accuarcy in visual relocalization than OA-SLAM thanks to our semantic object-level map** method and pose refinement strategy. Apart from that, Fig. 7 is the visual display of comparisons between ground truth trajectory and relocalized frames represented by points if successfully. Therefore, according to TABLE I, Fig. 6 and Fig. 7, when using only objects for localization, both our map** method and relocalization strategy have a promoting effect on the relocalization performance. If objects plus points are used together, the robustness of visual relocalization is greatly improved when the overall accuracy is close to ORB-SLAM2.
V Conlusions
In this conference paper, we propose a novel semantic object-level map** method and object-based visual camera relocalization strategy, all of which are totally integrated into ORB-SLAM2 backbone. Rather than generating ellipsoid representations using bounding box constraints, we use voxels to model object entities and directly computes more accurate ellipsoid representations, in order to better represent the position and pose of the objects in unknown indoor scenes. Due to the full use of accurate ellipsoid representations built in the proposed map** process, we can make the relocalization of visual SLAM more robust to large viewpoint changes while ensuring accuracy.
References
- [1] R. Mur-Artal and J. D. Tardós, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
- [2] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in 2011 International Conference on Computer Vision, 2011, pp. 2564–2571.
- [3] D. Galvez-López and J. D. Tardos, “Bags of binary words for fast place recognition in image sequences,” IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1188–1197, 2012.
- [4] C. Rubino, M. Crocco, and A. Del Bue, “3d object localisation from multi-view image detections,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 6, pp. 1281–1294, 2017.
- [5] V. Gaudillière, G. Simon, and M.-O. Berger, “Camera relocalization with ellipsoidal abstraction of objects,” in 2019 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), 2019, pp. 8–18.
- [6] L. Nicholson, M. Milford, and N. Sünderhauf, “Quadricslam: Dual quadrics from object detections as landmarks in object-oriented slam,” IEEE Robotics and Automation Letters, vol. 4, no. 1, pp. 1–8, 2019.
- [7] S. Yang and S. Scherer, “Cubeslam: Monocular 3-d object slam,” IEEE Transactions on Robotics, vol. 35, no. 4, pp. 925–938, 2019.
- [8] M. Zins, G. Simon, and M.-O. Berger, “Object-based visual camera pose estimation from ellipsoidal model and 3d-aware ellipse prediction,” International Journal of Computer Vision, vol. 130, no. 4, pp. 1107–1126, 2022.
- [9] Zins, Matthieu and Simon, Gilles and Berger, Marie-Odile, “Oa-slam: Leveraging objects for camera relocalization in visual slam,” in 2022 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). IEEE, 2022, pp. 720–728.
- [10] M. Grinvald, F. Furrer, T. Novkovic, J. J. Chung, C. Cadena, R. Siegwart, and J. Nieto, “Volumetric instance-aware semantic map** and 3d object discovery,” IEEE Robotics and Automation Letters, vol. 4, no. 3, pp. 3037–3044, 2019.
- [11] N. Sünderhauf, T. T. Pham, Y. Latif, M. Milford, and I. Reid, “Meaningful maps with object-oriented semantic map**,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017, pp. 5079–5085.
- [12] F. Furrer, T. Novkovic, M. Fehr, A. Gawel, M. Grinvald, T. Sattler, R. Siegwart, and J. Nieto, “Incremental object database: Building 3d models from multiple partial observations,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 6835–6842.
- [13] S. Lin, J. Wang, M. Xu, H. Zhao, and Z. Chen, “Topology aware object-level semantic map** towards more robust loop closure,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 7041–7048, 2021.
- [14] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015.
- [15] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
- [16] J. Terven and D. Cordova-Esparza, “A comprehensive review of yolo: From yolov1 and beyond,” 2023.
- [17] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard, “Octomap: An efficient probabilistic 3d map** framework based on octrees,” Autonomous robots, vol. 34, pp. 189–206, 2013.
- [18] N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime tracking with a deep association metric,” in 2017 IEEE International Conference on Image Processing (ICIP), 2017, pp. 3645–3649.
- [19] H. W. Kuhn, “The hungarian method for the assignment problem,” Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.
- [20] R. Hartley and A. Zisserman, Multiple view geometry in computer vision. Cambridge university press, 2003.
- [21] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of rgb-d slam systems,” in 2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 2012, pp. 573–580.
- [22] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740–755.