License: arXiv.org perpetual non-exclusive license
arXiv:2402.06951v1 [cs.CV] 10 Feb 2024

Semantic Object-level Modeling for Robust Visual Camera Relocalization

1st Yifan Zhu School of Automation
Bei**g Institute of Technology
Bei**g, China
[email protected]
   2nd Lingjuan Miao School of Automation
Bei**g Institute of Technology
Bei**g, China
[email protected]
   3rd Haitao Wu *{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT Aerospace Information Research Institute
Chinese Academy of Sciences
Bei**g, China
[email protected]
   4th Zhiqiang Zhou School of Automation
Bei**g Institute of Technology
Bei**g, China
[email protected]
   5th Weiyi Chen School of Automation
Bei**g Institute of Technology
Bei**g, China
[email protected]
   6th Longwen Wu School of Automation
Bei**g Institute of Technology
Bei**g, China
[email protected]
Abstract

Visual relocalization is crucial for autonomous visual localization and navigation of mobile robotics. Due to the improvement of CNN-based object detection algorithm, the robustness of visual relocalization is greatly enhanced especially in viewpoints where classical methods fail. However, ellipsoids (quadrics) generated by axis-aligned object detection may limit the accuracy of the object-level representation and degenerate the performance of visual relocalization system. In this paper, we propose a novel method of automatic object-level voxel modeling for accurate ellipsoidal representations of objects. As for visual relocalization, we design a better pose optimization strategy for camera pose recovery, to fully utilize the projection characteristics of 2D fitted ellipses and the 3D accurate ellipsoids. All of these modules are entirely intergrated into visual SLAM system. Experimental results show that our semantic object-level map** and object-based visual relocalization methods significantly enhance the performance of visual relocalization in terms of robustness to new viewpoints.

Index Terms:
visual relocalization, object-level map**, ellipsoidal model, SLAM, instance segmentation

I Introduction

Visual relocalization refers to the use of image sets, 3D point clouds, semantic objects, or other useful data to obtain the camera pose of query image frames, in order to solve the problem of camera uninitialization or localization failure in the Simultaneous Localization and Map** (SLAM). In practice, 6-DOF camera pose estimation based on known global scene maps has been widely applied in fields such as mobile robotics, autonomous driving, AR, and so on.

A robust and effective visual relocalization algorithm is extremely important for visual SLAM systems. It is quite common for motion blur and rapid movement to cause visual SLAM tracking failure, which seriously affects the widespread application of visual SLAM systems. As a consequence, the robust operation of the visual relocalization module is essential for recovering the SLAM process. It is crucial to enhance the robustness (reliability) of the visual relocalization module in order to solve the problems of robot loss and kidnapped robots that often occur in the indoor scenes.

Refer to caption

Figure 1: Relocalization using known ellipsoid-based and point-based map. (a): A RGB frame in our collected video sequence for map**. (b): There are some viewpoint changes between the yellow keyframes for map** and black ground truth trajectory of relocalization sequence. (c)&(d): Blue and red points are successfully relocalized frames using ORB-SLAM2[1] and our method, repectively. Apparently, Our method allows the camera to be relocalized from viewpoints where ORB-SLAM2 fails.

Current keypoint-based visual SLAM frameworks, such as ORB-SLAM2[1], depending on local ORB[2] features and Bag-of-Words[3] descriptors, search for the most similar reference image in the image sets to obtain many matches between 2D key points in query images and 3D map points in the map. And through these matches, the PnP algorithm is then used to recover camera pose in a RANSAC loop. However, when there is a significant change of viewpoints between the query frame and the reference keyframe database, the local manual features of the image change significantly with the viewpoints, which makes it difficult to effectively perform feature matching, leading to camera relocalization failure.

However, due to the rapid improvement of CNN-based object detection, methods that use 3D geometric models (cuboids, ellipsoids) and 2D object detection to complete object-based camera pose estimation are emerging[4, 5, 6, 7, 8, 9]. In OA-SLAM[9], Zins et al. used object detection as constraints in the map** process to automatically build object ellipsoidal representations on the fly, and utilized objects and points in the map for robust visual relocalization.

These methods use geometric models to roughly represent objects through 2D axis-aligned bounding box constraints, which have achieved desirable results in 2D-3D object-level association and camera pose estimation. But owing to the direct use of 2D centers of axis-aligned bounding boxes and corresponding central points in 3D ellipsoids, the PnP algorithm can only calculate camera poses that are not accurate enough, especially when the 3D ellipsoid representation is not accurate or the center point of the 2D detection box is not coherent with the projection of center point in the 3D ellipsoid. In [8], Zins et al. proposed a learning-based method which detects improved elliptic approximations of 2D detected objects which are coherent with the 3D ellipsoids in terms of perspective projection. It shows remarkable results but needs manual annotations again when encountering new scenes, which means hard to integrate into SLAM system.

In order to enhance the performance of relocalization in visual SLAM in terms of robustness while ensuring accuarcy, we explore a novel way to obtain accurate ellipsoidal representations (with accurate poses) for static object landmarks in unknown indoor scenes. As for relocalization, when a query image arrives, instance segmentation masks on query images are used to calculate fitted ellipses for regular objects, and a object-based optimization strategy is imposed to refine the initial pose computed by PnP algorithm. Our main contributions are as follows:

  • We propose a novel map** method to obtain accurate ellipsoidal representations of semantic object-level landmarks, leveraging object-level voxel modelling and automatic object-level associations.

  • We design a object-based relocalization strategy to fully use projection characteristics of 2D fitted ellipses and the built 3D accurate ellipsoids.

  • Our object-level map** method and object-based relocalization strategy are entirely intergrated into RGBD SLAM system, robust to a variety of viewpoints, adaptive to unknown indoor scenes, and running in real-time.

II Related works

II-A Semantic Object-level Map**

To autonomously navigate in real-world environments, robots must be able to perceive complex and unstructured scenes, and build object-oriented scene maps.

Recent methods [10, 11, 12, 13] used geometric representations like point, mesh, voxel or TSDF model to build semantically meaningful, object-level entities with fine grain.

Some other recent methods [4, 5, 6, 7, 8, 9, 13] chose to build coarse-grained geometric representations (cuboids or ellipsoids) of objects to enhance visual SLAM or modules such as loop closing and relocalization in visual SLAM. In [13], Lin et al. relied on 2D instance segmentation mask to extract object-level voxel models, and used these voxel models to calculate 3D cuboids for loop closing in visual SLAM. Similarly, in CubeSLAM[7], Yang et al. used 3D cuboids to represent objects in the map. These cuboids are jointly optimized with camera poses and landmarks. In addition to representing objects with 3D cuboids, there are also some methods that use ellipsoids (quadrics) to represent objects in the map. In QuadricSLAM[6], Nicholson et al. derived a SLAM formulation that uses dual quadrics (ellipsoid) as 3D landmarks to represent objects. OA-SLAM[9], proposed by Zins et al, also uses ellipsoidal representations for objects during map** process. It is noteworthy that these coarse ellipsoidal representations use 2D axis-aligned bounding boxes for constraints and optimization, and can only roughly represent the pose of objects.

II-B Object-based Camera Pose Estimation

Recent works of using objects to calculate camera pose in visual SLAM can be divided into two types. The first is to couple the object as a landmark with the SLAM system, and optimize the camera pose and object landmark jointly during the SLAM process[7, 6]. The second type of object landmark has a shallow coupling with SLAM, only using the camera poses provided by SLAM to extract objects in the image. They build objects in the map** process, and then use these objects to enhance SLAM modules such as loop closing and visual relocalization[13, 5, 9]. They take full use of current remarkable object detection or instance segmentation algorithm like Faster R-CNN[14], Mask R-CNN[15] or YOLOs[16]. In OA-SLAM[9], Zins et al. used objects with high-level semantics and points with better spatial localization accuarcy to improve visual relocalization module, which often fails in ORB-SLAM2[1] when there are a variety of viewpoints. Though promising, we go one step futher and propose a new method to generate more accurate ellipsoidal models for objects in the map. By using these accurate ellipsoidal models, we design a strategy when relocalizing the query images in order to futher enhance the robustness of relocalization when encountering big view changes.

III Method

III-A System Overview

Refer to caption

Figure 2: System overview: dashed boxes are newly added elements within ORB-SLAM2 backbone.The modules filled with different colors are run in separate thread.

Our proposed framework is detailed in Fig. 2. It is based on ORB-SLAM2 backbone and consists of two main parts (semantic object-level map** and visual camera relocalization). Keyframes provided during the camera tracking process are used for semantic object-level map**. Among them, the keyframes are processed by the instance segmentation thread to obtain bounding boxes and instance segmentation masks. We use the camera pose, RGBD data, and instance segmentation data of keyframes for map** process. As for relocalization module, the instance segmentation is similarly used for RGB query frame in order to get fitted 2D ellipses for detected regular objects. Then the 2D ellipses are associated with 3D ellipsoids built in map** process to estimate and recover camera pose in case of SLAM tracking lost.

III-B Object-level Voxel Modelling

In order to accurately describe the geometric appearance structure of objects in indoor scenes, facilitate object tracking (described in section III-C) and accurate ellipsoid generation (described in section III-D), we use voxel models to process the original object segmentation data, in order to achieve accurate 3D object extraction. We perform 2D instance segmentation[16] to obtain 2D observation of each object on keyframes determined by camera tracking in ORB-SLAM2.

According to the poses, instance segmentation masks and depth maps of keyframes, we can obtain dense point clouds of objects at each keyframe perspective. In practical situations, due to the limitation of segmentation mask errors, dense point clouds of actual objects contain a large amount of noise and useless background information, which will seriously deteriorate the accuracy of 3D object entities.

Hence, similar to the method in [13], we use semantic label probability values to replace the occupancy probability of the grid map. Semantic label probability stored in each voxel of object entities is dynamiclly updated according to continuous observations from RGBD sequence as introduced in [17]. This method is experimentally demonstrated that it’s effective and efficient to filter noise produced by segmentation masks and sensors.

Assuming that there are M𝑀Mitalic_M categories of objects in the scene, for the i𝑖iitalic_i-th object, its semantic category ID is ci{1,2,,M}subscript𝑐𝑖12𝑀c_{i}\in\{1,2,\dots,M\}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 1 , 2 , … , italic_M }. The number of nodes used to represent the probability voxel model of the i𝑖iitalic_i-th object is Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where the node number in i𝑖iitalic_i-th object is denoted n{1,2,,Ni}𝑛12subscript𝑁𝑖n\in\{1,2,\dots,N_{i}\}italic_n ∈ { 1 , 2 , … , italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }. And the semantic label probability of cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for the n𝑛nitalic_n-th node of i𝑖iitalic_i-th object is denoted P(ci,n)𝑃subscript𝑐𝑖𝑛P(c_{i},n)italic_P ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_n ). After we get the observations in frame T𝑇Titalic_T, if a prior probability P(ci,n|z1:T1)𝑃subscript𝑐𝑖conditional𝑛subscript𝑧:1𝑇1P(c_{i},n|z_{1:T-1})italic_P ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_n | italic_z start_POSTSUBSCRIPT 1 : italic_T - 1 end_POSTSUBSCRIPT ) is known from the first frame to the (T1)𝑇1(T-1)( italic_T - 1 )-th frame, then semantic label probability P(ci,n|z1:T)𝑃subscript𝑐𝑖conditional𝑛subscript𝑧:1𝑇P(c_{i},n|z_{1:T})italic_P ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_n | italic_z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) will be updated according to the following formulation:

P(ci,n|z1:T)=[1+1P(ci,n|zT)P(ci,n|zT)1P(ci,n|z1:T1)P(ci,n|z1:T1)P(ci,n)1P(ci,n)].1missing-subexpression𝑃subscript𝑐𝑖conditional𝑛subscript𝑧:1𝑇missing-subexpressionabsentsubscriptsuperscriptdelimited-[]11𝑃subscript𝑐𝑖conditional𝑛subscript𝑧𝑇𝑃subscript𝑐𝑖conditional𝑛subscript𝑧𝑇1𝑃subscript𝑐𝑖conditional𝑛subscript𝑧:1𝑇1𝑃subscript𝑐𝑖conditional𝑛subscript𝑧:1𝑇1𝑃subscript𝑐𝑖𝑛1𝑃subscript𝑐𝑖𝑛1.\begin{aligned} &P(c_{i},n|z_{1:T})\\ &=[1+\frac{1-P(c_{i},n|z_{T})}{P(c_{i},n|z_{T})}\frac{1-P(c_{i},n|z_{1:T-1})}{% P(c_{i},n|z_{1:T-1})}\frac{P(c_{i},n)}{1-P(c_{i},n)}]^{-1}_{.}\end{aligned}start_ROW start_CELL end_CELL start_CELL italic_P ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_n | italic_z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = [ 1 + divide start_ARG 1 - italic_P ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_n | italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_n | italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_ARG divide start_ARG 1 - italic_P ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_n | italic_z start_POSTSUBSCRIPT 1 : italic_T - 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_n | italic_z start_POSTSUBSCRIPT 1 : italic_T - 1 end_POSTSUBSCRIPT ) end_ARG divide start_ARG italic_P ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_n ) end_ARG start_ARG 1 - italic_P ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_n ) end_ARG ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT . end_POSTSUBSCRIPT end_CELL end_ROW

(1)

When the object is tracked by objects tracking (described in III-C) by ω0subscript𝜔0\omega_{0}italic_ω start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT times, voxel model of the objects are inserted into the global map, and the label with the maximum semantic probability is selected as the object’s final semantic label. Then, according to the preset filtering threshold, the voxel whose semantic label probability less than the threshold is cleared stage by stage. ω0subscript𝜔0\omega_{0}italic_ω start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and initial semantic label probability is set to 3 and 0.5.

Refer to caption

Figure 3: Overview of object-based Map** pipeline. TV monitor, keyboard and book are three different sample objects. Object Tracking procedure is not shown in the figure. The associated 3D object is continuously updated and filtered through the 2D images, and the pose and ellipsoid model are updated simultaneously.

III-C Objects Tracking

Objects tracking is divided into two categories: 2D objects tracking between adjacent frames and 3D objects matching. 2D object tracking is to distinguish whether the observation of adjacent frames for the same class of objects is the same object or not; 3D object matching is to determine whether the new object is an existing object in the scene or an unknown new object. When the same object is observed in continuous frames, it is mainly the 2D object tracking that takes effect; when the object disappears from the field of view for a period of time, it then suddenly appears (such as a loop), then 3D object matching is in effect.

2D objects tracking. Similarly to [18, 9], 2D multiple objects tracking is performed between adjacent keyframes. Because 2D-3D object correspondences needed to be transferred from previous (t1)𝑡1(t-1)( italic_t - 1 )-th frame to the current t𝑡titalic_t-th frame through 2D objects tracking. The voxels of an object Ot1,isubscript𝑂𝑡1𝑖O_{t-1,i}italic_O start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT (where subscript denote i𝑖iitalic_i-th object in the map detected in frame t1𝑡1t-1italic_t - 1) are projected onto camera frame to form 2D point set, which generates known 2D bounding boxes with axis alignment. The optimal associations between the known 2D bounding boxes and the current 2D detected bounding boxes are completed using the Hungarian algorithm[19]. Thus, each element in cost matrix is defined as:

cost(i,j)=λ1f(Ot1,i,ot,j)+λ2(1IoU(bbox(proj(Ot1,i)),bbox(ot,j))),𝑐𝑜𝑠𝑡𝑖𝑗absentlimit-fromsubscript𝜆1𝑓subscript𝑂𝑡1𝑖subscript𝑜𝑡𝑗missing-subexpressionsubscript𝜆21IoUbboxprojsubscript𝑂𝑡1𝑖bboxsubscript𝑜𝑡𝑗\begin{split}\begin{aligned} cost(i,j)=&\lambda_{1}f(O_{t-1,i},o_{t,j})+\\ &\lambda_{2}(1-\text{IoU}(\text{bbox}(\text{proj}(O_{t-1,i})),\text{bbox}(o_{t% ,j}))),\end{aligned}\end{split}start_ROW start_CELL start_ROW start_CELL italic_c italic_o italic_s italic_t ( italic_i , italic_j ) = end_CELL start_CELL italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_f ( italic_O start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ) + end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 1 - IoU ( bbox ( proj ( italic_O start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT ) ) , bbox ( italic_o start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ) ) ) , end_CELL end_ROW end_CELL end_ROW

(2)
f(a,b)={1,if a and b have same semantic ID,0,otherwise,f(a,b)=\left\{\begin{aligned} \ &1,\quad\text{if a and b have same semantic ID% ,}\\ \ &0,\quad\text{otherwise,}\end{aligned}\right.italic_f ( italic_a , italic_b ) = { start_ROW start_CELL end_CELL start_CELL 1 , if a and b have same semantic ID, end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL 0 , otherwise, end_CELL end_ROW (3)

where λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (λ1λ2much-greater-thansubscript𝜆1subscript𝜆2\lambda_{1}\gg\lambda_{2}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≫ italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) are the weights of semantic cost term and reprojection cost term, respectively. Besides, operations such as proj, bbox and IoU means projecting 3D object to 2D point set, calculating bounding box for 2D point set and computing intersection-over-union (IoU), repectively. For the i-th object Ot1,isubscript𝑂𝑡1𝑖O_{t-1,i}italic_O start_POSTSUBSCRIPT italic_t - 1 , italic_i end_POSTSUBSCRIPT tracked by last frame and the j-th 2D detection ot,jsubscript𝑜𝑡𝑗o_{t,j}italic_o start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT in the current frame, the semantic label ID consistency must be met first, and secondly, the minimum reprojection error based on IoU is required to be considered as a 2D-3D matching pair.

3D objects matching. For newly detected 2D objects that are not matched in the current frame t𝑡titalic_t in 2D objects tracking, we need to determine whether their corresponding 3D objects already exist in the map. We generate a voxel model v𝑣vitalic_v for a newly detected 2D object and then use voxel association to determine whether the detected object already exists in the map. Specifically, we traverse the map objects and retain the object visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that matches the semantic label ID. Firstly, set the search radius r, search and record the number of voxels in visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that are closest to the voxels in v𝑣vitalic_v, denoted as n(v,vi)𝑛𝑣subscript𝑣𝑖n(v,v_{i})italic_n ( italic_v , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Similarly, n(vi,v)𝑛subscript𝑣𝑖𝑣n(v_{i},v)italic_n ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v ) can be calculated in the same way. The successful judgment of 3D object matching is as follows:

{n(v,vi)>ξmin(v,vi),n(vi,v)>ξmin(v,vi),\left\{\begin{aligned} \ n(v,v_{i})>\xi\text{min}(v,v_{i}),\\ \ n(v_{i},v)>\xi\text{min}(v,v_{i}),\end{aligned}\right.{ start_ROW start_CELL italic_n ( italic_v , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > italic_ξ min ( italic_v , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL italic_n ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v ) > italic_ξ min ( italic_v , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , end_CELL end_ROW (4)

where min(v,vi)min𝑣subscript𝑣𝑖\text{min}(v,v_{i})min ( italic_v , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) means minimum nodes for each of them. ξ𝜉\xiitalic_ξ is a threshold set to 0.5 in our experiment.

III-D Accurate Ellipsoids Generation

We use ellipsoids(dual quadratics) to represent individual objects in the map. Similary to [6, 9], An ellipsoid is defined by nine parameters including rotation, position and three semi-axes. However, compared with these methods, ellipsoids generated by our method has better accuracy in terms of the above nine parameters. In order to obtain the accurate ellipsoid 𝐐*superscript𝐐\mathbf{Q^{*}}bold_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT for an inserted object in the map, center coordinates t=[x,y,z]T𝑡superscript𝑥𝑦𝑧𝑇t=[x,y,z]^{T}italic_t = [ italic_x , italic_y , italic_z ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and rotation 𝐑(θ)𝐑𝜃\mathbf{R}(\theta)bold_R ( italic_θ ) relative to the world coordinate system, size (h,w,l)𝑤𝑙(h,w,l)( italic_h , italic_w , italic_l ) of the voxel model are calculated. Ellipsoid 𝐐*superscript𝐐\mathbf{Q^{*}}bold_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is determined by transformation Z𝑍Zitalic_Z and initial ellipsoid 𝐐˘*superscript˘𝐐\mathbf{\breve{Q}^{*}}over˘ start_ARG bold_Q end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT centred at the origin.

𝐐*=𝐙𝐐˘*𝐙,superscript𝐐superscript𝐙topsuperscript˘𝐐𝐙\mathbf{Q^{*}}=\mathbf{Z}^{\top}\mathbf{\breve{Q}^{*}}\mathbf{Z},bold_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = bold_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over˘ start_ARG bold_Q end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT bold_Z , (5)
𝐙=(𝐑(θ)t𝟎31)and𝐐˘*=diag(l24,w24,h24,1).𝐙matrix𝐑𝜃superscript𝑡topsubscriptsuperscript0top31andsuperscript˘𝐐diagsuperscript𝑙24superscript𝑤24superscript241\!\!\mathbf{Z}=\begin{pmatrix}\mathbf{R}(\theta)&t^{\top}\\ \mathbf{0}^{\top}_{3}&1\end{pmatrix}\,\text{and}\,\mathbf{\breve{Q}^{*}}=\text% {diag}(\frac{l^{2}}{4},\frac{w^{2}}{4},\frac{h^{2}}{4},-1).\!\!bold_Z = ( start_ARG start_ROW start_CELL bold_R ( italic_θ ) end_CELL start_CELL italic_t start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_0 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL start_CELL 1 end_CELL end_ROW end_ARG ) and over˘ start_ARG bold_Q end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = diag ( divide start_ARG italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 end_ARG , divide start_ARG italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 end_ARG , divide start_ARG italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 end_ARG , - 1 ) . (6)

Pose estimation for objects. We assume that all objects are placed on the ground and we have calculated the center coordinates of the objects, so the six-DOF pose estimation is reduced to one (yaw angle). 3D voxels of a object are projected onto the ground, and the PCA algorithm is implemented to compute the red main axis (shown in 3) of the projected 2D points set. The yaw angle of an object is determined by the angle of the computed main axis. Hence, we can obtain the accurate 9-DOF ellipsoidal representations which actually are inscribed inside the cuboids generated by the computed pose for all the objects in the map. It is worth mentioning that compared to our method, works like[6, 9] merely use axis-aligned bounding box constraints to generate coarse ellipsoids, resulting in lower accuracy in object representation.

Refer to caption

Figure 4: Accurate ellipsoids map** in three different scenes.((I, II, III) are TUM fr3/long_office_household, TUM fr2/desk and customized scene, respectively.) From left to right, the columns are the RGB(s) of three scenes, map** process, voxel and cuboids, ellipsoid model and ORB points on objects surface.

III-E Object-based Relocalization

The main precedure of object-based relocalization is shown in 2. When camera tracking fails and the query frame needs to search for its own pose based on the already built global map, the object-based relocalization module is enabled. RGB, instance segmentation masks and axis-aligned bounding boxes of query image are sent to the relocalization module. It is noted that objects with projection masks presented as simple convex polygons are defined as regular objects. For regular objects, we fit the observation ellipses based on a 2D masks; For other objects, inscribed observation ellipses are obtained based on the axis-aligned bounding boxes. After obtaining the fitted observation ellipse, initial pose recovery and pose refinement for query image will be discussed seperately.

Initial pose recovery. An ellipsoid projects as an ellipse under any viewpoints, whose equation can be expressed in a closed-form manner using the dual space[9]. In that space, for a 2D observation ellipse 𝐂*superscript𝐂\mathbf{C^{*}}bold_C start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and corresponding 3D ellipsoid 𝐐*superscript𝐐\mathbf{Q^{*}}bold_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, if the projection matrix is 𝐏𝐏\mathbf{P}bold_P, there is the following projection equation[20]:

𝐂*=𝐏𝐐*𝐏.superscript𝐂superscript𝐏𝐐superscript𝐏top\mathbf{C^{*}}=\mathbf{P}\mathbf{Q^{*}}\mathbf{P}^{\top}.bold_C start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = bold_PQ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT . (7)

Therefore, initial pose recovery problem involves finding the matching relationship between the observation ellipses of query frame and the ellipsoids in the map, and solving for projection matrix of the query frame. We use the method (P3P loop) introduced in [9, 8] to jointly determine object correspondences between the query image and the map, and estimates the initial pose of the camera. Compared to them, we choose different fitting methods based on whether the 2D mask is regular or not. If boundary of the mask is roughly a quadrilateral, the object is considered as regular object and 2D point set in the mask is used to fit the observation ellipse; Otherwise, the ellipse that is inscribed on the bounding box is directly used as the observation ellipse. Obviously, observation ellipse fitted by mask for regular object is more in line with the projection characteristics of an ellipse-ellipsoid pair.

Pose refinement. Refinement of initial camera pose uses accurate ellipsoids 𝐐*superscript𝐐\mathbf{Q^{*}}bold_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT in the map and observation ellipses set denoted χ𝜒\chiitalic_χ. Coarse pose of the query image and 2D-3D object correspondences f()𝑓f(\cdot)italic_f ( ⋅ ) are jointly obtained by the P3P loop. The robust kernel function ρ()𝜌\rho(\cdot)italic_ρ ( ⋅ ) and the covariance matrix 𝚺𝚺\mathbf{\Sigma}bold_Σ set based on the elliptical area are applied to the following formula:

{𝐑*,t*}=argmin𝐑,tΣiχρ(𝒲22(𝐄i,𝐏𝐐*f(i)𝐏)𝚺),superscript𝐑superscript𝑡𝐑𝑡argmin𝑖𝜒Σ𝜌superscriptsubscript𝒲22subscriptsubscript𝐄𝑖subscriptsuperscript𝐏𝐐𝑓𝑖superscript𝐏top𝚺\{\mathbf{R^{*}},t^{*}\}=\underset{\mathbf{R},t}{\text{argmin}}\underset{i\in% \chi}{\Sigma}\rho(\mathscr{W}_{2}^{2}(\mathbf{E}_{i},\mathbf{P}\mathbf{Q^{*}}_% {f(i)}\mathbf{P}^{\top})_{\mathbf{\Sigma}}),{ bold_R start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT } = start_UNDERACCENT bold_R , italic_t end_UNDERACCENT start_ARG argmin end_ARG start_UNDERACCENT italic_i ∈ italic_χ end_UNDERACCENT start_ARG roman_Σ end_ARG italic_ρ ( script_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_PQ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f ( italic_i ) end_POSTSUBSCRIPT bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT bold_Σ end_POSTSUBSCRIPT ) , (8)

where 𝒲22superscriptsubscript𝒲22\mathscr{W}_{2}^{2}script_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the Wasserstein distance between two Gaussian distributions which are determined by observation ellipse 𝐄isubscript𝐄𝑖\mathbf{E}_{i}bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and projected ellipse based on camera projection matrix 𝐏𝐏\mathbf{P}bold_P. In OA-SLAM[9], Zin et al. detailed the process but for ellipsoidal optimization in map. The object-based camera pose refinement strategy prevent the camera pose from deteriorating due to object occlusion, inaccuray of the observation ellipses, and other reasons. It is because the accuracy of ellipsoidal representations for objects in the map, and observed 2D fitted ellipses in line with the projection characteristics, we make the pose refinement strategy better use of ellipse-ellipsoid projection equation 7 in the pinhole camera model. By the way, the points produced by ORB-SLAM2 also can be used to further refine camera pose like OA-SLAM.

IV Experiments

IV-A Experimental Settings

To evaluate our object-based relocalization method and map** method, we used three common indoor scenes including TUM RGBD datasets[21] (fr3/long_office_household and fr2/desk) and one customized RGBD dataset collected in our office. In addition to public benchmark, we used Kinect v2 RGBD camera to capture the RGBD sequences at 640×\times×360 resolution with 30 frames per second.

It is noteworthy that each scene contains two video sequences. The first sequence is used for map**, and the other sequence with a significant difference in perspective from the map** is used for validating the relocalization algorithm. The ground truth poses of relocalized frames in the second video sequences are obtained using state-of-the-art visual SLAM method[1] with loop correction if needed.

TABLE I: Evaluations of Relocalization Performance in Three Different Scenarios
TUM fr3/long_office_household11{}^{\mathrm{1}}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT TUM fr2/desk11{}^{\mathrm{1}}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT customized22{}^{\mathrm{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT
Methods pos.err.(cm) rot.err.({}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT) valid(%) pos.err. (cm) rot.err.({}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT) valid(%) pos.err.(cm) rot.err.({}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT) valid(%)
ORB-SLAM2[1](points) 4.34 1.48 60.44 3.44 0.49 17.85 2.64 0.98 47.61
OA-SLAM[9](objects) 17.60 7.06 42.73 20.29 9.03 45.49 11.35 8.12 87.61
OA-SLAM(objects+points) 7.06 3.68 45.94 9.50 5.67 56.89 4.69 2.36 95.75
Ours(objects) 11.15 4.92 56.31 16.53 6.17 60.49 9.87 6.44 93.98
Ours(objects+points) 5.16 2.40 60.68 7.58 3.37 69.33 2.96 1.19 96.28
11{}^{\mathrm{1}}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTOA-SLAM and our method use different maps built seperately.
22{}^{\mathrm{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTOA-SLAM and our method use the same map built by our method.

Additionally, the object detector or segmentation method used in the front-end directly affects performance of the semantic object-level map**. We used YOLO v8[16] pre-trained in COCO datasets[22] as our detector and instance segmentation algorithm without any fine-tuning. In order to ensure that the object landmark categories used in the map** are consistent, OA-SLAM[9] also used the same pre-trained YOLO v8 detector and disabled some objects.

IV-B Semantic Object-level Map**

Our semantic object-level map** method was evaluated on three common indoor scenes demonstrated in Fig. 4. In this demonstration, we can see that the voxel models of the common objects (TV monitors, chairs, books, keyboards, teddy bear, etc) are clearly generated by using our object-level voxel modelling method (III-B). Simultaneously, pose of each voxel model is computed (III-D) and represented by rotated 3D cuboid. Ultimately, accurate ellipsoidal representations for all objects in the map are generated for visual camera relocalization if SLAM tracking fails. This semantic object-level map** algorithm which is entirely integrated into ORB-SLAM2 can run in real-time and automatically in a seperate thread.

Fig. 5. shows the object map** comparision between our method and OA-SLAM[9]. It is apparent that our method has better accuracy in ellipsoidal representation, which is beneficial to the enhancement of accuracy and robustness of our object-based relocalization method (III-E).

Refer to caption

Figure 5: Map** comparision: Our Method and OA-SLAM[9] on ellipsoidal landmarks map** on TUM fr3/long_office_household show that our method has better accuracy in ellipsoidal representations.

Refer to caption

Figure 6: Percentage of estimated camera positions and orientations with respect to corresponding error thresholds evaluated on customized dataset. Our method and OA-SLAM use the same ellipsoidal map built by our method. It shows that our relocalization strategy can take full use of the accurate 3D ellipsoidal landmarks and achieve optimal performance.

Refer to caption

Figure 7: Comparision of relocalization trajectories using different maps built seperately on customized dataset. Note that the map built by OA-SLAM is not shown in this figure. Yellow keyframes are used for map** and have some viewpoint changes compared to black ground truth trajectories. Blue, green and red relocalization points are ORB-SLAM2(reloc), OA-SLAM(reloc) and our method, respectively. It shows that our valid relocalized query frames denoted as red points focus more on black ground truth.

IV-C Visual Relocalization

The evaluations of our visual relocalization algorithm were conducted on the second different sequences with different viewpoints in pre-built three scenes. We used ORB-SLAM2[1] and OA-SLAM[9] as our comparision models. The map used by ORB-SLAM2 is the ORB point cloud map built using the first sequence of each scene. Besides, there are two types of maps used in OA-SLAM and our method. The first is mere object-level maps with ellipsoid representations, and the second is objects plus points maps, all built on the first sequence of each scene.

TABLE I shows the quantitative evaluation results of relocalization performance. We evaluated the accuracy and robustness of each method by using the median error of position (less than 30 cm) and rotation (less than 30 degrees), as well as the proportion of query frames that successfully relocalized while satisfying both position and rotation thresholds in the second sequences. It shows that our method can achieve higher valid ratio and accuarcy in visual relocalization than OA-SLAM thanks to our semantic object-level map** method and pose refinement strategy. Apart from that, Fig. 7 is the visual display of comparisons between ground truth trajectory and relocalized frames represented by points if successfully. Therefore, according to TABLE I, Fig. 6 and Fig. 7, when using only objects for localization, both our map** method and relocalization strategy have a promoting effect on the relocalization performance. If objects plus points are used together, the robustness of visual relocalization is greatly improved when the overall accuracy is close to ORB-SLAM2.

V Conlusions

In this conference paper, we propose a novel semantic object-level map** method and object-based visual camera relocalization strategy, all of which are totally integrated into ORB-SLAM2 backbone. Rather than generating ellipsoid representations using bounding box constraints, we use voxels to model object entities and directly computes more accurate ellipsoid representations, in order to better represent the position and pose of the objects in unknown indoor scenes. Due to the full use of accurate ellipsoid representations built in the proposed map** process, we can make the relocalization of visual SLAM more robust to large viewpoint changes while ensuring accuracy.

References

  • [1] R. Mur-Artal and J. D. Tardós, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
  • [2] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in 2011 International Conference on Computer Vision, 2011, pp. 2564–2571.
  • [3] D. Galvez-López and J. D. Tardos, “Bags of binary words for fast place recognition in image sequences,” IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1188–1197, 2012.
  • [4] C. Rubino, M. Crocco, and A. Del Bue, “3d object localisation from multi-view image detections,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 6, pp. 1281–1294, 2017.
  • [5] V. Gaudillière, G. Simon, and M.-O. Berger, “Camera relocalization with ellipsoidal abstraction of objects,” in 2019 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), 2019, pp. 8–18.
  • [6] L. Nicholson, M. Milford, and N. Sünderhauf, “Quadricslam: Dual quadrics from object detections as landmarks in object-oriented slam,” IEEE Robotics and Automation Letters, vol. 4, no. 1, pp. 1–8, 2019.
  • [7] S. Yang and S. Scherer, “Cubeslam: Monocular 3-d object slam,” IEEE Transactions on Robotics, vol. 35, no. 4, pp. 925–938, 2019.
  • [8] M. Zins, G. Simon, and M.-O. Berger, “Object-based visual camera pose estimation from ellipsoidal model and 3d-aware ellipse prediction,” International Journal of Computer Vision, vol. 130, no. 4, pp. 1107–1126, 2022.
  • [9] Zins, Matthieu and Simon, Gilles and Berger, Marie-Odile, “Oa-slam: Leveraging objects for camera relocalization in visual slam,” in 2022 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).   IEEE, 2022, pp. 720–728.
  • [10] M. Grinvald, F. Furrer, T. Novkovic, J. J. Chung, C. Cadena, R. Siegwart, and J. Nieto, “Volumetric instance-aware semantic map** and 3d object discovery,” IEEE Robotics and Automation Letters, vol. 4, no. 3, pp. 3037–3044, 2019.
  • [11] N. Sünderhauf, T. T. Pham, Y. Latif, M. Milford, and I. Reid, “Meaningful maps with object-oriented semantic map**,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017, pp. 5079–5085.
  • [12] F. Furrer, T. Novkovic, M. Fehr, A. Gawel, M. Grinvald, T. Sattler, R. Siegwart, and J. Nieto, “Incremental object database: Building 3d models from multiple partial observations,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 6835–6842.
  • [13] S. Lin, J. Wang, M. Xu, H. Zhao, and Z. Chen, “Topology aware object-level semantic map** towards more robust loop closure,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 7041–7048, 2021.
  • [14] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015.
  • [15] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [16] J. Terven and D. Cordova-Esparza, “A comprehensive review of yolo: From yolov1 and beyond,” 2023.
  • [17] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard, “Octomap: An efficient probabilistic 3d map** framework based on octrees,” Autonomous robots, vol. 34, pp. 189–206, 2013.
  • [18] N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime tracking with a deep association metric,” in 2017 IEEE International Conference on Image Processing (ICIP), 2017, pp. 3645–3649.
  • [19] H. W. Kuhn, “The hungarian method for the assignment problem,” Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.
  • [20] R. Hartley and A. Zisserman, Multiple view geometry in computer vision.   Cambridge university press, 2003.
  • [21] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of rgb-d slam systems,” in 2012 IEEE/RSJ international conference on intelligent robots and systems.   IEEE, 2012, pp. 573–580.
  • [22] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13.   Springer, 2014, pp. 740–755.