\DeclareAcronym

SfM short=SfM, long=Structure from Motion, \DeclareAcronymSLAM short=SLAM, long=Simultaneous Localization and Map**, \DeclareAcronymVIO short=VIO, long=Visual Inertial Odometry, \DeclareAcronymDoG short=DoG, long=Difference of Gaussians, \DeclareAcronymFAST short=FAST, long=Features from Accelerated Segment Test, \DeclareAcronymORB short=ORB, long=Oriented FAST and Rotated BRIEF, \DeclareAcronymSIFT short=SIFT, long=Scale Invariant Feature Transform, \DeclareAcronymBRISK short=BRISK, long=Binary Robust Invariant Scalable Keypoints, \DeclareAcronymNRE short=NRE, long=Neural Reprojection Error, \DeclareAcronymMLP short=MLP, long=Multi-Layer Perceptron, \DeclareAcronymGNN short=GNN, long=Graph Neural Network, \DeclareAcronymCNN short=CNN, long=Convolutional Neural Network, \DeclareAcronymNN short=NN, long=nearest neighbor, \DeclareAcronymNLL short=NLL, long=Negative Log Likelihood, \DeclareAcronymAUC short=AUC, long=area under the cumulative error curve, \DeclareAcronymATE short=ATE, long=absolute trajectory error, \DeclareAcronymSVD short=SVD, long=singular value decomposition, \DeclareAcronymTUM short=TUM, long=Technical University Munich, \DeclareAcronymMOTS short=MOTS, long=Multi Object Tracking and Segmentation,

DynamicGlue: Epipolar and Time-Informed Data Association
in Dynamic Environments using Graph Neural Networks

Theresa Huber
[email protected]
   Simon Schaefer
[email protected]
   Stefan Leutenegger
[email protected]
Abstract

The assumption of a static environment is common in many geometric computer vision tasks like SLAM but limits their applicability in highly dynamic scenes. Since these tasks rely on identifying point correspondences between input images within the static part of the environment, we propose a graph neural network-based sparse feature matching network designed to perform robust matching under challenging conditions while excluding keypoints on moving objects. We employ a similar scheme of attentional aggregation over graph edges to enhance keypoint representations as state-of-the-art feature-matching networks but augment the graph with epipolar and temporal information and vastly reduce the number of graph edges. Furthermore, we introduce a self-supervised training scheme to extract pseudo labels for image pairs in dynamic environments from exclusively unprocessed visual-inertial data. A series of experiments show the superior performance of our network as it excludes keypoints on moving objects compared to state-of-the-art feature matching networks while still achieving similar results regarding conventional matching metrics. When integrated into a SLAM system, our network significantly improves performance, especially in highly dynamic scenes.

1 Introduction

In the realm of geometric computer vision, establishing correspondences between image keypoints serves as a foundational element for various tasks, including \acSLAM and \acSfM. This data association process enables the inference of relative transformations between images of a moving camera and the underlying environmental structure. Despite significant strides in the field, challenges persist, such as handling large viewpoint changes, occlusion, weak texture, and dynamic objects within the scene.

Refer to captionDynamicStaticRefer to captionDynamicStaticRefer to captionDynamicStatic
Figure 1: In this paper, we present DynamicGlue, a matching framework for dynamic scenes. Compared to state-of-the-art approaches, our framework cannot only deal with large changes in appearance, such as viewpoint changes but also differentiate dynamic from static parts of the scene. Matched keypoints are shown in yellow, and unmatched keypoints in red.

While existing methodologies have addressed these challenges to some extent, a notable limitation is their tendency to assume a static environment. This assumption poses a hurdle for downstream tasks like \acSLAM and \acVIO, which heavily rely on identifying correspondences within the static portion of the environment. Common approaches either employ RANSAC to filter out dynamic elements as outliers or resort to masking predefined semantic classes, such as vehicles, potentially sacrificing valuable information due to misclassification due to no differentiation between e.g. parking and moving cars and lack of generalization.

In response, we present a novel context-aware keypoint matching framework designed to not only navigate substantial changes in appearance but also distinguish between static and dynamic elements in the scene. Drawing inspiration from SuperGlue [16], we construct a graph from the keypoints of two input images. We enhance their representations through self- and cross-attentional aggregation, additionally incorporating temporal and epipolar geometry consistency information as edge features within a \acGNN.

Furthermore, we introduce a self-supervised pipeline for labeling point correspondences on image pairs within dynamic environments, leveraging raw stereo-inertial sensor data to eliminate the need for human annotation. We summarize our contributions as follows:

  • We propose a graph neural network-based sparse feature matching network architecture. While inspired by SuperGlue [16], we augment the graph to include epipolar and temporal information, allowing us to vastly decrease the size of the graph and number of graph operations, making it much more computationally efficient and aware of the scene dynamics.

  • We introduce a self-supervised training scheme that facilitates training using exclusively unprocessed real-world images and IMU data. To achieve this, we create pseudo labels by leveraging a SLAM system in combination with off-the-shelf networks for depth prediction and multi-object tracking.

  • We showcase the superior performance of our framework across a wide range of scenarios, including highly dynamic environments. While our system exhibits comparable performance to state-of-the-art methods when assessed using conventional matching metrics for a static world, it significantly outperforms them by reducing the number of potentially misleading matches on moving objects by 65%percent6565\%65 %. Furthermore, we integrate our approach into a SLAM system, highlighting its ability to enhance its overall performance (VIO by 29%percent2929\%29 % and VI-SLAM by 15%percent1515\%15 %).

2 Related Work

Refer to caption
Figure 2: The network architecture comprises three parts: Graph formation, attentional aggregation, and match assignment. After creating the graph from Superpoint [3] keypoints, epipolar and temporal information in the first step, an enhanced representation 𝐧𝐧\mathbf{n}bold_n of the initial descriptors 𝐝𝐝\mathbf{d}bold_d is computed using attentional aggregation over self and cross-edges in the second step. The third part computes a partial assignment based on the enhanced keypoint encodings.

Many data association pipelines follow a common scheme to detect salient keypoints, compute feature descriptors, and then match the feature descriptors between images, often followed by an outlier rejection step. Based on the found correspondences, a relative transformation is estimated by using, for example, the 8-point algorithm. Regarding the estimation of scene dynamics from 2D images, optical flow-based methods have been proposed [20, 7, 23]. However, to estimate the dynamics of the current image, they require a sequence of subsequent images and, therefore, introduce an additional delay to the system. While traditionally handcrafted methods for feature description and matching have been used [6, 14, 13, 10], more and more parts of this pipeline have applied deep learning to increase performance in recent years. In SuperPoint [3], a deep feature descriptor is learned with a multi-step scheme, first using synthetically generated data of basic shapes with groundtruth keypoint detections, then on random homographies of real-world images. Similarly, NRE [5] learns a deep descriptor end-to-end in conjunction with a relative camera pose estimation. While several methods for matching directly based on feature descriptors have been proposed, [13, 16], they fail to include contextual information extracted from image pairs. Yi et al. [24] are the first to include contextual information in the matching problem, such as the camera motion and scene geometry, and learn to predict the essential matrix using direct supervision. While this improves the matching performance, they do not encode scene dynamics information, arguably the most relevant information in many downstream tasks. Patch2Pix [28] and NRE [5] both follow an end-to-end approach. Instead of learning correspondences, they directly regress the relative pose between two images. Since one main drawback of these end-to-end correspondence networks is a low pixel accuracy in the matching due to memory and runtime limitations, they propose a two-step network in a detect-to-refine manner. While this approach is promising for complex and highly dynamic environments, it still requires weak supervision. It can hardly be integrated into optimization-based downstream applications, such as \acSLAM, that further refine the camera trajectory based on global optimization. In [17], Schmidt et al. were the first to propose a fully self-supervised training pipeline for data association based on a dense \acSLAM pipeline, similar to ours. However, they assumed a static environment, requiring RGB-D sensor input, which is generally more scarce, and using a much more computationally complex dense \acSLAM pipeline. Recently, Superglue [16] and its successors [11, 2, 12] have dominated the field by leveraging \acGNNs for matching that both include local as well as global contextual information, in the form of self- and cross-edges. Our work, in contrast, combines the node-level information with geometric and temporal edge encodings to explicitly create awareness of the scene geometry and dynamics. Furthermore, we do not require any direct supervision or synthetically generated data for training. Thereby, we achieve good generalization capabilities as well as scalability by training directly on real-world data and are much more computationally lightweight.

3 Problem Formulation

The data association problem is finding a partial assignment between the N𝑁Nitalic_N keypoints of image Asubscript𝐴\mathcal{I}_{A}caligraphic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and M𝑀Mitalic_M keypoints of image Bsubscript𝐵\mathcal{I}_{B}caligraphic_I start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT. The keypoints are indexed by 𝒜:={1,,N}assign𝒜1𝑁\mathcal{A}:=\{1,\cdots,N\}caligraphic_A := { 1 , ⋯ , italic_N } and :={N+1,,N+M}assign𝑁1𝑁𝑀\mathcal{B}:=\{N+1,\cdots,N+M\}caligraphic_B := { italic_N + 1 , ⋯ , italic_N + italic_M }, respectively, and defined by their position kii𝒜subscript𝑘𝑖for-all𝑖𝒜k_{i}\,\forall i\in\mathcal{A}\cup\mathcal{B}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∀ italic_i ∈ caligraphic_A ∪ caligraphic_B and the SuperPoint [3] descriptor dii𝒜subscript𝑑𝑖for-all𝑖𝒜d_{i}\,\forall i\in\mathcal{A}\cup\mathcal{B}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∀ italic_i ∈ caligraphic_A ∪ caligraphic_B. The camera intrinsics 𝐊Asubscript𝐊𝐴\mathbf{K}_{A}bold_K start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and 𝐊Bsubscript𝐊𝐵\mathbf{K}_{B}bold_K start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT and the image timestamps tAsubscript𝑡𝐴t_{A}italic_t start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and tBsubscript𝑡𝐵t_{B}italic_t start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT are known. Each keypoint in one image can be matched to, at most, one keypoint in the other image since a correct match results from the projection of a unique 3D location. It may also be matched to no keypoint due to occlusions or since it was not detected in the other image, resulting in a set of matches ={(i,j)}𝒜×𝑖𝑗𝒜\mathcal{M}=\{(i,j)\}\subset\mathcal{A}\times\mathcal{B}caligraphic_M = { ( italic_i , italic_j ) } ⊂ caligraphic_A × caligraphic_B. To this end, the network estimates a partial assignment matrix 𝐏[0,1]N×M𝐏superscript01𝑁𝑀\mathbf{P}\in[0,1]^{N\times M}bold_P ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_N × italic_M end_POSTSUPERSCRIPT between keypoints of images Asubscript𝐴\mathcal{I}_{A}caligraphic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and Bsubscript𝐵\mathcal{I}_{B}caligraphic_I start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT like in SuperGlue [16].

4 Epipolar and Time-Informed GNN

Figure 2 shows our network architecture, which consists of three stages. The graph formation builds a graph using the keypoints and their descriptors, epipolar and temporal information. This graph is then processed using attentional aggregation to extract high-dimensional node embeddings for each keypoint. These node embeddings are then matched to each other across the two images in the last stage, leading to the final associations.

4.1 Graph Formation

We create a graph with nodes defined as the SuperPoint [3] keypoints detected in the input images and two types of directed edges. Self-edges selfsubscriptself\mathcal{E}_{\text{self}}caligraphic_E start_POSTSUBSCRIPT self end_POSTSUBSCRIPT which connect keypoints within the same image, and cross-edges crosssubscriptcross\mathcal{E}_{\text{cross}}caligraphic_E start_POSTSUBSCRIPT cross end_POSTSUBSCRIPT which connect keypoints in one image to keypoints in the other. Apart from SuperGlue [16] or LightGlue [11], we do not create a complete graph but create cross-edges only to the 10101010 most similar keypoints in the other image based on the SuperPoint [3] descriptors and self-edges only to the 10101010 closest keypoints in the same image based on the Euclidean distance between keypoint positions in the image. We incorporate information about the consistency with the epipolar geometry and time information in the \acGNN as edge features for cross-edges. To this end, a lightweight matching is performed between the keypoints of both images, resulting in a set of matches \mathcal{M}caligraphic_M, where each match is assigned a weight wmmsubscript𝑤𝑚for-all𝑚w_{m}\;\forall m\in\mathcal{M}italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∀ italic_m ∈ caligraphic_M based on descriptor similarity. Based on the found matches \mathcal{M}caligraphic_M and match-weights wmsubscript𝑤𝑚w_{m}italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, the fundamental matrix F𝐹Fitalic_F is computed using the weighted 8-point algorithm. Using this estimate of the fundamental matrix 𝐅𝐅\mathbf{F}bold_F, for every cross-edge (i,j)cross𝑖𝑗subscriptcross(i,j)\in\mathcal{E}_{\text{cross}}( italic_i , italic_j ) ∈ caligraphic_E start_POSTSUBSCRIPT cross end_POSTSUBSCRIPT, the symmetric epipolar distance depi(𝐅,ki,kj)subscript𝑑epi𝐅subscript𝑘𝑖subscript𝑘𝑗d_{\text{epi}}(\mathbf{F},k_{i},k_{j})italic_d start_POSTSUBSCRIPT epi end_POSTSUBSCRIPT ( bold_F , italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is computed between the keypoints kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and kjsubscript𝑘𝑗k_{j}italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT which are represented by node i𝑖iitalic_i and j𝑗jitalic_j respectively. For numerical stability, we apply the logarithm on the computed epipolar distance.

depi(𝐅,ki,kj)=d(ki,𝐅,kj)2+d(kj,𝐅T,ki)2.subscript𝑑epi𝐅subscript𝑘𝑖subscript𝑘𝑗𝑑superscriptsubscript𝑘𝑖𝐅subscript𝑘𝑗2𝑑superscriptsubscript𝑘𝑗superscript𝐅𝑇subscript𝑘𝑖2.d_{\text{epi}}(\mathbf{F},k_{i},k_{j})=d(k_{i},\mathbf{F},k_{j})^{2}+d(k_{j},% \mathbf{F}^{T},k_{i})^{2}\text{.}italic_d start_POSTSUBSCRIPT epi end_POSTSUBSCRIPT ( bold_F , italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_d ( italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_F , italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_d ( italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_F start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (1)

Even for inaccurate initial estimates of the fundamental matrix, we found that including the epipolar information as edge features provides valuable information to cluster the match candidates in inliers and outliers. However, to further quantify the accuracy of the initial estimation, the sum of all weights wmsubscript𝑤𝑚w_{m}italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT used for the 8-point algorithm and the mean and minimum of epipolar distances are added to the edge features. Furthermore, the passed time between the two images is incorporated as the difference of timestamps tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and tjsubscript𝑡𝑗t_{j}italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. The edge features 𝐞i,jsubscript𝐞𝑖𝑗\mathbf{e}_{i,j}bold_e start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT are defined by concatenating all aforementioned information:

𝐞i,j=[log(depi(𝐅,ki,kj))1cross(a,b)crosslog(depi(𝐅,ka,kb))min(a,b)crosslog(depi(𝐅,ka,kb))wmwmtjti].subscript𝐞𝑖𝑗matrixsubscript𝑑epi𝐅subscript𝑘𝑖subscript𝑘𝑗1delimited-∣∣subscriptcrosssubscript𝑎𝑏subscriptcrosssubscript𝑑epi𝐅subscript𝑘𝑎subscript𝑘𝑏subscript𝑎𝑏subscriptcrosssubscript𝑑epi𝐅subscript𝑘𝑎subscript𝑘𝑏subscriptsubscript𝑤𝑚subscript𝑤𝑚subscript𝑡𝑗subscript𝑡𝑖.\mathbf{e}_{i,j}=\begin{bmatrix}\log(d_{\text{epi}}(\mathbf{F},k_{i},k_{j}))\\ \frac{1}{\mid\mathcal{E}_{\text{cross}}\mid}\sum_{(a,b)\in\mathcal{E}_{\text{% cross}}}\log(d_{\text{epi}}(\mathbf{F},k_{a},k_{b}))\\ \min_{(a,b)\in\mathcal{E}_{\text{cross}}}\log(d_{\text{epi}}(\mathbf{F},k_{a},% k_{b}))\\ \sum_{w_{m}\in\mathcal{M}}w_{m}\\ t_{j}-t_{i}\end{bmatrix}\text{.}bold_e start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL roman_log ( italic_d start_POSTSUBSCRIPT epi end_POSTSUBSCRIPT ( bold_F , italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG ∣ caligraphic_E start_POSTSUBSCRIPT cross end_POSTSUBSCRIPT ∣ end_ARG ∑ start_POSTSUBSCRIPT ( italic_a , italic_b ) ∈ caligraphic_E start_POSTSUBSCRIPT cross end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log ( italic_d start_POSTSUBSCRIPT epi end_POSTSUBSCRIPT ( bold_F , italic_k start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL roman_min start_POSTSUBSCRIPT ( italic_a , italic_b ) ∈ caligraphic_E start_POSTSUBSCRIPT cross end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log ( italic_d start_POSTSUBSCRIPT epi end_POSTSUBSCRIPT ( bold_F , italic_k start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ caligraphic_M end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] . (2)

4.2 Attentional Aggregation

The node embeddings are initialized as the SuperPoint [3] descriptors and alternately updated by attentional aggregation over the self- and cross-edges, similar to the residual message passing proposed in SuperGlue [16]. For the aggregation of self-edges, we use the graph transformer introduced in [18] with multi-head attention weights αi,jsubscript𝛼𝑖𝑗\alpha_{i,j}italic_α start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT:

𝐦^self,isubscript^𝐦subscriptself𝑖\displaystyle\mathbf{\hat{m}}_{\mathcal{E}_{\text{self}},i}over^ start_ARG bold_m end_ARG start_POSTSUBSCRIPT caligraphic_E start_POSTSUBSCRIPT self end_POSTSUBSCRIPT , italic_i end_POSTSUBSCRIPT =𝐖1𝐧𝐢+jself(i)αi,js𝐖2𝐧𝐣absentsubscript𝐖1subscript𝐧𝐢subscript𝑗subscriptself𝑖superscriptsubscript𝛼𝑖𝑗ssubscript𝐖2subscript𝐧𝐣\displaystyle=\mathbf{W}_{1}\mathbf{n_{i}}+\sum_{j\in\mathcal{E}_{\text{self}}% (i)}\alpha_{i,j}^{\text{s}}\mathbf{W}_{2}\mathbf{n_{j}}= bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_n start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_E start_POSTSUBSCRIPT self end_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT s end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_n start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT (3)
αi,jssuperscriptsubscript𝛼𝑖𝑗s\displaystyle\alpha_{i,j}^{\text{s}}italic_α start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT s end_POSTSUPERSCRIPT =sm((𝐖3𝐧𝐢)T(𝐖4𝐧𝐣)D),absentsmsuperscriptsubscript𝐖3subscript𝐧𝐢𝑇subscript𝐖4subscript𝐧𝐣𝐷,\displaystyle=\text{sm}\left(\frac{(\mathbf{W}_{3}\mathbf{n_{i}})^{T}(\mathbf{% W}_{4}\mathbf{n_{j}})}{\sqrt{D}}\right)\text{,}= sm ( divide start_ARG ( bold_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT bold_n start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_W start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT bold_n start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG italic_D end_ARG end_ARG ) , (4)

where D𝐷Ditalic_D is the dimensionality of the node embeddings 𝐧𝐧\mathbf{n}bold_n, sm the softmax activation function and 𝐖isubscript𝐖𝑖\mathbf{W}_{i}bold_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT learnable network parameters. We alter the graph transformer for cross-edge aggregation to attend to edge features, too.

𝐦^cross,isubscript^𝐦subscriptcross𝑖\displaystyle\mathbf{\hat{m}}_{\mathcal{E}_{\text{cross}},i}over^ start_ARG bold_m end_ARG start_POSTSUBSCRIPT caligraphic_E start_POSTSUBSCRIPT cross end_POSTSUBSCRIPT , italic_i end_POSTSUBSCRIPT =𝐖5𝐧𝐢+jcross(i)αi,jc(𝐖6𝐧𝐣+𝐖7𝐞i,j)absentsubscript𝐖5subscript𝐧𝐢subscript𝑗subscriptcross𝑖superscriptsubscript𝛼𝑖𝑗csubscript𝐖6subscript𝐧𝐣subscript𝐖7subscript𝐞𝑖𝑗\displaystyle=\mathbf{W}_{5}\mathbf{n_{i}}+\sum_{j\in\mathcal{E}_{\text{cross}% }(i)}\alpha_{i,j}^{\text{c}}\left(\mathbf{W}_{6}\mathbf{n_{j}}+\mathbf{W}_{7}% \mathbf{e}_{i,j}\right)= bold_W start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT bold_n start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_E start_POSTSUBSCRIPT cross end_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT c end_POSTSUPERSCRIPT ( bold_W start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT bold_n start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT + bold_W start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) (5)
αi,jcsuperscriptsubscript𝛼𝑖𝑗c\displaystyle\alpha_{i,j}^{\text{c}}italic_α start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT c end_POSTSUPERSCRIPT =sm((𝐖8𝐧𝐢)T(𝐖9𝐧𝐣+𝐖10𝐞i,j)D).absentsmsuperscriptsubscript𝐖8subscript𝐧𝐢𝑇subscript𝐖9subscript𝐧𝐣subscript𝐖10subscript𝐞𝑖𝑗𝐷.\displaystyle=\text{sm}\left(\frac{(\mathbf{W}_{8}\mathbf{n_{i}})^{T}(\mathbf{% W}_{9}\mathbf{n_{j}}+\mathbf{W}_{10}\mathbf{e}_{i,j})}{\sqrt{D}}\right)\text{.}= sm ( divide start_ARG ( bold_W start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT bold_n start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_W start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT bold_n start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT + bold_W start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG italic_D end_ARG end_ARG ) . (6)

The graph transformer layer is followed by a PairNorm [27] layer to prevent over-smoothing of node embeddings.

4.3 Match Assignment Head

We use the lightweight assignment head introduced by LightGlue [11] to estimate correspondences between the keypoints of the two images based on the updated node embeddings 𝐧𝐧\mathbf{n}bold_n. This head computes the partial assignment 𝐏𝐏\mathbf{P}bold_P by predicting a matchability score σ𝜎\sigmaitalic_σ for every keypoint and similarity score matrix 𝐒N×M𝐒superscript𝑁𝑀\mathbf{S}\in\mathbb{R}^{N\times M}bold_S ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_M end_POSTSUPERSCRIPT between all pairs of keypoints. The elements in 𝐒𝐒\mathbf{S}bold_S encode how likely a pair of points correspond, whereas the matchability score σ𝜎\sigmaitalic_σ encodes how likely a keypoint has a correspondence in the other image. For all i𝒜𝑖𝒜i\in\mathcal{A}italic_i ∈ caligraphic_A, j𝑗j\in\mathcal{B}italic_j ∈ caligraphic_B, and o𝒜𝑜𝒜o\in\mathcal{A}\cup\mathcal{B}italic_o ∈ caligraphic_A ∪ caligraphic_B we have:

𝐒ijsubscript𝐒𝑖𝑗\displaystyle\mathbf{S}_{ij}bold_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT =Linear(ni)TLinear(nj)absentLinearsuperscriptsubscriptn𝑖𝑇Linearsubscriptn𝑗\displaystyle=\text{Linear}(\textbf{n}_{i})^{T}\text{Linear}(\textbf{n}_{j})= Linear ( n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT Linear ( n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (7)
σosubscript𝜎𝑜\displaystyle\sigma_{o}italic_σ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT =Sigmoid(Linear(no)),absentSigmoidLinearsubscriptn𝑜,\displaystyle=\text{Sigmoid}(\text{Linear}(\textbf{n}_{o}))\text{,}= Sigmoid ( Linear ( n start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) ) , (8)

where Linear()Linear\text{Linear}(\cdot)Linear ( ⋅ ) denotes a learnable linear transformation with bias. A low matchability score, σi0subscript𝜎𝑖0\sigma_{i}\rightarrow 0italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → 0, is expected when a keypoint is occluded or undetected in the other image. The combination of similarity score 𝐒𝐒\mathbf{S}bold_S and matchability score σ𝜎\sigmaitalic_σ results in the partial assignment matrix 𝐏𝐏\mathbf{P}bold_P:

𝐏ij=σiσjsml𝒜(𝐒lj)isml(𝐒il)j.subscript𝐏𝑖𝑗subscript𝜎𝑖subscript𝜎𝑗𝑙𝒜smsubscriptsubscript𝐒𝑙𝑗𝑖𝑙smsubscriptsubscript𝐒𝑖𝑙𝑗.\mathbf{P}_{ij}=\sigma_{i}\sigma_{j}\underset{l\in\mathcal{A}}{\text{sm}}(% \mathbf{S}_{lj})_{i}\underset{l\in\mathcal{B}}{\text{sm}}(\mathbf{S}_{il})_{j}% \text{.}bold_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_UNDERACCENT italic_l ∈ caligraphic_A end_UNDERACCENT start_ARG sm end_ARG ( bold_S start_POSTSUBSCRIPT italic_l italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_UNDERACCENT italic_l ∈ caligraphic_B end_UNDERACCENT start_ARG sm end_ARG ( bold_S start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT . (9)

To extract correspondences, all pairs (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) that mutually yield the highest assignment score 𝐏ijsubscript𝐏𝑖𝑗\mathbf{P}_{ij}bold_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT within their row and column are selected if 𝐏ijsubscript𝐏𝑖𝑗\mathbf{P}_{ij}bold_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is larger than a threshold τ𝜏\tauitalic_τ.

5 Self-Supervised Training

We propose a self-supervised training scheme using datasets containing merely sequences of real-world stereo images and IMU data. As outlined in Figure 2, the datasets are pre-processed to extract pairs of images, denoted as image queries 𝒬𝒬\mathcal{Q}caligraphic_Q, used for training the network. Furthermore, a set of labels containing groundtruth matches gtsubscriptgt\mathcal{M}_{\text{gt}}caligraphic_M start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT and sets of non-matchable keypoints 𝒩gtAsuperscriptsubscript𝒩gt𝐴\mathcal{N}_{\text{gt}}^{A}caligraphic_N start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT and 𝒩gtBsuperscriptsubscript𝒩gt𝐵\mathcal{N}_{\text{gt}}^{B}caligraphic_N start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT in each image respectively are created from the input data.

5.1 SLAM-based Pseudo-Groundtruth Generation

Modern VI-\acSLAM systems can achieve centimeter-level accuracy after offline bundle adjustment, even for long sequences. Therefore, we suggest using the bundle-adjusted factor graph of OKVIS2 [9] for pseudo groundtruth generation. Each node in the graph represents a state within the sequence and is defined by its timestamp and optimized 6D pose. Each edge contains edges representing the reprojection errors, where keypoints in the images are assigned to landmarks in the reconstruction. For training data generation, image queries are extracted by parsing the bundle adjustment graph of a session and selecting any pair of camera images that track common landmarks. To contain the number of extracted samples to a tractable amount and decrease the similarity of image queries, the number of common landmarks must be above a threshold c𝑐citalic_c, and only every s𝑠sitalic_s-th state is considered for the second image of a query. Also, the bundle adjustment graph defines the keypoint locations in each image as the projection of the tracked landmarks back into the camera image.

We found that the number of groundtruth matches from landmark projections is typically too small for good generalization capabilities due to the relatively sparse detections and matches from OKVIS2 [9] for long sequences. Thus, we augment them with SuperPoint [3] keypoints. Each keypoint from image Asubscript𝐴\mathcal{I}_{A}caligraphic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT is projected into image Bsubscript𝐵\mathcal{I}_{B}caligraphic_I start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT using the estimated depth map, created from the stereo depth network IGEV-Stereo [22], from the timestamp of image Asubscript𝐴\mathcal{I}_{A}caligraphic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and the relative transformation extracted from the bundle adjustment graph. If a keypoint in image Bsubscript𝐵\mathcal{I}_{B}caligraphic_I start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT coincides with the projected keypoint, the pair is labeled as a groundtruth match. If the point is projected outside of image Bsubscript𝐵\mathcal{I}_{B}caligraphic_I start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT or there is no keypoint in a radius of 50 pixels in image Bsubscript𝐵\mathcal{I}_{B}caligraphic_I start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT, this point is labeled as non-matchable. All other keypoints remain unlabelled. To prevent outliers in the groundtruth due to inaccuracies in the depth maps, the same is performed for all keypoints in image Bsubscript𝐵\mathcal{I}_{B}caligraphic_I start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT with the according depth map. Only classifications verified by both projections are considered valid.

5.2 Groundtruth in a Dynamic Environment

The network shall be trained to not match keypoints on moving objects since these matches will be outliers for the static world assumption used by a \acSLAM system. For queries with only a short time difference, which are typical for \acSLAM systems processing subsequent images, the movement of the dynamic objects may be so small that the projective approach may also label matches on the moving objects as correct. The moving objects are explicitly extracted to prevent this, and their keypoints are excluded from the groundtruth matches. To this end, the \acMOTS network STEm-Seg [1] computes instance segmentation masks for objects within the image sequence. Due to limitations of the \acMOTS model, this work focuses on cars and pedestrians only but can easily be extended to other semantic classes. 3D point clouds are extracted using the created depth maps based on the segmentation masks assigned to the processed instance. These are then back-projected to point clouds in a common coordinate frame to compensate for the camera movement between the frames, using the optimized inter-frame transformation in the bundle-adjusted graph. To identify the moving instances, the distance between all combinations of extracted point clouds is computed with the Chamfer distance dpclsubscript𝑑pcld_{\text{pcl}}italic_d start_POSTSUBSCRIPT pcl end_POSTSUBSCRIPT. The object is labeled moving if dpclsubscript𝑑pcld_{\text{pcl}}italic_d start_POSTSUBSCRIPT pcl end_POSTSUBSCRIPT is larger than a threshold of 5555 meters. For all image queries for which the time difference is not zero, all keypoints lying on a moving object are excluded from gtsubscriptgt\mathcal{M}_{\text{gt}}caligraphic_M start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT and added to the set of non-matchable keypoints 𝒩gtsubscript𝒩gt\mathcal{N}_{\text{gt}}caligraphic_N start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT. For image queries with no time difference, there are no inter-frame movements so that all keypoints in gtsubscriptgt\mathcal{M}_{\text{gt}}caligraphic_M start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT are considered matchable.

5.3 Loss Formulation

Based on the generated labels gtsubscriptgt\mathcal{M}_{\text{gt}}caligraphic_M start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT, 𝒩gtAsuperscriptsubscript𝒩gt𝐴\mathcal{N}_{\text{gt}}^{A}caligraphic_N start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT, and 𝒩gtBsuperscriptsubscript𝒩gt𝐵\mathcal{N}_{\text{gt}}^{B}caligraphic_N start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT the \acNLL loss is minimized during training acting on the predicted partial assignment score 𝐏𝐏\mathbf{P}bold_P and matchability scores σ𝜎\sigmaitalic_σ for images Asubscript𝐴\mathcal{I}_{A}caligraphic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and Bsubscript𝐵\mathcal{I}_{B}caligraphic_I start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT .

MsubscriptM\displaystyle\mathcal{L}_{\text{M}}caligraphic_L start_POSTSUBSCRIPT M end_POSTSUBSCRIPT =(1|gt|(i,j)gtlog𝐏i,j)absent1subscriptgtsubscript𝑖𝑗subscriptgtsubscript𝐏𝑖𝑗\displaystyle=\left(\frac{1}{|\mathcal{M}_{\text{gt}}|}\sum_{(i,j)\in\mathcal{% M}_{\text{gt}}}\log\mathbf{P}_{i,j}\right)= ( divide start_ARG 1 end_ARG start_ARG | caligraphic_M start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT ( italic_i , italic_j ) ∈ caligraphic_M start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log bold_P start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) (10)
NsubscriptN\displaystyle\mathcal{L}_{\text{N}}caligraphic_L start_POSTSUBSCRIPT N end_POSTSUBSCRIPT =(12|𝒩gt|i𝒩gtlog(1σi))absent12subscript𝒩gtsubscript𝑖subscript𝒩gt1subscript𝜎𝑖\displaystyle=\left(\frac{1}{2|\mathcal{N}_{\text{gt}}|}\sum_{i\in\mathcal{N}_% {\text{gt}}}\log(1-\sigma_{i})\right)= ( divide start_ARG 1 end_ARG start_ARG 2 | caligraphic_N start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log ( 1 - italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (11)
\displaystyle\mathcal{L}caligraphic_L =(M+NA+NB)absentsubscriptMsuperscriptsubscriptN𝐴superscriptsubscriptN𝐵\displaystyle=-(\mathcal{L}_{\text{M}}+\mathcal{L}_{\text{N}}^{A}+\mathcal{L}_% {\text{N}}^{B})= - ( caligraphic_L start_POSTSUBSCRIPT M end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT + caligraphic_L start_POSTSUBSCRIPT N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) (12)

6 Experiments

We implement our network using the PyG library [4] in the Torch framework [15]. The network is trained using Adam [8] with a learning rate of 0.00010.00010.00010.0001 and a batch size of 32323232.

We trained the model on the respective training set from the TUM4Seasons dataset [21] and the Hilti-Oxford dataset [25] due to their versatility, as they provide stereo-inertial sensor data and contain significant appearance changes. While the TUM4Seasons dataset [21] contains scenes recorded from a car’s perspective on different roads in various conditions, the Hilti-Oxford dataset [25] is recorded from a handheld device showing indoor and outdoor scenes on construction sites and buildings. We set the thresholds to c=10𝑐10c=10italic_c = 10 for both datasets and s=37𝑠37s=37italic_s = 37 for the TUM4Seasons dataset [21] and s=13𝑠13s=13italic_s = 13 for the Hilti-Oxford dataset [25]. While Hilti-Oxford contains more rapid viewpoint changes due to the handheld camera, TUM4Seasons was recorded on a car, such that subsequent images are more similar. Using the proposed training scheme, we generate roughly 185,000185000185,000185 , 000 image pairs from the TUM4Seasons dataset and 50,0005000050,00050 , 000 image pairs from the Hilti-Oxford dataset in the training set. Furthermore, we extract the sequences with ego-motion from the Waymo Open Perception [19] dataset as an additional dataset for evaluation that was not used during training. This dataset shows inner and outer-city scenes from a car perspective in the USA and provides extensive labels for evaluating matches on dynamic objects. Similarly, we have recorded our own in-the-wild dataset to further test the generalization performance of our network with regard to diverse dynamics object types.

6.1 Matching Performance

AUC@5°\uparrow AUC@10°\uparrow AUC@20°\uparrow P \uparrow MS \uparrow Mmov \downarrow Kmov \downarrow
Hilti-Oxford Dataset [25]
Mutual NN 29.62 41.58 51.81 72.73 24.33 0.84 17.53
LightGlue [11] 34.98 47.94 58.54 62.75 35.72 1.15 41.81
DynamicGlue 33.41 46.45 57.24 77.38 27.08 0.62

TUM4Seasons Dataset [21]    Mutual NN 47.78 58.75 67.45 79.38 20.48 3.77 36.78 LightGlue [11] 78.84 87.60 92.69 88.88 40.17 2.68 47.11 DG - No edge features 70.14 80.43 87.31 93.61 21.55 0.96 10.67 DG - No time 69.66 79.97 86.95 93.14 22.11 1.04 11.78 DG - No pair norm 69.18 79.52 86.52 94.74 20.56 0.84 9.26 DG - iterative 70.28 80.58 87.39 94.36 22.62 1.15 12.15 DynamicGlue 69.70 79.98 86.90 93.87 21.22 0.59 6.97    Waymo Open Perception Dataset [19]    Mutual NN 44.05 52.47 59.02 62.03 13.83 6.95 14.03 LightGlue [11] 43.18 57.87 69.23 85.09 47.47 8.36 47.15 DynamicGlue 45.17 53.30 59.49 74.84 13.45 7.48 11.69    Self-Recorded In-The-Wild Dataset    Mutual NN 20.47 34.67 48.59 68.74 17.99 1.93 10.66 LightGlue [11] 26.73 44.05 58.85 65.18 33.98 3.35 41.37 DynamicGlue 26.43 42.16 56.55 76.84 21.06 1.02 7.18

Table 1: Evaluation of matching performance. The best results are marked in bold. While achieving state-of-the-art results in the commonly used matching metrics, our method outperforms the baselines in regards to the metrics that take into account the scene dynamics, Mmov𝑀movM\textsubscript{mov}italic_M and Kmov𝐾movK\textsubscript{mov}italic_K.

We evaluate the network’s performance based on the precision, the matching score, and the \acAUC of pose errors, as commonly used in the area. While the precision P defines the ratio of correct matches to found matches, the matching score MS relates it to the number of keypoints in image Asubscript𝐴\mathcal{I}_{A}caligraphic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT. Thereby, a match is classified as correct if its epipolar distance is below 51045superscript1045\cdot 10^{-4}5 ⋅ 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. We estimate the essential matrix with RANSAC based on the computed correspondences and select the maximum errors in the rotational and translational components to compute the pose error, represented as an angular error. The \acAUC of pose errors is reported at three thresholds: 5°, 10°, and 20°.Additionally, we evaluate the matching in dynamic environments based on the matches of keypoints on moving objects. To this end, we propose two new metrics:

MmovsubscriptMmov\displaystyle\text{M}_{\text{mov}}M start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT =|mov|||,absentsubscriptmov,\displaystyle=\frac{|\mathcal{M}_{\text{mov}}|}{|\mathcal{M}|}\text{,}= divide start_ARG | caligraphic_M start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT | end_ARG start_ARG | caligraphic_M | end_ARG , (13)
KmovsubscriptKmov\displaystyle\text{K}_{\text{mov}}K start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT =|𝒜mov|+|mov||𝒜|+||,absentsubscript𝒜movsubscriptmov𝒜,\displaystyle=\frac{|\mathcal{A}_{\text{mov}}|+|\mathcal{B}_{\text{mov}}|}{|% \mathcal{A}|+|\mathcal{B}|}\text{,}= divide start_ARG | caligraphic_A start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT | + | caligraphic_B start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT | end_ARG start_ARG | caligraphic_A | + | caligraphic_B | end_ARG , (14)

with |||\mathcal{M}|| caligraphic_M | the number of predicted matches, |𝒜|𝒜|\mathcal{A}|| caligraphic_A | and |||\mathcal{B}|| caligraphic_B | the number of keypoints in Asubscript𝐴\mathcal{I}_{A}caligraphic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and Bsubscript𝐵\mathcal{I}_{B}caligraphic_I start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT, respectively, and |𝒜mov|subscript𝒜mov|\mathcal{A}_{\text{mov}}|| caligraphic_A start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT | and |mov|subscriptmov|\mathcal{B}_{\text{mov}}|| caligraphic_B start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT | the keypoints on moving objects. These metrics allow us to evaluate the ability of the matcher to exclude moving objects from the matching.

For a fair comparison to state-of-the-art frameworks for feature matching, including naive descriptor matching based on mutual nearest neighbor, all of the tested methods use the same descriptor SuperPoint [3]. In addition, we have abducted several ablation studies. We evaluate the effect of omitting the edge features, the temporal information, and the pair norm. In addition, we test an alternative network structure in which the edge features are iteratively updated after each layer based on re-evaluating the nearest neighbors of each node.

LightGlue [11] is trained on more diverse and much larger datasets (about 5 million image pairs). Therefore, for a fair comparison, we re-train LightGlue [11] on our dataset while maintaining their architecture and loss formulation. For an image query with n𝑛nitalic_n keypoints per image, it computes 4n24superscript𝑛24n^{2}4 italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT edges to connect to every keypoint in the same image and to every keypoint in the other image. This results in 4,000,000 edges for a typical image query with 1,000 keypoints, compared to 20,000 edges in our method. While being much more computationally efficient, our method significantly outperforms state-of-the-art feature-matching networks with regards to matches on dynamic objects as indicated by DmovsubscriptDmov\text{D}_{\text{mov}}D start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT and KmovsubscriptKmov\text{K}_{\text{mov}}K start_POSTSUBSCRIPT mov end_POSTSUBSCRIPT, resulting in an improvement of 85%percent8585\%85 % to learned approaches and 47%percent4747\%47 % to \acNN matching. While LightGlue [11] is optimized to return as many matches as possible, resulting in high AUC scores, our method yields more trustworthy matches, which leads to much higher precision across most datasets. More qualitative examples of the performance of our method in regards to differentiating static from dynamic objects can be found in Figure 4.

6.2 SLAM Integration

Session Label VIO SLAM
Okvis2 [11] LG [13] MNN DG (Ours) Okvis2 [11] LG [13] MNN DG (Ours)
2021-05-10_19-15-19 l 1.39 1.69 1.66 1.75 1.02 1.07 1.08 1.15
2021-01-07_13-12-23 l 11.04 10.23 11.05 7.71 1.37 3.01 3.14 2.49
2020-04-07_11-33-45 m 61.81 111.54 106.39 37.64 76.71 90.83 85.15 33.67
2020-06-12_10-10-57 m 29.01 14.63 12.24 13.34 4.98 13.52 4.45 6.20
2021-05-10_18-32-32 m 5.42 4.90 5.28 4.86 5.12 5.09 4.67 5.05
2021-01-07_10-49-45 h 34.82 30.95 21.59 10.77 14.00 10.20 10.38 12.42
Table 2: Absolute trajectory errors for various sessions of TUM4Seasons [23] using Okvis2 [11] with several different matching frameworks: LG (LightGlue [13]), MNN (Mutual NN), and DG (DynamicGlue). Our method clearly outperforms the baseline matchers, especially in the VIO case and for medium to highly dynamic scenes.
Refer to caption
Refer to caption
Figure 3: Trajectories and drift statistics of our method vs. the baseline OKVIS2 [9] on the test sequence ‘2020-06-12_10-10-57‘ of the TUM4Season dataset [21]. The relative motion errors for position and orientation are aggregated over different sub-trajectory lengths. Our method drifts significantly less than the baseline, reducing especially the azimuth error.

To demonstrate the effectiveness of our matching framework in real-world applications, we have integrated our pipeline as well as the baselines into the state-of-the-art SLAM system OKVIS2 [9, 26]. OKVIS2 originally uses BRISK [10] keypoints that are matched by computing the Hamming distance between descriptors. For the integration, BRISK is replaced by SuperPoint [3] and then matched using the integrated matching framework. When matching an image to the map, the matching procedure had to be adapted to account for the additional contextual information used in our framework. Landmarks are not handled individually but clustered by frames so the network matches complete images.

The results are provided for sessions of the TUM4Seasons dataset from the validation and test split, comparing the original OKVIS2 with BRISK as the baseline and the adapted version with our network. The sessions are processed in \acVIO mode with loop closures disabled and in \acSLAM mode with loop closures enabled. We provide relative trajectory error statistics in Figure 3 that quantify the odometry drift that is especially meaningful when processing without loop closures. Furthermore, the \acATE is reported in Table 2, comparing the groundtruth trajectory with the trajectory estimate after aligning both trajectories regarding the position and yaw angle.

Especially in \acVIO mode, our network enhances the system’s overall performance quantified by the \acATE by 29%. The most significant improvements are evident in the azimuth error in sessions with a medium or high number of moving objects, most likely due to the erroneous tracking of nearby moving objects in the original OKVIS2. In \acSLAM mode, OKVIS2 significantly improves its performance by leveraging loop closures such that both methods result in similar results. Therefore, our method is especially powerful in scenarios without revisited locations and long sequences where drift can accumulate. Remarkably, for high dynamic scenarios, employing our system in \acVIO vastly outperforms these baselines, including the state-of-the-art matching framework LightGlue [11].

7 Conclusion

Most previous methods in keypoint matching focused on appearance changes, such as large viewpoint changes, which can lead to massive drift in downstream tasks, such as \acSLAM. In this work, we introduced DynamicGlue, which is, to the best of our knowledge, the first to propose an additional awareness of the scene dynamics. By augmenting a graph neural network with epipolar and temporal information, we achieve state-of-the-art matching performance and vastly outperform previous methods regarding dynamic scenes by 85%percent8585\%85 % while immensely reducing the required computational complexity by only including a small subset of neighboring keypoints into the graph. Exemplary for downstream tasks, we demonstrate that our framework can lead to massive improvements in the accuracy of \acVIO and \acSLAM pipelines, leading to a 29%percent2929\%29 % decrease in \acATE over a state-of-the-art system in \acVIO mode, and up to 43%percent4343\%43 % for medium to high dynamic scenes. While we have shown good generalization capabilities of our model, with our proposed self-supervised training pipeline, the network can easily be fine-tuned to new environments without any human intervention.

\begin{overpic}[width=238.48886pt,height=113.81102pt]{media/matching_% qualitative/waymo/LightGlue/session_1482/375.png} \put(3.0,40.0){\color[rgb]{.75,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{% .75,1,0}\normalsize{LightGlue}} \put(0.0,0.0){\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0% }\polygon(42, 18)(42, 27)(53, 27)(53, 18)} \end{overpic}
\begin{overpic}[width=238.48886pt,height=113.81102pt]{media/matching_% qualitative/waymo/ours/session_1482/375.png} \put(3.0,40.0){\color[rgb]{.75,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{% .75,1,0}\normalsize{DynamicGlue}} \put(0.0,0.0){\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0% }\polygon(42, 18)(42, 27)(53, 27)(53, 18)} \end{overpic}
\begin{overpic}[width=238.48886pt,height=113.81102pt]{media/matching_% qualitative/waymo/LightGlue/17160696560226550358_6229_820_6249_820/670.png} \put(3.0,40.0){\color[rgb]{.75,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{% .75,1,0}\normalsize{LightGlue}} \put(0.0,0.0){\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0% }\polygon(6, 16)(6, 25)(20, 25)(20, 16)} \end{overpic}
\begin{overpic}[width=238.48886pt,height=113.81102pt]{media/matching_% qualitative/waymo/ours/17160696560226550358_6229_820_6249_820/670.png} \put(3.0,40.0){\color[rgb]{.75,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{% .75,1,0}\normalsize{DynamicGlue}} \put(0.0,0.0){\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0% }\polygon(6, 16)(6, 25)(20, 25)(20, 16)} \end{overpic}
\begin{overpic}[width=238.48886pt,height=136.5733pt]{media/matching_% qualitative/campus/LightGlue/02-11-2023-12-31-46/1385.png} \put(3.0,50.0){\color[rgb]{.75,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{% .75,1,0}\normalsize{LightGlue}} \put(0.0,0.0){\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0% }\polygon(32, 20)(32, 43)(80, 43)(80, 20)} \end{overpic}
\begin{overpic}[width=238.48886pt,height=136.5733pt]{media/matching_% qualitative/campus/ours/02-11-2023-12-31-46/1385.png} \put(3.0,50.0){\color[rgb]{.75,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{% .75,1,0}\normalsize{DynamicGlue}} \put(0.0,0.0){\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0% }\polygon(32, 20)(32, 43)(80, 43)(80, 20)} \end{overpic}
\begin{overpic}[width=238.48886pt,height=136.5733pt]{media/matching_% qualitative/campus/LightGlue/02-11-2023-12-23-55/1175.png} \put(3.0,50.0){\color[rgb]{.75,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{% .75,1,0}\normalsize{LightGlue}} \put(0.0,0.0){\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0% }\polygon(49, 20)(49, 37)(57, 37)(57, 20)} \end{overpic}
\begin{overpic}[width=238.48886pt,height=136.5733pt]{media/matching_% qualitative/campus/ours/02-11-2023-12-23-55/1175.png} \put(3.0,50.0){\color[rgb]{.75,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{% .75,1,0}\normalsize{DynamicGlue}} \put(0.0,0.0){\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0% }\polygon(49, 20)(49, 37)(57, 37)(57, 20)} \end{overpic}
Figure 4: Qualitative results of our framework (right) in various scenarios compared to LightGlue [11] (left). Matched keypoints are shown in yellow, unmatched in red. Our method can distinguish dynamic from static objects in different environments and diverse object types.

References

  • Athar et al. [2020] Ali Athar, Sabarinath Mahadevan, Aljoša Ošep, Laura Leal-Taixé, and Bastian Leibe. Stem-seg: Spatio-temporal embeddings for instance segmentation in videos. In ECCV, 2020.
  • Campbell et al. [2020] Dylan Campbell, Liu Liu, and Stephen Gould. Solving the blind perspective-n-point problem end-to-end with robust differentiable geometric optimization. In ECCV, 2020. equal contribution.
  • DeTone et al. [2018] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperPoint: Self-Supervised Interest Point Detection and Description. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • Fey and Lenssen [2019] Matthias Fey and Jan E. Lenssen. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
  • Germain et al. [2021] Hugo Germain, Vincent Lepetit, and Guillaume Bourmaud. Neural reprojection error: Merging feature learning and camera pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 414–423, 2021.
  • Harris and Stephens [1988] C. Harris and M. Stephens. A combined corner and edge detector. In Procedings of the Alvey Vision Conference 1988. Alvey Vision Club, 1988.
  • Ilg et al. [2017] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  • Leutenegger [2022] Stefan Leutenegger. Okvis2: Realtime scalable visual-inertial slam with loop closure, 2022.
  • Leutenegger et al. [2011] Stefan Leutenegger, Margarita Chli, and Roland Y. Siegwart. BRISK: Binary Robust invariant scalable keypoints. In 2011 International Conference on Computer Vision, pages 2548–2555, 2011. ISSN: 2380-7504.
  • Lindenberger et al. [2023] Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. LightGlue: Local Feature Matching at Light Speed. In IEEE International Conference on Computer Vision (ICCV), 2023.
  • Liu et al. [2020] Liu Liu, Dylan Campbell, Hongdong Li, Dingfu Zhou, Xibin Song, and Ruigang Yang. Learning 2d-3d correspondences to solve the blind perspective-n-point problem. arXiv preprint arXiv:2003.06752, 2020.
  • Lowe [2004] David G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.
  • Mouats et al. [2018] Tarek Mouats, Nabil Aouf, David Nam, and Stephen Vidas. Performance Evaluation of Feature Detectors and Descriptors Beyond the Visible. Journal of Intelligent & Robotic Systems, 92(1):33–63, 2018.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
  • Sarlin et al. [2020] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperGlue: Learning feature matching with graph neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • Schmidt et al. [2017] Tanner Schmidt, Richard Newcombe, and Dieter Fox. Self-supervised visual descriptor learning for dense correspondence. IEEE Robotics and Automation Letters, 2(2):420–427, 2017.
  • Shi et al. [2020] Yunsheng Shi, Zhengjie Huang, Shikun Feng, Hui Zhong, Wen** Wang, and Yu Sun. Masked label prediction: Unified message passing model for semi-supervised classification. International Joint Conference on Artificial Intelligence, 2020.
  • Sun et al. [2020] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • Teed and Deng [2020] Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  • Wenzel et al. [2020] P. Wenzel, R. Wang, N. Yang, Q. Cheng, Q. Khan, L. von Stumberg, N. Zeller, and D. Cremers. 4Seasons: A cross-season dataset for multi-weather SLAM in autonomous driving. In Proceedings of the German Conference on Pattern Recognition (GCPR), 2020.
  • Xu et al. [2023] Gangwei Xu, Xianqi Wang, Xiaohuan Ding, and Xin Yang. Iterative geometry encoding volume for stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21919–21928, 2023.
  • Xu et al. [2022] Haofei Xu, **g Zhang, Jianfei Cai, Hamid Rezatofighi, and Dacheng Tao. Gmflow: Learning optical flow via global matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8121–8130, 2022.
  • Yi et al. [2018] Kwang Yi, Eduard Trulls, Yuki Ono, Vincent Lepetit, Mathieu Salzmann, and Pascal Fua. Learning to Find Good Correspondences, pages 2666–2674. IEEE, 2018.
  • Zhang et al. [2023a] Lintong Zhang, Michael Helmberger, Lanke Frank Tarimo Fu, David Wisth, Marco Camurri, Davide Scaramuzza, and Maurice Fallon. Hilti-oxford dataset: A millimeter-accurate benchmark for simultaneous localization and map**. IEEE Robotics and Automation Letters, 8(1):408–415, 2023a.
  • Zhang et al. [2023b] Lintong Zhang, Michael Helmberger, Lanke Frank Tarimo Fu, David Wisth, Marco Camurri, Davide Scaramuzza, and Maurice Fallon. Hilti-oxford dataset: A millimeter-accurate benchmark for simultaneous localization and map**. IEEE Robotics and Automation Letters, 8(1):408–415, 2023b.
  • Zhao and Akoglu [2019] Lingxiao Zhao and Leman Akoglu. Pairnorm: Tackling oversmoothing in gnns. International Conference on Learning Representations, 2019.
  • Zhou et al. [2021] Qunjie Zhou, Torsten Sattler, and Laura Leal-Taixe. Patch2pix: Epipolar-guided pixel-level correspondences. In IEEE Conference on Computer Vision and Pattern Recognition, 2021.