CorrMAE: Pre-training Correspondence Transformers with Masked Autoencoder

Tangfei Liao1, Xiaoqin Zhang2, Guobao Xiao3, Min Li2, Tao Wang4, Mang Ye1
1School of Computer Science, Wuhan University, China
2College of Computer Science and Artificial Intelligence, Wenzhou University, China
3School of Electronics and Information Engineering, Tongji University, China
4State Key Lab for Novel Software Technology, Nan**g University, China
{tangfeiliao, zhangxiaoqinnan, limin.simu, taowangzj}@gmail.com,
[email protected], [email protected]
Abstract

Pre-training has emerged as a simple yet powerful methodology for representation learning across various domains. However, due to the expensive training cost and limited data, pre-training has not yet been extensively studied in correspondence pruning. To tackle these challenges, we propose a pre-training method to acquire a generic inliers-consistent representation by reconstructing masked correspondences, providing a strong initial representation for downstream tasks. Toward this objective, a modicum of true correspondences naturally serve as input, thus significantly reducing pre-training overhead. In practice, we introduce CorrMAE, an extension of the mask autoencoder framework tailored for the pre-training of correspondence pruning. CorrMAE involves two main phases, i.e., correspondence learning and matching point reconstruction, guiding the reconstruction of masked correspondences through learning visible correspondence consistency. Herein, we employ a dual-branch structure with an ingenious positional encoding to reconstruct unordered and irregular correspondences. Also, a bi-level designed encoder is proposed for correspondence learning, which offers enhanced consistency learning capability and transferability. Extensive experiments have shown that the model pre-trained with our CorrMAE outperforms prior work on multiple challenging benchmarks. Meanwhile, our CorrMAE is primarily a task-driven pre-training method, and can achieve notable improvements for downstream tasks by pre-training on the targeted dataset. We hope this work can provide a starting point for correspondence pruning pre-training.

1 Introduction

Pre-training has achieved remarkable progress on diverse backbones in various downstream tasks [27, 13, 23, 35, 36], as it provides a strong initial representation. The conventional pre-training method, i.e., fully supervised learning with a classification task [10] is one of the most popular paradigms, and significantly helps downstream tasks. However, when dealing with tasks involving long sequence data as input, the expensive overhead of conventional methods poses a significant challenge, especially for correspondence pruning reliant on the graph neural network [37] (see Fig. 1(a)).

Refer to caption
(a)
Refer to caption
(b)
Figure 1: (a) Comparison of pre-training costs using the conventional method, i.e., initial correspondence classification task, and our proposed method. Meanwhile, some graph-based correspondence pruning methods [46, 6, 18, 5] are used as encoders. We report results averaged by batch size for training, measured on NVIDIA Tesla V100 GPU. (b) Comparing the previous learning paradigm and our pretraining-finetuning paradigm both for correspondence pruning. The correspondence is drawn in green if it represents the inlier and red for the outlier.

Correspondence pruning is a crucial element in much of computer vision, including simultaneous localization and map** [24], structure from motion [30], and visual camera localization [3]. This task involves accurately identifying true correspondences (inliers) from initial correspondences while recovering two-view geometry [41]. Unfortunately, thousands of initial correspondences as input and prevailing graph neural networks [46, 6, 18, 5] greatly escalate the costs of conventional pre-training methods. As dataset scales expand, this issue becomes even more acute, leading to direct training from scratch on the targeted dataset as the only learning paradigm for correspondence pruning (as presented in Fig. 1(b)). Besides, without additional data, conventional pre-training methods fail to yield valuable knowledge for correspondence pruning.

To resolve the aforementioned challenges, we are the first to introduce pre-training for correspondence pruning via a reconstruction pretext task. As illustrated in Fig. 1(b), the purpose of our method is to obtain a generic inliers-consistent representation by reconstructing masked correspondences. The pre-training knowledge is then transferred to correspondence pruning to enhance the performance of some downstream tasks, such as camera pose estimation. In pursuit of this goal, we naturally consider leveraging true correspondences from two images as the input for this pretext task. As shown in Fig. 1(a), this approach seamlessly resolves the issue of pre-training expenses, given that inliers typically constitute a mere 10% of initial correspondences.

Based on the above analysis, a naive implementation of this pretext task entails employing a trending framework, Masked Autoencoder (MAE) [9], to reconstruct masked correspondences directly, which is similar to Point-MAE [25]. However, unlike point clouds, the correspondences lack a geometric center or other location information for position encoding. That is, this solution ignores the unordered and irregular characteristics of correspondence, leading to ineffective reconstruction of masked correspondence. To this end, we extend MAE and propose a novel framework, named Correspondence Masked Autoencoder (CorrMAE). A key design element of CorrMAE involves a dual-branch structure with an ingenious position encoding, which can reconstruct matching points of source and target images respectively. Also, an alignment loss is proposed for the reconstructed matching points of source and target images. Interestingly, compared to the conventional pre-training method, our CorrMAE is a predominantly task-driven pre-training method, which can significantly enhance the performance of downstream tasks even without any extra data (more discussions see Section 4.4).

Furthermore, correspondence learning is indispensable in our CorrMAE. It is responsible for embedding local and global contexts for visible correspondences, subsequently guiding the reconstruction of masked correspondences. The correspondence representations produced in this phase will be transferred to downstream tasks for fine-tuning. Therefore, it necessitates an encoder specifically designed for the corresponding learning with strong transferability. Instead of roughly stacking vanilla transformers [34], our proposed encoder adopts a bi-level design for local context representation and global context acquisition. Meanwhile, linear transformers [38] as the fundamental element, and the graph neural network (GNN) [37] guides the learning of local context representation. The considerations behind this are: i) the linear transformer introduces low overhead during the fine-tuning stage; ii) the GNN brings locality to linear transformers, which is advantageous for the reconstruction process. That is, our encoder skillfully balances the requirements of both pre-training and fine-tuning.

Our contributions are summarized as follows: (1) To the best of our knowledge, we are the first to propose a pre-training method for correspondence pruning by a correspondence reconstruction task. Compared to the conventional pre-training approach, our approach significantly reduces pre-training costs and, even without additional data, enhances model performance. (2) To implement this reconstruction task, we present a novel framework named CorrMAE to reconstruct matching points in the source and target images respectively. Some key designs include an encoder tailored for correspondence learning, a dual-branch structure with an ingenious position encoding for reconstruction, and an alignment loss for supervision. (3) Extensive experiments show that the model pre-trained with our CorrMAE achieves new state-of-the-art performance on several downstream tasks. Our method achieves a precision increase of 16.37% and 9.30% compared with the state-of-the-art result on camera pose estimation and visual localization evaluation respectively.

2 Related Work

2.1 Correspondence Pruning

As a pioneering work, PointCN [41] formulates correspondence pruning as both a binary classification problem and an essential matrix regression problem. It also proposes a context normalization technique to embed global information into each correspondence. Since this seminal work, there have been various follow-up studies on correspondence consistency. ACNe [31] employs the attention mechanism to capture local and global contexts. OANet [43] learns global consistency in latent space through learnable soft assignment operations. Subsequently, a series of methods [19, 46, 6, 18, 5] based on the graph neural network, as dynamic graphs offer better exploration of potential correlations in correspondences. Recently, ConvMatch [44] based on motion vector fields, maps correspondences into predefined vector fields and mines the consistency of motions through 2D convolutions. In brief, the architectural design for correspondence pruning is still the mainstream to reach a new state-of-the-art performance. Nevertheless, from another perspective, we explore an acceptable pre-training method for correspondence pruning.

Refer to caption
Figure 2: The overview of our method.

2.2 Masked Autoencoder

MAE [9] as a representative representation learning method, randomly masks a high portion of the image and reconstructs missing pixels, providing powerful initial representations for downstream tasks by the pre-trained ViT [7] encoder. After that, many studies adopt the framework of MAE for pre-training across various tasks, including 3D object classification [25], video understanding [33], trajectory prediction [4], and action recognition [40]. However, MAEs fail to reconstruct correspondences due to their unordered and irregular characteristics. To this end, we extend MAE and leverage a dual-branch structure to reconstruct masked correspondences, providing generic inliers-consistent representations for downstream tasks.

3 Methodology

3.1 Overview

The pivotal innovation of this paper lies in introducing an acceptable pre-training method, considering factors such as training costs and data dependency, thus bridging a gap in the pre-training for corresponding pruning. To be specific, we perform the masked correspondence reconstruction task through an autoencoder, aimed at producing representations that can well complement the following downstream tasks. In contrast to conventional pre-training methods, i.e., initial correspondence classification task, our approach offers the benefits of lower training costs and independence from large-scale datasets.

Pre-training Stage. As shown in Fig. 2(a), we propose a Correspondence Masked Autoencoder framework (CorrMAE) to accomplish this pretext task. Given M𝑀Mitalic_M true correspondences {𝑰i(pre)=(𝒙i,𝒚i)|i=1,,M,𝒙i2,𝒚i2}conditional-setsuperscriptsubscript𝑰𝑖𝑝𝑟𝑒subscript𝒙𝑖subscript𝒚𝑖formulae-sequence𝑖1𝑀formulae-sequencesubscript𝒙𝑖superscript2subscript𝒚𝑖superscript2\left\{{}^{(pre)}\bm{I}_{i}=(\bm{x}_{i},\bm{y}_{i})|i=1,...,M,\bm{x}_{i}\in% \mathbb{R}^{2},\bm{y}_{i}\in\mathbb{R}^{2}\right\}{ start_FLOATSUPERSCRIPT ( italic_p italic_r italic_e ) end_FLOATSUPERSCRIPT bold_italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_i = 1 , … , italic_M , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } selected from initial correspondences using an empirical geometric threshold, the reconstruction task begins by randomly sampling true correspondences with a masking ratio. Subsequently, these visible correspondences are embedded into both global and local contexts using an encoder EθsubscriptE𝜃\mathrm{E_{\theta}}roman_E start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, specially designed for correspondence learning. At last, guided by the consistency within visible correspondences, the decoder DθsubscriptD𝜃\mathrm{D_{\theta}}roman_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT well conducts masked correspondence reconstruction. More details will be described in Section 3.2.

Fine-tuning Stage. In this paper, we perform fine-tuning for the correspondence pruning. As illustrated in Fig. 2(b), following CLNet [46], we employ an iterative network, where the initial weights of the encoder are derived from the pre-training stage. Built on the pruning operation, the encoder iterates K𝐾Kitalic_K times. The specifics of the fine-tuning stage are introduced in Section 3.3.

Refer to caption
Figure 3: The pipeline of our CorrMAE. Given a set of true correspondences selected by an empirical geometric threshold, CorrMAE aims to obtain inlier representations with strong generalization through the masked correspondence reconstruction task. The design details of each phase of CorrMAE are introduced in Section  3.2. Please note that to better distinguish between two branches, we introduce the concepts of source and target images. In fact, our pipeline does not involve images, but true correspondences (4D) as input.

3.2 Pre-training with CorrMAE

Refer to caption
Figure 4: Illustration of our proposed CorrFormer encoder. During fine-tuning, we integrate the CorrFormer encoder into the iterative network and employ a pruning strategy [46] to maximize its capabilities.

Inspired by the effective representation learning via masked autoencoder in image recognition [9] and 3D object classification [25], we focus on correspondence pruning and build a pre-training framework named CorrMAE that embeds inliers-consistent representations in the encoder. As depicted in Fig. 3, to perform the masked correspondence reconstruction task, this framework consists of four phases, i.e., correspondence masking, correspondence learning, source/target matching point reconstruction, and supervision. In the following subsections, we will elaborate on the design specifics of each phase. By combining these sophisticated yet efficient designs, we have been able to achieve strong inlier representations.

3.2.1 Correspondences Masking

We randomly mask true correspondences with a masking ratio α𝛼\alphaitalic_α, the set of masked correspondences is represented as 𝑰gt(pre)={𝑷gts,𝑷gtt}αM×4superscriptsubscript𝑰𝑔𝑡𝑝𝑟𝑒superscriptsubscript𝑷𝑔𝑡𝑠superscriptsubscript𝑷𝑔𝑡𝑡superscript𝛼𝑀4{}^{(pre)}\bm{I}_{gt}=\left\{\bm{P}_{gt}^{s},\bm{P}_{gt}^{t}\right\}\in\mathbb% {R}^{\alpha{M}\times 4}start_FLOATSUPERSCRIPT ( italic_p italic_r italic_e ) end_FLOATSUPERSCRIPT bold_italic_I start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT = { bold_italic_P start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_italic_P start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT italic_α italic_M × 4 end_POSTSUPERSCRIPT, which is used as ground truth in the supervision. Subsequently, the visible correspondences 𝑰vis(pre)(1α)M×4superscriptsubscript𝑰𝑣𝑖𝑠𝑝𝑟𝑒superscript1𝛼𝑀4{}^{(pre)}\bm{I}_{vis}\in\mathbb{R}^{(1-\alpha)M\times 4}start_FLOATSUPERSCRIPT ( italic_p italic_r italic_e ) end_FLOATSUPERSCRIPT bold_italic_I start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( 1 - italic_α ) italic_M × 4 end_POSTSUPERSCRIPT are processed by our encoder (details in Section 3.2.2), and then as guidance for the reconstruction of matching points. As for the masking technique, we investigate the impact of different masking ratios (40%80%percent40percent8040\%-80\%40 % - 80 %) and types (random masking and block masking [42]) on our method, see Section 2.

3.2.2 Correspondence Learning

The perennial theme of correspondence learning is to embed both local and global contexts for each correspondence. To this end, as illustrated in Fig. 4, we introduce an encoder for correspondence learning, which consists of L𝐿Litalic_L CorrFormer blocks. Each block adopts a neat bi-level design, dedicated to learning representations of local context and acquiring global context respectively. Then, local context representations are injected into the correspondence embeddings via an element-wise summation. Meanwhile, the graph neural network (GNN) [16] guides the learning of local context representations, and the linear transformer [38] is employed as the fundamental element. This process (block \ellroman_ℓ) can be described as:

(+1)𝑻vis=()Localθ(()GNNθ(()𝑻vis))+()Globalθ(()𝑻vis),\displaystyle^{(\ell+1)}\bm{T}_{vis}=^{(\ell)}\mathrm{Local_{\theta}}(^{(\ell)% }\mathrm{GNN_{\theta}}(^{(\ell)}\bm{T}_{vis}))+^{(\ell)}\mathrm{Global_{\theta% }}(^{(\ell)}\bm{T}_{vis}),start_POSTSUPERSCRIPT ( roman_ℓ + 1 ) end_POSTSUPERSCRIPT bold_italic_T start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT = start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT roman_Local start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT roman_GNN start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT bold_italic_T start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT ) ) + start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT roman_Global start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT bold_italic_T start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT ) , (1)

where 𝑻vis(1α)M×Csubscript𝑻𝑣𝑖𝑠superscript1𝛼𝑀𝐶\bm{T}_{vis}\in\mathbb{R}^{(1-\alpha)M\times C}bold_italic_T start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( 1 - italic_α ) italic_M × italic_C end_POSTSUPERSCRIPT represents high-dimension embeddings of visible correspondences, termed as visible tokens in this paper; LocalθsubscriptLocal𝜃\mathrm{Local_{\theta}}roman_Local start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and GlobalθsubscriptGlobal𝜃\mathrm{Global_{\theta}}roman_Global start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT denote the local context representation level and the global context acquisition level, respectively.

3.2.3 Source Matching Points Reconstruction

Since the unordered and irregular characteristics, correspondences lack effective location information for position encoding, thereby hindering the reconstruction process. As presented in Fig. 3, we adopt a dual-branch structure with an ingenious position encoding to separately recover matching points in both the source and target images, indirectly achieving the reconstruction of masked correspondences. Specifically, we begin by randomly generating mask keypoints for both the source and target branches ({𝑷masks,𝑷maskt}αM×4superscriptsubscript𝑷𝑚𝑎𝑠𝑘𝑠superscriptsubscript𝑷𝑚𝑎𝑠𝑘𝑡superscript𝛼𝑀4\left\{\bm{P}_{mask}^{s},\bm{P}_{mask}^{t}\right\}\in\mathbb{R}^{\alpha{M}% \times 4}{ bold_italic_P start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_italic_P start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT italic_α italic_M × 4 end_POSTSUPERSCRIPT). As for the source branch, we use the ground truth of keypoints from the target image (𝑷gttsuperscriptsubscript𝑷𝑔𝑡𝑡\bm{P}_{gt}^{t}bold_italic_P start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT introduced in Section 3.2.1) as our positional prompts. These prompts are then concatenated with their relative mask keypoints and encoded into mask tokens 𝑻masksαM×Csuperscriptsubscript𝑻𝑚𝑎𝑠𝑘𝑠superscript𝛼𝑀𝐶\bm{T}_{mask}^{s}\in\mathbb{R}^{\alpha{M}\times C}bold_italic_T start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_α italic_M × italic_C end_POSTSUPERSCRIPT via an MLP. Such a simple yet crucial positional encoding approach addresses the most fundamental problem, i.e., how to reconstruct unordered correspondences. Next, following vision MAE [9], different mask tokens are added to the decoder’s input sequence and later used to reconstruct the masked keypoints with a simple prediction head. Meanwhile, the decoder simply stacks some linear transformers [38], but the number is fewer than that of the encoder. The above process can be described as:

𝑻masks=MLPθ[𝑷masks,𝑷gtt],superscriptsubscript𝑻𝑚𝑎𝑠𝑘𝑠subscriptMLP𝜃superscriptsubscript𝑷𝑚𝑎𝑠𝑘𝑠superscriptsubscript𝑷𝑔𝑡𝑡\displaystyle\bm{T}_{mask}^{s}=\mathrm{MLP_{\theta}}\left[\bm{P}_{mask}^{s},% \bm{P}_{gt}^{t}\right],bold_italic_T start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = roman_MLP start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT [ bold_italic_P start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_italic_P start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ] , (2)
𝑷^s=Headθ(Dθ[𝑻masks,𝑻vis]),superscriptbold-^𝑷𝑠subscriptHead𝜃subscriptD𝜃superscriptsubscript𝑻𝑚𝑎𝑠𝑘𝑠subscript𝑻𝑣𝑖𝑠\displaystyle\bm{\widehat{P}}^{s}=\mathrm{Head_{\theta}}(\mathrm{D_{\theta}}% \left[\bm{T}_{mask}^{s},\bm{T}_{vis}\right]),overbold_^ start_ARG bold_italic_P end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = roman_Head start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( roman_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT [ bold_italic_T start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_italic_T start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT ] ) , (3)

where 𝑷^sαM×2superscriptbold-^𝑷𝑠superscript𝛼𝑀2\bm{\widehat{P}}^{s}\in\mathbb{R}^{\alpha{M}\times 2}overbold_^ start_ARG bold_italic_P end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_α italic_M × 2 end_POSTSUPERSCRIPT denotes the reconstructed source matching points; [,]\left[\cdot,\cdot\right][ ⋅ , ⋅ ] represents the concatenate operation along the first dimension; HeadθsubscriptHead𝜃\mathrm{Head_{\theta}}roman_Head start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is a prediction head, essentially an MLP. Similarly, the target branch performs the same operations, with shared weights between the positional encoding and decoder. Finally, reconstructed correspondences 𝑰^(pre)={𝑷^s,𝑷^t}αM×4superscriptbold-^𝑰𝑝𝑟𝑒superscriptbold-^𝑷𝑠superscriptbold-^𝑷𝑡superscript𝛼𝑀4{}^{(pre)}\bm{\widehat{I}}=\left\{\bm{\widehat{P}}^{s},\bm{\widehat{P}}^{t}% \right\}\in\mathbb{R}^{\alpha{M}\times 4}start_FLOATSUPERSCRIPT ( italic_p italic_r italic_e ) end_FLOATSUPERSCRIPT overbold_^ start_ARG bold_italic_I end_ARG = { overbold_^ start_ARG bold_italic_P end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , overbold_^ start_ARG bold_italic_P end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT italic_α italic_M × 4 end_POSTSUPERSCRIPT are obtained.

3.2.4 Supervision

The overall framework is optimized using a hybrid loss function, which comprises two reconstruction losses and an alignment loss:

(pre)=recs+rect+λalign,\displaystyle^{(pre)}\mathcal{L}=\mathcal{L}_{rec}^{s}+\mathcal{L}_{rec}^{t}+% \lambda\mathcal{L}_{align},start_POSTSUPERSCRIPT ( italic_p italic_r italic_e ) end_POSTSUPERSCRIPT caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_λ caligraphic_L start_POSTSUBSCRIPT italic_a italic_l italic_i italic_g italic_n end_POSTSUBSCRIPT , (4)

where λ𝜆\lambdaitalic_λ denotes the hyper-parameter used to balance between two objectives. As for the reconstruction objective, we employ the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-loss between the reconstructed keypoints (𝑷^ssuperscriptbold-^𝑷𝑠\bm{\widehat{P}}^{s}overbold_^ start_ARG bold_italic_P end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and 𝑷^tsuperscriptbold-^𝑷𝑡\bm{\widehat{P}}^{t}overbold_^ start_ARG bold_italic_P end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT) and the masked keypoints (𝑷gtssuperscriptsubscript𝑷𝑔𝑡𝑠\bm{P}_{gt}^{s}bold_italic_P start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and 𝑷gttsuperscriptsubscript𝑷𝑔𝑡𝑡\bm{P}_{gt}^{t}bold_italic_P start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT):

rec=𝑷^s𝑷gts22+𝑷^t𝑷gtt22.subscript𝑟𝑒𝑐superscriptsubscriptnormsuperscriptbold-^𝑷𝑠superscriptsubscript𝑷𝑔𝑡𝑠22superscriptsubscriptnormsuperscriptbold-^𝑷𝑡superscriptsubscript𝑷𝑔𝑡𝑡22\displaystyle\mathcal{L}_{rec}={\|\bm{\widehat{P}}^{s}-\bm{P}_{gt}^{s}\|}_{2}^% {2}+{\|\bm{\widehat{P}}^{t}-\bm{P}_{gt}^{t}\|}_{2}^{2}.caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT = ∥ overbold_^ start_ARG bold_italic_P end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT - bold_italic_P start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ overbold_^ start_ARG bold_italic_P end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_P start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (5)

The proposed alignment loss aims to reduce the overall discrepancy between the reconstructed keypoints of the two branches. We first construct two undirected complete graphs for the reconstructed keypoints of both the source and target branches (𝒢^ssuperscript^𝒢𝑠\widehat{\mathcal{G}}^{s}over^ start_ARG caligraphic_G end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and 𝒢^tsuperscript^𝒢𝑡\widehat{\mathcal{G}}^{t}over^ start_ARG caligraphic_G end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT), with Euclidean distances assigned to edges. Next, we perform an 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-loss between two undirected complete graphs to align reconstructed keypoints from two branches. The alignment loss can be defined as:

ei,jξ=𝑷^iξ𝑷^jξ2,ξ{s,t},formulae-sequencesuperscriptsubscript𝑒𝑖𝑗𝜉subscriptnormsuperscriptsubscriptbold-^𝑷𝑖𝜉superscriptsubscriptbold-^𝑷𝑗𝜉2𝜉𝑠𝑡\displaystyle e_{i,j}^{\xi}={\|\bm{\widehat{P}}_{i}^{\xi}-\bm{\widehat{P}}_{j}% ^{\xi}\|}_{2},{\xi}\in\left\{s,t\right\},italic_e start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ξ end_POSTSUPERSCRIPT = ∥ overbold_^ start_ARG bold_italic_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ξ end_POSTSUPERSCRIPT - overbold_^ start_ARG bold_italic_P end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ξ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_ξ ∈ { italic_s , italic_t } , (6)
𝒢^ξ={ei,jξ|i=1,,αM,j=1,,αM},superscript^𝒢𝜉conditional-setsuperscriptsubscript𝑒𝑖𝑗𝜉formulae-sequence𝑖1𝛼𝑀𝑗1𝛼𝑀\displaystyle\widehat{\mathcal{G}}^{\xi}=\left\{e_{i,j}^{\xi}|i=1,...,\alpha{M% },j=1,...,\alpha{M}\right\},over^ start_ARG caligraphic_G end_ARG start_POSTSUPERSCRIPT italic_ξ end_POSTSUPERSCRIPT = { italic_e start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ξ end_POSTSUPERSCRIPT | italic_i = 1 , … , italic_α italic_M , italic_j = 1 , … , italic_α italic_M } , (7)
align=𝒢^s𝒢^t1,subscript𝑎𝑙𝑖𝑔𝑛subscriptnormsuperscript^𝒢𝑠superscript^𝒢𝑡1\displaystyle\mathcal{L}_{align}={\|\widehat{\mathcal{G}}^{s}-\widehat{% \mathcal{G}}^{t}\|}_{1},caligraphic_L start_POSTSUBSCRIPT italic_a italic_l italic_i italic_g italic_n end_POSTSUBSCRIPT = ∥ over^ start_ARG caligraphic_G end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT - over^ start_ARG caligraphic_G end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (8)

where ei,jξsuperscriptsubscript𝑒𝑖𝑗𝜉e_{i,j}^{\xi}italic_e start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ξ end_POSTSUPERSCRIPT denotes the Euclidean distances between keypoint i𝑖iitalic_i and keypoint j𝑗jitalic_j; ξ𝜉{\xi}italic_ξ represents the source and target branches.

3.3 Fine-tuning for Correspondence Pruning

Given N𝑁Nitalic_N initial correspondences 𝑰(fine)N×4superscript𝑰𝑓𝑖𝑛𝑒superscript𝑁4{}^{(fine)}\bm{I}\in\mathbb{R}^{N\times 4}start_FLOATSUPERSCRIPT ( italic_f italic_i italic_n italic_e ) end_FLOATSUPERSCRIPT bold_italic_I ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 4 end_POSTSUPERSCRIPT generated by the SIFT [22] detector and a nearest neighbor matching strategy, correspondence pruning involves identifying true correspondences and recovering camera relative poses. This task is typically formulated as a binary classification problem (i.e., inlier vs. outlier) and an essential matrix regression problem [41]. However, finding true correspondences in the initial correspondences dominated by outliers (approximately 90%percent9090\%90 %) is still challenging. We employ robust inlier-consistent representations to guide the learning of correspondence pruning methods. As illustrated in Fig. 2(b), the inlier predictor outputs the probability of each candidate being an inlier, and utilizes a weighted eight-point algorithm [8] h(,)h(\cdot,\cdot)italic_h ( ⋅ , ⋅ ) to regress the essential matrix. The process is presented as:

𝑭=Eθϕ((fine)𝑰),𝑭N×C𝑾=Predictorϕ(𝑭),𝑾N×1𝑬^=h((fine)𝑰c,𝑾),(fine)𝑰cN×4\displaystyle\begin{aligned} \bm{F}=\mathrm{E_{\theta\rightarrow\phi}}(^{(fine% )}\bm{I}),\quad\bm{F}\in\mathbb{R}^{N^{{}^{\prime}}\times C}\\ \bm{W}=\mathrm{Predictor_{\phi}}(\bm{F}),\quad\bm{W}\in\mathbb{R}^{N^{{}^{% \prime}}\times 1}\\ \bm{\widehat{E}}=h(^{(fine)}\bm{I}_{c},\bm{W}),\quad^{(fine)}\bm{I}_{c}\in% \mathbb{R}^{N^{{}^{\prime}}\times 4}\end{aligned}start_ROW start_CELL bold_italic_F = roman_E start_POSTSUBSCRIPT italic_θ → italic_ϕ end_POSTSUBSCRIPT ( start_POSTSUPERSCRIPT ( italic_f italic_i italic_n italic_e ) end_POSTSUPERSCRIPT bold_italic_I ) , bold_italic_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT × italic_C end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_italic_W = roman_Predictor start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_F ) , bold_italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT × 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL overbold_^ start_ARG bold_italic_E end_ARG = italic_h ( start_POSTSUPERSCRIPT ( italic_f italic_i italic_n italic_e ) end_POSTSUPERSCRIPT bold_italic_I start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , bold_italic_W ) , start_POSTSUPERSCRIPT ( italic_f italic_i italic_n italic_e ) end_POSTSUPERSCRIPT bold_italic_I start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT × 4 end_POSTSUPERSCRIPT end_CELL end_ROW (9)

where Nsuperscript𝑁N^{{}^{\prime}}italic_N start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT and C𝐶Citalic_C are the number of final candidates and the embedding dimension respectively; 𝑰c(fine)superscriptsubscript𝑰𝑐𝑓𝑖𝑛𝑒{}^{(fine)}\bm{I}_{c}start_FLOATSUPERSCRIPT ( italic_f italic_i italic_n italic_e ) end_FLOATSUPERSCRIPT bold_italic_I start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and 𝑭𝑭\bm{F}bold_italic_F denote candidates and its embeddings respectively; 𝑾𝑾\bm{W}bold_italic_W represents the probabilities of candidates being inliers; 𝑬^bold-^𝑬\bm{\widehat{E}}overbold_^ start_ARG bold_italic_E end_ARG indicates the predicted essential matrix.

As for supervision, we employ the widely used binary cross-entropy loss with an adaptive temperature [46] for the binary classification, and a geometric loss [43] for the essential matrix regression:

(fine)=cls+βess(𝑬^,𝑬),\displaystyle^{(fine)}\mathcal{L}=\mathcal{L}_{cls}+\beta\mathcal{L}_{ess}(\bm% {\widehat{E}},\bm{E}),start_POSTSUPERSCRIPT ( italic_f italic_i italic_n italic_e ) end_POSTSUPERSCRIPT caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT + italic_β caligraphic_L start_POSTSUBSCRIPT italic_e italic_s italic_s end_POSTSUBSCRIPT ( overbold_^ start_ARG bold_italic_E end_ARG , bold_italic_E ) , (10)

where the hyper-parameter β𝛽\betaitalic_β is used to balance the two loss terms and 𝑬𝑬\bm{E}bold_italic_E is the ground truth of the essential matrix.

4 Experiments

4.1 Implementation Details

Both our pre-training and fine-tuning models are implemented using PyTorch [26] and trained on multiple NVIDIA Tesla V100 GPUs.

Pre-training Setting. We use an AdamW optimizer [20] and cosine learning rate decay [21]. The initial learning rate is set to 0.0010.0010.0010.001, with a weight decay of 0.050.050.050.05. We pre-train our model for 100100100100 epochs, with a batch size of 64646464. The hyper-parameter λ𝜆\lambdaitalic_λ in Eq. 4 is set as 0.10.10.10.1.

Fine-tuning Setting. The number of initial correspondences, final candidates, and CorrFormer blocks are N=2000𝑁2000N=2000italic_N = 2000, N=500superscript𝑁500N^{\prime}=500italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 500, and L=2𝐿2L=2italic_L = 2. Network iteration K𝐾Kitalic_K, channel dimension C𝐶Citalic_C, and pruning ratio are 2222, 128128128128, and 0.50.50.50.5. We adopt Adam [12] with a weight decay of 00 as the optimizer to train our network, and the canonical learning rate (for batch size is 32323232) is set to 0.0010.0010.0010.001. Following [46], the weight β𝛽\betaitalic_β in Eq. 10 is set as 00 during the first 20k20𝑘20k20 italic_k iterations and 0.50.50.50.5 in the remaining 480k480𝑘480k480 italic_k iterations.

4.2 Ablation Studies

Is the bi-level design of CorrFormer encoder helpful? The bi-level structure is the core design of the CorrFormer encoder. As shown in Table 1, to demonstrate its effectiveness, we conduct experiments using both single-level designs (rows 1 and 2) and the bi-level design (row 5) separately. Experimental results show that compared to the global context acquisition level and the local context representation level, the bi-level design achieves performance improvements of +3.043.04+3.04+ 3.04 AUC@5°and +9.049.04+9.04+ 9.04 AUC@5°, respectively.

Table 1: Ablation study for our CorrFormer encoder. The results of AUC without pre-training knowledge on YFCC100M [32] are reported. Local and Global respectively represent the local context representation and the global context acquisition layers. 1x and 2x represent the number of CorrFormer blocks.
Local Global Injection 1x 2x Pose estimation AUC
Interactive Simple @5° @10° @20°
28.65 49.33 67.59
34.64 54.79 71.19
33.06 53.28 70.94
36.92 57.89 74.46
37.69 58.37 74.71

How to inject local context? We explore both interactive (cross-attention) and simple (element-wise summation) manners to inject local context. In Table 1, we implement complex cross-attention (row 4) between the local context representation level and the global context acquisition level. The results indicate that this manner is not as effective as a simple element-wise summation operation (row 5), which is both neat and efficient.

How to choose L𝐿Litalic_L? As shown in Table 1, we investigate the effects of stacking one and two CorrFormer blocks (rows 3 and 4) on the network. When using two encoder blocks, the performance of network significantly improved (+3.863.86+3.86+ 3.86 AUC@5°). Considering the trade-off between training cost and performance, we set L𝐿Litalic_L to 2222.

Table 2: Ablation study for our CorrMAE. We conduct experiments using two masking types with different masking ratios (%), and report pre-train loss (x100) on MegaDepth [14] as well as fine-tune camera pose estimation on YFCC100M [32].
Random Block Ratio Align Loss Pose estimation AUC
40 60 80 @5° @10° @20°
1.774 39.49 59.78 75.42
1.900 39.72 60.30 75.99
2.198 38.85 59.45 75.52
2.073 39.40 59.70 75.39
1.908 40.16 60.78 76.43

How to choose a masking strategy? To study a masking strategy suitable for our method, we perform some ablation experiments on the masking ratio (40%-80%) and masking types (random masking and block masking [42]), see Table 2. The comparison among rows 1, 2, and 3 shows that our method gains better performance improvements for the downstream task at a moderate masking ratio of 60%. As presented in rows 2 and 4, we also validate the impact of block masking on our method, with experimental results revealing that block masking achieves slightly worse than random masking in both pre-training observation (loss) and fine-tuning performance. Hence, we adopt a random masking strategy with a masking ratio of 60% to provide more robust initial representations for downstream tasks. Additionally, as presented in Appendix B.1, we visualize the reconstruction results under this masking strategy. From the visualized results, it is evident that our reconstructed correspondences adhere to the local and global consistency of inliers. In other words, the pre-trained encoder exhibits excellent inliers-consistent representations.

Is the alignment loss useful? As shown in Table 2, after introducing the alignment loss (the last row), our method achieves an improvement of 1.93% in AUC@5° for camera pose estimation. That is, in our correspondence reconstruction task, the proposed alignment supervision between the reconstructed keypoints of the source and target branches proves to be effective for inlier representation learning.

4.3 Downstream Evaluation

Since we are the first to propose a pre-training method specifically for correspondence pruning, all baseline methods are trained from scratch solely on the target dataset (YFCC100M). In contrast, our proposed method is pre-trained on the MegaDepth dataset [14] (see Appendix A for details) and then fine-tuned on the YFCC100M dataset [32].

Correspondence Pruning directly impacts downstream geometric estimation tasks, necessitating methods that can precisely identify true correspondences (inliers). To ensure a fair comparison, all methods are evaluated on the correspondence pruning using the full-size verification [46]. Following [43, 46], we report precision, recall, and F1 score on the testing set of YFCC100M. Also, predicted epipolar distances use a threshold of 3×1053superscript1053\times 10^{-5}3 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT as the criterion for inliers and outliers. As shown in Table 3, our method pre-trained by CorrMAE outperforms all baseline methods on all metrics. Specifically, compared to the recent state-of-the-art methods NCMNet [18] and MGNet [5], we achieve improvements of 3.15% and 2.83% in F1 score, respectively. Additionally, as shown in Appendix B.2, partial typical visualization results of CLNet [46], NCMNet [18], and our method are shown from left to right. It can be seen that our method achieves the best performance under various challenging scenes.

Table 3: Evaluation on YFCC100M for the correspondence pruning.
Method Precision Recall F-score
OANet [43] 68.05 68.41 68.23
MS2DGNet [6] 72.61 73.86 73.23
PGFNet [17] 71.56 72.71 72.13
CLNet [46] 75.05 76.41 75.72
ConvMatch [44] 73.12 74.39 73.75
UMatch [15] 73.97 75.72 74.83
NCMNet [18] 77.24 78.57 77.90
MGNet [5] 76.97 79.35 78.14
Ours 79.27 81.45 80.35

Camera Pose Estimation. The goal of this task is to estimate the relative position relationship (rotation and translation) between two-view images. Following the evaluation protocol of [46, 44], we report the AUC of pose error at thresholds (5, 10, and 20 degrees), with a weighted eight-point algorithm [8] as the estimator. All baselines are evaluated on this task using the respective official settings and pre-trained models. Our results, presented in Table 5, show that our method consistently achieves the best accuracy in all thresholds. Notably, we outperform the current best graph neural network method NCMNet [18] with an improvement of +5.655.65+5.65+ 5.65 AUC@5°, and reduce parameters by 0.670.670.670.67M. Furthermore, we significantly outperform the recent light-weight method MGNet [5] with additional parameters of 2.792.792.792.79M, and gain an impressive improvement of +7.847.84+7.84+ 7.84 AUC@5°.

Table 4: Evaluation on YFCC100M for the camera pose estimation. The AUC of the pose error in percentage is reported. Our method improves the state-of-the-art methods by a large margin.
Method Params. Pose estimation AUC
@5° @10° @20°
PointCN [41] 0.39M 10.16 24.43 43.31
OANet [43] 2.47M 15.92 35.93 57.11
MS2DGNet [6] 2.61M 20.61 42.90 64.26
LMCNet [19] 0.93M 22.35 43.57 63.34
PGFNet [17] 3.12M 24.11 45.97 65.08
CLNet [46] 1.27M 25.28 45.82 65.44
ConvMatch [44] 7.49M 26.83 49.14 67.91
UMatch [15] 7.76M 30.84 52.04 69.65
NCMNet [18] 4.77M 34.51 55.34 72.40
MGNet [5] 1.31M 32.32 53.40 71.59
Ours 4.10M 40.16 60.78 76.43
Table 5: Evaluation on Aachen Day-Night benchmarks [29, 45] for the visual localization evaluation.
Method Day Night
(0.25m, 2°) / (0.5m, 5°) / (1m, 10°)
Visual Localization of Aachen v1.0
- 82.3/88.2/92.7 38.8/45.9/57.1
OANet [43] 85.4/92.4/96.6 63.3/74.5/83.7
CLNet [46] 85.0/93.0/97.7 67.3/81.6/88.8
ConvMatch [44] 85.2/91.9/96.1 58.2/63.3/75.5
NCMNet [18] 85.2/92.7/98.1 68.4/81.6/90.8
Ours 86.8/93.9/98.2 73.5/84.7/93.9
Visual Localization of Aachen v1.1
- 85.4/91.1/94.5 35.6/42.4/55.0
OANet [43] 87.7/94.8/98.2 58.6/69.1/83.8
CLNet [46] 86.7/94.1/98.1 61.3/78.0/89.0
ConvMatch [44] 87.3/93.8/97.5 51.8/60.2/73.3
NCMNet [18] 86.7/94.1/98.4 61.3/77.5/90.1
Ours 88.8/94.9/98.7 67.0/82.2/94.2

Visual Localization. This task aims to recover 6-DoF poses of query images with respect to the corresponding 3D scene model. We integrate correspondence pruning methods in the official HLoc [28] pipeline and use two popular benchmarks, i.e., Aachen Day-Night (v1.0 and v1.1) [29, 45], to validate performance on the visual localization task. Aachen v1.0 dataset contains 4328432843284328 reference and 922922922922 query (824824824824 day, 98989898 night) images. Aachen v1.1 extends v1.0 with the additional 2369236923692369 reference and 93939393 night query images. Following [15, 5], we report the accuracy at error thresholds of 0.25m/20.25𝑚20.25m/20.25 italic_m / 2°, 0.5m/50.5𝑚50.5m/50.5 italic_m / 5°, and 5m/105𝑚105m/105 italic_m / 10°. The baselines comprise a simple matcher mutual nearest neighbor (MNN) and some state-of-the-art correspondence pruning methods such as OANet [43], CLNet [46], ConMatch [44], and NCMNet [18]. Also, SIFT (with 4096409640964096 keypoints) and MNN are used as the pre-processing step for correspondence pruning methods. As shown in Table 5, our method outperforms all baselines on two benchmarks, showing its robustness under day-night changes. In specific, our method exhibits strong generalization ability, especially in night scenes (68.4 vs. our 73.5 and 61.3 vs. our 67.0 in the first metric). This should be attributed to the powerful inlier representation learned through our pre-training method.

Table 6: Evaluation on HPatches [1] for the homography estimation. The AUC of the corner error in percentage is reported.
Method Homography est. AUC
@3px @5px @10px
MAGSAC++ [2] 47.79 56.27 64.68
OANet [43] 50.71 62.40 75.23
CLNet [46] 51.24 63.11 76.76
ConvMatch [44] 51.38 62.94 75.77
NCMNet [18] 51.34 63.16 76.96
Ours 52.78 65.04 78.42
Table 7: Our CorrMAE serves as a plug-and-play and task-driven pre-training method. The camera pose estimation results without/with CorrMAE are reported, and those with CorrMAE are achieved through pre-training and fine-tuning both on the target dataset (YFCC100M), without utilizing any additional data for pre-training.
Encoder AUC@5° AUC@20°
- CorrMAE - CorrMAE
OANet [43] 15.92 17.40 57.11 58.34
CLNet [46] 25.28 28.56 65.44 67.31
NCMNet [18] 34.51 35.35 72.40 73.36
CorrFormer (Ours) 37.69 39.54 74.71 75.69

Homography Estimation. Homography is defined as a planar projection between two images captured from different perspectives. The testing set provides the pair of one source and five target images captured from various viewing angles and lighting conditions, along with ground-truth homography transformations. Following [15, 5], we test baselines and our method on HPatches [1] with RANSAC as a robust estimator. We report the AUC percentage of estimated homography whose average corner error distance is below 3/5/1035103/5/103 / 5 / 10 pixels. Since the intrinsic matrices are not provided in HPatches, we normalize keypoints by the image scale. To ensure fairness, we retrain all methods on the YFCC100M using a new normalization manner. As shown in Table 7, our model generalizes best among all baselines.

4.4 Discussion

Analysis of task-driven and data-driven within our CorrMAE. Through several experiments, we demonstrate that our pre-training method exhibits lower data dependency. As seen in the last row of Table 7, without introducing additional data, our method leads to a 4.91% performance improvement (39.54 vs. 37.69 in AUC@5°). In contrast, pre-training on a larger dataset (Megadepth [14]) only yields a 1.57% performance improvement (40.16 vs. 39.54 in AUC@5°), see the last row of Table 5. While additional data can provide stronger representational capabilities, our CorrMAE is predominantly task-driven in nature. This characteristic is essential for a method designed to be plug-and-play.

Does CorrMAE serve as a plug-and-play method? We replace the encoder within our pretraining-finetuning paradigm with other state-of-the-art methods such as OANet [43], CLNet [46], and NCMNet [18], to validate this problem. In Table 7, after pre-training with our CorrMAE, these methods all achieve significant performance improvements even without extra data. Specifically, for the methods based on GNN, CLNet achieves a performance improvement of 12.97% in AUC@5°, whereas the recent state-of-the-art method NCMNet achieves a 2.43% performance boost. With the help of our pre-training method, the CLNet, which exhibits inferior consistency learning capability, shows a remarkably pronounced improvement. In addition, compared to NCMNet, our CorrFormer demonstrates stronger consistency learning ability (34.51 vs. our 37.69 in AUC@5°) and superior transferability (+0.840.84+0.84+ 0.84 vs. our +1.851.85+1.85+ 1.85 in AUC@5°). These experiments further validate that our CorrMAE can serve as a powerful tool seamlessly assisting various correspondence pruning methods.

5 Conclusion

In this paper, with another perspective, we explore an acceptable pre-training method for correspondence pruning. A masked correspondence reconstruction pipeline is proposed toward the ultimate pursuit of inlier representation learning, namely CorrMAE. Meanwhile, we specially design an encoder, a dual-branch structure, and an alignment loss for this reconstruction task. Extensive experiments verify the proposed method in some downstream tasks. We also provide detailed analyses to evaluate the effectiveness and impact of our method.

References

  • [1] Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krystian Mikolajczyk. Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 5173–5182, 2017.
  • [2] Daniel Barath, Jana Noskova, Maksym Ivashechkin, and Jiri Matas. Magsac++, a fast, reliable and accurate robust estimator. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 1304–1312, 2020.
  • [3] Samarth Brahmbhatt, **wei Gu, Kihwan Kim, James Hays, and Jan Kautz. Geometry-aware learning of maps for camera localization. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 2616–2625, 2018.
  • [4] Hao Chen, Jiaze Wang, Kun Shao, Furui Liu, Jianye Hao, Chenyong Guan, Guangyong Chen, and Pheng-Ann Heng. Traj-mae: Masked autoencoders for trajectory prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
  • [5] Luanyuan Dai, Xiaoyu Du, Hanwang Zhang, and **hui Tang. Mgnet: Learning correspondences via multiple graphs. In Proceedings of the AAAI conference on Artificial Intelligence, pages 3945–3953, 2024.
  • [6] Luanyuan Dai, Yizhang Liu, Jiayi Ma, Lifang Wei, Taotao Lai, Changcai Yang, and Riqing Chen. Ms2dg-net: Progressive correspondence learning via multiple sparse semantics dynamic graph. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 8973–8982, 2022.
  • [7] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, 2021.
  • [8] Richard I Hartley. In defense of the eight-point algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(6):580–593, 1997.
  • [9] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollar, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 15979–15988, 2022.
  • [10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • [11] Jared Heinly, Johannes L Schonberger, Enrique Dunn, and Jan-Michael Frahm. Reconstructing the world* in six days*(as captured by the yahoo 100 million image dataset). In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 3287–3295, 2015.
  • [12] Diederik P Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, pages 1–15, 2014.
  • [13] Ming Li, Jie Wu, Xionghui Wang, Chen Chen, Jie Qin, Xuefeng Xiao, Rui Wang, Min Zheng, and Xin Pan. Aligndet: Aligning pre-training and fine-tuning in object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6866–6876, 2023.
  • [14] Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 2041–2050, 2018.
  • [15] Zizhuo Li, Shihua Zhang, and Jiayi Ma. U-match: two-view correspondence learning with hierarchy-aware local context aggregation. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 1169–1176, 2023.
  • [16] Tangfei Liao, Xiaoqin Zhang, Li Zhao, Tao Wang, and Guobao Xiao. Vsformer: Visual-spatial fusion transformer for correspondence pruning. In Proceedings of the AAAI conference on Artificial Intelligence, 2024.
  • [17] Xin Liu, Guobao Xiao, Riqing Chen, and Jiayi Ma. Pgfnet: Preference-guided filtering network for two-view correspondence learning. IEEE Transactions on Image Processing, 32:1367–1378, 2023.
  • [18] Xin Liu and Jufeng Yang. Progressive neighbor consistency mining for correspondence pruning. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 9527–9537, 2023.
  • [19] Yuan Liu, Lingjie Liu, Cheng Lin, Zhen Dong, and Wen** Wang. Learnable motion coherence for correspondence pruning. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 3237–3246, 2021.
  • [20] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  • [21] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In Proceedings of the International Conference on Learning Representations, 2017.
  • [22] David G Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.
  • [23] Haoming Lu, Hazarapet Tunanyan, Kai Wang, Shant Navasardyan, Zhangyang Wang, and Humphrey Shi. Specialist diffusion: Plug-and-play sample-efficient fine-tuning of text-to-image diffusion models to learn any unseen style. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 14267–14276, 2023.
  • [24] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics, 31(5):1147–1163, 2015.
  • [25] Yatian Pang, Wenxiao Wang, Francis EH Tay, Wei Liu, Yonghong Tian, and Li Yuan. Masked autoencoders for point cloud self-supervised learning. In Proceedings of the European Conference on Computer Vision, pages 604–621. Springer, 2022.
  • [26] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
  • [27] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
  • [28] Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 12716–12725, 2019.
  • [29] Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii, Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Okutomi, Marc Pollefeys, Josef Sivic, et al. Benchmarking 6dof outdoor visual localization in changing conditions. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 8601–8610, 2018.
  • [30] Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 4104–4113, 2016.
  • [31] Weiwei Sun, Wei Jiang, Eduard Trulls, Andrea Tagliasacchi, and Kwang Moo Yi. Acne: Attentive context normalization for robust permutation-equivariant learning. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 11286–11295, 2020.
  • [32] Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
  • [33] Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35:10078–10093, 2022.
  • [34] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
  • [35] Tao Wang, Kaihao Zhang, Xuanxi Chen, Wenhan Luo, Jiankang Deng, Tong Lu, Xiaochun Cao, Wei Liu, Hongdong Li, and Stefanos Zafeiriou. A survey of deep face restoration: Denoise, super-resolution, deblur, artifact removal. arXiv preprint arXiv:2211.02831, 2022.
  • [36] Tao Wang, Kaihao Zhang, Ziqian Shao, Wenhan Luo, Bjorn Stenger, Tong Lu, Tae-Kyun Kim, Wei Liu, and Hongdong Li. Gridformer: Residual dense transformer with grid structure for image restoration in adverse weather conditions. International Journal of Computer Vision, pages 1–23, 2024.
  • [37] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics, 38(5):1–12, 2019.
  • [38] Haixu Wu, Jialong Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Flowformer: Linearizing transformers with conservation flows. In Proceedings of the International Conference on Machine Learning, 2022.
  • [39] Fei Xue, Ignas Budvytis, and Roberto Cipolla. Imp: Iterative matching and pose estimation with adaptive pooling. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 21317–21326, 2023.
  • [40] Hong Yan, Yang Liu, Yushen Wei, Zhen Li, Guanbin Li, and Liang Lin. Skeletonmae: graph-based masked autoencoder for skeleton sequence pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5606–5618, 2023.
  • [41] Kwang Moo Yi, Eduard Trulls, Yuki Ono, Vincent Lepetit, Mathieu Salzmann, and Pascal Fua. Learning to find good correspondences. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 2666–2674, 2018.
  • [42] Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 19313–19322, 2022.
  • [43] Jiahui Zhang, Dawei Sun, Zixin Luo, Anbang Yao, Lei Zhou, Tianwei Shen, Yurong Chen, Long Quan, and Hongen Liao. Learning two-view correspondences and geometry using order-aware network. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 5845–5854, 2019.
  • [44] Shihua Zhang and Jiayi Ma. Convmatch: Rethinking network design for two-view correspondence learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • [45] Zichao Zhang, Torsten Sattler, and Davide Scaramuzza. Reference pose generation for long-term visual localization via learned features and view synthesis. International Journal of Computer Vision, 129:821–844, 2021.
  • [46] Chen Zhao, Yixiao Ge, Feng Zhu, Rui Zhao, Hongsheng Li, and Mathieu Salzmann. Progressive correspondence pruning by consensus learning. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 6464–6473, 2021.

Appendix A Datasets

YFCC100M [32] is collected by Yahoo and made up of 100 million photos from the Internet. The author of [11] split the YFCC100M into 72 sequences from different tourist landmarks, and provided camera poses and sparse models for generating ground-truth. Following [43], we selected 68 sequences as the training set and the remaining 4 sequences as the testing set. As for Table 7, the pre-training inputs (true correspondences/inliers) are selected by an empirical threshold of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT for epipolar distance.

MegaDepth [14] is a larger dataset comprising 196 scene sequences, whose camera poses and depth maps are produced from COLMAP [30]. Following [39], we employ SIFT [22] and a nearest neighbor matching strategy to generate initial correspondences for two-view images, and refine them into true correspondences using geometric thresholds. These true correspondences constitute our pre-training dataset from MegaDepth. It is worth noting that the MegaDepth pre-training dataset consists of true correspondences from 780k780𝑘780k780 italic_k image pairs, with an average of only 50505050 true correspondences per image pair used for pre-training.

Refer to caption
Figure 5: The examples of reconstruction results for masked correspondences. The left column represents the original correspondences, the middle column means the remaining correspondences, and the right column denotes the reconstruction results for masked correspondences.

Appendix B Visualization Results

B.1 Visualization Results of Correspondence Reconstruction

In this section, we show some correspondence reconstruction results to illustrate the effectiveness of our method. As shown in Fig. 5, in various scenarios, the reconstruction results of our method are largely subject to global and local consistency. This also further demonstrate that our pre-training method CorrMAE can provide powerful inlier representations for downstream tasks.

B.2 Visualization Results of Correspondence Pruning

We present visualization results of correspondence pruning to validate the superiority of our method. As shown in Fig. 6, our method outperforms the current state-of-the-art method NCMNet [18] in various challenging scenarios.

Refer to caption
Figure 6: Partial typical visualization results of the correspondence pruning on YFCC100M. The correspondence is drawn in green if it represents the inlier and red for the outlier.