CorrMAE: Pre-training Correspondence Transformers with Masked Autoencoder

Tangfei Liao¹, Xiaoqin Zhang², Guobao Xiao³, Min Li², Tao Wang⁴, Mang Ye¹
¹School of Computer Science, Wuhan University, China
²College of Computer Science and Artificial Intelligence, Wenzhou University, China
³School of Electronics and Information Engineering, Tongji University, China
⁴State Key Lab for Novel Software Technology, Nan**g University, China
{tangfeiliao, zhangxiaoqinnan, limin.simu, taowangzj}@gmail.com,
[email protected], [email protected]

Abstract

Pre-training has emerged as a simple yet powerful methodology for representation learning across various domains. However, due to the expensive training cost and limited data, pre-training has not yet been extensively studied in correspondence pruning. To tackle these challenges, we propose a pre-training method to acquire a generic inliers-consistent representation by reconstructing masked correspondences, providing a strong initial representation for downstream tasks. Toward this objective, a modicum of true correspondences naturally serve as input, thus significantly reducing pre-training overhead. In practice, we introduce CorrMAE, an extension of the mask autoencoder framework tailored for the pre-training of correspondence pruning. CorrMAE involves two main phases, i.e., correspondence learning and matching point reconstruction, guiding the reconstruction of masked correspondences through learning visible correspondence consistency. Herein, we employ a dual-branch structure with an ingenious positional encoding to reconstruct unordered and irregular correspondences. Also, a bi-level designed encoder is proposed for correspondence learning, which offers enhanced consistency learning capability and transferability. Extensive experiments have shown that the model pre-trained with our CorrMAE outperforms prior work on multiple challenging benchmarks. Meanwhile, our CorrMAE is primarily a task-driven pre-training method, and can achieve notable improvements for downstream tasks by pre-training on the targeted dataset. We hope this work can provide a starting point for correspondence pruning pre-training.

1 Introduction

Pre-training has achieved remarkable progress on diverse backbones in various downstream tasks [27, 13, 23, 35, 36], as it provides a strong initial representation. The conventional pre-training method, i.e., fully supervised learning with a classification task [10] is one of the most popular paradigms, and significantly helps downstream tasks. However, when dealing with tasks involving long sequence data as input, the expensive overhead of conventional methods poses a significant challenge, especially for correspondence pruning reliant on the graph neural network [37] (see Fig. 1(a)).

Correspondence pruning is a crucial element in much of computer vision, including simultaneous localization and map** [24], structure from motion [30], and visual camera localization [3]. This task involves accurately identifying true correspondences (inliers) from initial correspondences while recovering two-view geometry [41]. Unfortunately, thousands of initial correspondences as input and prevailing graph neural networks [46, 6, 18, 5] greatly escalate the costs of conventional pre-training methods. As dataset scales expand, this issue becomes even more acute, leading to direct training from scratch on the targeted dataset as the only learning paradigm for correspondence pruning (as presented in Fig. 1(b)). Besides, without additional data, conventional pre-training methods fail to yield valuable knowledge for correspondence pruning.

To resolve the aforementioned challenges, we are the first to introduce pre-training for correspondence pruning via a reconstruction pretext task. As illustrated in Fig. 1(b), the purpose of our method is to obtain a generic inliers-consistent representation by reconstructing masked correspondences. The pre-training knowledge is then transferred to correspondence pruning to enhance the performance of some downstream tasks, such as camera pose estimation. In pursuit of this goal, we naturally consider leveraging true correspondences from two images as the input for this pretext task. As shown in Fig. 1(a), this approach seamlessly resolves the issue of pre-training expenses, given that inliers typically constitute a mere 10% of initial correspondences.

Based on the above analysis, a naive implementation of this pretext task entails employing a trending framework, Masked Autoencoder (MAE) [9], to reconstruct masked correspondences directly, which is similar to Point-MAE [25]. However, unlike point clouds, the correspondences lack a geometric center or other location information for position encoding. That is, this solution ignores the unordered and irregular characteristics of correspondence, leading to ineffective reconstruction of masked correspondence. To this end, we extend MAE and propose a novel framework, named Correspondence Masked Autoencoder (CorrMAE). A key design element of CorrMAE involves a dual-branch structure with an ingenious position encoding, which can reconstruct matching points of source and target images respectively. Also, an alignment loss is proposed for the reconstructed matching points of source and target images. Interestingly, compared to the conventional pre-training method, our CorrMAE is a predominantly task-driven pre-training method, which can significantly enhance the performance of downstream tasks even without any extra data (more discussions see Section 4.4).

Furthermore, correspondence learning is indispensable in our CorrMAE. It is responsible for embedding local and global contexts for visible correspondences, subsequently guiding the reconstruction of masked correspondences. The correspondence representations produced in this phase will be transferred to downstream tasks for fine-tuning. Therefore, it necessitates an encoder specifically designed for the corresponding learning with strong transferability. Instead of roughly stacking vanilla transformers [34], our proposed encoder adopts a bi-level design for local context representation and global context acquisition. Meanwhile, linear transformers [38] as the fundamental element, and the graph neural network (GNN) [37] guides the learning of local context representation. The considerations behind this are: i) the linear transformer introduces low overhead during the fine-tuning stage; ii) the GNN brings locality to linear transformers, which is advantageous for the reconstruction process. That is, our encoder skillfully balances the requirements of both pre-training and fine-tuning.

Our contributions are summarized as follows: (1) To the best of our knowledge, we are the first to propose a pre-training method for correspondence pruning by a correspondence reconstruction task. Compared to the conventional pre-training approach, our approach significantly reduces pre-training costs and, even without additional data, enhances model performance. (2) To implement this reconstruction task, we present a novel framework named CorrMAE to reconstruct matching points in the source and target images respectively. Some key designs include an encoder tailored for correspondence learning, a dual-branch structure with an ingenious position encoding for reconstruction, and an alignment loss for supervision. (3) Extensive experiments show that the model pre-trained with our CorrMAE achieves new state-of-the-art performance on several downstream tasks. Our method achieves a precision increase of 16.37% and 9.30% compared with the state-of-the-art result on camera pose estimation and visual localization evaluation respectively.

2 Related Work

2.1 Correspondence Pruning

As a pioneering work, PointCN [41] formulates correspondence pruning as both a binary classification problem and an essential matrix regression problem. It also proposes a context normalization technique to embed global information into each correspondence. Since this seminal work, there have been various follow-up studies on correspondence consistency. ACNe [31] employs the attention mechanism to capture local and global contexts. OANet [43] learns global consistency in latent space through learnable soft assignment operations. Subsequently, a series of methods [19, 46, 6, 18, 5] based on the graph neural network, as dynamic graphs offer better exploration of potential correlations in correspondences. Recently, ConvMatch [44] based on motion vector fields, maps correspondences into predefined vector fields and mines the consistency of motions through 2D convolutions. In brief, the architectural design for correspondence pruning is still the mainstream to reach a new state-of-the-art performance. Nevertheless, from another perspective, we explore an acceptable pre-training method for correspondence pruning.

2.2 Masked Autoencoder

MAE [9] as a representative representation learning method, randomly masks a high portion of the image and reconstructs missing pixels, providing powerful initial representations for downstream tasks by the pre-trained ViT [7] encoder. After that, many studies adopt the framework of MAE for pre-training across various tasks, including 3D object classification [25], video understanding [33], trajectory prediction [4], and action recognition [40]. However, MAEs fail to reconstruct correspondences due to their unordered and irregular characteristics. To this end, we extend MAE and leverage a dual-branch structure to reconstruct masked correspondences, providing generic inliers-consistent representations for downstream tasks.

3 Methodology

3.1 Overview

The pivotal innovation of this paper lies in introducing an acceptable pre-training method, considering factors such as training costs and data dependency, thus bridging a gap in the pre-training for corresponding pruning. To be specific, we perform the masked correspondence reconstruction task through an autoencoder, aimed at producing representations that can well complement the following downstream tasks. In contrast to conventional pre-training methods, i.e., initial correspondence classification task, our approach offers the benefits of lower training costs and independence from large-scale datasets.

Pre-training Stage. As shown in Fig. 2(a), we propose a Correspondence Masked Autoencoder framework (CorrMAE) to accomplish this pretext task. Given $M$ true correspondences $\left\{{}^{(pre)}\bm{I}_{i}=(\bm{x}_{i},\bm{y}_{i})|i=1,...,M,\bm{x}_{i}\in% \mathbb{R}^{2},\bm{y}_{i}\in\mathbb{R}^{2}\right\}$ selected from initial correspondences using an empirical geometric threshold, the reconstruction task begins by randomly sampling true correspondences with a masking ratio. Subsequently, these visible correspondences are embedded into both global and local contexts using an encoder $\mathrm{E_{\theta}}$ , specially designed for correspondence learning. At last, guided by the consistency within visible correspondences, the decoder $\mathrm{D_{\theta}}$ well conducts masked correspondence reconstruction. More details will be described in Section 3.2.

Fine-tuning Stage. In this paper, we perform fine-tuning for the correspondence pruning. As illustrated in Fig. 2(b), following CLNet [46], we employ an iterative network, where the initial weights of the encoder are derived from the pre-training stage. Built on the pruning operation, the encoder iterates $K$ times. The specifics of the fine-tuning stage are introduced in Section 3.3.

3.2 Pre-training with CorrMAE

Inspired by the effective representation learning via masked autoencoder in image recognition [9] and 3D object classification [25], we focus on correspondence pruning and build a pre-training framework named CorrMAE that embeds inliers-consistent representations in the encoder. As depicted in Fig. 3, to perform the masked correspondence reconstruction task, this framework consists of four phases, i.e., correspondence masking, correspondence learning, source/target matching point reconstruction, and supervision. In the following subsections, we will elaborate on the design specifics of each phase. By combining these sophisticated yet efficient designs, we have been able to achieve strong inlier representations.

3.2.1 Correspondences Masking

We randomly mask true correspondences with a masking ratio $\alpha$ , the set of masked correspondences is represented as ${}^{(pre)}\bm{I}_{gt}=\left\{\bm{P}_{gt}^{s},\bm{P}_{gt}^{t}\right\}\in\mathbb% {R}^{\alpha{M}\times 4}$ , which is used as ground truth in the supervision. Subsequently, the visible correspondences ${}^{(pre)}\bm{I}_{vis}\in\mathbb{R}^{(1-\alpha)M\times 4}$ are processed by our encoder (details in Section 3.2.2), and then as guidance for the reconstruction of matching points. As for the masking technique, we investigate the impact of different masking ratios ( $40\%-80\%$ ) and types (random masking and block masking [42]) on our method, see Section 2.

3.2.2 Correspondence Learning

The perennial theme of correspondence learning is to embed both local and global contexts for each correspondence. To this end, as illustrated in Fig. 4, we introduce an encoder for correspondence learning, which consists of $L$ CorrFormer blocks. Each block adopts a neat bi-level design, dedicated to learning representations of local context and acquiring global context respectively. Then, local context representations are injected into the correspondence embeddings via an element-wise summation. Meanwhile, the graph neural network (GNN) [16] guides the learning of local context representations, and the linear transformer [38] is employed as the fundamental element. This process (block $\ell$ ) can be described as:

\displaystyle^{(\ell+1)}\bm{T}_{vis}=^{(\ell)}\mathrm{Local_{\theta}}(^{(\ell)% }\mathrm{GNN_{\theta}}(^{(\ell)}\bm{T}_{vis}))+^{(\ell)}\mathrm{Global_{\theta% }}(^{(\ell)}\bm{T}_{vis}),

(1)

where $\bm{T}_{vis}\in\mathbb{R}^{(1-\alpha)M\times C}$ represents high-dimension embeddings of visible correspondences, termed as visible tokens in this paper; $\mathrm{Local_{\theta}}$ and $\mathrm{Global_{\theta}}$ denote the local context representation level and the global context acquisition level, respectively.

3.2.3 Source Matching Points Reconstruction

Since the unordered and irregular characteristics, correspondences lack effective location information for position encoding, thereby hindering the reconstruction process. As presented in Fig. 3, we adopt a dual-branch structure with an ingenious position encoding to separately recover matching points in both the source and target images, indirectly achieving the reconstruction of masked correspondences. Specifically, we begin by randomly generating mask keypoints for both the source and target branches ( $\left\{\bm{P}_{mask}^{s},\bm{P}_{mask}^{t}\right\}\in\mathbb{R}^{\alpha{M}% \times 4}$ ). As for the source branch, we use the ground truth of keypoints from the target image ( $\bm{P}_{gt}^{t}$ introduced in Section 3.2.1) as our positional prompts. These prompts are then concatenated with their relative mask keypoints and encoded into mask tokens $\bm{T}_{mask}^{s}\in\mathbb{R}^{\alpha{M}\times C}$ via an MLP. Such a simple yet crucial positional encoding approach addresses the most fundamental problem, i.e., how to reconstruct unordered correspondences. Next, following vision MAE [9], different mask tokens are added to the decoder’s input sequence and later used to reconstruct the masked keypoints with a simple prediction head. Meanwhile, the decoder simply stacks some linear transformers [38], but the number is fewer than that of the encoder. The above process can be described as:

	$\displaystyle\bm{T}_{mask}^{s}=\mathrm{MLP_{\theta}}\left[\bm{P}_{mask}^{s},% \bm{P}_{gt}^{t}\right],$		(2)
	$\displaystyle\bm{\widehat{P}}^{s}=\mathrm{Head_{\theta}}(\mathrm{D_{\theta}}% \left[\bm{T}_{mask}^{s},\bm{T}_{vis}\right]),$		(3)

where $\bm{\widehat{P}}^{s}\in\mathbb{R}^{\alpha{M}\times 2}$ denotes the reconstructed source matching points; $\left[\cdot,\cdot\right]$ represents the concatenate operation along the first dimension; $\mathrm{Head_{\theta}}$ is a prediction head, essentially an MLP. Similarly, the target branch performs the same operations, with shared weights between the positional encoding and decoder. Finally, reconstructed correspondences ${}^{(pre)}\bm{\widehat{I}}=\left\{\bm{\widehat{P}}^{s},\bm{\widehat{P}}^{t}% \right\}\in\mathbb{R}^{\alpha{M}\times 4}$ are obtained.

3.2.4 Supervision

The overall framework is optimized using a hybrid loss function, which comprises two reconstruction losses and an alignment loss:

\displaystyle^{(pre)}\mathcal{L}=\mathcal{L}_{rec}^{s}+\mathcal{L}_{rec}^{t}+% \lambda\mathcal{L}_{align},

(4)

where $\lambda$ denotes the hyper-parameter used to balance between two objectives. As for the reconstruction objective, we employ the $\ell_{2}$ -loss between the reconstructed keypoints ( $\bm{\widehat{P}}^{s}$ and $\bm{\widehat{P}}^{t}$ ) and the masked keypoints ( $\bm{P}_{gt}^{s}$ and $\bm{P}_{gt}^{t}$ ):

\displaystyle\mathcal{L}_{rec}={\|\bm{\widehat{P}}^{s}-\bm{P}_{gt}^{s}\|}_{2}^% {2}+{\|\bm{\widehat{P}}^{t}-\bm{P}_{gt}^{t}\|}_{2}^{2}.

(5)

The proposed alignment loss aims to reduce the overall discrepancy between the reconstructed keypoints of the two branches. We first construct two undirected complete graphs for the reconstructed keypoints of both the source and target branches ( $\widehat{\mathcal{G}}^{s}$ and $\widehat{\mathcal{G}}^{t}$ ), with Euclidean distances assigned to edges. Next, we perform an $\ell_{1}$ -loss between two undirected complete graphs to align reconstructed keypoints from two branches. The alignment loss can be defined as:

	$\displaystyle e_{i,j}^{\xi}={\\|\bm{\widehat{P}}_{i}^{\xi}-\bm{\widehat{P}}_{j}% ^{\xi}\\|}_{2},{\xi}\in\left\{s,t\right\},$		(6)
	$\displaystyle\widehat{\mathcal{G}}^{\xi}=\left\{e_{i,j}^{\xi}\|i=1,...,\alpha{M% },j=1,...,\alpha{M}\right\},$		(7)
	$\displaystyle\mathcal{L}_{align}={\\|\widehat{\mathcal{G}}^{s}-\widehat{% \mathcal{G}}^{t}\\|}_{1},$		(8)

where $e_{i,j}^{\xi}$ denotes the Euclidean distances between keypoint $i$ and keypoint $j$ ; ${\xi}$ represents the source and target branches.

3.3 Fine-tuning for Correspondence Pruning

Given $N$ initial correspondences ${}^{(fine)}\bm{I}\in\mathbb{R}^{N\times 4}$ generated by the SIFT [22] detector and a nearest neighbor matching strategy, correspondence pruning involves identifying true correspondences and recovering camera relative poses. This task is typically formulated as a binary classification problem (i.e., inlier vs. outlier) and an essential matrix regression problem [41]. However, finding true correspondences in the initial correspondences dominated by outliers (approximately $90\%$ ) is still challenging. We employ robust inlier-consistent representations to guide the learning of correspondence pruning methods. As illustrated in Fig. 2(b), the inlier predictor outputs the probability of each candidate being an inlier, and utilizes a weighted eight-point algorithm [8] $h(\cdot,\cdot)$ to regress the essential matrix. The process is presented as:

\displaystyle\begin{aligned} \bm{F}=\mathrm{E_{\theta\rightarrow\phi}}(^{(fine% )}\bm{I}),\quad\bm{F}\in\mathbb{R}^{N^{{}^{\prime}}\times C}\\ \bm{W}=\mathrm{Predictor_{\phi}}(\bm{F}),\quad\bm{W}\in\mathbb{R}^{N^{{}^{% \prime}}\times 1}\\ \bm{\widehat{E}}=h(^{(fine)}\bm{I}_{c},\bm{W}),\quad^{(fine)}\bm{I}_{c}\in% \mathbb{R}^{N^{{}^{\prime}}\times 4}\end{aligned}

(9)

where $N^{{}^{\prime}}$ and $C$ are the number of final candidates and the embedding dimension respectively; ${}^{(fine)}\bm{I}_{c}$ and $\bm{F}$ denote candidates and its embeddings respectively; $\bm{W}$ represents the probabilities of candidates being inliers; $\bm{\widehat{E}}$ indicates the predicted essential matrix.

As for supervision, we employ the widely used binary cross-entropy loss with an adaptive temperature [46] for the binary classification, and a geometric loss [43] for the essential matrix regression:

\displaystyle^{(fine)}\mathcal{L}=\mathcal{L}_{cls}+\beta\mathcal{L}_{ess}(\bm% {\widehat{E}},\bm{E}),

(10)

where the hyper-parameter $\beta$ is used to balance the two loss terms and $\bm{E}$ is the ground truth of the essential matrix.

4 Experiments

4.1 Implementation Details

Both our pre-training and fine-tuning models are implemented using PyTorch [26] and trained on multiple NVIDIA Tesla V100 GPUs.

Pre-training Setting. We use an AdamW optimizer [20] and cosine learning rate decay [21]. The initial learning rate is set to $0.001$ , with a weight decay of $0.05$ . We pre-train our model for $100$ epochs, with a batch size of $64$ . The hyper-parameter $\lambda$ in Eq. 4 is set as $0.1$ .

Fine-tuning Setting. The number of initial correspondences, final candidates, and CorrFormer blocks are $N=2000$ , $N^{\prime}=500$ , and $L=2$ . Network iteration $K$ , channel dimension $C$ , and pruning ratio are $2$ , $128$ , and $0.5$ . We adopt Adam [12] with a weight decay of $0$ as the optimizer to train our network, and the canonical learning rate (for batch size is $32$ ) is set to $0.001$ . Following [46], the weight $\beta$ in Eq. 10 is set as $0$ during the first $20k$ iterations and $0.5$ in the remaining $480k$ iterations.

4.2 Ablation Studies

Is the bi-level design of CorrFormer encoder helpful? The bi-level structure is the core design of the CorrFormer encoder. As shown in Table 1, to demonstrate its effectiveness, we conduct experiments using both single-level designs (rows 1 and 2) and the bi-level design (row 5) separately. Experimental results show that compared to the global context acquisition level and the local context representation level, the bi-level design achieves performance improvements of $+3.04$ AUC@5°and $+9.04$ AUC@5°, respectively.

Table 1: Ablation study for our CorrFormer encoder. The results of AUC without pre-training knowledge on YFCC100M [32] are reported. Local and Global respectively represent the local context representation and the global context acquisition layers. 1x and 2x represent the number of CorrFormer blocks.

Local	Global	Injection		1x	2x	Pose estimation AUC
Local	Global	Interactive	Simple	1x	2x	@5°	@10°	@20°
✓					✓	28.65	49.33	67.59
	✓				✓	34.64	54.79	71.19
✓	✓	✓		✓		33.06	53.28	70.94
✓	✓	✓			✓	36.92	57.89	74.46
✓	✓		✓		✓	37.69	58.37	74.71

How to inject local context? We explore both interactive (cross-attention) and simple (element-wise summation) manners to inject local context. In Table 1, we implement complex cross-attention (row 4) between the local context representation level and the global context acquisition level. The results indicate that this manner is not as effective as a simple element-wise summation operation (row 5), which is both neat and efficient.

How to choose $L$ ? As shown in Table 1, we investigate the effects of stacking one and two CorrFormer blocks (rows 3 and 4) on the network. When using two encoder blocks, the performance of network significantly improved ( $+3.86$ AUC@5°). Considering the trade-off between training cost and performance, we set $L$ to $2$ .

Table 2: Ablation study for our CorrMAE. We conduct experiments using two masking types with different masking ratios (%), and report pre-train loss (x100) on MegaDepth [14] as well as fine-tune camera pose estimation on YFCC100M [32].

Random	Block	Ratio			Align	Loss	Pose estimation AUC
Random	Block	40	60	80	Align	Loss	@5°	@10°	@20°
✓		✓				1.774	39.49	59.78	75.42
✓			✓			1.900	39.72	60.30	75.99
✓				✓		2.198	38.85	59.45	75.52
	✓		✓			2.073	39.40	59.70	75.39
✓			✓		✓	1.908	40.16	60.78	76.43

How to choose a masking strategy? To study a masking strategy suitable for our method, we perform some ablation experiments on the masking ratio (40%-80%) and masking types (random masking and block masking [42]), see Table 2. The comparison among rows 1, 2, and 3 shows that our method gains better performance improvements for the downstream task at a moderate masking ratio of 60%. As presented in rows 2 and 4, we also validate the impact of block masking on our method, with experimental results revealing that block masking achieves slightly worse than random masking in both pre-training observation (loss) and fine-tuning performance. Hence, we adopt a random masking strategy with a masking ratio of 60% to provide more robust initial representations for downstream tasks. Additionally, as presented in Appendix B.1, we visualize the reconstruction results under this masking strategy. From the visualized results, it is evident that our reconstructed correspondences adhere to the local and global consistency of inliers. In other words, the pre-trained encoder exhibits excellent inliers-consistent representations.

Is the alignment loss useful? As shown in Table 2, after introducing the alignment loss (the last row), our method achieves an improvement of 1.93% in AUC@5° for camera pose estimation. That is, in our correspondence reconstruction task, the proposed alignment supervision between the reconstructed keypoints of the source and target branches proves to be effective for inlier representation learning.

4.3 Downstream Evaluation

Since we are the first to propose a pre-training method specifically for correspondence pruning, all baseline methods are trained from scratch solely on the target dataset (YFCC100M). In contrast, our proposed method is pre-trained on the MegaDepth dataset [14] (see Appendix A for details) and then fine-tuned on the YFCC100M dataset [32].

Correspondence Pruning directly impacts downstream geometric estimation tasks, necessitating methods that can precisely identify true correspondences (inliers). To ensure a fair comparison, all methods are evaluated on the correspondence pruning using the full-size verification [46]. Following [43, 46], we report precision, recall, and F1 score on the testing set of YFCC100M. Also, predicted epipolar distances use a threshold of $3\times 10^{-5}$ as the criterion for inliers and outliers. As shown in Table 3, our method pre-trained by CorrMAE outperforms all baseline methods on all metrics. Specifically, compared to the recent state-of-the-art methods NCMNet [18] and MGNet [5], we achieve improvements of 3.15% and 2.83% in F1 score, respectively. Additionally, as shown in Appendix B.2, partial typical visualization results of CLNet [46], NCMNet [18], and our method are shown from left to right. It can be seen that our method achieves the best performance under various challenging scenes.

Table 3: Evaluation on YFCC100M for the correspondence pruning.

Method	Precision	Recall	F-score
OANet [43]	68.05	68.41	68.23
MS²DGNet [6]	72.61	73.86	73.23
PGFNet [17]	71.56	72.71	72.13
CLNet [46]	75.05	76.41	75.72
ConvMatch [44]	73.12	74.39	73.75
UMatch [15]	73.97	75.72	74.83
NCMNet [18]	77.24	78.57	77.90
MGNet [5]	76.97	79.35	78.14
Ours	79.27	81.45	80.35

Camera Pose Estimation. The goal of this task is to estimate the relative position relationship (rotation and translation) between two-view images. Following the evaluation protocol of [46, 44], we report the AUC of pose error at thresholds (5, 10, and 20 degrees), with a weighted eight-point algorithm [8] as the estimator. All baselines are evaluated on this task using the respective official settings and pre-trained models. Our results, presented in Table 5, show that our method consistently achieves the best accuracy in all thresholds. Notably, we outperform the current best graph neural network method NCMNet [18] with an improvement of $+5.65$ AUC@5°, and reduce parameters by $0.67$ M. Furthermore, we significantly outperform the recent light-weight method MGNet [5] with additional parameters of $2.79$ M, and gain an impressive improvement of $+7.84$ AUC@5°.

Table 4: Evaluation on YFCC100M for the camera pose estimation. The AUC of the pose error in percentage is reported. Our method improves the state-of-the-art methods by a large margin.

Method	Params.	Pose estimation AUC
Method	Params.	@5°	@10°	@20°
PointCN [41]	0.39M	10.16	24.43	43.31
OANet [43]	2.47M	15.92	35.93	57.11
MS²DGNet [6]	2.61M	20.61	42.90	64.26
LMCNet [19]	0.93M	22.35	43.57	63.34
PGFNet [17]	3.12M	24.11	45.97	65.08
CLNet [46]	1.27M	25.28	45.82	65.44
ConvMatch [44]	7.49M	26.83	49.14	67.91
UMatch [15]	7.76M	30.84	52.04	69.65
NCMNet [18]	4.77M	34.51	55.34	72.40
MGNet [5]	1.31M	32.32	53.40	71.59
Ours	4.10M	40.16	60.78	76.43

Table 5: Evaluation on Aachen Day-Night benchmarks [29, 45] for the visual localization evaluation.

Visual Localization of Aachen v1.0
Method	Day	Night
Method	(0.25m, 2°) / (0.5m, 5°) / (1m, 10°)
-	82.3/88.2/92.7	38.8/45.9/57.1
OANet [43]	85.4/92.4/96.6	63.3/74.5/83.7
CLNet [46]	85.0/93.0/97.7	67.3/81.6/88.8
ConvMatch [44]	85.2/91.9/96.1	58.2/63.3/75.5
NCMNet [18]	85.2/92.7/98.1	68.4/81.6/90.8
Ours	86.8/93.9/98.2	73.5/84.7/93.9
Visual Localization of Aachen v1.1
-	85.4/91.1/94.5	35.6/42.4/55.0
OANet [43]	87.7/94.8/98.2	58.6/69.1/83.8
CLNet [46]	86.7/94.1/98.1	61.3/78.0/89.0
ConvMatch [44]	87.3/93.8/97.5	51.8/60.2/73.3
NCMNet [18]	86.7/94.1/98.4	61.3/77.5/90.1
Ours	88.8/94.9/98.7	67.0/82.2/94.2

Visual Localization. This task aims to recover 6-DoF poses of query images with respect to the corresponding 3D scene model. We integrate correspondence pruning methods in the official HLoc [28] pipeline and use two popular benchmarks, i.e., Aachen Day-Night (v1.0 and v1.1) [29, 45], to validate performance on the visual localization task. Aachen v1.0 dataset contains $4328$ reference and $922$ query ( $824$ day, $98$ night) images. Aachen v1.1 extends v1.0 with the additional $2369$ reference and $93$ night query images. Following [15, 5], we report the accuracy at error thresholds of $0.25m/2$ °, $0.5m/5$ °, and $5m/10$ °. The baselines comprise a simple matcher mutual nearest neighbor (MNN) and some state-of-the-art correspondence pruning methods such as OANet [43], CLNet [46], ConMatch [44], and NCMNet [18]. Also, SIFT (with $4096$ keypoints) and MNN are used as the pre-processing step for correspondence pruning methods. As shown in Table 5, our method outperforms all baselines on two benchmarks, showing its robustness under day-night changes. In specific, our method exhibits strong generalization ability, especially in night scenes (68.4 vs. our 73.5 and 61.3 vs. our 67.0 in the first metric). This should be attributed to the powerful inlier representation learned through our pre-training method.

Table 6: Evaluation on HPatches [1] for the homography estimation. The AUC of the corner error in percentage is reported.

Method	Homography est. AUC
Method	@3px	@5px	@10px
MAGSAC++ [2]	47.79	56.27	64.68
OANet [43]	50.71	62.40	75.23
CLNet [46]	51.24	63.11	76.76
ConvMatch [44]	51.38	62.94	75.77
NCMNet [18]	51.34	63.16	76.96
Ours	52.78	65.04	78.42

Table 7: Our CorrMAE serves as a plug-and-play and task-driven pre-training method. The camera pose estimation results without/with CorrMAE are reported, and those with CorrMAE are achieved through pre-training and fine-tuning both on the target dataset (YFCC100M), without utilizing any additional data for pre-training.

Encoder	AUC@5°		AUC@20°
Encoder	-	CorrMAE	-	CorrMAE
OANet [43]	15.92	17.40	57.11	58.34
CLNet [46]	25.28	28.56	65.44	67.31
NCMNet [18]	34.51	35.35	72.40	73.36
CorrFormer (Ours)	37.69	39.54	74.71	75.69

Homography Estimation. Homography is defined as a planar projection between two images captured from different perspectives. The testing set provides the pair of one source and five target images captured from various viewing angles and lighting conditions, along with ground-truth homography transformations. Following [15, 5], we test baselines and our method on HPatches [1] with RANSAC as a robust estimator. We report the AUC percentage of estimated homography whose average corner error distance is below $3/5/10$ pixels. Since the intrinsic matrices are not provided in HPatches, we normalize keypoints by the image scale. To ensure fairness, we retrain all methods on the YFCC100M using a new normalization manner. As shown in Table 7, our model generalizes best among all baselines.

4.4 Discussion

Analysis of task-driven and data-driven within our CorrMAE. Through several experiments, we demonstrate that our pre-training method exhibits lower data dependency. As seen in the last row of Table 7, without introducing additional data, our method leads to a 4.91% performance improvement (39.54 vs. 37.69 in AUC@5°). In contrast, pre-training on a larger dataset (Megadepth [14]) only yields a 1.57% performance improvement (40.16 vs. 39.54 in AUC@5°), see the last row of Table 5. While additional data can provide stronger representational capabilities, our CorrMAE is predominantly task-driven in nature. This characteristic is essential for a method designed to be plug-and-play.

Does CorrMAE serve as a plug-and-play method? We replace the encoder within our pretraining-finetuning paradigm with other state-of-the-art methods such as OANet [43], CLNet [46], and NCMNet [18], to validate this problem. In Table 7, after pre-training with our CorrMAE, these methods all achieve significant performance improvements even without extra data. Specifically, for the methods based on GNN, CLNet achieves a performance improvement of 12.97% in AUC@5°, whereas the recent state-of-the-art method NCMNet achieves a 2.43% performance boost. With the help of our pre-training method, the CLNet, which exhibits inferior consistency learning capability, shows a remarkably pronounced improvement. In addition, compared to NCMNet, our CorrFormer demonstrates stronger consistency learning ability (34.51 vs. our 37.69 in AUC@5°) and superior transferability ( $+0.84$ vs. our $+1.85$ in AUC@5°). These experiments further validate that our CorrMAE can serve as a powerful tool seamlessly assisting various correspondence pruning methods.

5 Conclusion

In this paper, with another perspective, we explore an acceptable pre-training method for correspondence pruning. A masked correspondence reconstruction pipeline is proposed toward the ultimate pursuit of inlier representation learning, namely CorrMAE. Meanwhile, we specially design an encoder, a dual-branch structure, and an alignment loss for this reconstruction task. Extensive experiments verify the proposed method in some downstream tasks. We also provide detailed analyses to evaluate the effectiveness and impact of our method.

References

[1] Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krystian Mikolajczyk. Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 5173–5182, 2017.
[2] Daniel Barath, Jana Noskova, Maksym Ivashechkin, and Jiri Matas. Magsac++, a fast, reliable and accurate robust estimator. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 1304–1312, 2020.
[3] Samarth Brahmbhatt, **wei Gu, Kihwan Kim, James Hays, and Jan Kautz. Geometry-aware learning of maps for camera localization. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 2616–2625, 2018.
[4] Hao Chen, Jiaze Wang, Kun Shao, Furui Liu, Jianye Hao, Chenyong Guan, Guangyong Chen, and Pheng-Ann Heng. Traj-mae: Masked autoencoders for trajectory prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
[5] Luanyuan Dai, Xiaoyu Du, Hanwang Zhang, and **hui Tang. Mgnet: Learning correspondences via multiple graphs. In Proceedings of the AAAI conference on Artificial Intelligence, pages 3945–3953, 2024.
[6] Luanyuan Dai, Yizhang Liu, Jiayi Ma, Lifang Wei, Taotao Lai, Changcai Yang, and Riqing Chen. Ms2dg-net: Progressive correspondence learning via multiple sparse semantics dynamic graph. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 8973–8982, 2022.
[7] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, 2021.
[8] Richard I Hartley. In defense of the eight-point algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(6):580–593, 1997.
[9] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollar, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 15979–15988, 2022.
[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
[11] Jared Heinly, Johannes L Schonberger, Enrique Dunn, and Jan-Michael Frahm. Reconstructing the world* in six days*(as captured by the yahoo 100 million image dataset). In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 3287–3295, 2015.
[12] Diederik P Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, pages 1–15, 2014.
[13] Ming Li, Jie Wu, Xionghui Wang, Chen Chen, Jie Qin, Xuefeng Xiao, Rui Wang, Min Zheng, and Xin Pan. Aligndet: Aligning pre-training and fine-tuning in object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6866–6876, 2023.
[14] Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 2041–2050, 2018.
[15] Zizhuo Li, Shihua Zhang, and Jiayi Ma. U-match: two-view correspondence learning with hierarchy-aware local context aggregation. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 1169–1176, 2023.
[16] Tangfei Liao, Xiaoqin Zhang, Li Zhao, Tao Wang, and Guobao Xiao. Vsformer: Visual-spatial fusion transformer for correspondence pruning. In Proceedings of the AAAI conference on Artificial Intelligence, 2024.
[17] Xin Liu, Guobao Xiao, Riqing Chen, and Jiayi Ma. Pgfnet: Preference-guided filtering network for two-view correspondence learning. IEEE Transactions on Image Processing, 32:1367–1378, 2023.
[18] Xin Liu and Jufeng Yang. Progressive neighbor consistency mining for correspondence pruning. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 9527–9537, 2023.
[19] Yuan Liu, Lingjie Liu, Cheng Lin, Zhen Dong, and Wen** Wang. Learnable motion coherence for correspondence pruning. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 3237–3246, 2021.
[20] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
[21] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In Proceedings of the International Conference on Learning Representations, 2017.
[22] David G Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.
[23] Haoming Lu, Hazarapet Tunanyan, Kai Wang, Shant Navasardyan, Zhangyang Wang, and Humphrey Shi. Specialist diffusion: Plug-and-play sample-efficient fine-tuning of text-to-image diffusion models to learn any unseen style. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 14267–14276, 2023.
[24] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics, 31(5):1147–1163, 2015.
[25] Yatian Pang, Wenxiao Wang, Francis EH Tay, Wei Liu, Yonghong Tian, and Li Yuan. Masked autoencoders for point cloud self-supervised learning. In Proceedings of the European Conference on Computer Vision, pages 604–621. Springer, 2022.
[26] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
[27] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
[28] Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 12716–12725, 2019.
[29] Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii, Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Okutomi, Marc Pollefeys, Josef Sivic, et al. Benchmarking 6dof outdoor visual localization in changing conditions. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 8601–8610, 2018.
[30] Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 4104–4113, 2016.
[31] Weiwei Sun, Wei Jiang, Eduard Trulls, Andrea Tagliasacchi, and Kwang Moo Yi. Acne: Attentive context normalization for robust permutation-equivariant learning. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 11286–11295, 2020.
[32] Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
[33] Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35:10078–10093, 2022.
[34] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
[35] Tao Wang, Kaihao Zhang, Xuanxi Chen, Wenhan Luo, Jiankang Deng, Tong Lu, Xiaochun Cao, Wei Liu, Hongdong Li, and Stefanos Zafeiriou. A survey of deep face restoration: Denoise, super-resolution, deblur, artifact removal. arXiv preprint arXiv:2211.02831, 2022.
[36] Tao Wang, Kaihao Zhang, Ziqian Shao, Wenhan Luo, Bjorn Stenger, Tong Lu, Tae-Kyun Kim, Wei Liu, and Hongdong Li. Gridformer: Residual dense transformer with grid structure for image restoration in adverse weather conditions. International Journal of Computer Vision, pages 1–23, 2024.
[37] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics, 38(5):1–12, 2019.
[38] Haixu Wu, Jialong Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Flowformer: Linearizing transformers with conservation flows. In Proceedings of the International Conference on Machine Learning, 2022.
[39] Fei Xue, Ignas Budvytis, and Roberto Cipolla. Imp: Iterative matching and pose estimation with adaptive pooling. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 21317–21326, 2023.
[40] Hong Yan, Yang Liu, Yushen Wei, Zhen Li, Guanbin Li, and Liang Lin. Skeletonmae: graph-based masked autoencoder for skeleton sequence pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5606–5618, 2023.
[41] Kwang Moo Yi, Eduard Trulls, Yuki Ono, Vincent Lepetit, Mathieu Salzmann, and Pascal Fua. Learning to find good correspondences. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 2666–2674, 2018.
[42] Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 19313–19322, 2022.
[43] Jiahui Zhang, Dawei Sun, Zixin Luo, Anbang Yao, Lei Zhou, Tianwei Shen, Yurong Chen, Long Quan, and Hongen Liao. Learning two-view correspondences and geometry using order-aware network. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 5845–5854, 2019.
[44] Shihua Zhang and Jiayi Ma. Convmatch: Rethinking network design for two-view correspondence learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[45] Zichao Zhang, Torsten Sattler, and Davide Scaramuzza. Reference pose generation for long-term visual localization via learned features and view synthesis. International Journal of Computer Vision, 129:821–844, 2021.
[46] Chen Zhao, Yixiao Ge, Feng Zhu, Rui Zhao, Hongsheng Li, and Mathieu Salzmann. Progressive correspondence pruning by consensus learning. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 6464–6473, 2021.

Appendix A Datasets

YFCC100M [32] is collected by Yahoo and made up of 100 million photos from the Internet. The author of [11] split the YFCC100M into 72 sequences from different tourist landmarks, and provided camera poses and sparse models for generating ground-truth. Following [43], we selected 68 sequences as the training set and the remaining 4 sequences as the testing set. As for Table 7, the pre-training inputs (true correspondences/inliers) are selected by an empirical threshold of $10^{-4}$ for epipolar distance.

MegaDepth [14] is a larger dataset comprising 196 scene sequences, whose camera poses and depth maps are produced from COLMAP [30]. Following [39], we employ SIFT [22] and a nearest neighbor matching strategy to generate initial correspondences for two-view images, and refine them into true correspondences using geometric thresholds. These true correspondences constitute our pre-training dataset from MegaDepth. It is worth noting that the MegaDepth pre-training dataset consists of true correspondences from $780k$ image pairs, with an average of only $50$ true correspondences per image pair used for pre-training.

Appendix B Visualization Results

B.1 Visualization Results of Correspondence Reconstruction

In this section, we show some correspondence reconstruction results to illustrate the effectiveness of our method. As shown in Fig. 5, in various scenarios, the reconstruction results of our method are largely subject to global and local consistency. This also further demonstrate that our pre-training method CorrMAE can provide powerful inlier representations for downstream tasks.

B.2 Visualization Results of Correspondence Pruning

We present visualization results of correspondence pruning to validate the superiority of our method. As shown in Fig. 6, our method outperforms the current state-of-the-art method NCMNet [18] in various challenging scenarios.