\useunder

\ul

EffoVPR: Effective Foundation Model Utilization for Visual Place Recognition

Issar Tzachor¹ Boaz Lerner¹ Matan Levy² Michael Green¹ Tal Berkovitz Shalev¹ Gavriel Habib¹ Dvir Samuel³ Noam Korngut Zailer¹ Or Shimshi¹ Nir Darshan¹ Rami Ben-Ari¹
¹OriginAI, Israel
²The Hebrew University of Jerusalem, Israel
³Bar-Ilan University, Israel
[email protected]

Abstract

The task of Visual Place Recognition (VPR) is to predict the location of a query image from a database of geo-tagged images. Recent studies in VPR have highlighted the significant advantage of employing pre-trained foundation models like DINOv2 for the VPR task. However, these models are often deemed inadequate for VPR without further fine-tuning on task-specific data. In this paper, we propose a simple yet powerful approach to better exploit the potential of a foundation model for VPR. We first demonstrate that features extracted from self-attention layers can serve as a powerful re-ranker for VPR. Utilizing these features in a zero-shot manner, our method surpasses previous zero-shot methods and achieves competitive results compared to supervised methods across multiple datasets. Subsequently, we demonstrate that a single-stage method leveraging internal ViT layers for pooling can generate global features that achieve state-of-the-art results, even when reduced to a dimensionality as low as 128D. Nevertheless, incorporating our local foundation features for re-ranking, expands this gap. Our approach further demonstrates remarkable robustness and generalization, achieving state-of-the-art results, with a significant gap, in challenging scenarios, involving occlusion, day-night variations, and seasonal changes.

1 Introduction

The task of Visual Place Recognition (VPR), also known as Geo-Localization, aims to predict the place where a photo was taken relying solely on the visual information in the image. This is typically done by an image retrieval approach [4, 21, 34, 11, 33, 27] where a database of geo-tagged images is used, often referred as gallery. Real world data, including tagged VPR datasets, rely on two major sources for images: 1) car street-view and 2) people personal cameras (commonly mobile phones) [4, 43, 47, 8, 6]. As a result, images contain natural objects that are irrelevant and sometimes misleading for VPR task. Figure 1 - top row, shows an example, where people, vehicles, daylight, or camera angles might differ between images.

Modern models commonly use a deep neural network to extract a so-called global feature (a.k.a descriptor) for the query and gallery images [29, 24, 39, 30, 31]. This single-stage approach is then followed by a nearest neighbor search in the feature space, to retrieve the matching candidates from the gallery. In practice, global features of the entire gallery need to be uploaded to RAM to enable fast retrieval. In large-scale and real-world scenarios, reducing the memory footprint for each image is crucial to ensure real-time applicability. Therefore, several works often promote compact features to achieve both accuracy and applicability [49, 34, 8]. A popular strategy to improve the accuracy is to conduct a two-stage search, with subsequent similarity search on the top-k ranked results, that re-ranks the first-stage retrieved results. This process is commonly performed by matching local key-points and their corresponding descriptors.

Following best practices in computer vision, VPR methods are often initialized with ImageNet pre-trained weights (e.g. [49, 9]), followed by finetuning on VPR datasets, e.g. MSLS [47], or trained from scratch [46, 21, 4]. Recent advancements in VPR exploit the capabilities of foundation models [27, 22, 34, 33] such as DINOv2 [37], a transformer [44] based model trained by self-supervised learning on a vast amount of data. Recent approaches [34, 33, 22] argue that using vanilla DINOv2 (without fine-tuning) is ineffective. They criticize this approach for failing by capturing dynamic and irrelevant elements (pedestrians or vehicles), thereby diverting attention from crucial VPR features (buildings or scene layout). Some methods propose to address this issue by integrating learned adaptors into the foundation model architecture [34, 33] or changing the standard fine-tuning scheme with a unique pooling mechanism [22], in order to facilitate the foundation model adaptation to the VPR task. In contrast, Anyloc [27] suggested to use DINOv2 as a general-purpose feature representation at zero-shot for various localization tasks. However, despite achieving its best performance with the extremely high dimensional features (of size 49K), faces scalability challenges, AnyLoc only demonstrates a modest performance, and struggles in cases where queries exhibit large appearance changes. In this paper, we utilize the internal ViT features to firstly surpasses previous zero-shot approaches and achieve competitive results compared to several VPR trained methods.

Typically, VPR methods [22, 34, 33, 11, 49, 8] adhere to a conventional approach of initially extracting local features from a pretrained model, then using pooling methods such as GeM [38] or NetVLAD [4], to obtain a global feature, for each image. Methods using VLAD[23], such as AnyLoc[27], or NetVLAD[4] for aggregating local features necessitate learning a dictionary for each specific gallery. These models tend to "overfit" to the particular gallery distribution, which hampers their generalization. Additionally, they often require fine-grained clustering, leading to high-dimensional feature vectors. In addition, a recent study[40], highlighted the GeM hyper-parameter sensitivity, spatial information loss, and convergence complications. Alternatively, we suggest to utilize the internal ViT self-attention mechanism, and training with the class $[CLS]$ token, for classification loss. This approach allows implicit aggregation from local features to a global representation and eliminates the need for previously used external special components and learned features, that often yield excessively large feature representation [22, 33, 34] (see Figure 1).

We hereby present an Effective foundation based VPR method called EffoVPR which achieves a superior single-stage performance without the need for additional components. Unlike more intricate DINOv2-based methods with built-in adapter components [34, 33] or special aggregation techniques [22], EffoVPR delivers State-of-The-Art (SoTA) performance across multiple datasets while maintaining compactness, as shown in Figure 1. Our two-stage approach, builds upon local feature matching originated from the internal ViT attention maps, thus further expanding our performance gap. Additionally, EffoVPR demonstrates strong generalization and robustness across various cities and landscapes, effectively handling challenges such as occlusions, time differences and seasonal changes as demonstrated in Figure 1, which illustrates significant occlusion.

To summarize, our contributions in this work include:

1.

We present a method that effectively leverages a foundation model for the VPR task, exploiting its intermediate attention layers. Leveraging this finding, we propose a zero-shot method that outperforms previous zero-shot approaches while showing comparable results, even with trained VPR methods, on few datasets.
2.

Our work introduces a novel foundation model based VPR approach that eliminates the need for external aggregation methods or specialized pooling layers (such as NetVLAD or GeM).
3.

We suggest an effective yet simple re-ranking process based on ViT internal attention layers, that significantly boosts performance.
4.

Our method achieves state-of-the-art performance across various VPR benchmarks and generalizes to challenging scenarios, as occlusions, day-night variations or seasonal changes.

2 Related work

Traditional approaches utilized a single-stage approach for Visual Place Recognition (VPR), using SIFT [32] SURF [7], or RootSIFT [25], focused on matching queries to gallery images through the use of image local feature matching. Two-stage methods[21, 46, 49, 34] entail an initial ranking based on global representation similarity, succeeded by re-ranking the top-K retrieved candidates in the second stage, utilizing local features.

First deep learning approaches for the VPR task used CNN [4, 38, 26] that was later replaced with Vision Transformers (ViT) [46, 49, 27, 34, 33, 22]. TransVPR [46] and R2Former [49] suggested the application of transformers for VPR and adopted a two-stage approach that included re-ranking. However, training from scratch or initialized on ImageNet, and lack of effective view-variability in training (in contrast to [11, 33]) has restricted their performance.

A handful of visual foundation models (VFMs) have recently emerged as the backbones for numerous tasks. VFMs like CLIP [39], DINOv2 [37], SAM [28] use ViT that are trained with distinct objectives, exhibiting unique characteristics for various downstream applications often even at zero-shot (without requiring any fine-tuning). Among zero-shot applications, AnyLoc [27] proposed a single-stage solution for VPR, that leveraged a DINOv2 pre-trained model alongside dense local (patch-level) features. By employing VLAD for feature pooling across multiple layers, AnyLoc approach produces an extremely large global representation of $\sim$ 49K dimensions, subsequently reduced to 512D through PCA whitening, but in expense of lower performance. However, AnyLoc’s VLAD aggregation tends to fail in generalizing to queries with large time-gaps, day vs. night or of different season. In this paper, we suggest a two-stage zero-shot method based on first global ranking, then re-ranking, by matching local descriptors, extracted from the ViT self-attention layers.

On the contrary, SALAD [22] proposed fine-tuning the pre-trained DINOv2 model to improve its performance for VPR task. Their single-stage approach involves pooling local features from the output layer of DINOv2 replacing NetVLAD’s soft-assignment to clusters by an optimal transport methodology. Notably, their most significant results are achieved with large features exceeding 8K dimension, which adversely impacts memory consumption.

Using DINOv2 as their backbone, SelaVPR [34] and CricaVPR [33] took a different path and avoided fine-tuning of the model, claiming for a side-effect of a catastrophic forgetting and performance deterioration. They propose the incorporation of a trainable adapters integrated into DINOv2 architecture. SelaVPR suggested a two-stage methodology that trains global and local features on two separate branches, built above DINOv2’s output tokens, while discarding the standard [CLS] token. Their re-ranking strategy relies on mutual-Nearest Neighbor (mutual-NN) computed over a newly learned patch level (local) features. CricaVPR instead, suggests learning a features pyramid above DINOv2’s output tokens, to learn a special viewpoint robust encoder. Both, SelaVPR and CricaVPR incorporate adapters in the DINOv2 architecture, using GeM pooling for feature aggregation and are trained with a contrastive loss.

Training a VPR model often sourced from Street-View images, introduces a significant challenge in selecting appropriate positive and hard negative images[4, 47, 49, 8]. EigenPlaces [11] introduced a novel training paradigm that organizes the training data into classes, with each class containing multiple viewpoints of the same scene. By employing a classification loss rather than the conventional contrastive loss (used in [34, 22, 33]), the resulting model demonstrated high resilience to varying viewpoints in testing. Hence, we adopt a similar strategy for our training process.

Current VPR methods predominantly, derive a global representation by aggregating local features obtained from the output layer of a trained backbone. This aggregation is often conducted using external learnable pooling techniques, such as VLAD (Vector of Locally Aggregated Descriptors) [23], SPoC (Sum-Pooled Convolutional features) [5], RMAC (Regional Maximum Activation of Convolutions) [42], the known NetVLAD [4] or the nowadays widely-used GeM (Generalized Mean) pooling layer[38, 40], as utilized in various studies [12, 18, 19, 47, 21, 22, 34, 33]. However, NetVLAD is encumbered by high computational costs and dimensional complexities (with $\sim$ 32K dimensional feature)[8], while GeM encounter convergence issues and require hyperparameter tuning [40]. In contrast, we suggest to forgo the explicit and external aggregation approach, replacing it with training of the single [CLS] token in ViT, with classification loss.

In summary, this paper revisits the potential of foundation DINOv2 for VPR task, considering two-stage zero-shot and single as well as two-stage finetuned models. Unlike previous approaches, we propose leveraging the existing internal aggregation layers in DINOv2 without any additional components or adaptors. Notably, our approach achieves SoTA results with compact features, a crucial consideration given realistic memory constraints.

3 Method

For the VPR task, we explore ViT [17] feature maps as local patch descriptors. In a ViT architecture, an image is split into $p$ non-overlap** patches which are processed into tokens by linearly projecting each patch to a $d$ -dimensional space, and adding learned positional embeddings. An additional [CLS] token is inserted to capture global image properties. The set of tokens are then passed through $L$ transformer encoder layers, each consists of normalization layers, Multihead Self-Attention modules, and MLP blocks. In ViT, each patch is directly associated with a set of features, a Key, Query, Value that can be used as patch descriptors. We utilize the self-attention matrices at layer $l$ , $\{K_{l},Q_{l},V_{l}\}\in\mathbb{R}^{p\times d}$ , with $p$ indicating the number of patches resulting $p+1$ tokens (number of patches in the image plus one added global token) and $d$ standing for the embedding dimension. The self-attention function at layer $l$ is then given by:

\text{Attention}(Q_{l},K_{l},V_{l})=\text{Softmax}(\frac{K_{l}^{T}Q_{l}}{\sqrt% {d}})V_{l}

(1)

We indicate the key feature of the $[CLS]$ token at layer $l$ by $k_{cls}\in K_{l}$ . Each image patch therefore is directly associated with the set of features $\{q_{i}^{l},k_{i}^{l},v_{i},^{l}\}_{i\in{[p+1]}}$ including its query, key and value at each layer $l\in[1,L]$ , respectively.

In the global search stage, the aim is to perform an efficient search across a vast corpus of images in a gallery. To this end, a shared embedding space for query and gallery images is typically utilized to identify the most similar images and rank them accordingly. In the re-ranking stage, the top-K nearest neighbors (candidates) are further refined by evaluating the mutual similarity between their local features and those of the query image. Throughout both stages, we use a ViT [17] encoder as our backbone model. Our EffoVPR retrieval method is presented in Figure 2.

3.1 Training strategy

Our training strategy employs a classification loss applied on the output $[CLS]$ token. Following EigenPlaces [11], we partition the map into cells measuring $15\times 15$ meters each, where cells define classes. Each class contains images capturing the same location from various viewpoints, enhancing the model’s ability to accurately recognize locations despite substantial changes in the captured view angle. To enforce class discrimination while ensuring viewpoint invariance, we utilize the Large Margin Cosine Loss (CosFace)[8, 45].

We initialize our ViT[17] backbone with pre-trained DINOv2 weights [37, 16]. To retain the rich visual representations learned during pre-training, while adapting the model for the VPR task, we fine-tune only the final layers of our backbone. Note that due to the inherent design of the ViT architecture, which incorporates a self-attention mechanism (Eq. 1), the global feature is implicitly trained to aggregate local features without the need of additional specialized components. For learning more compact global feature we simply add a linear layer on top of the output [CLS] token.

3.2 Inference strategy

Global ranking stage: Post-training, we extract the global feature of an image from the $[CLS]$ output token of the penultimate classification layer. We then retrieve the K nearest neighbors of the query’s global feature.

Re-ranking stage: We start by extracting patch local features from each candidate among the top-k retrieved images in the previous global stage, and re-rank the candidates based on these features. Local features are derived from the intermediate layer $l$ of ViT, utilizing the self-attention matrices, thus their computation is integrated into the process of computing global features and does not necessitate recalculation. Aligned with the findings in [3], where $V_{l}$ showed to have a higher instance-level characteristics, we find $V_{l}$ matrix mentioned in Eq. 1 as the most propitious facet for the task at hand (see Tab. S3 in Appx.). Then we leverage the model’s internal prioritization to identify discriminative features and extract the attention map: $S:=\text{Softmax}(Q_{l}\cdot k_{cls})\in\mathbb{R}^{p}$ , which represents the attention of each Value feature with the global $[CLS]$ image representation key-token $k_{cls}$ . We therefore select a subset $\mathcal{V}$ of the values $\mathcal{V}\subseteq\{v_{1},...,v_{p}\}:=V_{l}$ as the image’s local features, based on the score $S$ : $\mathcal{V}:=\{v_{i}~{}|~{}S_{i}>T_{1}\}$ for a predefined threshold $T_{1}$ . We use a rather low threshold that filters out patches that are “less significant” as reflected from the layer’s attention. Note that the number of selected local features might differ between images.

Next, given the query’s local features $\mathcal{V}$ and a candidate’s $\mathcal{V}^{\prime}$ , we calculate the pairs of mutual nearest neighbors (MNN) between $\mathcal{V}$ and $\mathcal{V}^{\prime}$ by the number of feature pairs that are the nearest neighbor of each other. Our observation revealed that applying a threshold to the MNN scores enhances the model’s resilience to clutter and directs its attention to the pertinent key-points for accurate matching. We therefore count only the pairs with cos-similarity higher than a predefined threshold $T_{2}$ (for ablation on thresholds see Tab. S3 in Appx.). We formulate that process in Equation 2:

\text{MNN}(\mathcal{V},\mathcal{V}^{\prime}):=\{(v_{i},v^{\prime}_{j})\in% \mathcal{V}\times\mathcal{V}^{\prime}~{}|~{}v_{i}:=NN(V,v^{\prime}_{j})~{}% \text{and}~{}v^{\prime}_{j}:=NN(v_{i},V^{\prime}),~{}v_{i}^{T}v^{\prime}_{j}>T% _{2}\}

(2)

Finally, the re-ranking stage concludes by sorting the top-K candidates based on their MNN counts. Note that our re-ranking strategy, which matches local features, does not require any additional learning, optimization, or spatial verification. The thresholds are established once and remain fixed across all test sets (20 different scenarios).

Zero-shot: For the first stage ranking we use the [CLS] token from vanilla-DINOv2. The results are then refined by re-ranking, while employing our suggested features $\mathcal{V}$ with MNN in Eq. (2).

4 Evaluation

In this section, we compare our approach with several SoTA VPR methods following the common VPR Benchmarks[10]. We propose, a single-stage (EffoVPR-G) and two-stage, that includes a re-ranker (EffoVPR-R) approaches with backbone trained on the publicly available SF-XL [8] dataset containing streetview of San Francisco. For more implementation details see Appx. We then test EffoVPR on a large number of diverse datasets (20), including e.g. Pitts30k [4], Tokyo24/7[43], MSLS-val/challenge[47] Nordland[41] and more, exhibiting a wide variety of conditions, including different cities, day/night images, and seasonal changes. Note that, the MSLS Challenge[47] is a hold-out set whose labels are not released, but researchers submit the predictions to the challenge server to get the performance. More details on the benchmarks can be found in the Appendix.

Datasets with gallery made from street-view images and with largest viewpoint variance, include Tokyo 24/7[43] and SF-XL[45, 6], where the query images are collected from a smartphone, usually from sidewalks. Most datasets are from urban footage, with the main exception being Nordland[41], which is a collection of photos taken across different seasons with a camera mounted on a train. Some datasets present various degrees of day-to-night changes, namely MSLS [47], Tokyo 24/7 [43], SF-XL [8] SVOX-Night [12]. AmsterTime [48] contains grayscale historical queries and modern-time RGB gallery images, making it the only dataset with large-scale time variations of multiple decades.

We follow common evaluation metric used in previous works e.g. [10, 11, 49, 34, 33, 4] and use 25 meters radius as the threshold for correct localization and report Recall@K metrics for K=1,5,10. For Nordland we evaluate $\pm 10$ frames as the common evaluation protocol used in [10, 21]. For a more comprehensive description of all 20 datasets and implementation details see Appx.

4.1 Zero-shot performance

Table 1: Comparison on Zero-Shot with R@1.

	Pitts30k	Tokyo24/7	MSLS-Val	Nordland
DINOv2[37]	78.1	62.2	47.7	33.0
Anyloc[27]	87.7	60.6	68.7	16.1
EffoVPR-ZS	89.4	90.8	70.3	57.9

We present the performance of our zero-shot method (EffoVPR-ZS) in Table 1 compared to two zero-shot alternatives, the recently published AnyLoc, and DINOv2 global feature (using the output [CLS] token), without finetuning, where EffoVPR-ZS re-ranks its top-100 global retrieved candidates. The results show that our method significantly improves over the baseline and is superior to AnyLoc. Note the significant gap for more challenging scenarios of Tokyo24/7 and Nordland exhibiting day vs. night and seasonal variations. AnyLoc tends to fail in these challegning scenarios as its VLAD aggregation learned (in unsupervised manner) on the gallery can not generalize well to challenging out-of-distribution queries. In Figure 3(a) we compare our zero-shot approach with several methods that have used VPR datasets for training. Although EffoVPR-ZS was not trained on VPR task, it still achieves comparable results to the trained methods on three popular datasets. This success can be attributed to the robust features in DINOv2, specifically those selected from the $\mathcal{V}$ facet, combined with our mutual-NN matching and scoring. Figure 3(b) demonstrates this by showcasing a scenario where the original attention, mistakenly focusing on an advertisement placed in front of a building. However, our method successfully identifies relevant key-points on the building itself, enabling correct image matching (even though there is a different ad on the gallery image).

4.2 Comparison with State-of-The-Art

In this section, we compare our single stage (EffoVPR-G) and two-stage methods (EffoVPR-R) with previous state-of-the-art including the recent works of [34, 22, 33]. SelaVPR and R2Former were trained on a combination of Pitts30k and MSLS while CricaVPR and SALAD were trained on GSV-Cities, and Cosplace and EigenPlaces on SF-XL (similar to ours). We show in Table 2 the global retrieval results (without re-ranking) with Recall@1 on five different benchmarks. EffoVPR-G achieves SoTA performance on three out of five datasets, while being ranked second on the other two. This highlights the effectiveness of the single global representation learned by our method. Notably, it achieves $+2.9\%$ on Tokyo24/7, $+2.8\%$ on the challenging Nordland dataset that exhibits extreme seasonal changes, and $+3.2\%$ on the hold-out MSLS-challenge dataset. We further demonstrate the performance of our global feature learning with reduced dimensions of 256D and even 128D, significantly decreasing the memory footprint and enabling efficient searches within a considerably larger gallery. The findings indicate only a marginal degradation in performance with lower-dimensional features, while achieving parity with the SALAD on Tokyo24/7 using 128D, compared to 8,448D, feature size (a 66-fold reduction in dimensionality). Figure 1 illustrates this quality on Tokyo24/7 and the hold-out MSLS-challenge, showing our top performing results even with 128D feature size. Results on more datasets can be found in the Appendix. The strong impact of our finetuning process including the last five layers, is visualized in Figure 4.

Table 2: Comparison with our single stage method - Recall@1 performance. Two-stage methods are marked with †, and present 1st-stage performance (for fair comparison). The best results are highlighted in bold and the second is underlined. We present results from EffoVPR with three different feature dimensions.

Method	Dim	Pitts30k	Tokyo24/7	MSLS-val	MSLS-chall.	Nordland
CosPlace [8]	512	90.5	81.9	82.8	61.4	66.5
MixVPR [2]	4096	91.5	86.7	88.2	64.0	58.4
R2Former [49]^†	256	76.3	45.7	79.3	56.2	50.9
EigenPlaces [11]	2048	92.5	93.0	89.1	67.4	71.2
SelaVPR^† [34]	1024	90.2	81.9	87.7	69.6	72.3
CricaVPR [33]	4096	94.9^*	93.0	90.0	69.0	90.7
SALAD [22]	8448	92.4	94.6	92.2	75.0	76.0
EffoVPR-G^†	1024	94.8	97.5	90.9	78.2	93.5
EffoVPR-G^†	256	93.8	95.9	90.4	75.6	79.7
EffoVPR-G^†	128	92.6	94.6	88.2	73.8	70.4

Next, we showcase the comprehensive performance of our two-stage approach (EffoVPR-R) in Table 3. EffoVPR-R achieves top performance across all datasets, taking the second place only on Pitts30K-R@1, with very close result. Note that CricaVPR reports using Pitts30k as a validation set, which may have contributed to the improved results on this dataset. However, EffoVPR-R demonstrates notable improvements, particularly evident in Tokyo24/7, where it achieves a remarkable increase in R@1 from 94.6% to 98.7% and in MSLS-challenge (from to 75.0% to 79.0%). These results underscore the generalization capability of our approach, demonstrating its resilience in handling significant variations between query and gallery images, such as viewpoint discrepancies (as seen in Pitts30k) and changes in illumination (in Tokyo24/7), across a diverse range of locations. Following the common practice, we report EffoVPR-R re-ranking performance over the top-100 candidates retrieved in the first-stage ( $K=100$ ), however we achieve SoTA R@1 results even with a low number of candidates (from $K=5$ onwards, see Tab. S4 in Appx). Note that EffoVPR matching method is highly expedient, with an average processing time of just 1 millisecond per match.

Table 3: Comparison to state-of-the-art methods on four benchmarks. The bests results are highlighted in bold and the second is underlined. Two-stage methods are marked with †.

Method	Pitts30k			Tokyo24/7			MSLS-val			MSLS-challenge
Method	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
NetVLAD [4]	81.9	91.2	93.7	60.6	68.9	74.6	53.1	66.5	71.1	35.1	47.4	51.7
SFRS [20]	89.4	94.7	95.9	81.0	88.3	92.4	69.2	80.3	83.1	41.6	52.0	56.3
Patch-NetVLAD^† [21]	88.7	94.5	95.9	86.0	88.6	90.5	79.5	86.2	87.7	48.1	57.6	60.5
CosPlace [8]	88.4	94.5	95.7	81.9	90.2	92.7	82.8	89.7	92.0	61.4	72.0	76.6
MixVPR [2]	91.5	95.5	96.4	86.7	92.1	94.0	88.2	93.1	94.3	64.0	75.9	80.6
R2Former^† [49]	91.1	95.2	96.3	88.6	91.4	91.7	89.7	95.0	96.2	73.0	85.9	88.8
EigenPlaces [11]	92.5	96.8	97.6	93.0	96.2	97.5	89.1	93.8	95.0	67.4	77.1	81.7
SelaVPR^† [34]	92.8	96.8	97.7	94.0	96.8	97.5	90.8	96.4	97.2	73.5	87.5	90.6
CricaVPR [33]	94.9	97.3	98.2	93.0	97.5	98.1	90.0	95.4	96.4	69.0	82.1	85.7
SALAD [22]	92.4	96.3	97.4	94.6	97.5	97.8	92.2	96.2	97.0	75.0	88.8	91.3
EffoVPR-R^†	93.9	97.4	98.5	98.7	98.7	98.7	92.8	97.2	97.4	79.0	89.0	91.6

In Figure 5, we present a failure case of our zero-shot approach, which is resolved after fine-tuning. In this instance, both the query and gallery contain a visually identical vehicle (an SF cable car), which leads to incorrect matching. Although such instances are rare in general case in context of pedestrians or vehicles in the images, where the objects are commonly not identical, this example highlights a limitation of our zero-shot approach.

Finally, we demonstrate the efficacy of our approach in the most challenging VPR benchmark scenarios by conducting experiments on six demanding datasets: Nordland [41], which includes extensive seasonal changes; AmsterTime [48], spanning over an extended time period; SF-Occlusion[6], that features queries with significant field-of-view obstructions; SF-Night[6], with severe illumination changes; and SVOX [12], with extreme weather and illumination variations. The results, detailed in Table 4, underscore the significant superiority of our method over previous approaches across these datasets. EffoVPR-R shows improvements of +4.3%, +0.8%, +7.9%, and +15%, +2% on Nordland, AmsterTime, SF-Occlusion, SF-Night, and SVOX-Night respectively and comparable results on SVOX-Rain. This demonstrates the high versatility of our model, which can handle extreme variations even when trained without seasonal or day-to-night changes. Figure 1 shows an examples of this case. We attribute this robustness primarily to the combination of our training method and specific re-ranking strategy over the DINOv2 model. We conduct an extensive ablation study on various hyperparameters and aspects of our approach in the Appendix.

Table 4: Comparison (R@1) to SoTA methods on more challenging datasets.

Method

Nordland

Amster

Time

SF-XL

Occlusion

SF-XL

Night

SVOX

Night

SVOX

Rain

EigenPlaces [11]

71.2

48.9

32.9

23.6

58.9

90.0

SelaVPR [34]

85.2

55.2

35.5

38.4

89.4

94.7

CircaVPR [33]

90.7

64.7

42.1

35.4

85.1

95.0

SALAD [22]

76.0

58.8

51.3

46.6

95.4

98.5

EffoVPR-R

95.0

65.5

59.2

61.6

97.4

98.3

Figure 6 qualitatively highlights the superior performance of our method. While other methods fail in challenging scenarios, such as viewpoint changes, seasonal variations, illumination differences, and severe occlusions, EffoVPR demonstrates high robustness against these challenges.

5 Summary and Limitations

In this paper, we introduced a single and two-stage approach for VPR, that effectively leverages a foundation model. Our method utilizes existing internal self-attention and pooling mechanisms to propose a new approach that achieves high performance even in a zero-shot setting.

We observed that despite the success of our zero-shot approach, it does not grasps the relevance of certain objects in the scene, for localization. This limitation is highlighted in Figure 5, where the iconic and visually identical cable car in SF causes distraction. Fine-tuning the model resolves this problem by shifting the attention from transient foreground objects to static VPR-relevant cues, as seen in Figure 4. Nevertheless, while our second-stage matching approach proves highly effective, we forgo geometric verification for sake of speed. Integrating such approach, in the future may offer further refinement and performance enhancement.

The experimental results demonstrated that our trained model outperforms previous SoTA often by a large margin, particularly in demanding scenarios that exhibit strong appearance change. Having compact features, our method provides a promising way to address the VPR task in real-world, large-scale applications.

References

[1] Amar Ali-bey, Brahim Chaib-draa, and Philippe Giguère. GSV-Cities: Toward appropriate supervised visual place recognition. Neurocomputing, 513:194–203, 2022.
[2] Amar Ali-bey, Brahim Chaib-draa, and Philippe Giguère. MixVPR: feature mixing for visual place recognition. In WACV, pages 2998–3007, 2023.
[3] Shir Amir, Yossi Gandelsman, Shai Bagon, and Tali Dekel. Deep ViT features as dense visual descriptors. ECCVW, 2022.
[4] Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5297–5307, 2016.
[5] Artem Babenko and Victor Lempitsky. Aggregating local deep features for image retrieval. In Proceedings of the IEEE international conference on computer vision, pages 1269–1277, 2015.
[6] Giovanni Barbarani, Mohamad Mostafa, Hajali Bayramov, Gabriele Trivigno, Gabriele Berton, Carlo Masone, and Barbara Caputo. Are local features all you need for cross-domain visual place recognition? In CVPRW, pages 6155–6165, June 2023.
[7] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. SURF: Speeded up robust features. In ECCV, pages 404–417. Springer, 2006.
[8] Gabriele Berton, Carlo Masone, and Barbara Caputo. Rethinking visual geo-localization for large-scale applications. In CVPR (CVPR), pages 4878–4888, June 2022.
[9] Gabriele Berton, Riccardo Mereu, Gabriele Trivigno, Carlo Masone, Gabriela Csurka, Torsten Sattler, and Barbara Caputo. Deep visual geo-localization benchmark. In CVPR, pages 5396–5407, 2022.
[10] Gabriele Berton, Riccardo Mereu, Gabriele Trivigno, Carlo Masone, Gabriela Csurka, Torsten Sattler, and Barbara Caputo. Deep visual geo-localization benchmark. In CVPR, June 2022.
[11] Gabriele Berton, Gabriele Trivigno, Barbara Caputo, and Carlo Masone. Eigenplaces: Training viewpoint robust models for visual place recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11080–11090, October 2023.
[12] Gabriele Moreno Berton, Valerio Paolicelli, Carlo Masone, and Barbara Caputo. Adaptive-attentive geolocalization from few queries: A hybrid approach. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2918–2927, 2021.
[13] David M Chen, Georges Baatz, Kevin Köser, Sam S Tsai, Ramakrishna Vedantham, Timo Pylvänäinen, Kimmo Roimela, Xin Chen, Jeff Bach, Marc Pollefeys, et al. City-scale landmark identification on mobile devices. In CVPR 2011, pages 737–744. IEEE, 2011.
[14] Zetao Chen, Lingqiao Liu, Inkyu Sa, Zongyuan Ge, and Margarita Chli. Learning context flexible attention model for long-term visual place recognition. IEEE Robotics and Automation Letters, 3(4):4015–4022, 2018.
[15] Mark Cummins and Paul Newman. Highly scalable appearance-only slam-fab-map 2.0. In Robotics: Science and systems, volume 5, page 17. Seattle, USA, 2009.
[16] Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers, 2023.
[17] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR, 2021.
[18] Sourav Garg and Michael Milford. Seqnet: Learning descriptors for sequence-based hierarchical place recognition. IEEE Robotics and Automation Letters, 6(3):4305–4312, 2021.
[19] Yixiao Ge, Haibo Wang, Feng Zhu, Rui Zhao, and Hongsheng Li. Self-supervising fine-grained region similarities for large-scale image localization. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 369–386. Springer, 2020.
[20] Yixiao Ge, Haibo Wang, Feng Zhu, Rui Zhao, and Hongsheng Li. Self-supervising fine-grained region similarities for large-scale image localization. In ECCV, pages 369–386. Springer, 2020.
[21] Stephen Hausler, Sourav Garg, Ming Xu, Michael Milford, and Tobias Fischer. Patch-NetVLAD: Multi-scale fusion of locally-global descriptors for place recognition. In CVPR, pages 14141–14152, 2021.
[22] Sergio Izquierdo and Javier Civera. Optimal transport aggregation for visual place recognition. In Accepted to CVPR, June 2024.
[23] Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. Aggregating local descriptors into a compact image representation. In CVPR, pages 3304–3311. IEEE, 2010.
[24] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In Marina Meila and Tong Zhang, editors, ICML, volume 139 of Proceedings of Machine Learning Research, pages 4904–4916. PMLR, 2021.
[25] Yuhe **, Dmytro Mishkin, Anastasiia Mishchuk, Jiri Matas, Pascal Fua, Kwang Moo Yi, and Eduard Trulls. Image matching across wide baselines: From paper to practice. International Journal of Computer Vision, 129(2):517–547, 2021.
[26] Hyo ** Kim, Enrique Dunn, and Jan-Michael Frahm. Learned contextual feature reweighting for image geo-localization. In CVPR, pages 2136–2145, 2017.
[27] Nikhil Keetha, Avneesh Mishra, Jay Karhade, Krishna Murthy Jatavallabhula, Sebastian Scherer, Madhava Krishna, and Sourav Garg. AnyLoc: towards universal visual place recognition. IEEE Robotics and Automation Letters, 2023.
[28] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In CVPR, pages 4015–4026, 2023.
[29] Matan Levy, Rami Ben-Ari, Nir Darshan, and Dani Lischinski. Chatting makes perfect: Chat-based image retrieval. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 61437–61449. Curran Associates, Inc., 2023.
[30] Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. BLIP: Bootstrap** Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In ICML, pages 12888–12900, 2022.
[31] Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models. In ICCV, pages 2105–2114, 2021.
[32] David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60:91–110, 2004.
[33] Feng Lu, Xiangyuan Lan, Lijun Zhang, Dongmei Jiang, Yaowei Wang, and Chun Yuan. Cricavpr: Cross-image correlation-aware representation learning for visual place recognition. In Accepted to CVPR, June 2024.
[34] Feng Lu, Lijun Zhang, Xiangyuan Lan, Shuting Dong, Yaowei Wang, and Chun Yuan. Towards seamless adaptation of pre-trained models for visual place recognition. In ICLR, 2024.
[35] Will Maddern, Geoffrey Pascoe, Chris Linegar, and Paul Newman. 1 year, 1000 km: The oxford robotcar dataset. The International Journal of Robotics Research, 36(1):3–15, 2017.
[36] Michael J Milford and Gordon F Wyeth. Map** a suburb with a single camera using a biologically inspired slam system. IEEE Transactions on Robotics, 24(5):1038–1053, 2008.
[37] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael G. Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jégou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. DINOv2: learning robust visual features without supervision. CoRR, abs/2304.07193, 2023.
[38] Filip Radenović, Giorgos Tolias, and Ondřej Chum. Fine-tuning cnn image retrieval with no human annotation. IEEE transactions on pattern analysis and machine intelligence, 41(7):1655–1668, 2018.
[39] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. In Marina Meila and Tong Zhang, editors, ICML, 2021.
[40] Shihao Shao, Kaifeng Chen, Arjun Karpur, Qinghua Cui, André Araujo, and Bingyi Cao. Global features are all you need for image retrieval and reranking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11036–11046, 2023.
[41] Niko Sünderhauf, Peer Neubert, and Peter Protzel. Are we there yet? challenging seqslam on a 3000 km journey across all four seasons. In Proc. of workshop on long-term autonomy, IEEE international conference on robotics and automation (ICRA), page 2013. Citeseer, 2013.
[42] Giorgos Tolias, Ronan Sicre, and Hervé Jégou. Particular object retrieval with integral max-pooling of cnn activations. arXiv preprint arXiv:1511.05879, 2015.
[43] Akihiko Torii, Relja Arandjelovic, Josef Sivic, Masatoshi Okutomi, and Tomas Pajdla. 24/7 place recognition by view synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1808–1817, 2015.
[44] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, pages 5998–6008, 2017.
[45] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, **gchao Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5265–5274, 2018.
[46] Ruotong Wang, Yanqing Shen, Weiliang Zuo, San** Zhou, and Nanning Zheng. TransVPR: transformer-based place recognition with multi-level attention aggregation. In CVPR, pages 13648–13657, 2022.
[47] Frederik Warburg, Soren Hauberg, Manuel Lopez-Antequera, Pau Gargallo, Yubin Kuang, and Javier Civera. Mapillary street-level sequences: A dataset for lifelong place recognition. In CVPR, pages 2626–2635, 2020.
[48] Burak Yildiz, Seyran Khademi, Ronald Maria Siebes, and Jan Van Gemert. Amstertime: A visual place recognition benchmark dataset for severe domain shift. In 2022 26th International Conference on Pattern Recognition (ICPR), pages 2749–2755. IEEE, 2022.
[49] Sijie Zhu, Linjie Yang, Chen Chen, Mubarak Shah, Xiaohui Shen, and Heng Wang. $R^{2}$ Former: unified retrieval and reranking transformer for place recognition. In CVPR, pages 19370–19380, 2023.

Appendix

Appendix A Datasets

We evaluate EffoVPR performance across a large number of datasets, to underscore its top-performance in variable scenarios and cities. Following prior work, we have used [10] open-source code for downloading and organizing datasets, to ensure maximum reproducibility. In the following we shortly describe each of the datasets.

A.1 Datasets Summary

AmsterTime[48] consists of $1,231$ image pairs from Amsterdam, Holland, exhibiting long-term changes. The queries are historical grayscale images, where for each query there is a reference of a modern-day photo which represents the same place. The pairs curated by human experts, and provide multiple challenges over different viewpoints and cameras, color vs grayscale and long-term changes.

Eynsham [15] is a collection of a car street-view camera, capturing photos around the same route of Oxford countryside twice. The grayscale images are divided to $23,935$ queries and $23,935$ gallery.

Mapillary Street-Level Sequences (MSLS) [47] is image and sequence-based VPR dataset. The dataset consists of more than 1.6M geo-tagged images collected during over seven years from 30 cities, in urban, suburban, and natural environments. There are 3 non-overlap subsets - a training set, validation (MSLS-val), and withheld test (MSLS-challenge). MSLS-val and MSLS-challenge provide various challenges, including viewpoint variations, long-term changes, and illumination and seasonal changes.

Nordland [41] was collected by a mounted camera on the top of a riding train in the Norwegian countryside, presenting rural and natural scenes. The data collected over the same route across four seasons, providing seasonal and illumination variability. Following [10, 41] we use the post-processed versions of winter as queries and summer as database, determining correct localization by retrieval of an image that is in less than 10 frames away. This dataset consists of $27,592$ query images and $27,592$ gallery images

Pittsburgh30k [4] is collected from Google Street View 360° panoramas of downtown Pittsburgh, split into multiple images. Ensuring queries and gallery were taken in different years, it provides 3 splits - a training set, validation and test. Pitts30k-test consists of 10k gallery images and 6816 queries. Pitts250k consists of $8280$ queries including these of Pitts30k, and its gallery size is $83,952$ .

San Francisco Landmark (SF-R) [13] is a dataset from downtown San Francisco, which provides viewpoint variations. It presents a collection of $598$ of smartphone camera queries and gallery of $1,046,587$ images.

San Francisco eXtra Large (SF-XL) [8, 6] is an enormous dataset covering the whole city of San Francisco. it consists of a training set, which includes also raw 360° panoramas, a small validation set of $7,983$ queries and $8,015$ gallery images, and a test gallery of $2,805,840$ images.
There are four sets of queries:
SF-XL-v1[8] consists of $1,000$ queries curated from Flickr, and provides viewpoint and camera variations, illumination changes and even some occlusions.
SF-XL-v2[8] is the queries of San Francisco Landmark (SF-R).
SF-XL-Night[6] is a collection of $466$ Flickr images of night scenes from San-Francisco. It provides viewpoint variations and very-challenging illumination changes.
SF-XL-Occlusion[6] is a collection of $76$ Flick images from the city of San Francisco, which suffers from severe occlusions, mostly by vehicles and crowd.

SPED[14] is a collection of surveillance cameras images consists of $607$ pairs of queries and gallery, captured accros time. It provides challenging viewpoint with seasonal and illumination changes.

St Lucia[36] is a collection of a nine videos of car-mounted camera from the St Lucia suburb of Brisbane. Following [10] open-source code, we select the first and last videos as queries and database, and sample one frame every 5 meters of driving. The gallery consists of $1,549$ images and there are $1464$ query images.

SVOX[12] is a dataset which presents multiple weather conditions VPR challenge. It consists of $17,166$ gallery images, of the city of Oxford. The queries were collected from the Oxford RobotCar dataset [35], providing multiple weather conditions queries sets, such as night (823 queries), overcast (872 queries), rainy (937 queries), snowy (870 queries) and sunny (854 queries).

Tokyo 24/7 [43] is a dataset from downtown Tokyo, which provides viewpoint changes and challenging illumination variations. It consists of a gallery of $75,984$ images, and a collection of $315$ smartphone camera queries from $185$ places. Each place is portrayed by three photos - one taken during the day, one at sunset and one at night.

A.2 Train Dataset

Following VPR classification methods, Eigneplaces and CosPlace, we train on SF-XL[8] while other studies[1, 22, 33, 2] train on GSV-Cities[1] or combinations of Pittsburgh30k[4] and MSLS[47, 4, 34, 49], including a large mixture of different cities around the world (introducing higher variability). Note that similar to EigenPlace, our approach is designed for training on panoramas with heading information, and requires slicing them for lateral and frontal views, which cannot be applied other training datasets.

Appendix B Ablation Study

Here we conduct extensive experiments on two different datasets to ablate over several key-components of our EffoVPR method.

Re-ranking features: We explore various configurations for selecting features in the re-ranking stage. Our initial focus is on the choice of the layer from which features are extracted. Table S1 demonstrates that extracting features from the $n-1$ layer yields the most significant enhancement in overall performance. Generally, employing re-ranking with any layer, except of the last layer, improves results compared to omitting re-ranking entirely (i.e., relying solely on the global feature from the first stage). Subsequently, upon extracting the Q, K, V components from the chosen layer, we find that the Value set ( $\mathcal{V}$ ) represents the most effective local features for re-ranking, as detailed in Table S3. We ablate in Table S3 the impact of our two thresholds, the Attention Map threshold $T_{1}$ and the threshold on the countable local feature matching score threshold $T_{2}$ , showing their necessity.

Table S 1: Ablation study on the choice of the layer for the re-ranking stage. We find the

n-1

layer to be the optimal for re-ranking feature extraction. Notably, the last layer

n

is ineffective and downgrades global performance. Results (in %) are the R@1.

Dataset	Global	n-5	n-4	n-3	n-2	n-1	n
MSLS-val	90.9	90.3	90.9	92.0	92.3	92.8	88.2
Tokyo-24/7	97.5	98.1	98.1	98.1	98.7	98.7	97.1

Table S 2: Ablation study on the impact of the thresholds. Results (in %) are the R@1.

T_{1}

is the Attention Map threshold and

T_{2}

is the threshold on the countable local features matching score

	Tokyo24/7	MSLS-val
no thr.	95.9	86.4
$T_{1}$	97.1	91.5
$+T_{2}$	98.7	92.8

Table S 3: Ablation on the choice of the local features. Results (in %) are the R@1. Query, Key and Value are respectively

Q

K

V

at Equation 1

	Query	Key	Value
Tokyo24/7	96.5	96.8	98.7
MSLS-val	89.7	90.1	92.8

Number of Candidates to Re-rank: The second re-ranking stage is applied to the top-K candidates retrieved during the global stage. Although common choice in literature is $K=100$ (e.g. [34, 49, 10]), we explore different choices of K, as detailed in Table S4. We achieve SoTA results even with $K=5$ . It is important to note that in some cases, an increase in K can introduce a greater number of “distractor” candidates, potentially leading to a decrease in performance. However, EffoVPR SoTA performance is consistent for all tested K’s.

Table S 4: Re-ranking ablation.

K

indicates re-ranking over top-

K

results. We achieve SoTA results even with

K=5

. Bold values indicate SoTA results.

Top-K	Pitts30k	Tokyo24/7	MSLS-val	Nordland	SF-XL-Occ.	SF-XL-Night	SPED
K=5	94.2	97.8	92.4	95.3	59.2	61.6	93.4
K=10	94.2	98.1	92.2	95.3	59.2	61.2	92.9
K=15	94.1	98.4	92.3	95.3	60.5	60.3	93.1
K=20	94.0	98.4	92.4	95.3	59.2	60.3	93.1
K=50	93.9	98.7	92.7	95.2	57.9	60.9	93.2
K=100	93.9	98.7	92.8	95.0	59.2	61.2	93.2

Choice of Trainable Layers: Table S5 presents a few different sets of trainable layers in our backbone model. We find the vanilla fine-tuning of the entire model, end-to-end, that includes all layers, drastically harms the performance of EffoVPR. We attribute this decline to the fact that the DINOv2 backbone was trained on significantly larger datasets compared to those typically used in VPR. Subsequently, we establish that training only the last five layers represents a “sweet-spot”, yielding peak performance. Both increasing or decreasing the number of trainable layers from this configuration leads to lower results.

Table S 5: Ablation study on the choice of trainable last layers. Results (in %) are the R@1 of the 1st stage. Note that the 0 column represent a zero-shot performance of DinoV2.

Dataset	0	1	2	3	4	5	6	all layers
MSLS-val	47.7	89.5	88.2	89.7	89.5	90.9	89.7	86.1
Tokyo-24/7	62.2	96.8	96.5	95.9	96.2	97.5	96.8	94.0

Appendix C Performance vs. Feature Compactness - Additional Results

In Figure S1 we present performance comparison of our global feature (EffoVPR-G) versus feature dimensionality for more datasets. While the current leading methods achieve their performance using large features, EffoVPR demonstrates high performance even with an extremely compact feature size.

Appendix D Visualizations

D.1 Additional Zero-Shot Visualizations

In Figure S2 we show more visualizations of EffoVPR-ZS method. While the attention map of pre-trained DINOv2 doesn’t focus on discriminative VPR elements, EffoVPR is able to fill the gap in zero-shot with local features matching. In the first row The pre-trained attention-map is mainly focused on temporal traffic signs and a far ad and almost not attend the building, and in the second row it is mainly focused on an insignificant back of a traffic sign. However EffoVPR method finds multiple local matches to the right geo-tagged image in the gallery.

D.2 Additional Re-ranking Visualizations

Figure S3 exhibits EffoVPR-R local features matching invariability to highly challenging scenes with top-1 results. From the top left to the bottom right - to camera rotation, a nature scene, color variance across time (building renovation), tree matching, challenging day-time change with hardly noticed electric cables matching, night to day significant change.

Appendix E Implementation details

We use ViT-L/14 as the backbone, initialized with pre-trained weights of DINOv2 with registers[37, 16]. We only train the last five layers of the backbone, which appeared to be most beneficial. We employ EigenPlaces’s[11] group and class partitioning with its default hyper-parameters, and both lateral and frontal views, on the publicly available SF-XL street-view panoramas dataset [8]. We set an AdamW optimizer to the backbone, and an Adam optimizer to the classification heads, both with a constant learning rate of $1\times 10^{-5}$ . We train EffoVPR with a batch size of 16, for 25 epochs, on a single NVIDIA-A100 node. We otherwise follow EigenPlaces training recipe. We choose the best epoch by SF-XL validation set, measuring Recall@1 global ranking performance. Given that ViT is independent of the image input size (provided it can be segmented into $14\times 14$ patches), we evaluated using images sized $504\times 504$ , but trained on 224 × 224 images to expedite training. For benchmarking EffoVPR-G of the global feature, we report nearest-neighbors performance on normalized output class token. In the re-ranking stage we extract the $V$ self-attention facet from layer $n-1$ (with $n$ being the output layer), measure cosine similarity, we filter the features by class attention map with a threshold $\mathcal{T}_{1}=0.05$ , and count only mutual nearest-neighbors with a score above the threshold $\mathcal{T}_{2}=0.65$ .

E.1 Additional information

E.1.1 Zero-shot

In evaluating AnyLoc[27], we tackle the significant memory requirements of its VLAD pooling algorithm by implementing an online clustering scheme. We observed that their recommendation for layer 31 outperformed layer 23. In our zero-shot evaluation, we assess EffoVPR-ZS method by extracting the $V$ features from layer $n-2$ to re-rank the top-100 candidates retrieved from the first-stage global [CLS] feature. In this framework, our performance is constrained by the first-stage Recall@100, achieving rates of $99.2\%$ , $96.8\%$ , $81.5\%$ , and $78.1\%$ on Pitts30k, Tokyo24/7, MSLS-Val, and Nordland, respectively.

E.1.2 Benchmarking

Generally, for consistent benchmarking, we adhere to [10]. In addition, we report the results of other methods in accordance with the evaluation choices of SelaVPR[34] and CricaVPR[33], including the specific versions of trained models utilized. For the recent state-of-the-art methods SelaVPR, CricaVPR, and SALAD, we provide results from the original publications whenever available. When such results are not directly available, we utilize their code and published weights. Specifically for SelaVPR, which has two sets of weights (trained on Pitts30k and MSLS), we report the best-performing for each dataset.

E.2 Other

We evaluate EffoVPR matching runtime by averaging matching function runtime on Tokyo 24/7.

Appendix F Additional Quantitative Results

To ensure comprehensiveness, the following Table S6 presents the complete results for datasets that were only partially presented in the main paper, as well as for some datasets that were previously omitted. Our method, EffoVPR, demonstrates SoTA performance on the majority of these datasets, and remains competitive with the SoTA on others.

Table S 6: Comparison to SoTA on more datasets

	SPED			SF-R			SF-XL-v1			SF-XL-v2			SF-XL-Occ.
Method	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
EigenPlaces	70.2	83.5	87.5	89.6	94.3	95.3	84.1	89.1	90.7	90.8	95.7	96.7	32.9	48.7	52.6
SelaVPR	88.6	95.1	97.2	88.5	92.0	93.0	74.9	80.7	82.1	89.3	95.7	96.3	35.5	47.4	55.3
CricaVPR	91.3	95.2	96.2	88.6	94.0	95.7	80.6	87.6	89.8	90.6	96.3	97.7	42.1	52.6	57.9
SALAD	92.1	96.2	96.5	92.3	95.7	96.8	88.6	93.5	94.4	94.8	97.3	98.3	51.3	65.8	68.4
EffoVPR	93.1	97.9	98.4	93.0	96.0	96.3	95.5	98.1	98.3	94.5	97.8	98.2	59.2	68.4	73.7

	SF-XL-Night			Amster Time			SVOX			SVOX Night			SVOX Overcast
Method	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
EigenPlaces	23.6	30.7	34.5	48.8	69.5	76.0	98.0	99.0	99.2	58.9	76.9	82.6	93.1	97.8	98.3
SelaVPR	38.4	50.9	55.4	55.2	72.6	78.0	97.2	98.7	99.0	89.4	95.5	96.6	97.0	99.1	99.3
CricaVPR	35.4	48.3	53.4	64.7	82.5	87.9	97.8	99.2	99.3	86.3	95.3	96.6	96.7	99.0	99.0
SALAD	46.6	59.0	62.2	58.8	78.9	84.2	98.2	99.3	99.4	95.4	99.3	99.4	98.3	99.3	99.3
EffoVPR	61.6	73.4	77.0	65.5	87.2	90.7	98.7	99.5	99.6	97.4	99.5	99.5	98.4	99.3	99.7

	SVOX Rain			SVOX Snow			SVOX Sun			Sr. Lucia			Eynsham
Method	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
EigenPlaces	90.0	96.4	98.0	93.1	97.6	98.2	86.4	95.0	96.4	99.6	99.9	100.0	90.7	94.4	95.4
SelaVPR	94.7	98.5	99.1	97.0	99.5	99.5	90.2	96.6	97.4	99.8	100.0	100.0	90.6	95.3	96.2
CricaVPR	94.8	98.5	98.7	96.0	99.2	99.2	93.8	98.1	98.8	99.9	99.9	99.9	91.6	95.0	95.8
SALAD	98.5	99.7	99.9	98.9	99.7	99.8	97.2	99.4	99.7	100.0	100.0	100.0	91.6	95.1	95.9
EffoVPR	98.3	99.6	99.6	98.7	99.7	99.7	97.7	99.3	99.4	100.0	100.0	100.0	91.0	95.2	96.3