\useunder

\ul

EffoVPR: Effective Foundation Model Utilization for Visual Place Recognition

Issar Tzachor1 Boaz Lerner1 Matan Levy2 Michael Green1 Tal Berkovitz Shalev1 Gavriel Habib1 Dvir Samuel3 Noam Korngut Zailer1 Or Shimshi1 Nir Darshan1 Rami Ben-Ari1   
1OriginAI, Israel
2The Hebrew University of Jerusalem, Israel
3Bar-Ilan University, Israel
[email protected]
Abstract

The task of Visual Place Recognition (VPR) is to predict the location of a query image from a database of geo-tagged images. Recent studies in VPR have highlighted the significant advantage of employing pre-trained foundation models like DINOv2 for the VPR task. However, these models are often deemed inadequate for VPR without further fine-tuning on task-specific data. In this paper, we propose a simple yet powerful approach to better exploit the potential of a foundation model for VPR. We first demonstrate that features extracted from self-attention layers can serve as a powerful re-ranker for VPR. Utilizing these features in a zero-shot manner, our method surpasses previous zero-shot methods and achieves competitive results compared to supervised methods across multiple datasets. Subsequently, we demonstrate that a single-stage method leveraging internal ViT layers for pooling can generate global features that achieve state-of-the-art results, even when reduced to a dimensionality as low as 128D. Nevertheless, incorporating our local foundation features for re-ranking, expands this gap. Our approach further demonstrates remarkable robustness and generalization, achieving state-of-the-art results, with a significant gap, in challenging scenarios, involving occlusion, day-night variations, and seasonal changes.

1 Introduction

The task of Visual Place Recognition (VPR), also known as Geo-Localization, aims to predict the place where a photo was taken relying solely on the visual information in the image. This is typically done by an image retrieval approach [4, 21, 34, 11, 33, 27] where a database of geo-tagged images is used, often referred as gallery. Real world data, including tagged VPR datasets, rely on two major sources for images: 1) car street-view and 2) people personal cameras (commonly mobile phones) [4, 43, 47, 8, 6]. As a result, images contain natural objects that are irrelevant and sometimes misleading for VPR task. Figure 1 - top row, shows an example, where people, vehicles, daylight, or camera angles might differ between images.

Modern models commonly use a deep neural network to extract a so-called global feature (a.k.a descriptor) for the query and gallery images [29, 24, 39, 30, 31]. This single-stage approach is then followed by a nearest neighbor search in the feature space, to retrieve the matching candidates from the gallery. In practice, global features of the entire gallery need to be uploaded to RAM to enable fast retrieval. In large-scale and real-world scenarios, reducing the memory footprint for each image is crucial to ensure real-time applicability. Therefore, several works often promote compact features to achieve both accuracy and applicability [49, 34, 8]. A popular strategy to improve the accuracy is to conduct a two-stage search, with subsequent similarity search on the top-k ranked results, that re-ranks the first-stage retrieved results. This process is commonly performed by matching local key-points and their corresponding descriptors.

Following best practices in computer vision, VPR methods are often initialized with ImageNet pre-trained weights (e.g. [49, 9]), followed by finetuning on VPR datasets, e.g. MSLS [47], or trained from scratch [46, 21, 4]. Recent advancements in VPR exploit the capabilities of foundation models [27, 22, 34, 33] such as DINOv2 [37], a transformer [44] based model trained by self-supervised learning on a vast amount of data. Recent approaches [34, 33, 22] argue that using vanilla DINOv2 (without fine-tuning) is ineffective. They criticize this approach for failing by capturing dynamic and irrelevant elements (pedestrians or vehicles), thereby diverting attention from crucial VPR features (buildings or scene layout). Some methods propose to address this issue by integrating learned adaptors into the foundation model architecture [34, 33] or changing the standard fine-tuning scheme with a unique pooling mechanism [22], in order to facilitate the foundation model adaptation to the VPR task. In contrast, Anyloc [27] suggested to use DINOv2 as a general-purpose feature representation at zero-shot for various localization tasks. However, despite achieving its best performance with the extremely high dimensional features (of size 49K), faces scalability challenges, AnyLoc only demonstrates a modest performance, and struggles in cases where queries exhibit large appearance changes. In this paper, we utilize the internal ViT features to firstly surpasses previous zero-shot approaches and achieve competitive results compared to several VPR trained methods.

Typically, VPR methods [22, 34, 33, 11, 49, 8] adhere to a conventional approach of initially extracting local features from a pretrained model, then using pooling methods such as GeM [38] or NetVLAD [4], to obtain a global feature, for each image. Methods using VLAD[23], such as AnyLoc[27], or NetVLAD[4] for aggregating local features necessitate learning a dictionary for each specific gallery. These models tend to "overfit" to the particular gallery distribution, which hampers their generalization. Additionally, they often require fine-grained clustering, leading to high-dimensional feature vectors. In addition, a recent study[40], highlighted the GeM hyper-parameter sensitivity, spatial information loss, and convergence complications. Alternatively, we suggest to utilize the internal ViT self-attention mechanism, and training with the class [CLS]delimited-[]𝐶𝐿𝑆[CLS][ italic_C italic_L italic_S ] token, for classification loss. This approach allows implicit aggregation from local features to a global representation and eliminates the need for previously used external special components and learned features, that often yield excessively large feature representation [22, 33, 34] (see Figure 1).

Refer to caption
(a)
(b)
Refer to caption
Refer to caption
Figure 1: EffoVPR showcase. Top row: we show a challenging query image (a) and EffoVPR’s top-1 candidate (b), retrieved from a gallery of 2.8M geo-tagged images of SF-Occ. EffoVPR demonstrates high capability in handling transitional objects and large obstructions. Bottom row: We present Recall@1 performance of EffoVPR using global features against feature dimensionality. While the current leading methods achieve their performance using large features, EffoVPR demonstrates top performance even with an extremely compact feature size.

We hereby present an Effective foundation based VPR method called EffoVPR which achieves a superior single-stage performance without the need for additional components. Unlike more intricate DINOv2-based methods with built-in adapter components [34, 33] or special aggregation techniques [22], EffoVPR delivers State-of-The-Art (SoTA) performance across multiple datasets while maintaining compactness, as shown in Figure 1. Our two-stage approach, builds upon local feature matching originated from the internal ViT attention maps, thus further expanding our performance gap. Additionally, EffoVPR demonstrates strong generalization and robustness across various cities and landscapes, effectively handling challenges such as occlusions, time differences and seasonal changes as demonstrated in Figure 1, which illustrates significant occlusion.

To summarize, our contributions in this work include:

  1. 1.

    We present a method that effectively leverages a foundation model for the VPR task, exploiting its intermediate attention layers. Leveraging this finding, we propose a zero-shot method that outperforms previous zero-shot approaches while showing comparable results, even with trained VPR methods, on few datasets.

  2. 2.

    Our work introduces a novel foundation model based VPR approach that eliminates the need for external aggregation methods or specialized pooling layers (such as NetVLAD or GeM).

  3. 3.

    We suggest an effective yet simple re-ranking process based on ViT internal attention layers, that significantly boosts performance.

  4. 4.

    Our method achieves state-of-the-art performance across various VPR benchmarks and generalizes to challenging scenarios, as occlusions, day-night variations or seasonal changes.

2 Related work

Traditional approaches utilized a single-stage approach for Visual Place Recognition (VPR), using SIFT [32] SURF [7], or RootSIFT [25], focused on matching queries to gallery images through the use of image local feature matching. Two-stage methods[21, 46, 49, 34] entail an initial ranking based on global representation similarity, succeeded by re-ranking the top-K retrieved candidates in the second stage, utilizing local features.

First deep learning approaches for the VPR task used CNN [4, 38, 26] that was later replaced with Vision Transformers (ViT) [46, 49, 27, 34, 33, 22]. TransVPR [46] and R2Former [49] suggested the application of transformers for VPR and adopted a two-stage approach that included re-ranking. However, training from scratch or initialized on ImageNet, and lack of effective view-variability in training (in contrast to [11, 33]) has restricted their performance.

A handful of visual foundation models (VFMs) have recently emerged as the backbones for numerous tasks. VFMs like CLIP [39], DINOv2 [37], SAM [28] use ViT that are trained with distinct objectives, exhibiting unique characteristics for various downstream applications often even at zero-shot (without requiring any fine-tuning). Among zero-shot applications, AnyLoc [27] proposed a single-stage solution for VPR, that leveraged a DINOv2 pre-trained model alongside dense local (patch-level) features. By employing VLAD for feature pooling across multiple layers, AnyLoc approach produces an extremely large global representation of similar-to\sim49K dimensions, subsequently reduced to 512D through PCA whitening, but in expense of lower performance. However, AnyLoc’s VLAD aggregation tends to fail in generalizing to queries with large time-gaps, day vs. night or of different season. In this paper, we suggest a two-stage zero-shot method based on first global ranking, then re-ranking, by matching local descriptors, extracted from the ViT self-attention layers.

On the contrary, SALAD [22] proposed fine-tuning the pre-trained DINOv2 model to improve its performance for VPR task. Their single-stage approach involves pooling local features from the output layer of DINOv2 replacing NetVLAD’s soft-assignment to clusters by an optimal transport methodology. Notably, their most significant results are achieved with large features exceeding 8K dimension, which adversely impacts memory consumption.

Using DINOv2 as their backbone, SelaVPR [34] and CricaVPR [33] took a different path and avoided fine-tuning of the model, claiming for a side-effect of a catastrophic forgetting and performance deterioration. They propose the incorporation of a trainable adapters integrated into DINOv2 architecture. SelaVPR suggested a two-stage methodology that trains global and local features on two separate branches, built above DINOv2’s output tokens, while discarding the standard [CLS] token. Their re-ranking strategy relies on mutual-Nearest Neighbor (mutual-NN) computed over a newly learned patch level (local) features. CricaVPR instead, suggests learning a features pyramid above DINOv2’s output tokens, to learn a special viewpoint robust encoder. Both, SelaVPR and CricaVPR incorporate adapters in the DINOv2 architecture, using GeM pooling for feature aggregation and are trained with a contrastive loss.

Training a VPR model often sourced from Street-View images, introduces a significant challenge in selecting appropriate positive and hard negative images[4, 47, 49, 8]. EigenPlaces [11] introduced a novel training paradigm that organizes the training data into classes, with each class containing multiple viewpoints of the same scene. By employing a classification loss rather than the conventional contrastive loss (used in [34, 22, 33]), the resulting model demonstrated high resilience to varying viewpoints in testing. Hence, we adopt a similar strategy for our training process.

Current VPR methods predominantly, derive a global representation by aggregating local features obtained from the output layer of a trained backbone. This aggregation is often conducted using external learnable pooling techniques, such as VLAD (Vector of Locally Aggregated Descriptors) [23], SPoC (Sum-Pooled Convolutional features) [5], RMAC (Regional Maximum Activation of Convolutions) [42], the known NetVLAD [4] or the nowadays widely-used GeM (Generalized Mean) pooling layer[38, 40], as utilized in various studies [12, 18, 19, 47, 21, 22, 34, 33]. However, NetVLAD is encumbered by high computational costs and dimensional complexities (with similar-to\sim 32K dimensional feature)[8], while GeM encounter convergence issues and require hyperparameter tuning [40]. In contrast, we suggest to forgo the explicit and external aggregation approach, replacing it with training of the single [CLS] token in ViT, with classification loss.

In summary, this paper revisits the potential of foundation DINOv2 for VPR task, considering two-stage zero-shot and single as well as two-stage finetuned models. Unlike previous approaches, we propose leveraging the existing internal aggregation layers in DINOv2 without any additional components or adaptors. Notably, our approach achieves SoTA results with compact features, a crucial consideration given realistic memory constraints.

3 Method

For the VPR task, we explore ViT [17] feature maps as local patch descriptors. In a ViT architecture, an image is split into p𝑝pitalic_p non-overlap** patches which are processed into tokens by linearly projecting each patch to a d𝑑ditalic_d-dimensional space, and adding learned positional embeddings. An additional [CLS] token is inserted to capture global image properties. The set of tokens are then passed through L𝐿Litalic_L transformer encoder layers, each consists of normalization layers, Multihead Self-Attention modules, and MLP blocks. In ViT, each patch is directly associated with a set of features, a Key, Query, Value that can be used as patch descriptors. We utilize the self-attention matrices at layer l𝑙litalic_l, {Kl,Ql,Vl}p×dsubscript𝐾𝑙subscript𝑄𝑙subscript𝑉𝑙superscript𝑝𝑑\{K_{l},Q_{l},V_{l}\}\in\mathbb{R}^{p\times d}{ italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_d end_POSTSUPERSCRIPT, with p𝑝pitalic_p indicating the number of patches resulting p+1𝑝1p+1italic_p + 1 tokens (number of patches in the image plus one added global token) and d𝑑ditalic_d standing for the embedding dimension. The self-attention function at layer l𝑙litalic_l is then given by:

Attention(Ql,Kl,Vl)=Softmax(KlTQld)VlAttentionsubscript𝑄𝑙subscript𝐾𝑙subscript𝑉𝑙Softmaxsuperscriptsubscript𝐾𝑙𝑇subscript𝑄𝑙𝑑subscript𝑉𝑙\text{Attention}(Q_{l},K_{l},V_{l})=\text{Softmax}(\frac{K_{l}^{T}Q_{l}}{\sqrt% {d}})V_{l}Attention ( italic_Q start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) = Softmax ( divide start_ARG italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) italic_V start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT (1)

We indicate the key feature of the [CLS]delimited-[]𝐶𝐿𝑆[CLS][ italic_C italic_L italic_S ] token at layer l𝑙litalic_l by kclsKlsubscript𝑘𝑐𝑙𝑠subscript𝐾𝑙k_{cls}\in K_{l}italic_k start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT ∈ italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. Each image patch therefore is directly associated with the set of features {qil,kil,vi,l}i[p+1]\{q_{i}^{l},k_{i}^{l},v_{i},^{l}\}_{i\in{[p+1]}}{ italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_p + 1 ] end_POSTSUBSCRIPT including its query, key and value at each layer l[1,L]𝑙1𝐿l\in[1,L]italic_l ∈ [ 1 , italic_L ], respectively.

In the global search stage, the aim is to perform an efficient search across a vast corpus of images in a gallery. To this end, a shared embedding space for query and gallery images is typically utilized to identify the most similar images and rank them accordingly. In the re-ranking stage, the top-K nearest neighbors (candidates) are further refined by evaluating the mutual similarity between their local features and those of the query image. Throughout both stages, we use a ViT [17] encoder as our backbone model. Our EffoVPR retrieval method is presented in Figure 2.

Refer to caption
Figure 2: An overview of EffoVPR. Left: During inference, we identify the nearest neighbors of the query by using the [CLS] token as the global representation for each image (vgsubscript𝑣𝑔v_{g}italic_v start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT). For the second re-ranking stage, we extract (dashed-line) intermediate value features V𝑉Vitalic_V and filter them (with a predefined threshold T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) using a score derived from the partial attention maps S𝑆Sitalic_S. Right: Lastly, we re-rank the top-K candidates from the first stage based on the count of strongly connected mutual nearest neighbors (MNN) with a score exceeding a predefined threshold (denoted as T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT).

3.1 Training strategy

Our training strategy employs a classification loss applied on the output [CLS]delimited-[]𝐶𝐿𝑆[CLS][ italic_C italic_L italic_S ] token. Following EigenPlaces [11], we partition the map into cells measuring 15×15151515\times 1515 × 15 meters each, where cells define classes. Each class contains images capturing the same location from various viewpoints, enhancing the model’s ability to accurately recognize locations despite substantial changes in the captured view angle. To enforce class discrimination while ensuring viewpoint invariance, we utilize the Large Margin Cosine Loss (CosFace)[8, 45].

We initialize our ViT[17] backbone with pre-trained DINOv2 weights [37, 16]. To retain the rich visual representations learned during pre-training, while adapting the model for the VPR task, we fine-tune only the final layers of our backbone. Note that due to the inherent design of the ViT architecture, which incorporates a self-attention mechanism (Eq. 1), the global feature is implicitly trained to aggregate local features without the need of additional specialized components. For learning more compact global feature we simply add a linear layer on top of the output [CLS] token.

3.2 Inference strategy

Global ranking stage: Post-training, we extract the global feature of an image from the [CLS]delimited-[]𝐶𝐿𝑆[CLS][ italic_C italic_L italic_S ] output token of the penultimate classification layer. We then retrieve the K nearest neighbors of the query’s global feature.

Re-ranking stage: We start by extracting patch local features from each candidate among the top-k retrieved images in the previous global stage, and re-rank the candidates based on these features. Local features are derived from the intermediate layer l𝑙litalic_l of ViT, utilizing the self-attention matrices, thus their computation is integrated into the process of computing global features and does not necessitate recalculation. Aligned with the findings in [3], where Vlsubscript𝑉𝑙V_{l}italic_V start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT showed to have a higher instance-level characteristics, we find Vlsubscript𝑉𝑙V_{l}italic_V start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT matrix mentioned in Eq. 1 as the most propitious facet for the task at hand (see Tab. S3 in Appx.). Then we leverage the model’s internal prioritization to identify discriminative features and extract the attention map: S:=Softmax(Qlkcls)passign𝑆Softmaxsubscript𝑄𝑙subscript𝑘𝑐𝑙𝑠superscript𝑝S:=\text{Softmax}(Q_{l}\cdot k_{cls})\in\mathbb{R}^{p}italic_S := Softmax ( italic_Q start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ italic_k start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT, which represents the attention of each Value feature with the global [CLS]delimited-[]𝐶𝐿𝑆[CLS][ italic_C italic_L italic_S ] image representation key-token kclssubscript𝑘𝑐𝑙𝑠k_{cls}italic_k start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT. We therefore select a subset 𝒱𝒱\mathcal{V}caligraphic_V of the values 𝒱{v1,,vp}:=Vl𝒱subscript𝑣1subscript𝑣𝑝assignsubscript𝑉𝑙\mathcal{V}\subseteq\{v_{1},...,v_{p}\}:=V_{l}caligraphic_V ⊆ { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT } := italic_V start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT as the image’s local features, based on the score S𝑆Sitalic_S: 𝒱:={vi|Si>T1}assign𝒱conditional-setsubscript𝑣𝑖subscript𝑆𝑖subscript𝑇1\mathcal{V}:=\{v_{i}~{}|~{}S_{i}>T_{1}\}caligraphic_V := { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } for a predefined threshold T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. We use a rather low threshold that filters out patches that are “less significant” as reflected from the layer’s attention. Note that the number of selected local features might differ between images.

Next, given the query’s local features 𝒱𝒱\mathcal{V}caligraphic_V and a candidate’s 𝒱superscript𝒱\mathcal{V}^{\prime}caligraphic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we calculate the pairs of mutual nearest neighbors (MNN) between 𝒱𝒱\mathcal{V}caligraphic_V and 𝒱superscript𝒱\mathcal{V}^{\prime}caligraphic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT by the number of feature pairs that are the nearest neighbor of each other. Our observation revealed that applying a threshold to the MNN scores enhances the model’s resilience to clutter and directs its attention to the pertinent key-points for accurate matching. We therefore count only the pairs with cos-similarity higher than a predefined threshold T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (for ablation on thresholds see Tab. S3 in Appx.). We formulate that process in Equation 2:

MNN(𝒱,𝒱):={(vi,vj)𝒱×𝒱|vi:=NN(V,vj)andvj:=NN(vi,V),viTvj>T2}assignMNN𝒱superscript𝒱conditional-setsubscript𝑣𝑖subscriptsuperscript𝑣𝑗𝒱superscript𝒱formulae-sequenceassignsubscript𝑣𝑖𝑁𝑁𝑉subscriptsuperscript𝑣𝑗andsubscriptsuperscript𝑣𝑗assign𝑁𝑁subscript𝑣𝑖superscript𝑉superscriptsubscript𝑣𝑖𝑇subscriptsuperscript𝑣𝑗subscript𝑇2\text{MNN}(\mathcal{V},\mathcal{V}^{\prime}):=\{(v_{i},v^{\prime}_{j})\in% \mathcal{V}\times\mathcal{V}^{\prime}~{}|~{}v_{i}:=NN(V,v^{\prime}_{j})~{}% \text{and}~{}v^{\prime}_{j}:=NN(v_{i},V^{\prime}),~{}v_{i}^{T}v^{\prime}_{j}>T% _{2}\}MNN ( caligraphic_V , caligraphic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) := { ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ caligraphic_V × caligraphic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := italic_N italic_N ( italic_V , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) and italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT := italic_N italic_N ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } (2)

Finally, the re-ranking stage concludes by sorting the top-K candidates based on their MNN counts. Note that our re-ranking strategy, which matches local features, does not require any additional learning, optimization, or spatial verification. The thresholds are established once and remain fixed across all test sets (20 different scenarios).

Zero-shot: For the first stage ranking we use the [CLS] token from vanilla-DINOv2. The results are then refined by re-ranking, while employing our suggested features 𝒱𝒱\mathcal{V}caligraphic_V with MNN in Eq. (2).

4 Evaluation

In this section, we compare our approach with several SoTA VPR methods following the common VPR Benchmarks[10]. We propose, a single-stage (EffoVPR-G) and two-stage, that includes a re-ranker (EffoVPR-R) approaches with backbone trained on the publicly available SF-XL [8] dataset containing streetview of San Francisco. For more implementation details see Appx. We then test EffoVPR on a large number of diverse datasets (20), including e.g. Pitts30k [4], Tokyo24/7[43], MSLS-val/challenge[47] Nordland[41] and more, exhibiting a wide variety of conditions, including different cities, day/night images, and seasonal changes. Note that, the MSLS Challenge[47] is a hold-out set whose labels are not released, but researchers submit the predictions to the challenge server to get the performance. More details on the benchmarks can be found in the Appendix.

Datasets with gallery made from street-view images and with largest viewpoint variance, include Tokyo 24/7[43] and SF-XL[45, 6], where the query images are collected from a smartphone, usually from sidewalks. Most datasets are from urban footage, with the main exception being Nordland[41], which is a collection of photos taken across different seasons with a camera mounted on a train. Some datasets present various degrees of day-to-night changes, namely MSLS [47], Tokyo 24/7 [43], SF-XL [8] SVOX-Night [12]. AmsterTime [48] contains grayscale historical queries and modern-time RGB gallery images, making it the only dataset with large-scale time variations of multiple decades.

We follow common evaluation metric used in previous works e.g. [10, 11, 49, 34, 33, 4] and use 25 meters radius as the threshold for correct localization and report Recall@K metrics for K=1,5,10. For Nordland we evaluate ±10plus-or-minus10\pm 10± 10 frames as the common evaluation protocol used in [10, 21]. For a more comprehensive description of all 20 datasets and implementation details see Appx.

4.1 Zero-shot performance

Table 1: Comparison on Zero-Shot with R@1.
Pitts30k Tokyo24/7 MSLS-Val Nordland
DINOv2[37] 78.1 62.2 47.7 33.0
Anyloc[27] 87.7 60.6 68.7 16.1
EffoVPR-ZS 89.4 90.8 70.3 57.9

We present the performance of our zero-shot method (EffoVPR-ZS) in Table 1 compared to two zero-shot alternatives, the recently published AnyLoc, and DINOv2 global feature (using the output [CLS] token), without finetuning, where EffoVPR-ZS re-ranks its top-100 global retrieved candidates. The results show that our method significantly improves over the baseline and is superior to AnyLoc. Note the significant gap for more challenging scenarios of Tokyo24/7 and Nordland exhibiting day vs. night and seasonal variations. AnyLoc tends to fail in these challegning scenarios as its VLAD aggregation learned (in unsupervised manner) on the gallery can not generalize well to challenging out-of-distribution queries. In Figure 3(a) we compare our zero-shot approach with several methods that have used VPR datasets for training. Although EffoVPR-ZS was not trained on VPR task, it still achieves comparable results to the trained methods on three popular datasets. This success can be attributed to the robust features in DINOv2, specifically those selected from the 𝒱𝒱\mathcal{V}caligraphic_V facet, combined with our mutual-NN matching and scoring. Figure 3(b) demonstrates this by showcasing a scenario where the original attention, mistakenly focusing on an advertisement placed in front of a building. However, our method successfully identifies relevant key-points on the building itself, enabling correct image matching (even though there is a different ad on the gallery image).

Refer to caption

Refer to caption     Refer to caption

(a)
(b)
(c)
(a)
(b)
Figure 3: EffoVPR zero-shot. (a) Comparison of EffoVPR-ZS with other VPR trained methods. Our zero-shot approach shows comparable results. (b) Zero-shot success despite existing dynamic and irrelevant objects and strong visual change. Matching keypoints are indicated by colored lines. Although the pre-trained DINOv2 initially has its strongest attention on the distracting temporal advertisement, EffoVPR effectively identifies correct keypoints for successful matching.

4.2 Comparison with State-of-The-Art

In this section, we compare our single stage (EffoVPR-G) and two-stage methods (EffoVPR-R) with previous state-of-the-art including the recent works of [34, 22, 33]. SelaVPR and R2Former were trained on a combination of Pitts30k and MSLS while CricaVPR and SALAD were trained on GSV-Cities, and Cosplace and EigenPlaces on SF-XL (similar to ours). We show in Table 2 the global retrieval results (without re-ranking) with Recall@1 on five different benchmarks. EffoVPR-G achieves SoTA performance on three out of five datasets, while being ranked second on the other two. This highlights the effectiveness of the single global representation learned by our method. Notably, it achieves +2.9%percent2.9+2.9\%+ 2.9 % on Tokyo24/7, +2.8%percent2.8+2.8\%+ 2.8 % on the challenging Nordland dataset that exhibits extreme seasonal changes, and +3.2%percent3.2+3.2\%+ 3.2 % on the hold-out MSLS-challenge dataset. We further demonstrate the performance of our global feature learning with reduced dimensions of 256D and even 128D, significantly decreasing the memory footprint and enabling efficient searches within a considerably larger gallery. The findings indicate only a marginal degradation in performance with lower-dimensional features, while achieving parity with the SALAD on Tokyo24/7 using 128D, compared to 8,448D, feature size (a 66-fold reduction in dimensionality). Figure 1 illustrates this quality on Tokyo24/7 and the hold-out MSLS-challenge, showing our top performing results even with 128D feature size. Results on more datasets can be found in the Appendix. The strong impact of our finetuning process including the last five layers, is visualized in Figure 4.

Table 2: Comparison with our single stage method - Recall@1 performance. Two-stage methods are marked with †, and present 1st-stage performance (for fair comparison). The best results are highlighted in bold and the second is underlined. We present results from EffoVPR with three different feature dimensions.
Method Dim Pitts30k Tokyo24/7 MSLS-val MSLS-chall. Nordland
CosPlace [8] 512 90.5 81.9 82.8 61.4 66.5
MixVPR [2] 4096 91.5 86.7 88.2 64.0 58.4
R2Former [49] 256 76.3 45.7 79.3 56.2 50.9
EigenPlaces [11] 2048 92.5 93.0 89.1 67.4 71.2
SelaVPR [34] 1024 90.2 81.9 87.7 69.6 72.3
CricaVPR [33] 4096 94.9* 93.0 90.0 69.0 90.7
SALAD [22] 8448 92.4 94.6 92.2 75.0 76.0
EffoVPR-G 1024 94.8 97.5 90.9 78.2 93.5
EffoVPR-G 256 93.8 95.9 90.4 75.6 79.7
EffoVPR-G 128 92.6 94.6 88.2 73.8 70.4
Refer to caption
Refer to caption
(a)
Refer to caption
(b)
Figure 4: Attention map visualization: pre-trained DINOv2 focuses on irrelevant foreground objects e.g. vehicles. Whereas attentions of EffoVPR after training are shifted to scene layout and building structures such as cables and windows.

Next, we showcase the comprehensive performance of our two-stage approach (EffoVPR-R) in Table 3. EffoVPR-R achieves top performance across all datasets, taking the second place only on Pitts30K-R@1, with very close result. Note that CricaVPR reports using Pitts30k as a validation set, which may have contributed to the improved results on this dataset. However, EffoVPR-R demonstrates notable improvements, particularly evident in Tokyo24/7, where it achieves a remarkable increase in R@1 from 94.6% to 98.7% and in MSLS-challenge (from to 75.0% to 79.0%). These results underscore the generalization capability of our approach, demonstrating its resilience in handling significant variations between query and gallery images, such as viewpoint discrepancies (as seen in Pitts30k) and changes in illumination (in Tokyo24/7), across a diverse range of locations. Following the common practice, we report EffoVPR-R re-ranking performance over the top-100 candidates retrieved in the first-stage (K=100𝐾100K=100italic_K = 100), however we achieve SoTA R@1 results even with a low number of candidates (from K=5𝐾5K=5italic_K = 5 onwards, see Tab. S4 in Appx). Note that EffoVPR matching method is highly expedient, with an average processing time of just 1 millisecond per match.

Table 3: Comparison to state-of-the-art methods on four benchmarks. The bests results are highlighted in bold and the second is underlined. Two-stage methods are marked with †.
Method Pitts30k Tokyo24/7 MSLS-val MSLS-challenge
R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
NetVLAD [4] 81.9 91.2 93.7 60.6 68.9 74.6 53.1 66.5 71.1 35.1 47.4 51.7
SFRS [20] 89.4 94.7 95.9 81.0 88.3 92.4 69.2 80.3 83.1 41.6 52.0 56.3
Patch-NetVLAD [21] 88.7 94.5 95.9 86.0 88.6 90.5 79.5 86.2 87.7 48.1 57.6 60.5
CosPlace [8] 88.4 94.5 95.7 81.9 90.2 92.7 82.8 89.7 92.0 61.4 72.0 76.6
MixVPR [2] 91.5 95.5 96.4 86.7 92.1 94.0 88.2 93.1 94.3 64.0 75.9 80.6
R2Former [49] 91.1 95.2 96.3 88.6 91.4 91.7 89.7 95.0 96.2 73.0 85.9 88.8
EigenPlaces [11] 92.5 96.8 97.6 93.0 96.2 97.5 89.1 93.8 95.0 67.4 77.1 81.7
SelaVPR [34] 92.8 96.8 97.7 94.0 96.8 97.5 90.8 96.4 97.2 73.5 87.5 90.6
CricaVPR [33] 94.9 97.3 98.2 93.0 97.5 98.1 90.0 95.4 96.4 69.0 82.1 85.7
SALAD [22] 92.4 96.3 97.4 94.6 97.5 97.8 92.2 96.2 97.0 75.0 88.8 91.3
EffoVPR-R 93.9 97.4 98.5 98.7 98.7 98.7 92.8 97.2 97.4 79.0 89.0 91.6

In Figure 5, we present a failure case of our zero-shot approach, which is resolved after fine-tuning. In this instance, both the query and gallery contain a visually identical vehicle (an SF cable car), which leads to incorrect matching. Although such instances are rare in general case in context of pedestrians or vehicles in the images, where the objects are commonly not identical, this example highlights a limitation of our zero-shot approach.

Refer to caption
(a)
Refer to caption
(b)
Figure 5: Zero-shot vs trained: a failure case of our zero-shot approach, resolved after training.

Finally, we demonstrate the efficacy of our approach in the most challenging VPR benchmark scenarios by conducting experiments on six demanding datasets: Nordland [41], which includes extensive seasonal changes; AmsterTime [48], spanning over an extended time period; SF-Occlusion[6], that features queries with significant field-of-view obstructions; SF-Night[6], with severe illumination changes; and SVOX [12], with extreme weather and illumination variations. The results, detailed in Table 4, underscore the significant superiority of our method over previous approaches across these datasets. EffoVPR-R shows improvements of +4.3%, +0.8%, +7.9%, and +15%, +2% on Nordland, AmsterTime, SF-Occlusion, SF-Night, and SVOX-Night respectively and comparable results on SVOX-Rain. This demonstrates the high versatility of our model, which can handle extreme variations even when trained without seasonal or day-to-night changes. Figure 1 shows an examples of this case. We attribute this robustness primarily to the combination of our training method and specific re-ranking strategy over the DINOv2 model. We conduct an extensive ablation study on various hyperparameters and aspects of our approach in the Appendix.

Table 4: Comparison (R@1) to SoTA methods on more challenging datasets.
Method Nordland
Amster
Time
SF-XL
Occlusion
SF-XL
Night
SVOX
Night
SVOX
Rain
EigenPlaces [11] 71.2 48.9 32.9 23.6 58.9 90.0
SelaVPR [34] 85.2 55.2 35.5 38.4 89.4 94.7
CircaVPR [33] 90.7 64.7 42.1 35.4 85.1 95.0
SALAD [22] 76.0 58.8 51.3 46.6 95.4 98.5
EffoVPR-R 95.0 65.5 59.2 61.6 97.4 98.3

Figure 6 qualitatively highlights the superior performance of our method. While other methods fail in challenging scenarios, such as viewpoint changes, seasonal variations, illumination differences, and severe occlusions, EffoVPR demonstrates high robustness against these challenges.

Refer to caption
Figure 6: Qualitative comparison to SoTA Methods with challenging examples.

5 Summary and Limitations

In this paper, we introduced a single and two-stage approach for VPR, that effectively leverages a foundation model. Our method utilizes existing internal self-attention and pooling mechanisms to propose a new approach that achieves high performance even in a zero-shot setting.

We observed that despite the success of our zero-shot approach, it does not grasps the relevance of certain objects in the scene, for localization. This limitation is highlighted in Figure 5, where the iconic and visually identical cable car in SF causes distraction. Fine-tuning the model resolves this problem by shifting the attention from transient foreground objects to static VPR-relevant cues, as seen in Figure 4. Nevertheless, while our second-stage matching approach proves highly effective, we forgo geometric verification for sake of speed. Integrating such approach, in the future may offer further refinement and performance enhancement.

The experimental results demonstrated that our trained model outperforms previous SoTA often by a large margin, particularly in demanding scenarios that exhibit strong appearance change. Having compact features, our method provides a promising way to address the VPR task in real-world, large-scale applications.

References

  • [1] Amar Ali-bey, Brahim Chaib-draa, and Philippe Giguère. GSV-Cities: Toward appropriate supervised visual place recognition. Neurocomputing, 513:194–203, 2022.
  • [2] Amar Ali-bey, Brahim Chaib-draa, and Philippe Giguère. MixVPR: feature mixing for visual place recognition. In WACV, pages 2998–3007, 2023.
  • [3] Shir Amir, Yossi Gandelsman, Shai Bagon, and Tali Dekel. Deep ViT features as dense visual descriptors. ECCVW, 2022.
  • [4] Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5297–5307, 2016.
  • [5] Artem Babenko and Victor Lempitsky. Aggregating local deep features for image retrieval. In Proceedings of the IEEE international conference on computer vision, pages 1269–1277, 2015.
  • [6] Giovanni Barbarani, Mohamad Mostafa, Hajali Bayramov, Gabriele Trivigno, Gabriele Berton, Carlo Masone, and Barbara Caputo. Are local features all you need for cross-domain visual place recognition? In CVPRW, pages 6155–6165, June 2023.
  • [7] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. SURF: Speeded up robust features. In ECCV, pages 404–417. Springer, 2006.
  • [8] Gabriele Berton, Carlo Masone, and Barbara Caputo. Rethinking visual geo-localization for large-scale applications. In CVPR (CVPR), pages 4878–4888, June 2022.
  • [9] Gabriele Berton, Riccardo Mereu, Gabriele Trivigno, Carlo Masone, Gabriela Csurka, Torsten Sattler, and Barbara Caputo. Deep visual geo-localization benchmark. In CVPR, pages 5396–5407, 2022.
  • [10] Gabriele Berton, Riccardo Mereu, Gabriele Trivigno, Carlo Masone, Gabriela Csurka, Torsten Sattler, and Barbara Caputo. Deep visual geo-localization benchmark. In CVPR, June 2022.
  • [11] Gabriele Berton, Gabriele Trivigno, Barbara Caputo, and Carlo Masone. Eigenplaces: Training viewpoint robust models for visual place recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11080–11090, October 2023.
  • [12] Gabriele Moreno Berton, Valerio Paolicelli, Carlo Masone, and Barbara Caputo. Adaptive-attentive geolocalization from few queries: A hybrid approach. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2918–2927, 2021.
  • [13] David M Chen, Georges Baatz, Kevin Köser, Sam S Tsai, Ramakrishna Vedantham, Timo Pylvänäinen, Kimmo Roimela, Xin Chen, Jeff Bach, Marc Pollefeys, et al. City-scale landmark identification on mobile devices. In CVPR 2011, pages 737–744. IEEE, 2011.
  • [14] Zetao Chen, Lingqiao Liu, Inkyu Sa, Zongyuan Ge, and Margarita Chli. Learning context flexible attention model for long-term visual place recognition. IEEE Robotics and Automation Letters, 3(4):4015–4022, 2018.
  • [15] Mark Cummins and Paul Newman. Highly scalable appearance-only slam-fab-map 2.0. In Robotics: Science and systems, volume 5, page 17. Seattle, USA, 2009.
  • [16] Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers, 2023.
  • [17] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR, 2021.
  • [18] Sourav Garg and Michael Milford. Seqnet: Learning descriptors for sequence-based hierarchical place recognition. IEEE Robotics and Automation Letters, 6(3):4305–4312, 2021.
  • [19] Yixiao Ge, Haibo Wang, Feng Zhu, Rui Zhao, and Hongsheng Li. Self-supervising fine-grained region similarities for large-scale image localization. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 369–386. Springer, 2020.
  • [20] Yixiao Ge, Haibo Wang, Feng Zhu, Rui Zhao, and Hongsheng Li. Self-supervising fine-grained region similarities for large-scale image localization. In ECCV, pages 369–386. Springer, 2020.
  • [21] Stephen Hausler, Sourav Garg, Ming Xu, Michael Milford, and Tobias Fischer. Patch-NetVLAD: Multi-scale fusion of locally-global descriptors for place recognition. In CVPR, pages 14141–14152, 2021.
  • [22] Sergio Izquierdo and Javier Civera. Optimal transport aggregation for visual place recognition. In Accepted to CVPR, June 2024.
  • [23] Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. Aggregating local descriptors into a compact image representation. In CVPR, pages 3304–3311. IEEE, 2010.
  • [24] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In Marina Meila and Tong Zhang, editors, ICML, volume 139 of Proceedings of Machine Learning Research, pages 4904–4916. PMLR, 2021.
  • [25] Yuhe **, Dmytro Mishkin, Anastasiia Mishchuk, Jiri Matas, Pascal Fua, Kwang Moo Yi, and Eduard Trulls. Image matching across wide baselines: From paper to practice. International Journal of Computer Vision, 129(2):517–547, 2021.
  • [26] Hyo ** Kim, Enrique Dunn, and Jan-Michael Frahm. Learned contextual feature reweighting for image geo-localization. In CVPR, pages 2136–2145, 2017.
  • [27] Nikhil Keetha, Avneesh Mishra, Jay Karhade, Krishna Murthy Jatavallabhula, Sebastian Scherer, Madhava Krishna, and Sourav Garg. AnyLoc: towards universal visual place recognition. IEEE Robotics and Automation Letters, 2023.
  • [28] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In CVPR, pages 4015–4026, 2023.
  • [29] Matan Levy, Rami Ben-Ari, Nir Darshan, and Dani Lischinski. Chatting makes perfect: Chat-based image retrieval. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 61437–61449. Curran Associates, Inc., 2023.
  • [30] Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. BLIP: Bootstrap** Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In ICML, pages 12888–12900, 2022.
  • [31] Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models. In ICCV, pages 2105–2114, 2021.
  • [32] David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60:91–110, 2004.
  • [33] Feng Lu, Xiangyuan Lan, Lijun Zhang, Dongmei Jiang, Yaowei Wang, and Chun Yuan. Cricavpr: Cross-image correlation-aware representation learning for visual place recognition. In Accepted to CVPR, June 2024.
  • [34] Feng Lu, Lijun Zhang, Xiangyuan Lan, Shuting Dong, Yaowei Wang, and Chun Yuan. Towards seamless adaptation of pre-trained models for visual place recognition. In ICLR, 2024.
  • [35] Will Maddern, Geoffrey Pascoe, Chris Linegar, and Paul Newman. 1 year, 1000 km: The oxford robotcar dataset. The International Journal of Robotics Research, 36(1):3–15, 2017.
  • [36] Michael J Milford and Gordon F Wyeth. Map** a suburb with a single camera using a biologically inspired slam system. IEEE Transactions on Robotics, 24(5):1038–1053, 2008.
  • [37] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael G. Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jégou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. DINOv2: learning robust visual features without supervision. CoRR, abs/2304.07193, 2023.
  • [38] Filip Radenović, Giorgos Tolias, and Ondřej Chum. Fine-tuning cnn image retrieval with no human annotation. IEEE transactions on pattern analysis and machine intelligence, 41(7):1655–1668, 2018.
  • [39] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. In Marina Meila and Tong Zhang, editors, ICML, 2021.
  • [40] Shihao Shao, Kaifeng Chen, Arjun Karpur, Qinghua Cui, André Araujo, and Bingyi Cao. Global features are all you need for image retrieval and reranking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11036–11046, 2023.
  • [41] Niko Sünderhauf, Peer Neubert, and Peter Protzel. Are we there yet? challenging seqslam on a 3000 km journey across all four seasons. In Proc. of workshop on long-term autonomy, IEEE international conference on robotics and automation (ICRA), page 2013. Citeseer, 2013.
  • [42] Giorgos Tolias, Ronan Sicre, and Hervé Jégou. Particular object retrieval with integral max-pooling of cnn activations. arXiv preprint arXiv:1511.05879, 2015.
  • [43] Akihiko Torii, Relja Arandjelovic, Josef Sivic, Masatoshi Okutomi, and Tomas Pajdla. 24/7 place recognition by view synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1808–1817, 2015.
  • [44] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, pages 5998–6008, 2017.
  • [45] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, **gchao Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5265–5274, 2018.
  • [46] Ruotong Wang, Yanqing Shen, Weiliang Zuo, San** Zhou, and Nanning Zheng. TransVPR: transformer-based place recognition with multi-level attention aggregation. In CVPR, pages 13648–13657, 2022.
  • [47] Frederik Warburg, Soren Hauberg, Manuel Lopez-Antequera, Pau Gargallo, Yubin Kuang, and Javier Civera. Mapillary street-level sequences: A dataset for lifelong place recognition. In CVPR, pages 2626–2635, 2020.
  • [48] Burak Yildiz, Seyran Khademi, Ronald Maria Siebes, and Jan Van Gemert. Amstertime: A visual place recognition benchmark dataset for severe domain shift. In 2022 26th International Conference on Pattern Recognition (ICPR), pages 2749–2755. IEEE, 2022.
  • [49] Sijie Zhu, Linjie Yang, Chen Chen, Mubarak Shah, Xiaohui Shen, and Heng Wang. R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTFormer: unified retrieval and reranking transformer for place recognition. In CVPR, pages 19370–19380, 2023.

Appendix

Appendix A Datasets

We evaluate EffoVPR performance across a large number of datasets, to underscore its top-performance in variable scenarios and cities. Following prior work, we have used [10] open-source code for downloading and organizing datasets, to ensure maximum reproducibility. In the following we shortly describe each of the datasets.

A.1 Datasets Summary

AmsterTime[48] consists of 1,23112311,2311 , 231 image pairs from Amsterdam, Holland, exhibiting long-term changes. The queries are historical grayscale images, where for each query there is a reference of a modern-day photo which represents the same place. The pairs curated by human experts, and provide multiple challenges over different viewpoints and cameras, color vs grayscale and long-term changes.

Eynsham [15] is a collection of a car street-view camera, capturing photos around the same route of Oxford countryside twice. The grayscale images are divided to 23,9352393523,93523 , 935 queries and 23,9352393523,93523 , 935 gallery.

Mapillary Street-Level Sequences (MSLS) [47] is image and sequence-based VPR dataset. The dataset consists of more than 1.6M geo-tagged images collected during over seven years from 30 cities, in urban, suburban, and natural environments. There are 3 non-overlap subsets - a training set, validation (MSLS-val), and withheld test (MSLS-challenge). MSLS-val and MSLS-challenge provide various challenges, including viewpoint variations, long-term changes, and illumination and seasonal changes.

Nordland [41] was collected by a mounted camera on the top of a riding train in the Norwegian countryside, presenting rural and natural scenes. The data collected over the same route across four seasons, providing seasonal and illumination variability. Following [10, 41] we use the post-processed versions of winter as queries and summer as database, determining correct localization by retrieval of an image that is in less than 10 frames away. This dataset consists of 27,5922759227,59227 , 592 query images and 27,5922759227,59227 , 592 gallery images

Pittsburgh30k [4] is collected from Google Street View 360° panoramas of downtown Pittsburgh, split into multiple images. Ensuring queries and gallery were taken in different years, it provides 3 splits - a training set, validation and test. Pitts30k-test consists of 10k gallery images and 6816 queries. Pitts250k consists of 8280828082808280 queries including these of Pitts30k, and its gallery size is 83,9528395283,95283 , 952.

San Francisco Landmark (SF-R) [13] is a dataset from downtown San Francisco, which provides viewpoint variations. It presents a collection of 598598598598 of smartphone camera queries and gallery of 1,046,58710465871,046,5871 , 046 , 587 images.

San Francisco eXtra Large (SF-XL) [8, 6] is an enormous dataset covering the whole city of San Francisco. it consists of a training set, which includes also raw 360° panoramas, a small validation set of 7,98379837,9837 , 983 queries and 8,01580158,0158 , 015 gallery images, and a test gallery of 2,805,84028058402,805,8402 , 805 , 840 images.
There are four sets of queries:
SF-XL-v1[8] consists of 1,00010001,0001 , 000 queries curated from Flickr, and provides viewpoint and camera variations, illumination changes and even some occlusions.
SF-XL-v2[8] is the queries of San Francisco Landmark (SF-R).
SF-XL-Night[6] is a collection of 466466466466 Flickr images of night scenes from San-Francisco. It provides viewpoint variations and very-challenging illumination changes.
SF-XL-Occlusion[6] is a collection of 76767676 Flick images from the city of San Francisco, which suffers from severe occlusions, mostly by vehicles and crowd.

SPED[14] is a collection of surveillance cameras images consists of 607607607607 pairs of queries and gallery, captured accros time. It provides challenging viewpoint with seasonal and illumination changes.

St Lucia[36] is a collection of a nine videos of car-mounted camera from the St Lucia suburb of Brisbane. Following [10] open-source code, we select the first and last videos as queries and database, and sample one frame every 5 meters of driving. The gallery consists of 1,54915491,5491 , 549 images and there are 1464146414641464 query images.

SVOX[12] is a dataset which presents multiple weather conditions VPR challenge. It consists of 17,1661716617,16617 , 166 gallery images, of the city of Oxford. The queries were collected from the Oxford RobotCar dataset [35], providing multiple weather conditions queries sets, such as night (823 queries), overcast (872 queries), rainy (937 queries), snowy (870 queries) and sunny (854 queries).

Tokyo 24/7 [43] is a dataset from downtown Tokyo, which provides viewpoint changes and challenging illumination variations. It consists of a gallery of 75,9847598475,98475 , 984 images, and a collection of 315315315315 smartphone camera queries from 185185185185 places. Each place is portrayed by three photos - one taken during the day, one at sunset and one at night.

A.2 Train Dataset

Following VPR classification methods, Eigneplaces and CosPlace, we train on SF-XL[8] while other studies[1, 22, 33, 2] train on GSV-Cities[1] or combinations of Pittsburgh30k[4] and MSLS[47, 4, 34, 49], including a large mixture of different cities around the world (introducing higher variability). Note that similar to EigenPlace, our approach is designed for training on panoramas with heading information, and requires slicing them for lateral and frontal views, which cannot be applied other training datasets.

Appendix B Ablation Study

Here we conduct extensive experiments on two different datasets to ablate over several key-components of our EffoVPR method.

Re-ranking features: We explore various configurations for selecting features in the re-ranking stage. Our initial focus is on the choice of the layer from which features are extracted. Table S1 demonstrates that extracting features from the n1𝑛1n-1italic_n - 1 layer yields the most significant enhancement in overall performance. Generally, employing re-ranking with any layer, except of the last layer, improves results compared to omitting re-ranking entirely (i.e., relying solely on the global feature from the first stage). Subsequently, upon extracting the Q, K, V components from the chosen layer, we find that the Value set (𝒱𝒱\mathcal{V}caligraphic_V) represents the most effective local features for re-ranking, as detailed in Table S3. We ablate in Table S3 the impact of our two thresholds, the Attention Map threshold T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and the threshold on the countable local feature matching score threshold T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, showing their necessity.

Table S 1: Ablation study on the choice of the layer for the re-ranking stage. We find the n1𝑛1n-1italic_n - 1 layer to be the optimal for re-ranking feature extraction. Notably, the last layer n𝑛nitalic_n is ineffective and downgrades global performance. Results (in %) are the R@1.
Dataset Global n-5 n-4 n-3 n-2 n-1 n
MSLS-val 90.9 90.3 90.9 92.0 92.3 92.8 88.2
Tokyo-24/7 97.5 98.1 98.1 98.1 98.7 98.7 97.1
Table S 2: Ablation study on the impact of the thresholds. Results (in %) are the R@1. T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the Attention Map threshold and T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the threshold on the countable local features matching score
Tokyo24/7 MSLS-val
no thr. 95.9 86.4
T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 97.1 91.5
+T2subscript𝑇2+T_{2}+ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 98.7 92.8
Table S 3: Ablation on the choice of the local features. Results (in %) are the R@1. Query, Key and Value are respectively Q𝑄Qitalic_Q,K𝐾Kitalic_K,V𝑉Vitalic_V at Equation 1
Query Key Value
Tokyo24/7 96.5 96.8 98.7
MSLS-val 89.7 90.1 92.8

Number of Candidates to Re-rank: The second re-ranking stage is applied to the top-K candidates retrieved during the global stage. Although common choice in literature is K=100𝐾100K=100italic_K = 100 (e.g. [34, 49, 10]), we explore different choices of K, as detailed in Table S4. We achieve SoTA results even with K=5𝐾5K=5italic_K = 5. It is important to note that in some cases, an increase in K can introduce a greater number of “distractor” candidates, potentially leading to a decrease in performance. However, EffoVPR SoTA performance is consistent for all tested K’s.

Table S 4: Re-ranking ablation. K𝐾Kitalic_K indicates re-ranking over top-K𝐾Kitalic_K results. We achieve SoTA results even with K=5𝐾5K=5italic_K = 5. Bold values indicate SoTA results.
Top-K Pitts30k Tokyo24/7 MSLS-val Nordland SF-XL-Occ. SF-XL-Night SPED
K=5 94.2 97.8 92.4 95.3 59.2 61.6 93.4
K=10 94.2 98.1 92.2 95.3 59.2 61.2 92.9
K=15 94.1 98.4 92.3 95.3 60.5 60.3 93.1
K=20 94.0 98.4 92.4 95.3 59.2 60.3 93.1
K=50 93.9 98.7 92.7 95.2 57.9 60.9 93.2
K=100 93.9 98.7 92.8 95.0 59.2 61.2 93.2

Choice of Trainable Layers: Table S5 presents a few different sets of trainable layers in our backbone model. We find the vanilla fine-tuning of the entire model, end-to-end, that includes all layers, drastically harms the performance of EffoVPR. We attribute this decline to the fact that the DINOv2 backbone was trained on significantly larger datasets compared to those typically used in VPR. Subsequently, we establish that training only the last five layers represents a “sweet-spot”, yielding peak performance. Both increasing or decreasing the number of trainable layers from this configuration leads to lower results.

Table S 5: Ablation study on the choice of trainable last layers. Results (in %) are the R@1 of the 1st stage. Note that the 0 column represent a zero-shot performance of DinoV2.
Dataset 0 1 2 3 4 5 6 all layers
MSLS-val 47.7 89.5 88.2 89.7 89.5 90.9 89.7 86.1
Tokyo-24/7 62.2 96.8 96.5 95.9 96.2 97.5 96.8 94.0

Appendix C Performance vs. Feature Compactness - Additional Results

In Figure S1 we present performance comparison of our global feature (EffoVPR-G) versus feature dimensionality for more datasets. While the current leading methods achieve their performance using large features, EffoVPR demonstrates high performance even with an extremely compact feature size.

Refer to caption
Refer to caption
Refer to caption
Fig. S 1: Recall@1 performance of EffoVPR-G global feature versus feature dimensionality for more datasets.

Appendix D Visualizations

D.1 Additional Zero-Shot Visualizations

In Figure S2 we show more visualizations of EffoVPR-ZS method. While the attention map of pre-trained DINOv2 doesn’t focus on discriminative VPR elements, EffoVPR is able to fill the gap in zero-shot with local features matching. In the first row The pre-trained attention-map is mainly focused on temporal traffic signs and a far ad and almost not attend the building, and in the second row it is mainly focused on an insignificant back of a traffic sign. However EffoVPR method finds multiple local matches to the right geo-tagged image in the gallery.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a)
(b)
(c)
Fig. S 2: Additional EffoVPR-ZS zero-shot visualizations.

D.2 Additional Re-ranking Visualizations

Figure S3 exhibits EffoVPR-R local features matching invariability to highly challenging scenes with top-1 results. From the top left to the bottom right - to camera rotation, a nature scene, color variance across time (building renovation), tree matching, challenging day-time change with hardly noticed electric cables matching, night to day significant change.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Fig. S 3: EffoVPR-R top-1 matching visualizations. For each pair matching, the left image is the query and the right is the top-1 result.

Appendix E Implementation details

We use ViT-L/14 as the backbone, initialized with pre-trained weights of DINOv2 with registers[37, 16]. We only train the last five layers of the backbone, which appeared to be most beneficial. We employ EigenPlaces’s[11] group and class partitioning with its default hyper-parameters, and both lateral and frontal views, on the publicly available SF-XL street-view panoramas dataset [8]. We set an AdamW optimizer to the backbone, and an Adam optimizer to the classification heads, both with a constant learning rate of 1×1051superscript1051\times 10^{-5}1 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. We train EffoVPR with a batch size of 16, for 25 epochs, on a single NVIDIA-A100 node. We otherwise follow EigenPlaces training recipe. We choose the best epoch by SF-XL validation set, measuring Recall@1 global ranking performance. Given that ViT is independent of the image input size (provided it can be segmented into 14×14141414\times 1414 × 14 patches), we evaluated using images sized 504×504504504504\times 504504 × 504, but trained on 224 × 224 images to expedite training. For benchmarking EffoVPR-G of the global feature, we report nearest-neighbors performance on normalized output class token. In the re-ranking stage we extract the V𝑉Vitalic_V self-attention facet from layer n1𝑛1n-1italic_n - 1 (with n𝑛nitalic_n being the output layer), measure cosine similarity, we filter the features by class attention map with a threshold 𝒯1=0.05subscript𝒯10.05\mathcal{T}_{1}=0.05caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.05, and count only mutual nearest-neighbors with a score above the threshold 𝒯2=0.65subscript𝒯20.65\mathcal{T}_{2}=0.65caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.65.

E.1 Additional information

E.1.1 Zero-shot

In evaluating AnyLoc[27], we tackle the significant memory requirements of its VLAD pooling algorithm by implementing an online clustering scheme. We observed that their recommendation for layer 31 outperformed layer 23. In our zero-shot evaluation, we assess EffoVPR-ZS method by extracting the V𝑉Vitalic_V features from layer n2𝑛2n-2italic_n - 2 to re-rank the top-100 candidates retrieved from the first-stage global [CLS] feature. In this framework, our performance is constrained by the first-stage Recall@100, achieving rates of 99.2%percent99.299.2\%99.2 %, 96.8%percent96.896.8\%96.8 %, 81.5%percent81.581.5\%81.5 %, and 78.1%percent78.178.1\%78.1 % on Pitts30k, Tokyo24/7, MSLS-Val, and Nordland, respectively.

E.1.2 Benchmarking

Generally, for consistent benchmarking, we adhere to [10]. In addition, we report the results of other methods in accordance with the evaluation choices of SelaVPR[34] and CricaVPR[33], including the specific versions of trained models utilized. For the recent state-of-the-art methods SelaVPR, CricaVPR, and SALAD, we provide results from the original publications whenever available. When such results are not directly available, we utilize their code and published weights. Specifically for SelaVPR, which has two sets of weights (trained on Pitts30k and MSLS), we report the best-performing for each dataset.

E.2 Other

We evaluate EffoVPR matching runtime by averaging matching function runtime on Tokyo 24/7.

Appendix F Additional Quantitative Results

To ensure comprehensiveness, the following Table S6 presents the complete results for datasets that were only partially presented in the main paper, as well as for some datasets that were previously omitted. Our method, EffoVPR, demonstrates SoTA performance on the majority of these datasets, and remains competitive with the SoTA on others.

Table S 6: Comparison to SoTA on more datasets
SPED SF-R SF-XL-v1 SF-XL-v2 SF-XL-Occ.
Method R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
EigenPlaces 70.2 83.5 87.5 89.6 94.3 95.3 84.1 89.1 90.7 90.8 95.7 96.7 32.9 48.7 52.6
SelaVPR 88.6 95.1 97.2 88.5 92.0 93.0 74.9 80.7 82.1 89.3 95.7 96.3 35.5 47.4 55.3
CricaVPR 91.3 95.2 96.2 88.6 94.0 95.7 80.6 87.6 89.8 90.6 96.3 97.7 42.1 52.6 57.9
SALAD 92.1 96.2 96.5 92.3 95.7 96.8 88.6 93.5 94.4 94.8 97.3 98.3 51.3 65.8 68.4
EffoVPR 93.1 97.9 98.4 93.0 96.0 96.3 95.5 98.1 98.3 94.5 97.8 98.2 59.2 68.4 73.7
SF-XL-Night Amster Time SVOX SVOX Night SVOX Overcast
Method R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
EigenPlaces 23.6 30.7 34.5 48.8 69.5 76.0 98.0 99.0 99.2 58.9 76.9 82.6 93.1 97.8 98.3
SelaVPR 38.4 50.9 55.4 55.2 72.6 78.0 97.2 98.7 99.0 89.4 95.5 96.6 97.0 99.1 99.3
CricaVPR 35.4 48.3 53.4 64.7 82.5 87.9 97.8 99.2 99.3 86.3 95.3 96.6 96.7 99.0 99.0
SALAD 46.6 59.0 62.2 58.8 78.9 84.2 98.2 99.3 99.4 95.4 99.3 99.4 98.3 99.3 99.3
EffoVPR 61.6 73.4 77.0 65.5 87.2 90.7 98.7 99.5 99.6 97.4 99.5 99.5 98.4 99.3 99.7
SVOX Rain SVOX Snow SVOX Sun Sr. Lucia Eynsham
Method R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
EigenPlaces 90.0 96.4 98.0 93.1 97.6 98.2 86.4 95.0 96.4 99.6 99.9 100.0 90.7 94.4 95.4
SelaVPR 94.7 98.5 99.1 97.0 99.5 99.5 90.2 96.6 97.4 99.8 100.0 100.0 90.6 95.3 96.2
CricaVPR 94.8 98.5 98.7 96.0 99.2 99.2 93.8 98.1 98.8 99.9 99.9 99.9 91.6 95.0 95.8
SALAD 98.5 99.7 99.9 98.9 99.7 99.8 97.2 99.4 99.7 100.0 100.0 100.0 91.6 95.1 95.9
EffoVPR 98.3 99.6 99.6 98.7 99.7 99.7 97.7 99.3 99.4 100.0 100.0 100.0 91.0 95.2 96.3