On the Estimation of Image-matching Uncertainty in Visual Place Recognition

Mubariz Zaffar
ME, TU Delft
The Netherlands
[email protected]
   Liangliang Nan
ABE, TU Delft
The Netherlands
[email protected]
   Julian F. P. Kooij
ME, TU Delft
The Netherlands
[email protected]
Abstract

In Visual Place Recognition (VPR) the pose of a query image is estimated by comparing the image to a map of reference images with known reference poses. As is typical for image retrieval problems, a feature extractor maps the query and reference images to a feature space, where a nearest neighbor search is then performed. However, till recently little attention has been given to quantifying the confidence that a retrieved reference image is a correct match. Highly certain but incorrect retrieval can lead to catastrophic failure of VPR-based localization pipelines. This work compares for the first time the main approaches for estimating the image-matching uncertainty, including the traditional retrieval-based uncertainty estimation, more recent data-driven aleatoric uncertainty estimation, and the compute-intensive geometric verification. We further formulate a simple baseline method, “SUE”, which unlike the other methods considers the freely-available poses of the reference images in the map. Our experiments reveal that a simple L2-distance between the query and reference descriptors is already a better estimate of image-matching uncertainty than current data-driven approaches. SUE outperforms the other efficient uncertainty estimation methods, and its uncertainty estimates complement the computationally expensive geometric verification approach. Future works for uncertainty estimation in VPR should consider the baselines discussed in this work.

1 Introduction

Refer to caption
Figure 1: The Precision-Recall curves on the Pittsburgh dataset [4] for the three common categories of VPR uncertainty estimation methods (RUE, DUE, GV), and for our proposed baseline SUE which uniquely considers spatial locations of the top-K references. The global image descriptors [9] are fixed for all methods except BTL [50]. The only difference is the confidence given by each uncertainty estimation method to the best-matched reference descriptors for the corresponding queries. The legend lists the Area-under-the-Precision-Recall-curves. As GV methods are two to three orders of magnitude more computationally expensive than the others, they are plotted as dotted lines. Surprisingly, simple L2-distance in feature space is a better estimate of VPR uncertainty than recent deep learning-based uncertainty estimates. SUE outperforms all other efficient uncertainty estimation methods.

Visual Place Recognition (VPR) is the problem of identifying a previously visited place given a query camera image and a map of geo-tagged reference images [28]. It has applications in vehicle localization [57], 3D modeling [1], image search [46], and loop-closure in Simultaneous Localization and Map** (SLAM) [28, 8].

VPR is typically approached as an image retrieval problem, transforming images into feature vectors in a latent feature space where an efficient nearest neighbor search compares the query to all references. The pose of the query image is then approximated to be the same as that of the retrieved nearest neighbor references. Since successful VPR requires a good image representation that is robust to viewpoint and/or appearance changes [16, 30, 28], the field has benefited from advances in deep representation learning.

However, two images with similar visual content could still originate from geographically far-apart areas, a concept referred to as perceptual-aliasing in VPR [16]. For example, images with mostly sky could match many locations on an outdoor map. This constitutes aleatoric uncertainty, i.e., inherent noise or ambiguity in the data which cannot be reduced, as opposed to epistemic uncertainty which could be addressed with more training data [23]. The close proximity of perceptually aliased images in the feature space can result in catastrophic failures For instance, a highly confident false-positive from VPR could result in an incorrect loop closure in a SLAM pipeline, leading to misaligned maps [28, 8]. Reliable uncertainty estimation on the quality of the match is therefore key to avoid such failures by, e.g., rejecting results above a certain uncertainty threshold. Moreover, uncertainty estimation can also be used to fuse multiple predictions in VPR ensemble methods [9].

From existing literature, we identify three categories of methods to estimate image-matching uncertainty in VPR: retrieval-based uncertainty estimation (RUE), data-driven aleatoric uncertainty estimation (DUE), and geometric verification (GV) by local feature matching. RUE: Traditionally in VPR, the L2-distance between the query and the best-matched reference in the feature space has been used as an estimate of uncertainty [35]. The ratio of L2-distance between the first and second nearest neighbour reference is also used [18]. DUE: On the other hand, several recent works, such as the Bayesian Triplet Loss [50] and the Self-teaching Uncertainty Estimation [9], have proposed to explicitly learn to predict the aleatoric uncertainty from the query’s image content only. GV: Another way to assert matching confidence is to test for consistent geometry among matched local features between the query and the best matching reference in a RANSAC loop [33].

Remarkably, none of the three categories exploit the spatial locations of matched images in the actual reference map, which we hypothesize can be an important source of information for estimating VPR matching uncertainty. To test this hypothesis, we formulate a new simple baseline, Spatial Uncertainty Estimation (SUE). SUE is a straightforward and efficient approach to estimating uncertainty for a query image’s match, using the spatial spread of the physical poses for the most similar references in the map as a proxy. A high spatial spread indicates perceptual aliasing leading to high matching uncertainty, while a low spread indicates a distinct area is matched. An overview of the sources of information employed by all categories of methods and by SUE is provided in Table 1.

While all categories of uncertainty estimation methods aim for the same task, i.e., rejecting false positives in VPR, previous evaluations did not include all categories, providing an incomplete picture of the state-of-the-art. This work therefore compares the three existing categories and SUE on a levelled playing field, to provide recommendations for future research, and insights on the strengths/weaknesses of each category. For instance, as the preview of the experimental results in Fig. 1 indicates, SUE outperforms other efficient methods (this and other experiments will be discussed in more detail in Section 4).

Categ. Descr.? Poses? Images? Efficient?
RUE Top-K No No Yes
DUE No Only train Yes Yes
GV No No Yes No
SUE Top-K Top-K No Yes
Table 1: Overview of the sources of information needed by the current main categories for VPR uncertainty estimation, and by the proposed method SUE: the query/reference global image descriptors, the reference poses, or complete query/reference images. Efficiency refers to the inference time needed by each approach.

Concretely, our contributions are:

  1. 1.

    A comparison of three different categories of uncertainty estimation methods in VPR.

  2. 2.

    A new simple baseline method, SUE, that considers the spatial locations of the reference images, a source of information not used by existing categories.

  3. 3.

    Since GV gives the best uncertainty estimates albeit at a higher computational cost, we investigate whether the other methods are complementary to GV.

2 Related work

Visual place recognition was first surveyed in the seminal work of Lowry et al. [28]. The three fundamental VPR challenges identified by Lowry et al. are viewpoint changes, appearance changes, and perceptual-aliasing.

The concept of matching images for VPR dates back to before the deep-learning era, when handcrafted methods were used to perform VPR [42, 44, 21, 12]. However, with the rise of deep learning, many deep learning-based methods were proposed to solve the first two challenges in VPR. A broad categorization of these methods can be done based on their underlying novelty, such as the use of a novel loss function [39, 26, 45], better training data [6, 2], new architectures [48, 56, 51], and new methods for feature aggregation [4, 37, 19, 3]. A number of benchmarks have been proposed in VPR, for example, the recent Deep Visual Geo-localization benchmark [7], VPR-Bench [52] and similar benchmarks in the image retrieval community [38, 41]. From these benchmarks, it is clear that the deep learning-based VPR techniques outperform handcrafted techniques by a significant margin on most datasets.

We focus on the third challenge identified in Lowry et al., i.e., perceptual aliasing, which arises from aleatoric uncertainty in the data. This challenge has received less attention in VPR literature compared to viewpoint and appearance changes. Most works in VPR use the distance (e.g., L2 or Cosine) in feature space between a query and the nearest neighbor as the uncertainty estimate [7], or the distance between the retrieved nearest neighbors [18]. Some more recent works model the aleatoric uncertainty in image retrieval, e.g., the Bayesian Triplet Loss (BTL) [50] and the Self-Teaching Uncertainty Estimation (STUN) [9]. Both BTL and STUN estimate the aleatoric uncertainty in the training data by representing images as distributions instead of point estimates in the feature space. Each image thus has an associated mean and variance for a feature descriptor.

Gronat et al. [17] treat VPR as a classification problem by training place-specific classifiers, one for each place, where each classifier naturally outputs a confidence estimate for the corresponding pose. Pion et al. [36] approximate the pose of the query image by aggregating the pose hypotheses from the top-retrieved nearest neighbors, weighing each hypothesis based on the distance in the feature space. The variance of the aggregated pose represents uncertainty over the pose space. Notably, this concept of pose uncertainty has been modeled in these existing works [36, 17, 53] and other related tasks such as classical Particle Filters [14], but, to the authors’ best knowledge, the uncertainty estimates derived based on the distribution of pose hypotheses has not yet been studied as a proxy for image-matching uncertainty.

Beyond global descriptors-based VPR, in local feature matching-based image retrieval the inlier count (aka. geometric verification) has been used as an estimate of confidence [33]. Zeisl et al. [55] perform 2D-to-3D local feature matching to estimate a distribution over the possible query poses. The work of [54] uses such inlier count from local feature matching and combines it with the pose distribution of retrieved images to estimate the confidence of localization. Since local features can appear in similar geometric configurations (geometric burstiness) across unrelated images, [40] proposes to use the pose information to downweight such matches in the inlier count. However, retrieving images based on local feature descriptors is computationally expensive, whereas VPR instead only efficiently compares global image descriptors. 111We study the relation between geometric burstiness and SUE in the supplementary materials. Absolute Pose Regression (APR) directly regresses the absolute pose given a camera image, and has also considered pose uncertainty estimation. Some approaches to uncertainty-aware APR include CoordiNet [32], Bayesian PoseNet [22], and HydraNet [34]. Unlike VPR, APR approaches do not generalize to new environments. In this work, we focus on estimating the image-matching uncertainty for VPR.

3 Methodology

This section first introduces the task of uncertainty estimation for VPR. We then formalize VPR, and describe the three main categories of uncertainty estimation methods. Next, we formulate the proposed baseline approach, SUE, which unlike the other three categories uses the freely available reference poses information. Finally, we outline how we combine the different categories of methods with the computationally expensive geometric verification to investigate if the uncertainty estimates are complementary.

3.1 Uncertainty estimation in VPR

Typically VPR is considered as an image retrieval task: finding the most similar reference images to the query by Euclidean distance in some feature space. The poses associated with the images however distinguish VPR from other image retrieval tasks, such as web search, where matches are correct if their image content should be judged as the same. In VPR we often instead refer to the location of the query and references to judge matches: a retrieved reference is only acceptable if its pose is within a maximum distance threshold of the (unknown) true pose of the query [7, 52, 16]. Ideally, the closest matches in the feature space thus also have the poses closest to the query pose. However, this is often not the case in VPR due to perceptual aliasing, a form of aleatoric uncertainty since it cannot be reduced by choosing a different feature encoder or by using more training data.

It is therefore desirable to obtain some uncertainty score sqsubscript𝑠𝑞s_{q}italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT for a query and the retrieved nearest neighbor, where a low score expresses confidence that the nearest neighbor is a correct match. A threshold τ𝜏\tauitalic_τ on the score could then reject a query (sq>τsubscript𝑠𝑞𝜏s_{q}>\tauitalic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT > italic_τ) for which the best match is at risk of being incorrect to prevent failures of the downstream application [16]. The objective of VPR uncertainty estimation is thus to score queries, such that queries with reliable matches can be distinguished from those with possible incorrect matches. Note that while an uncertainty estimation method could provide scores with an explicit probabilistic interpretation, this is not a strict requirement to apply an acceptance threshold.

3.2 Formalizing VPR

Given a set of reference images \mathcal{I}caligraphic_I with known poses 𝒫𝒫\mathcal{P}caligraphic_P, the goal of VPR is to find one or multiple reference images Iisubscript𝐼𝑖I_{i}\in\mathcal{I}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_I that match the place of a query image Iqsubscript𝐼𝑞I_{q}italic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT. The unknown pose pqsubscript𝑝𝑞p_{q}italic_p start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT for the query Iqsubscript𝐼𝑞I_{q}italic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT can then be approximated from the poses of the matched references pi𝒫subscript𝑝𝑖𝒫p_{i}\in\mathcal{P}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_P, since correct matches should have been taken in the same area. The exact formulation of a pose generally depends on the localization source and the task, for example, 2D GPS coordinates for visual geo-localization [7], or 6D pose [25]. In this research, we follow a general task-independent formulation and only assume that a pose pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT consists of 2D or 3D spatial coordinates in some global coordinate system.

In the offline map preparation phase of VPR, before accepting queries, a feature extractor G𝐺Gitalic_G is applied to every reference image Iisubscript𝐼𝑖I_{i}\in\mathcal{I}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_I to obtain D𝐷Ditalic_D-dimensional reference feature descriptors fi=G(Ii)subscript𝑓𝑖𝐺subscript𝐼𝑖f_{i}=G(I_{i})italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_G ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Usually G𝐺Gitalic_G is a trained neural network [30] or a handcrafted feature descriptor [13]. The resulting VPR map =(,𝒫)𝒫\mathcal{M}=(\mathcal{R},\mathcal{P})caligraphic_M = ( caligraphic_R , caligraphic_P ) contains the reference feature descriptors set ={f1,fN}subscript𝑓1subscript𝑓𝑁\mathcal{R}=\{f_{1},\cdots f_{N}\}caligraphic_R = { italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ italic_f start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }, where each descriptor fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is associated with a corresponding pose pi𝒫subscript𝑝𝑖𝒫p_{i}\in\mathcal{P}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_P.

At test time, the same feature extractor G𝐺Gitalic_G is applied to the query image Iqsubscript𝐼𝑞I_{q}italic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, and its query descriptor fq=G(Iq)subscript𝑓𝑞𝐺subscript𝐼𝑞f_{q}=G(I_{q})italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = italic_G ( italic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) is compared to the reference descriptors in the map \mathcal{M}caligraphic_M. This can be achieved through an efficient K𝐾Kitalic_K-nearest neighbor lookup, considering the L2-distances di=fifq2subscript𝑑𝑖subscriptnormsubscript𝑓𝑖subscript𝑓𝑞2d_{i}=||f_{i}-f_{q}||_{2}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = | | italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT between each reference i𝑖iitalic_i and the query. This gives an ordered list of K𝐾Kitalic_K nearest neighbor references nn=[f(1),,f(K)]subscriptnnsubscript𝑓1subscript𝑓𝐾\mathcal{R}_{\textrm{nn}}=[f_{(1)},\cdots,f_{(K)}]caligraphic_R start_POSTSUBSCRIPT nn end_POSTSUBSCRIPT = [ italic_f start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT , ⋯ , italic_f start_POSTSUBSCRIPT ( italic_K ) end_POSTSUBSCRIPT ], ranked by increasing distance d(1)d(K)subscript𝑑1subscript𝑑𝐾d_{(1)}\leq\cdots\leq d_{(K)}italic_d start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT ≤ ⋯ ≤ italic_d start_POSTSUBSCRIPT ( italic_K ) end_POSTSUBSCRIPT and with corresponding poses 𝒫nn=[p(1),,p(K)]subscript𝒫nnsubscript𝑝1subscript𝑝𝐾\mathcal{P}_{\textrm{nn}}=[p_{(1)},\cdots,p_{(K)}]caligraphic_P start_POSTSUBSCRIPT nn end_POSTSUBSCRIPT = [ italic_p start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT , ⋯ , italic_p start_POSTSUBSCRIPT ( italic_K ) end_POSTSUBSCRIPT ]. Here we use bracketed subscript (j)𝑗(j)( italic_j ) to indicate j𝑗jitalic_j-th item in the ranked order, i.e., f(1)=argminfififq2subscript𝑓1subscriptargminsubscript𝑓𝑖subscriptnormsubscript𝑓𝑖subscript𝑓𝑞2f_{(1)}=\textrm{argmin}_{f_{i}\in\mathcal{R}}||f_{i}-f_{q}||_{2}italic_f start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT = argmin start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_R end_POSTSUBSCRIPT | | italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the descriptor with the smallest distance to the query in the feature space.

Each corresponding pose p(i)𝒫subscript𝑝𝑖𝒫p_{(i)}\in\mathcal{P}italic_p start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT ∈ caligraphic_P can be considered as a hypothesis to estimate the query’s true pose pqsubscript𝑝𝑞p_{q}italic_p start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, though usually only the pose of the best matching reference feature descriptor f(1)subscript𝑓1f_{(1)}italic_f start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT is considered as the VPR pose estimate pqsubscriptsuperscript𝑝𝑞p^{\prime}_{q}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT for the query, i.e., pq=p(1)subscriptsuperscript𝑝𝑞subscript𝑝1p^{\prime}_{q}=p_{(1)}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT [52]. We follow this best-match-based query pose estimation in this work. In benchmarks, a match is considered correct if pqsubscriptsuperscript𝑝𝑞p^{\prime}_{q}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT is ‘physically near’ to pqsubscript𝑝𝑞p_{q}italic_p start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT. The threshold on what distance is still accepted as the same ‘place’ depends on the scale of each localization task [28].

3.3 Current VPR uncertainty estimation categories

We now describe various representative uncertainty estimation methods for the three common categories.

Retrieval-based uncertainty estimation (RUE): Commonly, the matching uncertainty in VPR is considered proportional to the L2-distance from the best match d(1)subscript𝑑1d_{(1)}italic_d start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT, so sq=d(1)subscript𝑠𝑞subscript𝑑1s_{q}=d_{(1)}italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT [7, 52], as this distance indicates relevant differences between the visual content of the query and match.

An alternative is to consider the distance ratio between the first and second nearest neighbor, sq=d(1)/d(2)subscript𝑠𝑞subscript𝑑1subscript𝑑2s_{q}=d_{(1)}/d_{(2)}italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT / italic_d start_POSTSUBSCRIPT ( 2 ) end_POSTSUBSCRIPT. This ratio is quite similar to the perceptual aliasing score (PA score) [18] and the false-positive rejection criterion in the popular local feature descriptor SIFT [27].

Data-driven uncertainty estimation (DUE): State-of-the-art VPR encoders are typically deep neural networks trained on a labeled VPR dataset. The labeled training data contains the ground-truth poses 𝒫trainsubscript𝒫train\mathcal{P}_{\textrm{train}}caligraphic_P start_POSTSUBSCRIPT train end_POSTSUBSCRIPT for the training references and query images trainsubscripttrain\mathcal{I}_{\textrm{train}}caligraphic_I start_POSTSUBSCRIPT train end_POSTSUBSCRIPT. A deep encoder G𝐺Gitalic_G can be adapted to also predict the aleatoric uncertainty of matching a nearby pose, by learning from the training query image in trainsubscripttrain\mathcal{I}_{\textrm{train}}caligraphic_I start_POSTSUBSCRIPT train end_POSTSUBSCRIPT when an image is distinctive and obtains good pose matches within 𝒫trainsubscript𝒫train\mathcal{P}_{\textrm{train}}caligraphic_P start_POSTSUBSCRIPT train end_POSTSUBSCRIPT, and when not (e.g., images of trees, uniform walls, or sky). Methods in this category include the Bayesian Triplet Loss (BTL) [50], and STUN [9]. Note that the learned uncertainty is based on the training images and poses, not those in the test-time reference map \mathcal{M}caligraphic_M.

In general, an uncertainty-aware encoder (f¯i,σi2)=G(Ii)subscript¯𝑓𝑖subscriptsuperscript𝜎2𝑖superscript𝐺subscript𝐼𝑖(\bar{f}_{i},\sigma^{2}_{i})=G^{\prime}(I_{i})( over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) predicts for an image Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT not only the expected feature f¯isubscript¯𝑓𝑖\bar{f}_{i}over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, but also the variance in the feature space, i.e., fi𝐍(f¯i,σi2)similar-tosubscript𝑓𝑖𝐍subscript¯𝑓𝑖subscriptsuperscript𝜎2𝑖f_{i}\sim\mathbf{N}(\bar{f}_{i},\sigma^{2}_{i})italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ bold_N ( over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). The total variance in σi2subscriptsuperscript𝜎2𝑖\sigma^{2}_{i}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be used as a proxy for the image-matching uncertainty, sq=σi21subscript𝑠𝑞subscriptnormsubscriptsuperscript𝜎2𝑖1s_{q}=||\sigma^{2}_{i}||_{1}italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = | | italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. The computational overhead of the deep network producing an additional output σi2subscriptsuperscript𝜎2𝑖\sigma^{2}_{i}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is low.

Refer to caption
Figure 2: In VPR, a query q𝑞qitalic_q is compared in feature space to features fisubscript𝑓𝑖f_{i}\in\mathcal{R}{}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_R of reference images with known poses. The nearest neighbors f(1),,f(K)subscript𝑓1subscript𝑓𝐾f_{(1)},\cdots,f_{(K)}italic_f start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT , ⋯ , italic_f start_POSTSUBSCRIPT ( italic_K ) end_POSTSUBSCRIPT are retrieved as matches. Left: The retrieved references I(1),I(2),I(3)subscript𝐼1subscript𝐼2subscript𝐼3I_{(1)},I_{(2)},I_{(3)}italic_I start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT ( 2 ) end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT ( 3 ) end_POSTSUBSCRIPT share similar visual content with the query (walls, pillars, and blobs), but are geographically far apart, reflecting high uncertainty that the matched reference is correct. Right: For another query, the retrieved references are geographically close together, indicating low uncertainty.

Geometric verification (GV): Another way to estimate image-matching uncertainty is to compare the query and the best-matched reference image in more detail through local feature matching and geometric verification in a RANSAC loop, e.g., through the use of SIFT [27], DELF [33], and SuperPoint [15]. All the matched local features that satisfy a geometric transformation estimated from the randomly sampled set of matched local features between the query image and the reference image are considered inliers. The confidence is indicated by the number of inliers cgvsubscript𝑐𝑔𝑣c_{gv}italic_c start_POSTSUBSCRIPT italic_g italic_v end_POSTSUBSCRIPT, which could be expressed as a matching uncertainty estimate, i.e., sq=cgvsubscript𝑠𝑞subscript𝑐𝑔𝑣s_{q}=-c_{gv}italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = - italic_c start_POSTSUBSCRIPT italic_g italic_v end_POSTSUBSCRIPT. While geometric verification yields high-quality uncertainty estimates, such post-processing is computationally expensive compared to the other methods.

3.4 Spatial uncertainty estimation (SUE) for VPR

We observe that the poses in the reference set 𝒫𝒫\mathcal{P}caligraphic_P are a potentially powerful and freely available source of information at test time, which current uncertainty estimation methods do not exploit (more details will be presented in Sec. 3.1). The intuition behind this is illustrated in Fig. 2, where we show that if the nearest neighbors in the feature space are spatially far apart in their respective 2D/3D world coordinates, it indicates that such a feature suffers from perceptual aliasing: various areas in the test reference set contain the queried appearance, thus uncertainty on the pose estimate should be high. On the other hand, if the nearest neighbors in the feature space are also spatially close together, there is agreement among the matching pose hypotheses that the matched area is distinct within that given reference set, thus the uncertainty should be low.

To test this insight, we now formulate SUE, a purposefully simple image-matching uncertainty estimation method. Given the K𝐾Kitalic_K-best retrieved references, fit a 2D or 3D multivariate Gaussian distribution 𝐍(μp,Σp)𝐍subscript𝜇𝑝subscriptΣ𝑝\mathbf{N}(\mu_{p},\Sigma_{p})bold_N ( italic_μ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) over their 2D or 3D poses 𝒫nnsubscript𝒫nn\mathcal{P}_{\textrm{nn}}caligraphic_P start_POSTSUBSCRIPT nn end_POSTSUBSCRIPT,

μpsubscript𝜇𝑝\displaystyle\mu_{p}italic_μ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT =1iw(i)i=1Kw(i)p(i),absent1subscript𝑖subscript𝑤𝑖superscriptsubscript𝑖1𝐾subscript𝑤𝑖subscript𝑝𝑖\displaystyle=\frac{1}{\sum_{i}w_{(i)}}\sum_{i=1}^{K}w_{(i)}\cdot p_{(i)},= divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT ⋅ italic_p start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT , (1)
ΣpsubscriptΣ𝑝\displaystyle\Sigma_{p}roman_Σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT =1iw(i)i=1Kw(i)(p(i)μp)(p(i)μp),absent1subscript𝑖subscript𝑤𝑖superscriptsubscript𝑖1𝐾subscript𝑤𝑖subscript𝑝𝑖subscript𝜇𝑝superscriptsubscript𝑝𝑖subscript𝜇𝑝top\displaystyle=\frac{1}{\sum_{i}w_{(i)}}\sum_{i=1}^{K}w_{(i)}\cdot(p_{(i)}-\mu_% {p})(p_{(i)}-\mu_{p})^{\top},= divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT ⋅ ( italic_p start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ( italic_p start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , (2)

where the relative contribution w(i)subscript𝑤𝑖w_{(i)}italic_w start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT of the i𝑖iitalic_i-th best reference pose p(i)subscript𝑝𝑖p_{(i)}italic_p start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT decreases as its L2-distance d(i)subscript𝑑𝑖d_{(i)}italic_d start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT to the query in the feature space increases,

w(i)subscript𝑤𝑖\displaystyle w_{(i)}italic_w start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT =eλd(i),whered(i)=fqf(i)2.formulae-sequenceabsentsuperscript𝑒𝜆subscript𝑑𝑖wheresubscript𝑑𝑖subscriptnormsubscript𝑓𝑞subscript𝑓𝑖2\displaystyle={e^{-\lambda\cdot d_{(i)}}},\quad\textrm{where}\quad d_{(i)}=||f% _{q}-f_{(i)}||_{2}.= italic_e start_POSTSUPERSCRIPT - italic_λ ⋅ italic_d start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , where italic_d start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT = | | italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (3)

The total variance across the spatial pose dimensions could then serve as a proxy for image-matching uncertainty, i.e., sq=trace(Σp)subscript𝑠𝑞tracesubscriptΣ𝑝s_{q}=\textrm{trace}(\Sigma_{p})italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = trace ( roman_Σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ).

The hyper-parameter λ𝜆\lambdaitalic_λ controls the non-linear relative contribution of a pose pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for the nearest neighbor f(i)nnsubscript𝑓𝑖subscriptnnf_{(i)}\in\mathcal{R}_{\textrm{nn}}italic_f start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT ∈ caligraphic_R start_POSTSUBSCRIPT nn end_POSTSUBSCRIPT given its distance d(i)subscript𝑑𝑖d_{(i)}italic_d start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT in the feature space. This hyper-parameter can be optimized on training data, though our experiments will show that its choice is remarkably robust across various real-world benchmark datasets.

3.5 Complementing geometric verification

To study to what extent SUE’s (or another method’s) sqsubscript𝑠𝑞s_{q}italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT provides information not captured by the cgvsubscript𝑐𝑔𝑣c_{gv}italic_c start_POSTSUBSCRIPT italic_g italic_v end_POSTSUBSCRIPT metric from geometric verification, we treat both scores as a 2D feature vector and train a classifier to predict if a best-matched reference should be accepted as a true-positive, or rejected as a false-positive. The regular rejection threshold is extended from a single score (sq>τsubscript𝑠𝑞𝜏s_{q}>\tauitalic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT > italic_τ) to a linear weighted sum of both scores (sq/τ1+cgv/τ2>1subscript𝑠𝑞subscript𝜏1subscript𝑐𝑔𝑣subscript𝜏21s_{q}/\tau_{1}+c_{gv}/\tau_{2}>1italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT / italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT italic_g italic_v end_POSTSUBSCRIPT / italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 1), by the use of a regular linear Support Vector Machine (SVM) as a classifier.

4 Experiments

We first present the setup for our experiments. Then, we compare the performance of all the image-matching uncertainty estimation methods on multiple benchmark datasets. Next, we test if the methods are complementary to geometric verification. Finally, we present an ablation over the hyper-parameters of SUE and provide a discussion.

Method \uparrow Pitts. \uparrow Sanfr. \uparrow Stluc. \uparrow Eyn. \uparrow MSLS \uparrow Nordland \uparrow Average \downarrow Time
(RUE) L2-distance 0.87 0.76 0.79 0.87 0.64 0.18 0.69 0.05
(RUE) PA-Score [18] 0.90 0.65 0.77 0.88 0.68 0.21 0.68 0.05
(DUE) BTL [50] 0.44 0.17 0.34 0.45 0.21 0.07 0.28 0.20
(DUE) STUN [9] 0.79 0.57 0.66 0.71 0.44 0.05 0.54 0.10
SUE 0.94 0.84 0.88 0.93 0.77 0.26 0.77 1.08
(GV) SIFT-RANSAC [27] 0.92 0.89 0.93 0.96 0.70 0.15 0.76 129
(GV) DELF-RANSAC [33] 0.97 0.92 0.97 0.95 0.95 0.84 0.93 1587
(GV) Super-RANSAC [15] 0.95 0.95 0.97 0.96 0.87 0.50 0.87 848
Table 2: The AUC-PR of all the compared methods. Higher AUC-PR is better, and best is in Bold. The bottom rows are the computationally expensive geometric verification methods. The last column lists the time (msec) to give an uncertainty estimate for a single query image.

4.1 Experimental setup

This section describes the datasets, baselines, evaluation metrics, and implementation details of our work.

Datasets: We use six public VPR datasets in this work: Pittsburgh-250k [4], Sanfrancisco [10, 47], Stlucia [31], Eysham [11], MSLS [49] and Nordland [43]. Details of these datasets and their respective ground-truths in [5].

Baselines: Our primary baselines for uncertainty estimation include the L2-distance in feature space dqsubscript𝑑𝑞d_{q}italic_d start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, the perceptual aliasing score (PA score [18]), the Bayesian Triplet Loss (BTL) [50] and STUN [9]. As the code for BTL is not open-source, we implement it following the pseudo-code and the network details provided in the original paper.

For geometric verification, we test three types of local feature descriptors, namely the handcrafted SIFT [27], the deep-learning-based DELF [33], and SuperPoint [15], which we refer to as SIFT-RANSAC, DELF-RANSAC and Superpoint-RANSAC, respectively.

Evaluation metrics: The precision-recall (PR) curves have been widely used in VPR for estimating the retrieval quality [52]. However, they can also be used to estimate the uncertainty estimate in VPR as widely used in existing uncertainty estimation tasks in deep learning [29, 20]. The choice of PR-curves over the Receiver Operating Characteristic (ROC) curve is due to the absence of true-negatives in employed VPR datasets. Given a fixed list of retrieved images, the Precision-Recall curves can reflect the technique with the better uncertainty estimates sqsubscript𝑠𝑞s_{q}italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT. A technique that can perfectly classify between true-positives (TP) and false-positives (FP), given the uncertainty estimates, achieves an Area-under-the-Precision-Recall-Curve (AUC-PR) of 1.

For the combination of uncertainty estimates with geometric verification, the task is formulated as binary classification and we use accuracy as an evaluation metric based on the ground-truth true-positives and false-positives [5].

Implementation details: For SUE and all the other baselines except BTL, we use a ResNet-50 backbone with GeM pooling trained in a self-teaching manner in [9] on the training split of the Pittsburgh dataset. Each feature vector fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is 2048 dimensional. For BTL, the same backbone and training data are used, but using the training procedure specified in the original BTL paper [50]. For DELF and SuperPoint, the implementations are open-sourced by the respective authors, and the default settings are employed. For SIFT-RANSAC we use the OpenCV implementation with the number of extracted features set to 5000, the Lowe test ratio to 0.6, and the number of RANSAC iterations to 1000.

The hyper-parameters in SUE are fined-tuned only on the Pittsburgh dataset and then fixed as λ=350𝜆350\lambda=350italic_λ = 350 and K=10𝐾10K=10italic_K = 10 for all datasets and experiments. An ablation over these parameters is given later in section 4.4. The SVM is trained with stochastic gradient descent with hinge loss and an L1-penalty, and a maximum of 1000 training iterations.

4.2 Performance comparison

We first compare all the uncertainty estimation methods formulated in this work, both qualitatively and quantitatively, and in terms of their computational overhead.

Area-under-the-Precision-Recall-curves: The AUC-PR for all the methods on all the datasets are summarized in Table 2. SUE outperforms other efficient methods by a clear margin, even on the Pittsburgh dataset which was used for training STUN and BTL. It is also important to note that a basic L2-distance-based uncertainty already outperforms BTL and STUN. Moreover, geometric verification outperforms all other uncertainty estimates although SUE achieves comparable performance. The precision-recall curves for the Pittsburgh dataset are shown in Fig. 1, and for the remainder datasets are provided in the accompanying supplementary materials.

Computational requirements: We further report the time taken to compute the GV confidence sqsubscript𝑠𝑞s_{q}italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and the uncertainty estimates in Table 2. Although GV gives useful uncertainty estimates, the high computational cost of these GV methods may be prohibitive for real-time online applications. Our implementation of SUE is about three orders of magnitude faster than GV using DELF-RANSAC.

Qualitative results: To obtain insight into how the different methods interpret the visual content in query images and what they are sensitive to, we show in Fig. 3 examples of the most and the least uncertain query images for different methods in the Pittsburgh dataset. While all methods usually consider feature-rich and distinctive buildings as least uncertain for VPR, differences between the methods lie in the most uncertain images. Highly saturated test images are considered most uncertain by L2-distance-based uncertainty because such saturation did not exist during the reference traversals of the same scene. On the other hand, STUN considers images of trees and walls that usually contribute to perceptual aliasing as the most uncertain for VPR. SUE considers traffic squares and common building patterns as the most uncertain. Note that because SUE uses the freely available pose information in the test reference set; whether a traffic square or a building is considered uncertain is specific to this test reference set and not due to a generally-applicable visual property.

Refer to caption
Figure 3: Examples of the two least and the two most uncertain query images with the corresponding nearest neighbor on the Pittsburgh dataset. The colors/symbols indicate whether the retrieved image is a correct match.
Refer to caption
Figure 4: Two queries and their nearest neighbor reference images that illustrate cases where SUE outperforms other methods. Ideally a method assigns high uncertainty to the mismatched query and low uncertainty to the correct match, as SUE does here.

We further show in Fig. 4 several images that illustrate failure cases of RUE and DUE in comparison to SUE. Images of walls generally contribute to high aleatoric uncertainty (DUE) and are closer together in the feature space in terms of L2-distance (RUE). However, we note that the query in Fig. 4 Top is correctly matched since only a unique wall with this pattern exists in the test reference set. SUE and L2-distance correctly give this query a low uncertainty, but STUN fails. The query image in Fig. 4 Bottom is given low uncertainty by L2-distance than ranking with STUN and SUE. This is because images with large portions of sky contribute to aleatoric uncertainty but they are close in terms of the feature space L2-distance. This query is mismatched and identifies where L2-distance-based uncertainty fails in comparison to STUN and SUE.

Refer to caption
Refer to caption
Refer to caption
Figure 5: The relation between geometric verification uncertainty (x-axis) and the L2/STUN/SUE uncertainty (y-axis) on the Pittsburgh dataset [4]. Each point represents a query, with blue indicating a correct match, and red otherwise. The linear SVM boundaries are shown as black lines, while the dashed lines are the SVM margins. Scores have been linearly scaled to the [0,1]01[0,1][ 0 , 1 ] range based on the min/max value in the training data, and for better visualization the vertical scale is in log-space, hence the SVM boundaries appear non-linear. The class distributions in the right-most plot reveal that SUE complements geometric-verification, especially when the latter has low confidence.

4.3 Complementing geometric verification

Finally, we test if efficient uncertainty estimation can complement geometric verification, as outlined in Sec. 3.5. We show in Fig. 5 the relation between the different types of uncertainties with the uncertainty from geometric verification. As we note from Table 2, STUN outperforms BTL, and L2-distance is on average better than the PA-score, thus we only combine STUN, L2-distance, and SUE with geometric-verification for this analysis.

SUE provides complementary performance by giving low uncertainty sqsubscript𝑠𝑞s_{q}italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT to images that are correctly matched but which were given high GV uncertainty. Some of these complementary queries are shown in Fig. 6, where it can be seen that these queries are images that are generally difficult to match local feature descriptors, such as facades, trees, and other repetitive features within the image [24]. We also show the linear boundaries learned by SVM to classify between true-positives and false-positives. The classification accuracy of the different methods is reported in Table 3.

Refer to caption
Figure 6: Correctly matched queries that are given high uncertainty by DELF-RANSAC and low uncertainty by SUE.
Method Pitts. San. Stlu. Eyn. MSLS
Superpoint 85.1 53.2 76.4 67.5 36.9
DELF 86.0 86.6 85.3 78.3 80.2
L2-distance 75.7 57.3 56.2 67.7 36.8
STUN 74.0 54.0 58.0 67.6 37.4
SUE 78.9 70.7 72.8 77.3 46.0
\hdashlineDELF+L2-di. 85.7 86.1 82.3 77.3 72.0
DELF+STUN 85.4 81.6 80.1 75.0 68.2
DELF+SUE 87.1 89.6 88.7 82.1 73.4
Table 3: Binary classification accuracy given the uncertainty estimates of various methods, using a linear SVM trained only on the Pittsburgh dataset. The combination DELF + SUE generalizes better than baseline combinations, except on the MSLS dataset where although DELF+SUE is better than the other combinations, the SVM boundaries learned from Pittsburgh are not the best.

4.4 Ablation study

SUE requires two hyper-parameters, the number of nearest neighbors K𝐾Kitalic_K and the decay parameter λ𝜆\lambdaitalic_λ that controls the relative contribution of the poses of the nearest neighbors. We show the ablation over these parameters in Fig. 7 by plotting the corresponding AUC-PR values for all datasets given a set of values for each parameter. The trend remains primarily the same across all datasets. We note that the AUC-PR increases by considering more nearest neighbors but the curves mostly plateau after K=5𝐾5K=5italic_K = 5, since poses from low-ranked neighbors contribute less to the overall pose hypothesis. For λ𝜆\lambdaitalic_λ, we see that the range 200400200400200-400200 - 400 is generally stable and gives reliable uncertainty estimates. We also note here (with details in the appendix) that SUE generalizes to different backbones (CosPlace [6]), and that exponential weighing of SUE in Equation (2) performs better than uniform weighing (an average AUC of 0.87 vs 0.70).

Refer to caption
Refer to caption
Figure 7: Effect of changing SUE’s hyper-parameters K𝐾Kitalic_K and λ𝜆\lambdaitalic_λ on the AUC-PR. For each curve, the other fixed hyper-parameter is chosen as K=10𝐾10K=10italic_K = 10 or λ=350𝜆350\lambda=350italic_λ = 350.

4.5 Discussion

We can now make several recommendations for estimating the image-matching uncertainty in VPR. First, future works evaluating image-matching uncertainty estimation should include diverse baselines such as SUE, even if they are simple. As our intra-category comparison revealed, even a common L2-distance-based image-matching uncertainty estimation may outperform data-driven techniques. Second, aleatoric uncertainty from training data does not necessarily generalize to the test data, so learning-based approaches should consider that perceptual aliasing is not just a property of the image content, but also the reference map at test time. Referring back to the example of the sky in images being ambiguous for an outdoor map; in an indoor map containing just one open-air patio, such images with sky might instead be considered distinctive for their location. Third, while GV gives the best uncertainty estimates at the expense of high computational needs, it is still susceptible to aleatoric uncertainty within the image, as repetitive structures, trees, and walls may also lead to incorrect matches of local features. In VPR, GV methods can still benefit from complementary uncertainty estimates provided by other methods, such as SUE.

We also note some potential limitations of SUE. SUE may underestimate the uncertainty if K𝐾Kitalic_K is too small to retrieve aliased references from multiple locations. Selecting K𝐾Kitalic_K for maps with mixed scene depths can therefore be challenging. Images in areas with low scene depth will already be perceptually distinct at small spatial offsets, whereas at high scene depth even images further apart may suffer from perceptual aliasing. A K𝐾Kitalic_K that suffices for small scene depths could be too small for areas with high scene depths. This could be mitigated by dynamically incrementing K𝐾Kitalic_K till w(K)subscript𝑤𝐾w_{(K)}italic_w start_POSTSUBSCRIPT ( italic_K ) end_POSTSUBSCRIPT becomes nearly zero. Now consider reference locations A and B which are perceptually aliased, i.e. all their image descriptors are similar. If A has 1000 references and B has one, even with K1001𝐾1001K\geq 1001italic_K ≥ 1001, SUE will always be confident about queries from either A or B as nearly all retrieved matches are spatially close. The high coverage of A over B thus presents an unwanted confidence bias, unless the chance of visiting A over B at test time is also 1000×1000\times1000 × higher. Nevertheless, we have shown that despite these assumptions SUE performs well on many real-world datasets. We study this more in depth in the supplementary materials, and there also present a possible solution.

5 Conclusions

We have compared different approaches for estimating the image-matching uncertainty in VPR, which provided (surprising) insights into this task, e.g. existing methods that learn aleatoric uncertainty from the training dataset often do not generalize well to the reference map at test time, and the common L2-distance in the feature space can be a more reliable indicator of matching uncertainty. We have shown that matching uncertainty in VPR is tightly related to the reference set at test time. Our new baseline SUE uniquely considers the spatial locations of the references, and outperforms all but the computationally expensive geometric verification. Its uncertainty estimates complement those of geometric verification. The choices for SUE’s hyper-parameters generalize for most queries across the tested datasets. We made recommendations for future research in this area.

Acknowledgement. This work was supported by the 3D Urban Understanding Lab established under the TU Delft AI Initiative, and the EU Horizon 2020 programme under grant number 964505 (Epistemic AI).

References

  • Agarwal et al. [2011] Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M Seitz, and Richard Szeliski. Building rome in a day. Communications of the ACM, 54(10):105–112, 2011.
  • Ali-bey et al. [2022] Amar Ali-bey, Brahim Chaib-draa, and Philippe Giguère. Gsv-cities: Toward appropriate supervised visual place recognition. Neurocomputing, 513:194–203, 2022.
  • Ali-bey et al. [2023] Amar Ali-bey, Brahim Chaib-draa, and Philippe Giguère. Mixvpr: Feature mixing for visual place recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2998–3007, 2023.
  • Arandjelovic et al. [2016] Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5297–5307, 2016.
  • Berton et al. [2021] Gabriele Berton, Carlo Masone, Valerio Paolicelli, and Barbara Caputo. Viewpoint invariant dense matching for visual geolocalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12169–12178, 2021.
  • Berton et al. [2022a] Gabriele Berton, Carlo Masone, and Barbara Caputo. Rethinking visual geo-localization for large-scale applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4878–4888, 2022a.
  • Berton et al. [2022b] Gabriele Berton, Riccardo Mereu, Gabriele Trivigno, Carlo Masone, Gabriela Csurka, Torsten Sattler, and Barbara Caputo. Deep visual geo-localization benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5396–5407, 2022b.
  • Cadena et al. [2016] Cesar Cadena, Luca Carlone, Henry Carrillo, Yasir Latif, Davide Scaramuzza, José Neira, Ian Reid, and John J Leonard. Past, present, and future of simultaneous localization and map**: Toward the robust-perception age. IEEE Transactions on Robotics, 32(6):1309–1332, 2016.
  • Cai et al. [2022] Kaiwen Cai, Chris Xiaoxuan Lu, and Xiaowei Huang. STUN: Self-teaching uncertainty estimation for place recognition. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6614–6621. IEEE, 2022.
  • Chen et al. [2011] David M Chen, Georges Baatz, Kevin Köser, Sam S Tsai, Ramakrishna Vedantham, Timo Pylvänäinen, Kimmo Roimela, Xin Chen, Jeff Bach, Marc Pollefeys, et al. City-scale landmark identification on mobile devices. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pages 737–744. IEEE, 2011.
  • Cummins [2009] Mark Cummins. Highly scalable appearance-only slam-fab-map 2.0. In Proceedings of the Robotics: Sciences and Systems (RSS) Conference, 2009.
  • Cummins and Newman [2008] Mark Cummins and Paul Newman. Fab-map: Probabilistic localization and map** in the space of appearance. The International Journal of Robotics Research, 27(6):647–665, 2008.
  • Dalal and Triggs [2005] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pages 886–893. IEEE, 2005.
  • Dellaert et al. [1999] Frank Dellaert, Wolfram Burgard, Dieter Fox, and Sebastian Thrun. Using the condensation algorithm for robust, vision-based mobile robot localization. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pages 588–594. IEEE, 1999.
  • DeTone et al. [2018] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and description. In IEEE International Conference on Computer Vision and Pattern Recognition Workshops, pages 224–236, 2018.
  • Garg et al. [2021] Sourav Garg, Tobias Fischer, and Michael Milford. Where is your place, visual place recognition? In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, 2021.
  • Gronat et al. [2013] Petr Gronat, Guillaume Obozinski, Josef Sivic, and Tomas Pajdla. Learning and calibrating per-location classifiers for visual place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 907–914, 2013.
  • Hausler et al. [2021a] Stephen Hausler, Tobias Fischer, and Michael Milford. Unsupervised complementary-aware multi-process fusion for visual place recognition. arXiv preprint arXiv:2112.04701, 2021a.
  • Hausler et al. [2021b] Stephen Hausler, Sourav Garg, Ming Xu, Michael Milford, and Tobias Fischer. Patch-CNN: Multi-scale fusion of locally-global descriptors for place recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition, pages 14141–14152, 2021b.
  • Hendrycks and Gimpel [2016] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In Proceedings of the International Conference on Learning Representations, 2016.
  • Ho and Newman [2007] Kin Leong Ho and Paul Newman. Detecting loop closure with scene sequences. International Journal of Computer Vision, 74(3):261–286, 2007.
  • Kendall and Cipolla [2016] Alex Kendall and Roberto Cipolla. Modelling uncertainty in deep learning for camera relocalization. In IEEE International Conference on Robotics and Automation (ICRA), pages 4762–4769. IEEE, 2016.
  • Kendall and Gal [2017] Alex Kendall and Yarin Gal. What uncertainties do we need in Bayesian deep learning for computer vision? Advances in Neural Information Processing Systems, 30, 2017.
  • Knopp et al. [2010] Jan Knopp, Josef Sivic, and Tomas Pajdla. Avoiding confusing features in place recognition. In Proceedings of the European Conference on Computer Vision, pages 748–761. Springer, 2010.
  • Laskar et al. [2017] Zakaria Laskar, Iaroslav Melekhov, Surya Kalia, and Juho Kannala. Camera relocalization by computing pairwise relative poses using convolutional neural network. In IEEE International Conference on Computer Vision Workshops, pages 929–938, 2017.
  • Leyva-Vallina et al. [2021] María Leyva-Vallina, Nicola Strisciuglio, and Nicolai Petkov. Generalized contrastive optimization of siamese networks for place recognition. arXiv preprint arXiv:2103.06638, 2021.
  • Lowe [2004] David G Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.
  • Lowry et al. [2015] Stephanie Lowry, Niko Sünderhauf, Paul Newman, John J Leonard, David Cox, Peter Corke, and Michael J Milford. Visual place recognition: A survey. IEEE Transactions on Robotics, 32(1):1–19, 2015.
  • Malinin and Gales [2018] Andrey Malinin and Mark Gales. Predictive uncertainty estimation via prior networks. Advances in Neural Information Processing Systems, 31, 2018.
  • Masone and Caputo [2021] Carlo Masone and Barbara Caputo. A survey on deep visual place recognition. IEEE Access, 9:19516–19547, 2021.
  • Milford and Wyeth [2008] Michael J Milford and Gordon F Wyeth. Map** a suburb with a single camera using a biologically inspired slam system. IEEE Transactions on Robotics, 24(5):1038–1053, 2008.
  • Moreau et al. [2022] Arthur Moreau, Nathan Piasco, Dzmitry Tsishkou, Bogdan Stanciulescu, and Arnaud de La Fortelle. Coordinet: uncertainty-aware pose regressor for reliable vehicle localization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2229–2238, 2022.
  • Noh et al. [2017] Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand, and Bohyung Han. Large-scale image retrieval with attentive deep local features. In Proceedings of the IEEE International Conference on Computer Vision, pages 3456–3465, 2017.
  • Peretroukhin et al. [2019] Valentin Peretroukhin, Brandon Wagstaff, Matthew Giamou, and Jonathan Kelly. Probabilistic regression of rotations using quaternion averaging and a deep multi-headed network. arXiv preprint arXiv:1904.03182, 2019.
  • Piasco et al. [2018] Nathan Piasco, Désiré Sidibé, Cédric Demonceaux, and Valérie Gouet-Brunet. A survey on visual-based localization: On the benefit of heterogeneous data. Pattern Recognition, 74:90–109, 2018.
  • Pion et al. [2020] Noé Pion, Martin Humenberger, Gabriela Csurka, Yohann Cabon, and Torsten Sattler. Benchmarking image retrieval for visual localization. In International Conference on 3D Vision (3DV), pages 483–494. IEEE, 2020.
  • Radenović et al. [2018] Filip Radenović, Giorgos Tolias, and Ondřej Chum. Fine-tuning CNN image retrieval with no human annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(7):1655–1668, 2018.
  • Radenović et al. [2018] Filip Radenović, Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondřej Chum. Revisiting oxford and paris: Large-scale image retrieval benchmarking. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 2018.
  • Revaud et al. [2019] Jerome Revaud, Jon Almazán, Rafael S Rezende, and Cesar Roberto de Souza. Learning with average precision: Training image retrieval with a listwise loss. In Proceedings of the IEEE International Conference on Computer Vision, pages 5107–5116, 2019.
  • Sattler et al. [2016] Torsten Sattler, Michal Havlena, Konrad Schindler, and Marc Pollefeys. Large-scale location recognition and the geometric burstiness problem. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pages 1582–1590, 2016.
  • Sattler et al. [2018] Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii, Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Okutomi, Marc Pollefeys, Josef Sivic, et al. Benchmarking 6dof outdoor visual localization in changing conditions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8601–8610, 2018.
  • Se et al. [2002] Stephen Se, David Lowe, and Jim Little. Mobile robot localization and map** with uncertainty using scale-invariant visual landmarks. International Journal of Robotics Research, 21(8):735–758, 2002.
  • Skrede [2013] Sindre Skrede. Nordland dataset. https://bit.ly/2QVBOym, 2013.
  • Stumm et al. [2013] Elena Stumm, Christopher Mei, and Simon Lacroix. Probabilistic place recognition with covisibility maps. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4158–4163. IEEE, 2013.
  • Thoma et al. [2020] Janine Thoma, Danda Pani Paudel, Ajad Chhatkuli, and Luc Van Gool. Geometrically mappable image features. IEEE Robotics and Automation Letters, 5(2):2062–2069, 2020.
  • Tolias et al. [2016] Giorgos Tolias, Yannis Avrithis, and Hervé Jégou. Image search with selective match kernels: aggregation across single and multiple images. International Journal of Computer Vision, 116(3):247–261, 2016.
  • Torii et al. [2019] Akihiko Torii, Hajime Taira, Josef Sivic, Marc Pollefeys, Masatoshi Okutomi, Tomas Pajdla, and Torsten Sattler. Are large-scale 3d models really necessary for accurate visual localization? IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  • Wang et al. [2022] Ruotong Wang, Yanqing Shen, Weiliang Zuo, San** Zhou, and Nanning Zheng. TransVPR: Transformer-based place recognition with multi-level attention aggregation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13648–13657, 2022.
  • Warburg et al. [2020] Frederik Warburg, Soren Hauberg, Manuel López-Antequera, Pau Gargallo, Yubin Kuang, and Javier Civera. Mapillary street-level sequences: A dataset for lifelong place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2626–2635, 2020.
  • Warburg et al. [2021] Frederik Warburg, Martin Jørgensen, Javier Civera, and Søren Hauberg. Bayesian Triplet Loss: Uncertainty quantification in image retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12158–12168, 2021.
  • Yu et al. [2019] Jun Yu, Chaoyang Zhu, Jian Zhang, Qingming Huang, and Dacheng Tao. Spatial pyramid-enhanced CNN with weighted triplet loss for place recognition. IEEE Transactions on Neural Networks and Learning Systems, 31(2):661–674, 2019.
  • Zaffar et al. [2021] Mubariz Zaffar, Sourav Garg, Michael Milford, Julian Kooij, David Flynn, Klaus McDonald-Maier, and Shoaib Ehsan. VPR-Bench: An open-source visual place recognition evaluation framework with quantifiable viewpoint and appearance change. International Journal of Computer Vision, 129(7):2136–2174, 2021.
  • Zaffar et al. [2023] Mubariz Zaffar, Liangliang Nan, and Julian Francisco Pieter Kooij. CoPR: Toward accurate visual localization with continuous place-descriptor regression. IEEE Transactions on Robotics, 2023.
  • Zamir and Shah [2010] Amir Roshan Zamir and Mubarak Shah. Accurate image localization based on google maps street view. In Proceedings of the European Conference on Computer Vision, pages 255–268. Springer, 2010.
  • Zeisl et al. [2015] Bernhard Zeisl, Torsten Sattler, and Marc Pollefeys. Camera pose voting for large-scale image-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 2704–2712, 2015.
  • Zhang et al. [2021] Jian Zhang, Yunyin Cao, and Qun Wu. Vector of locally and adaptively aggregated descriptors for image feature representation. Pattern Recognition, 116:107952, 2021.
  • Zhu et al. [2018] Jianliang Zhu, Yunfeng Ai, Bin Tian, Dongpu Cao, and Sebastian Scherer. Visual place recognition in long-term and large-scale environment based on CNN feature. In IEEE Intelligent Vehicles Symposium (IV), pages 1679–1685. IEEE, 2018.

6 Supplementary Material

We provide here an ablation of SUE by changing the backbone and the weight function. A probabilistic interpretation of SUE is then presented and later used to perform density compensation for dissimilarly distributed query and reference images. We further provide the precision-recall curves for the remainder five VPR datasets. The complementarity of SUE, STUN, and L2-distance to GV is also shown on these datasets. We also show these complementarity plots of other techniques with SUE. Then, we connect the concept of geometric burstiness [40] with SUE. Finally, some qualitative results are shown in the form of correctly/incorrectly matched queries ranked with different types of uncertainty estimates.

6.1 More ablation studies of SUE

We perform two further experiments: changing the backbone feature extractor from STUN [9] to CosPlace [6] to show SUE’s generality to other backbones in Fig. 8, and the benefit of using the exponential weighing function (in Equation (2) of the main paper) instead of the uniform weighing, as reported in Table 4.

Refer to caption
Refer to caption
Figure 8: SUE remains SOTA by changing the backbone feature extractor to CosPlace [6] with no retuning of SUE’s hyper-parameters. CosPlace is also used as the backbone for L2-distance and PA-score, but it was not possible to change the backbone for BTL and STUN.
Weigh. Pitts. San. Stlu. Eyn. MSLS Avg
Uniform 0.81 0.77 0.67 0.77 0.49 0.70
SUE 0.94 0.84 0.88 0.93 0.77 0.87
Table 4: SUE weighs the contribution of the nearest neighbor poses based on the distance in the feature space with an exponentially decaying function. This performs better than uniform weighing of the variance of the reference poses.

6.2 A probabilistic view of SUE

We here present a probabilistic view of SUE, which will help formulate a modified version in Section 6.3 to account for different spatial distributions of queries and references.

Consider M{1,,N}𝑀1𝑁M\in\{1,\cdots,N\}italic_M ∈ { 1 , ⋯ , italic_N } as a stochastic ‘match’ variable that indicates which of the N𝑁Nitalic_N references is a true reference. So, M=i𝑀𝑖M=iitalic_M = italic_i would mean reference i𝑖iitalic_i is the ‘true’ match for the query. Then 𝗉(M=i)𝗉𝑀𝑖\mathsf{p}(M=i)sansserif_p ( italic_M = italic_i ) expresses the prior belief that any reference i𝑖iitalic_i could be the true reference.

Assuming that some reference i𝑖iitalic_i is the true reference, M=i𝑀𝑖M=iitalic_M = italic_i, then the observed query feature fqsubscript𝑓𝑞f_{q}italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT can be expected to be similar to the reference feature fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, with some homoscedastic Gaussian noise or variation added to all feature dimensions,

𝗉(fq|M=i)𝗉conditionalsubscript𝑓𝑞𝑀𝑖\displaystyle\mathsf{p}(f_{q}|M=i)sansserif_p ( italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT | italic_M = italic_i ) =𝐍(fq|f(i),Σf)absent𝐍conditionalsubscript𝑓𝑞subscript𝑓𝑖subscriptΣ𝑓\displaystyle=\mathbf{N}(f_{q}|f_{(i)},\Sigma_{f})= bold_N ( italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT | italic_f start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) (4)
eλfqf(i)2proportional-toabsentsuperscript𝑒𝜆subscriptnormsubscript𝑓𝑞subscript𝑓𝑖2\displaystyle\propto e^{-\lambda\cdot||f_{q}-f_{(i)}||_{2}}∝ italic_e start_POSTSUPERSCRIPT - italic_λ ⋅ | | italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (5)
w(i).proportional-toabsentsubscript𝑤𝑖\displaystyle\propto w_{(i)}.∝ italic_w start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT . (6)

So, the weight term of Equation (3) can be considered as the non-normalized likelihood term. Note that the hyperparameter λ𝜆\lambdaitalic_λ subsumes the noise parameter ΣfsubscriptΣ𝑓\Sigma_{f}roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT.

Through Bayes’ rule, we can express the posterior belief over M𝑀Mitalic_M given the query feature as

𝗉(M|fq)=𝗉(fq|M)𝗉(M)𝗉(fq)=𝗉(fq|M)𝗉(M)j𝗉(fq|M=j)𝗉(M=j).𝗉conditional𝑀subscript𝑓𝑞𝗉conditionalsubscript𝑓𝑞𝑀𝗉𝑀𝗉subscript𝑓𝑞𝗉conditionalsubscript𝑓𝑞𝑀𝗉𝑀subscript𝑗𝗉conditionalsubscript𝑓𝑞𝑀𝑗𝗉𝑀𝑗\displaystyle\mathsf{p}(M|f_{q})=\frac{\mathsf{p}(f_{q}|M)\mathsf{p}(M)}{% \mathsf{p}(f_{q})}=\frac{\mathsf{p}(f_{q}|M)\mathsf{p}(M)}{\sum_{j}\mathsf{p}(% f_{q}|M=j)\mathsf{p}(M=j)}.sansserif_p ( italic_M | italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) = divide start_ARG sansserif_p ( italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT | italic_M ) sansserif_p ( italic_M ) end_ARG start_ARG sansserif_p ( italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) end_ARG = divide start_ARG sansserif_p ( italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT | italic_M ) sansserif_p ( italic_M ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT sansserif_p ( italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT | italic_M = italic_j ) sansserif_p ( italic_M = italic_j ) end_ARG . (7)

With a uniform prior (𝗉(M)=1/N𝗉𝑀1𝑁\mathsf{p}(M)=1/Nsansserif_p ( italic_M ) = 1 / italic_N) that indicates equal probability for all references, we can see that the posterior reduces to 𝗉(M|fq)=w(i)/jw(j)𝗉conditional𝑀subscript𝑓𝑞subscript𝑤𝑖subscript𝑗subscript𝑤𝑗\mathsf{p}(M|f_{q})=w_{(i)}/\sum_{j}w_{(j)}sansserif_p ( italic_M | italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) = italic_w start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT / ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT ( italic_j ) end_POSTSUBSCRIPT, since the constant of the prior factors out in the numerator and denominator.

If we now assume that our VPR technique is reasonable, and that the query position should be located at the ‘true’ reference, then we can express the expected query position, given our belief on the match of each reference, i.e.,

𝔼[p(M)|fq]𝔼delimited-[]conditionalsubscript𝑝𝑀subscript𝑓𝑞\displaystyle\mathbb{E}[p_{(M)}|f_{q}]blackboard_E [ italic_p start_POSTSUBSCRIPT ( italic_M ) end_POSTSUBSCRIPT | italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ] =i[𝗉(M=i|fq)p(i)]absentsubscript𝑖delimited-[]𝗉𝑀conditional𝑖subscript𝑓𝑞subscript𝑝𝑖\displaystyle=\sum_{i}\left[\mathsf{p}(M=i|f_{q})p_{(i)}\right]= ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ sansserif_p ( italic_M = italic_i | italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT ] (8)
=μpabsentsubscript𝜇𝑝\displaystyle=\mu_{p}= italic_μ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT (9)

Here we recognise Equation (1), assuming the uniform prior 𝗉(M)𝗉𝑀\mathsf{p}(M)sansserif_p ( italic_M ). While we do not necessarily consider this expected pose to be representative of the true query pose (it could be an average location between distant visually-matching areas), it does allow us to compute the expected squared pose distance of the true match to the query,

𝔼[||p(M)μp||2|fq]\displaystyle\mathbb{E}\left[||p_{(M)}-\mu_{p}||_{2}\bigg{\rvert}f_{q}\right]blackboard_E [ | | italic_p start_POSTSUBSCRIPT ( italic_M ) end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ] trace(Σp)=sq,absenttracesubscriptΣ𝑝subscript𝑠𝑞\displaystyle\approx\textrm{trace}(\Sigma_{p})=s_{q},≈ trace ( roman_Σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , (10)

where ΣpsubscriptΣ𝑝\Sigma_{p}roman_Σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is as defined in Equation (2) for the uniform prior 𝗉(M)𝗉𝑀\mathsf{p}(M)sansserif_p ( italic_M ). In other words, in SUE sqsubscript𝑠𝑞s_{q}italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT estimates the expected (squared) distance between the match’s pose and the query pose, thus the smaller sqsubscript𝑠𝑞s_{q}italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT the higher the chance is that a match selected according to our posterior belief is within an acceptable distance to the true query pose.

Finally, reference i=argmaxi𝗉(M=i|fq)superscript𝑖subscriptargmax𝑖𝗉𝑀conditional𝑖subscript𝑓𝑞i^{\prime}=\mathop{\mathrm{argmax}}_{i}\mathsf{p}(M=i|f_{q})italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_argmax start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT sansserif_p ( italic_M = italic_i | italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) with the highest posterior probability of being the correct match is selected, which based on the likelihood term (and with uniform prior) will be i=1superscript𝑖1i^{\prime}=1italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1, i.e. the nearest neighbor in the feature space.

Note that in the above, a uniform prior 𝗉(M)𝗉𝑀\mathsf{p}(M)sansserif_p ( italic_M ) means all references are assumed a-priori equally likely to match the query. In case some areas in the map contain more references than other areas, this also implies a higher prior belief that the query will occur in such a denser sampled area. This ‘default’ prior is therefore not a uniform spatial prior over the mapped area, but it assumes that the local spatial density of references in the map is indicative of the probability of a query appearing in such a local region.

6.3 Spatial density compensation for dissimilar query/reference spatial distributions

As explained in SUE’s potential limitations of Discussion Section 4.5 and Appendix Section 6.2, the default formulation of SUE assumes that each reference is equally probable to match a query, i.e., a uniform prior 𝗉(M)𝗉𝑀\mathsf{p}{(M)}sansserif_p ( italic_M ) is assumed. In other words, the query and reference images/poses are expected to be distributed similarly over the map, and the spatial density of the references in an area reflects the assumed prior probability for a query to be located in that area.

To illustrate, consider two perceptually-aliased locations A and B, where location A is represented by 100 images and location B by one image. If a query occurs at A or B, SUE’s uncertainty estimate as currently formulated in Equation (2) will be low, since the many references at location A will all agree on low spatial variance, while the contribution of distant references at location B are 100×\times× less. This high confidence could be desired if location A is also 100×\times× more likely to be visited at query-time than location B (i.e. the uniform 𝗉(M)𝗉𝑀\mathsf{p}(M)sansserif_p ( italic_M ) holds, so the spatial density of the references reflects a spatial prior of a query’s location). However, this prior could also be undesired if we expect queries at A and B are equally likely to occur, irrespective of the reference density. Ultimately, what is desired depends on the application and data collection procedure.

In case the uniform prior 𝗉(M)𝗉𝑀\mathsf{p}{(M)}sansserif_p ( italic_M ) over references is undesired, we can substitute it with a different prior in the equations of Section 6.2. Specifically, in Equation (7) the likelihood terms should not be multiplied with a constant prior term (which cancelled out in the numerator and denominator). Still, it may be more convenient to express the prior over references in terms of a spatial prior for the query. In other words, a reference would be more probable to match if it is in a area where the query is more probable to occur, while a reference would be less probable if there are more other references in the same spatial area. Let 𝗉q(p)subscript𝗉𝑞𝑝\mathsf{p}_{q}(p)sansserif_p start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_p ) denote the desired spatial prior for the query to be at a pose p𝑝pitalic_p, and 𝗉r(p)subscript𝗉𝑟𝑝\mathsf{p}_{r}(p)sansserif_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_p ) denote the spatial density of the references at a pose p𝑝pitalic_p, then

𝗉(M=i)𝗉q(p(i))𝗉r(p(i)).proportional-to𝗉𝑀𝑖subscript𝗉𝑞subscript𝑝𝑖subscript𝗉𝑟subscript𝑝𝑖\displaystyle\mathsf{p}{(M=i)}\propto\frac{\mathsf{p}_{q}(p_{(i)})}{\mathsf{p}% _{r}(p_{(i)})}.sansserif_p ( italic_M = italic_i ) ∝ divide start_ARG sansserif_p start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT ) end_ARG start_ARG sansserif_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT ) end_ARG . (11)

We will refer to this as spatial density compensation. In practice, we can thus compensate SUE for a desired spatial prior by multiplying the reference weight w(i)subscript𝑤𝑖w_{(i)}italic_w start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT with a term (proportional to) the desired prior 𝗉(M)𝗉𝑀\mathsf{p}{(M)}sansserif_p ( italic_M ). Note from Equation (11) that if the spatial distributions of queries and references are assumed equal, we again obtain that 𝗉(M)𝗉𝑀\mathsf{p}{(M)}sansserif_p ( italic_M ) is uniform, as is the case for the default SUE formulation.

Refer to caption
(a) Pittsburgh query
Refer to caption
(b) Pittsburgh reference
Refer to caption
(c) Stlucia query
Refer to caption
(d) Stlucia reference
Figure 9: The density of queries and references is depicted using the distance (z𝑧zitalic_z) of each query/ref to its nearest neighbour (k=1𝑘1k=1italic_k = 1) in the pose space. Queries and references in Pittsburgh dataset are highly dense and hence uniformly spatially distributed. The queries and references are non-uniformly (albeit similarly) spatially distributed in the sparser Stlucia dataset.

6.4 Validating spatial density compensation

In this section, we test the spatial density compensation concept of adjusting SUE as explained in Section 6.3.

Applying a uniform spatial prior for the query

Let’s assume the spatial density of query poses is uniform, so all query poses within the map are equally likely, in which case term 𝗉q(p)subscript𝗉𝑞𝑝\mathsf{p}_{q}(p)sansserif_p start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_p ) becomes a constant (and thus will cancel out when normalizing the weights).

The spatial density of the references 𝗉r(p)subscript𝗉𝑟𝑝\mathsf{p}_{r}(p)sansserif_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_p ) can be estimated from the finite samples of poses in the reference set. We can for instance model the spatial density of references by simply taking the distance z(i)subscript𝑧𝑖z_{(i)}italic_z start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT of the reference i𝑖iitalic_i to its k𝑘kitalic_k-th nearest neighbor in the pose space, such that the area z(i)2superscriptsubscript𝑧𝑖2z_{(i)}^{2}italic_z start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is inversely proportional to the local density of the reference i𝑖iitalic_i, i.e.,  𝗉r(p(i))1/z(i)2proportional-tosubscript𝗉𝑟subscript𝑝𝑖1superscriptsubscript𝑧𝑖2\mathsf{p}_{r}(p_{(i)})\propto 1/z_{(i)}^{2}sansserif_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT ) ∝ 1 / italic_z start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Hyperparameter k𝑘kitalic_k regularizes the smoothness of the estimated reference pose density.

We can now see that 𝗉(M=i)z(i)2proportional-to𝗉𝑀𝑖superscriptsubscript𝑧𝑖2\mathsf{p}(M=i)\propto z_{(i)}^{2}sansserif_p ( italic_M = italic_i ) ∝ italic_z start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, thus the density compensated SUE for this uniform spatial prior for query poses is obtained by re-weighing Equation (3) with z(i)2superscriptsubscript𝑧𝑖2z_{(i)}^{2}italic_z start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, i.e.,

w(i)subscript𝑤𝑖\displaystyle w_{(i)}italic_w start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT =eλd(i)z(i)2.absentsuperscript𝑒𝜆subscript𝑑𝑖superscriptsubscript𝑧𝑖2\displaystyle={e^{-\lambda\cdot d_{(i)}}}\cdot z_{(i)}^{2}.= italic_e start_POSTSUPERSCRIPT - italic_λ ⋅ italic_d start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⋅ italic_z start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (12)

Do common datasets have a uniform query distribution?

We used the above formulation of spatial density to study the properties in the used VPR datasets. First, we find that most of our datasets do have a mostly uniform spatial distribution for both queries and references, except the Stlucia dataset. Fig. 9 illustrates the distribution of distances to the k=1𝑘1k=1italic_k = 1 nearest neighbors for the Pittsburgh and Stlucia datasets. Second, we can conclude that the assumption that references and queries have a similar spatial distribution does hold in common VPR dataset, hence SUE’s default formulation with uniform reference prior is reasonable.

To properly validate the density compensation concept of Section 6.3, we also create a modified version of the Stlucia data such that queries and reference actually do have a different spatial distribution. We greadily subsample the Stlucia queries such that the spatial density of the resampled queries is uniform.

Does assuming a uniform query distribution help?

Finally, we test the density compensated SUE of Equation (12) on the VPR datasets for different choices of k𝑘kitalic_k, see Table 5.

Since queries and references of datasets other than Stlucia are already uniformly distributed spatially, the table confirms that density compensation does not lead to any major effect on SUE’s performance. We also see that for the (unmodified) Stlucia dataset, density compensation actually hurts performance because the queries and references are in fact non-uniformly and similarly distributed. The default uniform prior assumption of SUE is therefore better suited for Stlucia.

However, if we test density compensated SUE on the modified Stlucia dataset where queries are in fact uniformly spatially distributed while the references are not, then we do observe a benefit over the default SUE as shown in Table 6. In this case, the spatial prior of density compensated SUE does hold, where as the default SUE assumption that queries and references are similarly distributed does not.

Compensation Pitts. San. Stlu. Eyns. MSLS
none 0.94 0.84 0.89 0.93 0.76
k=1𝑘1k=1italic_k = 1 0.94 0.84 0.82 0.93 0.76
k=3𝑘3k=3italic_k = 3 0.94 0.84 0.84 0.93 0.77
k=10𝑘10k=10italic_k = 10 0.94 0.81 0.85 0.92 0.77
Table 5: SUE’s AUC-PR with reference density compensation.
z𝑧zitalic_z none k=1 k=3 k=5 k=8 k=10
89898-98 - 9 0.92 0.96 0.96 0.96 0.94 0.94
1011101110-1110 - 11 0.68 0.76 0.73 0.7 0.71 0.69
Table 6: SUE’s AUC-PR with reference density compensation using different values of k𝑘kitalic_k on the Stlucia dataset when the queries are resampled to have a close to uniform spatial density (e.g., z=89𝑧89z=8-9italic_z = 8 - 9). Reference density compensation helps SUE when queries are spatially uniformly distributed and references are non-uniformly distributed. Best across the columns is in Bold.

In conclusion, whether spatial density compensation is needed depends on the specifc spatial distributions of the references and queries in a dataset. For the studied VPR benchmark datasets that represent densely collected queries and references, the default assumption of SUE that their spatial distributions are similar holds. Still, in applications where we can expect that queries and references are distributed differently, then additional density compensation can be helpful. The formulation of spatial density compensation can be motivated from a probabilistic view on SUE. Future work can investigate better estimates for query and reference density for non-uniformly distributed data to further improve SUE.

6.5 Precision-Recall curves

In addition to the Precision-Recall curves of the Pittsburgh dataset in Fig. 1, the PR-curves for the remainder five datasets are shown in Fig. 10. SUE outperforms the methods in the RUE and DUE categories on all datasets. GV remains the overall state-of-the-art, albeit at a two to three orders of magnitude higher computational cost.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 10: The precision-recall curves on the six datasets using SUE and other baselines. SUE outperforms the existing methods within the efficient category on all datasets. Note how an L2-based retrieval uncertainty outperforms the data-driven aleatoric uncertainty estimated in BTL and STUN.

6.6 Complementing geometric verification

We further show in Fig. 11 the generalization of SVM trained on the Pittsburgh dataset to other datasets. For all these datasets, the relation of our SUE uncertainty with DELF-RANSAC leads to complementarity with queries in the bottom-left of the plot that can be linearly separated.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 11: The relation of L2-based uncertainty, STUN, and SUE with geometric verification uncertainty. The SVM boundaries are learned on the Pittsburgh dataset only. Each point represents a query, and the color indicates whether it is a true-positive (Blue) or a false-positive (Red). The linear SVM boundaries are shown as black lines, while the dashed lines are the SVM margins. The combination of SUE with geometric-verification leads to more correctly matched queries in the bottom right (where SUE is certain but GV is uncertain) of the plots identifying complementarity. For better visualization, the vertical scale is in log-space, due to which the SVM boundaries appear non-linear to the reader but are linear.

6.7 SUE combined with other uncertainty estimates

For completeness, we show the combination of other uncertainty estimation methods with SUE in Fig. 12. Most of the queries that can be classified as true- or false-positives by other methods can already be classified using only SUE. We hypothesize that this is because of SUE’s similarity to BTL and STUN which also estimate the aleatoric uncertainty, and since SUE already uses the L2-distance and nearest neighbours in its uncertainty estimate.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 12: The relation of L2-based, PA-score, BTL, and STUN uncertainties with SUE uncertainty. Each point represents a query, and the color indicates whether it is a true-positive (Blue) or a false-positive (Red). The linear SVM boundaries are shown as black lines, while the dashed lines are the SVM margins. As indicated by the near-vertical decision boundaries, most of the queries that can be classified as true- or false-positives by other methods can also be classified by SUE, and we do not see much complementarity.

6.8 Relating SUE to geometric burstiness

Relation: Features that appear in similar configurations across multiple unrelated reference images are referred to as geometric burstiness (GB) [40]. Ideally, such features should not be considered for estimating the image matching confidence using geometric verification (GV). Whether images are related or unrelated is determined using their pose information, i.e., different images that are physically close to each other could be looking at the same place. While the use of pose information of the Top-K𝐾Kitalic_K retrieved reference images is common between SUE and GB, the latter is evaluated for image re-ranking and the former for VPR. GB is implemented on top of GV and is more computationally expensive than GV, concretely by an order of K𝐾Kitalic_K, but gives better uncertainty estimates. For completeness, we implement a version of GB inspired by [40] and compare it to SUE. The implementation details are as follows.

Our implementation of GB: We use SIFT features, and perform feature matching in a RANSAC loop between a query and its Top-K𝐾Kitalic_K retrieved nearest neighbors. Local feature matches [qi,rjksubscript𝑞𝑖subscriptsuperscript𝑟𝑘𝑗q_{i},r^{k}_{j}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT] that satisfy a geometric transform (homographic) are considered inliers, where qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_ith query feature and rjksubscriptsuperscript𝑟𝑘𝑗r^{k}_{j}italic_r start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the j𝑗jitalic_jth feature in k𝑘kitalic_kth nearest neighbour. A query feature qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT contributes to geometric burstiness if it forms part of the inlier set for multiple (say T𝑇Titalic_T) retrieved images, and in the most naive case, such [qi,rjksubscript𝑞𝑖subscriptsuperscript𝑟𝑘𝑗q_{i},r^{k}_{j}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT] should be discarded from the inlier count. But similar to [40], we down-weight their contribution by T𝑇Titalic_T instead of completely discarding such inliers.

However, Sattler et al. [40] further studied that the top retrieved images could come from the same place, and hence query features could legally form part of the inlier set for multiple retrieved images. To classify whether a set of reference images represents the same place or not, we use the definition of place from [7] where images that are within 25 meters of each other are considered as the same place. Thus, only inliers from reference images of different (more than 25 meters apart) places are classified as geometric bursts. We use K=20𝐾20K=20italic_K = 20 and for feature matching the same hyperparameters are used as that of SIFT-RANSAC.

Results: We report in Table 7 that adding GB on top of SIFT-RANSAC leads to better performance than just using SIFT-RANSAC. Overall, among all uncertainty estimation methods, DELF-RANSAC still performs the best. GB is the most computationally expensive among all the uncertainty estimation methods. Note that GB could also be added on top of Superpoint-RANSAC and DELF-RANSAC albeit at an even higher computational cost.

We further test if SUE remains complementary to GB, given that both methods use reference poses. Fig. 13 shows that the uncertainty estimates from SUE can also complement GB. In conclusion, the several orders of magnitude higher computational needs of GB compared to SUE, and their mutual complementarity suggest that SUE is a useful baseline for uncertainty estimation in VPR.

\uparrow Pitts. \uparrow Nord. \uparrow MSLS \downarrow Time
L2-dist 0.87 0.18 0.64 0.05
STUN 0.79 0.05 0.44 0.10
SUE 0.94 0.26 0.77 1.08
SIFT 0.92 0.15 0.70 129
DELF 0.97 0.84 0.95 1587
GB (SIFT) 0.92 0.31 0.87 2709
Table 7: AUR-PR and computation time (msecs) comparison of the methods discussed in the main paper with geometric burstiness [40]. Best across the columns is in Bold. Implementing GB on top of SIFT-RANSAC leads to better performance but at several orders of magnitude higher computational cost.
Refer to caption
Figure 13: SUE remains complementary to GB since many true-positives can be separated from false-positives using SUE uncertainty and not using GB. See other such plots in this paper for details on the employed info-graphics.

6.9 Qualitative results

We show examples of queries with their corresponding nearest neighbors ranked with the uncertainties computed by the different types of uncertainty estimation methods in Fig. 14. We keep the set of randomly chosen queries the same for all the methods. These examples further indicate what each method is sensitive to for uncertainty estimation.

Refer to caption
Figure 14: Exemplar matched/mismatched queries are ranked with different types of estimated uncertainties in the Pittsburgh dataset. Note that the set of chosen queries is the same for all types of uncertainty estimation methods. I(n)subscript𝐼𝑛I_{(n)}italic_I start_POSTSUBSCRIPT ( italic_n ) end_POSTSUBSCRIPT denotes the nearest neighbor where the subscript n𝑛nitalic_n denotes its rank. The number of nearest neighbors shown relates to the corresponding number needed by each method (e.g. PA-score requires two nearest neighbors). The retrieved nearest neighbors for BTL are different than other methods due to the different feature encoder. A good uncertainty estimation method when used for ordering would rank correct matches to the left and incorrect matches to the right of the reader. The query image in column 12 of SUE depicts the failure case of SUE, where the perceptually aliased nearest neighbors are geographically far-apart leading to high uncertainty but the best match is still the correct match.