Visual place recognition for aerial imagery: A survey

Ivan Moskalenko [email protected] Anastasiia Kornilova [email protected] Gonzalo Ferrer [email protected]

Abstract

Aerial imagery and its direct application to visual localization is an essential problem for many Robotics and Computer Vision tasks. While Global Navigation Satellite Systems (GNSS) are the standard default solution for solving the aerial localization problem, it is subject to a number of limitations, such as, signal instability or solution unreliability that make this option not so desirable. Consequently, visual geolocalization is emerging as a viable alternative. However, adapting Visual Place Recognition (VPR) task to aerial imagery presents significant challenges, including weather variations and repetitive patterns. Current VPR reviews largely neglect the specific context of aerial data. This paper introduces a methodology tailored for evaluating VPR techniques specifically in the domain of aerial imagery, providing a comprehensive assessment of various methods and their performance. However, we not only compare various VPR methods, but also demonstrate the importance of selecting appropriate zoom and overlap levels when constructing map tiles to achieve maximum efficiency of VPR algorithms in the case of aerial imagery. The code is available on our GitHub repository — https://github.com/prime-slam/aero-vloc.

keywords:

visual place recognition , geolocalization , aerial imagery , benchmark [Uncaptioned image]

Figure 1: Left: sample frames from the used aerial datasets for Visual Place Recognition (VPR). Center: Recall@1 statistics for VPR methods without re-ranking step. Right: Recall@1 statistics for VPR methods with re-ranking step, top-100 candidates are used everywhere.

^†^†journal: Robotics and Autonomous Systems

\affiliation

[first]organization=Center for AI Technology (CAIT), Skolkovo Institute of Science and Technology (Skoltech),city=Moscow, country=The Russian Federation \affiliation[second]organization=Software Engineering Department, Saint Petersburg State University,city=Saint Petersburg, country=The Russian Federation

1 Introduction

Aerial imagery has become an indispensable tool across various sectors, including geology (53), agriculture (47), environmental monitoring (41), disaster management (30), and civil engineering (38). The effectiveness of aerial imagery, however, is contingent on accurate localization, which involves determining the geographical coordinates of the images. One method for localizing aerial imagery is visual geolocalization, which identifies the location of a captured image using only visual information.

Global Navigation Satellite Systems (GNSS), such as GPS and GLONASS, are commonly employed to localize aerial imagery. However, this method requires stable satellite signals and absence of interference, conditions that are not always achievable, especially in the presence of signal reflections from water bodies or complex terrain. Additionally, the uneven signal coverage of global satellite systems introduces further challenges, thus enhancing the relevance of visual geolocalization.

Visual geolocalization can be implemented through various methods, typically relying on a pre-built database of images with known locations (36). This approach generally involves two stages: global localization (or Visual Place Recognition, VPR) and local alignment. Global localization involves identifying the nearest frame from the database (Image Retrieval), while local alignment determines the precise position using the selected frame. VPR is often achieved using global descriptors, which are compact representations of images derived from aggregated image features. These compact representations enable rapid searching even in extensive databases (19), though they suffer from relatively low accuracy, which necessitates the use of re-ranking approaches to enhance quality (4), albeit at the cost of increased processing time.

The application of Visual Place Recognition methods to aerial imagery is complicated by several factors, such as varying weather conditions, changing seasons, and different times of day. Aerial image localization requires sufficiently large maps, imposing additional constraints on the global localization stage. Scale variations due to differences in altitude further complicate global localization. For these systems to function effectively, the database images must closely match the scale of the query images. Additionally, aerial images often feature repetitive patterns, such as urban grids and agricultural fields, making localization based on visual information challenging. These factors present significant challenges for contemporary VPR methods.

To date, reviews and comparisons of VPR systems have primarily focused on indoor and outdoor data, neglecting satellite and aerial imagery (40; 22; 59; 8). This paper aims to deliver a thorough comparative analysis of current VPR and re-ranking methodologies in the context of aerial imagery, as well as to identify the most effective map representation for this application. To achieve this, we introduce a novel methodology for constructing databases for the VPR task, which encompasses the determination of the optimal scale level and the level of overlap between adjacent map tiles. Additionally, we present an open-source benchmark that incorporates a range of popular VPR and re-ranking techniques.

The main contributions of the paper are as follows:

1.

a methodology for evaluating VPR methods in case of aerial data, including database construction part;
2.

an open-source benchmark that allows combining various VPR and re-ranking techniques;
3.

a complete evaluation of different VPR methods, providing both quality and performance comparisons.

2 Related Work

This section provides a comprehensive review of existing Visual Place Recognition systems, highlighting various re-ranking techniques designed to enhance their efficacy (4). It particularly emphasizes on the review articles pertinent to this topic. Special consideration is given to the VPR in aerial imagery, detailing commonly utilized datasets. Additionally, this section discusses the overview of evaluation technologies employed in assessing the performance of VPR methodologies.

2.1 Visual Place Recognition methods

The goal of VPR algorithms is to find the most appropriate image among a database of images and the locations that match them. This objective constitutes a subset within the broader domain of Image Retrieval (17; 55), with applications spanning search engines, visual content analysis, and robotics. The classic approach for Visual Place Recognition, as employed in early works (14; 21), involves utilizing descriptors, which are compact representations of images. They are obtained by aggregating local features of images into a vector of fixed length. Then, L2 or cosine similarity is used to search in the database of descriptors.

During the nascent phases of feature aggregation methodology, the bag-of-words (21) technique gained significant traction. Subsequent to this, the scientific community was acquainted with the VLAD (Vector of Locally Aggregated Descriptors) (3) methodology, which entails the fusion of local descriptors obtained from keypoints within an image into a vectorized form via the calculation of disparities between each descriptor and a set of predefined visual clusters. Following this, a trainable adaptation of VLAD, denoted as NetVLAD (2), was introduced to the community. NetVLAD utilizes AlexNet (31) and VGG-16 (51) as base architectures. More contemporary VPR techniques like CosPlace (6) and its viewpoint-robust variant, EigenPlaces (9), are evaluated on both VGG-16 and the newer backbone ResNet (26), with the latter demonstrating superior performance. MixVPR (1) adopts ResNet as its backbone and employs the all-MLP feature aggregation technique.

Foundation models, such as CLIP (46), mark a significant shift in AI by providing models with a profound grasp of text and images. Through pre-training on extensive text-image datasets, these models redefine the landscape of image processing AI. Notably, recent remarkable achievements in the VPR task have also been realized with foundation models, namely DINOv2 (42). The pioneering approach that utilizes DINOv2 is the method proposed by Keetha et al. and known as AnyLoc (29). They demonstrate that using foundation models without VPR-specific training is able to show SOTA quality. AnyLoc also provides a comparison of their method with competing methods on two aerial-to-aerial datasets. The idea of using foundation models has been developed in the SALAD (27) and SelaVPR (37) approaches. The first approach proposes a new aggregation technique based on the optimal transport problem. SelaVPR, on the other hand, proposes a method to realize Seamless adaptation by adding a few tunable lightweight adapters to the frozen pre-trained model.

2.2 Visual Place Recognition surveys

Over the past few years, a number of surveys exploring different approaches to Visual Place Recognition have become available to the community. The purpose of some (5; 40) is only to review different approaches to VPR or to find possible applications (22), and no data-driven comparisons are made.

Zaffar et al. (60) offer a tool to the community to compare different VPR methods and present their experimental comparison, but it is only done on indoor and outdoor datasets and does not cover the aerial case. The same can be said for the study by Pion et al. (45).

Wilson et al. (57) talk about different approaches to visual geolocalization, including cross-view geolocalization, where the query image is a ground-view image and satellite images are used in the VPR database. However, they do not focus on aerial-to-aerial localization and do not conduct an experimental study on them.

Berton et al. (8) in their paper provide a comparison of different backbones and aggregation techniques for the VPR task, but they also use only indoor and outdoor datasets for this purpose.

Also worth mentioning in this section is the paper by Barbarani et al. (4) In it, they evaluate different re-ranking methods that are actively used in VPR approaches. However, this study also focuses on indoor and outdoor datasets and does not address aerial data.

2.3 Re-ranking methods

Despite the rapid variety of techniques for local feature extraction and their further aggregation, the application of various re-ranking methods remains relevant (4). This holds particular significance in the context of aerial imagery, where there is often a high degree of visual homogeneity observed across urban, semi-urban, and especially pristine natural landscapes. In essence, re-ranking methods are also Visual Place Recognition methods with the difference that local features are used directly, without their prior aggregation, to determine the similarity measure between two images. These methods show higher quality but also require more inference time. Therefore, usually re-ranking method is fed with N candidate images obtained by classical VPR methods, from which it selects K best images.

One approach for re-ranking is to use local features such as keypoints with their corresponding descriptors. Then for re-ranking, keypoints of the query image are matched with sets of keypoints of candidate images, and the criterion is the number of matched keypoints. Both RANSAC (20) and the more advanced SuperGlue (48) or LightGlue (34) trainable methods can be used for keypoint matching. While this approach isn’t specifically tailored for the Image Retrieval task, it is used by the community.

Another approach for the re-ranking task is to use dense local features. For example, Patch-NetVLAD (25) applies VLAD (3) aggregation to extract image patch descriptors and match them using RANSAC or Rapid Scoring. Recent SOTA solutions in this area are R2Former (63) and SelaVPR (37). These methods use ViT backbones to extract dense local features both for their matching directly and for further aggregation. Thus, these solutions combine both classical Image Retrieval approach and re-ranking.

2.4 VPR in aerial images

Visual Place Recognition for aerial imagery primarily emerges in research works centered on UAV localization, with various studies exploring advanced methodologies to improve image matching capabilities between UAV-captured and satellite images.

Multiple research efforts have focused on adapting existing deep learning architectures to enhance their effectiveness in this application. For example, Shetty et al. (50) propose using the AlexNet architecture without its final classification layer, while Zhuang et al. (65) recommend modifications to the ResNet-50 architecture.

Additionally, uses of other models have been highlighted. Dai et al. (15) suggest employing a small Visual Transformer (ViT-S), pre-trained on ImageNet, as a robust backbone for their matching tasks. Similarly, Bianchi et al. (11) investigate the potential of an autoencoder architecture for improving UAV and satellite image matching.

In the case of Mantelli et al. (39), the abBRIEF descriptor is utilized to calculate the similarity score of images. In contrast, both Chen et al. (12) and Hao et al. (24) integrate the NetVLAD algorithm for global localization but employ SuperGlue for re-ranking. Moreover, Gurgu et al. (23) exclusively use SuperGlue for selecting the best matching satellite image, which imposes limitations on the number of map tiles.

Despite the existence of a variety of Visual Place Recognition solutions tailored for aerial imagery, this paper primarily concentrates on classical VPR methodologies.

2.5 Datasets

Currently, the availability of publicly accessible datasets comprising aerial imagery is relatively limited. Among these, the University-1652 (61) dataset is notable, encompassing images of 1,652 buildings distributed across 72 universities globally. Additionally, the CVUSA (58) dataset is widely utilized, containing several million photographs captured throughout the United States. Another significant dataset is SUES-200 (62). The datasets such as DenseUAV (16), ALTO (13), VPAir (49), and MARS-LVIG (32) consist of sequential images captured from aircraft and are employed in the assessment of visual geolocalization algorithms. Furthermore, notable contributions in this field include the works of Tian et al. (52) and Lin et al. (33), along with datasets VIGOR (64) and DAG (54).

Refer to caption — Figure 2: Pipeline of the proposed aerial visual geo-localization system. The offline phase involves the computation of global descriptors and local features for map tiles. During the online phase, the VPR method selects the $N$ nearest images from the database, a re-ranking method identifies the optimal frame from the candidate set, and the precise location is determined through local alignment.

3 Methodology

The suggested approach adheres to a typical framework for Visual Place Recognition, with its comprehensive workflow depicted in Figure 2. The process is divided into two main segments: offline and online. The offline segment involves building a database for VPR techniques. The online segment includes the Image Retrieval component as well as Local Alignment, which determines the precise position of the system using the chosen frame. Although our approach shares similarities with other VPR benchmarks, it incorporates several unique innovations. Although the challenge of determining appropriate zoom and overlap levels between frames in database construction from aerial imagery has been previously highlighted in the academic community (16; 24), we introduce a generalized parameterization method to address this problem. Additionally, we propose two new metrics that are better suited for aerial imagery compared to the widely used Recall@k metric.

In this section, we elaborate on the methodologies employed in the construction of the database (Section A), as well as the procedures involved in Image Retrieval (Section B) and Local Alignment (Section C) within the proposed pipeline. Additionally, the criteria for the selection of test datasets are detailed (Section D). This section also introduces two novel metrics developed for the assessment of VPR methods in the context of aerial imagery (Section E).

3.1 Database construction

This paper focuses on the use of satellite imagery as a reference, however, the proposed approach for database construction can be used with any data that has continuous coverage of the area over which visual geolocalization is planned.

The main idea of the proposed methodology for database construction is to partition the whole space into rectangular tiles of equal size. Such a set of tiles can already be used as a database for VPR methods (23). Nevertheless, this approach has some disadvantages, so we propose some improvements for it.

The first problem is that typically open data sources, such as Google Maps or Sentinel, offer a choice of only a few levels of map zoom level. Thus, even between neighboring levels, the difference can be significant to affect the performance quality of VPR methods. Therefore, we propose to not only choose the most appropriate zoom level when downloading images, but also to construct images of different scales locally using raw images. In this case, we consider that the zoom level of the downloaded raw images is equal to 100%. An example of constructing map tiles of different zoom levels can be seen in Fig. 3.

The second problem with the naive approach is the presence of stitches between tiles in the map. Thus, if a query image is captured somewhere between two adjacent tiles of the map, the probability of its successful localization is significantly reduced. To offset this problem, we propose to generate overlap** tiles. An example of database construction from images with different overlap levels between frames can be seen in Fig. 4.

3.2 Image retrieval

Following the standard scheme for the Visual Place Recognition task, we use pre-calculated descriptors of map tiles, as well as local features that are needed for the re-ranking step. The process of query frame localization is as follows. For a frame, its descriptor is computed using VPR method. Then the database is searched from the descriptors and $N$ nearest frames are selected. Using $N$ candidate frames, $K$ best matches are selected using re-ranking algorithms.

3.3 Local alignment

The task of Local Alignment is to calculate the geo-coordinates of the query image using the previously found corresponding satellite map tile. To do this, we use the keypoints of the query and satellite images and match them. Then we compute the perspective transformation, which allows us to map the query image to the satellite map tile. Since the satellite image itself is a rectangle with previously known geo-coordinates of its corners, it is possible to convert pixel coordinates to geographic coordinates. For this purpose, our approach provides Web Mercator model, which is used in Google Maps.

3.4 Datasets

Since one of the novelties of our approach is the database construction algorithm, our primary criterion for test datasets is that the images originate from a delimited geographical area, enabling the acquisition of corresponding satellite maps. Additionally, we require precise geocoordinates for these test images. Since our research concentrates on aerial-to-satellite localization, datasets generated from satellite imagery were not considered. Consequently, the VPAir (49), ALTO (13), MARS-LVIG (32) and DenseUAV (16) datasets align with our requirements.

The VPAir (49) dataset consists of two sequences: plane and ortho. The plane sequence is captured from an aircraft, while the ortho sequence comprises orthophotos derived from aerial data. Both sequences are utilized as test sequences in our experiments. Although the ALTO (13) dataset offers two sequences (Round1 and Round2), the former is a subset of the latter, prompting our utilization of Round2 exclusively. As for the MARS-LVIG (32) dataset, it encompasses various sequences; we opted for the AMtown01 sequence, one of the longest, captured over rural towns. DenseUAV (16) is shot in similar conditions to AMtown01 in terms of altitude (80 — 100 meters) and landscape, so we did not use it, favoring AMtown01. The characteristics of the test sequences used are given in Tab. 1.

Dataset	Number of	Trajectory	Resolution of
Dataset	frames	length	frames
VPAir (plane)	2706	100 km	800x600
VPAir (ortho)	2706	100 km	800x600
ALTO	460	37 km	500x500
AMtown01	2620	4.8 km	2448x2048

Table 1: Characteristics of the test datasets used

3.5 Metrics

In the visual place recognition task, the standard metric is Recall@k (8; 2; 7; 10; 25; 28; 35; 43; 44; 56), which measures the percentage of query images for which at least one of the $K$ best images selected by the method is no further away than a given threshold. This metric assumes that the geo-coordinates of the images retrieved from the database are utilized as the resultant geo-coordinates. However, within the realm of aerial data, this approach isn’t appropriate due to the substantial size of map tiles, raising ambiguity regarding the association of specific geo-coordinates with images from the database. To address this challenge, two adaptations of the original Recall@k metric have been suggested. One assesses the complete visual geolocalization process, encompassing the local alignment stage, while the other focuses solely on the global localization phase. It’s worth noting that for both suggested metrics, we assume that the query images are nadir images, and we consider their 2D projection. Consequently, we presume that the coordinates of the query image represent the latitude and longitude of its center.

Before presenting the metrics, we introduce the following general notations:

1.

$DB$ — the set of map tiles in the database;
2.

$Q$ and $q$ — a sequence of query images and a single query image, respectively;
3.

$q_{loc_{gt}}$ — the ground-truth coordinates of the query image;
4.

$VPR(q,DB,N)$ — a function representing a VPR pipeline, which may include a re-ranking step, that finds the N closest images in the database for a given query image;
5.

$DB_{k}=VPR(q,DB,k)$ .

The first proposed metric is Georeference Recall. It calculates the locations of the query images and calculates the distances to the ground-truth coordinates. If the distance is less than the specified threshold value, this image is considered to be correctly localized. The result of the metric is the ratio of correctly localized pairs to their total number. Thus, the metric allows to assess the accuracy of georeferencing, that is, the entire visual geolocalization pipeline. In essence, this metric is similar to Recall@1, with the only difference that the resulting location is not the geo-coordinates of the selected map tile, but the results of local alignment relative to it. Its formal definition is as follows:

\displaystyle R_{georef.}=\frac{\sum_{q\in Q}\begin{cases}1&\text{if $dist(q_{% loc_{gt}},LA(q,DB_{1}))<\mu$}\\ 0&\text{else}\end{cases}}{|Q|},

(1)

where $LA$ is a local alignment function that finds the exact location of the system, $dist$ is the geographic distance function and $\mu$ is a threshold value. A schematic representation of the metric is available in Fig. 5.

Another suggested metric is VPR Recall, which enables the evaluation of global localization systems. Due to the database images being flat rectangular map tiles, only their corner coordinates are known by default. Therefore, this metric computes $N$ best predictions from the image database using a global localization system for each query image, and then if the center of the query image hits at least one of the selected images from the database, this result is considered as correct. The result of the metric is the ratio of correct results to the total number of test images. Thus, the metric allows to evaluate the quality of global localization systems, does not require specifying a specific threshold value and can work with different zoom levels of satellite maps. Its formal definition is as follows:

\displaystyle R_{VPR}=\frac{\sum_{q\in Q}\begin{cases}1&\text{if $\exists d\in DB% _{N}$ s.t. $q_{loc_{gt}}\in d_{rect}$}\\ 0&\text{else}\end{cases}}{|Q|},

(2)

where $d_{rect}=(d_{NW},d_{SW})\times(d_{NW},d_{NE})$ and $d_{NW}$ , $d_{SW}$ , $d_{NE}$ represent the corresponding corner coordinates of the map tile. A visual depiction of the metric is presented in Fig. 6.

4 Experiments

In this section, we perform a qualitative comparative analysis of different VPR methods, examining their performance under different zoom levels and different levels of overlap between neighboring tiles. Furthermore, we present the results of our proposed local alignment method. Additionally, this section provides a summary of the comparative analysis of the different VPR methods in terms of both time efficiency and memory usage.

4.1 Test bench

During our experiments, we employed four test sequences: ALTO (13), AMtown01 (32), as well as two sequences from the VPAir (49) dataset (plane and ortho). Three maps were downloaded from Google Maps to encompass the entirety of the test trajectories. From the variety of zoom levels offered by the Google Maps API, we opted for zoom level 17 for ALTO and VPAir, and zoom level 19 for AMtown01, chosen to correspond closely with the scale of the test data. In terms of our methodology, these maps were considered as maps with a zoom level equal to 100%. Detailed specifications of the maps are provided in Table 2.

In addressing the VPR task, we employed several prominent methods that have emerged in recent years: CosPlace (6), EigenPlaces (9), MixVPR (1), and SALAD (27), alongside the established NetVLAD (2) approach. The original weights provided by the respective authors were utilized without any fine-tuning. For the re-ranking phase, two methods utilizing keypoints were used: the conventional SuperGlue (48) approach and its recent variant, LightGlue (34). Additionally, we incorporated the novel SelaVPR (37) method, which has demonstrated superior performance on several popular VPR datasets. SelaVPR integrates both candidate selection for the re-ranking phase and the re-ranking process itself.

Dataset	Number of	Area	Resolution of
Dataset	raw tiles	Area	tiles
VPAir	4788	1079 km²	1280x1230
ALTO	2706	718 km²	1280x1230
AMtown01	3190	66 km²	1280x1230

Table 2: Characteristics of the maps used

4.2 Quality evaluation

4.2.1 Zoom level

To compare VPR methods across various zoom levels of the map, multiple databases were constructed with an overlap level of 25%. All considered VPR methods were utilized. The zoom levels employed were 50%, 100%, 150%, and 200%. Additionally, for the ALTO (13) dataset, a zoom level of 250% was included, as multiple methods demonstrated the best performance at the 200% zoom level. The VPR Recall metric was used. The results of the comparison are summarized in Table 3.

Dataset	Localization	Zoom 50%			Zoom 100%			Zoom 150%			Zoom 200%			Zoom 250%			Best
Dataset	method	R@1	R@10	R@50	R@1	R@10	R@50	R@1	R@10	R@50	R@1	R@10	R@50	R@1	R@10	R@50	zoom
VPAir (plane)	AnyLoc	5.1	22.8	54.4	14.2	36.0	57.4	21.7	43.6	61.2	18.9	39.4	57.2	—	—	—	150%
	CosPlace	2.5	10.5	25.0	4.7	11.2	22.1	4.3	9.3	16.1	2.0	6.0	11.0	—	—	—	100%
	EigenPlaces	2.3	11.0	25.2	6.7	15.4	27.1	7.1	15.2	22.4	3.5	9.9	17.8	—	—	—	150%
	MixVPR	1.1	8.0	26.1	4.1	12.5	25.2	7.2	14.5	24.2	5.5	10.6	17.4	—	—	—	150%
	NetVLAD	1.0	7.3	21.9	1.4	8.2	18.7	0.9	4.0	9.5	0.6	2.3	5.2	—	—	—	100%
	SALAD	2.1	14.0	36.4	4.8	16.0	33.8	6.9	17.3	30.0	5.8	14.1	24.1	—	—	—	150%
	SelaVPR	2.3	16.4	43.5	9.1	30.6	55.7	11.9	35.0	53.6	8.7	26.3	44.5	—	—	—	150%
VPAir (ortho)	AnyLoc	13.3	42.8	72.0	42.9	69.3	83.2	59.5	76.3	86.3	59.7	76.9	84.5	—	—	—	200%
	CosPlace	8.8	29.1	50.8	24.8	49.4	66.7	28.8	54.7	70.0	24.3	48.2	63.9	—	—	—	150%
	EigenPlaces	7.5	28.5	56.5	34.0	60.4	75.5	48.4	72.5	82.0	46.9	69.1	79.7	—	—	—	150%
	MixVPR	5.7	22.8	50.8	43.5	69.2	83.1	67.2	85.1	91.9	66.0	84.1	90.5	—	—	—	150%
	NetVLAD	2.1	13.1	39.1	9.4	26.3	47.1	9.2	22.6	40.7	6.8	17.1	30.0	—	—	—	100%
	SALAD	10.4	34.2	61.6	32.7	56.3	76.0	47.2	70.0	82.1	47.0	67.4	78.9	—	—	—	150%
	SelaVPR	5.1	21.4	45.6	18.5	43.9	66.5	29.5	53.3	69.0	23.9	46.6	63.2	—	—	—	150%
ALTO	AnyLoc	3.7	19.1	40.0	10.7	30.0	44.3	19.3	34.1	48.5	20.0	35.4	46.5	18.9	32.6	43.7	200%
	CosPlace	5.0	12.6	28.7	6.1	15.4	26.5	5.0	13.3	26.1	6.1	12.8	21.7	4.1	11.5	20.2	100%
	EigenPlaces	4.6	12.0	31.1	7.6	21.3	37.6	7.4	18.9	35.0	6.5	16.1	27.6	4.8	15.7	27.0	100%
	MixVPR	2.6	14.6	27.4	8.0	22.8	38.7	17.4	35.4	50.2	20.7	38.9	53.5	17.0	34.6	50.4	200%
	NetVLAD	2.6	7.2	14.3	4.8	8.5	17.0	0.9	5.9	16.1	0.9	5.2	10.4	0.9	3.3	6.7	100%
	SALAD	6.1	18.7	37.8	12.2	24.1	40.7	16.7	29.8	45.0	18.3	31.7	46.5	15.9	28.9	42.8	200%
	SelaVPR	4.1	17.2	32.6	9.6	23.5	34.6	10.2	27.0	37.8	15.0	28.3	42.8	9.1	22.0	33.9	200%
AMtown01	AnyLoc	15.7	52.7	83.2	25.3	63.7	89.3	18.3	52.7	77.4	10.7	37.4	65.6	—	—	—	100%
	CosPlace	7.2	20.3	46.9	2.1	12.4	23.9	1.3	9.7	25.7	0.6	5.1	13.8	—	—	—	50%
	EigenPlaces	8.4	22.4	62.4	4.7	15.2	37.7	2.9	15.9	34.5	6.3	13.0	18.5	—	—	—	50%
	MixVPR	1.1	9.7	38.3	8.4	18.4	34.6	6.8	15.2	29.0	2.9	11.1	22.3	—	—	—	100%
	NetVLAD	5.6	31.1	64.9	4.5	18.5	39.0	0.8	6.8	23.5	0.1	2.7	13.8	—	—	—	50%
	SALAD	5.9	42.6	88.2	17.6	45.0	76.6	18.4	44.5	67.9	11.6	26.0	46.4	—	—	—	100%
	SelaVPR	4.7	40.6	91.8	26.7	62.9	87.3	16.9	56.8	81.0	10.9	32.0	56.5	—	—	—	100%

Table 3: Comparison of different zoom levels, VPR Recall value in %. The most favorable results in each row are emphasized in bold. The superior method for each dataset is identified in green, followed by the second-best method in blue, and the third-best method in red.

The findings suggest that the optimal zoom level is contingent not only on the characteristics of the test dataset, but also on the specific localization method employed; so this is a hyperparameter that needs to be tuned in each scenario. Nevertheless, among the top-performing methods (AnyLoc (29), MixVPR (1), SALAD (27) and SelaVPR (37)), there is typically a convergence on the optimal zoom level within the same dataset. Subsequent experiments were conducted exclusively at the optimal zoom level identified for each algorithm.

4.2.2 Overlap level

All the VPR methods under consideration were utilized to investigate the impact of the overlap level between neighboring frames. Coverage levels of 0%, 25%, and 50% were examined. The VPR Recall metric was used. The results of comparison can be seen in Table 4.

Dataset	Localization	Overlap 0%			Overlap 25%			Overlap 50%
Dataset	method	R@1	R@10	R@50	R@1	R@10	R@50	R@1	R@10	R@50
VPAir (plane)	AnyLoc	12.9	30.3	48.9	21.7	43.6	61.2	26.3	49.2	66.0
	CosPlace	2.6	9.3	19.3	4.7	11.2	22.1	4.9	11.6	21.8
	EigenPlaces	2.6	8.6	15.9	7.1	15.2	22.4	7.2	15.5	23.8
	MixVPR	3.6	9.6	18.9	7.2	14.5	24.2	6.9	14.8	24.2
	NetVLAD	1.4	6.9	17.3	1.4	8.2	18.7	1.7	6.6	15.9
	SALAD	4.7	12.8	24.9	6.9	17.3	30.0	9.3	21.7	33.3
	SelaVPR	7.8	24.7	44.0	11.9	35.0	53.6	13.6	37.3	57.8
VPAir (ortho)	AnyLoc	44.1	64.6	76.0	59.7	76.9	84.5	69.1	81.1	86.8
	CosPlace	21.8	41.8	57.0	28.8	54.7	70.0	39.0	60.3	73.4
	EigenPlaces	34.3	53.8	66.7	48.4	72.5	82.0	60.5	77.5	84.8
	MixVPR	45.5	65.7	77.8	67.2	85.1	91.9	78.9	90.7	94.5
	NetVLAD	7.9	22.1	40.0	9.4	26.3	47.1	11.6	28.9	45.8
	SALAD	31.1	50.8	64.9	47.2	70.0	82.1	65.7	80.7	88.5
	SelaVPR	20.4	40.2	56.9	29.5	53.3	69.0	37.8	60.3	74.4
ALTO	AnyLoc	14.6	28.7	39.6	20.0	35.4	46.5	25.7	39.1	48.0
	CosPlace	4.1	10.9	22.0	6.1	15.4	26.5	7.2	16.3	28.0
	EigenPlaces	7.0	17.4	32.2	7.6	21.3	37.6	11.1	23.7	36.1
	MixVPR	13.5	24.8	40.7	20.7	38.9	53.5	32.2	49.3	61.1
	NetVLAD	2.6	9.6	17.0	4.8	8.5	17.0	4.6	11.7	20.2
	SALAD	9.8	20.9	35.4	18.3	31.7	46.5	23.9	40.9	56.7
	SelaVPR	8.7	22.4	33.0	15.0	28.3	42.8	16.3	31.5	42.8
AMtown01	AnyLoc	18.3	55.8	82.6	25.3	63.7	89.3	31.6	67.6	87.1
	CosPlace	2.5	15.3	44.8	7.2	20.3	46.9	8.7	19.0	28.5
	EigenPlaces	2.7	18.3	64.3	8.4	22.4	62.4	6.8	20.3	40.6
	MixVPR	6.3	12.6	28.4	8.4	18.4	34.6	8.4	17.1	29.6
	NetVLAD	4.0	21.8	60.5	5.6	31.1	64.9	6.5	32.5	60.4
	SALAD	11.6	34.0	61.8	17.6	45.0	76.6	19.2	45.0	72.1
	SelaVPR	17.2	54.7	85.5	26.7	62.9	87.3	25.6	69.4	91.0

Table 4: Comparison of different overlap levels, VPR Recall value in %. The most favorable results in each row are emphasized in bold. The superior method for each dataset is identified in green, followed by the second-best method in blue, and the third-best method in red.

The findings indicate that increasing the level of overlap between frames enhances the efficacy of global localization systems, with the principal constraint in such scenarios revolving around the temporal and memory resources demanded for database construction. Upon comparative analysis, AnyLoc (29), MixVPR (1), SALAD (27) and SelaVPR (37) methods perform the best, albeit subject to variability contingent upon dataset. In further experiments, we used an overlap level equal to 50%.

4.2.3 Re-ranking

We conducted a comparative analysis of re-ranking techniques using the top-100 predictions produced by diverse global localization methods. For global localization, we employed AnyLoc (29), MixVPR (1), SALAD (27) and SelaVPR (37), which demonstrated superior performance compared to other methods. For re-ranking, we used all the methods considered: SuperGlue (48), LightGlue (34) and SelaVPR (37). Notably, SelaVPR, which integrates both global localization and re-ranking steps, was exclusively evaluated in conjunction with itself. The VPR Recall metric was used. The outcomes of this comparison are detailed in Table 5.

Dataset	Localization	W/o re-ranking			SuperGlue			LightGlue			SelaVPR
Dataset	method	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
VPAir (plane)	AnyLoc	26.3	41.9	49.2	23.7	32.6	37.7	17.1	23.5	28.4	—	—	—
	MixVPR	6.9	11.4	14.8	10.8	14.7	17.0	8.8	11.8	14.2	—	—	—
	SALAD	9.3	17.3	21.7	17.9	22.2	24.6	13.5	16.5	19.3	—	—	—
	SelaVPR	13.6	29.7	37.3	—	—	—	—	—	—	26.9	42.8	49.2
VPAir (ortho)	AnyLoc	69.1	78.2	81.1	72.7	80.9	82.8	71.6	79.7	81.3	—	—	—
	MixVPR	78.9	87.8	90.7	78.5	84.0	86.0	79.9	85.4	88.3	—	—	—
	SALAD	65.7	77.4	80.7	75.1	80.5	82.0	75.5	79.5	81.9	—	—	—
	SelaVPR	37.8	54.5	60.3	—	—	—	—	—	—	68.3	74.1	76.0
ALTO	AnyLoc	25.7	34.3	39.1	35.2	40.4	41.7	35.7	39.8	42.6	—	—	—
	MixVPR	32.2	42.6	49.3	38.9	44.6	47.8	41.7	47.2	50.2	—	—	—
	SALAD	23.9	35.9	40.9	41.1	46.1	48.3	40.4	45.2	48.0	—	—	—
	SelaVPR	16.3	26.5	31.5	—	—	—	—	—	—	33.3	42.6	44.8
AMtown01	AnyLoc	31.6	56.5	67.6	38.4	43.7	49.7	36.8	40.2	43.9	—	—	—
	MixVPR	8.4	13.4	17.1	23.9	25.4	26.8	23.7	25.3	26.6	—	—	—
	SALAD	19.2	35.8	45.0	37.1	41.1	45.0	36.6	39.0	41.5	—	—	—
	SelaVPR	25.6	57.8	69.4	—	—	—	—	—	—	45.5	69.6	78.7

Table 5: Comparison of different re-ranking methods, VPR Recall value in %. Top-100 candidates are used everywhere. The best configurations for R@1, R@5 and R@10 are highlighted in green.

The results underscore the variability in the optimal configuration of VPR, encompassing global localization and re-ranking, across distinct datasets. Notably, for the VPAir plane (49) and AMtown01 (32) sequences, the combined SelaVPR (37) method demonstrates superior performance, contrasting sharply with its relatively poorer performance on the other two sequences. Conversely, for the VPAir ortho (49) sequence, the MixVPR (1) without re-ranking step yields the most favorable outcomes, while on the ALTO (13) dataset this method shows itself better in conjunction with LightGlue (34).

4.2.4 Local alignment

We employed the AnyLoc (29), MixVPR (1), SALAD (27) and SelaVPR (37) methods, along with all re-ranking methods, to estimate the entire visual geolocalization pipeline, including the local alignment step. SelaVPR was only used in conjunction with itself. The Georeference Recall metric was utilized, with thresholds set at 10, 50, and 100 meters. The outcomes of this evaluation are summarized in Table 6.

Dataset	Localization	SuperGlue			LightGlue			SelaVPR
Dataset	method	10m	50m	100m	10m	50m	100m	10m	50m	100m
VPAir (plane)	AnyLoc	4.6	15.4	18.1	3.7	11.6	13.9	—	—	—
	MixVPR	2.6	7.4	8.5	2.4	6.5	7.6	—	—	—
	SALAD	4.0	12.9	14.7	3.2	10.3	11.3	—	—	—
	SelaVPR	—	—	—	—	—	—	1.1	9.2	15.7
VPAir (ortho)	AnyLoc	52.1	71.8	74.7	49.9	68.7	72.2	—	—	—
	MixVPR	58.3	74.8	77.2	57.4	73.9	76.8	—	—	—
	SALAD	56.9	71.8	74.0	55.4	70.8	73.4	—	—	—
	SelaVPR	—	—	—	—	—	—	16.0	57.5	65.0
ALTO	AnyLoc	5.2	28.7	33.0	3.9	28.0	31.3	—	—	—
	MixVPR	5.9	33.3	38.0	5.0	34.8	38.5	—	—	—
	SALAD	6.5	34.1	37.6	4.1	33.0	36.1	—	—	—
	SelaVPR	—	—	—	—	—	—	1.7	20.9	29.6
AMtown01	AnyLoc	5.3	37.7	38.8	6.0	36.5	37.1	—	—	—
	MixVPR	4.8	25.2	25.6	5.3	23.5	24.0	—	—	—
	SALAD	5.8	36.9	37.8	6.1	36.3	36.6	—	—	—
	SelaVPR	—	—	—	—	—	—	7.4	35.4	46.0

Table 6: Comparison of VPR methods with local alignment step, Georeference Recall value in %. The best configurations for each threshold are highlighted in green.

The results indicate that on the VPAir plane (49) dataset, the AnyLoc (29) method combined with SuperGlue (48) performs optimally. Similarly, on the VPAir ortho (49) dataset, the MixVPR (1) method, also combined with SuperGlue, yields the best performance. On the ALTO (13) sequence, the highest performance varies with the threshold value, with SALAD (27) and MixVPR demonstrating superior results when paired with SuperGlue and LightGlue, respectively. For the AMtown01 dataset, the combined SelaVPR (37) approach predominantly achieves the best results.

4.3 Performance evaluation

4.3.1 Time measurements

For the time measurements, the same test bench utilized for the Local alignment step estimation was employed. All measurements were conducted on the following machine configuration: AMD Ryzen Threadripper 3970X, NVIDIA GeForce RTX 3090, and 128GB RAM. Time measurements were taken for each stage of the visual geolocalization process, specifically: global descriptor extraction, database search, local feature extraction, re-ranking, and local alignment. Since both LightGlue (34) and SuperGlue (48) use SuperPoint (18) features, the results for these methods for the local feature extraction step are combined. The results can be seen in Table 7.

Dataset	Localization method	Descriptor calculation	Database search	Local features		Re-ranking			Local
				calculation		Re-ranking			alignment
				SP	Sela	SG	LG	Sela	SG	LG	Sela
VPAir (plane)	AnyLoc	0.63	1.51	0.04	0.04	22.86	1.04	—	0.07	0.11	—
	MixVPR	0.08	0.12			23.06	1.03	—	0.09	0.12	—
	SALAD	0.09	0.26			23.44	1.08	—	0.07	0.09	—
	SelaVPR	0.04	0.03			—	—	0.08	—	—	0.07
VPAir (ortho)	AnyLoc	0.63	2.68	0.04	0.04	23.64	1.08	—	0.05	0.07	—
	MixVPR	0.08	0.12			23.11	1.08	—	0.04	0.07	—
	SALAD	0.09	0.26			23.57	1.14	—	0.04	0.05	—
	SelaVPR	0.04	0.03			—	—	0.08	—	—	0.06
ALTO	AnyLoc	0.37	1.32	0.04	0.04	24.08	1.16	—	0.06	0.07	—
	MixVPR	0.08	0.11			17.28	1.18	—	0.07	0.10	—
	SALAD	0.05	0.21			17.87	1.05	—	0.08	0.10	—
	SelaVPR	0.04	0.03			—	—	0.09	—	—	0.09
AMtown01	AnyLoc	0.84	0.44	0.14	0.17	18.59	1.07	—	0.13	0.14	—
	MixVPR	0.20	0.04			15.67	1.05	—	0.15	0.15	—
	SALAD	0.24	0.07			18.29	1.06	—	0.13	0.14	—
	SelaVPR	0.17	0.01			—	—	0.09	—	—	0.02

Table 7: Performance of VPR methods, time in seconds. The best result for each step is highlighted in green.

The comparison findings indicate that computing descriptors and conducting a database search are relatively swift across all methods, except for AnyLoc (29), where these tasks significantly prolong. This discrepancy arises from its utilization of a large ViT-G DINOv2 (42) model and the fact that AnyLoc generates descriptors of high dimensionality (49152). Local features computation and alignment demonstrate rapidity across all methods. Notably, there’s a discernible contrast in the runtime of re-ranking methods. While SelaVPR (37) completes in less than 0.1 seconds, LightGlue (34) demands around 1 second. SuperGlue (48), on the other hand, operates at an even slower pace, necessitating 15 to 25 seconds, contingent upon the dataset.

4.3.2 Memory consumption

To compare the memory consumption of different methods, the corresponding descriptors for AnyLoc (29), MixVPR (1), SALAD (27) and SelaVPR (37) methods and local features for SuperGlue (48), LightGlue (34) and SelaVPR were saved to a hard drive. The optimal zoom level was employed for each method, with an overlap level set to 50%. Consequently, for the VPAir (49) dataset, statistics are provided at a zoom level of 200% exclusively for the AnyLoc method. The ensuing outcomes are accessible for scrutiny in Table 8.

Dataset		VPAir		ALTO	AMtown01
Zoom level		150%	200%	200%	100%
Descriptors size	AnyLoc	7.9 Gb	13.9 Gb	6.4 Gb	2.3 Gb
	MixVPR	0.7 Gb	—	0.5 Gb	0.2 Gb
	SALAD	1.4 Gb	—	1.1 Gb	0.4 Gb
	SelaVPR	0.2 Gb	—	0.1 Gb	¡ 0.1 Gb
Local features size	SP	30.7 Gb	36.0 Gb	16.2 Gb	7.8 Gb
Local features size	SelaVPR	75.7 Gb	—	61.4 Gb	22.1 Gb

Table 8: Memory consumption of VPR methods. The best results for each dataset are highlighted in green.

It’s important to highlight that AnyLoc’s (29) descriptors require significantly more memory compared to descriptors in other methods due to their high dimensionality. Similarly, SelaVPR’s (37) local features demand substantially more memory than the SuperPoint (18) local features utilized by LightGlue (34) and SuperGlue (48). However, these features also consume a considerable amount of memory, particularly noticeable with large map sizes, presenting potential challenges for devices with limited memory capacity.

4.4 Discussion

The culmination of the experimental investigation yields several significant conclusions. Firstly, the selection of an appropriate zoom level stands as a crucial factor in augmenting the efficacy of any VPR system. When aiming to localize images captured at varying altitudes, it is advisable to utilize multiple databases constructed at diverse zoom levels. Moreover, augmenting the overlap level predominantly enhances the quality of VPR methods.

Secondly, upon scrutinizing various VPR techniques alongside diverse re-ranking strategies, discerning an optimal configuration proves arduous. The efficacy of different methods appears contingent upon the specific characteristics of the test data. This trend persists when evaluating VPR system quality in conjunction with the Local alignment step, where the ranking of methods hinges on the dataset under examination.

In terms of temporal and memory metrics, a clearer pattern emerges. SelaVPR (37) emerges as the swiftest method across most stages, although certain other methodologies demonstrate comparable speeds at select stages. Additionally, SelaVPR demands the least memory allocation for storing global descriptors among all methods. However, SelaVPR necessitates a higher memory allocation for storing local features compared to SuperPoint (18), the feature descriptor employed by SuperGlue (48) and LightGlue (34).

5 Conclusion

This paper presents a comprehensive evaluation of different VPR and re-ranking methodologies tailored to aerial imagery. By introducing a novel approach for database construction, we provide a methodology for evaluating VPR methods in the context of aerial data. Furthermore, our open-source benchmark facilitates the combination and comparison of various VPR and re-ranking techniques. This paper serves as a starting point for researchers and practitioners exploring the application of aerial VPR algorithms, emphasizing the significance of hyperparameter optimization (specifically zoom and overlap) in the generation of map tiles. Additionally, it engages in discourse concerning various facets of state-of-the-art localization algorithms within the aerial domain. We have made all contributions, including code, public.

References

Ali-Bey et al. (2023) Ali-Bey, A., Chaib-Draa, B., Giguere, P., 2023. Mixvpr: Feature mixing for visual place recognition, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2998–3007.
Arandjelovic et al. (2016) Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J., 2016. Netvlad: Cnn architecture for weakly supervised place recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5297–5307.
Arandjelovic and Zisserman (2013) Arandjelovic, R., Zisserman, A., 2013. All about vlad, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1578–1585.
Barbarani et al. (2023) Barbarani, G., Mostafa, M., Bayramov, H., Trivigno, G., Berton, G., Masone, C., Caputo, B., 2023. Are local features all you need for cross-domain visual place recognition?, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6154–6164.
Barros et al. (2021) Barros, T., Pereira, R., Garrote, L., Premebida, C., Nunes, U.J., 2021. Place recognition survey: An update on deep learning approaches. arXiv preprint arXiv:2106.10458 .
Berton et al. (2022a) Berton, G., Masone, C., Caputo, B., 2022a. Rethinking visual geo-localization for large-scale applications, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4878–4888.
Berton et al. (2021a) Berton, G., Masone, C., Paolicelli, V., Caputo, B., 2021a. Viewpoint invariant dense matching for visual geolocalization, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12169–12178.
Berton et al. (2022b) Berton, G., Mereu, R., Trivigno, G., Masone, C., Csurka, G., Sattler, T., Caputo, B., 2022b. Deep visual geo-localization benchmark, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5396–5407.
Berton et al. (2023) Berton, G., Trivigno, G., Caputo, B., Masone, C., 2023. Eigenplaces: Training viewpoint robust models for visual place recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11080–11090.
Berton et al. (2021b) Berton, G.M., Paolicelli, V., Masone, C., Caputo, B., 2021b. Adaptive-attentive geolocalization from few queries: A hybrid approach, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2918–2927.
Bianchi and Barfoot (2021) Bianchi, M., Barfoot, T.D., 2021. Uav localization using autoencoded satellite images. IEEE Robotics and Automation Letters 6, 1761–1768.
Chen et al. (2021) Chen, S., Wu, X., Mueller, M.W., Sreenath, K., 2021. Real-time geo-localization using satellite imagery and topography for unmanned aerial vehicles, in: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE. pp. 2275–2281.
Cisneros et al. (2022) Cisneros, I., Yin, P., Zhang, J., Choset, H., Scherer, S., 2022. Alto: A large-scale dataset for uav visual place recognition and localization. arXiv preprint arXiv:2207.12317 .
Cummins and Newman (2008) Cummins, M., Newman, P., 2008. Fab-map: Probabilistic localization and map** in the space of appearance. The International journal of robotics research 27, 647–665.
Dai et al. (2021) Dai, M., Hu, J., Zhuang, J., Zheng, E., 2021. A transformer-based feature segmentation and region alignment method for uav-view geo-localization. IEEE Transactions on Circuits and Systems for Video Technology 32, 4376–4389.
Dai et al. (2023) Dai, M., Zheng, E., Feng, Z., Qi, L., Zhuang, J., Yang, W., 2023. Vision-based uav self-positioning in low-altitude urban environments. IEEE Transactions on Image Processing .
Datta et al. (2008) Datta, R., Joshi, D., Li, J., Wang, J.Z., 2008. Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys (Csur) 40, 1–60.
DeTone et al. (2018) DeTone, D., Malisiewicz, T., Rabinovich, A., 2018. Superpoint: Self-supervised interest point detection and description, in: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 224–236.
Douze et al. (2024) Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.E., Lomeli, M., Hosseini, L., Jégou, H., 2024. The faiss library. arXiv preprint arXiv:2401.08281 .
Fischler and Bolles (1981) Fischler, M.A., Bolles, R.C., 1981. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24, 381–395.
Gálvez-López and Tardos (2012) Gálvez-López, D., Tardos, J.D., 2012. Bags of binary words for fast place recognition in image sequences. IEEE Transactions on robotics 28, 1188–1197.
Garg et al. (2021) Garg, S., Fischer, T., Milford, M., 2021. Where is your place, visual place recognition?, in: IJCAI, pp. 4416–4425.
Gurgu et al. (2022) Gurgu, M.M., Queralta, J.P., Westerlund, T., 2022. Vision-based gnss-free localization for uavs in the wild, in: 2022 7th International Conference on Mechanical Engineering and Robotics Research (ICMERR), IEEE. pp. 7–12.
Hao et al. (2023) Hao, Y., He, M., Liu, Y., Liu, J., Meng, Z., 2023. Range–visual–inertial odometry with coarse-to-fine image registration fusion for uav localization. Drones 7, 540.
Hausler et al. (2021) Hausler, S., Garg, S., Xu, M., Milford, M., Fischer, T., 2021. Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14141–14152.
He et al. (2016) He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778.
Izquierdo and Civera (2023) Izquierdo, S., Civera, J., 2023. Optimal transport aggregation for visual place recognition. arXiv preprint arXiv:2311.15937 .
** Kim et al. (2017) ** Kim, H., Dunn, E., Frahm, J.M., 2017. Learned contextual feature reweighting for image geo-localization, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2136–2145.
Keetha et al. (2023) Keetha, N., Mishra, A., Karhade, J., Jatavallabhula, K.M., Scherer, S., Krishna, M., Garg, S., 2023. Anyloc: Towards universal visual place recognition. IEEE Robotics and Automation Letters .
Khan et al. (2022) Khan, A., Gupta, S., Gupta, S.K., 2022. Emerging uav technology for disaster detection, mitigation, response, and preparedness. Journal of Field Robotics 39, 905–955.
Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25.
Li et al. (2024) Li, H., Zou, Y., Chen, N., Lin, J., Liu, X., Xu, W., Zheng, C., Li, R., He, D., Kong, F., et al., 2024. Mars-lvig dataset: A multi-sensor aerial robots slam dataset for lidar-visual-inertial-gnss fusion. The International Journal of Robotics Research , 02783649241227968.
Lin et al. (2015) Lin, T.Y., Cui, Y., Belongie, S., Hays, J., 2015. Learning deep representations for ground-to-aerial geolocalization, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5007–5015.
Lindenberger et al. (2023) Lindenberger, P., Sarlin, P.E., Pollefeys, M., 2023. Lightglue: Local feature matching at light speed, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 17627–17638.
Liu et al. (2019) Liu, L., Li, H., Dai, Y., 2019. Stochastic attraction-repulsion embedding for large scale image localization, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2570–2579.
Lowry et al. (2015) Lowry, S., Sünderhauf, N., Newman, P., Leonard, J.J., Cox, D., Corke, P., Milford, M.J., 2015. Visual place recognition: A survey. ieee transactions on robotics 32, 1–19.
Lu et al. (2024) Lu, F., Zhang, L., Lan, X., Dong, S., Wang, Y., Yuan, C., 2024. Towards seamless adaptation of pre-trained models for visual place recognition. arXiv preprint arXiv:2402.14505 .
Lu and Dai (2024) Lu, L., Dai, F., 2024. Accurate road user localization in aerial images captured by unmanned aerial vehicles. Automation in Construction 158, 105257.
Mantelli et al. (2019) Mantelli, M., Pittol, D., Neuland, R., Ribacki, A., Maffei, R., Jorge, V., Prestes, E., Kolberg, M., 2019. A novel measurement model based on abbrief for global localization of a uav over satellite images. Robotics and Autonomous Systems 112, 304–319.
Masone and Caputo (2021) Masone, C., Caputo, B., 2021. A survey on deep visual place recognition. IEEE Access 9, 19516–19547.
Mastelic et al. (2020) Mastelic, T., Lorincz, J., Ivandic, I., Boban, M., 2020. Aerial imagery based on commercial flights as remote sensing platform. Sensors 20, 1658.
Oquab et al. (2023) Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al., 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 .
Peng et al. (2021a) Peng, G., Yue, Y., Zhang, J., Wu, Z., Tang, X., Wang, D., 2021a. Semantic reinforced attention learning for visual place recognition, in: 2021 IEEE International Conference on Robotics and Automation (ICRA), IEEE. pp. 13415–13422.
Peng et al. (2021b) Peng, G., Zhang, J., Li, H., Wang, D., 2021b. Attentional pyramid pooling of salient visual residuals for place recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 885–894.
Pion et al. (2020) Pion, N., Humenberger, M., Csurka, G., Cabon, Y., Sattler, T., 2020. Benchmarking image retrieval for visual localization, in: 2020 International Conference on 3D Vision (3DV), IEEE. pp. 483–494.
Radford et al. (2021) Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al., 2021. Learning transferable visual models from natural language supervision, in: International conference on machine learning, PMLR. pp. 8748–8763.
Rejeb et al. (2022) Rejeb, A., Abdollahi, A., Rejeb, K., Treiblmaier, H., 2022. Drones in agriculture: A review and bibliometric analysis. Computers and electronics in agriculture 198, 107017.
Sarlin et al. (2020) Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A., 2020. Superglue: Learning feature matching with graph neural networks, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4938–4947.
Schleiss et al. (2022) Schleiss, M., Rouatbi, F., Cremers, D., 2022. Vpair–aerial visual place recognition and localization in large-scale outdoor environments. arXiv preprint arXiv:2205.11567 .
Shetty and Gao (2019) Shetty, A., Gao, G.X., 2019. Uav pose estimation using cross-view geolocalization with satellite imagery, in: 2019 International Conference on Robotics and Automation (ICRA), IEEE. pp. 1827–1833.
Simonyan and Zisserman (2014) Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 .
Tian et al. (2017) Tian, Y., Chen, C., Shah, M., 2017. Cross-view image matching for geo-localization in urban environments, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3608–3616.
Umar et al. (2016) Umar, I., Yelwa, N., Jibrin, B., 2016. 3d geological models in ground investigation: Examples from the united kingdom. Journal of Scientific Research and Reports 9, 1–6.
Vallone et al. (2022) Vallone, A., Warburg, F., Hansen, H., Hauberg, S., Civera, J., 2022. Danish airs and grounds: A dataset for aerial-to-street-level place recognition and localization. IEEE Robotics and Automation Letters 7, 9207–9214.
Wan et al. (2014) Wan, J., Wang, D., Hoi, S.C.H., Wu, P., Zhu, J., Zhang, Y., Li, J., 2014. Deep learning for content-based image retrieval: A comprehensive study, in: Proceedings of the 22nd ACM international conference on Multimedia, pp. 157–166.
Warburg et al. (2020) Warburg, F., Hauberg, S., Lopez-Antequera, M., Gargallo, P., Kuang, Y., Civera, J., 2020. Mapillary street-level sequences: A dataset for lifelong place recognition, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2626–2635.
Wilson et al. (2024) Wilson, D., Zhang, X., Sultani, W., Wshah, S., 2024. Image and object geo-localization. International Journal of Computer Vision 132, 1350–1392.
Workman et al. (2015) Workman, S., Souvenir, R., Jacobs, N., 2015. Wide-area image geolocalization with aerial reference imagery, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 3961–3969.
Zaffar (2020) Zaffar, M., 2020. Visual Place Recognition for Autonomous Robots. Ph.D. thesis. University of Essex.
Zaffar et al. (2021) Zaffar, M., Garg, S., Milford, M., Kooij, J., Flynn, D., McDonald-Maier, K., Ehsan, S., 2021. Vpr-bench: An open-source visual place recognition evaluation framework with quantifiable viewpoint and appearance change. International Journal of Computer Vision 129, 2136–2174.
Zheng et al. (2020) Zheng, Z., Wei, Y., Yang, Y., 2020. University-1652: A multi-view multi-source benchmark for drone-based geo-localization, in: Proceedings of the 28th ACM international conference on Multimedia, pp. 1395–1403.
Zhu et al. (2023a) Zhu, R., Yin, L., Yang, M., Wu, F., Yang, Y., Hu, W., 2023a. Sues-200: A multi-height multi-scene cross-view image benchmark across drone and satellite. IEEE Transactions on Circuits and Systems for Video Technology .
Zhu et al. (2023b) Zhu, S., Yang, L., Chen, C., Shah, M., Shen, X., Wang, H., 2023b. R2former: Unified retrieval and reranking transformer for place recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19370–19380.
Zhu et al. (2021) Zhu, S., Yang, T., Chen, C., 2021. Vigor: Cross-view image geo-localization beyond one-to-one retrieval, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3640–3649.
Zhuang et al. (2021) Zhuang, J., Dai, M., Chen, X., Zheng, E., 2021. A faster and more effective cross-view matching method of uav and satellite images for uav geolocalization. Remote Sensing 13, 3979.

Ivan Moskalenko is pursuing a bachelor’s degree at Saint Petersburg State University in Saint Petersburg, Russia. Since 2022, he has been a Research Intern at the Mobile Robotics Lab, Skoltech, concentrating on reproducible research in SLAM and Visual Place Recognition across various data modalities.

Anastasiia Kornilova obtained her B.S. (2018) degree in Software Engineering from Saint Petersburg State University, St. Petersburg, Russia and M.S. (2021) degree in Data Science from Skolkovo Institute of Science and Technology (Skoltech), Moscow, Russia.

Since 2020, Anastasiia works as a Research Engineer in Mobile Robotics Laboratory at Skoltech Center for Artificial Intelligence Technology (CAIT). Her research targets development of robust and accurate localization and SLAM methods on different visual sensor modalities. As a lab member Anastasiia leads projects with industrial partners to adapt and embed those technologies into commercial products.

Gonzalo Ferrer obtained his Ph.D. in Robotics from the Universitat Politècnica de Catalunya (UPC), Barcelona, Spain in 2015 and worked during two years as a Research Fellow (postdoc) at the APRIL lab. in the department of Computer Science and Engineering at the University of Michigan. In 2018, Gonzalo joined the Skolkovo Institute of Science and Technology as an Assistant Professor. He is heading the Mobile Robotics lab., focusing his research on planning, perception and how to combine both into new solutions in robotics.