High-resolution open-vocabulary
object 6D pose estimation

Jaime Corsetti, Davide Boscaini, Francesco Giuliari, Changjae Oh, Andrea Cavallaro, and Fabio Poiesi (Corresponding author: Jaime Corsetti) Jaime Corsetti is with Fondazione Bruno Kessler (FBK), 38123 Trento, Italy, and also with the Department of Information Engineering and Computer Science, University of Trento, 38122 Trento, Italy (e-mail: [email protected]). Davide Boscaini, Francesco Giuliari, and Fabio Poiesi are with Fondazione Bruno Kessler (FBK), 38123 Trento, Italy (e-mail: [email protected]; [email protected]; [email protected]). Changjae Oh is with the School of Electronic Engineering and Computer Science, Queen Mary University of London, E1 4NS London, United Kingdom (e-mail: [email protected]). Andrea Cavallaro is with the Idiap Research Institute, CH-1920 Martigny, Switzerland, and also with the École polytechnique fédérale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland (e-mail: [email protected]).
Abstract

The generalisation to unseen objects in the 6D pose estimation task is very challenging. While Vision-Language Models (VLMs) enable using natural language descriptions to support 6D pose estimation of unseen objects, these solutions underperform compared to model-based methods. In this work we present Horyon, an open-vocabulary VLM-based architecture that addresses relative pose estimation between two scenes of an unseen object, described by a textual prompt only. We use the textual prompt to identify the unseen object in the scenes and then obtain high-resolution multi-scale features. These features are used to extract cross-scene matches for registration. We evaluate our model on a benchmark with a large variety of unseen objects across four datasets, namely REAL275, Toyota-Light, Linemod, and YCB-Video. Our method achieves state-of-the-art performance on all datasets, outperforming by 12.6 in Average Recall the previous best-performing approach.

Index Terms:
Object 6D pose estimation, open-vocabulary.

I Introduction

The estimation of the 6D pose of an object requires the prediction of its 3D rotation matrix and 3D translation vector with reference to the camera. Accurate object 6D pose estimation is a fundamental phase in a wide range of applications, such as augmented reality [1], robot gras** [2], and autonomous driving [3]. Data-driven methods [4, 5] can achieve reliable pose estimation, but require expensive object annotations. This problem is mitigated by the development of techniques to generate large-scale synthetic datasets [6, 7]. Another line of research instead defines the unseen-object setting [8], which assumes no overlap between the set of training and test objects [9, 7]. However, most methods for the unseen-object setting are model-based, as they require a CAD model of the object at test time [5, 10, 11, 12], as detailed in the first group of Tab. I. Recent works changed this assumption, and instead require a video sequence of the object at test time [9, 13], from which a set of reference views is extracted (second group in Tab. I). These methods use Structure-from-Motion (SfM) [14] to reconstruct the object 3D model from the reference views [13, 9].

To eliminate the need for the object models or multiple views, in our previous work, Oryon [15], we showed that the object can be described with a textual prompt, thus defining the open-vocabulary object 6D pose estimation setting. This setting assumes the availability of two RGBD scenes, along with a natural language description of the object of interest, provided by the user. The object 6D pose in a scene is estimated with respect to the object in the other scene.

In this work, we present Horyon (High-resolution Oryon), a VLM-based architecture for open-vocabulary object 6D pose estimation that generalises to unseen objects. Horyon uses a VLM to extract visual and textual representations from the input scene pair and the natural language description (i.e., the prompt). The two representations are fused using cross-attention layers that enable information exchange between each scene patch with each prompt token, without losing key details contained in the object description. The resulting feature maps are upsampled and fused with the high-resolution feature maps from a visual encoder. Horyon estimates the object pose in the query scene with respect to the anchor scene by segmentation and pixel-level view matching, and subsequently backprojecting and registering the 3D matches.

This work significantly extends our previous work [15] in several aspects. Instead of processing the whole scene, we identify and localise the object of interest in both scenes with GroundingDino [16] to obtain a bounding box. This leads to a higher resolution feature map without losing generalisation to unseen objects. Moreover, we remove the need for the guidance backbone and instead extract the guidance features directly from the VLM, reducing the training time and the number of parameters to 50% and 37% of the original method, respectively. To validate our choices, we test our method on an extended version of the Oryon benchmark, which includes two popular datasets for pose estimation, i.e., Linemod [17] and YCB-Video [18], in addition to REAL275 [19] and Toyota-Light [20]. Linemod features scenes with high clutter, occlusions and objects that are small and unusual. YCB-Video presents a large variety of poses, and objects that often belong to similar categories (e.g., boxes and cans). We provide extended ablation studies to justify each choice, and also evaluate the effect of the prompt on the final results. In summary, our main contributions are:

  • We provide a comprehensive analysis of recent approaches for estimating the 6D pose of unseen objects, detailing their requirements and operational conditions.

  • We propose to use a detector that localises and crops the object of interest from the input images using a natural language description. Despite its simplicity, this operation effectively mitigates the influence of extraneous factors, such as background clutter and unrelated objects, on the feature representations used for matching.

  • We provide a new multi-scale formulation for the VLM-based feature extractor and decoder modules, which enables extraction of higher-quality features for correspondence matching, resulting in an increase of +12.6 in Average Recall compared to the previous version.

  • We extend the evaluation set with two new challenging datasets with occluded (YCB-Video) or small (Linemod) objects. We provide textual prompts for each object and will make this benchmark public.

II Related work

TABLE I: Comparison of the data requirements of Horyon with examples of state-of-the-art methods for unseen-object 6D pose estimation. We classify the methods based on: Input: the type of input data, typically RGB or RGBD; Reference: additional data used to identify the unseen object at test time; Pose: whether the method is capable of estimating the 6D pose or is limited to the rotation component. Methods that can estimate the translation component up to a scale are identified by 𝐑,𝐭~𝐑~𝐭\mathbf{R},\mathbf{\tilde{t}}bold_R , over~ start_ARG bold_t end_ARG.; Object preprocessing: eventual process required at test time to acquire information about the object of interest; Localisation: eventual external modules used to localise the object, typically a segmentor or a detector; Zero-shot: True if the localisation module was not specifically trained for the test dataset.
Method Input Reference Pose Object preprocessing Localisation
RGB D Type Zero-shot
OVE6D [21] 3D model 𝐑,𝐭𝐑𝐭\mathbf{R},\mathbf{t}bold_R , bold_t Generates and encodes 4K object templates Segmentor
MegaPose [7] 3D model 𝐑,𝐭𝐑𝐭\mathbf{R},\mathbf{t}bold_R , bold_t - Detector
OSOP [12] 3D model 𝐑,𝐭𝐑𝐭\mathbf{R},\mathbf{t}bold_R , bold_t Generates 90 object templates - -
Gen6D [13] Video sequence 𝐑,𝐭𝐑𝐭\mathbf{R},\mathbf{t}bold_R , bold_t SfM and manual crop** of point cloud - -
OnePose [22] Video sequence 𝐑,𝐭𝐑𝐭\mathbf{R},\mathbf{t}bold_R , bold_t SfM to retrieve camera viewpoints Detector
OnePose++ [9] Video sequence 𝐑,𝐭𝐑𝐭\mathbf{R},\mathbf{t}bold_R , bold_t SfM to retrieve camera viewpoints Detector
NOPE [23] Single supp. view 𝐑𝐑\mathbf{R}bold_R - - -
RelPose [24] Two or more supp. views 𝐑𝐑\mathbf{R}bold_R - - -
RelPose++ [25] One or more supp. views 𝐑,𝐭~𝐑~𝐭\mathbf{R},\mathbf{\tilde{t}}bold_R , over~ start_ARG bold_t end_ARG - Detector
PoseDiffusion [26] One or more supp. views 𝐑,𝐭~𝐑~𝐭\mathbf{R},\mathbf{\tilde{t}}bold_R , over~ start_ARG bold_t end_ARG - Detector
LatentFusion [27] One or more supp. views 𝐑,𝐭𝐑𝐭\mathbf{R},\mathbf{t}bold_R , bold_t - Segmentor
Oryon Single supp. view 𝐑,𝐭𝐑𝐭\mathbf{R},\mathbf{t}bold_R , bold_t Expression of textual prompt - -
Horyon (ours) Single supp. view 𝐑,𝐭𝐑𝐭\mathbf{R},\mathbf{t}bold_R , bold_t Expression of textual prompt Detector
\begin{overpic}[clip,trim=34.14322pt 28.45274pt 34.14322pt 28.45274pt,width=43% 3.62pt]{main/figures/diagrams/main_diagram.pdf} \par\end{overpic}
Figure 1: The main modules of our proposed method, Horyon. (a) Overview of a processing branch: Horyon first crops the object of interest from the scene given a textual prompt, and subsequently extracts visual and textual features with DINO and BERT, respectively from the image crop and the prompt. The fusion module ϕTVsubscriptitalic-ϕ𝑇𝑉\phi_{TV}italic_ϕ start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT outputs a multimodal representation of the scene, which is upsampled by a decoder ϕDsubscriptitalic-ϕ𝐷\phi_{D}italic_ϕ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT. At this stage, skip connections from the image encoder ϕVsubscriptitalic-ϕ𝑉\phi_{V}italic_ϕ start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT are used to enrich the final representation. The output features 𝐅𝐅\mathbf{F}bold_F are used to obtain the object segmentation mask 𝐌𝐌\mathbf{M}bold_M. (b) Optimisation procedure. 𝐅Asuperscript𝐅𝐴\mathbf{F}^{A}bold_F start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT, 𝐅Qsuperscript𝐅𝑄\mathbf{F}^{Q}bold_F start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT are optimised by a hardest contrastive loss which uses ground-truth matches as supervision, while the segmentation is supervised by a Dice loss. (c) Test procedure. The predicted masks are used to filter 𝐅Asuperscript𝐅𝐴\mathbf{F}^{A}bold_F start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT, 𝐅Qsuperscript𝐅𝑄\mathbf{F}^{Q}bold_F start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT, and matches are obtained by nearest neighbor. Finally, the matches are backprojected in 3D, and a registration algorithm is used to retrieve the final pose TAQsubscript𝑇𝐴𝑄T_{A\rightarrow{Q}}italic_T start_POSTSUBSCRIPT italic_A → italic_Q end_POSTSUBSCRIPT.

In Tab. I we report examples of methods for unseen-object 6D pose estimation, along with their assumptions. We classify them in model-based (first group), model-free (second group), relative pose estimation methods (third group) and open-vocabulary methods (last group). In the following paragraphs, we describe the assumptions and discuss some example methods for each of the last three groups, as they have the most similar assumptions to Horyon. We also discuss the usage of localisation modules for pose estimation, to contextualise our choice of an open-vocabulary detector to crop the object.

Model-free pose estimation. Recent research has focused on develo** methods for pose estimation of unseen objects without requiring their 3D models during inference, as these 3D models might not always be available. Model-free methods only require a video sequence captured from multiple viewpoints around the object of interest. The video sequence is then used to reconstruct an approximate 3D model of the object through structure-from-motion (SfM) techniques [14]. Exemplary methods include OnePose++ [9] and Gen6D [13]. OnePose++ employs a graph neural network based on attention [28] to establish correspondences between the input image and the 3D reconstruction of the unseen object. Gen6D leverages three distinct modules for localisation, viewpoint estimation, and pose refinement. Both methods require a video sequence of the unseen object during inference, assuming the physical availability of the object. Additionally, a preprocessing procedure is needed to perform the 3D reconstruction of the object, typically involving the execution of an SfM-based algorithm on the video sequence. This procedure can be cumbersome and challenging for users without technical expertise. A related approach, FS6D [6], instead relies on a sparse set of reference images annotated with their respective camera poses. In contrast, Horyon does not require annotated images or complex preprocessing procedures; the user only needs to provide a natural language description of the object.

Relative pose estimation approaches estimate the relative pose of an object with respect to one or more reference views [24, 25, 26]. They are independent on the object’s 3D model. NOPE [23] estimates the rotation of an object in a scene given a single reference view, but the usage of RGB information only does not allow it to estimate the translation component of the pose. Similarly, RelPose [24] estimates the relative rotation of an object using as few as three reference views, but it requires that these views come from the same scene. RelPose++ [25] enhances its predecessor by introducing a reference frame, enabling the estimation of both the translation and rotation components of the pose using as few as two RGB views of the same scene. However, RelPose++ only estimates the pose up to a scale factor; the translation component retrieved is scaled under the assumption that the first camera has q unitary distance from the object. Moreover, RelPose++ requires the images to be approximately centred on the object of interest and does not handle the presence of multiple objects in the same scene. Similarly to RelPose++, PoseDiffusion [26] addresses camera pose estimation up to a scale factor in the translation component. It allows for the estimation of camera poses using 2-8 views of the object by enforcing consistency between pairwise matches through geometric constraints. However, like RelPose++, PoseDiffusion also requires object-centric images and is therefore limited to working with a single object per image. In contrast, Horyon can estimate an absolute value for the translation component of the pose and does not assume the views are captured from the same scene.

Language models for pose estimation. So far, the use of VLMs for pose estimation tasks has been limited. An early approach [29] adopted a grounding network to extend the capabilities of a gras** robot in a category-level pose estimation scenario. Subsequent works similarly used textual prompts for instance-level [30] or category-level [31] pose estimation, but their generalisation capability is limited to the training categories. Recently, Oryon [15] has shown an effective approach based on joint image matching and segmentation in order to estimate the pose of a common object given two scenes. Instead of relying on an object model or a set of reference views, the object of interest is described with a textual prompt, and localised by segmentation. The reference scene and the usage of depth information allows Oryon to estimate the relative 6D pose between the two scenes, even for objects unseen at test time. However, the limited resolution of Oryon’s feature map limits its ability to estimating accurate poses. In this work, we address this limitation by proposing a set of key modifications, which include a higher-resolution feature map for fine-grained feature matching, and an updated VLM to better exploit the textual prompt.

Object localisation for pose estimation. Adopting external modules to localise the object of interest within input images is standard practice in the pose estimation community [32, 5, 11]. This is often necessary to deal with highly occluded datasets [33], and is common both in the classic setting [4, 5] and in the unseen-object setting [22, 9]. Crops are obtained by training a detector on the specific objects (e.g., YOLOv5 [34] in OnePose and OnePose++, FCOS [35] in ZebraPose [5]) or by using ground-truth bounding boxes [6, 25, 26]. Notable exceptions are methods that train the detector contextually to the rest of the pipeline, such as Gen6D [13] and OSOP [12]. For non-generalisable methods, adopting supervised detectors [36, 35, 34] to obtain a bounding box is reasonable, as they assume to have access to the object model at test time. Therefore using them to train a detector does not lead to loss of generalisation. Recently, CNOS [37] has been proposed to provide accurate segmentation masks given a CAD model of the object, without training on the specific object instances. This method can be naturally paired with model-based methods for unseen-object pose estimation [7, 10, 12], as CNOS shares their assumptions (i.e., object 3D model available at test time, but no specific training on it). Horyon retains the open-vocabulary assumptions, as we use an open-vocabulary detector to locate and crop the object.

III Proposed approach: Horyon

Refer to caption
Figure 2: (a) Overview of a layer of the fusion module ϕTVsubscriptitalic-ϕ𝑇𝑉\phi_{TV}italic_ϕ start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT. Note that the left-side output of the module is not used in the last layer. (b) Architecture of the decoder ϕDsubscriptitalic-ϕ𝐷\phi_{D}italic_ϕ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT. 𝐄1subscript𝐄1\mathbf{E}_{1}bold_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT,𝐄2subscript𝐄2\mathbf{E}_{2}bold_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT,𝐄3subscript𝐄3\mathbf{E}_{3}bold_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT: visual features obtained from ϕVsubscriptitalic-ϕ𝑉\phi_{V}italic_ϕ start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT; Transp.Conv.: transposed 2D convolution; Group.Conv.: group composed by two blocks, each contains a 2D convolution, a group normalisation and a ReLU; [|][\cdot|\cdot][ ⋅ | ⋅ ] denotes feature concatenation.

III-A Overview

Let two RGBD scenes, an anchor A𝐴Aitalic_A and a query Q𝑄Qitalic_Q, depict two scenes with a common object. Each scene is represented by the RGB channels RGBA (RGBQ) and a depth map 0ptA (0ptQ). The intrinsic camera parameters used to capture both A𝐴Aitalic_A and Q𝑄Qitalic_Q are available. For each pair (A𝐴Aitalic_A, Q𝑄Qitalic_Q) a textual prompt T𝑇Titalic_T describing the common object O𝑂Oitalic_O is provided. The object present at test time was not observed at training time. The objective is to estimate the 6D pose of the object in Q𝑄Qitalic_Q with respect to the reference provided by A𝐴Aitalic_A.

Fig. 1(a) shows the proposed Horyon architecture. Horyon consists of four main modules: an open-vocabolary detector, the vision and text encoders, the fusion module, and the decoder. The open-vocabulary detector identifies the object O𝑂Oitalic_O in both scenes. The two pre-trained networks: the vision encoder ϕVsubscriptitalic-ϕ𝑉\phi_{V}italic_ϕ start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT and the text encoder ϕTsubscriptitalic-ϕ𝑇\phi_{T}italic_ϕ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, encode the cropped A𝐴Aitalic_A and Q𝑄Qitalic_Q in multi-scale feature maps and generate an embedding of the prompt T𝑇Titalic_T, respectively. The fusion module ϕTVsubscriptitalic-ϕ𝑇𝑉\phi_{TV}italic_ϕ start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT provides a joint representation across the visual and text modalities. Finally, the decoder ϕDsubscriptitalic-ϕ𝐷\phi_{D}italic_ϕ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT upsamples the features from the fusion module to provide a set of features that can be used for image matching. The final features are processed by a convolutional layer which outputs the localisation masks of the object. In the following sections, we describe each component, along with the training and test procedures.

III-B Object identification and crop**

We perform 6D pose estimation by image matching, therefore the accuracy of the predicted matches is crucial to the final performance. To this end, we use feature maps that encode the single object of interest instead of the whole scene. We process each scene with the open-vocabulary object detector GroundingDino [16], which allows a good accuracy and can generalise to previously unseen objects. We use the same prompt T𝑇Titalic_T given as input to Horyon, and square and resize each bounding box to avoid deformation. The predicted bounding boxes are used only at test time, and instead we use the ground-truth ones for training. In this way, we can obtain accurate detections without losing generalisation capability.

III-C Vision-Language backbone

The features used for matching should be (1) conditioned on the textual prompt T𝑇Titalic_T, and (2) robust to unseen objects, for which Horyon was not specifically trained. We use DINO [38] as vision encoder ϕVsubscriptitalic-ϕ𝑉\phi_{V}italic_ϕ start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT to extract multi-scale visual feature maps 𝐄1subscript𝐄1\mathbf{E}_{1}bold_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 𝐄2subscript𝐄2\mathbf{E}_{2}bold_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, 𝐄3subscript𝐄3\mathbf{E}_{3}bold_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and BERT [39] as text encoder ϕTsubscriptitalic-ϕ𝑇\phi_{T}italic_ϕ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, to encode the prompt in a sequence of textual features 𝐞Tsuperscript𝐞𝑇\mathbf{e}^{T}bold_e start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. Previous works [16] have proven this combination to be effective in obtaining fine-grained feature maps with good generalisation capabilities. Using BERT provides an additional advantage: 𝐞Tsuperscript𝐞𝑇\mathbf{e}^{T}bold_e start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is not a global vector, but instead is a sequence of feature vectors, one for each token. This allows to preserve the information related to the description of the object, and therefore provides a better representation for the natural language description. Unlike Horyon, Oryon uses CLIP [40] as Vision-Language backbone. Although CLIP showed impressive capabilities on tasks based on localisation [41], it was trained for alignment of global embeddings, and therefore was not designed to provide spatially-informed feature maps.

III-D Fusion module

The fusion module ϕTVsubscriptitalic-ϕ𝑇𝑉\phi_{TV}italic_ϕ start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT is based on cross-attention between patches of the visual feature maps 𝐄1subscript𝐄1\mathbf{E}_{1}bold_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and tokens of the prompt 𝐞Tsuperscript𝐞𝑇\mathbf{e}^{T}bold_e start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [16] (see Fig. 2(a)). This strategy allows Horyon to preserve the information contained in each specific token in the prompt, and therefore the model can learn to associate each visual patch with the most suitable component of the prompt (e.g., a token representing “red” in the prompt can easily be associated to red patches on the object). In contrast, Oryon [15] adopts a global representation from CLIP as textual embedding, thus collapsing in a single feature vector all the information contained in the prompt. We found the new design to be more effective, particularly when the prompt is noisy (see the ablation study on the prompt type in Tab. IV.

III-E Decoder

The feature map produced by the fusion module ϕTVsubscriptitalic-ϕ𝑇𝑉\phi_{TV}italic_ϕ start_POSTSUBSCRIPT italic_T italic_V end_POSTSUBSCRIPT retains the low resolution of the original input features 𝐄1subscript𝐄1\mathbf{E}_{1}bold_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. In order to obtain high-resolution features necessary for matching, we use a decoder ϕDsubscriptitalic-ϕ𝐷\phi_{D}italic_ϕ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT based on transposed convolutions to upsample the input features. Prior to each upsample layer, we concatenate the purely visual features obtained from DINO, respectively 𝐄1subscript𝐄1\mathbf{E}_{1}bold_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 𝐄2subscript𝐄2\mathbf{E}_{2}bold_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and 𝐄3subscript𝐄3\mathbf{E}_{3}bold_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, to enrich the semantic representation of the Vision-Language backbone. We define such feature map as guidance features. In Oryon [15], guidance features are extracted from a pretrained Swin transformer [42]. Instead, we directly use the features from DINO, therefore substantially reducing the number of parameters. Moreover, the training strategy of DINO encourages generalisation to novel concepts [38], and thus it is more effective than Swin in our task. The output of the decoder is a high-resolution feature map 𝐅𝐅\mathbf{F}bold_F. To obtain the segmentation mask 𝐌𝐌\mathbf{M}bold_M, we process 𝐅𝐅\mathbf{F}bold_F with a convolutional layer.

III-F Optimisation

We train Horyon with an optimisation procedure based on joint image matching and segmentation. The feature maps 𝐅Asuperscript𝐅𝐴\mathbf{F}^{A}bold_F start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT, 𝐅Qsuperscript𝐅𝑄\mathbf{F}^{Q}bold_F start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT are optimised with a hardest-contrastive loss Fsubscript𝐹\ell_{F}roman_ℓ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [43, 4], supervised by ground-truth matches between A𝐴Aitalic_A and Q𝑄Qitalic_Q. The positive component Psubscript𝑃\ell_{P}roman_ℓ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT forces the matches to be close in the feature space, while the negative component Nsubscript𝑁\ell_{N}roman_ℓ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT increases the distance between a feature point and its hardest negative.

Let 𝐟Asuperscript𝐟𝐴\mathbf{f}^{A}bold_f start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT, 𝐟Qsuperscript𝐟𝑄\mathbf{f}^{Q}bold_f start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT C×Dabsentsuperscript𝐶𝐷\in\mathbb{R}^{C\times D}∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_D end_POSTSUPERSCRIPT be the features extracted respectively from the feature maps 𝐅Asuperscript𝐅𝐴\mathbf{F}^{A}bold_F start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT, 𝐅Qsuperscript𝐅𝑄\mathbf{F}^{Q}bold_F start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT, by using the ground-truth matches 𝒫𝒫\mathcal{P}caligraphic_P. C=|𝒫|𝐶𝒫C=\lvert\mathcal{P}\rvertitalic_C = | caligraphic_P | is the total number of matches and D𝐷Ditalic_D is the feature dimension. The positive component Psubscript𝑃\ell_{P}roman_ℓ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT is

P=(i,j)𝒫1|𝒫|(dist(fiA,fjQ)μP)+,subscript𝑃subscript𝑖𝑗𝒫1𝒫subscriptdistsubscriptsuperscriptf𝐴𝑖subscriptsuperscriptf𝑄𝑗subscript𝜇𝑃\ell_{P}=\sum_{(i,j)\in\mathcal{P}}\frac{1}{\lvert\mathcal{P}\rvert}\left(% \mathrm{dist}(\textbf{f}^{A}_{i},\textbf{f}^{Q}_{j})-\mu_{P}\right)_{+},roman_ℓ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT ( italic_i , italic_j ) ∈ caligraphic_P end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | caligraphic_P | end_ARG ( roman_dist ( f start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , f start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_μ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT , (1)

where ()+=max(0,)subscript0(\cdot)_{+}=\max(0,\cdot)( ⋅ ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = roman_max ( 0 , ⋅ ) and μPsubscript𝜇𝑃\mu_{P}italic_μ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT is a positive margin, i.e., the desired distance in the feature space of a positive pair.

We consider a set of features 𝐟𝐟\mathbf{f}bold_f and their corresponding coordinates x on the scene to define the negative pairs. Given 𝐟isubscript𝐟𝑖\mathbf{f}_{i}bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, its candidate negative set is defined as 𝒩i={k:xkx,ki,xixkτ}subscript𝒩𝑖conditional-set𝑘formulae-sequencesubscriptx𝑘xformulae-sequence𝑘𝑖delimited-∥∥subscriptx𝑖subscriptx𝑘𝜏\mathcal{N}_{i}=\{k\colon\textbf{x}_{k}\in\textbf{x},k\neq i,\lVert\textbf{x}_% {i}-\textbf{x}_{k}\rVert\ \geq\tau\}caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_k : x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ x , italic_k ≠ italic_i , ∥ x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ≥ italic_τ }. This excludes all points closer than the distance τ𝜏\tauitalic_τ from the reference point xisubscriptx𝑖\textbf{x}_{i}x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the scene, thereby excluding features describing the same points. Candidate negatives sets are computed for all points of 𝐱Asuperscript𝐱𝐴\mathbf{x}^{A}bold_x start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT and 𝐱Qsuperscript𝐱𝑄\mathbf{x}^{Q}bold_x start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT, and the negative component Nsubscript𝑁\ell_{N}roman_ℓ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT is

N=(i,j)𝒫subscript𝑁subscript𝑖𝑗𝒫\displaystyle\ell_{N}=\sum_{(i,j)\in\mathcal{P}}roman_ℓ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT ( italic_i , italic_j ) ∈ caligraphic_P end_POSTSUBSCRIPT 12|𝒫i|(μNmink𝒩idist(fi,fk))+12subscript𝒫𝑖subscriptsubscript𝜇𝑁subscript𝑘subscript𝒩𝑖distsubscriptf𝑖subscriptf𝑘\displaystyle\frac{1}{2\lvert\mathcal{P}_{i}\rvert}\left(\mu_{N}-\min_{k\in% \mathcal{N}_{i}}\mathrm{dist}(\textbf{f}_{i},\textbf{f}_{k})\right)_{+}divide start_ARG 1 end_ARG start_ARG 2 | caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG ( italic_μ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT - roman_min start_POSTSUBSCRIPT italic_k ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_dist ( f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT (2)
+12|𝒫j|(μNmink𝒩jdist(fj,fk))+.12subscript𝒫𝑗subscriptsubscript𝜇𝑁subscript𝑘subscript𝒩𝑗distsubscriptf𝑗subscriptf𝑘\displaystyle+\frac{1}{2\lvert\mathcal{P}_{j}\rvert}\left(\mu_{N}-\min_{k\in% \mathcal{N}_{j}}\mathrm{dist}(\textbf{f}_{j},\textbf{f}_{k})\right)_{+}.+ divide start_ARG 1 end_ARG start_ARG 2 | caligraphic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ARG ( italic_μ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT - roman_min start_POSTSUBSCRIPT italic_k ∈ caligraphic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_dist ( f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT .

For each 𝐟isubscript𝐟𝑖\mathbf{f}_{i}bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the nearest 𝐟ksubscript𝐟𝑘\mathbf{f}_{k}bold_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in the feature space (i.e. the hardest negative) is selected. Given a negative pair, the negative margin μNsubscript𝜇𝑁\mu_{N}italic_μ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT is the desired distance of the two points in the feature space. In Eqs. (1) and (2), dist()dist\mathrm{dist}(\cdot)roman_dist ( ⋅ ) is the inverted and normalised cosine similarity. We implement the segmentation loss Msubscript𝑀\ell_{M}roman_ℓ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT as a Dice loss [44]. The final loss \ellroman_ℓ is defined as

=M+λNN+λPP,subscript𝑀subscript𝜆𝑁subscript𝑁subscript𝜆𝑃subscript𝑃\ell=\ell_{M}+\lambda_{N}\ell_{N}+\lambda_{P}\ell_{P},roman_ℓ = roman_ℓ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ,

where λNsubscript𝜆𝑁\lambda_{N}italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT and λPsubscript𝜆𝑃\lambda_{P}italic_λ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT are hyperparameters.

In Horyon, we change the loss hyperparameters with respect to the ones used in Oryon, in order to adapt to the new context in which the object is cropped. Intuitively, a higher resolution implies more ground-truth matches, therefore we raise the number of matches to C=𝐶absentC=italic_C = 2000, as opposed to C=𝐶absentC=italic_C = 500 in Oryon. For the same reason, we set the excluding distance for the hardest negatives to τ=𝜏absent\tau=italic_τ = 20.

III-G Matching and registration

At test time, we obtain the predicted mask 𝐌Asuperscript𝐌𝐴\mathbf{M}^{A}bold_M start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT, 𝐌Qsuperscript𝐌𝑄\mathbf{M}^{Q}bold_M start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT and the features 𝐅Asuperscript𝐅𝐴\mathbf{F}^{A}bold_F start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT, 𝐅Qsuperscript𝐅𝑄\mathbf{F}^{Q}bold_F start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT from A𝐴Aitalic_A and Q𝑄Qitalic_Q. The predicted masks are used to filter the features belonging to the objects, thus obtaining two lists of features 𝐅MAsubscriptsuperscript𝐅𝐴𝑀\mathbf{F}^{A}_{M}bold_F start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT C1×Dabsentsuperscriptsuperscript𝐶1𝐷\in\mathbb{R}^{C^{1}\times D}∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT × italic_D end_POSTSUPERSCRIPT, 𝐅MQsubscriptsuperscript𝐅𝑄𝑀\mathbf{F}^{Q}_{M}bold_F start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT C2×Dabsentsuperscriptsuperscript𝐶2𝐷\in\mathbb{R}^{C^{2}\times D}∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × italic_D end_POSTSUPERSCRIPT.

We compute the nearest neighbor 𝐟iQ𝐅MQsubscriptsuperscript𝐟𝑄𝑖subscriptsuperscript𝐅𝑄𝑀\mathbf{f}^{Q}_{i}\in\mathbf{F}^{Q}_{M}bold_f start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ bold_F start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT in the feature space for each feature 𝐟iA𝐅MAsubscriptsuperscript𝐟𝐴𝑖subscriptsuperscript𝐅𝐴𝑀\mathbf{f}^{A}_{i}\in\mathbf{F}^{A}_{M}bold_f start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ bold_F start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, and reject the pairs 𝐟iA,𝐟iQsubscriptsuperscript𝐟𝐴𝑖subscriptsuperscript𝐟𝑄𝑖\mathbf{f}^{A}_{i},\mathbf{f}^{Q}_{i}bold_f start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_f start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for which dist(𝐟iA,𝐟iQ)>μTdistsubscriptsuperscript𝐟𝐴𝑖subscriptsuperscript𝐟𝑄𝑖subscript𝜇𝑇\mathrm{dist}(\mathbf{f}^{A}_{i},\mathbf{f}^{Q}_{i})>\mu_{T}roman_dist ( bold_f start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_f start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > italic_μ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT.

The resulting points are backprojected to the 3D space, by using the depth maps and the intrinsic camera parameters of A𝐴Aitalic_A and Q𝑄Qitalic_Q, thus obtaining two point clouds 𝐏A,𝐏Qsuperscript𝐏𝐴superscript𝐏𝑄\mathbf{P}^{A},\mathbf{P}^{Q}bold_P start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT , bold_P start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT C×3absentsuperscript𝐶3\in\mathbb{R}^{C\times 3}∈ blackboard_R start_POSTSUPERSCRIPT italic_C × 3 end_POSTSUPERSCRIPT. Finally, to obtain the pose TAQsubscript𝑇𝐴𝑄T_{A\rightarrow Q}italic_T start_POSTSUBSCRIPT italic_A → italic_Q end_POSTSUBSCRIPT, we use PointDSC [45] as its spatial consistency formulation allows to reject inconsistent matches, thus leading to more accurate poses.

IV Results

IV-A Experimental setup

During training, the weights of the image and text encoders ϕVsubscriptitalic-ϕ𝑉\phi_{V}italic_ϕ start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT, ϕTsubscriptitalic-ϕ𝑇\phi_{T}italic_ϕ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT are frozen, we only update the weights of the fusion and decoder modules. We train our model using the Adam optimiser [46] for 20202020 epochs with learning rate 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, weight decay 51045superscript1045\cdot 10^{-4}5 ⋅ 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, and a cosine annealing scheduler [47] to lower the learning rate to 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. We randomly augment training data by applying horizontal flip**, vertical flip**, and colour jittering. The output resolution of 𝐅Asuperscript𝐅𝐴\mathbf{F}^{A}bold_F start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT, 𝐅Qsuperscript𝐅𝑄\mathbf{F}^{Q}bold_F start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT is 192×192192192192\times 192192 × 192, and their feature dimension is F=32𝐹32F=32italic_F = 32. Loss weights are set as λP=λN=0.5subscript𝜆𝑃subscript𝜆𝑁0.5\lambda_{P}=\lambda_{N}=0.5italic_λ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = 0.5. We set the positive and negative margins in the loss as μP=0.2subscript𝜇𝑃0.2\mu_{P}=0.2italic_μ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = 0.2 and μN=0.9subscript𝜇𝑁0.9\mu_{N}=0.9italic_μ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = 0.9, and the excluding distance for hardest negatives as τ=20𝜏20\tau=20italic_τ = 20. At test time we set the maximum feature distance to identify a match as μT=0.25subscript𝜇𝑇0.25\mu_{T}=0.25italic_μ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 0.25. We limit the number of matches to C=2000𝐶2000C=2000italic_C = 2000. We implement Horyon with PyTorch Lightning [48]. We set the batch size to 8888 and train on four Nvidia V100 GPUs. In this standard setting, a training requires about 12121212 hours.

IV-B Datasets

We train on ShapeNet6D [6], the same synthetic dataset used by Oryon [15]. For evaluation, we introduce a novel benchmark that extends the one proposed in Oryon. This comprises four real-world datasets: REAL275 [19] and Toyota-Light [20], which are also part of the original benchmark, as well as two additional datasets, Linemod [17] and YCB-Video [18], featuring more severe occlusions, clutter and object variety. To comply with our relative pose estimation setting, we form image pairs (A𝐴Aitalic_A, Q𝑄Qitalic_Q) by randomly sampling A𝐴Aitalic_A, Q𝑄Qitalic_Q from their image distribution while ensuring that they are captured from different scenes. We extract 2K image pairs from each test dataset. To accommodate our open-vocabulary setting, we provide natural language descriptions for each object across the four test datasets. In the following paragraphs, we detail the training and testing datasets.

ShapeNet6D (SN6D for short) [6] is a large-scale synthetic dataset comprising a diverse collection of scenes generated by rendering ShapeNet [49] objects in various poses against random backgrounds. SN6D adhere to our zero-shot assumption, as it does not contain any object instances present in the test datasets. We provide natural language descriptions for each object of SN6D by leveraging ShapeNetSem [50], a subset of ShapeNet that includes additional metadata associated with the 3D models, to extract both the object name and a set of synonyms. To generate an object description, we randomly select either the object name or one of the provided synonyms (e.g., possible synonyms for “television” are “tv”, “telly”, and “television receiver”). This augmentation strategy increases the object description variability and reduces overfitting. By following this procedure, we collect a set of 20K data tuples (A𝐴Aitalic_A, Q𝑄Qitalic_Q, T𝑇Titalic_T) to form our training set. Note that T𝑇Titalic_T contains only semantic information and does not include additional details such as the object colour, material, or physical attributes.

REAL275 [19] features 18 objects spanning six different categories, arranged in realistic configurations within diverse indoor scenarios (e.g., on tables, floors). Its challenges are the presence of multiple instances of the same object categories and a wide variety of viewpoints captured in the scenes.

Toyota-Light (TOYL for short) [20] contains scenes where one of 21 different objects is randomly positioned on various fabric types. The images are captured under challenging lighting conditions, which is particularly relevant in our setting. As we process pairs of images, significant lighting differences between them pose a major challenge for image matching.

Linemod [17] (LM for short) comprises 15 different objects arranged on a table in various configurations, along with other objects acting as clutter. It is challenging because the objects are often small, poorly textured with mostly uniform colours, and less conventional compared to those in REAL275 (e.g., plastic toy figures of an ape and a cat).

YCB-Video [18] (YCBV for short) includes 22 household objects, mainly boxes or cans, arranged in 12 different scenes with distractor objects in the background. Many objects share similar geometries, making photometric information crucial for accurate identification and pose estimation. It also contains occlusions, including objects stacked atop one another.

IV-C Evaluation metrics

We evaluate the poses using Average Recall [8] (AR) and ADD(S)-0.1d (abbreviated as ADD) [51]. AR measures pose error using three metrics: VSD (Visible Surface Discrepancy), MSSD (Maximum Symmetry-aware Surface Distance), and MSPD (Maximum Symmetry-Aware Projection Distance). For each metric, AR computes the average recall over a set of thresholds. For VSD and MSSD, the thresholds range from 5% to 50% of the object’s diameter, while for MSPD, the thresholds range from 5% to 50% of the image size. The final AR score is the average of these individual recalls. ADD measures pose error as the average of the pairwise distances between the 3D model points transformed according to the ground-truth and predicted poses. It then computes the recall of pose errors that are smaller than 10% of the object’s diameter [51]. Both AR and ADD are designed to be robust to object symmetries. Compared to AR, ADD’s more restrictive threshold makes it more effective in measuring highly accurate poses. As the quality of the masks influences the matches, we also evaluate the segmentation by mean Intersection-over-Union (mIoU) [52, 53] across the image pairs (A𝐴Aitalic_A, Q𝑄Qitalic_Q).

IV-D Comparison procedure

We consider three groups of methods for comparison. As crop-free methods, we report the results of ObjectMatch [54], LatentFusion [27] and a pipeline built from SIFT [55] and PointDSC [45]. Additionally, in this group we report Oryon’s [15] results. As crop-based methods, we report Horyon along with the results from ObjectMatch, LatentFusion, and SIFT+PointDSC obtained by using a detector to crop the objects. We refer as these versions of the baselines as ObjectMatch, LatentFusion and SIFT respectively. Furthermore, as sparse-view methods we report results with PoseDiffusion [26] and RelPose++ [25]. These are RGB-only methods that require an object detector, but no segmentation mask. To ensure a fair comparison, our main focus is on crop-based methods that use RGBD data (i.e., the crop-based group), as presented in rows 13-21 of Tab. II.

Oryon [15] serves as our primary baseline for comparison, being the only existing method for open-vocabulary object 6D pose estimation. In contrast to approaches that use detectors to crop the input images around the object of interest, Oryon processes entire images without requiring any crop**. We evaluate Oryon’s performance using the official checkpoints made available through its repository.

ObjectMatch [54] was proposed to tackle point cloud registration of scenes with low overlap. ObjectMatch is based on SuperGlue [56] to estimate the matches, and on a custom pose estimator. To compare with it, the mask is used to filter the keypoints obtained by the first step (i.e., keypoint estimation with SuperGlue), and the remaining matches are forwarded to the pose estimator model. When evaluating with the crop (ObjectMatch), we first crop the object according to the predicted box, and then follow the same procedure as above. We use ObjectMatch’s model trained on ScanNet [57] from the official repository. In Oryon [15], results with ObjectMatch were reported by using the mask to crop the image, and subsequently run the method on the resulting crop.

SIFT [55] is used to extract keypoints and descriptors from each image pair, from which matches are computed through descriptor similarity. We use the mask to filter the matches, and subsequently backproject them to the 3D space. The final pose is obtained by registering the points with PointDSC. When evaluating with the crop (SIFT), we first crop the object of interest according to the predicted bounding box, and then follow the same procedure as above.

LatentFusion [27] is a method for RGBD-based pose estimation of unseen objects. Given a set of RGBD support views with associated masks and poses, LatentFusion first builds a latent representation of the object, by aggregating the features of each view. Subsequently, given a query view of the same object, the optimal pose is obtained by optimising a differentiable renderer with the latent object representation as input. To compare with LatentFusion we use A𝐴Aitalic_A to build the representation, and Q𝑄Qitalic_Q as query view. Although LatentFusion has been designed to use 8-16 references, the ablation studies shows that it retains a good performance even with a single reference view, as in our setting. With the crop (LatentFusion), we fed the cropped images of A𝐴Aitalic_A and Q𝑄Qitalic_Q.

For sparse-view methods (rows 1-4) we report results with ground-truth detections and the ones obtained from GroundingDino [16] (GDino). Crop-free methods (rows 5-12) are reported with the ground-truth mask (Oracle) and the one obtained from by Oryon [15]. For crop-based methods (rows 13-24), we report results with ground-truth mask (Oracle), the one predicted by CATSeg [52] and the one predicted by Horyon (Ours). We use the ground-truth detection when evaluating with the ground-truth mask, and the detection obtained from GDino otherwise.

IV-E Quantitative results

TABLE II: We compare the results of Horyon with our baselines. Experiments are reported with different crop and segmentation priors. Results in bold and underlined represent best and second-best methods when using predicted priors, respectively. In the last row we report the increment with respect to Oryon [15]. All metrics are the higher the better.
Method Crop Mask REAL275 Toyota-Light Linemod YCB-Video Average
AR ADD mIoU AR ADD mIoU AR ADD mIoU AR ADD mIoU AR ADD mIoU
Sparse-view 1 PoseDiffusion [26] Oracle - 9.2 0.8 - 7.8 1.2 - 10.8 1.4 - 7.5 0.8 - 8.8 1.1 -
2 GDino - 9.5 0.8 - 8.1 1.6 - 7.7 1.0 - 8.5 1.1 - 8.5 1.1 -
3 RelPose++ [25] Oracle - 22.8 11.9 - 30.9 11.6 - 14.6 9.4 - 15.1 10.8 - 20.8 10.9 -
4 GDino - 23.1 12.8 - 30.5 11.6 - 15.1 10.8 - 15.5 10.6 - 21.1 11.5 -
Crop-free 5 ObjectMatch [54] - Oracle 9.7 4.5 100.0 3.5 1.6 100.0 14.2 13.8 100.0 3.7 2.8 100.0 7.8 5.7 100.0
6 - Oryon 9.2 4.2 66.5 2.9 1.1 68.1 7.5 6.8 30.1 1.6 1.3 39.2 5.3 3.4 51.0
7 SIFT [55] - Oracle 34.1 16.4 100.0 30.3 14.1 100.0 17.6 10.0 100.0 18.5 12.8 100.0 25.1 13.3 100.0
8 - Oryon 24.4 12.8 66.5 27.2 9.9 68.1 6.4 2.6 30.1 4.7 2.1 39.2 15.7 6.9 51.0
9 LatentFusion [27] - Oracle 22.9 9.8 100.0 28.2 10.2 100.0 14.5 7.1 100.0 18.0 10.2 100.0 20.9 9.3 100.0
10 - Oryon 13.7 3.8 66.5 23.1 5.0 68.1 4.8 1.0 30.1 6.7 3.5 39.2 12.1 3.3 51.0
11 Oryon [15] - Oracle 46.5 34.9 100.0 34.1 22.9 100.0 25.3 20.4 100.0 19.4 12.8 100.0 31.3 22.7 100.0
12 - Oryon 32.2 24.3 66.5 30.3 20.9 68.1 12.2 10.2 30.1 8.6 1.6 39.2 20.8 14.3 51.0
Crop-based 13 ObjectMatch [54] Oracle Oracle 20.5 11.1 100.0 8.0 4.1 100.0 12.2 11.1 100.0 6.0 3.7 100.0 11.7 7.5 100.0
14 GDino CATSeg 21.5 11.6 84.5 7.3 4.0 81.6 13.8 11.7 54.4 6.0 3.4 71.7 12.2 7.7 73.0
15 GDino Ours 21.0 11.0 81.3 8.2 4.3 82.1 13.6 11.8 67.6 5.9 3.4 52.0 12.2 7.6 70.7
16 SIFT [55] Oracle Oracle 38.8 21.6 100.0 32.4 16.5 100.0 18.7 10.8 100.0 19.3 13.9 100.0 27.3 15.7 100.0
17 GDino CATSeg 32.9 15.2 84.5 29.5 14.8 81.6 9.4 5.0 54.4 11.3 6.1 71.7 20.8 10.3 73.0
18 GDino Ours 33.5 18.1 81.3 29.9 14.6 82.1 14.6 8.5 67.6 10.1 6.5 52.0 22.0 11.9 70.7
19 LatentFusion [27] Oracle Oracle 22.6 9.6 100.0 28.2 10.2 100.0 14.5 6.4 100.0 17.5 10.2 100.0 20.7 9.1 100.0
20 GDino CATSeg 18.6 6.0 84.5 26.6 9.8 81.6 9.3 4.8 54.4 9.3 5.1 71.7 15.9 6.4 73.0
21 GDino Ours 19.8 8.2 81.3 26.0 10.3 82.1 10.3 3.9 67.6 11.0 8.9 52.0 16.8 7.8 70.7
22 Horyon Oracle Oracle 57.7 49.8 100.0 37.4 28.4 100.0 34.4 27.6 100.0 28.6 22.6 100.0 39.5 32.1 100.0
23 GDino CATSeg 47.4 39.6 84.5 32.7 23.5 81.6 19.6 17.4 54.4 17.9 9.3 71.7 29.4 22.5 73.0
24 GDino Ours 57.9 51.6 81.3 33.0 25.1 82.1 22.0 20.4 67.6 20.6 12.3 52.0 33.4 27.3 70.7
25 ΔΔ\Deltaroman_Δ score +25.5 +27.3 +14.8 +3.0 +4.2 +14.0 +9.8 +10.2 +37.5 +12.0 +10.7 +12.8 +12.6 +13.0 +19.7

We report our results in Tab. II, along with the ones obtained from the baselines and the ones reported in Oryon [15]. In rows 1-4 we report the results of sparse-view methods, which only use RGB data. Note that in their evaluation procedure, both RelPose++ [25] and PoseDiffusion [26] use the ground-truth translation component of the pose to obtain a translation in meters. We observe that PoseDiffusion reaches an overall low performance (8.5 average AR when using GDino as detector), while RelPose++ obtains 21.1 AR with the same detection, which is on par with Oryon (row 12) and also SIFT (row 18). PoseDiffusion is based on matches between the two views, while RelPose++ uses an energy-based formulation to recover the relative pose. In the context of our benchmark, in which images are crowded and distractor objects are present, the matches learned by PoseDiffusion fail at recovering an accurate pose, while RelPose++ can better generalise to our setting. Nonetheless, both methods score significantly lower then Horyon, which surpasses RelPose++ and PoseDiffusion by 12.3 and 24.9 in average AR, respectively.

Rows 5 to 12 show the results obtain from methods that do not use any detection crop. In this setting Oryon reaches 20.8 average AR, while SIFT obtains 15.7 average AR, thus showing that SIFT matches can lead to good pose estimation performance when paired with a robust registration method (i.e., PointDSC [45]). On the other hand, LatentFusion and ObjectMatch only obtain 12.1 and 5.3 AR, respectively. We attribute ObjectMatch’s low performance to the domain shift: ObjectMatch was trained for registration of scenes with low overlap, and in our datasets the only overlap** part is the one showing the object of interest. Moreover, the presence of distractor objects may lead to ambiguities in the matches, which results in an overall low score. LatentFusion was instead trained on renderings of ShapeNet, and therefore it suffers from the domain shift introduced by our benchmark, which show real images. Moreover, LatentFusion was not tested with predicted segmentation maps, which by nature may be noisy and led to inaccurate latent representations.

Rows 16 to 21 show results with crop-based baselines, which are our main comparison with Horyon. For most methods, working on the cropped version of the images is beneficial, even when the resulting matches are filtered by an oracle mask: ObjectMatch improves by 3.9 points on average AR (row 13 vs 5), SIFT by 2.2 points (row 16 vs 7), while LatentFusion loses 0.2 points (row 19 vs 10). This is reasonable, as the crop** can be considered a normalising operator on the image scale that reduces the effect of the camera pose on the image (i.e., farther objects are smaller).

Horyon beats SIFT, the next best method, by 11.4 average AR when using our predicted mask (row 24 vs 18), while the gap falls to 8.6 points when the mask predicted by CATSeg is used (rows 23 vs 17). By comparing the mIoU results on rows 23 and 24, we can observe that CATSeg on average outperforms Horyon, as the first reaches 73.0 mIoU against a 70.7 of the latter. We can gain more insight on this result by observing the mIoU performances separately on each dataset: while Horyon’s and CATSeg’s results on TOYL are on par, CATSeg performs slightly better on REAL275 (+3.2 mIoU) and is much more accurate on YCBV (+19.7 mIoU). On the other hand, on LM Oryon outperforms CATSeg by 13.2 mIoU points. We observe that REAL275 and YCBV contain many different variations of quite common objects (e.g., cans, cups, laptops), while LM is comprised of many unusually objects (e.g., an office hole puncher, a toy ape). These results suggest that Oryon is more effective than CATSeg in identifying unusual objects, and therefore exhibit better generalisation capabilities in LM. On the other hand, CATSeg is more effective with common objects, and possibly can easily disambiguate between similar objects.

In rows 23-24 we can observe how Horyon’s pose estimation performance changes when a different mask is used. The better average AR is obtained with our segmentation mask (33.4 vs 29.4), albeit CATSeg scores on average an higher mIoU (73.0 vs 70.7). This result is even more evident when considering the results of YCBV in which CATSeg produces better masks than our method, but the AR is still higher when our predicted masks are used (20.6 vs 17.9 in average AR). As observed in Oryon [15], an accurate segmentation mask is only important up to a point to determine the quality of pose estimation performance. In Horyon, our joint optimisation procedure enables to learn masks which are optimised to the matching objective, thus resulting in a better performance than the one obtained with an external mask.

Finally, in row 25 we report the increments of Horyon with respect to Oryon, by comparing the settings with predicted masks and detection boxes (i.e., row 24 vs row 12). While all the changes are positive, the behaviour is different if each dataset is considered separately. REAL275 shows the largest improvements in pose estimation, with an AR increment of 25.5. The increment in ADD(S) is even higher (+27.3), showing that Horyon can produce a much higher ratio of very accurate poses. On the other hand, TOYL exhibits the smaller improvement, as AR and ADD(S) only improve by 3.0 and 4.2 respectively. This dataset is less affected by the usage of a detector, as it presents a single object for each scene. It is also the dataset with the largest change in light and point of view across scenes: this suggests that architectural changes can be made to address this specific setting and provide a larger improvement. LM and YCBV both exhibit consistent improvements in AR (9.8 and 12.0 respectively), albeit their performance is still low compared the REAL275 and TOYL. We attribute this fact to the clear higher difficult of these datasets compared to the original ones used in Oryon, which is due to high clutter, object occlusion and object similarity.

IV-F Qualitative results

In Fig. 3 we report some qualitative result on the pose predicted by Horyon, compared to the ones predicted by Oryon [15], SIFT [55] and the ground-truth pose. Oryon’s results are reported using its predicted mask, while results for SIFT and Horyon are reported using Horyon’s predicted mask and detections from GroundingDino.

Fig. 3(a) show results on REAL275 [19]. In this example Horyon obtains fairly accurate poses, albeit the black camera presents a small rotation error, which causes it to be not completely aligned to the ground truth. SIFT and Oryon obtain correct localisations, but they present clearly visible rotation and translation errors.

On Figs. 3(b), we can observe that Horyon is quite accurate on TOYL. SIFT similarly obtains good performances, with only small errors in translation. Instead, Oryon fails completely at localising the object, which results in a large translation error. We observe that the high variation in lightning conditions makes particularly difficult to obtain correct matches, especially when the images are represented with a low-resolution feature map as in Oryon.

Figs. 3(c) show results on LM. Horyon retrieves a pose with a small rotation error, which is larger in the prediction of SIFT. In this case Oryon fails, likely due to a wrong localisation of the blue iron. This example is quite difficult due to the high variation in viewpoint between the two scenes.

Figs. 3(d) present results on YCBV. we can observe that Oryon and SIFT both fail in estimating the pose of the water jug: the first presents a large translation error, while the second shows a wrong rotation. In this case, Horyon predicts a pose which is mostly aligned, but one of the rotation components is wrong due to the partial symmetry of the object. This suggest that, while our method benefits from the high-resolution feature map, it still has difficulties in performing matches in small object regions, such as the handle in the water jug.

Anchor

Ground truth

Oryon [15]

SIFT [55]

Ours

% \begin{overpic}[width=433.62pt]{main/figures/qualitatives/nocs/pose/0_anchor-% min.png} \end{overpic} \begin{overpic}[width=433.62pt]{main/figures/qualitatives/nocs/pose/0_gt-min.% png} \end{overpic} \begin{overpic}[width=433.62pt]{main/figures/qualitatives/nocs/pose/0_oryon-% min.png} \end{overpic} \begin{overpic}[width=433.62pt]{main/figures/qualitatives/nocs/pose/0_sift-min% .png} \end{overpic} \begin{overpic}[width=433.62pt]{main/figures/qualitatives/nocs/pose/0_ours-min% .png} \end{overpic} (a) Prompt: Black lens camera

\begin{overpic}[width=433.62pt]{main/figures/qualitatives/toyl/pose/1_anchor-% min.png} \end{overpic}
\begin{overpic}[width=433.62pt]{main/figures/qualitatives/toyl/pose/1_gt-min.% png} \end{overpic}
\begin{overpic}[width=433.62pt]{main/figures/qualitatives/toyl/pose/1_oryon-% min.png} \end{overpic}
\begin{overpic}[width=433.62pt]{main/figures/qualitatives/toyl/pose/1_sift-min% .png} \end{overpic}
\begin{overpic}[width=433.62pt]{main/figures/qualitatives/toyl/pose/1_ours-min% .png} \end{overpic}

(b) Prompt: Green plastic bottle

\begin{overpic}[width=433.62pt]{main/figures/qualitatives/lm/pose/1_anchor-min% .png} \end{overpic}
\begin{overpic}[width=433.62pt]{main/figures/qualitatives/lm/pose/1_gt-min.png% } \end{overpic}
\begin{overpic}[width=433.62pt]{main/figures/qualitatives/lm/pose/1_oryon-min.% png} \end{overpic}
\begin{overpic}[width=433.62pt]{main/figures/qualitatives/lm/pose/1_sift-min.% png} \end{overpic}
\begin{overpic}[width=433.62pt]{main/figures/qualitatives/lm/pose/1_ours-min.% png} \end{overpic}

(c) Prompt: Light blue iron

\begin{overpic}[width=433.62pt]{main/figures/qualitatives/ycbv/pose/1_anchor-% min.png} \end{overpic}
\begin{overpic}[width=433.62pt]{main/figures/qualitatives/ycbv/pose/1_gt-min.% png} \end{overpic}
\begin{overpic}[width=433.62pt]{main/figures/qualitatives/ycbv/pose/1_oryon-% min.png} \end{overpic}
\begin{overpic}[width=433.62pt]{main/figures/qualitatives/ycbv/pose/1_sift-min% .png} \end{overpic}
\begin{overpic}[width=433.62pt]{main/figures/qualitatives/ycbv/pose/1_ours-min% .png} \end{overpic}

(d) Prompt: Blue plastic water jug

Figure 3: Sample pose results from REAL275 [19] (a), Toyota-Light [20] (b), Linemod [17] (c) and YCB-Video [18] (d). All the results use crop from GroundingDino [16] and segmentation mask predicted by Horyon. We show the object model coloured by map** its 3D coordinates to the RGB space. Query images are darkened to highlight the object poses.

IV-G Ablation study

TABLE III: We show the results obtained by ablation the architectural components of Oryon, with the baseline at row 8 to better compare with the other results. All the results are computed with GroundingDino as detector prior and our predicted mask as segmentation prior. All metrics are the higher the better.
Components REAL275 Toyota-Light Linemod YCB-Video Average
Crop Loss Hyp. Our ϕDsubscriptitalic-ϕ𝐷\phi_{D}italic_ϕ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT Fusion Guidance VLM AR ADD mIoU AR ADD mIoU AR ADD mIoU AR ADD mIoU AR ADD mIoU
1 Ours DINO GDino 35.8 22.2 61.2 35.4 27.3 83.9 5.7 2.9 15.9 6.7 4.5 32.6 20.9 14.2 48.4
2 Ours DINO GDino 43.5 30.8 80.8 28.6 15.9 82.3 15.6 14.2 51.4 13.6 8.8 67.6 25.3 17.4 70.5
3 Ours DINO GDino 26.2 12.6 64.8 30.4 19.6 82.2 6.6 4.7 19.4 8.0 2.7 36.5 17.8 9.9 50.7
4 - DINO GDino 51.2 39.0 74.8 31.2 19.2 77.9 19.9 18.2 48.0 13.7 5.7 55.6 29.0 20.5 64.1
5 Oryon DINO GDino 49.2 41.8 79.2 32.9 24.4 80.9 21.5 19.8 51.3 18.9 9.4 60.4 30.6 23.8 67.9
6 Ours DINO GDino 50.5 41.2 81.8 32.7 24.4 82.5 21.9 20.1 53.0 20.4 13.2 67.6 31.4 24.7 71.6
7 Ours - GDino 48.2 32.6 80.3 32.3 23.1 82.3 21.5 20.1 52.2 20.6 12.8 67.6 30.6 22.2 70.6
8 Ours DINO GDino 57.9 51.6 81.3 33.0 25.1 82.1 22.0 20.4 67.6 20.6 12.3 52.0 33.4 27.3 70.7
9 Ours Swin-B GDino 49.1 46.2 81.3 33.2 24.2 82.3 21.9 20.3 51.9 21.0 13.8 68.5 31.3 26.1 71.0
10 Ours Swin-B CLIP 37.3 19.7 72.0 26.9 13.4 69.1 16.7 14.2 53.8 11.4 4.9 54.5 23.1 13.1 62.3
11 Ours Swin-B ALIGN 40.0 21.6 77.8 26.5 12.8 71.1 16.2 13.3 53.7 11.1 4.7 55.6 23.4 13.1 64.6

We report in Tab. III an ablation study on the components of Horyon, with the baseline at row 8.

What is the effect of crop**? The effect of the crop is strictly related to the change in loss hyperparameters, as using the crop raises the number of matches due to the higher resolution. Therefore, in rows 1 to 3 we examine how the addition of the crop and the change in hyperparameters influence each other. In row 1 we do not use any crop and keep the original loss hyperparameters, resulting in average AR similar to Oryon (20.9 vs 20.8) and much worse than our baseline of 33.4 AR. Only using the crop (row 2) improves the performance, but still results in a significant gap with respect to our baseline (25.3 vs 33.4 AR). In row 3 we observe that only updating the loss hyperparameters leads to a worse performance than Oryon (17.8 vs 20.8 in AR), thus motivating our choice. The most significant change in the loss is in the dimension of the excluding kernel of the negative loss, which restricts the pool of negative candidates. Without using the crop, a larger kernel is detrimental as it removes a significant portion of the candidate negatives from the object region.

How important is the fusion design? To study the fusion impact, we remove (row 4) and replace it with the one from Oryon (row 5). Both experiments result in a worse AR (30.6 and 29.4 vs 33.4 AR), and also the segmentation quality is lower (64.1 and 67.9 vs 70.7 mIoU). Therefore, a fusion module based on cross-attention on visual and text modalities is more effective than aggregating the cross-modalities similarity map as in Oryon [15], as shown in row 5.

How important are the decoder design and the guidance features? In row 6, we use a different design for the decoder module, which uses two guidance feature maps from DINO instead of three, as the lowest-resolution feature map is discarded. This results in a drop of 2 AR points with respect to our baseline, which is mostly due to the REAL275 dataset (50.5 vs 57.9 AR). Intuitively, the low-resolution feature map captures useful information for larger objects, for which the high-resolution could encode noise related to local pixel variations. This explains such behaviour in REAL275, which consists on larger objects on average. In row 7, we remove the skip connections between DINO and the decoder. While this change does not significantly impact the mask quality (70.6 vs 70.7 mIoU), there is an important drop in pose quality, as the AR drops from 33.4 to 30.6. Similarly, switching to a Swin Transformer as guidance backbone (row 9) lowers the AR to 31.3. These results underline the importance of high-resolution features representative of appearance, to counterbalance the semantic content of the embeddings from the VLM.

What is the influence of the VLM choice? In rows 10 and 11 we replace our VLM (GDino) with CLIP [40] and ALIGN [58], respectively. Both backbones similarly underperform our default choice, as they score 23.1 and 23.4 AR (CLIP and ALIGN respectively) against 31.3 AR of row 9. Note that CLIP and ALIGN encode the prompt in a single global embedding, while we use BERT, which outputs a sequence of token-wise features. This allows Horyon to retain the information related to the object description, which instead may be lost in a global representation, and therefore is beneficial to the type of prompts we use.

TABLE IV: We report the results obtained by changing the prompts at test time with an alternative version, which can have an incorrect description (Misleading), show only the object name (Generic), or show only the object description without the name (No name). Crop priors can be Oracle or predicted by GroundingDino [16] Segmentation priors can be Oracle or predicted by Horyon. All metrics are the higher the better.
Prompt type Crop Mask REAL275 Toyota-Light Linemod YCB-Video Average
AR ADD mIoU AR ADD mIoU AR ADD mIoU AR ADD mIoU AR ADD mIoU
1 No Name Oracle Oracle 55.0 46.2 100.0 36.8 28.9 100.0 34.4 28.1 100.0 28.2 22.7 100.0 38.6 31.5 100.0
2 GDino Ours 24.0 27.1 49.0 26.5 20.2 74.3 13.3 10.5 33.3 11.2 5.8 53.7 18.7 15.9 52.6
3 Misleading Oracle Oracle 55.6 47.2 100.0 36.8 28.9 100.0 33.9 27.6 100.0 27.9 21.6 100.0 38.5 31.3 100.0
4 GDino Ours 40.1 38.2 75.9 28.5 19.6 76.7 7.3 4.0 20.2 13.9 7.0 50.6 22.5 17.2 55.8
5 Generic Oracle Oracle 55.6 47.5 100.0 36.8 28.1 100.0 33.7 27.6 100.0 28.0 21.9 100.0 38.5 31.3 100.0
6 GDino Ours 40.3 37.6 75.6 36.1 27.7 85.3 13.6 10.9 33.1 14.3 4.3 54.2 26.1 20.1 62.0
7 Standard Oracle Oracle 57.7 49.8 100.0 37.4 28.4 100.0 34.4 27.6 100.0 28.6 22.6 100.0 39.5 32.1 100.0
8 GDino Ours 57.9 51.6 81.3 33.0 25.1 82.1 22.0 20.4 67.6 20.6 12.3 52.0 33.4 27.3 70.7

What is the most important part of the prompt? In Tab. IV we answer this question by changing the type of prompts used at test time. The experiments in this table only affect the evaluation procedure, as all the training prompts and parameters remains the same. We evaluate three alternative prompt types: No name, in which the object name is replaced by “object” in the prompt (e.g., “brown open object” instead of “brown open laptop”); Misleading, in which the object description is changed to be different from the object appearance (e.g., “white closed laptop” for a laptop that appears brown and open); Generic, that only includes the object name, without any description (e.g., “laptop”). In the same table we report our baseline results, with the standard prompts. For a fair evaluation, when evaluating with GDino as detector, we use the same alternative prompt type we fed to Horyon.

In row 2, we report the results with the No name prompt. This experiment results in a very significant drop in both AR (-14.7) and mIoU (-18.1) compared with the baseline at row 8. While the drop is present and significant in all datasets, it is less catastrophic on TOYL. This dataset is the only one with a single object for each scene, and therefore changing the prompt introduces less ambiguities. This setting reflects the case in which the user providing the prompt is faced with an unknown object they cannot name, thus resulting in a partial description about the object characteristics. Row 4 shows that providing a wrong description greatly impacts the average performance, resulting in a drop of -10.9 AR and -14.9 mIoU. Similarly to the previous prompt, this change has less impact on the TOYL dataset, while LM is more significantly affected, losing 14.7 and 47.4 points in mIoU respectively. It is clear that in this case the main source of error is due to wrong localisation (either from GDino or from the segmentation mask). In row 6 the object description is dropped, resulting in the best average results among the ones with alternative prompts, with an AR of 26.1 and an mIoU of 62.0. While the average drop is still significant compared to the baseline, TOYL is again a notable exception. On this dataset, using a generic description is beneficial, as the experiment outperforms our baseline by 3.1 and 3.2 points in AR and mIoU, respectively. In a context where no ambiguity is possible (i.e., a single object is present), adding a description is detrimental to the pose estimation performance. Finally, rows 1, 3 and 5 report the results with the ground-truth localisation and segmentation. We observe that on average the performances are very close to the baseline with the same mask and detector on row 7. Unsurprisingly, with a perfect localisation Horyon’s features are enough to obtain a good performance in pose estimation, even with a suboptimal prompt.

In conclusion, in our architecture the object name is the most important part of the prompt, as only retaining it still provides a good performance, while completely removing it leads to failure, particularly in case of complex scenes with multiple objects. This experiment highlights an important limitation to the usage of VLMs for pose estimation: such models are still unable to identify and reason about objects given only a description of their visual characteristics.

V Conclusions

We presented Horyon, an approach that significantly improves upon previous open-vocabulary 6D pose estimation methods, by increasing the feature map resolution and providing more accurate matches to perform registration. Our experiments show that Horyon obtains good performances also in scenarios with unusual objects (Linemod) and mild occlusions (YCB-Video). The ablation studies show that the token-wise representation provided by the updated VLM, together with the new fusion strategy, greatly benefit Horyon. The improvement is consistent in terms of generalisation capabilities and robustness to prompt noise as well.

Horyon’s limitations are the need for depth maps and intrinsic camera parameters to perform registration. Such data requirements could be relaxed by exploring monocular depth estimation methods such as DepthAnything [59] on each RGB image. While Horyon is more robust to suboptimal prompts than Oryon, the resulting drop in performance is still significant. Moreover, the variety of prompt usable at test time is limited by the training data, which provides prompts without descriptions. To enrich the prompts at training time, LLMs or image captioners could be used on image samples to provide prompts that also include a description of the object’s colour and physical attributes.

Acknowledgements. This work was supported by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101058589 (AI-PRISM), and it made use of time on the Tier 2 HPC facility JADE2, funded by EPSRC (EP/T022205/1).

References

  • [1] E. Marchand, H. Uchiyama, and F. Spindler, “Pose estimation for augmented reality: a hands-on survey,” in TVCG, 2015.
  • [2] G. Du, K. Wang, S. Lian, and K. Zhao, “Vision-based robotic gras** from object localization, object pose estimation to grasp estimation for parallel grippers: A review,” in Artificial Intelligence Review, 2021.
  • [3] F. Manhardt, W. Kehl, and A. Gaidon, “Roi-10d: Monocular lifting of 2D detection to 6D pose and metric shape,” in CVPR, 2019.
  • [4] J. Corsetti, D. Boscaini, and F. Poiesi, “Revisiting Fully Convolutional Geometric Features for Object 6D Pose Estimation,” in ICCVW, 2023.
  • [5] Y. Su, M. Saleh, T. Fetzer, J. Rambach, N. Navab, B. Busam, D. Stricker, and F. Tombari, “ZebraPose: Coarse to Fine Surface Encoding for 6DoF Object Pose Estimation,” in CVPR, 2022.
  • [6] H. Yisheng, W. Yao, F. Haoqiang, C. Qifeng, and S. Jian, “FS6D: Few-shot 6D pose estimation of novel objects,” in CVPR, 2022.
  • [7] Y. Labbé, L. Manuelli, A. Mousavian, S. Tyree, S. Birchfield, J. Tremblay, J. Carpentier, M. Aubry, D. Fox, and J. Sivic, “Megapose: 6D pose estimation of novel objects via render & compare,” in CoRL, 2022.
  • [8] T. Hodan, M. Sundermeyer, Y. Labbe, V. N. Nguyen, G. Wang, E. Brachmann, B. Drost, V. Lepetit, C. Rother, and J. Matas, “BOP challenge 2023 on detection, segmentation and pose estimation of seen and unseen rigid objects,” in CVPR, 2024.
  • [9] X. He, J. Sun, Y. Wang, D. Huang, H. Bao, and X. Zhou, “OnePose++: Keypoint-Free One-Shot Object Pose Estimation without CAD Models,” in NeurIPS, 2022.
  • [10] A. Caraffa, D. Boscaini, A. Hamza, and F. Poiesi, “Object 6D pose estimation meets zero-shot learning,” arXiv:2312.00947, 2023.
  • [11] V. N. Nguyen, T. Groueix, M. Salzmann, and V. Lepetit, “GigaPose: Fast and robust novel object pose estimation via one correspondence,” in CVPR, 2024.
  • [12] I. Shugurov, F. Li, B. Busam, and S. Ilic, “OSOP: a multi-stage one shot object pose estimation framework,” in CVPR, 2022.
  • [13] Y. Liu, Y. Wen, S. Peng, C. Lin, X. Long, T. Komura, and W. Wang, “Gen6d: Generalizable model-free 6-DoF object pose estimation from rgb images,” in ECCV, 2022.
  • [14] J. Schönberger and J. Frahm, “Structure-from-motion revisited,” in CVPR, 2016.
  • [15] J. Corsetti, D. Boscaini, C. Oh, A. Cavallaro, and F. Poiesi, “Open-vocabulary object 6D pose estimation,” in CVPR, 2024.
  • [16] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” arXiv:2303.05499, 2023.
  • [17] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab, “Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes,” in ACCV, 2012.
  • [18] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes,” in RSS, 2018.
  • [19] H. Wang, S. Sridhar, J. Huang, J. Valentin, S. Song, and L. J. Guibas, “Normalized object coordinate space for category-level 6D object pose and size estimation,” in CVPR, 2019.
  • [20] T. Hodan, F. Michel, E. Brachmann, W. Kehl, A. GlentBuch, D. Kraft, B. Drost, J. Vidal, S. Ihrke, X. Zabulis et al., “BOP: Benchmark for 6D object pose estimation,” in ECCV, 2018.
  • [21] D. Cai, J. Heikkila, and E. Rahtu, “OVE6D: Object Viewpoint Encoding for Depth-based 6D Object Pose Estimation,” in CVPR, 2022.
  • [22] J. Sun, Z. Wang, S. Zhang, X. He, H. Zhao, G. Zhang, and X. Zhou, “Onepose: One-shot object pose estimation without CAD models,” in CVPR, 2022.
  • [23] V. N. Nguyen, T. Groueix, G. Ponimatkin, Y. Hu, R. Marlet, M. Salzmann, and V. Lepetit, “NOPE: Novel object pose estimation from a single image,” in CVPR, 2024.
  • [24] J. Zhang, D. Ramanan, and S. Tulsiani, “Relpose: Predicting probabilistic relative rotation for single objects in the wild,” in ECCV, 2022.
  • [25] A. Lin, J. Y. Zhang, D. Ramanan, and S. Tulsiani, “Relpose++: Recovering 6d poses from sparse-view observations,” in arXiv:2305.04926, 2023.
  • [26] J. Wang, C. Rupprecht, and D. Novotny, “Posediffusion: Solving pose estimation via diffusion-aided bundle adjustment,” in CVPR, 2023.
  • [27] K. Park, A. Mousavian, Y. Xiang, and D. Fox, “Latentfusion: End-to-end differentiable reconstruction and rendering for unseen object pose estimation,” in CVPR, 2020.
  • [28] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” in ICLR, 2018.
  • [29] C. Cheang, H. Lin, Y. Fu, and X. Xue, “Learning 6-DoF Object Poses to Grasp Category-Level Objects by Language Instructions,” in ICRA, 2022.
  • [30] B. Fu, S. K. Leong, Y. Di, J. Tang, and X. Ji, “Lanpose: Language-instructed 6D object pose estimation for robotic assembly,” arXiv:2310.13819, 2023.
  • [31] J. Cai, Y. He, W. Yuan, S. Zhu, Z. Dong, L. Bo, and Q. Chen, “OV9D: Open-vocabulary category-level 9D object pose and size estimation,” arXiv:2403.12396, 2024.
  • [32] Y. He, H. Huang, H. Fan, Q. Chen, and J. Sun, “FFB6D: A Full Flow Bidirectional Fusion Network for 6D Pose Estimation,” in CVPR, 2021.
  • [33] E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shotton, and C. Rother, “Learning 6D Object Pose Estimation using 3D Object Coordinates,” in ECCV, 2014.
  • [34] G. Jocher, “Yolov5 by ultralytics,” 2020. [Online]. Available: https://github.com/ultralytics/yolov5
  • [35] Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: A simple and strong anchor-free object detector,” TPAMI, 2020.
  • [36] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in CVPR, 2017.
  • [37] T. Groueix, G. Ponimatkin, V. Lepetit, T. Hodan et al., “CNOS: A strong baseline for CAD-based novel object segmentation,” in ICCVW, 2023.
  • [38] H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y. Shum, “Dino: Detr with improved denoising anchor boxes for end-to-end object detection,” arXiv:2203.03605, 2022.
  • [39] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv:1810.04805, 2018.
  • [40] A. Radford, J. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021.
  • [41] C. Zhou, C. Loy, and B. Dai, “Extract free dense labels from CLIP,” in ECCV, 2022.
  • [42] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV, 2021.
  • [43] C. Choy, J. Park, and V. Koltun, “Fully Convolutional Geometric Features,” in ICCV, 2019.
  • [44] C. Sudre, W. Li, T. Vercauteren, S. Ourselin, and M. Jorge Cardoso, “Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations,” in MICCAI, 2017.
  • [45] X. Bai, Z. Luo, L. Zhou, H. Chen, L. Li, Z. Hu, H. Fu, and C. L. Tai, “PointDSC: Robust Point Cloud Registration using Deep Spatial Consistency,” in CVPR, 2021.
  • [46] D. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” in ICLR, 2015.
  • [47] I. Loshchilov and F. Hutter, “SGDR: Stochastic gradient descent with warm restarts,” in ICLR, 2017.
  • [48] “Pytorch lightning documentation,” https://lightning.ai/docs/pytorch/stable/, last access: 13/06/2024.
  • [49] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, “ShapeNet: An Information-Rich 3D Model Repository,” arXiv:1512.03012, 2015.
  • [50] M. Savva, A. Chang, and H. P., “Semantically-Enriched 3D Models for Common-sense Knowledge,” in CVPR, 2015.
  • [51] T. Hodan, J. Matas, and S. Obdrzalek, “On evaluation of 6D object pose estimation,” in ECCV, 2016.
  • [52] S. Cho, H. Shin, S. Hong, A. Arnab, P. H. Seo, and S. Kim, “CAT-Seg: Cost aggregation for open-vocabulary semantic segmentation,” in CVPR, 2024.
  • [53] F. Liang, B. Wu, X. Dai, K. Li, Y. Zhao, H. Zhang, P. Zhang, P. Vajda, and D. Marculescu, “Open-vocabulary semantic segmentation with mask-adapted CLIP,” in CVPR, 2023.
  • [54] C. Gümeli, A. Dai, and M. Nießner, “ObjectMatch: Robust Registration using Canonical Object Correspondences,” in CVPR, 2023.
  • [55] D. Lowe, “Object recognition from local scale-invariant features,” in ICCV, 1999.
  • [56] P. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superglue: Learning feature matching with graph neural networks,” in CVPR, 2020.
  • [57] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes,” in CVPR, 2017.
  • [58] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in ICML, 2021.
  • [59] L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” in CVPR, 2024.
[Uncaptioned image] Jaime Corsetti is a PhD student at University of Trento, affiliated with the Technologies of Vision Lab in Fondazione Bruno Kessler in Trento, Italy. He received his MsC Degree in Artificial Intelligence Systems at the University of Trento in October 2023. His research interests are in object 6D pose estimation and 3D scene understanding.
[Uncaptioned image] Davide Boscaini is a Research Scientist at the Technologies of Vision Lab of the Fondazione Bruno Kessler in Trento, Italy. He received his PhD in Computational Science from the Università della Svizzera italiana in Lugano, Switzerland, under the supervision of Michael Bronstein. He has been a pioneer in the field of geometric deep learning. His current research interests include 3D perception and understanding, with a focus on object 6D pose estimation and 3D scene understanding.
[Uncaptioned image] Francesco Giuliari is a Researcher at the Technologies of Vision Centre in Fondazione Bruno Kessler in Trento, Italy. He received his PhD degree in Deep Learning from the University of Genoa in 2024. His research interests include visual scene understanding using scene graphs, and vision-based robot navigation using classical and deep learning planners.
[Uncaptioned image] Changjae Oh is a Lecturer at the School of Electronic Engineering and Computer Science and the Centre for Intelligent Sensing from Queen Mary University of London, UK. He received his PhD in Electrical and Electronic Engineering from Yonsei University, Seoul, South Korea, in 2018. From 2018 to 2019, he was a Postdoctoral Research Assistant at Queen Mary University of London, UK. His current research interests include embodied agents and vision-based robot manipulation.
[Uncaptioned image] Andrea Cavallaro is the director of the Idiap Research Institute and a Full Professor at École Polytechnique Fédérale de Lausanne (EPFL), Switzerland. He has been a Full Professor at Queen Mary University of London since 2010. He is a Fellow of the International Association for Pattern Recognition. His research interests include machine learning for multimodal perception, computer vision, audio processing and information privacy.
[Uncaptioned image] Fabio Poiesi is the Head of the Technologies of Vision (TeV) Lab, Fondazione Bruno Kessler, Trento, Italy. He received the PhD degree from the Queen Mary University of London, U.K. and was a post-doctoral researcher with the Queen Mary University of London before moving to TeV. His research interests include 3D scene understanding, object detection and tracking, and extended reality.