High-resolution open-vocabulary
object 6D pose estimation

Jaime Corsetti, Davide Boscaini, Francesco Giuliari, Changjae Oh, Andrea Cavallaro, and Fabio Poiesi (Corresponding author: Jaime Corsetti) Jaime Corsetti is with Fondazione Bruno Kessler (FBK), 38123 Trento, Italy, and also with the Department of Information Engineering and Computer Science, University of Trento, 38122 Trento, Italy (e-mail: [email protected]). Davide Boscaini, Francesco Giuliari, and Fabio Poiesi are with Fondazione Bruno Kessler (FBK), 38123 Trento, Italy (e-mail: [email protected]; [email protected]; [email protected]). Changjae Oh is with the School of Electronic Engineering and Computer Science, Queen Mary University of London, E1 4NS London, United Kingdom (e-mail: [email protected]). Andrea Cavallaro is with the Idiap Research Institute, CH-1920 Martigny, Switzerland, and also with the École polytechnique fédérale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland (e-mail: [email protected]).

Abstract

The generalisation to unseen objects in the 6D pose estimation task is very challenging. While Vision-Language Models (VLMs) enable using natural language descriptions to support 6D pose estimation of unseen objects, these solutions underperform compared to model-based methods. In this work we present Horyon, an open-vocabulary VLM-based architecture that addresses relative pose estimation between two scenes of an unseen object, described by a textual prompt only. We use the textual prompt to identify the unseen object in the scenes and then obtain high-resolution multi-scale features. These features are used to extract cross-scene matches for registration. We evaluate our model on a benchmark with a large variety of unseen objects across four datasets, namely REAL275, Toyota-Light, Linemod, and YCB-Video. Our method achieves state-of-the-art performance on all datasets, outperforming by 12.6 in Average Recall the previous best-performing approach.

Index Terms:

Object 6D pose estimation, open-vocabulary.

I Introduction

The estimation of the 6D pose of an object requires the prediction of its 3D rotation matrix and 3D translation vector with reference to the camera. Accurate object 6D pose estimation is a fundamental phase in a wide range of applications, such as augmented reality [1], robot gras** [2], and autonomous driving [3]. Data-driven methods [4, 5] can achieve reliable pose estimation, but require expensive object annotations. This problem is mitigated by the development of techniques to generate large-scale synthetic datasets [6, 7]. Another line of research instead defines the unseen-object setting [8], which assumes no overlap between the set of training and test objects [9, 7]. However, most methods for the unseen-object setting are model-based, as they require a CAD model of the object at test time [5, 10, 11, 12], as detailed in the first group of Tab. I. Recent works changed this assumption, and instead require a video sequence of the object at test time [9, 13], from which a set of reference views is extracted (second group in Tab. I). These methods use Structure-from-Motion (SfM) [14] to reconstruct the object 3D model from the reference views [13, 9].

To eliminate the need for the object models or multiple views, in our previous work, Oryon [15], we showed that the object can be described with a textual prompt, thus defining the open-vocabulary object 6D pose estimation setting. This setting assumes the availability of two RGBD scenes, along with a natural language description of the object of interest, provided by the user. The object 6D pose in a scene is estimated with respect to the object in the other scene.

In this work, we present Horyon (High-resolution Oryon), a VLM-based architecture for open-vocabulary object 6D pose estimation that generalises to unseen objects. Horyon uses a VLM to extract visual and textual representations from the input scene pair and the natural language description (i.e., the prompt). The two representations are fused using cross-attention layers that enable information exchange between each scene patch with each prompt token, without losing key details contained in the object description. The resulting feature maps are upsampled and fused with the high-resolution feature maps from a visual encoder. Horyon estimates the object pose in the query scene with respect to the anchor scene by segmentation and pixel-level view matching, and subsequently backprojecting and registering the 3D matches.

This work significantly extends our previous work [15] in several aspects. Instead of processing the whole scene, we identify and localise the object of interest in both scenes with GroundingDino [16] to obtain a bounding box. This leads to a higher resolution feature map without losing generalisation to unseen objects. Moreover, we remove the need for the guidance backbone and instead extract the guidance features directly from the VLM, reducing the training time and the number of parameters to 50% and 37% of the original method, respectively. To validate our choices, we test our method on an extended version of the Oryon benchmark, which includes two popular datasets for pose estimation, i.e., Linemod [17] and YCB-Video [18], in addition to REAL275 [19] and Toyota-Light [20]. Linemod features scenes with high clutter, occlusions and objects that are small and unusual. YCB-Video presents a large variety of poses, and objects that often belong to similar categories (e.g., boxes and cans). We provide extended ablation studies to justify each choice, and also evaluate the effect of the prompt on the final results. In summary, our main contributions are:

•

We provide a comprehensive analysis of recent approaches for estimating the 6D pose of unseen objects, detailing their requirements and operational conditions.
•

We propose to use a detector that localises and crops the object of interest from the input images using a natural language description. Despite its simplicity, this operation effectively mitigates the influence of extraneous factors, such as background clutter and unrelated objects, on the feature representations used for matching.
•

We provide a new multi-scale formulation for the VLM-based feature extractor and decoder modules, which enables extraction of higher-quality features for correspondence matching, resulting in an increase of +12.6 in Average Recall compared to the previous version.
•

We extend the evaluation set with two new challenging datasets with occluded (YCB-Video) or small (Linemod) objects. We provide textual prompts for each object and will make this benchmark public.

II Related work

TABLE I: Comparison of the data requirements of Horyon with examples of state-of-the-art methods for unseen-object 6D pose estimation. We classify the methods based on: Input: the type of input data, typically RGB or RGBD; Reference: additional data used to identify the unseen object at test time; Pose: whether the method is capable of estimating the 6D pose or is limited to the rotation component. Methods that can estimate the translation component up to a scale are identified by

\mathbf{R},\mathbf{\tilde{t}}

.; Object preprocessing: eventual process required at test time to acquire information about the object of interest; Localisation: eventual external modules used to localise the object, typically a segmentor or a detector; Zero-shot: True if the localisation module was not specifically trained for the test dataset.

Method	Input		Reference	Pose	Object preprocessing	Localisation
Method	RGB	D	Reference	Pose	Object preprocessing	Type	Zero-shot
OVE6D [21]		✓	3D model	$\mathbf{R},\mathbf{t}$	Generates and encodes 4K object templates	Segmentor	✗
MegaPose [7]	✓	✓	3D model	$\mathbf{R},\mathbf{t}$	-	Detector	✗
OSOP [12]	✓		3D model	$\mathbf{R},\mathbf{t}$	Generates 90 object templates	-	-
Gen6D [13]	✓		Video sequence	$\mathbf{R},\mathbf{t}$	SfM and manual crop** of point cloud	-	-
OnePose [22]	✓		Video sequence	$\mathbf{R},\mathbf{t}$	SfM to retrieve camera viewpoints	Detector	✗
OnePose++ [9]	✓		Video sequence	$\mathbf{R},\mathbf{t}$	SfM to retrieve camera viewpoints	Detector	✗
NOPE [23]	✓		Single supp. view	$\mathbf{R}$	-	-	-
RelPose [24]	✓		Two or more supp. views	$\mathbf{R}$	-	-	-
RelPose++ [25]	✓		One or more supp. views	$\mathbf{R},\mathbf{\tilde{t}}$	-	Detector	✗
PoseDiffusion [26]	✓		One or more supp. views	$\mathbf{R},\mathbf{\tilde{t}}$	-	Detector	✗
LatentFusion [27]	✓	✓	One or more supp. views	$\mathbf{R},\mathbf{t}$	-	Segmentor	✗
Oryon	✓	✓	Single supp. view	$\mathbf{R},\mathbf{t}$	Expression of textual prompt	-	-
Horyon (ours)	✓	✓	Single supp. view	$\mathbf{R},\mathbf{t}$	Expression of textual prompt	Detector	✓

\begin{overpic}[clip,trim=34.14322pt 28.45274pt 34.14322pt 28.45274pt,width=43% 3.62pt]{main/figures/diagrams/main_diagram.pdf} \par\end{overpic}

Figure 1: The main modules of our proposed method, Horyon. (a) Overview of a processing branch: Horyon first crops the object of interest from the scene given a textual prompt, and subsequently extracts visual and textual features with DINO and BERT, respectively from the image crop and the prompt. The fusion module

\phi_{TV}

outputs a multimodal representation of the scene, which is upsampled by a decoder

\phi_{D}

. At this stage, skip connections from the image encoder

\phi_{V}

are used to enrich the final representation. The output features

\mathbf{F}

are used to obtain the object segmentation mask

\mathbf{M}

. (b) Optimisation procedure.

\mathbf{F}^{A}

\mathbf{F}^{Q}

are optimised by a hardest contrastive loss which uses ground-truth matches as supervision, while the segmentation is supervised by a Dice loss. (c) Test procedure. The predicted masks are used to filter

\mathbf{F}^{A}

\mathbf{F}^{Q}

, and matches are obtained by nearest neighbor. Finally, the matches are backprojected in 3D, and a registration algorithm is used to retrieve the final pose

T_{A\rightarrow{Q}}

In Tab. I we report examples of methods for unseen-object 6D pose estimation, along with their assumptions. We classify them in model-based (first group), model-free (second group), relative pose estimation methods (third group) and open-vocabulary methods (last group). In the following paragraphs, we describe the assumptions and discuss some example methods for each of the last three groups, as they have the most similar assumptions to Horyon. We also discuss the usage of localisation modules for pose estimation, to contextualise our choice of an open-vocabulary detector to crop the object.

Model-free pose estimation. Recent research has focused on develo** methods for pose estimation of unseen objects without requiring their 3D models during inference, as these 3D models might not always be available. Model-free methods only require a video sequence captured from multiple viewpoints around the object of interest. The video sequence is then used to reconstruct an approximate 3D model of the object through structure-from-motion (SfM) techniques [14]. Exemplary methods include OnePose++ [9] and Gen6D [13]. OnePose++ employs a graph neural network based on attention [28] to establish correspondences between the input image and the 3D reconstruction of the unseen object. Gen6D leverages three distinct modules for localisation, viewpoint estimation, and pose refinement. Both methods require a video sequence of the unseen object during inference, assuming the physical availability of the object. Additionally, a preprocessing procedure is needed to perform the 3D reconstruction of the object, typically involving the execution of an SfM-based algorithm on the video sequence. This procedure can be cumbersome and challenging for users without technical expertise. A related approach, FS6D [6], instead relies on a sparse set of reference images annotated with their respective camera poses. In contrast, Horyon does not require annotated images or complex preprocessing procedures; the user only needs to provide a natural language description of the object.

Relative pose estimation approaches estimate the relative pose of an object with respect to one or more reference views [24, 25, 26]. They are independent on the object’s 3D model. NOPE [23] estimates the rotation of an object in a scene given a single reference view, but the usage of RGB information only does not allow it to estimate the translation component of the pose. Similarly, RelPose [24] estimates the relative rotation of an object using as few as three reference views, but it requires that these views come from the same scene. RelPose++ [25] enhances its predecessor by introducing a reference frame, enabling the estimation of both the translation and rotation components of the pose using as few as two RGB views of the same scene. However, RelPose++ only estimates the pose up to a scale factor; the translation component retrieved is scaled under the assumption that the first camera has q unitary distance from the object. Moreover, RelPose++ requires the images to be approximately centred on the object of interest and does not handle the presence of multiple objects in the same scene. Similarly to RelPose++, PoseDiffusion [26] addresses camera pose estimation up to a scale factor in the translation component. It allows for the estimation of camera poses using 2-8 views of the object by enforcing consistency between pairwise matches through geometric constraints. However, like RelPose++, PoseDiffusion also requires object-centric images and is therefore limited to working with a single object per image. In contrast, Horyon can estimate an absolute value for the translation component of the pose and does not assume the views are captured from the same scene.

Language models for pose estimation. So far, the use of VLMs for pose estimation tasks has been limited. An early approach [29] adopted a grounding network to extend the capabilities of a gras** robot in a category-level pose estimation scenario. Subsequent works similarly used textual prompts for instance-level [30] or category-level [31] pose estimation, but their generalisation capability is limited to the training categories. Recently, Oryon [15] has shown an effective approach based on joint image matching and segmentation in order to estimate the pose of a common object given two scenes. Instead of relying on an object model or a set of reference views, the object of interest is described with a textual prompt, and localised by segmentation. The reference scene and the usage of depth information allows Oryon to estimate the relative 6D pose between the two scenes, even for objects unseen at test time. However, the limited resolution of Oryon’s feature map limits its ability to estimating accurate poses. In this work, we address this limitation by proposing a set of key modifications, which include a higher-resolution feature map for fine-grained feature matching, and an updated VLM to better exploit the textual prompt.

Object localisation for pose estimation. Adopting external modules to localise the object of interest within input images is standard practice in the pose estimation community [32, 5, 11]. This is often necessary to deal with highly occluded datasets [33], and is common both in the classic setting [4, 5] and in the unseen-object setting [22, 9]. Crops are obtained by training a detector on the specific objects (e.g., YOLOv5 [34] in OnePose and OnePose++, FCOS [35] in ZebraPose [5]) or by using ground-truth bounding boxes [6, 25, 26]. Notable exceptions are methods that train the detector contextually to the rest of the pipeline, such as Gen6D [13] and OSOP [12]. For non-generalisable methods, adopting supervised detectors [36, 35, 34] to obtain a bounding box is reasonable, as they assume to have access to the object model at test time. Therefore using them to train a detector does not lead to loss of generalisation. Recently, CNOS [37] has been proposed to provide accurate segmentation masks given a CAD model of the object, without training on the specific object instances. This method can be naturally paired with model-based methods for unseen-object pose estimation [7, 10, 12], as CNOS shares their assumptions (i.e., object 3D model available at test time, but no specific training on it). Horyon retains the open-vocabulary assumptions, as we use an open-vocabulary detector to locate and crop the object.

III Proposed approach: Horyon

Refer to caption — Figure 2: (a) Overview of a layer of the fusion module $\phi_{TV}$ . Note that the left-side output of the module is not used in the last layer. (b) Architecture of the decoder $\phi_{D}$ . $\mathbf{E}_{1}$ , $\mathbf{E}_{2}$ , $\mathbf{E}_{3}$ : visual features obtained from $\phi_{V}$ ; Transp.Conv.: transposed 2D convolution; Group.Conv.: group composed by two blocks, each contains a 2D convolution, a group normalisation and a ReLU; $[\cdot|\cdot]$ denotes feature concatenation.

III-A Overview

Let two RGBD scenes, an anchor $A$ and a query $Q$ , depict two scenes with a common object. Each scene is represented by the RGB channels RGB^A (RGB^Q) and a depth map 0ptA (0ptQ). The intrinsic camera parameters used to capture both $A$ and $Q$ are available. For each pair ( $A$ , $Q$ ) a textual prompt $T$ describing the common object $O$ is provided. The object present at test time was not observed at training time. The objective is to estimate the 6D pose of the object in $Q$ with respect to the reference provided by $A$ .

Fig. 1(a) shows the proposed Horyon architecture. Horyon consists of four main modules: an open-vocabolary detector, the vision and text encoders, the fusion module, and the decoder. The open-vocabulary detector identifies the object $O$ in both scenes. The two pre-trained networks: the vision encoder $\phi_{V}$ and the text encoder $\phi_{T}$ , encode the cropped $A$ and $Q$ in multi-scale feature maps and generate an embedding of the prompt $T$ , respectively. The fusion module $\phi_{TV}$ provides a joint representation across the visual and text modalities. Finally, the decoder $\phi_{D}$ upsamples the features from the fusion module to provide a set of features that can be used for image matching. The final features are processed by a convolutional layer which outputs the localisation masks of the object. In the following sections, we describe each component, along with the training and test procedures.

III-B Object identification and crop**

We perform 6D pose estimation by image matching, therefore the accuracy of the predicted matches is crucial to the final performance. To this end, we use feature maps that encode the single object of interest instead of the whole scene. We process each scene with the open-vocabulary object detector GroundingDino [16], which allows a good accuracy and can generalise to previously unseen objects. We use the same prompt $T$ given as input to Horyon, and square and resize each bounding box to avoid deformation. The predicted bounding boxes are used only at test time, and instead we use the ground-truth ones for training. In this way, we can obtain accurate detections without losing generalisation capability.

III-C Vision-Language backbone

The features used for matching should be (1) conditioned on the textual prompt $T$ , and (2) robust to unseen objects, for which Horyon was not specifically trained. We use DINO [38] as vision encoder $\phi_{V}$ to extract multi-scale visual feature maps $\mathbf{E}_{1}$ , $\mathbf{E}_{2}$ , $\mathbf{E}_{3}$ and BERT [39] as text encoder $\phi_{T}$ , to encode the prompt in a sequence of textual features $\mathbf{e}^{T}$ . Previous works [16] have proven this combination to be effective in obtaining fine-grained feature maps with good generalisation capabilities. Using BERT provides an additional advantage: $\mathbf{e}^{T}$ is not a global vector, but instead is a sequence of feature vectors, one for each token. This allows to preserve the information related to the description of the object, and therefore provides a better representation for the natural language description. Unlike Horyon, Oryon uses CLIP [40] as Vision-Language backbone. Although CLIP showed impressive capabilities on tasks based on localisation [41], it was trained for alignment of global embeddings, and therefore was not designed to provide spatially-informed feature maps.

III-D Fusion module

The fusion module $\phi_{TV}$ is based on cross-attention between patches of the visual feature maps $\mathbf{E}_{1}$ and tokens of the prompt $\mathbf{e}^{T}$ [16] (see Fig. 2(a)). This strategy allows Horyon to preserve the information contained in each specific token in the prompt, and therefore the model can learn to associate each visual patch with the most suitable component of the prompt (e.g., a token representing “red” in the prompt can easily be associated to red patches on the object). In contrast, Oryon [15] adopts a global representation from CLIP as textual embedding, thus collapsing in a single feature vector all the information contained in the prompt. We found the new design to be more effective, particularly when the prompt is noisy (see the ablation study on the prompt type in Tab. IV.

III-E Decoder

The feature map produced by the fusion module $\phi_{TV}$ retains the low resolution of the original input features $\mathbf{E}_{1}$ . In order to obtain high-resolution features necessary for matching, we use a decoder $\phi_{D}$ based on transposed convolutions to upsample the input features. Prior to each upsample layer, we concatenate the purely visual features obtained from DINO, respectively $\mathbf{E}_{1}$ , $\mathbf{E}_{2}$ and $\mathbf{E}_{3}$ , to enrich the semantic representation of the Vision-Language backbone. We define such feature map as guidance features. In Oryon [15], guidance features are extracted from a pretrained Swin transformer [42]. Instead, we directly use the features from DINO, therefore substantially reducing the number of parameters. Moreover, the training strategy of DINO encourages generalisation to novel concepts [38], and thus it is more effective than Swin in our task. The output of the decoder is a high-resolution feature map $\mathbf{F}$ . To obtain the segmentation mask $\mathbf{M}$ , we process $\mathbf{F}$ with a convolutional layer.

III-F Optimisation

We train Horyon with an optimisation procedure based on joint image matching and segmentation. The feature maps $\mathbf{F}^{A}$ , $\mathbf{F}^{Q}$ are optimised with a hardest-contrastive loss $\ell_{F}$ [43, 4], supervised by ground-truth matches between $A$ and $Q$ . The positive component $\ell_{P}$ forces the matches to be close in the feature space, while the negative component $\ell_{N}$ increases the distance between a feature point and its hardest negative.

Let $\mathbf{f}^{A}$ , $\mathbf{f}^{Q}$ $\in\mathbb{R}^{C\times D}$ be the features extracted respectively from the feature maps $\mathbf{F}^{A}$ , $\mathbf{F}^{Q}$ , by using the ground-truth matches $\mathcal{P}$ . $C=\lvert\mathcal{P}\rvert$ is the total number of matches and $D$ is the feature dimension. The positive component $\ell_{P}$ is

\ell_{P}=\sum_{(i,j)\in\mathcal{P}}\frac{1}{\lvert\mathcal{P}\rvert}\left(% \mathrm{dist}(\textbf{f}^{A}_{i},\textbf{f}^{Q}_{j})-\mu_{P}\right)_{+},

(1)

where $(\cdot)_{+}=\max(0,\cdot)$ and $\mu_{P}$ is a positive margin, i.e., the desired distance in the feature space of a positive pair.

We consider a set of features $\mathbf{f}$ and their corresponding coordinates x on the scene to define the negative pairs. Given $\mathbf{f}_{i}$ , its candidate negative set is defined as $\mathcal{N}_{i}=\{k\colon\textbf{x}_{k}\in\textbf{x},k\neq i,\lVert\textbf{x}_% {i}-\textbf{x}_{k}\rVert\ \geq\tau\}$ . This excludes all points closer than the distance $\tau$ from the reference point $\textbf{x}_{i}$ in the scene, thereby excluding features describing the same points. Candidate negatives sets are computed for all points of $\mathbf{x}^{A}$ and $\mathbf{x}^{Q}$ , and the negative component $\ell_{N}$ is

	$\displaystyle\ell_{N}=\sum_{(i,j)\in\mathcal{P}}$	$\displaystyle\frac{1}{2\lvert\mathcal{P}_{i}\rvert}\left(\mu_{N}-\min_{k\in% \mathcal{N}_{i}}\mathrm{dist}(\textbf{f}_{i},\textbf{f}_{k})\right)_{+}$		(2)
		$\displaystyle+\frac{1}{2\lvert\mathcal{P}_{j}\rvert}\left(\mu_{N}-\min_{k\in% \mathcal{N}_{j}}\mathrm{dist}(\textbf{f}_{j},\textbf{f}_{k})\right)_{+}.$		(2)

For each $\mathbf{f}_{i}$ , the nearest $\mathbf{f}_{k}$ in the feature space (i.e. the hardest negative) is selected. Given a negative pair, the negative margin $\mu_{N}$ is the desired distance of the two points in the feature space. In Eqs. (1) and (2), $\mathrm{dist}(\cdot)$ is the inverted and normalised cosine similarity. We implement the segmentation loss $\ell_{M}$ as a Dice loss [44]. The final loss $\ell$ is defined as

\ell=\ell_{M}+\lambda_{N}\ell_{N}+\lambda_{P}\ell_{P},

where $\lambda_{N}$ and $\lambda_{P}$ are hyperparameters.

In Horyon, we change the loss hyperparameters with respect to the ones used in Oryon, in order to adapt to the new context in which the object is cropped. Intuitively, a higher resolution implies more ground-truth matches, therefore we raise the number of matches to $C=$ 2000, as opposed to $C=$ 500 in Oryon. For the same reason, we set the excluding distance for the hardest negatives to $\tau=$ 20.

III-G Matching and registration

At test time, we obtain the predicted mask $\mathbf{M}^{A}$ , $\mathbf{M}^{Q}$ and the features $\mathbf{F}^{A}$ , $\mathbf{F}^{Q}$ from $A$ and $Q$ . The predicted masks are used to filter the features belonging to the objects, thus obtaining two lists of features $\mathbf{F}^{A}_{M}$ $\in\mathbb{R}^{C^{1}\times D}$ , $\mathbf{F}^{Q}_{M}$ $\in\mathbb{R}^{C^{2}\times D}$ .

We compute the nearest neighbor $\mathbf{f}^{Q}_{i}\in\mathbf{F}^{Q}_{M}$ in the feature space for each feature $\mathbf{f}^{A}_{i}\in\mathbf{F}^{A}_{M}$ , and reject the pairs $\mathbf{f}^{A}_{i},\mathbf{f}^{Q}_{i}$ for which $\mathrm{dist}(\mathbf{f}^{A}_{i},\mathbf{f}^{Q}_{i})>\mu_{T}$ .

The resulting points are backprojected to the 3D space, by using the depth maps and the intrinsic camera parameters of $A$ and $Q$ , thus obtaining two point clouds $\mathbf{P}^{A},\mathbf{P}^{Q}$ $\in\mathbb{R}^{C\times 3}$ . Finally, to obtain the pose $T_{A\rightarrow Q}$ , we use PointDSC [45] as its spatial consistency formulation allows to reject inconsistent matches, thus leading to more accurate poses.

IV Results

IV-A Experimental setup

During training, the weights of the image and text encoders $\phi_{V}$ , $\phi_{T}$ are frozen, we only update the weights of the fusion and decoder modules. We train our model using the Adam optimiser [46] for $20$ epochs with learning rate $10^{-4}$ , weight decay $5\cdot 10^{-4}$ , and a cosine annealing scheduler [47] to lower the learning rate to $10^{-5}$ . We randomly augment training data by applying horizontal flip**, vertical flip**, and colour jittering. The output resolution of $\mathbf{F}^{A}$ , $\mathbf{F}^{Q}$ is $192\times 192$ , and their feature dimension is $F=32$ . Loss weights are set as $\lambda_{P}=\lambda_{N}=0.5$ . We set the positive and negative margins in the loss as $\mu_{P}=0.2$ and $\mu_{N}=0.9$ , and the excluding distance for hardest negatives as $\tau=20$ . At test time we set the maximum feature distance to identify a match as $\mu_{T}=0.25$ . We limit the number of matches to $C=2000$ . We implement Horyon with PyTorch Lightning [48]. We set the batch size to $8$ and train on four Nvidia V100 GPUs. In this standard setting, a training requires about $12$ hours.

IV-B Datasets

We train on ShapeNet6D [6], the same synthetic dataset used by Oryon [15]. For evaluation, we introduce a novel benchmark that extends the one proposed in Oryon. This comprises four real-world datasets: REAL275 [19] and Toyota-Light [20], which are also part of the original benchmark, as well as two additional datasets, Linemod [17] and YCB-Video [18], featuring more severe occlusions, clutter and object variety. To comply with our relative pose estimation setting, we form image pairs ( $A$ , $Q$ ) by randomly sampling $A$ , $Q$ from their image distribution while ensuring that they are captured from different scenes. We extract 2K image pairs from each test dataset. To accommodate our open-vocabulary setting, we provide natural language descriptions for each object across the four test datasets. In the following paragraphs, we detail the training and testing datasets.

ShapeNet6D (SN6D for short) [6] is a large-scale synthetic dataset comprising a diverse collection of scenes generated by rendering ShapeNet [49] objects in various poses against random backgrounds. SN6D adhere to our zero-shot assumption, as it does not contain any object instances present in the test datasets. We provide natural language descriptions for each object of SN6D by leveraging ShapeNetSem [50], a subset of ShapeNet that includes additional metadata associated with the 3D models, to extract both the object name and a set of synonyms. To generate an object description, we randomly select either the object name or one of the provided synonyms (e.g., possible synonyms for “television” are “tv”, “telly”, and “television receiver”). This augmentation strategy increases the object description variability and reduces overfitting. By following this procedure, we collect a set of 20K data tuples ( $A$ , $Q$ , $T$ ) to form our training set. Note that $T$ contains only semantic information and does not include additional details such as the object colour, material, or physical attributes.

REAL275 [19] features 18 objects spanning six different categories, arranged in realistic configurations within diverse indoor scenarios (e.g., on tables, floors). Its challenges are the presence of multiple instances of the same object categories and a wide variety of viewpoints captured in the scenes.

Toyota-Light (TOYL for short) [20] contains scenes where one of 21 different objects is randomly positioned on various fabric types. The images are captured under challenging lighting conditions, which is particularly relevant in our setting. As we process pairs of images, significant lighting differences between them pose a major challenge for image matching.

Linemod [17] (LM for short) comprises 15 different objects arranged on a table in various configurations, along with other objects acting as clutter. It is challenging because the objects are often small, poorly textured with mostly uniform colours, and less conventional compared to those in REAL275 (e.g., plastic toy figures of an ape and a cat).

YCB-Video [18] (YCBV for short) includes 22 household objects, mainly boxes or cans, arranged in 12 different scenes with distractor objects in the background. Many objects share similar geometries, making photometric information crucial for accurate identification and pose estimation. It also contains occlusions, including objects stacked atop one another.

IV-C Evaluation metrics

We evaluate the poses using Average Recall [8] (AR) and ADD(S)-0.1d (abbreviated as ADD) [51]. AR measures pose error using three metrics: VSD (Visible Surface Discrepancy), MSSD (Maximum Symmetry-aware Surface Distance), and MSPD (Maximum Symmetry-Aware Projection Distance). For each metric, AR computes the average recall over a set of thresholds. For VSD and MSSD, the thresholds range from 5% to 50% of the object’s diameter, while for MSPD, the thresholds range from 5% to 50% of the image size. The final AR score is the average of these individual recalls. ADD measures pose error as the average of the pairwise distances between the 3D model points transformed according to the ground-truth and predicted poses. It then computes the recall of pose errors that are smaller than 10% of the object’s diameter [51]. Both AR and ADD are designed to be robust to object symmetries. Compared to AR, ADD’s more restrictive threshold makes it more effective in measuring highly accurate poses. As the quality of the masks influences the matches, we also evaluate the segmentation by mean Intersection-over-Union (mIoU) [52, 53] across the image pairs ( $A$ , $Q$ ).

IV-D Comparison procedure

We consider three groups of methods for comparison. As crop-free methods, we report the results of ObjectMatch [54], LatentFusion [27] and a pipeline built from SIFT [55] and PointDSC [45]. Additionally, in this group we report Oryon’s [15] results. As crop-based methods, we report Horyon along with the results from ObjectMatch, LatentFusion, and SIFT+PointDSC obtained by using a detector to crop the objects. We refer as these versions of the baselines as ObjectMatch^†, LatentFusion^† and SIFT^† respectively. Furthermore, as sparse-view methods we report results with PoseDiffusion [26] and RelPose++ [25]. These are RGB-only methods that require an object detector, but no segmentation mask. To ensure a fair comparison, our main focus is on crop-based methods that use RGBD data (i.e., the crop-based group), as presented in rows 13-21 of Tab. II.

Oryon [15] serves as our primary baseline for comparison, being the only existing method for open-vocabulary object 6D pose estimation. In contrast to approaches that use detectors to crop the input images around the object of interest, Oryon processes entire images without requiring any crop**. We evaluate Oryon’s performance using the official checkpoints made available through its repository.

ObjectMatch [54] was proposed to tackle point cloud registration of scenes with low overlap. ObjectMatch is based on SuperGlue [56] to estimate the matches, and on a custom pose estimator. To compare with it, the mask is used to filter the keypoints obtained by the first step (i.e., keypoint estimation with SuperGlue), and the remaining matches are forwarded to the pose estimator model. When evaluating with the crop (ObjectMatch^†), we first crop the object according to the predicted box, and then follow the same procedure as above. We use ObjectMatch’s model trained on ScanNet [57] from the official repository. In Oryon [15], results with ObjectMatch were reported by using the mask to crop the image, and subsequently run the method on the resulting crop.

SIFT [55] is used to extract keypoints and descriptors from each image pair, from which matches are computed through descriptor similarity. We use the mask to filter the matches, and subsequently backproject them to the 3D space. The final pose is obtained by registering the points with PointDSC. When evaluating with the crop (SIFT^†), we first crop the object of interest according to the predicted bounding box, and then follow the same procedure as above.

LatentFusion [27] is a method for RGBD-based pose estimation of unseen objects. Given a set of RGBD support views with associated masks and poses, LatentFusion first builds a latent representation of the object, by aggregating the features of each view. Subsequently, given a query view of the same object, the optimal pose is obtained by optimising a differentiable renderer with the latent object representation as input. To compare with LatentFusion we use $A$ to build the representation, and $Q$ as query view. Although LatentFusion has been designed to use 8-16 references, the ablation studies shows that it retains a good performance even with a single reference view, as in our setting. With the crop (LatentFusion^†), we fed the cropped images of $A$ and $Q$ .

For sparse-view methods (rows 1-4) we report results with ground-truth detections and the ones obtained from GroundingDino [16] (GDino). Crop-free methods (rows 5-12) are reported with the ground-truth mask (Oracle) and the one obtained from by Oryon [15]. For crop-based methods (rows 13-24), we report results with ground-truth mask (Oracle), the one predicted by CATSeg [52] and the one predicted by Horyon (Ours). We use the ground-truth detection when evaluating with the ground-truth mask, and the detection obtained from GDino otherwise.

IV-E Quantitative results

TABLE II: We compare the results of Horyon with our baselines. Experiments are reported with different crop and segmentation priors. Results in bold and underlined represent best and second-best methods when using predicted priors, respectively. In the last row we report the increment with respect to Oryon [15]. All metrics are the higher the better.

		Method	Crop	Mask	REAL275			Toyota-Light			Linemod			YCB-Video			Average
		Method	Crop	Mask	AR	ADD	mIoU	AR	ADD	mIoU	AR	ADD	mIoU	AR	ADD	mIoU	AR	ADD	mIoU
Sparse-view	1	PoseDiffusion [26]	Oracle	-	9.2	0.8	-	7.8	1.2	-	10.8	1.4	-	7.5	0.8	-	8.8	1.1	-
	2	PoseDiffusion [26]	GDino	-	9.5	0.8	-	8.1	1.6	-	7.7	1.0	-	8.5	1.1	-	8.5	1.1	-
	3	RelPose++ [25]	Oracle	-	22.8	11.9	-	30.9	11.6	-	14.6	9.4	-	15.1	10.8	-	20.8	10.9	-
	4	RelPose++ [25]	GDino	-	23.1	12.8	-	30.5	11.6	-	15.1	10.8	-	15.5	10.6	-	21.1	11.5	-
Crop-free	5	ObjectMatch [54]	-	Oracle	9.7	4.5	100.0	3.5	1.6	100.0	14.2	13.8	100.0	3.7	2.8	100.0	7.8	5.7	100.0
	6	ObjectMatch [54]	-	Oryon	9.2	4.2	66.5	2.9	1.1	68.1	7.5	6.8	30.1	1.6	1.3	39.2	5.3	3.4	51.0
	7	SIFT [55]	-	Oracle	34.1	16.4	100.0	30.3	14.1	100.0	17.6	10.0	100.0	18.5	12.8	100.0	25.1	13.3	100.0
	8	SIFT [55]	-	Oryon	24.4	12.8	66.5	27.2	9.9	68.1	6.4	2.6	30.1	4.7	2.1	39.2	15.7	6.9	51.0
	9	LatentFusion [27]	-	Oracle	22.9	9.8	100.0	28.2	10.2	100.0	14.5	7.1	100.0	18.0	10.2	100.0	20.9	9.3	100.0
	10	LatentFusion [27]	-	Oryon	13.7	3.8	66.5	23.1	5.0	68.1	4.8	1.0	30.1	6.7	3.5	39.2	12.1	3.3	51.0
	11	Oryon [15]	-	Oracle	46.5	34.9	100.0	34.1	22.9	100.0	25.3	20.4	100.0	19.4	12.8	100.0	31.3	22.7	100.0
	12	Oryon [15]	-	Oryon	32.2	24.3	66.5	30.3	20.9	68.1	12.2	10.2	30.1	8.6	1.6	39.2	20.8	14.3	51.0
Crop-based	13	ObjectMatch^† [54]	Oracle	Oracle	20.5	11.1	100.0	8.0	4.1	100.0	12.2	11.1	100.0	6.0	3.7	100.0	11.7	7.5	100.0
	14		GDino	CATSeg	21.5	11.6	84.5	7.3	4.0	81.6	13.8	11.7	54.4	6.0	3.4	71.7	12.2	7.7	73.0
	15		GDino	Ours	21.0	11.0	81.3	8.2	4.3	82.1	13.6	11.8	67.6	5.9	3.4	52.0	12.2	7.6	70.7
	16	SIFT^† [55]	Oracle	Oracle	38.8	21.6	100.0	32.4	16.5	100.0	18.7	10.8	100.0	19.3	13.9	100.0	27.3	15.7	100.0
	17		GDino	CATSeg	32.9	15.2	84.5	29.5	14.8	81.6	9.4	5.0	54.4	11.3	6.1	71.7	20.8	10.3	73.0
	18		GDino	Ours	33.5	18.1	81.3	29.9	14.6	82.1	14.6	8.5	67.6	10.1	6.5	52.0	22.0	11.9	70.7
	19	LatentFusion^† [27]	Oracle	Oracle	22.6	9.6	100.0	28.2	10.2	100.0	14.5	6.4	100.0	17.5	10.2	100.0	20.7	9.1	100.0
	20		GDino	CATSeg	18.6	6.0	84.5	26.6	9.8	81.6	9.3	4.8	54.4	9.3	5.1	71.7	15.9	6.4	73.0
	21		GDino	Ours	19.8	8.2	81.3	26.0	10.3	82.1	10.3	3.9	67.6	11.0	8.9	52.0	16.8	7.8	70.7
	22	Horyon	Oracle	Oracle	57.7	49.8	100.0	37.4	28.4	100.0	34.4	27.6	100.0	28.6	22.6	100.0	39.5	32.1	100.0
	23		GDino	CATSeg	47.4	39.6	84.5	32.7	23.5	81.6	19.6	17.4	54.4	17.9	9.3	71.7	29.4	22.5	73.0
	24		GDino	Ours	57.9	51.6	81.3	33.0	25.1	82.1	22.0	20.4	67.6	20.6	12.3	52.0	33.4	27.3	70.7
	25	$\Delta$ score			+25.5	+27.3	+14.8	+3.0	+4.2	+14.0	+9.8	+10.2	+37.5	+12.0	+10.7	+12.8	+12.6	+13.0	+19.7

We report our results in Tab. II, along with the ones obtained from the baselines and the ones reported in Oryon [15]. In rows 1-4 we report the results of sparse-view methods, which only use RGB data. Note that in their evaluation procedure, both RelPose++ [25] and PoseDiffusion [26] use the ground-truth translation component of the pose to obtain a translation in meters. We observe that PoseDiffusion reaches an overall low performance (8.5 average AR when using GDino as detector), while RelPose++ obtains 21.1 AR with the same detection, which is on par with Oryon (row 12) and also SIFT^† (row 18). PoseDiffusion is based on matches between the two views, while RelPose++ uses an energy-based formulation to recover the relative pose. In the context of our benchmark, in which images are crowded and distractor objects are present, the matches learned by PoseDiffusion fail at recovering an accurate pose, while RelPose++ can better generalise to our setting. Nonetheless, both methods score significantly lower then Horyon, which surpasses RelPose++ and PoseDiffusion by 12.3 and 24.9 in average AR, respectively.

Rows 5 to 12 show the results obtain from methods that do not use any detection crop. In this setting Oryon reaches 20.8 average AR, while SIFT obtains 15.7 average AR, thus showing that SIFT matches can lead to good pose estimation performance when paired with a robust registration method (i.e., PointDSC [45]). On the other hand, LatentFusion and ObjectMatch only obtain 12.1 and 5.3 AR, respectively. We attribute ObjectMatch’s low performance to the domain shift: ObjectMatch was trained for registration of scenes with low overlap, and in our datasets the only overlap** part is the one showing the object of interest. Moreover, the presence of distractor objects may lead to ambiguities in the matches, which results in an overall low score. LatentFusion was instead trained on renderings of ShapeNet, and therefore it suffers from the domain shift introduced by our benchmark, which show real images. Moreover, LatentFusion was not tested with predicted segmentation maps, which by nature may be noisy and led to inaccurate latent representations.

Rows 16 to 21 show results with crop-based baselines, which are our main comparison with Horyon. For most methods, working on the cropped version of the images is beneficial, even when the resulting matches are filtered by an oracle mask: ObjectMatch improves by 3.9 points on average AR (row 13 vs 5), SIFT by 2.2 points (row 16 vs 7), while LatentFusion loses 0.2 points (row 19 vs 10). This is reasonable, as the crop** can be considered a normalising operator on the image scale that reduces the effect of the camera pose on the image (i.e., farther objects are smaller).

Horyon beats SIFT^†, the next best method, by 11.4 average AR when using our predicted mask (row 24 vs 18), while the gap falls to 8.6 points when the mask predicted by CATSeg is used (rows 23 vs 17). By comparing the mIoU results on rows 23 and 24, we can observe that CATSeg on average outperforms Horyon, as the first reaches 73.0 mIoU against a 70.7 of the latter. We can gain more insight on this result by observing the mIoU performances separately on each dataset: while Horyon’s and CATSeg’s results on TOYL are on par, CATSeg performs slightly better on REAL275 (+3.2 mIoU) and is much more accurate on YCBV (+19.7 mIoU). On the other hand, on LM Oryon outperforms CATSeg by 13.2 mIoU points. We observe that REAL275 and YCBV contain many different variations of quite common objects (e.g., cans, cups, laptops), while LM is comprised of many unusually objects (e.g., an office hole puncher, a toy ape). These results suggest that Oryon is more effective than CATSeg in identifying unusual objects, and therefore exhibit better generalisation capabilities in LM. On the other hand, CATSeg is more effective with common objects, and possibly can easily disambiguate between similar objects.

In rows 23-24 we can observe how Horyon’s pose estimation performance changes when a different mask is used. The better average AR is obtained with our segmentation mask (33.4 vs 29.4), albeit CATSeg scores on average an higher mIoU (73.0 vs 70.7). This result is even more evident when considering the results of YCBV in which CATSeg produces better masks than our method, but the AR is still higher when our predicted masks are used (20.6 vs 17.9 in average AR). As observed in Oryon [15], an accurate segmentation mask is only important up to a point to determine the quality of pose estimation performance. In Horyon, our joint optimisation procedure enables to learn masks which are optimised to the matching objective, thus resulting in a better performance than the one obtained with an external mask.

Finally, in row 25 we report the increments of Horyon with respect to Oryon, by comparing the settings with predicted masks and detection boxes (i.e., row 24 vs row 12). While all the changes are positive, the behaviour is different if each dataset is considered separately. REAL275 shows the largest improvements in pose estimation, with an AR increment of 25.5. The increment in ADD(S) is even higher (+27.3), showing that Horyon can produce a much higher ratio of very accurate poses. On the other hand, TOYL exhibits the smaller improvement, as AR and ADD(S) only improve by 3.0 and 4.2 respectively. This dataset is less affected by the usage of a detector, as it presents a single object for each scene. It is also the dataset with the largest change in light and point of view across scenes: this suggests that architectural changes can be made to address this specific setting and provide a larger improvement. LM and YCBV both exhibit consistent improvements in AR (9.8 and 12.0 respectively), albeit their performance is still low compared the REAL275 and TOYL. We attribute this fact to the clear higher difficult of these datasets compared to the original ones used in Oryon, which is due to high clutter, object occlusion and object similarity.

IV-F Qualitative results

In Fig. 3 we report some qualitative result on the pose predicted by Horyon, compared to the ones predicted by Oryon [15], SIFT^† [55] and the ground-truth pose. Oryon’s results are reported using its predicted mask, while results for SIFT^† and Horyon are reported using Horyon’s predicted mask and detections from GroundingDino.

Fig. 3(a) show results on REAL275 [19]. In this example Horyon obtains fairly accurate poses, albeit the black camera presents a small rotation error, which causes it to be not completely aligned to the ground truth. SIFT^† and Oryon obtain correct localisations, but they present clearly visible rotation and translation errors.

On Figs. 3(b), we can observe that Horyon is quite accurate on TOYL. SIFT^† similarly obtains good performances, with only small errors in translation. Instead, Oryon fails completely at localising the object, which results in a large translation error. We observe that the high variation in lightning conditions makes particularly difficult to obtain correct matches, especially when the images are represented with a low-resolution feature map as in Oryon.

Figs. 3(c) show results on LM. Horyon retrieves a pose with a small rotation error, which is larger in the prediction of SIFT^†. In this case Oryon fails, likely due to a wrong localisation of the blue iron. This example is quite difficult due to the high variation in viewpoint between the two scenes.

Figs. 3(d) present results on YCBV. we can observe that Oryon and SIFT^† both fail in estimating the pose of the water jug: the first presents a large translation error, while the second shows a wrong rotation. In this case, Horyon predicts a pose which is mostly aligned, but one of the rotation components is wrong due to the partial symmetry of the object. This suggest that, while our method benefits from the high-resolution feature map, it still has difficulties in performing matches in small object regions, such as the handle in the water jug.

Anchor

Ground truth

Oryon [15]

SIFT^† [55]

Ours

% \begin{overpic}[width=433.62pt]{main/figures/qualitatives/nocs/pose/0_anchor-% min.png} \end{overpic} \begin{overpic}[width=433.62pt]{main/figures/qualitatives/nocs/pose/0_gt-min.% png} \end{overpic} \begin{overpic}[width=433.62pt]{main/figures/qualitatives/nocs/pose/0_oryon-% min.png} \end{overpic} \begin{overpic}[width=433.62pt]{main/figures/qualitatives/nocs/pose/0_sift-min% .png} \end{overpic} \begin{overpic}[width=433.62pt]{main/figures/qualitatives/nocs/pose/0_ours-min% .png} \end{overpic} (a) Prompt: Black lens camera

\begin{overpic}[width=433.62pt]{main/figures/qualitatives/toyl/pose/1_anchor-% min.png} \end{overpic}

\begin{overpic}[width=433.62pt]{main/figures/qualitatives/toyl/pose/1_gt-min.% png} \end{overpic}

\begin{overpic}[width=433.62pt]{main/figures/qualitatives/toyl/pose/1_oryon-% min.png} \end{overpic}

\begin{overpic}[width=433.62pt]{main/figures/qualitatives/toyl/pose/1_sift-min% .png} \end{overpic}

\begin{overpic}[width=433.62pt]{main/figures/qualitatives/toyl/pose/1_ours-min% .png} \end{overpic}

(b) Prompt: Green plastic bottle

\begin{overpic}[width=433.62pt]{main/figures/qualitatives/lm/pose/1_anchor-min% .png} \end{overpic}

\begin{overpic}[width=433.62pt]{main/figures/qualitatives/lm/pose/1_gt-min.png% } \end{overpic}

\begin{overpic}[width=433.62pt]{main/figures/qualitatives/lm/pose/1_oryon-min.% png} \end{overpic}

\begin{overpic}[width=433.62pt]{main/figures/qualitatives/lm/pose/1_sift-min.% png} \end{overpic}

\begin{overpic}[width=433.62pt]{main/figures/qualitatives/lm/pose/1_ours-min.% png} \end{overpic}

\begin{overpic}[width=433.62pt]{main/figures/qualitatives/ycbv/pose/1_anchor-% min.png} \end{overpic}

\begin{overpic}[width=433.62pt]{main/figures/qualitatives/ycbv/pose/1_gt-min.% png} \end{overpic}

\begin{overpic}[width=433.62pt]{main/figures/qualitatives/ycbv/pose/1_oryon-% min.png} \end{overpic}

\begin{overpic}[width=433.62pt]{main/figures/qualitatives/ycbv/pose/1_sift-min% .png} \end{overpic}

\begin{overpic}[width=433.62pt]{main/figures/qualitatives/ycbv/pose/1_ours-min% .png} \end{overpic}

(d) Prompt: Blue plastic water jug

Figure 3: Sample pose results from REAL275 [19] (a), Toyota-Light [20] (b), Linemod [17] (c) and YCB-Video [18] (d). All the results use crop from GroundingDino [16] and segmentation mask predicted by Horyon. We show the object model coloured by map** its 3D coordinates to the RGB space. Query images are darkened to highlight the object poses.

IV-G Ablation study

TABLE III: We show the results obtained by ablation the architectural components of Oryon, with the baseline at row 8 to better compare with the other results. All the results are computed with GroundingDino as detector prior and our predicted mask as segmentation prior. All metrics are the higher the better.

	Components						REAL275			Toyota-Light			Linemod			YCB-Video			Average
	Crop	Loss Hyp.	Our $\phi_{D}$	Fusion	Guidance	VLM	AR	ADD	mIoU	AR	ADD	mIoU	AR	ADD	mIoU	AR	ADD	mIoU	AR	ADD	mIoU
1			✓	Ours	DINO	GDino	35.8	22.2	61.2	35.4	27.3	83.9	5.7	2.9	15.9	6.7	4.5	32.6	20.9	14.2	48.4
2	✓		✓	Ours	DINO	GDino	43.5	30.8	80.8	28.6	15.9	82.3	15.6	14.2	51.4	13.6	8.8	67.6	25.3	17.4	70.5
3		✓	✓	Ours	DINO	GDino	26.2	12.6	64.8	30.4	19.6	82.2	6.6	4.7	19.4	8.0	2.7	36.5	17.8	9.9	50.7
4	✓	✓	✓	-	DINO	GDino	51.2	39.0	74.8	31.2	19.2	77.9	19.9	18.2	48.0	13.7	5.7	55.6	29.0	20.5	64.1
5	✓	✓	✓	Oryon	DINO	GDino	49.2	41.8	79.2	32.9	24.4	80.9	21.5	19.8	51.3	18.9	9.4	60.4	30.6	23.8	67.9
6	✓	✓		Ours	DINO	GDino	50.5	41.2	81.8	32.7	24.4	82.5	21.9	20.1	53.0	20.4	13.2	67.6	31.4	24.7	71.6
7	✓	✓	✓	Ours	-	GDino	48.2	32.6	80.3	32.3	23.1	82.3	21.5	20.1	52.2	20.6	12.8	67.6	30.6	22.2	70.6
8	✓	✓	✓	Ours	DINO	GDino	57.9	51.6	81.3	33.0	25.1	82.1	22.0	20.4	67.6	20.6	12.3	52.0	33.4	27.3	70.7
9	✓	✓	✓	Ours	Swin-B	GDino	49.1	46.2	81.3	33.2	24.2	82.3	21.9	20.3	51.9	21.0	13.8	68.5	31.3	26.1	71.0
10	✓	✓	✓	Ours	Swin-B	CLIP	37.3	19.7	72.0	26.9	13.4	69.1	16.7	14.2	53.8	11.4	4.9	54.5	23.1	13.1	62.3
11	✓	✓	✓	Ours	Swin-B	ALIGN	40.0	21.6	77.8	26.5	12.8	71.1	16.2	13.3	53.7	11.1	4.7	55.6	23.4	13.1	64.6

We report in Tab. III an ablation study on the components of Horyon, with the baseline at row 8.

What is the effect of crop**? The effect of the crop is strictly related to the change in loss hyperparameters, as using the crop raises the number of matches due to the higher resolution. Therefore, in rows 1 to 3 we examine how the addition of the crop and the change in hyperparameters influence each other. In row 1 we do not use any crop and keep the original loss hyperparameters, resulting in average AR similar to Oryon (20.9 vs 20.8) and much worse than our baseline of 33.4 AR. Only using the crop (row 2) improves the performance, but still results in a significant gap with respect to our baseline (25.3 vs 33.4 AR). In row 3 we observe that only updating the loss hyperparameters leads to a worse performance than Oryon (17.8 vs 20.8 in AR), thus motivating our choice. The most significant change in the loss is in the dimension of the excluding kernel of the negative loss, which restricts the pool of negative candidates. Without using the crop, a larger kernel is detrimental as it removes a significant portion of the candidate negatives from the object region.

How important is the fusion design? To study the fusion impact, we remove (row 4) and replace it with the one from Oryon (row 5). Both experiments result in a worse AR (30.6 and 29.4 vs 33.4 AR), and also the segmentation quality is lower (64.1 and 67.9 vs 70.7 mIoU). Therefore, a fusion module based on cross-attention on visual and text modalities is more effective than aggregating the cross-modalities similarity map as in Oryon [15], as shown in row 5.

How important are the decoder design and the guidance features? In row 6, we use a different design for the decoder module, which uses two guidance feature maps from DINO instead of three, as the lowest-resolution feature map is discarded. This results in a drop of 2 AR points with respect to our baseline, which is mostly due to the REAL275 dataset (50.5 vs 57.9 AR). Intuitively, the low-resolution feature map captures useful information for larger objects, for which the high-resolution could encode noise related to local pixel variations. This explains such behaviour in REAL275, which consists on larger objects on average. In row 7, we remove the skip connections between DINO and the decoder. While this change does not significantly impact the mask quality (70.6 vs 70.7 mIoU), there is an important drop in pose quality, as the AR drops from 33.4 to 30.6. Similarly, switching to a Swin Transformer as guidance backbone (row 9) lowers the AR to 31.3. These results underline the importance of high-resolution features representative of appearance, to counterbalance the semantic content of the embeddings from the VLM.

What is the influence of the VLM choice? In rows 10 and 11 we replace our VLM (GDino) with CLIP [40] and ALIGN [58], respectively. Both backbones similarly underperform our default choice, as they score 23.1 and 23.4 AR (CLIP and ALIGN respectively) against 31.3 AR of row 9. Note that CLIP and ALIGN encode the prompt in a single global embedding, while we use BERT, which outputs a sequence of token-wise features. This allows Horyon to retain the information related to the object description, which instead may be lost in a global representation, and therefore is beneficial to the type of prompts we use.

TABLE IV: We report the results obtained by changing the prompts at test time with an alternative version, which can have an incorrect description (Misleading), show only the object name (Generic), or show only the object description without the name (No name). Crop priors can be Oracle or predicted by GroundingDino [16] Segmentation priors can be Oracle or predicted by Horyon. All metrics are the higher the better.

	Prompt type	Crop	Mask	REAL275			Toyota-Light			Linemod			YCB-Video			Average
	Prompt type	Crop	Mask	AR	ADD	mIoU	AR	ADD	mIoU	AR	ADD	mIoU	AR	ADD	mIoU	AR	ADD	mIoU
1	No Name	Oracle	Oracle	55.0	46.2	100.0	36.8	28.9	100.0	34.4	28.1	100.0	28.2	22.7	100.0	38.6	31.5	100.0
2	No Name	GDino	Ours	24.0	27.1	49.0	26.5	20.2	74.3	13.3	10.5	33.3	11.2	5.8	53.7	18.7	15.9	52.6
3	Misleading	Oracle	Oracle	55.6	47.2	100.0	36.8	28.9	100.0	33.9	27.6	100.0	27.9	21.6	100.0	38.5	31.3	100.0
4	Misleading	GDino	Ours	40.1	38.2	75.9	28.5	19.6	76.7	7.3	4.0	20.2	13.9	7.0	50.6	22.5	17.2	55.8
5	Generic	Oracle	Oracle	55.6	47.5	100.0	36.8	28.1	100.0	33.7	27.6	100.0	28.0	21.9	100.0	38.5	31.3	100.0
6	Generic	GDino	Ours	40.3	37.6	75.6	36.1	27.7	85.3	13.6	10.9	33.1	14.3	4.3	54.2	26.1	20.1	62.0
7	Standard	Oracle	Oracle	57.7	49.8	100.0	37.4	28.4	100.0	34.4	27.6	100.0	28.6	22.6	100.0	39.5	32.1	100.0
8	Standard	GDino	Ours	57.9	51.6	81.3	33.0	25.1	82.1	22.0	20.4	67.6	20.6	12.3	52.0	33.4	27.3	70.7

What is the most important part of the prompt? In Tab. IV we answer this question by changing the type of prompts used at test time. The experiments in this table only affect the evaluation procedure, as all the training prompts and parameters remains the same. We evaluate three alternative prompt types: No name, in which the object name is replaced by “object” in the prompt (e.g., “brown open object” instead of “brown open laptop”); Misleading, in which the object description is changed to be different from the object appearance (e.g., “white closed laptop” for a laptop that appears brown and open); Generic, that only includes the object name, without any description (e.g., “laptop”). In the same table we report our baseline results, with the standard prompts. For a fair evaluation, when evaluating with GDino as detector, we use the same alternative prompt type we fed to Horyon.

In row 2, we report the results with the No name prompt. This experiment results in a very significant drop in both AR (-14.7) and mIoU (-18.1) compared with the baseline at row 8. While the drop is present and significant in all datasets, it is less catastrophic on TOYL. This dataset is the only one with a single object for each scene, and therefore changing the prompt introduces less ambiguities. This setting reflects the case in which the user providing the prompt is faced with an unknown object they cannot name, thus resulting in a partial description about the object characteristics. Row 4 shows that providing a wrong description greatly impacts the average performance, resulting in a drop of -10.9 AR and -14.9 mIoU. Similarly to the previous prompt, this change has less impact on the TOYL dataset, while LM is more significantly affected, losing 14.7 and 47.4 points in mIoU respectively. It is clear that in this case the main source of error is due to wrong localisation (either from GDino or from the segmentation mask). In row 6 the object description is dropped, resulting in the best average results among the ones with alternative prompts, with an AR of 26.1 and an mIoU of 62.0. While the average drop is still significant compared to the baseline, TOYL is again a notable exception. On this dataset, using a generic description is beneficial, as the experiment outperforms our baseline by 3.1 and 3.2 points in AR and mIoU, respectively. In a context where no ambiguity is possible (i.e., a single object is present), adding a description is detrimental to the pose estimation performance. Finally, rows 1, 3 and 5 report the results with the ground-truth localisation and segmentation. We observe that on average the performances are very close to the baseline with the same mask and detector on row 7. Unsurprisingly, with a perfect localisation Horyon’s features are enough to obtain a good performance in pose estimation, even with a suboptimal prompt.

In conclusion, in our architecture the object name is the most important part of the prompt, as only retaining it still provides a good performance, while completely removing it leads to failure, particularly in case of complex scenes with multiple objects. This experiment highlights an important limitation to the usage of VLMs for pose estimation: such models are still unable to identify and reason about objects given only a description of their visual characteristics.

V Conclusions

We presented Horyon, an approach that significantly improves upon previous open-vocabulary 6D pose estimation methods, by increasing the feature map resolution and providing more accurate matches to perform registration. Our experiments show that Horyon obtains good performances also in scenarios with unusual objects (Linemod) and mild occlusions (YCB-Video). The ablation studies show that the token-wise representation provided by the updated VLM, together with the new fusion strategy, greatly benefit Horyon. The improvement is consistent in terms of generalisation capabilities and robustness to prompt noise as well.

Horyon’s limitations are the need for depth maps and intrinsic camera parameters to perform registration. Such data requirements could be relaxed by exploring monocular depth estimation methods such as DepthAnything [59] on each RGB image. While Horyon is more robust to suboptimal prompts than Oryon, the resulting drop in performance is still significant. Moreover, the variety of prompt usable at test time is limited by the training data, which provides prompts without descriptions. To enrich the prompts at training time, LLMs or image captioners could be used on image samples to provide prompts that also include a description of the object’s colour and physical attributes.

Acknowledgements. This work was supported by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101058589 (AI-PRISM), and it made use of time on the Tier 2 HPC facility JADE2, funded by EPSRC (EP/T022205/1).

References

[1] E. Marchand, H. Uchiyama, and F. Spindler, “Pose estimation for augmented reality: a hands-on survey,” in TVCG, 2015.
[2] G. Du, K. Wang, S. Lian, and K. Zhao, “Vision-based robotic gras** from object localization, object pose estimation to grasp estimation for parallel grippers: A review,” in Artificial Intelligence Review, 2021.
[3] F. Manhardt, W. Kehl, and A. Gaidon, “Roi-10d: Monocular lifting of 2D detection to 6D pose and metric shape,” in CVPR, 2019.
[4] J. Corsetti, D. Boscaini, and F. Poiesi, “Revisiting Fully Convolutional Geometric Features for Object 6D Pose Estimation,” in ICCVW, 2023.
[5] Y. Su, M. Saleh, T. Fetzer, J. Rambach, N. Navab, B. Busam, D. Stricker, and F. Tombari, “ZebraPose: Coarse to Fine Surface Encoding for 6DoF Object Pose Estimation,” in CVPR, 2022.
[6] H. Yisheng, W. Yao, F. Haoqiang, C. Qifeng, and S. Jian, “FS6D: Few-shot 6D pose estimation of novel objects,” in CVPR, 2022.
[7] Y. Labbé, L. Manuelli, A. Mousavian, S. Tyree, S. Birchfield, J. Tremblay, J. Carpentier, M. Aubry, D. Fox, and J. Sivic, “Megapose: 6D pose estimation of novel objects via render & compare,” in CoRL, 2022.
[8] T. Hodan, M. Sundermeyer, Y. Labbe, V. N. Nguyen, G. Wang, E. Brachmann, B. Drost, V. Lepetit, C. Rother, and J. Matas, “BOP challenge 2023 on detection, segmentation and pose estimation of seen and unseen rigid objects,” in CVPR, 2024.
[9] X. He, J. Sun, Y. Wang, D. Huang, H. Bao, and X. Zhou, “OnePose++: Keypoint-Free One-Shot Object Pose Estimation without CAD Models,” in NeurIPS, 2022.
[10] A. Caraffa, D. Boscaini, A. Hamza, and F. Poiesi, “Object 6D pose estimation meets zero-shot learning,” arXiv:2312.00947, 2023.
[11] V. N. Nguyen, T. Groueix, M. Salzmann, and V. Lepetit, “GigaPose: Fast and robust novel object pose estimation via one correspondence,” in CVPR, 2024.
[12] I. Shugurov, F. Li, B. Busam, and S. Ilic, “OSOP: a multi-stage one shot object pose estimation framework,” in CVPR, 2022.
[13] Y. Liu, Y. Wen, S. Peng, C. Lin, X. Long, T. Komura, and W. Wang, “Gen6d: Generalizable model-free 6-DoF object pose estimation from rgb images,” in ECCV, 2022.
[14] J. Schönberger and J. Frahm, “Structure-from-motion revisited,” in CVPR, 2016.
[15] J. Corsetti, D. Boscaini, C. Oh, A. Cavallaro, and F. Poiesi, “Open-vocabulary object 6D pose estimation,” in CVPR, 2024.
[16] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” arXiv:2303.05499, 2023.
[17] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab, “Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes,” in ACCV, 2012.
[18] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes,” in RSS, 2018.
[19] H. Wang, S. Sridhar, J. Huang, J. Valentin, S. Song, and L. J. Guibas, “Normalized object coordinate space for category-level 6D object pose and size estimation,” in CVPR, 2019.
[20] T. Hodan, F. Michel, E. Brachmann, W. Kehl, A. GlentBuch, D. Kraft, B. Drost, J. Vidal, S. Ihrke, X. Zabulis et al., “BOP: Benchmark for 6D object pose estimation,” in ECCV, 2018.
[21] D. Cai, J. Heikkila, and E. Rahtu, “OVE6D: Object Viewpoint Encoding for Depth-based 6D Object Pose Estimation,” in CVPR, 2022.
[22] J. Sun, Z. Wang, S. Zhang, X. He, H. Zhao, G. Zhang, and X. Zhou, “Onepose: One-shot object pose estimation without CAD models,” in CVPR, 2022.
[23] V. N. Nguyen, T. Groueix, G. Ponimatkin, Y. Hu, R. Marlet, M. Salzmann, and V. Lepetit, “NOPE: Novel object pose estimation from a single image,” in CVPR, 2024.
[24] J. Zhang, D. Ramanan, and S. Tulsiani, “Relpose: Predicting probabilistic relative rotation for single objects in the wild,” in ECCV, 2022.
[25] A. Lin, J. Y. Zhang, D. Ramanan, and S. Tulsiani, “Relpose++: Recovering 6d poses from sparse-view observations,” in arXiv:2305.04926, 2023.
[26] J. Wang, C. Rupprecht, and D. Novotny, “Posediffusion: Solving pose estimation via diffusion-aided bundle adjustment,” in CVPR, 2023.
[27] K. Park, A. Mousavian, Y. Xiang, and D. Fox, “Latentfusion: End-to-end differentiable reconstruction and rendering for unseen object pose estimation,” in CVPR, 2020.
[28] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” in ICLR, 2018.
[29] C. Cheang, H. Lin, Y. Fu, and X. Xue, “Learning 6-DoF Object Poses to Grasp Category-Level Objects by Language Instructions,” in ICRA, 2022.
[30] B. Fu, S. K. Leong, Y. Di, J. Tang, and X. Ji, “Lanpose: Language-instructed 6D object pose estimation for robotic assembly,” arXiv:2310.13819, 2023.
[31] J. Cai, Y. He, W. Yuan, S. Zhu, Z. Dong, L. Bo, and Q. Chen, “OV9D: Open-vocabulary category-level 9D object pose and size estimation,” arXiv:2403.12396, 2024.
[32] Y. He, H. Huang, H. Fan, Q. Chen, and J. Sun, “FFB6D: A Full Flow Bidirectional Fusion Network for 6D Pose Estimation,” in CVPR, 2021.
[33] E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shotton, and C. Rother, “Learning 6D Object Pose Estimation using 3D Object Coordinates,” in ECCV, 2014.
[34] G. Jocher, “Yolov5 by ultralytics,” 2020. [Online]. Available: https://github.com/ultralytics/yolov5
[35] Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: A simple and strong anchor-free object detector,” TPAMI, 2020.
[36] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in CVPR, 2017.
[37] T. Groueix, G. Ponimatkin, V. Lepetit, T. Hodan et al., “CNOS: A strong baseline for CAD-based novel object segmentation,” in ICCVW, 2023.
[38] H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y. Shum, “Dino: Detr with improved denoising anchor boxes for end-to-end object detection,” arXiv:2203.03605, 2022.
[39] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv:1810.04805, 2018.
[40] A. Radford, J. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021.
[41] C. Zhou, C. Loy, and B. Dai, “Extract free dense labels from CLIP,” in ECCV, 2022.
[42] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV, 2021.
[43] C. Choy, J. Park, and V. Koltun, “Fully Convolutional Geometric Features,” in ICCV, 2019.
[44] C. Sudre, W. Li, T. Vercauteren, S. Ourselin, and M. Jorge Cardoso, “Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations,” in MICCAI, 2017.
[45] X. Bai, Z. Luo, L. Zhou, H. Chen, L. Li, Z. Hu, H. Fu, and C. L. Tai, “PointDSC: Robust Point Cloud Registration using Deep Spatial Consistency,” in CVPR, 2021.
[46] D. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” in ICLR, 2015.
[47] I. Loshchilov and F. Hutter, “SGDR: Stochastic gradient descent with warm restarts,” in ICLR, 2017.
[48] “Pytorch lightning documentation,” https://lightning.ai/docs/pytorch/stable/, last access: 13/06/2024.
[49] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, “ShapeNet: An Information-Rich 3D Model Repository,” arXiv:1512.03012, 2015.
[50] M. Savva, A. Chang, and H. P., “Semantically-Enriched 3D Models for Common-sense Knowledge,” in CVPR, 2015.
[51] T. Hodan, J. Matas, and S. Obdrzalek, “On evaluation of 6D object pose estimation,” in ECCV, 2016.
[52] S. Cho, H. Shin, S. Hong, A. Arnab, P. H. Seo, and S. Kim, “CAT-Seg: Cost aggregation for open-vocabulary semantic segmentation,” in CVPR, 2024.
[53] F. Liang, B. Wu, X. Dai, K. Li, Y. Zhao, H. Zhang, P. Zhang, P. Vajda, and D. Marculescu, “Open-vocabulary semantic segmentation with mask-adapted CLIP,” in CVPR, 2023.
[54] C. Gümeli, A. Dai, and M. Nießner, “ObjectMatch: Robust Registration using Canonical Object Correspondences,” in CVPR, 2023.
[55] D. Lowe, “Object recognition from local scale-invariant features,” in ICCV, 1999.
[56] P. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superglue: Learning feature matching with graph neural networks,” in CVPR, 2020.
[57] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes,” in CVPR, 2017.
[58] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in ICML, 2021.
[59] L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” in CVPR, 2024.