License: arXiv.org perpetual non-exclusive license
arXiv:2403.16238v1 [cs.RO] 24 Mar 2024

KITchen: A Real-World Benchmark and Dataset for 6D Object Pose Estimation in Kitchen Environments

Abdelrahman Younes and Tamim Asfour The research leading to these results has received funding from the Baden-Württemberg Ministry of Science, Research and the Arts (MWK) as part of the state’s ”digital@bw” digitization strategy in the context of the Real-World Lab ”Robotics AI” and by the Carl Zeiss Foundation through the JuBot projectThe authors are with the High Performance Humanoid Technologies Lab, Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology (KIT), Germany.  {younes, asfour}@kit.edu
Abstract

Despite the recent progress on 6D object pose estimation methods for robotic gras**, a substantial performance gap persists between the capabilities of these methods on existing datasets and their efficacy in real-world mobile manipulation tasks, particularly when robots rely solely on their monocular egocentric field of view (FOV). Existing real-world datasets primarily focus on table-top gras** scenarios, where a robotic arm is placed in a fixed position and the objects are centralized within the FOV of fixed external camera(s). Assessing performance on such datasets may not accurately reflect the challenges encountered in everyday mobile manipulation tasks within kitchen environments such as retrieving objects from higher shelves, sinks, dishwashers, ovens, refrigerators, or microwaves. To address this gap, we present Kitchen, a novel benchmark designed specifically for estimating the 6D poses of objects located in diverse positions within kitchen settings. For this purpose, we recorded a comprehensive dataset comprising around 205k real-world RGBD images for 111 kitchen objects captured in two distinct kitchens, utilizing one humanoid robot with its egocentric perspectives. Subsequently, we developed a semi-automated annotation pipeline, to streamline the labeling process of such datasets, resulting in the generation of 2D object labels, 2D object segmentation masks, and 6D object poses with minimized human effort. The benchmark, the dataset, and the annotation pipeline are available at https://kitchen-dataset.github.io/KITchen.

I Introduction

Recent work in robot navigation in indoor environments shows remarkable advances for mobile robots to navigate towards a goal position following different modalities such as 2D points [50, 15], object’s image [9, 36], language instruction [8, 1], and acoustic signals [51, 10]. However, expanding the capabilities of these robots beyond navigation to perform tasks that require physical interaction with the surrounding objects in the environments remains a harder challenge. Therefore, understanding the 3D surroundings and objects’ 6D pose estimation are essential pre-tasks for any robotic gras** and manipulation task [38, 39, 5].

Current advances in tackling the 6D pose estimation problem focus on develo** new models and approaches [40, 31] to achieve the best results on the BOP challenge111https://bop.felk.cvut.cz datasets [20, 6, 21, 18, 27, 45, 49, 17, 43, 22]. While this paradigm boosted the research on 6D pose estimation, however, the available real-world datasets primarily focus on serving the table-top robotic gras** setup, featuring a robotic arm fixed in a position above objects, close to them, and often the objects are centered within the robot’s FOV and in some cases with multiple cameras setup [44].

Refer to caption
Figure 1: Challenging kitchen locations that our dataset covers in contrast with the currently available datasets. The objects are distributed across diverse locations such as fridge, drawer, sink, higher shelves, microwave, dishwasher, oven, etc.

These datasets do not cover the challenging scenarios that mobile manipulators face inside indoor environments, especially in kitchens, where objects are normally placed in different not-centered positions with respect to the robot’s field of view (FOV) such as on higher shelves, inside fridges, microwaves, dishwashers or ovens or in sinks. These locations not only impose challenging 6D poses with respect to the robot’s camera but also cover more diverse and challenging surroundings such as transparent shelves in the case of refrigerators, see-through shelves in the case of dishwashers, and reflective backgrounds in the case of sinks, these challenges are not covered in the currently available real-world datasets [44]. These gaps and the not-covered scenarios do not provide a reliable indication of the performance of the developed methods on these real-world datasets in the context of mobile manipulation tasks with monocular egocentric FOV.

In addition to that, the current top 10 models on the BOP leaderboard train a model for each dataset [24], or even for each object [40, 47], which makes it hard to use for robotic applications, where the robots have to deal with a large number of objects under constrained resources. Furthermore, the average inference time of these top 10 approaches is 0.02830.02830.02830.0283 frames per second (fps𝑓𝑝𝑠fpsitalic_f italic_p italic_s) with the best being 4.386fps4.386𝑓𝑝𝑠4.386fps4.386 italic_f italic_p italic_s. This makes these approaches not reliable for real-time applications, such as mobile manipulation where the 6D pose estimate is only a preliminary step of object gras** which is followed by a set of actions needed to execute the grasp such as grasp selection, motion planning, etc.

To overcome the limitations of the current 6D pose estimation methods, we introduce KITchen, the first-of-its-kind large-scale real-world dataset recorded using the humanoid robot ARMAR-6 [3], which has adjustable height and roll-yaw neck, in 2 different kitchen environments covering 111 kitchen objects from the humanoid robots’ egocentric perspective to cover the objects in the challenging kitchens’ locations as shown in Fig. 1. KITchen offers 2D bounding boxes, object segmentation, and 6D poses annotated with a semi-automated annotation pipeline to minimize the need for manual labeling.

The main contributions of our work are: (i) we introduce a large real-world annotated RGBD dataset for 111 objects with their 2D bounding boxes, segmentation masks, and 6D poses. (ii) we propose a semi-automated annotation pipeline to annotate the objects in the dataset to facilitate the creation of more real-world datasets and make it publicly available to support other robotics groups to create such large-scale datasets. (iii) we introduce a new benchmark and competition, where the focus is to solve the object 6D pose estimation problem depending solely on the monocular FOV of robots and limiting the submissions to approaches that offer at least 5fps5𝑓𝑝𝑠5fps5 italic_f italic_p italic_s to encourage further work on this problem while taking into consideration real-time applicability.

Refer to caption
Figure 2: The two distinguished kitchens where we recorded our dataset. On the left side is the Main Kitchen while on the right side is the Mobile Kitchen.

II Related Work

Dataset Objects Images Annotated Objects/Image Multi-object Multi-instance Robot’s FOV
LineMOD/LineMOD-Occluded [20, 6] 15 18.2K \leq 8
T-LESS [21] 30 39K \leq 10
ITODD [18] 28 1K \leq 8
Homebrewed-Database [27] 33 5K \leq 8
HOPE [45] 28 238 5-20
ICBIN [17] 3 177 \leq 3
TUD-L [22] 3 11K 1
MP6D [11] 20 20.1K \leq 8
ClearPose [12] 63 355K \leq 10
YCB-video YCB-V [49, 7] 21 134K 5
KITchen (ours) 111 205K 10-50
TABLE I: Overview of available datasets for instance-level 6D pose estimation, highlighting key metrics including the number of covered objects in the dataset, total images count, number of annotated objects per image, presence of multi-object setups, availability of multiple instances of same objects, and whether the dataset was captured using a mobile robot’s field of view.

II-A Objects Datasets

Current research on 6D pose estimation leverages several datasets categorized into two main groups: instance-level object datasets and category-level object datasets. Instance-level datasets offer 6D pose annotations for specific objects, serving as benchmarks for many object pose estimation methods. In contrast, category-level datasets aim to extend object pose estimation approaches to estimate the pose of different instances within the same category. In this work, we focus on instance-level object pose estimation. This subsection provides an overview of currently available real-world datasets for instance-level 6D object pose estimation.

LineMOD (LM) [20] comprises 15 texture-less objects with diverse shapes, colors, and sizes. LM provides approximately 1.2K real-world test images for each object in cluttered scenes, totaling 18241 images. LineMOD-Occluded (LMO) [6] offers pose annotations for only eight objects from the LineMOD dataset under severe occluded conditions. T-LESS [21] consists of 30 industrial texture-less, symmetric, and similar objects with 1296 real-world images per object, totaling around 39K images. ITODD [18] provides 6D pose annotations for 28 industrial objects with less than 1K publicly available Gray-Depth validation images. Homebrewed-Database (HB) [27] comprises less than 5K real-world images as validation set for 33 objects, with only 8 of them being household objects. HOPE [45] consists of 28 toy grocery objects that could be utilized in kitchen environments, but it provides only 238 real-world images in 50 scenes. IC-BIN [17] also offers only 177 real-world test images for only 3 out of its 8 objects in multi-objects cluttered scenes with heavy occlusion to be used for the BOP challenge. TUD-L [22] provides around 11K real-world images for 3 objects not placed on tables which differs this dataset from the others. MP6D [11] consists of 20.1K real-world frames for 20 symmetrical specular-reflective objects in cluttered multi-object setups with occlusion. ClearPose [12] offers about 355K real images for 63 transparent symmetrical objects in 51 cluttered scenes with diverse backgrounds and occlusion. YCB-video (YCB-V) [49] provides 134K real-world images for 21 objects from the original YCB dataset [7].

KIT object models database [28] was originally introduced in 2012 and offers 3D CAD models for more than 100 diverse objects, the majority of which are kitchen-related groceries. However, it only offers very few images for each object, which makes it hard to use this dataset for 6D pose estimation with the current state-of-the-art (SOTA) data-driven 6D pose estimation approaches. KIT bimanual manipulation dataset [30] provides rich data for learning models of bimanual manipulation tasks from human demonstrations. It includes accurate whole-body motion data, hand configurations, and 6D object poses captured using various sensors. The dataset features 12 bimanual actions for 21 kitchen-related objects.

In this work, we carefully selected 111 kitchen-related objects from the YCB, KIT object dataset, and the KIT bimanual manipulation dataset to record the first-of-its-kind large-scale real-world RGBD dataset featuring multi-objects in structured cluttered setups with diverse backgrounds and lighting conditions recorded using a humanoid robot.

II-B 6D Pose Estimation Methods

The current landscape of 6D pose estimation methods is diverse, ranging from traditional techniques such as template matching [26, 19, 41, 35] and correspondences with locally invariant features [14, 13, 37] to the current advanced deep learning SOTA render & compare approaches [31, 48]. These approaches provide the 6D poses of novel objects by rendering many views of the object during inference using its 3D CAD model and then passing these rendered views with the received cropped image of the object obtained by any 2D object detectors [42, 33, 52, 32, 46] to a coarse model which classifies which rendered image best matches the input image. Finally, they pass the initial pose to a refiner network to estimate an updated 6D pose of the object. In this work, we leverage MegaPose [31], Segment Anything [29], and YOLOv5 [25] to annotate our dataset.

III The KITchen Dataset

III-A Dataset’s Objects

We aim to create a large-scale real-world dataset that covers objects that are commonly used inside the kitchen environments. Although some of the existing object datasets already offer objects that are commonly used in kitchens, they lack enough diverse RGBD annotated images to train on [49] or no annotated RGBD at all [7, 30, 28, 34]. Therefore, we decided to reuse the already available kitchen-related objects from these datasets and provide a large real-world RGBD annotated dataset for them to facilitate research on 6D pose estimation for kitchen objects. These objects vary from toy vegetables and fruits from [7] to kitchen tools such as knives, spoons, cups, mugs, bowls, cutting board, egg whisk, frying pan, plate, etc from [7, 49, 30, 28, 34] to kitchen groceries objects from [7, 28].

III-B Dataset Recording


Refer to caption
Figure 3: Diverse robot torso positions. The images display heights of 145cm, 177cm, and 185cm from left to right, illustrating the varied perspectives captured in the datasets and the different placements of objects relative to the robot’s field of view.

We recorded the dataset using our humanoid robot ARMAR-6 [2, 4] inside two distinct kitchen environments as seen in Fig. 2 the first kitchen, referred to as the Main Kitchen, includes typical kitchen appliances such as a fridge, counter with drawers, table, sink, microwave, dishwasher, and oven. The second kitchen, named Mobile Kitchen, features a counter with drawers, sink, dishwasher, fridge, and three tables. To enhance diversity, we utilized four different table-top colors (red, white, gray, and blue) and varied the camera’s heights (150cm, 177cm, and 185cm) using ARMAR-6’s torso as demonstrated in Fig. 3. Additionally, we recorded data under three different pitch angles (10 degrees, 37 degrees, and 49 degrees down) and six different lighting conditions as shown in Fig. 4. To avoid similar and repetitive frames, we limited our recording to 5 fps𝑓𝑝𝑠fpsitalic_f italic_p italic_s. To the best of our knowledge, this is the first of its kind dataset that covers this amount of different robots’ fields of view.

Refer to caption
Figure 4: Variation in robot neck pitch angle. The images depict angles of 10, 37, and 49 degrees from left to right, showcasing a diverse range of perspectives.

III-C Annotation Pipeline

Refer to caption
Figure 5: Our proposed Annotation Pipeline. The pipeline begins with inputting 3D meshes of dataset objects, which BlenderProc2 processes to generate synthetic data with 2D bounding boxes. This annotated 2D data is utilized to train a YOLOv5 2D object detector. Subsequently, real-world recorded data is fed into the trained model, and the output undergoes manual inspection for correct and incorrect labeling. The correctly labeled images are employed for model refinement, which is then validated on the incorrectly labeled ones. This iterative process continues until all images are accurately labeled. The images with correct labels are then passed to Segment Anything (SAM) to produce masks. Finally, the images, along with the 2D labels and 3D meshes, are input into MegaPose to generate 6D poses for detected objects. Manual inspection of poses is conducted through contour and mesh overlay images, and corrected annotations are used to fine-tune MegaPose iteratively until the entire dataset is accurately annotated.

Annotating objects with their ground truth 6D poses is a labor-intensive and time-consuming task. To streamline this process, we propose a semi-automated annotation pipeline. This pipeline generates three types of annotations: 2D object bounding boxes, 2D segmentation masks, and 6D poses.

III-C1 2D Objects Bounding Boxes Annotation

The pipeline starts by receiving the collected 3D CAD object models for the dataset, then it generates around 100K annotated photo-realistic synthetic RGBD images with 2D bounding boxes using BlenderProc2 [16]. These synthetic images are used to finetune a pretrained YOLOv5 model [25] for 2D object detection. Subsequently, the trained model is applied to our real-world data, and manually classified images are inspected to distinguish correctly labeled ones. The model is then fine-tuned iteratively until all real-world data is accurately labeled with 2D object labels.

III-C2 2D Objects Segmentation Masks

For segmenting the objects and producing the 2D segmentation masks, we leverage Segment Anything [29], by passing the images as well as the 2D bounding boxes generated from the previous step.

III-C3 6D Object Poses

To generate the 6D poses for the objects in the images, we pass the 2D bounding boxes which are generated using the fine-tuned YOLOv5 object detection model alongside the 3D CAD models of the detected objects with the input image into MegaPose [31]. The output 6D poses are used to overlay contours and meshes on the images for manual inspection. The MegaPose model is fine-tuned with corrected labeled data iteratively until all data are accurately annotated. The entire annotation pipeline is illustrated in Fig. 5 and several illustrative examples of the output of each step are demonstrated in Fig. 6.

Refer to caption
Refer to caption
Figure 6: Examples of the results generated by our proposed annotation pipeline. Sequentially from left to right: output of the 2D detector, segmentation masks, contour overlay, and mesh overlay.

III-D Comparison to Existing Datasets

When compared to currently available datasets, the KITchen dataset stands out in several key aspects. With a diverse collection of 111 objects, our dataset offers a significantly wider range than the average number of objects found in existing datasets, surpassing the average by a factor of four. This expansive variety is crucial for training robust pose estimation models capable of handling a multitude of real-world scenarios. Moreover, the KITchen dataset offers a total of 205K RGBD images. This surpasses the average number of annotated images in existing datasets by over threefold, providing more data for training and evaluation purposes. Furthermore, our dataset has a remarkably larger number of annotated objects per image compared to the existing datasets with an unprecedented number of objects reaching 50 per image. This exceeds any available dataset by a significant margin, enabling more comprehensive analysis and training of instance-level 6D pose estimation models. Additionally, the KITchen dataset is unique in its capture methodology. It is the only dataset to have been recorded using the field of view of a humanoid robot with adjustable heights, camera angles, and lighting conditions. Unlike existing datasets that predominantly focus on tabletop scenes, our dataset features challenging locations within kitchen environments including refrigerators, ovens, sinks, higher shelves, microwaves, and dishwashers, offering a broader scope of real-world scenarios for pose estimation research. An overview of the dataset comparison is given in Tab. I.

IV The KITchen Benchmark

Our proposed KITchen benchmark aims to encourage researchers in both computer vision and robotics fields to test their developed methods on our diverse and challenging multi-object dataset while considering the constraints of robots’ resources. To this end, we impose specific guidelines for leaderboard submissions to ensure practical applicability. Specifically, submissions must utilize a single model for all objects and maintain a minimum processing frequency of 5fps𝑓𝑝𝑠fpsitalic_f italic_p italic_s during inference. The conditions mentioned above enhance the likelihood of the applicability of these methods in robotics. By aligning these criteria with those of the BOP Benchmark [23], we observe noteworthy disparities. Among the top 10 methods on the leaderboard, only two adhere to the requirement of utilizing a single model per dataset rather than per object. Moreover, none of these methods achieve the prescribed 5 fps𝑓𝑝𝑠fpsitalic_f italic_p italic_s performance, with the closest reaching 4.3 fps. This discrepancy underscores a crucial gap between current state-of-the-art approaches and the demands of time-critical robotics applications, as evidenced by the average processing speed of the top 10 approaches on the BOP leaderboard, which stands at a mere 0.03 fps.

IV-A Problem Statement

The benchmark is designed to address the object 6D pose estimation problem, where the model receives an image I𝐼Iitalic_I from the dataset D𝐷Ditalic_D, where D𝐷Ditalic_D is a set of RGBD images. The image I𝐼Iitalic_I contains a set of objects {o}i=0nsubscriptsuperscript𝑜𝑛𝑖0\{{o\}^{n}_{i=0}}{ italic_o } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT. The model has access to the M𝑀Mitalic_M, where M𝑀Mitalic_M is a set of 3D meshes of all objects O𝑂Oitalic_O in the dataset D𝐷Ditalic_D. The objective is to estimate the pose P𝑃Pitalic_P of all objects {o}i=0nsubscriptsuperscript𝑜𝑛𝑖0\{{o\}^{n}_{i=0}}{ italic_o } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT in each image I𝐼Iitalic_I, where P=[R,T;0,1]𝑃𝑅𝑇01P=[R,T;0,1]italic_P = [ italic_R , italic_T ; 0 , 1 ], where R𝑅Ritalic_R is a 3×3333\times 33 × 3 rotational matrix that describes the rotation of each of {o}i=0nsubscriptsuperscript𝑜𝑛𝑖0\{{o\}^{n}_{i=0}}{ italic_o } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT to the robot camera’s frame and T𝑇Titalic_T is the translation vector to the origin of robot camera’s coordinate system.

IV-B Datasets

Our benchmark leverages the KITchen dataset introduced in Sec. III. Notably, this dataset stands out as the first of its kind, captured from the perspective of a humanoid robot, and encompasses varying heights and pitch angles, making it more suited to cover robotic mobile manipulation scenarios in kitchen environments. We split the dataset to train/val/test sets with a 70/20/10 ratio.
Although our benchmark primarily focuses on the KITchen dataset, we invite other robotics research groups to record datasets in kitchen environments using their own robots and leverage our proposed annotation pipeline in Sec. III-C to annotate their data efficiently. Our vision for this benchmark extends beyond our dataset alone, we see it as a dynamic community platform where diverse research groups can collectively work to advance the field of robotic perception and pose estimation by testing their methods on a variety of datasets and providing their own datasets for other researchers to test on.

IV-C Pose Error Calculation

We utilize the same pose error function used by the BOP challenge [23]. The estimated pose is considered correct if the pose error function e𝑒eitalic_e calculated between the annotated pose P𝑃Pitalic_P and the estimated pose P^^𝑃\hat{P}over^ start_ARG italic_P end_ARG is lower than a predefined threshold θesubscript𝜃𝑒\theta_{e}italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, where e𝑒eitalic_e {eVSD,eMSSD,eMSPD}absentsubscript𝑒𝑉𝑆𝐷subscript𝑒𝑀𝑆𝑆𝐷subscript𝑒𝑀𝑆𝑃𝐷\in\{e_{VSD},e_{MSSD},e_{MSPD}\}∈ { italic_e start_POSTSUBSCRIPT italic_V italic_S italic_D end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_M italic_S italic_S italic_D end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_M italic_S italic_P italic_D end_POSTSUBSCRIPT }, where eVSDsubscript𝑒𝑉𝑆𝐷e_{VSD}italic_e start_POSTSUBSCRIPT italic_V italic_S italic_D end_POSTSUBSCRIPT is the Visible Surface Discrepancy error function which focuses on the visible part of the object and evaluates poses with indistinguishable shapes as equivalent, disregarding the color information, eMSSDsubscript𝑒𝑀𝑆𝑆𝐷e_{MSSD}italic_e start_POSTSUBSCRIPT italic_M italic_S italic_S italic_D end_POSTSUBSCRIPT is the Maximum Symmetry-Aware Surface Distance that calculates the surface deviation between vertices in the 3D, calculating the maximum distance between model vertices is crucial to know the chance of a successful grasp, while eMSPDsubscript𝑒𝑀𝑆𝑃𝐷e_{MSPD}italic_e start_POSTSUBSCRIPT italic_M italic_S italic_P italic_D end_POSTSUBSCRIPT is the Maximum Symmetry-Aware Projection Distance that considers the object symmetries and calculate the difference in X,Y𝑋𝑌X,Yitalic_X , italic_Y axes which makes it suitable for methods that rely on RGB data only. Finally, the Recall is defined as the ratio of correctly estimated poses with a total pose error e𝑒eitalic_e lower than the threshold θesubscript𝜃𝑒\theta_{e}italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT across all objects. The Average Recall is then computed by averaging these recall values across various threshold settings.

V Conclusion

We introduce KITchen, a novel object 6D pose estimation benchmark tailored to tackle this task within challenging kitchen environments using only monocular vision from robots’ FOV, with a specific emphasis on real-time performance. To serve this benchmark, we recorded the first-of-its-kind large-scale real-world dataset, captured from the perspective of a humanoid robot, featuring multi-objects in structured cluttered scenes in two distinct kitchen environments with diverse lighting conditions. Lastly, we proposed a semi-automated annotation pipeline aimed at streamlining the annotation of such datasets while minimizing manual human effort. We envision our benchmark as a bridge between robotics and the computer vision fields, fostering the development of innovative approaches to solve the 6D pose problem on resource-constrained platforms while also prioritizing real-time applicability.

Acknowledgment

We would like to thank Diana Burkart and Lisa Joosten for their contributions and assistance during the annotation process of the dataset.

References

  • [1] Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., Van Den Hengel, A.: Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3674–3683 (2018)
  • [2] Asfour, T., Kaul, L., Wächter, M., Ottenhaus, S., Weiner, P., Rader, S., Grimm, R., Zhou, Y., Grotz, M., Paus, F., et al.: Armar-6: A collaborative humanoid robot for industrial environments. In: 2018 IEEE-RAS 18th International Conference on Humanoid Robots (Humanoids). pp. 447–454. IEEE (2018)
  • [3] Asfour, T., Wächter, M., Kaul, L., Rader, S., Weiner, P., Ottenhaus, S., Grimm, R., Zhou, Y., Grotz, M., Paus, F.: Armar-6: A high-performance humanoid for human-robot collaboration in real world scenarios. IEEE Robotics & Automation Magazine 26(4), 108–121 (2019)
  • [4] Asfour, T., Waechter, M., Kaul, L., Rader, S., Weiner, P., Ottenhaus, S., Grimm, R., Zhou, Y., Grotz, M., Paus, F.: Armar-6: A high-performance humanoid for human-robot collaboration in real-world scenarios. IEEE Robotics & Automation Magazine 26(4), 108–121 (2019)
  • [5] Birr, T., Pohl, C., Younes, A., Asfour, T.: Autogpt+ p: Affordance-based task planning with large language models. arXiv preprint arXiv:2402.10778 (2024)
  • [6] Brachmann, E., Krull, A., Michel, F., Gumhold, S., Shotton, J., Rother, C.: Learning 6d object pose estimation using 3d object coordinates. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II 13. pp. 536–551. Springer (2014)
  • [7] Calli, B., Walsman, A., Singh, A., Srinivasa, S., Abbeel, P., Dollar, A.M.: Benchmarking in manipulation research: Using the yale-cmu-berkeley object and model set. IEEE Robotics & Automation Magazine 22(3), 36–52 (2015)
  • [8] Chang, M., Gervet, T., Khanna, M., Yenamandra, S., Shah, D., Min, S.Y., Shah, K., Paxton, C., Gupta, S., Batra, D., et al.: Goat: Go to any thing. arXiv preprint arXiv:2311.06430 (2023)
  • [9] Chaplot, D.S., Gandhi, D.P., Gupta, A., Salakhutdinov, R.R.: Object goal navigation using goal-oriented semantic exploration. Advances in Neural Information Processing Systems 33, 4247–4258 (2020)
  • [10] Chen, C., Schissler, C., Garg, S., Kobernik, P., Clegg, A., Calamia, P., Batra, D., Robinson, P., Grauman, K.: Soundspaces 2.0: A simulation platform for visual-acoustic learning. Advances in Neural Information Processing Systems 35, 8896–8911 (2022)
  • [11] Chen, L., Yang, H., Wu, C., Wu, S.: Mp6d: An rgb-d dataset for metal parts’ 6d pose estimation. IEEE Robotics and Automation Letters 7(3), 5912–5919 (2022)
  • [12] Chen, X., Zhang, H., Yu, Z., Opipari, A., Chadwicke Jenkins, O.: Clearpose: Large-scale transparent object dataset and benchmark. In: European Conference on Computer Vision. pp. 381–396. Springer (2022)
  • [13] Collet, A., Martinez, M., Srinivasa, S.S.: The moped framework: Object recognition and pose estimation for manipulation. The international journal of robotics research 30(10), 1284–1306 (2011)
  • [14] Collet, A., Srinivasa, S.S.: Efficient multi-view object recognition and full pose estimation. In: 2010 IEEE International Conference on Robotics and Automation. pp. 2050–2055. IEEE (2010)
  • [15] Datta, S., Maksymets, O., Hoffman, J., Lee, S., Batra, D., Parikh, D.: Integrating egocentric localization for more realistic point-goal navigation agents. In: Conference on Robot Learning. pp. 313–328. PMLR (2021)
  • [16] Denninger, M., Winkelbauer, D., Sundermeyer, M., Boerdijk, W., Knauer, M., Strobl, K.H., Humt, M., Triebel, R.: Blenderproc2: A procedural pipeline for photorealistic rendering. Journal of Open Source Software 8(82),  4901 (2023). https://doi.org/10.21105/joss.04901, https://doi.org/10.21105/joss.04901
  • [17] Doumanoglou, A., Kouskouridas, R., Malassiotis, S., Kim, T.K.: Recovering 6d object pose and predicting next-best-view in the crowd. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3583–3592 (2016)
  • [18] Drost, B., Ulrich, M., Bergmann, P., Hartinger, P., Steger, C.: Introducing mvtec itodd-a dataset for 3d object recognition in industry. In: Proceedings of the IEEE international conference on computer vision workshops. pp. 2200–2208 (2017)
  • [19] Hinterstoisser, S., Holzer, S., Cagniart, C., Ilic, S., Konolige, K., Navab, N., Lepetit, V.: Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In: 2011 international conference on computer vision. pp. 858–865. IEEE (2011)
  • [20] Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Bradski, G., Konolige, K., Navab, N.: Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In: Computer Vision–ACCV 2012: 11th Asian Conference on Computer Vision, Daejeon, Korea, November 5-9, 2012, Revised Selected Papers, Part I 11. pp. 548–562. Springer (2013)
  • [21] Hodan, T., Haluza, P., Obdržálek, Š., Matas, J., Lourakis, M., Zabulis, X.: T-less: An rgb-d dataset for 6d pose estimation of texture-less objects. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 880–888. IEEE (2017)
  • [22] Hodan, T., Michel, F., Brachmann, E., Kehl, W., GlentBuch, A., Kraft, D., Drost, B., Vidal, J., Ihrke, S., Zabulis, X., et al.: Bop: Benchmark for 6d object pose estimation. In: Proceedings of the European conference on computer vision (ECCV). pp. 19–34 (2018)
  • [23] Hodaň, T., Sundermeyer, M., Drost, B., Labbé, Y., Brachmann, E., Michel, F., Rother, C., Matas, J.: Bop challenge 2020 on 6d object localization. In: Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. pp. 577–594. Springer (2020)
  • [24] Hu, Y., Fua, P., Salzmann, M.: Perspective flow aggregation for data-limited 6d object pose estimation. In: European Conference on Computer Vision. pp. 89–106. Springer (2022)
  • [25] Jocher, G., Stoken, A., Borovec, J., NanoCode012, ChristopherSTAN, Changyu, L., Laughing, tkianai, Hogan, A., lorenzomammana, yxNONG, AlexWang1900, Diaconu, L., Marc, wanghaoyang0106, ml5ah, Doug, Ingham, F., Frederik, Guilhen, Hatovix, Poznanski, J., Fang, J., Yu, L., changyu98, Wang, M., Gupta, N., Akhtar, O., PetrDvoracek, Rai, P.: ultralytics/yolov5: v3.1 - Bug Fixes and Performance Improvements (Oct 2020). https://doi.org/10.5281/zenodo.4154370, https://doi.org/10.5281/zenodo.4154370
  • [26] Jurie, F., Dhome, M.: A simple and efficient template matching algorithm. In: Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001. vol. 2, pp. 544–549. IEEE (2001)
  • [27] Kaskman, R., Zakharov, S., Shugurov, I., Ilic, S.: Homebreweddb: Rgb-d dataset for 6d pose estimation of 3d objects. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. pp. 0–0 (2019)
  • [28] Kasper, A., Xue, Z., Dillmann, R.: The kit object models database: An object model database for object recognition, localization and manipulation in service robotics. The International Journal of Robotics Research 31(8), 927–934 (2012)
  • [29] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything. arXiv:2304.02643 (2023)
  • [30] Krebs, F., Meixner, A., Patzer, I., Asfour, T.: The kit bimanual manipulation dataset. In: IEEE/RAS International Conference on Humanoid Robots (Humanoids). pp. 499–506 (2021)
  • [31] Labbé, Y., Manuelli, L., Mousavian, A., Tyree, S., Birchfield, S., Tremblay, J., Carpentier, J., Aubry, M., Fox, D., Sivic, J.: Megapose: 6d pose estimation of novel objects via render & compare. arXiv preprint arXiv:2212.06870 (2022)
  • [32] Li, C., Li, L., Jiang, H., Weng, K., Geng, Y., Li, L., Ke, Z., Li, Q., Cheng, M., Nie, W., et al.: Yolov6: A single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976 (2022)
  • [33] Long, X., Deng, K., Wang, G., Zhang, Y., Dang, Q., Gao, Y., Shen, H., Ren, J., Han, S., Ding, E., et al.: Pp-yolo: An effective and efficient implementation of object detector. arXiv preprint arXiv:2007.12099 (2020)
  • [34] Mandery, C., Terlemez, O., Do, M., Vahrenkamp, N., Asfour, T.: Unifying representations and large-scale whole-body motion databases for studying human motion. IEEE Transactions on Robotics 32(4), 796–809 (2016)
  • [35] Nguyen, V.N., Hu, Y., Xiao, Y., Salzmann, M., Lepetit, V.: Templates for 3d object pose estimation revisited: Generalization to new objects and robustness to occlusions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6771–6780 (2022)
  • [36] Pal, A., Qiu, Y., Christensen, H.: Learning hierarchical relationships for object-goal navigation. In: Conference on Robot Learning. pp. 517–528. PMLR (2021)
  • [37] Pauwels, K., Kragic, D.: Simtrack: A simulation-based framework for scalable real-time object pose detection and tracking. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 1300–1307. IEEE (2015)
  • [38] Pohl, C., Reister, F., Peller-Konrad, F., Asfour, T.: Memory-centered and affordance-based framework for mobile manipulation. arXiv preprint arXiv:2401.16899 (2024)
  • [39] Reister, F., Grotz, M., Asfour, T.: Combining navigation and manipulation costs for time-efficient robot placement in mobile manipulation tasks. IEEE Robotics and Automation Letters 7(4), 9913–9920 (2022)
  • [40] Su, Y., Saleh, M., Fetzer, T., Rambach, J., Navab, N., Busam, B., Stricker, D., Tombari, F.: Zebrapose: Coarse to fine surface encoding for 6dof object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6738–6748 (2022)
  • [41] Sundermeyer, M., Durner, M., Puang, E.Y., Marton, Z.C., Vaskevicius, N., Arras, K.O., Triebel, R.: Multi-path learning for object pose estimation across domains. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13916–13925 (2020)
  • [42] Tan, M., Pang, R., Le, Q.V.: Efficientdet: Scalable and efficient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10781–10790 (2020)
  • [43] Tejani, A., Tang, D., Kouskouridas, R., Kim, T.K.: Latent-class hough forests for 3d object detection and pose estimation. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13. pp. 462–477. Springer (2014)
  • [44] Thalhammer, S., Bauer, D., Hönig, P., Weibel, J.B., García-Rodríguez, J., Vincze, M.: Challenges for monocular 6d object pose estimation in robotics. arXiv preprint arXiv:2307.12172 (2023)
  • [45] Tyree, S., Tremblay, J., To, T., Cheng, J., Mosier, T., Smith, J., Birchfield, S.: 6-dof pose estimation of household objects for robotic manipulation: An accessible dataset and benchmark. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 13081–13088. IEEE (2022)
  • [46] Wang, C.Y., Yeh, I.H., Liao, H.Y.M.: Yolov9: Learning what you want to learn using programmable gradient information. arXiv preprint arXiv:2402.13616 (2024)
  • [47] Wang, G., Manhardt, F., Tombari, F., Ji, X.: Gdr-net: Geometry-guided direct regression network for monocular 6d object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16611–16621 (2021)
  • [48] Wen, B., Yang, W., Kautz, J., Birchfield, S.: Foundationpose: Unified 6d pose estimation and tracking of novel objects. arXiv preprint arXiv:2312.08344 (2023)
  • [49] Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199 (2017)
  • [50] Ye, J., Batra, D., Wijmans, E., Das, A.: Auxiliary tasks speed up learning point goal navigation. In: Conference on Robot Learning. pp. 498–516. PMLR (2021)
  • [51] Younes, A., Honerkamp, D., Welschehold, T., Valada, A.: Catch me if you hear me: Audio-visual navigation in complex unmapped environments with moving sounds. IEEE Robotics and Automation Letters 8(2), 928–935 (2023)
  • [52] Zhang, Z., Lu, X., Cao, G., Yang, Y., Jiao, L., Liu, F.: Vit-yolo: Transformer-based yolo for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2799–2808 (2021)