License: CC BY 4.0
arXiv:2401.12202v2 [cs.RO] 29 Feb 2024

OK-Robot:
What Really Matters in Integrating Open-Knowledge Models for Robotics

Peiqi Liu*11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT  Yaswanth Orru*11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Jay Vakil22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT  Chris Paxton22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT
Nur Muhammad Mahi Shafiullah211{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT  Lerrel Pinto211{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT
New York University11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, AI at Meta22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT
https://ok-robot.github.io
* Denotes equal contribution and \dagger denotes equal advising.Correspondence to: [email protected]
Abstract

Remarkable progress has been made in recent years in the fields of vision, language, and robotics. We now have vision models capable of recognizing objects based on language queries, navigation systems that can effectively control mobile systems, and gras** models that can handle a wide range of objects. Despite these advancements, general-purpose applications of robotics still lag behind, even though they rely on these fundamental capabilities of recognition, navigation, and gras**. In this paper, we adopt a systems-first approach to develop a new Open Knowledge-based robotics framework called OK-Robot. By combining Vision-Language Models (VLMs) for object detection, navigation primitives for movement, and gras** primitives for object manipulation, OK-Robot offers a integrated solution for pick-and-drop operations without requiring any training. To evaluate its performance, we run OK-Robot in 10 real-world home environments. The results demonstrate that OK-Robot achieves a 58.5% success rate in open-ended pick-and-drop tasks, representing a new state-of-the-art in Open Vocabulary Mobile Manipulation (OVMM) with nearly 1.8×1.8\times1.8 × the performance of prior work. On cleaner, uncluttered environments, OK-Robot’s performance increases to 82%. However, the most important insight gained from OK-Robot is the critical role of nuanced details when combining Open Knowledge systems like VLMs with robotic modules. We published our code and robot videos on https://ok-robot.github.io to encourage further investigation.

I Introduction

Creating a general-purpose robot has been a longstanding dream of the robotics community. With the increase in data-driven approaches and large robot models, impressive progress is being made [1, 2, 3, 4]. However, current systems are brittle, closed, and fail when encountering unseen scenarios. Even the largest robotics models can often only be deployed in previously seen environments [5, 6]. The brittleness of these systems is further exacerbated in settings where little robotic data is available, such as in unstructured home environments.

The poor generalization of robotic systems lies in stark contrast to large vision models [7, 8, 9, 10], which show capabilities of semantic understanding [11, 12, 13], detection [7, 8], and connecting visual representations to language [14, 9, 10] At the same time, base robotic skills for navigation [15], gras** [16, 17, 18, 19], and rearrangement [20, 21] are fairly mature. Hence, it is perplexing that robotic systems that combine modern vision models with robot-specific primitives perform so poorly. To highlight the difficulty of this problem, the recent NeurIPS 2023 challenge for open-vocabulary mobile manipulation (OVMM) [22] registered a success rate of 33% for the winning solution [23].

So what makes open-vocabulary robotics so hard? Unfortunately, there isn’t a single challenge that makes this problem hard. Instead, inaccuracies in different components compound and together results in an overall drop. For example, the quality of open-vocabulary retrievals of objects in homes is dependent on the quality of query strings, navigation targets determined by VLMs may not be reachable to the robot, and the choice of different gras** models may lead to large differences in gras** performance. Hence, making progress on this problem requires a careful and nuanced framework that both integrates VLMs and robotics primitives, while being flexible enough to incorporate newer models as they are developed by the VLM and robotics community.

We present OK-Robot, an Open Knowledge Robot that integrates state-of-the-art VLMs with powerful robotics primitives for navigation and gras** to enable pick-and-drop. Here, Open Knowledge refers to learned models trained on large, publicly available datasets. When placed in a new home environment, OK-Robot is seeded with a scan taken from an iPhone. Given this scan, dense vision-language representations are computed using LangSam [24] and CLIP [9] and stored in a semantic memory. Then, when a language-query for an object to be picked comes in, semantic memory is queried with the language embedding to find that object. After this, navigation and picking primitives are applied sequentially to move to the desired object and pick it up. A similar process can be carried out for drop** the object.

To study OK-Robot, we tested it in 10 real world home environments. Through our experiments, we found that on a unseen natural home environment, a zero-shot deployment of our system achieves 58.5% success on average. However, this success rate is largely dependant on the “naturalness” of the environment, as we show that with improving the queries, decluttering the space, and excluding objects that are clearly adversarial (too large, too translucent, too slippery), this success rate reaches 82.4%. Overall, through our experiments, we make the following observations:

  • Pre-trained VLMs are highly effective for open-vocabulary navigation: Current open-vocabulary vision-language models such as CLIP [9] or OWL-ViT [8] offer strong performance in identifing arbitrary objects in the real world, and enable navigating to them in a zero-shot manner (see Section II-A.)

  • Pre-trained gras** models can be directly applied to mobile manipulation: Similar to VLMs, special purpose robot models pre-trained on large amounts of data can be applied out of the box to approach open-vocabulary gras** in homes. These robot models do not require any additional training or fine-tuning (see Section II-B.)

  • How components are combined is crucial: Given the pretrained models, we find that they can be combined with no training using a simple state-machine model. We also find that using heuristics to counteract the robot’s physical limitations can lead to a better success rate in the real world (see Section II-D.)

  • Several challenges still remain: While, given the immense challenge of operating zero-shot in arbitrary homes, OK-Robot improves upon prior work, by analyzing the failure modes we find that there are significant improvements that can be made on the VLMs, robot models, and robot morphology, that will directly increase performance of open-knowledge manipulation agents (see Section III-D).

To encourage and support future work in open-knowledge robotics, we have shared the code and modules for OK-Robot, and are committed to supporting reproduction of our results. More information along with robot videos and the code are available on our project website: https://ok-robot.github.io.

Refer to caption
Figure 2: Open-vocabulary, open knowledge object localization and navigation in the real-world. We use the VoxelMap [25] for localizing objects with natural language queries, and use an A* algorithm similar to USANet [26] for path planning.

II Technical Components and Method

Our method, on a high level, solves the problem described by the query: “Pick up A (from B) and drop it on/in C”, where A is an object and B and C are places in a real-world environment such as homes. The system we introduce is a combination of three primary subsystems combined on a Hello Robot: Stretch. Namely, these are the open-vocabulary object navigation module, the open-vocabulary RGB-D gras** module, and the drop** heuristic. In this section, we describe each of these components in more details.

II-A Open-home, open-vocabulary object navigation

The first component of our method is an open-home, open-vocabulary object navigation model that we use to map a home and subsequently navigate to any object of interest designated by a natural language query.

Scanning the home: For open vocabulary object navigation, we follow the approach from CLIP-Fields [27] and assume a pre-map** phase where the home is “scanned” manually using an iPhone. This manual scan simply consists of taking a video of the home using the Record3D app on the iPhone, which results in a sequence of posed RGB-D images and takes less than one minute for a new room. Once collected, the RGB-D images, along with the camera pose and positions, are exported to our library for map-building. To ensure our semantic memory contains both the objects of interest as well as the navigable surface and any obstacles, we capture the floor surface alongside the objects and receptacles in the environment.

Detecting objects: On each frame of the scan, we run an open-vocabulary object detector. We chose OWL-ViT [8] over Detic [7] as the object detector since we found OWL-ViT to perform better in preliminary queries. We apply the detector on every frame, and extract each of the object bounding box, CLIP-embedding, detector confidence, and pass these information onto the object memory module. We further refine the bounding boxes into object masks with Segment Anything (SAM) [28]. Note that, in many cases, open-vocabulary object detectors require a set of natural language object queries to be detected. We supply a large set of such object queries, derived from the original Scannet200 labels [29] and presented in Appendix B, to help the detector captures most common objects in the scene.

Object-centric semantic memory: We use an object-centric memory similar to Clip-Fields [27] and OVMM [25] that we call the VoxelMap. VoxelMap is built by back-projecting the object masks in real-world coordinates using the depth image and the pose collected by the camera. This process giving us a point cloud where each point has an associated CLIP semantic vector. Then, we voxelize the point cloud to a 5 cm resolution. For each voxel, we calculate the detector-confidence weighted average for the CLIP embeddings that belong to that voxel. This VoxelMap builds the base of our object memory module. Note that the representation created this way remains static after the first scan, and cannot be adapted during the robot’s operation. This inability to dynamically create a map is discussed in our limitations section (Section V).

Querying the memory module: Our semantic object memory gives us a static world representation represented as possibly non-empty voxels in the world, and a semantic vector in CLIP space associated with each voxel. Given a language query, we first convert it to a semantic vector using the CLIP language encoder. Then, we find the voxel where the dot product between the encoded embedding and the voxel’s associated embedding is maximized. Since each voxel is associated with a real location in the home, this lets us find the location where a queried object is most likely to be found, similar to Figure 2(a).

We also implement querying for “A on B” by interpreting it as “A near B”. We do so by selecting top-10 points for query A and top-50 points for query B. Then, we calculate the 10×50105010\times 5010 × 50 pairwise L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distances and pick the A-point associated with the shortest (A, B) distance. Note that during the object navigation phase we use this query only to navigate to the object approximately, and not for manipulation. This approach gives us two advantages: our map can be as lower resolution than those in prior work [27, 26, 30], and we can deal with small movements in object’s location after building the map.

Navigating to objects in the real world: Once our navigation model gives us a 3D location coordinate in the real world, we use that as a navigation target for our robot to initialize our manipulation phase. Going and looking at an object [27, 15, 31] can be done while remaining at a safe distance from the object itself. In contrast, our navigation module must place the robot at an arms length so that the robot can manipulate the target object afterwards. Thus, our navigation method has to balance the following objectives:

  1. 1.

    The robot needs to be close enough to the object to manipulate it,

  2. 2.

    The robot needs some space to move its gripper, so there needs to be a small but non-negligible space between the robot and the object, and,

  3. 3.

    The robot needs to avoid collision during manipulation, and thus needs to keep its distance from all obstacles.

We use three different navigation score functions, each associated with one of the above points, and evaluate them on each point of the space to find the best position to place the robot.

Let a random point be x𝑥\overrightarrow{x}over→ start_ARG italic_x end_ARG, the closest obstacle point as xobssubscript𝑥𝑜𝑏𝑠\overrightarrow{x}_{obs}over→ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT, and the target object as xosubscript𝑥𝑜\overrightarrow{x_{o}}over→ start_ARG italic_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG. We define the following three functions s1,s2,s3subscript𝑠1subscript𝑠2subscript𝑠3s_{1},s_{2},s_{3}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT to capture our three criterion. We define s𝑠sitalic_s as their weighted sum. The ideal navigation point x*superscript𝑥\overrightarrow{x}^{*}over→ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is the point in space that minimizes s(x)𝑠𝑥s(\overrightarrow{x})italic_s ( over→ start_ARG italic_x end_ARG ), and the ideal direction is given by the vector from x*superscript𝑥\overrightarrow{x^{*}}over→ start_ARG italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG to xosubscript𝑥𝑜\overrightarrow{x_{o}}over→ start_ARG italic_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG.

s1(x)subscript𝑠1𝑥\displaystyle s_{1}(\overrightarrow{x})italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( over→ start_ARG italic_x end_ARG ) =xxoabsentnorm𝑥subscript𝑥𝑜\displaystyle=||\overrightarrow{x}-\overrightarrow{x_{o}}||= | | over→ start_ARG italic_x end_ARG - over→ start_ARG italic_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG | |
s2(x)subscript𝑠2𝑥\displaystyle s_{2}(\overrightarrow{x})italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over→ start_ARG italic_x end_ARG ) =40min(xxo,40)absent40norm𝑥subscript𝑥𝑜40\displaystyle=40-\min(||\overrightarrow{x}-\overrightarrow{x_{o}}||,40)= 40 - roman_min ( | | over→ start_ARG italic_x end_ARG - over→ start_ARG italic_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG | | , 40 )
s3(x)subscript𝑠3𝑥\displaystyle s_{3}(\overrightarrow{x})italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( over→ start_ARG italic_x end_ARG ) ={1/xxobs,if xxobs0300,otherwiseabsentcases1norm𝑥subscript𝑥𝑜𝑏𝑠if subscriptnorm𝑥subscript𝑥𝑜𝑏𝑠0300otherwise\displaystyle=\begin{cases}1/||\overrightarrow{x}-\overrightarrow{x}_{obs}||,&% \text{if }||\overrightarrow{x}-\overrightarrow{x}_{obs}||_{0}\leq 30\\ 0,&\text{otherwise}\end{cases}= { start_ROW start_CELL 1 / | | over→ start_ARG italic_x end_ARG - over→ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT | | , end_CELL start_CELL if | | over→ start_ARG italic_x end_ARG - over→ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ 30 end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise end_CELL end_ROW
s(x)𝑠𝑥\displaystyle s(\overrightarrow{x})italic_s ( over→ start_ARG italic_x end_ARG ) =s1(x)+8s2(x)+8s3(x)absentsubscript𝑠1𝑥8subscript𝑠2𝑥8subscript𝑠3𝑥\displaystyle=s_{1}(\overrightarrow{x})+8s_{2}(\overrightarrow{x})+8s_{3}(% \overrightarrow{x})= italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( over→ start_ARG italic_x end_ARG ) + 8 italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over→ start_ARG italic_x end_ARG ) + 8 italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( over→ start_ARG italic_x end_ARG )

To navigate to this target point safely from any other point in space, we follow a similar approach to [26, 32] by building an obstacle map from our captured posed RGB-D images. We build a 2D, 10cm×\times×10cm grid of obstacles over which we navigate using the A* algorithm. To convert our VoxelMap to an obstacle map, we first set a floor and ceiling height. Presence of occupied voxels in between them implies the grid cell is occupied, while presence of neither ceiling nor floor voxels mean that the grid cell is unexplored. We mark both occupied or unexplored cells as not navigable. Around each occupied point, we mark any point within a 20 cm radius as also non-navigable to account for the robot’s radius and a turn radius. During A* search, we use the s3subscript𝑠3s_{3}italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT as a heuristic function on the node costs to navigate further away from any obstacles, which makes our generated paths similar to ideal Voronoi paths [33] in our experiments.

II-B Open-vocabulary gras** in the real world

Refer to caption
Figure 3: Open-vocabulary gras** in the real world. From left to right, we show the (a) robot POV image, (b) all suggested grasps from AnyGrasp [19], (c) object mask given label from LangSam [24], (d) grasp points filtered by the mask, and (e) grasp chosen for execution.

Gras** or physically interacting with arbitrary objects in the real world is much more complex than open-vocabulary navigation. We opt for using a pre-trained gras** model to generate grasp poses in the real world, and filter them with language-conditioning using a modern VLM.

Grasp perception: Once the robot reaches the object location using the navigation method outlined in Section II-A, we use a pre-trained gras** model, AnyGrasp [19], to generate a grasp for the robot. We point the robot’s RGB-D head camera towards the object’s 3D location, given to us by the semantic memory, and capture an RGB-D image from it (Figure 3, column 1). We backproject and convert the depth image to a pointcloud and pass this information to the grasp generation model. Our grasp generation model, AnyGrasp, generates all collision free grasps (Figure 3 column 2) for a parallel jaw gripper in a scene given a single RGB image and a pointcloud. AnyGrasp provides us with grasp point, width, height, depth, and a “graspness score”, indicating uncalibrated model confidence in each grasp.

Filtering grasps using language queries: Once we get all proposed grasps from AnyGrasp, we filter them using LangSam [24]. LangSam [24] segments the captured image and finds the desired object mask with a language query (Figure 3 column 3). We project all the proposed grasp points onto the image and find the grasps that fall into the object mask (Figure 3 column 4). We pick the best grasp using a heuristic. Given a grasp score 𝒮𝒮\mathcal{S}caligraphic_S and the angle between the grasp normal and floor normal θ𝜃\thetaitalic_θ, the new heuristic score is 𝒮(θ4/10)𝒮superscript𝜃410\mathcal{S}-(\nicefrac{{\theta^{4}}}{{10}})caligraphic_S - ( / start_ARG italic_θ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG 10 end_ARG ). This heuristic balances high graspness scores with finding flat, horizontal grasps. We prefer horizontal grasps because they are robust to small calibration errors on the robot, while vertical grasps needs better hand-eye calibration to be successful. Robustness to hand-eye calibration errors lead to higher success as we transport the robot to different homes during our experiments.

Grasp execution: Once we identify the best grasp (Figure 3 column 5), we use a simple pre-grasp approach [34] to grasp our intended object. If p𝑝\overrightarrow{p}over→ start_ARG italic_p end_ARG is the grasp point and a𝑎\overrightarrow{a}over→ start_ARG italic_a end_ARG is the approach vector given by the gras** model, our robot gripper follows the following trajectory:

p0.2a,p0.08a,p0.04a,p𝑝0.2𝑎𝑝0.08𝑎𝑝0.04𝑎𝑝\langle\overrightarrow{p}-0.2\overrightarrow{a},\;\overrightarrow{p}-0.08% \overrightarrow{a},\;\overrightarrow{p}-0.04\overrightarrow{a},\;% \overrightarrow{p}\rangle⟨ over→ start_ARG italic_p end_ARG - 0.2 over→ start_ARG italic_a end_ARG , over→ start_ARG italic_p end_ARG - 0.08 over→ start_ARG italic_a end_ARG , over→ start_ARG italic_p end_ARG - 0.04 over→ start_ARG italic_a end_ARG , over→ start_ARG italic_p end_ARG ⟩

Put simply, our method approaches the object from a pre-grasp position in a line with progressively smaller motions. Moving slower as we approach the object helps the robot not knock over light objects. Once we reach the predicted grasp point, we close the gripper in a close loop fashion to get a solid grip on the object without crushing it. After gras** the object, we lift up the robot arm, retract it fully, and rotate the wrist to have the object tucked over the body. This behavior maintains the robot footprint while ensuring the object is held securely by the robot and doesn’t fall while navigating to the drop location.

II-C Drop** heuristic

After picking up an object, we find and navigatte to the drop location using the same methods described in Section II-A. Unlike in HomeRobot’s baseline implementation [25] that assumes that the drop-off location is a flat surface, we extend our heuristic to cover concave objects such as sink, bins, boxes, and bags. First, we segment the point cloud P𝑃Pitalic_P captured by the robot’s head camera using LangSam [24] similar to Section 3 using the drop language query. Then, we align that segmented point cloud such that X-axis is aligned with the way the robot is facing, Y-axis is to its left and right, and the Z-axis of the point cloud is aligned with the floor normal. Then, we normalize the point cloud so that the robot’s (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) coordinate is (0,0)00(0,0)( 0 , 0 ), and the floor plane is at z=0𝑧0z=0italic_z = 0. We call this pointcloud Pasubscript𝑃𝑎P_{a}italic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT. On the aligned, segmented point cloud, we consider the (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) coordinates for each point, and find the median values xmsubscript𝑥𝑚x_{m}italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and ymsubscript𝑦𝑚y_{m}italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT on each axis. Finally, we find a drop height using zmax=0.2+max{z(x,y,z)Pa;0xxm;|yym|<0.1}subscript𝑧0.2conditional𝑧𝑥𝑦𝑧subscript𝑃𝑎0𝑥subscript𝑥𝑚𝑦subscript𝑦𝑚0.1z_{\max}=0.2+\max\{z\mid(x,y,z)\in P_{a};0\leq x\leq x_{m};|y-y_{m}|<0.1\}italic_z start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 0.2 + roman_max { italic_z ∣ ( italic_x , italic_y , italic_z ) ∈ italic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ; 0 ≤ italic_x ≤ italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ; | italic_y - italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | < 0.1 } on the segmented, aligned pointcloud. We add a small buffer of 0.20.20.20.2 to the height to avoid collisions between the robot and the drop location. Finally, we move the robot gripper above the drop point, and open the gripper to drop the object. While this heuristic doesn’t explicitly reason about clutter, in our experiments it performs well on average.

II-D Deployment in homes

Our navigation, pick, and drop primitives are combined to create our robot method that can be applied in any novel home. For a new home environment, we “scan” the room in under a minute. Then, it takes less than five minutes to process the scan into our VoxelMap. Once that is done, the robot can be immediately placed at the base and start operating. From arriving into a completely novel environment to start operating autonomously in it, our system takes under 10 minutes on average to complete the first pick-and-drop task.

Transitioning between modules: The transition between different modules is predefined and happens automatically once a user specifies the object to pick and where to drop it. Since we do not implement error detection or correction, our state machine model is a simple linear chain of steps leading from navigating to object, to gras**, to navigating to goal, and to drop** the object at the goal to finish the task.

Protocol for home experiments: To run our experiment in a novel home, we move the robot to a previously unobserved room. We record the scene and create our VoxelMap. Concurrently, we pick between 10-20 objects arbitrarily in each scene that can fit in the robot gripper. These are objects found in the scene, and are not chosen ahead of time. We come up with a language query for each chosen object using GPT-4V [35] to keep the queries consistent and free of experimenter bias. We query our navigation module to filter out all the navigation failures; i.e. objects that our semantic memory module could not locate properly. Then, we execute pick-and-drop on remaining objects sequentially without resets between trials.

Refer to caption
Figure 4: All the success and failure cases in our home experiments, aggregated over all three cleaning phases, and broken down by mode of failure. From left to right, we show the application of the three components of OK-Robot, and show a breakdown of the long-tail failure modes of each of the components.

III Experiments

We evaluate our method in two set of experiments. On the first set of experiments, we evaluate between multiple alternatives for each of our navigation and manipulation modules. These experiments give us insights about which modules to use and evaluate in a home environment as a part of our method. On the next set of experiments, we took our robots to 10 homes and ran 171 pick-and-drop experiments to empirically evaluate how our method performs in completely novel homes, and to understand the failure modes of our system.

Through these experiments, we look to answer a series of questions regarding the capabilities and limits of current Open Knowledge robotic systems, as embodied by OK-Robot. Namely, we ask the following:

  1. 1.

    How well can such a system tackle the challenge of pick and drop in arbitrary homes?

  2. 2.

    How well do alternate primitives for navigation and gras** compare to the recipe presented here for building an Open Knowledge robotic system?

  3. 3.

    How well can our current systems handle unique challenges that make homes particularly difficult, such as clutter, ambiguity, and affordance challenges?

  4. 4.

    What are the failure modes of such a system and its individual components in real home environments?

III-A Results of home experiments

Over the 10 home environment, OK-Robot achieved a 58.5% success rates in completing full pick-and-drops. Notably, this success rate is over novel objects sourced from each home with our zero-shot algorithm. As a result, each success and failure of the robot tells us something interesting about applying open-knowledge models in robotics, which we analyze over the next sections. In Appendix E, we provide the details of all our home experiments and results from the same. In Appendix C we show a subset of the target objects and in Appendix D we show snapshots of homes where OK-Robot was deployed. Snippets of our experiments are in Figure LABEL:fig:intro, and full videos are presented on our project website.

Reproduction of our system: Beyond the home experiment results presented here, we also reproduced OK-Robot in two homes in Pittsburgh, PA, and Fremont, CA. These homes were larger and more complex: a cluttered, actively-used home kitchen environment, and a large, controlled test apartment used in prior work [25, 22]. In Appendix Figure 12, we show the robot performing pick-and-drop in these two environments. These homes were different from our initial ten experiments in a few ways. Both were larger compared to the average NY homes, requiring more robot motion to navigate to different goals. The PA environment (Figure 12 top) notably had much more clutter. However, given only a scan, OK-Robot was able to successfully pick and drop objects like stuffed lion, plush cactus, toy drill, or green water bottle in both environments.

Refer to caption
Figure 5: Ablation experiment using different semantic memory and gras** modules, with the bars showing average performance and the error bars showing standard deviation over the environments.

III-B Ablations over system components

Apart from the navigation and manipulation strategies used in OK-Robot, we also evaluated a number of alternative open vocabulary navigation and gras** modules. We compared them by evaluating them in three different environments in our lab. Apart from VoxelMap [25], we evaluate CLIP-Fields [27], and USA-Net [26] for semantic navigation. For gras** module, we consider AnyGrasp and its open-source baseline, Open Graspness [19], Contact-GraspNet [16], and Top-Down grasp heuristic from home-robot [25]. More details about them are provided in Appendix Section A.

In Figure 5, we see their comparative performance in three lab environments. For semantic memory modules, we see that VoxelMap, used in OK-Robot and described in Sec. II-A, outperforms other semantic memory modules by a small margin. It also has much lower variance compared to the alternatives, meaning it is more reliable. As for gras** modules, AnyGrasp clearly outperforms other gras** methods, performing almost 50% better in a relative scale over the next best candidate, top-down grasp. However, the fact that a heuristic-based algorithm, top-down grasp from HomeRobot [25] beats the open-source AnyGrasp baseline and Contact-GraspNet shows that building a truly general-purpose gras** model remains difficult.

III-C Impact of clutter, object ambiguity, and affordance

What makes home environments especially difficult compared to lab experiments is the presence of physical clutter, language-to-object map** ambiguity, and hard-to-reach positions. To gain a clear understanding of how such factors play into our experiments, we go through two “clean-up” processes in each environment. During the clean-up, we pick a subset of objects that are free from ambiguity from the previous rounds, clean the clutter around objects, and generally relocated them in an accessible locations. The two clean-up rounds at each environment gives us insights about the performance gap caused by the natural difficulties of a home-like environment.

We show a complete analysis of the tasks listed section III-A which failed in various stages in Figure 6. As we can see from this breakdown, as we clean up the environment and remove the ambiguous objects, the navigation accuracy goes up, and the total error rate goes down from 15% to 12% and finally all the way down to 4%. Similarly, as we clean up clutters from the environment, we find that the manipulation accuracy also improves and the error rates decrease from 25% to 16% and finally 13%. Finally, since the drop-module is agnostic of the label ambiguity or manipulation difficulty arising from clutter, the failure rate of the drop** primitive stays roughly constant through the three phases of cleanup.

Refer to caption
Figure 6: Failure modes of our method in novel homes, broken down by the failures of the three modules and the cleanup levels.

III-D Understanding the performance of OK-Robot

While our method can show zero-shot generalization in completely new environments, we probe OK-Robot to better understand its failure modes. Primarily, we elaborate on how our model performed in novel homes, what were the biggest challenges, and discuss potential solutions to them.

We first show a coarse-level breakdown of the failures, only considering the three high level modules of our method in Figure 6. We see that generally, the leading cause of failure is our manipulation failure, which intuitively is the most difficult as well. However, at a closer look, we notice a long tail of failure causes presented in figure 4.

The three leading causes of failures are failing to retrieve the right object to navigate to from the semantic memory (9.3%), getting a difficult pose from the manipulation module (8.0%), and robot hardware difficulties (7.5%). In this section, we go over the analysis of the failure modes presented in Figure 4 and discuss the most frequent cases.

Natural language queries for objects:

Refer to caption
Figure 7: Samples of failed or ambiguous language queries into our semantic memory module. Since the memory module depends on pretrained large vision language model, its performance shows susceptibility to particular “incantations” similar to current LLMs.

One of the primary reasons our OK-Robot can fail is when a natural language query given by the user doesn’t retrieve the intended object from the semantic memory. In Figure 7 we show how some queries may fail while semantically very similar but slightly modified wording of the same query might succeed.

Generally, this has been the case for scenes where there are multiple visually or semantically similar objects, as shown in the figure. There are other cases where some queries may pass while other very similar queries may fail. An interactive system that gets confirmation from the user as it retrieves an object from memory would avoid such issues.

Gras** module limitations: One failure mode of our manipulation module comes from executing grasps from a pre-trained manipulation model’s output based on a single RGB-D image. Moreover, this model wasn’t even designed for the Hello Robot: Stretch gripper. As a result, sometimes the proposed grasps are unreliable or unrealistic (Figure 8).

Sometimes, the grasp is infeasible given the robot joint limits, or is simply too far from the robot body. Develo** better grasp perception models or heuristics will let us sample better grasps for a given object.

In other cases, the model generates a good grasp pose, but as the robot is executing the gras** primitive, it collides with some minor environment obstacle. Since we apply the same grasp trajectory in every case instead of planning the grasp trajectory, some such failures are inevitable. Gras** models that generates a grasp trajectory as well as a pose may solve such issues.

Finally, our gras** module categorically struggles with flat objects, like chocolate bars and books, since it’s difficult to grasp them off a surface with a two-fingered gripper.

Refer to caption
Figure 8: Samples of failures of our manipulation module. Most failures stem from using only a single RGB-D view to generate the grasp and the limiting form-factor of a large two-fingered parallel jaw gripper.

Robot hardware limitations: While our robot of choice, a Hello Robot: Stretch, is able to pick-and-drop a variety of objects, certain hardware limitations also dictate what our system can and cannot manipulate. For example, the fully extended robot arm has a 1 kg payload limit, and thus our method is unable to pick objects like a full dish soap bottle. Similarly, objects that are far from navigable floor space, i.e. in the middle of a bed, or on high places, are difficult for the robot to reach because of the reach limits of the arm. The robot hardware or the RealSense camera can occasionally get miscalibrated over time, especially during continuous home operations. This miscalibration can lead to manipulation errors since that module requires hand-eye coordination in the robot. The robot base wheels have small diameters and in some cases struggle to move smoothly between carpet and floor.

IV Related Works

IV-A Vision-Language models for robotic navigation

Early applications of pre-trained open-knowledge models in robotics has been in open-vocabulary navigation. Navigating to various objects is an important task which has been looked at in a wide range of previous works [36, 25, 31], as well as in the context of longer pick-and-place tasks [37, 38]. However, these methods have generally been applied to relatively small numbers of objects [39]. Recently, Objaverse [40] has shown navigation to thousands of object types, for example, but much of this work has been restricted to simulated or highly controlled environments.

The early work addressing this problem builds upon representations derived from pre-trained vision language models, such as SemAbs [41], CLIP-Fields [27], VLMaps [42], NLMap-SayCan [43], and later, ConceptFusion [44] and LERF [30]. Most of these models show object localization in pre-mapped scenes, while CLIP-Fields, VLMaps, and NLMap-SayCan show integration with real robots for indoor navigation tasks. USA-Nets [26] extends this task to include an affordance model, navigating with open-vocabulary queries while doing object avoidance. ViNT [45] proposes a foundation model for robotic navigation which can be applied to vision-language navigation problems. More recently, GOAT [31] was proposed as a modular system for “going to anything” and navigating to any object in any environment given either language or image queries. ConceptGraphs [46] proposed an open scene graph representation capable of handling complex queries using LLMs. Any such open-vocabulary embodied model has the potential to improve modular systems like OK-Robot.

IV-B Pretrained robot manipulation models

While humans can frequently look at objects and immediately know how to grasp it, such gras** knowledge is not easily accessible to robots. Over the years, there has been many works that has focused on creating such a general robot grasp generation model [1, 47, 48, 49, 50, 51, 52] for arbitrary objects and potentially cluttered scenes via learning methods. Our work focuses on more recent iterations of such methods [16, 19] that are trained on large gras** datasets [53, 18]. While these models only perform one task, namely gras**, they predict grasps across a large object surface and thus enable downstream complex, long-horizon manipulation tasks [20, 54, 21].

More recently, there is a set of general-purpose manipulation models moving beyond just gras** [55, 56, 57, 58, 59]. Some of these works perform general language-conditioned manipulation tasks, but are largely limited to a small set of scenes and objects. HACMan [60] demonstrates a larger range of object manipulation capabilities, focused on pushing and prodding. In the future, such models could expand the reach of our system.

IV-C Open vocabulary robot systems

Many recent works have worked on language-enabled tasks for complex robot systems. Some examples include language conditioned policy learning [61, 55, 62, 63], learning goal-conditioned value functions [3, 64], and using large language models to generate code [65, 66, 67]. However, a fundamental difference remains between systems which aim to operate on arbitrary objects in an open-vocab manner, and systems where one can specify one among a limited number of goals or options using language. Consequently, Open-Vocabulary Mobile Manipulation has been proposed as a key challenge for robotic manipulation [25]. There has previously been efforts to build such a system [68, 69]. However, unlike such previous work, we try to build everything on an open platform and ensure our method can work without having to re-train anything for a novel home. Recently, UniTeam [23] won the 2023 HomeRobot OVMM Challenge [22] with a modular system doing pick-and-place to arbitrary objects, with a zero-shot generalization requirement similar to ours.

In parallel, recently, there have been a number of papers doing open-vocabulary manipulation using GPT or especially GPT4 [35]. GPT4V can be included in robot task planning frameworks and used to execute long-horizon robot tasks, including ones from human demonstrations [70]. ConceptGraphs [46] is a good recent example, showing complex object search, planning, and pick-and-place capabilities to open-vocabulary objects. SayPlan [71] also shows how these can use used together with a scene graph to handle very large, complex environments, and multi-step tasks; this work is complementary to ours, as it doesn’t handle how to implement pick and place.

V Limitations, Open Problems and
Requests for Research

While our method shows significant success in completely novel home environments, it also shows many places where such methods can improve. In this section, we discuss a few of such potential improvement in the future.

V-A Live semantic memory and obstacle maps

All the current semantic memory modules and obstacle map builders build a static representation of the world, without a good way of kee** it up-to-date as the world changes. However, homes are dynamic environments, with many small changes over the day every day. Future research that can build a dynamic semantic memory and obstacle map would unlock potential for continuous application of such pick-and-drop methods in a novel home out of the box.

V-B Grasp plans instead of proposals

Currently, the gras** module proposes generic grasps without taking the robot’s body and dynamics into account. Similarly, given a grasp pose, often the open loop gras** trajectory collides with environmental obstacles, which can be easily improved by using a module to generate grasp plans rather than grasp poses only.

V-C Improving interactivity between robot and user

One of the major causes of failure in our method is in navigation: where the semantic query is ambiguous and the intended object is not retrieved from the semantic memory. In such ambiguous cases, interaction with the user would go a long way to disambiguate the query and help the robot succeed more often.

V-D Detecting and recovering from failure

Currently, we observe a multiplicative error accumulation between our modules: if any of our independent components fail, the entire process fails. As a result, even if our modules each perform independently at or above 80% success rate, our final success rate can still be below 60%. However, with better error detection and retrying algorithms, we can recover from much more single-stage errors, and similarly improve our overall success rate [23].

V-E Robustifying robot hardware

While Hello Robot - Stretch [72] is an affordable and portable platform on which we can implement such an open-home system for arbitrary homes, we also acknowledge that with robust hardware such methods may have vastly enhanced capacity. Such robust hardware may enable us to reach high and low places, and pick up heavier objects. Finally, improved robot odometry will enable us to execute much more finer grasps than is possible today.

Acknowledgments

NYU authors are supported by grants from Amazon, Honda, and ONR award numbers N00014-21-1-2404 and N00014-21-1-2758. NMS is supported by the Apple Scholar in AI/ML Fellowship. LP is supported by the Packard Fellowship. Our utmost gratitude goes to our friends and colleagues who helped us by hosting our experiments in their homes. Finally, we thank Siddhant Haldar, Paula Pascual and Ulyana Piterbarg for valuable feedback and conversations.

References

  • [1] Lerrel Pinto and Abhinav Gupta “Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours”, 2015 arXiv:1509.06825 [cs.LG]
  • [2] Sergey Levine et al. “Learning hand-eye coordination for robotic gras** with deep learning and large-scale data collection” In The International journal of robotics research 37.4-5 SAGE Publications Sage UK: London, England, 2018, pp. 421–436
  • [3] Michael Ahn et al. “Do as I can, not as I say: Grounding language in robotic affordances” In Conference on Robot Learning (CoRL), 2022
  • [4] Nur Muhammad Mahi Shafiullah et al. “On Bringing Robots Home”, 2023 arXiv:2311.16098 [cs.RO]
  • [5] Anthony Brohan et al. “Rt-1: Robotics transformer for real-world control at scale” In arXiv preprint arXiv:2212.06817, 2022
  • [6] Anthony Brohan et al. “Rt-2: Vision-language-action models transfer web knowledge to robotic control” In arXiv preprint arXiv:2307.15818, 2023
  • [7] Xingyi Zhou et al. “Detecting twenty-thousand classes using image-level supervision” In European Conference on Computer Vision, 2022, pp. 350–368 Springer
  • [8] Matthias Minderer et al. “Simple Open-Vocabulary Object Detection with Vision Transformers” In European Conference on Computer Vision, 2022, pp. 728–755 Springer
  • [9] Alec Radford et al. “Learning Transferable Visual Models From Natural Language Supervision” In International Conference on Machine Learning (ICML) 139, 2021, pp. 8748–8763
  • [10] Kenneth Marino, Mohammad Rastegari, Ali Farhadi and Roozbeh Mottaghi “Ok-vqa: A visual question answering benchmark requiring external knowledge” In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, 2019, pp. 3195–3204
  • [11] Jean-Baptiste Alayrac et al. “Flamingo: a Visual Language Model for Few-Shot Learning”, 2022 arXiv:2204.14198 [cs.CV]
  • [12] Shilong Liu et al. “Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection”, 2023 arXiv:2303.05499 [cs.CV]
  • [13] Haotian Liu, Chunyuan Li, Qingyang Wu and Yong Jae Lee “Visual Instruction Tuning”, 2023 arXiv:2304.08485 [cs.CV]
  • [14] Alec Radford et al. “Language models are unsupervised multitask learners” In OpenAI Blog, 2019
  • [15] Theophile Gervet et al. “Navigating to objects in the real world” In Science Robotics 8.79 American Association for the Advancement of Science, 2023, pp. eadf6991
  • [16] Martin Sundermeyer, Arsalan Mousavian, Rudolph Triebel and Dieter Fox “Contact-graspnet: Efficient 6-dof grasp generation in cluttered scenes” In 2021 IEEE International Conference on Robotics and Automation (ICRA), 2021, pp. 13438–13444 IEEE
  • [17] Jeffrey Mahler et al. “Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics”, 2017 arXiv:1703.09312 [cs.RO]
  • [18] Hao-Shu Fang, Chenxi Wang, Minghao Gou and Cewu Lu “Graspnet-1billion: a large-scale benchmark for general object gras**” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11444–11453
  • [19] Hao-Shu Fang et al. “Anygrasp: Robust and efficient grasp perception in spatial and temporal domains” In IEEE Transactions on Robotics IEEE, 2023
  • [20] Ankit Goyal et al. “Ifor: Iterative flow minimization for robotic object rearrangement” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14787–14797
  • [21] Weiyu Liu et al. “StructDiffusion: Language-Guided Creation of Physically-Valid Structures using Unseen Objects”, 2023 arXiv:2211.04604 [cs.RO]
  • [22] Sriram Yenamandra et al. “The HomeRobot Open Vocab Mobile Manipulation Challenge” In Thirty-seventh Conference on Neural Information Processing Systems: Competition Track, 2023 URL: https://aihabitat.org/challenge/2023_homerobot_ovmm/
  • [23] Andrew Melnik et al. “UniTeam: Open Vocabulary Mobile Manipulation Challenge” In arXiv preprint arXiv:2312.08611, 2023
  • [24] Luca Medeiros “Lang Segment Anything” In GitHub repository GitHub, https://github.com/luca-medeiros/lang-segment-anything, 2023
  • [25] Sriram Yenamandra et al. “HomeRobot: Open Vocabulary Mobile Manipulation” In arXiv preprint arXiv:2306.11565, 2023 URL: https://github.com/facebookresearch/home-robot
  • [26] Benjamin Bolte et al. “USA-Net: Unified Semantic and Affordance Representations for Robot Memory”, 2023 arXiv:2304.12164 [cs.RO]
  • [27] Nur Muhammad Mahi Shafiullah et al. “CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory”, 2023 arXiv:2210.05663 [cs.RO]
  • [28] Alexander Kirillov et al. “Segment Anything” In ICCV, 2023, pp. 4015–4026
  • [29] David Rozenberszki, Or Litany and Angela Dai “Language-Grounded Indoor 3D Semantic Segmentation in the Wild”, 2022 arXiv:2204.07761 [cs.CV]
  • [30] Justin Kerr et al. “Lerf: Language embedded radiance fields” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 19729–19739
  • [31] Matthew Chang et al. “Goat: Go to any thing” In arXiv preprint arXiv:2311.06430, 2023
  • [32] Chenguang Huang, Oier Mees, Andy Zeng and Wolfram Burgard “Audio Visual Language Maps for Robot Navigation” In arXiv preprint arXiv:2303.07522, 2023
  • [33] Santiago Garrido, Luis Moreno, Mohamed Abderrahim and Fernando Martin “Path planning for mobile robot navigation using voronoi diagram and fast marching” In 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2006, pp. 2376–2381 IEEE
  • [34] Sudeep Dasari, Abhinav Gupta and Vikash Kumar “Learning Dexterous Manipulation from Exemplar Object Trajectories and Pre-Grasps”, 2023 arXiv:2209.11221 [cs.RO]
  • [35] OpenAI “GPT-4 Technical Report” In arXiv preprint arXiv:2303.08774, 2023 arXiv:2303.08774 [cs.CL]
  • [36] Arsalan Mousavian et al. “Visual Representations for Semantic Target Driven Navigation” In 2019 International Conference on Robotics and Automation (ICRA), 2019, pp. 8846–8852 IEEE
  • [37] Valts Blukis et al. “A persistent spatial semantic representation for high-level natural language instruction execution” In Conference on Robot Learning, 2022, pp. 706–717 PMLR
  • [38] So Yeon Min et al. “Film: Following instructions in language with modular methods” In arXiv preprint arXiv:2110.07342, 2021
  • [39] Matt Deitke et al. “Retrospectives on the embodied ai workshop” In arXiv preprint arXiv:2210.06849, 2022
  • [40] Matt Deitke et al. “Objaverse: A universe of annotated 3d objects” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13142–13153
  • [41] Huy Ha and Shuran Song “Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models” In CoRL, 2022 arXiv:2207.11514 [cs.CV]
  • [42] Chenguang Huang, Oier Mees, Andy Zeng and Wolfram Burgard “Visual language maps for robot navigation” In 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 10608–10615 IEEE
  • [43] Boyuan Chen et al. “Open-vocabulary Queryable Scene Representations for Real World Planning” In arXiv preprint arXiv:2209.09874, 2022
  • [44] Krishna Murthy Jatavallabhula et al. “Conceptfusion: Open-set multimodal 3d map**” In arXiv preprint arXiv:2302.07241, 2023
  • [45] Dhruv Shah et al. “ViNT: A Foundation Model for Visual Navigation” In 7th Annual Conference on Robot Learning (CoRL), 2023
  • [46] Qiao Gu et al. “Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning” In arXiv preprint arXiv:2309.16650, 2023
  • [47] Abhinav Gupta, Adithyavairavan Murali, Dhiraj Prakashchand Gandhi and Lerrel Pinto “Robot Learning in Homes: Improving Generalization and Reducing Dataset Bias” In Advances in Neural Information Processing Systems 31, 2018, pp. 9094–9104
  • [48] Jeffrey Mahler et al. “Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics” In Robotics: Science and Systems (RSS), 2017
  • [49] Jeffrey Mahler et al. “Dex-Net 3.0: Computing Robust Robot Vacuum Suction Grasp Targets in Point Clouds using a New Analytic Model and Deep Learning”, 2018 arXiv:1709.06670 [cs.RO]
  • [50] Dmitry Kalashnikov et al. “QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation” In arXiv preprint arXiv:1806.10293, 2018
  • [51] Yuzhe Qin et al. “S4G: Amodal Single-view Single-Shot SE(3) Grasp Detection in Cluttered Scenes”, 2019 arXiv:1910.14218 [cs.RO]
  • [52] Arsalan Mousavian, Clemens Eppner and Dieter Fox “6-DOF GraspNet: Variational Grasp Generation for Object Manipulation”, 2019 arXiv:1905.10520 [cs.CV]
  • [53] Clemens Eppner, Arsalan Mousavian and Dieter Fox “Acronym: A large-scale grasp dataset based on simulation” In 2021 IEEE International Conference on Robotics and Automation (ICRA), 2021, pp. 6222–6227 IEEE
  • [54] I. Singh et al. “Progprompt: Generating situated robot task plans using large language models” In 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 11523
  • [55] Mohit Shridhar, Lucas Manuelli and Dieter Fox “Perceiver-Actor: A multi-task transformer for robotic manipulation” In CoRL, 2023, pp. 785–799 PMLR
  • [56] Priyam Parashar, Jay Vakil, Sam Powers and Chris Paxton “Spatial-Language Attention Policies for Efficient Robot Learning” In arXiv preprint arXiv:2304.11235, 2023
  • [57] Nur Muhammad Shafiullah, Zichen Cui, Ariuntuya Arty Altanzaya and Lerrel Pinto “Behavior Transformers: Cloning k𝑘kitalic_k modes with one stone” In Advances in neural information processing systems 35, 2022, pp. 22955–22968
  • [58] Zichen Jeff Cui, Yibin Wang, Nur Muhammad Mahi Shafiullah and Lerrel Pinto “From Play to Policy: Conditional Behavior Generation from Uncurated Robot Data”, 2022 arXiv:2210.10047 [cs.RO]
  • [59] Theophile Gervet, Zhou Xian, Nikolaos Gkanatsios and Katerina Fragkiadaki “Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation” In Conference on Robot Learning, 2023, pp. 3949–3965 PMLR
  • [60] Wenxuan Zhou et al. “Learning Hybrid Actor-Critic Maps for 6D Non-Prehensile Manipulation” In arXiv preprint arXiv:2305.03942, 2023
  • [61] Mohit Shridhar, Lucas Manuelli and Dieter Fox “CLIPort: What and where pathways for robotic manipulation” In CoRL, 2022, pp. 894–906 PMLR
  • [62] Corey Lynch et al. “Learning latent plans from play” In CoRL, 2020, pp. 1113–1132 PMLR
  • [63] Corey Lynch and Pierre Sermanet “Language Conditioned Imitation Learning over Unstructured Data” In Robotics: Science and Systems, 2021 URL: https://arxiv.longhoe.net/abs/2005.07648
  • [64] Wenlong Huang et al. “VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models” In CoRL, 2023
  • [65] Jacky Liang et al. “Code as Policies: Language model programs for embodied control” In icra, 2023, pp. 9493–9500 IEEE
  • [66] Guanzhi Wang et al. “Voyager: An Open-Ended Embodied Agent with Large Language Models” In arXiv preprint arXiv: Arxiv-2305.16291, 2023
  • [67] Ishika Singh et al. “ProgPrompt: Generating Situated Robot Task Plans using Large Language Models” In ICRA, 2023, pp. 11523–11530 IEEE
  • [68] Naoki Yokoyama et al. “ASC: Adaptive Skill Coordination for Robotic Mobile Manipulation” In arXiv preprint arXiv:2304.00410, 2023
  • [69] Austin Stone et al. “Open-World Object Manipulation using Pre-trained Vision-Language Models”, 2023 arXiv:2303.00905 [cs.RO]
  • [70] Naoki Wake et al. “GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration” In arXiv preprint arXiv:2311.12015, 2023
  • [71] Krishan Rana et al. “Sayplan: Grounding large language models using 3d scene graphs for scalable task planning” In arXiv preprint arXiv:2307.06135, 2023
  • [72] Charles C Kemp, Aaron Edsinger, Henry M Clever and Blaine Matulevich “The design of Stretch: A compact, lightweight mobile manipulator for indoor human environments” In 2022 International Conference on Robotics and Automation (ICRA), 2022, pp. 3150–3157 IEEE
  • [73] Nils Reimers and Iryna Gurevych “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”, 2019 arXiv:1908.10084 [cs.CL]
  • [74] Ben Mildenhall et al. “Nerf: Representing scenes as neural radiance fields for view synthesis” In European Conference on Computer Vision (ECCV) 65.1 ACM New York, NY, USA, 2020, pp. 99–106

Appendix A Description of alternate system components

In this section, we provide more details about the alternate system components that we evaluated in Section III-B.

Alternate semantic navigation strategies: We evaluate the following semantic memory modules:

  • VoxelMap [25]: VoxelMap converts every detected object to a semantic vector and stores such info into an associated voxel. Occupied voxels serve as an obstacle map.

  • CLIP-Fields [27]: CLIP-Fields converts a sequence of posed RGB-D images to a semantic vector field by using open-label object detectors and semantic language embedding models. The result associates each point in the space with two semantic vectors, one generated via a VLM [9], and another generated via a language model [73], which is then embedded into a neural field [74].

  • USA-Net [26]: USA-Net generates multi-scale CLIP features and embeds them in a neural field that also doubles as a signed distance field. As a result, a single model can support both object retrieval and navigation.

We compare them in the same three environments with a fixed set of queries, the results of which are shown in Figure 5.

Alternate gras** strategies: Similarly, we compare multiple gras** strategies to find out the best gras** strategy for our method.

  • AnyGrasp [19]: AnyGrasp is a single view RGB-D based gras** model. It is trained on the GraspNet dataset which contains 1B grasp labels.

  • Open Graspness [19]: Since the AnyGrasp model is free but not open source, we use an open licensed baseline trained on the same dataset.

  • Contact-GraspNet [16]: We use Contact-GraspNet as a prior work baseline, which is trained on the Acronym [53] dataset. One limitation of Contact-GraspNet is that it was trained on a fixed camera view for a tabletop setting. As a result, in our application with a moving camera and arbitrary locations, it failed to give us meaningful grasps.

  • Top-down grasp [25]: As a heuristic based baseline, we compare with the top-down heuristic grasp provided in the HomeRobot project.

Appendix B Scannet200 text queries

To detect objects in a given home environment using OWL-ViT, we use the Scannet200 labels. The full label set is here: [’shower head’, ’spray’, ’inhaler’, ’guitar case’, ’plunger’, ’range hood’, ’toilet paper dispenser’, ’adapter’, ’soy sauce’, ’pipe’, ’bottle’, ’door’, ’scale’, ’paper towel’, ’paper towel roll’, ’stove’, ’mailbox’, ’scissors’, ’tape’, ’bathroom stall’, ’chopsticks’, ’case of water bottles’, ’hand sanitizer’, ’laptop’, ’alcohol disinfection’, ’keyboard’, ’coffee maker’, ’light’, ’toaster’, ’stuffed animal’, ’divider’, ’clothes dryer’, ’toilet seat cover dispenser’, ’file cabinet’, ’curtain’, ’ironing board’, ’fire extinguisher’, ’fruit’, ’object’, ’blinds’, ’container’, ’bag’, ’oven’, ’body wash’, ’bucket’, ’cd case’, ’tv’, ’tray’, ’bowl’, ’cabinet’, ’speaker’, ’crate’, ’projector’, ’book’, ’school bag’, ’laundry detergent’, ’mattress’, ’bathtub’, ’clothes’, ’candle’, ’basket’, ’glass’, ’face wash’, ’notebook’, ’purse’, ’shower’, ’power outlet’, ’trash bin’, ’paper bag’, ’water dispenser’, ’package’, ’bulletin board’, ’printer’, ’windowsill’, ’disinfecting wipes’, ’bookshelf’, ’recycling bin’, ’headphones’, ’dresser’, ’mouse’, ’shower gel’, ’dustpan’, ’cup’, ’storage organizer’, ’vacuum cleaner’, ’fireplace’, ’dish rack’, ’coffee kettle’, ’fire alarm’, ’plants’, ’rag’, ’can’, ’piano’, ’bathroom cabinet’, ’shelf’, ’cushion’, ’monitor’, ’fan’, ’tube’, ’box’, ’blackboard’, ’ball’, ’bicycle’, ’guitar’, ’trash can’, ’hand sanitizers’, ’paper towel dispenser’, ’whiteboard’, ’bin’, ’potted plant’, ’tennis’, ’soap dish’, ’structure’, ’calendar’, ’dumbbell’, ’fish oil’, ’paper cutter’, ’ottoman’, ’stool’, ’hand wash’, ’lamp’, ’toaster oven’, ’music stand’, ’water bottle’, ’clock’, ’charger’, ’picture’, ’bascketball’, ’sink’, ’microwave’, ’screwdriver’, ’kitchen counter’, ’rack’, ’apple’, ’washing machine’, ’suitcase’, ’ladder’, ’** pong ball’, ’window’, ’dishwasher’, ’storage container’, ’toilet paper holder’, ’coat rack’, ’soap dispenser’, ’refrigerator’, ’banana’, ’counter’, ’toilet paper’, ’mug’, ’marker pen’, ’hat’, ’aerosol’, ’luggage’, ’poster’, ’bed’, ’cart’, ’light switch’, ’backpack’, ’power strip’, ’baseball’, ’mustard’, ’bathroom vanity’, ’water pitcher’, ’closet’, ’couch’, ’beverage’, ’toy’, ’salt’, ’plant’, ’pillow’, ’broom’, ’pepper’, ’muffins’, ’multivitamin’, ’towel’, ’storage bin’, ’nightstand’, ’radiator’, ’telephone’, ’pillar’, ’tissue box’, ’vent’, ’hair dryer’, ’ledge’, ’mirror’, ’sign’, ’plate’, ’tripod’, ’chair’, ’kitchen cabinet’, ’column’, ’water cooler’, ’plastic bag’, ’umbrella’, ’doorframe’, ’paper’, ’laundry hamper’, ’food’, ’jacket’, ’closet door’, ’computer tower’, ’stairs’, ’keyboard piano’, ’person’, ’table’, ’machine’, ’projector screen’, ’shoe’].

Appendix C Sample objects from our trials

During our experiments, we tried to sample objects that can plausibly be manipulated by the Hello Robot: Stretch gripper from the home environments. As a result, OK-Robot encountered a large variety of objects with different shapes and visual features. A subsample of such objects are presented in the Figures 910.

Refer to caption
Figure 9: Sample objects on our home experiments, sampled from each home environment, which OK-Robot was able to pick and drop successfully.
Refer to caption
Figure 10: Sample objects on our home experiments, sampled from each home environment, which OK-Robot failed to pick up successfully.

Appendix D Sample home environments from our trials

We show snapshots from a subset of home environments where we evaluated our method in Figures 11. Additionally, in Figure 12 we show the two home environments in Pittsburgh, PA, and Fremont, CA, where we reproduced the OK-Robot system.

Refer to caption
Figure 11: Eight out of the ten New York home environments in which we evaluated OK-Robot. In each figure caption, we show the queries that the system is being evaluated on.
Refer to caption
Figure 12: Home environments outside of New York where we successfully reproduced OK-Robot. We ensured that OK-Robot can function in these homes by trying pick-and-drop on a number of objects in the homes.

Appendix E List of home experiments

A full list of experiments in homes can be found in Table I.

TABLE I: A list of all tasks in the home enviroments, along with their categories and success rates out of 10 trials.
Pick object Place location Result
Home 1
Cleanup level: none
silver cup white table Success
blue eye glass case chair Success
printed paper cup, coffee cup [white table] ____ Manipulation failure
small red and white medication Chair Success
power adapter Grey Bed Success
wrapped paper ____ Navigation failure
blue body wash study table Success
blue air spray white table Success
black face wash ____ Manipulation failure
yellow face wash chair Success
body spray ____ Navigation failure
small hand sanitizer ____ Manipulation failure
blue inhaler device(window) white table Success
inhaler box(window) dust bin Success
multivitamin container ____ Navigation failure
red towel white cloth bin (air conditioner) Success
white shirt white cloth bin (air conditioner) Success
Cleanup level: low
silver cup white table Success
blue eye glass case ____ Navigation failure
printed paper cup, coffee cup [white table] dust bin Success
small red and white medication Chair Success
power adapter ____ Navigation failure
blue body wash white table Success
blue air spray white table Success
yellow face wash white table Success
small hand sanitizer study table Success
blue inhaler device(window) ____ Manipulation failure
inhaler box(window) dust bin Success
red towel white cloth bin(air conditioner) Success
white shirt white cloth bin(air conditioner) Success
Cleanup level: high
silver cup white table Success
printed paper cup, coffee cup [white table] dust bin Success
blue body wash white table Success
blue air spray white table Success
yellow face wash ____ Manipulation failure
small hand sanitizer ____ Manipulation failure
inhaler box(window) dust bin Success
white shirt white cloth bin(air conditioner) Success
Home 2
Cleanup level: None
fanta can dust bin Success
tennis ball small red shop** bag Success
black head band [bed] ____ Manipulation failure
purple shampoo bottle white rack Success
toothpaste small red shop** bag Success
orange packaging dust bin Success
green hair cream jar [white rack] ____ Navigation failure
green detergent pack [white rack] white table Success
blue moisturizer [white rack] ____ Navigation failure
green plastic cover ____ Navigation failure
storage container ____ Manipulation failure
blue hair oil bottle white rack Success
blue pretzels pack white rack Success
blue hair gel tube ____ Manipulation failure
red bottle [white rack] brown desk Success
blue bottle [air conditioner] white cloth bin(air conditioner) Success
wallet ____ Manipulation failure
Cleanup level: low
fanta can black trash can Success
tennis ball red target bag Success
black head band [bed] red target bag Success
purple shampoo bottle red target bag Success
toothpaste red target bag Success
orange packaging black trash can Success
green detergent pack [white rack] ____ Manipulation failure
blue moisturizer [white rack] ____ Navigation failure
blue hair oil bottle white rack Success
blue pretzels pack white rack Success
wallet ____ Manipulation failure
Cleanup level: high
fanta can black trash can Success
purple shampoo bottle small red shop** bag Success
orange packaging black trash can Success
blue moisturizer [white rack] white rack Success
blue hair oil bottle ____ Manipulation failure
blue hair gel tube dust bin Success
red bottle [white rack] target bag Placing failure
blue bottle [air conditioner] white cloth bin(air conditioner) Success
Home 3
Cleanup level: none
apple white plate Success
ice cream white and green bag Success
green lime juice bottle red basket Success
yellow packet ____ Manipulation failure
red packet ____ Manipulation failure
orange can card board box Success
cooking oil bottle ____ Manipulation failure
pasta sauce ____ Manipulation failure
orange box [stove] ____ Manipulation failure
green bowl sink Success
washing gloves green bag [card board box] Success
small oregano bottle red basket Success
yellow noodles packet [stove] red basket Success
blue dish wash bottle card board box Success
scrubber ____ Navigation failure
dressing salad bottle ____ Navigation failure
Cleanup level: low
apple white plate Success
ice cream red basket Success
green lime juice bottle red basket Success
yellow packet green bag Success
red packet ____ Manipulation failure
orange can card board box Success
cooking oil bottle marble surface [red basket] Success
green bowl ____ Manipulation failure
washing gloves sink Success
small oregano bottle red basket Success
yellow noodles packet [stove] ____ Manipulation failure
blue dish wash bottle card board box Success
Cleanup level: high
apple white plate Success
ice cream red basket Success
green lime juice bottle red basket Success
orange can card board box Success
cooking oil bottle ____ Manipulation failure
washing gloves sink Success
small oregano bottle red basket Success
yellow noodles packet [stove] red basket Success
blue dish wash bottle card board box Success
Home 4
Cleanup level: none
pepsi black chair Success
birdie cloth bin Success
black hat ____ Navigation failure
owl like wood carving bed Success
red inhaler ____ Manipulation failure
black sesame seeds ____ Manipulation failure
banana ____ Manipulation failure
loose-leaf herbal tea jar black chair Success
red pencil sharpener ____ Navigation failure
fast-food French fries container blue shop** bag [metal drying rack] Placing failure
milk plastic storage drawer unit Success
socks[bed] ____ Navigation failure
purple gloves ____ Manipulation failure
target bag cloth bin Success
muffin grey bed Success
tissue paper table Success
grey eyeglass box ____ Manipulation failure
Cleanup level: low
pepsi basket Success
birdie white drawer Success
owl like wood carving ____ Navigation failure
red inhaler plastic storage drawer unit Success
black sesame seeds bed Success
loose-leaf herbal tea jar table Success
fast-food French fries container chair Success
milk chair Success
purple gloves basket Success
target bag basket Placing failure
muffin table Success
tissue paper ____ Manipulation failure
grey eyeglass box ____ Navigation failure
Cleanup level: high
pepsi basket Success
birdie bed Success
red inhaler plastic storage drawer unit Success
black sesame seeds desk Success
banana ____ Manipulation failure
loose-leaf herbal tea jar ____ Manipulation failure
milk chair Success
purple gloves basket Success
target bag basket Success
muffin bed Success
Home 5
Cleanup level: none
tiger balm topical ointment ____ Navigation failure
pink shampoo trader joes shap** bag Success
aveeno sunscreen protective lotion trader joes shap** bag Success
small yellow nozzle spray ____ Manipulation failure
black hair care spray ____ Manipulation failure
green hand sanitizer ____ Manipulation failure
white hand sanitizer ____ Navigation failure
white bowl [ketchup] black sofa chair Success
blue bowl ____ Manipulation failure
blue sponge trader joes shap** bag Success
ketchup ____ Manipulation failure
white salt ____ Manipulation failure
black pepper black drawer Success
blue bottle ____ Navigation failure
purple light bulb box trader joes shop** bag Success
white plastic bag bed Success
rag white rack Success
Cleanup level: low
pink shampoo ____ Navigation failure
aveeno sunscreen protective lotion _____ Manipulation failure
small yellow nozzle spray _____ Manipulation failure
white bowl [ketchup] black sofa chair Success
blue sponge bed Success
ketchup trader joes shop** bag Success
white salt trader joes shop** bag Success
black pepper ____ Navigation failure
blue bottle black sofa chair Success
purple light bulb box _____ Manipulation failure
rag white rack Success
Cleanup level: high
pink shampoo trader joes shop** bag Success
green hand sanitizer black sofa chair Success
white bowl [ketchup] _____ Manipulation failure
blue sponge bed Success
ketchup black drawer Success
white salt white drawer Success
purple light bulb box trader joes shop** bag Success
rag black sofa chair Success
Home 6
Cleanup level: none
translucent grey cup ____ Manipulation failure
green mouth spray box stove Success
green eyeglass container chair Success
blue bag ____ Manipulation failure
black burn ointment box _____ Navigation failure
white vitamin bottle ____ Navigation failure
McDonald’s paper bag stove Success
purple medicine packaging chair Success
grey rag sink Success
sparkling water can [sink] countertop Success
gold wrapped chocolate ____ Manipulation failure
lemon tea carton table Success
metallic golden beverage can table Success
red bottle table Success
tea milk bottle ____ Navigation failure
nyu water bottle [sink] table Success
white hand wash _____ Navigation failure
Cleanup level: low
translucent grey cup ____ Navigation failure
green mouth spray box ____ Manipulation failure
blue bag brown box Success
black burn ointment box brown box Success
McDonald’s paper bag ____ Navigation failure
grey rag sink Success
sparkling water can [sink] chair Success
lemon tea carton stove Success
metallic golden beverage can ____ Navigation failure
red bottle brown box Success
nyu water bottle [sink] table Success
white hand wash sink Success
Cleanup level: high
blue bag brown box Success
black burn ointment box ____ Manipulation failure
grey rag sink Success
sparkling water can [sink] chair Success
lemon tea carton table Success
metallic golden beverage can stove Success
red bottle ____ Navigation failure
nyu water bottle [sink] ____ Manipulation failure
white hand wash ____ Manipulation failure
Home 7
Cleanup level: none
blue plastic bag roll _____ Navigation failure
green bag basket[window] Success
toy cactus desk Success
toy van chair Success
brown medical bandage chair Success
power adapter _____ Navigation failure
red herbal tea brown cardboard box Success
apple juice box brown cardboard box Success
paper towel blue cardboard box Success
toy bear bed blanket Success
yellow ball bed blanket Success
black pants basket[window] Success
purple water bottle desk Success
blue eyeglass case _____ Manipulation failure
brown toy monkey _____ Navigation failure
blue hardware box [table] blue cardboard box Success
green zandu balm container blue cardboard box Success
Cleanup level: low
green bag basket Success
toy cactus basket Success
toy van chair Success
brown medical bandage _____ Manipulation failure
red herbal tea brown box Success
apple juice box brown box Success
paper towel basket Success
toy bear desk Success
purple water bottle desk Success
blue eyeglass case _____ Manipulation failure
green zandu balm container blue cardboard box Success
Cleanup level: high
green bag stool [window] Success
toy cactus table Success
toy van white basket Success
red herbal tea brown cardboard box Success
apple juice box brown cardboard box Success
paper towel blue cardboard box Success
toy bear white basket Success
yellow ball bed Success
purple water bottle black tote bag Success
green zandu balm container blue cardboard box Success
Home 8
Cleanup level: none
cyan air spray brown shelf [sink] Success
blue gloves kitchen sink Success
blue peanut butter black stove Success
nutella table Success
green bag brown shelf [sink] Success
green bandage box trash can Success
green detergent kitchen sink Success
black ‘red pepper sauce’ ____ Manipulation failure
red bag chair Success
black bag chair Success
red spray [brown shelf] kitchen countertop Success
steel wool _____ Manipulation failure
white aerosol trash can Success
white pretzel black stove Success
purple crisp kitchen countertop Success
plastic bowl ______ Manipulation failure
playing card microwave Success
Cleanup level: low
cyan air apray chair Success
blue gloves sink Success
blue peanut butter ____ Navigation failure
green bag brown shelf Success
green bandage box brown shop** bag Success
green detergent microwave Success
red bag ____ Manipulation failure
black bag chair Success
white aerosol trash can Success
white pretzel black stove Success
purple crisp kitchen countertop Success
plastic bowl ______ Manipulation failure
playing card microwave Success
Cleanup level: high
cyan air apray brown shelf [sink] Success
blue gloves stove Success
blue peanut butter black stove Success
green bag brown shelf [sink] Success
green bandage box microwave Success
green detergent ____ Manipulation failure
black bag chair Success
white aerosol table Success
purple crisp chair Success
playing card microwave Success
Home 9
Cleanup level: none
toy grapes black laundry bag Success
purple strap _____ Manipulation failure
red foggy body spray _____ Manipulation failure
arm smartphone holder bed Success
medicine bottle _____ Manipulation failure
yogurt beverage _____ Navigation failure
blue shaving cream can ____ Navigation failure
blue cup table Success
purple tape _____ Manipulation failure
black shoe brush _____ Navigation failure
fluffy headband _____ Manipulation failure
black water bottle brown shop** bag Placing failure
yellow eyeglass case black chair Success
paper cup _____ Manipulation failure
lotion pump _____ Manipulation failure
nasal spray _____ Manipulation failure
plastic bag trash basket Success
Cleanup level: low
toy grapes _____ Manipulation failure
red foggy body spray brown paper bag Success
arm smartphone holder brown paper bag Success
yogurt beverage desk Success
blue shaving cream can black bag Success
blue cup black chair Success
black shoe brush _____ Manipulation failure
fluffy headband _____ Navigation failure
black water bottle folded chair Success
nasal spray _____ Navigation failure
plastic bag trash basket Success
Cleanup level: high
red foggy body spray brown paper bag Success
arm smartphone holder _____ Manipulation failure
yogurt beverage desk Success
blue shaving cream can black bag Success
blue cup black chair Success
black water bottle white bed Success
nasal spray folded chair Success
plastic bag trash basket Success
Home 10
Cleanup level: none
grey toy dragon bed Success
purple body spray ____ Manipulation failure
hand sanitizer shelf Success
toy plant bed [shelf] Success
brown trail mix bag ____ Manipulation failure
hanging blue shirt cloth bin Success
white apple bag ____ Manipulation failure
white and pink powder bottle table Success
cough syrup bottle shelf Success
tangled ear phones office chair Success
red deodrant stick[table] chair Success
black body spray chair Success
hair treatment medicine bottle ____ Manipulation failure
green tea package chair Success
portable speaker [green tea package] office chair Success
wooden workout gripper _____ Navigation failure
brown box _____ Navigation failure
blue bulb adapter office chair Success
game controller office chair Success
Cleanup level: low
grey toy dragon orange bag Success
purple body spray table Success
hand sanitizer _____ Navigation failure
toy plant bed Success
brown trail mix bag ____ Manipulation failure
white and pink powder bottle black chair [bed] Success
cough syrup bottle shelf [bed] Success
red deodrant stick[table] bed [rack] Success
black body spray rack [bed] Placing failure
green tea package orange bag Success
brown box black chair [bed] Success
blue bulb adapter _____ Manipulation failure
Cleanup level: high
purple body spray orange bag Success
toy plant bed Success
white and pink powder bottle _____ Navigation failure
cough syrup bottle shelf [bed] Success
red deodrant stick[table] _____ Navigation failure
black body spray black chair Success
green tea package table Success
blue bulb adapter shelf Success