anipulate-nything: Automating Real-World Robots using Vision-Language Models

Jiafei Duan ¹    Wentao Yuan ^1∗    Wilbert Pumacay ²    Yi Ru Wang ¹
Kiana Ehsani ³    Dieter Fox ^1,4   Ranjay Krishna ^1,3
¹University of Washington    ²Universidad Católica San Pablo
³Allen Institute for Artificial Intelligence ⁴NVIDIA
robot-ma.github.io Equal contribution

Abstract

Large-scale endeavors like RT-1[1] and widespread community efforts such as Open-X-Embodiment [2] have contributed to growing the scale of robot demonstration data. However, there is still an opportunity to improve the quality, quantity, and diversity of robot demonstration data. Although vision-language models have been shown to automatically generate demonstration data, their utility has been limited to environments with privileged state information, they require hand-designed skills, and are limited to interactions with few object instances. We propose Manipulate-Anything, a scalable automated generation method for real-world robotic manipulation. Unlike prior work, our method can operate in real-world environments without any privileged state information, hand-designed skills, and can manipulate any static object. We evaluate our method using two setups. First, Manipulate-Anything successfully generates trajectories for all $5$ real-world and $12$ simulation tasks, significantly outperforming existing methods like VoxPoser. Second, Manipulate-Anything’s demonstrations can train more robust behavior cloning policies than training with human demonstrations, or from data generated by VoxPoser [3] and Code-As-Policies [4]. We believe Manipulate-Anything can be the scalable method for both generating data for robotics and solving novel tasks in a zero-shot setting.

Keywords: Zero-shot manipulation, multimodal language models, multiview state verification, robot skill generation, behavior cloning, robotic manipulation

1 Introduction

The success of modern machine learning systems fundamentally relies on the quantity [5, 6, 7, 8, 9, 10], quality [11, 12, 13, 14, 15], and diversity [16, 17, 18, 19, 20] of the data they are trained on. The availability of large-scale internet data made possible significant advances in vision and language [21, 22, 23]. However, the dearth of data has prevented similar advancements in robotics. Human demonstration collection methods do not scale to sufficient quantity or diversity. Projects like RT-1 [1] demonstrated the utility of high-quality human data collected over 17 months. Others have developed low-cost hardware for data collection [24, 25, 26]. However, all these procedures require expensive human collection. In an effort to diversify demonstration data, Open X-Embodiment project collected 1 million trajectories collected through a participatory effort by 34 research labs [2]. Despite the wide-spread effort, the dataset only contains 20 tasks.

Automated data collection methods do not scale to sufficient diversity. With the advent of vision-language models (VLMs), the robotics community has been abuz with new systems that leverage VLMs to guide robotic behavior [27, 4, 28, 29, 3, 30, 31]. In these systems, VLMs decompose tasks into language plans [27, 4] or generate code to execute predefined skills [32, 3]. Though successful in simulation, these methods underperform in the real world [32, 3]. Some methods rely on privileged state information only available in simulation [28, 30], require hand-designed skills [29]. or are also limited to manipulating a fixed set of object instances with known geometric shape [3, 32].

Refer to caption — Figure 1: Manipulate-Anything is an automated method for robot manipulation in real world environments. Unlike prior methods, it doesn’t require priviledged state information, hand-designed skills, or limited to manipulating a fixed number of object instances. It can guide a robot to accomplish a diverse set of unseen tasks, manipulating diverse objects. Furthermore, the generated data enables training behavior cloning policies that outperform training with human demonstrations.

We propose Manipulate-Anything a scalable automated demonstration generation method for real-world robotic manipulation. Manipulate-Anything produces high quality data, at large-quantities (if needed), and can manipulate a diverse set of objects to perform a diverse set of tasks. When placed in a real world environment and given a task (e.g., “open the top drawer” in Figure 2), Manipulate-Anything effectively leverages VLMs to guide a robotic arm to complete the task. Unlike prior methods, it doesn’t need privileged state information, hand-designed skills, or limited to specific object instances. Not relying on privileged information makes Manipulate-Anything environment-agnostic. Thus it can easily generalized to the real world. Manipulate-Anything plans a sequence of sub-goals and generates actions to excute the sub-goals. It can verify whether the robot succeeded in the sub-goal using a verifier and re-plan from the current state if needed. This error recovery enables mistake identification, re-planning, and recovering from failure. It also injects recovery behavior into the collected demonstrations. We further enhanced the VLMs’ capabilities by incorporating reasoning from multi-viewpoints, significantly improving performance.

We showcase the utility of Manipulate-Anything through two evaluation setups. First, we show that it can be prompted with a novel, never-before-seen task and complete it in a zero-shot manner. We quantitatively evaluate across $5$ real-world and $12$ RLBench [33] simulation tasks and demonstrate capabilities across many real-world everyday tasks (refer to supplementary). Our method significantly outperforms VoxPoser [3] in 9/12 simulation tasks for zero-shot evaluation. It also generalizes to tasks where VoxPoser completely fails because of its limitation to specific object instances. Furthermore, we demonstrated that our approach can solve real-world manipulation tasks in a zero-shot manner, achieving a task-averaged success rate of 36%. Second, we show that Manipulate-Anything can generate useful training data for a behavior cloning policy. We compare Manipulate-Anything generated data against ground truth human demonstrations as well as against data from VoxPoser[3] and Code-As-Policies [4]. Surprisingly, policies trained on our data outperforms even human data on 5 out of 12 tasks and performs on par for 3 more. Meanwhile, the baselines are unable to generate the training data for some of tasks. Manipulate-Anything demonstrates the broad possibility of large-scale deployment of robots across unstructured real-world environments. It also highlights its utility as a training data generator, aiding in the crucial goal of scaling up robot demonstration data.

2 Related work

Manipulate-Anything enables scaling of robotic manipulation data using VLMs. As such, we review recent efforts in 1) scaling manipulation data, and 2) applications of VLMs in robotics.

Scaling manipulation data. When deploying vision and language-based control policies for real-world applications, a significant challenge revolves around acquiring data. Traditionally, a convenient avenue to collect such trajectories is through human annotations for action (i.e. through teleoperation) and language labeling [34, 35, 36], however, this approach is limited to scale. To address this limitation and achieve autonomous scalability, prior works employ vision-language models or procedurally generate language annotations in simulated environments [30, 37, 38]. For action labels, strategies range from random exploration to learned policies [39]. While human egocentric videos are relevant, they lack action labels and require cross-embodiment transfer [40]. Another strategy involves model-based policies, such as task and motion planning (TAMP) [41]. Our approach extends these methods by incorporating common-sense knowledge from large language models (LLMs) and vision language models (VLMs), by providing a framework which combines the strengths of VLMs, object pose prediction, and dynamic retry to synthesize demonstrations in simulated and real environments.

Language models for robotics. In the field of robotics, large language models have found diverse applications, including policy learning [42], task and motion planning [43, 44], log summarization [45], policy program synthesis [4], and optimization program generation [27]. Previous research has also explored the physical grounding capabilities of these models [3, 32], while ongoing work investigates their integration with task and motion planners to create expert demonstrations [30]. [36] attempted to collect extensive real-world interaction data, with short-horizon trajectories. [46, 47] proposed a key-point based visual prompting method for real-world manipulation, through predicting affordances and corresponding motions. Our work complements the existing line of works, by leveraging the high-level planning capabilities of language models, scene understanding capabilities of vision language models, and action sampling, to enable synthesis of robot trajectories, which include language, vision, and robot state, given arbitrary tasks and environments.

3 Manipulate-Anything

We propose Manipulate-Anything, a framework that solves everyday manipulation tasks conditioned on language. Under the hood, Manipulate-Anything leverages VLMs to decompose tasks into sub-goals, generates code for new skills or task-specific grasp pose, and verifies the success of each sub-goal (Figure 4). Note that due to the modularity aspect of our framework, Manipulate-Anything will continue to improve as the underlying VLMs continue to improve.

3.1 Task plan generation

Manipulate-Anything takes as input any task described by a free-form language instruction, $\mathbf{T}$ (e.g., ‘open the top drawer’). Creating robot trajectories that adheres to $\mathbf{T}$ is challenging due to its potential complexity and ambiguity, requiring a nuanced understanding of the current environment state. Given $\mathbf{T}$ , and an image of the scene, we apply a VLM to first identify task-relevant objects in the scene, appending them to a list. Subsequently, We use a VLM to decompose the main task $\mathbf{T}$ into a series of discrete, smaller sub-goals, represented as $\mathbf{T}_{i}$ , along with the corresponding verification conditions $v_{i}$ , where $i$ ranges from 1 to $n$ . For the above task, sub-goals include ‘grasp the drawer handle’ or ‘pull open the drawer’, and verification conditions are ‘did the robot grasp the handle?’ or ‘is the drawer opened?’. This transforms the instruction $\mathbf{T}$ into a sequence of specific sub-goals $\{(\mathbf{T}_{1},v_{1}),(\mathbf{T}_{2},v_{2}),\dots,(\mathbf{T}_{n},v_{n})\}$ . For each sub-goal, Manipulate-Anything generates desired actions (§ 3.2) and uses the corresponding verification condition for each sub-goal to validate whether the generated actions result in the successful completion of the sub-goal (§ 3.3). This verification step allows Manipulate-Anything to recover from mistakes and attempt again in the case of failure.

3.2 Action generation module

Given a sub-goal, the desired output from the action generation module is a sequence of low-level actions represented as a 6 DoF end-effector pose. The actions can be categorized into two sets: agent-centric or object-centric. Agent-centric actions modify the agent’s state; e.g., it can move the robot’s end-effector from the current state (e.g., “rotate $90^{\circ}$ ”). We feed the VLM with the current observation along with in-context learning technique to write a code to synthesize the desired motion. Unlike prior methods that use only language models to generate code [4], our approach utilizes VLMs to understand and reason about object locations and the scene, which helps to ground the generation in the current state of the scene. This advantage is demonstrated in the ablation studies in the Appendix.

Object-centric actions require manipulating a certain object (e.g., “grasp a knife”). We use an object-agnostic grasp prediction model [48]. The grasp model generates all the possible 6-DOF gras** poses in the scene. These poses are not conditioned on the objects and could contain errors. From the RGB-D image of the current state, we extract a raw 3D point cloud. The point cloud is sent to the grasp model, which predicts 6-DoF grasps placements across the scene. We then further filter the proposed candidate grasp pose using VLM with in-context learning and condition it on the given task (e.g., if the task is “grasp a knife”, the VLM will detect the handle of the knife). Lastly, we use the detected bounding box to filter and sample an ideal grasp pose.

A single view point might be insufficient to provide the VLM with enough information to perform the task (e.g., some views might be occluded by the robot arm). Therefore, for both agent-centric and objec-centric action generation, we render multiple viewpoints of the scene and query VLMs to choose an ideal viewpoint given the sub-task. For example, if the task is to open a drawer, the view in which the handle of the drawer is visible would be preferred. After the best view point is chosen, the gras** poses can be filtered limited to the poses visible in that view point or the code generation will be conditioned on that image. After the action is generated, a simple motion planner can be used to move the robot to the desired pose as shown in detailed in Fig. 3.

3.3 Sub-goal verification

To ensure that each sub-goal $T_{i}$ is executed correctly, we introduce a VLM-based verifier. After every action for each sub-goal are executed, we use the VLM to check if the end state matches the verifier condition $v_{i}$ . Similar to the action generation module, we use multi-view VLM reasoning to find the optimal view, avoiding errors due to occlusion or ambiguity from a single viewpoint. If the verifier identifies failure, we re-attempt the action generation step from the current state. Otherwise, the next sub-goal $T_{i+1}$ is attempted. More details of the implementation is in the Appendix.

4 Experiments

Our experiments are designed to address two questions: 1) Can Manipulate-Anything accurately solve a diverse set of tasks in a zero-shot manner? 2) Can data generated from Manipulate-Anything be used to train a robust policy?

Implementation details. We use both GPT-4V and Qwen-VL [49] as our VLM. We use GPT-4V for task decomposition, action generation, and verification. We use Qwen-VL to detect and extract object information. To ensure zero-shot execution within a reasonable budget, we limit the number of action steps in each trajectory to $50$ and the verification module allows a maximum of $30$ tries to accomplish a sub-goal. For the task plan generation, we follow the prompting structure adapted from ProgPrompt [27]. All prompts input into the VLM are accompanied by few-shot demonstrations [50]. Additionally, we provide three manually curated primitive action code snippets as examples to prompt the VLM for new action code generation. Full prompts are included in the Appendix. We use four viewpoints $\mathbf{M}_{4}=[front,wrist,left\_shoulder,right\_shoulder]$ for the simulation experiments, and re-render three viewpoints for the real-world experiments [51]. For better reasoning by the VLM, we use a resolution of $256\times 256$ .

4.1 Zero-shot Performance in Simulation

We empirically study the zero-shot capability of Manipulate-Anything in solving 12 diverse tasks in simulation. Our simulation experiments are reported to ensure reproducibility and provide a benchmark for future methods.

Table 1: Task-averaged success rate % for zero-shot evaluation. Manipulate-Anythingoutperformed other baselines in 9 out of 12 simulation tasks from RLBench [33]. Each task was evaluated over 3 seeds to obtain the task-averaged success rate and standard deviations.

Method	Put_block	Play_jenga	Open_jar	Close_box	Open_box	Pickup_cup
VoxPoser [3]	70.7 $\pm 2.31$	0.00 $\pm 0.00$	0.00 $\pm 0.00$	0.00 $\pm 0.00$	0.00 $\pm 0.00$	26.7 $\pm 14.00$
CAP [4]	84.00 $\pm 16.00$	0.00 $\pm 0.00$	0.00 $\pm 0.00$	0.00 $\pm 0.00$	0.00 $\pm 0.00$	14.67 $\pm 4.62$
MA (Ours)	96.00 $\pm 4.00$	77.33 $\pm 6.11$	80.00 $\pm 4.00$	33.33 $\pm 12.86$	29.00 $\pm 10.07$	82.67 $\pm 14.04$
Method	Take_umbrella	Sort_mustard	Open_wine	Lamp_on	Put_knife	Pick_&_lift
VoxPoser[3]	33.33 $\pm 8.33$	96.0 $\pm 6.93$	8.00 $\pm 4.00$	57.3 $\pm 12.22$	92.00 $\pm 4.00$	96.00 $\pm 0.00$
CAP[4]	4.00 $\pm 4.00$	0.00 $\pm 0.00$	0.00 $\pm 0.00$	64.00 $\pm 6.93$	14.67 $\pm 8.33$	100.00 $\pm 0.00$
MA (Ours)	61.33 $\pm 20.13$	64.00 $\pm 6.93$	42.00 $\pm 4.00$	69.33 $\pm 6.11$	52.00 $\pm 10.58$	84.00 $\pm 6.93$

Environment and tasks. The simulation is set up in CoppeliaSim and interfaced through PyRep. All simulation experiments use a Franka Panda robot with a parallel gripper. Input observations are captured from four RGB-D cameras positioned around a tabletop setting. We use RLBench [33], a robot learning benchmark with diverse tasks conditioned on language and provided success conditions. We sample $12$ tasks from RLBench, covering a diverse range of action primitives, task horizons, and object position perturbations. Each action can be represented as a way-point, and the trajectories are computed and executed via a motion planner using the Open Motion Planning Library[52].

Baselines. We compare against two state-of-the-art zero-shot data generation approaches: Code-as-Policies (CAP) [4] and VoxPoser [3]. CAP uses language models to generate executable programs that call hand-crafted primitive actions. VoxPoser [3] builds a 3D voxel map of value functions for predicting waypoints. We provide both CAP and VoxPoser with ground truth simulation state information of the target object’s asset names or positions.

Results: Manipulate-Anything can generate successful trajectories for all $12$ tasks while VoxPoser and CAP cover only $8$ and $6$ tasks, respectively (Table 1). Without the privileged state information, the baselines would not succeed on any of the $12$ tasks. Manipulate-Anything outperforms the baselines in $9$ out of the $12$ tasks. The three tasks where our method achieves lower performance require fine-grained manipulation of objects, which are the hardest task without the privileged state information used by baselines. VoxPoser fails in the tasks that require moving the arm beyond 4-DoF. Manipulate-Anything outperforms the strongest baseline, VoxPoser, by an average task-averaged margin of up to $25\%$ .

4.2 Behavior cloning with demonstrations from Manipulate-Anything

Next, we analyze the quality of the generated data by comparing the success rates of behavior cloning models trained with the data. Zero-shot methods like Manipulate-Anything are computationally expensive but hold the potential to generate useful training data. To evaluate the quality and effectiveness of the generated training data, we use the methods described in the previous section to generate data for each task. We also compare performance against a model trained on human-generated demonstrations across the $12$ tasks. We use the data to train behavior cloning policies.

Table 2: Behavior Cloning with different generated data. The behavior cloning policy trained on the data generated by Manipulate-Anything provides the best performance on 10 out of 12 tasks compared to the other autonomous data generation baselines. We report the Success Rate % for behaviour cloning policies trained with data generated from VoxPoser [3] and Code as Policies [4] in comparison. Note that the RLBench[33] baseline uses human expert demonstrations and is considered an upper bound for behavior cloning.

Generated data	Put_block	Play_jenga	Open_jar	Close_box	Open_box	Pickup_cup
VoxPoser[3]	2.67 $\pm 2.31$	-	-	-	-	4.00 $\pm 4.00$
CAP[4]	6.67 $\pm 2.31$	-	-	-	-	14.67 $\pm 12.86$
MA (Ours)	85.33 $\pm 10.07$	81.33 $\pm 2.31$	21.33 $\pm 10.07$	42.67 $\pm 8.33$	30.67 $\pm 11.55$	54.00 $\pm 12.49$
RLBench[33]	20.00 $\pm 18.33$	81.33 $\pm 9.24$	58.67 $\pm 45.49$	68.00 $\pm 24.98$	14.67 $\pm 6.11$	54.67 $\pm 23.09$
Generated data	Take_umbrella	Sort_mustard	Open_wine	Lamp_on	Put_knife	Pick_&_lift
VoxPoser[3]	4.00 $\pm 4.00$	0.00 $\pm 0.00$	1.33 $\pm 2.31$	5.33 $\pm 4.62$	1.33 $\pm 2.31$	5.67 $\pm 1.64$
CAP[4]	13.33 $\pm 10.06$	-	-	8.00 $\pm 16.00$	9.33 $\pm 6.11$	46.67 $\pm 2.31$
MA (Ours)	84.00 $\pm 6.93$	53.33 $\pm 6.11$	86.67 $\pm 6.11$	89.33 $\pm 6.11$	8.00 $\pm 4.00$	33.33 $\pm 2.31$
RLBench[33]	58.67 $\pm 50.80$	53.33 $\pm 34.02$	86.67 $\pm 12.86$	84.00 $\pm 13.86$	30.67 $\pm 10.07$	62.67 $\pm 9.24$

Data generation details. We generate 10 successful demonstrations per task. We use the system’s success condition to filter for successful demonstrations. Each of the demonstrations consist of a language instruction, RGB-D frames for the trajectory, and waypoints represented as 6 DoF gripper poses and states. For the tasks that the baselines were unable to generate any successful demonstrations, we patched the missing training data with RLBench system-generated demonstrations.

Training and evaluation protocol. Using the generated demonstrations, we train a Perceiver-Actor (PerAct) model, which is a transformer-based robotic manipulation behavior cloning model [34]. The model expects tokenized voxel grids and language instructions as inputs and predicts discretized voxel grid 6 DoF poses and gripper states. For all the generated training datasets, we train a multi-task PerAct policy with a batch size of $4$ for $30$ k iterations. To ensure consistent evaluation, we generate one set of testing environments with RLBench. We evaluate the last checkpoint from each of the trained policies. Each policy is evaluated for $25$ episodes across each task using $3$ different seeds. We measure the success rate based on the simulation-defined success condition.

Results: Policies trained using Manipulate-Anything data perform similarily to policies trained using human demonstrations ( $p=0.973$ ) (Table 2). Training on either Manipulate-Anything or on human demonstrations results in a performance difference of a mere $0.27$ % across all tasks. Furthermore, models trained on data from the baselines exhibit a statistically lower performance ( $p\leq 0.01$ for both VoxPoser and CAP). One of the main factors potentially contributing to the differences in the performance could be that Manipulate-Anything generates diverse expert trajectories that are preferable to humans. This can be seen in Fig. 5, which shows the action distribution of the generated data by different methods for the same given tasks. Additionally, our generated data recorded the lowest Chamfer Distance (CD) of 0.056 with human-generated demonstrations data. We also observed that the policy trained on MA data achieves a lower standard deviation of $3.39$ across all tasks compared to zero-shot performance of $8.48$ . This suggests the benefits of training over generated data instead of relying solely on zero-shot deployment.

4.3 Real-world experiments

Finally, we evaluate Manipulate-Anything in the real world. We also automatically generate real-world demonstrations for training PerAct.

Environment and tasks. We employ a Franka Panda manipulator equipped with a parallel gripper. We use a front-facing Kinect 2 RGB-D camera. To generate multi-view inputs for the Manipulate-Anything framework, we re-render virtual viewpoints from the generated point cloud, similar to prior work [51]. We select 5 representative real world tasks: open_jar, sort_objects, correct_dices, open_drawer, and on_lamp, all conditioned on language instructions. We evaluate each task for $10$ episodes, with varying object poses across $3$ trials of evaluation.

Data generation details. We used Manipulate-Anything to generate $6$ demonstrations for each task and manually perform scene resets when failures occur. We train a similar multi-task PerAct for $120$ k iterations and evaluate the trained policies in a manner similar to the zero-shot experiments.

Table 3: Real-world Results. The model trained on the data generated by our model in the real world (no expert in the loop) demonstrates on par results with the model trained on human expert collected data. We present a comparison of success rates for task completion in a zero-shot manner (Code as Policies [4] and Manipulate-Anything), and using trained policies from Manipulate-Anything data and human expert data.

	Open_drawer	Sort_object	On_lamp	Open_jar	Correct_dice
CAP (0-shot)	$0.00\pm 0.00$	$13.33\pm 5.77$	$0.00\pm 0.00$	$6.67\pm 5.77$	$6.67\pm 5.77$
MA (0-shot)	36.67 $\pm 5.77$	60.00 $\pm 10.00$	26.67 $\pm 11.55$	40.00 $\pm 10.00$	53.33 $\pm 5.77$
PerAct (MA data)	$50.00\pm 0.00$	$33.33\pm 5.77$	$46.67\pm 5.77$	$56.67\pm 5.77$	$60.00\pm 0.00$
PerAct (Human data)	53.33 $\pm 11.55$	36.67 $\pm 5.77$	60.00 $\pm 0.00$	76.67 $\pm 5.77$	80.00 $\pm 10.00$

Results: Manipulate-Anything is able to generate successful demonstrations for each of the $5$ real world tasks. Even for the worst-performing task, Manipulate-Anything achieves a success rate of more than $25\%$ . Our approach outperforms CAP by 38%. Consistent with the simulation results, training with the data generated by Manipulate-Anything produces a more robust policy compared to performing zero-shot. Additionally, in 4 out of 5 tasks, the trained policies perform better than the zero-shot approach. The policy underperforms on the sort_object task, because it requires longer-horizon memory —a known limitation pointed out in PerAct [34].

4.4 Ablations

For effective real-world deployment of Manipulate-Anything, it’s crucial that the collected data supports scaling of robotics transformers and offers diverse skills and interacted objects. We conducted an ablation study to evaluate the quality of Manipulate-Anything-generated data for scaling and its generalization to language instruction changes. For scaling, we generated behavior cloning data, ranging from 1 to 100 training demonstrations from RLBench and Manipulate-Anything for a single task, and trained a PerAct policy. For generalization, we varied the sort_mustard task with different language instructions and target objects. We compared our approach to VoxPoser to assess robustness to object and language instruction changes. Further implementation details are in the supplementary materials. Result: Our scaling experiments demonstrate that generating more training data via Manipulate-Anything improves PerAct policy performance (Fig. 6). The data from our approach shows a better rate of change with a slope of 0.503 for a linear fit, compared to 0.197 for RLBench-generated data. Additionally, Manipulate-Anything data is more generalizable and robust to language instruction changes, outperforming VoxPoser in task success across language and object variations. Detailed results in the appendix.

5 Discussion

Limitations. Manipulate-Anything relies on the availability of VLMs. While this can pose a dependence on the foundational models with the rise of the open VLMs, we believe this issue will be addressed soon. Future work. With the enhancements of large foundational models, Manipulate-Anything, due to its modularity, will continue to grow and scale up to more complex tasks. Conclusion. Manipulate-Anything is a scalable environment-agnostic approach for generating 0-shot demonstration for robotic tasks without the use of privileged environment information. Manipulate-Anything uses VLMs to do high level planning and scene understanding and is capable of error recovery. This enables high quality data generation for behavior cloning that can achieve better performance that using human data.

6 Acknowledgement

Jiafei Duan is supported by the Agency for Science, Technology and Research (A*STAR) National Science Fellowship. Wilbert Pumacay is supported by grant 234-2015-FONDECYT from Cienciactiva of the National Council for Science, Technology and Technological Innovation (CONCYTEC-PERU). This project is partially supported by Amazon Science. We would also like to thank Winson Han from AI2 for hel** with the figure and icon design.

References

Brohan et al. [2022] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
Collaboration [2023] O. X.-E. Collaboration. Open X-Embodiment: Robotic learning datasets and RT-X models. https://arxiv.longhoe.net/abs/2310.08864, 2023.
Huang et al. [2023] W. Huang, C. Wang, R. Zhang, Y. Li, J. Wu, and L. Fei-Fei. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023.
Liang et al. [2023] J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500. IEEE, 2023.
Kaplan et al. [2020] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
Cherti et al. [2023] M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023.
Hoffmann et al. [2022] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
Schuhmann et al. [2022] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
Ramanujan et al. [2024] V. Ramanujan, T. Nguyen, S. Oh, A. Farhadi, and L. Schmidt. On the connection between pre-training data diversity and fine-tuning robustness. Advances in Neural Information Processing Systems, 36, 2024.
Udandarao et al. [2024] V. Udandarao, A. Prabhu, A. Ghosh, Y. Sharma, P. H. Torr, A. Bibi, S. Albanie, and M. Bethge. No" zero-shot" without exponential data: Pretraining concept frequency determines multimodal model performance. arXiv preprint arXiv:2404.04125, 2024.
Gadre et al. [2024] S. Y. Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems, 36, 2024.
Zhou et al. [2024] C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, et al. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36, 2024.
Nguyen et al. [2024] T. Nguyen, S. Y. Gadre, G. Ilharco, S. Oh, and L. Schmidt. Improving multimodal datasets with image captioning. Advances in Neural Information Processing Systems, 36, 2024.
Nguyen et al. [2022] T. Nguyen, G. Ilharco, M. Wortsman, S. Oh, and L. Schmidt. Quality not quantity: On the interaction between dataset design and robustness of clip. Advances in Neural Information Processing Systems, 35:21455–21469, 2022.
Lee et al. [2021] K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck, C. Callison-Burch, and N. Carlini. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021.
Fang et al. [2022] A. Fang, G. Ilharco, M. Wortsman, Y. Wan, V. Shankar, A. Dave, and L. Schmidt. Data determines distributional robustness in contrastive language image pre-training (clip). In International Conference on Machine Learning, pages 6216–6234. PMLR, 2022.
Gururangan et al. [2020] S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964, 2020.
Tian et al. [2023] Y. Tian, L. Fan, K. Chen, D. Katabi, D. Krishnan, and P. Isola. Learning vision from models rivals learning vision from data. arXiv preprint arXiv:2312.17742, 2023.
Xu et al. [2023] H. Xu, S. Xie, X. E. Tan, P.-Y. Huang, R. Howes, V. Sharma, S.-W. Li, G. Ghosh, L. Zettlemoyer, and C. Feichtenhofer. Demystifying clip data. arXiv preprint arXiv:2309.16671, 2023.
Chen et al. [2023] Z. Chen, A. H. Cano, A. Romanou, A. Bonnet, K. Matoba, F. Salvi, M. Pagliardini, S. Fan, A. Köpf, A. Mohtashami, et al. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079, 2023.
Deng et al. [2009] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
Lin et al. [2014] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
Schuhmann et al. [2022] C. Schuhmann, R. Beaumont, R. Vencu, C. W. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. R. Kundurthy, K. Crowson, L. Schmidt, R. Kaczmarczyk, and J. Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=M3Y74vmsMcY.
Chi et al. [2024] C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. arXiv preprint arXiv:2402.10329, 2024.
Wang et al. [2024] C. Wang, H. Shi, W. Wang, R. Zhang, L. Fei-Fei, and C. K. Liu. Dexcap: Scalable and portable mocap data collection system for dexterous manipulation. arXiv preprint arXiv:2403.07788, 2024.
Duan et al. [2023] J. Duan, Y. R. Wang, M. Shridhar, D. Fox, and R. Krishna. Ar2-d2: Training a robot without a robot. arXiv preprint arXiv:2306.13818, 2023.
Singh et al. [2023] I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg. Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530. IEEE, 2023.
Wang et al. [2023a] L. Wang, Y. Ling, Z. Yuan, M. Shridhar, C. Bao, Y. Qin, B. Wang, H. Xu, and X. Wang. Gensim: Generating robotic simulation tasks via large language models. arXiv preprint arXiv:2310.01361, 2023a.
Wang et al. [2023b] Y. Wang, Z. Xian, F. Chen, T.-H. Wang, Z. Erickson, D. Held, and C. Gan. Genbot: Generative simulation empowers automated robotic skill learning at scale. arXiv, 2023b.
Ha et al. [2023] H. Ha, P. Florence, and S. Song. Scaling up and distilling down: Language-guided robot skill acquisition. arXiv preprint arXiv:2307.14535, 2023.
Nasiriany et al. [2024] S. Nasiriany, F. Xia, W. Yu, T. Xiao, J. Liang, I. Dasgupta, A. Xie, D. Driess, A. Wahid, Z. Xu, Q. Vuong, T. Zhang, T.-W. E. Lee, K.-H. Lee, P. Xu, S. Kirmani, Y. Zhu, A. Zeng, K. Hausman, N. Heess, C. Finn, S. Levine, and B. Ichter. Pivot: Iterative visual prompting elicits actionable knowledge for vlms, 2024.
Huang et al. [2024] H. Huang, F. Lin, Y. Hu, S. Wang, and Y. Gao. Copa: General robotic manipulation through spatial constraints of parts with foundation models. arXiv preprint arXiv:2403.08248, 2024.
James et al. [2020] S. James, Z. Ma, D. R. Arrojo, and A. J. Davison. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020.
Shridhar et al. [2023] M. Shridhar, L. Manuelli, and D. Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning, pages 785–799. PMLR, 2023.
Shridhar et al. [2022] M. Shridhar, L. Manuelli, and D. Fox. Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning, pages 894–906. PMLR, 2022.
Brohan et al. [2023] A. Brohan, Y. Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang, R. Julian, et al. Do as i can, not as i say: Grounding language in robotic affordances. In Conference on Robot Learning, pages 287–318. PMLR, 2023.
Ehsani et al. [2024] K. Ehsani, T. Gupta, R. Hendrix, J. Salvador, L. Weihs, K.-H. Zeng, K. P. Singh, Y. Kim, W. Han, A. Herrasti, et al. Imitating shortest paths in simulation enables effective navigation and manipulation in the real world. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
Dasgupta et al. [2021] A. Dasgupta, J. Duan, M. H. Ang Jr, Y. Lin, S.-h. Wang, R. Baillargeon, and C. Tan. A benchmark for modeling violation-of-expectation in physical reasoning across event categories. arXiv preprint arXiv:2111.08826, 2021.
Wang et al. [2023] Y. Wang, Z. Xian, F. Chen, T.-H. Wang, Y. Wang, K. Fragkiadaki, Z. Erickson, D. Held, and C. Gan. Robogen: Towards unleashing infinite data for automated robot learning via generative simulation. arXiv preprint arXiv:2311.01455, 2023.
Grauman et al. [2022] K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
Garrett et al. [2021] C. R. Garrett, R. Chitnis, R. Holladay, B. Kim, T. Silver, L. P. Kaelbling, and T. Lozano-Pérez. Integrated task and motion planning. Annual review of control, robotics, and autonomous systems, 4:265–293, 2021.
Xie et al. [2023] T. Xie, S. Zhao, C. H. Wu, Y. Liu, Q. Luo, V. Zhong, Y. Yang, and T. Yu. Text2reward: Automated dense reward function generation for reinforcement learning. arXiv preprint arXiv:2309.11489, 2023.
Lin et al. [2023] K. Lin, C. Agia, T. Migimatsu, M. Pavone, and J. Bohg. Text2motion: From natural language instructions to feasible plans, 2023.
Huang et al. [2022] W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y. Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022.
Liu et al. [2023] Z. Liu, A. Bahety, and S. Song. Reflect: Summarizing robot experiences for failure explanation and correction, 2023.
Liu et al. [2024] F. Liu, K. Fang, P. Abbeel, and S. Levine. Moka: Open-vocabulary robotic manipulation through mark-based visual prompting. arXiv preprint arXiv:2403.03174, 2024.
Yuan et al. [2024] W. Yuan, J. Duan, V. Blukis, W. Pumacay, R. Krishna, A. Murali, A. Mousavian, and D. Fox. Robopoint: A vision-language model for spatial affordance prediction for robotics. arXiv preprint arXiv:2406.10721, 2024.
Yuan et al. [2023] W. Yuan, A. Murali, A. Mousavian, and D. Fox. M2t2: Multi-task masked transformer for object-centric pick and place, 2023.
Bai et al. [2023] J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
Brown et al. [2020] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Goyal et al. [2023] A. Goyal, J. Xu, Y. Guo, V. Blukis, Y.-W. Chao, and D. Fox. Rvt: Robotic view transformer for 3d object manipulation. In Conference on Robot Learning, pages 694–710. PMLR, 2023.
Sucan et al. [2012] I. A. Sucan, M. Moll, and L. E. Kavraki. The open motion planning library. IEEE Robotics & Automation Magazine, 19(4):72–82, 2012.