Modeling the Real World with
High-Density Visual Particle Dynamics
Abstract
We present High-Density Visual Particle Dynamics (HD-VPD), a learned world model that can emulate the physical dynamics of real scenes by processing massive latent point clouds containing 100K+ particles. To enable efficiency at this scale, we introduce a novel family of Point Cloud Transformers (PCTs) called Interlacers leveraging intertwined linear-attention Performer layers and graph-based neighbour attention layers. We demonstrate the capabilities of HD-VPD by modeling the dynamics of high degree-of-freedom bi-manual robots with two RGB-D cameras. Compared to the previous graph neural network approach, our Interlacer dynamics is twice as fast with the same prediction quality, and can achieve higher quality using as many particles. We illustrate how HD-VPD can evaluate motion plan quality with robotic box pushing and can gras** tasks. See videos and particle dynamics rendered by HD-VPD at https://sites.google.com/view/hd-vpd.
Keywords: point clouds, particle dynamics, world models for control, Performers
![Refer to caption](x1.png)
1 Introduction
Physical simulators are the linchpin of modern robotics, enabling cheap data collection, safe verification of control algorithms, and real-time control via planning. Traditional analytic simulators [1, 2, 3, 4, 5] are fast and convenient to use, but lack the ability to precisely match the complex objects and dynamics of real-world scenes. Learned dynamics models make fewer assumptions on the form of objects and their interactions, but typically come with onerous constraints on their training data, e.g., requiring information such as 3D meshes and poses for all objects or per-object segmentation masks [6, 7, 8].
To address this problem, new classes of learned dynamics models have been proposed that can train directly on multi-view RGB-D observations, with no object-centrality in their data requirements or representations. A state-of-the-art approach of this flavor is Visual Particle Dynamics (VPD) [9], which represents scenes as a collection of 3D particles whose interaction dynamics is governed by graph neural networks (GNNs) [10, 11] and rendered to images with a conditional NeRF [12]. VPD models are trained end-to-end with a video prediction loss, support 3D state editing and multi-material dynamics modeling, and are data-efficient enough to model simple dynamic scenes with training trajectories. However, VPD has only been applied to simple simulated scenes under passive dynamics (without actuation), and it is unable to train with more than 30K particles due to memory and speed limitations. For applications in robotics, VPD needs two extensions: (1) to take robot actions into account, and (2) to be able to model environments with a much higher level of detail. This is the focus of this paper.
We propose a High-Density Visual Particle Dynamics (HD-VPD) world model which can train on robot interactions in real scenes and model their dynamics with 100K+ particles. To reach this scale, we propose a new class of Point Cloud Transformers (PCTs) [13] called Interlacers, which intertwine linear-attention Performer [14] layers and local neighborhood attention layers. We demonstrate the capabilities of HD-VPD by modeling the dynamics of bi-manual Kuka robots with multi-view depth input data. We find that Interlacer dynamics models are able to equal GNN’s prediction quality in half the time using matched point cloud densities, and that they can exceed the best GNN’s quality by leveraging high-density point clouds with as many particles as the GNN can fit in memory. We complement these results with illustrative experiments using HD-VPD for downstream applications such as planning in manipulation, leveraging HD-VPD’s ability to capture a scene from a single observation and define goals in 3D space. See videos and particle dynamics predictions from our model at https://sites.google.com/view/hd-vpd.
Our main contributions are as follows:
-
1.
We present High-Density Visual Particle Dynamics (HD-VPD) world models that train end-to-end on real robotics data (Section 2). These models take RGB-D images and robot kinematic skeletons representing actions as inputs, and they predict future 3D particle states and rendered images.
-
2.
To enable HD-VPD to operate on large, detailed scenes, we propose a new class of Transformers for point clouds called Interlacers, which scale linearly (rather than quadratically) with point cloud size (Section 3). Interlacers apply Performer-PCT layers [14, 15] combined with memory-efficient GNN-inspired local neighborhood attention layers, enabling efficient global point-to-point attention (with the former) while maintaining high-fidelity local geometric detail (with the latter).
-
3.
We show that HD-VPD can deliver highly realistic action-conditioned video predictions on complex scenes with bi-manual robots interacting with various objects (Section 4.2). We find that the Interlacer architecture enables fast, memory-efficient, and high-fidelity video prediction with HD-VPD. Whereas typical NeRF models require tens or hundreds of camera views for training, HD-VPD uses only two, making real-world data collection straightforward. We use these models to illustrate potential downstream applications of HD-VPD in robotics (Section 4.3), applying HD-VPD for robotic bi-manual control, where the HD-VPD world model is able to predict the success or failure of candidate plans.
2 HD-VPD: High-density visual particle dynamics world models
HD-VPD is a 3D dynamics model disguised as a video prediction model. It takes RGB-D images and robot actions as inputs and predicts images of the future, just like a video world model. Under the hood, though, it represents both the current state of the world and the actions which operate on it as particles in a shared 3D space. To make a prediction several timesteps into the future, HD-VPD goes through three steps, shown in Figure 2:
-
1.
Encode input images from all views and timesteps into 3D particles that represent the state of the scene.
-
2.
Predict the dynamics of the scene. Combine the particles representing the first robot action with those representing the scene state and and predict the change in the state. Use this to update the state. Then repeat this for the next robot action, and so forth.
-
3.
Render the particles representing the predicted state of the scene into an image.
In this way, HD-VPD can make predictions many steps into the future while remaining in particle space, only rendering to pixel space as needed for training or visualization. The rest of this section describes the HD-VPD model, architectural choices, and training in more detail.
![Refer to caption](x2.png)
Encoding HD-VPD receives RGB-D images from multiple cameras (in our experiments, 2) at timesteps for an input window of length ; throughout this work we use . Each RGB image is first encoded into feature image of per-pixel feature vectors using a U-Net [16]. Using known camera intrinsics and extrinsics, combined with the depth channel of the RGB-D input, each of these per-pixel features can be unprojected into the global coordinate frame, forming a point cloud where each point corresponds to an input pixel. Each point is associated with the predicted feature vector from its corresponding pixel to form a particle represented as a (location, feature) tuple . The particles from across cameras are merged together within one timestep before being subsampled uniformly at random to a fixed total number of particles .
Action representation Unlike VPD, HD-VPD includes a representation of actions, allowing it to model robotic scenes and be used for planning and other forms of controllable generation. We represent actions as a set of kinematic particles describing the motion of the robot across multiple timesteps: previous steps and , and next step which represents where the robot plans to move. The particles from each timestep are located at the joints, fingertips, and the base of the grippers of the two arms. Each kinematic particle is associated with a feature vector consisting of two concatenated one-hot vectors: one indicating which robot joint it is, and one indicating which timestep it came from. By representing actions where they occur in 3D space, HD-VPD can learn to associate them directly to the scene particles which they affect. More details and an example are in Appendix C.
Dynamics We employ the Interlacer architecture, described in detail in Section 3, for the dynamics. To predict the scene state at time , the dynamics model takes as inputs the scene particles from the encoder corresponding to timesteps and and the kinematic particles corresponding to timesteps , , and . The Interlacer encodes each timestep’s scene particles and the full set of kinematic particles separately before combining them in one large trunk. Predictions are made in the form for each particle from timestep . The prediction for timestep can then be constructed as .
Rendering The renderer uses a ray-based renderer conditioned on a point cloud, similar to Point-NeRF [17], to render images. This involves casting rays through the scene and, at each sampled location along the ray, finding a set of neighboring particles. Summary statistics describing these particles are computed and then provided as input to a NeRF MLP, which predicts the color and density of that location in the scene. These predictions are composited along the ray to produce a rendered pixel. For more details, refer to [9].
Training During training, we encode a set of input RGB-D images into point clouds, then recursively apply the dynamics model with actions from time to make predictions multiple timesteps into the future. For supervision we sample a small set of rays to render at each timestep and compute the loss between the predicted and observed RGB values. Unlike typical NeRF models, which require tens or hundreds of views for training, we train HD-VPD with data from just two cameras at each timestep. Both views are provided as input to the model, and the training loss is computed on pixels sampled from both views.
3 Interlacers: when linear-attention Transformers and GNNs meet
![Refer to caption](x3.png)
To make a prediction for time , the Interlacer takes as input the scene state from each timestep in a given window, where each scene state ( for the window length ) consists of the set of the 3D positions of a given set of particles in time and their corresponding feature vectors at that time: (for some ). For modeling robotics data, the Interlacer also receives kinematic particles at timesteps representing the action the robot will take.
The Interlacer consists of two main types of layers: (1) linear attention Performer-PCT layer [14, 15] and (2) graph-based feature-agglomeration layer, leveraging structural inductive priors encoded by local neighborhoods in graphs, that we refer to as Neighbor-Attender. The input point cloud for each timestep is first processed by the Neighbor-Attender modules. Their outputs are then processed by Performer-PCT modules, which produce versions of their input point clouds with updated features. These feature point clouds from each timestep are then merged together into one large point cloud along with the robot’s skeleton points from all times. The Interlacer then applies one more Neighbor-Attender layer, followed by a final block of Performer-PCT layers. This model predicts the change in location and features for each particle in the last input timestep . These deltas are applied to to construct a particle prediction at time , which is fed back into the dynamics to predict another step forward or into the renderer for image generation.
The Performer-PCT [14, 15] is described in detail in Appendix A. Next, we discuss Neighbor-Attender, a novel mechanism to incorporate geometry into attention.
![Refer to caption](extracted/5697885/figures/GRAFA.png)
3.1 Incorporating geometry into attention: the Neighbor-Attender layer
The Neighbor-Attender, illustrated in Figure 4 is designed to provide each particle with information about the geometry and features of its immediate neighborhood as efficiently as possible. A simple approach might find the nearest neighbors of each particle and extract a summary of those neighbors’ features and relative positions, but this would involve a forward pass on particle-neighbor pairs. Inspired by RandLA-Net [18], Neighbor-Attender instead computes such neighborhood features only on a small, uniformly-sampled subset of particles we call anchor particles, then uses the anchor particles to update the rest of the particles. This results in a bottleneck step of only pairs for a subsampling rate , allowing us to control memory consumption at will. The computations of the Neighbor-Attender layer consist of six steps that we explain in detail below; steps 1-4 aggregate neighborhood features onto the anchor particles, then steps 5-6 use those to update the rest.
-
1.
Choosing anchor particles: Sample uniformly at random particles from the input set of particles. We refer to them as anchor particles.
-
2.
Computing neighbors of anchors: For each anchor particle , compute the set of its nearest neighbors from the entire -element set.
-
3.
Computing attention-vectors: For each anchor particle and its neighbor , compute the relative position feature vector defined as: and concatenate it with the feature vector . We refer to the resulting vector as the edge feature .
-
4.
Updating feature vectors for all anchor particles: For each anchor particle , compute its new feature vector as the weighted sum of its edge features using a learnable MLP module:
(1) -
5.
Finding closest anchors for all the particles): For each particle in the original -element set, find its closest anchor particle .
-
6.
Computing new feature vectors for all the points: For each particle in the original -element set, compute its new feature vector .
4 Experiments
4.1 Datasets and HD-VPD instantiations
Hardware and dataset We train HD-VPD models on a dataset of bi-manual Kuka robots interacting with a variety of objects and tasks. The dataset consists of episodes collected from 5 different robots over several months. The robots (details in Appendix B, [19]) are equipped with an overhead RealSense and left arm wrist RealSense cameras capable of providing RGB-D images. The dataset is an uneven distribution of approximately 60 tasks. We break the episodes into snippets of trajectory of at most 8 steps with 500ms passing between consecutive steps. The dataset is 1.6TB.
Models We train HD-VPD models with three different dynamics architectures and varying numbers of particles for the experiments below. These architectures are: (1) a hierarchical GNN using the architecture from [9] with the addition of kinematic particles for actions; (2) Performer-PCT, which consists of repeated Performer-PCT layers; and (3) Interlacer, which includes both Performer-PCT and Neighbor-Attender layers. Architecture and training details are available in Appendix E.
![Refer to caption](extracted/5697885/figures/quality_vs_particles.png)
![Refer to caption](extracted/5697885/figures/step_speed_vs_particles.png)
![Refer to caption](extracted/5697885/figures/train_memory_vs_particles.png)
4.2 Learned bi-manual robot dynamics with HD-VPD
In this section, we present learned dynamics obtained with our HD-VPD method for the bi-manual Kuka robot. Corresponding videos are included in the supplementary material and on the web at https://sites.google.com/view/hd-vpd.
Video quality scales with particles We evaluate the quality of next-step image predictions (measured by SSIM) on the test set made with each architecture using varying numbers of particles. In Fig. 5(a), we show that the Interlacer-based HD-VPD model produces comparable-quality predictions to the GNN at every number of particles. The GNN model is unable to be trained at 65K and 130K particles due to memory consumption, whereas the Interlacer model continues to improve with scale. The Performer-PCT model lags behind the GNN and Interlacer at each number of particles. We also explore the behavior of these models on multi-step predictions, where the increasing uncertainty associated with long-range prediction might render high-density models’ image quality advantage moot. Figure 6 shows that while Interlacers of all sizes are less accurate the further into the future they predict, using more particles continues to produce better predictions.
![Refer to caption](extracted/5697885/figures/rollout_vs_particles.png)
Computational requirements We evaluate the key computational costs associated with running and training on the three classes of models. Figure 5(b) shows the number of dynamics steps per second achieved with each architecture and number of particles. The Interlacer is at least as fast as the GNN model at each scale, in some cases providing more than 2x speedups. The Performer-PCT is the fastest by a significant margin, but it is unable to match the image quality of the other models. In terms of memory consumption (Figure 5(c)), the GNN model always requires considerably more memory to train than the Interlacer or Performer-PCT models, and is unfeasible to use at scales beyond 32K.
4.3 Downstream robotic applications
![Refer to caption](extracted/5697885/figures/Kuka_Box_Push_Task.jpg)
![Refer to caption](extracted/5697885/figures/gras**_setup.jpg)
![Refer to caption](extracted/5697885/figures/BoxPushRolloutV2.jpg)
![Refer to caption](extracted/5697885/figures/grasp_planning.jpg)
Box push plan evaluation The goal is to move the box a desired distance to the left or right as shown in Fig. 7. We have a fixed set of motion plans that move between 12.5 to -12.5 cm in the y-axis (left to right). Given a goal, these planned motions are rolled out via our learned dynamics model. The cost function of a push plan is how close the median distance traveled by the box points is to the desired push distance. Lower cost is better and implies box particles are closer to the goal. Figure 10(a) shows that our predicted rollout costs are consistent with expected distance to a goal box position, and this rollout cost can be used to select the desired candidate push plan.
Grasp plan evaluation The goal is to lift a Coke can off the table. We have a fixed set of planned grasps, where the grasp pose is offset from the center of the object in 5mm increments. The cost function of the grasp plan is the distance of the particles from the goal position of 0.2m off the table. Lower cost is better and implies can particles are closer to the goal. We show that the HD-VPD model cost increases as the planned grasp pose moves away from the object and becomes less likely to succeed as shown in Fig 10(b). The HD-VPD model is able to infer that the Coke can points will be lifted on the grasp motions that do succeed in the real world. More details in Appendix D.
![Refer to caption](extracted/5697885/figures/push_scatter.png)
![Refer to caption](extracted/5697885/figures/grasp_scatter.png)
Dataset inspection After training HD-VPD, we evaluate it on trajectories from the training set and find those with anomalously high loss. This surfaces previously-unnoticed errors and outliers in the data, including sessions with incorrect geometric camera calibration, faulty camera exposures, and humans inside the robot workspace. This method of dataset introspection may be useful when preparing data for training models such as policies on other downstream tasks. Examples are available in Appendix F.
5 Related work
Dynamics models for robotics
Accurately modeling robot dynamics is essential for tasks such as motion planning, control, and simulation. Traditional approaches rely on rigid body dynamics models [1, 4, 2, 5, 3], but these models can be inaccurate in complex environments where objects are deformable or interacting in unpredictable ways, and require painstaking authoring of scenes and assets. A significant literature has explored the use of learned models for robot dynamics, which can broadly be divided into two camps: 2D and 3D. 2D models operate directly on pixels and use architectures designed for image generation [20, 21]. While these have found some success by scaling up to modern foundation model sizes and datasets [22, 23], their learned dynamics can be nonphysical, and they lack 3D representations that can be used for cost functions and scene editing.
More closely related to this work are the 3D learned dynamics models. Various works have used datasets of 3D dynamics to train dynamics models with particle [24, 25], mesh [26, 7], or object [27, 28] representations, and these models can be more accurate than analytic simulators in some circumstances [8]. However, the requirement for training data with ground-truth 3D particle or object poses limits their applications. Works have attempted to train 3D dynamics models with perceptual data such as videos and point clouds, many leveraging Neural Radiance Fields (NeRF) [12]. One approach has been to apply point cloud distance functions such as Chamfer distance to define loss functions between predicted points and the next-step point cloud [29] or NeRF [30] observations, though these often require object segmentations and have not been applied for large-scale scenes. Another strategy pre-trains conditional or object-level NeRF renderers, then trains a dynamics model on these representations [31, 32]. This split training process and coarse representation is limiting for dealing with complex scenes and results in lower-quality predictions [9].
HD-VPD derives from Visual Particle Dynamics (VPD) [9], extended to support robot actions and larger, more detailed scenes. Relative to the broader literature, HD-VPD has less restrictive data requirements (2 RGB-D cameras) than other NeRF-based works or mesh- or object-based works, and it jointly predicts 3D dynamics and 2D video, unlike pure 2D video or point cloud models.
Point cloud architectures
The rise of deep learning models has opened up new possibilities for processing and understanding 3D point cloud data. Models like PointNet++ [33], Point Transformer [34] and Point Cloud Transformers [35] have demonstrated strong performance in tasks such as object classification, segmentation, and registration. Similar to our work, RandLA-Net [18] focuses on building memory- and time-efficient architectures that scale up to large point clouds. However, these models are typically designed for static point clouds and do not directly model the dynamics of the scene. Our work introduces a new class of PCTs, called Interlacers, which are specifically designed for modeling the dynamics of large-scale point clouds. Performer-PCT architecture was introduced in [15], but applied in a different setting of scalable Robotics Transformers rather than dynamics modeling. In this paper, we show that intertwining Performer-PCT layers with the introduced here Neighbor-Attender layers is a key to achieving scalable and high-precision dynamics models.
6 Conclusion
We present a new high-density particle-based world model called HD-VPD and train it on real-world RGB-D and kinematic bi-manual robotics data. To support HD-VPD’s high fidelity predictions we propose Interlacers, combining linear-attention techniques with GNN-inspired local neighborhood attention methods, can model point clouds of sizes 100K+, infeasible for previous PCTs and GNNs. HD-VPD outperforms previous SOTA particle-based world models both in prediction quality and computational cost. Finally, we show how HD-VPD can be used in robotic planning, as accurate world models for scenes involving objects and robotic agents operating on them.
7 Limitations
HD-VPD is a deterministic dynamics model, meaning that its predictions blur with long horizons and stochastic dynamics. We leave extending the model to probabilistic predictions and long-horizon tasks to future work. HD-VPD might be improved via (1) more detailed finger and arm modeling in the kinematic skeleton; (2) data with near misses for robot planning applications; (3) upweighting points most relevant to robot interaction in training loss computations; (4) using loss functions to enforce physically realistic 3D motion rather than only correct RGB values. Planning with visual costs might avoid any mismatch between particle motion and physical outcomes.
Acknowledgements
The authors thank Chad Boodoo, Zac Gong, Neil Seejoor, Armando Fuentes for robot operation support. We thank Lama Yassine, Gabriela Taras, Grecia Salazar for data collection help. We thank Dmitry Kalashnikov, Anthony Brohan, Grace Vesom, Ken Oslund, for infrastructure support.
References
- Todorov et al. [2012] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2012, Vilamoura, Algarve, Portugal, October 7-12, 2012, pages 5026–5033. IEEE, 2012. doi:10.1109/IROS.2012.6386109. URL https://doi.org/10.1109/IROS.2012.6386109.
- Coumans and Bai [2016] E. Coumans and Y. Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. In http://pybullet.org, 2016.
- Shah et al. [2017] S. Shah, D. Dey, C. Lovett, and A. Kapoor. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In M. Hutter and R. Siegwart, editors, Field and Service Robotics, Results of the 11th International Conference, FSR 2017, Zurich, Switzerland, 12-15 September 2017, volume 5 of Springer Proceedings in Advanced Robotics, pages 621–635. Springer, 2017. doi:10.1007/978-3-319-67361-5_40. URL https://doi.org/10.1007/978-3-319-67361-5_40.
- Tedrake [2019] R. Tedrake. Drake: Model-based design and verification for robotics, 2019. URL https://drake.mit.edu.
- Makoviychuk et al. [2021] V. Makoviychuk, L. Wawrzyniak, Y. Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa, and G. State. Isaac gym: High performance gpu-based physics simulation for robot learning. ArXiv, abs/2108.10470, 2021. URL https://api.semanticscholar.org/CorpusID:237277983.
- Pfaff et al. [2021] T. Pfaff, M. Fortunato, A. Sanchez-Gonzalez, and P. W. Battaglia. Learning mesh-based simulation with graph networks. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=roNqYL0_XP.
- Allen et al. [2022a] K. R. Allen, Y. Rubanova, T. Lopez-Guevara, W. Whitney, A. Sanchez-Gonzalez, P. W. Battaglia, and T. Pfaff. Learning rigid dynamics with face interaction graph networks. CoRR, abs/2212.03574, 2022a. doi:10.48550/ARXIV.2212.03574. URL https://doi.org/10.48550/arXiv.2212.03574.
- Allen et al. [2022b] K. R. Allen, T. Lopez-Guevara, Y. Rubanova, K. L. Stachenfeld, A. Sanchez-Gonzalez, P. W. Battaglia, and T. Pfaff. Graph network simulators can learn discontinuous, rigid contact dynamics. In K. Liu, D. Kulic, and J. Ichnowski, editors, Conference on Robot Learning, CoRL 2022, 14-18 December 2022, Auckland, New Zealand, volume 205 of Proceedings of Machine Learning Research, pages 1157–1167. PMLR, 2022b. URL https://proceedings.mlr.press/v205/allen23a.html.
- Whitney et al. [2023] W. F. Whitney, T. Lopez-Guevara, T. Pfaff, Y. Rubanova, T. Kipf, K. Stachenfeld, and K. R. Allen. Learning 3d particle-based simulators from rgb-d videos. arXiv preprint arXiv:2312.05359, 2023.
- Scarselli et al. [2009] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Trans. Neural Networks, 20(1):61–80, 2009. doi:10.1109/TNN.2008.2005605. URL https://doi.org/10.1109/TNN.2008.2005605.
- Liu et al. [2023] R. Liu, P. Calafiura, S. Farrell, X. Ju, D. T. Murnane, and T. M. Pham. Hierarchical graph neural networks for particle track reconstruction. CoRR, abs/2303.01640, 2023. doi:10.48550/ARXIV.2303.01640. URL https://doi.org/10.48550/arXiv.2303.01640.
- Mildenhall et al. [2020] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
- Guo et al. [2021] M. Guo, J. Cai, Z. Liu, T. Mu, R. R. Martin, and S. Hu. PCT: point cloud transformer. Comput. Vis. Media, 7(2):187–199, 2021. doi:10.1007/S41095-021-0229-5. URL https://doi.org/10.1007/s41095-021-0229-5.
- Choromanski et al. [2021] K. M. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlós, P. Hawkins, J. Q. Davis, A. Mohiuddin, L. Kaiser, D. B. Belanger, L. J. Colwell, and A. Weller. Rethinking attention with performers. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=Ua6zuk0WRH.
- Leal et al. [2024] I. Leal, K. Choromanski, D. Jain, A. Dubey, J. Varley, M. Ryoo, Y. Lu, F. Liu, V. Sindhwani, Q. Vuong, et al. Sara-rt: Scaling up robotics transformers with self-adaptive robust attention. ICRA 2024, 2024.
- Ronneberger et al. [2015] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation, 2015.
- Xu et al. [2022] Q. Xu, Z. Xu, J. Philip, S. Bi, Z. Shu, K. Sunkavalli, and U. Neumann. Point-nerf: Point-based neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5438–5448, 2022.
- Hu et al. [2020] Q. Hu, B. Yang, L. Xie, S. Rosa, Y. Guo, Z. Wang, N. Trigoni, and A. Markham. Randla-net: Efficient semantic segmentation of large-scale point clouds, 2020.
- Varley et al. [2024] J. Varley, S. Singh, D. Jain, K. Choromanski, A. Zeng, S. B. R. Chowdhury, A. Dubey, and V. Sindhwani. Embodied ai with two arms: Zero-shot learning, safety and modularity. arXiv preprint arXiv:2404.03570, 2024.
- Finn et al. [2016] C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction. Advances in neural information processing systems, 29, 2016.
- Ha and Schmidhuber [2018] D. Ha and J. Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.
- Yang et al. [2023] M. Yang, Y. Du, K. Ghasemipour, J. Tompson, D. Schuurmans, and P. Abbeel. Learning interactive real-world simulators. arXiv preprint arXiv:2310.06114, 2023.
- Bruce et al. [2024] J. Bruce, M. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, Y. Aytar, S. Bechtle, F. M. P. Behbahani, S. Chan, N. M. O. Heess, L. Gonzalez, S. Osindero, S. Ozair, S. Reed, J. Zhang, K. Zolna, J. Clune, N. de Freitas, S. Singh, and T. Rocktaschel. Genie: Generative interactive environments. ArXiv, abs/2402.15391, 2024. URL https://api.semanticscholar.org/CorpusID:267897982.
- Li et al. [2018] Y. Li, J. Wu, R. Tedrake, J. B. Tenenbaum, and A. Torralba. Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids. arXiv preprint arXiv:1810.01566, 2018.
- Sanchez-Gonzalez et al. [2020] A. Sanchez-Gonzalez, J. Godwin, T. Pfaff, R. Ying, J. Leskovec, and P. Battaglia. Learning to simulate complex physics with graph networks. In International Conference on Machine Learning, pages 8459–8468. PMLR, 2020.
- Pfaff et al. [2021] T. Pfaff, M. Fortunato, A. Sanchez-Gonzalez, and P. Battaglia. Learning mesh-based simulation with graph networks. In International Conference on Learning Representations, 2021.
- Battaglia et al. [2016] P. W. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, and K. Kavukcuoglu. Interaction networks for learning about objects, relations and physics. In Neural Information Processing Systems, 2016. URL https://api.semanticscholar.org/CorpusID:2200675.
- Rubanova et al. [2024] Y. Rubanova, T. Lopez-Guevara, K. R. Allen, W. F. Whitney, K. Stachenfeld, and T. Pfaff. Learning rigid-body simulators over implicit shapes for large-scale scenes and vision. arXiv preprint arXiv:2405.14045, 2024.
- Shi et al. [2023] H. Shi, H. Xu, S. Clarke, Y. Li, and J. Wu. Robocook: Long-horizon elasto-plastic object manipulation with diverse tools. ArXiv, abs/2306.14447, 2023. URL https://api.semanticscholar.org/CorpusID:259251806.
- Xue et al. [2023] H. Xue, A. Torralba, J. Tenenbaum, D. Yamins, Y. Li, and H.-Y. Tung. 3d-intphys: Towards more generalized 3d-grounded visual intuitive physics under challenging scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3624–3634, 2023.
- Li et al. [2021] Y. Li, S. Li, V. Sitzmann, P. Agrawal, and A. Torralba. 3d neural scene representations for visuomotor control. arXiv preprint arXiv:2107.04004, 2021.
- Driess et al. [2022] D. Driess, Z. Huang, Y. Li, R. Tedrake, and M. Toussaint. Learning multi-object dynamics with compositional neural radiance fields. In K. Liu, D. Kulic, and J. Ichnowski, editors, Conference on Robot Learning, CoRL 2022, 14-18 December 2022, Auckland, New Zealand, volume 205 of Proceedings of Machine Learning Research, pages 1755–1768. PMLR, 2022. URL https://proceedings.mlr.press/v205/driess23a.html.
- Qi et al. [2017] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017.
- Zhao et al. [2021] H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V. Koltun. Point transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021.
- Guo et al. [2021] M.-H. Guo, J.-X. Cai, Z.-N. Liu, T.-J. Mu, R. R. Martin, and S.-M. Hu. Pct: Point cloud transformer. Computational Visual Media, 7:187–199, 2021.
- Loshchilov and Hutter [2017] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
Appendix
Appendix A Performer-PCTs linear attention: closer look
Let be an input to the Transformer’s attention block, where stands for the number of points and is the dimensionality of their corresponding latent feature vectors. The output of the regular attention module can be computed as follows for learnable , and query/key dimensionality :
(2) |
In Eq. 2, denotes the all-one vector and matrices: are often referred to as: (value), (query) and (key) respectively. Furthermore, map** is applied element-wise. Finally, matrix is called the attention matrix and encodes how points attend to each other via the so-called softmax-kernel (defined as: for row-vectors ).
Time- and space-complexity of the regular attention mechanism is clearly quadratic in the number of points , since attention-matrix has entries. This makes it prohibitively expensive for massive point clouds. Thus Interlacers use instead linear attention mechanism introduced in the class of Transformers, called Performers [14]. Performers propose the following computational model of attention, where is applied row-wise and is a hyper-parameter:
(3) |
The brackets indicate the order of computations. Calculations in Eq. 3 can be performed in time linear in (and in the hyper-parameter that in practice we choose as ). System defined by Eq. 3 is obtained from the one given by Eq. 2, by replacing regular attention matrix with a matrix . Following [15], we use applied element-wise (thus ).
Appendix B Hardware Setup
Overview of the hardware setup is shown in Figure 11
Cameras and Image Processing We use a Realsense D415 overhead camera and Realsense D405 attached to the left wrist. We sample the overhead camera at 480x640 resolution, and the wrist camera is captured at 480x848 resolution and cropped to 480x640. We inpaint the overhead camera depth images via cv.INPAINT_TELEA. We mask both cameras observations to exclude points more than 2 meters from the cameras according to the depth images.
![Refer to caption](extracted/5697885/figures/hardware_setup.jpg)
Appendix C Action Space
We represent actions as 26 particle kinematic skeletons at the desired future state positions. There is an additional particle representing where the fingertips would be if the gripper was closed, particles representing the estimated current position of the fingertips. Also, there is an additional particle on the side of each wrist so that the orientation of the gripper can be disambiguated especially when the fingers are closed. The fingers themselves are somewhat compliant so the skeleton points do not exactly reflect the true pose of the fingertips especially when in contact with the environment. The origin of the space is at the corner of the pink mat closest to the base of the right arm (no wrist camera). An example rendering of this skeleton is shown in Figure 12.
![Refer to caption](extracted/5697885/figures/ActionSpace.jpg)
Appendix D Grasp Experiments
For the grasp planning real world evaluations, two threads of fishing wire where attached to the bottom of the can and run through the table and then tied to a small weight. This way whenever the Coke can is dropped, it returns to exactly the same pose for all rollouts.
We start with an initial grasp centered at the center of the can, and then back off the grasps along the y-axis of the table (aligned with pink mat) in 5mm increments. Figure 10(b) overviews the relationship between these offsets and predicted plan cost. Figure 13 shows the observations from the left and overhead cameras snapshot during the real world rollout of these planned grasps.
![Refer to caption](extracted/5697885/figures/GraspRollouts.jpg)
Appendix E Training and architecture details
We train all models with a batch size of 16 for 1M total steps. We use a learning rate schedule of 3e-4 until step 1000, then 1e-4 until step 100K, then 3e-5 until the end of training. We use the AdamW optimizer [36] with weight decay 1e-3 and clip the gradient norm to 0.01 to prevent outliers in the dataset from destabilizing training.
All models are trained end-to-end with 6-step rollouts during training. At each of the two input timesteps and the 6 predicted timesteps, losses are computed on 128 sampled rays. Rays from input timesteps are rendered conditioned on the particles encoded from their respective input timesteps, while rays from predicted timesteps use the particles predicted by recursively rolling out the dynamics model in particle space conditioned on a sequence of actions.
The encoder uses a U-Net [16] with [32, 64, 128, 256, 256, 128, 64, 32] channels applied to each input image. It outputs 16-d per-pixel feature vectors. Points from the encoder are only included in the set of dynamics particles if they fall within the robot’s workspace. Input and target RGB-D images are masked using the depth channel, changing the color of any pixel whose depth is 2m to solid white.
The Interlacer dynamics model uses one Neighbor-Attender block and one Performer-PCT block on each input timestep of scene particles, using separate weights for each input timestep, and one (quadratic) PCT block on the kinematic particles. These three point clouds are then concatenated. Then this large, multi-timestep, multi-input-modality point cloud goes through another Neighbor-Attender block, followed by 3 Performer-ReLU blocks. All Performer-ReLU layers use a key dimension of 16 and a value dimension of 64. This dynamics model outputs for each particle of timestep . The values are constrained to be in using a tanh and scaling, preventing any particle from moving more than 15cm in any single step. The output feature vectors are 16-d to match the features coming from the encoder. To compute neighbors in the Neighbor-Attender layers, we use a CUDA kernel generated by Pallas (https://jax.readthedocs.io/en/latest/pallas/index.html). This same kernel is used for finding nearby points in the renderer.
The renderer follows the same scheme for constructing input features for the NeRF MLP as Whitney et al. [9]. We use four concentric annular kernels with radii [0, 0.01, 0.02, 0.05] and corresponding bandwidths [0.01, 0.01, 0.01, 0.05], and these kernels are approximated using 16 nearest neighbors. The near plane for ray rendering is set at 0.1m and the far plane at 2m. Rays are composited against a solid white background.
![Refer to caption](x4.png)
Appendix F Dataset inspection
We introspect our training data by computing a histogram, shown in Figure 15, of HD-VPD’s loss on a sample of the training set and using that to set a threshold for unusually high losses. Once we have set this threshold, we can search the training data for trajectories with high loss.
![Refer to caption](extracted/5697885/figures/loss_histogram.png)
In Figures 16 and 17 we present examples of dataset quality issues discovered by inspecting training examples where HD-VPD’s loss is anomalously high. These issues with our dataset were not previously known, and might pose problems for other applications of this data, such as policy learning. Discovering them with HD-VPD also allows us refine our hardware setup, fixing calibration issues and avoiding problematic robot states.
![Refer to caption](extracted/5697885/figures/error_person_in_workspace.png)
![Refer to caption](extracted/5697885/figures/error_miscalibration.png)
![Refer to caption](extracted/5697885/figures/error_color_calibration_pose.png)